CN111814611A

CN111814611A - Multi-scale face age estimation method and system embedded with high-order information

Info

Publication number: CN111814611A
Application number: CN202010590398.4A
Authority: CN
Inventors: 钟福金; 王新月
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Dragon Totem Technology Hefei Co ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-23
Anticipated expiration: 2040-06-24
Also published as: CN111814611B

Abstract

The invention relates to the field of face age estimation, in particular to a multi-scale face age estimation method and a system embedded with high-order information, wherein the method comprises the following steps: inputting a face image, and preprocessing the face image; inputting the face image into a residual error network for global feature extraction to construct a global branch; inserting blocks for extracting high-order age information at different positions of the global branch; taking the output characteristic diagram of the first convolutional layer of ResNet as the input of a long-term and short-term memory network, acquiring the position information of an age sensitive area, and obtaining a local characteristic diagram through cutting to construct a local branch; performing joint optimization on the two branches through a back propagation minimized loss function, and performing iterative training on a neural network; and inputting the test set into a trained neural network model, and calculating and outputting a final predicted age according to the age characteristics. The network model of the invention has the advantages of low calculation cost, high precision and strong applicability of related products.

Description

Multi-scale face age estimation method and system embedded with high-order information

Technical Field

The invention belongs to the field of face age estimation, and particularly relates to a multi-scale face age estimation method and system with embedded high-order information.

Background

The purpose of face age estimation is to automatically output biological age through a face image, and the face age estimation method is widely applied to the fields of face retrieval based on age, accurate advertisement, intelligent monitoring, human-computer interaction (HCI), Internet access control and the like, and is an active research topic in computer vision. Due to the combined action of internal factors of facial aging (such as various genes) and complex changes of facial images (such as facial poses at different angles and camera vision), the facial aging process is uncontrollable and personalized, and accurate and reliable automatic age estimation from facial images is extremely challenging.

The classical age estimation algorithm consists of two successive but relatively independent phases: age feature extraction and age estimation. According to the way of feature extraction, the current face age estimation methods can be divided into two categories: the method is based on the traditional machine learning method; the second is a method based on deep learning. The traditional machine learning method mainly extracts age features manually and then classifies the age features through a traditional classifier, so that the age estimation of the human face is realized. In recent years, with the development of deep learning technology, the deep neural network has the most advanced performance in image recognition, can realize automatic extraction of facial features, is widely applied to age estimation, and achieves the achievement superior to the traditional machine learning method.

In the prior art, the design of a deep convolutional neural network mainly focuses on a deeper or wider network to enhance the nonlinear modeling capability of a model, but a face age estimation method based on deep learning has the problem that the face age feature expression which can not well take global-local details into account, so that the feature expression capability of the CNN is limited to a certain extent. Therefore, how to realize the face age estimation feature expression which takes global and local details into consideration is one of the future face age estimation research directions.

Disclosure of Invention

In view of the above-mentioned problem of lack of global-local feature expression capability, the present invention aims to provide a method and a system for multi-scale face age estimation with embedded high-order information, which can better perform global-local age feature expression, and further enhance the nonlinear modeling capability of the model by inserting a block for extracting high-order age features into the network, thereby effectively improving the accuracy of face age estimation and realizing high-precision age estimation.

In a first aspect of the present invention, the present invention provides a multi-scale age estimation method embedded with high-order information, comprising the following steps:

a multi-scale face age estimation method embedded with high-order information comprises the following steps:

inputting a face image set with an accurate age label as a data set, and preprocessing the face image data set;

inputting the preprocessed face image into a baseline model ResNet-50, and extracting a shallow feature map through a convolution layer and a maximum pooling layer;

after the shallow feature map is extracted, four groups of residual modules which are sequentially connected are connected to form a residual network, the residual network is used as a global branch, and global features of the face image are extracted;

embedding a global second-order pooling block between the first set of residual blocks and the second set of residual blocks, thereby generating a high-dimensional global image representation in the global branch;

taking the shallow feature map as the input of a long-term and short-term memory neural network, constructing a local branch and extracting the local features of the age sensitive area;

performing joint optimization to solve a cross entropy loss function of the two branches, performing iterative training on a convolutional neural network formed by the global branch and the local branch until convergence, and storing a trained convolutional neural network model;

and inputting the face image to be detected into the trained convolutional neural network model, and calculating and outputting the final predicted age by the classifier according to the age characteristics.

In a second aspect of the present invention, the present invention provides a multi-scale face age estimation system embedded with high-order information, comprising an image acquisition module, a data preprocessing module, a data enhancement module, a neural network module and an output module;

the image acquisition module is used for inputting a data set and acquiring face image information or a face image to be detected;

the data preprocessing module is used for carrying out face detection, face alignment and cutting on the face image information or a face image to be detected and carrying out pixel normalization processing on the face image;

the data enhancement module is used for expanding the training set according to random horizontal turning, zooming, rotating and translating operations;

the neural network module is used for constructing and training a convolutional neural network formed by the global module and the local module;

preferably, a sharing module is further arranged in front of the global module and the local module, and the sharing module is used for transferring between the global module and the local module;

the global module is used for extracting and learning global features;

the local module is used for extracting and learning local features;

the output module is used for outputting the final predicted age of the face image to be detected.

The invention has the beneficial technical effects that:

(1) the invention has the effects of high speed and high precision, and can accurately estimate the age of any input face image.

(2) The invention provides a novel multi-scale feature extraction framework giving consideration to global-local information, ensures that the network can extract age features of different types (global and local details) through multi-scale feature extraction, enhances the feature characterization capability of the network, and overcomes the defects in the existing face age estimation method.

(3) According to the invention, the GSoP block used for extracting high-order age information is embedded in the age estimation network, the high-order module can capture global second-order statistical information along a channel dimension or a position dimension, and the nonlinear modeling capability of the model is stronger than that of a traditional first-order network.

Drawings

Fig. 1 is a flowchart of a multi-scale face estimation method with embedded high-order information according to an embodiment of the present invention;

FIG. 2 is a high level block diagram of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a multi-scale network embedded with high-order information according to an embodiment of the present invention;

fig. 5 is a diagram illustrating an application effect of the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a multi-scale face age estimation method embedded with high-order information, which comprises the following steps of:

In one embodiment, the data set used in the present invention is a Morph II face age data set comprising 55134 face images from 13618 people in the age range of 16-77 years with age labels of age value taken from a controlled environment. In order to ensure the sufficiency of the training set and the reasonableness of the test set, the invention adopts the widely used protocol S1-S2-S3 on the data set to carry out experiments, specifically, the embodiment of the invention repeats all the steps included in the data set twice, wherein S1 is adopted as the training set in the first pass, and S2+ S3 is adopted as the test set; the second pass used S2 as the training set and S1+ S3 as the test set. The original image provided by the Morph II face age data set has the advantages of high quality, small noise, large quantity and the like, and is convenient for subsequent processing of experiments.

Preprocessing the Morph II dataset: the method comprises the following steps of adopting a Multi-task convolutional neural network (MTCNN) to carry out face detection on an originally acquired face image, carrying out key point alignment through eye center, nose tip and upper lip coordinates, uniformly cutting a processed image into 256 multiplied by 256 sizes, carrying out random horizontal turning, scaling, rotating (such as +/-5 degrees) and translation series of data amplification operations on a candidate training set to enhance the generalization capability of a subsequent convolutional neural network model, and carrying out pixel normalization processing on the processed face image, wherein the formula comprises the following steps:

X_pix＝(X_pix-128)/128

wherein, in the present invention, X_pixThe correspondence is the input face image pixel value, specifically, the face image pixel value input to the MTCNN network.

And sequentially transmitting the training sample images after data enhancement to a neural network, and training the network by utilizing a back propagation minimized loss function. Compared with the traditional age estimation algorithm, the baseline model ResNet-50 is adopted to reduce the size of the model and improve the accuracy of the model, the ResNet-50 adds a bypass connection (shortcut) branch outside an original convolutional layer to form a basic residual module, the original mapping H (X) is expressed as H (X) (F (X)) + x, wherein F (X) is residual mapping, x is an input signal, the learning of H (X) by the convolutional layer is converted into the learning of F (X) by a residual module structure, the learning of F (X) is simpler than that of H (X), and the structure effectively solves the attenuation problem caused by the fact that the network layer number is too deep while reducing the calculation amount.

Inputting a face image into a ResNet-50 network, performing shallow feature extraction through a convolutional layer and a maximum pooling layer to serve as an input feature map of each next branch network, specifically, a feature map with an input channel number of 3 performs feature extraction through a convolutional layer with a core size of 7 × 7, a channel number of 64 and a step size of 2, an output feature map has a size of 112 × 112 and an output channel of 64, and an output channel is 64 through a maximum pooling layer with a core size of 3 × 3 and a step size of 2, and the output feature map at the moment serves as an input feature map of each next branch.

After the shallow feature map is extracted, four groups of residual modules which are sequentially connected are connected to form a residual network, the residual network is used as a global branch, and the global features of the face image are extracted;

it can be understood that the core improvement of the present invention lies in two branch networks proposed by the present invention, namely a global branch and a local branch, and for the global branch, the core of the present invention is to make some modifications to the baseline model ResNet-50, extract a shallow feature map in the convolution layer and the maximum pooling layer of the baseline model ResNet-50, and connect four sets of sequentially connected residual modules behind the maximum pooling layer to form a residual network, and extract the global feature of the face image by using the residual network as the global branch, and on the other hand, embed a high-order module for extracting high-order age information in the global branch, and instead, in the local branch, the shallow feature map is used as the input of the LSTM, and the coordinates of the local feature are obtained by using the gate structure of the LSTM, and then the local feature map is obtained by clipping. In the present invention, if not specifically emphasized, the residual error network of the present invention mainly refers to a structure formed by a plurality of sets of residual error modules after the baseline model ResNet-50, and of course, the above division refers to only the point for more highlighting the improvement of the present invention, and those skilled in the art can adaptively understand the present invention according to the overall embodiment and the attached drawings.

In this embodiment, the convolution layer and the maximum pooling layer of the baseline model ResNet-50 are used as a shared layer, and an output characteristic diagram of the shared layer is used as an input of a dual-branch network to form a hybrid network structure composed of a global branch and a local branch, i.e., a convolutional neural network model finally obtained by the present invention;

furthermore, the global branch is composed of a residual module and a high-order embedding module.

Further, the process of constructing the global branch includes the steps of:

firstly, inputting a feature map of a shared layer into a global network branch, wherein the global network branch is formed by connecting 4 groups of residual modules in series, the number of input channels of each group of residual modules is 64, 128, 256 and 512, each residual module is formed by convolution operation, BN (Batch Normalization) operation and ReLU (modified Linear Unit) operation, the series of operations are applied to mapping of global features, and corresponding output channels of the series of operations are changed into 256, 512, 1024 and 2048;

then, a global second-order pooling block is embedded between the first group of residual modules and the second group of residual modules, and the embedding process of the global second-order pooling block comprises the following steps:

inserting a block for extracting high-order information into a residual error network, specifically, as shown in fig. 2, inputting a three-dimensional tensor of h ' × w ' × c ', and performing 1 × 1 convolution on the three-dimensional tensor to obtain a three-dimensional tensor of h ' × w ' × c; wherein h 'and w' are respectively the length and width of the input face image, c 'is the number of channels, and c is less than c';

calculating the correlation of the channels to obtain a fixed-size c × c covariance matrix, and performing row direction normalization on the covariance matrix;

performing two continuous operations of covariance matrix row convolution and Sigmoid nonlinear activation, and outputting a weight vector of c multiplied by 1;

multiplying each channel of the input tensor by a corresponding element in the weight vector to obtain a new three-dimensional tensor h 'xw' xc which is used as the input of a subsequent residual error module;

and inserting a matrix normalized covariance matrix at the end of the last residual module of the residual network to generate a final global feature representation.

In an embodiment, after the convolution operation is performed by the first residual error module, a 128 × 128 × 256 three-dimensional tensor is input, which is the length, width and channel number of the feature map, respectively, and the three-dimensional tensor is subjected to 1 × 1 convolution first to obtain a 128 × 128 × c three-dimensional tensor, where it is noted that the calculation cost can be reduced by c < c', in this embodiment, 256 is taken, and no compression parameter number operation is performed; then, calculating channel correlation to obtain a fixed-size c multiplied by c covariance matrix, and carrying out row direction normalization on the covariance matrix; then, two continuous operations of covariance matrix row convolution and Sigmoid nonlinear activation are executed, and a weight vector of c multiplied by 1 is output; each channel of the input tensor is multiplied by a corresponding element in the weight vector, and the channels are emphasized or suppressed in a soft manner, so that a new three-dimensional tensor 128 x c representing the global features is obtained.

Finally, replacing the first-order global average pooling by a second-order statistical method at the end of the network, inserting a matrix normalization covariance matrix as a final global image representation, thereby realizing the embedding of high-order information, specifically, outputting a 7 × 7 × 2048 three-dimensional tensor after performing feature mapping on the fourth residual layer of ResNet-50, adjusting the three-dimensional tensor into a feature matrix X with a dimension of 2048 and an eigenvalue of 49, and then passing through the feature matrix X

The computation of the covariance matrix employs pooling of the second order, in which

Where I and 1 are respectively an n identity matrix and a matrix of all 1 s.

In one embodiment, the method for constructing the local branch and extracting the local features of the age-sensitive region based on the Long Short-term memory neural network (LSTM) is composed of the Long Short-term memory neural network, a local region positioning module and a cutting module.

Further, the process of constructing the local branch comprises the following steps:

firstly, an output characteristic diagram of a sharing layer is input into an LSTM, an LSTM unit controls the state of the unit through a structure of a gate, the characteristics of a current image are considered, position information of other similar images is utilized, an age sensitive area is more comprehensive to be positioned, and the structure of the gate is divided into an input gate, a forgetting gate and an output gate. First, the forgetting gate is from the previous state C_prevSelect information in the output of (1), input gate and new candidate vector C generated by tanh layer_in-tanThe purpose of the multiplication, and then the combination of the two sources of information for status update, is to discard unnecessary information and add new information. Furthermore, the state output of the LSTM hidden layer is obtained using the cell state, which is held between-1 and multiplied by the output value of the output gate. The formula includes:

C_next＝forget_gate⊙C_prev+in_gate⊙C_in-tan

h_next＝out_gate⊙tanh(C_next)

C_in-tan＝tan h(W_C[h_prev，x_input]+b_C)

wherein, forget_gateForgetting gate, in, representing long-short term memory neural network LSTM_gateInput gate, out, representing LSTM_gateAn output gate representing an LSTM; the lines indicate the same or a symbol; c_prevAnd C_next、h_nextThe previous state, the current state and the hidden state of the LSTM, respectively; c_in-tanIs a candidate vector for updating the state of the cell, W_CAnd b_CRespectively representing the weight and the offset, x_inputIs the input of the LSTM;

then, S is added_nextInputting into a positioning module consisting of a convolutional layerAnd an activation function of the S type, S_nextAs input to the convolutional layer, the output is l_1-4＝L(W*S_next) Wherein l is_1-4Representing a four-dimensional vector, representing coordinates (x, y), width and height, respectively, and updating the LSTM unit block and the location block using a cross-entropy loss function strategy in the back propagation process.

Finally, the LSTM clipping module clips the position coordinates to obtain a local feature map with a size of 112 × 112, and sequentially inputs the local feature map into the following 4 residual module groups for local feature learning, wherein the number of input channels is 64, 128, 256, and 512, and the number of corresponding output channels is 256, 512, 1024, and 2048;

and (3) performing cross entropy loss solution on the global branch and the local branch jointly, performing joint optimization on the two branches through a back propagation minimum loss function, and performing iterative training on the neural network.

Further, the loss function is expressed as follows:

P_final(X_i)＝P_global+0.5P_local

wherein the content of the first and second substances,

representing loss of convolutional neural network, P_globalRepresenting the predicted age probability, P, of a sample i in a global branch_localRepresenting the predicted age probability, P, of a sample i in a global branch_final(X_i) The final predicted age of sample i is represented and n represents the number of training lumped samples of the face image.

Using an Adam optimizer to carry out training adjustment, after multiple rounds of training, leading the neural network to tend to be stable, ending the iteration process, and obtaining a trained convolutional neural network model, wherein the training process is shown as figure 3,

after an image data set is obtained, preprocessing a face image;

constructing a multi-scale network model embedded with high-order information, namely a convolutional neural network model constructed by the invention;

training the network using the data set and performing multiple iterations;

and solving the loss of the result output by the network and the real age value label corresponding to the face image until the loss tends to be stable.

At this time, the training is finished and the trained convolutional neural network model is output.

The trained convolutional neural network is shown in fig. 4.

When the trained neural network model is used, the image containing the human face is input into the trained neural network model, and the trained neural network model calculates the predicted age value of the sample according to the weight parameters obtained in advance.

A multi-scale human face age estimation system embedded with high-order information comprises an image acquisition module, a data preprocessing module, a data enhancement module, a neural network module and an output module;

the image acquisition module is used for inputting a data set and acquiring face image information or a face image to be detected; the image acquisition module is used as a data reading inlet of the whole system and is used for inputting a data set and acquiring pixels and age tags of an original image;

the data enhancement module is used for expanding the training set according to random horizontal turning, zooming, rotating and translating operations; data enhancement is carried out on the limited training set to increase the generalization capability of the model, so that the network can deal with face estimation under a more complex background such as an uncontrolled environment;

the neural network module is used for constructing and training a convolutional neural network formed by the global module and the local module; the neural network module is used for training and testing a network and is a core module of the whole system;

the global module is used for extracting and learning global features, and the local module is used for extracting and learning local features;

in a preferred embodiment, the baseline model ResNet-50 can be used as a sharing layer, provides the input of the global module and the local module, and can realize the transfer between the global module and the local module.

The output module is used for outputting the age estimation value of the face image to be detected.

The global module comprises a residual error module and a high-order module, wherein the residual error modules are sequentially connected to form a residual error network, the residual error network extracts global features of the face image, and the high-order module introduces a global second-order pooling block from a lower layer to a higher layer, so that second-order statistical information of the face image is fully utilized.

The high-order module is used for embedding high-order information and comprises: a convolution module with the size of 1 multiplied by 1, which is used for integrating the information of each channel and reducing the number of output channels at the same time so as to compress the parameter number; the covariance matrix module is used for calculating channel correlation, obtaining a covariance matrix with a fixed size, and normalizing the covariance matrix in the row direction; and the covariance convolution module is used for performing covariance matrix row convolution and Sigmoid nonlinear activation two continuous operations.

The local module comprises a long-short term memory neural network, a local area positioning module and a cutting module, wherein the long-short term memory neural network is used for updating the state, the local area positioning module is used for positioning the coordinate, the width and the height of the age sensitive area, and the cutting module cuts the local characteristic diagram according to the local position information. The invention relates to a multi-scale human face age estimation system embedded with high-order information, which comprises an image acquisition module, a data preprocessing module, a neural network module and an output module.

Fig. 5 is a face age estimation diagram of the present invention, after inputting the leftmost original face picture, preprocessing the face according to the face key point detection to highlight the age characteristics of the face image, especially identifying the distance between the five sense organs of the face; and inputting the processed picture into a multi-scale human face age estimation network embedded with high-order information for feature extraction and age estimation. It can be seen that after the global features and the local features of the face image are extracted, the age corresponding to the face can be estimated to be 22.

It can be understood that, part of features of the multi-scale face age estimation method and system embedded with high-order information of the present invention may be mutually cited, for example, a global branch in the method corresponds to a global module of the system, etc., and those skilled in the art may correspondingly understand and implement the present invention according to the embodiments of the present invention, and the present invention is not described in detail.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A multi-scale face age estimation method embedded with high-order information is characterized by comprising the following steps:

2. The method of claim 1, wherein the preprocessing of the face image dataset comprises performing face detection and face alignment using a multitask convolutional neural network, and cropping the face image to the same size, performing data enhancement on a candidate training set in the face image dataset, and performing pixel normalization on the face image according to the following formula:

X_pix＝(X_pix-128)/128

wherein, X_pixIs the input face image pixel value.

3. The method as claimed in claim 1, wherein the constructing of the convolution layer and the maximum pooling layer of the baseline model ResNet-50 comprises inputting a face image into the ResNet-50, outputting a shallow feature map of the face image through the convolution layer with convolution kernel size of 7 x 7 and step size of 2, and outputting a feature map of 112 x 112, and then outputting the shallow feature map through the maximum pooling layer.

4. The method as claimed in claim 1, wherein after extracting the shallow feature map, the extracted shallow feature map is sequentially passed through four different residual error modules, each residual error module group sequentially includes residual error modules of 3, 4, 6, and 3, the output dimensions of the residual error modules in each group are different, and the sizes of the output feature maps are 56 × 56, 28 × 28, 14 × 14, and 7 × 7.

5. The method for estimating the age of a multi-scale face embedded with high-order information as claimed in claim 1, wherein the embedding process of the global second-order pooling block comprises:

inserting a block for extracting high-order information into a residual error network, specifically, inputting a three-dimensional tensor of h ' × w ' × c ', and performing 1 × 1 convolution on the three-dimensional tensor to obtain a three-dimensional tensor of h ' × w ' × c; wherein h 'and w' are respectively the length and width of the input face image, c 'is the number of channels, and c is less than c';

and inserting a matrix normalization covariance matrix at the tail end of the last residual module of the residual network to generate the final global feature representation of the face image.

6. The multi-scale face age estimation method embedded with high-order information as claimed in claim 1, wherein the process of constructing local branches based on long-short term memory neural network and extracting local features of age sensitive region comprises:

the long-short term memory neural network automatically keeps the position information of other face images similar to the current face image through a long-term and short-term storage mechanism to realize the positioning function, and the calculation formula comprises the following steps:

C_next＝forget_gate⊙C_prev+in_gate⊙C_in-tan

h_next＝out_gate⊙tanh(C_next)

C_in-tan＝tan h(W_C[h_prev，x_input]+b_C)

wherein, forget_gateA forgetting gate representing the long-short term memory neural network LSTM,in_gateinput gate, out, representing LSTM_gateAn output gate representing an LSTM; the lines indicate the same or a symbol; c_prevAnd C_next、h_nextThe previous state, the current state and the hidden state of the LSTM, respectively; c_in-tanIs a candidate vector for updating the state of the cell, W_CAnd b_CRespectively representing the weight and the offset, x_inputIs the input of the LSTM;

generating coordinates, width and height of the local area box sensitive to age by status update, the formula comprising:

l_1-4＝L(W*S_next)

wherein l_1-4Represents a four-dimensional vector, representing coordinates (x, y), width and height, S, respectively_nextFor the joint output of LSTM, W is the total parameter, L (. eta.) represents the convolution function;

and cutting according to the position coordinates to obtain local features of the age sensitive area, and sequentially inputting the local features into four groups of residual error modules in the residual error network for local feature learning.

7. The method for estimating age of a multi-scale face embedded with high-order information as claimed in claim 1, wherein the cross entropy loss function is expressed as follows:

P_final(X_i)＝P_global+0.5P_local

wherein the content of the first and second substances,

8. A multi-scale human face age estimation system embedded with high-order information is characterized by comprising an image acquisition module, a data preprocessing module, a data enhancement module, a neural network module and an output module;

9. The system according to claim 8, wherein the global module comprises a residual module and a high-order module, the residual module is used for extracting global features of the face image, and the high-order module introduces a global second-order pooling block from a lower layer to a higher layer, so that second-order statistical information of the face image is fully utilized.

10. The system of claim 8, wherein the local modules include a long-short term memory neural network for updating status, a local area location module for locating coordinates, width and height of the age-sensitive area, and a clipping module for clipping the local feature map according to the local location information.