CN115063862A

CN115063862A - Age estimation method based on feature contrast loss

Info

Publication number: CN115063862A
Application number: CN202210731136.4A
Authority: CN
Inventors: 孟明明; 张亮; 潘力立; 李宏亮; 孟凡满; 吴庆波; 许林峰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-16
Anticipated expiration: 2042-06-24
Also published as: CN115063862B

Abstract

The invention discloses an age estimation method based on feature contrast loss, and belongs to the field of computer vision. Firstly, selecting an attention mechanism as a basic structure of a feature extraction network, and using an offset window transformation network based on the attention mechanism as a main structure of the feature extraction network for extracting robust age features from a face image; then, a distance estimation network for calculating relative distances between features is designed, the sequence constraint relation of a label space is reserved through a feature-based contrast loss guide feature space, so that tail features can utilize information of head features, the prediction accuracy of tail data is improved, and the problem of long tail distribution in age estimation is solved.

Description

Age estimation method based on feature contrast loss

Technical Field

The invention belongs to the field of machine learning, and mainly relates to an age estimation problem based on a face image; the method mainly solves the problem of long tail distribution in an age estimation task.

Background

The phenomenon of long tail distribution widely exists in various data sets, and a machine learning model which relies on data for training is often influenced by the long tail distribution in the data sets, so that the fitting error of the model to tail data is far larger than that of head data. For example, in a scene oriented to human face attribute analysis, a significant long-tail distribution phenomenon exists in the age distribution of the existing age data set, namely, a large amount of data is distributed in the middle age stage, and only a small amount of samples exist in the age intervals of infants and the elderly. The depth model obtained by training the age data set based on long-tail distribution can always give accurate prediction in the middle age group, and has larger errors for the age groups of infants and the elderly, which is a problem to be solved urgently in the current age-long-tail regression analysis.

The existing solutions to the long tail distribution problem can be divided into two categories, namely a data-based method and a model-based method. The data-based method comprises two types of resampling and reweighing; resampling can be achieved by undersampling the head data or oversampling the tail data, but this may result in overfitting the tail data, and also may not fully utilize a large amount of head data; the weight is to construct different loss weights for the samples according to the distribution of the data of different labels in the whole training set, usually, the head data is assigned with a smaller weight, and the tail data is assigned with a larger weight, and when the data size is huge, the weight method may cause unstable optimization. Model-based methods include two-stage methods, transfer learning, etc.; firstly, training a feature extraction network by using an actual data set, then fixing parameters of the feature extraction network, and retraining a prediction head network by using a re-weighting method; and the transfer learning models the head data and the tail data respectively, so that the knowledge of the head data is transferred to the tail data. These methods are applicable to any long-tail distribution, but they do not adequately account for the differences and connections between the long-tail regression task and the long-tail classification task. Reference documents: zhou B, Cui Q, Wei X S, et al BBN: binary-bridge network with systematic learning for long-linked visual retrieval [ C ], Proceedings of the IEEE/CVF conference on computer vision and pattern retrieval.2020: 9719. 9728.

Different from the long-tail classification task, no constraint relation exists among all class labels, the long-tail regression task such as age estimation exists among all age labels, in order to fully utilize the constraint relation, the mutual relation of data between a label space and a feature space is mined, a label smoothing and feature smoothing method is provided for a long-tail regression analysis method, smooth transformation is firstly carried out on the label space and the feature space, so that adjacent labels can fully utilize the features of each other, and then a general method for solving long-tail distribution is combined on the basis, and finally the goal of reducing tail data errors is achieved. The method starts from two dimensions of a label space and a feature space respectively, and provides a new research direction for solving long-tail regression. Reference: yang Y, Zha K, Chen Y, et al, delving in vivo aggregated regression [ C ], International Conference on Machine learning, PMLR,2021:11842-11851.

Aiming at the problem of large error of tail data in age estimation, the invention provides an age estimation method based on feature comparison loss, and the estimation accuracy of the tail data is improved.

Disclosure of Invention

The invention provides an age estimation method based on feature contrast loss, which is used for solving the problem of larger tail data error caused by long tail distribution in an age estimation task.

The invention is composed of three parts, namely a feature extraction network, a distance estimation network and a prediction head; the feature extraction network is suitable for a swin transform structure and is used for extracting features from the face image; the distance estimation network accepts a pair of features as input and outputs the distance between the pair of features for calculating the feature-based contrast loss; the prediction head receives a feature as an input and outputs a predicted value corresponding to the feature for calculating the L2 loss between the true value and the predicted value. The training process of the invention comprises the steps of firstly sampling a batch of samples, and calculating the corresponding characteristics of the samples through a characteristic extraction network; then, any two groups of characteristics are combined, the combined result is input into a distance estimation network to calculate the characteristic-based contrast loss, then, a single characteristic is input into a prediction head, and the L2 loss is calculated; finally, the feature-based contrast loss and the L2 loss are combined, and all parameters of the whole model are optimized simultaneously through back propagation. In the testing stage, the sample firstly obtains the characteristics through the characteristic extraction network, and then the characteristics give a predicted value through the prediction head. Through the method, the characteristic contrast loss is introduced on the basis of the L2 loss, so that the characteristics of the head data and the tail data can be mutually corrected, the order constraint relation between labels can be reserved for the characteristics, and the fitting capability of the model on the tail data is improved. The general structural schematic of the process is shown in figure 1.

For convenience in describing the present disclosure, certain terms are first defined.

Definition 1: softmax function. The softmax function is to normalize the vector x such that each element in each vector ranges between (0,1) and the sum of all elements after normalization is 1; the normalized value of the ith element may be expressed as:

where K is the total number of elements of the vector x.

Definition 2: attention is paid to the mechanism. The attention mechanism is a method for transforming features, and usually requires that the features are mapped into query, key and value 3 modules, which are abbreviated as Q, K, V; then calculating the matching degree of the query and the key; and finally, carrying out weighted output with value, wherein the attention mechanism used by the invention can be expressed as: attention (Q, K, V) ═ softmax (QK) ^T ) V, the schematic structural diagram of which is shown in FIG. 2.

Definition 3: a multi-head attention mechanism. Using different mappings for characteristics to obtain a plurality of groups of different query, key and value modules, respectively calculating an attention mechanism for each group Q, K, V, and then cascading and linearly transforming the attention results of each group to realize a multi-head attention mechanism which can be expressed as MultiHead (Q, K, V) ═ Concat (head) ₁ ,…,head _h )W _o Head therein _h Attention results for group h are shown.

Definition 4: and (5) layer normalization. Layer Normalization (LN) is to normalize all neurons in a certain Layer, and to scale and translate after Normalization. The layer normalization can be expressed as:

where μ, σ denotes the mean and variance of all neurons in the layer, and γ, β denotes the scaling and translation parameters.

Definition 5: the GELU activation function. The expression of the GELU activation function is GELU (x) ═ x × Φ (x), where Φ (x) is the cumulative distribution function of the standard gaussian distribution.

Definition 6: the ReLU function. The ReLU function expression is ReLU (x) max (0, x).

Definition 7: the scatter function. The Flatten function transforms the shape of the tensor, and expands the high-dimensional tensor into a one-dimensional vector.

Therefore, the technical scheme of the invention is an age estimation method based on feature contrast loss, which comprises the following steps:

step 1: preprocessing the data set;

firstly, acquiring an image data set for age estimation, then carrying out face alignment on the image, and normalizing to [ -1,1 ]; then randomly dividing the training set and the test set; randomly cutting the images in the training set, randomly turning the images in the training set in a mirror image manner, only cutting the images in the test set at the center, and cutting the images in the test set to be consistent with the images in the training set;

and 2, step: constructing a feature extraction network;

1) constructing an area embedding unit;

firstly, dividing an image into sub-regions, dividing the image obtained in the step (1) into a plurality of a multiplied by a sub-regions, then using a multiplied by a convolution kernel to convolute each divided sub-region image in a step a mode, and using layer normalization to normalize the images;

2) constructing window division;

on the basis of the subareas, two forms of window division are respectively carried out by adopting a first division mode and a second division mode, wherein the division method of the first division mode comprises the following steps: selecting adjacent b multiplied by b sub-regions as a window, and dividing the whole image; the second division mode is that the window is respectively translated by half window size to the right and downwards on the basis of the first division mode, and the upper left window is circularly shifted;

3) constructing a multilayer perceptron module;

the multilayer perceptron module is composed of two full connection layers, wherein the first full connection layer is activated by using a GELU function, and the second full connection layer does not use an activation function;

4) constructing an offset window module;

the offset window module is composed of two continuous multi-head attention modules, wherein one of the multi-head attention modules calculates multi-head attention on the basis of a division mode one and is recorded as W-MSA; the other calculates the attention of the multiple heads in the division mode II and is recorded as SW-MSA;

5) constructing a sub-region merging module;

the sub-region combining module is a down-sampling module, combines adjacent a multiplied by a sub-regions into a sub-region, and simultaneously enlarges the channel dimension of the features by a times;

6) constructing a feature extraction network;

firstly, an input image input area embedding unit acquires characteristics; then, carrying out feature transformation and extraction through a plurality of offset window modules, and then connecting a sub-region merging module to carry out dimension reduction on the features; then repeatedly stacking the offset window module and the sub-region merging module to construct an offset window transformation network; finally, the output characteristics are expanded and linearly transformed;

and step 3: constructing a distance estimation network;

the distance estimation network consists of 3 layers of full connection layers, the first two layers of full connection layers are activated by using the ReLU after being output, and the last layer of full connection layer is not activated;

and 4, step 4: constructing a measuring head in advance;

the prediction head is composed of 2 layers of full connection layers, the first layer of full connection layer is activated by using softmax after being output, and the second layer of full connection layer is not activated;

and 5: determining a loss function;

1) constructing characteristic contrast loss;

first construct image and label sample pairs from the training set of step 1 { (x) ₁ ,y ₁ ),…(x _n ,y _n )…,(x _N ,y _N )}，x _n Representing an image, y _n Representing labels, N representing total number, then inputting the images in the sample pairs into a feature extraction network to obtain corresponding features (f) ₁ ,…,f _N ) If the distance estimation network is denoted as DE, the characteristic contrast loss is as follows:

2) constructing a prediction loss;

if the prediction head is denoted as G, the loss L is predicted _pred Calculated from the following formula:

wherein, G (f) _i ) Representing a feature f _i The predicted result of (2);

3) constructing a total loss function;

the total loss function is formed by weighted summation of the characteristic contrast loss and the prediction loss, and the weight coefficient is lambda and has the following form:

step 6: training network parameters;

performing network training by using the total loss function constructed in the step 5, and updating parameters of the feature extraction network, the distance estimation network and the prediction head;

and 7: and in the testing stage, the trained feature extraction network and the prediction head in the step 6 are selected, for a given picture, the feature extraction network is firstly input to extract features, and then the features are input into the prediction head to obtain the predicted age.

The innovation of the invention is that:

1) extracting features of the facial image using an attention-based mechanism of shifted window transform network structure to obtain a more robust feature representation;

2) the method comprises the steps of providing a feature-based contrast loss function, calculating the distance between two features by introducing a distance estimation module, and constraining a feature space by the feature-based contrast loss function to keep an order constraint relation of a label space, so that the features of tail data can acquire information from the features of head data, and the accuracy of the tail data in age estimation is improved.

Drawings

FIG. 1 is a schematic diagram of the network architecture of the method of the present invention;

FIG. 2 is a schematic view of the attention mechanism of the present invention;

FIG. 3 is a schematic diagram of a region embedding unit according to the present invention;

FIG. 4 is a schematic diagram of window division according to the present invention;

FIG. 5 is a schematic diagram of a multi-layered perceptron of the present invention;

FIG. 6 is a schematic diagram of an offset window according to the present invention;

FIG. 7 is a schematic diagram of a feature extraction network according to the present invention;

FIG. 8 is a schematic diagram of a distance estimation network according to the present invention;

FIG. 9 is a diagram of a prediction header according to the present invention.

The specific implementation mode is as follows:

step 1: preprocessing the data set;

acquiring a MOPRPH II data set, wherein the MORPPH II data set is an age estimation data set and comprises 55134 images; firstly, carrying out face alignment on an image, and normalizing the image to [ -1,1 ]; then randomly selecting 80% of data as a training set, and taking the rest 20% as a test set; the images in the training set were randomly cropped to 224 x 224 size and randomly mirror-flipped, and the images in the test set were only cropped in the center, again to 224 x 224 size.

And 2, step: constructing a feature extraction network;

1) constructing a region Embedding unit (Patch Embedding); the image is first divided into sub-regions, the 224 × 224 image is divided into 56 × 56 sub-regions of 4 × 4, then the 4 × 4 convolution kernel is used to convolute the sub-region divided image in a step 4 manner, and normalization is performed using layer normalization, and the structure of the region embedding unit is shown in fig. 3.

2) Constructing window division; on the basis of the subareas, two forms of window division are respectively carried out, and a division mode I and a division mode II are respectively used for representing the two division methods, wherein the division mode II is that the window is respectively translated by half window size towards the right and downwards on the basis of the division mode I, and the window at the upper left is circularly shifted. The first division mode and the second division mode refer to the left sub-image and the right sub-image of fig. 4, the thin line boxes represent sub-regions, the thick line boxes represent windows, and the numbers in the drawings represent sub-region labels.

3) Constructing a multilayer perceptron module; the multi-layered perceptron Module (MLP) consists of two fully-connected layers, where the first fully-connected layer is followed by activation using the GELU function, and the second fully-connected layer does not use the activation function. The structure diagram of the multilayer perceptron module is shown in fig. 5.

4) Constructing an offset window module; the shifting window module is composed of two continuous multi-head attentions, the difference of the two multi-head attentions lies in that windows used in attention calculation are different, namely two dividing methods corresponding to a dividing mode I and a dividing mode II, the multi-head attentions calculated on the dividing modes of the dividing mode I and the dividing mode II are marked as W-MSA and SW-MSA, when the shifting window module calculates each multi-head attention, layer normalization needs to be firstly carried out, and layer normalization and nonlinear transformation are carried out on output of the multi-head attention. The structure diagram of the offset window module is shown in fig. 6.

5) Constructing a subregion merging module; the sub-region merging module (Patch Merge) is a down-sampling module that merges adjacent 2 × 2 sub-regions into one sub-region while doubling the channel dimension of the feature.

6) Constructing a feature extraction network; the shift window conversion network firstly embeds the input image input area into a unit to obtain the characteristics; then, carrying out feature transformation and extraction through a plurality of offset window modules, and connecting a sub-region merging module to carry out dimension reduction on the features; then, repeatedly stacking the offset window module and the sub-region merging module to construct an offset window transformation network; and finally, performing expansion and linear transformation on the output characteristics. A schematic diagram of the structure of the feature extraction network is shown in fig. 7.

And step 3: constructing a distance estimation network; the distance estimation network is composed of 3 layers of full connection layers, the first two layers of full connection layers are activated by using the ReLU after being output, and the last layer of full connection layer is not activated. A schematic diagram of the distance estimation network is shown in fig. 8.

And 4, step 4: constructing a measuring head in advance; the prediction header is composed of 2 layers of fully-connected layers, the first layer of fully-connected layers is activated by using softmax after being output, and the second layer of fully-connected layers is not activated. A schematic diagram of the structure of the measuring probe is shown in fig. 9.

And 5: designing a loss function;

1) constructing characteristic contrast loss; first construct image and label pairs from the training set in step 1 { (x) ₁ ,y ₁ ),…,(x _N ,y _N ) Then inputting the images in the sample pairs into a feature extraction network to obtain corresponding features (f) ₁ ,…,f _N ) If the distance estimation network is denoted as DE, the characteristic contrast loss is as follows:

feature contrast loss calculates two distances, first DE (f) _i ,f _j ) Calculates the characteristic f _i And f _j Is followed by y _i -y _j Calculating the L1 distance between two labels corresponding to the features, and optimizing the feature distance to enable the feature contrast loss to be close to the label distance, so that information interaction can be generated between the features, and the order constraint relation of the label space is reserved.

2) Constructing a prediction loss; with the prediction head denoted as G, the predicted loss is calculated by:

3) constructing a total loss function; the total loss function is formed by weighted summation of the characteristic contrast loss and the prediction loss, and the weight coefficient is lambda and has the following form:

step 6: training network parameters; performing network training by using the total loss function constructed in the step 5, and updating parameters of the feature extraction network, the distance estimation network and the prediction head;

Claims

1. An age estimation method based on feature contrast loss, the method comprising:

step 1: preprocessing the data set;

firstly, acquiring an image data set for age estimation, then carrying out face alignment on the images, and normalizing to [ -1,1 ]; then randomly dividing the training set and the test set; randomly cutting the images in the training set, and randomly turning the images in the training set in a mirror image manner, wherein the images in the testing set are only cut in the center, and the cut size is the same as the cut size of the images in the training set;

step 2: constructing a feature extraction network;

1) constructing an area embedding unit;

2) constructing window division;

on the basis of the sub-area, two forms of window division are respectively carried out by adopting a first division mode and a second division mode, wherein the division method of the first division mode comprises the following steps: selecting adjacent b multiplied by b sub-regions as a window, and dividing the whole image; the second division mode is that the window is respectively translated by half window size to the right and downwards on the basis of the first division mode, and the upper left window is circularly shifted;

3) constructing a multilayer perceptron module;

the multi-layer perceptron module is composed of two full connection layers, wherein the first full connection layer is activated by using a GELU function, and the second full connection layer does not use an activation function;

4) constructing an offset window module;

5) constructing a subregion merging module;

6) constructing a feature extraction network;

firstly, an input image input area embedding unit acquires characteristics; then, carrying out feature transformation and extraction through a plurality of offset window modules, and then connecting a sub-region merging module to carry out dimension reduction on the features; then, repeatedly stacking the offset window module and the sub-region merging module to construct an offset window transformation network; finally, the output characteristics are expanded and linearly transformed;

and step 3: constructing a distance estimation network;

and 4, step 4: constructing a measuring head in advance;

and 5: determining a loss function;

1) constructing characteristic contrast loss;

2) constructing a prediction loss;

wherein, G (f) _i ) Representing a feature f _i The predicted result of (2);

3) constructing a total loss function;

and 6: training network parameters;