CN115565051B

CN115565051B - Lightweight face attribute recognition model training method, recognition method and device

Info

Publication number: CN115565051B
Application number: CN202211421512.6A
Authority: CN
Inventors: 郭理鹏; 陆金刚; 王为; 方伟
Original assignee: Zhejiang Xinsheng Electronic Technology Co Ltd
Current assignee: Zhejiang Xinsheng Electronic Technology Co Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-04-18
Anticipated expiration: 2042-11-15
Also published as: CN115565051A

Abstract

The invention provides a lightweight face attribute recognition model training method, a recognition method, computer equipment and a storage medium. The training method of the model comprises the steps of preprocessing an acquired face data set to form a face training image set; the method comprises the steps that a feature extraction network comprising a plurality of sequentially transmitted structure blocks is constructed based on face attribute data, an output feature map of each structure block is a multi-dimensional tensor and integrates input feature map information, channel information obtained based on input feature map transformation and space position information. And obtaining the predicted probability value of each face attribute category according to the output of the feature extraction network, and determining a loss function selected by the face attribute category in error loss calculation based on the relationship between the predicted probability value and the probability threshold value over-parameter. And training the constructed feature extraction network according to the error loss output by the error loss function to obtain a face attribute recognition model.

Description

Lightweight face attribute recognition model training method, recognition method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a lightweight face attribute recognition model training method, a recognition method, a computer device, and a storage medium.

Background

With the rapid development of society, quick and effective automatic identity authentication becomes more and more important in the security field. The human face attribute identification is the most direct means for identity verification, compared with other human body biological characteristics, the human face attribute identification has the characteristics of being direct, accurate and efficient, is more easily accepted by users and is not easy to perceive, and has become an important auxiliary means in the fields of intelligent monitoring, public security systems, safety verification systems and the like.

The face attribute recognition is similar to other biological feature recognition technologies, and the features are extracted firstly, and then the extracted features are classified. The traditional machine learning method firstly adopts methods such as Local Binary Pattern (LBP) and Scale Invariant Feature Transform (SIFT) to extract features from a face training image, and then inputs the features into classifiers such as a Support Vector Machine (SVM) and a Decision Tree (DT) for classification, so as to obtain a face attribute result. The method improves the performance of face attribute recognition to a certain extent, but is easily influenced by factors such as environment, illumination, posture and the like, and is difficult to obtain good effect in industrial application.

At present, a face attribute recognition method based on deep learning is mainly adopted, and a robust high-precision algorithm model is trained through a large number of face training images with standard postures. The method mainly comprises two types, wherein one type is that each attribute is identified by adopting an independent model, a plurality of models occupy a large amount of resources when running simultaneously, the identification speed is slow, and particularly, good effects are difficult to be exerted in edge equipment with limited computing resources. In the method, the extraction of the human face attribute features is usually carried out by using a depth separable convolution or a depth residual error network. For the deep residual error network, because the number of the feature layers is large and fixed, the parameter quantity of the human face attribute model trained by the deep residual error network is large, the consumption of memory resources is large, and the application in a low-power chip is difficult. For the separable convolution network, although the parameter quantity and the calculated quantity of the model can be reduced to a certain extent, when the model is quantized in the process of deploying the model to the chip, the precision loss of the separable convolution is large, so that the model reasoning result generates errors, and the accuracy rate of identification is seriously influenced. In addition, when a multi-branch model predicts multiple attributes simultaneously, the model data set needs to label multiple attributes in each picture simultaneously. In an actual scene, the probability of certain attributes is low, when the data volume is large, serious data imbalance can be caused, and the conditions of part of labeling errors and low data set quality can be caused, so that the problems of high model training difficulty, model overfitting, low recognition accuracy, difficult algorithm falling and the like are caused.

Disclosure of Invention

In order to overcome at least one defect of the prior art, the invention provides a lightweight face attribute recognition model training method, a recognition method, computer equipment and a storage medium.

In order to achieve the above object, the present invention provides a training method for a lightweight face attribute recognition model, which comprises:

preprocessing the acquired face data set to form a face training image set, wherein each face training image in the face training image set carries corresponding face attribute data and a label value;

constructing a feature extraction network fusing input information, channel information and spatial position information based on the face attribute data to extract features of the face training image; the feature extraction network comprises a plurality of sequentially transmitted structure blocks, an input feature map of each structure block is transformed to generate a first feature map, and the first feature maps are aggregated along two mutually perpendicular spatial dimensions to respectively obtain channel information and spatial position information; embedding the obtained channel information and the spatial position information into the first characteristic diagram to form a second characteristic diagram; fusing the second feature map to the input feature map of the structure block to form an output feature map of the multi-dimensional tensor;

determining a loss function selected by each face attribute category when calculating loss errors according to the output of the feature extraction network; if the predicted probability value of a certain face attribute category is smaller than the probability threshold value super-parameter, selecting a first loss function containing the probability threshold value super-parameter to calculate the error loss between the predicted probability value of the attribute category and the label value; otherwise, selecting a second loss function to calculate the error loss of the attribute category;

training the constructed feature extraction network according to the error loss to obtain a face attribute recognition model; and dynamically updating the probability threshold value hyperparameter and the number of the structural blocks and the number of channels in the feature extraction network in the training process.

According to an embodiment of the invention, the output feature map of the last structure block in the feature extraction network is subjected to dimensionality reduction conversion, starting from a certain dimension of the input feature map information, the rest dimensions are converted into one-dimensional vectors and then output to the full-connection layer, and the full-connection layer outputs the predicted values containing all face attribute categories.

According to an embodiment of the invention, based on the type of the face attribute in the face attribute data and the category corresponding to each attribute, the prediction probability value of each face attribute category is respectively obtained from the prediction values which are output by the feature extraction network and contain all the face attribute categories.

According to an embodiment of the invention, the input feature map of each structure block is convolved, regularized and transformed by a nonlinear activation function to form a first feature map.

According to an embodiment of the invention, the first loss function introduces a probability threshold hyperparameter on the basis of the second loss function to attenuate the loss weight; the first loss function and the second loss function respectively comprise a balance over-parameter for balancing the positive and negative samples and a difficulty over-parameter for balancing the simple and difficult samples, and the balance over-parameter and the difficulty over-parameter are both over-parameters.

According to an embodiment of the invention, the probability threshold hyperparameter is dynamically updated in the training process by taking a preset turn of traversing the face training image training set as a period.

According to one embodiment of the invention, the parameters of the feature extraction network are saved as candidate identification models each time the update probability threshold exceeds the parameters; and inputting the test set into a plurality of candidate recognition models, and selecting the candidate recognition model with the optimal prediction accuracy of the test set as the trained face attribute recognition model.

The invention also provides a lightweight face attribute identification method, which comprises the following steps:

acquiring a face image to be recognized;

and carrying out face attribute recognition on the face image to be recognized by using the face attribute recognition model obtained by training by using the lightweight face attribute recognition model training method to obtain a recognition result.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the lightweight face attribute recognition model training method or the lightweight face attribute recognition method when executing the computer program.

In another aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the lightweight face attribute recognition model training method or the lightweight face attribute recognition method.

In summary, the lightweight face attribute recognition model training method and the recognition method provided by the invention adopt a single model to recognize the multiple attribute features of the face, and respectively perform aggregation on two spatial dimensions on the first feature map formed after each structural block is converted when a feature extraction network is constructed so as to respectively generate channel information and position information. The second characteristic diagram formed by embedding the two pieces of position information not only considers the relationship between the characteristic diagram channels but also considers the position information of the space, so that the model can better position and identify the target, and the parameter and the calculated amount of the model are effectively reduced. Meanwhile, the input feature map of each module is fused again on the basis of the second feature map so as to retain the input feature map information, thereby effectively making up the loss of the input feature information caused by the feature map to the processing transformation in the feature extraction process and improving the accuracy of identification.

In addition, in the aspect of data, the problems of model overfitting and low identification accuracy caused by training of image samples can be effectively solved by selecting a loss function based on the probability threshold hyperparameters and dynamically adjusting the probability threshold hyperparameters in the training process, and the identification accuracy of the model is greatly improved. The training method and the recognition method of the lightweight face attribute recognition model provided by the invention can be deployed in low-calculation-force edge equipment and have high recognition accuracy.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

FIG. 1 is an application scenario diagram of a lightweight face attribute recognition model training method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a lightweight face attribute recognition model training method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a feature extraction network module.

Fig. 4 is a schematic diagram illustrating the principle of each module in the feature extraction network.

Fig. 5 is a schematic specific flowchart of step S30 in fig. 2.

Fig. 6 is a schematic diagram corresponding to fig. 5.

Fig. 7 is a schematic diagram illustrating a specific flowchart of step S40 in fig. 2.

Fig. 8 is a schematic structural diagram of a lightweight face attribute recognition model training device according to an embodiment of the present invention.

Fig. 9 is a schematic flow chart of a lightweight face attribute identification method according to an embodiment of the present invention.

Fig. 10 is a diagram showing an internal structure of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The training method for the lightweight face attribute recognition model provided by the embodiment can be applied to an application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The server 102 receives a model training instruction sent by the terminal 101, and the server 102 responds to the model training instruction to obtain a face training image set, wherein the face training image set comprises a plurality of face training images, and each face training image is correspondingly marked with face attribute data and a label value. The server 102 constructs a feature extraction network based on the face attribute data of each face training image and inputs a plurality of face training images in the face training image set into the constructed feature extraction network. The server 102 continuously trains the feature extraction network based on the error loss between the predicted probability value and the label value of each face attribute category output by the feature extraction network, and takes the trained feature extraction network as a face attribute recognition model. The terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 102 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 2, a lightweight face attribute recognition model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

and step S10, acquiring a face data set. The face data set is a set comprising a plurality of images with different facial attributes, and the facial attributes of the images in the face data set can be understood as various types, such as various attributes including different sexes, different ages, different expressions, whether glasses are worn or not, whether a mask is worn or not, and the like. Specifically, the face data set is obtained through the following steps: and acquiring a face image from the starting database and the monitoring scene data and cleaning the face image to obtain a face data set. Specifically, the terminal 102 may carry a link of open source data in the issued training instruction so that the server can collect the face data set based on the link.

And step S20, preprocessing the acquired face data set to form a face training image set, wherein each face training image in the face training image set carries corresponding face attribute data and a label value.

Step S201, each face image in the face data set is detected to obtain a face region and a key point of each face image. And correcting the face region according to the key points on the detected face image by taking the position relation among the key points on the face in the standard posture as a reference, so that each acquired face image is converted into the standard posture and is scaled to the uniform size to form a face training image.

Step S202, performing attribute labeling on each face training image to form corresponding face attribute data and label values. The face attribute data includes the number of face attributes and the category included in each attribute. In this embodiment, the number of attributes is five, which includes gender (male, female), wearing glasses (wearing glasses, not wearing glasses), expression (laughing, not laughing), age (children, young people, old people), mask (wearing mask, not wearing mask); i.e. each attribute contains the number of categories of

(sex) of the animal,

(wearing the glasses),

(the expression of the facial expression),

(age) of the subject to be examined,

(mask). However, the present invention does not set any limit to the number of face attributes and the category of each attribute. In other embodiments, different types and numbers of attributes may be selected according to different application scenarios.

The class of each attribute is encoded to form a corresponding tag value, such as gender: female is 0, male is 1; and coding and labeling are carried out according to the rule, and a label value corresponding to each attribute type is generated. And storing a plurality of face training images carrying the face attribute data and the label values into a database of a server to form a face training image set.

When the server 102 receives the model training instruction, a face training image set is obtained from the database in response to the model training instruction. Step S30, based on the number of the face attributes in the face attribute dataMAnd number of categories per attributen _i And constructing a feature extraction network fusing the input information, the channel information and the spatial position information to extract the features of the face training image. The present embodiment describes the steps of constructing the feature extraction network in detail based on the Regnet network structure. However, the present invention does not set any limit to the choice of this underlying network. In other embodiments, other neural network structures comprising a plurality of sequentially transmitted structure blocks may be selected as the model basis, such as a CNN network.

As shown in fig. 3 and 4, the Regnet Network (Network) is mainly composed of three parts, namely a trunk (Stem), a Body (Body), and a Head (Head), wherein the trunk and the Head are fixed, the trunk is a common convolutional layer with a convolution kernel of 3 × 3 and a step size of 2. The header is a classifier composed of a global pooling and full connection layer. The most important is that the Body (Body) part consists of a stack of 4 stages (stages), each Stage consisting of a series of building blocks (blocks) stacked in turn.

The steps for constructing the feature extraction network based on the Regnet network structure provided in this embodiment will be described in detail below with reference to fig. 3 and 6.

After obtaining the underlying Regnet model:

step S301, inputting characteristic diagram of each structure block

Performing transformation processing to form a first feature map

. In this embodiment, the transformation process includes convolution, regularization, and Relu activation function. However, the present invention is not limited thereto.

Step S302, a first characteristic diagram

Feature aggregation is performed along two spatial dimensions, horizontal and vertical, respectively, and the encoded height is

To (1) a

The output of each channel is represented as:

has a width of

To (1) a

The output of each channel is represented as:

wherein, the first and the second end of the pipe are connected with each other,

indicating the second in the width direction

A plurality of coordinate points;

inputting the first feature map on the c channel;

indicating the first in the height direction

A plurality of coordinate points;

is a first characteristic diagram of the light source,

and

are respectively a first characteristic diagram

C is the channel index in the first characteristic diagram,

and

width and height of the first profile, respectively.

To the first feature map by the above two transformations

In the horizontal and vertical directionsAnd the direction is aggregated to obtain a characteristic diagram of the perception channel information in the horizontal dimension, and obtain a characteristic diagram of the perception space position information in the vertical dimension.

And step 303, performing intermediate feature mapping on the two acquired perceptual feature maps. Specifically, the two perceptual feature maps obtained in step S302 are connected, then transformed by 1 × 1 convolution, and processed by using a non-linear activation function to form an intermediate feature map

(ii) a Namely that

Intermediate feature maps that encode the channel information and spatial position information in the horizontal and vertical directions.

Wherein, [, ]]For join operations along the spatial dimension, F is a 1x1 convolution,

for the non-linear activation function, the Relu activation function is used here.

Step S304, mapping the intermediate features along two spatial dimensions, namely horizontal and vertical

Decomposed into two tensors

，

. Using two 1x1 convolution variations

And

respectively will be respectively provided with

And

transformation into the first characteristic diagram

Tensor with same number of channels:

herein, the

Is the sigmoid activation function.

Step S305, two tensors of the transformed representation channel information and space position information

And

embedded in the first characteristic diagram

To output a second characteristic diagram:

step S306, obtaining a second characteristic diagram

Then the input feature map of the structure block is compared with the input feature map of the structure block

And fusing to obtain an output characteristic diagram of the multi-dimensional tensor containing the input characteristic diagram information, the channel information and the spatial position information. In this embodiment, the second characteristic diagram is obtained

Then, the feature map is input

Adding the obtained data and taking the obtained data as an output characteristic diagram after passing through a nonlinear activation function Relu

：

Wherein, c is the channel index,

the function is activated for Relu.

The feature extraction network provided by this embodiment is based on a Regnet network based on the relationship between feature map channels and embeds spatial position information to simultaneously consider the relationship between feature map channels and feature spaces (CA is used to represent the embedding of channel information and spatial position information in each Block in fig. 3 and 4), so that each weight parameter in the second feature map contains the inter-channel information and the spatial position information. The consideration of the channel information and the spatial information greatly improves the capability of extracting the network positioning target information by the characteristics, improves the identification accuracy rate and greatly reduces the calculation amount of the parameters. Furthermore, the second characteristic diagram and the input characteristic diagram are fused again, so that the output characteristic diagram of each structural block is a four-dimensional tensor [ S, C, H, W ]; the method comprises the following steps of S, acquiring a human face training image, wherein S is the number of human face training images and represents the number of input images for each training; c is the number of the characteristic image channels, and the characteristic is channel information; h is the feature map height, W is the feature map width, and the representation is the spatial position information. The second feature map and the input feature map are fused again, information of the input feature map is well reserved, information loss caused by feature map transformation in the process of embedding spatial position information is effectively avoided, and comprehensiveness and integrity of feature extraction are improved so as to further guarantee accuracy of model identification. However, the present invention does not set any limit to the specific form of the output characteristic diagram. In other embodiments, the output profile may also include a plurality of dimensions characterizing a plurality of input profile information.

As shown in fig. 3 and 4, in one stage (stage), the output of each structure block will be the input of the next structure block; the output of the last fabric block in the previous stage is input to the first fabric block in the next stage. The output of the last building block in the last phase forms the output of the body (body) which will be connected into the head (head).

In the present embodiment, step S307: and performing dimension reduction conversion on the output feature map of the last structural block in the feature extraction network, starting from a certain dimension of the input feature map information, converting the rest dimensions into one-dimensional vectors and outputting the one-dimensional vectors to a full connection layer, and outputting a predicted value containing all face attribute categories by the full connection layer. In particular, the four-dimensional tensor [ S, C, H, W ] output for the last structure block in the last phase]Expanding from the first dimension S, converting the latter dimension into one-dimensional vector, and changing into [ S, C H W ]]. Then, the two-dimensional vector [ S, C H W ] is used]As input to the full connection layer, connect

All connected layers of (C), output vector [ S, N]Wherein N is the number of all classes in the M kinds of face attributesn _i The sum of (a) and (b). As stated above, the sum N =2+ 3+ 2+ 11 of all the categories in the five face attributes outputs the vector [ S,11 +11]Namely the predicted value of 11 face attribute categories. However, the present invention is directed to face attribute categories andthe number of categories each attribute contains is not limited in any way.

And S40, obtaining the prediction probability value of each face attribute type according to the output of the feature extraction network, and determining a loss function used by the face attribute type in calculating error loss based on the relation between the prediction probability value and the probability threshold value over-parameter. Since the predicted value of each face attribute class is output by the full connection layer in the feature extraction network constructed in step S30, in order to determine the relationship between the predicted probability value and the probability threshold value hyperparameter of each face attribute class, this embodiment provides an implementation manner of step S40, as shown in fig. 7, specifically as follows:

step S401, converting the prediction value of each face attribute category into a corresponding prediction probability. In the face attribute, since each type of attribute has two or three outputs, for example, for the gender attribute, the output types are two and are male and female respectively; similarly, the three attributes of wearing glasses, expression, and mask have two output categories, while the age attribute has three output categories of children, young people, and old people. In the model training and recognition process, only one output class is confirmed to be correct for each attribute, so a softmax function is adopted to convert the predicted value of each attribute class into a corresponding predicted probability value. In this embodiment, the five attributes have 11 predicted values, and the corresponding softmax function forms 11 predicted probability values.

After the prediction probability values of all the face attribute classes are obtained, error loss between the prediction probability value of each face attribute class and the corresponding label value needs to be calculated so as to guide the optimization training of the model. In this embodiment, since the face image includes multiple attributes, each of which includes multiple categories, the face attribute data set is a multi-attribute data set, which is very prone to problems such as unbalanced data, incorrect labeling of data portions, and low quality of the data set; the existing error loss calculation based on the cross entropy loss function excessively depends on the data with few samples, so that overfitting is caused. In view of this, the embodiment introduces the equalization hyper-parameter and the difficulty hyper-parameter on the basis of the focallloss functionNumber and probability threshold hyperparameters

To form a first loss function and a second loss function; and based on the predicted probability value and the probability threshold hyperparameter

The first loss function or the second loss function is selected according to the relationship, so that the problems of data imbalance, part labeling errors and low quality of the multi-attribute data set are greatly solved, and the specific steps are as follows:

step S402, judging whether the prediction probability value of each face attribute category obtained in step S401 is smaller than the probability threshold value and exceeds the parameters

. If the judgment shows that the predicted probability value of a certain face attribute class

Sub-probability threshold hyperparameter

Then, step S403 is performed: selecting hyper-parameters containing probability threshold

Calculating a predicted probability value of the attribute class

And error loss between tag values based on a probability threshold over-parameter

To attenuate the loss weights. If the step S402 judges that the predicted probability value of a certain face attribute category is larger than or equal to the probability threshold value hyperparameter

Then a second penalty function is selected to calculate the error penalty for the attribute class. The first loss function introduces a probability threshold hyperparameter on the basis of the second loss function

The expression of both is as follows:

loss of error;

is the predicted probability value of the face attribute class after the softmax function,

(ii) a Alpha and gamma are hyperparameters, alpha is an equalization coefficient equalization hyperparameter for balancing positive and negative samples,

；

is a difficulty coefficient difficulty hyperparameter balancing simple and difficult samples,

；

is a probability threshold hyperparameter. In the case of a preferred one,

。

for probability threshold hyperparameter

When the predicted probability value of the face attribute category

Sub-probability threshold hyperparameter

When the temperature of the water is higher than the set temperature,

and the loss weight is attenuated, so that the influence of a small number of error samples on the model is effectively reduced. Setting up the initial

，

For the number of times the probability threshold exceeds the parameter update, initial

Probability threshold hyperparameter

The dynamic adjustment process of (2) is as follows:

probability threshold hyperparameter with increasing training times

It is gradually increased to gradually reduce the effect of labeling the wrong sample on the model effect. Specifically, in the training process, a preset round (epoch) of traversing a face training image training set is taken as a period to dynamically update the probability threshold hyperparameter. Such as every 50 rounds of model training (epoch),

increase by 1, probability threshold over-parameter

And also dynamically adjusted accordingly.

And S50, training the constructed feature extraction network according to the error loss, and dynamically updating the probability threshold value hyperparameter and the number of the structural blocks and the number of the channels in the feature extraction network in the training process to obtain a face attribute recognition model. Specifically, the loss errors calculated by the first loss function and the second loss function in step S40 are input to the adam optimizer, which has an initial learning rate of 0.001 and a weight attenuation coefficient of 0.0005. Training parameters were set, initial learning rate 0.001, batch size 128, 1000 rounds (epoch) of training. And (3) loading the pre-training model constructed in the step (S30), taking the face attribute data and the label value in the face training image set in the step (S10) as input, and continuously repeating the steps (S20) to (S50) to train the model to obtain the face attribute recognition model meeting the requirements.

The model constructed in step S30 of this embodiment is based on the Regnet network, and based on the adjustability of the number of structure blocks (Block) and the number of channels in the model, the number of structure blocks and the number of channels in each Stage (Stage) and the probability threshold value hyperparameter are dynamically adjusted in the training process

So as to obtain a model with optimal identification accuracy. In this embodiment, the probability threshold per update is saved

Taking parameters of the time-domain feature extraction network as candidate identification models; for example, updating the probability threshold hyperparameter after every 50 rounds of training (epoch)

And storing the characteristic extraction network parameters at the moment to form a candidate identification model. To be obtained in advanceAnd inputting the face image test set into a plurality of candidate recognition models, and selecting the candidate recognition model with the optimal prediction accuracy of the test set as the trained face attribute recognition model. The face test images in the face image test set are obtained by adopting the acquisition and preprocessing modes of the step S10 and the step S20. However, the determination method of the optimal face attribute recognition model is not limited in any way. In other embodiments, when the number of the structure blocks and the number of the channels in the basic network on which the training model is based cannot be adjusted, the optimal face attribute recognition model can be obtained according to the convergence degree of the loss function and the test accuracy as conditions.

In one embodiment, after the face attribute recognition model is obtained through training, the face attribute recognition model can be used for face attribute recognition. Specifically, a face image to be recognized is acquired, and the face image to be recognized is input to the face attribute recognition model. The facial attribute recognition model determines a plurality of attributes in the facial image, such as attributes of different sexes, different glasses, different expressions, different ages, different masks and the like, by performing feature extraction and feature classification on the facial image to be recognized.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In an embodiment, as shown in fig. 8, the embodiment further provides a lightweight face attribute recognition model training device, including:

the acquisition module 10 acquires a face data set.

And the preprocessing module 20 is used for preprocessing the acquired face data set to form a face training image set, wherein each face training image in the face training image set carries corresponding face attribute data and a label value.

The network-building block 30 is provided with,

constructing a feature extraction network fusing input information, channel information and spatial position information based on the face attribute data to extract features of the face training image; the feature extraction network comprises a plurality of sequentially transmitted structure blocks, an input feature map of each structure block is transformed to generate a first feature map, and the first feature maps are aggregated along two mutually perpendicular spatial dimensions to respectively obtain channel information and spatial position information; embedding the obtained channel information and the obtained spatial position information into the first characteristic diagram to form a second characteristic diagram; and fusing the second feature map to the input feature map of the structure block to form an output feature map of the multi-dimensional tensor.

A loss function determining module 40, which determines a loss function selected by each face attribute category when calculating a loss error according to the output of the feature extraction network; if the predicted probability value of a certain face attribute category is smaller than the probability threshold value super-parameter, selecting a first loss function containing the probability threshold value super-parameter to calculate the error loss between the predicted probability value of the attribute category and the label value; otherwise, selecting a second loss function to calculate the error loss of the attribute category;

the training module 50 trains the constructed feature extraction network according to the error loss to obtain a face attribute recognition model; and dynamically updating the probability threshold value hyperparameter and the number of the structural blocks in the feature extraction network in the training process.

In an embodiment, the network construction module 30 further performs dimension reduction conversion on the output feature map of the last structure block in the feature extraction network, and outputs the output feature map to the full-connected layer after converting the remaining dimensions into one-dimensional vectors from a certain dimension in which the information of the feature map is input, where the full-connected layer outputs the predicted values including all face attribute categories.

In one embodiment, based on the type of the face attribute in the face attribute data and the category corresponding to each attribute, the loss function determining module 40 further obtains a prediction probability value of each face attribute from prediction values including all face attributes output by the feature extraction network.

In one embodiment, the training module 50 dynamically updates the probability threshold hyperparameters during the training process with a preset number of passes through the training set of face training images as a period.

In one embodiment, training module 50 saves the feature extraction network parameters as candidate recognition models each time the update probability threshold exceeds the parameter; and inputting the test set into a plurality of candidate recognition models, and selecting the candidate recognition model with the optimal prediction accuracy of the test set as the trained face attribute recognition model.

For the specific limitation of the training apparatus for the lightweight face attribute recognition model, reference may be made to the above limitation on the training method for the lightweight face attribute recognition model, and details are not described here again. All modules in the lightweight face attribute recognition model training device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In an embodiment, as shown in fig. 9, there is further provided a lightweight face attribute identification method, where the identification method includes:

step S100, a face image to be recognized is obtained, where the face image to be recognized may be an image obtained by shooting through a camera on a monitoring device or an electronic terminal.

And S200, correcting the face image to be recognized to a standard face posture based on the preprocessing step of the step S20 in the model training method.

And step S300, loading the face attribute recognition model obtained by training the lightweight face attribute recognition model training method, and inputting the preprocessed face image to be recognized to obtain a face attribute recognition result.

FIG. 10 is a diagram that illustrates an internal structure of the computer device in one embodiment. The computer device may specifically be the server 102 in fig. 1. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program, which, when executed by the processor, causes the processor to implement the lightweight face attribute recognition model training method. The internal memory may also store a computer program, and when the computer program is executed by the processor, the computer program may enable the processor to execute a lightweight face attribute recognition model training method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the lightweight face attribute recognition model training apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 10. The memory of the computer device may store various program modules constituting the lightweight face attribute recognition model training apparatus, such as the obtaining module 10, the preprocessing module 20, the network construction module 30, the loss function determination module 40 and the training module 50 shown in fig. 8. The program modules constitute computer programs that cause the processor to execute the steps of the lightweight face attribute recognition model training method of the embodiments of the present application described in the present specification.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory storing a computer program, which when executed by the processor, causes the processor to perform the steps of the above-mentioned lightweight face attribute recognition model training method. Here, the steps of the lightweight face attribute recognition model training method may be steps in the lightweight face attribute recognition model training methods of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to execute the steps of the training method for the lightweight face attribute recognition model. The steps of the training method for the lightweight face attribute recognition model may be steps in the training method for the lightweight face attribute recognition model in the foregoing embodiments.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the lightweight face attribute recognition method described above. Here, the steps of the lightweight face attribute identification method may be steps in the lightweight face attribute identification methods of the above embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of the lightweight face attribute recognition method described above. Here, the steps of the lightweight face attribute identification method may be steps in the lightweight face attribute identification methods of the foregoing embodiments.

In summary, the lightweight face attribute recognition model training method and the recognition method provided by the invention provide a feature extraction network based on the fusion of Regnet and feature map spatial position information for the problems of face attribute algorithm models and deployment. After fusion, the relationship among characteristic diagram channels is considered for the characteristic extraction network, and meanwhile, the position information of the characteristic space is also considered, so that the problems of large precision loss, model error generation, low recognition accuracy rate and the like in the quantization process caused by the use of the conventional deep separable convolution are solved while the model parameter quantity and the calculated quantity are effectively reduced, and the performance of the face attribute algorithm in low-calculation-force edge equipment is effectively improved.

Aiming at the problem of human face attribute data, the method is based onMultiple over-parameter combination pair improved FocalLoss-

The method is optimized based on FocalLoss loss, and the problems of high classification difficulty caused by unbalanced face attribute sample data and low face quality are effectively solved by adjusting and controlling the sample balance and the weight of the difficultly-classified samples. And dynamically adjusting the probability threshold value hyper-parameter according to the change of the training turns to reduce the influence of the labeling error sample on the model effect. The problems of complex data preprocessing, low model overfitting and low identification accuracy caused by unbalanced samples, low image quality and a small amount of error labels in a training set are effectively solved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. The volatile memory may comprise random access memory

(RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A training method for a lightweight face attribute recognition model is characterized by comprising the following steps:

constructing a feature extraction network fusing input information, channel information and spatial position information based on the face attribute data to extract features of the face training image; the feature extraction network comprises a plurality of sequentially transmitted structure blocks, an input feature map of each structure block is transformed to generate a first feature map, and the first feature maps are aggregated along two mutually perpendicular spatial dimensions to respectively obtain channel information and spatial position information; embedding the obtained channel information and the spatial position information into the first characteristic diagram to form a second characteristic diagram; fusing the second eigen map to the input eigenmaps of the structure block to form an output eigenmap of the multi-dimensional tensor;

training the constructed feature extraction network according to the error loss to obtain a face attribute recognition model; and dynamically updating the probability threshold hyperparameter and the number of the structural blocks and the number of channels in the feature extraction network in the training process.

2. The training method of the lightweight face attribute recognition model according to claim 1, characterized in that the output feature map of the last structure block in the feature extraction network is subjected to dimensionality reduction conversion, starting from a certain dimension where the input feature map information is located, the remaining dimensions are converted into one-dimensional vectors and then output to the full-connection layer, and the full-connection layer outputs the predicted values containing all face attribute categories.

3. The training method of the lightweight face attribute recognition model according to claim 2, wherein the prediction probability value of each face attribute category is obtained from the prediction values containing all face attribute categories output by the feature extraction network based on the types of the face attributes in the face attribute data and the categories corresponding to each attribute.

4. The training method of the lightweight face attribute recognition model according to claim 1, wherein the input feature map of each structure block is transformed into the first feature map after being convolved, regularized and subjected to nonlinear activation function.

5. The training method of the lightweight face attribute recognition model according to claim 1, wherein the first loss function introduces a probability threshold hyperparameter on the basis of the second loss function to attenuate the loss weight; the first loss function and the second loss function respectively comprise a balance over-parameter for balancing the positive and negative samples and a difficulty over-parameter for balancing the simple and difficult samples, and the balance over-parameter and the difficulty over-parameter are both over-parameters.

6. The training method of the lightweight face attribute recognition model according to claim 1, wherein the probability threshold hyperparameter is dynamically updated in a cycle of a preset turn of traversing a training set of face training images during the training process.

7. The training method of the lightweight face attribute recognition model according to claim 6, wherein the parameters of the feature extraction network are saved as candidate recognition models each time the update probability threshold exceeds the parameters; and inputting the test set into a plurality of candidate recognition models, and selecting the candidate recognition model with the optimal prediction accuracy of the test set as the trained face attribute recognition model.

8. A lightweight face attribute identification method is characterized by comprising the following steps:

acquiring a face image to be recognized;

carrying out face attribute recognition on the face image to be recognized by using the face attribute recognition model obtained by training the lightweight face attribute recognition model training method according to any one of claims 1 to 7 to obtain a recognition result.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.