CN111967382A

CN111967382A - Age estimation method, and training method and device of age estimation model

Info

Publication number: CN111967382A
Application number: CN202010822523.XA
Authority: CN
Inventors: 苏驰; 李凯; 刘弘也; 王育林
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-20

Abstract

The invention provides an age estimation method, an age estimation model training method and an age estimation model training device, wherein the method comprises the following steps: acquiring a plurality of frames of video containing human faces, wherein the plurality of frames of video have time sequence; sequentially inputting each frame of video frame into an age estimation model which is trained in advance to obtain an output result corresponding to each frame of video frame, wherein the age estimation model is used for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; the age of the face is determined based on the output results of the plurality of frames of video. When the method is used for estimating the age of a person corresponding to the face in a video, the characteristics of video frames at different moments can be fused, so that the age estimation model can extract time sequence characteristic information which is more comprehensive, and the accuracy and stability of age estimation can be improved.

Description

Age estimation method, and training method and device of age estimation model

Technical Field

The invention relates to the technical field of video processing, in particular to an age estimation method, an age estimation model training method and an age estimation model training device.

Background

Age is an important human face attribute, and has important application in the fields of human-computer interaction, intelligent commerce, safety monitoring, entertainment and the like. In the related technology, age estimation can be performed through a trained deep learning model; the deep learning model is usually obtained by training based on a single face image, and the age of a person in the face image can be accurately estimated. However, with the development of video technology, people in a video need to be subjected to age estimation, and because the difference of face images on different frames of the same face in the video is large, the deep learning model is difficult to obtain a stable and accurate age estimation result when applied to the video.

Disclosure of Invention

The invention aims to provide an age estimation method, an age estimation model training method and an age estimation model training device, so as to improve the accuracy and stability of estimating the age of a person in a video.

In a first aspect, an embodiment of the present invention provides an age estimation method, where the method includes: acquiring a plurality of frames of video frames containing human faces, wherein the plurality of frames of video frames have time sequence, and the human faces contained in the plurality of frames of video frames belong to the same person; sequentially inputting each frame of video frame into an age estimation model which is trained in advance to obtain an output result corresponding to each frame of video frame, wherein the age estimation model is used for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; the age of the face is determined based on the output results of the plurality of frames of video.

In an alternative embodiment, the age estimation model comprises: a feature extraction network, a recurrent neural network and an age estimation network; the step of sequentially inputting each frame of video frame into the age estimation model trained in advance to obtain the output result corresponding to each frame of video frame includes: extracting feature data of each frame of video frames in a plurality of frames of video frames through a feature extraction network; fusing the characteristic data of a first frame video frame in a plurality of frames of video frames with the characteristic data of the first frame video frame through a recurrent neural network to obtain the fusion characteristic of the first frame video frame; fusing the feature data of the current video frame and the fusion feature corresponding to the video frame of the previous frame of the current video frame in the video frames except the first frame in the multi-frame video frames according to the time sequence of the video frames through a recurrent neural network to obtain the fusion feature of the current video frame; and performing feature extraction on the fusion features of each frame of video frame through an age estimation network to obtain an output result of each frame of video frame.

In an alternative embodiment, the weight parameter of the age estimation model is determined according to the loss amount in the process of machine learning; the loss amount is determined according to the output result of each frame of video frame output by the age estimation model and the age label corresponding to the multiple frames of video frames; the age tag is used to indicate the age of a person contained in a multi-frame video frame.

In an alternative embodiment, the loss amount includes a first loss value and a second loss value; the first loss value is used to indicate: the difference between the output result of each frame of video frame output by the age estimation model and the age label; the second loss value is used to indicate: and the difference between the output result of each frame of video frame output by the age estimation model and the average value of the output result corresponding to each frame of video frame.

In an alternative embodiment, the first loss value is determined by the following equation:

wherein L is_ageRepresenting a first loss value; a represents an age label;

representing the output result corresponding to the ith video frame in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; Σ denotes a summation operation.

In an alternative embodiment, the second loss value is determined by the following equation:

wherein L is_varRepresenting a second loss value;

representing the output result corresponding to the ith video frame in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; m represents the average value of output results corresponding to each frame of video frames in the multi-frame video frames; Σ denotes a summation operation.

In an optional embodiment, the step of determining the age of the human face based on the output result of the plurality of frames of video frames includes: and calculating the average value of the output results corresponding to each frame of video frames in the plurality of frames of video frames, and determining the average value as the age of the face.

In a second aspect, an embodiment of the present invention provides a training method for an age estimation model, where the training method includes: acquiring a sample video; the sample video comprises a plurality of frames of video frames, and the age labels carried by each frame of the plurality of frames of video frames of the sample video are the same; inputting a sample video into an initial model, and determining an age estimation result of a current video frame according to the characteristics of the current video frame and the characteristics of a video frame before the current video frame for each frame of video frames except a first frame in the sample video according to the time sequence of the multiple frames of video frames through the initial model; and performing machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model.

In an alternative embodiment, the initial model of the age estimation model includes: a feature extraction network, a recurrent neural network and an age estimation network; the step of inputting the sample video into the initial model to determine the age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for each frame of the video frames except the first frame in the sample video according to the time sequence of the multiple frames of the video frames through the initial model includes: extracting feature data of each frame of video frame in a sample video through a feature extraction network; fusing the characteristic data of a first frame video frame and the characteristic data of the first frame video frame in the sample video through recursion to obtain the fusion characteristic of the first frame video frame; fusing the feature data of the current video frame and the corresponding fusion feature of the video frame of the previous frame of the current video frame in the video frames except the first frame in the sample video according to the time sequence of the video frames through a recurrent neural network to obtain the fusion feature of the current video frame; and performing feature extraction on the fusion features of each frame of video frames through an age estimation network to obtain an age estimation result of each frame of video frames.

In an optional embodiment, the step of performing machine learning training on the initial model based on the age estimation result and the age label of each frame of the video frame to obtain the age estimation model includes: determining the loss amount according to the age estimation result and the age label of each frame of video; updating the weight parameters of the initial model based on the loss amount; and continuing to execute the step of obtaining the sample video until the loss amount is converged or reaches a preset training frequency, so as to obtain an age estimation model.

In an optional embodiment, the step of determining the loss amount according to the age estimation result and the age tag of each frame of the video frame includes: determining a first loss value according to the difference between the age estimation result of each frame of video frame in the sample video and the age label; determining a second loss value according to the difference between the age estimation result of each frame of video frame in the sample video and the mean value of the age estimation result of each frame of video frame; and obtaining the loss amount according to the first loss value and the second loss value.

In a third aspect, an embodiment of the present invention provides an age estimation apparatus, including: the system comprises a video frame acquisition module, a face recognition module and a face recognition module, wherein the video frame acquisition module is used for acquiring a plurality of frames of video frames containing faces, the plurality of frames of video frames have time sequence, and the faces contained in the plurality of frames of video frames belong to the same person; the video frame input module is used for sequentially inputting each frame of video frame into the age estimation model which is trained in advance to obtain an output result corresponding to each frame of video frame; the age estimation model is used for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; and the age estimation module is used for determining the age of the face based on the output result of the multi-frame video frames.

In a fourth aspect, an embodiment of the present invention provides a training apparatus for an age estimation model, where the training apparatus includes: the sample acquisition module is used for acquiring a sample video; the sample video comprises a plurality of frames of video, and the age labels corresponding to each frame of the plurality of frames of video of the sample video are the same; the sample input module is used for inputting the sample video into the initial model so as to determine the age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for each frame of the video frames except the first frame in the sample video according to the time sequence of the multiple frames of the video frames through the initial model; and the model training module is used for performing machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model.

In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the above age estimation method or the above training method of the age estimation model.

In a sixth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above age estimation method or the above training method of an age estimation model.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides an age estimation method, an age estimation model training method and an age estimation model training device, which are characterized in that firstly, a plurality of frames of video frames containing human faces are obtained, and the plurality of frames of video frames have time sequence; sequentially inputting each frame of video frame into an age estimation model trained in advance to obtain an output result corresponding to each frame of video frame; wherein the age estimation model is for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; the age of the face is then determined based on the output of each frame of video. When the method is used for estimating the age of a person corresponding to the face in the video, the characteristics of the video frames at different moments can be fused, so that the age estimation model extracts time sequence characteristic information which is rich in more comprehensive, and the accuracy and stability of age estimation can be improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an age estimation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another age estimation method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an age estimation model according to an embodiment of the present invention;

fig. 4 is a flowchart of a training method of an age estimation model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an age estimation apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an age estimation model training apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Automatic face age estimation, an important biometric identification technology, has been a popular research topic in the field of pattern recognition and computer vision. The definition of the human face age estimation problem generally refers to that the real age of a human face is automatically estimated according to an input human face image by adopting a computer vision technology and the like.

In the related art, two age estimation methods exist, the first is a traditional face age estimation algorithm, and usually, facial features (such as active appearance features, anthropometric features, biological heuristic features, and the like) in a face image need to be manually extracted, and then a regressor for obtaining an age from the facial features is trained, through which age estimation can be performed on the face image to be estimated, but the method lacks high-level semantic information of a face, so that the accuracy of an age estimation result obtained by the method is low.

The second method is to estimate the age based on a trained deep learning model; the deep learning model is usually obtained by training based on a single face image, the trained deep learning model establishes a mapping relation between an input face and the age, and the deep learning model is sensitive to changes of the posture, the expression, the illumination and the like of the input face image and can accurately estimate the age of a person in the face image. Therefore, the method can learn high-level semantic information of the face and improve the accuracy of face age estimation, but with the development of video technology, people in the video need to be subjected to age estimation, and because the difference of face images of the same face in different frames in the video is large, the deep learning model is difficult to obtain a stable and accurate age estimation result when applied to the video.

Based on the above description, the embodiment of the invention provides an age estimation method, an age estimation model training method and an age estimation model training device. The technology can be applied to the scenes of age identification and age estimation in the fields of human-computer interaction, intelligent commerce, safety monitoring, entertainment and the like, and particularly can be applied to the scenes of age estimation of people in videos. To facilitate understanding of the present embodiment, an age estimation method disclosed in the present embodiment will be described in detail first, and as shown in fig. 1, the method includes the following steps:

step S102, obtaining a plurality of frames of video containing human faces, wherein the plurality of frames of video have time sequence.

The human face contained in each frame of the multi-frame video frames belongs to the same person, and the human face may have different postures, lights or expressions and the like in different video frames, that is, the human face in the video may change to some extent along with the lapse of time, for example, the human face in the video frame may be a front face, a side face, a smiling face, a crying face, a head-up, a low-head, a strong-light illuminated human face or a soft-light illuminated human face and the like. The above-mentioned time sequence of the multi-frame video frames may refer to the time sequence of generating two adjacent video frames, for example, for a certain video, the time of generating the first frame video frame in the video is necessarily earlier than the time of generating the second frame video frame, and the time of generating the second frame video frame is necessarily earlier than the time of generating the third frame video frame.

In a specific implementation, the multi-frame video frame may be a video frame in a to-be-processed video that includes a human face, and when a video duration of the to-be-processed video is short (for example, the video duration is less than a preset time threshold), the multi-frame video may be each frame of the to-be-processed video; when the video duration of the video to be processed is longer (for example, the video duration is greater than or equal to the preset time threshold), a specified number of video frames can be extracted from the video to be processed as multi-frame video frames, so that the calculation amount of the subsequent age estimation can be reduced. The preset time threshold and the specified extraction quantity are set according to the requirements of users.

Step S104, sequentially inputting each frame of video frame into an age estimation model which is trained in advance to obtain an output result corresponding to each frame of video frame; wherein the age estimation model is for: and according to the time sequence of the input multi-frame video frames, determining the output result of the current video frame for each frame of video frames except the first frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame.

And step S106, determining the age of the human face based on the output result of the multi-frame video frames.

The age estimation model may adopt a deep learning model or a neural network model. The age estimation model is usually obtained through machine learning training according to a preset training sample set, and can extract the characteristics of each frame of video frames in a plurality of frames of video frames and obtain the corresponding output result of each frame of video frames according to the characteristics of each frame of video frames and the time sequence of the video frames.

In specific implementation, firstly, each frame of video frames in a plurality of acquired video frames is sequentially input into an age estimation model which is trained in advance, the age estimation model can extract the characteristics of each frame of video frames, then the output result of a first frame of video frame is obtained according to the characteristics of the first frame of video frame, the output result of the current video frame is obtained according to the characteristics of the current video frame and the characteristics of the previous video frame of the current video frame aiming at the video frames except the first frame in the plurality of frames of video frames, and therefore the age estimation model can output the output result of each frame of video frames in the plurality of frames of video frames.

The output result of each frame of video frame output by the age estimation model is the age estimation value of the face in each frame of video frame, and the average value of the age estimation values corresponding to each frame of video frame can be determined as the age of the face; the maximum value or the minimum value of the age estimation value corresponding to each frame of video frame can also be determined as the age of the face; the age of the face can be obtained through the age estimation value corresponding to each frame of the video according to other rules.

The age estimation method provided by the embodiment of the invention comprises the steps of firstly, obtaining a plurality of frames of video containing human faces, wherein the plurality of frames of video have time sequence; sequentially inputting each frame of video frame into an age estimation model trained in advance to obtain an output result corresponding to each frame of video frame; wherein the age estimation model is for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; the age of the face is then determined based on the output of each frame of video. When the method is used for estimating the age of a person corresponding to the face in the video, the characteristics of the video frames at different moments can be fused, so that the age estimation model extracts time sequence characteristic information which is rich in more comprehensive, and the accuracy and stability of age estimation can be improved.

The embodiment of the invention also provides another age estimation method, which is realized on the basis of the method of the embodiment; the method mainly describes a specific process of acquiring a plurality of frames of video frames containing a human face (realized by the following step S202), a specific process of obtaining an output result corresponding to each frame of video frames based on sequentially inputting each frame of video frames into an age estimation model trained in advance (realized by the following steps S204-S212), and a specific process of determining the age of the human face based on the output result of the plurality of frames of video frames (realized by the following step S214); as shown in fig. 2, the method comprises the steps of:

step S202, extracting a specified number of video frames from a video to be processed containing human faces; and determining the extracted specified number of video frames as the multi-frame video frames.

Each frame of the video to be processed contains a human face, and the video to be processed may be a video shot by a user through a camera, a camera or other devices in communication connection with the electronic device, or a video in a storage device in which a shot video is stored. The specified number may be a value set by a user according to a calculation amount, a service requirement, or the like, and may be 20 frames, 50 frames, or the like, for example. In particular implementations, the specified number of video frames is extracted to improve the computational efficiency of subsequent age estimation.

Step S204, sequentially inputting each frame of video frames in the plurality of frames of video frames into an age estimation model which is trained in advance; the age estimation model includes: a feature extraction network, a recurrent neural network, and an age estimation network.

And step S206, extracting the feature data of each frame of video frame in the multiple frames of video frames through the feature extraction network.

The feature extraction layer can extract feature data of each frame of video frame, and the feature data can also be understood as image features of images corresponding to each frame of video frame, so that high-level semantic information of each frame of video frame can be obtained. The feature extraction layer may include a convolution layer and an activation function layer connected in sequence, the activation function layer may perform function transformation on an image feature output by the convolution layer, the transformation process may break a linear combination of the convolution layer inputs, and the activation function layer may specifically be a Sigmoid function, a tanh function, a Relu function, or the like. In order to improve the performance of the feature extraction layer, the feature extraction layer may generally include multiple sets of sequentially connected convolution layers and activation function layers, and specifically, how many sets of sequentially connected convolution layers and activation function layers are included may be determined by the speed and precision requirements of a specific application.

And step S208, fusing the characteristic data of the first frame video frame in the multi-frame video frames with the characteristic data of the first frame video frame through a recurrent neural network to obtain the fusion characteristic of the first frame video frame.

Step S210, fusing the feature data of the current video frame in the video frames except the first frame in the multi-frame video frames with the fusion feature corresponding to the video frame of the previous frame of the current video frame through the recurrent neural network according to the time sequence of the video frames to obtain the fusion feature of the current video frame.

The Recurrent Neural Network may also be referred to as a Recurrent Neural Network (RNN), which is a type of Recurrent Neural Network that recurs in the direction of evolution of a sequence and all nodes (Recurrent units) are connected in a chain, with sequence data as an input. In specific implementation, because the first frame of video frames in the multiple frames of video frames does not have a corresponding previous frame of video frames in time sequence, the recurrent neural network fuses the feature data of the first frame of video frames output by the feature extraction layer with itself to obtain the fusion features of the first frame of video frames.

For video frames except for the first frame in a plurality of frames of video frames, the recurrent neural network can fuse the feature data of the current video frame with the fusion feature corresponding to the video frame of the previous frame of the current video frame to obtain the fusion feature of the current video frame, so that the fusion feature of each frame of video frame is obtained.

And step S212, performing feature extraction on the fusion features of each frame of video frame through an age estimation network to obtain an output result of each frame of video frame.

The age estimation network may be referred to as an output layer in the age estimation model, and the output layer may be a full connected layer (FC for short), and the full connected layer may extract features of fusion features of each frame of video frames output by the recurrent neural network to obtain an age estimation result of each frame of video frames.

For better understanding of the way of estimating age by the age estimation model, fig. 3 shows a schematic structural diagram of an age estimation model; block1, Block2 and FC1 in fig. 3 form a feature extraction network, where Block1 is composed of a set of convolutional layers and activation function layers, Block2 is also composed of a set of convolutional layers and activation function layers, and FC1 represents a fully-connected layer; RNN in fig. 3 represents a recurrent neural network, FC2 represents an age estimation network, which is a fully connected layer; the network with the same name in fig. 3 represents that the network parameters are shared here (for example, all the parameters of the FC2 full link layer share one set of parameters).

Assuming that a plurality of video frames comprise T video frames, sequentially inputting the T video frames into corresponding feature extraction layers according to a time sequence order, and obtaining feature data corresponding to each video frame through Block1, Block2 and FC1, wherein the feature data can be a feature vector f with the length of c (the length is determined according to network parameters set by a user)ⁱ,i∈[1,…,T]Wherein, the vector fⁱAnd representing the extracted feature vector of the ith frame of the video frame. Then fusing the feature data of the video frames at different moments through a recurrent neural network, specifically, sending the feature vector of each frame of video frame into a Recurrent Neural Network (RNN) to obtain the fusion feature h of each frame of video frameⁱ,i∈[1,…,T]The fusion features fuse the features of the video frames at different moments, wherein hⁱRepresenting the fusion characteristics of the ith frame of video frame.

As can be seen from fig. 3, the fusion feature of the first frame video frame is only related to the feature vector of the first frame video frame, and the fusion features of the video frames other than the first frame video frame are obtained from the fusion feature of the previous frame video frame and the feature vector of the current video frame, for example, the fusion feature h of the first frame video frame¹And a feature vector f of a second frame video frame²The fusion characteristic h of the second video frame can be obtained by fusion². Finally, the fusion characteristics corresponding to each frame of video frame are respectively sent to the age estimation network FC2 corresponding to each frame of video frame, and the output result corresponding to each frame of video frame is obtained

Wherein the content of the first and second substances,

represents the output of the ith frame of video, which may also be referred to as an age estimation result in some embodiments.

Step S214, calculating the average value of the output results of each frame of video frames in the plurality of frames of video frames, and determining the average value as the age of the human face.

Suppose that a multi-frame video frame is a T-frame video frame and the age of a human face is

Wherein the content of the first and second substances,

and the output result of the ith frame of the multi-frame video frames is shown.

In specific implementation, the weight parameters of each network in the age estimation model are determined according to the loss amount in the process of machine learning; the loss amount is determined according to the output result of each frame of video frame output by the age estimation model and the age label corresponding to the multiple frames of video frames; the age tag is used for indicating the age of a person contained in a multi-frame video frame; the specific training process of the age estimation model will be described in detail in the following embodiments of the training method of the age estimation model, and will not be described herein again.

Firstly, extracting a specified number of video frames from a to-be-processed video containing a human face; determining the extracted specified number of video frames as multi-frame video frames; sequentially inputting each frame of video frame into an age estimation model trained in advance; the age estimation model includes: a feature extraction network, a recurrent neural network and an age estimation network; then extracting the feature data of each frame of video frame in the multi-frame video frames through a feature extraction network; fusing the characteristic data of a first frame video frame and the characteristic data of the first frame video frame in the multi-frame video frames through a recurrent neural network to obtain the fusion characteristic of the first frame video frame; fusing the feature data of the current video frame and the fusion feature corresponding to the video frame of the previous frame of the current video frame in the video frames except the first frame in the multi-frame video frames according to the time sequence of the video frames through a recurrent neural network to obtain the fusion feature of the current video frame; performing feature extraction on the fusion features of each frame of video frames through an age estimation network to obtain an output result of each frame of video frames; and finally, calculating the average value of the output results of each frame of video frames in the plurality of frames of video frames, and determining the average value as the age of the face. According to the method, the age of the face is estimated through the age estimation model, the age estimation model can automatically learn the multi-level semantic features related to the age, the condition of video input is explicitly considered, and the information contained in video frames of different moments of the video is fused by using the recurrent neural network, so that the features extracted by the network are rich in more comprehensive time sequence feature information, and the accuracy of age estimation is improved.

For the embodiment of the age estimation method, an embodiment of the present invention further provides a training method of an age estimation model, as shown in fig. 4, the method includes the following steps:

step S402, obtaining a sample video; the sample video comprises a plurality of frames of video, and each frame of the plurality of frames of video of the sample video carries the same age tag.

The sample video is usually a sample in a training sample set, the training sample set includes a large number of samples, each sample includes a plurality of frames of video frames of a face and carries an age tag corresponding to the face in the plurality of frames of video, the faces included in the plurality of frames of video frames are the same, and the expressions and postures of the faces in the video frames at different times may be different or the same. In a particular implementation, the age label may be determined by the following steps 10-11:

step 10, obtaining a plurality of labeling results corresponding to the sample video; the labeling result is used for identifying the age value of the person in the sample video; the labeled age value in the labeling result is one of a plurality of preset age values.

The preset age values are set by the research and development personnel according to the requirements, the range and the number of the age values are also set according to the research and development requirements, for example, 101 age values can be set, and the age values are integers between 0 and 100 and respectively represent 0 to 100. The plurality of labeling results corresponding to the sample video may be n labeling results obtained after preset n individuals perform age labeling on the people in the sample video, respectively, and the age value labeled by the n individuals is one of the preset age values.

Step 11, calculating an average value of the age values corresponding to the plurality of labeling results to obtain an age average value; the age mean is used as an age label for the sample video.

For example, assuming that a plurality of preset age values are integers between 0 and 100, n people perform age labeling on people in the sample video to obtain n labeling results

Wherein k has a value ranging from 1 to n,

and representing the labeling result of the kth person on the sample video, and obtaining the age mean value as follows according to the n labeling results:

wherein, a represents the age mean of the sample video, i.e. the age label of the sample video;

represents rounding down.

Step S404, inputting the sample video into the initial model, and determining the age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for each frame of the video frames except the first frame in the sample video according to the time sequence of the multiple frames of the video frames through the initial model.

In a specific implementation, the initial model of the age estimation model includes: a feature extraction network, a recurrent neural network and an age estimation network; the above step S404 can be realized by the following steps 20 to 23:

and 20, extracting the feature data of each frame of video frame in the sample video through a feature extraction network.

And step 21, fusing the feature data of the first frame video frame in the sample video and the first frame video frame through recursion to obtain the fusion feature of the first frame video frame.

And step 22, fusing the feature data of the current video frame in the video frames except the first frame in the sample video and the fusion feature corresponding to the video frame of the previous frame of the current video frame through the recurrent neural network according to the time sequence of the video frames to obtain the fusion feature of the current video frame.

And step 23, performing feature extraction on the fusion features of each frame of video frame through an age estimation network to obtain an age estimation result of each frame of video frame.

And step S406, performing machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model.

During specific implementation, determining the loss amount according to the age estimation result and the age label of each frame of video; updating the weight parameters of the initial model based on the loss amount; and continuing to execute the step of obtaining the sample video until the loss amount is converged or reaches a preset training time, and obtaining an age estimation model.

The step of determining the loss amount based on the age estimation result and the age label of each frame of the video frame may be implemented by the following steps 30 to 32:

and step 30, determining a first loss value according to the age estimation result of each frame of video frame in the sample video and the difference between the age estimation result and the age label. That is, the first loss value is used to indicate: and the age estimation result of each frame of video output by the age estimation model is different from the age label.

Specifically, the first loss value is determined by the following equation:

wherein L is_ageRepresenting a first loss value; a represents an age label of the sample video;

an age estimation result corresponding to the ith video frame (also corresponding to an output result corresponding to the ith video frame) in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; Σ denotes a summation operation, | represents an absolute value of |.

And step 31, determining a second loss value according to the difference between the age estimation results of each frame of video in the sample video. That is, the second loss value is used to indicate: and the difference between the age estimation result of each frame of video frame output by the age estimation model and the mean value of the age estimation result of each frame of video frame.

Specifically, the second loss value is determined by the following equation:

wherein L is_varRepresenting a second loss value;

an age estimation result (corresponding to an output result corresponding to the ith video frame) corresponding to the ith video frame in the multi-frame video frames is shown; t represents the total number of video frames of the multi-frame video frame; m represents the mean value of the age estimation result of each frame of video frames in the plurality of frames of video frames; Σ denotes a summation operation.

As model training progresses, L_varWill be reduced continuously, thereby reducing the variance of the age estimation results of the video frames at different time instants, i.e. in order not to appear the video of the first frameThe frame age estimation result is 30 years old, while the second frame video frame age estimation result is 5 years old, which is a case of large variance. Therefore, the second loss value can constrain the variance of the age estimation results of the video frames at different time instants, so that the age estimation results at different time instants are more consistent, and the stability of the age estimation results can be improved.

And step 32, obtaining the loss amount according to the first loss value and the second loss value. For example, the sum of the first loss value and the second loss value is determined as the loss amount.

In a specific implementation, the weight parameters of the initial model may be updated based on the loss amount by the following steps 40-43:

step 40, calculating the derivative of the loss amount to the weight parameter to be updated in the initial model

Wherein L is the loss amount; w is a weight parameter to be updated; the weight parameters to be updated can be all parameters in the initial model, and can also be partial parameters randomly determined from the initial model; the updated weight parameter is the weight of each layer of network in the initial model. The derivative of the weight parameter to be updated can be solved according to a back propagation algorithm in general; if the loss is large, the difference between the identification result of the current initial model and the expected result is large, the derivative of the loss to the weight parameter to be updated in the initial model is solved, and the derivative can be used as the basis for updating the weight parameter to be updated.

Step 41, updating the weight parameter to be updated to obtain the updated weight parameter to be updated

Wherein α is a preset coefficient, and the preset coefficient is a manually preset hyper-parameter, and can be 0.01, 0.001, and the like. This process may also be referred to as a random gradient descent algorithm; the derivative of each weight parameter to be updated can also be understood as the direction in which the loss amount is reduced most rapidly relative to the current parameter, and the loss amount can be reduced rapidly by adjusting the parameter in the direction, so that the loss amount is reduced rapidlyThe weight parameters converge.

Step 42, judging whether the parameters of the updated initial model are all converged, and if yes, executing the step of determining a sample video based on a preset training sample set; otherwise step 43 is performed.

If the parameters of the updated initial model are not all converged, determining a new sample video based on a preset training sample set, and continuing to execute the steps S402-S406 until the parameters of the updated initial model are all converged.

And step 43, determining the initial model after the parameters are updated as the trained age estimation model.

In a specific implementation, images in a preset sample set can be divided into a sample set used for training the model and a sample set used for verifying the model according to a preset proportion (for example, 10: 1). The identification precision of the trained age estimation model can be determined through a sample set used for verifying the model; generally, a test sample can be determined from a sample set used for verifying a model, the test sample comprises a sample video and an age label corresponding to the sample video, the test sample is input into a trained age estimation model to obtain an age estimation result of each frame of video frame, the average value of the age estimation results of each frame of video frame is compared with the age label to judge whether the age estimation result is correct, and the test sample is determined from the sample set used for verifying the model until all samples in the sample set used for verifying the model are selected; and counting the correctness corresponding to the test result corresponding to each test sample to obtain the prediction precision of the trained age estimation model.

The training method of the age estimation model comprises the steps of firstly, obtaining a sample video; inputting the sample video into an initial model, and determining the age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for the video frames except the first frame in the sample video according to the time sequence of the multi-frame video frame through the initial model; and then performing machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model. In the method, in the process of training the age estimation model, the multi-level semantic features related to the age can be automatically learned, and the age estimation model in the method can be fused with the features contained in video frames at different moments, so that the features extracted by the model are rich in more comprehensive time sequence feature information, and the accuracy of age estimation is improved; in addition, the method restricts the variance of the age estimation results of the video frames at different moments, so that the age estimation results of the video frames at different moments are more consistent, and the stability of the age estimation results is further improved.

Corresponding to the embodiment of the age estimation method, an embodiment of the present invention further provides an age estimation apparatus, as shown in fig. 5, the apparatus including:

the video frame acquiring module 50 is configured to acquire multiple video frames containing faces, where the multiple video frames have a time sequence, and the faces contained in the multiple video frames belong to the same person.

A video frame input module 51, configured to sequentially input each frame of video frame into a pre-trained age estimation model to obtain an output result corresponding to each frame of video frame, where the age estimation model is configured to: and according to the time sequence of the input multi-frame video frames, determining the output result of the current video frame for each frame of video frames except the first frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame.

And an age estimation module 52, configured to determine an age of the human face based on an output result of the plurality of frames of video.

The age estimation device firstly acquires a plurality of frames of video frames containing human faces, wherein the plurality of frames of video frames have time sequence; sequentially inputting each frame of video frame into an age estimation model trained in advance to obtain an output result corresponding to each frame of video frame; wherein the age estimation model is for: according to the time sequence of input multi-frame video frames, for each frame of video frames except the first frame, determining the output result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame; the age of the face is then determined based on the output of each frame of video. When the method is used for estimating the age of a person corresponding to the face in the video, the characteristics of the video frames at different moments can be fused, so that the age estimation model extracts time sequence characteristic information which is rich in more comprehensive, and the accuracy and stability of age estimation can be improved.

Specifically, the age estimation model includes: a feature extraction network, a recurrent neural network and an age estimation network; the video frame input module 51 is configured to: extracting feature data of each frame of video frames in a plurality of frames of video frames through a feature extraction network; fusing the characteristic data of a first frame video frame and the characteristic data of the first frame video frame in the multi-frame video frames through a recurrent neural network to obtain the fusion characteristic of the first frame video frame; fusing the feature data of the current video frame and the fusion feature corresponding to the video frame of the previous frame of the current video frame in the video frames except the first frame in the multi-frame video frames according to the time sequence of the video frames through a recurrent neural network to obtain the fusion feature of the current video frame; and performing feature extraction on the fusion features of each frame of video frames through an age estimation network to obtain an age estimation result of each frame of video frames.

Further, the video frame acquiring module 50 is configured to: extracting a specified number of video frames from a video to be processed containing a human face; and determining the extracted specified number of video frames as the multi-frame video frames.

In specific implementation, the weight parameters of the age estimation model are determined according to the loss amount in the process of machine learning; the loss is determined according to an age estimation result of each frame of video frame output by the output model and an age label corresponding to the multiple frames of video frames; the age tag is used for indicating the age of the face contained in the multi-frame video frame.

Further, the loss amount includes a first loss value and a second loss value; the first loss value is indicative of: the difference between the output result of each frame of video frame output by the age estimation model and the age label; the second loss value is used to indicate: and the difference between the age estimation result of each frame of video frame output by the age estimation model and the average value of the output result corresponding to each frame of video frame.

Specifically, the first loss value is determined by the following equation:

wherein L is_ageRepresenting a first loss value; a represents an age label;

Specifically, the second loss value is determined by the following equation:

wherein L is_varRepresenting a second loss value;

Further, the age estimation module 52 is further configured to: and calculating the average value of the output results corresponding to each frame of video frames in the plurality of frames of video frames, and determining the average value as the age of the face.

The age estimation apparatus provided in the embodiment of the present invention has the same implementation principle and technical effect as those of the age estimation method embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiment for the part not mentioned in the apparatus embodiment.

Corresponding to the above embodiment of the training method of the age estimation model, an embodiment of the present invention further provides a training apparatus of an age estimation model, as shown in fig. 6, the training apparatus includes:

a sample acquiring module 60, configured to acquire a sample video; the sample video comprises a plurality of frames of video, and the corresponding age labels of each frame of the plurality of frames of video of the sample video are the same.

The sample input module 61 is configured to input the sample video into the initial model, so as to determine, according to the time sequence of the multiple frames of video frames, an age estimation result of the current video frame for each frame of video frames other than the first frame in the sample video according to the features of the current video frame and the features of the video frames before the current video frame.

And the model training module 62 is configured to perform machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model.

The training device of the age estimation model firstly obtains a sample video; inputting the sample video into an initial model, and determining the age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for the video frames except the first frame in the sample video according to the time sequence of the multi-frame video frame through the initial model; and then performing machine learning training on the initial model based on the age estimation result and the age label of each frame of video frame to obtain an age estimation model. In the method, in the process of training the age estimation model, the multi-level semantic features related to the age can be automatically learned, and the age estimation model in the method can be fused with the features contained in video frames at different moments, so that the features extracted by a network are rich in more comprehensive time sequence feature information, and the accuracy of age estimation is improved; in addition, the method restricts the variance of the age estimation results of the video frames at different moments, so that the age estimation results of the video frames at different moments are more consistent, and the stability of the age estimation results is further improved.

Specifically, the initial model of the age estimation model includes: a feature extraction network, a recurrent neural network and an age estimation network; the sample input module 61 is configured to: extracting feature data of each frame of video frame in a sample video through a feature extraction network; fusing the characteristic data of a first frame video frame and the characteristic data of the first frame video frame in the sample video through recursion to obtain the fusion characteristic of the first frame video frame; fusing the feature data of the current video frame and the corresponding fusion feature of the video frame of the previous frame of the current video frame in the video frames except the first frame in the sample video according to the time sequence of the video frames through a recurrent neural network to obtain the fusion feature of the current video frame; and performing feature extraction on the fusion features of each frame of video frames through an age estimation network to obtain an age estimation result of each frame of video frames.

Further, the model training module 62 is configured to: determining the loss amount according to the age estimation result and the age label of each frame of video; updating the weight parameters of the initial model based on the loss amount; and continuing to execute the step of obtaining the sample video until the loss amount is converged or reaches a preset training frequency, so as to obtain an age estimation model.

Specifically, the model training module 62 is further configured to: determining a first loss value according to the difference between the age estimation result of each frame of video frame in the sample video and the age label; determining a second loss value according to the difference between the age estimation result of each frame of video frame in the sample video and the mean value of the age estimation result of each frame of video frame; and obtaining the loss amount according to the first loss value and the second loss value.

Specifically, the first loss value is determined by the following equation:

wherein L is_ageRepresenting a first loss value; a represents an age label;

representing the age estimation result corresponding to the ith video frame in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; Σ denotes a summation operation.

Specifically, the second loss value is determined by the following equation:

wherein L is_varRepresenting a second loss value;

indicating the age estimation result corresponding to the ith video frame in the multi-frame video frames (the meaning is the same as that of the output result); t represents the total number of video frames of the multi-frame video frame; m represents the mean value of the age estimation result of each frame of video frames in the plurality of frames of video frames; Σ denotes a summation operation.

The implementation principle and the generated technical effect of the training device of the age estimation model provided by the embodiment of the invention are the same as those of the embodiment of the training method of the age estimation model, and for the sake of brief description, corresponding contents in the embodiment of the method can be referred to where the embodiment of the device is not mentioned.

An embodiment of the present invention further provides an electronic device, which is shown in fig. 7 and includes a processor 101 and a memory 100, where the memory 100 stores machine executable instructions that can be executed by the processor 101, and the processor executes the machine executable instructions to implement the age estimation method or the training method of the age estimation model.

Further, the electronic device shown in fig. 7 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.

The Memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

An embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the age estimation method or the training method for the age estimation model, and specific implementation may refer to method embodiments and will not be described herein again.

The age estimation method, the age estimation model training method, and the computer program product of the apparatus provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of age estimation, the method comprising:

acquiring a plurality of frames of video frames containing human faces, wherein the plurality of frames of video frames have time sequence, and the human faces contained in the plurality of frames of video frames belong to the same person;

sequentially inputting each frame of video frame into a pre-trained age estimation model to obtain an output result corresponding to each frame of video frame; wherein the age estimation model is to: according to the time sequence of the input multiple frames of video frames, determining the output result of the current video frame for each frame of video frames except the first frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame;

and determining the age of the human face based on the output result of the plurality of frames of video frames.

2. The method of claim 1, wherein the age estimation model comprises: a feature extraction network, a recurrent neural network and an age estimation network;

the step of sequentially inputting each frame of the video frame into a pre-trained age estimation model to obtain an output result corresponding to each frame of the video frame comprises the following steps:

extracting feature data of each frame of video frames in the multiple frames of video frames through the feature extraction network;

fusing the characteristic data of a first frame video frame in the multiple frames of video frames with the characteristic data of the first frame video frame through the recurrent neural network to obtain the fusion characteristic of the first frame video frame;

fusing the feature data of the current video frame and the fusion feature of the video frame of the previous frame of the current video frame in the video frames except the first frame in the multi-frame video frames according to the time sequence of the video frames through the recurrent neural network to obtain the fusion feature of the current video frame;

and performing feature extraction on the fusion features of each frame of the video frame through the age estimation network to obtain an output result of each frame of the video frame.

3. The method of claim 1, wherein the weight parameters of the age estimation model are determined from the amount of loss during machine learning; the loss amount is determined according to the output result of each frame of the video frame output by the age estimation model and the age label corresponding to the multiple frames of the video frames; the age tag is used to indicate the age of a person contained in the multi-frame video frame.

4. The method of claim 3, wherein the amount of loss comprises a first loss value and a second loss value;

the first loss value is used to indicate: the difference between the output result corresponding to each frame of the video frame output by the age estimation model and the age label;

the second loss value is used to indicate: and the difference between the output result corresponding to each frame of the video frame output by the age estimation model and the average value of the output results corresponding to each frame of the video frame.

5. The method of claim 4, wherein the first loss value is determined by the following equation:

wherein L is_ageRepresenting the first loss value; a represents the age label;

representing the output result corresponding to the ith video frame in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; sigma is to solveAnd (4) carrying out a sum operation.

6. The method of claim 4, wherein the second loss value is determined by the following equation:

wherein L is_varRepresenting the second loss value;

representing the output result corresponding to the ith video frame in the multi-frame video frames; t represents the total number of video frames of the multi-frame video frame; m represents the average value of the output results corresponding to each frame of the video frames in the plurality of frames of video frames; Σ denotes a summation operation.

7. The method of claim 1, wherein the step of determining the age of the face based on the output of the plurality of frames of video comprises:

and calculating the average value of the output results corresponding to each frame of the video frames in the plurality of frames of video frames, and determining the average value as the age of the face.

8. A method for training an age estimation model, the method comprising:

acquiring a sample video; the sample video comprises a plurality of frames of video frames, and the age labels carried by the video frames of each frame of the plurality of frames of the sample video are the same;

inputting the sample video into an initial model, and determining an age estimation result of a current video frame according to the characteristics of the current video frame and the characteristics of a video frame before the current video frame for each frame of the sample video except a first frame according to the time sequence of a plurality of frames of the video frames through the initial model;

and performing machine learning training on the initial model based on the age estimation result of each frame of the video frame and the age label to obtain the age estimation model.

9. Training method according to claim 8, wherein the initial model of the age estimation model comprises: a feature extraction network, a recurrent neural network and an age estimation network;

the step of inputting the sample video into an initial model, and determining an age estimation result of a current video frame according to the characteristics of the current video frame and the characteristics of a video frame before the current video frame for each frame of the sample video except a first frame according to the time sequence of the frames of the video through the initial model, includes:

extracting feature data of each frame of video frame in the sample video through the feature extraction network;

fusing the characteristic data of a first frame video frame in the sample video and the characteristic data of the first frame video frame through the recursion to obtain the fusion characteristic of the first frame video frame;

fusing the feature data of the current video frame in the video frames except the first frame in the sample video with the corresponding fusion feature of the video frame of the previous frame of the current video frame through the recurrent neural network according to the time sequence of the video frames to obtain the fusion feature of the current video frame;

and performing feature extraction on the fusion features of each frame of the video frame through the age estimation network to obtain an age estimation result of each frame of the video frame.

10. A training method as claimed in claim 8, wherein the step of performing machine learning training on the initial model based on the age estimation result of each frame of the video frame and the age label to obtain the age estimation model comprises:

determining the loss amount according to the age estimation result of each frame of the video frame and the age label;

updating a weight parameter of the initial model based on the loss amount; and continuing to execute the step of obtaining the sample video until the loss amount is converged or reaches a preset training frequency, so as to obtain the age estimation model.

11. The training method of claim 10, wherein the step of determining the amount of loss based on the age estimation result of the video frame and the age label of each frame comprises:

determining a first loss value according to the difference between the age estimation result of each frame of the video frame in the sample video and the age label;

determining a second loss value according to the difference between the age estimation result of each frame of the video frame in the sample video and the mean value of the age estimation result of each frame of the video frame;

and obtaining the loss amount according to the first loss value and the second loss value.

12. An age estimation apparatus, characterized in that the apparatus comprises:

the system comprises a video frame acquisition module, a face recognition module and a face recognition module, wherein the video frame acquisition module is used for acquiring a plurality of frames of video frames containing faces, the plurality of frames of video frames have time sequence, and the faces contained in the plurality of frames of video frames belong to the same person;

the video frame input module is used for sequentially inputting each frame of video frame into an age estimation model which is trained in advance to obtain an output result corresponding to each frame of video frame; the age estimation model is used for: according to the time sequence of the input multiple frames of video frames, determining the output result of the current video frame for each frame of video frames except the first frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame;

and the age estimation module is used for determining the age of the face based on the output result of the plurality of frames of video frames.

13. An apparatus for training an age estimation model, the apparatus comprising:

the sample acquisition module is used for acquiring a sample video; the sample video comprises a plurality of frames of video frames, and the age labels corresponding to the video frames of each frame of the plurality of frames of the sample video are the same;

the sample input module is used for inputting the sample video into an initial model so as to determine an age estimation result of the current video frame according to the characteristics of the current video frame and the characteristics of the video frame before the current video frame for each frame of the video frames except the first frame in the sample video according to the time sequence of the video frames of a plurality of frames through the initial model;

and the model training module is used for performing machine learning training on the initial model based on the age estimation result of each frame of the video frame and the age label to obtain the age estimation model.

14. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the age estimation method of any one of claims 1 to 7 or the training method of the age estimation model of any one of claims 8 to 11.

15. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the age estimation method of any of claims 1 to 7 or the training method of the age estimation model of any of claims 8 to 11.