CN113052144B

CN113052144B - Training method, device and equipment of living human face detection model and storage medium

Info

Publication number: CN113052144B
Application number: CN202110482189.2A
Authority: CN
Inventors: 喻晨曦
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-02-28
Anticipated expiration: 2041-04-30
Also published as: CN113052144A

Abstract

The invention discloses a training method of a living body face detection model, which is applied to the technical field of artificial intelligence and is used for solving the technical problem that the existing living body face detection model does not have higher prediction accuracy in a real scene. The method provided by the invention comprises the following steps: acquiring face sample image sets belonging to different fields; selecting teacher networks corresponding to the fields one by one, and performing two-class training on the corresponding teacher networks through the face sample image set; freezing the trained teacher network; outputting the prediction probability of the face image as the living body face through each teacher network; training the student network by taking the average value of the output prediction probabilities as a target probability value of the student network; and taking the parameters of the trained feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model, and carrying out meta-learning training on the living body face detection model through the face sample image set until the loss function of the living body face detection model is converged.

Description

Training method, device and equipment of living human face detection model and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method, a device, equipment and a storage medium for a human face detection model.

Background

With the updating and upgrading of mobile phones and the changing of shooting processing technology, a deception means that a non-face living body image is fake as a face living body image emerges endlessly, at present, the recognition of the face living body image is generally predicted through a trained living body face detection model, and when the non-face living body is predicted and recognized, a user is required to further check whether the face image is a living body face image.

In the training process of the living body face detection model, the fact that the prediction accuracy of the trained living body face detection model is limited by the type of the face image sample participating in the training is found, for example, when the shooting light of the face image sample participating in the training is bright, the prediction accuracy of the trained living body face detection model on whether the face image with dark shooting light is a living body face is reduced.

Because the living body face detection model trained by the conventional means is difficult to resist environmental noise, such as too strong or too weak illumination, different imaging qualities derived from different imaging devices, and the like, the trained living body face detection model lacks robustness for a new attack type, so that the living body face detection model trained by the conventional training method does not have high prediction accuracy in a real scene.

Disclosure of Invention

The embodiment of the invention provides a training method, a training device, equipment and a storage medium of a living body face detection model, and aims to solve the technical problem that the existing living body face detection model is difficult to resist environmental noise and poor in robustness, so that the existing living body face detection model does not have high prediction accuracy in a real scene.

A method for training a human face detection model, the method comprising:

acquiring a face sample image set belonging to different fields, wherein the face sample image in the face sample image set carries living or non-living labels;

selecting teacher networks corresponding to the fields one by one, and performing two-class training on the corresponding teacher networks through the face sample image set;

when the loss function of the teacher network is converged and the prediction accuracy of the teacher network is within a preset range after training, freezing the teacher network;

inputting the face image in the face image data set into each teacher network, and outputting the prediction probability that the face image is a living face through each teacher network;

taking the average value of the prediction probability output by each teacher network as the target probability value of the student network, and training the student network through the face image data set;

when the loss function of the student network is converged, acquiring parameters of a feature extractor of the student network;

taking the parameters of the feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model, and performing meta-learning training on the living body face detection model through the face sample image set;

and when the loss function of the living body face detection model is converged, obtaining the trained living body face detection model.

An apparatus for training a human face detection model, the apparatus comprising:

the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring a face sample image set belonging to different fields, and the face sample images in the face sample image set carry living or non-living labels;

the first training module is used for selecting teacher networks corresponding to the fields one by one and performing two-class training on the corresponding teacher networks through the face sample image set;

the freezing module is used for freezing the teacher network when the loss function of the teacher network is converged and the prediction accuracy of the teacher network is within a preset range after training;

the probability calculation module is used for inputting the face images in the face image data set into each teacher network and outputting the prediction probability that the face images are living body faces through each teacher network;

the second training module is used for taking the average value of the prediction probabilities output by the teacher networks as the target probability value of the student network and training the student network through the face image data set;

the parameter acquisition module is used for acquiring the parameters of the feature extractor of the student network when the loss function of the student network is converged;

the meta-learning module is used for taking parameters of the feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model and carrying out meta-learning training on the living body face detection model through the face sample image set;

and the convergence module is used for obtaining the trained living body face detection model when the loss function of the living body face detection model converges.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned training method of a living body face detection model when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned steps of the method of training a living body face detection model.

The invention provides a training method, a device, equipment and a storage medium of a living body face detection model, wherein two classes of teacher networks in corresponding fields are trained through face sample image sets in different fields, when the loss function of the teacher networks is converged and the prediction accuracy of the teacher networks is within a preset range, the teacher networks are frozen, before the student networks are trained, face images in face image data sets are input into the teacher networks, the prediction probabilities of the face images as living body faces are output through the teacher networks, the average value of the prediction probabilities output through the teacher networks is used as the target probability value of the student networks, the student networks are trained through the face image data sets, when the loss function of the student networks is converged, the parameters of feature extractors of the student networks are obtained, then the parameters of the feature extractors of the student networks are used as the initial values of the parameters of the feature extractors in the living body face detection model, the face detection model is subjected to meta-learning through the face image sets, and when the loss function of the living body detection model is converged, the initial values of the face images of the living body face detection model to be detected can be subjected to face detection. When the human face living body detection model trained through the scheme detects the human face living body images, the human face images shot in different fields can be compatible, the robustness is high, and the human face living body detection model trained through the scheme can have high prediction accuracy regardless of the human face images shot through different mobile phones or the human face images shot with different light intensities.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a training method for a living human face detection model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of a live face detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart of meta-learning training of a live-body face detection model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training apparatus for a living human face detection model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The training method of the living human face detection model provided by the application can be applied to the application environment shown in fig. 1, wherein the computer equipment is communicated with a server through a network. The computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a training method for a human face detection model is provided, which is described by taking the computer device in fig. 1 as an example, and includes the following steps S101 to S108.

S101, obtaining face sample image sets belonging to different fields, wherein the face sample images in the face sample image sets carry living or non-living labels.

It is understood that the field of the face sample image set in this step represents face sample images taken by different types of shooting devices or face sample images taken under different lighting conditions. For example, a face sample image shot by an apple mobile phone can be classified into a first field, a face sample image shot by a Huawei mobile phone can be classified into a second field, when the type of the camera of the face sample image is not clear, the face sample image shot with the illumination intensity within the range of a first threshold value can be classified into a third field, the face sample image shot with the illumination intensity within the range of a second threshold value can be classified into a fourth field, and so on, so that face sample image sets belonging to various fields can be obtained.

In this embodiment, the face sample images belonging to the same class or the same field are classified into one class, and then the face sample image sets of different fields can be obtained.

In one embodiment, the face sample image set includes, but is not limited to, public data sets such as CASIA-MFSD, IDIAP, MSU-MFSD, ouu-NPU, LCC FASD, NUAA, faceForensics + +, and the like, and also includes data sets that are arranged internally in the company.

And S102, selecting teacher networks corresponding to the fields one by one, and performing two-class training on the corresponding teacher networks through the face sample image set.

In one embodiment, the teacher network may optionally use Resnet 50. The teacher networks corresponding to each of the domains may each employ a Resnet 50 neural network. When the number of the fields of the face sample image set is 10, the number of the teacher networks is also 10, and each teacher network corresponds to one field of the face sample image set.

Resnet is an abbreviation for Residual Network (Residual Network), which is widely used in the field of object classification and the like and as part of the classical neural Network of the computer vision task backbone, where Resnet 50 is a typical Residual neural Network.

S103, when the loss function of the teacher network is converged and the prediction accuracy of the teacher network is within a preset range after training, freezing the teacher network.

In one embodiment, the network Freeze may be implemented by a Freeze command. It will be appreciated that the purpose of the frozen model is not to train any more, and the current parameters resulting from the training are quantified and used only for forward derivation through the teacher's network.

In one embodiment, the predetermined range is, for example, 90%, indicating that the teacher network is frozen when the average error rate (including missed versus missed) of the teacher network is within 10%.

And S104, inputting the face images in the face image data set into each teacher network, and outputting the prediction probability that the face images are living faces through each teacher network.

Wherein the image dataset is, for example, an open source ImageNet image dataset, which is a large visual database for visual object recognition software research, more than 1400 million image URLs (uniform resource locators) are manually annotated by ImageNet to indicate objects in a picture, and a bounding box is also provided in at least one million images.

And S105, taking the average value of the prediction probabilities output by the teacher networks as a target probability value of the student network, and training the student network through the face image data set.

In one embodiment, the student network may be selected from Resnet 34, mobilent v3small or mobilent v2, and the aim is to make it difficult for the antagonistic learner to distinguish the teacher network teacher from the student network student by performing antagonistic learning training on them.

S106, when the loss function of the student network is converged, obtaining the parameters of the feature extractor of the student network.

The target loss of the student network is the loss of KL subvergence and antagonistic learning of probability distribution of student network students and teacher network teacher.

S107, taking the parameters of the feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model, and performing meta-learning training on the living body face detection model through the face sample image set.

Fig. 3 is a flowchart of meta-learning training a living body face detection model according to an embodiment of the present invention, and in one embodiment, as shown in fig. 3, the step of performing meta-learning training on the living body face detection model through the face sample image set includes the following steps S301 to S303:

s301, extracting a sampling set comprising positive samples and negative samples from the face sample image set;

s302, sequentially performing meta-training, meta-testing and meta-optimization on the living human face detection model through the sampling set;

and S303, judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, if so, judging that the living body face detection model is trained, otherwise, circularly extracting a sampling set comprising a positive sample and a negative sample from the face sample image set to the step of judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, until the loss function of the living body face detection model in the meta-optimization stage is converged.

In one embodiment, the sequentially performing meta-training, meta-testing and meta-optimization on the living human face detection model through the sampling set includes:

dividing the face sample image set into a meta-training set and a meta-testing set;

extracting a plurality of sampling sets from the meta-training set, and training the living human face detection model in a meta-training stage through the sampling sets;

when the loss function of the living body face detection model in the meta-training stage is converged, testing and training the living body face detection model in the meta-testing stage through the meta-testing set;

when the loss function of the living body face detection model in the meta-test stage is converged, carrying out meta-optimization on the living body face detection model according to the training results of the living body face detection model in the meta-training stage and the meta-test stage;

and when the loss function of the living body face detection model in the meta-optimization stage is converged, judging that the loss function of the living body face detection model is converged.

The face sample image in one field in the face sample image set can be used as a meta-test set, and the face sample images in the other fields can be used as a meta-training set.

In one embodiment, the living body face detection model comprises a feature extractor, a depth map estimator and a meta-learning classifier, and the loss function of the living body face detection model in the meta-training stage comprises a first classification loss function, an iteration function of the meta-learning classifier and a first depth map estimation loss function, wherein the first classification loss function is:

L _cls (θ _F ，θ _M ) = Σ ylogM (F (x)) + (1-y) log (1-M (F (x))) where θ _F A parameter, θ, of a feature extractor representing the living body face detection model _M Parameters representing the meta-learning classifier, x represents a face sample image in the sampling set, y represents a value of a true label of the face sample image in the sampling set, F (x) represents a feature extracted from the face sample image by a feature extractor of the face detection model, and L (x) represents a feature extracted from the face sample image by a feature extractor of the face detection model _cls (θ _F ，θ _M ) Representing the first classification loss function.

The iterative function of the meta-learning classifier is:

wherein, theta _M Parameters representing the meta-learning classifier, alpha representing a hyper-parameter,

represents the parameter theta at the i-th training _M Gradient of (a), L _cls (θ _F ，θ _M ) An output result representing the first classification loss function.

The common parameters can be obtained through continuous training and learning of the model, but in general, the hyper-parameters are not learned, and the hyper-parameters are framework parameters in the machine learning model, such as the number of classes in a clustering method, or the number of topics in a topic model, and the like, and are called hyper-parameters. They are not the same as the parameters learned during training, and the hyper-parameters are usually set manually and adjusted by trial and error, or a series of exhaustive parameter sets are combined and enumerated, also called grid search.

The first depth map estimation loss function is:

L _Dtrn (θ _F ，θ _D )＝∑||D(F(X))-I|| ²

wherein, theta _F A parameter, θ, of a feature extractor representing the live face detection model _D Parameters representing the depth map estimator, F (X) representing a first feature extracted by a feature extractor of the face detection model on a face sample image in the meta training set, D (F (X)) representing an output result of the depth map estimator on the first feature extracted, and I representing a true depth of the first feature extracted.

In one embodiment, the extracted true depth I of the first feature may be obtained by an open source face 3D reconstruction and dense alignment technique PRNet.

The present embodiment uses an open source algorithmic approach, such as PRnet, to estimate the depth of the feature map. Here, for all labeled datasets, the face live view estimates the depth map using the PRnet and as a real depth map, the face non-live view we set a matrix of all 0's, which is equivalent to an all black depth map.

Further, the loss function of the living body face detection model in the meta-test stage comprises a second classification loss function and a second depth map estimation loss function, wherein the second classification loss function is as follows:

wherein N denotes a total domain number of the face sample image set, i denotes a domain number to which a face sample image currently used for meta-testing belongs, F (x) denotes a feature extracted from the face sample image by a feature extractor of the face detection model, M' _i (F (x)) represents the output of the meta-learning classifier on the extracted features, L _cls-tst Representing the second classification loss function.

Further, the second depth map estimation loss function is:

L _Dtst (θ _F ，θ _D )＝∑||D(F(X))-I|| ²

wherein F (X) represents a second feature extracted by the feature extractor of the face detection model for the face sample images in the meta test set, D (F (X)) represents an output result of the depth map estimator for the second feature extracted, and I represents a true depth of the second feature extracted.

The second depth map estimation loss function has the same expression as the first depth map estimation loss function, but because the sample input to the second depth map estimation loss function is an element test set and the sample input to the first depth map estimation loss function is an element training set, the output of the second depth map estimation loss function is different from the output of the first depth map estimation loss function.

Further, the true depth I of the second feature may be obtained by an open source face 3D reconstruction and dense alignment technique PRNet.

In one embodiment, the loss functions of the living body face detection model in the meta-optimization stage comprise a meta-learning classifier parameter loss function, a feature extractor parameter loss function and a depth map estimator parameter loss function. Wherein the meta-learning classifier parameter loss function is:

wherein beta and gamma represent hyperparameters, L _cls (θ _F ，θ _M ) An output value, L, representing said first classification loss function _cls-tst An output result representing the second classification loss function,

representing a parameter theta during training _M Of the gradient of (c).

It can be understood that, when the output value of the parameter loss function of the meta learning classifier tends to a stable state, the parameter loss function of the meta learning classifier converges to obtain the final parameter θ of the meta learning classifier _M 。

The feature extractor parameter loss function is:

wherein, theta _F Parameters of a feature extractor representing a live face detection model, beta and gamma represent hyper-parameters,

representing a parameter theta during training _F Gradient of (a), L _Dtst (θ _F ，θ _D ) An output value, L, representing the estimated loss function of the second depth map _cls-tst An output value, L, representing said second classification loss function _cls (θ _F ，θ _M ) Representing the first classification lossOutput value of function, L _Dtrn (θ _F ，θ _D ) An output value representing the first depth map estimation loss function.

It can be understood that, when the output value of the parameter loss function of the feature extractor tends to a steady state, the parameter loss function of the feature extractor converges to obtain the final parameter θ of the feature extractor of the living human face detection model _F 。

In one embodiment, the depth map estimator parameter loss function is:

wherein, theta _D Parameters representing the depth map estimator, beta and gamma represent hyper-parameters,

representing a parameter theta during training _D Gradient of (a), L _Dtst (θ _F ，θ _D ) An output value, L, representing the estimated loss function of the second depth map _Dtrn (θ _F ，θ _D ) An output value representing the first depth map estimation loss function.

It can be understood that when the output value of the depth map estimator parametric loss function tends to be in a steady state, the convergence of the depth map estimator parametric loss function is shown, and the final parameter θ of the depth map estimator parametric loss function is obtained _D 。

And S108, when the loss function of the living body face detection model is converged, obtaining the trained living body face detection model.

When the living body face detection is carried out through the trained living body face detection model, the face image to be detected is input into the living body face detection model, and whether the face image to be detected is the prediction result of the living body face can be obtained.

At present, because enough real business data are lacked for model training and testing, the distribution of the real business data is almost unpredictable, the domains of the data sets of the public market and the business are often greatly different, and the distribution of the training and the testing can not be aligned by sampling the data sets in a feasible mode. Therefore, the field intersection of the mapped data sets can be maximized, and the face detection model is trained by combining face sample image sets in dozens of fields with meta-learning, so that the trained face detection model can have higher prediction accuracy under the conditions that training sample data and actual images to be detected are human face living bodies or not in different fields.

The training method of the living body face detection model provided by the embodiment can judge and feed back the living body confidence probability and the depth map of a client in an average 50 milliseconds according to the front face of an identity card uploaded by the client, and after the test of millions of data by the post model detection misjudgment of online interception, the algorithm can identify a suspected mobile phone frame, an image with unnatural distortion and other samples which are artificially judged to be an obvious non-living body at a probability of 99.3%, can cover more than 87% of service non-living body characteristic scenes, and can well perform 40% of novel attacks after the test.

The training method of the living body face detection model provided in this embodiment performs two-class training on teacher networks in corresponding fields through face sample image sets in different fields, freezes the teacher network when a loss function of the teacher network converges and a prediction accuracy of the teacher network is within a preset range, inputs face images in face image data sets to each teacher network before training a student network, outputs prediction probabilities that the face images are living body faces through each teacher network, takes an average value of the prediction probabilities output through each teacher network as a target probability value of the student network, trains the student network through the face image data sets, acquires parameters of a feature extractor of the student network when the loss function of the student network converges, then performs meta-learning on the face detection model through the face sample image sets, and performs the living body face detection on the living body face detection model to be trained when the loss function of the living body detection model converges. When the human face living body detection model trained through the scheme is used for detecting the human face living body images, the human face images shot in different fields can be compatible, the robustness is high, and the human face living body detection model trained through the scheme can have high prediction accuracy regardless of the human face images shot through different mobile phones or the human face images shot with different light intensities.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a training device for a living body face detection model is provided, and the training device for the living body face detection model is in one-to-one correspondence with the training method for the living body face detection model in the above embodiment. As shown in fig. 4, the training apparatus 100 for the living body face detection model includes a sample acquisition module 11, a first training module 12, a freezing module 13, a probability calculation module 14, a second training module 15, a parameter acquisition module 16, a meta learning module 17, and a convergence module 18. The functional modules are explained in detail as follows:

the system comprises a sample acquisition module 11, a data processing module and a data processing module, wherein the sample acquisition module 11 is used for acquiring a face sample image set belonging to different fields, and the face sample images in the face sample image set carry living or non-living labels;

the first training module 12 is used for selecting teacher networks corresponding to the fields one by one and performing classification training on the corresponding teacher networks through the face sample image set;

a freezing module 13, configured to freeze the teacher network when the loss function of the teacher network is converged and the prediction accuracy of the teacher network is within a preset range after training;

a probability calculation module 14, configured to input the face image in the face image data set to each teacher network, and output the prediction probability that the face image is a living face through each teacher network;

a second training module 15, configured to train the student networks through the face image data set, by using an average value of the prediction probabilities output by the teacher networks as a target probability value of the student networks;

a parameter obtaining module 16, configured to obtain a parameter of a feature extractor of the student network when a loss function of the student network converges;

a meta-learning module 17, configured to perform meta-learning training on the living body face detection model through the face sample image set, using parameters of the feature extractor of the student network as initial values of parameters of the feature extractor in the living body face detection model;

and the convergence module 18 is configured to obtain the trained living body face detection model when the loss function of the living body face detection model converges.

In one embodiment, the meta learning module 17 includes:

the sampling unit is used for extracting a sampling set comprising positive samples and negative samples from the face sample image set;

the meta-training unit is used for sequentially carrying out meta-training, meta-testing and meta-optimization on the living human face detection model through the sampling set;

and the circulating unit is used for judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, if so, judging that the living body face detection model is trained, otherwise, circulating the step of extracting a sampling set comprising a positive sample and a negative sample from the face sample image set to judge whether the loss function of the living body face detection model in the meta-optimization stage is converged, until the loss function of the living body face detection model in the meta-optimization stage is converged.

Further, the meta-training unit specifically includes:

the classification unit is used for dividing the face sample image set into a meta-training set and a meta-testing set;

the first training unit is used for extracting a plurality of sampling sets from the meta-training set and training the living human face detection model in a meta-training stage through the sampling sets;

the first testing unit is used for testing and training the living body face detection model in the meta-testing stage through the meta-testing set when the loss function of the living body face detection model in the meta-training stage is converged;

the first optimization unit is used for carrying out meta-optimization on the living body face detection model according to the training results of the living body face detection model in the meta-training stage and the meta-testing stage when the loss function of the living body face detection model in the meta-testing stage is converged;

and the judging unit is used for judging the convergence of the loss function of the living body face detection model when the loss function of the living body face detection model in the meta-optimization stage converges.

In one embodiment, the living body face detection model comprises a feature extractor, a meta learning classifier and a depth map estimator, wherein the loss function of the living body face detection model in the meta training stage comprises a first classification loss function, an iterative function of the meta learning classifier and a first depth map estimation loss function;

the first classification loss function is:

L _cls (θ _F ，θ _M )＝∑ylogM(F(x))+(1-y)log(1-M(F(x)))

wherein, theta _F A parameter, θ, of a feature extractor representing the living body face detection model _M Parameters representing the meta-learning classifier, x represents a face sample image in the sample set, y represents a value of a true label of the face sample image in the sample set, F (x) represents a feature extracted from the face sample image by a feature extractor of the face detection model, and L (x) represents a feature extracted from the face sample image by a feature extractor of the face detection model _cls (θ _F ，θ _M ) Representing the first classification loss function;

the iterative function of the meta-learning classifier is:

represents the parameter theta at the i-th training _M Gradient of (a), L _cls (θ _F ，θ _M ) An output result representing the first classification loss function;

the first depth map estimated loss function is:

L _Dtrn (θ _F ，θ _D )＝∑||D(F(X))-I|| ² wherein, theta _F A parameter, θ, of a feature extractor representing the living body face detection model _D Parameters representing the depth map estimator, F (X) representing a first feature extracted by a feature extractor of the face detection model on a face sample image in the meta training set, D (F (X)) representing an output result of the depth map estimator on the first feature extracted, and I representing a true depth of the first feature extracted.

Further, the loss function of the living body face detection model in the meta-test stage comprises a second classification loss function and a second depth map estimation loss function;

the second classification loss function is:

wherein N denotes a total domain number of the face sample image set, i denotes a domain number to which a face sample image currently used for meta-testing belongs, F (x) denotes a feature extracted from the face sample image by a feature extractor of the face detection model, M' _i (F (x)) represents the output of the meta-learning classifier on the extracted features, L _cls-tst Representing the second classification loss function;

the second depth map estimation loss function is:

L _Dtst (θ _F ，θ _D )＝∑||D(F(X))-I|| ²

Further, the loss function of the living body face detection model in the meta-optimization stage comprises a meta-learning classifier parameter loss function, a feature extractor parameter loss function and a depth map estimator parameter loss function;

the meta-learning classifier parameter loss function is:

representing a parameter theta during training _M A gradient of (a);

the feature extractor parameter loss function is:

wherein, theta _F Parameters of a feature extractor representing a live face detection model,

representing a parameter theta during training _F Gradient of (a), L _Dtst (θ _F ，θ _D ) An output value, L, representing an estimated loss function of said second depth map _cls-tst An output value, L, representing said second classification loss function _cls (θ _F ，θ _M ) An output value, L, representing said first classification loss function _Dtrn (θ _F ，θ _D ) An output value representing the first depth map estimation loss function;

the depth map estimator parameter loss function is:

When the face living body detection model obtained by the training device of the living body face detection model provided by the embodiment is used for detecting the face living body image, the face living body detection model can be compatible with the face images shot in different fields, has stronger robustness, and can have higher prediction accuracy regardless of the face images shot by different mobile phones or the face images shot by different light intensities.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

For specific limitations of the training apparatus for the living body face detection model, reference may be made to the above limitations of the training method for the living body face detection model, and details are not repeated here. All or part of each module in the training device of the living body face detection model can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the computer device is used for communicating with an external server through a network connection. The computer program is executed by a processor to implement a method of training a human face detection model.

In one embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the training method for a living human face detection model in the above-mentioned embodiments, such as the steps 101 to 108 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the training apparatus for a living body face detection model in the above-described embodiments, such as the functions of the modules 11 to 18 shown in fig. 4. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer apparatus by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the training method of the living body face detection model in the above-described embodiments, such as the steps 101 to 108 shown in fig. 2 and extensions of other extensions and related steps of the method. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the training apparatus for a living body face detection model in the above-described embodiments, such as the functions of the modules 11 to 18 shown in fig. 4. To avoid repetition, further description is omitted here.

In the training method, the training device, the training equipment and the storage medium of the living body face detection model provided by the embodiment, two-class training is performed on teacher networks in corresponding fields through face sample image sets in different fields, when the loss function of the teacher networks is converged and the prediction accuracy of the teacher networks is within a preset range, the teacher networks are frozen, before training of student networks, the face images in face image data sets are input into the teacher networks, the prediction probabilities that the face images are living body faces are output through the teacher networks, the average value of the prediction probabilities output through the teacher networks is used as the target probability value of the student networks, the student networks are trained through the face image data sets, when the loss function of the student networks is converged, the parameters of feature extractors of the student networks are obtained, then the parameters of the feature extractors of the student networks are used as the initial values of the parameters of the feature extractors in the face detection model, the living body detection model is subjected to meta-learning through the face sample image sets, and when the loss function of the living body detection model is converged, the living body face detection model can be subjected to face detection by using the face sample image sets for face detection. When the human face living body detection model trained through the scheme is used for detecting the human face living body images, the human face images shot in different fields can be compatible, the robustness is high, and the human face living body detection model trained through the scheme can have high prediction accuracy regardless of the human face images shot through different mobile phones or the human face images shot with different light intensities.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for training a human face detection model, the method comprising:

acquiring a face sample image set belonging to different fields, wherein the face sample image in the face sample image set carries living or non-living marks;

inputting the face images in the face image data set into each teacher network, and outputting the prediction probability that the face images are living faces through each teacher network;

taking the average value of the prediction probabilities output by the teacher networks as a target probability value of a student network, and training the student network through the face image data set;

taking parameters of a feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model, and performing meta-learning training on the living body face detection model through the face sample image set;

when the loss function of the living body face detection model is converged, obtaining a trained living body face detection model;

wherein the step of performing meta-learning training on the living body face detection model through the face sample image set comprises:

extracting a sampling set comprising positive samples and negative samples from the face sample image set;

sequentially performing element training, element testing and element optimization on the living human face detection model through the sampling set;

judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, if so, judging that the living body face detection model is trained, otherwise, circularly extracting a sampling set comprising a positive sample and a negative sample from the face sample image set to the step of judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, until the loss function of the living body face detection model in the meta-optimization stage is converged;

wherein, the steps of performing meta-training, meta-testing and meta-optimization on the living human face detection model in sequence through the sampling set comprise:

when the loss function of the living body face detection model in the meta-training stage is converged, performing test training on the living body face detection model in the meta-testing stage through the meta-testing set;

when the loss function of the living body face detection model in the meta-optimization stage is converged, judging the convergence of the loss function of the living body face detection model;

the living body face detection model comprises a feature extractor, a meta-learning classifier and a depth map estimator, wherein the loss function of the living body face detection model in the meta-training stage comprises a first classification loss function, an iterative function of the meta-learning classifier and a first depth map estimation loss function;

the first classification loss function is:

L _cls (θ _F ，θ _M )＝∑y logM(F(x))+(1-y)log(1-M(F(x)))

wherein, theta _F A parameter, θ, of a feature extractor representing the living body face detection model _M Parameters representing the meta-learning classifier, x represents a face sample image in the sampling set, y represents a value of a true label of the face sample image in the sampling set, F (x) represents a feature extracted from the face sample image by a feature extractor of the face detection model, M (F (x)) represents an output result of the meta-learning classifier on the extracted F (x), and L (F (x)) represents an output result of the meta-learning classifier on the extracted F (x) _cls (θ _F ，θ _M ) Representing the first classification loss function;

the iterative function of the meta-learning classifier is:

the first depth map estimated loss function is:

L _Dtrn (θ _F ，θ _D )＝∑||D(F(X))-I|| ²

wherein, theta _F A parameter, θ, of a feature extractor representing the live face detection model _D Parameters representing the depth map estimator, F (X) representing a first feature extracted by a feature extractor of the face detection model on a face sample image in the meta training set, D (F (X)) representing an output result of the depth map estimator on the first feature extracted, and I representing a true depth of the first feature extracted;

wherein, the loss function of the living human face detection model in the meta-test stage comprises a second classification loss function and a second depth map estimation loss function;

the second classification loss function is:

wherein N represents the total domain number of the face sample image set, i represents the domain number of the face sample image currently used for meta-test, and F (x) represents the feature extracted from the face sample image by the feature extractor of the face detection model，M′ _i (F (x)) represents the output of the meta-learning classifier on the extracted features, L _cls-tst Representing the second classification loss function;

the second depth map estimation loss function is:

L _Dtst (θ _F ，θ _D )＝∑||D(F(X))-I|| ²

wherein F (X) represents a second feature extracted by a feature extractor of the face detection model for a face sample image in the meta-test set, D (F (X)) represents an output result of the depth map estimator for the second feature extracted, and I represents a true depth of the second feature extracted;

the loss function of the living body face detection model in the meta-optimization stage comprises a meta-learning classifier parameter loss function, a feature extractor parameter loss function and a depth map estimator parameter loss function;

the meta-learning classifier parameter loss function is:

wherein beta and gamma represent hyper-parameters, L _cls (θ _F ，θ _M ) An output value, L, representing said first classification loss function _cls-tst Represents an output result of the second classification loss function,

representing a parameter theta during training _M A gradient of (a);

the feature extractor parameter loss function is:

representing a parameter theta during training _F Gradient of (a), L _Dtst (θ _F ，θ _D ) An output value, L, representing the estimated loss function of the second depth map _cls-tst An output value, L, representing said second classification loss function _cls (θ _F ，θ _M ) An output value, L, representing said first classification loss function _Dtrn (θ _F ，θ _D ) An output value representing the first depth map estimation loss function;

the depth map estimator parameter loss function is:

2. A training apparatus for a living body face detection model, the apparatus being constructed according to the training method for a living body face detection model of claim 1, the apparatus comprising:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a face sample image set belonging to different fields, and the face sample images in the face sample image set carry living or non-living labels;

the first training module is used for selecting teacher networks which are in one-to-one correspondence with the fields and performing two-class training on the corresponding teacher networks through the face sample image set;

the probability calculation module is used for inputting the face images in the face image data set into each teacher network and outputting the prediction probability that the face images are living faces through each teacher network;

the second training module is used for training the student networks through the face image data sets by taking the average value of the prediction probabilities output by the teacher networks as the target probability value of the student networks;

a parameter obtaining module, configured to obtain a parameter of a feature extractor of the student network when a loss function of the student network converges;

the meta-learning module is used for performing meta-learning training on the living body face detection model through the face sample image set by taking parameters of the feature extractor of the student network as initial values of the parameters of the feature extractor in the living body face detection model;

the convergence module is used for obtaining a trained living body face detection model when the loss function of the living body face detection model converges;

wherein the meta learning module comprises:

the element training unit is used for carrying out element training, element testing and element optimization on the living body face detection model in sequence through the sampling set;

and the circulating unit is used for judging whether the loss function of the living body face detection model in the meta-optimization stage is converged, if so, judging that the living body face detection model is trained, otherwise, circulating the step of extracting a sampling set comprising a positive sample and a negative sample from the face sample image set to judge whether the loss function of the living body face detection model in the meta-optimization stage is converged, and till the loss function of the living body face detection model in the meta-optimization stage is converged.

3. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the training method of the living body face detection model as claimed in claim 1 when executing the computer program.

4. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for training a living body face detection model according to claim 1.