CN116246314A

CN116246314A - Living body detection model training and living body detection method and device

Info

Publication number: CN116246314A
Application number: CN202211610853.8A
Authority: CN
Inventors: 杨博文; 李建树; 刘健
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-06-09

Abstract

The invention discloses a method for training a living body detection model and detecting living body, which comprises the following steps: acquiring a face auxiliary view corresponding to each frame in an input frame sample set and a living body true value corresponding to the input frame sample set according to the input frame sample set; extracting a sample feature sequence of the input frame training set through the feature extraction backbone network; inputting the sample feature sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the auxiliary face view; inputting the sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value; and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target. Accordingly, the invention discloses a device for in vivo detection model training and in vivo detection.

Description

Living body detection model training and living body detection method and device

Technical Field

The invention relates to a living body detection technology, in particular to a method and a device for training a living body detection model and detecting living bodies.

Background

The face recognition technology is widely applied to various scenes in life such as payment, entrance guard and the like, and along with the development of 3D head models and facial mask manufacturing processes, the appearance is more and more vivid, and the possibility of living body detection in the existing face recognition system is more and more overcome.

There are some problems with existing living body algorithms, in which: the living body detection algorithm taking a single frame as input can only be used for defending low-quality 2D attack, and accurate judgment can not be made on high-quality 3D attack; the living body detection algorithm taking dense frames as input needs to calculate each frame, and the calculation cost is so high that the production floor of a low-calculation-force platform (such as a mobile phone) is difficult; the interactive living body detection requires a user to make a specified action, and the verification process is complicated and can not defend a mechanically driven movable head die or prerecorded high-definition video.

Therefore, a living body detection scheme with special precision, high quality 3D attack and reasonable cost is needed for solving the above problems.

Disclosure of Invention

The embodiment of the invention provides a method and a device for training a living body detection model and detecting living bodies, which are used for effectively defending high-quality 3D attacks and reducing calculation expenditure at the same time, so as to provide a user with experience of non-sense body inspection.

The embodiment of the invention adopts the following technical scheme:

the invention provides a living body detection model training method, wherein the living body detection model at least comprises a feature extraction main network, an auxiliary supervision network and a time domain detection network, and the living body detection model training method comprises the following steps:

acquiring a face auxiliary view corresponding to each frame in an input frame sample set and a living body true value corresponding to the input frame sample set according to the input frame sample set;

extracting a sample feature sequence of the input frame training set through the feature extraction backbone network;

inputting the sample feature sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the auxiliary face view;

inputting the sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;

and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.

According to the invention, the comprehensive loss is taken as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation cost is reduced, and the user is better experienced with noninductive body inspection.

Further, the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, the face auxiliary view is a face three-dimensional reconstruction projection view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method comprises the steps of:

encoding the sample feature sequence, inputting each frame of the encoded sample feature sequence into the three-dimensional reconstruction projection estimation network, and calculating three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation graph and the three-dimensional reconstruction projection view;

inputting the coded sample characteristic sequence into the time domain detection network, and calculating living body prediction loss according to the output living body prediction value and living body true value;

and determining comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.

Further, the obtaining the three-dimensional reconstruction projection view of the face corresponding to each frame in the sample set includes:

for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;

and calculating the orthogonal projection view of the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view.

Further, calculating a three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation diagram and the three-dimensional reconstruction projection view, including:

and calculating the mean square error loss of the three-dimensional reconstruction projection estimated graph and the three-dimensional reconstruction projection view as the three-dimensional reconstruction projection loss.

Further, the auxiliary supervision network is an optical flow estimation network, the face auxiliary view is a face optical flow view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method includes:

inputting adjacent frames of the sample feature sequence into the optical flow estimation network, and calculating optical flow loss according to the output optical flow estimation graph and the optical flow view;

encoding the sample characteristic sequence, inputting the encoded sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;

and determining comprehensive loss according to the optical flow loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.

Further, the obtaining the face optical flow view in the sample set includes:

Aiming at the adjacent frames in the sample set, obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm;

and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.

Further, inputting adjacent frames in the sample feature sequence into an optical flow estimation graph output by the optical flow estimation network, including:

inputting adjacent frames in the sample feature sequence into the optical flow estimation network to respectively obtain optical flow features;

and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.

Further, the calculating optical flow loss from the optical flow estimation map and the optical flow view includes:

acquiring the relative position change of the face between adjacent frames;

and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.

Further, calculating a living body prediction loss from the output living body prediction value and the living body true value includes:

And calculating the cross entropy loss of the living body predicted value and the living body true value as the living body predicted loss.

The invention provides a living body detection method, which comprises the following steps:

collecting a detection image, and acquiring an input frame sequence from the detection image;

inputting the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtaining a feature sequence output by the feature extraction backbone network;

and encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in the living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is obtained by training by adopting the living body detection model training method.

Further, obtaining an input frame sequence from the detected image includes:

and performing sparse sampling on the detection image to obtain a preset number of sampling frames serving as the input frame sequence.

The invention provides a living body detection training device which comprises an input module, a space domain module, a time domain module and a training module, wherein the input module is used for receiving a sample of a living body, and the space domain module is used for receiving the sample of the living body detection training device:

the input module obtains a face auxiliary view corresponding to each frame in the sample set and a living true value corresponding to the input frame sample set according to an input frame sample set;

The airspace module extracts a sample feature sequence of the input frame training set through the feature extraction backbone network, inputs the sample feature sequence into an auxiliary supervision network, and calculates auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view;

the time domain module inputs the coded sample feature sequence into a time domain detection network, and calculates living body prediction loss according to the output living body prediction value and living body true value;

and the training module determines comprehensive loss according to the auxiliary loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.

Further, the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, and the face auxiliary view is a face three-dimensional reconstruction projection view:

the airspace module encodes the sample feature sequence, each frame of the encoded sample feature sequence is input into the three-dimensional reconstruction projection estimation network, and three-dimensional reconstruction projection loss is calculated according to the output three-dimensional reconstruction projection estimation diagram and the three-dimensional reconstruction projection view;

and the training module determines comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.

Further, the acquisition module obtains three-dimensional reconstruction information of a face area through a three-dimensional reconstruction network for each frame in the sample set, and calculates an orthogonal projection view from the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view of the face.

Further, the auxiliary supervision network is an optical flow estimation network, and the face auxiliary view is a face optical flow view:

the airspace module inputs adjacent frames of the sample feature sequence into the optical flow estimation network, and calculates optical flow loss according to the output optical flow estimation diagram and the optical flow view;

and the training module determines comprehensive loss according to the optical flow loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.

Further, the acquisition module obtains optical flow estimation information of the whole scene through an optical flow estimation algorithm according to adjacent frames in the sample set;

for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network; and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.

Further, the airspace module is further configured to input adjacent frames in the sample feature sequence into the optical flow estimation network to obtain optical flow features respectively; and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.

Further, the airspace module is further used for acquiring the relative position change of the face between adjacent frames; and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.

The present invention provides a living body detection apparatus, which includes:

the acquisition module is used for acquiring detection images and acquiring an input frame sequence from the detection images;

the feature module inputs the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtains a feature sequence output by the feature extraction backbone network;

the detection module encodes the characteristic sequence, and inputs the encoded characteristic sequence into a time domain detection network in the living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.

Further, the acquisition module performs sparse sampling on the acquired living body detection target image to obtain a preset number of sampling frames, and the preset number of sampling frames are used as the input frame sequence.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the steps in the model training and in vivo detection methods described above.

The present invention also provides a computing device comprising a memory and a processor, the memory having executable code stored therein, which when executed by the processor performs the steps in the model training and in vivo detection methods described above.

The at least one technical scheme adopted by the embodiment of the invention has the following beneficial effects:

according to the embodiment of the invention, the comprehensive loss minimization is used as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation cost is reduced, and a better experience of noninductive body inspection is provided for a user.

Drawings

FIG. 1 schematically illustrates a flow chart of a living body detection model training method according to the present invention in one embodiment.

Fig. 2 schematically shows a flow chart of a living body detection method according to the present invention in one embodiment.

Fig. 3 schematically shows a structure of the living body detection training device according to the present invention in one embodiment.

Fig. 4 schematically shows a structure of the living body detecting apparatus according to the present invention in one embodiment.

Fig. 5 schematically shows a network structure of the in-vivo detection model training method according to the present invention in one embodiment.

FIG. 6 schematically shows a flow chart of a data preparation stage of the in-vivo detection model training method according to the present invention in one embodiment.

FIG. 7 schematically illustrates a method of calculating the optical flow translational invariant loss according to the present invention, under an embodiment.

Detailed Description

The following will describe in further detail the technical solution of the in vivo detection model training and in vivo detection according to the present invention with reference to specific examples and corresponding drawings, but the detailed description does not limit the present invention.

In an embodiment of the present invention, a method for training a living body detection model is provided, where the living body detection model at least includes a feature extraction backbone network, an auxiliary supervision network, and a time domain detection network, and fig. 1 schematically shows a flow chart of the living body detection model training method according to the present invention under an implementation mode, where the method includes:

100: and acquiring the face auxiliary view corresponding to each frame in the sample set and the living body true value corresponding to the input frame sample set according to the input frame sample set.

Illustratively, the input frame sample set is derived from continuously captured dynamic user facial information, mainly including inherent attributes such as facial features and dynamic attributes such as expressions and actions, and may be a video segment, or may be continuously captured image frames, including but not limited to two-dimensional images, depth images, and the like, which are transmitted as an input sequence into a living body detection model for living body detection model training.

The face auxiliary view is a view obtained by analyzing the inherent attribute of the face image through the existing face analysis technology, a specific attribute of the face image can be extracted and displayed in the view, the distribution condition and the change condition of the feature are directly observed, and related calculation and other operations such as edge information, illumination change, texture information, three-dimensional information, depth information and the like are performed in a targeted manner.

In some embodiments, a three-dimensional reconstructed projection view or an optical flow view is selected as the face auxiliary view. The three-dimensional reconstruction projection view reflects three-dimensional structure information of the face image and represents depth information by colors; the optical flow view displays the motion condition of the pixel points in the image under the time sequence change, so that single-frame information mining and multi-frame dynamic change can be simultaneously utilized as living clues, and facial region linkage information and micro expression change information under the time sequence can be captured in the feature extraction process.

Furthermore, the obtained sample set and the face auxiliary view can be subjected to data enhancement, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.

It should be noted that the living body detection model belongs to supervised learning in machine learning, which means that an input sample set and a sample label are known, the sample set is input into the model to obtain a corresponding output result, model parameters are adjusted according to a comparison result of the result and the corresponding sample label, and a training effect is optimized; and after the optimal model is obtained through known data training for a plurality of times, the model is applied to new data, so that an output result is obtained. In this example, the live truth value is the sample label, which is known prior to training, and the value is expressed as "live" or "not live".

110: and extracting sample feature sequences of the input frame training set through a feature extraction backbone network.

In some embodiments, a convolutional feature extraction backbone network is employed to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.

120: and inputting the sample characteristic sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view.

In machine learning, in order to obtain a reference index capable of reflecting the model performance in multiple aspects, an auxiliary supervision network is introduced, the network obtains a corresponding loss value after processing certain auxiliary features of an input sample set, and all obtained loss is comprehensively evaluated on the model, so that the model performance can be more comprehensively known, and the model training efficiency is improved. In some embodiments, only one type of secondary supervisory network may be employed; in some other embodiments, multiple auxiliary supervisory networks may be used together as the case may be.

In some embodiments, the face auxiliary view comprises a three-dimensional reconstructed projection view or an optical flow view, and the auxiliary supervisory network also comprises a corresponding three-dimensional reconstructed projection estimation network or optical flow estimation network. And outputting an auxiliary estimation graph corresponding to the auxiliary face view through an auxiliary supervision network, and further calculating respective auxiliary losses. The Loss (Loss) is an index for judging the performance of the model in the machine learning training process, different Loss functions are adopted to calculate, the difference between the predicted value and the true value is represented by performing function operation on the predicted value and the true value, the smaller the Loss is, the better the Loss is, and common Loss functions comprise a classification error rate (CE), a Mean Square Error (MSE), cross Entropy (Cross Entropy) and the like.

130: and inputting the sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and the living body true value.

In the airspace detection network, spatial characteristic information of the face image, such as depth information, three-dimensional information, texture information and the like, can be extracted; in the time domain detection network, the motion characteristics of the face area in the multi-frame image can be captured. By combining the depth in the spatial domain and the motion in the temporal domain into the temporal domain depth, the unique characteristics of each frame of image can be embodied, thereby better distinguishing living bodies from non-living bodies.

Specifically, a cross entropy loss of the living body predicted value and the living body true value is calculated as the living body predicted loss.

140: and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.

In some embodiments, the auxiliary loss and the living body predicted loss are weighted according to a certain proportion to obtain a comprehensive loss, and the network parameter theta is updated by taking the minimum comprehensive loss as a training target.

In some more specific embodiments, the network parameter θ is updated by a gradient backhaul (also known as a BP algorithm) method. The gradient feedback is a learning algorithm commonly used for a neural network, is realized based on a gradient descent method (Stochastic Gradient Descent, SGD) and a chain rule, mainly comprises two links of excitation propagation and weight updating, and trains network parameters conforming to a target through repeated circulation and iteration of the two links. The excitation propagation link is divided into a forward propagation stage and a reverse propagation stage, the training set is input into the network in the forward propagation stage to obtain an excitation response, and the obtained excitation response is differenced with a target output corresponding to the training set in the reverse propagation stage, namely a response error. In the weight updating link, for each weight on each layer of network, multiplying the excitation response obtained by input by the response error to obtain the gradient of the weight, multiplying the gradient by a proportion, inverting and adding the proportion to the current weight to obtain the updated weight. In the gradient return process, the direction of the gradient indicates the direction of error expansion, so the weight is updated according to the direction of the fastest gradient descent, and the error is reversely propagated to the previous layer after updating until all the adjustment is carried out, and the iteration is continuously carried out until the model converges.

In some embodiments, the face auxiliary view is a face three-dimensional reconstruction projection view, and the face three-dimensional reconstruction projection view corresponding to each frame in the sample set is obtained, including the steps of:

and calculating the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection view.

The three-dimensional reconstruction technology refers to an image-based modeling technology in the field of face modeling, namely, the three-dimensional structure of a plurality of two-dimensional images is restored by extracting characteristic information of the images. In this embodiment, a 3DDFA network is used to obtain a three-dimensional reconstructed projection view of the face, which image is also referred to as PNCC map, which represents depth information with color changes. In some other embodiments, three-dimensional reconstruction networks such as FML, PRNet, etc. may also be employed to obtain three-dimensional reconstructed projection views of the face.

In some more specific embodiments, the auxiliary supervisory network is a three-dimensional reconstruction projection estimation network, and after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network, the method comprises:

encoding the sample feature sequence by adopting a airspace transducer, inputting each frame of the encoded sample feature sequence into a three-dimensional reconstruction projection estimation network, and calculating three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation graph and the three-dimensional reconstruction projection view;

Inputting the coded sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and the living body true value;

In some embodiments, the three-dimensional reconstructed projection loss is obtained by calculating a mean square error loss of the three-dimensional reconstructed projection estimate map and the three-dimensional reconstructed projection view. The mean square error is an expected value of the square of the difference between the estimated value and the true value, and is denoted as MSE, and the degree of change of the data can be evaluated. The smaller the value of the mean square error, the better the accuracy of the model description experimental data.

The comprehensive loss is obtained by weighting three-dimensional reconstruction loss and living body prediction loss according to a certain proportion, the minimum comprehensive loss is used as a training target, the network parameter theta is updated by adopting a gradient return method, and iteration is continued until the model converges.

In other embodiments, the face auxiliary view is a face optical flow view, and the obtaining a face optical flow view in the sample set includes the steps of:

aiming at adjacent frames in the sample set, obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm;

and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain a face optical flow view.

Optical flow (optical flow) refers to a mode motion speed in an image under time sequence change, that is, a motion amount representing the same object pixel point in a video image from one frame to the next frame, and is represented by a two-dimensional vector. In this embodiment, the adopted optical flow estimation algorithm is rive, and the input is two adjacent frames of camera views in the sample set, and by this method, optical flows are generated and aligned for the two images, and the optical flow of each pixel point in the corresponding image is output. And then, filtering to obtain an optical flow view by taking the obtained three-dimensional reconstruction projection as a mask. In some other embodiments, optical flow estimation algorithms such as FlowNet, flowNet2 may also be employed.

In some more specific embodiments, the auxiliary supervisory network is an optical flow estimation network, and after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network, the method comprises:

inputting adjacent frames of the sample feature sequence into an optical flow estimation network, and calculating optical flow loss according to the output optical flow estimation graph and the optical flow view;

Encoding the sample characteristic sequence, inputting the encoded sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;

and determining comprehensive loss according to the optical flow loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.

In some embodiments, outputting an optical flow estimation map from an optical flow estimation network for adjacent frames in a sequence of sample features includes the steps of:

inputting adjacent frames in the sample feature sequence into an optical flow estimation network to respectively obtain optical flow features;

and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain an optical flow estimation graph.

In some embodiments, the present invention uses a common loss function and simultaneously proposes an optical flow loss, namely, by calculating an optical flow translational invariant loss as the optical flow loss. In some embodiments, first, the relative position change of the face between adjacent frames is obtained by introducing an additional convolution kernel; and then, after translational alignment is carried out according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view, and taking the mean square error as the optical flow loss. The optical flow loss calculation method can be used for reducing negative influence of head overall motion on noise introduced in optical flow estimation, can effectively reflect micro expression change information under time sequence, and improves evaluation capability of model effect.

The comprehensive loss is obtained by weighting optical flow loss and living body prediction loss according to a certain proportion, the minimum comprehensive loss is used as a training target, the network parameter theta is updated by adopting a gradient return method, and iteration is continued until the model converges.

In some embodiments, a living body detection method is provided, and fig. 2 schematically shows a flow chart of the living body detection method according to the present invention in an embodiment, including:

200: collecting a detection image, and acquiring an input frame sequence from the detection image;

illustratively, using a camera as an image acquisition device, an input frame sample set is derived from continuously captured dynamic user face information, and mainly includes inherent attributes such as facial features and dynamic attributes such as expressions and actions, and may be a video segment, or may be continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, which are transmitted as an input sequence into a living body detection model for living body detection model training.

In some embodiments, the video input sequence is sparsely sampled, i.e. a predetermined number of sampling frames, e.g. 2-10 frames, are obtained as an input frame sequence, i.e. the input frame sequence of the present embodiment. In the living body detection model of the embodiment, the input sequence can be in the form of a frame sequence, so that the input method which is a compromise between single-frame images and dense-frame videos can be adopted, multi-frame dynamic information can be captured, and the limited calculation force requirement can be met. With the development of 3D head models and facial mask manufacturing processes, the appearance of the 3D head models and facial mask manufacturing processes is more and more realistic, and the possibility of breaking living body detection in the existing face recognition system is higher and higher. Compared with the existing living body detection algorithm, the introduction of sparse sampling can effectively defend high-quality 3D head model surface tool attacks, is an extension of a single-frame silence living body detection method which only inputs one picture and can defend low-quality 2D attacks, and can be used for making more reliable judgment by combining multi-frame information. Meanwhile, compared with the dense frame video living body detection which needs to be input into each frame in a video sequence and estimates the subtle color change of the facial blood vessel caused by heartbeat through remote photoplethysmography, the dense frame video living body detection greatly reduces the calculation force requirement and relieves the calculation pressure, so that the product can be landed in more low calculation force field scenes. In addition, the sparse sampling method can omit the step of giving instructions to wait for user feedback in the interactive living body detection, realize noninductive silence inspection, relatively shorten the verification time, and can make up the defect that the interactive living body detection cannot defend a mechanically driven movable head die or a prerecorded high-definition video.

210: inputting the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtaining a feature sequence output by the feature extraction backbone network;

specifically, a convolution feature extraction backbone network is employed to extract a sample feature sequence of an input frame sequence. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.

220: and encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in a living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.

Optionally, after extracting the sample feature sequence of the input frame sequence through the feature extraction backbone network in step 210, the sample feature sequence is input into a space domain transducer encoder in a pre-trained living body detection model to be encoded, so as to obtain the encoded sample feature sequence. And then the sample characteristic sequence is directly input into a time domain detection network in the model to obtain a living body detection result.

The living body detection model is trained by combining the auxiliary loss and the living body prediction loss under the action of the auxiliary monitoring network as described in step 120, and further, the training of the model is affected by three-dimensional reconstruction projection loss or optical flow loss and the like. Although the function of an auxiliary supervision network is not needed in the living body detection process, the living body detection effect by using the model is also constrained by factors such as three-dimensional reconstruction projection information or optical flow information, so that the detection result is more accurate and reliable.

Fig. 3 schematically shows a structural diagram of a living body detection training device according to an embodiment of the present invention, including an input module, a spatial module, a temporal module, and a training module:

300: and the input module acquires the face auxiliary view corresponding to each frame in the sample set and the living true value corresponding to the input frame sample set according to the input frame sample set.

Illustratively, the input frame sample set obtained by the input module is from continuously captured dynamic user face information, mainly including inherent attributes such as facial features and dynamic attributes such as expressions and actions, and can be a video or continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, and is transmitted into a living body detection model as an input sequence for living body detection model training.

In some embodiments, the three-dimensional reconstructed projection view or the optical flow view is selected as the face auxiliary view in the input module. The three-dimensional reconstruction projection view reflects three-dimensional structure information of the face image and represents depth information by colors; the optical flow view displays the motion condition of the pixel points in the image under the time sequence change, so that single-frame information mining and multi-frame dynamic change can be simultaneously utilized as living clues, and facial region linkage information and micro expression change information under the time sequence can be captured in the feature extraction process.

Furthermore, the obtained sample set and the auxiliary human face view can be subjected to data enhancement in the input module, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.

It should be noted that the living body detection model belongs to supervised learning in machine learning, which means that a sample set and a sample label entering an input module are known, the sample set is input into the model to obtain a corresponding output result, model parameters are adjusted according to a comparison result of the result and the corresponding sample label, and a training effect is optimized; and after the optimal model is obtained through known data training for a plurality of times, the model is applied to new data, so that an output result is obtained. In this example, the live truth value is the sample label, which is known prior to training, and the value is expressed as "live" or "not live".

310: the airspace module extracts a sample feature sequence of an input frame training set through a feature extraction backbone network, inputs the sample feature sequence into an auxiliary supervision network, and calculates auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view.

In some embodiments, the spatial module employs a convolutional feature extraction backbone network to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.

In machine learning, in order to obtain a reference index capable of reflecting the model performance in multiple aspects, an auxiliary supervision network is introduced to enable the model performance to be more comprehensively known, and the model training efficiency is improved. In some embodiments, only one type of auxiliary supervisory network may be employed in the airspace module; in some other embodiments, multiple auxiliary supervisory networks may be used together as the case may be.

In some embodiments, the face auxiliary view comprises a three-dimensional reconstructed projection view or an optical flow view, and the auxiliary supervisory network also comprises a corresponding three-dimensional reconstructed projection estimation network or optical flow estimation network. The airspace module can output an auxiliary estimation graph corresponding to the auxiliary face view through an auxiliary supervision network, and further calculate respective auxiliary losses. The Loss (Loss) is an index for judging the performance of the model in the machine learning training process, different Loss functions are adopted to calculate, the difference between the predicted value and the true value is represented by performing function operation on the predicted value and the true value, the smaller the Loss is, the better the Loss is, and common Loss functions comprise a classification error rate (CE), a Mean Square Error (MSE), cross Entropy (Cross Entropy) and the like.

320: the time domain module inputs the coded sample characteristic sequence into a time domain detection network, and calculates the living body prediction loss according to the output living body prediction value and the living body true value.

In the airspace module, spatial characteristic information of the face image, such as depth information, three-dimensional information, texture information and the like, can be extracted; in the time domain module, the motion characteristics of the face area in the multi-frame image can be captured. By combining the depth in the spatial domain and the motion in the temporal domain into the temporal domain depth, the unique characteristics of each frame of image can be embodied, thereby better distinguishing living bodies from non-living bodies.

Specifically, the time domain module calculates a cross entropy loss of the living body predicted value and the living body true value as the living body predicted loss.

330: the training module determines the comprehensive loss according to the auxiliary loss and the living body prediction loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.

In some embodiments, the training module weights the auxiliary loss and the living body predicted loss in a proportion to obtain a comprehensive loss, and updates the network parameter θ with the minimum comprehensive loss as the training target.

Fig. 4 schematically shows a structure of a living body detecting device according to the present invention in one embodiment, the device including:

400: and the acquisition module is used for acquiring the detection image and acquiring an input frame sequence from the detection image.

The acquisition module uses a camera as an image acquisition device, and inputs dynamic user face information from a continuous capture frame sample set, wherein the dynamic user face information mainly comprises inherent attributes such as facial features and dynamic attributes such as expressions and actions, and the dynamic attributes can be a video or continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, and the dynamic user face information is transmitted into a living body detection model as an input sequence for living body detection model training.

In some embodiments, the acquisition module performs sparse sampling on the video input sequence, that is, obtains a preset number of sampling frames, for example, 2-10 frames, as an input frame sequence, that is, an input frame sequence in this embodiment. In the living body detection model of the embodiment, the input sequence can be in the form of a frame sequence, so that the input method which is a compromise between single-frame images and dense-frame videos can be adopted, multi-frame dynamic information can be captured, and the limited calculation force requirement can be met. With the development of 3D head models and facial mask manufacturing processes, the appearance of the 3D head models and facial mask manufacturing processes is more and more realistic, and the possibility of breaking living body detection in the existing face recognition system is higher and higher. Compared with the existing living body detection algorithm, the introduction of sparse sampling can effectively defend high-quality 3D head model surface tool attacks, is an extension of a single-frame silence living body detection method which only inputs one picture and can defend low-quality 2D attacks, and can be used for making more reliable judgment by combining multi-frame information. Meanwhile, compared with the dense frame video living body detection which needs to be input into each frame in a video sequence and estimates the subtle color change of the facial blood vessel caused by heartbeat through remote photoplethysmography, the dense frame video living body detection greatly reduces the calculation force requirement and relieves the calculation pressure, so that the product can be landed in more low calculation force field scenes. In addition, the sparse sampling method can omit the step of giving instructions to wait for user feedback in the interactive living body detection, realize noninductive silence inspection, relatively shorten the verification time, and can make up the defect that the interactive living body detection cannot defend a mechanically driven movable head die or a prerecorded high-definition video.

410: and the feature module inputs the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtains a feature sequence output by the feature extraction backbone network.

In some embodiments, the feature module employs a convolutional feature extraction backbone network to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.

420: the detection module encodes the feature sequence, and inputs the encoded feature sequence into a time domain detection network in a living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.

Optionally, after extracting a sample feature sequence of the input frame sequence from the feature module, the detection module inputs the sample feature sequence into a airspace transducer encoder in a pre-trained living body detection model to encode, so as to obtain an encoded sample feature sequence. And then the sample characteristic sequence is directly input into a time domain detection network in the model to obtain a living body detection result.

The living body detection model in the detection module is trained by combining the auxiliary loss and the living body prediction loss under the action of the auxiliary monitoring network as described in step 120, and further, the training of the model is affected by the three-dimensional reconstruction projection loss or the optical flow loss. Although the function of an auxiliary supervision network is not needed in the living body detection process, the living body detection effect by using the model is also constrained by factors such as three-dimensional reconstruction projection information or optical flow information, so that the detection result is more accurate and reliable.

For further explanation of the method and apparatus for training a living body model according to the present invention, a preferred embodiment is introduced, and fig. 5 schematically shows a network structure diagram of the method for training a living body model according to an embodiment of the present invention.

Before that, it is necessary to obtain, from the input frame sample set, a face auxiliary view corresponding to each frame in the sample set and a living true value corresponding to the input frame sample set.

Optionally, fig. 6 schematically shows a flow chart of a data preparation stage of the living body detection model training method according to the present invention in an embodiment, that is, a stage of generating an auxiliary view, in this embodiment, a three-dimensional reconstructed projection view and an optical flow view are simultaneously used as the auxiliary view.

As shown in fig. 6, first, for a frame of camera view in the input sample set

Using a three-dimensional reconstruction network to obtain three-dimensional reconstruction information of a corresponding face area, and obtaining a three-dimensional reconstruction projection view after calculation>

Wherein in this embodiment a 3DDFA network is used to obtain a three-dimensional reconstructed projection view of the face, which image is also called PNCC map, the depth information being represented by color variations. In some other embodiments, three-dimensional reconstruction networks such as FML, PRNet, etc. may also be employed to obtain three-dimensional reconstructed projection views of the face. Wherein (1)>

Representing an i-th frame camera view in the input sample set,/->

Representing a three-dimensional reconstructed projection view of the i-th frame image.

Next, two adjacent frames of camera views in the sample set are sampled

And->

Obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm; filtering to obtain optical flow view corresponding to face region by using the three-dimensional reconstruction projection as mask>

In this embodiment, the optical flow employedThe estimation algorithm is RIFE, the input of the estimation algorithm is two adjacent frames of camera views in a sample set, the optical flow is generated by the method, the optical flows of the two images are aligned, and the optical flow of each pixel point in the corresponding image is output. In some other embodiments, optical flow estimation algorithms such as FlowNet, flowNet2 may also be employed. Wherein (1) >

Representing an optical flow view of the i-th frame image.

Furthermore, the obtained sample set and the face auxiliary view can be subjected to unified data enhancement at the same time, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.

The enhanced sample set can be used as an input of a forward reasoning network, namely a living body detection model training network in the embodiment, the enhanced three-dimensional reconstruction projection view and the optical flow view are used as auxiliary supervision signals, and the auxiliary supervision signals are further input to the auxiliary supervision network for processing.

In the model network shown in fig. 5, the enhanced sample set is first subjected to sparse sampling to obtain an input frame sequence [ X ] ₁ …X _N ]N is the number of single frame images in the input frame sequence, and the input frame sequence is input into the spatial domain feature extraction network.

Obtaining sample feature sequences of an input frame training set by convolutional feature extraction backbone network

In this network is a convolutional block (conv blocks) acting, where F is a eigenvalue. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.

In this embodiment, correspondingly, the three-dimensional reconstruction projection estimation map and the optical flow estimation map are simultaneously employed as auxiliary estimation maps, and the three-dimensional reconstruction projection loss and the optical flow loss are simultaneously employed as auxiliary losses. The auxiliary estimation map and the auxiliary loss are both obtained in a corresponding auxiliary supervisory network, which is included in the spatial signature extraction network. The steps of obtaining the auxiliary estimation graph and the auxiliary loss are as follows:

inputting adjacent frames of the sample feature sequence into an optical flow estimation network, and according to the output optical flow estimation graph

And optical flow view->

Calculating optical flow Loss _flow The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

An optical flow estimation map representing an i-th frame image; loss represents Loss, i.e. the difference between the estimated value and the true value;

sample feature sequence using a spatial domain transducer encoder (Transformer Encoder)

Coding, namely, coding the coded sample characteristic sequence +.>

Each frame is input into a three-dimensional reconstruction projection estimation network, and a three-dimensional reconstruction projection estimation graph is output through a convolutionally up-sampling (conv up-sampling) module

And combining the corresponding three-dimensional reconstruction projection view +.>

Calculating three-dimensional reconstruction projection Loss _pncc The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps of

Representing a three-dimensional reconstruction projection estimation map of an ith frame image;

will be knittedCoded sample feature sequences

Input into a time domain detection network and according to the output living body predicted value Y ^prid True value Y of living body ^gt Calculation of Living prediction Loss _ce ；

Determination of Integrated Loss from optical flow Loss and Living prediction Loss _all And training the living body detection model by taking the minimum comprehensive loss as a training target.

Wherein, the method for inputting the adjacent frames in the sample characteristic sequence into the optical flow estimation network to output an optical flow estimation graph comprises the following steps:

differences in optical flow characteristics of adjacent frames, i.e. characteristic variation

Input convolution up-sampling (conv up sample) module to obtain optical flow estimation map +.>

It should be noted that, while the present invention uses the usual loss function, an optical flow loss is newly proposed, that is, the optical flow translational invariant loss (Contrastive Flow Loss) is calculated as the optical flow loss. FIG. 7 schematically illustrates a method of calculating the optical flow translational invariant loss according to the present invention, under an embodiment. In this embodiment, first, the relative position change of the face between adjacent frames is obtained by introducing additional convolution kernels, and further, 8 contrast convolution kernels (Contrative convolution kernel) are introduced; and then, after translational alignment is carried out according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view, and taking the mean square error as the optical flow loss. As shown in the formula:

Wherein L is _CDL Refers to the optical flow translation invariant Loss, i.e. the optical flow Loss in this embodiment _flow ；

Representing the ith convolution kernel; d (D) _P Representing Predicted optical Flow (Predicted Flow), i.e., an optical Flow estimation map; d (D) _G Representing the real optical Flow (groundtrunk Flow), i.e. the optical Flow view.

The optical flow loss calculation method can be used for reducing negative influence of head overall motion on noise introduced in optical flow estimation, can effectively reflect micro expression change information under time sequence, and improves evaluation capability of model effect.

It should be noted that three-dimensional reconstruction projection Loss _pncc Obtained by calculating the mean square error Loss (MSE Loss) of the three-dimensional reconstructed projection estimate map and the three-dimensional reconstructed projection view. The mean square error is an expected value of the square of the difference between the estimated value and the true value, and is denoted as MSE, and the degree of change of the data can be evaluated. The smaller the value of the mean square error, the better the accuracy of the model description experimental data.

It should be noted that, a class token mechanism is introduced into the time domain detection network, and classification can be implemented by adding a linear classifier shown in the shaded portion of fig. 5 to the coded sample feature sequence output in the spatial domain network.

Illustratively, the in vivo prediction Loss _ce Obtained by calculating a cross entropy Loss (CE Loss) of the living body predicted value and the living body true value.

Note that, the integrated Loss _all By aiding Loss and predicting Loss in vivo, i.e. Loss of projection Loss in three-dimensional reconstruction in this embodiment _pncc Loss of optical flow Loss _flow And Loss of living body prediction Loss _ce And weighting according to a certain proportion, taking the minimum comprehensive loss as a training target, updating the network parameter theta by adopting a gradient return method, and continuously iterating until the model converges.

According to the embodiment of the invention, the comprehensive loss minimization is used as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation force requirement is greatly reduced, and better experience of noninductive body inspection is provided for a user.

An embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of model training and living body detection provided by the present invention as described above.

One embodiment of the present invention provides a computing device comprising a memory and a processor, the memory having executable code stored therein that when executed by the processor performs the method of model training and in vivo detection provided by the present invention described above.

It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

Claims

1. A method of training a living detection model comprising at least a feature extraction backbone network, an auxiliary supervisory network, and a time domain detection network, the method comprising:

2. The method for training a living body detection model according to claim 1, wherein the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, the face auxiliary view is a face three-dimensional reconstruction projection view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method comprises:

3. The method for training a living body detection model according to claim 2, wherein the step of obtaining the three-dimensional reconstruction projection view of the face corresponding to each frame in the sample set comprises the steps of:

4. The living body model training method according to claim 2, calculating a three-dimensional reconstruction projection loss from the output three-dimensional reconstruction projection estimation map and the three-dimensional reconstruction projection view, comprising:

5. The in-vivo detection model training method of claim 1, the auxiliary supervisory network being an optical flow estimation network, the face auxiliary view being a face optical flow view, the method comprising, after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network:

6. The in vivo detection model training method of claim 5, said obtaining a face optical flow view in said sample set comprising:

7. The in vivo detection model training method of claim 5, inputting adjacent frames in the sample feature sequence into an optical flow estimation map output by the optical flow estimation network, comprising:

8. The living body model training method according to claim 5, the calculating optical flow loss from an optical flow estimation map and the optical flow view, comprising:

acquiring the relative position change of the face between adjacent frames;

9. The living body prediction model training method according to claim 1, calculating a living body prediction loss from the output living body prediction value and living body true value, comprising:

10. A living body detection method, comprising:

encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in the living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the method as set forth in any one of claims 1 to 9.

11. The living body detection method according to claim 10, acquiring an input frame sequence from the detection image, comprising:

12. A living body detection training device comprises an input module, a space domain module, a time domain module and a training module:

13. The living body detection training device according to claim 12, wherein the auxiliary supervisory network is a three-dimensional reconstruction projection estimation network, and the face auxiliary view is a face three-dimensional reconstruction projection view:

14. The living body detection training device according to claim 13,

the acquisition module obtains three-dimensional reconstruction information of a face area through a three-dimensional reconstruction network aiming at each frame in the sample set, and calculates an orthogonal projection view of the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view of the face.

15. The in-vivo detection training apparatus of claim 12, the auxiliary supervisory network being an optical flow estimation network, the face auxiliary view being a face optical flow view:

16. The living body detection training device according to claim 15,

the acquisition module is used for acquiring optical flow estimation information of the whole scene through an optical flow estimation algorithm according to adjacent frames in the sample set;

17. The living body detection training device according to claim 15,

the airspace module is also used for inputting adjacent frames in the sample feature sequence into the optical flow estimation network to respectively obtain optical flow features; and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.

18. The living body detection training device according to claim 15,

the airspace module is also used for acquiring the relative position change of the face between adjacent frames; and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.

19. A living body detection apparatus comprising:

the detection module encodes the feature sequence, and inputs the encoded feature sequence into a time domain detection network in the living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by the method according to any one of claims 1 to 9.

20. The living body detection device according to claim 19,

the acquisition module performs sparse sampling on the acquired living body detection target image to obtain a preset number of sampling frames serving as the input frame sequence.