CN116246314A - Living body detection model training and living body detection method and device - Google Patents

Living body detection model training and living body detection method and device Download PDF

Info

Publication number
CN116246314A
CN116246314A CN202211610853.8A CN202211610853A CN116246314A CN 116246314 A CN116246314 A CN 116246314A CN 202211610853 A CN202211610853 A CN 202211610853A CN 116246314 A CN116246314 A CN 116246314A
Authority
CN
China
Prior art keywords
living body
loss
optical flow
network
dimensional reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211610853.8A
Other languages
Chinese (zh)
Inventor
杨博文
李建树
刘健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202211610853.8A priority Critical patent/CN116246314A/en
Publication of CN116246314A publication Critical patent/CN116246314A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • G06V40/45Detection of the body part being alive
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/70Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in livestock or poultry

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for training a living body detection model and detecting living body, which comprises the following steps: acquiring a face auxiliary view corresponding to each frame in an input frame sample set and a living body true value corresponding to the input frame sample set according to the input frame sample set; extracting a sample feature sequence of the input frame training set through the feature extraction backbone network; inputting the sample feature sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the auxiliary face view; inputting the sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value; and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target. Accordingly, the invention discloses a device for in vivo detection model training and in vivo detection.

Description

Living body detection model training and living body detection method and device
Technical Field
The invention relates to a living body detection technology, in particular to a method and a device for training a living body detection model and detecting living bodies.
Background
The face recognition technology is widely applied to various scenes in life such as payment, entrance guard and the like, and along with the development of 3D head models and facial mask manufacturing processes, the appearance is more and more vivid, and the possibility of living body detection in the existing face recognition system is more and more overcome.
There are some problems with existing living body algorithms, in which: the living body detection algorithm taking a single frame as input can only be used for defending low-quality 2D attack, and accurate judgment can not be made on high-quality 3D attack; the living body detection algorithm taking dense frames as input needs to calculate each frame, and the calculation cost is so high that the production floor of a low-calculation-force platform (such as a mobile phone) is difficult; the interactive living body detection requires a user to make a specified action, and the verification process is complicated and can not defend a mechanically driven movable head die or prerecorded high-definition video.
Therefore, a living body detection scheme with special precision, high quality 3D attack and reasonable cost is needed for solving the above problems.
Disclosure of Invention
The embodiment of the invention provides a method and a device for training a living body detection model and detecting living bodies, which are used for effectively defending high-quality 3D attacks and reducing calculation expenditure at the same time, so as to provide a user with experience of non-sense body inspection.
The embodiment of the invention adopts the following technical scheme:
the invention provides a living body detection model training method, wherein the living body detection model at least comprises a feature extraction main network, an auxiliary supervision network and a time domain detection network, and the living body detection model training method comprises the following steps:
acquiring a face auxiliary view corresponding to each frame in an input frame sample set and a living body true value corresponding to the input frame sample set according to the input frame sample set;
extracting a sample feature sequence of the input frame training set through the feature extraction backbone network;
inputting the sample feature sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the auxiliary face view;
inputting the sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
According to the invention, the comprehensive loss is taken as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation cost is reduced, and the user is better experienced with noninductive body inspection.
Further, the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, the face auxiliary view is a face three-dimensional reconstruction projection view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method comprises the steps of:
encoding the sample feature sequence, inputting each frame of the encoded sample feature sequence into the three-dimensional reconstruction projection estimation network, and calculating three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation graph and the three-dimensional reconstruction projection view;
inputting the coded sample characteristic sequence into the time domain detection network, and calculating living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
Further, the obtaining the three-dimensional reconstruction projection view of the face corresponding to each frame in the sample set includes:
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
and calculating the orthogonal projection view of the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view.
Further, calculating a three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation diagram and the three-dimensional reconstruction projection view, including:
and calculating the mean square error loss of the three-dimensional reconstruction projection estimated graph and the three-dimensional reconstruction projection view as the three-dimensional reconstruction projection loss.
Further, the auxiliary supervision network is an optical flow estimation network, the face auxiliary view is a face optical flow view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method includes:
inputting adjacent frames of the sample feature sequence into the optical flow estimation network, and calculating optical flow loss according to the output optical flow estimation graph and the optical flow view;
encoding the sample characteristic sequence, inputting the encoded sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the optical flow loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
Further, the obtaining the face optical flow view in the sample set includes:
Aiming at the adjacent frames in the sample set, obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm;
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.
Further, inputting adjacent frames in the sample feature sequence into an optical flow estimation graph output by the optical flow estimation network, including:
inputting adjacent frames in the sample feature sequence into the optical flow estimation network to respectively obtain optical flow features;
and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.
Further, the calculating optical flow loss from the optical flow estimation map and the optical flow view includes:
acquiring the relative position change of the face between adjacent frames;
and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.
Further, calculating a living body prediction loss from the output living body prediction value and the living body true value includes:
And calculating the cross entropy loss of the living body predicted value and the living body true value as the living body predicted loss.
The invention provides a living body detection method, which comprises the following steps:
collecting a detection image, and acquiring an input frame sequence from the detection image;
inputting the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtaining a feature sequence output by the feature extraction backbone network;
and encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in the living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is obtained by training by adopting the living body detection model training method.
Further, obtaining an input frame sequence from the detected image includes:
and performing sparse sampling on the detection image to obtain a preset number of sampling frames serving as the input frame sequence.
The invention provides a living body detection training device which comprises an input module, a space domain module, a time domain module and a training module, wherein the input module is used for receiving a sample of a living body, and the space domain module is used for receiving the sample of the living body detection training device:
the input module obtains a face auxiliary view corresponding to each frame in the sample set and a living true value corresponding to the input frame sample set according to an input frame sample set;
The airspace module extracts a sample feature sequence of the input frame training set through the feature extraction backbone network, inputs the sample feature sequence into an auxiliary supervision network, and calculates auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view;
the time domain module inputs the coded sample feature sequence into a time domain detection network, and calculates living body prediction loss according to the output living body prediction value and living body true value;
and the training module determines comprehensive loss according to the auxiliary loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
Further, the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, and the face auxiliary view is a face three-dimensional reconstruction projection view:
the airspace module encodes the sample feature sequence, each frame of the encoded sample feature sequence is input into the three-dimensional reconstruction projection estimation network, and three-dimensional reconstruction projection loss is calculated according to the output three-dimensional reconstruction projection estimation diagram and the three-dimensional reconstruction projection view;
and the training module determines comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
Further, the acquisition module obtains three-dimensional reconstruction information of a face area through a three-dimensional reconstruction network for each frame in the sample set, and calculates an orthogonal projection view from the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view of the face.
Further, the auxiliary supervision network is an optical flow estimation network, and the face auxiliary view is a face optical flow view:
the airspace module inputs adjacent frames of the sample feature sequence into the optical flow estimation network, and calculates optical flow loss according to the output optical flow estimation diagram and the optical flow view;
and the training module determines comprehensive loss according to the optical flow loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
Further, the acquisition module obtains optical flow estimation information of the whole scene through an optical flow estimation algorithm according to adjacent frames in the sample set;
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network; and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.
Further, the airspace module is further configured to input adjacent frames in the sample feature sequence into the optical flow estimation network to obtain optical flow features respectively; and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.
Further, the airspace module is further used for acquiring the relative position change of the face between adjacent frames; and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.
The present invention provides a living body detection apparatus, which includes:
the acquisition module is used for acquiring detection images and acquiring an input frame sequence from the detection images;
the feature module inputs the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtains a feature sequence output by the feature extraction backbone network;
the detection module encodes the characteristic sequence, and inputs the encoded characteristic sequence into a time domain detection network in the living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.
Further, the acquisition module performs sparse sampling on the acquired living body detection target image to obtain a preset number of sampling frames, and the preset number of sampling frames are used as the input frame sequence.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the steps in the model training and in vivo detection methods described above.
The present invention also provides a computing device comprising a memory and a processor, the memory having executable code stored therein, which when executed by the processor performs the steps in the model training and in vivo detection methods described above.
The at least one technical scheme adopted by the embodiment of the invention has the following beneficial effects:
according to the embodiment of the invention, the comprehensive loss minimization is used as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation cost is reduced, and a better experience of noninductive body inspection is provided for a user.
Drawings
FIG. 1 schematically illustrates a flow chart of a living body detection model training method according to the present invention in one embodiment.
Fig. 2 schematically shows a flow chart of a living body detection method according to the present invention in one embodiment.
Fig. 3 schematically shows a structure of the living body detection training device according to the present invention in one embodiment.
Fig. 4 schematically shows a structure of the living body detecting apparatus according to the present invention in one embodiment.
Fig. 5 schematically shows a network structure of the in-vivo detection model training method according to the present invention in one embodiment.
FIG. 6 schematically shows a flow chart of a data preparation stage of the in-vivo detection model training method according to the present invention in one embodiment.
FIG. 7 schematically illustrates a method of calculating the optical flow translational invariant loss according to the present invention, under an embodiment.
Detailed Description
The following will describe in further detail the technical solution of the in vivo detection model training and in vivo detection according to the present invention with reference to specific examples and corresponding drawings, but the detailed description does not limit the present invention.
In an embodiment of the present invention, a method for training a living body detection model is provided, where the living body detection model at least includes a feature extraction backbone network, an auxiliary supervision network, and a time domain detection network, and fig. 1 schematically shows a flow chart of the living body detection model training method according to the present invention under an implementation mode, where the method includes:
100: and acquiring the face auxiliary view corresponding to each frame in the sample set and the living body true value corresponding to the input frame sample set according to the input frame sample set.
Illustratively, the input frame sample set is derived from continuously captured dynamic user facial information, mainly including inherent attributes such as facial features and dynamic attributes such as expressions and actions, and may be a video segment, or may be continuously captured image frames, including but not limited to two-dimensional images, depth images, and the like, which are transmitted as an input sequence into a living body detection model for living body detection model training.
The face auxiliary view is a view obtained by analyzing the inherent attribute of the face image through the existing face analysis technology, a specific attribute of the face image can be extracted and displayed in the view, the distribution condition and the change condition of the feature are directly observed, and related calculation and other operations such as edge information, illumination change, texture information, three-dimensional information, depth information and the like are performed in a targeted manner.
In some embodiments, a three-dimensional reconstructed projection view or an optical flow view is selected as the face auxiliary view. The three-dimensional reconstruction projection view reflects three-dimensional structure information of the face image and represents depth information by colors; the optical flow view displays the motion condition of the pixel points in the image under the time sequence change, so that single-frame information mining and multi-frame dynamic change can be simultaneously utilized as living clues, and facial region linkage information and micro expression change information under the time sequence can be captured in the feature extraction process.
Furthermore, the obtained sample set and the face auxiliary view can be subjected to data enhancement, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.
It should be noted that the living body detection model belongs to supervised learning in machine learning, which means that an input sample set and a sample label are known, the sample set is input into the model to obtain a corresponding output result, model parameters are adjusted according to a comparison result of the result and the corresponding sample label, and a training effect is optimized; and after the optimal model is obtained through known data training for a plurality of times, the model is applied to new data, so that an output result is obtained. In this example, the live truth value is the sample label, which is known prior to training, and the value is expressed as "live" or "not live".
110: and extracting sample feature sequences of the input frame training set through a feature extraction backbone network.
In some embodiments, a convolutional feature extraction backbone network is employed to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.
120: and inputting the sample characteristic sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view.
In machine learning, in order to obtain a reference index capable of reflecting the model performance in multiple aspects, an auxiliary supervision network is introduced, the network obtains a corresponding loss value after processing certain auxiliary features of an input sample set, and all obtained loss is comprehensively evaluated on the model, so that the model performance can be more comprehensively known, and the model training efficiency is improved. In some embodiments, only one type of secondary supervisory network may be employed; in some other embodiments, multiple auxiliary supervisory networks may be used together as the case may be.
In some embodiments, the face auxiliary view comprises a three-dimensional reconstructed projection view or an optical flow view, and the auxiliary supervisory network also comprises a corresponding three-dimensional reconstructed projection estimation network or optical flow estimation network. And outputting an auxiliary estimation graph corresponding to the auxiliary face view through an auxiliary supervision network, and further calculating respective auxiliary losses. The Loss (Loss) is an index for judging the performance of the model in the machine learning training process, different Loss functions are adopted to calculate, the difference between the predicted value and the true value is represented by performing function operation on the predicted value and the true value, the smaller the Loss is, the better the Loss is, and common Loss functions comprise a classification error rate (CE), a Mean Square Error (MSE), cross Entropy (Cross Entropy) and the like.
130: and inputting the sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and the living body true value.
In the airspace detection network, spatial characteristic information of the face image, such as depth information, three-dimensional information, texture information and the like, can be extracted; in the time domain detection network, the motion characteristics of the face area in the multi-frame image can be captured. By combining the depth in the spatial domain and the motion in the temporal domain into the temporal domain depth, the unique characteristics of each frame of image can be embodied, thereby better distinguishing living bodies from non-living bodies.
Specifically, a cross entropy loss of the living body predicted value and the living body true value is calculated as the living body predicted loss.
140: and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
In some embodiments, the auxiliary loss and the living body predicted loss are weighted according to a certain proportion to obtain a comprehensive loss, and the network parameter theta is updated by taking the minimum comprehensive loss as a training target.
In some more specific embodiments, the network parameter θ is updated by a gradient backhaul (also known as a BP algorithm) method. The gradient feedback is a learning algorithm commonly used for a neural network, is realized based on a gradient descent method (Stochastic Gradient Descent, SGD) and a chain rule, mainly comprises two links of excitation propagation and weight updating, and trains network parameters conforming to a target through repeated circulation and iteration of the two links. The excitation propagation link is divided into a forward propagation stage and a reverse propagation stage, the training set is input into the network in the forward propagation stage to obtain an excitation response, and the obtained excitation response is differenced with a target output corresponding to the training set in the reverse propagation stage, namely a response error. In the weight updating link, for each weight on each layer of network, multiplying the excitation response obtained by input by the response error to obtain the gradient of the weight, multiplying the gradient by a proportion, inverting and adding the proportion to the current weight to obtain the updated weight. In the gradient return process, the direction of the gradient indicates the direction of error expansion, so the weight is updated according to the direction of the fastest gradient descent, and the error is reversely propagated to the previous layer after updating until all the adjustment is carried out, and the iteration is continuously carried out until the model converges.
In some embodiments, the face auxiliary view is a face three-dimensional reconstruction projection view, and the face three-dimensional reconstruction projection view corresponding to each frame in the sample set is obtained, including the steps of:
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
and calculating the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection view.
The three-dimensional reconstruction technology refers to an image-based modeling technology in the field of face modeling, namely, the three-dimensional structure of a plurality of two-dimensional images is restored by extracting characteristic information of the images. In this embodiment, a 3DDFA network is used to obtain a three-dimensional reconstructed projection view of the face, which image is also referred to as PNCC map, which represents depth information with color changes. In some other embodiments, three-dimensional reconstruction networks such as FML, PRNet, etc. may also be employed to obtain three-dimensional reconstructed projection views of the face.
In some more specific embodiments, the auxiliary supervisory network is a three-dimensional reconstruction projection estimation network, and after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network, the method comprises:
encoding the sample feature sequence by adopting a airspace transducer, inputting each frame of the encoded sample feature sequence into a three-dimensional reconstruction projection estimation network, and calculating three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation graph and the three-dimensional reconstruction projection view;
Inputting the coded sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and the living body true value;
and determining comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
In some embodiments, the three-dimensional reconstructed projection loss is obtained by calculating a mean square error loss of the three-dimensional reconstructed projection estimate map and the three-dimensional reconstructed projection view. The mean square error is an expected value of the square of the difference between the estimated value and the true value, and is denoted as MSE, and the degree of change of the data can be evaluated. The smaller the value of the mean square error, the better the accuracy of the model description experimental data.
The comprehensive loss is obtained by weighting three-dimensional reconstruction loss and living body prediction loss according to a certain proportion, the minimum comprehensive loss is used as a training target, the network parameter theta is updated by adopting a gradient return method, and iteration is continued until the model converges.
In other embodiments, the face auxiliary view is a face optical flow view, and the obtaining a face optical flow view in the sample set includes the steps of:
aiming at adjacent frames in the sample set, obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm;
For each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain a face optical flow view.
Optical flow (optical flow) refers to a mode motion speed in an image under time sequence change, that is, a motion amount representing the same object pixel point in a video image from one frame to the next frame, and is represented by a two-dimensional vector. In this embodiment, the adopted optical flow estimation algorithm is rive, and the input is two adjacent frames of camera views in the sample set, and by this method, optical flows are generated and aligned for the two images, and the optical flow of each pixel point in the corresponding image is output. And then, filtering to obtain an optical flow view by taking the obtained three-dimensional reconstruction projection as a mask. In some other embodiments, optical flow estimation algorithms such as FlowNet, flowNet2 may also be employed.
In some more specific embodiments, the auxiliary supervisory network is an optical flow estimation network, and after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network, the method comprises:
inputting adjacent frames of the sample feature sequence into an optical flow estimation network, and calculating optical flow loss according to the output optical flow estimation graph and the optical flow view;
Encoding the sample characteristic sequence, inputting the encoded sample characteristic sequence into a time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the optical flow loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
In some embodiments, outputting an optical flow estimation map from an optical flow estimation network for adjacent frames in a sequence of sample features includes the steps of:
inputting adjacent frames in the sample feature sequence into an optical flow estimation network to respectively obtain optical flow features;
and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain an optical flow estimation graph.
In some embodiments, the present invention uses a common loss function and simultaneously proposes an optical flow loss, namely, by calculating an optical flow translational invariant loss as the optical flow loss. In some embodiments, first, the relative position change of the face between adjacent frames is obtained by introducing an additional convolution kernel; and then, after translational alignment is carried out according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view, and taking the mean square error as the optical flow loss. The optical flow loss calculation method can be used for reducing negative influence of head overall motion on noise introduced in optical flow estimation, can effectively reflect micro expression change information under time sequence, and improves evaluation capability of model effect.
The comprehensive loss is obtained by weighting optical flow loss and living body prediction loss according to a certain proportion, the minimum comprehensive loss is used as a training target, the network parameter theta is updated by adopting a gradient return method, and iteration is continued until the model converges.
In some embodiments, a living body detection method is provided, and fig. 2 schematically shows a flow chart of the living body detection method according to the present invention in an embodiment, including:
200: collecting a detection image, and acquiring an input frame sequence from the detection image;
illustratively, using a camera as an image acquisition device, an input frame sample set is derived from continuously captured dynamic user face information, and mainly includes inherent attributes such as facial features and dynamic attributes such as expressions and actions, and may be a video segment, or may be continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, which are transmitted as an input sequence into a living body detection model for living body detection model training.
In some embodiments, the video input sequence is sparsely sampled, i.e. a predetermined number of sampling frames, e.g. 2-10 frames, are obtained as an input frame sequence, i.e. the input frame sequence of the present embodiment. In the living body detection model of the embodiment, the input sequence can be in the form of a frame sequence, so that the input method which is a compromise between single-frame images and dense-frame videos can be adopted, multi-frame dynamic information can be captured, and the limited calculation force requirement can be met. With the development of 3D head models and facial mask manufacturing processes, the appearance of the 3D head models and facial mask manufacturing processes is more and more realistic, and the possibility of breaking living body detection in the existing face recognition system is higher and higher. Compared with the existing living body detection algorithm, the introduction of sparse sampling can effectively defend high-quality 3D head model surface tool attacks, is an extension of a single-frame silence living body detection method which only inputs one picture and can defend low-quality 2D attacks, and can be used for making more reliable judgment by combining multi-frame information. Meanwhile, compared with the dense frame video living body detection which needs to be input into each frame in a video sequence and estimates the subtle color change of the facial blood vessel caused by heartbeat through remote photoplethysmography, the dense frame video living body detection greatly reduces the calculation force requirement and relieves the calculation pressure, so that the product can be landed in more low calculation force field scenes. In addition, the sparse sampling method can omit the step of giving instructions to wait for user feedback in the interactive living body detection, realize noninductive silence inspection, relatively shorten the verification time, and can make up the defect that the interactive living body detection cannot defend a mechanically driven movable head die or a prerecorded high-definition video.
210: inputting the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtaining a feature sequence output by the feature extraction backbone network;
specifically, a convolution feature extraction backbone network is employed to extract a sample feature sequence of an input frame sequence. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.
220: and encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in a living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.
Optionally, after extracting the sample feature sequence of the input frame sequence through the feature extraction backbone network in step 210, the sample feature sequence is input into a space domain transducer encoder in a pre-trained living body detection model to be encoded, so as to obtain the encoded sample feature sequence. And then the sample characteristic sequence is directly input into a time domain detection network in the model to obtain a living body detection result.
The living body detection model is trained by combining the auxiliary loss and the living body prediction loss under the action of the auxiliary monitoring network as described in step 120, and further, the training of the model is affected by three-dimensional reconstruction projection loss or optical flow loss and the like. Although the function of an auxiliary supervision network is not needed in the living body detection process, the living body detection effect by using the model is also constrained by factors such as three-dimensional reconstruction projection information or optical flow information, so that the detection result is more accurate and reliable.
Fig. 3 schematically shows a structural diagram of a living body detection training device according to an embodiment of the present invention, including an input module, a spatial module, a temporal module, and a training module:
300: and the input module acquires the face auxiliary view corresponding to each frame in the sample set and the living true value corresponding to the input frame sample set according to the input frame sample set.
Illustratively, the input frame sample set obtained by the input module is from continuously captured dynamic user face information, mainly including inherent attributes such as facial features and dynamic attributes such as expressions and actions, and can be a video or continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, and is transmitted into a living body detection model as an input sequence for living body detection model training.
In some embodiments, the three-dimensional reconstructed projection view or the optical flow view is selected as the face auxiliary view in the input module. The three-dimensional reconstruction projection view reflects three-dimensional structure information of the face image and represents depth information by colors; the optical flow view displays the motion condition of the pixel points in the image under the time sequence change, so that single-frame information mining and multi-frame dynamic change can be simultaneously utilized as living clues, and facial region linkage information and micro expression change information under the time sequence can be captured in the feature extraction process.
Furthermore, the obtained sample set and the auxiliary human face view can be subjected to data enhancement in the input module, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.
It should be noted that the living body detection model belongs to supervised learning in machine learning, which means that a sample set and a sample label entering an input module are known, the sample set is input into the model to obtain a corresponding output result, model parameters are adjusted according to a comparison result of the result and the corresponding sample label, and a training effect is optimized; and after the optimal model is obtained through known data training for a plurality of times, the model is applied to new data, so that an output result is obtained. In this example, the live truth value is the sample label, which is known prior to training, and the value is expressed as "live" or "not live".
310: the airspace module extracts a sample feature sequence of an input frame training set through a feature extraction backbone network, inputs the sample feature sequence into an auxiliary supervision network, and calculates auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view.
In some embodiments, the spatial module employs a convolutional feature extraction backbone network to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.
In machine learning, in order to obtain a reference index capable of reflecting the model performance in multiple aspects, an auxiliary supervision network is introduced to enable the model performance to be more comprehensively known, and the model training efficiency is improved. In some embodiments, only one type of auxiliary supervisory network may be employed in the airspace module; in some other embodiments, multiple auxiliary supervisory networks may be used together as the case may be.
In some embodiments, the face auxiliary view comprises a three-dimensional reconstructed projection view or an optical flow view, and the auxiliary supervisory network also comprises a corresponding three-dimensional reconstructed projection estimation network or optical flow estimation network. The airspace module can output an auxiliary estimation graph corresponding to the auxiliary face view through an auxiliary supervision network, and further calculate respective auxiliary losses. The Loss (Loss) is an index for judging the performance of the model in the machine learning training process, different Loss functions are adopted to calculate, the difference between the predicted value and the true value is represented by performing function operation on the predicted value and the true value, the smaller the Loss is, the better the Loss is, and common Loss functions comprise a classification error rate (CE), a Mean Square Error (MSE), cross Entropy (Cross Entropy) and the like.
320: the time domain module inputs the coded sample characteristic sequence into a time domain detection network, and calculates the living body prediction loss according to the output living body prediction value and the living body true value.
In the airspace module, spatial characteristic information of the face image, such as depth information, three-dimensional information, texture information and the like, can be extracted; in the time domain module, the motion characteristics of the face area in the multi-frame image can be captured. By combining the depth in the spatial domain and the motion in the temporal domain into the temporal domain depth, the unique characteristics of each frame of image can be embodied, thereby better distinguishing living bodies from non-living bodies.
Specifically, the time domain module calculates a cross entropy loss of the living body predicted value and the living body true value as the living body predicted loss.
330: the training module determines the comprehensive loss according to the auxiliary loss and the living body prediction loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
In some embodiments, the training module weights the auxiliary loss and the living body predicted loss in a proportion to obtain a comprehensive loss, and updates the network parameter θ with the minimum comprehensive loss as the training target.
Fig. 4 schematically shows a structure of a living body detecting device according to the present invention in one embodiment, the device including:
400: and the acquisition module is used for acquiring the detection image and acquiring an input frame sequence from the detection image.
The acquisition module uses a camera as an image acquisition device, and inputs dynamic user face information from a continuous capture frame sample set, wherein the dynamic user face information mainly comprises inherent attributes such as facial features and dynamic attributes such as expressions and actions, and the dynamic attributes can be a video or continuously captured image frames, including but not limited to two-dimensional images, depth images and the like, and the dynamic user face information is transmitted into a living body detection model as an input sequence for living body detection model training.
In some embodiments, the acquisition module performs sparse sampling on the video input sequence, that is, obtains a preset number of sampling frames, for example, 2-10 frames, as an input frame sequence, that is, an input frame sequence in this embodiment. In the living body detection model of the embodiment, the input sequence can be in the form of a frame sequence, so that the input method which is a compromise between single-frame images and dense-frame videos can be adopted, multi-frame dynamic information can be captured, and the limited calculation force requirement can be met. With the development of 3D head models and facial mask manufacturing processes, the appearance of the 3D head models and facial mask manufacturing processes is more and more realistic, and the possibility of breaking living body detection in the existing face recognition system is higher and higher. Compared with the existing living body detection algorithm, the introduction of sparse sampling can effectively defend high-quality 3D head model surface tool attacks, is an extension of a single-frame silence living body detection method which only inputs one picture and can defend low-quality 2D attacks, and can be used for making more reliable judgment by combining multi-frame information. Meanwhile, compared with the dense frame video living body detection which needs to be input into each frame in a video sequence and estimates the subtle color change of the facial blood vessel caused by heartbeat through remote photoplethysmography, the dense frame video living body detection greatly reduces the calculation force requirement and relieves the calculation pressure, so that the product can be landed in more low calculation force field scenes. In addition, the sparse sampling method can omit the step of giving instructions to wait for user feedback in the interactive living body detection, realize noninductive silence inspection, relatively shorten the verification time, and can make up the defect that the interactive living body detection cannot defend a mechanically driven movable head die or a prerecorded high-definition video.
410: and the feature module inputs the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtains a feature sequence output by the feature extraction backbone network.
In some embodiments, the feature module employs a convolutional feature extraction backbone network to extract sample feature sequences of the input frame training set. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.
420: the detection module encodes the feature sequence, and inputs the encoded feature sequence into a time domain detection network in a living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the living body detection model training method.
Optionally, after extracting a sample feature sequence of the input frame sequence from the feature module, the detection module inputs the sample feature sequence into a airspace transducer encoder in a pre-trained living body detection model to encode, so as to obtain an encoded sample feature sequence. And then the sample characteristic sequence is directly input into a time domain detection network in the model to obtain a living body detection result.
The living body detection model in the detection module is trained by combining the auxiliary loss and the living body prediction loss under the action of the auxiliary monitoring network as described in step 120, and further, the training of the model is affected by the three-dimensional reconstruction projection loss or the optical flow loss. Although the function of an auxiliary supervision network is not needed in the living body detection process, the living body detection effect by using the model is also constrained by factors such as three-dimensional reconstruction projection information or optical flow information, so that the detection result is more accurate and reliable.
For further explanation of the method and apparatus for training a living body model according to the present invention, a preferred embodiment is introduced, and fig. 5 schematically shows a network structure diagram of the method for training a living body model according to an embodiment of the present invention.
Before that, it is necessary to obtain, from the input frame sample set, a face auxiliary view corresponding to each frame in the sample set and a living true value corresponding to the input frame sample set.
Optionally, fig. 6 schematically shows a flow chart of a data preparation stage of the living body detection model training method according to the present invention in an embodiment, that is, a stage of generating an auxiliary view, in this embodiment, a three-dimensional reconstructed projection view and an optical flow view are simultaneously used as the auxiliary view.
As shown in fig. 6, first, for a frame of camera view in the input sample set
Figure BDA0003999595270000161
Using a three-dimensional reconstruction network to obtain three-dimensional reconstruction information of a corresponding face area, and obtaining a three-dimensional reconstruction projection view after calculation>
Figure BDA0003999595270000162
Wherein in this embodiment a 3DDFA network is used to obtain a three-dimensional reconstructed projection view of the face, which image is also called PNCC map, the depth information being represented by color variations. In some other embodiments, three-dimensional reconstruction networks such as FML, PRNet, etc. may also be employed to obtain three-dimensional reconstructed projection views of the face. Wherein (1)>
Figure BDA0003999595270000163
Representing an i-th frame camera view in the input sample set,/->
Figure BDA0003999595270000164
Representing a three-dimensional reconstructed projection view of the i-th frame image.
Next, two adjacent frames of camera views in the sample set are sampled
Figure BDA0003999595270000165
And->
Figure BDA0003999595270000166
Obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm; filtering to obtain optical flow view corresponding to face region by using the three-dimensional reconstruction projection as mask>
Figure BDA0003999595270000167
In this embodiment, the optical flow employedThe estimation algorithm is RIFE, the input of the estimation algorithm is two adjacent frames of camera views in a sample set, the optical flow is generated by the method, the optical flows of the two images are aligned, and the optical flow of each pixel point in the corresponding image is output. In some other embodiments, optical flow estimation algorithms such as FlowNet, flowNet2 may also be employed. Wherein (1) >
Figure BDA0003999595270000168
Representing an optical flow view of the i-th frame image.
Furthermore, the obtained sample set and the face auxiliary view can be subjected to unified data enhancement at the same time, so that the characteristics of the image are more obvious, and the efficiency of subsequent model training is improved. For example, data enhancement methods such as three-view rotation, crop enhancement, and RGB view color, blur enhancement are employed.
The enhanced sample set can be used as an input of a forward reasoning network, namely a living body detection model training network in the embodiment, the enhanced three-dimensional reconstruction projection view and the optical flow view are used as auxiliary supervision signals, and the auxiliary supervision signals are further input to the auxiliary supervision network for processing.
In the model network shown in fig. 5, the enhanced sample set is first subjected to sparse sampling to obtain an input frame sequence [ X ] 1 …X N ]N is the number of single frame images in the input frame sequence, and the input frame sequence is input into the spatial domain feature extraction network.
Obtaining sample feature sequences of an input frame training set by convolutional feature extraction backbone network
Figure BDA0003999595270000169
In this network is a convolutional block (conv blocks) acting, where F is a eigenvalue. In some embodiments, feature extraction is performed using a ResNet network. In some other embodiments, a network such as MobileNet, VGG may also be employed to extract sample feature sequences of the input frame training set.
In this embodiment, correspondingly, the three-dimensional reconstruction projection estimation map and the optical flow estimation map are simultaneously employed as auxiliary estimation maps, and the three-dimensional reconstruction projection loss and the optical flow loss are simultaneously employed as auxiliary losses. The auxiliary estimation map and the auxiliary loss are both obtained in a corresponding auxiliary supervisory network, which is included in the spatial signature extraction network. The steps of obtaining the auxiliary estimation graph and the auxiliary loss are as follows:
inputting adjacent frames of the sample feature sequence into an optical flow estimation network, and according to the output optical flow estimation graph
Figure BDA0003999595270000171
And optical flow view->
Figure BDA0003999595270000172
Calculating optical flow Loss flow The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure BDA0003999595270000173
An optical flow estimation map representing an i-th frame image; loss represents Loss, i.e. the difference between the estimated value and the true value;
sample feature sequence using a spatial domain transducer encoder (Transformer Encoder)
Figure BDA0003999595270000174
Coding, namely, coding the coded sample characteristic sequence +.>
Figure BDA0003999595270000175
Each frame is input into a three-dimensional reconstruction projection estimation network, and a three-dimensional reconstruction projection estimation graph is output through a convolutionally up-sampling (conv up-sampling) module
Figure BDA0003999595270000176
And combining the corresponding three-dimensional reconstruction projection view +.>
Figure BDA0003999595270000177
Calculating three-dimensional reconstruction projection Loss pncc The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps of
Figure BDA0003999595270000178
Representing a three-dimensional reconstruction projection estimation map of an ith frame image;
will be knittedCoded sample feature sequences
Figure BDA0003999595270000179
Input into a time domain detection network and according to the output living body predicted value Y prid True value Y of living body gt Calculation of Living prediction Loss ce
Determination of Integrated Loss from optical flow Loss and Living prediction Loss all And training the living body detection model by taking the minimum comprehensive loss as a training target.
Wherein, the method for inputting the adjacent frames in the sample characteristic sequence into the optical flow estimation network to output an optical flow estimation graph comprises the following steps:
inputting adjacent frames in the sample feature sequence into an optical flow estimation network to respectively obtain optical flow features;
differences in optical flow characteristics of adjacent frames, i.e. characteristic variation
Figure BDA00039995952700001710
Input convolution up-sampling (conv up sample) module to obtain optical flow estimation map +.>
Figure BDA00039995952700001711
It should be noted that, while the present invention uses the usual loss function, an optical flow loss is newly proposed, that is, the optical flow translational invariant loss (Contrastive Flow Loss) is calculated as the optical flow loss. FIG. 7 schematically illustrates a method of calculating the optical flow translational invariant loss according to the present invention, under an embodiment. In this embodiment, first, the relative position change of the face between adjacent frames is obtained by introducing additional convolution kernels, and further, 8 contrast convolution kernels (Contrative convolution kernel) are introduced; and then, after translational alignment is carried out according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view, and taking the mean square error as the optical flow loss. As shown in the formula:
Figure BDA0003999595270000181
Wherein L is CDL Refers to the optical flow translation invariant Loss, i.e. the optical flow Loss in this embodiment flow
Figure BDA0003999595270000182
Representing the ith convolution kernel; d (D) P Representing Predicted optical Flow (Predicted Flow), i.e., an optical Flow estimation map; d (D) G Representing the real optical Flow (groundtrunk Flow), i.e. the optical Flow view.
The optical flow loss calculation method can be used for reducing negative influence of head overall motion on noise introduced in optical flow estimation, can effectively reflect micro expression change information under time sequence, and improves evaluation capability of model effect.
It should be noted that three-dimensional reconstruction projection Loss pncc Obtained by calculating the mean square error Loss (MSE Loss) of the three-dimensional reconstructed projection estimate map and the three-dimensional reconstructed projection view. The mean square error is an expected value of the square of the difference between the estimated value and the true value, and is denoted as MSE, and the degree of change of the data can be evaluated. The smaller the value of the mean square error, the better the accuracy of the model description experimental data.
It should be noted that, a class token mechanism is introduced into the time domain detection network, and classification can be implemented by adding a linear classifier shown in the shaded portion of fig. 5 to the coded sample feature sequence output in the spatial domain network.
Illustratively, the in vivo prediction Loss ce Obtained by calculating a cross entropy Loss (CE Loss) of the living body predicted value and the living body true value.
Note that, the integrated Loss all By aiding Loss and predicting Loss in vivo, i.e. Loss of projection Loss in three-dimensional reconstruction in this embodiment pncc Loss of optical flow Loss flow And Loss of living body prediction Loss ce And weighting according to a certain proportion, taking the minimum comprehensive loss as a training target, updating the network parameter theta by adopting a gradient return method, and continuously iterating until the model converges.
According to the embodiment of the invention, the comprehensive loss minimization is used as a training target, and the living body detection model is trained, so that after the detection image is subjected to sparse sampling, an accurate living body detection result of the corresponding image can be obtained, the capability of defending high-quality 3D attack is effectively improved, the calculation force requirement is greatly reduced, and better experience of noninductive body inspection is provided for a user.
An embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of model training and living body detection provided by the present invention as described above.
One embodiment of the present invention provides a computing device comprising a memory and a processor, the memory having executable code stored therein that when executed by the processor performs the method of model training and in vivo detection provided by the present invention described above.
It should be noted that the above-mentioned embodiments are merely examples of the present invention, and it is obvious that the present invention is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

Claims (20)

1. A method of training a living detection model comprising at least a feature extraction backbone network, an auxiliary supervisory network, and a time domain detection network, the method comprising:
acquiring a face auxiliary view corresponding to each frame in an input frame sample set and a living body true value corresponding to the input frame sample set according to the input frame sample set;
extracting a sample feature sequence of the input frame training set through the feature extraction backbone network;
inputting the sample feature sequence into an auxiliary supervision network, and calculating auxiliary loss according to the output auxiliary estimation graph and the auxiliary face view;
inputting the sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the auxiliary loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
2. The method for training a living body detection model according to claim 1, wherein the auxiliary supervision network is a three-dimensional reconstruction projection estimation network, the face auxiliary view is a face three-dimensional reconstruction projection view, and after the feature extraction backbone network extracts the sample feature sequence of the input frame training set, the method comprises:
encoding the sample feature sequence, inputting each frame of the encoded sample feature sequence into the three-dimensional reconstruction projection estimation network, and calculating three-dimensional reconstruction projection loss according to the output three-dimensional reconstruction projection estimation graph and the three-dimensional reconstruction projection view;
inputting the coded sample characteristic sequence into the time domain detection network, and calculating living body prediction loss according to the output living body prediction value and living body true value;
and determining comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
3. The method for training a living body detection model according to claim 2, wherein the step of obtaining the three-dimensional reconstruction projection view of the face corresponding to each frame in the sample set comprises the steps of:
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
And calculating the orthogonal projection view of the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view.
4. The living body model training method according to claim 2, calculating a three-dimensional reconstruction projection loss from the output three-dimensional reconstruction projection estimation map and the three-dimensional reconstruction projection view, comprising:
and calculating the mean square error loss of the three-dimensional reconstruction projection estimated graph and the three-dimensional reconstruction projection view as the three-dimensional reconstruction projection loss.
5. The in-vivo detection model training method of claim 1, the auxiliary supervisory network being an optical flow estimation network, the face auxiliary view being a face optical flow view, the method comprising, after extracting a sample feature sequence of the input frame training set through the feature extraction backbone network:
inputting adjacent frames of the sample feature sequence into the optical flow estimation network, and calculating optical flow loss according to the output optical flow estimation graph and the optical flow view;
encoding the sample characteristic sequence, inputting the encoded sample characteristic sequence into the time domain detection network, and calculating the living body prediction loss according to the output living body prediction value and living body true value;
And determining comprehensive loss according to the optical flow loss and the living body predicted loss, and training the living body detection model by taking the minimum comprehensive loss as a training target.
6. The in vivo detection model training method of claim 5, said obtaining a face optical flow view in said sample set comprising:
aiming at the adjacent frames in the sample set, obtaining optical flow estimation information of the whole scene through an optical flow estimation algorithm;
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network;
and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.
7. The in vivo detection model training method of claim 5, inputting adjacent frames in the sample feature sequence into an optical flow estimation map output by the optical flow estimation network, comprising:
inputting adjacent frames in the sample feature sequence into the optical flow estimation network to respectively obtain optical flow features;
and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.
8. The living body model training method according to claim 5, the calculating optical flow loss from an optical flow estimation map and the optical flow view, comprising:
acquiring the relative position change of the face between adjacent frames;
and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.
9. The living body prediction model training method according to claim 1, calculating a living body prediction loss from the output living body prediction value and living body true value, comprising:
and calculating the cross entropy loss of the living body predicted value and the living body true value as the living body predicted loss.
10. A living body detection method, comprising:
collecting a detection image, and acquiring an input frame sequence from the detection image;
inputting the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtaining a feature sequence output by the feature extraction backbone network;
encoding the characteristic sequence, inputting the encoded characteristic sequence into a time domain detection network in the living body detection model, and obtaining a living body detection result output by the time domain detection network, wherein the living body detection model is trained by adopting the method as set forth in any one of claims 1 to 9.
11. The living body detection method according to claim 10, acquiring an input frame sequence from the detection image, comprising:
and performing sparse sampling on the detection image to obtain a preset number of sampling frames serving as the input frame sequence.
12. A living body detection training device comprises an input module, a space domain module, a time domain module and a training module:
the input module obtains a face auxiliary view corresponding to each frame in the sample set and a living true value corresponding to the input frame sample set according to an input frame sample set;
the airspace module extracts a sample feature sequence of the input frame training set through the feature extraction backbone network, inputs the sample feature sequence into an auxiliary supervision network, and calculates auxiliary loss according to the output auxiliary estimation graph and the face auxiliary view;
the time domain module inputs the coded sample feature sequence into a time domain detection network, and calculates living body prediction loss according to the output living body prediction value and living body true value;
and the training module determines comprehensive loss according to the auxiliary loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
13. The living body detection training device according to claim 12, wherein the auxiliary supervisory network is a three-dimensional reconstruction projection estimation network, and the face auxiliary view is a face three-dimensional reconstruction projection view:
the airspace module encodes the sample feature sequence, each frame of the encoded sample feature sequence is input into the three-dimensional reconstruction projection estimation network, and three-dimensional reconstruction projection loss is calculated according to the output three-dimensional reconstruction projection estimation diagram and the three-dimensional reconstruction projection view;
and the training module determines comprehensive loss according to the three-dimensional reconstruction projection loss and the living body prediction loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
14. The living body detection training device according to claim 13,
the acquisition module obtains three-dimensional reconstruction information of a face area through a three-dimensional reconstruction network aiming at each frame in the sample set, and calculates an orthogonal projection view of the three-dimensional reconstruction information to a camera plane as the three-dimensional reconstruction projection view of the face.
15. The in-vivo detection training apparatus of claim 12, the auxiliary supervisory network being an optical flow estimation network, the face auxiliary view being a face optical flow view:
The airspace module inputs adjacent frames of the sample feature sequence into the optical flow estimation network, and calculates optical flow loss according to the output optical flow estimation diagram and the optical flow view;
and the training module determines comprehensive loss according to the optical flow loss and the living body predicted loss, and trains the living body detection model by taking the minimum comprehensive loss as a training target.
16. The living body detection training device according to claim 15,
the acquisition module is used for acquiring optical flow estimation information of the whole scene through an optical flow estimation algorithm according to adjacent frames in the sample set;
for each frame in the sample set, three-dimensional reconstruction information of a face area is obtained through a three-dimensional reconstruction network; and taking the orthogonal projection view of the three-dimensional reconstruction information to the camera plane as a three-dimensional reconstruction projection, taking the three-dimensional reconstruction projection as a mask, and filtering to obtain the face optical flow view.
17. The living body detection training device according to claim 15,
the airspace module is also used for inputting adjacent frames in the sample feature sequence into the optical flow estimation network to respectively obtain optical flow features; and inputting the difference value of the optical flow characteristics of the adjacent frames into a convolution up-sampling module to obtain the optical flow estimation graph.
18. The living body detection training device according to claim 15,
the airspace module is also used for acquiring the relative position change of the face between adjacent frames; and after translational alignment according to the relative position change, calculating the mean square error of the optical flow estimation graph and the optical flow view as the optical flow loss.
19. A living body detection apparatus comprising:
the acquisition module is used for acquiring detection images and acquiring an input frame sequence from the detection images;
the feature module inputs the input frame sequence into a feature extraction backbone network in a pre-trained living body detection model, and obtains a feature sequence output by the feature extraction backbone network;
the detection module encodes the feature sequence, and inputs the encoded feature sequence into a time domain detection network in the living body detection model to obtain a living body detection result output by the time domain detection network, wherein the living body detection model is trained by the method according to any one of claims 1 to 9.
20. The living body detection device according to claim 19,
the acquisition module performs sparse sampling on the acquired living body detection target image to obtain a preset number of sampling frames serving as the input frame sequence.
CN202211610853.8A 2022-12-14 2022-12-14 Living body detection model training and living body detection method and device Pending CN116246314A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211610853.8A CN116246314A (en) 2022-12-14 2022-12-14 Living body detection model training and living body detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211610853.8A CN116246314A (en) 2022-12-14 2022-12-14 Living body detection model training and living body detection method and device

Publications (1)

Publication Number Publication Date
CN116246314A true CN116246314A (en) 2023-06-09

Family

ID=86628514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211610853.8A Pending CN116246314A (en) 2022-12-14 2022-12-14 Living body detection model training and living body detection method and device

Country Status (1)

Country Link
CN (1) CN116246314A (en)

Similar Documents

Publication Publication Date Title
CN113673307B (en) Lightweight video action recognition method
CN109064507B (en) Multi-motion-stream deep convolution network model method for video prediction
CN108520503B (en) Face defect image restoration method based on self-encoder and generation countermeasure network
Zhao et al. Learning to forecast and refine residual motion for image-to-video generation
CN110689599B (en) 3D visual saliency prediction method based on non-local enhancement generation countermeasure network
KR101396618B1 (en) Extracting a moving object boundary
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN115914505B (en) Video generation method and system based on voice-driven digital human model
CN114463218B (en) Video deblurring method based on event data driving
CN112270691B (en) Monocular video structure and motion prediction method based on dynamic filter network
CN112541865A (en) Underwater image enhancement method based on generation countermeasure network
CN110246171B (en) Real-time monocular video depth estimation method
CN111901532A (en) Video stabilization method based on recurrent neural network iteration strategy
Aakerberg et al. Semantic segmentation guided real-world super-resolution
CN114842542B (en) Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN114049434A (en) 3D modeling method and system based on full convolution neural network
CN116704585A (en) Face recognition method based on quality perception
CN114283301A (en) Self-adaptive medical image classification method and system based on Transformer
Zhu et al. Micro-expression recognition convolutional network based on dual-stream temporal-domain information interaction
CN116246314A (en) Living body detection model training and living body detection method and device
CN114663315B (en) Image bit enhancement method and device for generating countermeasure network based on semantic fusion
CN115424337A (en) Iris image restoration system based on priori guidance
CN114943746A (en) Motion migration method utilizing depth information assistance and contour enhancement loss
Farooq et al. Generating thermal image data samples using 3D facial modelling techniques and deep learning methodologies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination