CN111209863B

CN111209863B - Living model training and human face living body detection method and device and electronic equipment

Info

Publication number: CN111209863B
Application number: CN202010012081.2A
Authority: CN
Inventors: 王鹏; 姚聪; 陈坤鹏; 周争光
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2023-12-15
Anticipated expiration: 2040-01-07
Also published as: CN111209863A

Abstract

The invention provides a living body model training and human face living body detection method, a device and electronic equipment, wherein the living body model training method comprises the following steps: acquiring a plurality of silent videos and extracting a plurality of frame images to be trained; inputting the output result corresponding to the silence video into a preset living model; calculating the value of a loss function according to the output result and the labeling result of the silent video; and adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges. Therefore, a plurality of single-frame images can be extracted from the silent video to carry out training and result output, and the mode and model for judging the multi-frame images avoid the condition of low safety level caused by the single-frame images, and the silent video is acquired without setting specific actions, so that the problem of excessively complex action difficulty in dynamic images is avoided; the safety level is high, the action is simple, the implementation is convenient, and the experience of a user is improved.

Description

Living model training and human face living body detection method and device and electronic equipment

Technical Field

The invention relates to the technical field of face recognition, in particular to a method and a device for training a living body model and detecting a living body of a face and electronic equipment.

Background

With the wide application of face recognition technology, the use safety of the face recognition technology is also slowly receiving attention. The human face living body judgment refers to a technology capable of automatically judging whether a human face in a given image or video is from a real person or from a deception human face, for example, a common deception human face has a mask, a printed photo, a photo displayed on a screen or a video fragment played, and the like. The face living body judgment is an important anti-attack and anti-fraud technical means, and has wide application in industries and occasions related to remote identity authentication, such as banks, insurance, internet finance, electronic commerce and the like.

Existing face living body judging technologies can be generally divided into two categories: static methods and dynamic methods. The static method mainly judges the authenticity of a given face through the characteristics of colors, textures, background objects and the like in the image, and the method has the characteristics of simplicity and high efficiency, but the security level is not high; the reason is that the static face image is easy to forge by means of PS, synthetic software, high-definition screen display photo and the like, and the technical difficulty and cost of the forging way are lower and lower along with the development of technology. The dynamic method mainly refers to judgment of various single-frame images based on actions, and requires a user to finish specified facial actions such as mouth opening, blinking and the like before a lens, however, the facial actions increase technical implementation difficulty and reduce user experience.

Therefore, a face living body recognition technology with high security level and simple action difficulty is needed to improve the user experience.

Disclosure of Invention

The invention solves the problem of how to combine the sample distribution characteristics and monitor the deep neural network to realize more accurate classification.

To solve the above problems, the present invention provides a method for training a living model, comprising:

acquiring a plurality of silence videos, and extracting a plurality of frame images to be trained from each silence video;

inputting the plurality of frame images to be trained into a preset living body model to obtain an output result corresponding to the silent video;

calculating the value of a loss function according to the output result and the labeling result of the silent video;

and adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges.

Therefore, a plurality of single-frame images can be extracted from the silent video and input into the living model for training and outputting results, and the mode and model for judging the multi-frame images avoid the condition of low safety level caused by the single-frame images, and the silent video is acquired without setting specific actions, so that the problem of excessively complex action difficulty in dynamic images is avoided; the safety level is high, the action is simple, the implementation is convenient, and the experience of a user is improved.

Optionally, the living model is a neural network model.

Optionally, the obtaining a plurality of silence videos, and extracting a plurality of frame images to be trained from each silence video, and then further includes:

and obtaining a labeling result according to the silent video.

Therefore, the real result of the frame image to be trained can be obtained by obtaining the labeling result through the silent video, so that the accuracy of recognition is improved through subsequent comparison, and a living model with higher accuracy is obtained.

Optionally, the obtaining a plurality of silence videos and extracting a plurality of frame images to be trained from each silence video includes:

acquiring one silence video, and dividing the silence video into a plurality of interval sections;

extracting a frame image from each interval; the frame image is a frame image to be trained;

and traversing all the silent videos to obtain a frame image to be trained.

Thus, after the interval sections are divided, each interval section extracts one frame image, and the defect or excessive repetition of the effective information caused by too uneven distribution of the frame images in the silence video (too long time interval between two frame images, which can cause the loss of the effective information, too short interval time, which can cause repeated display) caused by direct extraction can be avoided.

Optionally, the output result of the living model at least includes: face presence score, living score, and attack score.

Optionally, the attack score includes: the screen flip score, the printing paper score, the paper cut score, the matting surface score and the 3D model score.

Therefore, by classifying the attack types finely, the performances of the detection method and the model on different attack types can be counted, the defects of the method and the model on different attack types can be timely made up, and the purposes of simplicity and convenience in use, high precision and high safety are achieved.

Optionally, in the labeling result of the silent video, when the face existence score indicates that no face exists, the loss function is determined by the face existence score in the labeling result of the silent video and the face existence score in the output result.

Optionally, in the labeling result of the silence video, when the face existence score characterizes that a face exists, the loss function is determined by the face existence score, the living body score and the attack score in the labeling result of the silence video and the face existence score, the living body score and the attack score in the output result.

Thus, through the loss function setting under different faces, the corresponding relation between the output result and the actual classification is established, so that the classification function can reflect the accuracy of the actual classification, and the accuracy and the identification degree of the living model identification can be improved through the adjustment of the parameters in the living model by the loss function, and a better training effect is achieved.

Secondly, a human face living body detection method is provided, which comprises the following steps:

shooting a silence video of a face, and extracting a plurality of frame images to be evaluated from the silence video;

inputting the frame image to be evaluated into a preset living body model to obtain an evaluation result; the living model is obtained by training by adopting the living model training method;

and judging the detection result of the silent video according to the evaluation result.

Optionally, the duration of the silence video is 1-3 s. Therefore, the shooting time of a user can be reduced, the action difficulty of shooting by the user is reduced, and the experience is improved

Optionally, the judging the detection result of the silence video according to the evaluation result includes:

Judging whether the face existence score in the evaluation result is smaller than a preset face threshold value or not;

if the detected value is smaller than the face threshold value, the living body detection of the silent video is failed;

if the face score is not smaller than the face threshold, judging whether the living body score is the highest score or not in the living body score and the attack score;

if the score is the highest score, the living body detection of the silent video passes;

if the score is not the highest score, the living body detection of the silence video is not passed.

Therefore, the attack types are classified finely, the shortcomings of the method and the model in different attack types are timely made up through the statistics of the performances of the detection method and the model in different attack types, the attack types are directly obtained, and the purposes of simplicity and convenience in use, high precision and high safety are achieved.

Again, there is provided a living model training apparatus comprising:

the acquisition unit is used for acquiring a plurality of silence videos and extracting a plurality of frame images to be trained from each silence video;

the model unit is used for inputting the plurality of frame images to be trained into a preset living model to obtain an output result corresponding to the silent video;

the calculating unit is used for calculating the value of the loss function according to the output result and the labeling result of the silent video;

An adjusting unit for adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges.

From the time, there is provided a face living body detection apparatus, comprising:

the image capturing unit is used for capturing a silence video of a face and extracting a plurality of frame images to be evaluated from the silence video;

the evaluation unit is used for inputting the frame image to be evaluated into a preset living model to obtain an evaluation result; the living model is obtained by training by adopting the living model training method;

and the judging unit is used for judging the detection result of the silent video according to the evaluation result.

Finally, an electronic device is provided, which comprises a processor and a memory, wherein the memory stores a control program, and the control program realizes the living body model training method or the human face living body detection method when being executed by the processor.

In addition, a computer readable storage medium is provided, which stores instructions that, when loaded and executed by a processor, implement the living model training method described above, or implement the face living detection method described above.

Drawings

FIG. 1 is a flow chart of a method of in-vivo model training according to one embodiment of the present invention;

FIG. 2 is a flow chart of a method of in-vivo model training according to another embodiment of the present invention;

FIG. 3 is a flowchart of the method steps 10 of in-vivo model training in accordance with an embodiment of the present invention;

fig. 4 is a flowchart of a face living body detection method according to an embodiment of the present invention;

fig. 5 is a flowchart of steps 300 of a face living body detection method according to an embodiment of the present invention

FIG. 6 is a block diagram of a living model training device according to an embodiment of the present invention;

fig. 7 is a block diagram showing the structure of a face biopsy device according to an embodiment of the present invention;

Fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention;

fig. 9 is a block diagram of another electronic device according to an embodiment of the invention.

Reference numerals illustrate:

1-acquisition unit, 2-model unit, 3-calculation unit, 4-adjustment unit, 5-shooting unit, 6-evaluation unit, 7-judgment unit, 12-electronic device, 14-external device, 16-processing unit, 18-bus, 20-network adapter, 22-input/output (I/O) interface, 24-display, 28-system memory, 30-random access memory, 32-cache memory, 34-storage system, 40-utility, 42-program module.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

It will be apparent that the illustrated embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

In order to facilitate understanding, technical problems and technical principles thereof need to be described in detail in the present invention.

Face recognition technology has been widely used; whether the face recognition is safe or not is determined by face living judgment. Face living judgment refers to a technique by which it is possible to automatically judge whether a face in a given image or video is from a real person or from a spoofed face (mask, print photograph, photograph displayed on screen or video clip played, etc.).

That is, the face living body judgment is to judge whether the image or video acquired by the camera or the like is a real face or a fraudulent face made of a mask, a print photograph, a photograph displayed on a screen, a video clip played or the like. If the face is a real face, then the real identity of the face is judged subsequently, so that the face is identified; if the face is deceived, the true identity of the face is stopped to be judged, and recognition errors and other serious consequences are prevented.

Existing face living body judging technologies can be generally divided into two categories: static methods and dynamic methods.

The static method mainly judges the authenticity of a given face through the characteristics of colors, textures, background objects and the like in an image after a single frame image is acquired. The method has the characteristics of simplicity and high efficiency, but the security level is not high, because static face images are easy to forge through PS, synthetic software, high-definition screen display photos and the like, and the counterfeiting is difficult to judge through the characteristics of colors, textures, background objects and the like. In fact, if a face image is given, it is determined manually whether the image is a real face or a face obtained by other means such as a mask, a printed photograph, a photograph displayed on a screen or a screenshot of a video clip being played, it is difficult to distinguish (manually) in the case of high definition. In addition, with the development of technology, the technical difficulty and the forging cost of the face forging method are lower and lower, the failure rate of identification is higher and higher, and the defects of the existing static method are obvious and obvious.

The dynamic method mainly refers to judgment of various single-frame images based on actions, and requires a user to finish specified facial actions such as mouth opening, blinking and the like before taking a lens. That is, the user performs the actions of opening the mouth, blinking, etc. before the lens according to the instructions; after the action videos of opening the mouth and blinking are obtained, a single frame image of specific actions is intercepted, and then the single frame image is judged to determine whether a living face or a spoofed face is determined. In this way, since one frame of image is extracted from the video, the counterfeiter must forge the whole video to be deceptive; these specific actions involve a plurality of facial nerves, and the difficulty in forgery is high, and thus the security level is high. However, since these operations involve a plurality of facial nerves, it is a very complicated matter for the judgment party to judge by what way and by what criteria, and therefore the difficulty in realizing such judgment is high; in addition, the operation set according to the requirement, especially the operation is not standard, and needs to be repeated according to the requirement, which is a difficult thing for the user and has poor experience.

The existing living body detection method and the detection model thereof are static or dynamic identification modes aiming at single-frame images, so that the problems of low safety level or excessively complex action difficulty generally exist.

The disclosed embodiments provide a method of in-vivo model training that may be performed by an in-vivo model training apparatus that may be integrated into an electronic device such as a computer, server, or the like. FIG. 1 is a flow chart of a method of in-vivo model training according to an embodiment of the present invention; wherein the in-vivo model training method comprises the following steps:

step 10, obtaining a plurality of silence videos, and extracting a plurality of frame images to be trained from each silence video;

the silence video can be obtained after a user looks at the camera or the screen for a period of time, and the user does not need to do special actions or specific actions, so that the experience of the user is not reduced.

It should be noted that, the silence video (or each video) is formed by sequentially arranging multiple frames of images, and each frame of image may be an RGB picture or an infrared image. And extracting a plurality of frame images to be trained from each silent video, namely extracting a plurality of frame images to be trained from the images of which the frames in each silent video are sequentially arranged.

Optionally, the number of frame images to be trained extracted from each silence video is the same, for example, N frame images to be trained are extracted from each silence video; therefore, through unified extraction standards, recognition errors caused by standard non-unification can be greatly reduced, and accuracy of model recognition is improved.

Optionally, the extraction time intervals of two adjacent frame images to be trained in the frame images to be trained extracted from the silence video are the same. Therefore, the extraction of the frame image to be trained can be uniformly carried out, and recognition errors caused by nonuniform extraction are avoided, so that a living model with higher recognition accuracy is obtained.

Step 30, inputting the plurality of frame images to be trained into a preset living body model to obtain an output result corresponding to the silent video;

the living model can be a model such as VGG16 and Alexnet, resnet, a model modified on the basis of the model, or a self-defined network structure or a self-defined neural network.

After the living body model is determined, various parameters in the living body model are preset, and later stage is gradually adjusted through a loss function, so that the aim of obtaining the optimal living body model is fulfilled. It should be noted that, each parameter in the living model may be preset according to experience, or may be a default value set according to actual requirements or other manners such as conventional settings.

And inputting the plurality of frame images to be trained into a preset living model to obtain an output result corresponding to the silent video, wherein the output result is a predicted result of the living model to be trained frame images/silent video and is not a real result of the frame images/silent video to be trained.

Optionally, the living model is a neural network model. Thus, the neural network model can be accurately judged through training, so that the accuracy of face living body identification is improved.

In this step, the inputting the plurality of frame images to be trained into the living body model means that the plurality of frame images to be trained extracted from one silence video are all input into the living body model, so that whether the face in the silence video is a real face can be judged.

The method comprises the steps that a plurality of frame images to be trained extracted from one silence video are input into a living body model, and then a result is output from the living body model, wherein the output result is the output result of the silence video, namely the prediction result/recognition result of the living body model on the silence video.

It should be noted that, the frame images to be trained are input into the living model, not one frame by one frame, but the frame images to be trained extracted from the same silence video are used as a group of frame images, and the group of frame images are input into the living model when input to obtain an output result, wherein the output result is the output result corresponding to the group of frame images and the output result corresponding to the silence video. Step 40, calculating the value of the loss function according to the output result and the labeling result of the silent video;

The labeling result of the silence video refers to a real result of the frame image to be trained, and the obtaining mode of the result can be obtained synchronously when the frame image to be trained or the silence video is obtained, or can be obtained through manual identification. For example, by means of manual mode, whether the obtained silence video is a real face or a spoofed face can be identified, and the identification result of the silence video is the identification result of a plurality of frame images to be trained extracted from the silence video; for example, when recording the silence video, the user can simultaneously give the recognition result of the silence video, so as to obtain the recognition result of the extracted multiple frame images to be trained.

And step 50, adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges.

When the parameters of the neural network model or the living model are adjusted according to the value of the loss function, the iteration can be performed through a small batch of updated samples. For example, the model has 1 ten thousand frame images to be trained, 100 frame images to be trained are selected each time to be input into a neural network model or the living model, the value of a loss function of the 100 samples is calculated, and the parameters of the neural network model or the living model are adjusted through the value of the loss function; and then selecting 100 to-be-trained frame images to input the adjusted neural network model, and repeating the steps until the value of the loss function converges.

In this way, through the steps 10-50, a plurality of single-frame images can be extracted from the silent video and input into the living model for training and result output, and the mode and model for judging the multi-frame images avoid the condition of low safety level caused by the single-frame images, and the silent video is acquired without setting specific actions, so that the problem of excessively complex action difficulty in dynamic images is avoided; the safety level is high, the action is simple, the implementation is convenient, and the experience of a user is improved.

Optionally, as shown in fig. 2, the step 10 acquires a plurality of silence videos, and extracts a plurality of to-be-trained frame images from each silence video, and then further includes:

and step 20, obtaining a labeling result according to the silent video.

The labeling result refers to a real result of the frame image to be trained, and the labeling result may be obtained synchronously when the frame image to be trained or the silence video is obtained, may be stored in the silence video or other storage parts in advance, or may be obtained by identifying in a manual mode. For example, by means of manual mode, whether the obtained silence video is a real face or a spoofed face can be identified, and the identification result of the silence video is the identification result of a plurality of frame images to be trained extracted from the silence video; for example, when recording the silence video, the user can simultaneously give the recognition result of the silence video, so as to obtain the recognition result of the extracted multiple frame images to be trained.

Optionally, as shown in fig. 3, the step 10 of obtaining a plurality of silence videos and extracting a plurality of to-be-trained frame images from each silence video includes:

step 11, obtaining one silence video, and dividing the silence video into a plurality of interval sections;

when frame images to be trained are required to be extracted from a plurality of silence videos, one silence video is selected for extraction, and then another silence video is selected for extraction after extraction is finished until all silence videos are extracted.

The dividing the silence video into a plurality of intervals means that a plurality of time nodes are set in the silence video, and the adjacent time nodes are divided intervals, and the number of the divided intervals is one more than the number of the time nodes (excluding 0 time and last time) in the silence video because the time points also comprise 0 time and last time.

Optionally, the silence video is uniformly divided into a plurality of interval segments, thereby improving the uniformity of the interval segments.

Optionally, the duration of the interval section is a constant value (constant); that is, when dividing the silence video, the silence video is divided one by one with a preset constant value (constant) as a time length, and the remaining time length is insufficient and is divided into one section independently (or may be divided from the last time to the 0 time by the preset time length or other dividing modes); therefore, the duration of the interval is limited, omission of effective information of the silence video caused by overlong duration of the interval can be prevented, and excessive number of the interval caused by excessively short duration of the interval can be prevented, so that the situation that the amount of data to be processed is increased is caused.

Step 12, extracting a frame image from each interval; the frame image is a frame image to be trained;

each interval contains a plurality of RGB images (frame images), and one frame image is extracted from each interval, that is, one of the plurality of RGB images is selected as the extracted frame image.

And step 13, traversing all the silent videos to obtain a frame image to be trained.

Selecting one silent video for extraction, and selecting another silent video for extraction after the extraction is finished until all the silent videos are extracted; therefore, through traversal, all the silent videos can be guaranteed to be extracted in sequence, and accordingly video extraction efficiency is improved.

Thus, after the interval sections are divided in steps 11-13, each interval section extracts one frame image, so that the defect of effective information or excessive repetition caused by too uneven distribution of the frame images in the silence video (too long time interval between two frame images, which can cause effective information loss, too short interval time, which can cause repeated display) caused by direct extraction can be avoided.

The face existence score is used for judging whether a face exists or not; the living body score is used for judging whether the face in the living body score is a living body (living person or real face); the attack score is used for judging whether the face in the attack score is a network attack (deception face).

Thus, three scores are output in the output result, so that the face detection and the living body detection are integrated into one model, and the face and the living body can be detected simultaneously by directly using one living body detection model, and the face detection and living body detection method is simple to use, high in precision and high in safety.

The attack score comprises five types, namely specific types for judging the five types of attacks in reality; the five attack types of screen-flipping attacks, printing paper attacks, paper-cut attacks, drawing-surface-tool attacks and 3D model attacks are obtained by carrying out a large number of practices and summaries on the attack types in real life, and the five attack types are divided into five main types in the step, namely:

screen-flipping attack: the shooting objects are screens such as mobile phones, computers, ipad, display screens and the like, and the materials comprise pictures, videos and the like;

printing paper attack: the shooting object is a piece of complete printing paper with various sizes, and the materials comprise black white paper, colored paper, glossy paper, suede paper, coated paper and the like;

paper cutting attack: the shooting object is printing paper cut along the contour of the human body, and the materials comprise black and white paper, colored paper, glossy paper, suede paper, coated paper and the like;

Drawing surface attack: the shooting object is the attack of combining the cut printing paper with a real person, the cutting mode comprises cutting eyes, mouth, nose and the like, and the materials comprise black white paper, colored paper, glossy paper, suede paper, coated paper and the like;

3D model attack: the shooting object is a 3D model, and the model material comprises silica gel, plastic, graphite and the like.

It should be noted that the above classification is classified into five major classes, which are one classification mode for the existing attack types, and may be classified into other multiple types according to other classification standards, for example, four classes, three classes, seven classes, eight classes, etc., and the specific classification number and types may be adjusted according to actual situations.

In classifying attacks into five types, we prefer to set the output result and the labeling result as vectors of (1, 7), specifically:

y＝(p，c0，c1，c2,c3,c4,c5)

wherein y represents a labeling result, p represents whether a face exists in the video, p=1 represents the face, and p=0 represents an unmanned face (in an output result, a middle value between 0 and 1 can be output, so that a threshold can be set, a face is considered to exist above the threshold, and an unmanned face is considered to exist below the threshold); c0 represents a living body score, c1 represents a screen flip attack score, c2 represents a printing paper attack score, c3 represents a paper cut attack score, c4 represents a matting tool attack score, and c5 represents a 3D model attack score.

Thus, the seven types of videos collected in the data collection labeling stage (in-vivo model training stage) and the corresponding labeling results thereof are as follows:

face-free video- (0,0,0,0,0,0,0)

Real person video (live video) — (1,1,0,0,0,0,0)

Screen flip attack video- (1,0,1,0,0,0,0)

Printing paper attack video- (1,0,0,1,0,0,0)

Paper-cut attack video- (1,0,0,0,1,0,0)

Drawing-matting tool attack video- (1,0,0,0,0,1,0)

3D model attack video- (1,0,0,0,0,0,1)

That is, when the face existence score characterizes that no face exists (the score is 0), the loss function is calculated only through the face existence score, and other scores in the labeling result and the output result are not considered any more.

Wherein, when the face presence score indicates that no face is present (or when p=0 in the labeling result), the loss function is:

L＝(p-p^) ²

wherein L is the value of the loss function, p is the face existence score in the living model labeling result, and p is the face existence score in the output result.

That is, when the face presence score characterizes the presence of a face, a loss function is calculated by labeling all scores in the results and outputting the results.

Wherein, when the face existence score represents that a face exists (or when p=1 in the labeling result), the loss function is:

wherein L is the value of the loss function, p is the face existence score in the living model labeling result, p is the face existence score in the output result, c _i To label the living scores or attack scores in the results,to output a living score or attack score in the result.

The embodiment of the disclosure provides a face living body detection method, which can be executed by a face living body detection device, and the face living body detection device can be integrated in electronic equipment such as a computer, a server and the like. As shown in fig. 4, it is a flowchart of a face living body detection method according to an embodiment of the present invention; the human face living body detection method comprises the following steps:

step 100, shooting a silence video of a human face, and extracting a plurality of frame images to be evaluated from the silence video;

in the face living body detection method, the specific content of step 100 may refer to the specific descriptions of step 10 and steps 11-13 in the living body model training method, and will not be described herein.

Optionally, the duration of the silence video is 1-3 s. Therefore, the shooting time of a user can be reduced, the action difficulty of shooting by the user is reduced, and the experience is improved.

Step 200, inputting the frame image to be evaluated into a preset living model to obtain an evaluation result; the living model is obtained by training by adopting the living model training method;

in the face living body detection method, the specific content of step 200 may refer to the specific description of step 30 in the living body model training method, which is not described herein.

The living model is obtained by training by adopting the living model training method; in this way, the living model is firstly learned and trained, and then the frame image to be evaluated is judged and identified through the trained living model, so that the frame image to be evaluated can be accurately classified.

And 300, judging the detection result of the silent video according to the evaluation result.

In this way, through steps 100-300, a plurality of single-frame images can be extracted from the silent video and input into the living model for training and result output, and the mode and model for judging the multi-frame images avoid the condition of low safety level caused by the single-frame images, and the silent video is acquired without setting specific actions, so that the problem of excessively complex action difficulty in dynamic images is avoided; the safety level is high, the action is simple, the implementation is convenient, and the experience of a user is improved.

Optionally, as shown in fig. 5, the step 300 of determining the detection result of the silence video according to the evaluation result includes:

step 310, judging whether the face existence score in the evaluation result is smaller than a preset face threshold value;

the face existence score is larger than the face threshold value, which means that the possibility of the face existence is high, and the face existence score can be considered as the face existence in the silent video; the face presence score is less than the face threshold, meaning that the face is very low in likelihood of being present, and can be considered as no face in the silence video.

Optionally, the value range of the face threshold is 0.3-0.5. Therefore, the face existence score can be accurately judged, and whether the face exists in the silent video or not can be determined.

Step 320, if the detected live body of the silence video is smaller than the face threshold, the detected live body of the silence video is not passed;

the silence video is a video shot by a user, and no face exists in the silence video, which means that the shooting of the silence video is problematic, and the face does not exist in the video, so that it can be determined that no real face exists in the silence video, and therefore it is determined that the living body detection of the silence video is failed.

Step 330, if not less than the face threshold, determining whether the living score is the highest score among the living score and the attack score;

step 340, if the score is the highest score, the live detection of the silence video passes;

if the score is not the highest score, step 350, the living detection of the silence video is failed.

Only if the face exists in the silence video, whether the face in the silence video is a real face or a spoofed face can be judged on the basis.

The judgment is to judge which of the living score and the attack score is the highest score; if the living body score is the highest score, the face in the silence video is a real face, and the living body detection passes; if the attack score is the highest score, the face in the silence video is a deception face, and the living body detection fails.

Optionally, as described above, the attack score includes: the screen flip score, the printing paper score, the paper cut score, the matting surface score and the 3D model score.

As described above, the evaluation result is:

y＝(p，c0，c1,c2,c3,c4,c5)

the living score and the attack score are c0, c1, c2, c3, c4, c5; the judging step is to judge the highest score in c0, c1, c2, c3, c4, c5, if the living score c0 is the highest score, the face in the silence video is a real face, and the living detection passes; if one of the attack scores c1, c2, c3, c4, c5 is the highest score, the face in the silent video is a deceptive face, and the living body detection fails; and among attack scores, c1 represents a screen flip attack score, c2 represents a printing paper attack score, c3 represents a paper cut attack score, c4 represents a score of a matting surface attack, c5 represents a score of a 3D model attack, which is the highest score (all means that the attack score is the highest score), and represents an attack type corresponding to the highest score of a spoofed face in the silent video.

In this way, the attack types are finely classified through the steps 310-350, and the defects of the method and the model on different attack types are timely made up through the statistics of the performances of the detection method and the model on different attack types, so that the attack types are directly obtained, and the purposes of simplicity and convenience in use, high precision and high safety are achieved.

The embodiments of the present disclosure provide a living model training apparatus for performing the living model training method according to the above-described aspects of the present disclosure, which is described in detail below.

FIG. 6 is a block diagram of a living model training device according to an embodiment of the present invention; wherein the living model training device comprises:

an acquiring unit 1, configured to acquire a plurality of silence videos, and extract a plurality of frame images to be trained from each silence video;

the model unit 2 is used for inputting the plurality of frame images to be trained into a preset living model to obtain an output result corresponding to the silent video;

a calculating unit 3, configured to calculate a value of a loss function according to the output result and a labeling result of the silence video;

an adjusting unit 4 for adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges.

Optionally, the living model is a neural network model.

Optionally, the acquiring unit 1 is further configured to: and obtaining a labeling result according to the silent video.

Optionally, the acquiring unit 1 is further configured to: acquiring one silence video, and dividing the silence video into a plurality of interval sections; extracting a frame image from each interval; the frame image is a frame image to be trained; and traversing all the silent videos to obtain a frame image to be trained.

The embodiment of the disclosure provides a face living body detection device, which is used for executing the face living body detection method disclosed by the invention, and the face living body detection device is described in detail below.

As shown in fig. 7, it is a block diagram of the structure of the face living body detection apparatus according to the embodiment of the present invention; wherein, human face living body detection device includes:

a shooting unit 5, configured to shoot a silence video of a face, and extract a plurality of frame images to be evaluated from the silence video;

the evaluation unit 6 is used for inputting the frame image to be evaluated into a preset living model to obtain an evaluation result; the living model is obtained by training by adopting the living model training method as set forth in the right;

And the judging unit 7 is used for judging the detection result of the silent video according to the evaluation result.

Optionally, the living model is a neural network model.

Optionally, the judging unit 7 is further configured to: judging whether the face existence score in the evaluation result is smaller than a preset face threshold value or not; if the detected value is smaller than the face threshold value, the living body detection of the silent video is failed; if the face score is not smaller than the face threshold, judging whether the living body score is the highest score or not in the living body score and the attack score; if the score is the highest score, the living body detection of the silent video passes; if the score is not the highest score, the living body detection of the silence video is not passed.

It should be noted that the above-described embodiment of the apparatus is merely illustrative, for example, the division of the units is merely a logic function division, and there may be another division manner when actually implemented, and for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The internal functions and structures of the living body model training apparatus, the face living body detecting apparatus are described above, and as shown in fig. 8, in practice, the living body model training apparatus, the face living body detecting apparatus may be implemented as an electronic device, including: the system comprises a processor and a memory, wherein the memory stores a control program, and the control program realizes the living model training method or the human face living body detection method when being executed by the processor.

Fig. 9 is a block diagram of another electronic device, shown in accordance with an embodiment of the present invention. For example, the electronic device 800 may be a computer, a server, a terminal, a digital broadcast terminal, a messaging device, or the like.

The electronic device 12 shown in fig. 9 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 9, the electronic device 12 may be implemented in the form of a general-purpose electronic device. Components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer-readable storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in the figures and commonly referred to as a "hard disk drive"). Although not shown in fig. 9, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other electronic devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks, such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN) and/or a public network, such as the Internet, via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It is noted that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the methods mentioned in the foregoing embodiments.

The electronic device of the invention can be a server or a terminal device with limited computing power, and the lightweight network structure of the invention is particularly suitable for the latter. The base implementation of the terminal device includes, but is not limited to: intelligent mobile communication terminals, unmanned aerial vehicles, robots, portable image processing devices, security devices, and the like.

The embodiment of the disclosure provides a computer readable storage medium storing instructions which when loaded and executed by a processor implement the living model training method or implement the face living body detection method.

The technical solution of the embodiment of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform all or part of the steps of the method of the embodiment of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

Although the present disclosure is described above, the scope of protection of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the invention.

Claims

1. A method of in-vivo model training, comprising:

inputting the plurality of frame images to be trained into a preset living model to obtain an output result corresponding to the silent video, wherein the output result of the living model at least comprises: face presence score, living score, and attack score;

adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges;

in the labeling result of the silent video, when the face existence score represents that a face exists, the loss function is determined by the face existence score, the living body score, the attack score in the labeling result of the silent video and the face existence score, the living body score and the attack score in the output result, wherein when the face existence score represents that the face exists, the loss function is as follows:

2. The in-vivo model training method of claim 1, wherein the in-vivo model is a neural network model.

3. The method of claim 1, wherein the capturing a plurality of silence videos and extracting a plurality of frame images to be trained from each silence video further comprises:

and obtaining a labeling result according to the silent video.

4. The method of claim 1, wherein the capturing a plurality of silence videos and extracting a plurality of frame images to be trained from each silence video comprises:

and traversing all the silent videos to obtain a frame image to be trained.

5. The in-vivo model training method of claim 1, wherein the attack score comprises: the screen flip score, the printing paper score, the paper cut score, the matting surface score and the 3D model score.

6. The in-vivo model training method according to claim 1, wherein in the labeling result of the silence video, when the face presence score indicates that no face is present, the loss function is determined by the face presence score in the labeling result of the silence video and the face presence score in the output result.

7. A face living body detection method, characterized by comprising:

inputting the frame image to be evaluated into a preset living body model to obtain an evaluation result; the living model is trained and obtained by the living model training method according to any one of claims 1 to 6;

8. The face living body detection method according to claim 7, wherein the duration of the silence video is 1-3 s.

9. The face living body detection method according to claim 7, wherein the judging the detection result of the silence video according to the evaluation result includes:

10. A living model training device, comprising:

an acquisition unit (1) for acquiring a plurality of silence videos and extracting a plurality of frame images to be trained from each silence video;

the model unit (2) is used for inputting the plurality of frame images to be trained into a preset living model to obtain an output result corresponding to the silent video, and the output result of the living model at least comprises: face presence score, living score, and attack score;

the calculating unit (3) is used for calculating the value of the loss function according to the output result and the labeling result of the silent video;

an adjustment unit (4) for adjusting parameters of the living model according to the value of the loss function until the value of the loss function converges;

11. A human face living body detection apparatus, characterized by comprising:

the shooting unit (5) is used for shooting a silence video of a human face and extracting a plurality of frame images to be evaluated from the silence video;

the evaluation unit (6) is used for inputting the frame image to be evaluated into a preset living model to obtain an evaluation result; the living model is trained and obtained by the living model training method according to any one of claims 1 to 6;

and the judging unit (7) is used for judging the detection result of the silent video according to the evaluation result.

12. An electronic device comprising a processor and a memory, characterized in that the memory stores a control program which, when executed by the processor, implements the in-vivo model training method according to any one of claims 1-6 or implements the face in-vivo detection method according to any one of claims 7-9.

13. A computer readable storage medium storing instructions which when loaded and executed by a processor implement the in-vivo model training method of any one of claims 1-6 or the face in-vivo detection method of any one of claims 7-9.