WO2021179719A1

WO2021179719A1 - Face detection method, apparatus, medium, and electronic device

Info

Publication number: WO2021179719A1
Application number: PCT/CN2020/135548
Authority: WO
Inventors: 蔡中印; 陆进; 陈斌; 宋晨
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-10-12
Filing date: 2020-12-11
Publication date: 2021-09-16
Also published as: CN112149615A

Abstract

Provided are a face detection method, apparatus, medium, and electronic device. The method comprises: inputting into a preset recognition model a face region image corresponding to face head-shake video stream data to be subjected to live-body detection, to obtain face key point coordinates and an eye line-of-sight offset vector outputted by the preset recognition model (240), the preset recognition model being a face key point detection model combined with a eye line-of-sight offset vector output layer, the eye line-of-sight offset vector being used for measuring the degree of offset of the eye line of sight during the process of shaking the head; according to the coordinates of the face key points corresponding to each face image frame and the eye line-of-sight offset vector, determining whether the face head-shake video stream data passes the current stage of live-body detection (250). In the method, during a live-body detection process, it is possible to identify fraudulent means making use of paper or head models containing human faces to shake, improving the accuracy of live-body detection and reducing security risks.

Description

Human face live detection method, device, medium and electronic equipment

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 12, 2020, with the application number 202011086784.6, and the invention title "Methods, Apparatus, Media and Electronic Equipment for Living Face Detection", the entire contents of which are incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of artificial intelligence and is applied to the field of face recognition, and in particular to a method, device, medium, and electronic equipment for detecting a living body of a face.

Background technique

Action in vivo detection is one of the important means of in vivo detection. It mainly selects several actions randomly from the actions of shaking the head, nodding, opening and closing the mouth, opening and closing eyes, etc., and sends instructions to the user. The user performs corresponding actions in front of the camera according to the instructions. Finally, the video data recorded by the camera is obtained, analyzed, and the detection result is obtained. Shaking the head is one of the key actions of the motion detection. However, the inventor found that a new attack method for live detection has emerged, that is, according to the instructions, the paper or head model containing the face is used to shake the head to simulate the shaking of the head. The current live detection method cannot deal with this Means for identification have resulted in low accuracy of live detection and high safety risks.

technical problem

In the field of artificial intelligence and face recognition technology, in order to solve the above technical problems, the purpose of this application is to provide a method, device, medium, and electronic equipment for detecting a living body of a human face.

Technical solutions

According to one aspect of the present application, there is provided a method for detecting a human face. The method includes: inputting a face region picture corresponding to a face shaking video stream data to be subjected to a living body detection into a preset recognition model to obtain The face key point coordinates and the human eye sight offset vector output by the preset recognition model, wherein the preset recognition model is a face key point detection model combined with the human eye sight offset vector output layer, The face key point detection model includes a convolutional layer, the human eye sight offset vector output layer is connected to the last layer of the convolution layer in the face key point detection model, and the face key point coordinates are The human eye sight deviation vector corresponds to each face image frame included in the face shaking video stream data, and the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the face shaking process Determining whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector.

According to another aspect of the present application, there is provided a face living detection device, the device comprising: an input module configured to input the face area picture corresponding to the face shaking video stream data to be subjected to the living detection to A preset recognition model to obtain the key point coordinates of the face and the eye sight offset vector output by the preset recognition model, where the preset recognition model is a face combined with the human eye sight offset vector output layer A key point detection model, the face key point detection model includes a convolutional layer, the human eye sight offset vector output layer is connected to the last layer of the convolution layer in the face key point detection model, and the person The coordinates of the key points of the face and the eye sight offset vector correspond to each face image frame included in the face shaking video stream data, and the eye sight offset vector is used to measure the process of shaking the head of the face The degree of deviation of the human eye line of sight; the judgment module is configured to determine whether the face shaking video stream data passes according to the face key point coordinates corresponding to each face image frame and the human eye line of sight offset vector Live detection at the current stage.

According to another aspect of the present application, there is provided a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by the computer, the computer executes the following method: The face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into the preset recognition model, and the key point coordinates of the face and the eye sight offset vector output by the preset recognition model are obtained, where The preset recognition model is a face key point detection model combined with a human eye sight offset vector output layer, the face key point detection model includes a convolutional layer, and the human eye sight offset vector output layer is The last layer of the convolutional layer in the face key point detection model is connected, and the face key point coordinates and the eye sight offset vector are respectively connected to each face included in the face shaking video stream data. Corresponding to the image frame, the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight in the process of shaking the head of the face; according to the face key point coordinates and the human eye corresponding to each face image frame The sight offset vector determines whether the face shaking video stream data passes the current stage of living body detection.

According to another aspect of the present application, there is provided an electronic device, the electronic device including: a processor; , To implement the following method: input the face area picture corresponding to the face shaking video stream data to be subjected to the live detection into the preset recognition model, and obtain the key point coordinates of the face and the line of sight of the human eye output by the preset recognition model Offset vector, wherein the preset recognition model is a face key point detection model combined with an output layer of the human eye line of sight offset vector, the face key point detection model includes a convolutional layer, and the human eye line of sight is biased The shift vector output layer is connected to the last layer of the convolutional layer in the face key point detection model. The face key point coordinates and the human eye sight offset vector are respectively compared with the face shaking head video stream data. The included face image frames correspond to each other, and the human eye sight offset vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face; according to the face key point coordinates corresponding to each face image frame Determine whether the video stream data of the human face shaking head passes the current stage of the living body detection by using the sight deviation vector of the human eye.

Beneficial effect

This application uses the face key point detection model combined with the human eye sight offset vector output layer to calculate the human eye sight offset vector corresponding to the face area picture, and uses the human eye sight offset vector to perform face living detection . Therefore, in the process of living body detection, fraudulent means using paper or head molds containing human faces to shake can be identified, thereby improving the accuracy of living body detection and reducing security risks.

It should be understood that the above general description and the following detailed description are only exemplary and cannot limit the application.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.

Fig. 1 is a schematic diagram showing a system architecture of a method for detecting a human face according to an exemplary embodiment.

Fig. 2 is a flow chart showing a method for detecting human face living according to an exemplary embodiment.

Fig. 3 is a schematic diagram showing at least part of the structure of a preset recognition model used in a method for detecting a human face according to an exemplary embodiment.

FIG. 4 is a flowchart of steps before step 240 of an embodiment shown in the embodiment corresponding to FIG. 2.

Fig. 5 is a block diagram showing a device for detecting human face living according to an exemplary embodiment.

Fig. 6 is a block diagram showing an example of an electronic device for realizing the above method for detecting a human face according to an exemplary embodiment.

Fig. 7 shows a computer-readable storage medium for realizing the above-mentioned method for detecting human face living according to an exemplary embodiment.

Embodiments of the present invention

The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are merely examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.

In addition, the drawings are only schematic illustrations of the application and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities.

The technical solution of the present application can be applied to the fields of artificial intelligence, smart city, blockchain and/or big data technology to realize living body detection. Optionally, the data involved in this application, such as video stream data and/or face area pictures, etc., can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application .

This application first provides a method for detecting live human faces. The live face detection mainly refers to the process of judging whether the face in the video is a live face based on the recorded video containing the face. The live face detection is one of the important technical means in the field of identity verification. And motion live detection is an important part of face live detection. When performing live motion detection, the user needs to perform corresponding actions according to instructions such as voice, text, etc. These actions mainly include shaking their heads, nodding, opening and closing their mouths, opening and closing eyes, etc. Of course, they can also not issue instructions to the user, but Observe user actions randomly. When the user is instructed to shake his head, criminals may use paper containing human faces or a head model to shake the head to complete the shaking action, and upload a video recording these actions to complete the fraudulent operation. Using relevant technical means, it is impossible to monitor this situation, resulting in this kind of fraudulent means can easily pass live detection, and the risk is high. However, these fraudulent methods can be identified by using a face living detection method provided in the present application, so that the accuracy of living detection can be improved, and the loss can be reduced.

The implementation terminal of this application can be any device with computing, processing, and storage functions. The device can be connected to an external device for receiving or sending data. Specifically, it can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, PDA (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as cloud computing physical infrastructure or server clusters.

Optionally, the implementation terminal of this application may be a server or a physical infrastructure of cloud computing.

Fig. 1 is a schematic diagram showing a system architecture of a method for detecting a human face according to an exemplary embodiment. As shown in FIG. 1, the system architecture includes a server 110 and a mobile terminal 120. The mobile terminal 120 may be, for example, a smart phone. The mobile terminal 120 is connected to the server 110 through a communication link. Therefore, the mobile terminal 120 can send data to the server 110 or receive data from the server 110. The server 110 is provided with a server program and a preset recognition model, and the mobile terminal Client software is installed and running on 120, and server 110 is the implementation terminal in this embodiment. When the face living detection method provided in this application is applied to the system architecture shown in FIG. 1, a specific process may be as follows: the user records and uploads the face shaking his head to the server 110 by operating the client software on the mobile terminal 120 Video stream data; after the server 110 receives the face shaking head video stream data, it runs a server program to extract the face area picture in the face shaking head video stream data; then, the server 110 inputs the face area picture to The recognition model is preset to obtain the face key point coordinates and the human eye line of sight offset vector output by the model; finally, the server 110 runs the server program to according to the face key point coordinates and the human eye line of sight offset vector output by the model To judge and output the detection results of the current stage of the living body detection.

It is worth mentioning that Figure 1 is only an embodiment of the present application. Although the implementation terminal in this embodiment is a server, and the terminal that provides face shaking video stream data is a mobile terminal, in other embodiments or actual In the application, the implementation terminal and the terminal that provides the face shaking video stream data can be various terminals or devices as described above; although in this embodiment, the face shaking video stream data is from the terminal other than the application implementation terminal. It is sent from the terminal, but in fact, the face shaking video stream data can be directly obtained by the local terminal. This application does not limit this, and the protection scope of this application should not be restricted in any way.

Fig. 2 is a flow chart showing a method for detecting human face living according to an exemplary embodiment. The face living detection method provided in this embodiment can be executed by a server, as shown in FIG. 2, and includes the following steps.

Step 240: Input the face region picture corresponding to the face shaking video stream data to be subjected to the live detection into a preset recognition model, and obtain the key point coordinates of the face and the human eye sight offset output by the preset recognition model Vector.

Wherein, the preset recognition model is a face key point detection model combined with a human eye line of sight offset vector output layer, the face key point detection model includes a convolutional layer, and the human eye line of sight offset vector output layer Connected to the last layer of the convolutional layer in the face key point detection model, the face key point coordinates and the eye sight offset vector are respectively the same as each person included in the face shaking video stream data Corresponding to the face image frame, the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face.

The face key point coordinates and the eye sight offset vector correspond to each face image frame, that is, for each face image frame, there is a corresponding face key point coordinate and eye sight offset vector.

The human eye sight offset vector includes the direction and length. For example, the human eye sight to the left is positive, and to the right is negative. Length can be defined as the normalized relative degree of the deviation of the human eye pupil from the center of the eye socket.

The applicant found that when a normal face is shaking his head, his line of sight will look at the mobile phone. During the shaking of his head, the line of sight will shift and change. However, the line of sight of the curved paper of the face is relatively fixed. The line of sight model predicts due to the bending and shaking of the paper. A certain amount of jitter, but this jitter will be less than the offset increment of the normal human eye's line of sight, so it can be distinguished by counting the line of sight offset increment of the normal face shaking the head and the line of sight offset increment of the face curved paper Human face and curved paper.

Specifically, the structure of the preset recognition model is shown in Figure 3. Fig. 3 is a schematic diagram showing at least part of the structure of a preset recognition model used in a method for detecting a human face according to an exemplary embodiment. It can be seen from FIG. 3 that the preset recognition model 300 includes at least a face key point detection model 310 and an eye sight offset vector output layer 320. The part framed by the dashed line is the structural part of the face key point detection model 310, including the convolutional layer 311 and the output part 312 after the convolutional layer 311. The convolutional layer 311 can be stacked by a multilayer neural network structure. Assuming that the input received by the recognition model 300 is a face image frame, the output part 312 will finally output the coordinates of the key points of the face. Of course, in the preset recognition model 300, other structures may be included before the convolutional layer 311 and between the network structures of the convolutional layer 311. The human eye sight offset vector output layer 320 receives the input of the last layer of the convolutional layer, and finally outputs the human eye sight offset vector corresponding to the face image frame. The human eye sight offset vector output layer 320 is usually a fully connected layer.

In one embodiment, the steps before step 240 may be as shown in FIG. 4. FIG. 4 is a flowchart of steps before step 240 of an embodiment shown in the embodiment corresponding to FIG. 2. Please refer to Figure 4, including the following steps.

Step 210: Deframe the face shaking video stream data to be subjected to the living body detection, and obtain the face image frame corresponding to the face shaking video stream data.

Deframing the face shaking video stream data is a process of dividing the face shaking video stream data into face image frames.

In one embodiment, before deframing the face shaking video stream data to be subjected to liveness detection to obtain the face image frame corresponding to the face shaking video stream data, the method further includes: obtaining from a user terminal Video stream data of face shaking head to be subjected to live detection.

In an embodiment, before acquiring the face shaking video stream data to be subjected to the living body detection from the user terminal, the method further includes: randomly selecting a preset action instruction from a plurality of preset action instructions and selecting the selected one. The preset action instruction is sent to the user terminal, where the plurality of preset action instructions include shaking the head, and acquiring from the user terminal the face shaking video stream data to be subjected to living body detection is based on the selected preset action The instruction is carried out under the condition of shaking the head.

Step 220: Input the face image frame into a preset face detection model, and obtain the face detection frame coordinates corresponding to the face image frame.

The pixel area included in the face image frame may be very large, and the pixel area occupied by the face may be only a part or a small part of the face image frame. In order to accurately detect the face, it is necessary for the face image The area corresponding to the face in the frame is identified in a targeted manner.

The face detection frame coordinates are the position coordinates of the area corresponding to the face in the face image frame. The preset face detection model can output the corresponding face detection frame coordinates according to the input of the face image frame. The preset face detection model can be implemented based on various algorithms or principles, for example, general machine learning algorithms. It is a deep learning algorithm.

Step 230: Extract a face area picture from the face image frame according to the face detection frame coordinates.

In an embodiment, the extracting a face region picture from the face image frame according to the face detection frame coordinates includes: determining the first person corresponding to the face detection frame coordinates in the face image frame Face detection frame area; expand the first face detection frame area according to a predetermined expansion ratio to obtain a second face detection frame area; extract people based on the range defined by the second face detection frame area Face area picture.

For example, the first face detection frame area can be a rectangle, and the face detection frame coordinates are coordinates that can be used to uniquely determine the range of the rectangle. For example, the face detection frame coordinates can be the coordinates of the four vertices of the rectangle, using the rectangle The coordinates of the four vertices can determine the range of a rectangle; the coordinates of the face detection frame can also be the coordinates of the intersection of the two diagonals of the rectangle. After having the coordinates of the intersection of the two diagonals, the Assuming the length and width of the rectangle, the range of the corresponding rectangle can also be determined.

The predetermined frame expansion ratio is the ratio of further expanding the coverage area on the basis of the original area. The predetermined expansion ratio can be various predetermined ratios, such as 20%. The expansion operation of the first face area can be performed in a variety of ways or angles, such as expanding from the center to the surroundings, to the left and right or up and down. Amplify, amplify to the upper right or lower left, and so on. In this way, after the frame expansion operation, the area of the second face detection frame area obtained is larger than the first face detection frame area.

In this embodiment, after determining the corresponding first face detection frame area according to the face detection frame coordinates, the face area picture is not extracted directly according to the range defined by the first face detection frame area, but First, expand the frame of the first face detection frame area to obtain the second face detection frame area, and then extract the face area picture based on the range defined by the second face detection frame area. Therefore, this can make the extracted The face area picture is large enough to retain more information about the face, which improves the live detection effect to a certain extent.

In an embodiment, the inputting the face image frame into a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame includes: Input to the preset face detection model to obtain the face detection frame coordinates corresponding to each of the face image frames; the extracting the face area picture from the face image frame according to the face detection frame coordinates includes: Extract a face region picture from each face image frame according to the coordinates of each face detection frame.

Extracting the face region picture from the face image frame is the process of matting in the face image frame. In this embodiment, the face region pictures are first determined by the preset face detection model to determine the face detection frame coordinates, and then extracted according to the face detection frame coordinates.

In an embodiment, the inputting the face image frame into a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame includes: adding at least one of the face image frame Input to the preset face detection model to obtain the first face detection frame coordinates corresponding to at least one of the face image frames respectively; said extracting a face region picture from the face image frame according to the face detection frame coordinates , Including: extracting a corresponding first face region picture from the face image frame corresponding to the first face detection frame coordinates according to each of the first face detection frame coordinates; The region picture is input to the preset recognition model to obtain the face key point coordinates and the eye sight offset vector corresponding to each of the first face region pictures; determine the face corresponding to each of the first face region pictures The circumscribed rectangle of the face corresponding to the key point coordinates; determining a second face detection frame corresponding to at least one face image frame after the at least one face image frame according to the circumscribed rectangle of the face and a preset estimation algorithm Coordinates; according to the determined coordinates of the second face detection frame, extract a corresponding second face region picture from the face image frame corresponding to the coordinates of the second face detection frame.

The circumscribed rectangle of the face is a rectangle that can just cover the face area, and at least a part of the points on the edge of the face area are located on the rectangle. The preset estimation algorithm may be various algorithms capable of estimating or calculating the motion state of the face, for example, it may be a Kalman filter. Kalman Filter, also known as Kalman filter equation or Kalman equation of motion, is an algorithm that uses linear system state equations to perform optimal estimation of system state through system input and output observation data. Specifically, by bringing the circumscribed rectangle of the face corresponding to at least one previous face image frame into the Kalman equation of motion, the coordinates of the second face detection frame corresponding to the current or future face image frames can be determined. The coordinates of the face detection frame are predicted based on the Kalman equation of motion.

In this embodiment, in all face image frames, two methods are used to determine the face detection frame coordinates corresponding to the face image frame: For at least one face image frame in the front, the method used is to The face image frame is input to the preset face detection model to obtain the face detection frame coordinates, and then the corresponding face area picture is extracted from the corresponding face image frame according to the face detection frame coordinates; for the current or subsequent faces The image frame is also determined based on the previously extracted face region picture. Specifically, the previously extracted face region picture is input into the preset recognition model to obtain the key point coordinates of the face, and then according to the face The key point coordinates determine the corresponding circumscribed rectangle of the face, and finally input the circumscribed rectangle of the face into the preset estimation algorithm, and the coordinates of the second face detection frame corresponding to the current and subsequent face image frames can be determined. Compared with the method of simply inputting the face image frame into the preset face detection model, this method consumes less computing resources and is more efficient.

Step 250: Determine whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector.

In an embodiment, after determining whether the face shaking video stream data passes the current stage of live detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector, The method further includes: obtaining face video stream data after the face shaking video stream data when the current stage of the live body detection is passed; and performing silent live body detection on the face video stream data.

Various algorithms or models can be used to perform silent live detection on face video stream data. In the face video stream data for silent live detection, the person does not need to shake his head, and the position and angle of the face are in a relatively unchanged state.

In this embodiment, since the subsequent silent detection is only performed when the current stage of the living body detection is passed, the number of users who can complete the living body detection alone is far smaller than the number of users who can complete the silent detection alone. The current stage of live detection can filter out a large number of users, so this reduces resource consumption to a certain extent.

In one embodiment, the part of the preset recognition model that is related to the output layer of the human eye shift vector is trained in the following way: Obtain the normal person corresponding to the normal face shaking video stream data in the sample data set The face area picture and the face paper area picture corresponding to the face paper shaking head video stream data, the sample data set includes multiple normal face shaking head video stream data and multiple face paper shaking head video stream data; the normal person The face area picture and the face paper area picture are input to the preset recognition model, and the person output by the preset recognition model corresponding to the normal face area picture and the face paper area picture respectively Face key point coordinates and human eye line of sight offset vector; respectively use the face key point coordinate sequence corresponding to the normal face shaking head video stream data and the face key point coordinates corresponding to the face paper shaking head video stream data The sequence determines the face shaking degree sequence corresponding to the normal face shaking head video stream data and the face paper shaking head video stream data; for each normal face shaking head video stream data and each face sheet Head shaking video stream data, determine the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the first target face key point coordinates; for each normal face shaking head video stream data And each of the face paper shaking head video stream data, determine the maximum human eye sight offset vector and the smallest human eye sight offset vector in the human eye sight offset vector corresponding to the coordinates of the first target face key point The difference is used as the score of the normal face shaking head video stream data or the face paper shaking head video stream data; each of the scores is used to determine a score threshold; and the preset recognition model is trained based on the score threshold.

The determining whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the human eye sight offset vector includes: The face key point coordinates corresponding to the face image frame determine the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the second target face key point coordinates; 2. The coordinates of the key points of the target face and the sight deviation vector of the human eye determine the score corresponding to the face shaking video stream data to be subjected to the live detection; if the score reaches the score threshold, it is determined to pass the current stage of the live detection , Otherwise, it is determined that the current stage of the living body test has not passed.

The score of normal face shaking video stream data is generally greater than the score of face paper shaking video stream data.

In one embodiment, the determining the score threshold value using each of the scores includes: determining the score threshold value according to the score corresponding to each normal face shaking video stream data, so that and only making the normal face shaking video stream data The score of a predetermined proportion among the corresponding scores reaches the score threshold; the training of the preset recognition model based on the score threshold includes: determining that the score corresponding to each normal face shaking video stream data is less than the score Threshold the ratio of the number of scores of human face paper shaking head video stream data to the score of all human face paper shaking head video stream data; training the preset recognition model according to the ratio.

This ratio measures the proportion of all face paper shaking head video stream data that can be correctly identified as human face paper shaking head video stream data, that is, the correct rejection rate. Therefore, this ratio can be increased through training.

Of course, each of the scores can also be used in other ways to determine the score threshold. For example, the minimum value of a predetermined proportion of scores from small to large can be used as the score threshold, or a score threshold can be determined so that the face paper shakes the head video stream. The scores of the predetermined proportion among the scores corresponding to the data do not reach the score threshold.

Specifically, if there are a total of 100 scores and the predetermined ratio is 99%, then the scores ranked at 99 from the largest to the smallest will be used as the score threshold.

The degree of face shaking is based on the angle, which can be used to measure the size of the shaking head angle. The change in the coordinates of the key points of the face affects the degree of face shaking. Therefore, the corresponding face shaking can be determined according to the coordinate sequence of the key points of the face. The degree sequence, determining the corresponding face shaking degree sequence according to the coordinate sequence of the key points of the face can be implemented using various algorithms or models. The predetermined range of the degree of face shaking may be, for example, 15 degrees.

Each picture of the normal face area or the picture of the face paper area corresponds to a degree of shaking the head of the face. In the normal face shaking video stream data, all the face shaking degrees corresponding to the normal face region pictures constitute a face shaking degree sequence. Similarly, the human face shaking degree corresponding to the picture of the paper area of the human face in the face paper shaking head video stream data can also form a face shaking degree sequence.

Since the normal face shaking video stream data and the face paper shaking video stream data are a set of face image frames in time series, the normal face area image and the face paper shaking video corresponding to the normal face shaking video stream data The face paper area pictures corresponding to the stream data are all in the form of a picture sequence. Similarly, the face key point coordinates corresponding to the normal face shaking head video stream data and the face paper shaking head video stream data can also be sequenced The way exists.

To sum up, according to the face live detection method provided by the embodiment in FIG. 2, the face key point detection model combined with the human eye sight offset vector output layer is used to calculate the eye sight deviation corresponding to the face region picture Shift the vector, and use the eye sight shift vector to detect the face living body. Therefore, in the process of living body detection, fraudulent means using paper or head molds containing human faces to shake can be identified, thereby improving the accuracy of living body detection and reducing security risks.

The present application also provides a face living detection device, and the following are the device embodiments of the present application.

Fig. 5 is a block diagram showing a device for detecting human face living according to an exemplary embodiment. As shown in FIG. 5, the device 500 includes: an input module 510, configured to input the face region picture corresponding to the face shaking video stream data to be subjected to the live detection into a preset recognition model, and the preset recognition model is obtained from the preset recognition model. The face key point coordinates and the human eye sight offset vector output by the recognition model, wherein the preset recognition model is a face key point detection model combined with the human eye sight offset vector output layer, and the face key point The detection model includes a convolutional layer. The human eye sight offset vector output layer is connected to the last layer of the convolutional layer in the face key point detection model. The face key point coordinates and the human eye sight deviation The shift vector corresponds to each face image frame included in the face shaking video stream data, and the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the face shaking process; the judgment module 520 , Configured to determine whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector.

According to the third aspect of the present application, there is also provided an electronic device capable of implementing the above method.

Those skilled in the art can understand that various aspects of the present application can be implemented as a system, a method, or a program product. Therefore, each aspect of the present application can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which can be collectively referred to herein as "Circuit", "Module" or "System".

The electronic device 600 according to this embodiment of the present application will be described below with reference to FIG. 6. The electronic device 600 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application. As shown in FIG. 6, the electronic device 600 is represented in the form of a general-purpose computing device. The components of the electronic device 600 may include, but are not limited to: the aforementioned at least one processing unit 610, the aforementioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610). Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation. The storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623. The storage unit 620 may also include a program/utility tool 624 having a set of (at least one) program module 625. Such program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 630 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus. The electronic device 600 may also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 600, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface 650, such as communication with the display unit 640. In addition, the electronic device 600 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660. As shown in the figure, the network adapter 660 communicates with other modules of the electronic device 600 through the bus 630. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.

According to the fourth aspect of the present application, there is also provided a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, when the computer-readable instructions are executed by the computer, the computer executes this specification The above method.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

In some possible implementation manners, each aspect of the present application can also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.

Referring to FIG. 7, a program product 700 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be stored in a terminal device, For example, running on a personal computer. However, the program product of this application is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device. The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device. The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.

The program code used to perform the operations of the present application can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiments of the present application, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example. It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims

A method for detecting a living body of a human face, wherein the method includes:

The face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into the preset recognition model, and the key point coordinates of the face and the eye sight offset vector output by the preset recognition model are obtained, where The preset recognition model is a face key point detection model combined with a human eye sight offset vector output layer, the face key point detection model includes a convolutional layer, and the human eye sight offset vector output layer is The last layer of the convolutional layer in the face key point detection model is connected, and the face key point coordinates and the eye sight offset vector are respectively connected to each face included in the face shaking video stream data. Corresponding to the image frame, the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face;

According to the coordinates of the key points of the face corresponding to each face image frame and the line of sight offset vector of the eye, it is determined whether the face shaking video stream data passes the current stage of living body detection.
The method according to claim 1, wherein the face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into a preset recognition model to obtain the face output by the preset recognition model Before the key point coordinates and the human eye line of sight offset vector, the method further includes:

Deframing the face shaking video stream data to be subjected to live body detection to obtain the face image frame corresponding to the face shaking video stream data;

Input the face image frame into a preset face detection model, and obtain the face detection frame coordinates corresponding to the face image frame;

Extracting a face area picture from the face image frame according to the face detection frame coordinates.
The method according to claim 2, wherein said extracting a face region picture from a face image frame according to said face detection frame coordinates comprises:

Determining the first face detection frame area corresponding to the face detection frame coordinates in the face image frame;

Performing a frame expansion operation on the first face detection frame area according to a predetermined frame expansion ratio to obtain a second face detection frame area;

Extract a face area picture based on the range defined by the second face detection frame area.
The method according to claim 1, wherein it is determined whether the face shaking video stream data passes the current stage according to the face key point coordinates corresponding to each face image frame and the human eye sight offset vector After the live detection, the method further includes:

In the case of passing the current stage of living body detection, acquiring the face video stream data after the face shaking video stream data;

Perform silent living detection on the face video stream data.
The method according to claim 2, wherein the inputting the face image frame to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame comprises:

Inputting each of the face image frames into a preset face detection model to obtain the face detection frame coordinates corresponding to each of the face image frames;

The extracting a face area picture from the face image frame according to the face detection frame coordinates includes:

Extract a face region picture from each face image frame according to the coordinates of each face detection frame.
The method according to claim 2, wherein the inputting the face image frame to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame comprises:

Inputting at least one of the face image frames into a preset face detection model to obtain the coordinates of the first face detection frame corresponding to the at least one face image frame;

The extracting a face area picture from the face image frame according to the face detection frame coordinates includes:

Extracting a corresponding first face region picture from the face image frame corresponding to the first face detection frame coordinates according to each of the first face detection frame coordinates;

Inputting each of the first face area pictures into the preset recognition model to obtain the face key point coordinates and the human eye sight offset vector corresponding to each of the first face area pictures;

Determining the circumscribed rectangle of the face corresponding to the coordinates of the key points of the face corresponding to each of the first face region pictures;

Determining, according to the circumscribed rectangle of the face and a preset estimation algorithm, the coordinates of the second face detection frame corresponding to at least one face image frame after the at least one face image frame;

Extract a corresponding second face region picture from the face image frame corresponding to the second face detection frame coordinates according to the determined second face detection frame coordinates.
The method according to any one of claims 1 to 6, wherein the part of the preset recognition model related to the human eye shift vector output layer is trained in the following manner:

Obtain the normal face area picture corresponding to the normal face shaking head video stream data and the face paper area picture corresponding to the face paper shaking head video stream data in the sample data set, the sample data set includes multiple normal face shaking head video streams Data and multiple face paper shaking head video stream data;

The picture of the normal face area and the picture of the face paper area are input to the preset recognition model to obtain the picture of the normal face area and the face paper area output by the preset recognition model The picture corresponds to the key point coordinates of the face and the eye sight offset vector;

The normal face shaking head video stream data and the face key point coordinate sequence corresponding to the normal face shaking head video stream data and the face key point coordinate sequence corresponding to the face paper shaking head video stream data are used to determine the normal face shaking head video stream data and The human face shaking head degree sequence corresponding to the face paper shaking head video stream data;

For each of the normal face shaking video stream data and each of the face paper shaking video stream data, determine the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the first 1. Coordinates of key points on the target face;

For each of the normal face shaking head video stream data and each of the face paper shaking head video stream data, the largest human eye sight deviation is determined in the human eye sight offset vector corresponding to the coordinates of the first target face key point The difference between the shift vector and the smallest eye deviation vector is used as the score of the normal face shaking head video stream data or the face paper shaking head video stream data;

Using each of the scores to determine a score threshold;

Training the preset recognition model based on the score threshold;

The determining whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the human eye sight offset vector includes:

Determining, from the face key point coordinates corresponding to each face image frame, the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the second target face key point coordinates;

Determining, according to the coordinates of the key point of the second target human face and the eye sight offset vector, the score corresponding to the human face shaking head video stream data to be subjected to the living body detection;

If the score reaches the score threshold, it is determined that the current stage of living body detection is passed; otherwise, it is determined that the current stage of living body detection is not passed.
A living body detection device of a human face, wherein the device comprises:

The input module is configured to input the face region picture corresponding to the face shaking video stream data to be subjected to the live detection to a preset recognition model, and obtain the face key point coordinates and human eyes output by the preset recognition model Gaze offset vector, wherein the preset recognition model is a face key point detection model combined with an output layer of the human eye gaze offset vector, the face key point detection model includes a convolutional layer, and the human eye sight line The offset vector output layer is connected to the last layer of the convolutional layer in the face key point detection model, and the face key point coordinates and the eye sight offset vector are respectively the same as the face shaking head video stream data The included face image frames correspond to each other, and the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face;

The judgment module is configured to determine whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector.
A computer-readable storage medium, wherein the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a computer, the computer executes the following method:

The face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into the preset recognition model, and the key point coordinates of the face and the eye sight offset vector output by the preset recognition model are obtained, where The preset recognition model is a face key point detection model combined with a human eye sight offset vector output layer, the face key point detection model includes a convolutional layer, and the human eye sight offset vector output layer is The last layer of the convolutional layer in the face key point detection model is connected, and the face key point coordinates and the eye sight offset vector are respectively connected to each face included in the face shaking video stream data. Corresponding to the image frame, the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face;

According to the coordinates of the key points of the face corresponding to each face image frame and the line of sight offset vector of the eye, it is determined whether the face shaking video stream data passes the current stage of living body detection.
The computer-readable storage medium according to claim 9, wherein the face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into the preset recognition model, and the preset recognition model is obtained. Before the output of the key point coordinates of the face and the human eye line of sight offset vector, when the computer-readable instruction is executed by the computer, it is also used to make the computer execute:

Deframing the face shaking video stream data to be subjected to live body detection to obtain the face image frame corresponding to the face shaking video stream data;

Input the face image frame into a preset face detection model, and obtain the face detection frame coordinates corresponding to the face image frame;

Extracting a face area picture from the face image frame according to the face detection frame coordinates.
The computer-readable storage medium according to claim 9, wherein the face shaking head video stream data is determined according to the face key point coordinates corresponding to each face image frame and the human eye sight offset vector After passing the current stage of living body detection, when the computer-readable instruction is executed by the computer, it is also used to make the computer execute:

In the case of passing the current stage of living body detection, acquiring the face video stream data after the face shaking video stream data;

Perform silent living detection on the face video stream data.
The computer-readable storage medium according to claim 10, wherein when the face image frame is input to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame, specifically implement:

Inputting each of the face image frames into a preset face detection model to obtain the face detection frame coordinates corresponding to each of the face image frames;

When extracting a face region picture from a face image frame according to the face detection frame coordinates, the following is specifically executed:

Extract a face region picture from each face image frame according to the coordinates of each face detection frame.
The computer-readable storage medium according to claim 10, wherein when the face image frame is input to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame, specifically implement:

Inputting at least one of the face image frames into a preset face detection model to obtain the coordinates of the first face detection frame corresponding to the at least one face image frame;

When extracting a face region picture from a face image frame according to the face detection frame coordinates, the following is specifically executed:

Extracting a corresponding first face region picture from the face image frame corresponding to the first face detection frame coordinates according to each of the first face detection frame coordinates;

Inputting each of the first face area pictures into the preset recognition model to obtain the face key point coordinates and the human eye sight offset vector corresponding to each of the first face area pictures;

Determining the circumscribed rectangle of the face corresponding to the coordinates of the key points of the face corresponding to each of the first face region pictures;

Determining, according to the circumscribed rectangle of the face and a preset estimation algorithm, the coordinates of the second face detection frame corresponding to at least one face image frame after the at least one face image frame;

Extract a corresponding second face region picture from the face image frame corresponding to the second face detection frame coordinates according to the determined second face detection frame coordinates.
The computer-readable storage medium according to any one of claims 9-13, wherein the part of the preset recognition model that is related to the human eye shift vector output layer is trained in the following manner:

Obtain the normal face area picture corresponding to the normal face shaking head video stream data and the face paper area picture corresponding to the face paper shaking head video stream data in the sample data set, the sample data set includes multiple normal face shaking head video streams Data and multiple face paper shaking head video stream data;

The picture of the normal face area and the picture of the face paper area are input to the preset recognition model to obtain the picture of the normal face area and the face paper area output by the preset recognition model The picture corresponds to the key point coordinates of the face and the eye sight offset vector;

The normal face shaking head video stream data and the face key point coordinate sequence corresponding to the normal face shaking head video stream data and the face key point coordinate sequence corresponding to the face paper shaking head video stream data are used to determine the normal face shaking head video stream data and The human face shaking head degree sequence corresponding to the face paper shaking head video stream data;

For each of the normal face shaking video stream data and each of the face paper shaking video stream data, determine the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the first 1. Coordinates of key points on the target face;

For each of the normal face shaking head video stream data and each of the face paper shaking head video stream data, the largest human eye sight deviation is determined in the human eye sight offset vector corresponding to the coordinates of the first target face key point The difference between the shift vector and the smallest eye deviation vector is used as the score of the normal face shaking head video stream data or the face paper shaking head video stream data;

Using each of the scores to determine a score threshold;

Training the preset recognition model based on the score threshold;

When determining whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector, the following is specifically executed:

Determining, from the face key point coordinates corresponding to each face image frame, the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the second target face key point coordinates;

Determining, according to the coordinates of the key point of the second target human face and the eye sight offset vector, the score corresponding to the human face shaking head video stream data to be subjected to the living body detection;

If the score reaches the score threshold, it is determined that the current stage of living body detection is passed; otherwise, it is determined that the current stage of living body detection is not passed.
An electronic device, wherein the electronic device includes:

processor;

A memory, where computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the following methods are implemented:

The face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into the preset recognition model, and the key point coordinates of the face and the eye sight offset vector output by the preset recognition model are obtained, where The preset recognition model is a face key point detection model combined with a human eye sight offset vector output layer, the face key point detection model includes a convolutional layer, and the human eye sight offset vector output layer is The last layer of the convolutional layer in the face key point detection model is connected, and the face key point coordinates and the eye sight offset vector are respectively connected to each face included in the face shaking video stream data. Corresponding to the image frame, the human eye sight deviation vector is used to measure the degree of deviation of the human eye sight during the process of shaking the head of the face;

According to the coordinates of the key points of the face corresponding to each face image frame and the line of sight offset vector of the eye, it is determined whether the face shaking video stream data passes the current stage of living body detection.
The electronic device according to claim 15, wherein the face region picture corresponding to the face shaking video stream data to be subjected to the live detection is input into a preset recognition model, and the person output by the preset recognition model is obtained. Before the face key point coordinates and the human eye line of sight offset vector, the computer-readable instructions are also used to implement when executed by the processor:

Deframing the face shaking video stream data to be subjected to live body detection to obtain the face image frame corresponding to the face shaking video stream data;

Input the face image frame into a preset face detection model, and obtain the face detection frame coordinates corresponding to the face image frame;

Extracting a face area picture from the face image frame according to the face detection frame coordinates.
The electronic device according to claim 15, wherein it is determined whether the face shaking video stream data passes the current face based on the face key point coordinates corresponding to each face image frame and the human eye sight offset vector. After the stage of living body detection, when the computer-readable instructions are executed by the processor, they are also used to implement:

In the case of passing the current stage of living body detection, acquiring the face video stream data after the face shaking video stream data;

Perform silent living detection on the face video stream data.
The electronic device according to claim 16, wherein when the face image frame is input to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame, the specific realization is implemented:

Inputting each of the face image frames into a preset face detection model to obtain the face detection frame coordinates corresponding to each of the face image frames;

When extracting a face region picture from a face image frame according to the face detection frame coordinates, the specific implementation is as follows:

Extract a face region picture from each face image frame according to the coordinates of each face detection frame.
The electronic device according to claim 16, wherein when the face image frame is input to a preset face detection model to obtain the face detection frame coordinates corresponding to the face image frame, the specific realization is implemented:

Inputting at least one of the face image frames into a preset face detection model to obtain the coordinates of the first face detection frame corresponding to the at least one face image frame;

When extracting a face region picture from a face image frame according to the face detection frame coordinates, the specific implementation is as follows:

Extracting a corresponding first face region picture from the face image frame corresponding to the first face detection frame coordinates according to each of the first face detection frame coordinates;

Inputting each of the first face area pictures into the preset recognition model to obtain the face key point coordinates and the human eye sight offset vector corresponding to each of the first face area pictures;

Determining the circumscribed rectangle of the face corresponding to the coordinates of the key points of the face corresponding to each of the first face region pictures;

Determining, according to the circumscribed rectangle of the face and a preset estimation algorithm, the coordinates of the second face detection frame corresponding to at least one face image frame after the at least one face image frame;

Extract a corresponding second face region picture from the face image frame corresponding to the second face detection frame coordinates according to the determined second face detection frame coordinates.
The electronic device according to any one of claims 15-19, wherein the part of the preset recognition model related to the output layer of the human eye deviation vector is trained in the following manner:

Obtain the normal face area picture corresponding to the normal face shaking head video stream data and the face paper area picture corresponding to the face paper shaking head video stream data in the sample data set, the sample data set includes multiple normal face shaking head video streams Data and multiple face paper shaking head video stream data;

The picture of the normal face area and the picture of the face paper area are input to the preset recognition model to obtain the picture of the normal face area and the face paper area output by the preset recognition model The picture corresponds to the key point coordinates of the face and the eye sight offset vector;

The normal face shaking head video stream data and the face key point coordinate sequence corresponding to the normal face shaking head video stream data and the face key point coordinate sequence corresponding to the face paper shaking head video stream data are used to determine the normal face shaking head video stream data and The human face shaking head degree sequence corresponding to the face paper shaking head video stream data;

For each of the normal face shaking video stream data and each of the face paper shaking video stream data, determine the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the first 1. Coordinates of key points on the target face;

For each of the normal face shaking head video stream data and each of the face paper shaking head video stream data, the largest human eye sight deviation is determined in the human eye sight offset vector corresponding to the coordinates of the first target face key point The difference between the shift vector and the smallest eye deviation vector is used as the score of the normal face shaking head video stream data or the face paper shaking head video stream data;

Using each of the scores to determine a score threshold;

Training the preset recognition model based on the score threshold;

When determining whether the face shaking video stream data passes the current stage of living body detection according to the face key point coordinates corresponding to each face image frame and the eye sight offset vector, the specific implementation is as follows:

Determining, from the face key point coordinates corresponding to each face image frame, the face key point coordinates corresponding to the face shaking degree within the predetermined face shaking degree range, as the second target face key point coordinates;

Determining, according to the coordinates of the key point of the second target human face and the eye sight offset vector, the score corresponding to the human face shaking head video stream data to be subjected to the living body detection;

If the score reaches the score threshold, it is determined that the current stage of living body detection is passed; otherwise, it is determined that the current stage of living body detection is not passed.