CN105989264B

CN105989264B - Biological characteristic living body detection method and system

Info

Publication number: CN105989264B
Application number: CN201510053281.1A
Authority: CN
Inventors: 邓琼
Original assignee: Beijing Authenmetric Data Technology Co ltd
Current assignee: Beijing Authenmetric Data Technology Co ltd
Priority date: 2015-02-02
Filing date: 2015-02-02
Publication date: 2020-04-07
Anticipated expiration: 2035-02-02
Also published as: CN105989264A

Abstract

The application provides a biological feature living body detection method and system, which are used for rejecting false body attacks in modes such as human face pictures, human face video recording, voice recording and the like. Firstly, generating a random action and voice sequence instruction; then, presenting the random action sequence instruction to the user in a visual and auditory way, and simultaneously acquiring the face video and voice data of the user and feeding back the face video and voice data to the user in real time; whether the human body is a living human body is judged by analyzing the coincidence degree of the acquired data and the random action sequence instruction. According to the technical scheme, the action and voice instructions are randomly generated and are difficult to attack by human face photos, videos or voices prepared in advance; the visual and auditory presentation of the random action sequence instructions helps the user to understand the instructions; the synchronous feedback presentation of the video and the voice of the user effectively guides the user to make corresponding actions and sound; therefore, the safety of the identity authentication is improved, and the usability and the user experience of the product are improved.

Description

Biological characteristic living body detection method and system

Technical Field

The invention relates to the technical field of automatic image analysis and biological feature recognition, in particular to a biological feature in-vivo detection method and system.

Background

Biometric identification has important applications in the field of identity authentication and authorization, for example, identity authentication in mobile payment is performed by using a face recognition means, and identity security in internet and mobile internet services can be enhanced. However, face recognition systems are vulnerable to face forgery, which raises information and identity security issues. For example, an attacker can acquire a face image of an account owner by some means, make a photograph, a video or a mask and other counterfeit features, and present the image in front of the recognition system to replace a real face to obtain illegal rights.

The main techniques for distinguishing real persons from forged faces at present can be mainly classified into two categories. The first type is a texture-based method that distinguishes between real and counterfeit features by taking rich human skin detail textures and analyzing the high frequency components of the textures. The second method is based on analysis of motion patterns, for example, a person is usually in a certain background, and when the person moves, the background changes, and for a real person, the motion of a human body region and the motion of the background are relatively independent, and by analyzing the relative motion patterns of the human body region and the background region, the real person and the counterfeit features can be distinguished.

But the inventor finds out that: with the continuous improvement of photo collection and printing technology, high-definition human skin photos containing rich texture details can be obtained at present, so that the reliability of the first type of texture-based method is greatly reduced; the second category of methods generally cannot solve the attack mode of video playing. At present, a face recognition method and a face recognition system which can effectively recognize real people and prevent counterfeit feature attacks are not available.

Disclosure of Invention

In order to overcome the problems of the prior biological feature anti-counterfeiting technology, the invention provides a biological feature living body detection method and system based on the combination of human-computer interaction and mode identification, so as to reject false body attacks such as human face pictures, human face videos and voice recordings and improve the safety of biological feature identification.

The biological characteristic in-vivo detection method and system provided by the application are detailed as follows.

According to a first aspect of embodiments of the present application, there is provided a biometric in-vivo detection method, including:

generating a random action sequence instruction;

transcoding the random motion sequence instructions into text, visual and/or audible coding and presenting in visual frames, audible sounds or a combination thereof;

collecting a user response image sequence;

synchronizing the response image sequence with the random motion sequence instructions for visual presentation;

analyzing a sequence of response actions of a user in the sequence of response images;

and judging whether the response action sequence conforms to the action sequence corresponding to the random action sequence instruction, and if so, judging that the response action sequence is from the living human.

The biological characteristic living body detection method can further comprise the following steps:

generating a random voice sequence instruction;

transcoding the random speech sequence instructions into text, visual and/or audible coding and presenting in visual frames, audible sounds or a combination thereof;

collecting a user response voice sequence;

analyzing the response voice of the user in the response voice sequence;

and judging whether the response voice conforms to the voice sequence corresponding to the random voice sequence instruction, and if so, judging that the response voice sequence is from the living human.

The biological feature living body detection method assigns a time stamp to each action in a random action sequence instruction, the time stamp is used for identifying the action time of each action or the starting time and the ending time of each action, and the time stamp is generated randomly.

Wherein the analyzing the sequence of response actions of the user in the sequence of response images comprises:

detecting a face in each image from the sequence of response images;

carrying out key point positioning on each face;

calculating a head posture corner according to the positioned key points of the human face;

calculating the facial expression type according to the positioned facial key points;

obtaining a response action sequence of the user according to the head posture corner and the facial expression type;

comparing the response action sequence with the action sequence corresponding to the random action sequence instruction, and calculating action type conformity;

and comparing the action type conformity with a first preset threshold, if the action type conformity is greater than the first preset threshold, judging that the response action sequence is from the living body, otherwise, not considering that the response action sequence is from the living body.

Wherein the analyzing the response action sequence of the user in the response image sequence may further include:

for each action in the sequence of responsive actions, calculating an action time for each action;

comparing the calculated action time of each action with the timestamp of each action, and calculating the action time conformity;

calculating the action total conformity which is the action type conformity + WX action time conformity, wherein W is a weight;

and comparing the action overall conformity with a second preset threshold, if the action overall conformity is greater than the second preset threshold, judging that the response action sequence is from the living human, otherwise, not considering that the response action sequence is from the living human.

identifying the content of the response voice sequence;

calculating the voice content conformity of the response voice sequence;

calculating the total conformity which is the conformity of action type + w1 × the conformity of action time + w2 × the conformity of voice content, wherein w1 and w2 are weights;

and comparing the overall conformity with a third preset threshold, if the overall conformity is greater than the third preset threshold, judging that the response action sequence is from the living human, otherwise, not considering that the response action sequence is from the living human.

And setting the complexity of a random action sequence instruction and the sizes of the first preset threshold, the second preset threshold and the third preset threshold according to the safety level.

In accordance with a second aspect of embodiments of the present application, corresponding to a first aspect of embodiments of the present application, there is provided a biometric in-vivo detection system, comprising:

the action sequence instruction generating unit is used for generating a random action sequence instruction;

the action instruction presenting unit comprises a display and a loudspeaker and is used for converting codes of the random action sequence instructions into texts, visual and/or auditory codes and presenting the texts, the visual and/or auditory codes or the combination of the texts and the auditory codes;

the display is used for displaying the text and/or the visual coded picture of the random motion sequence instruction, and the loudspeaker is used for playing the text and/or the auditory coded sound of the random motion sequence instruction;

the image acquisition unit is used for acquiring a user response face image sequence;

a response action presenting unit, which is used for synchronizing the response image sequence and the random action sequence instruction for visual presentation;

the action analysis unit is used for analyzing a response action sequence of the user in the response face image sequence;

and the action conformity degree judging unit is used for judging whether the response action sequence conforms to the action sequence corresponding to the random action sequence instruction or not, and if so, judging that the response action sequence is from the living human.

Wherein, the biological characteristic living body detection system can also comprise:

the voice instruction generating unit is used for generating a random voice instruction;

the voice instruction presenting unit comprises a display and a loudspeaker and is used for converting codes of the random voice sequence instructions into texts, visual and/or auditory codes and presenting the texts, the visual and/or auditory codes or the combination of the visual pictures and the auditory sounds;

the voice acquisition unit is used for acquiring a user response voice sequence;

a voice analyzing unit for analyzing a response voice of the user in the response voice sequence;

and the voice conformity degree judging unit is used for judging whether the response voice conforms to the voice sequence corresponding to the random voice sequence instruction or not, and if so, judging that the response voice sequence is from the living human.

The random action sequence instruction generation unit assigns a time stamp for each action in the random action sequence instruction, the time stamp is used for identifying the action time of each action or the starting time and the ending time of each action, and the time stamp is generated randomly.

Wherein the motion analysis unit includes:

a face detection subunit, configured to detect a face in each image from the response action sequence;

the key point positioning subunit is used for positioning the key points of each face;

the head pose corner calculating subunit is used for calculating a head pose corner according to the positioned human face key points;

the facial expression type calculating subunit is used for calculating the facial expression type according to the positioned facial key points;

the action sequence identification subunit is used for obtaining the response action sequence according to the head posture corner and the facial expression type;

the action conformity degree judging unit comprises:

the action type conformity calculation subunit compares the response action sequence with the action sequence corresponding to the action sequence instruction and calculates the action type conformity;

and the first judgment subunit is used for comparing the action type conformity with a first preset threshold, if the action type conformity is greater than the first preset threshold, the response action type of the person in the response action sequence conforms to the random action sequence instruction, and judging that the response action sequence is from the living person, otherwise, the response action sequence is not from the living person.

Wherein, the action analysis unit may further include:

an action time calculating subunit, configured to calculate, for each action in the response action sequence, an action time of each action;

the action time conformity degree calculating subunit is used for comparing the calculated action time of each action with the time stamp of each action and calculating the action time conformity degree;

the action overall conformity degree calculation operator unit is used for calculating action overall conformity degree, wherein the action overall conformity degree is action type conformity degree + w multiplied by action time conformity degree, and w is weight;

and the second judgment subunit is used for comparing the action overall conformity with a second preset threshold, if the action overall conformity is greater than the second preset threshold, the response action of the person in the response action sequence conforms to the action sequence corresponding to the random action sequence instruction, and judging that the response action sequence is from the living person, otherwise, the response action sequence is not considered to be from the living person.

a voice analysis unit for recognizing the content of the response voice sequence;

a voice conformity degree calculating unit, configured to calculate the conformity degree of the response voice content of the response voice sequence;

a total conformity calculation unit for calculating a total conformity, which is an action type conformity + w1 × action time conformity + w2 × voice content conformity, wherein w1 and w2 are weights;

and the third judgment unit is used for comparing the overall conformity with a third preset threshold value, if the overall conformity is greater than the third preset threshold value, judging that the response action sequence is from the living body, otherwise, judging that the response action sequence is not from the living body.

The technical scheme provided by the embodiment of the application can have the following beneficial effects: the action and voice instruction are randomly generated, and the attack is difficult to be carried out by using a human face photo, a video or a voice corpus which are prepared in advance; the random action sequence instructions are presented visually and auditorily, so that the user is effectively helped to understand the instructions; the action and voice of the user are synchronously fed back and presented, and the user is effectively guided to make corresponding action and sound, so that the safety of identity authentication is improved, and the usability and user experience of the product are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart illustrating a biometric liveness detection method according to an exemplary embodiment of the present application.

Fig. 2 is a schematic structural diagram of a biometric in-vivo detection system according to an exemplary embodiment of the present application.

FIG. 3 is a schematic diagram of a random motion sequence instruction visual presentation and a random voice sequence instruction visual presentation of a biometric liveness detection system.

FIG. 4 is a schematic diagram of a visual presentation of a random sequence of motion commands for a biometric liveness detection system with a vertical display.

FIG. 5 is a schematic diagram of a visual presentation of a random sequence of motion commands for a biometric liveness detection system with a landscape display.

FIG. 6 is a schematic diagram of the simultaneous visual presentation of text of random motion sequence instructions and a response motion sequence.

FIG. 7 is a schematic diagram of the synchronized display of random voice sequence instructions and random motion sequence instructions along with a captured user response motion sequence.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but it will be appreciated by those skilled in the art that the present application may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.

Fig. 1 is a schematic flow chart of a biometric in-vivo detection method according to an exemplary embodiment of the present application, as shown in fig. 1, the method includes:

in step S101, a random motion sequence command is generated.

The random action sequence instruction is used for indicating how a user acts, is composed of action type description, and can further comprise a time stamp assigned to each action, wherein the time stamp is used for identifying the action time of each action or the starting time and the ending time of each action. The time stamp is used for identifying the action time of each action as a relative time stamp, and represents the action time length of each action type in the random action sequence instruction; the time stamp is used to identify the start time and the end time of each action as an absolute time stamp from which the action time of each action can be calculated, i.e. the end time of each action minus the end time of each action. The relative time stamp and the absolute time stamp may be randomly generated so that the random action sequence instruction has a higher degree of randomness, and when the absolute time stamp is randomly generated, the start time of the absolute time stamp of the next action is greater than the end time of the absolute time stamp of the previous action. The action sequence instruction can be a single action instruction or can be combined by a plurality of action instructions. The generating of the random motion sequence instruction may include:

(a1) the number of actions N is randomly determined, for example, N is 4.

(a2) From the candidate action type set, N action types are randomly selected and combined, and the order of the N action types in the combination is random, for example, 4 action types are randomly generated (head left turn 30 degrees → right turn 10 degrees → left turn 20 degrees → right turn 40 degrees), or (head left turn 30 degrees → turn right turn 0 degree → open mouth → close mouth).

(a3) And randomly appointing action time and/or action times of each action type, wherein the action time is the duration of each action type, the action time can be added into the action sequence instruction in the form of relative time stamp or absolute time stamp, namely, the time stamp for identifying the action time is added into the description of each action type, and the action sequence instruction is added in the form of time stamp, so that the action sequence instruction can conveniently correspond to each action type, separates each action type, has good consistency and is not easy to make mistakes. When each action type is specified as an absolute timestamp, the start time of the next action is equal to the end time of the previous action.

The action times can be further specified for each action type, the default action times is 1, and if the user makes multiple actions within the specified action time, only whether the user makes the actions within the specified action time is identified in subsequent identification, but not the actions are made for multiple times; only the number of actions can be specified, and the action time is not specified; the action time and the action times can be both specified, and the action time and the action times of each action type can be the same or different; the partial action type may also specify an action time and the partial action type may not specify an action time, or the partial action type may specify an action number, the partial action type may not specify an action number, and so on. For example, the action time and the action number of the action type are specified as: shaking the head from left to right for 2 times, and the action time is 2 seconds; shaking the head 3 times from top to bottom, and the action time is 3 seconds; opening the mouth for 2 times, and the action time is 1 second; the eyes are closed for 3 times, the action time is 2 seconds, and the corresponding action sequence instructions are as follows: (head shaking from left to right 2 times, action time 2 seconds) → (head shaking from top to bottom 3 times, action time 3 seconds) → (mouth opening 2 times, action time 1 second) → (eye closing 3 times, action time 2 seconds), in which the action time is specified in a relative time stamp manner. When the action time is specified in the form of an absolute timestamp, the corresponding action sequence instruction may be: (from 0 th to 2 nd, shaking head 2 times from left to right) → (from 2 nd to 5 th, shaking head 3 times from top to bottom) → (from 5 th to 6 th, opening mouth 2 times) → (from 6 th to 8 th, closing eyes 3 times)

The random action sequence command may have different complexity according to different security levels, for example, when the security level is high, the number N of actions in the action sequence is increased.

And step S102, converting the codes of the random motion sequence instructions into text, visual and/or auditory codes, and presenting the codes in a visual picture, auditory sound or combination mode of the visual picture and the auditory sound.

Wherein, the codes of the random motion sequence instructions are converted into texts and/or visual codes, and the texts are presented in the forms of text characters, images or animations of motions such as 'mouth opening', 'mouth closing', 'head left turning', 'head lowering', 'blinking' and the like, and then the texts are presented to a user through a display in a visual mode; or converting the code of the random motion sequence instruction into text and/or auditory code, converting the text into text, and converting the text into voice through a TTS (text To speech) engine, for example, broadcasting the voice of 'open mouth', 'close mouth', 'head left turn', 'head down', 'blink', etc. through a loudspeaker; or presented to the user in a combination of both visual and audible. Through visual and auditory presentation and prompt, the user is helped to understand the instruction so as to timely make corresponding action according to the prompt.

Step S103, collecting a user response image sequence.

The action of a user can be shot by using a camera or other image video shooting equipment, so that a response face image sequence is acquired, and each image in the response face image sequence is a video frame obtained by shooting.

Step S104, synchronizing the response image sequence and the random action sequence instruction for visual presentation;

the synchronous visual presentation is that the shot response image sequence picture and the random action sequence instruction are simultaneously displayed on a screen and timely fed back to a user, so that the user can adjust own action and the random action is consistent with the action sequence instruction.

And step S105, analyzing the response action sequence of the user in the response image sequence.

Wherein, step S105 includes:

(a1) detecting a human face in each image from the response action sequence;

(a2) carrying out key point positioning on each face;

(a3) calculating a head posture corner according to the positioned key points of the human face;

(a4) calculating the facial expression type according to the positioned facial key points;

(a5) calculating the response action sequence according to the head posture and the expression type;

(a6) and comparing the calculated response action sequence with the action sequence corresponding to the random action sequence instruction, and calculating the action type conformity.

If the human face is detected, the following steps are continued, the image is skipped if the human face is not detected, if the human face is not detected in each image, the whole process is ended, and at this time, the user can be prompted to restart in a visual or auditory mode.

After a face is detected in the image, the face is subjected to key point positioning, that is, for each face image, a plurality of corresponding preset key points are selected, for example, 68 key points are selected, and a face detail contour can be outlined according to the coordinates of the key points. And calculating the pose and expression classification of the face on the basis of the key points.

In another possible implementation, the head pose rotation angle and the facial expression type may be obtained by using a feature estimation method, where the feature estimation method acquires a large amount of facial image data in different poses and expressions in advance, extracts appearance features from the facial image data, trains in a mode of SVM or regression, etc. to obtain a pose estimation classifier, and then performs pose and expression estimation on the facial image by using the pose estimation classifier, for example, for the facial image, Gabor feature extraction or LBP (Local Binary pattern) feature extraction may be performed, and a SVM (support vector Machine) is used to train to obtain a pose and expression classifier to perform pose estimation and expression classification on the facial image.

After the head posture corner and the face expression type corresponding to each face image are obtained, segmenting a response face image sequence according to the head posture corner and the face expression type so as to separate and identify the human body action corresponding to each action instruction and obtain a response action sequence. The segmentation can be performed according to the time stamp of the action sequence instruction, and can also be performed according to the action type in the action sequence instruction.

The method comprises the steps that a response human face image sequence to be acquired is segmented according to a timestamp of an action sequence instruction according to action time obtained according to the timestamp, when the timestamp is a relative timestamp, the relative timestamp is the action time, the relative timestamp can be segmented according to the relative timestamp, and when the timestamp is an absolute timestamp, the relative timestamp can be segmented according to the action starting and stopping time. For example, if the motion sequence command is (2 times of shaking head from left to right, 2 seconds of motion time) → (3 times of shaking head from top to bottom, 3 seconds of motion time) → (2 times of opening mouth, 1 second of motion time) → (3 times of closing eyes, 2 seconds of motion time), and the time indicated by the time stamp is 2 seconds, 1 second, and 2 seconds, respectively, the sequence of the response face images is sliced in 2 seconds, 1 second, and 2 seconds. If the motion sequence command is (from 0 th to 2 nd seconds, and from left to right, 2 times of shaking) → (from 2 nd to 5 th seconds, and from top to bottom, and 3 times of shaking) → (from 5 th to 6 th seconds, and 2 times of opening mouth) → (from 6 th to 8 th seconds, and 3 times of closing eye), the start and stop times of each motion indicated by the time stamp are respectively 0 th to 2 nd seconds, 2 nd to 5 th seconds, 5 th to 6 th seconds, and 6 th to 8 th seconds, the response human face image sequence is cut according to the start and stop times. And for each segmented response face image sequence, recognizing the human body action corresponding to each segmented response face image sequence by combining the detected head attitude corner and the detected face expression type of each image, taking head shaking as an example, the head attitude corner of each face image obtained in the process of shaking the head is different, combining the head attitude corner data of each segmented response face image sequence, extracting the action characteristic of each segmented response face image sequence, and obtaining the action corresponding to each segmented response face image sequence by using a conventional human body action recognition algorithm and also recognizing the action times. And combining the actions corresponding to each section of response human face image sequence and the times of the actions according to the original time sequence to obtain a response action sequence.

Segmenting according to the action type in the action sequence command, namely according to the head pose rotation angle and the facial expression type of each facial image, and according to the action type and the corresponding action times and sequence in the action sequence command, sequentially identifying the actions of all the response facial image sequences, for example, if the first action type in the action sequence command is shaking head and the action times is 2 times, identifying whether shaking head actions exist or not and the shaking head times in all the response facial image sequences, if the shaking head actions can be identified, segmenting all the response facial image sequences corresponding to the shaking head actions from all the response facial image sequences, keeping the position sequence of the segmented response facial image sequences in all the facial images, for example, the segmented facial image is positioned at the front end of all the response facial image sequences, and simultaneously recording the action times corresponding to the segmented response facial image sequences, e.g. the number of recognized shaking heads. Then according to the second action type in the action sequence instruction, the action recognition is carried out on the rest response face image sequence, for example, the second action type in the action sequence instruction is nodding for 3 times, it is identified in the remaining sequence of responsive face images whether there is a nodding action and the number of nodding, if the nodding action can be identified, segmenting the response face image sequences corresponding to all nodding actions from all response face image sequences, keeping the time sequence of the segmented response face image sequences in all face images, and the time sequence relation with the first cut-out response face image sequence, such as the front end, the middle part or the tail part of the whole response face image sequence, and simultaneously recording the times of actions corresponding to the segmented response face image sequence before or after the segmented response face image sequence for the first time. And repeating the steps until the segmentation is finished according to the last action type in the action sequence instruction. And after segmentation, combining the human body action corresponding to each segmented response human face image sequence and the times of the human body action according to the original time sequence to obtain a response action sequence. For the motion recognition of the response facial image sequence, a conventional motion recognition algorithm can be used according to the head pose rotation angle and the facial expression type of each image in the response facial image sequence. When the response face image sequence is cut according to the action type, the identification of the action is included in the cutting process. And recognizing the human body action on each section of response human face image sequence after segmentation to ensure the correctness of the recognition, or obtaining the response action sequence according to the segmentation result without performing the recognition.

When the response facial image sequence is segmented according to the time stamp or the action type of the action sequence instruction, a relative time stamp or an absolute time stamp can be added to each segmented response facial image sequence, and the relative time stamp or the absolute time stamp is used for identifying the duration of the corresponding action or the starting time and the ending time of the corresponding action. When the face image sequence is divided according to the time stamp, because each section of the face image sequence has definite time length or start-stop time, the time length or start-stop time of the corresponding action can be identified without adding the time stamp. And when the segmentation is carried out according to the action type, the time of the first face image and the time of the last face image in the time sequence are obtained for each segmented response face image sequence and are respectively used as the starting time and the ending time of the action corresponding to the segmented response face image sequence, and the time stamps are added to the corresponding human body action in the response action sequence according to the starting time and the ending time. The time stamp is added in the response action sequence, so that the action types are separated, the residual response face image sequence after the action recognition of each human body is selected during segmentation, and the action time of each action type is calculated.

For the segmentation of the response facial image sequence, the segmentation is performed according to the timestamp, the segmentation process is simple, but the user is required to finish the action strictly according to the time requirement, and the action time of the user is difficult to accurately grasp sometimes and only approximately meets the requirement of the time length, for example, the head is required to be shaken for 2s, the actual shaking head may be 2.2s, and the segmentation is performed according to the timestamp, so that the incomplete human body action or residual images with other actions corresponding to each segmented response facial image sequence may be caused, and errors occur in the identification of the human body action. And (4) segmenting according to the action types, wherein although the segmenting process is relatively complex, the complete human body action can be accurately identified according to the response human face image sequence obtained by segmenting.

In a possible implementation manner, when the response facial image sequence is divided according to the action type, if the first action type in the action sequence instruction cannot be identified in all the response facial image sequences, or if the action type identical to the first action type in the action sequence instruction is identified according to all the response facial image sequences, the response facial image sequence corresponding to the action type is not at the front end (the first part) of all the response facial image sequences, and it may be determined that the identification fails without performing the subsequent steps. In this way, when the first action of the user is not satisfactory or the fake feature cannot perform the action, the current user is determined to be a non-living person, and the subsequent process is ended, so that the attack of the possibly unsafe feature can be prevented more concisely and quickly.

Comparing the calculated response action sequence with the action sequence corresponding to the action sequence command, calculating action type conformity, namely comparing each section of action in the response action sequence with the corresponding action command, comparing the type of the action and the action frequency, and setting different weights for each section of action according to the comparison result, for example, if the first action type in the action sequence command is shaking head and the action frequency is 3 times, if the first action type in the response action sequence is shaking head and the action frequency is 3 times, responding to the weight S of the first action type in the action sequence₁May be set to 1, S if the first motion type in the response motion sequence is shaking head, but the number of motions is 2, S₁May be set to 0.7 and so on. And adding the weights of each section of action to obtain the action type conformity.

Wherein, in the mode that the random motion sequence command is the motion type plus the relative timestamp, step S105 may further include:

(b1) for each action in the calculated sequence of responsive actions, calculating an action time for each action;

(b2) comparing the action time of each action with the relative time stamp of each action, and calculating action time conformity;

(b3) calculating the action total conformity which is the action type conformity + WX action time conformity, wherein W is a weight;

and comparing the action time corresponding to each action in the obtained response action sequence with the corresponding relative timestamp. And according to the comparison result, setting different time weights for each section of action of the response action sequence, and adding the time weights of each section of action to obtain action time conformity. The time weight of each action may be equal to (1-action time error), or equal to (1/action time error). Wherein, the relative time stamp in the random action sequence command is t1, and the action time of a certain action in the response action sequence is t2, then the action is performed

Then the action total conformity is the action type conformity + w × the action time conformity, where w is the weight.

Wherein, in the mode that the random motion sequence command is a motion type plus an absolute time stamp, step S105 may further include:

(c1) for each action in the calculated sequence of responsive actions, calculating an action time for each action;

(c2) comparing the action time of each action with the absolute time stamp of each action, and calculating the action time conformity;

(c3) calculating the action total conformity, namely the action type conformity + w multiplied by the action time conformity, wherein w is a weight;

and comparing the action time corresponding to each action in the obtained response action sequence with the corresponding absolute timestamp. And according to the comparison result, setting different time weights for each section of action of the response action sequence, and adding the time weights of each section of action to obtain action time conformity. The time weight of each action may be equal to (1-action time error), or equal to (1/action time error). Wherein, the absolute time stamp in the random motion sequence command is t1 seconds to t2 seconds, and the motion time of a certain motion in the response motion sequenceT, then this section is activated

Then the total action conformity is the action type conformity + w × action time conformity, where w is the weight.

In addition, in the mode that the random motion sequence instruction is motion type plus absolute time stamp, another scheme may further include:

and analyzing whether the user makes a specified instruction action on a specified time stamp. For example, the action type command sequence is (0 th to 2 nd seconds, head left turn to 30 degrees → 2 nd to 4 th seconds, right turn to 10 degrees → 4 th to 5 th seconds, left turn to 20 degrees → 5 th to 7 th seconds, right turn to 40 degrees), and the absolute time stamps start at 0 th, 2 nd, 4 th seconds, and 5 th seconds, and end at 2 nd, 4 th, 5 th, and 7 th seconds, respectively. The system checks whether the head position is 30 degrees to the left at the 2 nd second, 10 degrees to the right at the 4 th second, 20 degrees to the left at the 5 th second and 40 degrees to the right at the 7 th second, if the head position and the head position are matched, the system judges that 4 corresponding head actions are completed and judges that the response action sequence comes from the living human, otherwise, the system does not consider that the response action sequence comes from the living human.

And step S106, judging whether the response action sequence accords with the action sequence corresponding to the random action sequence instruction, and if so, judging that the response action sequence is from the living human.

And comparing the action type conformity with a first preset threshold, if the action type conformity is greater than the first preset threshold, judging that the response action sequence is from the living human, otherwise, judging that the response action sequence is not from the living human. In the mode that the random action sequence command is the action type plus the time stamp, the action overall conformity degree can also be compared with a second preset threshold value, if the action overall conformity degree is larger than the second preset threshold value, the response action of the person in the response action sequence conforms to the action sequence corresponding to the action sequence command, and the response action sequence is judged to be from the living person, otherwise, the response action sequence is not considered to be from the living person.

The first preset threshold and the second preset threshold may be set according to a requirement on a safety degree, for example, if the safety degree level is high, the first preset threshold and the second preset threshold are set to be large values.

The present application is further described below in an application case of the present application in the context of a mobile payment liveness verification application, so as to enable those skilled in the art to better understand the principles and applications of the present application.

In the mobile payment process, whether the current user is a real person or not needs to be identified in order to prevent false authentication caused by counterfeit features during identity authentication. For the sake of brevity and clarity of the case, the main steps of the present application are described by way of example. In the mobile payment process, the instruction set of the living body recognition system candidate comprises three common actions of { shaking head, opening mouth and blinking }.

(1a) When the system starts the living body identification, the system randomly generates an action sequence instruction, for example, "shake the head 3 times from left to right, the action time is 6 seconds; opening the mouth for 2 times, and the action time is 1 second; blinking 4 times, and action time 2 seconds ", and generating an action instruction schematic diagram in the form of animation and presenting the diagram to the user.

(2a) The user starts shooting facing the camera according to the action instruction schematic diagram, can make corresponding actions according to requirements, collects the response human face image sequence by the system at the moment, finishes shooting after all actions of the user are finished, and finishes collecting the response human face image sequence by the system at the moment.

(3a) And obtaining a posture estimation classifier by using Gabor feature extraction and SVM training, and estimating the posture of each image, including the states of the head, the eyes, the nose and the mouth, of the acquired response human face image sequence frame by using the posture estimation classifier.

(4a) And cutting the response human face image sequence into three sections with the time lengths of 6 seconds, 1 second and 2 seconds according to the time stamp of the action instruction. And identifying the human body action corresponding to each response action sequence according to the posture of each human face image to obtain a response action sequence.

If the corresponding human body action identified according to the first section of response human face image sequence is shaking for 3 times from left to right, setting the weight of the corresponding first action (the action type is described as shaking from left to right, and the action times is 3) in the response action sequence

If the corresponding action in the response action sequence is to shake the head 2 times from left to right, then set

If the corresponding action in the response action sequence is to shake the head 1 time from left to right, then the settings are made

Setting if the head shaking from left to right is not recognized according to the first segment of response action sequence

If the corresponding human body action identified according to the second section of response human face image sequence is mouth opening for 2 times, setting the weight of the corresponding second action in the response action sequence

If the number of times of opening the mouth is 1, setting

If the mouth opening times is 0 (the mouth opening action is not identified according to the second section of response action sequence), setting

If the corresponding human body action identified according to the third section of response action sequence is 4 times of blinking, setting a weight score of the corresponding third action in the response action sequence

If the number of blinks is 3, setting

If the blinking number is 2; then set up

If the number of blinks is 1, setting

If no blinking motion is identified from the second segment of the response motion sequence, setting

(5a) Calculating the action type conformity of the response action sequence and the random action sequence instruction as

And comparing with a first preset threshold value, if

Then the action type is matched with the degree S^aIf the preset threshold is 2, the identification fails because the action type conformity degree is smaller than the first preset threshold, the current user is judged to be not a living body, and accordingly, the authentication cannot pass.

If the response action sequence is segmented according to the action type, the step (4a) can be replaced by:

(4b) identifying the head shaking motion from left to right for all the response facial image sequences, if the head shaking motion from left to right is identified in all the response facial image sequences, the response facial image sequence corresponding to the motion is positioned at the front end of all the response facial image sequences, but the motion times are less than three, setting the weight of the first motion in the response motion sequences

If the number of actions reaches three times, recording

And recording the acquisition starting time t0 of the response human face image sequence and the acquisition time t1 of the last human face image of the response action sequence corresponding to the last shaking head from left to right, and calculating to obtain the action time t1-t0 of the first action in the response action sequence.

Other values may be set, or different values may be set according to different operation times.

Identifying the mouth opening action of the response face image sequence after the time t1, and if the identified mouth opening times is less than 2, setting the weight of the second action in the response action sequence

If the number of times of opening the mouth reaches 2 times, setting

And recording the acquisition time t2 of the last face image of the response face image sequence corresponding to the last mouth opening, and taking t2-t1 as the action time of the second action in the response face image sequence.

And identifying the blink action of the response face image sequence after the time t2, and if the identified blink action is less than 4 times, setting the weight of the third action in the response face image sequence

If the blinking motion reaches 4 times, recording

And recording the acquisition time t3 of the last face image of the response face image sequence corresponding to the last blink, and taking t3-t2 as the action time of the third action in the response face image sequence.

Meanwhile, the step (5a) can be replaced by:

(5b) calculating the overall conformity of the response action sequence and the random action sequence instructions:

wherein T1, T2, and T3 are timestamps corresponding to shaking head, opening mouth, and blinking in the random motion sequence command, respectively, and η is a weight coefficient.

And setting a second preset threshold value theta according to experience and safety level requirements, and judging as the living human when the overall conformity is greater than the second preset threshold value theta, otherwise, judging as the non-living human.

In one possible implementation manner, the biometric in-vivo detection method provided in the embodiment of the present application further includes:

(d1) generating a random voice sequence instruction;

(d2) transcoding the random speech sequence instructions into text, visual and/or audible coding and presenting in visual frames, audible sounds or a combination thereof; (ii) a

(d3) Collecting a user response voice sequence;

(d4) analyzing the response voice of the user in the response voice sequence;

(d5) and judging whether the response voice conforms to the voice sequence corresponding to the random voice sequence instruction, and if so, judging that the response voice sequence is from the living human.

The random voice sequence command can be a string of characters or a string of voice segments, the content of which is randomly generated, or a plurality of voice templates are randomly extracted from a voice template library and combined into a voice command sequence. The generated voice instruction sequence can be displayed on a display in the form of characters and images to indicate a user, or indicate the user through a loudspeaker in the form of voice playing, or indicate the user through the display and the loudspeaker in the form of combined text, images and voice playing. After receiving the instruction, the user sends out response voice according to the instruction voice. And acquiring a user response voice sequence through the recording equipment. When the voice command sequence is a string of characters or is composed of a voice template, audio analysis and recognition can be performed on the collected user response voice sequence, the audio analysis can be conventional audio content analysis and recognition, the recognition result is compared with the voice command sequence, and if the percentage of the same part exceeds a preset threshold, for example, exceeds 90%, the user response voice sequence is judged to be from a living human; when the random voice command is a string of words, the recognition result can be converted into words, the voice content is converted into words and then compared with the voice command sequence, and if the words obtained by converting the voice content and the words corresponding to the voice command sequence exceed a preset threshold value, for example, exceed 90%, the user response voice sequence is judged to be from a living person. When the random voice instruction sequence is a string of voice segments, waveform matching analysis can be performed on the collected user response voice sequence and the voice instruction sequence, and if the waveform matching degree of the response voice sequence and the random voice instruction sequence exceeds a preset threshold value, the user response voice sequence is judged to be from a living person.

In one possible implementation, in combining image analysis and voice analysis, and performing recognition analysis on human actions and voices to determine whether the user is a living body, the biometric living body detection method may include:

(e1) generating a random action sequence instruction;

(e2) transcoding the random motion sequence instructions into text, visual and/or audible coding and presenting in visual frames, audible sounds or a combination thereof;

(e3) collecting a user response image sequence;

(e4) analyzing a sequence of response actions of a user in the sequence of response images;

(e5) judging whether the response action sequence conforms to the action sequence corresponding to the random action sequence instruction or not, and calculating the action overall conformity;

(e6) generating a random voice instruction;

(e7) transcoding the random speech sequence instructions into text, visual and/or audible coding and presenting in visual frames, audible sounds or a combination thereof;

(e8) collecting a user response voice sequence;

(e9) analyzing the response voice of the user in the response voice sequence;

(e10) calculating the voice content conformity of the response voice sequence, namely comparing the response voice sequence content with the random voice command sequence, setting different weights for each voice of the response voice sequence according to the comparison condition, and adding the weights of each voice to obtain the voice content conformity;

(e11) calculating the total conformity as action total conformity + w2 × voice content conformity as action type conformity + w1 × action time conformity + w2 × voice content conformity, wherein w1 and w2 are weights;

(e12) and comparing the overall conformity with a third preset threshold, if the overall conformity is greater than the third preset threshold, judging that the response action sequence is from the living human, otherwise, not considering that the response action sequence is from the living human.

Through the above description of the method embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation manner in many cases. Based on such understanding, the technical solutions of the present application, which essentially or partially contribute to the prior art, may be embodied in the form of software products and stored in a storage medium, and include instructions for causing an intelligent device to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media that can store data and program codes, such as Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, etc.

Fig. 2 is a schematic structural diagram of a biometric in-vivo detection system according to an exemplary embodiment of the present application. As shown in fig. 2, the system includes:

an action sequence instruction generating unit U201 for generating a random action sequence instruction;

the action instruction presenting unit U202 comprises a display and a loudspeaker and is used for converting codes of the random action sequence instructions into texts, visual and/or auditory codes and presenting the texts, the visual and/or auditory codes in a visual picture or auditory sound mode or a combination mode of the texts, the visual and/or auditory codes and the auditory sound;

the image acquisition unit U203 is used for acquiring a user response face image sequence;

a response action presentation unit U204, configured to synchronize the response image sequence with the random action sequence instruction for visual presentation;

the action analysis unit U205 is used for analyzing a response action sequence of the user in the response face image sequence;

and the action conformity degree judging unit U206 is used for judging whether the response action sequence conforms to the action sequence corresponding to the random action sequence instruction, and if so, judging that the response action sequence comes from the living human.

the voice instruction presenting unit comprises a display and a loudspeaker and is used for converting codes of the random voice sequence instructions into texts, visual and/or auditory codes and presenting the texts, the visual and/or auditory codes in a visual picture, auditory sound or combination mode of the texts and the auditory sound;

and the voice conformity degree judging unit is used for judging whether the response voice sequence conforms to the voice sequence corresponding to the voice command or not, and if so, judging that the response voice sequence comes from the living body.

In one possible embodiment, the action sequence instruction generation unit assigns a time stamp to each action in the random action sequence instruction, the time stamp is used for identifying the action time of each action or the starting time and the ending time of each action, and the time stamp is generated randomly.

Wherein, the action command presentation unit converts the codes of the random action sequence commands into text and/or visual codes, such as text characters, images and animation forms of actions of 'opening mouth', 'closing mouth', 'turning left head', 'lowering head', 'blinking' and the like, and then presents the text characters, images and animation forms to the user through the display in a visual mode; the codes of the random motion sequence instructions are converted into texts and/or auditory codes, the texts are converted into texts, and then the texts are converted into voices through a TTS (text To speech) engine, for example, voices such as 'open mouth', 'close mouth', 'head left turn', 'head down', 'blink' and the like are broadcasted through a loudspeaker; or a combination of visual and audible presentation to the user. And presenting the prompt through visual and auditory senses to help a user understand the instruction so as to timely perform corresponding action according to the instruction.

The image acquisition unit, the camera or other image video shooting equipment shoot the action of a user, so that a response face image sequence is acquired, and each image in the response face image sequence is a video frame obtained by shooting.

The response action presenting unit is used for carrying out visual presentation on the response image sequence and the random action sequence instruction synchronously, displaying the response image sequence and the random action sequence instruction on a screen, and feeding back the response image sequence and the random action sequence instruction to a user in time so that the user can adjust own action to be consistent with the random action sequence instruction.

Wherein, the action analysis unit may include:

and the action sequence identification subunit is used for calculating the response action sequence according to the head posture corner and the facial expression type.

The action conformity degree judging unit may include:

the action type conformity degree calculating operator unit compares the calculated response action sequence with the action sequence corresponding to the random action sequence instruction and calculates the action type conformity degree;

Wherein the motion analysis unit further includes:

an action time calculating subunit configured to calculate, for each action in the calculated response action sequence, an action time of each action;

the action time conformity degree calculation subunit is used for comparing the action time of each action with the randomly generated time stamp and calculating the action time conformity degree;

and the second judgment subunit is used for comparing the action overall conformity with a second preset threshold, if the action overall conformity is greater than the second preset threshold, the response action of the person in the response action sequence conforms to the action sequence corresponding to the action sequence instruction, and judging that the response action sequence is from the living person, otherwise, the response action sequence is not considered to be from the living person.

In a possible implementation manner, the biological feature in-vivo detection system provided in the embodiment of the present application may further include:

And setting the complexity of the random action sequence instruction and the sizes of the first preset threshold, the second preset threshold and the third preset threshold according to the safety level.

FIG. 3 is a schematic diagram of a random motion sequence instruction visual presentation and a random voice sequence instruction visual presentation of a biometric liveness detection system. Wherein, (1), (2), (3) represent the random motion sequence instruction (45 degrees left turn → 45 degrees front face → right turn), and present with characters and images, and (4) represent the random voice sequence instruction, present with characters, wherein the example is reading a sentence, and also can read a string of random numbers.

FIG. 4 is a schematic diagram of a synchronized visual presentation of a sequence of random motion sequences and a sequence of user response images of a biometric liveness detection system with the display in a portrait orientation. In order to better guide the acquired object to make a motion sequence conforming to the random motion sequence instructions, the random motion sequence instructions and the acquired response image sequence are simultaneously presented on the display in a visual manner. And for the vertical screen, displaying a random action sequence instruction in the upper right corner of the collected response image sequence, and guiding the user to make a corresponding response action sequence in real time. Wherein (1) to (4) in fig. 4 represent the presentation schematic diagrams of the random motion sequence command of front face → side face → front face → wide mouth and the corresponding response image sequence.

FIG. 5 is a schematic diagram of a synchronized visual presentation of a sequence of random motion commands and a sequence of user response images of a biometric liveness detection system with a landscape display. Wherein (1) to (4) in fig. 5 represent the presentation schematic diagrams of the random motion sequence command of front face → side face → front face → wide mouth and the corresponding response image sequence.

FIG. 6 is a schematic diagram of the simultaneous visual presentation of text and response image sequences of random motion sequence instructions, with each random motion sequence instruction being displayed on the top, and the captured user response image sequence being displayed on the bottom.

FIG. 7 is a schematic illustration of the synchronized display of random voice sequence instructions and random motion sequence instructions along with a sequence of captured user response images.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is noted that, in this document, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, system, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A biological feature live body detection method, characterized by comprising:

generating a random action sequence instruction;

collecting a user response image sequence;

the analyzing the response action sequence of the user in the response image sequence comprises:

detecting a face in each image from the sequence of response images;

carrying out key point positioning on each face;

the analyzing the response action sequence of the user in the response image sequence further comprises:

comparing the calculated action time of each action with the timestamp of each action, and calculating the action time conformity; generating a random voice sequence instruction;

collecting a user response voice sequence;

analyzing the response voice of the user in the response voice sequence;

identifying the content of the response voice sequence;

calculating the voice content conformity of the response voice sequence;

2. The biometric liveness detection method of claim 1, wherein a time stamp is assigned to each action in the random action sequence instructions, the time stamp identifying an action time of each action or a start time and an end time of each action, the time stamp being randomly generated.

3. The biometric liveness detection method according to claim 1, wherein the complexity of the random action sequence command and the magnitude of the third preset threshold are set according to a security level.

4. A biometric liveness detection system, comprising:

the response action presentation unit is used for carrying out visual presentation on the response human face image sequence and the random action sequence instruction synchronously;

the motion analysis unit includes:

the action sequence identification subunit is used for obtaining a response action sequence of the user according to the head posture corner and the facial expression type;

the action conformity degree judging unit comprises:

the action type conformity calculation subunit compares the response action sequence with the action sequence corresponding to the random action sequence instruction and calculates the action type conformity;

the motion analysis unit further includes:

the voice analysis unit is used for analyzing the response voice of the user in the response voice sequence and identifying the content of the response voice sequence;

5. The biometric liveness detection system of claim 4 wherein a time stamp is assigned to each action in the random sequence of actions instruction, the time stamp identifying an action time for each action or a start time and an end time for each action, the time stamp being randomly generated.

6. The biometric liveness detection system of claim 4 wherein the complexity of the random action sequence instructions and the magnitude of the third preset threshold are set according to a security level.