WO2024051380A1

WO2024051380A1 - Living body detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2024051380A1
Application number: PCT/CN2023/109776
Authority: WO
Inventors: 黄石磊; 刘轶; 程刚; 廖晨; 蒋志燕
Original assignee: 深圳市北科瑞声科技股份有限公司
Priority date: 2022-09-05
Filing date: 2023-07-28
Publication date: 2024-03-14
Also published as: CN115171227B; CN115171227A

Abstract

In order to accurately identify whether an object to be subjected to detection is a living body, the present invention relates to a living body detection method and apparatus, an electronic device and a storage medium. The method comprises: determining a sound signal to be subjected to detection and an image to be subjected to detection which corresponds to said sound signal, wherein said image is an image including the face of an object to be subjected to detection; determining a sound source position corresponding to said sound signal, and determining a lip position of said object on the basis of said image; and performing a consistency comparison on the sound source position and the lip position, and determining a living body detection result of said object according to the comparison result.

Description

Living body detection methods, devices, electronic equipment and storage media

References to related applications

This disclosure claims all rights and interests in the invention patent application titled "Living Body Detection Methods, Devices, Electronic Equipment and Storage Media" submitted to the State Intellectual Property Office of the People's Republic of China on September 5, 2022 with application number 202211077263.3, and is incorporated by reference to incorporate its entire contents into this article.

field

The present disclosure relates to the field of biometric identification technology, specifically to living body detection methods, devices, electronic equipment and storage media.

background

Liveness detection is a method to determine the true physiological characteristics of objects in some identity verification scenarios. In face recognition applications, liveness detection can use facial key point positioning and face tracking through combined actions such as blinking, opening mouth, shaking head, nodding, etc. Physical sign detection technology, which verifies whether the user is a real living person operating technology to resist common attack methods such as photos, face swaps, masks, occlusions, and screen remakes.

Overview

The present disclosure provides living body detection methods, devices, electronic equipment and storage media.

In a first aspect, the present disclosure provides a living body detection method, which method includes:

Determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the object to be detected;

Determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

The sound source position and the lip position are compared for consistency, and the liveness detection result of the object to be detected is determined based on the comparison results.

In some embodiments, the sound source position and the lip position are compared for consistency, including:

Determine the reference space area based on the lip position;

Determine whether the sound source location is within the reference space area,

When the sound source position is within the reference space area, a comparison result is obtained in which the sound source position and the lip position are consistent.

When the sound source position is not located within the reference space area, a comparison result in which the sound source position and the lip position are inconsistent are obtained.

In some embodiments, determining the vitality detection result of the subject to be detected based on the comparison results includes:

When the comparison result indicates that the sound source position and the lip position are inconsistent, it is determined that the living body detection result of the object to be detected is non-living body;

When the comparison result indicates that the sound source position and the lip position are consistent, the living body detection result of the object to be detected is determined to be alive.

In some embodiments, when the comparison result indicates that the sound source position and the lip position are consistent, the method further includes:

Input the image to be detected and the sound signal to be detected to the trained mouth shape recognition model, and obtain the output result of the mouth shape recognition model;

When the output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected, the step of determining that the living body detection result of the object to be detected is a living body is executed.

In some embodiments, before determining the sound signal to be detected, the method further includes:

Output interactive instructions, which are used to instruct the object to be detected to emit a sound signal corresponding to the preset text data;

Before determining the sound source position corresponding to the sound signal to be detected, it also includes:

Perform speech recognition on the sound signal to be detected, and obtain the text data corresponding to the sound signal to be detected;

Compare the text data corresponding to the sound signal to be detected with the preset text data for consistency;

When the comparison result indicates that the text data corresponding to the sound signal to be detected is consistent with the preset text data, the step of determining the position of the sound source corresponding to the sound signal is performed.

In certain embodiments, the method further includes:

When the comparison result indicates that the text data corresponding to the sound signal to be detected is inconsistent with the preset text data, return to the step of outputting the interactive command.

In some embodiments, the generation process of interactive instructions includes:

Call the preset random number generation algorithm to generate a random array;

Generate preset text data based on a random array, and generate interactive instructions based on the preset text data.

In some embodiments, determining the sound signal to be detected includes:

Obtain sound signals collected by multiple microphones;

The sound signals collected by each microphone are synthesized and processed to obtain the sound signal to be detected.

In some embodiments, determining the sound source position corresponding to the sound signal to be detected includes:

Decompose the sound signal to be detected and obtain multiple decomposed signals;

Determine the sound source direction of each decomposed signal;

Determine the intersection position of multiple sound source directions as the sound source position of the sound signal to be detected Set.

In a second aspect, the present disclosure provides a living body detection device, which device includes:

The first determination module is configured to determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the subject to be detected;

a second determination module configured to determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

The third determination module is configured to compare the sound source position and the lip position for consistency, and determine the living body detection result of the object to be detected based on the comparison result.

In some embodiments, the third determination module is configured as:

Determine the reference space area based on the lip position;

Determine whether the sound source location is within the reference space area;

If so, the comparison result is obtained that the sound source position and the lip position are consistent;

If not, a comparison result will be obtained in which the sound source position and the lip position are inconsistent.

In some embodiments, the third determination module is configured as:

If the comparison result indicates that the sound source position and the lip position are inconsistent, it is determined that the living body detection result of the object to be detected is non-living body;

If the comparison result shows that the position of the sound source and the position of the lips are consistent, the living body detection result of the object to be detected is determined to be alive.

In certain embodiments, the device further includes:

The input module is configured to input the image to be detected and the sound signal to be detected to the trained mouth shape recognition model to obtain the output result of the mouth shape recognition model when the comparison result indicates that the sound source position and the lip position are consistent;

The first execution module is configured to execute the step of determining that the life detection result of the object to be detected is a living body if the output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected.

In certain embodiments, the device further includes:

The output module is configured to output an interactive instruction before determining the sound signal to be detected, and the interactive instruction is used to instruct the object to be detected to emit a sound signal corresponding to the preset text data;

A recognition module configured to perform speech recognition on the sound signal to be detected and obtain text data corresponding to the sound signal to be detected before determining the sound source position corresponding to the sound signal to be detected;

A comparison module configured to compare the text data corresponding to the sound signal to be detected with the preset text data for consistency;

The second execution module is configured to execute the step of determining the sound source position corresponding to the sound signal if the comparison result indicates that the text data corresponding to the sound signal to be detected is consistent with the preset text data.

In certain embodiments, the device further includes:

The third execution module is configured to return to the step of outputting interactive instructions if the comparison result indicates that the text data corresponding to the sound signal to be detected is inconsistent with the preset text data.

In some embodiments, the output module is configured as:

Call the preset random number generation algorithm to generate a random array;

In some embodiments, the first determining module is configured as:

Obtain sound signals collected by multiple microphones;

In some embodiments, the second determination module is configured as:

Determine the sound source direction of each decomposed signal;

The intersection position of multiple sound source directions is determined as the sound source position of the sound signal to be detected.

In a third aspect, the present disclosure provides an electronic device, including: a processor and a memory. The processor is configured to execute a life detection program stored in the memory to implement the life detection method described in the present disclosure.

In a fourth aspect, the present disclosure provides a storage medium. The storage medium stores one or more programs. The one or more programs can be executed by one or more processors to implement the living body detection method described in the present disclosure.

In some embodiments, according to the living body detection method of the present disclosure, the sound source position of the sound signal to be detected and the lip position of the object to be detected can be directly located, and when it is determined that the sound source position and the lip position are consistent, it is stated that the sound source position to be detected is consistent. The detection sound signal is emitted from the lips of the object to be detected, and the object to be detected is a living body; otherwise, the object to be detected is a non-living body. It is realized that even if a non-authenticated user impersonates an authenticated user by obtaining the authenticated user's video image, the object to be detected can still be identified as a non-living body, which improves the accuracy and reliability of the living body detection results.

Brief description of the drawings

Figure 1 is a schematic diagram of an application scenario involved in an embodiment of the present disclosure;

Figure 2 is a schematic diagram of an application scenario involved in an embodiment of the present disclosure;

Figure 3 is a flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 4 is a flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 5 is a distribution diagram of microphones on a living body detection device provided by an embodiment of the present disclosure;

Figure 6 shows the life detection device provided by an embodiment of the present disclosure and determines through a microphone Schematic diagram of sound source location;

Figure 7 is a schematic diagram of the sound source position provided by an embodiment of the present disclosure;

Figure 8 is a schematic diagram of the sound source position provided by an embodiment of the present disclosure;

Figure 9 is a flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 10 is a flow chart of a living body detection method provided by an embodiment of the present disclosure;

Figure 11 is a block diagram of a living body detection device provided by an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Elaborate

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are: Some, but not all, embodiments of this disclosure. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this disclosure.

Refer to Figure 1, which is a schematic diagram of an application scenario involved in an embodiment of the present disclosure.

The application scenario shown in Figure 1 includes: user 11 and life detection device 13. Among them, the living body detection device 13 can be installed with a living body detection system, and supports various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, etc. In this embodiment, the above-mentioned display screen can be used to display the video signal captured by the camera and prompt the position of the user's face to be detected. In FIG. 1 , a display screen is taken as an example to illustrate the living body detection device 13 .

In the application scenario shown in Figure 1, user 11 can perform life detection in a normal way. The normal method here means that the user 11 stands in front of the camera of the life detection device 13. Then, the life detection device 13 can directly collect the video image containing the face of the user 11 through the camera, and then perform the life test on the user 11 based on the video image. detection.

Refer to Figure 2, which is a schematic diagram of an application scenario involved in an embodiment of the present disclosure. The application scenario includes: user 11, user 12, living body detection device 13, terminal 14, and terminal 15. Among them, terminal 14 and terminal 15 can perform network communication.

Terminal 14 and terminal 15 may be hardware devices or software that support network connections to provide various network services. When the terminals 14 and 15 are hardware, they may support various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, etc. In Figure 2, only smart phones are used For example. When the terminal 14 and the terminal 15 are software, they can be installed in the electronic devices listed above. In this embodiment, the terminal 14 and the terminal 15 establish a video call by respectively installing corresponding applications.

In the application scenario shown in Figure 2, assuming that user 11 is a non-authenticated user, user 12 For authenticated users. Then, when user 11 wants to pass the life detection device 13, he can make a video call with user 12 through terminal 14 and terminal 15, or play a pre-recorded video image including user 12's face through terminal 14. At this time, the video image of the user 12 can be displayed on the display screen of the terminal 14 .

In some embodiments, the user 11 can face the display screen of the terminal 14 towards the camera of the life detection device 13, and then the life detection device 13 can receive the video image of the user 12 through the camera. Since the life detection device 13 can detect the vital signs from the video image of the user 12, it passes the life detection. It can be seen that in the existing technology, a non-authenticated user can impersonate the authenticated user by obtaining the authenticated user's video image and pass the liveness detection.

Based on this, an embodiment of the present disclosure provides a life detection method to prevent non-authenticated users from impersonating authenticated users by obtaining video images of authenticated users and improve the accuracy of life detection results.

The living body detection method provided by the present disclosure will be further explained below with reference to the accompanying drawings using specific embodiments. This embodiment does not constitute a limitation to the embodiments of the present disclosure.

Refer to Figure 3, which is a flow chart of a living body detection method provided by an embodiment of the present disclosure. In some embodiments, the process shown in Figure 3 can be applied to a living body detection device, such as the living body detection device 13 shown in Figure 1 . As shown in Figure 3, the process may include the following steps:

Step 301: Determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the subject to be detected;

Step 302: Determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

Step 303: Compare the sound source position and the lip position for consistency, and determine the living body detection result of the object to be detected based on the comparison result.

In this embodiment, the above-mentioned sound signal to be detected is a sound signal received by the microphone when the life detection device performs life detection. The above-mentioned image to be detected is an image collected by the camera when the life detection device performs life detection, and contains the face of the subject to be detected. The number of images to be detected can be one or multiple. When there are multiple images to be detected, the multiple images to be detected may refer to multiple images in a video collected by the life detection device through a camera.

Since both the sound signal to be detected and the image to be detected are obtained by the life detection device when performing life detection, it can be said that the sound signal to be detected and the image to be detected correspond to each other.

In an exemplary application scenario, the object to be detected is a real object. For example, in the application scenario shown in Figure 1 above, the user 11 is the object to be detected. At this time, the life detection device 13 can directly collect an image containing the face of the user 11 through the camera to obtain the image to be detected. At the same time, the user 11 can directly send out a sound signal, and the living body detection device receives the sound signal through the microphone. In this way, the living body detection device Determine the sound signal to be detected.

In another exemplary application scenario, the object to be detected is a virtual object. For example, in the application scenario shown in FIG. 2 above, the video image of the user 12 output on the display screen of the terminal 14 is the object to be detected. At this time, the life detection device 13 can collect the image including the face of the user 12 through the camera. At the same time, the user 12 can send out a sound signal, which is collected by the terminal 15 and sent to the terminal 14. The terminal 14 can play the sound signal through the speaker. In this way, the life detection device 13 can receive the sound signal to be detected through the microphone. Alternatively, the user 11 can send out a sound signal, and the life detection device 13 can receive the sound signal to be detected through a microphone.

In some embodiments, the above-mentioned life detection device may be provided with multiple microphones, and the multiple microphones are usually provided at different locations. For example, the life detection device is provided with four microphones, and the four microphones are provided at the life detection device. on the four corners. When the life detection device determines the sound signal to be detected, the sound signals collected by the multiple microphones can be obtained to obtain multiple sound signals.

In some embodiments, the living body detection device can synthesize and process the sound signals collected by each microphone, and determine the synthesized and processed sound signals as the sound signals to be detected. In this way, noise can be eliminated and a clearer and more accurate sound signal can be obtained.

In some embodiments, the living body detection device can acquire the sound signal from any of the above microphones, and use the above acquired sound signal as the sound signal to be detected.

The following is a unified description of step 302 and step 303:

In some embodiments, the living body detection device can locate the sound source position corresponding to the sound signal to be detected. How to locate the sound source will be explained below through the process shown in Figure 4, which will not be described in detail here;

In some embodiments, the above-mentioned image to be detected contains the face of the subject to be detected. Based on this, the life detection device can determine the lip position of the subject to be detected by performing facial recognition on the image to be detected;

In some embodiments, the sound source position and the lip position can be compared for consistency, and the living body detection result of the object to be detected can be determined based on the comparison result. Specifically, when the comparison result indicates that the sound source position and the lip position are inconsistent, it indicates that the sound signal to be detected is not emitted from the lips of the subject to be detected, indicating that the living body detection result of the subject to be detected is non-living.

In an exemplary application scenario, it is assumed that the object to be detected is a virtual object. For example, in the application scenario shown in FIG. 2 above, the video image of the user 12 output on the display screen of the terminal 14 is the object to be detected. At this time, the living body The detection device 13 can collect images including the video image of the user 12 through a camera.

At the same time, the user 12 can send out a sound signal, which is collected by the terminal 15 and sent to the terminal 14. The terminal 14 can play the sound signal through the speaker, and the terminal The speaker 14 may be located above, below, left, or right of the terminal 14 housing. That is, the sound source position of the sound signal may be located above, below, left, or right of the terminal 14 housing. As shown in Figure 7, the sound source position may be located in any of the areas A, B, C, D, or E, which is inconsistent with the lip position of the object to be detected. Alternatively, the user 11 can send out a sound signal, and then the sound source position at this time is at the position of the user 11 and is inconsistent with the lip position of the object to be detected. At this time, it can be determined that the living body detection result of the object to be detected is non-living body.

On the contrary, when the comparison result indicates that the sound source position and the lip position are consistent, it indicates that the sound signal to be detected is emitted from the lips of the subject to be detected, indicating that the living body detection result of the subject to be detected is a living body.

In another exemplary application scenario, the object to be detected is a real object. For example, in the application scenario shown in Figure 1 above, the user 11 is the object to be detected. At this time, the life detection device 13 can directly collect an image containing the face of the user 11 through the camera to obtain the image to be detected. At the same time, the user 11 can directly send out a sound signal, and the life detection device receives the sound signal through the microphone. In this way, the life detection device determines the sound signal to be detected.

Since the sound signal to be detected is emitted directly through the lips of the object to be detected, the sound source position as shown in Figure 8 can be obtained. As shown in Figure 8, the sound source position of the sound signal to be detected is located in the F area. It is known that The F area is the lip area of the subject to be detected, which is consistent with the position of the lips of the subject to be detected. At this time, it can be determined that the living body detection result of the subject to be detected is a living body.

In some embodiments, by determining the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, the image to be detected is an image containing the face of the object to be detected, and then, the sound source position of the sound signal to be detected is determined. , and determine the lip position of the object to be detected based on the image to be detected, compare the sound source position and the lip position for consistency, and determine the living detection result of the object to be detected based on the comparison results. In these embodiments, the sound source position of the sound signal to be detected and the lip position of the object to be detected can be directly located, and when it is determined that the sound source position and the lip position are consistent, it means that the sound signal to be detected is generated by the lips of the object to be detected. If the object is not sent out, the object to be detected is a living body; otherwise, the object to be detected is a non-living body. It is realized that even if a non-authenticated user impersonates an authenticated user by obtaining the authenticated user's video image, the object to be detected can still be identified as a non-living body, which improves the accuracy and reliability of the living body detection results.

Refer to Figure 4, which is a flow chart of a living body detection method provided by an embodiment of the present disclosure. The process shown in Figure 4 is based on the process shown in Figure 3 above, describing how the living body detection device locates the sound source position of the sound signal to be detected. As shown in Figure 4, the process may include the following steps:

Step 401: Decompose the sound signal to be detected to obtain multiple decomposed signals;

Step 402: Determine the sound source direction of each decomposed signal;

Step 403: Determine the intersection position of multiple sound source directions as the sound source position of the sound signal to be detected.

The above-described decomposed signal may be a sound signal in any direction among the sound signals in different directions contained in the sound signal to be detected.

In some embodiments, in order to more accurately locate the sound source position of the sound signal to be detected, the life detection device may use microphone array signal processing technology to determine the sound source position of the sound signal to be detected.

Based on this, N microphones may be provided on the above-mentioned living body detection device. Here, in order to more accurately determine the sound source position of the sound signal to be detected in the three-dimensional space, N may be greater than or equal to 3. As shown in FIG. 5 , it is a distribution diagram of microphones on a life detection device provided by an embodiment of the present disclosure. As can be seen from Figure 5, the detection equipment is equipped with four microphones, one microphone is provided at each corner, namely microphone 1, microphone 2, microphone 3, and microphone 4.

It can be seen from the description of step 301 above that when multiple microphones are provided on the life detection device, the sound signals collected by the multiple microphones can be obtained, and the sound signals collected by each microphone can be synthesized and processed to obtain the sound signal to be detected. .

Based on this, when determining the sound source position of the sound signal to be detected, the sound signal to be detected can be decomposed to obtain multiple decomposed signals, where each decomposed signal can correspond to a microphone.

The following is a unified description of step 402 and step 403:

It can be known from the description of the above step 401 that each of the above decomposed signals may correspond to a microphone. Since each microphone can locate the sound source direction of the sound signal when receiving the sound signal, the sound source direction of each decomposed signal can be determined through the microphone corresponding to each decomposed signal.

Based on this, after decomposing the sound signal to be detected into multiple decomposed signals, the sound source direction of each decomposed signal can be determined. In some embodiments, the intersection position of the plurality of sound source directions may be determined as the sound source position of the sound signal to be detected.

It is assumed that a microphone is provided at each of the four corners of the living body detection device. As shown in Figure 6, each microphone can correspond to a decomposed signal of the sound signal to be detected, and the sound source direction of the decomposed signal can be determined. In this way, four sound source directions can be obtained. The intersection position of the four sound source directions is point A, then point A can be determined as the sound source position of the sound signal to be detected.

In some embodiments, multiple decomposed signals are obtained by decomposing the sound signal to be detected, and then the sound source direction of each decomposed signal is determined, and the intersection position of the multiple sound source directions is determined as the sound signal to be detected. The sound source position enables more accurate positioning of the sound source position of the sound signal to be detected.

Refer to Figure 9, which is a flow chart of a living body detection method provided by an embodiment of the present disclosure. As shown in Figure 9, the process may include the following steps:

Step 901: Determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the subject to be detected;

Step 902: Determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

Step 903: Determine the reference space area based on the lip position;

Step 904: Determine whether the sound source position is within the reference space area. If so, perform step 906; if not, perform step 905;

Step 905: Determine that the living body detection result of the object to be detected is non-living body;

Step 906: Input the image to be detected and the sound signal to be detected to the trained mouth shape recognition model, and obtain the output result of the above mouth shape recognition model;

Step 907: Determine whether the above output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. If yes, step 908 is executed; if not, step 905 is executed;

Step 908: Determine that the living body detection result of the object to be detected is a living body.

For detailed descriptions of step 901 and step 902, please refer to the description of steps 301 to 302 above, and will not be described again here.

The following is a unified description of steps 903 to 905:

In this embodiment, a reference space area may first be determined based on the above-mentioned lip position. In some embodiments, a sphere or cuboid area can be set as the reference space area centered on the above-mentioned lip position. Then, it is determined whether the sound source position is within the reference space area. If so, it can be determined that the sound source position and the lip position are consistent; if not, it can be determined that the sound source position and the lip position are inconsistent.

Through this kind of processing, errors in locating the sound source position can be eliminated, or the position of the lip position of the object to be detected is inaccurate due to changes in sitting posture when emitting a sound signal, thereby making the consistency comparison result more accurate.

The following is a unified description of steps 906 to 908:

In this embodiment, the image to be detected and the sound signal to be detected can be input to the above-trained mouth shape recognition model to obtain a signal indicating whether the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. Output results.

In some embodiments, if the output result is 1, it means that the above-mentioned sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected; if the output result is 0, it means that the above-mentioned sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. The mouth shape of the object to be detected does not match.

In this embodiment, when the above output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected, it can be determined that the living body detection result of the object to be detected is a living body.

In an exemplary application scenario, the user 11 and the life detection device 13 are in the same environment. Then when the user 11, that is, the object to be detected emits an "ah" sound signal at a certain moment, the life detection device can directly use the camera to Collect the current time At this moment, the image containing the face of the object to be detected is obtained to obtain the image to be detected, and the "ah" sound signal is received through the microphone to obtain the sound signal to be detected.

Based on this, the above-mentioned image to be detected and the sound signal to be detected are input to the above-mentioned mouth shape recognition model. By identifying the mouth shape of the object to be detected in the image to be detected, the mouth shape recognition model can obtain the mouth shape of the object to be detected as " "Ah" mouth shape indicates that the mouth shape of the object to be detected matches the sound signal to be detected, thus it can be determined that the object to be detected is a living body.

On the contrary, when the above output result indicates that the sound signal to be detected does not match the mouth shape of the object to be detected in the image to be detected, it can be determined that the living body detection result of the object to be detected is non-living.

In the application scenario shown in Figure 2 above, the video image of the user 12 output on the display screen of the terminal 14 is the object to be detected. At this time, the life detection device 13 can collect the image containing the video image of the user 12 through the camera. Assume that at a certain moment, user 11 sends an "ah" sound signal, but user 12 does not send any sound signal at the same moment. At this time, the image to be detected collected by the living body detection device is the video image of user 12, and the collected The received sound signal to be detected is the sound signal emitted by user 11.

After that, the above-mentioned sound signal to be detected and the image to be detected are input to the mouth shape recognition model. Since the user 12 does not send any sound signal at the current moment, the mouth shape of the object to be detected can be obtained by identifying the image to be detected. is a closed state, and the sound signal to be detected is an "ah" sound signal, the mouth shape corresponding to the "ah" sound signal should be in an open state. Therefore, the sound signal to be detected and the image to be detected can be obtained. The output result of the object's mouth shape does not match, thus obtaining that the object to be detected is an inanimate body.

In some embodiments, after it is determined that the sound source position is located within the reference space area determined based on the lip position, by inputting the sound signal to be detected and the image to be detected into the trained mouth shape recognition model, further steps can be made based on the output results. Determine whether the living body detection result of the above-mentioned object to be detected is living body. In these embodiments, it is determined whether the object to be detected is a living body by further detecting whether the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. It is realized that even if a non-authenticated user impersonates an authenticated user by obtaining the authenticated user's video image, the object to be detected can still be identified as a non-living body, which improves the accuracy and reliability of the living body detection results.

Refer to Figure 10, which is a flow chart of a living body detection method provided by an embodiment of the present disclosure. As shown in Figure 10, the process may include the following steps:

Step 1001. Output an interactive instruction, which is used to instruct the object to be detected to emit a sound signal corresponding to the preset text data;

Step 1002: Determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected. The image to be detected is an image containing the face of the object to be detected. picture;

Step 1003: Perform speech recognition on the sound signal to be detected, and obtain text data corresponding to the sound signal to be detected;

Step 1004: Compare the text data corresponding to the sound signal to be detected with the preset text data for consistency;

Step 1005: Determine whether the comparison result indicates that the recognized text data is consistent with the preset text data. If yes, execute step 1006; if not, execute step 1001;

Step 1006: Determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

Step 1007: Determine whether the position of the sound source and the position of the lips are consistent. If yes, execute step 1009; if not, execute step 1008;

Step 1008: Determine that the living body detection result of the object to be detected is non-living body;

Step 1009: Input the image to be detected and the sound signal to be detected to the trained mouth shape recognition model, and obtain the output result of the above mouth shape recognition model;

Step 1010: Determine whether the above output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. If yes, step 1011 is executed; if not, step 1008 is executed;

Step 1011: Determine that the living body detection result of the object to be detected is a living body.

In some embodiments, the above-mentioned interactive instructions can be generated by the living body detection device through the following method: first, calling a preset random number generation algorithm to generate a random array; then, generating the above-mentioned preset text data based on the above-mentioned random array, and based on the Preset text data to generate interactive instructions.

For example, the living body detection device calls the preset random number generation algorithm to generate the following random array: 265910. Then, the living body detection device determines the random array as preset text data, generates an interactive instruction for instructing the subject to be detected to say 265910 based on the preset text data, and outputs the interactive instruction.

Under normal circumstances, after receiving the above interactive instructions, the object to be detected will speak the preset text data according to the interactive instructions, thereby generating a sound signal. At this time, the life detection device can receive the sound signal through the microphone and determine the sound signal as the sound signal to be detected.

Of course, in an exemplary application scenario, another object in the same environment as the life detection device as exemplified above can also speak the preset text data according to the interactive instruction, thereby generating a sound signal. At this time, the life detection device can receive the sound signal through the microphone and determine the sound signal as the sound signal to be detected.

In some embodiments, the living body detection device can perform speech recognition on the above-mentioned sound signal to be detected through ASR (Automatic Speech Recognition, automatic speech recognition technology) to obtain text data.

In some embodiments, the life detection device can perform speech recognition on the above-mentioned sound signal to be detected through a convolutional neural network algorithm to obtain text data.

For example, in step 1001, when the life detection device issues an interactive instruction for instructing the object to be detected to say 265910, the object to be detected emits a corresponding sound signal according to the interaction instruction, and the life detection device receives the sound through the microphone. signal to obtain the sound signal to be detected. Afterwards, the life detection device performs speech recognition on the sound signal to be detected and obtains text data 265910.

The following is a unified description of steps 1004 to 1005:

In some implementations, the text data corresponding to the sound signal to be detected can be compared with preset file data for consistency. Specifically, if the comparison result shows that the recognized text data is inconsistent with the preset text data, it means that the subject to be detected did not speak the preset text data according to the interactive instructions output by the life detection device. At this time, in order to avoid errors in the hearing of the object to be detected or errors in the recognition of the living body detection device, the living body detection device can regenerate the interactive command and output the interactive command.

On the contrary, if the comparison result shows that the recognized text data is consistent with the preset text data, it means that the subject to be detected speaks the preset text data according to the interactive instructions output by the living body detection device. At this time, step 1006 can be continued to further determine whether the object to be detected is a living body.

For detailed descriptions of steps 1006 to 1008, please refer to the descriptions of steps 302 and 303 above, and will not be described again here.

For descriptions of steps 1009 to 1011, please refer to the description of steps 906 to 908 above, and will not be described again here.

In some embodiments, before determining the sound signal to be detected, an interactive instruction can be output to instruct the object to be detected to emit a sound signal corresponding to the preset text data, and after determining the sound signal to be detected, identify For the sound signal to be detected, the recognized text data is compared with the preset text data for consistency. If they are consistent, the step of determining the source location of the sound signal to be detected can be performed. In these implementations, the object to be detected can be instructed to emit a sound signal corresponding to the preset text data through interactive instructions, and it can be initially determined whether the object to be detected can interact with the living body detection device, thereby avoiding the need for the object to be detected to use the device in advance. Recorded video images are used for counterfeiting, so that when non-authenticated users imitate authenticated users by obtaining video images recorded in advance by authenticated users, they can more quickly and accurately identify the object to be detected as non-living, thus improving liveness detection. The accuracy and reliability of the results.

Referring to FIG. 11 , a block diagram of a living body detection device 110 is provided according to an embodiment of the present disclosure. As shown in Figure 11, the device includes:

The first determination module 111 is used to determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the object to be detected;

The second determination module 112 is used to determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

The third determination module 113 is configured to compare the sound source position and the lip position for consistency, and determine the living body detection result of the object to be detected based on the comparison result.

In some embodiments, the third determination module 113 is configured as:

determining a reference space region based on the lip position;

Determine whether the sound source position is located within the reference space area;

If so, a comparison result is obtained in which the position of the sound source and the position of the lip are consistent;

If not, a comparison result is obtained in which the sound source position and the lip position are inconsistent.

In some embodiments, the third determination module 113 is configured as:

If the comparison result indicates that the sound source position and the lip position are consistent, it is determined that the living body detection result of the object to be detected is a living body.

In certain embodiments, the device further includes (not shown in the figure):

An input module configured to input the image to be detected and the sound signal to be detected to a trained mouth shape recognition model when the comparison result indicates that the sound source position and the lip position are consistent. , obtain the output result of the mouth shape recognition model;

A first execution module, configured to perform the life detection of determining the object to be detected if the output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected. The result is a step that is in vivo.

In certain embodiments, the device further includes (not shown in the figure):

An output module, configured to output an interactive instruction before determining the sound signal to be detected, the interactive instruction being used to instruct the object to be detected to emit a sound signal corresponding to the preset text data;

A recognition module, configured to perform speech recognition on the sound signal to be detected before determining the sound source position corresponding to the sound signal to be detected, and to obtain text data corresponding to the sound signal to be detected;

A comparison module for comparing the text data corresponding to the sound signal to be detected with the preset text data for consistency;

In certain embodiments, the device further includes (not shown in the figure):

The third execution module is used to return to execute the output if the comparison result indicates that the text data corresponding to the sound signal to be detected is inconsistent with the preset text data. Steps for interactive instructions.

In some embodiments, the output module is configured to:

Call the preset random number generation algorithm to generate a random array;

The preset text data is generated based on the random array, and the interactive instruction is generated based on the preset text data.

In some embodiments, the first determining module 111 is configured as:

Obtain sound signals collected by multiple microphones;

The sound signals collected by each of the microphones are synthesized and processed to obtain the sound signal to be detected.

In some embodiments, the second determination module 112 is configured to:

Decompose the sound signal to be detected to obtain multiple decomposed signals;

Determine the sound source direction of each of the decomposed signals;

FIG. 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. The electronic device 1200 shown in FIG. 12 includes: at least one processor 1201, a memory 1202, at least one network interface 1204, and a user interface 1203. The various components in electronic device 1200 are coupled together through bus system 1205 . It can be understood that the bus system 1205 is used to implement connection communication between these components. In addition to the data bus, the bus system 1205 also includes a power bus, a control bus and a status signal bus. However, for the sake of clarity, the various buses are labeled bus system 1205 in FIG. 12 .

The user interface 1203 may include a display, a keyboard, or a clicking device (eg, a mouse, a trackball, a touch pad, a touch screen, etc.).

It can be understood that the memory 1202 in the embodiment of the present disclosure may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synchlink DRAM, SLDRAM) and Direct Rambus RAM (DRRAM). The memory 1202 described herein is intended to include Including, but not limited to, these and any other suitable types of memory.

In some embodiments, memory 1202 stores the following elements, executable units or data structures, or a subset thereof, or an extension thereof: operating system 12021 and applications 12022.

Among them, the operating system 12021 includes various system programs, such as framework layer, core library layer, driver layer, etc., which are used to implement various basic services and process hardware-based tasks. Application 12022 includes various applications, such as media player, browser, etc., and is used to implement various application services. The program that implements the method of an embodiment of the present disclosure may be included in the application program 12022.

In an embodiment of the present disclosure, by calling the program or instructions stored in the memory 1202, which in some embodiments may be the program or instructions stored in the application program 12022, the processor 1201 is used to execute the methods provided by each method embodiment. Method steps include, for example:

The sound source position and the lip position are compared for consistency, and the living body detection result of the object to be detected is determined based on the comparison result.

The methods disclosed in the above embodiments of the present disclosure can be applied to the processor 1201 or implemented by the processor 1201. The processor 1201 may be an integrated circuit chip and has signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1201 . The above-mentioned processor 1201 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Each disclosed method, step and logical block diagram in the embodiment of the present disclosure can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present disclosure can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software units in the decoding processor. The software unit can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 1202. The processor 1201 reads the information in the memory 1202 and completes the steps of the above method in combination with its hardware.

It is understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can Implemented in one or more Application Specific Integrated Circuits (ASIC), Digital Signal Processing (DSP), Digital Signal Processing Device (DSP Device, DSPD), Programmable Logic Device, PLD), field-programmable gate array (Field-Programmable Gate Array, FPGA), general-purpose processor, controller, microcontroller, microprocessor, other electronic units used to perform the functions described in this application, or combinations thereof.

For software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. Software code may be stored in memory and executed by a processor. The memory can be implemented in the processor or external to the processor.

The electronic device provided by the present disclosure can be the electronic device as shown in Figure 12, and can perform all the steps of the living body detection method in Figures 3-4 and Figures 9-10, thereby realizing the steps shown in Figures 3-4 and Figures 9-10. shows the technical effect of the living body detection method. For details, please refer to the relevant descriptions in Figures 3 to 4 and Figures 9 to 10. This is a concise description and will not be repeated here.

Embodiments of the present disclosure also provide storage media (computer-readable storage media). The storage medium here stores one or more programs. The storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk or solid state hard drive; the memory may also include the above types of memory. combination.

One or more programs in the storage medium can be executed by one or more processors to implement the above-mentioned life detection method executed on the electronic device side.

The processor is used to execute the life detection program stored in the memory to implement the following steps of the life detection method executed on the electronic device side:

Those skilled in the art should further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of both. In order to clearly illustrate the relationship between hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of this disclosure.

Steps of methods or algorithms described in connection with the embodiments disclosed herein may be used Implemented in hardware, software modules executed by a processor, or a combination of the two. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present disclosure in detail. It should be understood that the above-mentioned are only specific embodiments of the present disclosure and are not intended to limit the scope of the present disclosure. Any modifications, equivalent substitutions, improvements, etc. made within the essence and principles of this disclosure shall be included in the scope of protection of this disclosure.

Claims

In vivo detection method, the method includes:

Determine the sound signal to be detected and the image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the object to be detected;

Determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

The sound source position and the lip position are compared for consistency, and the living body detection result of the object to be detected is determined based on the comparison result.
The method according to claim 1, wherein the consistency comparison between the sound source position and the lip position includes:

determining a reference space region based on the lip position;

Determine whether the sound source position is located within the reference space area,

When the sound source position is located in the reference space area, a comparison result is obtained that the sound source position and the lip position are consistent,

When the sound source position is not located in the reference space area, a comparison result in which the sound source position and the lip position are inconsistent is obtained.
The method according to claim 1 or 2, wherein determining the vitality detection result of the object to be detected according to the comparison result includes:

When the comparison result indicates that the sound source position and the lip position are inconsistent, it is determined that the living body detection result of the object to be detected is non-living body;

When the comparison result indicates that the sound source position and the lip position are consistent, it is determined that the living body detection result of the object to be detected is a living body.
The method according to claim 3, wherein when the comparison result indicates that the sound source position and the lip position are consistent, the method further includes:

Input the image to be detected and the sound signal to be detected to the trained mouth shape recognition model to obtain the output result of the mouth shape recognition model;

When the output result indicates that the sound signal to be detected matches the mouth shape of the object to be detected in the image to be detected, the step of determining that the living body detection result of the object to be detected is a living body is performed.
The method according to any one of claims 1 to 4, before determining the sound signal to be detected, the method further includes:

Output interactive instructions, the interactive instructions are used to instruct the object to be detected to emit a sound signal corresponding to the preset text data;

Before determining the sound source position corresponding to the sound signal to be detected, the method further includes:

Perform speech recognition on the sound signal to be detected to obtain text data corresponding to the sound signal to be detected;

Combining the text data corresponding to the sound signal to be detected and the preset text data Perform consistency comparison;

When the comparison result indicates that the text data corresponding to the sound signal to be detected is consistent with the preset text data, the step of determining the sound source position corresponding to the sound signal is performed.
The method of claim 5, further comprising:

When the comparison result indicates that the text data corresponding to the sound signal to be detected is inconsistent with the preset text data, return to the step of outputting the interactive instruction.
The method according to claim 5 or 6, wherein the generating process of the interactive instructions includes:

Call the preset random number generation algorithm to generate a random array;

The preset text data is generated based on the random array, and the interactive instruction is generated based on the preset text data.
The method according to any one of claims 1 to 7, wherein determining the sound signal to be detected includes:

Obtain sound signals collected by multiple microphones;

The sound signals collected by each of the microphones are synthesized and processed to obtain the sound signal to be detected.
The method according to any one of claims 1 to 8, wherein determining the sound source position corresponding to the sound signal to be detected includes:

Decompose the sound signal to be detected to obtain multiple decomposed signals;

Determine the sound source direction of each of the decomposed signals;

The intersection position of multiple sound source directions is determined as the sound source position of the sound signal to be detected.
Living body detection device, the device includes:

A first determination module configured to determine a sound signal to be detected and an image to be detected corresponding to the sound signal to be detected, where the image to be detected is an image containing the face of the subject to be detected;

a second determination module configured to determine the sound source position corresponding to the sound signal to be detected, and determine the lip position of the object to be detected based on the image to be detected;

The third determination module is configured to conduct a consistency comparison between the sound source position and the lip position, and determine the living body detection result of the object to be detected based on the comparison result.
An electronic device includes: a processor and a memory, and the processor is configured to execute a life detection program stored in the memory to implement the life detection method according to any one of claims 1 to 9.
A storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the living body detection method described in any one of claims 1 to 9.