WO2022142521A1

WO2022142521A1 - Liveness detection method and apparatus, device, and storage medium

Info

Publication number: WO2022142521A1
Application number: PCT/CN2021/120422
Authority: WO
Inventors: 时旭
Original assignee: 北京旷视科技有限公司
Priority date: 2020-12-29
Filing date: 2021-09-24
Publication date: 2022-07-07
Also published as: CN112733636A

Abstract

A liveness detection method and apparatus, a device, and a storage medium. The method comprises: obtaining multimedia data to be detected; extracting audio data and video data of the multimedia data; performing speech recognition on the audio data to obtain speech information, and performing lip language recognition on the video data to obtain lip language information; and parsing according to the speech information and the lip language information to obtain offset information between the audio data and the video data, and verifying, on the basis of the offset information, whether the multimedia data is from a living body. The accuracy of liveness detection is significantly improved, an omission rate of liveness detection is reduced, and an error-tolerant rate is provided for some videos of which the audio and picture are out of sync. Thus, the original annotation cost of a large number of videos of which the audio and picture are out of sync is saved.

Description

Liveness detection method, apparatus, device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese Patent Application No. 2020115874691, which was filed with the China Patent Office on December 29, 2020, and is entitled "Method, Apparatus, Equipment, and Storage Medium for Living Body Detection", the entire contents of which are incorporated herein by reference Applying.

technical field

The present application relates to the technical field of multimedia information, and in particular, to a method, apparatus, device and storage medium for living body detection.

Background technique

Liveness detection is a method for determining the real physiological characteristics of an object in some authentication scenarios. In the application scenario of liveness verification based on lip-language video, the current video data of the user is generally obtained in real time, and then based on the video content, it is detected whether it conforms to the audio and video of the living body. Synchronization feature.

Synchronization of audio and video generally means that each frame of the picture being rendered by the player is strictly corresponding to each piece of sound being played, and there is no deviation that can be distinguished by the human ear and the naked eye.

At present, the audio and video synchronization detection method usually uses a large number of labeled audio and video synchronous/asynchronous videos as samples, and obtains a model through neural network training. The model can output the synchronization score for the input video. The picture is synchronized, otherwise the sound and picture are not synchronized.

However, the above method has the following drawbacks:

1) The situation of out-of-sync video, audio and video is very complicated, and it is difficult for the training set to cover complex scenes.

2) The synchronization score output by the model is inaccurate, and cases of misjudgment are often encountered in the production environment.

3) Judging logic by comparing scores and thresholds is too simple and has low fault tolerance.

However, this increases the cost of labeling a large number of out-of-sync audio and video videos, thereby reducing the accuracy of liveness detection. Therefore, methods and apparatuses that can significantly improve the accuracy of living body detection are highly desired.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a method, apparatus, device, and storage medium for living body detection, which significantly improve the accuracy of living body detection.

An embodiment of the present application provides a method for detecting a living body, which may include: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data to obtain speech information, and Perform lip language recognition on the video data to obtain lip language information; according to the voice information and the lip language information, determine offset information between the audio data and the video data, and based on the offset information Verify that the multimedia data is from a living body.

In some embodiments of the present application, performing speech recognition on the audio data to obtain speech information may include: performing speech recognition on the audio data frame by frame to acquire audio element information of the audio data; extracting the audio element information of the audio data; audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.

In some embodiments of the present application, performing lip language recognition on the video data to obtain lip language information may include: performing lip language recognition on the video data frame by frame, and obtaining lip language elements of the video data information; extract the video start frame sequence and video end frame sequence of each element in the lip language element information, the lip language information may include: the lip language element information, the video start frame sequence and the Video terminates frame sequence.

In some embodiments of the present application, the determining, according to the voice information and the lip language information, the offset information between the audio data and the video data may include: comparing the information in the voice information The audio element information is subjected to data standardization processing, and an audio element string of target length is generated based on the audio element information after the data standardization processing, and data standardization processing is performed on the lip language element information in the lip language information. The lip language element information after the data standardization process generates a lip language element string of the target length; the audio element string and the lip language element string are compared with the target string respectively, and the When both the element string and the lip language element string match the semantics of the target string, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language The element string, the video start frame sequence and the video end frame sequence determine the offset information.

In some embodiments of the present application, the based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence and For the video termination frame sequence, calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character. start time difference between the start times, and calculate the end time difference between the audio end time and the video end time of each element character respectively, wherein, the audio start time is determined based on the audio start frame sequence, The audio end time is determined based on the audio end frame sequence, the video start time is determined based on the video start frame sequence, and the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.

In some embodiments of the present application, the audio start time may be determined based on the audio start frame sequence by the following formula: audio_start=(audio_fstart/audio_sampling_rate)*1000, where audio_fstart is a value in the audio element string The audio starting frame sequence corresponding to each element character, audio_sampling_rate is the audio sampling rate.

In some embodiments of the present application, the audio termination time may be determined based on the audio termination frame sequence by the following formula: audio_end=(audio_fend/audio_sampling_rate)*1000, where audio_fend is each of the audio element strings The audio termination frame sequence corresponding to the element character, audio_sampling_rate is the audio sampling rate.

In some embodiments of the present application, the video start time may be determined based on the video start frame sequence by the following formula: lip_start=(lip_fstart/fps)*1000, where lip_fstart is the lip language element string The video start frame sequence corresponding to each element character in , fps is the preset video frame rate.

In some embodiments of the present application, the video termination time may be determined based on the video termination frame sequence through the following formula: lip_end=(lip_fend/fps)*1000, where lip_fend is the value of each element in the lip language element string. The video termination frame sequence corresponding to each element character, fps is the preset video frame rate.

In some embodiments of the present application, the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process. The lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string The number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of bits that match accurately.

In some embodiments of the present application, the verifying whether the multimedia data comes from a living body based on the offset information may include: judging whether the offset information is within a target offset range; if the offset information Within the target offset range, it is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.

In some embodiments of the present application, the target offset range may be obtained through actual test data statistics, and the target offset range represents the characteristics of the multimedia data recorded by the living body.

An embodiment of the present application provides a living body detection device, which may include: an acquisition module configured to acquire multimedia data to be detected; an extraction module configured to extract audio data and video data in the multimedia data Recognition module, is configured to be used to carry out speech recognition to described audio data, obtain speech information, and carry out lip language recognition to described video data, obtain lip language information; Parsing module, be configured to be used for according to described speech information and the lip language information, determine offset information between the audio data and the video data, and verify whether the multimedia data is from a living body based on the offset information.

In some embodiments of the present application, the recognition module may be configured to: perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data; extract each of the audio element information The audio start frame sequence and the audio end frame sequence of the element, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.

In some embodiments of the present application, the recognition module may be configured to: perform lip language recognition on the video data frame by frame, obtain lip language element information of the video data; extract the lip language element information The video start frame sequence and the video end frame sequence of each element in the lip language information may include: the lip language element information, the video start frame sequence and the video end frame sequence.

In some embodiments of the present application, the parsing module may be configured to: perform data normalization processing on the audio element information of the voice information, and generate a target length based on the audio element information after the data normalization processing The audio element string of the lip language element, the data standardization process is performed on the lip language element information in the lip language information, and the lip language element string of the target length is generated based on the lip language element information; The element string and the lip language element string are compared with the target string, and when both the audio element string and the lip language element string match the semantics of the target string, based on the The audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence are used to determine the offset information.

In some embodiments of the present application, the parsing module may be further configured to: determine whether the offset information is within the target offset range; if the offset information is within the target offset range, It is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.

An embodiment of the present application provides an electronic device, which may include: a memory for storing a computer program; a processor for executing the method of any one of some embodiments of the present application, so as to detect whether the multimedia data comes from in the living body.

An embodiment of the present application provides a non-transitory electronic device-readable storage medium, which may include: a program, when run by an electronic device, causes the electronic device to execute any one of the embodiments of the present application Methods.

An embodiment of the present application provides a computer program product, and the computer program product may include a computer program, which implements the method described in any one of some embodiments of the present application when the computer program is executed by a processor.

The living body detection method, device, device and storage medium provided by the present application can extract the audio data and video data in the multimedia data, then respectively perform speech recognition on the audio data and lip language recognition on the video data, and then obtain the voice information and Lip language information, and then based on the speech information and lip language information analysis to obtain the offset information of the multimedia data, and then based on the offset information to verify whether the multimedia data comes from a living body, so that there is no need to do a large number of sample annotations, saving The detection cost, and the characteristics of speech information and lip language information are considered comprehensively, which improves the accuracy of living body detection.

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments of the present application. It should be understood that the following drawings only show some embodiments of the present application, therefore It should not be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained from these drawings without any creative effort.

FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the application;

FIG. 2 is a schematic diagram of a living body verification scene system according to an embodiment of the application;

3 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application;

4 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a living body detection device according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the present application, the terms "first", "second" and the like are only used to distinguish the description, and cannot be understood as indicating or implying relative importance. As shown in FIG. 1 , this embodiment provides an electronic device 1 , which may include: at least one processor 11 and a memory 12 . In FIG. 1 , one processor is used as an example. The processor 11 and the memory 12 can be connected through the bus 10, and the memory 12 stores instructions that can be executed by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the methods in the following embodiments The process is to detect the living body information of the multimedia data.

In one embodiment, the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.

Please refer to FIG. 2 , which is a living body verification scene system according to an embodiment of the present application, which may include: a server 20 and a client 30 . The server 20 may be implemented by the electronic device 1 , and the server 20 may include: a speech recognition module 21 and a lip language recognition module 22 . In an actual biometric verification scenario, such as an access control system, when the user triggers identity verification, the server 20 can generate random text information and display it on the client 30 for the user to read the random text information, and then the client 30 can record the user The multimedia data is read aloud, and the multimedia data can be uploaded to the server 20 . The server 20 may perform subsequent user authentication based on the multimedia data.

In one embodiment, the above-mentioned method for subsequent user authentication based on multimedia data may also be performed on the client 30 .

The random text information can be a random number of a target length, for example, a four-digit random number, and a certain strategy can be used to avoid the continuous occurrence of the same number, so as to reduce the difficulty of identification.

However, in the application scenario of in vivo verification based on lip language video, the following types of malicious attacks often appear in practical applications:

1. The characters in the multimedia data only complete the mouth movements without making a sound, and other people read the target numbers outside the video.

2. Record the audio in advance and replace the actual live audio in the video with the prepared audio.

3. Record the video and audio in advance, after identifying the target number, combine the four-digit audio and video.

In order to effectively prevent the security threats brought by the above-mentioned attack video, this embodiment comprehensively analyzes the multimedia data based on the speech recognition module 21 and the lip language recognition module 22 to obtain the speech information and the lip language information, and analyzes the speech information and the lip language information based on the speech information and the lip language information. Offset information of the multimedia data is obtained, and based on the offset information, it is verified whether the multimedia data comes from a living body.

The living body detection solution in this embodiment can effectively prevent the above attack videos and improve the security of living body verification.

Please refer to FIG. 3 , which is a method for detecting a living body according to an embodiment of the present application. The method can be executed by the electronic device 1 shown in FIG. 1 and can be applied to the scene of living body verification as shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body can improve the security of living body verification. Taking the server 20 executing the method as an example, the method includes the following steps: Step 301: Acquire multimedia data to be detected.

In this step, the multimedia data can be the real-time video data of the user to be verified. For example, it can be based on the random text content generated by the server 20 for the user to read aloud. Here, the random text content can be a four-digit random number, and the same number can be avoided by a certain strategy. appear consecutively to reduce the difficulty of identification. Taking the random number as an example, the user reads the acquired four-digit random number aloud, completes the recording of multimedia data, and uploads it to the server 20 .

In one embodiment, if the method is executed by the user terminal, the user terminal does not need to upload the multimedia data after acquiring the multimedia data.

Step 302: Extract audio data and video data in the multimedia data.

In this step, the server 20 can extract audio data from the video material uploaded by the user, and can specify the audio sampling rate in the extraction process, and read the video frame rate as the video data. The audio data may include voice information, and the video data may include image information of the user's lip language action.

In one embodiment, the audio data in the multimedia data can be extracted according to a preset audio sampling rate. The preset audio sampling rate can be specified by the server 20, and the audio sampling rate can accurately retain the relevant features of the speech in the original multimedia data for subsequent calculation.

In one embodiment, the video data in the multimedia data can be read according to a preset video frame rate.

The preset video frame rate may be the frame rate at which the server 20 reads the video data, and the video frame rate needs to ensure that the read video data retains the video features in the original multimedia data for subsequent calculation.

Step 303: Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information.

In this step, based on a neural network algorithm, speech recognition may be performed frame by frame on the audio data read aloud by the user to the four-digit random number, so as to obtain speech information. And based on the neural network algorithm, lip language recognition can be performed frame by frame on the video data of the user reading a four-digit random number, that is, the lip language action of the user in the video image is recognized, and the lip language information is obtained.

Step 304: Obtain the offset information between the audio data and the video data by analyzing the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information.

In this step, in order to effectively prevent the security threat brought by the above-mentioned attack video, the characteristics of the voice information and lip language information in the audio-visual synchronization can be comprehensively analyzed to obtain the offset information of the multimedia data, and verify based on the offset information. Whether the multimedia data comes from a living body.

The above-mentioned living body detection method extracts audio data and video data in multimedia data, then performs speech recognition on the audio data respectively, and performs lip language recognition on the video data, thereby obtaining voice information and lip language information, and then based on the voice information and lip language information. The offset information of the multimedia data is obtained by information analysis, and then based on the offset information, it is verified whether the multimedia data comes from a living body. In this way, there is no need to do a large number of sample annotations, which saves the detection cost, and comprehensively considers the characteristics of the voice information and lip language information. Accuracy of liveness detection. It can effectively prevent the above attack videos and improve the security of live verification.

Please refer to FIG. 4 , which is a living body detection method according to an embodiment of the present application. The method can be executed by the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body improves the security of living body verification. The method includes the following steps:

Step 401: Acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.

Step 402: Extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.

Step 403: Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data.

In this step, based on a neural network algorithm, speech recognition may be performed frame by frame on the audio data obtained in step 402 to obtain text information of the random number read aloud by the user.

In one embodiment, the speech recognition process may be as follows:

4a): Collect a preset number of digital audios (such as audios of people reading numbers 0-9), and mark them to distinguish training sets, validation sets, and test sets.

4b): Perform neural network training on the audio of the training set, and at the same time use the verification set to verify the intermediate results of the training process (adjust the training parameters in real time), and obtain a speech recognition model when the training accuracy and verification accuracy reach a certain threshold.

4c): Test the speech recognition model obtained in step 4b) with the test set to measure the performance of the model.

4d): Input the audio data obtained in step 402 into the speech recognition model frame by frame, and the model calculates the audio element information of the audio data.

Step 404: Extract the audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: audio element information, audio start frame sequence, and audio end frame sequence.

In this step, the audio element information obtained above may include at least the audio start frame sequence and audio end frame sequence of each element, such as the audio start frame sequence and audio end frame sequence of each random number, Extracted from audio element information.

Step 405: Perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.

In this step, based on a neural network algorithm, lip language recognition can be performed frame by frame on the video data obtained in step 402 to obtain lip language element information.

In one embodiment, the process of lip language recognition may be as follows:

6a): Collect a preset number of digital lip language videos, such as lip language pictures taken by people reading numbers 0-9, and perform feature labeling to distinguish training sets, validation sets, and test sets.

6b): Perform neural network training on the video of the training set, and at the same time use the verification set to verify the intermediate results of the training process (adjust the training parameters in real time), and obtain a lip language recognition model when the training accuracy and verification accuracy reach a certain threshold.

6c): Test the lip language recognition model obtained in step 6b) with the test set to measure the performance of the model.

6d): Input the video data obtained in step 402 into the lip language recognition model frame by frame to obtain the lip language element information of the video data calculated by the model.

Step 406: Extract the video start frame sequence and the video end frame sequence of each element in the lip language element information. The lip language information may include: lip language element information, video start frame sequence and video end frame sequence.

In this step, the above-mentioned lip language element information may at least include the video start frame sequence and video end frame sequence of each element, for example, the user reads the video start frame sequence and video end frame sequence of each number, and converts them from The lip language element information is extracted. In one embodiment, the execution order of steps 403 to 404 and steps 405 to 406 is not limited.

Step 407: perform data standardization processing on the speech information, and generate an audio element string of target length based on the audio element information, perform data standardization processing on the lip language information, and generate a lip language element string of target length based on the lip language element information.

In this step, for the living body verification scenario shown in FIG. 2 , the multimedia data recorded by the user may exist in various formats, and the content may also be complicated. In order to simplify the data processing process, before collecting the multimedia data, the server 20 may first generate Random text information, such as four random numbers, is for the user to read aloud, and then the multimedia data during the reading is recorded. In the subsequent data processing, the audio element information and the lip language element information need to be standardized into a digital string with a fixed length. The target length here is the length of the random number generated by the server 20. For example, the random number generated by the server 20 is four digits, and the target length here is four digits. The user reads four random numbers, so the audio element information and the lip language element information need to be standardized into four digits. Four random numbers are more conducive to the accuracy of the detection results. In one embodiment, step 407 may specifically include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. Identify the number of digits of the audio element string and the lip language element string respectively. When the number of recognized digits is less than the first threshold, the output is a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits are replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.

In one embodiment, taking the target length of four bits as an example, when performing data normalization processing on the audio element information and the lip language element information, the error results with less than three digits can be filtered out, and the missing bits can be identified with -1 instead. . If the number of digits exceeds four, the accurate identification bit is calculated by the matching algorithm, and the inaccurate identification is also replaced by -1.

In one embodiment, taking four random numbers as an example, the data standardization process may be as follows: first, the audio element information is converted into a four-digit audio element string, and the lip language element information is converted into a four-digit lip language element. String, respectively determine the number of digits of the audio element string and the lip language element string. When the number of digits is less than three, it is judged as a recognition error, and the verification process is terminated. When the number of bits is equal to three bits, -1 is substituted to identify the missing bits. When the number of digits is exactly four digits, the recognition result is output directly. When the number of digits is greater than four digits, based on the content of the text information, the matching algorithm is used to extract the exact matching digits. When the exact matching digits are less than four digits, the missing digits are replaced by -1. For example, if the content of audio element information or lip language element information is (12345) five-digit random numbers, and the four-digit random number generated by the server 20 is (1234), the content and number of digits can be extracted from (12345) as (1234) ) as the result of normalization.

Step 408: Compare the audio element string and the lip language element string with the target string respectively, and when both the audio element string and the lip language element string match the semantics of the target , audio start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.

In one embodiment, when both the audio element string and the lip language element string are semantically matched with the target string, step 408 may include:

S81: For the audio element string and the element string lip language element string, calculate the start time difference between the audio start time of each element character and the video start time, and calculate the audio end of each element character respectively The end time difference between the time and the end time of the video.

In this step, the element character is a pronunciation element in the text content, and the four-digit random number is (1234), then 1, 2, 3, and 4 are the four element characters. You can traverse the audio element string and the element string lip language element string, and use the following formula to calculate the character of each element:

Audio start time: audio_start=(audio_fstart/audio_sampling_rate)*1000.

Audio end time: audio_end=(audio_fend/audio_sampling_rate)*1000.

Video start time: lip_start=(lip_fstart/fps)*1000.

Video end time: lip_end=(lip_fend/fps)*1000.

Then compute the per-element characters:

Start time difference: abs(lip_start–audio_start).

End time difference: abs(lip_end–audio_end).

Among them, audio_fstart is the audio start frame sequence of each element character in the audio element string, audio_fend is the audio end frame sequence of each element character in the audio element string, and audio_sampling_rate The audio sampling rate. lip_fstart is the video start frame sequence of each element character in the element string lip language element string, lip_fend is the video end frame sequence of each element character in the element string lip language element string, fps is the preset video frame Rate. abs() is to find the absolute value.

S82: Calculate the average time difference of the start time difference and the end time difference of each element element.

In this step, the following formula can be used to calculate the average time difference:

diff_time=(abs(lip_start−audio_start)+abs(lip_end−audio_end))/2.

Among them, diff_time represents the offset formula (unit ms), and its function is to return the time interval between two time variables, that is, to calculate the time difference between two moments, where the result of diff_time represents the time difference of each element character average value.

S83: Calculate the offset average value of the time difference average values of all elements, and the offset information is the offset average value.

In this step, the average value of the time difference of all elements can be averaged. Taking the four-digit random number as an example, the following formula can be used to calculate the average value of the offset:

result=(diff_time[0]+diff_time[1]+diff_time[2]+diff_time[3])/4.

Among them, result is the average offset, diff_time[0] represents the average time difference of the first digit, diff_time[1] represents the average time difference of the second digit, and diff_time[2] represents the average time difference of the third digit , diff_time[3] represents the average time difference of the 4th digit.

Through the above steps S81 to S83, the average value of the offset of the multimedia data is calculated based on the audio element character string and the lip language element character string, which is the offset information.

Step 409: Determine whether the offset information is within the target offset range. If yes, go to step 410 , otherwise go to step 411 .

In this step, the target offset range can be obtained through actual test data statistics, which can characterize the characteristics of the multimedia data recorded by the living body.

Step 410: Outputting multimedia data from a living body.

In this step, if the offset information is within the target offset range, it means that the offset information of the multimedia data is small enough and is multimedia data generated by actual behaviors sent by a general living body, and the output multimedia data is from a living body.

Step 411: The output multimedia data does not come from a living body.

In this step, if the offset information is not within the target offset range, it means that the current multimedia data may not be the behavior sent by the living body, or maliciously synthesized attack data, then the output multimedia data does not come from the living body, and as shown in Figure 2 In the living verification scenario shown, this verification fails. A warning can be issued.

The above-mentioned living body detection method significantly improves the accuracy of living body detection, and reduces the missed detection rate. Provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.

Please refer to FIG. 5 , which is a living body detection apparatus 500 according to an embodiment of the present application. The apparatus can be applied to the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. Detects whether multimedia data comes from a living body, improving the security of living body verification. The device may include: an acquisition module 501, an extraction module 502, an identification module 503 and an analysis module 504, and the principle relationships of the modules are as follows:

The acquiring module 501 is configured to acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.

The extraction module 502 is configured to extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.

The recognition module 503 is configured to perform speech recognition on audio data to obtain voice information, and perform lip language recognition on video data to obtain lip language information. For details, refer to the description of step 303 in the above embodiment.

The parsing module 504 is configured to parse and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information. For details, refer to the description of step 304 in the above embodiment.

In one embodiment, the recognition module 503 may be configured to: perform speech recognition on the audio data frame by frame, and obtain audio element information of the audio data. The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information may include: audio element information, audio start frame sequence and audio end frame sequence.

In an embodiment, the identification module 503 may be configured to: perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data. The video start frame sequence and the video end frame sequence of each element in the lip language element information are extracted, and the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.

In one embodiment, the parsing module 504 may be configured to: perform data normalization processing on the speech information, generate an audio element string of a target length based on the audio element information, perform data normalization processing on the lip language information, and A lip language element string of target length is generated from the language element information. The audio element string and the lip language element string can be compared with the target string respectively, and when the audio element string and the lip language element string both match the semantics of the target string, based on the audio element string, audio Start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.

In one embodiment, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence and the video end frame sequence, the offset information of the multimedia data is calculated, which may include: : For the audio element string and the lip language element string, calculate the start time difference between the audio start time and video start time of each element character, and calculate the audio end time and video end time of each element character respectively Termination time difference between times. Calculate the average time difference between the start time difference and the end time difference of each element character. Calculate the offset average value of the time difference average value of all element characters, and the offset information can be the offset average value.

In one embodiment, data standardization processing can be performed on the voice information, an audio element string of a target length can be generated based on the audio element information, data standardization processing can be performed on the lip language information, and a lip language of the target length can be generated based on the lip language element information. The language element string may include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. The number of digits of the audio element string and the lip language element string can be recognized respectively, and when the number of recognized digits is less than the first threshold, it can be output as a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits may be replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm can be used to extract the number of digits that match accurately.

In one embodiment, the parsing module 504 may be further configured to: determine whether the offset information is within the target offset range. If the offset information is within the target offset range, the multimedia data can be output from the living body; otherwise, the multimedia data can be output from not from the living body.

For the detailed description of the above-mentioned living body detection apparatus 500, please refer to the description of the relevant method steps in the above-mentioned embodiment.

The embodiments of the present application also provide a non-transitory electronic device-readable storage medium, which may include: a program, when running on the electronic device, the program can cause the electronic device to execute all or part of the methods in the foregoing embodiments process. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive, Abbreviation: HDD) or Solid-State Drive (SSD), etc. The storage medium may also include a combination of the aforementioned kinds of memories.

Although the embodiments of the present application have been described in conjunction with the accompanying drawings, various modifications and variations can be made by those skilled in the art without departing from the spirit and scope of the present application, and such modifications and variations all fall within the scope of the appended claims within the limited range.

Industrial Applicability

The present application provides a method, device, device and storage medium for living body detection. The method includes: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data, Obtain voice information, and perform lip language recognition on the video data to obtain lip language information; analyze and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and It is verified whether the multimedia data is from a living body based on the offset information. The present application achieves a significant improvement in the accuracy of living body detection, a decrease in the missed detection rate, and provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.

Furthermore, it is understood that the liveness detection methods, apparatus, devices and storage media of the present application are reproducible and can be used in a variety of industrial applications. For example, the liveness detection method, apparatus, device and storage medium of the present application can be used in an application scenario of liveness verification based on lip language video.

Claims

A method for detecting a living body, comprising:

Obtain the multimedia data to be detected;

extracting audio data and video data in the multimedia data;

Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information;

According to the voice information and the lip language information, offset information between the audio data and the video data is determined, and based on the offset information, it is verified whether the multimedia data comes from a living body.
The method according to claim 1, wherein the performing speech recognition on the audio data to obtain the speech information comprises:

Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data;

The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information includes: the audio element information, the audio start frame sequence and the audio end frame sequence.
The method according to claim 1 or 2, wherein the performing lip language recognition on the video data to obtain lip language information, comprising:

Perform lip language recognition on the video data frame by frame, and obtain the lip language element information of the video data;

Extract the video start frame sequence and video end frame sequence of each element in the lip language element information, where the lip language information includes: the lip language element information, the video start frame sequence and the video end frame sequence.
The method according to claim 3, wherein the determining the offset information between the audio data and the video data according to the voice information and the lip language information comprises:

Perform data standardization processing on the audio element information in the voice information, and generate an audio element string of target length based on the audio element information after the data standardization processing, and perform data standardization on the lip language element information in the lip language information. performing data standardization processing, and generating a lip language element string of the target length based on the lip language element information after the data standardization process;

respectively matching the audio element string and the lip language element string with the target string, and when both the audio element string and the lip language element string are semantically matched with the target string , based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence, determine the offset information.
The method according to claim 4, characterized in that, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start The start frame sequence and the video end frame sequence, and the offset information of the multimedia data is calculated, including:

For the audio element character string and the lip language element character string, calculate the start time difference between the audio start time and the video start time of each element character respectively, and calculate the audio frequency of each element character respectively The difference between the end time and the end time of the video, wherein the audio start time is determined based on the audio start frame sequence, the audio end time is determined based on the audio end frame sequence, and the video start time Determined based on the video start frame sequence, the video end time is determined based on the video end frame sequence;

Calculate the average value of the time difference between the start time difference and the end time difference of each of the element characters;

An offset average value of the time difference average values of all the element characters is calculated, and the offset information is the offset average value.
The method according to claim 5, wherein the audio start time is determined based on the audio start frame sequence by the following formula: audio_start=(audio_fstart/audio_sampling_rate)*1000, wherein audio_fstart is the audio element The audio starting frame sequence corresponding to each element character in the string, audio_sampling_rate is the audio sampling rate.
The method according to claim 5, wherein the audio termination time is determined based on the audio termination frame sequence by the following formula: audio_end=(audio_fend/audio_sampling_rate)*1000, wherein audio_fend is the audio element string The audio termination frame sequence corresponding to each element character in , audio_sampling_rate is the audio sampling rate.
The method according to claim 5, wherein the video start time is determined based on the video start frame sequence by the following formula: lip_start=(lip_fstart/fps)*1000, wherein lip_fstart is the lip language The video start frame sequence corresponding to each element character in the element string, fps is the preset video frame rate.
The method according to claim 5, wherein the video end time is determined based on the video end frame sequence by the following formula: lip_end=(lip_fend/fps)*1000, wherein lip_fend is the lip language element character The video termination frame sequence corresponding to each element character in the string, fps is the preset video frame rate.
The method of claim 4, wherein:

performing data standardization processing on the audio element information in the voice information, and generating an audio element string of a target length based on the audio element information after the data standardization processing, and performing data standardization on the audio element information in the lip language information. Perform data standardization processing on the lip language element information, and generate a lip language element string of the target length based on the lip language element information after the data standardization process, including:

Converting the audio element information into the audio element string of the target length, converting the lip language element information into the lip language element string of the target length;

Identify the number of digits of the audio element string and the lip language element string respectively, when the number of recognized digits is less than the first threshold, the output is a recognition error;

When the identification number of digits is greater than or equal to the first threshold value and less than the second threshold value, the missing bits are replaced by the first preset value;

When the identification number of digits is greater than or equal to the second threshold, based on the audio element information and the content of the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.
The method according to any one of claims 1 to 10, wherein the verifying whether the multimedia data comes from a living body based on the offset information comprises:

Determine whether the offset information is within the target offset range;

If the offset information is within the target offset range, it is determined that the multimedia data comes from a living body; otherwise, it is determined that the multimedia data does not come from a living body.
The method according to claim 11, wherein the target offset range is statistically obtained based on actual test data.
A living body detection device, characterized in that it includes:

an acquisition module, configured to acquire multimedia data to be detected;

an extraction module configured to extract audio data and video data in the multimedia data;

A recognition module, configured to perform speech recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information;

A parsing module configured to determine offset information between the audio data and the video data according to the voice information and the lip language information, and to verify whether the multimedia data is based on the offset information from living organisms.
The liveness detection device of claim 13, wherein the identification module is configured to:

Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data;

The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information includes: the audio element information, the audio start frame sequence and the audio end frame sequence.
The living body detection device according to claim 13 or 14, wherein the identification module is configured to:

Perform lip language recognition on the video data frame by frame, and obtain the lip language element information of the video data;

Extract the video start frame sequence and video end frame sequence of each element in the lip language element information, where the lip language information includes: the lip language element information, the video start frame sequence and the video end frame sequence.
The living body detection device according to claim 15, wherein the parsing module is configured to:

Perform data standardization processing on the audio element information in the voice information, and generate an audio element string of target length based on the audio element information after the data standardization processing, and perform data standardization on the lip language element information in the lip language information. performing data standardization processing, and generating a lip language element string of the target length based on the lip language element information after the data standardization process;

respectively matching the audio element string and the lip language element string with the target string, and when both the audio element string and the lip language element string are semantically matched with the target string , based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence, determine the offset information.
The living body detection device according to claim 16, wherein the method is based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the The video start frame sequence and the video end frame sequence, and the offset information of the multimedia data is calculated, including:

For the audio element character string and the lip language element character string, calculate the start time difference between the audio start time and the video start time of each element character respectively, and calculate the audio frequency of each element character respectively The difference between the end time and the end time of the video, wherein the audio start time is determined based on the audio start frame sequence, the audio end time is determined based on the audio end frame sequence, and the video start time Determined based on the video start frame sequence, the video end time is determined based on the video end frame sequence;

Calculate the average value of the time difference between the start time difference and the end time difference of each of the element characters;

An offset average value of the time difference average values of all the element characters is calculated, and the offset information is the offset average value.
An electronic device, comprising:

memory for storing computer programs;

The processor is configured to execute the method according to any one of claims 1 to 12, so as to detect whether the multimedia data comes from a living body.
A non-transitory electronic device readable storage medium, characterized by comprising: a program that, when run by the electronic device, causes the electronic device to execute the method of any one of claims 1 to 12.
A computer program product, characterized in that the computer program product comprises a computer program, which implements the method according to any one of claims 1 to 12 when the computer program is executed by a processor.