WO2022142521A1 - Liveness detection method and apparatus, device, and storage medium - Google Patents

Liveness detection method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2022142521A1
WO2022142521A1 PCT/CN2021/120422 CN2021120422W WO2022142521A1 WO 2022142521 A1 WO2022142521 A1 WO 2022142521A1 CN 2021120422 W CN2021120422 W CN 2021120422W WO 2022142521 A1 WO2022142521 A1 WO 2022142521A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
video
frame sequence
lip language
Prior art date
Application number
PCT/CN2021/120422
Other languages
French (fr)
Chinese (zh)
Inventor
时旭
Original Assignee
北京旷视科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京旷视科技有限公司 filed Critical 北京旷视科技有限公司
Publication of WO2022142521A1 publication Critical patent/WO2022142521A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • G06V40/45Detection of the body part being alive
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning

Definitions

  • the present application relates to the technical field of multimedia information, and in particular, to a method, apparatus, device and storage medium for living body detection.
  • Liveness detection is a method for determining the real physiological characteristics of an object in some authentication scenarios.
  • the current video data of the user is generally obtained in real time, and then based on the video content, it is detected whether it conforms to the audio and video of the living body. Synchronization feature.
  • Synchronization of audio and video generally means that each frame of the picture being rendered by the player is strictly corresponding to each piece of sound being played, and there is no deviation that can be distinguished by the human ear and the naked eye.
  • the audio and video synchronization detection method usually uses a large number of labeled audio and video synchronous/asynchronous videos as samples, and obtains a model through neural network training.
  • the model can output the synchronization score for the input video.
  • the picture is synchronized, otherwise the sound and picture are not synchronized.
  • the embodiments of the present application provide a method, apparatus, device, and storage medium for living body detection, which significantly improve the accuracy of living body detection.
  • An embodiment of the present application provides a method for detecting a living body, which may include: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data to obtain speech information, and Perform lip language recognition on the video data to obtain lip language information; according to the voice information and the lip language information, determine offset information between the audio data and the video data, and based on the offset information Verify that the multimedia data is from a living body.
  • performing speech recognition on the audio data to obtain speech information may include: performing speech recognition on the audio data frame by frame to acquire audio element information of the audio data; extracting the audio element information of the audio data; audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
  • performing lip language recognition on the video data to obtain lip language information may include: performing lip language recognition on the video data frame by frame, and obtaining lip language elements of the video data information; extract the video start frame sequence and video end frame sequence of each element in the lip language element information, the lip language information may include: the lip language element information, the video start frame sequence and the Video terminates frame sequence.
  • the determining, according to the voice information and the lip language information, the offset information between the audio data and the video data may include: comparing the information in the voice information
  • the audio element information is subjected to data standardization processing, and an audio element string of target length is generated based on the audio element information after the data standardization processing, and data standardization processing is performed on the lip language element information in the lip language information.
  • the lip language element information after the data standardization process generates a lip language element string of the target length; the audio element string and the lip language element string are compared with the target string respectively, and the When both the element string and the lip language element string match the semantics of the target string, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language
  • the element string, the video start frame sequence and the video end frame sequence determine the offset information.
  • calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character.
  • the audio start time is determined based on the audio start frame sequence
  • the audio end time is determined based on the audio end frame sequence
  • the video start time is determined based on the video start frame sequence
  • the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
  • audio_sampling_rate is the audio sampling rate.
  • audio_sampling_rate is the audio sampling rate.
  • lip_fstart is the lip language element string
  • the video start frame sequence corresponding to each element character in , fps is the preset video frame rate.
  • the video termination frame sequence corresponding to each element character, fps is the preset video frame rate.
  • the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process.
  • the lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string
  • the number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip
  • the verifying whether the multimedia data comes from a living body based on the offset information may include: judging whether the offset information is within a target offset range; if the offset information Within the target offset range, it is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
  • the target offset range may be obtained through actual test data statistics, and the target offset range represents the characteristics of the multimedia data recorded by the living body.
  • An embodiment of the present application provides a living body detection device, which may include: an acquisition module configured to acquire multimedia data to be detected; an extraction module configured to extract audio data and video data in the multimedia data Recognition module, is configured to be used to carry out speech recognition to described audio data, obtain speech information, and carry out lip language recognition to described video data, obtain lip language information; Parsing module, be configured to be used for according to described speech information and the lip language information, determine offset information between the audio data and the video data, and verify whether the multimedia data is from a living body based on the offset information.
  • the recognition module may be configured to: perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data; extract each of the audio element information
  • the audio start frame sequence and the audio end frame sequence of the element, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
  • the recognition module may be configured to: perform lip language recognition on the video data frame by frame, obtain lip language element information of the video data; extract the lip language element information
  • the video start frame sequence and the video end frame sequence of each element in the lip language information may include: the lip language element information, the video start frame sequence and the video end frame sequence.
  • the parsing module may be configured to: perform data normalization processing on the audio element information of the voice information, and generate a target length based on the audio element information after the data normalization processing
  • the audio element string of the lip language element, the data standardization process is performed on the lip language element information in the lip language information, and the lip language element string of the target length is generated based on the lip language element information;
  • the element string and the lip language element string are compared with the target string, and when both the audio element string and the lip language element string match the semantics of the target string, based on the
  • the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence are used to determine the offset information.
  • calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character.
  • the audio start time is determined based on the audio start frame sequence
  • the audio end time is determined based on the audio end frame sequence
  • the video start time is determined based on the video start frame sequence
  • the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
  • the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process.
  • the lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string
  • the number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip
  • the parsing module may be further configured to: determine whether the offset information is within the target offset range; if the offset information is within the target offset range, It is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
  • An embodiment of the present application provides an electronic device, which may include: a memory for storing a computer program; a processor for executing the method of any one of some embodiments of the present application, so as to detect whether the multimedia data comes from in the living body.
  • An embodiment of the present application provides a non-transitory electronic device-readable storage medium, which may include: a program, when run by an electronic device, causes the electronic device to execute any one of the embodiments of the present application Methods.
  • An embodiment of the present application provides a computer program product, and the computer program product may include a computer program, which implements the method described in any one of some embodiments of the present application when the computer program is executed by a processor.
  • the living body detection method, device, device and storage medium provided by the present application can extract the audio data and video data in the multimedia data, then respectively perform speech recognition on the audio data and lip language recognition on the video data, and then obtain the voice information and Lip language information, and then based on the speech information and lip language information analysis to obtain the offset information of the multimedia data, and then based on the offset information to verify whether the multimedia data comes from a living body, so that there is no need to do a large number of sample annotations, saving The detection cost, and the characteristics of speech information and lip language information are considered comprehensively, which improves the accuracy of living body detection.
  • FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the application.
  • FIG. 2 is a schematic diagram of a living body verification scene system according to an embodiment of the application.
  • FIG. 3 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a living body detection device according to an embodiment of the present application.
  • this embodiment provides an electronic device 1 , which may include: at least one processor 11 and a memory 12 .
  • one processor is used as an example.
  • the processor 11 and the memory 12 can be connected through the bus 10, and the memory 12 stores instructions that can be executed by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the methods in the following embodiments
  • the process is to detect the living body information of the multimedia data.
  • the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.
  • FIG. 2 is a living body verification scene system according to an embodiment of the present application, which may include: a server 20 and a client 30 .
  • the server 20 may be implemented by the electronic device 1 , and the server 20 may include: a speech recognition module 21 and a lip language recognition module 22 .
  • the server 20 can generate random text information and display it on the client 30 for the user to read the random text information, and then the client 30 can record the user
  • the multimedia data is read aloud, and the multimedia data can be uploaded to the server 20 .
  • the server 20 may perform subsequent user authentication based on the multimedia data.
  • the above-mentioned method for subsequent user authentication based on multimedia data may also be performed on the client 30 .
  • the random text information can be a random number of a target length, for example, a four-digit random number, and a certain strategy can be used to avoid the continuous occurrence of the same number, so as to reduce the difficulty of identification.
  • the characters in the multimedia data only complete the mouth movements without making a sound, and other people read the target numbers outside the video.
  • this embodiment comprehensively analyzes the multimedia data based on the speech recognition module 21 and the lip language recognition module 22 to obtain the speech information and the lip language information, and analyzes the speech information and the lip language information based on the speech information and the lip language information. Offset information of the multimedia data is obtained, and based on the offset information, it is verified whether the multimedia data comes from a living body.
  • the living body detection solution in this embodiment can effectively prevent the above attack videos and improve the security of living body verification.
  • FIG. 3 is a method for detecting a living body according to an embodiment of the present application.
  • the method can be executed by the electronic device 1 shown in FIG. 1 and can be applied to the scene of living body verification as shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body can improve the security of living body verification.
  • the server 20 executing the method the method includes the following steps: Step 301: Acquire multimedia data to be detected.
  • the multimedia data can be the real-time video data of the user to be verified. For example, it can be based on the random text content generated by the server 20 for the user to read aloud.
  • the random text content can be a four-digit random number, and the same number can be avoided by a certain strategy. appear consecutively to reduce the difficulty of identification. Taking the random number as an example, the user reads the acquired four-digit random number aloud, completes the recording of multimedia data, and uploads it to the server 20 .
  • the user terminal if the method is executed by the user terminal, the user terminal does not need to upload the multimedia data after acquiring the multimedia data.
  • Step 302 Extract audio data and video data in the multimedia data.
  • the server 20 can extract audio data from the video material uploaded by the user, and can specify the audio sampling rate in the extraction process, and read the video frame rate as the video data.
  • the audio data may include voice information
  • the video data may include image information of the user's lip language action.
  • the audio data in the multimedia data can be extracted according to a preset audio sampling rate.
  • the preset audio sampling rate can be specified by the server 20, and the audio sampling rate can accurately retain the relevant features of the speech in the original multimedia data for subsequent calculation.
  • the video data in the multimedia data can be read according to a preset video frame rate.
  • the preset video frame rate may be the frame rate at which the server 20 reads the video data, and the video frame rate needs to ensure that the read video data retains the video features in the original multimedia data for subsequent calculation.
  • Step 303 Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information.
  • speech recognition may be performed frame by frame on the audio data read aloud by the user to the four-digit random number, so as to obtain speech information.
  • lip language recognition can be performed frame by frame on the video data of the user reading a four-digit random number, that is, the lip language action of the user in the video image is recognized, and the lip language information is obtained.
  • Step 304 Obtain the offset information between the audio data and the video data by analyzing the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information.
  • the characteristics of the voice information and lip language information in the audio-visual synchronization can be comprehensively analyzed to obtain the offset information of the multimedia data, and verify based on the offset information. Whether the multimedia data comes from a living body.
  • the above-mentioned living body detection method extracts audio data and video data in multimedia data, then performs speech recognition on the audio data respectively, and performs lip language recognition on the video data, thereby obtaining voice information and lip language information, and then based on the voice information and lip language information.
  • the offset information of the multimedia data is obtained by information analysis, and then based on the offset information, it is verified whether the multimedia data comes from a living body. In this way, there is no need to do a large number of sample annotations, which saves the detection cost, and comprehensively considers the characteristics of the voice information and lip language information. Accuracy of liveness detection. It can effectively prevent the above attack videos and improve the security of live verification.
  • FIG. 4 is a living body detection method according to an embodiment of the present application.
  • the method can be executed by the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body improves the security of living body verification.
  • the method includes the following steps:
  • Step 401 Acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
  • Step 402 Extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
  • Step 403 Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data.
  • speech recognition may be performed frame by frame on the audio data obtained in step 402 to obtain text information of the random number read aloud by the user.
  • the speech recognition process may be as follows:
  • step 4c) Test the speech recognition model obtained in step 4b) with the test set to measure the performance of the model.
  • Step 404 Extract the audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: audio element information, audio start frame sequence, and audio end frame sequence.
  • the audio element information obtained above may include at least the audio start frame sequence and audio end frame sequence of each element, such as the audio start frame sequence and audio end frame sequence of each random number, Extracted from audio element information.
  • Step 405 Perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.
  • lip language recognition can be performed frame by frame on the video data obtained in step 402 to obtain lip language element information.
  • the process of lip language recognition may be as follows:
  • step 6c) Test the lip language recognition model obtained in step 6b) with the test set to measure the performance of the model.
  • Step 406 Extract the video start frame sequence and the video end frame sequence of each element in the lip language element information.
  • the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
  • the above-mentioned lip language element information may at least include the video start frame sequence and video end frame sequence of each element, for example, the user reads the video start frame sequence and video end frame sequence of each number, and converts them from The lip language element information is extracted.
  • the execution order of steps 403 to 404 and steps 405 to 406 is not limited.
  • Step 407 perform data standardization processing on the speech information, and generate an audio element string of target length based on the audio element information, perform data standardization processing on the lip language information, and generate a lip language element string of target length based on the lip language element information.
  • the multimedia data recorded by the user may exist in various formats, and the content may also be complicated.
  • the server 20 may first generate Random text information, such as four random numbers, is for the user to read aloud, and then the multimedia data during the reading is recorded.
  • the audio element information and the lip language element information need to be standardized into a digital string with a fixed length.
  • the target length here is the length of the random number generated by the server 20.
  • the random number generated by the server 20 is four digits, and the target length here is four digits.
  • step 407 may specifically include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. Identify the number of digits of the audio element string and the lip language element string respectively. When the number of recognized digits is less than the first threshold, the output is a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits are replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.
  • the error results with less than three digits can be filtered out, and the missing bits can be identified with -1 instead. . If the number of digits exceeds four, the accurate identification bit is calculated by the matching algorithm, and the inaccurate identification is also replaced by -1.
  • the data standardization process may be as follows: first, the audio element information is converted into a four-digit audio element string, and the lip language element information is converted into a four-digit lip language element. String, respectively determine the number of digits of the audio element string and the lip language element string. When the number of digits is less than three, it is judged as a recognition error, and the verification process is terminated. When the number of bits is equal to three bits, -1 is substituted to identify the missing bits. When the number of digits is exactly four digits, the recognition result is output directly. When the number of digits is greater than four digits, based on the content of the text information, the matching algorithm is used to extract the exact matching digits.
  • the missing digits are replaced by -1.
  • the content of audio element information or lip language element information is (12345) five-digit random numbers, and the four-digit random number generated by the server 20 is (1234), the content and number of digits can be extracted from (12345) as (1234) ) as the result of normalization.
  • Step 408 Compare the audio element string and the lip language element string with the target string respectively, and when both the audio element string and the lip language element string match the semantics of the target , audio start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
  • step 408 may include:
  • S81 For the audio element string and the element string lip language element string, calculate the start time difference between the audio start time of each element character and the video start time, and calculate the audio end of each element character respectively The end time difference between the time and the end time of the video.
  • the element character is a pronunciation element in the text content, and the four-digit random number is (1234), then 1, 2, 3, and 4 are the four element characters.
  • audio_fstart is the audio start frame sequence of each element character in the audio element string
  • audio_fend is the audio end frame sequence of each element character in the audio element string
  • audio_sampling_rate The audio sampling rate.
  • lip_fstart is the video start frame sequence of each element character in the element string lip language element string
  • lip_fend is the video end frame sequence of each element character in the element string lip language element string
  • fps is the preset video frame Rate.
  • abs() is to find the absolute value.
  • S82 Calculate the average time difference of the start time difference and the end time difference of each element element.
  • diff_time (abs(lip_start ⁇ audio_start)+abs(lip_end ⁇ audio_end))/2.
  • diff_time represents the offset formula (unit ms), and its function is to return the time interval between two time variables, that is, to calculate the time difference between two moments, where the result of diff_time represents the time difference of each element character average value.
  • S83 Calculate the offset average value of the time difference average values of all elements, and the offset information is the offset average value.
  • the average value of the time difference of all elements can be averaged.
  • the following formula can be used to calculate the average value of the offset:
  • diff_time[0] represents the average time difference of the first digit
  • diff_time[1] represents the average time difference of the second digit
  • diff_time[2] represents the average time difference of the third digit
  • diff_time[3] represents the average time difference of the 4th digit.
  • the average value of the offset of the multimedia data is calculated based on the audio element character string and the lip language element character string, which is the offset information.
  • Step 409 Determine whether the offset information is within the target offset range. If yes, go to step 410 , otherwise go to step 411 .
  • the target offset range can be obtained through actual test data statistics, which can characterize the characteristics of the multimedia data recorded by the living body.
  • Step 410 Outputting multimedia data from a living body.
  • the offset information is within the target offset range, it means that the offset information of the multimedia data is small enough and is multimedia data generated by actual behaviors sent by a general living body, and the output multimedia data is from a living body.
  • Step 411 The output multimedia data does not come from a living body.
  • the offset information is not within the target offset range, it means that the current multimedia data may not be the behavior sent by the living body, or maliciously synthesized attack data, then the output multimedia data does not come from the living body, and as shown in Figure 2 In the living verification scenario shown, this verification fails. A warning can be issued.
  • the above-mentioned living body detection method significantly improves the accuracy of living body detection, and reduces the missed detection rate. Provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
  • FIG. 5 is a living body detection apparatus 500 according to an embodiment of the present application.
  • the apparatus can be applied to the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. Detects whether multimedia data comes from a living body, improving the security of living body verification.
  • the device may include: an acquisition module 501, an extraction module 502, an identification module 503 and an analysis module 504, and the principle relationships of the modules are as follows:
  • the acquiring module 501 is configured to acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
  • the extraction module 502 is configured to extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
  • the recognition module 503 is configured to perform speech recognition on audio data to obtain voice information, and perform lip language recognition on video data to obtain lip language information. For details, refer to the description of step 303 in the above embodiment.
  • the parsing module 504 is configured to parse and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information. For details, refer to the description of step 304 in the above embodiment.
  • the recognition module 503 may be configured to: perform speech recognition on the audio data frame by frame, and obtain audio element information of the audio data.
  • the audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information may include: audio element information, audio start frame sequence and audio end frame sequence.
  • the identification module 503 may be configured to: perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.
  • the video start frame sequence and the video end frame sequence of each element in the lip language element information are extracted, and the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
  • the parsing module 504 may be configured to: perform data normalization processing on the speech information, generate an audio element string of a target length based on the audio element information, perform data normalization processing on the lip language information, and A lip language element string of target length is generated from the language element information.
  • the audio element string and the lip language element string can be compared with the target string respectively, and when the audio element string and the lip language element string both match the semantics of the target string, based on the audio element string, audio Start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
  • the offset information of the multimedia data is calculated, which may include: : For the audio element string and the lip language element string, calculate the start time difference between the audio start time and video start time of each element character, and calculate the audio end time and video end time of each element character respectively Termination time difference between times. Calculate the average time difference between the start time difference and the end time difference of each element character. Calculate the offset average value of the time difference average value of all element characters, and the offset information can be the offset average value.
  • data standardization processing can be performed on the voice information, an audio element string of a target length can be generated based on the audio element information, data standardization processing can be performed on the lip language information, and a lip language of the target length can be generated based on the lip language element information.
  • the language element string may include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. The number of digits of the audio element string and the lip language element string can be recognized respectively, and when the number of recognized digits is less than the first threshold, it can be output as a recognition error.
  • the identified missing bits may be replaced by the first preset value.
  • a matching algorithm can be used to extract the number of digits that match accurately.
  • the parsing module 504 may be further configured to: determine whether the offset information is within the target offset range. If the offset information is within the target offset range, the multimedia data can be output from the living body; otherwise, the multimedia data can be output from not from the living body.
  • the embodiments of the present application also provide a non-transitory electronic device-readable storage medium, which may include: a program, when running on the electronic device, the program can cause the electronic device to execute all or part of the methods in the foregoing embodiments process.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive, Abbreviation: HDD) or Solid-State Drive (SSD), etc.
  • the storage medium may also include a combination of the aforementioned kinds of memories.
  • the present application provides a method, device, device and storage medium for living body detection.
  • the method includes: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data, Obtain voice information, and perform lip language recognition on the video data to obtain lip language information; analyze and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and It is verified whether the multimedia data is from a living body based on the offset information.
  • the present application achieves a significant improvement in the accuracy of living body detection, a decrease in the missed detection rate, and provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
  • liveness detection methods, apparatus, devices and storage media of the present application are reproducible and can be used in a variety of industrial applications.
  • the liveness detection method, apparatus, device and storage medium of the present application can be used in an application scenario of liveness verification based on lip language video.

Abstract

A liveness detection method and apparatus, a device, and a storage medium. The method comprises: obtaining multimedia data to be detected; extracting audio data and video data of the multimedia data; performing speech recognition on the audio data to obtain speech information, and performing lip language recognition on the video data to obtain lip language information; and parsing according to the speech information and the lip language information to obtain offset information between the audio data and the video data, and verifying, on the basis of the offset information, whether the multimedia data is from a living body. The accuracy of liveness detection is significantly improved, an omission rate of liveness detection is reduced, and an error-tolerant rate is provided for some videos of which the audio and picture are out of sync. Thus, the original annotation cost of a large number of videos of which the audio and picture are out of sync is saved.

Description

活体检测方法、装置、设备和存储介质Liveness detection method, apparatus, device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年12月29日提交中国专利局的申请号为2020115874691、名称为“活体检测方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 2020115874691, which was filed with the China Patent Office on December 29, 2020, and is entitled "Method, Apparatus, Equipment, and Storage Medium for Living Body Detection", the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本申请涉及多媒体信息技术领域,具体而言,涉及一种活体检测方法、装置、设备和存储介质。The present application relates to the technical field of multimedia information, and in particular, to a method, apparatus, device and storage medium for living body detection.
背景技术Background technique
活体检测,是在一些身份验证场景确定对象真实生理特征的方法,基于唇语视频进行活体验证的应用场景中,一般通过实时获取用户当下的视频数据,然后基于视频内容检测是否符合活体的音画同步特点。Liveness detection is a method for determining the real physiological characteristics of an object in some authentication scenarios. In the application scenario of liveness verification based on lip-language video, the current video data of the user is generally obtained in real time, and then based on the video content, it is detected whether it conforms to the audio and video of the living body. Synchronization feature.
音画同步,一般是指播放器正在渲染的每一帧画面和正在播放的每一段声音都是严格对应起来,不存在人耳和肉眼可以分辨出来的偏差。Synchronization of audio and video generally means that each frame of the picture being rendered by the player is strictly corresponding to each piece of sound being played, and there is no deviation that can be distinguished by the human ear and the naked eye.
目前,音画同步检测方式通常使用大量标注的音画同步/不同步视频作为样本,通过神经网络训练得到模型,该模型可以针对输入的视频,输出同步分数,若同步分数大于阈值则判定为音画同步,反之音画不同步。At present, the audio and video synchronization detection method usually uses a large number of labeled audio and video synchronous/asynchronous videos as samples, and obtains a model through neural network training. The model can output the synchronization score for the input video. The picture is synchronized, otherwise the sound and picture are not synchronized.
但是,上述方式具有如下缺陷:However, the above method has the following drawbacks:
1)视频音画不同步的情况很复杂,训练集很难覆盖复杂的场景。1) The situation of out-of-sync video, audio and video is very complicated, and it is difficult for the training set to cover complex scenes.
2)模型输出的同步分数不准确,生产环境中经常遇到判断错误的案例。2) The synchronization score output by the model is inaccurate, and cases of misjudgment are often encountered in the production environment.
3)通过分数与阈值的比较判断逻辑过于简单,容错性较低。3) Judging logic by comparing scores and thresholds is too simple and has low fault tolerance.
然而,这增加了对大量音画不同步视频的标注成本,从而降低了活体检测的准确率。因此,迫切期望能够显著提高活体检测的准确率的方法和装置。However, this increases the cost of labeling a large number of out-of-sync audio and video videos, thereby reducing the accuracy of liveness detection. Therefore, methods and apparatuses that can significantly improve the accuracy of living body detection are highly desired.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种活体检测方法、装置、设备和存储介质,显著提高了活体检测的准确率。The embodiments of the present application provide a method, apparatus, device, and storage medium for living body detection, which significantly improve the accuracy of living body detection.
本申请实施例提供了一种活体检测方法,可以包括:获取待检测的多媒体数据;提取所述多媒体数据中的音频数据和视频数据;对所述音频数据进行语音识别,得到语音信息,以及对所述视频数据进行唇语识别,得到唇语信息;根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,并基于所述偏移信息验证所述多媒体数据是否来自于活体。An embodiment of the present application provides a method for detecting a living body, which may include: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data to obtain speech information, and Perform lip language recognition on the video data to obtain lip language information; according to the voice information and the lip language information, determine offset information between the audio data and the video data, and based on the offset information Verify that the multimedia data is from a living body.
在本申请的一些实施例中,所述对所述音频数据进行语音识别,得到语音信息,可以包括:对所述音频数据逐帧进行语音识别,获取所述音频数据的音频元素信息;提取所述音频元素信息中每个元素的音频起始帧序和音频终止帧序,所述语音信息可以包括:所述音频元素信息、所述音频起始帧序和所述音频终止帧序。In some embodiments of the present application, performing speech recognition on the audio data to obtain speech information may include: performing speech recognition on the audio data frame by frame to acquire audio element information of the audio data; extracting the audio element information of the audio data; audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
在本申请的一些实施例中,所述对所述视频数据进行唇语识别,得到唇语信息,可以包括:对所述视频数据逐帧进行唇语识别,获取所述视频数据的唇语元素信息;提取所述唇语元素信息中每个元素的视频起始帧序和视频终止帧序,所述唇语信息可以包括:所述唇语元素信息、所述视频起始帧序和所述视频终止帧序。In some embodiments of the present application, performing lip language recognition on the video data to obtain lip language information may include: performing lip language recognition on the video data frame by frame, and obtaining lip language elements of the video data information; extract the video start frame sequence and video end frame sequence of each element in the lip language element information, the lip language information may include: the lip language element information, the video start frame sequence and the Video terminates frame sequence.
在本申请的一些实施例中,所述根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,可以包括:对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串;分别将所述音频元素字符串和所述唇语元素字符串与目标字符串进行比对,并在所述音频元素字符串和所述唇语元素字符串均与所述目标字符串的语义匹配时,基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,确定所述偏移信息。In some embodiments of the present application, the determining, according to the voice information and the lip language information, the offset information between the audio data and the video data may include: comparing the information in the voice information The audio element information is subjected to data standardization processing, and an audio element string of target length is generated based on the audio element information after the data standardization processing, and data standardization processing is performed on the lip language element information in the lip language information. The lip language element information after the data standardization process generates a lip language element string of the target length; the audio element string and the lip language element string are compared with the target string respectively, and the When both the element string and the lip language element string match the semantics of the target string, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language The element string, the video start frame sequence and the video end frame sequence determine the offset information.
在本申请的一些实施例中,所述基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,计算所述多媒体数据的偏移信息,可以包括:针对所述音频元素字符串和所述唇语元素字符串,分别计算每个元素字符的音频起始时间与视频起始时间之间的起始时间差,并分别计算每个所述元素字符的音频终止时间与视频终止时间之间的终止时间差,其中,所述音频起始时间基于所述音频起始帧序确定,所述音频终止时间基于所述音频终止帧序确定,所述视频起始时间基于所述视频起始帧序确定,所述视频终止时间基于所述视频终止帧序确定;计算每个所述元素字符的所述起始时间差与所述终止时间差的时差平均值;计算全部所述元素字符的所述时差平均值的偏移平均值,所述偏移信息为所述偏移平均值。In some embodiments of the present application, the based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence and For the video termination frame sequence, calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character. start time difference between the start times, and calculate the end time difference between the audio end time and the video end time of each element character respectively, wherein, the audio start time is determined based on the audio start frame sequence, The audio end time is determined based on the audio end frame sequence, the video start time is determined based on the video start frame sequence, and the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
在本申请的一些实施例中,可以通过如下公式基于所述音频起始帧序确定所述音频起始时间:audio_start=(audio_fstart/audio_sampling_rate)*1000,其中,audio_fstart为所述音频元素字符串中每个元素字符所对应的音频起始帧序,audio_sampling_rate为音频采样率。In some embodiments of the present application, the audio start time may be determined based on the audio start frame sequence by the following formula: audio_start=(audio_fstart/audio_sampling_rate)*1000, where audio_fstart is a value in the audio element string The audio starting frame sequence corresponding to each element character, audio_sampling_rate is the audio sampling rate.
在本申请的一些实施例中,可以通过如下公式基于所述音频终止帧序确定所述音频终止时间:audio_end=(audio_fend/audio_sampling_rate)*1000,其中,audio_fend为所述音频元素字符串中每个元素字符所对应的音频终止帧序,audio_sampling_rate为音频采样率。In some embodiments of the present application, the audio termination time may be determined based on the audio termination frame sequence by the following formula: audio_end=(audio_fend/audio_sampling_rate)*1000, where audio_fend is each of the audio element strings The audio termination frame sequence corresponding to the element character, audio_sampling_rate is the audio sampling rate.
在本申请的一些实施例中,可以通过如下公式基于所述视频起始帧序确定所述视频起始时间:lip_start=(lip_fstart/fps)*1000,其中,lip_fstart为所述唇语元素字符串中每个元素字符所对应的视频起始帧序,fps为预设的视频帧率。In some embodiments of the present application, the video start time may be determined based on the video start frame sequence by the following formula: lip_start=(lip_fstart/fps)*1000, where lip_fstart is the lip language element string The video start frame sequence corresponding to each element character in , fps is the preset video frame rate.
在本申请的一些实施例中,可以通过如下公式基于所述视频终止帧序确定所述视频终止时间:lip_end=(lip_fend/fps)*1000,其中,lip_fend为所述唇语元素字符串中每个元素字符所对应的视频终止帧序,fps为预设的视频帧率。In some embodiments of the present application, the video termination time may be determined based on the video termination frame sequence through the following formula: lip_end=(lip_fend/fps)*1000, where lip_fend is the value of each element in the lip language element string. The video termination frame sequence corresponding to each element character, fps is the preset video frame rate.
在本申请的一些实施例中,所述对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于所述数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于所述数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串,可以包括:将所述音频元素信息转换为所述目标长度的所述音频元素字符串,将所述唇语元素信息转换为所述目标长度的所述唇语元素字符串;分别识别所述音频元素字符串和所述唇语元素字符串的位数,当识别位数小于第一阈值时,输出为识别错误;当所述识别位数大于或等于所述第一阈值,且小于第二阈值时,以第一预设值代替识别缺失的位;当所述识别位数大于或等于第二阈值时,基于所述音频元素信息、所述唇语元素信息的内容,通过匹配算法,提取匹配准确的位数。In some embodiments of the present application, the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process. The lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string The number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of bits that match accurately.
在本申请的一些实施例中,所述基于所述偏移信息验证所述多媒体数据是否来自于活体,可以包括:判断所述偏移信息是否在目标偏移范围内;若所述偏移信息在所述目标偏移范围内,确定所述多媒体数据来自于活体,否则,确定所述多媒体数据并非来自于活体。In some embodiments of the present application, the verifying whether the multimedia data comes from a living body based on the offset information may include: judging whether the offset information is within a target offset range; if the offset information Within the target offset range, it is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
在本申请的一些实施例中,所述目标偏移范围可以经过实际测试数据统计得到,所述目标偏移范围表征活体录制的所述多媒体数据的特征。In some embodiments of the present application, the target offset range may be obtained through actual test data statistics, and the target offset range represents the characteristics of the multimedia data recorded by the living body.
本申请实施例提供了一种活体检测装置,可以包括:获取模块,被配置成用于获取待检测的多媒体数据;提取模块,被配置成用于提取所述多媒体数据中的音频数据和视频数据;识别模块,被配置成用于对所述音频数据进行语音识别,得到语音信息,以及对所述视频数据进行唇语识别,得到唇语信息;解析模块,被配置成用于根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,并基于所述偏移信息验证所述多媒体数据是否来自于活体。An embodiment of the present application provides a living body detection device, which may include: an acquisition module configured to acquire multimedia data to be detected; an extraction module configured to extract audio data and video data in the multimedia data Recognition module, is configured to be used to carry out speech recognition to described audio data, obtain speech information, and carry out lip language recognition to described video data, obtain lip language information; Parsing module, be configured to be used for according to described speech information and the lip language information, determine offset information between the audio data and the video data, and verify whether the multimedia data is from a living body based on the offset information.
在本申请的一些实施例中,所述识别模块可以被配置成用于:对所述音频数据逐帧进行语音识别,获取所述音频数据的音频元素信息;提取所述音频元素信息中每个元素的音频起始帧序和音频终止帧序,所述语音信息可以包括:所述音频元素信息、所述音频起始帧序和所述音频终止帧序。In some embodiments of the present application, the recognition module may be configured to: perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data; extract each of the audio element information The audio start frame sequence and the audio end frame sequence of the element, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
在本申请的一些实施例中,所述识别模块可以被配置成用于:对所述视频数据逐帧进 行唇语识别,获取所述视频数据的唇语元素信息;提取所述唇语元素信息中每个元素的视频起始帧序和视频终止帧序,所述唇语信息可以包括:所述唇语元素信息、所述视频起始帧序和所述视频终止帧序。In some embodiments of the present application, the recognition module may be configured to: perform lip language recognition on the video data frame by frame, obtain lip language element information of the video data; extract the lip language element information The video start frame sequence and the video end frame sequence of each element in the lip language information may include: the lip language element information, the video start frame sequence and the video end frame sequence.
在本申请的一些实施例中,所述解析模块可以被配置成用于:对所述语音信息的所述音频元素信息进行数据标准化处理,并基于数据标准化处理后所述音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于所述唇语元素信息生成所述目标长度的唇语元素字符串;分别将所述音频元素字符串和所述唇语元素字符串与目标字符串进行比对,并在所述音频元素字符串和所述唇语元素字符串均与所述目标字符串的语义匹配时,基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,确定所述偏移信息。In some embodiments of the present application, the parsing module may be configured to: perform data normalization processing on the audio element information of the voice information, and generate a target length based on the audio element information after the data normalization processing The audio element string of the lip language element, the data standardization process is performed on the lip language element information in the lip language information, and the lip language element string of the target length is generated based on the lip language element information; The element string and the lip language element string are compared with the target string, and when both the audio element string and the lip language element string match the semantics of the target string, based on the The audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence are used to determine the offset information.
在本申请的一些实施例中,所述基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,计算所述多媒体数据的偏移信息,可以包括:针对所述音频元素字符串和所述唇语元素字符串,分别计算每个元素字符的音频起始时间与视频起始时间之间的起始时间差,并分别计算每个所述元素字符的音频终止时间与视频终止时间之间的终止时间差,其中,所述音频起始时间基于所述音频起始帧序确定,所述音频终止时间基于所述音频终止帧序确定,所述视频起始时间基于所述视频起始帧序确定,所述视频终止时间基于所述视频终止帧序确定;计算每个所述元素字符的所述起始时间差与所述终止时间差的时差平均值;计算全部所述元素字符的所述时差平均值的偏移平均值,所述偏移信息为所述偏移平均值。In some embodiments of the present application, the based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence and For the video termination frame sequence, calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character. start time difference between the start times, and calculate the end time difference between the audio end time and the video end time of each element character respectively, wherein, the audio start time is determined based on the audio start frame sequence, The audio end time is determined based on the audio end frame sequence, the video start time is determined based on the video start frame sequence, and the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
在本申请的一些实施例中,所述对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于所述数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于所述数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串,可以包括:将所述音频元素信息转换为所述目标长度的所述音频元素字符串,将所述唇语元素信息转换为所述目标长度的所述唇语元素字符串;分别识别所述音频元素字符串和所述唇语元素字符串的位数,当识别位数小于第一阈值时,输出为识别错误;当所述识别位数大于或等于所述第一阈值,且小于第二阈值时,以第一预设值代替识别缺失的位;当所述识别位数大于或等于第二阈值时,基于所述音频元素信息、所述唇语元素信息的内容,通过匹配算法,提取匹配准确的位数。In some embodiments of the present application, the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process. The lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string The number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of bits that match accurately.
在本申请的一些实施例中,所述解析模块还可以被配置成用于:判断所述偏移信息是否在目标偏移范围内;若所述偏移信息在所述目标偏移范围内,确定所述多媒体数据来自 于活体,否则,确定所述多媒体数据并非来自于活体。In some embodiments of the present application, the parsing module may be further configured to: determine whether the offset information is within the target offset range; if the offset information is within the target offset range, It is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
本申请实施例提供了一种电子设备,可以包括:存储器,用以存储计算机程序;处理器,用以执行本申请的一些实施例中的任一实施例的方法,以检测出多媒体数据是否来自于活体。An embodiment of the present application provides an electronic device, which may include: a memory for storing a computer program; a processor for executing the method of any one of some embodiments of the present application, so as to detect whether the multimedia data comes from in the living body.
本申请实施例提供了一种非暂态电子设备可读存储介质,可以包括:程序,当其藉由电子设备运行时,使得所述电子设备执行本申请的一些实施例中的任一实施例的方法。An embodiment of the present application provides a non-transitory electronic device-readable storage medium, which may include: a program, when run by an electronic device, causes the electronic device to execute any one of the embodiments of the present application Methods.
本申请实施例提供了一种计算机程序产品,所述计算机程序产品可以包括计算机程序,所述计算机程序被处理器执行时实现本申请的一些实施例中的任一实施例所述的方法。An embodiment of the present application provides a computer program product, and the computer program product may include a computer program, which implements the method described in any one of some embodiments of the present application when the computer program is executed by a processor.
本申请提供的活体检测方法、装置、设备和存储介质,可以通过提取多媒体数据中的音频数据和视频数据,然后分别对音频数据进行语音识别,对视频数据进行唇语识别,进而得到语音信息和唇语信息,然后基于语音信息和唇语信息解析得到所述多媒体数据的偏移信息,进而基于所述偏移信息验证所述多媒体数据的是否来自于活体,如此,无需做大量样本标注,节约检测成本,而且综合考虑语音信息和唇语信息的特点,提高了活体检测的准确率。The living body detection method, device, device and storage medium provided by the present application can extract the audio data and video data in the multimedia data, then respectively perform speech recognition on the audio data and lip language recognition on the video data, and then obtain the voice information and Lip language information, and then based on the speech information and lip language information analysis to obtain the offset information of the multimedia data, and then based on the offset information to verify whether the multimedia data comes from a living body, so that there is no need to do a large number of sample annotations, saving The detection cost, and the characteristics of speech information and lip language information are considered comprehensively, which improves the accuracy of living body detection.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments of the present application. It should be understood that the following drawings only show some embodiments of the present application, therefore It should not be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained from these drawings without any creative effort.
图1为本申请一实施例的电子设备的结构示意图;FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the application;
图2为本申请一实施例的活体验证场景系统的示意图;FIG. 2 is a schematic diagram of a living body verification scene system according to an embodiment of the application;
图3为本申请一实施例的活体检测方法的流程示意图;3 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application;
图4为本申请一实施例的活体检测方法的流程示意图;4 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application;
图5为本申请一实施例的活体检测装置的结构示意图。FIG. 5 is a schematic structural diagram of a living body detection device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在本申请的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。如图1所示,本实施例提供一种电子设备1,可以包括:至少一个处理器11和存储器12,图1中以一个处理器为例。处理器11和存储器12可以通过总线10连接,存储器12存储有可被处理器11执行的指令,指令被处理器11执行,以使电子设备1可执行下述的实施例中方法的全部或部分流程,以检测出多媒体数据的活体信息。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the present application, the terms "first", "second" and the like are only used to distinguish the description, and cannot be understood as indicating or implying relative importance. As shown in FIG. 1 , this embodiment provides an electronic device 1 , which may include: at least one processor 11 and a memory 12 . In FIG. 1 , one processor is used as an example. The processor 11 and the memory 12 can be connected through the bus 10, and the memory 12 stores instructions that can be executed by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the methods in the following embodiments The process is to detect the living body information of the multimedia data.
于一实施例中,电子设备1可以是手机、笔记本电脑、台式计算机、或者多台计算机组成 的运算系统等设备。In one embodiment, the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.
请参看图2,其为本申请一实施例的活体验证场景系统,可以包括:服务器20和用户端30。其中,服务器20可以藉由电子设备1实现,服务器20中可以包括:语音识别模块21和唇语识别模块22。在实际活体验证场景中,比如门禁系统,当用户触发身份验证时,服务器20可以生成随机文本信息,并可以显示在用户端30,以供用户朗读该随机文本信息,然后用户端30可以录制用户朗读的多媒体数据,并可以将多媒体数据上传至服务器20。服务器20可以基于多媒体数据进行后续的用户身份验证。Please refer to FIG. 2 , which is a living body verification scene system according to an embodiment of the present application, which may include: a server 20 and a client 30 . The server 20 may be implemented by the electronic device 1 , and the server 20 may include: a speech recognition module 21 and a lip language recognition module 22 . In an actual biometric verification scenario, such as an access control system, when the user triggers identity verification, the server 20 can generate random text information and display it on the client 30 for the user to read the random text information, and then the client 30 can record the user The multimedia data is read aloud, and the multimedia data can be uploaded to the server 20 . The server 20 may perform subsequent user authentication based on the multimedia data.
于一实施例中,上述基于多媒体数据进行后续的用户身份验证的方法也可以在用户端30执行。In one embodiment, the above-mentioned method for subsequent user authentication based on multimedia data may also be performed on the client 30 .
其中,随机文本信息可以是目标长度的随机数字,比如可以是四位随机数字,可以通过一定策略避免相同数字连续出现,以降低识别难度。The random text information can be a random number of a target length, for example, a four-digit random number, and a certain strategy can be used to avoid the continuous occurrence of the same number, so as to reduce the difficulty of identification.
然而在针对基于唇语视频进行活体验证的应用场景中,实际应用中往往出现有如下几种恶意攻击类型:However, in the application scenario of in vivo verification based on lip language video, the following types of malicious attacks often appear in practical applications:
1、多媒体数据中的人物仅完成嘴部动作不发声,视频外有其他人读目标数字。1. The characters in the multimedia data only complete the mouth movements without making a sound, and other people read the target numbers outside the video.
2、提前录制好音频,并用准备好的音频替换视频中的实际现场音频。2. Record the audio in advance and replace the actual live audio in the video with the prepared audio.
3、提前录制好视频和音频,识别目标数字后,组合四位数字音频和视频。3. Record the video and audio in advance, after identifying the target number, combine the four-digit audio and video.
为了有效防止上述攻击视频带来的安全威胁,本实施例基于语音识别模块21和唇语识别模块22对多媒体数据进行综合分析,得到语音信息和唇语信息,并基于语音信息和唇语信息解析得到多媒体数据的偏移信息,进而基于所述偏移信息验证所述多媒体数据的是否来自于活体。In order to effectively prevent the security threats brought by the above-mentioned attack video, this embodiment comprehensively analyzes the multimedia data based on the speech recognition module 21 and the lip language recognition module 22 to obtain the speech information and the lip language information, and analyzes the speech information and the lip language information based on the speech information and the lip language information. Offset information of the multimedia data is obtained, and based on the offset information, it is verified whether the multimedia data comes from a living body.
本实施例的活体检测方案,可以有效防止以上攻击视频,提高活体验证的安全性。The living body detection solution in this embodiment can effectively prevent the above attack videos and improve the security of living body verification.
请参看图3,其为本申请一种实施例的活体检测方法,该方法可由图1所示的电子设备1来执行,并可以应用于如图2所示的活体验证场景中,以准确检测出多媒体数据的是否来自于活体,提高活体验证的安全性。以服务器20执行该方法为例,该方法包括如下步骤:步骤301:获取待检测的多媒体数据。Please refer to FIG. 3 , which is a method for detecting a living body according to an embodiment of the present application. The method can be executed by the electronic device 1 shown in FIG. 1 and can be applied to the scene of living body verification as shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body can improve the security of living body verification. Taking the server 20 executing the method as an example, the method includes the following steps: Step 301: Acquire multimedia data to be detected.
在本步骤中,多媒体数据可以是待验证的用户的实时视频资料,比如可以基于服务器20生成的随机文本内容供用户朗读,此处随机文本内容可以是四位随机数字,通过一定策略避免相同数字连续出现,以降低识别难度。以随机数字为例,用户对获取到的四位随机数字朗读,完成多媒体数据录制,并上传至服务器20。In this step, the multimedia data can be the real-time video data of the user to be verified. For example, it can be based on the random text content generated by the server 20 for the user to read aloud. Here, the random text content can be a four-digit random number, and the same number can be avoided by a certain strategy. appear consecutively to reduce the difficulty of identification. Taking the random number as an example, the user reads the acquired four-digit random number aloud, completes the recording of multimedia data, and uploads it to the server 20 .
于一实施例中,若由用户端执行该方法,则用户端可以获取到多媒体数据后,不需要上传。In one embodiment, if the method is executed by the user terminal, the user terminal does not need to upload the multimedia data after acquiring the multimedia data.
步骤302:提取多媒体数据中的音频数据和视频数据。Step 302: Extract audio data and video data in the multimedia data.
在本步骤中,服务器20可以从用户上传的视频资料中提取音频数据,提取过程中可以指定 音频采样率,并读取视频帧率作为视频数据。其中,音频数据可以包含语音信息,视频数据可以包含用户唇语动作的图像信息。In this step, the server 20 can extract audio data from the video material uploaded by the user, and can specify the audio sampling rate in the extraction process, and read the video frame rate as the video data. The audio data may include voice information, and the video data may include image information of the user's lip language action.
于一实施例中,可以按照预设音频采样率提取多媒体数据中的音频数据。预设的音频采样率可以由服务器20指定,该音频采样率可以准确的保留原多媒体数据中语音的相关特征,以供后续计算使用。In one embodiment, the audio data in the multimedia data can be extracted according to a preset audio sampling rate. The preset audio sampling rate can be specified by the server 20, and the audio sampling rate can accurately retain the relevant features of the speech in the original multimedia data for subsequent calculation.
于一实施例中,可以按照预设视频帧率读取多媒体数据中的视频数据。In one embodiment, the video data in the multimedia data can be read according to a preset video frame rate.
预设的视频帧率可以是服务器20读取视频数据的帧率,该视频帧率需要保证读取到的视频数据保留了原多媒体数据中的视频特征,以供后续计算使用。The preset video frame rate may be the frame rate at which the server 20 reads the video data, and the video frame rate needs to ensure that the read video data retains the video features in the original multimedia data for subsequent calculation.
步骤303:对音频数据进行语音识别,得到语音信息,以及对视频数据进行唇语识别,得到唇语信息。Step 303: Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information.
在本步骤中,可以基于神经网络算法,对用户对四位随机数朗读的音频数据逐帧进行语音识别,得到语音信息。并可以基于神经网络算法,对用户朗读四位随机数的视频数据逐帧进行唇语识别,即识别视频图像中的用户的唇语动作,获得唇语信息。In this step, based on a neural network algorithm, speech recognition may be performed frame by frame on the audio data read aloud by the user to the four-digit random number, so as to obtain speech information. And based on the neural network algorithm, lip language recognition can be performed frame by frame on the video data of the user reading a four-digit random number, that is, the lip language action of the user in the video image is recognized, and the lip language information is obtained.
步骤304:根据语音信息和唇语信息解析得到音频数据和视频数据之间的偏移信息,并基于偏移信息验证多媒体数据的是否来自于活体。Step 304: Obtain the offset information between the audio data and the video data by analyzing the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information.
在本步骤中,为了有效防止上述攻击视频带来的安全威胁,可以综合对语音信息和唇语信息在音画同步上的特征进行解析,得到多媒体数据的偏移信息,并基于偏移信息验证多媒体数据的是否来自于活体。In this step, in order to effectively prevent the security threat brought by the above-mentioned attack video, the characteristics of the voice information and lip language information in the audio-visual synchronization can be comprehensively analyzed to obtain the offset information of the multimedia data, and verify based on the offset information. Whether the multimedia data comes from a living body.
上述活体检测方法,通过提取多媒体数据中的音频数据和视频数据,然后分别对音频数据进行语音识别,对视频数据进行唇语识别,进而得到语音信息和唇语信息,然后基于语音信息和唇语信息解析得到多媒体数据的偏移信息,进而基于偏移信息验证多媒体数据的是否来自于活体,如此,无需做大量样本标注,节约检测成本,而且综合考虑语音信息和唇语信息的特点,提高了活体检测的准确率。可以有效防止以上攻击视频,提高活体验证的安全性。The above-mentioned living body detection method extracts audio data and video data in multimedia data, then performs speech recognition on the audio data respectively, and performs lip language recognition on the video data, thereby obtaining voice information and lip language information, and then based on the voice information and lip language information. The offset information of the multimedia data is obtained by information analysis, and then based on the offset information, it is verified whether the multimedia data comes from a living body. In this way, there is no need to do a large number of sample annotations, which saves the detection cost, and comprehensively considers the characteristics of the voice information and lip language information. Accuracy of liveness detection. It can effectively prevent the above attack videos and improve the security of live verification.
请参看图4,其为本申请一实施例的活体检测方法,该方法可由图1所示的电子设备1来执行,并可以应用于如图2所示的活体验证场景中,以准确检测出多媒体数据的是否来自于活体,提高活体验证的安全性。该方法包括如下步骤:Please refer to FIG. 4 , which is a living body detection method according to an embodiment of the present application. The method can be executed by the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body improves the security of living body verification. The method includes the following steps:
步骤401:获取待检测的多媒体数据。详细参见上述实施例中对步骤301的描述。Step 401: Acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
步骤402:提取多媒体数据中的音频数据和视频数据。详细参见上述实施例中对步骤302的描述。Step 402: Extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
步骤403:对音频数据逐帧进行语音识别,获取音频数据的音频元素信息。Step 403: Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data.
在本步骤中,可以基于神经网络算法,对用步骤402中获得的音频数据逐帧进行语音识别, 获取用户朗读随机数的文本信息。In this step, based on a neural network algorithm, speech recognition may be performed frame by frame on the audio data obtained in step 402 to obtain text information of the random number read aloud by the user.
于一实施例中,语音识别过程可以如下:In one embodiment, the speech recognition process may be as follows:
4a):采集预设数量的数字音频(比如人朗读数字0-9的音频),并进行标注,区分训练集、验证集、测试集。4a): Collect a preset number of digital audios (such as audios of people reading numbers 0-9), and mark them to distinguish training sets, validation sets, and test sets.
4b):对训练集的音频进行神经网络训练,同时用验证集对训练过程的中间结果进行验证(实时调整训练参数),当训练精度和验证精度达到一定阈值时,得到语音识别模型。4b): Perform neural network training on the audio of the training set, and at the same time use the verification set to verify the intermediate results of the training process (adjust the training parameters in real time), and obtain a speech recognition model when the training accuracy and verification accuracy reach a certain threshold.
4c):用测试集对步骤4b)中得到的语音识别模型测试,衡量模型的性能。4c): Test the speech recognition model obtained in step 4b) with the test set to measure the performance of the model.
4d):逐帧输入步骤402中获得的音频数据至语音识别模型,模型计算出音频数据的音频元素信息。4d): Input the audio data obtained in step 402 into the speech recognition model frame by frame, and the model calculates the audio element information of the audio data.
步骤404:提取音频元素信息中每个元素的音频起始帧序和音频终止帧序,语音信息中可以包括:音频元素信息、音频起始帧序和音频终止帧序。Step 404: Extract the audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: audio element information, audio start frame sequence, and audio end frame sequence.
在本步骤中,上述得到的音频元素信息中,可以至少包括每个元素的音频起始帧序和音频终止帧序,比如每个随机数字的音频起始帧序和音频终止帧序,将其从音频元素信息中提取出来。In this step, the audio element information obtained above may include at least the audio start frame sequence and audio end frame sequence of each element, such as the audio start frame sequence and audio end frame sequence of each random number, Extracted from audio element information.
步骤405:对视频数据逐帧进行唇语识别,获取视频数据的唇语元素信息。Step 405: Perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.
在本步骤中,可以基于神经网络算法,对用步骤402中获得视频数据逐帧进行唇语识别,获取唇语元素信息。In this step, based on a neural network algorithm, lip language recognition can be performed frame by frame on the video data obtained in step 402 to obtain lip language element information.
于一实施例中,唇语识别的过程可以如下:In one embodiment, the process of lip language recognition may be as follows:
6a):采集预设数量的数字唇语视频,比如可以是人在朗读数字0-9时拍摄的唇语图片,并进行特征标注,区分训练集、验证集、测试集。6a): Collect a preset number of digital lip language videos, such as lip language pictures taken by people reading numbers 0-9, and perform feature labeling to distinguish training sets, validation sets, and test sets.
6b):对训练集的视频进行神经网络训练,同时用验证集对训练过程的中间结果进行验证(实时调整训练参数),当训练精度和验证精度达到一定阈值时,得到唇语识别模型。6b): Perform neural network training on the video of the training set, and at the same time use the verification set to verify the intermediate results of the training process (adjust the training parameters in real time), and obtain a lip language recognition model when the training accuracy and verification accuracy reach a certain threshold.
6c):用测试集对步骤6b)中得到的唇语识别模型测试,衡量模型的性能。6c): Test the lip language recognition model obtained in step 6b) with the test set to measure the performance of the model.
6d):逐帧将步骤402中获得的视频数据输入唇语识别模型,得到模型计算出的视频数据的唇语元素信息。6d): Input the video data obtained in step 402 into the lip language recognition model frame by frame to obtain the lip language element information of the video data calculated by the model.
步骤406:提取唇语元素信息中每个元素的视频起始帧序和视频终止帧序,唇语信息可以包括:唇语元素信息、视频起始帧序和视频终止帧序。Step 406: Extract the video start frame sequence and the video end frame sequence of each element in the lip language element information. The lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
在本步骤中,上述唇语元素信息中可以至少包括每个元素的视频起始帧序和视频终止帧序,比如用户朗读每个数字的视频起始帧序和视频终止帧序,将其从唇语元素信息中提取出来。于一实施例中,步骤403-步骤404与步骤405-步骤406的执行顺序不做限定。In this step, the above-mentioned lip language element information may at least include the video start frame sequence and video end frame sequence of each element, for example, the user reads the video start frame sequence and video end frame sequence of each number, and converts them from The lip language element information is extracted. In one embodiment, the execution order of steps 403 to 404 and steps 405 to 406 is not limited.
步骤407:对语音信息进行数据标准化处理,并基于音频元素信息生成目标长度的音频元素字符串,对唇语信息进行数据标准化处理,并基于唇语元素信息生成目标长度的唇语元 素字符串。Step 407: perform data standardization processing on the speech information, and generate an audio element string of target length based on the audio element information, perform data standardization processing on the lip language information, and generate a lip language element string of target length based on the lip language element information.
在本步骤中,针对如图2所示的活体验证场景,用户录制的多媒体数据可能存在多种格式,内容也可能繁乱复杂,为了简化数据处理过程,在采集多媒体数据之前,服务器20可以先生成随机文本信息,比如四位随机数字,供用户朗读,进而录制朗读时的多媒体数据。在后续的数据处理中,需要对音频元素信息和唇语元素信息进行数据标准化处理,统一成长度固定的数字串。此处的目标长度就是服务器20生成随机数字的长度,比如服务器20生成的随机数字为四位,此处的目标长度就是四位。用户朗读的是四位随机数字,故需要将音频元素信息和唇语元素信息标准化为四位。四位随机数字更有利于检测结果的准确性。于一实施例中,步骤407具体可以包括:将音频元素信息转换为目标长度的音频元素字符串,将唇语元素信息转换为目标长度的唇语元素字符串。分别识别音频元素字符串和唇语元素字符串的位数,当识别位数小于第一阈值时,输出为识别错误。当识别位数大于或等于第一阈值,且小于第二阈值时,以第一预设值代替识别缺失的位。当识别位数大于或等于第二阈值时,基于音频元素信息、唇语元素信息的内容,通过匹配算法,提取匹配准确的位数。In this step, for the living body verification scenario shown in FIG. 2 , the multimedia data recorded by the user may exist in various formats, and the content may also be complicated. In order to simplify the data processing process, before collecting the multimedia data, the server 20 may first generate Random text information, such as four random numbers, is for the user to read aloud, and then the multimedia data during the reading is recorded. In the subsequent data processing, the audio element information and the lip language element information need to be standardized into a digital string with a fixed length. The target length here is the length of the random number generated by the server 20. For example, the random number generated by the server 20 is four digits, and the target length here is four digits. The user reads four random numbers, so the audio element information and the lip language element information need to be standardized into four digits. Four random numbers are more conducive to the accuracy of the detection results. In one embodiment, step 407 may specifically include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. Identify the number of digits of the audio element string and the lip language element string respectively. When the number of recognized digits is less than the first threshold, the output is a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits are replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.
于一实施例中,以目标长度为四位为例,在对音频元素信息和唇语元素信息进行数据标准化处理时,可以过滤掉不足三位的错误结果,并以-1代替识别缺失的位。若位数超过四位,通过匹配算法算出识别准确的位,同样以-1代替识别不准确的位。In one embodiment, taking the target length of four bits as an example, when performing data normalization processing on the audio element information and the lip language element information, the error results with less than three digits can be filtered out, and the missing bits can be identified with -1 instead. . If the number of digits exceeds four, the accurate identification bit is calculated by the matching algorithm, and the inaccurate identification is also replaced by -1.
于一实施例中,以四位随机数字为例,数据标准化处理过程可以如下:首先分别将音频元素信息转换为四位的音频元素字符串,将唇语元素信息转换为四位的唇语元素字符串,分别判断音频元素字符串和唇语元素字符串的位数,当位数小于三位时,判断为识别错误,终止验证流程。当位数等于三位时,以-1代替识别缺失的位。当位数恰好等于四位,直接输出识别结果。当位数大于四位时,基于文本信息的内容,通过匹配算法,提取匹配准确的位,当匹配准确的位数小于四位时,以-1代替缺失的位。比如假设音频元素信息或者唇语元素信息内容为(12345)五位随机数字,而服务器20生成的四位随机数字为(1234),则可以从(12345)中提取出内容和位数为(1234)的部分作为标准化处理的结果。In one embodiment, taking four random numbers as an example, the data standardization process may be as follows: first, the audio element information is converted into a four-digit audio element string, and the lip language element information is converted into a four-digit lip language element. String, respectively determine the number of digits of the audio element string and the lip language element string. When the number of digits is less than three, it is judged as a recognition error, and the verification process is terminated. When the number of bits is equal to three bits, -1 is substituted to identify the missing bits. When the number of digits is exactly four digits, the recognition result is output directly. When the number of digits is greater than four digits, based on the content of the text information, the matching algorithm is used to extract the exact matching digits. When the exact matching digits are less than four digits, the missing digits are replaced by -1. For example, if the content of audio element information or lip language element information is (12345) five-digit random numbers, and the four-digit random number generated by the server 20 is (1234), the content and number of digits can be extracted from (12345) as (1234) ) as the result of normalization.
步骤408:分别将音频元素字符串和唇语元素字符串与目标字符串进行比对,并在音频元素字符串和唇语元素字符串均与目标字符串的语义匹配时,基于音频元素字符串、音频起始帧序、音频终止帧序、唇语元素字符串、视频起始帧序和视频终止帧序,计算多媒体数据的偏移信息。Step 408: Compare the audio element string and the lip language element string with the target string respectively, and when both the audio element string and the lip language element string match the semantics of the target , audio start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
于一实施例中,在音频元素字符串和唇语元素字符串均与目标字符串的语义匹配时,步骤408可以包括:In one embodiment, when both the audio element string and the lip language element string are semantically matched with the target string, step 408 may include:
S81:针对音频元素字符串和元素字符串唇语元素字符串,分别计算每个元素字符的音频起 始时间与视频起始时间之间的起始时间差,并分别计算每个元素字符的音频终止时间与视频终止时间之间的终止时间差。S81: For the audio element string and the element string lip language element string, calculate the start time difference between the audio start time of each element character and the video start time, and calculate the audio end of each element character respectively The end time difference between the time and the end time of the video.
在本步骤中,元素字符就是文本内容中的一个发音元素,四位随机数字为(1234),那么1、2、3、4就是四个元素字符。可以遍历音频元素字符串和元素字符串唇语元素字符串,采用如下公式计算每个元素字符的:In this step, the element character is a pronunciation element in the text content, and the four-digit random number is (1234), then 1, 2, 3, and 4 are the four element characters. You can traverse the audio element string and the element string lip language element string, and use the following formula to calculate the character of each element:
音频起始时间:audio_start=(audio_fstart/audio_sampling_rate)*1000。Audio start time: audio_start=(audio_fstart/audio_sampling_rate)*1000.
音频终止时间:audio_end=(audio_fend/audio_sampling_rate)*1000。Audio end time: audio_end=(audio_fend/audio_sampling_rate)*1000.
视频起始时间:lip_start=(lip_fstart/fps)*1000。Video start time: lip_start=(lip_fstart/fps)*1000.
视频终止时间:lip_end=(lip_fend/fps)*1000。Video end time: lip_end=(lip_fend/fps)*1000.
然后计算每个元素字符的:Then compute the per-element characters:
起始时间差:abs(lip_start–audio_start)。Start time difference: abs(lip_start–audio_start).
终止时间差:abs(lip_end–audio_end)。End time difference: abs(lip_end–audio_end).
其中,audio_fstart为音频元素字符串中每个元素字符的音频起始帧序,audio_fend为音频元素字符串中每个元素字符的音频终止帧序,audio_sampling_rate音频采样率。lip_fstart为元素字符串唇语元素字符串中每个元素字符的视频起始帧序,lip_fend为元素字符串唇语元素字符串中每个元素字符的视频终止帧序,fps为预设的视频帧率。abs()为求绝对值。Among them, audio_fstart is the audio start frame sequence of each element character in the audio element string, audio_fend is the audio end frame sequence of each element character in the audio element string, and audio_sampling_rate The audio sampling rate. lip_fstart is the video start frame sequence of each element character in the element string lip language element string, lip_fend is the video end frame sequence of each element character in the element string lip language element string, fps is the preset video frame Rate. abs() is to find the absolute value.
S82:计算每个元素元素的起始时间差与终止时间差的时差平均值。S82: Calculate the average time difference of the start time difference and the end time difference of each element element.
在本步骤中,可以采用如下公式计算时差平均值:In this step, the following formula can be used to calculate the average time difference:
diff_time=(abs(lip_start–audio_start)+abs(lip_end–audio_end))/2。diff_time=(abs(lip_start−audio_start)+abs(lip_end−audio_end))/2.
其中,diff_time表示偏移量公式(单位ms),其函数功能为返回两个时间变量之间的时间间隔,即计算两个时刻之间的时间差,此处diff_time的结果表示每个元素字符的时差平均值。Among them, diff_time represents the offset formula (unit ms), and its function is to return the time interval between two time variables, that is, to calculate the time difference between two moments, where the result of diff_time represents the time difference of each element character average value.
S83:计算全部元素元素的时差平均值的偏移平均值,偏移信息为偏移平均值。S83: Calculate the offset average value of the time difference average values of all elements, and the offset information is the offset average value.
在本步骤中,可以对所有元素元素的时差平均值取均值,以四位随机数字为例,具体可以采用如下公式计算偏移平均值:In this step, the average value of the time difference of all elements can be averaged. Taking the four-digit random number as an example, the following formula can be used to calculate the average value of the offset:
result=(diff_time[0]+diff_time[1]+diff_time[2]+diff_time[3])/4。result=(diff_time[0]+diff_time[1]+diff_time[2]+diff_time[3])/4.
其中,result为偏移平均值,diff_time[0]表示第1位数字的时差平均值,diff_time[1]表示第2位数字的时差平均值,diff_time[2]表示第3位数字的时差平均值,diff_time[3]表示第4位数字的时差平均值。Among them, result is the average offset, diff_time[0] represents the average time difference of the first digit, diff_time[1] represents the average time difference of the second digit, and diff_time[2] represents the average time difference of the third digit , diff_time[3] represents the average time difference of the 4th digit.
通过上述步骤S81至步骤S83,实现了基于音频元素字符串和唇语元素字符串,计算多媒体数据的偏移平均值,就是偏移信息。Through the above steps S81 to S83, the average value of the offset of the multimedia data is calculated based on the audio element character string and the lip language element character string, which is the offset information.
步骤409:判断偏移信息是否在目标偏移范围内。若是,进入步骤410,否则进入步骤411。Step 409: Determine whether the offset information is within the target offset range. If yes, go to step 410 , otherwise go to step 411 .
在本步骤中,目标偏移范围可以经过实际测试数据统计得到,其可以表征活体录制的多媒体数据的特征。In this step, the target offset range can be obtained through actual test data statistics, which can characterize the characteristics of the multimedia data recorded by the living body.
步骤410:输出多媒体数据来自于活体。Step 410: Outputting multimedia data from a living body.
在本步骤中,若偏移信息在目标偏移范围内,说明该多媒体数据的偏移信息足够小,是一般活体发出的实际行为产生的多媒体数据,则输出多媒体数据来自于活体。In this step, if the offset information is within the target offset range, it means that the offset information of the multimedia data is small enough and is multimedia data generated by actual behaviors sent by a general living body, and the output multimedia data is from a living body.
步骤411:输出多媒体数据并非来自于活体。Step 411: The output multimedia data does not come from a living body.
在本步骤中,若偏移信息不在目标偏移范围内,说明当前的多媒体数据可能不是活体发出的行为,或者是恶意合成的攻击数据,则输出多媒体数据并非来自于活体,并且在如图2所示的活体验证场景中,本次验证不通过。可以发出警示。In this step, if the offset information is not within the target offset range, it means that the current multimedia data may not be the behavior sent by the living body, or maliciously synthesized attack data, then the output multimedia data does not come from the living body, and as shown in Figure 2 In the living verification scenario shown, this verification fails. A warning can be issued.
上述活体检测方法,显著提高了活体检测的准确率,漏检率下降。对于一些部分少量音画不同步的视频提供了容错率。节省了原来对大量音画不同步视频的标注成本。The above-mentioned living body detection method significantly improves the accuracy of living body detection, and reduces the missed detection rate. Provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
请参看图5,其为本申请一种实施例的活体检测装置500,该装置可以应用于图1所示的电子设备1,并可以应用于如图2所示的活体验证场景中,以准确检测出多媒体数据是否来自于活体,提高活体验证的安全性。该装置可以包括:获取模块501、提取模块502、识别模块503和解析模块504,各个模块的原理关系如下:Please refer to FIG. 5 , which is a living body detection apparatus 500 according to an embodiment of the present application. The apparatus can be applied to the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. Detects whether multimedia data comes from a living body, improving the security of living body verification. The device may include: an acquisition module 501, an extraction module 502, an identification module 503 and an analysis module 504, and the principle relationships of the modules are as follows:
获取模块501,被配置成用于获取待检测的多媒体数据。详细参见上述实施例中对步骤301的描述。The acquiring module 501 is configured to acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
提取模块502,被配置成用于提取多媒体数据中的音频数据和视频数据。详细参见上述实施例中对步骤302的描述。The extraction module 502 is configured to extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
识别模块503,被配置成用于对音频数据进行语音识别,得到语音信息,以及对视频数据进行唇语识别,得到唇语信息。详细参见上述实施例中对步骤303的描述。The recognition module 503 is configured to perform speech recognition on audio data to obtain voice information, and perform lip language recognition on video data to obtain lip language information. For details, refer to the description of step 303 in the above embodiment.
解析模块504,被配置成用于根据语音信息和唇语信息解析得到音频数据和视频数据之间的偏移信息,并基于偏移信息验证多媒体数据的是否来自于活体。详细参见上述实施例中对步骤304的描述。The parsing module 504 is configured to parse and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information. For details, refer to the description of step 304 in the above embodiment.
于一实施例中,识别模块503可以被配置成用于:对音频数据逐帧进行语音识别,获取音频数据的音频元素信息。提取音频元素信息中每个元素的音频起始帧序和音频终止帧序,语音信息可以包括:音频元素信息、音频起始帧序和音频终止帧序。In one embodiment, the recognition module 503 may be configured to: perform speech recognition on the audio data frame by frame, and obtain audio element information of the audio data. The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information may include: audio element information, audio start frame sequence and audio end frame sequence.
于一实施例中,识别模块503可以被配置成用于:对视频数据逐帧进行唇语识别,获取视频数据的唇语元素信息。提取唇语元素信息中每个元素的视频起始帧序和视频终止帧序,唇语信息可以包括:唇语元素信息、视频起始帧序和视频终止帧序。In an embodiment, the identification module 503 may be configured to: perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data. The video start frame sequence and the video end frame sequence of each element in the lip language element information are extracted, and the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
于一实施例中,解析模块504可以被配置成用于:对语音信息进行数据标准化处理,并基于音频元素信息生成目标长度的音频元素字符串,对唇语信息进行数据标准化处理,并基 于唇语元素信息生成目标长度的唇语元素字符串。可以分别将音频元素字符串和唇语元素字符串与目标字符串进行比对,并在音频元素字符串和唇语元素字符串均与目标字符串的语义匹配时,基于音频元素字符串、音频起始帧序、音频终止帧序、唇语元素字符串、视频起始帧序和视频终止帧序,计算多媒体数据的偏移信息。In one embodiment, the parsing module 504 may be configured to: perform data normalization processing on the speech information, generate an audio element string of a target length based on the audio element information, perform data normalization processing on the lip language information, and A lip language element string of target length is generated from the language element information. The audio element string and the lip language element string can be compared with the target string respectively, and when the audio element string and the lip language element string both match the semantics of the target string, based on the audio element string, audio Start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
于一实施例中,基于音频元素字符串、音频起始帧序、音频终止帧序、唇语元素字符串、视频起始帧序和视频终止帧序,计算多媒体数据的偏移信息,可以包括:针对音频元素字符串和唇语元素字符串,分别计算每个元素字符的音频起始时间与视频起始时间之间的起始时间差,并分别计算每个元素字符的音频终止时间与视频终止时间之间的终止时间差。计算每个元素字符的起始时间差与终止时间差的时差平均值。计算全部元素字符的时差平均值的偏移平均值,偏移信息可以为偏移平均值。In one embodiment, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence and the video end frame sequence, the offset information of the multimedia data is calculated, which may include: : For the audio element string and the lip language element string, calculate the start time difference between the audio start time and video start time of each element character, and calculate the audio end time and video end time of each element character respectively Termination time difference between times. Calculate the average time difference between the start time difference and the end time difference of each element character. Calculate the offset average value of the time difference average value of all element characters, and the offset information can be the offset average value.
于一实施例中,可以对语音信息进行数据标准化处理,并可以基于音频元素信息生成目标长度的音频元素字符串,对唇语信息进行数据标准化处理,并基于唇语元素信息生成目标长度的唇语元素字符串,可以包括:将音频元素信息转换为目标长度的音频元素字符串,将唇语元素信息转换为目标长度的唇语元素字符串。可以分别识别音频元素字符串和唇语元素字符串的位数,当识别位数小于第一阈值时,可以输出为识别错误。当识别位数大于或等于第一阈值,且小于第二阈值时,可以以第一预设值代替识别缺失的位。当识别位数大于或等于第二阈值时,可以基于音频元素信息、唇语元素信息的内容,通过匹配算法,提取匹配准确的位数。In one embodiment, data standardization processing can be performed on the voice information, an audio element string of a target length can be generated based on the audio element information, data standardization processing can be performed on the lip language information, and a lip language of the target length can be generated based on the lip language element information. The language element string may include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. The number of digits of the audio element string and the lip language element string can be recognized respectively, and when the number of recognized digits is less than the first threshold, it can be output as a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits may be replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm can be used to extract the number of digits that match accurately.
于一实施例中,解析模块504还可以被配置成用于:判断偏移信息是否在目标偏移范围内。若偏移信息在目标偏移范围内,可以输出多媒体数据来自于活体,否则,可以输出多媒体数据并非来自于活体。In one embodiment, the parsing module 504 may be further configured to: determine whether the offset information is within the target offset range. If the offset information is within the target offset range, the multimedia data can be output from the living body; otherwise, the multimedia data can be output from not from the living body.
上述活体检测装置500的详细描述,请参见上述实施例中相关方法步骤的描述。For the detailed description of the above-mentioned living body detection apparatus 500, please refer to the description of the relevant method steps in the above-mentioned embodiment.
本申请的实施例还提供了一种非暂态电子设备可读存储介质,可以包括:程序,当在电子设备上运行时,该程序可以使得电子设备可执行上述实施例中方法的全部或部分流程。其中,存储介质可为磁盘、光盘、只读存储记忆体(Read-Only Memory,ROM)、随机存储记忆体(Random Access Memory,RAM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,缩写:HDD)或固态硬盘(Solid-State Drive,SSD)等。存储介质还可以包括上述种类的存储器的组合。The embodiments of the present application also provide a non-transitory electronic device-readable storage medium, which may include: a program, when running on the electronic device, the program can cause the electronic device to execute all or part of the methods in the foregoing embodiments process. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive, Abbreviation: HDD) or Solid-State Drive (SSD), etc. The storage medium may also include a combination of the aforementioned kinds of memories.
虽然结合附图描述了本申请的实施例,但是本领域技术人员可以在不脱离本申请的精神和范围的情况下作出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present application have been described in conjunction with the accompanying drawings, various modifications and variations can be made by those skilled in the art without departing from the spirit and scope of the present application, and such modifications and variations all fall within the scope of the appended claims within the limited range.
工业实用性Industrial Applicability
本申请提供了一种活体检测方法、装置、设备和存储介质,该方法包括:获取待检测的多媒体数据;提取所述多媒体数据中的音频数据和视频数据;对所述音频数据进行语音识别,得到语音信息,以及对所述视频数据进行唇语识别,得到唇语信息;根据所述语音信息和所述唇语信息解析得到所述音频数据和所述视频数据之间的偏移信息,并基于所述偏移信息验证所述多媒体数据是否来自于活体。本申请实现了显著提高了活体检测的准确率,漏检率下降,对于一些部分少量音画不同步的视频提供了容错率。节省了原来对大量音画不同步视频的标注成本。The present application provides a method, device, device and storage medium for living body detection. The method includes: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data, Obtain voice information, and perform lip language recognition on the video data to obtain lip language information; analyze and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and It is verified whether the multimedia data is from a living body based on the offset information. The present application achieves a significant improvement in the accuracy of living body detection, a decrease in the missed detection rate, and provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
此外,可以理解的是,本申请的活体检测方法、装置、设备和存储介质是可以重现的,并且可以用在多种工业应用中。例如,本申请的活体检测方法、装置、设备和存储介质可以用于基于唇语视频进行活体验证的应用场景。Furthermore, it is understood that the liveness detection methods, apparatus, devices and storage media of the present application are reproducible and can be used in a variety of industrial applications. For example, the liveness detection method, apparatus, device and storage medium of the present application can be used in an application scenario of liveness verification based on lip language video.

Claims (20)

  1. 一种活体检测方法,其特征在于,包括:A method for detecting a living body, comprising:
    获取待检测的多媒体数据;Obtain the multimedia data to be detected;
    提取所述多媒体数据中的音频数据和视频数据;extracting audio data and video data in the multimedia data;
    对所述音频数据进行语音识别,得到语音信息,以及对所述视频数据进行唇语识别,得到唇语信息;Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information;
    根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,并基于所述偏移信息验证所述多媒体数据是否来自于活体。According to the voice information and the lip language information, offset information between the audio data and the video data is determined, and based on the offset information, it is verified whether the multimedia data comes from a living body.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述音频数据进行语音识别,得到语音信息,包括:The method according to claim 1, wherein the performing speech recognition on the audio data to obtain the speech information comprises:
    对所述音频数据逐帧进行语音识别,获取所述音频数据的音频元素信息;Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data;
    提取所述音频元素信息中每个元素的音频起始帧序和音频终止帧序,所述语音信息包括:所述音频元素信息、所述音频起始帧序和所述音频终止帧序。The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information includes: the audio element information, the audio start frame sequence and the audio end frame sequence.
  3. 根据权利要求1或2所述的方法,其特征在于,所述对所述视频数据进行唇语识别,得到唇语信息,包括:The method according to claim 1 or 2, wherein the performing lip language recognition on the video data to obtain lip language information, comprising:
    对所述视频数据逐帧进行唇语识别,获取所述视频数据的唇语元素信息;Perform lip language recognition on the video data frame by frame, and obtain the lip language element information of the video data;
    提取所述唇语元素信息中每个元素的视频起始帧序和视频终止帧序,所述唇语信息包括:所述唇语元素信息、所述视频起始帧序和所述视频终止帧序。Extract the video start frame sequence and video end frame sequence of each element in the lip language element information, where the lip language information includes: the lip language element information, the video start frame sequence and the video end frame sequence.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,包括:The method according to claim 3, wherein the determining the offset information between the audio data and the video data according to the voice information and the lip language information comprises:
    对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串;Perform data standardization processing on the audio element information in the voice information, and generate an audio element string of target length based on the audio element information after the data standardization processing, and perform data standardization on the lip language element information in the lip language information. performing data standardization processing, and generating a lip language element string of the target length based on the lip language element information after the data standardization process;
    分别将所述音频元素字符串和所述唇语元素字符串与目标字符串进行匹配,并在所述音频元素字符串和所述唇语元素字符串均与所述目标字符串的语义匹配时,基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,确定所述偏移信息。respectively matching the audio element string and the lip language element string with the target string, and when both the audio element string and the lip language element string are semantically matched with the target string , based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence, determine the offset information.
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,计算所述多媒体数据的偏移信息,包括:The method according to claim 4, characterized in that, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start The start frame sequence and the video end frame sequence, and the offset information of the multimedia data is calculated, including:
    针对所述音频元素字符串和所述唇语元素字符串,分别计算每个元素字符的音频起始时间与视频起始时间之间的起始时间差,并分别计算每个所述元素字符的音频终止时间与视频终止时间之间的终止时间差,其中,所述音频起始时间基于所述音频起始帧序确定,所述音频终止时间基于所述音频终止帧序确定,所述视频起始时间基于所述视频起始帧序确定,所述视频终止时间基于所述视频终止帧序确定;For the audio element character string and the lip language element character string, calculate the start time difference between the audio start time and the video start time of each element character respectively, and calculate the audio frequency of each element character respectively The difference between the end time and the end time of the video, wherein the audio start time is determined based on the audio start frame sequence, the audio end time is determined based on the audio end frame sequence, and the video start time Determined based on the video start frame sequence, the video end time is determined based on the video end frame sequence;
    计算每个所述元素字符的所述起始时间差与所述终止时间差的时差平均值;Calculate the average value of the time difference between the start time difference and the end time difference of each of the element characters;
    计算全部所述元素字符的所述时差平均值的偏移平均值,所述偏移信息为所述偏移平均值。An offset average value of the time difference average values of all the element characters is calculated, and the offset information is the offset average value.
  6. 根据权利要求5所述的方法,其特征在于,通过如下公式基于所述音频起始帧序确定所述音频起始时间:audio_start=(audio_fstart/audio_sampling_rate)*1000,其中,audio_fstart为所述音频元素字符串中每个元素字符所对应的音频起始帧序,audio_sampling_rate为音频采样率。The method according to claim 5, wherein the audio start time is determined based on the audio start frame sequence by the following formula: audio_start=(audio_fstart/audio_sampling_rate)*1000, wherein audio_fstart is the audio element The audio starting frame sequence corresponding to each element character in the string, audio_sampling_rate is the audio sampling rate.
  7. 根据权利要求5所述的方法,其特征在于,通过如下公式基于所述音频终止帧序确定所述音频终止时间:audio_end=(audio_fend/audio_sampling_rate)*1000,其中,audio_fend为所述音频元素字符串中每个元素字符所对应的音频终止帧序,audio_sampling_rate为音频采样率。The method according to claim 5, wherein the audio termination time is determined based on the audio termination frame sequence by the following formula: audio_end=(audio_fend/audio_sampling_rate)*1000, wherein audio_fend is the audio element string The audio termination frame sequence corresponding to each element character in , audio_sampling_rate is the audio sampling rate.
  8. 根据权利要求5所述的方法,其特征在于,通过如下公式基于所述视频起始帧序确定所述视频起始时间:lip_start=(lip_fstart/fps)*1000,其中,lip_fstart为所述唇语元素字符串中每个元素字符所对应的视频起始帧序,fps为预设的视频帧率。The method according to claim 5, wherein the video start time is determined based on the video start frame sequence by the following formula: lip_start=(lip_fstart/fps)*1000, wherein lip_fstart is the lip language The video start frame sequence corresponding to each element character in the element string, fps is the preset video frame rate.
  9. 根据权利要求5所述的方法,其特征在于,通过如下公式基于所述视频终止帧序确定所述视频终止时间:lip_end=(lip_fend/fps)*1000,其中,lip_fend为所述唇语元素字符串中每个元素字符所对应的视频终止帧序,fps为预设的视频帧率。The method according to claim 5, wherein the video end time is determined based on the video end frame sequence by the following formula: lip_end=(lip_fend/fps)*1000, wherein lip_fend is the lip language element character The video termination frame sequence corresponding to each element character in the string, fps is the preset video frame rate.
  10. 根据权利要求4所述的方法,其特征在于,The method of claim 4, wherein:
    所述对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于所述数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于所述数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串,包括:performing data standardization processing on the audio element information in the voice information, and generating an audio element string of a target length based on the audio element information after the data standardization processing, and performing data standardization on the audio element information in the lip language information. Perform data standardization processing on the lip language element information, and generate a lip language element string of the target length based on the lip language element information after the data standardization process, including:
    将所述音频元素信息转换为所述目标长度的所述音频元素字符串,将所述唇语元素信息转换为所述目标长度的所述唇语元素字符串;Converting the audio element information into the audio element string of the target length, converting the lip language element information into the lip language element string of the target length;
    分别识别所述音频元素字符串和所述唇语元素字符串的位数,当识别位数小于第一阈值时,输出为识别错误;Identify the number of digits of the audio element string and the lip language element string respectively, when the number of recognized digits is less than the first threshold, the output is a recognition error;
    当所述识别位数大于或等于所述第一阈值,且小于第二阈值时,以第一预设值代替识 别缺失的位;When the identification number of digits is greater than or equal to the first threshold value and less than the second threshold value, the missing bits are replaced by the first preset value;
    当所述识别位数大于或等于第二阈值时,基于所述音频元素信息、所述唇语元素信息的内容,通过匹配算法,提取匹配准确的位数。When the identification number of digits is greater than or equal to the second threshold, based on the audio element information and the content of the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述基于所述偏移信息验证所述多媒体数据是否来自于活体,包括:The method according to any one of claims 1 to 10, wherein the verifying whether the multimedia data comes from a living body based on the offset information comprises:
    判断所述偏移信息是否在目标偏移范围内;Determine whether the offset information is within the target offset range;
    若所述偏移信息在所述目标偏移范围内,确定所述多媒体数据来自于活体,否则,确定所述多媒体数据并非来自于活体。If the offset information is within the target offset range, it is determined that the multimedia data comes from a living body; otherwise, it is determined that the multimedia data does not come from a living body.
  12. 根据权利要求11所述的方法,其特征在于,所述目标偏移范围基于实际测试数据统计得到。The method according to claim 11, wherein the target offset range is statistically obtained based on actual test data.
  13. 一种活体检测装置,其特征在于,包括:A living body detection device, characterized in that it includes:
    获取模块,被配置成用于获取待检测的多媒体数据;an acquisition module, configured to acquire multimedia data to be detected;
    提取模块,被配置成用于提取所述多媒体数据中的音频数据和视频数据;an extraction module configured to extract audio data and video data in the multimedia data;
    识别模块,被配置成用于对所述音频数据进行语音识别,得到语音信息,以及对所述视频数据进行唇语识别,得到唇语信息;A recognition module, configured to perform speech recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information;
    解析模块,被配置成用于根据所述语音信息和所述唇语信息,确定所述音频数据和所述视频数据之间的偏移信息,并基于所述偏移信息验证所述多媒体数据是否来自于活体。A parsing module configured to determine offset information between the audio data and the video data according to the voice information and the lip language information, and to verify whether the multimedia data is based on the offset information from living organisms.
  14. 根据权利要求13所述的活体检测装置,其特征在于,所述识别模块被配置成用于:The liveness detection device of claim 13, wherein the identification module is configured to:
    对所述音频数据逐帧进行语音识别,获取所述音频数据的音频元素信息;Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data;
    提取所述音频元素信息中每个元素的音频起始帧序和音频终止帧序,所述语音信息包括:所述音频元素信息、所述音频起始帧序和所述音频终止帧序。The audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information includes: the audio element information, the audio start frame sequence and the audio end frame sequence.
  15. 根据权利要求13或14所述的活体检测装置,其特征在于,所述识别模块被配置成用于:The living body detection device according to claim 13 or 14, wherein the identification module is configured to:
    对所述视频数据逐帧进行唇语识别,获取所述视频数据的唇语元素信息;Perform lip language recognition on the video data frame by frame, and obtain the lip language element information of the video data;
    提取所述唇语元素信息中每个元素的视频起始帧序和视频终止帧序,所述唇语信息包括:所述唇语元素信息、所述视频起始帧序和所述视频终止帧序。Extract the video start frame sequence and video end frame sequence of each element in the lip language element information, where the lip language information includes: the lip language element information, the video start frame sequence and the video end frame sequence.
  16. 根据权利要求15所述的活体检测装置,其特征在于,所述解析模块被配置成用于:The living body detection device according to claim 15, wherein the parsing module is configured to:
    对所述语音信息中的所述音频元素信息进行数据标准化处理,并基于数据标准化处理后的音频元素信息生成目标长度的音频元素字符串,对所述唇语信息中的所述唇语元素信息进行数据标准化处理,并基于数据标准化处理后的唇语元素信息生成所述目标长度的唇语元素字符串;Perform data standardization processing on the audio element information in the voice information, and generate an audio element string of target length based on the audio element information after the data standardization processing, and perform data standardization on the lip language element information in the lip language information. performing data standardization processing, and generating a lip language element string of the target length based on the lip language element information after the data standardization process;
    分别将所述音频元素字符串和所述唇语元素字符串与目标字符串进行匹配,并在所述 音频元素字符串和所述唇语元素字符串均与所述目标字符串的语义匹配时,基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,确定所述偏移信息。respectively matching the audio element string and the lip language element string with the target string, and when both the audio element string and the lip language element string are semantically matched with the target string , based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence, determine the offset information.
  17. 根据权利要求16所述的活体检测装置,其特征在于,所述基于所述音频元素字符串、所述音频起始帧序、所述音频终止帧序、所述唇语元素字符串、所述视频起始帧序和所述视频终止帧序,计算所述多媒体数据的偏移信息,包括:The living body detection device according to claim 16, wherein the method is based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the The video start frame sequence and the video end frame sequence, and the offset information of the multimedia data is calculated, including:
    针对所述音频元素字符串和所述唇语元素字符串,分别计算每个元素字符的音频起始时间与视频起始时间之间的起始时间差,并分别计算每个所述元素字符的音频终止时间与视频终止时间之间的终止时间差,其中,所述音频起始时间基于所述音频起始帧序确定,所述音频终止时间基于所述音频终止帧序确定,所述视频起始时间基于所述视频起始帧序确定,所述视频终止时间基于所述视频终止帧序确定;For the audio element character string and the lip language element character string, calculate the start time difference between the audio start time and the video start time of each element character respectively, and calculate the audio frequency of each element character respectively The difference between the end time and the end time of the video, wherein the audio start time is determined based on the audio start frame sequence, the audio end time is determined based on the audio end frame sequence, and the video start time Determined based on the video start frame sequence, the video end time is determined based on the video end frame sequence;
    计算每个所述元素字符的所述起始时间差与所述终止时间差的时差平均值;Calculate the average value of the time difference between the start time difference and the end time difference of each of the element characters;
    计算全部所述元素字符的所述时差平均值的偏移平均值,所述偏移信息为所述偏移平均值。An offset average value of the time difference average values of all the element characters is calculated, and the offset information is the offset average value.
  18. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用以存储计算机程序;memory for storing computer programs;
    处理器,用以执行如权利要求1至12中任一项所述的方法,以检测出多媒体数据是否来自于活体。The processor is configured to execute the method according to any one of claims 1 to 12, so as to detect whether the multimedia data comes from a living body.
  19. 一种非暂态电子设备可读存储介质,其特征在于,包括:程序,当其藉由电子设备运行时,使得所述电子设备执行权利要求1至12中任一项所述的方法。A non-transitory electronic device readable storage medium, characterized by comprising: a program that, when run by the electronic device, causes the electronic device to execute the method of any one of claims 1 to 12.
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时实现如权利要求1至12中任一项所述的方法。A computer program product, characterized in that the computer program product comprises a computer program, which implements the method according to any one of claims 1 to 12 when the computer program is executed by a processor.
PCT/CN2021/120422 2020-12-29 2021-09-24 Liveness detection method and apparatus, device, and storage medium WO2022142521A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011587469.1A CN112733636A (en) 2020-12-29 2020-12-29 Living body detection method, living body detection device, living body detection apparatus, and storage medium
CN202011587469.1 2020-12-29

Publications (1)

Publication Number Publication Date
WO2022142521A1 true WO2022142521A1 (en) 2022-07-07

Family

ID=75607094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120422 WO2022142521A1 (en) 2020-12-29 2021-09-24 Liveness detection method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN112733636A (en)
WO (1) WO2022142521A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115209175A (en) * 2022-07-18 2022-10-18 忆月启函(盐城)科技有限公司 Voice transmission method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733636A (en) * 2020-12-29 2021-04-30 北京旷视科技有限公司 Living body detection method, living body detection device, living body detection apparatus, and storage medium
CN113810680A (en) * 2021-09-16 2021-12-17 深圳市欢太科技有限公司 Audio synchronization detection method and device, computer readable medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (en) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 Real person living body identity verification method based on sound-type image feature
CN105426723A (en) * 2015-11-20 2016-03-23 北京得意音通技术有限责任公司 Voiceprint identification, face identification and synchronous in-vivo detection-based identity authentication method and system
CN108038443A (en) * 2017-12-08 2018-05-15 深圳泰首智能技术有限公司 Witness the method and apparatus of service testing result
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN112733636A (en) * 2020-12-29 2021-04-30 北京旷视科技有限公司 Living body detection method, living body detection device, living body detection apparatus, and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834900B (en) * 2015-04-15 2017-12-19 常州飞寻视讯信息科技有限公司 A kind of method and system combined audio-visual signal and carry out In vivo detection
CN109409204B (en) * 2018-09-07 2021-08-06 北京市商汤科技开发有限公司 Anti-counterfeiting detection method and device, electronic equipment and storage medium
CN110585702B (en) * 2019-09-17 2023-09-19 腾讯科技(深圳)有限公司 Sound and picture synchronous data processing method, device, equipment and medium
CN110704683A (en) * 2019-09-27 2020-01-17 深圳市商汤科技有限公司 Audio and video information processing method and device, electronic equipment and storage medium
CN111881726B (en) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 Living body detection method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (en) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 Real person living body identity verification method based on sound-type image feature
CN105426723A (en) * 2015-11-20 2016-03-23 北京得意音通技术有限责任公司 Voiceprint identification, face identification and synchronous in-vivo detection-based identity authentication method and system
CN108038443A (en) * 2017-12-08 2018-05-15 深圳泰首智能技术有限公司 Witness the method and apparatus of service testing result
CN108124488A (en) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 A kind of payment authentication method and terminal based on face and vocal print
CN112733636A (en) * 2020-12-29 2021-04-30 北京旷视科技有限公司 Living body detection method, living body detection device, living body detection apparatus, and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115209175A (en) * 2022-07-18 2022-10-18 忆月启函(盐城)科技有限公司 Voice transmission method and system
CN115209175B (en) * 2022-07-18 2023-10-24 深圳蓝色鲨鱼科技有限公司 Voice transmission method and system

Also Published As

Publication number Publication date
CN112733636A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022142521A1 (en) Liveness detection method and apparatus, device, and storage medium
RU2738325C2 (en) Method and device for authenticating an individual
CN106601243B (en) Video file identification method and device
WO2020244153A1 (en) Conference voice data processing method and apparatus, computer device and storage medium
CN110378228A (en) Video data handling procedure, device, computer equipment and storage medium are examined in face
WO2019196205A1 (en) Foreign language teaching evaluation information generating method and apparatus
WO2020019591A1 (en) Method and device used for generating information
US20070220265A1 (en) Searching for a scaling factor for watermark detection
US9626575B2 (en) Visual liveness detection
CN109118420B (en) Watermark identification model establishing and identifying method, device, medium and electronic equipment
CN113242361B (en) Video processing method and device and computer readable storage medium
CN110853646A (en) Method, device and equipment for distinguishing conference speaking roles and readable storage medium
CN112380922B (en) Method, device, computer equipment and storage medium for determining multiple video frames
CN110598008B (en) Method and device for detecting quality of recorded data and storage medium
CN114760442B (en) Monitoring system for online education management
CN113409771B (en) Detection method for forged audio frequency, detection system and storage medium thereof
CN112351047B (en) Double-engine based voiceprint identity authentication method, device, equipment and storage medium
CN112151038B (en) Voice replay attack detection method and device, readable storage medium and electronic equipment
CN113627387A (en) Parallel identity authentication method, device, equipment and medium based on face recognition
KR20200042979A (en) Method and System for Non-Identification of Personal Information in Imaging Device
CN116883900A (en) Video authenticity identification method and system based on multidimensional biological characteristics
CN115424634A (en) Audio and video stream data processing method and device, electronic equipment and storage medium
CN115331703A (en) Song voice detection method and device
CN108734144A (en) A kind of speaker's identity identifying method based on recognition of face
CN114140850A (en) Face recognition method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 12/10/2023)