WO2022142521A1 - Procédé et appareil de détection d'état vivant, dispositif et support de stockage - Google Patents

Procédé et appareil de détection d'état vivant, dispositif et support de stockage Download PDF

Info

Publication number
WO2022142521A1
WO2022142521A1 PCT/CN2021/120422 CN2021120422W WO2022142521A1 WO 2022142521 A1 WO2022142521 A1 WO 2022142521A1 CN 2021120422 W CN2021120422 W CN 2021120422W WO 2022142521 A1 WO2022142521 A1 WO 2022142521A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
video
frame sequence
lip language
Prior art date
Application number
PCT/CN2021/120422
Other languages
English (en)
Chinese (zh)
Inventor
时旭
Original Assignee
北京旷视科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京旷视科技有限公司 filed Critical 北京旷视科技有限公司
Publication of WO2022142521A1 publication Critical patent/WO2022142521A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • G06V40/45Detection of the body part being alive
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning

Definitions

  • the present application relates to the technical field of multimedia information, and in particular, to a method, apparatus, device and storage medium for living body detection.
  • Liveness detection is a method for determining the real physiological characteristics of an object in some authentication scenarios.
  • the current video data of the user is generally obtained in real time, and then based on the video content, it is detected whether it conforms to the audio and video of the living body. Synchronization feature.
  • Synchronization of audio and video generally means that each frame of the picture being rendered by the player is strictly corresponding to each piece of sound being played, and there is no deviation that can be distinguished by the human ear and the naked eye.
  • the audio and video synchronization detection method usually uses a large number of labeled audio and video synchronous/asynchronous videos as samples, and obtains a model through neural network training.
  • the model can output the synchronization score for the input video.
  • the picture is synchronized, otherwise the sound and picture are not synchronized.
  • the embodiments of the present application provide a method, apparatus, device, and storage medium for living body detection, which significantly improve the accuracy of living body detection.
  • An embodiment of the present application provides a method for detecting a living body, which may include: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data to obtain speech information, and Perform lip language recognition on the video data to obtain lip language information; according to the voice information and the lip language information, determine offset information between the audio data and the video data, and based on the offset information Verify that the multimedia data is from a living body.
  • performing speech recognition on the audio data to obtain speech information may include: performing speech recognition on the audio data frame by frame to acquire audio element information of the audio data; extracting the audio element information of the audio data; audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
  • performing lip language recognition on the video data to obtain lip language information may include: performing lip language recognition on the video data frame by frame, and obtaining lip language elements of the video data information; extract the video start frame sequence and video end frame sequence of each element in the lip language element information, the lip language information may include: the lip language element information, the video start frame sequence and the Video terminates frame sequence.
  • the determining, according to the voice information and the lip language information, the offset information between the audio data and the video data may include: comparing the information in the voice information
  • the audio element information is subjected to data standardization processing, and an audio element string of target length is generated based on the audio element information after the data standardization processing, and data standardization processing is performed on the lip language element information in the lip language information.
  • the lip language element information after the data standardization process generates a lip language element string of the target length; the audio element string and the lip language element string are compared with the target string respectively, and the When both the element string and the lip language element string match the semantics of the target string, based on the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language
  • the element string, the video start frame sequence and the video end frame sequence determine the offset information.
  • calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character.
  • the audio start time is determined based on the audio start frame sequence
  • the audio end time is determined based on the audio end frame sequence
  • the video start time is determined based on the video start frame sequence
  • the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
  • audio_sampling_rate is the audio sampling rate.
  • audio_sampling_rate is the audio sampling rate.
  • lip_fstart is the lip language element string
  • the video start frame sequence corresponding to each element character in , fps is the preset video frame rate.
  • the video termination frame sequence corresponding to each element character, fps is the preset video frame rate.
  • the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process.
  • the lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string
  • the number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip
  • the verifying whether the multimedia data comes from a living body based on the offset information may include: judging whether the offset information is within a target offset range; if the offset information Within the target offset range, it is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
  • the target offset range may be obtained through actual test data statistics, and the target offset range represents the characteristics of the multimedia data recorded by the living body.
  • An embodiment of the present application provides a living body detection device, which may include: an acquisition module configured to acquire multimedia data to be detected; an extraction module configured to extract audio data and video data in the multimedia data Recognition module, is configured to be used to carry out speech recognition to described audio data, obtain speech information, and carry out lip language recognition to described video data, obtain lip language information; Parsing module, be configured to be used for according to described speech information and the lip language information, determine offset information between the audio data and the video data, and verify whether the multimedia data is from a living body based on the offset information.
  • the recognition module may be configured to: perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data; extract each of the audio element information
  • the audio start frame sequence and the audio end frame sequence of the element, and the voice information may include: the audio element information, the audio start frame sequence, and the audio end frame sequence.
  • the recognition module may be configured to: perform lip language recognition on the video data frame by frame, obtain lip language element information of the video data; extract the lip language element information
  • the video start frame sequence and the video end frame sequence of each element in the lip language information may include: the lip language element information, the video start frame sequence and the video end frame sequence.
  • the parsing module may be configured to: perform data normalization processing on the audio element information of the voice information, and generate a target length based on the audio element information after the data normalization processing
  • the audio element string of the lip language element, the data standardization process is performed on the lip language element information in the lip language information, and the lip language element string of the target length is generated based on the lip language element information;
  • the element string and the lip language element string are compared with the target string, and when both the audio element string and the lip language element string match the semantics of the target string, based on the
  • the audio element string, the audio start frame sequence, the audio end frame sequence, the lip language element string, the video start frame sequence, and the video end frame sequence are used to determine the offset information.
  • calculating the offset information of the multimedia data may include: for the audio element character string and the lip language element character string, respectively calculating the audio start time and video start time of each element character.
  • the audio start time is determined based on the audio start frame sequence
  • the audio end time is determined based on the audio end frame sequence
  • the video start time is determined based on the video start frame sequence
  • the video end time is determined based on the video end frame sequence; calculating each of the elements The time difference average value of the start time difference and the end time difference of the character; the offset average value of the time difference average value of all the element characters is calculated, and the offset information is the offset average value.
  • the data normalization process is performed on the audio element information in the voice information, and an audio element string of a target length is generated based on the audio element information after the data normalization process.
  • the lip language element information in the lip language information is subjected to data normalization processing, and the lip language element string of the target length is generated based on the lip language element information after the data normalization process, which may include: converting the audio The element information is converted into the audio element string of the target length, and the lip language element information is converted into the lip language element string of the target length; respectively identifying the audio element string and the lip language element string
  • the number of digits of the language element string, when the number of recognized digits is less than the first threshold, the output is a recognition error; when the number of recognized digits is greater than or equal to the first threshold and less than the second threshold, the first preset value is used to identify the missing bits; when the identification number of bits is greater than or equal to the second threshold, based on the content of the audio element information and the lip
  • the parsing module may be further configured to: determine whether the offset information is within the target offset range; if the offset information is within the target offset range, It is determined that the multimedia data comes from a living body, otherwise, it is determined that the multimedia data does not come from a living body.
  • An embodiment of the present application provides an electronic device, which may include: a memory for storing a computer program; a processor for executing the method of any one of some embodiments of the present application, so as to detect whether the multimedia data comes from in the living body.
  • An embodiment of the present application provides a non-transitory electronic device-readable storage medium, which may include: a program, when run by an electronic device, causes the electronic device to execute any one of the embodiments of the present application Methods.
  • An embodiment of the present application provides a computer program product, and the computer program product may include a computer program, which implements the method described in any one of some embodiments of the present application when the computer program is executed by a processor.
  • the living body detection method, device, device and storage medium provided by the present application can extract the audio data and video data in the multimedia data, then respectively perform speech recognition on the audio data and lip language recognition on the video data, and then obtain the voice information and Lip language information, and then based on the speech information and lip language information analysis to obtain the offset information of the multimedia data, and then based on the offset information to verify whether the multimedia data comes from a living body, so that there is no need to do a large number of sample annotations, saving The detection cost, and the characteristics of speech information and lip language information are considered comprehensively, which improves the accuracy of living body detection.
  • FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the application.
  • FIG. 2 is a schematic diagram of a living body verification scene system according to an embodiment of the application.
  • FIG. 3 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for detecting a living body according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a living body detection device according to an embodiment of the present application.
  • this embodiment provides an electronic device 1 , which may include: at least one processor 11 and a memory 12 .
  • one processor is used as an example.
  • the processor 11 and the memory 12 can be connected through the bus 10, and the memory 12 stores instructions that can be executed by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the methods in the following embodiments
  • the process is to detect the living body information of the multimedia data.
  • the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or a computing system composed of multiple computers.
  • FIG. 2 is a living body verification scene system according to an embodiment of the present application, which may include: a server 20 and a client 30 .
  • the server 20 may be implemented by the electronic device 1 , and the server 20 may include: a speech recognition module 21 and a lip language recognition module 22 .
  • the server 20 can generate random text information and display it on the client 30 for the user to read the random text information, and then the client 30 can record the user
  • the multimedia data is read aloud, and the multimedia data can be uploaded to the server 20 .
  • the server 20 may perform subsequent user authentication based on the multimedia data.
  • the above-mentioned method for subsequent user authentication based on multimedia data may also be performed on the client 30 .
  • the random text information can be a random number of a target length, for example, a four-digit random number, and a certain strategy can be used to avoid the continuous occurrence of the same number, so as to reduce the difficulty of identification.
  • the characters in the multimedia data only complete the mouth movements without making a sound, and other people read the target numbers outside the video.
  • this embodiment comprehensively analyzes the multimedia data based on the speech recognition module 21 and the lip language recognition module 22 to obtain the speech information and the lip language information, and analyzes the speech information and the lip language information based on the speech information and the lip language information. Offset information of the multimedia data is obtained, and based on the offset information, it is verified whether the multimedia data comes from a living body.
  • the living body detection solution in this embodiment can effectively prevent the above attack videos and improve the security of living body verification.
  • FIG. 3 is a method for detecting a living body according to an embodiment of the present application.
  • the method can be executed by the electronic device 1 shown in FIG. 1 and can be applied to the scene of living body verification as shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body can improve the security of living body verification.
  • the server 20 executing the method the method includes the following steps: Step 301: Acquire multimedia data to be detected.
  • the multimedia data can be the real-time video data of the user to be verified. For example, it can be based on the random text content generated by the server 20 for the user to read aloud.
  • the random text content can be a four-digit random number, and the same number can be avoided by a certain strategy. appear consecutively to reduce the difficulty of identification. Taking the random number as an example, the user reads the acquired four-digit random number aloud, completes the recording of multimedia data, and uploads it to the server 20 .
  • the user terminal if the method is executed by the user terminal, the user terminal does not need to upload the multimedia data after acquiring the multimedia data.
  • Step 302 Extract audio data and video data in the multimedia data.
  • the server 20 can extract audio data from the video material uploaded by the user, and can specify the audio sampling rate in the extraction process, and read the video frame rate as the video data.
  • the audio data may include voice information
  • the video data may include image information of the user's lip language action.
  • the audio data in the multimedia data can be extracted according to a preset audio sampling rate.
  • the preset audio sampling rate can be specified by the server 20, and the audio sampling rate can accurately retain the relevant features of the speech in the original multimedia data for subsequent calculation.
  • the video data in the multimedia data can be read according to a preset video frame rate.
  • the preset video frame rate may be the frame rate at which the server 20 reads the video data, and the video frame rate needs to ensure that the read video data retains the video features in the original multimedia data for subsequent calculation.
  • Step 303 Perform voice recognition on the audio data to obtain voice information, and perform lip language recognition on the video data to obtain lip language information.
  • speech recognition may be performed frame by frame on the audio data read aloud by the user to the four-digit random number, so as to obtain speech information.
  • lip language recognition can be performed frame by frame on the video data of the user reading a four-digit random number, that is, the lip language action of the user in the video image is recognized, and the lip language information is obtained.
  • Step 304 Obtain the offset information between the audio data and the video data by analyzing the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information.
  • the characteristics of the voice information and lip language information in the audio-visual synchronization can be comprehensively analyzed to obtain the offset information of the multimedia data, and verify based on the offset information. Whether the multimedia data comes from a living body.
  • the above-mentioned living body detection method extracts audio data and video data in multimedia data, then performs speech recognition on the audio data respectively, and performs lip language recognition on the video data, thereby obtaining voice information and lip language information, and then based on the voice information and lip language information.
  • the offset information of the multimedia data is obtained by information analysis, and then based on the offset information, it is verified whether the multimedia data comes from a living body. In this way, there is no need to do a large number of sample annotations, which saves the detection cost, and comprehensively considers the characteristics of the voice information and lip language information. Accuracy of liveness detection. It can effectively prevent the above attack videos and improve the security of live verification.
  • FIG. 4 is a living body detection method according to an embodiment of the present application.
  • the method can be executed by the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. 2 to accurately detect Whether the multimedia data comes from a living body improves the security of living body verification.
  • the method includes the following steps:
  • Step 401 Acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
  • Step 402 Extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
  • Step 403 Perform speech recognition on the audio data frame by frame to obtain audio element information of the audio data.
  • speech recognition may be performed frame by frame on the audio data obtained in step 402 to obtain text information of the random number read aloud by the user.
  • the speech recognition process may be as follows:
  • step 4c) Test the speech recognition model obtained in step 4b) with the test set to measure the performance of the model.
  • Step 404 Extract the audio start frame sequence and audio end frame sequence of each element in the audio element information, and the voice information may include: audio element information, audio start frame sequence, and audio end frame sequence.
  • the audio element information obtained above may include at least the audio start frame sequence and audio end frame sequence of each element, such as the audio start frame sequence and audio end frame sequence of each random number, Extracted from audio element information.
  • Step 405 Perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.
  • lip language recognition can be performed frame by frame on the video data obtained in step 402 to obtain lip language element information.
  • the process of lip language recognition may be as follows:
  • step 6c) Test the lip language recognition model obtained in step 6b) with the test set to measure the performance of the model.
  • Step 406 Extract the video start frame sequence and the video end frame sequence of each element in the lip language element information.
  • the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
  • the above-mentioned lip language element information may at least include the video start frame sequence and video end frame sequence of each element, for example, the user reads the video start frame sequence and video end frame sequence of each number, and converts them from The lip language element information is extracted.
  • the execution order of steps 403 to 404 and steps 405 to 406 is not limited.
  • Step 407 perform data standardization processing on the speech information, and generate an audio element string of target length based on the audio element information, perform data standardization processing on the lip language information, and generate a lip language element string of target length based on the lip language element information.
  • the multimedia data recorded by the user may exist in various formats, and the content may also be complicated.
  • the server 20 may first generate Random text information, such as four random numbers, is for the user to read aloud, and then the multimedia data during the reading is recorded.
  • the audio element information and the lip language element information need to be standardized into a digital string with a fixed length.
  • the target length here is the length of the random number generated by the server 20.
  • the random number generated by the server 20 is four digits, and the target length here is four digits.
  • step 407 may specifically include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. Identify the number of digits of the audio element string and the lip language element string respectively. When the number of recognized digits is less than the first threshold, the output is a recognition error. When the number of identification bits is greater than or equal to the first threshold and smaller than the second threshold, the identified missing bits are replaced by the first preset value. When the number of identified digits is greater than or equal to the second threshold, based on the content of the audio element information and the lip language element information, a matching algorithm is used to extract the number of digits that match accurately.
  • the error results with less than three digits can be filtered out, and the missing bits can be identified with -1 instead. . If the number of digits exceeds four, the accurate identification bit is calculated by the matching algorithm, and the inaccurate identification is also replaced by -1.
  • the data standardization process may be as follows: first, the audio element information is converted into a four-digit audio element string, and the lip language element information is converted into a four-digit lip language element. String, respectively determine the number of digits of the audio element string and the lip language element string. When the number of digits is less than three, it is judged as a recognition error, and the verification process is terminated. When the number of bits is equal to three bits, -1 is substituted to identify the missing bits. When the number of digits is exactly four digits, the recognition result is output directly. When the number of digits is greater than four digits, based on the content of the text information, the matching algorithm is used to extract the exact matching digits.
  • the missing digits are replaced by -1.
  • the content of audio element information or lip language element information is (12345) five-digit random numbers, and the four-digit random number generated by the server 20 is (1234), the content and number of digits can be extracted from (12345) as (1234) ) as the result of normalization.
  • Step 408 Compare the audio element string and the lip language element string with the target string respectively, and when both the audio element string and the lip language element string match the semantics of the target , audio start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
  • step 408 may include:
  • S81 For the audio element string and the element string lip language element string, calculate the start time difference between the audio start time of each element character and the video start time, and calculate the audio end of each element character respectively The end time difference between the time and the end time of the video.
  • the element character is a pronunciation element in the text content, and the four-digit random number is (1234), then 1, 2, 3, and 4 are the four element characters.
  • audio_fstart is the audio start frame sequence of each element character in the audio element string
  • audio_fend is the audio end frame sequence of each element character in the audio element string
  • audio_sampling_rate The audio sampling rate.
  • lip_fstart is the video start frame sequence of each element character in the element string lip language element string
  • lip_fend is the video end frame sequence of each element character in the element string lip language element string
  • fps is the preset video frame Rate.
  • abs() is to find the absolute value.
  • S82 Calculate the average time difference of the start time difference and the end time difference of each element element.
  • diff_time (abs(lip_start ⁇ audio_start)+abs(lip_end ⁇ audio_end))/2.
  • diff_time represents the offset formula (unit ms), and its function is to return the time interval between two time variables, that is, to calculate the time difference between two moments, where the result of diff_time represents the time difference of each element character average value.
  • S83 Calculate the offset average value of the time difference average values of all elements, and the offset information is the offset average value.
  • the average value of the time difference of all elements can be averaged.
  • the following formula can be used to calculate the average value of the offset:
  • diff_time[0] represents the average time difference of the first digit
  • diff_time[1] represents the average time difference of the second digit
  • diff_time[2] represents the average time difference of the third digit
  • diff_time[3] represents the average time difference of the 4th digit.
  • the average value of the offset of the multimedia data is calculated based on the audio element character string and the lip language element character string, which is the offset information.
  • Step 409 Determine whether the offset information is within the target offset range. If yes, go to step 410 , otherwise go to step 411 .
  • the target offset range can be obtained through actual test data statistics, which can characterize the characteristics of the multimedia data recorded by the living body.
  • Step 410 Outputting multimedia data from a living body.
  • the offset information is within the target offset range, it means that the offset information of the multimedia data is small enough and is multimedia data generated by actual behaviors sent by a general living body, and the output multimedia data is from a living body.
  • Step 411 The output multimedia data does not come from a living body.
  • the offset information is not within the target offset range, it means that the current multimedia data may not be the behavior sent by the living body, or maliciously synthesized attack data, then the output multimedia data does not come from the living body, and as shown in Figure 2 In the living verification scenario shown, this verification fails. A warning can be issued.
  • the above-mentioned living body detection method significantly improves the accuracy of living body detection, and reduces the missed detection rate. Provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
  • FIG. 5 is a living body detection apparatus 500 according to an embodiment of the present application.
  • the apparatus can be applied to the electronic device 1 shown in FIG. 1 , and can be applied to the living body verification scene shown in FIG. Detects whether multimedia data comes from a living body, improving the security of living body verification.
  • the device may include: an acquisition module 501, an extraction module 502, an identification module 503 and an analysis module 504, and the principle relationships of the modules are as follows:
  • the acquiring module 501 is configured to acquire multimedia data to be detected. For details, refer to the description of step 301 in the above embodiment.
  • the extraction module 502 is configured to extract audio data and video data in the multimedia data. For details, refer to the description of step 302 in the above embodiment.
  • the recognition module 503 is configured to perform speech recognition on audio data to obtain voice information, and perform lip language recognition on video data to obtain lip language information. For details, refer to the description of step 303 in the above embodiment.
  • the parsing module 504 is configured to parse and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and verify whether the multimedia data comes from a living body based on the offset information. For details, refer to the description of step 304 in the above embodiment.
  • the recognition module 503 may be configured to: perform speech recognition on the audio data frame by frame, and obtain audio element information of the audio data.
  • the audio start frame sequence and audio end frame sequence of each element in the audio element information are extracted, and the voice information may include: audio element information, audio start frame sequence and audio end frame sequence.
  • the identification module 503 may be configured to: perform lip language recognition on the video data frame by frame, and obtain lip language element information of the video data.
  • the video start frame sequence and the video end frame sequence of each element in the lip language element information are extracted, and the lip language information may include: lip language element information, video start frame sequence and video end frame sequence.
  • the parsing module 504 may be configured to: perform data normalization processing on the speech information, generate an audio element string of a target length based on the audio element information, perform data normalization processing on the lip language information, and A lip language element string of target length is generated from the language element information.
  • the audio element string and the lip language element string can be compared with the target string respectively, and when the audio element string and the lip language element string both match the semantics of the target string, based on the audio element string, audio Start frame sequence, audio end frame sequence, lip language element string, video start frame sequence and video end frame sequence, and calculate the offset information of multimedia data.
  • the offset information of the multimedia data is calculated, which may include: : For the audio element string and the lip language element string, calculate the start time difference between the audio start time and video start time of each element character, and calculate the audio end time and video end time of each element character respectively Termination time difference between times. Calculate the average time difference between the start time difference and the end time difference of each element character. Calculate the offset average value of the time difference average value of all element characters, and the offset information can be the offset average value.
  • data standardization processing can be performed on the voice information, an audio element string of a target length can be generated based on the audio element information, data standardization processing can be performed on the lip language information, and a lip language of the target length can be generated based on the lip language element information.
  • the language element string may include: converting the audio element information into an audio element string of a target length, and converting the lip language element information into a lip language element string of the target length. The number of digits of the audio element string and the lip language element string can be recognized respectively, and when the number of recognized digits is less than the first threshold, it can be output as a recognition error.
  • the identified missing bits may be replaced by the first preset value.
  • a matching algorithm can be used to extract the number of digits that match accurately.
  • the parsing module 504 may be further configured to: determine whether the offset information is within the target offset range. If the offset information is within the target offset range, the multimedia data can be output from the living body; otherwise, the multimedia data can be output from not from the living body.
  • the embodiments of the present application also provide a non-transitory electronic device-readable storage medium, which may include: a program, when running on the electronic device, the program can cause the electronic device to execute all or part of the methods in the foregoing embodiments process.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a flash memory (Flash Memory), a hard disk (Hard Disk Drive, Abbreviation: HDD) or Solid-State Drive (SSD), etc.
  • the storage medium may also include a combination of the aforementioned kinds of memories.
  • the present application provides a method, device, device and storage medium for living body detection.
  • the method includes: acquiring multimedia data to be detected; extracting audio data and video data in the multimedia data; performing speech recognition on the audio data, Obtain voice information, and perform lip language recognition on the video data to obtain lip language information; analyze and obtain offset information between the audio data and the video data according to the voice information and the lip language information, and It is verified whether the multimedia data is from a living body based on the offset information.
  • the present application achieves a significant improvement in the accuracy of living body detection, a decrease in the missed detection rate, and provides a fault tolerance rate for some videos with a small amount of audio and video out of sync. It saves the original cost of labeling a large number of audio and video asynchronous videos.
  • liveness detection methods, apparatus, devices and storage media of the present application are reproducible and can be used in a variety of industrial applications.
  • the liveness detection method, apparatus, device and storage medium of the present application can be used in an application scenario of liveness verification based on lip language video.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

L'invention concerne un procédé et un appareil de détection d'état vivant, un dispositif et un support de stockage. Le procédé comprend les étapes suivantes : obtenir des données multimédias devant subir une détection ; extraire des données audio et des données vidéo des données multimédias ; effectuer une reconnaissance de la parole sur les données audio pour obtenir des informations de parole, et effectuer une reconnaissance de langage de lèvres sur les données vidéo pour obtenir des informations de langage de lèvres ; et effectuer l'analyse syntaxique selon les informations de parole et les informations de langage de lèvres pour obtenir des informations de décalage entre les données audio et les données vidéo, et vérifier, en fonction des informations de décalage, si les données multimédias proviennent d'un corps vivant. La précision de la détection de l'état vivant est considérablement améliorée, un taux d'omission de détection d'état vivant est réduit, et un taux tolérant les erreurs est fourni pour certaines vidéos dont l'audio et l'image sont désynchronisés. Ainsi, le coût d'annotation d'origine d'un grand nombre de vidéos dont l'audio et l'image sont désynchronisés est économisé.
PCT/CN2021/120422 2020-12-29 2021-09-24 Procédé et appareil de détection d'état vivant, dispositif et support de stockage WO2022142521A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011587469.1A CN112733636A (zh) 2020-12-29 2020-12-29 活体检测方法、装置、设备和存储介质
CN202011587469.1 2020-12-29

Publications (1)

Publication Number Publication Date
WO2022142521A1 true WO2022142521A1 (fr) 2022-07-07

Family

ID=75607094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120422 WO2022142521A1 (fr) 2020-12-29 2021-09-24 Procédé et appareil de détection d'état vivant, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN112733636A (fr)
WO (1) WO2022142521A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115209175A (zh) * 2022-07-18 2022-10-18 忆月启函(盐城)科技有限公司 一种语音传输方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733636A (zh) * 2020-12-29 2021-04-30 北京旷视科技有限公司 活体检测方法、装置、设备和存储介质
CN113810680A (zh) * 2021-09-16 2021-12-17 深圳市欢太科技有限公司 音频同步检测方法及装置、计算机可读介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (zh) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 基于音型像特征的真人活体身份验证方法
CN105426723A (zh) * 2015-11-20 2016-03-23 北京得意音通技术有限责任公司 基于声纹识别、人脸识别以及同步活体检测的身份认证方法及系统
CN108038443A (zh) * 2017-12-08 2018-05-15 深圳泰首智能技术有限公司 见证服务测试结果的方法与装置
CN108124488A (zh) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 一种基于人脸和声纹的支付认证方法及终端
CN112733636A (zh) * 2020-12-29 2021-04-30 北京旷视科技有限公司 活体检测方法、装置、设备和存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834900B (zh) * 2015-04-15 2017-12-19 常州飞寻视讯信息科技有限公司 一种联合声像信号进行活体检测的方法和系统
CN109409204B (zh) * 2018-09-07 2021-08-06 北京市商汤科技开发有限公司 防伪检测方法和装置、电子设备、存储介质
CN110585702B (zh) * 2019-09-17 2023-09-19 腾讯科技(深圳)有限公司 一种音画同步数据处理方法、装置、设备及介质
CN110704683A (zh) * 2019-09-27 2020-01-17 深圳市商汤科技有限公司 音视频信息处理方法及装置、电子设备和存储介质
CN111881726B (zh) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 一种活体检测方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376250A (zh) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 基于音型像特征的真人活体身份验证方法
CN105426723A (zh) * 2015-11-20 2016-03-23 北京得意音通技术有限责任公司 基于声纹识别、人脸识别以及同步活体检测的身份认证方法及系统
CN108038443A (zh) * 2017-12-08 2018-05-15 深圳泰首智能技术有限公司 见证服务测试结果的方法与装置
CN108124488A (zh) * 2017-12-12 2018-06-05 福建联迪商用设备有限公司 一种基于人脸和声纹的支付认证方法及终端
CN112733636A (zh) * 2020-12-29 2021-04-30 北京旷视科技有限公司 活体检测方法、装置、设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115209175A (zh) * 2022-07-18 2022-10-18 忆月启函(盐城)科技有限公司 一种语音传输方法及系统
CN115209175B (zh) * 2022-07-18 2023-10-24 深圳蓝色鲨鱼科技有限公司 一种语音传输方法及系统

Also Published As

Publication number Publication date
CN112733636A (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022142521A1 (fr) Procédé et appareil de détection d'état vivant, dispositif et support de stockage
RU2738325C2 (ru) Способ и устройство аутентификации личности
CN106601243B (zh) 一种视频文件识别方法及装置
WO2020244153A1 (fr) Procédé et appareil de traitement de données vocales de conférence, dispositif informatique et support d'informations
CN110378228A (zh) 面审视频数据处理方法、装置、计算机设备和存储介质
CN110853646B (zh) 会议发言角色的区分方法、装置、设备及可读存储介质
WO2019196205A1 (fr) Procédé et appareil de génération d'informations d'évaluation d'apprentissage d'une langue étrangère
WO2020019591A1 (fr) Procédé et dispositif utilisés pour la génération d'informations
US20070220265A1 (en) Searching for a scaling factor for watermark detection
CN109118420B (zh) 水印识别模型建立及识别方法、装置、介质及电子设备
CN110598008B (zh) 录制数据的数据质检方法及装置、存储介质
US20170039440A1 (en) Visual liveness detection
CN113242361B (zh) 一种视频处理方法、装置以及计算机可读存储介质
CN112380922B (zh) 复盘视频帧确定方法、装置、计算机设备和存储介质
CN112232276A (zh) 一种基于语音识别和图像识别的情绪检测方法和装置
CN114760442B (zh) 一种用于在线教育管理用监控系统
CN113409771B (zh) 一种伪造音频的检测方法及其检测系统和存储介质
CN112351047B (zh) 基于双引擎的声纹身份认证方法、装置、设备及存储介质
CN112151038B (zh) 语音重放攻击检测方法、装置、可读存储介质及电子设备
CN113627387A (zh) 基于人脸识别的并行身份认证方法、装置、设备和介质
KR20200042979A (ko) 영상정보기기에서의 개인정보의 비식별화 방법 및 시스템
CN108734144A (zh) 一种基于人脸识别的主讲人身份认证方法
CN114333844A (zh) 声纹识别方法、装置、介质及设备
CN114120425A (zh) 一种情绪识别方法、装置、电子设备及存储介质
CN111611569A (zh) 一种人脸声纹复核终端及其身份认证方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 12/10/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21913266

Country of ref document: EP

Kind code of ref document: A1