CN113113021A

CN113113021A - Voice biological recognition authentication real-time detection method and system

Info

Publication number: CN113113021A
Application number: CN202110396974.6A
Authority: CN
Inventors: 张寅�; 张翼
Original assignee: Effective Software Technology Shanghai Co ltd
Current assignee: Effective Software Technology Shanghai Co ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-07-13

Abstract

The invention provides a real-time detection method and a real-time detection system for a voice biological recognition authentication system, wherein the detection method comprises the following steps: step S1: generating an audio file for voice verification; step S2: the voice biometric identification server receives the audio file generated in step S1; step S3: the audio file in step S2 is processed by the speech biometric identification server, and then sent to the watermark processing module in the speech biometric identification server for watermark detection. The system comprises: the identity authentication client is used for starting an identity authentication request; a speaker for playing the audio watermark; and the voice biological recognition server is used for converting the audio file into a format and then sending the audio file to the voice biological recognition server for processing.

Description

Voice biological recognition authentication real-time detection method and system

Technical Field

The invention belongs to the field of voice biological recognition systems, and particularly relates to a voice biological recognition authentication real-time detection method and system.

Background

Biometric speech recognition (also known as speaker recognition) is a system that compares the speech modeling model of an enrolled user with the speech modeling model of a matching request and gives a probability score that two speech samples come from the same person, similar to the working principle of fingerprint recognition or facial recognition. Speech biometric recognition is divided into two categories, one is called text-dependent and the other is text-independent. The speech biometric recognition of the text-related pattern refers to the registered speech modeled speech, such as "my voice is my password" and must be the same as the phrase spoken at the time of authentication. Text-independent means that the enrollment voice may be different from the authentication voice, but it requires a long enrollment time compared to the text-dependent mode. Typically, text-dependent voice biometric systems are used for identity verification, such as application login or access control. Text-dependent speech biometric systems, however, are susceptible to a type of spoofing attack known as a play or replay attack. This is when an authentication phrase from a legitimate user is captured by a recording device and the phrase is played on a speaker to corrupt the system associated with the text. Text-dependent systems judge the audio of the speaker as coming from the legitimate user and allow access to the application.

To combat such attacks, text-dependent systems are typically equipped with a real-time detection module that contains algorithms for determining whether the audio sample is from a real-time user or a machine speaker. However, such algorithms are statistical classifiers and therefore cannot reliably prevent replay attacks, especially when recording quality is high and speaker quality is high.

In addition, some other text-related systems use randomness elements, such as random numbers, to ensure real-time. For example, the system may prompt the user to speak a unique eight-digit random number sequence. Theoretically, if an authentication session is recorded, the recording is useless because the next session will have a new set of digits. However, this randomness element has the disadvantage that it must have a speech recognition so that the system can verify that the spoken language elements are the same prompt elements. Since speech recognition systems do not have perfect accuracy, if a user speaks in an accent or unsupported language/dialect, it is unreliable, which severely reduces the overall accuracy of the speech biometric system and limits the number of people who can use speech biometric as an authentication channel.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a voice biological identification authentication real-time detection method and a voice biological identification authentication real-time detection system, which solve the problem that the real-time recording or the pre-recording of an audio sample received by a biological identification system cannot be accurately judged in the background technology.

The invention provides a voice biological identification authentication real-time detection method, which comprises the following steps:

step S1: generating an audio file for voice verification;

step S11: the authentication client initiates an authentication request:

step S12: outputting the sound watermark by a loudspeaker;

the step S11 and the step S12 occur simultaneously or sequentially in time domain;

step S2: the voice biological recognition server receives the audio file generated in the step S1;

step S3: and after being processed by the voice biometric identification server, the audio file in the step S2 is sent to a watermarking processing module in the voice biometric identification server for watermarking detection.

In an embodiment of the present invention, the audio file generated in step S1 further includes a voice generated when the authentication request is executed in step S11 and a sound watermark generated when step S12 is executed, and the voice and the sound watermark are superimposed.

In one embodiment of the present invention, the audio file in step S1 is further processed and converted into a file type that can be recognized by the speech biometric recognition server, and then transmitted to the speech biometric recognition server.

In one embodiment of the present invention, the watermark processing module is further configured to detect a watermark from a file received by the watermark processing module via an algorithm module in the watermark processing module.

One technical solution of the present invention is further configured that the method further includes the steps of:

step S4: the watermark detected by said step S3 is compared with the perturbation provided by the client;

step S5: and returning the detection result to the voice biological recognition server.

One technical scheme of the invention is further set that the detected watermark is consistent with the disturbance, and the signal is a real-time signal and is returned to the voice biological recognition server; the detected watermark is inconsistent with the disturbance, and the signal is a replay signal, which is returned to the voice biological recognition server.

One technical solution of the present invention is further configured that the step of determining that the watermark is inconsistent with the disturbance includes: the time when the watermark occurs, the watermark length, or a combination of one or more of these.

The invention also provides another technical scheme, a voice biological identification authentication real-time detection system and a real-time detection method using the voice biological identification authentication system, wherein the system comprises:

the identity authentication client is used for starting an identity authentication request;

a speaker for playing the audio watermark;

and the system converts the audio file into a format and then sends the audio file to the voice biological recognition server for processing.

In one embodiment of the present invention, the voice biometric server further includes a watermark processing module, and the watermark processing module is configured to detect a watermark.

The invention has the beneficial effects that:

(1) the invention realizes real-time detection without increasing the friction of user experience.

(2) The invention protects each voice authentication session from being recorded, thereby reducing the risk of play attack.

(3) The voiceprint watermark ensures the uniqueness of the voice password and improves the safety of user identity authentication and the confidentiality of user biological characteristic information.

Drawings

FIG. 1 is a schematic flow chart of a real-time detection method for voice biometric authentication according to the present invention;

fig. 2 is a block diagram of a voice biometric authentication real-time detection system of the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings, in which several embodiments of the invention are shown, but which may be embodied in many different forms and are not limited to the embodiments described herein, but rather are provided for the purpose of providing a more thorough disclosure of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Any play attack occurs and the text-related authentication phrase from the target must be recorded. It may be assumed that text related phrases such as "my voice is my password" or "please verify my transaction" are not used in normal conversations outside the authentication context. Thus, the most likely way for a recording device to record such phrases is during an authentication session.

The method described in the present invention demonstrates how a voice biometric authentication system can detect a replay attack without using the conventional method mentioned in the background.

step S1: generating an audio file for voice verification;

step S2: the voice biometric identification server receives the audio file generated in step S1;

step S3: the audio file in step S2 is processed by the speech biometric identification server, and then sent to the watermark processing module in the speech biometric identification server for watermark detection.

The substeps of step S11 are as follows:

step S11: the authentication client initiates an authentication request:

step S12: outputting the sound watermark by a loudspeaker;

step S11 and step S12 occur simultaneously or sequentially in time domain.

Further, the audio file generated at step S1 includes the voice generated when the authentication request is performed at step S11 and the sound watermark generated when step S12 is performed, which are superimposed.

Further, the audio file of step S1 is processed and converted into a file type that can be recognized by the speech biometric recognition server, and then transmitted to the speech biometric recognition server.

Furthermore, the file received by the watermark processing module detects the watermark through an algorithm module in the watermark processing module.

Further, the detection method of the present invention further comprises the steps of:

step S4: the watermark detected in step S3 is compared with the perturbation provided by the client;

In detail, the detected watermark is consistent with the disturbance, and the signal is a real-time signal and is returned to the voice biological recognition server; the detected watermark is inconsistent with the disturbance, and the signal is a replay signal, which is returned to the voice biological recognition server.

Wherein the watermark being inconsistent with the perturbation comprises: the time when the watermark occurs, the watermark length, or a combination of one or more of these.

The real-time detection method is applied to the following voice biological identification authentication real-time detection system, and the real-time detection system comprises:

a speaker for playing the audio watermark;

and the voice biological recognition server is used for converting the audio file into a format and then sending the audio file to the voice biological recognition server for processing.

Furthermore, the voice biological recognition server also comprises a watermark processing module, and the watermark processing module is used for detecting the watermark.

If the authentication client sends different signals to all nearby recording devices, the authentication session will effectively be watermarked, making the recording unusable for play attacks.

Each time the authentication client initiates an authentication request, the speaker on the device plays an audio watermark that will be recorded in all nearby recorders. The voice watermark will be played randomly and at a random time between the start of the authentication request and the end of the authentication request. When the final audio file is sent to the speech biometric server for processing, the system also sends relevant information about the watermark, such as the sound image, the moment of play, and the time of transmission from the beginning of the file to the end of the file.

The watermark processing module in the speech biological recognition server searches the watermark in the audio file according to the disturbance provided by the system by using a signal processing formula based on Discrete Fourier Transform (DFT). If a single watermark in the audio file is present that is the same as the symbol provided by the device, the system may conclude that the audio file is from a genuine user.

The signal processing formula is as follows:

where t0 is the time at which the watermark begins and t1 is the time at which the watermark ends, g (f) is an audio frequency function, g (t) is an audio time function, i is a root number-1 equal to the imaginary unit i, i²I is the imaginary unit and f is the number of tangents to the sampling rate.

The watermark will be present in the recording if there is another recording device nearby for recording the authentication phrase for future play attacks.

When the recording is used as a replay attack, the watermark processing module will detect two separate watermarks, the first of which coincides with the symbol provided by the device and the second of which coincides with the watermark in the recording. If multiple watermarks are detected, the watermarking module should conclude that the authentication request may come from the playback source, rather than the legitimate speaker. This patent protects each voice authentication session from being recorded, thereby reducing the risk of play attacks.

Specifically, in conjunction with fig. 1 and 2, the following is exemplified:

the authentication client initiates an authentication request while activating the speaker to broadcast an unnatural sounding watermark, e.g., at 14khz, a sine wave is played at 60db for 0.3 seconds and at a timestamp of 1.0 second after the microphone is turned on. And then records and transmits the information of the approximate time of the server side. The audio file is saved as a 16khz 16bit WAV file and converted to BASE64 and sent to the speech biometric recognition server for processing. The copy of BASE64 would then be sent to a separate watermarking module where the algorithm would detect the watermark from the client-provided disturbance. If a watermark is detected and consistent with the perturbation, the WPM will return a "signal as a real-time signal" signal to the speech biometric server. If other watermarks of different lengths are detected at different times or the watermarks do not match the information in the metadata, the WPM returns a "signal as a replay signal" signal to the voice biometric server and may take corresponding action.

Assume that a real user is using a speech biometric recognition system whose speech biometric recognition system uses "my voice is my password" as a given phrase. When the user presses a button to initiate an authentication session, the authentication client selects a random point in time after pressing the button, but before the session is over, e.g., 0.3 seconds after pressing the button, and triggers a speaker on the device to emit a unique non-naturally occurring audio signature at a particular frequency (e.g., a 14khz sine wave) for a length of 0.5 seconds at a volume that will broadcast at a volume that will be recorded to all microphones in the vicinity, including the microphones on the authentication device. After the audio sample is sent to the server for processing, the watermark processing module searches the specific watermark signature in the audio file within 0.3 second, and the length of the specific watermark signature is 0.5 second. The watermarking module will also look for this 14khz sinusoid at any other point in time in the audio sample.

Assume that the genuine user was recorded by a malicious participant during the session and that the executive uses the voice sample to corrupt the genuine user account. When the authentication session is initiated, the new watermark is emitted 0.9 seconds after the button is pressed for a length of 0.4 seconds. However, when the audio file arrives at the server for processing, the watermarking module will find the watermark at the 0.9 second mark, play it for 0.4 seconds, and find a watermark at the 0.3 second mark and play it for 0.5 seconds. Thus, the system may conclude that the audio sample may be from a recorded source, not a real-time user, because there is a watermark that is not compliant with what should happen. In this case there is an overlap, for example if the new watermark is emitted at a 0.4 second mark length of 0.7 seconds, the audio file has a single watermark starting at a 0.3 second mark, playing for a total of 0.8 seconds. The watermark processing module may detect that the start-up time of the watermark is different and that the length of the emission is different from the length of the perturbation provided by the client application.

It is assumed that there is an injection attack and that the attacker is somehow able to bypass the authentication integrity of the client and inject the recording directly into the server. This type of attack would destroy any play detection module but not the watermarking module. The watermark representation received by the server from the client is different from the watermark in the injection. Only the original 0.3 mark will be 0.5 length since nothing is emitted. The probability that an attacked authentication session receives a representation of the watermark at 0.3 mark 0.5 length is very small.

In a conventional playback attack, since the initiation of the authentication session and the start of recording are two separate manual operations, the probability of watermark overlap is very small, since the watermark is indistinguishable from the watermarking module.

The above-mentioned embodiments only express a certain implementation mode of the present invention, and the description thereof is specific and detailed, but not construed as limiting the scope of the present invention; it should be noted that, for those skilled in the art, without departing from the concept of the present invention, several variations and modifications can be made, which are within the protection scope of the present invention; therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A voice biological identification authentication real-time detection method is characterized by comprising the following steps:

step S1: generating an audio file for voice verification;

step S11: the authentication client initiates an authentication request:

step S12: outputting the sound watermark by a loudspeaker;

2. The voice biometric authentication real-time detection method according to claim 1, wherein the audio file generated in the step S1 includes a voice generated when the authentication request is performed in the step S11 and a sound watermark generated when the step S12 is performed, which are superimposed.

3. The method for real-time detection of voice biometric authentication according to claim 2, wherein the audio file of step S1 is processed and converted into a file type recognizable by the voice biometric server, and then is transmitted to the voice biometric server.

4. The voice biometric authentication real-time detection method according to claim 3, wherein the watermark is detected from the file received by the watermarking module via an algorithm module in the watermarking module.

5. The voice biometric authentication real-time detection method according to claim 1, further comprising the steps of:

6. The voice biometric authentication real-time detection method according to claim 5, wherein the detected watermark is consistent with the disturbance, and a signal is a real-time signal and is returned to the voice biometric server; the detected watermark is inconsistent with the disturbance, and the signal is a replay signal, which is returned to the voice biological recognition server.

7. The voice biometric authentication real-time detection method according to claim 6, wherein the watermark being inconsistent with the perturbation comprises: the time when the watermark occurs, the watermark length, or a combination of one or more of these.

8. A voice biometric authentication system to which the voice biometric authentication real-time detection method according to any one of claims 1 to 7 is applied, the system comprising:

a speaker for playing the audio watermark;

9. The voice biometric authentication system of claim 8, wherein the voice biometric server comprises a watermarking module, the watermarking module configured to detect a watermark.