CN111194445A

CN111194445A - Detection of replay attacks

Info

Publication number: CN111194445A
Application number: CN201880065597.4A
Authority: CN
Inventors: J·P·莱索
Original assignee: Cirrus Logic International Semiconductor Ltd
Current assignee: Cirrus Logic International Semiconductor Ltd
Priority date: 2017-10-13
Filing date: 2018-10-11
Publication date: 2020-05-22
Also published as: WO2019073234A1; GB2581595A; KR102532584B1; KR20200066691A; GB2581595B; GB202004478D0

Abstract

To detect a replay attack on a voice biometric system, an utterance signal is received at least a first microphone and a second microphone. The speech signal has a component at a first frequency and a component at a second frequency. The detection method comprises the following steps: obtaining information about a location of a source of a first frequency component of the speech signal relative to the first microphone and the second microphone; obtaining information about a location of a source of a second frequency component of the speech signal relative to the first microphone and the second microphone; comparing the location of the source of the first frequency component with the location of the source of the second frequency component; and determining that the utterance signal is likely to be from a replay attack if the location of the source of the first frequency component differs from the location of the source of the second frequency component by more than a threshold amount.

Description

Detection of replay attacks

Technical Field

Embodiments described herein relate to methods and apparatus for detecting replay attacks on voice biometric systems.

Background

Speech biometric systems are increasingly being used. In such systems, users train the system by providing samples of their utterances during the enrollment phase. In subsequent use, the system is able to distinguish between registered users and unregistered speakers. Voice biometric systems can in principle be used to control access to various services and systems.

One way in which malicious parties attempt to defeat a voice biometric system is to obtain a recording of an utterance of a registered user, and play back the recording in an attempt to impersonate the registered user and gain access to services intended to be limited to the registered user.

This is known as a replay attack or a spoofing attack.

Disclosure of Invention

According to a first aspect of the present invention, there is provided a method of detecting a replay attack on a voice biometric system, the method comprising: receiving a speech signal at least a first microphone and a second microphone, wherein the speech signal has a component at a first frequency and a component at a second frequency; obtaining information about a location of a source of a first frequency component of the speech signal relative to the first microphone and the second microphone; obtaining information about a location of a source of a second frequency component of the speech signal relative to the first microphone and the second microphone; and comparing the location of the source of the first frequency component with the location of the source of the second frequency component. Determining that the utterance signal is likely to be from a replay attack if the location of the source of the first frequency component differs from the location of the source of the second frequency component by more than a threshold amount.

According to another aspect of the present invention, there is provided a system for detecting a replay attack in a speaker recognition system, the system being configured to perform the method of the first aspect.

According to an aspect of the invention, there is provided a device comprising a system according to another aspect. The device may comprise a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

According to an aspect of the invention, there is provided a computer program product comprising a computer readable tangible medium and instructions for performing the method according to the first aspect.

According to an aspect of the invention, there is provided a non-transitory computer-readable storage medium having stored thereon computer-executable instructions which, when executed by processor circuitry, cause the processor circuitry to perform the method according to the first aspect.

According to an aspect of the invention, there is provided an apparatus comprising a non-transitory computer-readable storage medium according to the previous aspect. The device may comprise a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

According to a second aspect of the present invention, there is provided a method of detecting a replay attack on a voice biometric system, the method comprising:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

determining a location of an apparent source of the received sound using the first signal and the second signal; and

if the apparent source of the received sound is diffuse, it is determined that the received sound is likely to be from a replay attack.

According to another aspect of the present invention, there is provided a system for detecting a replay attack on a voice biometric system, the system being configured to perform the method.

According to one aspect of the present invention, there is provided a method of detecting a replay attack on a speech recognition system, such as a speech biometric system, the method comprising:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

determining a correlation function based on a correlation between the first signal and the second signal;

calculating the width of the central lobe of the determined correlation function; and

if the width of the central lobe of the determined correlation function exceeds a threshold, it is determined that the received sound is likely to be from a replay attack.

According to another aspect of the invention, a system for detecting a replay attack on a speech recognition system, such as a speech biometric system, configured to perform the method is provided.

According to one aspect of the present invention, there is provided a method of detecting a replay attack on a voice biometric system, the method comprising:

generating a first signal from sound received at a first microphone, wherein the first signal has a first component at a first frequency and a second component at a second frequency, and wherein the first frequency is higher than the second frequency;

generating a second signal from sound received at a second microphone, wherein the second signal has a first component at the first frequency and a second component at the second frequency;

determining a first correlation function based on a correlation between a first component of the first signal and a first component of the second signal;

calculating a width of a central lobe of the determined first correlation function;

determining a second correlation function based on a correlation between a second component of the first signal and a second component of the second signal;

calculating a width of a central lobe of the determined second correlation function; and

determining that the received sound is from a replay attack if the width of the determined central lobe of the second correlation function exceeds the width of the determined central lobe of the first correlation function by more than a threshold value.

According to another aspect of the invention, there is provided a device comprising a system according to any preceding aspect. The device may comprise a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

According to another aspect of the invention, there is provided a computer program product comprising a computer readable tangible medium and instructions for performing the method according to the second aspect.

According to another aspect of the present invention, there is provided a non-transitory computer-readable storage medium having stored thereon computer-executable instructions which, when executed by processor circuitry, cause the processor circuitry to perform the method according to the second aspect.

According to another aspect of the invention, there is provided an apparatus comprising a non-transitory computer-readable storage medium according to the previous aspect. The device may comprise a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

Drawings

For a better understanding of the present invention, and to show how the same may be carried into effect, reference will now be made to the accompanying drawings, in which:

FIG. 1 illustrates a smart phone;

FIG. 2 is a schematic diagram illustrating the form of a smart phone;

FIG. 3 illustrates a first scenario in which a replay attack is being performed;

FIG. 4 illustrates a second scenario in which a replay attack is being performed;

FIG. 5 shows a portion of FIG. 4 in more detail;

figure 6 illustrates sound transmission in the arrangement of figure 5;

FIG. 7 is a flow chart illustrating a method;

FIG. 8 is a block diagram illustrating a system for performing the method of FIG. 7;

FIG. 9 illustrates a stage in the method of FIG. 7;

FIG. 10 illustrates a stage in the method of FIG. 7;

FIG. 11 illustrates a stage in the method of FIG. 7;

FIG. 12 illustrates the result of performing the method of FIG. 7;

FIG. 13 illustrates another result of performing the method of FIG. 7;

FIG. 14 illustrates a third scenario in which a replay attack is being performed;

figure 15 illustrates sound transmission in the arrangement of figure 14;

FIG. 16 is a flow chart illustrating a method;

FIG. 17 is a block diagram illustrating a system for performing the method of FIG. 16;

FIG. 18 illustrates a first result of performing the method of FIG. 16; and

fig. 19 illustrates a second result of performing the method of fig. 16.

Detailed Description

The following description sets forth example embodiments in accordance with this disclosure. Other example embodiments and implementations will be apparent to those of ordinary skill in the art. Further, those of ordinary skill in the art will recognize that a variety of equivalent techniques may be applied in place of or in combination with the embodiments discussed below, and all such equivalents are to be considered encompassed by the present disclosure.

The methods described herein may be implemented in a wide variety of devices and systems, such as mobile phones, audio players, video players, mobile computing platforms, gaming devices, remote control devices, toys, machines, or home automation controllers, or home appliances. However, for ease of explanation of one implementation, an exemplary embodiment will be described in which the implementation occurs in a smartphone.

Fig. 1 illustrates a smartphone 10 having

microphones

12, 12a, and 12b for detecting ambient sounds. In this embodiment, the microphone 12 is of course used for detecting the words of the user holding the smartphone 10, whereas the

microphones

12a, 12b are arranged in the upper part of the side of the smartphone 10 and are therefore not clearly visible in fig. 1.

Fig. 2 is a schematic diagram illustrating the form of the smartphone 10.

In particular, fig. 2 shows a number of interconnected components of the smartphone 10. It should be understood that the smartphone 10 will in fact contain many other components, but the following description is sufficient for understanding the present invention.

Thus, fig. 2 shows the above mentioned microphone 12. In some embodiments, the smartphone 10 is provided with

multiple microphones

12, 12a, 12b, etc.

Fig. 2 also shows a memory 14, which memory 14 may actually be provided as a single component or as multiple components. The memory 14 is provided for storing data and program instructions.

Fig. 2 also shows a processor 16, which processor 16 again may in fact be provided as a single component or as a plurality of components. For example, one component of the processor 16 may be an application processor of the smartphone 10.

Fig. 2 also shows a transceiver 18, which transceiver 18 is arranged to allow the smartphone 10 to communicate with an external network. For example, the transceiver 18 may include circuitry for establishing an internet connection via a WiFi local area network or via a cellular network.

Fig. 2 also shows audio processing circuitry 20 for performing operations on the audio signal detected by the microphone 12 as needed. For example, audio processing circuitry 20 may filter the audio signal or perform other signal processing operations.

Fig. 2 also shows at least one sensor 22. In an embodiment of the invention, the sensor is a magnetic field sensor for detecting a magnetic field. For example, the sensor 22 may be a hall effect sensor capable of providing discrete measurements of magnetic field strength in three orthogonal directions. Other embodiments of sensors that may be used may include a gyroscope sensor, an accelerometer, or a software-based sensor operable to determine the orientation of the phone, where such software-based sensor may operate in conjunction with a software program such as the FaceTimeTM system provided by Apple, inc.

In this embodiment, the smartphone 10 is provided with a voice biometric function and with a control function. Thus, the smart phone 10 is capable of performing a variety of functions in response to spoken commands from a registered user. The biometric function is able to distinguish spoken commands from registered users from the same command spoken by a different person. Accordingly, certain embodiments of the present invention relate to operating a smart phone or another portable electronic device with some voice operability, such as a tablet or laptop computer, a game console, a home control system, a home entertainment system, an in-vehicle entertainment system, a home appliance, etc., where voice biometric functions are performed in the device intended to execute spoken commands. Certain other embodiments relate to a system for performing voice biometric functions on a smart phone or other device that sends a command to a separate device if the voice biometric functions can confirm that the speaker is a registered user.

In some embodiments, while the voice biometric function is performed on the smart phone 10 or other device located near the user, the spoken command is transmitted using the transceiver 18 to a remote speech recognition system that determines the meaning of the spoken command. For example, the speech recognition system may be located on one or more remote servers in a cloud computing environment. A signal based on the meaning of the spoken command is then returned to the smartphone 10 or other local device. In other embodiments, the speech recognition system is also disposed on the smartphone 10.

One attempt to spoof a voice biometric system or automatic speech recognition system is to play a recording of the registered user's voice in a so-called replay attack or spoofing attack.

Thus, a method is described herein with reference to one embodiment in which it is desirable to detect when sound is played through a speaker rather than being generated by a human speaker. However, the method is equally applicable to other situations in which it is useful to detect whether the sound is from a point source or a more dispersed source. One such embodiment may be when it is desired to detect when sound received by the automatic speech recognition system is produced by a speaker.

Fig. 3 illustrates one embodiment of a scenario in which a replay attack is being performed. Thus, in fig. 3, the smartphone 10 is provided with a voice biometric function. In this embodiment, the smartphone 10 at least temporarily owns an attacker that has another smartphone 30. The smartphone 30 has been used to record the voice of a registered user of the smartphone 10. The smartphone 30 is brought into proximity with the microphone inlet 12 of the smartphone 10 and a recording of the registered user's voice is played back. If the voice biometric system is not able to detect that the registered user's voice it detects is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the registered user.

As is known, smart phones such as smart phone 30 are typically provided with speakers having a relatively low mass due to size constraints. Thus, a recording of the registered user's voice played back through such a speaker will not perfectly match the user's voice, and this fact can be used to identify a replay attack. For example, the speaker may have certain frequency characteristics that, if detectable in the speech signal received by the voice biometric system, may be considered to be from a replay attack.

Fig. 4 shows a second embodiment of a situation in which a replay attack is being performed in an attempt to defeat the detection method described above. Thus, in fig. 4, the smartphone 10 is provided with a voice biometric function. Again, in this embodiment, the smartphone 10 at least temporarily owns an attacker, which is another smartphone 140. The smartphone 140 has been used to record the voice of a registered user of the smartphone 10.

In this embodiment, the smartphone 140 is connected to a high-quality speaker 150. Smartphone 10 is then positioned proximate speaker 150, and a recording of the registered user's voice is played back through speaker 150. As before, if the voice biometric system is unable to detect that the registered user's voice it detects is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the registered user.

In this embodiment, the speaker 150 may be of sufficiently high quality that recordings of registered user's speech played back through the speaker are not reliably distinguishable from the user's speech, and thus the audio characteristics of the speech signal cannot be used to identify a playback attack.

However, it will be appreciated that many loudspeakers, particularly high quality loudspeakers, are electromagnetic loudspeakers in which an electrical audio signal is applied to one or both of two voice coils located between the poles of a permanent magnet, causing the one or more coils to move back and forth rapidly. This movement of the coil will cause a corresponding diaphragm attached to the coil to move back and forth, thereby generating sound waves.

Fig. 5 illustrates the general form of one such loudspeaker device 150 in widespread use. In particular, the illustrated speaker device 150 has two speakers as described above with corresponding voice coils and diaphragms. The first of these loudspeakers is a woofer 152, which woofer 152 is intended to play back relatively low frequency sound, for example at frequencies up to 1kHz or at frequencies up to 2 kHz. The second of said loudspeakers is a tweeter 154, which tweeter 154 is intended to play back relatively high frequency sound, for example at the highest frequency of the audio frequency range from 2kHz to at least 20 kHz.

Note that there are also loudspeaker devices comprising more than two loudspeakers, which are intended to play back different frequencies. The methods described herein may also be used to identify playback attacks using those speaker devices.

Fig. 6 shows a typical arrangement in which a speaker device 150 is being used to play back an utterance detected by the smartphone 10. Thus, fig. 6 shows that sound from the woofer 152 reaches the microphone 12 located at the bottom end of the smartphone 10, and also reaches the

microphones

12a and 12b located at the top end of the smartphone 10. Fig. 6 also shows that sound from tweeter 154 reaches microphone 12 at the bottom end of smartphone 10, and also reaches

microphones

12a and 12b at the top end of smartphone 10.

Thus, as can be seen from fig. 6, the location of the source of low frequency sound from woofer 152 is different from the location of the source of high frequency sound from tweeter 154 as seen from smartphone 10.

An understanding of this fact is used in the methods described herein.

Fig. 7 is a flowchart illustrating a method of detecting a replay attack on a voice biometric system, and fig. 8 is a block diagram illustrating functional blocks in the voice biometric system.

Thus, fig. 8 shows a speech biometric system 180, wherein an audio signal generated by one or more of the

microphones

12, 12a, 12b in response to detected ambient sounds is passed to a feature extraction block 182, thereby obtaining features of the utterance detected in the signal.

The extracted features are passed to a model comparison block 184 where the extracted features are compared to features of one or more models of the utterances of the registered users in the model comparison block 184. For example, there may be only one registered user of a voice biometric system associated with one particular device 10. The extracted features of the detected utterance are then compared to a model of the user's utterance, thereby deciding whether the detected utterance should be considered as an utterance of a registered user.

In step 170 of the method of fig. 7, speech signals are received from at least a first microphone 12 and a second microphone 12 a. In practice, the signals generated by the

microphones

12, 12a may be conveyed to a voice activity detector, and only those sections of the signals containing utterances may be further processed.

The speech signal generated by the microphone 12 is passed to a first filter bank 186 and the speech signal generated by the microphone 12a is passed to a second filter bank 188. The

filter banks

186, 188 extract components of the speech signal at least a first frequency and a second frequency.

For example, the

filter banks

186, 188 may extract components in a first relatively narrow frequency band and components in a second relatively narrow frequency band. In this case, the two frequency bands may have bandwidths of, for example, 10Hz-200 Hz. The first frequency band may be centered around a frequency in the range of 100Hz-1kHz, for example around 200 Hz. The second frequency band may be centred at a frequency in the range 2kHz-15kHz, for example 5 kHz.

Alternatively, the

filter banks

186, 188 may extract components in a first relatively wide frequency band and components in a second relatively wide frequency band. In this case, the two frequency bands may have bandwidths of, for example, 200Hz-2 kHz. Again, the first frequency band may be centered around a frequency in the range of 100Hz-1kHz, for example around 200 Hz. The second frequency band may be centred at a frequency in the range 2kHz-15kHz, for example 5 kHz.

In other embodiments, more than two frequency components are extracted. For example, up to 10 frequency components or more may be extracted. The filter bank may be implemented as a Fast Fourier Transform (FFT) block.

In step 172 of the process shown in fig. 7, the extracted frequency components are passed to a location information derivation block 190. The position information derivation block 190 obtains information about the position of the source of the first frequency component of the speech signal relative to the first microphone and the second microphone.

In step 174 of the process shown in fig. 7, the position information derivation block 190 obtains information about the position of the source of the second frequency component of the speech signal relative to the first microphone and the second microphone.

In one embodiment, the location information derivation block 190 obtains information about the location of the source of the frequency component by determining an angle of arrival (angle of arrival) of each frequency component at the first microphone and the second microphone.

Typically, a correlation method will be used to determine the time difference between the two signals. In a preferred embodiment, the time delay of an arbitrary waveform can be estimated using a method called generalized cross-correlation with phase transformation (GCC-PHAT). In this case, GCC-PHAT will be applied to different frequency bands to measure the relative delay. These delays may then be transformed into an angle of arrival that provides information about the location of the source of the signal in the frequency band. A beamformer may be used.

The method for determining the location of the source of the respective frequency component is described in more detail below.

It can be seen from fig. 6 that the sound emitted from the woofer 152 will have a shorter path to the microphone 12 than to the microphone 12a, so they will reach the microphone 12a after reaching the microphone 12. This time difference may be used to provide some information about the location of the source of any component of the speech signal being produced by the woofer 152.

Conversely, sounds emitted from tweeter 154 will have a shorter path to microphone 12a than to microphone 12a, and thus they will reach microphone 12 after reaching microphone 12 a. This time difference may be used to provide some information about the location of the source of any component of the speech signal being produced by tweeter 154.

For example, in each case, the respective time difference may be determined by calculating a cross-correlation between the signals received at the two

microphones

12, 12 a. The peak in the cross-correlation value will indicate the time difference of arrival of the correlated frequency component at the two microphones.

It should be noted that although in the embodiment illustrated herein the apparatus 10 is positioned such that the path from sound emitted from the woofer 152 to the microphone 12 will be shorter than the path to the microphone 12a, and the path from sound emitted from the tweeter 154 to the microphone 12a will be shorter than the path to the microphone 12a, the method described herein does not rely on any assumptions regarding the position of the apparatus 10 relative to the speaker apparatus 150. For most positions of the device, the time difference between the arrival of the signal from the woofer 152 at the two

microphones

12, 12a will be different than the time difference between the arrival of the signal from the tweeter 154 at the two

microphones

12, 12 a.

Fig. 9 illustrates a form of processing to determine the cross-correlation of one frequency component. In particular, fig. 9 shows a form of processing for performing generalized cross-correlation with phase transformation (GCC-PHAT). This combines the computational efficiency of the transform domain processing with the spectral whitening state to compute the correlation function with the narrowest lobe.

The signals from the two

microphones

1110, 1112 are passed to respective Fast Fourier Transform (FFT) blocks 1114, 1116. In the embodiments described above, where the location of the sources of the different frequency components is determined, the signals delivered to the FFT blocks 1114, 1116 are the relevant frequency components of the signals generated by the

microphones

1110, 1112.

The outputs of the FFT blocks 1114, 1116 are passed to a correlation block 1118. The output of the correlation block 1118 is passed to a normalizer 1120 and the normalized result is passed to an Inverse Fast Fourier Transform (IFFT) block 1122 to give a correlation result.

Thus, the output of the IFFT block 1122 is the result of the cross-correlation of one frequency component between the signals generated by the first and second microphones in response to that frequency component. The signals received by the two microphones are typically the same, but with an offset that depends on the difference in time of arrival of the signals at the two microphones.

Fig. 10 shows the general form of the cross-correlation 1128. The peak 1130 of the cross-correlation 1128 occurs at a sample offset value corresponding to a particular time difference of arrival. However, obtaining the peak directly from the cross-correlation 1128 only allows the result to be obtained with a maximum accuracy limited by the step size of the sampling rate used by the correlator, and this may result in relatively large errors.

The accuracy of the determination may be improved by interpolating the cross-correlation 1128 to determine the location of the peak 1130. One method that can be used for this is to apply a parabolic interpolator on the peaks of the correlation waveform. In other words, a point near peak 1130 is selected, and then a point on either side of the first point is also selected. Parabolic interpolation is performed to find the actual position of the peak.

To perform parabolic interpolation, second order polynomial interpolation is performed on the smoothed power spectrum using 3 points: the selected data is left and right crossed with 0.8 (frequency f)_leftAnd f_right) And a center frequency f_meas. Thus:

wherein p is₂、p₁And p₀Is a coefficient of a polynomial, and (A)_sel(f_left)，f_left)；(A_sel(f_meas)，f_meas) (ii) a And (A)_sel(f_tight)，f_right) Are the three selected points.

Therefore, we solve:

the equation can then be solved to obtain the f-value at which the peak occurs, and this can be converted back to a time difference.

An alternative and potentially more robust method for finding the location of peak 1130, and thus the time difference of arrival between the signals received at the two microphones, is to use a hilbert transform to shift the phase of the waveform. In fig. 10, the imaginary part of the hubert transformed waveform is shown with reference numeral 1132. The peak 1130 of the waveform 1128 now corresponds to the point where the waveform 1132 crosses the zero line, and this can be found by using interpolation based on zero crossings. This is now a sub-sampling method and is limited by the accuracy of the sampling rate.

Thus, for this particular frequency component of the received sound, the time difference of arrival of the sound at the two microphones can be found.

Fig. 11 shows a case where the source of sound is located in the far field. In other words, the source is far enough away from the microphones so that the

respective paths

1140, 1142 from the source to the two

microphones

1144, 1146 can be considered parallel. In fig. 11, d is the distance between the

microphones

1144, 1146, and θ is the angle of approach (angleof approximate) of the

paths

1140, 1142 from the source to the

microphones

1144, 1146. By geometry, the additional distance l traveled by the path to microphone 1144 compared to the path to microphone 1146 is given by l ═ d.sin θ.

The time it takes to travel this additional distance, Δ t, can then be found by using the equation l ═ c.

Thus, θ ═ sin can be used^-1(c, Δ t/d) the approach angle θ is obtained.

The actual calculation of the time difference of arrival of the sound at the two microphones is affected by noise and therefore there may be some uncertainty in the results obtained above. One possibility is that a measurement corrupted by noise may result in a physically impossible result of l being greater than d. To prevent the use of such measurements, it may be checked whether the magnitude of (c. Δ t/d) is less than 1 before attempting to calculate the value of θ.

This embodiment assumes the use of two microphones. If more than two microphones are used, the same method may be used for different pairs of microphones, allowing the location of the source of this frequency component of the sound to be determined by triangulation.

Fig. 12 illustrates results obtained for an embodiment in which a received input signal is split into two frequency components. In particular, fig. 12 shows the frequency f₁At an approach angle theta₁And frequency f₂At an approach angle theta₂。

Thus, the time difference between the times of arrival at the two microphones gives some information about the location of the source of the first frequency component and the location of the source of the second frequency component, although in this embodiment this information does not accurately indicate the location of the source.

In step 176 of the method shown in fig. 7, the position information is passed to a comparator block 192 for comparing the position of the source of the first frequency component with the position of the source of the second frequency component.

In particular, it is determined whether the location of the source of the first frequency component differs from the location of the source of the second frequency component by more than a threshold amount. In the exemplary embodiment described above, in the case where accurate position information is not available, if the angle θ is approached₁And theta₂The difference between the first and second frequency components exceeds a threshold amount, it may be determined that the position of the source of the first frequency component differs from the position of the source of the second frequency component by more than a threshold amount.

If it is determined that the position of the source of the first frequency component differs from the position of the source of the second frequency component by more than a threshold amount, it is determined that the speech signal may be generated by a speaker, and thus the speech signal may be from a replay attack.

In this case, the operation of the speech biometric system is adapted. For example, as shown in FIG. 8, the output from the model comparison block 184 may be accompanied by some information regarding the likelihood that the input signal may come from a replay attack. This information may then be considered when deciding whether to treat the received speech signal as coming from and acting upon a registered user.

In other embodiments, the operation of the model comparison block 184 may be adapted or even prevented if it is determined that the utterance signal is likely to be from a replay attack.

The implementation described thus far filters two frequency components of the received speech signal and obtains information about the location of the sources of the two components. As mentioned above, more than two frequency components from the received speech signal may be filtered. In that case, the result will be that the characteristics of the signal generated by the loudspeaker device comprising woofer and tweeter loudspeakers will be that all low frequency components are found from one source position and all high frequency components are found from a different source position.

Fig. 13 illustrates a case in which the position information is related to the approach angle of the frequency component of the received input signal. In this embodiment, six frequency components have been extracted, wherein the angle θ is approached₁-θ₆Respectively corresponding to the frequency f₁-f₆. Approach angle theta₁-θ₃Are all gathered together and are closer than the angle theta₄-θ₆The much smaller of these results indicate that the frequency f₁-f₃The component at (a) being from a frequency different from f₄-f₆A source of the component (d). In such a case, the frequency f may be set₁-f₃The results of (a) form a cluster and are averaged, and the frequency f can be averaged₄-f₆The result of (a) forms another cluster and is averaged, and the frequency f can be set₁-f₃Average result of (3) and frequency f₄-f₆The average results of (a) are compared. The average approach angle may be considered to give information about the locations of the sources of the first and second frequency components, and if the location of the source of the first frequency component differs from the location of the source of the second frequency component by more than a threshold amount, it may be determined that the speech signal is likely to be from a replay attack. Alternatively, the maximum value θ of the approach angle may be₅Minimum value of approach angle theta₂Make a comparison and if there is a maximum value of approach angle θ₅Is located at an angle of approach to the source of the frequency componentMinimum value of (theta)₂The location of the sources of the frequency components differs by more than a threshold amount, it can be determined that the utterance signal is likely to be from a replay attack. In another embodiment, the second largest value of approach angle θ may be₄Second small value theta from approach angle₁Compare and if there is a second largest value of theta close to the angle₄Is located with a second small value theta having a close angle₁The location of the sources of the frequency components differs by more than a threshold amount, it can be determined that the utterance signal is likely to be from a replay attack.

Similarly, the result will be that the signal generated by a loudspeaker device comprising three loudspeakers will be characterized in that one or more low frequency components are found from one source position, one or more mid frequency components are found from a second source position and one or more high frequency components are found from a third source position.

Typically, information is obtained about the respective positions of the sources of the two or more frequency components of the speech signal relative to the first microphone and the second microphone. The locations of the sources of the frequency components are compared, and if the location of the source of one frequency component differs from the location of the source of at least one other frequency component by more than a threshold amount, it is determined that the utterance signal is likely to be from a replay attack.

In the embodiment given above, signals from only two

microphones

12, 12a are used.

In other embodiments, signals from more than two microphones are used. For example, if the signals from the three

microphones

12, 12a, 12b are all used (where the signal from the microphone 12b is passed to the filter bank 194 and the resulting components are also passed to the position information derivation block 190), more accurate position information can be obtained. In principle, by receiving signals from three (or more) microphones, triangulation can be used to derive the exact location of the source of each component.

In that case, the spacing of the source locations may be compared to a threshold value related to the size of the mouth of the person (the spacing based on the size exceeding the mouth of the person indicates that the utterance is not from a human speaker), and/or the spacing of the source locations may be compared to a threshold value related to the size of a typical speaker device (the spacing based on the corresponding size of the typical speaker device indicates that the utterance is likely to be from such speaker device).

In yet another embodiment, the method proceeds as described above, using signals from two microphones. However, if the apparent position (apparent position) of the source of the first frequency component and the apparent position of the source of the second frequency component are such that the comparator block 192 cannot draw a conclusion as to whether the speech signal is from a replay attack, the signal from the third microphone may also be considered. In other words, the signals from the third microphone are passed to the respective filter bank to extract the relevant frequency components, and then the signals from the three microphones are examined for each frequency component to obtain more accurate position information.

This more accurate location information for each frequency component can then be used to determine whether the different frequency components are from different source locations, and thus whether the speech signal is from a replay attack.

Fig. 14 shows a third embodiment of a situation in which a replay attack is being performed in an attempt to defeat the detection method described above. Thus, in fig. 14, the smartphone 10 is provided with a voice biometrics function. Again, in this embodiment, the smartphone 10 at least temporarily owns an attacker who has another smartphone 240. Smartphone 240 has been used to record the voice of a registered user of smartphone 10.

In this embodiment, the smartphone 240 is connected to a high-quality speaker 2150. The smartphone 10 is then positioned proximate to the speaker 2150, and a recording of the registered user's voice is played back through the speaker 2150. As before, if the voice biometric system is unable to detect that the registered user's voice it detects is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the registered user.

In this embodiment, the speaker 2150 may be of sufficiently high quality that a recording of the registered user's voice played back through the speaker will not be reliably distinguishable from the user's voice, and thus the audio characteristics of the speech signal cannot be used to identify a playback attack.

In this embodiment, the speaker 2150 is an electrostatic speaker (e.g.,

ESL), or a balanced radiator speaker, or a bending mode speaker or a bending wave speaker, or any other type of flat panel speaker.

One feature of many such loudspeakers is that the apparent source of sound (apparent source) is not at a certain point, but rather is diffuse, i.e. distributed over the loudspeakers.

Fig. 15 shows a typical arrangement in which a speaker device 2150 is being used to playback utterances detected by the smartphone 10. Thus, fig. 15 shows that sound from one point 2152 of the lower portion of the speaker 2150 reaches the microphone 12 located at the bottom end of the smartphone 10, and also reaches the

microphones

12a and 12b located at the top end of the smartphone 10. Fig. 15 also shows that sound from one point 2154 in the upper part of the speaker 2150 reaches the microphone 12 located at the bottom end of the smartphone 10, and also reaches the

microphones

12a and 12b located at the top end of the smartphone 10.

Thus, as can be seen from fig. 15, the sound it detects comes from a highly dispersed source, as can be seen from the smartphone 10.

This is in contrast to the case of human speech, where the human mouth has, for example, a maximum chin movement Range (ROM) or a Maximum Mouth Opening (MMO) of about 5cm-8cm when the sound comes from a relatively small area.

An understanding of this fact is used in the methods described herein.

Fig. 16 is a flowchart illustrating a method of detecting a replay attack on a voice biometric system, and fig. 17 is a block diagram illustrating functional blocks in the voice biometric system.

Specifically, in fig. 16, in step 2170, a first signal is generated at a first microphone. For example, the first microphone may be the microphone 12 located at the bottom end of the smartphone 10. The microphone generates a signal in response to the received sound. Similarly, in step 2172, a second signal is generated at a second microphone. For example, the second microphone may be one of the microphones 12a located at the upper end of the smartphone 10. Again, the microphone generates a signal in response to the received sound.

In the case of a smartphone, the first and second microphones may be spaced apart by a distance in the range of 5cm-20 cm.

As part of the biometric operation, the first and second signals are passed to a feature extraction block 2190, which feature extraction block 2190 extracts features of the audio signal in a known manner. In one embodiment, the characteristic of the audio signal may be a mel-frequency cepstral coefficient (MFCC). These features are passed to a model comparison block 2192 where they are compared to corresponding features extracted from the user's utterance during the registration process. Based on the comparison, it is determined whether the detected utterance is an utterance of a registered user.

At the same time, the first signal and the second signal are also transmitted to the position information derivation block 2194.

In step 2174, the position information derivation block 2194 determines the position of the apparent source of the received sound using the first signal and the second signal.

More specifically, in one embodiment, in step 2176, the position information derivation block 2194 performs a correlation operation on the first signal and the second signal and determines a correlation function.

The correlation operation determines a value of a cross-correlation Rxy between the first signal and the second signal for a range of time offsets. Thus, in this embodiment, the first signal and the second signal are responsive to the same sound being received. However, the value of the correlation depends on the location of the source of the received sound. For example, if sound arrives at the second microphone after arriving at the first microphone, the signals will need to be offset in one direction relative to each other to achieve a match between them. This results in a high value for the offset of the correlation function in this direction. If sound arrives at the first microphone after arriving at the second microphone, the signals will need to be shifted in the other direction relative to each other to achieve a match between them. This results in a high value of the offset of the correlation function in the other direction.

This assumes that the source of the received sound is a point source. In practice, however, the source has a finite width, so the overall correlation function calculated is the integral of these correlations between the received sounds from the point sources over the entire width of the source.

In particular, for any point in the finite width of the source of sound, the time of flight of the sound from that point to the two microphones can be calculated as τ and τ_P. The difference between these two times will depend on the angle at which the sound from the source intersects the plane containing the two microphones. If the source of the sound is from-w in the width direction_OExtend to w_OThe correlation result is the integral of the correlation over the width of the source:

thus:

the width of the central lobe of this function is therefore dependent on the width of the source of the sound.

By making assumptions about the possible distance of the speaker 2150 from the smartphone 10, a suitable threshold may be set in the situation illustrated in fig. 15 (e.g., an attacker may place the smartphone between 0.10 and 1.0 meters from the speaker). This threshold may represent the maximum width of the central lobe that would be expected if the source of the sound was actually a human mouth. If the width of the central lobe exceeds this threshold, it can be determined that the source of sound is likely a speaker.

FIG. 18 illustrates an example scenario in whichIn this case, sound reaches the first microphone and the second microphone from a narrow source placed equidistant from the first microphone and the second microphone. Thus, the peak of the central lobe of the correlation function Rxy is relatively sharp and located at zero offset, with a width W between the points at which the central lobe of the correlation function reaches zero₁。

In contrast, fig. 19 illustrates an example situation in which sound reaches the first microphone and the second microphone from dispersed sources, similar to the situation shown in fig. 15. Thus, certain portions of the source (e.g., point 2152 below the speaker 2150) are closer to the microphone 12 at the bottom end of the smartphone 10 than the microphone 12a at the top end of the smartphone 10.

Therefore, sound from the point 2152 and other similar points reaches the microphone 12a after reaching the microphone 12. Thus, as discussed above, sound from these points causes the offset of the correlation function in one particular direction to have a high value.

Conversely, other portions of the source (e.g., point 2154 above the speaker 2150) are closer to the microphone 12a at the top end of the smartphone 10 than the microphone 12 at the bottom end of the smartphone 10.

Therefore, sound from the point 2154 and other similar points reaches the microphone 12 after reaching the microphone 12 a. Thus, as discussed above, sound from these points causes the offset of the correlation function to have a high value in the opposite direction from sound from points such as point 2152.

Thus, in fig. 19, the peak of the central lobe of the correlation function Rxy is less sharp than in fig. 18, with a width W between the points where the central lobe of the correlation function reaches zero₂。

In step 2178 of the process shown in fig. 16, this width of the center lobe of the correlation function is calculated.

In step 2180 of the process shown in figure 16, this calculated width of the central lobe of the correlation function is passed to a determination block 2196 and if the apparent source of the received sound is diffuse, it is determined that the received sound is likely to be from a replay attack. For example, if the apparent source of the received sound is larger than the human mouth, e.g., if it exceeds a diameter of 5 centimeters, the apparent source of the received sound may be considered diffuse.

Thus, as shown at step 2182, if the width of the central lobe of the correlation function exceeds a threshold, then it is determined that the received sound is from a replay attack. The threshold may be chosen such that if the source of the received sound exceeds a diameter of about 5cm-8cm, the width of the central lobe of the correlation function exceeds the threshold.

In some embodiments, information about the usage pattern of the smartphone may be obtained. For example, information about the distance of the smartphone from the source of the received sound may be obtained, for example, using an ultrasonic proximity detection function or an optical proximity detection function. The threshold may then be set based on the distance of the smartphone from the source of the received sound.

If it is determined that the received sound is likely to be from a replay attack, an output flag or output signal is sent to another function of the voice biometric system. For example, the output of the model comparison block 2192 may be aborted, or may be changed such that subsequent processing blocks assign less weight (or no weight at all) to the output indicating that the speech is that of a registered speaker.

In the above embodiment, the signals from the two microphones are used to determine whether the source of the received sound is diffuse. In other embodiments, the signals from three of the multiple microphones may be cross-correlated (e.g., cross-correlated with each other in pairs) to obtain more information about the spatial diversity of the source of the detected sound.

In the embodiments given above, the signals from the two microphones are used to determine whether the source of the received sound is diffuse. This further development is based on the recognition that: at least for some speakers, the idea can be extended by noting that the apparent width of the speaker will vary with frequency. More specifically, the speaker will appear wider at low frequencies than at high frequencies.

To take advantage of this, the position information derivation block 2194 includes two or more band pass filters for extracting respective frequency bands of the received signal. The above described method is then performed on these two frequency bands, respectively. More specifically, the first microphone generates a first signal from the received sound, wherein the first signal has a first component at a first frequency and a second component at a second frequency, and wherein the first frequency is higher than the second frequency. The second microphone generates a second signal from the received sound.

A first correlation function is then determined based on a correlation between the first component of the first signal and the first component of the second signal. The width of the central lobe of the first correlation function is calculated. A second correlation function is determined based on a correlation between the second component of the first signal and the second component of the second signal. The width of the central lobe of the second correlation function is calculated.

The two widths are then compared and if the width of the determined central lobe of the second correlation function exceeds the width of the determined central lobe of the first correlation function by more than a threshold, it is determined that the received sound may have been generated by a loudspeaker and may have been from a replay attack, for example.

Accordingly, methods and systems are disclosed that may be used to detect possible replay attacks.

Those skilled in the art will recognize that some aspects of the apparatus and methods described above may be embodied as processor control code, for example, on a non-volatile carrier medium such as a magnetic disk, CD-ROM or DVD-ROM, programmed memory such as read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications, embodiments of the invention will be implemented on a DSP (digital signal processor), an ASIC (application specific integrated circuit), or an FPGA (field programmable gate array). Thus, the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also include code for dynamically configuring a reconfigurable device, such as a re-programmable array of logic gates. In a similar manner to that described above,the code may be included for a hardware description language (such as Verilog)^TMOr VHDL (very high speed integrated circuit hardware description language)). As will be appreciated by those skilled in the art, code may be distributed among a plurality of coupled components in communication with each other. The embodiments may also be implemented using code running on a field-programmable (re) programmable analog array or similar device to configure analog hardware, where appropriate.

Note that as used herein, the term module should be used to refer to a functional unit or block that may be implemented at least in part by dedicated hardware components (such as custom circuitry), and/or by one or more software processors or appropriate code running on suitable general purpose processors or the like. The modules themselves may comprise other modules or functional units. A module may be provided by a number of components or sub-modules that need not be co-located and may be provided on different integrated circuits and/or run on different processors.

Embodiments may be implemented in a host device, in particular a portable host device and/or a battery-powered host device, such as a mobile computing device (e.g., a laptop or tablet computer), a gaming console, a remote control device, a home automation controller or a home appliance (including a home temperature or lighting control system), a toy, a machine (such as a robot), an audio player, a video player, or a mobile phone (e.g., a smartphone).

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units listed in a claim. Any reference numerals or references in the claims shall not be construed as limiting the scope of the claims.

Claims

1. A method of detecting a replay attack on a voice biometric system, the method comprising:

receiving a speech signal at least a first microphone and a second microphone, wherein the speech signal has a component at a first frequency and a component at a second frequency;

obtaining information about a location of a source of a first frequency component of the speech signal relative to the first microphone and the second microphone;

obtaining information about a location of a source of a second frequency component of the speech signal relative to the first microphone and the second microphone;

comparing the location of the source of the first frequency component and the location of the source of the second frequency component; and

determining that the utterance signal is likely to be from a replay attack if the location of the source of the first frequency component differs from the location of the source of the second frequency component by more than a threshold amount.

2. The method of claim 1, wherein obtaining information about a location of a source of a first frequency component of the speech signal relative to the first microphone and the second microphone comprises:

filtering the received speech signal to obtain the first frequency component; and

determining angles of arrival of the first frequency components at the first microphone and the second microphone;

and wherein obtaining information about the location of the source of the second frequency component of the speech signal relative to the first microphone and the second microphone comprises:

filtering the received speech signal to obtain the second frequency component; and

determining angles of arrival of the second frequency components at the first microphone and the second microphone.

3. The method of claim 2, wherein determining an angle of arrival of each frequency component at the first microphone and the second microphone comprises:

calculating a cross-correlation between a respective component of the speech signal received at the first microphone and the respective component of the speech signal received at the second microphone; and

obtaining information about the angle of arrival from the position of the peak in the calculated cross-correlation.

4. The method of claim 1, wherein obtaining information about a location of a source of a first frequency component of the speech signal relative to the first microphone and the second microphone comprises:

determining a time difference between arrival of the first frequency component at the first microphone and the second microphone;

determining a time difference between arrival of the second frequency component at the first microphone and the second microphone.

5. A method according to any preceding claim, wherein the first frequency component comprises a frequency in the range below 1 kHz.

6. The method of claim 5, wherein the first frequency component comprises a frequency in the range of 100Hz-1 kHz.

7. A method according to any preceding claim, wherein the second frequency component comprises a frequency in the range above 1 kHz.

8. The method of claim 7, wherein the second frequency component comprises a frequency in the range of 2kHz-15 kHz.

9. The method of any preceding claim, comprising:

obtaining information about respective positions of sources of more than two frequency components of the speech signal relative to the first microphone and the second microphone;

comparing the locations of the sources of the frequency components; and

determining that the utterance signal is likely to be from a replay attack if a location of a source of one frequency component differs from a location of a source of at least one other frequency component by more than a threshold amount.

10. The method of any preceding claim, further comprising:

after comparing the location of the source of the first frequency component and the location of the source of the second frequency component, if the result of the comparison is ambiguous:

receiving the speech signal at a third microphone;

obtaining additional information about a location of a source of a first frequency component of the speech signal relative to the first microphone, the second microphone, and the third microphone;

obtaining additional information about a location of a source of a second frequency component of the speech signal relative to the first microphone, the second microphone, and the third microphone;

comparing, based on the additional information, a location of a source of the first frequency component and a location of a source of the second frequency component.

11. A system for detecting a replay attack in a speaker recognition system, the system comprising an input for receiving speech signals from at least a first microphone and a second microphone, and comprising a processor and configured to:

12. A device comprising the system of claim 11.

13. The device of claim 12, wherein the device comprises a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

14. A computer program product comprising a computer readable tangible medium and instructions for performing the method of any one of claims 1 to 10.

15. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by processor circuitry, cause the processor circuitry to perform the method of any of claims 1-10.

16. An apparatus comprising the non-transitory computer-readable storage medium of claim 15.

17. The device of claim 16, wherein the device comprises a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

18. A method of detecting a replay attack on a voice biometric system, the method comprising:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

19. The method of claim 18, wherein the apparent source of the received sound is deemed diffuse if it exceeds a diameter of 8 cm.

20. A method of detecting a replay attack on a speech recognition system, such as a speech biometric system, the method comprising:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

calculating the width of the central lobe of the determined correlation function;

21. The method of claim 20, comprising:

obtaining information on distances of sound sources from the first microphone and the second microphone; and

setting the threshold based on the distance.

22. The method of claim 21, wherein obtaining information about distances of sound sources from the first and second microphones comprises:

determining a usage pattern of a device comprising the first microphone and the second microphone.

23. The method of any of claims 20-22, wherein the first microphone and the second microphone are spaced apart by a distance of 5-20 cm.

24. A method of detecting a replay attack on a voice biometric system, the method comprising:

determining that the received sound is likely to be from a replay attack if the width of the central lobe of the second correlation function determined exceeds the width of the central lobe of the first correlation function determined by more than a threshold value.

25. A system for detecting a replay attack on a voice biometric system, the system comprising an input for receiving speech signals from at least a first microphone and a second microphone, and comprising a processor and configured to:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

26. A system for detecting a replay attack in a speaker recognition system, the system comprising an input for receiving speech signals from at least a first microphone and a second microphone, and comprising a processor and configured to:

generating a first signal from sound received at a first microphone;

generating a second signal from sound received at a second microphone;

27. A system for detecting a replay attack on a voice biometric system, the system comprising an input for receiving speech signals from at least a first microphone and a second microphone, and comprising a processor and configured to:

28. An apparatus comprising the system of any one of claims 25, 26 and 27.

29. The device of claim 28, wherein the device comprises a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.

30. A computer program product comprising a computer readable tangible medium and instructions for performing the method of any one of claims 18 to 24.

31. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by processor circuitry, cause the processor circuitry to perform the method of any of claims 18 to 24.

32. An apparatus comprising the non-transitory computer-readable storage medium of claim 31.

33. The device of claim 32, wherein the device comprises a mobile phone, an audio player, a video player, a mobile computing platform, a gaming device, a remote control device, a toy, a machine or home automation controller, or a household appliance.