WO2019073233A1 - Analysing speech signals - Google Patents

Analysing speech signals Download PDF

Info

Publication number
WO2019073233A1
WO2019073233A1 PCT/GB2018/052905 GB2018052905W WO2019073233A1 WO 2019073233 A1 WO2019073233 A1 WO 2019073233A1 GB 2018052905 W GB2018052905 W GB 2018052905W WO 2019073233 A1 WO2019073233 A1 WO 2019073233A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
audio signal
channel
speaker
enrolled
Prior art date
Application number
PCT/GB2018/052905
Other languages
French (fr)
Inventor
John Paul Lesso
Original Assignee
Cirrus Logic International Semiconductor Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB1719731.0A external-priority patent/GB2567503A/en
Priority claimed from GBGB1719734.4A external-priority patent/GB201719734D0/en
Application filed by Cirrus Logic International Semiconductor Limited filed Critical Cirrus Logic International Semiconductor Limited
Priority to GB2004481.4A priority Critical patent/GB2580821B/en
Priority to CN201880065835.1A priority patent/CN111201570A/en
Publication of WO2019073233A1 publication Critical patent/WO2019073233A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Definitions

  • microphones which can be used to detect ambient sounds.
  • the ambient sounds include the speech of one or more nearby speaker.
  • Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speech recognition system, allowing a user to control a device or system using spoken commands.
  • a method of analysis of an audio signal comprising: receiving an audio signal representing speech; extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
  • a system for analysing an audio signal configured for performing the method.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the first aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the first aspect.
  • a method of speaker identification comprising: receiving an audio signal representing speech; removing effects of a channel and/or noise from the received audio signal to obtain a cleaned audio signal; obtaining an average spectrum of at least a part of the cleaned audio signal; comparing the average spectrum with a long term average speaker model for an enrolled speaker; and determining based on the comparison whether the speech is the speech of the enrolled speaker.
  • Obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech.
  • the first acoustic class may be voiced speech and the second acoustic class unvoiced speech.
  • the method may comprise comparing the average spectrum with respective long term average speaker models for each of a plurality of enrolled speakers; and determining based on the comparison whether the speech is the speech of one of the enrolled speakers.
  • the method may further comprise comparing the average spectrum with a Universal Background Model; and including a result of the comparing the average spectrum with the Universal Background Model in determining whether the speech is the speech of one of the enrolled speakers.
  • the method may comprise identifying one of the enrolled speakers as a most likely candidate as a source of the speech.
  • the method may comprise: obtaining information about the effects of a channel and/or noise on the received audio signal by: receiving the audio signal representing speech; extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and, based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
  • the method may comprise analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of a plurality of enrolled users, to obtain respective hypothetical values of the channel, and determining that the speech is not the speech of any enrolled speaker whose models give rise to physically implausible hypothetical values of the channel.
  • a hypothetical value of the channel may be considered to be physically implausible if it contains variations exceeding a threshold level across the relevant frequency range.
  • a hypothetical value of the channel may be considered to be physically implausible if it contains significant discontinuities.
  • a system for analysing an audio signal configured for performing the method.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the second aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the second aspect.
  • Figure 1 illustrates a smartphone.
  • Figure 2 is a schematic diagram, illustrating the form of the smartphone.
  • Figure 3 is a flow chart illustrating a method of analysing an audio signal;
  • Figure 4 is a block diagram illustrating a system for analysing an audio signal;
  • Figure 5 illustrates results in the method of Figure 3;
  • Figure 6 is a block diagram illustrating an alternative system for analysing an audio signal
  • Figure 7 is a block diagram illustrating a further alternative system for analysing an audio signal
  • Figure 8 is a block diagram illustrating a further alternative system for analysing an audio signal
  • Figure 9 illustrates a possible relay attack on a voice biometric system
  • Figure 10 illustrates an effect of a replay attack
  • Figure 1 1 is a flow chart illustrating a method of detecting a replay attack
  • Figure 12 is a flow chart, illustrating a method of identifying a speaker
  • Figure 13 is a block diagram illustrating a system for identifying a speaker
  • Figure 14 is a block diagram illustrating a system for identifying a speaker.
  • Figure 1 illustrates a smartphone 10, having a microphone 12 for detecting ambient sounds.
  • the microphone is of course used for detecting the speech of a user who is holding the smartphone 10 close to their face.
  • FIG. 2 is a schematic diagram, illustrating the form of the smartphone 10.
  • Figure 2 shows various interconnected components of the smartphone 10. It will be appreciated that the smartphone 10 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
  • Figure 2 shows the microphone 12 mentioned above.
  • the smartphone 10 is provided with multiple microphones 12, 12a, 12b, etc.
  • Figure 2 also shows a memory 14, which may in practice be provided as a single component or as multiple components.
  • the memory 14 is provided for storing data and program instructions.
  • Figure 2 also shows a processor 16, which again may in practice be provided as a single component or as multiple components.
  • one component of the processor 16 may be an applications processor of the smartphone 10.
  • Figure 2 also shows a transceiver 18, which is provided for allowing the smartphone 10 to communicate with external networks.
  • the transceiver 18 may include circuitry for establishing an internet connection either over a WiFi local area network or over a cellular network.
  • Figure 2 also shows audio processing circuitry 20, for performing operations on the audio signals detected by the microphone 12 as required.
  • the audio processing circuitry 20 may filter the audio signals or perform other signal processing operations.
  • the smartphone 10 is provided with voice biometric functionality, and with control functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user.
  • the biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person.
  • certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command.
  • Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
  • the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands.
  • the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
  • speech can be divided into voiced sounds and unvoiced or voiceless sounds.
  • a voiced sound is one in which the vocal cords of the speaker vibrate, and a voiceless sound is one in which they do not.
  • voiced and unvoiced sounds have different frequency properties, and that these different frequency properties can be used to obtain useful information about the speech signal.
  • Figure 3 is a flow chart, illustrating a method of analysing an audio signal
  • Figure 4 is a block diagram illustrating functional blocks in the analysis system.
  • step 50 in the method of Figure 3 an audio signal, which is expected to contain speech, is received on an input 70 of the system shown in Figure 4.
  • the received signal is divided into frames, which may for example have lengths in the range of 10-100 ms, and then passed to a voiced/unvoiced detection block 72.
  • step 52 of the process first and second components of the audio signal,
  • Extracting the first and second components of the audio signal may comprise identifying periods when the audio signal contains the first acoustic class of speech, and identifying periods when the audio signal contains the second acoustic class of speech. More specifically, extracting the first and second components of the audio signal may comprise identifying frames of the audio signal that contain the first acoustic class of speech, and frames that contain the second acoustic class of speech.
  • the first and second acoustic classes of the speech are voiced speech and unvoiced speech
  • there are several methods that can be used to identify voiced and unvoiced speech for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which tends to be higher for voiced speech); tracking the first formant frequency F0 (because unvoiced speech does not contain the first format frequency); examining the error in a linear predictive coding (LPC) analysis (because the LPC prediction error is lower for voiced speech); using automatic speech recognition to identify the words being spoken and hence the division of the speech into voiced and unvoiced
  • the first and second acoustic classes of the speech are voiced speech and unvoiced speech.
  • the first and second acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or the first class may be fricatives while the second class are sibilants.
  • the received signal may be supplied to a voice activity detection block, and only supplied to the voiced/unvoiced detection block 72 when it is determined that it does contain speech.
  • the step of identifying periods when the audio signal contains unvoiced speech may comprise identifying periods when the audio signal contains voiced speech, and identifying the remaining periods of speech as containing unvoiced speech.
  • the voiced/unvoiced detection block 72 may for example be based on Praat speech analysis software.
  • the voiced/unvoiced detection block 72 thus outputs the first component of the audio signal, Sv , representing voiced speech and the second component, Su , representing unvoiced speech.
  • the first component of the audio signal, Sv representing voiced speech
  • the second component, Sw representing unvoiced speech
  • averaged spectra are meant spectra of the speech obtained and averaged over multiple frames.
  • the spectra can be averaged over enough data to provide reasonable confidence in the information that is obtained about the speech signal. In general terms, this information will become more reliable as more data is used to form the average spectra.
  • spectra averaged over 500ms of the relevant speech will be enough to provide reliable averaged spectra.
  • the length of time over which the averaged spectra are generated may be adapted based on the articulation rate of the speech, in order to ensure that the speech contains enough phonetic variation to provide a reliable average.
  • the length of time over which the averaged spectra are generated may be adapted based on the content of the speech. If the user is speaking a predetermined known phrase, this may be more discriminative than speaking words of the user's choosing, and so a useful average can be obtained in a shorter period.
  • the process illustrated in Figure 3 may be performed regularly while the user is speaking, providing regularly updated information at the end of the method as more speech is received. It may then be judged that enough speech has been processed when the results of the method converge to stable values.
  • the signal received on the input 70 is also passed to a speaker recognition block 74, which performs a voice biometric process to identify the speaker, from amongst a plurality of enrolled speakers.
  • the process of enrolment in a speaker recognition system typically involves the speaker providing a sample of speech, from which specific features are extracted, and the extracted features are used to form a model of the speaker's speech. In use, corresponding features are extracted from a sample of speech, and these are compared with the previously obtained model to obtain a measure of the likelihood that the speaker is the previously enrolled speaker.
  • the speaker recognition system attempts to identify one or more enrolled speaker without any prior expectation as to who the speaker should be. In other situations, there is a prior expectation as to who the speaker should be, for example because there is only one enrolled user of the particular device that is being used, or because the user has already identified themselves in some other way.
  • the speaker recognition block 74 is used to identify the speaker. In other examples, there may be an assumption that the speaker is a particular person, or is selected from a small group of people.
  • the first and second components of the audio signal are compared with models of the first acoustic class (for example the voiced component) of the speech of an enrolled user and of the second acoustic class (for example the unvoiced component) of the speech of the enrolled user.
  • comparing the first and second components of the audio signal with the models of the voiced and unvoiced speech of the enrolled user may comprise comparing magnitudes of the audio signal at a number of predetermined frequencies with magnitudes in the models
  • one or more speaker model is stored, for example in a database. Based on the output of the speaker recognition block 74, or based on a prior assumption as to who the speaker is expected to be, one or more speaker model is selected.
  • each speaker model contains separate models of the voiced speech and the unvoiced speech of the enrolled user. More specifically, the model of the voiced speech and the model of the unvoiced speech of the enrolled user each comprise amplitude values corresponding to multiple frequencies.
  • Figure 5 shows a multiple speaker models.
  • each speaker model shown in Figure 5 comprises a long term averaged spectrum of the voiced components of the speech and a long term averaged spectrum of the unvoiced components of the speech.
  • These models are obtained from the respective speakers during previous separate enrolment processes, during which the speakers speak, either uttering predetermined standard test phrases or saying words of their own choosing.
  • Figure 5 shows the speaker models for five speakers, labelled Speaker 1 - Speaker 5.
  • the model for Speaker 1 comprises the long term averaged spectrum 90 of the voiced components of the speech and the long term averaged spectrum 91 of the unvoiced components of the speech;
  • the model for Speaker 2 comprises the long term averaged spectrum 92 of the voiced components of the speech and the long term averaged spectrum 93 of the unvoiced components of the speech;
  • the model for Speaker 3 comprises the long term averaged spectrum 94 of the voiced components of the speech and the long term averaged spectrum 95 of the unvoiced components of the speech;
  • the model for Speaker 4 comprises the long term averaged spectrum 96 of the voiced components of the speech and the long term averaged spectrum 97 of the unvoiced components of the speech;
  • the model for Speaker 5 comprises the long term averaged spectrum 98 of the voiced components of the speech and the long term averaged spectrum 99 of the unvoiced components of the speech.
  • the model of the speech comprises a vector containing amplitude values at a plurality of frequencies
  • the plurality of frequencies may be selected from within a frequency range that contains the most useful information for discriminating between speakers.
  • the range may be from 20Hz to 8kHz, or from 20Hz to 4kHz.
  • the frequencies at which the amplitude values are taken may be linearly spaced, with equal frequency spacings between each adjacent pair of frequencies.
  • the frequencies may be non-linearly spaced.
  • the frequencies may be equally spaced on the mel scale.
  • the number of amplitude values used to form the model of the speech may be chosen depending on the frequency spacings. For example, using linear spacings the model may contain amplitude values for 64 to 512 frequencies. Using mel spacings, it may be possible to use fewer frequencies, for example between 10 and 20 mel-spaced frequencies.
  • the model of the voiced speech may be indicated as Mv, where Mv represents a vector comprising one amplitude value at each of the selected frequencies, while the model of the unvoiced speech may be indicated as Mu , where Mu represents a vector comprising one amplitude value at each of the selected frequencies.
  • the received signal containing the user's speech
  • the properties of the channel which we take to mean any factor that produces a difference between the user's speech and the speech signal as generated by the microphone alters, and the received signal will also be affected by noise.
  • first and second components can be expressed as:
  • a represents the frequency spectrum of a multiplicative disturbance component, referred to herein as the channel, and
  • n represents the frequency spectrum of an additive disturbance component, referred to herein as the noise.
  • a low-pass filter or a statistical filter such as a Savitsky-Golay filter, to the results in order to obtain low- pass filtered versions of the channel and noise characteristics.
  • a least squares method may be used to obtain solutions to the 2f different equations.
  • step 56 of the process shown in Figure 3 information is obtained about the channel and/or the noise affecting the audio signal.
  • This information can be used in many different ways.
  • Figure 6 illustrates one such use.
  • the system shown in Figure 6 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system.
  • the comparison block 78 is used to obtain information about the channel a that is affecting the received audio signal.
  • the comparison block 78 may be used to obtain the frequency spectrum of the channel. This can be used to compensate the received audio signal to take account of the channel.
  • Figure 6 shows a channel compensation block 120, to which the audio signal received on the input 70 is supplied.
  • the channel compensation block 120 also receives the frequency spectrum of the channel a.
  • compensation block 120 acts to remove the effects of the channel from the received signal, by dividing the received signal by the calculated channel a, before the received signal is passed to the speaker recognition block 74.
  • the output of the speaker recognition block 74, on the output 122 can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to a processing block 124 and used for any required purposes.
  • the output of the channel compensation block 120 containing the received signal after the effects of the channel have been removed, can be supplied to any suitable processing block 126, such as a speech recognition system, or the like.
  • Figure 7 illustrates another such use.
  • the system shown in Figure 7 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system.
  • the comparison block 78 is used to obtain information about the noise n that is affecting the received audio signal. Specifically, the comparison block 78 may be used to obtain the frequency spectrum of the noise. This can be used to take account of the noise when processing the received audio signal.
  • Figure 7 shows a filter block 128, to which the audio signal received on the input 70 is supplied.
  • the filter block 128 also receives the frequency spectrum of the noise n.
  • the filter block 128 acts so as to ensure that noise does not adversely affect the operation of the speaker recognition block 74.
  • the calculated noise characteristic, n can be subtracted from the received signal before any further processing takes place.
  • the filter block 128 can remove the corrupted components of the received audio signal at those frequencies, before passing the signal to the speaker recognition block 74.
  • these components could instead be flagged as being potentially corrupted, before being passed to the speaker recognition block 74 or any further signal processing block.
  • the output of the speaker recognition block 74, on the output 122 can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to any suitable processing block 124, and used for any required purposes.
  • the output of the filter block 128, containing the received signal after the frequency components that are excessively corrupted by noise have been removed, can be supplied to any suitable processing block 130, such as a speech recognition system, or the like.
  • Figure 8 illustrates another such use.
  • the system shown in Figure 8 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system.
  • the comparison block 78 is used to obtain information about the channel a and the noise n that are affecting the received audio signal. Specifically, the comparison block 78 may be used to obtain the frequency spectrum of the channel and of the noise. This can be used to take account of the channel and the noise when processing the received audio signal.
  • Figure 8 shows a combined filter block 134, to which the audio signal received on the input 70 is supplied.
  • the combined filter block 134 also receives the frequency spectrum of the channel a and the noise n.
  • the combined filter block 134 acts so as to ensure that channel effects and noise do not adversely affect the operation of the speaker recognition block 74.
  • the calculated noise characteristic, n can be subtracted from the received signal, and the remaining signal can be divided by the calculated channel a, before any further processing takes place.
  • the output of the speaker recognition block 74, on the output 122 can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to any suitable processing block 124, and used for any required purposes.
  • the output of the combined filter block 134 containing the received signal after the effects of the channel and the noise have been removed, can be supplied to any suitable processing block 136, such as a speech recognition system, or the like.
  • a further use of the information obtained about the channel and/or the noise affecting the audio signal is to overcome an attempt to deceive a voice biometric system by playing a recording of an enrolled user's voice in a so-called replay or spoof attack. Additionally, a further use of the information obtained about the channel and/or the noise affecting the audio signal is to remove their effects from a received audio signal, meaning that the average spectrum of the speech contained in the audio signal can be used as a biometric.
  • Figure 9 shows an example of a situation in which a replay attack is being performed.
  • the smartphone 10 is provided with voice biometric functionality.
  • the smartphone 10 is in the possession, at least temporarily, of an attacker, who has another smartphone 30.
  • the smartphone 30 has been used to record the voice of the enrolled user of the smartphone 10.
  • the smartphone 30 is brought close to the microphone inlet 12 of the smartphone 10, and the recording of the enrolled user's voice is played back. If the voice biometric system is unable to determine that the enrolled user's voice that it recognises is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the enrolled user.
  • smartphones, such as the smartphone 30, are typically provided with loudspeakers that are of relatively low quality. Thus, the recording of an enrolled user's voice played back through such a loudspeaker will not be a perfect match with the user's voice, and this fact can be used to identify replay attacks.
  • Figure 10 illustrates the frequency response of a typical loudspeaker.
  • the loudspeaker suffers from low- frequency roll-off, as the bass response is limited by the size of the loudspeaker diaphragm.
  • the loudspeaker suffers from high-frequency roll-off.
  • there is a degree of pass-band ripple as the magnitude of the response varies periodically between ⁇ and ⁇ 2 .
  • the size of these effects will be determined by the quality of the loudspeaker.
  • the lower threshold frequency fL and the upper threshold frequency fu should be such that there is minimal low-frequency roll-off or high-frequency roll-off within the frequency range that is typically audible to humans.
  • size and cost constraints mean that many commercially available
  • loudspeakers such as those provided in smartphones such as the smartphone 30, do suffer from these effects to some extent.
  • the magnitude of the pass-band ripple that is the difference between ⁇ and ⁇ 2 , will also depend on the quality of the loudspeaker. If the voice of a speaker is played back through a loudspeaker whose frequency response has the general form shown in Figure 10, then this may be detectable in the received audio signal containing the speech of that speaker. It has previously been recognised that, if a received audio signal has particular frequency characteristics, that may be a sign that the received audio signal is the result of a replay attack. However, the frequency characteristics of the received signal depend on other factors, such as the frequency characteristics of the speech itself, and the properties of any ambient noise, and so it is difficult to make a precise determination that a signal comes from a replay attack based only on the frequency characteristics of the received signal.
  • the method shown in Figure 3, and described with reference thereto can be used to make a more reliable determination as to whether a signal comes from a replay attack.
  • the frequency characteristic of the ambient noise is determined, and this is subtracted from the received audio signal by means of the filter 128.
  • the received signal, with noise removed, is supplied to a processing block 130, which in this case may be a replay attack detection block.
  • the replay attack detection block may perform any of the methods disclosed in EP-2860706A, such as testing whether a particular spectral ratio (for example a ratio of the signal energy from 0-2kHz to the signal energy from 2-4kHz) has a value that may be indicative of replay through a loudspeaker, or whether the ratio of the energy within a certain frequency band to the energy of the complete frequency spectrum has a value that may be indicative of replay through a loudspeaker.
  • a particular spectral ratio for example a ratio of the signal energy from 0-2kHz to the signal energy from 2-4kHz
  • the method shown in Figure 3 is used to determine the frequency characteristic of the channel that affects the received speech. If the speech has been played back through a loudspeaker, the frequency response of the loudspeaker should be visible in the frequency characteristic of the channel.
  • Figure 11 is a flow chart, illustrating a method of determining whether the received signal may result from a replay attack.
  • an audio signal is received, representing speech.
  • step 142 information is obtained about a channel affecting said audio signal.
  • the information about the channel may be obtained by the method shown in Figure 3.
  • step 144 it is determined whether the channel has at least one characteristic of a loudspeaker.
  • determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has a low frequency roll-off.
  • the low-frequency roll-off may involve the measured channel decreasing at a relatively constant rate, such as 6dB per octave, for frequencies below a lower cut-off frequency f L , which may for example be in the range 50Hz - 700Hz.
  • determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has a high frequency roll-off.
  • the high-frequency roll-off may involve the measured channel decreasing at a relatively constant rate, such as 6dB per octave, for frequencies above an upper cut-off frequency f u , which may for example be in the range 18kHz - 24kHz.
  • determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has ripple in a pass- band thereof. For example, this may comprise applying a Welch periodogram to the channel, and determining whether there is a predetermined amount of ripple in the characteristic.
  • a degree of ripple that is, a difference between ⁇ and ⁇ 2 in the frequency response shown in Figure 10
  • a threshold value such as 1 dB
  • a peak-to-trough frequency of about 100Hz over the central part of the pass- band, for example from 100Hz - 10kHz
  • two or three of the steps 146, 148 and 150 may be performed, with the results being applied to a classifier, to determine whether the results of those steps are indeed characteristic of a loudspeaker frequency response.
  • the channel frequency response can be applied as an input to a neural network, which has been trained to distinguish channels that are characteristic of loudspeakers from other channels.
  • Figure 12 is a flow chart, illustrating a method of speaker identification
  • Figure 13 is a block diagram of a system for performing speaker identification.
  • the system may be implemented in a smartphone, such as the smartphone 10, or any other device with voice biometric functionality.
  • the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user.
  • the biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person.
  • certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command.
  • Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
  • the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands.
  • the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
  • the signal generated by a microphone 12 in response to ambient sound is received.
  • the received signal is divided into frames, which may for example have lengths in the range of 10-100 ms. These frames can be analysed to determine whether they represent speech, and only frames that represent speech are considered further.
  • the frames that represent speech are passed to a channel/noise removal block 180 and, in step 162 of the method, the effects of a channel and/or noise are removed from the received audio signal to obtain a cleaned audio signal.
  • the effects of the channel and/or noise can be determined by the method described above, or by any other suitable method, leaving a cleaned audio signal that is not adversely affected by any channel or noise effects.
  • the cleaned audio signal is passed to an averaging block 182, which obtains an average spectrum of at least a part of the cleaned audio signal.
  • the average spectrum is a spectrum of the relevant part or parts of the speech obtained and averaged over multiple frames.
  • the spectrum or spectra can be averaged over enough data to provide reasonable confidence in the information average. In general terms, this average will become more reliable as more data is used to form the average spectrum or spectra. In some cases, spectra averaged over 500ms of the relevant speech will be enough to provide reliable averaged spectra.
  • the length of time over which the averaged spectrum or spectra are generated may be adapted based on the articulation rate of the speech, in order to ensure that the speech contains enough phonetic variation to provide a reliable average.
  • the length of time over which the averaged spectrum or spectra are generated may be adapted based on the content of the speech.
  • an average spectrum of at least a part of the cleaned audio signal is obtained in step 164.
  • this may comprise obtaining an average spectrum for parts of the cleaned audio signal representing one or more audio classes.
  • one or more components of the cleaned audio signal, representing different acoustic classes of the speech are extracted from the cleaned audio signal. Extracting the or each component of the cleaned audio signal may comprise identifying periods when the cleaned audio signal contains the relevant acoustic class of speech. More specifically, extracting the component or components of the cleaned audio signal may comprise identifying frames of the cleaned audio signal that contain the relevant acoustic class of speech.
  • obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech. In some other embodiments, obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining a first average spectrum of a part of the cleaned audio signal representing voiced speech and obtaining a second average spectrum of a part of the cleaned audio signal representing unvoiced speech.
  • the method involves obtaining an average spectrum for parts of the cleaned audio signal representing one or more audio classes, and the acoustic class is voiced speech (or the first and second acoustic classes of the speech are voiced speech and unvoiced speech)
  • acoustic class is voiced speech
  • the acoustic class is voiced speech
  • the first and second acoustic classes of the speech are voiced speech and unvoiced speech
  • there are several methods that can be used to identify voiced and unvoiced speech for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which
  • the acoustic classes of the speech may be voiced speech and unvoiced speech.
  • the acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or a first class may be fricatives while a second class are sibilants.
  • the obtained average spectrum of at least a part of the cleaned audio signal is passed to a comparison block 184.
  • the comparison block 184 also receives one or more long term average speaker model for one or more enrolled speaker.
  • long term average speaker model means that enough of the speech of the enrolled speaker was used to form the model, either during enrolment or subsequently, that the model is relatively stable. In some embodiments or situations, there is only one enrolled speaker, and so the comparison block 184 receives the one or more long term average speaker model for that enrolled speaker. In some other embodiments or situations, there is more than one enrolled speaker, and so the comparison block 184 receives the one or more long term average speaker model for each enrolled speaker.
  • the comparison block 184 receives the one or more long term average speaker model for that enrolled speaker.
  • the comparison block 184 may additionally or alternatively receive a Universal Background Model (UBM), for example in the form of a model of the statistically average user.
  • UBM Universal Background Model
  • the one or more long term average speaker model, and the Universal Background Model (UBM) if used, are stored in a model database 186.
  • the comparison block 184 may receive one or more long term average speaker model corresponding to the part of the cleaned audio signal for which the average spectrum was obtained.
  • obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining a first average spectrum of a part of the cleaned audio signal representing voiced speech and obtaining a second average spectrum of a part of the cleaned audio signal representing unvoiced speech.
  • the average spectrum of a part of the cleaned audio signal representing an be calculated as:
  • the first average spectrum SCv is compared with a long term average speaker model Mv for voiced speech of the or each enrolled speaker being considered by the comparison block 184
  • the second average spectrum SCu is compared with a long term average speaker model M for unvoiced speech of the or each enrolled speaker being considered by the comparison block 184.
  • step 168 of the method the result of the comparison is passed to a determination block 188, which determines based on the comparison whether the speech is the speech of the enrolled speaker being considered by the comparison block 184. As mentioned above, this determination may be an accept/reject decision based on the comparison, as to whether the received speech matches sufficiently closely with the enrolled user who was expected to be the speaker.
  • a small number of speakers are enrolled, and suitable models of their speech are obtained during an enrolment process. Then, the determination made by the determination block 188 concerns which of those enrolled speakers was the most likely candidate as the source of the speech in the received audio signal. This determination may be based on the respective Log Spectral Distances (LSD) of the received speech from the different models, or may use Principal component analysis (PCA) or Linear discriminative analysis (LDA), as examples. When a Universal Background Model (UBM) is also considered, then the determination may take into account the result of a comparison between the received speech, the model of the enrolled user's speech, and the background model.
  • Figure 14 is another block diagram of a system for performing speaker identification.
  • the system may be implemented in a smartphone, such as the smartphone 10, or any other device with voice biometric functionality.
  • the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user.
  • the biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person.
  • certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command.
  • Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
  • the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands.
  • the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
  • Some embodiments are particularly suited to use in devices, such as home control systems, home entertainment systems, or in-vehicle entertainment systems, in which there will often be multiple enrolled users (for example between two and ten such users), and where the intended operation to be performed in response to a spoken command (such as "play my favourite music", or "increase the temperature in my room", for example) will depend on the identity of the speaker.
  • the signal generated by a microphone 12 in response to ambient sound is received.
  • the received signal is divided into frames, which may for example have lengths in the range of 10-100 ms. These frames can be analysed to determine whether they represent speech, and only frames that represent speech are considered further.
  • Extracting the or each component of the cleaned audio signal may comprise identifying periods when the audio signal contains the relevant acoustic class of speech. More specifically, extracting the component or components of the audio signal may comprise identifying frames of the audio signal that contain the relevant acoustic class of speech.
  • the extraction block 192 is a voiced/unvoiced detector (VU), which extracts respective components representing voiced and unvoiced speech, and outputs an average spectrum Sv of a part of the audio signal representing voiced speech, and an average spectrum Su of a part of the audio signal representing unvoiced speech.
  • VU voiced/unvoiced detector
  • the first and second acoustic classes of the speech are voiced speech and unvoiced speech
  • there are several methods that can be used to identify voiced and unvoiced speech for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which tends to be higher for voiced speech); tracking the first formant frequency F0 (because unvoiced speech does not contain the first format frequency); examining the error in a linear predictive coding (LPC) analysis (because the LPC prediction error is lower for voiced speech); using automatic speech recognition to identify the words being spoken and hence the division of the speech into voiced and unvoiced
  • the acoustic classes of the speech may be voiced speech and unvoiced speech.
  • the acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or a first class may be fricatives while a second class are sibilants.
  • the average spectra of the two components of the signal representing the two acoustic classes of the speech are then passed to a channel/noise calculation and removal block 194.
  • the system is provided with a purported identity of the speaker, and it is required to determine whether the received signal has in fact come from that speaker (referred to as speaker verification).
  • the system has multiple enrolled speakers, but has no further information as to which of the enrolled speakers is speaking at any given time, and it is required to identify which of those enrolled speakers is the speaker (referred to as speaker identification).
  • the system includes a database 196, which stores a long term average speaker model Mv for voiced speech of the or each enrolled speaker and a long term average speaker model M for unvoiced speech of the or each enrolled speaker (or models of other acoustic classes of the speech of each enrolled speaker).
  • the system may be required to perform speaker verification, or speaker identification.
  • the average spectrum Sv of the part of the audio signal representing voiced speech, and the average spectrum Su of the part of the audio signal representing unvoiced speech are combined with the model Mv for voiced speech of the purported speaker and the long term average speaker model Mu for unvoiced speech of the purported speaker to obtain values for the channel, a, and for the noise, n.
  • the channel/noise calculation and removal block 194 then removes the effect of the calculated channel and noise, to obtain a cleaned measurement SCv of the average spectrum of the voiced speech, calculated as:
  • a cleaned measurement SCu of the average spectrum of the unvoiced speech can be similarly calculated as:
  • the cleaned measurement of the average spectrum of the relevant part of the speech is then passed to a comparison block 198, for comparison with the respective model of that part of the speech of the purported user.
  • the comparison score is output, indicating whether the cleaned measurement(s) of the average spectrum of the relevant part(s) of the speech is/are close enough to the model(s) to have a required degree of confidence that the signal comes from the speech of the purported speaker.
  • the comparison block 198 may additionally receive a Universal Background Model (UBM), for example in the form of a model of the statistically average user, from the database 196, and may use this when providing the output comparison score.
  • UBM Universal Background Model
  • the average spectrum Sv of the part of the audio signal representing voiced speech, and the average spectrum Su of the part of the audio signal representing unvoiced speech are combined with the respective models Mv for voiced speech of each enrolled speaker and the long term average speaker model M for unvoiced speech of each enrolled speaker to obtain preliminary or hypothetical values for the channel, a, and for the noise, n. Specifically, as before:
  • the channel/noise calculation and removal block 194 removes the effect of each of the calculated channel and noise values from the received signal, to obtain respective cleaned hypothetical measurements SCv of the average spectrum of the voiced speech, on the assumption that the speaker was the person whose speech model was used as the basis for those calculated values of the channel and noise.
  • SCvB —— ⁇ for enrolled speaker B.
  • SCvA is compared with the model MvA , for enrolled speaker A
  • SCvB is compared with the model MvB for enrolled speaker B.
  • the comparison score is then output, indicating whether the hypothetical cleaned measurement of the average spectrum of the relevant part of the speech for one of the enrolled speakers is close enough to the respective model to have a required degree of confidence that the signal comes from the speech of that enrolled speaker.
  • the result output by the comparison block 198 may simply indicate which of those enrolled speakers was the most likely candidate as the source of the speech in the received audio signal.
  • processor control code for example on a nonvolatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier.
  • a nonvolatile carrier medium such as a disk, CD- or DVD-ROM
  • programmed memory such as read only memory (Firmware)
  • a data carrier such as an optical or electrical signal carrier.
  • DSP Digital Signal Processor
  • ASIC Application Specific
  • the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA.
  • the code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays.
  • the code may comprise code for a hardware description language such as Verilog TM or VHDL (Very high speed integrated circuit Hardware Description
  • the code may be distributed between a plurality of coupled components in communication with one another.
  • the embodiments may also be implemented using code running on a field- (re)programmable analogue array or similar device in order to configure analogue hardware.
  • module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like.
  • a module may itself comprise other modules or functional units.
  • a module may be provided by multiple components or sub-modules which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
  • Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
  • a host device especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
  • a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A method of analysis of an audio signal comprises: receiving an audio signal representing speech; extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user. Based on the analysing, information is obtained information about at least one of a channel and noise affecting the audio signal.

Description

ANALYSING SPEECH SIGNALS Technical Field Embodiments described herein relate to methods and devices for analysing speech signals.
Background Many devices include microphones, which can be used to detect ambient sounds. In many situations, the ambient sounds include the speech of one or more nearby speaker. Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speech recognition system, allowing a user to control a device or system using spoken commands.
Summary
According to a first aspect of the present invention, there is provided a method of analysis of an audio signal, the method comprising: receiving an audio signal representing speech; extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
According to another aspect of the present invention, there is provided a system for analysing an audio signal, configured for performing the method.
According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance. According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the first aspect. According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the first aspect. According to a second aspect of the invention, there is provided a method of speaker identification, comprising: receiving an audio signal representing speech; removing effects of a channel and/or noise from the received audio signal to obtain a cleaned audio signal; obtaining an average spectrum of at least a part of the cleaned audio signal; comparing the average spectrum with a long term average speaker model for an enrolled speaker; and determining based on the comparison whether the speech is the speech of the enrolled speaker.
Obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech.
Obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining a first average spectrum of a part of the cleaned audio signal representing a first acoustic class and obtaining a second average spectrum of a part of the cleaned audio signal representing a second acoustic class, and comparing the average spectrum with a long term average speaker model for an enrolled speaker may comprise comparing the first average spectrum with a long term average speaker model for the first acoustic class for the enrolled speaker and comparing the second average spectrum with a long term average speaker model for the second acoustic class for the enrolled speaker.
The first acoustic class may be voiced speech and the second acoustic class unvoiced speech. The method may comprise comparing the average spectrum with respective long term average speaker models for each of a plurality of enrolled speakers; and determining based on the comparison whether the speech is the speech of one of the enrolled speakers.
The method may further comprise comparing the average spectrum with a Universal Background Model; and including a result of the comparing the average spectrum with the Universal Background Model in determining whether the speech is the speech of one of the enrolled speakers.
The method may comprise identifying one of the enrolled speakers as a most likely candidate as a source of the speech.
The method may comprise: obtaining information about the effects of a channel and/or noise on the received audio signal by: receiving the audio signal representing speech; extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and, based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal. The method may comprise analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of a plurality of enrolled users, to obtain respective hypothetical values of the channel, and determining that the speech is not the speech of any enrolled speaker whose models give rise to physically implausible hypothetical values of the channel.
A hypothetical value of the channel may be considered to be physically implausible if it contains variations exceeding a threshold level across the relevant frequency range. A hypothetical value of the channel may be considered to be physically implausible if it contains significant discontinuities.
According to another aspect of the present invention, there is provided a system for analysing an audio signal, configured for performing the method.
According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the second aspect.
According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the second aspect.
Brief Description of Drawings
For a better understanding of the present invention, and to show how it may be put into effect, reference will now be made to the accompanying drawings, in which:-
Figure 1 illustrates a smartphone.
Figure 2 is a schematic diagram, illustrating the form of the smartphone. Figure 3 is a flow chart illustrating a method of analysing an audio signal; Figure 4 is a block diagram illustrating a system for analysing an audio signal; Figure 5 illustrates results in the method of Figure 3;
Figure 6 is a block diagram illustrating an alternative system for analysing an audio signal;
Figure 7 is a block diagram illustrating a further alternative system for analysing an audio signal; Figure 8 is a block diagram illustrating a further alternative system for analysing an audio signal; Figure 9 illustrates a possible relay attack on a voice biometric system; Figure 10 illustrates an effect of a replay attack;
Figure 1 1 is a flow chart illustrating a method of detecting a replay attack; Figure 12 is a flow chart, illustrating a method of identifying a speaker; Figure 13 is a block diagram illustrating a system for identifying a speaker; and Figure 14 is a block diagram illustrating a system for identifying a speaker.
Detailed Description of Embodiments
The description below sets forth example embodiments according to this disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the embodiments discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
The methods described herein can be implemented in a wide range of devices and systems. However, for ease of explanation of one embodiment, an illustrative example will be described, in which the implementation occurs in a smartphone.
Figure 1 illustrates a smartphone 10, having a microphone 12 for detecting ambient sounds. In normal use, the microphone is of course used for detecting the speech of a user who is holding the smartphone 10 close to their face.
Figure 2 is a schematic diagram, illustrating the form of the smartphone 10.
Specifically, Figure 2 shows various interconnected components of the smartphone 10. It will be appreciated that the smartphone 10 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
Thus, Figure 2 shows the microphone 12 mentioned above. In certain embodiments, the smartphone 10 is provided with multiple microphones 12, 12a, 12b, etc.
Figure 2 also shows a memory 14, which may in practice be provided as a single component or as multiple components. The memory 14 is provided for storing data and program instructions.
Figure 2 also shows a processor 16, which again may in practice be provided as a single component or as multiple components. For example, one component of the processor 16 may be an applications processor of the smartphone 10. Figure 2 also shows a transceiver 18, which is provided for allowing the smartphone 10 to communicate with external networks. For example, the transceiver 18 may include circuitry for establishing an internet connection either over a WiFi local area network or over a cellular network. Figure 2 also shows audio processing circuitry 20, for performing operations on the audio signals detected by the microphone 12 as required. For example, the audio processing circuitry 20 may filter the audio signals or perform other signal processing operations. In this embodiment, the smartphone 10 is provided with voice biometric functionality, and with control functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
In some embodiments, while voice biometric functionality is performed on the smartphone 10 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
Methods described herein proceed from the recognition that different parts of a user's speech have different properties.
Specifically, it is known that speech can be divided into voiced sounds and unvoiced or voiceless sounds. A voiced sound is one in which the vocal cords of the speaker vibrate, and a voiceless sound is one in which they do not.
It is now recognised that the voiced and unvoiced sounds have different frequency properties, and that these different frequency properties can be used to obtain useful information about the speech signal.
Figure 3 is a flow chart, illustrating a method of analysing an audio signal, and Figure 4 is a block diagram illustrating functional blocks in the analysis system.
Specifically, in step 50 in the method of Figure 3, an audio signal, which is expected to contain speech, is received on an input 70 of the system shown in Figure 4. The received signal is divided into frames, which may for example have lengths in the range of 10-100 ms, and then passed to a voiced/unvoiced detection block 72. Thus, in step 52 of the process, first and second components of the audio signal,
representing different first and second acoustic classes of the speech, are extracted from the received signal. Extracting the first and second components of the audio signal may comprise identifying periods when the audio signal contains the first acoustic class of speech, and identifying periods when the audio signal contains the second acoustic class of speech. More specifically, extracting the first and second components of the audio signal may comprise identifying frames of the audio signal that contain the first acoustic class of speech, and frames that contain the second acoustic class of speech.
When the first and second acoustic classes of the speech are voiced speech and unvoiced speech, there are several methods that can be used to identify voiced and unvoiced speech, for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which tends to be higher for voiced speech); tracking the first formant frequency F0 (because unvoiced speech does not contain the first format frequency); examining the error in a linear predictive coding (LPC) analysis (because the LPC prediction error is lower for voiced speech); using automatic speech recognition to identify the words being spoken and hence the division of the speech into voiced and unvoiced speech; or fusing any or all of the above.
In the embodiments described further below, the first and second acoustic classes of the speech are voiced speech and unvoiced speech. However, the first and second acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or the first class may be fricatives while the second class are sibilants.
The received signal may be supplied to a voice activity detection block, and only supplied to the voiced/unvoiced detection block 72 when it is determined that it does contain speech. In that case, or otherwise when there is reason to believe that the audio signal contains only speech, the step of identifying periods when the audio signal contains unvoiced speech may comprise identifying periods when the audio signal contains voiced speech, and identifying the remaining periods of speech as containing unvoiced speech. The voiced/unvoiced detection block 72 may for example be based on Praat speech analysis software.
The voiced/unvoiced detection block 72 thus outputs the first component of the audio signal, Sv , representing voiced speech and the second component, Su , representing unvoiced speech.
More specifically, in some embodiments, the first component of the audio signal, Sv , representing voiced speech and the second component, Sw , representing unvoiced speech, are averaged spectra of the voiced and unvoiced components of the speech. By averaged spectra are meant spectra of the speech obtained and averaged over multiple frames.
The spectra can be averaged over enough data to provide reasonable confidence in the information that is obtained about the speech signal. In general terms, this information will become more reliable as more data is used to form the average spectra.
In some cases, spectra averaged over 500ms of the relevant speech will be enough to provide reliable averaged spectra. The length of time over which the averaged spectra are generated may be adapted based on the articulation rate of the speech, in order to ensure that the speech contains enough phonetic variation to provide a reliable average. The length of time over which the averaged spectra are generated may be adapted based on the content of the speech. If the user is speaking a predetermined known phrase, this may be more discriminative than speaking words of the user's choosing, and so a useful average can be obtained in a shorter period. The process illustrated in Figure 3 may be performed regularly while the user is speaking, providing regularly updated information at the end of the method as more speech is received. It may then be judged that enough speech has been processed when the results of the method converge to stable values.
The signal received on the input 70 is also passed to a speaker recognition block 74, which performs a voice biometric process to identify the speaker, from amongst a plurality of enrolled speakers. The process of enrolment in a speaker recognition system typically involves the speaker providing a sample of speech, from which specific features are extracted, and the extracted features are used to form a model of the speaker's speech. In use, corresponding features are extracted from a sample of speech, and these are compared with the previously obtained model to obtain a measure of the likelihood that the speaker is the previously enrolled speaker. In some situations, the speaker recognition system attempts to identify one or more enrolled speaker without any prior expectation as to who the speaker should be. In other situations, there is a prior expectation as to who the speaker should be, for example because there is only one enrolled user of the particular device that is being used, or because the user has already identified themselves in some other way.
In this illustrated example, the speaker recognition block 74 is used to identify the speaker. In other examples, there may be an assumption that the speaker is a particular person, or is selected from a small group of people. In step 54 of the process shown in Figure 3, the first and second components of the audio signal are compared with models of the first acoustic class (for example the voiced component) of the speech of an enrolled user and of the second acoustic class (for example the unvoiced component) of the speech of the enrolled user. For example, comparing the first and second components of the audio signal with the models of the voiced and unvoiced speech of the enrolled user may comprise comparing magnitudes of the audio signal at a number of predetermined frequencies with magnitudes in the models
Thus, in the system shown in Figure 4, one or more speaker model is stored, for example in a database. Based on the output of the speaker recognition block 74, or based on a prior assumption as to who the speaker is expected to be, one or more speaker model is selected.
In this embodiment, each speaker model contains separate models of the voiced speech and the unvoiced speech of the enrolled user. More specifically, the model of the voiced speech and the model of the unvoiced speech of the enrolled user each comprise amplitude values corresponding to multiple frequencies.
Thus, Figure 5 shows a multiple speaker models. Specifically, each speaker model shown in Figure 5 comprises a long term averaged spectrum of the voiced components of the speech and a long term averaged spectrum of the unvoiced components of the speech. These models are obtained from the respective speakers during previous separate enrolment processes, during which the speakers speak, either uttering predetermined standard test phrases or saying words of their own choosing.
Figure 5 shows the speaker models for five speakers, labelled Speaker 1 - Speaker 5. The model for Speaker 1 comprises the long term averaged spectrum 90 of the voiced components of the speech and the long term averaged spectrum 91 of the unvoiced components of the speech; the model for Speaker 2 comprises the long term averaged spectrum 92 of the voiced components of the speech and the long term averaged spectrum 93 of the unvoiced components of the speech; the model for Speaker 3 comprises the long term averaged spectrum 94 of the voiced components of the speech and the long term averaged spectrum 95 of the unvoiced components of the speech; the model for Speaker 4 comprises the long term averaged spectrum 96 of the voiced components of the speech and the long term averaged spectrum 97 of the unvoiced components of the speech; and the model for Speaker 5 comprises the long term averaged spectrum 98 of the voiced components of the speech and the long term averaged spectrum 99 of the unvoiced components of the speech. In each case, the model of the speech comprises a vector containing amplitude values at a plurality of frequencies.
The plurality of frequencies may be selected from within a frequency range that contains the most useful information for discriminating between speakers. For example, the range may be from 20Hz to 8kHz, or from 20Hz to 4kHz.
The frequencies at which the amplitude values are taken may be linearly spaced, with equal frequency spacings between each adjacent pair of frequencies. Alternatively, the frequencies may be non-linearly spaced. For example, the frequencies may be equally spaced on the mel scale.
The number of amplitude values used to form the model of the speech may be chosen depending on the frequency spacings. For example, using linear spacings the model may contain amplitude values for 64 to 512 frequencies. Using mel spacings, it may be possible to use fewer frequencies, for example between 10 and 20 mel-spaced frequencies. Thus, the model of the voiced speech may be indicated as Mv, where Mv represents a vector comprising one amplitude value at each of the selected frequencies, while the model of the unvoiced speech may be indicated as Mu , where Mu represents a vector comprising one amplitude value at each of the selected frequencies.
As will be appreciated, the received signal, containing the user's speech, will be affected by the properties of the channel, which we take to mean any factor that produces a difference between the user's speech and the speech signal as generated by the microphone alters, and the received signal will also be affected by noise.
Thus, assuming that the channel and the noise are constant over the period during which the received signal is averaged to form the first and second components of the received speech, these first and second components can be expressed as:
Sv = ocMv + n , and
Su = aMu + n , where
a represents the frequency spectrum of a multiplicative disturbance component, referred to herein as the channel, and
n represents the frequency spectrum of an additive disturbance component, referred to herein as the noise.
Thus, with measurements Sv and Su , and with models Mv and Mu , these two equations can therefore be solved for the two unknowns, a and n.
Thus, for illustrative purposes,
(Su - Sv)
a =— — , and
(Mu -Mv)
_ (SuMv - SvMu)
(Mu -Mv) For completeness, it should be noted that, with measurements of the spectrum made at a plurality of frequencies, these two equations are effectively solved at each of the frequencies. Alternatively, with measurements made at f different frequencies, the equations
Sv = c Mv + n , and Su = aMu + n can each be regarded as f different equations to be solved.
In that case, having solved the equations, it may be useful to apply a low-pass filter, or a statistical filter such as a Savitsky-Golay filter, to the results in order to obtain low- pass filtered versions of the channel and noise characteristics.
As an alternative example, a least squares method may be used to obtain solutions to the 2f different equations.
It will be noted that the calculations set out above rely on determining the difference (Mu -Mv) between the model of the unvoiced speech and the model of the voiced speech. Where these are similar, for example in the range 1 .3 - 1.6kHz in the case of Speaker 1 in Figure 5, then any small uncertainties in either of the models will potentially be magnified into large errors in the calculated values for the channel and/or the noise. Thus, the calculated values in any such frequency ranges may be given lower significance in any subsequent processing steps that use the calculated values, for example a reduced can weight applied to the values used in later processing steps. Alternatively, when it is known in advance that the model of the unvoiced speech and the model of the voiced speech are similar in a particular frequency range, the equations given above need not be solved for frequencies in this range.
Thus, as shown at step 56 of the process shown in Figure 3, information is obtained about the channel and/or the noise affecting the audio signal.
This information can be used in many different ways.
Figure 6 illustrates one such use. The system shown in Figure 6 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system. In the system of Figure 6, the comparison block 78 is used to obtain information about the channel a that is affecting the received audio signal. Specifically, the comparison block 78 may be used to obtain the frequency spectrum of the channel. This can be used to compensate the received audio signal to take account of the channel.
For one example, Figure 6 shows a channel compensation block 120, to which the audio signal received on the input 70 is supplied. The channel compensation block 120 also receives the frequency spectrum of the channel a. The channel
compensation block 120 acts to remove the effects of the channel from the received signal, by dividing the received signal by the calculated channel a, before the received signal is passed to the speaker recognition block 74.
Thus, the output of the speaker recognition block 74, on the output 122, can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to a processing block 124 and used for any required purposes.
The output of the channel compensation block 120, containing the received signal after the effects of the channel have been removed, can be supplied to any suitable processing block 126, such as a speech recognition system, or the like.
Figure 7 illustrates another such use. The system shown in Figure 7 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system.
In the system of Figure 7, the comparison block 78 is used to obtain information about the noise n that is affecting the received audio signal. Specifically, the comparison block 78 may be used to obtain the frequency spectrum of the noise. This can be used to take account of the noise when processing the received audio signal.
For one example, Figure 7 shows a filter block 128, to which the audio signal received on the input 70 is supplied. The filter block 128 also receives the frequency spectrum of the noise n. The filter block 128 acts so as to ensure that noise does not adversely affect the operation of the speaker recognition block 74. For example, the calculated noise characteristic, n, can be subtracted from the received signal before any further processing takes place.
In another example, where the level of noise exceeds a predetermined threshold level at one or more frequencies, such that the operation of the speaker recognition block 74 could be compromised, the filter block 128 can remove the corrupted components of the received audio signal at those frequencies, before passing the signal to the speaker recognition block 74. Alternatively, these components could instead be flagged as being potentially corrupted, before being passed to the speaker recognition block 74 or any further signal processing block.
Thus, the output of the speaker recognition block 74, on the output 122, can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to any suitable processing block 124, and used for any required purposes.
The output of the filter block 128, containing the received signal after the frequency components that are excessively corrupted by noise have been removed, can be supplied to any suitable processing block 130, such as a speech recognition system, or the like.
Figure 8 illustrates another such use. The system shown in Figure 8 is similar to the system of Figure 4, and the same reference numerals are used to refer to the same components of the system.
In the system of Figure 8, the comparison block 78 is used to obtain information about the channel a and the noise n that are affecting the received audio signal. Specifically, the comparison block 78 may be used to obtain the frequency spectrum of the channel and of the noise. This can be used to take account of the channel and the noise when processing the received audio signal.
For one example, Figure 8 shows a combined filter block 134, to which the audio signal received on the input 70 is supplied. The combined filter block 134 also receives the frequency spectrum of the channel a and the noise n. The combined filter block 134 acts so as to ensure that channel effects and noise do not adversely affect the operation of the speaker recognition block 74. For example, the calculated noise characteristic, n, can be subtracted from the received signal, and the remaining signal can be divided by the calculated channel a, before any further processing takes place.
Thus, the output of the speaker recognition block 74, on the output 122, can be improved. That is, it can provide more reliable information about the identity of the speaker. This can then be supplied to any suitable processing block 124, and used for any required purposes.
The output of the combined filter block 134, containing the received signal after the effects of the channel and the noise have been removed, can be supplied to any suitable processing block 136, such as a speech recognition system, or the like.
A further use of the information obtained about the channel and/or the noise affecting the audio signal is to overcome an attempt to deceive a voice biometric system by playing a recording of an enrolled user's voice in a so-called replay or spoof attack. Additionally, a further use of the information obtained about the channel and/or the noise affecting the audio signal is to remove their effects from a received audio signal, meaning that the average spectrum of the speech contained in the audio signal can be used as a biometric. Figure 9 shows an example of a situation in which a replay attack is being performed. Thus, in Figure 9, the smartphone 10 is provided with voice biometric functionality. In this example, the smartphone 10 is in the possession, at least temporarily, of an attacker, who has another smartphone 30. The smartphone 30 has been used to record the voice of the enrolled user of the smartphone 10. The smartphone 30 is brought close to the microphone inlet 12 of the smartphone 10, and the recording of the enrolled user's voice is played back. If the voice biometric system is unable to determine that the enrolled user's voice that it recognises is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the enrolled user. It is known that smartphones, such as the smartphone 30, are typically provided with loudspeakers that are of relatively low quality. Thus, the recording of an enrolled user's voice played back through such a loudspeaker will not be a perfect match with the user's voice, and this fact can be used to identify replay attacks.
Figure 10 illustrates the frequency response of a typical loudspeaker. Thus, at frequencies below a lower threshold frequency fL, the loudspeaker suffers from low- frequency roll-off, as the bass response is limited by the size of the loudspeaker diaphragm. At frequencies above an upper threshold frequency fu, the loudspeaker suffers from high-frequency roll-off. At frequencies between the lower threshold frequency fL and the upper threshold frequency fu, there is a degree of pass-band ripple, as the magnitude of the response varies periodically between βι and β2.
The size of these effects will be determined by the quality of the loudspeaker. For example, in a high quality loudspeaker, the lower threshold frequency fL and the upper threshold frequency fu should be such that there is minimal low-frequency roll-off or high-frequency roll-off within the frequency range that is typically audible to humans. However, size and cost constraints mean that many commercially available
loudspeakers, such as those provided in smartphones such as the smartphone 30, do suffer from these effects to some extent.
Similarly, the magnitude of the pass-band ripple, that is the difference between βι and β2, will also depend on the quality of the loudspeaker. If the voice of a speaker is played back through a loudspeaker whose frequency response has the general form shown in Figure 10, then this may be detectable in the received audio signal containing the speech of that speaker. It has previously been recognised that, if a received audio signal has particular frequency characteristics, that may be a sign that the received audio signal is the result of a replay attack. However, the frequency characteristics of the received signal depend on other factors, such as the frequency characteristics of the speech itself, and the properties of any ambient noise, and so it is difficult to make a precise determination that a signal comes from a replay attack based only on the frequency characteristics of the received signal. However, the method shown in Figure 3, and described with reference thereto, can be used to make a more reliable determination as to whether a signal comes from a replay attack. In one possibility, as shown in Figure 7, the frequency characteristic of the ambient noise is determined, and this is subtracted from the received audio signal by means of the filter 128. The received signal, with noise removed, is supplied to a processing block 130, which in this case may be a replay attack detection block. For example, the replay attack detection block may perform any of the methods disclosed in EP-2860706A, such as testing whether a particular spectral ratio (for example a ratio of the signal energy from 0-2kHz to the signal energy from 2-4kHz) has a value that may be indicative of replay through a loudspeaker, or whether the ratio of the energy within a certain frequency band to the energy of the complete frequency spectrum has a value that may be indicative of replay through a loudspeaker.
In another possibility, the method shown in Figure 3 is used to determine the frequency characteristic of the channel that affects the received speech. If the speech has been played back through a loudspeaker, the frequency response of the loudspeaker should be visible in the frequency characteristic of the channel.
Figure 11 is a flow chart, illustrating a method of determining whether the received signal may result from a replay attack. In the method of Figure 1 1 , in step 140, an audio signal is received, representing speech.
In step 142, information is obtained about a channel affecting said audio signal. For example, the information about the channel may be obtained by the method shown in Figure 3.
In step 144, it is determined whether the channel has at least one characteristic of a loudspeaker. As shown at step 146, determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has a low frequency roll-off. For example, the low-frequency roll-off may involve the measured channel decreasing at a relatively constant rate, such as 6dB per octave, for frequencies below a lower cut-off frequency fL, which may for example be in the range 50Hz - 700Hz. As shown at step 148, determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has a high frequency roll-off. For example, the high-frequency roll-off may involve the measured channel decreasing at a relatively constant rate, such as 6dB per octave, for frequencies above an upper cut-off frequency fu, which may for example be in the range 18kHz - 24kHz.
As shown at step 150, determining whether the channel has at least one characteristic of a loudspeaker may comprise determining whether the channel has ripple in a pass- band thereof. For example, this may comprise applying a Welch periodogram to the channel, and determining whether there is a predetermined amount of ripple in the characteristic. A degree of ripple (that is, a difference between βι and β2 in the frequency response shown in Figure 10) exceeding a threshold value, such as 1 dB, and with a peak-to-trough frequency of about 100Hz, over the central part of the pass- band, for example from 100Hz - 10kHz, can be regarded as characteristic of a loudspeaker.
For example, two or three of the steps 146, 148 and 150 may be performed, with the results being applied to a classifier, to determine whether the results of those steps are indeed characteristic of a loudspeaker frequency response. As a further example, the channel frequency response can be applied as an input to a neural network, which has been trained to distinguish channels that are characteristic of loudspeakers from other channels.
If it is determined that the channel has a characteristic of a loudspeaker, then it may be determined, perhaps on the basis of other indicators too, that the received audio signal is the result of a replay attack. In that case, the speech in the received audio signal may be disregarded when attempting to verify that the speaker is the expected enrolled speaker. Figure 12 is a flow chart, illustrating a method of speaker identification, and Figure 13 is a block diagram of a system for performing speaker identification. As described above, the system may be implemented in a smartphone, such as the smartphone 10, or any other device with voice biometric functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
In some embodiments, while voice biometric functionality is performed on the smartphone 10 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device. In step 160 of the method of Figure 12, the signal generated by a microphone 12 in response to ambient sound is received.
The received signal is divided into frames, which may for example have lengths in the range of 10-100 ms. These frames can be analysed to determine whether they represent speech, and only frames that represent speech are considered further.
The frames that represent speech are passed to a channel/noise removal block 180 and, in step 162 of the method, the effects of a channel and/or noise are removed from the received audio signal to obtain a cleaned audio signal. The effects of the channel and/or noise can be determined by the method described above, or by any other suitable method, leaving a cleaned audio signal that is not adversely affected by any channel or noise effects. In step 164 of the method, the cleaned audio signal is passed to an averaging block 182, which obtains an average spectrum of at least a part of the cleaned audio signal.
The average spectrum is a spectrum of the relevant part or parts of the speech obtained and averaged over multiple frames.
The spectrum or spectra can be averaged over enough data to provide reasonable confidence in the information average. In general terms, this average will become more reliable as more data is used to form the average spectrum or spectra. In some cases, spectra averaged over 500ms of the relevant speech will be enough to provide reliable averaged spectra. The length of time over which the averaged spectrum or spectra are generated may be adapted based on the articulation rate of the speech, in order to ensure that the speech contains enough phonetic variation to provide a reliable average. The length of time over which the averaged spectrum or spectra are generated may be adapted based on the content of the speech.
As mentioned above, an average spectrum of at least a part of the cleaned audio signal is obtained in step 164. For example, this may comprise obtaining an average spectrum for parts of the cleaned audio signal representing one or more audio classes. To achieve this, one or more components of the cleaned audio signal, representing different acoustic classes of the speech, are extracted from the cleaned audio signal. Extracting the or each component of the cleaned audio signal may comprise identifying periods when the cleaned audio signal contains the relevant acoustic class of speech. More specifically, extracting the component or components of the cleaned audio signal may comprise identifying frames of the cleaned audio signal that contain the relevant acoustic class of speech.
In some embodiments, obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech. In some other embodiments, obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining a first average spectrum of a part of the cleaned audio signal representing voiced speech and obtaining a second average spectrum of a part of the cleaned audio signal representing unvoiced speech.
When the method involves obtaining an average spectrum for parts of the cleaned audio signal representing one or more audio classes, and the acoustic class is voiced speech (or the first and second acoustic classes of the speech are voiced speech and unvoiced speech), there are several methods that can be used to identify voiced and unvoiced speech, for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which tends to be higher for voiced speech); tracking the first formant frequency F0 (because unvoiced speech does not contain the first format frequency); examining the error in a linear predictive coding (LPC) analysis (because the LPC prediction error is lower for voiced speech); using automatic speech recognition to identify the words being spoken and hence the division of the speech into voiced and unvoiced speech; or fusing any or all of the above.
As mentioned above, the acoustic classes of the speech may be voiced speech and unvoiced speech. However, the acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or a first class may be fricatives while a second class are sibilants. In step 166 of the method, the obtained average spectrum of at least a part of the cleaned audio signal is passed to a comparison block 184. The comparison block 184 also receives one or more long term average speaker model for one or more enrolled speaker. The term "long term" average speaker model means that enough of the speech of the enrolled speaker was used to form the model, either during enrolment or subsequently, that the model is relatively stable. In some embodiments or situations, there is only one enrolled speaker, and so the comparison block 184 receives the one or more long term average speaker model for that enrolled speaker. In some other embodiments or situations, there is more than one enrolled speaker, and so the comparison block 184 receives the one or more long term average speaker model for each enrolled speaker.
In some other embodiments or situations, there is more than one enrolled speaker, but there is some additional information regarding the purported speaker. For example, a user of the device may have identified themselves in some way. In that case, the comparison block 184 receives the one or more long term average speaker model for that enrolled speaker.
In addition, in some embodiments, the comparison block 184 may additionally or alternatively receive a Universal Background Model (UBM), for example in the form of a model of the statistically average user.
The one or more long term average speaker model, and the Universal Background Model (UBM) if used, are stored in a model database 186. The comparison block 184 may receive one or more long term average speaker model corresponding to the part of the cleaned audio signal for which the average spectrum was obtained.
Thus, for example, obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech. That is, with a measurement Sv of the spectrum of the voiced speech, and with values having been calculated for the channel, a, and for the noise, n, the cleaned measurement SCv of the spectrum of the voiced speech can be calculated as: SCv = {Sv - n) .
a
This can then be compared with the long term average speaker model Mv for voiced speech of the or each enrolled speaker being considered by the comparison block 184.
In other examples, obtaining an average spectrum of at least a part of the cleaned audio signal may comprise obtaining a first average spectrum of a part of the cleaned audio signal representing voiced speech and obtaining a second average spectrum of a part of the cleaned audio signal representing unvoiced speech.
As before, the average spectrum of a part of the cleaned audio signal representing an be calculated as:
Figure imgf000026_0001
and similarly the average spectrum of a part of the cleaned audio signal representing unvoiced speech can be calculated as:
Figure imgf000026_0002
a
The first average spectrum SCv is compared with a long term average speaker model Mv for voiced speech of the or each enrolled speaker being considered by the comparison block 184, and the second average spectrum SCu is compared with a long term average speaker model M for unvoiced speech of the or each enrolled speaker being considered by the comparison block 184.
In step 168 of the method, the result of the comparison is passed to a determination block 188, which determines based on the comparison whether the speech is the speech of the enrolled speaker being considered by the comparison block 184. As mentioned above, this determination may be an accept/reject decision based on the comparison, as to whether the received speech matches sufficiently closely with the enrolled user who was expected to be the speaker.
In some examples, a small number of speakers (for example from 2 to 10) are enrolled, and suitable models of their speech are obtained during an enrolment process. Then, the determination made by the determination block 188 concerns which of those enrolled speakers was the most likely candidate as the source of the speech in the received audio signal. This determination may be based on the respective Log Spectral Distances (LSD) of the received speech from the different models, or may use Principal component analysis (PCA) or Linear discriminative analysis (LDA), as examples. When a Universal Background Model (UBM) is also considered, then the determination may take into account the result of a comparison between the received speech, the model of the enrolled user's speech, and the background model. Figure 14 is another block diagram of a system for performing speaker identification.
As described above, the system may be implemented in a smartphone, such as the smartphone 10, or any other device with voice biometric functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
In some embodiments, while voice biometric functionality is performed on the smartphone 10 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device. Some embodiments are particularly suited to use in devices, such as home control systems, home entertainment systems, or in-vehicle entertainment systems, in which there will often be multiple enrolled users (for example between two and ten such users), and where the intended operation to be performed in response to a spoken command (such as "play my favourite music", or "increase the temperature in my room", for example) will depend on the identity of the speaker. As in the system of Figure 13, the signal generated by a microphone 12 in response to ambient sound is received. The received signal is divided into frames, which may for example have lengths in the range of 10-100 ms. These frames can be analysed to determine whether they represent speech, and only frames that represent speech are considered further.
Components of the received audio signal, representing different acoustic classes of the speech, are then extracted in an extraction block 192. Extracting the or each component of the cleaned audio signal may comprise identifying periods when the audio signal contains the relevant acoustic class of speech. More specifically, extracting the component or components of the audio signal may comprise identifying frames of the audio signal that contain the relevant acoustic class of speech.
In the illustrated embodiment, the extraction block 192 is a voiced/unvoiced detector (VU), which extracts respective components representing voiced and unvoiced speech, and outputs an average spectrum Sv of a part of the audio signal representing voiced speech, and an average spectrum Su of a part of the audio signal representing unvoiced speech. When the first and second acoustic classes of the speech are voiced speech and unvoiced speech, there are several methods that can be used to identify voiced and unvoiced speech, for example: using a deep neural network (DNN), trained against a golden reference, for example using Praat software; performing an autocorrelation with unit delay on the speech signal (because voiced speech has a higher autocorrelation for non-zero lags); performing a linear predictive coding (LPC) analysis (because the initial reflection coefficient is a good indicator of voiced speech); looking at the zero- crossing rate of the speech signal (because unvoiced speech has a higher zero- crossing rate); looking at the short term energy of the signal (which tends to be higher for voiced speech); tracking the first formant frequency F0 (because unvoiced speech does not contain the first format frequency); examining the error in a linear predictive coding (LPC) analysis (because the LPC prediction error is lower for voiced speech); using automatic speech recognition to identify the words being spoken and hence the division of the speech into voiced and unvoiced speech; or fusing any or all of the above. As mentioned above, the acoustic classes of the speech may be voiced speech and unvoiced speech. However, the acoustic classes of the speech may be any phonetically distinguishable acoustic classes. For example, they may be different phoneme classes, for example two different sets of vowels; they may be two different fricatives; or a first class may be fricatives while a second class are sibilants.
The average spectra of the two components of the signal representing the two acoustic classes of the speech are then passed to a channel/noise calculation and removal block 194.
In some embodiments, the system is provided with a purported identity of the speaker, and it is required to determine whether the received signal has in fact come from that speaker (referred to as speaker verification). In other embodiments, the system has multiple enrolled speakers, but has no further information as to which of the enrolled speakers is speaking at any given time, and it is required to identify which of those enrolled speakers is the speaker (referred to as speaker identification).
The system includes a database 196, which stores a long term average speaker model Mv for voiced speech of the or each enrolled speaker and a long term average speaker model M for unvoiced speech of the or each enrolled speaker (or models of other acoustic classes of the speech of each enrolled speaker).
As described above, the system may be required to perform speaker verification, or speaker identification.
In the case of speaker verification, the average spectrum Sv of the part of the audio signal representing voiced speech, and the average spectrum Su of the part of the audio signal representing unvoiced speech, are combined with the model Mv for voiced speech of the purported speaker and the long term average speaker model Mu for unvoiced speech of the purported speaker to obtain values for the channel, a, and for the noise, n. Specifically, as before:
(Su - Sv)
a =— — , and
(Mu -Mv) (SuMv - SvMu)
n =
(Mu -Mv)
The channel/noise calculation and removal block 194 then removes the effect of the calculated channel and noise, to obtain a cleaned measurement SCv of the average spectrum of the voiced speech, calculated as:
SCv = {Sv - n) .
a
In other embodiments, a cleaned measurement SCu of the average spectrum of the unvoiced speech can be similarly calculated as:
SCu = {Su - n) .
a
The cleaned measurement of the average spectrum of the relevant part of the speech is then passed to a comparison block 198, for comparison with the respective model of that part of the speech of the purported user. The comparison score is output, indicating whether the cleaned measurement(s) of the average spectrum of the relevant part(s) of the speech is/are close enough to the model(s) to have a required degree of confidence that the signal comes from the speech of the purported speaker. As before, the comparison block 198 may additionally receive a Universal Background Model (UBM), for example in the form of a model of the statistically average user, from the database 196, and may use this when providing the output comparison score.
In the case of speaker identification, the average spectrum Sv of the part of the audio signal representing voiced speech, and the average spectrum Su of the part of the audio signal representing unvoiced speech, are combined with the respective models Mv for voiced speech of each enrolled speaker and the long term average speaker model M for unvoiced speech of each enrolled speaker to obtain preliminary or hypothetical values for the channel, a, and for the noise, n. Specifically, as before:
(Su - Sv)
a =— — , and
(Mu -Mv) (SuMv - SvMu)
n =
(Mu -Mv)
These values for channel and noise are calculated for each of the possible speakers. The results may be such that it is clear that the speech could not have come from one or more of the enrolled speakers. Specifically, if the calculated values for the channel, a, based on the models for a particular speaker, are clearly physically implausible, it can be assumed that that speaker was not the source of the received speech signal. For example, if there are very large variations (of more than 20dB, say) in one of the calculated channels across the relevant frequency range, or if there are significant discontinuities in one of the calculated channels, this might indicate that that channel is physically implausible, and hence that the speaker whose model led to that calculated channel was not the person speaking at that time. Otherwise, the channel/noise calculation and removal block 194 removes the effect of each of the calculated channel and noise values from the received signal, to obtain respective cleaned hypothetical measurements SCv of the average spectrum of the voiced speech, on the assumption that the speaker was the person whose speech model was used as the basis for those calculated values of the channel and noise.
Thus, in a case with two enrolled speakers A and B, having respective models MvA and MvB for their voiced speech and having respective models MuA and MuB for their unvoiced speech, it is possible to obtain respective hypothetical values for the channel and noise, namely:
(Su - Sv)
aA , and
(MuA -MvA)
(SuMvA - SvMuA)
nA for enrolled speaker A, and
(MuA -MvA)
(Su - Sv)
aB , and
(MuA -MvB)
(SuMvB - SvMuB)
nB for enrolled speaker B.
(MuB -MvB) These are then provisionally removed from the received signal to give respective hypothetical cleaned measurements for the two enrolled users, namely:
SCvA = —— for enrolled speaker A and
cA
SCvB = —— ^ for enrolled speaker B.
aB
These hypothetical cleaned measurements of the average spectrum of the relevant part of the speech are then passed to a comparison block 198, for comparison with the respective model of that part of the speech of the relevant user.
Thus, SCvA is compared with the model MvA , for enrolled speaker A, and SCvB is compared with the model MvB for enrolled speaker B.
The comparison score is then output, indicating whether the hypothetical cleaned measurement of the average spectrum of the relevant part of the speech for one of the enrolled speakers is close enough to the respective model to have a required degree of confidence that the signal comes from the speech of that enrolled speaker.
The result output by the comparison block 198 may simply indicate which of those enrolled speakers was the most likely candidate as the source of the speech in the received audio signal.
The skilled person will recognise that some aspects of the above-described apparatus and methods may be embodied as processor control code, for example on a nonvolatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications embodiments of the invention will be
implemented on a DSP (Digital Signal Processor), ASIC (Application Specific
Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog TM or VHDL (Very high speed integrated circuit Hardware Description
Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field- (re)programmable analogue array or similar device in order to configure analogue hardware.
Note that as used herein the term module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like. A module may itself comprise other modules or functional units. A module may be provided by multiple components or sub-modules which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. Any reference numerals or labels in the claims shall not be construed so as to limit their scope.

Claims

1. A method of analysis of an audio signal, the method comprising:
receiving an audio signal representing speech;
extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively;
analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and
based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
2. A method according to claim 1 , wherein extracting first and second components of the audio signal comprises:
identifying periods when the audio signal contains voiced speech; and identifying remaining periods of speech as containing unvoiced speech.
3. A method according to claim 1 or 2, wherein analysing the first and second components of the audio signal with the models of the first and second acoustic classes of the speech of the enrolled user comprises:
comparing magnitudes of the audio signal at a number of predetermined frequencies with magnitudes in the models of the first and second acoustic classes of the speech.
4. A method according to any preceding claim, comprising compensating the received audio signal for channel and/or noise.
5. A method according to any preceding claim, comprising:
performing a speaker identification process on the received audio signal to form a provisional decision on an identity of a speaker;
selecting the models of the first and second acoustic classes of the speech of the enrolled user, from a plurality of models, based on the provisional decision on the identity of the speaker;
compensating the received audio signal for channel and/or noise; and performing a second speaker identification process on the compensated received audio signal to form a final on the identity of the speaker.
6. A method according to claim 5, wherein compensating the received audio signal for channel and/or noise comprises:
identifying at least one part of a frequency spectrum of the received audio signal where a noise level exceeds a threshold level; and
ignoring the identified part of the frequency spectrum of the received audio signal when performing the second speaker identification process.
7. A method according to any of claims 1 to 6, wherein the first and second acoustic classes of the speech comprise voiced speech and unvoiced speech.
8. A method according to any of claims 1 to 6, wherein the first and second acoustic classes of the speech comprise first and second phoneme classes.
9. A method according to any of claims 1 to 6, wherein the first and second acoustic classes of the speech comprise first and second fricatives.
10. A method according to any of claims 1 to 6, wherein the first and second acoustic classes of the speech comprise fricatives and sibilants.
1 1. A system for analysis of an audio signal, the system comprising an input for receiving an audio signal, and being configured for:
receiving an audio signal representing speech;
extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively;
analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and
based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
12. A device comprising a system as claimed in any of claims 1 to 10.
13. A device as claimed in claim 12, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
14. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 1 to 10.
15. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 1 to 10.
16. A method of determining whether a received signal may result from a replay attack, the method comprising:
receiving an audio signal representing speech;
obtaining information about a channel affecting said audio signal; and
determining whether the channel has at least one characteristic of a loudspeaker.
17. A method according to claim 16, wherein determining whether the channel has at least one characteristic of a loudspeaker comprises:
determining whether the channel has a low frequency roll-off.
18. A method according to claim 17, wherein determining whether the channel has a low frequency roll-off comprises determining whether the channel decreases at a constant rate for frequencies below a lower cut-off frequency.
19. A method according to claim 16 or 17, wherein determining whether the channel has at least one characteristic of a loudspeaker comprises:
determining whether the channel has a high frequency roll-off.
20. A method according to claim 19, wherein determining whether the channel has a high frequency roll-off comprises determining whether the channel decreases at a constant rate for frequencies above an upper cut-off frequency.
21 . A method according to claim 16, 17 or 19, wherein determining whether the channel has at least one characteristic of a loudspeaker comprises:
determining whether the channel has ripple in a pass-band thereof.
22. A method according to claim 21 , wherein determining whether the channel has ripple in a pass-band thereof comprises determining whether a degree of ripple over a central part of the pass-band, for example from 100Hz - 10kHz, exceeds a threshold amount.
23. A system for determining whether a received signal may result from a replay attack, the system comprising an input for receiving an audio signal, and being configured for:
receiving an audio signal representing speech;
obtaining information about a channel affecting said audio signal; and
determining whether the channel has at least one characteristic of a loudspeaker.
24. A device comprising a system as claimed in any of claims 16 to 22.
25. A device as claimed in claim 24, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
26. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 16 to 22.
27. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 16 to 22.
28. A method of speaker identification, comprising:
receiving an audio signal representing speech;
removing effects of a channel and/or noise from the received audio signal to obtain a cleaned audio signal;
obtaining an average spectrum of at least a part of the cleaned audio signal; comparing the average spectrum with a long term average speaker model for an enrolled speaker; and
determining based on the comparison whether the speech is the speech of the enrolled speaker.
29. A method according to claim 28, wherein obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining an average spectrum of a part of the cleaned audio signal representing voiced speech.
30. A method according to claim 28, wherein obtaining an average spectrum of at least a part of the cleaned audio signal comprises obtaining a first average spectrum of a part of the cleaned audio signal representing a first acoustic class and obtaining a second average spectrum of a part of the cleaned audio signal representing a second acoustic class, and wherein
comparing the average spectrum with a long term average speaker model for an enrolled speaker comprises comparing the first average spectrum with a long term average speaker model for the first acoustic class for the enrolled speaker and comparing the second average spectrum with a long term average speaker model for the second acoustic class for the enrolled speaker.
31. A method according to claim 28, wherein the first acoustic class is voiced speech and the second acoustic class is unvoiced speech.
32. A method according to claim 28, 29, 30 or 31 , comprising comparing the average spectrum with respective long term average speaker models for each of a plurality of enrolled speakers; and
determining based on the comparison whether the speech is the speech of one of the enrolled speakers.
33. A method according to claim 32, further comprising comparing the average spectrum with a Universal Background Model; and
including a result of the comparing the average spectrum with the Universal Background Model in determining whether the speech is the speech of one of the enrolled speakers.
34. A method according to claim 32, comprising identifying one of the enrolled speakers as a most likely candidate as a source of the speech.
35. A method according to any of claims 28 to 34, comprising:
obtaining information about the effects of a channel and/or noise on the received audio signal by:
receiving the audio signal representing speech;
extracting first and second components of the audio signal representing first and second acoustic classes of the speech respectively; analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of an enrolled user; and
based on said analysing, obtaining information about at least one of a channel and noise affecting said audio signal.
36. A method according to claim 35, comprising analysing the first and second components of the audio signal with models of the first and second acoustic classes of the speech of a plurality of enrolled users, to obtain respective hypothetical values of the channel, and determining that the speech is not the speech of any enrolled speaker whose models give rise to physically implausible hypothetical values of the channel.
37. A method according to claim 36, wherein a hypothetical value of the channel is considered to be physically implausible if it contains variations exceeding a threshold level across the relevant frequency range.
38. A method according to claim 36, wherein a hypothetical value of the channel is considered to be physically implausible if it contains significant discontinuities.
39. A system for analysis of an audio signal, the system comprising an input for receiving an audio signal, and being configured for:
receiving an audio signal representing speech;
removing effects of a channel and/or noise from the received audio signal to obtain a cleaned audio signal;
obtaining an average spectrum of at least a part of the cleaned audio signal; comparing the average spectrum with a long term average speaker model for an enrolled speaker; and
determining based on the comparison whether the speech is the speech of the enrolled speaker.
40. A device comprising a system as claimed in claim 39.
41. A device as claimed in claim 40, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
42. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 28 to 38.
43. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 28 to 38.
PCT/GB2018/052905 2017-10-13 2018-10-11 Analysing speech signals WO2019073233A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB2004481.4A GB2580821B (en) 2017-10-13 2018-10-11 Analysing speech signals
CN201880065835.1A CN111201570A (en) 2017-10-13 2018-10-11 Analyzing speech signals

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US201762571978P 2017-10-13 2017-10-13
US62/571,978 2017-10-13
US201762578667P 2017-10-30 2017-10-30
US62/578,667 2017-10-30
GB1719731.0 2017-11-28
GB1719731.0A GB2567503A (en) 2017-10-13 2017-11-28 Analysing speech signals
GB1719734.4 2017-11-28
GBGB1719734.4A GB201719734D0 (en) 2017-10-30 2017-11-28 Speaker identification

Publications (1)

Publication Number Publication Date
WO2019073233A1 true WO2019073233A1 (en) 2019-04-18

Family

ID=66100464

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2018/052905 WO2019073233A1 (en) 2017-10-13 2018-10-11 Analysing speech signals

Country Status (3)

Country Link
CN (1) CN111201570A (en)
GB (1) GB2580821B (en)
WO (1) WO2019073233A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808595B (en) * 2020-06-15 2024-07-16 颜蔚 Voice conversion method and device from source speaker to target speaker

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002103680A2 (en) * 2001-06-19 2002-12-27 Securivox Ltd Speaker recognition system ____________________________________
US20070129941A1 (en) * 2005-12-01 2007-06-07 Hitachi, Ltd. Preprocessing system and method for reducing FRR in speaking recognition
WO2013022930A1 (en) * 2011-08-08 2013-02-14 The Intellisis Corporation System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain
EP2860706A2 (en) * 2013-09-24 2015-04-15 Agnitio S.L. Anti-spoofing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244031A (en) * 2015-10-26 2016-01-13 北京锐安科技有限公司 Speaker identification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002103680A2 (en) * 2001-06-19 2002-12-27 Securivox Ltd Speaker recognition system ____________________________________
US20070129941A1 (en) * 2005-12-01 2007-06-07 Hitachi, Ltd. Preprocessing system and method for reducing FRR in speaking recognition
WO2013022930A1 (en) * 2011-08-08 2013-02-14 The Intellisis Corporation System and method of processing a sound signal including transforming the sound signal into a frequency-chirp domain
EP2860706A2 (en) * 2013-09-24 2015-04-15 Agnitio S.L. Anti-spoofing

Also Published As

Publication number Publication date
GB2580821B (en) 2022-11-09
GB202004481D0 (en) 2020-05-13
CN111201570A (en) 2020-05-26
GB2580821A (en) 2020-07-29

Similar Documents

Publication Publication Date Title
US11270707B2 (en) Analysing speech signals
US20200227071A1 (en) Analysing speech signals
US11042616B2 (en) Detection of replay attack
US11631402B2 (en) Detection of replay attack
CN110832580B (en) Detection of replay attacks
US11037574B2 (en) Speaker recognition and speaker change detection
US11056118B2 (en) Speaker identification
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
US20200201970A1 (en) Biometric user recognition
US10839810B2 (en) Speaker enrollment
GB2576960A (en) Speaker recognition
US11074917B2 (en) Speaker identification
US10818298B2 (en) Audio processing
WO2019073233A1 (en) Analysing speech signals
US20200043503A1 (en) Speaker verification
US11024318B2 (en) Speaker verification
US20230343359A1 (en) Live speech detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18788840

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202004481

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20181011

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18788840

Country of ref document: EP

Kind code of ref document: A1