WO2000077772A2 - Speech and voice signal preprocessing - Google Patents

Speech and voice signal preprocessing Download PDF

Info

Publication number
WO2000077772A2
WO2000077772A2 PCT/GB2000/002332 GB0002332W WO0077772A2 WO 2000077772 A2 WO2000077772 A2 WO 2000077772A2 GB 0002332 W GB0002332 W GB 0002332W WO 0077772 A2 WO0077772 A2 WO 0077772A2
Authority
WO
WIPO (PCT)
Prior art keywords
signal
voice
signals
filtered
linearized
Prior art date
Application number
PCT/GB2000/002332
Other languages
French (fr)
Other versions
WO2000077772A3 (en
Inventor
Ronald Chalmers
Mark Christopher Simpson
Steven Leslie Pae
Original Assignee
Cyber Technology (Iom) Liminted
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cyber Technology (Iom) Liminted filed Critical Cyber Technology (Iom) Liminted
Priority to GB0200735A priority Critical patent/GB2367938A/en
Priority to AU55471/00A priority patent/AU5547100A/en
Publication of WO2000077772A2 publication Critical patent/WO2000077772A2/en
Publication of WO2000077772A3 publication Critical patent/WO2000077772A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • This invention relates to a system and method for processing signals derived from human speech, and relates especially to applications in voice and speech recognition systems.
  • Electronic speech recognition is a developing technique, and one which, because of the advent of cheap and substantial computer processing power, is gaining widespread currency.
  • Computer programs are already available which can be trained to convert spoken words into text on screen, vastly facilitating the production of written text.
  • Other software features permit commands to be entered to programs by speech which would otherwise require the use of input devices such as a keyboard or a pointing device such as a mouse.
  • computer based information systems for example for air flight schedules and fares, can be entirely automatic if such are able to recognize questions asked by a telephone inquirer and then, using a voice synthesis system, produce the appropriate answer.
  • a voice waveform i.e. an electrical signal derived from human speech
  • a matrix of power levels is generated therefrom either by time or by frequency.
  • the resultant pattern is stored, which is then compared with an existing sample.
  • voice recognition the comparison enables the making of a decision whether if the same speaker provided both samples, to enable speaker verification.
  • the percentage of points in the stored and new samples, which must match before the samples are considered to originate from the same person, can be varied so as to provide a varied level of certainty, and therefore of security, depending on the circumstances.
  • the stored and new samples are compared to determine if different speakers are speaking the same word. Generally a lower percentage of points in the pattern which must match is set than for voice recognition.
  • Other versions of speech recognition software require speaker-specific training where samples of a particular speaker are initially accumulated. New samples of the same speaker are compared to these stored samples in order to determine the precise word spoken.
  • This invention includes a method for processing a voice signal comprising filtering the components of the voice signal modeled as the product of a power component and an informational component to derive an amplified signal for voice or speech recognition.
  • the step of filtering the components comprises linearizing the components of the voice signal; duplicating the linearized signal to produce a first linearized signal and a second linearized signal; passing the first linearized signals through a low pass filter producing a first filtered signal and passing the second linearized signals through a high pass filter producing a second filtered signal; and amplifying the filtered signals.
  • the filtered signals are amplified differentially.
  • the first filtered signal is amplified at a lower gain level than the second filtered signal.
  • the low pass filter and the high pass filter include approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
  • the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and combining the filtered signals, and determining the antilogarithm of the combined signals.
  • the invention also includes the variation where the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and further comprises determining the antilogarithm of the filtered signals.
  • it further comprises applying an additional voice or speech recognition method to the processed signal.
  • the further voice or speech recognition optionally comprises processing with a hidden Markov model.
  • the method for generating a plurality of processed signals from a voice signal for input to voice or speech recognition further comprising duplicating the voice signal, and analyzing the duplicated voice signal with frequency response analysis.
  • the frequency response analysis optionally involves processing based with a hidden Markov model.
  • the method for voice or speech recognition further comprises applying a voice or speech recognition process to the plurality of signals generated.
  • Another embodiment of the invention relates to a method for identifying an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, comprises determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system, and recording the applicant's application voice information in the database if the application is determined to be fraudulent.
  • the invention further comprises cross-checking the voice information of a application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications.
  • the cross-checking of the non-fraudulent application voice information against fraudulent application voice information occurs subsequent to the application process.
  • the passbands of the high and low pass filters subdivide the frequency spectrum at a breakpoint frequency to be preferably between 200 and 400 Hertz.
  • a further variation involves the breakpoint frequency set preferably at approximately 300 Hertz.
  • the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined ascertaining the context of the application and then generating the breakpoint frequency as a function of the context.
  • the context comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker.
  • a variation of the invention involves processing of the voice signal by a computer.
  • Another embodiment of the invention relates to a system for processing a voice signal comprising a device for filtering the components of the voice signal modeled as the product of a power component and an informational component, and a further device for deriving an amplified signal for voice or speech recognition.
  • the invention also includes the variation where the system for filtering the components comprises a device for linearizing the components of the voice signal; a device for duplicating the linearized signal to produce a first linearized signal and a second linearized signal; a device for passing the first linearized signals through a low pass filter producing a first filtered signal and device for passing Hie second linearized signals through a high pass filter producing a second filtered signal; and a device for amplifying the filtered signals.
  • the filtered signals are amplified differentially.
  • Another variation of the above involves amplifying the first filtered signal at a lower gain level than the second filtered signal.
  • Another embodiment of the invention further comprising means for combining the filtered signals.
  • the low pass filter and the high pass filter have approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
  • the means for linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal, and further comprises means for combining the filtered signals, and means for determining the antilogarithm of the combined signals.
  • the means for the step of linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal, and further comprises means for determining the antilogarithm of the filtered signals.
  • a further variation further comprises means for applying voice or speech recognition to the processed signal.
  • the means for further voice or speech recognition optionally comprises means for processing with a hidden Markov model.
  • Another aspect of the invention involves generating a plurality of processed signals from a voice signal for input to means for voice or speech comprises means for duplicating the voice signal, and means for analyzing the duplicated voice signal with frequency response analysis.
  • the frequency response analysis optionally comprises processing with a hidden Markov model.
  • An another embodiment of the invention comprises means for applying a voice or speech recognition process to the plurality of signals.
  • Another embodiment of the invention relates to a system to identify an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, which comprises means for determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system, and means for recording the applicant's application voice information in the database if the application is determined to be fraudulent.
  • Another variation comprises means for cross-checking the voice information of an application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications.
  • the means for cross-checking the non- fraudulent application voice information against fraudulent application voice information performs the cross-checking as a separate process from that carried out by the means for determining whether an application is fraudulent or non-fraudulent.
  • a system for voice or speech recognition comprises the following: input means for obtaining a voice signal; means for linearizing determining the logarithm of the voice signal modeled as the product of a power component and an informational component; means for duplicating the linearized signal producing a first linearized signal and a second linearized signal; means for passing the first linearized signals through a low pass filter having a lowpass passband producing a frst filtered signal and means for passing the second linearized signals through a high pass filter having a highpass passband producing a second filtered signal; means for amplifying both filtered signals, the first filtered signals amplified at a higher gain than the second filtered signal; means for combining the filtered signals; means for determining the antilogarithm of the combined signals; and means for applying voice or speech recognition to the determined signal.
  • the invention includes the variation where the passbands of the high and low pass filters sub-divide the frequency spectrum.
  • Another variation involves the breakpoint frequency lying preferably in the range of the frequency spectrum between 200 to 400 Hertz.
  • the breakpoint frequency is preferably 300 Hertz.
  • the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined by means for ascertaining the context of the application, and means for generating the breakpoint frequency as a function of the context.
  • the context optionally comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker.
  • An another embodiment of the invention relates to a system for providing a signal for voice or speech recognition comprising microphone means to generate a multi-frequency electrical signal from the voice signal; a logarithmic amplifier to receive the electrical signal; a high pass filter and a low pass filter connected to the logarithmic amplifier; first and second amplifiers electrically connected one to each filter; and an exponential amplifier connected to the first and second amplifiers.
  • system further comprises means for speech or voice recognition connected to the output of the exponential amplifier.
  • Another aspect of the invention involves a computer being used as the means for speech or voice recognition.
  • FIG. 1 is a block diagram of the invention.
  • a method of voice or speech recognition comprises generating a voice waveform signal, dividing the signal into a higher frequency signal and a lower frequency signal, amplifying the higher and lower frequency signals separately, combining the amplified signals, and applying a voice or speech recognition method to the combined amplified signals.
  • the signals are divided at a frequency such that the higher frequency signal relates to the information content of the voice waveform and the lower frequency signal relates to the power content of the voice waveform.
  • the approach used in the present invention thus provides a novel method of processing a speech sample.
  • the approach is to break down the speech into two components and then subject them to separate treatment.
  • the lower frequency spectrum generally relates to the power or volume of the spoken word while the higher frequency components can be considered to contain the bulk of the information content of the sample, including the inflection of the spoken word. After separation of these components, they can be processed differently to maximize the efficiency of the speech or voice recognition system. This method enables much more information about the speech sample to be generated.
  • the method of the present invention may be thought of as generating a three-dimensional picture of the voice sample, rather than a two- dimensional one.
  • the system can be much more precise and vastly faster than conventional systems.
  • the prior art two- dimensional approach normally employed may look at, for example, four thousand points, whereas the three-dimensional approach (using the same sampling frequency) can be set to identify and operate on many more times that number of samples.
  • the downstream processing for voice or speech recognition can make use of many more points of reference, for example ten or more times as many, leading to greater accuracy and thus a much diminished risk of false voice recognition, or much higher accuracy of reflection of the words actually spoken in the case of speech or word recognition applications.
  • the standard technique of normalizing the sample power using automatic gain control circuitry can be dispensed with.
  • the power level of the total sample is more accurately known and hence easier to control or manipulate.
  • the above approach is far superior to simply amplifying or attenuating the entire signal, as that produces no additional information. This means that the signal can be adjusted before processing by the rest of the voice recognition processors in a more refined way that gives rise to a higher acceptance rate of the voice.
  • a voice signal V can be modeled as a combination of a volume or power signal P and an information signal I
  • each signal can be processed in accordance with requirements, in contrast to prior art arrangements in which the entire voice signal V is amplified and processed.
  • a human being 10 is shown adjacent a microphone 12 connected to a linear amplifier 14 which is connected through a logarithmic amplifier 16 to a high pass filter 18 and a low pass filter 20.
  • the filters 18, 20 are connected respectively to first and second amplifiers 22, 24 which are both connected through an exponential amplifier 26 to a voice or speech recognition apparatus 28.
  • the output of the high pass filter 18 is the signal log I which is amplified by amplifier 22 to form a signal log I'.
  • the output of the low pass filter 20 is the signal log P and the amplified signal from amplifier 24 is the signal log P". This effects the separation of the components.
  • the frequency at which the signals are separated, the breakpoint, and therefore the dividing line between the passbands of the high pass and low pass filters 18, 20, is preferably around the 300 Hertz threshold.
  • the breakpoint is to some extent arbitrary, as there, is no exact frequency at which the signal changes from a power signal to an information signal, but 300 Hertz is generally recognized in speech analysis as providing a reasonable boundary between the power/volume component and the information component of human speech.
  • the breakpoint frequency may also be determined adaptively depending upon the personal characteristics of the speaker and the language spoken.
  • the filters 18, 20 would preferably not have infinite attenuation outside their passbands, which is clear to a person skilled in the art, there is some merging of the power and information signals near the breakpoint frequency, which are acceptable under the circumstances. Again, this is clear to a person skilled in the art.
  • the amplification applied by the first amplifier 22 to the information signal is higher than the amplification applied by amplifier 24 to the power signal.
  • Such a signal now includes enhanced detail in the information signal, thus allowing a more accurate matrix sample of the initial waveform.
  • the lower amplification applied to the power signal will reduce the effect of differences in microphone sensitivity, loudness of speech, etc.
  • the final voice or speech recognition may be based on, for example, a hidden Markov model or an adaptation of dynamic time warping, fuzzy logic, neural networks, template matching, expert systems, or a combination of these approaches for pattern recognition (See L.
  • the voice or speech recognition process carried out in apparatus 28 can have a higher level of accuracy than has previously been possible.
  • the output29 of apparatus 28 will be: a "yes" or “no" signal for voice recognition, and output corresponding to one or more words in question in speech recognition systems, for example when the apparatus 28 is a personal computer running a direct dictation to screen program.
  • a further advantage according to the invention is that it is now easy for a person skilled in the art to introduce automatic gain control using the lower frequency component in another embodiment. This ensures that the signal is presented for further processing subsequently at optimum levels and reduces variations due to the speaker moving away from the mike etc.
  • the system includes an adaptive component in order to determine the optimal breakpoint (cut-off) frequency for delimiting power and information (i.e. inflection) based on the characteristics of the speaker.
  • voice types There are many variations of voice types: some voices have a lot of base resonance and a lot of inflection in that bass part and other voices have resonance in the higher frequency. A significant part of the resonance tends to be based on whether the input is from either a male or female, adult or child (but not necessarily so).
  • the adaptive system allows for that differentiation.
  • the voice differentiation could be implemented as a separate and prior component, or integrated as part of a feedback design shifting the break point depending upon the input voice signal. Alternatively, voice differentiation is first processed before it moves to the breakpoint where the power and information components are separated. In either case a higher level of voice and password acceptance results.
  • the object of another preferred embodiment is to transcend the limitation of a single language, for exmaple, English.
  • Languages vary enormously with varying sound, inflection, pronunciations; these determine the different resonance levels and frequency responses.
  • the database and the break point may vary with language; although largely language independent, the system maintains different processing options that can support different languages.
  • the spectral components are split based on the cut-off frequency, which break point is dependent upon the type of voice.
  • the voice can be further categorized so that the character of continued processing depends upon the nature and origin of the input language. Therefore a language dependent system results.
  • the software recognizes the language and type of speaker and makes the processing adaptive to these features.
  • a French speaking person will be classified in a certain group; then if a female, in a subgroup thereof, further if a child: the processing branches off again and again prior to undergoing actual spectral separation. Subsequently, when another person uses the system, say a Chinese adult male, then the system adapts to this entirely new language and speaker type for voice comparison.
  • the natural language, the gender, and the age of the speaker are all factors where context is permitted to influence the choice of an optimal breakpoint frequency.
  • a further preferred embodiment implements dual-level security with a parallel frequency response analysis system.
  • This parallel system is run so that the voice/speech system effectively has two levels of security.
  • the additional test could be based on a Hidden Markov model or some other similar model for the process. This allows the system to focus on its main application that is total voice recognition and its security.
  • a check of the results from this standard speech recognition system and from the present invention gives rise to two masks that collectively achieves a greater level of verification. Therefore, the results of the standard system are taken and compared to this invention's results achieving better comparisons, analyses and outcome.
  • Still another preferred embodiment implements functionality to facilitate investigation of fraudulent applications.
  • One component of the embodiment maintains a database of data and information about each user, the person's voice and voice inflection. If anyone attempts to make a fraudulent application then the person's input voice information is also recorded. This fraudulent application is cross-referenced or cross-checked to the main database and voice information, whether at the time of application or subsequently as for example in a batched process on an interval basis. A match of a fraudulent applicant's voice information to their real application voice information would lead to accurate and appropriate details for further security investigation.
  • Another preferred embodiment as indicated via the dotted connections in Figure 1 , has outputs of the first and second amplifiers 22, 24 connected to separate exponential amplifiers 30, 32 and the power signal and information signal are then processed separately using different methodology suitable for the type of signal.
  • One preferred embodiment implements this invention by analog means using electronic circuitry, including logarithmic and exponential converters, operational amplifiers to realize Overn and Key high and low pass filters.
  • Another preferred embodiment implements the invention digitally.
  • an analog to digital converter digitizes the voice input from a microphone.
  • a computer or dedicated hardware can then process the digitized voice signal. Filtering of the signal can be effected by standard digital filtering methods (See AN. Oppenheim and R.W. Schafer, with J.R. Buck. Discrete-Time Signal Processing, Second Edition. Prentice-Hall, Inc., Upper Saddle River, ⁇ J, 1999).

Abstract

In a system or method of voice or speech recognition, a voice waveform signal modeled as the product of a power component and an informational component is divided into higher and lower frequency signals, corresponding to the information signal and the power signal. The signals are amplified separately and then combined. By applying higher amplification to the information signal, a more detailed sample of the initial waveform can be provided to voice recognition or word recognition apparatus.

Description

SPEECH AND VOICE SIGNAL PROCESSING
FIELD OF THE INVENTION
This invention relates to a system and method for processing signals derived from human speech, and relates especially to applications in voice and speech recognition systems.
BACKGROUND OF THE INVENTION
Differences between individuals in anatomy, language, and background, amongst other factors, mean that people sound different from one another. The human brain possesses an extraordinary power of voice recognition which enables one to recognize a known speaker, e.g. over the telephone, without explicit identification by any other means. On the other hand, electronic voice recognition systems are becoming of increasing importance in a number of areas. In numerous security applications it is useful to be able to recognize a voice and to distinguish it from other voices. Such applications include for example, a security access system to determine whether speech from a person requesting access to a secure area, is that of an authorized person or a would-be intruder.
Electronic speech recognition is a developing technique, and one which, because of the advent of cheap and substantial computer processing power, is gaining widespread currency. Computer programs are already available which can be trained to convert spoken words into text on screen, vastly facilitating the production of written text. Other software features permit commands to be entered to programs by speech which would otherwise require the use of input devices such as a keyboard or a pointing device such as a mouse. In another area, computer based information systems, for example for air flight schedules and fares, can be entirely automatic if such are able to recognize questions asked by a telephone inquirer and then, using a voice synthesis system, produce the appropriate answer.
In known techniques of both voice and speech recognition, a voice waveform, i.e. an electrical signal derived from human speech, is sampled, and a matrix of power levels is generated therefrom either by time or by frequency. The resultant pattern is stored, which is then compared with an existing sample. For example, in voice recognition, the comparison enables the making of a decision whether if the same speaker provided both samples, to enable speaker verification. The percentage of points in the stored and new samples, which must match before the samples are considered to originate from the same person, can be varied so as to provide a varied level of certainty, and therefore of security, depending on the circumstances.
In speech recognition, the stored and new samples are compared to determine if different speakers are speaking the same word. Generally a lower percentage of points in the pattern which must match is set than for voice recognition. Other versions of speech recognition software require speaker-specific training where samples of a particular speaker are initially accumulated. New samples of the same speaker are compared to these stored samples in order to determine the precise word spoken.
The usual approach taken in the prior art is to process a sample of the speech signal in the time domain. This sampling produces a 2-dimensional matrix of the speech signal which is then made subject to further treatment. Such treatment includes preliminary acoustic spectral analysis based on linear predictive coding, Mel frequency cepstral coefficients, cochlea modelling and others. One disadvantage of this type of approach concerns a fact known to the person skilled in the art, that most of the information contained in a speech signal reside in the higher end of the frequency spectrum. By not discriminating between regions of the frequency spectrum of the speech signal, over-emphasis is focused on the less important portion, with concomitant under-emphasis on the more important high frequency component. What results is a less accurate output than that which places proper weight.
Existing patents on speech recognition include the following: US patent 5,864,804; US patent 6,067,513; US patent 5,839,099; US patent 5,313,531 ; US patent 5,960,395; and Japanese patent 5143098A2.
SUMMARY OF THE INVENTION
This invention includes a method for processing a voice signal comprising filtering the components of the voice signal modeled as the product of a power component and an informational component to derive an amplified signal for voice or speech recognition.
In a variation of the invention, the step of filtering the components comprises linearizing the components of the voice signal; duplicating the linearized signal to produce a first linearized signal and a second linearized signal; passing the first linearized signals through a low pass filter producing a first filtered signal and passing the second linearized signals through a high pass filter producing a second filtered signal; and amplifying the filtered signals.
In a further variation of the invention, the filtered signals are amplified differentially. Optionally, the first filtered signal is amplified at a lower gain level than the second filtered signal.
In a variation of the invention the low pass filter and the high pass filter include approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
In another variation of the invention, the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and combining the filtered signals, and determining the antilogarithm of the combined signals.
The invention also includes the variation where the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and further comprises determining the antilogarithm of the filtered signals.
According to another aspect of the invention, it further comprises applying an additional voice or speech recognition method to the processed signal. The further voice or speech recognition optionally comprises processing with a hidden Markov model.
In another embodiment of the invention, the method for generating a plurality of processed signals from a voice signal for input to voice or speech recognition further comprising duplicating the voice signal, and analyzing the duplicated voice signal with frequency response analysis. The frequency response analysis optionally involves processing based with a hidden Markov model. In a variation of the invention, the method for voice or speech recognition further comprises applying a voice or speech recognition process to the plurality of signals generated.
Another embodiment of the invention relates to a method for identifying an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, comprises determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system, and recording the applicant's application voice information in the database if the application is determined to be fraudulent. In a variation, the invention further comprises cross-checking the voice information of a application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications. Optionally, the cross-checking of the non-fraudulent application voice information against fraudulent application voice information occurs subsequent to the application process.
One embodiment of the invention relates to a method for voice or speech recognition comprises linearizing the components of a voice signal by determining the logarithm of the voice signal modeled as the product of a power component and an informational component; duplicating the linearized signal producing a first linearized signal and a second linearized signal; passing the first linearized signal through a low pass filter producing a first filtered signal and passing the second linearized signal through a high pass filter producing a second filtered signal; and amplifying both filtered signals, the first filtered signals amplified at a higher gain than the second filtered signal; combining the filtered signals; determining the antilogarithm of the combined signals; and applying voice or speech recognition to the determined signal.
In a variation of the invention, the passbands of the high and low pass filters subdivide the frequency spectrum at a breakpoint frequency to be preferably between 200 and 400 Hertz.
A further variation involves the breakpoint frequency set preferably at approximately 300 Hertz.
According to another aspect of the invention, the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined ascertaining the context of the application and then generating the breakpoint frequency as a function of the context. The context comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker.
A variation of the invention involves processing of the voice signal by a computer.
Another embodiment of the invention relates to a system for processing a voice signal comprising a device for filtering the components of the voice signal modeled as the product of a power component and an informational component, and a further device for deriving an amplified signal for voice or speech recognition.
The invention also includes the variation where the system for filtering the components comprises a device for linearizing the components of the voice signal; a device for duplicating the linearized signal to produce a first linearized signal and a second linearized signal; a device for passing the first linearized signals through a low pass filter producing a first filtered signal and device for passing Hie second linearized signals through a high pass filter producing a second filtered signal; and a device for amplifying the filtered signals.
In another variation, the filtered signals are amplified differentially.
Another variation of the above involves amplifying the first filtered signal at a lower gain level than the second filtered signal.
Another embodiment of the invention further comprising means for combining the filtered signals.
In another variation of the invention, the low pass filter and the high pass filter have approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
According to another aspect of the invention the means for linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal, and further comprises means for combining the filtered signals, and means for determining the antilogarithm of the combined signals.
In another variation, the means for the step of linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal, and further comprises means for determining the antilogarithm of the filtered signals. A further variation further comprises means for applying voice or speech recognition to the processed signal. The means for further voice or speech recognition optionally comprises means for processing with a hidden Markov model.
Another aspect of the invention involves generating a plurality of processed signals from a voice signal for input to means for voice or speech comprises means for duplicating the voice signal, and means for analyzing the duplicated voice signal with frequency response analysis. The frequency response analysis optionally comprises processing with a hidden Markov model.
An another embodiment of the invention comprises means for applying a voice or speech recognition process to the plurality of signals.
Another embodiment of the invention relates to a system to identify an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, which comprises means for determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system, and means for recording the applicant's application voice information in the database if the application is determined to be fraudulent.
Another variation comprises means for cross-checking the voice information of an application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications. Optionally the means for cross-checking the non- fraudulent application voice information against fraudulent application voice information performs the cross-checking as a separate process from that carried out by the means for determining whether an application is fraudulent or non-fraudulent.
The invention also includes the embodiment where a system for voice or speech recognition comprises the following: input means for obtaining a voice signal; means for linearizing determining the logarithm of the voice signal modeled as the product of a power component and an informational component; means for duplicating the linearized signal producing a first linearized signal and a second linearized signal; means for passing the first linearized signals through a low pass filter having a lowpass passband producing a frst filtered signal and means for passing the second linearized signals through a high pass filter having a highpass passband producing a second filtered signal; means for amplifying both filtered signals, the first filtered signals amplified at a higher gain than the second filtered signal; means for combining the filtered signals; means for determining the antilogarithm of the combined signals; and means for applying voice or speech recognition to the determined signal.
The invention includes the variation where the passbands of the high and low pass filters sub-divide the frequency spectrum. Another variation involves the breakpoint frequency lying preferably in the range of the frequency spectrum between 200 to 400 Hertz. In another further variation, the breakpoint frequency is preferably 300 Hertz.
According to another aspect of the invention, the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined by means for ascertaining the context of the application, and means for generating the breakpoint frequency as a function of the context. The context optionally comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker.
An another embodiment of the invention relates to a system for providing a signal for voice or speech recognition comprising microphone means to generate a multi-frequency electrical signal from the voice signal; a logarithmic amplifier to receive the electrical signal; a high pass filter and a low pass filter connected to the logarithmic amplifier; first and second amplifiers electrically connected one to each filter; and an exponential amplifier connected to the first and second amplifiers.
In another variation, the system further comprises means for speech or voice recognition connected to the output of the exponential amplifier.
Another aspect of the invention involves a computer being used as the means for speech or voice recognition. BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be described by way of example and with reference to the drawing in which: FIG. 1 is a block diagram of the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
It is an object of the invention to provide a system and method to improve the effectiveness of both voice and speech recognition systems by pre-treating the voice signal at the initial stage of processing an audio signal.
According to one embodiment of the invention, a method of voice or speech recognition comprises generating a voice waveform signal, dividing the signal into a higher frequency signal and a lower frequency signal, amplifying the higher and lower frequency signals separately, combining the amplified signals, and applying a voice or speech recognition method to the combined amplified signals.
Preferably the signals are divided at a frequency such that the higher frequency signal relates to the information content of the voice waveform and the lower frequency signal relates to the power content of the voice waveform.
The approach used in the present invention thus provides a novel method of processing a speech sample. The approach is to break down the speech into two components and then subject them to separate treatment. The lower frequency spectrum generally relates to the power or volume of the spoken word while the higher frequency components can be considered to contain the bulk of the information content of the sample, including the inflection of the spoken word. After separation of these components, they can be processed differently to maximize the efficiency of the speech or voice recognition system. This method enables much more information about the speech sample to be generated.
Compared with the prior art approach, the method of the present invention may be thought of as generating a three-dimensional picture of the voice sample, rather than a two- dimensional one. By utilizing more processor power on the higher frequency elements of the signal, the system can be much more precise and vastly faster than conventional systems. In terms of the number of points selected for comparison purposes, the prior art two- dimensional approach normally employed, may look at, for example, four thousand points, whereas the three-dimensional approach (using the same sampling frequency) can be set to identify and operate on many more times that number of samples. Because the samples in the approach used in the method of the present invention can be selected in a biased fashion, due to the separation process, the downstream processing for voice or speech recognition can make use of many more points of reference, for example ten or more times as many, leading to greater accuracy and thus a much diminished risk of false voice recognition, or much higher accuracy of reflection of the words actually spoken in the case of speech or word recognition applications.
Using the approach in accordance with the present invention, the standard technique of normalizing the sample power using automatic gain control circuitry can be dispensed with. By processing the lower frequency component separately, the power level of the total sample is more accurately known and hence easier to control or manipulate. The above approach is far superior to simply amplifying or attenuating the entire signal, as that produces no additional information. This means that the signal can be adjusted before processing by the rest of the voice recognition processors in a more refined way that gives rise to a higher acceptance rate of the voice.
^ A voice signal V can be modeled as a combination of a volume or power signal P and an information signal I where
V = P I
By taking the logarithm of the signals one arrives at the simple equation:
Log V = Log P + Log I
thus providing an approach in which the signals are linearly separated but related by this simple combination. After such a separation, each signal can be processed in accordance with requirements, in contrast to prior art arrangements in which the entire voice signal V is amplified and processed.
The invention will now be described by way of a preferred embodiment with reference to the accompanying drawing which illustrates a system for capturing and processing human speech. To a person skilled in the art, it is obvious that the components of this system, with the exception of the microphone may be implemented either by analog circuitry or digital means after analog to digital conversion. For convenience of description, analog terminology is used presently.
In Figure 1 , a human being 10 is shown adjacent a microphone 12 connected to a linear amplifier 14 which is connected through a logarithmic amplifier 16 to a high pass filter 18 and a low pass filter 20. The filters 18, 20 are connected respectively to first and second amplifiers 22, 24 which are both connected through an exponential amplifier 26 to a voice or speech recognition apparatus 28.
When the human being 10 speaks into the microphone, the output of the linear amplifier 14 is the signal V = P I. The output of the logarithmic amplifier 16 is a signal log V = log P + log I which is connected to both of the filters 18, 20. This in effect linearizes the two components of the voice signal. The output of the high pass filter 18 is the signal log I which is amplified by amplifier 22 to form a signal log I'. The output of the low pass filter 20 is the signal log P and the amplified signal from amplifier 24 is the signal log P". This effects the separation of the components.
The frequency at which the signals are separated, the breakpoint, and therefore the dividing line between the passbands of the high pass and low pass filters 18, 20, is preferably around the 300 Hertz threshold. The breakpoint is to some extent arbitrary, as there, is no exact frequency at which the signal changes from a power signal to an information signal, but 300 Hertz is generally recognized in speech analysis as providing a reasonable boundary between the power/volume component and the information component of human speech. The breakpoint frequency may also be determined adaptively depending upon the personal characteristics of the speaker and the language spoken.
Since the filters 18, 20 would preferably not have infinite attenuation outside their passbands, which is clear to a person skilled in the art, there is some merging of the power and information signals near the breakpoint frequency, which are acceptable under the circumstances. Again, this is clear to a person skilled in the art.
Typically the amplification applied by the first amplifier 22 to the information signal is higher than the amplification applied by amplifier 24 to the power signal. The signals log I' and log P' are combined in the exponential amplifier 26 to give an combined amplified signal V = P' I', which is passed to the voice or speech recognition apparatus 28 for processing. Such a signal now includes enhanced detail in the information signal, thus allowing a more accurate matrix sample of the initial waveform. The lower amplification applied to the power signal will reduce the effect of differences in microphone sensitivity, loudness of speech, etc. The final voice or speech recognition may be based on, for example, a hidden Markov model or an adaptation of dynamic time warping, fuzzy logic, neural networks, template matching, expert systems, or a combination of these approaches for pattern recognition (See L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, NJ, 1993; and CH. Lee, F.K. Soong and K.K. Paliwal (Eds.), Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer, Boston, 1996).
Thus the voice or speech recognition process carried out in apparatus 28 can have a higher level of accuracy than has previously been possible. As is well-known, the output29 of apparatus 28 will be: a "yes" or "no" signal for voice recognition, and output corresponding to one or more words in question in speech recognition systems, for example when the apparatus 28 is a personal computer running a direct dictation to screen program.
A further advantage according to the invention is that it is now easy for a person skilled in the art to introduce automatic gain control using the lower frequency component in another embodiment. This ensures that the signal is presented for further processing subsequently at optimum levels and reduces variations due to the speaker moving away from the mike etc.
According to another preferred embodiment, the system includes an adaptive component in order to determine the optimal breakpoint (cut-off) frequency for delimiting power and information (i.e. inflection) based on the characteristics of the speaker. There are many variations of voice types: some voices have a lot of base resonance and a lot of inflection in that bass part and other voices have resonance in the higher frequency. A significant part of the resonance tends to be based on whether the input is from either a male or female, adult or child (but not necessarily so). The adaptive system allows for that differentiation. The voice differentiation could be implemented as a separate and prior component, or integrated as part of a feedback design shifting the break point depending upon the input voice signal. Alternatively, voice differentiation is first processed before it moves to the breakpoint where the power and information components are separated. In either case a higher level of voice and password acceptance results.
The object of another preferred embodiment is to transcend the limitation of a single language, for exmaple, English. Languages vary enormously with varying sound, inflection, pronunciations; these determine the different resonance levels and frequency responses. The database and the break point may vary with language; although largely language independent, the system maintains different processing options that can support different languages. First, the spectral components are split based on the cut-off frequency, which break point is dependent upon the type of voice. The voice can be further categorized so that the character of continued processing depends upon the nature and origin of the input language. Therefore a language dependent system results. The software recognizes the language and type of speaker and makes the processing adaptive to these features. That is to say, a French speaking person will be classified in a certain group; then if a female, in a subgroup thereof, further if a child: the processing branches off again and again prior to undergoing actual spectral separation. Subsequently, when another person uses the system, say a Chinese adult male, then the system adapts to this entirely new language and speaker type for voice comparison. The natural language, the gender, and the age of the speaker are all factors where context is permitted to influence the choice of an optimal breakpoint frequency.
A further preferred embodiment implements dual-level security with a parallel frequency response analysis system. This parallel system is run so that the voice/speech system effectively has two levels of security. The additional test could be based on a Hidden Markov model or some other similar model for the process. This allows the system to focus on its main application that is total voice recognition and its security. A check of the results from this standard speech recognition system and from the present invention gives rise to two masks that collectively achieves a greater level of verification. Therefore, the results of the standard system are taken and compared to this invention's results achieving better comparisons, analyses and outcome.
Still another preferred embodiment implements functionality to facilitate investigation of fraudulent applications. One component of the embodiment maintains a database of data and information about each user, the person's voice and voice inflection. If anyone attempts to make a fraudulent application then the person's input voice information is also recorded. This fraudulent application is cross-referenced or cross-checked to the main database and voice information, whether at the time of application or subsequently as for example in a batched process on an interval basis. A match of a fraudulent applicant's voice information to their real application voice information would lead to accurate and appropriate details for further security investigation.
Another preferred embodiment, as indicated via the dotted connections in Figure 1 , has outputs of the first and second amplifiers 22, 24 connected to separate exponential amplifiers 30, 32 and the power signal and information signal are then processed separately using different methodology suitable for the type of signal.
One preferred embodiment implements this invention by analog means using electronic circuitry, including logarithmic and exponential converters, operational amplifiers to realize sallen and Key high and low pass filters. Another preferred embodiment implements the invention digitally. In the case of the latter, an analog to digital converter digitizes the voice input from a microphone. A computer or dedicated hardware can then process the digitized voice signal. Filtering of the signal can be effected by standard digital filtering methods (See AN. Oppenheim and R.W. Schafer, with J.R. Buck. Discrete-Time Signal Processing, Second Edition. Prentice-Hall, Inc., Upper Saddle River, ΝJ, 1999).
It will be appreciated that the above description relates to the preferred embodiments by way of example only. Many variations on the apparatus for delivering the invention will be obvious to those knowledgeable in the field, and such obvious variations are within the scope of the invention as described and claimed, whether or not expressly described.
All patents and publications referred to in this paper are incorporated by reference in their entirety.

Claims

What is claimed is:
1. A method for processing a voice signal comprising filtering the components of the voice signal modeled as the product of a power component and an informational component to derive an amplified signal for voice or speech recognition.
2. The method of claim 1 , wherein the step of filtering the components comprises:
• linearizing the components of the voice signal;
• duplicating the linearized signal to produce a first linearized signal and a second linearized signal;
• passing the first linearized signal through a lowpass filter having a lowpass passband to produce a first filtered signal and passing the second linearized signal through a highpass filter having a highpass passband to produce a second filtered signal; and
• amplifying the filtered signals.
3. The method of claim 2, wherein the filtered signals are amplified differentially.
4. The method of any of claims 2 or 3, wherein the first filtered signal is amplified at a lower gain level than the second filtered signal.
5. The method of any of claims 2 to 4, further comprising combining the filtered signals.
6. The method of any of claims 2 to 5, wherein the low pass filter and the high pass filter include approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
7. The method of any of claims 2 to 6, wherein:
• the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and • further comprises:
• combining the filtered signals; and
• determining the antilogarithm of the combined signals.
8. The method of any of claims 2 to 6, wherein: • the step of linearizing the components of the voice signal comprises determining the logarithm of the voice signal; and
• further comprises determining the antilogarithm of the filtered signals.
9. A method for voice or speech recognition comprises the method of any of claims 1 to 8, and further comprising applying voice or speech recognition to the processed signal and determining whether the voice signal is that of a recognized person in the case of voice recognition or one or more recognized words in the case of speech recognition.
10. The method of claim 9, wherein the further voice or speech recognition comprises processing with a hidden Markov model.
11. A method for generating a plurality of processed signals from a voice signal for input to ^yoice or speech recognition comprising the method of any of claims 1 to 8, and further comprising:
• duplicating the voice signal; and
• analyzing the duplicated voice signal with frequency response analysis.
12. The method claim 11 , wherein the frequency response analysis comprising processing with a hidden Markov model.
13. A method for voice or speech recognition comprising the method of any of claims 11 or 12, and further comprising applying a voice or speech recognition process to the plurality of signals.
14. A method for identifying an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system, the system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, comprising the method of any of claims 9 to 10, or 13, and further comprising:
• determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system; and
• recording the applicant's application voice information in the database if the application is determined to be fraudulent.
15. The method of claim 14, further comprising cross-checking the voice information of a application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications.
16. The method of claim 15, wherein the cross-checking of the non-fraudulent application voice information against fraudulent application voice information occurs subsequent to the application process.
17. A method for voice or speech recognition comprising: • linearizing the components of a voice signal by determining the logarithm of the voice signal modeled as the product of a power component and an informational component;
• duplicating the linearized signal producing a first linearized signal and a second linearized signal; • passing the first linearized signal through a low pass filter having a lowpass passband producing a first filtered signal and passing the second linearized signal through a high pass filter having a highpass passband producing a second filtered signal; and
• amplifying both filtered signals, the first filtered signals amplified at a higher gain than the second filtered signal;
• combining the filtered signals;
• determining the antilogarithm of the combined signals; and
• applying voice or speech recognition to the determined signal.
18. The method in any of claims 2 to 17, wherein the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency to be between 200 and 400 Hertz.
19. The method of Claim 18 in which the breakpoint frequency is approximately 300 Hertz.
20. The method of any of claims 2 to 17, wherein the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined by the following steps: • determining the context of the application; and • generating the breakpoint frequency as a function of the context.
21. The method of claim 20, wherein the context comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker.
22. The method of claims 1 to 21 , wherein the voice signal is processed by a computer.
23. A system for processing a voice signal comprising means for filtering the components of the voice signal modeled as the product of a power component and an informational component and means for deriving an amplified signal for voice or speech recognition.
24. The system of claim 23, wherein the means of filtering the components comprises:
• means for linearizing the components of the voice signal;
•_ means for duplicating the linearized signal to produce a first linearized signal and a second linearized signal;
• means for passing the first linearized signals through a low pass filter producing a first filtered signal and means for passing the second linearized signals through a high pass filter producing a second filtered signal; and
• means for amplifying the filtered signals.
25. The system of claim 24, wherein the filtered signals are amplified differentially.
26. The system of any of claims 24 or 25, wherein the first filtered signal is amplified at a lower gain level than the second filtered signal.
27. The system of any of claims 24 to 26, further comprising means for combining the filtered signals.
28. The system of any of claims 24 to 27, wherein the low pass filter and the high pass filter having approximately non-overlapping passbands whereby the higher and the lower spectral components of the linearized signal are separated.
29. The system of any of claims 24 to 28, wherein:
• the means for linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal; and • further comprises:
• means for combining the filtered signals; and
• means for determining the antilogarithm of the combined signals.
30. The system of any of claims 24 to 28, wherein:
• the means for linearizing the components of the voice signal comprises means for determining the logarithm of the voice signal; and
• further comprises means for determining the antilogarithm of the filtered signals.
31. A system for voice or speech recognition comprises the system of any of claims 23 to 30, and further comprising means for applying voice or speech recognition to the processed signal.
32. The system of claim 31 , wherein the means for further voice or speech recognition comprises means for processing with a hidden Markov model.
33. A system for generating a plurality of processed signals from a voice signal for input to means for voice or speech recognition comprises the system of any of claims 23 to 31 , and further comprises: • means for duplicating the voice signal; and
• means for analyzing the duplicated voice signal with frequency response analysis.
34. The system of claim 33, wherein the frequency response analysis comprising processing with a hidden Markov model.
35. A system for voice or speech recognition comprising the system of any of claims 33 or
34, and further comprising means for applying a voice or speech recognition process to the plurality of signals.
36. A system for identifying an individual making a fraudulent application to gain unauthorized entry into a voice-activated secured entry system in combination with a database comprising vectors of personal information and voice information of all authorized applicants, comprises the system of any of claims 31 to 33, and 35, and further comprising: • means for determining whether the application is a fraudulent or a non-fraudulent attempt to gain entry to the system; and
• means for recording the applicant's application voice information in the database if the application is determined to be fraudulent.
37. The system of claim 36, further comprising means for cross-checking the voice information of an application determined to be non-fraudulent with voice information recorded in the database of previous fraudulent applications.
38. The system of claim 37, wherein the means for cross-checking the non-fraudulent application voice information against fraudulent application voice information performs the cross-checking as a separate process from that carried out by the means for determining whether an application is fraudulent or non-fraudulent.
39. A system for voice or speech recognition comprises:
• input means for obtaining a voice signal;
• means for linearizing determining the logarithm of the voice signal modeled as the product of a power component and an informational component;
• means for duplicating the linearized signal producing a first linearized signal and a second linearized signal;
• means for passing the first linearized signals through a low pass filter having a lowpass passband producing a first filtered signal and means for passing the second linearized signals through a high pass filter having a highpass passband producing a second filtered signal; and • means for amplifying both filtered signals, the first filtered signals amplified at a higher gain than the second filtered signal;
• means for combining the filtered signals;
• means for determining the antilogarithm of the combined signals; and
• means for applying voice or speech recognition to the determined signal.
40. The system of any of claims 24 to 39, wherein the passbands of the high and low pass filters sub-divide the frequency spectrum.
41. The system of any of claims 24 to 40, wherein the breakpoint frequency lies in the range of the frequency spectrum between 200 to 400 Hertz.
42. The system of any of claims 40 and 41 wherein the breakpoint frequency is approximately 300 Hertz.
43. The system of any of claims 24 to 40, wherein the passbands of the high and low pass filters sub-divide the frequency spectrum at a breakpoint frequency determined by the following:
• means for ascertaining the context of the application; and
• means for generating the breakpoint frequency as a function of the context.
44. The system of claim 43, wherein the context comprises at least one of the personal characteristics of the speaker selected from the group comprising the gender, age, and language of the speaker <
45. A system for providing a signal for voice or speech recognition, the apparatus comprises
• microphone means to generate a multi-frequency electrical signal from the voice signal;
• a logarithmic amplifier to receive the electrical signal;
• a high pass filter and a low pass filter connected to the logarithmic amplifier. • a first amplifier electrically connected to the high pass filter and a second amplifier electrically connected the high pass filter, and
• an exponential amplifier connected to the first and second amplifiers
6 The system of claim 45 further comprising means for speech or voice recognition connected to the output of the exponential amplifier
7 The system of claim 46 in which the means for speech or voice recognition comprises a computer
PCT/GB2000/002332 1999-06-14 2000-06-14 Speech and voice signal preprocessing WO2000077772A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0200735A GB2367938A (en) 1999-06-14 2000-06-14 Speech and voice signal processing
AU55471/00A AU5547100A (en) 1999-06-14 2000-06-14 Speech and voice signal processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9913773.9 1999-06-14
GBGB9913773.9A GB9913773D0 (en) 1999-06-14 1999-06-14 Speech signal processing

Publications (2)

Publication Number Publication Date
WO2000077772A2 true WO2000077772A2 (en) 2000-12-21
WO2000077772A3 WO2000077772A3 (en) 2002-10-10

Family

ID=10855289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2000/002332 WO2000077772A2 (en) 1999-06-14 2000-06-14 Speech and voice signal preprocessing

Country Status (3)

Country Link
AU (1) AU5547100A (en)
GB (2) GB9913773D0 (en)
WO (1) WO2000077772A2 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011046474A3 (en) * 2009-09-24 2011-06-16 Общество С Ограниченной Ответственностью "Цeнтp Речевых Технологий" Method for identifying a speaker based on random speech phonograms using formant equalization
US8793131B2 (en) 2005-04-21 2014-07-29 Verint Americas Inc. Systems, methods, and media for determining fraud patterns and creating fraud behavioral models
US8903859B2 (en) 2005-04-21 2014-12-02 Verint Americas Inc. Systems, methods, and media for generating hierarchical fused risk scores
US8924285B2 (en) 2005-04-21 2014-12-30 Verint Americas Inc. Building whitelists comprising voiceprints not associated with fraud and screening calls using a combination of a whitelist and blacklist
US8930261B2 (en) 2005-04-21 2015-01-06 Verint Americas Inc. Method and system for generating a fraud risk score using telephony channel based audio and non-audio data
US9113001B2 (en) 2005-04-21 2015-08-18 Verint Americas Inc. Systems, methods, and media for disambiguating call data to determine fraud
CN106683686A (en) * 2016-11-18 2017-05-17 祝洋 Examinee gender statistical equipment and statistical method of same
US9875739B2 (en) 2012-09-07 2018-01-23 Verint Systems Ltd. Speaker separation in diarization
US9875743B2 (en) 2015-01-26 2018-01-23 Verint Systems Ltd. Acoustic signature building for a speaker from multiple sessions
US9881617B2 (en) 2013-07-17 2018-01-30 Verint Systems Ltd. Blind diarization of recorded calls with arbitrary number of speakers
US9984706B2 (en) 2013-08-01 2018-05-29 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
US10134400B2 (en) 2012-11-21 2018-11-20 Verint Systems Ltd. Diarization using acoustic labeling
US10887452B2 (en) 2018-10-25 2021-01-05 Verint Americas Inc. System architecture for fraud detection
US11115521B2 (en) 2019-06-20 2021-09-07 Verint Americas Inc. Systems and methods for authentication and fraud detection
US11538128B2 (en) 2018-05-14 2022-12-27 Verint Americas Inc. User interface for fraud alert management
US11868453B2 (en) 2019-11-07 2024-01-09 Verint Americas Inc. Systems and methods for customer authentication based on audio-of-interest

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9571652B1 (en) 2005-04-21 2017-02-14 Verint Americas Inc. Enhanced diarization systems, media and methods of use
US8639757B1 (en) 2011-08-12 2014-01-28 Sprint Communications Company L.P. User localization using friend location information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4827516A (en) * 1985-10-16 1989-05-02 Toppan Printing Co., Ltd. Method of analyzing input speech and speech analysis apparatus therefor
EP0625775A1 (en) * 1993-05-18 1994-11-23 International Business Machines Corporation Speech recognition system with improved rejection of words and sounds not contained in the system vocabulary
US5495522A (en) * 1993-02-01 1996-02-27 Multilink, Inc. Method and apparatus for audio teleconferencing a plurality of phone channels
WO1998043237A1 (en) * 1997-03-25 1998-10-01 The Secretary Of State For Defence Recognition system
US5878392A (en) * 1991-04-12 1999-03-02 U.S. Philips Corporation Speech recognition using recursive time-domain high-pass filtering of spectral feature vectors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4827516A (en) * 1985-10-16 1989-05-02 Toppan Printing Co., Ltd. Method of analyzing input speech and speech analysis apparatus therefor
US5878392A (en) * 1991-04-12 1999-03-02 U.S. Philips Corporation Speech recognition using recursive time-domain high-pass filtering of spectral feature vectors
US5495522A (en) * 1993-02-01 1996-02-27 Multilink, Inc. Method and apparatus for audio teleconferencing a plurality of phone channels
EP0625775A1 (en) * 1993-05-18 1994-11-23 International Business Machines Corporation Speech recognition system with improved rejection of words and sounds not contained in the system vocabulary
WO1998043237A1 (en) * 1997-03-25 1998-10-01 The Secretary Of State For Defence Recognition system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
OPPENHEIM A V, SCHAFER R W, STOCKHAM T G: "Nonlinear Filtering of Multiplied and Convolved Signals" PROCEEDINGS OF THE IEEE, no. 56, August 1968 (1968-08), pages 1264-1291, XP000946572 ISSN: 0165-1684 *
PINOLI J: "A general comparative study of the multiplicative homomorphic, log-ratio and logarithmic image processing approaches" SIGNAL PROCESSING. EUROPEAN JOURNAL DEVOTED TO THE METHODS AND APPLICATIONS OF SIGNAL PROCESSING,NL,ELSEVIER SCIENCE PUBLISHERS B.V. AMSTERDAM, vol. 58, no. 1, 1 April 1997 (1997-04-01), pages 11-45, XP004082677 ISSN: 0165-1684 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8793131B2 (en) 2005-04-21 2014-07-29 Verint Americas Inc. Systems, methods, and media for determining fraud patterns and creating fraud behavioral models
US8903859B2 (en) 2005-04-21 2014-12-02 Verint Americas Inc. Systems, methods, and media for generating hierarchical fused risk scores
US8924285B2 (en) 2005-04-21 2014-12-30 Verint Americas Inc. Building whitelists comprising voiceprints not associated with fraud and screening calls using a combination of a whitelist and blacklist
US8930261B2 (en) 2005-04-21 2015-01-06 Verint Americas Inc. Method and system for generating a fraud risk score using telephony channel based audio and non-audio data
US9113001B2 (en) 2005-04-21 2015-08-18 Verint Americas Inc. Systems, methods, and media for disambiguating call data to determine fraud
EA019949B1 (en) * 2009-09-24 2014-07-30 Общество с ограниченной ответственностью "Центр речевых технологий" Method for identifying a speaker based on random speech phonograms using formant equalization
US9047866B2 (en) 2009-09-24 2015-06-02 Speech Technology Center Limited System and method for identification of a speaker by phonograms of spontaneous oral speech and by using formant equalization using one vowel phoneme type
WO2011046474A3 (en) * 2009-09-24 2011-06-16 Общество С Ограниченной Ответственностью "Цeнтp Речевых Технологий" Method for identifying a speaker based on random speech phonograms using formant equalization
US9875739B2 (en) 2012-09-07 2018-01-23 Verint Systems Ltd. Speaker separation in diarization
US11227603B2 (en) 2012-11-21 2022-01-18 Verint Systems Ltd. System and method of video capture and search optimization for creating an acoustic voiceprint
US10720164B2 (en) 2012-11-21 2020-07-21 Verint Systems Ltd. System and method of diarization and labeling of audio data
US11776547B2 (en) 2012-11-21 2023-10-03 Verint Systems Inc. System and method of video capture and search optimization for creating an acoustic voiceprint
US11380333B2 (en) 2012-11-21 2022-07-05 Verint Systems Inc. System and method of diarization and labeling of audio data
US11367450B2 (en) 2012-11-21 2022-06-21 Verint Systems Inc. System and method of diarization and labeling of audio data
US11322154B2 (en) 2012-11-21 2022-05-03 Verint Systems Inc. Diarization using linguistic labeling
US10134400B2 (en) 2012-11-21 2018-11-20 Verint Systems Ltd. Diarization using acoustic labeling
US10134401B2 (en) 2012-11-21 2018-11-20 Verint Systems Ltd. Diarization using linguistic labeling
US10950242B2 (en) 2012-11-21 2021-03-16 Verint Systems Ltd. System and method of diarization and labeling of audio data
US10438592B2 (en) 2012-11-21 2019-10-08 Verint Systems Ltd. Diarization using speech segment labeling
US10446156B2 (en) 2012-11-21 2019-10-15 Verint Systems Ltd. Diarization using textual and audio speaker labeling
US10522152B2 (en) 2012-11-21 2019-12-31 Verint Systems Ltd. Diarization using linguistic labeling
US10522153B2 (en) 2012-11-21 2019-12-31 Verint Systems Ltd. Diarization using linguistic labeling
US10650826B2 (en) 2012-11-21 2020-05-12 Verint Systems Ltd. Diarization using acoustic labeling
US10950241B2 (en) 2012-11-21 2021-03-16 Verint Systems Ltd. Diarization using linguistic labeling with segmented and clustered diarized textual transcripts
US10692501B2 (en) 2012-11-21 2020-06-23 Verint Systems Ltd. Diarization using acoustic labeling to create an acoustic voiceprint
US10692500B2 (en) 2012-11-21 2020-06-23 Verint Systems Ltd. Diarization using linguistic labeling to create and apply a linguistic model
US10902856B2 (en) 2012-11-21 2021-01-26 Verint Systems Ltd. System and method of diarization and labeling of audio data
US10109280B2 (en) 2013-07-17 2018-10-23 Verint Systems Ltd. Blind diarization of recorded calls with arbitrary number of speakers
US9881617B2 (en) 2013-07-17 2018-01-30 Verint Systems Ltd. Blind diarization of recorded calls with arbitrary number of speakers
US11670325B2 (en) 2013-08-01 2023-06-06 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
US10665253B2 (en) 2013-08-01 2020-05-26 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
US9984706B2 (en) 2013-08-01 2018-05-29 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
US10366693B2 (en) 2015-01-26 2019-07-30 Verint Systems Ltd. Acoustic signature building for a speaker from multiple sessions
US10726848B2 (en) 2015-01-26 2020-07-28 Verint Systems Ltd. Word-level blind diarization of recorded calls with arbitrary number of speakers
US9875743B2 (en) 2015-01-26 2018-01-23 Verint Systems Ltd. Acoustic signature building for a speaker from multiple sessions
US11636860B2 (en) 2015-01-26 2023-04-25 Verint Systems Ltd. Word-level blind diarization of recorded calls with arbitrary number of speakers
US9875742B2 (en) 2015-01-26 2018-01-23 Verint Systems Ltd. Word-level blind diarization of recorded calls with arbitrary number of speakers
CN106683686A (en) * 2016-11-18 2017-05-17 祝洋 Examinee gender statistical equipment and statistical method of same
US11538128B2 (en) 2018-05-14 2022-12-27 Verint Americas Inc. User interface for fraud alert management
US11240372B2 (en) 2018-10-25 2022-02-01 Verint Americas Inc. System architecture for fraud detection
US10887452B2 (en) 2018-10-25 2021-01-05 Verint Americas Inc. System architecture for fraud detection
US11115521B2 (en) 2019-06-20 2021-09-07 Verint Americas Inc. Systems and methods for authentication and fraud detection
US11652917B2 (en) 2019-06-20 2023-05-16 Verint Americas Inc. Systems and methods for authentication and fraud detection
US11868453B2 (en) 2019-11-07 2024-01-09 Verint Americas Inc. Systems and methods for customer authentication based on audio-of-interest

Also Published As

Publication number Publication date
GB2367938A (en) 2002-04-17
GB9913773D0 (en) 1999-08-11
WO2000077772A3 (en) 2002-10-10
GB0200735D0 (en) 2002-02-27
AU5547100A (en) 2001-01-02

Similar Documents

Publication Publication Date Title
Rosenberg Automatic speaker verification: A review
Bimbot et al. A tutorial on text-independent speaker verification
US6463415B2 (en) 69voice authentication system and method for regulating border crossing
WO2000077772A2 (en) Speech and voice signal preprocessing
US5666466A (en) Method and apparatus for speaker recognition using selected spectral information
Prabakaran et al. A review on performance of voice feature extraction techniques
JP2002514318A (en) System and method for detecting recorded speech
Charisma et al. Speaker recognition using mel-frequency cepstrum coefficients and sum square error
Kekre et al. Speaker recognition using Vector Quantization by MFCC and KMCG clustering algorithm
Goh et al. Robust computer voice recognition using improved MFCC algorithm
De Lara A method of automatic speaker recognition using cepstral features and vectorial quantization
Khanna et al. Application of vector quantization in emotion recognition from human speech
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method
Londhe et al. Extracting Behavior Identification Features for Monitoring and Managing Speech-Dependent Smart Mental Illness Healthcare Systems
Hizlisoy et al. Text independent speaker recognition based on MFCC and machine learning
Jain et al. Speech features analysis and biometric person identification in multilingual environment
Bansal et al. lllllllllllllll ç Medwell Journals, 2007 Automatic Speaker Identification Using Vector Quantization
Aliyu et al. Development of a text-dependent speaker recognition system
Tsuge et al. Bone-and air-conduction speech combination method for speaker recognition
Chakraborty et al. An improved approach to open set text-independent speaker identification (OSTI-SI)
Higgins et al. A multi-spectral data-fusion approach to speaker recognition
Revathi et al. Text independent composite speaker identification/verification using multiple features
Cohen Forensic Applications of Automatic Speaker Verification
Yee et al. Classification of language speech recognition system
Iliadi Bio-inspired voice recognition for speaker identification

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase in:

Ref country code: GB

Ref document number: 200200735

Kind code of ref document: A

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

NENP Non-entry into the national phase in:

Ref country code: JP