WO1991002348A1 - Speech recognition using spectral line frequencies - Google Patents

Speech recognition using spectral line frequencies Download PDF

Info

Publication number
WO1991002348A1
WO1991002348A1 PCT/US1990/003844 US9003844W WO9102348A1 WO 1991002348 A1 WO1991002348 A1 WO 1991002348A1 US 9003844 W US9003844 W US 9003844W WO 9102348 A1 WO9102348 A1 WO 9102348A1
Authority
WO
WIPO (PCT)
Prior art keywords
line spectral
spectral frequencies
transfer function
speech
linear predictive
Prior art date
Application number
PCT/US1990/003844
Other languages
French (fr)
Inventor
Clifford Allan Wood
Morris Anthony Moore
James Michael Keba
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Publication of WO1991002348A1 publication Critical patent/WO1991002348A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This invention relates in general to the processing of speech and more particularly to the recognition of speech.
  • Linear Predictive Coding model Using Linear Predictive Coding methods, a stable all pole transfer function that linearly predicts future outputs based on the history of inputs can be derived representing the vocal tract.
  • Linear Predictive Coding coefficients can be used to recognize speech. Since the Linear Predictive Coding speech model is a digital filter, each utterance transformed using Linear Predictive Coding methods results in a unique set of filter coefficients. Each set of Linear Predictive Coding coefficients can be matched to templates containing a set of Linear Predictive Coding coefficients representing target utterances. To create the templates used in speech recognition, the speech recognition system must be trained. A speech recognition system is trained by compiling a large sample of utterances or words spoken by one or more individuals and statistically forming templates representing the characteristic set of Linear Predictive Coding coefficients of each utterance.
  • Line Spectral Frequencies speech method hereafter referred to as the Line Spectral Frequencies method was developed.
  • the Line Spectral Frequencies method transforms Linear Predictive Coding coefficients to create a lossless digital filter represented by a polynomial transfer function which has roots lying on the unit circle.
  • Line Spectral Frequencies for speech analysis was limited to the coding and decoding of speech.
  • Linear Predictive Coding speech recognition some of the information derived in the transform is not useful because recognition of an utterance requires primarily the identification of the spectral peaks, hereafter referred to as formants.
  • Linear Predictive Coding speech recognition methods make a decision based on a larger, possibly more variable, volume of dat ⁇ because the information content of an utterance represented by Linear Predictive Coding coefficients is greater than the same utterance represented by Line Spectral Frequencies. This larger more variable volume of data comprised within the spectrum represented by the Linear Predictive Coding coefficient filter increases the probability of an error during recognition. Thus, what is needed is a more efficient and accurate method for- the recognition of speech which uses Line Spectral Frequencies.
  • Another object of the invention is to provide for the recognition of speech where a particular utterance may not exactly match the same utterance spoken by the same or another person.
  • a method for recognizing speech comprising the steps of deriving a transfer function from the speech, transforming the coefficients of the transfer function into representative line spectral frequencies, and matching the representative line spectral frequencies to predetermined line spectral frequencies.
  • FIG. 1 is a block diagram of a particular embodiment of a Line Spectral Frequency speech recognition system.
  • FIG. 2 is a graph of the amplitude variations of a spoken word plotted in the time domain.
  • FIG. 3 is a graph of the smoothed Linear Predictive Coding filter data derived from FIG 2.
  • FIG. 4 is a graph of the smoothed Linear Predictive Coding filter data from FIG. 3 and the Line Spectral Frequencies derived from the Linear Predictive Coding coefficients used to plot FIG. 3.
  • the block diagram shows a Line
  • Spectral Frequency speech recognition system In this example the person 1 says the word “one" representing the numeric digit "1".
  • This acoustic energy comprised within the analog time domain representation of the utterance is quantized by coupling the output of a pickup transducer such as a microphone 2 to an analog to digital converter 3.
  • the discrete representation of the utterance is then converted by a Linear Predictive Coding converter 4 into a digital filter transfer function characterized by Linear Predictive Coding coefficients.
  • the digital filter transfer function is comprised of a polynomial having constant coefficients which are derived by mapping the discrete time domain data into the frequency domain using linear predictive coding methods.
  • the digital frequency domain plot shown represents the Linear Predictive Coding frequency spectra derived from the discrete Linear Predictive Coding coefficients.
  • a detailed explanation of the Linear Predictive Coding method is described in the publication "Voice And Speech Processing" by Thomas W. Parsons, copyright 1987, McGraw-Hill, Inc., pp. 136-166.
  • the output of the Linear Predictive Coding converter 4 is converted to Line Spectral Frequencies by the Line Spectral Frequency converter 5.
  • the Linear Predictive Coding coefficients are transformed using a symmetric pair of polynomials.
  • the polynomials are comprised of an even polynomial representing the sum of the Linear Predictive Coding transfer function and its conjugate, and an odd polynomial representing the difference of the Linear Predictive Coding transfer function and its conjugate.
  • the Line Spectral Frequencies are then found by solving the polynomials for their roots.
  • Line Spectral Frequencies shown in the plot are overlaid on the smoothed digital frequency domain spectra to illustrate the relationship between Line Spectral Frequencies and Linear Predictive CoALng transforms.
  • a further explanation of the techniques associated with the Line Spectral Frequencies method is found in "Quantizer Design in LSP Speech Analysis and Synthesis" by Noboru Sugamura and Nariman Farvardin, September 198S, IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 398-401.
  • the graph shows a time domain plot of the amplitude variations of a spoken word.
  • the complex amplitude, frequency, and harmonic content of the utterance can be seen in FIG. 2.
  • This time domain data is captured using an acoustic transducer and converted to an equivalent discrete representation by an analog to digital converter as previously shown in FIG. 1.
  • the graph was generated by plotting the frequency response of the digital filter transfer function using the Linear Predictive Coding coefficients derived from the discrete time domain data in FIG. 2. This figure clearly shows the areas of peak average spectral energy and the smoothing effect of the Linear Predictive Coding transform on the spectrum.
  • the graph shows the Line Spectral Frequencies overlaid on the smoothed frequency domain plot from FIG. 3. Because the Line Spectral Frequencies are discrete, the amount of information that must be examined for recognition is decreased relative to Linear Predictive Coding methods. The relationship between the areas of peak spectral energy and the locations of the Line Spectral Frequencies are shown. As can be seen on the overlay, the Line Spectral Frequencies bracket the formants, positively identifying the formants positioning without the need for a search. Using Line Spectral Frequencies for the identification of the formants positioning greatly reduces the effect of the more variable non-peak areas to reduce recognition accuracy of an utterance as compared to using Linear Predictive Coding methods.

Abstract

A method for recognizing speech mathematically transforms a particular set of Linear Predicitve Coding (4) coefficients derived from a spoken utterance (2) into unique Line Spectral Frequencies (5). Recognition is accomplished by optimally matching (6) the unique Line Spectral Frequencies to one of a set of predetermined Line Spectral Frequencies representing spoken utterances.

Description

SPEECH RECOGNITION USING SPECTRAL LINE FREQUENCIES
Field of the Invention
This invention relates in general to the processing of speech and more particularly to the recognition of speech.
Background of the Invention
The processing of speech for purposes of characterizing the vocal tract to achieve more efficient storage and transmission of speech is well known. One method for coding speech is using the Linear Predictive Coding model. Using Linear Predictive Coding methods, a stable all pole transfer function that linearly predicts future outputs based on the history of inputs can be derived representing the vocal tract.
Linear Predictive Coding coefficients can be used to recognize speech. Since the Linear Predictive Coding speech model is a digital filter, each utterance transformed using Linear Predictive Coding methods results in a unique set of filter coefficients. Each set of Linear Predictive Coding coefficients can be matched to templates containing a set of Linear Predictive Coding coefficients representing target utterances. To create the templates used in speech recognition, the speech recognition system must be trained. A speech recognition system is trained by compiling a large sample of utterances or words spoken by one or more individuals and statistically forming templates representing the characteristic set of Linear Predictive Coding coefficients of each utterance. This method of using Linear Predictive Coding coefficients to recognize speech works well in a system where the recognition hardware has been trained to the average users voice characteristics. However, the percentage of correctly recognized utterances or words drops dramatically when the Linear Predictive Coding speech recognition system encounters an individual with a foreign accent, nasality, or an uncharacterized dialect. This drop in recognition score is caused by variations in amplitude, timing, and pitch between the same utterances spoken by different individuals or the same individual at different times.
In an attempt to produce a more highly efficient method for coding speech, the Line Spectral Frequencies speech method, hereafter referred to as the Line Spectral Frequencies method was developed. The Line Spectral Frequencies method transforms Linear Predictive Coding coefficients to create a lossless digital filter represented by a polynomial transfer function which has roots lying on the unit circle. Previously, the use of Line Spectral Frequencies for speech analysis was limited to the coding and decoding of speech.
In Linear Predictive Coding speech recognition, some of the information derived in the transform is not useful because recognition of an utterance requires primarily the identification of the spectral peaks, hereafter referred to as formants. Linear Predictive Coding speech recognition methods make a decision based on a larger, possibly more variable, volume of datø because the information content of an utterance represented by Linear Predictive Coding coefficients is greater than the same utterance represented by Line Spectral Frequencies. This larger more variable volume of data comprised within the spectrum represented by the Linear Predictive Coding coefficient filter increases the probability of an error during recognition. Thus, what is needed is a more efficient and accurate method for- the recognition of speech which uses Line Spectral Frequencies.
Summary of the Invention
Accordingly, it is an object of the present invention to provide an improved method for the recognition of speech.
Another object of the invention is to provide for the recognition of speech where a particular utterance may not exactly match the same utterance spoken by the same or another person.
In carrying out the above and other objects of the invention in one form, there is provided a method for recognizing speech comprising the steps of deriving a transfer function from the speech, transforming the coefficients of the transfer function into representative line spectral frequencies, and matching the representative line spectral frequencies to predetermined line spectral frequencies. The above and other objects, features, and advantages of the present invention will be better understood from the following detailed description taken in conjunction with the accompanying drawings.
Brief Description of the Drawings
FIG. 1 is a block diagram of a particular embodiment of a Line Spectral Frequency speech recognition system.
FIG. 2 is a graph of the amplitude variations of a spoken word plotted in the time domain.
FIG. 3 is a graph of the smoothed Linear Predictive Coding filter data derived from FIG 2.
FIG. 4 is a graph of the smoothed Linear Predictive Coding filter data from FIG. 3 and the Line Spectral Frequencies derived from the Linear Predictive Coding coefficients used to plot FIG. 3.
Description of a Preferred Embodiment
Referring to FIG. 1, the block diagram shows a Line
Spectral Frequency speech recognition system. In this example the person 1 says the word "one" representing the numeric digit "1". This acoustic energy comprised within the analog time domain representation of the utterance is quantized by coupling the output of a pickup transducer such as a microphone 2 to an analog to digital converter 3. The discrete representation of the utterance is then converted by a Linear Predictive Coding converter 4 into a digital filter transfer function characterized by Linear Predictive Coding coefficients. The digital filter transfer function is comprised of a polynomial having constant coefficients which are derived by mapping the discrete time domain data into the frequency domain using linear predictive coding methods. The digital frequency domain plot shown represents the Linear Predictive Coding frequency spectra derived from the discrete Linear Predictive Coding coefficients. A detailed explanation of the Linear Predictive Coding method is described in the publication "Voice And Speech Processing" by Thomas W. Parsons, copyright 1987, McGraw-Hill, Inc., pp. 136-166.
The output of the Linear Predictive Coding converter 4 is converted to Line Spectral Frequencies by the Line Spectral Frequency converter 5. To find the Line Spectral Frequencies, the Linear Predictive Coding coefficients are transformed using a symmetric pair of polynomials. The polynomials are comprised of an even polynomial representing the sum of the Linear Predictive Coding transfer function and its conjugate, and an odd polynomial representing the difference of the Linear Predictive Coding transfer function and its conjugate. The Line Spectral Frequencies are then found by solving the polynomials for their roots. The resultant Line Spectral Frequencies shown in the plot are overlaid on the smoothed digital frequency domain spectra to illustrate the relationship between Line Spectral Frequencies and Linear Predictive CoALng transforms. A further explanation of the techniques associated with the Line Spectral Frequencies method is found in "Quantizer Design in LSP Speech Analysis and Synthesis" by Noboru Sugamura and Nariman Farvardin, September 198S, IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 398-401.
In the Sugamura and Farvardin paper, the notation LSP meaning Line Spectrum Pairs is used to represent a narrower use of Line Spectral Frequencies. Finally, the Line Spectral Frequencies are compared using an error function to a predetermined set of Line Spectral Frequencies to determine an optimal solution of the stored template match 6. A technique which can be used for the comparison that is well known by those skilled in the art is discussed in "Speech Recognition Experiments with Linear Prediction, Bandpass Filtering, and Dynamic Programming" by George M. White and Richard B. Neely, April 1976, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-24, pp. 183-188. The result is the recognition of the spoken utterance "one" as the numeric value "1." In this example the numeric value "1" is represented as an ASCII (American Standard Code for Information Interchange) byte which can be interpreted by a digital computer.
Referring to FIG. '2, the graph shows a time domain plot of the amplitude variations of a spoken word. The complex amplitude, frequency, and harmonic content of the utterance can be seen in FIG. 2. This time domain data is captured using an acoustic transducer and converted to an equivalent discrete representation by an analog to digital converter as previously shown in FIG. 1. Referring to FIG. 3, the graph was generated by plotting the frequency response of the digital filter transfer function using the Linear Predictive Coding coefficients derived from the discrete time domain data in FIG. 2. This figure clearly shows the areas of peak average spectral energy and the smoothing effect of the Linear Predictive Coding transform on the spectrum. Note that in order to identify the formants (areas of peak average spectral energy) a continuous function plot must be made and then the plot must be searched for areas where the inflection changes from positive to negative. This is an error prone process in areas near the peaks of the curve where the slope approaches zero.
Referring to FIG. 4, the graph shows the Line Spectral Frequencies overlaid on the smoothed frequency domain plot from FIG. 3. Because the Line Spectral Frequencies are discrete, the amount of information that must be examined for recognition is decreased relative to Linear Predictive Coding methods. The relationship between the areas of peak spectral energy and the locations of the Line Spectral Frequencies are shown. As can be seen on the overlay, the Line Spectral Frequencies bracket the formants, positively identifying the formants positioning without the need for a search. Using Line Spectral Frequencies for the identification of the formants positioning greatly reduces the effect of the more variable non-peak areas to reduce recognition accuracy of an utterance as compared to using Linear Predictive Coding methods.
By now it should be appreciated that there has been provided a more efficient and accurate method for the recognition of speech which uses Line Spectral Frequencies.

Claims

1. A method of recognizing speech comprising the steps of: deriving a transfer function from said speech; transforming the coefficients of said transfer function into representative line spectral frequencies; and comparing said representative line spectral frequencies to predetermined line spectral frequencies.
2. A method according to claim 1 wherein said deriving step comprises the step of mapping said speech into said transfer function, wherein said transfer function is a polynomial having coefficients derived using linear predictive coding methods.
3. The method according to claim 1 wherein said transforming step comprises the step of forming even and odd symmetric polynomials, an even polynomial comprised of the sum of said transfer function with the conjugate of said transfer function and an odd polynomial comprised of the difference of said transfer function with the conjugate of said transfer function*
4. The method according to claim 3 wherein said transforming step further comprises the step of finding the roots of said even and odd symmetric polynomials wherein said roots correspond to said representative line spectral frequencies.
5. The method according to claim 1 wherein said comparing step comprises the steps of: comparing said representative line spectral frequencies to said predetermined line spectral frequencies using an error function; and determining the optimum selection by choosing said predetermined line spectral frequencies corresponding to the optimum value of said error function associated with said representative line spectral frequencies.
6. An apparatus for the recognizing speech comprising: means for deriving a transfer function from said speech; means for transforming the coefficients of said transfer function into representative line spectral frequencies; and means for comparing said representative line spectral frequencies to predetermined line spectral frequencies.
7. An apparatus according to claim 6 wherein said means for deriving comprises the mapping said speech into said transfer function, wherein said transfer function is a polynomial having coefficients derived using linear predictive coding methods.
8. An apparatus according to claim 6 wherein said means for transforming comprises forming even and odd symmetric polynomials, the even polynomial comprised of the sum of said transfer function with the conjugate of said transfer function and the odd polynomial comprised of the difference of said transfer function with the conjugate of said transfer f nction.
9. An apparatus according to claim 8 wherein said means for transforming further comprises finding the roots of said even and odd symmetric polynomials wherein said roots correspond to said representative line spectral frequencies.
10. An apparatus according to claim 6 wherein said means for comparing comprises: means for comparing said representative line spectral frequencies to said predetermined line spectral frequencies using an error function; and means for determining the optimum selection by choosing said predetermined line spectral frequencies corresponding to the optimum value of said error function associated with said representative line spectral frequencies.
PCT/US1990/003844 1989-08-07 1990-07-09 Speech recognition using spectral line frequencies WO1991002348A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US38992589A 1989-08-07 1989-08-07
US389,925 1989-08-07

Publications (1)

Publication Number Publication Date
WO1991002348A1 true WO1991002348A1 (en) 1991-02-21

Family

ID=23540335

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1990/003844 WO1991002348A1 (en) 1989-08-07 1990-07-09 Speech recognition using spectral line frequencies

Country Status (2)

Country Link
AU (1) AU6070490A (en)
WO (1) WO1991002348A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5233659A (en) * 1991-01-14 1993-08-03 Telefonaktiebolaget L M Ericsson Method of quantizing line spectral frequencies when calculating filter parameters in a speech coder

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE4005591A1 (en) * 1990-02-22 1991-09-05 Behringwerke Ag THE HERBAL INHIBITING PEPTIDES, METHOD FOR THEIR PRODUCTION AND THEIR USE

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4519094A (en) * 1982-08-26 1985-05-21 At&T Bell Laboratories LPC Word recognizer utilizing energy features

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4519094A (en) * 1982-08-26 1985-05-21 At&T Bell Laboratories LPC Word recognizer utilizing energy features

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IEEE TRANSACTION ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, issued September 1988, N. SUGAMURA et al., "Quantizer Design in LSP Speech Analysis and Synthesis", pages 398-401. *
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, Volume 1, issued 1988, (New York, New York, USA), K.K. PALIWAL, "Study of Line Spectrum Pair Frequencies for Speech Recognition". *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5233659A (en) * 1991-01-14 1993-08-03 Telefonaktiebolaget L M Ericsson Method of quantizing line spectral frequencies when calculating filter parameters in a speech coder

Also Published As

Publication number Publication date
AU6070490A (en) 1991-03-11

Similar Documents

Publication Publication Date Title
US5528725A (en) Method and apparatus for recognizing speech by using wavelet transform and transient response therefrom
White et al. Speech recognition experiments with linear predication, bandpass filtering, and dynamic programming
US7756700B2 (en) Perceptual harmonic cepstral coefficients as the front-end for speech recognition
Vergin et al. Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition
EP0128755B1 (en) Apparatus for speech recognition
US6751595B2 (en) Multi-stage large vocabulary speech recognition system and method
Hernando et al. Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition
US5459815A (en) Speech recognition method using time-frequency masking mechanism
KR0123934B1 (en) Low cost speech recognition system and method
US20050021330A1 (en) Speech recognition apparatus capable of improving recognition rate regardless of average duration of phonemes
Yapanel et al. A new perspective on feature extraction for robust in-vehicle speech recognition.
US4937871A (en) Speech recognition device
US4922539A (en) Method of encoding speech signals involving the extraction of speech formant candidates in real time
US20030036905A1 (en) Information detection apparatus and method, and information search apparatus and method
US6470311B1 (en) Method and apparatus for determining pitch synchronous frames
US5845092A (en) Endpoint detection in a stand-alone real-time voice recognition system
US5704004A (en) Apparatus and method for normalizing and categorizing linear prediction code vectors using Bayesian categorization technique
Christensen et al. A comparison of three methods of extracting resonance information from predictor-coefficient coded speech
JPS6366600A (en) Method and apparatus for obtaining normalized signal for subsequent processing by preprocessing of speaker,s voice
JP2779325B2 (en) Pitch search time reduction method using pre-processing correlation equation in vocoder
JP3004023B2 (en) Voice recognition device
JP3354252B2 (en) Voice recognition device
WO1991002348A1 (en) Speech recognition using spectral line frequencies
CN114550741A (en) Semantic recognition method and system
Hernando Pericás et al. A comparative study of parameters and distances for noisy speech recognition

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA FI JP KR NO

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA