WO1991002348A1

WO1991002348A1 - Speech recognition using spectral line frequencies

Info

Publication number: WO1991002348A1
Application number: PCT/US1990/003844
Authority: WO
Inventors: Clifford Allan Wood; Morris Anthony Moore; James Michael Keba
Original assignee: Motorola, Inc.
Priority date: 1989-08-07
Filing date: 1990-07-09
Publication date: 1991-02-21
Also published as: AU6070490A

Abstract

A method for recognizing speech mathematically transforms a particular set of Linear Predicitve Coding (4) coefficients derived from a spoken utterance (2) into unique Line Spectral Frequencies (5). Recognition is accomplished by optimally matching (6) the unique Line Spectral Frequencies to one of a set of predetermined Line Spectral Frequencies representing spoken utterances.

Description

SPEECH RECOGNITION USING SPECTRAL LINE FREQUENCIES

Field of the Invention

This invention relates in general to the processing of speech and more particularly to the recognition of speech.

Background of the Invention

The processing of speech for purposes of characterizing the vocal tract to achieve more efficient storage and transmission of speech is well known. One method for coding speech is using the Linear Predictive Coding model. Using Linear Predictive Coding methods, a stable all pole transfer function that linearly predicts future outputs based on the history of inputs can be derived representing the vocal tract.

Linear Predictive Coding coefficients can be used to recognize speech. Since the Linear Predictive Coding speech model is a digital filter, each utterance transformed using Linear Predictive Coding methods results in a unique set of filter coefficients. Each set of Linear Predictive Coding coefficients can be matched to templates containing a set of Linear Predictive Coding coefficients representing target utterances. To create the templates used in speech recognition, the speech recognition system must be trained. A speech recognition system is trained by compiling a large sample of utterances or words spoken by one or more individuals and statistically forming templates representing the characteristic set of Linear Predictive Coding coefficients of each utterance. This method of using Linear Predictive Coding coefficients to recognize speech works well in a system where the recognition hardware has been trained to the average users voice characteristics. However, the percentage of correctly recognized utterances or words drops dramatically when the Linear Predictive Coding speech recognition system encounters an individual with a foreign accent, nasality, or an uncharacterized dialect. This drop in recognition score is caused by variations in amplitude, timing, and pitch between the same utterances spoken by different individuals or the same individual at different times.

In an attempt to produce a more highly efficient method for coding speech, the Line Spectral Frequencies speech method, hereafter referred to as the Line Spectral Frequencies method was developed. The Line Spectral Frequencies method transforms Linear Predictive Coding coefficients to create a lossless digital filter represented by a polynomial transfer function which has roots lying on the unit circle. Previously, the use of Line Spectral Frequencies for speech analysis was limited to the coding and decoding of speech.

In Linear Predictive Coding speech recognition, some of the information derived in the transform is not useful because recognition of an utterance requires primarily the identification of the spectral peaks, hereafter referred to as formants. Linear Predictive Coding speech recognition methods make a decision based on a larger, possibly more variable, volume of datø because the information content of an utterance represented by Linear Predictive Coding coefficients is greater than the same utterance represented by Line Spectral Frequencies. This larger more variable volume of data comprised within the spectrum represented by the Linear Predictive Coding coefficient filter increases the probability of an error during recognition. Thus, what is needed is a more efficient and accurate method for- the recognition of speech which uses Line Spectral Frequencies.

Summary of the Invention

Accordingly, it is an object of the present invention to provide an improved method for the recognition of speech.

Another object of the invention is to provide for the recognition of speech where a particular utterance may not exactly match the same utterance spoken by the same or another person.

In carrying out the above and other objects of the invention in one form, there is provided a method for recognizing speech comprising the steps of deriving a transfer function from the speech, transforming the coefficients of the transfer function into representative line spectral frequencies, and matching the representative line spectral frequencies to predetermined line spectral frequencies. The above and other objects, features, and advantages of the present invention will be better understood from the following detailed description taken in conjunction with the accompanying drawings.

Brief Description of the Drawings

FIG. 1 is a block diagram of a particular embodiment of a Line Spectral Frequency speech recognition system.

FIG. 2 is a graph of the amplitude variations of a spoken word plotted in the time domain.

FIG. 3 is a graph of the smoothed Linear Predictive Coding filter data derived from FIG 2.

FIG. 4 is a graph of the smoothed Linear Predictive Coding filter data from FIG. 3 and the Line Spectral Frequencies derived from the Linear Predictive Coding coefficients used to plot FIG. 3.

Description of a Preferred Embodiment

Referring to FIG. 1, the block diagram shows a Line

Spectral Frequency speech recognition system. In this example the person 1 says the word "one" representing the numeric digit "1". This acoustic energy comprised within the analog time domain representation of the utterance is quantized by coupling the output of a pickup transducer such as a microphone 2 to an analog to digital converter 3. The discrete representation of the utterance is then converted by a Linear Predictive Coding converter 4 into a digital filter transfer function characterized by Linear Predictive Coding coefficients. The digital filter transfer function is comprised of a polynomial having constant coefficients which are derived by mapping the discrete time domain data into the frequency domain using linear predictive coding methods. The digital frequency domain plot shown represents the Linear Predictive Coding frequency spectra derived from the discrete Linear Predictive Coding coefficients. A detailed explanation of the Linear Predictive Coding method is described in the publication "Voice And Speech Processing" by Thomas W. Parsons, copyright 1987, McGraw-Hill, Inc., pp. 136-166.

The output of the Linear Predictive Coding converter 4 is converted to Line Spectral Frequencies by the Line Spectral Frequency converter 5. To find the Line Spectral Frequencies, the Linear Predictive Coding coefficients are transformed using a symmetric pair of polynomials. The polynomials are comprised of an even polynomial representing the sum of the Linear Predictive Coding transfer function and its conjugate, and an odd polynomial representing the difference of the Linear Predictive Coding transfer function and its conjugate. The Line Spectral Frequencies are then found by solving the polynomials for their roots. The resultant Line Spectral Frequencies shown in the plot are overlaid on the smoothed digital frequency domain spectra to illustrate the relationship between Line Spectral Frequencies and Linear Predictive CoALng transforms. A further explanation of the techniques associated with the Line Spectral Frequencies method is found in "Quantizer Design in LSP Speech Analysis and Synthesis" by Noboru Sugamura and Nariman Farvardin, September 198S, IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 398-401.

In the Sugamura and Farvardin paper, the notation LSP meaning Line Spectrum Pairs is used to represent a narrower use of Line Spectral Frequencies. Finally, the Line Spectral Frequencies are compared using an error function to a predetermined set of Line Spectral Frequencies to determine an optimal solution of the stored template match 6. A technique which can be used for the comparison that is well known by those skilled in the art is discussed in "Speech Recognition Experiments with Linear Prediction, Bandpass Filtering, and Dynamic Programming" by George M. White and Richard B. Neely, April 1976, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-24, pp. 183-188. The result is the recognition of the spoken utterance "one" as the numeric value "1." In this example the numeric value "1" is represented as an ASCII (American Standard Code for Information Interchange) byte which can be interpreted by a digital computer.

Referring to FIG. '2, the graph shows a time domain plot of the amplitude variations of a spoken word. The complex amplitude, frequency, and harmonic content of the utterance can be seen in FIG. 2. This time domain data is captured using an acoustic transducer and converted to an equivalent discrete representation by an analog to digital converter as previously shown in FIG. 1. Referring to FIG. 3, the graph was generated by plotting the frequency response of the digital filter transfer function using the Linear Predictive Coding coefficients derived from the discrete time domain data in FIG. 2. This figure clearly shows the areas of peak average spectral energy and the smoothing effect of the Linear Predictive Coding transform on the spectrum. Note that in order to identify the formants (areas of peak average spectral energy) a continuous function plot must be made and then the plot must be searched for areas where the inflection changes from positive to negative. This is an error prone process in areas near the peaks of the curve where the slope approaches zero.

Referring to FIG. 4, the graph shows the Line Spectral Frequencies overlaid on the smoothed frequency domain plot from FIG. 3. Because the Line Spectral Frequencies are discrete, the amount of information that must be examined for recognition is decreased relative to Linear Predictive Coding methods. The relationship between the areas of peak spectral energy and the locations of the Line Spectral Frequencies are shown. As can be seen on the overlay, the Line Spectral Frequencies bracket the formants, positively identifying the formants positioning without the need for a search. Using Line Spectral Frequencies for the identification of the formants positioning greatly reduces the effect of the more variable non-peak areas to reduce recognition accuracy of an utterance as compared to using Linear Predictive Coding methods.

By now it should be appreciated that there has been provided a more efficient and accurate method for the recognition of speech which uses Line Spectral Frequencies.

Claims

1. A method of recognizing speech comprising the steps of: deriving a transfer function from said speech; transforming the coefficients of said transfer function into representative line spectral frequencies; and comparing said representative line spectral frequencies to predetermined line spectral frequencies.

2. A method according to claim 1 wherein said deriving step comprises the step of mapping said speech into said transfer function, wherein said transfer function is a polynomial having coefficients derived using linear predictive coding methods.

3. The method according to claim 1 wherein said transforming step comprises the step of forming even and odd symmetric polynomials, an even polynomial comprised of the sum of said transfer function with the conjugate of said transfer function and an odd polynomial comprised of the difference of said transfer function with the conjugate of said transfer function_*

4. The method according to claim 3 wherein said transforming step further comprises the step of finding the roots of said even and odd symmetric polynomials wherein said roots correspond to said representative line spectral frequencies.

5. The method according to claim 1 wherein said comparing step comprises the steps of: comparing said representative line spectral frequencies to said predetermined line spectral frequencies using an error function; and determining the optimum selection by choosing said predetermined line spectral frequencies corresponding to the optimum value of said error function associated with said representative line spectral frequencies.

6. An apparatus for the recognizing speech comprising: means for deriving a transfer function from said speech; means for transforming the coefficients of said transfer function into representative line spectral frequencies; and means for comparing said representative line spectral frequencies to predetermined line spectral frequencies.

7. An apparatus according to claim 6 wherein said means for deriving comprises the mapping said speech into said transfer function, wherein said transfer function is a polynomial having coefficients derived using linear predictive coding methods.

8. An apparatus according to claim 6 wherein said means for transforming comprises forming even and odd symmetric polynomials, the even polynomial comprised of the sum of said transfer function with the conjugate of said transfer function and the odd polynomial comprised of the difference of said transfer function with the conjugate of said transfer f nction.

9. An apparatus according to claim 8 wherein said means for transforming further comprises finding the roots of said even and odd symmetric polynomials wherein said roots correspond to said representative line spectral frequencies.

10. An apparatus according to claim 6 wherein said means for comparing comprises: means for comparing said representative line spectral frequencies to said predetermined line spectral frequencies using an error function; and means for determining the optimum selection by choosing said predetermined line spectral frequencies corresponding to the optimum value of said error function associated with said representative line spectral frequencies.