WO1996010248A1

WO1996010248A1 - A system and method for determining the tone of a syllable of mandarin chinese speech

Info

Publication number: WO1996010248A1
Application number: PCT/US1995/012595
Authority: WO
Inventors: Hsiao-Wuen Hon
Original assignee: Apple Computer, Inc.
Priority date: 1994-09-29
Filing date: 1995-09-29
Publication date: 1996-04-04
Also published as: GB2308002B; AU3734195A; GB9706562D0; GB2308002A

Abstract

A tone recognition system for Mandarin Chinese speech recognition comprises an A/D converter, segmenter, coefficient determinator, coefficient modeler, and comparator. The A/D converter is for digitizing an input signal that includes a syllable of Mandarin Chinese speech. The segmenter parses the digitized input to isolate the syllable. The coefficient determinator estimates the pitch of the syllable and determines a second order polynomial that best describes the pitch and determines the duration of the syllable. The coefficient determinator provides the polynomial and duration to the comparator. The coefficient modeler provides models of the tones of the tonal language to the comparator. A model comprises an expected second order polynomial and expected duration, with a co-variance matrix, that describes the language tone. The comparator compares the polynomial and duration to the models using a multivariate normal density, selects the model that best matches the polynomial and duration, and generates an output indicating the selected model.

Description

A SYSTEM AND METHOD FOR DETERMINING THE TONE OF A SYLLABLE OF MANDARIN CHINESE SPEECH

CROSS-REFERENCE TO RELATED APPLICATIONS The present invention also relates to pending U.S. Patent Application Serial No.: , filed , 1994, invented by Hsiao-Wuen Hon and Bao-Sheng Yuan, entitled "A System And Method For Recognizing A Tonal Language," which is incorporated herein by reference. The present invention also relates to pending U.S. Patent Application Serial No.: , filed , 1994, invented by Hsiao-Wuen Hon, Yen-Lu Chow, and Kai-Fu Lee entitled "Continuous Mandarin Chinese Speech Recognition System Having An Integrated Tone Classifier," which is incorporated herein by reference.

BACKGROUND OF THE INVENTION 1. Field of the Invention. The present invention relates generally to speech recognition systems. In particular, the present invention relates to a system and method for determining the tone of a syllable of speech. Still more particularly, the present invention relates to a system and method for identifying the tone of a syllable of Mandarin Chinese speech. 2. Description of the Background Art. Mandarin Chinese is a tonal syllabic language where a syllable and tone combine to define the meaning of an utterance. In Mandarin Chinese, there are 4 lexical tones (high and level, rising, falling-rising, and falling) and 1 neutral tone. Exemplary wave forms for electrical signals representing the tones of Mandarin Chinese are shown in Figures 1A, IB, 1C, ID, and IE. The high and level tone, the rising tone, the falling-rising tone, the falling tone, and the neutral tone will each be referred to generically as a language tone in this application. The tone is the behavior of the fundamental frequency (pitch) of the audio signal. The tone of a syllable affects the meaning of the syllable. Syllables with the same phonetic structure but different tones usually have significantly different meanings and correspondence to different written characters. Thus, to recognize accurately an audio signal of Mandarin Chines a speech recognition system must recognize both the syllable and the tone of the syllable. Once a speech recognition system has identified a syllable, the speech recognition system must identify the tone of the syllable. Frequency analysis of the tone of the syllable is often time consuming and cumbersome. This is particularly problematic for speech recognitions systems performing real-time speech recognition. These systems generally must quickly and accurately identify the tone of a syllable in real time. Furthermore, there is a growing trend to incorporate speech recognition systems into small, low cost consumer goods. The speech recognition system incorporated into such goods must be sufficiently inexpensive that it does not overly inflate the cost of the device. Many systems and methods for determining the tone of a syllable of Mandarin Chinese speech have been used in the prior art. These prior art systems and methods, such as vector quantization, hidden Markov models, or neural networks, are very computationally intensive and therefore, maybe unsuitable for real time recognition o Mandarin Chinese speech. These prior art systems, which rely on short-term frequency analysis, also do not provide good frequency resolution and provide less than acceptable accuracy. There is a need for a system and method for quickly, accurately, and inexpensively identifying the tone of a syllable of Mandarin Chinese speech.

SI TMM ARY OF THE INVENTION The present invention overcomes the deficiencies and limitations of the prior ar with a system and method for analyzing the tone of a syllable of Mandarin Chinese speech. The novel tone recognition system of the present inventor comprises a coefficient determinator, a coefficient modeler, and a comparator. The coefficient determinator receives a digitized signal of a syllable from the A/D converter. The coefficient determinator estimates the pitch of the syllable and describes the pitch by using a least squares method to fit a second order polynomial to the pitch. The coefficient determinator can quickly and easily describe the pitch using a second order polynomial. The second order polynomial is preferred because of its low computational requirements. The output of the coefficient determinator is coupled to an input of the comparator. Through this coupling, the coefficient determinator provides a signal representing the polynomial to the comparator. A second input of the comparator is coupled to an output of the coefficient modeler. Through this coupling the comparator receives models of the language tones of the tonal language. A model of a language tone comprises expected coefficients for a second order polynomial that describes the language tone, an expected duration of syllables having the language tone, and a co-variance of the coefficients and duration. The comparator compares the polynomial, which describes the syllable to the models, determines the model that most closely matches the syllable, and generates a signal indicating the matching language tone. The present invention also includes a system and method for training a coefficient modeler. The system for training comprises a model generator, a coefficient A training memory, a coefficient B training memory, a coefficient C training memory, and a duration T training memory. For a model, the model generator receives a large number of uttered syllables all known to have the same language tone. The model generator estimates the pitch of each syllable, fits a second order polynomial to each syllable, and stores the resulting coefficients and duration in the appropriate memories. The model generator then determines the arithmetic mean and co-variance for the coefficients and duration and generates a signal for the coefficient modeler to accept the model.

BRIEF DESCRIPTION OF THE DRAWINGS Figures 1A, IB, 1C, ID and IE are graphical representations of electrical signals having the behavior of the 4 lexical tones and one neutral tone of Mandarin Chinese speech; Figure 2 is a block diagram of an exemplary embodiment of a system for training a coefficient modeler; Figure 3 is a block diagram of a first embodiment of a speech recognition syste that includes a tone recognition system according to the present invention; Figure 4 is a block diagram of a second embodiment of a system for recognizing syllables of Mandarin Chinese speech that includes the tone recognition system of the present invention; Figure 5 is a block diagram of the speech recognition system of the second embodiment of a system for recognizing syllables of Mandarin Chinese speech; Figures 6A and 6B are flowcharts showing the preferred method for training the tone recognition system of the present invention to identify the tone of a syllable of Mandarin Chinese speech; and Figure 7 is a flowchart showing the preferred method for identifying the tone of a syllable of Mandarin Chinese speech using the speech recognition system of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Referring now to Figures 1A, IB, 1C, ID, and IE, graphical representations of electrical signals that represent the 4 lexical tones and 1 neutral tone of Mandarin Chinese speech are shown. The signals shown in Figures 1A, IB, 1C, ID, and IE originate as audio signals generated by a human speaker. A speech recognition system receives the audio signals through a microphone. The microphone converts an audio signal to an electrical signal. Each syllable of Mandarin Chinese speech has a tone. Th tone is the behavior of the pitch contour, the fundamental frequency contour, of the audio signal of the syllable. The tone is preserved when the audio signal is converted into an electrical signal. Thus, the tone of a syllable may refer to the behavior of the fundamental frequency of the audio signal or the electrical signal into which the audio signal is converted. Figures 1A, IB, 1C, ID, and IE show the tones of Mandarin Chinese speech as fundamental frequencies plotted as a function of time. Figure 1 A shows a first tone signal 40 having a high and level tone. Figure IB shows a second tone signal 42 having a rising tone. A third tone signal 44, shown in Figure 1C, has a falling-rising behavior. Figure ID shows a fourth tone signal 46 having a behavior of the last lexical tone of Mandarin Chinese speech, the falling tone. Finally, Figure IE shows a fifth tone signal having the neutral tone. Referring now to Figure 2, a block diagram of a first embodiment of a system 10 for recognizing a syllable of speech of a tonal language is shown. The system 10 comprises a microphone 76, and analog-to-digital converter 78, a segmenter 79, a syllable recognition system 80, and a tone recognition system 82. The microphone 76 is a conventional microphone for converting analog audio signals into analog electrical signals. The microphone 76 has an input for receiving audio signals and an output for outputting electrical signals. The output is coupled to a line 88. The analog-to-digital converter ("A/D converter") 78 is a conventional analog-to-digital converter for converting analog, electrical signals to digital, electrical signals. The A/D converter 78 has an input, which is coupled to line 88, for receiving an analog, electrical signal and an output for outputting a digital, electrical signal. The output of the A/D converter 78 is coupled to a line 90. The segmenter 79 is a conventional segmenter for parsing a continuous input into single syllables of Mandarin Chinese speech. The segmenter 79 generates output signals of single syllables. The segmenter 79 has an input coupled to line 90 and an output coupled to a line 91. The syllable recognition system 80 is a system for determining the phonetic structure of a syllable of a tonal language and for combining the identified phonetic structure with an identification of the language tone of the syllable to recognize the syllable. The syllable recognition system 80 has a first input, coupled to line 91, for receiving a syllable from the segmenter 79, a second input, coupled to a line 96, for receiving a signal that identifies the language tone of the syllable from the tone recognition system 82, and an output, coupled to a line 98, for outputting a signal representing the recognized syllable. The syllable recognition system 80 uses a hidd Markov model and state analysis to identify the phonetic structure of the syllable. T syllable recognition system 80 combines the identification of the phonetic structure with an identification of the language tone of the syllable to generate a signal indicati of the complete meaning of the syllable. The syllable recognition system 80 is preferably an Apple PlainTalk Chinese Syllable Recognition System (the "PlainTalk System") from Apple Computer, Inc. of Cupertino, California modified to utilize long-term tonal analysis. The PlainTalk System has a first system for determining the phonetic structure of a syllable, a secon system for performing short-term tonal analysis on the syllable to identify the langua tone of the syllable, and a third system for combining the output of the first system a the second system into a recognition. The output of the third system forms an outpu of the PlainTalk System. The PlainTalk System may be modified to utilize long-term tonal analysis by removing the second system and coupling line 96 to an input to the third system so that the third system combines the output of the first system, for determining the phonetic structure of the syllable, and the output of the tone recognition system 82 into an output of the syllable recognition system 80. An alternative modification to the PlainTalk System is to modify the third system to receive inputs from the first system, the second system, and the tone recognition system 82 and to combine the three inputs into a recognition which is output on line The syllable recognition system 80 may alternately be a syllable based voice typewrit from Star Incorporated of the People's Republic of China similarly modified as described in connection with the preferred PlainTalk System. As shown in Figure 2, the tone recognition system 82 comprises a coefficient determinator 84, a comparator 86, and a coefficient modeler 38. The coefficient determinator 84 has an input coupled to line 91 to receive a syllable from the segmenter 79 and an output coupled to a line 92. The coefficient modeler 38 is constructed by the trainer 34, shown in Figure 3, and has a first input and an output, both of which are coupled to line 94. The comparator 86 has a first input, coupled to line 92 to receive the output of the coefficient determinator 84, a second input, coupled to line 94 to receive models from the coefficient modeler 38, a first output, and a second output, coupled to line 94, through which the comparator 86 signals for the coefficient modeler 38 to transmit the models. The output of the tone recognition system 82 is formed by the first output of the comparator 86 and is coupled to line 96 so that the identification of the language tone is transmitted to the second input of the syllable recognition system 80. The coefficient determinator 84 receives a syllable from the segmenter 79 on line 91. The coefficient determinator 84 estimates the fundamental frequency, the pitch, of the syllable using fourier analysis. Alternatively, the coefficient determinator 84 may use a low pass filter or other means to determine the pitch of the input syllable. The coefficient determinator 84 adjusts the coefficients of a second order polynomial to match closely the pitch contour. The coefficient determinator 84 uses a least squares curve fitting method to adjust the coefficients. Once the coefficients best describe the pitch contour, the coefficient determinator 84 generates a signal to transfer the coefficients along with the duration of the input signal, grouped together in a 4 dimensional vector, to the comparator 86. By using a second order polynomial, the coefficient determinator 84 can quickly and accurately describe the tone of the syllable and thus, facilitate a quick and accurate identification of the language tone of the syllable. The coefficient modeler 38 has a first input and a second input that is coupled to a line 7, shown in Figure 3, and an output. The first input and the output are coupled to a line 94. The coefficient modeler 38 comprises a tone 1 model memory 58, a tone 2 model memory 60, a tone 3 model memory 62, a tone 4 model memory 64, and a tone 5 model memory 66. The memories 58, 60, 62, 64, 66 are shown in Figure 3. Each of the tone model memories 58, 60, 62, 64, 66 stores a model of a tone of Mandarin Chinese speech. For example, the tone 1 model memory 58 may store a model of the high and level tone, the tone 2 model memory 60 may store a model of the rising tone, the tone model memory 62 may store a model of the falling-rising tone, the tone 4 model memory 64 may store a model of the falling tone, and the tone 5 model memory 66 may store a model of the neutral tone. Each model comprises an A coefficient, a B coefficient and a C coefficient of a polynomial of the form:

f(t)=At² +Bt+C

Where t is time. Each model also includes a time duration T. The A coefficient, B coefficient, C coefficient, and duration T are the expected coefficients and duration for the language tone. The A, B, and C coefficients and duration T are grouped together into a 4 dimensional vector. The language tones are preferably modeled using a second order polynomial because of the simplicity and low computational overhead o the second order polynomial. As can be seen from Figures 1A, IB, 1C, ID, and IE, a second order polynomial is adequate to model accurately the pitch contour of each language tone. A model includes an expected duration because the expected coefficients alone do not provide sufficient information to determine accurately thr language tone of a syllable. The coefficients with the duration are sufficient to determine accurately the language tone of a syllable. Thus, the present invention overcomes the limitations of the prior art by using a second order equation to model quickly and accurately to determine the language tone of an utterance of Mandarin Chinese speech. While the present invention has been described with respect to five tones, those skilled in the art will appreciate that the number of training data memori 22, 24, 26, 28, 30, shown in Figure 3, and the number of tone models memories 58, 60, 62, 64, 66 may be increased or decreased according to the number of tones in the tonal language being evaluated. Each model also includes a co-variance matrix for the A coefficient, the B coefficient, the C coefficient and the duration T. The co-variance matrix is a 4x4 matri which describes the dependence of each of the A, B, and C coefficients and the duration T on each other. Alternately, the model may include a data structure for storing the variance of the A, B, and C coefficients and duration T. In this alternate embodiment, it is assumed that the A, B, and C coefficients and duration T are independent. The variance is, however, a diagonal co-variance matrix. That is, the diagonal entries of the co-variance matrix describe the variances and all other entries, the entries off the diagonal, are 0. Thus, the variance is a sub-set of the co-variance and is therefore, included within the co-variance. Those skilled in the art will recognize the variance and its relation to the co-variance. If a higher order polynomial is used to describe the language tones of a tonal language other than Mandarin Chinese, the co-variance matrix is enlarged to accommodate the additional coefficients. For example, if a fifth order polynomial were used, the co-variance matrix would be a 6x6 matrix. The present invention can also model the tone of utterances of tonal languages other than Mandarin Chinese. A higher order polynomial may be needed to model accurately the language tones of a tonal language having language tones that are more complex than the language tones of Mandarin Chinese. The comparator 86 receives the coefficients and duration, which describe the pitch of the input syllable, from the coefficient determinator 84 and receives the models of the language tones from the coefficient modeler 38. The comparator 86 compares the coefficients and duration to the models to determine the model that the coefficients and duration most closely resemble. The comparator 86 uses a multivariate normal density, also known as a Gaussian classifier, to compare the input to the models. The multivariate normal density is given by the equation:

p(x) = (2_)-d/2 I _ I -1/2 exp{ -1/2 (x-μ)* _-ι (x-μ)}.

Where p(x) is the multivariate normal density; x is an input vector of the A, B, and C coefficients and duration T of the second order polynomial that the coefficient determinator 84 determined for the input signal; d is the dimension of the input vector, in this case 4; I _ I is the determinant of the co-variance matrix of the model; _^_1 is the inverse of the co-variance matrix of the model; μ is a vector of the arithmetic means of the A, B, and C coefficients and duration T of the model; and (x-μ)¹ is the transpose of the difference vector of x and μ. The comparator 86 determines the multivariate normal density using the mean vector, μ, of each model. The comparator 86 temporarily stores the multivariate normal density determined for each model. The comparator 86 recognizes the model having the greatest multivariate normal density the model of the language tone of the syllable. The comparator 86 generates a signal indicating this language tone and impresses the signal on line 96 to transmit it to the second input of the syllable recognition system 80. Referring now to Figure 3, a block diagram of a trainer 34 is shown. Figure 3 also shows training data memory 32, and the coefficient modeler 38. The trainer 34 uses data stored in the training data memory 32 to train the coefficient modeler 38. Th training data memory 38 is a conventional random access memory or read only memory for storing data. The training data memory 32 comprises a tone 1 training data memory 22, a tone 2 training data memory 24, a tone 3 training data memory 26, a tone 4 training data memory 28, and a tone 5 training data memory 30. Each of the tone 1-5 training data memories 22, 24, 26, 28, 30 is associated with a language tone of Mandarin Chinese and stores utterances that have the associated tone. For example, the tone 1 training data memory 22 may be associated with the high and level tone, the tone 2 training data memory 24 with the rising tone, the tone 3 training data memory 26 with the falling-rising tone, the tone 4 training data memory 28 with the falling tone, and th tone 5 training data memory 30 with the neutral tone. The tone 1 training data memory 22 stores at least 20 utterances of Mandarin Chinese speech having the language tone that is associated with the tone 1 training data memory 22. Each utterance is a digitized, single syllable. Similarly, the tone 2 - 5 training data memorie 24, 26, 28, 30 store at least 20 digitized, single syllables having the language tone with which each is associated. While 20 is the minimum number of utterances, preferably each tone 1 - 5 training memories 22, 24, 26, 28, 30 stores several hundred utterances. The tone 1 - 5 training data memory 22, 24, 26, 28, 30 are each coupled to line 68 and impress signals of the digitized utterances on line 68 to output the digitized utterances from the training data memory 32. The trainer 34 has a first input that is coupled to line 68 and a first output that is formed by line 74. The trainer 34 comprises a model generator 36, an A coefficient memory 50, a B coefficient memory 52, a C coefficient memory 54, and a T duration memory 56. The model generator 36 first estimates the fundamental frequency, the pitch, of a syllable using fourier analysis. Alternately, the model generator 36 may use a low pass filter or other means to determine the pitch of a syllable. The model generator 36 adjusts the coefficients of a second order polynomial so that the polynomial best describes the pitch contour of an utterance. The model generator 36 has a first input, a second input, a first output, and a second output. The first input is coupled to line 68 to receive data from the tone 1-5 memories 22, 24, 26, 28, 30. The first output is coupled to a line 70 which is also coupled to an input of each of memories 50, 52, 54, 56. The second input is coupled to a line 72 which is also coupled to an output of each of memories 50, 52, 54, 56. The second output is coupled to line ^"4 which forms an output of the trainer 34. The model generator 36 uses a least squaro curve fitting method for determining the coefficients of a polynomial that accurately describes a tone. For example, the model generator 36 receives a digitized utterance from the tone 1 model data 22. The model generator 36 uses fourier analysis, a low pass filter, or other techniques to determine the behavior of the pitch of the utterance. The model generator 36 then uses the least squares curve fitting method to adjust the coefficients of a second order polynomial so that the polynomial best describes the pitch contour of the utterance. The A coefficient memory 50, the B coefficient memory 52, the C coefficient memory 54, and duration T memory 56 are all memories for storing data. Each of the memories 50, 52, 54, 56 have an input coupled to the first output of the model generator 36 by a line 70, and an output coupled to the second input of the model generator 34 by line 72. After the model generator 36 has determined the coefficients of the polynomial that best describes a tone and has determined the duration of the utterance, the model generator 36 generates a signal and impresses the signal on line to store the value of the A coefficient in the A coefficient memory 50, the value of the coefficient in the B coefficient memory 52, the value of the C coefficient in the C coefficient memory 54, and duration of the utterance in the duration T memory 56. The model generator 34 preferably determines the coefficients of the polynomials for utterances in the training data 32, which have the same language ton together. Once the model generator 34 has determined the coefficients and durations for all utterances that have the same language tone, the model generator 36 generates the model for that language tone. In alternate embodiment, the training data need not be segregated according t language tone. Each utterance would have associated with it an indicator of its language tone. The trainer would have an A coefficient training memory, a B coefficient training memory, a C coefficient training memory, and a duration T traini memory for each language tone of the language. In the case of Mandarin Chinese, there would be 5 A coefficient memories, 5 B coefficient B memories, 5 C coefficient memories, and 5 duration T memories. In this alternate embodiment, the model generator would record the coefficients and duration in the set of memories indicated by the tone indicator. The coefficient modeler 38 comprises a tone 1 model memory 58, a tone 2 mod memory 60, a tone 3 model memory 62, a tone 4 model memory 64, and a tone 5 mod memory 66. The coefficient modeler 38 has an input formed by lie 74, and each mode memory 58, 60, 62, 64, 66 has an input that is coupled to line 74. Furthermore, each model memory 58, 60, 62, 64, 66 has an output coupled to line 94; line 94 forms an output of the coefficient modeler 38. Each model memory 58, 60, 62, 64, 66 stores a model of a language tone; that is, there is a model memory 58, 60, 62, 64, 66 for each language tone. If the present invention is being used to recognize a syllable of a tonal language other than Mandarin Chinese, there are model memories for each language tone of the tonal language. Once the model generator 36 has determined the coefficients and duration for each utterance in one of the memories 22, 24, 26, 28, 30, the model generator 36 determines the arithmetic mean and co-variance matrix of the coefficients and durations stored in the memories 50, 52, 54, 56. The average coefficient, average duration and co-variance matrix of the coefficients and duration constitute a model of a language tone. The model generator 36 determines the model and generates a signal on line 74 to transfer the model to the coefficient modeler 38 where the model is stored in the appropriate model memory 58, 60, 62, 64, 66. The tone model memories 58, 60, 62, 64, 66 each have an output, which is coupled to line 94, for outputting models of the tones of Mandarin Chinese speech. In the preferred embodiment, the model generator 34 receives all the utterances stored in the tone 1 training data memory 22. The model generator 34 fits a second order polynomial to an utterance and generates signals on line 70 to store the A, B, and C coefficients and duration T in memories 50, 52, 54, 56. The model generator 34 repeats this process for each utterance stored in the tone 1 training data memory 22. The model generator 34 then receives the A, B, and C coefficients and duration T from the memories 50, 52, 54, 56 on line 72 at the second input. The model generator 34 determines the arithmetic means and co-variance matrix to generate a model of the first language tone and then generates a signal to transfer the model to the tone 1 model memory 58. After transferring the model, the model generator 34 asserts a signal on line 70 to the memories 50, 52, 54, 56 to reset the memories 50, 52, 54, 56, and the model generator 34 performs the described functions on the data stored in the tone 2 training data memory 24. Referring now to Figure 4, a block diagram of a second embodiment of the system 10 for analyzing the tone of a syllable of Mandarin Chinese speech, constructed in accordance with the present invention, is shown. The system 10 preferably comprises an input device 12, an output device 14, a processor 16, a memory means 18, and a speech recognition system 20. The input device 12, output device 14, processor 16, memory means 18, and speech recognition system 20 are coupled in a von Neuma architecture via a bus 22 such as in a personal computer. The processor 16 is preferabl a microprocessor such as a Motorola 68040; the output device 14 is preferably a video monitor; and the input device 12 is preferably a keyboard and mouse type controller and a microphone for inputting audio signals. The input device 12 includes an A/D converter for digitizing analog signals from the microphone. In an exemplary embodiment, the system 10 is a Macintosh Quadra 840AV computer system from Apple Computer, Inc. of Cupertino, California. Those skilled in the art will realize tha the system 10 could also be implement on an IBM personal computer or any other computer system. The memory means 18 is constructed with random access memory and read only memory. The memory means 18 stores data and program instruction steps for th system 10. The memory means 18 may be a conventional dynamic random access memory and a conventional disk drive. The memory means 18 includes a training dat memory 32. The training data memory 32 is coupled through the bus 22 to transmit and receive signals from the speech recognition system 20. The speech recognition system 20 is coupled to the bus 22 to receive a digitized input signal from the input device 12. The speech recognition system 20 parses the input signal to obtain a syllable of Mandarin Chinese speech and generates an output of the recognized syllable. The speech recognition system 20 determines both the phonetic structure and language tone of the syllable. Referring now to Figure 5, a block diagram of the speech recognition system 20 is shown. The speech recognition system 20 comprises a segmenter 79, a syllable recognition system 80, a tone recognition system 82, and a trainer 34. The recognition system 82 comprises a coefficient modeler 38, a coefficient determinator 84, and a comparator 86. In this second embodiment, the segmenter 79, syllable recognition system 80, tone recognition system 82, and trainer 34 are memories storing sets of program instruction steps that when executed by the processor 16 perform the functions described above with reference to Figure 2 and Figure 3. For convenience, the segmenter 79, a syllable recognition system 80, a tone recognition system 82, a trainer 34, coefficient modeler 38, a coefficient determinator 84, and a comparator 86 have been given the same numerals in Figure 5 as in Figures 2 and 3. Those skilled in the art will recognize that while described in Figures 2 and 3 as separate devices with their own processors, the present invention may be implemented with each device being a set of program instruction steps sharing a single processor 16. Each of the syllable recognition system 80, tone recognition 82, and trainer 34 are coupled to the bus 22 to transmit and receive signals to and from the input device 12, the output device 14, the processor 16, and memory means 18 and to transmit and receive signals from each other as shown in Figures 2 and 3 and described above. The trainer 34 is coupled to the bus 22 to receive data from the training data memory 32 in the memory means 18. The trainer 34 is also coupled to the bus 22 to transmit models of the language tones to the coefficient modeler 38. Through the bus 22, the segmenter 79 receives a digital input signal from the input device 12 and transmits a syllable to the syllable recognition system 80 and to the coefficient determinator 84. As has been stated, the syllable recognition system 80 receives a syllable from the segmenter 79 through the bus 22. The syllable recognition system 80 also receives a signal indicating the language tone of the syllable from the comparator 86 through the bus 22. The syllable recognition 80 signals the recognition of the syllable to the output device, or to a look up table or other memory device, through the bus 22. The bus 22 couples the coefficient determinator 84 to the comparator 86 so that the coefficient determinator may transmit the coefficients and duration that describe the pitch contour of a syllable to the comparator 86. Finally, the comparator 86 receives models of the language tones from the coefficient modeler 38 over the bus 22. Referring now to Figures 6A and 6B, a flow chart is shown of a preferred method for training the coefficient modeler 38 for the language tones of a syllable of Mandarin Chinese speech. The method begins in step 700 where the trainer 34 receives an utterance or signal representing a syllable of Mandarin Chinese speech having a known tone. The method preferably processes the utterances having the same language tone together in one group. For example, the model generator 36 processes the utterances in the tone 1 training data 22 together in one group. In step 702, the coefficient determinator 84 uses fourier analysis to estimate the pitch contour of the input signal. The coefficient determinator 84 may alternately use a low pass filter or other techniques to estimate the pitch contour of the input. Next, in step 704, the model generator 36 fits a second order polynomial to the estimated pitch contour. Th model generator 36 uses a least square-error curve fitting method for fitting the secon order polynomial to the tone. Those skilled in the art will recognize the least square method and will realize that there are alternative methods for fitting a second order polynomial to the pitch. If the model generator 36 is being used to train the coefficient modeler 38 to model a tonal language other than Mandarin Chinese, a polynomial having an order higher than the second order may be required to model accurately th language tones. In step 706, the model generator 36 selects the A coefficient of the polynomial and stores it in the coefficient A training memory 50. Next in step 708, the model generator 36 selects the B coefficient and stores it in the B coefficient training memory 52. Similarly, in step 710, the model generator 36 selects the C coefficient and stores it in the C coefficient training memory 54. Finally, in step 712, the model generator 36 determines the duration of the utterance and stores it in the duration T training memory 56. In step 714, the model generator 36 determines if there is another utterance to b processed that has the current language tone in the training memory 32. If there is another utterance to be processed, the method returns to step 700 where the model generator 36 receives the next utterance from the training memory 32. If in step 714 there are no more utterances having the current language tone to be processed, the model generator 36 uses the data stored in the coefficient A training memory 50, coefficient B training memory 52, coefficient C training memory, and duration T training memory 56 to generate a model of the language tone. In step 716, the model generator 36 receives the A coefficients stored in the A coefficient training memory 50 and determines the arithmetic mean of the A coefficients. Next in step 718, the model generator 36 receives the B coefficients stored in the B coefficient training memory 52 and determines the arithmetic mean of the B coefficients. The model generator 36 then, in step 720, receives the C coefficients from the coefficient C training memory 56 and determines the arithmetic mean of the C coefficients. Finally, in step 722, the model generator 36 receives the durations from the duration T training memory 56 and determines the arithmetic mean of the durations. The model generator 36, in step 724, uses the arithmetic means and the data in the A coefficient training memory 50, B coefficient training memory 52, C coefficient training memory 54, and duration T training memory 56 to determine a co-variance matrix for random variables A, B, C, and T. The model generator 36 gathers the arithmetic means together into a 4 dimensional vector and joins the vector to the co-variance matrix to form a model of the selected language tone. The model generator 36 may alternately determine the variance of each of the A coefficients, B coefficients, C coefficients, and duration T in step 716, 718, 720, and 722, respectively and not determine the co-variance matrix in step 724. The variance, however, is a sub-set of the co-variance, and determining the variance alone yields a less accurate model than determining the co- variance matrix. The determining variance alone, however, saves significant computation time. It is preferred to determine the co- variance. The model generator 36 signals, in step 726, the coefficient modeler 38 to receive a model and then transfers the model to the coefficient modeler 38 for storage. The model corresponds to the training data. For example, if the model generator 36 has just completed processing the utterances of the tone 1 training data 22, then the model generator 36 signals the coefficient modeler 38 to receive the tone 1 model 58. In step 728, the model generator 36 determines if there is more tone training data to be processed. If there is more tone training data to be processed, the model generator 36 resets, in step 730, the coefficient A training memory 50, coefficient B training memory 52, coefficient C training memory 56, and duration T training memory 58. The model generator 36 then selects, in step 732, the next tone training data to process and the method returns to step 700. If, in step 728, there are no more sets of tone training data to be processed, the method ends. Referring now to Figure 7, a flow chart of a preferred method for identifying th tone of a syllable of Mandarin Chinese speech is shown. The method begins in step 80 where the tone recognition system 82 receives an input of a digitized syllable of Mandarin Chinese speech. Upon receipt, the coefficient determinator 84 estimates the pitch of the syllable, in step 802. In step 804, the coefficient determinator 84 fits a second order polynomial to the estimated pitch. The coefficient determinator 84 preferably uses a least squares curve fitting method to fit a second order polynomial t the pitch. The coefficient determinator 84 alternatively may use many other methods to fit a polynomial to the tone. If the coefficient determinator 84 is being used to recognize tones of a tonal language other than Mandarin Chinese, a higher order polynomial may be used to model accurately the pitch. The coefficient determinator 8 then transmits the polynomial to the comparator 86. The comparator 86 determines and temporarily stores the multivariate normal density for the tone 1 model 58 usinμ Gaussian classifier as has been described above with respect to Figure 3. Next, in stop 808, the comparator 86 determines the multivariate normal density for the tone 2 model 60 and stores the result. In step 810, the comparator 86 determines the multivariate normal density for the tone 3 model 62 and stores the result. The comparator 86 then, in step 812, determines the multivariate normal density for the tone 4 model 64 and stores the result. Finally, in step 814, the comparator 86 determines the multivariate normal density for the tone 5 model 66 and stores the result. In step 816, the comparator 86 selects the tone model 58 that received the highest multivariate normal density as the model that best describes the input. The comparator 86 selects the language tone of the selected tone model as the language tone of the syllable. In step 816, the comparator 86 generates a signal to the syllable recognition system 80 that indicates the language tone of the syllable. While the present invention has been described with reference to certain preferred embodiments, those skilled in the art will recognize that various modifications may be provided. For example, the present invention may be used to recognize any tonal language, such as Cantonese, and is not limited to Mandarin Chinese. The present invention may also be used with handwriting recognition systems and other data recognition systems. These and other variations upon and modifications to the preferred embodiments are provided for by the present invention, which is limited only by the following claims.

Claims

WHAT IS CLAIMED IS: 1. A system for recognizing a language tone of an input signal that include a syllable of a tonal language, the system comprising: a coefficient determinator, having inputs and outputs, for determining the coefficients of a polynomial that describes the pitch contour of the syllable, an input coupled to receive a signal of the syllable; a coefficient modeler, having inputs and outputs, for storing models of langua tones of the tonal language and for outputting, from an output, a signal of the models; and a comparator, having a first input coupled to an output of the coefficient determinator for receiving the coefficients, a second input coupled to an output of the coefficient modeler for receiving the models, and an outpu for comparing the coefficients determined by the coefficient determinat to the models of the language tones and for generating a signal specifyi the language tone having a model that most closely matches the syllable input to the coefficient determinator.

2. The system of claim 1, wherein the coefficient determinator also determines the duration of the syllable.

3. The system of claim 1, wherein the coefficient modeler stores models of the language tones of Mandarin Chinese.

4. The system of claim 1, wherein a model of a language tone comprises expected coefficients of a polynomial that describes the fundamental frequency of syllables having the tone of the tonal language.

5. The system of claim 4, wherein the model further comprises an expecte duration of syllables having the language tone.

6. The system of claim 5, wherein the model further comprises a co- variance matrix for the expected coefficients and expected duration.

7. The system of claim 5, wherein the comparator uses a multivariate normal density to compare the coefficients and duration, received from the coefficient determinator, to the models received from the coefficient modeler.

8. The system of claim 4, wherein the expected coefficients of the model are arithmetic means of coefficients of polynomials that describe the pitch contour of more than one syllable having the language tone and the expected duration is an arithmetic mean of durations of more than one syllable having the language tone.

9. The system of claim 4, wherein the polynomial is a second order polynomial.

10. A system for training a coefficient modeler to model a language tone of a tonal language, the system comprising: a model generator, having a first input coupled to receive more than one syllable each having a known language tone of the tonal language, a second input, a first output and a second output, the model generator for determining the coefficients of a polynomial that describes the pitch contour of a syllable and for determining an arithmetic mean of the coefficients of polynomials that describe more than one syllable having the same language tone; and a coefficient C training memory having an input coupled to the first output of the model generation and an output coupled to the second input of the model generator, for storing constant coefficients of polynomials that describe the pitch contour of syllables having the same language tone.

11. The system of claim 10, further comprising: a coefficient A training memory, having an input coupled to the first output of the model generator and an output coupled to the second input of the model generator, for storing second order coefficients of polynomials th describe the pitch contour of syllables having the same language tone; and a coefficient B training memory, having an input coupled to the first output of the model generator and an output coupled to the second input of the model generator, for storing first order coefficients of polynomials that describe the pitch contour of syllables having the same language tone.

12. The system of claim 10, further comprising a duration T training memory, having an input coupled to the first output of the model generator and an output coupled to the second input of the model generator, for storing durations of syllables having the same language tone.

13. The system of claim 11, wherein the model generator also determines a co-variance matrix for the coefficients of the polynomials that describe more than one input syllable having the same language tone.

14. A method for recognizing a language tone of a syllable of a tonal language, the method comprising the steps of: receiving an input of a syllable of the tonal language; fitting a polynomial to the tone of the syllable; comparing coefficients of the polynomial to expected coefficients of language tones of the tonal language; determining the language tone whose expected coefficients most closely match the coefficients of the polynomial; and generating a signal indicating the recognized language tone whose expected coefficients most closely match the coefficients of the polynomial.

15. The method of claim 14, further comprising the steps of: comparing a duration of the syllable to expected durations of syllables having the language tones of the tonal language; and determining the language tone whose expected duration most closely matches the duration of the syllable.

16. The method of claim 15, wherein an expected coefficient is an arithmetic mean of coefficients of the same order of more than one polynomial and the expected duration is an arithmetic mean of durations of more than one syllable.

17. The method of claim 16 further comprising the step of comparing the coefficients of the polynomial to a co-variance matrix for the expected coefficients of the language tones.

18. The method of claim 17 where, in the steps of comparing coefficients of the polynomial and duration of the syllable to expected coefficients and duration, a multivariate normal density is used.

19. The method of claim 18, wherein the step of determining the language tone includes a step of selecting the language tone having a model that yields the highest multivariate normal density, when the model is compared to the syllable, as the language tone of the syllable.

20. A method for training a coefficient modeler to model a language tone of a tonal language, the method comprising the steps of: receiving more than one syllable having the language tone; fitting a polynomial to the tone of each syllable; determining an arithmetic mean of the values of each coefficient of the polynomials; and generating signals to transmit the arithmetic means of the coefficients to the coefficient modeler.

21. The method of claim 20 further comprising the steps of: determining the duration of each syllable; and determining an arithmetic mean of the durations.

22. The method of claim 20 further comprising the steps of: determining a co-variance matrix for the values of the coefficients of the polynomials and durations; and generating a signal to transmit the co-variance matrix to the coefficient modeler.

23. A system for recognizing a language tone of a syllable of a tonal language, the system comprising: means for receiving an input of a syllable of the tonal language; means for fitting a polynomial to the tone of the syllable; means for comparing coefficients of the polynomial to expected coefficients of language tones of the tonal language; means for determining the language tone whose expected coefficients most closely match the coefficients of the polynomial; and means for generating a signal indicating the recognized language tone whose expected coefficients most closely match the coefficients of the polynomial.

24. The system of claim 23, further comprising: means for comparing a duration of the syllable to expected durations of the language tones of the tonal language; and means for determining the language tone whose duration most closely matches the duration of the syllable.

25. The system of claim 24 including means for comparing the coefficients of the polynomial to a co-variance matrix for the expected coefficients of the language tone.

26. The system of claim 25, wherein the determining means includes means for selecting the language tone having a model that yields the highest multivariate normal density, when the model is compared to the syllable, as the language tone of the syllable.

27. A system for training a coefficient modeler to model language tones of a tonal language, the system comprising the steps of: means for receiving more than one utterance having the language tone; means for fitting a polynomial to the tone of each utterance; means for determining an arithmetic mean of the values of each coefficient of the polynomials; and means for generating signals to transmit the arithmetic means of the coefficients to the coefficient modeler.

28. The system of claim 27, further comprising: means for determining a co-variance matrix for the values of the coefficients of the polynomials and durations; and means for generating a signal to transmit the co-variance matrix to the coefficient modeler.