WO1996010248A1 - A system and method for determining the tone of a syllable of mandarin chinese speech - Google Patents
A system and method for determining the tone of a syllable of mandarin chinese speech Download PDFInfo
- Publication number
- WO1996010248A1 WO1996010248A1 PCT/US1995/012595 US9512595W WO9610248A1 WO 1996010248 A1 WO1996010248 A1 WO 1996010248A1 US 9512595 W US9512595 W US 9512595W WO 9610248 A1 WO9610248 A1 WO 9610248A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language
- tone
- syllable
- coefficients
- coefficient
- Prior art date
Links
- 241001672694 Citrus reticulata Species 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims description 46
- 239000011159 matrix material Substances 0.000 claims abstract description 25
- 230000015654 memory Effects 0.000 claims description 126
- 238000012549 training Methods 0.000 claims description 78
- 235000006887 Alpinia galanga Nutrition 0.000 claims description 2
- 240000002768 Alpinia galanga Species 0.000 claims description 2
- 230000005236 sound signal Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 230000006399 behavior Effects 0.000 description 7
- 230000007935 neutral effect Effects 0.000 description 7
- 230000000630 rising effect Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 244000221110 common millet Species 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/15—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
Definitions
- the present invention also relates to pending U.S. Patent Application Serial No.: , filed , 1994, invented by Hsiao-Wuen Hon and Bao-Sheng Yuan, entitled “A System And Method For Recognizing A Tonal Language,” which is incorporated herein by reference.
- the present invention also relates to pending U.S. Patent Application Serial No.: , filed , 1994, invented by Hsiao-Wuen Hon, Yen-Lu Chow, and Kai-Fu Lee entitled "Continuous Mandarin Chinese Speech Recognition System Having An Integrated Tone Classifier," which is incorporated herein by reference.
- the present invention relates generally to speech recognition systems.
- the present invention relates to a system and method for determining the tone of a syllable of speech.
- the present invention relates to a system and method for identifying the tone of a syllable of Mandarin Chinese speech.
- Mandarin Chinese is a tonal syllabic language where a syllable and tone combine to define the meaning of an utterance.
- Exemplary wave forms for electrical signals representing the tones of Mandarin Chinese are shown in Figures 1A, IB, 1C, ID, and IE.
- the high and level tone, the rising tone, the falling-rising tone, the falling tone, and the neutral tone will each be referred to generically as a language tone in this application.
- the tone is the behavior of the fundamental frequency (pitch) of the audio signal.
- the tone of a syllable affects the meaning of the syllable. Syllables with the same phonetic structure but different tones usually have significantly different meanings and correspondence to different written characters. Thus, to recognize accurately an audio signal of Mandarin Chines a speech recognition system must recognize both the syllable and the tone of the syllable.
- the present invention overcomes the deficiencies and limitations of the prior ar with a system and method for analyzing the tone of a syllable of Mandarin Chinese speech.
- the novel tone recognition system of the present inventor comprises a coefficient determinator, a coefficient modeler, and a comparator.
- the coefficient determinator receives a digitized signal of a syllable from the A/D converter.
- the coefficient determinator estimates the pitch of the syllable and describes the pitch by using a least squares method to fit a second order polynomial to the pitch.
- the coefficient determinator can quickly and easily describe the pitch using a second order polynomial.
- the second order polynomial is preferred because of its low computational requirements.
- the output of the coefficient determinator is coupled to an input of the comparator. Through this coupling, the coefficient determinator provides a signal representing the polynomial to the comparator.
- a second input of the comparator is coupled to an output of the coefficient modeler. Through this coupling the comparator receives models of the language tones of the tonal language.
- a model of a language tone comprises expected coefficients for a second order polynomial that describes the language tone, an expected duration of syllables having the language tone, and a co-variance of the coefficients and duration.
- the comparator compares the polynomial, which describes the syllable to the models, determines the model that most closely matches the syllable, and generates a signal indicating the matching language tone.
- the present invention also includes a system and method for training a coefficient modeler.
- the system for training comprises a model generator, a coefficient A training memory, a coefficient B training memory, a coefficient C training memory, and a duration T training memory.
- the model generator receives a large number of uttered syllables all known to have the same language tone.
- the model generator estimates the pitch of each syllable, fits a second order polynomial to each syllable, and stores the resulting coefficients and duration in the appropriate memories.
- the model generator determines the arithmetic mean and co-variance for the coefficients and duration and generates a signal for the coefficient modeler to accept the model.
- Figures 1A, IB, 1C, ID and IE are graphical representations of electrical signals having the behavior of the 4 lexical tones and one neutral tone of Mandarin Chinese speech;
- Figure 2 is a block diagram of an exemplary embodiment of a system for training a coefficient modeler;
- Figure 3 is a block diagram of a first embodiment of a speech recognition syste that includes a tone recognition system according to the present invention;
- Figure 4 is a block diagram of a second embodiment of a system for recognizing syllables of Mandarin Chinese speech that includes the tone recognition system of the present invention;
- Figure 5 is a block diagram of the speech recognition system of the second embodiment of a system for recognizing syllables of Mandarin Chinese speech;
- Figures 6A and 6B are flowcharts showing the preferred method for training the tone recognition system of the present invention to identify the tone of a syllable of Mandarin Chinese speech;
- Figure 7 is a flowchart showing the preferred method for identifying the tone of a s
- FIGS 1A, IB, 1C, ID, and IE graphical representations of electrical signals that represent the 4 lexical tones and 1 neutral tone of Mandarin Chinese speech are shown.
- the signals shown in Figures 1A, IB, 1C, ID, and IE originate as audio signals generated by a human speaker.
- a speech recognition system receives the audio signals through a microphone.
- the microphone converts an audio signal to an electrical signal.
- Each syllable of Mandarin Chinese speech has a tone.
- Th tone is the behavior of the pitch contour, the fundamental frequency contour, of the audio signal of the syllable. The tone is preserved when the audio signal is converted into an electrical signal.
- the tone of a syllable may refer to the behavior of the fundamental frequency of the audio signal or the electrical signal into which the audio signal is converted.
- Figures 1A, IB, 1C, ID, and IE show the tones of Mandarin Chinese speech as fundamental frequencies plotted as a function of time.
- Figure 1 A shows a first tone signal 40 having a high and level tone.
- Figure IB shows a second tone signal 42 having a rising tone.
- a third tone signal 44, shown in Figure 1C has a falling-rising behavior.
- Figure ID shows a fourth tone signal 46 having a behavior of the last lexical tone of Mandarin Chinese speech, the falling tone.
- Figure IE shows a fifth tone signal having the neutral tone.
- the system 10 comprises a microphone 76, and analog-to-digital converter 78, a segmenter 79, a syllable recognition system 80, and a tone recognition system 82.
- the microphone 76 is a conventional microphone for converting analog audio signals into analog electrical signals.
- the microphone 76 has an input for receiving audio signals and an output for outputting electrical signals.
- the output is coupled to a line 88.
- the analog-to-digital converter (“A/D converter”) 78 is a conventional analog-to-digital converter for converting analog, electrical signals to digital, electrical signals.
- the A/D converter 78 has an input, which is coupled to line 88, for receiving an analog, electrical signal and an output for outputting a digital, electrical signal.
- the output of the A/D converter 78 is coupled to a line 90.
- the segmenter 79 is a conventional segmenter for parsing a continuous input into single syllables of Mandarin Chinese speech.
- the segmenter 79 generates output signals of single syllables.
- the segmenter 79 has an input coupled to line 90 and an output coupled to a line 91.
- the syllable recognition system 80 is a system for determining the phonetic structure of a syllable of a tonal language and for combining the identified phonetic structure with an identification of the language tone of the syllable to recognize the syllable.
- the syllable recognition system 80 has a first input, coupled to line 91, for receiving a syllable from the segmenter 79, a second input, coupled to a line 96, for receiving a signal that identifies the language tone of the syllable from the tone recognition system 82, and an output, coupled to a line 98, for outputting a signal representing the recognized syllable.
- the syllable recognition system 80 uses a hidd Markov model and state analysis to identify the phonetic structure of the syllable. T syllable recognition system 80 combines the identification of the phonetic structure with an identification of the language tone of the syllable to generate a signal indicati of the complete meaning of the syllable.
- the syllable recognition system 80 is preferably an Apple PlainTalk Chinese Syllable Recognition System (the "PlainTalk System") from Apple Computer, Inc. of Cupertino, California modified to utilize long-term tonal analysis.
- the PlainTalk System has a first system for determining the phonetic structure of a syllable, a secon system for performing short-term tonal analysis on the syllable to identify the langua tone of the syllable, and a third system for combining the output of the first system a the second system into a recognition.
- the output of the third system forms an outpu of the PlainTalk System.
- the PlainTalk System may be modified to utilize long-term tonal analysis by removing the second system and coupling line 96 to an input to the third system so that the third system combines the output of the first system, for determining the phonetic structure of the syllable, and the output of the tone recognition system 82 into an output of the syllable recognition system 80.
- the syllable recognition system 80 may alternately be a syllable based voice typewrit from Star Incorporated of the People's Republic of China similarly modified as described in connection with the preferred PlainTalk System.
- the tone recognition system 82 comprises a coefficient determinator 84, a comparator 86, and a coefficient modeler 38.
- the coefficient determinator 84 has an input coupled to line 91 to receive a syllable from the segmenter 79 and an output coupled to a line 92.
- the coefficient modeler 38 is constructed by the trainer 34, shown in Figure 3, and has a first input and an output, both of which are coupled to line 94.
- the comparator 86 has a first input, coupled to line 92 to receive the output of the coefficient determinator 84, a second input, coupled to line 94 to receive models from the coefficient modeler 38, a first output, and a second output, coupled to line 94, through which the comparator 86 signals for the coefficient modeler 38 to transmit the models.
- the output of the tone recognition system 82 is formed by the first output of the comparator 86 and is coupled to line 96 so that the identification of the language tone is transmitted to the second input of the syllable recognition system 80.
- the coefficient determinator 84 receives a syllable from the segmenter 79 on line 91.
- the coefficient determinator 84 estimates the fundamental frequency, the pitch, of the syllable using fourier analysis. Alternatively, the coefficient determinator 84 may use a low pass filter or other means to determine the pitch of the input syllable.
- the coefficient determinator 84 adjusts the coefficients of a second order polynomial to match closely the pitch contour.
- the coefficient determinator 84 uses a least squares curve fitting method to adjust the coefficients. Once the coefficients best describe the pitch contour, the coefficient determinator 84 generates a signal to transfer the coefficients along with the duration of the input signal, grouped together in a 4 dimensional vector, to the comparator 86.
- the coefficient determinator 84 can quickly and accurately describe the tone of the syllable and thus, facilitate a quick and accurate identification of the language tone of the syllable.
- the coefficient modeler 38 has a first input and a second input that is coupled to a line 7, shown in Figure 3, and an output. The first input and the output are coupled to a line 94.
- the coefficient modeler 38 comprises a tone 1 model memory 58, a tone 2 model memory 60, a tone 3 model memory 62, a tone 4 model memory 64, and a tone 5 model memory 66.
- the memories 58, 60, 62, 64, 66 are shown in Figure 3.
- Each of the tone model memories 58, 60, 62, 64, 66 stores a model of a tone of Mandarin Chinese speech.
- the tone 1 model memory 58 may store a model of the high and level tone
- the tone 2 model memory 60 may store a model of the rising tone
- the tone model memory 62 may store a model of the falling-rising tone
- the tone 4 model memory 64 may store a model of the falling tone
- the tone 5 model memory 66 may store a model of the neutral tone.
- Each model comprises an A coefficient, a B coefficient and a C coefficient of a polynomial of the form:
- Each model also includes a time duration T.
- the A coefficient, B coefficient, C coefficient, and duration T are the expected coefficients and duration for the language tone.
- the A, B, and C coefficients and duration T are grouped together into a 4 dimensional vector.
- the language tones are preferably modeled using a second order polynomial because of the simplicity and low computational overhead o the second order polynomial. As can be seen from Figures 1A, IB, 1C, ID, and IE, a second order polynomial is adequate to model accurately the pitch contour of each language tone.
- a model includes an expected duration because the expected coefficients alone do not provide sufficient information to determine accurately thr language tone of a syllable.
- the coefficients with the duration are sufficient to determine accurately the language tone of a syllable.
- the present invention overcomes the limitations of the prior art by using a second order equation to model quickly and accurately to determine the language tone of an utterance of Mandarin Chinese speech. While the present invention has been described with respect to five tones, those skilled in the art will appreciate that the number of training data memori 22, 24, 26, 28, 30, shown in Figure 3, and the number of tone models memories 58, 60, 62, 64, 66 may be increased or decreased according to the number of tones in the tonal language being evaluated.
- Each model also includes a co-variance matrix for the A coefficient, the B coefficient, the C coefficient and the duration T.
- the co-variance matrix is a 4x4 matri which describes the dependence of each of the A, B, and C coefficients and the duration T on each other.
- the model may include a data structure for storing the variance of the A, B, and C coefficients and duration T.
- the A, B, and C coefficients and duration T are independent.
- the variance is, however, a diagonal co-variance matrix. That is, the diagonal entries of the co-variance matrix describe the variances and all other entries, the entries off the diagonal, are 0.
- the variance is a sub-set of the co-variance and is therefore, included within the co-variance. Those skilled in the art will recognize the variance and its relation to the co-variance.
- the co-variance matrix is enlarged to accommodate the additional coefficients. For example, if a fifth order polynomial were used, the co-variance matrix would be a 6x6 matrix.
- the present invention can also model the tone of utterances of tonal languages other than Mandarin Chinese.
- a higher order polynomial may be needed to model accurately the language tones of a tonal language having language tones that are more complex than the language tones of Mandarin Chinese.
- the comparator 86 receives the coefficients and duration, which describe the pitch of the input syllable, from the coefficient determinator 84 and receives the models of the language tones from the coefficient modeler 38.
- the comparator 86 compares the coefficients and duration to the models to determine the model that the coefficients and duration most closely resemble.
- the comparator 86 uses a multivariate normal density, also known as a Gaussian classifier, to compare the input to the models.
- the multivariate normal density is given by the equation:
- p(x) is the multivariate normal density
- x is an input vector of the A, B, and C coefficients and duration T of the second order polynomial that the coefficient determinator 84 determined for the input signal
- d is the dimension of the input vector, in this case 4
- I _ I is the determinant of the co-variance matrix of the model
- _ _1 is the inverse of the co-variance matrix of the model
- ⁇ is a vector of the arithmetic means of the A, B, and C coefficients and duration T of the model
- (x- ⁇ ) 1 is the transpose of the difference vector of x and ⁇ .
- the comparator 86 determines the multivariate normal density using the mean vector, ⁇ , of each model.
- the comparator 86 temporarily stores the multivariate normal density determined for each model.
- the comparator 86 recognizes the model having the greatest multivariate normal density the model of the language tone of the syllable.
- the comparator 86 generates a signal indicating this language tone and impresses the signal on line 96 to transmit it to the second input of the syllable recognition system 80.
- FIG 3 a block diagram of a trainer 34 is shown.
- Figure 3 also shows training data memory 32, and the coefficient modeler 38.
- the trainer 34 uses data stored in the training data memory 32 to train the coefficient modeler 38.
- Th training data memory 38 is a conventional random access memory or read only memory for storing data.
- the training data memory 32 comprises a tone 1 training data memory 22, a tone 2 training data memory 24, a tone 3 training data memory 26, a tone 4 training data memory 28, and a tone 5 training data memory 30.
- Each of the tone 1-5 training data memories 22, 24, 26, 28, 30 is associated with a language tone of Mandarin Chinese and stores utterances that have the associated tone.
- the tone 1 training data memory 22 may be associated with the high and level tone, the tone 2 training data memory 24 with the rising tone, the tone 3 training data memory 26 with the falling-rising tone, the tone 4 training data memory 28 with the falling tone, and th tone 5 training data memory 30 with the neutral tone.
- the tone 1 training data memory 22 stores at least 20 utterances of Mandarin Chinese speech having the language tone that is associated with the tone 1 training data memory 22.
- Each utterance is a digitized, single syllable.
- the tone 2 - 5 training data memorie 24, 26, 28, 30 store at least 20 digitized, single syllables having the language tone with which each is associated. While 20 is the minimum number of utterances, preferably each tone 1 - 5 training memories 22, 24, 26, 28, 30 stores several hundred utterances.
- the tone 1 - 5 training data memory 22, 24, 26, 28, 30 are each coupled to line 68 and impress signals of the digitized utterances on line 68 to output the digitized utterances from the training data memory 32.
- the trainer 34 has a first input that is coupled to line 68 and a first output that is formed by line 74.
- the trainer 34 comprises a model generator 36, an A coefficient memory 50, a B coefficient memory 52, a C coefficient memory 54, and a T duration memory 56.
- the model generator 36 first estimates the fundamental frequency, the pitch, of a syllable using fourier analysis. Alternately, the model generator 36 may use a low pass filter or other means to determine the pitch of a syllable.
- the model generator 36 adjusts the coefficients of a second order polynomial so that the polynomial best describes the pitch contour of an utterance.
- the model generator 36 has a first input, a second input, a first output, and a second output. The first input is coupled to line 68 to receive data from the tone 1-5 memories 22, 24, 26, 28, 30.
- the first output is coupled to a line 70 which is also coupled to an input of each of memories 50, 52, 54, 56.
- the second input is coupled to a line 72 which is also coupled to an output of each of memories 50, 52, 54, 56.
- the second output is coupled to line " 4 which forms an output of the trainer 34.
- the model generator 36 uses a least squaro curve fitting method for determining the coefficients of a polynomial that accurately describes a tone. For example, the model generator 36 receives a digitized utterance from the tone 1 model data 22.
- the model generator 36 uses fourier analysis, a low pass filter, or other techniques to determine the behavior of the pitch of the utterance.
- the model generator 36 uses the least squares curve fitting method to adjust the coefficients of a second order polynomial so that the polynomial best describes the pitch contour of the utterance.
- the A coefficient memory 50, the B coefficient memory 52, the C coefficient memory 54, and duration T memory 56 are all memories for storing data.
- Each of the memories 50, 52, 54, 56 have an input coupled to the first output of the model generator 36 by a line 70, and an output coupled to the second input of the model generator 34 by line 72.
- the model generator 36 After the model generator 36 has determined the coefficients of the polynomial that best describes a tone and has determined the duration of the utterance, the model generator 36 generates a signal and impresses the signal on line to store the value of the A coefficient in the A coefficient memory 50, the value of the coefficient in the B coefficient memory 52, the value of the C coefficient in the C coefficient memory 54, and duration of the utterance in the duration T memory 56.
- the model generator 34 preferably determines the coefficients of the polynomials for utterances in the training data 32, which have the same language ton together. Once the model generator 34 has determined the coefficients and durations for all utterances that have the same language tone, the model generator 36 generates the model for that language tone.
- the training data need not be segregated according t language tone.
- Each utterance would have associated with it an indicator of its language tone.
- the trainer would have an A coefficient training memory, a B coefficient training memory, a C coefficient training memory, and a duration T traini memory for each language tone of the language.
- the model generator would record the coefficients and duration in the set of memories indicated by the tone indicator.
- the coefficient modeler 38 comprises a tone 1 model memory 58, a tone 2 mod memory 60, a tone 3 model memory 62, a tone 4 model memory 64, and a tone 5 mod memory 66.
- the coefficient modeler 38 has an input formed by lie 74, and each mode memory 58, 60, 62, 64, 66 has an input that is coupled to line 74. Furthermore, each model memory 58, 60, 62, 64, 66 has an output coupled to line 94; line 94 forms an output of the coefficient modeler 38.
- Each model memory 58, 60, 62, 64, 66 stores a model of a language tone; that is, there is a model memory 58, 60, 62, 64, 66 for each language tone. If the present invention is being used to recognize a syllable of a tonal language other than Mandarin Chinese, there are model memories for each language tone of the tonal language.
- the model generator 36 determines the arithmetic mean and co-variance matrix of the coefficients and durations stored in the memories 50, 52, 54, 56.
- the average coefficient, average duration and co-variance matrix of the coefficients and duration constitute a model of a language tone.
- the model generator 36 determines the model and generates a signal on line 74 to transfer the model to the coefficient modeler 38 where the model is stored in the appropriate model memory 58, 60, 62, 64, 66.
- the tone model memories 58, 60, 62, 64, 66 each have an output, which is coupled to line 94, for outputting models of the tones of Mandarin Chinese speech.
- the model generator 34 receives all the utterances stored in the tone 1 training data memory 22.
- the model generator 34 fits a second order polynomial to an utterance and generates signals on line 70 to store the A, B, and C coefficients and duration T in memories 50, 52, 54, 56.
- the model generator 34 repeats this process for each utterance stored in the tone 1 training data memory 22.
- the model generator 34 then receives the A, B, and C coefficients and duration T from the memories 50, 52, 54, 56 on line 72 at the second input.
- the model generator 34 determines the arithmetic means and co-variance matrix to generate a model of the first language tone and then generates a signal to transfer the model to the tone 1 model memory 58.
- FIG. 4 a block diagram of a second embodiment of the system 10 for analyzing the tone of a syllable of Mandarin Chinese speech, constructed in accordance with the present invention, is shown.
- the system 10 preferably comprises an input device 12, an output device 14, a processor 16, a memory means 18, and a speech recognition system 20.
- the input device 12, output device 14, processor 16, memory means 18, and speech recognition system 20 are coupled in a von Neuma architecture via a bus 22 such as in a personal computer.
- the processor 16 is preferabl a microprocessor such as a Motorola 68040; the output device 14 is preferably a video monitor; and the input device 12 is preferably a keyboard and mouse type controller and a microphone for inputting audio signals.
- the input device 12 includes an A/D converter for digitizing analog signals from the microphone.
- the system 10 is a Macintosh Quadra 840AV computer system from Apple Computer, Inc. of Cupertino, California. Those skilled in the art will realize tha the system 10 could also be implement on an IBM personal computer or any other computer system.
- the memory means 18 is constructed with random access memory and read only memory. The memory means 18 stores data and program instruction steps for th system 10.
- the memory means 18 may be a conventional dynamic random access memory and a conventional disk drive.
- the memory means 18 includes a training dat memory 32.
- the training data memory 32 is coupled through the bus 22 to transmit and receive signals from the speech recognition system 20.
- the speech recognition system 20 is coupled to the bus 22 to receive a digitized input signal from the input device 12.
- the speech recognition system 20 parses the input signal to obtain a syllable of Mandarin Chinese speech and generates an output of the recognized syllable.
- the speech recognition system 20 determines both the phonetic structure and language tone of the syllable. Referring now to Figure 5, a block diagram of the speech recognition system 20 is shown.
- the speech recognition system 20 comprises a segmenter 79, a syllable recognition system 80, a tone recognition system 82, and a trainer 34.
- the recognition system 82 comprises a coefficient modeler 38, a coefficient determinator 84, and a comparator 86.
- the segmenter 79, syllable recognition system 80, tone recognition system 82, and trainer 34 are memories storing sets of program instruction steps that when executed by the processor 16 perform the functions described above with reference to Figure 2 and Figure 3.
- each device being a set of program instruction steps sharing a single processor 16.
- Each of the syllable recognition system 80, tone recognition 82, and trainer 34 are coupled to the bus 22 to transmit and receive signals to and from the input device 12, the output device 14, the processor 16, and memory means 18 and to transmit and receive signals from each other as shown in Figures 2 and 3 and described above.
- the trainer 34 is coupled to the bus 22 to receive data from the training data memory 32 in the memory means 18.
- the trainer 34 is also coupled to the bus 22 to transmit models of the language tones to the coefficient modeler 38.
- the segmenter 79 receives a digital input signal from the input device 12 and transmits a syllable to the syllable recognition system 80 and to the coefficient determinator 84.
- the syllable recognition system 80 receives a syllable from the segmenter 79 through the bus 22.
- the syllable recognition system 80 also receives a signal indicating the language tone of the syllable from the comparator 86 through the bus 22.
- the syllable recognition 80 signals the recognition of the syllable to the output device, or to a look up table or other memory device, through the bus 22.
- the bus 22 couples the coefficient determinator 84 to the comparator 86 so that the coefficient determinator may transmit the coefficients and duration that describe the pitch contour of a syllable to the comparator 86.
- the comparator 86 receives models of the language tones from the coefficient modeler 38 over the bus 22. Referring now to Figures 6A and 6B, a flow chart is shown of a preferred method for training the coefficient modeler 38 for the language tones of a syllable of Mandarin Chinese speech.
- the method begins in step 700 where the trainer 34 receives an utterance or signal representing a syllable of Mandarin Chinese speech having a known tone.
- the method preferably processes the utterances having the same language tone together in one group.
- the model generator 36 processes the utterances in the tone 1 training data 22 together in one group.
- the coefficient determinator 84 uses fourier analysis to estimate the pitch contour of the input signal.
- the coefficient determinator 84 may alternately use a low pass filter or other techniques to estimate the pitch contour of the input.
- the model generator 36 fits a second order polynomial to the estimated pitch contour. Th model generator 36 uses a least square-error curve fitting method for fitting the secon order polynomial to the tone.
- step 706 the model generator 36 selects the A coefficient of the polynomial and stores it in the coefficient A training memory 50.
- the model generator 36 selects the B coefficient and stores it in the B coefficient training memory 52.
- the model generator 36 selects the C coefficient and stores it in the C coefficient training memory 54.
- step 712 the model generator 36 determines the duration of the utterance and stores it in the duration T training memory 56.
- step 714 the model generator 36 determines if there is another utterance to b processed that has the current language tone in the training memory 32. If there is another utterance to be processed, the method returns to step 700 where the model generator 36 receives the next utterance from the training memory 32. If in step 714 there are no more utterances having the current language tone to be processed, the model generator 36 uses the data stored in the coefficient A training memory 50, coefficient B training memory 52, coefficient C training memory, and duration T training memory 56 to generate a model of the language tone.
- step 716 the model generator 36 receives the A coefficients stored in the A coefficient training memory 50 and determines the arithmetic mean of the A coefficients.
- step 718 the model generator 36 receives the B coefficients stored in the B coefficient training memory 52 and determines the arithmetic mean of the B coefficients.
- the model generator 36 receives the C coefficients from the coefficient C training memory 56 and determines the arithmetic mean of the C coefficients.
- the model generator 36 receives the durations from the duration T training memory 56 and determines the arithmetic mean of the durations.
- the model generator 36 uses the arithmetic means and the data in the A coefficient training memory 50, B coefficient training memory 52, C coefficient training memory 54, and duration T training memory 56 to determine a co-variance matrix for random variables A, B, C, and T.
- the model generator 36 gathers the arithmetic means together into a 4 dimensional vector and joins the vector to the co-variance matrix to form a model of the selected language tone.
- the model generator 36 may alternately determine the variance of each of the A coefficients, B coefficients, C coefficients, and duration T in step 716, 718, 720, and 722, respectively and not determine the co-variance matrix in step 724.
- the variance is a sub-set of the co-variance, and determining the variance alone yields a less accurate model than determining the co- variance matrix. The determining variance alone, however, saves significant computation time. It is preferred to determine the co- variance.
- the model generator 36 signals, in step 726, the coefficient modeler 38 to receive a model and then transfers the model to the coefficient modeler 38 for storage.
- the model corresponds to the training data. For example, if the model generator 36 has just completed processing the utterances of the tone 1 training data 22, then the model generator 36 signals the coefficient modeler 38 to receive the tone 1 model 58. In step 728, the model generator 36 determines if there is more tone training data to be processed.
- the model generator 36 If there is more tone training data to be processed, the model generator 36 resets, in step 730, the coefficient A training memory 50, coefficient B training memory 52, coefficient C training memory 56, and duration T training memory 58. The model generator 36 then selects, in step 732, the next tone training data to process and the method returns to step 700. If, in step 728, there are no more sets of tone training data to be processed, the method ends.
- FIG. 7 a flow chart of a preferred method for identifying th tone of a syllable of Mandarin Chinese speech is shown. The method begins in step 80 where the tone recognition system 82 receives an input of a digitized syllable of Mandarin Chinese speech.
- the coefficient determinator 84 estimates the pitch of the syllable, in step 802.
- the coefficient determinator 84 fits a second order polynomial to the estimated pitch.
- the coefficient determinator 84 preferably uses a least squares curve fitting method to fit a second order polynomial t the pitch.
- the coefficient determinator 84 alternatively may use many other methods to fit a polynomial to the tone. If the coefficient determinator 84 is being used to recognize tones of a tonal language other than Mandarin Chinese, a higher order polynomial may be used to model accurately the pitch.
- the coefficient determinator 8 then transmits the polynomial to the comparator 86.
- the comparator 86 determines and temporarily stores the multivariate normal density for the tone 1 model 58 usin ⁇ Gaussian classifier as has been described above with respect to Figure 3.
- the comparator 86 determines the multivariate normal density for the tone 2 model 60 and stores the result.
- the comparator 86 determines the multivariate normal density for the tone 3 model 62 and stores the result.
- the comparator 86 determines the multivariate normal density for the tone 4 model 64 and stores the result.
- step 814 the comparator 86 determines the multivariate normal density for the tone 5 model 66 and stores the result.
- the comparator 86 selects the tone model 58 that received the highest multivariate normal density as the model that best describes the input.
- the comparator 86 selects the language tone of the selected tone model as the language tone of the syllable.
- the comparator 86 generates a signal to the syllable recognition system 80 that indicates the language tone of the syllable.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9706562A GB2308002B (en) | 1994-09-29 | 1995-09-29 | A system and method for determining the tone of a syllable of mandarin chinese speech |
AU37341/95A AU3734195A (en) | 1994-09-29 | 1995-09-29 | A system and method for determining the tone of a syllable of mandarin chinese speech |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US31522294A | 1994-09-29 | 1994-09-29 | |
US08/315,222 | 1994-09-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1996010248A1 true WO1996010248A1 (en) | 1996-04-04 |
Family
ID=23223432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1995/012595 WO1996010248A1 (en) | 1994-09-29 | 1995-09-29 | A system and method for determining the tone of a syllable of mandarin chinese speech |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU3734195A (en) |
GB (1) | GB2308002B (en) |
WO (1) | WO1996010248A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
SG97998A1 (en) * | 1999-12-10 | 2003-08-20 | Matsushita Electric Ind Co Ltd | Method and apparatus for mandarin chinese speech recognition by using initial/final phoneme similarity vector |
US9620092B2 (en) | 2012-12-21 | 2017-04-11 | The Hong Kong University Of Science And Technology | Composition using correlation between melody and lyrics |
CN111916066A (en) * | 2020-08-13 | 2020-11-10 | 山东大学 | Random forest based voice tone recognition method and system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0272723A1 (en) * | 1986-11-26 | 1988-06-29 | Philips Patentverwaltung GmbH | Method and arrangement for determining the temporal course of a speech parameter |
-
1995
- 1995-09-29 AU AU37341/95A patent/AU3734195A/en not_active Abandoned
- 1995-09-29 GB GB9706562A patent/GB2308002B/en not_active Expired - Lifetime
- 1995-09-29 WO PCT/US1995/012595 patent/WO1996010248A1/en not_active Application Discontinuation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0272723A1 (en) * | 1986-11-26 | 1988-06-29 | Philips Patentverwaltung GmbH | Method and arrangement for determining the temporal course of a speech parameter |
Non-Patent Citations (5)
Title |
---|
CHEN ET AL.: "Vector quantization of pitch information in Mandarin speech", IEEE TRANSACTIONS ON COMMUNICATIONS, vol. 38, no. 9, US, pages 1317 - 1320, XP000173207, DOI: doi:10.1109/26.61370 * |
DATABASE INSPEC INSTITUTE OF ELECTRICAL ENGINEERS, STEVENAGE, GB; LONG ET AL.: "Two-syllabic Chinese tone recognition by a polynomial approximation of pitch contour" * |
GUAN ET AL.: "Speaker-independent tone recognition for Chinese speech", ACTA ACUSTICA, vol. 18, no. 5, CHINA, pages 379 - 385 * |
TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS D-II, JAN. 1990, JAPAN, vol. J73D-II, no. 1, pages 122 - 124 * |
WU ET AL.: "A tone recognition of polysyllabic Chinese words using an approximation model of four tone pitch patterns", IECON '91 INTERNATIONAL CONFERENCE ON INDUSTRIAL ELECTRLNICS, CONTROL AND INSTRUMENTATION, 28 October 1991 (1991-10-28) - 1 November 1991 (1991-11-01), KOBE, JP, pages 2115 - 2119 vol.3 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
SG97998A1 (en) * | 1999-12-10 | 2003-08-20 | Matsushita Electric Ind Co Ltd | Method and apparatus for mandarin chinese speech recognition by using initial/final phoneme similarity vector |
US9620092B2 (en) | 2012-12-21 | 2017-04-11 | The Hong Kong University Of Science And Technology | Composition using correlation between melody and lyrics |
CN111916066A (en) * | 2020-08-13 | 2020-11-10 | 山东大学 | Random forest based voice tone recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
GB2308002B (en) | 1998-08-19 |
AU3734195A (en) | 1996-04-19 |
GB9706562D0 (en) | 1997-05-21 |
GB2308002A (en) | 1997-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5602960A (en) | Continuous mandarin chinese speech recognition system having an integrated tone classifier | |
AU712412B2 (en) | Speech processing | |
US5865626A (en) | Multi-dialect speech recognition method and apparatus | |
EP0805434B1 (en) | Method and system for speech recognition using continuous density hidden Markov models | |
JP3114975B2 (en) | Speech recognition circuit using phoneme estimation | |
JPH09114495A (en) | System and method for decision of pitch outline | |
US5596679A (en) | Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs | |
EP3594940B1 (en) | Training method for voice data set, computer device and computer readable storage medium | |
JP4391701B2 (en) | System and method for segmentation and recognition of speech signals | |
JP2002328695A (en) | Method for generating personalized voice from text | |
KR19980701676A (en) | System and method for generating and using context-dependent model for syllable language (tonal language) recognition | |
US5734793A (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
JPH0636156B2 (en) | Voice recognizer | |
JPH075892A (en) | Voice recognition method | |
CA3195582A1 (en) | Audio generator and methods for generating an audio signal and training an audio generator | |
JP2003532162A (en) | Robust parameters for speech recognition affected by noise | |
KR20220134347A (en) | Speech synthesis method and apparatus based on multiple speaker training dataset | |
WO1996010248A1 (en) | A system and method for determining the tone of a syllable of mandarin chinese speech | |
CA2203649A1 (en) | Decision tree classifier designed using hidden markov models | |
Qin et al. | HMM-based emotional speech synthesis using average emotion model | |
KR100612843B1 (en) | Method for compensating probability density function, method and apparatus for speech recognition thereby | |
JPH01202798A (en) | Voice recognizing method | |
JP2004509364A (en) | Speech recognition system | |
JP3560590B2 (en) | Prosody generation device, prosody generation method, and program | |
KR0169592B1 (en) | Performance enhancing method for voice recognition device using adaption of voice characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AM AT AU BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LT LU LV MD MG MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TT UA UG UZ VN |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): KE MW SD SZ UG AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1019970702045 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 9706562.7 Country of ref document: GB |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 1019970702045 Country of ref document: KR |
|
NENP | Non-entry into the national phase |
Ref country code: CA |
|
122 | Ep: pct application non-entry in european phase | ||
WWW | Wipo information: withdrawn in national office |
Ref document number: 1019970702045 Country of ref document: KR |