WO2007129802A1

WO2007129802A1 - Method for selecting training data based on non-uniform sampling for speech recognition vector quantization

Info

Publication number: WO2007129802A1
Application number: PCT/KR2006/005830
Authority: WO
Inventors: Chang-Sun Ryu; Jae-In Kim; Hong Kook Kim; Yoo Rhee Oh; Gil Ho Lee
Original assignee: Kt Corporation
Priority date: 2006-05-10
Filing date: 2006-12-28
Publication date: 2007-11-15
Also published as: KR20070109314A; KR100901640B1

Abstract

Provided is a method for selecting training data based on non-uniform sampling for speech recognition vector quantization. The method includes the steps of : if sample speech data are received, making the speech data subjected to forced alignment to acquire pronunciation information of each phoneme; creating an appearance list by phoneme based on the acquired pronunciation information; statistically calculating an appearance frequency rate of phoneme for a corresponding language depending on the created appearance list by phoneme; and deducing training data having a minimum error between a total frequency number for each phoneme and an appearance frequency rate by phoneme by referring to the calculated statistics.

Description

DESCRIPTION

METHOD FOR SELECTING TRAINING DATA BASED ON NON-UNIFORM SAMPLING FOR SPEECH RECOGNITION VECTOR QUANTIZATION

TECHNICAL FIELD

The present invention relates to a method for selecting training data for speech recognition vector quantization; and, more particularly, to a method for selecting training data based on non-uniform sampling for speech recognition vector quantization, which selects training data based on non-uniform sampling by considering speech characteristics of language, for example, an appearance frequency by phoneme, in case of selecting training data to be used in training a speech feature vector quantizer.

BACKGROUND ART

Distributed speech recognition (DSR) is a technology which enables a low spec terminal such as a portable telephone to have a speech recognition function. This DSR is constituted by dual processing systems in which the low spec terminal recognizes the feature of a speech signal and a high spec server performs speech recognition based on the feature of the speech signal provided from the low spec terminal.

In general, Mel-Frequency Cepstral Coefficient (MFCC) has been universally used in the speech recognition field. The MFCC represents the form of frequency spectrum expressed in Mel scale as sine wave components, and refers to a speech feature vector (or speech feature parameter) representing speech being received from a user. This MFCC has to be subjected to a quantization process in order to transmit it from the terminal to a server through a communication network. A speech recognition process will be briefly described below with reference to the DSR system as mentioned above. At the terminal end, a speech feature vector is first extracted from speech received from a user through MFCC and then quantized to process it in the form suitable for transmitting to a communication network. In other words, MFCCs extracted from speech received from the user, are mapped into vectors having the nearest distance among central vectors of a limited-size codebook. And then, the selected vectors are sent as a bit stream to a server. Therefore, a codebook having a central value for a group with similar values is required for recognition of speech originated from the user, and is used as a reference value for speech feature vector quantization. Meanwhile, the server dequantizes the speech feature vector transmitted from the terminal and recognizes word corresponding to the speech by using HMM

(Hidden Markov Model) as a speech model. Here, HMM is a process that models a basic unit, for example, phoneme for speech recognition, wherein words and sentences are created by combining phonemes applied to a speech recognition engine and phonemes registered in DB of the speech recognition engine.

The above-mentioned speech feature vector quantization process is an important factor which determines the performance of the speech recognition system. The design of this speech feature vector quantizer may be defined by the creation of a codebook. Especially, this codebook should be made by extracting training data (training MFCC vector) from numerous speech data and then selecting the most representative values out of the extracted training data as training data for the speech feature vector quantization.

To be more specific, the codebook has central vectors of speech data as indexes and needs a great amount of training data to enhance the performance of speech recognition. As set forth above, the result of the speech feature vector quantization is a data sequence represented by the codebook index and such a speech feature vector quantization gives the nearest codebook index by distance calculation between vectors.

However, most of speech recognition technologies to date propose only a processing algorithm of the speech feature vector quantizer, but do not propose an algorithm for selecting training data actually required for training the speech feature vector quantizer.

For example, in the speech feature vector quantizer, it is current situation that the prior art relating to the selection of the training data uses only arbitrary unplanned sample training data selected by an administrator in advance.

Further, as described above, the recognition performance of HMM is excellent and thus most of speech recognition systems adopt this HMM. However, in case of such HMM, if the reliability of training data is low, the server has a difficulty in estimating a precise acoustic model and also needs an enormous amount of prior knowledge about phoneme context. For example, if the reliability of training data is high in the quantization of speech feature vector at the terminal end, the server does not require a separate speech recognition process but performs only a dequantization of the speech feature vector through the existing technology.

Therefore, there is an urgent need for a technology capable of quantizing speech feature vector with a smalLer number of bits. It will easily be understood by those skilled in the art that the development of an algorithm of selecting training data used as training data in the speech feature vector quantization to do so should be preceded above all things. DISCLOSURE TECHNICAL PROBLEM

An embodiment of the present invention is directed to provide a method for selecting training data based on non-uniform sampling for speech recognition vector quantization, which selects training data based on nonuniform sampling by considering speech characteristics of language, for example, an appearance frequency by phoneme, in case of selecting training data to be used in training a speech feature vector quantizer.

Other objects and advantages of the present invention can be understood by the following description, and become apparent with reference to the embodiments of the present invention. Also, it is obvious to those skilled in the art of the present invention that the objects and cidvantages of the present invention can be realized by the means as claimed and combinations thereof.

TECHNICf-L SOLUTION

In accordance with an aspect of the present invention, there is provided a method for selecting training data based on non-uniform sampling for speech recognition vector quantization, the method including the steps of: if sample speech data are received, making the speech data subjected to forced alignment to acquire pronunciation information of each phoneme; creating an appearance list by phoneme based on the acquired pronunciation information; statistically calculating an appearance frequency rate of phoneme for a corresponding language depending on the created appearance list by phoneme; and deducing training data having a minimum error between a total frequency number for each phoneme and an appearance frequency rate by phoneme by referring to the calculated statistics. ADVANTAGEOUS EFFECTS

As mentioned above and will be described below, the present invention can more improve the speech recognition performance under the same condition of bit number necessary for the speech feature vector quantization, compared with the existing method using the unplanned sample of training data.

In addition, the present invention exhibits the same speech recognition performance compared with the existing system which does not perform the speech feature vector quantization process, and prevents the lowering of the speech recognition performance that may be caused by going through the speech feature vector quantization process.

Further, the present invention can enhance the reliability of training data in speech recognition and thus can guarantee the superior speech recognition accuracy by estimating a precise acoustic model through HMM, even through an existing speech recognition device which is not provided with a separate processor or equipment .

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram illustrating a structure of one example of a DSR system to which the present invention is applied.

Fig. 2 is a flowchart describing a training data selection method based on non-uniform sampling in a speech feature vector quantization in accordance with a preferred embodiment of the present invention.

Fig. 3 is an explanatory diagram of one example of a speech recognition process used in the present invention. Fig. 4 is a flowchart describing one example of a training process of creating an acoustic model in Fig. 3.

Fig. 5 is a graph representing a speech recognition performance of a speech feature vector quantizer in which the present invention is applied to Korean.

Fig. 6 is a graph representing a speech recognition performance of a speech feature vector quantizer in which the present invention is applied to English.

BEST MODE FOR THE INVENTION

The advantages, features and aspects of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter, and thus the invention will easily be carried out by those skilled in the art to which the invention pertains. Further, in the following description, well-known arts will not be described in detail if it seems that they could obscure the invention in unnecessary detail. Hereinafter, a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings .

Fig. 1 is a block diagram illustrating a structure of one example of a DSR system to which the present invention is applied.

As shown in Fig. 1, the DSR system to which the present invention is applied includes a low spec terminal (client) such as a portable telephone and a high spec server.

The terminal includes an A/D converter 11, a speech feature vector extractor 12, a speech feature vector quantizer 13 and so on. The server includes a speech feature vector dequantizer 21, a speech recognizer 22 and so on. Other basic components are omitted here for simplicity. In addition, in the following description, although the DSR system is described for illustration, it will easily be understood by those skilled in the art that the present invention can also be applied to a single speech recognition system, for example, other speech recognition device comprised of an A/D converter, a speech feature vector extractor, a speech feature vector quantizer/dequantizer, and a speech recognizer . The present invention is characterized by selecting training data based on non-uniform sampling by considering speech characteristics of languages (e.g., Korean, English, etc.), for example, an appearance frequency by phoneme, in case of selecting training data to be used in training (or designing) the speech feature vector quantizer 13.

That is, the present invention employs statistical characteristics of language in selecting training data to quantize a speech feature vector represented by MFCC. For example, speech characteristics are represented differently depending on languages and training data are selected based on the speech characteristics as mentioned above. On the basis of the training data, a codebook is created and then applied to the speech feature vector quantizer 13 so as to enhance the accuracy of speech recognition.

The components of the DSR system will be roughly described below, prior to explaining the training data selection algorithm of the present invention. The A/D converter 11 converts a speech signal of analog form received from a user into speech data of digital form.

The speech feature vector extractor 12 extracts a speech feature vector representing speech characteristics from the speech data converted by the A/D converter 11. The speech feature vector quantizer 13 quantizes the speech feature vector extracted from the speech feature vector extractor 12 in the form of a bit stream suitable for the communication network based on the training data previously trained through the training data. In addition, a quantization process applied to the speech feature vector quantizer 13 uses a split vector quantization scheme proposed by the standard "ETSI ES 201 108" of ETSI (European Telecommunications Standards Institute) .

The speech feature vector dequantizer 21 dequantizes the bit stream data transmitted from the terminal end.

The speech recognizer 22 recognizes the data dequantized by the speech feature vector dequantizer 21, that is, the speech feature vector based on HMM to output speech recognition results such as a word, a sentence, etc. In addition, the server can also perform the function of reconstructing speech by using the speech feature vector and reproducing it.

In this DSR system, the speech feature vector is extracted through MFCC and represented as 14^th order coefficients, which are subjected to a quantization process. That is, it is first noted that the MFCC vector is not represented as a particularly fixed order of coefficients, but represented as 14^th or 15^th order coefficients since the speech recognition performance is excellent in case of expressing it as such order of coefficients . Hereinafter, a training data selection method based on non-uniform sampling proposed by the present invention will be described in detail with reference to Fig. 2. Further, a speech recognition process and an acoustic model generation process used in the present invention will be described with reference to Figs. 3 and 4, and a speech recognition performance when the present invention is applied to Korean and English will be described with reference to Figs. 5 and 6.

Fig. 2 is a flowchart illustrating the training data selection method based on non-uniform sampling for speech feature vector quantization in accordance with a preferred embodiment of the present invention.

First of all, when sample speech data used in training the speech feature vector quantizer is received, the speech data are subjected to forced alignment to acquire pronunciation information of each phoneme at step S201. That is, an utterance sentence, a phonetic transcription, a phoneme type, and a start time and an end time by phoneme are obtained by making the speech data subjected to the forced alignment. The following Table 1 represents the result which is obtained by having the sample speech data subjected to the forced alignment.

Table 1

In Table 1, it can be seen that the sample speech data represent speech which pronounces

4ti°l ^\°] M--2-J1 9lSi (and people run out continuously)" and transcribes as

"-1 —Ξ 1 -l-L-i il A-i-=i W A i A } ≡ } σc—≡ 1 CCTI H L- V-J-H a.1 CM-J-", and show a start time and an end time by phoneme.

Thereafter, an appearance list by phoneme is created on the basis of the pronunciation information acquired by bhe forced alignment process at step S202. The following Table 2 represents an appearance list by phoneme in the result of executing the forced alignment for speech data stored in a Korean DB used in verifying the present invention. The last two rows represent a frequency by phoneme and a total phoneme number of training data.

Table 2

Next, an appearance frequency rate for phoneme according to the corresponding language is statistically calculated based on the created appearance list by phoneme at step S203. Here, an appearance frequency rate for a certain pronunciation among pronunciation sequences is analyzed on the basis of the pronunciation dictionary, in statistically calculating the appearance frequency rate for phoneme. For example, in order to analyze the phoneme appearance frequency of Korean, a phoneme appearance frequency for each consonant and vowel is analyzed based on the pronunciation dictionary data such as "a study on an appearance frequency of voice appearing in a usual conversation of adult", and "a quantitative linguistic study on function load of Korean phonemes and a phoneme chain". On the other hand, a phoneme appearance frequency analysis for each consonant and vowel is performed depending on the pronunciation dictionary data such as CMU dictionary and the like in order to analyze the phoneme appearance frequency of Korean.

The following Table 3 represents an appearance frequency rate of Korean phoneme, and the following Table 4 represents an appearance frequency rate of English phoneme .

Table 3

Phoneme } TT =11 -] fl M T

Frequency (% ) 1.11 10.24 2.18 0.09 0.12 4.68 0.01 1.73 0.66 2.61

Phoneme Hl ^■m ^ HH Λ IX

Frequency (%) 9.23 0.71 2.28 0.09 5.03 0.24 3.06 0.93 0.23 1.12

Phoneme Tl > -ri A

Frequency (%) 1.53 3.33 5.80 0.01 2.1 0.45 0.16 0.58 0.46 7.28

Phoneme M Vi A H

Frequency (%) 0.24 3.95 0.03 6.98 0.44 3.99 0.53 2.16 11.22

Table 4

In Table 2 and Table 3 (or Table 4 in English) , training data having a minimum error between the total frequency (Table 2) for each phoneme and the appearance frequency rate by phoneme (Table 3) are derived at step S204. An example of this training data deriving process (step S204) will be described later.

Through these steps S201 to S204, the selected training data, are determined as the training data of the speech feature vector quantizer 13. The training data thus determined are used as parameter for creation of a codebook. Here, any algorithms such as K-means algorithm can be used as the algorithm used in creating the codebook.

In order to help understand the training data selection algorithm of the present invention mentioned above, for example, if an English frequency rate of a phoneme "/a/" in the pronunciation dictionary is 2%, the number of times of generation of the phoneme "/a/" in the sample training data (sample speech data) is "40" and the total number of phonemes generated in the sample training data is "1,000", the frequency rate at which the phoneme "/a/" is to be generated in the sample training data is 4%. Then, it is subjected to the forced alignment and then only "1, 3, 5,..., 35, 37, and 39^th data" are selected in the created sample training data list for the phoneme "/a/" to use (derive) them as the training data of the speech feature vector quantizer.

For example, in order to minimize a difference between an appearance frequency rate for a specific phoneme in the pronunciation dictionary and an appearance frequency rate of the corresponding phoneme with respect to a total phoneme number in the sample training data (that is, 4%->2%, 1/2 difference), 1/2 sample training data ("1, 3, 5,..., 35, 37, and 39^th data") out of the total 40 sample training data are selected as the training data for training the speech feature vector quantizer .

Fig. 3 is an explanatory diagram of one example of the speech recognition process used in the present invention, and Fig. 4 is a flowchart describing one example of the training process of creating the acoustic model in Fig. 3.

Fig. 3 represents the speech recognition process using HMM, which carries out the processes of extracting a speech feature 301 from user's speech, and searches an acoustic model 303, a language model 304 and a pronunciation dictionary 305 for the extracted speech feature, to recognize a word and a sentence corresponding to the speech through a pattern matching 302.

The speech feature extraction 301 employs a scheme proposed by the standard "ETSI ES 201 108" of ETSI. In other words, the speech feature is extracted from the speech data through MFCC and the speech feature vector is formed as 14^th order coefficients. For this speech feature vector, a word sequence with a maximum probability value is searched by the pattern matching that uses the acoustic model 303, the language model 304 and the pronunciation dictionary 305.

The acoustic model 303 uses HMM, and particularly, the present invention uses the phoneme model according to language as the acoustic model. A training process for creation of this phoneme model will be described below with reference to Fig. 4.

First, a monophone-based model which is a phoneme- independent model is created by using the speech feature vector extracted from the training data selected according to the present invention with reference to Fig. 2 at step S401.

Subsequently, it is subjected to forced alignment on the basis of the created monophone-based model, to newly create a phoneme label file at step S402.

In the meantime, a triphone-based model which is a phoneme-dependent model is created by expanding the created monophone-based model at step S403.

Next, a state-tying is made by considering the fact that the amount of training data for the created triphone-based model is small at step S404.

Then, the model resulting from the state-tying is subjected to mixture density increase, to lastly create an acoustic model at step S405. Meanwhile, the language model 304 shown in Fig. 3 uses a statistical-based scheme. Here, the statistical- based scheme refers to a scheme which statistically estimates a probability value of a possible word sequence from the database (DB) of speech originated under a given circumstance. A representative model of language models of the statistical-based method is an n-gram. This 11- gram method approximates the probability of word sequence as multiplication of n-number of its previous conditional probabilities for its use, and uses bigram in Fig. 3. The pronunciation dictionary 305 uses the pronunciation dictionary provided by "CleanSentOl" of SiTEC (Speech Information Technology & Industry Promotion Center) in case of Korean, and "CMU dictionary V.0.6" provided by Carneige Mellon University in case of English. On the other hand, the speech DB in Fig. 3 uses "read sentence speech DB (CleanSentOl) " provided by SiTEC in case of Korean, and uses "AURORA 4 DB(WaIl Street Journal)" established by ETSI in case of English.

Fig. 5 is a graph representing a speech recognition performance of a speech feature vector quantizer in which the present invention is applied to Korean, and Fig. 6 is a graph representing a speech recognition performance of a speech feature vector quantizer in which the present invention is applied to English.

In Fig. 5, the left graph represents an average word error rate for Korean in case of training a speech feature vector quantizer using unplanned sample training data, and the right graph represents an average word error rate for Korean in case of training a speech feature vector quantizer using training data selected according to an appearance frequency by phoneme presented by the present invention.

In Fig. 6, the left graph represents an average word error rate for English in case of training a speech feature vector quantizer using unplanned sample training data, and the right graph represents an average word error rate for English in case of training a speech feature vector quantizer using training data selected according to an appearance frequency by phoneme proposed by the present invention.

The following is a description about the speech recognition performance experiment shown in each of Figs . 5 and 6. In English, multi-condition training data of "Wall Street Journal DB (WSJO)" is used as sample training data, and 7,138 speech data which are recorded by "Sennheiser short distance microphone" and several types of long distance microphone are used as speech input condition. These speech data are stored at a sampling rate of 16 kHz. Further, among evaluation data of "WSJO" corresponding to 5, 000-word class, "Setl to Set7" which are part of 14 data sets for multi-condition evaluation are used for speech recognition performance evaluation. Here, each set is composed of 330 speech utterances, which are recorded from 8 speakers and issued from about 40 utterances per speaker.

Also, 39 order speech feature vectors are used. For this, 12 order MFCC and log energy are extracted and their difference and accelerated coefficients are also used. Here, the DSR front-end standard algorithm proposed by ETSI is used in the extraction of the speech feature vector.

Further, cepstral mean normalization and energy normalization are employed for the speech feature vector used in the training.

Meanwhile, the acoustic model employs "left-to- right model" having 3 states, and the context-independent 4-mixture density and "cross-word triphone model". This acoustic model uses "HTK version 3.2 toolkit" for training. For example, after extending to a triphone- based acoustic model starting from 41 monotone based acoustic models where two pause models are contained, a state-tying using a decision tree is conducted, to thereby reduce the number of states. The total number of acoustic models and the total number of states are 8,360 and 5,356, respectively.

After setting up the speech recognition system in the above condition, the speech recognition performance in case of performing the speech feature vector quantization process, that in case of training the speech feature vector quantizer using the unplanned sample training data, and that in case of training the speech feature vector quantizer using the training data selected according to the appearance frequency by phoneme proposed by the present invention are compared as follows.

The following Table 5 represents a word error rate in the speech recognition performance without the speech feature vector quantization process. In Table 5, it can be seen that an average word error rate for 7 sets is 21.8 %.

Table 5

Thereafter, a word error rate for a method in which a codebook is created from the unplanned training data and applied to the speech feature vector quantizer and that for a method in which a codebook is created from the training data selected by using the algorithm proposed by the present invention and applied to the speech feature vector quantizer are compared as follows.

The following Table 6 represents a word error rate for a method in which a codebook is created from the unplanned training data and applied to the speech feature vector quantizer, and Table 7 represents a word error rate for a method in which a codebook is created from the training data selected by using the algorithm suggested by the present invention and applied to the speech feature vector quantizer.

Table 6

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Avg

17. i 3 18. ^< 25.1 29. 4 26.5 24. £ 3 25.9 24.1

Table 7

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Avg

13. 6 16. { 3 23.0 27. 6 24.9 21. E 3 24.5 21.7 In Tables 6 and 7, it can be seen that the existing method using the unplanned training data shows that the average word error rate for 7 sets is 24.1%, while the training data selection method according to the phoneme appearance frequency provided by the present invention shows that the average word error rate for 7 sets is 21.7%.

In other words, it can be confirmed that the word error rate resulting from the algorithm presented by the present invention is decreased by about 10%, compared with that for the existing unplanned training data selection algorithm. This is almost identical to 21.8% which is the average word error rate of the method without the speech feature vector quantization process. It will easily be understood by those skilled in the art that the present invention does not lower the speech recognition performance by way of going through the speech feature vector quantization process. As described above, the training data selection method provided by the present invention can also be applied to any devices or systems having other speech feature vector quantizers as well as the DSR system shown in Fig. 1. For example, the training data selection method of the present invention can also be applied to a quantizer of linear prediction coefficient (LPC) used in speech encoding. It will easily be appreciated by those skilled in the art that the training data selection method of the present invention can be applied to multimedia processing systems of image, audio, etc. (commonly called "speech processing device") , in view of characteristics of vector to be quantized.

The method of the present invention as mentioned above may be implemented by a software program that is stored in a computer-readable storage medium such as CD- ROM, RAM, ROM, floppy disk, hard disk, optical magnetic disk, or the like. This process may be readily carried out by those skilled in the art; and therefore, details of thereof are omitted here. While the present invention has been described with respect to the particular embodiments, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method for selecting training data based on non-uniform sampling for speech recognition vector quantization, the method comprising the steps of: if sample speech data is received, making the speech data subjected to forced alignment to acquire pronunciation information of each phoneme; creating an appearance list by phoneme based on the acquired pronunciation information; statistically calculating an appearance frequency rate of phoneme for a corresponding language depending on the created appearance list by phoneme; and deducing training data having a minimum error between a total frequency number for each phoneme and an appearance frequency rate by phoneme by referring to the calculated statistics.

2. The method of claim 1, wherein the pronunciation information contains an utterance sentence, a phonetic transcription, a phoneme type, and a start time and an end time by phoneme.

3. The method of claim 1, wherein the appearance list by phoneme contains a frequency number by phoneme and a total phoneme number of training data.

4. The method of claim 1, wherein the speech feature vector is represented by MFCC (Mel-Frequency Cepstral Coefficient).

5. The method of claim 1, wherein the language is Korean.

6. The method of claim 1, wherein the language is English.

7. The method of claim 1, wherein the training data selected from the training data deduction is used as training data for creating an acoustic model in HMM (Hidden Markov Model) .

8. The method of claim 1, wherein a rate for a target pronunciation out of pronunciation sequences is analyzed based on a pronunciation dictionary of the corresponding language in the phoneme appearance frequency rate calculation.

9. The method of claim 1, wherein, in the training data deduction, the training data are selected from sample training data as many as a number corresponding to a difference between an appearance frequency rate for a specific phoneme in the pronunciation dictionary and an appearance frequency rate of the corresponding phoneme with respect to a total phonemic number in the sample speech data.

10. A computer-readable recording medium for storing a program implementing a method for selecting training datai in a speech processing apparatus with a processor, the method comprising the steps of: if sample speech data is received, making the speech data subjected to forced alignment to acquire pronunciation information of each phoneme; creatinςj an appearance list by phoneme based on the acquired pronunciation information; statistically calculating an appearance frequency rate of phoneme for a corresponding language depending on the created appearance list by phoneme; and deducincj training data having a minimum error between a total frequency number for each phoneme and an appearance frequency rate by phoneme by referring to the calculated statistics.