KR0176788B1

KR0176788B1 - Automatic Model Determination of Speech Recognition

Info

Publication number: KR0176788B1
Application number: KR1019950058739A
Authority: KR
Inventors: 김민성
Original assignee: 구자홍; 엘지전자주식회사
Priority date: 1995-12-27
Filing date: 1995-12-27
Publication date: 1999-04-01
Also published as: KR970050118A

Abstract

본 발명은 화자(話者)의 음성특성에 따라 다수개의 모델을 사용할 때 그 음성특성에 최적합한 모델을 자동으로 결정하도록 함과 아울러 모델수의 증가에 따른 처리시간을 단축하도록 하는 음성인식의 자동모델 결정방법에 관한 것으로, 일반적으로 화자의 음성특성을 더 자세히 표현하기 위하여 다수개의 모델을 사용하는 경우에는 그 수에 비례하여 모델검색시간 증가하게 되어 결론적으로 인식시간이 증가되는 문제점이 있었으나, 본 발명에서는 소정 프레임 구간 동안의 특징벡터열과 가장 유사한 코드북을 선택한 다음 각 코드워드에 가중치를 주고, 입력음성 전체에 대하여 선택된 코드북의 코드워드와 양자화하여 그에 해당하는 모델을 인식결과로 함으로써 코드북 및 모델의 검색시간을 줄일 수 있어 음성인식의 고속화를 이룰 수 있는 효과가 있게된다.According to the present invention, when a plurality of models are used in accordance with the speaker's voice characteristics, the automatic recognition of the voice recognition to reduce the processing time according to the increase in the number of models while automatically determining the best model for the voice characteristics. It is related to the model determination method. In general, when a plurality of models are used to express the speaker's speech characteristics in detail, the model retrieval time increases in proportion to the number, and consequently, the recognition time increases. In the present invention, a codebook most similar to a feature vector sequence for a predetermined frame period is selected, weighted to each codeword, quantized with a codeword of a selected codebook for the entire input voice, and the corresponding model is recognized as a recognition result. The search time can be reduced, so that the speed of speech recognition can be increased. It is good.

Description

Automatic Model Determination of Speech Recognition

제1도는 일반적인 1개의 양자화기를 사용한 음성인식기의 구성을 나타낸 도.1 is a diagram showing the configuration of a speech recognizer using one general quantizer.

제2도는 다수개의 양자화기와 모델을 사용한 음성인식기의 구성을 나타낸 도.2 is a diagram showing the configuration of a speech recognizer using a plurality of quantizers and models.

제3도는 본 발명에 의거한 음성인식기의 구성을 나타낸 도.3 is a view showing the configuration of a voice recognizer according to the present invention.

제4도는 본 발명에 의거한 코드북 선택과정을 나타낸 도.4 is a diagram illustrating a codebook selection process according to the present invention.

제5도는 본 발명 음성인식의 자동모델 결정방법을 나타낸 도.5 is a view showing an automatic model determination method of the present invention speech recognition.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

200 : 인식 정합부 210 : 코드북 선택부200: recognition matching unit 210: codebook selection unit

220 : 모델 선택부 230 : 코드북220: model selection unit 230: codebook

240 : 모델240: model

본 발명은 음성인식에 있어서의 모델결정 방법에 관한 것으로, 특히 화자(話者)의 음성특성에 따라 다수개의 모델을 사용할 때 그 음성특성에 최적합한 모델을 자동으로 결정하도록 함과 아울러 모델수의 증가에 따른 처리시간을 단축하도록 하는 음성인식의 자동모델 결정방법에 관한 것이다.The present invention relates to a method for model determination in speech recognition. In particular, when a plurality of models are used according to the speaker's speech characteristics, the present invention automatically determines a model that is most suitable for the speech characteristics. The present invention relates to an automatic model determination method of speech recognition for shortening the processing time due to an increase.

일반적으로, 제1도에 도시된 바와같이 1개의 양자화기를 사용하여 화자의 음성인식하는 경우에는, 인식할 단어에 대한 모델과 모델을 HMM(Hidden Markov Model)로 표현하기 위해 연속적인 음향학적 특징벡터를 유한한 갯수의 대표벡터로 변환하는 양자화기의 코드북(codebook)이 필요하게 된다.In general, in case of speech recognition of a speaker using one quantizer as shown in FIG. 1, a continuous acoustic feature vector is used to express a model and a model of a word to be recognized as a HMM (Hidden Markov Model). A codebook of a quantizer that converts to a finite number of representative vectors is required.

이의 동작을 설명하면, 음성의 특징벡터가 인식정합부(100)로 입력되면 그 특징벡터를 코드북(110)의 코드워드(code word)열로 양자화하고, 이 코드워드열을 인식 정합부(100)에서 각 단어에 대한 HMM 모델로 비터비 검색과 정합과정을 통해 확률을 구하고, 그 확률이 가장 큰 모델(120)의 단어열을 읽어들임으로써 음성을 인식하게 된다.Referring to this operation, when the feature vector of the voice is input to the recognition matching unit 100, the feature vector is quantized into a codeword string of the codebook 110, and the codeword string is converted into the recognition matching unit 100. In the HMM model for each word to obtain the probability through the Viterbi search and matching process, the speech is recognized by reading the word string of the model 120 with the largest probability.

한편, 제2도는 다수개의 양자화기와 모델을 사용한 음성인식기로서, 화자 독립연결 음성인식에서는 성별, 연령층별로 음성의 특성이 매우 상이하므로 1개의 단어를 여러개의 모델로 만듬으로서 양자화 과정중의 오차를 줄이고 화자의 음성특성을 더 자세히 표현할 수 있게된다.On the other hand, Figure 2 is a speech recognizer using a plurality of quantizers and models, in the speech-connected speech recognition, the characteristics of speech are very different for each gender and age group, reducing the error in the quantization process by making a single word in multiple models The speaker's voice characteristics can be expressed in more detail.

따라서, 이와같이 다수개의 모델을 사용한 경우의 음성인식기의 동작도, 1개의 단어에 다수개의 코드북에 대한 양자화와 모델을 검색하는 것을 제외하면 제1도에 도시된 바와같은 양자화기와 그 동작은 동일하게 된다.Accordingly, the operation of the speech recognizer in the case of using a plurality of models is the same as the operation of the quantizer as shown in FIG. 1 except that a quantization and a model for a plurality of codebooks are searched in one word. .

그런데, 화자의 음성특성을 더 자세히 표현하기 위하여 다수개의 모델을 사용하는 경우에는 그 수에 비례하여 모델검색시간을 증가하게 되어 결론적으로 인식시간이 증가되는 문제점이 있었다.However, in the case of using a plurality of models to express the speaker's speech characteristics in detail, the model retrieval time is increased in proportion to the number, resulting in an increase in the recognition time.

따라서, 본 발명은 이러한 문제점을 감안하여 소정 프레임 구간 동안의 특징벡터열과 가장 유사한 코드북을 선택한 다음 각 코드북의 코드워드에 가중치를 주어 입력음성 전체에 대하여 선택된 코드북의 코드워드와 양자화하여 그에따른 모델을 자동으로 구함으로써 코드북 및 모델의 검색시간을 줄여 음성의 고속인식을 이루는데 목적이 있는 것으로, 이와같은 목적을 갖는 본 발명을 상세히 설명한다.Therefore, in view of this problem, the present invention selects the codebook most similar to the feature vector sequence for a predetermined frame period, weights the codeword of each codebook, and quantizes the codebook of the selected codebook with respect to the entire input speech to form a model accordingly. The purpose of the present invention is to reduce the searching time of codebooks and models and to achieve high-speed speech recognition.

본 발명 음성인식의 자동모델 결정방법은, 입력 음성에 대한 특징벡터를 추출하는 제1과정과, 상기 제1과정에 의하여 추출된 벡터열의 소정 프레임 구간 동안에 대해 기 훈련된 다수개의 코드북으로 양자화한 다음 그 양자화 오차를 계산하여 그 누적 거리를 계산하는 제2과정과, 상기 제2과정에 의하여 계산된 누적 거리가 최소인 코드북을 선택하는 제3과정과, 상기 제3과정에 의하여 선택된 코드북의 코드워드에 가중치를 부여한 다음 입력 음성 전체를 입력받아 선택된 코드북의 코드워드로써 양자화하는 제4과정과, 상기 제4과정에서 양자화에 사용된 코드북에 해당하는 모델을 선택하여 음성인식 결과로 출력하는 제5과정으로 이루어 진다.The automatic model determination method of speech recognition according to the present invention comprises a first process of extracting a feature vector for an input speech, and quantizing a plurality of codebooks previously trained for a predetermined frame period of the vector sequence extracted by the first process. A second step of calculating the cumulative distance by calculating the quantization error, a third step of selecting a codebook having a minimum cumulative distance calculated by the second step, and a codeword of the codebook selected by the third step A fourth step of quantizing a codebook of the selected codebook by weighting the input voice, and a fifth step of selecting a model corresponding to the codebook used for quantization in the fourth step and outputting the result as a voice recognition result Is done.

이와같이 이루어진 본 발명을 제3도 및 제5도를 참조하여 상세히 설명한다.The present invention thus constructed will be described in detail with reference to FIGS. 3 and 5.

코드북 선택부(210)는 입력되는 음성 특징벡터열의 소정(T) 프레임(즉, 1단어에 대한 음성 지속시간에 해당하는 프레임의 수)구간동안의 특징벡터를 입력받아 훈련과정에 의하여 미리 만들어진 모든 다수개의 코드북(230)에 의해 제4도에 도시한 바와같이 양자화를 하게된다.The codebook selector 210 receives a feature vector for a predetermined (T) frame (that is, the number of frames corresponding to the voice duration for one word) of the input voice feature vector sequence, and receives all of the feature vectors previously prepared by the training process. A plurality of codebooks 230 are used for quantization as shown in FIG.

이 양자화 과정에서 소정(T) 프레임의 i번째 프레임과 j번째 코드북 사이에 양자화된 코드워드와 그때의 양자화 오차는 하기 식 (1)에 의해 정의된다.In this quantization process, the quantized codeword between the i th frame and the j th codebook of a predetermined (T) frame and the quantization error at that time are defined by the following equation (1).

상기 식 (1)에 의해 소정(T) 프레임에 대한 전체 N개의 코드북과의 양자화 오차를 모두 구한 다음, 이에 대한 누적치를 구하여 그 양자화 오차 누적치가 최소인 코드북을 선택함으로써 입력되는 소정(T) 프레임에 대한 최적절 코드북을 선택하게 된다.A predetermined (T) frame input by calculating all quantization errors with all N codebooks for a predetermined (T) frame by using Equation (1), then obtaining a cumulative value and selecting a codebook having a minimum quantization error cumulative value. You will select the best-performing codebook for.

이때, 양자화 누적치에 의해 코드북을 선택함에 있어, 서로 다른 다수개의 코드북중에서 특징이 분류된 음성의 특성을 잘 표현하는 코드북의 코드워드도 있지만 반면에 음성특성에 차이가 없는 코드워드도 있게된다.In this case, in selecting a codebook based on a quantization cumulative value, some codebooks express well the characteristics of speech classified with features among a plurality of different codebooks, while other codewords have no difference in speech characteristics.

따라서, 본 발명에서는 각 분류된 음성에 대한 특성을 잘 표현하는 코드워드와 그렇지 못한 코드워드를 달리 처리하기 위해서 각 코드북의 코드워드에 가중치를 줌으로써 음성 특성차이가 분류된 그룹간에 차이가 큰 코드워드는 가중치를 낮게 주어 양자화 오차 누적치를 더 작게하고, 그렇지 못한 코드워드는 가중치를 크게하여 누적치가 커지게 한다.Therefore, in the present invention, a codeword having a large difference between groups classified by a voice characteristic difference by weighting codewords of each codebook in order to process a codeword that expresses characteristics of each classified speech well and a codeword that is not different is different. The lower the weight, the smaller the cumulative quantization error cumulative value, and otherwise, the codewords are made larger by the larger weight.

이러한 가중치는 기 훈련된 훈련음성으로부터 구하게 되는데, 이 훈련음성의 특징벡터를 각각의 N개의 코드북에 대해 양자화 했을때 코드워드와 양자화 오차를 각각 C1, d(C1), C2, d(C2),..., Cn, d(Cn)이라 정의하고, 특징벡터가 첫번째 코드북을 만들때 훈련음성으로부터 구해졌다면 첫번째 코드북의 코드워드 C1에 대한 가중치는 하기 식 (2)에 의해 정의된다.These weights are obtained from the previously trained training voices. When the feature vectors of the training voices are quantized for each of the N codebooks, the codewords and the quantization errors are C1, d (C1), C2, d (C2), ..., Cn, d (Cn), and if the feature vector is obtained from the training voice when making the first codebook, the weight for the codeword C1 of the first codebook is defined by the following equation (2).

또한, N개의 모든 코드북에 대한 훈련음성을 양자화하여 가중치를 구할 수 있으며, 1개의 코드워드에 대해 여러개의 훈련음성 특징벡터가 L개가 있을때 가중치는 하기 식 (3)에 의해 정의된다.In addition, weights can be obtained by quantizing training voices for all N codebooks, and when there are L training voice feature vectors for one codeword, the weight is defined by Equation (3) below.

상기 식 (3)에 의해 각 코드북의 코드워드에 가중치가 부여된 다음 입력 음성 전체에 대하여 인식 정합부(200)에서 상기의 과정에 의해 선택된 코드북으로 양자화한 다음, 그 선택된 코드북에 대한 모델(240)을 음성인식결과로 하여 인식 결과를 출력하게 되는 것이다.The codeword of each codebook is weighted according to Equation (3), and then the cognition matching unit 200 quantizes the entire coded code into the codebook selected by the above process, and then models 240 for the selected codebook. ) As the voice recognition result and outputs the recognition result.

이와같이, 본 발명은 소정 프레임 구간 동안의 특징벡터열과 가장 유사한 코드북을 선택한 다음 각 코드워드에 가중치를 주고, 입력음성 전체에 대하여 선택된 코드북의 코드워드와 양자화하여 그에 해당하는 모델을 인식결과로 함으로써 코드북과 모델을 선택하기 위한 계산량이 그만큼 줄어들게 되어 코드북 및 모델의 검색시간을 줄일 수 있어 음성인식의 고속화를 이룰 수 있는 효과가 있게된다.As described above, the present invention selects a codebook most similar to the feature vector sequence for a predetermined frame period, weights each codeword, quantizes the codeword of the selected codebook for the entire input speech, and sets the corresponding model as a recognition result. And the amount of computation for selecting the model is reduced by that, it is possible to reduce the search time of the codebook and model, it is possible to achieve the speed of speech recognition.

Claims

A first step of extracting a feature vector for an input speech, and a plurality of codebooks previously trained for a predetermined frame period of the vector sequence extracted by the first step is quantized into a plurality of codebooks, and then the cumulative distance is calculated by calculating the quantization error. And a third process of selecting a codebook having a minimum cumulative distance calculated by the second process, weighting codewords of the codebook selected by the third process, and receiving the entire input voice. A fourth process of quantizing the codebook using the selected codebook and a fifth process of selecting a model corresponding to the codebook used for quantization in the fourth process and outputting the result as a speech recognition result. Model Determination Method.