KR100340688B1

KR100340688B1 - Method for extracting the number of optimal allophone in order to speech recognition

Info

Publication number: KR100340688B1
Application number: KR1019990033593A
Authority: KR
Inventors: 이승훈; 이항섭
Original assignee: 오길록; 한국전자통신연구원
Priority date: 1999-08-16
Filing date: 1999-08-16
Publication date: 2002-06-15
Also published as: KR20010017858A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 음성인식을 위한 최적의 변이음 개수 추출 방법에 관한 것임.The present invention relates to a method of extracting the optimal number of variances for speech recognition.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은 엔트로피를 이용한 군집화 기법을 적용한 것으로서, 각각의 변이음과 병합되는 변이음 사이의 엔트로피를 이용한 정보 손실을 측정하여 전체 변이음 모델내의 정보 손실이 최소화되는 쌍을 골라내는 작업을 반복하면서 원하는 개수만큼 군집화하는 기법을 사용하여 최적의 변이음 개수를 추출할 수 있는 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있음.The present invention applies a clustering technique using entropy, and measures the loss of information using entropy between each variation sound and merged sound, and selects a pair that minimizes the loss of information in the entire variation sound model while clustering as many times as desired. It is an object of the present invention to provide a method for extracting the optimal number of mutations using a technique and a computer-readable recording medium recording a program for realizing the method.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은, 초기 분포 정보를 로딩하는 제 1 단계; 하나의 중심 음소에 속한 각각의 변이음 쌍을 병합하는 경우에 대해서 엔트로피를 이용한 정보 손실을 구하여 최소의 손실값 및 해당 변이음 쌍을 기록하는 과정을 모든 중심 음소에 대하여 반복 수행하는 제 2 단계; 및 상기 제 2 단계에서 구한 각 중심 음소별 최소 손실값을 가지는 변이음 쌍중에서 최소 정보 손실값을 갖는 변이음 쌍을 구하여 병합하고 분포 정보를 갱신하는 과정과 상기 제 2 단계를 원하는 음소 개수만큼 군집화될 때까지 반복 수행하는 제 3 단계를 포함한다.The present invention, the first step of loading the initial distribution information; Obtaining a loss of information using entropy and recording the minimum loss value and the corresponding pair of variation sounds for all the central phones in case of merging each pair of variation sounds belonging to one central phoneme; Obtaining a merged pair having a minimum information loss value from among the pairs of variable sounds having the minimum loss value for each center phoneme obtained in the second step, updating the distribution information, and clustering the second step by the desired number of phonemes. The third step is to repeat until.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 음성 인식기 등에 이용됨.The present invention is used in a speech recognizer and the like.

Description

Optimal number extraction method for speech recognition {METHOD FOR EXTRACTING THE NUMBER OF OPTIMAL ALLOPHONE IN ORDER TO SPEECH RECOGNITION}

본 발명은 음성인식을 위한 최적의 변이음 개수 추출 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 특히 기존의 변이음 모델을 대상으로 엔트로피를 적용한 군집화 기법을 적용하여 변이음의 개수를 변경시키면서 최적의 변이음 개수를 추출하는 최적의 변이음 개수 추출 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to an optimal method for extracting the number of variable sounds for speech recognition and a computer-readable recording medium recording a program for realizing the method. In particular, by applying an entropy clustering technique to an existing variable sound model A method of extracting an optimal number of variation sounds while changing the number of variation sounds, and a computer-readable recording medium having recorded thereon a program for realizing the method.

일반적으로, 개인용 컴퓨터(PC)상의 음성 인식기에서 주로 사용하는 음향 모델로서는 단어, 음절, 음소 및 변이음 등이 있다. 이 중에서 음소 단위는 단어나 음절 단위보다 그 종류가 훨씬 작다는 장점 때문에 널리 선호되고 있다. 그러나, 음소 단위는 동일한 음소라도 그 주변 음소들의 영향에 의해 서로 다른 음성학적 특징을 나타낸다. 따라서, 좋은 성능의 음성 인식기를 만들기 위해서는 각 음소의 주변 환경이 그 음소에 미칠 수 있는 영향을 포함할 수 있도록 음소를 보다 세분하여 음향 모델을 구현할 필요가 있다. 이러한 음소의 세분화 방법을 이용하여 음향 모델을 구현하는 방법으로는 변이음 단위를 이용하는 방법이 주로 사용된다.In general, acoustic models mainly used in speech recognizers on personal computers (PCs) include words, syllables, phonemes, and transition sounds. Of these, phoneme units are widely preferred because of their advantages, which are much smaller than words or syllable units. However, the phoneme unit exhibits different phonetic characteristics even though the same phoneme is influenced by the surrounding phonemes. Therefore, in order to create a high-performance speech recognizer, it is necessary to implement a sound model by subdividing the phonemes so that the surrounding environment of each phoneme can include the influence that the phoneme can have. As a method of implementing an acoustic model using the phoneme segmentation method, a method using a variation sound unit is mainly used.

이 분야의 종래 기술로서는 음성 데이터베이스를 이용한 음향학적인 분류 방법, 매치 스코어(match score)를 이용한 분류 방법 및 변이음 집단화 수형도를 이용한 분류 방법 등이 있다. 이러한 종래 기술들은 대부분이 영어와 같은 외국어를 기반으로 한 분류 방법들이다. 따라서, 한국어를 인식할 수 있는 음성 인식기의 음향 모델을 구현하는 경우에는 그 방법들을 적용하기가 어려우며, 비효율적인 단점이 있다.Conventional techniques in this field include acoustic classification using a speech database, classification using a match score, and classification using a variation sound group tree. These prior arts are mostly classification methods based on foreign languages such as English. Therefore, in the case of implementing an acoustic model of a speech recognizer capable of recognizing Korean language, it is difficult to apply the methods and has an inefficient disadvantage.

또한, 한국어에 적용된 종래 기술도 있으나 이는 생성 가능한 한국어 변이음들의 조합 형태를 정한 후에 이 변이음들의 집단화 수형도를 바탕으로 적용하였다. 따라서, 실제적으로 사용되지 않는 변이음들도 음향 모델내에 포함하므로 상당히 많은 양의 변이음 개수를 가지게 된다. 따라서, 이러한 종래 기술은 개인용 컴퓨터(PC)와 같은 시스템에서 실제 구현시에 여러 가지 문제점들을 발생시킨다.In addition, there is a conventional technology applied to Korean language, but this method is applied based on the grouping tree of these mutant sounds after determining the combination form of Korean mutant sounds that can be generated. Therefore, since the unacceptable variations are included in the acoustic model, the number of variations is quite large. Thus, this prior art causes various problems in actual implementation in a system such as a personal computer (PC).

즉, 변이음을 이용한 음향 모델을 한국어에 적용하여 분류를 하면, 이론상으로 3,000개 이상의 개수로 나누어지며, 빈도수가 작은 경우를 제외하더라도 1,500여개 이상으로 분류된다. 이와 같이 세분화된 변이음 모델은 40개의 음소로 구성된 음향 모델에 비해서 음성 인식기의 성능은 향상될 수 있다는 장점은 가지고 있으나, 실제 구현시 많은 양의 메모리를 필요로 하며, 실시간으로 동작하는데 무리가 있다는 단점을 가지고 있다. 따라서, 변이음 모델의 정량적인 최적화 작업이 필수적이다.That is, when the acoustic model using the mutant sounds is applied and classified into Korean, it is theoretically divided into 3,000 or more, and is classified into more than 1,500 even if the frequency is small. This subdivided mutant model has the advantage that the performance of the speech recognizer can be improved compared to the acoustic model consisting of 40 phonemes, but it requires a large amount of memory in actual implementation and is difficult to operate in real time. Have Therefore, quantitative optimization of the variance model is essential.

상기 문제점을 해결하기 위하여 안출된 본 발명은, 엔트로피를 이용한 군집화 기법을 적용한 것으로서, 각각의 변이음과 병합되는 변이음 사이의 엔트로피를 이용한 정보 손실을 측정하여 전체 변이음 모델내의 정보 손실이 최소화되는 쌍을 골라내는 작업을 반복하면서 원하는 개수만큼 군집화하는 기법을 사용하여 최적의 변이음 개수를 추출할 수 있는 방법 및 상기 방법을 실현시키기 위한 프로그램을기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.In order to solve the above problems, the present invention applies a clustering technique using entropy, and measures a loss of information using entropy between each variation sound and a merged sound, and selects a pair in which the information loss in the entire variation sound model is minimized. It is an object of the present invention to provide a method of extracting the optimal number of variation sounds using a technique of grouping as many times as desired while repeating the work, and a computer-readable recording medium recording a program for realizing the method.

도 1 은 본 발명이 적용되는 가변어휘 음성 인식기의 구성예시도.1 is an exemplary configuration diagram of a variable vocabulary speech recognizer to which the present invention is applied.

도 2 는 본 발명에 이용되는 기존의 변이음 모델의 일실시예 구조도.Figure 2 is a structure diagram of one embodiment of a conventional variable sound model used in the present invention.

도 3 은 본 발명에 사용되는 정보손실 계산법을 기존의 변이음 모델에 적용하는 예시도.Figure 3 is an illustration of applying the information loss calculation method used in the present invention to a conventional variable sound model.

도 4 는 본 발명에 따른 최적의 변이음 개수 추출 방법에 대한 일실시예 흐름도.4 is a flowchart illustrating an embodiment of a method for extracting the optimum number of variations according to the present invention.

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

101 : 음성 입력부 102 : 발음사전 생성부101: voice input unit 102: pronunciation dictionary generation unit

103 : 음향 모델링부 104 : 음성 인식부103: acoustic modeling unit 104: speech recognition unit

105 : 사용자 인터페이스부105: user interface unit

상기 목적을 달성하기 위하여 본 발명은, 음성 인식기의 음향 모델링부에 적용되는 최적의 변이음 개수 추출 방법에 있어서, 초기 분포 정보를 로딩하는 제 1 단계; 하나의 중심 음소에 속한 각각의 변이음 쌍을 병합하는 경우에 대해서 엔트로피를 이용한 정보 손실을 구하여 최소의 손실값 및 해당 변이음 쌍을 기록하는 과정을 모든 중심 음소에 대하여 반복 수행하는 제 2 단계; 및 상기 제 2 단계에서 구한 각 중심 음소별 최소 손실값을 가지는 변이음 쌍중에서 최소 정보 손실값을 갖는 변이음 쌍을 구하여 병합하고 분포 정보를 갱신하는 과정과 상기 제 2 단계를 원하는 음소 개수만큼 군집화될 때까지 반복 수행하는 제 3 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention provides a method for extracting the optimum number of variances applied to an acoustic modeling unit of a speech recognizer, comprising: a first step of loading initial distribution information; Obtaining a loss of information using entropy and recording the minimum loss value and the corresponding pair of variation sounds for all the central phones in case of merging each pair of variation sounds belonging to one central phoneme; Obtaining a merged pair having a minimum information loss value from among the pairs of variable sounds having the minimum loss value for each center phoneme obtained in the second step, updating the distribution information, and clustering the second step by the desired number of phonemes. Characterized in that it comprises a third step to repeat until.

한편, 본 발명은, 프로세서를 구비하는 음성 인식기에, 초기 분포 정보를 로딩하는 제 1 기능; 하나의 중심 음소에 속한 각각의 변이음 쌍을 병합하는 경우에 대해서 엔트로피를 이용한 정보 손실을 구하여 최소의 손실값 및 해당 변이음 쌍을 기록하는 과정을 모든 중심 음소에 대하여 반복 수행하는 제 2 기능; 및 상기 제 2 기능에서 구한 각 중심 음소별 최소 손실값을 가지는 변이음 쌍중에서 최소 정보 손실값을 갖는 변이음 쌍을 구하여 병합하고 분포 정보를 갱신하는 과정과 상기 제 2 기능을 원하는 음소 개수만큼 군집화될 때까지 반복 수행하는 제 3 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention, a speech recognizer having a processor, the first function for loading the initial distribution information; A second function of obtaining information loss using entropy and recording a minimum loss value and a corresponding pair of variation sounds for all the central phonemes in case of merging each pair of variation sounds belonging to one central phoneme; Obtaining a merged pair having a minimum information loss value among the pairs of variable sounds having a minimum loss value for each center phoneme obtained by the second function, updating the distribution information, and clustering the second function by a desired number of phonemes. Provided is a computer readable recording medium having recorded thereon a program for realizing a third function to be repeatedly performed.

이러한 본 발명은, 매번 반복하는 단계마다 각 단계에서의 최적의 모델 개수를 유지하고 있다. 또한, 여러 가지 개수로 구성되는 다양한 변이음 모델들을 추출하여 각각의 모델들에 대한 인식 성능을 평가하면, 가장 적은 양의 메모리 점유와 실시간 동작을 만족시키면서 높은 성능을 보이는 최적의 모델을 얻을 수 있다. 또한, 이와 같은 다양한 모델들로부터 인식성능/사용메모리/실시간 동작유무 등과 같은 사용자의 목적에 맞는 변이음 모델을 선택할 수 있다는 장점도 가지고 있다.The present invention maintains the optimal number of models in each step in each iteration. In addition, by evaluating the recognition performance of each model by extracting various variance models composed of various numbers, it is possible to obtain an optimal model showing high performance while satisfying the smallest amount of memory occupancy and real-time operation. In addition, there is an advantage that can be selected from the various models, such as the recognition performance / use memory / real-time operation or not according to the user's purpose.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 가변어휘 음성 인식기의 구성예시도이다.1 is an exemplary configuration diagram of a variable vocabulary speech recognizer to which the present invention is applied.

가변어휘 음성 인식기는 크게 나누어 음성 입력부(101), 발음사전 생성부(102), 본 발명의 기술이 구현되는 음향 모델링부(103), 음성 인식부(104) 및 사용자 인터페이스부(105)를 구비한다.The variable vocabulary speech recognizer is divided into a speech input unit 101, a pronunciation dictionary generator 102, a sound modeling unit 103, a speech recognition unit 104, and a user interface unit 105 in which the technique of the present invention is implemented. do.

음성 입력부(101)는 마이크로부터 사용자 음성을 입력받아 음성 인식을 수행할 수 있는 형태의 디지털 데이터로 가공하는 기능을 가지고 있다. 이 모듈에 의해서 사용자가 발성하는 음성의 시작과 끝이 검출된다.The voice input unit 101 has a function of receiving a user's voice from a microphone and processing it into digital data in a form capable of performing voice recognition. The module detects the beginning and end of the voice spoken by the user.

발음사전 생성부(102)는 사용자 인터페이스부(105)로부터 입력되는 인식 어휘를 인식기 내부에서 사용되는 음소열 발음사전의 형태로 변환한다. 사용자 인터페이스부(105)로부터의 입력은 2-바이트 완성형 한글 텍스트 형태를 갖는 어휘이며, 출력은 발음 규칙 및 미리 정의된 아스키(ASCII) 음소 표기법에 따라 변환된발음사전이다.The pronunciation dictionary generation unit 102 converts the recognition vocabulary input from the user interface unit 105 into a phoneme string pronunciation dictionary used in the recognizer. The input from the user interface unit 105 is a vocabulary having a 2-byte complete Hangul text form, and the output is a phonetic dictionary converted according to a pronunciation rule and a predefined ASCII phonetic notation.

음향 모델링부(103)는 발음사전 생성부(102)로부터 생성된 발음사전의 음소열에 대한 변이음들을 추출하고, 이들을 서로 연결하여 변이음 단위의 기준 패턴을 생성시켜 음성 인식부(103)로 넘겨주는 기능을 가지고 있다.The acoustic modeling unit 103 extracts the variation sounds for the phoneme sequence of the pronunciation dictionary generated from the pronunciation dictionary generator 102, connects them to each other, generates a reference pattern in units of the variation sound, and hands it to the voice recognition unit 103. Have

음성 인식부(104)는 인식 모델, 발음사전 및 음향 모델을 이용하여 입력된 음성 신호에 대한 인식을 수행하는 기능을 가지고 있으며, 인식 결과를 사용자 인터페이스부(105)로 넘겨준다.The voice recognition unit 104 has a function of recognizing the input voice signal using the recognition model, the pronunciation dictionary, and the acoustic model, and passes the recognition result to the user interface unit 105.

사용자 인터페이스부(105)는 전체 시스템을 관리하는 중앙 제어 부분으로서, 각 기능 모듈의 순서 제어, 그래픽 사용자 인터페이스(GUI) 및 인식 결과 표시 등의 기능을 가지고 있다.The user interface unit 105 is a central control unit for managing the entire system, and has functions such as order control of each function module, a graphical user interface (GUI), and a recognition result display.

도 2 는 본 발명에 이용되는 기존의 변이음 모델의 일실시예 구조도로서, 본 발명이 적용되는 음향 모델을 생성하기 위하여 사용한 기존의 변이음 모델에 대한 구조를 나타낸다.FIG. 2 is a structural diagram of an embodiment of a conventional mute model used in the present invention, and illustrates a structure of an existing mute model used to generate an acoustic model to which the present invention is applied.

이 음향 모델은 중심 음소(201)별로 분류된 변이음들(202)의 집합으로 구성되어 있다. 중심 음소(201)는 한국어의 자음, 모음 및 묵음으로 표현되는 40개의 모델로 표현된다. 또한, 각각의 중심 음소에는 이 음소를 기준으로 하여 앞뒤에 올 수 있는 음소들을 모델링한 변이음들이 N(N은 자연수)개 만큼씩 존재한다. 예를 들면, '??'이라는 중심 음소(201)를 가운데에 두고 발생할 수 있는 음소들의 집합(202)은 'A, B, ...'와 같이 많이 있다는 것이다.This acoustic model consists of a set of variation sounds 202 classified by the central phone 201. The central phone 201 is represented by 40 models represented by Korean consonants, vowels, and silent sounds. In addition, in each of the central phonemes, there are as many as N (N is a natural number) variation sounds modeling phonemes that may come before and after the phoneme. For example, the set of phonemes 202 that can occur with the central phone 201 in the middle of '??' is such as 'A, B, ...'.

기존의 음향 모델은 이와 같이 한국어의 변이음 생성에 영향을 끼칠 수 있는음소 그룹을 음성학적 지식에 기반하여 초성 자음, 종성 자음 및 모음의 경우에 대하여 각각 7개, 5개, 5개의 그룹으로 분류하는 방법으로 구성하였으며, 이 분류법에 의해서 생성될 수 있는 한국어의 전체 변이음들의 수는 실제 실현될 수 없는 변이음들을 고려하여 약 3,381개 이하이다. 또한, 이 변이음 모델을 가변어휘 음성 인식기의 발음사전에 적용하여 빈도수가 낮은 변이음들을 제외하면 대략적으로 1,548개 정도의 변이음 모델로 축소할 수 있다. 그러나, 이 정도로 축소된 변이음 모델도 실제적으로 가변어휘 음성 인식기에 구현하기에는 메모리의 사용 양이나 인식 수행 시간에 커다란 단점으로 작용한다.The existing acoustic model classifies the phoneme groups that can influence Korean's speech generation into seven, five, and five groups based on phonological knowledge for the cases of initial consonants, final consonants, and vowels, respectively. The total number of variations in Korean that can be generated by this taxonomy is about 3,381 or less in consideration of variations that cannot be realized in practice. In addition, by applying the phonetic dictionary of the variable vocabulary speech recognizer, the variable phonetic model can be reduced to approximately 1,548 different phonetic models except for low frequency voices. However, the reduced-tone model, which is reduced to this extent, is a significant disadvantage in the amount of memory used and the execution time of the recognition.

본 발명에서 적용한 최적의 변이음 개수 추출 방법은, 각각의 변이음과 병합되는 변이음 사이의 엔트로피를 이용한 정보 손실을 측정하여 전체 변이음 모델내의 손실이 최소화되는 쌍을 골라내는 작업을 반복하면서 원하는 개수만큼 군집화하는 방법이다. 우선, 히든 마르코프 모델(HMM)의 분포 정보를 이용하여 엔트로피 및 정보 손실을 계산하는 방법을 설명하면 다음과 같다. 먼저, 문맥 a의 분포 d에 대한 코드워드 i의 카운트 값을 N_a,_d(i)라고 정의한다.The method of extracting the optimum number of variation sounds applied in the present invention measures the loss of information using entropy between each variation sound and the merged variation sound, and clusters as many times as desired while repeating the operation of selecting a pair in which the loss in the total variation sound model is minimized. Way. First, a method of calculating entropy and information loss using distribution information of a hidden Markov model (HMM) will be described. First, the count value of the codeword i with respect to the distribution d of the context a is defined as N _a , _d (i).

이때, 출력 확률 P 및 카운트값 N은 아래의 (수학식 1)과 같이 표시된다.At this time, the output probability P and the count value N are expressed as Equation 1 below.

P_a,_d(i) = N_a,_d(i)/N_a,_d P _a , _d (i) = N _a , _d (i) / N _a , _d

N_a,_d=N_a,_d(i)N _a , _d = N _a , _d (i)

상기 (수학식 1)들을 이용하여 표시하면 엔트로피 H는 아래의 (수학식 2)와 같다.When expressed using the above Equation 1, entropy H is as shown in Equation 2 below.

P_a,_d(i)*log(P_a,_d(i)) P _a , _d (i) * log (P _a , _d (i))

문맥 a와 문맥 b가 서로 병합된 모델을 문맥 m이라고 하면, 병합된 카운트 N_m,_d(i)는 아래의 (수학식 3)과 같다.If a model in which context a and context b merge with each other is called context m, the merged count N _m and _d (i) are given by Equation 3 below.

N_m,_d(i) = N_a,_d(i) + N_b,_d(i)N _m , _d (i) = N _a , _d (i) + N _b , _d (i)

상기의 각 수학식들을 이용하여 문맥 b와 문맥 m에 대한 엔트로피 H가 구해지면, 원래 모델과 병합된 모델사이의 정보 손실은 아래의 (수학식 4)을 이용하여 얻을 수 있다.When the entropy H for the context b and the context m is obtained using each of the above equations, the information loss between the original model and the merged model can be obtained using Equation 4 below.

L_d(a,b) = N_m,_d* H_m,_d- N_a,_d* H_a,_d- N_b,_d* H_b,_d L _d (a, b) = N _m , _d * H _m , _d -N _a , _d * H _a , _d -N _b , _d * H _b , _d

도 3 은 본 발명에 사용되는 정보손실 계산법을 기존의 변이음 모델에 적용하는 예시도로서, 중심 음소(201) '??'에 속해있는 각각의 변이음 모델(202)중에서 A, B를 병합하는 경우에 대한 예이다.FIG. 3 is an exemplary diagram of applying the information loss calculation method used in the present invention to an existing mute model, in which A and B are merged among the mute models 202 belonging to the central phone 201 '??'. Is an example.

먼저, 각각의 변이음에 대한 엔트로피 H와 카운트값 N을 구한다. 이때, 각각의 변이음 모델들(301,302,303)에 표시되어 있는 b, m, e는 히든 마르코프 모델(HMM)의 시작(begin), 중간(middle), 종료 상태(end state)에 해당한다. 다음으로, 위의 값들을 이용하여 각각의 시작(begin), 중간(middle), 종료 상태(end state)에 대한 정보 손실을 구한 후에 최종적으로 변이음 모델 A, B를 병합하는 경우에 대한 손실인 L(A,B)를 아래의 (수학식 5)와 같이 얻는다.First, the entropy H and the count value N for each transition sound are obtained. In this case, b, m, and e that are displayed in each of the transition models 301, 302, and 303 correspond to a start, middle, and end state of the hidden Markov model HMM. Next, L is the loss for the case of finally merging variance models A and B after finding information loss for each start, middle, and end state using the above values. Obtain (A, B) as shown in Equation 5 below.

L_b(A,B) = N_mb* H_mb- N_ab* H_ab- N_bb* H_bb L _b (A, B) = N _mb * H _mb -N _ab * H _ab -N _bb * H _bb

L_m(A,B) = N_mm* H_mm- N_am* H_am- N_bm* H_bm L _m (A, B) = N _mm * H _mm -N _am * H _am -N _bm * H _bm

L_e(A,B) = N_me* H_me- N_ae* H_ae- N_be* H_be L _e (A, B) = N _me * H _me -N _ae * H _ae -N _be * H _be

L(A,B) = Lb(a,b) + Lm(a,b) + Le(a,b)L (A, B) = Lb (a, b) + Lm (a, b) + Le (a, b)

이때, 변이음 모델의 병합 기준은 위에서 설명한 40개의 중심 음소별로 모든 쌍에 대해서 얻은 정보 손실중에서 가장 적은 값을 가지는 쌍을 병합하는 것으로서, 병합하는 쌍을 새로운 변이음 모델로 만들면서 변이음의 개수를 1개씩 줄여나간다. 이와 같은 방법을 이용하여 기존 변이음 모델로부터 다양한 개수의 변이음 모델을 생성할 수 있다.In this case, the merging criterion of the mute model is to merge the pair having the smallest value among the information loss obtained for all pairs by the 40 center phonemes described above. Reduce it. Using such a method, various numbers of mute models can be generated from existing mute models.

도 4 는 본 발명에 따른 최적의 변이음 개수 추출 방법에 대한 일실시예 흐름도이다.4 is a flowchart illustrating an exemplary method for extracting the optimum number of variation sounds according to the present invention.

먼저, 알고리듬이 동작하기 위해서는 기존의 변이음 모델에 의한 초기 분포 정보를 로딩한다(401). 이 분포 정보에는 각 변이음의 카운트값이 저장되어 있다. 다음으로 현재 음향 모델내의 변이음의 개수가 원하는 만큼 존재하는지를 검사한다(402). 만약, 원하는 개수만큼이면 알고리듬을 끝내면서 현재 존재하는 변이음에 대한 정보를 출력한다(403). 이 정보가 바로 최적화된 변이음 개수를 가지는 음향 모델이다.First, in order to operate the algorithm, the initial distribution information by the existing variable sound model is loaded (401). In this distribution information, the count value of each transition sound is stored. Next, it is checked whether the number of variation sounds in the current acoustic model exists as much as desired (402). If the number is desired, the algorithm terminates the algorithm and outputs information on the currently existing variation sound (403). This information is an acoustic model with an optimized number of variations.

한편, 검사 결과(402), 변이음의 개수가 원하는 것보다 많으면 우선 하나의 중심 음소에 속한 각각의 변이음 쌍을 병합하는 경우에 대해서 정보 손실을 계산한다(404). 정보 손실 계산 방법은 상기 도 3 에서 설명한 방법에 의해서 구한다. 이렇게 해서 구해진 정보 손실중에서 가장 최소의 손실값 및 해당 변이음 쌍을 기록한다(405). 이와 같은 과정(404,405)을 40개의 모든 중심 음소에 대해서 반복하여(406) 각 중심 음소별 최소 손실값을 가지는 변이음 쌍들이 구해지면, 이 중에서 최소로 작은 값을 비교하여 찾아내고 그 값 및 해당 변이음 쌍을 기록한다(407). 이렇게 해서 구해진 변이음 쌍이 전체 변이음 모델내의 정보 손실을 최소화하는 쌍이므로 이 쌍을 병합하여 변이음 개수를 한 개 줄이고 분포 정보를 갱신한다(408). 이와 같은 모든 과정이 끝나면 다시 변이음 개수를 검사하는 과정(402)으로 돌아가서 원하는 개수만큼 줄었는지를 검사하여 다음 과정을 반복 수행한다.On the other hand, if the result of the test 402, the number of the variation sound is more than desired, information loss is calculated for the case of merging each pair of variation sounds belonging to one central phoneme (404). The information loss calculation method is obtained by the method described with reference to FIG. 3. Among the information loss thus obtained, the smallest loss value and the corresponding pair of variation sounds are recorded (405). This process (404, 405) is repeated for all 40 center phonemes (406), and when a pair of variation sounds having a minimum loss value for each center phoneme is obtained, the smallest of them is compared and found, and the value and corresponding variation sound are found. Record the pair (407). Since the thus obtained pair of pairs is a pair that minimizes the loss of information in the whole paired model, the pairs are merged to reduce the number of pairs of pairs and update distribution information (408). After all of these processes are completed, the process returns to the process of checking the number of variation sounds (402) and checks whether the number is reduced by the desired number and repeats the following process.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes are possible in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기와 같은 본 발명은, 기존의 1,500여개 내지 3,000여개 정도의 변이음으로 구성되는 음향 모델내의 변이음의 개수를 체계적인 방법으로 정량적으로 줄일 수 있는 효과가 있다.As described above, the present invention has an effect of quantitatively reducing the number of variation sounds in the acoustic model composed of approximately 1,500 to about 3,000 variation sounds.

따라서, 본 발명은 기존의 한국어 변이음 모델에 적용할 경우에 사용자가 원하는 만큼의 변이음의 개수를 가지는 음향 모델을 생성할 수 있는 효과가 있다. 즉, 본 발명을 이용한 변이음 음향 모델을 가변어휘 음성 인식기에 넣어서 사용할 경우에 프로그램 실행시에 기존의 음향 모델보다 적은 양의 메모리를 점유하는 효과가 있다. 또한, 분류화된 변이음의 개수가 적어지므로 입력 음성에 대한 음성 인식기의 수행 속도도 상대적으로 빨라지며, 음성 인식기의 복잡도도 줄어드는 효과가 있다.Therefore, the present invention has an effect of generating an acoustic model having the number of variations as desired by the user when applied to the existing Korean variations. That is, when the variable sound acoustic model using the present invention is used in a variable vocabulary speech recognizer, the memory occupies a smaller amount of memory than the conventional acoustic model when the program is executed. In addition, since the number of categorized variation sounds is reduced, the performance of the speech recognizer relative to the input speech is also relatively high, and the complexity of the speech recognizer is also reduced.

상기와 같은 효과들은 가변어휘 음성 인식기를 개인용 컴퓨터(PC)와 같은 범용 시스템에 포팅하여 상용화하는 경우에는 음성 인식기의 성능 및 시장성에 커다란 영향을 미치는 요소들이다. 본 발명을 개인용 컴퓨터(PC) 명령어 인식을 목표로 하는 소규모의 가변어휘 음성 인식기에 적용하여 실험한 결과, 200개에서 400개 정도의 변이음 개수로도 성공적으로 높은 인식 성능을 보이는 문맥의존형 변이음 모델을 구현할 수 있다는 것을 알 수 있었다.Such effects are factors that greatly affect the performance and marketability of the speech recognizer when the variable vocabulary speech recognizer is ported to a general-purpose system such as a personal computer (PC) and commercialized. As a result of applying the present invention to a small variable vocabulary speech recognizer aimed at recognizing personal computer (PC) instructions, we have developed a context-dependent anomaly model that successfully shows high recognition performance even with 200 to 400 variations. I can see that it can be implemented.

따라서, 본 발명을 한국어 음성 인식기의 음향 모델 최적화 분야에 적용하여 사용한다면, 음성 인식기의 수행 속도, 메모리 사용량 및 복잡도 측면에서 매우 효과적이며, 가변어휘 음성 인식기의 상용화를 크게 앞당길 수 있다.Therefore, if the present invention is applied to the acoustic model optimization field of the Korean speech recognizer, it is very effective in terms of performance, memory usage, and complexity of the speech recognizer, and can greatly accelerate the commercialization of the variable vocabulary speech recognizer.

Claims

In the optimum number of variance sound extraction method applied to the acoustic modeling unit of the speech recognizer,

A first step of loading initial distribution information;

Obtaining a loss of information using entropy and recording the minimum loss value and the corresponding pair of variation sounds for all the central phones in case of merging each pair of variation sounds belonging to one central phoneme; And

Obtaining and merging a pair of mute pairs having the minimum information loss value among the pairs of mute pairs having the minimum loss value for each center phoneme obtained in the second step, updating the distribution information, and clustering the second step to the desired number of phonemes Third Step to Repeat

Optimal variation number extraction method comprising a.

The method of claim 1,

The second step,

A fourth step of calculating information loss using entropy between the merged variation sounds when merging pairs of variation sounds belonging to one central phoneme;

A fifth step of recording the lowest loss value and the corresponding pair of mute pairs among the information loss obtained in the fourth step; And

A sixth step of repeating the fourth and fifth steps for all the central phonemes to obtain a pair of variation sounds having a minimum loss value for each center phoneme

Optimal variation number extraction method comprising a.

The method according to claim 1 or 2,

The third step,

A seventh step of comparing the pairs of variation sounds having the minimum loss value for each center phoneme, finding a minimum value among them, and recording the value and the pair of variation sounds;

An eighth step of reducing the number of variation sounds and updating distribution information by merging pairs which minimize the loss of information obtained in the seventh step; And

If it is not checked whether the number of variation sounds is reduced by the desired number, the process proceeds to the second step; and if the number is reduced by the desired number, the ninth step of outputting information on the existing variation sounds

Optimal variation number extraction method comprising a.

In a speech recognizer having a processor,

A first function of loading initial distribution information;

A second function of obtaining information loss using entropy and recording a minimum loss value and a corresponding pair of variation sounds for all the central phonemes in case of merging each pair of variation sounds belonging to one central phoneme; And

Obtaining, merging, and updating distribution information among the pairs of mute pairs having the minimum loss value of each center phoneme obtained by the second function, and updating the distribution information, and clustering the second function by the desired number of phonemes. 3rd function to repeat

A computer-readable recording medium having recorded thereon a program for realizing this.

The method of claim 4, wherein

The second function is,

A fourth function of calculating information loss using entropy between the merged variation sounds when merging pairs of variation sounds belonging to one central phoneme;

A fifth function of recording the lowest loss value and the corresponding pair of mute pairs among the information losses obtained in the fourth function; And

A sixth function of repeating the fourth function and the fifth function with respect to all center phonemes to obtain a pair of variation sounds having a minimum loss value for each center phoneme

The method according to claim 4 or 5,

The third function,

A seventh function of comparing the pairs of variation sounds having the minimum loss value for each center phoneme, finding a minimum value among them, and recording the value and the pair of variation sounds;

An eighth function of reducing the number of variation sounds and updating distribution information by merging pairs which minimize the loss of information obtained in the seventh function; And

If it is not checked whether the number of variation sounds has been reduced by the desired number, the process goes to the second function, and if the number is reduced by the desired number, the ninth function of outputting information on the existing variation sounds