KR100981540B1

KR100981540B1 - Speech recognition method of processing silence model in a continous speech recognition system

Info

Publication number: KR100981540B1
Application number: KR1020030026055A
Authority: KR
Inventors: 구명완
Original assignee: 주식회사 케이티
Priority date: 2003-04-24
Filing date: 2003-04-24
Publication date: 2010-09-10
Also published as: KR20040092572A

Abstract

본 발명은 연속 음성인식 시스템에서의 묵음 모델 처리를 통한 음성인식 방법에 관한 것으로, 연속 음성인식을 위하여 묵음 모델을 구분하고, 특히 단어 사이에 존재하는 묵음을 언어 모델에서 단어로 간주하지 않고 매 클래스에 속하는 단어들의 시작음소로 데이터 구조를 만듦으로써 인식률 향상 및 단순한 언어 모델 구현이 가능한 음성인식 방법을 제공하고자 한다.The present invention relates to a speech recognition method through a silent model processing in a continuous speech recognition system, and to classify a silent model for continuous speech recognition. The present invention aims to provide a speech recognition method that can improve the recognition rate and implement a simple language model by creating a data structure with the phonemes of words belonging to.

이를 위하여, 본 발명은, 연속 음성인식 시스템에서의 음성인식 방법에 있어서, 묵음 모델을 정의하여 언어 모델을 생성하는 제 1 단계; 상기 언어 모델에서 중간 묵음을 매 클래스 내의 단어군의 시작 음소로 구성하여 인식과정에 사용될 데이터 구조를 생성하는 제 2 단계; 상기 데이터 구조를 이용하여 시작 클래스에 해당되는 단어를 비터비 인식하는 제 3 단계; 인식결과에서 종료 클래스에 해당되는 단어의 마지막 음소로 끝나는 값만 역추적(trace-back)하여 음소 단위로 분할하는 제 4 단계; 및 음소 단위 분할 정보에서 중간 묵음을 삭제한 후 반음소 모델을 이용하여 검증하고, 검증 확인된 단어의 열만 문장으로 인식된 것으로 가정하여 출력하는 제 5 단계를 포함한다.To this end, the present invention, the speech recognition method in a continuous speech recognition system, a first step of generating a language model by defining a silent model; A second step of constructing a data structure to be used in a recognition process by composing an intermediate silence in the language model as starting phonemes of word groups in every class; A third step of recognizing a word corresponding to a start class by using the data structure; A fourth step of tracing back only the value ending with the last phoneme of the word corresponding to the end class in the recognition result and dividing it into phoneme units; And a fifth step of deleting the intermediate silence from the phoneme unit division information and verifying using the semitone phone model, and outputting assuming that only a column of the verified word is recognized as a sentence.

묵음, 언어 모델, 음성인식, 역추적(trace-back), 비터비Silence, Language Model, Speech Recognition, Trace-back, Viterbi

Description

Speech recognition method of processing silence model in a continous speech recognition system

도 1 은 일반적인 음성인식 시스템의 구성 예시도.1 is an exemplary configuration of a general voice recognition system.

도 2a 및 2b 는 본 발명에 이용되는 연속 음성인식 문법에서 나타날 수 있는 묵음모델의 종류 및 묵음 모델의 HMM 토폴로지(topology)를 나타낸 일실시예 설명도.2A and 2B are diagrams illustrating an embodiment of HMM topology of a silent model and types of silent models that may appear in a continuous speech recognition grammar used in the present invention.

도 3a 및 3b 는 본 발명에 이용되는 통계적 언어모델을 사용할 경우 만들어지는 인식 클래스 데이터 구조를 나타낸 일시예 설명도.3A and 3B are temporary explanatory diagrams showing a recognition class data structure generated when using a statistical language model used in the present invention.

도 4 는 본 발명에 따른 묵음모델 처리를 통한 음성인식 방법에 대한 일실시예 흐름도.
Figure 4 is a flow diagram of an embodiment of a speech recognition method through silence model processing according to the present invention.

* 도면의 주요 부분에 대한 부호 설명* Explanation of symbols on the main parts of the drawing

11 : 끝점 검출기 12 : 특징 추출기11: endpoint detector 12: feature extractor

13 : 비터비 탐색기 14 : 발음사전13: Viterbi Explorer 14: Pronunciation Dictionary

15 : 음소 모델 데이터베이스 16 : 발화 검증기 15: Phoneme Model Database 16: Speech Verifier

17 : 반음소 모델 데이터베이스
17: semitone phone model database

본 발명은 연속 음성인식 시스템에서 묵음 모델을 처리하는 방법을 기존에 사용하던 언어 모델 처리 방법 대신에 인식과정에서 매 단어의 첫 음소로 가정하여 처리함으로써 음성인식률을 향상시킬 수 있는, 묵음 모델 처리를 통한 음성인식 방법에 관한 것이다.The present invention provides a silent model processing that can improve speech recognition rate by assuming that the first phoneme of each word is recognized in the recognition process instead of the language model processing method used in the continuous speech recognition system. It relates to a voice recognition method through.

일반적으로, 널리 알려진 음성인식 방법으로 은닉 마르코프 모델(HMM : Hidden Markov Model)을 사용하는 방법이 있다. 여기서, 음성인식 과정으로 비터비(Viterbi) 탐색을 실시하는데, 이는 인식대상 후보 단어들에 대한 미리 훈련하여 구축한 HMM과 현재 입력된 음성의 특징들과의 차이를 비교하여 가장 유사한 후보단어를 결정하는 과정이다.In general, a well-known speech recognition method uses a Hidden Markov Model (HMM). Here, the Viterbi search is performed through the speech recognition process, which compares the difference between the HMM constructed by pre-training the candidate words to be recognized and the features of the currently input speech to determine the most similar candidate word. It's a process.

이해를 돕기 위하여, 도 1을 참조하여 일반적인 음성인식 시스템에 대해 살펴보기로 한다.For better understanding, a general speech recognition system will be described with reference to FIG. 1.

먼저, 음성이 입력되면, 끝점 검출기(11)에서 음성의 앞뒤에 있는 묵음 구간을 제외한 음성구간을 찾는다. 이후에, 특징 추출기(12)에서 앞에서 찾은 음성 구간의 음성신호로부터 음성의 특징을 추출한다. First, when a voice is input, the endpoint detector 11 searches for a voice section excluding a silent section before and after the voice. Thereafter, the feature extractor 12 extracts a feature of the speech from the speech signal of the speech section found above.

다음으로, 비터비 탐색기(13)에서 음소 모델 데이터베이스(15)로 구성된 발음사전(14)에 등록된 단어들에 대해 음성 특징값을 이용하여 유사도(Likelihood)가 가장 유사한 단어들을 선정한다.Next, in the Viterbi searcher 13, the words most similar to the likelihood are selected by using the speech feature values for the words registered in the pronunciation dictionary 14 composed of the phoneme model database 15.

이어서, 발화 검증기(16)가 비터비 탐색기(13)에서 선정된 단어를 이용하여 음소단위로 특징구간을 분할한 후에, 반음소 모델을 이용하여 음소단위의 유사 신뢰도(Likelihood Ratio Confidence Score)를 구한다.Subsequently, the speech verifier 16 divides the feature interval into phoneme units using the word selected by the Viterbi searcher 13, and then obtains a Likelihood Ratio Confidence Score using the semi-phoneme model. .

발화 검증 방식이란, 음성인식된 어떤 결과에 대해 그 인식 결과를 받아들일 것인지(Accept), 거절할 것인지(Reject)를 어떤 신뢰도(Confidence Score 또는 Confidence Measure) 값을 사용하여 결정하는 방식이다. 여기서, 신뢰도는 음성인식 결과에 대해서 그 결과가 얼마나 믿을 만한 것인가를 나타내는 척도로서, 신뢰도값이 높으면 인식 결과를 신뢰할 수 있는 것으로 인식결과를 받아들여야 하고, 반대로 낮으면 결과를 신뢰하기가 어렵다는 의미로 인식결과를 거절하여야 한다.The speech verification method is a method of determining which confidence value (Confidence Score or Confidence Measure) value is used to determine whether to accept (Accept) or reject (Reject) the recognition result. Here, reliability is a measure of how reliable the result is for the speech recognition result. If the reliability value is high, the recognition result should be accepted as reliable, and if it is low, it is difficult to trust the result. The recognition result should be rejected.

마지막으로, 단어가 거절되면 다음 후보 단어에 대해 상기한 바와 같이 발화검증기(16)에서 발화 검증 과정을 수행한다.Finally, if the word is rejected, the speech verification process 16 performs the speech verification process as described above for the next candidate word.

한편, 문장을 인식할 경우에도 상기의 발화 검증 과정은 동일하게 적용되어 문법만 추가되며, 문장단위의 검증이 된다.On the other hand, in the case of recognizing a sentence, the above utterance verification process is applied in the same manner, only the grammar is added, and the sentence unit is verified.

상기의 신뢰도는 비터비 탐색 결과 수치와는 의미가 다르다. 즉, 비터비 탐색 결과 수치는 어떤 단어나 음소에 대한 단순한 유사도를 나타낸 것인 반면에, 신뢰도는 인식된 결과인 음소나 단어에 대해 그 외의 다른 음소나 단어로부터 그 말이 발화되었을 확률에 대한 상대값을 의미한다. The reliability is different from the Viterbi search result. That is, the Viterbi search result number represents a simple similarity to a word or phoneme, while the reliability is a relative value of the probability that the word is spoken from other phonemes or words for the recognized phoneme or word. Means.

신뢰도를 결정하기 위해서는 음소(Phone) 모델과 반음소(Anti-phone) 모델이 필요하다.To determine the reliability, a phone model and an anti-phone model are required.

음소 모델은 어떤 음성에서 실제로 발화된 음소들을 추출하여 추출된 음소들을 훈련시켜 생성된 HMM이다. 이러한 음소 모델은 일반적인 HMM에 근거한 음성인식 시스템에서 사용되는 모델이다.The phoneme model is an HMM created by training extracted phonemes by extracting phonemes actually spoken from a voice. The phoneme model is a model used in a speech recognition system based on a general HMM.

한편, 반음소 모델은 실제 발화된 음소와 아주 유사한 음소들(이를 유사음소집합(Cohort Set)이라 함)을 사용하여 훈련된 HMM을 말한다.The semitone phone model, on the other hand, refers to an HMM that is trained using phonemes that are very similar to actual phonemes (these are called cohort sets).

이와 같이, 음성인식 시스템에서는 사용하는 모든 음소들에 대해서 각기 음소 모델과 반음소 모델이 존재한다.As such, in the speech recognition system, a phoneme model and a semiphoneme model exist for each phoneme used.

예를 들어 설명하면, "ㅏ"라는 음소에 대해서는 "ㅏ" 음소모델이 있고, "ㅏ"에 대한 반음소 모델이 존재하게 되는 것이다.For example, there is a "ㅏ" phoneme model for the phoneme "ㅏ" and a semi-phoneme model for "ㅏ".

예를 들면, "ㅏ" 음소의 모델은 음성 데이터베이스에서 "ㅏ"라는 음소만을 추출하여 HMM의 훈련 방식대로 훈련을 시켜서 만들어지게 된다. 그리고 "ㅏ"에 대한 반음소 모델을 구축하기 위해서는 "ㅏ"에 대한 유사음소집합을 구해야 한다. 이는 음소인식 결과를 보면 구할 수 있는데, 음소인식 과정을 수행하여 "ㅏ" 이외의 다른 어떤 음소들이 "ㅏ"로 오인식되었는지를 보고 이를 모아서 "ㅏ"에 대한 유사음소집합을 결정할 수 있다. 즉, "ㅑ, ㅓ, ㅕ" 등의 음소들이 주로 "ㅏ"로 오인식 되었다면 이들을 유사음소집합이라 할 수 있고, 이들을 모아서 HMM 훈련과정을 거치면 "ㅏ" 음소에 대한 반음소 모델이 생성된다.For example, the model of "ㅏ" phoneme is made by extracting only the "ㅏ" phoneme from the speech database and training it according to HMM's training method. And in order to construct a half-phoneme model for "ㅏ", we need to find a similar phoneme set for "ㅏ". This can be obtained from the phoneme recognition result. By performing the phoneme recognition process, it is possible to determine which phonemes other than "ㅏ" are misrecognized as "ㅏ" and collect them to determine a similar phoneme set for "ㅏ". In other words, if the phonemes such as "ㅓ, ㅓ, ㅕ" are misidentified as "ㅏ", they can be called similar phoneme sets, and when they are collected and subjected to HMM training, a semi-phoneme model for "ㅏ" phonemes is generated.

이와 같은 방식으로 모든 음소에 대하여 음소 모델과 반음소 모델이 생성되었다면, 입력된 음성에 대한 신뢰도는 다음과 같이 계산된다.If a phoneme model and a semiphoneme model are generated for all phonemes in this manner, the reliability of the input voice is calculated as follows.

우선, 음소 모델을 탐색하여 가장 유사한 음소를 하나 찾아낸다.First, the phoneme model is searched to find the most similar phoneme.

그리고 찾아낸 음소에 대한 반음소 모델에 대한 유사도를 계산해 낸다.The similarity is calculated for the semitone phone model.

최종적인 신뢰도는 음소 모델에 대한 유사도와 반음소 모델에 대한 유사도의 차이를 구하고, 이에 소정의 특정함수를 적용시켜 신뢰도값의 범위를 조절하여 구할 수 있다.The final reliability can be obtained by calculating the difference between the similarity between the phoneme model and the similarity between the semi-phoneme model and adjusting a range of the reliability value by applying a predetermined specific function thereto.

그런데, HMM을 이용한 연속 음성인식 시스템은 단어를 연속적으로 말하는 것을 인식할 수 있어야 한다. 이때, 사용자가 단어를 연달아 발음하거나 혹은 단어와 단어 사이를 조금 쉬다가 발음하더라도 인식이 되어야 한다. 이를 위해서, 단어와 단어 사이에 묵음이 올 수 있다고 가정하여, 묵음 모델로 하여 인식과정을 수행한다.However, the continuous speech recognition system using the HMM should be able to recognize that the words are spoken continuously. At this time, even if the user pronounces the words in succession or pronounces the words and words while resting a little, it should be recognized. To this end, it is assumed that silence can come between words, and the recognition process is performed using the silence model.

이 묵음 모델을 처리하는 방법은, 주로 언어 모델 내에 묵음을 하나의 단어처럼 가정해서 유한상태(finite-state) 모델 방식 혹은 통계적 언어 모델(bigram, trigram 등)로 구현하였다. 이 방법 중 많이 사용되고 있는 통계적 언어 모델 방식에서 바이그램(bigram)을 사용할 경우 묵음을 하나의 단어로 정의함에 따라 단어(A)와 단어(B) 사이에 연결해 주는 대신 "단어(A), 묵음, 단어(B)"의 연결이 될 수밖에 없다. 바이그램의 정의는 두 단어 사이를 설명해 줄 수밖에 없으므로 P(단어B/단어A)는 P(묵음/단어A), P(단어B/묵음)의 두 개의 확률값을 이용해서 구해야 한다. 그런데, 단어가 2개 단어 이상으로 이루어질 경우, 묵음은 단어 A, B, C,... 등의 여러 단어 사이에 들어갈 수 있으므로 "P(단어B/단어A) ≠ P(묵음/단어A) × P(단어B/묵음)"이 되게 된다. 그 이유는 묵음 다음에는 단어 B, 단어 C, ... 등이 올 수 있기 때문이다. 또한, 묵음은 여러 번 루핑(looping)이 될 수 있는데, 이것을 표현하면 바이그램(bigram)의 정보에 따른 단어A, 단어B의 연관관계를 표현해 줄 수 없다.The method of processing the silent model is mainly implemented in a finite-state model method or a statistical language model (bigram, trigram, etc.) by assuming silence as a word in the language model. The use of bigrams in the statistical language model method, which is widely used among these methods, defines silence as a word, so instead of linking between word (A) and word (B), "word (A), silence, word (B) "is bound to be. Since the definition of a bigram can only explain between two words, P (word B / word A) should be obtained using two probability values, P (mute / word A) and P (word B / mute). By the way, when a word is made up of two or more words, the silence may enter between several words such as the words A, B, C, ..., so that "P (word B / word A) ≠ P (mute / word A) X P (word B / mute) ". The reason is that after silence, word B, word C, ... can come. In addition, mute can be looped (looping) a number of times, expressing this can not express the relationship between the words A, B according to the information of the bigram (bigram).

따라서 현재의 기술분야에서는 두 단어 이상을 연속적으로 인식하는 연속 음성인식 시스템에서 묵음 모델을 처리하는 방식으로 묵음 모델을 단어로 가정하여 언어 모델 방식을 채택하는 대신, 연속 음성인식을 위한 언어 모델을 구성하고 구성된 언어 모델을 이용한 연속 음성인식 방안이 필수적으로 요구된다.Therefore, in the current technical field, a language model for continuous speech recognition is constructed instead of adopting a language model method by assuming a silent model as a word in a continuous speech recognition system that recognizes two or more words consecutively. Continuous speech recognition using the constructed language model is required.

본 발명은, 상기와 같은 문제점을 해결하기 위해 제안된 것으로, 연속 음성인식을 위하여 묵음 모델을 구분하고, 특히 단어 사이에 존재하는 묵음을 언어 모델에서 단어로 간주하지 않고 매 클래스에 속하는 단어들의 시작음소로 데이터 구조를 만듦으로써 인식률 향상 및 단순한 언어 모델 구현이 가능한 음성인식 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems, and distinguishes the silent model for continuous speech recognition, and in particular, the beginning of words belonging to each class without considering the silent between the words as a word in the language model. The purpose of this study is to provide a speech recognition method that can improve the recognition rate and implement a simple language model by making a data structure with phonemes.

상기 목적을 달성하기 위한 본 발명은, 연속 음성인식 시스템에서의 음성인식 방법에 있어서, 묵음 모델을 정의하여 언어 모델을 생성하는 제 1 단계; 상기 언어 모델에서 중간 묵음을 매 클래스 내의 단어군의 시작 음소로 구성하여 인식과정에 사용될 데이터 구조를 생성하는 제 2 단계; 상기 데이터 구조를 이용하여 시작 클래스에 해당되는 단어를 비터비 인식하는 제 3 단계; 인식결과에서 종료 클래스에 해당되는 단어의 마지막 음소로 끝나는 값만 역추적(trace-back)하여 음소 단위로 분할하는 제 4 단계; 및 음소 단위 분할 정보에서 중간 묵음을 삭제한 후 반음소 모델을 이용하여 검증하고, 검증 확인된 단어의 열만 문장으로 인식된 것으로 가정하여 출력하는 제 5 단계를 포함한다.According to an aspect of the present invention, there is provided a speech recognition method in a continuous speech recognition system, comprising: a first step of defining a silence model to generate a language model; A second step of constructing a data structure to be used in a recognition process by composing an intermediate silence in the language model as starting phonemes of word groups in every class; A third step of recognizing a word corresponding to a start class by using the data structure; A fourth step of tracing back only the value ending with the last phoneme of the word corresponding to the end class in the recognition result and dividing it into phoneme units; And a fifth step of deleting the intermediate silence from the phoneme unit division information and verifying using the semi-phoneme model, and outputting assuming that only a column of the verified word is recognized as a sentence.

삭제delete

본 발명은 두 단어 이상을 연속적으로 인식하는 연속 음성인식 시스템에서 묵음 모델을 처리하는 방식으로 묵음 모델을 단어로 가정하여 언어 모델 방식을 채택하는 대신, 매 단어의 시작에 묵음 음소가 있다고 가정하고 이 단어가 인식대상 이 되면 매 단어의 첫 음소 혹은 묵음이 시작되도록 처리한다. 특히, 이 방식을 채택하면, 전화번호, 주민등록번호 등 연속 숫자인식을 발음할 때 세 단어 혹은 네 단어 간격으로 묵음이 들어갈 경우에 성능이 많이 향상된다.The present invention assumes that there is a silent phoneme at the beginning of every word instead of adopting a language model method by assuming a silent model as a word in a continuous speech recognition system that recognizes two or more words consecutively. When a word becomes a recognition object, it is processed to start the first phoneme or silence of every word. In particular, when this method is adopted, the performance is greatly improved when silent numbers are entered at intervals of three words or four words when pronouncing continuous numeric recognition such as a phone number and social security number.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above-mentioned objects, features and advantages will become more apparent from the following detailed description in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

연속 음성인식을 위한 언어 모델을 구성할 때, 묵음 모델은 도 2a에 도시된 바와 같이 크게 세 종류(여기서, q1과 q2가 동일하다고 가정하면 두 종류)로 나누어진다. 이 묵음 모델의 HMM 토폴로지(topology) 예가 도 2b에 도시되었다.When constructing a language model for continuous speech recognition, the silence model is divided into three types (here, assuming that q1 and q2 are the same) as shown in FIG. 2A. An HMM topology example of this silent model is shown in FIG. 2B.

도 2a에서, q1(시작 묵음)과 q2(끝 묵음)는 연속 음성의 시작과 끝에 있는 묵음을 모델링한 것이고, q3(중간 묵음)는 단어와 단어 사이의 묵음을 모델링한 것이다.In FIG. 2A, q1 (start mute) and q2 (end mute) model the silence at the beginning and end of the continuous voice, and q3 (middle silence) model the silence between words.

도 2b에서, q1과 q2는 언어 모델로 처리하고, q3는 매 단어 앞에 존재하는 선택적인(optional) 음소로 규정한다. 또한, q1, q2는 문장의 시작과 끝에 있는 묵음이므로 3 상태(state)를 갖는 묵음으로 정의하고, q3는 1 상태를 갖는 묵음으로 정의한다.In FIG. 2B, q1 and q2 are treated as language models, and q3 is defined as optional phonemes that exist before every word. In addition, since q1 and q2 are mute at the beginning and end of the sentence, it is defined as mute having three states, and q3 is defined as mute having one state.

도 2a 및 2b에서 묵음을 위한 언어 모델을 만드는 과정은 다음과 같다. 즉, q1, q2는 하나의 단어로 가정하고, q3는 다음에 있는 단어군의 시작 음소를 정의하고 바이그램(bigram) 혹은 트라이그램(trigram)에 해당하는 확률값을 구한다.The process of creating a language model for silence in Figures 2a and 2b is as follows. That is, it is assumed that q1 and q2 are one word, and q3 defines the starting phoneme of the next word group and obtains a probability value corresponding to a bigram or a trigram.

도 3a 및 3b에서는 상기 도 2a 및 2b에서 구한 확률값을 이용한 인식과정을 설명한 것이다.3A and 3B illustrate the recognition process using the probability values obtained in FIGS. 2A and 2B.

먼저, 도 3b와 같이 데이터 구조를 만든다. 그리고 도 3a의 언어 모델에서 단어군을 나타내는 class1, class2를 정의하고, q1, q2도 하나의 단어로 구성되는 독립 클래스(class)로 정의한다.First, a data structure is created as shown in FIG. 3B. In the language model of FIG. 3A, class1 and class2 representing word groups are defined, and q1 and q2 are also defined as independent classes composed of one word.

도 3b는 각 클래스(class) 단위로 이루어지는 데이터 구조를 나타낸 것이다. 즉, class1이 F1, F2, ...의 단어로 구성된다면, 이 class1에 해당되는 모든 단어 앞에 q3 음소를 선택적(optional)으로 매 단어 앞에 추가한다. 이때, F1, F2는 음소로 표현하기 위한 트리 구조 혹은 선형 구조로 나타낼 수 있다. 그리고 클래스(class)와 클래스(class) 사이의 관계는 확률값으로 표현된다. 또한, q1, q2 그리고 q3는 루핑(looping)이 가능하도록 셀프 루핑 아크(self-looping arc)를 추가한다.3B shows a data structure composed of units of each class. That is, if class1 is composed of words F1, F2, ..., the phoneme q3 is added before every word as optional before every word corresponding to this class1. In this case, F1 and F2 may be represented by a tree structure or a linear structure to express phonemes. And the relationship between class and class is represented by probability value. In addition, q1, q2 and q3 add a self-looping arc to enable looping.

도 4 는 본 발명에 따른 묵음모델 처리를 통한 음성인식 방법에 대한 일실시예 흐름도로서, 언어 모델 정보로부터 인식 작업을 수행하는 절차를 나타낸다.4 is a flowchart illustrating an example of a speech recognition method through silent model processing according to the present invention, and illustrates a procedure of performing a recognition operation from language model information.

전체적인 절차를 살펴보면, 연속 음성인식 시스템에서 묵음 모델을 3종류(q1, q2, q3)로 구분하여, q1, q2는 독립된 단어로 구성하여 언어 모델을 만들고, q3을 매 클래스(class) 내의 단어군의 시작음소로 구성하여 인식과정에 사용될 데이터 구조를 만든다. 그리고 이 데이터 구조를 이용하여 시작 클래스에 해당되는 단어를 비터비(viterbi) 인식하고, 인식결과에서 마지막 클래스에 해당하는 단어만 구해서 음소 단위 분할한 후 이를 이용하여 검증(verification) 과정을 수행한다.Looking at the overall process, in the continuous speech recognition system, the silent model is divided into three types (q1, q2, and q3), q1 and q2 are composed of independent words to form a language model, and q3 is a group of words in each class. By constructing the beginning phone of, we make a data structure to be used in the recognition process. By using this data structure, the word corresponding to the starting class is recognized by Viterbi, and only the word corresponding to the last class is obtained from the recognition result. The phoneme is divided into phonemes.

이를 구체적으로 살펴보면 다음과 같다.Specifically, it is as follows.

우선, 묵음 모델을 3종류로 정의하여 언어 모델을 만든다. 이때, 문장의 시작과 끝에 있는 묵음은 q1, q2로 정의하고 일반적으로 지속시간이 걸리기 때문에 3 상태(state)로 구성하고 언어 모델로 셀프 루핑(self-looping)을 표현해 주며, 단어 사이의 묵음은 q3로 정의하고 1 상태(state)로 구성하나 언어 모델로 고려하지 않고 단어 사이의 관계만 통계적인 확률 분포로 나타낸다.First, we create a language model by defining three silent models. At this time, the silence at the beginning and the end of the sentence is defined as q1 and q2, and since it generally takes a duration, it is composed of three states and expresses self-looping with the language model. Defined as q3 and composed of 1 state, but not considered as a language model, only the relationship between words is represented by statistical probability distribution.

이후, 클래스 언어 모델을 이용하여 인식 과정에 맞게 끔 묵음 모델이 포함된 데이터 구조를 만든다(401~403). 즉, 클래스 단어 문법으로부터 효율적인 인식 과정이 될 수 있도록 데이터 구조를 만들되, q1으로 구성된 클래스(class) 외의 모든 클래스(class) 내의 각 단어 앞에 q3 음소가 셀프 루핑(self-looping)이 되도록 데이터 구조를 만든다.Subsequently, using the class language model to create a data structure containing the silence model in accordance with the recognition process (401 ~ 403). In other words, create a data structure that can be an efficient recognition process from class word grammar, but make the data structure self-looping to the q3 phoneme before each word in every class other than the class consisting of q1. Make.

즉, 언어 모델 정보로부터 클래스 및 단어 정보를 읽어(401), 클래스 단어 문법으로부터 q3를 추가할 클래스를 선정하여(402) q3 음소가 셀프 루핑(self-looping)이 되도록 데이터 구조를 만든다(403).That is, the class and word information is read from the language model information (401), the class to which q3 is added from the class word grammar is selected (402), and a data structure is formed such that the q3 phoneme is self-looping (403). .

다음으로, 구해진 데이터 구조를 이용하여 비터비 인식하는데, 이때 비터비(viterbi) 과정이 시작되기 전에 초기 클래스(class)를 구하여(404) 이 클래스(class)에 해당되는 단어에 대해서만 비터비 과정을 수행한다(405). 즉, 초기 클래스에 해당하는 단어 리스트만 비터비 검색을 위한 초기값으로 설정한 후(404), 일반적인 비터비 알고리즘을 수행한다(405).Next, Viterbi is recognized by using the obtained data structure. At this time, the initial class is obtained before the Viterbi process starts (404), and the Viterbi process is performed only for words corresponding to this class. Perform 405. That is, after setting only the word list corresponding to the initial class as an initial value for Viterbi search (404), a general Viterbi algorithm is performed (405).

이어서, 인식 결과에서 마지막(종료) 클래스(class)에 해당하는 단어만 구해서(406) 음소 단위 분할한다(407). 즉, 종료 클래스(class)에 해당되는 단어의 마지막 음소로 끝나는 값만 음소 단위 역추적(trace-back)한다. 이때, 역추적(trace-back)된 문장은 단어 단위 및 음소 단위로 분할 정보를 구하게 된다. 특히, q3 음소는 묵음 모델이므로 이 음소 분할 정보에서 삭제한 후(408) 반음소 모델을 이용한 검증(verification) 작업을 한다(409). 즉, q3 음소가 제거된 단어 내의 음소 분할 정보와 반음소 모델을 이용하여 검증 과정을 수행한다.Subsequently, only words corresponding to the last (end) class are obtained from the recognition result (406), and phoneme units are divided (407). That is, only the values ending with the last phoneme of the word corresponding to the termination class are traced back to the phoneme unit. In this case, the trace-backed sentence obtains partition information in word units and phoneme units. In particular, since the q3 phoneme is a silent model, the phoneme is deleted from the phoneme split information (408) and then verified (409) using the semitone phone model. That is, the verification process is performed by using the phoneme segmentation information and the semi-phoneme model in the word from which the q3 phoneme is removed.

상기 음소 단위 역추적(trace-back) 과정에 대해 보다 상세하게 살펴보면 다음과 같다.Looking at the phoneme trace back process in more detail as follows.

음성인식을 하면 시간축이 증가(프레임 수가 증가)할 때마다 그 시간에 해당하는 음소를 찾게 된다. 그래서, 음성인식이 끝나는 시점은 입력 음성의 마지막 프레임이 된다. 이 음소는 마지막 단어의 마지막 음소만 표시하므로 입력 음성의 첫 프레임부터 현재까지의 음소 열(sequence) 혹은 단어 열(sequence)를 구해야만 음성인식 결과를 알게 된다. 이 과정을 "trace-back"이라 한다. 이를 구하는 방식은 매 프레임의 검색결과를 저장한 곳에 그 프레임까지 인식되어 온 음소(혹은 단어) 열을 저장한다.Speech recognition finds a phoneme for that time each time the time axis increases (the number of frames increases). Thus, the end point of speech recognition becomes the last frame of the input speech. The phoneme displays only the last phoneme of the last word. Therefore, the phoneme sequence or word sequence from the first frame of the input speech to the present is obtained before the speech recognition result is known. This process is called "trace-back". This method stores a phoneme (or word) string that has been recognized up to the frame where the search result of each frame is stored.

역추적(trace-back)이라고 하는 이유는, 마지막 프레임 인식결과부터 첫 프레임의 인식결과를 구하게 되므로 역으로 추적(trace)하기 때문이다. 이렇게 하면 입력 음성의 첫 프레임부터 마지막 프레임까지 프레임 단위로 인식되어 온 음소열이 자동으로 구해진다. 본 발명은 이 음소열에 q3가 루핑(looping)이 되도록 하고 검증(verification) 단계에서는 q1, q2, q3가 포함되지 않도록 하여 검증한다.The reason for the trace-back is that the result of recognizing the first frame from the last frame recognition result is traced back. This automatically calculates the phoneme sequence that has been recognized in units of frames from the first frame to the last frame of the input voice. The present invention verifies that q3 is looping in the phoneme string and that q1, q2 and q3 are not included in the verification step.

마지막으로, 검증이 확인된 단어의 열만 문장으로 인식된 것으로 가정하고 출력한다(410).Finally, it is assumed that only a column of the word whose verification is confirmed is recognized as a sentence and output (410).

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다.The method of the present invention as described above may be implemented as a program and stored in a computer-readable recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.).

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.
The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes are possible in the art without departing from the technical spirit of the present invention. It will be clear to those of ordinary knowledge.

상기한 바와 같은 본 발명은, 발성구간 내에 있는 묵음을 3종류로 모델링하고 특히 단어 중간에 존재한 묵음을 1 상태(state) 단위의 HMM 필라미터로 정의하고 기존에 사용하는 언어모델 방식 대신에 매 클래스(class) 내의 한 음소로 정의하고 이 클래스가 인식되는 과정에 클래스 내의 모든 단어 앞에 올 수 있는 선택적(optional) 묵음으로 정의한 데이터 구조를 만들어서 인식과정을 수행함으로써, 바이그램(bigram), 트라이그램(trigram) 사용시 중간 묵음(q3)을 독립된 단어로 정의하지 않게 되어 정보의 제한조건을 지속적으로 유지할 수 있어서 인식률의 향상을 가져올 수 있으며, 묵음의 언어모델을 단순화할 수 있는 효과가 있다.As described above, the present invention models three types of silences in the utterance section, and in particular, defines the silences in the middle of the word as HMM parameters in units of one state, and instead of using the language model method. Bigrams, trigrams are defined by creating a data structure that is defined as a phoneme in a class and an optional silence that can precede all words in the class in the course of the recognition of this class. trigram) does not define the intermediate silent (q3) as an independent word, which can maintain the constraints of the information continuously, resulting in improved recognition rate and simplifying the language model of the silent.

Claims

In the speech recognition method in a continuous speech recognition system,

A first step of defining a silent model to generate a language model;

A second step of constructing a data structure to be used in a recognition process by composing an intermediate silence in the language model as starting phonemes of word groups in every class;

A third step of recognizing a word corresponding to a start class by using the data structure;

A fourth step of tracing back only the value ending with the last phoneme of the word corresponding to the end class in the recognition result and dividing it into phoneme units; And

A fifth step of deleting the middle silence from the phoneme division information and verifying using the semi-phoneme model, and assuming that only the columns of the verified words are recognized as sentences

Speech recognition method through silent model processing in a continuous speech recognition system comprising a.

The method of claim 1,

In the first step,

The silence model is divided into three types (q1, q2, and q3), and the silence at the beginning and the end of the sentence is defined as q1 and q2, and because it generally takes a duration, it is composed of three states and self-looping with the language model. (self-looping) is expressed, and the silence between words is defined as q3 and composed of 1 state, but without considering the language model, only the relation between words is represented by the statistical probability distribution. Speech recognition method through silent model processing in system.

The method of claim 2,

The second step,

Create a data structure to be an efficient recognition process from the class word grammar from the first step, where q3 phonemes are self-looping before each word in every class other than the class consisting of q1. Speech recognition method through a silent model processing in a continuous speech recognition system, characterized in that to create a data structure.

4. The method according to any one of claims 1 to 3,

The third step,

Perform the Viterbi process using the result of the second step, but obtain the initial (start) class before the Viterbi process starts, and perform the Viterbi process only for words corresponding to the initial (start) class Speech recognition method through silent model processing in a continuous speech recognition system, characterized in that.

The method of claim 4, wherein

The fourth step,

A continuous sequence that finds and traces back phonemes ending with a word corresponding to an end (last) class in Viterbi result, obtains each word and phoneme segmentation information, and removes q3 phonemes from the segmentation information. Speech recognition method through silent model processing in speech recognition system.

The method of claim 4, wherein

The verification process of the fifth step,

q3 Speech recognition method using a silence model processing in a continuous speech recognition system characterized in that the verification (verification) using the phoneme segmentation information and the anti-phone model in the phoneme removed phoneme.

delete