KR20020045960A

KR20020045960A - Method for performance improvement of keyword detection in speech recognition

Info

Publication number: KR20020045960A
Application number: KR1020000075412A
Authority: KR
Inventors: 김회린; 김규홍; 전호현
Original assignee: 이계철; 주식회사 케이티; 전창오; 정보통신연구진흥원; 안병엽; 학교법인 한국정보통신학원
Priority date: 2000-12-12
Filing date: 2000-12-12
Publication date: 2002-06-20

Abstract

PURPOSE: A method of improving a keyword detection function in voice recognition is provided to reduce the rate of erroneously recognizing a word as a keyword in the detection of keyword, thereby improving the quality of voice recognition service. CONSTITUTION: Viterbi search is performed using a variable word recognition device to recognize a word as a basic word unit. The recognized word is internally recognized as phoneme units and the recognized phoneme units are respectively compared with semi-phoneme models to obtain confidence measures. The average of the confidence measures in phoneme units are computed in order to convert the confidence measures in phoneme units into confidence measures in word units. The result of recognition of a keyword is verified through the semi-phoneme models to improve keyword recognition performance.

Description

Method for performance improvement of keyword detection in speech recognition

본 발명은 음성인식에서 핵심어 검출 성능 개선 방법에 관한 것으로, 특히 기존의 음성 인식 시스템은 사용자가 음성을 이용하여 사용할 수 있기 때문에 매우자연스럽고 편리한 인터페이스를 제공하나 인식기에 등록이 안된 음성을 발성하면 이를 처리할 수 없었던 것을 해소하고자 인식대상 단어에 대해서만 인식을 하고, 그 외는 인식 결과를 내지 않고 거절하도록 함으로써 시스템의 성능을 향상시킬 수 있도록 한 것이다.The present invention relates to a method for improving key word detection performance in speech recognition, and in particular, the existing speech recognition system provides a very natural and convenient interface because the user can use the speech, but if the speech is not registered in the recognizer, In order to solve what could not be processed, only the words to be recognized are recognized and the others are rejected without giving a recognition result to improve the performance of the system.

일반적으로 투-패스(Two-pass) 구조란 인식기의 후처리 방식으로 검증 기능을 구현하는 방법으로, 이러한 투-패스 구조는 기존에 구현되어 있던 시스템을 크게 수정하지 않고 추가로 검증 과정만을 구현하여 사용하기 때문에 구현에 소요되는 시간을 단축시킬 수 있는 장점이 있다.In general, a two-pass structure is a method of implementing a verification function through a post-processing method of a recognizer. Such a two-pass structure only implements a verification process without significantly modifying an existing system. This has the advantage of reducing the time required for the implementation.

도 1은 종래 발화 검증 시스템을 탑재한 가변 어휘 단어 인식 시스템을 나타낸 것으로, 입력되는 음성구간을 검출하는 끝점 검출부(1)와, 상기 끝점 검출부(1)로부터 특징을 추출하는 특징 추출부(2)와, 상기 특징 추출부(2)로부터의 신호와 발음사전으로부터 비터비 탐색을 하는 가변 어휘 단어 인식 시스템(3)과, 상기 가변 어휘 단어 인식 시스템(3)으로부터 단어를 인식하여 안티-모델(5)을 참조하여 발화를 검증하는 발화검증 시스템(4)으로 구성된 것으로, 도면에서 6은 음소 모델, 7은 묵음 모델이다.FIG. 1 shows a variable vocabulary word recognition system equipped with a conventional speech verification system. An end point detection unit 1 for detecting an input speech segment and a feature extraction unit 2 for extracting features from the end point detection unit 1 are shown. And a variable vocabulary word recognition system 3 for searching Viterbi from a signal from the feature extraction unit 2 and a pronunciation dictionary, and an anti-model 5 by recognizing words from the variable vocabulary word recognition system 3. It is composed of a speech verification system (4) for verifying the speech with reference to), 6 is a phoneme model, 7 is a silent model.

이러한 종래의 발화 검증 시스템(4)을 설계할 때에는 세 가지 문제를 해결 해야 한다.When designing the conventional speech verification system 4, three problems must be solved.

첫째, 미등록어와 잘못 인식된 단어를 잘 선별할 수 있는 검증 모델 ??(set)에 기반한 적절한 신뢰도(confidence measure)를 정의해야 한다.First, we need to define an appropriate confidence measure based on a set of validation models that can screen out unregistered and misrecognized words.

둘째, 훈련 데이터에서 검증 오류를 최소화 할 수 있도록 검증 모델을 적응시키는 훈련과정을 적절히 선택해야 한다.Second, the training process that adapts the validation model should be appropriately selected to minimize validation errors in the training data.

셋째, 유사도(likelihood)의 변화, 검증 임계값의 변화, 훈련과 테스트 상태의 변화에 민감하지 않은 검증 시스템을 설계하여야 한다.Third, a verification system should be designed that is not sensitive to changes in likelihoods, changes in validation thresholds, or changes in training and test conditions.

본 발명에서 사용한 반음소 모델과 신뢰도는 위의 요건들을 최대한 만족시킬 수 있는 방법이다.The half phoneme model and the reliability used in the present invention are methods that can satisfy the above requirements as much as possible.

상기 가변 어휘 단어 인식 시스템(3)에서 비터비(Viterbi) 탐색시 사용되는 네트워크 망은 도 2와 같으며 인식된 결과는 등록어들과 음소들의 열로 나타난다.In the variable vocabulary word recognition system 3, the network used for Viterbi search is shown in FIG. 2, and the recognized result is represented by a sequence of registered words and phonemes.

즉, 도시된 바와 같이, "sil + (등록어 및 음소들의 열) + sil" 와 같은 형태가 되는데, 이때 도 3에 도시된 바와 같이, 워드 패널티(word penalty)를 잘 조정하면, 입력된 음성이 등록어일 경우 인식된 결과가 "sil + (등록어 및 약간의 음소들의 열) + sil" 로 나타나는 한편, 입력된 음성이 미등록어일 경우 인식된 결과는 "sil + (등록어 및 다수의 음소들의 열) + sil" 또는 "sil + (다수의 음소들의 열) + sil" 로 나타난다.That is, as shown in the figure, "sil + (registration of words and phonemes) + sil", and the like, as shown in Figure 3, if the word penalty (word penalty) well adjusted, the input voice If this is a registered word, the recognized result is "sil + (a string of registered words and some phonemes) + sil", whereas if the input voice is an unregistered word, the recognized result is "sil + (a registered word and a plurality of Row) + sil "or" sil + (row of multiple phonemes) + sil ".

다음 단계로 상기에서 인식된 결과를 발화 검증 시스템(4)으로 넘긴다. 여기에서, 가변어휘 단어 인식 시스템(3)의 워드 패널티와 인식된 결과의 삽입된 음소들의 개수를 이용하여 미등록어를 거절시킬 수 있다.In the next step, the result recognized above is passed to the speech verification system 4. Here, the unregistered word can be rejected using the word penalty of the variable vocabulary word recognition system 3 and the number of inserted phonemes of the recognized result.

또한, 삽입된 음소가 미리 정해둔 임계값 이하라도 인식 결과에 등록어가 포함되어 있지 않거나 2개 이상이면, 무조건 거절시키는 방법이 있으며, 이러한 방법은 워드 패널티와 인식된 결과의 삽입된 음소 개수를 이용하여 미등록어를 거절시키므로, 임계값의 요소는 워드 패널티와 삽입된 음소의 개수, 즉, 2개가 된다.In addition, even if the inserted phoneme is less than or equal to a predetermined threshold, if the recognition result does not include registered words or two or more, there is a method of rejecting them unconditionally. This method uses word penalties and the number of inserted phonemes of the recognized result. Since the non-registered word is rejected, the threshold element is the word penalty and the number of inserted phonemes, that is, two.

따라서, 원하는 성능을 발휘하기 위해서는 2개의 요소를 같이 조정해야 하므로 다소 힘든 작업이 되고, 미등록어 거절 성능도 저하되는 문제가 발생한다.Therefore, in order to achieve the desired performance, the two elements must be adjusted together, which is a rather difficult task, and the problem of declining unregistered words also occurs.

대부분의 음성 인식 시스템은 사용자가 음성을 이용하여 사용할 수 있기 때문에 매우 자연스럽고 편리한 인터페이스를 제공한다.Most speech recognition systems provide a very natural and convenient interface because the user can use the voice.

그러나, 인식기에 등록이 안된 음성을 발성하면 이를 처리할 수 없다는 단점을 가지므로 사용자는 정해진 등록어 만을 사용해야하는 제약을 받는다.However, if a voice that is not registered in the recognizer has a disadvantage that it cannot be processed, the user is limited to using only a predetermined registered word.

본 발명은 이와 같은 문제점을 해결하기 위한 것으로, 본 발명의 목적은 반음소 모델을 생성하는 방법을 제안하여 이를 안티-모델로 사용하고, 핵심어 검증 방법으로는 입력된 단어의 각 음소마다 정상적인 음소 모델과 안티-모델과의 유사도를 이용한 새로운 신뢰도측정 방법을 제안함으로써 제안된 반음소 모델과 발화검증 방법을 사용하여 핵심어 검증 성능이 우수한 음성인식에서 핵심어 검출 성능 개선 방법을 제공하는데 있다.The present invention is to solve such a problem, an object of the present invention is to propose a method for generating a semi-phoneme model and use it as an anti-model, the key word verification method is a normal phoneme model for each phoneme of the input word By suggesting a new reliability measurement method using the similarity with the anti-model and the anti-model, we propose a method for improving the key word detection performance in speech recognition with excellent key word verification performance using the proposed semiphoneme model and speech verification method.

도 1은 발화 검증 시스템을 탑재한 가변어휘 단어 인식 시스템1 is a variable vocabulary word recognition system equipped with a speech verification system

도 2는 가변어휘 핵심어 시스템의 네트워크2 is a network of a variable vocabulary keyword system

도 3은 일반적인 가변어휘 핵심어 인식 시스템의 네트워크3 is a network of a general variable vocabulary keyword recognition system

도 4는 본 발명에서 제안된 발화 검증 시스템을 탑재한 가변어휘 단어 핵심 어 인식 시스템Figure 4 is a variable vocabulary word key word recognition system equipped with a speech verification system proposed in the present invention

〈도면의 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

1:끝점 검출부 2:특징 검출부1: endpoint detection section 2: feature detection section

3A:가변어휘 핵심어인식 시스템 4:발화검증 시스템3A: Variable Vocabulary Key Recognition System 4: Speech Verification System

5:안티 모델 6:음소 모델5: anti-model 6: phoneme model

7:묵음 모델7: silent model

이와 같은 목적을 달성하기 위한 본 발명은, 핵심어 인식 시스템에 있어서, 가변어휘 단어 인식기를 이용하여 비터비 탐색을 하여 기본적인 단어 단위로 인식이 되게 하는 제1단계, 상기 제1단계에서 인식된 단어는 내부적으로 음소 단위로 인식이 되게 하여 인식된 음소 단위들을 각각의 반음소 모델과 비교하여 신뢰도를 구하는 제2단계, 상기 음소 단위의 신뢰도를 단어 단위의 신뢰도로 환산하기 위해서 음소 단위의 신뢰도의 평균을 구하는 제3단계로 이루어짐을 특징으로 한다.In the present invention for achieving the above object, in the key word recognition system, the first step to perform the Viterbi search using a variable vocabulary word recognizer to be recognized in basic word units, the word recognized in the first step A second step of obtaining the reliability by comparing the recognized phoneme units with the respective semi-phoneme models by internally recognizing the phoneme units, and converting the reliability of the phoneme unit into the word unit Obtaining is characterized in that the third step is made.

이하, 첨부된 도면을 참조하여 본 발명의 실시 예를 상세히 설명하도록 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 미등록어 거절 기능을 갖는 가변어휘 단어 인식 시스템의 구성도로, 입력되는 음성 구간을 검출하는 끝점 검출부(1)와, 상기 끝점 검출부(1)로부터 특징을 추출하는 특징 추출부(2)와, 상기 특징 추출부(2)로부터의 신호와 발음사전으로부터 비터비 탐색을 하는 가변 어휘 핵심어 인식 시스템(3A)과, 상기 가변 어휘 핵심어 인식 시스템(3A)으로부터 핵심어를 인식하여 안티-모델(5)을 참조하여 발화를 검증하는 발화검증 시스템(4)으로 구성된 것이다.1 is a block diagram of a variable vocabulary word recognition system having a non-registered word rejection function of the present invention, which includes: an end point detection unit 1 for detecting an input speech section and a feature extraction unit for extracting features from the end point detection unit 1; 2) a variable vocabulary keyword recognition system 3A for searching Viterbi from a signal and a pronunciation dictionary from the feature extractor 2, and an anti-model by recognizing key words from the variable vocabulary keyword recognition system 3A. It is composed of a utterance verification system (4) for verifying utterance with reference to (5).

단, 도면중 미 설명 부호 6은 음소 모델, 7은 묵음 모델이다.In the figure, reference numeral 6 denotes a phoneme model, and 7 denotes a silent model.

이하, 본 발명의 작용을 설명하면 다음과 같다.Hereinafter, the operation of the present invention will be described.

본 발명은 가변어휘 핵심어 인식 시스템(3A) 뒤에 발화검증 시스템(4)이 후처리로 붙는 2단계 시스템이다.The present invention is a two-stage system in which the utterance verification system 4 is attached to the variable vocabulary keyword recognition system 3A as a post-process.

첫 번째 단계 즉, 가변어휘 핵심어 인식 시스템(3A)에서 39개의 음소 모델들을 사용해서 구성된 단어모델들에 비터비 탐색 알고리즘을 적용하여 인식 과정이 수행된다.In the first step, the recognition process is performed by applying a Viterbi search algorithm to word models constructed using 39 phoneme models in the variable vocabulary keyword recognition system 3A.

음소 모델들은 ML(최대 유사도) 크리터리어(criteria)를 이용하여 HMM(Hidden Markov Model)의 파라미터를 최적화시킨다.Phoneme models use ML (maximum similarity) criteria to optimize the parameters of the Hidden Markov Model (HMM).

인식 과정 동안 각 단어의 발화는 음소 히포티시스(hypothesis)로 분할되며, 그 결과를 발화 검증 시스템(4)으로 넘긴다.The speech of each word is divided into phoneme hypothesis during the recognition process, and the result is passed to the speech verification system 4.

두 번째 단계 즉, 발화검증 시스템(4)에서의 검증 과정은 인식된 후보 단어들의 음소열에 대해 안티-모델(5)과의 신뢰도(confidence measure)를 구해 그 단어의 신뢰도 값을 결정하여 이 신뢰도 값이 미리 정해둔 임계값 보다 크면 해당 단어로 인식하고 작으면 거절한다.The second step, that is, the verification process in the speech verification system 4, obtains a confidence measure with the anti-model 5 for the phoneme sequence of the recognized candidate words and determines the confidence value of the word to determine the confidence value. If it is larger than the predetermined threshold, it is recognized as the word, and if it is smaller, it is rejected.

도 2는 본 발명이 적용되는 가변어휘 핵심어 인식 시스템의 네트워크를 나타낸 것으로, 등록어 들로만 구성된 네트워크를 사용하면, 인식되는 결과는 "sil + 등록어 + sil"와 같은 형태가 된다.2 illustrates a network of a variable vocabulary key word recognition system to which the present invention is applied. When a network composed of only registered words is used, a recognized result is a form such as "sil + registered word + sil."

즉, 입력된 음성이 등록어이든, 미등록어이든 가변어휘 핵심어 인식 시스템(3A)에서는 항상 등록어로 인식을 한다.That is, the variable vocabulary key word recognition system 3A always recognizes a registered word, whether the input voice is a registered word or a non-registered word.

다음에, 인식된 결과를 제안된 발화 검증 시스템(4)으로 넘겨서 인식된 결과가 등록어인지 아니면 미등록어인지를 판별한다.Next, the recognized result is passed to the proposed speech verification system 4 to determine whether the recognized result is a registered word or a non-registered word.

반음소 모델은 자기 음소를 제외한 유사 음소 집합을 말하며, 일반적으로 유사 음소 집합이 많을수록 반음소가 잘 모델링 되지만, 유사 음소 집합의 크기가 너무 크게 되면 훈련 데이터 양이 너무 많아지게 된다.The semi-phoneme model refers to a set of similar phonemes excluding self-phones. In general, the more similar phonemes, the more semitones are modeled. However, if the size of the phoneme set is too large, the amount of training data becomes too large.

따라서, 본 발명에서 제안한 반음소 모델은 이 두 가지를 모두 만족시키는 새로운 반음소 모델이다.Therefore, the semitone phone model proposed in the present invention is a new semitone phone model that satisfies both of them.

즉, 39개의 잘 훈련된 음소 모델만 있으면, 반음소 모델을 만들기 위한 특별한 훈련을 거치지 않고 반음소 모델을 만들 수 있으며, 또한 유사 음소 집합도 자기 음소와 묵음을 제외한 나머지 모든 음소를 모두 포함시켜서 반음소 모델링이 잘 되도록 함으로써 검증 오류를 최소화 할 수 있다.In other words, if you have only 39 well-trained phoneme models, you can create a semi-phoneme model without any special training to create a semi-phoneme model, and a similar phoneme set also includes all the phonemes except self and silence. Good phonemic modeling can minimize verification errors.

본 발명과 같은 측정방식은 핵심어 인식에서 오류의 검출을 잘 해줄 뿐만 아니라 핵심어의 분별력을 크게 할 수 있다.The measurement method such as the present invention not only makes it easy to detect errors in keyword recognition, but can also increase the discrimination of the keywords.

본 발명은 가변 어휘 핵심어 인식기를 이용하여 비터비 탐색을 하므로, 기본적으로 단어 단위로 인식이 되지만 그 인식된 단어는 내부적으로 음소 단위로 인식이 된다.In the present invention, since the Viterbi search is performed using the variable vocabulary keyword recognizer, it is basically recognized in a word unit, but the recognized word is internally recognized in a phoneme unit.

따라서, 인식된 음소 단위들을 각각의 반음소 모델과 비교하여 신뢰도를 구하고, 이 음소 단위의 신뢰도를 단어 단위의 신뢰도로 환산하기 위해서 음소 단위의 신뢰도를 평균 내었다.Accordingly, the phoneme units are compared with each of the semi-phoneme models to calculate the reliability, and the phoneme units are averaged to convert the phoneme units into word units.

우선, 39개의 다른 패턴, 즉 Theta = ｛Theta_1 ,..., ~Theta_l ,...,~ Theta_39 ｝에 상응하는 발화 검증 모델을 사용하는 신뢰도를 선택했다.First, we chose a reliability using a speech verification model corresponding to 39 different patterns: Theta = ｛Theta_1, ..., ~ Theta_l, ..., ~ Theta_39｝.

각 패턴 l 에 대해서, 음소모델을 {{Theta_l}^(k)} 라 표시하고, 안티-모델인 반음소 모델을 {{Theta_l}^(a)}라 표시했다(즉, Theta_l = ｛{{Theta_l}^(k)},{{Theta_l}^(a)} ｝).For each pattern l, the phoneme model is labeled {{Theta_l} ^ (k)} and the anti-model semitone phone model is labeled {{Theta_l} ^ (a)} (ie, Theta_l = ｛{{ Theta_l} ^ (k)}, {{Theta_l} ^ (a)} iii).

따라서, 음소단위들의 평균 단어 단위의 신뢰도는 다음 수학식 1과 같다.Therefore, the reliability of the average word unit of the phoneme units is expressed by Equation 1 below.

이 신뢰도가 미리 정해둔 임계값 tau_s 이하라면 거절시키며, 여기서 k는 음의 값을 가지는 상수이며, 가변어휘 단어 인식기에서 인식된 결과인 등록어 i 는 N(i)음소들로 구성되어 있다.If the reliability is less than or equal to a predetermined threshold value tau_s, it is rejected, where k is a constant having a negative value, and the registered word i, which is a result recognized by the variable vocabulary word recognizer, is composed of N (i) phonemes.

각 음소의 반음소 모델과의 유사율(likelihood ratio) 거리, Lr_i(q) (O_q~;Theta)는 다음 수학식 2와 같다.The likelihood ratio distance, Lr_i (q) (O_q ~; Theta), from the phoneme model of each phoneme is expressed by Equation 2 below.

패턴l(즉, l=i(q) )인 일반적인 음소에 대해서 다음 수학식 3, 4와 같이 적용된다.For general phonemes having a pattern l (that is, l = i (q)), the following equations (3) and (4) are applied.

상기 수학식 3과 4를 계산할 때 사용되는 관측 확률은 수학식 5와 같다.Observation probabilities used when calculating Equations 3 and 4 are shown in Equation 5.

여기서, c_jk 는 믹스쳐(mixture)의 무게(weighting)이며, j는 각 음소의 상태(state)를 뜻하며, k는 각 상태의 믹스쳐를 뜻한다.Here, c_jk is the weighting of the mix, j is the state of each phoneme, and k is the mix of each state.

N은 각 믹스쳐의 가우시안 분포를 의미하며, mu_jk는 평균 벡터, U_jk는 커베리언스 매트릭스(covariance matrix)이다.N denotes a Gaussian distribution of each mixture, mu_jk is an average vector, and U_jk is a covariance matrix.

상기 수학식 2와 같은 신뢰도는 음성 인식에서 근소한 오류의 검출을 잘해줄 뿐만 아니라, 핵심어와 비핵심어 사이의 분별력을 더 잘 보여준다.Reliability as shown in Equation 2 not only improves the detection of slight errors in speech recognition, but also shows better discrimination between key words and non-key words.

실제로 핵심어를 발화 검증 해보면, s_i (O~ ;Theta)가 발화 검증 임계값tau_s보다 크며, 비핵심어를 발화 검증 해보면 tau_s보다 작으므로 핵심어인식기의 전체적인 성능을 높여줄 수 있다.In practice, s_i (O ~; Theta) is greater than the tau_s utterance verification threshold tau_s, and less than tau_s when utterance verification is performed. Therefore, the overall performance of the key word recognizer can be improved.

이상에서 설명한 바와 같은 본 발명은 인식 대상 단어에 대해서만 인식을 하고, 그 외는 인식 결과를 내지 않고 거절함으로써 오인식의 결과를 감소시킬 수 있으며, 기존의 핵심어 인식 시스템의 구조를 그대로 유지하면서 성능을 향상시킬 수 있어 음성인식 서비스의 질을 향상시킬 수 있는 효과가 있다.As described above, the present invention can reduce the result of misrecognition by recognizing only words to be recognized and rejecting other words without generating recognition results, and improving performance while maintaining the structure of the existing keyword recognition system. It can improve the quality of voice recognition service.

Claims

In the keyword recognition system,

A first step of performing a Viterbi search using a variable vocabulary word recognizer to be recognized in basic word units;

A second step of determining the reliability by comparing the recognized phoneme units with each semi-phoneme model by allowing the words recognized in the first step to be internally recognized as phoneme units;

Obtaining an average of the reliability of the phoneme unit in order to convert the reliability of the phoneme unit into the reliability of the word unit;

Method for improving the keyword detection performance in speech recognition, characterized in that to improve the keyword recognition performance by verifying the key word recognition result with a semi-phoneme model.

The method of claim 1, wherein the normalized similarity of the half phoneme model and the normalized similarity in the keyword unit using the same are verified using the keyword verification method.

The method of claim 1, wherein the reliability of the phoneme units of the average word unit in the third step,

And if the reliability is less than or equal to a predetermined threshold, the keyword is rejected.

The method according to claim 3, wherein the similarity distance with each phoneme's semitone model is

Lr_i (q) (O_q ~; Theta) = ego,

, How to improve the keyword detection performance in speech recognition, characterized in that.