KR100608644B1

KR100608644B1 - Method For Recognizing Variable-Length Connected Digit Speech

Info

Publication number: KR100608644B1
Application number: KR1020030074298A
Authority: KR
Inventors: 최인정
Original assignee: 삼성전자주식회사
Priority date: 2003-10-23
Filing date: 2003-10-23
Publication date: 2006-08-02
Also published as: KR20050038973A

Abstract

본 발명은 숫자음 인식방법에 관한 것이다.The present invention relates to a number sound recognition method.

본 발명에 따른 가변길이의 연결숫자음 인식방법은 소정의 방법으로 좌우에 인접한 숫자음과 연음여부에 따라 다르게 발음되는 숫자음을 구별하여 세분화된 숫자음에 대한 음향모델과 언어모델을 생성하는 (a) 단계와, 가변길이의 연결숫자음을 입력받아 상기 음향모델 및 언어모델을 이용하여 소정의 방법으로 상기 숫자음을 인식하는 (b) 단계를 포함한다.In the method of recognizing variable-length linked numbers according to the present invention, a sound model and a language model for subdivided number sounds are generated by distinguishing the numbers sounded differently depending on whether the numbers sound adjacent to the left and right and the soft sound in a predetermined manner. and a) receiving a variable length connection digit by using a variable length and recognizing the digit by a predetermined method using the acoustic model and the language model.

본 발명에 따른 좌우에 인접한 숫자음 및 연음에 따라 다르게 발음되는 숫자음에 대한 높은 인식율을 가질 수 있다.According to the present invention, it is possible to have a high recognition rate for the digits that are pronounced differently according to the digits and softphones adjacent to the left and right.

음성인식, 연결숫자음, 세분화된 숫자음, 트랜지션 페널티Speech Recognition, Concatenated Digits, Granular Digits, Transition Penalties

Description

Method for Recognizing Variable-Length Connected Digit Speech}

도 1은 종전의 기술에 의한 HBT 모델을 보여주는 도면이다.1 is a view showing an HBT model according to a conventional technique.

도 2는 종전의 기술에 의한 HBT 모델을 이용한 연결음성 인식방법을 설명하기 위한 도면이다.2 is a view for explaining a connection speech recognition method using the HBT model according to the prior art.

도 3은 본 발명의 일 실시예에 따른 세분화된 숫자음 모델을 보여주는 도면이다.3 is a diagram illustrating a segmented numeric model according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 가변 길이로 연결된 숫자음 인식장치의 기능성 블록도이다.4 is a functional block diagram of a digit recognition device connected with a variable length according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 음향모델 및 언어모델을 생성하고 갱신 하기 위한 장치의 기능성 블록도이다.5 is a functional block diagram of an apparatus for generating and updating an acoustic model and a language model according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 숫자음 인식과정과, 음향모델 및 언어모델을 생성하고 갱신하는 과정을 보여주는 흐름도이다.6 is a flowchart illustrating a process of recognizing a number sound and generating and updating a sound model and a language model according to an embodiment of the present invention.

본 발명은 음성 숫자음 인식에 관한 것으로서, 보다 상세하게는 좌우에 인접한 숫자음 및 발성에서 연음여부에 따라 다르게 발음되는 숫자음을 구별하여 세분화된 숫자음 모델을 기반으로 한 가변 길이의 연결 숫자음 인식방법에 관한 것이다.The present invention relates to speech digit sound recognition, and more particularly, to connected digits of a variable length based on a segmented digit sound model by distinguishing digit sounds that are pronounced differently depending on whether or not the digit sounds are adjacent to the left and right digits. It relates to a recognition method.

편리한 생활을 추구하는 인간의 욕구는 다양한 분야의 기술발전을 불러온다. 음성인식분야도 그 중의 한 분야이다. 과거의 음성인식 기술은 고립된 단어 인식에서 연속적인 음성 인식으로 발전하고 있으며, 인식되는 단어의 크기는 점차 커져가고 있다. 또한 화자에 종속된 음성만을 인식할 수 있었던 것으로부터 점차 화자에 독립된 음성인식으로 발전하고 있다. 입력받는 음성은 책읽는 것과 같이 의도적으로 발음한 것으로부터 일상 대화상의 음성을 인식하는 기술로 발전하고 있으며, 이 외에 저 잡음에서의 인식은 잡음이 있는 실제 환경에서의 인식으로 발전하고 있다.Human desire to pursue convenient life brings about technological development in various fields. Speech recognition is one of them. In the past, speech recognition technology has been developed from isolated word recognition to continuous speech recognition, and the size of the recognized words is gradually increasing. In addition, the voice-dependent voice recognition is gradually developing from speaker-independent speech recognition. The input voice is being developed from the intentional pronunciation of the reading to the technology of recognizing the speech in everyday conversation. In addition, the recognition in the low noise is developing into the recognition in the noisy real environment.

음성인식 기술은 그 목적에 따라 요구되는 정확도가 다르다. 예를 들면 일상 회화에서는 모든 음절을 정확하게 인식하지 않고도 뜻을 전달할 수가 있는 반면에, 은행거래와 같은 상거래나 숫자의 경우에는 상당히 높은 정확도를 가져야 한다. 숫자음 인식과 관련하여 기존의 방법은 여러 가지 문제점을 내포하고 있다. 즉, 단순한 단어모델을 이용하여 가변 길이의 연결숫자음을 인식하는 경우에는 숫자음 사이에서 발생할 수 있는 연음의 유무나 연음에 의한 조음 결합의 영향을 반영할 수 없기 때문에 숫자음 인식의 성능이 좋지 않다. 특히 이러한 문제점은 영어와 달리 모든 숫자음이 단음절로 이루어진 한국어의 연결숫자음의 경우에 더욱 두드러진다. 예를 들면, "오오"와 "오", "이이"와 "이", 또는 "이일"과 "일"의 경우에는 삽입오류나 삭제오류가 발생할 가능성이 매우 높다. 따라서 숫자음의 인식률을 높이기 위한 다양한 방법들이 시도되고 있다.Speech recognition technology has different accuracy requirements depending on its purpose. In everyday conversation, for example, you can convey meaning without correctly recognizing all the syllables, while for commerce and numbers such as banking, you have to have a very high accuracy. The existing methods in terms of the recognition of numeric sounds have various problems. In other words, when recognizing a variable length concatenated phoneme using a simple word model, the performance of digital phone speech recognition is poor because it cannot reflect the presence or absence of the sound that may occur between the numbers and the effects of the articulation combination by the sound. not. In particular, this problem is more prominent in the case of Korean concatenated digits consisting of single syllables, unlike English. For example, in the case of "Oh" and "Oh", "Ie" and "I", or "Il" and "Il", there is a high possibility of insertion or deletion errors. Therefore, various methods have been attempted to increase the recognition rate of numeric sounds.

종전의 숫자음 인식에 비해 비교적 숫자음 인식률이 우수한 기술 중의 하나로 머리-몸통-꼬리(Head-Body-Tail; 이하, HBT라 함) 구조의 단어모델을 이용한 연결 숫자음 인식방법이 있다. 도 1은 HBT 모델을 보여주는 도면이다. HBT 모델에서 하나의 숫자음은 머리, 몸통, 및 꼬리를 갖는다. 단어의 시작 또는 끝부분의 발음이 같은 경우에는 머리 또는 꼬리를 공유한다. 몸통부분은 문맥 독립 모델로하고, 머리와 꼬리 부분은 좌우에 인접할 수 있는 단어를 고려하여 문맥종속 모델(Context Dependent Model)로 구성한다. HBT 모델 기반의 음성 인식기술은 최대 가능성(Maximum Likelihood) 학습법 외에 최소 분류 오류(Minimum Classification Error) 학습법을 적용한다. 이러한 HBT 모델 음성인식방법은 숫자음, 날짜, 요일, 시간 등과 관련된 연결음성인식에 주로 적용되고 있으며, 전체 단어 모델을 사용한 경우보다 전체적으로 17~24%의 오인식률을 개선한다.One of the technologies that has a relatively good numerical recognition rate compared to the previous numeric recognition, there is a method of recognizing a connected numeral sound using a head-body-tail (HBT) structured word model. 1 is a diagram showing an HBT model. In the HBT model, one digit has a head, a torso, and a tail. If the beginning or end of a word is the same pronunciation, the head or tail is shared. The body is composed of a context-independent model, and the head and tail are composed of a context-dependent model considering words that may be adjacent to left and right. Speech recognition technology based on HBT model applies the minimum classification error learning method in addition to the maximum likelihood learning method. The HBT model speech recognition method is mainly applied to connected speech recognition related to numbers, dates, days of the week, and times, and improves the overall false recognition rate of 17 to 24% compared with the whole word model.

HBT 모델에서 모든 단어는 머리, 몸통, 및 꼬리 구조를 갖는다. 도 2에서 도시된 바와 같이 "일이"라는 두개의 숫자가 연결된 경우에 "일"과 "이"는 각각 하나의 몸통과 복수개의 머리, 및 꼬리들을 갖는다. "일"의 몸통 부분은 문맥 독립이나 꼬리부분은 뒤에 올 수 있는 숫자음 개수만큼 존재한다. 한편, "이"의 경우도 몸통 부분은 문맥 독립이나 머리 부분은 앞에 올 수 있는 숫자의 개수만큼 존재한다. 한국어의 경우에는 하나의 숫자의 머리와 꼬리들의 수는 각각 10개 내지 11개이다. 각 단어 모델의 꼬리 부분과 다음 단어의 머리 부분간의 연결 관계는 grammar 또는 n-gram 언어모델을 적용한다.Every word in the HBT model has a head, torso, and tail structure. As shown in FIG. 2, when two numbers "one" are connected, "one" and "this" each have one body, a plurality of heads, and tails. The body of "work" is context independent, but the tail is the number of the number of notes that can follow. On the other hand, in the case of "this", the body part is context independent, but the head part exists as many as the number that can be preceded. In Korean, the number of heads and tails of one number is 10 to 11, respectively. The relationship between the tail of each word model and the head of the next word applies the grammar or n-gram language model.

이러한 HBT 모델은 종전의 숫자음 인식에 비하여 좌우 인접한 숫자음에 대한 고려를 하고 있으므로 보다 우수한 성능을 갖는다. 그러나, 기존의 숫자음 인식방법과 마찬가지로 HBT 모델의 경우에도 연음 여부에 의해 발생되는 발음의 변이성을 반영하지 못한 한계점을 갖고 있다. 특히, 영어의 경우와 달리 한국어의 경우에는 "이", 또는 "오"와 같이 숫자음은 연음여부를 고려하지 않을 경우에는 그 인식률이 떨어질 수 있다. 이에 따라 인접한 숫자음으로 인한 영향과 더불어 연음 여부에 의한 효과까지 고려하여 좀더 우수한 숫자음 인식률을 가질 수 있는 있는 숫자음 인식기술이 필요하다.The HBT model has better performance than the previous number recognition because it considers left and right adjacent numbers. However, like the conventional numerical recognition method, the HBT model also has a limitation in that it does not reflect the variability of the pronunciation generated by the tone. In particular, unlike in the case of English, in the case of Korean, "i" or "oh", such as the number may not be recognized when not considering whether the tone. Accordingly, there is a need for a numerical sound recognition technology capable of having a better numerical sound recognition rate in consideration of the effects of adjacent numeric sounds and the effects of softening.

본 발명은 상술한 필요에 의해 안출된 것으로서, 인접한 숫자음으로 인한 영향과 연음 여부에 의한 효과까지 고려하여 우수한 숫자음 인식률을 갖는 가변길의 연결숫자음 인식방법을 제공하는 것을 기 기술적 과제로 한다.SUMMARY OF THE INVENTION The present invention has been made in view of the above-described needs, and it is an object of the present invention to provide a method for recognizing a variable length connected number sound having an excellent number sound recognition rate in consideration of effects due to adjacent number sounds and soft or not. .

상기 기술적 과제를 달성하기 위하여, 본 발명에 따른 가변길이의 연결숫자음 인식방법은 소정의 방법으로 좌우에 인접한 숫자음과 연음여부에 따라 다르게 발음되는 숫자음을 구별하여 세분화된 숫자음에 대한 음향모델과 언어모델을 생성하는 (a) 단계와, 가변길이의 연결숫자음을 입력받아 상기 음향모델 및 언어모델을 이용하여 소정의 방법으로 상기 숫자음을 인식하는 (b) 단계를 포함한다. 상기 (a) 단계에 상기 음향모델은 각 숫자를 좌우에 인접한 숫자과 연음여부에 따라 다르게 발음되는 되는 숫자음들로 구별한 세분화된 숫자음이고, 언어모델은 상기 세분화된 숫자 음 어휘들간의 연결 관계를 규정하는 바이그램 언어모델인 것이 바람직하다. 한편, 숫자음 쌍과, 연음 여부에 따라 가변적인 단어천이 페널티를 학습시키는 단계를 포함하고, 상기 학습된 단어천이 페널티를 적용하여 상기 숫자음을 인식하는 것이 바람직하며, 이 때, 상기 가변적인 탄어천이 페널티를 학습시키는 방법은 MCE(Minimum Classification Error) 학습이 될 수 있다.In order to achieve the above technical problem, according to the present invention, a variable length connection number recognition method according to the present invention distinguishes a number sound which is pronounced differently depending on whether the number sound is adjacent to the left and right by a predetermined method and the sound for the subdivided number sound. (A) generating a model and a language model, and (b) receiving a variable length connection digit and receiving the digit by a predetermined method using the acoustic model and the language model. In the step (a), the acoustic model is a subdivided digit sound that is divided into digits that are pronounced differently depending on whether the digits are adjacent to the left and right, and the language model is a connection relationship between the subdivided digit vocabulary words. It is preferable that it is a Bygram language model that defines. On the other hand, comprising a step of learning a variable word transition penalty according to the pair of digits, and whether or not consecutively, it is preferable to recognize the digits by applying the learned word transition penalty, wherein, the variable bullet The method of learning the penalty may be a minimum classification error (MCE) learning.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

"일"이라는 숫자는 그 뒤에 오는 숫자에 따라 다양하게 발음될 수 있다. 예를 들면, 도 3의 경우와 같이 뒤에 "구"가 따라온 경우에는 "일"로 발음되지만, 뒤에 "육"이 나오는 경우에는 "일ㄹ"으로 발음되는 경우가 많다. 뒤에 "오"가 나오는 경우에는 "이ㄹ"로 발음될 가능성이 크다. 한편, 동일한 숫자열이 항상 동일하게 발음되는 것은 아니다. 즉 "일구"를 발음할 때 "일"과 "구"를 각각 끊어서 읽는 경우와 연속으로 읽는 경우는 다르게 발음된다. 본 발명은 연음에 따른 발음의 변이를 고려하기 위하여 끊어 읽는 경우와 연결하여 읽는 경우는 다른 것으로 취급한다. 끊어 읽는 경우에는 #을 뒤에 붙였으며, 연결하여 읽는 경우에느 뒤에 $을 붙여서 구별한다. 본 발명에서는 이렇듯 좌우에 인접한 숫자음으로 인한 영향과 연음 등의 영향을 고려하여 숫자음을 세분화한다. 즉, 본 발명에서는 "일#", 일$, 이ㄹ, 일ㄹ"모두 다른 숫자음으로 취급한다. 세분화된 숫자음에 설명은 도 4를 통해 후술한다. The number "day" can be pronounced differently depending on the number following it. For example, as shown in FIG. 3, when the phrase "gu" is followed, it is pronounced as "day", but when the word "yuk" follows, it is often pronounced as "il". If "oh" follows, it is most likely pronounced "yi". On the other hand, the same string of numbers is not always pronounced the same. In other words, when pronounced "one phrase", the "one" and "nine" is read separately from the case of reading and continuous reading. In the present invention, in order to consider the variation of pronunciation according to the ringtone, the case of broken reading and the case of reading in connection are treated as different. In case of broken reading, it is followed by #, and in case of concatenating reading, it is separated by $. In the present invention, the numerical sounds are subdivided in consideration of the effects of the adjacent numerical sounds on the left and right, and the effects of softening. That is, in the present invention, "one #", one $, two, one "are all treated as different numeric sounds. The description of the granular numeral sounds will be described later with reference to FIG.

가변길이의 연결음 인식장치는 음성 끝점 검출부(10)와, 특징추출부(20)와, 비터비 탐색부(30)와, 재추정부(40)와, 음향모델DB(50)와, 발음사전DB(60), 및 언어모델DB(70)를 포함한다.The variable length connection sound recognition device includes a voice endpoint detection unit 10, a feature extraction unit 20, a Viterbi search unit 30, a re-estimation unit 40, an acoustic model DB 50, and a pronunciation dictionary DB. 60, and language model DB 70.

음성 끝점 검출부(10)는 입력되는 음성의 시작 점과 끝 점을 판단한다. 즉, 숫자를 끊어서 읽는 경우에는 연결숫자음이 모두 입력되지 않은 경우에도 음성이 입력되지 않는 시간적 간격이 있으므로, 어느 정도의 시간동안 연속해서 음성이 입력되지 않았는지를 판단하여 적절한 순간에 숫자음 입력의 종료를 판단하여야 한다. 음성 끝점 검출부(10)에서 숫자음의 입력의 종료여부를 판단할 때는 사람이 음성을 입력할 때 입력순간을 버튼을 누르도록 하고 버튼이 눌러진 상태에서 입력되는 음성을 기준으로 연결숫자음의 끝을 판단하도록 할 수도 있으나, 입력되는 숫자음의 에너지 레벨의 정도와 음성이 입력되지 않는 시간적 간격 등을 고려하여 자동으로 판단하도록 할 수도 있다. 어느 방법에 의하든지 입력되는 숫자음의 종결부분을 정확히 판단하는 것은 연결숫자음 인식의 성능에 영향을 미친다.The voice endpoint detecting unit 10 determines a start point and an end point of the input voice. In other words, if you read the number by cutting off the number, the voice is not input even when all the connected numbers are not input. Therefore, it is judged that the voice has not been input continuously for a certain amount of time. Judgment should be made. When the voice endpoint detection unit 10 determines whether the input of the numeric sound is terminated, when the human inputs the voice, the user presses the input moment on the button and the end of the connected numeric sound based on the input voice while the button is pressed. It may be determined, but may be automatically determined in consideration of the degree of energy level of the input numeric sound and the time interval in which the voice is not input. Either way, accurately determining the end of the input numeric digits affects the performance of the concatenated digits.

특징추출부(20)는 입력되는 연결숫자음의 특징을 추출한다. 특징추출이란 음성인식에 유용한 성분을 음성신호로부터 뽑아내는 것을 말하며, 일반적으로 정보의 압축, 차원 감소 과정과 관련된다. 특징추출을 위한 이상적인 방법은 현재 알려지지는 않았으나, 인간의 청각특성을 반영하는(perceptually meaningful) 특징 표현, 다양한 잡음환경/화자/채널 변이에 강인한 특징, 시간적인 변화를 잘 표현하는 특징의 추출등이 특징추출분야에서 주로 연구되고 있다. 음성인식을 위하여 주로 사 용되는 특징은 LPC(Linear Predictive Coding) cepstrum, PLP(Perceptual Linear Prediction) cepstrum, MFCC(Mel Frequency cepstral coefficient), 차분 cepstrum, 필터 뱅크 에너지, 차분 에너지 등이 사용된다. 본 발명의 실시예에서는 특징추출을 위해 프레임당 25차의 특징 벡터를 이용하였는데, 12차는 MFCC이고, 12차는 차분 MFCC이며, 나머지 1차는 차분 에너지이다.The feature extractor 20 extracts a feature of the input concatenated sound. Feature extraction refers to the extraction of components useful for speech recognition from speech signals and is generally associated with the compression and dimensional reduction of information. The ideal method for feature extraction is currently unknown, but it is a feature expression that reflects human perceptually meaningful characteristics, robust to various noise environments / speakers / channel variations, and feature extraction that expresses temporal changes well. It is mainly studied in the feature extraction field. The main features used for speech recognition include linear predictive coding (LPC) cepstrum, perceptual linear prediction (PLP) cepstrum, mel frequency cepstral coefficient (MFCC), differential cepstrum, filter bank energy, and differential energy. In the embodiment of the present invention, a feature vector of 25th order per frame is used for feature extraction. The 12th order is MFCC, the 12th order is differential MFCC, and the remaining 1st order is differential energy.

비터비탐색부(30)는 음향모델과 발음사전 및 언어모델 등의 모델 λ와 연결숫자음의 관측 가능한 특징벡터열을 바탕으로 이 특징벡터 열과 가장 정합이 잘 되는 단어열 W를 전방 탐색(Forward Search)한다. 이 과정을 수식으로 표현하면 수학식 1로 나타낼 수 있다. The Viterbi search unit 30 searches forward the word string W that best matches the feature vector sequence based on the λ of the acoustic model, the pronunciation dictionary, and the language model, and the observable feature vector sequence of the connected numerals. Search). This process can be expressed as an equation (1).

가장 가능성 높은 단어열의 탐색에 적용되는 방정식Equation applied to searching for the most likely word string

수학식 1에서 Pr(X|W, λ)는 발음사전에 의해 구정되는 음향모델의 열이 특징벡터 열과 정합될 확률을 나타내며, Pr(W| λ)는 단어열 W의 언어모델 확률 값을 나타낸다. 전방향 탐색에서는 비터비 탐색 알고리즘을 이용하여 모든 단어 격자를 검출하며, 후방향 탐색에서는 재추정부(40)를 통해 최종 인식된 숫자음 열을 결정한다. 비터비 탐색 알고리즘은 음성인식 분야에서 널리 알려진 알고리즘으로 자세한 설명은 생략한다. 전방향 탐색에서는 탐색 비용을 줄이기 위해, 단어 내부에서만 트라이폰 모델을 적용하며, 재추정부(40)에서는 단어간 트라이폰 모델을 적용하여 정합될 확률 값을 재추정하여 최종 인식 결과를 얻는다.In Equation 1, Pr (X | W, λ) represents the probability that the column of the acoustic model determined by the pronunciation dictionary is matched with the feature vector sequence, and Pr (W | λ) represents the language model probability value of the word sequence W. . In the forward search, all word grids are detected using the Viterbi search algorithm, and in the backward search, the final recognized number string is determined through the re-estimation 40. The Viterbi search algorithm is well known in the speech recognition field and will not be described in detail. In the omnidirectional search, in order to reduce the search cost, the triphone model is applied only inside the word, and the recalculation unit 40 applies the interword triphone model to reestimate the probability values to be matched to obtain a final recognition result.

음향모델DB(50)는 연결숫자음 인식을 위한 음향모델을 저장한다. 본 발명에 사용 되는 음향모델은 은닉마코프모델(Hidden Markov Model; 이하, HMM이라 함)에 기반한다. 음성인식을 위한 음향모델의 단위로는 음소(phoneme), 다이폰(diphone), 트라이폰(triphone), 퀸폰(quinphone), 음절(syllable), 단어(word) 등이 될 수 있다. 본 발명은 인접한 숫자음의 영향(coarticaulation)을 고려하므로 다이폰, 트라이폰, 퀸폰 등을 사용할 수 있으나 숫자음의 앞과 뒤의 영향을 모두 고려해야 한다는 점과 숫자음 인식만을 대상으로 하므로 어휘의 수가 적으므로 트라이폰으로 음향모델을 하는 것이 바람직하다.The acoustic model DB 50 stores an acoustic model for recognizing the connected numeral sound. The acoustic model used in the present invention is based on the Hidden Markov Model (hereinafter referred to as HMM). The unit of the acoustic model for speech recognition may be a phoneme, a diphone, a triphone, a quinphone, a syllable, a word, or the like. Since the present invention considers the effects of adjacent digits (coarticaulation), it is possible to use a diphone, a triphone, a quinone, etc., but it is necessary to consider both the effects of the front and back of the digits and the number of vocabularies since only the digits are recognized. Since it is small, it is desirable to make an acoustic model with a triphone.

발음사전DB(60)은 인식 단위인 단어의 발음을 모델링 하기 위한 발음모델을 저장한다. 발음모델은 표준 발음 사전으로 구한 대표발음을 사용하여 한 단어당 하나의 발음을 갖는 단순하 모델부터, 허용발음/사투리/액센트를 고려하기 위하여 인식 어휘 사전에 여러 개의 표제어를 사용하는 다중발음모델, 각 발음의 확률을 고려하는 통계적 발음모델, 음소 기반의 사전식(Lexical) 발음모델 등 다양하게 있을 수 있다. 본 발명의 실시예에서는 사전식 발음모델을 이용하여 음소 기반의 발음사전을 생성한 후, 이를 트라이폰 발음사전으로 확장한다.The pronunciation dictionary DB 60 stores a pronunciation model for modeling the pronunciation of a word that is a recognition unit. The pronunciation model is a simple model with one pronunciation per word using a representative phoneme obtained from a standard phonetic dictionary, a multiple phonetic model using multiple headings in a recognized vocabulary dictionary to take into account acceptable pronunciations, dialects and accents. There may be a variety of statistical pronunciation models that consider the probability of each pronunciation, phoneme-based lexical pronunciation model. In an embodiment of the present invention, a phoneme-based pronunciation dictionary is generated using a dictionary phonetic model, and then it is extended to a triphone pronunciation dictionary.

언어모델DB(70)은 문법을 저장한다. 언어모델은 연속(Continuous) 음성인식에서 주로 사용된다. 음성 인식기는 언어모델을 탐색과정에서 사용함으로써 인식기의 탐색 공간을 줄일 수 있으며, 언어모델은 문법에 맞는 문장에 대한 확률을 높여주는 역할을 하기 때문에 인식률을 향상시킨다. 문법의 종류에는 FSN(Finite State Network)나 CFG(Context-Free Grammar)와 같은 형식언어를 위한 문법들도 있고 n-gram과 같은 통계적인 문법이 있다. 이중 n-gram은 과거 n-1개의 단어로부터 다음 에 나타날 단어의 확률을 정의하는 문법을 말한다. 종류는 바이그램, 트라이그램, 4그램등이 있다. 본 발명의 실시예에서는 바이그램을 사용하는데, 인접 숫자에 따른 변이와 연음여부에 따라 달리 발음되는 숫자음을 다른 단어로 취급하고 각 단어들의 연결가능성에 대해서는 언어모델의 문법을 이용하여 인식률을 높인다. 예를 들면, "일"과 "육"이 연결된 발음의 경우에는 "일ㄹ+육#"이나 "일#|육#"의 경우에는 문법에 맞는다고 판단할 수 있으나, "이ㄹ+육#"은 문법에 틀린다고 판단할 수 있다. 이를 허용할 경우에는 "이 육"과 "일 육"의 구분을 어렵게 만들기 때문이다. 문법에 틀린다고 판단하는 경우에는 적은 확률값을 주어 해당 음소열이 선택될 가능성을 줄여준다.The language model DB 70 stores grammar. The language model is mainly used for continuous speech recognition. The speech recognizer reduces the search space of the recognizer by using the language model in the search process, and improves the recognition rate because the language model plays a role of increasing the probability of sentences that match the grammar. Grammar types include syntaxes for formal languages such as Finite State Network (FSN) and Context-Free Grammar (CFG), and statistical grammars such as n-gram. The n-gram is a grammar that defines the probability of the next word from the past n-1 words. Types include bigograms, trigrams, and four grams. In the embodiment of the present invention, a bigram is used, which treats a differently pronounced number sound as another word according to the variation according to the adjacent number and whether or not the melody is used, and increases the recognition rate by using the grammar of the language model for the linkability of each word. For example, in the case of a pronunciation in which "yl" and "yuk" are connected, it may be determined that "yl + six #" or "yl # | yuk #" is correct for grammar, but "yi + yng #". Can be judged to be incorrect in grammar. If you allow this, it makes it difficult to distinguish between "this flesh" and "one flesh". If it is determined that the grammar is wrong, a small probability value is given to reduce the possibility of selecting the phoneme string.

음향모델과 언어모델을 생성(또는 갱신)하기 위해서는 학습과정이 필요하다. 먼저 실제로 다양한 사람이 발음한 음성 샘플들이 있어야 한다. 음성DB(110)는 다양한 사람의 다양한 발음에 대한 데이터를 수집하고 저장한다. 한편, 단어DB(120)는 음성DB(110)에 저장된 음성에 대한 단어레벨 전사들을 저장한다. 음성 샘플들과 이에 대한 단어레벨 전사(word-level script)들은 디코딩부(130)를 통해서 가장 최적의 상태 시퀀스를 찾아낸다. 본 발명의 실시에는 디코딩부(130)는 비터비(Viterbi) 알고리즘을 이용하였으며, 비터비 알고리즘은 본 발명의 분야에서 널리 알려져 있으므로 자세한 설명은 생략한다. 디코딩부(130)를 거쳐서 음성 샘플들과 단어레벨 전사들은 각각 음소레벨 전사들과 본 발명에 따라 세분화된 단어 레벨 전사들로 바뀌고 이들은 각각 음소레벨 전사DB(150)와 단어레벨 전사DB(140)에 저장된다. 음소레벨 전사DB(150)에 저장된 음소레벨의 전사들은 음향모델 학습도구(160)를 통해 음향모델을 생성(또는 갱신)하는데 사용되며 생성된 음향모델은 음향모델DB(180)에 저장된다. 한편, 단어레벨 전사DB(140)에 저장된 단어레벨 전사들은 언어모델 학습도구(170)를 통해 언어모델을 생성(또는 갱신)하고 생성된 언어모델들은 언어모델DB(190)에 저장된다. 또한, 단어 레벨 전사들은 언어모델 학습도구(170)를 통한 후 언어모델 재추정부(200)에 의해 단어 천이 패널티를 설정하여 발음사전(210)을 생성하는데, 이 때 MCE (Minimum Classification Error) 학습 방법이 이용된다.Learning process is required to create (or update) acoustic model and language model. First of all, there should be voice samples pronounced by various people. The voice DB 110 collects and stores data on various pronunciations of various people. Meanwhile, the word DB 120 stores word level transcriptions for the voice stored in the voice DB 110. Speech samples and word-level scripts thereof find the most optimal state sequence through the decoder 130. In the embodiment of the present invention, the decoding unit 130 uses a Viterbi algorithm, and since the Viterbi algorithm is well known in the art, detailed description thereof will be omitted. Through the decoding unit 130, the voice samples and the word level transcripts are changed into phoneme level transcripts and word level transcripts subdivided according to the present invention, respectively, which are phoneme level transcription DB 150 and word level transcription DB 140, respectively. Are stored in. Phoneme level transcriptions stored in the phoneme level transcription DB 150 are used to generate (or update) an acoustic model through the acoustic model learning tool 160, and the generated sound model is stored in the acoustic model DB 180. Meanwhile, word level transcriptions stored in the word level transcription DB 140 generate (or update) a language model through the language model learning tool 170, and the generated language models are stored in the language model DB 190. In addition, word level transcriptions generate a pronunciation dictionary 210 by setting a word transition penalty by the language model recalculation 200 after the language model learning tool 170, at which time MCE (Minimum Classification Error) learning method. This is used.

세분화된 숫자음 모델에 기반하여 비터비탐색부(30)와 재추정부(40)를 통해 연결숫자음을 인식할 때, 연결숫자음의 길이를 알고 있는 경우보다 길이를 모르고 있는 경우에 인식오류율이 높다. 이는 하나의 모음으로 이루어진 숫자음의 인식과 관련하여 두드러지게 나타나는데 인식오류에 대한 사항은 표 1과 표 2를 통해 살펴본다.When recognizing the concatenated digits through the Viterbi search unit 30 and the re-estimation unit 40 based on the segmented digit model, the recognition error rate is not known when the length of the concatenated digits is not known. high. This is prominent in relation to the recognition of a single vowel sound. Recognition errors are discussed in Table 1 and Table 2.

[표 1]TABLE 1

각 숫자음에서 발생하는 비교표Comparison table for each digit

표1를 통해 알 수 있듯이, 일, 이, 오의 경우에는 첨가 에러와 삭제 에러가 다른 숫자음에 비해 두드러짐을 알 수 있다.As can be seen from Table 1, in the case of one, two, and five, the addition error and the deletion error are more prominent than the other numbers.

한편, 연결숫자음의 길이를 알 수 있는 경우와 길이를 모르는 경우에 그 에러 정도는 다르다. 길이에 대한 정보의 유무에 따른 인식에러율은 표2를 통해 알 수 있다.On the other hand, when the length of the concatenated number is known and the length is not known, the error degree is different. The recognition error rate according to the presence or absence of information on the length can be seen in Table 2.

[표 2]TABLE 2

길이 정보 유무에 따른 에러율Error rate with length information

길이를 모르는 경우If you do not know the length 길이를 아는 경우If you know the length 길이 판단 에러Length determination error 스트링 에러율String error rate 스트링 에러율String error rate HBT 모델HBT model 11.0011.00 18.1618.16 11.8111.81 본 발명The present invention 11.9711.97 17.1817.18 7.757.75

표2에서 알 수 있다시피 길이를 알 수 없는 경우에 에러율은 본 발명에 의한 경우나 HBT에 의한 경우나 거의 비슷함을 알 수 있다. 그러나 길이 정보를 안 경우에는 본 발명의 에러율이 훨씬 작은 것을 볼 수 있다. 즉, 본 발명에서는 숫자음 길이 오판에 의한 에러비율이 종전의 방법에 비해 높은 것을 알 수 있다. 숫자음 길이의 오판은 삽입 또는 삭제 에러와 밀접한 관련이 있으며 이러한 삽입 또는 삭제 에러를 감소를 위하여 여러 가지가 방법이 시도되고 있다. 그 중 하나가 단어 길이에 따라 단어 삽입 페널티를 주는 방법으로서 수학식 2는 종전의 음성 인식분야에서 사용되던 단어 삽입 페널티를 적용하여 인식 결과를 찾는 공식을 보여준다.As can be seen from Table 2, when the length is unknown, the error rate is almost similar to that of the present invention or the HBT. However, if the length information is known, it can be seen that the error rate of the present invention is much smaller. That is, in the present invention, it can be seen that the error rate due to the negative number length error is higher than that of the conventional method. Misprints of numeric length are closely related to insertion or deletion errors, and various methods have been attempted to reduce such insertion or deletion errors. One of them is a method of giving a word insertion penalty according to the word length, and Equation 2 shows a formula for finding a recognition result by applying a word insertion penalty used in the conventional speech recognition field.

단어 삽입 페널티에 의한 음성인식Speech recognition based on word insertion penalty

그러나 이러한 방법은 단일 음절의 어휘들로 구성된 한국어 숫자음인식에 있어서는 연음 여부 등에 대한 고려가 없어 동일한 값의 단어 삽입 페널티 값을 부여할 경우에 단어 첨가 및 삭제 오류를 제어하기 힘든 측면이 있다.However, this method has difficulty in controlling word addition and deletion errors when a word insertion penalty value of the same value is given because there is no consideration of whether a consecutive sound is included in Korean number recognition consisting of a single syllable vocabulary.

이에 따라 본 발명은 단어 천이 페널티라는 방법을 도입하였다. 즉, 기존의 음성인식 문제에서 고정된 단어 삽입 벌칙을 사용하지 않고 세분화된 숫자음들 사이의 천이시에 가변적인 페널티를 적용하여 연음 영향에 대한 효과를 반영하였다. 단어 천이 페널티에 대한 공식은 수학식 3에 나타나있다.Accordingly, the present invention introduces a method called word transition penalty. In other words, instead of using fixed word insertion penalty in the conventional speech recognition problem, the variable penalties are applied at the transition between the subdivided numbers to reflect the effect on the effect of the sound. The formula for the word transition penalty is shown in equation (3).

단어 천이 페널티에 의한 음성인식Speech recognition by word transition penalty

그리고 단어 천이 페널티는 MCE(Minimum Classification Error) 학습 방법을 이용하여 학습 데이터에 대한 오류가 최소가 되도록 반복적으로 단어 천이 페널티 값들을 추정하였다.The word transition penalty was repeatedly estimated for the word transition penalty values using a minimum classification error (MCE) learning method to minimize the error of the training data.

표 3은 각 숫자음쌍(단어쌍)에 대한 단어 천이 페널티 값을 구한 예를 보여준다.Table 3 shows an example of calculating word transition penalty values for each pair of digits (word pairs).

[표 3]TABLE 3

단어 천이 패널티의 예Example of the word transition penalty

도 6은 본 발명의 일 실시예에 따른 숫자음 인식과정과, 음향모델 및 언어모델을 생성하고 갱신하는 과정을 정리한 흐름도이다.6 is a flowchart summarizing a process of recognizing a number sound and generating and updating a sound model and a language model according to an embodiment of the present invention.

먼저 음성숫자음 파일들과 이에 대한 단어레벨 전사들을 입력한다(S10). 입력된 음성숫자음 파일들과 단얼벨 전사들은 디코딩과정을 거쳐(S20) 음향모델 및 언어모델을 생성(또는 갱신)한다(S30). 그리고 나서 음성 숫자음이 입력되면(S40) 특징추출과정과(S50) 숫자열 결정과정(S60)을 거친다.First, voice number files and word level transcriptions thereof are input (S10). The input voice number files and the Dalbell warriors generate (or update) an acoustic model and a language model (S30) through a decoding process (S20). Then, when the voice digit is input (S40), the feature extraction process (S50) and the number string determination process (S60) are performed.

도 7은 본 발명의 일 실시예에 따른 숫자음 인식 과정과, 음향모델 및 언어모델을 생성하고 갱신하는 장치를 보여주는 블록도이다.7 is a block diagram illustrating an apparatus for recognizing a numeric sound and generating and updating an acoustic model and a language model according to an embodiment of the present invention.

먼저 학습 과정에서는 음성 숫자음 DB와 단어레벨 전사들을 입력으로 받아 비터비 디코딩 과정을 통해 세분화된 숫자음 모델의 열과 음소 레벨 전사를 얻으며, 이렇게 얻어진 음소 레벨 전사들은 음향모델 학습 도구를 활용하여 음향 모델을 학습하게 되며, 세분화된 숫자음 모델의 열은 언어모델 학습 도구를 활용하여 언어모델을 생성(또는 갱신)한다. 언어모델은 MCE와 같은 분별적인 학습 방법을 통해 재추정된다. 이렇게 학습된 모든 지식 모델들을 이용하여, 입력된 음성은 음성 끝점 검출부와 특징 추출 과정, 비터비 탐색 과정, 재추정 과정을 거쳐 입력된 음성과 가장 정합 점수가 높은 숫자음 열을 인식하게 된다.First, in the learning process, the phonetic number DB and word level transcriptions are received as inputs, and the string and phoneme level transcription of the segmented numeric model is obtained through the Viterbi decoding process. The trained language model is used to generate (or update) a language model using language model learning tools. The language model is reestimated through discreet learning methods such as MCE. Using all of the learned knowledge models, the input voice recognizes the input voice and the number string with the highest matching score through the voice endpoint detection unit, feature extraction process, Viterbi search process, and reestimation process.

본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 따라서 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구의 범위에 의하여 나타내어지며, 특허청구의 범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Those skilled in the art will appreciate that the present invention can be embodied in other specific forms without changing the technical spirit or essential features of the present invention. Accordingly, the embodiments described above are to be understood in all respects as illustrative and not restrictive. The scope of the present invention is indicated by the scope of the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and the equivalent concept are included in the scope of the present invention. Should be interpreted.

본 발명에 따르면 가변길이의 연결숫자음 인식에서 좌우의 숫자에 의한 변이와 연음에 의한 영향 등을 고려하였으며, 숫자음의 삭제 및 첨가 오류의 수를 줄임으로서 가변길이의 연결숫자음 인식에서 높은 인식률을 갖는다. 본 발명에 연결숫자음 인식의 에러율과 종전의 방법에 의한 비교결과는 표 4에 나타나있다. According to the present invention, the variation of left and right numbers and the influence of soft noise are considered in the recognition of variable length concatenated numbers, and the high recognition rate in the recognition of variable length concatenated numbers by reducing the number of deletion and addition errors. Has In the present invention, the error rate of the linked numeral recognition and the comparison result by the conventional method are shown in Table 4.

[표 4]TABLE 4

표4를 통해 알 수 있듯이 본 발명에 의한 세분화된 숫자음 어휘를 이용한 경우에 알려진 길이의 숫자음에 대해서는 종전의 방법들 보다 우수한 인식률을 갖는 것을 알 수 있으며, 길이를 모를 경우에는 단어 천이 페널티를 도입하여 성능 개선을 하였다.As can be seen from Table 4, it can be seen that the numerical length of the known length has a better recognition rate than the conventional methods in the case of using the granular number vocabulary according to the present invention. Introduced to improve performance.

Claims

delete

(A) generating a sound model and a language model for subdivided digit sounds by distinguishing digit sounds that are differently pronounced according to whether the digit sounds adjacent to the left and right are predetermined by a predetermined method; And

(B) receiving a variable length connection number and recognizing the number in a predetermined manner using the acoustic model and the language model,

The initial consonants of the digits are pronounced differently depending on the finality of the adjacent digits,

In the step (a), the acoustic model is a subdivided digit sound which is divided into digits which are pronounced differently depending on whether the digit is adjacent to the left and right, and the language model is a connection between the subdivided digit vocabulary. Variable length concatenated speech recognition method characterized in that it defines a bigram language model

The method of claim 2, further comprising: learning a word transition penalty variable according to whether a pair of digits and a consecutive tone are determined by a predetermined method, and recognizing the digits by applying the learned word transition penalty. Recognition method of concatenated number sound of variable length

4. The method of claim 3, wherein the method of learning the variable word transition penalty is a minimum classification error (MCE) learning method.