KR20130126570A

KR20130126570A - Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof

Info

Publication number: KR20130126570A
Application number: KR1020130127604A
Authority: KR
Inventors: 곽철
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2013-10-25
Filing date: 2013-10-25
Publication date: 2013-11-20
Also published as: KR101483947B1

Abstract

The present invention relates to a device for discriminative acoustic model training by considering results of phoneme errors in keywords and a computer readable recording medium recording a method therefor. The present invention comprises a keyword detection unit detecting a keyword recognition result as a speech recognition result about keywords from an N-best recognition result as the speech recognition result for a word string including the keywords; a keyword training unit obtaining phoneme model parameter distribution of correct phonemes and error phonemes according to the keyword recognition results; and a keyword discriminative training unit applying a log-likelihood value of the error phonemes as a weighted value in the keyword recognition results and updating the phoneme model parameter of the error phonemes to minimize mutual information in proportion to a count value of the error phonemes. [Reference numerals] (100) Property extracting unit;(200) Investigation unit;(300) Sound model database;(400) Pronunciation dictionary database;(500) Language model database;(600) Text database;(700) Language model learning unit;(800) Sound database;(910) Keyword extracting unit;(912) Keyword database;(920) N-best alignment unit;(930) Keyword learning unit;(940) N-best learning unit;(950) Keyword discrimination learning unit;(960) N-best discrimination learning unit;(AA) Sound signal;(BB) Word row

Description

Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method according to the present invention.

본 발명은 음향 모델 변별 학습 방법에 관한 것으로, 더욱 상세하게는, 핵심어에서의 음소의 오류 결과를 고려하여 음향 모델에서 음소의 상호 정보량을 최소화할 수 있는 음향 모델 변별 학습을 위한 장치 및 이러한 변별 학습 방법이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. The present invention relates to a sound model discrimination learning method, and more particularly, to an acoustic model discrimination learning apparatus capable of minimizing mutual information of phonemes in an acoustic model in consideration of an error result of phonemes in a key word, and such discrimination learning. A method relates to a computer readable recording medium having recorded thereon.

음성 인식은 자동적 수단에 의하여 음성으로부터 언어적 의미 내용을 식별하는 것. 구체적으로 음성파형을 입력하여 단어나 단어열을 식별하고 의미를 추출하는 처리 과정이다. 이러한 음성 인식은 크게　음성 분석, 음소 인식, 단어 인식, 문장 해석, 의미 추출의 5가지로 분류될 수 있다. 음성 인식은 좁은 의미로 음성 분석에서 단어 인식까지를 말하는 경우가 있다. Speech recognition is the identification of linguistic semantic content from speech by automatic means. In detail, a process of identifying a word or a word string and extracting a meaning by inputting a speech waveform. Such speech recognition can be classified into five categories: speech analysis, phoneme recognition, word recognition, sentence interpretation, and meaning extraction. Speech recognition has a narrow meaning from speech analysis to word recognition.

인간-기계 인터페이스 개선의 하나로 음성으로 정보를 입력하는 음성 인식과 음성으로 정보를 출력하는 음성 합성 기술의 연구 개발이 오랫동안 진행되어 왔다. 대형 장치를 필요로 하는 음성 인식 장치와 음성 합성 장치를　대규모 집적 회로(LSI, large scale integrated circuit)의 발달에 따라 가로세로 수 mm 크기의　집적 회로　위에 실현할 수 있게 됨으로써　음성 입출력　장치가 실용화되었다. As an improvement of the human-machine interface, research and development of speech recognition technology for inputting information with voice and speech synthesis technology for outputting information with voice have been in progress for a long time. With the development of large scale integrated circuits (LSIs), speech recognition devices and speech synthesis devices that require large devices can be realized on integrated circuits of several millimeters in width.

현재 전화에 의한 은행 잔액 조회, 증권 시세 조회, 통신 판매의 신청,　신용 카드　조회, 호텔이나 항공기 좌석 예약 등에 사용된다. 그러나 이들　서비스는 제한된 수의 단어를 하나하나 떼어서 발음하는 음성을 인식하는 단어 음성 인식 장치를 사용한다. 음성 인식의 궁극적인 목표는 자연스러운 발성에 의한 음성을 인식하여 실행 명령어로서 받아들이거나 자료로서 문서에 입력하는　완전한　음성 혹은 텍스트 변환의 실현이다. 이는 단지 단어를 인식할 뿐 아니라 구문 정보, 의미 정보, 작업에 관련된 정보와 지식 등을 이용하여　연속 음성　또는 문장의 의미 내용을 정확하게 추출하는　음성 이해 시스템을 개발하는 것이다. 이러한　시스템의 연구 개발이 활발하게 진행되고 있다. It is currently used for checking bank balances by phone, checking stock quotes, applying for mail-order sales, looking up credit cards, and making reservations for hotel or aircraft seats. These services, however, use a word speech recognition device that recognizes a speech which is pronounced by separating a limited number of words one by one. The ultimate goal of speech recognition is the realization of a complete speech or text conversion that recognizes speech by natural utterance and accepts it as an execution command or inputs it into a document as data. It is not only to recognize words, but also to develop a speech understanding system that accurately extracts the meaning contents of continuous speech or sentences using phrase information, semantic information, and work-related information and knowledge. The research and development of these systems is in progress.

한국공개특허 제2013-0067854호, 2012.06.25 공개 (명칭: 코퍼스 기반 언어 모델 변별 학습 방법 및 그 장치)Korean Laid-Open Patent No. 2013-0067854, published on June 25, 2012 (Name: Corpus-based language model discrimination learning method and apparatus)

본 발명의 목적은 음성 인식 시스템에서 핵심어를 이용하여 음소의 상호 정보량을 최소화할 수 있는 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체를 제공함에 있다. An object of the present invention is to provide an apparatus for discriminating and learning a sound model that can minimize mutual information of phonemes using key words in a speech recognition system, and a computer readable recording medium having recorded thereon a method therefor.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 음향 모델 변별 학습을 위한 장치는 핵심어를 포함하는 단어열에 대한 음성 인식 결과인 N-best 인식 결과로부터 상기 핵심어에 대한 음성 인식 결과인 핵심어 인식 결과를 추출하는 핵심어 검출부와, 상기 핵심어 인식 결과에 따라 정답 음소 및 오류 음소의 음소 모델 파라미터 분포를 도출하는 핵심어 학습부와, 상기 핵심어 인식 결과에서 오류 음소의 로그우도를 가중치로 적용하고, 상기 오류 음소의 카운트 값에 비례하여, 상호 정보량이 최소화되도록 상기 오류 음소의 음소 모델 파라미터 분포를 업데이트하는 핵심어 변별 학습부를 포함한다. The apparatus for discriminating and learning acoustic models according to an embodiment of the present invention for achieving the above object is a speech recognition result for the key word from the N-best recognition result, which is a speech recognition result for the word string including the key word. A key word detection unit for extracting a key word recognition result, a key word learning unit for deriving a phoneme model parameter distribution of correct answer phonemes and error phonemes according to the key word recognition result, and applying a log likelihood of error phonemes from the key word recognition result as a weight, And a key word discrimination learning unit for updating a phoneme model parameter distribution of the error phoneme so that the amount of mutual information is minimized in proportion to the count value of the error phoneme.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 컴퓨터 판독 가능한 기록매체는, 핵심어를 포함하는 단어열에 대한 음성 인식 결과인 N-best 인식 결과로부터 상기 핵심어에 대한 음성 인식 결과인 핵심어 인식 결과를 추출하는 단계와, 상기 핵심어 인식 결과에 따라 정답 음소 및 오류 음소의 음소 모델 파라미터 분포를 도출하는 단계와, 상기 핵심어 인식 결과에서 오류 음소의 로그우도를 가중치로 적용하고, 상기 오류 음소의 카운트 값에 비례하여, 상호 정보량이 최소화되도록 상기 오류 음소의 음소 모델 파라미터 분포를 업데이트하는 단계;를 포함하는 음향 모델 변별 학습을 위한 방법이 기록된다. The computer-readable recording medium according to the preferred embodiment of the present invention for achieving the above object, the key word that is the speech recognition result for the key word from the N-best recognition result that is the speech recognition result for the word string including the key word Extracting a recognition result, deriving a phoneme model parameter distribution of a correct answer phoneme and an error phoneme according to the keyword recognition result, applying a log likelihood of an error phoneme as a weight in the keyword recognition result, Updating a phoneme model parameter distribution of the error phoneme in such a manner that the amount of mutual information is minimized in proportion to a count value.

본 발명의 실시예에 따르면, N-best 인식 결과를 통한 변별 학습과 함께, 핵심어 인식 결과에 따른 변별 학습을 수행하기 때문에, 특정 단어, 즉, 핵심어에 대한 변별 학습의 성능을 향상시킬 수 있다. 이러한 경우, 단순하게 N-best 인식 결과에 따라 변별 학습을 수행하는 것에 비해, 핵심어 인식 결과를 통해 모든 사용자 혹은 특정 사용자에 대해 보다 효율적인 음향 모델을 구축할 수 있다. 더욱이, 음향 모델의 음소 모델 파라미터 분포를 업데이트할 때, 오류 인식 결과로 구분된 음소의 로그우도를 가중치로 적용하여, 오류 인식 결과도 상호 정보량을 최소화하도록 반영함으로써, 변별 학습(discriminative training)에서 효율적으로 상호 정보량을 최소화시킬 수 있다. According to an exemplary embodiment of the present invention, since discriminative learning is performed according to the key word recognition result along with the discriminative learning through the N-best recognition result, the performance of discriminative learning for a specific word, that is, a key word can be improved. In this case, compared to simply performing discriminative learning based on N-best recognition results, a more efficient acoustic model can be constructed for all users or specific users through keyword recognition results. Furthermore, when updating the phonemic model parameter distribution of the acoustic model, the log likelihood of the phonemes classified as the error recognition results is applied as a weight, and the error recognition results are reflected to minimize the amount of mutual information, thereby effectively in discriminative training. This can minimize the amount of mutual information.

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.
도 1은 본 발명의 실시예에 따른 학습 방법의 개념을 설명하기 위한 개념도이다.
도 2는 본 발명의 실시예에 따른 음성 인식 시스템을 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 음향 모델 학습부의 내부 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 음향 모델 학습 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 따른 음향 모델 학습부의 음향 모델 학습 방법을 설명하기 위한 흐름도이다. The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and, together with the description, serve to explain the technical features of the invention.
1 is a conceptual diagram illustrating the concept of a learning method according to an embodiment of the present invention.
2 is a view for explaining a speech recognition system according to an embodiment of the present invention.
3 is a view for explaining the internal configuration of the acoustic model learning unit according to an embodiment of the present invention.
4 is a flowchart illustrating an acoustic model training method according to an exemplary embodiment of the present invention.
5 is a flowchart illustrating an acoustic model training method of an acoustic model training unit according to an exemplary embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있는 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작 원리를 상세하게 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 이는 불필요한 설명을 생략함으로써 본 발명의 핵심을 흐리지 않고 더욱 명확히 전달하기 위한 것이다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, in describing in detail the operating principle of the preferred embodiment of the present invention, if it is determined that the detailed description of the related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. This is to more clearly communicate without obscure the core of the present invention by omitting unnecessary description.

또한, 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 또한, 본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 명세서에서 기술되는 "포함 한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Also, when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may be present in between . In addition, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In addition, the terms "comprises" or "having" described herein are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or the same. It is to be understood that the present invention does not exclude in advance the possibility of the presence or addition of other features, numbers, steps, operations, components, parts, or combinations thereof.

도 1a 내지 도 1d는 본 발명의 실시예에 따른 변별 학습의 개념을 설명하기 위한 개념도이다. 1A to 1D are conceptual views illustrating the concept of differential learning according to an embodiment of the present invention.

도 1a 내지 도 1d의 설명에 앞서 본 발명의 실시예에 따른 음성 인식 방법을 살펴보기로 한다. 음성 인식은 입력된 음성에 대한 음향이 분석되고, 그 음성의 특징량을 나타내는 소정 차원의 특징 벡터의 추출이 이루어진다. 그 후, 특징 벡터와 음향 모델과의 매칭 처리가 이루어진다. 본 발명의 실시예에 따르면, 그 매칭 처리는 음소 단위로 이루어진다. 그 매칭 처리 결과, 특징 벡터에 매칭되는 음향 모델의 음소가 음성 인식 결과(인식 결과)가 된다. 매칭 처리에서는 음향 모델을 구성하는 확률 분포(음소 모델 파라미터 분포)를 이용하여, 음성 인식 결과의 복수의 후보로, 해당 음향 모델에 대한 특징 벡터가 관측되는 로그우도가 계산된다. 예컨대, 입력된 단어("특")가 3개의 음소 A(ㅌ), B(으) 및 C(ㄱ)로 이루어진 경우, 다음의 <표 1>과 같은 결과가 출력될 수 있다. Prior to the description of FIGS. 1A to 1D, a voice recognition method according to an exemplary embodiment of the present invention will be described. In speech recognition, a sound of an input speech is analyzed and a feature vector of a predetermined dimension representing a feature amount of the speech is extracted. Thereafter, matching processing between the feature vector and the acoustic model is performed. According to the embodiment of the present invention, the matching process is made in phoneme units. As a result of the matching processing, the phoneme of the acoustic model matching the feature vector becomes the speech recognition result (recognition result). In the matching process, using a probability distribution (phoneme model parameter distribution) constituting the acoustic model, a log likelihood in which a feature vector for the acoustic model is observed is calculated as a plurality of candidates of the speech recognition result. For example, when the input word ("special") is composed of three phonemes A (ㅌ), B (B), and C (a), the result as shown in Table 1 below may be output.

음소 A 인식 결과
(로그우도)Phoneme A recognition result
(Log likeness) 음소 B 인식 결과
(로그우도)Phoneme B recognition result
(Log likeness) 음소 C 인식 결과
(로그우도)Phoneme C recognition result
(Log likeness) 후보 1Candidate 1 E(0.1)E (0.1) V(0.2)V (0.2) E(0.1)E (0.1) 후보 2Candidate 2 A(1.2)A (1.2) V(0.2)V (0.2) C(1.4)C (1.4) 후보 3Candidate 3 A(1.2)A (1.2) B(2.2)B (2.2) C(1.4)C (1.4)

<표 1>에서, 로그우도는 인식된 음소가 음향 모델에서 해당 음소일 확률이다. 다른 말로, 로그우도는 음향 모델의 음소 모델 파라미터 분포와 입력된 음성의 음소의 유사도(확률)이다. 각 후보의 인식 결과가 산출되면, 그 로그우도에 기초하여 복수의 후보 중에서, 최종적인 음성 인식 결과가 결정된다. <표 1>에서는 후보 3이 선택될 것이다. 즉, 복수의 후보 중 로그우도가 가장 높은 후보가 입력된 음성에 가장 가까운 것으로 선택되고, 그 후보를 구성하는 음향 모델에 대응하는 단어열이 음성 인식 결과로서 출력될 것이다. In Table 1, the log likelihood is the probability that the recognized phoneme is the phoneme in the acoustic model. In other words, the log likelihood is the similarity (probability) between the phoneme model parameter distribution of the acoustic model and the phoneme of the input voice. When the recognition result of each candidate is calculated, the final speech recognition result is determined among the plurality of candidates based on the log likelihood. In Table 1, candidate 3 will be selected. That is, the candidate with the highest log likelihood among the plurality of candidates is selected as the closest to the input voice, and the word string corresponding to the acoustic model constituting the candidate will be output as the speech recognition result.

본 발명의 실시예에 따른 변별 학습은 N-best 인식 결과와 핵심어 인식 결과를 모두 이용한다. 여기서, N-best 인식 결과는 특정 단어열에 대한 인식 결과라면, 핵심어 인식 결과는 특정 단어열에 포함된 단어 중 핵심어로 지정된 적어도 하나의 단어에 대한 인식 결과가 될 수 있다. Discriminant learning according to an embodiment of the present invention uses both N-best recognition results and keyword recognition results. Here, if the N-best recognition result is a recognition result for a specific word string, the keyword recognition result may be a recognition result for at least one word designated as a key word among words included in the specific word string.

예컨대, "나는 이번 달 계좌를 신청했습니다."와 같은 음성 신호가 입력된 경우를 가정한다. 이때, N-best 인식 결과는 N=3 일 때, 그 N-best 인식 결과 값은 다음의 <표 2>과 같다고 가정한다. For example, suppose a voice signal such as "I applied for an account this month" is input. In this case, when the N-best recognition result is N = 3, it is assumed that the N-best recognition result value is shown in Table 2 below.

N-best 인식 결과N-best recognition result 정답/오류Answer / error 나는 이번 달 계좌를 신청했습니다.I applied for an account this month. 정답 인식 결과Answer recognition result 나는 이월 달 계좌를 신청했습니다.I applied for an February month account. 오류 인식 결과Error recognition result 나는 이월 달 계자를 신설했습니다.I founded a field moon in February. 오류 인식 결과Error recognition result

<표 2>에 보인 바와 같이, 음성 인식 결과, 정답 인식 결과는 "나는 이번 달 계좌를 신청했습니다."로 출력될 수 있다. As shown in Table 2, the voice recognition result and the correct answer recognition result may be output as "I applied for an account this month."

또한, 핵심어가 "계좌"인 경우, 핵심어 인식 결과는 다음의 <표 3>과 같이 될 수 있다. In addition, when the key word is "account", the key word recognition result may be as shown in Table 3 below.

핵심어 인식 결과Keyword recognition result 정답/오류Answer / error 계좌를Account 정답 인식 결과Answer recognition result 계좌를Account 정답 인식 결과Answer recognition result 계자를Field 오류 인식 결과Error recognition result

<표 3>에 보인 바와 같이, 핵심어 음성 인식 결과 정답 인식 결과는 "계좌"가 될 수 있다. As shown in Table 3, the keyword speech recognition result correct answer recognition result may be a "account".

전술한 바와 같이, N-best 인식 결과의 경우, 단어열 "나는 이번 달 계좌를 신청했습니다."에 대한 인식 결과가 되며, 핵심어 인식 결과의 경우, 핵심어 "계좌"에 대한 인식 결과가 되기 때문에, N-best 인식 결과와 핵심어 인식 결과는 상이할 수 있다. 즉, N-best 인식 결과에 따라 오류인 인식 결과에 포함된 핵심어는 그 핵심어 인식 결과가 정답일 수도 있다. 그리고 그 역도 마찬가지다. 따라서 본 발명의 실시예에 따르면, 핵심어를 포함하는 단어열에 대한 변별 학습과 그 핵심어에 대한 변별 학습을 같이 수행한다. As described above, for the N-best recognition result, the recognition result for the word string "I applied for an account this month", and for the keyword recognition result, for the keyword "account", The N-best recognition result and the keyword recognition result may be different. That is, according to the N-best recognition result, the key word included in the recognition result that is an error may be the correct answer. And vice versa. Therefore, according to the exemplary embodiment of the present invention, the differential learning of the word string including the core word and the differential learning of the key word are performed together.

이제, 도 1a 내지 도 1d를 참조하여, 본 발명의 실시예에 따른 변별 학습 방법을 설명한다. 강조하면, 아래에서 설명되는 변별 학습은 N-best 인식 결과 및 핵심어 인식 결과 모두에 적용될 수 있다. 1A to 1D, a discrimination learning method according to an embodiment of the present invention will now be described. To emphasize, the discrimination learning described below can be applied to both N-best recognition results and keyword recognition results.

도 1a를 살펴보면, <표 1>과 같이 음향 모델에 음소 A, B, C가 존재한다고 가정한다. 도면 부호 10, 20 및 30은 음소 A, B 및 C의 음소 모델 파라미터 분포를 도식화한 것이다. 음소 모델 파라미터 분포는 해당 음소의 확률 분포가 될 수 있다. 여기서, 각 음소의 음소 모델 파라미터 분포가 중첩된 부분인 도면 부호 40은 상호 정보량을 나타낸다. Referring to FIG. 1A, it is assumed that phonemes A, B, and C exist in an acoustic model as shown in Table 1 below. Reference numerals 10, 20, and 30 denote a phoneme model parameter distribution of phonemes A, B, and C. The phoneme model parameter distribution may be a probability distribution of the corresponding phoneme. Here, reference numeral 40, which is a part where the phoneme model parameter distribution of each phoneme overlaps, indicates the mutual information amount.

앞서 설명된 음성 인식 방법을 기반으로 하는 음성 인식 방법에서, 상호 정보량이 많은 경우, 입력된 음성의 음소들을 명확히 구분하는 인식정확도를 저하시키는 문제가 발생한다. 따라서 본 발명의 실시예는 이러한 "상호 정보량을 최소화"하도록 음소 모델 파라미터 분포를 업데이트한다. In the speech recognition method based on the above-described speech recognition method, when there is a large amount of mutual information, a problem of lowering the recognition accuracy for clearly distinguishing the phonemes of the input voice occurs. Thus, an embodiment of the present invention updates the phoneme model parameter distribution to minimize this amount of mutual information.

본 발명의 실시예에 따른 "상호 정보량 최소화"의 의미에 대해 설명하면 다음과 같다. 도 1b를 참조하면, 2 개의 음소, 즉, 음소 M 및 N에 대한 음소 모델 파라미터 분포가 도시되었다. 예컨대, 음소 M은 "ㅘ"이고, 음소 N은 "ㅏ"가 될 수 있다. The meaning of "minimization of mutual information" according to an embodiment of the present invention will be described below. Referring to FIG. 1B, a phoneme model parameter distribution for two phonemes, ie, phonemes M and N, is shown. For example, the phoneme M may be "ㅘ" and the phoneme N may be "ㅏ".

도면 부호 50 및 60은 각각 현재 음향 모델의 음소 M 및 N에 대한 음소 모델 파라미터 분포이며, 도면 부호 70 및 80은 각각 음소 M 및 N에 대한 상호 정보량이 최소화된 이상적인 음소 모델 파라미터 분포라고 가정한다. Reference numerals 50 and 60 denote phoneme model parameter distributions for the phonemes M and N, respectively, of the current acoustic model, and reference numerals 70 and 80 assume an ideal phoneme model parameter distribution in which mutual information amounts for the phonemes M and N are minimized, respectively.

도면 부호 90은 현재 음향 모델의 음소 M 및 N에 대한 음소 모델 파라미터 분포(50, 60) 간의 상호 정보량이다. 이러한 상호 정보량(90)으로 인하여 음소 M이 입력되었음에도 불구하고, 음소 N으로 인식될 수 있다. 따라서 이러한 상호 정보량을 최소화시켜야 한다. 상호 정보량(90)을 최소화시키기 위해서는 현재 음향 모델의 음소 M의 음소 모델 파라미터 분포(50)가 음소 M의 이상적인 음소 모델 파라미터 분포(70)로 이동되어야 한다. 혹은, 현재 음향 모델의 음소 N의 음소 모델 파라미터 분포(60)가 음소 N의 이상적인 음소 모델 파라미터 분포(80)로 이동되어야 한다. Reference numeral 90 denotes a mutual information amount between phoneme model parameter distributions 50 and 60 for phonemes M and N of the current acoustic model. Although the phoneme M is input due to the mutual information amount 90, it can be recognized as the phoneme N. Therefore, this amount of mutual information should be minimized. In order to minimize the mutual information amount 90, the phoneme model parameter distribution 50 of phoneme M of the current acoustic model should be moved to the ideal phoneme model parameter distribution 70 of phoneme M. Alternatively, the phoneme model parameter distribution 60 of phoneme N of the current acoustic model should be moved to the ideal phoneme model parameter distribution 80 of phoneme N.

앞서 설명된 바와 같이, 음성 인식 결과는 입력된 음성 신호의 음소가 음향 모델에 저장된 음소와 동일한 음소일 확률(유사도)로 출력된다. 훈련 혹은 학습을 위한 음성은 알려져 있는 음성을 이용할 수 있다. 따라서 음성 인식 결과, 정답 인식 결과인 경우, 해당하는 음소를 정답 음소라고 하며, 음성 인식 결과, 오류 인식 결과인 경우, 해당하는 음소를 오류 음소라고 한다. 예컨대, 음소 M이 입력되었을 때, 음소 M으로 인식한 경우, 정답 인식 결과라고 하며, 음소 M을 정답 음소라고 칭한다. 또한, 음소 M이 입력되었을 때, 음소 N으로 인식한 경우, 오류 인식 결과라고 하며, 음소 N을 오류 음소라고 칭한다. As described above, the speech recognition result is output with the probability (similarity) that the phonemes of the input voice signal are the same phonemes as the phonemes stored in the acoustic model. Voices for training or learning can use known voices. Therefore, in the case of the speech recognition result or the correct answer recognition result, the corresponding phoneme is called the correct answer phoneme. In the case of the speech recognition result or the error recognition result, the corresponding phoneme is called the error phoneme. For example, when phoneme M is input, when it recognizes as phoneme M, it is called a result of correct answer recognition, and phoneme M is called correct answer phoneme. In addition, when phoneme M is input and recognized as phoneme N, it is called an error recognition result and phoneme N is called error phoneme.

도 1c를 참조하면, 음소 M에 대한 음성 인식 결과는 정답 인식 결과가 출력되었다고 가정한다. 따라서 음성 인식 결과가 정답인 음소 M의 경우, 음성 인식 결과에 따라 최대 상호정보량(MMI, maximum mutual information) 추정 방법을 이용하여, 음향 모델의 음소 모델 파라미터 분포에 반영시켜 업데이트하면, 현재 음향 모델의 음소 M에 대한 음소 모델 파라미터 분포(50)는 음소 M에 대한 이상적인 음소 모델 파라미터 분포(70) 방향으로 이동할 것이다. 도 1c와 같은 경우에는 상호 정보량이 전혀 없는 이상적인 형태로 최소화되었다. Referring to FIG. 1C, it is assumed that the speech recognition result for the phoneme M is the correct answer recognition result. Therefore, in the case of phoneme M whose answer is correct, the phoneme model parameter distribution of the acoustic model is updated by using the maximum mutual information (MMI) estimation method according to the speech recognition result. The phoneme model parameter distribution 50 for phoneme M will move in the direction of the ideal phoneme model parameter distribution 70 for phoneme M. In the case of Figure 1c it was minimized to an ideal form with no mutual information at all.

한편, 도 1d를 참조하면, 음소 M에 대한 음성 인식 결과가 오류 인식 결과가 출력되었다고 가정한다. 예컨대, 음소 M에 대한 음성 인식 결과가 N으로 출력되었다고 가정한다. 이는 도면 부호 90이 나타내는 상호 정보량에 의한 오류 인식 결과이다. 이러한 경우, 종래에는 오류 인식 결과는 무시되었기 때문에 음향 모델의 업데이트는 없었다. Meanwhile, referring to FIG. 1D, it is assumed that an error recognition result is output as a voice recognition result for the phoneme M. For example, assume that the speech recognition result for phoneme M is output as N. This is a result of error recognition by the amount of mutual information indicated by 90. In such a case, there was no update of the acoustic model since the error recognition result was conventionally ignored.

하지만, 본 발명의 실시예에 따르면, 음성 인식 결과가 오류인 음소 N의 로그우도를 도출한다. 이때, 로그우도는 최대 로그우도(ML, maximum likelihood) 추정 방법을 이용하여 산출된다. 또한, 오류 인식 결과에 따른 오류 음소의 카운트 값, 즉, 오류 인식 결과에서 오류 음소로 판정된 횟수를 도출한다. 그런 다음, 로그우도를 가중치로 적용하고, 오류 음소의 카운트 값에 비례하여, 음향 모델의 음소 모델 파라미터 분포를 업데이트한다. 이에 따라, 현재 음향 모델의 음소 N에 대한 음소 모델 파라미터 분포(60)는 음소 N에 대한 이상적인 음소 모델 파라미터 분포(80)로 이동할 것이다. 이는 상호 정보량에 의해 오류가 나타나지 않도록 오류 인식 결과로 나타난 음소 N의 음소 모델 파라미터 분포(60)를 음소 M과의 상호 정보량이 줄어들도록 이동시키는 것이다. 이에 따라, 음소 N의 음소 M에 대한 상호 정보량(90)이 도면 부호 93에 의해 지시되는 바와 같이 줄어든다. 즉, 상호 정보량이 최소화된다. 이때, 음소 N의 음소 모델 파라미터 분포(60)는 로그우도의 크기 및 오류 음소의 카운트 값에 따라 그 이동의 정도를 달리한다. 다른 말로, 음소 N의 음소 모델 파라미터 분포(60)가 이동하는 정도는 로그우도의 값이 크고, 오류 음소의 카운트 값이 클수록 이동의 폭이 넓다. However, according to an embodiment of the present invention, the log likelihood of phoneme N, in which the speech recognition result is an error, is derived. In this case, the log likelihood is calculated using a maximum likelihood (ML) estimation method. In addition, the count value of the error phoneme according to the error recognition result, that is, the number of times determined as the error phoneme from the error recognition result is derived. The log likelihood is then applied as a weight and the phoneme model parameter distribution of the acoustic model is updated in proportion to the count value of the error phoneme. Accordingly, the phoneme model parameter distribution 60 for phoneme N of the current acoustic model will move to the ideal phoneme model parameter distribution 80 for phoneme N. This shifts the phoneme model parameter distribution 60 of the phoneme N, which is the result of error recognition, so that the amount of mutual information with the phoneme M decreases so that an error does not appear by the amount of mutual information. Accordingly, the mutual information amount 90 for the phoneme M of the phoneme N is reduced as indicated by 93. That is, the amount of mutual information is minimized. At this time, the phoneme model parameter distribution 60 of the phoneme N varies its degree of movement according to the magnitude of the log likelihood and the count value of the error phoneme. In other words, the degree of movement of the phoneme model parameter distribution 60 of the phoneme N is larger in the log likelihood value, and the larger the count value of the error phoneme is, the wider the movement is.

전술한 바와 같이, 상호 정보량 최소화는 서로 다른 음소의 음소 모델 파라미터 분포에서 중첩되는 확률 분포 부분(90)을 줄이는 것을 의미한다. 이러한 방법에 따라, 음향 모델에서 음소 M과 음소 N의 상호 정보량(90)을 최소화시킬 수 있다. 특히, 본 발명에 따르면, 정답(correct) 인식 결과의 음소들에 대한 모델 파라미터 분포를 업데이트하고, 추가로, 오류(incorrect) 인식 결과의 음소들에 대한 모델 파라미터 분포를 업데이트함으로써, 변별 학습의 성능을 향상시킬 수 있다. As described above, minimizing the amount of mutual information means reducing the probability distribution portion 90 overlapping in the phoneme model parameter distribution of different phonemes. According to this method, it is possible to minimize the mutual information amount 90 of the phoneme M and the phoneme N in the acoustic model. In particular, according to the present invention, the performance of differential learning is improved by updating the model parameter distribution for the phonemes of the correct recognition result, and further updating the model parameter distribution for the phonemes of the incorrect recognition result. Can improve.

더욱이, 본 발명의 실시예에 따르면, N-best 인식 결과를 통한 변별 학습과 함께, 핵심어 인식 결과에 따른 변별 학습을 수행하기 때문에, 특정 단어, 즉, 핵심어에 대한 변별 학습의 성능을 향상시킬 수 있다. 예컨대, 핵심어는 핵심어는 특정 사용자가 사용 빈도가 기 설정된 수치 보다 높은 단어, 모든 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 특정 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 모든 사용자에게 있어 기 설정된 보다 많은 오류가 발생하는 단어, 및 특정 사용자에게 있어 기 설정된 기준치 보다 많은 오류가 발생하는 단어 중 적어도 하나에 해당하는 단어 등을 미리 지정할 수 있다. 이러한 경우, 단순하게 N-best 인식 결과에 따라 변별 학습을 수행하는 것에 비해, 모든 사용자 혹은 특정 사용자에 대해 보다 효율적인 음향 모델을 구축할 수 있다. Furthermore, according to an embodiment of the present invention, since discriminative learning is performed according to the key word recognition result along with the discriminative learning through the N-best recognition result, the performance of discriminative learning for a specific word, that is, the key word can be improved. have. For example, a key word is a key word that is more frequently used by a specific user, a word that is difficult to pronounce for all users, a word that is difficult to pronounce for a particular user, or more errors that are preset for all users. A word corresponding to at least one of a word generated and a word having more errors than a predetermined reference value may be specified in advance. In this case, it is possible to construct a more efficient acoustic model for all users or for a specific user, compared to simply performing discrimination learning according to the N-best recognition result.

도 2는 본 발명의 실시예에 따른 음향 모델 변별 학습을 위한 장치를 포함하는 음성 인식 시스템을 설명하기 위한 도면이다. 2 is a diagram for explaining a speech recognition system including an apparatus for learning acoustic model discrimination according to an embodiment of the present invention.

도 2를 참조하면, 음성 인식 시스템은 특징 추출부(100), 탐색부(200), 음향 모델 데이터베이스(300), 발음 사전 데이터베이스(400), 언어 모델 데이터베이스(500), 텍스트 데이터베이스(600), 언어 모델 학습부(700), 음성 데이터베이스(800), 핵심어 검출부(910), 핵심어 데이터베이스(912), N-best 정렬부(920), 핵심어 학습부(930), N-best 학습부(940), 핵심어 변별 학습부(950) 및 N-best 변별 학습부(960)를 포함한다. Referring to FIG. 2, the speech recognition system includes a feature extractor 100, a searcher 200, an acoustic model database 300, a pronunciation dictionary database 400, a language model database 500, a text database 600, Language model learner 700, speech database 800, keyword detection unit 910, keyword database 912, N-best sorter 920, keyword learner 930, N-best learner 940 The keyword discrimination learning unit 950 and the N-best discrimination learning unit 960 are included.

특징 추출부(100)는 입력된 음성 신호로부터 음성 신호의 특징을 추출하기 위한 것이다. 여기서, 음성 신호는 음성 입력 장치나 음성 파일을 통해 입력될 수 있다. 특징 추출부(100)는 입력된 음성 신호에서 잡음을 제거하거나 음성 인식 성능을 높이기 위한 신호 처리를 수행 한다. 그런 다음, 특징 추출부(100)는 신호 처리된 음성 구간의 음성 신호에서 특징 벡터를 추출하여, 탐색부(200)에 제공한다. The feature extractor 100 extracts a feature of the voice signal from the input voice signal. Here, the voice signal may be input through a voice input device or a voice file. The feature extractor 100 performs signal processing to remove noise from an input speech signal or to improve speech recognition performance. Then, the feature extractor 100 extracts the feature vector from the speech signal of the signal-processed speech section and provides it to the searcher 200.

탐색부(200)는 음향 모델 데이터베이스(300)에 저장된 음향 모델, 언어 모델 데이터베이스(500)에 저장된 언어 모델과 발음 사전 데이터베이스(400)에 저장된 발음 사전을 통해 음성 인식에 필요한 탐색 공간을 형성하고, 형성된 탐색공간과 입력된 음성으로부터 특징 추출부(100)가 구한 특징 벡터를 사용하여 음성 인식을 수행한다. The searcher 200 forms a search space for speech recognition through a sound model stored in the acoustic model database 300, a language model stored in the language model database 500, and a pronunciation dictionary stored in the pronunciation dictionary database 400. Speech recognition is performed using the feature vector obtained by the feature extractor 100 from the formed search space and the input voice.

본 발명의 실시예에서 탐색부(200)는 미리 학습된 모델에 대한 유사도 값을 인식 결과로 출력할 수 있다. 탐색부(200)는 음성 인식을 통해 격자(lattice) 형태의 인식결과를 얻을 수 있으며, 격자(lattice) 형태의 인식결과로부터 N-best의 인식 결과를 얻을 수 있다. 이를 위하여 탐색부(200)는 비터비(Viterbi) 알고리즘 또는 DTW(Dynamic Time Warping)와 같은 패턴 정합 알고리즘을 이용할 수 있다. 예컨대, 탐색 공간은 명령어 인식 및 숫자음 인식과 같은 적은 어휘의 인식을 위한 FSN(Finite state network) 형태의 탐색 공간과 대어휘 인식과 빠른 인식을 위한 트리(tree) 형태의 탐색 공간을 포함할 수 있다. In an embodiment of the present invention, the searcher 200 may output a similarity value with respect to a previously trained model as a recognition result. The searcher 200 may obtain a lattice type recognition result through voice recognition, and may obtain a N-best recognition result from the lattice type recognition result. To this end, the searcher 200 may use a pattern matching algorithm such as a Viterbi algorithm or Dynamic Time Warping (DTW). For example, the search space may include a search space in the form of a finite state network (FSN) for recognition of small vocabulary such as command recognition and a digit recognition, and a search space in the form of a tree for recognition of large words and rapid recognition. have.

음향 모델 데이터베이스(300)는 음향 모델을 저장한다. 여기서, 음향 모델은 시간적으로 변화하는 음성 신호의 특징을 모델링한다. 음향 모델링 방법은 HMM, Continuous HMM, 신경회로망(NN) 등을 예시할 수 있다. 본 발명의 실시예에 따른 음향 모델 데이터베이스(300)는 각 음소 별로 음소 모델 파라미터 분포를 저장할 수 있다. The acoustic model database 300 stores the acoustic model. Here, the acoustic model models the characteristics of the voice signal that changes in time. The acoustic modeling method may exemplify an HMM, a continuous HMM, a neural network (NN), and the like. The acoustic model database 300 according to an embodiment of the present invention may store a phoneme model parameter distribution for each phoneme.

발음 사전 데이터베이스(400)는 발음 사전을 저장한다. 발음 사전은 음성에 대한 발음을 저장한다. 발음 사전은 음향 모델과 연결하여 특정 음성에 대한 다중의 발음들을 저장한다. The pronunciation dictionary database 400 stores the pronunciation dictionary. Pronunciation dictionary stores the pronunciation for the voice. The phonetic dictionary stores multiple pronunciations for a particular voice in conjunction with the acoustic model.

언어 모델 데이터베이스(500) 언어 모델은 단어간의 문법을 고려하여 인식 후보에 가중치를 줌으로써 문법에 맞는 문장이 더 높은 점수를 얻도록 함으로써 인식률을 향상시킨다. 최적의 인식 단어열을 찾기 위한 탐색에서는 비교하여야 할 후보의 개수를 줄이는 역할도 하게 된다. 인식되는 대상 어휘의 수와 인식 속도, 인식 성능을 고려하여 언어 모델을 선택할 수 있다. Language model database 500 The language model improves recognition rate by weighting recognition candidates in consideration of grammar between words so that sentences matching grammar get higher scores. The search for the optimal recognition word string also reduces the number of candidates to be compared. The language model can be selected in consideration of the number of recognized words, recognition speed, and recognition performance.

텍스트 데이터베이스(600)는 언어 모델을 생성하기 위한 텍스트들을 저장한다. 언어 모델 학습부(700)는 텍스트 데이터베이스(600)에 저장된 텍스트들을 통해 언어 모델을 생성 혹은 업데이트한다. The text database 600 stores texts for generating a language model. The language model learner 700 generates or updates a language model through texts stored in the text database 600.

음성 데이터베이스(800)는 학습을 위한 음성 및 그 음성에 대한 텍스트(전사 데이터)를 저장할 수 있다. 이때, 그 음성에 대한 텍스트는 생략될 수 있다. The voice database 800 may store a voice for learning and text (transcription data) for the voice. At this time, the text for the voice may be omitted.

상술한 음향 모델 데이터베이스(300)의 음향 모델, 발음 사전 데이터베이스(400)의 발음 사전, 및 언어 모델 데이터베이스(500)의 언어 모델을 이용하여 음성 인식에 필요한 탐색 공간이 형성되며, 학습을 위한 음성 신호가 입력되면, 특징 추출부(100)는 그 음성 신호로부터 특징 벡터를 추출하고, 탐색부(200)는 추출된 특징 벡터를 이용하여 음성을 인식하고, 그 인식 결과로 N-best 인식 결과를 출력한다. A search space for speech recognition is formed using the acoustic model of the acoustic model database 300, the pronunciation dictionary of the pronunciation dictionary database 400, and the language model of the language model database 500. Is input, the feature extractor 100 extracts the feature vector from the speech signal, and the searcher 200 recognizes the speech using the extracted feature vector, and outputs the N-best recognition result as the recognition result. do.

핵심어 검출부(910)는 탐색부(200)가 출력한 단어열에 대한 인식 결과인 N-best 인식 결과에서 핵심어 데이터베이스(912)를 참조하여 단어열에 포함된 핵심어에 대한 인식 결과인 핵심어 인식 결과만을 추출한다. The keyword detecting unit 910 extracts only a keyword recognition result, which is a recognition result of the keyword included in the word string, by referring to the keyword database 912 from the N-best recognition result, which is a recognition result of the word string output by the searcher 200. .

핵심어 데이터베이스(912)에 저장되는 핵심어는 핵심어는 특정 사용자가 사용 빈도가 기 설정된 수치 보다 높은 단어, 모든 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 특정 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 모든 사용자에게 있어 기 설정된 보다 많은 오류가 발생하는 단어, 및 특정 사용자에게 있어 기 설정된 기준치 보다 많은 오류가 발생하는 단어 중 적어도 하나에 해당하는 단어 등이 저장된다. 핵심어 검출부(910)는 검출된 핵심어 인식 결과를 핵심어 학습부(930)에 제공한다. The key words stored in the key word database 912 are words that a specific user uses more than a preset number, words that are difficult to pronounce for all users, words that are difficult for pronunciation for a specific user, and all users. The word corresponding to at least one of a word having more errors than a preset value and a word having more errors than a predetermined reference value for a specific user are stored. The keyword detecting unit 910 provides the detected keyword recognition result to the keyword learning unit 930.

핵심어 학습부(930)는 음성 데이터베이스(800)에 저장된 음성을 통해 핵심어 인식 결과에 따라, 핵심어에 포함된 복수의 음소들의 음소 모델 파라미터 분포를 도출한다. 예컨대, 음소 M이 입력되었을 때, M으로 인식한 경우, 그 음소 M을 정답 인식 결과로 판단하며, 정답 음소라고 한다. 그리고 음소 M이 입력되었을 때, N으로 인식한 경우, 그 음소 N을 오류 인식 결과로 판단하며, 오류 음소라고 한다. The keyword learning unit 930 derives a phoneme model parameter distribution of a plurality of phonemes included in the keyword based on the keyword recognition result through the voice stored in the speech database 800. For example, when the phoneme M is input, when it is recognized as M, the phoneme M is determined as a correct answer recognition result, and is called a correct answer phoneme. If the phoneme M is recognized as N when the phoneme M is input, the phoneme N is determined as an error recognition result and is called an error phoneme.

따라서, 핵심어 학습부(930)는 정답 음소의 음소 모델 파라미터와, 오류 음소의 음소 모델 파라미터를 도출한다. 다음으로, 핵심어 학습부(930) 및 N-best 학습부(940) 각각은 이러한 정답 음소 및 오류 음소를 핵심어 학습부(930) 및 N-best 학습부(940)로 전달한다. Therefore, the keyword learning unit 930 derives the phoneme model parameter of the correct answer phoneme and the phoneme model parameter of the error phoneme. Next, each of the keyword learning unit 930 and the N-best learning unit 940 delivers the correct answer phoneme and the error phoneme to the keyword learning unit 930 and the N-best learning unit 940.

핵심어 변별 학습부(950)는 핵심어 학습부(930)로부터 수신된 음소 모델 파라미터 분포와 다른 음소 모델 파라미터의 상호 정보량이 최소화되도록, 수신된 음소 모델 파라미터 분포를 업데이트한다. 이때, 핵심어 변별 학습부(950)는 핵심어 인식 결과에서 정답 음소의 음소 모델 파라미터 분포 및 오류 음소의 음소 모델 파라미터 분포를 업데이트한다. The keyword discrimination learning unit 950 updates the received phoneme model parameter distribution so that the phoneme model parameter distribution received from the keyword learning unit 930 and the mutual information amount of other phoneme model parameters are minimized. At this time, the keyword discrimination learning unit 950 updates the phoneme model parameter distribution of the correct answer phoneme and the phoneme model parameter distribution of the error phoneme in the keyword recognition result.

한편, N-best 정렬부(920)는 탐색부(200)가 출력한 N-best 인식 결과를 정렬하여, N-best 학습부(940)에 제공한다. Meanwhile, the N-best sorter 920 sorts the N-best recognition result output from the searcher 200 and provides the N-best learner 940.

N-best 학습부(940)는 음성 데이터베이스(800)에 저장된 음성을 통해 N-best 인식 결과에 따른 음소들의 음소 모델 파라미터 분포를 도출한다. 이때, N-best 학습부(940)는 정답 음소의 음소 모델 파라미터 및 오류 음소의 음소 모델 파라미터를 도출한다. 다음으로, N-best 학습부(940) 각각은 이러한 정답 음소 및 오류 음소의 음소 모델 파라미터를 N-best 변별 학습부(960)로 전달한다. The N-best learning unit 940 derives a phoneme model parameter distribution of the phonemes according to the N-best recognition result through the voice stored in the voice database 800. At this time, the N-best learner 940 derives a phoneme model parameter of a correct answer phoneme and a phoneme model parameter of an error phoneme. Next, each of the N-best learners 940 transfers the phoneme model parameters of the correct answer phoneme and the error phoneme to the N-best discrimination learner 960.

N-best 변별 학습부(960)는 N-best 학습부(940)로부터 수신된 음소 모델 파라미터 분포와 다른 음소 모델 파라미터의 상호 정보량이 최소화되도록, 수신된 음소 모델 파라미터 분포를 업데이트한다. 이때, N-best 변별 학습부(960)는 N-best 인식 결과에서 정답 음소의 음소 모델 파라미터 분포 및 오류 음소의 음소 모델 파라미터 분포 각각을 업데이트한다. The N-best discrimination learning unit 960 updates the received phoneme model parameter distribution so that the phoneme model parameter distribution received from the N-best learning unit 940 and the mutual information amount of other phoneme model parameters are minimized. In this case, the N-best discrimination learning unit 960 updates the phoneme model parameter distribution of the correct answer phoneme and the phoneme model parameter distribution of the error phoneme in the N-best recognition result.

상술한 바와 같이, 본 발명의 실시예에 따르면, N-best 인식 결과뿐만 아니라, N-best 인식 결과로부터 추출된 핵심어 인식 결과를 이용하여, 변별 학습을 수행한다. 따라서, 핵심어에 대해서는 보다 정확하고 집중적인 변별 학습을 수행할 수 있다. 더욱이, N-best 인식 결과 핵심어 인식 결과 각각에서, 정답 음소뿐만 아니라, 오류 음소를 이용하여, 변별 학습을 수행한다. 따라서 변별 학습에서 보다 효과적으로 상호 정보량을 최소화할 수 있다. As described above, according to an embodiment of the present invention, discrimination learning is performed using not only N-best recognition results but also keyword recognition results extracted from N-best recognition results. Therefore, more accurate and intensive discriminant learning can be performed on key words. Furthermore, in each of the N-best recognition result keyword recognition results, discriminative learning is performed using error phonemes as well as correct phonemes. Therefore, the amount of mutual information can be minimized more effectively in differential learning.

그러면, 보다 상세히, N-best 인식 결과를 이용하는 변별 학습 수행을 위한 장치 및 핵심어 인식 결과를 이용하는 변별 학습 수행을 위한 장치 각각에 대해 살펴보기로 한다. In more detail, the apparatus for performing discriminatory learning using N-best recognition results and the apparatus for performing discriminatory learning using keyword recognition results will be described in detail.

도 3은 도 2의 음성 인식 시스템에서 핵심어 인식 결과에 대한 변별 학습을 수행하기 위한 장치를 설명하기 위한 블록도이다. FIG. 3 is a block diagram illustrating an apparatus for performing differential learning on a keyword recognition result in the speech recognition system of FIG. 2.

도 3을 참조하면, 핵심어 인식 결과에 대한 변별 학습을 수행하기 위해 필요한 장치는 음향 모델 데이터베이스(300), 음성 데이터베이스(800), 핵심어 검출부(910), 핵심어 데이터베이스(912), 핵심어 학습부(930) 및 핵심어 변별 학습부(950)를 포함한다. Referring to FIG. 3, a device necessary for performing discriminative learning on a keyword recognition result is an acoustic model database 300, a speech database 800, a keyword detection unit 910, a keyword database 912, and a keyword learning unit 930. And the keyword discrimination learning unit 950.

먼저, 핵심어 검출부(910)는 탐색부(200)로부터 출력된 N-best 인식 결과 각각에서, 핵심어 데이터베이스(912)에 저장된 핵심어를 참조하여 핵심어 인식 결과를 추출한다. 예컨대, <표 2>와 같이, N=3일 때, N-best 인식 결과로부터 예컨대, <표 3>와 같은 핵심어 인식 결과를 추출한다. 이러한 핵심어는 핵심어는 특정 사용자가 사용 빈도가 기 설정된 수치 보다 높은 단어, 모든 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 특정 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 모든 사용자에게 있어 기 설정된 보다 많은 오류가 발생하는 단어, 및 특정 사용자에게 있어 기 설정된 기준치 보다 많은 오류가 발생하는 단어 중 적어도 하나에 해당하는 단어 등이 될 수 있다. 여기서, 핵심어 인식 결과는 정답 및 오류 인식 결과를 포함한다. 그리고 핵심어 검출부(910)는 핵심어 인식 결과를 핵심어 학습부(930)에 제공한다. First, the keyword detecting unit 910 extracts a keyword recognition result by referring to a keyword stored in the keyword database 912 from each of the N-best recognition results output from the searcher 200. For example, as shown in Table 2, when N = 3, a keyword recognition result as shown in Table 3 is extracted from the N-best recognition result. These key words are those words where a particular user has a higher frequency than a preset number, words that are difficult to pronounce for all users, words that are difficult to pronounce for a particular user, and more errors that are preset for all users. The word may correspond to at least one of a word generated and a word having more errors than a predetermined reference value for a specific user. Here, the keyword recognition result includes a correct answer and an error recognition result. The keyword detecting unit 910 provides the keyword recognition result to the keyword learning unit 930.

핵심어 학습부(930)는 핵심어의 각 음소 별 음소 모델 파라미터 분포를 도출한다. 이때, 핵심어 학습부(930)는 음성 데이터베이스(800)를 참조하여, 각 핵심어에 포함된 복수의 음소에 대한 음소 모델 파라미터 분포를 도출한다. 특히, 핵심어 학습부(930)는 정답 음소 및 오류 음소 각각을 구분하여 음소 모델 파라미터 분포를 도출한다. 그런 다음, 핵심어 학습부(930)는 정답 음소 및 오류 음소로 구분된 음소 음소 모델 파라미터 분포를 핵심어 변별 학습부(950)에 제공한다. The keyword learning unit 930 derives a phoneme model parameter distribution for each phoneme of the keyword. At this time, the keyword learning unit 930 refers to the speech database 800 to derive a phoneme model parameter distribution for a plurality of phonemes included in each keyword. In particular, the key word learning unit 930 derives the phoneme model parameter distribution by dividing each of the correct answer phoneme and the erroneous phoneme. Next, the keyword learning unit 930 provides the keyword discrimination learning unit 950 with a phoneme phonemic model parameter distribution divided into correct answer phonemes and error phonemes.

핵심어 변별 학습부(950)는 오류 음소 처리 모듈(951) 및 정답 음소 처리 모듈(953)을 포함한다. 이에 따라, 오류 인식 결과를 가지는 오류 음소에 대한 음소 모델 파라미터 분포는 오류 음소 처리 모듈(951)에 입력된다. 정답 인식 결과를 가지는 정답 음소에 대한 음소 모델 파라미터 분포는 정답 음소 처리 모듈(953)에 입력된다. The keyword discrimination learning unit 950 includes an error phoneme processing module 951 and a correct answer phoneme processing module 953. Accordingly, the phoneme model parameter distribution for the error phoneme having the error recognition result is input to the error phoneme processing module 951. The phoneme model parameter distribution for the correct answer phone with the correct answer recognition result is input to the correct answer phoneme processing module 953.

오류 음소 처리 모듈(951)은 음향 모델 데이터베이스(300)에 저장된 음향 모델에서 오류 음소에 해당하는 음소 모델 파라미터 분포를 업데이트한다. 이때, 오류 음소 처리 모듈(951)은 핵심어 인식 결과에서 오류 음소의 로그우도를 가중치로 적용하고, 오류 음소의 카운트 값에 비례하여, 음향 모델에서 해당하는 음소 모델 파라미터 분포를 업데이트한다. 이때, 로그우도는 최대 로그우도(ML) 추정 방법을 이용하여 산출된다. 또한, 오류 음소의 카운트 값은 해당 음소에 대해 오류 음소로 판정된 횟수를 의미한다. 예컨대, 도 1d에서 설명된 바와 같이, 오류 음소에 해당하는 음소 모델 파라미터 분포를 업데이트하는 것은, 입력된 음소 M이 아니라, 오류 음소인 음소 N의 음소 모델 파라미터 분포(60)를 음소 M의 음소 모델 파라미터 분포(50)와의 상호 정보량이 줄어드는 방향으로 이동시키는 것이다. 이에 따라, 음소 N에 대한 음소 모델 파라미터 분포(60)는 음소 N에 대한 이상적인 음소 모델 파라미터 분포(80)로 이동할 것이다. 또한, 음소 N에 대한 음소 모델 파라미터 분포(60)는 로그우도의 크기에 비례하여 이동되는 정도가 결정된다. 또한, 음소 모델 파라미터 분포(60)는 오류 음소의 카운트 값에 비례하여 이동되는 정도가 결정된다. The error phoneme processing module 951 updates a phoneme model parameter distribution corresponding to the error phoneme in the acoustic model stored in the acoustic model database 300. In this case, the error phoneme processing module 951 applies the log likelihood of the error phoneme as a weight in the keyword recognition result, and updates the corresponding phoneme model parameter distribution in the acoustic model in proportion to the count value of the error phoneme. In this case, the log likelihood is calculated using the maximum log likelihood (ML) estimation method. In addition, the count value of the error phoneme means the number of times that the phoneme is determined as the error phoneme. For example, as illustrated in FIG. 1D, updating the phoneme model parameter distribution corresponding to the error phoneme is not a phoneme M, but a phoneme model parameter distribution 60 of the phoneme N, which is an error phoneme. This is to move in the direction in which the amount of mutual information with the parameter distribution 50 decreases. Accordingly, the phoneme model parameter distribution 60 for phoneme N will move to the ideal phoneme model parameter distribution 80 for phoneme N. In addition, it is determined that the phoneme model parameter distribution 60 for phoneme N is moved in proportion to the magnitude of the log likelihood. In addition, the phoneme model parameter distribution 60 is determined to be moved in proportion to the count value of the error phoneme.

정답 음소 처리 모듈(953)은 음향 모델 데이터베이스(300)에 저장된 음향 모델에서 정답 음소에 해당하는 음소 모델 파라미터 분포를 업데이트한다. 이때, 정답 음소 처리 모듈(953)은 핵심어 인식 결과에서 정답 음소에 대한 음향 모델의 음소 모델 파라미터 분포를 업데이트하기 위해 최대 상호정보량(MMI) 추정 방법을 통해 수행할 수 있으며, 이러한 업데이트는 정답 음소의 카운트 값을 이용한다. 여기서, 정답 음소의 카운트 값은 해당 음소에 대해 정답 음소로 판정된 횟수를 의미한다. 예컨대, 도 1c에서 설명된 바와 같이, 정답 음소에 해당하는 정답 음소 모델 파라미터 분포를 업데이트하는 것은, 현재 음향 모델의 음소 M에 대한 음소 모델 파라미터 분포(50)를 음소 M에 대한 이상적인 음소 모델 파라미터 분포(70)를 향하여 이동시키는 것이다. 추가로, 음향 모델의 음소 M에 대한 음소 모델 파라미터 분포(50)는 정답 음소의 수에 따라 이동되는 정도가 결정된다. The correct phoneme processing module 953 updates the phoneme model parameter distribution corresponding to the correct phoneme in the acoustic model stored in the acoustic model database 300. In this case, the correct answer phoneme processing module 953 may perform the maximum mutual information amount (MMI) estimation method to update the phoneme model parameter distribution of the acoustic model for the correct answer phoneme in the keyword recognition result, and such an update may be performed. Use the count value. Here, the count value of the correct answer phoneme means the number of times determined as the correct answer phoneme for the phoneme. For example, as illustrated in FIG. 1C, updating the correct phoneme model parameter distribution corresponding to the correct phoneme may result in the phoneme model parameter distribution 50 for phoneme M of the current acoustic model being the ideal phoneme model parameter distribution for phoneme M. It is to move toward 70. In addition, the degree to which the phoneme model parameter distribution 50 for phoneme M of the acoustic model is shifted according to the number of correct phonemes is determined.

도 4는 도 2의 음성 인식 시스템에서 N-best 인식 결과에 대한 변별 학습을 수행하기 위한 장치를 설명하기 위한 블록도이다. FIG. 4 is a block diagram illustrating an apparatus for performing discriminative learning on N-best recognition results in the speech recognition system of FIG. 2.

도 4를 참조하면, 핵심어 인식 결과에 대한 변별 학습을 수행하기 위해 필요한 장치는 음향 모델 데이터베이스(300), 음성 데이터베이스(800), N-best 정렬부(920), N-best 학습부(940) 및 N-best 변별 학습부(960)를 포함한다. Referring to FIG. 4, an apparatus for performing discriminative learning on key word recognition results may include an acoustic model database 300, a speech database 800, an N-best sorter 920, and an N-best learner 940. And an N-best discrimination learning unit 960.

먼저, N-best 정렬부(920)는 탐색부(200)로부터 출력된 N-best 인식 결과 각각을 정렬한다. 예컨대, <표 2>와 같이, N=3일 때, N-best 인식 결과를 정렬한다. 여기서, N-best 인식 결과는 정답 및 오류 인식 결과를 포함한다. 그리고 N-best 정렬부(920)는 N-best 인식 결과를 N-best 학습부(940)에 제공한다. First, the N-best sorter 920 sorts each of the N-best recognition results output from the searcher 200. For example, as shown in Table 2, when N = 3, the N-best recognition result is sorted. Here, the N-best recognition result includes a correct answer and an error recognition result. The N-best sorter 920 provides the N-best recognition result to the N-best learner 940.

N-best 학습부(940)는 단어열의 각 음소 별 음소 모델 파라미터 분포를 도출한다. 이때, N-best 학습부(940)는 음성 데이터베이스(800)를 참조하여, 각 핵심어에 포함된 복수의 음소에 대한 음소 모델 파라미터 분포를 도출한다. 특히, N-best 학습부(940)는 정답 음소 및 오류 음소 각각을 구분하여 음소 모델 파라미터 분포를 도출한다. 그런 다음, N-best 학습부(940)는 정답 음소 및 오류 음소로 구분된 음소 음소 모델 파라미터 분포를 N-best 변별 학습부(960)에 제공한다. The N-best learning unit 940 derives a phoneme model parameter distribution for each phoneme of the word string. In this case, the N-best learner 940 derives a phoneme model parameter distribution for a plurality of phonemes included in each keyword by referring to the voice database 800. In particular, the N-best learner 940 derives a phoneme model parameter distribution by dividing each of the correct answer phoneme and the error phoneme. Next, the N-best learner 940 provides the N-best discrimination learner 960 with distributions of phoneme phonemic model parameters divided into correct answer phonemes and error phonemes.

N-best 변별 학습부(960)는 오류 음소 처리 모듈(961) 및 정답 음소 처리 모듈(963)을 포함한다. 이에 따라, 오류 인식 결과를 가지는 오류 음소에 대한 음소 모델 파라미터 분포는 오류 음소 처리 모듈(961)에 입력된다. 정답 인식 결과를 가지는 정답 음소에 대한 음소 모델 파라미터 분포는 정답 음소 처리 모듈(963)에 입력된다. The N-best discrimination learning unit 960 includes an error phoneme processing module 961 and a correct phoneme processing module 963. Accordingly, the phoneme model parameter distribution for the error phoneme having the error recognition result is input to the error phoneme processing module 961. The phoneme model parameter distribution for the correct answer phone with the correct answer recognition result is input to the correct answer phoneme processing module 963.

정답 음소 처리 모듈(963)은 음향 모델 데이터베이스(300)에 저장된 음향 모델에서 정답 음소에 해당하는 음소 모델 파라미터 분포를 업데이트한다. 이때, 정답 음소 처리 모듈(963)은 핵심어 인식 결과에서 정답 음소에 대한 음향 모델의 음소 모델 파라미터 분포를 업데이트하기 위해 최대 상호정보량(MMI) 추정 방법을 통해 수행할 수 있으며, 이러한 업데이트는 정답 음소의 카운트 값을 이용한다. 여기서, 정답 음소의 카운트 값은 해당 음소에 대해 정답 음소로 판정된 횟수를 의미한다. 예컨대, 도 1c에서 설명된 바와 같이, 정답 음소에 해당하는 정답 음소 모델 파라미터 분포를 업데이트하는 것은, 현재 음향 모델의 음소 M에 대한 음소 모델 파라미터 분포(50)를 음소 M에 대한 이상적인 음소 모델 파라미터 분포(70)를 향하여 이동시키는 것이다. 추가로, 음향 모델의 음소 M에 대한 음소 모델 파라미터 분포(50)는 정답 음소의 수에 따라 이동되는 정도가 결정된다. The correct phoneme processing module 963 updates a phoneme model parameter distribution corresponding to the correct phoneme in the acoustic model stored in the acoustic model database 300. In this case, the correct answer phoneme processing module 963 may perform a maximum mutual information amount (MMI) estimation method to update the phoneme model parameter distribution of the acoustic model for the correct answer phoneme in the keyword recognition result, and the update may be performed by the correct answer phoneme. Use the count value. Here, the count value of the correct answer phoneme means the number of times determined as the correct answer phoneme for the phoneme. For example, as illustrated in FIG. 1C, updating the correct phoneme model parameter distribution corresponding to the correct phoneme may result in the phoneme model parameter distribution 50 for phoneme M of the current acoustic model being the ideal phoneme model parameter distribution for phoneme M. It is to move toward 70. In addition, the degree to which the phoneme model parameter distribution 50 for phoneme M of the acoustic model is shifted according to the number of correct phonemes is determined.

도 5는 본 발명의 실시예에 따른 음향 모델 변별 학습을 위한 방법을 설명하기 위한 흐름도이다. 5 is a flowchart illustrating a method for learning acoustic model discrimination according to an embodiment of the present invention.

도 5를 참조하면, 특징 추출부(100)는 S110 단계에서 음성 신호가 입력되면 음성 신호의 특징 벡터를 추출하고, S115 단계에서 추출된 음성의 특징 벡터를 탐색부(200)에 제공한다. Referring to FIG. 5, when the voice signal is input in step S110, the feature extractor 100 extracts a feature vector of the voice signal and provides the searcher 200 with the feature vector of the voice extracted in step S115.

탐색부(200)는 S120 단계에서 입력된 음성 벡터에 대해 음향 모델, 발음 사전 및 언어 모델을 기초로 형성된 탐색 공간에서 음성 인식을 수행한다. 이러한 음성 인식은 음소 단위로 이루어진다. 또한, 음성 인식의 결과는 N-best의 인식 결과이다. 이러한 N-best의 인식 결과는 정답 인식 결과 및 오류 인식 결과를 포함한다. 음성 인식을 수행한 후, 탐색부(200)는 S125 단계에서 N-best의 인식 결과를 핵심어 검출부(910)에 제공한다. The searcher 200 performs speech recognition in a search space formed based on an acoustic model, a pronunciation dictionary, and a language model with respect to the speech vector input in operation S120. This speech recognition is made in phoneme units. In addition, the result of speech recognition is the result of recognition of N-best. The recognition result of N-best includes an answer recognition result and an error recognition result. After performing the speech recognition, the searcher 200 provides the keyword detection unit 910 with the recognition result of N-best in operation S125.

핵심어 검출부(910)는 N-best의 인식 결과를 수신하여, S130 단계에서 N-best의 인식 결과에서 핵심어 데이터베이스(912)에 저장된 핵심어에 기초하여, 핵심어 인식 결과를 추출한다. 여기서, 핵심어 데이터베이스(912)에 저장된 핵심어는 특정 사용자가 사용 빈도가 기 설정된 수치 보다 높은 단어, 모든 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 특정 사용자에게 있어 발음이 어려운 것으로 지정된 단어, 모든 사용자에게 있어 기 설정된 보다 많은 오류가 발생하는 단어, 및 특정 사용자에게 있어 기 설정된 기준치 보다 많은 오류가 발생하는 단어 중 적어도 하나에 해당하는 단어가 될 수 있다. N-best의 인식 결과와 마찬가지로, 핵심어 인식 결과 또한 정답 인식 결과 및 오류 인식 결과를 포함한다. 그런 다음, 핵심어 검출부(910)는 S135 단계에서 핵심어 인식 결과를 핵심어 학습부(930)에 제공한다. The keyword detecting unit 910 receives the recognition result of N-best, and extracts a keyword recognition result based on the keyword stored in the keyword database 912 from the recognition result of N-best in step S130. Here, the key words stored in the key word database 912 are words for which a specific user has a frequency higher than a predetermined number, words designated as difficult to pronounce for all users, words designated as difficult to pronounce for a specific user, and for all users. It may be a word corresponding to at least one of a word in which more errors occur in advance and a word in which more errors occur than a predetermined reference value for a specific user. Like the recognition result of N-best, the keyword recognition result includes the result of correct answer and error. Then, the keyword detecting unit 910 provides the keyword recognition result to the keyword learning unit 930 in step S135.

핵심어 학습부(930)는 핵심어 인식 결과에 따라 음성 데이터베이스(800)의 음성을 참조하여, 핵심어 인식 결과에 포함된 복수의 음소에 대한 음소 모델 파라미터 분포를 도출한다. 도출된 복수의 음소에 대한 음소 모델 파라미터 분포는 정답 음소 및 오류 음소에 대한 음소 모델 파라미터 분포를 포함한다. 그런 다음, 핵심어 학습부(930)는 S140 단계에서 정답 음소 및 오류 음소에 대한 음소 모델 파라미터 분포를 핵심어 변별 학습부(950)에 제공한다. The keyword learning unit 930 refers to the speech of the speech database 800 according to the keyword recognition result to derive a phoneme model parameter distribution for a plurality of phonemes included in the keyword recognition result. The phoneme model parameter distributions for the derived plurality of phonemes include the phoneme model parameter distributions for the correct answer phoneme and the error phoneme. In operation S140, the keyword learning unit 930 provides the keyword discrimination learning unit 950 with a distribution of phoneme model parameters for the correct answer phoneme and the error phoneme.

그러면, 핵심어 변별 학습부(950)는 S145 단계에서 음향 모델 데이터베이스(300)에 저장된 음향 모델에서 정답 음소 및 오류 음소에 상응하는 음소의 음소 모델 파라미터 분포 각각을 업데이트한다. Then, the key word discrimination learning unit 950 updates each phoneme model parameter distribution of the phonemes corresponding to the correct answer phoneme and the error phoneme in the acoustic model stored in the acoustic model database 300 in step S145.

이때, 도 1d를 참조하면, 핵심어 변별 학습부(950)는 오류 음소의 로그우도(likelihood)를 가중치로 적용하고, 오류 음소의 카운트 값에 비례하여, 오류 음소의 음소 모델 파라미터 분포를 업데이트한다. 이때, 로그우도는 최대 로그우도(ML) 추정 방법을 이용하여 산출될 수 있다. 또한, 오류 음소의 카운트 값은 해당 음소에 대해 오류 음소로 판정된 횟수를 의미한다. In this case, referring to FIG. 1D, the key word discrimination learning unit 950 applies a log likelihood of the error phoneme as a weight and updates the phoneme model parameter distribution of the error phoneme in proportion to the count value of the error phoneme. In this case, the log likelihood may be calculated using a maximum log likelihood (ML) estimation method. In addition, the count value of the error phoneme means the number of times that the phoneme is determined as the error phoneme.

이와 동시에, 도 1c를 참조하면, 핵심어 변별 학습부(950)는 정답 음소의 음소 모델 파라미터 분포를 업데이트한다. 이때, 정답 음소에 대한 음향 모델의 음소 모델 파라미터 분포를 업데이트하는 것은 최대 상호정보량(MMI) 추정 방법을 통해 수행한다. 특히, 이러한 업데이트는 해당 음소가 정답 음소로 판정된 횟수를 의미하는 정답 음소의 카운트 값에 비례한다. At the same time, referring to FIG. 1C, the keyword discrimination learning unit 950 updates a phoneme model parameter distribution of the correct answer phoneme. At this time, updating the phoneme model parameter distribution of the acoustic model with respect to the correct answer phoneme is performed through a method of estimating the maximum mutual information amount (MMI). In particular, such an update is proportional to the count value of the correct answer phoneme, which means the number of times the corresponding phoneme is determined as the correct answer phoneme.

한편, 핵심어 검출부(910)는 S150 단계에서 앞서 탐색부(200)로부터 수신된 N-best 인식 결과를 N-best 정렬부(920)에 제공한다. 그러면, N-best 정렬부(920)는 N-best 인식 결과를 정렬하고, S155 단계에서 정렬된 N-best 인식 결과를 N-best 학습부(940)에 제공한다. On the other hand, the key word detector 910 provides the N-best recognition result received from the search unit 200 to the N-best alignment unit 920 in step S150. Then, the N-best alignment unit 920 sorts the N-best recognition results, and provides the N-best recognition results sorted in step S155 to the N-best learning unit 940.

N-best 학습부(940)는 N-best 인식 결과에 따라 음성 데이터베이스(800)에의 저장된 음성을 참조하여, N-best 인식 결과에 포함된 복수의 음소에 대한 음소 모델 파라미터 분포를 도출한다. 도출된 복수의 음소에 대한 음소 모델 파라미터 분포는 정답 음소 및 오류 음소에 대한 음소 모델 파라미터 분포를 포함한다. 이에 따라, N-best 학습부(940)는 S160 단계에서 정답 음소 및 오류 음소에 대한 음소 모델 파라미터 분포를 N-best 변별 학습부(960)에 제공한다. The N-best learning unit 940 derives a phoneme model parameter distribution for a plurality of phonemes included in the N-best recognition result by referring to the stored voice in the voice database 800 according to the N-best recognition result. The phoneme model parameter distributions for the derived plurality of phonemes include the phoneme model parameter distributions for the correct answer phoneme and the error phoneme. Accordingly, the N-best learner 940 provides the phoneme model parameter distribution for the correct answer phoneme and the error phoneme to the N-best discrimination learner 960 in step S160.

그러면, N-best 변별 학습부(960)는 S165 단계에서 음향 모델 데이터베이스(300)에 저장된 음향 모델에서 정답 음소 및 오류 음소에 상응하는 음소의 음소 모델 파라미터 분포 각각을 업데이트한다. Then, the N-best discrimination learning unit 960 updates each phoneme model parameter distribution of phonemes corresponding to the correct answer phoneme and the error phoneme in the acoustic model stored in the acoustic model database 300 in step S165.

이때, 도 1d를 참조하면, N-best 변별 학습부(960)는 오류 음소의 로그우도(likelihood)를 가중치로 적용하고, 오류 음소의 카운트 값에 비례하여, 오류 음소의 음소 모델 파라미터 분포를 업데이트한다. 이때, 로그우도는 최대 로그우도(ML) 추정 방법을 이용하여 산출될 수 있다. 또한, 오류 음소의 카운트 값은 해당 음소에 대해 오류 음소로 판정된 횟수를 의미한다. 이와 동시에, 도 1c를 참조하면, N-best 변별 학습부(960)는 정답 음소의 음소 모델 파라미터 분포를 업데이트한다. 이때, 정답 음소에 대한 음향 모델의 음소 모델 파라미터 분포를 업데이트하는 것은 최대 상호정보량(MMI) 추정 방법을 통해 수행한다. 특히, 이러한 업데이트는 해당 음소가 정답 음소로 판정된 횟수를 의미하는 정답 음소의 카운트 값에 비례한다. In this case, referring to FIG. 1D, the N-best discrimination learning unit 960 applies a log likelihood of the error phoneme as a weight and updates the phoneme model parameter distribution of the error phoneme in proportion to the count value of the error phoneme. do. In this case, the log likelihood may be calculated using a maximum log likelihood (ML) estimation method. In addition, the count value of the error phoneme means the number of times that the phoneme is determined as the error phoneme. At the same time, referring to FIG. 1C, the N-best discrimination learning unit 960 updates a phoneme model parameter distribution of the correct answer phoneme. At this time, updating the phoneme model parameter distribution of the acoustic model with respect to the correct answer phoneme is performed through a method of estimating the maximum mutual information amount (MMI). In particular, such an update is proportional to the count value of the correct answer phoneme, which means the number of times the corresponding phoneme is determined as the correct answer phoneme.

상술한 바와 같은, 본 발명의 실시 예에 따른 음향 모델 변별 학습 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media) 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다.As described above, the acoustic model discrimination learning method according to an embodiment of the present invention may be implemented as computer readable codes on a computer readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, and the like, alone or in combination, and includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include an optical recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, a compact disk read only memory (CD-ROM), and a digital video disk (ROM), random access memory (RAM), flash memory, and the like, such as a magneto-optical medium such as a magneto-optical medium and a floppy disk, And hardware devices that are specifically configured to perform the functions described herein.

또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다. In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers of the technical field to which the present invention belongs.

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시예에 대하여 개시하였으나, 여기에 개시된 실시예외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다. 따라서, 상술한 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 선정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다. While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be apparent to those skilled in the art. In addition, although specific terms are used in the specification and the drawings, they are only used in a general sense to easily explain the technical contents of the present invention and to help the understanding of the present invention, and are not intended to limit the scope of the present invention. Accordingly, the foregoing detailed description is to be considered in all respects illustrative and not restrictive. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the scope of equivalents of the present invention are included in the scope of the present invention.

본 발명은 핵심어에서의 음소 오류 결과를 고려한 음향 모델 변별 학습을 위한 장치 및 이를 위한 방법이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. 이러한 본 발명은 N-best 인식 결과를 통한 변별 학습과 함께, 핵심어 인식 결과에 따른 변별 학습을 수행하기 때문에, 특정 단어, 즉, 핵심어에 대한 변별 학습의 성능을 향상시킬 수 있다. 예컨대, 핵심어는 특정 사용자가 자주 사용하는 단어, 모든 사용자에게 있어 발음이 어려운 단어, 특정 사용자에게 있어 발음이 어려운 단어, 모든 사용자에게 있어 기 설정된 보다 많은 오류가 발생하는 단어, 특정 사용자에게 있어 기 설정된 기준치 보다 많은 오류가 발생하는 단어 등을 미리 지정할 수 있다. 이러한 경우, 단순하게 N-best 인식 결과에 따라 변별 학습을 수행하는 것에 비해, 모든 사용자 혹은 특정 사용자에 대해 보다 효율적인 음향 모델을 구축할 수 있다. 더욱이, 본 발명은 음향 모델의 음소 모델 파라미터 분포를 업데이트할 때, 오류 인식 결과로 구분된 음소의 로그우도를 가중치로 적용하여, 오류 인식 결과도 상호 정보량을 최소화하도록 반영함으로써, 변별 학습에서 효율적으로 상호 정보량을 최소화시킬 수 있다. 이러한 본 발명은 시판 또는 영업의 가능성이 충분할 뿐만 아니라 현실적으로 명백하게 반복 실시할 수 있는 정도이므로 산업상 이용가능성이 있다. The present invention relates to a device for learning acoustic model discrimination in consideration of phoneme error results in a key word, and a computer readable recording medium having recorded thereon a method therefor. Since the present invention performs discriminative learning according to the keyword recognition result along with the discriminative learning through the N-best recognition result, it is possible to improve the performance of the discriminative learning for a specific word, that is, a keyword. For example, key words are words frequently used by a specific user, words that are difficult to pronounce for all users, words that are difficult to pronounce for a specific user, words that generate more preset errors for all users, and preset words for a specific user. Words that cause more errors than the standard value can be specified in advance. In this case, it is possible to construct a more efficient acoustic model for all users or for a specific user, compared to simply performing discrimination learning according to the N-best recognition result. Furthermore, when the phoneme model parameter distribution of the acoustic model is updated, the present invention applies logarithmic likelihood of the phoneme classified as an error recognition result as a weight, and reflects the error recognition result to minimize the amount of mutual information. The amount of mutual information can be minimized. Such a present invention has industrial applicability because it is not only sufficient marketable or commercially viable, but also realistically repeatable.

100: 특징 추출부 200: 탐색부
300: 음향 모델 데이터베이스 400: 발음 사전 데이터베이스
500: 언어 모델 데이터베이스 600: 텍스트 데이터베이스
700: 언어 모델 학습부 800: 음성 데이터베이스
910: 핵심어 추출부 912: 핵심어 데이터베이스
920: N-best 정렬부 930: 핵심어 학습부
940: N-best 학습부 950: 핵심어 변별 학습부
960: N-best 변별 학습부 100: feature extraction unit 200: search unit
300: acoustic model database 400: pronunciation dictionary database
500: language model database 600: text database
700: language model training unit 800: voice database
910: keyword extraction unit 912: keyword database
920: N-best alignment unit 930: keyword learning unit
940: N-best learning unit 950: Key word discrimination learning unit
960: N-best discriminant learning unit

Claims

A keyword detection unit for extracting a keyword recognition result that is a speech recognition result for the keyword from an N-best recognition result that is a speech recognition result for a word string including a keyword;
A keyword learning unit for deriving a phoneme model parameter distribution of a correct answer phoneme and an error phoneme according to the keyword recognition result; And
A keyword discriminating learning unit applying a log likelihood of an error phoneme as a weight in the keyword recognition result, and updating a phoneme model parameter distribution of the error phoneme to minimize the amount of mutual information in proportion to the count value of the error phoneme; Apparatus for discriminating learning acoustic models, characterized in that.

The method of claim 1,
Further comprising a keyword database for storing the keyword,
The key words are words in which a specific user uses a frequency higher than a predetermined number, words designated as difficult to pronounce for all users, words designated as difficult to pronounce for a specific user, and more errors than preset values in all users. And a word corresponding to at least one of a word and a word having more errors than a predetermined reference value for a specific user.

The method of claim 1,
An N-best learner for deriving a phoneme model parameter distribution based on the N-best recognition result; And
And an N-best discrimination learning unit for updating the phoneme model parameter distribution so that the amount of mutual information is minimized.

The method of claim 1,
The key word discrimination learning unit
Apparatus for discriminating acoustic models, characterized in that for calculating the log likelihood using a maximum likelihood (ML) estimation method.

The method of claim 1,
The key word discrimination learning unit
Apparatus for acoustic model discrimination learning, by applying a log likelihood of the correct answer phoneme as a weight, updating a phoneme model parameter distribution for the correct answer phoneme to minimize the amount of mutual information in proportion to the count value of the correct answer phoneme .

The method of claim 5,
The key word discrimination learning unit
And a phoneme model parameter distribution for the correct answer phoneme using a method of estimating a maximum mutual information (MMI).

Extracting a key word recognition result, which is a voice recognition result of the key word, from an N-best recognition result, which is a voice recognition result of a word string including a key word;
Deriving a phoneme model parameter distribution of a correct answer phoneme and an error phoneme according to the keyword recognition result; And
Applying a log likelihood of an error phoneme as a weight in the keyword recognition result, and updating a phoneme model parameter distribution of the error phoneme to minimize the amount of mutual information in proportion to the count value of the error phoneme; Computer-readable recording medium having recorded thereon a method for learning.

The method of claim 7, wherein
Deriving a phoneme model parameter distribution according to the N-best recognition result; And
And updating a phoneme model parameter distribution according to the N-best recognition result to minimize the amount of mutual information. The computer-readable recording medium having recorded thereon a method for discriminating acoustic models.