KR100277690B1

KR100277690B1 - Speech Recognition Using Speech Act Information

Info

Publication number: KR100277690B1
Application number: KR1019980052256A
Authority: KR
Inventors: 권오욱; 박준; 황규웅
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-12-01
Filing date: 1998-12-01
Publication date: 2001-01-15
Also published as: KR20000037625A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 화행 정보를 이용한 음성 인식 방법에 관한 것임.The present invention relates to a speech recognition method using speech act information.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하고 이 화행 정보에 따라 언어모델을 변경하므로써 음성 인식기의 정확도를 향상시키기 위한 음성 인식 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있음.The present invention provides a speech recognition method and a method for realizing the speech recognition method for improving the accuracy of a speech recognizer by predicting speech act information of currently spoken contents from previously recognized dialogue act information and changing a language model according to the speech act information. Its purpose is to provide a computer-readable recording medium that records the program.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은, 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하기 위한 화행 추정 파라미터를 구하는 제 1 단계; 화행 추정 파라미터를 언어모델에 반영하는 제 2 단계; 입력 음성을 1차로 인식한 후에, 1차 음성 인식 결과로부터 화행 추정 파라미터를 사용하여 현재의 화행 정보를 추정하는 제 3 단계; 및 추정한 화행 정보에 따라 상기 1차 인식 결과를 재계산하여 인식 결과를 구하는 제 4 단계를 포함한다.The present invention includes a first step of obtaining a speech act estimation parameter for predicting speech act information of a content currently spoken from speech act information of a previously recognized dialogue; A second step of reflecting a speech act estimation parameter in a language model; A third step of estimating the current speech act information using speech act estimation parameters from the primary speech recognition result after the input speech is first recognized; And a fourth step of recalculating the first recognition result according to the estimated speech act information to obtain the recognition result.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 음성 인식기 등에 이용됨.The present invention is used in a speech recognizer and the like.

Description

Speech Recognition Using Speech Act Information

본 발명은 대화체 음성 인식기 등에서 음성 인식의 성능을 향상시키기 위한 음성 인식 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a speech recognition method for improving the performance of speech recognition in an interactive speech recognizer and the like, and a computer readable recording medium having recorded thereon a program for realizing the method.

먼저, 종래의 유사 기술에 대하여 살펴보면 다음과 같다.First, a similar technology will be described.

"Apple Computer,Inc."의 미국 특허 US5,384,892호(Dynamic language model for speech recognition : 1995. 1. 24)는 음성 샘플에서 음향 특징을 결정하고, 인식가능한 단어열을 결정짓는 언어모델을 기반으로 인식하고, 인식된 단어로부터 적절한 응답을 선택하는 방법에 관한 것이다.US Patent No. 5,384,892 (Dynamic language model for speech recognition: Jan. 24, 1995) of "Apple Computer, Inc." is based on a language model that determines acoustic features in speech samples and determines recognizable word sequences. A method of recognizing and selecting an appropriate response from a recognized word.

어떤 단어를 인식할 것인지, 어떤 조건에서 인식할 것인지 및 그 단어가 인식되었을 때 어떤 응답을 할 것인지에 관한 정보를 음성 규칙(speech rule)이라는 데이터 구조에 저장해 둔다. 이 규칙들은 문맥에 따라서 분할된다. 음성이 입력되면, 현재의 컴퓨터 시스템의 상태에 따라서 어떤 규칙이 활성화될지가 결정되고, 단어 인식을 위한 언어모델을 어떻게 결합할지를 결정한다. 단어열의 전부 또는 일부에 매칭되는 규칙으로부터 발성된 음성에 대한 적절한 응답을 발생시킨다.Information about which words to recognize, under what conditions, and which responses to respond to when they are recognized is stored in a data structure called a speech rule. These rules are divided according to context. When a voice is input, it determines which rule is activated according to the state of the current computer system and how to combine the language model for word recognition. Produce an appropriate response to the voice spoken from a rule that matches all or part of the word sequence.

그리고, "IBM"의 미국 특허 US5,640,487호(Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models : 1997. 7. 17)는 메모리 크기와 언어모델링 수렴속도를 감소하는 n-그램(n-gram) 모델링 방법에 관한 것이다.In addition, US Pat. No. 5,640,487 (Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models: July 17, 1997) of " IBM " is an n-gram that reduces memory size and language model convergence speed. (n-gram) modeling method.

그리고, "U.S. Philips Corp."의 미국 특허 US5,613,034호(Method and apparatus for recognizing spoken words in a speech signal : 1997. 3. 8)는 기존의 트리 기반의 탐색부에서 언어모델 적용시 단어간의 천이시에 언어모델 값이 더해져야 하는데, 트리 기반의 경우에 현재 단어가 결정되지 않은 상태이므로 이전에 끝난 모든 단어들에 대하여 탐색 트리를 복사하여 가지고 있어야 한다. 그러나, 본 발명에서는 이러한 탐색 트리를 복사하지 않고 언어모델을 적용한다.In addition, US Pat. No. 5,613,034 (Method and apparatus for recognizing spoken words in a speech signal: March 8, 1997) of "US Philips Corp." shows a transition between words when applying a language model in a conventional tree-based search unit. At the time, the language model value should be added. In the tree-based case, since the current word is not determined, the search tree should be copied for all the words that have ended before. However, the present invention applies a language model without copying such a search tree.

전술한 바와 같이, 종래의 음성 인식기는 대화형의 발화를 인식하고자 할 때 이전 발화의 내용에 상관없이 고정된 언어모델을 사용하여 음성 인식을 수행하므로써, 음성 인식률이 낮은 문제점이 있었다.As described above, the conventional speech recognizer has a problem of low speech recognition rate by performing speech recognition using a fixed language model irrespective of the contents of the previous speech when trying to recognize an interactive speech.

상기 문제점을 해결하기 위하여 안출된 본 발명은, 대화체 음성 인식기 등에서 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하고 이 화행 정보에 따라 음성 인식기의 탐색부에서 사용하는 언어모델을 변경하므로써 음성 인식기의 정확도를 향상시키기 위한 음성 인식 방법 및 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention devised to solve the above problems, the speech model for predicting the speech act information of the current speech content from the dialogue act information of the dialogue previously recognized in the dialogue speech recognizer and the like, the language model used in the search unit of the speech recognizer according to the dialogue act information It is an object of the present invention to provide a speech recognition method for improving the accuracy of a speech recognizer by changing the, and a computer-readable recording medium recording a program for realizing the method.

도 1 은 본 발명이 적용되는 음성 인식기의 구성예시도.1 is an exemplary configuration diagram of a speech recognizer to which the present invention is applied.

도 2 는 본 발명에 따른 화행 태깅된 텍스트 코퍼스의 일예시도.2 is an exemplary view of a speech act tagged text corpus according to the present invention.

도 3 은 본 발명에 따른 화행 정보 파라미터의 훈련 과정에 대한 일실시예 흐름도.3 is a flow diagram of an embodiment of a training process of speech act information parameters according to the present invention;

도 4 는 본 발명에 따른 화행 정보를 이용한 음성 인식 방법에 대한 일실시예 흐름도.4 is a flowchart illustrating an embodiment of a speech recognition method using speech act information according to the present invention;

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

102 : 특징 추출부 103 : 탐색부102: feature extraction unit 103: search unit

104 : 후처리부 105 : 음향모델104: post-processing unit 105: acoustic model

106 : 발음사전 107 : 언어모델106: Pronunciation dictionary 107: Language model

상기 목적을 달성하기 위하여 본 발명은, 음성 인식기에 적용되는 음성 인식 방법에 있어서, 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하기 위한 화행 추정 파라미터를 구하는 제 1 단계; 상기 화행 추정 파라미터를 언어모델에 반영하는 제 2 단계; 입력 음성을 1차로 인식한 후에, 상기 1차 음성 인식 결과로부터 상기 화행 추정 파라미터를 사용하여 현재의 화행 정보를 추정하는 제 3 단계; 및 상기 추정한 화행 정보에 따라 상기 1차 인식 결과를 재계산하여 인식 결과를 구하는 제 4 단계를 포함하여 이루어진 것을 특징으로 한다.In order to achieve the above object, the present invention provides a speech recognition method applied to a speech recognizer, comprising: a first step of obtaining speech act estimation parameters for predicting speech act information of a currently spoken content from speech act information of a previously recognized conversation; A second step of reflecting the speech act estimation parameter in a language model; A third step of, after recognizing an input speech as a primary, estimating current speech act information using the speech act estimation parameter from the primary speech recognition result; And a fourth step of obtaining a recognition result by recalculating the first recognition result according to the estimated speech act information.

한편, 본 발명은, 프로세서를 구비한 음성 인식기에, 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하기 위한 화행 추정 파라미터를 구하는 제 1 기능; 상기 화행 추정 파라미터를 언어모델에 반영하는 제 2 기능; 입력 음성을 1차로 인식한 후에, 상기 1차 음성 인식 결과로부터 상기 화행 추정 파라미터를 사용하여 현재의 화행 정보를 추정하는 제 3 기능; 및 상기 추정한 화행 정보에 따라 상기 1차 인식 결과를 재계산하여 인식 결과를 구하는 제 4 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention, a speech recognizer having a processor, a first function for obtaining a speech act estimation parameter for predicting the speech act information of the content currently spoken from the speech act information of the previously recognized dialogue; A second function of reflecting the speech act estimation parameter in a language model; A third function of, after recognizing an input speech as a primary, estimating current speech act information using the speech act estimation parameter from the primary speech recognition result; And a computer-readable recording medium having recorded thereon a program for realizing a fourth function of recalculating the primary recognition result to obtain the recognition result according to the estimated speech act information.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 음성 인식기의 구성예시도이다.1 is an exemplary configuration diagram of a speech recognizer to which the present invention is applied.

특징 추출부(102)에서는 음성(101)을 입력받아 인식에 유용한 정보만을 추출하여 특징벡터로 변환하고, 탐색부(103)에서는 학습과정에서 미리 구해진 음향모델(105)과 발음사전(106) 및 언어모델(107)을 이용하여 가장 확률이 높은 단어열을 비터비 알고리듬을 이용하여 찾게 된다. 대어휘 인식을 위하여 인식 대상 어휘들은 트리를 구성하고 있으며, 탐색부(103)는 그 트리를 탐색하게 된다. 후처리부(104)에서는 탐색 결과로부터 잡음 기호 등을 제거하여 최종 인식 결과(108)를 출력한다.The feature extractor 102 receives the voice 101, extracts only information useful for recognition, converts the information into a feature vector, and the search unit 103 obtains the acoustic model 105 and the phonetic dictionary 106, which are obtained in advance in the learning process. Using the language model 107, the most probable word sequence is found using the Viterbi algorithm. Recognition target vocabularies form a tree for large vocabulary recognition, and the search unit 103 searches the tree. The post processor 104 removes the noise symbol from the search result and outputs the final recognition result 108.

탐색부(103)는 모든 가능한 단어열에 대하여 음향모델(105)과 언어모델(107)로부터 구한 확률을 곱하여 최대가 되는 단어열을 선택한다. 이때, 언어모델(107)은 이전의 단어들로부터 다음 단어가 나타날 확률값을 예측하는 것으로서, 일반적으로 바로 이전에 나온 두개의 단어로부터 다음 단어가 나타날 확률을 사용하는 트라이그램(trigram)이 사용된다. 이전의 단어를 3개 이상 사용하여 다음 단어를 예측할 수도 있으나, 그 확률값을 구하기 위한 텍스트 코퍼스가 제한되고, 언어모델(107)의 저장 공간이 많이 필요하기 때문에 트라이그램이 널리 사용된다.The search unit 103 selects a word string that is maximized by multiplying the probabilities obtained from the acoustic model 105 and the language model 107 for all possible word sequences. In this case, the language model 107 predicts a probability value of the next word from the previous words, and generally a trigram using the probability of the next word from two immediately preceding words is used. Although three or more previous words may be used to predict the next word, trigrams are widely used because the text corpus for obtaining the probability value is limited and the storage space of the language model 107 is required.

즉, 10,000 단어를 인식하는 시스템에서 모든 가능한 트라이그램의 개수는 1E12이 된다. 텍스트 코퍼스에서 나타나는 트라이그램 단어쌍은 태스크에 따라서 다르지만 보통 200,000개 정도가 된다. 음성 인식기의 성능을 높이기 위해서는 기존의 트라이그램으로는 부족하며, 여기에 덧붙여 고품위의 언어모델(107)이 필요하다. 언어모델(107)의 구성은 계층적으로 다수개의 언어모델을 구성할 수도 있고, 여러 개의 언어모델을 각각 구성할 수도 있다. 이때, 고품위 언어모델의 예로는 트리거 정보 활용, 캐시 언어모델링, 품사정보의 활용, 4-그램 또는 5-그램 언어모델 사용 등이 있다.That is, in a system that recognizes 10,000 words, the number of possible trigrams is 1E12. The trigram word pairs that appear in the text corpus vary depending on the task, but are usually around 200,000. In order to improve the performance of the speech recognizer, existing trigrams are not sufficient, and in addition, a high quality language model 107 is required. The configuration of the language model 107 may constitute a plurality of language models hierarchically, or may constitute a plurality of language models, respectively. In this case, examples of the high quality language model include the use of trigger information, the cache language modeling, the use of part-of-speech information, and the use of 4-gram or 5-gram language models.

이러한 고품위 언어모델에서는 거의 대부분 모델링 파라미터의 차원이 늘어나기 때문에 평활화 방법이 필수적으로 요구된다. 즉, 텍스트 코퍼스에서 나타나지 않은 사건에 대한 확률값을 구하기 위하여 평활화(smoothing)가 필요한데, 주로 최대 엔트로피 방법이나 백오프 방법이 사용된다.In most of these high-quality language models, the leveling of the modeling parameters increases, so a smoothing method is essential. In other words, smoothing is required to obtain a probability value for an event that does not appear in the text corpus. A maximum entropy method or a backoff method is mainly used.

본 발명에서는 단어열에서 바로 이전에 나타난 두개의 단어에 덧붙여 이전에 인식된 문장을 사용하여 다음 단어를 예측하여 인식 성능을 높이고자 한다. 이전에 인식된 문장으로부터 다음 단어를 예측하는 것은 파라미터의 자유도가 너무 많기 때문에 제대로 예측이 되지 않는다. 따라서. 여기에서는 이전에 인식된 문장으로부터 적은 개수의 내용어만을 사용하여 화자의 의도(화행 정보)를 파악하고, 이에 의존한 언어모델을 적용한다.In the present invention, in addition to the two words that appeared immediately before in the word sequence, using the previously recognized sentences to predict the next word to improve the recognition performance. Predicting the next word from a previously recognized sentence is not well predicted because the parameter has too many degrees of freedom. therefore. In this case, the intention of the speaker (speech act information) is identified using only a small number of content words from the previously recognized sentences, and the language model is applied.

도 2 는 본 발명에 따른 화행 태깅된 텍스트 코퍼스의 일예시도이다.2 is an exemplary view of a speech act tagged text corpus according to the present invention.

도 2 에서 "KS"는 한글 문장임을 나타내고, "SA"는 화행, "ST"는 문장 타입을 의미한다. 화행 정보의 종류로는 오프닝(opening), 정보 제공(inform), 확인 요청(ask-confirm), 응답(response), 참조 정보 요청(ask-ref) 및 존재 유무 요청(ask-if) 등이 있으며, 문장 타입으로는 예/아니오 질문(yn-quest), 선언문(decl), wh-질문(wh-quest) 등이 있다.In FIG. 2, "KS" denotes a Korean sentence, "SA" denotes a dialogue act, and "ST" denotes a sentence type. Types of speech act information include opening, information providing, confirmation-confirmation, response, reference-ask-ref, and existence-ask request. , Sentence types include yes / no question (yn-quest), declaration (decl), wh-quest (wh-quest).

종래의 화행 정보 추출은 문장의 구문 분석을 통하여 내용어를 추출하고, 문장의 구조를 파악하는 방법을 사용하였다. 따라서, 시스템의 복잡도를 높이는 구문분석기가 필요하며, 구문 분석에 필요한 규칙들을 사람이 일일이 작성하여야 하였다. 그러나, 본 발명에서는 음성 인식기의 1차 결과로부터 어느 단어가 내용어인지를 알아낸 다음에, 내용어의 함수로 주어지는 화행 확률을 통계적인 방법으로 구하므로써 사람의 노력을 줄이고 화행 정보 추출에 소요되는 시간을 줄일 수 있도록 하였다. 음성 인식기에서는 인식 대상 어휘에 품사에 대한 태그를 붙여서 인식하기 때문에 어느 단어가 내용어인지를 쉽게 알 수 있다.Conventional speech act information extraction uses a method of extracting a content word through syntax analysis of a sentence and grasping the structure of the sentence. Therefore, a parser is needed to increase the complexity of the system, and the rules required for parsing have to be written by humans. However, in the present invention, after finding out which word is the content word from the primary result of the speech recognizer, the speech act probability given as a function of the content word is calculated by a statistical method, thereby reducing human effort and extracting speech act information. The time was saved. The speech recognizer recognizes a part of speech by attaching a tag for a part-of-speech to recognize the word.

도 3 은 본 발명에 따른 화행 정보 파라미터의 훈련 과정에 대한 일실시예 흐름도이다.3 is a flowchart illustrating an embodiment of a training process of speech act information parameters according to the present invention.

먼저, 대화별로 모아진 텍스트 코퍼스의 각 문장에 대하여 그 문장이 어느 화행에 속하는지를 태깅한다(301). 각 문장에 대한 화행 태그만을 나열한 다음에, 이전 화행으로부터 현재 화행으로 천이할 확률(바이그램 이상) P(s|s_t-1,s_t-2) 를 구한다(302). 이후, 모든 텍스트 코퍼스를 사용하여 광역 언어모델(트라이그램) P(w₃|w₁,w₂) 를 구한다(303).First, for each sentence of the text corpus collected for each conversation, it is tagged to which speech act the sentence belongs. Lists only act tags for each sentence, and then the probability of transitioning from the previous act to the current act (by more than onegram) P (s | s _t-1 , s _t-2 ) Obtain (302). Subsequently, global language models (trigrams) using all text corpus P (w ₃ | w ₁ , w ₂ ) Obtain (303).

다음으로, 텍스트 코퍼스를 화행별로 분류하고, 화행별로 분류된 문장을 사용하여 화행의존 언어모델(트라이그램) Ps(w₃|w₁,w₂),s=1,..,S 를 구한다(304). 여기서, S는 화행의 개수를 의미한다. 각 화행별로 분류된 텍스트의 각 문장에서 문장 끝으로부터 시작하여 미리 정해진 개수(N)의 내용어를 뽑고, 문장의 종류(의문문/평서문)에 대한 정보를 추출한다. 최대 엔트로피(maximum entropy) 방법을 이용하여 내용어와 문장 종류로부터 화행 정보를 예측하기 위한 화행 추정 파라미터 A(s,c₁ ^N) 을 구한다(305). 화행 추정 파라미터로부터 확률을 구하는 것은 최대 엔트로피 방법에서 정의된 아래의 (수학식 1)과 같다.Next, the text corpus is classified by speech acts, and the speech act-dependent language model (trigram) is used by using sentences classified by speech acts. Ps (w ₃ | w ₁ , w ₂ ), s = 1, .., S Obtain (304). Here, S means the number of speech acts. From each sentence of the text classified by each act, starting from the end of the sentence, a predetermined number N of content words are extracted, and information on the type of sentence (question sentence / comment sentence) is extracted. Speech act estimation parameter for predicting speech act information from content word and sentence type using maximum entropy method A (s, c ₁ ^N ) Obtain (305). Probability from the speech act estimation parameter is defined by Equation 1 below defined in the maximum entropy method.

b=c₁ ^N b = c ₁ ^N

여기서, c₁ ^N 은 내용어 및 문장 종류 정보를 나타내고, Z는 정규화 상수, f_j(s,b)는 최대 엔트로피 방법에서 사용되는 특징(feature)으로, 만약 (s,b)가 j의 사건 공간에 존재하면 "1", 그외에는 "0"의 값을 갖는다. K는 특징의 개수이다.here, c ₁ ^N Represents content word and sentence type information, Z is a normalization constant, f _j (s, b) is a feature used in the maximum entropy method, and if (s, b) is present in j's event space, 1 ", otherwise, has a value of" 0 ". K is the number of features.

사건 공간은 (s,c₁ ^N) 쌍이 된다. 텍스트 코퍼스내에 존재하는 모든 (s,c₁ ^N) 에 대하여 하나의 특징이 주어진다. 최대 엔트로피 방법을 사용하는 것은 보통 N이 5정도로서, 크기 때문에 확률 추정시 텍스트 코퍼스에 존재하지 않는 (s,c₁ ^N) 쌍이 많이 생기게 된다. 그 이유는 모든 쌍에 대하여 P(s|b) 를 텍스트 코퍼스로부터 구할 수는 없기 때문이다. 최대 엔트로피 방법에서 존재하지 않는 쌍에 대한 화행 추정 확률은 P(s|b)=1/Z(b) 와 같이 주어진다.Event space (s, c ₁ ^N ) Pairs. All that exist in the text corpus (s, c ₁ ^N ) One feature is given for. The maximum entropy method is usually N, which is about 5, which is not present in the text corpus when estimating probability. (s, c ₁ ^N ) There are many pairs. The reason is that for every pair P (s | b) Is not available from the text corpus. The probability of speech act estimation for a pair that does not exist in the maximum entropy method P (s | b) = 1 / Z (b) Is given by

불충분한 정보로부터 모든 데이터쌍에 대한 확률을 얻는 방법으로는 최대 엔트로피 방법외에도 백오프(backoff) 평활화 방법이 사용될 수도 있다.In addition to the maximum entropy method, a backoff smoothing method may be used as a method for obtaining probabilities for all data pairs from insufficient information.

도 4 는 본 발명에 따른 화행 정보를 이용한 음성 인식 방법에 대한 일실시예 흐름도이다.4 is a flowchart illustrating an embodiment of a speech recognition method using speech act information according to the present invention.

음성 인식기의 탐색부는 2과정으로 동작한다. 첫 번째 과정에서는 광역 언어모델만을 사용하여 격자(lattice) 형태로 얻어지는 1차 인식 결과를 얻는다(401). 격자로부터 제일 확률이 높은 하나의 문장을 추출한다(402). 이 문장의 끝에서 시작하여 미리 정해진 개수의 내용어와 문장 종류에 대한 정보를 추출한다(403). 미리 구하여 언어모델에 반영하여 둔 화행 추정 파라미터를 사용하여 현재 인식 결과가 어느 화행일 확률이 가장 높은지를 아래의 (수학식 2)와 같이 추정한다(404). 이때, 그 이전에 인식된 문장에 대한 화행을 고려한다.The search unit of the speech recognizer operates in two processes. In the first process, a first-order recognition result obtained in the form of a lattice is obtained using only a wide language model (401). One sentence having the highest probability is extracted from the grid (402). Starting at the end of this sentence, information about a predetermined number of content words and sentence types is extracted (403). Using the speech act estimation parameter obtained in advance and reflected in the language model, it is estimated as below (Equation 2) which speech act is most likely to be the current recognition result (404). At this time, the act of speech for the sentence recognized before is considered.

추정된 화행( )에 의존하는 언어모델 값과 광역 언어모델 값을 아래의 (수학식 3)과 같이 보간하여 최종 언어모델 값으로 한다.Estimated speech acts ( ) Is then finalized by interpolating the language model value and the global language model value as shown in Equation 3 below.

이때, 보간 가중치는 화행 확률에 비례하도록 정한다. 1차 인식 결과로 주어진 격자에 대하여 화행의존 언어모델을 적용하여, 즉 모든 가능한 문장에 대하여 아래의 (수학식 4)와 같이 문장 확률이 최대가 되는 문장( )을 구한다(405). 이것이 최종 인식 문장이 된다(406).At this time, the interpolation weight is set to be proportional to speech act probability. The speech recognition language model is applied to the grid given as the result of the first recognition, that is, for all possible sentences, the sentence with the maximum sentence probability is expressed as (Equation 4 below). (405). This is the final recognition sentence (406).

여기서, P(X|W) 는 단어열 W가 주어졌을 때 음성의 특징벡터열 X가 관측될 확률로서, 음향모델을 이용하여 계산되는 값이다. T는 문장내의 단어 개수이다.here, P (X | W) Is the probability that the feature vector sequence X of speech is observed when the word sequence W is given, and is calculated using the acoustic model. T is the number of words in the sentence.

다른 실시예로는 화행 확률이 높은 M개의 화행에 대하여 앞에서 정한 새로운 언어모델을 적용하여 문장확률이 최대인 문장을 각각 구한 다음에, 아래의 (수학식 5)와 같이 M개의 문장에 대하여 선험적인 화행확률을 곱한 값이 최대가 되는 하나의 문장을 뽑아서 이를 최종 인식 결과로 한다.In another embodiment, the sentences having the highest sentence probability are obtained by applying the new language model defined above for M dialogue acts having a high probability of speech acts, and then a priori for M sentences as shown in Equation 5 below. One sentence is obtained by multiplying the dialogue act probability by the maximum, and this is the final recognition result.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes can be made in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기와 같은 본 발명은, 대화체 음성 인식기 등에서 이전에 인식된 대화의 화행 정보로부터 현재 발성된 내용의 화행 정보를 예측하고 이 화행정보에 따라 음성 인식기의 탐색부에서 사용하는 언어모델을 변경하므로써 음성 인식기의 정확도를 향상시킬 수 있다.The present invention as described above, the speech recognizer by predicting the speech act information of the content currently spoken from the dialogue act information of the conversation previously recognized by the dialogue speech recognizer and changing the language model used in the search unit of the speech recognizer according to the dialogue act information. Can improve the accuracy.

즉, 본 발명은 특별한 문장 구조에 대한 사전 지식이 없이도 화행 태깅된 텍스트 코퍼스만 있으면, 화행 추정시 통계적인 방법만으로 쉽게 화행 정보 파라미터를 구할 수 있어, 화행에 의존하는 언어모델을 음성 인식에 이용하므로써 음성 인식 성능을 향상시킬 수 있다.That is, according to the present invention, if there is only a speech corpus tagged text corpus without prior knowledge of a specific sentence structure, the speech act information parameter can be easily obtained only by statistical method when estimating speech acts. Improve speech recognition performance.

Claims

In the speech recognition method applied to the speech recognizer,

A first step of obtaining speech act estimation parameters for predicting speech act information of currently spoken content from speech act information of a previously recognized conversation;

A second step of reflecting the speech act estimation parameter in a language model;

A third step of, after recognizing an input speech as a primary, estimating current speech act information using the speech act estimation parameter from the primary speech recognition result; And

A fourth step of obtaining a recognition result by recalculating the first recognition result according to the estimated speech act information

Speech recognition method comprising a.

The method of claim 1,

The first step is,

A fifth step of tagging each sentence of the text corpus collected by conversations to which speech act;

A sixth step of obtaining a probability of transitioning from the previous act to the current act;

A seventh step of obtaining a global language model using the text corpus;

An eighth step of classifying the text corpus by speech acts and obtaining a speech act-dependent language model using sentences classified by speech acts; And

A ninth step of obtaining speech act estimation parameters for predicting speech act information from a predetermined number of content words extracted from each sentence of text classified by each act and information on the type of the sentence;

Speech recognition method comprising a.

The method of claim 2,

The dialogue act estimation parameter of the ninth step is

A speech recognition method characterized by obtaining using a maximum entropy method.

The method of claim 2,

The dialogue act estimation parameter of the ninth step is

A speech recognition method characterized by obtaining using a backoff smoothing method.

The method according to any one of claims 1 to 4,

The third step,

A tenth step of obtaining a first recognition result obtained in a lattice form using the wide language model;

An eleventh step of extracting a sentence having the highest probability from the obtained grid;

Extracting information about the predetermined number of content words and sentence types from the extracted sentences; And

A thirteenth step of estimating which speech act has the highest probability using the obtained speech act estimation parameter;

Speech recognition method comprising a.

The method of claim 5,

The fourth step,

A fourteenth step of interpolating a language model value and a global language model value depending on speech act information estimated in the thirteenth step; And

A fifteenth step of obtaining a sentence having a maximum sentence probability by applying the speech act dependent language model to a grid given as the first recognition result and outputting a final recognition result;

Speech recognition method comprising a.

The method of claim 6,

A sixteenth step of extracting a sentence having a maximum value obtained by multiplying a priori acting probability with respect to a sentence having a maximum sentence probability obtained in the fifteenth step and outputting the final sentence as a final recognition result

Speech recognition method further comprising.

The method of claim 6,

The interpolation process of the fourteenth step,

Speech recognition method characterized in that the interpolation weight is determined to be proportional to speech act probability.

In a speech recognizer with a processor,

A first function of obtaining speech act estimation parameters for predicting speech act information of currently spoken content from speech act information of a previously recognized conversation;

A second function of reflecting the speech act estimation parameter in a language model;

A third function of, after recognizing an input speech as a primary, estimating current speech act information using the speech act estimation parameter from the primary speech recognition result; And

A fourth function of recalculating the primary recognition result according to the estimated speech act information to obtain a recognition result

A computer-readable recording medium having recorded thereon a program for realizing this.