KR20040055292A

KR20040055292A - System and method for improving in-domain training data using out-of-domain data

Info

Publication number: KR20040055292A
Application number: KR1020020081932A
Authority: KR
Inventors: 정의정
Original assignee: 한국전자통신연구원
Priority date: 2002-12-20
Filing date: 2002-12-20
Publication date: 2004-06-26
Also published as: KR100487718B1

Abstract

PURPOSE: A system and a method for reinforcing a domain subordinated training corpus using the corpus out of a domain are provided to improve a performance of a language model for the continuous speech recognition by using the training corpus out of the related domain. CONSTITUTION: The domain corpus(205) has the language model data of a small size for a predetermined domain. A similar contents corpus(202) has the contents similar to the domain corpus but a form different from the domain corpus while having the language model data different from the domain corpus. A form standardization part(204) standardizes the form of the similar contents corpus to the same form as the domain corpus. An adder(209) generates the corpus reinforcing the language module by adding the corpus standardized in the form standardization part to the domain corpus.

Description

Zone-dependent training corpus reinforcement system and its method using out-of-domain corpus {System and method for improving in-domain training data using out-of-domain data}

본 발명은 대어휘 연속 음성 인식(Large Vocabulary Continuous Speech Recognition; LVCSR)에 관한 것으로, 특히, 영역외 훈련 코퍼스를 이용한 훈련 코퍼스 보강 시스템 및 그 방법에 관한 것이다.The present invention relates to Large Vocabulary Continuous Speech Recognition (LVCSR), and more particularly, to a training corpus reinforcement system and method using an out-of-area training corpus.

음성인식 시스템을 위한 가장 성공적인 언어모델링 기법은 통계적 언어모델링이며 이는 신뢰성이 높고 강건한 통계치를 가지기 위해 충분한 규모의 훈련 코퍼스(말뭉치:text database)를 필요로 한다. 또한, 어떤 한 영역에 종속적으로 구축된 통계적 언어모델은 영역이 달라질 경우 제대로 된 성능을 나타낼 수 없다. 즉, 새로운 영역 종속적 훈련 코퍼스가 부족할 경우 통계적 언어모델은 음성인식 성능에 어떠한 기여도 하지 못할 뿐더러 오히려 성능 악화를 유발시킨다. 또한, 영역 종속적인 코퍼스 확보는 결코 쉽지가 않다는 문제점이 있다.The most successful language modeling technique for speech recognition systems is statistical language modeling, which requires a training corpus (text database) of sufficient size to have reliable and robust statistics. Also, a statistical language model built on a certain domain may not show proper performance if the domains are different. In other words, if the new domain-dependent training corpus is lacking, the statistical language model does not contribute to speech recognition performance and causes performance deterioration. In addition, there is a problem that area-dependent corpus securing is not easy.

도 1은 일반적인 연속 음성인식시스템을 개략적으로 나타내는 블록도이다.1 is a block diagram schematically illustrating a general continuous speech recognition system.

도 1을 참조하여, 특징추출부(101)는 입력된 음성으로부터 인식에 유용한 정보만을 추출하여 특징벡터로 변환한다.Referring to FIG. 1, the feature extractor 101 extracts only information useful for recognition from an input voice and converts the information into a feature vector.

탐색부(102)는 특징추출부(101)에서 출력되는 특징벡터로부터 학습과정에서 미리 구해진 음향모델 데이터베이스(104)와 발음사전 데이터베이스(105), 언어모델 데이터베이스(106)를 이용하여 확률이 가장 높은 단어열을 비터비 알고리듬을 이용하여 찾는다. 여기서 대어휘 인식을 위하여 인식 대상 어휘들은 트리를 구성하고 있으며, 탐색부(102)는 이러한 트리를 탐색한다.The searcher 102 uses the acoustic model database 104, the pronunciation dictionary database 105, and the language model database 106, which are obtained in advance in the learning process from the feature vectors output from the feature extractor 101, and have the highest probability. Find a string of words using the Viterbi algorithm. Here, the recognition target vocabularies form a tree for the recognition of the large vocabulary, and the search unit 102 searches the tree.

마지막으로, 인식 결과 출력부(103)는 탐색부(102)의 출력을 이용하여 인식된 텍스트를 출력한다.Finally, the recognition result output unit 103 outputs the recognized text using the output of the search unit 102.

도 1에 도시된 시스템과 같이, 다양한 영역의 대화 인식을 목적으로 하는 대화체 연속음성인식 시스템에서의 통계적 언어모델은 훈련 코퍼스가 충분히 확보된 경우에 아주 신뢰성이 높고 강건한 모델이 구축될 수 있다. 여기서, 언어모델이란 음성 인식 시스템의 문법이라 할 수 있다. 연속음성인식 시스템이라고 해서 임의의 아무 문장을 다 인식할 수 있는 것은 아니고 어떤 정해진 문법에 맞는 문장만을 인식하게 된다. 언어모델을 음성인식 시스템의 탐색 과정에 사용함으로써 음성인식시스템의 탐색 공간을 감소시킬 수 있으며, 문법에 맞는 문장에 대한 확률을 높여주는 역할을 하기 때문에 인식률 향상에도 기여하게 된다.As in the system shown in Fig. 1, the statistical language model in the dialogue continuous speech recognition system for the purpose of speech recognition in various areas can be constructed very reliable and robust when the training corpus is sufficiently secured. Here, the language model may be referred to as the grammar of the speech recognition system. The continuous speech recognition system does not recognize any sentence, but only sentences that meet certain grammars. By using the language model in the search process of the speech recognition system, it is possible to reduce the search space of the speech recognition system and contribute to the improvement of the recognition rate because it plays a role of increasing the probability of sentences that match the grammar.

상기한 바와 같이, 여러 언어모델링 기법 가운데 통계적 언어모델링이 대어휘 연속 음성인식시 가장 성능이 뛰어나다. 통계적 언어모델은 단어간의 연결 관계가 확률로서 표현되는 문법이다. 통계적 언어모델에 많이 사용되는 n-gram은 과거의 n-1개의 단어로부터 다음에 나타날 단어의 확률을 정의하는 문법으로서 흔히 사용되는 n-gram은 바이그램(n=2), 트라이그램(n=3)이다. 통계적 언어모델의 장점은 모든 것을 확률로서 정의하기 때문에 사람의 지식이 별로 필요하지 않고 대량의 코퍼스만 있으면 쉽게 구현할 수 있다는 장점이 있다. 그러나 코퍼스가 적을 경우에는 믿을만한 수치의 확률값을 구하지 못하고, 오히려 성능 악화를 유발시킬 수 있다.As described above, statistical language modeling among the various language modeling techniques has the best performance in recognition of large vocabulary continuous speech. Statistical language model is a grammar in which word-to-word relationships are expressed as probabilities. The n-gram, which is used frequently in statistical language models, is a grammar that defines the probability of the next word from n-1 words in the past. The commonly used n-gram is a bigram (n = 2) and a trigram (n = 3). )to be. The advantage of the statistical language model is that it defines everything as a probability, so it does not require much human knowledge and can be easily implemented with a large amount of corpus. However, if the corpus is small, a reliable probability value may not be obtained, but may cause performance deterioration.

결국, 통계적 언어 모델의 확률값들이 믿을만한 수치를 지니기 위해서는 엄청난 양의 코퍼스를 필요로 하며 이를 수집하는 것은 굉장한 노력과 시간 돈이 드는 일이다. 특히, 여행 상담이나 일상 생활속에서 발생하는 대화체 유형의 문장을 충분히 확보하는 일은 결코 쉬운 일이 아니며, 따라서 불충분한 텍스트 데이터로부터 강건한(robust) 언어모델을 구축하기 위한 새로운 기법들이 요구된다.After all, the probability values of statistical language models require a huge amount of corpus in order to have reliable values, and collecting them is a tremendous amount of effort and time. In particular, it is never easy to secure enough dialogue-type sentences that occur in travel counseling or everyday life, and therefore new techniques are needed to build a robust language model from insufficient text data.

본 발명이 이루고자 하는 기술적 과제는 연속 음성인식을 위한 통계적 언어모델 구축 시 일반적으로 발생하게 되는 영역 종속적 훈련 코퍼스 부족으로 인한 통계적 언어 모델의 성능 저하 문제를 극복하기 위해, 영역 외 훈련 코퍼스를 활용함으로써 연속 음성인식을 위한 언어모델의 성능을 향상시킬 수 있는 훈련 코퍼스 보강 시스템을 제공하는 데 있다.The technical problem to be achieved by the present invention is to overcome the problem of performance degradation of the statistical language model due to the lack of the region-dependent training corpus which is commonly generated when constructing the statistical language model for continuous speech recognition, by using the continuous training corpus It is to provide a training corpus reinforcement system that can improve the performance of language model for speech recognition.

본 발명이 이루고자 하는 다른 기술적 과제는 상기 훈련 코퍼스 보강 시스템에서 수행되는 코퍼스 보강 방법을 제공하는 데 있다.Another technical problem to be achieved by the present invention is to provide a corpus reinforcement method performed in the training corpus reinforcement system.

본 발명이 이루고자 하는 또 다른 기술적 과제는 상기 훈련 코퍼스 보강 방법을 컴퓨터에서 실행 가능한 프로그램 코드로 기록된 기록 매체를 제공하는 데 있다.Another technical problem to be achieved by the present invention is to provide a recording medium in which the training corpus reinforcement method is recorded as program code executable on a computer.

도 2는 본 발명에 따른 영역 외 훈련 코퍼스를 이용한 영역 종속적 훈련 코퍼스 보강 시스템의 일실시예를 개략적으로 나타내는 블록도이다.2 is a block diagram schematically illustrating an embodiment of an area dependent training corpus reinforcement system using an out of zone training corpus according to the present invention.

도 3은 도 2에 도시된 시스템에서 형식 표준화부(204)를 구체적으로 나타내는 블록도이다.3 is a block diagram illustrating in detail the formal standardization unit 204 in the system shown in FIG. 2.

상기 과제를 이루기 위해, 본 발명에 따른 코퍼스 보강 시스템은 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스, 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 가지되, 영역 코퍼스와 내용은 유사하지만 형식이 다른 내용유사 코퍼스, 내용유사 코퍼스의 형식을 영역 코퍼스와 같은 형식으로 표준화시키는 형식 표준화부 및 형식 표준화부에서 표준화된 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 가산기를 포함하는 것이 바람직하다.In order to achieve the above object, the corpus reinforcement system according to the present invention has a regional corpus having small language model data for a specific region, but has language model data for a region different from the region corpus, but the content is similar to that of the region corpus. Includes format-like corpus with different format, corpus standardized to corpus-like corpus, and adder to add corpus standardized by corpus to corpus which is standardized by corpus, It is desirable to.

상기 과제를 이루기 위해, 본 발명에 따른 코퍼스 보강 시스템은 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스, 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 가지되, 영역 코퍼스와 형식은 유사하지만 내용이 다른 형식유사 코퍼스, 형식 유사 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 가산기 및 가산기에서 생성된 보강된 코퍼스에서 영역 코퍼스의 언어모델에 가중치를 높게 부여하여 영역 코퍼스의 언어모델이 선택될 확률을 높게하는 가중치 조정부를 포함하는 것이 바람직하다.In order to achieve the above object, the corpus reinforcement system according to the present invention has a regional corpus having small language model data for a specific region, but has language model data for a region different from the region corpus, but the format is similar to that of the region corpus. The language of the domain corpus by adding weights to the language model of the domain corpus in the adder that adds the form-like corpus of different contents, the form-like corpus to the region corpus, and the corpus reinforced by the language model. It is preferable to include a weight adjusting unit for increasing the probability that the model is selected.

상기 과제를 이루기 위해, 본 발명에 따른 코퍼스 보강 시스템은 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스, 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 가지되, 영역 코퍼스와 내용은 유사하지만 형식이 다른 내용유사 코퍼스, 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 가지되, 영역 코퍼스와 형식은 유사하지만 내용이 다른 형식유사 코퍼스, 내용유사 코퍼스의 형식을 영역 코퍼스와 같은 형식으로 표준화시키는 형식 표준화부, 형식 표준화부에서 표준화된 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 제1가산기, 형식 유사 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 제2가산기 및 가산기에서 생성된 보강된 코퍼스에서 영역 코퍼스의 언어모델에 가중치를 높게 부여하여 영역 코퍼스의 언어모델이 선택될 확률을 높게하는 가중치 조정부를 포함하는 것이 바람직하다.In order to achieve the above object, the corpus reinforcement system according to the present invention has a regional corpus having small language model data for a specific region, but has language model data for a region different from the region corpus, but the content is similar to that of the region corpus. It has the same language model data as the domain corpus, which has the same language format data as the domain corpus. A first adder that adds the corpus normalized by the formal standardizer, the corpus standardized by the formal standardizer to the region corpus, and a second corpus that adds the form-like corpus to the region corpus to generate the corpus enhanced by the language model Language Model of Area Corpus in Adder and Reinforced Corpus Generated by Adder To give a higher weight including a weight adjusting unit to increase the probability that the language models in the corpus area is selected is preferred.

상기 다른 과제를 이루기 위해, 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스를 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 갖는 영역 외 코퍼스를 이용하여 코퍼스 보강하는 본 발명에 따른 코퍼스 보강 방법은 영역 외 코퍼스에서 영역 코퍼스와 내용은 유사하지만 형식이 다른 내용유사 코퍼스를 추출하는 (a)단계, 내용유사 코퍼스의 형식을 영역 코퍼스와 같은 형식으로 표준화시키는 (b)단계 및 (b)단계에서 표준화된 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 (c)단계를 포함하는 것이 바람직하다.In order to achieve the above another problem, a corpus reinforcement method according to the present invention for reinforcing a corpus having a small language model data for a specific region by using an outer corpus having a language model data for a region different from the region corpus In step (a) of extracting a content-like corpus that is similar in content but different in form to the region corpus from the non-area corpus, and in steps (b) and (b) normalizing the format of the content-like corpus to the same format as the region corpus. And adding (c) the standardized corpus to the region corpus to produce a corpus enriched in the language model.

상기 다른 과제를 이루기 위해, 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스를 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 갖는 영역 외 코퍼스를 이용하여 코퍼스 보강하는 본 발명에 따른 코퍼스 보강 방법은 영역 외 코퍼스에서 영역 코퍼스와 형식은 유사하지만 내용이 다른 형식유사 코퍼스를 추출하는 (a)단계, 형식 유사 코퍼스를 영역 코퍼스에 추가하여 언어모델이 보강된 코퍼스를 생성하는 (b)단계 및 (b)단계에서 생성된 코퍼스에서 영역 코퍼스의 언어모델에 가중치를 높게 부여하여 영역 코퍼스의 언어모델이 선택될 확률이 높아지도록 가중치를 조정하는 (c)단계를 포함하는 것이 바람직하다.In order to achieve the above another problem, a corpus reinforcement method according to the present invention for reinforcing a corpus having a small language model data for a specific region by using an outer corpus having a language model data for a region different from the region corpus (A) extracting a form-like corpus that is similar in format to the region corpus but has different contents in the non-domain corpus; (b) generating a corpus reinforced with a language model by adding a form-like corpus to the region corpus; In the corpus generated in step b), it is preferable to include the step (c) of adjusting the weight so that the weight of the language model of the area corpus is increased by giving a weight to the language model of the area corpus.

상기 다른 과제를 이루기 위해, 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 영역 코퍼스를 영역 코퍼스와는 다른 영역에 대한 언어모델 데이터를 갖는 영역 외 코퍼스를 이용하여 코퍼스 보강하는 본 발명에 따른 코퍼스 보강 방법은 영역 외 코퍼스에서 영역 코퍼스와 내용은 유사하지만 형식이 다른 내용유사 코퍼스 및 영역 코퍼스와 형식은 유사하지만 내용이 다른 형식유사 코퍼스를 각각 추출하는 (a)단계, 내용유사 코퍼스의 형식을 영역 코퍼스와 같은 형식으로 표준화시키는 (b)단계, (b)단계에서 표준화된 코퍼스 및 형식 유사 코퍼스를 영역 코퍼스에 각각 추가하여 언어모델이 보강된 코퍼스를 생성하는 (c)단계 및 (c)단계에서 생성된 보강된 코퍼스에서 영역 코퍼스의 언어모델에 가중치를 높게 부여하여 영역 코퍼스의 언어모델이 선택될 확률이 높아지도록 가중치를 조정하는 (d)단계를 포함하는 것이 바람직하다.In order to achieve the above another problem, a corpus reinforcement method according to the present invention for reinforcing a corpus having a small language model data for a specific region by using an outer corpus having a language model data for a region different from the region corpus (A) extracting a similar-like corpus of similar but different format but different-form corpus from a region-based corpus in the out-of-area corpus. In the steps (c) and (c), the corpus reinforced with the language model is generated by adding the corpus normalized in the steps (b) and (b) and the form-like corpus respectively to the region corpus. In the reinforced corpus, the language model of the region corpus is selected by giving a high weight to the language model of the region corpus. To include the (d) step of adjusting the weight so as to be higher the probability is preferred.

이하, 본 발명에 따른 영역 외 훈련 코퍼스를 이용한 영역 종속적 훈련 코퍼스 보강 시스템 및 그 방법을 첨부한 도면들을 참조하여 다음과 같이 설명한다.Hereinafter, an area-dependent training corpus reinforcement system using the out-of-area training corpus according to the present invention and a method thereof will be described as follows with reference to the accompanying drawings.

도 2는 본 발명에 따른 영역 외 훈련 코퍼스를 이용한 영역 종속적 훈련 코퍼스 보강 시스템의 일실시예를 개략적으로 나타내는 블록도이다. 본 발명에 따른 훈련 코퍼스 보강 시스템은 영역 외 코퍼스(201), 형식 표준화부(204), 제1 및 제2가산부(209,210), 영역 코퍼스(205), 제1 및 제2보강된 코퍼스(206,207) 및 유니그램값 조정부(208)를 포함하여 구성된다.2 is a block diagram schematically illustrating an embodiment of an area dependent training corpus reinforcement system using an out of zone training corpus according to the present invention. The training corpus reinforcement system according to the present invention includes an out-of-area corpus 201, a formal standardizer 204, first and second adders 209 and 210, region corpus 205, and first and second reinforced corpus 206 and 207. ) And a unigram value adjusting unit 208.

도 2를 참조하여, 영역 코퍼스(205)는 특정 영역에 대한 소규모의 언어모델 데이터를 갖는 코퍼스이다. 이하, 설명의 편의를 위해, 영역 코퍼스(205)는 여행과 관련된 대화체 코퍼스인 것으로 한다.Referring to Fig. 2, region corpus 205 is a corpus having small language model data for a specific region. Hereinafter, for convenience of explanation, the area corpus 205 is assumed to be an interactive corpus associated with the trip.

영역 외 코퍼스(201)는 영역 코퍼스(205) 이외의 영역에 대한 언어모델 데이터를 갖는 코퍼스이다. 이 때, 영역 외 코퍼스(201)는 크게 영역 코퍼스(205)와 내용은 유사하되 형식이 다른 내용유사 코퍼스(202)와 형식은 유사하되 내용이 다른 형식유사 코퍼스(203)로 각각 분류될 수 있다.The out-of-area corpus 201 is a corpus having language model data for an area other than the area corpus 205. In this case, the out-of-area corpus 201 may be classified into a form-like corpus 203 that is similar in content but similar in form to a content-like corpus 202 whose content is similar to that of the region corpus 205. .

형식 표준화부(204)는 영역 외 코퍼스(201)에서 내용유사 코퍼스(202)를 추출하고, 추출된 내용유사 코퍼스(202)의 형식을 영역 코퍼스(205)의 형식으로 즉, 여행관련 대화체 형식으로 변화시킨다.The format standardizing unit 204 extracts the content-like corpus 202 from the out-of-area corpus 201 and converts the extracted content-like corpus 202 into the form of the region corpus 205, that is, into a travel-related conversation form. Change.

제1가산부(209)는 형식 표준화부(204)에서 영역 코퍼스(205)의 형식으로 변화된 코퍼스와 영역 코퍼스(205)에 추가하여 제1보강된 코퍼스(206)를 생성할 수 있다. 즉, 영역 코퍼스(205)의 부족했던 언어모델 데이터를 내용유사 코퍼스(202)의 형식을 영역 코퍼스(205)의 형식으로 변화시켜 영역 코퍼스(205)의 언어모델 데이터를 보다 풍부하게 보강시킬 수 있다.The first adder 209 may generate the first reinforced corpus 206 in addition to the corpus and the region corpus 205 changed from the form normalization unit 204 to the region corpus 205. That is, the language model data of the area corpus 205 may be changed into a form of the area corpus 205 by using the lack of language model data to enrich the language model data of the area corpus 205. .

제2가산부(210)는 영역 외 코퍼스(201)에서 형식 유사 코퍼스(203)를 추출하여 영역 코퍼스(205)에 추가시켜 제2보강된 코퍼스(207)를 생성할 수 있다. 즉, 영역 코퍼스(205)와 형식이 유사한 경우, 그 내용을 그대로 영역 코퍼스(205)에 추가시킨다.The second adder 210 may extract the form-like corpus 203 from the out-of-area corpus 201 and add it to the region corpus 205 to generate the second reinforced corpus 207. That is, when the format is similar to that of the area corpus 205, the contents thereof are added to the area corpus 205 as it is.

가중치 조정부(208)는 제2보강된 코퍼스(207)에서 추가된 형식유사 코퍼스(203)보다 영역 코퍼스(205)의 언어모델의 유니그램(unigram) 확률값에 가중치를 높게 부여함으로써, 영역 코퍼스(205)의 언어모델이 선택될 확률을 높게 한다. 그 이유는, 가산된 형식유사 코퍼스(203)의 경우 영역 코퍼스(205)와 내용이 다를 수 있으므로, 영역 코퍼스(205)의 유니그램값을 조정하여 형식유사 코퍼스(203)보다는 영역 코퍼스(205)의 언어모델이 선택될 확률값을 높인다.The weight adjusting unit 208 gives a weight higher to the unigram probability value of the language model of the region corpus 205 than the form-like corpus 203 added by the second reinforced corpus 207, thereby increasing the weight of the region corpus 205. Increase the probability of selecting a language model. The reason is that in the case of the added type-like corpus 203, the content may be different from that of the area corpus 205, so that the unigram value of the area corpus 205 is adjusted to adjust the unigram value of the area-like corpus 205 rather than the form-like corpus 203. Increase the probability that the language model of.

이상에서와 같이, 영역 외 코퍼스(201)를 이용하여 영역 코퍼스(205)의 언어모델 데이터를 보강하고자 할 경우, 내용유사 코퍼스(202)인가 또는 형식유사 코퍼스(203)인가에 따라 보강된 코퍼스(206,207)를 각각 구축할 수 있다.As described above, when reinforcing the language model data of the area corpus 205 using the out-of-area corpus 201, the corpus (reinforced by the content-like corpus 202 or the form-like corpus 203) is reinforced ( 206 and 207, respectively.

예를 들어, 다음과 같은 [여행상담-낭독체]와 [여행상담-대화체}의 경우를 살펴보자.For example, consider the following cases of [Travel Counseling-Reading] and [Travel Counseling-Dialogue].

[여행상담-낭독체][Travel Counseling]

여권 발급서류는 여권용사진 2매, 신분증입니다.Passport issuance documents are two passport photos and ID card.

발급장소는 공항터미널 1층 강남구청 여권과에서 신청하시면 됩니다. 자세한 사항은 강남구청 여권과로 문의하시기 바랍니다.You can apply for the place of issue at the Passport section of Gangnam-gu Office on the 1st floor of the airport terminal. For details, please contact the Gangnam-gu Office Passport.

[여행상담-대화체][Travel Consultation-Conversation]

A:어 그냥 돈 바꿀려구요.A: Uh, I just want to change the money.

B:저기 환전소는 여기 일 층에 있구요, 면세점은 삼 층에 있습니다B: There is a currency exchange office on the ground floor here, and a duty free shop on the third floor.

위와 같이, 여행을 주제로 하고 있으면서 문장 형식도 대화체적인 특징이 있는 텍스트들은 대화체의 여행 분야 언어모델 구축을 위해 당연히 사용 가능하다. 그러나, 내용이 유사한 [여행상담-낭독체] 데이터를 그냥 가져다 쓸 수는 없다. 왜냐하면, [여행상담-낭독체] 데이터는 형식면에서나 문장의 스타일면에서나 대화체의 대화체 여행 분야와 상이한 특징들을 나타내고 있기 때문이다. 이처럼, 여행을 주제로 하고 있지만 대화체적인 스타일을 가지고 있지 않은 텍스트 데이터들은 스타일 측면에서 대화체적인 요소를 가지게끔 형식 표준화부(204)를 통해 변환해주어야 한다.As mentioned above, texts with a theme of travel and also have a conversational feature can be used naturally to build a language model of the travel field of the conversation. However, similar [Travel Counseling-Reading] data cannot be used. This is because the [Travel Counseling-Reading] data exhibits different characteristics in form and sentence style than in the conversational field of conversational conversation. As such, the text data, which is the subject of travel but does not have an interactive style, should be converted through the format standardization unit 204 to have an interactive element in terms of style.

형식 표준화부(204)를 통해 [여행상담-낭독체]를 대화체적인 스타일을 가지게끔 변환해주면 다음과 같다.When converting the [travel consultation-reading reader] to have an interactive style through the formal standardization unit 204, it is as follows.

[여행상담-낭독체::변환후][Travel Consultation-Reading: After Conversion]

여권 발급 서류는 여권 사진 2장이랑 신분증이예요.Passport issuance documents are two passport photos and an ID card.

발급하는 장소는 공항터미널 1층에 강남구청 여권과에서 신청하면 되구요 좀 더 자세한건 강남구청 여권과로 문의하면 됩니다.You can apply at the Gangnam-gu Office Passport Department on the 1st floor of the airport terminal. For more details, contact the Passport Department at Gangnam-gu Office.

이처럼, 형식 표준화부(204)를 통해 형식 변환을 함으로써, 대화체의 여행 분야 언어모델 구축에 [여행상담-낭독체]를 이용할 수 있게 된다. 구체적으로, 형식 표준화는 원래의 텍스트 데이터에 형태소 태깅(각 형태소 어휘마다 품사를 표기하여 주는 것)을 하여서 대화체적인 특징을 나타내는 품사가 어떤 것이 있는지, 또 낭독체적인 특징을 나타내는 품사엔 어떤 것이 있는지 미리 미리 조사하고 있다가 영역 외 데이터 가운데 해당하는 품사가 나타나면 그 품사를 대화체적 품사로 변환하여 주는 역할을 한다. 형식 표준화부(204)에 대해서는 도 3을 참조하여 보다 구체적으로 설명될 것이다.As such, by performing the format conversion through the formal standardization unit 204, it is possible to use the [travel consultation-reading body] to construct the language model of the travel field of the conversational body. Specifically, formal standardization is based on morphological tagging (indicating parts of speech for each morpheme vocabulary) on the original text data, and which parts of speech that exhibit interactive features and which parts of speech that exhibit textual features exist. If the part-of-speech is found out of the data beforehand, the part-of-speech part converts the part-of-speech into an interactive part of speech. The formal standardization unit 204 will be described in more detail with reference to FIG. 3.

계속해서, 다음과 같은 [여행중 공항에서의 대화]와 [여행중 호텔에서의 대화]의 경우, 즉, 형식은 대화체로서 유사하나 내용이 서로 다른 경우를 살펴보자.Continuing with this, let us consider the following cases of [conversation at the airport while traveling] and [conversation at the hotel while traveling], that is, the form is similar as the dialogue but the contents are different.

[여행 중 공항에서의 대화][Dialogue at the airport while traveling]

A:어 그냥 돈 바꾸려구요.A: Uh, I just want to change the money.

B:저기, 환전소는 여기 일 층에 있구여 면세점은 삼 층에 있습니다B: There, the exchange office is on the first floor, and the duty free shop is on the third floor.

[여행중 호텔에서의 대화][Dialogue at hotel while traveling]

A:저, 제가 헬스클럽을 이용하고 싶은데 부대 시설은 어디에 있나요.A: Well, I want to use the health club, but where are the facilities?

B:휘트니스 클럽이 있는데, 수영장은 야외에 있고 체력실은 지하 일 층, 사우나는 지하 삼 층, 온천은 지하 이 층, 볼링장은 지하 일 층, 오락실은 지하 일 층에 각각 있습니다.B: There is a fitness club, the swimming pool is outdoors, the fitness room is on the ground floor, the sauna is on the ground floor, the hot spring is on the ground floor, the bowling alley is on the ground floor, and the recreation room is on the ground floor.

A:네.. 그리고요 식당은 어디에 있나요.A: Yes. And where is the restaurant?

B:일 층에 가요 주점이 있고요, 이 층에 일식당과 양식당이 있습니다.B: There is a tavern on the first floor, and there is a Japanese and Western restaurant on this floor.

그리고, 삼 층에 가시면 민속 식당과 커피숍이 있으니까 입 맛에 맞는 곳으로 가세요.If you go to the third floor, you will find a folk restaurant and coffee shop, so go to the right place.

만약, [여행 중 공항에서의 대화]를 위한 언어모델을 구축하고자 하는데 영역 코퍼스가 부족한 경우에 영역 외 코퍼스로 위의 [여행중 호텔에서의 대화]를 사용할 수 있다. 즉, [여행중 호텔에서의 대화]의 경우는 주제면에서 그 키워드는 [여행 중 공항에서의 대화]와 다르지만 말하는 패턴이나 스타일이 대화체로서 유사한 형식유사 코퍼스인 경우이다. 이처럼, 내용은 다르지만 스타일이 유사한 형식유사 코퍼스를 이용하여 언어모델을 구축할 경우, 가중치 조정부(208)를 통해 [여행중 공항에서의 대화]에 나타는 공항용 어휘들의 확률값에 가중치를 더 높여주어 선택될 확률을 높인다.If you want to build a language model for [conversation at the airport while traveling] and you don't have enough area corpus, you can use the above [conversation at hotel while traveling] as a corpus outside the area. In other words, in the case of [conversation in a hotel during a trip], the keyword is different from [conversation in an airport during a trip] in terms of the subject, but the pattern or style of the speech is similar to a corpus. As described above, when constructing a language model using a format-like corpus having different contents but similar styles, the weight adjusting unit 208 increases the weight to the probability value of the airport vocabulary shown in [Dialogue at the airport while traveling]. Increase the chance of being selected.

이상에서와 같이, 연속음성인식을 위한 통계적 언어모델의 성능 저하의 원인이 되는 영역 종속적 훈련 코퍼스 부족 문제 해결을 위하여 영역 외 훈련 코퍼스를 활용함으로써, 음성인식을 위한 언어모델이 보다 신뢰성 있고 강건해 질 수 있다.As mentioned above, by using out-of-area training corpus to solve the problem of lack of domain-dependent training corpus which causes the performance degradation of statistical language model for continuous speech recognition, the language model for speech recognition will be more reliable and robust. Can be.

도 3은 도 2에 도시된 시스템에서 형식 표준화부(204)를 구체적으로 나타내는 블록도이다. 도 3을 참조하여, 형식 표준화부(204)는 형태소 분석부(302), 품사정보 분석부(303), 텍스트 표준화부(304) 변환된 영역 외 코퍼스(305)를 포함하여 구성된다.3 is a block diagram illustrating in detail the formal standardization unit 204 in the system shown in FIG. 2. Referring to FIG. 3, the formal standardization unit 204 includes a morpheme analysis unit 302, a part-of-speech information analysis unit 303, and a text normalization unit 304 and an out-of-area corpus 305.

도 2 및 도 3을 참조하여, 형태소 분석부(302)는 영역 외 코퍼스(201) 가운데 영역 코퍼스(205)와 내용면에서 유사한 내용유사 코퍼스(202)의 형태소를 분석하여 각 형태소 어휘마다 품사를 표기하는 형태소 태깅을 한다.Referring to FIGS. 2 and 3, the morpheme analysis unit 302 analyzes the morphemes of the content-like corpus 202 that is similar in content to the area corpus 205 among the out-of-area corpus 201 to form parts of speech for each morpheme word. Morphological tagging is used.

품사정보 분석부(303)는 형태소 분석부(302)에서 형태소 태깅된 결과를 이용하여 대화체적인 특징으로 나타내는 품사가 어떤 것이 있는지, 또 낭독체적인 특징을 나타내는 품사엔 어떤 것이 있는지를 분석한다.The part-of-speech information analysis unit 303 analyzes which parts of speech are represented as interactive features and which are parts of speech which are read aloud by using the result of the morpheme tagging by the morpheme analysis unit 302.

텍스트 표준화부(304)는 품사정보 분석부(303)에서 품사분석된 품사 리스트를 이용하여 영역 코퍼스(205)와 유사한 형식을 갖도록 품사 변형을 하여 형식변환된 영역 외 코퍼스를 생성한다. 예컨대, 영역 코퍼스(205)의 형식이 대화체 형식이고 영역 외 코퍼스(201)의 내용유사 코퍼스(202)가 낭독체 형식이라면, 텍스트 표준화부(304)는 품사정보 분석부(303)에서 품사분석된 결과중 낭독체 품사를 대화체 품사로 변형하여 내용유사 코퍼스(202)의 형식을 영역 코퍼스(205)의 형식으로 변화시킨다.The text normalization unit 304 generates the out-of-area corpus that is transformed by using the parts-of-speech analysis part 303 and the part-of-speech transformed to have a format similar to that of the area corpus 205. For example, if the format of the area corpus 205 is a conversational form and the content-like corpus 202 of the out-of-area corpus 201 is a reading form, the text normalization unit 304 is part-of-speech-analyzed by the part-of-speech information analyzer 303. In the result, the reading part-of-speech is transformed into a conversational part-of-speech to change the format of the content-like corpus 202 to the form of the area corpus 205.

이상에서와 같은 형태소 분석, 품사정보 분석 및 텍스트 표준화 과정들을 통해 내용유사 코퍼스(202)는 내용뿐만 아니라 그 형식도 영역 코퍼스(205)와 유사해지게 된다.Through the morphological analysis, the part-of-speech information analysis, and the text standardization process as described above, the content-like corpus 202 becomes similar to the region corpus 205 as well as its contents.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플라피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, which are also implemented in the form of a carrier wave (for example, transmission over the Internet). It also includes. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상 도면과 명세서에서 최적 실시예들이 개시되었다. 여기서 특정한 용어들이 사용되었으나, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.The best embodiments have been disclosed in the drawings and specification above. Although specific terms have been used herein, they are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

상술한 바와 같이, 영역 외 코퍼스를 이용한 본 발명에 따른 코퍼스 보강 시스템 및 그 방법에 따르면, 연속음성인식을 위한 통계적 언어모델의 성능 저하의 원인이 되는 영역 종속적 훈련 코퍼스 부족 문제 해결을 위하여 영역 외 훈련 코퍼스를 활용함으로써, 음성인식을 위한 언어모델이 보다 신뢰성 있고 강건해 질 수 있다.As described above, according to the corpus reinforcement system and the method using the out-of-area corpus, the out-of-area training to solve the problem of the region-dependent training corpus shortage that causes the performance degradation of the statistical language model for continuous speech recognition By using corpus, the language model for speech recognition can be more reliable and robust.

Claims

An area corpus with small language model data for a particular area;

A content-like corpus having language model data for a region different from the region corpus, the content being similar to the region corpus but having a different format;

A format standardizing unit for standardizing a format of the content-like corpus into a format such as an area corpus; And

And an adder for adding a corpus normalized by the formal standardization unit to the region corpus to generate a corpus reinforced with a language model.

The method of claim 1, wherein the formal standardization unit

A morpheme analysis unit configured to analyze morphemes of the content-like corpus and to tag parts of speech for each morpheme vocabulary;

A part-of-speech information analysis unit for analyzing which parts of speech representing a formal feature are found by using the result of the morphological tagging by the morphological analysis unit; And

And a text normalization unit configured to generate an out-of-area corpus by transforming the parts-of-speech to have a form similar to the area corpus by using the parts-of-speech analysis part of speech analysis part of the part-of-speech information analyzing unit.

An area corpus with small language model data for a particular area;

A format-like corpus having language model data for a region different from the region corpus, the format being similar to the region corpus but having different contents;

An adder for adding the form-like corpus to the region corpus to generate a corpus reinforced with a language model; And

And a weight adjustment unit for assigning a high weight to the language model of the area corpus in the reinforced corpus generated by the adder to increase the probability of selecting the language model of the area corpus.

The method of claim 3, wherein the weight adjusting unit

A corpus reinforcement system, characterized in that to give a high unigram probability value to the language model of the area corpus.

An area corpus with small language model data for a particular area;

A format standardizing unit for standardizing a format of the content-like corpus into a format such as an area corpus;

A first adder for adding a corpus normalized by the formal standardizer to the region corpus to generate a corpus reinforced with a language model;

A second adder for adding the form-like corpus to the region corpus to generate a corpus reinforced with a language model; And

And a weight adjustment unit for assigning a high weight to the language model of the area corpus in the reinforced corpus generated by the adder to increase the probability that the language model of the area corpus is selected.

The method of claim 5, wherein the formal standardization unit

The method of claim 5, wherein the weight adjusting unit

A corpus reinforcement method in which a corpus reinforcement having a small language model data for a specific region is corpus reinforced using an out-of-region corpus having language model data for a region different from the region corpus.

(a) extracting a content-like corpus in the out-of-area corpus, the content of which is similar to the area corpus but in a different format;

(b) normalizing the format of the content-like corpus to the same format as the region corpus; And

(c) adding a corpus normalized in the step (b) to the region corpus to generate a corpus reinforced with a language model.

The method of claim 8, wherein step (b)

(b1) analyzing the morphemes of the content-like corpus and performing morpheme tagging to mark parts of speech for each morpheme vocabulary;

(b2) analyzing any parts of speech representing formal features using the morphologically tagged result in step (b1); And

and (b3) generating a corpus transformed out-of-area corpus by forming a part-of-speech transformation to have a form similar to that of the area corpus by using the part-of-speech analysis of the part-of-speech analysis in step (b2).

(a) extracting a type-like corpus in the out-of-area corpus that has a similar format to the area corpus but has different contents;

(b) adding the form-like corpus to the region corpus to generate a corpus reinforced with a language model; And

and (c) a weighting step of giving a weighting value of the language model of the area corpus in the corpus generated in step (b) to increase the probability of selecting the language model of the area corpus. Reinforcement method.

The method of claim 10, wherein step (c)

A corpus reinforcement method characterized by giving a high unigram probability value to the language model of the region corpus.

(a) extracting a content-like corpus in the out-of-area corpus that is similar in content but different in form to the area corpus, and a form-like corpus in which the form is similar to but different in content from the area corpus;

(b) normalizing a format of the content-like corpus to a form such as a region corpus;

(c) adding a corpus normalized in the step (b) and the form-like corpus to the region corpus, respectively, to generate a corpus reinforced with a language model; And

(d) weighting the language model of the area corpus in the reinforced corpus generated in step (c) to increase the probability of selecting a language model of the area corpus; Corpus reinforcement method.

The method of claim 12, wherein step (b)

The method of claim 12, wherein step (d)

The recording medium which recorded the corpus reinforcement method of any one of Claims 8-14 with the program code which can be executed by a computer.