KR101250900B1

KR101250900B1 - Apparatus for text learning based statistical hmm part of speech tagging and method thereof

Info

Publication number: KR101250900B1
Application number: KR1020090075778A
Authority: KR
Inventors: 권오욱; 최승권; 이기영; 노윤형; 김창현; 서영애; 양성일; 김운; 황금하; 오영순; 윤창호; 박은진; 김영길; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2009-08-17
Filing date: 2009-08-17
Publication date: 2013-04-04
Also published as: KR20110018140A

Abstract

본 발명은 문서정보 학습기반 통계적 HMM 품사 태깅 장치 및 그 방법에 관한 것으로, 입력되는 문서에 따라 달라지는 문맥 확률, 어휘 확률과 태깅 오류 수정 규칙을 실시간으로 추출하여 품사 태깅하고자 하는 입력 문서의 장르와 도메인에 의존적인 확률 정보와 수정 규칙 등을 추출할 수 있다. 또한, 본 발명은 기학습된 코퍼스에서 나타나지 않는 다양한 장르나 도메인의 문서에 대해서도 실시간으로 입력 문서에서 학습된 정보를 사용함으로써, 다양한 장르나 도메인의 문서에 대한 태깅 정확성을 높일 수 있으며, 문서에 대한 언어적 분석을 필요로 하는 자동번역 및 정보검색 등의 시스템에서 언어분석 정확도를 향상시켜 전체 번역 성능 및 정확성을 향상시킬 수 있다. The present invention relates to an apparatus and method for statistical HMM part-of-speech tagging based on document information. The genre and domain of an input document to be extracted in real-time by extracting the context probability, the lexical probability, and the tagging error correction rule varying according to the input document are provided. Probability information and correction rules that depend on can be extracted. In addition, the present invention can improve the tagging accuracy for documents of various genres or domains by using the information learned in the input document in real time even for documents of various genres or domains that do not appear in the previously learned corpus. In systems such as automatic translation and information retrieval that require linguistic analysis, the accuracy of linguistic analysis can be improved to improve overall translation performance and accuracy.

형태소, 품사 태깅, HMM, 학습 기반 Stemming, part-of-speech tagging, HMM, learning-based

Description

Document-based learning based statistical HMM part-of-speech tagging device and its method {APPARATUS FOR TEXT LEARNING BASED STATISTICAL HMM PART OF SPEECH TAGGING AND METHOD THEREOF}

본 발명은 문서정보 학습기반 통계적 HMM(hidden markov model) 품사 태깅 장치 및 그 방법에 관한 것으로, 보다 상세하게는 입력문서를 자동 태깅하고, 태깅된 문서에서 언어적 특징을 실시간으로 학습하여 분석하고자 하는 문서의 언어적 특징을 기학습된 정보에 추가하여 다시 분석하여 형태소 품사 태깅 성능을 향상시킬 수 있는 장치 및 그 방법에 관한 것이다.The present invention relates to a document information learning based statistical hidden markov model (HMM) part-of-speech tagging device and a method thereof, and more particularly, to automatically tag an input document, and to learn and analyze language features in a tagged document in real time. The present invention relates to an apparatus and a method for improving the morpheme part-of-speech tagging performance by re-analyzing the linguistic features of a document in addition to the previously learned information.

본 발명은 지식경제부의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2009-S-034-01, 과제명: 한중영 대화체 및 기업문서 자동번역 기술개발].The present invention is derived from a study conducted as part of the IT growth engine technology development project of the Ministry of Knowledge Economy [Task management number: 2009-S-034-01, Task name: Korean-Chinese dialogue and corporate document automatic translation technology development].

주지된 바와 같이, 각종 정보검색(information retrieval) 시스템, 질의응답 시스템(Q&A system), 자동번역(machine translation) 시스템 등에서는 처리하고자 하는 문장에 대한 형태소 분석 및 품사 태깅(part-of-speech tagging)이 필수적으 로 요구되는데, 이를 해결하기 위해서 종래의 품사 태깅 장치들은 통계적 기법으로 구현된 HMM을 가장 널리 사용할 수 있다. As is well known, in various information retrieval systems, Q & A systems, and machine translation systems, morphological analysis and part-of-speech tagging of sentences to be processed In order to solve this problem, conventional parts-of-speech tagging devices can most widely use HMM implemented by statistical techniques.

이러한 HMM 태깅 장치는 w₁, w₂, ... , w_n인 n개의 단어로 구성된 원시입력문장을 형태소 분석한 후, 각 단어에 대하여 가장 적합한 품사 나열 t₁, t₂, ... ,t_n을 2차 마코프 모델을 적용한 <수학식 >1Such an HMM tagging device stems a raw input sentence consisting of n words of w ₁ , w ₂ , ..., w _n , and then lists the most suitable parts of speech for each word t ₁ , t ₂ , ..., a _n t applying a second Markov model <equation> 1

(여기서, P(t_i|t_i-1,t_i-2)는 문맥 확률(contextual probability) 또는 전이확률(transition probability)을 의미하고, P(w_i|t_i)는 어휘 확률(lexical/output probability)을 의미하며, 문맥 확률과 어휘 확률은 태깅된 코퍼스로부터 그 값을 통계적으로 추출하여 사용할 수 있다.)Where P (t _i | t _i-1 , t _i-2 ) means contextual probability or transition probability, and P (w _i | t _i ) means lexical / output probability), and the context probability and the lexical probability can be extracted statistically from the tagged corpus.)

에 의해서 찾을 수 있다. Can be found by

상술한 바와 같이, 문맥 확률에 사용된 품사 수에 따라 HMM 모델을 구분하며, 만약 n개의 품사들을 이용한 문맥 확률을 사용한 HMM 모델에 기반한 태깅 장치는 일반적으로 n-gram 태깅 장치라고 부른다. 수학식 1에서는 문맥 확률 P(t_i|t_i _-1,t_i-2)에서 품사 t_i, t_i-1과 t_i-2를 사용하여 n이 3이므로, tri-gram(또는 3-gram) 태깅 장치라 할 수 있다. As described above, the HMM model is classified according to the number of parts of speech used for the context probability. If a tagging device based on the HMM model using the context probability using n parts of speech is generally called an n-gram tagging device. In Equation 1, since n is 3 using the parts of speech t _i , t _i-1 and t _i-2 in the context probability P (t _i | t _i _-1 , t _i-2 ), tri-gram (or 3- gram) tagging device.

이론적으로는 n이 커지면 HMM 모델의 성능이 향상되지만, 학습코퍼스로부터 가치가 있는 통계치 및 확률값을 얻기 위해서는 그 학습코퍼스가 매우 커야 하는 단점을 가진다. 현재 구축된 학습코퍼스로부터 학습하여 가장 좋은 성능을 나타내는 n이 대부분의 실험에서 3으로 알려져 있다. n이 3일 경우에도, 학습코퍼스로부터 태깅하고자 하는 언어에 대한 충분한 n-gram에 대한 확률값을 얻지 못하여 n이 1, 2인 경우의 문맥 확률도 같이 이용하는 방법을 선호하고 있다. 이와 같은 방법을 smooting paradigm이라고 부르며 <수학식 2>Theoretically, when n increases, the performance of the HMM model is improved. However, in order to obtain valuable statistics and probabilities from the learning corpus, the learning corpus needs to be very large. N, which shows the best performance by learning from the currently constructed learning corpus, is known as 3 in most experiments. Even when n is 3, the method of using the context probability when n is 1 or 2 is not preferred because the probability value for the n-gram for the language to be tagged is not obtained from the learning corpus. This method is called smooting paradigm.

(여기서, P'는 실제 학습코퍼스로부터 얻어지는 확률값이고, P'(t_i|t_i-1,t_i-2)는 문맥 확률 또는 전이확률을 의미하며, λ₁ + λ₂ + λ₃ = 1인 상기 λ₁, λ₂, λ₃은 학습코퍼스로부터 선형보간법(linear interpolation)을 통해 구해진다.)Where P 'is the probability value obtained from the actual learning corpus, and P' (t _i | t _i-1 , t _i-2 ) means the context probability or transition probability, and λ ₁ + λ ₂ + λ ₃ = 1 Λ ₁ , λ ₂ , and λ _{3, which} are obtained from the learning corpus through linear interpolation.

와 같이 표현할 수 있다. Can be expressed as

상술한 수학식 1에 언급된 문맥 확률은 품사 문맥 확률의 경우를 나타내며 최근에는 이를 확장한 어휘품사 문맥 확률을 이용하여 그 성능을 향상시키기도 한다. 수학식 1에서 나타난 문맥 확률과 어휘 확률은 태깅된 코퍼스로부터 그 값을 통계적으로 추출하여 사용할 수 있다. The context probability referred to in Equation 1 indicates a case of a part-of-speech context probability, and recently, the performance is improved by using an extended lexical part-of-text probability. The context probability and the lexical probability shown in Equation 1 may be statistically extracted from the tagged corpus.

이에 따라, 통계적 HMM 태깅 장치는 학습을 위해 사용되는 태깅된 코퍼스의 언어적 특성과 어휘 특성에 적합하도록 학습되기 때문에, 태깅된 코퍼스의 도메인 문서에서는 뛰어난 성능을 보일 수 있지만, 그렇지 않은 도메인 문서에서는 그 성능을 보장할 수 없었던 단점을 해결하기 위해서, 종래 방법들은 새로운 도메인의 원시코퍼스(raw corpus)로부터 학습하는 방법과 기존 학습된 정보를 활용하여 새로운 도메인의 원시코퍼스로부터 얻은 정보를 이용하여 기학습된 정보를 보완하여 새로운 도메인에 대해서도 적용 가능한 방법들을 사용할 수 있다.As a result, the statistical HMM tagging device is trained to fit the linguistic and lexical characteristics of the tagged corpus used for learning, so that it can perform well on domain documents of tagged corpus, but not on domain documents. In order to solve the shortcomings that could not be guaranteed, the conventional methods are pre-learned using the information obtained from the raw corpus of the new domain and the information obtained from the original corpus of the new domain using the existing learned information. By supplementing the information, you can use applicable methods for new domains.

한편, 품사 태깅 장치를 활용하는 자동번역, 정보추출, 정보검색, 질의응답 시스템 등에서 처리하는 대상이 문서 단위인데, 이 문서마다 특징이 있어서 문서가 가지는 고유의 n-gram과 어휘 확률을 이용하여 HMM 모델을 구현한다면 그 문서만을 위한 품사 태깅 모델을 구현하여 높은 성능을 가질 수 있다. On the other hand, subjects processed by automatic translation, information extraction, information retrieval, and question-and-response system using part-of-speech tagging devices are document units. Each document has a characteristic, and HMM uses the inherent n-gram and lexical probabilities of documents. If you implement the model, you can have a high performance by implementing a part-of-speech tagging model for that document only.

기존 방법은 처리하고자 하는 문서 대상의 언어모델을 학습한 것이 아니라, 다양한 문서에서 일반적인 언어모델을 학습하였기 때문에 문서마다 가지는 각각의 특색을 고려하지 않고 있다. 즉 특정 문서에서 품사 애매성이 많은 단어라도 같은 품사로 자주 사용되는 특징을 가지며, 같은 문형이나 어구들을 반복하여 사용되는 특징이 있다. The existing method does not take into account the language model of the document to be processed, but because general language models are learned from various documents, each feature of each document is not considered. That is, even a word having a lot of part-of-speech ambiguity in a particular document has a feature that is often used as the same part-of-speech, and the same sentence or phrase is used repeatedly.

또한, 문서에 따라서는 기학습된 품사 태깅된 코퍼스에서 나타나는 어휘 확률과 문맥 확률의 양상과는 다르게 사용되는 경우가 많다. 즉, 학습된 코퍼스에서는 특정 단어 “call”이 명사보다 동사로 훨씬 많이 사용되었으나, 입력 문서에서는 “call”을 동사로는 사용하지 않을 수 있으며, 기존 코퍼스에서는 작은 확률로 나타나는 n-gram이 그 문서에서는 높은 확률로 발생할 수 있다.In addition, depending on the document, it is often used differently from the aspects of lexical probability and context probability shown in the pre-learned part-of-speech tagged corpus. In other words, in the learned corpus, the word "call" is used much more as a verb than a noun, but in the input document, "call" may not be used as a verb, and in the existing corpus, the n-gram appears as a small probability. Can occur with high probability.

상술한 바와 같이 종래 기술에서 언급된 기존 품사 태깅 장치들은 품사 태깅된 대규모 코퍼스에서 학습한 정보만을 이용하여 다양한 장르나 양식의 문서들을 같은 어휘 확률과 문맥 확률로 품사 태깅하여 학습되지 않은 장르나 양식의 문서에 대해서는 낮은 성능을 보이며, 또한 입력 문서 전체의 문맥 확률과 어휘 확률을 고려하지 않고 문장 단위로만 입력을 받아 태깅함으로써 특정 문서에서 반복 표현되는 어휘나 표현들에 대해서도 각각 상이한 품사들로 태깅하게 되는 문제점을 가지고 있다.As described above, the existing parts-of-speech tagging devices mentioned in the prior art use only the information learned in a large-part-of-speech-tagged corpus, so that documents of various genres or styles can be tagged with the same lexical and contextual probabilities, so that the parts of the unspeaked genre or style are not. The document shows low performance, and tags are inputted only in sentence units without considering the context probability and the lexical probability of the entire input document. I have a problem.

이에, 본 발명의 기술적 과제는 상술한 바와 같은 문제점을 해결하기 위해 안출한 것으로, 입력문서를 자동 태깅하고, 태깅된 문서에서 언어적 특징을 실시간으로 학습하여 분석하고자 하는 문서의 언어적 특징을 기학습된 정보에 추가하여 다시 분석하여 형태소 품사 태깅 성능을 향상시킬 수 있는 문서정보 학습기반 통계적 HMM 품사 태깅 장치 및 그 방법을 제공한다. Accordingly, the technical problem of the present invention is to solve the problems as described above, to automatically tag the input document, to learn the linguistic features of the document to be analyzed in real time by learning the language features in the tagged document The present invention provides a statistical information based HMM part-of-speech tagging apparatus and method for document information learning which can improve the morpheme part-of-speech tagging performance by analyzing the information in addition to the learned information.

본 발명의 일 관점에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 장치는, 입력 문서를 형태소 분석한 문장에 대하여 기학습된 문맥 확률 정보 DB 및 어휘 확률 정보 DB와 입력 문서에서 학습되어 실시간 문맥 확률 정보 DB 및 실시간 어휘 확률 정보 DB에 저장된 문맥 확률 및 어휘 확률을 활용하여 실시간 통계적 품사 태 깅을 수행한 태깅 결과를 제공하는 실시간 학습기반 통계적 품사 태깅부와, 태깅 결과에 대하여 기구축된 태깅 오류 수정 규칙 DB를 통해 오류를 제1정정하고, 실시간 태깅 오류 수정 규칙 DB에 저장된 실시간 태깅 오류 수정 규칙으로 활용하여 제2정정하는 실시간 학습기반 태깅 오류 정정부와 문맥 확률 정보 DB와 어휘 확률 정보 DB를 이용하여 입력 문서를 품사 태깅한 후 입력 문서에서 출현하는 문맥 확률 정보와 어휘 확률 정보를 추출하여 추출된 정보에 따라 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB를 구축하고, 입력 문서의 태깅된 결과에서 어휘 및 품사 패턴을 분석하여 실시간 태깅 오류 수정 규칙 DB를 구축하는 실시간 문서정보 학습부를 포함한다.Document information learning based statistical HMM part-of-speech tagging device according to an aspect of the present invention, the context probability information DB, the lexical probability information DB and the lexical probability information DB that have been pre-learned with respect to the sentence in which the input document is morphologically analyzed, and the real-time context probability information DB And a real-time learning-based statistical part-of-speech tagging unit that provides tagging results by performing real-time statistical part-of-speech tagging using contextual and vocabulary probabilities stored in real-time vocabulary probability information DB, and a built-in tagging error correction rule DB for tagging results. First, the error is corrected by using the real-time tagging error correction rule stored in the real-time tagging error correction rule DB, and the second correction using the real-time learning-based tagging error correction unit, the context probability information DB and the lexical probability information DB. After the part-of-speech tagging, the context probability information and the lexical probability information appearing in the input document are added. And a real-time document information learning unit for constructing a real-time context probability information DB and a real-time vocabulary probability information DB according to the extracted information, and analyzing a vocabulary and part-of-speech pattern from the tagged results of the input document to construct a real-time tagging error correction rule DB. do.

상술한 실시간 문서정보 학습부는, 문장 분리부 및 형태소 분석부를 구비하며, 형태소 분석한 문장에 대하여 문맥 확률 정보 DB 및 어휘 확률 정보 DB에 저장된 문맥 확률 및 어휘 확률의 통계 정보를 활용시켜 통계적 품사 태깅을 수행한 품사 태깅 결과를 제공하는 통계적 품사 태깅부와, 품사 태깅 결과에 대하여 태깅 오류 수정 규칙 DB에 저장된 태깅 오류 수정 규칙들을 통해 오류를 정정하여 생성시킨 문장 단위의 최종 태깅 결과를 제공하는 태깅 오류 정정부와, 최종 태깅 결과에 대하여 실시간 확률 정보로 학습시켜 생성시킨 어휘 확률을 구하기 위한 단어, 빈도수, 문맥 확률을 구하기 위한 n-gram 정보를 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB 각각에 저장하는 실시간 확률 정보 학습부와, 최종 태깅 결과에 대하여 태깅 오류 수정 규칙들을 통해 추출시킨 반복 사용되는 어휘 구문과 어휘 구문이 태깅된 품사 나열을 실시간 태깅 오류 수정 규칙 DB에 저장하는 태깅 오류 수정 규칙 추출부를 포함한다. The real-time document information learning unit includes a sentence separation unit and a morpheme analysis unit, and performs statistical part-of-speech tagging by utilizing the statistical information of the context probability and the lexical probability stored in the context probability information DB and the lexical probability information DB for the morphologically analyzed sentences. Tagging error definition that provides the final tagging result in sentence units generated by correcting errors through the statistical part-of-speech tagging unit that provides the part-of-speech tagging result and the tagging error-correction rules stored in the DB for the part-of-speech tagging result. Storing words, frequency, and n-gram information for obtaining the lexical probabilities generated by learning with the real-time probability information about the final tagging result in the real-time context probability information DB and the real-time lexical probability information DB, respectively. Tagging error correction rule for real-time probability information learning unit and final tagging result Was repeated using a vocabulary and phrase extraction tagging error correction rule for storing a list of vocabulary phrase tagging parts of speech in real time tagging error correction rule DB is extracted through comprises parts.

상술한 실시간 확률 정보 학습부는, 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB에 저장된 어휘 확률을 구하기 위한 단어와 빈도수와 문맥 확률을 구하기 위한 n-gram 정보를 품사 태깅을 수행한 후에 삭제하는 것을 특징으로 한다. The above-described real-time probability information learning unit deletes the word for obtaining the lexical probabilities stored in the real-time context probability information DB and the real-time lexical probability information DB, the n-gram information for obtaining the frequency and the context probability, and then deletes them after the part-of-speech tagging. It is done.

상술한 태깅 오류 수정 규칙 추출부는, 실시간 태깅 오류 수정 규칙 DB에 저장된 반복 사용되는 어휘 구문과 어휘 구문이 태깅된 품사 나열을 품사 태깅을 수행한 후에 삭제하는 것을 특징으로 한다. The tagging error correction rule extracting unit may delete the lexical phrases and the lexical phrases tagged with the lexical phrases repeatedly stored in the real-time tagging error correction rule DB after performing the part-of-speech tagging.

상술한 형태소 분석부에 의해 형태소 분석한 문장에 대하여 About sentence morphologically analyzed by the morphological analysis part mentioned above

<수학식>&Lt; Equation &

(여기서, P(t_i|t_i-1,t_i-2,t_i-3)는 문맥 확률이고, P(w_i|t_i)는 어휘 확률이며, 문맥 확률과 어휘 확률은 입력 문서에서도 얻은 정보임)Where P (t _i | t _i-1 , t _i-2 , t _i-3 ) is the context probability, P (w _i | t _i ) is the lexical probability, and the context probability and the lexical probability Information obtained)

에 적용하여 입력 문장의 단어들에 대한 품사 나열(t₁, t₂, ... , t_n)을 선정하는 것을 특징으로 한다. It is characterized by selecting a part-of-speech list (t ₁ , t ₂ , ..., t _n ) for words in the input sentence.

상술한 문맥 확률(P(t_i|t_i-1,t_i-2,t_i-3))은,The aforementioned context probability P (t _i | t _i-1 , t _i-2 , t _i-3 ) is

<수학식>&Lt; Equation &

(여기서, P''(t_i|t_i-1,t_i-2,t_i-3)는 문맥 확률이고, P'의 확률값은 문맥 확률 정보 DB 및 어휘 확률 정보 DB에 저장된 품사 태깅된 코퍼스로부터 기학습된 정보를 통해 얻고, P''의 확률값은 실시간 문맥 확률 정보 DB 및 실시간 어휘 확률 정보 DB에 저장된 확률 정보를 이용하여 얻으며, λ₁ + λ₂ + λ₃ + λ₄ = 1인 λ₁, λ₂, λ₃, λ₄는 문서 단위로 품사 태깅된 학습코퍼스를 통하여 선형보간법에 의해 얻음)Where P '' (t _i | t _i-1 , t _i-2 , t _i-3 ) is the context probability, and the probability value of P 'is the part-of-speech tagged corpus stored in the context probability information DB and the lexical probability information DB. The probability value of P '' is obtained using the probability information stored in the real-time context probability information DB and the real-time lexical probability information DB, and λ with λ ₁ + λ ₂ + λ ₃ + λ ₄ = 1 ₁ , λ ₂ , λ ₃ , and λ ₄ are obtained by linear interpolation through a speech-tagged learning corpus in document units.)

를 통해 얻는 것을 특징으로 한다. Characterized in that obtained through.

상술한 어휘 확률(P(w_i|t_i))은, The lexical probability P (w _i | t _i ) described above is

<수학식>&Lt; Equation &

(여기서, P'의 확률값은 문맥 확률 정보 DB 및 어휘 확률 정보 DB에 저장된 품사 태깅된 코퍼스로부터 기학습된 정보를 통해 얻고, P''의 확률값은 실시간 문맥 확률 정보 DB 및 실시간 어휘 확률 정보 DB에 저장된 확률 정보를 이용하여 얻 으며, α와 β는 문서 단위로 품사 태깅된 학습코퍼스로부터 실험값에 의하여 입력 문서의 언어적 특징을 기학습된 언어적 특징을 통해 얻음)를 통해 얻는 것을 특징으로 한다. (Here, the probability value of P 'is obtained through the pre-learned information from the part-of-speech tagged corpus stored in the context probability information DB and the lexical probability information DB, and the probability value of P' 'is stored in the real-time context probability information DB and the real-time lexical probability information DB. Obtained using the stored probability information, and α and β are obtained through the pre-learned linguistic features of the input document by experimental values from the learning corpus tagged with parts of documents.

상술한 실시간 학습기반 태깅 오류 정정부는, 태깅 결과에 대하여 태깅 오류 수정 규칙 DB를 통해 오류를 정정하고, 정정이 안된 부분이 실시간 태깅 오류 수정 규칙 DB의 규칙과 일치할 경우 실시간 태깅 오류 수정 규칙 DB에 저장된 어휘 나열에서 긍정적인 확률(긍정적인 빈도수 / (긍정적인 빈도수 + 부정적인 확률))이 0.71∼0.79의 범위 보다 크거나 범위에 포함되는 것만을 실시간 태깅 오류 수정 규칙으로 활용하여 정정하는 것을 특징으로 한다. The above-described real-time learning-based tagging error correction unit corrects an error through a tagging error correction rule DB for the tagging result, and if the uncorrected part matches the rule of the real-time tagging error correction rule DB, In the stored vocabulary list, the positive probability (positive frequency / (positive frequency + negative probability)) is larger than or included in the range of 0.71 to 0.97, and is corrected using a real-time tagging error correction rule. .

또한, 본 발명의 다른 관점에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 방법은, 입력 문서에 대하여 문장의 단위로 분리하는 단계와, 문장에 대하여 형태소 분석 사전 데이터베이스(DB)를 통해 형태소 분석하는 단계와, 형태소 분석한 문장에 대하여 기학습된 문맥 확률 정보 DB 및 어휘 확률 정보 DB와 입력 문서에서 학습되어 실시간 문맥 확률 정보 DB 및 실시간 어휘 확률 정보 DB에 저장된 문맥 확률 및 어휘 확률을 활용하여 실시간 통계적 품사 태깅을 수행하는 단계와, 실시간 통계적 품사 태깅의 결과에 대하여 기구축된 태깅 오류 수정 규칙 DB를 통해 오류를 제1정정하고, 실시간 태깅 오류 수정 규칙 DB에 저장된 실시간 태깅 오류 수정 규칙으로 활용하여 제2정정시킨 태깅 결과를 출력하는 단계를 포함한다. In addition, the document information learning-based statistical HMM part-of-speech tagging method according to another aspect of the present invention, the step of separating the input document in the unit of sentences, morphological analysis of the sentence through the morphological analysis dictionary database (DB) and Real-time statistical part-of-speech tagging using contextual probabilities and lexical probabilities stored in contextual probabilistic information DB and lexical probability information DB and input documents that have been previously learned for morphologically analyzed sentences And correcting the first error through the built-in tagging error correction rule DB for the result of the real-time statistical part-of-speech tagging, and utilizing the second real time tagging error correction rule stored in the real-time tagging error correction rule DB. And outputting the tagging result.

상술한 분리하는 단계 이전에, 문맥 확률 정보 DB와 어휘 확률 정보 DB를 이용하여 입력 문서를 품사 태깅한 후 입력 문서에서 출현하는 문맥 확률 정보와 어 휘 확률 정보를 추출하여 추출된 정보에 따라 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB를 구축하고, 입력 문서의 태깅된 결과에서 어휘 및 품사 패턴을 분석하여 실시간 태깅 오류 수정 규칙 DB를 구축하는 단계를 더 포함한다. Before the above-mentioned separation step, after the part-of-speech tagging of the input document using the context probability information DB and the lexical probability information DB, the context probability information and the lexical probability information appearing in the input document are extracted and extracted in real time context according to the extracted information. Constructing the probability information DB and the real-time lexical probability information DB, and analyzing the vocabulary and the part-of-speech pattern from the tagged results of the input document to construct the real-time tagging error correction rule DB.

상술한 구축하는 단계는, 형태소 분석한 문장에 대하여 문맥 확률 정보 DB 및 어휘 확률 정보 DB에 저장된 문맥 확률 및 어휘 확률의 통계 정보를 활용시켜 통계적 품사 태깅을 수행하는 단계와, 통계적 품사 태깅의 결과에 대하여 태깅 오류 수정 규칙 DB에 저장된 태깅 오류 수정 규칙들을 통해 오류를 정정하여 생성시킨 문장 단위의 최종 태깅 결과를 제공하는 단계와, 최종 태깅 결과에 대하여 실시간 확률 정보로 학습시켜 생성시킨 어휘 확률을 구하기 위한 단어, 빈도수, 문맥 확률을 구하기 위한 n-gram 정보를 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB 각각에 제1저장하는 단계와, 최종 태깅 결과에 대하여 태깅 오류 수정 규칙들을 통해 추출시킨 반복 사용되는 어휘 구문과 어휘 구문이 태깅된 품사 나열을 실시간 태깅 오류 수정 규칙 DB에 제2저장하는 단계를 포함한다. The constructing step may include performing statistical part-of-speech tagging using statistical information of context probability and lexical probability stored in the context probability information DB and the lexical probability information DB on the morphologically analyzed sentences, and the results of statistical part-of-speech tagging. Providing the final tagging result in the unit of sentence generated by correcting the error through the tagging error correction rule stored in the DB, and obtaining the lexical probability generated by learning the final tagging result with real-time probability information. First storing n-gram information for obtaining words, frequencies, and context probabilities in the real-time context probability information DB and the real-time lexical probability information DB, respectively, and repeatedly using the tagging error correction rules for the final tagging result. List of parts of speech tagged with lexical phrases and lexical phrases A second storing step.

상술한 제1저장하는 단계는, 실시간 문맥 확률 정보 DB와 실시간 어휘 확률 정보 DB에 저장된 어휘 확률을 구하기 위한 단어와 빈도수와 문맥 확률을 구하기 위한 n-gram 정보를 품사 태깅을 수행한 후에 삭제하는 것을 특징으로 한다. In the above-described first storing step, the word and frequency and the n-gram information for obtaining the lexical probabilities stored in the real-time context probability information DB and the real-time lexical probability information DB are deleted after the part-of-speech tagging. It features.

상술한 제2저장하는 단계는, 실시간 태깅 오류 수정 규칙 DB에 저장된 반복 사용되는 어휘 구문과 어휘 구문이 태깅된 품사 나열을 품사 태깅을 수행한 후에 삭제하는 것을 특징으로 한다. The second storing may include deleting the repetitive lexical phrases and lexical phrases tagged with the lexical phrases stored in the real-time tagging error correction rule DB after performing the part-of-speech tagging.

상술한 출력하는 단계는, 태깅 결과에 대하여 태깅 오류 수정 규칙 DB를 통 해 오류를 정정하고, 정정이 안된 부분이 실시간 태깅 오류 수정 규칙 DB의 규칙과 일치할 경우 실시간 태깅 오류 수정 규칙 DB에 저장된 어휘 나열에서 긍정적인 확률(긍정적인 빈도수 / (긍정적인 빈도수 + 부정적인 확률))이 0.71∼0.79의 범위 보다 크거나 범위에 포함되는 것만을 실시간 태깅 오류 수정 규칙에 의해 정정되어 태깅 결과로 출력되는 것을 특징으로 한다. The above outputting step may correct the error through the tagging error correction rule DB for the tagging result, and if the uncorrected part matches the rule of the real time tagging error correction rule DB, the vocabulary stored in the real time tagging error correction rule DB. Only the positive probability (positive frequency / (positive frequency + negative probability)) in the sequence is greater than or within the range of 0.71-0.79, is corrected by the real-time tagging error correction rule, and output as a tagging result. It is done.

본 발명은 입력문서를 자동 태깅하고, 태깅된 문서에서 언어적 특징을 실시간으로 학습하여 분석하고자 하는 문서의 언어적 특징을 기학습된 정보에 추가하여 다시 분석함으로써, 기존에서와 같이 학습되지 않은 장르나 양식의 문서에 대해서는 낮은 성능을 보이는 문제점과 특정 문서에서 반복 표현되는 어휘나 표현들에 대해서도 각각 상이한 품사들로 태깅하게 되는 문제점들을 해결할 수 있다. According to the present invention, by automatically tagging an input document and re-analyzing the linguistic features of the document to be analyzed and learning the linguistic features in the tagged document in real time, the genre has not been learned as before. It can solve the problem of low performance for a document of the form and tagging with different parts of speech for words or expressions that are repeated in a specific document.

또한, 본 발명은 문서정보 학습기반 통계적 HMM 품사 태깅 장치 및 그 방법을 제공함으로써, 입력되는 문서에 따라 달라지는 문맥 확률, 어휘 확률과 태깅 오류 수정 규칙을 실시간으로 추출하여 품사 태깅하고자 하는 입력 문서의 장르와 도메인에 의존적인 확률 정보와 수정 규칙 등을 추출할 수 있다.In addition, the present invention provides a document information learning-based statistical HMM part-of-speech tagging apparatus and a method thereof, by extracting in real time the context probability, lexical probability and tagging error correction rules that vary according to the input document in real time to tag the part-of-speech tag Probability information and correction rules, which are domain dependent, can be extracted.

또한, 본 발명은 기학습된 코퍼스에서 나타나지 않는 다양한 장르나 도메인의 문서에 대해서도 실시간으로 입력 문서에서 학습된 정보를 사용함으로써, 다양한 장르나 도메인의 문서에 대한 태깅 정확성을 높일 수 있다. In addition, the present invention can improve tagging accuracy for documents of various genres or domains by using information learned from input documents in real time even for documents of various genres or domains that do not appear in the previously learned corpus.

또한, 본 발명은 문서에 대한 언어적 분석을 필요로 하는 자동번역 및 정보 검색 등의 시스템에서 언어분석 정확도를 향상시켜 전체 번역 성능 및 정확성을 향상시킬 수 있는 이점이 있다. In addition, the present invention has the advantage of improving the overall translation performance and accuracy by improving the language analysis accuracy in a system such as automatic translation and information retrieval that requires a linguistic analysis of the document.

이하, 첨부된 도면을 참조하여 본 발명의 동작 원리를 상세히 설명한다. 하기에서 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, with reference to the accompanying drawings will be described in detail the operating principle of the present invention. In the following description of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. The following terms are defined in consideration of the functions of the present invention, and may be changed according to the intentions or customs of the user, the operator, and the like. Therefore, the definition should be based on the contents throughout this specification.

도 1은 본 발명의 일 실시예에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 장치(100)에 대한 블록 구성도로서, 실시간 문서정보 학습부(101)와 실시간 학습기반 통계적 품사 태깅부(103)와 실시간 학습기반 태깅 오류 정정부(105)와 문장 분리 규칙 데이터베이스(DataBase, 이하 DB라 함)(107)와 형태소 분석 사전 DB(109)와 문맥 확률 정보 DB(111)와 어휘 확률 정보 DB(113)와 태깅 오류 수정 규칙 DB(115)와 실시간 문맥 확률 정보 DB(117)와 실시간 어휘 확률 정보 DB(119)와 실시간 태깅 오류 수정 규칙 DB(121) 등을 포함할 수 있다. 1 is a block diagram of a document information learning-based statistical HMM part-of-speech tagging device 100 according to an embodiment of the present invention, and includes a real-time document information learning unit 101 and a real-time learning-based statistical part-of-speech tagging unit 103. Real-time learning-based tagging error correction unit 105, sentence separation rule database (DataBase, hereinafter DB) 107, morphological analysis dictionary DB (109), context probability information DB (111) and lexical probability information DB (113) And a tagging error correction rule DB 115, a real-time context probability information DB 117, a real-time lexical probability information DB 119, a real-time tagging error correction rule DB 121, and the like.

실시간 문서정보 학습부(101)는 도 2에 도시된 바와 같이 문장 분리부(1011) 와 형태소 분석부(1013)와 통계적 품사 태깅부(1015)와 태깅 오류 정정부(1017)와 실시간 확률 정보 학습부(1019)와 태깅 오류 수정 규칙 추출부(1021) 등을 포함할 수 있다. As shown in FIG. 2, the real-time document information learning unit 101 learns a sentence separator 1011, a morpheme analyzer 1013, a statistical part-of-speech tagging unit 1015, a tagging error correcting unit 1017, and real-time probability information. The unit 1019 and the tagging error correction rule extractor 1021 may be included.

문장 분리부(1011)는 입력 혹은 재입력되는 입력 문서(S1)에 대하여 문장 분리 규칙 DB(107)에 저장된 문장 분리 규칙들, 즉 문장 부호 및 인용 부호 등의 부호와 줄바꿈 문자 및 공백 문자 등을 통해 문장 단위로 분리하여 형태소 분석부(1013)로 제공할 수 있다. Sentence separator 1011 is a sentence separation rules stored in the sentence separation rule DB (107) for the input document (S1) that is input or re-enter, that is, symbols such as punctuation marks and quotation marks, line break characters and space characters, etc. By separating through the sentence unit may be provided to the morpheme analysis unit 1013.

형태소 분석부(1013)는 문장 분리부(1011)로부터 입력되는 분리된 문장에 대하여 형태소 분석 사전 DB(109)에 저장된 형태소 분석 사전을 통해 형태소 분석을 수행한 형태소 분석 결과를 실시간 학습기반 통계적 품사 태깅부(103) 및 통계적 품사 태깅부(1015)에 제공할 수 있다. The morpheme analysis unit 1013 is a real-time learning-based statistical part-of-speech tagging of the morphological analysis results of the morphological analysis of the separated sentences inputted from the sentence separator 1011 through the morpheme analysis dictionary stored in the morpheme analysis dictionary DB 109. The part 103 and the statistical part-of-speech tagging unit 1015 may be provided.

통계적 품사 태깅부(1015)는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 기학습된 문맥 확률 정보 DB(111) 및 어휘 확률 정보 DB(113)에 저장된 문맥 확률 및 어휘 확률의 통계 정보를 수학식 1에 활용시켜 통계적 품사 태깅을 수행하고, 수행된 통계적 품사 태깅 결과를 태깅 오류 정정부(1017)에 제공할 수 있다. 여기서, 통계적 품사 태깅부(1015)는 입력 문서(S1)에 나타나는 어휘 확률 정보와 문맥 확률 정보를 추출하기 위해서 기학습된 통계 정보를 활용하는 것이 바람직하다. The statistical part-of-speech tagging unit 1015 is statistical information of the context probability and the vocabulary probability stored in the context probability information DB 111 and the lexical probability information DB 113 that have been learned about the morpheme analysis result input from the morpheme analysis unit 1013. To perform the statistical part-of-speech tagging by using Equation 1 and provide the result of the statistical part-of-speech tagging to the tagging error correcting unit 1017. Here, it is preferable that the statistical part-of-speech tagging unit 1015 utilizes pre-learned statistical information to extract lexical probability information and contextual probability information appearing in the input document S1.

태깅 오류 정정부(1017)는 통계적 품사 태깅부(1015)로부터 입력되는 통계적 품사 태깅 결과, 즉 선정된 품사에 대하여 태깅 오류 수정 규칙 DB(115)에 저장된 태깅 오류 수정 규칙들을 통해 오류를 정정하여 생성시킨 실시간 문서정보 추출을 위한 문장 단위의 최종 태깅 결과를 실시간 확률 정보 학습부(1019) 및 태깅 오류 수정 규칙 추출부(1021)에 제공할 수 있다. The tagging error correcting unit 1017 generates a statistical part-of-speech tagging result input from the statistical part-of-speech tagging unit 1015, that is, by correcting an error through tagging error correction rules stored in the tagging error correction rule DB 115 for the selected part-of-speech. The final tagging result in sentence units for extracting the real-time document information may be provided to the real-time probability information learning unit 1019 and the tagging error correction rule extraction unit 1021.

실시간 확률 정보 학습부(1019)는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에 대하여 실시간 확률 정보로 학습시켜 생성시킨 어휘 확률을 구하기 위한 단어, 선정 품사에 대한 n-gram의 빈도수 및 어휘(단어) 빈도수와 어휘와 태깅된 품사의 공기 빈도수, 문맥 확률을 구하기 위한 n-gram 정보 등(예컨대, 입력 문서(S1)에 대한 언어 모델만을 고려하기 때문에 보다 큰 범위의 n+1-gram을 추출하여 입력 문서(S1)에 대한 언어적 특성을 보다 심화하여 실시간 문맥확률을 4-gram까지 확장할 수 있다.)을 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)에 제공할 수 있다. 여기서, 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)에 저장된 어휘 확률을 구하기 위한 단어와 선정 품사에 대한 n-gram의 빈도수 및 어휘(단어) 빈도수와 어휘와 태깅된 품사의 공기 빈도수와 문맥 확률을 구하기 위한 n-gram 정보는 품사를 태깅하는 과정에서만 유지되면, 품사 태깅을 수행한 후에 그 저장된 정보들을 삭제할 수 있다. The real-time probability information learning unit 1019 calculates a lexical probability generated by learning real-time probability information about the final tagging result of a sentence unit input from the tagging error correcting unit 1017, and the n-gram of the selected part-of-speech. Frequency and vocabulary (words) Frequency, vocabulary and tagged parts of speech, n-gram information to obtain context probabilities (e.g., only the language model for the input document S1, etc. extract the -gram to deepen the linguistic properties of the input document (S1) to extend the real-time context probability up to 4-gram.) Real-time context probability information DB (117) and real-time vocabulary probability information DB (119) ) Can be provided. Here, the frequency and the vocabulary (word) frequency and the vocabulary and tagged parts of speech for the words and the selected parts of speech to obtain the lexical probabilities stored in the real-time context probability information DB 117 and the real-time lexical probability information DB 119. If the n-gram information for calculating the air frequency and the context probability is maintained only in the process of tagging the parts of speech, the stored information may be deleted after the parts of speech tagging is performed.

태깅 오류 수정 규칙 추출부(1021)는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에 대하여 태깅 오류 수정 규칙들을 통해 추출시킨 반복 사용되는 어휘 구문(복합 명사와 명사구와 같은 구)과 그 어휘 구문이 태깅된 품사 나열을 실시간 태깅 오류 수정 규칙 DB(121)에 제공할 수 있다. 여기서, 실시간 태깅 오류 수정 규칙 DB(121)에 저장된 반복 사용되는 어휘 구문과 그 어휘 구 문이 태깅된 품사 나열은 같은 어휘 구문에서 가장 반복되는 품사 나열만을 남겨 태깅 오류 수정 규칙으로 활용하도록 하며, 또한 품사를 태깅하는 과정에서만 유지되면, 품사 태깅을 수행한 후에 그 저장된 정보들을 삭제할 수 있다. The tagging error correction rule extraction unit 1021 is a repetitive lexical phrase (phrases such as compound nouns and noun phrases) extracted through tagging error correction rules for the final tagging result of a sentence unit input from the tagging error correction unit 1017. The parts of speech tagged with the lexical syntax may be provided to the real-time tagging error correction rule DB 121. Here, the real-time tagging error correcting rule DB 121 stored repetitive lexical phrases and the parts of speech tagged with the lexical syntax are left to list only the most repetitive parts of speech in the same lexical syntax to be used as a tagging error correction rule. If only the part-of-speech tagging process is maintained, the stored information may be deleted after the part-of-speech tagging is performed.

실시간 학습기반 통계적 품사 태깅부(103)는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 기학습된 문맥 확률 정보 DB(111) 및 어휘 확률 정보 DB(113), 이외에 입력 문서(S1)에서 학습되어 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)에 저장된 문맥 확률 및 어휘 확률을 활용하여 실시간으로 통계적 품사 태깅을 수행하고, 이 수행된 통계적 품사 태깅 결과를 실시간 학습기반 태깅 오류 정정부(105)에 제공할 수 있다. The real-time learning-based statistical part-of-speech tagging unit 103 is a contextual probability information DB 111 and a lexical probability information DB 113 pre-learned about the morphological analysis result input from the morpheme analysis unit 1013, and an input document S1. Statistic parts of speech tagging is performed in real time by using contextual probabilities and lexical probabilities stored in the real-time context probability information DB 117 and the real-time lexical probability information DB 119, and based on the real-time learning The tagging error correction unit 105 may be provided.

다시 말하여, 실시간 학습기반 통계적 품사 태깅부(103)는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 <수학식 3>In other words, the real-time learning-based statistical part-of-speech tagging unit 103 may perform the morpheme analysis result input from the morpheme analysis unit 1013.

(여기서, P(t_i|t_i-1,t_i-2,t_i-3)는 문맥 확률이고, P(w_i|t_i)는 어휘 확률로서, 이 문맥 확률과 어휘 확률은 품사 태깅된 코퍼스로부터 얻는 것이 아니라, 입력 문서(S1)에서도 얻은 정보를 같이 사용함)Where P (t _i | t _i-1 , t _i-2 , t _i-3 ) is the context probability, P (w _i | t _i ) is the lexical probability, and the context probability and the lexical probability are part-of-speech tagging. Information from the input document (S1), not from the corpus

에 적용하여 입력 문장의 단어들 w₁, w₂, ... , w_n에 대한 최적의 품사 나열 t₁, t₂, ... , t_n을 선정할 수 있다. 즉, 수학식 3은 상술한 수학식 1에 비해 문맥 확률에서 하나의 품사를 더 보는 것의 차이 이외는 일반적인 HMM 모델로서, 수학식 3은 3차 HMM 모델로 문맥 확률에서 4-gram을 사용할 수 있다. By applying to, we can select the optimal parts of speech t ₁ , t ₂ , ..., t _n for words w ₁ , w ₂ , ..., w _n in the input sentence. That is, Equation 3 is a general HMM model other than the difference between seeing one part of speech in context probability more than Equation 1, and Equation 3 may use 4-gram in context probability as a third-order HMM model. .

다음으로, 문맥 확률 P(t_i|t_i-1,t_i-2,t_i-3)는 <수학식 4>Next, the context probability P (t _i | t _i-1 , t _i-2 , t _i-3 ) is given by Equation 4

를 통해 얻을 수 있고, You can get it through

어휘 확률 P(w_i|t_i)는 <수학식 5>The lexical probability P (w _i | t _i ) is given by Equation 5

를 통해 얻을 수 있다. You can get it through

여기서, 수학식 4 및 수학식 5에서 P'의 확률값은 문맥 확률 정보 DB(111) 및 어휘 확률 정보 DB(113)에 저장되어진 품사 태깅된 코퍼스로부터 기학습된 정보를 이용하여 얻을 수 있고, P''의 확률값은 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)에 저장되어진 입력 문서(S1)에서 실시간 문서정보 학 습부(101)가 얻은 확률 정보를 이용하여 얻을 수 있으며, 4-gram 정보는 P''(t_i|t_i-1,t_i-2,t_i-3)만을 이용하여 입력 문서(S1)에 나타나는 언어적 특징을 보다 더 적용할 수 있고, λ₁ + λ₂ + λ₃ + λ₄ = 1인 λ₁, λ₂, λ₃, λ₄는 문서 단위로 품사 태깅된 학습코퍼스를 통하여 선형보간법에 의하여 실험적으로 얻을 수 있으며, α와 β는 문서 단위로 품사 태깅된 학습코퍼스로부터 실험값에 의하여 입력 문서(S1)의 언어적 특징을 기학습된 언어적 특징에 어느 정도 반영을 할 때 최고의 성능을 나타낼 수 있는가를 실험값을 통해 얻을 수 있다. Here, in Equations 4 and 5, the probability value of P 'may be obtained by using pre-learned information from the part-of-speech tagged corpus stored in the context probability information DB 111 and the lexical probability information DB 113. Probability value of '' can be obtained using the probability information obtained by the real-time document information learning unit 101 from the input document (S1) stored in the real-time context probability information DB (117) and the real-time lexical probability information DB (119), The 4-gram information may further apply linguistic features appearing in the input document S1 using only P '' (t _i | t _i-1 , t _i-2 , t _i-3 ), and λ ₁ λ ₁ , λ ₂ , λ ₃ , and λ ₄ with + λ ₂ + λ ₃ + λ ₄ = 1 can be experimentally obtained by linear interpolation through a speech-tagged learning corpus in document units, and α and β are document units. The linguistic feature of the input document S1 is obtained from the pre-learned linguistic feature by the experimental values from the part-of-speech tagged learning corpus. The experimental results show whether the best performance can be obtained when the reflection is applied to some extent.

실시간 학습기반 태깅 오류 정정부(105)는 실시간 학습기반 통계적 품사 태깅부(103)로부터 입력되는 통계적 품사 태깅 결과, 즉 선정된 품사에 대하여 기구축된(예컨대, 전문가에 의해 기구축됨) 태깅 오류 수정 규칙 DB(115)를 우선 참조하여 오류를 정정한 다음에, 정정되지 않은 부분이 실시간 태깅 오류 수정 규칙 DB(121)의 규칙과 일치할 경우 정정, 즉 입력 문서(S1)에서 학습되어 실시간 태깅 오류 수정 규칙 DB(121)에 저장된 어휘 나열에서 긍정적인 확률(긍정적인 빈도수 / (긍정적인 빈도수 + 부정적인 확률))이 0.71∼0.79의 범위 보다 크거나 범위에 포함되는 것(예컨대, 0.75)만을 실시간 태깅 오류 수정 규칙으로 활용하여 정정함으로써 입력 문서(S1)에서 자주 발생하는 문형이나 패턴에 적합하도록 정정시킨 품사의 태깅 결과(S2)를 출력할 수 있다. The real-time learning-based tagging error correcting unit 105 is a statistical part-of-speech tagging result input from the real-time learning-based statistical part-of-speech tagging unit 103, that is, a tagging error that is constructed for the selected part-of-speech (eg, organized by an expert). Correct the error by first referring to the correction rule DB 115, and then correcting, that is, learning from the input document S1 and real-time tagging if the uncorrected portion matches the rule of the real-time tagging error correction rule DB 121. Only the positive probability (positive frequency / (positive frequency + negative probability)) in the lexical listing stored in the error correction rule DB 121 is greater than or within the range of 0.71-0.79 (eg, 0.75) in real time. The tagging result S2 of the part-of-speech corrected to be suitable for the sentence pattern or pattern frequently occurring in the input document S1 can be output by utilizing the tagging error correction rule and correcting it.

문장 분리 규칙 DB(107)는 문장 단위로 분리하는 규칙들을 저장할 수 있고, 형태소 분석 사전 DB(109)는 형태소 분석을 위한 사전을 저장할 수 있으며, 문맥 확률 정보 DB(111)는 기학습된 문맥 확률 정보를 저장할 수 있으며, 어휘 확률 정보 DB(113)는 기학습된 어휘 확률 정보를 저장할 수 있으며, 태깅 오류 수정 규칙 DB(115)는 태깅 오류 수정 규칙들을 저장할 수 있다.The sentence separation rule DB 107 may store the rules for dividing into sentence units, the morpheme analysis dictionary DB 109 may store a dictionary for morpheme analysis, and the context probability information DB 111 may be a pre-learned context probability. Information may be stored, and the lexical probability information DB 113 may store pre-learned lexical probability information, and the tagging error correction rule DB 115 may store tagging error correction rules.

실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)는 실시간 확률 정보 학습부(1019)로부터 입력되는 어휘 확률을 구하기 위한 단어, 선정 품사에 대한 n-gram의 빈도수 및 어휘(단어) 빈도수와 어휘와 태깅된 품사의 공기 빈도수, 문맥 확률을 구하기 위한 n-gram 정보 등을 저장할 수 있다. The real-time context probability information DB 117 and the real-time lexical probability information DB 119 are words for obtaining the lexical probabilities input from the real-time probability information learning unit 1019, frequency of n-grams for the selected parts of speech, and vocabulary (words). You can store the frequency, vocabulary, air frequency of tagged parts of speech, and n-gram information to obtain context probabilities.

실시간 태깅 오류 수정 규칙 DB(121)는 태깅 오류 수정 규칙 추출부(1021)로부터 입력되는 반복 사용되는 어휘 구문(예컨대, 복합 명사와 명사구와 같은 구)과 그 어휘 구문이 태깅된 품사 나열을 저장할 수 있다. The real-time tagging error correction rule DB 121 may store a repetitive lexical phrase (eg, a compound noun and a phrase such as a noun phrase) input from the tagging error correction rule extractor 1021 and a list of parts of speech tagged with the lexical phrase. have.

따라서, 본 발명은 입력문서를 자동 태깅하고, 태깅된 문서에서 언어적 특징을 실시간으로 학습하여 분석하고자 하는 문서의 언어적 특징을 기학습된 정보에 추가하여 다시 분석함으로써, 기존에서와 같이 학습되지 않은 장르나 양식의 문서에 대해서는 낮은 성능을 보이는 문제점과 특정 문서에서 반복 표현되는 어휘나 표현들에 대해서도 각각 상이한 품사들로 태깅하게 되는 문제점들을 해결할 수 있으며, 또한, 입력되는 문서에 따라 달라지는 문맥 확률, 어휘 확률과 태깅 오류 수정 규칙을 실시간으로 추출하여 품사 태깅하고자 하는 입력 문서의 장르와 도메인에 의존적인 확률 정보와 수정 규칙 등을 추출할 수 있다.Therefore, the present invention is not learned as before by automatically tagging an input document and re-analyzing the linguistic features of the document to be learned by analyzing the linguistic features in the tagged document in real time. It can solve the problems of low performance for genre or style documents and different parts of speech for vocabulary or expressions that are repeatedly expressed in a specific document. Also, the context probability varies depending on the input document. In addition, the lexical probabilities and tagging error correction rules can be extracted in real time to extract probability information and correction rules depending on the genre and domain of the input document to be tagged.

다음에, 상술한 바와 같은 구성을 갖는 본 발명의 일 실시예에서 문서정보 학습기반 통계적 HMM 품사 태깅 과정에 대하여 설명한다. Next, a description will be given of the document information learning-based statistical HMM part-of-speech tagging process in an embodiment of the present invention having the above-described configuration.

도 3은 본 발명의 일 실시예에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 방법에 대하여 순차적으로 도시한 흐름도이다. 3 is a flowchart sequentially illustrating a document information learning-based statistical HMM part-of-speech tagging method according to an embodiment of the present invention.

즉, 본 발명에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 방법은, 기구축된 분석정보를 이용하여 입력 문서(S1)에서 실시간 학습을 수행하여 입력 문서(S1)의 언어적 특징에 적합한 실시간 문서정보를 구축할 수 있고, 실시간 학습에 의해 얻어진 정보를 활용하여 입력 문서(S1)에 대한 통계적 품사 태깅을 수행한다. That is, in the document information learning-based statistical HMM part-of-speech tagging method according to the present invention, real-time document information suitable for the linguistic characteristics of the input document S1 is performed by performing real-time learning on the input document S1 using the instrumented analysis information. Can be constructed, and statistical part-of-speech tagging for the input document (S1) is performed using the information obtained by real-time learning.

먼저, 기구축된 분석정보를 이용하여 입력 문서(S1)에서 실시간 학습을 수행하여 입력 문서(S1)의 언어적 특징에 적합한 실시간 문서정보를 구축(저장)하는 과정에 대하여 보다 상세하게 설명한다.First, a process of constructing (storing) real-time document information suitable for linguistic features of the input document S1 by performing real-time learning on the input document S1 using the instrumented analysis information will be described in more detail.

실시간 문서정보 학습부(101)내 문장 분리부(1011)에서는 입력되는 입력 문서(S1)에 대하여 기구축된 문장 분리 규칙 DB(107)에 저장된 문장 분리 규칙들, 즉 문장 부호 및 인용 부호 등의 부호와 줄바꿈 문자 및 공백 문자 등을 통해 문장 단위로 분리(S301)하여 형태소 분석부(1013)로 제공(S303)한다. In the sentence separation unit 1011 in the real-time document information learning unit 101, sentence separation rules stored in the sentence separation rule DB 107, which is built on the input document S1 input, that is, punctuation marks and quotation marks, and the like. A symbol, a line break character, a space character, and the like are separated in units of sentences (S301) and provided to the morpheme analysis unit 1013 (S303).

형태소 분석부(1013)에서는 문장 분리부(1011)로부터 입력되는 분리된 문장에 대하여 기구축된 형태소 분석 사전 DB(109)에 저장된 형태소 분석 사전을 통해 형태소 분석을 수행(S305)한 형태소 분석 결과를 통계적 품사 태깅부(1015)에 제공(S307)한다. The morpheme analysis unit 1013 performs a morphological analysis on the separated sentences inputted from the sentence separator 1011 through the morphological analysis dictionary stored in the instrumental morphological analysis dictionary DB 109 (S305). The statistical part-of-speech tagging unit 1015 is provided (S307).

통계적 품사 태깅부(1015)에서는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 기학습된 문맥 확률 정보 DB(111) 및 어휘 확률 정보 DB(113)에 저장된 문맥 확률 및 어휘 확률의 통계 정보(예컨대, 기학습된 통계 정 보)를 수학식 1에 활용시켜 통계적 품사 태깅을 수행(S309)하고, 수행된 통계적 품사 태깅 결과를 태깅 오류 정정부(1017)에 제공(S311)한다. In the statistical part-of-speech tagging unit 1015, the statistical probabilities of the context probability and the lexical probability stored in the context probability information DB 111 and the lexical probability information DB 113 that have been learned about the morpheme analysis result input from the morpheme analysis unit 1013. The statistical part-of-speech tagging is performed by using the previously learned statistical information in Equation 1 (S309), and the result of statistical part-of-speech tagging is provided to the tagging error correcting unit 1017 (S311).

태깅 오류 정정부(1017)에서는 통계적 품사 태깅부(1015)로부터 입력되는 통계적 품사 태깅 결과, 즉 선정된 품사에 대하여 기구축된 태깅 오류 수정 규칙 DB(115)에 저장된 태깅 오류 수정 규칙들을 통해 오류를 정정(S313)하여 생성시킨 실시간 문서정보 추출을 위한 문장 단위의 최종 태깅 결과를 실시간 확률 정보 학습부(1019) 및 태깅 오류 수정 규칙 추출부(1021) 각각에 제공(S315)한다. In the tagging error correcting unit 1017, an error is output through the tagging error correcting rules stored in the tagging error correcting rule DB 115 configured for the selected part-of-speech. The final tagging result in sentence units for extracting the real-time document information generated by the correction (S313) is provided to each of the real-time probability information learning unit 1019 and the tagging error correction rule extraction unit 1021 (S315).

그러면, 실시간 확률 정보 학습부(1019)에서는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에 대하여 실시간 확률 정보로 학습(S317)시켜 생성시킨 어휘 확률을 구하기 위한 단어, 선정 품사에 대한 n-gram의 빈도수 및 어휘(단어) 빈도수와 어휘와 태깅된 품사의 공기 빈도수, 문맥 확률을 구하기 위한 n-gram 정보 등(예컨대, 입력 문서(S1)에 대한 언어 모델만을 고려하기 때문에 보다 큰 범위의 n+1-gram을 추출하여 입력 문서(S1)에 대한 언어적 특성을 보다 심화하여 실시간 문맥확률을 4-gram까지 확장한다.)을 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119) 각각에 저장(S319)한다. Then, the real-time probability information learning unit 1019 is a word for selecting the lexical probability generated by learning the real-time probability information (S317) with respect to the final tagging result of the sentence unit input from the tagging error correction unit 1017 to the selected part-of-speech. The frequency of n-grams and vocabulary (words), the frequency of vocabulary and tagged parts of speech, and the n-gram information to obtain context probabilities (e.g., only the language model for the input document S1 is considered. Extract n + 1-gram of range to deepen linguistic characteristics of input document S1 to extend real-time context probability up to 4-gram.) Real-time context probability information DB 117 and real-time lexical probability information Each of the DBs 119 is stored (S319).

이어서, 태깅 오류 수정 규칙 추출부(1021)에서는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에 대하여 태깅 오류 수정 규칙들을 통해 추출(S321)시킨 반복 사용되는 어휘 구문(복합 명사와 명사구와 같은 구)과 그 어휘 구문이 태깅된 품사 나열을 실시간 태깅 오류 수정 규칙 DB(121)에 저장(S323)한다. Subsequently, the tagging error correcting rule extractor 1021 extracts the tagging error correcting rules inputted from the tagging error correcting unit 1017 for the final tagging result through the tagging error correcting rules (S321). A phrase such as a noun phrase) and a list of parts of speech tagged with the lexical syntax are stored in the real-time tagging error correction rule DB 121 (S323).

예컨대, 태깅 오류 수정 규칙 추출부(1021)는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에 대하여 태깅된 품사 나열만으로 쉽게 파악할 수 있는 구문, 즉 복합명사와 명사구를 선정하고, 이 선정된 구문에 속하는 어휘 나열을 인덱스 키(index key)로 하여 선정하여 추출시킨 구문 어휘들의 품사 나열, 긍정적인 빈도수 및 부정적인 빈도수 등을 실시간 태깅 오류 수정 규칙 DB(121)에 저장한다. 다시 말하여, 태깅 오류 수정 규칙 추출부(1021)는 태깅 오류 정정부(1017)로부터 입력되는 문장 단위의 최종 태깅 결과에서 첫 단어부터 마지막 단어까지의 어휘 나열이 실시간 태깅 오류 수정 규칙 DB(121)에 인덱스 키로 존재하는 경우, 이미 저장된 품사 나열이면 긍정적인 빈도수를 1만큼 더하는데 반하여, 저장된 품사 나열과 다르면 부정적인 빈도수를 1만큼 더하여 실시간 태깅 오류 수정 규칙 DB(121)에 저장한다.For example, the tagging error correction rule extractor 1021 selects a phrase that can be easily grasped by listing tagged parts of speech, that is, a compound noun and a noun phrase, for the final tagging result of a sentence unit input from the tagging error correction unit 1017, The part-of-speech listing, the positive frequency, the negative frequency, and the like of the selected phrase vocabulary, which are selected and extracted as the index key, are stored in the real-time tagging error correction rule DB 121. In other words, the tagging error correcting rule extractor 1021 is a real-time tagging error correcting rule DB 121 that lists the vocabulary from the first word to the last word in the final tagging result in sentence units input from the tagging error correcting unit 1017. If it is present as an index key, the stored part of speech is added to the positive frequency by 1, whereas the stored part of speech is added to the negative frequency by 1 and stored in the real-time tagging error correction rule DB 121.

상술한 바와 같이, 실시간 문서 정보 학습부(101)의 실시간 학습에 의해 실시간 문맥 확률 정보 DB(117), 실시간 어휘 확률 정보 DB(119), 실시간 태깅 오류 수정 규칙 DB(121)가 구축(저장)되면, 실시간 학습에 의해 얻어진 정보를 활용하여 입력 문서(S1)에 대한 통계적 품사 태깅을 수행한다. As described above, the real-time context probability information DB 117, the real-time vocabulary probability information DB 119, and the real-time tagging error correction rule DB 121 are constructed (stored) by the real-time learning of the real-time document information learning unit 101. In this case, statistical part-of-speech tagging for the input document S1 is performed using the information obtained by real-time learning.

즉, 실시간 문서정보 학습부(101)내 문장 분리부(1011)에서는 재입력되는 입력 문서(S1)에 대하여 문장 분리 규칙 DB(107)에 저장된 문장 분리 규칙들, 즉 문장 부호 및 인용 부호 등의 부호와 줄바꿈 문자 및 공백 문자 등을 통해 문장 단위로 분리(S325)하여 형태소 분석부(1013)로 제공(S327)한다. That is, the sentence separation unit 1011 in the real-time document information learning unit 101, the sentence separation rules stored in the sentence separation rule DB 107 for the input document S1 that is input again, that is, punctuation marks and quotation marks, etc. A symbol, a line break character, a space character, and the like are separated in units of sentences (S325) and provided to the morpheme analysis unit 1013 (S327).

형태소 분석부(1013)에서는 문장 분리부(1011)로부터 입력되는 분리된 문장 에 대하여 기구축된 형태소 분석 사전 DB(109)에 저장된 형태소 분석 사전을 통해 형태소 분석을 수행(S329)한 형태소 분석 결과를 실시간 학습기반 통계적 품사 태깅부(103)에 제공(S331)한다. The morpheme analysis unit 1013 performs a morphological analysis on the separated sentences inputted from the sentence separator 1011 through the morphological analysis dictionary stored in the instrumental morphological analysis dictionary DB 109 (S329). The real-time learning-based statistical part-of-speech tagging unit 103 is provided (S331).

실시간 학습기반 통계적 품사 태깅부(103)에서는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 기학습된 문맥 확률 정보 DB(111) 및 어휘 확률 정보 DB(113), 이외에 입력 문서(S1)에서 학습되어 실시간 문맥 확률 정보 DB(117) 및 실시간 어휘 확률 정보 DB(119)에 저장된 문맥 확률 및 어휘 확률을 활용하여 실시간으로 통계적 품사 태깅을 수행(S333)하고, 이 수행된 통계적 품사 태깅 결과를 실시간 학습기반 태깅 오류 정정부(105)에 제공(S335)한다. In the real-time learning-based statistical part-of-speech tagging unit 103, the context probability information DB 111 and the lexical probability information DB 113 that have been pre-learned about the morpheme analysis result input from the morpheme analysis unit 1013, and other input documents S1. Statistical part-of-speech tagging is performed in real time using the contextual probabilities and vocabulary probabilities stored in the real-time contextual probability information DB 117 and the real-time lexical probability information DB 119 (S333), and the statistical part-of-speech tagging results are performed. The real-time learning-based tagging error correction unit 105 is provided (S335).

다시 말하여, 실시간 학습기반 통계적 품사 태깅부(103)는 형태소 분석부(1013)로부터 입력되는 형태소 분석 결과에 대하여 상술한 <수학식 3>에 적용하여 입력 문장의 단어들 w₁, w₂, ... , w_n에 대한 최적의 품사 나열 t₁, t₂, ... , t_n을 선정한다. In other words, the real-time learning-based statistical part-of-speech tagging unit 103 is applied to the above-described <Equation 3> with respect to the morphological analysis result input from the morpheme analysis unit 1013, the words w ₁ , w ₂ , Listing the best parts of speech for ..., w _{n Select} t ₁ , t ₂ , ..., t _n .

다음으로, 문맥 확률 P(t_i|t_i-1,t_i-2,t_i-3)는 상술한 <수학식 4>를 통해 얻을 수 있고, 어휘 확률 P(w_i|t_i)는 상술한 <수학식 5>를 통해 얻을 수 있다. Next, the context probability P (t _i | t _i-1 , t _i-2 , t _i-3 ) can be obtained through Equation 4, and the lexical probability P (w _i | t _i ) is It can be obtained through the above equation (5).

실시간 학습기반 태깅 오류 정정부(105)에서는 실시간 학습기반 통계적 품사 태깅부(103)로부터 입력되는 통계적 품사 태깅 결과, 즉 선정된 품사에 대하여 기구축된(예컨대, 전문가에 의해 기구축됨) 태깅 오류 수정 규칙 DB(115)를 우선 참조하여 오류를 정정한 다음에, 정정되지 않은 부분이 실시간 태깅 오류 수정 규칙 DB(121)의 규칙과 일치할 경우 정정, 즉 입력 문서(S1)에서 학습되어 실시간 태깅 오류 수정 규칙 DB(121)에 저장된 어휘 나열에서 긍정적인 확률(긍정적인 빈도수 / (긍정적인 빈도수 + 부정적인 확률))이 0.71∼0.79의 범위 보다 크거나 범위에 포함되는 것(예컨대, 0.75)만을 실시간 태깅 오류 수정 규칙으로 활용하여 정정함으로써 입력 문서(S1)에서 자주 발생하는 문형이나 패턴에 적합하도록 정정(S337)시킨 품사의 태깅 결과(S2)를 출력(S339)한다. In the real-time learning-based tagging error correcting unit 105, a tagging error that is configured for a selected part-of-speech tag, that is, a speech part that is input from the real-time learning-based statistical part-of-speech tagging unit 103. Correct the error by first referring to the correction rule DB 115, and then correcting, that is, learning from the input document S1 and real-time tagging if the uncorrected portion matches the rule of the real-time tagging error correction rule DB 121. Only the positive probability (positive frequency / (positive frequency + negative probability)) in the lexical listing stored in the error correction rule DB 121 is greater than or within the range of 0.71-0.79 (eg, 0.75) in real time. The tagging result (S2) of the part-of-speech corrected (S337) so as to be suitable for the sentence pattern or pattern frequently occurring in the input document (S1) is corrected by utilizing as a tagging error correction rule (S339).

한편, 상술한 바와 같이 다양한 실시예를 제시하고 있는 본 발명의 문서정보 학습기반 통계적 HMM 품사 태깅 방법은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 실행할 수 있는 코드로서 구현할 수 있는데, 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함할 수 있다. 이러한 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기테이프, 플로피디스크, 광 데이터 저장장치와 캐리어 웨이브(예컨대, 인터넷을 통한 전송 등) 등이 있으며, 컴퓨터로 실행할 수 있는 코드 또는 프로그램은 본 발명의 기능을 분산적으로 수행하기 위해 네트워크로 연결된 컴퓨터 시스템에 분산되어 실행될 수 있다.On the other hand, the document information learning-based statistical HMM part-of-speech tagging method of the present invention that provides various embodiments as described above can be implemented as computer executable code on a computer-readable recording medium, which is computer-readable The recording medium may include any kind of recording device that stores data that can be read by a computer system. Examples of such computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disks, optical data storage devices and carrier waves (e.g., transmission over the Internet), and may be executed by a computer. Code or programs may be distributed and executed on networked computer systems to perform the functions of the present invention.

이상에서와 같이, 본 발명은 기학습된 코퍼스에서 나타나지 않는 다양한 장르나 도메인의 문서에 대해서도 실시간으로 입력 문서에서 학습된 정보를 사용함으로써, 다양한 장르나 도메인의 문서에 대한 태깅 정확성을 높일 수 있으며, 또한, 문서에 대한 언어적 분석을 필요로 하는 자동번역 및 정보검색 등의 시스템에서 언어분석 정확도를 향상시켜 전체 번역 성능 및 정확성을 향상시킬 수 있다.As described above, the present invention can improve tagging accuracy for documents of various genres or domains by using information learned from input documents in real time even for documents of various genres or domains that do not appear in the pre-learned corpus. In addition, it is possible to improve the accuracy of linguistic analysis in a system such as automatic translation and information retrieval that requires linguistic analysis of documents, thereby improving overall translation performance and accuracy.

지금까지 본 발명에 대하여 그 일부 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been described with reference to some embodiments thereof. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

도 1은 본 발명의 일 실시예에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 장치에 대한 블록 구성도,1 is a block diagram of a document information learning-based statistical HMM part-of-speech tagging device according to an embodiment of the present invention;

도 2는 도 1에 도시된 실시간 문서정보 학습부의 상세 블록 구성도,2 is a detailed block diagram of a real-time document information learning unit shown in FIG. 1;

도 3은 본 발명의 일 실시예에 따른 문서정보 학습기반 통계적 HMM 품사 태깅 방법에 대하여 순차적으로 도시한 흐름도.3 is a flowchart sequentially showing a document information learning-based statistical HMM part-of-speech tagging method according to an embodiment of the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100 : 통계적 HMM 품사 태깅 장치 101 : 실시간 문서정보 학습부100: statistical HMM part-of-speech tagging device 101: real-time document information learning unit

1011 : 문장 분리부 1013 : 형태소 분석부1011: sentence separation unit 1013: morphological analysis unit

1015 : 통계적 품사 태깅부 1017 : 태깅 오류 정정부1015: statistical part-of-speech tagging unit 1017: tagging error correction unit

1019 : 실시간 확률 정보 학습부 1021 : 태깅 오류 수정 규칙 추출부1019: real-time probability information learning unit 1021: tagging error correction rule extractor

103 : 실시간 학습기반 통계적 품사 태깅부103: Statistical part-of-speech tagging unit based on real-time learning

105 : 실시간 학습기반 태깅 오류 정정부105: real-time learning-based tagging error correction unit

107 : 문장 분리 규칙 DB107: statement separation rules DB

109 : 형태소 분석 사전 DB109: stemming dictionary database

111 : 문맥 확률 정보 DB111: context probability DB

113 : 어휘 확률 정보 DB113: lexical probability information DB

115 : 태깅 오류 수정 규칙 DB115: tagging error correction rule DB

117 : 실시간 문맥 확률 정보 DB117: Real-time context probability DB

119 : 실시간 어휘 확률 정보 DB119 real-time vocabulary probability information DB

121 : 실시간 태깅 오류 수정 규칙 DB121: Realtime tagging error correction rule DB

Claims

Contextual probability information DB and lexical probability information DB that have been learned about the sentence that has been morphologically analyzed for the input document, and the contextual probabilities and lexical probabilities that are learned from the input document and stored in the real-time contextual probability information DB and the real-time lexical probability information DB A real-time learning-based statistical part-of-speech tagging unit that provides a tagging result of statistical part-of-speech tagging;

A real-time learning-based tagging error correcting unit that corrects an error first through a built-in tagging error correction rule DB for the tagging result and uses a second tagging error correction rule stored in the real-time tagging error correction rule DB to correct the second.

After the part-of-speech tagging of the input document using the context probability information DB and the lexical probability information DB, the context probability information DB and the lexical probability information appearing in the input document are extracted and the real-time context probability information DB is generated according to the extracted information. And a real-time document information learning unit for constructing the real-time vocabulary probability information DB and analyzing the vocabulary and the part-of-speech pattern from the tagged result of the input document to construct the real-time tagging error correction rule DB.

The real-time learning-based tagging error correction unit,

Correct the error through the tagging error correction rule DB with respect to the tagging result, and if the uncorrected part matches the rule of the real-time tagging error correction rule DB, it is positive in the lexical list stored in the real-time tagging error correction rule DB. Document information learning-based statistical HMM (hidden) that corrects by using the real-time tagging error correction rule only that the probability (positive frequency / (positive frequency + negative probability)) is greater than or within the range of 0.71-0.79 markov model) Part of speech tagging device.

The method of claim 1,

The real time document information learning unit,

A sentence separator for dividing the input document into units of sentences;

A morpheme analysis unit for morphological analysis of the sentence through a morphological analysis dictionary database (DB);

A statistical part-of-speech tagging unit configured to provide a part-of-speech tagging result by performing statistical part-of-speech tagging by using statistical information of context probability and lexical probability stored in the context probability information DB and the lexical probability information DB with respect to the sentence that has been morphologically analyzed;

A tagging error correcting unit for providing a final tagging result in sentence units generated by correcting an error through tagging error correction rules stored in the tagging error correction rule DB with respect to the part-of-speech tagging result;

Storing word, frequency, and n-gram information for obtaining a lexical probability generated by learning the final tagging result with real-time probability information in the real-time context probability information DB and the real-time lexical probability information DB, respectively. Real-time probability information learning unit,

A tagging error correction rule extracting unit for storing a repetitive lexical phrase extracted from tagging error correction rules and a part-of-speech tag tagged with the lexical phrase in the real-time tagging error correction rule DB.

Statistical information HMM part-of-speech tagging device comprising a document information learning.

The method of claim 2,

The real time probability information learning unit,

Document information learning to delete the word for obtaining the lexical probability and the frequency and the n-gram information for obtaining the context probability stored in the real-time context probability information DB and the real-time lexical probability information DB after performing the part-of-speech tagging. Statistical HMM Part-of-Speech Tagging Device Based.

The method of claim 2,

The tagging error correction rule extraction unit,

Document information learning based statistical HMM part-of-speech tagging device for deleting the repetitive lexical phrases stored in the real-time tagging error correction rule DB and the parts of speech tagged with the lexical phrase after performing the part-of-speech tagging.

The method of claim 2,

Regarding the sentence morphologically analyzed by the morphological analysis unit

&Lt; Equation &

Where P (t _i | t _i-1 , t _i-2 , t _i-3 ) is a context probability, P (w _i | t _i ) is a lexical probability, and the context probability and the lexical probability Is information obtained from the above input document)

Document information learning based statistical HMM part-of-speech tagging device for selecting the parts of speech (t ₁ , t ₂ , ..., t _n ) for the words in the input sentence.

6. The method of claim 5,

The context probability P (t _i | t _i-1 , t _i-2 , t _i-3 ) is

&Lt; Equation &

Where P '' (t _i | t _i-1 , t _i-2 , t _i-3 ) is the context probability and the probability value of P 'is stored in the context probability information DB and the lexical probability information DB. gained by the learned information group from the Part-of-Speech tagged corpus, the probability value of the P '' is Obtained by using the probability information stored in the real-time context probability information DB and the real-time vocabulary probability information DB, the λ ₁ + λ ₂ + λ Λ ₁ , λ ₂ , λ ₃ , and λ ₄ , where ₃ + λ ₄ = 1, are obtained by linear interpolation through a speech-tagged learning corpus in document units)

Statistical HMM part-of-speech tagging device based on the document information learning obtained through.

The method of claim 6,

The lexical probability P (w _i | t _i ) is

&Lt; Equation &

Here, the probability value of P 'is obtained through pre-learned information from the part-of-speech tagged corpus stored in the context probability information DB and the lexical probability information DB, and the probability value of P' 'is the real-time context probability information DB and the Obtained by using probability information stored in a real-time lexical probability information DB, and α and β are obtained through linguistic features pre-learned by the experimental value from a learning corpus tagged with parts of speech in document units.

delete

Dividing the input document into units of sentences;

Morphological analysis of the sentence through a morphological analysis dictionary database (DB);

Real-time statistical part-of-speech using contextual probabilities and vocabulary probabilities learned from the contextual probabilistic information DB and the vocabulary probability information DB and the input document learned in the input document and stored in the real-time context probability information DB and the real-time vocabulary probability information DB. Performing tagging,

The first correcting of the error through the structured tagging error correction rule DB for the result of the real-time statistical part-of-speech tagging, and outputs the second corrected tagging result by using the real-time tagging error correction rule stored in the real-time tagging error correction rule DB Steps to

Including,

Wherein the outputting step comprises:

Correct the error through the tagging error correction rule DB with respect to the tagging result, and if the uncorrected part matches the rule of the real-time tagging error correction rule DB, it is positive in the lexical list stored in the real-time tagging error correction rule DB. Learning that the probability (positive frequency / (positive frequency + negative probability)) is greater than or within the range of 0.71-0.79 and corrected by the real-time tagging error correction rule and outputted as the tagging result Based Statistical HMM Parts of Speech Tagging Method.

The method of claim 9,

Prior to the separating step,

After the part-of-speech tagging of the input document using the context probability information DB and the lexical probability information DB, the context probability information DB and the lexical probability information appearing in the input document are extracted and the real-time context probability information DB is generated according to the extracted information. Constructing the real-time vocabulary probability information DB, and analyzing the vocabulary and the part-of-speech pattern from the tagged result of the input document to construct the real-time tagging error correction rule DB.

Statistical information HMM parts of speech tagging method further comprising a document.

11. The method of claim 10,

The building step,

Performing statistical part-of-speech tagging on the sentence morphologically analyzed by utilizing statistical information of the context probability and the lexical probability stored in the context probability information DB and the lexical probability information DB;

Providing a final tagging result in units of sentences generated by correcting an error through tagging error correction rules stored in the tagging error correction rule DB with respect to the result of the statistical part-of-speech tagging;

A word, a frequency, and n-gram information for obtaining a lexical probability generated by learning the final tagging result with real-time probability information are respectively included in the real-time context probability information DB and the real-time lexical probability information DB. Storing it,

Secondly storing a repetitive lexical phrase extracted from tagging error correction rules and a part-of-speech tag tagged with the lexical phrase in the real-time tagging error correction rule DB with respect to the final tagging result;

Statistical information HMM parts of speech tagging method comprising a document.

The method of claim 11,

The first storing step,

Document information learning base for deleting the word for obtaining the lexical probability stored in the real-time context probability information DB and the real-time lexical probability information DB, the frequency and n-gram information for obtaining the context probability after performing the part-of-speech tagging Statistical HMM part-of-speech tagging method.

The method of claim 11,

The second storing step,

The document information learning-based statistical HMM part-of-speech tagging method of deleting the repeated lexical phrases stored in the real-time tagging error correction rule DB and the parts of speech tagged with the lexical phrases after performing the part-of-speech tagging.

delete