KR20000019194A

KR20000019194A - K-best part-of-speech tagging apparatus and method based on statistics

Info

Publication number: KR20000019194A
Application number: KR1019980037168A
Authority: KR
Inventors: 임희석
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1998-09-09
Filing date: 1998-09-09
Publication date: 2000-04-06
Also published as: KR100284769B1

Abstract

PURPOSE: A K-best part-of-speech tagging apparatus and method based on a statistics is provided, which performs a part-of-speech tagging using a state-based part-of-speech tagging. CONSTITUTION: The K-best part-of-speech tagging apparatus comprises: a state-based part-of-speech tag (100) for performing a part-of-speech tagging using an instant optimum reference to each character in a phrase; a path-based part-of-speech tag (102) for performing a part-of-speech tagging using a Viterbi algorithm according to a global optimum reference to each character in a phrase; and a post-processing unit (104) for combining the part-of-speech tagging result of the state-based part-of-speech tag (100) and the part-of-speech tagging result of the path-based part-of-speech tag (102). Thereby, it is possible to decrease the cost and effort required to the construction of a part-of-speech tagging rule.

Description

Statistics-based K-Best POS tagging device and method

본 발명은 품사 태깅에 관한 것으로서, 특히 통계 기반 K-best 품사 태깅 장치 및 방법에 관한 것이다.The present invention relates to part-of-speech tagging, and more particularly, to an apparatus and method for statistical-based K-best part-of-speech tagging.

언어처리에 있어서, 문장에서 사용된 각 단어의 어휘적 중의성을 해결하는 품사 태깅(Part-of-speech tagging)은 형태소 해석 결과를 입력하여 각 단어에 올바른 품사를 결정하는 과정으로서, 구문 분석, 문서 요약, 기계 번역, 정보 검색, 사전 편찬학 등 언어 처리 분야에서 반드시 필요하다.In language processing, part-of-speech tagging, which solves the lexical significance of each word used in a sentence, is a process of determining the correct part-of-speech for each word by inputting morphological analysis results. Essential for language processing such as document summaries, machine translation, information retrieval, and dictionary compilation.

최근 품사 태깅의 정확률을 향상시키기 위한 많은 노력으로 영어권의 경우 97%~99%를 상회하는 품사 태거가 제안되었고, 한국어의 경우 95~97%의 정확률을 보이는 품사 태거가 제안되었다. 자연어 처리 시스템에서 품사 태깅 시스템의 역할은 각 단어의 불필요한 품사를 제거함으로써 많은 처리 시간을 요구하는 구문 분석의 부하를 감소시키고, 올바른 문장 분석이 이루어 질 수 있도록 지원하는 것이다. 따라서, 올바른 문장 분석을 위해서 높은 문장 단위의 정확률을 갖는 품사 태깅 시스템이 필요한데, 대부분의 품사 태깅 시스템의 정확률은 단어 단위로 평가된 것이며, 실제 구문 분석을 위하여 품사 태깅 시스템을 사용하려 할 경우 그리 좋지 않은 성능을 보인다. 예를 들어, 99%의 단어 단위의 높은 정확률을 보이는 품사 태깅 시스템으로 20단어로 이루어진 100문장을 분석한다고 가정하자. 이 경우, 품사 태깅 후 전체 20단어에 대한 오류가 발생할 것으로 생각할 수 있는데, 최악의 경우 20 단어에 대한 오류가 20 문장에 1번씩 분포되어 발생한다면, 20문장에 대한 문장 분석이 실패하여 80%(80문장/100문장 *100)의 정확률밖에 얻을 수 없게 된다.Recently, in order to improve the accuracy of the part-of-speech tagging, a part-of-speech tag with over 97% to 99% has been proposed in English, and a part-of-speech tag with 95-97% in Korean has been proposed. The part-of-speech tagging system's role in natural language processing systems is to reduce unnecessary parsing of each word, reducing the load of parsing that requires a lot of processing time, and supporting correct sentence analysis. Therefore, the part-of-speech tagging system with a high sentence-level accuracy is needed for correct sentence analysis. Most parts-of-speech tagging systems have been evaluated in terms of words, which is not good when the part-of-speech tagging system is used for actual syntax analysis. Does not show performance. For example, suppose you are analyzing a 100-word sentence with 20 words using a part-of-speech tagging system with a high accuracy rate of 99%. In this case, it can be considered that an error about 20 words occurs after tag-of-speech tagging. In the worst case, if an error about 20 words is distributed once in 20 sentences, sentence analysis for 20 sentences fails and 80% ( Only the accuracy of 80 sentences / 100 sentences * 100) can be obtained.

자연어에는 어휘 단계의 정보로만으로는 해결하기 어려운 어휘 중의성을 갖는 단어들이 존재한다. 그러나 품사 태깅 시스템들은 주변 단어의 통계 정보나 단어 형태 또는 품사와 같은 어휘 단계의 언어 지식만을 이용하여 무리하게 각 단어에 품사를 할당하며, 이로 인하여 품사 태깅의 오류를 초래하게 된다. 따라서 정확한 문장 분석을 위해서는 어휘 중의성을 해소하기 어려운 단어에 하나의 품사를 할당하여 오류를 초래하게 하는 것보다 중의성을 해결할 수 있는 정보를 얻을 수 있는 구문 분석 단계까지 유보하는 것이 바람직하다. 그 경우 품사 태깅 후, 단어가 가지는 평균 중의성이 증가하여 구문 분석의 부하가 증가할 수 있으므로, 이를 최소화하면서 동시에 문장내 모든 단어가 올바른 품사를 포함할 수 있어야 한다.In natural language, there are words with lexical significance that cannot be solved only by information at the lexical level. However, part-of-speech tagging systems use parts of speech information such as statistical information of words or word form or part-of-speech, and assign parts of speech to each word by force. This causes errors in part-of-speech tagging. Therefore, for accurate sentence analysis, rather than assigning a part-of-speech to words that are difficult to resolve lexicality, it is desirable to suspend them to the parsing stage to obtain information that can resolve the importance. In this case, after the part-of-speech tagging, the average weight of a word may increase, thereby increasing the load of parsing. Therefore, all words in a sentence should include the correct part-of-speech while minimizing this.

품사 태깅시 완전한 중의성 해소가 힘든 경우에, 후보가 되는 2개 이상의 품사를 할당하는 품사 태거를 K-best 품사 태거라 한다. 이 때, K는 품사 태깅 이후 한 단어에 할당되는 평균 품사 개수를 의미한다. 한 단어 당 평균 중의성이 1이상인 품사 태깅을 수행하는 K-best 품사 태깅은 해결되지 않은 단어에 대한 중의성을 구문 분석등 중의성 해소에 필요한 정보의 사용이 가능한 과정에서 해결하도록 함으로써 문장 분석의 정확률을 향상시킬 수 있다. 뿐만 아니라, 이를 품사 태깅된 코퍼스 구축을 위한 전처리기로 사용할 경우, 적은 량의 수작업으로 높은 정확률을 보장할 수 있다. 만약 95%의 정확률을 보이는 품사 태거로 천만단어 크기의 코퍼스를 품사 태깅하고, 그 결과를 후처리하여 품사 태깅된 코퍼스를 구축한다고 가정할 경우, 5%의 오류에 해당하는 50만 단어의 오류를 수정하기 위해서 천만단어를 모두 검사하고, 수정하는 작업을 수행하여야 한다. 그러나 K-best 품사 태깅을 수행하여 99%이상의 정확률을 얻을 수 있다면, 하나 이상의 품사를 가지는 단어만 후처리하여 99%이상의 정확률을 갖는 품사 태깅된 코퍼스를 구축할 수 있다.If parts of speech tagging are difficult to resolve completely, the part-of-speech tag that assigns two or more candidate parts of speech is called K-best part-of-speech tagging. In this case, K refers to the average number of parts of speech assigned to a word after part-of-speech tagging. K-best part-of-speech tagging, which performs part-of-speech tagging with an average weight of 1 or more per word, resolves the problem of sentence analysis by enabling the use of information necessary for resolving neutrality, such as parsing. The accuracy rate can be improved. In addition, when used as a preprocessor for constructing a part-of-speech tagged corpus, high accuracy can be guaranteed with a small amount of manual work. If you presume tagging a 10-word word corpus with a part-of-speech tagger with 95% accuracy, and then postprocess the result to construct a part-tagged tagged corpus, you get a half-word error equal to 5%. In order to revise, 10 million words must be examined and corrected. However, if K-best part-of-speech tagging is used to obtain an accuracy rate of 99% or more, it can postprocess only words with one or more parts-of-speech to construct a part-of-speech tagged corpus with an accuracy rate of 99% or more.

K-best 품사 태깅을 위해서는, 단어가 가질 수 있는 가능한 품사중 현재 문맥에 부적합한 품사를 제거하든지 아니면 적합한 품사만을 선택할 수 있는 방법이 필요하다. 규칙 기반 품사 태깅방법은 주로 부정 규칙 또는 언어적 제한(linguistic constraints)을 사용하여 부적합한 품사를 제거하는 방법을 사용한다. 반면, 통계 기반 품사 태깅 방법은 주로 품사열의 확률값을 계산하여 특정 임계값 이상의 확률값을 갖는 품사만을 선택하여 K-best 품사 태깅을 수행한다. 따라서, 사용하는 임계값에 따라 단어 당 평균 중의성과 정확률이 변하게 되며, 최대의 정확률을 보이면서 단어 당 최소의 중의성을 갖을 수 있는 임계값을 결정하는 작업이 매우 중요하다.K-best part-of-speech tagging requires a way to remove parts of speech that a word can have that are inappropriate for the current context or to select only the parts of speech that are appropriate. Rule-based part-of-speech tagging mainly uses negation rules or linuistic constraints to eliminate inappropriate parts of speech. On the other hand, the statistical-based part-of-speech tagging method mainly performs the K-best part-of-speech tagging by calculating the probability value of the part-of-speech sequence and selecting only the parts of speech having a probability value above a certain threshold. Therefore, the average neutrality and the accuracy rate per word change according to the threshold value used, and it is very important to determine the threshold value that can have the minimum neutrality per word while showing the maximum accuracy rate.

K-best 품사 태깅에 관한 기존의 대표적인 연구는 Marcken, Weishedel, Brill, 그리고 Voutilainen 등의 연구를 들 수 있다. Marcken은 DeRose의 알고리즘을 변형한 것으로 그 방법은 다음과 같다. 문장을 왼쪽에서 오른쪽으로 읽어가며 현재 단어, wⁱ의 k번째 품사인 w_k ⁱ 까지의 최적의 경로를 결정한다. 다음에, 각 경로를 wⁱ⁺¹의 각 품사까지 확장하여 새로운 확률값을 계산한다. 이 때, 최대의 확률값을 P라 하면, P값과 이미 주어진 임계값 이내의 확률값을 갖는 wⁱ⁺¹까지의 경로에 포함된 wⁱ의 품사들을 wⁱ의 품사로 결정한다. 주어지는 임계값에 따라 각 단어에 하나의 품사 또는 그 이상의 품사가 할당되는데, P값과 임계값이 같은 경우 각 단어에 하나의 품사만을 할당한다. Marcken은 단어당 평균 중의성이 2.2인 80,000 단어 크기의 코퍼스를 이용하여 실험하였고, 단어 당 평균 중의성이 1.03일 때 약 97.6%의 정확률을 보였고, 1.27개일 때 99.93%의 정확률를 보였다.Representative researches on K-best part-of-speech tagging include the work of Marcken, Weishedel, Brill, and Voutilainen. Marcken is a variation of DeRose's algorithm. The sentence is read from left to right, with the kth part of speech of the current word, w ⁱ w _k ⁱ Determine the best route to. Next, each path is expanded to each part of w ^{i + 1} to calculate a new probability value. At this time, if the maximum probability value of P d, and P values and determine the path already part of speech of w ⁱ included in up to w ^{i + 1} with a probability value of less than a given threshold as part of speech of w ^i. According to the given threshold, one part of speech or more parts of speech is assigned to each word. If the P value and the threshold are the same, only one part of speech is assigned to each word. Marcken experimented with an 80,000 word size corpus with an average weight per word of 2.2. The average weight per word was 1.03, which was about 97.6% accurate, and 1.27 was 99.93%.

Weishedel은 포워드-백워드(forwoard-backward)알고리즘을 이용하여, 입력 문장을 관측하고, 현재 단어 wⁱ의 특정 품사 w_k ⁱ 에 위치할 확률값을 계산하고, 가장 높은 확률값과 비교하여 특정 임계값 이하의 확률값을 갖는 품사를 선택하는 K-best 품사 태거를 제안하였다. Weishedel 방법은 품사 선택을 위한 확률 계산시 현재 단어의 앞 단어의 정보만을 이용하는 Marcken의 방법과는 달리 문장 전체를 고려한다는 특징을 갖는다. Weishedel의 K-best 품사 태거는 10,000 단어 크기의 코퍼스에서 실험한 결과 단어당 평균 중의성이 3일 때 99.3%의 정확률을 보였으며, 이는 Marcken의 방법과 똑같은 정확률을 기준으로 평가할 때 다소 높은 중의성을 갖는 결과를 생성하는 것이다.Weishedel uses a forward-backward algorithm to observe the input sentence and to determine the specific part of speech of the current word w ⁱ . w _k ⁱ We propose a K-best part-of-speech tagger that calculates the probability value to be located at and selects the part-of-speech having a probability value below a certain threshold by comparing with the highest probability value. The Weishedel method considers the whole sentence differently from Marcken's method, which uses only the information of the previous word of the current word when calculating the probability for part-of-speech selection. Weishedel's K-best part-of-speech tagger, when tested in a corpus of 10,000 words, had an accuracy of 99.3% when the mean weight per word was 3, which was somewhat higher when evaluated based on the same accuracy as Marcken's method. To produce a result with

Brill은 '특정 문맥에서 특정 품사를 다른 품사로 변경한다'와 같은 변형 규칙을 '특정 문맥에서 특정 품사(또는 단어)에 다른 품사를 추가한다'와 같이 수정하고, 이를 이용한 K-best 품사 태깅 방법을 제안하였다. 100개의 변형 규칙을 이용하여 단어당 평균 1.04개의 품사를 할당할 때, 97.4%의 정확률을 보였고, 250개의 변형 규칙을 이용하여 단어당 평균 1.50개의 품사를 할당할 때 99.1%의 정확률을 보였다. Brill의 방법은 초기 태거로 품사 태깅을 수행하고, 오류 수정을 위해 적용될 변형 규칙을 코퍼스로부터 자동 추출할 수 있다는 장점을 갖지만, 수작업으로 구축한 규칙과 같은 일반성과 학습 코퍼스와는 다른 코퍼스에서의 호환성에 대한 문제점이 지적되고 있다.Brill modifies a variation rule such as 'change a part-of-speech to a different part-of-speech in a specific context', such as 'add another part-of-speech to a specific part-of-speech (or word) in a specific context', and uses it to tag K-best parts-of-speech. Suggested. When we assigned 1.04 parts-of-speech per word using 100 transformation rules, it showed 97.4% accuracy, and when we assigned an average of 1.50 parts-of-speech per word using 250 transformation rules, it showed 99.1% accuracy. Brill's method has the advantage of being able to perform part-of-speech tagging with initial taggers and automatically extract the transformation rules to be applied for error correction from the corpus, but with the same generality as hand-built rules and compatibility with other corpus than the learning corpus. The problem with is pointed out.

Voutilainen의 ENGCG2는 ENGCG에 중의성 해소 규칙을 추가하고 개선한 시스템이다. ENGCG2는 특정 품사가 제거될 수 있는 문맥을 기술하는 규칙을 이용하여 각 단어의 부적절한 품사를 제거하여 단어의 중의성을 감소시키는데, 더이상 규칙이 적용될 수 없는 단어에 대해서 중의성이 남아있게 되므로, 한 단어에 평균 중의성이 1이상의 품사를 갖게 된다. ENGCG2는 4,000여개의 규칙으로 단어 당 평균 1.04~1.08개의 품사를 할당할 때 99.7%의 정확률을 보였다. ENGCG2는 위에서 설명한 다른 시스템과 비교하여 가장 좋은 성능을 보고하였다. 뿐만아니라, 규칙 기반 품사 태깅 방법으로 통계 기반 품사 태깅 방법의 한계를 극복할 수 있고, 품사 태깅에도 다른 자연어 처리 단계와 같이 규칙 기반 접근법이 좋은 성능을 가질 수 있음을 주장하였다. 그러나, ENGCG2는 비록 높은 정확률을 보이지만 중의성 해소를 위한 규칙 획득 병목(knowledge acquisition bottleneck)의 문제, 규칙의 개수가 증가할수록 발생하는 규칙의 충돌 문제 또한 시스템 확장 및 변경시 규칙 기반 접근법이 갖는 모든 문제점을 가지고 있다.Voutilainen's ENGCG2 adds and refines neutralization rules to ENGCG. ENGCG2 reduces the importance of words by eliminating inappropriate parts of speech for each word using rules that describe the context in which a particular part of speech can be removed, thereby remaining neutrality for words that can no longer be applied. The word has an average gravity of one or more parts of speech. ENGCG2 has an accuracy of 99.7% when 4,000 rules are assigned on an average of 1.04 to 1.08 parts of speech per word. ENGCG2 reported the best performance compared to other systems described above. In addition, the rule-based part-of-speech tagging method overcomes the limitations of the statistics-based part-of-speech tagging method and claims that the rule-based approach can have good performance for tagging parts like other natural language processing steps. However, although ENGCG2 has a high accuracy rate, the problem of rule acquisition bottlenecks for neutrality resolution, rule conflict as the number of rules increases, and all the problems of rule-based approach in system expansion and change Have

본 발명이 이루고자하는 기술적 과제는, 전술한 ENGCG2의 문제점을 극복하기 위해 창출된 것으로서, 상태 기반 품사 태깅법과 기반 품사 태깅법을 모두 활용하여 품사 태깅함으로써, ENGCG2와 같이 높은 성능을 갖으면서 규칙 기반 품사 태거와 비교하여 구현, 확장, 관리가 용이한, 통계 기반 케이-베스트 품사 태깅 장치 및 방법을 제공하는데 있다.The technical problem to be achieved by the present invention is to overcome the problems of the above-described ENGCG2, by using a part-part tagging method using both the state-based part-of-speech tagging method and the base part-of-speech tagging method, it has a high performance, such as rule-based parts-of-speech The present invention provides an apparatus and method for statistical-based K-best part-of-speech tagging that is easier to implement, extend, and manage than a tagger.

도 1은 본 발명에 의한 통계 기반 K-best 품사 태깅 장치의 블럭도이다.1 is a block diagram of a statistics-based K-best part-of-speech tagging apparatus according to the present invention.

도 2는 본 발명에 의한 통계 기반 K-best 품사 태깅 방법을 설명하기 위한 플로우챠트이다.2 is a flowchart illustrating a statistical-based K-best part-of-speech tagging method according to the present invention.

상기 과제를 이루기 위하여, 본 발명에 의한 통계 기반 케이-베스트 품사 태깅 장치는,In order to achieve the above object, the statistics-based K-best part-of-speech tagging device according to the present invention,

원시 코퍼스로부터 형태소 분석된 문장에서 각 단어에 대해 소정의 상태 기반 품사 태깅법을 이용하여 품사 태깅을 수행하는 상태 기반 품사 태거, 형태소 분석된 입력 문장에서 각 단어에 대해 소정의 경로 기반 품사 태깅법을 이용하여 품사 태깅을 수행하는 경로 기반 품사 태거 및 상태 기반 품사 태깅된 결과와 경로 기반 품사 태깅된 결과를 병합하고, 병합된 결과에서 두가지의 품사를 할당받은 특정 단어의 부적합 품사를 제거하거나 적합 품사를 할당하며, 한가지의 품사를 할당받더라도 오류된 태깅 결과를 정정하여 품사 태깅된 코퍼스를 얻는 후처리부를 구비하는 것을 특징으로 한다.State-based part-of-speech tagging that performs part-of-speech tagging for each word in a stemmed sentence from a primitive corpus using a predetermined state-based part-of-speech tagging method, and a predetermined path-based part-of-speech tagging method for each word in a stemmed input sentence. Path-based part-of-speech tagging and state-based part-of-speech tagging results and path-based part-of-speech tagged results that perform part-of-speech tagging, removes inappropriate part-of-speech of specific words to which two parts of speech are assigned from the merged result And a post-processing unit for correcting an error tagging result to obtain a part-of-speech tagged corpus even if one part-of-speech is assigned.

상기 다른 과제를 이루기 위하여, 통계 기반 접근법을 이용하여 원시 코퍼스로부터 품사 태깅된 코퍼스를 얻는, 본 발명에 의한 통계 기반 케이-베스트 품사 태깅 방법은,In order to achieve the above another object, a method for statistical-based k-best part-of-speech tagging according to the present invention, which obtains a part-of-speech tagged corpus from a raw corpus using a statistical-based approach,

원시 코퍼스로부터 형태소 분석된 입력 문장에서 각 단어에 대해 통계 기반 접근법에 속하는 상태 기반 품사 태깅법 및 경로 기반 품사 태깅법 각각에 따른 태깅을 각각 수행하는 품사 태깅 단계, 두가지 품사 태깅 결과를 병합하는 병합 단계 및 병합된 결과에서 두가지의 품사를 할당받은 특정 단어의 부적합 품사를 제거하거나 적합 품사를 할당하며, 한가지의 품사를 할당받더라도 오류된 태깅 결과를 정정하여 품사 태깅된 코퍼스를 얻는 후처리 단계를 구비하는 것을 특징으로 한다.Merging step of merging two parts-of-speech tagging results, each part tagging step that performs tagging according to the state-based part-of-speech tagging method and the path-based part-of-speech tagging method belonging to the statistic-based approach in the input sentence stemmed from the raw corpus And a post-processing step of removing an inappropriate part of speech of a specific word assigned two parts of speech from the merged result or assigning an appropriate part of speech, correcting an error tagging result even if one part of speech is assigned. It is characterized by.

이하, 본 발명에 의한 통계 기반 케이-베스트 품사 태깅 장치 및 방법을 첨부한 도면을 참조하여 다음과 같이 설명한다.Hereinafter, with reference to the accompanying drawings, the statistical K-best part-of-speech tagging apparatus and method according to the present invention will be described.

본 발명은 단어 당 평균 중의성을 최소화하면서 높은 문장 단위의 정확률을 가질 수 있는 통계 기반 K-best 품사 태깅 시스템을 제안한다. 제안된 품사 태깅 방법은 HMM을 이용하는 품사 태깅 모델이 최적의 품사열을 찾기 위하여 사용하는 순간 최적 기준(instantaneous optimality criterior)과 전역 최적 기준(global optimality criterior)을 각각 이용하여 품사 태깅을 수행하고, 두 결과중 충돌이 일어나는 단어에 두가지 품사를 모두 할당하는 방법을 이용한다. 또한, 통계 기반 모델에 의한 결과는 휴리스틱 규칙에 의해서 단어에 남아있는 중의성을 해소하거나 올바른 품사를 할당받는다.The present invention proposes a statistics-based K-best part-of-speech tagging system that can have a high sentence-precision rate while minimizing the mean weight per word. In the proposed part-of-speech tagging method, the part-of-speech tagging is performed using instantaneous optimality criterior and global optimality criterior, respectively. We use the method of assigning both parts of speech to the word in which the conflict occurs. In addition, the results based on the statistic-based model eliminate heuristics remaining in the words by heuristic rules or are assigned the correct parts of speech.

먼저, 본 발명의 이해를 돕기 위해서 본 발명에 적용되는 순각 최적 기준에 따른 상태 기반 품사 태깅법과, 전역 최적 기준에 따른 통계 기반 품사 태깅법을 살펴본다.First, to aid understanding of the present invention, a state-based part-of-speech tagging method according to the instantaneous optimal criterion applied to the present invention and a statistical-based part-of-speech tagging method according to the global optimal criterion will be described.

본 발명은 앞서 제시한 ENGCG2의 문제점을 극복할 수 있고 ENGCG2와 같은 높은 성능을 갖는 통계 기반 접근법을 이용한 K-best 품사 태거의 개발을 위하여 시작되었다. 통계 기반 품사 태거는 규칙 기반 품사 태거와 비교하여 비교적 구현이 용이하고, 다양한 언어 현상에 대해서 견고하고, 시스템의 확장 및 변경이 용이하다는 장점을 갖는다. 따라서 ENGCG2와 같은 우수한 성능을 갖는 통계 기반 K-best 품사 태거의 개발은 그 의의가 매우 크다고 할 수 있다.The present invention has been initiated for the development of a K-best part-of-speech tagger that can overcome the problems of ENGCG2 presented above and uses a high performance statistical based approach such as ENGCG2. Statistical-based part-of-speech tagger has the advantages of being relatively easy to implement, robust to various language phenomena, and easy to expand and change the system, compared to rule-based part-of-speech tagger. Therefore, the development of statistics-based K-best part-of-speech tagger with superior performance, such as ENGCG2, is of great significance.

본 발명의 연구는 기존에 제안된 K-best 통계 기반 품사 태깅의 오류와 문제점을 분석하는 것으로 시작되었다. 기존의 통계 기반 K-best 품사 태깅 방법은 순간 최적 기준 또는 전역 최적 기준에 의해서 계산된 특정 임계값 이상의 확률을 갖는 품사 또는 품사열을 품사 태깅 결과로 생성한다. 다음은 순간 최적 기준 또는 전역 최적 기준에 따른 품사 태깅 방법을 설명한다.The study of the present invention began by analyzing the errors and problems of the previously proposed K-best statistics-based part-of-speech tagging. The existing statistical K-best part-of-speech tagging method generates parts-of-speech or part-of-speech sequences with probabilities above a certain threshold calculated by instantaneous or global optimal criteria. The following describes a part-of-speech tagging method based on instantaneous optimality criteria or global optimality criteria.

품사 태깅 모델은 n개의 단어로 구성된 문장 w^1,n을 입력받아 각 단어에 해당하는 품사열 t^1,n을 찾는 문제로 정의된다. 상태 기반 품사 태깅법은 현재 단어 wⁱ가 가질 수 있는 각 품사를 지나 문장을 생성할 때의 확률값, γ_i(k)을 계산하여 최대의 확률값을 갖는 k번째의 품사를 태깅 결과로 할당하는 방법으로서, 다음 수학식 1과 같이 정의된다.The part-of-speech tagging model is defined as a problem of finding a part-of-speech sequence t ^{1, n} corresponding to each word by receiving a sentence w ^{1, n} consisting of n words. State-based part of speech tagging method is how to assign a part of speech of the k-th having the maximum probability value of calculating the probability value, γ _i (k) for generating a sentence through the respective parts of speech in the current word w ⁱ to have a tagging results As shown in Equation 1 below.

수학식 1에서, α_i(k)는 문장의 첫 단어에서부터 i번째 단어까지 관측하고, wⁱ가 k번째 품사를 갖을 때의 확률값을 의미하고, β_i(k)는 wⁱ가 k번째 품사를 가지고, i+1에서 문장 마지막까지의 단어가 관측될 확률값을 의미하며, Pr(Ο｜λ)는 언어 모델을 나타낸다. 즉, 수학식 1의 분자에 사용된 α_i(k)β_i(k)는 wⁱ의 k번째 품사를 통과하여 문장 w^i,n을 생성할 확률을 의미한다. 순간 최적 기준을 이용한 품사 태깅은 각 단어에서 개별적으로 가장 가능성이 높은 품사를 할당하지만, 주변 단어의 품사, 전체 문장의 길이 등을 고려하지 않으므로 허용되지 않는 상태 전이(disallowed transitions)가 발생할 수 있다. 예를 들면, "the fly have wings"에서 'fly'는 명사와 동사의 품사를 가질 수 있는데, 한정사(determiner)로 사용된 'the' 다음에는 동사가 나오지 않음에도 불구하고, 'fly'가 동사로 품사 태깅될 수도 있다.In Equation 1, α _i (k) is observed from the first word of the sentence to the i-th word, it means a probability value when w ⁱ has a k-th part-of-speech, β _i (k) is w ⁱ is a k-th part of speech , Denotes a probability value at which words from i + 1 to the end of a sentence are observed, and Pr (Ο | λ) represents a language model. That is, it means the _{_{α i (k) β i (}} k) is the probability to generate a sentence w ^{i, n} through the k-th part of speech of w ⁱ used in the numerator of equation (1). Part-of-speech tagging using the instantaneous optimal criterion assigns the most likely parts of speech individually in each word, but disallowed transitions may occur because the parts of speech that are most likely to be not considered are not considered. For example, in "the fly have wings", "fly" can have parts of nouns and verbs, although "fly" is a verb even though "the" is used after the determiner. Part of speech may be tagged.

상태 기반 품사 태깅의 이러한 오류를 해결하기 위한 방법은 주변 문맥을 이용하는 경로 기반 품사 태깅이다. 경로 기반 품사 태깅은 다음 수학식 2와 같이 정의된다.A method for solving this error of state based part-of-speech tagging is path-based part-of-speech tagging using surrounding context. Path-based part-of-speech tagging is defined as in Equation 2 below.

수학식 2에서, T(w^i,n)은 한 문장이 가질 수 있는 품사열 t^i,n의 조합에서 통계적으로 가장 많이 사용되는 평균적인 태깅 확률을 나타낸다. 수학식 2의 확률값 계산에는 많은 량의 통계정보가 필요하며, 실제로 이들 정보의 획득은 거의 불가능하다. 따라서, 단어의 어휘 정보 또는 근거리 문맥 정보만을 이용하여 근사한 다음 수학식 3과 수학식 4가 많이 사용된다.In Equation 2, T (w ^{i, n} ) represents the average tagging probability used most statistically in the combination of parts of speech t ^{i, n} that a sentence can have. The calculation of probability values in Equation 2 requires a large amount of statistical information, and in fact, it is almost impossible to obtain these information. Therefore, the following equations (3) and (4) are often used to approximate only using lexical information or short-range contextual information.

수학식 3과 수학식 4는 현재 단어의 발생은 현재 품사에만 의존하며, 현재 단어의 품사는 이전 단어(또는 앞의 두단어)의 품사에만 의존한다는 마르코프 가정을 이용하여 수학식 2를 근사하게 표현한 것이다. 수학식 3과 수학식 4의 확률값 계산은 다이내믹 프로그래밍(dynamic programming)을 이용한 비터비 알고리즘이 사용되며, 이를 이용하여 선형 시간(linear time)내에 품사열들의 확률값 계산이 가능하다.Equations 3 and 4 approximate Equation 2 using the Markov assumption that the occurrence of the current word depends only on the current part of speech and the part of the current word depends only on the part of speech of the previous word (or the first two words). will be. For the calculation of probability values in Equations 3 and 4, a Viterbi algorithm using dynamic programming is used, and the probability values of parts of speech parts can be calculated in a linear time using the Viterbi algorithm.

앞서 설명하였듯이, 상태 기반 품사 태깅법은 개별적으로 가장 가능성이 높은 품사를 할당하지만, 허용되지 않는 상태 전이가 발생할 수 있다는 단점을 가지며, 경로 기반 품사 태깅법은 이 단점을 극복하기 위한 방법으로 사용될 수 있다. 그러나, 기존의 통계 기반 K-best 품사 태깅 방법은 경로 기반 품사 태깅법 아니면 상태 기반 품사 태깅법을 적절히 수정하는 방법만을 사용하였다.As described above, state-based part-of-speech tagging assigns the most likely parts of speech individually, but has the disadvantage that unacceptable state transitions can occur, and path-based part-of-speech tagging can be used as a way to overcome this disadvantage. have. However, the existing statistics-based K-best part-of-speech tagging method uses only the method of modifying path-based part-of-speech tagging or state-based part-of-speech tagging appropriately.

상태 기반 품사 태깅법과 경로 기반 품사 태깅법은 그 특성상 서로 상이한 품사 태깅 결과를 생성하는데, 그 결과를 비교해보면 매우 흥미로운 사실을 발견할 수 있다. 대부분의 상이한 품사 태깅 결과는 각 방법의 품사 태깅 오류를 일으킨 단어들에 대하여 집중되어 있으며, 두 가지 품사 태깅 결과중 올바른 품사 태깅 결과가 존재하는 경우가 확률적으로 많다. 이는 두 방법이 같은 품사를 할당한 경우에, 이는 순간 최적 기준과 전역 최적 기준 두가지에 의한 결과이므로, 그 정확도를 신뢰할 수 있으나, 그렇지 않은 경우 어느 한 결과의 오류일 가능성이 있기 때문이다. 실제 두 방법이 모두 같은 품사를 할당한 단어의 품사 태깅 결과는 대부분 올바른 결과이고, 서로 다른 결과를 생성한 단어의 경우에는 두 가지중 올바른 품사가 포함되어 있었다. 이는 상태 기반 품사 태거와 경로 기반 품사 태거가 서로의 품사 태깅 오류를 보정할 수 있는 보완적 특성을 가짐을 의미한다.The state-based part-of-speech tagging method and the path-based part-of-speech tagging method produce different parts-of-speech tagging results, which are very interesting when comparing the results. Most of the part-of-speech tagging results are concentrated on the words that caused the part-of-speech tagging errors of each method, and there are many cases where the correct part-of-speech tagging results exist. This is because when the two methods assign the same part-of-speech, this is a result of both the instantaneous optimal criterion and the global optimal criterion, so that the accuracy is reliable, but otherwise the error of either result is likely. In fact, the part-of-speech tagging results of words in which both methods assign the same part-of-speech are correct results. In the case of words that produce different results, the part of speech included the correct part-of-speech. This means that the state-based part-of-speech tagger and the path-based part-of-speech tagger have complementary features to correct each other's part-of-speech tagging errors.

본 발명이 제안하는 통계적 기반 K-best 품사 태깅 장치 및 방법은 이러한 보완적 특성을 이용하여 입력 문장을 상태 기반 품사 태거과 경로 기반 품사 태거 모두로 품사 태깅하고, 서로 상이한 품사가 할당된 단어에만 두가지 상이한 품사를 모두 할당하는 등 후처리를 수행한다.The statistically-based K-best part-of-speech tagging apparatus and method proposed by the present invention utilizes these complementary features to tag the input sentences with both state-based part-of-speech tagging and path-based part-of-speech tagging, and differs only in words to which different parts of speech are assigned. Post-processing, such as allocating parts of speech.

도 1은 본 발명에 의한 통계 기반 K-best 품사 태깅 장치의 블럭도로서, 상태 기반 품사 태거(100), 경로 기반 품사 태거(102) 및 후처리부(104)를 구비한다.1 is a block diagram of a statistics-based K-best part-of-speech tagging apparatus according to the present invention, which includes a state-based part-of-speech tagger 100, a path-based part-of-speech tagger 102, and a post-processing unit 104.

도 1에서 상태 기반 품사 태거(100)는 원시 코퍼스로부터 형태소 분석된 문장에서 각 단어에 대해 순간 최적 기준을 이용한 품사 태깅을 수행한다. 한편, 경로 기반 품사 태거(102)는 마찬가지로 원시 코퍼스로부터 형태소 분석된 문장에서 각 단어에 대해 전역 최적 기준에 따라 비터비 알고리즘을 이용한 품사 태깅을 수행한다. 후처리부(104)는 이들 두가지 품사 태깅된 결과를 병합하고, 병합된 결과를 후처리하여 품사 태깅된 코퍼스를 얻는다.In FIG. 1, the state-based part-of-speech tagger 100 performs part-of-speech tagging using instantaneous optimal criteria for each word in a sentence morphologically analyzed from a raw corpus. Meanwhile, the path-based part-of-speech tagger 102 similarly performs the part-of-speech tagging using the Viterbi algorithm for each word in the sentence stemmed from the raw corpus. Post-processing unit 104 merges these two parts-of-speech tagged results and post-processes the merged results to obtain a part-of-speech tagged corpus.

후처리부(104)의 역할은 휴리스틱 규칙을 이용하여 상태 기반 품사 태거(100)와 경로 기반 품사 태거(102)의 결과에서 두가지의 품사로 할당받은 특정 단어의 부적합 품사를 제거하거나, 적합 품사를 추가함으로써, 단어 당 평균 중의성을 감소시키거나, 정확률을 향상시키는 것이다. 본 발명에서 후처리부(104)에서 사용하는 휴리스틱 규칙은 바람직하게 상태 기반 품사 태거(100)와 경로 기반 품사 태거(102)의 결과를 분석하여 수작업으로 구축한 규칙과, 이미 품사 태깅된 코퍼스를 분석하여 자동 추출한 신태그마(syntagma) 규칙으로 구성된다. 신태그마 규칙은 아래와 같이 신태그마내의 개별적인 품사 태깅 결과를 명시한 형태이다.The role of the post-processing unit 104 is to remove non-suitable parts of speech that are assigned as two parts of speech from the result of the state-based parts of speech tagger 100 and the path-based parts of speech tagger 102 using heuristic rules, or add the appropriate parts of speech. This reduces the average neutrality per word or improves the accuracy. In the present invention, the heuristic rule used in the post-processing unit 104 preferably analyzes the results of the state-based part-of-speech tagger 100 and the path-based part-of-speech tagger 102 and analyzes the rule manually constructed and the part-of-speech tagged corpus. It consists of the rules automatically extracted syntagma (syntagma). The New Tagma Rule is a form that specifies the individual parts of speech tagging results in the New Tagma as follows.

신태그마 규칙 형태 :Syntagma rule forms:

단어1 단어2,...단어n => {단어1/품사 단어2/품사,...단어n/품사}Word1 word2, ... word n => {word1 / part of speech word2 / part of speech, ... word n / part of speech}

예컨대, "because of"와 같이 두 단어가 합하여 하나의 의미를 갖는 소정개수의 구를 신태그마 규칙으로 정의하면, 'because'와 'of'는 모두 전치사로 미리 명시된다.For example, if a predetermined number of phrases having two meanings such as "because of" are defined as a new tagma rule, both 'because' and 'of' are previously specified as prepositions.

도 2를 참조하여, 본 발명의 동작 원리를 상세히 설명하며, 상태 기반 품사 태거와 경로 기반 품사 태거가 품사 태깅을 수행하는 과정과 휴리스틱 규칙에 의해서 후처리되는 과정을 입력 문장 “One island that was near Java was called Pralape in the old book."을 예를 들어 설명한다.Referring to FIG. 2, the operation principle of the present invention will be described in detail, and a state-based part-of-speech tag and a path-based part-of-speech tagger perform a part-of-speech tagging process and a post-processing process by a heuristic rule. Java was called Pralape in the old book. "

먼저, 원시 코퍼스를 준비한다(제200단계). 입력 문장은 품사 태깅 이전에 각 단어의 가능한 모든 품사를 분석하는 형태소 분석기(상태 기반 품사 태거 및 경로 기반 품사 태거에 각각 내장될 수도 있음)에 의해서 아래와 같은 가능한 품사가 할당된다(제202단계).First, prepare a raw corpus (step 200). The input sentences are assigned the following possible parts of speech by the morpheme analyzer (which may be embedded in the state-based part-of-speech tagger and the path-based part-of-speech tagger respectively) before analyzing the part-of-speech tagging (step 202).

입력 문장An input sentence 형태소 분석 결과Stemming results OneislandsthatwerenearJavawascalledPralapeintheoldbookPERIODOneislandsthatwerenearJavawascalledPralapeintheoldbookPERIOD noun det(determiner) pron(pronoun)nounconj(conjection) det pronverbd(verb:past) behad(past beha)verbd(verb:present) adj(adject) adv(adverb)prep(preposition)nounverbd behadverbd verbn(verb:past perfect)nounadv prepdetadjnoun verbpnoun det (determiner) pron (pronoun) nounconj (conjection) det pronverbd (verb: past) behad (past beha) verbd (verb: present) adj (adject) adv (adverb) prep (preposition) nounverbd behadverbd verbn (verb: past perfect) nounadv prepdetadjnoun verbp

표 1에서 굵게 쓰여진 품사가 현재 문장에서의 올바른 품사를 나타내고 있으며, 형태소 결과를 살펴보면 각 단어 당 평균 중의성은 1.86(26/14)개다.The parts of speech in bold in Table 1 represent the correct parts of speech in the current sentence, and from the morpheme results, the mean weight per word is 1.86 (26/14).

다음에, 형태소 분석된 각 단어에 대해 상태 기반 및 경로 기반 품사 태깅을 각각 수행한다(제204단계). 경로 기반 품사 태깅 및 상태 기반 품사 태깅을 수행한 결과는 다음 표 2와 같다. 여기서, * 표시는 잘못된 품사 할당이 된 것을 의미하고, 품사 결과 뒤에 있는 1은 경로 기반 품사 태깅에 의해서 할당된 품사임을 나타낸다. 또한, 품사 결과 뒤의 2는 상태 기반 품사 태깅 결과임을 나타내는 표시이다.Next, state-based and path-based part-of-speech tagging are performed for each word that has been stemmed (step 204). The results of path-based part-of-speech tagging and state-based part-of-speech tagging are shown in Table 2 below. In this case, a * indicates that an incorrect part-of-speech is assigned, and a 1 after the part-of-speech result indicates that the part-of-speech is assigned by path-based part-of-speech tagging. Also, 2 after the part-of-speech result is an indication indicating that it is a state-based part-of-speech tagging result.

입력 문장An input sentence 경로 기반 품사 태깅 결과Path-based part-of-speech tagging results 상태 기반 품사 태깅 결과State-based part-of-speech tagging results OneislandsthatwerenearJavawascalledPralapeintheoldbookPERIODOneislandsthatwerenearJavawascalledPralapeintheoldbookPERIOD noun1noun1*conj1verbd1prep1noun1*verbd1verbn1noun1prep1det1adj1noun1.1noun1noun1 * conj1verbd1prep1noun1 * verbd1verbn1noun1prep1det1adj1noun1.1 noun2noun2pron2verbd2*adj2noun2*verbd2verbn2noun2prep2det2adj2noun2.2noun2noun2pron2verbd2 * adj2noun2 * verbd2verbn2noun2prep2det2adj2noun2.2

표 1을 살펴보면, 경로 기반 품사 태깅 결과에서 pron과 behad로 품사 태깅되어야 할 'that'과 'was'가 각각 conj와 verbd로 잘못 태깅되었음을 알 수 있다. 상태 기반 품사 태깅 결과는 경로 기반 품사 태깅 결과에서 오류를 일으킨 'that'에 대한 품사를 pron으로 올바르게 품사 태깅한 반면, 'near'에 대한 품사를 adj로 잘못 할당한 것을 볼 수 있다. 또한 경로 기반 품사 태깅에서도 오류를 일으킨 'was'에 대한 오류는 상태 기반 품사 태깅에서도 여전히 존재한다.Looking at Table 1, the path-based part-of-speech tagging results show that 'that' and 'was', which should be part-tagged with pron and behad, were incorrectly tagged as conj and verbd, respectively. The state-based part-of-speech tagging result shows that the part-of-speech tag for the 'that' that caused the error in the path-based part-of-speech tagging result is correctly tagged with pron, whereas the part-of-speech for 'near' is incorrectly assigned as part of adj. In addition, the error for 'was' which caused an error in path-based part-of-speech tagging still exists in state-based part-of-speech tagging.

제204단계 후에, 두가지 품사 태깅 결과를 병합한다(제206단계). 경로 기반 품사 태깅과 상태 기반 품사 태깅을 수행한 결과중 동일한 품사가 할당된 단어에는 그 품사를 품사 태깅 결과로 할당하고, 상이한 품사가 할당된 단어에 대해서는 두가지 품사를 모두 해당 단어의 품사로 할당하여 병합한다. 다음에, 병합된 결과에 소정의 휴리스틱 규칙을 적용하여 후처리한다(제208단계). 다음 표 3은 두가지 품사 태깅 결과를 병합한 결과와, 후처리된 최종적인 K-best 품사 태깅 결과를 나타낸다.After step 204, the two parts-of-speech tagging results are merged (step 206). Part of the path-based part-of-speech tagging and state-based part-of-speech tagging results in a part-of-speech tag that is assigned the same part-of-speech. Merge. Next, post processing is performed by applying a predetermined heuristic rule to the merged result (step 208). Table 3 below shows the result of merging the two parts-of-speech tagging results and the final K-best part-of-speech tagging results processed.

입력 문장An input sentence 경로 기반 + 상태 기반품사 태깅 결과Route-based + state-based part-of-speech tagging results 최종적인 K-best품사 태깅 결과Final K-best part-tagging tagging results OneislandsthatwerenearJavawascalledPralapeintheoldbookPERIODOneislandsthatwerenearJavawascalledPralapeintheoldbookPERIOD nounnounconj1 pron2verbdprep1 adj2noun*verbdverbnnounprepdetadjnoun.nounnounconj1 pron2verbdprep1 adj2noun * verbdverbnnounprepdetadjnoun. nounnounconj1 pron2verbdprep1 adj2nounbehadverbnnounprepdetadjnoun.nounnounconj1 pron2verbdprep1 adj2nounbehadverbnnounprepdetadjnoun.

후처리는 특정 품사의 부적절한 품사를 제거하거나 올바른 품사를 할당하게 되는데, 본 예에서는 verbd로 잘못 태깅된 'was'에 "Cword:verb* && Next word in 2 position:verbn => change Cword:beha*"와 같은 휴리스틱 규칙을 적용되어 'was'의 품사를 behad로 변경한다. 이 규칙은 현재 단어(Cword)의 품사가 verbp(verbd, verbn)이고, 현재 단어 다음 또는 그 다음 단어의 품사가 verbn으로 품사 태깅되었을 경우, 현재 단어의 품사를 behap(behad, behan)으로 수정한다는 규칙이다. 즉, 이런 규칙은 수작업으로 구축된 규칙이다. 표 3에서, 휴리스틱 규칙이 적용된후 최종적인 결과는 단어 당 평균 1.14개의 품사를 할당하며, 100%의 정확률을 보인다.Postprocessing removes inappropriate parts of speech from the part of speech or assigns the correct parts of speech. In this example, "Cword: verb * && Next word in 2 position: verbn => change Cword: beha * Heuristic rules, such as ", change the part of 'was' to behad. This rule states that if the part of speech of the current word is verbp (verbd, verbn), and the part of speech after or after the current word is tagged with part of speech, the part of speech of the current word is modified to behap (behad, behan). It is a rule. In other words, these rules are built by hand. In Table 3, the final result after the heuristic rule is applied allocates an average of 1.14 parts-of-speech per word, with an accuracy of 100%.

이제, 본 발명에 따른 통계 기반 K-best 품사 태깅 장치 및 방법과, 종래의 품사 태깅을 실시예를 통해 비교해본다.Now, the statistics-based K-best part-of-speech tagging apparatus and method according to the present invention and the conventional part-of-speech tagging will be compared through examples.

경로 기반 품사 태깅에 사용되는 문맥 전이 확률 계산에는 현재 단어 이전 단어의 품사만을 고려하는 바이그램(bigram)모델과, 이전 두 단어의 품사를 고려하는 트라이그램(trigram)모델을 사용된다. 일반적으로 트라이그램 이용한 품사 태깅이 높은 정확률을 보인다. 그러나, 각 방법의 특성이 다소 상이하므로 트라이그램 모델을 이용한 결과와 상태 기반 품사 태깅 결과를 통합한 결과가 바이그램을 이용한 결과와의 통합보다 우수한 성능을 보일 지는 알 수 없다.The context transfer probability calculation used for the path-based part-of-speech tagging uses a bigram model that considers only the part of speech of the word before the current word and a trigram model that considers the part of speech of the previous two words. In general, part-of-speech tagging using trigrams shows high accuracy. However, since the characteristics of each method are slightly different, it is not known whether the results of using the trigram model and the state-based part-of-speech tagging results show better performance than the integration with the bigram results.

따라서, 본 발명은 바이그램 모델과 트라이그램 모델 각각 이용한 경로 기반 품사 태거를 구현하였고, 두 가지 품사 태깅 결과와 상태 기반 품사 태거의 결과를 통합하는 실험을 수행하였다. 실험에 사용된 원시 코퍼스는 품사 태깅을 위하여 사용되는 확률 정보 추출에 사용되지 않은 500개의 영어 문장이 사용되었다. 실험에 사용된 원시 코퍼스내의 총 단어수는 6,137개이며, 단어 당 평균 중의성은 2.53개이다.Accordingly, the present invention implements a path-based part-of-speech tagger using a Baigram model and a trigram model, respectively, and performs an experiment integrating the results of two parts-of-speech tagging and state-based part-of-speech tagging. For the raw corpus used in the experiment, 500 English sentences were used that were not used to extract probability information used for part-of-speech tagging. The total number of words in the primitive corpus used in the experiment was 6,137, with an average weight of 2.53 per word.

다음 표 1은 상태 기반 품사 태깅 방법, 경로 기반 품사 태깅 그리고 본 발명에 의해 제안된 품사 태깅 방법의 실험 결과를 나타내고 있다. '상태'는 상태 기반 품사 품사 태깅 결과를 의미하고, '경로2'와 '경로3'은 각각 바이그램 모델과 트라이그램 모델을 이용한 경로 기반 품사 태깅 결과를 의미한다. '후처리'는 본 발명에서 사용하는 휴리스틱 규칙을 적용한 것임을 의미한다.Table 1 below shows the experimental results of the state-based part-of-speech tagging method, the path-based part-of-speech tagging, and the part-of-speech tagging method proposed by the present invention. 'State' refers to the result-based part-of-speech tagging result, and 'Path 2' and 'Path 3' refer to the path-based part-of-speech tagging result using a bigram model and a trigram model, respectively. 'Post-processing' means applying the heuristic rule used in the present invention.

품사 태깅 방법Part of Speech Tagging Method 단어단위Word unit 문장단위Sentence 단어당 중의성Chinese word per word 상태condition 97.07%97.07% 70.40%70.40% 1One 경로2Route 2 96.43%96.43% 65.80%65.80% 1One 경로3Route 3 96.35%96.35% 69.80%69.80% 1One 경로2+상태Route 2+ Status 98.21%98.21% 80.80%80.80% 1.061.06 경로3+상태Route 3+ Status 99.02%99.02% 88.80%88.80% 1.101.10 경로2+상태+후처리Path 2 + Status + Post Processing 98.57%98.57% 84.20%84.20% 1.061.06 경로3+상태+후처리Path 3 + Status + Post Processing 99.22%99.22% 91.40%91.40% 1.091.09

실험 결과, 경로 기반 품사 태깅과 상태 기반 품사 태깅을 혼합한 품사 태깅 방법은 각 방법을 독립적으로 사용한 경우보다 단어 단위와 문장 단위에서 높은 정확률을 보였다.Experimental results show that the part-of-speech tagging method that combines path-based part-of-speech tagging and state-based part-of-speech tagging has higher accuracy in word and sentence units than in each method.

특히, 문장 단위의 정확률을 향상시키는데 매우 우수한 성능을 보였음을 알 수 있었다. '경로2+상태'와 '경로3+상태'를 비교해보면, '경로3+상태' 방법이 높은 정확률을 보였으나, 단어 당 평균 중의성도 약간 높았다. 휴리스틱 규칙은 단어 당 평균 중의성을 낮추는데는 많은 영향을 미치지 못했으나 정확률을 높이는데 효과적으로 사용됨을 확인하였다. 최종적으로 트라이그램을 이용한 경로 기반 품사 태깅과 상태 기반 품사 태깅 결과를 통합하고, 이를 후처리한 '경로3+상태+후처리' 방법이 단어 단위와 문장 단위의 가장 높은 정확률을 보였다. 특히, 제안된 방법은 문장 단위의 정확률 향상에 매우 우수한 성능을 보였다.In particular, it can be seen that the performance was very good to improve the accuracy of the sentence unit. Comparing 'Path 2+ state' and 'Path 3+ state', the 'Path 3+ state' method showed a high accuracy rate, but the average weight per word was slightly higher. The heuristic rule did not have much effect on lowering the average weight per word, but was found to be effectively used to increase the accuracy. Finally, the path-based part-of-speech tagging using the trigram and the state-based part-of-speech tagging results, and postprocessing, the 'path 3 + state + post-processing' method showed the highest accuracy in word and sentence units. In particular, the proposed method showed a very good performance in improving the accuracy of sentence units.

본 발명은 Marcken과 Weischedel의 품사 태깅 방법을 제안된 방법이 사용한 품사 집합으로 구현하였고, 동일한 실험 코퍼스에 적용하여 비교 평가하였고, 그 결과는 다음 표 5와 같다.The present invention implements the part-of-speech tagging method of Marcken and Weischedel as the part-of-speech set used by the proposed method and compares and evaluates the same by applying the same experimental corpus. The results are shown in Table 5 below.

단어당평균 중의성Average weight per word 각 시스템의 정확률Accuracy of each system MarckenMarcken WeischedelWeischedel 본 발명The present invention 1.36개1.36 99.41%99.41% 99.36%99.36% 1.31개1.31 99.33%99.33% 99.15%99.15% 1.29개1.29 99.07%99.07% 98.55%98.55% 1.12개1.12 98.71%98.71% 97.51%97.51% 1.09개1.09 98.45%98.45% 97.28%97.28% 99.22%99.22%

표 5에서 알 수 있듯이, 품사 태깅 후 한 단어에 같은 수의 중의성을 유지할 때 Weischedel의 방법이 Marcken의 방법보다 높은 정확률을 보였고, 제안된 방법은 단어 당 평균 중의성이 1.09개일 때 가장 높은 정확률을 보였다.As can be seen from Table 5, Weischedel's method showed higher accuracy than Marcken's method when maintaining the same number of weights in a word after part-of-speech tagging, and the proposed method had the highest accuracy when the average weight per word was 1.09. Showed.

현재 제안된 방법은 단어 당 평균 중의성과 정확률을 조절할 수 있는 기능을 가지고 있지 않은데, 이러한 문제점은 제안된 방법에서 사용하는 경로 기반 품사 태거 또는 상태 기반 품사 태거를 Marcken의 방법이나 Weischedel의 방법을 적용하면 간단히 해결할 수 있다. ENGCG2는 규칙 기반 품사 태깅 방법으로 이를 구현하기 위해서는 품사 태깅 규칙 구축을 위하여 많은 수작업과 비용을 필요로 하므로, 위의 평가에서 제외되었다.Currently, the proposed method does not have the ability to adjust the average weight and accuracy rate per word. This problem can be solved by applying Marcken's method or Weischedel's method to the path-based part-of-speech tagger or state-based part-of-speech tagger used in the proposed method. Simple solution. ENGCG2 is a rule-based part-of-speech tagging method, which requires a lot of manual work and expense to construct part-of-speech tagging rules, and was excluded from the above evaluation.

그러나, 표 5의 결과로 제안된 방법이 기존의 통계 기반 접근법을 사용하는 K-best 품사 태깅 시스템보다 우수한 성능을 가질 수 있음을 확인할 수 있다. 또한 통계 기반 접근법과 휴리스틱 규칙의 사용만으로도 ENGCG2에서 보고한 높은 성능을 가질 수 있음을 알 수 있다. 제안된 방법이 ENGCG2의 규칙 기반 방법과 비교하여 가질 수 있는 장점은 다음과 같다. 첫째, 최근 대용량의 코퍼스의 사용이 가능해진 관계로 구현이 용이하다. 둘째, 대량의 품사 태깅 규칙 구축을 위한 비용과 노력을 필요로하지 않는다. 셋째, 품사 집합의 변화나 응용 분야의 변화로 인한 시스템 변경 및 확장이 용이하다.However, the results of Table 5 show that the proposed method can have better performance than the K-best part-of-speech tagging system using the conventional statistical based approach. It can also be seen that the use of a statistics-based approach and heuristic rules alone can yield the high performance reported by ENGCG2. The advantages of the proposed method compared to the rule-based method of ENGCG2 are as follows. First, it is easy to implement because it is possible to use a large amount of corpus recently. Second, it does not require the cost and effort to establish a large part-of-speech tagging rule. Third, it is easy to change and expand the system due to the change of part-of-speech set or application field.

이상에서 설명한 바와 같이, 본 발명에 의한 통계 기반 케이-베스트 품사 태깅 장치 및 방법은, 최근 최고의 성능을 보고한 규칙 기반 품사 태거인 ENGCG2와 유사한 성능을 갖으면서, 규칙 기반 품사 태거와 비교하여 구현, 확장, 관리가 용이하고, ENGCG2와 같은 높은 정확도를 갖는 품사 태거 개발을 위하여 필요한 대용량의 품사 태깅 규칙 구축에 소요되는 비용과 노력을 절감하며, 간단한 휴리스틱 규칙의 적용으로 통계 기반 품사 태깅의 정확률 한계를 극복하는 이점이 있다.As described above, the statistics-based K-best part-of-speech tagging apparatus and method according to the present invention has a similar performance to that of the rule-based part-of-speech tagger ENGCG2, which recently reported the best performance, and compared with the rule-based part-of-speech tagger. It is easy to expand and manage, and saves the cost and effort of constructing the large parts of speech tagging rules needed for developing high-precision parts of speech tagger such as ENGCG2, and applies simple heuristic rules to limit the accuracy of statistics based parts of speech tagging. There is an advantage to overcome.

Claims

In the statistics-based K-best part-of-speech tagging device,

State-based part-of-speech tagger that performs part-of-speech tagging for each word in a stemmed sentence from a primitive corpus using a predetermined state-based part-of-speech tagging method;

A path-based part-of-speech tagger that performs part-of-speech tagging for each word in the morphologically analyzed input sentence using a predetermined path-based part-of-speech tagging method; And

Merging the state-based part-of-speech tagged result with the path-based part-of-speech tagged result, removing the inappropriate part of speech of a specific word that has been assigned two parts of speech from the merged result, or assigning an appropriate part-of-speech, even if one part of speech is assigned And a post-processing unit configured to correct the tagging result to obtain a part-of-speech tagged corpus.

According to claim 1, The post-processing unit,

A statistical-based K-best part-of-speech tagging device, characterized by using predetermined heuristic rules.

The method of claim 2, wherein the heuristic rule,

Statistics-based K-Best part-of-speech comprising at least a state-based part-of-speech tagger and a rule constructed manually by analyzing the results of the path-based part-of-speech tagger and a new tagma rule automatically extracted by analyzing a part-of-speech tagged corpus. Tagging device.

In a statistical-based K-best part-of-speech tagging method, which obtains a part-of-speech tagged corpus from a raw corpus using a statistics-based approach,

A part-of-speech tagging step of performing tagging according to each of state-based part-of-speech tagging and path-based part-of-speech tagging methods belonging to the statistic-based approach in the input sentence morphologically analyzed from the source corpus;

A merging step of merging two parts of speech tagging results; And

Removing a part-of-speech part of the word that has been assigned two parts of speech from the merged result, or assigning a suitable part of speech, and correcting an error tagging result even if one part of speech is assigned, to obtain a part-of-speech tagged corpus. A K-best part-of-speech tagging method characterized by the above-mentioned.

The method of claim 4, wherein the post-treatment step,

A K-best part-of-speech tagging method characterized by using a predetermined heuristic rule.

The method of claim 5, wherein the heuristic rule,

Statistics-based K-Best part-of-speech comprising at least a state-based part-of-speech tagger and a rule constructed manually by analyzing the results of the path-based part-of-speech tagger and a new tagma rule automatically extracted by analyzing a part-of-speech tagged corpus. Tagging method.