KR101997783B1

KR101997783B1 - Syllable-based Korean POS Tagging using POS Distribution and Bidirectional LSTM CRFs and Method of the same

Info

Publication number: KR101997783B1
Application number: KR1020170104916A
Authority: KR
Inventors: 고영중; 김혜민; 양선
Original assignee: 동아대학교 산학협력단
Priority date: 2017-08-18
Filing date: 2017-08-18
Publication date: 2019-07-08
Also published as: KR20190019683A

Abstract

본 발명은 한 어절에서 형태소 단위의 분리 과정과 형태소에 대한 품사 결정 과정을 합쳐서 딥러닝으로 최근 순차 레이블이 많은 영역에서 좋은 성능을 보이고 있는 양방향(Bidirectional) LSTM(Long Short Term Memory) CRFs(bi-LSTM-CRFs)를 사용한 음절 단위 형태소 분석기 및 분석방법을 제공하기 위한 것으로서, 입력된 문장을 음절 단위로 분리하는 음절 분리부와, 상기 음절 분리부에서 분리된 음절을 기반으로 음절에 대한 벡터를 통해 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅을 진행하여 음절이 포함된 형태소의 품사 태그를 할당하는 분류부와, 상기 분류부에서 음절 단위로 품사 태그가 결정된 결과에 대해 기분석 사전을 통해 학습 말뭉치에서 중의성이 없는 변환을 처리하여 오류를 제거하는 오류 제거부와, 상기 오류 제거부에서 오류를 제거하여 음절 단위로 품사 태그가 부착된 결과를 원형복원을 통해 형태소 단위로 변환하는 원형 복원부를 포함하여 구성되는데 있다.Bidirectional LSTM (Long Short Term Memory) CRFs (bi-directional) bi-directional loudspeakers with deep sequential labeling have been developed by combining the morpheme- LSTM-CRFs), comprising: a syllable separating unit for separating input sentences into syllable units; and a syllable-based morpheme analyzer for analyzing syllables based on syllables separated from the syllable separating unit a classification part for performing phonetic unit part tagging using bi-LSTM-CRFs to assign phonetic tags of morpheme including syllable; and a learning corpus An error removing unit for removing an error by processing a conversion having no ambiguity in a syllable; The results of four tag is attached through a circular restore it is composed by including a circular restored to convert morpheme unit.

Description

{Syllable-based Korean POS Tagging using POS Distribution and Bidirectional LSTM CRFs and Method of the Same}

본 발명은 형태소 분석기에 관한 것으로, 특히 한 어절에서 형태소 단위의 분리 과정과 형태소에 대한 품사 결정 과정을 합쳐서 딥러닝으로 최근 순차 레이블이 많은 영역에서 좋은 성능을 보이고 있는 양방향(Bidirectional) LSTM(Long Short Term Memory) CRFs를 사용한 음절 단위 형태소 분석기 및 분석방법에 관한 것이다.The present invention relates to a morpheme analyzer, and more particularly, to a bidirectional LSTM (Long Short) which has a good performance in a region where a sequential label has been recently obtained by a deep learning by combining a morpheme unit separation process and a part- Term Memory) CRFs.

한국어 형태소 분석의 부정확한 결과는 구문 분석, 의미역 부착, 기계 번역 등에 치명적인 영향을 미칠 수 있으므로 정확한 분석이 중요하다. 형태소 분석인 일반적으로 형태소 분석과 품사 태킹 두 가지로 나뉜다. 형태소 분석이란 가장 작은 의미를 가진 형태소와 품사 쌍 후보를 생성하는 것이다. 그리고 품사 태깅이란 형태소 분석에서 나온 후보들에서 각 어절의 뜻과 문맥을 고려하여 가장 알맞은 형태소와 품사 쌍을 결정하는 것이다.Accurate analysis is important because inaccurate results of Korean morphological analysis can have a serious impact on syntax analysis, semantic attachment, and machine translation. There are two general types of morpheme analysis: morpheme analysis and part-of-speech. Morphological analysis is the generation of morphemes and part-of-speech pairs with the smallest meaning. And part-tagging is to determine the most appropriate morpheme and part-of-speech pair considering the meaning and context of each word in candidates from morphological analysis.

기존의 형태소 단위로 한국어 어절을 분석하기 위해서는 형태소 복원과 동시에 형태소 단위의 분리 과정, 형태소에 대한 품사 결정 과정이 함께 필요하다. 각 과정에서 형태적 중의성 및 품사적 중의성이 발생하므로 이를 처리하기 위한 과정이 비교적 복잡하다. 최근에는 이를 해결하기 위해 음절 단위 품사 태깅에 대한 연구가 늘어나고 있다. 음절 단위 품사 태깅은 어절 단위로 품사 태깅할 때 부다 자료 부족 문제가 줄어들고, 띄어쓰기 등의 기능과 결합이 가능하며, 다른 언어 이식과 이전 연구보다도 우수한 성능을 보인다.In order to analyze Korean words in existing morpheme units, morpheme restoration and segmentation of morpheme units as well as part-of-speech determination are required. Since the morphological ambiguity and material ambiguity occur in each process, the process to deal with it is relatively complicated. Recently, there has been an increase in the number of syllabic tagging to solve this problem. When tagging a part of a syllable unit, it is possible to combine with functions such as spacing, and perform better than other language transplantation and previous studies.

음절 단위 형태소 분석은 입력된 문장을 음절단위로 나누고, CRF와 같은 기계학습 기반 분류기를 이용해 음절 단위로 형태소 시작과 이어지는 형태소를 나타내는 B,I 태그가 포함된 품사 레이블을 결정한다. 그리고 한국어는 교착어로 다양한 음운 현상이 발생하기 때문에 효과적인 음절 단위 형태소 분석을 위해서 기분석 사전을 이용한다. 기분석 사전은 형태소 분석이 된 어절들을 특정한 기준을 통해 미리 만들어 놓고 품사 태깅시 이용하는 것이다. 또한 불규칙 용언을 해결하기 위해 원형복원 사전을 추가적으로 이용한다. 원형복원 사전은 복합 형태소를 대상으로 간단한 규칙을 통하여 복합태그를 부착하는데 사용한다.Syllable unit morpheme analysis divides the input sentence into syllable units and uses a machine learning - based classifier such as CRF to determine the part - of - speech label including the B and I tags indicating the beginning and the succeeding morpheme in syllable unit. In addition, Korean uses diacritic dictionaries for effective syllable unit morphological analysis because diverse phonological phenomena occur as a group. The preliminary analysis dictionary is to make the morpheme-analyzed phrases in advance by using specific criteria and use it when tagging parts of speech. In addition, a circular restoration dictionary is additionally used to solve irregular verbs. Circular restoration dictionaries are used to attach complex tags to complex morphemes through simple rules.

음절 단위 형태소 분석을 위해서는 앞서 언급한 것과 같이 순차적 레이블링을 처리할 수 있는 기계학습 기반의 분류기가 필요하며, Structural SVM와 CRF를 이용한 음절단위 형태소 분석 연구가 있다. For syllable unit morpheme analysis, we need a machine learning based classifier that can process sequential labeling as mentioned above, and there is a syllable unit morpheme analysis study using Structural SVM and CRF.

그러나 Structural SVM와 CRF는 한 음절에 대해 레이블을 결정하기 위해 다양한 자질을 사용해야 한다. 특히 현재 음절의 앞과 뒤에 존재하는 음절 또는 어절에 대한 정보를 자질로 활용하는 것이 중요하다. 이러한 형태소 분석기의 부정확한 결과는 구분 분석, 의미역 부착, 기계 번역 등에 치명적인 영향을 미칠 수 있으므로 정확한 분석이 중요하다.Structural SVM and CRF, however, must use various qualities to determine the label for a syllable. In particular, it is important to use information about syllables or phrases existing before and after the current syllable. Accurate analysis is important because the inaccurate results of this morpheme analyzer can have a catastrophic effect on classification analysis, semantic attachment, and machine translation.

최근 순차 레이블이 많은 영역에서 좋은 성능을 보이고 있는 양방향(Bidirectional) LSTM(Long Short Term Memory) CRFs(bi-LSTM-CRFs)를 이용한 음절 단위 형태소 분석 방법을 적용할 수 있다. bi-LSTM-CRFs는 forward 단계에서 현재 입력에 대한 상태층의 정보가 뒤의 상태에 영향을 주며, backward 단계에서 뒤에 상태가 앞의 상태에 영향을 주어 학습이 되기 때문에 다른 순차 레이블링을 위한 기계학습과 달리 작은 수의 자질만으로도 좋은 결과를 얻을 수 있다.Recently, syllable unit morpheme analysis method using Bidirectional LSTM (Long Short Term Memory) CRFs (bi-LSTM-CRFs), which has good performance in areas with many sequential labels, can be applied. Since bi-LSTM-CRFs affect the state of the state layer for the current input at the forward step and the state at the backward step influences the previous state, the machine learning for other sequential labeling Unlike a small number of qualities can get good results.

국내특허출원번호 제10-2010-0077308호 (출원일자 2010.08.11)Korean Patent Application No. 10-2010-0077308 (filed on August 11, 2010) 국내특허출원번호 제10-2015-0089121호 (출원일자 2015.06.23)Korean Patent Application No. 10-2015-0089121 (filed on June 23, 2015)

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 한 어절에서 형태소 단위의 분리 과정과 형태소에 대한 품사 결정 과정을 합쳐서 딥러닝으로 최근 순차 레이블이 많은 영역에서 좋은 성능을 보이고 있는 양방향(Bidirectional) LSTM(Long Short Term Memory) CRFs(bi-LSTM-CRFs)를 사용한 음절 단위 형태소 분석기 및 분석방법을 제공하는데 그 목적이 있다.Accordingly, the present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a method and apparatus for decoding a morpheme, The present invention provides a syllable unit morpheme analyzer and analysis method using Bidirectional LSTM (Long Short Term Memory) CRFs (bi-LSTM-CRFs).

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기의 특징은 입력된 문장을 음절 단위로 분리하는 음절 분리부와, 상기 음절 분리부에서 분리된 음절을 기반으로 음절에 대한 벡터를 통해 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅을 진행하여 음절이 포함된 형태소의 품사 태그를 할당하는 분류부와, 상기 분류부에서 음절 단위로 품사 태그가 결정된 결과에 대해 기분석 사전을 통해 학습 말뭉치에서 중의성이 없는 변환을 처리하여 오류를 제거하는 오류 제거부와, 상기 오류 제거부에서 오류를 제거하여 음절 단위로 품사 태그가 부착된 결과를 원형복원을 통해 형태소 단위로 변환하는 원형 복원부를 포함하여 구성되는데 있다.According to another aspect of the present invention, there is provided a morpheme morpheme analyzer using a part-of-speech distribution and bidirectional LSTM CRFs, the morpheme morpheme analyzer comprising: a syllable separator for separating input sentences into syllable units; A classification unit for performing tagging of syllable unit parts using bi-LSTM-CRFs through a vector for syllables based on syllables and assigning partly tagged morpheme tags containing syllables; And an error correction unit for correcting errors by eliminating errors from the error removal unit by performing a circular restoration on the syllable tag by syllable unit, And a circular reconstruction unit for reconstructing the reconstructed image.

바람직하게 상기 분류부는 음절에 대한 벡터를 생성하기 위해 단어 임베딩(word embedding) 알고리즘인 word2vec를 사용하여 64차원의 음절 단위의 임베딩 벡터를 학습하여 입력 벡터로 사용하는 것을 특징으로 한다.Preferably, the classifier is configured to learn a 64-dimensional syllable-unit embedding vector by using a word embedding algorithm word2vec to generate a vector for a syllable and use it as an input vector.

바람직하게 상기 오류 제거부의 기분석 사전은 어절사전과 명사사전을 사용하는 것을 특징으로 한다.Preferably, the pre-analysis dictionary of the error elimination uses a dictionary dictionary and a noun dictionary.

바람직하게 상기 원형 복원부는 불규칙 변환이 존재하는 경우 불규칙 변환 사전을 통해 이를 보정하여 변환하는 것을 특징으로 한다.Preferably, the circular reconstruction unit corrects the irregular transformation when there is an irregular transformation, and converts the corrected irregular transformation dictionary.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석방법의 특징은 (A) 음절 분리부를 통해 입력된 문장을 음절 단위로 분리하는 단계와, (B) 분류부를 통해 상기 분리된 음절을 기반으로 음절에 대한 벡터를 통해 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅을 진행하여 음절이 포함된 형태소의 품사 태그를 할당하는 단계와, (C) 오류 제거부를 통해 상기 음절 단위로 품사 태그가 결정된 결과에 대해 기분석 사전을 통해 학습 말뭉치에서 중의성이 없는 변환을 처리하여 오류를 제거하는 단계와, (D) 원형 복원부를 통해 상기 오류를 제거하여 음절 단위로 품사 태그가 부착된 결과를 원형복원을 통해 형태소 단위로 변환하는 단계를 포함하여 이루어지는데 있다.According to another aspect of the present invention, there is provided a method of analyzing a syllable unit morpheme using a part-of-speech distribution and bidirectional LSTM CRFs, the method comprising the steps of: (A) separating a sentence input through a syllable- LSTM-CRFs through a vector for a syllable based on the separated syllable through a classification unit to assign a part-of-speech tag of a morpheme including a syllable; and (C) (D) removing the error through a circular decompression unit and processing the syllable tag in units of syllables by a circular restoration unit; And converting the result of attaching the part-of-speech tag into a morpheme unit by circular restoration.

바람직하게 상기 (A) 단계는 CRF 학습을 위해 음절단위 자질과 어절 단위 자질을 사용하는 것을 특징으로 한다.Preferably, step (A) uses syllable unit qualities and word unit qualities for CRF learning.

바람직하게 상기 어절 단위 자질은 전체 말뭉치에서 유일한 어절들을 추출하는 단계와, 상기 추출된 각 어절 별로 ID를 할당하는 단계와, 상기 할당된 ID를 자질로 표현하여 사용하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, the word unit qualities include extracting unique words in the entire corpora corpus, assigning IDs to the respective extracted phrases, and using the assigned IDs in terms of their qualities. do.

바람직하게 상기 (B) 단계는 상기 음절에 대한 벡터를 생성하기 위해 단어 임베딩(word embedding) 알고리즘인 word2vec를 사용하여 64차원의 음절 단위의 임베딩 벡터를 학습하여 입력 벡터로 사용하는 것을 특징으로 한다.Preferably, the step (B) is characterized by using an embedding vector of a 64-dimensional syllable unit as an input vector by using a word embedding algorithm word2vec to generate a vector for the syllable.

바람직하게 상기 (C) 단계는 기분석 사전을 어절사전과 명사사전을 사용하는 것을 특징으로 한다.Preferably, the step (C) is characterized in that the pre-analysis dictionary uses a dictionary dictionary and a noun dictionary.

바람직하게 상기 명사사전은 중의적 분석이 되지 않는 명사들로 구축하는 것을 특징으로 한다.Preferably, the noun dictionary is constructed of nouns that are not subject to ambiguous analysis.

바람직하게 상기 어절사전은 문맥정보를 고려하지 않은 어절사전1과 문맥정보를 고려하여 모호성을 해결한 어절사전2를 구축하는 것을 특징으로 한다.Preferably, the dictionary dictionary includes a dictionary dictionary 1 that does not take context information into consideration and a dictionary dictionary 2 that resolves ambiguity in consideration of context information.

바람직하게 상기 (D) 단계는 원형 복원 시에 불규칙 변환이 필요한 경우 불규칙 변환 사전을 이용하여 변환하는 것을 특징으로 한다.Preferably, the step (D) is performed by using an irregular conversion dictionary when irregular conversion is required during circular restoration.

바람직하게 상기 (D) 단계는 불규칙 변환에서 불규칙 변환 사전에 동일한 변환이 있을 시 가장 높은 빈도의 결과를 선택하는 단계와, 불규칙 변환을 적용한 후 최종적으로 동일한 품사 태그를 가지는 형태소들은 결합하여 형태소 품사 태깅을 완료하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, the step (D) includes the steps of: selecting the most frequent result when there is the same conversion in the irregular conversion dictionary in the irregular conversion; and And a step of completing the step.

이상에서 설명한 바와 같은 본 발명에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기 및 분석방법은 다음과 같은 효과가 있다.As described above, the morphological analyzer and analysis method using the morphological distribution of the present invention and the bi-directional LSTM CRFs have the following effects.

첫째, bi-LSTM-CRFs를 사용하여 품사를 잘 표현하는 품사 태그 분포를 입력으로 넣어줌으로써, 기존의 형태소 분석기의 복잡한 자질을 없앤다.First, bi-LSTM-CRFs are used to eliminate the complexity of the existing morpheme analyzer by inputting the part-of-speech tag distribution that expresses the part-of-speech well.

둘째, 음절 단위로 형태소 분석을 하여 형태소 분리와 품사 결정 단계를 나눠서 하지 않고 한 번에 처리함에 따라 기존의 형태소 분석기의 방법 보다 더 향상된 성능을 나타낸다.Second, morphological analysis is performed by syllable unit, and morphological segmentation and part - of - speech decision steps are performed separately without dividing them.

셋째, 순차 레이블에 적합한 bi-LSTM-CRFs를 사용하며 입력 벡터에 품사 분포 자질을 추가시킴으로써, 각 음절의 품사에 대한 정보를 추가시켜 기존의 형태소 분석기보다 사람의 시간과 노력이 비교적 적게 투자되지만 더 높은 성능을 나타낸다.Third, using bi-LSTM-CRFs suitable for sequential labels and adding the part-of-speech qualities to the input vector, human's time and effort are less invested than the existing morpheme analyzer by adding information on parts of each syllable. High performance.

넷째, 형태소 분리와 품사 결정을 한 번에 해결하기 위해 음절 단위로 학습하며, 기분석 사전과 원형복원 사전을 미리 구축하여 적용시킴으로써 다양한 음운 현상이 발생하여 겪는 문제들을 해결할 수 있다.Fourth, morpheme segmentation and part - of - speech decision are solved in one syllable unit. By pre - constructing pre - analysis dictionary and circular restoration dictionary, various phonological phenomena can be solved.

도 1 은 본 발명의 실시예에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기의 구성을 나타낸 블록도
도 2 는 본 발명의 실시예에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석 방법을 설명하기 위한 흐름도
도 3 은 본 발명에서 음절을 입력으로 하는 bi-LSTM-CRFs의 실시예
도 4 는 기존의 CRF와 본 발명의 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅 결과를 비교한 도면1 is a block diagram showing the configuration of a part-of-speech morpheme analyzer using a part-of-speech distribution and bidirectional LSTM CRFs according to an embodiment of the present invention;
FIG. 2 is a flowchart for explaining a part-of-speech distribution according to an embodiment of the present invention and a method of analyzing syllable unit morpheme using bidirectional LSTM CRFs
FIG. 3 shows an embodiment of bi-LSTM-CRFs in which syllables are input in the present invention
FIG. 4 is a diagram comparing the tagging results of syllable unit parts with the existing CRF and bi-LSTM-CRFs of the present invention

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the detailed description of the embodiments with reference to the accompanying drawings.

본 발명에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기 및 분석방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다. 그러나 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예는 본 발명의 개시가 완전하도록하며 통상의 지식을 가진자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Preferred embodiments of the part-of-speech morphology analyzing method and the syllable unit morpheme analyzing method using bi-directional LSTM CRFs according to the present invention will be described with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. It is provided to let you know. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention and are not intended to represent all of the technical ideas of the present invention. Therefore, various equivalents It should be understood that water and variations may be present.

도 1 은 본 발명의 실시예에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기의 구성을 나타낸 블록도이다.FIG. 1 is a block diagram showing the configuration of a morphological distribution analyzer according to an embodiment of the present invention and a syllable morpheme analyzer using bidirectional LSTM CRFs.

도 1에서 도시하고 있는 것과 같이, 입력된 문장을 음절 단위로 분리하는 음절 분리부(100)와, 상기 음절 분리부(100)에서 분리된 음절을 기반으로 음절에 대한 벡터를 통해 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅을 진행하여 음절이 포함된 형태소의 품사 태그를 할당하는 분류부(200)와, 상기 분류부(200)에서 음절 단위로 품사 태그가 결정된 결과에 대해 기분석 사전을 통해 학습 말뭉치에서 중의성이 없는 변환을 처리하여 오류를 제거하는 오류 제거부(300)와, 상기 오류 제거부(300)에서 오류를 제거하여 음절 단위로 품사 태그가 부착된 결과를 원형복원을 통해 형태소 단위로 변환하는 원형 복원부(400)로 구성된다.As shown in FIG. 1, the syllable separating unit 100 separates input sentences into syllable units, and bi-LSTM-1 through syllable vectors based on the syllables separated by the syllable separating unit 100. [ A classifier 200 for assigning a phoneme tag of a morpheme containing a syllable by performing tagging of a syllable unit part using CRFs, and a classifier 200 for classifying the result of determining the part of speech tag in units of syllables in the classifier 200 (300) for eliminating an error by processing a transformation having no ambiguity in a learning corpus; and an error removal unit (300) for removing an error, And a circular reconstruction unit 400 for transforming the unit into a unit.

상기 분류부(200)에서 음절에 대한 벡터를 생성하기 위해 대표적인 단어 임베딩(word embedding) 알고리즘인 word2vec를 사용하여 64차원의 음절 단위의 임베딩 벡터를 학습하여 입력 벡터로 사용한다.In order to generate a vector for a syllable in the classifier 200, an embedding vector of a 64-dimensional syllable unit is learned and used as an input vector using word2vec, which is a typical word embedding algorithm.

상기 오류 제거부(300)의 기분석 사전은 품사 태깅에 모호성이 존재하지 않는 경우를 사전으로 구축한 것으로, 어절사전과 명사사전을 사용한다.The pre-analysis dictionary of the error removal section 300 is a dictionary construction in which there is no ambiguity in part marking tagging, and a dictionary dictionary and a noun dictionary are used.

상기 원형 복원부(400)는 불규칙 변환이 존재하는 경우 불규칙 변환 사전을 통해 이를 보정하여 변환한다.The circular restoring unit 400 corrects and converts the irregular transformation through an irregular transformation dictionary when irregular transformation exists.

이와 같이 구성된 본 발명에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석기의 동작을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다. 도 1 또는 도 2와 동일한 참조부호는 동일한 기능을 수행하는 동일한 부재를 지칭한다. The operation of the morphological distribution according to the present invention and the syllable morpheme analyzer using bidirectional LSTM CRFs will now be described in detail with reference to the accompanying drawings. Like reference numerals in FIG. 1 or FIG. 2 denote the same members performing the same function.

도 2 는 본 발명의 실시예에 따른 품사 분포와 양방향 LSTM CRFs를 이용한 음절 단위 형태소 분석 방법을 설명하기 위한 흐름도이다.2 is a flowchart for explaining a part-of-speech distribution according to an embodiment of the present invention and a syllable unit morpheme analysis method using bidirectional LSTM CRFs.

도 2를 참조하여 설명하면, 먼저 음절 분리부(100)를 통해 입력된 문장을 음절 단위로 분리한다(S10).Referring to FIG. 2, the sentence input through the syllable separation unit 100 is divided into syllable units (S10).

음절 기반의 형태소 품사 태깅을 위해서 분리된 음절에 대해 품사 태그를 결정하는 것이 중요하다. 예를 들어, 문장에서 "세계적인"이라는 어절은 음절 단위로 분리된 후 각 음절별로 아래 표 1과 같이 품사 태그가 할당된다.For syllable-based morphological part-tagging, it is important to determine the part-of-speech tag for the separated syllable. For example, the phrase "global" in a sentence is divided into syllable units, and then a part-of-speech tag is assigned to each syllable as shown in Table 1 below.

B-NNG는 품사가 NNG인 형태소의 시작 음절을 나타내며, I-NNG는 품사가 NNG인 형태소의 이어진 음절을 나타낸다. 음절 단위로 품사 태그 부착하기 위해서 CRF와 같은 기계학습 기반 분류기를 학습하여야 한다. 이를 위해서 각 음절에 대해 아래 표 2와 같은 자질들을 사용한다.B-NNG represents the start syllable of the morpheme where the part of speech is NNG, and I-NNG represents the syllable of the morpheme where the part of speech is NNG. To attach phonemic tags in syllable units, a machine learning-based classifier such as CRF should be studied. To do this, use the qualities shown in Table 2 below for each syllable.

만일 형태소가 불규칙 형태소였으면 음절 단위로 부착되는 품사 태그의 뒤에 "DIC" 태그를 부착하여 불규칙 변환 사전을 적용해야 할 형태소임을 나타낸다. "DIC"태그가 부착된 음절은 원형복원 단계에서 불규칙 변환사전을 이용하여 불규칙에 대한 문제를 해결한다. 예를 들어, "세계적인"의 경우 "세계적/NNG+이/VCP+ㄴ/ETM 의상/NNG"과 같이 품사 태깅이 이루어지며, 여기서 "인"은 CRF 학습 시에 "인 B-VCPDIC"의 태그를 가지고 학습에 사용된다.If the morpheme is an irregular morpheme, it indicates the morpheme to which the irregular conversion dictionary should be applied by attaching the "DIC" tag to the part of speech tag attached in syllable unit. A syllable with a "DIC" tag solves the problem of irregularities by using a random conversion dictionary in the round reconstruction step. For example, in the case of "global ", a part of speech tagging is performed as" global / NNG + / VCP + c / ETM costume / NNG ", where " Used for learning.

본 발명에서는 CRF 학습을 위해 표 2에서 언급한 음절단위 자질과 어절 단위 자질을 사용한다. 어절 단위 자질은 효과적인 사용을 위해 전체 말뭉치에서 유일한 어절들을 추출한 후 각 어절 별로 ID를 할당하고, 할당된 ID를 자질로 표현하여 사용한다. 추가적으로 문장의 시작은 :S를 나타내고, trigram 어절에서 :S 이전의 어절위치 역시 :S를 사용하였다. 또한, 문장의 끝은 :O를 이용해 표현하였다. 예를 들어, 문장의 시작과 끝인 경우 아래 표 3과 같이 자질이 생성된다.In the present invention, syllable unit qualities and word unit qualities mentioned in Table 2 are used for CRF learning. For the effective use of the unit features, the idioms are extracted from the whole corpus and then the IDs are assigned to each word and the assigned IDs are used as the qualities. In addition, the beginning of the sentence is represented by: S, and in trigram phrases: S is also used for the previous word position. Also, the end of the sentence is represented by: O. For example, at the beginning and end of a sentence, qualities are generated as shown in Table 3 below.

CRF는 하나의 음절에 대한 품사 태그를 부착하기 위해 여러 자질을 사용해야 한다. 반면, bi-LSTM-CRFs는 음절에 대한 벡터만을 입력으로 사용하여 좋은 결과를 얻을 수 있다. The CRF must use several qualities to attach part-of-speech tags for one syllable. On the other hand, bi-LSTM-CRFs can obtain good results by using only vectors for syllables.

따라서 분류부(200)를 통해 상기 음절 분리부(100)에서 분리된 음절을 기반으로 음절에 대한 벡터를 통해 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅을 진행하여 음절이 포함된 형태소의 품사 태그를 할당한다(S20). 이때, 상기 분류부(200)에서 음절에 대한 벡터를 생성하기 위해 대표적인 단어 임베딩(word embedding) 알고리즘인 word2vec를 사용하여 64차원의 음절 단위의 임베딩 벡터를 학습하여 입력 벡터로 사용한다.Therefore, tagging of syllable part-of-speech parts using bi-LSTM-CRFs is performed through vectors for syllables based on the syllables separated by the syllable separating part 100 through the classifying part 200, (S20). At this time, to generate a vector for the syllable in the classifier 200, a word-based embedding algorithm word2vec is used to learn an embedding vector of a 64-dimensional syllable unit as an input vector.

도 3 은 본 발명에서 음절을 입력으로 하는 bi-LSTM-CRFs의 실시예이다.3 is an embodiment of bi-LSTM-CRFs in which syllables are input in the present invention.

도 3에서 도시하고 있는 것과 같이, bi-LSTM-CRFs는 음절단위로 학습을 하기 위해 "패션쇼에"라는 어절이 들어왔을 때 forward 단계에서 먼저 "패"라는 음절이 입력되고, 다음으로 음절 "션"이 입력이 된다. 음절 "션"이 입력되었을 때 이전 음절의 상태가 현재 음절의 상태에 영향을 주어 현재 음절의 상태는 실제로는"패션"을 나타내는 상태와 같은 의미를 가지게 된다. 모든 음절이 입력으로 들어갈 때까지 bi-LSTM-CRFs의 forward를 진행한다. backward 단계에서는 forward와 반대로 음절"에"가 먼저 입력이 되고, 다음으로 음절 "쇼"가 입력이 된다. 어절에 대한 forward와 backward 단계가 진행된 후 두 단계를 결과와 정답과의 비용(cost)을 계산한 후 역전파(back-propagation) 알고리즘을 통해 학습한다. 최근 태그 사이의 전이 확률을 반영하여 성능을 개선하는 연구를 진행하였으며, 이를 위해 CRF와 같이 forward 알고리즘을 이용하고, 최적의 태그 열을 찾기 위해 Viterbisearch 알고리즘을 이용하였다.As shown in FIG. 3, bi-LSTM-CRFs is a syllable of "L" when a phrase "to fashion show" comes in to learn in syllable unit, "Becomes the input. When the syllable "se" is entered, the state of the previous syllable affects the state of the current syllable, so that the state of the current syllable has the same meaning as the state actually representing "fashion". Forward bi-LSTM-CRFs until all syllables enter the input. In the backward step, in contrast to forward, the syllable is preceded by a "syllable" followed by a syllable "show". After the forward and backward steps for the word, the two steps are calculated through the back-propagation algorithm after calculating the cost of the result and the correct answer. Recently, research has been carried out to improve the performance by reflecting the transition probability between tags. For this purpose, forward algorithm like CRF and Viterbi search algorithm have been used to find optimal tag sequence.

bi-LSTM-CRFs는 앞의 음절과 다음 음절의 정보가 반영되어 학습이 되기 때문에 CRF와 달리 음절의 입력만을 사용하기 때문에 효과적이다. bi-LSTM-CRFs is effective because it only uses syllable input, unlike CRF, because it learns by reflecting the information of the preceding syllable and the next syllable.

한편, 상기 bi-LSTM-CRFs를 이용한 음절 기반의 형태소 품사 태깅의 성능 향상을 위해 음절의 입력 벡터를 확장한다. 이를 위해, 음절이 학습 말뭉치에서 포함된 형태소의 품사 분포를 벡터로 표현하여 입력 벡터를 확장한다. 음절은 포함된 형태소에 따라 다른 품사를 가질 수 있다. Meanwhile, the input vector of the syllable is expanded to improve the performance of the syllable-based morphological part-of-speech tagging using the bi-LSTM-CRFs. To do this, we extend the input vector by expressing the part-of-speech distribution of the morpheme in which the syllable is included in the learning corpus. Syllables can have different parts of speech depending on the morpheme they contain.

예를 들어, 음절 "하"는 학습 말뭉치에서 명사 태그를 가지는 "하늘"의 일부 음절일 수도 있고, 형용사 태그를 가지는 "하얗게"의 일부 음절일 수도 있다. 이러한 음절이 학습 말뭉치에서 출현한 형태소의 품사 분포를 벡터로 표현하여, bi-LSTM-CRFs의 입력으로 사용하였다. 음절의 품사 분포를 나타내는 벡터는 46개의 품사에 B, I 태그와 DIC 태그가 반영된 131개의 차원에 문장의 처음, 끝, 공백을 나타내는 3개의 태그를 추가한 총 134차원으로 표현된다. 각 차원의 값은 음절에 대한 품사 태그가 말뭉치에서 나온 빈도수를 모두 계산한다. 한 음절에 대해 말뭉치에서 출현한 모든 빈도를 계산한 후 softmax를 통해 확률 값으로 만들어서 벡터의 값을 결정한다.For example, the syllable "lower" may be some syllable of "sky" having a noun tag in the learning corpus, or some syllable of "whitish" having an adjective tag. The morphological distribution of the morpheme in which these syllables appear in the learning corpus is expressed as a vector and used as the input of bi-LSTM-CRFs. The vector representing the partly syllable distribution is expressed as 134 dimensions with the addition of three tags representing the beginning, end, and blank of the sentence in 131 dimensions reflecting B, I and DIC tags in 46 parts of speech. The value of each dimension computes the frequency of the speech tag for the syllable from the corpus. For every syllable, we calculate all frequencies that occur in the corpus and then make a probability value through softmax to determine the value of the vector.

표 4 는 음절 "랑"에 대한 품사 분포 벡터의 예를 나타내고 있다.Table 4 shows an example of a part-of-speech distribution vector for a syllable "."

표 4를 참조하면, 음절 "랑"은 학습 말뭉치에서 총 435번 출현하였으며, B-NNP, I-NNP, I-NNG, B-JKB 등의 품사 태그를 가지고 있다. 생성한 벡터는 음절 임베딩 벡터와 결합하여 최종적으로 198(64+134)차원의 벡터를 생성하여 bi-LSTM-CRFs의 입력으로 사용한다.As shown in Table 4, the syllable "Rang" appears 435 times in the learning corpus, and has a part-of-speech tag such as B-NNP, I-NNP, I-NNG and B-JKB. The generated vector is combined with the syllable embedding vector and finally generates a vector of 198 (64 + 134) dimension, which is used as an input of bi-LSTM-CRFs.

다음으로 오류 제거부(300)를 통해 상기 분류부(200)에서 음절 단위로 품사 태그가 결정된 결과에 대해 기분석 사전을 통해 학습 말뭉치에서 중의성이 없는 변환을 처리하여 오류를 제거한다(S30).Next, an error is removed by processing a non-verbal conversion in the learning corpus through a pre-analysis dictionary with respect to a result of determining the parts-of-speech tag in units of syllables in the classifying unit 200 through the error removing unit 300 (S30) .

이때, 기분석 사전은 품사 태깅에 모호성이 존재하지 않는 경우를 사전으로 구축한 것으로 CRF의 음절별 품사 태그 부착 결과와 상관없이 학습 말뭉치에 존재하는 품사 태그로 변환하기 위해 필요하다. 기분석 사전은 어절사전과 명사사전을 사용한다. 예를 들어, "엔터테이너"라는 명사의 경우 말뭉치에서 "NNG"이외의 품사태그가 부착되는 경우가 존재하지 않기 때문에 이러한 경우는 명사 사전에 포함한다. 표 5는 명사 사전을 적용한 예이다.In this case, the preliminary analysis dictionary is a dictionary construction in which there is no ambiguity in speech tagging, and it is necessary to convert it into speech tag existing in the learning corpus regardless of the result of CRF syllable tag addition. The pre-analysis dictionary uses a dictionary dictionary and a noun dictionary. For example, in the case of a noun "entertainer", there is no case in which a part-of-speech tag other than "NNG" is attached to a corpus. Table 5 shows an example of applying the noun dictionary.

명사사전은 중의적 분석이 되지 않는 명사들로 구축을 하였다. 하지만, 명사일 경우만 뽑으면 짧은 명사일 경우 "은/NNG"과 "은/JX"인 동일한 글자이나 다른 품사로 태깅된다. 이를 해결하기 위해 일정 길이가 넘는 명사와 복합 명사를 전체 데이터에서 169,004개를 구축하였다.The noun dictionary was built with nouns that can not be analyzed as ambiguous. However, if it is a noun, if it is a short noun, it is tagged as the same letter or another part of speech as "E / NNG" and "E / JX". To solve this problem, we constructed 169,004 nouns and compound nouns over a certain length.

또한, 어절사전은 문맥정보를 고려하지 않은 어절사전1과 문맥정보를 고려하여 모호성을 해결한 어절사전2를 구축하였다. 어절사전2는 어절만 보았을 때 문맥의 중의성이 있는 어절이 문제가 되므로 이를 해결하기 위해 어절사전에 들어갈 해당 어절의 전 어절에 포함된 마지막 품사 태그를 함께 저장을 하여 중의성을 해결한다. 어절사전1은 1,552,635개, 어절사전2는 37,233개를 구축하였으며, 다음 표 6은 어절 사전을 적용한 예를 나타내고 있다.In addition, the dictionary of the dictionary has constructed the dictionary 1 which does not take context information into account and the dictionary 2 which resolves the ambiguity considering the context information. In order to solve this problem, when the dictionary dictionary 2 has only the word phrases, the problem of the word with the ambiguity of the context becomes a problem. Therefore, the last phrases tag included in the word phrases of the corresponding phrases to be included in the dictionary dictionary are stored together to solve the ambiguity. The dictionary 1 has 1,552,635 words and the dictionary 2 has 37,233 words. Table 6 shows an example of applying the dictionary dictionary.

이어 원형 복원부(400)를 통해 상기 오류 제거부(300)에서 오류를 제거하여 음절 단위로 품사 태그가 부착된 결과를 원형복원을 통해 형태소 단위로 변환한다(S40). 이때, 상기 원형 복원 시에는 앞서 설명한 불규칙 변환이 필요한 경우 불규칙 변환 사전을 이용하여 변환한다. 표 7은 불규칙 변환 사전을 적용한 예이다.Then, the error removing unit 300 removes the error through the circular restoring unit 400, and converts the result of attaching the part-of-speech tag in units of syllables into a morpheme unit by circular restoration (S40). At this time, at the time of the circular restoration, when the irregular transformation described above is required, the irregular transformation dictionary is used. Table 7 shows an example in which an irregular conversion dictionary is applied.

상기 불규칙 변환에서 불규칙 변환 사전에 동일한 변환이 있을 시 가장 높은 빈도의 결과를 선택한다. 불규칙 변환을 적용한 후 최종적으로 동일한 품사 태그를 가지는 형태소들은 결합하여 형태소 품사 태깅을 완료한다. 예를 들어 "생산/NNG+노동자/NNG+들/XSN" 이면 품사 'NNG'가 동일하기 때문에 "생산노동자/NNG+들/XSN"의 형태로 수정한다.When the irregular conversion has the same conversion in the irregular conversion dictionary, the highest frequency result is selected. After applying the random transformation, finally, the morphemes having the same part-of-speech tag are combined to complete the morpheme part-tagging. For example, in the case of "Production / NNG + Worker / NNG + S / XSN", it is changed to "Production Worker / NNG + S / XSN" because part NNG is the same.

실시예Example

본 발명에서는 음절 기반의 형태소 품사 태깅을 평가하기 위해서 세종코퍼스를 사용하였다. 최종적으로 CRF 기반의 방법과 제안하는 bi-LSTM-CRFs를 이용한 방법을 비교하기 위해 랜덤하게 50만 어절을 선택한 후 40만 어절을 학습에 사용하였으며, 10만 어절을 테스트에 사용하였다. 모든 모델에 대해서 40만 어절로 학습을 하는 것은 시간이 많이 소모되기 때문에 제안한 방법 중 가장 좋은 성능을 보이는 모델을 판단하기 위해서 랜덤하게 5만 어절을 선택하여 4만어절로 학습을 하고, 1만 어절로 제28회 한글 및 한국어 정보처리 학술대회 논문집(2016년) 테스트하여 평가를 진행하였다. 평가를 위해서 다음 수학식 1, 2의 정확도(accuracy) 수식을 사용하였다.In the present invention, Sejong Corpus was used to evaluate syllable-based morphological part-tagging. Finally, in order to compare the CRF-based method with the proposed bi-LSTM-CRFs method, we selected 500,000 randomly selected words and 400,000 words were used for the learning and 100,000 words were used for the test. In order to judge the model that shows the best performance among the proposed methods, it is necessary to select 50,000 words at random and learn 4-language words, The 28th Korean and Korean Information Processing Conference Proceedings (2016) were tested and evaluated. For the evaluation, the following equations (1) and (2) were used.

도 4 는 기존의 CRF와 본 발명의 bi-LSTM-CRFs를 이용한 음절 단위 품사 태깅 결과를 비교한 도면이다.FIG. 4 is a graph comparing the syllable unit parts tagging result using the existing CRF and bi-LSTM-CRFs of the present invention.

도 4에서 도시하고 있는 것과 같이, 본 발명에 따른 음절의 품사 분포 벡터를 이용한 bi-LSTM-CRFs 기반의 음절 품사 태깅 방법을 적용하였을 때 CRF 기반의 방법에 비해 7.65% 향상된 92.93%의 음절 단위 품사 태깅 성능을 보였으며, 기분석 사전, 불규칙 변환 사전을 적용한 후 원형복원 했을 때 CRF보다 3.01% 향상된 97.09%의 성능을 보였다.As shown in FIG. 4, when the bi-LSTM-CRFs based syllabic tagging method using the syllable distribution vector of the syllable according to the present invention is applied, the syllable part tagging method using the syllable distribution vector according to the present invention is applied to a 92.93% Tagging performance and showed 97.09% performance improvement of CRF by 3.01% when circular restoration was performed after applying pre-analysis dictionary and irregular conversion dictionary.

상기에서 설명한 본 발명의 기술적 사상은 바람직한 실시예에서 구체적으로 기술되었으나, 상기한 실시예는 그 설명을 위한 것이며 그 제한을 위한 것이 아님을 주의하여야 한다. 또한, 본 발명의 기술적 분야의 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 실시예가 가능함을 이해할 수 있을 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be apparent to those skilled in the art that various modifications may be made without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

Claims

A syllable separating unit for separating the input sentence into syllable units,
LSTM-CRFs based on syllables separated by the syllable separating unit, and assigning a part-of-speech tag of a morpheme including a syllable;
An error elimination unit for processing an irrelevant conversion in the learning corpus through a preliminary analysis dictionary to the result of determining the part of speech tag in the syllable unit in the classifying unit,
And a circular decompression unit for removing the error from the error elimination and converting the result of attaching the part-of-speech tag to the morpheme unit by circular restoration in units of syllable. The morphological distribution and the syllable unit morpheme using bidirectional LSTM CRFs Analyzer.

The method according to claim 1,
Wherein the classification unit learns a 64-dimensional syllable-unit embedding vector by using a word embedding algorithm word2vec to generate a vector for a syllable and uses the embedding vector as an input vector to generate a bilingual LSTM CRFs Syllable unit morpheme analyzer.

The method according to claim 1,
Wherein the analysis dictionary of the error elimination uses a phrase dictionary and a noun dictionary, and a syllable morpheme analyzer using bidirectional LSTM CRFs.

The method according to claim 1,
Wherein the circular decompression unit corrects the irregular transformation dictionary when there is an irregular transformation dictionary, and converts the corrected irregular transformation dictionary to a syllable unit morpheme analyzer using bidirectional LSTM CRFs.

(A) separating an input sentence into syllable units in a syllable separating unit,
(B) a step of performing syllable part-of-speech tagging using bi-LSTM-CRFs through a vector for a syllable in the classifying part based on syllables separated by the syllable separating part, ,
(C) removing errors from the result of determining the part-of-speech tag in units of the syllable by processing a non-causal transformation in the learning corpus through the pre-analysis dictionary in the error elimination;
(D) removing the error and converting the result of attaching the part-of-speech tag to the syllable tag in a syllable unit by performing a circular restoration in the circular restoration unit. Morphological analysis method.

6. The method according to claim 5, wherein step (A) performed in the syllable separating unit
A method for analyzing syllable unit morphemes using speech segment features and bidirectional LSTM CRFs using syllable unit qualities and word unit qualities for CRF learning.

The method according to claim 6,
Wherein the word unit qualities include extracting unique phrases from the entire corpus,
Assigning an ID to each extracted word;
And expressing the assigned ID as a quality and using the allocated ID. The method of analyzing syllable unit morpheme using bidirectional LSTM CRFs.

6. The method of claim 5, wherein step (B) performed in the classifier
A word-embedding algorithm word2vec is used to generate a vector for the syllable, and an embedding vector of a 64-dimensional syllable unit is learned and used as an input vector. Unit morphological analysis method.

6. The method as claimed in claim 5, wherein the step (C)
A Method of Analysis of Syllable Unit Morphology Using Part - of - Speech Distribution and Bi - directional LSTM CRFs.

10. The method of claim 9,
Wherein said noun dictionary is constructed of nouns that are not subject to ambiguity analysis, and a method for analyzing syllable unit morphology using bi-directional LSTM CRFs.

10. The method of claim 9,
The method of claim 1, wherein the dictionary includes a dictionary of dictionary 1 that does not consider context information and a dictionary of dictionary 2 that resolves the ambiguity by taking context information into account. The method of analyzing syllable unit morphology using two-way LSTM CRFs.

6. The method of claim 5,
Wherein the step (D) performed by the circular decompression unit is performed using an irregular transformation dictionary when irregular transformation is required at the time of round restoration, and a method for analyzing a syllable unit morpheme using bidirectional LSTM CRFs.

6. The method of claim 5, wherein step (D) performed in the circular reconstruction unit
Selecting a result having the highest frequency when there is the same transformation in the irregular transformation dictionary in the irregular transformation,
And finally combining the morphemes having the same part-of-speech tag after completing the irregular transformation, and completing the morphological part-of-speech tagging, and analyzing the syllable unit morpheme using bi-directional LSTM CRFs.