KR20200044176A

KR20200044176A - System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding

Info

Publication number: KR20200044176A
Application number: KR1020180119102A
Authority: KR
Inventors: 고영중; 김혜민
Original assignee: 동아대학교 산학협력단
Priority date: 2018-10-05
Filing date: 2018-10-05
Publication date: 2020-04-29
Also published as: KR102109858B1

Abstract

The present invention relates to an apparatus and a method for analyzing a Korean morpheme by using the embedding-coupling of a letter and a syllable, which can determine an accurate word class even in a sentence with a frequent typing error by forming input by syllable and letter and using converted letter embedding. The apparatus includes: a letter unit embedding part performing initial, medial and final consonant embedding by letter; a syllable unit embedding part performing embedding by syllable; an input part coupling three initial/medial/final letter embedding, additionally coupling a syllable embedding to express one syllable as a vector, and providing the syllable as the input of Bi-LSTM-CRF; a learning part performing learning by using a backpropagation algorithm after conducting forward/backward steps of Bi-LSTM-CRF; and an output part using a Viterbi exploration algorithm to find an optimal tag string, and outputting a word class tag with symbols attached to indicate the start, middle and end of the word class.

Description

System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding}

본 발명은 한국어 형태소 분석에 관한 것으로, 구체적으로 음절 및 자모단위로 입력을 구성하며 변환된 자모임베딩을 사용하여 빈번하게 발생되는 오타가 일어난 문장에서도 정확한 품사를 결정할 수 있도록 한 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법에 관한 것이다.The present invention relates to the analysis of Korean morphemes, and specifically, it combines alphabet and syllable embedding to make it possible to determine the correct part-of-speech even in sentences with frequent typos by constructing inputs in syllable and alphabet units and using converted alphabet embedding. It relates to an apparatus and method for analyzing Korean morphemes.

형태소란 한 언어 내에서 의미를 내포하고 있는 가장 작은 단위를 말한다.Morphology is the smallest unit that contains meaning within a language.

형태소 분석기는 어절 또는 문장에 포함된 형태소들을 분리하고, 분리된 형태소들 각각을 분석하는 작동을 수행하는 장치 또는 프로그램으로서, 음성 인식, 감성 분석, 자연어 처리, 데이터 마이닝, 또는 키워드 추출 등 폭넓은 분야에 이용되고 있다.The morpheme analyzer is a device or program that separates morphemes contained in a word or sentence and performs an operation of analyzing each of the separated morphemes. A wide range of fields such as speech recognition, emotional analysis, natural language processing, data mining, or keyword extraction It is being used in.

이와 같은 형태소 분석은 가장 기본적이고 필수적인 자연어 처리 과정으로, 부정확한 품사 태깅 결과는 개체명 인식, 구문 분석 등을 비롯한 많은 언어 처리 과제의 성능에 치명적인 영향을 미칠 수 있다.This morphological analysis is the most basic and essential natural language processing process. Inaccurate part-of-speech tagging results can have a fatal effect on the performance of many language processing tasks including object name recognition and parsing.

이로 인해 전통적으로 정확한 형태소 분석을 위한 많은 연구가 진행되어 왔으며, 최근에는 딥 러닝(deep learning) 모델을 이용하여 형태소 분리 및 품사 태깅 등에서 높은 성능들이 보고되고 있다. For this reason, many studies have been conducted for accurate morphological analysis in the past, and recently, high performances in morphological separation and part-of-speech tagging have been reported using a deep learning model.

그러나 대부분의 기존 형태소 분석 연구는 상당 수준의 정제된 문장들로 구성된 말뭉치(대표적으로 세종 말뭉치)를 대상으로 수행되어 왔다.However, most of the existing morphological analysis studies have been conducted on corpuses (typically, Sejong corpuses) composed of significant levels of refined sentences.

하지만, 빅 데이터의 중요성이 대두되면서, 웹 문서들과 같이 정제되지 않은 대량의 문서들이 중요한 언어 자원으로 사용되고 있는데, 그 안에는 물론 신문 기사와 정제 과정을 거치는 문서들도 포함되어 있지만, 대부분의 문서들은 별도의 정제 과정 없이 작성된 경우이다.However, as the importance of big data has emerged, a large number of unrefined documents such as web documents are used as important language resources, including newspaper articles and documents undergoing refining, as well as most documents. It is a case that is written without a separate purification process.

이로 인해 최근 들어서는 오타 등 문법적 오류를 포함하는 비격식 문서를 대상으로 언어 분석 실험을 수행하는 연구들이 수행되고 있다. For this reason, recently, studies for conducting a language analysis experiment on non-formal documents including grammatical errors such as typos have been conducted.

이와 같이, 형태소 분석은 자연어처리의 첫단계로써 부정확한 품사 태깅 결과는 개체명 인식, 구문 분석 등 치명적인 영향을 미칠 수 있다.As such, morphological analysis is the first step in natural language processing, and the result of incorrect part-of-speech tagging can have a fatal effect such as object name recognition and syntax analysis.

그러나 종래 기술의 대부분의 형태소 분석 연구는 정제된 문장들로 구성된 신문기사, 세종 말뭉치를 이용하여 학습을 하였기 때문에 오타가 발생한 문장들에 대한 형태소 분석 결과가 좋지 않다.However, most of the morpheme analysis studies of the prior art have been studied using a newspaper article composed of refined sentences and a Sejong corpus, so the result of morphological analysis for sentences with typos is not good.

또한, 최근 SNS 사용의 급증 및 빅 데이터의 대량의 문서들이 중요한 언어 자원으로 사용되고 있지만 이를 이용하기 위하여 형태소 분석을 하면 정제되어 있지 않은 데이터이기 때문에 오타가 빈번하여 적합하지 않는 품사 태깅 결과를 보여준다.In addition, the recent surge in the use of SNS and large documents of big data are being used as important language resources, but morpheme analysis is used to use this, so it is unrefined data.

따라서, 실생활에서 자주 혼동하여 사용되는 오타들을 대상으로 강건한 형태소 분석을 가능하도록 하기 위한 새로운 기술의 개발이 요구되고 있다.Therefore, there is a need to develop a new technology to enable robust morpheme analysis for typos that are frequently confused in real life.

대한민국 공개특허 제10-2017-0000201호Republic of Korea Patent Publication No. 10-2017-0000201 대한민국 공개특허 제10-2000-0018924호Republic of Korea Patent Publication No. 10-2000-0018924

본 발명은 종래 기술의 형태소 분석 기술의 문제점을 해결하기 위한 것으로, 음절 및 자모단위로 입력을 구성하며 변환된 자모임베딩을 사용하여 빈번하게 발생되는 오타가 일어난 문장에서도 정확한 품사를 결정할 수 있도록 한 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the morpheme analysis technology of the prior art, configures the input in syllables and letter units, and uses the converted letter embedding to determine the correct part-of-speech even in frequently occurring typos. And it has an object to provide an apparatus and method for analyzing a Korean morpheme using syllable embedding combination.

본 발명은 Bi-LSTM-CRF(Bidirectional Long Short Term Memory CRFs 모델을 사용하여 입력으로 음절을 표현하기 위하여 자모 및 음절 임베딩 결합을 통하여 오타가 발생한 문장에도 형태소 분석이 효과적으로 이루어지도록 한 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention combines consonant and syllable embedding to effectively perform morphological analysis even on sentences with typos to combine syllable and syllable embedding to express syllables as input using Bi-LSTM-CRF (Bidirectional Long Short Term Memory CRFs) model. The purpose is to provide an apparatus and method for analyzing Korean morphemes using.

본 발명은 자주 혼동되거나 입력 실수로 발생되는 오타들의 정확한 형태소 품사 태깅을 위하여 혼동되는 초중종성들을 조사하여 통합을 한 자모임베딩 벡터를 사용함으로써 개선된 형태소 분석을 할 수 있도록 한 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention combines consonant and syllable embedding combinations for improved morphological analysis by using the consolidated Jamo embedding vector by investigating confused super-middle traits for accurate morphological part-of-speech tagging of typos that are often confused or caused by input errors. The purpose is to provide an apparatus and method for analyzing Korean morphemes.

본 발명은 자모 임베딩과 음절 임베딩의 결합 및 임베딩 변환을 이용하여 오타 없는 문서와 오타 있는 문서에서 동시에 우수한 성능을 내는 형태소 분석이 가능하도록 한 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is an apparatus and method for Korean morpheme analysis using Jamo and syllable embedding combination that enables morphological analysis to simultaneously perform excellent morpheme analysis on documents with no typos and typos using a combination of Jamo embedding and syllable embedding and embedding conversion The purpose is to provide.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to those mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치는 자모 단위 초,중,종성 임베딩을 수행하는 자모단위 임베딩부;음절 단위 임베딩을 수행하는 음절단위 임베딩부;초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공하는 입력부;Bi-LSTM-CRF의 forward/backward 단계를 진행한 후 역전파 알고리즘을 이용하여 학습을 하는 학습부;최적의 태그열을 찾기 위해 Viterbi 탐색 알고리즘을 사용하고, 품사의 시작, 중간, 끝을 나타내는 기호를 부착한 품사 태그 출력을 하는 출력부;를 포함하는 것을 특징으로 한다.Apparatus for analyzing the Korean morpheme using the combination of syllable and syllable embedding according to the present invention for achieving the above object is a syllable unit embedding unit that performs elementary, middle, and vertical embedding of syllable units; syllable units that perform syllable unit embedding An embedding unit; an input unit that combines elementary / middle / longitudinal triplet embedding, additionally combines syllable embedding to express syllables as vectors, and provides bi-LSTM-CRF as input; forward / for Bi-LSTM-CRF After the backward step, the learning unit learns using the back propagation algorithm; uses the Viterbi search algorithm to find the optimal tag sequence, and outputs the part of speech tag with symbols indicating the start, middle, and end of the part of speech. And an output unit.

여기서, 입력부는, 임베딩 차원은 64로 하고, 초성과 종성의 동일한 자음 구분을 위하여 초성과 종성의 위치 표시를 두어 구분하고, 종성이 없는 음절의 경우에 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습하는 것을 특징으로 한다.Here, the input unit has an embedding dimension of 64, and separates the location of the consonant and the consonant by distinguishing the consonants of the consonant and the consonant. Characterized by learning to put a separator.

그리고 입력부는, 초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 총 256차원의 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공하는 것을 특징으로 한다.In addition, the input unit is characterized in that it combines elementary / middle / longitudinal three-letter embedding, and additionally combines syllable embedding to express syllables as a vector of a total of 256 dimensions, and provides it as an input of Bi-LSTM-CRF.

그리고 초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하는 과정에서, 자모sum, 자모음절sum, 자모concat, 자모음절concat의 합 또는 결합을 선택적으로 진행하고, 'sum'은 vector sum을 의미하는 합이고, 'concat'은 concatenate vector를 의미 결합인 것을 특징으로 한다.Then, in the process of combining three alphabetic embeddings of elementary / middle / longitudinal term, and additionally combining syllable embedding, the sum or combination of the letter sum, the letter syllable sum, the letter concat, the letter syllable concat is selectively performed, and the 'sum' is It is a sum that means vector sum, and 'concat' is a semantic combination of concatenate vectors.

그리고 초/중/종성 세 자모 임베딩을 결합하는 과정에서, 실제로 문법을 혼동하거나 혹은 자판 입력 시의 오류로 자주 틀리게 작성되는 자모들을 분석하여, 동일 벡터로 변환하여 해당 오타에 효과적으로 대응할 수 있도록 하는 것을 특징으로 한다.And in the process of combining the three-element embedding of elementary / middle / vertical characters, it is possible to effectively respond to the corresponding typo by converting it into the same vector by analyzing letters that are often wrong due to errors in grammar input or keyboard input. It is characterized by.

그리고 초/중/종성 세 자모 임베딩을 결합하는 과정에서 어느하나의 자모와 다른 자모를 동일 벡터로 변환하여 통합하기 위한 오타 유형은, 초성의 ㄱ/ㄲ, ㅂ/ㅃ, ㅅ/ㅆ 중성의 ㅐ/ㅔ, ㅙ/ㅚ/ㅞ 종성의 ㄱ/ㄲ/ㄳ, ㄴ/ㄶ/ㄵ, ㄹ/ㄺ/ㄻ/ㄼ/ㄽ/ㄾ/ㄿ/ㅀ, ㅂ/ㅄ, ㅅ/ㅆ의 유형을 포함하여 구분되는 것을 특징으로 한다.In addition, in the process of combining the embedding of three elementary / middle / longitudinal letters, the typo type for converting and combining one letter and the other letter into the same vector is: ㄲ / ㄱ, ㅂ /, ㅅ / ㅆ 성의 / ㅔ, ㅙ / ㅚ / ㅞ including the types of ㄱ / ㄲ / ㄳ, ㄴ / ㄶ / ㄵ, ㄹ / ㄺ / ㄻ / ㄼ / ㄽ / ㄾ / ㄿ / ㅀ, ㅂ / ㅄ, ㅅ / ㅆ It is characterized by being distinguished.

그리고 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석 결과의 성능은 어절단위 정확도를 사용하여 평가되고,And the performance of Korean morphological analysis results using combination of Jamo and syllable embedding is evaluated using word unit accuracy,

으로 정의되는 것을 특징으로 한다.

It is characterized by being defined as.

다른 목적을 달성하기 위한 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 방법은 자모 단위 초,중,종성 임베딩을 수행하고, 음절 단위 임베딩을 수행하는 자모단위 및 음절단위 임베딩 단계;초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공하는 입력 단계;Bi-LSTM-CRF의 forward/backward 단계를 진행한 후 역전파 알고리즘을 이용하여 학습을 하는 학습 단계;최적의 태그열을 찾기 위해 Viterbi 탐색 알고리즘을 사용하고, 품사의 시작, 중간, 끝을 나타내는 기호를 부착한 품사 태그 출력을 하는 출력 단계;를 포함하는 것을 특징으로 한다.A method for Korean morphological analysis using a combination of alphabetic and syllable embeddings according to the present invention for achieving a different object includes: alphabetic unit and syllable unit embedding to perform alphabetic unit elementary, middle, and vertical embedding, and to perform syllable unit embedding; An input step that combines elementary / middle / longitudinal triplet embedding, additionally combines syllable embedding to express one syllable as a vector, and provides it as an input of Bi-LSTM-CRF; forward / backward step of Bi-LSTM-CRF A learning step of learning using a backpropagation algorithm after proceeding; an output step of using a Viterbi search algorithm to find the optimal tag sequence, and outputting a part-of-speech tag with symbols indicating the beginning, middle, and end of parts of speech. It characterized in that it contains a.

여기서, 입력 단계는, 임베딩 차원은 64로 하고, 초성과 종성의 동일한 자음 구분을 위하여 초성과 종성의 위치 표시를 두어 구분하고, 종성이 없는 음절의 경우에 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습하는 것을 특징으로 한다.Here, in the input step, the embedding dimension is 64, and to distinguish the same consonant between the first and the second, the location of the first and the second is separated, and in the case of a syllable without a finality, a separate position indicating 'no dependency' in the final position Characterized by learning to put the separator.

그리고 입력 단계는, 초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 총 256차원의 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공하는 것을 특징으로 한다.And the input step is characterized in that it combines elementary / middle / longitudinal three-letter embedding, additionally combines syllable embedding to express syllables as a vector of a total of 256 dimensions, and provides it as an input of Bi-LSTM-CRF. .

으로 정의되는 것을 특징으로 한다.

It is characterized by being defined as.

이상에서 설명한 바와 같은 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법은 다음과 같은 효과가 있다.The apparatus and method for the analysis of Korean morphemes using the alphabet and syllable embedding combination according to the present invention as described above has the following effects.

첫째, 음절 및 자모단위로 입력을 구성하며 변환된 자모임베딩을 사용하여 빈번하게 발생되는 오타가 일어난 문장에서도 정확한 품사를 결정할 수 있도록 한다.First, the input is composed of syllables and letter units, and by using the converted letter embedding, it is possible to determine the correct part-of-speech even in frequently occurring typos.

둘째, Bi-LSTM-CRF(Bidirectional Long Short Term Memory CRFs 모델을 사용하여 입력으로 음절을 표현하기 위하여 자모 및 음절 임베딩 결합을 통하여 오타가 발생한 문장에도 형태소 분석이 효과적으로 이루어지도록 한다.Second, in order to express syllables as inputs using the Bi-LSTM-CRF (Bidirectional Long Short Term Memory CRFs) model, morpheme analysis is effectively performed even in sentences in which typos occur by combining letter and syllable embedding.

셋째, 자주 혼동되거나 입력 실수로 발생되는 오타들의 정확한 형태소 품사 태깅을 위하여 혼동되는 초중종성들을 조사하여 통합을 한 자모임베딩 벡터를 사용함으로써 개선된 형태소 분석을 할 수 있도록 한다.Third, for the accurate morpheme part tagging of typos that are frequently confused or caused by input errors, confused super middle-class characteristics are investigated and improved morpheme analysis can be performed by using the integrated Jamo embedding vector.

넷째, 자모 임베딩과 음절 임베딩의 결합 및 임베딩 변환을 이용하여 오타 없는 문서와 오타 있는 문서에서 동시에 우수한 성능을 내는 형태소 분석이 가능하도록 한다.Fourth, it is possible to perform a morpheme analysis that simultaneously exhibits excellent performance in a document without a typo and a document with a typo by using a combination of character embedding and syllable embedding and embedding conversion.

도 1a와 도 1b는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치의 전체 구성도 및 상세 구성도
도 2는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치의 상세 구성도
도 3은 자주 발생되는 오타의 일 예를 나타낸 구성도
도 4는 자주 발생되는 오타 유형을 나타낸 구성도
도 5는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 방법을 나타낸 플로우 차트
도 6a와 도 6b는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석에 따른 성능을 나타낸 결과 그래프1A and 1B are overall and detailed configuration diagrams of an apparatus for analyzing Korean morphemes using alphabetic and syllable embedding combinations according to the present invention.
2 is a detailed configuration diagram of an apparatus for analyzing Korean morphemes using a combination of syllable and syllable embedding according to the present invention;
3 is a configuration diagram showing an example of a typo that frequently occurs
4 is a block diagram showing a type of typo that frequently occurs
Figure 5 is a flow chart showing a method for analyzing the morphemes in Korean using alphabet and syllable embedding combination according to the present invention
Figures 6a and 6b is a graph showing the performance according to the analysis of Korean morphemes using the alphabet and syllable embedding combination according to the present invention

이하, 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a detailed description of a preferred embodiment of an apparatus and method for analyzing Korean morphemes using Jamo and syllable embedding according to the present invention is as follows.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the apparatus and method for analyzing Korean morphemes using the Jamo and syllable embedding combination according to the present invention will become apparent through detailed description of each embodiment below.

도 1a와 도 1b는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치의 전체 구성도 및 상세 구성도이다.1A and 1B are an overall configuration diagram and a detailed configuration diagram of an apparatus for analyzing Korean morphemes using a letter and syllable embedding combination according to the present invention.

본 발명은 자연어 처리의 가장 기본적이고 필수적인 과정으로, 한국어와 같은 자연어를 분석하는 시스템에 관한 것이다.The present invention relates to a system for analyzing natural language such as Korean as the most basic and essential process of natural language processing.

한국어 형태소 분석은 의미를 가지는 가장 작은 단위인 형태소를 분석하기 위하여 형태소가 조합된 어절 단위에서 형태소 단위로 분리하고 형태소에 적합한 품사를 결정하는 기술이며 이를 형태소 분석기(POS Taging)라고 부른다.Korean morpheme analysis is a technique for separating morphemes from word units combined with morphemes into morpheme units to determine the part of speech suitable for morphemes to analyze morphemes, which are the smallest units with meaning. This is called morpheme analyzer (POS Taging).

본 발명에서는 형태소 분석을 하기 위하여 형태소 분리를 하고 품사 결정을 하는 단계를 합쳐서 딥러닝을 이용하는 구성을 포함한다.The present invention includes a configuration using deep learning by combining morphological separation and part-of-speech determination to perform morphological analysis.

또한, 오타가 있는 문장도 형태소 분석이 정확하게 되기 위하여 음절 및 자모단위로 입력을 구성하며 변환된 자모임베딩을 사용함으로써 빈번하게 발생되는 오타가 일어난 문장에도 정확한 품사를 결정하는 형태소 분석 기술을 구현하기 위한 것이다.In addition, in order to accurately analyze morphemes in sentences with typos, the input is composed of syllables and character units, and by using the converted alphabet embedding, the morpheme analysis technology is used to determine the correct part-of-speech even in frequently occurring typos. .

이와 같은 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법은 자주 혼동되어 사용되는 오타들을 분석한 후, 그 자모에 해당하는 임베딩을 하나의 임베딩 벡터로 통합을 하여 임베딩을 만들고, 입력으로 자모와 음절 임베딩벡터를 결합(concatenate)하여 사용하는 것에 의해 자주 혼동되어 사용하는 오타에 대하여 자모임베딩만을 사용하는 방법보다 향상된 성능을 갖도록 한 것이다.The apparatus and method for Korean morphological analysis using the combination of Jamo and syllable embedding according to the present invention analyzes typos that are frequently confused and then embeds the embedding corresponding to the Jamo into one embedding vector. By making and using a letter and a syllable embedding vector as input, it is intended to have better performance than the method using only letter embedding for typos that are often confused.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법에 적용되는 딥 러닝 모델의 일 예는 Bi-LSTM-CRF 이며, 이로 제한되지 않는다.An example of a deep learning model applied to an apparatus and method for analyzing a Korean morpheme using a letter and syllable embedding combination according to the present invention is Bi-LSTM-CRF, and is not limited thereto.

이하의 설명에서 본 발명에 따른 자모 단위 초,중,종성 임베딩 및 음절 단위 임베딩은 다음과 같은 의미를 갖는다.In the following description, the Jamo unit elementary, middle, and vertical embedding units and syllable unit embeddings according to the present invention have the following meanings.

딥러닝의 입력은 벡터로 이루어져 있어야 하는데, 본 발명에서는 첫째, 오타에 강건하기 위하여 음절을 자모 단위로 나누어 입력으로 추가하고, 둘째, 자주 혼동되어 틀리는 오타는 개선하기 위하여 혼동되는 자모끼리 서로 통합하여 자모단위 임베딩을 생성하여 입력으로 사용한다.The input of deep learning should consist of a vector. In the present invention, first, in order to be robust against typos, syllables are divided into alphabetical units and added as input, and second, frequently confused and incorrect typos are combined with each other to improve confusion. Create alphabetic unit embedding and use it as input.

일 예로, 'ㅔ', 'ㅐ' 는 모두 'ㅐ'로 통합하여 사용할 수 있다.For example, 'ㅔ' and 'ㅐ' can all be used as 'ㅐ'.

이와 같이 본 발명은 자모단위 임베딩을 사용하는 것에 의해 오타에 더 강건한 형태소 분석기를 개발할 수 있도록 한 것이다.Thus, the present invention is to enable the development of a more robust morpheme analyzer for typos by using the alphabetic unit embedding.

Bi-LSTM-CRF의 입력인 음절과 자모를 표현하기 위하여 알고리즘인 word2vec를 이용하여 임베딩을 사용한다.In order to express syllables and letters that are inputs of Bi-LSTM-CRF, embedding is used using the word2vec algorithm.

이때 임베딩은 대용량의 뉴스코퍼스를 사용하여 자모와 음절 벡터 표현을 만들고, 자모임베딩과 음절임베딩에는 자모와 음절에 대한 정보가 들어간 각 64차원의 벡터가 만들어진다.At this time, the embedding uses a large-scale news corpus to create a Jamo and syllable vector representation, and in Jamo embedding and syllable embedding, each 64-dimensional vector containing information about the Jamo and syllables is created.

그 후, 자모 단위 초,중,종성 임베딩의 각 64차원 벡터 및 음절 단위 임베딩의 64차원 벡터를 결합(concatenate)하여 총 256차원 벡터를 사용하여 입력 벡터로 사용한다.Thereafter, each 64D vector of Jamo unit elementary, middle, and vertical embeddings and the 64D vector of syllable unit embedding are concatenated and used as an input vector using a total of 256D vectors.

도 1a와 도 1b에서와 같이, 각 음절을 입력으로 사용하는데, 입력 음절을 표현하기 위해서 자모 단위 초/중/종성 세 임베딩을 사용한다.1A and 1B, each syllable is used as an input. Three alphabetic elementary / medium / vertical embedding is used to represent the input syllable.

그 후 Bi-LSTM-CRF의 forward/backward 단계를 진행한 후 역전파 알고리즘을 이용하여 학습을 한다.After that, the forward / backward steps of Bi-LSTM-CRF are performed, and then learning is performed using a back propagation algorithm.

여기서 사용되는 Bi-LSTM-CRF은 순차 레이블이 많은 영역에서 좋은 성능을 보이고 있는 딥러닝 방법인데, forward 단계에서 현재 입력에 대한 상태층의 정보가 뒤의 상태에 영향을 주며, backward 단계에서 뒤에 상태가 앞의 상태에 영향을 주어 학습한다.Bi-LSTM-CRF used here is a deep learning method that shows good performance in areas with many sequential labels. In the forward step, information in the status layer for the current input affects the back state, and in the backward step, the back state. Learns by affecting the previous state.

예를 들어, '학생의'이라는 어절이 들어왔을때, forward 단계에서 먼저 '학'이라는 음절이 입력되고, 다음으로 음절'생'이 입력된다.For example, when the word 'student's' comes in, the syllable 'hak' is first entered in the forward step, and then the syllable 'live' is entered.

그리하여 실제로는 '학생'을 나타내는 상태와 같은 의미를 가지게 된다.Thus, in reality, it has the same meaning as the state representing 'student'.

forward 단계와 마찬가지로 backword 단계도 반대로 '의'이라는 음절이 들어오고 '생'이라는 음절이 들어가는데, forward 단계와 backward 단계가 진행된 후 두 단계를 결과와 정답과의 비용(cost)을 계산한 후에 역전파 알고리즘을 사용하여 학습을 하는 것이다.Like the forward step, the backword step, on the contrary, enters a syllable called 'righteousness' and a syllable called 'live'. After the forward step and the backward step proceed, the two steps are computed as the result and the cost of the correct answer is reverse propagated. Learning using algorithms.

마지막으로 최적의 태그열을 찾기 위해 Viterbi 탐색 알고리즘을 사용하고, 최종 출력은 품사의 시작/중간/끝을 나타내는 B/I/E 기호를 부착한 품사 태그가 된다.Finally, the Viterbi search algorithm is used to find the optimal tag sequence, and the final output is a part-of-speech tag with a B / I / E symbol indicating the start / middle / end of the part of speech.

태그 사이의 전이확률을 계산하기 위해 CRF의 forward 알고리즘을 이용하고, 최적의 태그열을 찾기 위해 확률 값들의 누적치 중 최고값을 가지는 상태에서 백트랙킹을 하여 최적의 상태열을 추출하는 Viterbi 탐색 알고리즘을 이용하는 것이다.We use the CRF forward algorithm to calculate the transition probability between tags, and the Viterbi search algorithm that extracts the optimal status sequence by backtracking while having the highest value among the accumulated values of probability values to find the optimal tag sequence. It is to use.

임베딩 차원은 64로 하였으며, 초성과 종성의 동일한 자음 구분을 위하여 초성과 종성의 위치 표시를 두어 구분한다.The embedding dimension was set to 64, and to distinguish the same consonants from the first and the second, the location of the first and the second is marked.

종성이 없는 음절의 경우에는 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습한다.In the case of syllables without a species, learn by putting a separate delimiter indicating 'no species' at the location of the species.

그 후 초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 총 256차원의 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 사용하게 된다.After that, the three-element embedding of elementary / middle / longitudinal characters is combined, and syllable embedding is additionally combined to represent one syllable in a 256-dimensional vector, which is used as the input of Bi-LSTM-CRF.

이하의 설명에서 '초/중/종성 임베딩 3개의 합'으로 표기되는 경우에서의 '합'은 vector sum을 의미하며, 앞으로 'sum'으로 표기한다. 예를 들어, 초/중/종성 임베딩 3개의 합은 '자모sum'이라고 표기한다.In the following description, 'sum' in the case of 'sum of 3 elementary / middle / longitudinal embeddings' means a vector sum, and is referred to as 'sum' in the future. For example, the sum of three elementary / middle / longitudinal embeddings is denoted 'jamo sum'.

그리고 '초/중/종성 임베딩 3개의 결합'으로 표기되는 경우에서의 '결합'은 concatenate vector를 의미하며, 앞으로 'concat'으로 표기한다.Also, in the case where it is expressed as '3 combinations of elementary / medium / longitudinal embedding', 'combination' means a concatenate vector, and is referred to as 'concat' in the future.

예를 들어 '자모concat'으로 표기한다.For example, it is written as 'jamo concat'.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법에 자모sum, 자모음절sum, 자모concat, 자모음절concat 등 다양한 구성을 적용한 경우에서 자모음절concat을 Bi-LSTM-CRF의 입력 자질로 사용한 경우, 오타 없는 문서 및 오타 있는 문서에서 강인한 결과를 갖는다.In the case of applying various configurations such as the letter sum, the letter syllable sum, the letter concat, the letter syllable concat to the apparatus and method for analyzing the Korean morpheme using the letter and syllable embedding combination according to the present invention, the letter syllable concat is used for Bi-LSTM-CRF When used as input qualities, it has strong results in documents without typos and typos.

또한, 문장 안의 어절 단위 정보를 넣기 위하여 띄어쓰기 단위마다 <SP> 라는 구분자를 입력으로 넣고, 띄어쓰기 공백의 최종 출력은 도 1a와 도 1b에서와 같이 B-S 태그로 설정한다.In addition, in order to insert word unit information in a sentence, a delimiter <SP> is input as a space unit, and the final output of a space is set as a B-S tag as shown in FIGS. 1A and 1B.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치의 상세 구성은 다음과 같다.The detailed configuration of the apparatus for analyzing the Korean morpheme using the combination of syllable and syllable embedding according to the present invention is as follows.

도 2는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치의 상세 구성도이다.2 is a detailed configuration diagram of an apparatus for analyzing Korean morphemes using a combination of syllable and syllable embedding according to the present invention.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치는 자모 단위 초/중/종성 세 임베딩을 수행하는 자모단위 임베딩부(10)와, 음절 단위 임베딩을 수행하는 음절단위 임베딩부(20)와, 임베딩 차원은 64로 하고, 초성과 종성의 동일한 자음 구분을 위하여 초성과 종성의 위치 표시를 두어 구분하고, 종성이 없는 음절의 경우에 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습하여 초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합하여 총 256차원의 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공하는 입력부(30)와, Bi-LSTM-CRF의 forward/ backward 단계를 진행한 후 역전파 알고리즘을 이용하여 학습을 하는 학습부(40)와, 최적의 태그열을 찾기 위해 Viterbi 탐색 알고리즘을 사용하고, 품사의 시작/중간/끝을 나타내는 B/I/E 기호를 부착한 품사 태그 출력을 하는 출력부(50)를 포함한다.The apparatus for Korean morphological analysis using the combination of Jamo and syllable embedding according to the present invention includes a Jamo unit embedding unit 10 for performing Jamo unit elementary / middle / long term embedding, and a syllable unit embedding unit for performing syllable unit embedding ( 20), and the embedding dimension is 64, to separate the consonants of the supernovae and the species by distinguishing the location of the consonants and the consonants, and in the case of syllables without a species, a separate delimiter indicating 'no species' at the location of the species The input unit 30 that combines the elementary / middle / longitudinal three-character embedding by learning, and additionally combines syllable embedding to express one syllable in a 256-dimensional vector, and provides the input of Bi-LSTM-CRF. , Bi-LSTM-CRF, after the forward / backward steps, the learning unit 40 learns using the back propagation algorithm, and uses the Viterbi search algorithm to find the optimal tag sequence, and checks the part of speech. / Represents the middle / end of an output unit 50 for outputting the part-of-speech tags attached to B / I / E symbol.

이와 같은 구성을 갖는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치를 이용한 실제 자주 출현하는 오타 분석 결과는 다음과 같다.The results of typo analysis that frequently appear using a device for analyzing a Korean morpheme using a combination of syllable and syllable embedding according to the present invention having such a configuration are as follows.

그리고 도 3은 자주 발생되는 오타의 일 예를 나타낸 구성도이다.And Figure 3 is a block diagram showing an example of a typo that frequently occurs.

오타가 있는 문장에서도 형태소 분석이 잘 수행되는지의 여부를 파악하기 위해 강제로 임의의 자모 오타를 생성시킨 후 분석을 수행한다.In order to determine whether morpheme analysis is performed well even in sentences with typos, an arbitrary letter typo is forcibly generated and then analyzed.

오타는 다양한 경우에서 발생되므로, 이와 같이 임의 오타를 생성하여 수행한 분석은 매우 중요하다고 할 수 있다.Since typos occur in various cases, it can be said that the analysis performed by generating random typos is very important.

본 발명에서는, 실생활에서 유독 오타가 많이 발생되는 경우를 집계하여, 그런 오타 유형에 맞춤형으로 대응하는 분석도 수행한다.In the present invention, cases in which a lot of toxic typos are generated in real life are counted, and analysis corresponding to such typo types is also performed.

본 발명에서는 국립국어원 질문응답 사이트의 다양한 자료를 분석하여, 실제로 혼동이 많다고 집계된 11 가지 경우에 대해서 각각 자모 임베딩을 동일하게 변환하는 방법으로 분석을 수행한다.In the present invention, by analyzing various data of the National Institute of Language and Technology Q & A site, the analysis is performed by converting the Jamo embedding to the same for each of the 11 cases where there is a lot of confusion.

여기서, 초/중/종성 세 자모 임베딩을 결합하는 과정에서 어느하나의 자모와 다른 자모를 동일 벡터로 변환하여 통합하기 위한 오타 유형은, 초성의 ㄱ/ㄲ, ㅂ/ㅃ, ㅅ/ㅆ 그리고 중성의 ㅐ/ㅔ, ㅙ/ㅚ/ㅞ 그리고 종성의 ㄱ/ㄲ/ㄳ, ㄴ/ㄶ/ㄵ, ㄹ/ㄺ/ㄻ/ㄼ/ㄽ/ㄾ/ㄿ/ㅀ, ㅂ/ㅄ, ㅅ/ㅆ의 유형을 포함하여 구분되는 것이 바람직하다.Here, in the process of combining the three-element embedding of elementary / middle / longitudinal characters, the typo type for converting and combining one letter and the other letter into the same vector is a / ㄲ, ㅂ / ㅃ, ㅅ / ㅆ, and neutral ㅐ / ㅔ, ㅙ / ㅚ / ㅞ and the types of ㄱ / ㄲ / ㄳ, ㄴ / ㄶ / ㄵ, ㄹ / ㄺ / ㄻ / ㄼ / ㄽ / ㄾ / ㄿ / ㅀ, ㅂ / ㅄ, ㅅ / ㅆ of Jongseong It is preferable to be separated, including.

도 3은 벡터를 통합하는 몇 가지 경우이며 괄호 안은 자주 혼동되는 단어의 예이다.3 is a few cases of integrating vectors, and parentheses are examples of frequently confused words.

본 발명은 도 3의 예와 같이 실제로 문법을 혼동하거나 혹은 자판 입력 시의 오류로 자주 틀리게 작성되는 자모들을 분석하여, 동일 벡터로 변환함으로써(예를 들어 ㅐ와 ㅔ의 임베딩을 동일한 벡터로 통합 사용) 시스템이 해당 오타에 효과적으로 대응할 수 있도록 한다.In the present invention, as shown in the example of FIG. 3, by analyzing the letters that are often incorrectly written due to errors in grammar or keyboard input, and converting them into the same vector (for example, the embedding of ㅐ and ㅔ is integrated into the same vector) ) Make the system respond effectively to the typo.

본 발명은 이와 같은 임베딩 변환을 통하여 오타가 매우 자주 출현하는 데이터(예를 들어 SNS 데이터)에서도 일정 수준 이상의 성능을 산출할 수 있는 형태소 분석이 가능하도록 한 것이다.The present invention is to enable morphological analysis capable of calculating a performance of a certain level or more even in data (for example, SNS data) in which typos appear very often through the embedding transformation.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치를 이용한 형태소 분석 결과를 설명하면 다음과 같다.The results of the morpheme analysis using the apparatus for analyzing the Korean morpheme using the combination of Jamo and syllable embedding according to the present invention are as follows.

본 발명의 형태소 분석에 사용되는 말뭉치는 세종말뭉치이며, 임의로 선택된 4만 어절의 학습 데이터(training data)와 1만 어절의 평가 데이터(test data)를 사용한다.The corpus used for morphological analysis of the present invention is a sejong corpus, and uses randomly selected training data of 40,000 words and test data of 10,000 words.

세세하게 분리된 어미 정보는 사용하지 않고, 각 어미들은 어간과 결합하여 하나의 용언으로 구성한다. 품사 개수는 43개의 품사태그를 사용하였으며, B/I/E태그가 부착되었으므로 출력 태그의 개수는 공백을 나타내는 B-S까지 총 130개가 된다. Finely separated mother information is not used, and each mother is combined with the stem to form one term. As for the number of parts of speech, 43 parts of speech tags were used, and since B / I / E tags are attached, the total number of output tags is 130 to B-S indicating blank.

임베딩 구축은 11.5GB의 네이버 뉴스 대상으로 Word2Vec을 사용하였으며, Bi-LSTM-CRF의 hidden layer 개수는 100, learning rate는 0.01, 그리고 epoch 수는 최대 150으로 설정하였다.For embedding construction, Word2Vec was used for the 11.5GB Naver news target, and the number of hidden layers of Bi-LSTM-CRF was set to 100, the learning rate to 0.01, and the number of epochs to 150.

그리고 성능 평가는 아래와 같이 어절단위 정확도를 사용한다.And the performance evaluation uses word unit accuracy as below.

이와 같은 분석의 베이스라인은 표 1에서와 같이 두 가지로 설정한다.There are two baselines for this analysis, as shown in Table 1.

표 1에서 음절임베딩은 최근 세종말뭉치 대상으로 가장 높은 형태소 태깅 성능을 산출한다고 보고되고 있는 음절 임베딩을 사용한 경우이다.In Table 1, syllable embedding is a case where syllable embedding, which has been reported to yield the highest morphological tagging performance for the Sejong corpus, is used.

본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치를 이용한 형태소 분석 결과를 보면, 음절 임베딩을 Bi-LSTM-CRF의 입력 벡터로 사용하는 시스템을 구현하여 오타 없는 문서에서 97.76%의 높은 성능을 산출하였다.Looking at the results of the morpheme analysis using a device for analyzing Korean morphemes using the combination of Jamo and syllable embedding according to the present invention, a system using syllable embedding as an input vector of Bi-LSTM-CRF was implemented to achieve 97.76% of typos. High performance was calculated.

표 1에서 베이스라인인 자모sum-SP없음은 어절 정보를 포함하지 않는다.In Table 1, no baseline sum-SP is not included in the word information.

표 2는 문장 안에 어절 정보, 즉, <SP>를 추가하여 분석한 결과이다.Table 2 shows the analysis results by adding word information, that is, <SP> in the sentence.

오타 있는 문서를 만들기 위해서, 테스트 데이터의 모든 어절에서 어절당 1개씩의 자모오타를 강제로 생성하여 분석한 결과이다.In order to create a typographical document, it is the result of forcibly generating and analyzing one jamaota per word in every word of the test data.

표 2에서 볼 수 있듯이, 자모sum-SP없음보다 어절 정보 넣은 나머지 경우들에서 오타 유무와 상관없이 성능이 개선되었다.As can be seen in Table 2, performance was improved with or without typos in the rest of the cases where word information was added rather than without the sum-SP.

그리고 전체적으로 sum 경우보다 concat경우가 성능이 우수하였는데, 오타 없는 데이터의 성능은 자모concat이 97.34%, 오타 있는 경우에서는 자모음절concat이 80.09%로 베이스라인인 자모sum-SP없음보다 9%p 가까이 높은 성능을 산출하였다.And overall, the concat case was superior to the sum case. The performance of the data without typo was 97.34% for the letter concat, and 80.09% for the consonant syllable concat for the typo, which was 9% p higher than the baseline without the sum-SP. Performance was calculated.

다음으로 오타 출현 빈도별 성능 분석 결과는 다음과 같다.Next, the performance analysis results by typo frequency are as follows.

실제로 모든 어절마다 오타가 출현하는 경우보다는 좀 더 간헐적으로 출현하는 경우가 많을 것이라는 판단 하에, 표 3과 같이 5 및 2어절당 1오타인 경우에 대해서도 성능을 분석하였다.In fact, the performance was also analyzed for 1 typo per 5 and 2 words, as shown in Table 3, in the judgment that there would be more frequent occurrences of typos than every occurrence of typos.

표 3에서와 같이 오타 빈도수가 n=1, 2, 5 모든 경우에서 자모음절concat이 가장 우수한 결과를 보이고 있는데, 오타에 대해서 자모음절concat이 역할을 잘 하고 있는 것을 알 수 있다.As shown in Table 3, the consonant syllable concat shows the best result in all cases where the number of typos is n = 1, 2, and 5, and it can be seen that the consonant syllable concat plays a role for typo.

표 4는 어떤 품사의 단어에서 오타가 발생했을 때 전체 성능에 가장 영향을 주는지를 확인하기 위해, 명사, 동사, 조사에 대해서 각각 별도로 오타를 발생시켜 분석한 것이다.Table 4 analyzes by generating typos separately for nouns, verbs, and surveys to determine which typos in a word of speech most affect the overall performance.

조사에서 오타 발생 시 인식률이 매우 낮았는데, 이는 조사에 오타가 있는 경우 조사 앞에 있는 체언뿐만 아니라 조사까지 포함하여 전체 어절이 하나의 일반명사로 태깅되기 때문이다.In the case of a typo in the survey, the recognition rate was very low, because if there was a typo in the survey, the whole word was tagged as a common noun, including the investigation in front of the survey.

예를 들어 '학교에' 대신 '학교애'가 입력된 경우, '명사+조사'가 아닌 명사 하나로 출력됨을 확인할 수 있다.For example, when 'school boy' is entered instead of 'in school', it can be seen that it is output as one noun instead of 'noun + investigation'.

동사는 오타임에도 상대적으로 높은 정확도를 보여주고 있는데, 이는 문맥 정보에 의해 시스템이 동사를 비교적 잘 인식하고 있기 때문이다.The verb shows relatively high accuracy even at O-time because the system recognizes the verb relatively well by contextual information.

그리고 본 발명에서는 국립국어원 질문응답 사이트의 데이터를 분석하여, 실제로 맞춤법이 어렵거나 자판을 입력할 때의 실수에 의해 오타가 빈번히 발생하는 11 가지 오타 유형에 대해서 자모 임베딩을 통합하였다.In addition, in the present invention, by analyzing data of the National Institute of Language and Answering questionnaire site, Jamo embedding is integrated for 11 types of typos in which typos are frequently caused by errors in spelling or typing a keyboard.

예를 들어, 사용자들이 자주 틀리게 입력하는 단어 중 하나로 '베개'를 들 수 있는데, 베게/배게/배개 등으로 틀리게 입력될 가능성이 높다.For example, as one of the words that users frequently enter incorrectly, 'pillow' is likely to be entered incorrectly as pillow / pillow / pillow.

이 경우 중성 ㅐ/ㅔ를 잘못 입력하는 경우인데, 이런 경우 두 중성에 대해 동일한 자모 임베딩 벡터를 사용한다.In this case, the input of the neutral ㅐ / ㅔ is incorrect. In this case, the same Jamo embedding vector is used for the two neutrals.

비슷한 예로, 되어/돼어, 왠지/웬지 등도 빈번하게 오타가 발생되는 경우로, 세 중성 ㅚ/ㅙ/ㅞ에 대해 하나의 자모 임베딩 벡터로 통합하여 사용한다.As a similar example, it is a case where typo / poor, for some reason / wenji, etc. frequently occur, and is used by integrating the three neutral ㅚ / ㅙ / ㅞ into a single embedding vector.

이와 유사한 총 11가지 유형을 설정하여 각각 동일한 임베딩으로 통합 후 학습 및 분석을 진행하고, 도 4는 자주 발생되는 오타 11 유형을 나타낸 구성도이다.After setting up a total of 11 similar types, each is integrated with the same embedding, and then learning and analysis are performed, and FIG. 4 is a configuration diagram showing frequently occurring typo 11 types.

분석을 위하여, 대표자모가 아닌 자모들을 모두 대표자모로 변환 후 (예를 들어, 너한테 -> 너한태) 분석을 한다.For analysis, all non-representative letters are converted to representative letters (for example, you-> you Han Tae) and analyzed.

테스트 데이터의 거의 모든 문장에서 최소한 하나씩은 이러한 오타변경이 발생하였다. (예를 들어‘ㅔ’, ‘ㅗ’, ‘ㅆ’등을 하나도 포함하지 않는 문장은 전체 테스트 문장의 1.5%도 되지 않았다.)At least one of these typos in almost every sentence of the test data occurred. (For example, sentences that do not contain any of ‘ㅔ’, ‘ㅗ’, ‘ㅆ’, etc. are not even 1.5% of the total test sentences.)

표 5는 자모음절concat에 대해서 임베딩 변환 전 후의 성능 변화를 나타낸 것이다.Table 5 shows the performance change before and after the embedding conversion for the consonant syllable concat.

표 6에서와 같이 변환된 임베딩으로 학습한 경우, 그 유형에 해당하는 오타가 발생한 경우 변환 전에 비해 16%p 가까이 증가한 93.05%의 높은 성능을 기록하였다.As shown in Table 6, when learning with transformed embedding, a typo corresponding to that type occurred, recording a high performance of 93.05%, which increased by 16% p compared to before conversion.

이는 임베딩 변환 방법이 오타가 있는 문서에서 역할을 할 수 있음을 보여주고 있다고 할 수 있다.It can be said that the embedding conversion method can play a role in documents with typos.

다만, 임베딩 변환한 경우 오타가 없는 문서에서 전체 성능은 변환 전에 비해 2.55%p 낮은 94.45%를 기록했는데, 이는 몇몇 자모 벡터를 통합함으로써 오타가 아닌 문장들 경우에 통합된 자모들의 구분 범위가 줄어들었기 때문이다. However, in the case of embedding conversion, in documents without typos, the overall performance was 94.45%, which was 2.55% p lower than before conversion, which reduced the classification range of the consolidated letters in the case of sentences other than typos by integrating several letter vectors. Because.

이와 같은 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 방법을 구체적으로 설명하면 다음과 같다.The method for analyzing a Korean morpheme using the combination of syllable and syllable embedding according to the present invention will be described in detail as follows.

도 5는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 방법을 나타낸 플로우 차트이다.5 is a flow chart showing a method for analyzing a Korean morpheme using a combination of syllable and syllable embedding according to the present invention.

먼저, 자모 단위 초/중/종성 세 임베딩을 수행하고(S501), 음절 단위 임베딩을 수행한다.(S502)First, three embedding of elementary / middle / longitudinal units is performed (S501), and syllable unit embedding is performed (S502).

임베딩 차원은 64로 하고, 초성과 종성의 동일한 자음 구분을 위하여 초성과 종성의 위치 표시를 두어 구분하고(S503), 종성이 없는 음절의 경우에 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습한다.(S504)The embedding dimension is 64, and to distinguish the same consonant between the first and the second, the location of the first and the second is separated by a marker (S503). Put in and learn. (S504)

초/중/종성 세 자모 임베딩을 결합하고, 음절 임베딩을 추가로 결합한다.(S505)It combines elementary / middle / longitudinal three letter embedding, and additionally combines syllable embedding (S505).

총 256차원의 벡터로 한 음절을 표현하며, Bi-LSTM-CRF의 입력으로 제공한다.(S506)It expresses one syllable as a 256-dimensional vector, and provides it as an input of Bi-LSTM-CRF. (S506)

Bi-LSTM-CRF의 forward/ backward 단계를 진행한 후 역전파 알고리즘을 이용하여 학습을 하고(S507), 최적의 태그열을 찾기 위해 Viterbi 탐색 알고리즘을 사용하고, 품사의 시작/중간/끝을 나타내는 B/I/E 기호를 부착한 품사 태그 출력을 한다.(S508)After performing forward / backward steps of Bi-LSTM-CRF, learn using a back propagation algorithm (S507), use the Viterbi search algorithm to find the optimal tag sequence, and indicate the start / middle / end of parts of speech Output the part-of-speech tag with the B / I / E symbol. (S508)

이와 같은 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 방법은 오타에도 정확한 형태소 품사를 결정하기 위하여 입력을 음절 임베딩 뿐만 아니라 자모단위인 초중종성의 자모임베딩을 concatenate하여 사용한다.The method for analyzing the Korean morphemes using the combination of Jamo and syllable embedding according to the present invention uses concatenate the syllable embedding as well as the Jamo unit's super middle-term Jamo embedding as well as the syllable embedding in order to determine the correct morphological part of speech.

또한, 자주 혼동되거나 입력 실수로 발생하는 오타들은 보다 정확한 품사 결정을 할 수 있도록 혼동되는 자모들을 조사하고 이를 통합된 벡터를 사용함으로써 자주 발생되는 오타들의 잘못된 형태소 분석을 해결한다.In addition, frequently confusing or typographical errors can be solved by analyzing confusing letters to make more accurate part-of-speech decisions and using the integrated vector to analyze the morphological errors of frequently occurring typos.

이와 같은 형태소 분석시에 오타에도 효과적인 형태소 분리 및 품사 태깅을 위해 문장이 들어왔을 시 음절로 분리하여 Bidirectional Long Short Term Memory CRFs 모델의 입력으로 두가지의 방법을 사용한다.In order to separate morphemes and tag parts of speech effectively for typos during analysis of morphemes, two methods are used as inputs to the bidirectional long short term memory CRFs model.

첫 번째는 오타가 난 문장에서도 정확한 형태소 분석을 위하여 입력으로 word2vec를 사용하여 만든 음절 임베딩과 음절을 자모 단위로 분리한 초중종성의 임베딩을 concatenate하여 총 256차원이 입력으로 들어간다.The first is to concatenate the syllable embedding created by using word2vec as the input and the super middle-class embedding that separates the syllables into letter units for accurate morphological analysis even in sentences with typos.

만약, 종성이 없을 시 종성 위치에 '종성없음'을 나타내는 별도의 구분자를 넣어 학습한다.If there is no species, learn by putting a separate separator indicating 'no species' at the location of the species.

또한, 문장안의 어절 정보를 넣기 위하여 띄어쓰기를 나타내는 정보인 space 벡터를 추가하여 띄어쓰기마다 <SP>라는 구분자를 입력으로 넣어 모델에 학습을 시킨다.In addition, a space vector, which is space-indicating space, is added to put word information in a sentence, and a model called <SP> is input as a space to train the model.

두 번째는 자주 혼동되는 자모를 분석을 하여 이에 해당되는 오타는 더 정확한 형태소 분석을 하기위해 변환된 자모임베딩을 사용한다.The second is to analyze the frequently confused Jamo, and the corresponding typo uses the converted Jamo embedding for a more accurate morpheme analysis.

그러기 위해 자주 혼동되는 자모들은 통합시켜서 word2vec를 이용하여 변환된 자모임베딩을 구축한다.To do this, frequently confused letters are combined to build a converted letter embedding using word2vec.

그리하여 변환된 자모임베딩과 음절임베딩을 concatenate해서 Bidirectional Long Short Term Memory CRFs 모델의 입력으로 넣는다.Thus, the converted Jamo embedding and syllable embedding are concatenateed and put into the input of the Bidirectional Long Short Term Memory CRFs model.

이를 통하여 자주 혼동되는 자모 오타에는 일반적인 자모임베딩만 사용한 형태소 분석기보다 변환된 자모임베딩을 사용함으로써 정확한 품사 결정을 할 수 있다. Through this, it is possible to make an accurate part-of-speech decision by using the converted jammer embedding rather than the morpheme analyzer that uses only common jammer embedding in jamaota that is often confused.

도 6a와 도 6b는 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석에 따른 성능을 나타낸 결과 그래프이다.6A and 6B are graphs showing results according to Korean morphological analysis using Jamo and syllable embedding combinations according to the present invention.

도 6a에서와 같이, 초/중/종성 자모 임베딩 및 음절 임베딩의 결합을 입력으로 하여 Bi-LSTM-CRF을 수행함으로써, 오타없는 문장들에 대해서 97%의 성능을 유지하면서, 동시에 오타있는 문장에서도 베이스라인보다 8.77%p 높은 성능(n=1 기준)을 보여주었다.As shown in FIG. 6A, Bi-LSTM-CRF is performed by using a combination of elementary / medium / vertical Jamo embedding and syllable embedding as inputs, while maintaining 97% performance for sentences without typos, and at the same time typos. It showed 8.77% p higher performance than baseline (n = 1 standard).

또한, 도 6b에서와 같이, 실생활에서 자주 발생하는 11가지 오타 유형을 집계 후 임베딩 통합을 이용해서, 해당 오타가 있는 문장에서도 그림 93.05%의 높은 성능을 산출하였다.In addition, as shown in FIG. 6B, after counting 11 types of typos frequently occurring in real life and using embedding integration, high performance of figure 93.05% was calculated even in sentences with corresponding typos.

이는 향후 오타 유무와 상관없이 일정 수준 이상의 성능을 유지하는 형태소 분석 시스템의 구현이 가능함을 의미한다.This means that it is possible to implement a morpheme analysis system that maintains a certain level of performance with or without typos in the future.

이상에서 설명한 본 발명에 따른 자모 및 음절 임베딩 결합을 이용하는 한국어 형태소 분석을 위한 장치 및 방법은 자모 임베딩과 음절 임베딩의 결합 및 임베딩 변환을 이용하여 오타 없는 문서와 오타 있는 문서에서 동시에 우수한 성능을 내는 형태소 분석이 가능하도록 한 것이다.The apparatus and method for Korean morphological analysis using Jamo and syllable embedding combination according to the present invention described above is a combination of Jamo embedding and syllable embedding and embedding conversion, which simultaneously achieves excellent performance in documents without typos and typos. It made analysis possible.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered in terms of explanation rather than limitation, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range are included in the present invention. Should be interpreted.

10. 자모단위 임베딩부 20. 음절단위 임베딩부
30. 입력부 40. 학습부
50. 출력부10. Embryo unit embedding unit 20. Syllable unit embedding unit
30. Input section 40. Learning section
50. Output

Claims

Jamo unit embedding unit that performs elementary, middle, and vertical embedding;
A syllable unit embedding unit that performs syllable unit embedding;
An input unit that combines elementary / middle / longitudinal triplet embedding, additionally combines syllable embedding to express syllables as vectors, and provides bi-LSTM-CRF as input;
A learning unit performing a forward / backward step of Bi-LSTM-CRF and learning using a back propagation algorithm;
Characterized in that it uses a Viterbi search algorithm to find the optimal tag sequence, and outputs a part-of-speech tag with symbols indicating the start, middle, and end of the part of speech. Device for morphological analysis.

According to claim 1, Input unit,
The embedding dimension is 64, and to distinguish the same consonant between the first and the second, the positions of the first and the second are marked and separated. In the case of syllables without a finality, a separate delimiter indicating 'no dependency' is added to the final position. Apparatus for analyzing Korean morphemes using a combination of syllable and syllable embedding.

According to claim 1, Input unit,
Combines alphabetic and syllable embeddings characterized by combining three alphabetic embeddings of elementary / middle / longitudinal character, additionally combining syllable embedding to express syllables as a vector of a total of 256 dimensions, and providing them as an input of Bi-LSTM-CRF Device for analyzing Korean morphemes using

According to claim 1 or claim 3, In the process of combining the elementary / middle / longitudinal triangular embedding, and further combining the syllable embedding,
The sum or combination of the letter sum, the letter syllable sum, the letter syllable concat, and the letter syllable concat are selectively performed,
'sum' is a sum that means vector sum, and 'concat' is a device for Korean morphological analysis using character and syllable embedding combinations, which is a combination of concatenate vectors.

The method of claim 1, wherein in the process of combining the elementary / medium / longitudinal triplet embedding,
In order to analyze Korean morphemes using letter and syllable embedding combinations, it is possible to effectively respond to a typo by analyzing letters written incorrectly, often confused with grammar, or due to errors in keyboard input. Device.

The method of claim 5, wherein in the process of combining the three-element embedding of elementary / middle / longitudinal characters, a typo type for converting and combining one letter and another letter into the same vector is combined.
ㄱ / ㄲ, ㅂ / ㅃ, ㅅ / ㅆ of Choseong
Neutral ㅐ / ㅔ, ㅙ / ㅚ / ㅞ
Characters characterized by being classified including types of ㄱ / ㄲ / ㄳ, ㄴ / ㄶ / ㄵ, ㄹ / ㄺ / ㄻ / ㄼ / ㄽ / ㄾ / ㄿ / ㅀ, ㅂ / ㅄ, ㅅ / ㅆ of Jongseong and Apparatus for analyzing Korean morphemes using syllable embedding combinations.

The method of claim 1, wherein the performance of the Korean morpheme analysis result using the combination of syllable and syllable embedding is evaluated using word unit accuracy,

Apparatus for analyzing Korean morphemes using a combination of alphabet and syllable embedding, characterized by being defined as.

A Jamo unit and a syllable unit embedding step of performing elementary, middle, and vertical embedding of syllable units and syllable unit embedding;
An input step of combining elementary / middle / longitudinal triplet embedding, additionally combining syllable embedding to express syllables as vectors, and providing as inputs of Bi-LSTM-CRF;
A learning step of performing a forward / backward step of Bi-LSTM-CRF and then learning using a back propagation algorithm;
Using a Viterbi search algorithm to find the optimal tag sequence, and outputting a part-of-speech tag with symbols indicating the beginning, middle, and end of parts of speech; Korean using Jamo and syllable embedding combinations Method for morphological analysis.

The method of claim 8, wherein the input step,
The embedding dimension is 64, and to distinguish the same consonant between the first and the second, the positions of the first and the second are marked and separated. In the case of syllables without a finality, a separate delimiter indicating 'no dependency' is added to the final position. Method for analyzing Korean morphemes using a combination of letter and syllable embedding, characterized in that.

The method of claim 8, wherein the input step,
Combines alphabetic and syllable embeddings characterized by combining three alphabetic embeddings of elementary / middle / longitudinal character, additionally combining syllable embedding to express syllables as a vector of a total of 256 dimensions, and providing them as an input of Bi-LSTM-CRF Method for analyzing Korean morphemes using

The method of claim 8 or 10, wherein in the process of combining the elementary / middle / longitudinal triangular embedding, and further combining the syllable embedding,
The sum or combination of the letter sum, the letter syllable sum, the letter syllable concat, and the letter syllable concat are selectively performed,
'sum' is a sum that means vector sum, and 'concat' is a method for analyzing Korean morphemes using character and syllable embedding combinations, which is a combination of concatenate vectors.

The method of claim 8, wherein in the process of combining the elementary / medium / longitudinal three letter embedding,
In order to analyze Korean morphemes using letter and syllable embedding combinations, it is possible to effectively respond to a typo by analyzing letters written incorrectly, often confused with grammar, or due to errors in keyboard input. Way.

The method of claim 12, wherein in the process of combining the three-element embedding of elementary / medium / vertical characters, a typo type for converting and combining one letter and another letter into the same vector is combined.
ㄱ / ㄲ, ㅂ / ㅃ, ㅅ / ㅆ of Choseong
Neutral ㅐ / ㅔ, ㅙ / ㅚ / ㅞ
Characters characterized by being classified including types of ㄱ / ㄲ / ㄳ, ㄴ / ㄶ / ㄵ, ㄹ / ㄺ / ㄻ / ㄼ / ㄽ / ㄾ / ㄿ / ㅀ, ㅂ / ㅄ, ㅅ / ㅆ of Jongseong and Method for Korean morphological analysis using syllable embedding combination.

The method of claim 8, wherein the performance of the Korean morpheme analysis result using the combination of syllable and syllable embedding is evaluated using word unit accuracy,

Method for analyzing a Korean morpheme using a combination of syllable and syllable embedding, characterized by being defined as.