KR20210084156A

KR20210084156A - Method for noting korean language to be used for deep learning training

Info

Publication number: KR20210084156A
Application number: KR1020190177174A
Authority: KR
Inventors: 조영환
Original assignee: 주식회사 투블럭에이아이
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-07-07

Abstract

The present invention relates to a method of noting a Korean language to apply a Korean corpus to deep learning in order to train the Korean language by using an open source deep learning code that is basically written in English. According to the present invention, the method of noting the Korean language includes: analyzing a Korean sentence, and converting the Korean sentence by a scheme suitable for deep learning; and recombining a result of the Korean sentence of a model trained by the deep learning into the Korean sentence. According to the present invention, words are handled in the same way as English for Korean so that training is performed by using a program released as an open source, and recombination into Korean is performed in a simple way so that restoration into a Korean sentence is simple with no conversion error.

Description

Korean notation method for deep learning learning {METHOD FOR NOTING KOREAN LANGUAGE TO BE USED FOR DEEP LEARNING TRAINING}

본 발명은 한국어 코퍼스를 딥러닝 학습하는 방법에 관한 것으로, 더욱 상세하게는 딥러닝을 위한 한국어 표현 방법에 관한 것이다.The present invention relates to a method for deep learning learning of a Korean corpus, and more particularly, to a Korean expression method for deep learning.

딥러닝으로 언어를 다루는 기술은 영어를 기본으로 하고 있기 때문에, 문장의 구성이 단어들의 나열인 것으로 가정되고 있다. 그러나 한국어는 교착어의 특징을 가지고 있어서 단어의 개념이 애매하며, 하나의 띄어쓰기 단위인 어절이 여러 개의 단어가 붙여진 형태로 사용된다. 그렇기 ??문에 모든 띄어쓰기 단위를 단어라고 가정을 하게 되면 “특허는”, “특허의”, “특허를” 등에서 “특허”라는 명사가 공통적으로 포함되어 있지만, 이것을 인식하지 못하고 다른 단어 인것처럼 인식되어 진다. 그렇게 ??문에 보통 한국어를 다루기 위해서 형태소 분석이라는 기술을 이용해서 띄어 쓰기의 단위인 어절을 단어와 유사한 단위인 형태소 단위로 분리해서 학습하게 된다. 한국어 형태소 분석은 “특허가”의 경우 “특허(명사) + 가(조사)”, “특허등록을”의 경우 “특허(명사) + 등록(명사) + 을(조사)”, “아름다운”의 경우 “아름답다(동사) + ㄴ(전성어미)”로 결과를 만들어 낼 수 있다. 여기에는 세가지 고려점이 있는데, 1) 품사를 단어와 함께 부착할 것인가하는 문제와 2) 복합명사 사이에 공백을 어떻게 판단할 것인가, 3) 변형된 글자를 원래의 글자로 복원하는 복잡한 과정에서 원문과 다른 결과를 낼수 있다는 것이다. Since the technology for dealing with language with deep learning is based on English, it is assumed that the composition of a sentence is a sequence of words. However, Korean has the characteristics of an agglutinative language, so the concept of a word is ambiguous, and a word, which is a unit of space, is used in a form in which several words are attached. Therefore, if we assume that all spacing units are words, the noun “patent” is commonly included in “patent”, “patent”, “patent”, etc. is recognized Therefore, in order to deal with the Korean language, morpheme analysis is used to separate the words, which are units of spacing, into morphemes, which are units similar to words, and learn them. The Korean morpheme analysis is “patent (noun) + a (search)” in the case of “patent price”, “patent (noun) + registration (noun) + b (search)”, and “beautiful” in the case of “patent registration”. In this case, “beautiful (verb) + ㄴ (full ending)” can produce a result. There are three considerations here: 1) the problem of attaching the part-of-speech together with the word, 2) how to determine the space between compound nouns, and 3) the original text and the that may produce different results.

본 발명은 상기와 같은 문제점을 해결하기 위하여 띄어쓰기와 문법적 기능을 고려하는 한국어 표현 방법을 제공하는 것에 그 목적이 있다. 영어를 기본으로 하고 있는 딥러닝 언어처리 오픈 소스는 단어를 기본단위로 하여 영어를 표현하고 있으나, 한국어의 경우에 “은, 는, 이, 가, 을, 를” 등의 조사와 “었다, 었었다, 는데, 라고, 고”와 같이 어미 등의 문법 기능을 하는 단어들이 명사와 동사의 원형과 연결되어 하나의 띄어쓰기 단위인 어절을 이루고 있다. 그러므로 영어와 동일하게 띄어쓰기 단위를 하나의 단어로 취급한다면 “특허는, 특허를, 특허와, 특허가, . . .” 와 같이 명사와 조사가 붙어있는 단위가 학습에 입력으로 사용되기 때문에 조사의 의미를 학습하기가 어렵기 때문에 학습의 성능이 낮아지는 단점이 있다. 한국어를 분석하여 품사를 부착하는 방식으로 한국어를 표기하는 방법도 있으나, “먹인다고”와 같은 어절의 경우 “먹다/동사” + “ㄴ다고/어미”와 같이 한글의 글자가 변형되어야 하기 때문에 분리후에 다시 합하기 위해서 복잡한 변형 규칙이 필요하다. 본 발명에서는 띄어쓰기와 함께 문법적인 기능을 구별할 수 있는 한국어 표기법을 제안함을써, 영어를 기준으로 작성된 학습 코드를 변형하지 않고도 활용할 수 있으며, 학습의 성능도 유지할 수 있는 방법을 제시한다.An object of the present invention is to provide a Korean expression method in consideration of spacing and grammatical functions in order to solve the above problems. Deep learning language processing open source, which is based on English, expresses English by using words as the basic unit, but in the case of Korean, surveys such as “eun, eun, e, g, b, b” and “was, was Words that have grammatical functions such as endings, such as da, de, lago, and go, are connected to the original form of nouns and verbs to form a single space unit, a word word. Therefore, if the spacing unit is treated as a single word in the same way as in English, “patents, patents, patents, patents, . . .” Since units with nouns and propositions are used as input for learning, it is difficult to learn the meaning of the propositions, so there is a disadvantage in that the learning performance is lowered. There is also a method to mark Korean by analyzing Korean and attaching parts of speech, but in the case of a word such as “to eat”, it is separated because the letters of Hangul must be transformed like “eat/verb” + “nah/say”. Complex transformation rules are needed to rejoin later. In the present invention, by proposing a Korean notation that can distinguish grammatical functions with spaces, it is possible to utilize the learning code written on the basis of English without modifying it, and to provide a method for maintaining the learning performance.

이를 위해 본 발명에서는 명사 등과 같은 실질어의 경우, 앞의 단어에 붙여써져 있는 것이 띄어쓰고 난 후에 '~'의 기호를 띄어진 단어의 앞쪽에 삽입하고, 조사 등과 같은 기능어의 경우에는 “~~”의 기호를 띄어진 단어의 앞쪽에 삽입함으로써 한국어의 표현을 쉽게 한다. 이를 통하면 “특허출원을 위해서는 기존에는 없는 아이디어가 필요하다.”는 문장을 다음과 같이 바꿀 수 있다. “특허 ~출원 ~~을 위해 ~~서 ~~는 기존 ~~에는 없 ~~는 아이디어 ~~가 필요 ~~하다 ~.” For this purpose, in the present invention, in the case of a real word such as a noun, the symbol of '~' is inserted in front of the spaced word after the word pasted to the preceding word is spaced, and in the case of a function word such as a survey, "~~ By inserting the symbol of ” in front of spaced words, it makes it easier to express Korean. Through this, the sentence “In order to apply for a patent, you need an idea that does not exist before” can be changed to the following. “For a patent ~~application ~~, ~~standing ~~ needs an idea ~~ that is not ~~ in the existing ~~.”

한국어 표현 인코더Korean Expression Encoder

명사, 동사 등의 실질어가 앞의 단어와 붙여씌여진 경우 이를 분리한 후에 단어의 앞에 “~”를 부착한다. 조사, 어미 등의 기능어가 앞의 단어와 붙여씌어진 경우에 이를 분리한 후에 단어의 앞에 ~~를 부착한다. If a real word such as a noun or verb is pasted with the preceding word, separate it and attach “~” in front of the word. If a function word such as a postposition or ending is pasted with the preceding word, separate it and attach ~~ to the front of the word.

한국어 표현 디코더Korean Expression Decoder

“~”와 “~~”로 시작되는 단어의 경우 “~”와 “~~”를 제거한 후에 앞의 단어에 부착한다. 예를 들어 “특허 ~출원 ~~이 완료 ~~되 ~~었습니다 ~.”와 같은 딥러닝 결과를 변환하면 “특허출원이 완료되었습니다.”로 변환한다. For words that start with “~” and “~~”, remove “~” and “~~” and attach to the previous word. For example, if a deep learning result such as “Patent ~application ~~ has been completed ~~has been ~~.” is converted, it is converted to “Patent application has been completed.”

상기한 바와 같이, 본 발명에 의하면 이하와 같은 효과가 있다.As described above, according to the present invention, there are the following effects.

첫째, 영어를 기준으로 작성된 딥러닝 학습 방법을 그대로 적용할 수 있다.First, the deep learning learning method written based on English can be applied as it is.

둘??, 같은 글자의 단어가 실질어로 사용되는 경우와 기능어로 사용되는 경우를 구별하여 학습할 수 있다. Two, it is possible to learn by distinguishing the case where the word of the same letter is used as a real word and a case where it is used as a function word.

세??, 딥러닝에 의해서 생성된 결과를 쉽게 한국어 문장으로 복원할 수 있다.The results generated by deep learning can be easily restored to Korean sentences.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 딥러닝 언어 처리를 위한 모듈 구성을 도시한 모식도,
도 2는 일반적인 한국어 언어 처리를 구체적으로 도시한 모식도,
도 3은 본 발명의 일실시예에 따른 한국어 표기법을 구체적으로 도시한 모식도,
도 4는 본 발명의 한국어 인코더 동작 방법을 도시한 모식도,
도 5는 본 발명의 한국어 디코더 동작 방법을 도시한 것이다.The following drawings attached to the present specification illustrate preferred embodiments of the present invention, and serve to further understand the technical spirit of the present invention together with the detailed description of the present invention, so that the present invention is limited only to the matters described in those drawings should not be interpreted as
1 is a schematic diagram showing a module configuration for deep learning language processing;
2 is a schematic diagram specifically showing general Korean language processing;
3 is a schematic diagram specifically showing the Korean notation according to an embodiment of the present invention;
4 is a schematic diagram showing a Korean encoder operating method of the present invention;
5 shows a method of operating a Korean decoder of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작원리를 상세하게 설명함에 있어서 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, with reference to the accompanying drawings, a person of ordinary skill in the art to which the present invention pertains will be described in detail an embodiment in which the present invention can be easily carried out. However, in the detailed description of the principle of operation of the preferred embodiment of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 특정 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그중간에 다른 소자를 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 특정 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions. Throughout the specification, when it is said that a specific part is connected to another part, it includes not only a case in which it is directly connected, but also a case in which it is indirectly connected with another element interposed therebetween. In addition, the inclusion of specific components does not exclude other components unless otherwise stated, but means that other components may be further included.

도 1은 통상적으로 딥러닝을 통해서 언어 처리를 하는 모듈과 그 처리의 흐름을 도시한다. 먼저 대규모 학습 코퍼스와 코퍼스 내의 단어를 숫자로 바꿔주기 위한 Vocab이 주어지면, 코퍼스를 숫자로 변환하여 원래 코퍼스와 동일하지만 표현은 숫자의 나열로 바꿔진 코퍼스를 만들게 된다. 이후에 딥러닝 학습을 하면, 학습된 모델이 생성된다. 도 2는 한국어를 기존의 방식에 도입하기 위하여 한국어 분석과 한국어 조합이 수행되는 위치와 역할을 도시하였다. 한국어 분석은 최초의 학습용 코퍼스에서 단어를 분리하고, 그 품사에 따라 표현을 변환하는 부분이고, 한국어 조합은 딥러닝 추론의 결과에 부착되어 있는 품사 정보를 적절하게 이용하여 단어를 재조합하여 딥러닝 추론 결과를 생성한다. 도 3은 한국어 분석이 끝난 코퍼스에 대해서 간단한 기호의 부착으로 영어를 기본으로 작성된 오픈소스의 변경없이 학습하는 것과 그렇게 학습된 딥러닝 모델에서 추론한 결과에 포함되어 있는 기호를 간단하게 제거함을써 딥러닝 추론 결과를 생성할 수 있음을 도시하고 있다.1 shows a module and a flow of processing that typically performs language processing through deep learning. First, given a large-scale learning corpus and a vocab for converting words into numbers in the corpus, the corpus is converted into numbers to create a corpus that is identical to the original corpus, but whose expression is replaced with a sequence of numbers. After deep learning, a trained model is created. FIG. 2 shows positions and roles where Korean analysis and Korean combination are performed in order to introduce Korean into the existing method. Korean analysis is a part that separates words from the first learning corpus and transforms expressions according to the part-of-speech, and Korean combination uses the part-of-speech information attached to the results of deep learning inference to recombining words appropriately for deep learning inference produce results. Figure 3 shows the deep learning by simply removing the symbols included in the results inferred from the learned deep learning model and learning without changing the English-based open source by attaching simple symbols to the corpus that has been analyzed in Korean. It shows that running inference results can be generated.

1) 한국어 표기 인코더One) Korean transcription encoder

통상의 한국어 형태소 분석기의 출력을 이용한다. 이때에 한국어 형태소 분석기는 한글을 변형하지 않고 글짜단위로 분절하여 품사를 부착하는 버젼이어야 한다. 실질어와 기능어의 경우로 나누고, 한 어절의 시작부분인지, 중간부분인지를 나누어 고려한다. 명사, 동사 등의 실질어의 경우에 어절의 시작위치에 있다면, 아무런 기호를 부착하지 않고, 어절의 중간 부분에서 나타났다면, ~기호를 단어의 앞에 부착한다. 조사, 어미 등의 기능어의 경우에는 어절의 시작부분에 위치할 수 없으므로, ~~기호를 단어의 앞에 부착한다. 도 4는 한국어 인토더 동작방식을 도시한 모식도이다. 도 4에 도시한 바와 같이, 본 발명은 입력을 한국어 분석이 끝난 어절단위로 받아서 어절을 구성하는 단어에 특별한 기호를 삽입하여 영어와 같이 각각이 단어와 동일하게 취급되도록 한다. The output of a normal Korean morpheme analyzer is used. At this time, the Korean morpheme analyzer should be a version that does not transform Hangul, but segments it into character units and attaches parts of speech. Divide into the case of real words and function words, and consider whether it is the beginning or the middle of a word. In the case of real words such as nouns and verbs, if they are at the beginning of the word, no sign is attached, and if it appears in the middle of the word, the ~ sign is attached in front of the word. In the case of function words such as postpositions and endings, since they cannot be located at the beginning of a word, the ~~ symbol is attached to the front of the word. 4 is a schematic diagram illustrating an operation method of a Korean inserter. As shown in FIG. 4 , in the present invention, input is received in unit of word after analysis of Korean, and special symbols are inserted into words constituting the word so that each word is treated the same as in English.

2) 한국어 표기 디코더2) Korean notation decoder

딥러닝 결과를 입력으로 한다. 딥러닝 결과에는 인코더에 의해 부착된 기호가 단어의 앞에 나타날 수 있기 때문에, 해당 기호를 제거하고 앞의 단어에 부착한다. 그러므로 “~, ~~” 기호로 시작되는 단어에서 해당 기호를 제거하고 공백없이 앞의 단어에 부착한다. 도 5에 도시한 바와 같이, 본 발명은 딥러닝 모델의 추론이 끝난 결과를 합하여 어절단위로 출력을 한다. 딥러닝 결과물은 학습때에 부착하였던 기호가 부착되어 있기 때문에 해당 기호를 발견하면 제거하여 쉽게 어절로 합한다. The deep learning results are taken as input. In the deep learning result, since the symbol attached by the encoder may appear in front of the word, the corresponding symbol is removed and attached to the previous word. Therefore, from the words beginning with “~, ~~”, the corresponding symbol is removed and attached to the preceding word without a space. As shown in Fig. 5, the present invention outputs the result of the inference of the deep learning model by word unit. Because the deep learning result is attached with the symbol attached during learning, when the symbol is found, it is removed and easily combined into words.

Claims

In a Korean conversion method for deep learning, a method of writing Korean with a simple symbol attachment method that includes pasting and grammatical functions

A method of recombining Korean by removing the attached symbols from the Korean notation results with attached symbols in a simple way that includes pasting and grammatical functions of Korean generated from a deep learning model.