KR20200137936A

KR20200137936A - Vocabulary list generation method and device for Korean based neural network language model

Info

Publication number: KR20200137936A
Application number: KR1020190159637A
Authority: KR
Inventors: 허의남; 김만수
Original assignee: 경희대학교 산학협력단
Priority date: 2019-05-29
Filing date: 2019-12-04
Publication date: 2020-12-09
Also published as: KR102354898B1

Abstract

The present invention relates to a method for generating a vocabulary list for a Korean-based neural network language model suitable for the characteristics of Korean and to a device thereof. According to one embodiment of the present invention, a device for generating a vocabulary list comprises: a data receiving unit which receives Korean language data to generate a vocabulary list; a first operation unit which performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words in accordance with the algorithm; and a second operation unit which generates the vocabulary list by performing a normalization algorithm on the separated partial words.

Description

{Vocabulary list generation method and device for Korean based neural network language model}

본 발명은 한국어 기반 신경망 언어 모델을 위한 어휘 목록 생성 방법 및 장치에 관한 것으로, 더욱 상세하게는 한국어의 언어적 특성을 고려한 알고리즘을 이용한 어휘 목록 생성 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for generating a vocabulary list for a Korean-based neural network language model, and more particularly, to a method and apparatus for generating a vocabulary list using an algorithm in consideration of linguistic characteristics of Korean.

신경망 언어 모델(neural network language model)은 기계 번역, 질의 응답, 개체명 인식 등 언어 이해 능력이 필요한 많은 분야에서 사용되고 있다. 신경망 언어 모델은 크게 글자, 단어, 부분 단어 단위로 나뉠 수 있으며 방법에 따라 단위를 섞어 사용할 수도 있다.Neural network language models are used in many fields that require language comprehension skills, such as machine translation, query response, and entity name recognition. Neural network language models can be largely divided into letters, words, and partial words, and units can be mixed according to methods.

신경망 언어 모델에 입력된 문자들은 워드 임베딩(word embedding) 단계를 거쳐 벡터로 변환되어 사용되는데 이를 워드 임베딩 벡터라고 하며, 방법에 따라 워드 임베딩 단계는 신경 언어 모델의 하나의 레이어(layer)에서, 또는 별도의 모델에서 처리될 수 있다.The characters input to the neural network language model are converted into vectors through a word embedding step, which is called a word embedding vector, and depending on the method, the word embedding step is performed in one layer of the neural language model, or Can be processed in a separate model.

이 때, 워드 임베딩 벡터의 차원은 입력으로 들어오는 어휘 목록의 크기에 따라 달라진다. 예를 들어 가, 나, 다와 같은 한글의 총 글자 개수는 11,117자이고 단어의 총 개수는 사전에 따라 다르지만 현재 표준국어대사전에 등재된 단어의 총 개수는 약 50만 개이므로 글자 단위의 워드 임베딩 벡터는 11,117차원이며 단어 단위의 워드 임베딩 벡터는 약 50만 차원에 이른다. At this time, the dimension of the word embedding vector varies according to the size of the vocabulary list received as an input. For example, the total number of Korean characters such as A, B, and D is 11,117 characters, and the total number of words varies depending on the dictionary, but the total number of words listed in the current standard Korean dictionary is about 500,000, so word embedding in character units The vector is 11,117 dimensional, and the word embedding vector is about 500,000 dimensional.

하지만, 글자 단위는 언어의 맥락을 표현하기에는 차원의 크기가 너무 작고, 단어 단위는 차원이 너무 높아 막대한 메모리가 필요하기 때문에 연산을 수행하기 어렵다. 따라서 부분 단어 단위로 원하는 크기의 어휘 목록을 생성하여 신경망 언어 모델을 학습하는 연구가 많이 진행되고 있으며, 단어를 부분 단어 단위로 분리하는 부분 단어 분리(subword segmentation) 연구가 활발히 진행되고 있다.However, the size of the dimension is too small for the character unit to express the context of the language, and the size of the word unit is too high to require an enormous amount of memory, making it difficult to perform an operation. Therefore, many studies are being conducted to learn a neural network language model by generating a vocabulary list of a desired size in units of partial words, and studies on subword segmentation in which words are divided into units of partial words are actively being conducted.

기존 부분 단어 분리 방법은 지도학습 방법과 비지도학습 방법으로 나뉠 수 있다. 지도학습 방법의 경우, 단어를 단어의 최소 단위인 형태소로 분리하는 형태소 분석기를 이용하여 분리하는 방식이다. 하지만, 정확한 형태소 분석이 완료된 거대한 데이터셋이 있어야 형태소 학습기를 만들 수 있다는 단점이 있다. 비지도학습 방법의 경우, 한국어의 조사와 같은 언어적 특성을 고려하지 않고, 부분 단어의 빈도수만을 고려하여 빈도수가 높은 부분 단어 순으로 어휘 목록을 구성하거나, 또는 언어적 특성을 고려한 채 단어를 분리하였지만 그 이후 어떤 기준으로 어휘 목록을 구성할 것인가에 대해 고려하지 않았다.The existing partial word separation method can be divided into supervised learning method and unsupervised learning method. In the case of supervised learning, words are separated using a morpheme analyzer that separates words into morphemes, which are the smallest units of words. However, there is a disadvantage that a morpheme learner can be created only when there is a huge dataset that has been accurately analyzed for morphemes. In the case of the unsupervised learning method, the vocabulary list is constructed in the order of the high-frequency partial words considering only the frequency of the partial words, without considering linguistic characteristics such as the Korean language survey, or the words are separated while considering the linguistic characteristics. However, after that, it did not consider the criteria for constructing the vocabulary list.

어휘 목록의 크기를 줄여 부분 단어 예측 범위를 줄임으로써 예측 정확도를 향상시키는 부분 단어 정규화(subword regularization) 방법의 경우에도 기존 방법은 연속된 부분 단어 간의 관계를 독립적으로 가정하고 부분 단어 간의 상관 관계를 고려하지 않아 신경망 언어 모델의 예측률이 저하되는 단점이 있다. In the case of the subword regularization method, which improves prediction accuracy by reducing the size of the vocabulary list and reducing the range of partial word prediction, the existing method independently assumes the relationship between consecutive partial words and considers the correlation between the partial words. As a result, the prediction rate of the neural network language model is degraded.

KRKR 10-2018-000188910-2018-0001889 AA KRKR 10-2017-010869310-2017-0108693 AA KRKR 10-2019-004643210-2019-0046432 AA

본 발명은 전술한 문제점을 해결하고자 한 것으로, 한국어의 언어적 특성을 고려한 부분 단어 분리 알고리즘을 제공하는 것을 목적으로 한다.The present invention aims to solve the above-described problem, and an object of the present invention is to provide a partial word separation algorithm in consideration of linguistic characteristics of Korean.

또한, 본 발명은 부분 단어 간 상관 관계를 고려하여 부분 단어 간 상호 의존성을 측정하는 상호의존정보(mutual information) 기반 부분 단어 정규화 알고리즘을 제공하는 것을 목적으로 한다.In addition, an object of the present invention is to provide a partial word normalization algorithm based on mutual information that measures the interdependence between partial words in consideration of the correlation between partial words.

본 발명의 목적들은 상술된 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-described objects, and other objects that are not mentioned will be clearly understood from the following description.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 (a) 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 단계; (b) 상기 수신한 한국어 데이터에 대해 부분 단어(subword) 분리 알고리즘을 수행하여 상기 알고리즘에 따라 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계; 및 (c) 상기 분리된 부분 단어에 대해 정규화(regularization) 알고리즘을 수행하여 어휘 목록을 생성하는 단계를 포함할 수 있다.A method for generating a vocabulary list according to an embodiment of the present invention includes the steps of: (a) receiving Korean language data to generate a vocabulary list; (b) separating words included in Korean data into partial words according to the algorithm by performing a subword separation algorithm on the received Korean data; And (c) generating a vocabulary list by performing a regularization algorithm on the separated partial words.

상기 (b) 단계에서, 상기 부분 단어 분리 알고리즘은 다음의 수학식으로 정의될 수 있다.In step (b), the partial word separation algorithm may be defined by the following equation.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합, X는 한국어 데이터를 의미한다.here,

Is a consecutive character set from 0 to n,

Is in Korean data

The next set of letters, X, means Korean data.

상기 (b) 단계는, 상기 부분 단어 분리 알고리즘을 이용하여, 상기 수신한 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리하는 단계일 수 있다.The step (b) may be a step of dividing words included in the received Korean data into a left partial word and a right partial word using the partial word separation algorithm.

상기 (b) 단계는, 상기 오른쪽 부분 단어가 존재하는 경우, 상기 오른쪽 부분 단어에 대하여 상기 부분 단어 분리 알고리즘을 수행하는 단계를 반복하는 단계를 더 포함할 수 있다.The step (b) may further include repeating the step of performing the partial word separation algorithm on the right partial word when the right partial word exists.

상기 (c) 단계에서, 상기 정규화 알고리즘은 다음의 수학식으로 정의될 수 있다.In step (c), the normalization algorithm may be defined by the following equation.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

Is the ith word in the word set W,

Is

The j-th partial word separated from,

Means the partial word for which you want to get the regScore value.

상기 (c) 단계는, 상기 regScore 값이 큰 순서대로 기설정된 비율만큼 상기 부분 단어를 삭제하는 단계를 더 포함할 수 있다.The step (c) may further include deleting the partial words by a preset ratio in the order in which the regScore value is large.

상기 (c) 단계는, 상기 어휘 목록의 기설정된 단어 개수를 만족할 때까지 상기 부분 단어를 삭제하는 단계를 반복하는 단계를 더 포함할 수 있다.The step (c) may further include repeating the step of deleting the partial words until a preset number of words in the vocabulary list is satisfied.

본 발명의 일 실시예에 따른 컴퓨터 판독 가능한 기록매체는 상기 어휘 목록 생성 방법을 컴퓨터 상에서 수행하기 위한 프로그램을 기록할 수 있다.A computer-readable recording medium according to an embodiment of the present invention may record a program for performing the method for generating a vocabulary list on a computer.

본 발명의 일 실시예에 따른 어휘 목록 생성 장치는 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 데이터 수신부; 상기 수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 상기 알고리즘에 따라 한국에 데이터에 포함된 단어들을 부분 단어로 분리하는 제1 연산부; 및 상기 분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 상기 어휘 목록을 생성하는 제2 연산부를 포함할 수 있다. According to an embodiment of the present invention, an apparatus for generating a vocabulary list includes: a data receiving unit for receiving Korean data to generate a vocabulary list; A first operation unit that performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words according to the algorithm; And a second operation unit that generates the vocabulary list by performing a normalization algorithm on the separated partial words.

상기 부분 단어 분리 알고리즘은 다음의 수학식으로 정의될 수 있다.The partial word separation algorithm may be defined by the following equation.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

Is a consecutive character set from 0 to n,

Is in Korean data

The next set of letters, X, means Korean data.

상기 부분 단어 분리 알고리즘은, 상기 수신한 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다.The partial word separation algorithm may divide words included in the received Korean data into a left partial word and a right partial word.

상기 제1 연산부는 상기 한국어 데이터에 포함된 단어들을 분리한 후, 상기 오른쪽 부분 단어가 존재하는 경우, 상기 오른쪽 부분 단어에 대하여 상기 부분 단어 분리 알고리즘을 반복하여 수행할 수 있다.After separating words included in the Korean data, the first operation unit may repeatedly perform the partial word separation algorithm on the right partial word when the right partial word exists.

상기 정규화 알고리즘은 다음의 수학식으로 정의될 수 있다.The normalization algorithm may be defined by the following equation.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

Is the ith word in the word set W,

Is

The j-th partial word separated from,

Means the partial word for which you want to get the regScore value.

상기 제2 연산부는, 상기 regScore 값이 큰 순서대로 기설정된 비율만큼 상기 부분 단어를 삭제할 수 있다.The second operation unit may delete the partial words by a preset ratio in the order of the regScore value being large.

상기 제2 연산부는, 상기 어휘 목록의 기설정된 단어 개수를 만족할 때까지 상기 부분 단어 삭제를 반복하여 수행할 수 있다.The second operation unit may repeatedly perform the partial word deletion until a preset number of words in the vocabulary list is satisfied.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 한국어의 언어적 특성을 고려하여 부분 단어를 분리하는 바, 기존 부분 단어 분리 알고리즘을 사용하는 방법 및 장치에 비해 한국어 기반 신경망 언어 모델의 성능을 높일 수 있다는 장점이 있다. The method and apparatus for generating a vocabulary list according to an embodiment of the present invention separates partial words in consideration of the linguistic characteristics of Korean, and performance of a Korean-based neural network language model compared to a method and apparatus using an existing partial word separation algorithm. There is an advantage in that it can be increased.

또한, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 부분 단어 간 상관 관계를 고려하여 부분 단어를 정규화하는 바, 기존 부분 단어 정규화 알고리즘을 사용하는 방법 및 장치에 비해 한국어 기반 신경망 언어 모델의 성능을 높일 수 있다는 장점이 있다.In addition, the method and apparatus for generating a vocabulary list according to an embodiment of the present invention normalizes partial words in consideration of the correlation between partial words. Compared to the method and apparatus using the existing partial word normalization algorithm, a Korean-based neural network language model It has the advantage of increasing the performance of the device.

또한, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 한국어 처리 서비스에 적용할 수 있다는 장점이 있다.In addition, the method and apparatus for generating a vocabulary list according to an embodiment of the present invention has an advantage of being applicable to a Korean language processing service.

본 발명의 효과들은 이상에서 언급된 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법과 생성된 어휘 목록을 이용하여 신경망 언어 모델 기반 질의응답 시스템을 운용하는 유스케이스(use case)에 대한 개념도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 어휘 목록 생성 장치를 간략히 도시한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 나타내는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 개략적으로 나타내는 흐름도이다.
도 5a 및 도 5b는 실험예에서 세 가지 알고리즘의 성능을 비교한 그래프이다.FIG. 1 is a conceptual diagram illustrating a method for generating a vocabulary list according to an embodiment of the present invention and a use case for operating a query-answering system based on a neural network language model using the generated vocabulary list.
2 is a block diagram schematically illustrating an apparatus for generating a vocabulary list according to an embodiment of the present invention.
3 is a flowchart illustrating a method of generating a vocabulary list according to an embodiment of the present invention.
4 is a flowchart schematically illustrating a method of generating a vocabulary list according to an embodiment of the present invention.
5A and 5B are graphs comparing the performance of three algorithms in an experimental example.

본 명세서 또는 출원에 개시되어 있는 본 발명의 실시 예들에 대해서 특정한 구조적 내지 기능적 설명들은 단지 본 발명에 따른 실시 예를 설명하기 위한 목적으로 예시된 것으로, 본 발명에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 명세서 또는 출원에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니 된다.Specific structural or functional descriptions of the embodiments of the present invention disclosed in this specification or application are exemplified only for the purpose of describing the embodiments according to the present invention, and the embodiments according to the present invention may be implemented in various forms. And should not be construed as limited to the embodiments described in this specification or application.

본 발명에 따른 실시 예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있으므로 특정 실시예들을 도면에 예시하고 본 명세서 또는 출원에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예를 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the embodiments according to the present invention can be modified in various ways and have various forms, specific embodiments will be illustrated in the drawings and described in detail in the present specification or application. However, this is not intended to limit the embodiments according to the concept of the present invention to a specific form of disclosure, and it should be understood that all changes, equivalents, and substitutes included in the spirit and scope of the present invention are included.

본 명세서에서 제1 및/또는 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 즉, 구성요소들을 상기 용어들에 의해 한정하고자 함이 아니다.In the present specification, terms such as first and/or second are used only for the purpose of distinguishing one component from another component. That is, it is not intended to limit the components by the terms.

본 명세서에서 '포함하다' 라는 표현으로 언급되는 구성요소, 특징, 및 단계는 해당 구성요소, 특징 및 단계가 존재함을 의미하며, 하나 이상의 다른 구성요소, 특징, 단계 및 이와 동등한 것을 배제하고자 함이 아니다.Components, features, and steps referred to as'include' in this specification mean the existence of the corresponding components, features, and steps, and are intended to exclude one or more other components, features, steps, and equivalents thereof. This is not.

본 명세서에서 단수형으로 특정되어 언급되지 아니하는 한, 복수의 형태를 포함한다. 즉, 본 명세서에서 언급된 구성요소 등은 하나 이상의 다른 구성요소 등의 존재나 추가를 의미할 수 있다.Unless otherwise specified and stated in the singular form in the specification, plural forms are included. That is, the components and the like mentioned in the present specification may mean the presence or addition of one or more other components.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함하여, 본 명세서에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자(통상의 기술자)에 의하여 일반적으로 이해되는 것과 동일한 의미이다.Unless otherwise defined, all terms used in this specification, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs. to be.

즉, 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. That is, terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the meaning of the context of the related technology, and should be interpreted as ideal or excessively formal meanings unless explicitly defined in this specification. It doesn't work.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings.

먼저, 부분 단어(subword)에 대한 설명을 부가하자면, 부분 단어는 단어에서 분리 가능한 연속된 글자들의 집합으로 예컨대 '서울에서'라는 단어가 있을 경우, '서울에서'라는 단어는 '서울, 에서', '서, 울에, 서', '서, 울에서'와 같은 부분 단어 집합을 가질 수 있다. 이와 같이 단어, 어휘 또는 어절의 일부분을 부분 단어로 정의할 수 있다.First, to add an explanation of the subword, the subword is a set of consecutive letters that can be separated from the word. For example, when there is the word'in Seoul', the word'in Seoul' is'in Seoul,' , Can have a partial word set such as'seo, in wool, standing', and'seo, in wool'. In this way, a word, vocabulary, or part of a word can be defined as a partial word.

도 1은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법과 생성된 어휘 목록을 이용하여 신경망 언어 모델 기반 질의응답 시스템을 운용하는 유스케이스(use case)에 대한 개념도를 나타낸다.FIG. 1 is a conceptual diagram illustrating a method for generating a vocabulary list according to an embodiment of the present invention and a use case for operating a query-answering system based on a neural network language model using the generated vocabulary list.

도 1을 참조하면, 한국어 데이터셋(dataset)에 대하여 어휘 목록 생성 방법을 적용하여 부분 단어를 분리하고, 이를 정규화하여 어휘 목록을 생성하는 것을 확인할 수 있다. 여기서, 한국어 데이터셋은 품사 처리, 개체명 인식 등 전처리가 수행되지 않은 순수한 텍스트만 존재하는 데이터셋이다. Referring to FIG. 1, it can be seen that a vocabulary list generation method is applied to a Korean dataset to separate partial words and normalize them to generate a vocabulary list. Here, the Korean dataset is a dataset in which only pure text has not been preprocessed, such as part-of-speech processing and entity name recognition.

어휘 목록 생성이 완료된 후, 신경망 언어 모델에 입력되는 텍스트는 생성된 어휘 목록을 기반으로 부분 단어로 분리되고, 어휘 목록 내 부분 단어에 매겨져 있는 숫자로 변경되는 토큰화(tokenize) 과정을 거친다. 도 1과 같은 질의응답 유스케이스에서 입력된 질문과 본문 텍스트는 토큰화 과정을 거쳐 신경망 언어 모델에 입력되고, 신경망 언어 모델은 본문 내 답의 위치를 예측하고 사용자는 신경망 언어 모델이 예측한 위치를 통해 답을 알 수 있는 것이다.After the creation of the vocabulary list is completed, the text input to the neural network language model is divided into partial words based on the generated vocabulary list, and undergoes a tokenization process in which the number assigned to the partial words in the vocabulary list is changed. The question and body text input in the Q&A use case as shown in Fig. 1 are input to the neural network language model through a tokenization process, the neural network language model predicts the position of the answer in the body, and the user determines the position predicted by the neural network language model. You can know the answer through it.

도 2는 본 발명의 일 실시예에 따른 어휘 목록 생성 장치(100)를 간략히 도시한 블록도이다.2 is a block diagram schematically illustrating a vocabulary list generating apparatus 100 according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 어휘 목록 생성 장치(100)는 데이터 수신부(110), 제1 연산부(120) 및 제2 연산부(130)를 포함할 수 있다.Referring to FIG. 2, the apparatus 100 for generating a vocabulary list according to an embodiment of the present invention may include a data receiving unit 110, a first calculating unit 120, and a second calculating unit 130.

데이터 수신부(110)는 어휘 목록 생성을 위한 한국어 단어들이 포함된 한국어 데이터를 수신할 수 있다. 또한, 데이터 수신부(110)는 한국어 데이터를 수신한 후, 이를 제1 연산부(120)로 송신할 수 있다.The data receiver 110 may receive Korean data including Korean words for generating a vocabulary list. In addition, the data receiving unit 110 may transmit the Korean data to the first operation unit 120 after receiving the Korean data.

일 실시예에서, 제1 연산부(120)는 데이터 수신부(110)가 수신한 한국어 데이터를 입력받고, 부분 단어 분리 알고리즘을 한국어 데이터에 포함된 단어들에 대해 적용하여 부분 단어로 분리할 수 있다. 부분 단어 분리 알고리즘은 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '아인슈타인'은 아인슈타(왼쪽) + 인(오른쪽), '아인슈타인은'은 아인슈타(왼쪽) + 인은(오른쪽)과 같이 분리될 수 있다.In an embodiment, the first operation unit 120 may receive Korean data received by the data receiving unit 110 and apply a partial word separation algorithm to words included in the Korean data to separate them into partial words. The partial word separation algorithm may separate words included in Korean data into a left partial word and a right partial word. For example,'Einstein' can be divided into Einstein (left) + In (right), and'Einstein' can be divided into Einstein (left) + In-silver (right).

부분 단어 분리 알고리즘의 종래 기술로는 cohesion score 알고리즘과 branching entropy 알고리즘이 존재한다.As the prior art of the partial word separation algorithm, there are a cohesion score algorithm and a branching entropy algorithm.

cohesion score 알고리즘은 한국어를 위해 제안된 부분 단어 분리 알고리즘으로 명사 뒤에는 여러 종류의 접미사 및 조사가 올 수 있는 바, 왼쪽 부분 단어의 글자 간에는 연관성이 높다고 가정하고 단어를 분리하는 알고리즘이다. 즉, 연속된 글자의 연관성이 높을수록 단어일 가능성이 높으므로 이를 이용하는 것이다. 한국어의 경우, 왼쪽에 명사, 동사, 형용사 등이 주로 나타나고, 오른쪽에 문법적 역할을 하는 조사가 주로 등장하므로 본 알고리즘은 한국어에 적합할 수 있다. The cohesion score algorithm is a partial word separation algorithm proposed for Korean. It is an algorithm that separates words by assuming that there is a high association between the letters of the left partial word, since various types of suffixes and investigations can be followed after the noun. In other words, the higher the relevance of consecutive letters, the higher the probability of a word, so it is used. In the case of Korean, nouns, verbs, adjectives, etc. appear on the left, and investigations that play a grammatical role appear on the right, so this algorithm may be suitable for Korean.

cohesion score 알고리즘은 하기 수학식 1과 같이 정의될 수 있다.The cohesion score algorithm may be defined as in Equation 1 below.

여기서,

은 0에서부터 n까지의 연속된 글자의 집합을 나타낸다. 즉, '노란색의'라는 단어가 있는 경우

은 '노',

은 '노란',

는 '노란색'을 의미한다.

는 B일 때, A가 일어 날 조건부 확률을 의미한다.here,

Represents a set of consecutive letters from 0 to n. In other words, if you have the word'yellow'

Is'no',

Silver'yellow',

Means'yellow'.

Denotes the conditional probability that A will occur when it is B.

그러나, cohesion score 알고리즘은 단어의 빈도수가 낮아질수록 정확도가 급격히 떨어진다는 단점이 있고, 왼쪽 부분 단어 뒤에 나올 수 있는 다른 형태의 오른쪽 부분 단어를 고려하지 않는다는 문제점이 있다. 예컨대, '노란색의' 라는 단어가 있을 때, '노란색'과 '의'만을 고려하고, '은', '이', '을', '과' 등의 '노란색' 뒤에 나올 수 있는 다른 오른쪽 부분 단어들을 고려하지 않는다. However, the cohesion score algorithm has a disadvantage in that the accuracy decreases sharply as the frequency of words decreases, and there is a problem in that it does not consider other types of right partial words that may appear after the left partial words. For example, when there is a word'yellow', only'yellow' and'righteous' are considered, and the other right part that can appear after'yellow' such as'silver','i','eul','and' Do not consider words.

branching entropy 알고리즘은 중국어를 위해 제안된 알고리즘으로 연속된 글자 다음에 어떤 글자가 나올지 불확실할수록 분리될 확률이 높다는 것을 이용한 알고리즘이다. 예컨대, 'naturalize'라는 단어가 있을 때, 'natura'이라는 부분 단어가 있을 때, 뒤에 'l'이 나올 것으로 쉽게 예측할 수 있으나, 'natural' 뒤에는 어떤 단어가 나올 지 쉽게 예측할 수가 없다. 따라서, 본 알고리즘을 이용하면 'naturalize'는 'natura', 'lize'로 분리되기보다 'natural', 'ize'로 분리될 확률이 더 높아진다.The branching entropy algorithm is an algorithm proposed for Chinese. It is an algorithm that uses the fact that the probability of separation is high as it is uncertain which letter will appear after consecutive letters. For example, when there is a word'naturalize', when there is a partial word'natura', it can be easily predicted that an'l' will follow, but it is not easy to predict which word will appear after'natural'. Therefore, when this algorithm is used, the probability that'naturalize' is separated into'natural' and'ize' is higher than that of'natura' and'lize'.

branching entropy 알고리즘은 하기 수학식 2와 같이 정의될 수 있다.The branching entropy algorithm may be defined as in Equation 2 below.

여기서, 상기 수학식 2는

에서의 branching entropy를 구하기 위한 수학식이며,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합을 의미한다. 예컨대, 아인슈타인과 아인슈타이늄의

은 '아인슈타'이며

={인, 이}가 되는 것이다. X는 한국어 데이터를 의미한다.Here, Equation 2 is

Is an equation to find the branching entropy at

Is in Korean data

It means the set of letters that appeared next. For example, Einstein and Einstein

Is'Einstein'

={In, Lee}. X means Korean data.

하지만 branching entropy 알고리즘은 글자 하나하나에 의미가 담겨 글자 다음에 특정 글자가 나올 확률이 높은 중국어 등의 언어에 적용하는 것이 적합하며 글자 자체보다는 조합에 의해 의미가 생기는 한국어에는 적용하기 어렵다.However, the branching entropy algorithm is suitable to be applied to languages such as Chinese, which have a high probability of a specific letter appearing after the letter because each letter contains meaning, and it is difficult to apply it to Korean, where meaning is generated by combinations rather than letters themselves.

따라서, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 상기 두 알고리즘의 단점을 보완하기 위하여 두 알고리즘을 곱한 새로운 방식의 알고리즘을 사용한다. 제안하는 알고리즘은 하기 수학식 3과 같이 정의될 수 있다.Accordingly, the method for generating a vocabulary list according to an embodiment of the present invention uses a new algorithm in which the two algorithms are multiplied to compensate for the disadvantages of the two algorithms. The proposed algorithm can be defined as in Equation 3 below.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합을 의미한다. X는 한국어 데이터를 의미한다.here,

Is a consecutive character set from 0 to n,

Is in Korean data

It means the set of letters that appeared next. X means Korean data.

상기 수학식 3의 알고리즘을 통해

부터

까지의 모든 값을 계산한 뒤, 가장 높은 값을 가지는 위치를 분리되는 지점으로 삼을 수 있다. 예컨대, '아인슈타이늄'이라는 단어에서

값이 가장 높은 것으로 계산되는 경우, '아인슈타이늄'은 '아인슈타', '이늄'으로 분리되는 것이다.Through the algorithm of Equation 3

from

After calculating all the values up to, the position with the highest value can be used as a separation point. For example, in the word'Einsteinium'

When the value is calculated as the highest,'Einstein' is divided into'Einstein'and'Inium'.

cohesion score와 branching entropy 알고리즘의 곱을 사용함으로써 단어의 빈도수 뿐만 아니라, 다음에 나오는 글자의 경우의 수까지 함께 고려하여 단어를 분리할 수 있는 바, 기존 두 알고리즘의 단점을 보완하면서 부분 단어를 분리할 수 있게 되는 것이다.By using the product of the cohesion score and the branching entropy algorithm, it is possible to separate words by considering not only the frequency of the word but also the number of the next letter, while compensating for the shortcomings of the two existing algorithms. There will be.

일 실시예에서, 부분 단어 분리 알고리즘은 오른쪽 부분 단어가 존재하는 경우, 오른쪽 부분 단어에 대해 반복하여 수행될 수 있다. 즉, 오른쪽 부분 단어가 더 이상 존재하지 않을 때까지 부분 단어 분리 알고리즘을 반복하여 수행하는 것이다.In one embodiment, the partial word separation algorithm may be repeatedly performed on the right partial word when the right partial word exists. That is, the partial word separation algorithm is repeatedly performed until the right partial word no longer exists.

예컨대, '대한민국만세'라는 단어가 있고, 부분 단어 분리 알고리즘을 통해 '대한', '민국만세'로 분리된 경우, 오른쪽 부분 단어인 '민국만세'에 대하여 부분 단어 분리 알고리즘을 반복하여 수행할 수 있다. '민국만세'에 부분 단어 분리 알고리즘이 적용되어 다시 '민국', '만세'로 분리되고 최종적으로 '대한민국만세'는 '대한', '민국', '만세'로 분리될 수 있을 것이다.For example, if there is a word'Hurray Korea' and it is divided into'Hurry Korea' and'Hurry Korea' through a partial word separation algorithm, the partial word separation algorithm can be repeatedly performed for the right partial word'Hurry Korea'. have. A partial word separation algorithm is applied to'Hurray Korea', and it will be separated into'Hurray Korea' and'Hurray', and finally'Hurray Korea' can be divided into'Korea','Republic of Korea', and'Hurray'.

위와 같은 과정을 거치며 한국어 데이터에 포함된 모든 단어들이 부분 단어로 분리되어 하나의 어휘 목록을 구축할 수 있다.Through the above process, all words included in the Korean data are separated into partial words, and a single vocabulary list can be constructed.

일 실시예에서, 제2 연산부(130)는 제1 연산부(120)에서 분리된 부분 단어를 정규화하는 알고리즘을 수행할 수 있다. 즉, 부분 단어 어휘 목록을 사용자가 원하는 크기로 줄이는 역할을 수행한다. 어휘 목록의 크기가 클수록 예측률이 낮아지기 때문에, 어휘 목록의 크기를 효율적으로 줄이기 위함이다.In an embodiment, the second operation unit 130 may perform an algorithm for normalizing partial words separated by the first operation unit 120. That is, it plays the role of reducing the partial word vocabulary list to the size desired by the user. The larger the size of the vocabulary list, the lower the prediction rate, so this is to effectively reduce the size of the vocabulary list.

부분 단어 정규화 알고리즘은 부분 단어의 중복성을 고려하여 단어의 어휘 목록 크기를 줄이는 알고리즘을 말한다. 기존 정규화 알고리즘 중 하나인 Unigram language model은 하기 수학식 4와 같이 정의될 수 있다. The partial word normalization algorithm is an algorithm that reduces the size of a vocabulary list of words by considering the redundancy of partial words. Unigram language model, one of the existing normalization algorithms, may be defined as in Equation 4 below.

여기서, V는 어휘들의 집합,

는 V로부터 분리된 i번재 부분 단어, x는 연속된 부분 단어들의 집합 {

},

는 부분 단어

의 발생 확률을 의미한다. Where V is the set of vocabularies,

Is the i-th partial word separated from V, x is a set of consecutive partial words {

},

Is a partial word

Means the probability of occurrence of.

부분 단어

의 발생 확률이란, 부분 단어

의 개수를 데이터셋 내의 부분 단어 개수로 나눈 것으로 예컨대, 데이터셋 내에 부분 단어가 10000개 있고, 부분 단어

의 개수가 10개 인 경우,

는 10/10000=0.001이 되는 것이다.Partial word

Probability of occurrence of, partial words

Divided by the number of partial words in the dataset, for example, there are 10000 partial words in the dataset, and

If the number of is 10,

Is 10/10000 = 0.001.

기존 정규화 알고리즘의 경우 부분 단어 확률의 곱으로써 정규화를 수행하며, 이는 부분 단어 간의 관계를 독립적으로 계산한 것이다. 즉, 부분 단어 간의 상관 관계를 전혀 고려하지 않고 독립이라 가정하고 계산을 수행한 것이다. In the case of the existing normalization algorithm, normalization is performed by multiplying partial word probabilities, which is an independent calculation of the relationship between partial words. In other words, the calculation was performed under the assumption that it is independent without considering the correlation between partial words at all.

이 경우, 어휘 목록 내 부분 단어로 표현할 수 없는 단어인 out-of-vocabulary의 개수가 많아지게 된다는 단점이 있다. 실제 딥러닝 학습 시 부분 단어들이 저장된 어휘 목록을 사용하는데 어휘 목록의 크기가 클수록 메모리 자원이 많이 필요하므로 일정 크기의 목록을 사용한다. 따라서, 수많은 부분 단어들 중 일부분의 부분 단어만 이용하게 되는데, 어휘 목록 내에 있는 부분 단어로 단어를 표현할 수 없는 단어를 out-of-vocabulary라고 한다. 예컨대, '스크림'이라는 단어가 있을 때, '스크', '림', '스', '크림', '스크림' 등 해당 단어 '스크림'을 표현할 수 있는 부분 단어가 없는 경우를 의미한다. In this case, there is a disadvantage in that the number of out-of-vocabulary words that cannot be expressed as partial words in the vocabulary list increases. In actual deep learning learning, a vocabulary list in which partial words are stored is used. As the size of the vocabulary list increases, more memory resources are required, so a list of a certain size is used. Therefore, only some partial words are used among numerous partial words, and words that cannot be expressed as partial words in the vocabulary list are called out-of-vocabulary. For example, when there is the word'scream', it means that there is no partial word that can express the word'scream', such as'sk','cream','su','cream', and'scream'.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 기존 알고리즘의 단점을 보완하기 위하여 변수 간 상호의존성을 측정하는 상호의존정보 공식을 기반으로 한 알고리즘을 사용한다. 상호의존정보량은 정보이론에서 두 사건 사이에 얼마만큼의 밀접한 관계를 지니고 있는지를 나타내는 것이다. 즉, 부분 단어 간의 연관 관계를 고려한 알고리즘을 사용하는 것이다. The method for generating a vocabulary list according to an embodiment of the present invention uses an algorithm based on an interdependence information formula that measures interdependence between variables in order to compensate for the disadvantages of the existing algorithm. The amount of interdependence information indicates how close there is a relationship between two events in information theory. In other words, it uses an algorithm that considers the relationship between partial words.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법의 정규화 알고리즘은 단어 당 부분 단어가 2개 이상 존재할 수 있으므로 다변수 상호의존정보 공식을 이용하며, 각 부분 단어는 순서대로 존재하기 때문에 다변수 상호의존정보 공식을 x, y 두 변수로 예를 들면 하기 수학식 5와 같이 정의될 수 있다.The normalization algorithm of the vocabulary list generation method according to an embodiment of the present invention uses a multivariate interdependence information formula because two or more partial words may exist per word, and since each partial word exists in order, multivariate interdependence The information formula may be defined as two variables x and y as shown in Equation 5 below.

p(x)는 x의 확률, p(x, y)는

를 의미한다.p(x) is the probability of x, p(x, y) is

Means.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법의 정규화 알고리즘은 하기 수학식 6과 같이 정의될 수 있다.The normalization algorithm of the method for generating a vocabulary list according to an embodiment of the present invention may be defined as in Equation 6 below.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

Is the ith word in the word set W,

Is

The j-th partial word separated from,

Means the partial word for which you want to get the regScore value.

그러나, 상기 수학식 6과 같이 log 값 안의 분모는 매우 작은 값인 부분 단어의 발생 확률 값이 서로 곱해지는 바, MI(mutual information) 값이 컴퓨터가 표현할 수 있는 double 값의 범위를 넘어갈 수 있다. 이러한 문제를 보완하기 위한 정규화 알고리즘은 하기 수학식 7과 같이 정의될 수 있다. 즉, 곱셈 부분을 덧셈으로 대신하여 double 값 범위를 초과하는 문제를 해결할 수 있는 것이다.However, as shown in Equation 6, the denominator in the log value is multiplied by the probability of occurrence of the partial word, which is a very small value, so that the MI (mutual information) value may exceed the range of the double value that can be expressed by the computer. A normalization algorithm to compensate for this problem may be defined as in Equation 7 below. In other words, it is possible to solve the problem of exceeding the range of double values by replacing the multiplication part with addition.

일 실시예에서, 어휘 목록의 크기를 줄이기 위해 regScore 값이 큰 부분 단어를 삭제할 수 있다. 여기서, regScore 값이 큰 순서대로 기설정된 비율만큼 부분 단어를 삭제할 수 있다. 예컨대, regScore 값이 큰 상위 20%의 부분 단어를 삭제할 수 있다. regScore 값이 높다는 의미는 상호의존정보량이 낮은, 즉, 부분 단어끼리의 연관성이 낮다는 의미이므로 regScore 값이 큰 부분 단어를 삭제하여 어휘 목록의 크기를 줄이는 것이다.In an embodiment, partial words with a large regScore value may be deleted to reduce the size of the vocabulary list. Here, partial words may be deleted by a preset ratio in the order of regScore value being large. For example, you can delete the top 20% of partial words with a large regScore value. A high regScore value means that the amount of interdependence information is low, that is, the association between partial words is low. Therefore, the size of the vocabulary list is reduced by deleting partial words with a large regScore value.

부분 단어

의 regScore 값을 구하는 방법을 보다 예시를 들어 구체적으로 설명한다. 상기 수학식 7에서 부분 단어

의 regScore 값은

를 포함하고 있는 단어

의 MI 값의 합으로부터 연산된다. 예컨대, 단어 집합 W 내에 '대한민국, 대한민주주의공화국, 대한사람'이라는 단어가 있을 때, regScore(대한)은 '대한'이라는 부분 단어가 포함되어 있는 단어들의 MI 값의 합으로부터 연산된다. 즉, MI(대한민국)+MI(대한민주주의공화국)+MI(대한사람)으로부터 연산된다. 이와 달리, regScore(민국)은 MI(대한민국)으로부터 연산된다.Partial word

How to obtain the regScore value of is described in more detail with an example. Partial words in Equation 7 above

The regScore value is

Words containing

Is calculated from the sum of the MI values of. For example, when there is the word'Korea, Democratic Republic of Korea, Korea People' in the word set W, regScore (Korea) is calculated from the sum of the MI values of the words containing the partial word'Korea'. In other words, it is calculated from MI (Korea) + MI (Democratic Republic of Korea) + MI (Korean). In contrast, regScore (Korea) is computed from MI (Korea).

일 실시예에서, regScore 값이 큰 순서에 따라 부분 단어를 삭제하는 과정은 기설정된 어휘 목록의 크기를 만족할 때까지 반복할 수 있다. 예컨대, 어휘 목록의 단어 개수를 D로 기설정한 경우, 어휘 목록의 단어 개수가 D가 될 때까지 regScore 값이 높은 상위 20%의 부분 단어를 삭제하는 과정을 반복할 수 있다.In an embodiment, the process of deleting partial words according to the order in which the regScore value is large may be repeated until a preset size of the vocabulary list is satisfied. For example, if the number of words in the vocabulary list is preset to D, the process of deleting the top 20% of the partial words having a high regScore value may be repeated until the number of words in the vocabulary list becomes D.

어휘 목록의 단어 개수를 한 번에 기설정된 어휘 목록의 단어 개수로 줄이게 되는 경우, overfitting 현상이 발생할 수 있어 regScore 값이 큰 순서에 따라 부분 단어를 한꺼번에 삭제하는 대신, 일정 비율의 부분 단어를 삭제하는 과정을 반복 수행하는 것이다.If the number of words in the vocabulary list is reduced to the number of words in the preset vocabulary list at one time, overfitting may occur, so instead of deleting the partial words all at once in the order of the large regScore value, a certain percentage of the partial words are deleted. It is to repeat the process.

도 3은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 나타내는 흐름도이다. 3 is a flowchart illustrating a method of generating a vocabulary list according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 단계(S301), 수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계(S303) 및 분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 어휘 목록을 생성하는 단계(305)를 포함할 수 있다.Referring to FIG. 3, in the method for generating a vocabulary list according to an embodiment of the present invention, a method for generating a vocabulary list includes receiving Korean data (S301), performing a partial word separation algorithm on the received Korean data, It may include the step of separating the included words into partial words (S303) and the step of generating a vocabulary list (305) by performing a normalization algorithm on the separated partial words.

한국어 데이터를 수신하는 단계(S301)는 데이터 수신부(110)가 한국어 단어들이 포함된 한국어 데이터를 수신하는 단계이다. 데이터 수신부(110)는 한국어 데이터를 수신한 후, 이를 제1 연산부(120)로 송신할 수 있다.In the step S301 of receiving Korean language data, the data receiving unit 110 receives Korean data including Korean words. After receiving the Korean data, the data receiving unit 110 may transmit it to the first operation unit 120.

수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계(S303)는, 제1 연산부(120)가 한국어 데이터를 수신한 후, 한국어 데이터에 포함된 단어들에 대해 부분 단어 분리 알고리즘을 수행하는 단계이다. 제1 연산부(120)는 상기 수학식 3의 알고리즘을 이용하여 부분 단어를 분리할 수 있다.In the step of separating words included in Korean data into partial words by performing a partial word separation algorithm on the received Korean data (S303), after the first operator 120 receives the Korean data, This is the step of performing a partial word separation algorithm for words. The first operation unit 120 may separate partial words by using the algorithm of Equation 3 above.

분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 어휘 목록을 생성하는 단계(S305)는 제1 연산부(120)에서 분리한 부분 단어를 제2 연산부(130)가 수신한 후, 정규화 알고리즘을 이용하여 부분 단어 정규화를 수행하여 어휘 목록을 생성하는 단계이다. In the step of generating a vocabulary list by performing a normalization algorithm on the separated partial words (S305), after the second operation unit 130 receives the partial words separated by the first operation unit 120, the partial words are This is the step of generating a vocabulary list by performing word normalization.

제2 연산부(130)는 상기 수학식 7의 알고리즘을 이용하여 부분 단어를 정규화하여 어휘 목록의 크기를 줄여 최종적인 어휘 목록을 생성할 수 있다.The second operator 130 may generate a final vocabulary list by normalizing partial words using the algorithm of Equation 7 to reduce the size of the vocabulary list.

도 4는 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 개략적으로 나타내는 흐름도이다.4 is a flowchart schematically illustrating a method of generating a vocabulary list according to an embodiment of the present invention.

제1 연산부(120)는 부분 단어 분리 알고리즘을 이용하여 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '대한독립만세'라는 단어를 입력받아 '대한', '독립만세'로 분리될 수 있다.The first operator 120 may separate words included in the Korean data into a left partial word and a right partial word using a partial word separation algorithm. For example, the word'Long live Korea' can be divided into'Long live Korea' and'Long live independence'.

일 실시예에서, 제1 연산부(120)는 오른쪽 부분 단어가 존재하는 경우, 오른쪽 부분 단어에 대하여 부분 단어 분리 알고리즘을 수행하는 단계를 반복할 수 있다. 즉, 오른쪽 부분 단어가 더 이상 분리될 수 없을 때까지 오른쪽 부분 단어에 대해 부분 단어 분리 알고리즘을 이용하여 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '대한', '독립만세'로 단어가 분리되었을 때, 오른쪽 부분 단어인 '독립만세'는 다시 '독립', '만세'로 분리되어 최종적으로 '대한', '독립', '만세'라는 부분 단어로 분리할 수 있는 것이다.In an embodiment, when the right partial word exists, the first operator 120 may repeat the step of performing a partial word separation algorithm on the right partial word. That is, the right partial word may be separated into a left partial word and a right partial word using a partial word separation algorithm until the right partial word cannot be separated any more. For example, when the words are separated into'Daehan' and'Long live for independence', the word in the right part'Long live for independence' is separated into'Long live for independence' and'Long live' and finally'Long live','Long live' and'Long live'. It can be separated into partial words.

분리된 부분 단어는 제2 연산부(130)로 송신되고, 제2 연산부(130)는 분리된 부분 단어에 대해 정규화 알고리즘을 이용하여 상호의존정보량이 낮은 부분 단어를 삭제하는 과정을 거쳐 최종적인 어휘 목록을 생성할 수 있다.The separated partial words are transmitted to the second operation unit 130, and the second operation unit 130 deletes the partial words with a low interdependence information amount using a normalization algorithm for the separated partial words, and the final vocabulary list Can be created.

일 실시예에서, 제2 연산부(130)는 상기 수학식 7의 정규화 알고리즘을 이용하여 regScore 값이 큰 부분 단어들을 기설정된 비율만큼 삭제할 수 있다.In an embodiment, the second operator 130 may delete partial words having a large regScore value by a predetermined ratio by using the normalization algorithm of Equation 7 above.

또한, 제2 연산부(130)는 기설정된 어휘 목록의 크기를 만족할 때까지 부분 단어를 삭제하는 과정을 반복할 수 있다. 예컨대, 어휘 목록의 단어 개수를 D로 기설정한 경우, 어휘 목록의 단어 개수가 D가 될 때까지 regScore 값이 높은 상위 20%의 부분 단어를 삭제하는 과정을 반복할 수 있다.Also, the second operation unit 130 may repeat the process of deleting the partial words until the size of the preset vocabulary list is satisfied. For example, if the number of words in the vocabulary list is preset to D, the process of deleting the top 20% of the partial words having a high regScore value may be repeated until the number of words in the vocabulary list becomes D.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 컴퓨터 상에서 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능한 기록 매체는 컴퓨터에 의해 액세스(access)될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독 가능한 기록 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The method for generating a vocabulary list according to an embodiment of the present invention may be implemented in the form of a computer-readable recording medium in which a program to be executed on a computer is recorded. The computer-readable recording medium may be any available medium that can be accessed by a computer, and may include both volatile and nonvolatile media, and removable and non-removable media. Also, the computer-readable recording medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 명세서에서 설명된 실시예들에 관한 예시적인 모듈, 단계 또는 이들의 조합은 전자 하드웨어(코딩 등에 의해 설계되는 디지털 설계), 소프트웨어(프로그램 명령을 포함하는 다양한 형태의 애플리케이션) 또는 이들의 조합에 의해 구현될 수 있다. 하드웨어 및/또는 소프트웨어 중 어떠한 형태로 구현되는지는 사용자 단말에 부여되는 설계상의 제약에 따라 달라질 수 있다.Exemplary modules, steps, or a combination thereof according to the embodiments described herein may be performed by electronic hardware (digital design designed by coding, etc.), software (various types of applications including program instructions), or a combination thereof. Can be implemented. Which form of hardware and/or software is implemented may vary according to design constraints imposed on the user terminal.

본 명세서에서 설명된 구성의 하나 이상은 컴퓨터 프로그램 명령으로서 메모리에 저장될 수 있는데, 이러한 컴퓨터 프로그램 명령은 디지털 신호 프로세서를 중심으로 본 명세서에서 설명된 방법을 실행할 수 있다. 본 명세서에 첨부된 도면을 참조하여 특정되는 구성 간의 연결 예는 단지 예시적인 것으로, 이들 중 적어도 일부는 생략될 수도 있고, 반대로 이들 구성 뿐 아니라 추가적인 구성을 더 포함할 수 있음은 물론이다.One or more of the configurations described herein may be stored in a memory as computer program instructions, which computer program instructions may execute the methods described herein centered on a digital signal processor. Connection examples between configurations specified with reference to the accompanying drawings in the present specification are merely exemplary, and at least some of them may be omitted, and conversely, not only these configurations but also additional configurations may be further included.

실험예 : 성능 평가 실험Experimental Example: Performance evaluation experiment

BPE(byte pair encoding) 알고리즘 , Unigram language model 알고리즘과 본 발명이 제안한 어휘 목록 생성 방법의 성능을 비교하였다. 세 가지 알고리즘을 비교하기 위한 신경망 언어 모델로는 BERT(Bidirectional Encoder Representations from Transformers)를 사용하였다. The performance of the byte pair encoding (BPE) algorithm, the Unigram language model algorithm, and the vocabulary list generation method proposed by the present invention are compared. As a neural network language model for comparing the three algorithms, BERT (Bidirectional Encoder Representations from Transformers) was used.

BPE 알고리즘은 언어의 구조를 고려하지 않고 빈도수를 기반으로 단어를 분리하는 알고리즘이며, Unigram 알고리즘의 경우 Unigram 단위로 단어를 분리하는 방식의 알고리즘이다.The BPE algorithm is an algorithm that separates words based on the frequency without considering the structure of the language, and in the case of the Unigram algorithm, it is an algorithm that separates words by Unigram units.

하기 표 1은 각 알고리즘 별로 어휘 목록을 생성한 후, BERT 모델을 학습하여 성능을 비교한 표이다.Table 1 below is a table comparing performance by learning a BERT model after generating a vocabulary list for each algorithm.

알고리즘
algorithm
어휘 목록 크기
Vocabulary list size Masked word predictionMasked word prediction Fine-Tuning (%)Fine-Tuning (%) AccuracyAccuracy EMEM F1F1
BPE
BPE 30,00030,000 0.5090.509 42.3142.31 80.7780.77 40,00040,000 0.4960.496 41.2741.27 80.1180.11 50,00050,000 0.4550.455 40.8240.82 80.0280.02
Unigram
Unigram 30,00030,000 0.5070.507 49.7749.77 81.1881.18 40,00040,000 0.4870.487 49.5249.52 81.0581.05 50,00050,000 0.4920.492 49.8649.86 81.3381.33
Proposed algorithm
Proposed algorithm 30,00030,000 0.6090.609 53.1253.12 81.7881.78 40,00040,000 0.6060.606 52.7352.73 81.5481.54 50,00050,000 0.5930.593 52.5052.50 81.6381.63

Masked word prediction은 Pre-training 단계에서 빈칸에 들어갈 부분 단어를 예측하는 문제이고, Fine-Tuning(질의 응답)은 질문에 맞는 답을 본문에서 찾아내는 문제이다. Masked word prediction is a problem of predicting partial words that will be in the blank in the pre-training stage, and Fine-Tuning (question and answer) is a problem of finding the correct answer in the text.

Fine-Tuning의 EM(exact match)는 예측된 답과 실제 답이 정확하게 일치하느냐를 나타내는 지표이다. 예컨대, '1990년대 말'이 실제 답일 때, '1990년대'로 예측하였을 경우 EM은 0이며, '1990년대 말'로 예측했을 경우 100%가 되는 것이다. Fine-Tuning's EM (exact match) is an index indicating whether the predicted answer and the actual answer match exactly. For example, when'late 1990s' is the actual answer, when predicted as '1990s', EM is 0, and when predicted as'late 1990s', it is 100%.

Fine-Tuning의 F1 score는 예측된 답과 실제 답이 얼마나 일치하느냐를 나타내는 지표이다. 예컨대, '1990년대 말'이 실제 답일 때, '1990년대'로 예측하였을 경우, F1 score는 7개 중 6개 음절이 일치하는 것으로 86%의 일치율을 보이는 것이다.Fine-Tuning's F1 score is an indicator of how well the predicted answer matches the actual answer. For example, when'late 1990s' is the actual answer and predicted as '1990s', the F1 score shows an agreement rate of 86% as 6 syllables out of 7 coincide.

Pre-training 데이터셋으로는 한국 위키피디아 dump를 이용하였고, Fine-Tuning 데이터셋으로는 KorQuAD 데이터셋을 이용하였다. The Korean Wikipedia dump was used as the pre-training data set, and the KorQuAD data set was used as the fine-tuning data set.

도 5a 및 도 5b는 상기 표 1을 기반으로 세 가지 알고리즘의 성능을 비교한 그래프이다.5A and 5B are graphs comparing the performance of three algorithms based on Table 1 above.

BPE, Unigram 및 본 발명에서의 제안된 알고리즘을 비교해보면, Masked word prediction의 경우 타 알고리즘에 비하여 약 10% 가량 성능이 향상되었음을 확인할 수 있다.Comparing the BPE, Unigram, and the proposed algorithm in the present invention, it can be seen that the performance of the masked word prediction is improved by about 10% compared to other algorithms.

Fine-Tuning의 EM의 경우 타 알고리즘에 비해 약 3%, F1 score의 경우 타 알고리즘에 비해 약 1% 정도의 성능이 향상되었음을 확인할 수 있다.In the case of Fine-Tuning's EM, it can be seen that the performance is improved by about 3% compared to other algorithms, and in the case of F1 score, performance is improved by about 1% compared to other algorithms.

따라서, 본 발명에서 제안하는 알고리즘은 타 알고리즘에 비해 주변 부분 단어들의 문맥을 기반으로 빈칸에 들어갈 단어를 유추하는 능력이 향상되며, 질의응답에서도 더 높은 성능을 보이는 것을 확인하였다. 즉, 신경망 언어 모델의 구조 수정 없이 어휘 목록 생성 방법의 개선만으로도 신경망 언어 모델의 예측 정확도 향상이 가능한 것이다.Therefore, it was confirmed that the algorithm proposed in the present invention has improved ability to infer words to be entered in blank spaces based on the context of surrounding partial words compared to other algorithms, and exhibits higher performance in question answering. In other words, it is possible to improve the prediction accuracy of the neural network language model only by improving the vocabulary list generation method without modifying the structure of the neural language model.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술적 사상의 범위가 한정되는 것이 아니다. 본 발명의 보호범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to describe it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100 : 어휘 목록 생성 장치
110 : 데이터 수신부
120 : 제1 연산부
130 : 제2 연산부100: Vocabulary list generation device
110: data receiver
120: first operation unit
130: second operation unit

Claims

(a) receiving Korean language data to generate a vocabulary list;
(b) separating words included in Korean data into partial words according to the algorithm by performing a subword separation algorithm on the received Korean data; And
(c) generating a vocabulary list by performing a regularization algorithm on the separated partial words,
Vocabulary list generation method for Korean-based neural network language model.

The method of claim 1,
In step (b),
The partial word separation algorithm is defined by the following equation,
Vocabulary list generation method for Korean-based neural network language model.

here,

Is a consecutive character set from 0 to n,

Is in Korean data

The next set of letters, X, means Korean data.

The method of claim 1,
The step (b),
Separating words included in the received Korean data into left partial words and right partial words using the partial word separation algorithm,
Vocabulary list generation method for Korean-based neural network language model.

The method of claim 3,
The step (b),
If the right part word is present,
Further comprising repeating the step of performing the partial word separation algorithm on the right partial word,
Vocabulary list generation method for Korean-based neural network language model.

The method of claim 1,
In step (c),
The normalization algorithm is defined by the following equation,
Vocabulary list generation method for Korean-based neural network language model.

here,

Is the ith word in the word set W,

Is

The j-th partial word separated from,

Means the partial word for which you want to get the regScore value.

The method of claim 5,
The step (c),
Further comprising the step of deleting the partial words by a preset ratio in the order of the regScore value being large,
Vocabulary list generation method for Korean-based neural network language model.

The method of claim 6,
The step (c),
Further comprising repeating the step of deleting the partial words until a preset number of words in the vocabulary list is satisfied,
Vocabulary list generation method for Korean-based neural network language model.

A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 7 on a computer.

A data receiving unit for receiving Korean language data to generate a vocabulary list;
A first operation unit that performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words according to the algorithm; And
Comprising a second operation unit for generating the vocabulary list by performing a normalization algorithm on the separated partial words,
Vocabulary list generation device for Korean-based neural network language model.

The method of claim 9,
The partial word separation algorithm is defined by the following equation,
Vocabulary list generation device for Korean-based neural network language model.

here,

Is a consecutive character set from 0 to n,

Is in Korean data

The next set of letters, X, means Korean data.

The method of claim 9,
The partial word separation algorithm,
Dividing the words included in the received Korean data into a left part word and a right part word,
Vocabulary list generation device for Korean-based neural network language model.

The method of claim 11,
The first operation unit separates words included in the Korean data,
If the right part word is present,
Repeatedly performing the partial word separation algorithm for the right partial word,
Vocabulary list generation device for Korean-based neural network language model.

The method of claim 9,
The normalization algorithm is defined by the following equation,
Vocabulary list generation device for Korean-based neural network language model.

here,

Is the ith word in the word set W,

Is

The j-th partial word separated from,

Means the partial word for which you want to get the regScore value.

The method of claim 13,
The second calculation unit,
Deleting the partial words by a preset ratio in the order of the regScore value being large,
Vocabulary list generation device for Korean-based neural network language model.

The method of claim 14,
The second calculation unit,
Repeatedly performing the partial word deletion until a preset number of words in the vocabulary list is satisfied,
Vocabulary list generation device for Korean-based neural network language model.