KR102354898B1

KR102354898B1 - Vocabulary list generation method and device for Korean based neural network language model

Info

Publication number: KR102354898B1
Application number: KR1020190159637A
Authority: KR
Inventors: 허의남; 김만수
Original assignee: 경희대학교 산학협력단
Priority date: 2019-05-29
Filing date: 2019-12-04
Publication date: 2022-01-24
Also published as: KR20200137936A

Abstract

본 발명은 한국어의 특성에 알맞는 알고리즘을 이용한 한국어 기반 신경망 언어 모델을 위한 어휘 목록 생성 방법 및 장치에 관한 것이다.
본 발명의 일 실시예에 따른 어휘 목록 생성 장치는 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 데이터 수신부; 상기 수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 상기 알고리즘에 따라 한국에 데이터에 포함된 단어들을 부분 단어로 분리하는 제1 연산부; 및 상기 분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 상기 어휘 목록을 생성하는 제2 연산부를 포함할 수 있다.The present invention relates to a method and apparatus for generating a vocabulary list for a Korean-based neural network language model using an algorithm suitable for the characteristics of Korean.
A vocabulary list generating apparatus according to an embodiment of the present invention includes: a data receiving unit for receiving Korean data to generate a vocabulary list; a first operation unit that performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words according to the algorithm; and a second operator configured to generate the vocabulary list by performing a normalization algorithm on the separated partial words.

Description

Vocabulary list generation method and device for Korean based neural network language model

본 발명은 한국어 기반 신경망 언어 모델을 위한 어휘 목록 생성 방법 및 장치에 관한 것으로, 더욱 상세하게는 한국어의 언어적 특성을 고려한 알고리즘을 이용한 어휘 목록 생성 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for generating a vocabulary list for a Korean-based neural network language model, and more particularly, to a method and apparatus for generating a vocabulary list using an algorithm considering the linguistic characteristics of Korean.

신경망 언어 모델(neural network language model)은 기계 번역, 질의 응답, 개체명 인식 등 언어 이해 능력이 필요한 많은 분야에서 사용되고 있다. 신경망 언어 모델은 크게 글자, 단어, 부분 단어 단위로 나뉠 수 있으며 방법에 따라 단위를 섞어 사용할 수도 있다.The neural network language model is used in many fields requiring language understanding ability, such as machine translation, question answering, and object name recognition. The neural network language model can be largely divided into letters, words, and partial words, and units can be mixed depending on the method.

신경망 언어 모델에 입력된 문자들은 워드 임베딩(word embedding) 단계를 거쳐 벡터로 변환되어 사용되는데 이를 워드 임베딩 벡터라고 하며, 방법에 따라 워드 임베딩 단계는 신경 언어 모델의 하나의 레이어(layer)에서, 또는 별도의 모델에서 처리될 수 있다.Characters input to the neural network language model are converted into vectors through a word embedding step, which is called a word embedding vector. Depending on the method, the word embedding step is performed in one layer of the neural language model, or It can be processed in a separate model.

이 때, 워드 임베딩 벡터의 차원은 입력으로 들어오는 어휘 목록의 크기에 따라 달라진다. 예를 들어 가, 나, 다와 같은 한글의 총 글자 개수는 11,117자이고 단어의 총 개수는 사전에 따라 다르지만 현재 표준국어대사전에 등재된 단어의 총 개수는 약 50만 개이므로 글자 단위의 워드 임베딩 벡터는 11,117차원이며 단어 단위의 워드 임베딩 벡터는 약 50만 차원에 이른다. In this case, the dimension of the word embedding vector depends on the size of the input vocabulary list. For example, the total number of characters in Hangeul such as A, B, and D is 11,117 characters, and the total number of words varies depending on the dictionary, but the total number of words currently listed in the standard Korean dictionary is about 500,000, so word embedding in units of letters The vector is 11,117 dimensions, and the word embedding vector for each word is about 500,000 dimensions.

하지만, 글자 단위는 언어의 맥락을 표현하기에는 차원의 크기가 너무 작고, 단어 단위는 차원이 너무 높아 막대한 메모리가 필요하기 때문에 연산을 수행하기 어렵다. 따라서 부분 단어 단위로 원하는 크기의 어휘 목록을 생성하여 신경망 언어 모델을 학습하는 연구가 많이 진행되고 있으며, 단어를 부분 단어 단위로 분리하는 부분 단어 분리(subword segmentation) 연구가 활발히 진행되고 있다.However, the size of the dimension of the character unit is too small to express the context of the language, and the dimension of the word unit is too high, so it is difficult to perform calculations because a huge amount of memory is required. Therefore, many studies are being conducted to learn a neural network language model by generating a vocabulary list of a desired size in units of partial words, and research on subword segmentation, which divides words into units of partial words, is being actively conducted.

기존 부분 단어 분리 방법은 지도학습 방법과 비지도학습 방법으로 나뉠 수 있다. 지도학습 방법의 경우, 단어를 단어의 최소 단위인 형태소로 분리하는 형태소 분석기를 이용하여 분리하는 방식이다. 하지만, 정확한 형태소 분석이 완료된 거대한 데이터셋이 있어야 형태소 학습기를 만들 수 있다는 단점이 있다. 비지도학습 방법의 경우, 한국어의 조사와 같은 언어적 특성을 고려하지 않고, 부분 단어의 빈도수만을 고려하여 빈도수가 높은 부분 단어 순으로 어휘 목록을 구성하거나, 또는 언어적 특성을 고려한 채 단어를 분리하였지만 그 이후 어떤 기준으로 어휘 목록을 구성할 것인가에 대해 고려하지 않았다.The existing partial word separation method can be divided into a supervised learning method and an unsupervised learning method. In the case of the supervised learning method, it is a method of separating words using a morpheme analyzer that separates words into morphemes, which are the smallest units of words. However, there is a disadvantage that a morpheme learner can be made only when there is a huge dataset that has completed accurate morpheme analysis. In the case of the unsupervised learning method, the vocabulary list is constructed in the order of partial words with high frequency by considering only the frequency of partial words without considering linguistic characteristics such as surveys in Korean, or words are separated while considering linguistic characteristics However, it did not consider what criteria to compose the vocabulary list after that.

어휘 목록의 크기를 줄여 부분 단어 예측 범위를 줄임으로써 예측 정확도를 향상시키는 부분 단어 정규화(subword regularization) 방법의 경우에도 기존 방법은 연속된 부분 단어 간의 관계를 독립적으로 가정하고 부분 단어 간의 상관 관계를 고려하지 않아 신경망 언어 모델의 예측률이 저하되는 단점이 있다. Even in the case of the subword regularization method, which improves prediction accuracy by reducing the size of the vocabulary list to reduce the prediction range of partial words, the existing method independently assumes the relationship between consecutive subwords and considers the correlation between the subwords. This has the disadvantage of lowering the prediction rate of the neural network language model.

KRKR 10-2018-000188910-2018-0001889 AA KRKR 10-2017-010869310-2017-0108693 AA KRKR 10-2019-004643210-2019-0046432 AA

본 발명은 전술한 문제점을 해결하고자 한 것으로, 한국어의 언어적 특성을 고려한 부분 단어 분리 알고리즘을 제공하는 것을 목적으로 한다.An object of the present invention is to solve the above problems, and an object of the present invention is to provide a partial word separation algorithm in consideration of the linguistic characteristics of Korean.

또한, 본 발명은 부분 단어 간 상관 관계를 고려하여 부분 단어 간 상호 의존성을 측정하는 상호의존정보(mutual information) 기반 부분 단어 정규화 알고리즘을 제공하는 것을 목적으로 한다.Another object of the present invention is to provide a partial word normalization algorithm based on mutual information that measures the interdependence between partial words in consideration of the correlation between partial words.

본 발명의 목적들은 상술된 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.Objects of the present invention are not limited to the objects described above, and other objects not mentioned will be clearly understood from the following description.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 (a) 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 단계; (b) 상기 수신한 한국어 데이터에 대해 부분 단어(subword) 분리 알고리즘을 수행하여 상기 알고리즘에 따라 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계; 및 (c) 상기 분리된 부분 단어에 대해 정규화(regularization) 알고리즘을 수행하여 어휘 목록을 생성하는 단계를 포함할 수 있다.A method for generating a vocabulary list according to an embodiment of the present invention includes the steps of: (a) receiving Korean data for creating a vocabulary list; (b) performing a subword separation algorithm on the received Korean data to separate words included in the Korean data into partial words according to the algorithm; and (c) generating a vocabulary list by performing a regularization algorithm on the separated partial words.

상기 (b) 단계에서, 상기 부분 단어 분리 알고리즘은 다음의 수학식으로 정의될 수 있다.In step (b), the partial word separation algorithm may be defined by the following equation.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합, X는 한국어 데이터를 의미한다.here,

is a set of consecutive characters from 0 to n,

is from the Korean data.

The set of letters shown below, X, means Korean data.

상기 (b) 단계는, 상기 부분 단어 분리 알고리즘을 이용하여, 상기 수신한 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리하는 단계일 수 있다.The step (b) may be a step of separating the words included in the received Korean data into a left partial word and a right partial word using the partial word separation algorithm.

상기 (b) 단계는, 상기 오른쪽 부분 단어가 존재하는 경우, 상기 오른쪽 부분 단어에 대하여 상기 부분 단어 분리 알고리즘을 수행하는 단계를 반복하는 단계를 더 포함할 수 있다.The step (b) may further include repeating the step of performing the partial word separation algorithm on the right partial word when the right partial word exists.

상기 (c) 단계에서, 상기 정규화 알고리즘은 다음의 수학식으로 정의될 수 있다.In step (c), the normalization algorithm may be defined by the following equation.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

is the i-th word in the word set W,

Is

The jth part word separated from

is the partial word for which the regScore value is to be obtained.

상기 (c) 단계는, 상기 regScore 값이 큰 순서대로 기설정된 비율만큼 상기 부분 단어를 삭제하는 단계를 더 포함할 수 있다.The step (c) may further include deleting the partial words by a preset ratio in the order of increasing the regScore value.

상기 (c) 단계는, 상기 어휘 목록의 기설정된 단어 개수를 만족할 때까지 상기 부분 단어를 삭제하는 단계를 반복하는 단계를 더 포함할 수 있다.The step (c) may further include repeating the step of deleting the partial words until a predetermined number of words in the vocabulary list is satisfied.

본 발명의 일 실시예에 따른 컴퓨터 판독 가능한 기록매체는 상기 어휘 목록 생성 방법을 컴퓨터 상에서 수행하기 위한 프로그램을 기록할 수 있다.A computer-readable recording medium according to an embodiment of the present invention may record a program for executing the method for generating a vocabulary list on a computer.

본 발명의 일 실시예에 따른 어휘 목록 생성 장치는 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 데이터 수신부; 상기 수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 상기 알고리즘에 따라 한국에 데이터에 포함된 단어들을 부분 단어로 분리하는 제1 연산부; 및 상기 분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 상기 어휘 목록을 생성하는 제2 연산부를 포함할 수 있다. A vocabulary list generating apparatus according to an embodiment of the present invention includes: a data receiving unit for receiving Korean data to generate a vocabulary list; a first operation unit that performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words according to the algorithm; and a second operator configured to generate the vocabulary list by performing a normalization algorithm on the separated partial words.

상기 부분 단어 분리 알고리즘은 다음의 수학식으로 정의될 수 있다.The partial word separation algorithm may be defined by the following equation.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

is a set of consecutive characters from 0 to n,

is from the Korean data.

The set of letters shown below, X, means Korean data.

상기 부분 단어 분리 알고리즘은, 상기 수신한 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다.The partial word separation algorithm may separate words included in the received Korean data into a left partial word and a right partial word.

상기 제1 연산부는 상기 한국어 데이터에 포함된 단어들을 분리한 후, 상기 오른쪽 부분 단어가 존재하는 경우, 상기 오른쪽 부분 단어에 대하여 상기 부분 단어 분리 알고리즘을 반복하여 수행할 수 있다.After separating the words included in the Korean data, the first operation unit may repeat the partial word separation algorithm with respect to the right partial word when the right partial word exists.

상기 정규화 알고리즘은 다음의 수학식으로 정의될 수 있다.The normalization algorithm may be defined by the following equation.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

is the i-th word in the word set W,

Is

The jth part word separated from

is the partial word for which the regScore value is to be obtained.

상기 제2 연산부는, 상기 regScore 값이 큰 순서대로 기설정된 비율만큼 상기 부분 단어를 삭제할 수 있다.The second operation unit may delete the partial words by a preset ratio in the order of increasing the regScore value.

상기 제2 연산부는, 상기 어휘 목록의 기설정된 단어 개수를 만족할 때까지 상기 부분 단어 삭제를 반복하여 수행할 수 있다.The second operation unit may repeatedly delete the partial words until a predetermined number of words in the vocabulary list is satisfied.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 한국어의 언어적 특성을 고려하여 부분 단어를 분리하는 바, 기존 부분 단어 분리 알고리즘을 사용하는 방법 및 장치에 비해 한국어 기반 신경망 언어 모델의 성능을 높일 수 있다는 장점이 있다. The method and apparatus for generating a vocabulary list according to an embodiment of the present invention separates partial words in consideration of the linguistic characteristics of Korean. Performance of a Korean-based neural network language model compared to a method and apparatus using an existing partial word separation algorithm It has the advantage of being able to increase

또한, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 부분 단어 간 상관 관계를 고려하여 부분 단어를 정규화하는 바, 기존 부분 단어 정규화 알고리즘을 사용하는 방법 및 장치에 비해 한국어 기반 신경망 언어 모델의 성능을 높일 수 있다는 장점이 있다.In addition, the method and apparatus for generating a vocabulary list according to an embodiment of the present invention normalize partial words in consideration of the correlation between partial words. Compared to a method and apparatus using an existing partial word normalization algorithm, a Korean-based neural network language model It has the advantage of improving the performance of

또한, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법 및 장치는 한국어 처리 서비스에 적용할 수 있다는 장점이 있다.In addition, the method and apparatus for generating a vocabulary list according to an embodiment of the present invention has an advantage that it can be applied to a Korean language processing service.

본 발명의 효과들은 이상에서 언급된 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법과 생성된 어휘 목록을 이용하여 신경망 언어 모델 기반 질의응답 시스템을 운용하는 유스케이스(use case)에 대한 개념도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 어휘 목록 생성 장치를 간략히 도시한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 나타내는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 개략적으로 나타내는 흐름도이다.
도 5a 및 도 5b는 실험예에서 세 가지 알고리즘의 성능을 비교한 그래프이다.1 is a conceptual diagram illustrating a method for generating a vocabulary list according to an embodiment of the present invention and a use case of operating a neural network language model-based question and answer system using the generated vocabulary list.
2 is a block diagram schematically illustrating an apparatus for generating a vocabulary list according to an embodiment of the present invention.
3 is a flowchart illustrating a method for generating a vocabulary list according to an embodiment of the present invention.
4 is a flowchart schematically illustrating a method for generating a vocabulary list according to an embodiment of the present invention.
5A and 5B are graphs comparing the performance of three algorithms in an experimental example.

본 명세서 또는 출원에 개시되어 있는 본 발명의 실시 예들에 대해서 특정한 구조적 내지 기능적 설명들은 단지 본 발명에 따른 실시 예를 설명하기 위한 목적으로 예시된 것으로, 본 발명에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 명세서 또는 출원에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니 된다.Specific structural or functional descriptions of the embodiments of the present invention disclosed in the present specification or application are only exemplified for the purpose of describing the embodiments according to the present invention, and the embodiments according to the present invention may be implemented in various forms. and should not be construed as being limited to the embodiments described in the present specification or application.

본 발명에 따른 실시 예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있으므로 특정 실시예들을 도면에 예시하고 본 명세서 또는 출원에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예를 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the embodiment according to the present invention can have various changes and can have various forms, specific embodiments are illustrated in the drawings and described in detail in the present specification or application. However, this is not intended to limit the embodiment according to the concept of the present invention with respect to a specific disclosed form, and should be understood to include all changes, equivalents or substitutes included in the spirit and scope of the present invention.

본 명세서에서 제1 및/또는 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 즉, 구성요소들을 상기 용어들에 의해 한정하고자 함이 아니다.In this specification, terms such as first and/or second are used only for the purpose of distinguishing one component from another. That is, it is not intended to limit the components by the above terms.

본 명세서에서 '포함하다' 라는 표현으로 언급되는 구성요소, 특징, 및 단계는 해당 구성요소, 특징 및 단계가 존재함을 의미하며, 하나 이상의 다른 구성요소, 특징, 단계 및 이와 동등한 것을 배제하고자 함이 아니다.Elements, features, and steps referred to as 'comprising' in the present specification means that the elements, features, and steps exist, and are intended to exclude one or more other elements, features, steps, and the like this is not

본 명세서에서 단수형으로 특정되어 언급되지 아니하는 한, 복수의 형태를 포함한다. 즉, 본 명세서에서 언급된 구성요소 등은 하나 이상의 다른 구성요소 등의 존재나 추가를 의미할 수 있다.The plural form is included unless specifically stated otherwise in the singular. That is, elements and the like mentioned in this specification may mean the presence or addition of one or more other elements.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함하여, 본 명세서에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자(통상의 기술자)에 의하여 일반적으로 이해되는 것과 동일한 의미이다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. to be.

즉, 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. That is, terms such as those defined in commonly used dictionaries should be interpreted as meanings consistent with the meanings in the context of the related art, and unless explicitly defined in the present specification, they should be interpreted in an ideal or excessively formal meaning. doesn't happen

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings.

먼저, 부분 단어(subword)에 대한 설명을 부가하자면, 부분 단어는 단어에서 분리 가능한 연속된 글자들의 집합으로 예컨대 '서울에서'라는 단어가 있을 경우, '서울에서'라는 단어는 '서울, 에서', '서, 울에, 서', '서, 울에서'와 같은 부분 단어 집합을 가질 수 있다. 이와 같이 단어, 어휘 또는 어절의 일부분을 부분 단어로 정의할 수 있다.First, to add a description of a subword, a subword is a set of consecutive letters that can be separated from a word. , can have partial word sets such as 'seo, in ul, seo', 'seo, in ul'. In this way, a part of a word, vocabulary, or word phrase may be defined as a partial word.

도 1은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법과 생성된 어휘 목록을 이용하여 신경망 언어 모델 기반 질의응답 시스템을 운용하는 유스케이스(use case)에 대한 개념도를 나타낸다.1 is a conceptual diagram illustrating a method for generating a vocabulary list according to an embodiment of the present invention and a use case of operating a neural network language model-based question and answer system using the generated vocabulary list.

도 1을 참조하면, 한국어 데이터셋(dataset)에 대하여 어휘 목록 생성 방법을 적용하여 부분 단어를 분리하고, 이를 정규화하여 어휘 목록을 생성하는 것을 확인할 수 있다. 여기서, 한국어 데이터셋은 품사 처리, 개체명 인식 등 전처리가 수행되지 않은 순수한 텍스트만 존재하는 데이터셋이다. Referring to FIG. 1 , it can be seen that a vocabulary list generation method is applied to a Korean dataset to separate partial words and normalize them to generate a vocabulary list. Here, the Korean dataset is a dataset in which only pure text is present without preprocessing such as part-of-speech processing and object name recognition.

어휘 목록 생성이 완료된 후, 신경망 언어 모델에 입력되는 텍스트는 생성된 어휘 목록을 기반으로 부분 단어로 분리되고, 어휘 목록 내 부분 단어에 매겨져 있는 숫자로 변경되는 토큰화(tokenize) 과정을 거친다. 도 1과 같은 질의응답 유스케이스에서 입력된 질문과 본문 텍스트는 토큰화 과정을 거쳐 신경망 언어 모델에 입력되고, 신경망 언어 모델은 본문 내 답의 위치를 예측하고 사용자는 신경망 언어 모델이 예측한 위치를 통해 답을 알 수 있는 것이다.After the creation of the vocabulary list is completed, the text input to the neural network language model is divided into partial words based on the generated vocabulary list, and a tokenization process is performed in which the numbers assigned to the partial words in the vocabulary list are changed. In the question and answer use case shown in Figure 1, the input question and body text are inputted to the neural network language model through the tokenization process, the neural network language model predicts the location of the answer in the text, and the user determines the location predicted by the neural network language model through which you can find out the answer.

도 2는 본 발명의 일 실시예에 따른 어휘 목록 생성 장치(100)를 간략히 도시한 블록도이다.2 is a block diagram schematically illustrating an apparatus 100 for generating a vocabulary list according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 어휘 목록 생성 장치(100)는 데이터 수신부(110), 제1 연산부(120) 및 제2 연산부(130)를 포함할 수 있다.Referring to FIG. 2 , the vocabulary list generating apparatus 100 according to an embodiment of the present invention may include a data receiving unit 110 , a first calculating unit 120 , and a second calculating unit 130 .

데이터 수신부(110)는 어휘 목록 생성을 위한 한국어 단어들이 포함된 한국어 데이터를 수신할 수 있다. 또한, 데이터 수신부(110)는 한국어 데이터를 수신한 후, 이를 제1 연산부(120)로 송신할 수 있다.The data receiver 110 may receive Korean data including Korean words for generating a vocabulary list. Also, the data receiving unit 110 may receive Korean data and then transmit it to the first calculating unit 120 .

일 실시예에서, 제1 연산부(120)는 데이터 수신부(110)가 수신한 한국어 데이터를 입력받고, 부분 단어 분리 알고리즘을 한국어 데이터에 포함된 단어들에 대해 적용하여 부분 단어로 분리할 수 있다. 부분 단어 분리 알고리즘은 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '아인슈타인'은 아인슈타(왼쪽) + 인(오른쪽), '아인슈타인은'은 아인슈타(왼쪽) + 인은(오른쪽)과 같이 분리될 수 있다.In an embodiment, the first operation unit 120 may receive the Korean data received by the data reception unit 110 , and apply a partial word separation algorithm to words included in the Korean data to separate the words into partial words. The partial word separation algorithm may separate words included in Korean data into left partial words and right partial words. For example, 'Einstein' may be divided into Einstein (left) + In (right), and 'Einstein' may be divided as Einstein (left) + In (right).

부분 단어 분리 알고리즘의 종래 기술로는 cohesion score 알고리즘과 branching entropy 알고리즘이 존재한다.As a prior art of the partial word separation algorithm, a cohesion score algorithm and a branching entropy algorithm exist.

cohesion score 알고리즘은 한국어를 위해 제안된 부분 단어 분리 알고리즘으로 명사 뒤에는 여러 종류의 접미사 및 조사가 올 수 있는 바, 왼쪽 부분 단어의 글자 간에는 연관성이 높다고 가정하고 단어를 분리하는 알고리즘이다. 즉, 연속된 글자의 연관성이 높을수록 단어일 가능성이 높으므로 이를 이용하는 것이다. 한국어의 경우, 왼쪽에 명사, 동사, 형용사 등이 주로 나타나고, 오른쪽에 문법적 역할을 하는 조사가 주로 등장하므로 본 알고리즘은 한국어에 적합할 수 있다. The cohesion score algorithm is a partial word separation algorithm proposed for the Korean language, and it is an algorithm that separates words assuming that there is a high correlation between the letters of the left partial word since various types of suffixes and propositions can follow a noun. That is, the higher the relevance of consecutive letters, the higher the likelihood that they are words, so this is used. In the case of Korean, nouns, verbs, and adjectives mainly appear on the left side, and grammatical propositions mainly appear on the right side, so this algorithm may be suitable for Korean.

cohesion score 알고리즘은 하기 수학식 1과 같이 정의될 수 있다.The cohesion score algorithm may be defined as in Equation 1 below.

여기서,

은 0에서부터 n까지의 연속된 글자의 집합을 나타낸다. 즉, '노란색의'라는 단어가 있는 경우

은 '노',

은 '노란',

는 '노란색'을 의미한다.

는 B일 때, A가 일어 날 조건부 확률을 의미한다.here,

represents a set of consecutive letters from 0 to n. That is, if there is the word 'yellow'

is 'no',

is 'yellow',

means 'yellow'.

is the conditional probability that A will occur when B.

그러나, cohesion score 알고리즘은 단어의 빈도수가 낮아질수록 정확도가 급격히 떨어진다는 단점이 있고, 왼쪽 부분 단어 뒤에 나올 수 있는 다른 형태의 오른쪽 부분 단어를 고려하지 않는다는 문제점이 있다. 예컨대, '노란색의' 라는 단어가 있을 때, '노란색'과 '의'만을 고려하고, '은', '이', '을', '과' 등의 '노란색' 뒤에 나올 수 있는 다른 오른쪽 부분 단어들을 고려하지 않는다. However, the cohesion score algorithm has disadvantages in that the accuracy drops sharply as the frequency of words decreases, and there is a problem in that it does not consider other types of right-part words that may appear after the left-part words. For example, when there is the word 'yellow', consider only 'yellow' and 'of', and other right-hand parts that can appear after 'yellow' such as 'silver', 'i', 'a', 'and', etc. Words are not taken into account.

branching entropy 알고리즘은 중국어를 위해 제안된 알고리즘으로 연속된 글자 다음에 어떤 글자가 나올지 불확실할수록 분리될 확률이 높다는 것을 이용한 알고리즘이다. 예컨대, 'naturalize'라는 단어가 있을 때, 'natura'이라는 부분 단어가 있을 때, 뒤에 'l'이 나올 것으로 쉽게 예측할 수 있으나, 'natural' 뒤에는 어떤 단어가 나올 지 쉽게 예측할 수가 없다. 따라서, 본 알고리즘을 이용하면 'naturalize'는 'natura', 'lize'로 분리되기보다 'natural', 'ize'로 분리될 확률이 더 높아진다.The branching entropy algorithm is an algorithm proposed for Chinese, which uses the fact that the more uncertain which letter will appear after a continuous letter, the higher the probability of separation. For example, when there is a word 'naturalize' or a partial word 'natura', it can be easily predicted that an 'l' will appear after it, but it is not easy to predict which word will appear after 'natural'. Therefore, when this algorithm is used, the probability that 'naturalize' is separated into 'natural' and 'ize' is higher than that of 'natura' and 'lize'.

branching entropy 알고리즘은 하기 수학식 2와 같이 정의될 수 있다.The branching entropy algorithm may be defined as in Equation 2 below.

여기서, 상기 수학식 2는

에서의 branching entropy를 구하기 위한 수학식이며,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합을 의미한다. 예컨대, 아인슈타인과 아인슈타이늄의

은 '아인슈타'이며

={인, 이}가 되는 것이다. X는 한국어 데이터를 의미한다.Here, Equation 2 is

It is an equation for finding the branching entropy in

is from the Korean data.

It means a set of letters that appear next. For example, Einstein and Einsteinium

is 'Einstein'

= {in, this} becomes. X stands for Korean data.

하지만 branching entropy 알고리즘은 글자 하나하나에 의미가 담겨 글자 다음에 특정 글자가 나올 확률이 높은 중국어 등의 언어에 적용하는 것이 적합하며 글자 자체보다는 조합에 의해 의미가 생기는 한국어에는 적용하기 어렵다.However, the branching entropy algorithm is suitable to be applied to languages such as Chinese, where each letter has a meaning and there is a high probability that a specific letter will appear after the letter, and it is difficult to apply to Korean, where a meaning is created by a combination rather than the letter itself.

따라서, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 상기 두 알고리즘의 단점을 보완하기 위하여 두 알고리즘을 곱한 새로운 방식의 알고리즘을 사용한다. 제안하는 알고리즘은 하기 수학식 3과 같이 정의될 수 있다.Therefore, the method for generating a vocabulary list according to an embodiment of the present invention uses a new algorithm in which the two algorithms are multiplied to compensate for the shortcomings of the two algorithms. The proposed algorithm can be defined as in Equation 3 below.

여기서,

은 0에서부터 n까지의 연속된 글자집합,

는 한국어 데이터에서

다음에 나왔던 글자들의 집합을 의미한다. X는 한국어 데이터를 의미한다.here,

is a set of consecutive characters from 0 to n,

is from the Korean data.

It means a set of letters that appear next. X stands for Korean data.

상기 수학식 3의 알고리즘을 통해

부터

까지의 모든 값을 계산한 뒤, 가장 높은 값을 가지는 위치를 분리되는 지점으로 삼을 수 있다. 예컨대, '아인슈타이늄'이라는 단어에서

값이 가장 높은 것으로 계산되는 경우, '아인슈타이늄'은 '아인슈타', '이늄'으로 분리되는 것이다.Through the algorithm of Equation 3 above

from

After calculating all values up to , the position with the highest value can be used as a separation point. For example, in the word 'Einsteinium'

When the value is calculated as the highest, 'Einsteinium' is separated into 'Einstein' and 'inium'.

cohesion score와 branching entropy 알고리즘의 곱을 사용함으로써 단어의 빈도수 뿐만 아니라, 다음에 나오는 글자의 경우의 수까지 함께 고려하여 단어를 분리할 수 있는 바, 기존 두 알고리즘의 단점을 보완하면서 부분 단어를 분리할 수 있게 되는 것이다.By using the product of the cohesion score and branching entropy algorithm, it is possible to separate words by considering not only the frequency of the word but also the number of letters that follow. there will be

일 실시예에서, 부분 단어 분리 알고리즘은 오른쪽 부분 단어가 존재하는 경우, 오른쪽 부분 단어에 대해 반복하여 수행될 수 있다. 즉, 오른쪽 부분 단어가 더 이상 존재하지 않을 때까지 부분 단어 분리 알고리즘을 반복하여 수행하는 것이다.In an embodiment, the partial word separation algorithm may be repeatedly performed on the right partial word when there is a right partial word. That is, the partial word separation algorithm is repeatedly performed until the right partial word no longer exists.

예컨대, '대한민국만세'라는 단어가 있고, 부분 단어 분리 알고리즘을 통해 '대한', '민국만세'로 분리된 경우, 오른쪽 부분 단어인 '민국만세'에 대하여 부분 단어 분리 알고리즘을 반복하여 수행할 수 있다. '민국만세'에 부분 단어 분리 알고리즘이 적용되어 다시 '민국', '만세'로 분리되고 최종적으로 '대한민국만세'는 '대한', '민국', '만세'로 분리될 수 있을 것이다.For example, if there is a word 'Korea Manse' and it is divided into 'Daehan' and 'Minguk Manse' through a partial word separation algorithm, the partial word separation algorithm can be repeatedly performed for the right partial word 'Minguk Manse'. have. Partial word separation algorithm is applied to 'Minguk Manse', and it is again divided into 'Republic of Korea' and 'Manse'.

위와 같은 과정을 거치며 한국어 데이터에 포함된 모든 단어들이 부분 단어로 분리되어 하나의 어휘 목록을 구축할 수 있다.Through the above process, all the words included in the Korean data are separated into partial words to construct a single vocabulary list.

일 실시예에서, 제2 연산부(130)는 제1 연산부(120)에서 분리된 부분 단어를 정규화하는 알고리즘을 수행할 수 있다. 즉, 부분 단어 어휘 목록을 사용자가 원하는 크기로 줄이는 역할을 수행한다. 어휘 목록의 크기가 클수록 예측률이 낮아지기 때문에, 어휘 목록의 크기를 효율적으로 줄이기 위함이다.In an embodiment, the second operation unit 130 may perform an algorithm for normalizing the partial words separated by the first operation unit 120 . That is, it serves to reduce the partial word vocabulary list to a size desired by the user. This is to effectively reduce the size of the vocabulary list because the prediction rate decreases as the size of the vocabulary list increases.

부분 단어 정규화 알고리즘은 부분 단어의 중복성을 고려하여 단어의 어휘 목록 크기를 줄이는 알고리즘을 말한다. 기존 정규화 알고리즘 중 하나인 Unigram language model은 하기 수학식 4와 같이 정의될 수 있다. Partial word normalization algorithm refers to an algorithm that reduces the lexical list size of words by considering the redundancy of partial words. The Unigram language model, which is one of the existing regularization algorithms, may be defined as in Equation 4 below.

여기서, V는 어휘들의 집합,

는 V로부터 분리된 i번재 부분 단어, x는 연속된 부분 단어들의 집합 {

},

는 부분 단어

의 발생 확률을 의미한다. where V is a set of vocabularies,

is the i-th sub-word separated from V, and x is the set of consecutive sub-words {

},

is a partial word

means the probability of occurrence of

부분 단어

의 발생 확률이란, 부분 단어

의 개수를 데이터셋 내의 부분 단어 개수로 나눈 것으로 예컨대, 데이터셋 내에 부분 단어가 10000개 있고, 부분 단어

의 개수가 10개 인 경우,

는 10/10000=0.001이 되는 것이다.part word

The probability of occurrence of a partial word

is divided by the number of partial words in the dataset. For example, there are 10000 partial words in the dataset, and partial words

If the number of is 10,

is 10/10000 = 0.001.

기존 정규화 알고리즘의 경우 부분 단어 확률의 곱으로써 정규화를 수행하며, 이는 부분 단어 간의 관계를 독립적으로 계산한 것이다. 즉, 부분 단어 간의 상관 관계를 전혀 고려하지 않고 독립이라 가정하고 계산을 수행한 것이다. In the case of the existing regularization algorithm, normalization is performed by multiplying the partial word probabilities, which is an independent calculation of the relationship between the partial words. That is, the calculation was performed assuming independence without considering the correlation between partial words at all.

이 경우, 어휘 목록 내 부분 단어로 표현할 수 없는 단어인 out-of-vocabulary의 개수가 많아지게 된다는 단점이 있다. 실제 딥러닝 학습 시 부분 단어들이 저장된 어휘 목록을 사용하는데 어휘 목록의 크기가 클수록 메모리 자원이 많이 필요하므로 일정 크기의 목록을 사용한다. 따라서, 수많은 부분 단어들 중 일부분의 부분 단어만 이용하게 되는데, 어휘 목록 내에 있는 부분 단어로 단어를 표현할 수 없는 단어를 out-of-vocabulary라고 한다. 예컨대, '스크림'이라는 단어가 있을 때, '스크', '림', '스', '크림', '스크림' 등 해당 단어 '스크림'을 표현할 수 있는 부분 단어가 없는 경우를 의미한다. In this case, there is a disadvantage in that the number of out-of-vocabularies, which are words that cannot be expressed as partial words in the vocabulary list, increases. In actual deep learning learning, a vocabulary list in which partial words are stored is used. As the size of the vocabulary list increases, more memory resources are required, so a list of a certain size is used. Therefore, only partial words are used among numerous partial words. A word that cannot be expressed as a partial word in the vocabulary list is called out-of-vocabulary. For example, when the word 'scream' exists, it means that there is no partial word that can express the corresponding word 'scream', such as 'sk', 'rim', 's', 'cream', and 'scream'.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 기존 알고리즘의 단점을 보완하기 위하여 변수 간 상호의존성을 측정하는 상호의존정보 공식을 기반으로 한 알고리즘을 사용한다. 상호의존정보량은 정보이론에서 두 사건 사이에 얼마만큼의 밀접한 관계를 지니고 있는지를 나타내는 것이다. 즉, 부분 단어 간의 연관 관계를 고려한 알고리즘을 사용하는 것이다. The method for generating a vocabulary list according to an embodiment of the present invention uses an algorithm based on the interdependence information formula for measuring the interdependence between variables in order to compensate for the shortcomings of the existing algorithm. The amount of interdependent information indicates how close the relationship between two events is in information theory. That is, an algorithm that considers the correlation between partial words is used.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법의 정규화 알고리즘은 단어 당 부분 단어가 2개 이상 존재할 수 있으므로 다변수 상호의존정보 공식을 이용하며, 각 부분 단어는 순서대로 존재하기 때문에 다변수 상호의존정보 공식을 x, y 두 변수로 예를 들면 하기 수학식 5와 같이 정의될 수 있다.The regularization algorithm of the method for generating a vocabulary list according to an embodiment of the present invention uses a multivariate interdependence information formula because there may be two or more partial words per word, and since each partial word exists in order, multivariate interdependence The information formula may be defined as, for example, Equation 5 below with two variables x and y.

p(x)는 x의 확률, p(x, y)는

를 의미한다.p(x) is the probability of x, p(x, y) is

means

본 발명의 일 실시예에 따른 어휘 목록 생성 방법의 정규화 알고리즘은 하기 수학식 6과 같이 정의될 수 있다.The normalization algorithm of the method for generating a vocabulary list according to an embodiment of the present invention may be defined as in Equation 6 below.

여기서,

는 단어 집합 W에서의 i번째 단어,

는

로부터 분리된 j번째 부분 단어,

는 regScore 값을 구하고자 하는 부분 단어를 의미한다.here,

is the i-th word in the word set W,

Is

The jth part word separated from

is the partial word for which the regScore value is to be obtained.

그러나, 상기 수학식 6과 같이 log 값 안의 분모는 매우 작은 값인 부분 단어의 발생 확률 값이 서로 곱해지는 바, MI(mutual information) 값이 컴퓨터가 표현할 수 있는 double 값의 범위를 넘어갈 수 있다. 이러한 문제를 보완하기 위한 정규화 알고리즘은 하기 수학식 7과 같이 정의될 수 있다. 즉, 곱셈 부분을 덧셈으로 대신하여 double 값 범위를 초과하는 문제를 해결할 수 있는 것이다.However, as in Equation (6), as the denominator in the log value is multiplied by the occurrence probability of the partial word, which is a very small value, the MI (mutual information) value may exceed the range of the double value that the computer can express. A normalization algorithm for supplementing this problem may be defined as in Equation 7 below. That is, the problem of exceeding the double value range can be solved by replacing the multiplication part with addition.

일 실시예에서, 어휘 목록의 크기를 줄이기 위해 regScore 값이 큰 부분 단어를 삭제할 수 있다. 여기서, regScore 값이 큰 순서대로 기설정된 비율만큼 부분 단어를 삭제할 수 있다. 예컨대, regScore 값이 큰 상위 20%의 부분 단어를 삭제할 수 있다. regScore 값이 높다는 의미는 상호의존정보량이 낮은, 즉, 부분 단어끼리의 연관성이 낮다는 의미이므로 regScore 값이 큰 부분 단어를 삭제하여 어휘 목록의 크기를 줄이는 것이다.In an embodiment, in order to reduce the size of the vocabulary list, a partial word having a large regScore value may be deleted. Here, partial words may be deleted by a preset ratio in the order of increasing regScore values. For example, partial words in the top 20% with a large regScore value may be deleted. A high regScore value means that the amount of interdependence information is low, that is, the correlation between partial words is low. Therefore, the size of the vocabulary list is reduced by deleting partial words with a large regScore value.

부분 단어

의 regScore 값을 구하는 방법을 보다 예시를 들어 구체적으로 설명한다. 상기 수학식 7에서 부분 단어

의 regScore 값은

를 포함하고 있는 단어

의 MI 값의 합으로부터 연산된다. 예컨대, 단어 집합 W 내에 '대한민국, 대한민주주의공화국, 대한사람'이라는 단어가 있을 때, regScore(대한)은 '대한'이라는 부분 단어가 포함되어 있는 단어들의 MI 값의 합으로부터 연산된다. 즉, MI(대한민국)+MI(대한민주주의공화국)+MI(대한사람)으로부터 연산된다. 이와 달리, regScore(민국)은 MI(대한민국)으로부터 연산된다.part word

A method of obtaining the regScore value of . Partial words in Equation 7

The regScore value of

words containing

is calculated from the sum of the MI values of For example, when the words 'Korea, Democratic Republic of Korea, and Korea' exist in the word set W, regScore (Daehan) is calculated from the sum of the MI values of the words including the partial word 'Daehan'. That is, it is calculated from MI (Korea) + MI (Democratic Republic of Korea) + MI (Korean people). In contrast, regScore (Korea) is calculated from MI (Korea).

일 실시예에서, regScore 값이 큰 순서에 따라 부분 단어를 삭제하는 과정은 기설정된 어휘 목록의 크기를 만족할 때까지 반복할 수 있다. 예컨대, 어휘 목록의 단어 개수를 D로 기설정한 경우, 어휘 목록의 단어 개수가 D가 될 때까지 regScore 값이 높은 상위 20%의 부분 단어를 삭제하는 과정을 반복할 수 있다.In an embodiment, the process of deleting partial words in the order of increasing regScore values may be repeated until the size of the preset vocabulary list is satisfied. For example, if the number of words in the vocabulary list is preset to D, the process of deleting the partial words of the top 20% having a high regScore value may be repeated until the number of words in the vocabulary list becomes D.

어휘 목록의 단어 개수를 한 번에 기설정된 어휘 목록의 단어 개수로 줄이게 되는 경우, overfitting 현상이 발생할 수 있어 regScore 값이 큰 순서에 따라 부분 단어를 한꺼번에 삭제하는 대신, 일정 비율의 부분 단어를 삭제하는 과정을 반복 수행하는 것이다.If the number of words in the vocabulary list is reduced to the number of words in the preset vocabulary list at once, overfitting may occur. repeating the process.

도 3은 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 나타내는 흐름도이다. 3 is a flowchart illustrating a method for generating a vocabulary list according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 어휘 목록 생성을 위하여 한국어 데이터를 수신하는 단계(S301), 수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계(S303) 및 분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 어휘 목록을 생성하는 단계(305)를 포함할 수 있다.Referring to FIG. 3 , the method for generating a vocabulary list according to an embodiment of the present invention includes receiving Korean data to generate a vocabulary list ( S301 ), and performing a partial word separation algorithm on the received Korean data to obtain the Korean data. Separating the included words into partial words ( S303 ) and generating a vocabulary list by performing a normalization algorithm on the separated partial words ( 305 ).

한국어 데이터를 수신하는 단계(S301)는 데이터 수신부(110)가 한국어 단어들이 포함된 한국어 데이터를 수신하는 단계이다. 데이터 수신부(110)는 한국어 데이터를 수신한 후, 이를 제1 연산부(120)로 송신할 수 있다.The step of receiving Korean data ( S301 ) is a step in which the data receiving unit 110 receives Korean data including Korean words. After receiving Korean data, the data receiving unit 110 may transmit it to the first calculating unit 120 .

수신한 한국어 데이터에 대하여 부분 단어 분리 알고리즘을 수행하여 한국어 데이터에 포함된 단어들을 부분 단어로 분리하는 단계(S303)는, 제1 연산부(120)가 한국어 데이터를 수신한 후, 한국어 데이터에 포함된 단어들에 대해 부분 단어 분리 알고리즘을 수행하는 단계이다. 제1 연산부(120)는 상기 수학식 3의 알고리즘을 이용하여 부분 단어를 분리할 수 있다.In the step (S303) of performing a partial word separation algorithm on the received Korean data to separate the words included in the Korean data into partial words, after the first operation unit 120 receives the Korean data, the It is a step of performing a partial word separation algorithm on words. The first operation unit 120 may separate the partial words by using the algorithm of Equation 3 above.

분리된 부분 단어에 대해 정규화 알고리즘을 수행하여 어휘 목록을 생성하는 단계(S305)는 제1 연산부(120)에서 분리한 부분 단어를 제2 연산부(130)가 수신한 후, 정규화 알고리즘을 이용하여 부분 단어 정규화를 수행하여 어휘 목록을 생성하는 단계이다. In the step of generating a vocabulary list by performing a normalization algorithm on the separated partial words (S305), the second calculating unit 130 receives the partial words separated by the first calculating unit 120, and then using the normalization algorithm to partially This is the step of generating a vocabulary list by performing word normalization.

제2 연산부(130)는 상기 수학식 7의 알고리즘을 이용하여 부분 단어를 정규화하여 어휘 목록의 크기를 줄여 최종적인 어휘 목록을 생성할 수 있다.The second operation unit 130 may use the algorithm of Equation 7 above to normalize partial words to reduce the size of the vocabulary list to generate a final vocabulary list.

도 4는 본 발명의 일 실시예에 따른 어휘 목록 생성 방법을 개략적으로 나타내는 흐름도이다.4 is a flowchart schematically illustrating a method for generating a vocabulary list according to an embodiment of the present invention.

제1 연산부(120)는 부분 단어 분리 알고리즘을 이용하여 한국어 데이터에 포함된 단어들을 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '대한독립만세'라는 단어를 입력받아 '대한', '독립만세'로 분리될 수 있다.The first operation unit 120 may separate the words included in the Korean data into a left partial word and a right partial word using a partial word separation algorithm. For example, the word 'Hurray for the independence of Korea' may be input and divided into 'Daehan' and 'Hurray for independence'.

일 실시예에서, 제1 연산부(120)는 오른쪽 부분 단어가 존재하는 경우, 오른쪽 부분 단어에 대하여 부분 단어 분리 알고리즘을 수행하는 단계를 반복할 수 있다. 즉, 오른쪽 부분 단어가 더 이상 분리될 수 없을 때까지 오른쪽 부분 단어에 대해 부분 단어 분리 알고리즘을 이용하여 왼쪽 부분 단어 및 오른쪽 부분 단어로 분리할 수 있다. 예컨대, '대한', '독립만세'로 단어가 분리되었을 때, 오른쪽 부분 단어인 '독립만세'는 다시 '독립', '만세'로 분리되어 최종적으로 '대한', '독립', '만세'라는 부분 단어로 분리할 수 있는 것이다.In an embodiment, when the right partial word exists, the first operation unit 120 may repeat the step of performing the partial word separation algorithm on the right partial word. That is, until the right partial word can no longer be separated, the right partial word may be divided into a left partial word and a right partial word using a partial word separation algorithm. For example, when words are divided into 'Daehan' and 'Hurray for independence', the right part word 'Hurray for independence' is again divided into 'Independence' and 'Hurray for independence' and finally 'Daehan', 'Independence', and 'Hurray for independence' It can be separated into partial words.

분리된 부분 단어는 제2 연산부(130)로 송신되고, 제2 연산부(130)는 분리된 부분 단어에 대해 정규화 알고리즘을 이용하여 상호의존정보량이 낮은 부분 단어를 삭제하는 과정을 거쳐 최종적인 어휘 목록을 생성할 수 있다.The separated partial word is transmitted to the second calculating unit 130, and the second calculating unit 130 uses a normalization algorithm for the separated partial word to delete the partial word with a low amount of interdependent information, followed by a final vocabulary list. can create

일 실시예에서, 제2 연산부(130)는 상기 수학식 7의 정규화 알고리즘을 이용하여 regScore 값이 큰 부분 단어들을 기설정된 비율만큼 삭제할 수 있다.In an embodiment, the second operation unit 130 may delete partial words having a large regScore value by a preset ratio using the normalization algorithm of Equation (7).

또한, 제2 연산부(130)는 기설정된 어휘 목록의 크기를 만족할 때까지 부분 단어를 삭제하는 과정을 반복할 수 있다. 예컨대, 어휘 목록의 단어 개수를 D로 기설정한 경우, 어휘 목록의 단어 개수가 D가 될 때까지 regScore 값이 높은 상위 20%의 부분 단어를 삭제하는 과정을 반복할 수 있다.Also, the second operation unit 130 may repeat the process of deleting partial words until the size of the preset vocabulary list is satisfied. For example, if the number of words in the vocabulary list is preset to D, the process of deleting the partial words of the top 20% having a high regScore value may be repeated until the number of words in the vocabulary list becomes D.

본 발명의 일 실시예에 따른 어휘 목록 생성 방법은 컴퓨터 상에서 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능한 기록 매체는 컴퓨터에 의해 액세스(access)될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독 가능한 기록 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The method for generating a vocabulary list according to an embodiment of the present invention may also be implemented in the form of a computer-readable recording medium in which a program to be executed on a computer is recorded. The computer-readable recording medium may be any available medium that can be accessed by a computer, and may include both volatile and nonvolatile media, and removable and non-removable media. In addition, the computer-readable recording medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 명세서에서 설명된 실시예들에 관한 예시적인 모듈, 단계 또는 이들의 조합은 전자 하드웨어(코딩 등에 의해 설계되는 디지털 설계), 소프트웨어(프로그램 명령을 포함하는 다양한 형태의 애플리케이션) 또는 이들의 조합에 의해 구현될 수 있다. 하드웨어 및/또는 소프트웨어 중 어떠한 형태로 구현되는지는 사용자 단말에 부여되는 설계상의 제약에 따라 달라질 수 있다.Exemplary modules, steps, or combinations thereof related to the embodiments described in this specification are implemented by electronic hardware (digital design designed by coding, etc.), software (various types of applications including program instructions), or a combination thereof. can be implemented. Whether implemented in hardware and/or software may vary depending on design constraints imposed on the user terminal.

본 명세서에서 설명된 구성의 하나 이상은 컴퓨터 프로그램 명령으로서 메모리에 저장될 수 있는데, 이러한 컴퓨터 프로그램 명령은 디지털 신호 프로세서를 중심으로 본 명세서에서 설명된 방법을 실행할 수 있다. 본 명세서에 첨부된 도면을 참조하여 특정되는 구성 간의 연결 예는 단지 예시적인 것으로, 이들 중 적어도 일부는 생략될 수도 있고, 반대로 이들 구성 뿐 아니라 추가적인 구성을 더 포함할 수 있음은 물론이다.One or more of the components described herein may be stored in the memory as computer program instructions, which may execute the methods described herein centered on a digital signal processor. Connection examples between the components specified with reference to the accompanying drawings in this specification are merely exemplary, and at least some of them may be omitted, and conversely, it is of course that not only these components but also additional components may be further included.

실험예 : 성능 평가 실험Experimental example: Performance evaluation experiment

BPE(byte pair encoding) 알고리즘 , Unigram language model 알고리즘과 본 발명이 제안한 어휘 목록 생성 방법의 성능을 비교하였다. 세 가지 알고리즘을 비교하기 위한 신경망 언어 모델로는 BERT(Bidirectional Encoder Representations from Transformers)를 사용하였다. The performance of the byte pair encoding (BPE) algorithm, the unigram language model algorithm, and the lexical list generation method proposed by the present invention was compared. BERT (Bidirectional Encoder Representations from Transformers) was used as a neural network language model to compare the three algorithms.

BPE 알고리즘은 언어의 구조를 고려하지 않고 빈도수를 기반으로 단어를 분리하는 알고리즘이며, Unigram 알고리즘의 경우 Unigram 단위로 단어를 분리하는 방식의 알고리즘이다.The BPE algorithm is an algorithm that separates words based on frequency without considering the structure of the language, and the Unigram algorithm is an algorithm that separates words in units of unigrams.

하기 표 1은 각 알고리즘 별로 어휘 목록을 생성한 후, BERT 모델을 학습하여 성능을 비교한 표이다.Table 1 below is a table comparing performance by learning the BERT model after generating a vocabulary list for each algorithm.

알고리즘
algorithm
어휘 목록 크기
vocabulary list size Masked word predictionMasked word prediction Fine-Tuning (%)Fine-Tuning (%) AccuracyAccuracy EMEM F1F1
BPE
BPE 30,00030,000 0.5090.509 42.3142.31 80.7780.77 40,00040,000 0.4960.496 41.2741.27 80.1180.11 50,00050,000 0.4550.455 40.8240.82 80.0280.02
Unigram
Unigram 30,00030,000 0.5070.507 49.7749.77 81.1881.18 40,00040,000 0.4870.487 49.5249.52 81.0581.05 50,00050,000 0.4920.492 49.8649.86 81.3381.33
Proposed algorithm
Proposed algorithm 30,00030,000 0.6090.609 53.1253.12 81.7881.78 40,00040,000 0.6060.606 52.7352.73 81.5481.54 50,00050,000 0.5930.593 52.5052.50 81.6381.63

Masked word prediction은 Pre-training 단계에서 빈칸에 들어갈 부분 단어를 예측하는 문제이고, Fine-Tuning(질의 응답)은 질문에 맞는 답을 본문에서 찾아내는 문제이다. Masked word prediction is a problem of predicting partial words to be filled in the blanks in the pre-training stage, and fine-tuning (question-and-answer) is a problem of finding the correct answer in the text.

Fine-Tuning의 EM(exact match)는 예측된 답과 실제 답이 정확하게 일치하느냐를 나타내는 지표이다. 예컨대, '1990년대 말'이 실제 답일 때, '1990년대'로 예측하였을 경우 EM은 0이며, '1990년대 말'로 예측했을 경우 100%가 되는 것이다. Fine-tuning's EM (exact match) is an index indicating whether the predicted answer and the actual answer exactly match. For example, when 'end of 1990's' is the actual answer, EM is 0 when '1990s' is predicted, and 100% when 'end of 1990s' is predicted.

Fine-Tuning의 F1 score는 예측된 답과 실제 답이 얼마나 일치하느냐를 나타내는 지표이다. 예컨대, '1990년대 말'이 실제 답일 때, '1990년대'로 예측하였을 경우, F1 score는 7개 중 6개 음절이 일치하는 것으로 86%의 일치율을 보이는 것이다.Fine-tuning's F1 score is an indicator of how well the predicted answer matches the actual answer. For example, when 'late 1990's' is the real answer and '1990s' is predicted, the F1 score is 6 out of 7 syllables, showing an 86% concordance rate.

Pre-training 데이터셋으로는 한국 위키피디아 dump를 이용하였고, Fine-Tuning 데이터셋으로는 KorQuAD 데이터셋을 이용하였다. As the pre-training dataset, Korea Wikipedia dump was used, and as the fine-tuning dataset, the KorQuAD dataset was used.

도 5a 및 도 5b는 상기 표 1을 기반으로 세 가지 알고리즘의 성능을 비교한 그래프이다.5A and 5B are graphs comparing the performance of three algorithms based on Table 1 above.

BPE, Unigram 및 본 발명에서의 제안된 알고리즘을 비교해보면, Masked word prediction의 경우 타 알고리즘에 비하여 약 10% 가량 성능이 향상되었음을 확인할 수 있다.Comparing BPE, Unigram, and the algorithms proposed in the present invention, it can be seen that the performance of masked word prediction is improved by about 10% compared to other algorithms.

Fine-Tuning의 EM의 경우 타 알고리즘에 비해 약 3%, F1 score의 경우 타 알고리즘에 비해 약 1% 정도의 성능이 향상되었음을 확인할 수 있다.In the case of fine-tuning EM, it can be seen that the performance is improved by about 3% compared to other algorithms, and by about 1% in the case of F1 score compared to other algorithms.

따라서, 본 발명에서 제안하는 알고리즘은 타 알고리즘에 비해 주변 부분 단어들의 문맥을 기반으로 빈칸에 들어갈 단어를 유추하는 능력이 향상되며, 질의응답에서도 더 높은 성능을 보이는 것을 확인하였다. 즉, 신경망 언어 모델의 구조 수정 없이 어휘 목록 생성 방법의 개선만으로도 신경망 언어 모델의 예측 정확도 향상이 가능한 것이다.Therefore, it was confirmed that the algorithm proposed in the present invention improves the ability to infer the word to be placed in the blank based on the context of the surrounding partial words compared to other algorithms, and shows higher performance in question and answer. That is, the prediction accuracy of the neural network language model can be improved only by improving the vocabulary list generation method without modifying the structure of the neural network language model.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술적 사상의 범위가 한정되는 것이 아니다. 본 발명의 보호범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various modifications and variations will be possible without departing from the essential characteristics of the present invention by those skilled in the art to which the present invention pertains. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100 : 어휘 목록 생성 장치
110 : 데이터 수신부
120 : 제1 연산부
130 : 제2 연산부100: vocabulary list generator
110: data receiving unit
120: first operation unit
130: second operation unit

Claims

A method for generating a vocabulary list for a Korean-based neural network language model by a vocabulary list generating device, the method comprising:
(a) receiving, by the vocabulary list generating device, Korean data for creating a vocabulary list;
(b) performing a subword separation algorithm on the received Korean data to separate words included in the Korean data into partial words according to the algorithm; and
(c) a preset ratio among the divided partial words based on the regScore value of the following [Equation 7] by performing a regularization algorithm according to the following [Equation 7] on the divided partial words Including the step of generating a vocabulary list through the process of repeatedly deleting the word corresponding to
How to generate a vocabulary list for a Korean-based neural network language model:
[Equation 7]

here,

is the i-th word in the word set W,

Is

The jth part word separated from

is the partial word for which the regScore value is to be obtained.

According to claim 1,
In step (b),
The partial word separation algorithm is defined by the following equation,
A lexical list generation method for a Korean-based neural network language model.

here,

is a set of consecutive characters from 0 to n,

is from the Korean data.

The set of letters shown below, X, means Korean data.

According to claim 1,
The step (b) is,
Separating the words included in the received Korean data into a left partial word and a right partial word using the partial word separation algorithm,
A lexical list generation method for a Korean-based neural network language model.

4. The method of claim 3,
The step (b) is,
If the right part word is present,
Further comprising repeating the step of performing the partial word separation algorithm on the right partial word,
A lexical list generation method for a Korean-based neural network language model.

delete

According to claim 1,
Step (c) is,
The method further comprising repeating the step of deleting the partial words until a predetermined number of words in the vocabulary list is satisfied.
A lexical list generation method for a Korean-based neural network language model.

A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 4 on a computer.

An apparatus for generating a vocabulary list for a Korean-based neural network language model, comprising:
a data receiver configured to receive Korean data to generate a vocabulary list;
a first operation unit that performs a partial word separation algorithm on the received Korean data and separates words included in the Korean data into partial words according to the algorithm; and
and a second operation unit for generating the vocabulary list by performing a normalization algorithm on the separated partial words,
The second operation unit performs the normalization algorithm defined by the following [Equation 7], and, based on the regScore value of the following [Equation 7], a word corresponding to a preset ratio among the divided partial words. A vocabulary list generating apparatus for a Korean-based neural network language model, characterized in that the vocabulary list is generated through a process of repetitive deletion:
[Equation 7]

here,

is the i-th word in the word set W,

Is

The jth part word separated from

is the partial word for which the regScore value is to be obtained.

10. The method of claim 9,
The partial word separation algorithm is defined by the following equation,
A vocabulary list generator for a Korean-based neural network language model.

here,

is a set of consecutive characters from 0 to n,

is from the Korean data.

The set of letters shown below, X, means Korean data.

10. The method of claim 9,
The partial word separation algorithm is
Separating the words included in the received Korean data into a left part word and a right part word,
A vocabulary list generator for a Korean-based neural network language model.

12. The method of claim 11,
After the first operation unit separates the words included in the Korean data,
If the right part word is present,
repeating the partial word separation algorithm for the right partial word,
A vocabulary list generator for a Korean-based neural network language model.

delete

10. The method of claim 9,
The second calculation unit,
repeatedly performing deletion of the partial words until a preset number of words in the vocabulary list is satisfied;
A vocabulary list generator for a Korean-based neural network language model.