KR20220096055A

KR20220096055A - Electronic device for word embedding and method of operating the same

Info

Publication number: KR20220096055A
Application number: KR1020200188182A
Authority: KR
Inventors: 김철연; 주형주
Original assignee: 숙명여자대학교산학협력단
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07

Abstract

Disclosed is an electronic device that performs word embedding. The electronic device may comprise: a storage part; and a processor that obtains an average value of a plurality of embedding vector values by outputting a plurality of embedding vector values for a plurality of words included in a sliding window having a predetermined size from a first artificial neural network, inputs the average of the plurality of embedding vector values to a second artificial neural network, and learns the second artificial neural network to output the embedding vector value of a main word within the sliding window.

Description

Electronic device for performing word embedding and operating method thereof

본 발명은 단어 임베딩을 수행하는 전자장치 및 이의 동작방법에 대한 것으로, 보다 상세하게는 문맥 또는 맥락을 반영하는 단어 임베딩을 수행하는 전자장치 및 이의 동작방법에 대한 것이다.The present invention relates to an electronic device for performing word embedding and an operating method thereof, and more particularly, to an electronic device for performing word embedding reflecting context or context, and an operating method thereof.

자연어 처리(Natural Language Processing)는 컴퓨터가 인간의 언어(Human or Natural Language)를 이해할 수 있도록 분석하고 처리하는 방법과 관련한, 컴퓨터 과학(Computer Science)과 언어학(Linguistics), 인공 지능(Articial Intelligence)의 하위 분야이다.Natural Language Processing is a study of computer science, linguistics, and artificial intelligence related to methods for analyzing and processing computers to understand human or natural language. is a sub-field.

자연어 처리에서는 파싱(Parsing)과 품사 태깅(Part-of-Speech Tagging) 등을 이용하여 구문 분석을 진행하고, 개체명 인식(Entity Recognition)과 사건 추출(Event Extraction), 문서 분류(Document Classication) 등을 이용하여 의미 분석을 진행함으로써, 최종적으로 컴퓨터가 자연어 문서의 내용을 이해하도록 하는데 중점을 둔다.In natural language processing, parsing and part-of-speech tagging are used to perform syntax analysis, and entity recognition, event extraction, document classification, etc. By conducting semantic analysis using

그 중 단어 의미 중의성 해소(Word Sense Disambiguation, WSD)는 자연어 처리 중 의미분석 문제의 일종으로, 문장에서 각 단어가 어떤 의미로 사용되었는지 식별하는 분야이다. 이는 단어가 속한 문맥을 파악함으로써 등장한 동형이의어의 의미를 구하는 과정으로, 챗봇이나 정보 검색 등 의사 소통에 기반한 자연어 처리 응용 프로그램의 성능을 높이는 데 중요하다.Among them, word sense disambiguation (WSD) is a kind of semantic analysis problem in natural language processing, and it is a field to identify the meaning of each word used in a sentence. This is the process of finding the meaning of the homograph that has appeared by understanding the context in which the word belongs, and is important for improving the performance of communication-based natural language processing applications such as chatbots and information retrieval.

인간의 두뇌는 단어 의미 중의성 해소에 상당히 능숙하지만, 현재의 자동 솔루션은 인간의 능력에 미치지 못하고 있는 실정이다. 따라서 성능 향상을 겨냥한 많은 연구가 이루어지고 있는데, 어휘적 자원으로 암호화된 지식을 사용하는 사전(Dictionary)기반 방법부터, 자연어 말뭉치(Corpus)로부터 각각의 개별 단어에 대해 의미를 구별할 수 있도록 훈련되는 감독 학습(Supervised Learning) 방법 등 그 방법론은 다양하게 존재한다. 이 중에서 가장 성공적인 알고리즘은 감독된 기계학습 접근 방식이라 알려져 있다. The human brain is quite good at disambiguating the meaning of words, but the current automatic solution is not as good as the human ability. Therefore, many studies aimed at improving performance are being conducted, from a dictionary-based method using encrypted knowledge as a lexical resource, to a training program that is trained to distinguish the meaning of each individual word from a natural language corpus (Corpus). There are various methods such as supervised learning method. The most successful of these algorithms is known as a supervised machine learning approach.

단어 의미 중의성 해소와 같은 과업을 해결하기 위한 전형적인 인공 신경망 모델은 일련의 단어 임베딩, 즉 자연어의 실수형 벡터 표상을 입력으로 받아들인다. 왜냐하면, 자연어 데이터 집합에 내재하는 구문 및 의미 정보들을 실수형 벡터 표상에 효과적으로 담아내는 단어 임베딩을 특징 값(Feature)로 사용하는 것이 심층학습(Deep Learning) 모델 및 그와 관련한 최적화 기법들에 효율적으로 작동하기 때문이다. A typical artificial neural network model for solving a task such as word semantic disambiguation takes as input a series of word embeddings, i.e., real vector representations of natural language. Because, using word embeddings that effectively contain the syntax and semantic information inherent in natural language data sets in real vector representations as feature values is effective for deep learning models and related optimization techniques. Because it works.

본 발명의 배경이 되는 기술의 일 예로, 대한민국 공개특허공보 제10-2020-0040652호(2020.04.20.)는 기학습된 단어 임베딩의 지도 학습을 기반으로 미등록 단어(Out Of Vocabulary, OOV)를 비롯한 모든 단어에 대한 단어 표현을 생성하는 자연어 처리 시스템 및 자연어 처리에서의 단어 표현 방법을 개시하고 있다.As an example of the technology that is the background of the present invention, Republic of Korea Patent Publication No. 10-2020-0040652 (2020.04.20.) discloses an unregistered word (Out Of Vocabulary, OOV) based on supervised learning of pre-learned word embeddings. Disclosed are a natural language processing system for generating a word expression for all words including, and a word expression method in natural language processing.

다만, 이러한 임베딩의 값은 보통 훈련 데이터 집합을 이용하여 학습되기 때문에, 훈련 데이터에서 발생하는 단어가 아닐 경우 임베딩 값을 도출할 수 없다는 단점이 발생한다. 시험(Test) 시간에는 훈련 데이터에서 한 번도 마주친 적 없는 단어를 얼마든지 입력받을 수 있기 때문에 이는 문제가 된다. However, since the embedding value is usually learned using the training data set, there is a disadvantage that the embedding value cannot be derived if it is not a word occurring in the training data. This is problematic because at test time, any number of words that have never been encountered in the training data can be entered.

이러한 미등록 어휘(Out-Of-Vocabulary, OOV) 문제는 챗봇 또는 검색 엔진과 같은 창의적인 의사소통 도메인에서 특히 뚜렷하게 나타난다. 이 문제에 대한 일반적인 해결책은 훈련 데이터 집합 내의 희귀 단어들을, '알수 없는(Unknown) 단어'를 나타내는 특수 토큰(Token) < UNK >로 교체하고, 훈련 중에 해당 토큰의 임베딩 값을 학습할 수 있도록 하는 것이다. This out-of-vocabulary (OOV) problem is particularly evident in creative communication domains such as chatbots or search engines. 이 문제에 대한 일반적인 해결책은 훈련 데이터 집합 내의 희귀 단어들을, '알수 없는(Unknown) 단어'를 나타내는 특수 토큰(Token) < UNK >로 교체하고, 훈련 중에 해당 토큰의 임베딩 값을 학습할 수 있도록 하는 will be.

그런 다음, 시험 시간에 입력 문장을 구성하는 모든 미등록 어휘를, 모델에 입력하기 전에 < UNK > 토큰으로 대체하는 방식이다. 사실상 이는 모든 미등록 어휘를 동일한 단일 벡터 표상으로 사상시킴으로써, 모델이 각 단어들의 실체를 인식하지 못하도록 한다는 치명적인 단점을 야기한다. 그런 다음, 시험 시간에 입력 문장을 구성하는 모든 미등록 어휘를, 모델에 입력하기 전에 < UNK > 토큰으로 대체하는 방식이다. In fact, this leads to the fatal disadvantage of mapping all unregistered vocabularies to the same single vector representation, preventing the model from recognizing the entity of each word.

또 다른 해결책으로서, 형태소나 문자(Character) n-gram 등 단어의 형태학적 특성을 활용함으로써 하위 단어(Subword) 정보에 기초하여 단어 임베딩을 구성하는 방식이 있다. 대표적인 방법론으로는 Facebook에서 제안된 모델로, Skip-gram 모델을 확장하여 모든 문자 n-gram에 임베딩 벡터를 할당함으로써 각 단어를 n-gram들의 합으로 표현하는 FastText를 들 수 있다. 다만, FastText 모델을 이용한 해결법 또한 양태가 동일한 동형이의어에 대해서는 동일한 단어 임베딩을 출력하므로 문맥에 반응적으로 대응하지 못한다는 약점이 있다.As another solution, there is a method of constructing word embeddings based on subword information by utilizing morphological characteristics of words, such as morphemes or character n-grams. A representative methodology is FastText, which is a model proposed by Facebook, which extends the skip-gram model and assigns an embedding vector to all n-grams of characters to express each word as a sum of n-grams. However, the solution using the FastText model also has a weakness in that it cannot respond responsively to the context because the same word embedding is output for homozygous words having the same aspect.

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로, 본 발명은 앞서 제시한 두 가지 문제, 즉 미등록 어휘 문제 및 단어 의미 중의성 해소를 해결하기 위한 2 단계 단어 임베딩 알고리즘 기법을 제공하는 것을 목적으로 한다.The present invention has been devised to solve the above problems, and the present invention aims to provide a two-step word embedding algorithm technique for solving the two problems presented above, namely, the problem of unregistered vocabulary and disambiguation of word meanings. do.

구체적으로, 본 발명은 제1 단계 단어 임베딩 알고리즘 기법을 통해, 미등록 어휘 문제를 해결함으로써 훈련 데이터를 통해 학습한 적 없는 단어에 대해서도 단어 임베딩 값을 얻을 수 있도록 하는 강건한(Robust) 학습 방법을 제공하는 것을 목적으로 한다.Specifically, the present invention provides a robust learning method that enables word embedding values to be obtained even for words that have not been learned through training data by solving the problem of unregistered vocabulary through the first stage word embedding algorithm technique. aim to

또한, 본 발명은 단계 제2 단계 단어 임베딩 알고리즘 기법을 통해, 단어 의미 중의성 해소 작업을 수행하기 위해, 여러 단어들의 형상이 동일하다 할지라도, 각 단어가 속한 맥락에 따라 서로 다른 임베딩을 취할 수 있도록 하는 문맥적 임베딩 학습 방법을 제공하는 것을 목적으로 한다.In addition, the present invention can take different embeddings according to the context to which each word belongs, even if the shapes of several words are the same, in order to perform the word semantic disambiguation task through the second step word embedding algorithm technique. It aims to provide a contextual embedding learning method that allows

본 발명의 다양한 실시 예에 따른 단어 임베딩(word embedding)을 수행하는 전자장치는 저장부 및 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 상기 복수의 임베딩 벡터 값의 평균 값을 획득하고, 상기 복수의 임베딩 벡터 값의 상기 평균을 제2 인공 신경망으로 입력하고, 상기 제2 인공 신경망이 상기 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습하는 프로세서를 포함할 수 있다.An electronic device for performing word embedding according to various embodiments of the present disclosure outputs a plurality of embedding vector values for a plurality of words included in a storage unit and a sliding window having a preset size from the first artificial neural network. to obtain the average value of the plurality of embedding vector values, input the average of the plurality of embedding vector values to a second artificial neural network, and the second artificial neural network outputs the embedding vector value of the central word in the sliding window It may include a processor that learns to do so.

본 발명의 다양한 실시 예에 따른 단어 임베딩을 수행하는 전자장치의 동작방법은 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 상기 복수의 임베딩 벡터 값의 평균 값을 획득하는 과정 및 상기 복수의 임베딩 벡터 값의 상기 평균을 제2 인공 신경망으로 입력하고, 상기 제2 인공 신경망이 상기 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습하는 과정을 포함할 수 있다.In a method of operating an electronic device for performing word embedding according to various embodiments of the present disclosure, a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size are output from a first artificial neural network, and the plurality of A process of obtaining the average value of the embedding vector values, inputting the average of the plurality of embedding vector values to a second artificial neural network, and learning to output the embedding vector value of the central word in the sliding window by the second artificial neural network process may be included.

본 발명의 다양한 실시 예에 따른 전자장치의 프로세서에 의해 실행되는 경우 상기 전자장치의 동작을 수행하도록 하는 컴퓨터 명령을 저장하는 비일시적 컴퓨터 판독 가능 매체에 있어서, 상기 동작은 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 상기 복수의 임베딩 벡터 값의 평균 값을 획득하는 과정 및 상기 복수의 임베딩 벡터 값의 상기 평균을 제2 인공 신경망으로 입력하고, 상기 제2 인공 신경망이 상기 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습하는 과정을 포함할 수 있다.In a non-transitory computer-readable medium storing computer instructions for performing an operation of an electronic device when executed by a processor of an electronic device according to various embodiments of the present disclosure, the operation is performed by a sliding window having a preset size A process of outputting a plurality of embedding vector values for a plurality of words included in the first artificial neural network to obtain an average value of the plurality of embedding vector values, and the average of the plurality of embedding vector values to a second artificial neural network and learning to output the embedding vector value of the central word within the sliding window by the second artificial neural network.

본 발명의 다양한 실시 예에 따르면, 미등록 어휘 문제를 해결함으로써 훈련 데이터를 통해 학습한 히스토리가 없는 단어에 대해서도 단어 임베딩 값을 얻을 수 있도록 하는 강건한 학습 방법을 제공할 수 있다.According to various embodiments of the present disclosure, it is possible to provide a robust learning method for obtaining a word embedding value even for a word without a history learned through training data by solving the problem of unregistered vocabulary.

또한, 본 발명의 다양한 실시 예에 따르면, 동형이의어에 해당하는 각 단어가 속한 문장의 맥락에 따라 서로 다른 임베딩을 취할 수 있도록 하여 동형의이어의 중의성을 해소할 수 있다.In addition, according to various embodiments of the present disclosure, different embeddings can be taken according to the context of the sentence to which each word corresponding to the homomorphic word belongs, thereby resolving the ambiguity of the homozygous word.

도 1은 본 발명의 일 실시 예에 따른 Skip-gram 및 FastText의 구조를 도시한다.
도 2는 본 발명의 일 실시 예에 따른 슬라이딩 윈도우에 따른 임베딩 벡터 값을 도시한다.
도 3은 본 발명의 일 실시 예에 단어 임베딩을 수행하는 전자장치의 블록도이다.
도 4는 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법의 순서도이다.
도 5는 본 발명의 일 실시 예에 따른 희소 오토인코더의 구조를 도시한다.
도 6은 본 발명의 일 실시 예에 따른 KL-D 및 MSE에 대한 그래프이다.
도 7a 및 도 7b는 본 발명의 일 실시 예에 따른 개념 증명에 사용하는 문서 분류기의 구조를 도시한다.
도 8은 본 발명의 일 실시 예에 따른 단어 임베딩 시스템의 전체 구조를 도시한다.
도 9a 내지 도 9d는 본 발명의일 실시 예에 따른 미등록 어휘에 대한 군집의 t-SNE 2차원 사상 시각화를 도시한다.
도 10a 내지 도 10f는 본 발명의 일 실시 예에 따른 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화 비교 도면이다.
도 11a 내지 11d는 본 발명의 다른 실시 예에 따른 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화 비교 도면이다.
도 12는 본 발명의 일 실시 예에 따른 등록 및 미등록 여부에 기초한 어휘의 복원 오차를 도시한다.
도 13은 본 발명의 일 실시 예에 따른 자연어 문장 동일성 판별 분류기의 성능 변화를 도시한다.
도 14는 본 발명의 일 실시 예에 따른 단어의 중의성 해소 분류기의 성능 변화를 도시한다.
도 15는 본 발명의 다른 실시 예에 따른 단어의 중의성 해소 분류기의 성능 변화를 도시한다.
도 16은 본 발명의 일 실시 예에 따른 전자장치의 세부 구성에 대한 블록도이다.
도 17은 본 발명의 일 실시 예에 따른 단어 임베딩을 수행하는 전자장치의 동작방법에 대한 흐름도이다.1 shows the structures of Skip-gram and FastText according to an embodiment of the present invention.
2 illustrates an embedding vector value according to a sliding window according to an embodiment of the present invention.
3 is a block diagram of an electronic device that performs word embedding according to an embodiment of the present invention.
4 is a flowchart of a two-step word embedding method according to an embodiment of the present invention.
5 shows the structure of a sparse autoencoder according to an embodiment of the present invention.
6 is a graph for KL-D and MSE according to an embodiment of the present invention.
7A and 7B show the structure of a document classifier used for proof of concept according to an embodiment of the present invention.
8 shows the overall structure of a word embedding system according to an embodiment of the present invention.
9A to 9D are diagrams illustrating a t-SNE two-dimensional map visualization of a cluster for an unregistered vocabulary according to an embodiment of the present invention.
10A to 10F are diagrams comparing two-dimensional map visualization of word embedding t-SNE according to context according to an embodiment of the present invention.
11A to 11D are diagrams comparing two-dimensional map visualization of word embedding t-SNE according to context according to another embodiment of the present invention.
12 illustrates a vocabulary restoration error based on whether registered or not registered according to an embodiment of the present invention.
13 illustrates a change in performance of a natural language sentence equality discrimination classifier according to an embodiment of the present invention.
14 illustrates a change in performance of a word disambiguation classifier according to an embodiment of the present invention.
15 illustrates a change in performance of a word disambiguation classifier according to another embodiment of the present invention.
16 is a block diagram of a detailed configuration of an electronic device according to an embodiment of the present invention.
17 is a flowchart illustrating a method of operating an electronic device for performing word embedding according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대한 동작원리를 상세히 설명한다. 또한, 발명에 대한 실시 예를 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 하기에서 사용되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로써, 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 사용된 용어들의 정의는 본 명세서 전반에 걸친 내용 및 이에 상응한 기능을 토대로 해석되어야 할 것이다.Hereinafter, the principle of operation of a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings. In addition, when it is determined that a detailed description of a related well-known function or configuration may obscure the gist of the present disclosure in describing an embodiment of the present invention, the detailed description thereof will be omitted. And the terms used below are terms defined in consideration of functions in the present invention, which may vary depending on the intention or custom of the user or operator. Therefore, the definitions of the terms used should be interpreted based on the contents and corresponding functions throughout this specification.

이하에서는, 본 발명의 다양한 실시 예에 다른 미등록 어휘 문제와 단어 의미 중의성 해소 문제를 해결하는 것을 목적으로, FastText 단어 임베딩 방법론과 희소 오토인코더(Sparse Autoencoder)를 결합한 심층 신경망 구조를 예를 들어 제안한다. Hereinafter, for the purpose of solving the problem of unregistered vocabulary and resolving word semantic disambiguation in various embodiments of the present invention, a deep neural network structure combining FastText word embedding methodology and sparse autoencoder is proposed as an example do.

상술한 FastText 모델 및 희소 오토인코더 모델을 예로 들면, 모델을 훈련시킬 때와 학습을 마친 모델을 시험에 적용시킬 때 시스템 작동 방식을 서로 차별화시킬 수 있다.Taking the above-described FastText model and sparse autoencoder model as an example, it is possible to differentiate how the system operates when the model is trained and when the trained model is applied to a test.

먼저 FastText 모델을 훈련(또는 학습)시키는 경우를 예로 들면, 많은 양의 문장으로 구성된 자연어 말뭉치 데이터 집합이 주어질 수 있다. 해당 데이터 집합을 이용하여 [단계 1: 단어의 분산 표상화]를 위한 FastText 모듈을 학습시킬 수 있다.First, in the case of training (or training) a FastText model, a natural language corpus data set consisting of a large amount of sentences may be given. Using the data set, the FastText module for [Step 1: Distributed representation of words] can be trained.

FastText 모델은 슬라이딩 윈도우(Sliding Window) 내 중심 단어를 통해 주변 문맥에 위치할 단어들을 맞추는 정확도를 상승시키는 작업일 수 있다. FastText 방법론의 특성상 하나의 단어를 문자 n-gram들로 쪼개는 방식을 이용할 수 있다.The FastText model may be a task of increasing the accuracy of matching words to be located in surrounding contexts through a central word in a sliding window. Due to the nature of the FastText methodology, a method of splitting a single word into n-grams of characters may be used.

본 발명의 다양한 실시 예에서는 하나의 문장의 임베딩 벡터 표상을 해당 문장 내에 속한 모든 단어의 임베딩 벡터 값들의 평균으로 처리할 수 있다. 이 때문에, 어떤 문장의 임베딩 벡터 표상은 곧 문장 내에 속한 모든 문자 n-gram 벡터 값들의 평균이 될 수 있다. According to various embodiments of the present disclosure, the representation of the embedding vector of one sentence may be processed as an average of embedding vector values of all words included in the corresponding sentence. For this reason, the embedding vector representation of a sentence can be the average of all character n-gram vector values belonging to the sentence.

즉, 상술한 단계 1을 통해 미등록 어휘를 처리함과 동시에 더욱 복합적인 의미를 실수형 벡터 표상에 담을 수 있다.That is, it is possible to process the unregistered vocabulary through the above-described step 1 and to contain more complex meanings in the real vector representation.

다음으로, 단계 1의 출력인 문장의 벡터 값을 [단계 2: 분산 표 상의 문맥화]를 위한 입력으로 사용할 수 있다. 이 때, 단계 2의 희소 오토인코더 모듈은 슬라이딩 윈도우 내에 속하는 단어 벡터들의 평균 값을 입력으로 취하여, 슬라이딩 윈도우 내 중심 단어의 단일 임베딩 벡터 값을 출력시키도록 학습될 수 있다. 이를 통해, 희소 오토인코더 인공 신경망의 가중치 내에 훈련 자연어 데이터 집합에 내재된 어휘 및 통사적 정보를 효과적으로 담아낼 수 있다.Next, the vector value of the sentence that is the output of step 1 can be used as an input for [step 2: contextualization on variance table]. At this time, the sparse autoencoder module of step 2 may be trained to take the average value of word vectors belonging to the sliding window as input and output a single embedding vector value of the central word within the sliding window. Through this, the lexical and syntactic information inherent in the training natural language dataset can be effectively contained within the weights of the sparse autoencoder artificial neural network.

여기서, 단어집(Vocabulary, 또는 단어 집합) 내에서 단일 의미를 지니는 단어들의 절반은 등록 어휘라 가정하고, 나머지 절반은 미등록 어휘라고 가정한다. Here, it is assumed that half of the words having a single meaning in the vocabulary (or set of words) are registered vocabularies, and the other half are assumed to be unregistered vocabularies.

등록 어휘의 경우는 해당 중심 단어의 임베딩 벡터 값을 그대로 이용함으로써 입력의 평균 값 처리에 포함시킬 수 있다. 미등록 어휘의 경우는 영-벡터(Zero Vector)로 처리함으로써 입력의 평균 값 처리에 포함시키지 않는 효과를 야기할 수 있다. 즉, 단계 2를 통해 미등록 어휘가 입력되는 상황에 대처함과 동시에, 문맥이 주어질 경우 중심 단어의 벡터 표상을 얻을 수 있다.In the case of a registered vocabulary, the embedding vector value of the corresponding central word can be used as it is, so that it can be included in the processing of the average value of the input. In the case of an unregistered vocabulary, it may cause an effect of not being included in the processing of the average value of the input by processing it as a zero vector. That is, it is possible to cope with a situation in which an unregistered vocabulary is input through step 2 and at the same time to obtain a vector representation of a central word when a context is given.

최종적으로, 상술한 본 발명의 다양한 실시 예에 따른 1 단계 및 2 단계가 수행되어 출력되는 임베딩 벡터 값은 단어의 실수형 벡터 표상으로 획득할 수 있다.Finally, the embedding vector value output by performing steps 1 and 2 according to various embodiments of the present invention described above may be obtained as a real vector representation of a word.

한편, 학습을 마친 모델을 테스트에 적용시키는 방법은 다음과 같다. Meanwhile, the method of applying the trained model to the test is as follows.

먼저, 필요에 따라 문서 규모로도 확장될 수 있는 문장이 주어진다. 문장 하나는 단어들의 나열(Sequence)로 구성되므로, 해당 단어들을 [단계 1: 단어의 분산 표상화]를 위한 FastText 모듈에 입력할 수 있다. 이를 통해 얻은 임베딩 값들의 평균을 해당 문장의 임베딩 벡터 표상으로 획득할 수 있다.First, a sentence is given that can be expanded to the document size as needed. Since one sentence consists of a sequence of words, the corresponding words can be input into the FastText module for [Step 1: Distributed representation of words]. The average of the obtained embedding values can be obtained as an embedding vector representation of the corresponding sentence.

이러한 임베딩 벡터 표상을 특징 값으로 획득하고, [단계 2: 분산 표상의 문맥화]를 위한 희소 오토인코더 모듈에 입력한 뒤 얻은 출력 값을 최종 해당 문장의 문맥화된 분산 표상으로 정의할 수 있다. 이와 같은 복합 시스템을 이용함으로써 자연어 데이터에 내재한 미등록 어휘 및 동형이의어 문제를 효과적으로 다룰 수 있다.This embedding vector representation is obtained as a feature value, and the output value obtained after input to the sparse autoencoder module for [Step 2: Contextualization of Distributed Representation] can be defined as a contextualized distributed representation of the final corresponding sentence. By using such a complex system, it is possible to effectively deal with the problem of unregistered vocabulary and homozygous words inherent in natural language data.

한편, 자연어 단어를 컴퓨터가 연산 가능하도록 표현하는 방법의 일 예로는 단어집 크기의 영벡터에서 해당 단어의 인덱스 위치만 1 값을 갖는 원-핫 인코딩(One-hot Encoding) 벡터 방식이 있다. On the other hand, as an example of a method of expressing a natural language word so that a computer can calculate it, there is a one-hot encoding vector method in which only an index position of a corresponding word has a value of 1 in a zero vector of a vocabulary size.

원-핫 인코딩 벡터 방식은 희소하고(Sparse) 이산적인(Discrete) 표현이기 때문에, 단어 간 유사성(Similarity)이나 추론(Inference)을 계산하는 데 취약하다는 평가를 받는다. Since the one-hot encoding vector method is a sparse and discrete representation, it is evaluated as being weak in calculating similarity or inference between words.

이를 보완하는 방법으로는 단어를 밀집한(Dense) 분산(Dis-tributed) 표상으로 나타내는 기법이 있다. 본 기법은 초기에는 언어 모델(Language Model) 신경망의 일부분으로서 학습되었으나, 최근에는 속도와 성능을 개선하기 위해 분산 표상만을 학습하는 단일 모델로 정립되었다.As a way to supplement this, there is a technique to represent words as a Dense and Dis-tributed representation. This technique was initially trained as a part of a language model neural network, but recently it has been established as a single model that learns only distributed representations to improve speed and performance.

이러한 단어의 분산 표상 기법은 자연어를 실수형 벡터로 정량화하는 기법이며, 근래에는 거의 모든 자연어 처리에서 사용되고 있다. 다시 말해, 심층 학습 기법에 기초한 대부분의 자연어 처리 모델은 단어 임베딩이라 불리는 사전 훈련된 분산 단어 표상을 사용하여 성능 향상을 도모한다. The distributed representation technique of such a word is a technique for quantifying natural language as a real vector, and has recently been used in almost all natural language processing. In other words, most natural language processing models based on deep learning techniques use pre-trained distributed word representations called word embeddings to improve performance.

단어 임베딩은 문장 내 단어들 사이의 의미적, 통사적 맥락을 포함하면서 보다 풍부하고 효율적인 단어 표상을 제공할 것을 요구받고 있다. 대규모의 고품질 데이터 집합을 사용하여 각 단어의 문맥적 의미를 포착하는 방법을 학습할 때, 이러한 임베딩 방법은 높은 성능을 발휘할 수 있다. Word embeddings are required to provide richer and more efficient word representations while including semantic and syntactic contexts between words in a sentence. When learning how to capture the contextual meaning of each word using a large, high-quality data set, this embedding method can perform well.

단어 임베딩 모델을 생성하는 방법의 실시 예로는 Word2Vec, FastText, GloVe, ELMo, BERT 등이 있다.Examples of a method for generating a word embedding model include Word2Vec, FastText, GloVe, ELMo, and BERT.

본 발명의 일 실시 예에 따른 Word2Vec는 Continuous Bag-Of-Words(CBOW)와 Skip-gram의 두 가지 다른 모델로 구분될 수 있다. CBOW에서 신경망의 입력은 중심 단어의 문맥, 즉 주변 단어들이며, 신경망의 출력은 중심 단어이다. 이는 곧 문맥이 주어질 경우 중심 단어를 맞추는 방식으로 훈련이 수행될 수 있다. Skip-gram은 이와 역으로 중심 단어를 입력 삼아 문맥 단어들을 출력으로서 예측하는 신경망이며, 구조는 도 1에 도시된다.Word2Vec according to an embodiment of the present invention may be divided into two different models: Continuous Bag-Of-Words (CBOW) and Skip-gram. In CBOW, the input of the neural network is the context of the central word, that is, the surrounding words, and the output of the neural network is the central word. This training can be performed in such a way that, given the context, the central word is matched. Skip-gram is a neural network that takes a central word as input and predicts context words as output, and the structure is shown in FIG. 1 .

본 발명의 다른 실시 예에 따른 FastText 모델의 신경망 구조는 Word2Vec의 Skip-gram 구조와 동일하다. 그러나, 단어의 내부 구조를 고려하기 위해 서로 다른 손실 함수를 사용함으로써, 각 단어는 문자 n-gram들의 백(Bag)으로 취급된다. The neural network structure of the FastText model according to another embodiment of the present invention is the same as the Skip-gram structure of Word2Vec. However, by using different loss functions to consider the internal structure of a word, each word is treated as a bag of n-grams of characters.

FastText 모델은 전체 단어 외에 하위 단어 정보 또한 학습해야 할 표상에 포함하며, 최종적으로 단어를 내부 n-gram들의 벡터 표상의 합으로 표현함으로써 단어 속에 포함된 의미를 학습하며 희귀한(Rare) 단어에 풍부한 표현력을 줄 수 있다. In addition to the whole word, the FastText model includes sub-word information in the representation to be learned, and finally learns the meaning contained in the word by expressing the word as the sum of the vector representations of internal n-grams, and is rich in rare words. It can give you expressive power.

예를 들어, 도 2는 단어 'apple'에 대한 FastText 임베딩 벡터 값의 구성을 도시한다. 도 2에서는, 단어 내 슬라이딩 윈도우의 크기가 3으로 설정된 것으로 가정한다. 'apple'의 맨 앞과 맨 뒤에 특수 토큰인 '<' 및 '>'를 배치하여 3-gram 단위의 벡터 표상들이 획득될 수 있다. 이 값들과 기존의 단일 단어 벡터 표상 값을 모두 더한 임베딩 결과 값이 비로소 해당 단어의 벡터 표상이 된다.For example, FIG. 2 shows the construction of FastText embedding vector values for the word 'apple'. In FIG. 2 , it is assumed that the size of the sliding window within a word is set to 3 . By disposing special tokens '<' and '>' at the front and the back of 'apple', vector representations in 3-gram units can be obtained. The embedding result value obtained by adding all these values to the existing single word vector representation becomes the vector representation of the word.

본 발명의 또 다른 실시 예에 따른 GloVe(Global Vectors for Word Presentation)는 전역 행렬 분해(Global Matrix Factorization)와 지역 문맥 윈도우(Local Context Window) 방식이라는 두 가지 방법의 장점을 결합한 로그 이중 선형(Log-Bilinear) 회귀(Regression) 모델이다. GloVe (Global Vectors for Word Presentation) according to another embodiment of the present invention combines the advantages of two methods, Global Matrix Factorization and Local Context Window. It is a bilinear regression model.

즉, GloVe 모델은 임베딩으로 표현된 두 단어 벡터의 내적 값(Dot Product)이 전체 말뭉치 내 동시 등장(Co-occurrence) 확률 값과 동일하도록 최적화될 수 있다.That is, the GloVe model can be optimized so that the dot product of two word vectors expressed as embeddings is the same as the co-occurrence probability value in the entire corpus.

본 발명의 또 다른 실시 예에 따른 ELMo(Embeddings from Language Models)는 주어진 단어를 문자 단위로 쪼개어 charCNN 모듈을 통과시킴으로써 획득한 단어 임베딩을 양방향 LSTM에 되먹임하는 RNN(Recurrent Neural Network) 기반 언어 모델이다. ELMo는 주어진 단어를 문자 단위로 나누어 임베딩을 얻는 특성으로 인하여, FastText와 같이 미등록 어휘를 처리할 수 있다는 장점이 있다.Embeddings from Language Models (ELMo) according to another embodiment of the present invention is a Recurrent Neural Network (RNN)-based language model that feeds back word embeddings obtained by splitting a given word into character units and passing it through a charCNN module to a bidirectional LSTM. ELMo has the advantage of being able to process unregistered vocabularies like FastText due to the characteristic of obtaining embeddings by dividing a given word into character units.

본 발명의 또 다른 실시 예에 따른 BERT(Bidirectional Encoder Representations from Transformers)는 Transformer의 인코더(Encoder)를 기반으로 한 언어 모델로, RNN 구조를 사용하지 않음으로써 병렬화에 능하고 인코더를 여러 층 쌓아 올려 문장으로부터 더욱 유의미한 표현을 얻을 수 있다는 장점이 있다.BERT (Bidirectional Encoder Representations from Transformers) according to another embodiment of the present invention is a language model based on Transformer's encoder. It has the advantage that more meaningful expressions can be obtained from

BERT는 훈련 데이터의 입력을 바이트 페어(Byte-Pair) 인코딩으로 처리함으로써 단어를 단일 단위로 취급하지 않고 여러 하위 단어로 쪼개어 풍부한 표현을 얻을 수 있다는 장점이 있다. 다만, BERT는 단어집의 크기가 한정적이기 때문에 이에 포함되지 않는 미등록 어휘는 모두 < UNK > 토큰으로 취급할 수 있다.BERT has the advantage that by processing the input of training data with Byte-Pair encoding, a rich expression can be obtained by dividing the word into several sub-words rather than treating the word as a single unit. 다만, BERT는 단어집의 크기가 한정적이기 때문에 이에 포함되지 않는 미등록 어휘는 모두 < UNK > 토큰으로 취급할 수 있다.

상술한 단어 임베딩 모델은 본 발명의 다양한 실시 예에 적용 가능할 것이나, 이하 본 발명의 다양한 실시 예에서는 FastText 모듈을 이용하는 것으로 가정한다.The above-described word embedding model may be applicable to various embodiments of the present invention. Hereinafter, it is assumed that the FastText module is used in various embodiments of the present invention.

FastText는 Word2Vec과 GloVe와는 다르게 미등록 어휘 처리가 가능하다는 장점이 있다. 또한, 모든 미등록 어휘를 동일하게 < UNK > 토큰으로 처리하지 않고 단어 내의 하위 단어들의 n-gram들로 쪼개어 벡터 표현을 얻는다는 점에서 BERT에 비해 강점을 지닌다. 마지막으로, charCNN과 LSTM으로 구성된 신경망을 통과시켜야 하는 ELMo에 비해 시간을 적게 소모하므로, 시스템을 구성하는 하위 모듈로서 사용하기에 효율성 가질 수 있다.Unlike Word2Vec and GloVe, FastText has the advantage of being able to process unregistered vocabulary. 또한, 모든 미등록 어휘를 동일하게 < UNK > 토큰으로 처리하지 않고 단어 내의 하위 단어들의 n-gram들로 쪼개어 벡터 표현을 얻는다는 점에서 BERT에 비해 강점을 지닌다. Finally, since it consumes less time than ELMo, which has to pass a neural network composed of charCNN and LSTM, it can be used efficiently as a sub-module constituting the system.

한편, 본 발명의 다양한 실시 예에서는 인공 신경망인 오토인코더(Autoencoder)를 이용한다. Meanwhile, in various embodiments of the present invention, an autoencoder, which is an artificial neural network, is used.

본 발명의 일 실시 예에 따른 오토인코더는 입력을 출력으로 복사하는 인공 신경망으로, 인코더(Encoder)와 디코더(Decoder) 두 부분으로 구성되어 있다. 여기서, 인코더는 인지 네트워크(Recognition Network)로서, 입력을 내부 표현으로 변환한다. 디코더는 생성 네트워크(Generative Network)로서, 내부 표현을 출력으로 변환한다. An autoencoder according to an embodiment of the present invention is an artificial neural network that copies an input to an output, and is composed of two parts: an encoder and a decoder. Here, the encoder is a recognition network, which transforms the input into an internal representation. The decoder is a generative network, which transforms the inner representation into an output.

본 발명의 일 실시 예에 따른 오토인코더는 입력과 출력 층의 뉴런(Neuron) 수가 동일하다는 것만 제외하면 일반적인 다층 퍼셉트론(Multi-Layer Perceptron, MLP)와 동일한 구조이다. 오토인코더는 입력을 출력으로 재구성하기 때문에, 해당 신경망을 거친 출력을 재구성(Reconstruction)이라고도 하며, 손실 함수 값은 출력과 입력 간 차이를 통해 계산하는 것이 기본이다.The autoencoder according to an embodiment of the present invention has the same structure as a general multi-layer perceptron (MLP) except that the number of neurons in the input and output layers is the same. Since an autoencoder reconstructs an input into an output, the output that has passed through the neural network is also called reconstruction, and it is basic to calculate the loss function value through the difference between the output and the input.

본 발명의 일 실시 예에 따르면, 인공 신경망 내 은닉 층의 뉴런 수를 입력 층의 뉴런 수 보다 작게 설정하여 차원을 축소함으로써 데이터를 압축하는 방식의 과소 완전 오토인코더(Undercomplete Autoencoder)가 단어 임베딩에 이용될 수 있다. According to an embodiment of the present invention, an undercomplete autoencoder that compresses data by reducing the dimension by setting the number of neurons in the hidden layer in the artificial neural network to be smaller than the number of neurons in the input layer is used for word embedding can be

본 발명의 일 실시 예에 따른 과소 완전 오토인코더는 은닉 층이 입력 층 보다 저 차원이므로 입력을 그대로 출력으로 복사할 수 없기 때문에, 입력과 같은 것을 출력하도록 구성된 과소 완전 오토인코더는 입력 데이터 집합으로부터 가장 중요한 특성을 훈련할 수 있다. Since the under-perfect autoencoder according to an embodiment of the present invention cannot copy the input to the output as it is because the hidden layer has a lower dimension than the input layer, the under-perfect autoencoder configured to output the same as the input is the most Important traits can be trained.

또한, 본 발명의 다양한 실시 예에 따르면, 입력 데이터에 잡음(Noise)을 추가한 후 원본 입력을 복원할 수 있도록 오토인코더를 훈련시키는 잡음 제거 오토인코더(Denoising Autoencoder) 등 다양한 종류의 오토인코더가 이용될 수 있다. In addition, according to various embodiments of the present invention, various types of autoencoders such as a denoising autoencoder that trains the autoencoder to restore the original input after adding noise to the input data are used. can be

상술한 본 발명의 다양한 실시 예에 따른 각각의 조건들은 오토인코더가 단순히 입력을 바로 출력으로 복사하지 못하도록 방지하며, 데이터를 효율적으로 표상하는 방법을 학습하도록 제어할 수 있다.Each of the above-described conditions according to various embodiments of the present invention prevents the autoencoder from simply copying the input directly to the output, and can be controlled to learn how to efficiently represent data.

본 발명의 일 실시 예에 따른 오토인코더는 입력을 출력으로 복사하는 일반적인 오토인코더를 이용한 학습과는 달리, 슬라이딩 윈도우 내 단어들의 임베딩 벡터 평균을 입력으로 취하고, 해당 슬라이딩 윈도우 내 중심 단어의 단일 임베딩 벡터 값을 출력으로 내보내도록 입출력을 조정하여 학습을 수행할 수 있다. Unlike learning using a general autoencoder that copies an input to an output, the autoencoder according to an embodiment of the present invention takes the average of the embedding vectors of words in the sliding window as an input, and a single embedding vector of the central word in the sliding window. Learning can be accomplished by adjusting the input and output to output a value.

이는 오토인코더가 문맥 정보를 함유한 단어의 분산 표상 값을 최종적으로 출력할 수 있도록, 인공 신경망 내 가중치에 유의미한 정보를 담는 방향으로 학습할 수 있도록 하는 조건(또는 제약)에 해당한다.This corresponds to a condition (or constraint) that allows the autoencoder to learn in a way that contains meaningful information in the weights in the artificial neural network so that the autoencoder can finally output the distributed representation value of the word containing context information.

한편, 대규모 기계 학습 사전 훈련(Pre-trained) 언어 모델은 자연어 처리 영역에서 전이 학습(Transfer Learning)에 획기적인 성능 향상을 보여왔다. 다만, 기계 학습 방식을 취하는 자연어 처리 모델들은 공통적으로 미등록 어휘 문제에 대응하는 것에 어려움을 겪고 있다. On the other hand, large-scale machine learning pre-trained language models have shown remarkable performance improvement in transfer learning in the natural language processing domain. However, natural language processing models that take the machine learning method have difficulties in responding to the problem of unregistered vocabulary in common.

여기서, 미등록 어휘 문제란 훈련 데이터 집합에 등장하지 않은 단어를 모델이 제대로 인식하지 못함으로써 발생하는 문제를 일컫는다. 근본적으로 미등록 어휘 문제는 모델이 훈련 데이터 집합 내에 포함되는 제한된 어휘만을 사용하여 훈련하고, 시험 환경에서는 훈련한 경험이 없는 어휘를 입력으로 받을 수 있기 때문에 발생한다. 이는 자연어 데이터 집합에서 관심을 두는 도메인과 해당 데이터 집합이 사용되는 시기가 다양함에 따라, 데이터에 내재하는 특성이 지배되기 때문에 자연스럽게 일어나는 문제이다.Here, the unregistered vocabulary problem refers to a problem that occurs because the model does not properly recognize words that do not appear in the training data set. Fundamentally, the problem of unregistered vocabulary arises because the model is trained using only the limited vocabulary included in the training data set, and in the test environment, it can receive as input a vocabulary that has not been trained. This is a problem that naturally arises because the nature of the data set is dominated by the domain of interest in the natural language data set and when the data set is used varies.

이 때 빈번하게 사용되는 일반적 처리 방법은 미등록 어휘를 < UNK >라는 하나의 특수 토큰으로 입력하여 표현하는 것이다. 이는 모든 미등록 어휘들을 동일한 표상으로 처리하기 때문에, 해당 어휘가 분류에 결정적인 역할을 하는 문서 분류 과업의 성능 저하를 불러일으키는 등, 자연어 처리 작업에 부정적인 영향을 미칠 수 있다.이 때 빈번하게 사용되는 일반적 처리 방법은 미등록 어휘를 < UNK >라는 하나의 특수 토큰으로 입력하여 표현하는 것이다. Since all unregistered vocabularies are treated as the same representation, it may negatively affect the natural language processing work, such as causing a decrease in the performance of the document classification task where the corresponding vocabulary plays a crucial role in classification.

미등록 어휘 문제를 해결하기 위한 또 하나의 방법은 단어가 아닌 하위 단어를 기본 단위로 삼는 것이다. 이 경우, 바이트 페어 인코딩(Byte-Pair Encoding), 워드피스(WordPiece), 센텐스피스(SentencePiece) 등의 토큰화 방식이 이용될 수 있다. 하위 단어는 주어진 단어를 더욱 작은 단위로 세분화하는 방식이기 때문에, 적은 수의 어휘들로 구성한 단어집을 통해 단어들을 입력으로 취할 수 있으므로 미등록 어휘 문제를 처리하는 데 도움이 된다. 다만, 단어를 하위 단어로 분리하는 경우 본래의 유의미한 의미 전달이 어려울 수 있다.Another way to solve the problem of unregistered vocabulary is to use sub-words as basic units, not words. In this case, a tokenization method such as Byte-Pair Encoding, WordPiece, or SentencePiece may be used. Since sub-words are a method of subdividing a given word into smaller units, words can be taken as input through a vocabulary composed of a small number of words, which helps to deal with the problem of unregistered vocabulary. However, when a word is divided into sub-words, it may be difficult to convey the original meaningful meaning.

한편, 본 발명의 일 실시 예에 따르면, 중심 단어가 누락된 문맥(또는 문장)이 주어질 경우, 사람은 주변 맥락의 의미와 학습 경험을 기반으로 유추를 진행하며 중심 단어를 추론할 수 있다. 본 발명의 일 실시 예에서는, 사람의 신경망에 내재한 자연적 특성에서 착안하여, FastText와 희소 오토인코더를 이용함으로써 단어와 문장의 슬라이딩 윈도우 내 토큰들의 임베딩을 동시에 감안한 평균 값을 입력시킨 후 중심 단어의 임베딩 값을 출력할 수 있도록 학습시킬 수 있다. 이러한 시스템 설계를 통해 중심 단어가 미등록 어휘인 경우라도 문맥을 이용하여 적절한 단어 표상을 획득할 수 있다.Meanwhile, according to an embodiment of the present invention, when a context (or sentence) in which the central word is omitted is given, a person may infer the central word by performing analogies based on the meaning of the surrounding context and learning experience. In an embodiment of the present invention, by using FastText and a sparse autoencoder, paying attention to the natural characteristics inherent in the human neural network, the average value considering the embedding of tokens in the sliding window of words and sentences at the same time is inputted, and then the central word It can be trained to output an embedding value. Through this system design, even when the central word is an unregistered vocabulary, an appropriate word representation can be obtained using the context.

한편, 본 발명의 다양한 실시 예에 따르면 단어의 중의성이 해소될 수 있다.Meanwhile, according to various embodiments of the present disclosure, ambiguity of words may be resolved.

단어 중의성 해소는 주어진 맥락에서 단어에 정확한 의미를 할당하는 것으로 구성되는 자연어 처리의 핵심 과제로, 많은 잠재적 응용 프로그램들을 지니고 있다. Word disambiguation is a key task in natural language processing, which consists in assigning precise meanings to words in a given context, and has many potential applications.

분산된 의미 표상, 즉 단어 임베딩의 획기적인 발전에도 불구하고 어휘적 모호성을 해결하는 것은 자연어 처리 분야의 오랜 과제로 여전히 남아 있다. 단순하게 가장 빈번한 의미(Most Frequent Sense, MFS)를 선택하는 기준선(Baseline)도 넘기기 힘들 정도로 이는 어려운 과제에 해당한다.Despite breakthroughs in distributed semantic representation, i.e. word embedding, resolving lexical ambiguity remains a long-standing challenge in the field of natural language processing. This is a difficult task to the extent that it is difficult to pass even the baseline of simply selecting the most frequent meaning (Most Frequent Sense, MFS).

단어 의미 중의성 해소는, 축적해온 지식과 경험을 토대로 의사 소통을 진행하는 인간의 사고방식과 같이, 주변 문맥을 통해 문장 이해를 수행한다.Word semantic disambiguation performs sentence understanding through the surrounding context, similar to the human way of thinking that communicates based on accumulated knowledge and experience.

본 발명의 일 실시 예에 따르면, 사전에 정의된 어휘 지식을 사용하여 문장에 등장한 단어를 예측하는 지식 기반 방식이 이용될 수 있다. 지식 기반 방식에는 사전 정의 기반 방식과 그래프 기반 방식이 있다. 사전 정의 기반 방식은 사전에 설명된 단어의 정의에 기반하여 의미를 추론하는 방법이고, 그래프 기반 방식은 단어의 시소러스(Thesaurus) 정보를 통해 단어들 간 의미 관계를 추출하여 의미를 추론하는 방법이다. According to an embodiment of the present invention, a knowledge-based method of predicting a word appearing in a sentence using vocabulary knowledge defined in advance may be used. The knowledge-based method includes a predefined-based method and a graph-based method. The dictionary definition-based method is a method of inferring meaning based on the definition of a word described in the dictionary, and the graph-based method is a method of inferring meaning by extracting the semantic relationship between words through thesaurus information of the word.

본 발명의 다른 실시 예에 따르면, 문장 내 단어 의미 주석이 달린 데이터 집합을 이용하여 기계 학습 모델을 구축하고, 그것을 통해 단어의 의미를 예측하는 지도 학습 방식이 이용될 수 있다. 지도 학습 방법은 기계 학습을 이용하기 때문에 높은 성능을 보이지만, 대량의 훈련 데이터 집합을 구축해야 하는 어려움이 있다. 또한, 하나의 의미도 다양한 문맥 패턴을 보일 수 있기 때문에 훈련 데이터 집합을 구축하기 위해 많은 시간을 요구한다는 어려움이 있다.According to another embodiment of the present invention, a supervised learning method of constructing a machine learning model using a data set annotated with word meaning in a sentence and predicting the meaning of a word through the data set may be used. The supervised learning method shows high performance because it uses machine learning, but there is a difficulty in building a large training data set. In addition, since a single meaning can show various context patterns, there is a difficulty in that it requires a lot of time to build a training data set.

상술한 다양한 방법에도 불구하고, 본 발명의 다양한 실시 예에서는 미등록 어휘 또한 단어의 표상에 구문 정보와 의미 관계를 지닐 수 있도록 FastText 모듈 및 희소 오토인코더 모듈을 결합시킨 시스템이 활용되는 것이 바람직할 것이다.In spite of the various methods described above, in various embodiments of the present invention, it is preferable to use a system combining the FastText module and the sparse autoencoder module so that the unregistered vocabulary can also have a semantic relationship with syntax information in the representation of a word.

이하에서, 도 3을 참조하여 본 발명의 일 실시 예에 따른 전자장치의 구성 및 동작방법에 대하여 상세히 설명한다.Hereinafter, a configuration and an operating method of an electronic device according to an embodiment of the present invention will be described in detail with reference to FIG. 3 .

도 3은 본 발명의 일 실시 예에 단어 임베딩을 수행하는 전자장치의 블록도이다.3 is a block diagram of an electronic device that performs word embedding according to an embodiment of the present invention.

도 3을 참조하면, 단어 임베딩을 수행하는 전자장치(300)(이하, 전자장치)는 저장부(310) 및 프로세서(320)를 포함할 수 있다.Referring to FIG. 3 , an electronic device 300 (hereinafter, referred to as an electronic device) that performs word embedding may include a storage unit 310 and a processor 320 .

저장부(310)는 전자장치(300)에서 이용되는 다양한 데이터를 저장할 수 있다.The storage unit 310 may store various data used in the electronic device 300 .

일 예로, 저장부(310)는 휘발성 RAM(random access memory), 비휘발성 ROM(read only memory), 비휘발성 MRAM(magnetoresistive RAM), 및/또는 기타 유형의 메모리를 포함할 수 있다. 저장부(310)는 데이터(예: 훈련 데이터) 및 컨트롤러/프로세서로 실행 가능한 명령어(예: 본 명세서에서 설명된 바와 같은, 전자장치(300)에 의해 수행되는 프로세스들을 수행하기 위한 명령어)를 저장하기 위한 데이터 저장소를 포함할 수 있다. For example, the storage unit 310 may include a volatile random access memory (RAM), a nonvolatile read only memory (ROM), a nonvolatile magnetoresistive RAM (MRAM), and/or other types of memory. The storage unit 310 stores data (eg, training data) and instructions executable by the controller/processor (eg, instructions for performing processes performed by the electronic device 300 as described herein). It may contain data storage for

데이터 저장소는 자기적 저장소(magnetic storage), 광학적 저장소(optical storage), 솔리드-스테이트(solid-state) 저장소 등과 같은 하나 또는 그 이상의 비휘발성 저장소 유형을 포함할 수 있다.Data storage may include one or more types of non-volatile storage, such as magnetic storage, optical storage, solid-state storage, and the like.

프로세서(320)는 전장치(300)를 전반적으로 제어할 수 있다. 일 예로, 프로세서(320)는 인공 신경망을 이용하여 단어 임베딩을 수행할 수 있다. 예를 들면, 프로세서(320)는 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 복수의 임베딩 벡터 값의 평균 값을 획득할 수 있다. 또한, 프로세서(320)는 복수의 임베딩 벡터 값의 평균을 제2 인공 신경망으로 입력하고, 제2 인공 신경망이 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습할 수 있다.The processor 320 may control the overall device 300 . As an example, the processor 320 may perform word embedding using an artificial neural network. For example, the processor 320 may output a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size from the first artificial neural network to obtain an average value of the plurality of embedding vector values. . Also, the processor 320 may learn to input the average of the plurality of embedding vector values to the second artificial neural network, and to output the embedding vector value of the central word within the sliding window.

일 예로, 상술한 슬라이딩 윈도우에 포함되는 복수의 단어 중 하나의 임베딩 벡터 값은 복수의 단어 중 하나에 대한 적어도 하나의 서브워드(subword)에 대한 벡터 값 및 복수의 단어 중 하나에 대한 벡터 값을 합한 값으로 정의될 수 있다.For example, the embedding vector value of one of the plurality of words included in the above-described sliding window is a vector value of at least one subword of one of the plurality of words and a vector value of one of the plurality of words It can be defined as a sum of values.

일 예로, 상술한 제1 인공 신경망은 FastText 모듈이고, 상술한 서브워드는 FastText 모듈에서 정의되는 n-gram일 수 있다. 여기서, FastText 모듈은 단어 집합 크기의 영벡터에서 해당 단어의 인덱스 위치만 1 값을 갖는 원-핫 인코딩(one-hot encoding)의 결과를 입력으로 할 수 있다.For example, the aforementioned first artificial neural network may be a FastText module, and the aforementioned subword may be an n-gram defined in the FastText module. Here, the FastText module may input the result of one-hot encoding in which only the index position of the corresponding word has a value of 1 in the zero vector of the word set size.

일 예로, 상술한 FastText 모듈은 슬라이딩 윈도우가 5로 설정되고, 임베딩 벡터의 크기가 300차원으로 설정될 수 있다.For example, in the above-described FastText module, the sliding window may be set to 5, and the size of the embedding vector may be set to 300 dimensions.

한편, 상술한 슬라이딩 윈도우 내 중심 단어는 훈련 데이터 집합에 포함된 등록 어휘 또는 훈련 데이터 집합에 미포함된 미등록 어휘(out-of-vocabulary; OOV) 중 하나일 수 있다. 여기서, 슬라이딩 윈도우 내 중심 단어가 미등록 어휘인 경우, 중심 단어의 임베딩 벡터 값은 영-벡터로 설정될 수 있다.Meanwhile, the central word in the above-described sliding window may be one of a registered vocabulary included in the training data set or an out-of-vocabulary (OOV) not included in the training data set. Here, when the central word in the sliding window is an unregistered vocabulary, the embedding vector value of the central word may be set to a zero-vector.

일 예로, 상술한 제2 인공 신경망은 입력을 출력으로 복사하는 인공 신경망인 희소 오토인코더(sparse autoencoder) 모듈일 수 있다.As an example, the above-described second artificial neural network may be a sparse autoencoder module that is an artificial neural network that copies an input to an output.

일 예로, 상술한 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값은 실수 값으로 정의될 수 있다. 또한, 상술한 슬라이딩 윈도우에 포함되는 복수의 단어는 단일 문장에 포함될 수 있다.As an example, the embedding vector value of the central word in the above-described sliding window may be defined as a real value. Also, a plurality of words included in the above-described sliding window may be included in a single sentence.

이하에서는, 도 4 내지 도 8를 참조하여, 2 단계 단어 임베딩 방법에 대하여 상세히 설명한다.Hereinafter, a two-step word embedding method will be described in detail with reference to FIGS. 4 to 8 .

도 4는 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법의 순서도이다.4 is a flowchart of a two-step word embedding method according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 단어 임베딩 방법은 2단계로 진행된다. 여기서, 2 단계의 단어 임베딩 방법은 [단계 1: 단어의 분산 표상화] 및 [단계 2: 분산 표상의 문맥화]를 순차적으로 수행하는 과정을 포함할 수 있다.The word embedding method according to an embodiment of the present invention proceeds in two steps. Here, the word embedding method of step 2 may include a process of sequentially performing [Step 1: Distributed representation of words] and [Step 2: Contextualization of distributed representations].

[단계 1: 단어의 분산 표상화][Step 1: Distributed Representation of Words]

본 발명의 일 실시 예에 따르면, 프로세서(320)는 영문 Common Crawl 데이터 집합을 훈련 데이터 집합으로 사용하여, FastText 단어 임베딩을 훈련시킬 수 있다. Common Crawl 데이터 집합에는 공개적으로 접근 및 확장 가능한 클러스터에 약 60억 개의 웹 문서가 저장되어 있다.According to an embodiment of the present invention, the processor 320 may train FastText word embedding by using the English Common Crawl data set as the training data set. The Common Crawl dataset contains approximately 6 billion web documents stored in publicly accessible and scalable clusters.

본 발명의 일 실시 예에 따르면, 훈련시킨 FastText 모듈의 슬라이딩 윈도우 크기는 5로, 임베딩 벡터의 크기는 300 차원으로 설정되었다. 또한, FastText 모듈의 학습 전 데이터를 정제하기 위해 모든 텍스트를 소문자 변환되고, 문장 부호는 공백 처리되고, 축약어는 대응되는 풀 네임으로 변환되었다. According to an embodiment of the present invention, the sliding window size of the trained FastText module is set to 5, and the size of the embedding vector is set to 300 dimensions. In addition, to refine the data before learning of the FastText module, all texts were converted to lowercase letters, punctuation marks were processed as blanks, and abbreviations were converted to corresponding full names.

또한, 프로세서(320)는 토큰화(Tokenization)를 위하여 PennTreebank 토큰화를 수행할 수 있다. FastText 방법론을 이용하여, 프로세서(320)는 하나의 단어를 문자 n-gram들로 쪼개어 문자 n-gram들의 벡터 값과 단일 단어의 벡터 값을 합한 결과를 해당 단어의 임베딩 벡터 값으로 취할 수 있다. In addition, the processor 320 may perform PennTreebank tokenization for tokenization. Using the FastText methodology, the processor 320 splits one word into character n-grams and sums the vector value of the character n-grams and the vector value of the single word as an embedding vector value of the word.

또한, 본 발명의 일 실시 예에 따르면, 프로세서(320)는 한 문장의 벡터 표상을 해당 문장 내에 속한 모든 단어들의 벡터 값들의 평균으로 처리할 수 있다. 이 경우, 한 문장의 임베딩 벡터 표상은 곧 문장 내에 속한 모든 단어들의 문자 n-gram 벡터 값의 평균이 될 수 있다. Also, according to an embodiment of the present invention, the processor 320 may process the vector representation of one sentence as an average of vector values of all words included in the corresponding sentence. In this case, the embedding vector representation of a sentence may be the average of the character n-gram vector values of all words included in the sentence.

이에 따르면, 단어 임베딩 과정에서 미등록 어휘에 적절하게 대응함과 동시에 더욱 복합적인 의미를 실수형 벡터 표상에 담을 수 있다. 다시 말해, 본 발명의 일 실시 예에 따르면, 단어의 형태학적인 정보를 사용함으로써, 데이터를 구분할 수 있는 가장 작은 단위를 하위 단어로 지정하여 미등록 어휘에 대처하고자 FastText 모델을 이용할 수 있다. According to this, it is possible to appropriately respond to unregistered vocabulary in the word embedding process and to contain more complex meanings in the real vector representation. In other words, according to an embodiment of the present invention, the FastText model can be used to cope with unregistered vocabulary by designating the smallest unit capable of distinguishing data as a sub-word by using the morphological information of the word.

이러한 문자 n-gram 기반의 방법론은 단어 내에 속하는 하위 단어 정보를 모두 활용하기 때문에 훈련 데이터 집합에 자주 등장하지 않는 드문 단어나 미등록 어휘도 처리할 수 있다는 강점을 가진다. This n-gram-based method has the advantage of being able to process rare words or unregistered words that do not appear frequently in the training data set because it utilizes all sub-word information belonging to a word.

FastText 모델의 목적 함수 틀은 다음의 수학식 1에 기초하며, 단어의 조건부 확률을 나타내는 식은 다음의 수학식 2에 따라 정의될 수 있다.The objective function frame of the FastText model is based on Equation 1 below, and an expression representing the conditional probability of a word may be defined according to Equation 2 below.

상기 수학식 1은 전체 T개의 단어들을 대상으로 중심 단어

가 주어졌을 때, 동일한 슬라이딩 윈도우 내

개 단어들이 주변 문맥에 등장할 확률을 최대화하도록 모델을 훈련시키는 것을 의미한다. 즉, 중심 단어 벡터

를 문맥 단어 벡터

를 최대한 유사하도록 학습시키는 것이 해당 목적 함수의 목표이다. Equation 1 is a central word for all T words.

Given a, within the same sliding window

This means training the model to maximize the probability that dog words appear in the surrounding context. i.e. the central word vector

context word vector

The goal of the objective function is to learn to be as similar as possible.

해당 식을 통해 모델이 훈련되면, 비슷한 문맥을 공유하는 단어들은 비슷한 벡터 값을 지니게 되며, 훈련을 마친 단어의 분산 표상에는 자연어 데이터에 내재된 의미 및 구문적인 정보가 포함되게 된다.When the model is trained through this equation, words that share a similar context have similar vector values, and semantic and syntactic information inherent in natural language data is included in the distributed representation of the trained words.

또한, 이 때 단어의 조건부 확률에 대한 조건으로 수학식 2를 사용함으로써 하위단어 정보를 단어 임베딩에 포함시킬 수 있다. 해당 식에서 G는 말뭉치 내 존재하는 모든 단어들의 문자 n-gram 집합을 의미하고,

는 중심 단어

에 속하는 문자 n-gram 벡터를 의미한다. 하나의 단어 벡터로 표현할 때에는 이 벡터 값들의 총합으로 나타낸다. Also, at this time, by using Equation 2 as a condition for the conditional probability of a word, sub-word information can be included in the word embedding. In this formula, G means the set of letter n-grams of all words in the corpus,

is the central word

It means a character n-gram vector belonging to When expressed as a single word vector, it is expressed as the sum of these vector values.

단어들 사이의 유사성을 측정할 경우, 하나의

가 아닌 다수의 문자 n-gram 벡터

들을 사용하는 점에서 Skip-gram 방법론과 차별성을 지닌다. 즉, 이와 같이 하나의 단어 벡터를 여러 개의 문자 n-gram 벡터들을 이용해 표현한다는 점에서, FastText는 확장된 Skip-gram이라 할 수 있다.When measuring the similarity between words, one

A multi-character n-gram vector that is not

It differs from the Skip-gram methodology in that it uses That is, FastText can be called an extended skip-gram in that it expresses a single word vector using several character n-gram vectors.

본 발명의 일 실시 예에 따르면, 수학식 1 및 수학식 2에 대하여 훈련을 마친 뒤 단어 벡터는 약 2백만 개가 존재할 수 있다. 의미론적 단어 유추(Semantic Word Analogy)에서는 90%, 통사론적 단어 유추(Syntactic Word Analogy)에서는 82%의 정확도를 보였다. According to an embodiment of the present invention, after training on Equations 1 and 2, about 2 million word vectors may exist. The accuracy was 90% in Semantic Word Analogy and 82% in Syntactic Word Analogy.

Word2Vec이 자에서 약 76%, 후자에서 약 75%의 정확도를 보이고, GloVe가 전자에서 약 83%, 자에서 약 76%의 정확도를 보인다는 점을 감안할 때, FastText 단계 1 모듈의 단어 임베딩으로 취한 것은 경쟁력 있는 선택이라 할 수 있다.Given that Word2Vec has an accuracy of about 76% in the ruler and about 75% in the latter, and GloVe about 83% in the former and about 76% in the ruler, the FastText Step 1 module’s word embeddings is a competitive option.

훈련을 끝낸 FastText의 단어 표상을 이용하여, 다섯 가지 동형이의어의 최근 접 이웃(Nearest Neighbor) 단어들 예시를 구하였고, 이는 다음의 표 1과 같다. 표 1은 FastText 단어 표상에 대한 다섯 가지 동형이의어의 최근접 이웃 단어들을 나타낸다.Using the word representation of FastText after training, examples of the nearest neighbor words of five homozygous words were obtained, and this is shown in Table 1 below. Table 1 shows the nearest neighbors of five homozygous words for FastText word representations.

상술한 본 발명의 일 실시 예에서는 거리측정 지표로 코사인 유사도(Cosine Similarity)를 이용하였다. 여기서, 동형이의어 단어의 경우에는 유사 단어 목록 내에 단일한 맥락이 아닌 서로 다른 문맥에서 장하는 단어들이 최근접 이웃으로 혼재되어 있는 점에 주목해야 한다. 즉, 단어가 한 맥락을 구분하기 위해 본 연구에서는 이들을 서로 다른 영역으로 사상시키도록 [단계 2: 분산 표상의 문맥화]를 진행한다.In the above-described embodiment of the present invention, cosine similarity was used as a distance measurement index. Here, in the case of homozygous words, it should be noted that words occurring in different contexts rather than a single context in the list of similar words are mixed as nearest neighbors. In other words, in order to distinguish one context of a word, this study proceeds with [Step 2: Contextualization of Distributed Representation] to map them to different domains.

[단계 2: 분산 표상의 문맥화][Step 2: Contextualization of Distributed Representation]

도 5는 본 발명의 일 실시 예에 따른 희소 오토인코더의 구조를 도시한다.5 shows the structure of a sparse autoencoder according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 분산 표상의 문맥화는 도 5에 도시된 희소 오토인코더(Sparse Autoencoder)의 구조에 따라 수행될 수 있다. Contextualization of a distributed representation according to an embodiment of the present invention may be performed according to the structure of a sparse autoencoder shown in FIG. 5 .

프로세서(320)는 훈련 데이터 집합으로부터 유의미한 특성을 추출할 수 있도록 오토인코더에 제약을 가하는 여러가지 방법 중 하나로 희소성(Sparsity)을 이용할 수 있다. The processor 320 may use sparsity as one of various methods of applying a constraint to the autoencoder so as to extract meaningful features from the training data set.

구체적으로, 프로세서(320)는 손실 함수에 특정 항을 추가함으로써 오토인코더 내 은닉 층에서 활성화되는 뉴런 수를 감소시킬 수 있다. 예를 들어, 프로세서(320)는 은닉 층에서 평균적으로 5%의 뉴런만 활성화되도록 조건을 두면, 오토인코더는 해당 5%의 뉴런을 조합하여 입력을 재구성해야 하기 때문에 유용한 특성 표현을 학습을 강제할 수 있다.Specifically, the processor 320 may reduce the number of neurons activated in the hidden layer in the autoencoder by adding a specific term to the loss function. For example, if the processor 320 places a condition such that, on average, only 5% of neurons are activated in the hidden layer, the autoencoder must reconstruct the input by combining those 5% of neurons, so it will force learning a useful feature expression. can

본 발명의 일 실시 예에 따른 희소 오토인코더를 구성하기 위해서, 프로세서(320)는 먼저 학습 단계에서 은닉 층의 실제 희박함 정도를 측정할 수 있다. In order to configure the sparse autoencoder according to an embodiment of the present invention, the processor 320 may first measure the actual sparseness of the hidden layer in the learning step.

이를 위해, 프로세서(320)는 너무 작지 않도록 설정한 크기의 전체 학습 배치(Batch)에 대해, 은닉 층 내 각 뉴런에 대한 평균적인 활성화 정도를 계산할 수 있다. 그 후, 프로세서(320)는 손실 함수에 희소 손실(Sparsity Loss) 항을 추가하여 뉴런이 크게 활성화되지 않도록 규제할 수 있다. To this end, the processor 320 may calculate an average activation degree for each neuron in the hidden layer for the entire training batch having a size set not to be too small. Thereafter, the processor 320 may regulate the neuron not to be significantly activated by adding a sparsity loss term to the loss function.

희소 손실을 구하는 간단한 방법으로는 제곱 오차(Mean Squared Error, MSE)를 추가하는 방법이 있지만, 희소 오토인코더에서는 도 6과 같이, 프로세서(320)는 MSE 보다 경사가 보다 급한 쿨백 라이블러 발산(Kullback-Leibler Divergence, KL-Divergence)을 사용할 수 있다. As a simple method of calculating the sparse loss, there is a method of adding a mean squared error (MSE), but in the sparse autoencoder, as shown in FIG. -Leibler Divergence, KL-Divergence) can be used.

프로세서(320)는 아래의 수학식 3에 나타난 것처럼, KLD를 이용함으로써 두 확률 분포의 차이를 계산할 수 있다.The processor 320 may calculate the difference between the two probability distributions by using the KLD, as shown in Equation 3 below.

즉, 프로세서(320)는 희소 오토인코더에서는 KLD를 이용하여 실제 확률, 즉 학습 배치에 대한 평균 활성화 값과 은닉 층에서 뉴런이 활성화될 목표 확률 사이의 발산을 측정할 수 있다. 이를 나타내는 수학식 4는 다음과 같다.That is, in the sparse autoencoder, the processor 320 may measure the divergence between the actual probability, ie, the average activation value for the learning batch, and the target probability that the neuron will be activated in the hidden layer by using the KLD. Equation 4 representing this is as follows.

프로세서(320)는 수학식 4에 기초하여, 은닉 층의 각 뉴런에 대해 희소 손실을 구할 수 있다. 프로세서(320)는 상기 희소 손실을 모두 합한 뒤 희소 가중치 하이퍼파라미터를 곱하여 손실함수 값인 MSE에 더한 결과를 획득할 수 있다. 상기 결과는 최종 손실이며, 다음의 수학식 5와 같다.The processor 320 may calculate a sparse loss for each neuron of the hidden layer based on Equation (4). The processor 320 may obtain a result obtained by summing all the sparse losses, multiplying them by the sparse weight hyperparameter, and adding them to the loss function value, MSE. The result is the final loss, and is expressed in Equation 5 below.

본 발명의 일 실시 예에 따르면, 프로세서(320)는 단계 1을 통해 획득한 출력 값 벡터를 특징 값으로 취할 수 있다. 또한, 프로세서(320)는 희소 오토인코더 모듈에 입력하여 얻은 출력 값을 해당 문장의 문맥화된 분산 표상으로 설정할 수 있다. According to an embodiment of the present invention, the processor 320 may take the output value vector obtained through step 1 as a feature value. Also, the processor 320 may set an output value obtained by input to the sparse autoencoder module as a contextualized distributed representation of the corresponding sentence.

미등록 어휘에 대처하기 위한 방안으로, 프로세서(320)는 단일 의미 단어를 두 가지 카테고리로 나누어 학습을 수행할 수 있다. As a method for coping with the unregistered vocabulary, the processor 320 may perform learning by dividing the single meaning word into two categories.

여기서, 단일 의미 단어의 절반은 등록 어휘라고 가정하여, 희소 오토인코더의 입력은 문맥과 중심 단어들의 임베딩 평균 값이 될 수 있다. 오토인코더의 출력은 해당 중심 단어의 단일 임베딩 벡터 값이 될 수 있다. Here, it is assumed that half of the single semantic word is a registered vocabulary, and the input of the sparse autoencoder may be the embedding average value of the context and central words. The output of the autoencoder can be a single embedding vector value of that central word.

프로세서(320)는 나머지 단일 의미 단어의 절반은 미등록 어휘라고 가정하여, 해당 중심 단어에 기본적으로 영벡터를 할당시킬 수 있다. 이 경우, 희소 오토인코더의 입력은 문맥 단어들의 임베딩 벡터 값의 평균이 될 수 있다. 또한, 희소 오토인코더의 출력은 해당 중심 단어의 단일 임베딩 벡터 값이 될 수 있다. The processor 320 may basically allocate a zero vector to the corresponding central word, assuming that half of the remaining single-meaning words are unregistered vocabulary. In this case, the input of the sparse autoencoder may be the average of the embedding vector values of context words. Also, the output of the sparse autoencoder can be a single embedding vector value of the corresponding central word.

상술한 단계 2를 통해서, 문맥 단어들에 기초하여 중심 단어의 벡터 표상이 획득될 수 있다.Through the above-described step 2, a vector representation of the central word may be obtained based on the context words.

이하에서는, 상술한 2 단계 단어 임베딩 방법에 대한 효용성을 평가한다.Hereinafter, the effectiveness of the above-described two-step word embedding method is evaluated.

먼저, 2 단계 단어 임베딩 알고리즘의 효용성을 평가하기 위해, 상술한 단계 1 및 단계 2를 순차적으로 수행하여 획득한 문맥적 분산 표상을 임베딩으로 입력하여 주어진 두 문장이 의미상으로 동일한지 아닌지를 판별하고, 동일한 단어가 두 문장에서 같은 의미로 쓰였는지 아닌지를 판별하는 이진 분류기의 성능을 측정한다. First, in order to evaluate the effectiveness of the two-step word embedding algorithm, the contextual distributed representation obtained by sequentially performing steps 1 and 2 described above is input as embedding to determine whether two given sentences are semantically identical or not, , measures the performance of a binary classifier to determine whether the same word is used with the same meaning in two sentences.

성능 측정 지표는 정확도(Accuracy)로 설정하고, 모델의 구조는 도 7a 및 도 7b와 같다. 도 7a 및 도 7b는 본 발명의 일 실시 예에 따른 개념 증명에 사용하는 문서 분류기의 구조를 도시한다. 도 7a 및 도 7b는 위 아래로 연결되어 하나의 구조를 나타낼 수 있다.The performance measurement index is set to Accuracy, and the structure of the model is shown in FIGS. 7A and 7B . 7A and 7B show the structure of a document classifier used for proof of concept according to an embodiment of the present invention. 7A and 7B may be connected up and down to represent one structure.

또한, 문서 분류기 훈련에에 사용한 하이퍼파라미터(Hyperparameter)의 목록은 아래의 표 2와 같다.In addition, the list of hyperparameters used for training the document classifier is shown in Table 2 below.

본 발명의 일 실시 예에 따른 문서 분류기 모델은 다음과 같다. 먼저, 프로세서(320)는 자연어 단어가 주어질 경우 상술한 2 단계를 통해 얻은 임베딩 값을 획득할 수 있다. 프로세서(320)는 컨볼루션(Convolution) 연산을 통해 지엽적인 비교 보다는 문맥 윈도우에 포함된 공간적인 정보를 유지하며 특징을 추출할 수 있다. 이로써, 본 발명의 일 실시 예에 따른 문서 분류기 모델은 고성능 문서 분류기가 될 수 있다.A document classifier model according to an embodiment of the present invention is as follows. First, when a natural language word is given, the processor 320 may acquire the embedding value obtained through the above-described step 2 . The processor 320 may extract features while maintaining spatial information included in the context window rather than a local comparison through a convolution operation. Accordingly, the document classifier model according to an embodiment of the present invention may be a high-performance document classifier.

다음으로, 드롭아웃(Drop-out) 계층을 추가하여 과적합(Over-tting)을 경계하고 보다 일반화된 모델이 도출될 수 있다. 전역 맥스 풀링(Global Max-Pooling)으로 파라미터의 개수를 줄여 연산량을 감소시킴으로써 하드웨어 자원, 즉 에너지를 절약시키며 속도를 향상시킬 수 있다. Next, a drop-out layer is added to guard against over-tting and a more generalized model can be derived. With Global Max-Pooling, the number of parameters is reduced to reduce the amount of computation, thereby saving hardware resources, that is, energy, and improving speed.

더불어, 배치 정규화(Batch-Normalization)와 PReLU(Parametric ReLU) 활성화를 이용하여 기울기 소멸(Gradient Vanishing) 문제를 완화시키며 오차 역전파의 효율을 추구할 수 있다. 최종적으로, 프로세서(320)는 시그모이드(Sigmoid) 활성화를 통해 0 또는 1의 이진 출력으로 두 문장이 의미적으로 동일한지 여부를 판단할 수 있다. 이러한 연산들을 반복적으로 수행하는 해당 모델 내의 총 파라미터의 개수는 약 1억 7천개이며, 이 중 훈련 가능한 파라미터의 개수는 약 6천개일 수 있다.In addition, by using batch-normalization and PReLU (Parametric ReLU) activation, the gradient vanishing problem can be alleviated and the efficiency of error backpropagation can be pursued. Finally, the processor 320 may determine whether two sentences are semantically identical with a binary output of 0 or 1 through sigmoid activation. The total number of parameters in the model that repeatedly performs these operations is about 170 million, and the number of trainable parameters among them may be about 6,000.

상술한 본 발명의 일 실시 예에서, 배치 사이즈는 512로 다소 크게 지정하여 학습의 안정성을 도모하고, 에폭 수는 200으로 설정하여 훈련을 진행한다. 훈련 데이터 집합 내 데이터 중 총 90%를 훈련 데이터로, 나머지 10%를 검증(Validation) 데이터로 이용하며, 셔플링(Shufing)을 진행함으로써 보다 탄탄한 훈련을 수행한다. 훈련의 손실 함수로는 이진 교차 엔트로피(Binary Cross-Entropy)를, 최적화 함수는 Adam이 이용되었다. In one embodiment of the present invention described above, the batch size is set to be rather large as 512 to promote learning stability, and the number of epochs is set to 200 to perform training. A total of 90% of the data in the training data set is used as training data, and the remaining 10% is used as validation data, and more robust training is performed by shuffing. Binary cross-entropy was used as the training loss function, and Adam was used as the optimization function.

최종적으로, 본 발명의 일 실시 예에 따른 전체 시스템은 도 8에 개시된 구조와 같다.Finally, the overall system according to an embodiment of the present invention has the same structure as shown in FIG. 8 .

이하에서는, 도 9 내지 도 15를 참조하여 상술한 2 단계의 단어 임베딩 방법(또는 알고리즘)의 효율성 및 정확성을 보여주는 실험 결과를 상세히 설명한다.Hereinafter, experimental results showing the efficiency and accuracy of the two-step word embedding method (or algorithm) described above will be described in detail with reference to FIGS. 9 to 15 .

먼저, 상한 본 발명의 다양한 실시 예에 따른 2 단계 단어 임베딩 방법의 성능 입증을 위한 실험으로, 미등록 어휘와 동형이의어가 다양한 문맥적 의미에 따라 벡터 공간 내에서 서로 다른 위치에 사상되는 것을 t-SNE(t-Distributed Stochastic Neighbor Embedding) 차원 축소를 이용한 시각화를 통해 확인한다. First, as an experiment to prove the performance of the two-step word embedding method according to various embodiments of the present invention, it is t-SNE that unregistered vocabulary and homozygous words are mapped to different positions in the vector space according to various contextual meanings. (t-Distributed Stochastic Neighbor Embedding) It is confirmed through visualization using dimensionality reduction.

또한, 등록 어휘를 미등록 어휘라 가정한 뒤, 상술한 2 단계 단어 임베딩 방법을 통한 출력 벡터 값과 검증 값 사이의 차이, 즉 복원 오차를 측정한다.In addition, assuming that the registered vocabulary is an unregistered vocabulary, the difference between the output vector value and the verification value through the above-described two-step word embedding method, that is, the restoration error is measured.

또한, 상술한 본 발명의 다양한 실시 예에 따른 문장 분류기의 전처리 모듈로서 2 단계 단어 임베딩 방법이 성능에 미친 영향을 확인한다. In addition, as a pre-processing module of the sentence classifier according to various embodiments of the present invention, the effect of the two-step word embedding method on the performance is checked.

이러한 평가들을 통해, 2 단계 단어 임베딩 방법이 미등록 어휘와 단어 의미 중의성 해소 문제에 효용성을 지님을 증명한다.Through these evaluations, it is proved that the two-step word embedding method is effective in resolving unregistered vocabulary and word meaning disambiguation.

[임베딩 학습 결과분석 - 미등록 어휘 사례 분석][Embedding Learning Results Analysis - Case Analysis of Unregistered Vocabulary]

본 발명의 일 실시 예에 따른 매니폴드 학습(Manifold Learning)은 비선형(Non-linear) 차원 축소에 대한 접근법이다. 이 작업에 대한 알고리즘은 많은 데이터 집합의 차원이 인위적으로(Articially) 높을 뿐이라는 생각에 기초한다. 즉, 매니폴드 학습은 데이터에 내재하는 흥미로운 구조를 유지하기 위하여 복잡한 사상(Mapping)을 만들어서 더 나은 차원 축소를 도모한다. Manifold learning according to an embodiment of the present invention is an approach to non-linear dimensionality reduction. The algorithm for this task is based on the idea that the dimensions of many data sets are only artificially high. In other words, manifold learning promotes better dimensionality reduction by creating complex mappings in order to maintain the interesting structure inherent in the data.

본 발명의 일 실시 예에 따른 t-SNE는 이러한 방법론의 일종으로, 데이터 포인트 사이의 거리를 가장 잘 보존하는 차원 축소를 찾을 수 있도록 한다. 다시 말해, 원본 특성 공간에서 가까운 포인트는 여전히 가깝도록, 멀리 떨어져 있던 포인트는 축소된 차원 공간 안에서도 멀도록 사상할 수 있다. The t-SNE according to an embodiment of the present invention is a kind of such a methodology, and allows to find a dimensionality reduction that best preserves the distance between data points. In other words, points that are close to the original feature space can still be mapped, and points that are farther away from the original feature space can be mapped to be far in the reduced dimensional space.

여기서 t-SNE는 이웃 데이터 포인트에 대한 정보를 보존하는 데 포커싱 된다. 이웃 포인트의 보존에 포커싱하는 t-SNE의 특성을 이용하여 미등록 어휘의 맥락 시각화를 확인할 수 있다.Here, t-SNE is focused on preserving information about neighboring data points. By using the characteristic of t-SNE focusing on the preservation of neighboring points, the contextual visualization of unregistered vocabulary can be confirmed.

본 발명의 일 실시 예에 따르면, t-SNE를 이용하여, 미등록 어휘가 의미적 맥락에 걸맞게 서로 상이한 클러스터로 지정되도록, 2 단계 단어 임베딩 알고리즘에 유의미한 실수형 벡터 표현이 훈련되었는지 확인하는 것을 목표한다. 이를 위하여 각 예시 별로 세개의 서로 상이한 등록 어휘들을 이용하여 미등록 어휘라 가정한다. According to an embodiment of the present invention, using t-SNE, it aims to confirm whether a meaningful real vector representation is trained in the two-step word embedding algorithm so that unregistered vocabulary is assigned to different clusters according to the semantic context. . For this, it is assumed that it is an unregistered vocabulary using three different registered vocabulary for each example.

도 9a 및 도 9b에서는 단어 apple, rock 및 star에 대한 t-SNE 2차원 사상 시각화를 도시한다. 도 9a 및 도 9b는 등록 어휘의 임베딩 값을 그대로 이용하여 t-SNE 활용 2차원 시각화한 결과이다.9A and 9B show t-SNE two-dimensional mapping visualizations for the words apple, rock and star. 9A and 9B are two-dimensional visualization results using t-SNE using the embedding value of the registered vocabulary as it is.

도 9c 및 도 9d에서는 단어 plant, fox 및 party에 대한 t-SNE 2차원 사상 시각화를 도시한다. 도 9c 및 도 9d는 등록 어휘를 미등록 어휘라 가정한 채 본 시스템에 통과시킨 후, 도 9a 및 도 9b의 차원 축소 결과에 덧씌워 시각화한 결과이다.9c and 9d show t-SNE two-dimensional map visualization for the words plant, fox and party. 9c and 9d are results of visualization by overlaying the dimension reduction results of FIGS. 9A and 9B after passing the registered vocabulary through the present system assuming that it is an unregistered vocabulary.

도 9a 내지 도 9d에서, 동일한 단어의 경우 동일한 마커(Marker)를 지정하였다.9A to 9D , the same marker is designated for the same word.

도 9a 내지 도 9d의 결과를 통하여, 각 단어들이 여섯 개의 클러스터가 아닌 세 개의 클러스터로 여전히 군집화 됨을 확인할 수 있다. 이는 곧 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법은 미등록 어휘의 경우에도 단어의 분산 표상 내에 의미적 맥락을 효과적으로 담아내며 대응 가능함을 뜻한다.From the results of FIGS. 9A to 9D , it can be confirmed that each word is still clustered into three clusters instead of six clusters. This means that the two-step word embedding method according to an embodiment of the present invention can effectively contain and respond to semantic contexts in the distributed representation of words even in the case of unregistered vocabulary.

[임베딩 학습 결과분석 - 동형이의어 사례 분석][Embedding Learning Result Analysis - Homomorphism Case Analysis]

코사인 유사도를 측정 지표로 이용하여, 본 발명의 다양한 실시 예에 따른 2 단계 단어 임베딩 방법을 통해 의미적으로 분류된 다섯 가지 동형이의어의 최근접 이웃 단어들 목록을 확인하면 다음의 표 3과 같다.Using the cosine similarity as a measurement index, the list of nearest neighbors of five homozygous words semantically classified through the two-step word embedding method according to various embodiments of the present invention is checked, as shown in Table 3 below.

표 2를 참조하면, 각 단어의 양태가 같을지라도 단어가 속할 맥락에 걸맞게 논리적으로 이웃 단어들이 추출되었음을 확인할 수 있다. 이러한 주제적 문맥을 100개씩 모아 각 단어별로 총 200개씩 추출한 임베딩 값들의 사상 결과에 t-SNE 2차원 축소를 가한 후 시각화 하면 도 10a 내지 도 10f 및 도 11a 내지 11d와 같이 도시될 수 있다. Referring to Table 2, it can be confirmed that, even if the aspects of each word are the same, the neighboring words are logically extracted according to the context to which the word belongs. When t-SNE two-dimensional reduction is applied to the mapping result of embedding values extracted by collecting 100 subject contexts and extracting a total of 200 for each word, it can be visualized as shown in FIGS. 10A to 10F and 11A to 11D.

도 10a 내지 도 10f는 단어 apple, rock 및 star에 대한 것이다. 도 10a, 10c 및 도 10e는 기존의 단어 임베딩 방법을 이용한 경우의 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화를 도시하고, 도 10b, 10d 및 10f는 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 이용한 경우의 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화를 도시한다.10a to 10f are for the words apple, rock and star. 10A, 10C, and 10E show a two-dimensional map visualization of a word embedding t-SNE according to context in the case of using a conventional word embedding method, and FIGS. A two-dimensional map visualization of word embedding t-SNE according to context in the case of using the word embedding method is shown.

도 11a 내지 11d는 단어 plant 및 cell에 대한 것이다. 도 11a 및 11c는 기존의 단어 임베딩 방법을 이용한 경우의 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화를 도시하고, 도 11b 및 11d는 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 이용한 경우의 문맥에 따른 단어 임베딩 t-SNE 2차원 사상 시각화를 도시한다.11a to 11d are for the words plant and cell. 11A and 11C show two-dimensional map visualization of word embedding t-SNE according to context in the case of using the existing word embedding method, and FIGS. 11B and 11D are two-step word embedding method using the two-step word embedding method according to an embodiment of the present invention. A two-dimensional map visualization of word embedding t-SNE according to the context of the case is shown.

상술한 실험 결과를 참고하면, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 이용하여 자연어 단어로부터 의미적 맥락에 맞게 분리된 분산 단어 표상을 획득할 수 있음을 알 수 있다.Referring to the above experimental results, it can be seen that a distributed word representation separated according to a semantic context can be obtained from a natural language word by using the two-step word embedding method according to an embodiment of the present invention.

[임베딩 학습 결과분석 - 손실 함수 (Loss Function) 분석][Embedding learning result analysis - Loss function analysis]

본 발명의 일 실시 예에 따른 오토인코더를 활용한 이상치 탐지(Anomaly Detection)를 활용하여, 미등록 어휘에 대하여 본 연구 제안 알고리즘이 효과적인 단어 임베딩을 생성해낼 수 있는지 확인한다. By using anomaly detection using an autoencoder according to an embodiment of the present invention, it is checked whether the algorithm proposed in this study can generate effective word embeddings for unregistered vocabulary.

오토인코더를 활용하여 이상치 탐지를 수행하는 과정은 다음과 같다.The process of performing outlier detection using an autoencoder is as follows.

먼저, 프로세서(320)는 인코더를 통해 입력 샘플을 저차원으로 압축한 후, 이를 디코더에 통과시켜 다시 원래의 차원으로 복원한다. 이를 통해, 프로세서(320)는 입력 샘플과 복원 샘플의 복원 오차를 구하고, 해당 복원 오차는 이상 점수(Anomaly Score)가 되어 임계 값(Threshold)과 비교될 수 있다. 이러한 과정을 수행한 결과, 최종적인 이상 여부가 결정될 수 있다.First, the processor 320 compresses an input sample to a low dimension through an encoder, and then passes it through a decoder to restore the original dimension again. Through this, the processor 320 obtains a restoration error between the input sample and the restored sample, and the restoration error becomes an anomaly score and can be compared with a threshold value. As a result of performing such a process, it may be determined whether the final abnormality.

상술한 이상치 탐지를 수행하는 과정을 통해 오토인코더는 복원 오차를 최소화하기 위해서, 차원이 줄어드는 병목 구간을 지날 때, 유의미한 정보량을 최대한 보존하는 방향으로 학습될 수 있다.Through the process of performing the above-described outlier detection, the autoencoder can be learned in a direction that preserves the amount of meaningful information as much as possible when passing through a bottleneck section in which the dimension is reduced in order to minimize the restoration error.

일 예로, 정상적이지 않은 이상치 샘플이 시험 과정에서 주어질 경우, 오토인코더는 주어진 샘플에 대해서 압축 및 복원을 효과적으로 수행하지 못할 것이다. 이 경우, 복원 오차는 큰 값으로 귀결되고, 이를 비정상 샘플로 판별될 수 있다. 즉, 이러한 오토인코더는 앞서 언급했던 매니폴드 가설의 관점에서 또한 해석 가능한 방법론이다.For example, if an outlier sample that is not normal is given during the test process, the autoencoder will not be able to effectively compress and decompress the given sample. In this case, the restoration error results in a large value, and it may be determined as an abnormal sample. In other words, this autoencoder is a methodology that can also be interpreted from the viewpoint of the manifold hypothesis mentioned above.

본 발명의 일 실시 예에 따르면, 프로세서(320)는 오토인코더의 상술한 특성을 이용하여, Common Crawl 데이터 집합 내 각 500개씩 등록 및 미등록 어휘에 대한 복원 오차 계산을 수행할 수 있다. 다시 말해, 이는 본 발명의 일 실시 예에 따른 순차적 학습을 통해 임베딩 내에 자연어 데이터에 내재하는 의미 및 구문적 표상을 원활히 학습하였다면 복원 오차 값은 작을 것이고, 그렇지 않다면 복원 오차 값은 클 것이라는 논리 하에 실험이 진행된다.According to an embodiment of the present invention, the processor 320 may perform restoration error calculation for each 500 registered and unregistered vocabularies in the Common Crawl data set by using the above-described characteristics of the autoencoder. In other words, this is an experiment under the logic that the restoration error value will be small if the meaning and syntactic representation inherent in the natural language data in the embedding is smoothly learned through sequential learning according to an embodiment of the present invention, otherwise the restoration error value will be large. this goes on

도 12는 손실함수 분석에 대한 실험 결과를 도시한다. 임계 값은 정확도 및 재현율 곡선(Precision-Recall Curve)을 통해 계산되었으며, 본 실험에서는 0.66으로 지정되었다. 12 shows experimental results for loss function analysis. The threshold value was calculated through a precision-recall curve and was assigned to 0.66 in this experiment.

실험 결과 정확도는 약 89%, 정밀도는 약 90%로 학습이 효과적으로 이루어졌음을 확인할 수 있었다. 또한, 시각화를 통해 등록 어휘의 경우 복원 오차가 임계값 미만으로 도출되었다. 이를 통해, 임베딩 내에 단어의 통사 및 의미적 정보가 잘 훈련되었음을 확인할 수 있다. 또한, 미등록 어휘의 경우에 또한 대개 원활히 복원함을 통해 유의미한 대처 능력이 훈련되었음을 확인할 수 있다.As a result of the experiment, it was confirmed that the learning was performed effectively with an accuracy of about 89% and a precision of about 90%. In addition, in the case of the registered vocabulary through visualization, the restoration error was derived below the threshold value. Through this, it can be confirmed that the syntactic and semantic information of the word in the embedding is well trained. In addition, in the case of unregistered vocabulary, it can be confirmed that meaningful coping ability has been trained through smooth restoration in general.

이하에서는, Quora Question Pairs 및 WiC 데이터 집합을 이용하여 단어 의미 중의성 해소와 관련한 문장 분류기를 통한 개념 검증을 진행한다. Quora Question Pairs 및 WiC 데이터 집합은 모두 단어 의미 중의성 해소 작업을 겨냥한다는 지점에서 결을 함께한다. 분류기를 통한 개념검증을 통해, 자연어 처리 응용 프로그램에서 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법이 성능 향상을 구현함을 알 수 있을 것이다.Hereinafter, concept verification is performed through a sentence classifier related to word semantic disambiguation using Quora Question Pairs and WiC data sets. The Quora Question Pairs and WiC datasets all agree on the point that they are aimed at disambiguation of word meanings. Through the concept verification through the classifier, it can be seen that the two-step word embedding method according to an embodiment of the present invention improves performance in a natural language processing application.

[분류기를 이용한 개념 검증 - 자연어 문장 동일성 판별 문제를 위한 분류기 이용 개념 검증][Concept Verification Using Classifier-Concept Verification of Using Classifier for the Problem of Identifying Natural Language Sentence Equality]

Quora는 사람들이 질문과 답변을 통해 지식을 서로 공유할 수 있도록 하는 플랫폼의 역할을 하는 웹사이트이다. 매달 1억 명이 넘는 사람들이 Quora를 방문하기 때문에, Quora에는 많은 사람들의 비슷한 질문이 축적되고 있다.Quora is a website that serves as a platform for people to share knowledge with each other through questions and answers. With over 100 million people visiting Quora every month, Quora is accumulating similar questions from many people.

이러한 배경을 통해 구축된 Quora Question Pairs 데이터 집합이 존재한다. 해당 데이터 내 각 레코드는 두 개의 자연어 질문과, 그것들이 서로 의미 상으로 동일한지(Duplicate) 아닌지 여부를 판별하는 레이블로 쌍으로 구성되어 있다. There is a data set of Quora Question Pairs built through this background. Each record in the data consists of a pair of two natural language questions and a label that determines whether they are semantically duplicate of each other.

이는 곧 단어 의미 중의성 해소 문제의 하위 문제라 할 수 있다. 이는 약 25만 개의 비복제(Non-duplicate) 음성(Negative) 데이터와 약 15만 개의 복제(Duplicate) 양성(Positive) 데이터로 구성된 약 40만 개 문항 쌍으로 구성된다.This can be said to be a sub-problem of the problem of disambiguation of word meanings. It consists of about 400,000 item pairs, consisting of about 250,000 non-duplicate negative data and about 150,000 duplicate positive data.

이하에서는, Quora Question Pairs 데이터 집합을 이용하여 기존의 미등록 어휘 처리 방법과 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법 각각을 이용하여 자연어 문장 동일성 판별 문제를 위한 문장 분류기를 훈련시킨 후 둘의 성능을 비교한다. 실험은 각각 5번씩 실험을 진행되었으며, 실험 결과는 도 13에 도시된다.Hereinafter, after training a sentence classifier for a problem of determining the identity of a natural language sentence using each of the existing unregistered vocabulary processing method and the two-step word embedding method according to an embodiment of the present invention using the Quora Question Pairs data set, the two Compare performance. Experiments were carried out 5 times, respectively, and the experimental results are shown in FIG. 13 .

도 13은 Quora Question Pairs 데이터 집합에 대한 자연어 문장 동일성 판별 분류기의 성능 변화를 보여준다.13 shows the performance change of the natural language sentence equality discrimination classifier for the Quora Question Pairs data set.

기존 미등록 어휘 처리 방식의 정확도 평균은 82.5%, 최솟값은 82.1%, 최댓값은 82.9%이며, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법의 정확도 평균은 84%, 최솟값은 83.9%, 최댓값은 84.4%이다. The average accuracy of the existing unregistered vocabulary processing method is 82.5%, the minimum value is 82.1%, and the maximum value is 82.9%, and the average accuracy of the two-step word embedding method according to an embodiment of the present invention is 84%, the minimum value is 83.9%, and the maximum value is 84.4%.

이를 통해, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 이용함으로써 자연어 문장 동일성 판별을 효과적으로 수행할 수 있음을 확인 가능하다.Through this, it can be confirmed that natural language sentence equivalence determination can be effectively performed by using the two-step word embedding method according to an embodiment of the present invention.

[분류기를 이용한 개념 검증 - 단어 의미 중의성 해소 문제를 위한 분류기 이용 개념 검증][Concept Verification Using Classifier-Concept Verification Using Classifier for Resolving Word Meaning Disambiguation]

문맥에 따라 모호한 단어는 잠재적으로 관련이 없는 여러 의미를 나타낼 수Depending on the context, an ambiguous word can have multiple potentially unrelated meanings.

있다. 문맥화된 단어 임베딩은 문맥에 따라 적응할 수 있는 단어의 동적 표현을 계산하여 이러한 한계를 해소할 수 있다. have. Contextualized word embeddings can overcome this limitation by computing dynamic representations of words that can adapt according to context.

이와 관련하여, 본 발명의 일 실시 예에 따른 WiC 데이터 집합에 대한 시스템의 작업은 단어의 의도된 의미를 식별하는 것이다. WiC는 이진 분류 작업으로 프레임 처리된다. WiC의 각 인스턴스(Instance)에는 동사 또는 명사 중 하나의 표적 단어 w가 있으며, 두 가지 문맥, 즉 문장이 제공된다. In this regard, the task of the system for the WiC data set according to an embodiment of the present invention is to identify the intended meaning of the word. WiC is framed as a binary classification task. Each instance of WiC has a target word w, either a verb or a noun, and is provided with two contexts: a sentence.

이러한 각각의 문맥은 w의 특정한 의미를 촉발한다. 두 가지 맥락에서 w의 발생이 동일한 의미에 해당하는지 여부를 확인할 수 있다. 여기서, 데이터 집합은 실무에서의 단어 의미 중의성 해소를 응용한 것으로도 볼 수 있다.Each of these contexts triggers a specific meaning of w. It can be checked whether the occurrence of w corresponds to the same meaning in both contexts. Here, the data set can also be viewed as an application of word semantic disambiguation in practice.

WiC는 문맥화된 단어 및 감지 표현과 단어 의미 중 의성 해소를 포함한 광범위한 응용 프로그램 평가에 적합하다. 스탠포드 문맥 별 단어 유사성(Stanford Contextual Word Similarity, SCWS) 데이터 집합과 달리 동일한 단어가 서로 짝을 이루므로 상황에 민감하지 않은 단어 임베딩 모델이 무작위 기준선과 유사하게 수행되는 이진 분류 데이터 집합이다. 이는 전문가에 의해 수집 및 정제된 고품질 주석을 사용하여 제작되었다.WiC is suitable for evaluating a wide range of applications, including contextualized words and sensing representations and disambiguation of word meanings. Unlike the Stanford Contextual Word Similarity (SCWS) dataset, the same words are paired with each other, so the context-insensitive word embedding model is a binary classification dataset that performs similarly to a randomized baseline. It has been crafted using high-quality tin that has been collected and refined by experts.

이하에서는, WiC 데이터 집합을 이용하여 기존의 미등록 어휘 처리 방법과 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법 각각을 이용하여 단어 의미 중의성 해소 문제를 위한 문장 분류기를 훈련시킨 후 둘의 성능을 비교한다. 실험은 각각 10번씩 진행되었다. 그 결과는 도 14에 도시된다.Hereinafter, after training a sentence classifier for word semantic disambiguation using the existing unregistered vocabulary processing method and the two-step word embedding method according to an embodiment of the present invention using a WiC data set, the performance of the two compare Each experiment was carried out 10 times. The result is shown in FIG. 14 .

도 14는 WiC 데이터 집합에 대한 단어 의미 중의성 해소 분류기의 성능 변화를 나타낸다. 도 14를 참조하면, 기존 미등록 어휘 처리 방식의 정확도 평균은 60.16%, 최솟값은 59%, 최댓값은 60.5%이다. 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법의 정확도 평균은 60.98%, 최솟값은 60.7%, 최댓값은 61.4%이다. 14 shows the performance change of the word semantic disambiguation classifier for the WiC data set. Referring to FIG. 14 , the average accuracy of the existing non-registered vocabulary processing method is 60.16%, the minimum value is 59%, and the maximum value is 60.5%. The average accuracy of the two-step word embedding method according to an embodiment of the present invention is 60.98%, the minimum value is 60.7%, and the maximum value is 61.4%.

이를 통해, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 이용함으로써 단어 의미 중의성 해소가 가능하며, 고품질의 답을 찾는 과정을 더 효율적으로 수행할 수 있음을 알 수 있다.Through this, it can be seen that by using the two-step word embedding method according to an embodiment of the present invention, word meaning ambiguity can be resolved, and the process of finding a high-quality answer can be performed more efficiently.

또한, 표적 단어 w가 미등록 어휘라 가정하였을 경우 두 방법론의 성능 결과는 도 15와 같다. 도 15는 표적 단어를 미등록 어휘 처리할 경우, WiC 데이터 집합에 대한 단어 의미 중의성 해소 분류기의 성능 변화를 나타낸다.In addition, when it is assumed that the target word w is an unregistered vocabulary, the performance results of the two methodologies are shown in FIG. 15 . 15 shows the performance change of the word semantic disambiguation classifier for the WiC data set when the target word is processed as an unregistered vocabulary.

도 15를 참조하면, 기존 미등록 어휘 처리 방식의 정확도 평균은 59.44%, 최솟값은 59.2%, 최댓값은 59.9%이며, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법의 정확도 평균은 61.27%, 최솟값은 60.9%, 최댓값은 61.7%이다. 이를 통해, 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법은 미등록 어휘 처리 문제에 효과적임을 알 수 있다.15 , the average accuracy of the existing unregistered vocabulary processing method is 59.44%, the minimum value is 59.2%, and the maximum value is 59.9%, and the average accuracy of the two-step word embedding method according to an embodiment of the present invention is 61.27%, the minimum value is 60.9%, and the maximum value is 61.7%. Through this, it can be seen that the two-step word embedding method according to an embodiment of the present invention is effective for the problem of processing unregistered vocabulary.

상술한 본 발명의 다양한 실시 예에 따라, 미등록 어휘 문제와 단어 의미 중의성 해소 문제를 해결하기 위한 2 단계 단어 임베딩 방법을 상세히 설명하였다.According to various embodiments of the present invention described above, a two-step word embedding method for solving the problem of unregistered vocabulary and the problem of resolving word meaning ambiguity has been described in detail.

본 발명의 일 실시 예에 따른 단계 1에서는 미등록 어휘에 대해서도 단어 임베딩 값을 얻을 수 있도록 하기 위하여 FastText 방법론의 인공 신경망을 활용했다. In step 1 according to an embodiment of the present invention, an artificial neural network of the FastText methodology is used in order to obtain a word embedding value even for an unregistered vocabulary.

또한, 단계 1을 통해 추출한 특징 값을 입력으로 사용하는 단계 2에서는 동형이의어에 대응하기 위하여 중심이 되는 단어 하나의 특징뿐만 아니라, 해당 단어가 속한 문맥의 정보까지 함께 사용했다. In addition, in step 2, in which the feature value extracted through step 1 is used as an input, not only the characteristic of one central word but also information on the context to which the word belongs were used together in order to respond to the homozygous word.

이러한 접근법을 통해, 주어진 단어가 학습 데이터에 없는 단어일지라도 단어 임베딩 값을 얻을 수 있었고, 주어진 여러 단어들의 양태가 같아도 해당 단어들이 속한 문맥에 따라 서로 다른 단어 임베딩 값을 얻을 수 있었다.Through this approach, even if a given word is a word that is not in the training data, it was possible to obtain a word embedding value, and even if several given words have the same aspect, different word embedding values could be obtained depending on the context to which the words belong.

제안한 알고리즘의 성능을 입증하기 위해, 본 2 단계 단어 임베딩 기법을 통해 자연어 단어들을 벡터 공간 안에 사상시킨 뒤, 미등록 어휘와 동형이의어 단어들이 문맥적 의미에 따라 어느 공간에 위치하는지 확인하는 사례 분석을 진행하였다. In order to prove the performance of the proposed algorithm, natural language words are mapped into the vector space through this two-step word embedding technique, and case analysis is performed to determine where the unregistered vocabulary and homologous words are located according to the contextual meaning. did

또한, 손실 함수 결과 확인 및 문장 분류기의 성능 변화를 확인하는 실험으로 나누어 본 알고리즘의 효용성을 입증하였다.In addition, the effectiveness of this algorithm was verified by dividing it into an experiment to confirm the loss function result and the performance change of the sentence classifier.

본 발명의 다양한 실시 예에 따른 실험 결과, 미등록 어휘에 대하여도 효과적으로 단어 임베딩 값을 취할 수 있었다. 또한, 두 단어가 동형이의어라 할지라도 동일한 특징 값으로 연결되지 않고, 단어가 속한 문맥에 따라 의미에 맞도록 반응적으로 특징 값을 획득함을 t-SNE를 통한 차원 축소 및 2차원 시각화를 통해 확인할 수 있었다. As a result of an experiment according to various embodiments of the present invention, it was possible to effectively take a word embedding value even for an unregistered vocabulary. In addition, even if two words are homozygous, they are not connected with the same feature value, and that feature values are acquired responsively to match the meaning according to the context to which the word belongs. could be verified through

또한, 손실 함수의 복원 오차 값 확인을 통해 등록 어휘와 미등록 어휘에 모두 적절한 임베딩 값을 도출해낼 수 있도록 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법이 적절하게 훈련되었음을 확인할 수 있었다. In addition, it was confirmed that the two-step word embedding method according to an embodiment of the present invention was properly trained to derive embedding values suitable for both registered and non-registered vocabulary by checking the restoration error value of the loss function.

또한, 기존의 접근법에 비해 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법을 전처리 모듈로서 사용할 경우 문장 분류기의 성능이 평균 약 82.5%에서 84%으로 약 1.5% 향상됨과 약 59%에서 61%로 약 2% 향상됨을 두 개의 데이터 집합을 이용하여 확인할 수 있었다. 이는 복수 개의 데이터 집합에 일관적으로 본 발명의 일 실시 예에 따른 2 단계 단어 임베딩 방법이 성능 향상을 불러일으킴을 입증한다.In addition, compared to the existing approach, when the two-step word embedding method according to an embodiment of the present invention is used as a preprocessing module, the performance of the sentence classifier is improved by about 1.5% on average from about 82.5% to 84%, and from about 59% to 61% An improvement of about 2% was confirmed using two data sets. This proves that the two-step word embedding method according to an embodiment of the present invention consistently improves performance in a plurality of data sets.

상술한 본 발명의 다양한 실시 예에 따른 2 단계 단어 임베딩 방법을 통해 자연어 데이터셋에 내재하는 미등록 어휘 문제와 단어 의미 중의성 해소 문제에 효율적으로 대처할 수 있으므로, 이는 챗봇이나 검색 엔진 등 자연어 처리의 다양한 응용 프로그램에 적용될 수 있을 것이다.Since the above-described two-step word embedding method according to various embodiments of the present invention can effectively cope with the problem of unregistered vocabulary and word meaning disambiguation inherent in natural language datasets, it is possible to efficiently deal with various types of natural language processing such as chatbots and search engines. It could be applied to the application.

상술한 본 발명의 다양한 실시 예에 따른 2 단계 단어 임베딩 방법에 대한 다양한 구성은 프로세서(320)에 의해 동작 또는 제어될 수 있을 것이다.Various configurations of the above-described two-step word embedding method according to various embodiments of the present invention may be operated or controlled by the processor 320 .

도 16은 본 발명의 일 실시 예에 따른 전자장치의 세부 구성에 대한 블록도이다.16 is a block diagram of a detailed configuration of an electronic device according to an embodiment of the present invention.

도 16을 참조하면, 전자장치(1600)는 통신부(1610), 저장부(1620) 및 프로세서(1630)를 포함한다.Referring to FIG. 16 , the electronic device 1600 includes a communication unit 1610 , a storage unit 1620 , and a processor 1630 .

통신부(1610)는 통신을 수행한다 통신부(1610)는 BT(BlueTooth), WI-FI(Wireless Fidelity), ZigBee, IR(Infrared), NFC(Near Field Communication) 등과 같은 다양한 통신 방식을 통해 외부 전자기기와 통신을 수행할 수 있다.The communication unit 1610 performs communication. The communication unit 1610 performs communication with an external electronic device through various communication methods such as BT (BlueTooth), WI-FI (Wireless Fidelity), ZigBee, IR (Infrared), NFC (Near Field Communication), etc. can communicate with

저장부(1620)는 전자장치(1600)를 구동시키기 위한 O/S(Operating System) 소프트웨어 모듈, 디스플레이 영역에서 제공되는 다양한 UI 화면을 구성하기 위한 데이터 등을 저장할 수 있다. 또한, 저장부(1620)는 읽고 쓰기가 가능하다.The storage unit 1620 may store an O/S (Operating System) software module for driving the electronic device 1600 , data for configuring various UI screens provided in the display area, and the like. Also, the storage unit 1620 is readable and writable.

특히, 저장부(1620)는 훈련 데이터 집합을 저장하거나, 2 단계 단어 임베딩 과정에서 도출되는 데이터를 저장할 수 있다.In particular, the storage unit 1620 may store a training data set or data derived from a two-step word embedding process.

프로세서(1630)는 저장부(1620)에 저장된 각종 프로그램을 이용하여 전자장치(1600)의 동작을 전반적으로 제어한다.The processor 1630 generally controls the operation of the electronic device 1600 using various programs stored in the storage 1620 .

구체적으로, 프로세서(1630)는 RAM(1631), ROM(1632), 메인 CPU(1633), 그래픽 처리부(1634), 제1 내지 n 인터페이스(1635-1 ~ 1635-n) 및 버스(1636)를 포함한다.Specifically, the processor 1630 includes a RAM 1631 , a ROM 1632 , a main CPU 1633 , a graphics processing unit 1634 , the first to n interfaces 1635-1 to 1635-n and a bus 1636 . include

여기서, RAM(1631), ROM(1632), 메인 CPU(1633), 그래픽 처리부(1634), 제1 내지 n 인터페이스(1635-1 ~ 1635-n) 등은 버스(1636)를 통해 서로 연결될 수 있다.Here, the RAM 1631 , the ROM 1632 , the main CPU 1633 , the graphic processing unit 1634 , the first to n-interfaces 1635-1 to 1635-n, etc. may be connected to each other through the bus 1636 . .

제1 내지 n 인터페이스(1635-1 내지 1635-n)는 상술한 각종 구성요소들과 연결된다. 인터페이스들 중 하나는 네트워크를 통해 외부 장치와 연결되는 네트워크 인터페이스가 될 수도 있다.The first to n-th interfaces 1635-1 to 1635-n are connected to the various components described above. One of the interfaces may be a network interface connected to an external device through a network.

ROM(1632)에는 시스템 부팅을 위한 명령어 세트 등이 저장된다. 턴온 명령이 입력되어 전원이 공급되면, 메인 CPU(1633)는 ROM(1632)에 저장된 명령어에 따라 저장부(1620)에 저장된 O/S를 RAM(1631)에 복사하고, O/S를 실행시켜 시스템을 부팅시킨다. The ROM 1632 stores an instruction set for system booting and the like. When a turn-on command is input and power is supplied, the main CPU 1633 copies the O/S stored in the storage unit 1620 to the RAM 1631 according to the command stored in the ROM 1632, and executes the O/S. Boot the system.

부팅이 완료되면, 메인 CPU(1633)는 저장된 각종 어플리케이션 프로그램을 RAM(1631)에 복사하고, RAM(1631)에 복사된 어플리케이션 프로그램을 실행시켜 각종 동작을 수행한다.When booting is completed, the main CPU 1633 copies various stored application programs to the RAM 1631 , and executes the application programs copied to the RAM 1631 to perform various operations.

메인 CPU(1633)는 저장부(1620)에 액세스하여, 저장부(1620)에 저장된 O/S를 이용하여 부팅을 수행한다. 그리고, 메인 CPU(1633)는 저장부(1620)에 저장된 각종 프로그램, 컨텐트, 데이터 등을 이용하여 다양한 동작을 수행한다.The main CPU 1633 accesses the storage unit 1620 and performs booting using the O/S stored in the storage unit 1620 . In addition, the main CPU 1633 performs various operations using various programs, contents, data, etc. stored in the storage unit 1620 .

그래픽 처리부(1634)는 연산부 및 렌더링부를 이용하여 아이콘, 이미지, 텍스트 등과 같은 다양한 객체를 포함하는 화면을 생성한다.The graphic processing unit 1634 generates a screen including various objects such as icons, images, and texts by using the operation unit and the rendering unit.

도 17은 본 발명의 일 실시 예에 따른 단어 임베딩을 수행하는 전자장치의 동작방법에 대한 흐름도이다.17 is a flowchart illustrating a method of operating an electronic device for performing word embedding according to an embodiment of the present invention.

도 17을 참조하면, 단어 임베딩을 수행하는 전자장치의 동작방법은 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 복수의 임베딩 벡터 값의 평균 값을 획득하는 과정(S1710) 및 복수의 임베딩 벡터 값의 평균을 제2 인공 신경망으로 입력하고, 제2 인공 신경망이 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습하는 과정(S1720)을 포함할 수 있다.Referring to FIG. 17 , in a method of operating an electronic device for performing word embedding, a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size are output from a first artificial neural network, and a plurality of embedding vector values are performed. A process of obtaining an average value of (S1710) and a process of inputting the average of a plurality of embedding vector values to the second artificial neural network, and learning to output the embedding vector value of the central word within the sliding window (S1720) may include

상술한 동작방법에서, 슬라이딩 윈도우에 포함되는 복수의 단어 중 하나의 임베딩 벡터 값은 복수의 단어 중 하나에 대한 적어도 하나의 서브워드(subword)에 대한 벡터 값 및 복수의 단어 중 하나에 대한 벡터 값을 합한 값일 수 있다.In the above-described operation method, the embedding vector value of one of the plurality of words included in the sliding window is a vector value of at least one subword of one of the plurality of words and a vector value of one of the plurality of words may be the sum of

일 예로, 제1 인공 신경망은 FastText 모듈이고, 서브워드는 FastText 모듈에서 정의되는 n-gram일 수 있다.As an example, the first artificial neural network may be a FastText module, and the subword may be an n-gram defined in the FastText module.

상술한 FastText 모듈은 단어 집합 크기의 영벡터에서 해당 단어의 인덱스 위치만 1 값을 갖는 원-핫 인코딩(one-hot encoding)의 결과를 입력으로 할 수 있다.The above-described FastText module may input a result of one-hot encoding in which only the index position of the corresponding word has a value of 1 in the zero vector of the word set size.

또한, 상술한 FastText 모듈은 슬라이딩 윈도우가 5로 설정되고, 임베딩 벡터의 크기가 300차원으로 설정될 수 있다.Also, in the FastText module described above, the sliding window may be set to 5, and the size of the embedding vector may be set to 300 dimensions.

상술한 본 발명의 일 실시 예에서, 슬라이딩 윈도우 내 중심 단어는 훈련 데이터 집합에 포함된 등록 어휘 또는 훈련 데이터 집합에 미포함된 미등록 어휘(out-of-vocabulary; OOV) 중 하나일 수 있다.In the above-described embodiment of the present invention, the central word in the sliding window may be one of a registered vocabulary included in the training data set or an out-of-vocabulary (OOV) not included in the training data set.

또한, 상술한 본 발명의 일 실시 예에서, 슬라이딩 윈도우 내 중심 단어가 미등록 어휘인 경우, 중심 단어의 임베딩 벡터 값은 영-벡터로 설정될 수 있다.Also, in the above-described embodiment of the present invention, when the central word in the sliding window is an unregistered vocabulary, the embedding vector value of the central word may be set to a zero-vector.

또한, 상술한 본 발명의 일 실시 예에서, 제2 인공 신경망은 희소 오토인코더(sparse autoencoder) 모듈일 수 있다.In addition, in the above-described embodiment of the present invention, the second artificial neural network may be a sparse autoencoder module.

또한, 상술한 본 발명의 일 실시 예에서, 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값은 실수 값일 수 있다.Also, in the above-described embodiment of the present invention, the embedding vector value of the central word in the sliding window may be a real value.

상술한 슬라이딩 윈도우에 포함되는 복수의 단어는 단일 문장에 포함될 수 있다.A plurality of words included in the above-described sliding window may be included in a single sentence.

한편, 상술한 본 발명의 다양한 실시 예에 따른 단어 임베딩을 수행하는 전자장치의 동작방법은 컴퓨터로 실행 가능한 프로그램 코드로 구현되어 다양한 비 일시적 판독 가능 매체(non-transitory computer readable medium)에 저장된 상태로 프로세서에 의해 실행되도록 각 서버 또는 기기들에 제공될 수 있다.Meanwhile, the method of operating an electronic device for performing word embedding according to various embodiments of the present invention described above is implemented as a computer-executable program code and stored in various non-transitory computer readable media. It may be provided to each server or devices to be executed by the processor.

일 예로, 기설정된 크기를 갖는 슬라이딩 윈도우에 포함되는 복수의 단어에 대한 복수의 임베딩 벡터 값을 제1 인공 신경망으로부터 출력하여 복수의 임베딩 벡터 값의 평균 값을 획득하는 과정 및 복수의 임베딩 벡터 값의 평균을 제2 인공 신경망으로 입력하고, 제2 인공 신경망이 슬라이딩 윈도우 내 중심 단어의 임베딩 벡터 값을 출력하도록 학습하는 과정을 수행하는 프로그램이 저장된 비일시적 판독 가능 매체(non-transitory computer readable medium)가 제공될 수 있다.As an example, a process of obtaining an average value of a plurality of embedding vector values by outputting a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size from the first artificial neural network, and A non-transitory computer readable medium in which a program for performing a process of inputting the average into the second artificial neural network and learning to output the embedding vector value of the central word within the sliding window of the second artificial neural network is stored can be provided.

비 일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, or memory, and can be read by a device. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

이상으로, 본 발명의 실시 예들이 도시되고 설명되었지만, 당업자는 첨부된 청구항들 및 그에 동등한 것들에 의해 정의되는 바와 같은 본 실시 예의 사상 및 범위를 벗어나지 않고 형태 및 세부 사항들에 있어 다양한 변경이 이루어질 수 있음을 이해할 것이다.While the embodiments of the present invention have been shown and described, various changes in form and details may be made by those skilled in the art without departing from the spirit and scope of the present embodiments as defined by the appended claims and their equivalents. you will understand that you can

전자장치: 300
저장부: 310, 1620
통신부: 1610
프로세서: 320, 1630Electronics: 300
Storage: 310, 1620
Department of Communications: 1610
Processor: 320, 1630

Claims

An electronic device for performing word embedding, comprising:
storage; and
Outputting a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size from the first artificial neural network to obtain an average value of the plurality of embedding vector values,
and a processor configured to input the average of the plurality of embedding vector values to a second artificial neural network, and to learn the second artificial neural network to output an embedding vector value of a central word within the sliding window.

According to claim 1,
An embedding vector value of one of the plurality of words included in the sliding window is,
The electronic device is a value obtained by summing a vector value of at least one subword of the one of the plurality of words and a vector value of the one of the plurality of words.

3. The method of claim 2,
The first artificial neural network is a FastText module,
The electronic device, wherein the subword is an n-gram defined in the FastText module.

4. The method of claim 3,
The FastText module is
An electronic device that receives as an input a result of one-hot encoding in which only an index position of a corresponding word has a value of 1 in a zero vector of a word set size.

4. The method of claim 3,
The FastText module is
The electronic device, wherein the sliding window is set to 5, and the size of the embedding vector is set to 300 dimensions.

According to claim 1,
The central word in the sliding window is,
The electronic device, which is one of a registered vocabulary included in the training data set or an out-of-vocabulary (OOV) not included in the training data set.

7. The method of claim 6,
When the central word in the sliding window is the non-registered vocabulary, an embedding vector value of the central word is set to a zero-vector.

According to claim 1,
The second artificial neural network is a sparse autoencoder module, the electronic device.

According to claim 1,
The value of the embedding vector of the central word in the sliding window is a real value.

According to claim 1,
The plurality of words included in the sliding window are included in a single sentence.

In the operating method of an electronic device for performing word embedding,
outputting a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size from a first artificial neural network to obtain an average value of the plurality of embedding vector values; and
The process of inputting the average of the plurality of embedding vector values to a second artificial neural network, and learning the second artificial neural network to output the embedding vector value of the central word within the sliding window; .

12. The method of claim 11,
An embedding vector value of one of the plurality of words included in the sliding window is,
and a value obtained by summing a vector value of at least one subword of the one of the plurality of words and a vector value of the one of the plurality of words.

13. The method of claim 12,
The first artificial neural network is a FastText module,
The subword is an n-gram defined in the FastText module.

14. The method of claim 13,
The FastText module is
An operating method of an electronic device in which a result of one-hot encoding in which only an index position of a corresponding word has a value of 1 in a zero vector of a word set size is input.

14. The method of claim 13,
The FastText module is
The method of operating an electronic device, wherein the sliding window is set to 5, and the size of the embedding vector is set to 300 dimensions.

12. The method of claim 11,
The central word in the sliding window is,
A method of operating an electronic device, which is one of a registered vocabulary included in the training data set or an out-of-vocabulary (OOV) not included in the training data set.

17. The method of claim 16,
When the central word in the sliding window is the non-registered vocabulary, an embedding vector value of the central word is set to a zero-vector.

12. The method of claim 11,
The second artificial neural network is a sparse autoencoder module, the operating method of the electronic device.

12. The method of claim 11,
and the embedding vector value of the central word in the sliding window is a real value.

12. The method of claim 11,
The plurality of words included in the sliding window are included in a single sentence.

A non-transitory computer-readable medium storing computer instructions for performing an operation of the electronic device when executed by a processor of an electronic device, the operation comprising:
outputting a plurality of embedding vector values for a plurality of words included in a sliding window having a preset size from a first artificial neural network to obtain an average value of the plurality of embedding vector values; and
The process of inputting the average of the plurality of embedding vector values to a second artificial neural network, and learning the second artificial neural network to output the embedding vector value of the central word within the sliding window; media.