KR20230014034A

KR20230014034A - Improving classification accuracy using further pre-training method and device with selective masking

Info

Publication number: KR20230014034A
Application number: KR1020210139364A
Authority: KR
Inventors: 김남규; 서수민
Original assignee: 국민대학교산학협력단
Priority date: 2021-07-20
Filing date: 2021-10-19
Publication date: 2023-01-27

Abstract

The present invention relates to a method and device for the further pre-training based on selective masking for improving classification accuracy. The method comprises: a step of performing the emotional classification on training data of a training data population for pre-training; a step of generating an effective word corpus on the words of the training data population in accordance with the result of the emotional classification; a step of applying the effective word corpus to a tokenizer built through the pre-training, and building an extended tokenizer; a step of masking the training data by using the extended tokenizer; and a step of performing a further pre-training by using the masked training data.

Description

Optional masking-based additional pre-learning method and device for improving classification accuracy

본 발명은 추가 사전 학습 기술에 관한 것으로, 보다 상세하게는 선택적 마스킹을 통해 특정 분류 태스크에 특화된 추가 사전 학습을 수행할 수 있는 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법 및 장치에 관한 것이다.The present invention relates to an additional pre-learning technique, and more particularly, to a selective masking-based additional pre-learning method and apparatus for improving classification accuracy capable of performing additional pre-learning specialized for a specific classification task through selective masking.

최근 자연어 처리(Natural Language Processing) 및 텍스트 마이닝(Text Mining) 기술의 획기적인 발전으로 여러 분야에서 텍스트 분석 기법을 도메인의 문제 해결에 적용하고자 하는 시도가 이어지고 있다. 자연어 처리는 텍스트와 같은 인간의 언어를 컴퓨터가 처리할 수 있도록 수치화하여 벡터로 표현하는 과정이다. 텍스트 데이터는 다양한 언어적 특징(Feature)과 도메인 정보(Domain Information)를 함께 가지고 있으며, 자연어 처리 기술의 핵심은 이러한 특징과 정보를 최대한 반영하여 이를 주어진 벡터 공간에 표현(Representation)하는 것으로, 이를 임베딩(Embedding)이라 한다. 임베딩 과정을 통해 추출된 벡터는 텍스트 내의 의견과 감성, 태도를 극성으로 예측하는 감성 분석(Sentiment Analysis), 인명, 지명 등 개체의 유형을 인식하는 개체명 인식(Named Entity Recognition), 자동으로 언어를 번역하는 기계 번역(Machine translation) 등 텍스트를 활용하는 여러 연구 분야의 요소 기술로 사용되고 있다.Recent breakthroughs in natural language processing (NLP) and text mining (Text Mining) technologies have led to attempts to apply text analysis techniques to domain problem solving in various fields. Natural language processing is a process of converting human language such as text into numbers so that a computer can process it and expressing it as a vector. Text data has various linguistic features and domain information together, and the core of natural language processing technology is to reflect these features and information as much as possible and represent them in a given vector space. It is called (embedding). The vector extracted through the embedding process is used for Sentiment Analysis, which predicts opinions, emotions, and attitudes in the text as polarity, Named Entity Recognition, which recognizes the type of entity such as a person's name, place name, etc., and automatically recognizes language. It is used as an element technology in various research fields that utilize text, such as machine translation.

딥 러닝(Deep Learning) 기술의 발전과 더불어 자연어 처리 분야에서도 딥 러닝 기술들을 적용하여 임베딩 성능을 높이는 연구가 활발히 이루어졌다. 최근에는 딥 러닝을 기반으로 대용량의 말뭉치를 학습한 모델인 사전 학습 언어 모델(Pre-trained Language Model)을 활용한 텍스트 임베딩 방식이 분석에 주로 사용되고 있다. 사전 학습 언어 모델을 기반으로 적은 양의 데이터에 대한 추가 학습을 진행함으로써 높은 성능을 가져올 수 있는 전이 학습(Transfer Learning)에 대한 관심이 높아짐에 따라 다양한 사전 학습 언어 모델이 공개되고 있으며, 특히 대표적인 모델로 BERT(Pre-training of Deep Bidirectional Transformers for Language Understanding)가 활발하게 사용되고 있다.Along with the development of deep learning technology, research on improving embedding performance by applying deep learning technologies has been actively conducted in the field of natural language processing. Recently, a text embedding method using a pre-trained language model, which is a model that learns a large-capacity corpus based on deep learning, is mainly used for analysis. As interest in transfer learning, which can bring high performance by performing additional training on a small amount of data based on a pre-learning language model, has increased, various pre-learning language models have been released. In particular, a representative model As a result, BERT (Pre-training of Deep Bidirectional Transformers for Language Understanding) is actively used.

BERT는 기존의 사전 학습 언어 모델의 한계인 단방향 구조를 양방향 구조로 개선한 모델로, BERT의 사전 학습(Pre-training) 방식 중 하나인 MLM(Masked Language Model)은 입력 문장 내의 단어를 무작위(Random)로 선택하여 마스킹(Masking) 한 후, 해당 단어의 정답을 예측하는 비지도 학습(Unsupervised Learning) 방식으로 문맥 정보를 학습한다. 기존의 Left-to-Right 방식의 언어 모델과 달리, MLM은 좌/우 문맥에 대한 정보를 모두 반영하여 마스킹 단어를 추론하기 때문에 고품질의 텍스트 표현을 추출할 수 있다. 사전 학습된 BERT는 분석 과제(Downstream Task)에 맞게 가중치를 갱신하는 미세 조정(Fine-Tuning)을 통해 자연어 처리 분야의 여러 과제에 적용하여 활용할 수 있다.BERT is a model that improves the one-way structure, which is a limitation of existing pre-learning language models, into a bi-directional structure. One of BERT's pre-training methods, MLM (Masked Language Model), randomly ), and then masking, the context information is learned by an unsupervised learning method that predicts the correct answer of the corresponding word. Unlike the existing Left-to-Right language model, MLM can extract high-quality text expressions because masking words are inferred by reflecting both left and right context information. The pre-learned BERT can be applied and utilized to various tasks in the field of natural language processing through fine-tuning that updates weights according to analysis tasks (downstream tasks).

BERT의 성능이 입증됨에 따라, 여러 분야에서 BERT를 활용하여 특정 도메인 또는 특정 과제에 적합한 모델을 구축하려는 연구가 이어지고 있으며, 특히 최근에는 MLM을 활용하여 관심 도메인 또는 과제의 데이터를 추가적으로 학습하는 추가 사전 학습(Further Pre-training)에 대한 연구가 주목받고 있다. 이러한 방법은 사전 학습을 통해 단어의 일반적인 의미를 학습하고 추가 학습을 통해 단어가 해당 도메인 및 과제에서 갖는 특수한 의미 또는 뉘앙스를 학습하는 것으로 설명할 수 있다.As the performance of BERT has been proven, research to build models suitable for specific domains or specific tasks using BERT continues in various fields. Research on further pre-training is attracting attention. This method can be described as learning the general meaning of a word through prior learning and learning the special meaning or nuance of a word in a corresponding domain and task through additional learning.

한국등록특허 제10-0766169호 (2007.10.04)Korean Patent Registration No. 10-0766169 (2007.10.04)

본 발명의 일 실시예는 Attention-based LSTM(Long Shot-Term Memory)을 통해 영화 댓글에 대한 감성 분류를 수행하고, 이를 통해 각 댓글의 감성 분류에 기여한 수준에 따라 단서 단어와 주변 단어를 구분한 뒤, 추가 사전 학습 단계에서 주변 단어에만 마스킹을 적용하는 선택적 마스킹을 수행하는 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법 및 장치를 제공하고자 한다.An embodiment of the present invention performs emotion classification on movie comments through attention-based LSTM (Long Shot-Term Memory), and classifies clue words and surrounding words according to the level of contribution to emotion classification of each comment. Later, in the additional pre-learning step, it is intended to provide a selective masking-based additional pre-learning method and apparatus for improving classification accuracy that performs selective masking in which masking is applied only to neighboring words.

실시예들 중에서, 선택적 마스킹 기반 추가 사전 학습 방법은 사전학습(pre-training)을 위한 학습 데이터 모집단의 학습 데이터들로 감성 분류를 수행하는 단계; 상기 감성 분류의 결과에 따라 상기 학습 데이터 모집단의 단어들에 관한 유효 단어 말뭉치를 생성하는 단계; 상기 사전학습을 통해 구축된 토크나이저에 상기 유효 단어 말뭉치를 적용하여 확장된 토크나이저를 구축하는 단계; 상기 확장된 토크나이저를 이용하여 상기 학습 데이터들을 마스킹하는 단계; 및 상기 마스킹된 학습 데이터들을 이용하여 추가 사전학습(further pre-training)을 수행하는 단계;를 포함한다.Among the embodiments, a selective masking-based additional pre-learning method includes performing emotion classification with learning data of a training data population for pre-training; generating a valid word corpus of words of the learning data population according to a result of the sentiment classification; constructing an extended tokenizer by applying the valid word corpus to the tokenizer built through the prior learning; masking the training data using the extended tokenizer; and performing further pre-training using the masked training data.

상기 감성 분류를 수행하는 단계는 감성 태그(Tag)가 부착된 문장들을 상기 학습 데이터들로 사용하여 상기 감성 분류를 수행하는 단계를 포함할 수 있다.The performing of the emotion classification may include performing the emotion classification by using sentences to which emotional tags are attached as the learning data.

상기 감성 분류를 수행하는 단계는 어텐션 기반 언어 모델을 통해 상기 단어들의 어텐션 가중치를 추출하는 단계를 포함할 수 있다.The performing of the sentiment classification may include extracting attention weights of the words through an attention-based language model.

상기 감성 분류를 수행하는 단계는 상기 어텐션 가중치를 기 설정된 임계값과 비교하여 상기 단어들을 단서 단어(clue term), 주변 단어(surrounding term) 및 제외 단어(excluded term)로 분류하는 단계를 포함할 수 있다.The performing of the sentiment classification may include comparing the attention weight with a preset threshold and classifying the words into a clue term, a surrounding term, and an excluded term. there is.

상기 감성 분류를 수행하는 단계는 상기 어텐션 가중치가 평균의 상위 1-Sigma보다 높은 단어는 상기 단서 단어로 분류하고 상기 어텐션 가중치가 평균보다 낮은 단어는 상기 주변 단어로 분류하며 나머지 단어는 상기 제외 단어로 분류하는 단계를 포함할 수 있다.In the step of performing the sentiment classification, words having the attention weight higher than the upper 1-Sigma of the average are classified as clue words, words having the attention weight lower than the average are classified as the neighboring words, and the remaining words are classified as the excluded words. A classification step may be included.

상기 감성 분류를 수행하는 단계는 상기 임계값을 상기 문장들 마다 독립적으로 설정하는 단계를 포함할 수 있다.The performing of the sentiment classification may include independently setting the threshold for each of the sentences.

상기 유효 단어 말뭉치를 생성하는 단계는 상기 단서 단어와 상기 주변 단어로 분류된 단어들로 상기 유효 단어 말뭉치를 구성하는 단계를 포함할 수 있다.The generating of the valid word corpus may include constructing the valid word corpus from words classified as the clue word and the neighboring words.

상기 학습 데이터들을 마스킹하는 단계는 상기 확장된 토크나이저를 이용하여 상기 학습 데이터들 각각을 단어 단위로 분절하는 단계; 및 상기 학습 데이터들 각각에 대해 상기 주변 단어로 분류된 단어들 중 어느 하나를 선택적으로 마스킹하는 단계를 포함할 수 있다.The masking of the learning data may include segmenting each of the learning data in word units using the extended tokenizer; and selectively masking any one of the words classified as the neighboring words for each of the learning data.

상기 추가 사전학습을 수행하는 단계는 상기 마스킹된 학습 데이터들을 사전학습된 BERT 모델에 적용하여 상기 추가 사전학습을 진행하는 단계를 포함할 수 있다.The performing of the additional pre-training may include applying the masked training data to the pre-trained BERT model to perform the additional pre-training.

상기 방법은 상기 사전학습 및 상기 추가 사전학습을 통해 구축된 학습 모델을 미세 조정(fine tuning) 하는 단계;를 더 포함할 수 있다.The method may further include fine-tuning the learning model built through the pre-learning and the additional pre-learning.

실시예들 중에서, 선택적 마스킹 기반 추가 사전 학습 장치는 사전학습(pre-training)을 위한 학습 데이터 모집단의 학습 데이터들로 감성 분류를 수행하는 감성 분류 수행부; 상기 감성 분류의 결과에 따라 상기 학습 데이터 모집단의 단어들에 관한 유효 단어 말뭉치를 생성하는 말뭉치 생성부; 상기 사전학습을 통해 구축된 토크나이저에 상기 유효 단어 말뭉치를 적용하여 확장된 토크나이저를 구축하는 토크나이저 확장부; 상기 확장된 토크나이저를 이용하여 상기 학습 데이터들을 마스킹하는 선택적 마스킹 처리부; 및 상기 마스킹된 학습 데이터들을 이용하여 추가 사전학습(further pre-training)을 수행하는 추가 사전학습 수행부;를 포함한다.Among the embodiments, an optional masking-based additional pre-learning device includes an emotion classification performer configured to perform emotion classification with learning data of a learning data population for pre-training; a corpus generating unit generating a valid word corpus of words of the learning data population according to a result of the sentiment classification; a tokenizer extension unit constructing an extended tokenizer by applying the effective word corpus to the tokenizer built through the prior learning; an optional masking processor for masking the learning data using the extended tokenizer; and an additional pre-training unit performing further pre-training using the masked training data.

상기 감성 분류 수행부는 어텐션 기반 언어 모델을 통해 상기 단어들의 어텐션 가중치를 추출할 수 있다.The emotion classifier may extract attention weights of the words through an attention-based language model.

상기 감성 분류 수행부는 상기 어텐션 가중치를 기 설정된 임계값과 비교하여 상기 단어들을 단서 단어(clue term), 주변 단어(surrounding term) 및 제외 단어(excluded term)로 분류할 수 있다.The sentiment classification performer may compare the attention weight with a preset threshold and classify the words into clue terms, surrounding terms, and excluded terms.

상기 말뭉치 생성부는 상기 단서 단어와 상기 주변 단어로 분류된 단어들로 상기 유효 단어 말뭉치를 구성할 수 있다.The corpus generator may configure the effective word corpus with words classified as the clue word and the neighboring word.

상기 마스킹 처리부는 상기 확장된 토크나이저를 이용하여 상기 학습 데이터들 각각을 단어 단위로 분절하고 상기 학습 데이터들 각각에 대해 상기 주변 단어로 분류된 단어들 중 어느 하나를 선택적으로 마스킹할 수 있다.The masking processor may segment each of the training data into word units using the extended tokenizer and selectively mask one of the words classified as the neighboring words for each of the training data.

상기 장치는 상기 사전학습 및 상기 추가 사전학습을 통해 구축된 학습 모델을 미세 조정(fine tuning) 하는 전이학습 수행부;를 더 포함할 수 있다.The apparatus may further include a transfer learning performing unit fine-tuning the learning model built through the pre-learning and the additional pre-learning.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology may have the following effects. However, it does not mean that a specific embodiment must include all of the following effects or only the following effects, so it should not be understood that the scope of rights of the disclosed technology is limited thereby.

본 발명의 일 실시예에 따른 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법 및 장치는 Attention-based LSTM(Long Shot-Term Memory)을 통해 영화 댓글에 대한 감성 분류를 수행하고, 이를 통해 각 댓글의 감성 분류에 기여한 수준에 따라 단서 단어와 주변 단어를 구분한 뒤, 추가 사전 학습 단계에서 주변 단어에만 마스킹을 적용하는 선택적 마스킹을 수행할 수 있다.A selective masking-based additional pre-learning method and apparatus for improving accuracy according to an embodiment of the present invention performs emotion classification on movie comments through attention-based LSTM (Long Shot-Term Memory), and through this, each comment's After classifying clue words and neighboring words according to the level of contribution to sentiment classification, selective masking may be performed in which masking is applied only to neighboring words in an additional pre-learning step.

도 1은 본 발명에 따른 추가 사전학습 시스템을 설명하는 도면이다.
도 2는 도 1의 추가 사전학습 장치의 기능적 구성을 설명하는 도면이다.
도 3은 본 발명에 따른 선택적 마스킹 기반 추가 사전학습 과정의 일 실시예를 설명하는 순서도이다.
도 4는 본 발명에 따른 선택적 마스킹 기반 추가 사전학습 방법을 설명하는 도면이다.
도 5는 감성 문장 및 태그를 설명하는 도면이다.
도 6은 어텐션 기반 언어 모델의 학습 과정을 설명하는 도면이다.
도 7은 어텐션 가중치 및 임계값에 따른 구분 결과를 설명하는 도면이다.
도 8은 본 발명에 따른 단어 단어와 주변 단어 및 유효 단어 말뭉치를 설명하는 도면이다.
도 9는 BERT 사전 학습 과정을 설명하는 도면이다.
도 10은 BERT 토크나이저 확장을 설명하는 도면이다.
도 11은 본 발명에 따른 선택적 마스킹을 적용한 MLM 학습의 일 실시예를 설명하는 도면이다.
도 12 내지 19는 본 발명에 따른 선택적 마스킹 기반 추가 사전학습 방법에 관한 실험 결과를 설명하는 도면이다.1 is a diagram illustrating an additional pre-learning system according to the present invention.
Figure 2 is a diagram explaining the functional configuration of the additional pre-learning device of Figure 1.
3 is a flowchart illustrating an embodiment of a selective masking-based additional pre-learning process according to the present invention.
4 is a diagram illustrating a selective masking-based additional pre-learning method according to the present invention.
5 is a diagram illustrating emotional sentences and tags.
6 is a diagram illustrating a learning process of an attention-based language model.
7 is a diagram for explaining results of classification according to attention weights and thresholds.
8 is a diagram illustrating word words, surrounding words, and valid word corpus according to the present invention.
9 is a diagram illustrating a BERT pre-learning process.
10 is a diagram illustrating BERT tokenizer extension.
11 is a diagram for explaining an embodiment of MLM learning to which selective masking is applied according to the present invention.
12 to 19 are diagrams illustrating experimental results of the selective masking-based additional pre-learning method according to the present invention.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is only an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiments described in the text. That is, since the embodiment can be changed in various ways and can have various forms, it should be understood that the scope of the present invention includes equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all of them or only such effects, the scope of the present invention should not be construed as being limited thereto.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of terms described in this application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as "first" and "second" are used to distinguish one component from another, and the scope of rights should not be limited by these terms. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected to the other element, but other elements may exist in the middle. On the other hand, when an element is referred to as being "directly connected" to another element, it should be understood that no intervening elements exist. Meanwhile, other expressions describing the relationship between components, such as “between” and “immediately between” or “adjacent to” and “directly adjacent to” should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Expressions in the singular number should be understood to include plural expressions unless the context clearly dictates otherwise, and terms such as “comprise” or “having” refer to an embodied feature, number, step, operation, component, part, or these. It should be understood that it is intended to indicate that a combination exists, and does not preclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, the identification code (eg, a, b, c, etc.) is used for convenience of explanation, and the identification code does not describe the order of each step, and each step clearly follows a specific order in context. Unless otherwise specified, it may occur in a different order than specified. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be implemented as computer readable code on a computer readable recording medium, and the computer readable recording medium includes all types of recording devices storing data that can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. In addition, the computer-readable recording medium may be distributed to computer systems connected through a network, so that computer-readable codes may be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs, unless defined otherwise. Terms defined in commonly used dictionaries should be interpreted as consistent with meanings in the context of the related art, and cannot be interpreted as having ideal or excessively formal meanings unless explicitly defined in the present application.

딥 러닝(Deep Learning)이란 여러 은닉 층(Hidden Layer)을 쌓은 인공 신경망을 기반으로 각 은닉 층을 연결하는 가중치를 학습하는 알고리즘에 해당할 수 있다. 딥 러닝을 적용한 자연어 처리(Natural Language Processing)는 벡터화된 단어 정보가 은닉 층을 통과함에 따라 학습된 가중치를 통해 텍스트에 담긴 정보와 특징들을 더욱 잘 표현한다는 장점을 가질 수 있다. 대표적인 딥 러닝 기반 단어 임베딩(Word Embedding) 방법으로는 Word2Vec, FastText 등이 있다. 이후 통계 기반의 언어 모델에 딥 러닝을 적용한 시퀀스 임베딩(Sequence Embedding) 방법들이 등장하였고, 대표적으로 문장 내의 순서에 따라 이전 단어 정보를 다음 단어 정보에 반영하여 학습하는 언어 모델인 RNN(Recurrent Neural Network), LSTM, GRU(Gated Recurrent Unit) 등이 기계번역과 같은 Seq2Seq(Sequence-to-Sequence) 모델에서 탁월한 성능을 보였다. 하지만 이러한 방법들은 학습한 텍스트 말뭉치에 없는 단어를 처리하지 못하는 문제(Out Of Vocabulary)와 문장의 길이가 길어질수록 과거의 단어 정보가 마지막까지 전달되지 못하는 장기의존성 문제(Long Term Dependency)와 같은 한계를 가지고 있다.Deep learning may correspond to an algorithm that learns weights connecting each hidden layer based on an artificial neural network in which several hidden layers are stacked. Natural language processing using deep learning can have the advantage of better expressing information and features contained in text through learned weights as vectorized word information passes through a hidden layer. Representative deep learning-based word embedding methods include Word2Vec and FastText. Since then, sequence embedding methods that apply deep learning to statistical-based language models have emerged. Representatively, RNN (Recurrent Neural Network), a language model that learns by reflecting previous word information in the next word information according to the order in a sentence , LSTM, and GRU (Gated Recurrent Unit) showed excellent performance in Seq2Seq (Sequence-to-Sequence) models such as machine translation. However, these methods have limitations such as the problem of not processing words that are not in the learned text corpus (Out Of Vocabulary) and the long-term dependency problem (Long Term Dependency), in which information on words in the past is not delivered until the end as the length of the sentence increases. Have.

이러한 한계를 극복하기 위해 언어 모델과 어텐션 메커니즘(Attention Mechanism)을 활용하여 성능을 개선하는 연구가 활발히 이루어졌다. 어텐션 메커니즘은 문장의 길이와 관계없이 입력받은 단어 각각에 대한 어텐션 가중치(Weight)를 구해 언어 모델이 어떤 단어 정보에 더욱 집중해야 하는지 학습하는 기법에 해당할 수 있다. 어텐션 메커니즘에서 사용하는 어텐션 함수를 통해 타깃 데이터(Target Data)가 소스 데이터(Source Data)의 어느 부분에 좀 더 집중하여 정보를 반영하는지 알 수 있으며, 계산된 어텐션 가중치를 활용하여 모델 내부에서 이루어지는 일련의 학습 과정을 해석할 수 있다. 이후, 언어 모델 없이 어텐션 알고리즘만을 활용하여 구축한 셀프 어텐션(Self-Attention) 기반의 트랜스포머(Transformer) 모델이 제안됨에 따라 자연어 처리 분야에 획기적인 발전이 있었다.To overcome these limitations, active research has been conducted to improve performance by using language models and attention mechanisms. The attention mechanism may correspond to a technique for learning which word information the language model should pay more attention to by obtaining an attention weight for each input word regardless of the length of the sentence. Through the attention function used in the attention mechanism, it is possible to know which part of the source data the target data concentrates more on and reflects the information. can interpret the learning process of Since then, as a Transformer model based on Self-Attention, built using only the attention algorithm without a language model, was proposed, there was a breakthrough in the field of natural language processing.

이에 최근에는 트랜스포머 모델을 활용한 사전 학습 언어 모델을 구축하려는 연구가 활발히 이루어졌다. 사전 학습 언어 모델은 데이터의 부족으로 텍스트의 고유한 정보를 표현하지 못하는 한계를 극복하기 위해 고안된 모델에 해당할 수 있다. 사전 학습 언어 모델은 위키피디아와 같이 대규모의 텍스트 말뭉치를 통해 텍스트의 일반적인 표현을 학습한 후, 일련의 전이 학습 방식으로 소규모의 데이터를 추론하는 과정에서 기존의 임베딩 방법들보다 풍부한 텍스트 표현을 추출할 수 있다. 사전 학습 언어 모델을 분석 과제에 적용하는 방법으로는 크게 특징 기반 접근(Feature-based Approach)과 미세 조정 접근이 존재할 수 있다. 특징 기반 접근은 분석 과제를 수행하는 모델에 사전 학습된 언어 표현을 추가하여 학습하는 방식으로 대표적으로 ELMo(Embedding from Language Model)가 있다. 미세 조정 접근은 분석 과제에 맞게 최소한의 파라미터만 사용하여 사전 학습 언어 모델의 가중치를 조정하는 방식으로 BERT, GPT(Generative Pre-trained Transformer) 등을 포함할 수 있다.In recent years, studies have been actively conducted to build a pre-learning language model using a transformer model. The pre-learning language model may correspond to a model designed to overcome limitations of not being able to express unique information of text due to lack of data. Pre-learning language models can extract richer text expressions than existing embedding methods in the process of inferring small-scale data through a series of transfer learning methods after learning general expressions of text through large-scale text corpus like Wikipedia. there is. As a method of applying a pretrained language model to an analysis task, there can be a feature-based approach and a fine-tuning approach. The feature-based approach learns by adding pre-trained language expressions to a model that performs an analysis task, and ELMo (Embedding from Language Model) is a representative example. Fine-tuning approaches can include BERTs, Generative Pre-trained Transformers (GPTs), etc., by adjusting the weights of pre-trained language models using minimal parameters to suit the analysis task.

기존의 사전 학습 언어 모델은 이전 단어에 대한 정보만을 반영하여 학습하는 단방향 구조로 인한 한계가 존재할 수 있다. BERT는 트랜스포머의 인코더 구조를 활용하여 양방향 모델을 구축하고, NSP(Next Sentence Prediction)와 MLM 방식의 비지도 학습을 통해 기존의 단방향 모델의 한계를 극복하였다. BERT의 학습 방식 중, NSP는 두 문장을 동시에 입력으로 주고, 첫 문장 이후 다음 문장을 예측하는 방식으로 텍스트의 표현을 학습할 수 있다. MLM은 문장 내의 단어 집합에서 무작위로 단어를 선택하여 마스킹한 후 해당 단어를 예측하는 과정을 통해 학습이 이루어지며, 전체 토큰의 15%만 마스킹을 씌우고 그 중 80%는 [Mask] 토큰으로, 각각 10%는 다른 단어로 대체하거나 그대로 두어 학습할 수 있다. BERT는 이처럼 두 가지 비지도 학습을 통해 문맥을 모두 반영한 양방향 학습이 가능하며 이러한 장점으로 대부분의 자연어 처리 분야에서 SOTA(State-of-the-Art)를 달성할 수 있었다.Existing pre-learning language models may have limitations due to a unidirectional structure in which only information on previous words is reflected and learned. BERT builds a bidirectional model by utilizing the encoder structure of the transformer, and overcomes the limitations of the existing unidirectional model through NSP (Next Sentence Prediction) and unsupervised learning of the MLM method. Among the learning methods of BERT, NSP can learn text representation by giving two sentences as input at the same time and predicting the next sentence after the first sentence. MLM learns by randomly selecting a word from a set of words in a sentence, masking it, and then predicting the word. Only 15% of the total tokens are masked, and 80% of them are [Mask] tokens, respectively. 10% can be learned by replacing it with another word or leaving it as it is. BERT is capable of two-way learning that reflects both contexts through two types of unsupervised learning, and with these advantages, it has been able to achieve SOTA (State-of-the-Art) in most natural language processing fields.

BERT의 사전 학습 방식 중, MLM은 마스킹 된 단어를 좌/우 문맥을 통해 양방향으로 예측하여 풍부한 정보를 학습할 수 있다는 장점을 갖고 있으며, 이로 인해 여러 자연어 처리 과제에서 MLM을 활용하여 성능을 개선하려는 후속 연구가 이루어졌다. 구체적으로 기계 번역에서 MLM을 통해 단방향 디코더를 개선한 연구, MLM을 활용한 도메인 감정 전이 연구, 연속적 MLM을 적용하여 사전 학습모델의 효율성을 향상한 연구 등이 있다.Among BERT's pre-learning methods, MLM has the advantage of being able to learn rich information by bi-directionally predicting masked words through left/right context, and this has led to efforts to improve performance by utilizing MLM in various natural language processing tasks. A follow-up study was conducted. Specifically, there are studies on improving one-way decoders through MLM in machine translation, studies on domain emotion transfer using MLM, and studies on improving the efficiency of pre-learning models by applying continuous MLM.

한편, 최근에는 사전 학습된 BERT를 활용하여 미세 조정을 통해 분석 과제의 성능을 높이기 위한 연구들이 활발히 이루어졌다. 미세 조정 통한 성능 향상은 주로 BERT 내부의 하이퍼 파라미터들을 조정하거나 다른 언어 모델과 융합하여 문장 벡터(Context Vector)를 추출하는 방식으로 이루어졌다. 하지만, 기존의 BERT는 위키피디아와 같이 일반적인 데이터에 국한되어 학습되었다는 한계가 있다. 따라서, 도메인에 특화된 데이터를 기존의 BERT로 미세 조정할 때, BERT에 학습된 일반적인 말뭉치의 분포와 도메인 말뭉치의 단어 분포가 달라 정보를 온전히 표현할 수 없다는 한계가 있다. 이에 대규모의 도메인 데이터를 구축하여 도메인에 특화된 사전 학습 언어 모델을 재구축하는 방법들이 많이 연구되었다. 구체적으로 생체 의학 텍스트를 학습한 연구, 과학 지식 및 임상 정보를 학습한 연구 등이 존재한다.Meanwhile, recently, studies have been actively conducted to improve the performance of analysis tasks through fine-tuning using pre-learned BERT. Performance improvement through fine-tuning was mainly achieved by adjusting hyperparameters inside BERT or extracting context vectors by fusing with other language models. However, the existing BERT has a limitation in that it is limited to general data such as Wikipedia. Therefore, when domain-specific data is fine-tuned with the existing BERT, there is a limitation that information cannot be completely expressed because the distribution of general corpus learned in BERT and the distribution of words in the domain corpus are different. Accordingly, many studies have been conducted on methods of constructing large-scale domain data and reconstructing domain-specific pre-learning language models. Specifically, there are studies that have learned biomedical texts, studies that have learned scientific knowledge and clinical information, and the like.

한편, 사전 학습된 BERT에 도메인 데이터를 추가 학습하여 BERT를 개선하는 시도 역시 활발히 이루어지고 있으며, 주로 사전 학습에 사용한 MLM을 활용하여 도메인 데이터를 추가 학습하는 연구들이 이루어지고 있다. 하지만, 이들 대부분의 시도는 전통적인 MLM, 즉 무작위로 단어를 선택하여 마스킹하는 방식으로 추가 학습을 진행하기 때문에, 분석 과제의 성능 향상을 위한 정보를 충분히 활용하지 못했다는 한계를 가질 수 있다.On the other hand, attempts to improve BERT by additionally learning domain data on pretrained BERT are also actively being made, and studies are being conducted to additionally learn domain data using MLM, which is mainly used for pretraining. However, since most of these attempts proceed with additional learning in traditional MLM, that is, by randomly selecting and masking words, they may have limitations in that information for improving the performance of analysis tasks is not sufficiently utilized.

한편, 본 발명에 의하면 각 단어가 분석 과제의 수행에 미치는 영향에 따라 이들 단어를 단서 단어와 주변 단어로 구분하고, 이들 단어에만 마스킹을 적용하는 선택적 마스킹 방안이 개시될 수 있다.Meanwhile, according to the present invention, a selective masking scheme may be disclosed in which words are divided into clue words and neighboring words according to the effect each word has on the performance of an analysis task, and masking is applied only to these words.

도 1은 본 발명에 따른 추가 사전학습 시스템을 설명하는 도면이다.1 is a diagram illustrating an additional pre-learning system according to the present invention.

도 1을 참조하면, 추가 사전학습 시스템(100)은 사용자 단말(110), 추가 사전학습 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1 , the additional pre-learning system 100 may include a user terminal 110 , an additional pre-learning device 130 and a database 150 .

사용자 단말(110)은 추가 사전학습 장치(130)와 연결되어 정보를 제공하거나 또는 정보를 이용할 수 있는 컴퓨팅 장치에 해당할 수 있다. 즉, 사용자 단말(110)은 추가 사전학습 장치(130)에게 학습에 필요한 학습 데이터를 제공할 수 있으며, 추가 사전학습 장치(130)에 의해 수집된 학습 데이터를 이용하거나 또는 추가 사전학습 장치(130)에 의해 구축된 학습 모델을 이용할 수 있다. 또한, 사용자 단말(110)은 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 추가 사전학습 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)들이 추가 사전학습 장치(130)와 동시에 연결될 수도 있다.The user terminal 110 may correspond to a computing device that is connected to the additional pre-learning device 130 to provide information or to use information. That is, the user terminal 110 may provide the additional pre-learning device 130 with learning data necessary for learning, and use the learning data collected by the additional pre-learning device 130 or the additional pre-learning device 130 ) can be used. In addition, the user terminal 110 may be implemented as a smart phone, a laptop computer, or a computer, but is not necessarily limited thereto, and may also be implemented as various devices such as a tablet PC. The user terminal 110 may be connected to the additional pre-learning device 130 through a network, and a plurality of user terminals 110 may be connected to the additional pre-learning device 130 at the same time.

추가 사전학습 장치(130)는 본 발명에 따른 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법을 수행하는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 추가 사전학습 장치(130)는 사용자 단말(110)과 유선 또는 무선 네트워크를 통해 연결될 수 있고 상호 간에 데이터를 주고받을 수 있다.The additional pre-learning device 130 may be implemented as a server corresponding to a computer or program that performs the optional masking-based additional pre-learning method for improving classification accuracy according to the present invention. The additional pre-learning device 130 may be connected to the user terminal 110 through a wired or wireless network and may exchange data with each other.

일 실시예에서, 추가 사전학습 장치(130)는 본 발명에 따른 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법을 수행하는 과정에서 다양한 외부 시스템(또는 서버)과 연동하여 동작할 수 있다. 예를 들어, 추가 사전학습 장치(130)는 SNS 서비스, 포털사이트, 위키피디아(Wikipedia), 블로그 등을 통해 관련 컨텐츠에 접근할 수 있으며, 학습 데이터의 수집과 학습 모델의 구축 등에 필요한 데이터를 제공받을 수 있다.In one embodiment, the additional pre-learning device 130 may operate in conjunction with various external systems (or servers) in the process of performing the optional masking-based additional pre-learning method for improving classification accuracy according to the present invention. For example, the additional pre-learning device 130 can access related content through SNS services, portal sites, Wikipedia, blogs, etc., and receive data necessary for collecting learning data and building a learning model. can

데이터베이스(150)는 추가 사전학습 장치(130)의 동작 과정에서 필요한 다양한 정보들을 저장하는 저장장치에 해당할 수 있다. 예를 들어, 데이터베이스(150)는 다양한 출처로부터 수집된 학습 데이터를 저장할 수 있고, 학습 모델 구축을 위한 학습 알고리즘 및 모델 정보를 저장할 수 있으며, 반드시 이에 한정되지 않고, 추가 사전학습 장치(130)가 본 발명에 따른 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전 학습 방법을 수행하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.The database 150 may correspond to a storage device for storing various information necessary for the operation process of the additional pre-learning device 130 . For example, the database 150 may store learning data collected from various sources, and may store learning algorithms and model information for building a learning model, but are not necessarily limited thereto, and the additional pre-learning device 130 may In the process of performing the optional masking-based additional pre-learning method for improving classification accuracy according to the present invention, collected or processed information may be stored in various forms.

도 2는 도 1의 추가 사전학습 장치의 기능적 구성을 설명하는 도면이다.Figure 2 is a diagram explaining the functional configuration of the additional pre-learning device of Figure 1.

도 2를 참조하면, 추가 사전학습 장치(130)는 감성 분류 수행부(210), 말뭉치 생성부(220), 토크나이저 확장부(230), 선택적 마스킹 처리부(240), 추가 사전학습 수행부(250), 전이학습 수행부(260) 및 제어부(도 2에 미도시함)를 포함할 수 있다.Referring to FIG. 2 , the additional pre-learning device 130 includes a sentiment classification performer 210, a corpus generator 220, a tokenizer expander 230, an optional masking processor 240, an additional pre-learning performer ( 250), a transfer learning performing unit 260, and a control unit (not shown in FIG. 2).

감성 분류 수행부(210)는 사전학습(pre-training)을 위한 학습 데이터 모집단의 학습 데이터들로 감성 분류를 수행할 수 있다. 여기에서, 학습 데이터 모집단은 학습 데이터들로 구성된 데이터 집합에 해당할 수 있다. 또한, 학습 데이터는 감성 분류를 위한 텍스트에 해당할 수 있다. 예를 들어, 학습 데이터는 문장들을 포함할 수 있으며, SNS 서비스의 메시지들, 블로그 또는 게시글에 등록된 댓글들 등과 같이 다양한 출처로부터 수집되는 다양한 문장들로 구성될 수 있다.The emotion classification performer 210 may perform emotion classification with learning data of a learning data population for pre-training. Here, the learning data population may correspond to a data set composed of learning data. Also, the learning data may correspond to text for emotion classification. For example, learning data may include sentences and may be composed of various sentences collected from various sources, such as messages from SNS services, comments registered in blogs or postings, and the like.

일 실시예에서, 감성 분류 수행부(210)는 학습 데이터 모집단에 대해 전처리 동작을 수행할 수 있다. 예를 들어, 감성 분류 수행부(210)는 학습 데이터 모집단에 대해 필터링을 통해 특정 단어가 사용된 학습 데이터를 제거할 수 있으며, 소정의 아이콘이나 이미지 등을 제거할 수도 있다. 또한, 감성 분류 수행부(210)는 학습 데이터 모집단에 대해 학습 데이터들을 소정의 기준에 따라 분류하여 군집화할 수 있다. 예를 들어, 감성 분류 수행부(210)는 학습 데이터 모집단에 대해 전문 분야 별로 학습 데이터들을 분류할 수 있다.In one embodiment, the emotion classification performer 210 may perform a preprocessing operation on the learning data population. For example, the emotion classification performer 210 may remove learning data in which a specific word is used through filtering on the learning data population, and may also remove a predetermined icon or image. In addition, the emotion classification performer 210 may classify and cluster the learning data for the learning data population according to a predetermined criterion. For example, the emotion classification performer 210 may classify learning data for each professional field with respect to the learning data population.

일 실시예에서, 감성 분류 수행부(210)는 감성 태그(Tag)가 부착된 문장들을 학습 데이터들로 사용하여 감성 분류를 수행할 수 있다. 학습 데이터는 문장을 포함하는 텍스트로 구성될 수 있으며, 학습 데이터가 문장들로 구성된 경우, 감성 분류 수행부(210)는 각 문장 별로 사전에 감성을 분류하여 감성 태그를 부여할 수 있다. 즉, 감성 태그는 해당 문장에서 느껴지는 감성을 소정의 감성 단어 또는 어절로 요약 표현한 것에 해당할 수 있다. 감성 분류 수행부(210)는 감성 태그를 가진 문장들을 대상으로 감성 분류를 수행할 수 있으며, 감성 분류 결과와 감성 태그를 상호 비교하여 감성 분류의 정확도를 분석할 수 있다.In an embodiment, the emotion classification performer 210 may perform emotion classification by using sentences to which the emotion tags are attached as learning data. The learning data may be composed of text including sentences, and when the learning data is composed of sentences, the emotion classification performer 210 may assign emotion tags by classifying emotions in advance for each sentence. That is, the emotion tag may correspond to a summary expression of emotion felt in the corresponding sentence with a predetermined emotion word or phrase. The emotion classification performer 210 may perform emotion classification on sentences having emotion tags, and may analyze accuracy of emotion classification by comparing emotion classification results and emotion tags.

일 실시예에서, 감성 분류 수행부(210)는 어텐션 기반 언어 모델을 통해 단어들의 어텐션 가중치를 추출할 수 있다. 예를 들어, 감성 분류 수행부(210)는 어텐션 기반 언어 모델로서 어텐션 기반 LSTM 분류기를 사용할 수 있다. 즉, 감성 태그가 부착된 문장들은 어텐션 기반의 LSTM 분류기에 입력될 수 있으며, 해당 분류기는 각 문장별 감성 분류 결과를 출력으로 생성할 수 있다. 이때, 감성 분류 수행부(210)는 감성 분류 과정에서 문장 내 각 단어 별로 어텐션 가중치를 획득할 수 있다. 즉, 감성 분류 수행부(210)는 문장 단위로 각 단어별 어텐션 가중치를 수집하여 저장할 수 있다.In an embodiment, the emotion classification performer 210 may extract attention weights of words through an attention-based language model. For example, the sentiment classification performer 210 may use an attention-based LSTM classifier as an attention-based language model. That is, sentences with sentiment tags attached may be input to an attention-based LSTM classifier, and the classifier may generate a sentiment classification result for each sentence as an output. In this case, the emotion classification performer 210 may obtain an attention weight for each word in the sentence in the emotion classification process. That is, the emotion classification performer 210 may collect and store the attention weight for each word in sentence units.

일 실시예에서, 감성 분류 수행부(210)는 어텐션 가중치를 기 설정된 임계값과 비교하여 단어들을 단서 단어(clue term), 주변 단어(surrounding term) 및 제외 단어(excluded term)로 분류할 수 있다. 이때, 어텐션 가중치와 비교 대상이 되는 임계값은 사전에 설정되어 활용될 수 있다. 감성 분류 수행부(210)는 단어 별로 결정된 어텐션 가중치를 기초로 문장 내 각 단어들을 단서 단어, 주변 단어 및 제어 단어 중 어느 하나로 분류할 수 있다.In one embodiment, the emotion classification performer 210 compares the attention weight with a preset threshold value to classify words into a clue term, a surrounding term, and an excluded term. . At this time, the threshold value to be compared with the attention weight may be set in advance and utilized. The emotion classification performer 210 may classify each word in the sentence as one of a clue word, a peripheral word, and a control word based on the attention weight determined for each word.

여기에서, 단서 단어(clue term)는 감성 분류 과정에서 해다 문장을 분류할 때 기여도가 높은 단어에 해당할 수 있고, 주변 단어(surrounding term)는 이와 반대로 분류 기여도가 낮은 단어에 해당할 수 있다. 또한, 제외 단어(excluded term)는 중간 수준의 기여도를 갖는 단어에 해당할 수 있다.Here, the clue term may correspond to a word with a high contribution when classifying a sentence in the sentiment classification process, and the surrounding term may correspond to a word with a low classification contribution. Also, an excluded term may correspond to a word having an intermediate level of contribution.

일 실시예에서, 감성 분류 수행부(210)는 어텐션 가중치가 평균의 상위 1-Sigma보다 높은 단어는 단서 단어로 분류하고, 어텐션 가중치가 평균보다 낮은 단어는 주변 단어로 분류하며, 나머지 단어는 제외 단어로 분류할 수 있다. 즉, 감성 분류 수행부(210)는 단어들의 분류를 위한 기준 임계값으로 어텐션 가중치에 대한 평균과 평균의 상위 1-Sigma를 각각 적용할 수 있다.In one embodiment, the sentiment classification performer 210 classifies words with an attention weight higher than the upper 1-Sigma of the average as clue words, and classifies words with an attention weight lower than the average as neighboring words, excluding the other words. can be classified as words. That is, the sentiment classification performer 210 may apply the average of the attention weights and the upper 1-Sigma of the average as reference thresholds for classifying words, respectively.

일 실시예에서, 감성 분류 수행부(210)는 임계값을 문장들 마다 독립적으로 설정할 수 있다. 감성 분류 과정은 학습 데이터, 즉 문장 단위로 이루어질 수 있으며, 어텐션 기반의 LSTM 분류기를 이용하는 경우, 각 문장 내에서 단어 별로 어텐션 스코어가 산출될 수 있다. 따라서, 어텐션 스코어는 문장 내의 단어 분포에 따라 독립적으로 결정될 수 있으며, 단어 분류를 위한 임계값 역시 문장마다 독립적으로 결정될 수 있다. 이에 따라, 동일한 단어라 하더라도 포함된 문장의 단어 분포에 따라 서로 다른 어텐션 가중치를 가질 수 있고, 그 결과 각 문장마다 단어의 분류 결과가 상이할 수 있다.In one embodiment, the emotion classification performer 210 may independently set threshold values for each sentence. The emotion classification process may be performed in units of learning data, that is, sentences, and when an attention-based LSTM classifier is used, an attention score may be calculated for each word within each sentence. Accordingly, the attention score may be independently determined according to word distribution in a sentence, and a threshold value for word classification may also be independently determined for each sentence. Accordingly, even the same word may have different attention weights according to the word distribution of included sentences, and as a result, word classification results may be different for each sentence.

말뭉치 생성부(220)는 감성 분류의 결과에 따라 학습 데이터 모집단의 단어들에 관한 유효 단어 말뭉치를 생성할 수 있다. 일 실시예에서, 말뭉치 생성부(220)는 단서 단어와 주변 단어로 분류된 단어들로 유효 단어 말뭉치를 구성할 수 있다. 즉, 유효 단어 말뭉치(corpus)는 감성 분류 결과에 따라 특정 단어들로 구성된 단어 집합에 해당할 수 있다. 보다 구체적으로, 감성 분류 수행부(210)에 의해 분류된 단어들은 해당 분류 결과에 따라 단서 단어 리스트, 주변 단어 리스트 및 제외 단어 리스트 형태로 생성될 수 있다. 말뭉치 생성부(220)는 단서 단어 리스트와 주변 단어 리스트를 통합한 다음 중복 단어를 제거하여 유효 단어 말뭉치를 생성할 수 있다. 결과적으로, 유효 단어 말뭉치는 감성 분류에 따라 단서 단어와 주변 단어로 분류된 단어들의 집합으로 정의될 수 있다.The corpus generator 220 may generate a valid word corpus for words in the training data population according to the sentiment classification result. In one embodiment, the corpus generator 220 may construct a valid word corpus from words classified as clue words and neighboring words. That is, the valid word corpus may correspond to a word set composed of specific words according to the sentiment classification result. More specifically, the words classified by the sentiment classification performer 210 may be generated in the form of a clue word list, a neighboring word list, and an excluded word list according to the classification result. The corpus generator 220 may generate a valid word corpus by integrating the clue word list and the neighboring word list and then removing duplicate words. As a result, the valid word corpus can be defined as a set of words classified as clue words and peripheral words according to sentiment classification.

토크나이저 확장부(230)는 사전학습을 통해 구축된 토크나이저에 유효 단어 말뭉치를 적용하여 확장된 토크나이저를 구축할 수 있다. 예를 들어, 사전학습 과정에서 BERT 모델이 사용된 경우, BERT 토크나이저가 구축될 수 있다. 토크나이저 확장부(230)는 기 구축된 BERT 토크나이저를 그대로 사용하는 대신 유효 단어 말뭉치를 추가하여 토크나이저를 확장시킬 수 있다. 즉, 토크나이저 확장부(230)는 사전학습을 통해 구축된 토크나이저의 단어들에 유효 단어 말뭉치를 추가할 수 있으며, 중복을 제거함으로써 확장된 토크나이저를 구축할 수 있다.The tokenizer expansion unit 230 may construct an extended tokenizer by applying a valid word corpus to a tokenizer built through prior learning. For example, if a BERT model is used in the pretraining process, a BERT tokenizer can be built. The tokenizer extension unit 230 may extend the tokenizer by adding valid word corpus instead of using the previously constructed BERT tokenizer as it is. That is, the tokenizer extension unit 230 may add valid word corpus to the words of the tokenizer built through prior learning, and build an extended tokenizer by removing redundancies.

선택적 마스킹 처리부(240)는 확장된 토크나이저를 이용하여 학습 데이터들을 마스킹할 수 있다. 일 실시예에서, 선택적 마스킹 처리부(240)는 확장된 토크나이저를 이용하여 학습 데이터들 각각을 단어 단위로 분절할 수 있고, 학습 데이터들 각각에 대해 주변 단어로 분류된 단어들 중 어느 하나를 선택적으로 마스킹할 수 있다. 학습 데이터에 해당하는 각 문장들에 대해 확장된 토크나이저를 이용하여 단어 단위로 분절하는 경우, 선택적 마스킹 처리부(240)는 각 문장 별로 단서 단어와 주변 단어를 포함하는 토큰들의 집합을 획득할 수 있다. 이후, 선택적 마스킹 처리부(240)는 분절된 단어들 중에서 단서 단어는 마스킹 대상에서 제외할 수 있으며, 주변 단어들 중에서 적어도 하나를 선별하여 마스킹(masking)을 수행할 수 있다. 이때, 마스킹 동작은 마스킹 대상이 되는 단어를 'MASK' 단어로 대체하여 수행될 수 있다.The selective masking processor 240 may mask the learning data using the extended tokenizer. In one embodiment, the selective masking processor 240 may segment each of the learning data into word units using an extended tokenizer, and selectively select one of words classified as neighboring words for each of the learning data. can be masked with When each sentence corresponding to the learning data is segmented in word units using an extended tokenizer, the selective masking processor 240 may obtain a set of tokens including clue words and neighboring words for each sentence. . Thereafter, the selective masking processor 240 may exclude a clue word from among the segmented words from the masking target, and perform masking by selecting at least one of neighboring words. At this time, the masking operation may be performed by replacing the word to be masked with the word 'MASK'.

추가 사전학습 수행부(250)는 마스킹된 학습 데이터들을 이용하여 추가 사전학습(further pre-training)을 수행할 수 있다. 일 실시예에서, 추가 사전학습 수행부(250)는 마스킹된 학습 데이터들을 사전학습된 BERT 모델에 적용하여 추가 사전학습을 진행할 수 있다. 예를 들어, 추가 사전학습 수행부(250)는 주변 단어에 대해 선택적 마스킹이 이루어진 문장들을 대상으로 MLM을 통해 추가 사전학습을 진행할 수 있다.The additional pre-training unit 250 may perform further pre-training using the masked training data. In one embodiment, the additional pre-training unit 250 may perform additional pre-learning by applying the masked training data to the pre-trained BERT model. For example, the additional pre-learning unit 250 may perform additional pre-learning through MLM for sentences in which selective masking is performed for surrounding words.

전이학습 수행부(260)는 사전학습 및 추가 사전학습을 통해 구축된 학습 모델을 미세 조정(fine tuning) 할 수 있다. 여기에서, 미세 조정(fine tuning)은 기존의 사전학습된 모델을 기반으로 새로운 모델을 학습하는 과정에 해당할 수 있다. 보다 구체적으로, 미세 조정은 사전학습된 모델을 새로운 목적에 따라 재정의하기 위해 원래의 모델에 존재하는 분류기(classifier)를 제거하고 새로운 목적에 맞는 분류기를 추가하는 동작을 포함할 수 있다. 이후, 미세 조정은 수정된 모델에 대해 다양한 방법의 학습 전략을 적용하여 수행될 수 있다.The transfer learning performer 260 may fine-tune the learning model built through pre-learning and additional pre-learning. Here, fine tuning may correspond to a process of learning a new model based on an existing pretrained model. More specifically, fine-tuning may include an operation of removing a classifier existing in the original model and adding a classifier suitable for a new purpose in order to redefine the pretrained model according to a new purpose. Then, fine-tuning can be performed by applying various learning strategies to the modified model.

예를 들어, 미세 조정 과정은 수정된 모델의 전체를 전부 학습하는 방식으로 진행될 수 있고, 컨볼루션 베이스(Convolutional base)의 일부분은 고정시킨 상태에서 나머지 계층(layer)과 분류기만을 학습하는 방식으로 진행될 수 있으며, 컨볼루션 베이스는 고정시킨 상태에서 분류기만을 학습하는 방식으로 진행될 수 있다. 전이학습 수행부(260)는 하이퍼 파라미터(hyperparameter)를 통해 미세 조정을 위한 학습률(learning rate)을 조정할 수 있으며, 사용하고자 하는 목적에 따라 학습률을 가변적으로 적용하여 미세 조정을 수행할 수 있다.For example, the fine-tuning process can be performed in a way of learning the entire modified model, and in a way of learning only the remaining layers and classifiers while keeping a part of the convolutional base fixed. It can proceed, and the convolution base can proceed in a way to learn only the classifier in a fixed state. The transfer learning performing unit 260 may adjust a learning rate for fine tuning through a hyperparameter, and may perform fine tuning by variably applying the learning rate according to a purpose to be used.

제어부(도 2에 미도시함)는 추가 사전학습 장치(130)의 전체적인 동작을 제어하고, 감성 분류 수행부(210), 말뭉치 생성부(220), 토크나이저 확장부(230), 선택적 마스킹 처리부(240), 추가 사전학습 수행부(250) 및 전이학습 수행부(260) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The controller (not shown in FIG. 2) controls the overall operation of the additional pre-learning device 130, the emotion classification performer 210, the corpus generator 220, the tokenizer expander 230, and the optional masking processor 240, the additional pre-learning unit 250 and the transfer learning unit 260 may manage control flow or data flow.

도 3은 본 발명에 따른 선택적 마스킹 기반 추가 사전학습 과정의 일 실시예를 설명하는 순서도이다.3 is a flowchart illustrating an embodiment of a selective masking-based additional pre-learning process according to the present invention.

도 3을 참조하면, 추가 사전학습 장치(130)는 감성 분류 수행부(210)를 통해 사전학습(pre-training)을 위한 학습 데이터 모집단의 학습 데이터들로 감성 분류를 수행할 수 있다(단계 S310). 추가 사전학습 장치(130)는 말뭉치 생성부(220)를 통해 감성 분류의 결과에 따라 학습 데이터 모집단의 단어들에 관한 유효 단어 말뭉치를 생성할 수 있다(단계 S330).Referring to FIG. 3 , the additional pre-learning device 130 may perform emotion classification with the learning data of the learning data population for pre-training through the emotion classification performer 210 (step S310). ). The additional pre-learning device 130 may generate a valid word corpus for words in the training data population according to the result of sentiment classification through the corpus generator 220 (step S330).

또한, 추가 사전학습 장치(130)는 토크나이저 확장부(230)를 통해 사전학습을 통해 구축된 토크나이저에 유효 단어 말뭉치를 적용하여 확장된 토크나이저를 구축할 수 있다(단계 S350). 추가 사전학습 장치(130)는 선택적 마스킹 처리부(240)를 통해 확장된 토크나이저를 이용하여 학습 데이터들을 마스킹할 수 있다(단계 S370). 추가 사전학습 장치(130)는 추가 사전학습 수행부(250)를 통해 마스킹된 학습 데이터들을 이용하여 추가 사전학습(further pre-training)을 수행할 수 있다(단계 S390).In addition, the additional pre-learning device 130 may build an extended tokenizer by applying a valid word corpus to the tokenizer built through pre-learning through the tokenizer expansion unit 230 (step S350). The additional pre-learning device 130 may mask the learning data using the tokenizer extended through the optional masking processor 240 (step S370). The additional pre-learning device 130 may perform further pre-training using the learning data masked through the additional pre-learning unit 250 (step S390).

이하, 도 4 내지 19를 통해 본 발명에 따른 분류 정확도 향상을 위한 선택적 마스킹 기반 추가 사전학습 방법을 보다 자세히 설명한다.Hereinafter, a selective masking-based additional pre-learning method for improving classification accuracy according to the present invention will be described in detail with reference to FIGS. 4 to 19.

도 4를 참조하면, 본 발명에 따른 추가 사전학습 방법은 감성 분류 과제에 기여하는 수준에 따라 각 단어를 단서 단어와 주변 단어로 구별하는 Phase 1, 그리고 이렇게 구별된 단어 정보를 활용하여 선택적 마스킹 기반 추가 사전 학습을 진행하는 Phase 2로 구성될 수 있다. 보다 구체적으로, Phase 1은 감성 태그(Tag)가 부착된 문장을 입력으로 받아서 어텐션 기반의 LSTM 분류기를 통해 감성 분류를 수행하고, 이 과정에서 획득한 문장 내 각 단어의 어텐션 가중치에 따라 각 단어를 감성 분류 관점에서의 단서 단어와 주변 단어로 구분할 수 있다. Phase 2는 Phase 1에서 획득한 단어를 사전 학습된 BERT 토크나이저(Tokenizer)에 추가하여 확장된 토크나이저를 구축하고, 이를 활용하여 입력 문장에 대한 분절(Tokenizing)을 수행할 수 있다. 이어서 Phase 2는 사전 학습된 BERT의 MLM을 통해 추가 사전 학습을 실시할 수 있으며, 이 과정에서 앞에서 구분한 단서 단어와 주변 단어의 정보를 활용하여 선택적 마스킹을 수행할 수 있다. 이러한 과정을 통해 최종적으로 문장 분류에 특화된 문장 임베딩, 즉 문장 내의 감성 정보를 충실히 반영한 문장 임베딩을 도출할 수 있으며, 각 과정에 대한 구체적인 내용에 대해 설명한다.Referring to FIG. 4, the additional pre-learning method according to the present invention is based on phase 1 of distinguishing each word into a clue word and a neighboring word according to the level of contribution to the emotion classification task, and selective masking using the information of the word thus distinguished. It can consist of Phase 2, which proceeds with additional pre-learning. More specifically, Phase 1 receives a sentence with a sentiment tag attached as an input, performs sentiment classification through an attention-based LSTM classifier, and classifies each word according to the attention weight of each word in the sentence acquired in this process. It can be divided into clue words and peripheral words from the viewpoint of emotion classification. Phase 2 builds an extended tokenizer by adding the words acquired in Phase 1 to the pre-learned BERT tokenizer, and can perform tokenizing on the input sentence by utilizing it. Subsequently, in Phase 2, additional pre-learning can be performed through the pre-learned BERT's MLM, and in this process, selective masking can be performed by utilizing the information of the previously identified clue words and surrounding words. Through this process, it is possible to finally derive a sentence embedding specialized for sentence classification, that is, a sentence embedding that faithfully reflects the emotional information in the sentence, and details of each process will be described.

도 5를 참조하면, 추가 사전학습 장치(130)는 도 4의 Phase 1, 즉 어텐션 기반 LSTM을 활용하여 감성 분류 과정에서 문장 내의 각 단어에 대한 어텐션 가중치를 추출하는 과정(단계 1), 그리고 추출된 단어를 구별하여 감성 정보에 따른 말뭉치를 구축하는 과정(단계 2)을 수행할 수 있다. 또한, 추가 사전학습 장치(130)는 감성이 포함된 리뷰 데이터로서 긍정(Positive) 또는 부정(Negative) 태그가 부착된 문장들을 사용할 수 있다. 도 5에서, '원작의 긴장감을 제대로 살려내지 못했다'와 같은 문장은 부정(Negative) 태그가 부착될 수 있고, '청춘 영화의 최고봉. 방황과 우울했던 날들의 자화'와 같은 문장은 긍정(Positive) 태그가 부착될 수 있다.Referring to FIG. 5, the additional pre-learning device 130 performs Phase 1 of FIG. 4, that is, the process of extracting the attention weight for each word in the sentence in the emotion classification process using the attention-based LSTM (Step 1), and the extraction It is possible to perform a process (step 2) of constructing a corpus according to emotion information by distinguishing the words. In addition, the additional pre-learning device 130 may use sentences to which a positive or negative tag is attached as review data including emotions. In FIG. 5, a sentence such as 'I couldn't properly reproduce the tension of the original work' can be attached with a negative tag, and 'the highest peak of youth films. Sentences such as ‘self-portraits of wandering and depressed days’ can be tagged as positive.

선택적 마스킹을 수행하기 위해서는 우선 문장 내의 어떤 단어들이 감성 분류에 중요하게 기여하는지를 판단하는 작업이 필요할 수 있다. 추가 사전학습 장치(130)는 이를 위해 어텐션 기반 언어 모델 중 하나인 Attention-based LSTM을 활용하여 감성 분류를 진행할 수 있다. 이때, Attention-based LSTM은 분류 과정에서 입력 문장 내의 각 단어에 대해 고유의 어텐션 가중치를 산출할 수 있으며, 해당 가중치는 각 단어가 감성 태그의 결정에 기여한 정도에 대응될 수 있다. Attention-based LSTM에 대한 일반적인 학습 과정은 도 6와 같이 표현될 수 있다.In order to perform selective masking, it may first be necessary to determine which words in a sentence contribute significantly to sentiment classification. For this purpose, the additional pre-learning device 130 may perform emotion classification by utilizing Attention-based LSTM, which is one of attention-based language models. In this case, the attention-based LSTM may calculate a unique attention weight for each word in the input sentence in the classification process, and the corresponding weight may correspond to the degree of contribution of each word to the determination of the emotional tag. A general learning process for attention-based LSTM can be expressed as shown in FIG.

도 6을 참조하면, Attention-based LSTM은 크게 LSTM 층과 어텐션 층으로 구성될 수 있다. 우선 LSTM 층은 각각의 토큰들에 대한 임베딩 벡터값을 LSTM의 입력으로 사용하여, 이를 LSTM 내부의 은닉층을 통과시킨 후 은닉 상태의 벡터를 생성할 수 있다. 어텐션 층은 은닉 상태의 벡터에 어텐션 함수를 적용하여 어텐션 스코어(Attention Score)를 계산한 후, 이에 다시 소프트맥스 함수를 적용하여 어텐션 분포(Attention Distribution)를 구하는 작업을 수행할 수 있다. 이때, 어텐션 분포에 대한 각각의 값은 어텐션 가중치에 해당할 수 있으며, 어텐션 가중치와 LSTM을 통과한 은닉 상태의 벡터들 간의 가중 합(Weighted Sum)을 통해 최종적으로 문맥을 표현하는 문장 벡터가 추출될 수 있다. 이후, 분류 층에서는 문장 벡터로부터 감성 태그를 예측하는 학습을 수행하여 어텐션 가중치를 갱신할 수 있다. Attention-based LSTM을 통해 감성 문장에 대한 분류가 진행되고, 해당 과정에서 단어별 어텐션 가중치를 추출한 결과의 일부를 시각화한 예시는 도 7과 같이 표현될 수 있다.Referring to FIG. 6, the attention-based LSTM may be largely composed of an LSTM layer and an attention layer. First, the LSTM layer may use the embedding vector value for each token as an input of the LSTM, pass it through the hidden layer inside the LSTM, and then generate a vector in a hidden state. The attention layer may calculate an attention score by applying an attention function to a vector in a hidden state, and then obtain an attention distribution by applying a softmax function to the calculated attention score. At this time, each value of the attention distribution may correspond to an attention weight, and a sentence vector expressing the context is finally extracted through a weighted sum between the attention weight and the vectors in the hidden state that have passed through the LSTM. can Thereafter, in the classification layer, the attention weight may be updated by performing learning to predict the sentiment tag from the sentence vector. Sentimental sentences are classified through attention-based LSTM, and an example of visualizing a part of the results of extracting attention weights for each word in the process can be expressed as shown in FIG. 7 .

도 7을 참조하면, Attention-based LSTM 학습을 통해 각 단어의 어텐션 가중치 값이 도출될 수 있으며, 기 설정된 임계값에 따라 구분될 수 있다. 도 7의 문장들은 모두 부정(Negative)으로 분류되었기 때문에, 각 단어가 갖는 어텐션 가중치는 주어진 문장이 부정으로 분류되기 위해 해당 단어가 기여한 정도에 대응될 수 있다. 예를 들어, 첫 번째 문장의 '낭비', '최악'과 두 번째 문장의 '아까움', '지루한'과 같은 단어들은 부정 태그의 예측에 크게 영향을 끼친 단어들에 해당할 수 있다.Referring to FIG. 7 , an attention weight value of each word may be derived through attention-based LSTM learning, and may be classified according to a preset threshold. Since the sentences of FIG. 7 are all classified as negative, the attention weight of each word may correspond to the degree of contribution of the corresponding word to classify a given sentence as negative. For example, words such as 'wasted' and 'worst' in the first sentence and 'wasted' and 'boring' in the second sentence may correspond to words that significantly affect the prediction of the negative tag.

이처럼, Attention-based LSTM을 통한 분류 과정에서 생성된 어텐션 가중치를 통해 문장의 감성 판단에 중요하게 작용하는 단어와 그렇지 않은 단어가 구분될 수 있다. 본 발명에 따른 추가 사전학습 방법은 감성 문장 내의 어텐션 가중치가 높은 단어는 해당 문장을 분류할 때 기여도가 높은 '단서 단어'로, 어텐션 가중치가 낮은 단어는 분류 기여도가 낮은 '주변 단어'로 구분할 수 있다. 또한, 중간 수준의 기여도를 갖는 단어는 '제외 단어(Excluded Terms)'로 지정되어 분석에서 제외될 수 있다. 본 발명에 따른 추가 사전학습 방법은 이렇게 구분된 단어의 역할 정보를 활용하여 선택적 마스킹을 수행할 수 있다.In this way, words that play an important role in determining the emotion of a sentence and words that do not play a role can be distinguished through the attention weight generated in the classification process through the attention-based LSTM. In the additional pre-learning method according to the present invention, words with a high attention weight in emotional sentences can be classified as 'clue words' with a high contribution when classifying the corresponding sentence, and words with a low attention weight can be classified as 'surrounding words' with a low contribution to classification. there is. In addition, words with an intermediate level of contribution can be designated as 'Excluded Terms' and excluded from the analysis. The additional pre-learning method according to the present invention may perform selective masking using the role information of the divided words.

하지만, Attention-based LSTM은 각 문장에 대해 단어별 어텐션 가중치를 생성하기 때문에, 단어의 역할 지정은 전체 문서가 아닌 각 문장 단위로 이루어질 수 있다. 즉, 동일한 단어가 한 문장에서는 단서 단어로 구분되고, 또 다른 문장에서는 주변 단어로 구분될 수도 있다. 도 7에서, 단어 '낭비'는 첫 문장에서는 단서 단어로, 그리고 두 번째 문장에서는 주변 단어로 구분될 수 있다. 또한, 본 발명에 따른 추가 사전학습 방법은 각 단어의 역할을 구분하기 위한 임계값으로 문장마다 상이한 값을 지정할 수 있다. 보다 구체적으로, 각 문장에 대하여 어텐션 가중치가 평균의 상위 1-Sigma보다 높은 단어는 단서 단어로 선정될 수 있고, 어텐션 가중치가 평균보다 낮은 단어는 주변 단어로 선정될 수 있다. 또한, 단서 단어와 주변 단어에 모두 포함되지 않는 단어, 즉 어텐션 가중치가 평균보다 높고 평균의 상위 1-Sigma보다 낮은 단어는 제외 단어로 구분될 수 있다. 도 7에서, 단서 단어로 구분되기 위한 가중치 임계값(1-sigma)이 첫 문장의 경우 0.192, 두 번째 문장의 경우 0.277로 서로 상이하게 설정될 수 있다.However, since the attention-based LSTM generates attention weights for each word for each sentence, the role designation of words can be done in units of sentences rather than in the entire document. That is, the same word may be classified as a clue word in one sentence and as a neighboring word in another sentence. In FIG. 7 , the word 'waste' can be classified as a clue word in the first sentence and as a peripheral word in the second sentence. In addition, in the additional pre-learning method according to the present invention, a different value may be designated for each sentence as a threshold value for distinguishing the role of each word. More specifically, for each sentence, a word having an attention weight higher than the upper 1-Sigma of the average may be selected as a clue word, and a word having an attention weight lower than the average may be selected as a neighboring word. In addition, words that are not included in both the clue word and the neighboring words, that is, words whose attention weight is higher than the average and lower than the upper 1-Sigma of the average, may be classified as excluded words. In FIG. 7 , weight thresholds (1-sigma) for being classified as clue words may be set differently, such as 0.192 for the first sentence and 0.277 for the second sentence.

이후, 임계값에 의해 구분된 단서 및 주변 단어들은 BERT의 토크나이저를 확장하기 위한 용도와 선택적 마스킹을 수행하기 위한 두 가지 용도로 사용될 수 있다. 첫 번째 용도의 경우 주요 목적이 어휘 집합의 확장에 있으므로 단서 단어 전체와 주변 단어 전체를 통합하여 중복을 제거한 후 BERT 토크나이저의 확장에 사용할 수 있다. 한편, 두 번째 용도의 경우 단어의 역할 정보는 각 문장 단위로 수행되는 선택적 마스킹에 활용되기 때문에, 문장 간 중복되는 단어도 제거하지 않고 그대로 저장할 수 있다. 예를 들어, 도 8의 (a)는 선택적 마스킹에 사용되는 단서 단어 집합을, 도 8의 (b)는 주변 단어 집합을 나타낼 수 있다. 또한, 이들을 통합한 후 중복을 제거한 말뭉치인 도 8의 (c)는 BERT 토크나이저의 확장에 사용될 수 있다. 예를 들어, '지루한'의 경우 도 8의 (a)에서는 두 번째 문장과 네 번째 문장에서 모두 단서 단어로 선정될 수 있으며, 도 8의 (c)에서는 중복 제거 후 한 차례만 나타날 수 있다.Afterwards, the clues and neighboring words classified by the threshold can be used for two purposes: one for extending BERT's tokenizer and one for performing selective masking. For the first use, since the main purpose is the expansion of the lexicon, it can be used for the expansion of the BERT tokenizer after integrating all the clue words and all the surrounding words to remove redundancies. Meanwhile, in the case of the second use, since word role information is used for selective masking performed for each sentence, overlapping words between sentences can be stored as they are without removing them. For example, (a) of FIG. 8 may indicate a set of clue words used for selective masking, and (b) of FIG. 8 may indicate a set of neighboring words. In addition, (c) of FIG. 8, which is a corpus from which duplicates are removed after integrating them, can be used to expand the BERT tokenizer. For example, in the case of 'boring', it can be selected as a clue word in both the second and fourth sentences in FIG. 8(a), and can appear only once after removing duplicates in FIG. 8(c).

추가 사전학습 장치(130)는 이전 단계에서 구축한 단서/주변 단어 말뭉치를 사전 학습된 BERT의 토크나이저에 추가하여 감성 문장을 분절하는 과정(단계 3), 분절된 감성 문장을 단서 단어와 주변 단어의 역할 정보를 활용하여 선택적 마스킹을 수행하는 과정, 그리고 사전 학습된 BERT를 통해 추가 사전 학습을 수행하는 과정(단계 4)을 진행할 수 있다. BERT는 트랜스포머의 인코더 부분만을 차용하여 양방향으로 구축한 모델로, 일반적으로 12개의 트랜스포머 층과 대규모의 텍스트 말뭉치를 통해 사전 학습이 이루어질 수 있다. BERT의 사전 학습 과정은 도 9와 같이 표현될 수 있다.The additional pre-learning device 130 adds the clue/surrounding word corpus built in the previous step to the pre-learned tokenizer of BERT to segment sentiment sentences (step 3), and divides the segmented sentiment sentences into clue words and surrounding words. The process of performing selective masking using the role information of and the process of performing additional pre-learning through the pre-learned BERT (step 4) can be performed. BERT is a model built in both directions by borrowing only the encoder part of a transformer, and pre-learning can be achieved through generally 12 transformer layers and a large text corpus. The pre-learning process of BERT can be expressed as shown in FIG.

도 9를 참조하면, 일반적인 BERT의 사전 학습 과정은 크게 임베딩 층(Embedding Layer), 트랜스포머 층(Transformer Layer), 그리고 학습 층(Training Layer)으로 이루어질 수 있다. 구체적으로, 임베딩 층은 대규모의 텍스트 데이터를 벡터로 변환하는 작업을 수행하며, 하나의 단어에 대하여 토큰 임베딩(Token Embedding), 세그먼트 임베딩(Segment Embedding), 그리고 포지션 임베딩(Position Embedding)을 수행한 결과를 모두 더하여 벡터로 구성할 수 있다. 구축된 벡터들은 트랜스포머의 은닉층을 통과하는 과정에서 12개의 트랜스포머 인코더와 각 인코더 내의 멀티 헤드 어텐션을 통해 다양한 관점에서 벡터의 표현에 대한 가중치를 학습할 수 있다. 이후, 학습 층에서는 BERT의 사전 학습 방식인 MLM과 NSP를 적용한 학습을 동시에 수행하여 BERT가 문맥을 더욱 잘 표현할 수 있도록 모든 층의 가중치들을 갱신할 수 있다.Referring to FIG. 9 , a general pre-learning process of BERT may be largely composed of an embedding layer, a transformer layer, and a training layer. Specifically, the embedding layer converts large-scale text data into vectors, and results of performing token embedding, segment embedding, and position embedding for a single word. You can construct a vector by adding them all. In the process of passing through the hidden layer of the transformer, the constructed vectors can learn weights for vector representation from various viewpoints through 12 transformer encoders and multi-head attention in each encoder. After that, in the learning layer, the weights of all layers can be updated so that BERT can express the context better by simultaneously performing learning using MLM and NSP, which are BERT's pre-learning methods.

사전 학습된 BERT에 선택적 마스킹을 적용하여 감성 문장을 추가 학습하기 위해, 우선 텍스트를 분절하여 사전 학습된 BERT 임베딩 층의 입력 형태로 변환하는 작업이 수행될 수 있다. 대부분의 사전 학습된 BERT는 텍스트를 토큰으로 분절하는 고유한 토크나이저를 가지고 있으며, 해당 토크나이저는 하나의 단어를 여러 개의 하위 단어(Sub Word)로 쪼개어 학습하는 BPE(Byte Pair Encoding) 알고리즘을 기반으로 텍스트를 분절할 수 있다. 하지만, 사전 학습된 BERT의 토크나이저를 그대로 적용할 경우, 단계 1과 2에서 도출한 단서 단어와 주변 단어들이 하위 단어들로 분절되어 본래의 의미를 잃게 될 우려가 있다. 따라서, 이를 방지하기 위해 단서/주변 단어들로 구성된 말뭉치를 사전 학습된 BERT의 토크나이저에 추가하여 BERT 토크나이저를 확장하는 과정(단계 3)이 필요할 수 있다. 도 10은 기존의 단서/주변 단어의 말뭉치를 추가한 토크나이징과 그렇지 않은 토크나이징의 결과의 일 실시예를 나타낼 수 있다.In order to additionally learn emotional sentences by applying selective masking to the pre-learned BERT, an operation of first segmenting and converting the text into an input form of the pre-trained BERT embedding layer may be performed. Most of the pre-trained BERTs have their own tokenizer that segments text into tokens, and the tokenizer is based on the Byte Pair Encoding (BPE) algorithm that learns by splitting a single word into multiple subwords. text can be segmented. However, if the pre-learned tokenizer of BERT is applied as it is, there is a risk that the clue words derived from steps 1 and 2 and surrounding words will be segmented into sub-words and lose their original meaning. Therefore, in order to prevent this, a process of extending the BERT tokenizer by adding a corpus composed of clues/surrounding words to the pretrained BERT tokenizer (step 3) may be required. 10 may show an example of a result of tokenizing with the existing corpus of clues/surrounding words added and tokenizing without it.

도 10의 (a)는 BERT 토크나이저를 확장하지 않고 그대로 적용한 결과의 일 실시예에 해당할 수 있으며, 원본 문장의 단서 단어와 주변 단어인 '세밀한', '결말' 등이 하위 단어로 쪼개져 단어의 의미가 손상된 것을 확인할 수 있다. 반면, 도 10의 (b)는 말뭉치 추가를 통해 확장된 BERT 토크나이저를 통해 분절한 결과의 일 실시예에 해당할 수 있으며, 단서/주변 단어들이 의미의 손실없이 온전히 유지되는 것을 확인할 수 있다. 따라서, 추가 사전학습 장치(130)는 BERT 토크나이저가 단서 단어와 주변 단어를 온전히 인식하도록 하기 위해, 이들 단어들을 기존의 BERT 토크나이저에 추가하여 확장한 BERT 토크나이저를 구축하여 사용할 수 있다.10(a) may correspond to an embodiment of a result of applying the BERT tokenizer as it is without expanding it, and the clue word of the original sentence and the surrounding words such as 'detailed' and 'concluding' are divided into sub-words. It can be seen that the meaning of is damaged. On the other hand, (b) of FIG. 10 may correspond to an example of the result of segmentation through the extended BERT tokenizer through corpus addition, and it can be confirmed that clues/surrounding words are maintained intact without loss of meaning. Therefore, the additional pretraining device 130 may construct and use an extended BERT tokenizer by adding these words to the existing BERT tokenizer in order for the BERT tokenizer to fully recognize the clue words and surrounding words.

다음으로, 추가 사전학습 장치(130)는 단어의 역할 정보를 활용하여 감성 태그 예측에 영향을 덜 미치는 주변 단어들에 대해서만 선택적 마스킹(단계 4)을 수행할 수 있다. 기존의 무작위 마스킹 방식은 문장의 감성 판단에 중요한 영향을 끼치는 단어가 마스킹에 사용되는 경우 다른 단어들이 문장의 감성을 충분히 학습하기 어렵다는 한계를 가질 수 있다. 이와 달리, 추가 사전학습 장치(130)는 단서 단어들이 마스킹에 사용되지 않도록 제어함으로써, 단서 단어들을 통해 다른 단어들이 문장의 감성을 충분히 학습할 수 있도록 보장할 수 있다. 더 나아가, 추가 사전학습 장치(130)는 주변 단어들만 마스킹에 사용되도록 강제하여 다른 단어들의 감성 학습 효과를 최대화할 수 있다. 보다 구체적으로, 추가 사전학습 장치(130)는 분절된 감성 문장을 사전 학습된 BERT 임베딩 층의 입력으로 받은 후, 각 문장의 주변 단어들에 대해서만 [MASK] 토큰을 씌우는 방식의 변형된 MLM을 수행할 수 있다. 주변 단어에 대한 선택적 마스킹을 수행하는 일 실시예가 도 11과 같이 도시될 수 있다.Next, the additional pre-learning device 130 may perform selective masking (step 4) only on peripheral words that have less influence on sentiment tag prediction by utilizing word role information. Existing random masking methods may have limitations in that it is difficult for other words to sufficiently learn the emotion of a sentence when a word that has a significant influence on the emotion judgment of a sentence is used for masking. In contrast, the additional pre-learning device 130 controls clue words not to be used for masking, thereby ensuring that other words can sufficiently learn the emotion of the sentence through the clue words. Furthermore, the additional pre-learning device 130 may maximize the emotional learning effect of other words by forcing only surrounding words to be used for masking. More specifically, the additional pre-learning device 130 receives the segmented emotional sentence as an input of the pre-learned BERT embedding layer, and then covers only the words surrounding each sentence with [MASK] tokens. Performs a modified MLM can do. An embodiment of performing selective masking on neighboring words may be illustrated as shown in FIG. 11 .

도 11을 참조하면, 선택적 마스킹은 주변 단어인 '연출'을 마스킹하고 단서 단어인 '세밀한'과 '감동적'을 유지함으로써, 다른 단어들이 단서 단어로부터 긍정의 뉘앙스를 학습할 수 있는 기회를 보장할 수 있다. 이와 유사하게, 다음 학습에서는 또 다른 주변 단어인 '결말'이 마스킹될 수 있으며, 이 경우에도 단서 단어인 '세밀한'과 '감동적'은 여전히 유지될 수 있다. 즉, 추가 사전학습 장치(130)는 단서 단어인 '세밀한'과 '감동적'이 마스킹되는 가능성을 원천 차단할 수 있다.Referring to FIG. 11, selective masking ensures an opportunity for other words to learn positive nuances from clue words by masking the peripheral word 'direction' and maintaining the clue words 'detailed' and 'inspiring'. can Similarly, in the next learning, another peripheral word 'conclusion' may be masked, and even in this case, clue words 'detailed' and 'impressive' may still be maintained. That is, the additional pre-learning device 130 may fundamentally block the possibility that the clue words 'detailed' and 'impressive' are masked.

추가 사전학습 장치(130)는 사전 학습된 BERT를 사용하여 MLM 학습을 수행하되, [MASK] 토큰을 씌우는 대상을 임의 방식이 아닌 단어의 역할에 따라 선택적으로 선정하는 방법을 적용할 수 있다. 추가 사전학습 장치(130)는 감성 정보를 풍부하게 포함하고 있는 단서 단어가 마스킹 과정에서 제거되지 않고 항상 유지되도록 보장함으로써, 단서 단어가 주변 단어들의 추론에 미치는 영향을 극대화할 뿐 아니라 궁극적으로 감성 정보가 충실히 표현된 문장 임베딩 결과를 도출할 수 있다.The additional pre-learning device 130 performs MLM learning using the pre-learned BERT, but can apply a method of selectively selecting the target to cover the [MASK] token according to the role of the word rather than an arbitrary method. The additional pre-learning device 130 ensures that clue words rich in emotional information are always maintained without being removed in the masking process, thereby maximizing the influence of clue words on the inference of surrounding words, and ultimately providing emotional information. It is possible to derive a sentence embedding result in which is faithfully expressed.

이하, 본 발명에 따른 추가 사전학습 방법을 실제 데이터에 적용하여 실험을 수행한 결과와 성능 평가 결과를 보다 구체적으로 설명한다.Hereinafter, the result of performing an experiment by applying the additional pre-learning method according to the present invention to actual data and the performance evaluation result will be described in more detail.

실험 데이터는 공개된 한국어 영화 리뷰 데이터 세트인 NSMC(Naver Sentiment Movie Corpus)가 사용될 수 있다. NSMC는 도 5와 같이 영화에 대한 댓글과 감성에 대한 태그의 쌍으로 이루어진 공개 데이터로, 20만 건의 데이터가 1:1 비율로 구성되어 있다. 여기에서는, 전체 데이터 중 훈련용(Training) 및 검증용(Validation)으로 약 1만 6천 건을 구축하여 어텐션 기반 감성 분류를 수행한 후 본 발명에 따른 추가 사전학습 방법을 통해 추가 사전 학습이 진행될 수 있다. 실험 환경은 Python 3.7로 구축될 수 있으며 사전 학습 모델은 Pytorch 기반으로 구현된 KoBERT가 사용될 수 있다.As the experimental data, NSMC (Naver Sentiment Movie Corpus), a public Korean movie review data set, can be used. As shown in FIG. 5, NSMC is open data consisting of pairs of tags for comments and emotions for movies, and 200,000 data is composed of 1:1 ratio. Here, after attention-based sentiment classification is performed by building about 16,000 cases for training and validation among the entire data, additional pre-learning is performed through the additional pre-learning method according to the present invention. can The experimental environment can be built with Python 3.7, and KoBERT, which is implemented based on Pytorch, can be used as the pre-learning model.

우선, 어텐션 기반 감성 분류 모델을 사용하여 약 16,000개의 감성 댓글에 대한 감성 분류를 진행하고, 이를 통해 어텐션 가중치를 추출한 결과를 획득할 수 있다. 보다 구체적으로, 여기에서 LSTM 층은 양방향 LSTM을 사용하여 문장 내의 토큰들을 벡터화하였고, 어텐션 층은 바나다우 어텐션(Bahdanau Attention)을 통해 어텐션 가중치를 계산하고 문장 벡터를 추출할 수 있다. 이후 추출된 문장 벡터에 완전 연결 계층을 쌓아 분류 학습을 수행할 수 있다. 도 12는 감성 분류에 사용한 모델 구조를 도시하고 있다.First of all, the emotion classification of about 16,000 emotional comments is performed using the attention-based emotion classification model, and through this, the result of extracting the attention weight can be obtained. More specifically, here, the LSTM layer vectorizes tokens in a sentence using bidirectional LSTM, and the attention layer may calculate an attention weight and extract a sentence vector through Bahdanau Attention. Then, classification learning can be performed by building fully connected layers on the extracted sentence vectors. 12 shows a model structure used for emotion classification.

이후, 분류 학습을 통해 갱신된 어텐션 가중치를 추출하고 각 문장마다 임계값을 계산하여 단서 단어 리스트와 주변 단어 리스트를 각각 구축할 수 있다. 보다 구체적으로, 단서 단어는 각 문장마다 구성 단어의 어텐션 가중치의 평균을 구하고, 이 평균보다 1-sigma 이상 높은 값을 갖는 단어를 단서 단어로 구분할 수 있으며, 주변 단어는 어텐션 가중치의 평균보다 낮은 경우 주변 단어로 구분할 수 있다. 다음으로, BERT 토크나이저를 확장하기 위해 단서 단어 리스트와 주변 단어 리스트를 통합한 후, 중복을 제거하여 26,896개의 유효 단어 말뭉치를 구축할 수 있다. 도 13은 분류 학습을 통해 추출한 어텐션 가중치 결과의 일 실시예에 해당할 수 있으며, 도 14는 단서 단어 리스트(Clue Terms), 주변 단어 리스트(Surrounding Terms) 그리고 유효 단어 말뭉치(Valid Corpus)를 구축한 결과의 일 실시예에 해당할 수 있다.Thereafter, an attention weight updated through classification learning is extracted, and a threshold value is calculated for each sentence to construct a clue word list and a neighboring word list, respectively. More specifically, for clue words, if the average of the attention weights of constituent words is obtained for each sentence, and words having a value higher than 1-sigma above the average are classified as clue words, and neighboring words are lower than the average of the attention weights It can be distinguished by surrounding words. Next, to extend the BERT tokenizer, we can construct a corpus of 26,896 valid words by integrating the clue word list and the neighboring word list, and then removing redundancies. FIG. 13 may correspond to an example of an attention weight result extracted through classification learning, and FIG. 14 is a construction of a clue word list, a surrounding word list, and a valid word corpus. This may correspond to an embodiment of the result.

다음으로, 추가 사전학습 장치(130)는 구축된 유효 단어 말뭉치를 사전 학습된 BERT에 추가하여 확장된 BERT 토크나이저를 구축한 후, 단서/주변 단어 리스트 정보를 활용하여 선택적 마스킹 기반 추가 사전 학습을 수행할 수 있다. 여기에서는 한국어 학습을 위해 KcBERT-Base 모델이 사용될 수 있다. KcBERT는 대규모의 한국어 뉴스 댓글 데이터를 사용하여 사전 학습이 이루어졌으며 약 15.5GB의 텍스트 데이터와 8,900만개 이상의 문장이 사용되었다. 학습에 사용된 토크나이저는 BPE 기반의 Word Piece 토크나이저를 사용하였고 총 30,000개 단어가 구축되어 있다. 해당 토크나이저에 구축된 유효 단어 말뭉치를 추가하여 최종적으로 중복 제거 후 50,672개의 확장된 토크나이저가 구축될 수 있다. 도 15는 확장된 BERT 토크나이저를 사용하여 감성 문장을 분절한 결과의 일 실시예를 도시하고 있다.Next, the additional pre-learning device 130 builds an extended BERT tokenizer by adding the constructed effective word corpus to the pre-learned BERT, and then performs additional pre-learning based on selective masking using clue/surrounding word list information. can be done Here, the KcBERT-Base model can be used for learning Korean. KcBERT was pre-trained using large-scale Korean news comment data, and about 15.5 GB of text data and more than 89 million sentences were used. The tokenizer used for learning uses the BPE-based Word Piece tokenizer, and a total of 30,000 words are built. 50,672 extended tokenizers can be built after finally deduplication by adding valid word corpus built to the corresponding tokenizer. 15 shows an example of a result of segmenting a sentiment sentence using the extended BERT tokenizer.

다음으로, 추가 사전학습 장치(130)는 분절된 감성 문장에 대해 주변 단어 리스트를 활용하여 선택적 마스킹을 수행한 후 추가 사전 학습을 수행할 수 있다. 즉, 단서 단어는 마스킹에서 제외하였으며, 주변 단어 리스트 중 최대 2개의 단어를 선정하여 [MASK] 토큰으로 대체하는 방식으로 선택적 마스킹이 수행될 수 있다. 이후, MLM을 사용하여 추가 사전 학습이 진행될 수 있다. 이때, KcBERT-Base의 가중치가 초기화 값으로 사용될 수 있으며, 학습에 사용된 하이퍼 파라미터는 트랜스포머 블록 수 12개, 은닉층의 차원의 수 768개, 어텐션 헤드 12개로 설정될 수 있다. 감성 댓글의 최대 길이는 128로 설정될 수 있으며, 도 16을 통해 추가 사전 학습 후 KcBERT-Base의 임베딩 벡터값이 변경되었음을 확인할 수 있다.Next, the additional pre-learning device 130 may perform additional pre-learning after selectively masking the segmented emotional sentence by using a list of neighboring words. That is, clue words are excluded from masking, and selective masking may be performed in a manner in which up to two words are selected from the neighboring word list and replaced with [MASK] tokens. Thereafter, additional pre-learning may proceed using MLM. At this time, the weight of KcBERT-Base can be used as an initialization value, and the hyperparameters used for learning can be set to 12 transformer blocks, 768 dimensions of the hidden layer, and 12 attention heads. The maximum length of emotional comments can be set to 128, and it can be confirmed through FIG. 16 that the embedding vector value of KcBERT-Base has changed after additional pre-learning.

여기에서는 선택적 마스킹을 적용하여 추가 사전 학습된 BERT와, 성능 검증을 위해 구축한 세 가지 모델을 비교하여 제안 방법론의 성능을 분석한 결과를 소개한다. 성능 평가 실험의 전반적인 과정은 도 17에 도시되어 있다.Here, we introduce the results of analyzing the performance of the proposed methodology by comparing the additional pre-trained BERT with selective masking and three models built for performance verification. The overall process of the performance evaluation experiment is shown in FIG. 17 .

도 17의 (A)는 본 발명에 따른 추가 사전학습 방법을 적용하여 감성 댓글을 분류한 흐름을 나타낼 수 있다. 구체적으로 추가 사전학습 방법은 주변 단어에만 마스킹을 적용하여 추가 학습된 BERT의 가중치로 미세 조정을 수행한 후, 감성 댓글에 대한 문장 벡터를 추출하여 감성 분류 학습을 수행할 수 있다. 도 17의 (B), (C), (D) 모델은 본 발명과의 비교를 위해 구축한 모델로, (B) 모델은 기존의 사전 학습 방법인 무작위 마스킹(RM) 기반의 추가 사전 학습을 수행하여 모델을 구축하며, (D) 모델은 추가 사전 학습을 수행하지 않고 오직 사전 학습된 BERT만을 사용할 수 있다. 한편, (C) 모델은 본 발명에 따른 선택적 마스킹의 과정에서 구현 가능한 모델로, 본 발명이 주변 단어에만 마스킹을 적용하는 것과 반대로 단서 단어에만 마스킹을 적용한 모델에 해당할 수 있다.17(A) may show a flow of classifying emotional comments by applying the additional pre-learning method according to the present invention. Specifically, the additional pre-learning method applies masking only to surrounding words, performs fine-tuning with the weight of the additionally learned BERT, and then extracts sentence vectors for emotional comments to perform sentiment classification learning. Models (B), (C), and (D) of FIG. 17 are models built for comparison with the present invention. (D) the model can use only the pre-trained BERT without performing additional pre-training. On the other hand, model (C) is a model that can be implemented in the process of selective masking according to the present invention, and may correspond to a model in which masking is applied only to clue words as opposed to applying masking only to surrounding words in the present invention.

감성 댓글에 대한 분류 학습을 수행하기 위해 15,932개의 감성 댓글을 훈련용 12,745개와 검증용 3,187개로 나누었고, 긍정/부정 비율은 각 데이터 세트 내에서 1:1로 동일하게 설정될 수 있다. 또한 5,000개의 평가용(Test) 데이터를 통해 모델의 성능을 평가할 수 있다. 성능 비교를 위한 분류 모델로는 심층 신경망(Deep Neural Network)을 사용하였으며, 모델의 성능을 비교하기 위해 본 발명에 따른 추가 사전학습 방법을 포함한 네 가지 모델에 대하여 각각 분류 정확도(Classification Accuracy)를 측정할 수 있다. 각 모델의 분류 정확도 비교 결과는 도 18 및 19에서 확인할 수 있다.To perform classification learning on emotional comments, 15,932 emotional comments were divided into 12,745 for training and 3,187 for verification, and the positive/negative ratio can be equally set to 1:1 within each data set. In addition, the performance of the model can be evaluated through 5,000 test data. A deep neural network was used as a classification model for performance comparison, and classification accuracy was measured for each of the four models including the additional pre-learning method according to the present invention to compare the performance of the model. can do. The classification accuracy comparison results of each model can be found in FIGS. 18 and 19 .

도 19는 도 17에서 소개한 네 가지 모델에 대해 감성 분류 학습에 따른 손실(Loss)을 에폭(Epoch) 단위로 나타낸 비교 그래프에 해당할 수 있다. 검증용 데이터를 측정값으로 사용하였으며 네 가지 모델 모두 두 번째 에폭에서 가장 낮은 손실 값을 가짐을 확인할 수 있다. 따라서, 각 모델 모두 해당 에폭의 모델을 채택하여 평가용 데이터에 대한 추론을 수행하였으며, 분류 정확도는 도 19와 같이 나타날 수 있다.FIG. 19 may correspond to a comparison graph showing loss according to emotion classification learning in units of epochs for the four models introduced in FIG. 17 . The data for verification was used as the measured value, and it can be seen that all four models have the lowest loss value in the second epoch. Therefore, each model adopts the model of the corresponding epoch to perform inference on the evaluation data, and the classification accuracy can be shown as shown in FIG. 19.

도 19는 네 가지 모델에 대한 분류 정확도를 측정한 결과로, 검증용 데이터 및 평가용 데이터 모두 본 발명에 따른 (A) 모델이 다른 비교 모델에 비해 높은 분류 정확도를 나타냄을 확인할 수 있다. 특히, 단서 단어에만 마스킹을 수행한 (C) 모델의 경우 가장 낮은 분류 정확도를 보였는데, 이는 주변 단어에만 마스킹을 수행한 본 발명이 가장 높은 정확도를 나타내는 실험 결과를 뒷받침하는 또 다른 결과에 해당할 수 있다.19 is a result of measuring the classification accuracy for the four models, and it can be confirmed that the (A) model according to the present invention shows higher classification accuracy than other comparison models for both the verification data and the evaluation data. In particular, the (C) model in which masking was performed only on clue words showed the lowest classification accuracy. can

결과적으로, 본 발명에 따른 선택적 마스킹 방식, 즉 주변 단어와 단서 단어를 구분하여 주변 단어에만 마스킹을 수행하여 추가 사전 학습을 수행하는 방식을 통해 도출한 모델이 기존의 다른 방법, 즉 추가 사전 학습을 수행하지 않는 방식과 임의 마스킹을 통해 추가 사전 학습을 수행하는 방식에 의해 도출한 모델에 비해 분류 정확도 측면에서 우수한 성능을 나타낼 수 있다.As a result, the model derived through the selective masking method according to the present invention, that is, the method of performing additional dictionary learning by distinguishing peripheral words from clue words and performing masking on only the peripheral words, is different from the existing method, that is, additional dictionary learning. It can show superior performance in terms of classification accuracy compared to the model derived by the method that does not perform and the method that performs additional pre-learning through random masking.

본 발명에 따른 추가 사전학습 방법은 감성 문장 내의 감성 정보를 활용한 선택적 마스킹 방안, 즉 감성 분류 기여도가 낮은 주변 단어를 마스킹하여 감성 분류에 특화된 문장 벡터를 추론하는 방안을 제안할 수 있다. 더불어 본 발명에 따른 추가 사전학습 방법은 주변 단어를 선정하기 위해 어텐션 가중치를 활용하여 감성 문장 내의 단어들에 대한 감성 기여도를 측정하는 방안도 함께 제안할 수 있다. 본 발명에 따른 추가 사전학습 방법은 분류 정확도 향상을 목표로 수행될 수 있으나, 변형 적용을 통해 다른 자연어 처리 과제의 성능 향상에도 기여할 수 있다.The additional pre-learning method according to the present invention may propose a selective masking method using emotion information in emotional sentences, that is, a method of inferring a sentence vector specialized for emotion classification by masking peripheral words having a low contribution to emotion classification. In addition, the additional pre-learning method according to the present invention may also propose a method of measuring the emotional contribution of words in emotional sentences by using the attention weight to select neighboring words. The additional pre-learning method according to the present invention may be performed with the goal of improving classification accuracy, but may also contribute to performance improvement of other natural language processing tasks through application of transformation.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art will variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

100: 추가 사전학습 시스템
110: 사용자 단말 130: 추가 사전학습 장치
150: 데이터베이스
210: 감성 분류 수행부 220: 말뭉치 생성부
230: 토크나이저 확장부 240: 선택적 마스킹 처리부
250: 추가 사전학습 수행부 260: 전이학습 수행부100: additional pre-learning system
110: user terminal 130: additional pre-learning device
150: database
210: emotion classification unit 220: corpus generator
230: tokenizer extension unit 240: optional masking unit
250: Additional pre-learning unit 260: Transfer learning unit

Claims

performing emotion classification with learning data of a learning data population for pre-training;
generating a valid word corpus of words of the learning data population according to a result of the sentiment classification;
constructing an extended tokenizer by applying the valid word corpus to the tokenizer built through the prior learning;
masking the training data using the extended tokenizer; and
Optional masking-based additional pre-training method comprising: performing additional pre-training using the masked training data.

The method of claim 1, wherein the performing of the emotion classification comprises:
and performing the emotion classification by using sentences to which emotion tags are attached as the training data.

The method of claim 1, wherein the performing of the emotion classification comprises:
A selective masking-based additional dictionary learning method comprising extracting the attention weights of the words through an attention-based language model.

The method of claim 3, wherein the performing of the emotion classification comprises:
Classifying the words into a clue term, a surrounding term, and an excluded term by comparing the attention weight with a preset threshold value Optional masking-based additional dictionary learning method.

The method of claim 4, wherein the performing of the emotion classification comprises:
Classifying words having an attention weight higher than the upper 1-Sigma of the average as the clue words, words having the attention weight lower than the average as the neighboring words, and classifying the remaining words as the excluded words. Optional masking-based additional pre-learning method with .

The method of claim 4, wherein the performing of the emotion classification comprises:
A selective masking-based additional pre-learning method comprising the step of independently setting the threshold for each of the sentences.

5. The method of claim 4, wherein generating a corpus of valid words comprises:
and constructing the effective word corpus with words classified as the clue words and the neighboring words.

The method of claim 4, wherein masking the learning data
segmenting each of the learning data into word units using the extended tokenizer; and
and selectively masking any one of the words classified as the neighboring words for each of the learning data.

The method of claim 1, wherein the performing of the additional pre-learning
and applying the masked training data to a pretrained BERT model to perform the additional pretraining.

According to claim 1,
Optional masking-based additional pre-learning method further comprising: fine-tuning the learning model built through the pre-learning and the additional pre-learning.

an emotion classification performer performing emotion classification with learning data of a learning data population for pre-training;
a corpus generating unit generating a valid word corpus of words of the learning data population according to a result of the sentiment classification;
a tokenizer extension unit constructing an extended tokenizer by applying the effective word corpus to the tokenizer built through the prior learning;
an optional masking processing unit for masking the learning data using the extended tokenizer; and
Optional masking-based additional pre-training device comprising: an additional pre-training unit that performs additional pre-training using the masked training data.

The method of claim 11, wherein the emotion classification performing unit
A selective masking-based additional dictionary learning device, characterized in that for extracting the attention weights of the words through an attention-based language model.

13. The method of claim 12, wherein the emotion classification performing unit
Optional masking-based additional dictionary learning device, characterized in that the words are classified into a clue term, a surrounding term, and an excluded term by comparing the attention weight with a preset threshold.

The method of claim 13, wherein the corpus generator
Optional masking-based additional dictionary learning device, characterized in that for configuring the effective word corpus with words classified as the clue word and the neighboring word.

The method of claim 13, wherein the masking processing unit
Selective masking-based addition characterized in that each of the training data is segmented into word units using the extended tokenizer and any one of the words classified as the neighboring words is selectively masked for each of the training data Pre-learning device.

According to claim 11,
Optional masking-based additional pre-learning device further comprising: a transfer learning performer for fine-tuning the learning model built through the pre-learning and the additional pre-learning.