KR20230093765A

KR20230093765A - Method for transfer learning of neural network pre-trained with a corpus

Info

Publication number: KR20230093765A
Application number: KR1020210182651A
Authority: KR
Inventors: 연형석
Original assignee: 연형석
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-06-27

Abstract

본 발명은, BERT 등, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델을 태스크에 맞도록 효율적으로 미세조정하여 전이학습을 수행함으로써 언어 처리 모델의 예측, 또는 분류의 성능을 향상시키기 위한 자연어 처리 모델의 전이 학습 방법에 관한 것이며, 본 발명의 일 실시예에 의한, 코퍼스를 활용하여 사전 학습된, 신경망을 이용한 자연어 처리 모델의 전이 학습 방법은, 용어집합 DB를 구축하는 단계; 복수의 단어를 포함하는 텍스트를 입력 받고, 상기 복수의 단어를 상기 용어집합 DB와 조회하여 상기 용어집합 DB에 포함된 단어에 표시하는 단계; 상기 텍스트를 상기 코퍼스에 기초한 복수의 토큰들로 변환하는 단계; 상기 복수의 토큰들로부터 임베딩 벡터를 생성하는 단계; 상기 임베딩 벡터를 처리하여 어텐션 스코어 행렬을 얻는 단계; 상기 어텐션 스코어 행렬의 요소들 중, 상기 표시된 단어에 기초하여 얻어진 상기 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계를 포함한다.The present invention performs transfer learning by efficiently fine-tuning a natural language processing model using a pretrained neural network using a corpus such as BERT to suit a task, thereby improving the performance of prediction or classification of a language processing model. It relates to a transfer learning method of a processing model, and according to an embodiment of the present invention, a transfer learning method of a natural language processing model using a neural network pretrained using a corpus includes constructing a term set DB; receiving text including a plurality of words, querying the plurality of words with the term set DB, and displaying the words included in the term set DB; converting the text into a plurality of tokens based on the corpus; generating an embedding vector from the plurality of tokens; obtaining an attention score matrix by processing the embedding vector; and obtaining a matrix in which amplification values are assigned to elements corresponding to the tokens obtained based on the displayed word, among elements of the attention score matrix.

Description

Transfer learning method of natural language processing model using pre-trained neural network using corpus {Method for transfer learning of neural network pre-trained with a corpus}

본 발명은 신경망을 이용한 자연어 처리 모델의 학습 방법에 관한 것으로, 특히, BERT(Bidirectional Encoder Representations from Transformers) 모델 등과 같이 코퍼스(corpus)를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델의 전이 학습 방법 및 장치에 관한 것이다.The present invention relates to a learning method of a natural language processing model using a neural network, and more particularly, to a transfer learning method of a natural language processing model using a neural network pretrained using a corpus such as a BERT (Bidirectional Encoder Representations from Transformers) model, and the like. It's about the device.

최근 딥러닝 등 머신 러닝 기술이 주목받음에 따라, 머신 러닝을 이용한 자연 언어 처리(Natural Language Processing, NLP)에 대한 관심도 높아지고 있다. 자연 언어 처리란 언어를 컴퓨터가 인식하고 이를 처리할 수 있도록 하는 기술로서, 자동 번역, 검색 엔진 등에 사용되는데, 대표적인 알고리즘으로는 BERT(Bidirectional Encoder Representations from Transformers), GPT(Generative Pre-Training model) 등이 있다.As machine learning technologies such as deep learning have recently attracted attention, interest in natural language processing (NLP) using machine learning is also increasing. Natural language processing is a technology that allows computers to recognize and process language, and is used in automatic translation and search engines. Representative algorithms include BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-Training model) there is

여기서, BERT란 구글이 2018년에 공개한 자연어 처리 모델을 말하는데, 트랜스포머(Transformer) 신경망 구조를 활용한 언어 표현(Language Representation) 관련 논문에서 처음 등장하였다. 여기서, 언어 표현(Language Representation)이란 인간의 언어를 다차원 벡터로 표현하여 컴퓨터가 이해할 수 있도록 하는 것을 말한다.Here, BERT refers to a natural language processing model disclosed by Google in 2018, which first appeared in a paper related to Language Representation using Transformer neural network structure. Here, language representation refers to expressing human language as a multidimensional vector so that a computer can understand it.

BERT는 기본적으로 위키피디아, 기타 백과사전, 서적, 논문, 학술자료 데이터 등 라벨링되지 않은(Unlabeled) 대용량 데이터로 신경망 구조를 사전 학습(Pre-training)시킨 뒤, 특정 태스크(Task)를 가지고 있는 라벨링된 데이터로 전이학습(Transfer Learning)을 하여 활용하도록 한 모델이다.BERT basically pre-trains the neural network structure with large amounts of unlabeled data such as Wikipedia, other encyclopedias, books, theses, and academic data, and then performs labeling with a specific task. It is a model that is used by transferring learning with data.

BERT를 활용하여 구체적인 특정 태스크(가령, 한국어, 영어 등 각 언어의 인식, 번역, 텍스트 문서의 분류 등)를 처리하기 위해서는 전이학습(Transfer Learning)을 수행하기 전에 사전 학습(Pre-training) 모델 자체의 미세조정(Fine-tuning)이 필요한 경우가 많은데, 이를 위한 일반적인 방법론은 제시되어 있지 않다는 한계가 있어, 특정 테스크에 맞는 자연어 처리 모델의 우수한 성능을 얻기 위해서는 각각의 특정 태스크 특성에 부합하는 미세조정 및 전이학습 방법의 확립이 요망된다.In order to process specific specific tasks (e.g., recognition of each language such as Korean and English, translation, classification of text documents, etc.) using BERT, the pre-training model itself is required before performing transfer learning. There are many cases in which fine-tuning is required, but there is a limitation that a general methodology for this is not presented. In order to obtain excellent performance of a natural language processing model suitable for a specific task, fine-tuning suitable for each specific task characteristic and the establishment of transfer learning methods are desired.

한국등록특허공보 10-2205430Korean Registered Patent Publication 10-2205430

본 발명은, BERT 등, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델을 태스크에 맞도록 효율적으로 미세조정하여 전이학습을 수행함으로써 언어 처리 모델의 예측, 또는 분류의 성능을 향상시키기 위한 자연어 처리 모델의 전이 학습 방법에 관한 것이다.The present invention performs transfer learning by efficiently fine-tuning a natural language processing model using a pretrained neural network using a corpus such as BERT to suit a task, thereby improving the performance of prediction or classification of a language processing model. It is about transfer learning methods of processing models.

본 발명에 의한, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델의 전이 학습 방법은, 용어집합 DB를 구축하는 단계; 복수의 단어를 포함하는 텍스트를 입력 받고, 상기 복수의 단어를 상기 용어집합 DB와 조회하여 상기 용어집합 DB에 포함된 단어에 표시하는 단계; 상기 텍스트를 상기 코퍼스에 기초한 복수의 토큰들로 변환하여 상기 복수의 토큰들로부터 임베딩 벡터를 생성하는 단계; 상기 임베딩 벡터를 처리하여 어텐션 스코어 행렬을 얻는 단계; 상기 어텐션 스코어 행렬의 요소들 중, 상기 표시된 단어에 기초하여 얻어진 상기 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계를 포함할 수 있다.According to the present invention, a transfer learning method of a natural language processing model using a neural network pre-trained using a corpus includes the steps of constructing a term set DB; receiving text including a plurality of words, querying the plurality of words with the term set DB, and displaying the words included in the term set DB; converting the text into a plurality of tokens based on the corpus to generate an embedding vector from the plurality of tokens; obtaining an attention score matrix by processing the embedding vector; The method may include obtaining a matrix in which amplification values are assigned to elements corresponding to the tokens obtained based on the displayed word, among elements of the attention score matrix.

상기 어텐션 스코어 행렬을 얻는 단계는, 상기 임베딩 벡터에 가중치 행렬을 곱하여 쿼리(Q) 행렬, 키(K) 행렬, 밸류(V) 행렬을 구하는 단계; 및 상기 쿼리(Q) 행렬과 키(K) 행렬을 곱하여 상기 어텐션 스코어 행렬을 구하는 단계를 포함할 수 있다.The obtaining of the attention score matrix may include obtaining a query (Q) matrix, a key (K) matrix, and a value (V) matrix by multiplying the embedding vector by a weight matrix; and obtaining the attention score matrix by multiplying the query (Q) matrix and the key (K) matrix.

상기 자연어 처리 모델의 전이 학습 방법은, 증폭값이 부여된 행렬에 대해, 소프트맥스(softmax) 함수를 적용하고 상기 밸류(V) 행렬을 곱하여 어텐션 값 행렬을 구하는 단계를 더 포함할 수 있다.The transfer learning method of the natural language processing model may further include obtaining an attention value matrix by applying a softmax function to a matrix to which an amplification value is assigned and multiplying by the value (V) matrix.

상기 자연어 처리 모델의 전이 학습 방법은, 상기 어텐션 값 행렬에 대해, 제1 잔차 연결 및 정규화, FFNN 처리, 제2 잔차 연결 및 정규화를 수행하는 단계를 더 포함할 수 있다.The transfer learning method of the natural language processing model may further include performing first residual concatenation and normalization, FFNN processing, and second residual concatenation and normalization on the attention value matrix.

상기 자연어 처리 모델의 전이 학습 방법은, 상기 제1 잔차 연결 및 정규화, FFNN 처리, 제2 잔차 연결 및 정규화를 수행하는 단계의 출력을 제2 인코더에 입력시키는 단계를 더 포함할 수 있다.The transfer learning method of the natural language processing model may further include inputting an output of performing the first residual concatenation and normalization, FFNN processing, and second residual concatenation and normalization to a second encoder.

상기 증폭값은 1 보다 크고 1.5 이하일 수 있다.The amplification value may be greater than 1 and less than 1.5.

본 발명의 자연어 처리 모델의 전이 학습 방법을 적용하여, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델을 태스크에 맞도록 효율적으로 미세조정하여 전이학습을 수행함으로써 언어 처리 모델의 예측, 또는 분류의 성능을 향상시킬 수 있게 된다.By applying the transfer learning method of the natural language processing model of the present invention, the natural language processing model using a pretrained neural network using the corpus is efficiently fine-tuned to fit the task, and transfer learning is performed to predict or classify the language processing model. can improve the performance of

본 발명에 의해 달성 가능한 효과는 이상에서 언급한 것들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects achievable by the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below. .

도 1은 본 발명의 일 실시예에 의한, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델의 전이 학습 방법을 설명하기 위한 흐름도이다.
도 2는 BERT 모델의 구조를 전체적으로 설명하기 위한 개략도이다.
도 3은 임베딩 계층(10)의 구조를 나타내는 개략도이다.
도 4는 어텐션 계층(20)에서 이루어지는 처리를 개념적으로 설명하기 위한 개략도이다.
도 5는 어텐션 계층(20)에서 쿼리(Q) 행렬, 키(K) 행렬, 밸류(V) 행렬을 얻는 과정을 설명하기 위한 개략도이다.
도 6은 어텐션 계층(20)에서 어텐션 스코어 행렬을 구하는 과정을 설명하기 위한 개략도이다.
도 7은, 본 발명의 실시예의, 어텐션 스코어 행렬의 요소들 중, 용어집합 DB를 사용하여 표시된 단어에 기초하여 얻어진 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계(S50)를 설명하기 위한 개략도이다.1 is a flowchart illustrating a transfer learning method of a natural language processing model using a neural network pre-trained using a corpus according to an embodiment of the present invention.
2 is a schematic diagram for explaining the structure of the BERT model as a whole.
3 is a schematic diagram showing the structure of the embedding layer 10.
4 is a schematic diagram for conceptually explaining the processing performed in the attention layer 20.
5 is a schematic diagram for explaining a process of obtaining a query (Q) matrix, a key (K) matrix, and a value (V) matrix in the attention layer 20.
6 is a schematic diagram for explaining a process of obtaining an attention score matrix in the attention layer 20.
7 illustrates a step of obtaining a matrix in which amplification values are assigned to elements corresponding to tokens obtained based on words displayed using a term set DB among elements of an attention score matrix in an embodiment of the present invention (S50). It is a schematic diagram for explanation.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 즉, 본 발명에서 사용되는 '부'라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '부'는 어떤 역할들을 수행한다. 그렇지만 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부'들로 결합되거나 추가적인 구성요소들과 '부'들로 더 분리될 수 있다.Hereinafter, the embodiments disclosed in this specification will be described in detail with reference to the accompanying drawings, but the same or similar elements are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used together in consideration of ease of writing the specification, and do not have meanings or roles that are distinct from each other by themselves. That is, the term 'unit' used in the present invention means a hardware component such as software, FPGA or ASIC, and 'unit' performs certain roles. However, 'part' is not limited to software or hardware. A 'unit' may be configured to reside in an addressable storage medium and may be configured to reproduce one or more processors. Thus, as an example, 'unit' refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. The functionality provided within the components and 'parts' may be combined into a smaller number of elements and 'parts' or further separated into additional elements and 'parts'.

또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.In addition, in describing the embodiments disclosed in this specification, if it is determined that a detailed description of a related known technology may obscure the gist of the embodiment disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, the technical idea disclosed in this specification is not limited by the accompanying drawings, and all changes included in the spirit and technical scope of the present invention , it should be understood to include equivalents or substitutes.

도 1은 본 발명의 일 실시예에 의한, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델의 전이 학습 방법을 설명하기 위한 흐름도이다.1 is a flowchart illustrating a transfer learning method of a natural language processing model using a neural network pre-trained using a corpus according to an embodiment of the present invention.

BERT 모델 등의, 코퍼스를 활용하여 사전 학습된 신경망을 이용한 자연어 처리 모델의 전이 학습 방법은, 용어집합 DB를 구축하는 단계(S10); 복수의 단어를 포함하는 텍스트를 입력 받고, 상기 복수의 단어를 상기 용어집합 DB와 조회하여 상기 용어집합 DB에 포함된 단어에 표시하는 단계(S20); 상기 텍스트를 상기 코퍼스에 기초한 복수의 토큰들로 변환하여 상기 복수의 토큰들로부터 임베딩 벡터를 생성하는 단계(S30); 상기 임베딩 벡터를 처리하여 어텐션 스코어 행렬을 얻는 단계(S40); 상기 어텐션 스코어 행렬의 요소들 중, 상기 표시된 단어에 기초하여 얻어진 상기 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계(S50)를 포함할 수 있다.The transfer learning method of a natural language processing model using a neural network pre-trained using a corpus, such as a BERT model, includes constructing a term set DB (S10); receiving text including a plurality of words, querying the plurality of words with the term set DB, and displaying the words included in the term set DB (S20); generating an embedding vector from the plurality of tokens by converting the text into a plurality of tokens based on the corpus (S30); obtaining an attention score matrix by processing the embedding vector (S40); A step of obtaining a matrix in which amplification values are assigned to elements corresponding to the tokens obtained based on the displayed word among the elements of the attention score matrix (S50) may be included.

여기서, 코퍼스(Corpus)란 언어 연구를 목적으로 컴퓨터를 이용하여 구축된 텍스트의 집합. 또는 여러 가지 출처에서 뽑은 대용량의 전자화된 언어 자료를 말하는데, 가령, 위키피디아, 기타 백과사전, 서적, 논문, 학술자료 데이터 등을 기초로 구축된 라벨링되지 않은(Unlabeled) 대용량 데이터의 형태일 수 있다.Here, a corpus is a set of texts constructed using a computer for the purpose of language research. Alternatively, it may refer to large amounts of electronic language material drawn from various sources, such as large amounts of unlabeled data built on Wikipedia and other encyclopedias, books, articles, and scholarly data.

본 발명에서는 BERT 등을 활용하여 구체적인 특정 태스크(가령, 한국어, 영어 등 각 언어의 인식, 번역, 텍스트 문서의 분류 등)를 처리하기 위한 전이학습을 위해 신경망 모델의 미세조정을 하기 위한 방법을 제안하며, 특히, 본 발명자는 기술문서(특허청구항 등)의 텍스트를 입력으로 하여 신경망 모델이 그 내용과 관련된 기술분류를 예측하도록 하는 특정 태스크에 본 발명을 적용함으로써 현저한 예측 정확도의 개선을 확인한 바 있다.In the present invention, a method for fine-tuning a neural network model for transfer learning to process specific specific tasks (e.g., recognition of each language such as Korean and English, translation, classification of text documents, etc.) using BERT is proposed. In particular, the present inventors have confirmed significant improvement in prediction accuracy by applying the present invention to a specific task in which the text of a technical document (such as a patent claim) is input and the neural network model predicts a technical classification related to the content. .

도 2는 BERT 모델의 구조를 전체적으로 설명하기 위한 개략도이다. BERT 모델은 트랜스포머 모델의 인코더(encoder) 부분만을 분리하여 대량의 문서를 이용하여 사전 학습을 시킨 것인데, 도 2에 도시된 바와 같이, 일반적으로 임베딩 계층(10), 인코더(100), 분류기(Classifier)(60)를 포함하여 구성된다.2 is a schematic diagram for explaining the structure of the BERT model as a whole. The BERT model is pre-trained using a large amount of documents by separating only the encoder part of the transformer model. As shown in FIG. ) (60).

여기서, 인코더(100)는, 어텐션 계층(20), 제1 잔차 연결 및 정규화 계층(30), FFNN(Feed-Forward Neural Network) 처리 계층(40), 제2 잔차 연결 및 정규화 계층(50)을 포함할 수 있고, 인코더(100) 내에서 이러한 계층 구조는 Lx회 반복될 수 있다. 예컨대, Lx는 12 또는 24 등일 수 있다.Here, the encoder 100 includes an attention layer 20, a first residual connection and normalization layer 30, a feed-forward neural network (FFNN) processing layer 40, and a second residual connection and normalization layer 50. , and within the encoder 100, this hierarchical structure may be repeated Lx times. For example, Lx may be 12 or 24, etc.

임베딩 계층(10)의 출력(102)은 인코더(100)에 입력되며, 인코더(100)의 최종 출력(104)은 분류기(60)에 입력된다.The output 102 of the embedding layer 10 is input to the encoder 100, and the final output 104 of the encoder 100 is input to the classifier 60.

도 3은 임베딩 계층(10)의 구조를 나타내는 개략도이다. 임베딩 계층(10)에서는 입력된 텍스트를 코퍼스에 기초한 복수의 토큰들로 변환하여 복수의 토큰들로부터 임베딩 벡터를 생성하는 단계(S30)를 수행하는데, 도 3에 나타낸 바와 같이, 입력된 텍스트를 형태소 단위로 토큰화한 뒤, Token embedding(코퍼스 내부의 인덱스를 추출), Segment embedding(가령, 문장이 2개이면 각 문장에 포함된 토큰들에 0000001111와 같은 비트열을 대응시키는 방식으로 처리), Position embedding(각 토큰의 순서 정보) 처리를 하여 그 결과를 합한 것을 임베딩 벡터 형태로 상기 어텐션 계층(20)에 입력한다.3 is a schematic diagram showing the structure of the embedding layer 10. The embedding layer 10 converts the input text into a plurality of tokens based on the corpus and generates an embedding vector from the plurality of tokens (S30). As shown in FIG. 3, the input text is transformed into morphemes. After tokenizing as a unit, Token embedding (extracting the index inside the corpus), Segment embedding (for example, if there are two sentences, tokens included in each sentence are processed in such a way as to match a bit string such as 0000001111), Position Embedding (sequence information of each token) is processed and the sum of the results is input to the attention layer 20 in the form of an embedding vector.

이러한 본 발명의 임베딩 계층(10)의 기본적인 동작은 종래기술의 BERT 모델의 경우와 다르지 않으나, 본 발명은, 그와 별도로 용어집합 DB를 구축하는 단계(S10)를 더 포함하는 점에 차이가 있다.The basic operation of the embedding layer 10 of the present invention is not different from that of the prior art BERT model, but the present invention is different in that it further includes a step (S10) of constructing a term set DB separately. .

여기서, 용어집합 DB는 BERT 모델 등이 사전 학습을 위해 활용하는 코퍼스와는 별개의 것으로서, 특정 태스크를 위한 분야의 용어들만을 별도로 구축하여 놓은 것을 말한다. 예컨대, 상술한 바와 같이 기술문서(특허청구항 등)의 텍스트를 입력으로 하여 신경망 모델이 그 내용과 관련된 기술분류를 예측하도록 하는 특정 태스크에서는 기술용어 사전이 용어집합 DB로 사용될 수 있다.Here, the term set DB is separate from the corpus used by the BERT model and the like for prior learning, and refers to a separately constructed terminology in a field for a specific task. For example, as described above, a technical term dictionary may be used as a term set DB in a specific task in which a neural network model predicts a technical classification related to the text of a technical document (eg, a patent claim) as an input.

본 발명에서는, 복수의 단어를 포함하는 텍스트를 입력 받고, 복수의 단어를 용어집합 DB와 조회하여 용어집합 DB에 포함된 단어에 표시하는 단계(S20)를 거쳐 임베딩 계층(10)에 전달하게 된다. 여기서, 단어에 표시한다는 것의 의미는, 텍스트에 포함된 단어 중 기술용어에 해당하는 단어를 별도로 기록해 두는 처리일 수 있다.In the present invention, text including a plurality of words is received, and the plurality of words are retrieved from the term set DB and displayed on words included in the term set DB (S20), and then transmitted to the embedding layer 10. . Here, the meaning of marking words may be a process of separately recording words corresponding to technical terms among words included in the text.

이와 같이 복수의 단어를 포함하는 텍스트를 입력 받고, 복수의 단어를 용어집합 DB와 조회하여 용어집합 DB에 포함된 단어에 표시하는 단계(S20)는 임베딩 계층(10) 이전에 전처리를 통해 이루어지도록 구성할 수도 있고, 임베딩 계층(10) 자체 내에 구성될 수도 있다.In this way, the step of receiving text including a plurality of words, querying the plurality of words with the term set DB, and displaying the words included in the term set DB (S20) is performed through preprocessing before the embedding layer 10. It may be configured, or it may be configured within the embedding layer 10 itself.

도 4는 어텐션 계층(20)에서 이루어지는 처리를 개념적으로 설명하기 위한 개략도이다.4 is a schematic diagram for conceptually explaining the processing performed in the attention layer 20.

일반적으로 신경망 모델에서 말하는 '어텐션(Attention)'이란, 주어진 '쿼리(Query: Q)'에 대해서 모든 '키(Key: K)'와의 유사도를 각각 구하고, 구해낸 이 유사도를 가중치로 하여 키와 맵핑되어있는 각각의 '밸류(Value: V)'에 반영해주고, 유사도가 반영된 '밸류'를 모두 가중합하여 리턴하는 처리를 말한다.In general, 'Attention' in a neural network model is to obtain similarities with all 'Keys (K)' for a given 'Query (Q)', and map with keys using the obtained similarities as weights. It refers to the process of returning a weighted sum of all the 'values' that reflect the degree of similarity.

특히, 어텐션 중에서 셀프 어텐션(Self-attention)이란, 어텐션을 자기 자신에게 수행한다는 의미인데, 셀프 어텐션에서는 쿼리(Q), 키(K), 밸류(V)가 전부 동일하게 된다. 셀프 어텐션을 통해 입력 텍스트 내의 각 단어들끼리의 유사도를 구할 수 있다. 이러한 점은 본 발명에서도 공지된 BERT 모델의 경우와 특별히 차이가 없다.In particular, among attention, self-attention means that attention is performed on oneself. In self-attention, the query (Q), key (K), and value (V) are all the same. Through self-attention, the similarity between each word in the input text can be obtained. This point is not particularly different from the case of the BERT model known in the present invention.

도 5는 어텐션 계층(20)에서 쿼리(Q) 행렬, 키(K) 행렬, 밸류(V) 행렬을 얻는 과정을 설명하기 위한 개략도이다.5 is a schematic diagram for explaining a process of obtaining a query (Q) matrix, a key (K) matrix, and a value (V) matrix in the attention layer 20.

도시된 바와 같이, 예컨대, 'I am a student'라는 문장을 텍스트로 입력하여 생성된 임베딩 벡터(또는 행렬)에 쿼리(Q), 키(K), 밸류(V) 각각에 대한 가중치 행렬(WQ, WK, WV)을 곱하여 각각 쿼리(Q), 키(K), 밸류(V) 행렬을 구하게 된다.As shown, for example, a weight matrix (WQ) for each of the query (Q), key (K), and value (V) in the embedding vector (or matrix) generated by inputting the sentence 'I am a student' as text , WK, and WV) to obtain query (Q), key (K), and value (V) matrices, respectively.

도 6은 어텐션 계층(20)에서 어텐션 스코어 행렬을 구하는 과정을 설명하기 위한 개략도이다.6 is a schematic diagram for explaining a process of obtaining an attention score matrix in the attention layer 20.

도시된 바와 같이, 위의 과정에서 얻어진 쿼리(Q) 행렬을, 키(K) 행렬을 전치(Transpose)한 행렬과 곱하면, 각각의 단어의 쿼리(Q) 벡터와 키(K) 벡터의 내적이 각 행렬의 요소가 되는 행렬을 얻게되는데 이를 어텐션 스코어(Attention score) 행렬이라 한다. 예컨대, 도 6에 도시된 바와 같이, I 행과 student 열의 값은 I의 Q 벡터와 student의 K 벡터의 어텐션 스코어와 동일한 행렬이 된다는 의미이다. 이러한 본 발명의 어텐션 스코어 행렬을 구하는 과정 자체는 종래기술과 차이가 없으므로 더 상세한 설명은 생략한다.As shown, when the query (Q) matrix obtained in the above process is multiplied by a matrix obtained by transposing the key (K) matrix, the inner product of the query (Q) vector and the key (K) vector of each word is multiplied. A matrix that is an element of each matrix is obtained, which is called an attention score matrix. For example, as shown in FIG. 6, it means that the values of the I row and the student column become the same matrix as the attention scores of the Q vector of I and the K vector of student. Since the process of obtaining the attention score matrix of the present invention is not different from the prior art, a detailed description thereof will be omitted.

도 7은, 본 발명의 실시예의, 어텐션 스코어 행렬의 요소들 중, 용어집합 DB를 사용하여 표시된 단어에 기초하여 얻어진 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계(S50)를 설명하기 위한 개략도이다.7 illustrates a step of obtaining a matrix in which amplification values are assigned to elements corresponding to tokens obtained based on words displayed using a term set DB among elements of an attention score matrix in an embodiment of the present invention (S50). It is a schematic diagram for explanation.

도시된 바와 같이 본 발명의 실시예에서는, 위의 과정에서 얻어진 어텐션 스코어 행렬(A1)의 요소들 중에서, 표시된 단어에 기초하여 얻어진 토큰들에 대응되는 요소에 증폭값이 부여된 행렬을 구하는 단계(S50)를 포함한다. 여기서, 표시된 단어란, 복수의 단어를 포함하는 텍스트를 입력 받아 임베딩 벡터를 생성하기 전에(또는 그와 병렬적으로), 복수의 단어를 용어집합 DB와 조회하여 용어집합 DB에 포함된 단어에 표시하는 단계(S20)에서 표시하였던 그 단어를 말한다.As shown, in the embodiment of the present invention, among the elements of the attention score matrix A1 obtained in the above process, obtaining a matrix in which amplification values are assigned to elements corresponding to tokens obtained based on the displayed word ( S50). Here, the displayed word refers to a plurality of words that are retrieved from the term set DB before generating an embedding vector by receiving text including a plurality of words (or in parallel therewith), and displaying the words included in the term set DB. It refers to the word displayed in the step (S20).

예컨대, 임베딩 벡터 생성 전에(또는 그와 병렬적으로), 용어집합 DB와 조회하여 'student'가 용어집합 DB에 포함된 단어로 표시되었고, 도 7 상단과 같이 얻어진 어텐션 스코어 행렬(A1)에서, 단어 'student'에 대응되는 요소가 가장 하단의 행과 가장 우측의 열에 해당한다면, 도 7 중단에 도시한 바와 같이, 해당 행과 해당 열에 증폭값을 부여하게 된다. 여기서, 증폭값은 1 보다 크고 1.5 이하가 바람직한데, 예컨대, 1.1, 1.15, 1.2 등일 수 있다.For example, before generating the embedding vector (or in parallel with it), 'student' was displayed as a word included in the term set DB by querying the term set DB, and in the obtained attention score matrix A1 as shown in the upper part of FIG. 7, If the element corresponding to the word 'student' corresponds to the lowermost row and the rightmost column, as shown in the middle of FIG. 7 , amplification values are assigned to the corresponding row and corresponding column. Here, the amplification value is preferably greater than 1 and less than or equal to 1.5, and may be, for example, 1.1, 1.15, or 1.2.

이와 같이 하여 증폭값이 부여된 행렬(A2)을 얻고 나면, 이후의 단계는 도 7 하단에 도시된 바와 같이, 공지된 BERT 모델과 마찬가지로, 증폭값이 부여된 행렬(A2)에 소프트맥스(Softmax) 함수를 사용하고, 밸류(V) 행렬을 곱하여 어텐션 값 행렬(A3)을 얻게 된다. 이러한 과정은 도 1에 도시된 인코더(100)내의 어텐션 계층(20)에서 이루어진다.After obtaining the matrix A2 to which the amplification value is obtained in this way, as shown in the lower part of FIG. 7, the softmax (Softmax) ) function and multiplied by the value (V) matrix to obtain the attention value matrix (A3). This process is performed in the attention layer 20 in the encoder 100 shown in FIG.

이후, 어텐션 값 행렬(A3)에 대해, 제1 잔차 연결 및 정규화, FFNN 처리, 제2 잔차 연결 및 정규화를 수행하게 되는데, 이는 각각 도 1에 도시된 인코더(100)의, 제1 잔차 연결 및 정규화 계층(30), FFNN 처리 계층(40), 제2 잔차 연결 및 정규화 계층(50)에서 수행되며, 도 1에 도시한 바와 같이 인코더(100) 내에서 이러한 계층 구조는 Lx회 반복될 수 있으므로, 그에 대응되는 처리 과정도 Lx회 반복될 수 있다. 즉, 제1 인코더의 출력을 제2 인코더에 입력하고, 이러한 과정을 수회 반복시키는 방식의 처리를 말한다.Thereafter, first residual concatenation and normalization, FFNN processing, and second residual concatenation and normalization are performed on the attention value matrix A3, which are the first residual concatenation and normalization of the encoder 100 shown in FIG. It is performed in the normalization layer 30, the FFNN processing layer 40, the second residual concatenation and the normalization layer 50, and as shown in FIG. 1, this hierarchical structure can be repeated Lx times in the encoder 100, so , the processing process corresponding thereto may also be repeated Lx times. That is, it refers to a process in which the output of the first encoder is input to the second encoder and this process is repeated several times.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above-described present invention can be implemented as computer readable code on a medium on which a program is recorded. The computer-readable medium may continuously store programs executable by the computer or temporarily store them for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or combined hardware, but is not limited to a medium directly connected to a certain computer system, and may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROM and DVD, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc. configured to store program instructions. In addition, examples of other media include recording media or storage media managed by an app store that distributes applications, a site that supplies or distributes various other software, and a server. Accordingly, the above detailed description should not be construed as limiting in all respects and should be considered illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명에 따른 구성요소를 치환, 변형 및 변경할 수 있다는 것이 명백할 것이다.The present invention is not limited by the foregoing embodiments and accompanying drawings. It will be clear to those skilled in the art that the components according to the present invention can be substituted, modified, and changed without departing from the technical spirit of the present invention.

10: 임베딩 계층 20: 어텐션 계층
30: 제1 잔차 연결 및 정규화 계층 40: FFNN 처리 계층
50: 제1 잔차 연결 및 정규화 계층 60: 분류기
100: 인코더 102: 임베딩 계층의 출력
104: 인코더의 최종 출력10: embedding layer 20: attention layer
30: first residual concatenation and normalization layer 40: FFNN processing layer
50: first residual concatenation and normalization layer 60: classifier
100: encoder 102: output of embedding layer
104: final output of the encoder

Claims

In the transfer learning method of a natural language processing model using a pretrained neural network using a corpus,
constructing a term set DB;
receiving text including a plurality of words, querying the plurality of words with the term set DB, and displaying the words included in the term set DB;
converting the text into a plurality of tokens based on the corpus, and generating an embedding vector from the plurality of tokens;
obtaining an attention score matrix by processing the embedding vector;
and obtaining a matrix in which amplification values are assigned to elements corresponding to the tokens obtained based on the displayed word, among elements of the attention score matrix.

According to claim 1,
Obtaining the attention score matrix,
obtaining a query (Q) matrix, a key (K) matrix, and a value (V) matrix by multiplying the embedding vector by a weight matrix; and
The transfer learning method of the natural language processing model, comprising the step of obtaining the attention score matrix by multiplying the query (Q) matrix and the key (K) matrix.

According to claim 2,
Applying a softmax function to the matrix to which the amplification value is assigned and multiplying the value (V) matrix to obtain an attention value matrix.

According to claim 3,
The transfer learning method of the natural language processing model further comprising performing first residual concatenation and normalization, FFNN processing, and second residual concatenation and normalization on the attention value matrix.

According to claim 4,
Characterized in that it further comprises the step of inputting the output of the step of performing the first residual concatenation and normalization, FFNN processing, and the second residual concatenation and normalization to a second encoder, the transfer learning method of the natural language processing model.

According to claim 1,
The amplification value is greater than 1 and less than 1.5, characterized in that, the transfer learning method of the natural language processing model.