KR20220138960A

KR20220138960A - Apparatus and metohd for generating named entity recognition model based on knowledge enbedding model

Info

Publication number: KR20220138960A
Application number: KR1020210044980A
Authority: KR
Inventors: 김장원; 채수현
Original assignee: 군산대학교산학협력단
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2022-10-14
Also published as: KR102557380B1

Abstract

Disclosed are a device for generating an object name recognition model and a method thereof. According to an embodiment of the present invention, the method for generating an object name recognition model comprises: a step of performing the pre-processing for input domain data for a pre-learning of a language model; a step of generating a pre-learning model of an input domain based on the pre-processed data; a step of generating a target domain knowledge graph including data sets of a subject, predicate, and object based on target domain data included in the input domain data; a step of generating a knowledge embedding based on the pre-learning model and the target domain knowledge graph; and a step of fine-tuning the pre-learning model to identify an object name based on the knowledge embedding.

Description

Apparatus and method for generating object name recognition model based on knowledge embedding model

아래 실시예들은 지식 임베딩 모델 기반 개체명 인식 모델 생성 기술에 관한 것이다.The following embodiments relate to a knowledge embedding model-based entity name recognition model generation technology.

대부분의 기계 학습(machine learning) 기법들은 학습 데이터셋과 실제 데이터셋이 같은 특징과 분포를 가지는 경우에만 효율적이다. 따라서, 타겟 도메인 또는 타겟 태스크가 달라지면, 타겟 도메인 또는 타겟 태스크에 대한 학습 데이터셋을 다시 수집하거나 생성한 다음, 새롭게 기계 학습 모델을 구축하여야 한다.Most machine learning techniques are effective only when the training dataset and the actual dataset have the same characteristics and distribution. Therefore, when the target domain or target task is changed, it is necessary to re-collect or generate a training dataset for the target domain or target task, and then build a new machine learning model.

그러나, 현실 세계의 일부 도메인에서는, 학습 데이터셋을 새로 수집하거나 생성(e.g. 라벨링 작업)하는데 비용이 매우 많이 들거나 불가능한 경우가 있다. 가령, 의료 도메인에서 환자의 방사선 이미지로부터 병변의 위치를 예측하는 모델을 구축하는 경우, 의료 도메인에 병변의 위치가 태깅된 대량의 방사선 이미지는 거의 존재하지 않기 때문에, 상기 예측 모델의 학습 데이터셋을 확보하는 것은 불가능하다. 또한, 방사선 이미지에 병변의 위치를 태깅(tagging)하기 위해서는, 방사선 전문의와 같은 전문 인력이 도움이 필수적이다. 따라서, 학습 데이터셋을 직접 생성하기 위해서는 상당히 많은 비용이 소모된다.However, in some domains of the real world, it is often very expensive or impossible to collect or create a new training dataset (e.g. a labeling operation). For example, when constructing a model for predicting the location of a lesion from a radiographic image of a patient in the medical domain, since there is almost no mass radiographic image tagged with the location of the lesion in the medical domain, the training dataset of the predictive model is used. It is impossible to obtain In addition, in order to tag the location of the lesion on the radiographic image, the help of a professional manpower such as a radiologist is essential. Therefore, it is very costly to directly generate the training dataset.

학습 데이터셋을 새로 수집하거나 생성하는데 드는 비용을 줄이기 위한 방편으로 전이 학습(transfer learning)이 활용될 수 있다.Transfer learning can be used as a way to reduce the cost of collecting or creating a new training dataset.

일 실시예에 따른 개체명 인식 모델 생성 방법은 언어 모델의 사전 학습을 위해 입력 도메인 데이터에 대한 전처리를 수행하는 단계; 상기 전처리된 데이터에 기초하여 입력 도메인의 사전 학습 모델을 생성하는 단계; 상기 입력 도메인 데이터에 포함된 타겟 도메인 데이터에 기초하여 주어, 서술어, 목적어의 데이터 세트들을 포함하는 타겟 도메인 지식 그래프를 생성하는 단계; 상기 사전 학습 모델 및 상기 타겟 도메인 지식 그래프에 기초하여 지식 임베딩(knowledge embedding)을 생성하는 단계; 및 상기 지식 임베딩에 기초하여, 개체명을 식별하도록 상기 사전 학습 모델을 미세 조정(fine-tune)하는 단계를 포함할 수 있다.A method of generating an entity name recognition model according to an embodiment includes performing pre-processing on input domain data for prior learning of a language model; generating a pre-learning model of an input domain based on the pre-processed data; generating a target domain knowledge graph including data sets of a subject, a predicate, and an object based on the target domain data included in the input domain data; generating a knowledge embedding based on the pre-learning model and the target domain knowledge graph; and fine-tuning the pre-learning model to identify the entity name based on the knowledge embedding.

상기 전처리하는 단계는, 입력 도메인 데이터에서 텍스트 데이터를 추출하는 단계; 상기 추출된 텍스트 데이터에서 불용어를 삭제하는 단계; 상기 불용어가 삭제된 텍스트 데이터를 토큰화(tokenization)하여 단어 집합(vocabulary)을 생성하는 단계; 및 상기 단어 집합에 기초하여 사전 학습 데이터를 생성하는 단계The pre-processing may include: extracting text data from input domain data; deleting stopwords from the extracted text data; generating a vocabulary by tokenizing the text data from which the stopword has been deleted; and generating dictionary learning data based on the word set.

를 포함하고, 상기 사전 학습 모델을 생성하는 단계는, 상기 사전 학습 데이터에 기초하여 상기 입력 도메인의 상기 사전 학습 모델을 생성하는 단계를 포함할 수 있다.and generating the pre-learning model may include generating the pre-learning model of the input domain based on the pre-learning data.

상기 불용어는, 상기 입력 도메인에 따라 다르게 정의되는 의미상의 불용어를 포함할 수 있다.The stopword may include a semantic stopword defined differently according to the input domain.

상기 지식 그래프는, 상기 타겟 도메인의 개체명들을 포함하는 미리 구축된 개체명 사전에 기초하여 생성될 수 있다.The knowledge graph may be generated based on a pre-built entity name dictionary including entity names of the target domain.

상기 지식 임베딩(knowledge embedding)을 생성하는 단계는, 상기 타겟 도메인 데이터에서 타겟 도메인 텍스트 데이터를 추출하는 단계; 상기 추출된 타겟 도메인 텍스트 데이터에서 불용어를 삭제하여 타겟 도메인 문장을 추출하는 단계; 상기 단어 집합을 이용하여 상기 타겟 도메인 문장을 토큰화하는 단계; 상기 타겟 도메인 지식 그래프의 상기 데이터 세트에 기초하여 미리 결정된 최대 경로 수 및 최대 깊이에 따라 상기 토큰화된 타겟 도메인 문장을 확장하는 단계; 상기 확장된 타겟 도메인 문장에 포함된 토큰의 깊이에 대한 정보를 포함하는 세그먼트 인덱스 및 상기 확장된 타겟 도메인 문장에 포함된 토큰의 위치 정보를 포함하는 포지션 인덱스를 상기 확장된 타겟 도메인 문장에 포함된 각각의 토큰에 매핑하는 단계; 및 상기 확장된 타겟 도메인 문장에 포함된 토큰들, 상기 토큰들에 대응되는 상기 세그먼트 인덱스 및 상기 포지션 인덱스에 기초하여 상기 지식 임베딩을 생성하는 단계를 포함할 수 있다.The generating of the knowledge embedding may include: extracting target domain text data from the target domain data; extracting a target domain sentence by deleting stopwords from the extracted target domain text data; tokenizing the target domain sentence using the word set; expanding the tokenized target domain sentence according to a predetermined maximum number of paths and a maximum depth based on the data set of the target domain knowledge graph; A segment index including information on the depth of a token included in the extended target domain sentence and a position index including position information of a token included in the expanded target domain sentence are each included in the extended target domain sentence mapping to a token of and generating the knowledge embedding based on tokens included in the extended target domain sentence, the segment index corresponding to the tokens, and the position index.

상기 토큰화된 타겟 도메인 문장을 확장하는 단계는, 상기 확장된 타겟 도메인 문장에 포함된 토큰들의 깊이 및 경로 수가 각각 상기 최대 깊이 및 상기 최대 경로 수를 초과하지 않는 범위에서 상기 토큰화된 타겟 도메인 문장을 확장하는 단계를 포함할 수 있다.The step of expanding the tokenized target domain sentence may include the tokenized target domain sentence in a range where the depth and the number of paths of tokens included in the expanded target domain sentence do not exceed the maximum depth and the maximum number of paths, respectively. may include the step of expanding

상기 타겟 도메인 문장을 확장하는 단계는, 상기 토큰화된 타겟 도메인 문장에 포함된 토큰들 중 상기 데이터 세트의 주어와 대응되는 토큰에 상기 데이터 세트의 서술어 토큰 및 목적어 토큰을 부가하여 상기 토큰화된 타겟 도메인 문장을 확장하는 단계를 포함할 수 있다.The step of expanding the target domain sentence may include adding a predicate token and an object token of the data set to a token corresponding to the subject of the data set among the tokens included in the tokenized target domain sentence to add the tokenized target token. It may include expanding the domain sentence.

상기 최대 깊이 및 상기 최대 경로 수는, 상기 미세 조정에 이용될 수 있는 것으로 미리 결정된 상기 확장된 타겟 도메인 문장의 최대 길이에 기초하여 결정될 수 있다.The maximum depth and the maximum number of paths may be determined based on a predetermined maximum length of the extended target domain sentence that may be used for the fine adjustment.

상기 매핑하는 단계는, 상기 확장된 타겟 도메인 문장에 포함된 각 토큰에 0 부터 상기 최대 깊이에 대응되는 음이 아닌 정수 중 상기 각 토큰의 깊이 정보와 대응되는 값을 상기 세그먼트 인덱스로 매핑하는 단계; 및 상기 확장된 타겟 도메인 문장의 첫 번째 토큰부터 각 경로의 마지막 토큰까지 각 토큰에 순차적으로 0 부터 음이 아닌 정수 값을 상기 포지션 인덱스로 매핑하는 단계를 포함하고, 상기 확장된 타겟 도메인 문장에 포함된 토큰들은, 상기 세그먼트 인덱스와 상기 포지션 인덱스에 의해 서로 구분될 수 있다.The mapping may include: mapping a value corresponding to depth information of each token among non-negative integers corresponding to the maximum depth from 0 to the maximum depth for each token included in the extended target domain sentence to the segment index; and sequentially mapping a non-negative integer value from 0 to the position index to each token from the first token of the extended target domain sentence to the last token of each path, including in the extended target domain sentence Tokens may be distinguished from each other by the segment index and the position index.

상기 최대 깊이 및 상기 최대 경로 수에 따라 상기 개체명을 식별하도록 미세 조정된 모델의 정확도, 재현율 및 F1-스코어가 결정될 수 있다.According to the maximum depth and the maximum number of paths, the accuracy, recall, and F1-score of the fine-tuned model to identify the entity name may be determined.

상기 지식 임베딩은, 상기 세그먼트 인덱스에 기초하여 생성된 세그먼트 임베딩, 상기 포지션 인덱스에 기초하여 생성된 포지션 임베딩 및 상기 확장된 타겟 도메인 문장에 포함된 토큰들에 기초하여 생성된 토큰 임베딩을 포함할 수 있다.The knowledge embedding may include a segment embedding generated based on the segment index, a position embedding generated based on the position index, and a token embedding generated based on tokens included in the expanded target domain sentence. .

상기 타겟 도메인 지식 그래프는, 상기 타겟 도메인 데이터에서 추출된 서술어를 기준으로 상기 서술어 앞, 뒤의 일정한 크기의 윈도우를 설정하여 상기 윈도우 내에 포함된 단어들 중 주어 또는 목적어에 대응되는 단어를 추출함으로써 생성될 수 있다.The target domain knowledge graph is generated by setting windows of a predetermined size before and after the predicate based on the predicate extracted from the target domain data and extracting a word corresponding to a subject or an object from among the words included in the window. can be

상기 타겟 도메인 지식 그래프는, 상기 타겟 도메인 데이터에서 추출된 서술어를 기준으로 상기 서술어 앞, 뒤의 일정한 크기의 윈도우를 설정하여 상기 윈도우 내에 포함된 단어들 중 주어 또는 목적어에 대응될 수 있는 후보 단어들을 추출하고, 상기 후보 단어들 중 상기 타겟 도메인의 개체명에 해당되는 단어들을 결정하고, 상기 개체명에 해당되는 단어들을 상기 주어 또는 상기 목적어로 결정함으로써 생성되고, 상기 개체명에 해당되는 단어들은 타겟 도메인의 개체명들을 포함하는 미리 구축된 상기 개체명 사전에 기초하여 결정될 수 있다.The target domain knowledge graph sets windows of a predetermined size before and after the predicate based on the predicate extracted from the target domain data, and selects candidate words that can correspond to a subject or an object among words included in the window. extracting, determining words corresponding to the entity name of the target domain from among the candidate words, and determining words corresponding to the entity name as the subject or the object. It may be determined based on the pre-built entity name dictionary including entity names of domains.

상기 타겟 도메인 지식 그래프는, 상기 주어 및 상기 서술어와의 관계에 기초하여 상기 윈도우 바깥 범위에서 상기 목적어가 될 수 있는 후보 단어들을 더 추출하고, 상기 개체명 사전에 기초하여 상기 더 추출된 후보 단어들 중에서 상기 타겟 도메인의 개체명에 해당되는 단어들을 상기 목적어로 결정함으로써 생성될 수 있다.The target domain knowledge graph further extracts candidate words that can become the object from a range outside the window based on the relationship between the subject and the predicate, and the further extracted candidate words based on the entity name dictionary Among them, words corresponding to the entity name of the target domain may be generated by determining as the object.

일 실시예에 따른 개체명 인식 모델 생성 장치는 언어 모델의 사전 학습을 위해 입력 도메인 데이터에 대한 전처리를 수행하는 전처리기; 상기 전처리된 데이터에 기초하여 입력 도메인의 사전 학습 모델을 생성하는 사전 학습기; 및 상기 사전 학습 모델 및 타겟 도메인 지식 그래프에 기초하여 지식 임베딩(knowledge embedding)을 생성하고, 상기 지식 임베딩에 기초하여, 개체명을 식별하도록 상기 사전 학습 모델을 미세 조정(fine-tune)하는 미세 조정기를 포함하고, 상기 타겟 도메인 지식 그래프는, 상기 입력 도메인 데이터에 포함된 타겟 도메인 데이터에 기초하여 생성된 주어, 서술어, 목적어의 데이터 세트들을 포함하는 것일 수 있다.An apparatus for generating an entity name recognition model according to an embodiment includes a preprocessor for preprocessing input domain data for prior learning of a language model; a pre-learner for generating a pre-learning model of an input domain based on the pre-processed data; and a fine-tuner for generating a knowledge embedding based on the pre-learning model and the target domain knowledge graph, and fine-tuning the pre-learning model to identify an entity name based on the knowledge embedding. may include, and the target domain knowledge graph may include data sets of subjects, predicates, and objects generated based on target domain data included in the input domain data.

상기 전처리기는, 입력 도메인 데이터에서 텍스트 데이터를 추출하고, 상기 추출된 텍스트 데이터에서 불용어를 삭제하고, 상기 불용어가 삭제된 텍스트 데이터를 토큰화(tokenization)하여 단어 집합(vocabulary)을 생성하고 및 상기 단어 집합에 기초하여 사전 학습 데이터를 생성하고, 상기 사전 학습기는, 상기 사전 학습 데이터에 기초하여 상기 입력 도메인의 상기 사전 학습 모델을 생성할 수 있다.The preprocessor extracts text data from the input domain data, deletes stopwords from the extracted text data, and tokenizes the text data from which the stopwords are deleted to generate a vocabulary and the word. The pre-learning data may be generated based on the set, and the pre-learner may generate the pre-learning model of the input domain based on the pre-learning data.

상기 미세 조정기는, 상기 타겟 도메인 데이터에서 타겟 도메인 텍스트 데이터를 추출하고, 상기 추출된 타겟 도메인 텍스트 데이터에서 불용어를 삭제하여 타겟 도메인 문장을 추출하고, 상기 단어 집합을 이용하여 상기 타겟 도메인 문장을 토큰화하고, 상기 타겟 도메인 지식 그래프의 상기 데이터 세트에 기초하여 미리 결정된 최대 경로 수 및 최대 깊이에 따라 상기 토큰화된 타겟 도메인 문장을 확장하고, 상기 타겟 도메인 문장의 확장 여부에 관한 정보를 포함하는 세그먼트 인덱스 및 상기 확장된 타겟 도메인 문장에 포함된 토큰의 위치 정보를 포함하는 포지션 인덱스를 상기 확장된 타겟 도메인 문장에 포함된 각각의 토큰에 매핑하고, 및 상기 확장된 타겟 도메인 문장에 포함된 토큰들, 상기 토큰들에 대응되는 상기 세그먼트 인덱스 및 상기 포지션 인덱스에 기초하여 상기 지식 임베딩을 생성할 수 있다.The fine adjuster extracts target domain text data from the target domain data, extracts a target domain sentence by deleting stopwords from the extracted target domain text data, and tokenizes the target domain sentence using the word set And, based on the data set of the target domain knowledge graph, the tokenized target domain sentence is expanded according to a predetermined maximum number of paths and a maximum depth, and a segment index including information on whether the target domain sentence is expanded or not. and a position index including position information of a token included in the extended target domain sentence is mapped to each token included in the extended target domain sentence, and tokens included in the extended target domain sentence, the The knowledge embedding may be generated based on the segment index and the position index corresponding to tokens.

상기 미세조정기는, 상기 확장된 타겟 도메인 문장에 포함된 토큰들의 깊이 및 경로 수가 각각 상기 최대 깊이 및 상기 최대 경로 수를 초과하지 않는 범위에서 상기 토큰화된 타겟 도메인 문장을 확장할 수 있다.The fine adjuster may extend the tokenized target domain sentence in a range where the depth and the number of paths of tokens included in the expanded target domain sentence do not exceed the maximum depth and the maximum number of paths, respectively.

상기 토큰화된 타겟 도메인 문장에 포함된 토큰들 중 상기 데이터 세트의 주어와 대응되는 토큰에 상기 데이터 세트의 서술어 토큰 및 목적어 토큰을 부가하여 상기 토큰화된 타겟 도메인 문장을 확장할 수 있다.The tokenized target domain sentence may be expanded by adding a predicate token and an object token of the data set to a token corresponding to the subject of the data set among tokens included in the tokenized target domain sentence.

상기 미세 조정기는, 상기 확장된 타겟 도메인 문장에 포함된 각 토큰에 0 부터 상기 최대 깊이에 대응되는 음이 아닌 정수 중 상기 각 토큰의 깊이 정보와 대응되는 값을 상기 세그먼트 인덱스로 매핑하고, 상기 확장된 타겟 도메인 문장의 첫 번째 토큰부터 각 경로의 마지막 토큰까지 각 토큰에 순차적으로 0 부터 음이 아닌 정수 값을 상기 포지션 인덱스로 매핑하고, 상기 확장된 타겟 도메인 문장에 포함된 토큰들은, 상기 세그먼트 인덱스와 상기 포지션 인덱스에 의해 서로 구분될 수 있다.The fine adjuster maps, to the segment index, a value corresponding to depth information of each token among non-negative integers corresponding to the maximum depth from 0 in each token included in the extended target domain sentence to the segment index, and A non-negative integer value from 0 is sequentially mapped to each token from the first token of the target domain sentence to the last token of each path as the position index, and the tokens included in the extended target domain sentence are the segment indexes. and can be distinguished from each other by the position index.

도 1은 일 실시예에 따른 개체명 인식 모델 생성 장치 및 방법의 개요를 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 개체명 인식 모델 생성 장치 및 방법의 지식 임베딩 생성 과정을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 미세 조정기에서 수행되는 동작을 설명하기 위한 흐름도이다.
도 4는 일 실시예에 따른 미세 조정기의 구조를 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 개체명 인식 모델 생성 방법을 설명하기 위한 흐름도이다.1 is a diagram for explaining an outline of an apparatus and method for generating an entity name recognition model according to an embodiment.
2 is a diagram for explaining a knowledge embedding generation process of an apparatus and method for generating an entity name recognition model according to an exemplary embodiment.
3 is a flowchart illustrating an operation performed by the fine adjuster according to an exemplary embodiment.
4 is a view for explaining a structure of a fine adjuster according to an embodiment.
5 is a flowchart illustrating a method for generating an entity name recognition model according to an exemplary embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for purposes of illustration only, and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various elements, these terms should be interpreted only for the purpose of distinguishing one element from another. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it may be directly connected or connected to the other component, but it should be understood that another component may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same components are assigned the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted.

도 1은 일 실시예에 따른 개체명 인식 모델 생성 장치 및 방법의 개요를 설명하기 위한 도면이다.1 is a diagram for explaining an outline of an apparatus and method for generating an entity name recognition model according to an embodiment.

다양한 분야(정보, 식품, 화학, 보건 분야 등)의 내용이 포함된 문서에서 사용자의 질의어에 대한 정확하고, 재현율이 높은 검색 결과 제공을 위해서는 어휘가 가지는 의미를 고려할 수 있는 검색 기술이 제공되어야 한다. 특히 한국어의 경우 동음이의어가 많기 때문에 문맥적 상황을 고려한 어휘의 의미적 식별 기법이 검색 성능 향상을 위해 필수적이다. 개체명 인식 모델을 이용하면 텍스트 데이터에서 사람, 장소, 조직 등 개체를 의미하는 단어를 식별할 수 있고, 텍스트 데이터의 개체명 인식을 수행함으로써 텍스트 데이터에 대한 검색의 정확도를 높일 수 있다.In order to provide accurate and highly reproducible search results for the user's query in documents containing content in various fields (information, food, chemistry, health fields, etc.), a search technology that can consider the meaning of the vocabulary must be provided. . In particular, in the case of Korean, since there are many homonyms, a semantic identification technique of vocabulary considering contextual situations is essential to improve search performance. If the entity name recognition model is used, it is possible to identify words meaning entities such as people, places, and organizations in text data, and by performing entity name recognition of text data, the accuracy of a search for text data can be improved.

의학 분야, 법률 분야, 산업 분야 등의 전문 분야에서는 각 분야의 전문 용어가 사용되므로 일반 용어를 학습한 개체명 인식 모델을 이용하여서는 필요한 수준의 정확도(precision), 재현율(recall) 및 F1-스코어(F1-score)를 얻을 수 없다. 도메인(또는 분야) 또는 목적에 특화되도록 개체명 인식을 정확하게 수행하기 위해서는 목적에 맞는 데이터를 이용하여 학습된 모델을 생성해야 하나, 특정 분야에 대해 충분한 데이터가 존재하지 않을 수 있고, 개체명 인식 모델의 학습을 위한 대량의 학습 데이터를 만드는 것은 많은 시간과 비용을 요구하므로 필요한 모든 분야 각각에 대한 학습 데이터를 만드는 것은 현실적으로 가능하지 않을 수 있다.In specialized fields such as medical field, legal field, industrial field, etc., since specialized terms of each field are used, the required level of precision, recall, and F1-score ( F1-score) cannot be obtained. In order to accurately perform entity name recognition to be specialized for a domain (or field) or purpose, a trained model must be created using data suitable for the purpose, but sufficient data for a specific field may not exist, and the entity name recognition model It may not be realistically possible to create learning data for each of the necessary fields because it requires a lot of time and money to create a large amount of learning data for the learning of .

일 실시예에 따른 개체명 인식 모델 학습 장치 및 방법에 의하면, 대량의 데이터를 포함하는 입력 도메인의 데이터인 입력 도메인 데이터를 이용하여 언어 모델을 학습시킴으로써 입력 도메인의 사전 학습 모델을 생성할 수 있다. 언어 모델은 어텐션 딥러닝 네트워크를 기반으로 하는 모델일 수 있다. 일 실시예에서, 언어 모델은 BERT(Bidirectional Encoder Representations from Transformers) 모델일 수 있다.According to the apparatus and method for learning an entity name recognition model according to an embodiment, a pre-learning model of the input domain may be generated by learning the language model using input domain data that is data of the input domain including a large amount of data. The language model may be a model based on an attention deep learning network. In an embodiment, the language model may be a Bidirectional Encoder Representations from Transformers (BERT) model.

개체명 인식 모델 학습 장치 및 방법은 개체명 인식에 있어서 높은 정확도와 재현율을 얻고자 하는 분야의 데이터인 타겟 도메인 데이터를 이용하여 지식 그래프를 생성하고, 타겟 도메인 지식 그래프에 기초하여 사전 학습 모델의 파리미터를 미세 조정(fine-tuning)할 수 있다. 개체명 인식 모델 학습 장치 및 방법은 전이 학습 방법을 통해 타겟 도메인 지식 그래프를 이용한 미세 조정을 수행하여 개체명 인식 모델을 생성함으로써 개체명 인식 모델의 타겟 도메인 데이터에 대한 개체명 인식의 정확도, 재현율 및 F1-스코어를 효과적으로 높일 수 있다.An apparatus and method for learning an entity name recognition model generates a knowledge graph using target domain data, which is data in a field to obtain high accuracy and recall in entity name recognition, and based on the target domain knowledge graph, a parameter of a pre-learning model can be fine-tuned. An apparatus and method for learning an entity name recognition model perform fine adjustment using a target domain knowledge graph through a transfer learning method to generate an entity name recognition model, thereby improving the accuracy, recall and You can effectively increase your F1-score.

도 1을 참조하면, 입력 도메인 데이터를 수신하여 전처리하는 전처리기, 전처리된 데이터에 기초하여 사전 학습 모델을 생성하는 사전 학습기 및 사전 학습 모델을 개체명 식별을 위하여 미세 조정(fine-tuning)하는 미세 조정기가 도시되어 있다.Referring to FIG. 1 , a preprocessor that receives and preprocesses input domain data, a pre-learner that generates a pre-learning model based on the pre-processed data, and a fine-tuning that fine-tunes the pre-learning model for object name identification The regulator is shown.

전처리기(110)는 입력 도메인 데이터(105)를 수신하고 사전 학습 및 미세 조정을 위해 입력 도메인 데이터(105)에 대한 전처리를 수행할 수 있다. 여기서 도메인이란, 언어 모델을 학습시키고자 하는 데이터의 범위 또는 분야를 의미할 수 있고, 입력 도메인은 언어 모델을 사전 학습시키고자 하는 데이터의 범위 또는 분야를 의미할 수 있다. 예를 들어, 입력 도메인은 특정 분야의 웹(web) 기사, SNS(social network service) 게시글, 블로그 게시글, 한국어 특허 문서 전체, 한국어 과학기술 논문 전체 또는 한국어 법률 문헌 전체일 수 있고, 입력 도메인 데이터(105)는 특정 분야의 웹(web) 기사, SNS(social network service) 게시글, 블로그 게시글, 한국어 특허 문서 전체, 한국어 과학기술 논문 전체 또는 한국어 법률 문헌 전체에 포함된 텍스트 및 이미지 등의 데이터일 수 있다. 이는 일 실시예일 뿐이며, 입력 도메인은 이에 한정되는 것은 아니고 필요에 따라 다양하게 설정될 수 있다. 예를 들어, 입력 도메인은 한국어 특허 문서 전체와 과학기술 논문 전체를 포함하도록 설정되거나 한국어 특허 문서의 일부를 포함하도록 설정될 수 있다.The preprocessor 110 may receive the input domain data 105 and perform preprocessing on the input domain data 105 for pre-learning and fine-tuning. Here, the domain may mean a range or field of data for which a language model is to be trained, and the input domain may mean a range or field of data for which a language model is to be pre-trained. For example, the input domain may be a web article in a specific field, a social network service (SNS) post, a blog post, the entire Korean patent document, the entire Korean science and technology paper, or the entire Korean legal literature, and the input domain data ( 105) may be data such as text and images included in web articles, social network service (SNS) posts, blog posts, entire Korean patent documents, Korean science and technology papers, or Korean legal documents in a specific field. . This is only an example, and the input domain is not limited thereto and may be set in various ways as needed. For example, the input domain may be set to include the entire Korean patent document and all scientific and technological papers, or may be set to include a part of the Korean patent document.

입력 도메인 데이터(105)는 언어 모델의 사전 학습을 위해 대용량의 데이터를 포함할 수 있다. 입력 도메인은 타겟 도메인보다 넓고 타겟 도메인을 포함하는 도메인일 수 있다. 여기서 타겟 도메인은 개체명 인식에 있어서 보다 높은 정확도, 재현율 및 F1-스코어를 얻고자 하는 데이터의 범위 또는 분야를 의미할 수 있다. 예를 들어, CPC(Cooperative Patent Classification) 분류의 G섹션(section) 특허 문서의 개체명을 인식하고자 하는 경우, 입력 도메인은 한국어 특허 문서 전체이고, 타겟 도메인은 CPC 분류의 G섹션의 한국어 특허 문서에 대응될 수 있다. 다른 예에서, 타겟 도메인은 CPC 분류의 세부 분류들(예를 들어, CPC 분류의 섹션들, 클래스들, 서브 클래스들, 메인 그룹들, 서브 그룹들) 및 IPC(International Patent Classification) 분류의 세부 분류들(IPC 분류의 섹션들) 중 일부 세부 분류에 포함되는 한국어 특허 문서일 수 있다. 타겟 도메인은 특허 문서에 대해서는 IPC 및 CPC 분류 코드에 기초하여 결정될 수 있으며, 논문, 보고서, 법률 문헌 등에 대해서는 과학기술표준분류체계 및 산업기술분류체계 등의 분류체계에 기초하여 결정될 수 있다. 다만, 이는 일 실시예일 뿐이고 입력 도메인 및 타겟 도메인은 필요에 따라 다양하게 설정될 수 있다.The input domain data 105 may include a large amount of data for prior learning of a language model. The input domain may be a domain that is wider than the target domain and includes the target domain. Here, the target domain may mean a range or field of data to obtain higher accuracy, recall, and F1-score in recognizing entity names. For example, if you want to recognize the entity name of a patent document in section G of the CPC (Cooperative Patent Classification) classification, the input domain is the entire Korean patent document, and the target domain is the Korean patent document in the G section of the CPC classification. can be matched. In another example, the target domain includes subclassifications of CPC classification (eg, sections, classes, subclasses, main groups, subgroups of CPC classification) and subclassification of International Patent Classification (IPC) classification. It may be a Korean patent document included in some sub-classifications among the above (sections of the IPC classification). The target domain may be determined based on IPC and CPC classification codes for patent documents, and may be determined based on classification systems such as scientific and technological standard classification systems and industrial technology classification systems for papers, reports, and legal documents. However, this is only an example, and the input domain and the target domain may be set in various ways as needed.

전처리기(110)는 입력 도메인 데이터(105)로부터 텍스트 데이터를 추출하고 추출된 텍스트 데이터로부터 불용어를 삭제할 수 있다. 불용어는 문장에서 큰 의미를 갖지 않는 용어를 의미하는 것으로, 이를 삭제함으로써 개체명 인식의 정확도를 높일 수 있다. 예를 들어, 불용어는 "을", "를" 등의 조사, "그리고", "그래서" 등의 접속부사 등을 포함할 수 있다.The preprocessor 110 may extract text data from the input domain data 105 and delete stopwords from the extracted text data. A stopword means a term that does not have a great meaning in a sentence, and by deleting it, the accuracy of object name recognition can be increased. For example, the stopword may include a proposition such as "a" and "a", and a connecting adverb such as "and" and "so".

불용어는 도메인 별로 정의되는 불용어인 의미상의 불용어를 포함할 수 있다. 입력 도메인 데이터(105)가 특허 문서인 경우, "상기", "장치" 및 "방법" 등의 용어는 해당 특허 데이터의 구체적인 내용과 관계없이 포함되는 용어로 별다른 의미를 포함하지 않으므로 입력 도메인에 대한 의미상의 불용어로 처리되어 텍스트 데이터에서 삭제될 수 있다.The stopword may include a semantic stopword that is a stopword defined for each domain. When the input domain data 105 is a patent document, terms such as “the”, “apparatus” and “method” are included regardless of the specific content of the patent data and do not contain any special meaning. It may be treated as a semantic stopword and deleted from the text data.

전처리기(110)는 불용어가 삭제된 텍스트 데이터를 문장 단위로 구분할 수 있다. 전처리기(110)는 텍스트 데이터에 포함된 마침표, 쉼표, 세미 콜론 등의 기호를 이용하여 문장을 구분할 수 있다.The preprocessor 110 may classify text data from which stopwords are deleted in units of sentences. The preprocessor 110 may separate sentences using symbols such as a period, a comma, and a semi-colon included in the text data.

전처리기(110)는 문장이 구분된 텍스트 데이터를 토큰화(tokenization)할 수 있다. 전처리기(110)는 구분된 문장 단위로 텍스트 데이터의 토큰화를 수행할 수 있다. 문장 단위로 토큰화를 수행함으로써 문맥을 고려한 토큰화가 가능하다. 전처리기(110)는 토큰화를 수행하여 토큰들을 포함하는 단어 집합(vocabulary)을 생성할 수 있다. 전처리기(110)가 한국어 텍스트 데이터에 대한 토큰화를 수행하는 경우 형태소 분석을 이용할 수 있다.The preprocessor 110 may tokenize text data in which sentences are separated. The preprocessor 110 may perform tokenization of text data in units of divided sentences. Tokenization in consideration of context is possible by performing tokenization in units of sentences. The preprocessor 110 may generate a vocabulary including tokens by performing tokenization. When the preprocessor 110 tokenizes Korean text data, morpheme analysis may be used.

전처리기(110)는 토큰화된 데이터에 기초하여 언어 모델의 사전 학습을 위한 사전 학습 데이터를 생성할 수 있다. 전처리기(110)에서 생성된 사전 학습 데이터 및 단어 집합은 사전 학습기(115)로 입력될 수 있다. The preprocessor 110 may generate pre-training data for pre-learning a language model based on the tokenized data. The dictionary learning data and word set generated by the preprocessor 110 may be input to the dictionary learner 115 .

사전 학습기(115)는 언어 모델에 대한 사전 학습을 수행할 수 있다. 사전 학습기(115)는 언어 모델의 사전 학습을 위해 전처리기(110)로부터 수신된 사전 학습 데이터 및 단어 집합에 기초하여 문장 임베딩(sentence embedding)을 생성할 수 있다. 사전 학습기(115)는 복수의 인코더(encoder)와 디코더(decoder)를 포함하는 트랜스포머(transformer)를 포함할 수 있고 문장 임베딩은 트랜스포머로 입력되어 사전 학습이 수행될 수 있다. 사전 학습기(115)는 마스크드 언어 모델(Masked Language Model; MLM) 또는 다음 문장 예측(Next sentence prediction; NSP)을 통해 언어 모델을 사전 학습시키고, 입력 도메인 데이터(105)에 대해 학습된 입력 도메인의 사전 학습 모델을 생성할 수 있다. 일 실시예에서, 언어 모델은 BERT 언어 모델일 수 있고, 입력 도메인이 한국어 특허 문서 전체인 경우, 사전 학습 모델은 한국어 특허 문서 전체에 대해 학습된 BERT 언어 모델일 수 있다.The dictionary learner 115 may perform prior learning on the language model. The dictionary learner 115 may generate sentence embeddings based on the dictionary learning data and the word set received from the preprocessor 110 for dictionary learning of the language model. The dictionary learner 115 may include a transformer including a plurality of encoders and decoders, and sentence embedding may be input to the transformer to perform dictionary learning. The pre-trainer 115 pre-trains the language model through a Masked Language Model (MLM) or Next Sentence Prediction (NSP), and You can create a pre-learning model. In an embodiment, the language model may be a BERT language model, and when the input domain is the entire Korean patent document, the pre-learning model may be a BERT language model trained on the entire Korean patent document.

입력 도메인 전체에 대해 학습된 사전 학습 모델을 이용하여 개체명을 인식할 수 있지만 입력 도메인에 포함된 세부 도메인 별로 사용되는 개체명이 상이할 수 있고 이로 인해 타겟 도메인에 대한 정확도, 재현율 및 F1-스코어가 저하될 수 있다. 정확도, 재현율 및 F1-스코어를 높이기 위해서는 타겟 도메인과 관련이 있는 데이터로 학습해야 한다. 미세 조정기(120)는 개체명을 인식하고자 하는 타겟 도메인의 데이터에 대한 개체명 인식의 정확도, 재현율 및 F1-스코어를 높이기 위해 사전 학습 모델을 미세 조정하여 개체명 인식 모델을 생성할 수 있다.Although the object name can be recognized using the pre-learning model trained on the entire input domain, the object name used for each sub-domain included in the input domain may be different, and thus the accuracy, recall, and F1-score for the target domain may be reduced. may be lowered. To increase accuracy, recall, and F1-score, you need to learn from data that is relevant to your target domain. The fine adjuster 120 may generate an entity name recognition model by fine-tuning the pre-learning model in order to increase the accuracy, recall, and F1-score of entity name recognition for data of a target domain for which entity names are to be recognized.

미세 조정기(120)는 입력 도메인 데이터(105)를 수신하고, 사전 학습기(115)로부터 사전 학습 모델을 수신하고, 전처리기(110)로부터 단어 집합을 수신하고 및 타겟 도메인 데이터에 기초하여 생성된 타겟 도메인 지식 그래프(125)를 수신할 수 있다. 미세 조정기(120)는 입력 도메인 데이터(105)에 포함된 타겟 도메인 데이터에 대해 텍스트를 추출하고 불용어를 삭제하는 전처리를 수행하여 미세 조정을 위한 타겟 도메인 문장을 추출할 수 있다. 불용어는 입력 도메인 또는 타겟 도메인에 대해 정의되는 의미상의 불용어를 포함할 수 있다.The fine-tuner 120 receives the input domain data 105 , receives the pre-trained model from the dictionary learner 115 , receives the word set from the pre-processor 110 , and a target generated based on the target domain data. A domain knowledge graph 125 may be received. The fine adjuster 120 may extract a target domain sentence for fine adjustment by performing pre-processing of extracting text and deleting stopwords on the target domain data included in the input domain data 105 . The stopword may include a semantic stopword defined for the input domain or the target domain.

미세 조정기(120)는 타겟 도메인 데이터로부터 추출된 타겟 도메인 문장, 사전 학습 모델 및 타겟 도메인 지식 그래프(125)에 기초하여 타겟 도메인 문장에 대한 지식 임베딩(knowledge embedding)을 생성할 수 있다. 미세 조정기(120)는 지식 임베딩에 기초하여 사전 학습 모델에 대해 미세 조정을 수행함으로써 타겟 도메인에 대해 높은 정확도와 재현율 및 F1-스코어를 나타내는 개체명 인식 모델을 생성할 수 있다.The fine adjuster 120 may generate a knowledge embedding for the target domain sentence based on the target domain sentence extracted from the target domain data, the pre-learning model, and the target domain knowledge graph 125 . The fine-tuner 120 may generate an entity name recognition model exhibiting high accuracy and recall and F1-score for the target domain by performing fine-tuning on the pre-learning model based on the knowledge embedding.

타겟 도메인 지식 그래프(125)는 타겟 도메인 데이터로부터 추출된 주어, 서술어, 목적어의 데이터 세트들을 포함할 수 있다. 예를 들어, 타겟 도메인은 CPC 분류의 G 섹션 특허 데이터의 청구항일 수 있고, 타겟 도메인 지식 그래프(125)는 G 섹션의 특허 데이터의 청구항으로부터 추출된 주어, 서술어, 목적어의 데이터 세트들을 포함할 수 있다. 여기서 지식이란 특정 도메인에서 추출된 주어, 서술어 및 목적어의 집합을 의미하고, 지식 그래프(125)란 특정 도메인에서 추출된 주어, 서술어 및 목적어가 서로 대응되어 형성된 데이터 세트들을 의미한다. 주어 및 목적어는 타겟 도메인 데이터의 개체명들로 구성될 수 있다.The target domain knowledge graph 125 may include data sets of subjects, predicates, and objects extracted from target domain data. For example, the target domain may be a claim of G-section patent data of CPC classification, and the target domain knowledge graph 125 may include data sets of subjects, predicates, and objects extracted from claims of patent data of G-section. have. Here, the knowledge refers to a set of subjects, predicates, and objects extracted from a specific domain, and the knowledge graph 125 refers to data sets formed by matching subjects, predicates, and objects extracted from a specific domain. Subject and object may be composed of entity names of target domain data.

타겟 도메인 지식 그래프(125)에서, 서술어는 타겟 도메인 데이터에 포함된 서술어의 어근을 추출함으로써 결정될 수 있다. 예를 들어, 타겟 도메인 데이터에 '처리하다'라는 서술어가 포함된 경우, '처리하다'의 어근인 '처리'가 타겟 도메인 지식 그래프(125)의 서술어로 결정될 수 있다.In the target domain knowledge graph 125 , the predicate may be determined by extracting the root of the predicate included in the target domain data. For example, when the predicate 'process' is included in the target domain data, 'process', a root of 'process', may be determined as the predicate of the target domain knowledge graph 125 .

타겟 도메인 지식 그래프(125)에서 주어 및 목적어를 결정하기 위해 타겟 도메인 데이터에 포함된 서술어를 기준으로 서술어 앞, 뒤의 일정한 크기의 윈도우를 설정하여 윈도우 내에 포함된 단어들 중 주어 또는 목적어가 될 수 있는 후보 단어들이 추출될 수 있다. 예를 들어, 윈도우는 서술어 앞, 뒤 3단어를 포함할 수 있고, 윈도우 내에 포함된 단어들 중 주어 또는 목적어가 될 수 있는 후보 단어들이 추출될 수 있다. 후보 단어들 중에서 타겟 도메인의 개체명에 해당되는 단어들이 주어와 목적어로 결정될 수 있다.In order to determine the subject and the object in the target domain knowledge graph 125, windows of a predetermined size before and after the predicate are set based on the predicate included in the target domain data to become the subject or the object among the words included in the window. Candidate words can be extracted. For example, the window may include three words before and after the predicate, and candidate words that may be the subject or the object among the words included in the window may be extracted. Among the candidate words, words corresponding to the entity name of the target domain may be determined as the subject and the object.

서술어를 기준으로 설정된 윈도우에는 포함되지 않으나 타겟 도메인에서 주어, 서술어 및 목적어의 데이터 세트를 구성할 수 있는 목적어의 누락을 방지하기 위해 윈도우 바깥 범위에서 주어, 서술어와의 관계를 고려하여 목적어가 될 수 있는 후보 단어들이 더 추출될 수 있다. 예를 들어, 해당 주어 및 서술어가 포함된 문장, 문단 또는 문서 전체에서 해당 주어 및 서술어와 데이터 세트를 구성할 수 있는 목적어가 될 수 있는 후보 단어들이 추출될 수 있다. 후보 단어들 중에서 타겟 도메인의 개체명에 해당되는 단어가 목적어로 결정될 수 있다.It is not included in the window set based on the predicate, but in order to prevent omission of the object that can compose the data set of subject, predicate and object in the target domain, it can be an object in the range outside the window considering the relationship with the subject and predicate. Further candidate words may be extracted. For example, candidate words that can be objects constituting the subject and predicate and the data set may be extracted from the entire sentence, paragraph, or document including the subject and the predicate. Among the candidate words, a word corresponding to the entity name of the target domain may be determined as the object.

일 실시예에서, 주어 및 목적어의 결정에는 미리 구축된 개체명 사전(130)이 이용될 수 있다. 개체명 사전(130)은 타겟 도메인의 개체명 식별 시 재현율을 높이기 위해 구축된 것으로, 타겟 도메인의 개체명들을 포함할 수 있다. 개체명 사전(130)에 포함된 개체명에 기초하여 후보 단어들을 중 개체명이 식별될 수 있고, 주어와 목적어로 결정될 개체명이 결정될 수 있다. 주어 및 목적어는 명사뿐만 아니라, 대명사도 해당될 수 있다. 타겟 도메인 지식 그래프(125)의 생성에 개체명 사전(130)을 이용하고, 생성된 타겟 도메인 지식 그래프(125)를 이용하여 사전 학습 모델을 미세조정함으로써 상대적으로 높은 재현율을 갖는 개체명 인식 모델을 얻을 수 있다.In an embodiment, a pre-built dictionary of entity names 130 may be used to determine a subject and an object. The entity name dictionary 130 is constructed to increase reproducibility when identifying entity names of the target domain, and may include entity names of the target domain. An entity name may be identified from among candidate words based on the entity name included in the entity name dictionary 130, and an entity name to be determined as a subject and an object may be determined. Subjects and objects can be nouns as well as pronouns. Using the entity name dictionary 130 to generate the target domain knowledge graph 125, and fine-tuning the pre-learning model using the generated target domain knowledge graph 125, an entity name recognition model having a relatively high recall can be obtained

다른 실시예에서, 주어 및 목적어는 개체명 사전(130)을 이용하지 않고 확률적으로 결정될 수 있다. 예를 들어, 추출된 후보 단어들 중 개체명에 해당될 수 있는 단어들이 확률적으로 식별될 수 있고, 식별된 개체명들 중에서 주어와 목적어로 결정될 개체명이 결정될 수 있다. 개체명 사전(130)을 이용하지 않고 타겟 도메인 지식 그래프(125)를 생성한 이후에, 보다 높은 재현율을 갖는 개체명 인식 모델이 필요할 경우 개체명 사전(130)을 이용하여 타겟 도메인 지식 그래프(125)를 보정할 수 있다.In another embodiment, the subject and the object may be determined probabilistically without using the entity name dictionary 130 . For example, words that may correspond to entity names among the extracted candidate words may be probabilistically identified, and entity names to be determined as subject and object may be determined from among the identified entity names. After generating the target domain knowledge graph 125 without using the entity name dictionary 130, if an entity name recognition model with higher recall is required, the target domain knowledge graph 125 using the entity name dictionary 130 ) can be corrected.

미세 조정기(120)는 미세 조정을 위한 타겟 도메인 문장, 사전 학습 모델 및 타겟 도메인 지식 그래프(125)에 기초하여 타겟 도메인 문장에 대한 지식 임베딩(knowledge embedding)을 생성할 수 있다. 미세 조정기(120)는 지식 임베딩 생성을 위한 지식 레이어(knowledge layer) 및 임베딩 레이어(embedding layer)를 포함할 수 있다.The fine adjuster 120 may generate a knowledge embedding for the target domain sentence based on the target domain sentence for fine tuning, the pre-learning model, and the target domain knowledge graph 125 . The fine adjuster 120 may include a knowledge layer and an embedding layer for generating knowledge embeddings.

미세 조정기(120)는 지식 레이어에서, 미세 조정기(120)에 입력된 단어 집합을 이용하여 타겟 도메인 문장을 토큰화할 수 있다. 미세 조정기(120)는 타겟 도메인 지식 그래프(125)에 포함된 주어, 서술어, 목적어의 데이터 세트에 기초하여 토큰화된 문장을 확장할 수 있다. 미세 조정기(120)는 토큰화된 문장에서 데이터 세트의 주어와 대응되는 토큰을 찾고, 데이터 세트에서 해당 주어와 대응되는 서술어, 목적어 토큰을 문장에 부가함으로써 토큰화된 문장을 확장할 수 있다. 토큰화된 문장의 확장은 미리 결정된 최대 깊이(depth)와 최대 경로 수(path)의 범위 내에서 수행될 수 있다. 토큰화된 문장의 확장, 깊이 및 경로 수와 관련하여서는 도 2를 참조하여 아래에서 설명한다.The fine coordinator 120 may tokenize the target domain sentence by using the word set input to the fine coordinator 120 in the knowledge layer. The fine adjuster 120 may expand the tokenized sentence based on the data set of the subject, predicate, and object included in the target domain knowledge graph 125 . The fine adjuster 120 may expand the tokenized sentence by finding a token corresponding to the subject of the data set in the tokenized sentence, and adding a predicate and object token corresponding to the subject in the data set to the sentence. Expansion of the tokenized sentence may be performed within a range of a predetermined maximum depth and a maximum number of paths. The expansion, depth, and number of paths of the tokenized sentence will be described below with reference to FIG. 2 .

미세 조정기(120)는 타겟 도메인 지식 그래프(125)를 이용하여 토큰화된 타겟 도메인 문장을 확장함으로써 타겟 도메인에서 서로 간에 높은 관련성을 갖는 단어들을 토대로 사전 학습 모델의 미세 조정을 수행할 수 있고, 미세 조정된 개체명 인식 모델의 정확도, 재현율 및 F1-스코어를 높일 수 있다.The fine-tuner 120 may perform fine-tuning of the pre-learning model based on words having high relevance to each other in the target domain by expanding the tokenized target domain sentences using the target domain knowledge graph 125 , The accuracy, recall and F1-score of the adjusted entity name recognition model can be improved.

미세 조정기(120)는 확장된 문장에 포함된 토큰들을 구분하기 위해 각 토큰에 대해 세그먼트 인덱스 및 포지션 인덱스를 매핑할 수 있다.The fine adjuster 120 may map a segment index and a position index for each token in order to distinguish tokens included in the extended sentence.

일 실시예에서, 세그먼트 인덱스는 토큰의 깊이에 대한 정보를 포함할 수 있다. 세그먼트 인덱스는 0 부터 최대 깊이에 대응되는 음이 아닌 정수로 표현될 수 있다. 예를 들어, 미세 조정기(120)는 최대 깊이가 3인 경우, 토큰화된 타겟 도메인 문장에 포함된 토큰들에는 0을 매핑하고, 확장되어 깊이가 1인 토큰에 대해서 1을, 깊이가 2인 토큰에 대해서 2를, 및 깊이가 3인 토큰에 대해서 3을 매핑할 수 있다.In one embodiment, the segment index may include information on the depth of the token. The segment index may be expressed as a non-negative integer corresponding to the maximum depth from 0. For example, when the maximum depth is 3, the fine adjuster 120 maps 0 to tokens included in the tokenized target domain sentence, 1 for a token that is expanded and has a depth of 1, and a depth of 2 You can map 2 for tokens and 3 for tokens with a depth of 3.

포지션 인덱스는 각 토큰의 위치 정보를 포함할 수 있다. 확장된 문장에서 각 경로를 따라 문장의 첫 번째 토큰부터 각 경로의 마지막 토큰까지 순서대로 음이 아닌 정수 값이 포지션 인덱스로서 매핑될 수 있다. 일 실시예에 따른 세그먼트 인덱스 및 포지션 인덱스를 이용하여 서로 다른 토큰들을 구분할 수 있다.The position index may include position information of each token. In the expanded sentence, a non-negative integer value may be mapped as a position index along each path in order from the first token of the sentence to the last token of each path. Different tokens may be distinguished using a segment index and a position index according to an embodiment.

다른 실시예에서, 세그먼트 인덱스는 문장 확장 여부에 관한 정보를 포함할 수 있다. 세그먼트 인덱스는 0 또는 1로 표현될 수 있으며, 토큰화된 타겟 도메인 문장에 포함된 토큰이면 0, 토큰화된 문장이 확장됨에 따라 부가된 토큰이면 1의 세그먼트 인덱스가 매핑될 수 있다. 문장 확장 여부에 관한 정보를 포함하는 세그먼트 인덱스 및 포지션 인덱스를 이용함으로써 토큰의 깊이에 대한 정보 없이도 서로 다른 토큰을 구분할 수 있다.In another embodiment, the segment index may include information on whether to expand the sentence. The segment index may be expressed as 0 or 1, and if it is a token included in the tokenized target domain sentence, a segment index of 0 may be mapped, and if it is a token added as the tokenized sentence is expanded, a segment index of 1 may be mapped. By using the segment index and the position index including information on whether or not the sentence is extended, different tokens can be distinguished without information on the depth of the token.

세그먼트 인덱스 및 포지션 인덱스 매핑에 대해서는 도 2를 참조하여 아래에서 설명한다.Segment index and position index mapping will be described below with reference to FIG. 2 .

미세 조정기(120)는 임베딩 레이어에서, 확장된 문장, 세그먼트 인덱스 및 포지션 인덱스에 기초하여 미세 조정을 위한 지식 임베딩을 생성할 수 있다. 지식 임베딩은 토큰 임베딩, 세그먼트 임베딩 및 포지션 임베딩을 포함할 수 있다. 미세 조정기(120)는 확장된 문장에 포함된 토큰들에 기초하여 토큰 임베딩을 생성하고, 세그먼트 인덱스에 기초하여 세그먼트 임베딩을 생성하고, 포지션 인덱스에 기초하여 포지션 임베딩을 생성하여 지식 임베딩을 생성할 수 있다.The fine adjuster 120 may generate a knowledge embedding for fine adjustment based on the extended sentence, the segment index, and the position index in the embedding layer. Knowledge embedding may include token embedding, segment embedding and position embedding. The fine adjuster 120 generates a token embedding based on the tokens included in the expanded sentence, generates a segment embedding based on the segment index, and generates a position embedding based on the position index to generate knowledge embedding. have.

각 토큰에 세그먼트 인덱스 및 포지션 인덱스가 매핑됨으로써 동일한 단어라도 단어의 위치, 앞과 뒤에 연결된 단어와의 관계에서 그 의미가 명확해질 수 있고, 세그먼트 인덱스 및 포지션 인덱스를 이용하여 지식 임베딩을 구성함으로써 문장의 문맥을 반영하는 임베딩을 생성할 수 있다. By mapping the segment index and the position index to each token, the meaning of the same word can be made clear from the position of the word and the relationship with the word connected before and after, and by constructing the knowledge embedding using the segment index and the position index, the sentence You can create embeddings that reflect context.

미세 조정기(120)는 하나 이상의 트랜스포머를 포함할 수 있다. 생성된 지식 임베딩은 트랜스포머로 입력되어 사전 학습 모델의 개체명 인식을 위한 미세 조정이 수행될 수 있다. 미세 조정기(120)는 확장된 문장, 세그먼트 인덱스 및 포지션 인덱스에 기초하여 생성된 지식 임베딩을 이용하여 미세 조정을 수행함으로써 정확도, 재현율 및 F1-스코어가 높은 개체명 인식 모델을 생성할 수 있다.The fine adjuster 120 may include one or more transformers. The generated knowledge embedding may be input to a transformer to perform fine-tuning for object name recognition of the pre-learning model. The fine adjuster 120 may generate an entity name recognition model having high accuracy, recall rate, and F1-score by performing fine adjustment using knowledge embeddings generated based on the extended sentence, segment index, and position index.

도 2는 일 실시예에 따른 개체명 인식 모델 생성 장치 및 방법의 지식 임베딩 생성 과정을 설명하기 위한 도면이다.2 is a diagram for explaining a knowledge embedding generation process of an apparatus and method for generating an entity name recognition model according to an exemplary embodiment.

도 2를 참조하면, 미세 조정을 위한 타겟 도메인 문장(205), 타겟 도메인 문장(205)을 토큰화하고 토큰화된 타겟 도메인 문장(265)을 확장하고 세그먼트 인덱스 및 포지션 인덱스를 매핑하는 지식 레이어(210), 확장된 타겟 도메인 문장(225), 지식 레이어(210)에 입력된 타겟 도메인 지식 그래프(230) 및 지식 임베딩을 생성하는 임베딩 레이어(215)가 도시되어 있다.Referring to FIG. 2 , a target domain sentence 205 for fine-tuning, a knowledge layer that tokenizes the target domain sentence 205, expands the tokenized target domain sentence 265, and maps a segment index and a position index ( 210 , an extended target domain sentence 225 , a target domain knowledge graph 230 input to the knowledge layer 210 , and an embedding layer 215 for generating knowledge embeddings are shown.

도 2에서, 미세 조정기의 지식 레이어(210)에 타겟 도메인 문장(205) "1. 분산 파일 처리 기반 미디어 시스템"이 입력되고, 미세 조정기는 지식 레이어(210)에서 단어 집합에 기초하여 타겟 도메인 문장(205)을 토큰화할 수 있다.In FIG. 2 , the target domain sentence 205 "1. Distributed file processing-based media system" is input to the knowledge layer 210 of the fine coordinator, and the fine coordinator is the target domain sentence based on the word set in the knowledge layer 210 . (205) can be tokenized.

미세 조정기는 지식 레이어(210)에서 타겟 도메인 지식 그래프(230)에 포함된 주어, 서술어, 목적어의 데이터 세트에 기초하여 토큰화된 타겟 도메인 문장(265)을 확장할 수 있다. 미세 조정기는 토큰화된 타겟 도메인 문장(265)에서 데이터 세트의 주어와 대응되는 토큰을 찾고, 데이터 세트에서 해당 주어와 대응되는 서술어, 목적어 토큰을 문장에 부가함으로써 토큰화된 타겟 도메인 문장(265)을 확장하는 문장 확장 과정을 수행할 수 있다. 도 2에서, 타겟 도메인은 CPC 분류의 G섹션일 수 있고, 타겟 도메인 지식 그래프(230)는 타겟 도메인 데이터에 대해 생성된 주어, 서술어, 목적어의 데이터 세트를 포함할 수 있다. 미세 조정기는 토큰화된 타겟 도메인 문장(265)에서 데이터 세트의 주어와 대응되는 토큰으로서 "분산"과 "미디어"를 찾을 수 있다. 미세 조정기는 타겟 도메인 지식 그래프(230)에서 주어 "분산"과 대응되는 서술어 "처리" 및 목적어 "클라우드"의 토큰을 토큰화된 타겟 도메인 문장(265)에 부가하고, 주어 "미디어"와 대응되는 서술어 "처리" 및 목적어 "모바일"의 토큰을 토큰화된 타겟 도메인 문장(265)에 부가하여 토큰화된 타겟 도메인 문장(265)을 확장(240, 260)할 수 있다.The fine adjuster may expand the tokenized target domain sentence 265 in the knowledge layer 210 based on a data set of a subject, a predicate, and an object included in the target domain knowledge graph 230 . The fine-tuner finds a token corresponding to the subject of the data set in the tokenized target domain sentence 265, and adds a predicate and object token corresponding to the subject in the data set to the tokenized target domain sentence 265. It is possible to perform a sentence expansion process that expands . In FIG. 2 , the target domain may be a G-section of the CPC classification, and the target domain knowledge graph 230 may include a data set of subject, predicate, and object generated for the target domain data. The fine-tuner may find "distributed" and "media" as tokens corresponding to the subject of the data set in the tokenized target domain sentence 265 . The fine coordinator adds tokens of the predicate "processing" and the object "cloud" corresponding to the subject "dispersion" in the target domain knowledge graph 230 to the tokenized target domain sentence 265, and the corresponding subject "media" Tokens of the predicate "process" and the object "mobile" may be added to the tokenized target domain sentence 265 to extend 240 , 260 the tokenized target domain sentence 265 .

미세 조정기는 지식 레이어(210)에서, 타겟 도메인 지식 그래프(230)에 포함된 주어, 서술어, 목적어의 데이터 세트에 기초하여 문장을 확장하는 과정을 반복하여 수행할 수 있다. 도 2에서, 미세 조정기는 부가된 토큰들(240, 260) 중 데이터 세트의 주어와 대응되는 토큰을 찾고, 데이터 세트에서 해당 주어와 대응되는 서술어, 목적어 토큰을 문장에 부가함으로써 토큰화된 타겟 도메인 문장(265)을 확장할 수 있다. 미세 조정기는 부가된 토큰들(240, 260) 중 데이터 세트의 주어와 대응되는 토큰으로서 "클라우드"를 찾을 수 있다. 부가된 토큰들(260)에 포함된 목적어 "모바일"은 타겟 도메인 지식 그래프(230)에서 주어에 해당되지 않으므로 더 이상 확장되지 않을 수 있다. 미세 조정기는 타겟 도메인 지식 그래프(230)에서 주어 "클라우드"와 대응되는 서술어 "수행", 목적어 "서비스"의 토큰을 확장된 타겟 도메인 문장(225)에 부가할 수 있다. 미세 조정기는 타겟 도메인 지식 그래프(230)에서 주어 "클라우드"와 대응되는 다른 서술어 "구성", 목적어 "서버"의 토큰을 확장된 타겟 도메인 문장(225)에 더 부가할 수 있다.The fine adjuster may repeat the process of expanding a sentence in the knowledge layer 210 based on a data set of a subject, a predicate, and an object included in the target domain knowledge graph 230 . In FIG. 2 , the fine adjuster finds a token corresponding to the subject of the data set among the added tokens 240 , 260 , and adds a predicate and object token corresponding to the subject in the data set to a sentence, thereby tokenized target domain Sentence 265 can be expanded. The fine coordinator may find “cloud” as a token corresponding to the subject of the data set among the added tokens 240 and 260 . Since the object “mobile” included in the added tokens 260 does not correspond to the subject in the target domain knowledge graph 230 , it may not be expanded any more. The fine coordinator may add tokens of the predicate "perform" and the object "service" corresponding to the subject "cloud" in the target domain knowledge graph 230 to the extended target domain sentence 225 . The fine coordinator may further add tokens of another predicate "configuration" and object "server" corresponding to the subject "cloud" in the target domain knowledge graph 230 to the expanded target domain sentence 225 .

미세 조정기는 계속해서 부가된 토큰들(245, 255)에 대해 문장을 확장하는 과정을 수행할 수 있다. 미세 조정기는 부가된 토큰들(245, 255) 중 데이터 세트의 주어와 대응되는 토큰으로서 "서비스"를 찾을 수 있다. 부가된 토큰들(255)에 포함된 목적어 "서버"는 타겟 도메인 지식 그래프(230)에서 주어에 해당되지 않으므로 더 이상 확장되지 않을 수 있다. 미세 조정기는 타겟 도메인 지식 그래프(230)에서 주어 “서비스”와 대응되는 서술어 “포함”, 목적어 “모바일”의 토큰을 확장된 타겟 도메인 문장(225)에 부가할 수 있다.The fine coordinator may continuously perform the process of expanding the sentence for the added tokens 245 and 255 . The fine coordinator may find "service" as a token corresponding to the subject of the data set among the added tokens 245 and 255 . Since the object “server” included in the added tokens 255 does not correspond to the subject in the target domain knowledge graph 230 , it may not be expanded any more. The fine coordinator may add tokens of the predicate "include" and the object "mobile" corresponding to the subject "service" in the target domain knowledge graph 230 to the extended target domain sentence 225 .

미세 조정기는 문장을 확장하는 과정을 반복 수행함으로써 확장된 타겟 도메인 문장(225)을 생성할 수 있다. 확장된 타겟 도메인 문장(225)은 미세 조정을 위해 미세 조정기의 트랜스포머(예: 도 4의 트랜스포머(425))에 입력될 수 있다. 트랜스포머에 입력될 수 있는 확장된 타겟 도메인 문장(225)의 최대 길이는 미리 결정되어 있을 수 있다. 문장의 길이는 문장을 구성하는 토큰의 개수를 의미할 수 있다. 예를 들어, 트랜스포머에 입력될 수 있는 확장된 타겟 도메인 문장(225)의 최대 길이는 512개의 토큰일 수 있다. 다만, 이에 한정되는 것은 아니고 문장의 최대 길이는 다양하게 결정될 수 있다.The fine adjuster may generate the expanded target domain sentence 225 by repeating the process of expanding the sentence. The extended target domain sentence 225 may be input to a transformer of the fine adjuster (eg, the transformer 425 of FIG. 4 ) for fine adjustment. The maximum length of the extended target domain sentence 225 that can be input to the transformer may be predetermined. The length of the sentence may mean the number of tokens constituting the sentence. For example, the maximum length of the extended target domain sentence 225 that can be input to the transformer may be 512 tokens. However, the present invention is not limited thereto and the maximum length of the sentence may be variously determined.

트랜스포머에서의 미세 조정에 이용될 수 있는 확장된 타겟 도메인 문장(225)의 최대 길이가 정해져 있으므로, 확장된 타겟 도메인 문장(225)이 문장의 최대 길이를 초과하지 않도록 하기 위해 최대 깊이 및 최대 경로 수가 결정될 수 있다. 여기서 깊이는 문장 확장 과정이 반복된 횟수를 의미할 수 있다. 경로 수는 토큰화된 문장에서 데이터 세트의 주어에 해당되는 하나의 토큰과 대응될 수 있는 데이터 세트의 서술어의 수를 나타낼 수 있다. 깊이와 경로 수에 관한 예시적인 설명은 아래에서 도 2를 참조하여 설명한다.Since the maximum length of the extended target domain sentence 225 that can be used for fine-tuning in the transformer is determined, the maximum depth and the maximum number of paths are determined so that the extended target domain sentence 225 does not exceed the maximum length of the sentence. can be decided. Here, the depth may mean the number of times the sentence expansion process is repeated. The number of paths may indicate the number of predicates in the data set that may correspond to one token corresponding to the subject of the data set in the tokenized sentence. Exemplary descriptions of depth and number of paths are described below with reference to FIG. 2 .

미세 조정기는 확장된 타겟 도메인 문장(225)에 포함된 토큰들의 깊이 및 경로 수가 각각 최대 깊이 및 최대 경로 수를 초과하지 않는 범위에서 타겟 도메인 문장(205)을 확장할 수 있다. 예컨대, 최대 깊이 및 최대 경로 수가 결정되지 않을 경우 확장된 타겟 도메인 문장(225)은 문장 확장 과정을 통해 512개를 초과하는 토큰들을 포함하도록 확장될 수 있으므로, 이를 방지하기 위해 최대 깊이 및 최대 경로 수가 각각 8 및 2로 결정될 수 있다. 다만, 이에 한정되는 것은 아니고 다양한 최대 깊이 및 최대 경로 수가 결정될 수 있다.The fine adjuster may expand the target domain sentence 205 in a range where the depth and the number of paths of tokens included in the extended target domain sentence 225 do not exceed the maximum depth and the maximum number of paths, respectively. For example, when the maximum depth and the maximum number of paths are not determined, the expanded target domain sentence 225 may be expanded to include more than 512 tokens through the sentence expansion process. may be determined to be 8 and 2, respectively. However, the present invention is not limited thereto, and various maximum depths and maximum number of paths may be determined.

타겟 도메인 문장(205)은 최대 길이를 초과하지 않는 범위 내에서 확장되어야 하므로, 최대 깊이를 크게 결정하는 경우 최대 경로 수를 크게 설정하면 확장된 타겟 도메인 문장(225)의 길이가 최대 길이를 초과할 수 있기 때문에 최대 경로 수는 상대적으로 작게 결정될 수 있다. 반대로, 최대 경로 수를 크게 결정하는 경우, 최대 깊이는 상대적으로 작게 결정될 수 있다.Since the target domain sentence 205 must be extended within a range that does not exceed the maximum length, when the maximum depth is determined to be large, if the maximum number of paths is set large, the length of the extended target domain sentence 225 may exceed the maximum length. Therefore, the maximum number of paths can be determined to be relatively small. Conversely, when the maximum number of paths is determined to be large, the maximum depth may be determined to be relatively small.

최대 깊이와 최대 경로 수는 임의로 변경될 수 있고, 최대 깊이 및 최대 경로가 어떻게 설정되는지에 따라 미세 조정되어 생성된 개체명 인식 모델의 정확도, 재현율 및 F1-스코어가 결정될 수 있다. 일 실시예에서, 최대 깊이 및 최대 경로 수는 임의로 설정될 수 있다. 다른 실시예에서, 최대 깊이 및 최대 경로 수는 실험을 통해서 결정될 수 있다. 실험을 통해 가장 높은 정확도, 재현율 및 F1-스코어를 나타내는 미세 조정을 수행할 수 있는 최적의 최대 깊이 및 최대 경로 수가 결정될 수 있다.The maximum depth and the maximum number of paths can be arbitrarily changed, and the accuracy, recall, and F1-score of the generated entity recognition model can be determined by fine-tuning according to how the maximum depth and maximum path are set. In one embodiment, the maximum depth and the maximum number of paths may be set arbitrarily. In another embodiment, the maximum depth and maximum number of paths may be determined experimentally. Experimentation can determine the optimal maximum depth and maximum number of paths that can be fine-tuned to yield the highest accuracy, recall, and F1-score.

도 2의 확장된 타겟 도메인 문장(225)은 최대 경로 수가 2로 설정되고, 최대 깊이가 3으로 설정되어 확장된 것일 수 있다. 토큰화된 타겟 도메인 문장(265)의 토큰들은 문장 확장 과정을 거치지 않은 토큰들로서, 제0 깊이의 토큰들이다. 토큰화된 타겟 도메인 문장(265)으로부터 확장된 토큰들(240, 260)은 한번의 문장 확장 과정을 거친 토큰들로서, 제1 깊이의 토큰들이다. 확장된 토큰들(240)로부터 확장된 토큰들(245, 255)은 두 번의 문장 확장 과정을 거친 토큰들로서, 제2 깊이의 토큰들이다. 확장된 토큰들(245)로부터 확장된 토큰들(250)은 세 번의 문장 확장 과정을 거친 토큰들로서, 제3 깊이의 토큰들이다. n+1번째(n은 상수) 반복되는 문장 확장 과정은 제n 깊이의 토큰들에 대해서 수행될 수 있다.The extended target domain sentence 225 of FIG. 2 may be extended with the maximum number of paths set to 2 and the maximum depth set to 3 . The tokens of the tokenized target domain sentence 265 are tokens that have not undergone the sentence expansion process, and are tokens of the 0th depth. The tokens 240 and 260 extended from the tokenized target domain sentence 265 are tokens that have undergone one sentence expansion process, and are tokens of the first depth. The tokens 245 and 255 extended from the extended tokens 240 are tokens that have undergone the sentence expansion process twice, and are tokens of the second depth. The tokens 250 extended from the extended tokens 245 are tokens that have undergone a sentence expansion process three times, and are tokens of a third depth. The n+1th (n is a constant) repeated sentence expansion process may be performed on tokens of the nth depth.

도 2에서, 부가된 토큰들(240)의 "클라우드" 토큰은 데이터 세트의 주어에 해당되는 토큰이며, 데이터 세트의 서술어 "수행" 및 서술어 "구성"과 대응되어 2개의 경로를 갖는다. 최대 경로 수가 1로 설정된 경우, 문장 확장 과정에서, 부가된 토큰들(240)의 "클라우드" 토큰에 대해 서술어 "수행"의 토큰 및 서술어 "구성"의 토큰 중 어느 하나만 문장에 부가될 수 있다.In FIG. 2 , the "cloud" token of the added tokens 240 is a token corresponding to the subject of the data set, and has two paths corresponding to the predicate "performance" and the predicate "configuration" of the data set. When the maximum number of paths is set to 1, in the sentence expansion process, only one of the token of the predicate “perform” and the token of the predicate “configuration” with respect to the “cloud” token of the added tokens 240 may be added to the sentence.

미세 조정기는 확장된 타겟 도메인 문장(225)의 각 토큰에 대해 세그먼트 인덱스와 포지션 인덱스를 매핑할 수 있다. 도 2에서, 각 토큰의 세그먼트 인덱스는 세그먼트 인덱스(220)와 같이 각 토큰의 위에 도시되어 있고, 각 토큰의 포지션 인덱스는 포지션 인덱스(235)와 같이 각 토큰의 아래에 도시되어 있다.The fine coordinator may map a segment index and a position index for each token of the extended target domain sentence 225 . In FIG. 2 , the segment index of each token is shown above each token as segment index 220 , and the position index of each token is shown below each token as position index 235 .

일 실시예에서, 세그먼트 인덱스는 토큰의 깊이에 대한 정보를 포함할 수 있다. 세그먼트 인덱스는 0 부터 최대 깊이에 대응되는 음이 아닌 정수로 표현될 수 있다. 도 2의 실시예에서, 미세 조정기는 토큰화된 타겟 도메인 문장(265)에 포함된 토큰들에는 0을 매핑하고, 확장되어 깊이가 1인 토큰(240, 260)에 대해서 1을, 깊이가 2인 토큰들(245, 255)에 대해서 2를, 및 깊이가 3인 토큰들(250)에 대해서 3을 매핑할 수 있다.In one embodiment, the segment index may include information on the depth of the token. The segment index may be expressed as a non-negative integer corresponding to the maximum depth from 0. In the embodiment of FIG. 2 , the fine adjuster maps 0 to tokens included in the tokenized target domain sentence 265 , expands to 1 for tokens 240 and 260 having a depth of 1, and a depth of 2 2 for tokens 245 , 255 , and 3 for tokens 250 with depth 3 may be mapped.

포지션 인덱스는 확장된 타겟 도메인 문장(225)에 포함된 각 토큰의 위치 정보를 포함할 수 있다. 확장된 타겟 도메인 문장(225)에서 각 경로를 따라 문장의 첫 번째 토큰부터 각 경로의 마지막 토큰까지 순서대로 음이 아닌 정수 값이 포지션 인덱스로서 매핑될 수 있다. 토큰화된 타겟 도메인 문장(265)만으로 1개의 경로가 형성되고, 토큰화된 타겟 도메인 문장(265)에 부가된 토큰들(240, 245, 250, 255, 260)에 의해 3개의 경로가 형성될 수 있다. 도 2에서 경로는 토큰화된 타겟 도메인 문장(265)으로부터 부가된 토큰들(240, 255)로 형성되는 제1 경로, 토큰화된 타겟 도메인 문장(265)으로부터 부가된 토큰들(240, 245, 250)로 형성되는 제2 경로 및 토큰화된 타겟 도메인 문장(265)으로부터 부가된 토큰들(240, 260)로 형성되는 제3 경로의 3개의 경로가 형성될 수 있다. 미세 조정기는 토큰화된 타겟 도메인 문장(265)의 첫 번째 토큰([CLS])로부터 각 경로의 토큰들에 순서대로 음이 아닌 정수 값을 매핑함으로써 포지션 인덱스를 매핑할 수 있다. 일 실시예에 따른 세그먼트 인덱스 및 포지션 인덱스를 토큰에 매핑함으로써 서로 다른 깊이 및 경로에 있는 토큰들을 구분할 수 있다.The position index may include position information of each token included in the extended target domain sentence 225 . In the extended target domain sentence 225 , a non-negative integer value may be mapped as a position index in the order from the first token of the sentence along each path to the last token of each path. One path is formed only with the tokenized target domain sentence 265 , and three paths are formed by the tokens 240 , 245 , 250 , 255 , 260 added to the tokenized target domain sentence 265 . can In FIG. 2 , the path is a first path formed of tokens 240 , 255 added from tokenized target domain sentence 265 , tokens 240 , 245 added from tokenized target domain sentence 265 , Three paths may be formed: a second path formed by 250 , and a third path formed by tokens 240 and 260 added from the tokenized target domain sentence 265 . The fine coordinator may map the position index by mapping a non-negative integer value sequentially from the first token ([CLS]) of the tokenized target domain sentence 265 to tokens of each path. By mapping the segment index and the position index according to an embodiment to the token, tokens at different depths and paths may be distinguished.

다른 실시예에서, 도 2와 달리 세그먼트 인덱스는 문장 확장 여부에 관한 정보를 포함할 수 있다. 세그먼트 인덱스는 0 또는 1로 표현될 수 있으며, 토큰화된 타겟 도메인 문장(265)에 포함된 토큰이면 0, 토큰화된 타겟 도메인 문장(265)이 확장됨에 따라 부가된 토큰이면 1의 세그먼트 인덱스가 매핑될 수 있다. 다른 실시예에 따른 세그먼트 인덱스를 매핑하는 경우, 토큰화된 타겟 도메인 문장(265)에 포함된 토큰들에는 모두 0의 세그먼트 인덱스가 매핑될 수 있고, 문장이 확장됨에 따라 부가된 토큰들(240, 245, 250, 255, 260)에는 2 및 3 대신 1의 세그먼트 인덱스가 매핑될 수 있다.In another embodiment, unlike FIG. 2 , the segment index may include information on whether or not a sentence is expanded. The segment index may be expressed as 0 or 1, if it is a token included in the tokenized target domain sentence 265, it is 0, if it is a token added as the tokenized target domain sentence 265 is expanded, a segment index of 1 is can be mapped. In the case of mapping the segment index according to another embodiment, a segment index of all 0 may be mapped to the tokens included in the tokenized target domain sentence 265, and the added tokens 240, Segment indexes of 1 instead of 2 and 3 may be mapped to 245, 250, 255, and 260).

문장 확장 여부에 관한 정보를 포함하는 세그먼트 인덱스 및 포지션 인덱스를 이용함으로써 토큰의 깊이에 대한 정보 없이도 서로 다른 토큰을 구분할 수 있다.By using the segment index and the position index including information on whether or not the sentence is extended, different tokens can be distinguished without information on the depth of the token.

지식 레이어(210)에서 확장된 타겟 도메인 문장(225), 세그먼트 인덱스 및 포지션 인덱스는 임베딩 레이어(215)로 전달될 수 있다. 미세 조정기는 확장된 타겟 도메인 문장(225), 세그먼트 인덱스 및 포지션 인덱스에 기초하여 토큰 임베딩, 세그먼트 임베딩 및 포지션 임베딩을 포함하는 지식 임베딩을 생성할 수 있다. 미세 조정기는 임베딩 레이어(215)에서 확장된 타겟 도메인 문장(225)의 토큰들에 기초하여 토큰 임베딩을 생성하고, 세그먼트 인덱스에 기초하여 세그먼트 임베딩을 생성하고 및 포지션 인덱스에 기초하여 포지션 임베딩을 생성할 수 있다. The target domain sentence 225 , the segment index, and the position index extended in the knowledge layer 210 may be transmitted to the embedding layer 215 . The fine coordinator may generate knowledge embeddings including token embeddings, segment embeddings and position embeddings based on the extended target domain sentence 225 , the segment index and the position index. The fine-tuner generates a token embedding based on the tokens of the target domain sentence 225 extended in the embedding layer 215, generates a segment embedding based on the segment index, and generates a position embedding based on the position index. can

미세 조정기는 하나 이상의 트랜스포머를 포함할 수 있다. 생성된 지식 임베딩은 트랜스포머로 입력되어 사전 학습 모델의 개체명 인식을 위한 미세 조정이 수행될 수 있다. 미세 조정기는 확장된 타겟 도메인 문장(225), 세그먼트 인덱스 및 포지션 인덱스에 기초하여 생성된 지식 임베딩을 이용하여 미세 조정을 수행함으로써 정확도, 재현율 및 F1-스코어가 높은 개체명 인식 모델을 생성할 수 있다.The fine-tuner may include one or more transformers. The generated knowledge embedding may be input to a transformer to perform fine-tuning for object name recognition of the pre-learning model. The fine-tuner can generate an entity name recognition model with high accuracy, recall, and F1-score by performing fine-tuning using the knowledge embedding generated based on the extended target domain sentence 225, the segment index, and the position index. .

도 3은 일 실시예에 따른 미세 조정기에서 수행되는 동작을 설명하기 위한 흐름도이다.3 is a flowchart illustrating an operation performed by the fine adjuster according to an exemplary embodiment.

단계(305)에서, 미세 조정기는 입력 도메인 데이터를 수신하고, 사전 학습기로부터 사전 학습 모델을 수신하고, 전처리기로부터 단어 집합을 수신하며, 타겟 도메인 데이터에 기초하여 생성된 타겟 도메인 지식 그래프를 수신하고 및 입력 도메인 데이터에 포함된 타겟 도메인 데이터에 대해 텍스트를 추출하고 불용어를 삭제하는 전처리를 수행하여 미세 조정을 위한 타겟 도메인 문장을 추출할 수 있다. 불용어는 입력 도메인에 따라 다르게 정의되는 의미상의 불용어를 포함할 수 있다.In step 305, the fine-tuner receives the input domain data, receives the pre-trained model from the dictionary learner, receives the word set from the preprocessor, receives the target domain knowledge graph generated based on the target domain data, and preprocessing for extracting text and deleting stopwords on the target domain data included in the input domain data may be performed to extract target domain sentences for fine adjustment. The stopword may include a semantic stopword defined differently depending on the input domain.

수신한 타겟 도메인 지식 그래프는 타겟 도메인 데이터로부터 추출된 주어, 서술어, 목적어의 데이터 세트들을 포함할 수 있다.The received target domain knowledge graph may include data sets of a subject, a predicate, and an object extracted from the target domain data.

단계(310)에서, 미세 조정기는 단어 집합에 기초하여 타겟 도메인 문장을 토큰화할 수 있다.In step 310, the fine-tuner may tokenize the target domain sentence based on the word set.

단계(315)에서, 토큰화된 문장의 확장에 적용되는 최대 깊이(depth) 및 최대 경로(path) 수가 결정될 수 있다. 최대 깊이 및 최대 경로 수는 단계(320)에서 확장된 타겟 도메인 문장이 미세 조정에 이용될 수 있는 문장의 최대 길이를 초과하지 않도록 하기 위해 결정될 수 있다.In operation 315 , a maximum depth and a maximum number of paths applied to the tokenized sentence expansion may be determined. The maximum depth and the maximum number of paths may be determined in step 320 so that the extended target domain sentence does not exceed the maximum length of the sentence that can be used for fine-tuning.

단계(320)에서, 미세 조정기는 타겟 도메인 지식 그래프에 포함된 주어, 서술어, 목적어의 데이터 세트에 기초하여 토큰화된 문장을 확장할 수 있다. 미세 조정기는 토큰화된 문장에서 데이터 세트의 주어와 대응되는 토큰을 찾고, 데이터 세트에서 해당 주어와 대응되는 서술어, 목적어 토큰을 문장에 부가함으로써 토큰화된 문장을 확장할 수 있다. 미세 조정기는 단계(315)에서 결정된 최대 깊이와 최대 경로 수의 범위 내에서 토큰화된 문장을 확장할 수 있다.In operation 320 , the fine adjuster may expand the tokenized sentence based on the data set of the subject, predicate, and object included in the target domain knowledge graph. The fine-tuner can expand the tokenized sentence by finding a token corresponding to the subject of the data set in the tokenized sentence, and adding a predicate and object token corresponding to the subject in the data set to the sentence. The fine-tuner may expand the tokenized sentence within the range of the maximum depth and maximum number of paths determined in step 315 .

단계(325)에서, 미세 조정기는 확장된 타겟 도메인 문장에 포함된 토큰에 세그먼트 인덱스 및 포지션 인덱스를 매핑할 수 있다.In step 325 , the fine coordinator may map the segment index and the position index to the token included in the extended target domain sentence.

일 실시예에서, 세그먼트 인덱스는 확장된 타겟 도메인 문장에 포함된 토큰의 깊이에 대한 정보를 포함할 수 있다. 미세 조정기는 도 2에서 설명한 바와 같이 확장된 타겟 도메인 문장에 포함된 토큰에 각 토큰의 깊이 정보를 반영하는 세그먼트 인덱스를 매핑할 수 있다.In an embodiment, the segment index may include information on the depth of the token included in the extended target domain sentence. As described with reference to FIG. 2 , the fine adjuster may map a segment index reflecting depth information of each token to a token included in the extended target domain sentence.

다른 실시예에서, 세그먼트 인덱스는 각 토큰이 단계(320)를 통해 부가된 토큰인지 여부에 대한 정보를 포함할 수 있다. 예를 들어, 단계(320)에서 문장 확장 과정에서 부가된 토큰에는 1의 세그먼트 인덱스가 매핑되고, 토큰화된 타겟 도메인 문장에 포함되어 있던 토큰에 대해서는 0의 세그먼트 인덱스가 매핑될 수 있다.In another embodiment, the segment index may include information on whether each token is a token added through step 320 . For example, a segment index of 1 may be mapped to the token added during the sentence expansion process in step 320 , and a segment index of 0 may be mapped to the token included in the tokenized target domain sentence.

포지션 인덱스는 각 토큰의 위치 정보를 포함할 수 있다. 미세 조정기는 토큰화된 타겟 도메인 문장의 첫 번째 토큰부터 각 경로의 마지막 토큰까지 토큰들에 순차적으로 0 부터 음이 아닌 정수 값을 매핑함으로써 포지션 인덱스를 매핑할 수 있다.The position index may include position information of each token. The fine-tuner may map the position index by sequentially mapping non-negative integer values from 0 to tokens from the first token of the tokenized target domain sentence to the last token of each path.

단계(330)에서, 미세 조정기는 확장된 타겟 도메인 문장, 세그먼트 인덱스 및 포지션 인덱스에 기초하여 지식 임베딩을 생성할 수 있다. 지식 임베딩은 확장된 타겟 도메인 문장의 토큰들로부터 생성된 토큰 임베딩, 세그먼트 인덱스로부터 생성된 세그먼트 임베딩, 포지션 인덱스로부터 생성된 포지션 임베딩을 포함할 수 있다.In step 330 , the fine coordinator may generate the knowledge embedding based on the extended target domain sentence, the segment index, and the position index. The knowledge embedding may include a token embedding generated from tokens of the extended target domain sentence, a segment embedding generated from a segment index, and a position embedding generated from a position index.

단계(335)에서, 미세 조정기는 생성된 지식 임베딩에 기초하여 사전 학습 모델의 파라미터를 미세 조정할 수 있다. 미세 조정기는 지식 임베딩을 이용하여 미세 조정을 수행함으로써 정확도, 재현율 및 F1-스코어가 높은 개체명 인식 모델을 생성할 수 있다.In step 335, the fine-tuner may fine-tune the parameters of the pre-learning model based on the generated knowledge embedding. The fine-tuner can generate a name recognition model with high accuracy, recall, and F1-score by performing fine-tuning using knowledge embeddings.

도 4는 일 실시예에 따른 미세 조정기의 구조를 설명하기 위한 도면이다.4 is a view for explaining a structure of a fine adjuster according to an embodiment.

도 4를 참조하면, 미세 조정기의 타겟 도메인 지식 그래프에 기초하여 타겟 도메인 문장을 확장하고 인덱스를 매핑하는 지식 레이어, 확장된 문장 및 인덱스에 기초하여 지식 임베딩을 생성하는 임베딩 레이어 및 지식 임베딩을 이용하여 사전 학습 모델의 파라미터를 미세 조정하는 트랜스포머가 도시되어 있다.Referring to FIG. 4 , a knowledge layer that extends a target domain sentence and maps an index based on the target domain knowledge graph of the fine adjuster, an embedding layer that generates a knowledge embedding based on the extended sentence and index, and knowledge embedding Using the knowledge embedding A transformer is shown that fine-tunes the parameters of the pre-trained model.

미세 조정기는 사전 학습기로부터 사전 학습 모델을 수신하고, 전처리기로부터 단어 집합을 수신하며, 타겟 도메인 데이터에 기초하여 생성된 타겟 도메인 지식 그래프를 수신하고 및 미세 조정을 위한 타겟 도메인 문장을 수신할 수 있다.The fine-tuner may receive the pre-learning model from the pre-learner, receive the word set from the pre-processor, receive the target domain knowledge graph generated based on the target domain data, and receive the target domain sentence for fine-tuning. .

미세 조정기는 지식 레이어에서, 수신한 단어 집합에 기초하여 타겟 도메인 문장을 토큰화하고 타겟 도메인 지식 그래프에 기초하여 토큰화된 문장을 확장할 수 있다. 미세 조정기는 지식 레이어에서 확장된 문장의 각 토큰에 세그먼트 인덱스 및 포지션 인덱스를 매핑할 수 있다. 일 실시예에서, 세그먼트 인덱스는 토큰의 깊이에 대한 정보를 포함할 수 있다. 다른 실시예에서, 세그먼트 인덱스는 각 토큰이 부가된 토큰에 해당되는지 여부에 대한 정보를 포함할 수 있다. 포지션 인덱스는 각 토큰의 위치 정보를 포함할 수 있다.In the knowledge layer, the fine adjuster may tokenize the target domain sentence based on the received word set and expand the tokenized sentence based on the target domain knowledge graph. The fine-tuner may map the segment index and position index to each token of the extended sentence in the knowledge layer. In one embodiment, the segment index may include information on the depth of the token. In another embodiment, the segment index may include information on whether each token corresponds to an added token. The position index may include position information of each token.

미세 조정기의 지식 레이어에서는 도 3의 단계(305) 내지 단계(325)가 수행될 수 있으며, 중복되는 설명은 생략한다.Steps 305 to 325 of FIG. 3 may be performed in the knowledge layer of the fine adjuster, and overlapping descriptions will be omitted.

미세 조정기는 임베딩 레이어에서, 미세 조정기는 확장된 타겟 도메인 문장, 세그먼트 인덱스 및 포지션 인덱스에 기초하여 지식 임베딩을 생성할 수 있다. 지식 임베딩은 확장된 타겟 도메인 문장의 토큰들로부터 생성된 토큰 임베딩, 세그먼트 인덱스로부터 생성된 세그먼트 임베딩, 포지션 인덱스로부터 생성된 포지션 임베딩을 포함할 수 있다.The fine coordinator may generate a knowledge embedding based on the extended target domain sentence, the segment index, and the position index in the embedding layer. The knowledge embedding may include a token embedding generated from tokens of the extended target domain sentence, a segment embedding generated from a segment index, and a position embedding generated from a position index.

미세 조정기의 임베딩 레이어에서는 도 3의 단계(330)가 수행될 수 있으며, 중복되는 설명은 생략한다.In the embedding layer of the fine adjuster, step 330 of FIG. 3 may be performed, and a redundant description will be omitted.

미세 조정기는 트랜스포머를 포함할 수 있다. 트랜스포머는 복수의 인코더와 디코더로 구성되며, 지식 임베딩을 이용하여 사전 학습 모델이 타겟 도메인 데이터에 대해 높은 재현율 및 정확도로 개체명 인식을 수행할 수 있도록 사전 학습 모델의 파라미터를 미세 조정할 수 있다. 미세 조정기의 트랜스포머에서는 도 3의 단계(335)가 수행될 수 있다. 트랜스포머를 통해 확장된 문장의 각 토큰에 대한 개체명 인식 결과가 출력될 수 있다.The fine-tuner may include a transformer. The transformer consists of a plurality of encoders and decoders, and by using knowledge embedding, the parameters of the pre-learning model can be fine-tuned so that the pre-learning model can perform object name recognition with high recall and accuracy on the target domain data. Step 335 of FIG. 3 may be performed in the transformer of the fine adjuster. The entity name recognition result for each token of the sentence extended through the transformer may be output.

도 5는 일 실시예에 따른 개체명 인식 모델 생성 방법을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a method for generating an entity name recognition model according to an exemplary embodiment.

일 실시예에 따른 개체명 인식 모델 학습 장치에 의하면, 대량의 데이터를 포함하는 입력 도메인의 데이터인 입력 도메인 데이터를 이용하여 단계(505) 내지 단계(510)를 통해 언어 모델을 학습시킴으로써 입력 도메인의 사전 학습 모델을 생성할 수 있다. 일 실시예에서, 언어 모델은 BERT(Bidirectional Encoder Representations from Transformers) 모델일 수 있다.According to the apparatus for learning an entity name recognition model according to an embodiment, by learning the language model through steps 505 to 510 using input domain data that is data of the input domain including a large amount of data, You can create a pre-learning model. In an embodiment, the language model may be a Bidirectional Encoder Representations from Transformers (BERT) model.

단계(505)에서, 개체명 인식 모델 생성 장치는 입력 도메인 데이터를 수신하고 입력 도메인 데이터에 기초하여 사전 학습을 수행하기 위해 수신한 입력 도메인 데이터에 대한 전처리를 수행할 수 있다.In step 505, the apparatus for generating an entity name recognition model may receive the input domain data and perform pre-processing on the received input domain data to perform pre-learning based on the input domain data.

개체명 인식 모델 생성 장치는 입력 도메인 데이터에서 텍스트 데이터를 추출할 수 있다. 개체명 인식 모델 생성 장치는 사전 학습 데이터를 만들 때 텍스트 데이터의 문맥을 고려하기 위해 문장을 구분하여 텍스트 데이터를 추출할 수 있다. 입력 도메인이 한국어 특허 문서 전체인 경우 특허 문서의 청구항은 세미콜론, 쉼표 등 다양한 기호가 사용되므로 일반적인 문장 구분 방법에 의해서는 문장을 구분하기 어려울 수 있다. 개체명 인식 모델 생성 장치는 이를 해소하기 위해 입력 도메인 특성에 맞게 문장을 구분하여 텍스트 데이터를 추출할 수 있다.The apparatus for generating an entity name recognition model may extract text data from the input domain data. The apparatus for generating a name recognition model may extract text data by dividing sentences in order to consider the context of the text data when generating the pre-learning data. When the input domain is the entire Korean patent document, various symbols such as semicolons and commas are used in the claims of the patent document, so it may be difficult to distinguish sentences by a general sentence division method. In order to solve this problem, the apparatus for generating an entity name recognition model may extract text data by dividing sentences according to input domain characteristics.

개체명 인식 모델 생성 장치는 추출된 텍스트 데이터로부터 불용어를 삭제할 수 있다. 예를 들어, 입력 도메인이 한국어 특허 문서 전체인 경우, 대부분의 특허에 "장치" 및 "방법"의 용어가 사용되므로 해당 용어와 관련하여 불필요하게 큰 가중치가 할당되고 학습된 모델의 성능에 영향을 미칠 수 있다. 이를 방지하기 위해 입력 도메인에서 특별한 의미를 갖지 않는 용어를 미리 삭제할 수 있다. 개체명 인식 모델 생성 장치는 텍스트 데이터에서 불용어를 삭제함으로써 학습된 모델의 정확도, 재현율 및 F1-스코어를 높일 수 있다.The apparatus for generating an entity name recognition model may delete stopwords from the extracted text data. For example, if the input domain is the entire Korean patent document, the terms "device" and "method" are used in most patents, so unnecessarily large weights are assigned with respect to those terms and affect the performance of the trained model. can go crazy To prevent this, terms that do not have a special meaning in the input domain may be deleted in advance. The apparatus for generating a name recognition model may increase accuracy, recall, and F1-score of the learned model by deleting stopwords from text data.

개체명 인식 모델 생성 장치는 불용어가 삭제된 텍스트 데이터를 토큰화하여 토큰들을 포함하는 단어 집합을 생성할 수 있다. 토큰화는 조건부확률, 단어 출현 빈도 수 등을 고려하여 수행될 수 있다. 입력 도메인 데이터의 언어가 한국어인 경우, 토큰화 과정에서 한국어의 구조에 맞게 토큰화가 수행될 수 있다. 예를 들어, 한국어의 어근과 어미를 하나의 토큰으로 생성할 것인지, 또는 별개의 토큰으로 생성할 것인지 여부 등의 토큰화 방식이 토큰화 과정에서 결정될 수 있다. 도 2에서 설명하였듯이, 단어 집합은 사전 학습 단계뿐만 아니라 미세 조정 단계에서 타겟 도메인 문장의 확장에도 이용될 수 있다. 토큰화된 타겟 도메인 문장에 포함된 토큰과 타겟 도메인 지식 그래프가 서로 대응될 수 있어야 하므로 토큰화 방식에 대응되도록 타겟 도메인 지식 그래프가 생성될 수 있다.The apparatus for generating a name recognition model may generate a word set including tokens by tokenizing text data from which stopwords are deleted. Tokenization may be performed in consideration of conditional probability, word appearance frequency, and the like. When the language of the input domain data is Korean, tokenization may be performed according to the structure of the Korean language during the tokenization process. For example, a tokenization method such as whether to generate Korean roots and endings as one token or as separate tokens may be determined during the tokenization process. As described in FIG. 2 , the word set may be used not only in the pre-learning stage but also in the expansion of the target domain sentence in the fine-tuning stage. Since the token included in the tokenized target domain sentence and the target domain knowledge graph should be able to correspond to each other, the target domain knowledge graph may be generated to correspond to the tokenization method.

개체명 인식 모델 생성 장치는 텍스트 데이터 및 단어 집합을 이용하여 입력 도메인에 대한 사전 학습 데이터를 생성할 수 있다.The apparatus for generating an entity name recognition model may generate pre-training data for an input domain using text data and a word set.

단계(510)에서, 개체명 인식 모델 생성 장치는 사전 학습 데이터에 기초하여 입력 도메인의 사전 학습 모델을 생성할 수 있다.In operation 510 , the apparatus for generating an entity name recognition model may generate a prior learning model of the input domain based on the prior learning data.

개체명 인식 모델 생성 장치는 사전 학습 모델을 생성하기 위해 사전 학습 데이터에 기초하여 문장 임베딩을 생성할 수 있다. 개체명 인식 모델 생성 장치는 문장 임베딩에 기초하여 마스크드 언어 모델(Masked Language Model; MLM) 또는 다음 문장 예측(Next sentence prediction; NSP)의 방식을 통해 언어 모델을 사전 학습시키고, 입력 도메인 데이터에 대해 학습된 입력 도메인의 사전 학습 모델을 생성할 수 있다. 일 실시예에서, 언어 모델은 BERT 언어 모델일 수 있고, 입력 도메인이 한국어 특허 문서 전체인 경우, 사전 학습 모델은 한국어 특허 문서 전체에 대해 학습된 BERT 언어 모델일 수 있다. 다만, 이에 한정되는 것은 아니고 다양한 도메인의 사전 학습 모델이 생성될 수 있다. 예를 들어, 입력 도메인이 전체 한국어 법률 문서인 경우, 사전 학습 모델은 전체 한국어 법률 문서에 대해 학습된 언어 모델일 수 있다.The apparatus for generating an entity name recognition model may generate sentence embeddings based on the prior learning data to generate the prior learning model. The apparatus for generating a name recognition model pre-trains a language model through a masked language model (MLM) or a next sentence prediction (NSP) method based on sentence embedding, and A pre-trained model of the learned input domain can be created. In an embodiment, the language model may be a BERT language model, and when the input domain is the entire Korean patent document, the pre-learning model may be a BERT language model trained on the entire Korean patent document. However, the present invention is not limited thereto, and pre-learning models of various domains may be generated. For example, when the input domain is the entire Korean legal document, the pre-learning model may be a language model trained on the entire Korean legal document.

입력 도메인 전체에 대해 학습된 사전 학습 모델을 이용하여 개체명을 인식할 수 있지만 입력 도메인에 포함된 세부 도메인 별로 사용되는 개체명이 상이할 수 있고 이로 인해 타겟 도메인에 대한 정확도, 재현율 및 F1-스코어가 저하될 수 있다. 정확도, 재현율 및 F1-스코어를 높이기 위해서는 타겟 도메인과 관련이 있는 데이터로 학습해야 한다. 개체명 인식 모델 생성 장치는 단계(515) 및 단계(520)를 통해 사전 학습 모델에 대한 미세 조정을 수행할 수 있다.Although the object name can be recognized using the pre-learning model trained on the entire input domain, the object name used for each sub-domain included in the input domain may be different, and thus the accuracy, recall, and F1-score for the target domain may be reduced. may be lowered. To increase accuracy, recall, and F1-score, you need to learn from data that is relevant to your target domain. The apparatus for generating a name recognition model may perform fine adjustment on the pre-learning model through steps 515 and 520 .

단계(515)에서, 개체명 인식 모델 생성 장치는 타겟 도메인과 관련이 있는 데이터를 이용하여 미세 조정을 수행하기 위해 타겟 도메인 지식 그래프를 생성할 수 있다. 타겟 도메인 지식 그래프는 단어 집합에 포함된 토큰과 대응되도록 생성될 수 있다. 타겟 도메인 지식 그래프와 관련하여서는 도 2를 참조하여 자세히 설명하였으므로 중복되는 설명은 생략한다.In operation 515, the apparatus for generating a name recognition model may generate a target domain knowledge graph to perform fine adjustment using data related to the target domain. The target domain knowledge graph may be generated to correspond to the token included in the word set. Since the target domain knowledge graph has been described in detail with reference to FIG. 2 , a redundant description will be omitted.

단계(520)에서, 개체명 인식 모델 생성 장치는 타겟 도메인 지식 그래프를 이용하여 사전 학습 모델에 대한 미세 조정을 수행할 수 있다. 단계(520)는 도 3의 단계(305) 내지 단계(335)에 대응될 수 있고, 미세 조정과 관련하여서는 도 2 내지 도 4를 참조하여 자세히 설명하였으므로 중복되는 설명은 생략한다.In operation 520, the apparatus for generating a name recognition model may perform fine adjustment on the pre-learning model using the target domain knowledge graph. Step 520 may correspond to steps 305 to 335 of FIG. 3 , and since the fine adjustment has been described in detail with reference to FIGS. 2 to 4 , redundant description will be omitted.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using a general purpose computer or special purpose computer. The processing device may execute an operating system (OS) and a software application running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium are specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. may be Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, a person skilled in the art may apply various technical modifications and variations based thereon. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

performing preprocessing on input domain data for prior learning of a language model;
generating a pre-learning model of an input domain based on the pre-processed data;
generating a target domain knowledge graph including data sets of a subject, a predicate, and an object based on the target domain data included in the input domain data;
generating a knowledge embedding based on the pre-learning model and the target domain knowledge graph; and
fine-tuning the pre-learning model to identify an entity name based on the knowledge embedding;
containing,
How to train an entity name recognition model.

According to claim 1,
Performing the pre-processing step,
extracting text data from the input domain data;
deleting stopwords from the extracted text data;
generating a word set (vocabulary) by tokenizing the text data from which the stopword has been deleted; and
generating dictionary learning data based on the word set
including,
The step of generating the pre-learning model is,
generating the pre-learning model of the input domain based on the pre-learning data
containing,
How to train an entity name recognition model.

3. The method of claim 2,
The stop words are,
Containing a semantic stopword defined differently according to the input domain,
How to train an entity name recognition model.

According to claim 1,
The knowledge graph is
generated based on a pre-built entity name dictionary including entity names of the target domain,
How to train an entity name recognition model.

3. The method of claim 2,
The step of generating the knowledge embedding comprises:
extracting target domain text data from the target domain data;
extracting a target domain sentence by deleting stopwords from the extracted target domain text data;
tokenizing the target domain sentence using the word set;
expanding the tokenized target domain sentence according to a predetermined maximum number of paths and a maximum depth based on the data set of the target domain knowledge graph;
A segment index including information on the depth of a token included in the extended target domain sentence and a position index including position information of a token included in the expanded target domain sentence are each included in the extended target domain sentence mapping to a token of and
generating the knowledge embedding based on tokens included in the extended target domain sentence, the segment index corresponding to the tokens, and the position index
containing,
How to train an entity name recognition model.

6. The method of claim 5,
Expanding the tokenized target domain sentence comprises:
Expanding the tokenized target domain sentence in a range where the depth and the number of paths of tokens included in the expanded target domain sentence do not exceed the maximum depth and the maximum number of paths, respectively.
containing,
How to train an entity name recognition model.

6. The method of claim 5,
The step of expanding the target domain sentence comprises:
Expanding the tokenized target domain sentence by adding a predicate token and an object token of the data set to a token corresponding to the subject of the data set among tokens included in the tokenized target domain sentence
A method for learning an entity name recognition model, including.

6. The method of claim 5,
The maximum depth and the maximum number of paths are
determined based on a maximum length of the extended target domain sentence that is predetermined to be usable for the fine-tuning,
How to train an entity name recognition model.

6. The method of claim 5,
The mapping step is
mapping a value corresponding to the depth information of each token among non-negative integers corresponding to the maximum depth from 0 to the maximum depth for each token included in the extended target domain sentence to the segment index; and
Mapping a non-negative integer value from 0 to the position index sequentially to each token from the first token of the extended target domain sentence to the last token of each path
including,
The tokens included in the expanded target domain sentence are,
distinguished from each other by the segment index and the position index,
How to train an entity name recognition model.

9. The method of claim 8,
According to the maximum depth and the maximum number of paths, the accuracy, recall and F1-score of the model fine-tuned to identify the entity name are determined.
How to train an entity name recognition model.

6. The method of claim 5,
The knowledge embedding is
A segment embedding generated based on the segment index, a position embedding generated based on the position index, and token embedding generated based on tokens included in the expanded target domain sentence,
How to train an entity name recognition model.

According to claim 1,
The target domain knowledge graph is
Generated by setting a window of a predetermined size before and after the predicate based on the predicate extracted from the target domain data and extracting a word corresponding to a subject or an object from among the words included in the window,
How to train an entity name recognition model.

5. The method of claim 4,
The target domain knowledge graph is
Based on the predicate extracted from the target domain data, a window of a predetermined size before and after the predicate is set to extract candidate words that can correspond to a subject or an object from among the words included in the window, and the candidate words is generated by determining the words corresponding to the entity name of the target domain, and determining the words corresponding to the entity name as the subject or the object,
The words corresponding to the entity name are determined based on the pre-built entity name dictionary including entity names of the target domain,
How to train an entity name recognition model.

14. The method of claim 13,
The target domain knowledge graph is
Based on the relationship between the subject and the predicate, candidate words that can become the object are further extracted from the range outside the window, and the entity name of the target domain from among the further extracted candidate words based on the entity name dictionary Generated by determining the words corresponding to the object,
How to train an entity name recognition model.

a preprocessor that performs preprocessing on input domain data for prior learning of a language model;
a pre-learner for generating a pre-learning model of an input domain based on the pre-processed data; and
A fine-tuner for generating a knowledge embedding based on the prior learning model and a target domain knowledge graph, and fine-tuning the prior learning model to identify an entity name based on the knowledge embedding
including,
The target domain knowledge graph is
Which includes data sets of a subject, a predicate, and an object generated based on the target domain data included in the input domain data,
An entity name recognition model training device.

16. The method of claim 15,
The preprocessor is
extracting text data from the input domain data, deleting stopwords from the extracted text data, tokenizing the text data from which the stopwords have been deleted, to generate a word set (vocabulary), and based on the word set generate pre-training data,
The pre-learner,
generating the pre-learning model of the input domain based on the pre-learning data,
An entity name recognition model training device.

17. The method of claim 16,
The stop words are,
Containing a semantic stopword defined differently according to the input domain,
An entity name recognition model training device.

16. The method of claim 15,
The knowledge graph is
generated based on a pre-built entity name dictionary including entity names of the target domain,
An entity name recognition model training device.

17. The method of claim 16,
The fine adjuster,
extracting target domain text data from the target domain data, extracting a target domain sentence by deleting stopwords from the extracted target domain text data, tokenizing the target domain sentence using the word set, and the target domain Expanding the tokenized target domain sentence according to a predetermined maximum number of paths and a maximum depth based on the data set of a knowledge graph, and a segment index including information on whether the target domain sentence is expanded and the expanded target A position index including position information of a token included in the domain sentence is mapped to each token included in the extended target domain sentence, and tokens included in the extended target domain sentence, corresponding to the tokens generating the knowledge embedding based on the segment index and the position index,
An entity name recognition model training device.

20. The method of claim 19,
The fine adjuster,
Expanding the tokenized target domain sentence in a range where the depth and the number of paths of tokens included in the expanded target domain sentence do not exceed the maximum depth and the maximum number of paths, respectively,
An entity name recognition model training device.

20. The method of claim 19,
The fine adjuster,
Expanding the tokenized target domain sentence by adding a predicate token and an object token of the data set to a token corresponding to the subject of the data set among the tokens included in the tokenized target domain sentence,
An entity name recognition model training device.

20. The method of claim 19,
The maximum depth and the maximum number of paths are
determined based on a maximum length of the extended target domain sentence that is predetermined to be usable for the fine-tuning,
An entity name recognition model training device.

20. The method of claim 19
The fine adjuster,
For each token included in the extended target domain sentence, a value corresponding to depth information of each token among non-negative integers corresponding to the maximum depth from 0 to the segment index is mapped to the segment index, and mapping a non-negative integer value from 0 to the position index sequentially to each token from the first token to the last token in each path,
The tokens included in the expanded target domain sentence are,
distinguished from each other by the segment index and the position index,
An entity name recognition model training device.

23. The method of claim 22,
According to the maximum depth and the maximum number of paths, the accuracy, recall and F1-score of the model fine-tuned to identify the entity name are determined.
An entity name recognition model training device.

20. The method of claim 19,
The knowledge embedding is
A segment embedding generated based on the segment index, a position embedding generated based on the position index, and token embedding generated based on tokens included in the expanded target domain sentence,
An entity name recognition model training device.

16. The method of claim 15,
The target domain knowledge graph is
Generated by setting a window of a predetermined size before and after the predicate based on the predicate extracted from the target domain data and extracting a word corresponding to a subject or an object from among the words included in the window,
An entity name recognition model training device.

19. The method of claim 18,
The target domain knowledge graph is
Based on the predicate extracted from the target domain data, a window of a predetermined size before and after the predicate is set to extract candidate words that can correspond to a subject or an object from among the words included in the window, and the candidate words is generated by determining the words corresponding to the entity name of the target domain, and determining the words corresponding to the entity name as the subject or the object,
The words corresponding to the entity name are determined based on the pre-built entity name dictionary including entity names of the target domain,
An entity name recognition model training device.

28. The method of claim 27,
The target domain knowledge graph is
Based on the relationship between the subject and the predicate, candidate words that can become the object are further extracted from the range outside the window, and the entity name of the target domain from among the further extracted candidate words based on the entity name dictionary Generated by determining the words corresponding to the object,
An entity name recognition model training device.