KR20200071689A

KR20200071689A - Apparatus and method for named entity disambiguation based on rdf knowledge base

Info

Publication number: KR20200071689A
Application number: KR1020190164741A
Authority: KR
Inventors: 김홍기; 심용선
Original assignee: 서울대학교산학협력단
Priority date: 2018-12-11
Filing date: 2019-12-11
Publication date: 2020-06-19
Also published as: KR102293071B1

Abstract

A method for overcoming named entity disambiguation based on an RDF knowledge base comprises the steps of: selecting words recognized as entities among words of an input sentence; determining an entity candidate group for each of the selected words based on data learning representing a relationship between the entities using a knowledge base; constructing combination sets in which candidates are arbitrarily combined in consideration of the number of all cases from each of the individual candidate groups; and determining meaning of named entity of each of the selected words by calculating a ranking of the similarity of each of the combination sets.

Description

Method and apparatus for resolving object name neutrality based on RDF knowledge base {APPARATUS AND METHOD FOR NAMED ENTITY DISAMBIGUATION BASED ON RDF KNOWLEDGE BASE}

본 발명은 질의응답 시스템에 적용하기 위해 고안된 RDF(Resource Description Framework) 지식베이스 기반의 개체명 중의성 해소 방법 및 장치에 관한 것으로, 구체적으로 두 가지 이상의 의미를 가진 개체명을 문장에서 동시에 사용된 단어들의 의미와 연관이 있는 것으로 연결시켜주기 위한 포괄적 상호의존성 짝 연결 접근법에 의한 RDF 지식베이스 기반의 개체명 중의성 해소 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for resolving neutrality of an entity name based on a Resource Description Framework (RDF) knowledge base designed to be applied to a question-and-answer system, specifically, a word in which an entity name having two or more meanings is simultaneously used in a sentence It relates to a method and apparatus for resolving object name neutrality based on the RDF knowledge base by a comprehensive interdependence pair linking approach for linking to the ones that are related to the meaning.

최근 인터넷과 컴퓨팅 기술의 발전, 모바일 기기와 센서들의 진화, 네트워크의 출현 등으로 정보량이 급속도로 늘어나고 있다. 따라서, 증가하는 정보들 가운데 필요한 정보를 찾기 위한 다양한 연구들이 진행되고 있다. 정보 추출의 한 분야인 개체명 인식과, 인식된 개체명을 특정 개체에 링킹하는 연구들은 방대한 정보 속에서 의미 있는 지식을 추출하기 위해 활발히 시도되고 있다. 개체 링킹(Entity Linking)은 텍스트에 출현한 개체명을 위키피디아와 같은 지식베이스의 특정 엔트리에 대응시키는 작업이다. With the recent development of Internet and computing technologies, the evolution of mobile devices and sensors, and the emergence of networks, the amount of information is rapidly increasing. Accordingly, various studies are being conducted to find necessary information among increasing information. Researches on object name recognition, which is a field of information extraction, and linking recognized object names to specific objects are actively being attempted to extract meaningful knowledge from vast information. Entity Linking is the task of mapping the name of an entity appearing in text to a specific entry in a knowledge base such as Wikipedia.

자연어 표현에는 다양한 개체명들이 사용된다. 인물, 조직, 장소, 제품 등 현실에 실존하는 개체들이 가진 이름을 개체명이라 하는데, 하나의 개체명은 다양한 의미를 가질 수 있다. Various individual names are used in natural language expression. The names of entities that exist in reality, such as people, organizations, places, and products, are called entity names. One entity name can have various meanings.

따라서, 이러한 개체명이 갖는 중의성 문제는 자연어 처리 분야에 있어 매우 도전적인 과제이며, 어려운 문제 중 하나이다. Therefore, the neutrality problem of such an individual name is a very challenging subject in the field of natural language processing, and is one of difficult problems.

본 발명은 전술한 필요성을 감안하여 안출된 것으로서, 두 가지 이상의 의미를 가진 개체명을 문장에서 동시에 사용된 단어들의 의미와 연관이 있는 것으로 연결시켜주기 위한 포괄적 상호의존성 짝 연결 접근법에 의한 RDF 지식베이스 기반의 개체명 중의성 해소 방법 및 장치를 제공함에 그 목적이 있다.The present invention has been devised in view of the above-mentioned necessity, and the RDF knowledge base is based on a comprehensive interdependence pair linking approach for linking two or more meanings of entity names to the meanings of words used simultaneously in a sentence. It is an object of the present invention to provide a method and apparatus for resolving neutrality based on object names.

본 발명이 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problem to be achieved by the present invention is not limited to the technical problems mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description. Will be able to.

상기 목적을 달성하기 위하여, 본 발명의 일 측면에 따르면, 지식베이스 기반의 개체명 중의성 해소 방법으로서, 입력 문장의 단어들 중 개체로 인식되는 단어들을 선별하는 단계, 지식베이스를 이용한 개체와 개체간의 관계를 표현한 데이터 학습에 기반하여 선별된 단어들 각각의 개체 후보군을 결정하는 단계, 상기 개체 후보군 각각으로부터 모든 경우의 수를 고려하여 후보들을 임의로 조합한 조합셋을 구축하는 단계, 및 상기 조합셋 각각의 유사도의 랭킹을 산출하여 상기 선별된 단어들 각각의 개체명 의미를 판단하는 단계를 포함하는 것을 특징으로 하는, 개체명 중의성 해소 방법이 제공된다.In order to achieve the above object, according to an aspect of the present invention, as a method of resolving the neutrality of an entity name based on a knowledge base, selecting words recognized as entities among words in an input sentence, entities and entities using the knowledge base Determining an individual candidate group for each of the selected words based on data learning expressing a relationship between them, constructing a combination set in which candidates are arbitrarily combined considering the number of all cases from each of the individual candidate groups, and the combination set A method of resolving the neutrality of an individual name is provided, comprising calculating the ranking of each similarity and determining the meaning of the individual name of each of the selected words.

본 발명의 일 실시예에 따르면, 상기 입력 문장의 단어들 중 개체로 인식되는 단어들을 선별하는 단계는, 상기 입력 문장의 단어들 중 명사, 동사 또는 형용사인 단어들을 선별하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of selecting words recognized as an object among words in the input sentence may include selecting words that are nouns, verbs or adjectives among words in the input sentence. .

본 발명의 일 실시예에 따르면, 상기 선별된 단어들 각각의 개체 후보군을 결정하는 단계는, 상기 선별된 단어들 각각과 매칭되면서, 상기 선별된 단어들 각각의 개체타입과 매칭되는 개체를 후보군으로 결정하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of determining an individual candidate group of each of the selected words may include matching an individual type of each of the selected words as a candidate group while matching each of the selected words. And determining.

본 발명의 일 실시예에 따르면, 상기 개체타입은, 인스턴스, 클래스, 프로퍼티 중 어느 하나일 수 있다.According to an embodiment of the present invention, the object type may be any one of instances, classes, and properties.

본 발명의 일 실시예에 따르면, 상기 조합셋 각각의 유사도는 하기 [수학식]에 기반하여 산출되고, According to an embodiment of the present invention, the similarity of each combination set is calculated based on the following [Equation],

[수학식][Mathematics]

여기서, V는 한 단어에 대한 후보 개체의 집합인 노드셋의 개수,

는 u 노드셋의 i 번째 노드,

는 ν 노드셋의 j 번째 노드를 나타낼 수 있다.Here, V is the number of node sets that are a set of candidate objects for a word,

Is the i th node of the u nodeset,

May denote the j th node of the ν node set.

본 발명의 또 다른 일 측면에 따르면, 지식베이스 기반의 개체명 중의성 해소 장치로서, 입력 문장의 단어들 중 개체로 인식되는 단어들을 선별하고, 지식베이스를 이용한 개체와 개체간의 관계를 표현한 데이터 학습에 기반하여 선별된 단어들 각각의 개체 후보군을 결정하고, 상기 개체 후보군 각각으로부터 모든 경우의 수를 고려하여 후보들을 임의로 조합한 조합셋을 구축하고, 상기 조합셋 각각의 유사도의 랭킹을 산출하여 상기 선별된 단어들 각각의 개체명 의미를 판단하는 하나 이상의 프로세서를 포함하는 것을 특징으로 하는, 개체명 중의성 해소 장치가 제공된다.According to another aspect of the present invention, as a knowledge base-based object name neutralization resolution device, a word recognized as an object among words in an input sentence is selected, and data learning expressing a relationship between an object and an object using the knowledge base An individual candidate group of each of the selected words is determined based on the result, and a combination set of candidates is arbitrarily considered in consideration of the number of cases in each of the individual candidate groups, and the ranking of the similarity of each combination set is calculated. An apparatus for resolving deuteria of an individual name is provided, comprising one or more processors for determining the meaning of the individual name of each of the selected words.

본 발명의 일 실시예에 따르면, 상기 하나 이상의 프로세서는, 상기 입력 문장의 단어들 중 명사, 동사 또는 형용사인 단어들을 선별할 수 있다.According to an embodiment of the present invention, the one or more processors may select words that are nouns, verbs, or adjectives among words in the input sentence.

본 발명의 일 실시예에 따르면, 상기 하나 이상의 프로세서는, 상기 선별된 단어들 각각과 매칭되면서, 상기 선별된 단어들 각각의 개체타입과 매칭되는 개체를 후보군으로 결정할 수 있다.According to an embodiment of the present invention, while the one or more processors are matched with each of the selected words, an object matching the individual type of each of the selected words may be determined as a candidate group.

본 발명에서 제안하는 장치 및 방법은, 지식베이스 기반의 포괄적인 상호의존성을 고려한 접근법에 기반하여 개체명 중의성 해소 모델을 생성하고 임베딩된 벡터를 이용해 효율적으로 개체명 중의성을 해소할 수 있다. 또한, 전체 단어를 활용하는 포괄적인 상호의존성을 고려한 접근법을 사용하여 종래의 각 단어의 상호의존성을 고려한 접근법 보다 전체적으로 우수한 성능을 얻을 수 있다. 뿐만 아니라, 포괄적인 상호의존성을 고려한 접근법 가운데 짝 연결 접근법을 이용하여 개별화된 페이지랭크 알고리즘을 적용한 것 보다 우수한 성능을 얻을 수 있다. 이를 통해, 본 발명에서 제안하는 방법 및 장치는 RDF 지식베이스 기반의 개체명 중의성 해소 방법으로, 질의응답 시스템, 정보추출 시스템에 적용될 수 있다.The apparatus and method proposed in the present invention can efficiently resolve an object name neutrality using an embedded vector by generating an object name neutrality resolution model based on a knowledge base-based approach that considers comprehensive interdependence. In addition, by using a comprehensive interdependence approach that utilizes whole words, it is possible to obtain better overall performance than the conventional approach that considers each word's interdependence. In addition, among the approaches considering comprehensive interdependence, better performance can be obtained than applying the individualized page rank algorithm using the pair linking approach. Through this, the method and apparatus proposed in the present invention can be applied to a question-and-answer system and an information extraction system as a method of resolving the neutrality of entity names based on the RDF knowledge base.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtained in the present invention are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those skilled in the art from the following description. will be.

도 1은 본 발명의 일 실시예에 따른 지식베이스 기반의 개체명 중의성 해소 장치의 개념도를 도시한다.
도 2는 본 발명의 일 실시예에 따른 ETP(Entity Transition Probabilities) 알고리즘의 적용 예를 도시한다.
도 3은 본 발명의 일 실시예에 따른 지식베이스 기반의 개체명 중의성 해소 방법의 흐름도를 도시한다. 1 is a conceptual diagram of an apparatus for resolving neutrality of entity names based on a knowledge base according to an embodiment of the present invention.
2 illustrates an application example of an Entity Transition Probabilities (ETP) algorithm according to an embodiment of the present invention.
3 is a flowchart of a method for resolving neutrality of an entity name based on a knowledge base according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains. It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims. The same reference numerals refer to the same components throughout the specification.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이며, 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In the description of the embodiments of the present invention, when it is determined that a detailed description of known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted, and the terms described below will be described in the embodiments of the present invention. These terms are defined in consideration of the function of the user, and may vary depending on the user's or operator's intention or custom. Therefore, the definition should be made based on the contents throughout this specification.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들(실행 엔진)에 의해 수행될 수도 있으며, 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다.Combinations of each block in the accompanying block diagrams and steps in the flow charts may be performed by computer program instructions (execution engines), which are executed by a general purpose computer, special purpose computer, or other programmable data processing equipment processor. Since it can be mounted, its instructions, which are executed through a processor of a computer or other programmable data processing equipment, create a means to perform the functions described in each block of the block diagram or in each step of the flowchart.

이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다.These computer program instructions can also be stored in computer readable or computer readable memory that can be oriented to a computer or other programmable data processing equipment to implement a function in a particular way, so that computer readable or computer readable memory The instructions stored in it are also possible to produce an article of manufacture containing instructions means for performing the functions described in each block of the block diagram or in each step of the flowchart.

그리고 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명되는 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.And since computer program instructions may be mounted on a computer or other programmable data processing equipment, a series of operation steps are performed on a computer or other programmable data processing equipment to create a process that is executed by the computer to generate a computer or other programmable It is also possible for instructions to perform data processing equipment to provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능들을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있으며, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하며, 또한 그 블록들 또는 단계들이 필요에 따라 해당하는 기능의 역순으로 수행되는 것도 가능하다.Further, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing specified logical functions, and in some alternative embodiments referred to in blocks or steps It should be noted that it is also possible for the functions to be generated out of sequence. For example, two blocks or steps shown in succession may in fact be performed substantially simultaneously, and it is also possible that the blocks or steps are performed in the reverse order of the corresponding function as necessary.

이하, 첨부 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 그러나 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다. 본 발명의 실시예는 당업계에서 통상의 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공된다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the embodiments of the present invention exemplified below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art.

개체명 중의성 해소(Named Entity Disambiguation)란 텍스트에 등장한 개체명을 지식베이스 내의 적절한 개체로 연결해주는 작업이다. 이러한 개체명 중의성 해소는 질의응답 시스템, 정보추출 시스템 등의 분야에 활용되고 있다. 예를 들어, '영화 레옹에서 마틸다 역할은 내털리 포트먼인가요?'라는 질문에서 사용된 '레옹'은 영화 '레옹'을 의미한다. 이에 반해 '2015년에 발매된 노래인 레옹은 아이유와 박명수가 불렀어?'라는 질문에서 사용된 '레옹'은 2015년 발매된 노래 '레옹'을 의미한다. 이와 같이, 두 가지 이상의 의미를 가진 개체명은 문장에서 동시에 사용된 단어들의 의미와 연관이 있는 것으로 개체명의 의미가 결정된다. Named Entity Disambiguation is the task of linking the name of an entity appearing in text to an appropriate entity in the knowledge base. This object name neutralization is used in fields such as a question-and-answer system and an information extraction system. For example,'Leon' used in the question,'Is Matilda's role in film Leone Natalie Portman?', means'Leon'. By contrast, "the song of Leon released in 2015, IU and Park Myeong-su sang that? 'The question' Leon 'in the sense that the song' Leon 'released in 2015. As described above, the meaning of the individual name is determined to be related to the meaning of the words used in the sentence at the same time.

본 발명에서는 개체명 중의성을 해소하기 위해 RDF(Resource Description Framework) 형태의 지식베이스를 바탕으로 모델을 생성하였다. 그리고 생성된 모델을 사용하여 개체명 중의성 해소에 대한 랭킹을 시도하고, 결과를 비교하였다.In the present invention, a model was created based on a knowledge base in the form of Resource Description Framework (RDF) to resolve the neutrality of an individual name. Then, using the generated model, ranking was attempted to resolve the neutrality of individual names, and the results were compared.

본 발명에서는 개체명 중의성 해소를 위한 모델로 Word2vec을 활용한 Entity2vec 모델을 생성하였다. Word2vec은 2013년 구글에서 발표된 연구로, 단어를 벡터화 시키는 워드 임베딩 (Word Embedding)의 방법론 가운데 하나다. 본 발명에서는 Entity2vec 모델을 생성하기 위해 Word2vec 모델의 학습 기법을 이용하였다. Entity2vec 모델은 지식베이스에 있는 각 개체 간의 연관성을 바탕으로 학습되어 각 개체에 대한 임베딩 벡터로 구성되어 있다. 그러므로 Entity2vec의 학습데이터는 Word2vec과 달리 문장이 아닌 RDF 지식 베이스를 학습시켰다. 자연어 문장에는 해당 개체명의 의미를 파악하기 위해 필요한 단어들도 있지만, 자연어이기에 불필요한 단어 또한 포함될 수 있다. 그러나 RDF 지식베이스는 해당 개체명의 의미를 파악하기 위해 다른 개체들과의 관계성을 바탕으로 구성되어 있어 자연어 문장이 갖는 이러한 단점을 보완할 수 있다. 그러므로 본 발명에서는 모델 생성에 있어 문장이 아닌 RDF 지식베이스를 학습시켜 각 개체명에 대한 임베딩 정보가 있는 Entity2vec 모델을 생성하였다.In the present invention, an Entity2vec model using Word2vec was generated as a model for resolving the neutrality of an entity name. Word2vec, a study published by Google in 2013, is one of the word embedding methodologies for vectorizing words. In the present invention, the learning technique of the Word2vec model was used to generate the Entity2vec model. The Entity2vec model is trained on the basis of the association between each entity in the knowledge base and consists of embedding vectors for each entity. Therefore, the learning data of Entity2vec, unlike Word2vec, trained the RDF knowledge base, not sentences. Although there are words necessary to understand the meaning of the object name in the natural language sentence, unnecessary words may also be included because it is a natural language. However, the RDF knowledge base is constructed based on the relationship with other entities in order to grasp the meaning of the corresponding entity name, thereby compensating for these shortcomings of natural language sentences. Therefore, in the present invention, in the model generation, the RDF knowledge base, not the sentence, was trained to create an Entity2vec model with embedding information for each entity name.

개체명 중의성 해소를 위한 대표적인 방법론인 짝 연결 접근법(Pairwise based method)은 한 문장에서 등장한 개체명이 두 개 이상일 경우 서로의 연관성을 이용해 개체명 중의성을 해소하는 방법이다. 즉, 두 개의 서로 다른 개체명을 연결시킨 뒤 가중치를 부여하고 가중치가 가장 높은 짝을 선택하는 방법이다. 이 방법은 동일 문장에서 등장하는 개체명들 간의 상호의존성(interdependence)을 고려하고 있으나, 2개의 짝만을 연결해서 사용하기 때문에 포괄적인 상호의존성(Global interdependence)을 고려하지 못한다는 한계점이 있다. 예를 들어, '십장생의 종류에 소나무도 포함되나요?' 라는 문장이 있을 때, 중의성 개체명인 '소나무'에 대해 전통적인 짝 연결 접근법은 '십장생' 정보를 사용하지 않고 이전 단어인 '종류' 정보를 사용하기 때문에 대한민국의 7인조 여성 음악 그룹인 '소나무'를 찾아주게 된다. 본 발명에서는 이러한 짝 연결 접근법의 한계점을 보완하기 위해 포괄적인 상호의존성을 바탕으로 짝 연결 접근법을 고안하고, 구현 및 실험을 통해 전통적인 짝 연결 접근법과 비교하였다.Pairwise based method, which is a representative methodology for resolving the identity of individuals, is a method of resolving the identity of individuals by using the correlation of each other when there are two or more individuals in a sentence. That is, it is a method of linking two different entity names, assigning weights, and selecting the pair with the highest weight. Although this method considers interdependence between the individual names appearing in the same sentence, there is a limitation that global interdependence cannot be considered because only two pairs are connected and used. For example,'Does the length of the age of 10 include pine trees?' 'Pine', a Korean seven-membered female music group in Korea, because the traditional pairing approach to'pine', which is the name of the neutrality, does not use the information of'kind', but the previous word'kind'. You will find In the present invention, in order to compensate for the limitations of such a pair linking approach, a pair linking approach was devised based on comprehensive interdependencies, and compared with a traditional pair linking approach through implementation and experimentation.

이와 같이, 본 발명에서는 전통적인 방식의 짝 연결 접근법의 한계점을 극복하고자 포괄적인 상호의존성 바탕의 짝 연결 접근법을 사용하였다. 또한, 포괄적인 상호의존성의 개념을 차용한 개별화된 페이지 랭크 알고리즘(Personalized Pagerank Algorithm)과 본 발명에서 제시하는 알고리즘을 비교하였다. 페이지랭크 알고리즘은 대상이 되는 페이지와 연결되어 있는 다른 페이지의 상대적 중요성에 따라 가중치를 부여하는 알고리즘이다. 본 발명에서는 페이지랭크 알고리즘의 개념적 의미를 차용하여 각 개체를 하나의 페이지로 가정하고 함께 등장한 개체들과의 연관성을 적용한 개별화된 페이지랭크 알고리즘을 사용하였다.As described above, in the present invention, a comprehensive interdependence based pairing approach is used to overcome the limitations of the conventional pairing approach. In addition, the personalized page rank algorithm using the concept of comprehensive interdependence was compared with the algorithm presented in the present invention. The page rank algorithm is an algorithm that assigns weights according to the relative importance of other pages connected to the target page. In the present invention, by using the conceptual meaning of the page rank algorithm, an individualized page rank algorithm is applied, assuming that each entity is a page and applying association with the entities that appear together.

본 발명에서는 개체명 중의성 해소를 위해 포괄적인 상호의존성을 바탕으로 제시하는 짝 연결 접근법이 전통적인 짝 연결 접근법뿐만 아니라 개별화된 페이지 랭크 알고리즘을 적용한 결과에 비해 성능이 더 우수한 것을 확인할 수 있었다. In the present invention, it was confirmed that the pair linking approach proposed based on the comprehensive interdependence for resolving object name neutrality has better performance than the results of applying the individualized page rank algorithm as well as the traditional pair linking approach.

도 1은 본 발명의 일 실시예에 따른 지식베이스 기반의 개체명 중의성 해소 장치(100)의 개념도를 도시한다. 1 shows a conceptual diagram of an apparatus for resolving neutrality of entity names based on a knowledge base according to an embodiment of the present invention.

도 1을 참조하면, 개체명 중의성 해소 장치(100)는 후보군 추출부(110), Entity2vec 모델부(130), 개체명 중의성 해소부(140)를 포함한다. 본 발명의 일 실시 예에 따라, RDF 지식베이스(120)는 개체명 중의성 해소 장치(100) 내에 포함된 메모리에 저장된 데이터베이스이거나, 외부 저장 장치에 저장된 데이터베이스일 수 있다. Referring to Figure 1, the object name neutrality resolution device 100 includes a candidate group extraction unit 110, Entity2vec model unit 130, the object name neutrality resolution unit 140. According to an embodiment of the present invention, the RDF knowledge base 120 may be a database stored in a memory included in the object name neutralization resolution device 100 or a database stored in an external storage device.

개체명 중의성 해소 과정은 도 1에 전체적인 과정이 도시되어 있다. 이와 같은 개체명 중의성 해소 과정은 개체명 중의성 해소 장치(100)의 하나 이상의 프로세서(processor)에 의해 수행될 수 있다. The process of resolving the neutrality of the individual name is shown in FIG. 1. The object name neutrality cancellation process may be performed by one or more processors of the object name neutrality resolution device 100.

먼저, 1단계에서는 Word2vec 알고리즘을 활용한 Entity2vec 모델을 생성한다. 2단계에서는 문장에서 사용된 단어를 지식베이스에서 검색하여 해당 개체에 대한 후보군을 추출한다. 3단계는 개체명 중의성 해소 단계로 2단계에서 추출한 전체 후보군에서 각각의 후보를 조합하여 조합셋을 구축한다. 이후, Entity2vec 모델에 있는 임베딩된 벡터를 사용하여 각 후보 간의 유사도를 기반으로 개체 중의성을 해소할 수 있다.First, in step 1, an Entity2vec model using the Word2vec algorithm is generated. In step 2, the word used in the sentence is searched in the knowledge base to extract candidate groups for the object. The third step is to resolve the neutrality of the individual name, and a combination set is constructed by combining each candidate from the entire candidate group extracted in the second step. Then, using the embedded vector in the Entity2vec model, object neutrality can be resolved based on the similarity between each candidate.

Entity2vec 모델 생성Create Entity2vec model

먼저, 개체명 중의성 해소 장치(100)의 Entity2vec 모델부(130)는 RDF 지식베이스(120)에 기반하여 본 발명에서 제안하는 Entity2vec 모델을 생성한다(①). First, the Entity2vec model unit 130 of the object name neutralization resolution apparatus 100 generates an Entity2vec model proposed in the present invention based on the RDF knowledge base 120 (①).

RDF는 월드 와이드 웹 컨소시움(World Wide Web Consortium)에서 메타데이터를 모델링하기 위해 제안하였다. RDF 지식베이스는 주어(Subject)-서술어(Property)-목적어(Object)인 트리플(triple) 형태로 되어 있다. RDF 지식베이스는 개체 간의 관계를 서술어로 표현하고 있어, 직관적으로 개체 간의 관계를 파악하는데 용이하다. 이에 따라, RDF 지식베이스의 트리플 형태는 자연어 문장으로 풀어 쓸 수 있다. 예를 들어, '레옹-개봉년도-1994년'이라는 트리플이 있을 때, 이 트리플은 '레옹의 개봉년도는 1994년이다.'와 같은 자연어 문장으로 풀어 쓸 수 있다. RDF was proposed to model metadata at the World Wide Web Consortium. The RDF knowledge base is in the form of a triple that is a subject-property-object. The RDF knowledge base expresses relationships between entities as descriptors, making it easy to intuitively grasp the relations between entities. Accordingly, the triple form of the RDF knowledge base can be written in natural language sentences. For example, when there is a triple called'Leon-Opening Year-1994', this triple can be written in a natural language sentence such as'Lyon's opening year is 1994.'

본 발명에서는 RDF 지식베이스를 바탕으로 Word2vec 알고리즘을 사용해 Entity2vec 모델을 제안한다. 단어를 벡터화시키는 워드 임베딩의 방법론 가운데 하나인 Word2vec은 한 문장에서 해당 단어와 동시에 등장하는 단어들을 학습하여 각 단어에 대한 벡터를 추출 한다. Entity2vec 모델은 Word2vec의 개념을 인용하여 자연어 문장으로 표현이 가능한 RDF 지식베이스를 학습데이터로 사용한다. 즉, 개체들 간의 동시정보(Co-occurence)를 학습하여 각 개체에 대한 벡터를 생성하는 것이다. 이때, 지식베이스에 있는 다양한 트리플 형태 가운데 개체와 개체 간의 관계를 표현하고 있는 형태의 데이터만을 추출하여 학습데이터로 사용하였다. 다시 말하면, 주어와 목적어의 데이터 타입이 URI 형태인 데이터만을 추출해 학습을 시켰다. 이것은, URI 형태가 아닌 Datatype의 형태로 되어있는 데이터는 본 발명에서 정의하는 개체라 할 수 없어 데이터 학습에 방해가 될 수 있기 때문에 제외된 것이다. The present invention proposes an Entity2vec model using the Word2vec algorithm based on the RDF knowledge base. Word2vec, one of the word embedding methodologies for vectorizing words, extracts vectors for each word by learning words that appear at the same time in the sentence. The Entity2vec model uses the RDF knowledge base that can be expressed in natural language sentences by quoting the concept of Word2vec as learning data. That is, a vector is generated for each individual by learning co-occurence between the entities. At this time, among the various triple types in the knowledge base, only data in the form expressing the relationship between the individual and the individual were extracted and used as learning data. In other words, only the data in which the data type of the subject and object are URI is extracted and trained. This is excluded because data in the form of Datatype rather than URI is not an entity defined in the present invention and may interfere with data learning.

후보군 추출Candidate extraction

2단계에서, 개체명 중의성 해소 장치(100)의 후보군 추출부(110)는 문장에서 등장한 단어들 가운데 개체로 인식 할 수 있는 단어들을 대상으로 RDF 지식베이스(120)에 연결시켜 각 단어의 개체 후보군을 선정한다(②). 이때, 개체로 인식할 수 있는 단어란 문장에서 등장한 전체 단어 가운데 관사, 전치사, 조사, 접속사 등의 불용어(stopword)를 제외한 명사, 동사, 형용사를 의미한다. 예를 들어, '영화 레옹에서 마틸다 역할은 내털리포트먼인가요?'라는 문장이 있을 때, 해당 문장의 입력단어는 '영화', '레옹', '마틸다', '역할', '내털리포트먼'이다. 해당 단어들은 1차적으로 구문분석(dependency analysis)을 통해 단어가 가질 수 있는 개체타입을 미리 부여받는다. '레옹, 마틸다, 내털리포트먼'은 인스턴스, '영화'는 클래스 또는 인스턴스 그리고 '역할'은 프로퍼티로 각각 매핑될 수 있다.In step 2, the candidate group extracting unit 110 of the object name neutralization resolution apparatus 100 connects the words that can be recognized as objects among the words appearing in the sentence to the RDF knowledge base 120 to connect each word to the object. The candidate group is selected (②). At this time, the word that can be recognized as an individual means a noun, a verb, and an adjective, except for stopwords such as articles, prepositions, investigations, and conjunctions among all the words appearing in a sentence. For example, when there is a sentence such as'Are the roles of Matilda in the movie Leon, is my reporter?', the input words of the sentence are'Movie','Leon','Matilda','Role','Natal Reporter' to be. These words are first given an object type that a word can have through a dictionary analysis. 'Leon, Matilda, Natal Reporter' can be mapped to an instance,'Movie' to a class or instance, and'Role' to properties.

본 발명의 일 실시예에 따라, 개체 후보군을 선정하는데 있어 두 가지 정보를 사용하는데, 두 정보 모두 부합하는 경우 개체 후보군으로 선정한다. RDF 지식베이스(120)에서 단어를 1:1 매칭하고, 해당 단어와 매칭되면서 동시에 단어의 개체타입과 해당 개체의 개체타입이 매칭되는 개체를 후보군으로 선정할 수 있다.According to an embodiment of the present invention, two pieces of information are used to select an individual candidate group, and if both information matches, an individual candidate group is selected. In the RDF knowledge base 120, a word can be matched 1:1, and an object matching the word's object type and the object's object type can be selected as a candidate group while matching the word.

예를 들어, 단어 '레옹'의 경우 RDF 지식베이스(120)에서 '레옹'으로 검색을 하고, 동시에 인스턴스인 개체만을 후보군으로 선정할 수 있다. 이 경우, 후보군은 영화 '레옹', 영화 레옹의 주인공 캐릭터 '레옹', 노래 제목 '레옹'이 후보군으로 선정될 수 있다.For example, in the case of the word'Leon', the RDF knowledge base 120 searches for'Leon', and at the same time, only an instance instance can be selected as a candidate group. In this case, the candidate group may be selected as a candidate for the movie'Leon', the character'Leon', and the song title'Leon'.

개체명 중의성 해소Eliminate the name neutrality

3단계에서, 개체명 중의성 해소 장치(100)의 개체명 중의성 해소부(140)는 2단계에서 만들어진 개체 후보군을 대상으로 랭킹을 시행하였다. 랭킹을 위해 각 후보군으로부터 후보들을 추출해 조합셋을 구축하였다(③). In step 3, the object name neutrality canceling unit 140 of the object name neutrality cancellation apparatus 100 performed ranking on the object candidate groups created in step 2. For ranking, candidates were extracted from each candidate group and a combination set was constructed (③).

이후, 구축된 조합셋을 대상으로 각 후보군에서 추출된 후보 간의 코사인 유사도를 계산한 뒤 평균을 구한다(④). 본 발명에서는 전통적인 짝 연결 접근법과 달리 조합셋 전체의 유사도를 사용함으로써 포괄적인 상호의존성을 바탕으로 개체 중의성을 해소하고자 한다. 포괄적인 상호의존성을 바탕으로 제시하는 짝 연결 접근법의 알고리즘 수식은 다음의 <수학식 1>과 같다. Subsequently, the cosine similarity between candidates extracted from each candidate group is calculated for the constructed combination set, and an average is obtained (④). In the present invention, unlike the traditional pair linking approach, by using the similarity of the entire set of combinations, it is intended to resolve individual neutrality based on comprehensive interdependence. The algorithm formula of the pair linking approach based on comprehensive interdependence is as shown in <Equation 1>.

<수학식 1>의 알고리즘은 조합셋의 스코어를 의미한다. <수학식 1>의 알고리즘은 단어의 개수에 따라 V개의 노드셋(node set)을 갖는다. 노드셋은 한 단어에 대한 후보 개체의 집합을 의미한다.

는 u 노드셋의 i 번째 노드를 의미하고,

는 ν 노드셋의 j 번째 노드를 의미한다. 즉, <수학식 1>의 알고리즘은 조합셋에 있는 개체 후보들 간의 코사인 유사도를 계산하여 모두 합한 뒤, 단어의 개수로 나누어 조합셋의 평균값을 구하는 것이다. 이후, 전체 조합셋을 대상으로 값을 구한 뒤, 그 중 가장 큰 값을 가진 조합셋을 선택하여 개체 중의성을 해소할 수 있다.The algorithm of <Equation 1> means the score of the combination set. The algorithm of Equation 1 has V node sets according to the number of words. A nodeset is a set of candidate entities for a word.

Is the i th node of the u node set,

Is the j th node of the ν node set. That is, the algorithm of <Equation 1> calculates the cosine similarity between the individual candidates in the combination set, sums them, and divides them by the number of words to obtain the average value of the combination set. Subsequently, after obtaining a value for the entire combination set, the object neutrality can be resolved by selecting the combination set having the largest value.

본 발명의 일 실시예에 따라, 학습 데이터를 만들기 위해 Adam 지식베이스를 사용하였다. Adam 지식베이스는 주식회사 솔트룩스에서 지식베이스 기반 질의응답 시스템 개발을 위해 구축한 RDF 지식베이스로, 약 1700만개의 인스턴스, 1천개의 프로퍼티, 2억개의 트리플로 구성되어 있다.According to one embodiment of the present invention, an Adam knowledge base was used to create learning data. Adam Knowledgebase is an RDF knowledgebase built by Saltlux, Inc. to develop a knowledge base-based question-and-answer system. It consists of approximately 17 million instances, 1,000 properties, and 200 million triples.

본 발명 개체명 중의성 해소 방법은 질의응답 시스템 등에 적용할 수 있고, 평가를 위한 테스트 데이터로는 솔트룩스에서 제공한 질문 셋을 사용할 수 있다. 솔트룩스에서 제공한 질문 셋은 '아이린은 걸그룹 레드벨벳의 멤버야?', '영화 레옹에서 마틸다 역할은 내털리 포트먼인가요?'와 같이 정답이 '예/아니오'의 형태인 질문들로 구성되어 있다. 본 발명의 일 실시 예에서, 총 198개의 질문 가운데 구문분석이 정확하게 이뤄지지 않아 단어 추출에 오류가 있는 질문 51개, 그리고 지식베이스 내의 단어가 부족해 후보군을 찾을 수 없는 질문 53개를 제외한 94개의 질문을 대상으로 실험을 진행하였다. 94개의 질문에 대해 구문분석을 하면 총 370개의 단어와 단어의 개체타입 정보가 추출된다. 본 발명의 개체명 중의성 해소 방법이 적용된 질의응답 시스템을 평가하기 위해 370개의 단어에 대해 수동으로 정답 셋을 구축하였다. 정답 셋의 형태의 일 실시예는 다음의 <표 1>과 같다.The method for resolving the neutrality of the present invention can be applied to a question-and-answer system, etc., and a set of questions provided by Saltlux can be used as test data for evaluation. The three questions provided by Saltlux consist of questions with the correct answer of'Yes/No', such as'Irene is a member of the girl group Red Velvet?' and'Are you Matilda's role in the movie Leon? have. In one embodiment of the present invention, among the total of 198 questions, there were 51 questions with errors in word extraction because parsing was not accurately performed, and 94 questions except 53 questions in which candidates could not be found due to lack of words in the knowledge base. The experiment was conducted on the subject. When parsing 94 questions, a total of 370 words and object type information of the words are extracted. In order to evaluate the question-and-answer system to which the object name neutrality resolution method of the present invention was applied, a set of correct answers was manually constructed for 370 words. One example of the form of the correct answer set is shown in Table 1 below.

WordWord Entity typeEntity type AnswerAnswer 내털리포트먼My Reporter ResourceResource http://kb.adams.ai/resource/0000410364http://kb.adams.ai/resource/0000410364 레옹Leon ResourceResource http://kb.adams.ai/resource/0000093989http://kb.adams.ai/resource/0000093989 영화movie ClassClass http://kb.adams.ai/schema/class/movie_06205452http://kb.adams.ai/schema/class/movie_06205452 역할role PropertyProperty http://kb.adams.ai/schema/property/rolehttp://kb.adams.ai/schema/property/role 마틸다Matilda ResourceResource http://kb.adams.ai/resource/0030163951http://kb.adams.ai/resource/0030163951

본 발명에서 제안하는 포괄적인 상호의존성을 고려한 짝 연결 접근법(Global Interdependence based Pairwise, GIPW)에 기반한 개체명 중의성 해소 방법의 성능을 평가하기 위해, 상호의존성만을 고려한 짝 연결 접근법(Interdependence based Pairwise, IPW) 및 포괄적인 상호의존성을 바탕으로 고안된 개별화된 페이지랭크 알고리즘(Personalized Pagerank, PPR)을 비교하여 테스트를 진행하였다. In order to evaluate the performance of the method for resolving the object name neutrality based on the global interdependence based pairwise approach (GIPW) proposed in the present invention, an interdependence based pairwise (IPW) ) And a personalized pagerank (PPR) designed based on comprehensive interdependence were compared and tested.

IPW의 경우 개체 후보 간의 코사인 유사도를 사용해 중의성을 해소하는 방식이다. PPR의 경우 한 노드에서 다른 노드로 이동 할 때 엣지에 가중치를 부여하기 위해, 알고리즘으로는 'DoSeR-a knowledge-base-agnostic framework for entity disambiguation using semantic embeddings.' 논문의 ETP(Entity Transition Probabilities) 알고리즘을 사용하였다. 이 알고리즘은 단어의 개수에 따라 V 개의 노드셋을 갖는데, 노드셋은 한 단어에 대한 후보 개체의 집합을 의미하고, 노드는 노드셋 안에 있는 k 개의 후보를 의미한다. 알고리즘 수식은 다음의 <수학식 2>와 같다.In the case of IPW, neutrality is resolved by using cosine similarity between individual candidates. In the case of PPR, in order to weight the edge when moving from one node to another, the algorithm is'DoSeR-a knowledge-base-agnostic framework for entity disambiguation using semantic embeddings.' The ETP (Entity Transition Probabilities) algorithm of the paper was used. This algorithm has V node sets according to the number of words. The node set refers to a set of candidate entities for a word, and the node refers to k candidates in the node set. The algorithm formula is as follows <Equation 2>.

여기서,

는 u 노드셋의 i 번째 노드를 의미하고,

는 ν 노드셋의 j 번째 노드를 의미한다. here,

Is the i th node of the u node set,

Is the j th node of the ν node set.

이와 같은 ETP알고리즘의 적용 예시는 도 2를 참고하여 도시된다. An example of application of such an ETP algorithm is illustrated with reference to FIG. 2.

도 2에서, 문장에 사용된 '레옹'과 '내털리포트먼'을 대상으로 ETP 알고리즘을 적용시킨 예가 도시된다. 도 2에 도시된 바와 같이, '레옹'이라는 단어로 만들어진 노드셋이 있고, 그 밑에 노드로 영화 제목 '레옹', 캐릭터 '레옹', 노래제목 '레옹' 총 3개의 개체후보가 있고, '내털리포트먼'이라는 단어로 만들어진 노드셋이 있고, 그 밑에 노드로 배우 '내털리포트먼' 총 1개의 개체후보가 있다. Entity2vec 모델에 있는 각 노드의 벡터를 ETP 알고리즘에 적용시키면, 배우 '내털리포트먼'은 '레옹' 노드셋에 있는 노드들로부터 각각 1점을 받게 된다. 반대로, 배우 '내털리포트먼' 노드셋은 다른 노드에 부여할 수 있는 총 1점을 '레옹' 노드셋에 있는 노드들에 각각 나눠 점수를 부여하게 되는데, 마찬가지로 ETP 알고리즘에 적용시키면, 영화 '레옹'은 0.4점, 캐릭터 '레옹'은 0.4점, 노래제목 '레옹'은 0.2점을 받는다. 이와 같은 방식으로 두 개의 노드셋이 서로의 노드 간의 ETP 알고리즘을 적용한 값을 누적하여 최종적으로 각각의 노드셋에서 가장 점수가 높은 노드를 정답으로 개체 중의성을 해소할 수 있다.In FIG. 2, an example in which the ETP algorithm is applied to'Leon' and'Natural Reporter' used in sentences is illustrated. As shown in Fig. 2, there is a node set made of the word'Leon', and there are three individual candidates under the title of the movie'Leon', the character'Leon', and the song title'Leon' as the node, and'My' There is a node set made of the word'Turtle Reporter', and underneath there is one individual candidate, actor'Natal Reporter'. When the vector of each node in the Entity2vec model is applied to the ETP algorithm, the actor'Natural Reporter' receives 1 point each from the nodes in the'Leon' node set. Conversely, the actor'Natural Reporter' nodeset divides the total of 1 point that can be given to other nodes to the nodes in the'Leon' nodeset, and applies the score to the ETP algorithm. The score is 0.4 for',' the character'Leon' is 0.4, and the song title'Leon' is 0.2. In this way, two node sets can accumulate the value of applying the ETP algorithm between each node, and finally, the node with the highest score in each node set can resolve the object neutrality.

세 가지 방법론의 성능 결과는 아래의 <표 2>와 같이 얻어진다. <표 2>는 Entity2Vec 모델을 학습하는데 있어, DL4J에서 제공하는 Word2Vec 모델 학습의 기본 옵션(Baseline)을 사용한 결과이다. The performance results of the three methodologies are obtained as shown in Table 2 below. <Table 2> is the result of using the basic option (Baseline) of learning the Word2Vec model provided by DL4J in training the Entity2Vec model.

<표 2>에 나타난 바와 같이, 세 가지 방법론의 응답률(recall)은 모두 100%인 것을 확인할 수 있다. 이는 본 발명의 테스트 질문 셋을 구문 분석 및 지식 베이스 내 단어의 부족으로 인해 후보군을 찾을 수 없는 단어가 있는 질문을 제외했기 때문에, 본 발명에서 평가하는 데이터는 지식 베이스 내 후보군이 존재하는 단어가 대상이 되므로 응답률이 100%로 나타났다. 정밀도(precision)는 IPW가 63.2%, GIPW가 65.7%, PPR이 64.3%로 포괄적인 상호의존성을 고려한 GIPW와 PPR이 IPW보다 결과가 높았고, 그 중 본 발명의 GIPW가 PPR보다 1.4% 높았다. As shown in <Table 2>, it can be seen that the response rates of all three methodologies are 100%. This is because the set of test questions of the present invention excludes questions with words in which the candidate group cannot be found due to parsing and lack of words in the knowledge base, data evaluated in the present invention is targeted for words in which the candidate groups in the knowledge base exist Therefore, the response rate was 100%. The precision was 63.2% for IPW, 65.7% for GIPW, and 64.3% for PPR. GIPW and PPR considering comprehensive interdependence had higher results than IPW, of which GIPW of the present invention was 1.4% higher than PPR.

이와 같은 결과를 통해, 상호의존성만을 고려한 IPW에 비해 포괄적인 상호의존성을 고려한 방법론이 개체명 중의성을 해소하는데 있어 더 효과적인 것을 확인할 수 있고, 그 중 본 발명에서 제안한 포괄적인 상호의존성을 고려한 짝 연결 접근법(GIPW)이 개체 중의성 해소에 있어 매우 효과적인 것을 확인할 수 있었다.Through these results, it can be confirmed that the methodology considering comprehensive interdependence is more effective in resolving object name neutrality compared to IPW considering only interdependence, and among them, pair linkage considering comprehensive interdependence proposed in the present invention The approach (GIPW) was found to be very effective in resolving individual neutrality.

BaselineBaseline IPWIPW GIPWGIPW PPRPPR 응답률Response rate 100%100% 100%100% 100%100% 정밀도Precision 63.2%63.2% 65.7%65.7% 64.3%64.3% F-점수F-score 77.5%77.5% 79.3%79.3% 78.3%78.3%

아래의 <표 3>은 기본 옵션에서 Iteration을 1에서 10으로 늘려 Entity2vec 모델을 생성 후 테스트한 결과이다. Iteration을 1에서 10으로 늘려 학습한 모델을 적용한 결과 GIPW가 정밀도 70.5%, F-점수 82.7%로 IPW와 PPR에 비해 효과적인 것을 확인할 수 있다. <Table 3> below shows the result of creating and testing the Entity2vec model by increasing Iteration from 1 to 10 in the basic option. As a result of applying the trained model by increasing the iteration from 1 to 10, it can be seen that GIPW is more effective than IPW and PPR with 70.5% precision and 82.7% F-score.

Iteration 증가 (1 -> 10)Iteration increase (1 -> 10) IPWIPW GIPWGIPW PPRPPR 응답률Response rate 100%100% 100%100% 100%100% 정밀도Precision 67%67% 70.5%70.5% 65.6%65.6% F-점수F-score 80.2%80.2% 82.7%82.7% 79.2%79.2%

<표 4>는 기본 옵션에서 Epoch을 1에서 10으로 늘려 Entity2vec 모델을 생성 후 테스트한 결과이다. Epoch을 1에서 10으로 늘려 학습한 모델을 적용한 결과 GIPW와 PPR이 정밀도 67.3%, F-점수 80.5%로 유사한 성능을 보인 것을 확인할 수 있었다.<Table 4> shows the result of testing after creating Entity2vec model by increasing Epoch from 1 to 10 in the basic option. As a result of applying the trained model by increasing the epoch from 1 to 10, it was confirmed that GIPW and PPR showed similar performance with an accuracy of 67.3% and an F-score of 80.5%.

Epoch 증가 (1 -> 10)Epoch increase (1 -> 10) IPWIPW GIPWGIPW PPRPPR 응답률Response rate 100%100% 100%100% 100%100% 정밀도Precision 66.5%66.5% 67.3%67.3% 67.3%67.3% F-점수F-score 79.6%79.6% 80.5%80.5% 80.5%80.5%

<표 5>는 기본 옵션에서 Layersize를 100에서 200으로 늘려 Entity2vec 모델을 생성 후 테스트한 결과이다. Layersize를 100에서 200으로 늘려 학습한 모델을 적용한 결과 GIPW의 정밀도가 65.1%, F-점수 78.9%로 IPW와 PPR에 비해 효과적인 것을 확인할 수 있다. 기본 옵션에 비해 성능 결과가 낮아지지만, 전체적으로 GIPW의 성능이 우수함을 확인할 수 있다.<Table 5> shows the result of creating and testing the Entity2vec model by increasing the Layersize from 100 to 200 in the basic option. As a result of applying the trained model by increasing the layer size from 100 to 200, it can be seen that the precision of GIPW is 65.1% and the F-score is 78.9%, which is more effective than IPW and PPR. Although the performance result is lower than the basic option, it can be seen that the overall performance of the GIPW is excellent.

Layersize 증가 (100 -> 200)Increased Layersize (100 -> 200) IPWIPW GIPWGIPW PPRPPR 응답률Response rate 100%100% 100%100% 100%100% 정밀도Precision 62.4%62.4% 65.1%65.1% 64.6%64.6% F-점수F-score 76.9%76.9% 78.9%78.9% 78.5%78.5%

Iteration, Epoch, Layersize를 각각 늘린 모델의 결과 가운데 Iteration만을 늘려 생성한 모델의 결과가 가장 우수한 것으로 나타났다. 그리고 Iteration과 Epoch을 증가시켜 학습한 모델의 성능이 기본 옵션으로 학습시킨 모델에 비해 효과가 우수한 것을 확인 할 수 있었다. 이는 곧, Iteration과 Epoch 옵션을 증가시켜 모델을 학습시키는 것이 Layersize를 증가시켜 모델을 학습시키는 것에 비해 보다 발전된 모델을 만드는데 있어 중요한 요소라는 것을 확인할 수 있다.Among the results of models that increased Iteration, Epoch, and Layersize respectively, the results of the model created by increasing Iteration only showed the best results. In addition, it was confirmed that the performance of the trained model by increasing Iteration and Epoch is superior to the model trained as the basic option. This indicates that training the model by increasing the Iteration and Epoch options is an important factor in creating a more advanced model compared to training the model by increasing the Layersize.

도 3은 본 발명의 일 실시예에 따른 지식베이스 기반의 개체명 중의성 해소 방법의 흐름도를 도시한다. 예를 들어, 지식베이스 기반의 개체명 중의성 해소 방법은 개체명 중의성 해소 장치(100)에 의해 수행될 수 있다. 3 is a flowchart of a method for resolving neutrality of an entity name based on a knowledge base according to an embodiment of the present invention. For example, the knowledge base based object name neutrality resolution method may be performed by the object name neutrality resolution device 100.

도 3을 참조하면, 지식베이스 기반의 개체명 중의성 해소 방법은 입력 문장의 단어들 중 개체로 인식되는 단어들을 선별하는 단계(S310), 선별된 단어들 각각의 개체 후보군을 결정하는 단계(S320), 개체 후보군 각각으로부터 후보들을 임의로 조합한 조합셋을 구축하는 단계(S330) 및 조합셋 각각의 유사도의 랭킹을 산출하여 각 단어의 개체명 의미를 판단하는 단계(S340)를 포함한다. Referring to FIG. 3, the knowledge base-based object name neutralization method includes selecting words recognized as objects among words in the input sentence (S310) and determining individual candidate groups for each of the selected words (S320). ), constructing a combination set in which candidates are randomly combined from each of the individual candidate groups (S330) and calculating the ranking of the similarity of each combination set to determine the meaning of the individual name of each word (S340 ).

먼저, 개체명 중의성 해소 장치(100)는 입력 문장의 단어들 중 개체로 인식되는 단어들을 선별한다(S310). 예를 들어, 후보군 추출부(110)는 문장에서 등장한 단어들 가운데 개체로 인식 할 수 있는 단어들을 선별할 수 있다. 이때, 개체로 인식할 수 있는 단어란 문장에서 등장한 전체 단어 가운데 관사, 전치사, 조사, 접속사 등의 불용어를 제외한 명사, 동사, 형용사를 의미할 수 있다.First, the apparatus 100 for resolving the neutrality of individual names selects words recognized as entities among words of the input sentence (S310). For example, the candidate group extracting unit 110 may select words that can be recognized as objects among words appearing in the sentence. In this case, the word that can be recognized as an individual may mean a noun, a verb, or an adjective, excluding stopwords such as articles, prepositions, investigations, and conjunctions among all words appearing in a sentence.

다음으로, 개체명 중의성 해소 장치(100)는 지식베이스를 이용한 개체와 개체간의 관계를 표현한 데이터 학습에 기반하여 선별된 단어들 각각의 개체 후보군을 결정한다(S320). 예를 들어, 후보군 추출부(110)는 단어들 가운데 개체로 인식 할 수 있는 단어들을 대상으로 RDF 지식베이스(120)에 연결시켜 각 단어의 개체 후보군을 선정할 수 있다. 단어들은 1차적으로 구문분석을 통해 단어가 가질 수 있는 개체타입을 미리 부여 받는다. 예를 들어, '레옹, 마틸다, 내털리포트먼'은 인스턴스, '영화'는 클래스 또는 인스턴스 그리고 '역할'은 프로퍼티로 각각 매핑될 수 있다. 개체 후보군은 해당 단어와 매칭되면서 동시에 단어의 개체타입과 해당 개체의 개체타입이 매칭되는 개체를 후보군으로 선정할 수 있다.Next, the apparatus 100 for resolving the neutrality of an individual name determines an individual candidate group for each of the selected words based on data learning expressing the relationship between the individual and the entity using the knowledge base (S320). For example, the candidate group extracting unit 110 may select an individual candidate group for each word by connecting the RDF knowledge base 120 to words that can be recognized as objects among the words. Words are first given an object type that a word can have through parsing. For example,'Leon, Matilda, Natal Reporter' may be mapped to an instance,'Movie' to a class or instance, and'Role' to properties. The individual candidate group may select an individual that matches the word and the object type of the word and the individual object type of the object as the candidate group.

그 다음, 개체명 중의성 해소 장치(100)는 개체 후보군 각각으로부터 모든 경우의 수를 고려하여 후보들을 임의로 조합한 조합셋을 구축한다(S330). 예를 들어, 도 1에 도시된 바와 같이 개체명 중의성 해소부(140)는 이전 단계에서 만들어진 개체 후보군을 대상으로 각 후보군으로부터 후보들을 추출해 조합셋을 구축할 수 있다. Subsequently, the apparatus 100 for resolving the neutrality of an individual name constructs a combination set in which candidates are arbitrarily combined by considering the number of all cases from each of the individual candidate groups (S330). For example, as shown in FIG. 1, the object name neutrality canceling unit 140 may construct a combination set by extracting candidates from each candidate group for the object candidate groups created in the previous step.

마지막으로, 개체명 중의성 해소 장치(100)는 조합셋 각각의 유사도의 랭킹을 산출하여 각 단어의 개체명 의미를 판단한다(S340). 예를 들어, 개체명 중의성 해소부(140)는 포괄적인 상호의존성을 바탕으로 제시하는 짝 연결 접근법의 알고리즘에 기반하여, 조합셋에 있는 개체 후보들 간의 코사인 유사도를 계산하여 모두 합한 뒤 단어의 개수로 나누어 조합셋의 평균값을 구할 수 있다. 이후, 전체 조합셋을 대상으로 값을 구한 뒤, 그 중 가장 큰 값을 가진 조합셋을 선택하여 개체 중의성을 해소할 수 있다. 본 발명에서 제안하는 포괄적인 상호의존성을 바탕으로 제시하는 짝 연결 접근법의 알고리즘 수식은 상술한 <수학식 1>과 같이 나타낼 수 있다.Lastly, the apparatus 100 for resolving the neutrality of individual names calculates the ranking of the similarity of each combination set and determines the meaning of each word (S340). For example, the object name neutralization resolver 140 calculates cosine similarities between individual candidates in a combination set based on an algorithm of a pair linking approach based on a comprehensive interdependency, sums them all, and then counts the words. Divide by to get the average value of the combination set. Subsequently, after obtaining a value for the entire set of combinations, the object set can be resolved by selecting the combination set having the largest value. The algorithm formula of the pair linking approach proposed on the basis of the comprehensive interdependence proposed in the present invention can be expressed as Equation 1 described above.

상술한 구체적인 실시예들에서, 발명에 포함되는 구성 요소는 제시된 구체적인 실시예에 따라 단수 또는 복수로 표현되었다. 그러나, 단수 또는 복수의 표현은 설명의 편의를 위해 제시한 상황에 적합하게 선택된 것으로서, 상술한 실시예들이 단수 또는 복수의 구성 요소에 제한되는 것은 아니며, 복수로 표현된 구성 요소라 하더라도 단수로 구성되거나, 단수로 표현된 구성 요소라 하더라도 복수로 구성될 수 있다.In the above-described specific embodiments, components included in the present invention are expressed in singular or plural according to the specific embodiments presented. However, the singular or plural expressions are appropriately selected for the situation presented for convenience of explanation, and the above-described embodiments are not limited to the singular or plural components, and even the components expressed in plural are composed of the singular or , Even a component represented by a singular number may be composed of a plurality.

한편 발명의 설명에서는 구체적인 실시예에 관해 설명하였으나, 다양한 실시예들이 내포하는 기술적 사상의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며 후술하는 청구범위뿐만 아니라 이 청구범위와 균등한 것들에 의해 정해져야 한다.On the other hand, in the description of the invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the technical spirit of the various embodiments. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the claims described below, but also by the claims and equivalents.

Claims

As a method of resolving the neutrality of individual names based on knowledge base,
Selecting words recognized as entities among words in the input sentence;
Determining an individual candidate group for each of the selected words based on data learning expressing the relationship between the individual and the individual using the knowledge base;
Constructing a combination set in which candidates are randomly combined considering the number of all cases from each of the individual candidate groups; And
And determining the meaning of each of the selected words by calculating the ranking of the similarity of each combination set.

According to claim 1,
The step of selecting words recognized as entities among words of the input sentence,
And selecting words that are nouns, verbs, or adjectives among words in the input sentence.

According to claim 1,
Determining the individual candidate group of each of the selected words,
And matching with each of the selected words, determining an individual matching the individual type of each of the selected words as a candidate group.

According to claim 3,
The object type, characterized in that any one of instances, classes, properties, method of resolving object name neutrality.

According to claim 1,
Similarity of each combination set is calculated based on the following [Equation],
[Mathematics]

Here, V is the number of node sets that are a set of candidate objects for a word,

Is the i th node of the u nodeset,

Is a j-th node of the ν node set.

As a knowledge base based object name neutralization resolution device,
Among words in the input sentence, words recognized as individuals are selected, and based on data learning expressing the relationship between the individuals and individuals using the knowledge base, individual candidate groups for each of the selected words are determined, and in each case from each of the individual candidate groups And one or more processors for constructing a combination set in which candidates are arbitrarily considered in consideration of the number of, and determining the meaning of each of the selected words by calculating the ranking of the similarity of each combination set. , Deuteronization device.

The method of claim 6,
The one or more processors,
Device for resolving the neutrality of an individual name, characterized by selecting words that are nouns, verbs or adjectives among words of the input sentence.

The method of claim 6,
The one or more processors,
And matching each of the selected words, determining an individual matching the individual type of each of the selected words as a candidate group.

The method of claim 8,
The object type is an instance, class, property, characterized in that any one of, object name neutrality resolution device.

The method of claim 6,
Similarity of each combination set is calculated based on the following [Equation],
[Mathematics]

Is the i th node of the u nodeset,

Is a j-th node of the ν node set, characterized in that the object name neutrality cancellation device.