KR20110026039A

KR20110026039A - Ontology matching method using broader terms

Info

Publication number: KR20110026039A
Application number: KR1020090083751A
Authority: KR
Inventors: 최기선; 최동현
Original assignee: 한국과학기술원
Priority date: 2009-09-07
Filing date: 2009-09-07
Publication date: 2011-03-15
Also published as: KR101116465B1

Abstract

PURPOSE: An ontology matching method using broader terms is provided to supply a result most proper to the expansion of ontology by deducing a category system relationship that is most proper to the category system expansion of given ontology. CONSTITUTION: The broader terms for given words are extracted by using a category structure of structured data(S10). An attributing graph is deduced from the broader terms(S20) and is analyzed by using a hits link analysis algorithm(S30). The matching score for each concept of ontology is generated by using the analysis result(S40). The concept of the ontology matched with a given word is selected based on the matching score(S50).

Description

Ontology matching method using broader terms}

본 발명은 하나의 단어와, 그 단어의 광의어들, 그리고 온톨로지가 주어졌을 때, 광의어들을 분석함으로써 입력된 단어를 온톨로지의 개념들 중 하나로 매칭하는 방법에 관한 것으로, 보다 상세하게는 주어진 광의어들을 이용해 수식 그래프를 구축하고, 수식 그래프를 링크 분석 알고리즘(Link Analysis Algorithm)을 이용해 분석하며, 분석된 결과를 이용하여 온톨로지의 각 개념들에 매칭 점수를 매긴 후, 단어가 매칭될 개념을 점수를 토대로 선택하는 매칭 방식에 관한 것이다. 어떤 개념에 매칭된 단어는 그 개념의 인스턴스(instance) 또는 하위 개념(subclass)이 되고, 이런 과정을 통하여 주어진 온톨로지의 분류 체계를 확장해 줄 수 있다.The present invention relates to a method of matching an input word to one of the concepts of ontology by analyzing the broad terms, given a word, broad terms and an ontology of the word. Construct a mathematical graph using words, analyze the mathematical graph using a link analysis algorithm, and use the analyzed results to score matching concepts on the ontology, and then score the concept to match words. It relates to a matching method for selecting based on. Words that match a concept become instances or subclasses of the concept, and this process can extend the taxonomy of a given ontology.

사용자가 광의어를 입력해주기 힘들 경우에 대비하여, 본 알고리즘은 위키피디아의 카테고리 구조를 이용하여 광의어들을 얻어내고자 한다.In case the user cannot input a broad term, this algorithm attempts to obtain broad terms using Wikipedia's category structure.

분류 체계는 여러 종류의 지능형 시스템(intelligent system)에서 매우 중요 한 역할을 담당한다. 그러나 분류 체계를 수동으로 구축하기 위해서는 많은 시간과 노력이 필요하다. 따라서, 수많은 연구자들이 지금까지 분류체계를 자동으로 구축하는 방법에 대하여 연구해 왔다.Classification systems play a very important role in many types of intelligent systems. However, building a classification system manually requires a lot of time and effort. Therefore, many researchers have been studying how to build the classification system automatically.

분류체계를 자동으로 구축하기 위해서는 먼저 분류체계 구축에 사용될 정보를 얻어내야만 한다. 일반적으로, 분류체계 정보를 얻어내는 작업은 크게 두 가지 자원을 이용한다. 첫 번째는 일반적인 비구조화(Unstructured)된 문서로부터 정보를 얻어내는 방식이다. Hearst가 그의 1992년 논문, "Automatic Acquisition of Hyponyms from Large Text Corpora"에서 분류 체계 관계를 추출하기 위해 4개의 문법적 패턴(lexico-syntactic pattern)을 정의한 이후, 대부분의 후속 연구들은 이 패턴을 이용하거나, 또는 패턴을 확장하는 방향으로 진행되어 왔다. 두 번째는 구조화(Structured)된 데이터로부터 정보를 얻어내는 방식이다. 최근 들어 위키피디아의 카테고리 구조와 같은 대용량의 구조화된 데이터가 사용가능해짐으로 인해서, 구조화된 데이터로부터 분류체계의 정보를 얻어내는 방식이 활성화되고 있다. Ponzetto의 2007년 논문, "Deriving a Large Scale Taxonomy from Wikipedia"은 위키피디아의 카테고리 정보로부터 여러 가지 방법을 이용하여 다수의 분류 체계 정보를 얻어내고자 시도하였다.In order to build a taxonomy automatically, we must first obtain information that will be used to construct the taxonomy. In general, retrieval of taxonomy information uses two resources. The first is to get information from a general unstructured document. Since Hearst defined four lexico-syntactic patterns to extract taxonomy relationships in his 1992 paper, "Automatic Acquisition of Hyponyms from Large Text Corpora," most subsequent studies use this pattern, Or it has progressed in the direction of expanding a pattern. The second is to get information from the structured data. Recently, due to the availability of a large amount of structured data such as Wikipedia's category structure, a method of obtaining information of the classification system from the structured data has been activated. Ponzetto's 2007 paper, "Deriving a Large Scale Taxonomy from Wikipedia," attempted to derive a large number of taxonomy information from Wikipedia's category information using a variety of methods.

본 발명에서는 어떤 단어와, 그 단어의 광의어들을 입력으로 받아 그로부터 존재하는 온톨로지의 분류체계를 확장하기 위한 분류체계 관계의 추출 방식을 제시한다. 이 방식은 위키피디아의 카테고리 구조, 또는 시소러스(thesaurus)와 같이 현재 존재하는 수많은 구조화된 데이터에 곧바로 적용 가능하다. 기존의 광의어를 사용한 분류체계 관계 추출 방식은 주로, <단어, 광의어>의 쌍이 분류체계 관계인지, 아닌지를 결정하는데 반해, 본 발명에서 제시된 방법은 광의어의 집합으로부터 정보를 모아 분류체계를 확장하므로 좀 더 정확하고 주어진 온톨로지의 분류체계 확장에 알맞은 분류체계 관계를 얻어낼 수 있다.The present invention proposes a method of extracting a taxonomy relationship for extending a taxonomy of an ontology existing from a word and its broad terms. This approach is directly applicable to Wikipedia's category structure, or to a number of existing structured data such as thesaurus. The existing method of extracting classification system relations using broad terms mainly determines whether a pair of <words, broad terms> is related to a classification system, or the method proposed in the present invention collects information from a set of broad terms. By extending, we can obtain a more accurate and taxonomy relationship that is appropriate for the taxonomy expansion of a given ontology.

본 발명은 상기한 바와 같은 과제를 해결하기 위해 안출된 것으로, 하나의 온톨로지, 하나의 단어 및 그 단어의 광의어들을 입력으로 받아들이고, 입력으로 주어진 단어를 이용하여 주어진 온톨로지의 분류체계를 확장하는 데 가장 알맞은 분류체계 관계의 정보를 얻어내는 것을 그 목적으로 한다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems. The present invention takes an ontology, a word, and a broad term of the word as input, and expands the classification system of a given ontology using the word given as input. The goal is to obtain information about the most appropriate taxonomy relationship.

상기한 바와 같은 목적을 달성하기 위한 본 발명에 따른 광의어들을 사용한 분류체계 관계의 추출 방법은, Extraction method of the classification system relationship using broad terms according to the present invention for achieving the above object,

주어진 단어의 광의어들을 위키피디아(Wikipedia) 카테고리 구조를 이용하여 추출하는 제 1단계; 얻어진 광의어들로부터 수식 그래프(Attributing Graph)를 도출해내는 제 2단계; 수식 그래프를 HITS 링크 분석 알고리즘을 이용하여 분석하는 제 3단계; 수식 그래프의 분석 결과를 이용하여 온톨로지의 각 개념에 매칭 점수를 매기는 제 4단계; 매칭 점수를 이용하여 주어진 단어에 매칭될 온톨로지의 개념을 선택하는 제 5단계로 이루어진다. Extracting broad terms of a given word using a Wikipedia category structure; A second step of deriving an Attributing Graph from the obtained broad terms; A third step of analyzing the equation graph using the HITS link analysis algorithm; A fourth step of assigning a matching score to each concept of the ontology using the analysis result of the equation graph; The fifth step is to select a concept of ontology to match a given word using the matching score.

상기한 바와 같은 본 발명에 따른 광의어들을 이용한 온톨로지 개념 매칭 방법은, 주어진 온톨로지의 분류체계 확장을 하는데 가장 적합한 분류체계 관계를 도 출해냄으로써, 기존의 다른 알고리즘들에 비하여 온톨로지 확장에 가장 적합하며 좋은 결과를 얻어내는 시스템을 기대할 수 있다.The ontology concept matching method using broad terms according to the present invention as described above is best suited for ontology extension compared to other existing algorithms by deriving a classification system relation that is most suitable for extending the classification system of a given ontology. You can expect a system that will yield results.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명한다. 본 발명을 설명함에 있어서 관련된 공지기능 혹은 구성에 대한 구체적인 설명은 본 발명의 요지를 모호하게 하지 않기 위해 생략한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, detailed descriptions of related well-known functions or configurations are omitted in order not to obscure the gist of the present invention.

도 1은 본 발명에 따른 광의어를 통하여 주어진 단어를 대상 온톨로지의 개념에 매칭하는 방법을 도시한 순서도이다. 여기서 매칭이라 함은, 주어진 단어가 의미적으로 포함되는(subsumption) 개념을 의미한다.1 is a flowchart illustrating a method of matching a given word to a concept of a target ontology through a broad term according to the present invention. Here, matching means a concept in which a given word is semantically included.

먼저 시스템은 위키피디아 카테고리 구조를 이용하여 주어진 단어의 광의어들을 추출해낸다(S10). 이 과정은 단순히 주어진 단어와 같은 이름의 위키피디아 페이지 또는 카테고리 이름을 찾아, 그 상위 3단계의 카테고리 이름을 얻어내는 방식으로 이루어진다. 이 과정은 사용자 입력으로 대체될 수 있다.First, the system extracts the broad terms of a given word using the Wikipedia category structure (S10). This is done by simply looking for a Wikipedia page or category name with the same name as the given word and getting the top three category names. This process can be replaced by user input.

그 다음, 상기 S10 단계를 거쳤거나 또는 사용자 입력을 통하여 얻어진 광의어들의 집합을 이용하여 수식 그래프를 구축한다(S20). 도 2는 시스템의 입력 예시를 나타내고, 도 3은 이 입력으로부터 구축된 수식 그래프를 나타낸다. 수식 그래프를 만들기 위하여, 먼저 각 광의어들의 'Head Word'를 추출하는데, 이를 위해 M. Collins의 1999년 논문, "Head-driven Statistical models for natural language parsing"에서 제시된 방법을 사용하였다. 도 3에 나타낸 바와 같이, 구축된 수식 그래프의 각 노드는 광의어에서 추출된 각각의 토큰(token)을 나타낸다. 수식 그래프의 변은 수식어들과 각각의 Head word간의 수식 관계의 존재 및 빈도를 나타낸다. 이때, 수식 관계는 어떤 Head word와 Head word가 아닌 단어가 같은 광의어 내에 존재한다면, Head word가 아닌 단어는 Head word를 항상 수식하는 것이라고 가정하였다. 예를 들어, 도면 3에서 'by'로부터 'Albums'로 가는 값이 4인 변의 의미는, 주어진 단어 !!Destroy-Oh-Boy!!의 광의어들 중에서 'Albums'가 head word인 단어들의 경우, 'by'가 같이 등장한 경우가 모두 4번 존재한다는 뜻이다.Next, a mathematical graph is constructed using a set of broad terms that have passed through step S10 or obtained through user input (S20). FIG. 2 shows an input example of the system, and FIG. 3 shows a mathematical graph constructed from this input. In order to construct a mathematical graph, we first extract the head words of each broad term, using the method presented in M. Collins' 1999 paper, "Head-driven Statistical models for natural language parsing." As shown in Fig. 3, each node of the constructed equation graph represents each token extracted from the broad term. The sides of the equation graph indicate the existence and frequency of the mathematical relationship between the modifiers and each head word. In this case, it is assumed that the mathematical relationship is that if a head word and a non-head word exist within the same broad term, the non-head word always modifies the head word. For example, in FIG. 3, the meaning of the value of 4 from 'by' to 'Albums' is the meaning of the words 'Albums' as the head word among the broad terms of the given word !! Destroy-Oh-Boy !! In other words, there are four occurrences of 'by'.

상기 S20 단계를 거쳐 생성된 수식 그래프에서, 각 노드의 점수를 매기기 위하여 J. M. Kleinberg의 1999년 논문 "Authoritative sources in a hyper-linked environment"에서 제시된 'HITS page ranking algorithm'을 사용한다(S30). 'HITS algorithm'은 'HITS authority score'와 'HITS hub score'의 두 가지 점수를 매기는데, 'HITS authority score'는 어떤 페이지(노드)가 다른 페이지에 의해 얼마나 많이 'reference(수식)' 되는가를 나타내고, 'HITS hub score'는 어떤 페이지(노드)가 다른 페이지를 얼마나 많이 'reference(수식)'하는가를 나타낸다.In the equation graph generated through the step S20, the 'HITS page ranking algorithm' presented in J. M. Kleinberg's 1999 paper "Authoritative sources in a hyper-linked environment" is used to score each node (S30). The HITS algorithm scores two scores: the HITS authority score and the HITS hub score. The HITS authority score determines how many pages are referred to by other pages. The 'HITS hub score' indicates how many pages (nodes) 'reference' other pages.

상기 S30 단계에 의하여 각 노드의 점수가 매겨진 이후, 도 4의 공식에 의하여 주어진 온톨로지의 각 개념에 대하여 점수를 매긴다(S40). 이 점수는 주어진 단어의 분류체계상의 부모로서, 온톨로지의 각 개념들이 얼마나 적합한지를 나타낸다. 도 4의 수식에서 c는 대상 개념, h는 대상 개념을 나타내는 단어의 head word, A는 대 상 개념을 나타내는 단어에서 수식어를 제외한 토큰의 집합을 나타낸다.After each node is scored by the step S30, a score is scored for each concept of the ontology given by the formula of FIG. 4 (S40). This score is the parent of the taxonomy of a given word, indicating how well each concept of the ontology fits. In the formula of FIG. 4, c represents a target concept, h represents a head word of a word representing a target concept, and A represents a set of tokens excluding a modifier from a word representing a target concept.

상기 S40 단계에 의하여 온톨로지의 각 개념에 대한 점수가 매겨지면, 그 중 가장 높은 점수를 받은 개념을 찾아, 주어진 단어를 그 개념의 'subclass' 또는 'instance'로 부착한다(S50). 본 발명에서는 분류 체계를 확장시키는 데 중점을 두었으므로 subclass와 instance에 특별한 구분을 두지 않았음을 밝힌다. 만약 모든 개념의 점수가 0점이라면, 주어진 단어를 통한 온톨로지의 확장은 실패한 것으로 처리된다.When the score for each concept of the ontology is scored by the step S40, find the highest scored concept and attach the given word as 'subclass' or 'instance' of the concept (S50). In the present invention, the emphasis has been placed on extending the classification system, so it is clear that no special classification is made between subclasses and instances. If all concepts score 0, the expansion of the ontology through the given word is treated as a failure.

이상과 같이 본 발명에 따른 광의어를 이용하여 주어진 단어를 주어진 온톨로지의 개념 중 하나로 매칭시키는 방법을 예시한 도면을 참조로 하여 설명하였으나, 본 명세서에 개시된 실시예와 도면에 의해 본 발명이 한정되는 것은 아니며, 본 발명의 기술사상 범위내에서 당업자에 의해 다양한 변형이 이루어질 수 있음은 물론이다.As described above with reference to the drawings illustrating a method of matching a given word to one of the concepts of a given ontology using a broad term according to the present invention, the present invention is limited by the embodiments and drawings disclosed herein Of course, various modifications can be made by those skilled in the art within the technical scope of the present invention.

도 1은 본 발명에 따라 주어진 단어를 그 단어의 광의어들을 사용하여, 주어진 온톨로지의 개념 중 하나로 매칭시키는 방법에 대한 순서도.1 is a flow chart of a method for matching a given word to one of the concepts of a given ontology, using the synonyms of that word, in accordance with the present invention.

도 2는 본 발명의 수행 과정을 보여주기 위한 예시.2 is an illustration for showing the implementation of the present invention.

도 3은 본 발명에서 도 2의 입력을 받았을 때 구축하는 수식 그래프.Figure 3 is a formula graph that is constructed when the input of Figure 2 in the present invention.

도 4는 본 발명에서 수식 그래프의 분석 결과를 이용하여 각 온톨로지의 개념에 대한 점수를 매기는 수식이다.4 is a formula for scoring the concept of each ontology using the analysis result of the equation graph in the present invention.

Claims

Ontology matching method of words through broad terms including the following steps:

Extracting broad terms of a given word using a category structure of structured data;

A second step of deriving an Attributing Graph from the broad terms obtained in the first step;

A third step of analyzing the equation graph obtained in the second step by using an HITS link analysis algorithm;

A fourth step of assigning a matching score to each concept of the ontology using the analysis result of the equation graph obtained in the third step; And

A fifth step of selecting a concept of ontology to be matched to a given word by using the matching score obtained in the fourth step.

The method of claim 1, wherein each node in the equation graph of the second step represents a respective token extracted from the broad term, and the sides represent a broad term such that the existence and frequency of the mathematical relationship between the modifiers and each head word are represented. Method of ontology mapping of words through

The method of claim 1, wherein each node in the third step is scored using a HITS algorithm, which differs from the HITS authority indicative of how many nodes (pages) are referenced by other pages. A method of ontology mapping of words through a broad term, with a HITS hub score that indicates how many pages are modified.

The method of claim 1, wherein the method for scoring each concept of the ontology of the fourth step is a method of ontology mapping of words through a broad term using the following equation (where c is an object concept and h is a word representing an object concept). head word, and A represent a set of tokens excluding modifiers in words representing target concepts)

The ontology mapping method of claim 1, wherein the structured data of the first step uses a web document including Wikipedia.

Ontology mapping system of words through broad terms, consisting of the following components:

A first processor extracting the broad terms of a given word using a web document category structure;

A second processing unit for deriving a mathematical graph from the broad terms obtained by the first processing unit;

A third processor for analyzing the equation graph obtained by the second processor using an HITS link analysis algorithm;

A fourth processing unit for assigning a matching score to each concept of the ontology using the analysis result of the equation graph obtained by the third processing unit; And

A fifth processor selecting a concept of ontology to be matched to a given word by using the matching score obtained by the fourth processor.