KR20060122276A

KR20060122276A - Relation extraction from documents for the automatic construction of ontologies

Info

Publication number: KR20060122276A
Application number: KR1020050044610A
Authority: KR
Inventors: 김민구
Original assignee: 주식회사 다음기술
Priority date: 2005-05-26
Filing date: 2005-05-26
Publication date: 2006-11-30

Abstract

A method for extracting concept relation from documents for automatic ontology construction is provided to automatically construct ontology requiring high cost and time by extracting the concept relation from a document set with a hybrid type for combining relation information extracted through a symbolic method and a statistical method. The symbolic method automatically extracts and uses a generalized syntax pattern. The statistical method constructs correlation information by clustering association information and fixes a name of the relation based on the constructed correlation information. The symbolic method is combined with the statistical method.

Description

Relation extraction from documents for the automatic construction of ontologies

도 1은 본 발명에 따른 기호적인 방법과 통계적인 방법을 결합한 하이브리드식의 관계추출에 도시한 도면이다.1 is a diagram illustrating a hybrid relational extraction combining a symbolic method and a statistical method according to the present invention.

도 2는 본 발명에서 사용한 기호적인 방법을 도시한 도면으로 문서로부터 개념을 추출하는 흐름을 도시한 흐름도이다. 2 is a flowchart illustrating a symbolic method used in the present invention and illustrating a flow of extracting a concept from a document.

도 3은 본 발명에서 사용한 통계적인 방법을 도시한 도면으로 문서로부터 개념을 추출하는 흐름을 도시한 흐름도이다.3 is a flowchart illustrating a statistical method used in the present invention, illustrating a flow of extracting a concept from a document.

<도면의 주요부분에 대한 부호의 설명> <Description of the symbols for the main parts of the drawings>

1 : 문서 - 온톨로지를 구축하려고 하는 분야의 전문적인 문서들1: Documentation-Professional documents in the field of ontology construction

2 : 후보 개념 목록 - 온톨로지 구축에 사용되는 후보 개념들List of Candidate Concepts-Candidate Concepts Used in Ontology Construction

10 : 기호적인 방법모듈 10: symbolic method module

20: 통계적인 방법모듈20: Statistical Method Module

30 : 하이브리드 방법모듈30: hybrid method module

31 : 온톨로지31: Ontology

101 : 시드 개념들 - 전문가에 의해 선정된 관계의 예101: Seed Concepts-Examples of Relationships Selected by Experts

110 : 문서 전처리 모듈 - 문서들에서 자연어 처리 기법을 이용한 전처리기110: Document Preprocessing Module-Preprocessor using natural language processing techniques in documents

120 : 시드 개념들의 출현 문맥 추출 모듈120: emergence of seed concepts context extraction module

121 : 태깅된 문맥 - 120 모듈을 통하여 추출된 문맥정보121: tagged context-120 contextual information extracted through the module

130 : 일반화된 구문정보 패턴 생성 모듈130: generalized syntax information pattern generation module

140 : 새로운 시드 생성 모듈140: new seed generation module

210 : 전처리 모듈 - 문서를 읽어 통계적 방법에서 사용하는 정보 전처리기210: Preprocessing Module-Information preprocessor used for statistical methods to read documents

211 : 단어 테이블 - 문서에 출현하는 단어 정보211: word table-word information that appears in the document

212 : 문장 정보 - 문서에서 문장단위로 출력212: Sentence information-print out sentence by sentence

220 : 문맥 추출 모듈 - 후보 개념들을 포함하는 문맥정보 추출 모듈220: context extraction module-context information extraction module including candidate concepts

221 : 문맥221: context

230 : 연관 규칙 추출 모듈 - 후보 개념과 개념 사이의 연관 규칙 생성 모듈230: Association rule extraction module-Association rule generation module between candidate concepts and concepts

231 : 연관 규칙 231: Association Rules

240 : 패턴 군집화 모듈240: pattern clustering module

241 : 연관 패턴 정보241: Association Pattern Information

250 : 관계 이름 지정 모듈250: Relationship Naming Module

251 : 추출된 관계 정보 251: extracted relationship information

본 발명은 문서에서 온톨로지를 자동으로 구축하는 기술에 관한 것으로, 특히 개념과 개념 간의 관계를 자동으로 추출하는 기술에 관한 것이다. 국내에서는 백과사전에서 온톨로지를 자동으로 구축하는 기술은 있으나, 전문적인 문서들로부터 온톨로지를 자동으로 추출하는 분야에 대한 기술이 전혀 없는 실정이다. 이 분야에 대하여 국외에서 진행되고 있는 기술들을 기반으로 본 발명이 출발하였다. 문서로부터 관계를 자동으로 추출하는 기술은 기호적 접근방법과 통계적 접근방법으로 크게 두 가지의 접근 방법이 지금까지 연구되어 왔다.The present invention relates to a technique for automatically constructing an ontology in a document, and more particularly to a technique for automatically extracting a concept and a relationship between concepts. In Korea, there is a technology for automatically constructing an ontology in an encyclopedia, but there is no technology in the field of automatically extracting ontology from specialized documents. The present invention has been started based on the technology that is going on abroad in this field. The automatic extraction of relations from documents has been studied in two ways: symbolic approach and statistical approach.

문서로부터 개념을 구축하기 위한 기호적 접근 방법은 문장의 문법 및 특정 의미를 갖는 패턴과 같은 문법적인 요소를 사용한다. 예를 들어, 문서상에 ‘dogs and other animals’과 같은 어구에서 ‘dog’는 ‘animal’의 하위 개념을 나타낼 것이다라고 가정하고, 이것을 ‘As and other Bs’라는 일반화된 패턴으로 만들고 이 패턴을 통해 얻어진 A와 B의 개체 간의 상-하위 관계를 찾는다. The symbolic approach to constructing a concept from a document uses grammatical elements such as the grammar of the sentence and patterns with specific meanings. For example, in a document such as 'dogs and other animals', we assume that 'dog' will represent a sub-concept of 'animal', and we make this a generalized pattern called 'As and other Bs' Find the parent-child relationship between the obtained entities of A and B.

NP₀ such as {NP,NP,...,(and|or)}NP -> hyponym (NP_s, NP₀)NP ₀ such as {NP, NP, ..., (and | or)} NP-> hyponym (NP _s , NP ₀ )

such NP₀ as {NP,}*{(or|and)}NP -> hyponym (NP_s, NP₀)such NP ₀ as {NP,} * {(or | and)} NP-> hyponym (NP _s , NP ₀ )

NP {,NP}* {,} (or|and) other NP₀ ->hyponym (NP_s, NP₀)NP {, NP} * {,} (or | and) other NP _0- > hyponym (NP _s , NP ₀ )

NP₀ {,} including {NP,}*{(or|and)}NP ->hyponym (NP_s, NP₀)NP ₀ {,} including {NP,} * {(or | and)} NP-> hyponym (NP _s , NP ₀ )

NP₀ {,} especially {NP,}*{(or|and)}NP ->hyponym (NP_s, NP₀)NP ₀ {,} especially {NP,} * {(or | and)} NP-> hyponym (NP _s , NP ₀ )

Hearst의 연구에서는 위에 보인 것과 같이 문서로부터 하위관계를 추출하기 위한 여섯 개의 패턴을 정의하고 그 결과를 WordNet과 통합하는 과정을 수행했다. 하지만, 패턴에 의한 관계의 추출은 언어 사용 방법의 다양성으로 인해 기대하는 관계를 발견하기 어렵다는 단점이 있다.Hearst's work involved defining six patterns for extracting subrelations from a document and integrating the results with WordNet, as shown above. However, the extraction of the relationship by the pattern has a disadvantage that it is difficult to find the expected relationship due to the variety of language usage methods.

미국의 콜로비아대학에서 개발한 Snowball 시스템은 문서로부터 특정 관계를 추출하기 위한 방법을 제시한다. 이 시스템에서는 사용자가 관심을 갖는 관계를 설명하는 개념 쌍들을 예(seed)로 사용하여, 그 개념 쌍이 등장하는 문서로부터 개념 사이에 존재하는 내용(context)를 추출한다. 이렇게 추출된 내용을 군집화(clustering)을 통해 패턴(pattern)을 찾고, 이 패턴을 사용하여 개념 쌍들을 추출하는 방법으로 특정 관계에 대한 개념 쌍들을 추출한다. Snowball 시스템의 결과는 문서로부터 특정 관계를 구성하는 개념 쌍들과 그것을 연결하는 관계에 관련된 패턴들로, 이것은 온톨로지를 구축하는데 중요한 정보로 사용될 수 있다. 하지만, 이 시스템의 사용을 위해서는 입력으로 사용되는 개념 쌍들의 선출에서 개념 쌍의 정확도가 결과에 많은 영향을 미치며, 특정 개념의 분류가 명확해야 한다. 이를 위해 Snowball 시스템은 사람의 이름과 기관, 위치에 대한 태거(named entity tagger)를 사용하여 명확히 구분되는 개념 혹은 개체에 대한 관계를 추출한다. 따라서, 이 시스템은 사용이 극히 제한적이며, 하나의 개념 쌍으로부터 추출되는 관계는 오직 하나만 존재한다는 단점이 있다. 또한, Snowball에서 생성되는 관계에 대한 관련된 패턴들은 고정 패턴으로 다양한 변형을 가지고 있는 문서에서 적용하기에 제약이 있다.The Snowball system, developed by the University of Columbia, USA, provides a way to extract specific relationships from documents. In this system, concept pairs that describe relationships of interest to the user are used as examples to extract the context between concepts from the document in which the pair appears. The extracted contents are found through clustering, and the concept pairs are extracted using this pattern to extract concept pairs for a specific relationship. The result of the Snowball system is a set of concepts related to the pairs of concepts that make up a particular relationship from the document and the relationships that connect it, which can be used as important information in building an ontology. However, the use of this system requires that the accuracy of concept pairs affect the results in the selection of concept pairs used as input, and the classification of specific concepts must be clear. To do this, the Snowball system uses named entity taggers for people's names, agencies, and locations to extract clearly related concepts or relationships to entities. Therefore, this system is extremely limited in use and has the disadvantage that there is only one relationship extracted from one concept pair. In addition, related patterns for relationships created in Snowball are limited to apply in documents that have various variations to fixed patterns.

문서로부터 개념을 구축하기 위한 기호적 접근 방법은 문서 집합이 갖는 통계적인 분포를 기반으로 개념과 관계를 추출하고 그것을 기반으로 관계를 구축한다. 문서 집합이 갖는 통계적인 분포에는 단어의 빈도 또는 문서의 빈도, 단 어 사이의 종속성 등이 있으며, 단어-가중치 방법(term-weighting scheme)을 적용하여 이러한 통계적 분포를 가공하고, 군집화(clustering) 및 데이터 마이닝(data mining) 기법을 통해 연관성 정보를 추출한다. The symbolic approach to constructing concepts from documents extracts concepts and relationships based on the statistical distribution of a document set and builds relationships based on them. The statistical distribution of a document set includes the frequency of words, the frequency of documents, and the dependencies between words. The statistical distribution is processed by applying a term-weighting scheme, and clustering and Correlation information is extracted through data mining techniques.

독일의 Karlsruhe 대학과 FZI 연구 센터에서 개발한 KAON 시스템은 문서로부터 온톨로지를 생성을 돕기 위한 도구(tool)로써, 통계적 접근 방법을 사용하여 온톨로지의 개념간의 관계를 발견한다. 엄밀한 의미에서 KAON 시스템은 온톨로지 구축에 대한 두 가지 접근 방법 모두를 사용하고 있으나, 본 논문에서는 이 시스템의 개념간의 관계를 제시하는 기능에 초점을 맞춘다. KAON 시스템에서 온톨로지를 구축하는 방법은 다음과 같다. 먼저, 저 수준의 파서(shallow parser)를 사용하여 문서를 처리하고, 개념으로 사용될 하나의 단어 이상을 포함하는 명사절을 추출한다. 그 다음 선택된 개념간의 일반화된 연관 규칙(generalized association rule)을 찾아 개념간의 관계의 가능성을 제시한다. 마지막으로 기존의 온톨로지가 있으면, 그 온톨로지에 발견된 관계들을 추가함으로써 온톨로지를 확장하고, 그렇지 않으면 개념과 관계로부터 온톨로지를 구축한다. KAON시스템은 개념간의 관계를 직접적으로 추출하기보다 개념간의 관계로 가능성이 있는 후보들을 제시하고 연관 규칙의 지지도(support) 및 신뢰도(confidence) 관계의 이름으로 제공함으로써 사용자에게 관계로 정의할 것인지에 대한 판단과 명명(naming)을 맡긴다. 이 시스템을 이용한 온톨로지의 생성 방법은 온톨로지의 생성과 유지하는데 있어서 상당 부분에 사람의 참여를 요구하며, 추출된 개념간의 관계를 설명하기 위한 직관적인 정보를 제공하지 않는다. 즉, KAON 시스템을 사용하여 온톨로지를 생성하는데 있어서, 많은 사람 의 노력과 시간이 요구된다는 문제가 있다. 또한, 통계적 방법으로 찾아진 연관정보는 관계로 발전하지 못하고, 단순하게 개념과 개념 사이에 존재하는 연관정도정보만 제공하는 문제점을 안고 있다.Developed by the Karlsruhe University and the FZI Research Center in Germany, the KAON system is a tool to help generate ontologies from documentation, and uses a statistical approach to discover the relationships between the ontology concepts. In a strict sense, the KAON system uses both approaches to ontology construction, but this paper focuses on the function of presenting the relationship between the concepts of the system. How to build ontology in KAON system is as follows. First, we use a low level parser to process the document and extract a noun clause containing more than one word to be used as a concept. It then looks for generalized association rules between the selected concepts and suggests the possibility of relationships between them. Finally, if there is an existing ontology, it extends the ontology by adding the relationships found in the ontology, or constructs an ontology from concepts and relationships. The KAON system judges whether to define a relationship as a user by presenting potential candidates as a relationship between concepts rather than directly extracting the relationship between concepts and providing them in the name of the support and confidence relationship of the association rule. And naming. The ontology generation method using this system requires human participation in the generation and maintenance of the ontology and does not provide intuitive information for explaining the relationship between the extracted concepts. That is, there is a problem that a lot of effort and time is required to generate the ontology using the KAON system. In addition, the association information found by the statistical method does not develop into a relationship, but has a problem of simply providing the association degree information existing between the concepts.

본 발명은 이 같은 문제점들을 해결하기 위한 것으로, 기호적인 방법에서는 구축하고자 하는 패턴을 관리자가 지정하는 어려움을 해결하기 위해 구문패턴정보를 자동으로 구축하는 기술에 중심을 둔다. 이때 구축되는 구문패턴정보가 온톨로지를 구축하는 분야의 문서정보에 다양하게 출현하는 변형 구문을 포함하지 못하는 문제를 해결하기 위하여 구문패턴정보를 일반화하여 더 많은 관계들을 찾고자 한다. The present invention is to solve such problems, and in the symbolic method, the method focuses on a technique of automatically constructing syntax pattern information in order to solve the difficulty of designating a pattern to be constructed by an administrator. At this time, in order to solve the problem that the syntax pattern information to be constructed does not include various syntaxes that appear variously in the document information of the ontology building field, the syntax pattern information is generalized to find more relations.

통계적인 방법에서는 단순하게 연관정도정보만 제공하는 기존의 기술의 한계를 극복하고자 연관패턴정보를 구성하고, 구성된 연관패턴정보를 군집화하여 유사한 관계들을 묶어 그룹정보를 제공하고자 한다. 또한, 이러한 그룹의 관계이름을 제시하여 좀더 관계에 의미를 부여하고자 한다.In the statistical method, in order to overcome the limitations of the existing technology of simply providing the association degree information, the association pattern information is composed, and the related association pattern information is clustered to provide group information by grouping similar relationships. In addition, the relationship names of these groups are suggested to give more meaning to the relationships.

본 발명은 기호적인 방법으로 추출한 온톨로지 관계정보와 통계적인 방법으로 추출한 관계정보를 결합하는 하이브리드식 결합 방식을 제공하는 것을 목적으로 한다. It is an object of the present invention to provide a hybrid combining method that combines ontology relationship information extracted by a symbolic method and relationship information extracted by a statistical method.

본 명세서에서 문서란 온톨로지를 구축하고자 하는 분야의 다양한 전문적인 텍스트 문서를 말한다. In the present specification, the document refers to various professional text documents in the field of ontology construction.

나아가 본 발명은 특정분야의 온톨로지가 아니라 어느 분야에서도 적용할 수 있는 온톨로지를 자동으로 구축하는 가능하도록 하는 것을 목적으로 한다.Furthermore, an object of the present invention is to enable to automatically construct an ontology applicable to any field, not an ontology of a specific field.

상기 목적을 달성하기 위한 본 발명은 크게 기호적 접근 방법과 통계적 접근방법으로 나누어 구성되어 있으며, 이 방법의 결과들을 결합하는 방법을 제공한다. 도 1에서 보는 것과 같이 온톨로지를 자동으로 구축하고자 하는 분야의 전문적인 문서(1)들과 온톨로지에 사용될 수 있는 후보 개념 목록(2)을 입력으로 받아서, 각각 기호적인 방법 모듈(10)과 통계적인 방법 모듈(20)을 거친 후에 추출된 결과들은 하이브리드 방법모듈(30)을 통하여 결합하고 이를 온톨로지(3)로 변환하여 온톨로지를 자동으로 구축한다.The present invention for achieving the above object is largely divided into a symbolic approach and a statistical approach, and provides a method of combining the results of this method. As shown in FIG. 1, the document receives the specialized documents (1) of the field to be automatically constructed and the candidate concept list (2) that can be used in the ontology, as shown in FIG. 1, respectively. The results extracted after the method module 20 are combined through the hybrid method module 30 and converted into an ontology 3 to automatically build the ontology.

기호적인 접근 방법에서는 기존의 Snowball 시스템을 기반으로 하여 위에서 기술한 문제들을 해결하고자 한다. 도 2는 기호적인 접근 방법에 대한 흐름도를 도시한 것으로, 크게 문서 전처리 모듈(210)과 시드 개념들의 출현 내용 추출 모듈(120)과 일반화된 구문정보 패턴 생성 모듈(130)과 새로운 시드 생성 모듈(140)로 이루어져 있다. The symbolic approach attempts to solve the problems described above based on the existing Snowball system. FIG. 2 is a flowchart illustrating a symbolic approach, which includes a document preprocessing module 210, an appearance content extraction module 120 of seed concepts, a generalized syntax information pattern generation module 130, and a new seed generation module ( 140).

문서 전처리 모듈(110)은 전문적인 문서(1)들을 입력으로 받아서, 문서에 문장정보과 각각의 단어에 품사 태깅 정보를 추출하여 다음 시드 개념들의 출현 내용 추출 모듈(120)로 넘겨 준다. 이때, 단어의 품사 태깅을 위하여 외국에서 잘 알려진 불릴태커를 사용하였다. The document preprocessing module 110 receives the professional documents 1 as inputs, extracts sentence information and part-of-speech tagging information in each word, and passes the extracted content of the next seed concepts to the module 120. At this time, a well-known bullytacker was used for tagging of parts of words.

시드 출현 개념들의 출현 내용 추출 모듈(120)에서는 문서 전처리 모듈(110)의 결과와 관리자가 지정해준 시드 개념들(101)을 기반으로 하여 문서 내에 시드가 출현하는 출현 내용을 추출한다. 이때, 한 문장에서 두 개의 이상의 시드가 동시에 출현할 때 의미가 있는 정보로 판단하여 출현내용으로 추출된다. 즉, 한 문장에 출현하는 단어들은 서로 어떤 관계를 가지고 있다고 생각할 수 있다. 의미 있는 출현 내용으로 판단된 정보들만 별로도 태깅된 문맥(121)으로 출력하여 다음 일반화된 구문정보 패턴 생성 모듈(130)로 넘겨준다.The appearance content extraction module 120 of the seed appearance concepts extracts the appearance content of the seed in the document based on the result of the document preprocessing module 110 and the seed concepts 101 designated by the administrator. In this case, when two or more seeds appear in a sentence at the same time, it is determined as meaningful information and extracted as appearance contents. That is, the words appearing in a sentence can be considered to have a relationship with each other. Only information determined as meaningful appearance contents is output to the tagged context 121 and then passed to the next generalized syntax information pattern generation module 130.

일반화된 구문정보 패턴 생성모듈(130)은 태깅된 문서(121)를 기반으로 일반화된 패턴을 생성한다. 패턴은 다음과 같은 다섯 개의 정보를 보유하는 5-tuples의 벡터로 표현한다.

여기에서 t는 tuple을 의미하며, concept1과 concept2는 후보 개념들로서 두 개념에 관련되는 정보를 패턴으로 기술한 것이다. lv, mv와 rv는 단어들의 벡터를 의미하는데. lv는 개념1과 개념2의 연관정보 왼쪽에 출현하는 단어들로 만들어진 벡터가 되며, mv는 두 개념 사이에 출현하는 단어들로 만들어진 벡터이며, rv는 두 개념의 오른쪽에 출현하는 단어들로 만들어진 벡터이다. 이들 벡터를 만드는 과정에서 문서에 출현할 수 있는 다양한 형태를 지원해주기 위하여 본 발명에서는 패턴의 일반화 작업을 진행하였다. 일반화 규칙은 아래의 표에 기술한다.The generalized syntax information pattern generation module 130 generates a generalized pattern based on the tagged document 121. The pattern is expressed as a vector of 5-tuples that holds the following five pieces of information:

Here, t means tuple, and concept1 and concept2 are candidate concepts that describe information related to the two concepts in a pattern. lv, mv and rv mean vectors of words. lv is a vector made up of words that appear to the left of the association between Concept 1 and Concept 2, mv is a vector made up of words that appear between the two concepts, and rv is made up of words that appear to the right of the two concepts Vector. In the present invention, in order to support various forms that can appear in a document in the process of making these vectors, the pattern generalization work was performed. Generalization rules are described in the table below.

단어/품사태그정보Word / Pawn 카테고리category 예제example is, are, am, was, were is, are, am, was, were BE BE is -> [BE] is-> [BE] 명사, 명사절 Noun 변경 없음 No change Kodak -> Kodak Kodak-> Kodak 형용사 adjective ADJ ADJ pretty women -> [ADJ] women pretty women-> [ADJ] women 부사 adverb 제거 remove well -> well-> 관사, 정관사 Article DT DT the -> [DT] the-> [DT]

위의 일반화 규칙에 따라서 일반화된 구문정보 패턴 생성 모듈(130)은 구문정보 패턴을 생성하여 새로운 시드 생성 모듈(140)로 넘겨준다.According to the generalization rule above, the generalized syntax information pattern generation module 130 generates syntax information patterns and passes them to the new seed generation module 140.

새로운 시드 생성 모듈(140)은 구문정보 패턴을 이전 모듈에서 전달받아서 이들 패턴정보들을 군집화 알고리즘(single-pass clustering)을 사용하여 구문정보 패턴들을 군집화한다. 이때 사용하는 유사도 계산은 아래의 식을 이용하여 이루어진다. ts의 패턴정보와 tp의 패턴정보를 비교하는 유사도 계산은 다음의 Match (ts,tp)의 값으로 결정된다.The new seed generation module 140 receives syntax information patterns from the previous module and clusters the syntax information patterns using a single-pass clustering algorithm. The similarity calculation used at this time is performed using the following equation. The similarity calculation for comparing the pattern information of ts and the pattern information of tp is determined by the value of the following Match (ts, tp).

위에 식에서 계산된 유사도 값이 일정한 임계값을 초과하면 같은 군집으로 표현된다. 위의 방법으로 만들어진 각각의 군집에서 군집을 대표하는 대표 구문패턴정보를 벡터의 합 방법으로 추출한다. 이때 추출된 대표구문패턴을 기반으로 하여 실제 문서를 다시 검색하면서 대표구문패턴에 적용되는 개념들을 추출한다. 이때 만들어진 개념들은 본 발명에서 찾고자 하는 관계에 대한 개념과 개념들이 된다. 이렇게 만들어진 정보들은 기존에 존재하던 시드 개념들과 합해서, 반복적으로 기호적 접근 방법을 수행하여 더 많은 관계정보들을 추출한다.If the similarity value calculated in the above equation exceeds a certain threshold, it is expressed as the same cluster. Representative syntax pattern information representing the clusters in each cluster created by the above method is extracted by the vector sum method. At this time, based on the extracted representative syntax pattern, the actual document is searched again and the concepts applied to the representative syntax pattern are extracted. The concepts created at this time become concepts and concepts about the relationship to be found in the present invention. This information is combined with existing seed concepts, and iteratively performs symbolic approach to extract more relational information.

통계적인 접근 방법은 전처리기(210), 문맥추출기(220), 연관규칙추출기(230), 패턴 군집화 모듈(240)과 관계 이름 지정모듈(250)로 이루어져 있다. 전처리기(210)에서는 많은 양의 문서를 효율적으로 다루기 위해서 문서를 문장 단위로 구분하고, 각 문장을 단어의 ID들의 연속된 형태로 표현한다. 이때 문서에서 추출된 단어들은 군집화의 성능 향상을 위해 KStemmer를 사용하여 어근을 찾고, 단어-단어 ID의 정보를 유지하기 위해 B+-트리(tree)를 사용한다. The statistical approach includes a preprocessor 210, a context extractor 220, an association rule extractor 230, a pattern clustering module 240, and a relationship naming module 250. The preprocessor 210 divides documents into sentence units in order to efficiently handle a large amount of documents, and expresses each sentence in a continuous form of IDs of words. At this time, the words extracted from the document find the root using KStemmer to improve the clustering performance, and use the B + -tree to maintain the information of the word-word ID.

문맥 추출기(220) 에서는 문맥을 추출하기에 앞서, 개념 색인 구조(concept indexes structure)를 생성한다. 개념 색인 구조는 개념의 시작 단어의 ID를 인덱스(index)로 사용한다. 또한, 인덱스로 시작하는 개념들의 수와 그 개념들의 ID 목록을 유지함으로써, 문장에 등장하는 개념의 검색시간을 단축시킨다. 또한, 긴 단어를 포함하는 개념에 우선 순위를 주기 위한 방법으로, 개념에 포함된 단어들의 수에 대해 오름차순으로 개념 색인 구조의 개념 ID 목록을 정렬한다.The context extractor 220 generates a concept indexes structure before extracting the context. The concept index structure uses the ID of the starting word of a concept as an index. In addition, by maintaining a list of concepts starting with an index and IDs of the concepts, the searching time of the concepts appearing in the sentence is shortened. Also, in order to give priority to concepts containing long words, the list of concept IDs in the concept index structure is sorted in ascending order of the number of words included in the concept.

연관 규칙 추출기(230)에서는 문맥 추출기(220)에서 추출된 문맥 정보(221)로부터 두 개의 개념으로 이루어진 연관 규칙을 찾는다. 이 과정에서 발견된 연관 규칙은 문서집합이 설명하는 중요한 개념간의 관계가 된다. 연관규칙을 찾기 위해 제안된 apriori-알고리즘은 구매한 제품에 대한 상품 집합으로부터 상품의 연관성을 찾는 것을 목적으로 하기 때문에, 연관 규칙의 조건과 결과 부분에 복수 개의 상품을 갖는 것을 허락한다. 하지만, 본 발명에서는 두 개의 개념으로 구성되는 연관 규칙을 찾기 때문에 기존의 apriori-알고리즘을 단순화하여 사용한다.The association rule extractor 230 finds an association rule composed of two concepts from the context information 221 extracted by the context extractor 220. The association rules found in this process are the relationships between the important concepts that the document set describes. Since the apriori-algorithm proposed to find association rules aims to find the association of a product from a set of products for a purchased product, it allows to have multiple products in the conditions and result parts of the association rule. However, the present invention simplifies and uses the existing apriori-algorithm because the association rule consists of two concepts.

패턴 군집화 모듈(240)에서는 군집화를 통해 이전 단계에서 생성된 내용과 연관 규칙으로부터 두 개의 개념간의 관계를 설명하는 패턴을 찾는다. 여기서 언급된 패턴은 군집의 중심(centroid)를 의미한다. 군집화를 위해 본 발명에서는 simple single-pass clustering알고리즘을 사용한다. 이 군집화 과정을 통해 생성 된 군집들은 하나 이상의 내용을 포함한다. 그러나 각각의 군집이 두 개념간의 관계를 설명하기 위한 패턴이 되기 위해서는 객관성을 가져야 한다. 따라서, 군집에 포함된 내용의 수가 특정 임계값보다 작은 경우 그 군집을 제거함으로써, 군집에 대한 여과 과정(filtering)을 수행한다. 각 연관 규칙에 대한 내용을 군집화하여 패턴을 찾는 과정을 통해 두 개념간의 관계를 설명하는 패턴들이 발견되며, 이것은 기존의 SnowBall 시스템이 가지는 단점인 두 개념 간에는 단일 관계만 존재한다는 제약을 보완할 수 있는 방법이 될 수 있다. 패턴 군집화 모듈(240)의 결과인 연관패턴정보(241)에는 연관패턴정보를 그룹핑하여 관련 있는 연관패턴으로 묶어주며, 이들에게 "이름없는관계그룹1"식의 관계이름을 부여한다.The pattern clustering module 240 finds a pattern describing the relationship between the two concepts from the content generated in the previous step and the association rule through clustering. The pattern mentioned here refers to the centroid of the cluster. For clustering, the present invention uses a simple single-pass clustering algorithm. Clusters created through this clustering process contain one or more contents. But each cluster must have objectivity in order to be a pattern for explaining the relationship between the two concepts. Therefore, if the number of contents included in the cluster is smaller than a certain threshold value, the cluster is removed to perform filtering on the cluster. Patterns that describe the relationship between the two concepts are found by clustering the contents of each association rule, which can complement the constraint that there is only a single relationship between the two concepts, which is a disadvantage of the existing SnowBall system. It can be a way. The association pattern information 241, which is the result of the pattern clustering module 240, is grouped into related association patterns by grouping the association pattern information, and a relationship name of an unnamed relation group 1 is given to them.

관계 이름 지정 모듈(250)에서는 앞서 설명한 과정들에 의해 생성된 패턴과 내용을 사용하여 두 개념 사이의 관계에 이름을 부여한다. 패턴 군집화 모듈(240)을 통해 발견된 패턴들은 각각 두 개념의 관계만을 설명하는 특정화된 것이고, 중요하지 않은 많은 관계들을 생성할 수 있다. 따라서, 관계 이름 지정 모듈(250)에서는 이러한 문제를 해결하기 위해 이름을 부여하는 전 단계로, 일반화된 패턴을 찾기 위해 패턴들의 군집화를 수행한다. 이 군집화는 유사한 내용들로부터 생성된 패턴들에 대해 일반적인 관계를 찾을 수 있도록 도움을 준다. 일반적인 패턴을 찾기 위한 군집화 작업이 끝나면, 각 군집이 나타내는 관계에 대한 이름을 부여하는 작업이 수행된다. 이 작업은 군집의 중심(centroid)을 사용하여 각 패턴에 속한 각 내용 정보에 대한 점수 (score)를 계산하고, 내용의 점수가 가장 높은 것을 두 개념간의 관계를 나타내는 이름으로 명명한다. 이것은 일반적인 패턴에 포함된 단어 들의 가중치와 그 패턴에 포함된 내용들이 가지고 있는 단어들의 연속을 사용함으로써, 생성된 두 개념간의 관계에 대한 직관적인 정보를 제공하는 좋은 방법이 된다. The relationship naming module 250 names the relationship between the two concepts by using the pattern and the contents generated by the above-described processes. The patterns found through the pattern clustering module 240 are specialized, each describing only the relationship between the two concepts, and can create many relationships that are not important. Therefore, the relationship naming module 250 performs clustering of patterns to find a generalized pattern as a previous step of naming to solve this problem. This clustering helps to find general relationships for patterns generated from similar content. When clustering is done to find a general pattern, a name is assigned to a relationship represented by each cluster. This task uses the centroid of a cluster to calculate the score for each piece of content in each pattern, and names the highest score in the name as a name that represents the relationship between the two concepts. This is a good way to provide intuitive information about the relationship between the two generated concepts by using the weights of the words in the general pattern and the sequence of words in the content of the pattern.

하이브리드식 방법에서는 위의 기호적인 접근 방법과 통계적인 접근 방법에서 추출된 관계정보들을 결합하는 것에 초점을 맞추고 있다. 일반적으로 통계적인 결과보다 사람의 인식론적인 부분이 가미된 기호적인 결과에 더 많은 비중을 두는 것이 통례이므로 본 발명에서도 기호적인 결과에 더 높은 비중을 두었다. 기호적인 결과를 기반으로 하여 통계적인 결과를 결합하는 방식을 취하였다. 통계적인 결과에는 기호적인 결과와 다르게 관계의 이름이 확정되이 있지 않고 각 그룹의 후보 관계이름을 제시하고 있다. 따라서, 통계적인 결과에 있는 각각의 그룹에 대하여 각 그룹에 속하는 개념들 정보를 기호적인 결과와 비교를 하여 통계적인 결과에 특정 그룹에 속한 개념들이 기호적인 결과에 특정 관계와 유사성이 높을 경우에 한하여, 통계적인 결과의 그룹에 기호적인 결과의 그 관계이름을 부여한다. 만약에, 이 속하지 않는 그룹들에 대하여서는 이름없는관계그룹1 이런 형태로 관계의 이름을 부여고 통계적인 방법에 의하여 제시된 후보 그룹이름을 같이 기술하여 줌으로 최종적인 온톨로지를 구성한다.The hybrid approach focuses on combining the relational information extracted from the symbolic and statistical approaches. In general, it is customary to place more weight on the semiotic result with the epistemological part of the person than the statistical result, so the present invention also puts a higher weight on the semiotic result. Based on the symbolic results, the statistical results are combined. In the statistical result, unlike the symbolic result, the name of the relationship is not fixed and the candidate relationship name of each group is presented. Therefore, for each group in the statistical results, the concept information belonging to each group is compared with the symbolic results, so that the concepts belonging to a specific group in the statistical results have a high similarity with the specific relationship to the symbolic results. In this case, the relational name of the symbolic result is given to the group of statistical results. If this group does not belong, nameless relationship group1 forms the final ontology by naming the relationship in this form and stating the candidate group name presented by statistical method together.

인터넷의 발달로 인한 웹의 대중화는 정보의 공유 및 검색 등에 기계의 활발한 참여를 가져왔으나, 웹에 존재하는 대부분의 정보는 사람에 의한 이해를 목적으로 작성되었다. 이것은 기계가 사용자의 요청을 처리하는데 있어서 큰 장애로 작용 하였다. 이를 해결하기 위한 방법으로 기계가 이해할 수 있는 형태로 정보를 표현하기 위한 시맨틱 웹이 제안되었다. 하지만, 시맨틱 웹의 논리적 기반을 구성하는 온톨로지를 구축하고 유지하는 작업은 많은 시간과 비용을 요구한다. 따라서 본 발명에서 온톨로지 구축 과정의 하나의 부분인 개념간의 관계를 문서집합으로부터 추출하는 방법을 발명하였다.The popularization of the web due to the development of the Internet has led to the active participation of machines in the sharing and retrieval of information, but most of the information on the web has been prepared for human understanding. This was a major obstacle to the machine's handling of user requests. As a way to solve this problem, a semantic web for expressing information in a form that can be understood by a machine has been proposed. However, building and maintaining the ontology that constitutes the logical foundation of the Semantic Web requires a lot of time and money. Therefore, the present invention invented a method for extracting a relationship between concepts, which is a part of ontology construction process, from a document set.

나아가 본 발명으로 인하여 전문가의 수작업에 의지하여 고비용과 많은 시간을 요구하던 온톨로지 구축작업이 자동으로 전환될 수 있다.Furthermore, the ontology construction work, which required high cost and time, can be automatically switched by the expert's manual labor.

나아가 인터넷상에 시맨틱 웹의 세상이 오는 것을 앞당기는 역활을 할 것이라고 기대한다.It is also expected to play a role in speeding up the coming of the semantic web on the Internet.

Claims

In the method of automatically extracting the ontology relationship from the document,

a1) a symbolic method of automatically extracting and using generalized syntax patterns;

a2) a statistical method of clustering related information to build relationship related information and naming the relationship based thereon;

a3) hybrid method combining symbolic and statistical methods;

Ontology automatic building method comprising a.

The method of claim 1, wherein the method a1) extracts representative syntax patterns that can represent various relationships from seeds for various relationships.

The method of claim 1, wherein the method a1) generalizes the syntax pattern to apply the generalized syntax pattern.

The ontology automatic construction method of claim 1, wherein the method a2) extracts the association information through the grouping of the association patterns with respect to the association rule.

The ontology automatic construction method according to claim 1, wherein the method a2) assigns names to the relationships through the relationship association information.

The method of claim 1, wherein the method a3) combines statistical results based on the symbolic results.