KR101335540B1

KR101335540B1 - Method for classifying document by using ontology and apparatus therefor

Info

Publication number: KR101335540B1
Application number: KR1020110062387A
Authority: KR
Inventors: 김평; 정한민; 이미경; 이승우; 서동민; 김진형; 성원경
Original assignee: 한국과학기술정보연구원
Priority date: 2011-06-27
Filing date: 2011-06-27
Publication date: 2013-12-03
Also published as: WO2013002436A1; KR20130001552A

Abstract

온톨로지에 모델링된 각각의 개체명의 클래스와 클래스의 속성을 분류에 활용하는 온톨로지 기반의 문서 분류 방법 및 그에 적합한 장치가 개시된다.
온톨로지 기반의 문서 분류 방법은 각 범주에 속하는 문서들의 특성을 온톨로지로 표현하기 위해서 동일 범주에 속하는 문서들에서 공통으로 발생하는 개체들의 종류와 속성(property), 속성 관계(property relation) 을 추출하고 이를 추상화(abstraction)하여 온톨로지를 모델링하는 온톨로지 모델링 과정; 분류하고자 하는 문서로부터 개체명을 인식하는 개체명 인식 과정; 전체 범주를 대상으로 문장 분석을 통해서 개체명 사이의 관계 정보를 추출하는 개체 관계 추출 과정; 문서에서 추출된 개체명과 관계 속성을 각 범주별 온톨로지와 비교해서 개체-클래스, 개체-속성-개체로 매핑(mapping)하는 인스턴스 필링 과정; 및 필링된 클래스, 인스턴스간 관계, 클래스 속성의 가중치를 고려해서 상기 분류하고자 하는 문서에 가장 적합한 온톨로지를 선택하고, 해당 온톨로지의 분야를 분류하고자 하는 문서의 분야로 결정하는 문서 분야 결정 과정을 포함한다.An ontology-based document classification method and apparatus suitable for using the class of each entity name modeled in the ontology and the attributes of the class for classification are disclosed.
The ontology-based document classification method extracts the types, properties, and property relations of objects that commonly occur in documents belonging to the same category in order to express the characteristics of the documents belonging to each category as ontology. An ontology modeling process of modeling the ontology by abstraction; An entity name recognition process of recognizing an entity name from documents to be classified; An entity relationship extraction process of extracting relationship information between entity names through sentence analysis for all categories; An instance peeling process of mapping object names and relationship attributes extracted from the document to object-class and object-attribute-object by comparing the ontology of each category; And a document field determination process of selecting an ontology most suitable for the document to be classified in consideration of the filled class, the relationship between the instances, and the weight of the class attribute, and determining the field of the ontology as the field of the document to be classified. .

Description

Ontology-based document classification method and apparatus {Method for classifying document by using ontology and apparatus therefor}

본 발명은 문서 분류 방법 및 장치에 관한 것으로서, 특히 온톨로지 기반의 문서 분류 방법 및 장치에 관한 것이다.The present invention relates to a document classification method and apparatus, and more particularly, to an ontology-based document classification method and apparatus.

정보 양의 급증에 따라 정보를 효과적으로 검색하고 분류하기 위한 다양한 기술들이 개발되어 왔다. 기존 연구들은 대부분 문서들의 품사 태깅(POS tagging)을 통해 선정된 자질(feature)들의 비교를 통해 문서와 분류의 유사도를 계산하고 그 결과를 분류에 사용하였다. 또한, 검색엔진은 이렇게 선정된 자질들을 문서의 색인어(index)로 사용하여 자질 값과 문서의 관련도를 비교해서 검색의 우선 순위 결정에 사용하였다. As the amount of information has increased, various techniques have been developed for effectively searching and classifying information. Existing studies have calculated the similarity between documents and classification by comparing features selected through POS tagging of documents and used the results in classification. In addition, the search engine uses the selected qualities as indexes of documents and compares the feature values with the relevance of the documents, and uses them to prioritize search.

온톨로지(ontology)는 지식표현의 한 방법으로 클래스(class)와 클래스간 관계(relation)을 통해 다양한 보유 자원을 모델링하는데 사용되고 있다. 일반 도메인과는 달리 민원 문서의 경우 민원인의 유형과 민원 내용, 민원 대상자의 유형, 관련 기관의 종류가 다양하게 나타나서 일반적인 자질을 분류에 사용하는 경우 정확성이 높지 않다. Ontology is a method of knowledge representation and is used to model various holding resources through classes and relations between classes. Unlike general domains, in case of complaint documents, there are various types of complaints, contents of complaints, types of complaints, and related organizations.

본 발명은 상기의 문제점을 해결하기 위하여 안출된 것으로서, 온톨로지에 모델링된 각각의 개체명의 클래스와 클래스의 속성을 분류에 활용하는 온톨로지 기반의 문서 분류 방법을 제공하는 것을 그 목적으로 한다.An object of the present invention is to provide an ontology-based document classification method that utilizes a class and an attribute of a class of each individual modeled in an ontology for classification.

본 발명의 다른 목적은 상기의 문서 분류 방법에 적합한 장치를 제공하는 것에 있다.
Another object of the present invention is to provide an apparatus suitable for the above document classification method.

상기의 목적을 달성하기 위한 본 발명에 따른 온톨로지 기반의 문서 분류 방법은 Ontology-based document classification method according to the present invention for achieving the above object

각 범주에 속하는 문서들의 특성을 온톨로지로 표현하기 위해서 동일 범주에 속하는 문서들에서 공통으로 발생하는 개체들의 종류와 속성(property), 속성 관계(property relation) 을 추출하고 이를 추상화(abstraction)하여 온톨로지를 모델링하는 온톨로지 모델링 과정;In order to express the characteristics of the documents belonging to each category with ontology, the ontology is extracted by abstracting and abstracting the kinds, properties and property relations of objects that are common in the documents belonging to the same category. Ontology modeling process for modeling;

분류하고자 하는 문서로부터 개체명을 인식하는 개체명 인식 과정; An entity name recognition process of recognizing an entity name from documents to be classified;

전체 범주를 대상으로 문장 분석을 통해서 개체명 사이의 관계 정보를 추출하는 개체 관계 추출 과정; An entity relationship extraction process of extracting relationship information between entity names through sentence analysis for all categories;

문서에서 추출된 개체명과 관계 속성을 각 범주별 온톨로지와 비교해서 개체-클래스, 개체-속성-개체로 매핑(mapping)하는 인스턴스 필링 과정; 및An instance peeling process of mapping object names and relationship attributes extracted from the document to object-class and object-attribute-object by comparing the ontology of each category; And

필링된 클래스, 인스턴스간 관계, 클래스 속성의 가중치를 고려해서 상기 분류하고자 하는 문서에 가장 적합한 온톨로지를 선택하고, 해당 온톨로지의 분야를 분류하고자 하는 문서의 분야로 결정하는 문서 분야 결정 과정을 포함하는 것을 특징으로 한다.A document field determination process of selecting an ontology most suitable for the document to be classified and considering the field of the ontology as the field of the document to be classified in consideration of the weight of the filled class, the relationship between the instances, and the class attribute. It features.

여기서, 본 발명에 따른 문서 분류 방법은 각 분야로 결정된 문서에서 온톨로지를 구성하는 개체 및 개체간 관계가 가장 많은 문서를 대표 문서로 결정하는 대표 문서 결정 과정을 더 구비하는 것이 바람직하다.Here, the document classification method according to the present invention preferably further includes a representative document determination process of determining the documents that have the largest number of entities and relationships among entities as representative documents in documents determined in each field.

본 발명에 따른 문서 분류 방법에 있어서, 상기 개체명 인식 과정 이전에 키워드에 의해 문서를 필터링하는 과정을 더 구비하는 것이 바람직하다.In the document classification method according to the present invention, it is preferable to further include a step of filtering the document by a keyword before the entity name recognition process.

여기서, 상기 개체 관계 추출 과정은 문서에 표현된 요구 사항을 중심으로 개체와 개체간의 관계를 추출하는 것이 바람직하다.In the object relationship extraction process, it is preferable to extract the relationship between objects based on the requirements expressed in the document.

여기서, 상기 개체 관계 추출 과정은 요구사항을 표현하는 문장을 추출하기 위한 범용적 문장 패턴을 참조하는 것이 바람직하다.In the entity relationship extraction process, it is preferable to refer to a general sentence pattern for extracting a sentence representing a requirement.

상기 문서 분류 과정은 The document classification process

필링된 클래스의 종류 및 개수, 각 클래스에 필링된 주요 속성들에 대한 가중치, 인스턴스 간의 관계에 대한 가중치를 기반으로 각 온톨로지에 대한 유사도를 산출하는 과정; 및Calculating a similarity degree for each ontology based on the type and number of filled classes, weights of key attributes filled in each class, and weights of relationships among instances; And

산출된 온톨로지별 유사도를 비교하여 가장 높은 유사도를 가지는 온톨로지를 선택하는 과정을 포함하는 것이 바람직하다.It is preferable to include the process of selecting the ontology having the highest similarity by comparing the similarity for each ontology calculated.

상기 모델링된 온톨로지는 The modeled ontology is

- 분야/주제/온톨로지별 차별적 요소가 도출될 수 있어야 하고-Discriminatory factors by sector, topic, and ontology can be derived.

- 클래스/속성이 공통되는 온톨로지 모델링은 될수록 회피하고-Avoid ontology modeling that has a common class / property

- 최상위 온톨로지로 전체 온톨로지를 연계할 수 있도록 하는 조건하에서 설계되는 것이 바람직하다. -It is desirable to be designed under the condition that the whole ontology can be linked to the top level ontology.

상기의 다른 목적을 달성하는 본 발명에 따른 온톨로지 기반의 문서 분류 장치는Ontology-based document classification apparatus according to the present invention to achieve the above another object

분류하고자 하는 문서를 읽어들여 문서에 포함된 개체(named entity)를 추출하는 개체명 추출 모듈;An entity name extraction module for reading a document to be classified and extracting an entity included in the document;

상기 개체명 추출 모듈에 의해 추출된 개체들 사이의 관계를 추출하는 개체 관계 추출 모듈; An entity relationship extraction module for extracting a relationship between entities extracted by the entity name extraction module;

상기 개체명 추출 모듈 및 상기 개체 관계 추출 모듈에 의해 추출된 개체명, 개체간의 관계 속성을 각 분야별로 모델링된 온톨로를지들과 비교해서 개체-클래스, 개체-속성 개체로 매핑하는 인스턴스 필링 모듈;An instance peeling module for mapping the entity name extracted by the entity name extraction module and the entity relationship extraction module and relationship attributes between entities to entity-class and entity-property entities by comparing the ontology modeled for each field;

상기 인스턴스 필링 모듈에 의해 참조되는 각 분야별 온톨로지들을 저장하는 데이터베이스; 및A database storing ontology for each field referred to by the instance filling module; And

상기 인스턴스 필링 모듈에 의해 각 분야별 온톨로지들에 대해 필링된 결과를 참조하여 문서와 각 분야별 온톨로지들의 유사도를 산출하고 이들 중에서 가장 큰 유사도를 가지는 온톨로지의 분야를 문서의 분야로서 출력하는 문서 분야 판단 모듈을 포함하는 것을 특징으로 한다.The document field determination module calculates the similarity between the document and the ontology for each field by referring to the results filled in by the instance filling module for the ontology of each field, and outputs the field of the ontology having the largest similarity as the field of the document. It is characterized by including.

여기서, 본 발명에 따른 온톨로지 기반의 문서 분류 장치는 키워드, 품사 태깅을 통해 선정된 자질을 가지는 자질, 시소러스, 샘플링된 문서로부터 추출된 개체명 및 관계 정보에 기반하여 문서 분야별 온톨로지들을 모델링하는 온톨로지 모델링 모듈을 더 구비하는 것이 바람직하다.
Here, the ontology-based document classification apparatus according to the present invention is an ontology modeling modeling the ontology for each document field based on the keyword, the feature having the selected feature through part-of-speech tagging, thesaurus, the entity name and relationship information extracted from the sampled document. It is preferable to further provide a module.

문서 급증에 따른 관리의 효율성 확보를 위한 방법으로 문서를 분류하는 다양한 연구들이 수행되어 왔으며, 분류를 위해서 문서를 대표하는 다양한 자질들이 분류에 활용되었다. Various researches have been conducted to classify documents as a way to secure management efficiency due to the proliferation of documents, and various features representing documents have been used for classification.

본 발명에서는 각 분류별로 분류에 속한 문서의 구조를 온톨로지로 표현하고 문서에서 개체와 개체간 관계 속성을 추출해서 온톨로지 구조와의 비교를 통해 분류를 결정하는 방법을 제안한다. The present invention proposes a method of determining a classification by comparing the ontology structure by expressing the structure of a document belonging to a classification for each classification by an ontology, extracting an object-to-object relationship attribute from a document.

문서의 구조가 온톨로지로 표현되면, 분류에 활용하는 것은 물론 개체간의 관계 속성과 상하위 관계 속성을 이용한 검색 방법의 개선, 분류 별 대표 문서의 선정에도 활용될 수 있다. If the structure of the document is expressed as an ontology, it can be used not only for classification, but also for improving the retrieval method using the attributes of the relationship between the entities and the parent-child relationship, and selecting the representative documents for each classification.

상술한 바와 같이, 본 발명에 따른 온톨로지 기반 문서 분류 장치 및 방법은, 온톨로지 기반으로 문서의 구조정보를 문서 분류에 이용함으로써, 문서 분류의 정확도를 10~15% 개선하는 효과를 얻을 수 있었다.As described above, the ontology-based document classification apparatus and method according to the present invention can obtain an effect of improving the accuracy of document classification by 10-15% by using the structure information of the document on the basis of the ontology.

도 1은 본 발명에 따른 온톨로지 기반의 문서 분류 방법을 보이는 흐름도이다.
도 2는 모델링된 온톨로지의 예로서 국민연금 분야에 적용된 것을 도시한다.
도 3은 본 발명에 따른 온톨로지 기반의 문서 분류 장치의 활용예를 도시한다.
도 4는 도 3에 도시된 본 발명에 따른 문서 분류 장치의 상세한 구성을 도시한다.
도 5는 인스턴스 필링 모듈에 의해 필링된 결과를 예시하는 것이다.
도 6은 국민연금 분야에 대한 민원 서류의 예를 도시한다.
도 7은 도 6에 도시된 민원 서류를 요구 사항을 중심으로 주요 개념/관계를 도출한 것을 보인다.
도 8은 국민연금 분야에 대해 구축된 온톨로지의 예를 도시한다.
도 9는 공정거래 분야에 대해 구축된 온톨로지의 예를 도시한다.1 is a flowchart illustrating an ontology-based document classification method according to the present invention.
2 shows an example of modeled ontology applied to the national pension field.
3 shows an example of utilizing an ontology-based document classification apparatus according to the present invention.
4 shows a detailed configuration of a document classification apparatus according to the present invention shown in FIG.
5 illustrates the results filled by the instance filling module.
6 shows an example of a civil petition in the national pension field.
FIG. 7 shows that the main concept / relationship is derived based on the requirements of the complaint document shown in FIG. 6.
8 shows an example of ontology built for the national pension sector.
9 shows an example of ontology built for the field of fair trade.

이하 첨부된 도면을 참조하여 본 발명의 구성 및 동작에 대하여 상세히 설명하기로 한다.Hereinafter, the configuration and operation of the present invention will be described in detail with reference to the accompanying drawings.

본 발명은 문서에 나타난 용어들을 자질로 사용해서 문서와 분류의 유사도를 계산하는 기존의 분류를 보완하기 위한 방법으로서, 온톨로지에 모델링 된 각각의 개체명의 클래스와 클래스의 속성을 분류에 활용한다. The present invention is a method for supplementing the existing classification that calculates the similarity between the document and the classification by using the terms appearing in the document as a feature, and utilizes the class and the property of each entity name modeled in the ontology.

도 1은 본 발명에 따른 온톨로지 기반의 문서 분류 방법을 보이는 흐름도이다. 도 1을 참조하면, 본 발명에 따른 문서 분류 방법은 분야별 온톨로지 모델링 과정(s102), 개체명 인식 과정(named entity recognition, s104), 개체 관계 추출 과정(entity relation extraction, s106), 인스턴스 필링 과정(instance filling, s108), 분야 결정 과정(s110) 그리고 대표 문서 선정 과정(s112)를 포함한다.1 is a flowchart illustrating an ontology-based document classification method according to the present invention. Referring to FIG. 1, the document classification method according to the present invention includes an ontology modeling process for each field (s102), a named entity recognition process (s104), an entity relation extraction process (s106), an instance filling process ( instance filling, s108), a field determination process s110, and a representative document selection process s112.

온톨로지 모델링 과정(s102)은 각 분야에 속하는 문서들의 특성을 온톨로지로 표현하기 위해서 동일 분야에 속하는 문서들에서 공통으로 발생하는 개체들의 종류와 속성(property), 속성 관계(property relation) 을 추출하고 이를 추상화(abstraction)한다.The ontology modeling process (s102) extracts the types, properties, and property relations of objects that commonly occur in documents belonging to the same field in order to express the characteristics of documents belonging to each field as ontology. Abstraction

온톨로지는 인간의 지식을 다루는 분야 및 기법을 가리키며, 특히 컴퓨터를 이용한 지식 표현으로서의 온토로지는 「개념화의 명시적인 규약」 즉, 「어떤 분야의 지식을 계산기로 처리할 수 있도록 명시적 및 논리적으로 기술하고, 그 지식의 공유와 재이용을 가능하게 하는 것」이다. 온톨로지에 있어서, 클래스는 같은 성질을 갖는 리소스들을 그룹화하고 공통 성질을 논리적으로 표현하기 위한 기능을 제공한다. 클래스의 성질은 해당 클래스가 갖는 속성의 조건을 규정함으로써 표현할 수 있다. 한편, 인스턴스(instance)란 개념에 속하는 개체를 말한다.Ontology refers to the fields and techniques that deal with human knowledge, and ontology, in particular, as a computer-based expression of knowledge, expresses both explicit and logical terms to express the knowledge of a certain field with a calculator. To enable the sharing and reuse of that knowledge. In ontology, classes provide the ability to group resources with the same properties and to logically express common properties. The nature of a class can be expressed by specifying the condition of the property of the class. On the other hand, an instance refers to an entity belonging to the concept.

온톨로지 구축은 각 분야의 샘플링된 서류들로부터 패턴 및 구성 요소를 분석해내고, 분석한 결과들을 토대로 온톨로지를 모델링함에 의해 달성된다.Ontology construction is accomplished by analyzing patterns and components from sampled documents in each field and modeling the ontology based on the analysis results.

도 2는 모델링된 온톨로지의 예로서 국민연금 분야에 적용된 것을 도시한다.2 shows an example of modeled ontology applied to the national pension field.

각 분야별로 모델링된 온톨로지는 타 분야의 온톨로지와는 클래스의 종류와 수, 속성 관계의 종류와 수에 차이를 보이도록 설계하는 것이 중요하다. 예를 들면, 국민연금 분야에서는 연금과 관련된 클래스 및 속성, 기업간 공정거래 분야에서는 기업간의 거래에 관련된 클래스 및 속성이 온톨로지에 반영되어야 한다.It is important to design the ontology modeled by each field to show the difference in the type and number of classes and the type and number of attribute relationships with the ontology of other fields. For example, in the field of national pensions, classes and attributes related to pensions and classes and attributes related to transactions between companies in the field of fair trade between companies should be reflected in the ontology.

온톨로지가 구축되면, 이를 이용하여 분류하고자 하는 문서에 적용하여 문서의 분야를 분석한다.Once the ontology is built, it is applied to the documents to be classified and analyzed the fields of the documents.

개체명 인식 과정(s104)은 분류하고자 하는 문서로부터 개체명을 인식한다. The entity name recognition process s104 recognizes the entity name from the documents to be classified.

온톨로지를 구성하는 클래스와 클래스의 인스턴스를 식별하는 과정은 매우 중요하다. 인명, 기관명, 보험명, 법률명 등의 다양한 개체들은 개체명 인식 과정에서 개체들의 유형(type)이 결정되고, 개체명 인식의 정확성 향상을 위해서 개체명에 대한 전거 데이터(authority data)와 시소러스(thesaurus)정보가 함께 사용된다. The process of identifying the classes and instances of classes that constitute the ontology is very important. Various entities such as human names, institution names, insurance names, legal names, etc. are determined during the entity name recognition process, and authority data and thesaurus (authority data) for the entity name are improved to improve the accuracy of entity name recognition. thesaurus) is used together.

전거 데이터는 동일한 개체를 동일한 개체를 지칭하는 이표기들에 대한 데이터를 말한다. 전거 데이터는 기본적으로 서지데이터와 동일한 레코드 구조를 가지고 있다. 데이터 필드에서는 동일한 인명이나 총서명, 주제명의 다양한 표기를 하나의 그룹으로 형성한 후 목록 규칙에 의거하여 그 중에 대표되는 표목을 선정하게 된다.Authority data refers to data for two notations referring to the same entity to the same entity. Authority data basically has the same record structure as bibliographic data. In the data field, various notations of the same person, general signature, and subject name are formed into one group, and the headings among them are selected based on the list rule.

시소러스란 용어의 사용법과 용어들 사이의 관계에 대한 정보를 제공하는 어휘도구를 말한다. 용어의 관계성은 일반적으로 상위 개념(broader term), 하위 개념(narrower term), 용례 혹은 동의어(Use for or Synonymous), 관계어(related term), 대체어(use) 등으로 분류되는 데, 시소러스는 이러한 관계성을 이용하여 탐색시 질의에 포함된 용어의 의미를 확대하기 위해 주로 사용된다.A thesaurus is a lexical tool that provides information about the usage of terms and the relationships between them. Terms are generally classified into broader terms, narrower terms, use for or synonymous, related terms, and alternatives. It is mainly used to expand the meaning of the terms included in the query when searching using these relationships.

개체명 인식 과정(s104)의 전단계로서 문서에 대해 키워드 기반의 1차적 필터링이 행해질 수 있다. 이를 위하여, positive/negative 키워드가 사용될 수 있다. 여기서, positive 키워드란 해당 키워드가 포함된 문서를 검색하기 위한 것을 말하며, negative 키워드란 해당 키워드가 포함되지 않은 문서를 검색하기 위한 것이다.As a preliminary step of the entity name recognition process s104, keyword-based primary filtering may be performed on the document. For this purpose, a positive / negative keyword may be used. Here, the positive keyword refers to searching for a document including the corresponding keyword, and the negative keyword refers to searching for a document not including the corresponding keyword.

개체관계 추출 과정(s106)은 전체 분야를 대상으로 문장 분석을 통해서 개체명 사이의 관계 정보를 추출한다. 개체 관계를 추출하기 위해서 이전 단계에서 인식된 개체명을 중심으로 문장 분석을 통해서 관계 정보를 추출한다. 이 과정에서는 클래스 간 속성 관계를 중심으로 도메인(domain)과 레인지(range)를 고려해서 다양한 개체 간 관계(인명-관계명-인명, 인명-관계명-기관명) 정보를 생성한다. 이 과정에서는 아직 문서의 분야가 정해지지 않았기 때문에 전체 분야의 모든 속성명을 대상으로 추출할 수 있는 모든 관계를 문서에서 추출해야 한다.The entity relationship extraction process (s106) extracts relationship information between entity names through sentence analysis for all fields. In order to extract the entity relationship, the relationship information is extracted through sentence analysis based on the entity name recognized in the previous step. In this process, the relationship between various entities (name-relationship-person name, name-relationship name-organization name) is generated by considering domain and range. Since the field of the document has not yet been determined in this process, all relationships that can be extracted from all attribute names of the entire field should be extracted from the document.

인스턴스 필링 과정(s108)은 문서에서 추출된 개체명과 관계 속성을 각 분야별 온톨로지와 비교해서 개체-클래스, 개체-속성-개체로 매핑(mapping)한다. The instance filling process s108 maps the object name and the relationship attribute extracted from the document into object-class and object-property-object by comparing the ontology of each field.

민원의 분류를 위해서는 일반적인 개체명 인식을 통한 PLO(person, Location, Organization) 태깅(tagging) 이외에도, 보험, 연금, 거래대상 등에 대한 태깅 및 각 기관의 유형이나 추가 속성을 도출하는 것이 필요하다. 예를 들면, 의료 재해나 건강보험 관련 민원의 경우 민원 문서에 개체명으로 의료 시설명, 치료, 병명 등이 나타나므로 기관에서도 의료기관 클래스로 개체를 필링하는 것이 필요하다. 또한, 기업 간 공정거래 분야는 기업과 기업 간의 민원 분쟁이 발생하므로 기업과 기업 간 개체 관계가 해당 클래스의 인스턴스로 필링되어야 한다.
The classification of complaints requires tagging for insurance, pensions, transactions, etc., as well as deriving the type or additional attributes of each institution, in addition to tagging of PLO (person, location, organization) through general entity name recognition. For example, in case of a medical disaster or a health insurance related complaint, the name of the medical facility, treatment, illness, etc. are shown as the name of the individual in the complaint document. In addition, in the field of fair trade between companies, there is a civil dispute between the company and the entity, and the relationship between the entity and the company must be filled with an instance of the class.

문서에서 추출된 개체명과 개체 관계를 가지고 인스턴스 필링 과정(s108)을 수행한 후, 문서 분야 결정 과정(s110)은 필링된 클래스, 인스턴스간 관계, 클래스 속성의 가중치를 고려해서 가장 적합한 온톨로지를 선택하게 되고, 해당 온톨로지의 분야를 분류하고자 하는 문서의 분야로 결정하게 된다. After performing the instance filling process (s108) with the object name and the object relationship extracted from the document, the document field determination process (s110) selects the most suitable ontology in consideration of the weighted class, the relationship between the instances, and the class attributes. Then, the field of the ontology is determined as the field of the document to be classified.

구체적으로 인스턴스가 필링된 클래스의 종류, 각 클래스의 주요 속성에 대한 가중치, 인스턴스간 관계, 주요 관계에 대한 가중치 등을 고려해서 문서와 온톨로지간의 최종 유사도가 결정된다. 본 발명의 실시예에서는 온톨로지 속성에 대해서 모든 가중치를 동일하게 설정하였으나 온톨로지 기반 분류 결과 분석을 통해 속성별 가중치를 조정할 수도 있다. In detail, the final similarity between the document and the ontology is determined in consideration of the type of the class in which the instance is filled, the weight of the main attribute of each class, the relationship between the instances, and the weight of the main relationship. In the embodiment of the present invention, all weights are set to the same for the ontology attribute, but the weight for each attribute may be adjusted by analyzing the ontology-based classification results.

분야별 대표 문서 결정 과정(s112)은 각 분야로 결정된 문서에서 온톨로지를 구성하는 개체 및 개체간 관계가 가장 많은 문서를 대표 문서로 선정한다. 본 발명의 실시에에 있어서는 속성명 가중치를 동일하게 적용하였지만 분야별 대표 문서를 결정하는데 중요한 속성별 가중치를 다르게 부여하는 것도 가능하다.
The representative document determination process for each field (s112) selects the documents that have the largest number of entities and the relationships among them in the documents determined for each field as representative documents. In the practice of the present invention, the attribute name weights are applied in the same way, but it is also possible to give different attribute weights important for determining the representative document for each field.

도 3은 본 발명에 따른 온톨로지 기반의 문서 분류 장치의 활용예를 도시한다. 도 3을 참조하면, 키워드 및 품사 태깅을 통해 선정된 자질을 가지는 키워드 및 자질 사전(302), 시소러스 사전(304), 샘플링된 문서로부터 추출된 개체명 및 관계 정보(306)에 기반하여 문서 분야별 온톨로지(308)들이 모델링된다. 모델링된 온톨로지(308)는 데이터 베이스에 저장되어 문서 분류 장치(310)에 제공된다. 3 shows an example of utilizing an ontology-based document classification apparatus according to the present invention. Referring to FIG. 3, based on a keyword and a feature dictionary 302, a thesaurus dictionary 304, and an entity name and relationship information 306 extracted from a sampled document having a selected feature through keyword and part-of-speech tagging, each document field may be selected. Ontologies 308 are modeled. The modeled ontology 308 is stored in a database and provided to the document classification device 310.

문서 분류 장치(310)는 문서에 대하여 개체명 인식을 수행하여 개체명 및 개체간의 관계들을 추출하고, 이를 바탕으로 인스턴스 필링 및 유사도 비교에 의해 문서 분야를 결정한다.The document classification apparatus 310 extracts the entity name and the relationships between the entities by performing entity name recognition on the document, and determines the document field by instance filling and similarity comparison.

도 4는 도 3에 도시된 본 발명에 따른 문서 분류 장치(310)의 상세한 구성을 도시한다. 도 4를 참조하면, 본 발명에 따른 문서 분류 장치(310)는 개체명 추출 모듈(402), 개체관계 추출 모듈(404), 인스턴스 필링 모듈(406), 문서 분야 판단 모듈(408)을 포함한다.4 shows a detailed configuration of the document classification device 310 according to the present invention shown in FIG. Referring to FIG. 4, the document classification apparatus 310 according to the present invention includes an entity name extraction module 402, an entity relationship extraction module 404, an instance filling module 406, and a document field determination module 408. .

개체명 추출 모듈(402)은 분류하고자 하는 문서를 읽어들이고 문장으로부터 개체 즉, 명사, 동사, 조사, 형용사 등을 추출한다.The entity name extraction module 402 reads a document to be classified and extracts an entity, that is, a noun, a verb, a search, an adjective, and the like from a sentence.

개체 관계 추출 모듈(404)은 문장 내에서의 개체와 개체 사이의 관계를 추출한다. The entity relationship extraction module 404 extracts a relationship between entities in a sentence.

인스턴스 필링 모듈(406)은 개체명, 개체간의 관계 속성을 각 분야별 온톨로지와 비교해서 개체-클래스, 개체-속성 개체로 매핑한다. The instance filling module 406 maps the object name and the relationship attribute between the objects to object-class and object-property objects by comparing the ontology of each field.

인스턴스 필링모듈(406)은 개체명, 개체간의 관계 속성을 이용하여 모델링된 복수 개의 온톨로지(예) 국민연금, 기업규제, 공정거래 등) 내의 각 클래스에 상기 추출된 개체를 필링(filling)한다. 예로, '문서번호 1', '문서번호 2' 그리고 '문서번호 3'으로부터 추출된 개체를 국민연금 온톨로지 내에 7개의 클래스(민원인, 연금정보, 담당자, 장애정보, 퇴직금정보, 임금체불 그리고 회사)에 필링한 결과는 도 5와 같이 예시될 수 있다.The instance filling module 406 fills the extracted objects to each class in a plurality of ontology models (e.g., national pension, corporate regulation, fair trade, etc.) modeled using the entity name and the relationship attribute between the entities. For example, individuals extracted from 'Document No. 1', 'Document No. 2' and 'Document No. 3' are classified into seven classes (National Complaints, Pension Information, Contact Person, Disability Information, Retirement Pay Information, Wage Pay and Company) within the National Pension Ontology. The peeling result may be illustrated as shown in FIG. 5.

인스턴스 필링 모듈(406)에 의해 참조되는 각 분야별 온톨로지는 데이터베이스(410)에 저장되어 있다. 데이터베이스(410)는 온톨로지 모델링 모듈(412)에 의해 모델링된 각 분야별 온톨로지를 저장한다. Each sector ontology referenced by the instance filling module 406 is stored in the database 410. The database 410 stores ontology for each field modeled by the ontology modeling module 412.

온톨로지 모델링 모듈(412)은 도 2에 도시된 바와 같은 키워드 및 품사 태깅을 통해 선정된 자질을 가지는 키워드 및 자질 사전(202), 시소러스 사전(204), 샘플링된 문서로부터 추출된 개체명 및 관계 정보(206)에 기반하여 문서 분야별 온톨로지(208)들을 모델링한다. The ontology modeling module 412 includes a keyword and feature dictionary 202, a thesaurus dictionary 204, and entity names and relationship information extracted from the sampled document having the selected feature through keyword and part-of-speech tagging as shown in FIG. Model ontology-specific ontology 208 based on 206.

문서 분야 판단 모듈(408)은 인스턴스 필링 모듈(406)에 의해 각 분야별 온톨로지들에 대해 필링된 결과를 참조하여 문서와 각 분야별 온톨로지들의 유사도를 산출하고 이들 중에서 가장 큰 유사도를 가지는 온톨로지의 분야를 문서의 분야로서 출력한다.The document field determination module 408 calculates the similarity between the document and the ontology of each field by referring to the results filled by the field filling module 406 for the ontology of each field, and documents the field of the ontology having the largest similarity among them. Output as the field of.

구체적으로, 민원문서 분야 판단모듈(408)은 인스턴스 필링 모듈(406)에 의해 필링된 결과를 참조하여 입력된 문서의 분야를 판단한다. In detail, the civil document field determination module 408 determines the field of the input document with reference to the result filled by the instance filling module 406.

도 5는 인스턴스 필링 모듈(406)에 의해 필링된 결과를 예시하는 것이다.5 illustrates the results filled by the instance filling module 406.

도 5의 '문서번호 1', '문서번호 2' 그리고 '문서번호 3'은 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 개수가 각각 '4', '2' 그리고 '6'이다. 여기서, '민원인' 클래스는 속성이 사람이므로 필링된 클래스의 속성이 '(주) 국제상사'와 같이 회사명인 경우 필링된 클래스의 개수로 인정하지 않는다. Document number 1, document number 2 and document number 3 of Figure 5 is the number of classes filled in each class in the national pension ontology '4', '2' and '6', respectively. In this case, since the class of the complaint is a person, if the property of the filled class is a company name such as 'International Co., Ltd.', the number of filled classes is not recognized.

분서 분류를 위한 클래스의 기준 개수가 '3'이상으로 기 설정되었다고 가정하면, '문서번호 1'과 '문서번호 3'은 민원문서의 분야가 '국민연금'으로 판단된다.Assuming that the standard number of classes for classifying a document is set to '3' or more, 'document number 1' and 'document number 3' are regarded as 'national pension'.

한편, 전술한 바와 같이 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 가중치가 동일하지 않고 '민원인' 클래스에 필링된 클래스의 가중치가 10% 높다면(클래스의 중요도에 따라 가중치를 높일 수 있다.), 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 개수는 각각 '4.4', '2' 그리고 '6.6'이 된다. 따라서, 가중치가 동일한 전술한 경우와 비교할 때 민원문서 분야 판단결과가 같으나, 가중치로 인해 문서 분야 판단결과가 달라질 수 있음은 물론이다.On the other hand, as described above, if the weights of the classes filled in each class in the National Pension Ontology are not the same, and the weights of the classes filled in the 'Citizen' class are 10% higher (weight can be increased according to the importance of the class). The number of classes filled in each class in the National Pension Ontology is '4.4', '2' and '6.6', respectively. Therefore, although the result of the civil document field determination is the same as compared with the above-described case where the weight is the same, it is a matter of course that the document field determination result may be different due to the weight.

아울러, 대표 민원문서 결정모듈(412)도 인스턴스 필링의 결과를 참조하여, 필링된 클래스의 개수가 가장 큰 문서인 '문서번호 3'을 국민연금 민원문서 분야의 대표 민원문서로 결정하여 데이터베이스(410)에 저장한다. 이로써, '국민연금' 관련 문서가 검색된 경우 대표 민원문서인 '문서번호 3'이 가장 상단에 표시되도록 할 수 있다. 또는 대표 문서를 이용하여 온톨로지를 재구성하도록 할 수도 있다.
In addition, the representative complaint document determination module 412 also refers to the result of the instance peeling, and determines the document number 3, the document having the largest number of filled classes, as a representative complaint document in the field of the national pension complaint document. ). As a result, when a document related to the 'National Pension Service' is searched, the representative complaint document 'Document No. 3' may be displayed at the top. Alternatively, the ontology may be reconstructed using the representative document.

실시예Example

이하에서는 본 발명을 민원 서류 그 중에서도 민원 활동이 활발한 "국민 연금", "기업 규제" 그리고 "공정거래"의 3개 분야의 민원 서류에 적용한 예를 기술한다. 민원(民願)은 국민이 행정기관에 어떠한 것을 신청하는 것이다. 이때의 국민을 민원인이라 하고, 신청하는 내용을 민원사항이라 하며, 행정기관이 이를 처리하기 위해 하는 업무를 민원사무라고 한다. 행정기관이 민원사무를 처리하고 그 결과를 민원인에게 제공하는 것을 민원서비스라 하고, 이러한 전체 과정을 민원행정이라 한다. 즉, 민원행정은 국민이 행정기관에 특정한 행위를 요구하는 것에 행정기관이 대응하는 활동에 관한 행정이다The following describes an example in which the present invention is applied to civil complaint documents in three fields, namely, "national pension", "corporate regulation" and "fair trade", in which civil complaint activity is active. A civil complaint is something a citizen applies for an administrative agency. The citizens at this time are called civil complaints, the contents of the application are called civil complaints, and the administrative agencies are responsible for handling them. The administrative agency handles the civil affairs and provides the result to the civil affairs service. This whole process is called civil affairs administration. In other words, a civil administration is an administration of activities in which an administrative body responds to a citizen's request for a specific action from an administrative body.

민원 서비스를 신속화하기 위해서는 접수된 민원 서류의 분야를 자동적으로 분류하는 것이 필요하다. In order to expedite complaint services, it is necessary to automatically classify the fields of complaint documents received.

먼저, 각 분야별 온톨로지 구축을 위하여, 각각의 분야에 대해 샘플링된 민원 서류의 본문을 수작업으로 분석하였다. First, in order to construct ontologies for each sector, the body of the sampled complaint documents was analyzed manually.

민원 서류들 중에서 국민 연금 분야에 대해서는 "국민연금"이라는 키워드가 발견된 94/123건을 대상으로 분석하고, 기업 규제 분야에 대해서는 "기업 &(and) 규제" 키워드가 발견된 21/23건을 대상으로 분석하고, 공정거래 분야에 대해서는 "기업 &(and) (공정 ｜(or) 횡포)" 키워드가 발견된 76/115건에 대하여 분석하였다. 94/123과 같이 대상 건수에서 차이가 발생하는 것은 키워드로 검색된 것들 중에서 중복, 오분류, 요구사항이 명확하지 않은 것 등을 제외하였기 때문이다.Of the civil affairs documents, we analyzed 94/123 cases where the keyword "national pension" was found in the field of national pension, and 21/23 cases where the keyword "corporate & (and) regulation" was found in the field of corporate regulation. For the fair trade field, 76/115 cases with the keyword "corporate & (and) domineering" were found. The reason for the difference in the number of cases, such as 94/123, is that the duplicated, misclassified, and unclear requirements are excluded from the searched keywords.

도 6은 국민연금 분야에 대한 민원 서류의 예를 도시하고, 도 7은 도 6에 도시된 민원 서류를 요구 사항을 중심으로 주요 개념/관계를 도출한 것을 보인다.FIG. 6 shows an example of a complaint document in the field of the national pension, and FIG. 7 shows a main concept / relationship derived from the requirements of the complaint document shown in FIG. 6.

국민 연금 분야의 민원 서류 분석 결과는 다음과 같다. The analysis results of the civil documents in the field of national pension are as follows.

* 민원인* Complaint

-개인 이름: 홍길동, ...Personal Name: Hong Gil-dong, ...

-친족관계: 아버지, 어머니, ...-Relationships: father, mother, ...

* 담당자/대상자* Contact Person

- 국민연금관리공단 + 성명 + 직위/직급:-National Pension Service + Name + Position / Title:

- 국민연금관리공단 지사 -Branch of National Pension Service

* 회사* company

- 병원, 법원, 국세청Hospitals, courts and the IRS

* 내용* Contents

- 지급, 체불, 납입, 납부, 연체-Payment, late payment, payment, payment, overdue

- 지급액, 납입액-Payments, payments

- 연금 (노령연금, 장애연금, 퇴직연금, 국민연금)-Pensions (age pension, disability pension, retirement pension, national pension)

*시점* Time

- 퇴직, 상실-Retirement, loss

여기서, 각 분야/주제에 대해 구축한 시소러스가 활용될 수 있다. Here, the thesaurus constructed for each field / topic can be utilized.

각 분야별로 분석된 데이터를 바탕으로 각 분야별 온톨로지를 모델링한다.The ontology of each field is modeled based on the data analyzed for each field.

이때, 클래스 측면에서 볼 때, 온톨로지는 In terms of class, the ontology

- 최상위 온톨로지로 전체 온톨로지를 연계할 수 있도록 설계되며,-It is designed to link the whole ontology with the highest ontology.

인스턴스 관점에서 볼 때, 온톨로지는From an instance point of view, the ontology

- 주제와 요구사항은 가급적 동일 문장에서 추출하고,-Subjects and requirements should be extracted from the same sentence as much as possible.

- 주제를 찾을 수 없는 문장은 본문에서 TF 기준으로 추출하고-If the sentence can't be found, extract it from the text based on TF.

- 인스턴스는 개체명 추출 수준에서 처리하고,Instances are processed at the object name extraction level,

- 개체명을 온톨로지 클래스를 통해 추출한다.-Extract object name through ontology class.

여기서, 요구사항을 포함한 문장을 추출하기 위한 범용적 문장 패턴의 구축이 바람직하다. 범용적 문장 패턴의 예는 다음과 같다.Here, it is preferable to construct a general sentence pattern for extracting a sentence including a requirement. An example of a general sentence pattern is as follows.

예1) "...국민연금의 금액의 절반을...답변 부탁드리겠습니다."Example 1) "Please answer ... half of the amount of the national pension."

예2) "검토하시고 담당자를 처벌하여 주십시오."Example 2) "Please review and punish the person in charge."

도 8은 "국민연금"분야에 대해 구축된 온톨로지의 예를 도시한다. 8 shows an example of ontology built for the field of "national pension".

도 9는 공정거래 분야에 대해 구축된 온톨로지의 예를 도시한다.9 shows an example of ontology built for the field of fair trade.

공정거래 분야의 민원 서류 분석 결과는 다음과 같다. The results of analysis of complaint documents in the field of fair trade are as follows.

* 민원인* Complaint

- 개인, 모임, 단체-Individuals, meetings, groups

- 중소기업-SME

* 담당자/대상자* Contact Person

- 대기업, 중소기업-Large and small companies

- 위원회, 사업부 Committees, Divisions

- 공공기관 - Public institutions

- 공공기관 + 성명 + 담당자-Public Institution + Name + Person in Charge

* 대상* Target

- 계약, 토지, 아파트, 건물, 서비스-Contracts, land, apartments, buildings, services

* 회사* company

- 병원, 법원, 국세청Hospitals, courts and the IRS

* 내용* Contents

- 이익, 착취, 부당, 피해, 횡포, 비용, 대금-Profits, exploitation, unfairness, damage, tyranny, costs and payment

- 남용, 변제, 보상, 재해, 보증, 채권-Abuse, reimbursement, compensation, disaster, guarantee, bond

- 해약, 부도, 해고, 선정, 산재-Termination, default, dismissal, selection, industrial accident

- 평가, 납품-Evaluation, delivery

*지역*area

- 지명-Place name

이와 같이 한 후, 문서 분야 판단모듈(408)은 민원문서의 구조정보 즉, 상기 개체 필링 결과인 필링된 클래스의 개수에 근거하여 상기 입력된 민원문서의 분야를 판단할 수 있다. After doing so, the document field determination module 408 may determine the field of the inputted civil document based on the structural information of the civil document, that is, the number of peeled classes that are the object peeling results.

즉, 도 5의 '문서번호 1', '문서번호 2' 그리고 '문서번호 3'은 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 개수가 각각 '4', '2' 그리고 '6'이다('민원인' 클래스는 속성이 사람이므로 필링된 클래스의 속성이 '(주) 국제상사'와 같이 회사명인 경우 필링된 클래스의 개수로 인정하지 않는다.) That is, 'document number 1', 'document number 2' and 'document number 3' of FIG. 5 have the number of classes filled in each class in the National Pension Ontology '4', '2' and '6', respectively ( Since the 'Complaint' class is a person, if the property of the filled class is a company name such as 'International Co., Ltd.', the number of filled classes is not recognized.)

그러므로 기준개수가 '3'이상으로 기 설정되었다고 가정하면, '문서번호 1'과 '문서번호 3'은 민원문서의 분야가 '국민연금'으로 판단된다(도 5 참조).Therefore, assuming that the standard number is set to '3' or more, 'Document No. 1' and 'Document No. 3' are regarded as the 'National Pension' in the field of complaint documents (see Fig. 5).

한편, 전술한 바와 같이 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 가중치가 동일하지 않고 '민원인' 클래스에 필링된 클래스의 가중치가 10% 높다면(클래스의 중요도에 따라 가중치를 높일 수 있다.), 국민연금 온톨로지 내의 각 클래스에 필링된 클래스의 개수는 각각 '4.4', '2' 그리고 '6.6'이 된다. 따라서, 가중치가 동일한 전술한 경우와 비교할 때 민원문서 분야 판단결과가 같으나, 가중치로 인해 민원문서 분야 판단결과가 달라질 수 있음은 물론이다.On the other hand, as described above, if the weights of the classes filled in each class in the National Pension Ontology are not the same, and the weights of the classes filled in the 'Citizen' class are 10% higher (weight can be increased according to the importance of the class). The number of classes filled in each class in the National Pension Ontology is '4.4', '2' and '6.6', respectively. Therefore, although the result of the civil document field determination is the same as compared with the above-described case where the weight is the same, it is a matter of course that the result of the civil document field determination may be different due to the weight.

아울러, 대표 민원문서 결정모듈(130)도 민원문서의 구조정보 즉, 상기 개체 필링 결과에 해당하는 필링된 클래스의 개수가 가장 큰 민원문서인 '문서번호 3'을 국민연금 민원문서 분야의 대표 민원문서로 결정하여 데이터베이스(112)에 저장한다. 이로써, 도 5의 검색창에 '국민연금' 관련 민원문서가 검색된 경우 대표 민원문서인 '문서번호 3'이 가장 상단에 표시되도록 할 수 있다.
In addition, the representative complaint document determination module 130 also uses the structure information of the complaint document, that is, the document number 3, which is the largest complaint document corresponding to the object peeling result, the representative complaint in the field of the National Pension Complaint Document. The document is determined and stored in the database 112. As a result, when a 'national pension' related complaint document is searched for in the search box of FIG. 5, the representative complaint document 'document number 3' may be displayed at the top.

310...문서 분류 장치
402...개체명 추출 모듈 404...개체관계 추출 모듈
406...인스턴스 필링 모듈 408...문서 분야 판단 모듈310 ... Document sorting device
404 ... Object Name Extraction Module 404 ... Object Relationship Extraction Module
406 ... instance filling module ...

Claims

In order to express the characteristics of the documents belonging to each field with ontology, the ontology is extracted by abstracting and abstracting the kinds, properties, and property relations of objects common to the documents belonging to the same field. Ontology modeling process for modeling;
An entity name recognition process of recognizing an entity name from documents to be classified;
An entity relationship extraction process of extracting relationship information between entity names through sentence analysis for all fields;
An instance filling process of mapping object names and relationship attributes extracted from a document to object-classes, object-attributes-objects by comparing the ontology for each field;
The document field determination process of referring to the instance filling process and determining the field of the ontology selected as the field of the document according to the similarity between the document to be classified and the ontology for each field in consideration of the weight of the class, the relationship between the instances, and the class attribute. ; And
Ontology-based document classification method comprising; a representative document determination process of determining the documents that have the largest number of relationships between entities and the objects constituting the ontology in the documents determined in each field.

delete

The method of claim 1,
Ontology-based document classification method further comprising the step of filtering the document by the keyword before the entity name recognition process.

The method of claim 1, wherein the entity relationship extraction process
Ontology-based document classification method characterized by extracting the relationship between objects based on the requirements expressed in the document.

The ontology-based document classification method of claim 4, wherein the entity relationship extraction process refers to a general sentence pattern for extracting a sentence representing a requirement.

The method of claim 1, wherein the document field determination process is performed.
Calculating a similarity degree for each ontology based on the type and number of filled classes, weights of key attributes filled in each class, and weights of relationships among instances; And
Ontology-based document classification method comprising the step of selecting the ontology having the highest similarity by comparing the similarity for each ontology.

delete

An entity name extraction module for reading a document to be classified and extracting an entity included in the document;
An entity relationship extraction module for extracting a relationship between entities extracted by the entity name extraction module;
Instance filling that maps the entity name extracted by the entity name extraction module and the entity relationship extraction module and relationship properties between entities to object-class and object-property entities by comparing with ontology modeled for each field. module;
A database storing ontology for each field referred to by the instance filling module;
A document field determination module configured to calculate a similarity between the document and the ontology of each field by referring to the results of filling the ontology of each field by the instance filling module, and to output the field of the ontology having the largest similarity among them as the field of the document; And
And a representative document determination module for determining, by a document field determination module, a document that has the largest number of entities and relationships among entities in a document determined to each field as a representative document.

9. The method of claim 8,
Ontology-based document further comprises an ontology modeling module for modeling the ontology for each document field based on the feature, thesaurus, and object information extracted from the sampled document, the thesaurus having a selected feature through keyword and part-of-speech tagging Sorting device.