KR100416477B1

KR100416477B1 - Apparatus and method for managing document intelligently

Info

Publication number: KR100416477B1
Application number: KR10-2000-0072305A
Authority: KR
Inventors: 이상구; 김한준
Original assignee: (주)코어로직스
Priority date: 2000-12-01
Filing date: 2000-12-01
Publication date: 2004-01-31
Also published as: KR20020042944A

Abstract

본 발명은 지능형 문서 관리 장치 및 그 방법에 관한 것으로서, 운영자의 판별에 의해 기 설정된 카테고리들을 노드로 하는 계층 분류 트리를 정의하고, 외부로부터 유입되는 임의의 문서들을 그 계층 분류 트리에 자동으로 분류하고, 방대한 문서의 유입 등으로 상기 계층 분류 트리를 재구성하여야 하는 경우, 상기 계층 분류 트리의 영역 중 재구성해야 하는 영역에 포함된 문서들을 운영자의 의도에 의해 클러스터링하고, 그 결과 생성된 클러스터들을 새로운 카테고리로 정의하고, 그 카테고리들을 노드로 하는 부분 계층 트리를 형성하고, 상기 부분 계층 트리를 카테고리간 포함관계에 의해 기존의 계층 분류 트리에 병합하는 일련의 과정을 수행함으로써, 인터넷의 보급으로 인해 대용량화되고 있는 문서들을 적은 수의 인력으로 체계적으로 관리할 수 있다는 장점이 있다.The present invention relates to an intelligent document management apparatus and method thereof, which defines a hierarchical classification tree whose nodes are preset categories according to an operator's discernment, and automatically classifies any documents from outside into the hierarchical classification tree. If the hierarchical classification tree needs to be reorganized due to the inflow of a large number of documents, the documents included in the reorganization area of the hierarchical classification tree should be clustered according to the intention of the operator, and the resulting clusters may be moved to a new category. By forming a partial hierarchical tree using the categories as nodes, and merging the partial hierarchical tree into the existing hierarchical classification tree by the interrelationship between categories, the volume is being increased due to the spread of the Internet. Manage documents systematically with a small number of people There can be an advantage.

Description

Apparatus and method for managing document intelligently}

본 발명은 지능형 문서 관리 장치 및 그 방법에 관한 것으로서, 특히, 문서들을 계층 분류 트리에 의해 보다 체계적으로 관리하고, 새롭게 유입되는 문서들을 그 계층 구조에 의해 자동으로 분류함으로써, 적은 수의 인력으로 대용량 문서를 효율적으로 관리하도록 하는 지능형 문서 관리 장치 및 그 방법에 관한 것이다.The present invention relates to an intelligent document management apparatus and method thereof, and more particularly, by managing documents more systematically by a hierarchical classification tree, and automatically classifying newly introduced documents by their hierarchical structure, thereby providing a large capacity with a small number of manpower. An intelligent document management apparatus and method for efficiently managing a document are provided.

최근 인터넷의 급속한 팽창과 보급으로 인해, 조직이 필요로 하는 문서 중에서 인터넷을 통해 획득된 문서와 지식의 양은 그 증가 속도가 갈수록 커지고 있다. 이로 인해 대용량 문서 정보 시스템에서 내용 기반 검색, 필터링, 라우팅 등의 정보 검색을 하기 위해 선행되어야 할 문서 구조화 기법이 매우 중요해지고 있다.Recently, due to the rapid expansion and dissemination of the Internet, the amount of documents and knowledge acquired through the Internet among the documents required by the organization is increasing at an increasing rate. As a result, document structuring techniques that need to be preceded for information retrieval such as content-based retrieval, filtering, and routing in a large document information system become very important.

현재까지 대용량의 문서 정보를 조직화하는데 가장 일반적인 방법은 그 문서들을 주제에 따라 계층적으로 분류하고, 인덱싱하는 것이다. 현재 대부분의 정보 시스템들은 이러한 계층적 분류 트리(이하, 계층 분류 트리)를 통한 문서 인덱스를 구축하기 위해 대부분 사람의 노동력을 이용하고 있다.To date, the most common way to organize large amounts of document information is to classify and index the documents hierarchically by topic. Most information systems currently use the labor of most people to build document indexes through this hierarchical classification tree.

즉, 문서 도메인 전문가들에 의해 카테고리별로 기초적인 계층 분류 트리의 구조가 제공되면, 문서 분류자들은 현재 시스템 내에 보관하고 있거나 새로 유입되는 문서들로부터 속성을 추출한 후, 그 속성에 의해 해당 문서들을 상기 계층 분류 트리내의 각 카테고리들로 할당하는 작업을 수행한다.In other words, if a structure of a hierarchical hierarchical classification tree is provided for each category by document domain experts, the document classifier extracts an attribute from documents currently stored or newly introduced in the system, and then recalls the documents by the attribute. Assigns to each category in the hierarchical classification tree.

그리고, 상기 도메인 전문가들에 의해 초기에 주어졌던 상기 계층 분류 트리는 지속적으로 문서가 할당됨에 따라 그 구조가 변화될 필요가 있는데, 이를 위해 도메인 전문가들은 각 카테고리에 할당된 문서들의 내용을 면밀히 검토하여 그 구조를 변형시켜 나간다. 즉, 기존의 계층 분류 트리에 포함되어 있지 않은 문서 집합이 유입되어 그 문서 집합을 포함시킬 수 있는 새로운 카테고리를 생성한 경우 이를 상기 계층 분류 트리의 적당한 위치에 병합시키거나, 또는 각 카테고리에 포함된 문서들간 내용의 이질성이 높아져 새로운 카테고리에 의해 묶을 수 있는 문서 집합이 발생한 경우 그 카테고리를 두 개 이상의 카테고리로 분할하는 작업을 수행하여야 한다.In addition, the hierarchical classification tree initially given by the domain experts needs to change its structure as documents are continuously assigned. For this purpose, the domain experts carefully examine the contents of the documents assigned to each category and construct the structure. Transform it. In other words, if a document set that is not included in the existing hierarchical classification tree is introduced and a new category is created to include the document set, it is merged into an appropriate position of the hierarchical classification tree or included in each category. When the heterogeneity of contents between documents increases and a set of documents that can be grouped by a new category is generated, the work of dividing the category into two or more categories should be performed.

그러나, 이러한 문서의 집합들이 계속적으로 변화하고, 그 문서량도 빠른 속도로 증가하는 최근의 작업 환경에서, 문서 분류 및 그 계층 분류 트리의 관리 작업들을 사람의 노력에 의존하는 종래의 문서 관리 방법은 그 활용에 한계가 있다.However, in a recent work environment in which these sets of documents are constantly changing, and the amount of documents is rapidly increasing, the conventional document management method which relies on human effort to classify documents and manage the hierarchical tree of classification is There is a limit to utilization.

또한, 각 문서 분류자가 가지고 있는 경험과 지식이 모두 다르기 때문에 문서 분류가 지속적으로 일관성을 유지하지 못할 가능성도 커진다는 단점이 있다.In addition, since the experience and knowledge of each document classifier are all different, there is a disadvantage that the classification of documents is not always consistent.

따라서, 본 발명은 상기한 바와 같은 종래의 제반 문제점을 해결하기 위하여 안출된 것으로서, 운영자가 정의한 카테고리들간 계층 구조에 의거하여, 외부로부터 유입되는 문서들을 자동으로 분류하고, 그 계층 구조를 지능적으로 관리함으로써, 적은 수의 인력으로 대용량 문서 집합을 효율적으로 관리하도록 하는 지능형 문서 관리 장치 및 그 방법을 제공하는 것을 목적으로 한다.Accordingly, the present invention has been made to solve the conventional problems as described above, based on the hierarchical structure between categories defined by the operator, automatically classifies documents coming from the outside, and intelligently manages the hierarchical structure. It is therefore an object of the present invention to provide an intelligent document management apparatus and method for efficiently managing a large document set with a small number of personnel.

도 1은 본 발명의 일 실시예에 따른 지능형 문서 관리 장치에 대한 개략적인 블록도,1 is a schematic block diagram of an intelligent document management apparatus according to an embodiment of the present invention;

도 2는 본 발명의 일 실시예에 따라 문서를 관리하기 위한 처리 과정에 대한 처리 흐름도,2 is a processing flow diagram for a processing procedure for managing a document according to an embodiment of the present invention;

도 3은 본 발명의 일 실시예에 따른 계층 분류 트리에 대한 예시도,3 is an exemplary diagram of a hierarchical classification tree according to an embodiment of the present invention;

도 4는 본 발명의 일 실시예에 따라 학습을 수행하기 위한 처리 과정에 대한 처리 흐름도,4 is a process flow diagram for a process for performing learning according to an embodiment of the present invention;

도 5는 본 발명의 일 실시예에 따른 개략적인 학습 과정을 설명하기 위한 예시도,5 is an exemplary diagram for explaining a schematic learning process according to an embodiment of the present invention;

도 6은 본 발명의 일 실시예에 따른 구체적인 학습 과정을 설명하기 위한 예시도,6 is an exemplary diagram for explaining a specific learning process according to an embodiment of the present invention;

도 7은 본 발명의 일 실시예에 따른 문서 자동 분류 과정을 수행하기 위한 처리 과정에 대한 처리 흐름도,7 is a flowchart illustrating a processing procedure for performing an automatic document classification process according to an embodiment of the present invention;

도 8은 본 발명의 일 실시예에 따라 외부로부터 유입된 문서가 임의의 카테고리에 분류되는 과정을 설명하기 위한 예시도,8 is an exemplary diagram for describing a process of classifying documents imported from the outside into an arbitrary category according to an embodiment of the present invention;

도 9는 본 발명의 일 실시예에 따라 클러스터링을 수행하기 위한 처리 과정에 대한 처리 흐름도,9 is a process flow diagram for a process for performing clustering according to an embodiment of the present invention;

도 10은 본 발명의 일 실시예에 따라 새롭게 생성된 카테고리들간 계층 구조를 정의하는 과정을 설명하기 위한 예시도.10 is an exemplary view for explaining a process of defining a hierarchical structure between newly created categories according to an embodiment of the present invention.

♣ 도면의 주요 부분에 대한 부호의 설명 ♣♣ Explanation of symbols for the main parts of the drawing ♣

100 : 인터넷 200 : 지능형 문서 관리 장치100: Internet 200: intelligent document management device

210 : 계층 구조 관리 DB 220 : 학습부210: hierarchy management DB 220: learning unit

230 : 문서 자동 분류부 240 : 문서 원본 관리 DB230: document automatic classification unit 240: document source management DB

250 : 계층 분류 트리 관리부 260 : 문서 속성 추출부250: hierarchical classification tree management unit 260: document property extraction unit

270 : 문서 수집부 280 : 사용자 인터페이스270: document collection unit 280: user interface

상기 목적을 달성하기 위해 본 발명에서 제공하는 지능형 문서 관리 장치는 기 설정된 카테고리들간 계층 구조에 의해, 각 카테고리들을 노드로 하여 운영자가 정의한 계층 분류 트리 정보를 저장하고 관리하는 제1 저장부와, 운영자가 선정한 소정의 학습 문서에 의해 상기 계층 분류 트리를 구성하는 각 서브 트리별 학습을 수행하고, 상기 계층 분류 트리의 갱신에 따른 재학습을 수행하고 그 결과를 저장/관리하는 학습부와, 외부로부터 유입되는 문서에서 추출된 문서의 속성에 의해, 상기 학습부에 저장된 학습 결과를 참조하여, 상기 제1 저장부에 저장된 계층 분류 트리의 계층 구조를 따라가면서, 각 서브트리들과 상기 문서간 관계성을 예측하고, 상기 문서와 관계성이 가장 높은 서브 트리의 카테고리로 상기 문서를 자동 분류하는 문서 자동 분류부와, 상기 문서 자동 분류부에 의해 자동 분류된 문서들 및 그인덱싱 정보들을 저장하는 제2 저장부와, 운영자의 요청 또는 자체 판단에 의해 상기 제1 저장부에 저장된 계층 분류 트리의 임의의 영역에 대한 재구성이 요구된 경우 그 영역에 대한 부분 클러스터링에 의해 새로운 카테고리들을 정의하고, 그에 따라 기존의 계층 분류 트리를 재구성하는 계층 분류 트리 관리부를 포함하여 구성된다.In order to achieve the above object, the intelligent document management apparatus provided by the present invention includes a first storage unit for storing and managing hierarchical classification tree information defined by an operator using each category as a node by a hierarchical structure between preset categories, and an operator. A learning unit configured to perform learning for each subtree constituting the hierarchical classification tree according to a predetermined learning document selected by the user, to perform relearning according to the update of the hierarchical classification tree, and to store / manage the result; Relevance between each subtree and the document by following the hierarchical structure of the hierarchical classification tree stored in the first storage unit by referring to the learning result stored in the learning unit by the attribute of the document extracted from the incoming document. Automatic document classification to predict and to automatically classify the document into categories of the subtree most relevant to the document. And a second storage unit for storing documents and indexing information automatically classified by the document automatic classification unit, and any area of the hierarchical classification tree stored in the first storage unit at the request of the operator or at its own discretion. If a reconstruction is required, the hierarchical classification tree manager may be configured to define new categories by partial clustering of the region and to reconstruct the existing hierarchical classification tree accordingly.

또한, 상기 목적을 달성하기 위해 본 발명에서 제공하는 지능형 문서 관리 방법은 운영자의 판별에 의해 다수개의 카테고리들 간 계층 구조를 정의하고, 그 각각의 카테고리들을 노드로 하는 계층 분류 트리를 정의한 후, 운영자가 선정한 학습 문서에 의해 그 계층 분류 트리를 구성하는 각 서브 트리들에 대한 학습을 수행하는 제1 과정과, 외부로부터 임의의 문서가 유입될 경우 그 문서에서 추출된 속성에 의해 상기 제1 과정에서 정의된 계층 분류 트리의 계층 구조를 따라가면서, 각 서브트리들과 상기 문서간 관계성을 예측하고, 상기 문서와 관계성이 가장 높은 서브트리의 카테고리로 상기 문서를 자동 분류하는 제2 과정과, 상기 제1 과정에서 정의한 계층 분류 트리의 임의의 영역에 대한 재구성이 요구된 경우 그 재구성 영역에 포함된 문서들을 클러스터링하여 새로운 카테고리들을 정의하는 제3 과정과, 상기 새로운 카테고리들의 속성별 포함관계에 의해 그 카테고리들간 계층 관계를 정의하고, 그 계층 관계를 적용하여 부분 계층 트리를 생성하는 제4 과정과, 상기 부분 계층 트리를 기존의 계층 분류 트리에 병합시켜 상기 기존의 계층 분류 트리를 재구성한 후, 그 계층 분류 트리에 대한 학습을 수행하는 제5 과정을 포함하여 구성된다.In addition, in order to achieve the above object, the intelligent document management method provided by the present invention defines a hierarchical structure among a plurality of categories according to the operator's discernment, and defines a hierarchical classification tree having each of the categories as a node. In the first process, the learning process selected for each subtree constituting the hierarchical classification tree is performed by the learning document selected by the user, and the attribute extracted from the document when an arbitrary document is introduced from the outside. Following a hierarchical structure of a defined hierarchical classification tree, a second process of predicting a relationship between each subtree and the document, and automatically classifying the document into a category of a subtree having the highest relationship with the document; If reconstruction of any area of the hierarchical classification tree defined in the first step is required, the documents included in the reconstruction area A third process of rubbing to define new categories, a fourth process of defining a hierarchical relationship between the categories based on the property-specific inclusion relationship of the new categories, and generating a partial hierarchical tree by applying the hierarchical relationship; And a fifth process of reconstructing the existing hierarchical classification tree by merging the partial hierarchical tree into the existing hierarchical classification tree and then learning the hierarchical classification tree.

이하, 본 발명에 따른 지능형 문서 관리 장치 및 방법에 대한 바람직한 실시예를 첨부된 도면에 의거하여 상세하게 설명하면 다음과 같다.Hereinafter, exemplary embodiments of an intelligent document management apparatus and method according to the present invention will be described in detail with reference to the accompanying drawings.

먼저, 도 1은 본 발명의 일 실시예에 따른 지능형 문서 관리 장치에 대한 개략적인 블록도로서, 도 1을 참조하면, 본 발명의 일 실시예에 따른 지능형 문서 관리 장치(200)는 계층 구조 관리 데이터 베이스(210)와, 학습부(220)와, 문서 자동 분류부(230)와, 문서 원본 관리 데이터 베이스(240)와, 계층 분류 트리 관리부(250)와, 문서 속성 추출부(260)와, 문서 수집부(270)와, 사용자 인터페이스(280)로 구성된다.First, FIG. 1 is a schematic block diagram of an intelligent document management apparatus according to an embodiment of the present invention. Referring to FIG. 1, the intelligent document management apparatus 200 according to an embodiment of the present invention is hierarchical structure management. Database 210, learner 220, document automatic classifier 230, document source management database 240, hierarchical classification tree manager 250, document attribute extractor 260, The document collection unit 270 and the user interface 280 are configured.

상기 계층 구조 관리 데이터 베이스(210)는 운영자의 판별에 의해 정의된 계층 분류 트리 정보를 저장하고 관리하는데, 상기 계층 분류 트리는 기 설정된 카테고리들의 속성에 의해 그 카테고리들을 노드로 하여 운영자에 의해 정의된다.The hierarchical structure management database 210 stores and manages hierarchical classification tree information defined by the operator's determination. The hierarchical classification tree is defined by the operator using the categories as nodes by attributes of preset categories.

상기 학습부(220)는 상기 운영자가 선정한 소정의 학습 문서에 의해 상기 계층 구조 관리 데이터 베이스(210)에 저장된 계층 분류 트리를 구성하는 각 서브 트리별 학습을 수행하고, 상기 계층 분류 트리가 재구성된 경우 그에 대한 재학습을 수행하며, 그 학습 결과를 별도로 저장/관리한다.The learning unit 220 performs learning for each subtree constituting the hierarchical classification tree stored in the hierarchical structure management database 210 according to a predetermined learning document selected by the operator, and reconstructs the hierarchical classification tree. In case of re-learning, the learning result is saved and managed separately.

이를 위해, 상기 학습부(220)는 각 카테고리별로 선정된 소정의 학습 문서들에 포함된 주요어들을 추출한 후, 그 주요어들이 해당 카테고리를 부모 노드로 하는 트리에 출현할 확률값을 계산하는 학습 과정을 수행하고, 그 학습 결과로 발생된 상기 확률값들을 주요어별로 저장 관리한다.To this end, the learning unit 220 extracts key words included in predetermined learning documents selected for each category, and then performs a learning process of calculating probability values of the key words appearing in a tree having the corresponding category as the parent node. The probability values generated as a result of the learning are stored and managed for each key word.

상기 문서 자동 분류부(230)는 외부로부터 유입되는 문서에서 추출된 문서의속성에 의해, 상기 학습부(220)에 저장된 학습 결과를 참조하여, 상기 계층 분류 트리의 계층 구조를 따라가면서, 각 서브트리들과 상기 문서간 관계성을 예측하고, 상기 문서와 관계성이 가장 높은 서브트리의 카테고리로 상기 문서를 자동 분류한다.The document automatic classifying unit 230 follows the hierarchical structure of the hierarchical classification tree by referring to the learning result stored in the learning unit 220 by the property of the document extracted from the document introduced from the outside. The relationship between the trees and the document is predicted, and the document is automatically classified into the category of the subtree having the highest relationship with the document.

이 때, 상기 문서 자동 분류부(230)는 상기 문서와 카테고리들간 관계성 여부를 판별하기 위해 기 설정된 관계성 판별 기준 임계값을 가지고, 상기 예측된 관계성이 그 관계성 판별 기준 임계값 이하인 경우 그 문서를 상기 서브 트리들의 부모 노드에 해당되는 카테고리로 자동 분류한다.In this case, the document automatic classification unit 230 has a predetermined relationship determination criterion threshold to determine whether the document and the relationship between the categories, the predicted relationship is less than the relationship determination criteria threshold The document is automatically classified into categories corresponding to parent nodes of the subtrees.

상기 문서 원본 관리 데이터 베이스(240)는 상기 문서 자동 분류부(230)에 의해 자동 분류된 문서들 및 그 문서들이 포함되는 계층 분류 트리에서의 위치를 파악하기 위한 인덱싱 정보들을 저장한다.The document original management database 240 stores the documents automatically classified by the document automatic classification unit 230 and indexing information for identifying a position in the hierarchical classification tree including the documents.

상기 계층 분류 트리 관리부(250)는 상기 계층 구조 관리 데이터 베이스(210)에 저장된 계층 분류 트리의 임의의 영역에 대한 재구성이 요구된 경우 그 영역에 대한 부분 클러스터링에 의해 새로운 카테고리들을 정의하고, 그에 따라 기존의 계층 분류 트리를 재구성한다.The hierarchical classification tree management unit 250 defines new categories by partial clustering of the area when reconfiguration of any area of the hierarchical classification tree stored in the hierarchical structure management database 210 is required, and accordingly, Reconstruct the existing hierarchical classification tree.

이 때, 상기 운영자는 재구성하고자 하는 영역에 포함된 문서들간 관계성이 인위적으로 설정된 문서 집합을 정의한다. 즉, 문서간 거리 함수값에 무관하게 동일한 클러스터에 포함되어야 하는 문서들 또는 문서간 거리 함수값에 무관하게 다른 클러스터에 들어가야 하는 문서들에 대한 문서 집합을 정의한다.At this time, the operator defines a document set in which relationships between documents included in the area to be reconstructed are artificially set. That is, a document set is defined for documents that should be included in the same cluster regardless of the inter-document distance function value or documents that should be in another cluster regardless of the inter-document distance function value.

이를 위해, 그 문서들에 포함되는 주요어들에 대한 가중치를 설정한다. 즉,동일 클러스터에 포함되어야 하는 문서들에 포함된 주요어들은 그 거리 함수값이 적어지도록 하기 위해 그 가중치를 상대적으로 적은 값으로 설정하고, 다른 클러스터에 포함되어야 하는 문서들에 포함된 주요어들은 그 거리 함수값이 커지도록 하기 위해 그 가중치를 상대적으로 큰 값으로 설정한다. 이러한 가중치 설정에 대한 보다 구체적인 방법은 이후에 언급될 본 발명의 일 실시예에 따른 지능형 문서 관리 방법의 설명에서 다루겠다.To this end, weights for key words included in the documents are set. In other words, key words included in documents that should be included in the same cluster should have their weights set to a relatively small value so that the distance function value is reduced, and key words included in documents that should be included in other clusters should have the distance. To make the function larger, set its weight to a relatively large value. A more specific method of setting the weight will be described in the description of the intelligent document management method according to an embodiment of the present invention to be described later.

또한, 상기 계층 분류 트리 관리부는 클러스터링에 의해 새롭게 정의된 카테고리들간 포함관계에 의해 그 카테고리들간의 계층 구조를 정의하고, 그에 따른 부분 계층 트리를 생성한 후, 그 부분 계층 트리를 기존의 계층 분류 트리에 병합한다. 이 때, 상기 부분 계층 트리의 각 노드들과 그 부분 계층 트리가 포함될 기존 계층 분류 트리의 주변 노드들간 포함관계에 의해 그들간 계층 구조를 파악하고, 그에 따라 상기 부분 계층 트리가 포함될 위치를 식별한 후, 해당 위치에 상기 부분 계층 트리를 병합한다.In addition, the hierarchical classification tree manager may define a hierarchical structure among the categories according to inclusion relationships between categories newly defined by clustering, generate a partial hierarchical tree accordingly, and then convert the partial hierarchical tree into an existing hierarchical classification tree. To merge. At this time, the hierarchical structure between the nodes of the partial hierarchical tree and the neighboring nodes of the existing hierarchical classification tree that will include the partial hierarchical tree is identified, and accordingly, the location of the partial hierarchical tree is identified. After that, merge the partial hierarchical tree at the corresponding position.

보다 구체적인 내용은 이후에 언급될 본 발명의 지능형 문서 관리 방법에 자세히 설명하겠다.More details will be described in detail in the intelligent document management method of the present invention will be described later.

한편, 상기 사용자 인터페이스(280)는 운영자와의 인터페이스를 통해 운영자의 작업 내용을 본 발명의 지능형 문서 관리 장치로 입력한다.On the other hand, the user interface 280 inputs the operator's work content to the intelligent document management apparatus of the present invention through the interface with the operator.

상기 문서 수집부(270)는 인터넷(WWW)(100)등과 같이 외부로부터 유입되는 문서들을 수집하여 상기 문서 속성 추출부(260)로 전달한다.The document collecting unit 270 collects documents introduced from the outside, such as the Internet (WWW) 100, and delivers the documents to the document property extracting unit 260.

상기 문서 속성 추출부(260)는 상기 문서 수집부(270)로부터 전달된 문서들에 포함된 주요어들에 의해 그 문서의 속성을 추출한다.The document property extractor 260 extracts the property of the document by key words included in the documents transmitted from the document collector 270.

이와 같은 본 발명의 지능형 문서 관리 장치에 의해 자동으로 문서를 분류하고 관리하는 구체적인 방법을 도 2 내지 도 10을 참조하여 설명하면 다음과 같다.A detailed method of automatically classifying and managing documents by the intelligent document management apparatus of the present invention will be described with reference to FIGS. 2 to 10 as follows.

도 2는 본 발명의 일 실시예에 따라 문서를 관리하기 위한 처리 과정에 대한 처리 흐름도로서, 도 2를 참조하면, 본 발명의 지능형 문서 관리 방법은 먼저, 운영자의 판별에 의해 계층 분류 트리를 정의한 후, 그 각각의 카테고리들을 노드로하는 계층 분류 트리를 정의하는 과정을 수행한다(s100).FIG. 2 is a flowchart illustrating a process for managing a document according to an embodiment of the present invention. Referring to FIG. 2, the intelligent document management method of the present invention first defines a hierarchical classification tree by the operator's determination. Thereafter, a process of defining a hierarchical classification tree having each of the categories as nodes is performed (s100).

그리고, 운영자가 선정한 학습 문서에 의해 상기 계층 분류 트리를 구성하는 각 서브 트리들에 대한 학습을 수행한다(s200). 즉, 상기 학습 문서들이 상기 계층 분류 트리를 구성하는 각 서브 트리들에 포함될 확률값을 계산하여 그 확률값을 주요어별로 저장한다.In operation S200, learning for each subtree constituting the hierarchical classification tree is performed based on a learning document selected by an operator. That is, a probability value included in each of the sub-trees constituting the hierarchical classification tree is calculated and stored in each keyword.

상기 확률값들은 외부로부터 유입되는 문서가 각 서브 트리에 포함될 확률을 계산하기 위해 필요한 값들로서, 외부로부터 임의의 문서가 유입될 경우, 상기 과정(s100)에서 정의된 계층 분류 트리의 계층 구조를 따라가면서, 상기 확률값에 의해 상기 문서와 각 서브트리들간 관계성을 예측한 후, 그 관계성이 가장 높은 서브트리의 카테고리로 상기 문서를 자동 분류한다(s300).The probability values are necessary values for calculating the probability that documents from the outside are included in each subtree. When any document is introduced from the outside, the probability values follow the hierarchical structure of the hierarchical classification tree defined in step S100. After estimating the relationship between the document and each subtree based on the probability value, the document is automatically classified into the category of the subtree having the highest relation (s300).

그리고, 상기 지능형 문서 관리 방법에 의해 외부의 문서를 지속적으로 관리하는 도중, 상기 과정(s100)에서 정의한 계층 분류 트리의 임의의 영역에 대한 재구성이 요구된 경우 그 재구성 영역에 포함된 문서들을 클러스터링하여 새로운 카테고리들을 정의한다(s400, s500).During the continuous management of the external document by the intelligent document management method, when a reconstruction of an arbitrary area of the hierarchical classification tree defined in step S100 is required, clustering of the documents included in the reconstructed area is performed. Define new categories (s400, s500).

즉, 기존의 계층 분류 트리에 포함되어 있지 않은 문서 집합이 유입되어 그 문서 집합을 포함시킬 수 있는 새로운 카테고리를 생성한 경우 이를 상기 계층 분류 트리의 적당한 위치에 병합시키거나, 또는 각 카테고리에 포함된 문서들간 내용의 이질성이 높아져 새로운 카테고리에 의해 묶을 수 있는 문서 집합이 발생한 경우 그 카테고리를 두 개 이상의 카테고리로 분할하여야 하는데, 이 때, 그 카테고리들이 포함된 영역을 재구성 영역으로 설정한 후, 그 영역에 포함된 문서들을 클러스터링하고, 그 결과 생성된 클러스터들을 새로운 카테고리로 정의한다.In other words, if a document set that is not included in the existing hierarchical classification tree is introduced and a new category is created to include the document set, it is merged into an appropriate position of the hierarchical classification tree or included in each category. When heterogeneity of contents between documents increases and a set of documents that can be grouped by a new category is generated, the categories must be divided into two or more categories. In this case, the area containing the categories is set as a reconstructed area, and then the area. Cluster the documents contained in and define the resulting clusters in a new category.

상기와 같이 새로운 카테고리들을 정의하였으면, 그 카테고리들의 속성별 포함관계에 의해 그 카테고리들의 계층 관계를 정의하고, 그 계층 관계를 적용하여 부분 계층 트리를 생성한다(s600).When the new categories are defined as described above, the hierarchical relationship of the categories is defined by the inclusion relationship for each property of the categories, and the hierarchical relationship is applied to generate a partial hierarchical tree (S600).

그리고, 상기 부분 계층 트리를 기존의 계층 분류 트리에 병합시킨(s700) 후, 그 계층 분류 트리에 대한 재학습을 수행한다(s800).Then, the partial hierarchical tree is merged into the existing hierarchical classification tree (s700), and the reclassification of the hierarchical classification tree is performed (s800).

도 3은 본 발명의 일 실시예에 따른 계층 분류 트리에 대한 예시도로서, 도 2의 계층 분류 트리 정의 과정(s100)에 의해 생성된 계층 분류 트리의 예가 나타나 있다. 도 3을 참조하면, 상기 계층 분류 트리는 일반적인 트리의 형태를 가지며, 상기 트리의 노드들은 임의의 카테고리들에 의해 구성되는데, 초기 계층 분류 트리는 운영자의 판별에 의해 정의된 계층 관계에 의해 그 계층 분류 트리가 생성된다.3 is an exemplary diagram of a hierarchical classification tree according to an embodiment of the present invention. An example of the hierarchical classification tree generated by the hierarchical classification tree defining process s100 of FIG. 2 is shown. Referring to FIG. 3, the hierarchical classification tree has a form of a general tree, and nodes of the tree are configured by arbitrary categories, and the initial hierarchical classification tree is hierarchical classification tree by hierarchical relation defined by the operator's determination. Is generated.

이 때, 상기 트리의 각 노드들은 기 설정된 카테고리들을 의미하고, 노드 안에는 그 카테고리에 할당된 문서들을 표시한다.At this time, each node of the tree refers to a preset category, and displays the documents assigned to the category in the node.

도 4는 본 발명의 일 실시예에 따라 학습을 수행하기 위한 처리 과정에 대한처리 흐름도이다. 도 4를 참조하면, 상기 계층 분류 트리의 노드를 구성하는 카테고리별로 운영자가 선정한 학습 문서에 의해 상기 계층 분류 트리를 학습하는 과정은 먼저, 해당 카테고리별로 선정된 학습 문서에 포함된 주요어들을 추출하고(s210), 상기 주요어들이 해당 카테고리를 부모 노드로 하는 트리에 출현할 확률값을 계산한다(s220). 즉, 상기 주요어들이 포함된 카테고리를 부모 노드로 하는 트리의 서브 트리들에 출현할 확률값을 순환적으로 계산한 후, 그 값을 모두 합하여 그 트리에 출현할 확률값으로 계산한다.4 is a process flow diagram for a process for performing learning according to an embodiment of the present invention. Referring to FIG. 4, in the process of learning the hierarchical classification tree by a learning document selected by an operator for each category constituting the node of the hierarchical classification tree, first, key words included in the learning document selected for each category are extracted ( In operation S220, the key words are calculated in probability of appearing in the tree having the corresponding category as the parent node. That is, the probability values appearing in the subtrees of the tree having the category including the main words as the parent node are cyclically calculated, and then the sums of the values are calculated as the probability values appearing in the tree.

그리고, 그 확률값들을 주요어별로 저장/관리한다(s230).Then, the probability values are stored / managed for each major word (s230).

도 5는 본 발명의 일 실시예에 따른 개략적인 학습 과정을 설명하기 위한 예시도로서, 이 때, 트리(Tr)는 카테고리(Cr)를 부모노드로 하는 트리를 나타내고, 트리(Tr1)는 카테고리(Cr1)를 부모 노드로 하는 트리(Tr)의 서브 트리를 나타내고, 트리(Tr2)는 카테고리(Cr2)를 부모 노드로 하는 트리(Tr)의 서브 트리를 나타내고, 트리(TrK)는 카테고리(CrK)를 부모 노드로 하는 트리(Tr)의 서브 트리를 나타낸다. 또한, 상기 트리(Tr1, Tr2,…, TrK)에 속하는 문서들은 트리(Tr)에도 속하게 된다.FIG. 5 is an exemplary diagram for describing a schematic learning process according to an embodiment of the present invention, in which a tree Tr represents a tree having a category Cr as a parent node, and a tree Tr1 represents a category. Represents a subtree of tree Tr with (Cr1) as its parent node, tree (Tr2) represents a subtree of tree (Tr) with category (Cr2) as parent node, and tree (TrK) represents category (CrK) ) Is a subtree of the tree Tr whose parent node is. Further, documents belonging to the trees Tr1, Tr2, ..., TrK also belong to the tree Tr.

도 5를 참조하여 상기 도 4의 과정(s220)을 보다 상세히 설명하면 다음과 같다. 임의의 트리(Tr)를 학습한다는 것은 상기 트리(Tr)에 임의의 주요어(w)가 출현할 확률값을 계산하여 그 확률값을 주요어별로 저장/관리하는 것을 말한다. 즉, 상기 트리(Tr)에 임의의 주요어(w)가 출현할 확률값(Pr(w|Tr))은, 상기 카테고리(Cr)에 주요어(w)가 출현할 제1 확률값(P1)에 상기 서브 트리(Tr)에서 그 카테고리(Cr)의 중요도를 나타내는 제1 비중값(W1)을 곱한 값과, 상기 주요어(w)가 상기트리(Tr)의 특정 서브 트리들(Tri)에 출현할 제2 확률값(P2)과 상기 서브트리(Tr)에서 각 하위 트리(Tri)에 대한 중요도를 나타내는 제2 비중값(W2)을 곱한 값 k개를 더하여 계산한다.The process s220 of FIG. 4 will be described in more detail with reference to FIG. 5 as follows. Learning an arbitrary tree Tr means calculating a probability value at which an arbitrary key word w appears in the tree Tr, and storing and managing the probability value for each key word. That is, the probability value Pr (w | Tr) in which any key word w appears in the tree Tr is the sub probability in the first probability value P1 in which the main word w appears in the category Cr. A value obtained by multiplying a first specific gravity value W1 representing the importance of the category Cr in the tree Tr, and a second term in which the main word w appears in specific subtrees Tri of the tree Tr. A value k multiplied by a probability value P2 multiplied by the second specific gravity value W2 representing the importance of each subtree Tri in the subtree Tr is calculated.

아래 수학식 1 내지 수학식 5에 다음과 같은 확률값 계산을 위한 수치들을 유출하기 위한 수식이 나타나 있다.Equation 1 to Equation 5 below shows the equations for extracting the numerical values for calculating the probability value as follows.

이 때, 상기 수학식 1은 제1 확률값(P1)을 구하기 위한 것이고, 상기 수학식 2는 제2 확률값(P2)을 구하기 위한 것이고, 상기 수학식 3은 제1 비중값(W1)을 구하기 위한 식이고, 상기 수학식 4는 제2 비중값(W2)을 구하기 위한 식이고, 수학식 5는 상기 수학식 1 내지 수학식 4에서 구해진 결과 값에 의해 최종적으로주요어(w)가 임의의 서브 트리(Tr)에 포함될 확률을 구하기 위한 것이다.In this case, Equation 1 is for obtaining a first probability value P1, Equation 2 is for obtaining a second probability value P2, and Equation 3 is for obtaining a first specific gravity value W1. Equation 4 is a formula for obtaining a second specific gravity value W2, and Equation 5 is finally a subtree in which the main word w is arbitrarily determined by the result value obtained in Equations 1 to 4. To find the probability to be included in (Tr).

한편, 상기 제1 비중값(W1)은 상기 트리(Tr)에 포함된 문서의 개수(df(Tr), df:document frequency)에 대한 해당 카테고리(Cr)에 포함된 문서의 개수(df(Cr))의 비율() 및 상기 서브트리(Tr)에 포함된 전체 카테고리의 개수(cf(Tr), cf:category frequency)에 대한 해당 카테고리(cf(Cr))의 비율() 및 상기 서브트리(Tr)에 포함된 주요어의 개수(tf(Tr), tf:term frequency)에 대한 해당 카테고리(Cr)에 포함된 주요어의 개수(tf(Cr))의 비율() 중 어느 하나에 의해 결정할 수 있다.Meanwhile, the first specific gravity value W1 is the number of documents included in the corresponding category Cr for the number of documents df (Tr) and df: document frequency included in the tree Tr (df (Cr). )) ) And the ratio of the corresponding category (cf (Cr)) to the total number of categories (cf (Tr), cf: category frequency) included in the subtree (Tr) ( ) And the ratio of the number of main words tf (Cr) included in the corresponding category Cr to the number of main words tf (Tr) and tf: term frequency included in the subtree Tr ( Can be determined by any one of

또한, 상기 비중값(W2)도 상기 제1 비중값(W1)과 같이 상기 서브트리(Tr)에 포함된 문서의 개수(df(Tr))에 대한 상기 서브트리의 하위 서브트리들(Tri, i=1,2,…,K) 각각에 포함된 문서의 개수(df(Tri), i=1,2,…,K)의 비율() 및 상기 서브트리(Tr)에 포함된 전체 카테고리의 개수(cf(Tr))에 대한 상기 서브트리의 하위 서브 트리들(Tri, i=1,2,…,K) 각각에 포함된 카테고리의 개수(cf(Tri), i=1,2,…,K)의 비율() 및 상기 서브트리(Tr)에 포함된 주요어의 개수(tf(Tr))에 대한 상기 서브트리의 하위 트리들(Tri, i=1,2,…,K) 각각에 포함된 주요어의 개수(tf(Tri), i=1,2,…,K)의 비율()중 어느 하나의 방법에 의해 결정한다.In addition, the specific gravity value W2 is also equal to the first specific gravity value W1, and the lower subtrees Tri, of the subtree with respect to the number of documents df (Tr) included in the subtree Tr, i = 1,2,…, K) The ratio of the number of documents included in each (df (Tri), i = 1,2,…, K) ( ) And the number of categories included in each of the sub-trees Tri, i = 1, 2, ..., K of the subtree with respect to the number of total categories cf (Tr) included in the subtree Tr. The ratio of the number (cf (Tri), i = 1,2,…, K) ( ) And the number of keywords included in each of the subtrees Tri, i = 1, 2, ..., K of the subtrees with respect to the number of keywords tf (Tr) included in the subtree Tr (Tr). tf (Tri), i = 1,2, ..., K) Is determined by either method.

이 때, 상기 제1 및 제2 가중치를 구하는 방법은 본 발명의 장치에서도 동일하게 적용된다.At this time, the method for obtaining the first and second weights is equally applied to the apparatus of the present invention.

도 6은 본 발명의 일 실시예에 따른 구체적인 학습 과정을 설명하기 위한 예시도로서, 임의의 서브 트리(T5)가 도 6과 같이 구성되었을 때, 임의의 주요어(w)가 그 서브 트리(T5)에 포함될 확률(Pr(w|T5))을 구하는 방법을 상기 도 6을 참조하여 설명하면 다음과 같다. 이 때, 상기 노드들은 각 카테고리들(C5, C4, C3, C2, C1)을 의미하고, 그 노드의 괄호 안에 있는 숫자는 그 카테고리에 속한 문서의 개수를 의미한다.FIG. 6 is an exemplary diagram for describing a detailed learning process according to an embodiment of the present invention. When an arbitrary subtree T5 is configured as shown in FIG. 6, an optional main word w is the subtree T5. The method of obtaining the probability Pr (w | T5) to be included will be described with reference to FIG. 6 as follows. At this time, the nodes mean each category (C5, C4, C3, C2, C1), and the number in parentheses of the node means the number of documents belonging to the category.

상기 문서의 개수를 수학식 5에 대입하면 임의의 주요어(w)가 서브트리(T5)에서 출현할 확률은 수학식 6과 같다.Substituting the number of documents into Equation 5, the probability that an arbitrary key word w appears in the subtree T5 is expressed by Equation 6.

이 때, Pr(w|C5)는 주요어(w)가 카테고리(C5)에 출현할 확률값을 말하는 것으로서, 일반적인 문서 분류에서 사용하는 수학식 7에 의해 구할 수 있다.At this time, Pr (w | C5) refers to a probability value in which the main word w appears in the category C5, and can be obtained by Equation 7 used in general document classification.

이 때, V는 모든 주요어들의 집합, |V|는 그 주요어들의 개수, λ는 |V|값이 전체 확률값에 주는 영향을 조절하는 제어 변수, tf(w,Ci)는 카테고리 Ci에 존재하는 문서들에서 단어 w가 출현한 횟수를 나타낸다.Where V is the set of all key words, | V | is the number of key words, λ is the control variable that controls the effect of the | V | value on the overall probability value, and tf (w, Ci) is the document in category Ci. Indicates the number of occurrences of the word w in the field.

이 때, 상기 카테고리(C4)는 그 카테고리(C4)를 부모 노드로 하는 서브트리가 존재하지 않기 때문에 Pr(T4의 서브트리|T4)가 '0'이 되어 시그마()부분이 '0'으로 계산된다.At this time, the category C4 has a subtree whose category C4 is a parent node, so that Pr (subtree | T4 of T4 | T4) becomes '0' and the sigma ( ) Is calculated as '0'.

이와 같이 서브 트리가 없는 단말 노드가 나올 때까지 상기 수학식 6이 재귀적으로 계산된다.In this way, Equation 6 is recursively calculated until a terminal node without a subtree appears.

이러한 계산 방식은 문서 빈도(document frequency)를 비중값으로 채택한 경우에 대한 예로서, 이를 주요어 빈도(term frequency) 또는 카테고리빈도(category frequency)를 바꿔서 계산할 수도 있다.This calculation method is an example of the case where the document frequency is adopted as the specific gravity value, and may be calculated by changing the term frequency or category frequency.

도 7은 본 발명의 일 실시예에 따른 문서 자동 분류 과정을 수행하기 위한 처리 과정에 대한 처리 흐름도로서, 도 7을 참조하면 본 발명의 일 실시예에 따른 문서 자동 분류 과정은 다음과 같다.7 is a flowchart illustrating a process for performing an automatic document classification process according to an embodiment of the present invention. Referring to FIG. 7, an automatic document classification process according to an embodiment of the present invention is as follows.

먼저, 외부로부터 유입되는 문서가 있을 경우 그 문서에 포함된 주요어를 추출한다(s310). 그리고, 그 주요어에 의해 계층 분류 트리의 루트를 부모 노드로 하는 각각의 하위 서브 트리들과 상기 문서와의 관계성을 예측하여 상기 분서와의 관계성이 가장 높은 서브트리를 선택한 후, 그 선택된 서브트리의 루트가 단말 노드가 될 때까지 그 서브 트리의 루트를 부모 노드로 하는 각각의 하위 서브 트리들과 상기 문서와의 관계성을 예측하여 그 문서와의 관계성이 가장 높은 서브트리를 선택하는 일련의 과정을 반복 수행한다(s320, s330, s350).First, if there is a document introduced from the outside, the main word included in the document is extracted (s310). The sub-trees having the root of the hierarchical classification tree as parent nodes are predicted by the main words, and the sub-tree having the highest relation with the document is selected. Until the root of the tree becomes the terminal node, each sub subtree whose parent tree is the parent node is predicted to be related to the document, and the subtree having the highest relation with the document is selected. A series of processes are repeated (s320, s330, s350).

이 때, 상기 관계성이 기 설정된 관계성 판별 기준 임계값 이하인 경우는 그 문서를 그 서브 트리들의 부모 노드에 해당되는 카테고리로 자동 분류하고(s340, s370), 상기 선택된 서브 트리의 루트가 단말 노드인 경우 그 노드에 해당되는 카테고리로 상기 문서를 자동 분류한다(s350, s360).In this case, when the relation is less than or equal to a predetermined relation determination criterion threshold, the document is automatically classified into categories corresponding to parent nodes of the subtrees (s340 and s370), and the root of the selected subtree is a terminal node. If, the document is automatically classified into categories corresponding to the node (S350, S360).

도 8은 본 발명의 일 실시예에 따라 외부로부터 유입된 문서가 임의의 카테고리에 분류되는 과정을 설명하기 위한 예시도로서, 도 8a와 같이 외부로부터 임의의 문서(d)가 유입된 경우, 상기 계층 분류 트리의 상위 레벨부터 하위 레벨로 그 계층 구조를 따라 가면서, 그 문서와 해당 카테고리들과의 관계성을 분석한다.FIG. 8 is an exemplary diagram for describing a process of classifying documents imported from the outside into an arbitrary category according to an embodiment of the present invention. When any document d is introduced from the outside as illustrated in FIG. It follows the hierarchy from the top level to the bottom level of the hierarchical classification tree and analyzes the relationship between the document and its categories.

즉, 먼저, 서브 트리(T)의 서브 트리를 구성하는 카테고리들(C1, C2, C3)과상기 문서와의 관계성을 분석한 후, 그 중 가장 관계성이 높은 카테고리를 선택하는데, 도 8b의 예에서는 카테고리(C2)를 선택하였다. 그리고, 그 서브 트리인 트리(21)의 카테고리(C21)와 문서와의 관계성을 비교하여 그 관계성이 상기 관계성 판별 기준 임계값 이하인 경우 그 문서를 카테고리(C2)에 자동 분류하고, 그 관계성 판별 기준 임계값 이상인 경우 상기 문서를 카테고리(C21)에 자동 분류한다.That is, first, the relationship between the categories C1, C2, and C3 constituting the subtree of the subtree T and the document is analyzed, and then the category having the highest relation is selected. In the example, category C2 was selected. Then, the relationship between the category C21 of the tree 21, which is the subtree, and the document is compared, and when the relationship is equal to or less than the above threshold for determining the relationship, the document is automatically classified into the category C2. The document is automatically classified into the category C21 when the relationship determination criterion is equal to or greater than the threshold.

도 8b의 예에서는 상기 문서를 카테고리(C21)에 자동 분류한 경우에 대한 예를 나타낸다.The example of FIG. 8B shows an example in which the document is automatically classified into the category C21.

이 때, 상기 계층 분류 트리는 그 계층 분류 트리 상의 임의의 노드를 구성하는 카테고리에 할당 가능한 문서의 개수를 재구성 판별 기준 임계값으로 기 설정하고, 지속적으로 유입되는 문서들에 대한 분류 수행 시, 임의의 카테고리에 상기 재구성 판별 기준 임계값 이상의 문서가 할당될 경우 해당 노드 또는 그 주변 노드들을 포함한 영역을 재구성하기 위한 재구성 영역으로 설정한 후, 그 영역에 포함된 문서들에 대한 클러스터링을 수행하고, 그 결과 새롭게 정의된 카테고리들에 의해 상기 계층 분류 트리를 재구성하게 된다.At this time, the hierarchical classification tree presets the number of documents that can be allocated to a category constituting any node on the hierarchical classification tree as a reconstruction determination criterion threshold, and when performing classification on continuously imported documents, When a document having a threshold value equal to or higher than the reconstruction determination criterion is assigned to a category, the reconfigured area for reconfiguring an area including a corresponding node or neighboring nodes is set, and clustering of the documents included in the area is performed. The hierarchical classification tree is reconstructed by newly defined categories.

도 9는 이러한 클러스터링 과정에 대한 처리 흐름도를 나타낸 것으로서, 도 9를 참조하면, 상기 클러스터링은 먼저, 운영자의 판별에 의해 상기 재구성 영역에 포함된 문서들간 관계성이 인위적으로 설정된 문서 집합을 정의한다(s510). 이 때, 상기 관계성이 인위적으로 설정된다는 것은 클러스터링시 반드시 동일한 클러스터에 포함되어야 하는 문서들에 대한 집합(B⁺)과, 다른 클러스터에 포함되어야 하는문서들에 대한 집합(B^-)을 운영자에 의해 정의하는 것을 말한다.FIG. 9 is a flowchart illustrating a process of clustering. Referring to FIG. 9, the clustering first defines a document set in which relationships between documents included in the reconstruction area are artificially set by an operator's determination ( s510). To the operator ^- At this time, a set (B) of the said relationship is a document that should be included in the set (B ⁺⁾ and the other cluster for the document to be included in not necessarily the same cluster during clustering that artificially set to By definition.

그리고, 상기와 같이 문서 집합이 정의되었으면, 그 문서 집합의 속성에 의해 그 문서 집합에 포함된 문서들간 인위적인 관계성을 반영시키기 위한 주요어별 가중치를 설정하고(s520), 그 가중치를 적용하여 상기 재구성 영역에 포함된 문서들간 거리 함수값을 계산한다(s530).If the document set is defined as described above, the weight of each key word is set to reflect an artificial relationship between documents included in the document set by the property of the document set (s520), and the reconstruction is performed by applying the weight. The distance function value between the documents included in the area is calculated (S530).

이 때, 상기 가중치는 수학식 10에 의해 설정하는 것이 가능하다.At this time, the weight can be set by Equation (10).

이 때, D는 상기 운영자에 의해 인위적인 관계성을 가지도록 정의된 문서 집합(B⁺∪B^-)을 나타내고, I는 그 문서 집합의 속성값을 나타낸다.At this time, D is the set of documents defined to have the artificial relationship by the operator (B ⁺ ∪B ^-) indicates, I represents the value of a property of the set of documents.

즉, 수학식 10에 의하면, 상기 문서 집합에 속한 문서들(di, dj)간 거리 계산시 그 거리 함수(dist_D (di, dj))에 상기 문서 집합의 속성값(I(di, dj))을 곱하여 계산하는데, 이 때, 상기 거리 함수(dist_D (di, dj))는 수학식 11과 같다.That is, according to Equation 10, the property value (I (di, dj)) of the document set in the distance function dist_D (di, dj) when calculating the distance between documents (di, dj) belonging to the document set. Multiplying by the distance function, wherein the distance function dist_D (di, dj) is expressed by Equation (11).

이 때, 상기 w_k는 가중치 벡터의 k 번째 성분값이고, d_ik는 문서 d_i의 k번째 성분값이다.In this case, w _k is the k-th component value of the weight vector, and d _ik is the k-th component value of the document d _i .

한편, 상기 문서 집합의 속성값(I(di, dj))은 상기 문서 집합에 포함된 문서들 중에서 임의의 두 개 문서(di, dj)가 같은 클러스터에 존재하는 속성을 가진 경우 '+1'값을 가지고, 상기 문서 집합에 포함된 문서들 중에서 임의의 두 개 문서(di, dj)가 다른 클러스터에 존재하는 속성을 가진 경우 '-1'값을 가진다.On the other hand, the property value (I (di, dj)) of the document set is '+1' when any two documents (di, dj) among the documents included in the document set has the property exists in the same cluster Has a value and has a value of '-1' when any two documents (di, dj) among the documents included in the document set have an attribute present in another cluster.

따라서, 만일 다른 클러스터에 포함되도록 그 관계성이 정의된 문서들의 경우 그 거리함수값이 커지는 방향으로 그 가중치가 설정되고, 동일한 클러스터에 포함되도록 그 관계성이 정의된 문서들의 경우 그 거리 함수값이 작아지는 방향으로 가중치가 설정된다.Thus, if documents whose relations are defined to be included in other clusters have their weights set in the direction of increasing their distance function values, those documents whose relations are defined to be included in the same cluster have their distance function values The weight is set in the direction of decreasing.

그리고, 이와 같이 하여 상기 문서집합에 포함된 문서들에 가중치를 적용하여 각 문서들에 대한 거리 함수를 계산하였으면, 그 거리 함수값이 기 설정된 소정의 범위 이내인 문서들을 하나의 클러스터로 병합하는 과정을 소정 개수(n)의 클러스터가 생성될 때까지 반복 수행한 후, 그 결과 생성된 소정 개수(n)의 클러스터들을 각각 카테고리로 정의한다(s540, s550, s560).In this way, if the distance function for each document is calculated by applying weights to the documents included in the document set, merging the documents whose distance function value is within a predetermined predetermined range into one cluster. Is repeated until a predetermined number n of clusters are generated, and then the predetermined number n of clusters are defined as categories, respectively (s540, s550, and s560).

도 10은 본 발명의 일 실시예에 따라 새롭게 생성된 카테고리들간 계층 구조를 정의하는 과정을 설명하기 위한 예시도로서, 그 일련의 과정들을 도 10a 내지 도 10f에 순차적으로 나타내었다.FIG. 10 is an exemplary diagram for describing a process of defining a hierarchical structure between newly created categories according to an embodiment of the present invention, and a series of processes thereof are sequentially shown in FIGS. 10A to 10F.

먼저, 도 10a는 기존의 계층 분류 트리를 재구성해야 할 필요가 발생한 경우, 운영자에 의해 설정된 재구성 영역(A)을 나타낸다.First, FIG. 10A illustrates a reconstruction area A set by an operator when a need to reconstruct an existing hierarchical classification tree occurs.

도 10b는 상기 재구성 영역(A)에 포함된 다수개의 문서들에 대한 클러스터링에 의해 새롭게 정의된 5개의 카테고리들(Ca, Cb, Cc, Cd, Ce)을 나타낸다.FIG. 10B shows five categories (Ca, Cb, Cc, Cd, Ce) newly defined by clustering a plurality of documents included in the reconstruction area A. As shown in FIG.

도 10c는 상기 카테고리들(Ca, Cb, Cc, Cd, Ce)간 포함관계를 판정하기 위해구현된 초기 포함관계 매트릭스를 나타낸다.FIG. 10C illustrates an initial inclusion relationship matrix implemented to determine inclusion relationships between the categories Ca, Cb, Cc, Cd, and Ce.

도 10d는 상기 포함관계 매트릭스에 의한 포함관계를 설정하기 위한 임계값이 0.8인 경우 그 최종적인 포함관계를 나타내는 포함관계 매트릭스를 나타낸다.FIG. 10D illustrates an inclusion relationship matrix indicating a final inclusion relationship when the threshold for setting an inclusion relationship by the inclusion matrix is 0.8. FIG.

도 10e는 상기 최종적인 포함관계를 나타내는 포함관계 매트릭스에 의해 각 카테고리들간 포함 관계를 예측하고, 상기 포함 관계에 의해 그들간 계층 구조를 정의하여 생성된 부분 계층트리를 나타낸다.FIG. 10E illustrates a partial hierarchical tree generated by predicting inclusion relationships between categories by an inclusion relationship matrix representing the final inclusion relationship, and defining a hierarchical structure therebetween.

도 10f는 상기 부분 계층 트리(A)가 기존의 계층 분류 트리에 병합된 상태를 나타낸다.10F shows a state in which the partial hierarchical tree A is merged into an existing hierarchical classification tree.

먼저, 도 10a와 같이 운영자가 재구성 영역(A)을 설정하면, 그 영역에 포함된 모든 문서들에 대한 클러스터링을 수행한다. 이 때, 결과로 발생되는 클러스터의 수에 대한 제한은 상기 운영자가 임의로 하는 것이 가능하다.First, as shown in FIG. 10A, when the operator sets the reconstruction area A, clustering of all documents included in the area is performed. At this time, the operator can arbitrarily limit the number of the resulting clusters.

도 10b의 예에서는 상기 클러스터링에 의해 5개의 클러스터가 생성된 경우를 나타낸다.In the example of FIG. 10B, five clusters are generated by the clustering.

이와 같이 클러스터가 생성되었으면, 그 클러스터들을 새로운 카테고리로 정의한 후, 상기 카테고리 내에 포함된 모든 속성들을 추출한 후, 운영자는 그 속성들이 임의의 카테고리 내에서 가지는 중요도를 부여한다.Once a cluster has been created in this way, after defining the clusters as a new category, extracting all the attributes contained in the category, the operator gives the importance that those attributes have within any category.

그리고, 임의의 두 카테고리 내에서 각 속성(주요어)들이 가지는 중요도를 각각 비교하여, 그 중요도가 높은 주요어를 더 많이 포함하는 하나의 카테고리가, 다른 하나의 카테고리에 포함된다고 간주한다.Then, the importance of each attribute (keyword) in each of the two categories is compared, and one category containing more key words of higher importance is considered to be included in the other category.

즉, 상기 각 속성(주요어)들이 임의의 두 카테고리 내에서 갖는 중요도를 비교하여, 그 중요도가 높은 속성을, 기 설정한 포함 관계 판별기준 임계값 이상 포함한 카테고리가 다른 하나의 카테고리를 포함하는 것으로 정의한다.In other words, by comparing the importance of each of the attributes (keywords) in any two categories, it is defined that the category that contains the attribute of high importance, the category containing more than the predetermined threshold of the inclusion relationship criteria includes a different category do.

도 10c에 나타난 초기 포함관계 매트릭스의 경우를 예로 살펴보면, 카테고리 Ca의 경우 그 포함 관계를 나타내는 수치가 카테고리 Cb의 경우 '0'이고, Cc와는 '0.5'이고, Cd인 경우 '0.1'이고, Ce인 경우 '0.1'을 나타낸다. 이와 같이 하여 각 카테고리들간 포함 관계를 나타내는 수치를 모두 확인하여, 그 수치가 상기 포함 관계 판별 기준 임계값 이상인 경우 하나의 카테고리가 다른 하나의 카테고리를 포함하는 것으로 정의한다.Referring to the case of the initial inclusion relation matrix shown in FIG. 10C as an example, in the case of category Ca, the numerical value representing the inclusion relation is '0' for category Cb, '0.5' for Cc, '0.1' for Cd, and Ce. In the case of '0.1'. In this way, all the numerical values representing the inclusion relations between the categories are checked, and if the numerical values are equal to or greater than the inclusion relationship determination criterion threshold, one category is defined as including the other category.

도 10d는 상기 도 10c에 나타난 초기 포함관계 매트릭스의 포함 관계 판별 기준 임계값이 '0.8'인 경우 재구성된 포함관계 매트릭스로서, 도 10d를 참조하면, 카테고리(Cd)의 경우 모든 카테고리들(Ca, Cb, Cc, Ce)에 대한 포함 관계 수치가 모두 '1'이므로, 그 각 카테고리들을 모두 포함하는 것으로 판별되고, 상기 카테고리(Ce)의 경우 카테고리(Cc)만을 포함하는 것으로 판별된다.10D is a reconstructed inclusion relationship matrix when the inclusion relationship determination criterion threshold of the initial inclusion relationship matrix shown in FIG. 10C is '0.8'. Referring to FIG. 10D, in the case of category Cd, all categories Ca, Since the inclusion relationship values for Cb, Cc, and Ce are all '1', it is determined to include all of the categories, and for the category Ce, only the category Cc is determined.

이 때, 그 포함 관계가 적절한 지의 여부를 운영자 참여에 의해 결정하여, 부적절한 경우 운영자는 그 포함 관계 판별 기준 임계값을 변경시켜 가면서, 그 결과를 운영자가 만족할 때까지 상기 포함관계 매트릭스의 재구성을 반복 수행한다.At this time, whether or not the inclusion relation is appropriate is determined by the participation of the operator, and if inappropriate, the operator changes the inclusion relation determination criterion threshold and repeats the reconstruction of the inclusion relation matrix until the result is satisfied by the operator. To perform.

그리고, 이와 같은 판별 결과를 적용하여 부분 계층 트리를 구성하면, 도 10e와 같다.Then, the partial hierarchical tree is constructed by applying the above determination result, as shown in FIG. 10E.

이와 같이 구성된 부분 계층 트리는 기존의 계층 분류 트리와의 계층 구조를 만족하도록 하는 방향으로 상기 기존의 계층 분류 트리의 일정 위치에 병합되는데,상기 부분 계층 트리의 노드를 구성하는 카테고리들 중 루트 및 단말 카테고리들과 기존 계층 분류 트리에서 상기 부분 계층 트리가 포함될 주변의 카테고리들간의 속성별 포함관계를 파악한 후, 그 결과에 의해 상기 부분 계층 트리를 기존의 계층 분류 트리에 병합한다.The partial hierarchical tree configured as described above is merged at a predetermined position of the existing hierarchical classification tree in a direction to satisfy the hierarchical structure with the existing hierarchical classification tree. And the inclusion relationship for each property among the surrounding categories in which the partial hierarchical tree is to be included in the existing hierarchical classification tree, and merge the partial hierarchical tree into the existing hierarchical classification tree as a result.

이 때, 상기 부분 계층 트리와 기존의 계층 분류 트리의 각 노드들 간의 포함관계 식별은 상기 부분 계층 트리 생성시의 방법과 동일한 방법을 사용한다.At this time, the inclusion relationship between the nodes of the partial hierarchical tree and the existing hierarchical classification tree uses the same method as the method for generating the partial hierarchical tree.

상기와 같은 본 발명의 지능형 문서 관리 장치 및 그 관리 방법은 인터넷의 보급으로 인해 대용량화되고 있는 문서들을 계층 분류 트리에 의해 관리함으로써, 그 문서들에 대한 보다 체계적인 관리가 가능하고, 새롭게 유입되는 문서들을 그 계층 구조에 의해 자동으로 분류하도록 한다. 따라서, 적은 수의 인력으로 대용량 문서를 효율적으로 관리할 수 있다는 장점이 있다.The intelligent document management apparatus and its management method of the present invention as described above manages documents that have become larger due to the spread of the Internet by using a hierarchical classification tree, thereby enabling more systematic management of the documents and newly introducing documents. Automatically sort by that hierarchy. Therefore, there is an advantage in that a large number of documents can be efficiently managed with a small number of personnel.

그리고,운영자와의 대화식 작업을 통해 상기 계층 분류 트리의 구조를 자동으로 재구성할 수 있도록 하여, 문서의 분류를 자동화된 장치에만 의존함으로써 문서의 양이 방대해질 경우 발생될 수 있는 분류상의 오류를 방지할 수 있다.In addition, the structure of the hierarchical classification tree can be automatically reconfigured through an interactive operation with an operator, thereby relying on the classification of the document only on an automated device, thereby preventing classification errors that may occur when the amount of documents increases. can do.

또한, 향후 대폭 확대될 전자상거래 기업에서 사용될 고객 구매 정보, 고객 불만 정보 등과 같은 고객 관련 문서들을 보다 효율적으로 관리할 수 있도록 함으로써, 이를 데이터베이스 마케팅에 활용할 수 있으며, 계층적 분류가 필요한 모든 분야에서 관리해야 할 데이터의 양의 방대한 경우 상기 데이터의 관리에 응용하는 것이 가능하다.In addition, it enables customers to efficiently manage customer-related documents such as customer purchase information and customer complaint information to be used in e-commerce companies, which will be expanded in the future, so that they can be used for database marketing and managed in all fields requiring hierarchical classification. In the case of a large amount of data to be applied, it is possible to apply to the management of the data.

Claims

A first process of performing learning on each subtree constituting a hierarchical classification tree predefined by a predetermined learning document,

When any document is introduced from the outside, the attribute extracted from the document is followed by the hierarchical structure of the hierarchical classification tree defined in the first step, and the relationship between each subtree and the document is predicted. A second process of automatically classifying the document into a category of a subtree having the most relation with

A third step of defining new categories by clustering documents included in the reconstruction area when reconstruction of an area of the hierarchical classification tree defined in the first step is required;

A fourth process of defining a hierarchical relationship between the categories according to the property-specific inclusion relationships of the new categories, and applying the hierarchical relationship to generate a partial hierarchical tree;

And a fifth process of reconstructing the existing hierarchical classification tree by merging the partial hierarchical tree into an existing hierarchical classification tree, and then learning the hierarchical classification tree.

The method of claim 1, wherein the learning of the first process

A first-first process of extracting key words included in a study document selected for each category;

Calculating a probability value of the main words in the tree having the category as a parent node;

Intelligent document management method comprising the step 1-3 of storing / managing the probability values for each key word.

The method of claim 2, wherein the 1-2 process

The first probability value (w) in which the main words w will appear in the corresponding category Cr 1-2-1 process of calculating),

The second probability value of the main words (w) to be included in specific subtrees (Tri) of the tree (Tr) having the category (Cr) as a parent node ( 1-2-2 process for calculating),

A first specific gravity value indicating the importance of the category Cr in the subtree Tr; 1-2-3 process for calculating),

A second specific gravity value representing the importance of each subtree Tri in the subtree Tr; 1-2-4 process for calculating),

The tree Tr in which the main words w have the corresponding category Cr as a parent node based on the values P1, P2, W1, and W2 calculated in the steps 1-2-1 to 1-2-4. ) Is the probability value Intelligent document management method comprising the step 1-2-5 of calculating).

The method of claim 3, wherein the 1-2-3 process

The first specific gravity value W1

The ratio of the number of documents (df (Cr)) included in the corresponding category (Cr) to the number of documents (df (Tr)) included in the subtree (Tr) ( ) And

The ratio of the corresponding category cf (Cr) to the total number of categories cf (Tr) included in the subtree Tr ( ) And

The ratio of the number of main words tf (Cr) included in the corresponding category Cr to the number of main words tf (Tr) included in the subtree Tr ( Intelligent document management method characterized in that determined by any one of).

The method of claim 3, wherein the 1-2-4 process

The second specific gravity value W2

The number of documents included in each of the subtrees Ti, i = 1, 2, ..., k of the subtree with respect to the number of documents df (Tr) included in the subtree Tr (df) (Ti), i = 1,2,…, k) ) And

The number of categories included in each of the subtrees Ti, i = 1, 2, ..., k of the subtree with respect to the total number of categories cf (Tr) included in the subtree Tr ( the ratio of cf (Ti), i = 1,2,…, k) ) And

The number of keywords included in each of the subtrees Ti, i = 1, 2, ..., k of the subtrees with respect to the number of keywords tf (Tr) included in the subtree Tr (tf (Tr). Ti), i = 1, 2, ..., k) Intelligent document management method characterized in that determined by any one of the) method.

The method of claim 1, wherein the second process

A 2-1 process of extracting key words included in documents imported from the outside;

A step 2-2 of selecting a subtree having the highest relation with the document by predicting a relationship between the respective sub subtrees having the root of the hierarchical classification tree as the parent node and the document;

Until the root of the selected subtree becomes the terminal node, the subtrees having the root of the subtree as the parent node are predicted to be related to the document, and the subtree having the highest relation with the document is obtained. 2-3 process of repeating a series of processes to select,

And if the root of the selected subtree is a terminal node, step 2-4 of automatically classifying the document into a category corresponding to the node.

The method of claim 6, wherein steps 2-2 and 2-3 are respectively

And when the relationship between the document and the sub subtrees is equal to or less than a predetermined relationship determination criterion threshold, the document is automatically classified into a category corresponding to a parent node of the sub trees.

The method of claim 1, wherein the third process is

When a document having a predetermined reconstruction criterion threshold is classified into a category constituting an arbitrary node on the hierarchical classification tree, an intelligent reconfiguration area for reconfiguring an area including the corresponding node or neighboring nodes is set. How to manage documents.

The method of claim 1, wherein the clustering process of the third process

Step 3-1 of receiving an input of a document set in which relationships between documents included in the reconstruction area are artificially set by an operator;

A third weight factor for setting key weights for reflecting an artificial relationship between documents included in the document set by attributes of the document set, and calculating a distance function value between documents included in the reconstruction region by applying the weights; -2 courses,

A third to third step of repeatedly merging the documents whose distance function values calculated in step 3-2 within a predetermined range into one cluster until a predetermined number of clusters are generated;

Intelligent document management method comprising the step 3-4 defining each of the predetermined number of clusters generated in the step 3-3 as a category.

The method of claim 9, wherein the step 3-2 is performed.

For any documents (di, dj) included in the document set (D), the distance function value (distD (di, dj)) of the documents and the attribute value (I (di, dj)) of the document set ) Times the weight ( Intelligent document management method characterized in that the setting ().

11. The method of claim 10, wherein the attribute value I (di, dj) of the document set is

If any two documents (di, dj) among the documents included in the document set have the property that exists in the same cluster, it has a value of '+1',

The intelligent document management method of claim 2, wherein any one of the documents included in the document set has a value of '-1' when any two documents (di, dj) have attributes present in different clusters.

The method of claim 1, wherein the fourth process

For each attribute included in the categories, the importance of the attributes in each category is input, and the importance of each attribute is compared in the two categories to determine the attribute of high importance. Intelligent document management method characterized in that the category containing more than the reference threshold is defined as including the other category.

The method of claim 1 or 12, wherein the fourth process

Intelligent document management method, characterized in that to generate a new sub-layer tree while receiving a new input of the inclusion relationship determination criteria threshold from the operator.

The method of claim 1, wherein the fifth process

The root and terminal categories of the partial hierarchy tree and the hierarchy of the existing hierarchy classification tree by attribute inclusion relations between the root and terminal categories of the partial hierarchy tree and the surrounding categories in which the partial hierarchy tree is included in the existing hierarchy classification tree. And identifying the relationship and merging the partial hierarchical tree into an existing hierarchical classification tree based on the hierarchical relationship.

A first storage unit for storing and managing hierarchical classification tree information predefined by using each category as a node according to a preset hierarchical structure between categories;

A learning unit for performing learning for each subtree constituting the hierarchical classification tree by a predetermined learning document, re-learning according to updating of the hierarchical classification tree, and storing / managing the results;

By referring to a learning result stored in the learning unit by the attribute of a document extracted from a document introduced from the outside, following the hierarchical structure of the hierarchical classification tree stored in the first storage unit, between the subtrees and the document A document automatic classification unit for predicting a relation and automatically classifying the document into a category of a subtree having the highest relation with the document;

A second storage unit for storing documents and indexing information automatically classified by the document automatic classification unit;

When a reorganization of any area of the hierarchical classification tree stored in the first storage unit is required, a hierarchical classification tree manager which defines new categories by partial clustering of the area, and reconstructs the existing hierarchical classification tree accordingly Intelligent document management device comprising a.

The method of claim 15, wherein the learning unit

After extracting key words included in predetermined learning documents selected for each category, a learning process of calculating probability values of the key words appearing in a tree having the corresponding category as a parent node is performed.

Intelligent document management apparatus, characterized in that for each key word stored and managed the probability values to be included in an arbitrary sub-tree each key word generated as a result of the learning.

The method of claim 15, wherein the document automatic classification unit

And when the relationship between the subtrees and the document is equal to or less than a predetermined relationship determination criterion threshold, the document is automatically classified into a category corresponding to a parent node of the subtrees.

The hierarchical classification tree management unit according to claim 15.

After receiving a document set in which the relations between the documents included in the region to be reconstructed are artificially set, setting weights for key words to reflect the artificial relations between the documents included in the document set, and applying the weights And documenting the documents included in the area by the calculated distance function value between the documents and defining the resulting cluster as a new category.

19. The hierarchical classification tree management unit according to claim 15 or 18.

After generating a partial hierarchical tree applying a hierarchical structure between categories defined by the property-specific inclusion relationship of the new categories, the hierarchical layer between each node of the partial hierarchical tree and neighboring nodes of the existing hierarchical classification tree to include the partial hierarchical tree. And integrating the partial hierarchical tree into an existing hierarchical classification tree by a relationship.