KR100816923B1

KR100816923B1 - System and method for classifying document

Info

Publication number: KR100816923B1
Application number: KR1020060033660A
Authority: KR
Inventors: 차완규; 안한준; 김정중
Original assignee: 엘지전자 주식회사
Priority date: 2006-04-13
Filing date: 2006-04-13
Publication date: 2008-03-26
Also published as: KR20070102035A; CN101055581A; CN101055581B

Abstract

본 발명의 실시예에 따른 문서 분류 시스템은 문서가 저장된 데이터베이스; 및 상기 데이터베이스에 저장된 문서들을 자동 분류하기 위한 문서 문류부;가 포함되고, 상기 문서 분류부에는 상기 문서의 특성을 도출하여 이를 벡터화하는 특성 추출부와, 상기 특성 추출부에 의해 형성된 벡터들을 이용하여 문서간의 유사도를 판단하는 유사도 판단부와, 상기 데이터베이스에 저장된 문서들을 소정의 분류체계에 따라 분류시키는 분류 체계부가 구비되고, 상기 문서 분류부는 상기 데이터베이스로 제공되는 신규의 문서를 상기 분류체계에 따라 문서분류를 수행하는 것을 특징으로 한다.A document classification system according to an embodiment of the present invention includes a database in which a document is stored; And a document entry unit for automatically classifying documents stored in the database. The document classification unit includes a feature extraction unit for deriving and vectorizing a property of the document and using vectors formed by the property extraction unit. A similarity determination unit for determining similarity between documents and a classification system unit for classifying documents stored in the database according to a predetermined classification system, wherein the document classification unit stores new documents provided to the database according to the classification system. Characterized by performing the classification.

문서 분류 Document classification

Description

Document classification system and its method {System and method for classifying document}

도 1은 본 발명의 사상에 따른 문서 분류 시스템을 설명하기 위한 블록도.1 is a block diagram illustrating a document classification system according to the spirit of the present invention.

도 2는 문서로부터 추출된 특성을 이용한 벡터화된 문서를 설명하는 도면.2 illustrates a vectorized document using properties extracted from the document.

도 3은 본 발명의 실시예에 따른 분류코드를 설명하는 도면. 3 is a diagram illustrating a classification code according to an embodiment of the present invention.

도 4는 본 발명의 실시예에 따른 문서 검색 방법을 설명하기 위한 흐름도.4 is a flowchart illustrating a document search method according to an embodiment of the present invention.

본 발명은 문서를 분류하기 위한 시스템에 대한 것으로서, 상세하게는, 저장된 문서들로부터 도출되는 특성과 문서간의 유사도를 참조하여 저장된 문서들을 분류체계화하고, 데이터베이스로 제공되는 신규의 문서에 대해서도 분류체계화 작업이 수행되도록 하는 문서 분류 시스템 및 그 방법에 대한 것이다.The present invention relates to a system for classifying documents, and in detail, classifies stored documents by referring to similarities between the properties derived from stored documents and documents, and classifies the new documents provided to the database. A document classification system and a method for causing this to be performed.

최근 인터넷의 급속한 팽창과 보급으로 인해, 조직이 필요로 하는 문서 중에서 인터넷을 통해 획득된 문서와 지식의 양은 그 증가 속도가 갈수록 커지고 있다. 이로 인해 대용량 문서 정보 시스템에서 내용 기반 검색, 필터링, 라우팅 등의 정보 검색을 하기 위해 선행되어야 할 문서 구조화 기법이 매우 중요해지고 있다.Recently, due to the rapid expansion and dissemination of the Internet, the amount of documents and knowledge acquired through the Internet among the documents required by the organization is increasing at an increasing rate. As a result, document structuring techniques that need to be preceded for information retrieval such as content-based retrieval, filtering, and routing in a large document information system become very important.

그리고, 문서 도메인 전문가들에 의해 카테고리별로 기초적인 계층 분류 트 리의 구조가 제공되면, 문서 분류자들은 현재 시스템 내에 보관하고 있거나 새로 유입되는 문서들로부터 속성을 추출한 후, 그 속성에 의해 해당 문서들을 상기 계층 분류 트리내의 각 카테고리들로 할당하는 작업을 수행한다.Then, if the structure of the basic hierarchical classification tree by category is provided by the document domain experts, the document classifier extracts an attribute from documents currently stored or newly introduced in the system, and then recalls the corresponding documents by the attribute. Assigns to each category in the hierarchical classification tree.

그리고, 상기 도메인 전문가들에 의해 초기에 주어졌던 상기 계층 분류 트리는 지속적으로 문서가 할당됨에 따라 그 구조가 변화될 필요가 있는데, 이를 위해 도메인 전문가들은 각 카테고리에 할당된 문서들의 내용을 면밀히 검토하여 그 구조를 변형시켜 나간다. 즉, 기존의 계층 분류 트리에 포함되어 있지 않은 문서 집합이 유입되어 그 문서 집합을 포함시킬 수 있는 새로운 카테고리를 생성한 경우 이를 상기 계층 분류 트리의 적당한 위치에 병합시키거나 각 카테고리에 포함된 문서들간 내용의 이질성이 높아져 새로운 카테고리에 의해 묶을 수 있는 문서 집합이 발생한 경우 그 카테고리를 두 개 이상의 카테고리로 분할하는 작업을 수행하여야 한다.In addition, the hierarchical classification tree initially given by the domain experts needs to change its structure as documents are continuously assigned. For this purpose, the domain experts carefully examine the contents of the documents assigned to each category and construct the structure. Transform it. In other words, if a document set that is not included in the existing hierarchical classification tree is introduced and a new category is created to include the document set, it is merged to an appropriate position of the hierarchical classification tree or between documents included in each category. When the heterogeneity of the contents increases and a document set can be grouped by a new category, the work of dividing the category into two or more categories should be performed.

그러나, 이러한 문서의 집합들이 계속적으로 변화하고, 그 문서량도 빠른 속도로 증가하는 최근의 작업 환경에서 문서 분류 및 그 계층 분류 트리의 관리 작업들을 사람의 노력에 의존하는 종래의 문서 관리 방법은 그 활용에 한계가 있다.However, in the recent work environment where these sets of documents are constantly changing, and the amount of documents is rapidly increasing, the conventional document management method which utilizes human effort for document classification and management of the hierarchical classification tree is utilized. There is a limit to.

또한, 각 문서 분류자가 가지고 있는 경험과 지식이 모두 다르기 때문에 문서 분류가 지속적으로 일관성을 유지하지 못할 가능성도 커진다는 단점이 있다. In addition, since the experience and knowledge of each document classifier are all different, there is a disadvantage that the classification of documents is not always consistent.

본 발명은 상기되는 문제점을 해결하기 위하여 제안되는 것으로서, 데이터베이스에 저장된 문서로부터 특성 및 문서간의 유사도를 독출함으로써 저장된 문서가 소정의 분류체계에 따라 자동적으로 분류될 수 있도록 하는 문서 분류 시스템 및 그 방법을 제안하는 것을 목적으로 한다. SUMMARY OF THE INVENTION The present invention is proposed to solve the above-described problem, and a document classification system and method for reading stored properties and similarities between documents from a document stored in a database so that the stored documents can be automatically classified according to a predetermined classification system. It is for the purpose of suggestion.

또한, 외부로부터 유입되는 신규의 문서들을 자동으로 분류하고, 그 계층 구조를 지능적으로 관리함으로써 문서의 관리가 효율적으로 수행되도록 하는 문서 관리 시스템 및 그 방법을 제안하는 것을 목적으로 한다. In addition, an object of the present invention is to propose a document management system and method for automatically classifying new documents introduced from the outside and intelligently managing the hierarchical structure so that document management can be efficiently performed.

상기되는 목적을 달성하기 위한 본 발명의 실시예에 따른 문서 분류 시스템은 문서가 저장된 데이터베이스; 및 상기 데이터베이스에 저장된 문서들을 자동 분류하기 위한 문서 문류부;가 포함되고, 상기 문서 분류부에는 상기 문서의 특성을 도출하여 이를 벡터화하는 특성 추출부와, 상기 특성 추출부에 의해 형성된 벡터들을 이용하여 문서간의 유사도를 판단하는 유사도 판단부와, 상기 데이터베이스에 저장된 문서들을 소정의 분류체계에 따라 분류시키는 분류 체계부가 구비되고, 상기 문서 분류부는 상기 데이터베이스로 제공되는 신규의 문서를 상기 분류체계에 따라 문서분류를 수행하는 것을 특징으로 한다.Document classification system according to an embodiment of the present invention for achieving the above object is a database in which a document is stored; And a document entry unit for automatically classifying documents stored in the database. The document classification unit includes a feature extraction unit for deriving and vectorizing a property of the document and using vectors formed by the property extraction unit. A similarity determination unit for determining similarity between documents and a classification system unit for classifying documents stored in the database according to a predetermined classification system, wherein the document classification unit stores new documents provided to the database according to the classification system. Characterized by performing the classification.

다른 측면에 따른 본 발명의 문서 분류 방법은 (a) 데이터베이스에 저장된 문서들로부터 특성을 추출하고, 추출된 특성을 이용하여 문서들간의 유사도가 판단되는 단계; (b) 상기 문서들간의 유사도를 기반으로 하여 상기 데이터베이스에 저장된 문서들이 소정의 분류체계에 따라 분류되는 단계; 및 (c) 상기 데이터베이스로 신규의 문서가 제공되는지 여부를 감시하고, 신규의 문서가 제공되는 경우에 상기 신규의 문서에 대하여 상기의 (a) 및 (b)단계가 재수행되는 단계;가 포함된다.According to another aspect of the present invention, a document classification method may include: (a) extracting a feature from documents stored in a database, and determining similarity between documents using the extracted feature; (b) classifying documents stored in the database according to a classification system based on the similarity between the documents; And (c) monitoring whether a new document is provided to the database, and if the new document is provided, the steps (a) and (b) are repeated for the new document. do.

제안되는 바와 같은 본 발명의 사상에 따라 문서 분류 시스템 및 그 방법에 의해서, 데이터베이스에 저장된 문서로부터 특성 및 문서간의 유사도를 독출함으로써 저장된 문서가 소정의 분류체계에 따라 자동적으로 분류될 수 있는 장점이 있다.According to the idea of the present invention as described above, the document classification system and method thereof have the advantage that the stored documents can be automatically classified according to a predetermined classification system by reading the similarities between the properties and the documents from the documents stored in the database. .

또한, 외부로부터 유입되는 신규의 문서들을 자동으로 분류하고, 그 계층 구조를 지능적으로 관리함으로써 문서의 관리가 효율적으로 수행되는 장점이 있다.In addition, there is an advantage that the management of documents is efficiently performed by automatically classifying new documents introduced from the outside and intelligently managing the hierarchical structure.

이하에서는 본 발명의 바람직한 실시예를 첨부되는 도면을 참조하여 상세하게 설명한다. 다만, 본 발명의 사상이 제시되는 실시예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서, 구성 요소의 부가, 변경, 삭제, 추가등에 의해서 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본 발명의 사상의 범위 내에 든다고 할 것이다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in detail. However, the spirit of the present invention is not limited to the embodiments in which the present invention is presented, and those skilled in the art who understand the spirit of the present invention may easily add other embodiments by adding, changing, deleting, and adding components within the scope of the same idea. It may be suggested, but this will also fall within the scope of the spirit of the present invention.

도 1은 본 발명의 사상에 따른 문서 분류 시스템을 설명하기 위한 블록도이고, 도 2는 문서로부터 추출된 특성을 이용한 벡터화된 문서를 설명하는 도면이고, 도 3은 본 발명의 실시예에 따른 분류코드를 설명하는 도면이다. 1 is a block diagram illustrating a document classification system according to the spirit of the present invention, FIG. 2 is a diagram illustrating a vectorized document using characteristics extracted from a document, and FIG. 3 is a classification according to an embodiment of the present invention. It is a figure explaining a code.

도 1 내지 도 3을 참조하면, 본 발명에 따른 문서 분류 시스템(100)에는 다수의 문서가 저장되는 데이터베이스(110)와, 상기 데이터베이스(110)에 저장된 문서를 분류하기 위한 문서 분류부(120)가 포함된다.1 to 3, the document classification system 100 according to the present invention includes a database 110 in which a plurality of documents are stored, and a document classification unit 120 for classifying documents stored in the database 110. Included.

그리고, 상기 문서 분류부(120)는 상기 데이터베이스(110)에 신규의 문서가 제공되는지 여부를 실시간 또는 사용자에 의해 설정된 주기마다 감시할 수 있으며, 상기 문서 분류부(120)에 의한 문서의 분류는 상기 신규의 문서에 대해서도 수행된 다. In addition, the document classification unit 120 may monitor whether a new document is provided to the database 110 in real time or at a cycle set by a user. The classification of the document by the document classification unit 120 may be performed. This is also done for the new document.

상기 문서 분류부(120)에는 상기 데이터베이스(110)에 저장된 문서로부터 특성을 도출하여 벡터화하는 특성 추출부(121)와, 상기 특성 추출부(121)에 의해 형성된 문서의 벡터들로부터 문서간의 유사도를 판단하기 위한 유사도 판단부(122)와, 상기 유사도 판단부(122)에 의해 판단된 문서간의 유사도에 따라 상기 데이터베이스(110)에 저장된 문서들을 분류하기 위한 분류 체계부(123)가 포함된다.The document classification unit 120 includes a feature extraction unit 121 for deriving and vectorizing a feature from a document stored in the database 110 and a similarity between documents from vectors of documents formed by the feature extraction unit 121. A similarity determination unit 122 for determining and a classification system unit 123 for classifying documents stored in the database 110 according to the similarity between the documents determined by the similarity determination unit 122 are included.

그리고, 상기 분류 체계부(123)는 상기 유사도 판단부(122)에 의한 문서간의 유사도에 따라 문서들이 분류되도록 하는 것 외에 기술분야별로 각각 구분형성된 분류코드(124)를 참조하여 상기 데이터베이스(110)에 저장된 문서들을 분류시킬 수 있다.In addition, the classification system 123 allows the documents to be classified according to the similarity between the documents by the similarity determining unit 122, and also refers to the classification codes 124 formed by the technical fields, respectively, to the database 110. Documents stored in can be classified.

상세히, 상기 특성 추출부(121)는 상기 데이터베이스(110)에 저장된 문서의 특성을 도출하여 이를 벡터화하는 역할을 수행한다.In detail, the feature extractor 121 derives the feature of the document stored in the database 110 and vectorizes it.

그리고, 상기 특성 추출부(121)에 의한 문서의 벡터화가 수행되도록 하기 위하여, 상기 데이터베이스(110)에 저장된 문서들은 텍스트 형태의 파일인 doc, hwp, pdf, txt, html, xls, ppt등의 형태일 수 있다. In order to perform the vectorization of the document by the feature extraction unit 121, the documents stored in the database 110 are in the form of doc, hwp, pdf, txt, html, xls, ppt, etc., which are text files. Can be.

그리고, 상기 특성 추출부(121)는 문서로부터 특성(예를 들면, 키워드 또는 색인어)을 추출하기 위하여 상기 문서에 기록된 사항으로부터 낱말을 구분하기 위한 형태소 해석(morphological analysis)을 수행할 수 있다.In addition, the feature extractor 121 may perform morphological analysis to distinguish words from the items recorded in the document in order to extract a feature (for example, a keyword or an index word) from the document.

예를 들면, 영어나 국어등과 같은 낱말과 낱말 사이에 공백이 있는 언어에서는 공백을 단서로 하여 낱말을 결정할 수 있는데, 일본어와 아시아의 많은 언어와 같이 낱말 사이에 구분 기호를 두지 않는 언어에서는 낱말을 색인어 또는 키워드로 이용하기 위하여, 먼저 낱말을 판단하기 위한 프로세싱이 필요하다.For example, in languages with spaces between words and words, such as English or Korean, words can be determined by using spaces as clues. In languages that do not have a separator between words, such as Japanese and many languages in Asia, In order to use as an index word or keyword, first, processing for determining a word is required.

또한, 상기 특성 추출부(121)는 문서로부터 추출되는 특성에 대하여 가중치를 부여하는 기능을 수행할 수 있으며, 이 경우 상기 특성 추출부(121)는 망라성과 특정을 겸비한 특성의 중요도가 높아지도록 가중치를 부여한다.In addition, the feature extractor 121 may perform a function of assigning a weight to a feature extracted from a document. In this case, the feature extractor 121 may weight the feature to increase the importance of a feature having both a specific nature and a specific feature. To give.

상기 행렬에서의 각 행(t₁,t₂,t₃,t₄,t₅,t₆)c은 문서의 특성을, 각 열(d₁,d₂,d₃,d₄,d₅)은 상기 데이터베이스(110)에 저장된 문서에 대응된다.Each row (t ₁ , t ₂ , t ₃ , t ₄ , t ₅ , t ₆ ) c in the matrix represents the characteristics of the document, and each column (d ₁ , d ₂ , d ₃ , d ₄ , d ₅ ). Corresponds to a document stored in the database 110.

행렬 요소 a_ij는 색인어 t_i가 문서 d_j에 출현하는 빈도를 나타낸다.The matrix element a _ij indicates the frequency with which the index word t _i appears in the document d _j .

상기와 같은 행렬의 각행은 그 특성이 문서에 나타나는 분포를 나타내고, 각 열은 그 문서에서의 특성 분포를 나타내고 있다.Each row of the matrix described above represents a distribution in which the characteristic appears in the document, and each column represents a distribution of the characteristic in the document.

문서 특성의 빈도에 근거한 가중치 부여에 있어서는, 너무 빈도가 높은 낱말은 문서를 특징짓는데에 그다지 도움이 되지 않으므로, 특정의 문서 즉, 문서의 특성으로서 역할을 수행하는데에 부적합한 단어에 관한 불용어 리스트를 이용할 수 있다.In weighting based on the frequency of document characteristics, too frequent words are not very helpful in characterizing a document, so use a list of stopwords for words that are inappropriate for serving as a specific document, that is, as a feature of the document. Can be.

이와 같은 견지에서, 상기 특성 추출부(121)는 문서로부터 추출되는 키워드의 출현빈도를 문서에 있는 모든 키워드의 출현 수로 상대 빈도를 가중치로 채용할 수 있다. In this regard, the feature extraction unit 121 may employ the frequency of occurrence of the keywords extracted from the document as the weight of the number of occurrences of all the keywords in the document.

이에 대한 실시예는 다음과 같은 수학식에 의해 수행될 수 있다.An embodiment thereof may be performed by the following equation.

여기서, 상기 tf(t,d)는 특정의 문서 d에 출현하는 키워드 t의 빈도를 의미한다.Here, tf (t, d) means the frequency of the keyword t appearing in the specific document d.

또한, 상기 특성 추출부(121)는 상기와 같이 문서로부터 도출된 키워드 또는 색인어등의 문서 특성을 이용하여, 도 2에 도시된 바와 같이 각 문서들을 벡터화할 수 있다.In addition, the feature extractor 121 may vectorize each document as illustrated in FIG. 2 using document properties such as keywords or index words derived from the document as described above.

예를 들어, 상기 데이터베이스(110)에 저장된 문서중에서, 문서1에서는 첫번째 특성이 19번, 두번째 특성이 35번, 마지막 특성이 15번의 빈도로 포함된다.For example, among documents stored in the database 110, Document 1 includes a frequency of 19 times for the first characteristic, 35 times for the second characteristic, and 15 times for the last characteristic.

같은 방법으로 분석대상이 되는 문서들에 대해 특성으로 구성되는 벡터가 형성될 수 있다. In the same way, a vector consisting of the characteristics of documents to be analyzed can be formed.

그리고, 상기 특성 추출부(121)에 의해 형성된 벡터들을 이용하여 상기 유사도 판단부(122)는 각 문서들간의 유사의 정도를 판단할 수 있으며, 이 경우 상기 유사도 판단부(122)는 문서간의 유사도를 판단하기 위하여 각 벡터간의 코사인값을 이용할 수 있다. The similarity determination unit 122 may determine the degree of similarity between documents using the vectors formed by the feature extraction unit 121. In this case, the similarity determination unit 122 may determine the similarity between documents. Cosine values between the vectors may be used to determine.

예를 들어, 상기 유사도 판단부(122)는 상기 특성 추출부(121)에 의해 형성되는 문서들의 벡터를 다음의 수학식을 이용하여 문서간의 유사도를 판단할 수 있다.For example, the similarity determination unit 122 may determine the similarity between the documents using a vector of documents formed by the feature extraction unit 121 using the following equation.

또한, 상기 분류 체계부(123)는 상기 유사도 판단부(122)에 의한 문서간의 유사도 판단결과에 따라 상기 데이터베이스(110)에 저장된 문서들이 분류체계화되도록 하는 역할을 수행한다.In addition, the classification system 123 serves to classify the documents stored in the database 110 according to the similarity determination result between the documents by the similarity determination unit 122.

그리고, 상기 분류 체계부(123)는 상기 데이터베이스(110)에 저장된 문서들이 소정의 기준이 될 수 있는 분류체계에 따라 문서들이 분류되도록 함으로써, 상기 데이터베이스(110)에 저장된 문서로부터 특정의 문서이 보다 빨리 검색되도록 하며, 저장된 문서들의 군집화가 신속히 수행되도록 한다.In addition, the classification system 123 allows the documents stored in the database 110 to be classified according to a classification system in which a predetermined criterion may be used. It allows for retrieval and quick clustering of stored documents.

그리고, 상기 유사도 판단부(122) 및 분류 체계부(123)에 의한 문서간의 유사도 판단 및 문서의 분류체계는 상기 데이터베이스(110)에 새롭게 제공되는 문서에 대해서 수행되도록 함으로써, 상기 데이터베이스(110)에 저장되는 문서들이 자동적으로 분류될 수 있다.In addition, the similarity determination between the documents by the similarity determining unit 122 and the classification system unit 123 and the classification system of the documents are performed on the documents newly provided to the database 110. Stored documents can be automatically classified.

그리고, 상기 분류 체계부(123)는 도 3에 도시된 바와 같은 분류코드(124)를 참조할 수 있으며, 사용자는 소정의 입력수단을 통해 상기 데이터베이스(110)에 저장된 문서를 상기 분류코드(124)에 따라 분류되도록 할 수 있다.In addition, the classification system unit 123 may refer to the classification code 124 as shown in FIG. 3, and the user may sort the document stored in the database 110 through a predetermined input means. Can be classified according to

즉, 도 3에 도시된 분류코드(124)에는 각 기술분야별로 분류화된 코드들이 포함되며, 사용자가 상기 데이터베이스(110)로부터 OLED에 대한 문서를 검색한 경우에 해당 문서에 대해서 상기 분류코드(124)에 부여된 특정의 코드 예를 들면 DD중에서 OLED를 선택함으로써 상기 분류코드(124)에 따라 문서가 분류되도록 할 수 있다.That is, the classification code 124 shown in FIG. 3 includes codes classified for each technical field, and when the user searches for a document on the OLED from the database 110, the classification code ( A document may be classified according to the classification code 124 by selecting an OLED from a specific code assigned to 124, for example, DD.

그리고, 상기 분류코드(124)에 구비되는 각각의 코드들(예를 들면, CRT,DTV,OLED,PDP,Projection, TV)에는 그들의 분류 기준이 되는 특성값들을 가지고 있으며, 이에 따라 상기 특성 추출부(121)에 의해 추출되는 문서의 특성들을 이용하여 문서를 상기 분류코드(124)에 맵핑시킬 수 있게 된다.Each of the codes (for example, CRT, DTV, OLED, PDP, Projection, and TV) included in the classification code 124 has characteristic values that are their classification criteria. The document may be mapped to the classification code 124 using the characteristics of the document extracted by 121.

전술한 바와 같은 상기 특성 추출부(121), 유사도 판단부(122) 및 분류 체계부(123)는 상기 데이터베이스(110)로 새롭게 제공되는 문서에 대해서도 동일한 역할을 수행할 수 있으며, 이에 따라 사용자가 수작업으로 신규의 문서를 분류해야 하는 수고스러움을 덜 수 있다.As described above, the feature extractor 121, the similarity determiner 122, and the classification system 123 may play the same role with respect to a document newly provided to the database 110. This saves you from having to sort new documents manually.

도 4는 본 발명의 실시예에 따른 문서 검색 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a document search method according to an embodiment of the present invention.

먼저, 상기 특성 추출부(121)에 의해 데이터베이스(110)에 저장된 문서들 각각으로부터 특성 예컨대, 키워드 또는 색인어를 추출한다(S101). 이 경우, 상기 특 성 추출부(121)에 의해 추출된 특성에 대한 가중치 부여작업이 더 수행될 수 있으며, 추출된 특성에 기반을 둔 벡터화작업이 수행된다.First, a feature, for example, a keyword or an index word is extracted from each of the documents stored in the database 110 by the feature extractor 121 (S101). In this case, a weighting operation for the feature extracted by the feature extraction unit 121 may be further performed, and a vectorization operation based on the extracted feature is performed.

그리고, 상기 유사도 판단부(122)에 의해 문사들간의 유사도가 판단되고(S103), 문서간의 유사도는 상기 특성 추출부(121)에 의해 형성된 각각의 문서에 대한 벡터들을 이용하여 계산될 수 있다.In addition, the similarity between the documents is determined by the similarity determining unit 122 (S103), and the similarity between the documents may be calculated using vectors for each document formed by the feature extracting unit 121.

그 다음, 상기 분류 체계부(123)에 의해 유사한 문서들끼리 분류되거나 상기 분류코드(124)에 따라 문서들이 분류된다(S105).Then, similar documents are classified by the classification system unit 123 or documents are classified according to the classification code 124 (S105).

앞서 설명한 바와 같이, 상기 데이터베이스(110)로 제공되는 신규의 문서에 대해서도 상기 유사도 판단부(122)에 의해 기저장된 문서와의 유사도가 판단될 수 있으며, 상기 분류코드(124)에 따라 분류될 수 있다. As described above, the similarity with the previously stored document may also be determined by the similarity determination unit 122 with respect to a new document provided to the database 110, and may be classified according to the classification code 124. have.

Claims

A database in which documents are stored; And

Includes a document classification unit for automatically classifying documents stored in the database,

The document classification unit weights a keyword or an index word derived from the document to create a feature value and vectorizes it, and a similarity degree of determining similarity between documents using vectors formed by the feature extractor. A determination unit and a classification system unit for classifying documents stored in the database according to a classification code for each technical field by using codes having predetermined characteristic values,

And the document classification unit classifies new documents provided to the database according to the classification system at intervals set by the user.

delete

(a) weighting a keyword or an index word derived from documents stored in a database to create a feature value and determining similarity between documents using the extracted feature value;

(b) classifying documents stored in the database according to technical classification codes by using codes having similarities between the documents and predetermined characteristic values; And

(c) monitoring whether a new document is provided to the database, and in the case where a new document is provided, steps (a) and (b) above are performed again at intervals set by a user for the new document. Step; document classification method is included.

delete