KR20090017830A

KR20090017830A - Apparatus for providing aspect-based documents clustering that raises reliability and method therefor

Info

Publication number: KR20090017830A
Application number: KR1020070082309A
Authority: KR
Inventors: 송준화; 박순일; 강승우; 정상영; 최성원
Original assignee: 한국과학기술원
Priority date: 2007-08-16
Filing date: 2007-08-16
Publication date: 2009-02-19
Also published as: KR100896702B1

Abstract

An apparatus for providing aspect-based documents clustering and a method thereof are provided to improve reliability by correcting an error of a clustering result by a feedback of a user. A document clustering part(100) extracts a keyword of documents, calculates importance of each keyword, executes a document clustering with the keyword and the importance, and corrects a result of the document clustering by using a user feedback. A keyword extracting unit(110) classifies the documents, and extracts a keyword from a classified documents. An importance calculating unit(120) calculates an importance of each keyword based on a text content of the documents. A similarity calculating unit(130) calculates a similarity between the documents. A clustering number decision unit(140) groups the documents, determines the number of document clustering, and generates a document clustering result according to the number of clustering. A post processing unit(150) corrects an error generated in the clustering result by a cooperative clustering.

Description

Apparatus for providing Aspect-based Documents Clustering that raises Reliability and Method therefor}

본 발명은 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법에 관한 것으로서, 더욱 상세하게는 문서의 내부 구조를 반영한 군집을 수행하고 사용자의 피드백을 이용하여 군집 결과에서 발생할 수 있는 오류들을 수정함으로써 문서 군집의 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법에 관한 것이다.The present invention relates to a document structure-based clustering device and method for improving reliability, and more particularly, to perform document clustering by reflecting the internal structure of a document and correcting errors that may occur in the clustering result by using user feedback. The present invention relates to a document structure-based clustering device and a method for improving the reliability.

문서 군집은 군집 알고리즘에 의하여 문서 집합으로부터 유사한 특성을 가진 문서들의 그룹을 발견하는 것이다. 문서 군집 방법은 자료의 조직화, 웹 검색 결과의 브라우징 또는 다중 문서 요약 등 정보 검색의 많은 응용 분야에 사용되는 중요한 문서 분석 방법이다.Document clustering is the discovery of groups of documents with similar characteristics from a set of documents by a clustering algorithm. Document clustering is an important document analysis method used in many applications of information retrieval, such as organizing data, browsing web search results, or multiple document summaries.

문서 군집에 있어서는 자료 집합의 분포, 내부 구조 또는 사용자가 원하는 군집 형태 등이 군집 결과에 상당한 영향을 미치게 되는데, 사용자에게 제한된 정보만을 제공하여 군집의 질을 향상시키는 연구들이 있어 왔다.In document clustering, the distribution of data sets, internal structure, or the type of clustering desired by users have a significant effect on clustering results. There have been studies to improve the quality of clustering by providing limited information to users.

또한, 사용자의 문서 접근에 대한 편의를 제공할 목적으로 여러 가지 문서 군집 기술들이 제안된 바 있다. 예를 들어, 동일한 사건 또는 동일한 주제를 다룬 문서들을 자동으로 분류하기 위한 기술들이 그러하다.In addition, various document clustering techniques have been proposed for the purpose of providing convenience to a user's document access. For example, techniques for automatically classifying documents covering the same event or the same subject.

그러나, 이러한 기술들은 하나의 주제 내에서도 작성자에 따라 다른 관점에서 다른 내용을 토대로 문서가 작성될 수 있다는 사실을 고려하지 않았으므로, 사용자가 다양한 관점에 근거하여 주제를 종합적, 객관적으로 인식하기 어렵다는 문제점이 있었다.However, these techniques do not take into account the fact that documents can be created based on different contents from different viewpoints according to the authors within one subject, so that it is difficult for the user to comprehensively and objectively recognize the subject based on various viewpoints. there was.

본 발명이 해결하고자 하는 과제는, 상기 문제점을 해결하기 위한 것으로서, 동일한 사건 또는 동일한 주제에 관한 문서라도 문서 작성자가 중요시하는 관점은 다양할 수 있다는 점을 사용자가 인식하도록 유도하는 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법을 제공하는 것이다.The problem to be solved by the present invention is to solve the above problems, the document that improves the reliability to induce the user to recognize that even if the document on the same event or the same topic, the point of view that the document creator is important may vary. It is to provide a structure-based clustering device and method.

본 발명이 해결하고자 하는 다른 과제는, 상기 문제점을 해결하기 위한 것으로서, 사용자가 다양한 관점에 근거하여 특정 주제 또는 사건을 객관적이고 종합적으로 인식할 수 있는 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법을 제공하는 것이다.Another problem to be solved by the present invention is to solve the above problems, a document structure-based clustering device and method for improving the reliability that the user can objectively and comprehensively recognize a specific subject or event based on various viewpoints. To provide.

본 발명이 해결하고자 하는 또 다른 과제는, 상기 문제점을 해결하기 위한 것으로서, 사용자의 피드백에 의한 군집 결과의 점진적인 오류 수정을 이용하여 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법을 제공하는 것이다.Another object of the present invention is to solve the above problems, and to provide a document structure-based clustering device and method for improving reliability by using a gradual error correction of the clustering result based on user feedback.

본 발명은 신뢰도를 향상시킨 문서 구조 기반 군집 장치에 관한 것으로서, 입력받은 문서들을 주제별로 분류하며, 분류한 문서들의 핵심 내용을 반영하는 키워드를 추출하고, 추출한 키워드의 중요도를 계산하며, 추출한 키워드 및 계산한 중요도를 이용하여 문서 군집을 실행하고, 문서 군집의 결과를 협력적 군집을 이용하여 보완하는 문서군집부; 및 문서 군집의 대상이 되는 문서 정보들을 입력받으며, 문서 군집의 결과를 출력하고, 출력한 문서 군집 결과의 오류 정보들을 입력받 는 입출력부; 를 포함한다.The present invention relates to a document structure-based clustering device with improved reliability. The present invention classifies input documents into subjects, extracts keywords reflecting key contents of classified documents, calculates importance of extracted keywords, extracts keywords and A document clustering unit which executes document clustering using the calculated importance and supplements the results of the document clustering using the cooperative clustering; And an input / output unit configured to receive document information that is a target of document cluster, output a document cluster result, and receive error information of the output document cluster result. It includes.

바람직하게는, 상기 문서군집부는, 입력받은 문서들을 주제별로 분류하며, 분류된 문서들로부터 키워드를 추출하는 키워드 추출수단; 추출된 키워드의 중요도를 문서의 본문 내용을 기반으로 계산하는 중요도 계산수단; 추출된 키워드 및 계산된 중요도를 이용하여, 문서들 사이의 유사도를 계산하는 유사도 계산수단; 상기 유사도 계산 결과를 이용하여 적합한 문서 군집의 개수를 결정하며, 결정된 군집 개수에 따라 문서 군집 결과를 생성하는 군집개수 결정수단; 및 상기 군집개수 설정수단에 의한 군집 결과의 오류들을 협력적 군집에 의하여 수정하는 후처리수단; 을 포함하는 것을 특징으로 한다.Preferably, the document grouping unit, the keyword extraction means for classifying the input documents by subject, extracting a keyword from the classified documents; Importance calculation means for calculating the importance of the extracted keyword based on the body content of the document; Similarity calculating means for calculating a similarity between documents using the extracted keyword and the calculated importance; Cluster number determining means for determining a suitable number of document clusters using the similarity calculation result and generating a document cluster result according to the determined number of clusters; And post-processing means for correcting errors of the cluster result by the cluster number setting means by the cooperative cluster. Characterized in that it comprises a.

또한 바람직하게는, 상기 유사도의 계산은 코사인(cosine) 측정(measure)에 의해 수행되는 것을 특징으로 한다.Also preferably, the similarity calculation may be performed by cosine measurement.

또한 바람직하게는, 상기 입력받은 문서들은 뉴스 기사 또는 신문 기사인 것을 특징으로 한다.Also preferably, the received documents may be news articles or newspaper articles.

또한 바람직하게는, 상기 문서군집부는 뉴스 기사 또는 신문 기사의 제목, 부제 및 리드(Lede)로부터 키워드를 추출하는 것을 특징으로 한다.Also preferably, the document group extracts a keyword from a title, subtitle, and lead of a news article or newspaper article.

또한 바람직하게는, 상기 키워드의 중요도는, 키워드가 포함된 단락의 위치, 키워드의 출현 횟수, 키워드가 포함된 단락의 길이 및 키워드가 포함된 문장의 길이 중에서 어느 하나 이상의 변수를 이용하여 계산되는 것을 특징으로 한다.Also, preferably, the importance of the keyword may be calculated using one or more variables among the position of the paragraph including the keyword, the number of occurrences of the keyword, the length of the paragraph including the keyword, and the length of the sentence including the keyword. It features.

또한 바람직하게는, 상기 키워드의 중요도는 다음의 수학식에 의해 계산되는 것을 특징으로 한다.Also preferably, the importance of the keyword may be calculated by the following equation.

(상기 수학식에서, v는 키워드의 중요도 값, La(i)는 문서 집합의 i번째 문서의 길이, Lp(i,j)는 문서 집합의 i번째 문서에 있는 j번째 단락의 길이, Ls(i,k)는 문서 집합의 i번째 문서에 있는 k번째 문장의 길이, d_Pj는 문서 집합의 i번째 문서에 있는 j번째 단락의 디미니싱 팩터(Diminishing Factor), d_Sk는 문서 집합의 i번째 문서에 있는 k번째 문장의 디미니싱 팩터(Diminishing Factor)를 의미한다.)Where v is the importance value of the keyword, La (i) is the length of the i-th document of the document set, and Lp (i, j) is the length of the j-th paragraph in the i-th document of the document set, Ls (i , k) is the length of the kth sentence in the i-th document of the document set, d _Pj is the diminishing factor of the j-th paragraph in the i-th document of the document set, and d _Sk is the i-th of the document set Refers to the deminishing factor of the kth sentence in the document.)

또한 바람직하게는, 상기 문서 군집은 Hierarchical Agglomerative Clustering 방법을 이용하여 실행되는 것을 특징으로 한다.Also preferably, the document clustering may be performed using a Hierarchical Agglomerative Clustering method.

또한 바람직하게는, Elbow Criterion을 이용하여 적합한 문서 군집의 개수를 자동으로 결정하는 것을 특징으로 한다.Also preferably, the number of suitable document clusters may be automatically determined using Elbow Criterion.

그리고 바람직하게는, 상기 입출력부는 태깅(Tagging) 인터페이스 및 드래그(Drag) 앤드(and) 드랍(Drop) 인터페이스 중에서 어느 하나 이상을 구비하는 것을 특징으로 한다.Preferably, the input / output unit may include any one or more of a tagging interface and a drag and drop interface.

한편, 본 발명은 신뢰도를 향상시킨 문서 구조 기반 군집 방법에 관한 것으로서, (a) 문서 군집의 대상이 되는 문서 정보들을 입력받는 단계; (b) 입력받은 문서들로부터 키워드를 추출하는 단계; (c) 추출된 키워드의 중요도를 계산하는 단계; (d) 추출된 키워드 및 계산된 중요도를 이용하여 문서들 사이의 유사도를 계산하는 단계; (e) 유사도 계산 결과를 이용하여 문서 군집 결과를 생성하는 단계; (f) 생성한 문서 군집 결과를 출력하는 단계; 및 (g) 출력한 문서 군집 결과를 협력적 군집에 의하여 보완하는 단계; 를 포함한다.On the other hand, the present invention relates to a document structure-based clustering method to improve the reliability, comprising the steps of: (a) receiving document information that is the target of the document cluster; (b) extracting keywords from the input documents; (c) calculating the importance of the extracted keyword; (d) calculating similarity between documents using the extracted keyword and the calculated importance; (e) generating document clustering results using the similarity calculation results; (f) outputting the generated document clustering result; And (g) complementing the output document clustering result by the cooperative clustering. It includes.

바람직하게는, 상기 (b) 단계는, (b-1) 입력받은 문서들을 주제별로 분류하는 단계; 를 포함하는 것을 특징으로 한다.Preferably, the step (b) comprises: (b-1) classifying the received documents by subject; Characterized in that it comprises a.

또한 바람직하게는, 상기 (b) 단계는, (b-2) 입력받은 문서들로부터 불용어를 제거하는 단계; 및 (b-3) 불용어가 제거된 문서들로부터 어근을 추출하는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, the step (b) may include: (b-2) removing the stopwords from the received documents; And (b-3) extracting roots from documents from which stop words have been removed; Characterized in that it comprises a.

또한 바람직하게는, 상기 (c) 단계는, (c-1) 키워드가 포함된 단락 또는 문장의 문서 내에서의 위치를 고려하기 위해 디미니싱 팩터(Diminishing Factor)를 계산하는 단계; (c-2) 전체 문서 길이에 대한 키워드가 포함된 단락 또는 문장의 길이의 비율을 계산하는 단계; 및 (c-3) 문서 내에서 각 키워드의 출현시마다 중요도의 합을 계산하는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, the step (c) may include: (c-1) calculating a diminishing factor to consider the position of the paragraph or sentence containing the keyword in the document; (c-2) calculating a ratio of the length of the paragraph or sentence including the keyword to the total document length; And (c-3) calculating a sum of importance for each occurrence of each keyword in the document; Characterized in that it comprises a.

또한 바람직하게는, 상기 (g) 단계는, (g-1) 태깅(Tagging) 인터페이스를 통해 사용자로부터 문서의 키워드를 입력받는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, the step (g) may include: (g-1) receiving a keyword of a document from a user through a tagging interface; Characterized in that it comprises a.

또한 바람직하게는, 상기 (g-1) 단계는, (g-1-1) 상기 태깅 인터페이스를 통해 입력된 키워드를 포함하지 않는 단어 집합을 가진 문서들을 검색하는 단계; (g-1-2) 검색한 단어 집합에 상기 태깅 인터페이스를 통해 입력된 키워드를 추가하는 단계; 및 (g-1-3) 추가한 키워드의 중요도를 계산하는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, the step (g-1) may include: (g-1-1) searching for documents having a word set that does not include a keyword input through the tagging interface; (g-1-2) adding a keyword input through the tagging interface to the searched word set; And (g-1-3) calculating importance of the added keyword; Characterized in that it comprises a.

또한 바람직하게는, 상기 (g) 단계는, (g-2) 드래그(Drag) 앤드(and) 드랍(Drop) 인터페이스를 통해 사용자에 의해 잘못 군집된 문서라고 판단된 문서의 이동 신호를 수신하는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, the step (g) may include: (g-2) receiving a movement signal of a document determined to be a wrongly clustered document by a user through a drag and drop interface ; Characterized in that it comprises a.

그리고 바람직하게는, 상기 (g-2) 단계는, (g-2-1) 이동되는 문서에 포함된 단어들을 이동 후의 군집이 가지는 단어 집합과 중복되는 단어들 및 이동 전의 군집이 가지는 단어 집합과 중복되는 단어들로 분류하는 단계; 및 (g-2-2) 이동 후의 군집이 가지는 단어 집합과 중복되는 단어들 및 이동 전의 군집이 가지는 단어 집합과 중복되는 단어들에 대하여 서로 다른 가중치를 부여하는 단계; 를 포함하는 것을 특징으로 한다.And, preferably, the step (g-2), (g-2-1) the words included in the document to be moved overlaps the word set of the cluster after the movement and the word set of the cluster before the movement Categorizing into words; And (g-2-2) assigning different weights to words that overlap with the word set of the cluster after the movement and words that overlap with the word set of the cluster before the movement; Characterized in that it comprises a.

본 발명에 따르면, 동일한 사건 또는 동일한 주제에 관한 문서라도 문서 작성자가 중요시하는 관점은 다양할 수 있다는 점을 사용자가 인식할 수 있는 효과가 있다.According to the present invention, there is an effect that a user can recognize that even a document related to the same event or the same subject may have various points of view that the document creator places importance on.

본 발명에 따르면, 사용자가 최소한의 시간 또는 노력만으로 다양한 관점에 근거하여 관심 있는 주제를 다룬 문서를 객관적이고 종합적으로 인식할 수 있는 효 과가 있다.According to the present invention, there is an effect that the user can objectively and comprehensively recognize a document dealing with a subject of interest based on various points of view with minimal time or effort.

본 발명에 따르면, 사용자의 피드백에 의한 군집 결과의 연속적인 오류 수정을 이용하여 신뢰도를 향상시킬 수 있는 효과도 있다.According to the present invention, there is also an effect that can improve the reliability by using the continuous error correction of the clustering result by the user feedback.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대하여는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음을 유의하여야 할 것이다.Before describing the details for carrying out the present invention, it should be noted that configurations that are not directly related to the technical gist of the present invention are omitted within the scope of not distracting the technical gist of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, the terms or words used in the present specification and claims are consistent with the technical spirit of the present invention on the basis of the principle that the inventor can appropriately define the concept of the term in order to explain the invention in the best way. It should be interpreted as meaning and concept.

본 발명은 문서가 초점을 맞춘 관점별로 문서들을 구조를 기반으로 분류하여 군집하는 문서 구조 기반 군집 장치 및 방법을 제안하며, 군집 결과에 포함될 수 있는 오류들을 사용자들의 피드백을 이용하여 수정하는 장치 및 방법을 개시한다.The present invention proposes a document structure-based clustering device and method for classifying and grouping documents based on a structure based on a viewpoint that a document focuses, and an apparatus and method for correcting errors that may be included in a clustering result using feedback of users. Initiate.

이하, 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 장치에 관하여 도 1 내지 도 3 을 참조하여 설명한다.Hereinafter, a document structure based clustering device having improved reliability according to an exemplary embodiment of the present invention will be described with reference to FIGS. 1 to 3.

도 1 은 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 장치의 전체 구성도이며, 도 2 는 일반적인 기사 구조의 모식도이고, 도 3 은 본 발명의 바람직한 실시예에 따른 드래그(Drag) 앤드(and) 드랍(Drop) 인터페 이스의 예시도이다.1 is an overall configuration diagram of a document structure based clustering device having improved reliability according to a preferred embodiment of the present invention, FIG. 2 is a schematic diagram of a general article structure, and FIG. 3 is a drag according to a preferred embodiment of the present invention. An illustration of an end drop interface.

상기 도 1 에 도시된 바와 같이, 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 장치는 문서군집부(100) 및 입출력부(200)를 포함한다.As shown in FIG. 1, the document structure based clustering device having improved reliability according to the preferred embodiment of the present invention includes a document clustering unit 100 and an input / output unit 200.

상기 문서군집부(100)는 입력받은 문서들의 키워드를 추출하고, 추출한 각각의 키워드의 중요도를 계산하며, 추출한 키워드 및 중요도를 이용하여 문서 군집을 실행하고, 문서 군집의 결과를 사용자 피드백을 이용하여 수정하는 기능을 수행하며, 키워드 추출수단(110), 중요도 계산수단(120), 유사도 계산수단(130), 군집개수 결정수단(140) 및 후처리수단(150)을 포함한다.The document cluster unit 100 extracts the keywords of the input documents, calculates the importance of each extracted keyword, executes document clustering using the extracted keywords and the importance, and uses the user feedback as a result of the document clustering. It performs a function of correcting, and includes a keyword extraction means 110, importance calculation means 120, similarity calculation means 130, cluster number determining means 140 and post-processing means 150.

상기 키워드 추출수단(110)은 입력받은 문서들을 주제별로 분류하며, 분류된 문서들로부터 키워드를 추출한다.The keyword extracting unit 110 classifies the input documents into subjects and extracts keywords from the classified documents.

본 발명의 바람직한 실시예에서, 입력받는 문서 정보들은 인터넷 신문 기사, 인터넷 뉴스 기사, 각종 논문, 각종 도서 등으로 구현될 수 있다.In a preferred embodiment of the present invention, the received document information may be implemented as Internet newspaper articles, Internet news articles, various articles, various books, and the like.

문서들의 주제별 분류에 사용되는 기술은 통상적인 기술을 사용할 수 있으므로, 자세한 설명은 생략하기로 한다.Techniques used for subject classification of documents may use conventional techniques, and thus detailed description thereof will be omitted.

상기 키워드 추출수단(110)에 의한 키워드의 추출은, 대부분의 문서들은 제목, 부제 및 본문으로 구성되어 있다는 점에 착안하고 있다. 즉, 주제별로 분류된 문서들의 제목 또는 부제로부터 형태소 분석 기법을 이용하여 추출된 명사 상당어구를 기사의 키워드로 선정한다. 선정된 키워드는 각 문서에서 다룬 중요한 내용을 대표한다.The keyword extraction by the keyword extraction means 110 focuses on the fact that most documents are composed of a title, subtitle, and text. In other words, noun equivalent phrases extracted from the titles or subtitles of documents classified by subjects using morphological analysis are selected as keywords of articles. The keywords selected represent important content covered in each document.

예컨대, 문서 군집의 대상이 되는 문서가 뉴스 기사 또는 신문 기사인 경우, 도 2 에 도시된 바와 같이, 제목(Head), 부제(Sub-Head), 리드(Lede) 및 본문(Main Text)으로 구성되는 것이 일반적이다. 이는 뉴스 보도에서 전형적으로 사용되는 역피라미드형 기사 작성 규칙에 의한 것으로서, 기자들은 전달하고자 하는 뉴스 정보 또는 사실들을 중요도가 높은 순서대로 기사 내에 배치하도록 교육받는다. 따라서 이를 이용하여 키워드를 추출하게 되면, 기사에서 중요하게 다루어진 관점의 파악이 가능하다.For example, when a document targeted for a document cluster is a news article or a newspaper article, as shown in FIG. 2, the document includes a head, a sub-head, a lead, and a main text. It is common to be. This is due to the inverted pyramid type writing rules typically used in news reporting, where journalists are trained to place news information or facts to be conveyed in articles in order of importance. Therefore, if you use this to extract keywords, it is possible to grasp the important aspects of the article.

뉴스 기사 또는 신문 기사에서, 제목 및 부제는 기사를 읽을 것인지 결정하게 되는 중요한 요소이기 때문에, 기사의 방향이 무엇인지 알 수 있게 해주는 중요한 정보를 담고 있다. 그리고, 리드는 일반적으로 기사의 첫 문장 또는 첫 두 문장을 의미하며, 기사의 전체적인 내용을 망라하는 중요한 내용이 나온다. 그 이하의 본문에는 리드의 내용을 뒷받침하는 정보들이 중요도 순으로 배치된다.In news articles or newspaper articles, headings and subtitles contain important information that can tell you the direction of the article, since it is an important factor in determining whether to read the article. And, a lead usually means the first sentence or the first two sentences of an article, and an important content that covers the entire content of the article comes out. In the following text, the information supporting the contents of the lead is arranged in order of importance.

따라서, 문서 군집의 대상이 되는 문서가 뉴스 기사 또는 신문 기사인 경우에는, 제목, 부제 및 리드로부터 키워드를 추출하여 문서 군집의 정확도를 향상시키는 것이 바람직하다.Therefore, when the document to be a document cluster is a news article or a newspaper article, it is desirable to extract keywords from the title, subtitle and lead to improve the accuracy of the document cluster.

다음으로, 상기 중요도 계산수단(120)은 추출된 각 키워드의 중요도를 문서의 본문 내용을 기반으로 계산한다. Next, the importance calculating means 120 calculates the importance of each extracted keyword based on the body content of the document.

이하, 본 발명의 바람직한 실시예에 따른 키워드의 중요도 계산에 관하여 설명한다.Hereinafter, the importance calculation of keywords according to a preferred embodiment of the present invention will be described.

키워드가 본문의 앞 부분에서 출현할수록, 본문 중에 반복적으로 출현할수록 중요도가 높으며, 키워드가 출현한 단락 또는 문장이 길수록 해당 키워드의 중요도가 높은 것으로 설정할 수 있다.The higher the importance of the keyword appearing in the front part of the main body and the repeated occurrence of the keyword in the main body, and the longer the paragraph or sentence in which the keyword appears, the higher the importance of the keyword.

이에 따라, 키워드의 중요도는 키워드가 포함된 단락의 위치, 키워드의 출현 횟수, 키워드가 포함된 단락의 길이 및 키워드가 포함된 문장의 길이로 구성되는 네 가지의 변수에 의해 계산되는 것이 바람직하나, 본 발명이 이에 한정되는 것은 아니다.Accordingly, the importance of the keyword is calculated by four variables consisting of the position of the paragraph containing the keyword, the number of occurrences of the keyword, the length of the paragraph containing the keyword and the length of the sentence containing the keyword, The present invention is not limited thereto.

다음의 [표 1]에 본 발명의 바람직한 실시예에 따른 중요도 계산에 사용된 변수들을 정리하였다.Table 1 below summarizes the variables used in the importance calculation according to the preferred embodiment of the present invention.

기호sign 설명Explanation A(i)A (i) 문서 집합의 i번째 문서I-th document in the document set La(i)La (i) A(i)의 길이Length of A (i) Lp(i,j)Lp (i, j) A(i)에 있는 j번째 단락의 길이The length of the jth paragraph in A (i) Ls(i,k)Ls (i, k) A(i)에 있는 k번째 문장의 길이Length of kth sentence in A (i) W(i)W (i) A(i)의 키워드 집합Set of keywords in A (i) w_n w _n W(i)의 n번째 키워드Nth keyword of W (i) v_n v _n wn의 중요도 값Importance value of wn Va(i)Va (i) 키워드, 중요도 쌍의 집합Set of keyword, importance pairs d_pj d _pj A(i)에 있는 j번째 단락의 디미니싱 팩터(Diminishing Factor)Diminishing Factor of the jth paragraph in A (i) d_sk d _sk A(i)에 있는 k번째 문장의 디미니싱 팩터(Diminishing Factor)Diminishing Factor of the kth sentence in A (i)

본 실시예에서, 상술한 네 가지의 변수들은 다음과 같은 방식으로 키워드의 중요도 계산에 적용되었다.In this embodiment, the four variables described above were applied to the importance calculation of the keyword in the following manner.

첫째, 키워드가 포함된 단락 또는 문장의 위치에 따른 중요도를 고려하기 위해 디미니싱 팩터(Diminishing Factor)를 이용한다. 둘째, 전체 기사 길이에 대한 키워드가 포함된 단락의 길이의 비율을 이용한다. 셋째, 전체 기사 길이에 대한 키워드가 포함된 문장의 길이의 비율을 이용한다. 넷째, 기사에서 키워드가 출현한 횟수를 반영하기 위해 각 키워드의 출현시 중요도의 합을 구한다.First, a diminishing factor is used to consider the importance according to the position of a paragraph or sentence including a keyword. Second, use the ratio of the length of the paragraph containing the keyword to the total article length. Third, the ratio of the length of the sentence including the keyword to the total article length is used. Fourth, to reflect the number of occurrences of keywords in the article, the sum of the importance of each keyword is calculated.

아래의 두 가지 기사를 통하여, 뉴스 기사의 경우에 상술한 중요도의 계산이 적용되는 예를 설명하기로 한다.In the following two articles, an example in which the above-described importance calculation is applied to a news article will be described.

상기 두 가지의 예에서 보는 바와 같이, 먼저 제목 및 리드에서 명사 상당어구를 키워드로 추출한다. 추출된 키워드들의 예가 사각형으로 표시되어 있다.As shown in the two examples above, first, noun equivalent phrases are extracted from the title and the lead as keywords. Examples of extracted keywords are shown in squares.

각 키워드의 중요도는 기사의 전체 길이에 대한 키워드가 출현한 단락 또는 문장의 길이에 비례하여 계산된다. 예를 들어, 키워드가 출현한 단락의 길이가 기사의 전체 길이의 1/4라면 0.25의 중요도를 부여하고, 키워드가 출현한 문장의 길이가 기사 전체 길이의 1/8이라면 0.125의 중요도를 가산하는 방식이 가능하다.The importance of each keyword is calculated in proportion to the length of the paragraph or sentence in which the keyword appears over the entire length of the article. For example, if the length of a paragraph in which a keyword appears is 1/4 of the total length of the article, it is given an importance of 0.25. If the length of the sentence in which the keyword appears is 1/8 of the length of the article, it adds 0.125. The way is possible.

아울러, 키워드가 포함된 단락 또는 문장의 위치에 따라서 키워드의 중요도를 변화시키기 위하여, 각각의 단락 또는 문장에 대한 디미니싱 팩터(Diminishing Factor)를 반영한다.In addition, in order to change the importance of the keyword according to the position of the paragraph or sentence containing the keyword, it reflects the diminishing factor for each paragraph or sentence.

상기 디미니싱 팩터는 기사 내에서 단락 또는 문장이 뒷부분에 위치할수록 감소하는 것으로 설정할 수 있다, 예를 들어, 첫 번째 단락의 길이가 전체 기사 길이의 1/4이라면 첫 번째 단락에는 디미니싱 팩터 1을 반영하고, 두 번째 단락에는 디미니싱 팩터 0.75를 반영하는 방식 등이 가능하다.The deminimization factor may be set to decrease as the paragraph or sentence is located later in the article, for example, if the length of the first paragraph is one quarter of the length of the article, Reflecting 1, and the second paragraph reflecting the deminishing factor 0.75.

이러한 중요도 계산의 예는 뉴스 기사의 경우에 대하여 설명하였으나, 상술한 중요도 계산은 이에 한하지 않고 인터넷 신문 기사, 각종 논문, 각종 도서 등의 문서에도 적용될 수 있음은 자명하며, 문서가 중요한 내용 순으로 배열되어 있는 경우 더욱 정확성을 기할 수 있다.An example of such importance calculation has been described in the case of a news article. However, the above-described importance calculation is not limited to this, and it is obvious that the article can be applied to documents such as Internet newspaper articles, various articles, and various books. If it is arranged, more accuracy can be achieved.

상술한 중요도 계산 과정은 다음의 [수학식 1]에 의하여 명확히 표현할 수 있다.The above-described importance calculation process can be clearly expressed by the following [Equation 1].

상기 [수학식 1]에 포함된 각 변수의 수학적 의미는 상기 [표 1]에 설명되어 있으므로, 여기서는 생략하기로 한다.Since the mathematical meaning of each variable included in [Equation 1] is described in [Table 1], it will be omitted here.

다음으로, 상기 유사도 계산수단(130)은 각 문서에서 추출한 키워드 및 계산된 중요도를 이용하여, 문서들 사이의 유사도를 계산한다.Next, the similarity calculating means 130 calculates the similarity between the documents using the keywords extracted from each document and the calculated importance.

각 문서에서 추출된 키워드 및 중요도 쌍의 집합 Va(i)는 유클리드 공간 상의 벡터로 볼 수 있는데, 이 벡터를 기반으로 유사도를 계산한다.The set Va (i) of keyword and importance pairs extracted from each document can be viewed as a vector on Euclidean space, and the similarity is calculated based on this vector.

상기 유사도의 계산은 코사인(cosine) 측정(measure)을 이용하는 것이 바람직한 바, 문서의 유사도는 중요도 값의 절대값을 고려하기보다는 중요도 값의 비율을 고려하여 계산하는 것이 적합하기 때문이다.It is preferable to use a cosine measure for the calculation of the similarity, because the similarity of the document is appropriately calculated in consideration of the ratio of the importance values rather than the absolute value of the importance values.

상기 코사인(cosine) 측정(measure)은 두 벡터 간의 거리(Distance)를 측정하는 대표적인 방법 중의 하나이다. 이는 두 벡터가 향하는 방향의 차이, 즉 두 벡 터가 이루는 각을 이용하여 유사도를 측정하는 방법이다. The cosine measurement is one of the representative methods for measuring the distance between two vectors. This is a method of measuring the similarity using the difference between the directions of two vectors, that is, the angle between two vectors.

예컨대, 2차원의 두 벡터 (2,2) 및 (1,1)이 있다면, 이 두 벡터는 같은 방향을 향하고 있어서 유사도에 있어서 같다고 판단한다. 이러한 코사인(cosine) 측정(measure)은 주로 두 벡터 간의 유클리드 거리를 계산하는 방법과 대비된다.For example, if there are two two-dimensional vectors (2,2) and (1,1), these two vectors face the same direction and are judged to be the same in similarity. This cosine measure is primarily contrasted with the method of calculating the Euclidean distance between two vectors.

다음으로, 상기 군집개수 결정수단(140)은 상기 유사도 계산 결과를 이용하여 문서들을 그룹화하며, 적합한 문서 군집의 개수를 결정하고, 결정된 군집 개수에 따라 문서 군집 결과를 생성한다.Next, the group count determining means 140 groups the documents using the similarity calculation result, determines the number of suitable document clusters, and generates a document cluster result according to the determined number of clusters.

문서들을 그룹화하는 방법으로는 기사 그룹의 개수를 자동으로 결정하는 Hierarchical Agglomerative Clustering 방법을 이용할 수 있다.As a way to group documents, you can use the Hierarchical Agglomerative Clustering method that automatically determines the number of article groups.

문서 군집의 대상이 되는 문서들이 몇 가지 종류의 문서 그룹으로 분류될 것인지 예상할 수 없으므로, 본 실시예에 따른 문서 군집의 수행에 있어서 문서 그룹의 개수를 사전에 결정할 수는 없다.Since it is impossible to predict how many kinds of document groups the documents targeted for document grouping are to be classified, the number of document groups cannot be determined in advance in performing the document grouping according to the present embodiment.

상기 Hierarchical Agglomerative Clustering 방법은 대표적인 계층적 군집 방법의 하나로서, 가장 가까운 클러스터 쌍을 결합해 나가는 방식으로 문서 그룹을 생성하는 기법이다.The Hierarchical Agglomerative Clustering method is one of the typical hierarchical clustering methods and generates a document group by combining the nearest cluster pairs.

상기 Hierarchical Agglomerative Clustering 방법에서는, 개별적인 모든 원소를 하나의 클러스터로 설정하고 이로부터 군집을 시작하여, 가장 가까운 벡터를 결합하면서 모든 원소들이 하나의 클러스터로 결합될 때까지 군집을 진행한다. 클러스터를 이루는 벡터가 여러 개일 경우, 각 벡터가 가진 값의 평균 값이 클러스터 의 대표값이 되며, 대표값을 이용하여 거리를 계산함으로써 가장 가까운 두 클러스터를 결합한다.In the Hierarchical Agglomerative Clustering method, all individual elements are set to one cluster, and clustering is started. The clustering is performed until all the elements are combined into one cluster while combining the nearest vector. If there are several vectors constituting a cluster, the average value of each vector becomes the representative value of the cluster, and the two nearest clusters are combined by calculating the distance using the representative value.

적합한 문서 군집의 개수는 Elbow Criterion을 이용하여 자동으로 결정한다.The number of suitable document clusters is automatically determined using Elbow Criterion.

상기 Hierarchical Agglomerative Clustering 방법을 적용하게 되면, 클러스터의 개수 각각에 따른 군집 결과가 생성된다. 즉, 모든 원소들이 하나의 클러스터를 형성한 경우부터 모든 원소들 각각이 개별적으로 클러스터를 형성하는 경우까지 다양한 결과가 가능하다.When applying the Hierarchical Agglomerative Clustering method, a clustering result is generated according to the number of clusters. That is, a variety of results are possible from the case where all elements form one cluster to the case where all elements each individually form a cluster.

상기 Elbow Criterion은 이러한 다양한 클러스터의 형성 개수 중에서, 클러스터의 개수를 결정하는 방법 중의 한 가지로서, 분산(클러스터의 중심에서 벡터들이 떨어져 있는 정도)이 감소하는 정도를 이용한다.The Elbow Criterion is one of methods for determining the number of clusters among the number of formation of these various clusters, and uses the degree to which the dispersion (the degree of separation of vectors from the center of the cluster) is reduced.

예를 들어, 하나의 클러스터에 모든 원소들이 있을 경우, 클러스터의 중심에서 여러 벡터가 퍼져 있으므로 원소들이 두 개의 클러스터를 형성한 경우보다 분산이 크다. 이러한 성질을 이용하여 클러스터의 개수를 증가시키면서 더 이상 클러스터의 개수를 증가시켜도 분산이 감소하지 않는 지점에서 클러스터의 개수가 결정된다.For example, if all the elements are in a cluster, there are more variances than if the elements formed two clusters because several vectors were spread from the center of the cluster. By using this property, the number of clusters is determined at a point where the variance does not decrease even if the number of clusters is increased while increasing the number of clusters.

마지막으로, 상기 후처리수단(150)은 상기 군집개수 설정수단(140)에 의한 군집 결과에서 발생할 수 있는 오류들을 협력적 군집에 의하여 수정한다.Finally, the post-processing means 150 corrects errors that may occur in the clustering result by the cluster number setting means 140 by cooperative clustering.

이하, 상기 후처리수단(150)에 의한 협력적 군집에 관하여 설명한다.Hereinafter, the cooperative cluster by the post-processing means 150 will be described.

상술한 문서 구조 기반 군집 방법이 실제로 문서 군집에 적용되는 경우, 예 기치 못한 오류들이 발생할 수 있다. 문서에서 중요하게 다룬 키워드가 문서의 뒷부분에서 출현할 수 있고, 실질적으로 동일한 주제에 대하여 상이한 용어를 사용하는 문서가 있을 수 있다. 그러한 경우, 문서의 주요 키워드 추출에 오류가 발생할 수 있으며, 군집 결과의 정확도가 감소할 수 있다.If the document structure based clustering method described above is actually applied to document clustering, unexpected errors may occur. Keywords that are important in the document may appear later in the document, and there may be documents that use different terms for substantially the same subject. In such a case, an error may occur in extracting key keywords of the document, and the accuracy of the clustering result may be reduced.

의미적 지식을 내장하는 것은 매우 복잡하고 어려운 작업이므로, 본 발명의 바람직한 실시예에 따른 협력적 군집은 사용자들의 집단 지성을 활용하여 상술한 군집 결과의 오류를 수정한다. 사용자들은 문서의 의미를 충분히 이해하고 문서의 초점을 알아낼 수 있다. 따라서 사용자들의 피드백을 수집하고 활용하는 것은 군집 결과의 정확도를 향상시키는 적절한 방법이 될 수 있다.Since embedding semantic knowledge is a very complex and difficult task, the cooperative cluster according to the preferred embodiment of the present invention corrects the above-described clustering error by utilizing the collective intelligence of users. Users can fully understand the meaning of a document and find the focus of the document. Therefore, gathering and utilizing feedback from users can be an appropriate way to improve the accuracy of cluster results.

본 실시예에 따른 협력적 군집은 상기 입출력부(200)의 태깅(Tagging) 인터페이스 및 드래그(Drag) 앤드(and) 드랍(Drop) 인터페이스에 의하여 수집된 사용자들의 피드백을 이용하여 수행할 수 있다. 사용자들은 문서의 초점을 반영하는 단어들을 상기 태깅 인터페이스를 통하여 제공하며, 제공된 단어들을 추출된 단어 집합에 반영한다. 또한, 유사도 계산 시에 상기 태깅 인터페이스를 통해 제공된 단어들을 더욱 중요하게 고려하는 것이 바람직하다.The cooperative clustering according to the present embodiment may be performed by using feedback of users collected by a tagging interface and a drag and drop interface of the input / output unit 200. Users provide words that reflect the focus of the document through the tagging interface and reflect the provided words in the extracted word set. In addition, it is desirable to consider the words provided through the tagging interface more importantly when calculating the similarity.

더욱 상세하게, 상기 태깅 인터페이스를 통해 입력된 키워드를 포함하지 않는 단어 집합을 가진 문서들을 검색하여, 단어 집합에 입력된 키워드를 추가하고, 다른 키워드와 동일한 방식으로 중요도를 계산한다. 이를 통하여, 문서 구조상의 특징 또는 형태소 분석기의 한계 등으로 인하여 중요 키워드를 자동으로 추출하지 못한 경우를 보완할 수 있다.More specifically, documents with a word set that does not include a keyword entered through the tagging interface are searched, a keyword entered in the word set is added, and importance is calculated in the same manner as other keywords. Through this, it is possible to compensate for the case that the important keyword is not automatically extracted due to the characteristics of the document structure or the limitation of the stemmer.

바람직하게, 사용자들은 드래그 앤드 드랍 인터페이스를 통해 잘못 분류된 문서를 선택하여 적절한 군집으로 이동시킬 수 있다. 상기 드래그 앤드 드랍에 의해 이동되는 문서는 이동 후의 군집과는 가까운 의미적 관계로 설정되며, 이동 전의 군집과는 먼 의미적 관계로 설정된다. 도 3 에 상기 드래그 앤드 드랍 인터페이스에 의한 문서 이동의 일 예를 나타내었다.Preferably, users can select misclassified documents through a drag and drop interface and move them to the appropriate cluster. The document moved by the drag and drop is set in a semantic relationship close to the cluster after the move, and is set in a semantic relationship far from the cluster before the move. 3 illustrates an example of document movement by the drag and drop interface.

상기 의미적 관계는 단어에 부여된 가중치들을 조정하는 방식으로 군집 결과에 반영될 수 있다.The semantic relationship may be reflected in the clustering result by adjusting weights assigned to words.

구체적으로, 이동되는 문서에 포함된 단어들을 이동 후의 군집이 가지는 단어 집합과 중복되는 단어들 및 이동 전의 군집이 가지는 단어 집합과 중복되는 단어들로 분류한다. 그리고, 이동 후의 군집이 가지는 단어 집합과 중복되는 단어들에 대하여는 이동 후의 군집 단어의 가중치와 유사한 가중치를 부여하고, 이동 전의 군집이 가지는 단어 집합과 중복되는 단어들은 가중치를 감소시켜 0에 가깝게 설정한다. 이러한 방식에 의하여, 상술한 벡터 간의 거리 계산에 있어서 이동 전의 군집과는 멀어지고 이동 후의 군집과는 가까워지도록 할 수 있다.Specifically, words included in the moved document are classified into words overlapping with the word set of the cluster after the movement and words overlapping with the word set of the cluster before the movement. The words that overlap with the word set of the cluster after the movement are given weights similar to the weights of the cluster words after the movement, and the words overlapping with the word set of the cluster before the movement are set to be close to zero by reducing the weight. In this manner, in the above-described distance calculation between the vectors, it is possible to move away from the cluster before the movement and to approach the cluster after the movement.

상기 후처리수단(150)은 상술한 협력적 군집을 통하여 문서 군집 결과의 후처리를 수행하고, 수행한 후처리 결과를 반영하여 문서 군집 결과를 보완한다.The post-processing means 150 performs post-processing of the document clustering result through the cooperative cluster described above, and supplements the document clustering result by reflecting the performed post-processing result.

또한, 상기 입출력부(200)는 문서 군집의 대상이 되는 문서 정보들을 입력받으며, 문서 군집의 결과를 출력하고, 사용자 피드백을 이용하여 군집 결과의 오류 정보들을 입력받는 기능을 수행한다.In addition, the input / output unit 200 receives document information that is a target of document cluster, outputs a result of document cluster, and receives error information of a cluster result by using user feedback.

문서 군집의 대상이 되는 문서 정보들은, TCP/IP 프로토콜에 의한 유선 인터넷망 및 WAP 프로토콜 등에 의한 무선 인터넷망 등에 접속하여 입력받거나, 유에스비(USB) 메모리 등으로부터 입력받는 것으로 설정할 수 있으나, 이에 한정되는 것은 아니다.Document information that is a target of document clustering may be set to be input by connecting to a wired Internet network using a TCP / IP protocol, a wireless Internet network using a WAP protocol, or the like, or to be input from a USB memory, or the like. It is not.

아울러, 상기 입출력부(200)는 HTML(Hyper Text Markup Language)의 형태로 문서 정보들을 디스플레이할 수 있는 인터넷 브라우저(예를 들어, Netscape, Internet Explorer)를 구비하며, 구비된 인터넷 브라우저를 이용하여 웹 페이지 상으로 문서 정보들 또는 문서 군집 결과를 디스플레이하는 기능을 포함하는 것이 바람직하나, 이에 한정되지 아니한다.In addition, the input / output unit 200 includes an Internet browser (eg, Netscape, Internet Explorer) capable of displaying document information in the form of Hyper Text Markup Language (HTML), and the web using the provided Internet browser. It is preferable to include a function of displaying document information or document clustering results on a page, but is not limited thereto.

덧붙여, 상기 입출력부(200)는 태깅(Tagging) 인터페이스 및 드래그(Drag) 앤드(and) 드랍(Drop) 인터페이스를 구비하는 것이 바람직하다.In addition, the input / output unit 200 preferably includes a tagging interface and a drag and drop interface.

상기 태깅 인터페이스를 통해 입력받는 주요 단어들을 문서 군집 과정에 반영하거나, 상기 드래그 앤드 드랍 인터페이스를 통해 사용자들이 문서 군집 결과를 더욱 직접적으로 수정할 수 있도록 함으로써 문서 군집 결과를 보완할 수 있다.The key word input through the tagging interface may be reflected in the document clustering process, or the user may directly modify the document clustering result through the drag and drop interface to compensate for the document clustering result.

이하, 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 방법에 관하여 도 4 를 참조하여 설명한다.Hereinafter, a document structure based clustering method of improving reliability according to a preferred embodiment of the present invention will be described with reference to FIG. 4.

도 4 는 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 방법에 관한 전체 흐름도이다.4 is a flowchart illustrating a document structure based clustering method of improving reliability according to a preferred embodiment of the present invention.

먼저, 상기 도에 도시된 바와 같이, 상기 입출력부(200)는 문서 군집의 대상 이 되는 문서 정보들을 입력받는다(S10).First, as shown in the figure, the input and output unit 200 receives the document information that is the target of the document group (S10).

다음으로, 상기 키워드 추출수단(110)은 입력받은 문서들로부터 키워드를 추출한다(S20).Next, the keyword extraction means 110 extracts a keyword from the input documents (S20).

상기 S20 단계는, 입력받은 문서들을 주제별로 분류하는 단계를 포함하는 것이 바람직하다.The step S20 may include classifying the received documents by subject.

상기 주제별 분류에 관해서는 통상적인 주제별 분류 기술을 사용하여 수행할 수 있으므로 자세한 설명은 생략하기로 한다.The subject classification may be performed using conventional subject classification techniques, and thus detailed description thereof will be omitted.

바람직하게, 상기 S20 단계는 입력받은 문서들로부터 불용어를 제거하는 단계 및 어근을 추출하는 단계를 포함할 수 있다. 상기 불용어의 제거는 불용어 목록(예컨대, 리츠베르젠-rijsbergen 등)을 통해 수행될 수 있다.Preferably, the step S20 may include removing stop words from the input documents and extracting roots. The removal of the stopword may be performed through a stopword list (eg, Rijsbergen, etc.).

참고적으로, 불용어(Stop-words 또는 Noise Word)는 검색시에 무시해버리는 단어 또는 검색엔진이 데이터베이스를 구축할 때, 색인에서 제외해 버리는 단어를 뜻하며, 예를 들어 한글에서 '~를, ~을, ~에서, ~와' 등을 포함하는 조사, 접미사, 접속사, 어미 등이 있고, 그리고 영어에서 'a, an, is, are, if, for, but, this, may' 등을 포함하는 동사, 조동사, 인칭대명사, 지시대명사, 전치사 등이 있다.For reference, stopwords (Stop-words or Noise Word) refers to words that are ignored at the time of search or words that search engines exclude from the index when building the database. Verbs, including, in, surveys, suffixes, conjunctions, endings, etc., and in English, including a, an, is, are, if, for, but, this, may, There are modal verbs, personal pronouns, instructional pronouns, and prepositions.

아울러, 어근은 단어를 이루는 형태소 중에서 그 단어의 실질적인 뜻을 나타내는 형태소로서, 합성어의 경우에 합성된 낱낱의 실질 형태소이고, 파생어의 경우에 접사를 제외한 나머지 부분이 해당한다. 예를 들면, '덮개'에서 '덮', '사람답다'에서 '사람', '넉넉하다'에서 '넉넉' 등이 어근에 해당한다.In addition, the root is a morpheme representing the actual meaning of the word among the morphemes constituting the word, is a real real morpheme synthesized in the case of a compound word, the rest of the stem except the affix in the case of derivatives. For example, the roots include the cover, the cover, the person, the person, and the generous.

다음으로, 상기 중요도 계산수단(120)은 추출된 각 키워드의 중요도를 문서 의 본문 내용을 기반으로 계산한다(S30).Next, the importance calculation means 120 calculates the importance of each extracted keyword based on the body content of the document (S30).

상기 S30 단계는, 키워드가 포함된 단락 또는 문장의 문서 내에서의 위치를 고려하기 위해 디미니싱 팩터(Diminishing Factor)를 계산하는 단계, 전체 문서 길이에 대한 키워드가 포함된 단락 또는 문장의 길이의 비율을 계산하는 단계 및 문서 내에서 각 키워드의 출현시마다 중요도의 합을 계산하는 단계를 포함하는 것이 바람직하다.In step S30, the step of calculating a diminishing factor to consider the position of the paragraph or sentence containing the keyword in the document, the length of the paragraph or sentence containing the keyword for the entire document length It is preferable to include calculating a ratio and calculating a sum of importance for each occurrence of each keyword in the document.

다음으로, 상기 유사도 계산수단(130)은 각 문서에서 추출한 키워드 및 계산된 중요도를 이용하여, 문서들 사이의 유사도를 계산한다(S40).Next, the similarity calculating means 130 calculates the similarity between the documents using the keywords extracted from each document and the calculated importance (S40).

다음으로, 상기 군집개수 결정수단(140)은 상기 유사도 계산 결과를 이용하여 문서들을 그룹화하며, 적합한 문서 군집의 개수를 결정하고, 결정된 군집 개수에 따라 문서 군집 결과를 생성한다(S50).Next, the group count determining means 140 groups the documents using the similarity calculation result, determines the number of suitable document clusters, and generates a document cluster result according to the determined number of clusters (S50).

다음으로, 상기 입출력부(200)는 생성한 문서 군집 결과를 출력한다(S60).Next, the input / output unit 200 outputs the generated document clustering result (S60).

마지막으로, 상기 후처리수단(150)은 상기 군집개수 설정수단(140)에 의한 군집 결과에서 발생할 수 있는 오류들을 협력적 군집에 의하여 수정한다(S70).Finally, the post-processing means 150 corrects errors that may occur in the clustering result by the cluster number setting means 140 by the cooperative clustering (S70).

상기 S70 단계는, 상기 태깅 인터페이스를 통해 사용자로부터 문서의 키워드를 입력받는 단계 또는 상기 드래그 앤드 드랍 인터페이스를 통해 사용자에 의해 잘못 군집된 문서라고 판단된 문서의 이동 신호를 수신하는 단계를 포함하는 것이 바람직하다.The step S70 may include receiving a keyword of a document from a user through the tagging interface or receiving a movement signal of a document determined to be a wrong group of documents by the user through the drag and drop interface. Do.

이하, 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 장치 및 방법에 대한 검증 결과에 관하여 설명한다.Hereinafter, a verification result of a document structure-based clustering device and method with improved reliability according to an exemplary embodiment of the present invention will be described.

검증 대상이 되는 문서로는 뉴스 기사를 사용하였으며, 두 가지의 검증을 수행하였다.News articles were used as documents to be verified, and two verifications were performed.

먼저, 첫 번째 검증은, 구글 뉴스(GoogleNews)의 정치면, 경제면 및 사회면에서 각각 하나씩 추출한 뉴스 기사 세트를 사용하여 수행되었다. 추출된 뉴스 기사들에 대하여 문서 구조 기반 군집을 수행하였고, 각각의 뉴스 기사 세트에 대한 군집 결과를 네 명으로 구성된 다수의 그룹에 제공하였다.First, the first verification was performed using a set of news articles extracted from each of the political, economic, and social aspects of Google News. The document-based clustering was performed on the extracted news articles, and the clustering results for each set of news articles were provided to a large group of four people.

그리고, 제공된 군집 결과에서 잘못 분류된 기사들을 수정해 줄 것을 요청하였다. 검증에 참여한 사람들은 잘못 분류되었다고 판단하는 기사들을 적합하다고 생각하는 그룹으로 이동시키거나, 기 분류된 그룹과는 상이한 관점에 기반한 기사라고 판단하는 경우에는 새로운 그룹을 생성하고, 생성한 그룹에 포함시켰다.We then requested that we correct the misclassified articles in the cluster results provided. Those who participated in the verification moved the articles that they thought were misclassified to a group that they thought were appropriate, or created a new group and included them in the created group when they judged the article based on a different viewpoint than the previously classified groups. .

검증 결과를 TF(Term Frequency) 방법 및 제목과 리드에서만 키워드를 선정하고 키워드의 중요도를 계산한 구조 기반 군집 방법에 의한 결과와 서로 비교하였다.The verification results were compared with the results of the TF (Term Frequency) method and the structure-based clustering method in which keywords were selected only in the title and the lead, and the importance of the keywords was calculated.

상기 TF 방법은 기사의 구조에 무관하게 전체 기사에서 키워드를 추출하여, 키워드의 출현 횟수만을 근거로 가중치를 부여하는 방법이다. 또한, 제목과 리드에서만 키워드를 선정하고 중요도를 계산한 구조 기반 군집 방법에 의한 결과는, 본문도 포함하여 키워드를 선정하고 중요도를 계산한 경우와 의미있는 비교가 가능하다.The TF method is a method of extracting keywords from all articles regardless of the article structure and assigning a weight based only on the number of occurrences of the keywords. In addition, the results of the structure-based clustering method in which keywords are selected only in the title and the lead and the importance are calculated can be meaningfully compared with those in which the keyword is selected and the importance is calculated including the text.

상기 검증 결과는 다음의 [표 2] 및 [표 3]에 나타낸 바와 같다.The verification results are as shown in the following [Table 2] and [Table 3].

평균 오류 횟수Average error count 구조 기반 방법 (제목+리드+본문)Structure based method (Title + Lead + Body) 구조 기반 방법 (제목+리드)Structure based method (Title + Lead) TF 방법TF method 정치면(12개의 기사)Politics (12 articles) 33 44 3.53.5 경제면(19개의 기사)Economic aspect (19 articles) 0.250.25 22 55 사회면(18개의 기사)Social aspect (18 articles) 5.755.75 66 88

총 오류 횟수Total error count 구조 기반 방법 (제목+리드+본문)Structure based method (Title + Lead + Body) 구조 기반 방법 (제목+리드)Structure based method (Title + Lead) TF 방법TF method 정치면(12개의 기사)Politics (12 articles) 44 55 44 경제면(19개의 기사)Economic aspect (19 articles) 1One 22 77 사회면(18개의 기사)Social aspect (18 articles) 77 88 1414

상기 [표 2] 및 [표 3]은 각각 주제별로 검증에 참여한 사용자에 의하여 오류가 발생한 기사로 지목된 기사 개수의 평균 및 일정 주제에 관하여 오류로 지적된 총 기사 개수를 나타낸 것이다.[Table 2] and [Table 3] shows the total number of articles indicated as errors with respect to the average and the schedule of the number of articles identified as articles in which errors occurred by users participating in the verification for each subject.

상기 [표 2] 및 [표 3]에서, 제목과 리드에서 키워드를 선정하고, 본문을 반영하여 중요도를 계산한 구조 기반 군집 방법을 적용했을 때 가장 우수한 성능을 나타냄을 알 수 있으며, 특히 경제면 및 사회면 기사의 경우에 더욱 우수한 성능을 나타낸다.In [Table 2] and [Table 3], it can be seen that the best performance is obtained when applying a structure-based clustering method in which keywords are selected from titles and leads, and the importance is calculated by reflecting the text. In the case of social articles, even better performance.

두 번째의 검증은, 25세부터 35세까지의 나이 분포를 가지는 75명의 참여자들을 25명씩 세 개의 그룹으로 나누어 같은 주제에 대한 기사를 제공하여 수행되었다. 참여자들은 제공된 주제에 대하여 검증 시점 이전에는 세부적인 접근을 한 적이 없었다. 또한, 기사를 읽는 습관이 검증 결과에 영향을 미치지 않도록 하기 위해 공정하게 참여자들을 세 개의 그룹에 배분하였다.The second test was carried out by dividing 75 participants with an age distribution from 25 to 35 years into three groups of 25 people each, providing articles on the same topic. Participants had not taken a detailed approach to the subject matter prior to the verification point. In addition, participants were fairly distributed among the three groups to ensure that the habit of reading the article did not affect the verification results.

각각의 그룹에 본 발명의 바람직한 실시예에 따른 구조 기반 분류에 의한 기사들 및 구글 뉴스의 기사들을 제공하였는데, 이때 참여자들이 구독한 평균 기사 개수 및 참여자들이 기사로부터 접한 평균 구독 관점 개수를 다음의 [표 4]에 나타내었다.Each group was provided with articles based on structure-based classification and articles in Google News according to a preferred embodiment of the present invention, wherein the average number of articles subscribed by participants and the average number of subscription views viewed from articles were as follows. Table 4].

평균 구독 기사 개수Average subscription article count 평균 구독 관점 개수Average Subscription Perspectives 구조 기반 분류Structure-based classification 44 33 구글 뉴스Google news 2.52.5 1.11.1

상기 [표 4]에 나타낸 바와 같이, 구조 기반 분류에 의하여 기사들을 제공한 경우, 참여자들이 상대적으로 많은 기사들을 접할 수 있었으며 같은 주제에 대해서라도 다양한 관점을 접할 수 있음을 알 수 있다. 특히, 평균 구독 관점 개수의 경우, 구글 뉴스에 비하여 구조 기반 분류 방법이 세 배에 가까운 다양한 관점을 제공함을 알 수 있다.As shown in [Table 4], when articles were provided by the structure-based classification, participants could see a relatively large number of articles and could see various viewpoints on the same subject. In particular, in the case of the average number of subscription viewpoints, it can be seen that the structure-based classification method provides three times more various views than the Google News.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes and modifications and equivalents should be considered to be within the scope of the present invention.

도 1 은 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 장치의 전체 구성도.1 is an overall configuration diagram of a document structure based clustering device with improved reliability according to a preferred embodiment of the present invention.

도 2 는 일반적인 기사 구조의 모식도.2 is a schematic diagram of a general article structure.

도 3 은 본 발명의 바람직한 실시예에 따른 드래그(Drag) 앤드(and) 드랍(Drop) 인터페이스의 예시도.3 is an exemplary diagram of a drag and drop interface in accordance with a preferred embodiment of the present invention.

도 4 는 본 발명의 바람직한 실시예에 따른 신뢰도를 향상시킨 문서 구조 기반 군집 방법의 전체 흐름도.4 is an overall flowchart of a document structure based clustering method with improved reliability according to a preferred embodiment of the present invention.

Claims

In the document structure based clustering device with improved reliability,

The input documents are classified by subject, the keywords reflecting the core contents of the classified documents are extracted, the importance of the extracted keywords is calculated, the document cluster is executed using the extracted keywords and the calculated importance, and the results of the document cluster are displayed. A document cluster unit 100 that supplements using a cooperative cluster; And

An input / output unit 200 which receives document information that is a target of document cluster, outputs a result of document cluster, and receives error information of the output document cluster result; Document structure based clustering device that includes improved reliability.

The method of claim 1,

The document group unit 100,

Keyword extracting means (110) for classifying the input documents into subjects and extracting keywords from the classified documents;

Importance calculation means 120 for calculating the importance of the extracted keyword based on the content of the main text of the document;

Similarity calculating means (130) for calculating the similarity between the documents using the extracted keyword and the calculated importance;

Cluster number determining means (140) for determining a suitable number of document clusters using the similarity calculation result and generating a document cluster result according to the determined number of clusters; And

Post-processing means 150 for correcting errors of the cluster result by the cluster number setting means 140 by the cooperative cluster; Document structure-based clustering device to improve the reliability, characterized in that it comprises a.

The method of claim 2,

And the calculation of the similarity is performed by cosine measurement.

The method of claim 1,

And the input documents are news articles or newspaper articles.

The method of claim 4, wherein

The document clustering unit (100) is a document structure-based clustering device with improved reliability, characterized in that for extracting a keyword from the title, subtitle and lead of the news article or newspaper article.

The method of claim 1,

The importance of the keyword,

Improved reliability based on document structure, which is calculated using one or more variables among paragraph position including keyword, frequency of occurrence of keyword, length of paragraph including keyword, and sentence length including keyword. Device.

The method of claim 1,

Importance of the keyword is a document structure based clustering apparatus improved reliability, characterized in that calculated by the following equation.

Where v is the importance value of the keyword, La (i) is the length of the i-th document of the document set, and Lp (i, j) is the length of the j-th paragraph in the i-th document of the document set, Ls (i , k) is the length of the kth sentence in the i-th document of the document set, d _Pj is the diminishing factor of the j-th paragraph in the i-th document of the document set, and d _Sk is the i-th of the document set Refers to the deminishing factor of the kth sentence in the document.)

The method of claim 1,

And the document clustering is performed using a Hierarchical Agglomerative Clustering method.

The method of claim 8,

A document structure based clustering device with improved reliability, characterized in that it automatically determines the number of suitable document clusters using Elbow Criterion.

The method of claim 1,

The input / output unit 200 includes a tagging interface and a drag and drop interface, any one or more of the document structure-based clustering device with improved reliability.

In the document structure based clustering method with improved reliability,

(a) receiving document information targeted for document clustering;

(b) extracting keywords from the input documents;

(c) calculating the importance of the extracted keyword;

(d) calculating similarity between documents using the extracted keyword and the calculated importance;

(e) generating document clustering results using the similarity calculation results;

(f) outputting the generated document clustering result; And

(g) complementing the printed document clustering result by the cooperative clustering; Document structure based clustering method improved reliability including a.

The method of claim 11,

In step (b),

(b-1) classifying the received documents by subject; Document structure based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 11,

In step (b),

(b-2) removing the stopwords from the received documents; And

(b-3) extracting roots from documents from which stopwords have been removed; Document structure based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 11,

In step (c),

(c-1) calculating a diminishing factor to consider the position of the paragraph or sentence containing the keyword in the document;

(c-2) calculating a ratio of the length of the paragraph or sentence including the keyword to the total document length; And

(c-3) calculating a sum of importance for each occurrence of each keyword in the document; Document structure based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 11,

Step (g) is

(g-1) receiving a keyword of a document from a user through a tagging interface; Document structure based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 15,

In step (g-1),

(g-1-1) searching for documents having a word set that does not include a keyword entered through the tagging interface;

(g-1-2) adding a keyword input through the tagging interface to the searched word set; And

(g-1-3) calculating importance of the added keyword; Document structure based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 11,

Step (g) is

(g-2) receiving a movement signal of a document determined to be a wrongly clustered document by a user through a drag and drop interface; Document structure-based clustering method to improve the reliability, characterized in that it comprises a.

The method of claim 17,

Step (g-2) is,

(g-2-1) classifying the words included in the moved document into words that overlap with the word set of the cluster after the movement and words that overlap with the word set of the cluster before the movement; And

(g-2-2) assigning different weights to words that overlap with the word set of the cluster after the movement and words that overlap with the word set of the cluster before the movement; Document structure based clustering method to improve the reliability, characterized in that it comprises a.