KR102268570B1

KR102268570B1 - Apparatus and method for generating document cluster

Info

Publication number: KR102268570B1
Application number: KR1020200005113A
Authority: KR
Inventors: 윤병운; 박인채; 노태연; 안재형; 김동하; 전윤수; 송기식; 김송희
Original assignee: 현대엔지비 주식회사
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2021-06-24

Abstract

Disclosed are a device for generating a document cluster, and a method thereof. The device for generating a document cluster according to one embodiment of the present invention comprises: a document information acquisition unit which acquires bibliographic information and text information for each of the plurality of documents; a citation-based similarity matrix generation unit which generates a citation-based similarity matrix having, as an element, the citation-based similarity between the plurality of documents based on the bibliographic information; a text-based similarity matrix generation unit which generates a text-based similarity matrix having a text-based similarity between the plurality of documents as an element based on the text information; a combination similarity matrix generation unit which generates a combination similarity matrix having a combination similarity between the plurality of documents as an element based on the citation-based similarity matrix and the text-based similarity matrix; and a cluster generation unit which generates one or more clusters by clustering the plurality of documents based on the combination similarity matrix. According to the present invention, it is possible to derive the optimal clustering result based on a similarity between documents.

Description

Document cluster generating apparatus and method {APPARATUS AND METHOD FOR GENERATING DOCUMENT CLUSTER}

개시되는 실시예들은 문서 군집을 생성하는 기술에 관한 것이다.Disclosed embodiments relate to techniques for generating document clusters.

방대한 양의 문서들을 효율적으로 관리하기 위해, 종래부터 다양한 방법으로 문서 간의 유사도를 계산하여 이 유사도를 기반으로 문서를 군집화하려는 시도가 있었다.In order to efficiently manage a large amount of documents, there has been an attempt to calculate the similarity between documents using various methods and to cluster the documents based on the similarity.

그러나 이러한 시도들은 대부분 문서의 서지정보만을 기반으로 하거나 문서 내 텍스트 정보만을 기반으로 하여, 문서의 종류에 따라 군집화 결과에 현저한 차이가 발생한다는 문제점이 존재한다.However, most of these attempts are based on only the bibliographic information of the document or the text information in the document, so there is a problem that a significant difference occurs in the clustering result depending on the type of the document.

대한민국 등록특허공보 제10-1769035호 (2017.08.10. 등록)Republic of Korea Patent Publication No. 10-1769035 (registered on Aug. 10, 2017)

개시되는 실시예들은 다양한 종류의 문서를 효과적으로 군집화하기 위한 것이다.The disclosed embodiments are for effectively clustering various kinds of documents.

일 실시예에 따른 문서 군집 생성 장치는, 복수의 문서 각각에 대한 서지 정보 및 텍스트 정보를 획득하는 문서 정보 획득부, 상기 서지 정보에 기초하여 상기 복수의 문서 간의 인용 기반 유사도를 원소로 갖는 인용 기반 유사도 행렬을 생성하는 인용 기반 유사도 행렬 생성부, 상기 텍스트 정보에 기초하여 상기 복수의 문서 간의 텍스트 기반 유사도를 원소로 갖는 텍스트 기반 유사도 행렬을 생성하는 텍스트 기반 유사도 행렬 생성부, 상기 인용 기반 유사도 행렬 및 상기 텍스트 기반 유사도 행렬에 기초하여 상기 복수의 문서 간의 결합 유사도를 원소로 갖는 결합 유사도 행렬을 생성하는 결합 유사도 행렬 생성부 및 상기 결합 유사도 행렬에 기초하여 상기 복수의 문서를 군집화하여 하나 이상의 군집을 생성하는 군집 생성부를 포함한다.The apparatus for generating a document group according to an embodiment includes a document information acquisition unit configured to acquire bibliographic information and text information for each of a plurality of documents, and a citation-based similarity between the plurality of documents based on the bibliographic information as an element. A citation-based similarity matrix generator for generating a similarity matrix, a text-based similarity matrix generator for generating a text-based similarity matrix having as an element text-based similarity between the plurality of documents based on the text information, the citation-based similarity matrix, and A joint similarity matrix generator generating a joint similarity matrix having a joint similarity between the plurality of documents as an element based on the text-based similarity matrix, and clustering the plurality of documents based on the joint similarity matrix to generate one or more clusters It includes a cluster generating unit that

상기 인용 기반 유사도 행렬 생성부는, 동시 인용(co-citation) 강도 및 서지 결합(bibliographic coupling) 강도 중 어느 하나에 기초하여 상기 인용 기반 유사도를 산출할 수 있고, 상기 텍스트 기반 유사도 행렬 생성부는, 벡터 공간(vector space) 기반 유사도 산출 방식을 이용하여 상기 텍스트 기반 유사도를 산출할 수 있다.The citation-based similarity matrix generator may calculate the citation-based similarity based on any one of a co-citation strength and a bibliographic coupling strength, and the text-based similarity matrix generator includes a vector space The text-based similarity may be calculated using a vector space-based similarity calculation method.

다른 실시예에 따른 문서 군집 생성 장치는, 상기 인용 기반 유사도에 기초한 제1 정보 엔트로피 및 상기 텍스트 기반 유사도에 기초한 제2 정보 엔트로피를 산출하고, 상기 인용 기반 유사도, 상기 텍스트 기반 유사도, 상기 제1 정보 엔트로피 및 상기 제2 정보 엔트로피에 기초하여 인용 가중치 및 텍스트 가중치를 산출하는 가중치 계산부를 더 포함할 수 있다.The apparatus for generating a document cluster according to another embodiment may calculate a first information entropy based on the citation-based similarity and a second information entropy based on the text-based similarity, and the citation-based similarity, the text-based similarity, and the first information The method may further include a weight calculator configured to calculate a citation weight and a text weight based on the entropy and the second information entropy.

상기 결합 유사도 행렬 생성부는, 상기 인용 기반 유사도, 상기 텍스트 기반 유사도, 상기 인용 가중치 및 상기 텍스트 가중치에 기초하여 상기 결합 유사도를 산출할 수 있다.The joint similarity matrix generator may calculate the joint similarity based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight.

상기 군집 생성부는, k-평균 알고리즘(k-means algorithm), 밀도 기반 클러스터링(Density-Based Clustering), 확률 분포 기반 클러스터링(Probability distribution-Based Clustering), 계층 클러스터링(Hierarchical Clustering) 및 거번-뉴먼 알고리즘(Girvan-Newman algorithm) 중 어느 하나를 이용하여 상기 하나 이상의 군집을 생성할 수 있다.The cluster generating unit, k-means algorithm (k-means algorithm), density-based clustering (Density-Based Clustering), probability distribution-based clustering (Probability distribution-Based Clustering), hierarchical clustering (Hierarchical Clustering) and Govern-Newman algorithm ( The one or more clusters may be generated using any one of the Girvan-Newman algorithm.

상기 군집 생성부는, 생성되는 군집 수를 달리하여 복수의 군집 시나리오를 생성할 수 있다.The cluster generator may generate a plurality of cluster scenarios by varying the number of clusters to be generated.

다른 실시예에 따른 문서 군집 생성 장치는, 상기 복수의 군집 시나리오 중 최적 군집 시나리오를 선정하는 군집 평가부를 더 포함할 수 있다.The apparatus for generating a document cluster according to another embodiment may further include a cluster evaluation unit configured to select an optimal cluster scenario from among the plurality of cluster scenarios.

상기 군집 평가부는, 기 설정된 복수의 지표 중 어느 하나를 기준 지표로 선택하여, 상기 기준 지표가 최대값을 갖는 경우의 군집 시나리오를 상기 최적 군집 시나리오로 선정할 수 있다.The cluster evaluator may select any one of a plurality of preset indices as a reference index, and select a cluster scenario when the reference index has a maximum value as the optimal cluster scenario.

상기 군집 평가부는, 기 설정된 복수의 지표 각각에 대한 상기 복수의 군집 시나리오 별 순위를 산출하고, 상기 산출된 순위의 상기 복수의 군집 시나리오 별 평균 값을 산출하여, 상기 산출된 평균 값이 최소인 경우의 군집 시나리오를 상기 최적 군집 시나리오로 선정할 수 있다.The cluster evaluation unit is configured to calculate a rank for each of the plurality of cluster scenarios for each of a plurality of preset indicators, calculate an average value for each of the plurality of cluster scenarios of the calculated rank, and when the calculated average value is the minimum may be selected as the optimal clustering scenario.

일 실시예에 따른 문서 군집 생성 방법은, 복수의 문서 각각에 대한 서지 정보 및 텍스트 정보를 획득하는 단계, 상기 서지 정보에 기초하여 상기 복수의 문서 간의 인용 기반 유사도를 원소로 갖는 인용 기반 유사도 행렬을 생성하는 단계, 상기 텍스트 정보에 기초하여 상기 복수의 문서 간의 텍스트 기반 유사도를 원소로 갖는 텍스트 기반 유사도 행렬을 생성하는 단계, 상기 인용 기반 유사도 행렬 및 상기 텍스트 기반 유사도 행렬에 기초하여 상기 복수의 문서 간의 결합 유사도를 원소로 갖는 결합 유사도 행렬을 생성하는 단계 및 상기 결합 유사도 행렬에 기초하여 상기 복수의 문서를 군집화하여 하나 이상의 군집을 생성하는 단계를 포함한다.A method for generating a document cluster according to an embodiment includes obtaining bibliographic information and text information for each of a plurality of documents, and a citation-based similarity matrix having citation-based similarity between the plurality of documents as an element based on the bibliographic information. generating, based on the text information, generating a text-based similarity matrix having text-based similarity between the plurality of documents as an element, between the plurality of documents based on the citation-based similarity matrix and the text-based similarity matrix generating a joint similarity matrix having joint similarity as an element, and generating one or more clusters by clustering the plurality of documents based on the joint similarity matrix.

상기 인용 기반 유사도 행렬을 생성하는 단계는, 동시 인용(co-citation) 강도 및 서지 결합(bibliographic coupling) 강도 중 어느 하나에 기초하여 상기 인용 기반 유사도를 산출할 수 있고, 상기 텍스트 기반 유사도 행렬을 생성하는 단계는, 벡터 공간(vector space) 기반 유사도 산출 방식을 이용하여 상기 텍스트 기반 유사도를 산출할 수 있다.The generating of the citation-based similarity matrix may include calculating the citation-based similarity based on any one of a co-citation strength and a bibliographic coupling strength, and generating the text-based similarity matrix In the doing, the text-based similarity may be calculated using a vector space-based similarity calculation method.

다른 실시예에 따른 문서 군집 생성 방법은, 상기 인용 기반 유사도에 기초하여 제1 정보 엔트로피를 산출하는 단계, 상기 텍스트 기반 유사도에 기초하여 제2 정보 엔트로피를 산출하는 단계 및 상기 인용 기반 유사도, 상기 텍스트 기반 유사도, 상기 제1 정보 엔트로피 및 상기 제2 정보 엔트로피에 기초하여 인용 가중치 및 텍스트 가중치를 산출하는 단계를 더 포함할 수 있다.According to another embodiment, a method for generating a document cluster includes calculating a first information entropy based on the citation-based similarity, calculating a second information entropy based on the text-based similarity, and the citation-based similarity and the text The method may further include calculating a citation weight and a text weight based on the base similarity, the first information entropy, and the second information entropy.

상기 결합 유사도 행렬을 생성하는 단계는, 상기 인용 기반 유사도, 상기 텍스트 기반 유사도, 상기 인용 가중치 및 상기 텍스트 가중치에 기초하여 상기 결합 유사도를 산출할 수 있다.The generating of the joint similarity matrix may include calculating the joint similarity based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight.

상기 군집을 생성하는 단계는, k-평균 알고리즘(k-means algorithm), 밀도 기반 클러스터링(Density-Based Clustering), 확률 분포 기반 클러스터링(Probability distribution-Based Clustering), 계층 클러스터링(Hierarchical Clustering) 및 거번-뉴먼 알고리즘(Girvan-Newman algorithm) 중 어느 하나를 이용하여 상기 하나 이상의 군집을 생성할 수 있다.The step of generating the cluster includes k-means algorithm, density-based clustering, probability distribution-based clustering, hierarchical clustering, and Govern- The one or more clusters may be generated using any one of a Girvan-Newman algorithm.

상기 군집을 생성하는 단계는, 생성되는 군집 수를 달리하여 복수의 군집 시나리오를 생성할 수 있다.The generating of the cluster may include generating a plurality of cluster scenarios by varying the number of generated clusters.

다른 실시예에 따른 문서 군집 생성 방법은, 상기 복수의 군집 시나리오 중 최적 군집 시나리오를 선정하는 단계를 더 포함할 수 있다.The method for generating a document cluster according to another embodiment may further include selecting an optimal cluster scenario from among the plurality of cluster scenarios.

상기 선정하는 단계는, 기 설정된 복수의 지표 중 어느 하나를 기준 지표로 선택하여, 상기 기준 지표가 최대값을 갖는 경우의 군집 시나리오를 상기 최적 군집 시나리오로 선정할 수 있다.The selecting may include selecting any one of a plurality of preset indices as a reference index, and selecting a cluster scenario when the reference index has a maximum value as the optimal clustering scenario.

상기 선정하는 단계는, 기 설정된 복수의 지표 각각에 대한 상기 복수의 군집 시나리오 별 순위를 산출하는 단계, 상기 산출된 순위의 상기 복수의 군집 시나리오 별 평균 값을 산출하는 단계 및 상기 산출된 평균 값이 최소인 경우의 군집 시나리오를 상기 최적 군집 시나리오로 선정하는 단계를 포함할 수 있다.The selecting may include calculating a rank for each of the plurality of clustering scenarios for each of a plurality of preset indicators, calculating an average value for each of the plurality of clustering scenarios of the calculated rank, and the calculated average value is The method may include selecting a minimum cluster scenario as the optimal cluster scenario.

개시되는 실시예들에 따르면, 인용 기반 유사도 및 텍스트 기반 유사도를 모두 고려하여 문서를 군집화함으로써, 문서에 포함된 서지 정보 및 텍스트 정보 중 군집 분류에 보다 용이한 정보를 활용할 수 있다.According to the disclosed embodiments, by grouping documents in consideration of both the citation-based similarity and the text-based similarity, it is possible to more easily use information for group classification among bibliographic information and text information included in the document.

또한 개시되는 실시예들에 따르면, 생성된 군집 수 별 군집 시나리오 중 최적의 군집 시나리오를 선정함으로써, 문서 간의 유사성을 기반으로 한 최적의 군집화 결과를 도출할 수 있다.Also, according to the disclosed embodiments, by selecting an optimal clustering scenario from among the generated clustering scenarios according to the number of clusters, it is possible to derive an optimal clustering result based on the similarity between documents.

도 1은 일 실시예에 따른 문서 군집 생성 장치를 설명하기 위한 블록도
도 2는 추가적인 실시예에 따른 문서 군집 생성 장치를 설명하기 위한 블록도
도 3은 일 실시예에 따른 인용 가중치 및 텍스트 가중치를 산출하는 과정을 설명하기 위한 도면
도 4는 일 실시예에 따른 문서 군집 생성 방법을 설명하기 위한 흐름도
도 5는 추가적인 실시예에 따른 문서 군집 생성 방법을 설명하기 위한 흐름도
도 6은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도1 is a block diagram illustrating an apparatus for generating a group of documents according to an embodiment;
2 is a block diagram illustrating an apparatus for generating a group of documents according to an additional embodiment;
3 is a diagram for explaining a process of calculating a citation weight and a text weight according to an embodiment;
4 is a flowchart illustrating a method for generating a document cluster according to an exemplary embodiment;
5 is a flowchart illustrating a method for generating a document cluster according to an additional exemplary embodiment;
6 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in example embodiments;

이하, 도면을 참조하여 구체적인 실시형태를 설명하기로 한다. 이하의 상세한 설명은 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 개시되는 실시예들은 이에 제한되지 않는다.Hereinafter, specific embodiments will be described with reference to the drawings. DETAILED DESCRIPTION The following detailed description is provided to provide a comprehensive understanding of the methods, devices, and/or systems described herein. However, this is merely an example and the disclosed embodiments are not limited thereto.

실시예들을 설명함에 있어서, 관련된 공지기술에 대한 구체적인 설명이 개시되는 실시예들의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 개시되는 실시예들에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.In describing the embodiments, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the disclosed embodiments, the detailed description thereof will be omitted. And, the terms to be described later are terms defined in consideration of functions in the disclosed embodiments, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification. The terminology used in the detailed description is for the purpose of describing the embodiments only, and should in no way be limiting. Unless explicitly used otherwise, expressions in the singular include the meaning of the plural. In this description, expressions such as “comprising” or “comprising” are intended to indicate certain features, numbers, steps, acts, elements, some or a combination thereof, one or more other than those described. It should not be construed to exclude the presence or possibility of other features, numbers, steps, acts, elements, or any part or combination thereof.

이하에서, '문서 군집'이란 복수의 문서를 문서 내 데이터의 특성에 따라 유사한 문서끼리 분류한 것을 의미한다.Hereinafter, the term 'document cluster' means classifying a plurality of documents into similar documents according to characteristics of data in the document.

도 1은 일 실시예에 따른 문서 군집 생성 장치(100)를 설명하기 위한 블록도이다. 1 is a block diagram illustrating an apparatus 100 for generating a document group according to an exemplary embodiment.

도 1을 참조하면, 일 실시예에 따른 문서 군집 생성 장치(100)는 문서 정보 획득부(110), 인용 기반 유사도 행렬 생성부(120), 텍스트 기반 유사도 행렬 생성부(130), 결합 유사도 행렬 생성부(140) 및 군집 생성부(150)를 포함한다.Referring to FIG. 1 , an apparatus 100 for generating a group of documents according to an embodiment includes a document information acquisition unit 110 , a citation-based similarity matrix generation unit 120 , a text-based similarity matrix generation unit 130 , and a combined similarity matrix It includes a generator 140 and a cluster generator 150 .

문서 정보 획득부(110)는 복수의 문서 각각에 대한 서지 정보 및 텍스트 정보를 획득한다.The document information acquisition unit 110 acquires bibliographic information and text information for each of a plurality of documents.

이하에서, '서지 정보'는 문서의 서지 사항을 모아서 일정한 방식에 따라 배열하고 편성한 목록을 의미한다. 예를 들어, '서지 정보'는 문서를 작성한 사람의 인적 정보, 문서가 작성된 국가 정보, 문서가 작성된 날짜 정보, 문서에 지정된 번호 정보, 분류 정보, 문서의 인용 또는 피인용에 따른 참조 정보 등을 포함할 수 있으나, 반드시 이에 한정되는 것은 아니며 문서의 종류나 형태에 따라 상술한 예들에도 다양한 종류의 서지 정보를 더 포함할 수 있다.Hereinafter, 'bibliographic information' refers to a list in which bibliographic items of documents are collected and arranged and organized according to a predetermined method. For example, 'bibliographic information' refers to personal information of the person who created the document, information about the country where the document was created, information about the date the document was created, information about the number assigned to the document, classification information, and reference information based on citation or citation of the document It may include, but is not necessarily limited thereto, and the above-described examples may further include various types of bibliographic information according to the type or form of the document.

이하에서, '텍스트 정보'는 문서 자체를 구성하는 문장 또는 단어를 포함하는 텍스트를 의미한다. 예를 들어, '텍스트 정보'는 문서의 제목, 문서의 요약, 문서의 내용, 문서에 기재된 참고 문헌의 제목 등을 포함할 수 있으나, 반드시 이에 한정되는 것은 아니다.Hereinafter, 'text information' means text including sentences or words constituting the document itself. For example, 'text information' may include, but is not limited to, the title of a document, a summary of the document, the content of the document, and the title of a reference document included in the document.

일 실시예에 따르면, 문서 정보 획득부(110)는 문서 군집 생성 장치(100)와 구분된 별개의 데이터베이스(미도시)로부터 서지 정보 및 텍스트 정보를 획득할 수 있으나, 반드시 이에 한정되는 것은 아니며, 실시예에 따라 데이터베이스(미도시)는 문서 군집 생성 장치(100) 내에 구비될 수도 있다.According to an embodiment, the document information acquisition unit 110 may acquire bibliographic information and text information from a separate database (not shown) separated from the document cluster generating apparatus 100, but is not necessarily limited thereto. According to an embodiment, a database (not shown) may be provided in the document group generating apparatus 100 .

인용 기반 유사도 행렬 생성부(120)는 서지 정보에 기초하여 복수의 문서 간의 인용 기반 유사도를 원소로 갖는 인용 기반 유사도 행렬을 생성한다.The citation-based similarity matrix generator 120 generates a citation-based similarity matrix having, as an element, a citation-based similarity between a plurality of documents based on bibliographic information.

일 실시예에 따르면, 인용 기반 유사도 행렬 생성부(120)는 동시 인용(co-citation) 강도 및 서지 결합(bibliographic coupling) 강도 중 어느 하나에 기초하여 인용 기반 유사도를 산출할 수 있다.According to an embodiment, the citation-based similarity matrix generator 120 may calculate the citation-based similarity based on any one of a co-citation strength and a bibliographic coupling strength.

구체적으로, '동시 인용 강도'는 두 문서가 그 밖의 하나 이상의 문서에 의해 공통으로 인용된 횟수를 나타내는 수치로서, 두 문서를 공통으로 인용하는 문서가 새로이 생성됨에 따라 변경될 수 있다.Specifically, the 'simultaneous citation intensity' is a number indicating the number of times two documents are commonly cited by one or more other documents, and may be changed as a document citing two documents in common is newly created.

또한, '서지 결합 강도'는 문서가 인용하고 있는 참고 문헌들 중 공통으로 인용하는 참고 문헌의 수를 나타내는 수치로서, 예를 들어, 아래의 수학식 1에 의하여 정규화(normalization)될 수 있다.In addition, the 'bibliographic coupling strength' is a number indicating the number of commonly cited references among the references cited by the document, and may be normalized by, for example, Equation 1 below.

[수학식 1][Equation 1]

이때,

은 정규화된 서지 결합 강도,

는 문서 A와 B가 공통으로 인용한 참고 문헌의 수,

는 문서 A가 인용한 참고 문헌의 수,

는 문서 B가 인용한 참고 문헌의 수를 나타낸다.At this time,

is the normalized surge bond strength,

is the number of references commonly cited by documents A and B;

is the number of references cited by document A,

indicates the number of references cited by document B.

일 실시예에 따르면, '인용 기반 유사도'는 동시 인용 강도 또는 서지 결합 강도를 정규화하여 0 이상 1 이하의 범위 내의 값으로 나타낸 수치일 수 있다.According to an embodiment, the 'citation-based similarity' may be a numerical value expressed as a value within a range of 0 or more or 1 or less by normalizing the strength of simultaneous citation or bibliographic binding.

예를 들어, 동일한 문서 간의 인용 기반 유사도는 1이며, 전혀 유사하지 않은 두 문서 간의 인용 기반 유사도는 0이다.For example, the citation-based similarity between identical documents is 1, and the citation-based similarity between two documents that are not at all similar is 0.

일 실시예에 따르면, '인용 기반 유사도를 원소로 갖는 행렬'은 복수의 문서들 각각에 순서대로 부여된 일련번호가 행 인덱스(index) 및 열 인덱스인 정사각행렬일 수 있다.According to an embodiment, the 'matrix having citation-based similarity as an element' may be a square matrix in which serial numbers sequentially assigned to each of a plurality of documents are a row index and a column index.

이때, 행렬의 주대각선(main diagonal)에 위치한 원소들의 값은 1이 된다.In this case, the values of elements located on the main diagonal of the matrix are 1.

텍스트 기반 유사도 행렬 생성부(130)는 텍스트 정보에 기초하여 복수의 문서 간의 텍스트 기반 유사도를 원소로 갖는 텍스트 기반 유사도 행렬을 생성한다.The text-based similarity matrix generator 130 generates a text-based similarity matrix having text-based similarity between a plurality of documents as an element based on text information.

일 실시예에 따르면, 텍스트 기반 유사도 행렬 생성부(130)는 벡터 공간(vector space) 기반 유사도 산출 방식을 이용하여 텍스트 기반 유사도를 산출할 수 있다.According to an embodiment, the text-based similarity matrix generator 130 may calculate the text-based similarity using a vector space-based similarity calculation method.

구체적으로, 벡터 공간 기반 유사도 산출 방식은 문서 내 텍스트 정보에 기반하여 문서 각각에 대응되는 벡터를 계산하고, 계산된 벡터를 이용하여 문서 간 유사도를 산출하는 방식을 의미한다.Specifically, the vector space-based similarity calculation method refers to a method of calculating a vector corresponding to each document based on text information in the document, and calculating the similarity between documents by using the calculated vector.

예를 들어, 텍스트 기반 유사도는 유클리디안 거리(Euclidean distance), 코사인 유사도(cosine similarity), 맨해튼 거리(Manhattan distance) 및 자카드 유사도(Jaccard similarity) 등을 이용하여 산출될 수 있다.For example, the text-based similarity may be calculated using Euclidean distance, cosine similarity, Manhattan distance, and Jaccard similarity.

일 실시예에 따르면, 텍스트 기반 유사도 행렬 생성부(130)는 딥러닝(Deep learning) 기반의 인공신경망(Artificial Neural Network, ANN) 구조를 포함하는 유사도 산출 모델을 이용하여 텍스트 기반 유사도를 산출할 수 있다.According to an embodiment, the text-based similarity matrix generator 130 may calculate the text-based similarity using a similarity calculation model including a deep learning-based artificial neural network (ANN) structure. have.

한편, 일 실시예에 따르면, '텍스트 기반 유사도'는 벡터 공간 기반 유사도 산출 방식에 의한 결과 값 또는 유사도 산출 모델에 의한 결과 값을 정규화하여 0 이상 1 이하의 범위 내의 값으로 나타낸 수치일 수 있다.Meanwhile, according to an embodiment, the 'text-based similarity' may be a numerical value expressed as a value within a range of 0 or more and 1 or less by normalizing a result value by a vector space-based similarity calculation method or a result value by a similarity calculation model.

예를 들어, 동일한 문서 간의 텍스트 기반 유사도는 1이며, 전혀 유사하지 않은 두 문서 간의 텍스트 기반 유사도는 0이다.For example, the text-based similarity between identical documents is 1, and the text-based similarity between two documents that are not at all similar is 0.

일 실시예에 따르면, '텍스트 기반 유사도를 원소로 갖는 행렬'은 복수의 문서들 각각에 순서대로 부여된 일련번호가 행 인덱스(index) 및 열 인덱스인 정사각행렬일 수 있다.According to an embodiment, the 'matrix having text-based similarity as an element' may be a square matrix in which serial numbers sequentially assigned to each of a plurality of documents are a row index and a column index.

결합 유사도 행렬 생성부(140)는 인용 기반 유사도 행렬 및 텍스트 기반 유사도 행렬에 기초하여 복수의 문서 간의 결합 유사도를 원소로 갖는 결합 유사도 행렬을 생성한다.The joint similarity matrix generator 140 generates a joint similarity matrix having the joint similarity between a plurality of documents as an element based on the citation-based similarity matrix and the text-based similarity matrix.

구체적으로, 결합 유사도 행렬 생성부(140)는 인용 기반 유사도 행렬 내 원소 및 해당 원소와 동일한 인덱스에 위치한 텍스트 기반 유사도 행렬 내 원소를 기 설정된 비율로 합산한 값을 원소로 갖는 행렬을 생성할 수 있다.Specifically, the combined similarity matrix generator 140 may generate a matrix having, as elements, a value obtained by summing elements in the citation-based similarity matrix and elements in the text-based similarity matrix positioned at the same index as the corresponding element at a preset ratio. .

군집 생성부(150)는 결합 유사도 행렬에 기초하여 복수의 문서를 군집화하여 하나 이상의 군집을 생성한다.The cluster generator 150 generates one or more clusters by clustering a plurality of documents based on the joint similarity matrix.

일 실시예에 따르면, 군집 생성부(150)는 k-평균 알고리즘(k-means algorithm), 밀도 기반 클러스터링(Density-Based Clustering), 확률 분포 기반 클러스터링(Probability distribution-Based Clustering), 계층 클러스터링(Hierarchical Clustering) 및 거번-뉴먼 알고리즘(Girvan-Newman algorithm) 중 어느 하나를 이용하여 하나 이상의 군집을 생성할 수 있다.According to an embodiment, the cluster generating unit 150 includes a k-means algorithm, a density-based clustering, a probability distribution-based clustering, and a hierarchical clustering. Clustering) and one or more clusters may be generated using any one of the Girvan-Newman algorithm.

도 2는 추가적인 실시예에 따른 문서 군집 생성 장치(200)를 설명하기 위한 블록도이다. 2 is a block diagram illustrating an apparatus 200 for generating a document group according to an additional exemplary embodiment.

도 2를 참조하면, 추가적인 실시예에 따른 문서 군집 생성 장치(200)는 가중치 계산부(210) 및 군집 평가부(220)를 더 포함할 수 있다. 도 2에 도시된 예에서, 문서 정보 획득부(110), 인용 기반 유사도 행렬 생성부(120), 텍스트 기반 유사도 행렬 생성부(130), 결합 유사도 행렬 생성부(140) 및 군집 생성부(150)는 도 1에 도시된 구성과 동일한 구성이므로, 이에 대한 중복적인 설명은 생략하도록 한다.Referring to FIG. 2 , the apparatus 200 for generating a group of documents according to an additional embodiment may further include a weight calculation unit 210 and a group evaluation unit 220 . In the example shown in FIG. 2 , the document information obtaining unit 110 , the citation-based similarity matrix generating unit 120 , the text-based similarity matrix generating unit 130 , the combined similarity matrix generating unit 140 , and the cluster generating unit 150 . ) is the same configuration as the configuration shown in FIG. 1, and thus a redundant description thereof will be omitted.

가중치 계산부(210)는 인용 기반 유사도 행렬 생성부(120)에 의해 생성된 인용 기반 유사도 행렬 내의 인용 기반 유사도에 기초하여 제1 정보 엔트로피를 산출한다.The weight calculator 210 calculates the first information entropy based on the citation-based similarity in the citation-based similarity matrix generated by the citation-based similarity matrix generator 120 .

구체적으로, '정보 엔트로피'는 특정 정보의 불확실성을 의미하는 값으로서, 특정 정보가 취할 수 있는 복수개의 상태가 존재하는 경우, '정보 엔트로피'는 특정 정보가 특정 상태를 취할 확률과 해당 확률의 자연로그 값을 곱한 값을 모든 상태에 대해 합산함으로써 산출되며, '제1 정보 엔트로피'는 인용 기반 유사도의 불확실성을 의미한다.Specifically, 'information entropy' is a value indicating the uncertainty of specific information. When there are a plurality of states that specific information can take, 'information entropy' is the probability that specific information will take a specific state and the nature of the probability. It is calculated by summing values multiplied by log values for all states, and 'first information entropy' means uncertainty of citation-based similarity.

일 실시예에 따르면, 가중치 계산부(210)는 아래의 수학식 2에 의하여 제1 정보 엔트로피를 산출할 수 있다.According to an embodiment, the weight calculator 210 may calculate the first information entropy by Equation 2 below.

[수학식 2][Equation 2]

이때,

는 제1 정보 엔트로피,

은 두 문서 간의 인용 기반 유사도에 대응되는 확률,

은 1-

, log는 밑을 e로 하는 자연로그를 나타낸다.At this time,

is the first information entropy,

is the probability corresponding to the citation-based similarity between two documents,

Silver 1-

, log represents the natural logarithm with base e.

예를 들어, 두 논문 A, B의 인용 기반 유사도가 80%인 경우,

는 0.8,

는 0.2이고, 위 수학식 2에 의하여 제1 정보 엔트로피

는 0.5가 된다.For example, if the citation-based similarity of two papers A and B is 80%,

is 0.8,

is 0.2, and the first information entropy according to Equation 2 above

becomes 0.5.

또한, 가중치 계산부(210)는 텍스트 기반 유사도 행렬 생성부(130)에 의해 생성된 텍스트 기반 유사도 행렬 내의 텍스트 기반 유사도에 기초하여 제2 정보 엔트로피를 산출한다.Also, the weight calculator 210 calculates the second information entropy based on the text-based similarity in the text-based similarity matrix generated by the text-based similarity matrix generator 130 .

구체적으로, '제2 정보 엔트로피'는 텍스트 기반 유사도의 불확실성을 의미한다.Specifically, 'second information entropy' refers to the uncertainty of text-based similarity.

일 실시예에 따르면, 가중치 계산부(210)는 아래의 수학식 3에 의하여 제2 정보 엔트로피를 산출할 수 있다.According to an embodiment, the weight calculator 210 may calculate the second information entropy by Equation 3 below.

[수학식 3][Equation 3]

이때,

는 제2 정보 엔트로피,

은 두 문서 간의 텍스트 기반 유사도에 대응되는 확률,

은 1-

, log는 밑을 e로 하는 자연로그를 나타낸다.At this time,

is the second information entropy,

is the probability corresponding to the text-based similarity between two documents,

Silver 1-

, log represents the natural logarithm with base e.

이어서, 가중치 계산부(210)는 인용 기반 유사도, 텍스트 기반 유사도, 제1 정보 엔트로피 및 제2 정보 엔트로피에 기초하여 인용 가중치 및 텍스트 가중치를 산출한다.Next, the weight calculator 210 calculates a citation weight and a text weight based on the citation-based similarity, the text-based similarity, the first information entropy, and the second information entropy.

이때, 특정한 두 문서 간의 인용 가중치 및 텍스트 가중치의 합은 1이다.In this case, the sum of the citation weight and the text weight between two specific documents is 1.

도 3은 일 실시예에 따른 인용 가중치 및 텍스트 가중치를 산출하는 과정을 설명하기 위한 도면이다.3 is a diagram for explaining a process of calculating a citation weight and a text weight according to an embodiment.

도 3을 참조하면, 가중치 계산부(210)는 인용 기반 유사도 행렬 내 i행 j열의 인용 기반 유사도

에 기초하여 산출된 제1 정보 엔트로피

와 텍스트 기반 유사도 행렬 내 i행 j열의 텍스트 기반 유사도

에 기초하여 산출된 제2 정보 엔트로피

를 비교한다.Referring to FIG. 3 , the weight calculator 210 calculates the citation-based similarity of row i and column j in the citation-based similarity matrix.

First information entropy calculated based on

and text-based similarity of row i and column j in text-based similarity matrix

The second information entropy calculated based on

compare

이어서, 가중치 계산부(210)는

가

보다 큰 경우,

가 0.5 이상이면 인용 가중치

를

와 동일한 값으로 산출하고, 텍스트 가중치

를 1-

와 동일한 값으로 산출하며,

가 0.5 미만이면 인용 가중치

를 1-

와 동일한 값으로 산출하고, 텍스트 가중치

를

와 동일한 값으로 산출한다.Then, the weight calculation unit 210

end

greater than,

citation weight if is greater than or equal to 0.5

to

Calculated with the same value as , and text weight

1-

is calculated as the same value as

citation weight if is less than 0.5

1-

Calculated with the same value as , and text weight

to

is calculated with the same value as

또한, 가중치 계산부(210)는

가

보다 크지 않은 경우,

가 0.5 이상이면 인용 가중치

를 1-

와 동일한 값으로 산출하고, 텍스트 가중치

를

와 동일한 값으로 산출하며,

가 0.5 미만이면 인용 가중치

를

와 동일한 값으로 산출하고, 텍스트 가중치

를 1-

와 동일한 값으로 산출한다.In addition, the weight calculation unit 210

end

If not greater than

citation weight if is greater than or equal to 0.5

1-

Calculated with the same value as , and text weight

to

is calculated as the same value as

citation weight if is less than 0.5

to

Calculated with the same value as , and text weight

1-

is calculated with the same value as

다시 도 2를 참조하면, 결합 유사도 행렬 생성부(140)는 인용 기반 유사도, 텍스트 기반 유사도, 인용 가중치 및 텍스트 가중치에 기초하여 결합 유사도를 산출할 수 있다.Referring back to FIG. 2 , the joint similarity matrix generator 140 may calculate the joint similarity based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight.

구체적으로, 결합 유사도 행렬 생성부(140)는 인용 가중치와 인용 기반 유사도를 곱한 값에 텍스트 가중치와 텍스트 기반 유사도를 곱한 값을 합산함으로써 결합 유사도를 산출할 수 있다.Specifically, the joint similarity matrix generator 140 may calculate the joint similarity by adding a value obtained by multiplying the citation weight and the citation-based similarity by the text weight and the text-based similarity.

군집 생성부(150)는 생성되는 군집 수를 달리하여 복수의 군집 시나리오를 생성할 수 있다.The cluster generator 150 may generate a plurality of cluster scenarios by varying the number of clusters to be generated.

예를 들어, 군집 생성부(150)는 생성되는 군집 수를 각각 10개, 15개, 20개로 설정하여 각각의 경우에 따른 문서 군집 생성 결과로서 제1 군집 시나리오, 제2 군집 시나리오 및 제3 군집 시나리오를 생성할 수 있다.For example, the cluster generating unit 150 sets the number of generated clusters to 10, 15, and 20, respectively, and as a result of generating document clusters according to each case, the first cluster scenario, the second cluster scenario, and the third cluster You can create scenarios.

군집 평가부(220)는 군집 생성부(150)에 의해 생성된 복수의 군집 시나리오 중 최적 군집 시나리오를 선정한다.The cluster evaluator 220 selects an optimal cluster scenario from among the plurality of cluster scenarios generated by the cluster generator 150 .

일 실시예에 따르면, 군집 평가부(220)는 기 설정된 복수의 지표 중 어느 하나를 기준 지표로 선택하여, 기준 지표가 최대값을 갖는 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.According to an embodiment, the cluster evaluator 220 may select any one of a plurality of preset indices as a reference index, and select a cluster scenario when the reference index has a maximum value as an optimal cluster scenario.

예를 들어, 군집 평가부(220)는 실루엣(silhouette) 지표를 기준 지표로 선택하여, 아래의 수학식 4에 의해 실루엣 지표가 최대값을 갖는 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.For example, the cluster evaluation unit 220 may select a silhouette index as a reference index, and select a cluster scenario when the silhouette index has a maximum value as the optimal cluster scenario according to Equation 4 below. .

[수학식 4][Equation 4]

이때,

는 실루엣 지표,

는 문서 i와 같은 군집에 속한 다른 모든 문서 각각과 문서 i와의 거리의 평균(average dissimilarity),

는 문서 i가 속하지 않은 모든 군집들 내 문서 각각과 문서 i 간의 거리를 구하여 군집 별로 산출한 거리의 평균 중 최소값을 나타낸다.At this time,

is the silhouette indicator,

is the average dissimilarity of the distance from document i to each of all other documents belonging to the same cluster as document i,

denotes the minimum value among the averages of distances calculated for each cluster by finding the distance between each document and document i in all clusters to which document i does not belong.

구체적으로, 군집 평가부(220)는 유클리디안 거리, 코사인 거리, 맨해튼 거리 등에 기초하여 문서 간의 거리를 산출할 수 있으나, 반드시 이에 한정되는 것은 아니다.Specifically, the cluster evaluator 220 may calculate the distance between documents based on the Euclidean distance, the cosine distance, the Manhattan distance, and the like, but is not limited thereto.

다른 예로서, 군집 평가부(220)는 BMS(Blurring Mean-Shift) 지표를 기준 지표로 선택하여, 아래의 수학식 5에 의해 BMS 지표가 최대값을 갖는 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.As another example, the cluster evaluation unit 220 selects a blurring mean-shift (BMS) index as a reference index, and selects a cluster scenario when the BMS index has a maximum value according to Equation 5 below as an optimal cluster scenario can do.

[수학식 5][Equation 5]

이때,

는 문서 i의 BMS 지표,

는 특정 군집 h,

는 특정 군집 h', i와 i'는

에 속하는 문서, j는

에 속하는 문서,

는 문서 i와 문서 i' 간의 코사인 거리(cosine distance),

는 문서 i와 문서 j 간의 코사인 거리를 나타낸다.At this time,

is the BMS indicator of document i,

is a specific cluster h,

is a specific cluster h', i and i' are

A document belonging to, j is

documents belonging to

is the cosine distance between document i and document i',

denotes the cosine distance between document i and document j.

다른 예로서, 군집 생성부(150)가 결합 유사도 행렬의 원소에 거번-뉴먼 알고리즘을 이용하여 하나 이상의 군집을 생성한 경우, 군집 평가부(220)는 결합 유사도에 기초하여 복수의 문서 간의 네트워크를 생성하고, 생성된 네트워크 내의 각 엣지(edge)마다 엣지 매개 중심성(edge betweenness centrality)을 계산한다. 이후, 군집 평가부(220)는 엣지 매개 중심성이 가장 높은 엣지를 제거하고 모듈성(modularity)을 산출하는 과정을 반복하여, 산출된 모듈성이 최대일 때의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.As another example, when the cluster generating unit 150 generates one or more clusters using the Govern-Newman algorithm for elements of the joint similarity matrix, the cluster evaluation unit 220 determines the network between the plurality of documents based on the joint similarity. and calculates edge betweenness centrality for each edge in the created network. Thereafter, the cluster evaluation unit 220 removes the edge with the highest edge-mediated centrality and repeats the process of calculating modularity, and selects the cluster scenario when the calculated modularity is maximum as the optimal cluster scenario. .

구체적으로, 군집 평가부(220)는 아래의 수학식 6에 의해 엣지 매개 중심성을 계산하고, 아래의 수학식 7에 의해 모듈성을 산출할 수 있다.Specifically, the cluster evaluator 220 may calculate edge-mediated centrality according to Equation 6 below, and may calculate modularity according to Equation 7 below.

[수학식 6][Equation 6]

이때,

는 노드

와 노드

를 잇는 엣지,

는 엣지

의 엣지 매개 중심성, V는 네트워크,

와

는 V에 속한 노드,

는 노드

와

사이의 최단 경로의 수,

는 노드

와

사이의 최단 경로 중 엣지

를 통과하는 경로의 수를 나타낸다.At this time,

is the node

and node

edge connecting

is the edge

Edge-mediated centrality of, V is the network,

Wow

is the node belonging to V,

is the node

Wow

number of shortest paths between

is the node

Wow

edge of the shortest path between

It represents the number of paths passing through.

[수학식 7][Equation 7]

1)

One)

2)

이때, Q는 모듈성,

는 전체 엣지 중 동일한 커뮤니티 i 내의 노드를 연결하는 엣지의 비율,

는 전체 엣지 중 커뮤니티 i 내의 노드와 커뮤니티 j 내의 노드를 연결하는 엣지의 비율을 나타낸다.In this case, Q is modularity,

is the proportion of edges connecting nodes within the same community i among all edges,

represents the ratio of edges connecting nodes in community i to nodes in community j among all edges.

일 실시예에 따르면, 군집 평가부(220)는 기 설정된 복수의 지표 각각에 대한 복수의 군집 시나리오 별 순위를 산출하고, 산출된 순위의 복수의 군집 시나리오 별 평균 값을 산출하여, 산출된 평균 값이 최소인 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.According to an embodiment, the cluster evaluation unit 220 calculates a rank for each of a plurality of cluster scenarios for each of a plurality of preset indicators, calculates an average value for a plurality of cluster scenarios of the calculated rank, and calculates the average value A cluster scenario in the case of this minimum may be selected as an optimal cluster scenario.

예를 들어, 실루엣 지표를 기준 지표로 하여 산출한 제1 군집 시나리오 내지 제4 군집 시나리오의 순위가 순서대로 1, 2, 3, 4위이고, BMS 지표를 기준 지표로 하여 산출한 제1 군집 시나리오 내지 제4 군집 시나리오의 순위가 순서대로 4, 1, 3, 2위인 경우, 제1 군집 시나리오 내지 제4 군집 시나리오의 평균 순위는 순서대로 2.5, 1.5, 3, 3위이므로, 군집 평가부(220)는 평균 순위가 최소인 제2 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.For example, the first to fourth clustering scenarios calculated using the silhouette index as the reference index are ranked 1st, 2nd, 3rd, and 4th in order, and the first clustering scenario calculated using the BMS index as the reference index. When the ranks of the first to fourth cluster scenarios are 4, 1, 3, and 2 in the order, the average ranks of the first to fourth cluster scenarios are 2.5, 1.5, 3, and 3 in the order, so the cluster evaluator 220 ) may select the second clustering scenario having the minimum average rank as the optimal clustering scenario.

도 4는 일 실시예에 따른 문서 군집 생성 방법을 설명하기 위한 흐름도이다. 4 is a flowchart illustrating a method for generating a document cluster according to an exemplary embodiment.

도 4에 도시된 방법은 예를 들어, 상술한 문서 군집 생성 장치(100)에 의해 수행될 수 있다.The method illustrated in FIG. 4 may be performed, for example, by the above-described document group generating apparatus 100 .

도 4를 참조하면 우선, 문서 군집 생성 장치(100)는 복수의 문서 각각에 대한 서지 정보 및 텍스트 정보를 획득한다(410).Referring to FIG. 4 , first, the document group generating apparatus 100 obtains bibliographic information and text information for each of a plurality of documents ( 410 ).

이후, 문서 군집 생성 장치(100)는 획득한 서지 정보에 기초하여 복수의 문서 간의 인용 기반 유사도를 원소로 갖는 인용 기반 유사도 행렬을 생성한다(420).Thereafter, the document cluster generating apparatus 100 generates a citation-based similarity matrix having a citation-based similarity between a plurality of documents as an element based on the obtained bibliographic information ( 420 ).

일 실시예에 따르면, 문서 군집 생성 장치(100)는 동시 인용 강도 및 서지 결합 강도 중 어느 하나에 기초하여 인용 기반 유사도를 산출할 수 있다.According to an embodiment, the document cluster generating apparatus 100 may calculate the citation-based similarity based on any one of the simultaneous citation strength and the bibliographic coupling strength.

이후, 문서 군집 생성 장치(100)는 획득한 텍스트 정보에 기초하여 복수의 문서 간의 텍스트 기반 유사도를 원소로 갖는 텍스트 기반 유사도 행렬을 생성한다(430).Thereafter, the document cluster generating apparatus 100 generates a text-based similarity matrix having text-based similarity between a plurality of documents as an element based on the acquired text information ( 430 ).

일 실시예에 따르면, 문서 군집 생성 장치(100)는 벡터 공간 기반 유사도 산출 방식을 이용하여 텍스트 기반 유사도를 산출할 수 있다.According to an embodiment, the document cluster generating apparatus 100 may calculate the text-based similarity using a vector space-based similarity calculation method.

이후, 문서 군집 생성 장치(100)는 인용 기반 유사도 행렬 및 텍스트 기반 유사도 행렬에 기초하여 복수의 문서 간의 결합 유사도를 원소로 갖는 결합 유사도 행렬을 생성한다(440).Thereafter, the document cluster generating apparatus 100 generates a joint similarity matrix having a joint similarity between a plurality of documents as an element based on the citation-based similarity matrix and the text-based similarity matrix ( 440 ).

이후, 문서 군집 생성 장치(100)는 결합 유사도 행렬에 기초하여 복수의 문서를 군집화하여 하나 이상의 군집을 생성한다(450).Thereafter, the document cluster generating apparatus 100 generates one or more clusters by clustering a plurality of documents based on the joint similarity matrix ( S450 ).

도시된 흐름도에서는 상기 방법을 복수 개의 단계로 나누어 기재하였으나, 적어도 일부의 단계들은 순서를 바꾸어 수행되거나, 다른 단계와 결합되어 함께 수행되거나, 생략되거나, 세부 단계들로 나뉘어 수행되거나, 또는 도시되지 않은 하나 이상의 단계가 부가되어 수행될 수 있다.In the illustrated flowchart, the method is described by dividing the method into a plurality of steps, but at least some of the steps are performed in a reversed order, are performed together in combination with other steps, are omitted, are performed separately, or are not shown. One or more steps may be added and performed.

도 5는 추가적인 실시예에 따른 문서 군집 생성 방법을 설명하기 위한 흐름도이다. 5 is a flowchart illustrating a method for generating a document group according to an additional exemplary embodiment.

도 5에 도시된 방법은 예를 들어, 상술한 문서 군집 생성 장치(200)에 의해 수행될 수 있다.The method illustrated in FIG. 5 may be performed, for example, by the above-described document group generating apparatus 200 .

도 5를 참조하면 우선, 문서 군집 생성 장치(200)는 복수의 문서 각각에 대한 서지 정보 및 텍스트 정보를 획득한다(510).Referring to FIG. 5 , first, the document group generating apparatus 200 obtains bibliographic information and text information for each of a plurality of documents ( S510 ).

이후, 문서 군집 생성 장치(200)는 획득한 서지 정보에 기초하여 복수의 문서 간의 인용 기반 유사도를 원소로 갖는 인용 기반 유사도 행렬을 생성한다(520).Thereafter, the document cluster generating apparatus 200 generates a citation-based similarity matrix having a citation-based similarity between a plurality of documents as an element based on the obtained bibliographic information ( 520 ).

이후, 문서 군집 생성 장치(200)는 획득한 텍스트 정보에 기초하여 복수의 문서 간의 텍스트 기반 유사도를 원소로 갖는 텍스트 기반 유사도 행렬을 생성한다(530).Thereafter, the document cluster generating apparatus 200 generates a text-based similarity matrix having text-based similarity between a plurality of documents as an element based on the acquired text information ( 530 ).

이후, 문서 군집 생성 장치(200)는 인용 기반 유사도에 기초하여 제1 정보 엔트로피를 산출한다(540).Thereafter, the document cluster generating apparatus 200 calculates the first information entropy based on the citation-based similarity ( S540 ).

이후, 문서 군집 생성 장치(200)는 텍스트 기반 유사도에 기초하여 제2 정보 엔트로피를 산출한다(550).Thereafter, the document cluster generating apparatus 200 calculates a second information entropy based on the text-based similarity ( S550 ).

이후, 문서 군집 생성 장치(200)는 인용 기반 유사도, 텍스트 기반 유사도, 제1 정보 엔트로피 및 제2 정보 엔트로피에 기초하여 인용 가중치 및 텍스트 가중치를 산출한다(560).Thereafter, the document cluster generating apparatus 200 calculates a citation weight and a text weight based on the citation-based similarity, the text-based similarity, the first information entropy, and the second information entropy ( 560 ).

이후, 문서 군집 생성 장치(200)는 인용 기반 유사도, 텍스트 기반 유사도, 인용 가중치 및 텍스트 가중치에 기초하여 복수의 문서 간의 결합 유사도를 원소로 갖는 결합 유사도 행렬을 생성한다(570).Thereafter, the document cluster generating apparatus 200 generates a joint similarity matrix having the joint similarity between a plurality of documents as an element based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight ( 570 ).

이후, 문서 군집 생성 장치(200)는 결합 유사도 행렬에 기초하여 복수의 문서를 군집화하여 하나 이상의 군집을 생성한다(580).Thereafter, the document cluster generating apparatus 200 generates one or more clusters by clustering the plurality of documents based on the joint similarity matrix ( 580 ).

이후, 문서 군집 생성 장치(200)는 생성된 복수의 군집 시나리오 중 최적 군집 시나리오를 선정한다(590).Thereafter, the document cluster generating apparatus 200 selects an optimal cluster scenario from among the generated plurality of cluster scenarios ( 590 ).

일 실시예에 따르면, 문서 군집 생성 장치(200)는 기 설정된 복수의 지표 중 어느 하나를 기준 지표로 선택하여, 기준 지표가 최대값을 갖는 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.According to an embodiment, the document cluster generating apparatus 200 may select any one of a plurality of preset indices as a reference index, and select a cluster scenario when the reference index has a maximum value as an optimal cluster scenario.

일 실시예에 따르면, 문서 군집 생성 장치(200)는 기 설정된 복수의 지표 각각에 대한 복수의 군집 시나리오 별 순위를 산출하고, 산출된 순위에 대한 복수의 군집 시나리오 별 평균 값을 산출한 후, 산출된 평균 값이 최소인 경우의 군집 시나리오를 최적 군집 시나리오로 선정할 수 있다.According to an embodiment, the document cluster generating apparatus 200 calculates a rank for each cluster scenario with respect to each of a plurality of preset indicators, calculates an average value for each cluster scenario with respect to the calculated rank, and then calculates A clustering scenario in which the averaged value is the minimum may be selected as the optimal clustering scenario.

도시된 흐름도에서는 상기 방법을 복수 개의 단계로 나누어 기재하였으나, 적어도 일부의 단계들은 순서를 바꾸어 수행되거나, 다른 단계와 결합되어 함께 수행되거나, 생략되거나, 세부 단계들로 나뉘어 수행되거나, 또는 도시되지 않은 하나 이상의 단계가 부가되어 수행될 수 있다.In the illustrated flowchart, the method is described by dividing the method into a plurality of steps, but at least some of the steps are performed in a different order, are performed together in combination with other steps, are omitted, are performed separately, or are not shown. One or more steps may be added and performed.

도 6은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경(10)을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술된 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.6 is a block diagram illustrating and describing a computing environment 10 including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and may include additional components in addition to those described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 도 1 또는 도 2에 도시된 문서 군집 생성 장치(100, 200)일 수 있다.The illustrated computing environment 10 includes a computing device 12 . In an embodiment, the computing device 12 may be the document cluster generating device 100 or 200 illustrated in FIG. 1 or FIG. 2 .

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 includes at least one processor 14 , computer readable storage medium 16 , and communication bus 18 . The processor 14 may cause the computing device 12 to operate in accordance with the exemplary embodiments discussed above. For example, the processor 14 may execute one or more programs stored in the computer-readable storage medium 16 . The one or more programs may include one or more computer-executable instructions that, when executed by the processor 14, configure the computing device 12 to perform operations in accordance with the exemplary embodiment. can be

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 20 stored in the computer readable storage medium 16 includes a set of instructions executable by the processor 14 . In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, other forms of storage medium accessed by computing device 12 and capable of storing desired information, or a suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communication bus 18 interconnects various other components of computing device 12 , including processor 14 and computer readable storage medium 16 .

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide interfaces for one or more input/output devices 24 . The input/output interface 22 and the network communication interface 26 are coupled to the communication bus 18 . Input/output device 24 may be coupled to other components of computing device 12 via input/output interface 22 . Exemplary input/output device 24 may include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touchscreen), a voice or sound input device, various types of sensor devices, and/or imaging devices. input devices, and/or output devices such as display devices, printers, speakers and/or network cards. The exemplary input/output device 24 may be included in the computing device 12 as a component constituting the computing device 12 , and may be connected to the computing device 12 as a separate device distinct from the computing device 12 . may be

한편, 본 발명의 실시예는 본 명세서에서 기술한 방법들을 컴퓨터상에서 수행하기 위한 프로그램, 및 상기 프로그램을 포함하는 컴퓨터 판독 가능 기록매체를 포함할 수 있다. 상기 컴퓨터 판독 가능 기록매체는 프로그램 명령, 로컬 데이터 파일, 로컬 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나, 또는 컴퓨터 소프트웨어 분야에서 통상적으로 사용 가능한 것일 수 있다. 컴퓨터 판독 가능 기록매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광 기록 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 프로그램의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Meanwhile, an embodiment of the present invention may include a program for performing the methods described in this specification on a computer, and a computer-readable recording medium including the program. The computer-readable recording medium may include program instructions, local data files, local data structures, etc. alone or in combination. The medium may be specially designed and configured for the present invention, or may be commonly used in the field of computer software. Examples of the computer-readable recording medium include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and program instructions specially configured to store and execute program instructions such as ROMs, RAMs, flash memories, and the like. Hardware devices are included. Examples of the program may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those generated by a compiler.

이상에서 본 발명의 대표적인 실시예들을 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 청구범위뿐만 아니라 이 청구범위와 균등한 것들에 의해 정해져야 한다.Although representative embodiments of the present invention have been described in detail above, those of ordinary skill in the art will understand that various modifications are possible without departing from the scope of the present invention with respect to the above-described embodiments. . Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the following claims as well as the claims and equivalents.

10: 컴퓨팅 환경
12: 컴퓨팅 장치
14: 프로세서
16: 컴퓨터 판독 가능 저장 매체
18: 통신 버스
20: 프로그램
22: 입출력 인터페이스
24: 입출력 장치
26: 네트워크 통신 인터페이스
100, 200: 문서 군집 생성 장치
110: 문서 정보 획득부
120: 인용 기반 유사도 행렬 생성부
130: 텍스트 기반 유사도 행렬 생성부
140: 결합 유사도 행렬 생성부
150: 군집 생성부
210: 가중치 계산부
220: 군집 평가부10: Computing Environment
12: computing device
14: Processor
16: computer readable storage medium
18: communication bus
20: Program
22: input/output interface
24: input/output device
26: network communication interface
100, 200: document cluster generator
110: document information acquisition unit
120: Citation-based similarity matrix generator
130: text-based similarity matrix generator
140: joint similarity matrix generator
150: cluster generator
210: weight calculation unit
220: cluster evaluation unit

Claims

a document information obtaining unit which obtains bibliographic information and text information for each of the plurality of documents;
a citation-based similarity matrix generator for generating a citation-based similarity matrix having, as an element, a citation-based similarity between the plurality of documents based on the bibliographic information;
a text-based similarity matrix generator configured to generate a text-based similarity matrix having text-based similarity between the plurality of documents as an element based on the text information;
a joint similarity matrix generator configured to generate a joint similarity matrix having the joint similarity between the plurality of documents as an element based on the citation-based similarity matrix and the text-based similarity matrix; and
and a cluster generator configured to generate one or more clusters by clustering the plurality of documents based on the joint similarity matrix.

The method according to claim 1,
The citation-based similarity matrix generator,
Calculate the citation-based similarity based on any one of the co-citation strength and the bibliographic coupling strength,
The text-based similarity matrix generator,
and calculating the text-based similarity by using a vector space-based similarity calculation method.

The method according to claim 1,
Calculate a first information entropy based on the citation-based similarity and a second information entropy based on the text-based similarity, and cite based on the citation-based similarity, the text-based similarity, the first information entropy and the second information entropy The apparatus for generating a group of documents further comprising a weight calculation unit for calculating a weight and a text weight.

4. The method according to claim 3,
The joint similarity matrix generator,
and calculating the combined similarity based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight.

The method according to claim 1,
The cluster generating unit,
k-means algorithm, density-based clustering, probability distribution-based clustering, hierarchical clustering, and Girvan-Newman algorithm An apparatus for generating the one or more clusters by using any one of them.

The method according to claim 1,
The cluster generating unit,
A document cluster generating apparatus for generating a plurality of cluster scenarios by varying the number of clusters to be generated.

7. The method of claim 6,
and a cluster evaluation unit for selecting an optimal cluster scenario from among the plurality of cluster scenarios.

8. The method of claim 7,
The cluster evaluation unit,
An apparatus for generating a document cluster, selecting any one of a plurality of preset indices as a reference index, and selecting a clustering scenario when the reference index has a maximum value as the optimal clustering scenario.

8. The method of claim 7,
The cluster evaluation unit,
Calculating the rank for each of the plurality of clustering scenarios with respect to each of a plurality of preset indices, calculating an average value for each of the plurality of clustering scenarios of the calculated rank, and calculating the clustering scenario when the calculated average value is the minimum. A document cluster generating device that selects the optimal cluster scenario.

obtaining bibliographic information and text information for each of a plurality of documents;
generating a citation-based similarity matrix having, as an element, a citation-based similarity between the plurality of documents based on the bibliographic information;
generating a text-based similarity matrix having text-based similarity between the plurality of documents as an element based on the text information;
generating a joint similarity matrix having the joint similarity between the plurality of documents as an element based on the citation-based similarity matrix and the text-based similarity matrix; and
and generating one or more clusters by clustering the plurality of documents based on the joint similarity matrix.

11. The method of claim 10,
The generating of the citation-based similarity matrix includes calculating the citation-based similarity based on any one of a co-citation strength and a bibliographic coupling strength,
The generating of the text-based similarity matrix includes calculating the text-based similarity using a vector space-based similarity calculation method.

11. The method of claim 10,
calculating a first information entropy based on the citation-based similarity;
calculating second information entropy based on the text-based similarity; and
The method further comprising the step of calculating a citation weight and a text weight based on the citation-based similarity, the text-based similarity, the first information entropy, and the second information entropy.

13. The method of claim 12,
The generating of the joint similarity matrix includes calculating the joint similarity based on the citation-based similarity, the text-based similarity, the citation weight, and the text weight.

11. The method of claim 10,
The step of generating the cluster includes k-means algorithm, density-based clustering, probability distribution-based clustering, hierarchical clustering, and Govern- A method for generating a cluster of documents, wherein the one or more clusters are generated using any one of a Girvan-Newman algorithm.

11. The method of claim 10,
The generating of the cluster includes generating a plurality of cluster scenarios by varying the number of generated clusters.

16. The method of claim 15,
and selecting an optimal cluster scenario from among the plurality of cluster scenarios.

17. The method of claim 16,
The selecting may include selecting any one of a plurality of preset indices as a reference index, and selecting a cluster scenario when the reference index has a maximum value as the optimal cluster scenario.

17. The method of claim 16,
The selecting step is
calculating a ranking for each of the plurality of cluster scenarios for each of a plurality of preset indicators;
calculating an average value for each of the plurality of cluster scenarios of the calculated rank; and
and selecting a cluster scenario in which the calculated average value is minimum as the optimal cluster scenario.