KR20210121921A

KR20210121921A - Method and device for extracting key keywords based on keyword joint appearance network

Info

Publication number: KR20210121921A
Application number: KR1020200039380A
Authority: KR
Inventors: 유택호; 윤지성; 정우성; 권오현
Original assignee: 포항공과대학교 산학협력단
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-08
Also published as: KR102498294B1

Abstract

Disclosed are a method and device for extracting a key keyword based on a keyword co-occurrence network. The method for extracting the key keyword based on the keyword co-occurrence network comprises: a step of obtaining a subject keyword inputted from a user; a step of extracting keywords from an entire document set based on the subject keyword; a step of generating a keyword co-occurrence network based on whether or not the co-occurrence is between the keywords; a step of classifying the keywords constituting the keyword co-occurrence network into a plurality of clusters; and a step of extracting at least one key keyword for each of the clusters. Therefore, the key keyword can be extracted quickly and accurately.

Description

METHOD AND DEVICE FOR EXTRACTING KEY KEYWORDS BASED ON KEYWORD JOINT APPEARANCE NETWORK}

본 발명은 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법 및 장치에 관한 것으로, 더욱 상세하게는 다수의 연구 문서들을 대상으로 키워드 공동출현 네트워크를 구성함으로써 세부 연구 분야의 핵심 키워드를 추출하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for extracting a core keyword based on a keyword co-occurrence network, and more particularly, a method for extracting a core keyword in a detailed research field by configuring a keyword co-occurrence network for a plurality of research documents and devices.

최근 정보 통신 기술의 발달에 따라 많은 수의 연구 문서들이 발간되고 외부로 공개되어 연구자들이 쉽게 방대한 양의 연구 문서들을 활용할 수 있다.With the recent development of information and communication technology, a large number of research documents are published and disclosed to the outside, so that researchers can easily utilize a vast amount of research documents.

그러나, 이러한 방대한 양의 연구 문서들을 활용하기 위해서는 많은 노동력이 필요하다. 예를 들어, 특정 사용자가 방대한 양의 문서들을 대상으로 연구 주제를 분석하는 것에는 현실적인 시간, 노력, 정확성의 한계가 있다.However, a lot of labor is required to utilize such a vast amount of research documents. For example, there are practical limitations in time, effort, and accuracy for a specific user to analyze a research topic based on a large amount of documents.

이러한 한계를 극복하기 위해 각종 논문에서는 초록(abstract)과 키워드들을 제공하고 있으나, 이러한 초록과 키워드들만으로는 해당 연구 문서가 담고 있는 학술적 내용을 정확하게 추론해내기에는 어려운 측면이 있고, 초록과 키워드들을 제공하지 않는 문서들도 많다.To overcome this limitation, various papers provide abstracts and keywords, but it is difficult to accurately infer the academic content contained in the research document only with these abstracts and keywords, and abstracts and keywords are provided. There are many documents that do not.

따라서, 상술한 문제를 극복하기 위하여 방대한 양의 문서들을 대상으로 빠르고 정확하게 핵심 키워드들을 추출하여 연구 분야와 발전 동향을 파악할 수 있는 방안이 필요한 실정이다.Therefore, in order to overcome the above-mentioned problem, there is a need for a method for quickly and accurately extracting key keywords from a large amount of documents to grasp the research field and development trend.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to provide a method for extracting a core keyword based on a keyword co-appearance network.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치를 제공하는 데 있다.Another object of the present invention for solving the above problems is to provide an apparatus for extracting a key keyword based on a keyword co-appearance network.

상기 목적을 달성하기 위한 본 발명의 일 측면은, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법을 제공한다.One aspect of the present invention for achieving the above object provides a method of extracting a core keyword based on a keyword co-appearance network.

키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법은, 사용자로부터 입력된 주제 키워드를 획득하는 단계; 상기 주제 키워드를 기초로 전체 문서 집합에서 키워드들을 추출하는 단계; 상기 키워드들 사이의 공동출현 여부를 기초로 키워드 공동출현 네트워크를 생성하는 단계; 상기 키워드 공동출현 네트워크를 구성하는 키워드들을 복수의 클러스터들로 분류하는 단계; 및 상기 클러스터들마다 적어도 하나의 핵심 키워드를 추출하는 단계를 포함할 수 있다.A method of extracting a core keyword based on a keyword co-appearance network includes: acquiring a topic keyword input from a user; extracting keywords from the entire document set based on the subject keyword; generating a keyword co-occurrence network based on whether the keyword co-occurrences exist; classifying keywords constituting the keyword co-occurrence network into a plurality of clusters; and extracting at least one key keyword for each of the clusters.

상기 적어도 하나의 핵심 키워드를 추출하는 단계는, 각 클러스터를 구성하는 상기 키워드들의 중심성(centrality)과 출현 빈도를 기초로 상기 키워드들 각각에 대한 중요도를 산출하고, 상기 중요도에 따라 상기 적어도 하나의 핵심 키워드를 추출할 수 있다.The extracting of the at least one core keyword may include calculating the importance of each of the keywords based on the centrality and frequency of appearance of the keywords constituting each cluster, and calculating the importance of the at least one core keyword according to the importance. keywords can be extracted.

상기 키워드들을 추출하는 단계는, 상기 주제 키워드를 기초로 상기 전체 문서 집합으로부터 제1 문서 집합을 생성하는 단계; 상기 제1 문서 집합에서 적어도 하나의 기준 키워드를 추출하는 단계; 상기 적어도 하나의 기준 기워드를 기초로, 상기 전체 문서 집합으로부터 제2 문서 집합을 생성하는 단계; 및 상기 제2 문서 집합에서 상기 키워드들을 추출하는 단계를 포함할 수 있다.The extracting of the keywords may include: generating a first document set from the entire document set based on the subject keyword; extracting at least one reference keyword from the first document set; generating a second document set from the entire document set based on the at least one reference keyword; and extracting the keywords from the second document set.

상기 키워드 공동출현 네트워크는, 상기 키워드들 각각을 노드(node)로 갖고, 하나의 문서에 공통적으로 출현한 키워드들을 서로 연결하는 링크(link)를 가지며, 상기 링크로 연결된 2개의 키워드가 공통적으로 출현한 문서들의 수를 상기 링크에 대한 연결 강도로 가질 수 있다.The keyword co-occurrence network has each of the keywords as a node, has a link connecting keywords commonly appearing in one document to each other, and two keywords connected by the link appear in common The number of documents may have the strength of the link to the link.

상기 키워드 공동출현 네트워크는, 연결 강도가 미리 설정된 임계값 이상인 링크로 연결된 키워드들만으로 재구성될 수 있다.The keyword co-occurrence network may be reconfigured only with keywords connected by links whose connection strength is equal to or greater than a preset threshold value.

상기 키워드들을 복수의 클러스터들로 분류하는 단계는, 상기 클러스터들로 분류된 키워드 공동출현 네트워크의 모듈성을 평가하고, 상기 모듈성이 최대화되는 방향으로 상기 키워드들을 반복하여 다시 분류함으로써 상기 클러스터들을 최적화할 수 있다.In the step of classifying the keywords into a plurality of clusters, it is possible to optimize the clusters by evaluating the modularity of the keyword co-occurrence network classified into the clusters, and repeatedly reclassifying the keywords in a direction in which the modularity is maximized. have.

상기 적어도 하나의 핵심 키워드를 추출하는 단계는, 하기 수학식에 기초하여 상기 중요도를 산출하되,In the step of extracting the at least one key keyword, the importance is calculated based on the following equation,

KR(i)는 상기 키워드들 중 i(i는 자연수)번째 키워드에 대한 중요도이고, CR(i)는 상기 i번째 키워드에 대한 상기 중심성이고, N(i)는 상기 i번째 키워드가 상기 제2 문서 집합에서 출현하는 상기 출현 빈도이며, d는 상기 중심성과 상기 출현 빈도 사이의 비중을 설정하도록 상기 사용자로부터 입력받는 평가 상수일 수 있다.KR(i) is the importance of the i-th keyword among the keywords (i is a natural number), CR(i) is the centrality with respect to the i-th keyword, and N(i) is the i-th keyword is the second keyword. It is the frequency of appearance in the document set, and d may be an evaluation constant input by the user to set a weight between the centrality and the frequency of appearance.

상기 i번째 키워드에 대한 상기 중심성은, 도수 중심성(degree centrality), 매개 중심성(betweenness centrality), 및 근접 중심성(closeness centrality) 중 하나일 수 있다.The centrality for the i-th keyword may be one of degree centrality, between centrality, and closeness centrality.

상기 매개 중심성은, 상기 i번째 키워드에 대응하는 노드를 제외한 2개의 노드 사이의 최단 경로들의 개수 대비 상기 i번째 키워드에 대응하는 노드를 지나는 상기 2개의 노드 사이의 최단 경로들의 개수로 정의될 수 있다.The intermediate centrality may be defined as the number of shortest paths between the two nodes passing through the node corresponding to the i-th keyword compared to the number of shortest paths between two nodes excluding the node corresponding to the i-th keyword. .

상기 최단 경로는, 상기 연결 강도에 대한 역수를 거리로하여 상기 거리가 최소가 되는 경로일 수 있다.The shortest path may be a path in which the distance is minimized by using a reciprocal of the connection strength as a distance.

상기 근접 중심성은, 상기 i번째 키워드에 대응하는 노드에서 나머지 노드들까지의 최단 경로에 따른 거리의 평균값을 산출하고, 산출된 평균값에 대한 역수로 정의될 수 있다.The proximity centrality may be defined as an average value of a distance along a shortest path from a node corresponding to the i-th keyword to the remaining nodes and defined as a reciprocal of the calculated average value.

상기 목적을 달성하기 위한 본 발명의 다른 측면은, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치를 제공한다.Another aspect of the present invention for achieving the above object provides an apparatus for extracting a core keyword based on a keyword co-appearance network.

키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치는, 적어도 하나의 프로세서(processor); 및 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory, 120)를 포함할 수 있다.An apparatus for extracting a core keyword based on a keyword co-occurrence network includes: at least one processor; and a memory 120 that stores instructions instructing the at least one processor to perform at least one step.

적어도 하나의 단계는, 사용자로부터 입력된 주제 키워드를 획득하는 단계; 상기 주제 키워드를 기초로 전체 문서 집합에서 키워드들을 추출하는 단계; 상기 키워드들 사이의 공동출현 여부를 기초로 키워드 공동출현 네트워크를 생성하는 단계; 상기 키워드 공동출현 네트워크를 구성하는 키워드들을 복수의 클러스터들로 분류하는 단계; 및 상기 클러스터들마다 적어도 하나의 핵심 키워드를 추출하는 단계를 포함할 수 있다.The at least one step may include: obtaining a subject keyword input from a user; extracting keywords from the entire document set based on the subject keyword; generating a keyword co-occurrence network based on whether the keyword co-occurrences exist; classifying keywords constituting the keyword co-occurrence network into a plurality of clusters; and extracting at least one key keyword for each of the clusters.

상기와 같은 본 발명에 따른 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법 및 장치를 이용할 경우에는 주제 키워드를 입력하는 것 만으로도 누구나 쉽게 주제 키워드와 관련된 분야의 핵심 키워드를 확인할 수 있는 장점이 있다.When using the method and apparatus for extracting key keywords based on the keyword co-occurrence network according to the present invention as described above, there is an advantage that anyone can easily identify key keywords in the field related to the topic keyword just by inputting the topic keyword. .

또한, 세부 연구분야를 지시하는 클러스터별로 핵심 키워드를 확인할 수 있어 세부 연구분야까지 구별할 수 있는 장점이 있다.In addition, since core keywords can be identified for each cluster indicating detailed research fields, there is an advantage in that even detailed research fields can be distinguished.

또한, 핵심 키워드가 해당 분야를 나타내는 단어로 제한되지 않고, 재료나 성질 등과 같이 다양한 종류의 단어일 수 있기 때문에, 사용자가 다양한 각도에서 주제 키워드와 관련된 분야를 이해할 수 있게 도울 수 있다.In addition, since the key keyword is not limited to a word indicating the field, and may be various types of words such as material or property, it is possible to help the user to understand the field related to the subject keyword from various angles.

도 1은 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크가 도출되는 대상을 설명하기 위한 개념도이다.
도 2는 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 예시적으로 나타낸 도면이다.
도 3은 도 2에 따른 키워드 공동출현 네트워크를 구성하는 키워드들을 분류한 클러스터들을 나타낸 도면이다.
도 4a 내지 도 4b는 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법을 이용하여 도출한 핵심 키워드를 도시한 예시도이다.
도 5는 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법에 대한 대표 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치에 대한 구성도이다.1 is a conceptual diagram for explaining an object from which a keyword co-occurrence network according to an embodiment of the present invention is derived.
2 is a diagram exemplarily illustrating a keyword co-appearance network according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating clusters in which keywords constituting the keyword co-occurrence network according to FIG. 2 are classified.
4A to 4B are exemplary diagrams illustrating core keywords derived using a method of extracting core keywords based on a keyword co-appearance network.
5 is a representative flowchart of a method of extracting a core keyword based on a keyword co-occurrence network according to an embodiment of the present invention.
6 is a block diagram of an apparatus for extracting a core keyword based on a keyword co-occurrence network according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크가 도출되는 대상을 설명하기 위한 개념도이다.1 is a conceptual diagram for explaining an object from which a keyword co-occurrence network according to an embodiment of the present invention is derived.

일 실시예에서 키워드 공동출현 네트워크는, 전체 문서 집합에서 추출된 키워드들을 대상으로 하나의 문서에 공통적으로 출현하는 키워드들을 서로 연결하여 생성되는 네트워크로 정의될 수 있다. In an embodiment, the keyword co-occurrence network may be defined as a network generated by connecting keywords commonly appearing in one document with respect to keywords extracted from the entire document set.

본 명세서에서 전체 문서 집합은 미리 수집되어 저장된 문서들의 집합으로 정의된다. 여기서 전체 문서 집합은, 사용자로부터 미리 입력받거나 다양한 방식의 크롤링(crawing) 알고리즘을 이용하여 유무선 네트워크를 통해 수집될 수 있다.In this specification, the entire document set is defined as a set of previously collected and stored documents. Here, the entire document set may be received in advance from a user or may be collected through a wired/wireless network using various types of crawling algorithms.

전체 문서 집합을 대상으로 키워드 공동출현 네트워크를 구성할 경우 지나치게 많은 연산 부하가 요구될 수 있고, 관련성이 상당히 낮은 문서까지 분석 대상에 포함되는 문제가 있다. 따라서, 전체 문서 집합에서 주제 키워드와 관련성이 있는 제1 문서 집합이 생성될 수 있다. 예를 들어, 전체 문서 집합에서 주제 키워드(및 주제 키워드의 동의어)가 포함되어 있는 문서들을 모아 제1 문서 집합이 생성될 수 있다.When a keyword co-appearance network is configured for the entire document set, too much computational load may be required, and there is a problem that even documents with very low relevance are included in the analysis target. Accordingly, the first document set related to the subject keyword may be generated from the entire document set. For example, the first document set may be generated by collecting documents including subject keywords (and synonyms of subject keywords) from the entire document set.

다음으로, 제1 문서 집합에서 적어도 하나의 기준 키워드가 추출될 수 있다. 여기서 기준 키워드는, 제1 문서 집합에서 추출된 키워드들 중에서 출현 빈도가 기준값 이상인 키워드일 수 있다. 또한, 제1 문서 집합에서 키워드들을 추출하는 데에는 형태소 분석 기반의 다양한 시맨틱 네트워크 분석(semantic network anyysis) 모듈이 활용될 수 있다.Next, at least one reference keyword may be extracted from the first document set. Here, the reference keyword may be a keyword having an appearance frequency equal to or greater than the reference value among keywords extracted from the first document set. In addition, various semantic network analysis modules based on morpheme analysis may be utilized to extract keywords from the first document set.

여기서, 적어도 하나의 기준 키워드는 제1 문서 집합에서 추출된 키워드들 중에서 주제 키워드(및 주제 키워드의 동의어) 및 불필요한 키워드를 제외한 나머지 키워드들에 속할 수 있다. 여기서 불필요한 키워드는 의미가 일반적인 개념에 속하는 단어(예를 들어, become, get 등의 동사, parameter, percent 등의 명사)일 수 있다.Here, the at least one reference keyword may belong to the remaining keywords excluding the subject keyword (and synonyms of the subject keyword) and unnecessary keywords among the keywords extracted from the first document set. Here, the unnecessary keyword may be a word whose meaning belongs to a general concept (eg, a verb such as become and get, and a noun such as parameter and percent).

적어도 하나의 기준 키워드가 추출되면, 전체 문서 집합에서 적어도 하나의 기준 키워드가 포함된 문서들을 모아 제2 문서 집합이 생성될 수 있다. When at least one reference keyword is extracted, a second document set may be generated by collecting documents including at least one reference keyword from the entire document set.

제2 문서 집합은 전체 문서 집합에서 적어도 하나의 기준 키워드 뿐만 아니라, 기준 키워드의 약어와 동의어, 유의어(또는 유사 단어)를 포함하는 문서들을 모아 생성될 수 있다.The second document set may be generated by collecting documents including at least one reference keyword as well as abbreviations, synonyms, and synonyms (or similar words) of the reference keywords from the entire document set.

다음으로, 제2 문서 집합에서 키워드들을 추출하고, 추출된 키워드들을 대상으로 하나의 문서에 공통적으로 출현하는 키워드들끼리 서로 연결함으로써 키워드 공동출현 네트워크가 생성될 수 있다.Next, a keyword co-occurrence network may be generated by extracting keywords from the second document set and linking the extracted keywords with keywords commonly appearing in one document.

상술한 기준 키워드는 주제 키워드와 함께 출현하는 빈도가 높은 키워드이기 때문에, 주제 키워드를 세부적으로 설명하거나 주제 키워드와 관련도가 높은 키워드에 해당할 수 있다. 따라서, 기준 키워드를 이용하여 제2 문서 집합을 생성하고, 제2 문서 집합에서 추출된 키워드들을 이용하여 키워드 공동출현 네트워크를 구성하면 주제 키워드가 사용빈도가 낮은 키워드거나 부수적인 키워드에 해당하더라도 핵심 키워드를 추출하는 데 유리할 수 있다. Since the above-described reference keyword is a keyword that appears frequently together with the subject keyword, the subject keyword may be described in detail or may correspond to a keyword having a high degree of relevance to the subject keyword. Therefore, if a second document set is generated using the reference keyword and a keyword co-occurrence network is constructed using the keywords extracted from the second document set, even if the subject keyword corresponds to a keyword with low frequency of use or an incidental keyword, the core keyword It may be advantageous to extract

도 2는 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 예시적으로 나타낸 도면이다.2 is a diagram exemplarily illustrating a keyword co-appearance network according to an embodiment of the present invention.

도 2를 참조하면, 제2 문서 집합에서 추출된 키워드들(A, B, C, ..., H, I)을 대상으로 생성된 키워드 공동출현 네트워크(KJAN)가 도시된다.Referring to FIG. 2 , a keyword co-appearance network (KJAN) generated by targeting keywords (A, B, C, ..., H, I) extracted from the second document set is illustrated.

키워드 공동출현 네트워크(KJAN)는 키워드들(A, B, C, ..., H, I) 각각을 네트워크의 노드(node)로 가질 수 있고, 하나의 문서에 공통적으로 출현한 키워드들을 서로 연결하는 링크(link)를 가질 수 있다. 도 2에서 링크(link)는 각 키워드들을 연결하는 직선(또는 엣지(edge))으로 도시하였다. 이하에서, 노드는 키워드와 혼용하여 지칭될 수 있고, 노드와 키워드는 키워드 공동출현 네트워크(KJAN)에서 1:1 대응관계를 가진다.The keyword co-appearance network (KJAN) can have each of the keywords (A, B, C, ..., H, I) as a node of the network, and connect keywords commonly appearing in one document to each other. You can have a link to In FIG. 2, a link is shown as a straight line (or edge) connecting each keyword. Hereinafter, a node may be referred to as a keyword and a keyword, and a node and a keyword have a 1:1 correspondence in a keyword co-appearance network (KJAN).

또한, 키워드 공동출현 네트워크(KJAN)는 링크로 연결된 2개의 키워드가 공통적으로 출현한 문서들의 수를 링크에 대한 연결 강도로 가질 수 있다. 예를 들어, 제2 문서 집합에서 키워드 C와 키워드 H가 공통적으로 출현한 문서들의 수가 키워드 C와 키워드 H를 연결하는 링크에 대한 연결 강도(CS[C,H])일 수 있다.In addition, the keyword co-appearance network (KJAN) may have the number of documents in which two keywords connected by a link appear in common as the connection strength for the link. For example, the number of documents in which the keyword C and the keyword H commonly appear in the second document set may be the connection strength (CS[C,H]) for a link connecting the keyword C and the keyword H.

한편, 하나의 문서에 공통적으로 출현한 키워드들이 링크로 연결될 경우, 하나의 문서 단위로 키워드들을 연결하기 때문에 관련성이 떨어지는 키워드들이 서로 연결되는 문제가 있다. 이러한 문제를 방지하기 위해 하나의 문서에 공통적으로 출현한 키워드들 사이의 단어 간격, 문장 간격, 단락 간격 중 적어도 하나가 임계값 이하인 키워드들이 링크로 연결될 수 있다. 예를 들어, 키워드 C와 키워드 H가 2개의 단어를 사이에 두고 하나의 문서에 공통적으로 출현하는 경우 키워드 C와 키워드 H 사이의 단어 간격은 2 일 수 있다. 마찬가지로, 키워드 C와 키워드 H가 2개의 문장을 사이에 두고 하나의 문서에 공통적으로 출현하는 경우 키워드 C와 키워드 H 사이의 문장 간격은 2 일 수 있다.On the other hand, when keywords that appear in common in one document are linked by a link, there is a problem in that keywords with low relevance are linked to each other because the keywords are linked in unit of one document. In order to prevent such a problem, keywords having at least one of a word spacing, a sentence spacing, and a paragraph spacing between keywords commonly appearing in one document may be linked by a link. For example, when the keyword C and the keyword H commonly appear in one document with two words interposed therebetween, the word interval between the keyword C and the keyword H may be 2. Similarly, when the keyword C and the keyword H commonly appear in one document with two sentences interposed therebetween, the sentence interval between the keyword C and the keyword H may be 2.

임계값은 사용자에 의해 미리 입력받을 수 있다. 임계값이 크게 설정되면, 2개의 키워드가 하나의 문서에 공통적으로 출현하는 것만으로 하나의 링크로 연결될 수 있고, 임계값이 작게 설정되면, 2개의 키워드가 하나의 문서 내에서 서로 인접한 위치에 등장해야 하기 때문에 하나의 링크로 연결되지 않을 가능성이 높다. 따라서, 임계값은 시스템 부하율, 추출하고자 하는 핵심 키워드의 범위와 개수 등에 기초하여 결정될 수 있다.The threshold value may be input in advance by the user. If the threshold value is set high, two keywords can be linked into one link only by appearing in common in one document. If the threshold value is set small, the two keywords appear in adjacent positions in one document It is highly likely that it will not lead to a single link because it has to. Accordingly, the threshold value may be determined based on the system load ratio, the range and number of key keywords to be extracted, and the like.

한편, 키워드 공동출현 네트워크(KJAN)에서 키워드들은 연결 강도가 작은 다수의 링크들 및 연결 강도가 큰 소수의 링크들로 연결되어 있을 수 있다. 그런데, 제2 문서 집합에서 추출되는 키워드들은 특정 논문에서만 사용되거나 스스로 정의한 용어들이 포함되기 때문에 연결 강도가 작은 다수의 링크와 연결된 키워드들은 연구분야를 분류하는데 부적절할 수 있다. Meanwhile, in the keyword co-appearance network (KJAN), keywords may be connected by a plurality of links having a low connection strength and a small number of links having a large connection strength. However, since keywords extracted from the second set of documents are used only in specific papers or include terms defined by themselves, keywords connected to a large number of links with low connection strength may be inappropriate for classifying the research field.

따라서, 일 실시예에서 키워드 공동출현 네트워크(KJAN)는 연결 강도가 미리 설정된 임계값 이상인 링크로 연결된 키워드들만으로 재구성될 수도 있다. 연결 강도가 미리 설정된 임계값 이상인 링크로 연결된 키워드들로 키워드 공동출현 네트워크를 재구성함으로써, 관련성이 낮거나 사용 빈도가 적은 키워드를 제거하여 더욱 정확한 핵심 키워드를 추출할 수가 있다. Accordingly, in an embodiment, the keyword co-appearance network (KJAN) may be reconfigured only with keywords connected by a link whose connection strength is equal to or greater than a preset threshold value. By reconstructing the keyword co-appearance network with keywords linked by links whose connection strength is greater than or equal to a preset threshold, keywords with low relevance or infrequently used keywords can be removed to extract more accurate key keywords.

도 3은 도 2에 따른 키워드 공동출현 네트워크를 구성하는 키워드들을 분류한 클러스터들을 나타낸 도면이다.FIG. 3 is a diagram illustrating clusters in which keywords constituting the keyword co-occurrence network according to FIG. 2 are classified.

도 3을 참조하면, 키워드 공동출현 네트워크(KJAN)를 구성하는 키워드들(A, B, C, D, ..., H, I)은 복수의 클러스터들(CLT1, CLT2, CLT3, CLT4)로 분류될 수 있다. 예를 들어, 키워드 A는 제1 클러스터(CLT1)으로 분류될 수 있고, 키워드 B, 키워드 C, 키워드 D는 제2 클러스터(CLT2)로 분류될 수 있다.3, the keywords (A, B, C, D, ..., H, I) constituting the keyword co-appearance network (KJAN) are a plurality of clusters (CLT1, CLT2, CLT3, CLT4). can be classified. For example, keyword A may be classified into a first cluster CLT1, and keyword B, keyword C, and keyword D may be classified into a second cluster CLT2.

여기서, 키워드들은 CNM(Clauset-Newman-Moore) 알고리즘(A. Clauset, M. E. J. Newman, and C. Moore, "Finding Community Structure in Very Large Networks," Physical review E, Vol. 70, 066111, 2004.), Louvain 알고리즘(V. D. Blondel, J. Guilaume, R. Lambiotte, and E.Lefebvre, "Fast Unfolding of Communities in Large Networks," Journal of Statistical Mechanics, Vol. 10, P10008, 2008.) 등을 이용하여 클러스터들로 분류될 수 있다.Here, the keywords are CNM (Clauset-Newman-Moore) algorithm (A. Clauset, MEJ Newman, and C. Moore, "Finding Community Structure in Very Large Networks," Physical review E, Vol. 70, 066111, 2004.), into clusters using the Louvain algorithm (VD Blondel, J. Guilaume, R. Lamiotte, and E. Lefebvre, "Fast Unfolding of Communities in Large Networks," Journal of Statistical Mechanics, Vol. 10, P10008, 2008.), etc. can be classified.

일 실시예에서, 클러스터들로 분류된 키워드 공동출현 네트워크(KJAN)의 모듈성이 평가되고, 모듈성(modularity)이 최대화되는 방향으로 키워드들을 반복하여(literatively) 다시 분류함으로써, 클러스터들이 최적화될 수 있다. 여기서 모듈성(Q)은 다음의 수학식 1과 같이 정의될 수 있다.In one embodiment, the modularity of the keyword co-occurrence network (KJAN) classified into clusters is evaluated, and by reclassifying keywords iteratively (literatively) in a direction in which modularity is maximized, the clusters can be optimized. Here, the modularity (Q) may be defined as in Equation 1 below.

상기 수학식 1을 참조하면, A_vw는 임의의 2개의 노드 v와 w사이에 링크 여부를 나타내는 값으로 노드 v와 w가 링크로 연결되어 있으면 1, 아니면 0을 의미할 수도 있다. Cv와 Cw는 각각 노드 v와 노드 w가 속하는 클러스터일 수 있다. 함수

는, 노드 v가 속한 클러스터(Cv)와 노드 w가 속한 클러스터(Cw)가 같으면 1이고, 다르면 0인 함수일 수 있다. 또한, 수학식 1에서 m은 전체 링크들의 수일 수 있다. Cv와 Cw는 각각 노드 v와 노드 w가 속하는 클러스터일 수 있다.Referring to Equation 1 above, A _vw is a value indicating whether a link exists between any two nodes v and w, and may mean 1 if the nodes v and w are connected by a link, or 0 otherwise. Cv and Cw may be clusters to which node v and node w belong, respectively. function

may be a function that is 1 if the cluster (Cv) to which node v belongs and the cluster (Cw) to which node w belongs are the same, and 0 otherwise. Also, in Equation 1, m may be the total number of links. Cv and Cw may be clusters to which node v and node w belong, respectively.

수학식 1에서 k_v와 k_w는 각각 노드 v와 노드 w에 대한 도수(degree)로서, 여기서 노드 v에 대한 도수(k_v)는 다음의 수학식 2과 같이 정의될 수 있다.In Equation 1, k _v and k _w are degrees for a node v and a node w, respectively, where the degree (k _v ) for the node v can be defined as in Equation 2 below.

상기 수학식 2를 참조하면, 노드 v에 대한 도수(k_v)는 노드 v와 임의의 노드 z 사이의 링크 여부(A_vz)를 모두 더한 값으로서, 노드 v에 직접 연결된 링크들의 수일 수 있다. 노드 w에 대한 도수(k_w)는 수학식 2와 마찬가지 형태로 정의될 수 있다.Referring to Equation 2 above, the frequency k _v for node v is the sum of all links between node v and any node z (A _vz ), and may be the number of links directly connected to node v. _{The frequency k w} for the node w may be defined in the same form as in Equation (2).

수학식 1에 따른 모듈성(Q)은 동일한 클러스터 내에 속하는 노드들 사이에 연결되는 링크들이 많고, 서로 다른 클러스터에 속하는 노드들 사이에 연결되는 링크들이 적을 수록 큰 값이 도출된다. 따라서, 클러스터들로 분류된 키워드 공동출현 네트워크(KJAN)의 모듈성을 반복해서 평가하면, 각 클러스터가 세부 연구분야를 대표하는 키워드들로 구성될 수 있다.As for the modularity (Q) according to Equation 1, the greater the number of links connected between nodes belonging to the same cluster and the fewer links connected between nodes belonging to different clusters, the greater the value is derived. Therefore, if the modularity of the keyword co-emergence network (KJAN) classified into clusters is repeatedly evaluated, each cluster can be composed of keywords representing detailed research fields.

상술한 방법 이외에도 네트워크 분석 분야의 다양한 커뮤니티 발견법을 이용하여 키워드 공동출현 네트워크(KJAN)를 구성하는 키워드들을 복수의 클러스터들로 분류할 수 있다.In addition to the above-described method, keywords constituting the keyword co-appearance network (KJAN) may be classified into a plurality of clusters using various community discovery methods in the field of network analysis.

일 실시예에서 클러스터들(CLT1, CLT2, CLT3, CLT4)마다 적어도 하나의 핵심 키워드가 추출될 수 있다. 예를 들어, 제2 클러스터(CLT2)에 속하는 키워드들(B, C, D) 중에서 적어도 하나의 핵심 키워드가 추출될 수 있다.In an embodiment, at least one key keyword may be extracted for each of the clusters CLT1, CLT2, CLT3, and CLT4. For example, at least one core keyword may be extracted from among the keywords B, C, and D belonging to the second cluster CLT2.

여기서, 각 클러스터는 세부 연구분야를 나타낼 수 있으므로, 각 클러스터에서 추출되는 적어도 하나의 핵심 키워드는 특정 세부 연구분야를 대표하는 키워드일 수 있다.Here, since each cluster may represent a detailed research field, at least one key keyword extracted from each cluster may be a keyword representing a specific detailed research field.

일 실시예에서 핵심 키워드를 추출하기 위하여 각 클러스터를 구성하는 키워드들의 중심성과 출현 빈도가 산출될 수 있다. 예를 들어, 제2 클러스터(CLT2)에 속하는 키워드 B, C, D 각각에 대하여 중심성과 출현 빈도가 산출될 수 있다.In one embodiment, in order to extract a key keyword, the centrality and appearance frequency of keywords constituting each cluster may be calculated. For example, centrality and frequency of appearance may be calculated for each of the keywords B, C, and D belonging to the second cluster CLT2.

여기서, 중심성은 도수 중심성(degree centrality), 매개 중심성(betweenness centrality), 및 근접 중심성(closeness centrality) 중 하나일 수 있다.Here, centrality may be one of degree centrality, between centrality, and closeness centrality.

도수 중심성은 해당 키워드에 직접 연결된 모든 링크들의 개수일 수 있다. 예를 들어, 키워드 B에 대한 도수 중심성은 키워드 B에 직접 연결된 링크들의 수인 3일 수 있다. 이때, 도수 중심성은 해당 키워드에 직접 연결된 모든 링크들의 개수를 해당 키워드를 제외한 나머지 키워드들의 개수로 나눔으로써 정규화될 수도 있다.The frequency centrality may be the number of all links directly linked to the corresponding keyword. For example, the frequency centrality for keyword B may be 3, which is the number of links directly connected to keyword B. In this case, the frequency centrality may be normalized by dividing the number of all links directly connected to the corresponding keyword by the number of keywords other than the corresponding keyword.

매개 중심성은, 해당 키워드가 다른 키워드와 적어도 하나의 링크로 연결되는 데 경유될 수 있는지를 나타내는 지표일 수 있다. 구체적으로, 매개 중심성은 해당 키워드를 제외한 2개의 키워드들 사이의 최단 경로들의 개수에서 해당 키워드를 경유하는 2개의 키워드들 사이의 최단 경로들이 차지하는 비율을 의미할 수 있다. Intermediary centrality may be an indicator indicating whether a corresponding keyword can be connected to another keyword through at least one link. Specifically, each centrality may mean a ratio of the shortest paths between two keywords passing through the corresponding keyword in the number of shortest paths between the two keywords excluding the corresponding keyword.

예를 들어, 연결 강도가 모두 동일하다고 가정할 경우, 키워드 B와 키워드 H 사이의 최단 경로는 B-C-H밖에 없다. 따라서, 키워드 C를 경유하지 않는 최단 경로가 존재하지 않으므로, 키워드 B와 키워드 H 사이를 매개하는 키워드 C의 매개 중심성은 1로 정의될 수 있다. 키워드 B와 키워드 G 사이의 최단 경로들은 B-C-F-G, B-C-H-G, B-E-F-G가 존재할 수 있다. 3개의 최단 경로들 중에서 키워드 C를 경유하는 최단 경로는 2개 이므로, 키워드 B와 키워드 G 사이를 매개하는 키워드 C의 매개 중심성은 2/3 일 수 있다. 같은 방식으로 키워드 C를 제외한 모든 2개의 키워드들 사이의 최단 경로들에 대해서 키워드 C의 매개 중심성을 산출하여 모두 더할 경우, 키워드 C에 대한 매개 중심성을 도출할 수 있다.For example, assuming that the connection strengths are all the same, the shortest path between the keyword B and the keyword H is only B-C-H. Accordingly, since there is no shortest path not passing through the keyword C, the centrality of the mediation of the keyword C mediating between the keyword B and the keyword H may be defined as 1. The shortest paths between keyword B and keyword G may be B-C-F-G, B-C-H-G, B-E-F-G. Since two of the three shortest paths pass through the keyword C, the centrality of the mediation of the keyword C mediating between the keyword B and the keyword G may be 2/3. In the same way, when the mediation centrality of the keyword C is calculated for the shortest paths between all two keywords except for the keyword C and all are added up, the mediation centrality of the keyword C can be derived.

근접 중심성은, 해당 키워드에서 임의의 다른 키워드까지의 최단 경로들에 따른 거리의 평균값을 산출하고, 산출된 평균값의 역수로 정의될 수 있다. 즉, 근접 중심성은 해당 키워드에서 다른 키워드까지의 거리가 짧을수록 큰 값으로 정의될 수 있다.Proximity centrality may be defined as a reciprocal of the calculated average value by calculating an average value of distances along shortest paths from the corresponding keyword to any other keyword. That is, the proximity centrality may be defined as a larger value as the distance from the corresponding keyword to another keyword is shorter.

한편, 상술한 매개 중심성과 근접 중심성은 모두 최단 경로를 전제로 한다. 도 2에서 설명한 것 처럼 각 링크는 연결 강도를 가지므로, 최단 경로는 링크가 갖는 연결 강도의 역수를 해당 링크에 대한 거리로 하고, 그 거리가 최소가 되는 경로로 정의될 수 있다.On the other hand, both of the above-mentioned mediation centrality and proximity centrality are premised on the shortest path. Since each link has a connection strength as described in FIG. 2 , the shortest path may be defined as a path in which the reciprocal of the connection strength of the link is the distance to the corresponding link, and the distance is the minimum.

각 클러스터를 구성하는 키워드들의 중심성과 출현 빈도가 산출되면, 중요도는 다음의 수학식 3과 같이 산출될 수 있다.When the centrality and appearance frequency of keywords constituting each cluster are calculated, the importance can be calculated as in Equation 3 below.

상기 수학식 3을 참조하면, KR(i)는 키워드들 중 i(i는 키워드들의 개수 이하인 자연수)번째 키워드에 대한 중요도이고, CR(i)는 i번째 키워드에 대한 중심성이고, N(i)는 i번째 키워드가 제2 문서 집합에서 출현하는 출현 빈도이며, d는 사용자에게 입력받는 평가 상수일 수 있다. 여기서 평가 상수 d는 중심성과 출현 빈도를 평가하는 비중을 정의하는 지표로서, 0 이상이고 1이하의 상수일 수 있다. 출현 빈도가 중심성보다 중요도에서 상대적으로 높은 비중을 차지할수록 큰 값으로 설정될 수 있다. 사용자는 평가 상수 d를 1과 가깝게 입력함으로써, 출현 빈도의 비중을 높게 설정할 수 있고, 평가 상수 d를 0과 가깝게 입력함으로써, 중심성의 비중을 높게 설정할 수도 있다. 더 극단적으로 사용자가 평가 상수 d를 1로 입력하면, 중심성 대신에 출현 빈도만을 이용하여 중요도를 평가하도록 할 수도 있고, 사용자가 평가 상수 d를 0으로 입력하면, 출현 빈도 대신에 중심성만을 이용하여 중요도를 평가하도록 할 수도 있다.Referring to Equation 3 above, KR(i) is the importance of the i-th keyword among the keywords (i is a natural number equal to or less than the number of keywords), CR(i) is the centrality of the i-th keyword, and N(i) is the frequency of appearance of the i-th keyword in the second document set, and d may be an evaluation constant input by the user. Here, the evaluation constant d is an index defining the weight for evaluating centrality and frequency of appearance, and may be a constant of 0 or more and 1 or less. As the frequency of appearance occupies a relatively higher weight in importance than centrality, it may be set to a larger value. By inputting the evaluation constant d close to 1, the user may set a high proportion of the appearance frequency, and by inputting the evaluation constant d close to 0, may set the centrality high. More extreme, if the user inputs the evaluation constant d as 1, the importance may be evaluated using only the frequency of appearance instead of the centrality. can also be evaluated.

수학식 3에 따라 중요도가 각 키워드마다 산출되면, 각 클러스터마다 중요도가 높은 키워드 순서로 정렬하여 각 클러스터에서 적어도 하나의 핵심 키워드를 추출할 수 있다.When importance is calculated for each keyword according to Equation 3, at least one key keyword may be extracted from each cluster by sorting the keywords in the order of high importance for each cluster.

도 4a 내지 도 4b는 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법을 이용하여 도출한 핵심 키워드를 도시한 예시도이다.4A to 4B are exemplary diagrams illustrating core keywords derived using a method of extracting core keywords based on a keyword co-appearance network.

도 4a를 참조하면, 블록 체인 분야에 관한 주제 키워드를 입력받아 키워드 공동출현 네트워크를 구축하고, 키워드 공동출현 네트워크에 구성된 키워드들을 서로 다른 색상의 클러스터들로 분류하고, 각 클러스터별로 핵심 키워드를 도출한 예시도가 도시된다.Referring to FIG. 4A , a keyword co-occurrence network is constructed by receiving topic keywords related to the block chain field, the keywords configured in the keyword co-emergence network are classified into clusters of different colors, and core keywords are derived for each cluster. An example diagram is shown.

도 4b를 참조하면, 물리학 복잡계 분야에 관한 주제 키워드를 입력받아 키워드 공동출현 네트워크를 구축하고, 키워드 공동출현 네트워크에 구성된 키워드들을 서로 다른 색상의 클러스터들로 분류하고, 각 클러스터별로 핵심 키워드를 도출한 예시도가 도시된다.Referring to FIG. 4B , a keyword co-occurrence network is constructed by receiving topic keywords related to the field of physics complexity, the keywords configured in the keyword co-occurrence network are classified into clusters of different colors, and core keywords are derived for each cluster. An example diagram is shown.

도 4a 및 도 4b를 참조하면, 중요도에 따라 추출된 핵심 키워드가 각 클러스터별로 시각화되어 도시된 것을 확인할 수 있다. 따라서, 사용자로서는 주제 키워드로부터 세부 연구분야의 핵심 키워드들을 시각적으로 한눈에 파악할 수 있기 때문에 현재 진행되고 있는 세부 연구분야의 연결관계 및 방향을 쉽게 이해할 수 있는 장점이 있다.Referring to FIGS. 4A and 4B , it can be seen that key keywords extracted according to importance are visualized for each cluster. Therefore, since the user can visually grasp the key keywords of the detailed research field from the subject keyword at a glance, there is an advantage in that the user can easily understand the connection relationship and the direction of the detailed research field currently in progress.

도 5는 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법에 대한 대표 흐름도이다.5 is a representative flowchart of a method of extracting a core keyword based on a keyword co-occurrence network according to an embodiment of the present invention.

도 5를 참조하면, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법은, 사용자로부터 입력된 주제 키워드를 획득하는 단계(S100); 상기 주제 키워드를 기초로 전체 문서 집합에서 키워드들을 추출하는 단계(S110); 상기 키워드들 사이의 공동출현 여부를 기초로 키워드 공동출현 네트워크를 생성하는 단계(S120); 상기 키워드 공동출현 네트워크를 구성하는 키워드들을 복수의 클러스터들로 분류하는 단계(S130); 및 상기 클러스터들마다 적어도 하나의 핵심 키워드를 추출하는 단계(S140)를 포함할 수 있다.Referring to FIG. 5 , a method for extracting a core keyword based on a keyword co-appearance network includes: acquiring a topic keyword input from a user (S100); extracting keywords from the entire document set based on the subject keyword (S110); generating a keyword co-occurrence network based on whether or not co-occurrence between the keywords (S120); classifying keywords constituting the keyword co-occurrence network into a plurality of clusters (S130); and extracting at least one key keyword for each of the clusters (S140).

키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 방법은, 후술하는 핵심 키워드를 추출하는 장치(100)에 의해 수행될 수 있다.A method of extracting a core keyword based on a keyword co-occurrence network may be performed by the apparatus 100 for extracting a core keyword, which will be described later.

상기 적어도 하나의 핵심 키워드를 추출하는 단계(S140)는, 각 클러스터를 구성하는 상기 키워드들의 중심성(centrality)과 출현 빈도를 기초로 상기 키워드들 각각에 대한 중요도를 산출하고, 상기 중요도에 따라 상기 적어도 하나의 핵심 키워드를 추출할 수 있다.In the step of extracting the at least one key keyword ( S140 ), the importance of each of the keywords is calculated based on the centrality and frequency of appearance of the keywords constituting each cluster, and, according to the importance, the at least One key keyword can be extracted.

상기 키워드들을 추출하는 단계(S110)는, 상기 주제 키워드를 기초로 상기 전체 문서 집합으로부터 제1 문서 집합을 생성하는 단계; 상기 제1 문서 집합에서 적어도 하나의 기준 키워드를 추출하는 단계; 상기 적어도 하나의 기준 기워드를 기초로, 상기 전체 문서 집합으로부터 제2 문서 집합을 생성하는 단계; 및 상기 제2 문서 집합에서 상기 키워드들을 추출하는 단계를 포함할 수 있다.The step of extracting the keywords (S110) may include: generating a first document set from the entire document set based on the subject keyword; extracting at least one reference keyword from the first document set; generating a second document set from the entire document set based on the at least one reference keyword; and extracting the keywords from the second document set.

상기 키워드들을 복수의 클러스터들로 분류하는 단계(S130)는, 상기 클러스터들로 분류된 키워드 공동출현 네트워크의 모듈성을 평가하고, 상기 모듈성이 최대화되는 방향으로 상기 키워드들을 반복하여 다시 분류함으로써 상기 클러스터들을 최적화할 수 있다.In the step of classifying the keywords into a plurality of clusters ( S130 ), the clusters are classified by evaluating the modularity of the keyword co-occurrence network classified into the clusters, and repeatedly reclassifying the keywords in a direction in which the modularity is maximized. can be optimized.

상기 적어도 하나의 핵심 키워드를 추출하는 단계(S140)는, 하기 수학식에 기초하여 상기 중요도를 산출하되,In the step of extracting the at least one key keyword (S140), the importance is calculated based on the following equation,

도 6은 본 발명의 일 실시예에 따른 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치에 대한 구성도이다.6 is a block diagram of an apparatus for extracting a core keyword based on a keyword co-occurrence network according to an embodiment of the present invention.

도 6을 참조하면, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치(100)는, 적어도 하나의 프로세서(processor, 110); 및 적어도 하나의 프로세서(110)가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory, 120)를 포함할 수 있다.Referring to FIG. 6 , an apparatus 100 for extracting a core keyword based on a keyword co-occurrence network includes at least one processor 110 ; and a memory 120 for storing instructions instructing the at least one processor 110 to perform at least one step.

여기서 적어도 하나의 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 메모리(120) 및 저장 장치(160) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다.Here, the at least one processor 110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. can Each of the memory 120 and the storage device 160 may be configured as at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 120 may be configured as at least one of a read only memory (ROM) and a random access memory (RAM).

또한, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치(100)는, 무선 네트워크를 통해 통신을 수행하는 송수신 장치(transceiver)(130)를 포함할 수 있다. 또한, 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치(100)는 입력 인터페이스 장치(140), 출력 인터페이스 장치(150), 저장 장치(160) 등을 더 포함할 수 있다. 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치(100)에 포함된 각각의 구성 요소들은 버스(bus)(170)에 의해 연결되어 서로 통신을 수행할 수 있다.In addition, the apparatus 100 for extracting a core keyword based on the keyword co-occurrence network may include a transceiver 130 for performing communication through a wireless network. Also, the apparatus 100 for extracting a core keyword based on the keyword co-occurrence network may further include an input interface device 140 , an output interface device 150 , a storage device 160 , and the like. Each of the components included in the apparatus 100 for extracting a core keyword based on the keyword co-emergence network may be connected by a bus 170 to communicate with each other.

키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치(100)의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.For example of the device 100 for extracting the core keyword based on the keyword co-emergence network, a communicable desktop computer (desktop computer), a laptop computer (laptop computer), a notebook (notebook), a smartphone (smart phone), a tablet PC (tablet PC), mobile phone, smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game console, navigation device, digital camera (digital camera), DMB (digital multimedia broadcasting) player, digital audio recorder, digital audio player, digital video recorder, digital video player, PDA (Personal Digital Assistant) or the like.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The methods according to the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer-readable medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

컴퓨터 판독 가능 매체의 예에는 롬(ROM), 램(RAM), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable media may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions may include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as at least one software module to perform the operations of the present invention, and vice versa.

또한, 상술한 방법 또는 장치는 그 구성이나 기능의 전부 또는 일부가 결합되어 구현되거나, 분리되어 구현될 수 있다. In addition, the above-described method or apparatus may be implemented by combining all or part of its configuration or function, or may be implemented separately.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

100: 키워드 공동출현 네트워크를 기반으로 핵심 키워드를 추출하는 장치
110: 프로세서 120: 메모리
130: 송수신 장치 140: 입력 인터페이스 장치
150: 출력 인터페이스 장치 160: 저장 장치
KJAN: 키워드 공동출현 네트워크
CLT1, CLT2, CLT3, CLT4: 클러스터100: A device for extracting core keywords based on the keyword co-appearance network
110: processor 120: memory
130: transceiver device 140: input interface device
150: output interface device 160: storage device
KJAN: Keyword Co-Appearance Network
CLT1, CLT2, CLT3, CLT4: cluster

Claims

A method of extracting core keywords based on a keyword co-appearance network, performed in a device for extracting core keywords,
obtaining a subject keyword input from a user;
extracting keywords from the entire document set based on the subject keyword;
generating a keyword co-occurrence network based on whether the keyword co-occurrences exist;
classifying keywords constituting the keyword co-occurrence network into a plurality of clusters; and
Extracting at least one key keyword for each of the clusters,
The step of extracting the at least one key keyword,
A method of extracting a core keyword, calculating the importance of each of the keywords based on the centrality and appearance frequency of the keywords constituting each cluster, and extracting the at least one core keyword according to the importance.

In claim 1,
The step of extracting the keywords,
generating a first document set from the entire document set based on the subject keyword;
extracting at least one reference keyword from the first document set;
generating a second document set from the entire document set based on the at least one reference keyword; and
and extracting the keywords from the second document set.

In claim 1,
The keyword co-appearance network is
Having each of the keywords as a node, having a link connecting keywords commonly appearing in one document to each other, and counting the number of documents in which the two keywords connected by the link commonly appear in the link How to extract key keywords, having as a connection strength for.

In claim 3,
The keyword co-appearance network is
A method of extracting key keywords, which is reconstructed only with keywords connected by links whose connection strength is greater than or equal to a preset threshold.

In claim 1,
Classifying the keywords into a plurality of clusters comprises:
Evaluating the modularity of the keyword co-occurrence network classified into the clusters, and optimizing the clusters by iteratively reclassifying the keywords in a direction in which the modularity is maximized.

In claim 2,
The step of extracting the at least one key keyword,
Calculating the importance based on the following equation,

KR(i) is the importance of the i-th keyword among the keywords (i is a natural number), CR(i) is the centrality with respect to the i-th keyword, and N(i) is the i-th keyword is the second keyword. The method of extracting a key keyword is the frequency of appearance in a document set, and d is an evaluation constant input from the user to set a weight between the centrality and the frequency of appearance.

In claim 6,
The centrality for the i-th keyword,
A method of extracting a key keyword, which is one of degree centrality, between centrality, and closeness centrality.

In claim 7,
The centrality of the mediation is,
Method of extracting a key keyword, defined as the number of shortest paths between the two nodes passing through the node corresponding to the i-th keyword compared to the number of shortest paths between two nodes excluding the node corresponding to the i-th keyword .

In claim 8,
The shortest path is
A method of extracting a key keyword, which is a path in which the distance is minimized by using the reciprocal of the connection strength as a distance.

In claim 7,
The proximity centrality is
A method of extracting a core keyword, defined as a reciprocal of the calculated average value, by calculating an average value of a distance along a shortest path from a node corresponding to the i-th keyword to the remaining nodes.

at least one processor; and
An apparatus for extracting a key keyword based on a keyword co-occurrence network, comprising a memory for storing instructions instructing the at least one processor to perform at least one step,
The at least one step is
obtaining a subject keyword input from a user;
extracting keywords from the entire document set based on the subject keyword;
generating a keyword co-occurrence network based on whether the keyword co-occurrences exist;
classifying keywords constituting the keyword co-occurrence network into a plurality of clusters; and
Extracting at least one key keyword for each of the clusters,
The step of extracting the at least one key keyword,
An apparatus for extracting a core keyword, calculating the importance of each of the keywords based on the centrality and appearance frequency of the keywords constituting each cluster, and extracting the at least one core keyword according to the importance.

In claim 11,
The step of extracting the keywords,
generating a first document set from the entire document set based on the subject keyword;
extracting at least one reference keyword from the first document set;
generating a second document set from the entire document set based on the at least one reference keyword; and
and extracting the keywords from the second document set.

In claim 11,
The keyword co-appearance network is
Having each of the keywords as a node, having a link connecting keywords commonly appearing in one document to each other, and counting the number of documents in which the two keywords connected by the link commonly appear in the link A device for extracting key keywords, having as a connection strength for.

In claim 13,
The keyword co-appearance network is
A device for extracting key keywords, which is reconstructed only with keywords connected by links whose connection strength is greater than or equal to a preset threshold.

In claim 11,
Classifying the keywords into a plurality of clusters comprises:
Apparatus for extracting a core keyword, evaluating the modularity of the keyword co-occurrence network classified into the clusters, and optimizing the clusters by iteratively reclassifying the keywords in a direction in which the modularity is maximized.

In claim 12,
The step of extracting the at least one key keyword,
Calculating the importance based on the following equation,

KR(i) is the importance of the i-th keyword among the keywords (i is a natural number), CR(i) is the centrality with respect to the i-th keyword, and N(i) is the i-th keyword is the second keyword. an apparatus for extracting a core keyword, wherein the frequency of appearance is the frequency of occurrence in a document set, and d is an evaluation constant input from the user to set a weight between the centrality and the frequency of appearance.

17. In claim 16,
The centrality for the i-th keyword,
A device for extracting a key keyword, which is one of degree centrality, between centrality, and closeness centrality.

In claim 17,
The centrality of the mediation is,
Apparatus for extracting a key keyword, defined as the number of shortest paths between the two nodes passing through the node corresponding to the i-th keyword compared to the number of shortest paths between two nodes excluding the node corresponding to the i-th keyword .

In claim 18,
The shortest path is
An apparatus for extracting a key keyword, which is a path in which the distance is minimized by using the reciprocal of the connection strength as a distance.

In claim 17,
The proximity centrality is
An apparatus for extracting a core keyword, which is defined as a reciprocal of the calculated average value by calculating an average value of a distance along a shortest path from a node corresponding to the i-th keyword to the remaining nodes.