KR20190050180A

KR20190050180A - keyword extraction method and apparatus for science document

Info

Publication number: KR20190050180A
Application number: KR1020170145461A
Authority: KR
Inventors: 서정연; 고영중; 염홍선
Original assignee: 서강대학교산학협력단
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2019-05-10
Also published as: KR102017227B1

Abstract

According to the present invention, a method for extracting keywords from a scientific document comprises the following steps of: receiving a scientific document and converting the same into a morpheme-analyzed scientific document consisting of words analyzed as noun, adjective, and verb through morpheme analysis; constructing a document graph by assigning a main line representing a relationship between words by using words of the scientific document as a vertices; calculating importance scores for the vertices in the document graph; detecting keyword candidates and location information of extracted keyword candidates from the morpheme-analyzed scientific document; calculating scores of words included in a keyword candidate for each of the keyword candidates and a score of a keyword candidate according to the length of the keyword candidate; reranking ranking of keyword candidates by changing a score of each of the keyword candidates according to the location information of a keyword candidate; and determining the predetermined number of highly-ranked keyword candidates among reranked keyword candidates as keywords of the scientific document.

Description

[0002] Key words extraction method and apparatus for science document [

본 발명은 과학문서의 핵심어구 추출 기술에 관한 것으로, 더욱 상세하게는 과학문서의 핵심어구의 추출시에 핵심어구의 위치를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있는 과학문서의 핵심어구 추출방법 및 장치에 관한 것이다. The present invention relates to a key word extraction technique of a scientific document, and more particularly, to a method of extracting a key word from a scientific document, And a method and apparatus for extracting key phrases.

인터넷 등의 발전으로 정보가 대량으로 증가하고 있는 상황에서 문서 정보의 효율적인 관리 및 검색은 중요한 문제로 대두되었다. 이러한 문제에 대해 효과적인 해결책을 제시해줄 수 있는 방법으로 문서 요약이 요구된다.In the situation where information has been increasing in large quantities due to the development of the Internet and the like, efficient management and retrieval of document information has become an important issue. Document summaries are required as a way to provide effective solutions to these problems.

상기 문서 요약이란 문서의 기본적인 내용을 유지하면서 원문으로부터 가장 의미있는 내용만을 추출하여 문서의 길이를 줄이는 작업을 의미한다. 상기 문서 요약의 방식은 크게 두 가지로, 추출(extract)과 추상(abstract)이 있다. The document summary refers to the task of reducing the length of a document by extracting only the most meaningful contents from the original document while maintaining the basic contents of the document. There are two methods of document summarization, namely, extract and abstract.

상기 추출(extract) 방식은 주어진 문서에서 중요한 문장 또는 구를 그대로 추출하는 방식으로, 그래프 구조를 이용하여 주요 문장을 추출하는 Rada Mihalcea와 Paul Tarau가 2005년에 발표한 논문 “A Language Independent Algorithm for Single and Multiple Document Summarization“과 문장의 문서내에서의 위치와 문장이 포함하고 있는 단어의 빈도수 등을 자질로 이용하여 주요 문장을 추출하는 Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, 그리고 Daniel Tam의 2004년 논문 ”Centroid-based summarization of multiple documents. Information Processing and Management“ 등이 있다. The extracting method extracts important sentences or phrases from a given document, extracts key sentences using a graph structure, and Rada Mihalcea and Paul Tarau, "A Language Independent Algorithm for Single and Multiple Document Summarization, "and Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, and Daniel Tam, 2004, which extract key sentences by using the positions of the sentences in the document and the frequency of the words contained in the sentences. The paper "Centroid-based summarization of multiple documents. Information Processing and Management ".

그리고 상기 추상(abstract) 방식은 문서의 일부분을 추출하지 않고, 문서의 내용으로부터 요약문을 완전히 새로이 작성해내는 방식이다. 이러한 추상 방식의 요약은 아직 현재의 기술 수준이 많이 부족하여 만족할 만한 수준의 결과를 만들어내기 힘들다. The abstract method is a method of completely creating a summary from the contents of a document without extracting a part of the document. A summary of this abstraction method is still not enough to produce a satisfactory level of results due to a lack of current technology levels.

좀 더 설명하면, Mihalcea, R., & Tarau, P.(2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics는 그래프 기반 랭킹 알고리즘으로 개별 문서들에서 중요한 핵심어를 추출하는 방법을 제시한다. 이는 그래프의 정점을 구성할 문서의 구성 단위들, 예를들어 문서의 명사, 형용사 단어들을 정점으로 구성하고, 문서의 구성단위들의 관계를 정의하여 정점들끼리 이어주며, 이어진 간선들은 방향성을 띄는 간선, 방향성을 띄지 않는 간선, 가중된 간선, 가중되지 않은 간선일 수 있다. 그 이후 그래프 기반 랭킹 알고리즘(PageRank)를 이용하여 정점의 중요도 점수를 계산하고, 최종 점수를 토대로 정렬한 후 상위 N개를 최종 핵심어로 추출하는 것이다. More specifically, Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics is a graph based ranking algorithm that suggests how to extract key words from individual documents. This structure consists of vertices of the constituent units of a document constituting the vertices of the graph, for example, nouns and adjective words of a document, and defines the relationship between constituent units of a document to connect apexes. , Non-directional trunks, weighted trunks, and unweighted trunks. After that, we use the graph based ranking algorithm (PageRank) to calculate the importance scores of the vertices, sort them based on the final score, and extract the top N as the final key word.

그리고 Wan, X., & Xiao, J. (2008, July). Single Document Keyphrase Extraction Using Neighborhood Knowledge. In AAAI (Vol. 8, pp. 855-860).는 Mihalcea and Tarau(2004)의 모델을 좀 더 발전시킨 모델이다. 전체 과정은 이전 모델과 같지만 다른 점이 2가지가 있다. 첫째로는 그래프의 간선들에 가중치를 주었다. 두 문서의 구성 단위(단어)들이 특정 길이 내에서 같이 발생한 횟수를 가중치로 주는 것이고, 둘째로는 문서에서 먼저 핵심어가 될 수 있는 후보들을 추출하여 그래프 랭킹 알고리즘 결과 점수들을 더하여 핵심어의 최종 점수를 계산하고, 최종 계산된 점수를 정렬하여 상위 N개를 최종 핵심어로 추출하는 것이다. And Wan, X., & Xiao, J. (2008, July). Single Document Keyphrase Extraction Using Neighborhood Knowledge. In AAAI (Vol.8, pp. 855-860) is a more advanced model of Mihalcea and Tarau (2004). The whole process is the same as the previous model, but there are two differences. First, we weighted the trunks of the graph. The number of occurrences of the units (words) of two documents together in a certain length is weighted. Second, candidates that can become key words in the document are extracted and the final score of the keyword is calculated by adding the scores of the graph ranking algorithm And sorting the final calculated score to extract the top N words as the final key word.

그리고 Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), 115-130.는 의학문서에서 다단어(Multi-word)를 추출하기 위해 고안된 알고리즘이다. 이는 전체 문서의 정보를 이용하며 다음의 수학식 1을 통해 입력 문서에서 다단어를 추출하고, 모든 후보 다단어들의 점수를 계산한 후 상위 N개의 후보 다단어들을 문서에서 나타난 최종 다단어들로 추출한다.And Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: the c-value / nc-value method. International Journal of Digital Libraries, 3 (2), 115-130., Is an algorithm designed to extract multi-words from medical documents. It uses the information of the entire document and extracts multiple words from the input document through the following Equation 1, calculates the scores of all the candidates, extracts the upper N candidate words from the final words of the document do.

상기 수학식 1에서 a는 후보 다단어들이며 T_a는 후보 다단어 a를 포함하는 더 큰 후보 다단어들의 집합이다. f(a)는 후보 다단어 a가 말뭉치에서 나타난 빈도수이며 P(Ta)는 집합 Ta의 크기이다.In Equation (1), a is a candidate word and T _a is a set of larger candidate words including a word a. f (a) is the frequency of occurrence of the word a in the corpus and P (Ta) is the size of the set Ta.

상기한 종래 기술들의 문제점을 설명한다. The problems of the above-mentioned prior arts will be described.

먼저 Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics는 그래프 상에서 정점(문서의 구성단위)과 정점을 잇는 간선(관계)의 가중치 값을 일정하게 하였고, 핵심어를 추출할 때 그래프 기반 랭킹 알고리즘으로 추출된 상위 후보 핵심어들로만 구성 및 추출 가능한 한계가 있었다.First, Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics makes the weights of the vertices (constituent units of the document) and the vertices of the vertices on the graph constant, and constructs and extracts only the top candidate keywords extracted by the graph-based ranking algorithm when extracting key words There was a limit.

그리고 Wan, X., & Xiao, J. (2008, July). Single Document Keyphrase Extraction Using Neighborhood Knowledge. In AAAI (Vol. 8, pp. 855-860)는 그래프 상에서 간선의 가중치 값을 같이 발생한 빈도수로 주었지만 이 빈도수를 계산할 때 중복 계산의 문제점이 있다. 예를들어, “한국 정보 과학회” 단어와 “정보 과학회” 단어가 존재한다면 “정보”와 “과학회”는 두 후보 모두에서 빈도수가 세어지게 된다. 그리고 후보 핵심어들의 점수를 계산할 때 구성 단어들의 점수들의 합으로 계산하므로, 후보 핵심어의 길이가 길어질수록 점수가 높아지는 문제점이 있다. “엄격한 대한민국 헌법”, “대한민국 헌법”의 두 후보 핵심어가 있을 때 합산으로 점수를 책정하게 된다면 무조건 앞의 “엄격한 대한민국 헌법”이 점수가 더 높게 책정되는 문제점이 있었다.And Wan, X., & Xiao, J. (2008, July). Single Document Keyphrase Extraction Using Neighborhood Knowledge. In AAAI (Vol. 8, pp. 855-860), the weights of the trunks are given in the graph on the graph, but there is a problem of redundancy calculation when calculating the frequency. For example, if the words "Korean Information Science Society" and "Information Science Society" are present, "information" and "scientific society" are counted in both candidates. In addition, when calculating the score of the candidate key words, it is calculated as the sum of the scores of the constituent words. Therefore, there is a problem that the score increases as the candidate keyword becomes longer. If there are two candidate keywords of "strict Constitution of Korea" and "constitution of the Republic of Korea", if the score is summed up, there is a problem that the "rigid Constitution of the Republic of Korea" is scored higher.

그리고 Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), 115-130은 단어를 추출하기 위해 같은 분야의 많은 문서들을 요구하고, 빈도수 정보만 사용하므로 중요도를 판단하기 어려운 문제가 있었다. And Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: the c-value / nc-value method. In International Journal of Digital Libraries, 3 (2), 115-130, many documents in the same field are required to extract words, and only the frequency information is used.

대한한국특허등록 제101664711호Korean Patent Registration No. 101664711 대한한국특허공개 제1020100128007호Korean Patent Publication No. 1020100128007 대한민국특허등록 제1015365200000호Korean Patent Registration No. 1015365200000

본 발명은 과학문서의 핵심어구의 추출시에 핵심어구의 위치를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있는 과학문서의 핵심어구 추출방법 및 장치를 제공하는 것을 그 목적으로 한다. An object of the present invention is to provide a method and apparatus for extracting a key word in a scientific document, which can increase the reliability of a key word by determining a key word reflecting the position of the key word at the time of extracting the key word of the scientific document.

또한 본 발명의 다른 목적은 핵심어구의 추출시에 핵심어구의 길이를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있는 과학문서의 핵심어구 추출방법 및 장치를 제공하는 것이다. Another object of the present invention is to provide a method and apparatus for extracting a key word in a scientific document, which can increase the reliability of a key word by determining a key word reflecting the length of the key word at the time of extracting the core word.

상기한 목적을 달성하기 위한 본 발명에 따르는 과학문서의 핵심어 추출방법은, 과학문서를 입력받아 형태소 분석을 통해 명사와 형용사, 동사로 분석된 단어로 구성되는 형태소 분석된 과학문서로 변환하는 단계; 상기 과학문서의 단어를 정점으로 하고, 단어와 단어 사이의 관계를 나타내는 간선을 부여하여 문서 그래프를 구축하는 단계; 상기 문서 그래프내의 정점들에 대한 중요도 점수를 산출하는 단계; 상기 형태소 분석된 과학문서에서 핵심어구 후보들 및 추출된 핵심어구 후보들의 위치정보들을 검출하는 단계; 상기 핵심어구 후보들 각각에 대해 핵심어구 후보에 포함된 단어들의 점수와 핵심어구 후보의 길이에 따라 후보 핵심어구의 점수를 산출하는 단계; 상기 핵심어구 후보들 각각의 점수를 핵심어구 후보의 위치정보에 따라 변경하여 핵심어구 후보들의 순위를 재순위화하는 단계; 및 재순위화한 핵심어구 후보들 중 미리 정해둔 수의 상위 핵심어구 후보들을 과학문서의 핵심어로 결정하는 단계;를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method for extracting key words in a scientific document, comprising the steps of: inputting a scientific document and transforming the morpheme analysis into a morphologically analyzed scientific document composed of nouns, adjectives and verbs; Constructing a document graph by using a word of the scientific document as a vertex and giving an edge representing a relationship between the word and the word; Calculating an importance score for vertices in the document graph; Detecting positional information of key word candidates and extracted key word candidates in the morpheme analyzed scientific document; Calculating a score of the candidate key word according to the scores of the words included in the key word candidates and the length of the key word candidates for each of the key word candidates; Changing the score of each of the key word candidates according to position information of the key word candidate to re-rank the ranking of the key word candidates; And determining a predetermined number of upper key word candidates among the re-ordered key word candidates as key words of the scientific document.

상기한 본 발명은 과학문서의 핵심어구의 추출시에 핵심어구의 위치를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있는 효과를 야기한다. The present invention described above is effective in increasing the reliability of key phrases by determining key phrases reflecting the positions of key phrases in extracting key phrases of scientific documents.

또한 본 발명은 핵심어구의 추출시에 핵심어구의 길이를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있는 효과를 야기한다. In addition, the present invention has the effect of increasing the reliability of key phrases by determining key phrases reflecting the length of key phrases when extracting key phrases.

도 1은 본 발명의 바람직한 실시예에 따르는 과학문서의 핵심어구 추출장치의 구성도.
도 2 및 도 3은 본 발명의 바람직한 실시예에 따르는 과학문서의 핵심어구 추출방법의 흐름도.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a key word extraction apparatus for a scientific document according to a preferred embodiment of the present invention; FIG.
FIG. 2 and FIG. 3 are flowcharts of a key word extraction method of a scientific document according to a preferred embodiment of the present invention.

본 발명은 과학문서의 핵심어구의 추출시에 핵심어구의 위치를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있다. The present invention can increase the reliability of key phrases by determining key phrases reflecting the positions of key phrases when extracting core phrases of scientific documents.

또한 본 발명은 핵심어구의 추출시에 핵심어구의 길이를 반영하여 핵심어구를 결정하여 핵심어구의 신뢰도를 높일 수 있다.In addition, the present invention can increase the reliability of key phrases by determining key phrases reflecting the length of key phrases when extracting key phrases.

이러한 본 발명의 바람직한 실시예를 도면을 참조하여 상세히 설명한다. Preferred embodiments of the present invention will be described in detail with reference to the drawings.

<과학문서의 핵심어구 추출장치의 구성><Structure of key word extracting device of scientific document>

도 1은 본 발명의 바람직한 실시예에 따르는 과학문서의 핵심어구 추출장치의 구성을 도시한 것이다. 상기 도 1을 참조하면 본 발명의 바람직한 실시예에 따르는 과학문서의 핵심어 추출장치는 제어장치(100)와 메모리부(102)와 사용자 인터페이스부(104)와 디스플레이부(106)와 통신부(108)로 구성된다. 1 is a block diagram of a key word extraction apparatus for scientific documents according to a preferred embodiment of the present invention. 1, a key word extraction apparatus for a scientific document according to a preferred embodiment of the present invention includes a controller 100, a memory unit 102, a user interface unit 104, a display unit 106, a communication unit 108, .

상기 제어장치(100)는 본 발명의 바람직한 실시예에 따라 과학문서의 핵심어 추출을 이행한다. 좀더 설명하면 제어장치(100)는 사용자 인터페이스(104) 또는 통신부(108)를 통해 과학문서를 입력받아 형태소 분석을 통해 명사와 형용사, 동사로 분석된 단어로 구성되는 형태소 분석된 과학문서로 변환하고, 상기 과학문서의 단어를 정점으로 하고, 단어와 단어 사이의 관계를 나타내는 간선을 부여하여 문서 그래프를 구축하고, 상기 문서 그래프내의 정점들에 대한 중요도 점수를 산출하고, 상기 형태소 분석된 과학문서에서 핵심어구 후보들 및 추출된 핵심어구 후보들의 위치정보들을 검출하고, 상기 핵심어구 후보들 각각에 대해 핵심어구 후보에 포함된 단어들의 점수와 핵심어구 후보의 길이에 따라 후보 핵심어구의 점수를 산출하고, 상기 핵심어구 후보들 각각의 점수를 핵심어구 후보의 위치정보에 따라 변경하여 핵심어구 후보들의 순위를 재순위화하고, 재순위화한 핵심어구 후보들 중 미리 정해둔 수의 상위 핵심어구 후보들을 과학문서의 핵심어로 결정한다. The control device 100 performs key word extraction of a scientific document according to a preferred embodiment of the present invention. More specifically, the control apparatus 100 receives a scientific document through the user interface 104 or the communication unit 108 and converts the scientific document into a morpheme-analyzed scientific document composed of nouns, adjectives, and words analyzed as verbs through morphological analysis Constructing a document graph by assigning an edge representing a relationship between words and words to a vertex of a word of the scientific document, calculating a score of importance for vertices in the document graph, The score of the candidate key word is calculated according to the scores of the words included in the key word candidates and the length of the key word candidates for each of the key word candidates, The score of each key word candidate is changed according to the position information of the key word candidate to rank the key word candidates Yu Hua and re-determine the top key phrase candidates ranked by number of copies that you pre-defined key phrases of candidates to the key words of scientific documents.

또한 상기 제어장치(100)는 다수의 말뭉치들로부터 상기 핵심어구 후보들의 재순위화를 위한 핵심어구 위치정보를 추출하여 저장한다. 이는 사용자 인터페이스(102)를 통해 핵심어구의 위치를 선정받아 결정될 수도 있다. Also, the control device 100 extracts key word position information for reordering the key word candidates from a plurality of corpusums and stores the key word position information. This may be determined by selecting the location of the key word through the user interface 102.

상기 메모리부(102)는 본 발명의 바람직한 실시예에 따라 과학문서의 핵심어구 추출을 위한 각종 정보를 저장하기 위한 저장영역을 제공한다. The memory unit 102 provides a storage area for storing various kinds of information for extracting key words of a scientific document according to a preferred embodiment of the present invention.

상기 사용자 인터페이스부(104)는 사용자와 상기 제어장치(100) 사이의 인터페이스를 담당한다. The user interface unit 104 serves as an interface between the user and the control apparatus 100.

상기 디스플레이부(106)는 상기 제어장치(100)의 제어에 따른 각종 정보를 저장한다. The display unit 106 stores various kinds of information under the control of the control device 100.

상기 통신부(108)는 외부의 네트워크 또는 외부 기기 등과 상기 제어장치(100) 사이의 통신을 담당한다. The communication unit 108 is responsible for communication between an external network or an external device and the control device 100.

<과학문서의 핵심어구 추출방법의 절차><Procedures for extracting key words in scientific documents>

이제 상기한 과학문서의 핵심어구 추출장치에 적용 가능한 본 발명에 따르는 과학문서 핵심어구 추출방법의 절차를 도 2를 참조하여 설명한다.Now, the procedure of the key word extraction method of the scientific document according to the present invention applicable to the key word extraction apparatus of the scientific document will be described with reference to FIG.

상기 제어장치(100)는 사용자 인터페이스부(104) 또는 통신부(108)를 통해 과학문서가 입력되면(200단계), 전처리를 이행한다(202단계). 상기 전처리는 과학문서에 대해 형태소 분석을 통해 형태소 분석된 과학문서로 변환하는 것이며, 상기 형태소 분석된 과학문서는 명사와 형용사, 동사로 분석된 단어들로 구성된다. When the scientific document is inputted through the user interface unit 104 or the communication unit 108 (Step 200), the controller 100 performs preprocessing (Step 202). The preprocessing converts the scientific document into a morpheme analyzed scientific document through morphological analysis, and the morpheme analyzed scientific document is composed of nouns, adjectives, and verbs.

상기 제어장치(100)는 상기 형태소 분석된 과학문서에 포함된 명사와 형용사, 동사로 분석된 단어들을 이용하여 문서 그래프를 구축한다(204단계). 여기서, 상기 문서 그래프는 명사와 형용사, 동사로 구성된 단어들을 문서의 구성단위로 보고, 이 구성단위들을 정점으로 함과 아울러, 단어와 단어 사이의 관계를 나타내는 간선을 부여하여 구축한 것이다. The controller 100 constructs a document graph using nouns, adjectives, and verbs analyzed in the morpheme-analyzed scientific document (step 204). Here, the document graph is constructed by giving words constituted of nouns, adjectives, and verbs as constituent units of a document, using these constituent units as apexes, and adding trunks representing relationships between words and words.

상기 문서 그래프의 구축이 완료되면, 상기 제어장치(100)는 상기 문서 그래프의 간선들에 대해 가중치를 수학식 2에 따라 산출하여 부여한다(206단계). 여기서, 상기 가중치는 두 단어가 동시에 발생한 빈도수를 기준으로 확률론적으로 계산된다. When the construction of the document graph is completed, the controller 100 calculates and assigns weights to the edges of the document graph according to Equation (2) (Step 206). Here, the weight is probabilistically calculated on the basis of the frequencies at which the two words occur at the same time.

상기 수학식 2는 두 개의 정점인 정점1과 정점2 사이를 연결하는 간선의 가중치를 산출하기 위한 것이다. 상기 동시발생횟수는 정점1과 정점2가 사용자 또는 개발자에 의해 설정된 윈도우 사이즈의 윈도우(앞뒤 단어 간격)내에 존재한다면 동시에 발생한 것으로 판단하여 카운트한 값이다. 상기 빈도수는 입력 문서에 대한 빈도수를 말한다. 상기 가중치들의 평균값은 가중치의 평균값을 고려하지 않은 채 산출된 가중치들 모두를 계산한 후에 평균을 취하여 구한 것이다. Equation (2) is used to calculate the weights of the trunks connecting the vertexes 1 and 2, which are two vertices. The number of simultaneous occurrences is a value obtained by judging that vertex 1 and vertex 2 have occurred simultaneously if they exist within a window of window size (front and back word interval) set by the user or developer. The frequency refers to the frequency with respect to the input document. The average of the weights is obtained by calculating all the weights calculated without considering the average value of the weights, and then taking the average.

이후 상기 제어장치(100)는 그래프 기반 랭킹 알고리즘을 통해 과학 문서의 구성단위인 단어들, 즉 정점들 각각에 대한 중요도 점수를 계산한다(210단계). 여기서, 상기 그래프 기반 랭킹 알고리즘에 따르는 정점의 중요도 점수는 수학식 3에 따라 산출하여 부여한다. In step 210, the control device 100 calculates importance scores for words, i.e., vertices, which are constituent units of a scientific document through a graph-based ranking algorithm. Here, the importance score of the vertex according to the graph-based ranking algorithm is calculated and given according to the equation (3).

상기 수학식 3에서 d는 댐핑 펙터(damping factor)로써 0~1사이의 확률값으로 정하며, 이는 하이퍼 파라메터(hyperparameter)로써 사용자 또는 개발자에 의해 정해진다. 상기 j는 정점i와 연결된 주변의 단어들의 식별번호이며, 정점j로 사용된다. 상기 i는 계산하고자 하는 단어의 식별번호이며, 정점i로 사용된다. 상기 k는 정점j와 연결된 단어들의 식별번호이며, 정점k로 사용된다. 그리고 상기 주변 정점들은 어떤 한 단어를 현재 정점으로 보았을 때 그 기준으로부터 미리 정해진 윈도우 사이즈(window size)의 윈도우 내에 단어들이 존재한다면 그 단어들에 해당되는 정점과 현재 정점은 주변 정점의 관계를 가지는 것으로 판단한다. In Equation (3), d is a damping factor, which is a probability value between 0 and 1, which is determined by a user or a developer as a hyperparameter. J is an identification number of surrounding words connected to a vertex i, and is used as a vertex j. I is an identification number of a word to be calculated, and is used as a vertex i. K is an identification number of words associated with a vertex j, and is used as a vertex k. If the surrounding vertices have words in a window of a predetermined window size from the reference when a word is regarded as a current vertex, the vertex corresponding to the words and the current vertex have a relation of peripheral vertices .

또한 상기 제어장치(100)는 형태소 분석된 과학문서에서 핵심어구 후보들을 추출하고, 상기 추출된 핵심어구 후보들의 위치정보를 검출하여 저장한다(214,216단계). 여기서 상기 핵심어구 후보 추출방법은 N-gram 기반 추출방법 또는 형태소를 이용한 규칙 기반의 후보 추출방법이 사용될 수 있다. 이때 핵심어구 후보들은 명사구들이며, 상기 위치정보는 해당 후보 핵심어들이 과학문서의 어느 위치인지를 나타내는 정보이다.In addition, the control device 100 extracts key word candidates from the morpheme-analyzed scientific document, and detects and stores position information of the extracted key word candidates in steps 214 and 216. The key word candidate extraction method may be an N-gram based extraction method or a rule based candidate extraction method using morpheme. At this time, the key word candidates are noun phrases, and the position information is information indicating which candidate key words are located in the scientific document.

여기서, 본 발명은 과학문서의 초반부에는 전체 문서에 대한 요약 및 핵심내용이 존재하는 특징을 이용하므로, 상기 과학문서를 위치별로 다수 구간으로 미리 분할하고, 해당 핵심어구 후보들이 과학문서의 어느 구간에 위치하지에 따라 핵심어구를 판별한다. 이를 위해 과학문서는 10개 단어마다 한 구간으로 나누어질 수 있다. Herein, since the present invention uses a feature that includes summary and core contents of the entire document at the beginning of a scientific document, the scientific document is divided in advance into a plurality of sections according to positions, and the key word candidates are divided into several sections Identify key phrases according to location. For this purpose, the scientific document can be divided into one section for every 10 words.

상기한 바와 같이 핵심어구 후보들이 추출되면, 상기 제어장치(100)는 추출된 핵심어구 후보들을 구성하는 각 단어에 대해 각 단어들의 중요도 점수를 이용하여 수학식 4에 따라 핵심어구 후보들의 점수를 산출한다(212단계). When the key word candidates are extracted as described above, the control device 100 calculates scores of key word candidates according to Equation (4) using the importance score of each word for each word constituting the extracted key word candidates (Step 212).

상기 수학식 4에서 score(핵심어구 후보에 속한 단어i)는 해당 단어의 중요도 점수이며, 이는 수학식 3에 따라 산출된다. In the equation (4), score (word i belonging to the keyword in the key word candidate) is the importance score of the word, which is calculated according to equation (3).

그리고 상기 수학식 4에 따르는 핵심어구 후보의 점수는 단어들의 변형된 조화 평균을 이용하여 계산한 것이다. 이와 같이 변형된 조화 평균으로 점수를 계산하는 이유는 후보 핵심어구의 점수가 길이에 의존하는 것을 방지함과 동시에 길이 정보가 손실되는 것을 방지하기 위함이다. And the scores of the key word candidates according to Equation (4) are calculated using the modified harmonic mean of the words. The reason for calculating the score by the modified harmonic mean is to prevent the score of the candidate key word from being dependent on the length and to prevent loss of the length information.

상기한 바와 같이 산출된 핵심어구 후보들의 점수들은, 그래프 구성시 단어와 단어 사이의 간선에 정보가 중첩되어 반영되는 문제가 있다. 예를들어 설명하면, "정보 과학회"와 "한국 정보 과학회"라는 후보 핵심어구들이 문서에서 나타난다면, "정보 과학회"의 빈도수를 측정할 때 "한국 정보 과학회"에 포함된 "정보 과학회"까지 측정하므로 빈도수 정보가 편향되는 경우가 있다. The scores of the key word candidates calculated as described above have the problem that the information is superimposed on the trunk line between the word and the word in the graph construction. For example, if the candidate key words such as "Information Science Society" and "Korea Information Science Society" appear in the document, then measure the frequency of "Information Science Society" and measure "Information Science Society" included in "Korea Information Science Society" The frequency information may be biased.

이에 제어장치(100)는 핵심어구 후보들에 대한 점수가 산출되면, 그래프의 간선에 정보가 중첩되는 문제를 해소하기 위해 수학식 5에 따라 점수를 보정하여 핵심어구 후보들을 재순위화한다(218단계). When the scores for the key word candidates are calculated, the control device 100 corrects the scores according to Equation (5) to re-rank key word candidates in order to solve the problem of overlapping information on the trunk of the graph ).

상기 수학식 5는 재순위화를 위해 핵심어구 후보에 대한 점수를 보정하기 위한 수학식이며, a는 핵심어구 후보이고, b는 핵심어구 후보 a를 포함하는 더 큰 길이의 핵심어구 후보 집합이고, i는 b의 리스트 원소들(핵심어구 후보 a를 포함하는 더 큰 길이의 핵심어구 후보)이고, N은 리스트 b의 크기이다. 그리고 위치함수(i)는 과학문서 말뭉치를 이용하여 실제로 핵심어구들이 나타난 위치의 확률 분포에 따라 현재 핵심어구의 위치가 핵심어구일 확률을 산출하는 것이다. The equation (5) is a mathematical expression for correcting a score for a key word candidate for re-ranking, where a is a key word candidate, b is a key word candidate set of a larger length including a key word candidate a, i Is a list element of b (a key word phrase of a larger length including the key word candidate a), and N is the size of the list b. And the position function (i) is to calculate the probability that the position of the current key word is the key word according to the probability distribution of the position where the key words actually appear using the scientific document corpus.

좀더 설명하면, 제어장치(100)는 대규모 과학문서 말뭉치를 다수개의 위치구간으로 분할하고, 위치구간별 실제 핵심어구가 발생할 확률을 구한다. 즉, 대규모 과학 문서 말뭉치에서 발생한 핵심어구들의 총 개수를 각 위치구간마다 실제 핵심어구가 발생한 개수로 나누어 각 위치구간별 핵심어구가 발생할 확률을 구하여 미리 저장해둘 수 있다. More specifically, the control apparatus 100 divides a large scientific document corpus into a plurality of position intervals, and obtains a probability that an actual key phrase per position interval occurs. That is, the total number of key words generated in a large scientific document corpus can be divided into the number of actual key words generated for each position section, and the probability of occurrence of key words per each position section can be obtained and stored in advance.

이 상태에서 제어장치(100)는 위치함수를 통해 과학문서가 입력되면, 입력된 과학문서에서 핵심어구 후보가 처음 발생한 위치를 기준으로 사전에 계산된 위치구간별 확률값을 독출하여 위치함수 결과값으로 출력하여 상기 수학식 5에 따르는 핵심어구 점수에 반영할 수 있다. In this state, when the scientific document is input through the position function, the control device 100 reads the probability value of each positional region calculated in advance based on the position where the key word candidate is first generated in the inputted scientific document, And can be reflected in the key word score according to Equation (5).

즉, 본 발명은 그래프의 간선에 의해 중복된 정보를 없애기 위해 재순위화(Re-Ranking) 과정을 거치며, 이는 중복된 빈도수 정보가 있기 때문이므로, 이를 해결하기 위해 Frantzi et al., (2000)에서 사용한 C-Value 수식을 변형하여 중복된 빈도수 정보를 해결한다. 이때 핵심어구 후보들의 위치까지도 포함하여 계산하게 된다.That is, according to the present invention, a re-ranking process is performed in order to eliminate information duplicated by the trunk of the graph. This is because Frantzi et al., (2000) To solve the duplicated frequency information. At this time, the positions of the key word candidates are calculated including the position.

이후 제어장치(100)는 재순위화 과정까지 거친 핵심어구 후보들 중 최종 상위 N개의 핵심어구 후보를 최종 핵심어구로 결정한다(220단계).Thereafter, the controller 100 determines the final top N candidate keyword candidates among the key word candidates that have undergone the re-ranking process (step 220).

상기한 바와 같이 본 발명은 핵심어구 후보의 위치정보에 따라 핵심어구 후보의 점수를 보정하며 이를 위해 제어장치(100)는 도 3에 도시한 바와 같이 말뭉치들을 입력받아(300단계), 말뭉치에서의 핵심어구 위치정보를 미리 설정받아 저장한다(302단계). 이 핵심어구 위치정보는 과학문서내에서 핵심어구가 위치하는 위치를 지시하며, 이는 사용자 인터페이스를 통한 사용자 지정에 따를 수 있다. As described above, according to the present invention, the score of the key word candidate is corrected according to the position information of the key word candidate. To this end, the controller 100 receives corpusks as shown in FIG. 3 (step 300) The key word position information is preset and stored (step 302). This key phrase location information indicates where the key word is located in the scientific document, which can be customized through the user interface.

100 : 제어장치
102 : 메모리부
104 : 사용자 인터페이스부
106 : 디스플레이부
108 : 통신부100: Control device
102: memory unit
104: User interface section
106:
108:

Claims

In a key word extraction method of a scientific document,
Transforming a scientific document into a morpheme analyzed scientific document composed of nouns, adjectives, and verbs analyzed through morphological analysis;
Constructing a document graph by using a word of the scientific document as a vertex and giving an edge representing a relationship between the word and the word;
Calculating an importance score for vertices in the document graph;
Detecting positional information of key word candidates and extracted key word candidates in the morpheme analyzed scientific document;
Calculating a score of the candidate key word according to the scores of the words included in the key word candidates and the length of the key word candidates for each of the key word candidates;
Changing the score of each of the key word candidates according to the positions of the key word candidates and re-ranking the ranking of the key word candidates; And
Determining a predetermined number of upper key word candidates among the re-ordered key word candidates as key words of the scientific document.

The method according to claim 1,
Wherein the scores of the key word candidates are calculated according to Equation (6).
Equation 6

In Equation (6), Score (vertex i) is the importance score of the word i of the scientific document, d is a damping factor, j is an identification number of surrounding words connected to the vertex i, i is an identification number of a word to be calculated, k is an identification number of words connected to a vertex j, and the weight (vertex 1, vertex 2) is a weight of an edge connecting vertex 1 and vertex 2, (Vertex 1) and frequency 2 (vertex 2) in the scientific document are denoted by vertex 1, vertex 2 and vertex 2, respectively. The number of simultaneous occurrences (vertex 1 and vertex 2) represents the number of times vertex 1 and vertex 2 exist in a predetermined window. Indicates the number of times indicated by.

The method according to claim 1,
Wherein the ranking of the key word candidates is calculated according to Equation (7).
Equation 7

Where a is a key word candidate, b is a core word candidate set of a larger length including a key word candidate a, i is a list element of b, N is the size of the list b, (i) uses the corpus of scientific documents to calculate the probability that the position of the current key word is the key word according to the probability distribution of the position where the key words actually appear.

A key word extraction apparatus for a scientific document,
A user interface for interface with the user;
A memory unit for providing a storage area for keyword extraction of a scientific document; And
In this paper, we propose a morphological analysis method for morphological analysis,
Constructing a document graph by using a word of the scientific document as a vertex and giving an edge representing a relation between the word and the word,
Calculating an importance score for vertices in the document graph,
In the morpheme analyzed scientific document, location information of key word candidates and extracted key word candidates are detected,
For each of the key word candidates, the score of the candidate key word is calculated according to the scores of the words included in the key word candidate and the length of the key word candidate,
The score of each of the key word candidates is changed according to the position information of the key word candidate to re-rank the ranking of the key word candidates,
And a control device for determining a predetermined number of upper key word candidates among the re-ordered key word candidates as a key word of a scientific document.

5. The method of claim 4,
Wherein scores of the key word are calculated according to Equation (8).
Equation 8

In Equation (8), Score (vertex i) is an importance score of a word i in a scientific document, d is a damping factor, j is an identification number of surrounding words connected to a vertex i, i is an identification number of a word to be calculated, k is an identification number of words connected to a vertex j, and the weight (vertex 1, vertex 2) is a weight of an edge connecting vertex 1 and vertex 2, (Vertex 1) and frequency 2 (vertex 2) in the scientific document are denoted by vertex 1, vertex 2 and vertex 2, respectively. The number of simultaneous occurrences (vertex 1 and vertex 2) represents the number of times vertex 1 and vertex 2 exist in a predetermined window. Indicates the number of times indicated by.

The method according to claim 1,
Wherein the ranking of the key word candidates is calculated according to Equation (9).
Equation 9

Where a is a key word candidate, b is a key word candidate set of a larger length including a key word candidate a, i is a list element of b, N is the size of the list b, (i) uses the corpus of scientific documents to calculate the probability that the position of the current key word is the key word according to the probability distribution of the position where the key words actually appear.