KR100703193B1

KR100703193B1 - Apparatus for summarizing generic text summarization using non-negative matrix factorization and method therefor

Info

Publication number: KR100703193B1
Application number: KR1020060037974A
Authority: KR
Inventors: 이주홍; 박선; 안찬민; 김덕환
Original assignee: 인하대학교 산학협력단
Priority date: 2006-04-27
Filing date: 2006-04-27
Publication date: 2007-04-09

Abstract

본 발명은 비음수 행렬 인수분해를 이용한 문서요약 장치에 있어서, 문서를 요약하기 위한 전단계로서, 주어진 상기 문서를 각각의 문장으로 분해한 후, 불필요한 문자열을 제거하고, 어근을 추출하며, 상기 문장에 대한 가중치를 계산함으로써, 비음수 행렬 인수분해를 기반으로 문서를 요약할 수 있게 하는 전처리부; 및 행렬 A 를 비음수 행렬 인수분해함으로써, 상기 문서 전체를 요약하는 문서 요약부; 를 포함한다.The present invention relates to a document summarization apparatus using non-negative matrix factorization, which is a preliminary step for summarizing a document, after decomposing the given document into respective sentences, removing unnecessary strings, extracting roots, and A preprocessor configured to summarize the document based on the non-negative matrix factorization by calculating weights for the preprocessing; And a document summarizer for summarizing the entire document by non-negative matrix factorization of matrix A; It includes.

본 발명에 따르면, 비음수 행렬 인수분해를 이용하여 추출된 비음수 의미 가변 행렬을 이용함으로써, 문서 내용을 정확하게 요약할 수 있는데 그 효과가 있다. According to the present invention, by using the non-negative semantic variable matrix extracted by using the non-negative matrix factorization, it is possible to accurately summarize the document contents.

비음수 행렬, 인수분해, 문서, 요약, 비음수 의미 특징 행렬, 비음수 의미 변수 행렬 Nonnegative matrix, factorization, document, summary, nonnegative semantic feature matrix, nonnegative semantic variable matrix

Description

Apparatus for summarizing generic text summarization using non-negative matrix factorization and method therefor}

도 1 은 본 발명의 일실시예에 따른 비음수 행렬 인수분해를 이용한 문서요약 장치를 나타내는 구성도.1 is a block diagram showing a document summary device using non-negative matrix factorization according to an embodiment of the present invention.

도 2 는 본 발명의 일실시예에 따른 비음수 행렬 인수분해를 이용한 문서요약 방법을 나타내는 흐름도.2 is a flowchart illustrating a document summary method using non-negative matrix factorization according to an embodiment of the present invention.

도 3 은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 단계를 나타내는 상세 흐름도.3 is a detailed flowchart illustrating a non-negative matrix factorization step according to an embodiment of the present invention.

도 4 는 본 발명의 일실시예에 따른 문장선택 단계를 나타내는 상세 흐름도.4 is a detailed flowchart illustrating a sentence selection step according to an embodiment of the present invention.

본 발명은 비음수 행렬 인수분해를 이용한 문서요약 장치 및 방법에 관한 것으로서, 더욱 상세하게는 비음수 행렬 인수분해로 얻어진 비음수 의미 가변 행렬을 이용하여 자동으로 문서 내용 전체를 포괄적으로 요약을 할 수 있도록 한 비음수 행렬 인수분해를 이용한 문서요약 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for document summary using non-negative matrix factorization, and more particularly, to comprehensively summarize the entire contents of a document automatically using a non-negative semantic variable matrix obtained by non-negative matrix factorization. The present invention relates to a document summary apparatus and method using non-negative matrix factorization.

최근, 정보통신의 발달로 인해 컴퓨터 네트워크의 사용이 많아지고, 디지털화된 전자문서의 활용이 확산되면서 개인이 관리하고 확인해야하는 문서의 양이 점점 증가하는 추세이다. 따라서 많은 컴퓨터 사용자들이 개인, 또는 그들이 속한 조직과 관련된 문서를 확인 및 관리하는데 점점 많은 시간과 노력을 소모해야 한다. 따라서 효과적인 정보관리를 위하여 개인과 관련되는 문서를 찾아내고, 또, 이들을 확인하여 분류하는 작업을 위한 도구가 절실하게 필요한 실정에 있다.Recently, due to the development of information and communication, the use of computer networks has increased, and the use of digitized electronic documents has spread, and the amount of documents that an individual must manage and check has been increasing. As a result, many computer users must spend more and more time identifying and managing documents related to individuals or their organizations. Therefore, there is an urgent need for tools for finding, classifying, and classifying documents related to individuals for effective information management.

이에 대하여, 방대한 정보 중에서 원하는 정보를 찾을 수 있는 정보검색시스템을 이용하지만, 사용자가 제시하는 검색결과를 자세히 확인하기 위해서는 정보의 양이 많은 문제점이 있었다. 따라서, 이러한 문제를 해결하기 위해 문서의 서론 부분만을 제시하여 해결하는 경향 대신에 사용자가 원하는 정보의 검색시간을 줄임으로써, 정보의 과적재 문제를 해결하는 방법으로 1) 문장추출을 이용한 잠재의미분석(LSA, Latent Semantic Analysis)의 문서요약 방식과, 2) 문서의 주제(Topic)를 이용한 문서요약 방식 등이 있다. On the other hand, the information retrieval system that can find the desired information from the vast amount of information is used, but in order to check in detail the search results presented by the user has a lot of information problems. Therefore, instead of the tendency to solve the problem by presenting only the introductory part of the document, it is possible to reduce the retrieval time of the information desired by the user. LSA, Latent Semantic Analysis (Document Summary), and 2) Document Topic (Topic).

먼저, 잠재의미분석을 이용한 문서요약 방식은 문장을 선택하기 위해 복합 고유벡터(Singular Vector)의 구성요소를 이용하는데, 고유벡터에 일치하는 고유값은 양수와 음수의 구성요소를 갖으며, 이러한 의미가 작은 고유벡터의 구성 요소값에 의해서 추출 문장의 순위가 구성됨으로써, 추출 문장들이 의미 없이 벡터값에만 의존하는 문제점이 있었다. First, the document summary method using latent semantic analysis uses the components of a singular vector to select a sentence, and the eigenvalues corresponding to the eigenvectors have positive and negative components. Since the ranking of the extracted sentences is composed of the component values of the small eigenvectors, the extracted sentences have a problem in that they depend only on the vector values without meaning.

그리고, 문서의 주제를 이용한 문서요약 방식은 문서에서 다양한 주제를 찾은 후, 각 주제에 일치하는 문장을 선택함으로써 문서를 요약하는데, 문장과 용어들로부터 특성점수(Saliency Scores)를 계산하여 문장들을 주제그룹(Topical Groups)으로 군집하여 문서를 요약하는데, 이는 추출에 따른 상당한 비용이 발생하는 문제점이 있었다.In addition, the document summary method using the subject of the document summarizes the document by finding various subjects in the document, and then selects a sentence that matches each subject, and calculates the Saliency Scores from the sentences and the terms. The documents are summarized in groups (Topical Groups), which has the problem of significant cost of extraction.

본 발명의 목적은, 전술한 문제점을 해결하기 위한 것으로, 비음수 행렬 인수분해를 이용하여 추출된 비음수 의미 가변 행렬을 이용함으로써, 문서 내용을 정확하게 요약할 수 있도록 함에 있다. SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described problem, and to accurately summarize document contents by using a non-negative semantic variable matrix extracted using non-negative matrix factorization.

본 발명의 다른 목적은, 대용량 문서를 각각의 문장으로 분리함으로써, 분리된 문장을 이용하여 문서요약이 용이하도록 함에도 있다. Another object of the present invention is to separate a large document into individual sentences, thereby facilitating document summaries using the separated sentences.

본 발명의 또 따른 목적은 인간의 인식 과정과 유사한 비음수 제약을 사용함에도 있다.Another object of the present invention is to use a non-negative constraint similar to the human recognition process.

본 발명의 또 다른 목적은 비지도 학습에 의한 문서요약으로써, 사전 전문가에 의한 학습문장이 필요없으며, 적은 계산비용을 통해서 문서를 쉽게 추출할 수 있도록 함에도 있다. Another object of the present invention is a document summary by unsupervised learning, which does not require a learning sentence by a prior expert, and also makes it possible to easily extract a document through a low calculation cost.

이와 같은 특징적인 기술적 사상을 구현하기 위한 본 발명에 따른 비음수 행렬 인수분해를 이용한 문서요약 장치에 있어서, 문서(D)를 요약하기 위한 전단계로 서, 주어진 상기 문서를 각각의 문장으로 분해한 후, 불필요한 문자열을 제거하고, 어근을 추출하며, 상기 문장에 대한 가중치를 계산함으로써, 비음수 행렬 인수분해를 기반으로 문서를 요약할 수 있게 하는 전처리부(120); 및 행렬 A 를 비음수 행렬 인수분해함으로써, 상기 문서 전체를 요약하는 문서 요약부(130); 를 포함한다. In the document summarization apparatus using the non-negative matrix factorization according to the present invention for realizing such a technical idea, as a previous step for summarizing the document (D), after decomposing the given document into each sentence A preprocessing unit 120 capable of summarizing a document based on non-negative matrix factorization by removing unnecessary strings, extracting roots, and calculating weights for the sentences; And a document summary unit 130 for summarizing the entire document by factoring a matrix A into a non-negative matrix; It includes.

바람직하게 사용자로부터 요약할 상기 문서 및 요약할 상기 문서의 설정신호를 입력받는 입력부(110); 및 상기 문서 요약부를 통해서 요약된 상기 요약서를 상기 사용자가 확인할 수 있도록 출력하는 출력부(140); 를 더 포함하는 것을 특징으로 한다. Preferably, the input unit 110 for receiving the setting signal of the document to be summarized and the document to be summarized from the user; And an output unit 140 for outputting the summary summary summarized through the document summary unit for the user to check. It characterized in that it further comprises.

또한 바람직하게 상기 전처리부(120)는, 상기 문서를 각각의 상기 문장으로 분해하는 문장 분해수단(121); 추출할 상기 문장의 개수를 설정하는 추출문장 설정수단(122); 상기 문장 분해수단을 통해서 분해된 각각의 상기 문장들에 대하여 불용어를 제거하는 불용어 제거수단(123); 상기 문장에 대하여 어근을 추출하는 어근 추출수단(124); 상기 문장에 대하여 상기 불용어를 제외한 용어의 사용빈도에 따른 용어-빈도벡터를 생성하는 용어-빈도벡터 생성수단(125); 상기 용어-빈도벡터를 이용하여 상기 문장에 대한 가중치를 산출하는 가중치 산출수단(126); 및 상기 문서에서 상기 용어와 상기 문장으로 이루어진 상기 행렬 A 를 생성하는 용어-문장행렬 생성수단(127); 을 포함하는 것을 특징으로 한다. Also preferably the preprocessing unit 120, sentence decomposition means 121 for decomposing the document into each of the sentences; Extraction sentence setting means (122) for setting the number of sentences to be extracted; Stopword removal means (123) for removing stopwords for each of the sentences decomposed through the sentence decomposition means; Root extracting means (124) for extracting roots with respect to the sentence; Term-frequency vector generating means (125) for generating a term-frequency vector according to the frequency of use of the term excluding the stop word with respect to the sentence; Weight calculation means (126) for calculating a weight for the sentence using the term-frequency vector; Term-sentence matrix generating means (127) for generating the matrix A consisting of the term and the sentence in the document; Characterized in that it comprises a.

또한 바람직하게 상기 문서 요약부(130)는, 상기 행렬 A 를 비음수 행렬 W 및 H 를 이용하여 인수분해하는 인수분해 수단(131); 상기 비음수 행렬 W 및 H 를 산출하는 비음수 행렬 산출수단(132); 상기 행렬 H 에서 p 번째 행에 포함된 행벡 터 H_p _. 에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 상기 행렬 A 의 문장 벡터 A _. _q에 대응되는 문장을 선택 및 추출하여 요약서(S)에 추가하는 문장 선택수단(133); 및 상기 문장 선택수단을 통해서 선택된 상기 문장들의 집합을 이용하여 문서의 상기 요약서로 생성하는 문서요약 생성수단(134); 을 포함하는 것을 특징으로 한다. Also preferably, the document summary unit 130 includes: factoring means (131) for factoring the matrix A using non-negative matrices W and H; Non-negative matrix calculation means (132) for calculating the non-negative matrices W and H; In the matrix H that contains the p-th row haengbek emitter H _{_p.} The sentence vector A of the matrix A in the same column as the q column with the largest element value in _. sentence selection means 133 for selecting and extracting sentences corresponding to _q and adding them to the summary S; And document summary generating means for generating the summary of the document using the set of sentences selected by the sentence selecting means. Characterized in that it comprises a.

또한 바람직하게 상기 가중치는, 지역 가중치 및 전역 가중치를 포함하는 것을 특징으로 한다.Also preferably, the weight is characterized in that it comprises a regional weight and a global weight.

또한 바람직하게 상기 비음수 행렬 인수분해 수단(131)은, 상기 행렬 A 를 비음수 의미 특징 행렬 W 및 비음수 의미 변수 행렬 H 을 이용하여 인수분해하되, Also preferably, the non-negative matrix factoring means 131 factorizes the matrix A using a non-negative semantic feature matrix W and a non-negative semantic variable matrix H,

여기서, 상기 행렬 W 는 용어와 의미특징의 개수로 지정된 행렬이며, 상기 행렬 H 는 상기 행렬 A 로부터 근사값으로 인수분해된 행렬인 것을 특징으로 한다. Here, the matrix W is a matrix designated by the number of terms and semantic features, and the matrix H is a matrix factored by an approximation value from the matrix A.

또한 바람직하게 상기 비음수 행렬 W 및 H 는, ∥A-WH∥²에 의한 값이 최소가 되는 것을 특징으로 한다. Also preferably, the non-negative matrices W and H can is characterized in that the value by ∥A-WH∥ ² is minimized.

그리고 바람직하게 상기 문장 선택수단(133)은, 상기 행벡터 H_p _. 의 요소 합의 값이 높은 순서로 상기 행벡터 H_p _. 를 선택하는 것을 특징으로 한다. And preferably, the sentence selection means 133, the row vector H _p _. The row vectors H _p in the order of the highest element sum of the elements _. Characterized in that the selection.

한편, 본 발명의 비음수 행렬 인수분해를 이용한 문서요약 방법에 있어서, (a) 요약 대상인 문서(D)를 각각의 문장으로 분해 및 추출한 후, 요약할 문장의 개수(k)를 설정하는 단계; (b) 상기 문장에서 의미가 없는 불용어를 제거하고, 상기 문장에 포함된 어근을 추출하는 단계; (c) 상기 문장에 있어서 용어의 사용빈도에 따른 용어-빈도벡터를 생성하는 단계; (d) 상기 용어-빈도벡터를 이용하여 지역 가중치 및 전역 가중치를 포함하는 상기 문장에 대한 가중치를 산출하는 단계; (e) m 개의 용어와 n 개의 문장으로 이루어진 용어-문장행렬(A)을 생성하는 단계; (f) 상기 용어-문장행렬을 비음수 행렬을 이용하여 인수분해하는 단계; (g) 행렬 H 에서 p 번째 행에 포함된 행벡터 H_p _. 에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 행렬 A 의 문장 벡터 A_.q에 대응되는 문장을 선택하는 단계; 및 (h) 상기 제 (g) 단계를 통해 선택한 문장을 이용하여 문서의 요약서(S)로 출력하는 단계; 를 포함한다. On the other hand, in the document summary method using the non-negative matrix factorization of the present invention, (a) decomposing and extracting the document (D) to be summarized into each sentence, and setting the number of sentences (k) to be summarized; (b) removing meaningless words from the sentence and extracting a root included in the sentence; (c) generating a term-frequency vector according to the frequency of use of the term in the sentence; (d) calculating weights for the sentences including local weights and global weights using the term-frequency vectors; (e) generating a term-sentence matrix (A) consisting of m terms and n sentences; (f) factoring the term-sentence matrix using a non-negative matrix; (g) a row vector in the matrix H _p H included in the p-th _row. Selecting a sentence corresponding to the sentence vector q _.q A of the matrix A in the column such as heat with the largest value element in; And (h) outputting a summary (S) of the document using the sentence selected in the step (g); It includes.

또한 바람직하게 상기 제 (a) 단계 이전에, (a-1) 사용자로부터 요약할 상기 문서 및 요약할 상기 문서에 대한 요약할 상기 문장의 개수, 불필요한 문자열 및 어근의 종류의 설정신호를 입력받는 단계; 를 더 포함하는 것을 특징으로 한다. Also preferably, prior to the step (a), (a-1) receiving a setting signal of the number of sentences to be summarized and the unnecessary character strings and root types from the user to summarize and the documents to be summarized. ; It characterized in that it further comprises.

또한 바람직하게 상기 제 (f) 단계는, (f-1) 비음수 의미 특징 행렬 W 및 비음수 의미 변수 행렬 H 를 이용하여 인수분해하는 단계; (f-2) 상기 비음수 행렬 W 및 H 가 ∥A-WH∥²에 의해 최소화가 되는지를 판단하는 단계; 및 (f-3) 상기 제 (f-2) 단계의 판단결과, 최소화가 되는 경우에 상기 비음수 행렬 W 와 H 의 값을 산출한 후, 동시에 갱신하는 단계; 를 포함하는 것을 특징으로 한다. Also preferably, the step (f) comprises: (f-1) factoring using the non-negative semantic feature matrix W and the non-negative semantic variable matrix H; (f-2) wherein the number of non-negative matrices W and H is determined that the minimum by ∥A-WH∥ ^2; And (f-3) calculating the values of the non-negative matrices W and H when they are minimized as a result of the determination in the step (f-2), and updating them at the same time; Characterized in that it comprises a.

또한 바람직하게 상기 제 (g) 단계는, (g-1) 상기 행벡터 H_p _. 의 요소의 합을 각각 행벡터별로 계산하는 단계; (g-2) 상기 행벡터 요소 합의 값이 높은 순서로 k 개의 행벡터 H_p _. 를 선택 및 추출하는 단계; (g-3) 상기 선택 및 추출된 문장이 요약할 문장의 개수(k) 인지 판단한 후, k 인 경우에 선택된 k 개의 행벡터 각각에서, 행에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 행렬 A 의 문장 벡터 A_.q 에 대응되는 문장을 선택하는 단계; 및 (g-4) 상기 (g-3) 단계의 판단결과, 선택된 개수가 k 미만일 경우에 상기 비음수 의미 특징 행렬 W 의 열벡터들과 주제 간의 유사도를 계산하여, 유사도가 큰 열벡터(W_·p)를 찾아 반복적으로 문장을 선택 및 추출하는 단계; 를 포함하는 것을 특징으로 한다. Also preferably, the step (g) may include (g-1) the row vector H _p _. Calculating a sum of elements of each row vector; (g-2) k row vectors H _p _. Selecting and extracting; (g-3) After determining whether the selected and extracted sentences are the number of sentences to be summarized (k), in each of the k row vectors selected in the case of k, the matrix A in the same column as the q column having the largest element value in the row. Sentence vector A _.q Selecting a sentence corresponding to; And (g-4) calculating the similarity between the column vectors of the non-negative semantic feature matrix W and the subject when the selected number is less than k, as a result of the determination in the step (g-3), and the column vector W having a high similarity (W). _{Searching for p} ) and repeatedly selecting and extracting sentences; Characterized in that it comprises a.

그리고 바람직하게 상기 제 (g-4) 단계는, 상기 분해 추출한 문장의 개수(k)만큼 반복 수행되는 것을 특징으로 한다.Preferably, the step (g-4) is repeated as many times as the number (k) of the decomposition-extracted sentences.

본 발명의 특징 및 이점들은 첨부도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다. 또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.The features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. Prior to this, the terms or words used in the present specification and claims are defined in the technical spirit of the present invention on the basis of the principle that the inventor can appropriately define the concept of the term in order to explain his invention in the best way. It should be interpreted to mean meanings and concepts. In addition, when it is determined that the detailed description of the known function and its configuration related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description is omitted.

이하, 도면을 참조하여 본 발명의 전반적인 기술적 사상을 살펴보면, 다음과 같다. Hereinafter, the overall technical spirit of the present invention with reference to the drawings.

본 발명의 일실시예에 따른 비음수 행렬 인수분해를 이용한 문서요약 장치에 관하여 도 1 을 참조하여 설명하면 다음과 같다.A document summarizing apparatus using non-negative matrix factorization according to an embodiment of the present invention will be described with reference to FIG. 1.

도 1 은 본 발명의 일실시예에 따른 비음수 행렬 인수분해를 이용한 문서요약 장치를 나타내는 구성도이다.1 is a block diagram showing a document summary device using non-negative matrix factorization according to an embodiment of the present invention.

우선, 도 1 을 참조하여 살펴보면, 비음수 행렬 인수분해를 기반으로 문장(Sentences)을 추출함으로써 비음수 의미 가변 행렬을 이용하여 일반적인 문서(D)를 요약할 수 있는 문서요약 장치(100)는 입력부(110), 전처리부(120), 문서 요약부(130) 및 출력부(140)를 포함한다.First, referring to FIG. 1, the document summarizing apparatus 100 capable of summarizing a general document D using a non-negative semantic variable matrix by extracting sentences based on non-negative matrix factorization may include an input unit. 110, a preprocessing unit 120, a document summary unit 130, and an output unit 140.

먼저, 입력부(110)는 사용자 측으로부터 요약할 문서(D) 및 요약할 문서(D)의 설정신호를 입력받는 기능을 수행한다. First, the input unit 110 performs a function of receiving a setting signal of the document D to be summarized and the document D to be summarized from the user side.

본 실시예에서, 요약할 문서(D)의 설정신호에 요약할 문장의 개수, 불필요한 문자열 및 어근의 종류 등을 포함하는 것으로 설정하였으나, 본 발명이 이에 국한되지 않는다.In this embodiment, the setting signal of the document D to be summarized is set to include the number of sentences to be summarized, an unnecessary character string, a kind of root, and the like, but the present invention is not limited thereto.

전처리부(120)는 일반적인 문서(D)를 요약하기 위한 전단계로서, 주어진 문서(D)를 각각의 문장으로 분해한 후, 불필요한 문자열을 제거하고, 어근(語根, Root)을 추출하며, 문장에 대한 가중치를 계산함으로써, 비음수 행렬 인수분해를 기반으로 문서를 요약할 수 있게 하는 기능을 수행하는 바, 문장 분해수단(121), 추출문장 설정수단(122), 불용어(Stop Word) 제거수단(123), 어근 추출수단(124), 용어(Term)-빈도(Frequency)벡터 생성수단(125), 가중치 산출수단(126) 및 용어-문장행렬 생성수단(127)을 포함한다.As a preliminary step for summarizing a general document (D), the preprocessor 120 decomposes a given document (D) into individual sentences, removes unnecessary strings, extracts roots, By calculating the weight for the function, it is possible to summarize the document based on the non-negative matrix factorization bar, sentence decomposition means 121, extraction sentence setting means 122, stop word removal means ( 123, root extracting means 124, term-frequency vector generating means 125, weight calculating means 126, and term-sentence matrix generating means 127.

먼저, 문장 분해수단(121)은 주어진 문서(D)를 각각의 문장으로 분해하는 기능을 수행한다.First, the sentence decomposing means 121 performs a function of decomposing a given document D into each sentence.

또한 추출문장 설정수단(122)은 문서(D)에 있어서 추출할 문장의 개수를 설정하는 기능을 수행하는 바, 본 실시예에서 추출 문장의 개수를 k 로 설정하였으나, 본 발명이 이에 한정되지 않는다.Further, the extraction sentence setting unit 122 performs a function of setting the number of sentences to be extracted in the document D. In this embodiment, the number of extraction sentences is set to k, but the present invention is not limited thereto. .

또한 불용어 제거수단(123)은 문장 분해수단(121)을 통해서 분해된 각각의 문장들에 대하여 불용어 목록(A)을 통해 불용어를 제거하는 기능을 수행한다. In addition, the stop word removing means 123 performs a function of removing the stop word through the stop word list A for each sentence decomposed through the sentence decomposition means 121.

이때, 본 실시예에서 리츠베르젠(rijsbergen)을 불용어 목록(A)으로 설정하였으나, 본 발명이 이에 한정되는 것은 아니다. At this time, in the present embodiment, although Rijsbergen is set as the stop word list A, the present invention is not limited thereto.

참고적으로, 불용어(Stop-words or Noise world)는 검색시에 무시해버리는 단어 또는 검색엔진이 데이터베이스를 구축할 때, 색인에서 제외해 버리는 단어를 뜻하며, 예를 들어 한글에서 '~를, ~을, ~에서, ~와' 등을 포함하는 조사, 접미사, 접속사, 어미 등이 있고, 그리고 영어에서 'a, an, is, are, if, for, but, this, may' 등을 포함하는 동사, 조동사, 인칭대명사, 지시대명사, 전치사 등이 있다.For reference, stop-words or noise world refers to words that are ignored during search or words that search engines exclude from the index when building the database. For example, '~', ~ in Korean Verbs, including, in, surveys, suffixes, conjunctions, endings, etc., and in English, including a, an, is, are, if, for, but, this, may, There are modal verbs, personal pronouns, instructional pronouns, and prepositions.

또한 어근 추출수단(124)은 각각의 문장에 대하여 어근을 추출하는 기능을 수행한다. The root extracting means 124 also performs a function of extracting roots for each sentence.

참고적으로, 어근은 단어를 이루는 형태소 중에서 그 단어의 실질적인 뜻을 나타내는 형태소로서, 합성어의 경우에 합성된 낱낱의 실질 형태소이고, 파생어의 경우에 접사를 제외한 나머지 부분이 해당한다. 예를 들면, '덮개'에서 '덮', '사람답다'에서 '사람', '넉넉하다'에서 '넉넉' 등이 어근에 해당한다. For reference, the root is a morpheme representing the actual meaning of the word among the morphemes constituting the word, and is a real real morpheme synthesized in the case of a compound word, and the rest except the affix in the case of a derivative word. For example, the roots include the cover, the cover, the person, the person, and the generous.

또한 용어-빈도벡터 생성수단(125)은 각각의 문장에 대하여 불용어를 제외한 용어의 사용빈도에 따른 용어-빈도벡터를 생성하는 기능을 수행한다. 이때, 생성되는 용어-빈도벡터는 다음의 수학식 1 과 같다.In addition, the term-frequency vector generating means 125 performs a function for generating a term-frequency vector for each sentence in accordance with the frequency of use of the term except the stopword. In this case, the generated term-frequency vector is shown in Equation 1 below.

T_i=[t_1i, t_2i, ... , t_ni]^T T _i = [t _1i , t _2i , ..., t _ni ] ^T

상술한 수식에 있어서, 용어-빈도벡터 T_i는 i 번째 문장에 대한 용어의 사용빈도를 나타내고, t_ij는T_i의 요소(Element)로서 i 번째 문장에서 출현한 j 번째 용어의 빈도를 나타낸다.In the above formula, the term-frequency vector T _i represents the frequency of use of the term for the i th sentence, and t _ij is An element of T _i , which indicates the frequency of the j th term appearing in the i th sentence.

또한 가중치 산출수단(126)은 생성된 용어-빈도벡터를 이용하여 문장에 대한 지역 가중치(Local Weight) 및 전역 가중치(Global Weight)를 포함하는 가중치를 산출하는 기능을 수행한다.In addition, the weight calculation means 126 calculates a weight including a local weight and a global weight for the sentence using the generated term-frequency vector.

더욱 구체적으로, 지역 가중치(L(t_ji))는 i 번째 문장에서 j 번째 용어를 나타내고, 전역 가중치(G(t_ji))는 문서(D) 전체에서 j 번째 용어를 나타내며, 다음의 수학식 2 및 수학식 3 과 같이 표현된다. More specifically, the regional weight L (t _ji ) represents the j th term in the i th sentence, and the global weight G (t _ji ) represents the j th term throughout the document D, 2 and (3).

L(i)=tf(i)L (i) = tf (i)

G(i)=log(N/n(i))G (i) = log (N / n (i))

이때, tf(i) 는 i 번째 용어가 출현한 빈도를 나타내고, N 은 문서(D)에 있어서 문장의 총개수를 나타내며, n(i) 는 i 번째 용어를 포함한 문장의 개수를 나타낸다.In this case, tf (i) represents the frequency of appearance of the i-th term, N represents the total number of sentences in the document D, and n (i) represents the number of sentences including the i-th term.

이에 대하여, 상술한 수학식 2 및 수학식 3 을 이용하여 지역 가중치와 전역 가중치를 포함하는 각각의 문장에 대한 가중치는 다음의 수학식 4 와 같이 표현된다. On the other hand, the weight for each sentence including the local weight and the global weight is expressed by the following Equation 4 using Equation 2 and Equation 3 described above.

a_ji=L(t_ji)·G(t_ji)a _ji = L (t _ji ) · G (t _ji )

이때, a_i 는 다음의 수학식 5 와 같이 표현되는 i 번째 문장 A_i의 요소이다. In this case, a _i is an element of the i th sentence A _i represented by Equation 5 below.

A_i=[a_1i, a_2i, ... , a_ni]^T A _i = [a _1i , a _2i , ..., a _ni ] ^T

그리고 용어-문장행렬 생성수단(127)은 문서(D)에서 총 m 개의 용어와 n 개의 문장으로 이루어진 m×n 으로 구성된 행렬 A 를 생성하는 기능을 수행한다. The term-sentence matrix generating means 127 performs a function of generating a matrix A consisting of m × n consisting of m terms and n sentences in total in the document D.

이때, 행렬 A(m×n) 는 요소 A_ij로 구성되며, 요소 A_ij 는 i 번째 문장과 j 번째 용어의 가중치가 부여된 빈도이다.In this case, the matrix A (m × n) is composed of a component A _ij, A _ij element Is the weighted frequency of the i th sentence and the j th term.

또한, 문서 요약부(130)는 용어-문장행렬 생성수단(127)을 통해서 생성된 행 렬 A 를 비음수 행렬 인수분해(NMF, non-negative matrix factorization)함으로써, 주어진 문서(D) 전체를 요약하는 기능을 수행하는 바, 비음수 행렬 인수분해 수단(131), 비음수 행렬 산출수단(132), 문장 선택수단(133) 및 문서요약 생성수단(134)을 포함한다.In addition, the document summary unit 130 summarizes the entire document D by non-negative matrix factorization (NMF) of the matrix A generated through the term-statement matrix generating means 127. The non-negative matrix factoring means 131, the non-negative matrix calculating means 132, the sentence selecting means 133, and the document summary generating means 134 are included.

비음수 행렬 인수분해 수단(131)은 행렬 A 를 비음수 의미 특징 행렬(NSFM, non-negative semantic feature matrix) W 와 비음수 의미 변수 행렬(NSVM, non-negative semantic variable matrix) H 를 이용하여 인수분해하는 기능을 수행하며, 다음의 수학식 6 과 같이 표현된다. The nonnegative matrix factoring means 131 receives the matrix A using a non-negative semantic feature matrix (NSFM) W and a non-negative semantic variable matrix (NSVM). It decomposes and is expressed as Equation 6 below.

이때, A 는 m×n 행렬이고, W 는 m×r 행렬이며, H 는 행렬 A 로부터 근사값으로 인수분해된 행렬을 나타낸다. 여기서 r 은 의미특징의 개수로 지정된다. At this time, A is an m × n matrix, W is an m × r matrix, and H represents a matrix factored by an approximation from the matrix A. Where r is specified as the number of semantic features.

그리고 비음수 행렬 산출수단(132)은 다음의 수학식 7 및 수학식 8 을 이용하여 비음수 행렬 W 와 H 를 산출하는 기능을 수행한다. And the non-negative matrix calculation means 132 performs a function of calculating the non-negative matrices W and H using the following equations (7) and (8).

이때, ∥A-WH∥²의 값이 최소가 될 때까지 비음수 행렬 W 및 H 를 산출하여 동시에 갱신한다.At this time, the non-negative matrices W and H are calculated and updated at the same time until the value of | A-WH | ² becomes minimum.

참고적으로, 행렬 A 의 열벡터 A_·j는 열벡터 W_·l 와, 의미변수 행렬 H 의 요소 H_kj로 선형조합(Linear Combination)을 이루는데 다음의 수학식 9 와 같이 표현된다. For reference, the column vector A _{· j} of the matrix A forms a linear combination with the column vector W _{· l} and the element H _kj of the semantic variable matrix H, which is expressed by Equation 9 below.

이때, 열벡터 A_·j는 행렬 A 의 j 번째 문장의 열벡터이고, 열벡터 W_·l는 행렬 W의 l 번째 의미특징 열벡터를 나타낸다. In this case, the column vector A _{· j} is a column vector of the j th sentence of the matrix A, and the column vector W _{· l} represents the l th semantic feature column vector of the matrix W.

또한 문장 선택수단(133)은 상기 행렬 H 에서 p 번째 행에 포함된 행벡터(Row Vector) H_p _. 에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 행렬 A 의 문장 벡터 A_.q에 대응되는 문장을 선택/추출하여 요약서(Summary)에 추가한다. In addition, the sentence selecting means 133 is a row vector H _p _. Select / extract the sentence corresponding to the sentence vector A _.q of the matrix A in the same column as the q column with the largest element value in, and add it to the summary.

더욱 구체적으로, 행벡터 H_p _. 의 요소의 합을 다음의 수학식 10 을 이용하여 각각의 행벡터별로 계산한다.More specifically, row vector H _p _. The sum of the elements of is calculated for each row vector using Equation 10 below.

이때, 행벡터 요소 합의 값이 높은 순서로 k 개의 행벡터 H_p _. 를 선택한다.In this case, k row vectors H _p _. Select.

이러한 문장의 선택/추출은 요약할 문장의 개수 k 만큼 반복 수행되며, 선택된 수가 k 미만일 경우, 다음으로 유사도가 큰 열벡터(W_·p)를 찾아 반복적으로 문장을 선택/추출한다. This sentence selection / extraction is repeated as many times as the number of sentences k to be summarized, and if the selected number is less than k, the next-highest similarity column vector W _{· p} is found and the sentence is repeatedly selected / extracted.

이때, 비음수 의미 특징 행렬 W 의 열벡터들과 주제 간의 유사도를 계산하여 p 번째 유사도가 큰 열벡터(W_·p)를 선택하는 것으로 설정하였으나, 본 발명이 이에 한정되지 않는다. In this case, the similarity between the column vectors of the non-negative semantic feature matrix W and the subject is calculated to select a column vector W _{· p} having a high p-th similarity, but the present invention is not limited thereto.

그리고 문서요약 생성수단(134)은 문장 선택수단(133)을 통해서 선택된 문장들의 집합을 이용하여 문서(D)의 요약서(S)로 생성하는 기능을 수행한다. The document summary generating means 134 generates a summary S of the document D using the set of sentences selected by the sentence selecting means 133.

여기서, 문장추출에 사용되는 의미 특징이 비음수값을 갖기 때문에 잠재의미분석에 비해 문서요약에 있어서 더 정확성을 갖는다.Here, since the semantic feature used for sentence extraction has a non-negative value, it is more accurate in document summary than latent meaning analysis.

참고적으로, 모든 의미 변수는 각 문장을 표현할 수 있는데, 직관적으로 단지 하나의 주제 또는 모든 주제보다는 광범위하게 배열된 주제와 연관된 작은 부 집합이 각 문장을 더욱 의미 있게 한다. 따라서 각각의 의미 변수의 특징은 비음수 행렬 인수분해에 의해 의미적으로 관련 있는 용어로 군집화된다. 이로써, 의미적으로 관련된 군집이 의미 특징으로 결합하여, 문맥상에서 동음이의어를 구별하는데 비음수 행렬 인수분해를 사용한다.For reference, every semantic variable can represent each sentence, which intuitively makes each sentence more meaningful with a small subset associated with a broadly arranged topic rather than just one or all topics. Thus, the characteristics of each semantic variable are clustered into terms that are semantically related by non-negative matrix factorization. Thus, semantically related clusters are combined into semantic features, using nonnegative matrix factorization to distinguish homonyms in context.

그리고, 출력부(140)는 문서 요약부(130)를 통해서 요약된 요약서(S)를 사용자가 확인할 수 있도록 출력하는 기능을 수행하는 바, 본 실시예에서 모니터 또는 프린터 등을 포함하는 것으로 설정하였으나, 본 발명이 이에 한정되지 않는다. In addition, the output unit 140 performs a function of outputting the summary S through the document summary unit 130 so that the user can check it. In this embodiment, the output unit 140 is set to include a monitor or a printer. However, the present invention is not limited to this.

이처럼, 비음수 행렬 인수분해 문장을 추출하여 대량의 문서를 전체적으로 요약할 수 있다. In this way, a nonnegative matrix factorization statement can be extracted to summarize the bulk of the document.

이하, 상술한 바와 같은 구성으로 이루어진 응용소프트웨어를 통해 비음수 행렬 인수분해를 이용한 문서요약 방법에 대하여 도 2 내지 도 4 를 참조하여 살펴보면 다음과 같다. Hereinafter, a document summary method using non-negative matrix factorization through application software having the above-described configuration will be described with reference to FIGS. 2 to 4.

도 2 는 본 발명의 일실시예에 따른 비음수 행렬 인수분해를 이용한 문서요약 방법을 나타내는 흐름도이고, 도 3 은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 단계를 나타내는 상세 흐름도이며, 도 4 는 본 발명의 일실시예에 따른 문장선택 단계를 나타내는 상세 흐름도이다.2 is a flowchart illustrating a document summary method using non-negative matrix factorization according to an embodiment of the present invention, and FIG. 3 is a detailed flowchart illustrating a non-negative matrix factorization step according to an embodiment of the present invention. 4 is a detailed flowchart illustrating a sentence selection step according to an embodiment of the present invention.

우선, 도 2 에 도시된 바와 같이, 전처리부(120)는 요약 대상인 문서(D)를 각각의 문장으로 분해/추출한다(S10).First, as shown in FIG. 2, the preprocessor 120 decomposes / extracts the document D, which is the summary object, into each sentence (S10).

이때, 전처리부가 사용자로부터 요약할 문서 및 요약할 문서에 대한 설정신호를 입력받는데, 이때, 설정신호에 요약할 문장의 개수, 불필요한 문자열 및 어근의 종류 등을 포함하는 것으로 설정하였으나, 본 발명이 이에 한정되지 않는다.At this time, the preprocessing unit receives a setting signal for the document to be summarized and the document to be summarized from the user. In this case, the setting signal is set to include the number of sentences to be summarized, unnecessary strings, types of roots, and the like. It is not limited.

전처리부(120)는 분해/추출된 문장을 이용하여 요약할 문장의 개수를 설정한다(S20).The preprocessor 120 sets the number of sentences to be summarized using the decomposed / extracted sentences (S20).

여기서, 요약할 문장의 개수를 k 로 설정하였지만, 본 발명이 이에 한정되지 않는다.Here, although the number of sentences to be summarized is set to k, the present invention is not limited thereto.

다음으로, 전처리부(120)는 분해된 각각의 문장에서 의미가 없는 불용어를 제거하고, 각각의 문장에 포함된 어근을 추출한다(S30).Next, the preprocessing unit 120 removes a meaningless meaning word from each of the decomposed sentences, and extracts a root included in each sentence (S30).

전처리부(120)는 분해된 문장을 이용하여 용어의 사용빈도에 따른 용어-빈도벡터를 수학식 1 과 같이 생성한다(S40).The preprocessor 120 generates a term-frequency vector according to the frequency of use of the term using the decomposed sentence as shown in Equation 1 (S40).

전처리부(120)는 용어-빈도벡터를 이용하여 지역 가중치 및 전역 가중치를 포함하는 각각의 분해된 문장에 대한 가중치를 수학식 4 를 이용하여 산출한다(S50).The preprocessor 120 calculates a weight for each decomposed sentence including the regional weight and the global weight using the term-frequency vector using Equation 4 (S50).

전처리부(120)는 m 개의 용어와 n 개의 문장으로 이루어지되 수학식 4 를 통해서 산출된 가중치가 반영된 용어-문장행렬(A)을 생성한다(S60).The preprocessor 120 generates the term-sentence matrix A consisting of m terms and n sentences but reflecting the weight calculated through Equation 4 (S60).

다음, 문서 요약부(130)는 생성된 용어-문장행렬(A)을 비음수 행렬을 이용하여 인수분해한다(S70).Next, the document summary unit 130 factorizes the generated term-phrase matrix A using a non-negative matrix (S70).

더욱 구체적으로 도 3 에 도시된 바와 같이 제 S70 단계에 있어서, 비음수 의미 특징 행렬 W 와 비음수 의미 변수 행렬 H 를 이용하여 인수분해한 후(S71), 문서 요약부(130)는 비음수 행렬 W 와 H 가 ∥A-WH∥²에 의한 값이 최소가 되는지를 판단한다(S72). More specifically, as shown in FIG. 3, after factoring using the non-negative semantic feature matrix W and the non-negative semantic variable matrix H in step S70 (S71), the document summary unit 130 determines the non-negative matrix. determines the values of the W and H by ∥A-WH∥ ² that the minimum (S72).

제 S72 단계의 판단결과, 최소화가 되는 경우에 문서 요약부(130)는 비음수 행렬 W 와 H 의 값을 산출한 후(S73), 동시에 갱신한다(S74).As a result of the determination in the step S72, in the case of minimization, the document summary unit 130 calculates the values of the non-negative matrices W and H (S73) and updates them at the same time (S74).

한편, 제 S72 단계의 판단결과, 최소화가 되지 않는 경우에 문서 요약부(130)는 수학식 7 및 수학식 8 을 이용하여 비음수 행렬 W 와 H 가 최소화가 되 도록 한다(S75).On the other hand, when it is determined that the step S72 is not minimized, the document summarizing unit 130 allows the non-negative matrices W and H to be minimized using Equations 7 and 8 (S75).

다음으로, 문서 요약부(130)는 행렬 H 에서 p 번째 행에 포함된 행벡터 H_p _. 에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 행렬 A 의 문장 벡터 A_.q에 대응되는 문장을 선택한다(S80).Next, the document summary unit 130 includes the row vector H _p _. Selects the sentence corresponding to the sentence vector q _.q A of the matrix A in the column such as heat with the largest value element in the (S80).

더욱 구체적으로 도 4 를 참고하여 살펴보면 제 S80 단계에 있어서, 문서 요약부(130)는 행벡터 H_p _. 의 요소의 합을 수학식 10 을 이용하여 각각의 행벡터별로 계산한다(S81).More specifically, referring to FIG. 4, in step S80, the document summary unit 130 has a row vector H _p _. The sum of elements is calculated for each row vector by using Equation 10 (S81).

이때, 문서 요약부(130)는 행벡터 요소 합의 값이 높은 순서로 k 개의 행벡터 H_p _. 를 선택/추출한다(S82).At this time, the document summary unit 130 k k row vectors H _p _. Select / extract (S82).

선택/추출된 문장이 요약할 문장의 개수 k 인지 판단한 후(S83), k 인 경우에 문서 요약부(130)는 선택된 k 개의 행벡터 각각에서, 행에서 가장 큰 요소값을 가진 q 열과 같은 열에 있는 행렬 A 의 문장 벡터 A_.q에 대응되는 문장을 선택한다(S84).After determining whether the selected / extracted sentence is the number k of sentences to be summarized (S83), in case of k, the document summary unit 130 is in the same column as the q column having the largest element value in the row in each of the k selected row vectors. A sentence corresponding to the sentence vector A. _Q of the matrix A is selected (S84).

선택한 문장을 이용하여 문서 요약부(130)는 문서(D)의 요약서(S)로 출력한다(S90).Using the selected sentence, the document summary unit 130 outputs the summary S of the document D (S90).

만약, 선택된 개수가 k 미만일 경우에 문서 요약부(130)는 비음수 의미 특징 행렬 W 의 열벡터들과 주제 간의 유사도가 큰 열벡터(W_·p)를 찾아 반복적으로 문장을 선택/추출한다(S85). If the selected number is less than k, the document summary unit 130 finds a column vector W _{· p} having a high similarity between the subjects and non-negative semantic feature matrix W and repeatedly selects / extracts a sentence ( S85).

[ 실 험 예 ][Experimental Example]

본 발명에서 제안한 문서요약 방법의 성능 검증을 위해 129건의 기사를 무작위로 추출하여 실험 자료로 이용했다. 성능 평가는 문서 요약에서 주로 사용되는 정확률(Precision), 재현율(Recall), F-measure를 이용하였다. To verify the performance of the document summary method proposed in the present invention, 129 articles were randomly extracted and used as experimental data. For performance evaluation, we used precision, recall, and F-measure which are mainly used in document summaries.

이때, 실제 평가에 사용된 실험 자료는 세 명의 평가자에 의해 수동으로 요약되어 평가자 중 두 명 이상이 각각 3.8 문장을 선택한 것 중에서 공통으로 선택된 문장을 포함하는 81건의 문서를 사용하였다. At this time, the experimental data used for the actual evaluation were manually summarized by three evaluators, and 81 documents including the sentences commonly selected among two or more of the evaluators each selected 3.8 sentences were used.

이에 대한, 성능평가에 대한 척도는 다음의 식을 이용했다.As a measure of performance evaluation, the following equation was used.

<재현율>

<정확률>

<F-measure>

<F-measure>

여기서, S_man은 각각 사람에 의해 제안된 방법으로 선택된 문장이며, S_sum은 본 발명이 제안한 방법으로 선택된 문장이다. 다음의 표는 잠재의미분석 방식을 이용하여 문서를 요약 방법과, 본 발명이 제안한 방법을 비교한 결과이며, 본 발명의 문서요약 성능이 잠재의미분석에 비해 우수하다는 것을 볼 수 있다.Here, S _man is a sentence selected by the method proposed by each person, and S _sum is a sentence selected by the method proposed by the present invention. The following table shows the results of comparing the method of summarizing the document using the latent meaning analysis method and the method proposed by the present invention.

구분division LSALSA NMFNMF 정확률Accuracy 0.4450.445 0.7000.700 재현율Recall 0.3400.340 0.6500.650 F-measureF-measure 0.4050.405 0.6800.680

표 1 의 F 값에 의하면 제안한 방법이 잠재의미분석(LSA)을 사용하는 방법보다 본 발명에 따른 비음수 행렬 인수분해(NMF) 요약방식이 약 0.275 성능이 우수함을 알 수 있다. According to the F value of Table 1, the nonnegative matrix factorization (NMF) summarization method according to the present invention is about 0.275 better than the method using the latent significance analysis (LSA).

실험에서 보듯이 제안된 방법은 잠재의미분석 방식에 비하여 좋은 성능을 보이는데 이는, 제안방법이 비음수 값과 부분정보를 이용하는 인간의 인식과정과 유사한 과정으로 문서를 처리하기 때문이다. 또한 상술한 수학식 6, 7, 8 에서 보는 것과 같이 적은 비용을 통해 문장을 추출할 수 있다.As shown in the experiment, the proposed method outperforms the potential semantic analysis method because the proposed method processes the document in a similar process to human recognition process using non-negative values and partial information. In addition, as shown in Equations 6, 7, and 8, the sentence can be extracted at a low cost.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다. As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes and modifications and equivalents should be considered to be within the scope of the present invention.

상기와 같은 본 발명에 따르면 비음수 행렬 인수분해를 이용하여 추출된 비음수 의미 가변 행렬을 이용함으로써, 문서 내용을 정확하게 요약할 수 있는 효과가 있다. According to the present invention as described above by using the non-negative semantic variable matrix extracted by using the non-negative matrix factorization, it is possible to accurately summarize the document content.

또한, 인간의 인식 과정과 유사한 비음수 제약을 사용하여 비지도 학습에 의한 문서요약으로써, 사전 전문가에 의한 학습문장이 필요없으며, 적은 계산비용을 통해서 문서를 쉽게 추출할 수 있는 그 특유의 효과도 있다.In addition, as a document summary by non-supervised learning using non-negative constraints similar to human recognition process, there is no need for learning sentences by experts in advance, and its peculiar effect that can easily extract documents with low computational cost. have.

Claims

In a document summary device using non-negative matrix factorization,

As a previous step to summarize the document (D), it is based on non-negative matrix factorization by decomposing the given document into individual sentences, then removing unnecessary strings, extracting roots, and calculating weights for the sentences. A preprocessor 120 to summarize the document as; And

A document summary unit 130 for summarizing the entire document by factoring a matrix A into a non-negative matrix; Document summarizing apparatus using a non-negative matrix factorization comprising a.

The method of claim 1,

An input unit (110) for receiving a setting signal of the document to be summarized and the document to be summarized from a user; And

An output unit 140 for outputting the summary summarized through the document summary unit for the user to check; Document summarization apparatus using a non-negative matrix factorization further comprises.

The method of claim 1,

The preprocessing unit 120,

Sentence decomposition means (121) for decomposing the document into the respective sentences;

Extraction sentence setting means (122) for setting the number of sentences to be extracted;

Stopword removal means (123) for removing stopwords for each of the sentences decomposed through the sentence decomposition means;

Root extracting means (124) for extracting roots with respect to the sentence;

Term-frequency vector generating means (125) for generating a term-frequency vector according to the frequency of use of the term excluding the stop word with respect to the sentence;

Weight calculation means (126) for calculating a weight for the sentence using the term-frequency vector; And

Term-sentence matrix generating means (127) for generating the matrix A consisting of the term and the sentence in the document; Document summarizing apparatus using a non-negative matrix factorization comprising a.

The method of claim 1,

The document summary unit 130,

Factoring means (131) for factoring said matrix A using non-negative matrices W and H;

Non-negative matrix calculation means (132) for calculating the non-negative matrices W and H;

In the matrix H that contains the p-th row row vector H _{_p.} Sentence selection means 133 for selecting and extracting a sentence corresponding to the sentence vector A _.q of the matrix A in the same column as the q column having the largest element value in and adding it to the summary S; And

Document summary generating means (134) for generating the summary of the document by using the set of sentences selected by the sentence selecting means; Document summarizing apparatus using a non-negative matrix factorization comprising a.

The method according to claim 1 or 3,

The weight is,

Document summarization apparatus using non-negative matrix factorization comprising local weights and global weights.

The method of claim 4, wherein

The non-negative matrix factoring means 131,

Factorize the matrix A using a non-negative semantic feature matrix W and a non-negative semantic variable matrix H,

Wherein the matrix W is a matrix designated by the number of terms and semantic features, and the matrix H is a matrix factored by an approximation from the matrix A.

The method of claim 4, wherein

The non-negative matrices W and H are

A-WH Document summarization apparatus using non-negative matrix factorization, characterized in that the value by ² is minimized.

The method of claim 4, wherein

The sentence selection means 133,

The row vector H _p _. The row vectors H _p in the order of the highest element sum of the elements _. Document summarization apparatus using non-negative matrix factorization, characterized in that for selecting the.

In the document summary method using non-negative matrix factorization,

(a) decomposing and extracting the document (D) to be summarized into each sentence, and setting the number of sentences (k) to be summarized;

(b) removing meaningless words from the sentence and extracting a root included in the sentence;

(c) generating a term-frequency vector according to the frequency of use of the term in the sentence;

(d) calculating weights for the sentences including local weights and global weights using the term-frequency vectors;

(e) generating a term-sentence matrix (A) consisting of m terms and n sentences;

(f) factoring the term-sentence matrix using a non-negative matrix;

(g) a row vector in the matrix H _p H included in the p-th _row. Selecting a sentence corresponding to the sentence vector q _.q A of the matrix A in the column such as heat with the largest value element in; And

(h) outputting a summary (S) of the document using the sentence selected in step (g); Document summary method using a non-negative matrix factorization comprising a.

The method of claim 9,

Before step (a),

(a-1) receiving, from a user, a setting signal of the number of sentences to be summarized, the number of sentences to be summarized, and unnecessary character strings and root types of the documents to be summarized; Document summarization method using a non-negative matrix factorization further comprises.

The method of claim 9,

Step (f),

(f-1) factoring using the non-negative semantic feature matrix W and the non-negative semantic variable matrix H;

(f-2) wherein the number of non-negative matrices W and H is determined that the minimum by ∥A-WH∥ ^2; And

(f-3) calculating the values of the non-negative matrices W and H when they are minimized as a result of the determination in the step (f-2) and updating them simultaneously; Document summary method using a non-negative matrix factorization comprising a.

The method of claim 9,

The step (g) is

(g-1) the row vector H _p _. Calculating a sum of elements of each row vector;

(g-2) k row vectors H _p _. Selecting and extracting;

(g-3) After determining whether the selected and extracted sentences are the number of sentences to be summarized (k), in each of the k row vectors selected in the case of k, the matrix A in the same column as the q column having the largest element value in the row. Selecting a sentence corresponding to the sentence vector A _{· q of the} at least one sentence; And

(g-4) As a result of the determination in the step (g-3), when the selected number is less than k, the similarity between the column vectors of the non-negative semantic feature matrix W and the subject is calculated, and a column vector having a high similarity (W _· searching for _p ) and repeatedly selecting and extracting sentences; Document summary method using a non-negative matrix factorization comprising a.

The method of claim 12,

The (g-4) step,

The document summarization method using non-negative matrix factorization, characterized in that it is repeatedly performed as the number (k) of the extracted sentences.