KR20130127804A

KR20130127804A - Apparatus and method for document summary based on query

Info

Publication number: KR20130127804A
Application number: KR1020120051587A
Authority: KR
Inventors: 박선; 양후열; 조광문; 오일환; 조지우; 정민아; 이성로
Original assignee: 목포대학교산학협력단
Priority date: 2012-05-15
Filing date: 2012-05-15
Publication date: 2013-11-25

Abstract

The present invention discloses a query-based document summarization device and a method thereof. Given in the present invention, the query-based document summarization device includes a preprocessor which segments an input document into sentences and generates a matrix of the sentences; a query extender which extends queries using the intent-related feedback; a weighting calculator which re-calculates the weighting of a term using the semantic characteristics through non-negative matrix factorization; and a summarizer which generates summarized sentences for the input document using the matrix of the terms and the sentences into which the extended user queries and the re-calculated term weightings are applied. [Reference numerals] (110) Preprocessor;(120) Question extender;(130) Weighed value calculator;(140) Sentence summary device

Description

Apparatus and Method for Document Summary based on Query}

본 발명은 요약문 생성기법에 관한 것으로서, 더 구체적으로는 사용자의 의도를 반영한 요약문을 생성할 수 있는 질의 기반 문서 요약 장치 및 방법에 관한 것이다.The present invention relates to a summary sentence generating technique, and more particularly to a query-based document summary apparatus and method that can generate a summary reflecting the user's intention.

최근, 인터넷의 발전으로 정보의 량은 계속해서 폭발적으로 증가하고 있으며, 이러한 정보들은 정보통신과 휴대용 단말기의 발전으로 어디서든지 쉽게 접근할 수 있다.Recently, the amount of information continues to explode with the development of the Internet, and such information can be easily accessed from anywhere due to the development of information communication and portable terminals.

그러나, 대량의 정보는 사용자가 원하는 정보에 쉽게 접근하지 못하는 원인이 되기도 한다. 이 때문에 사용자들이 원하는 정보를 쉽게 확인할 수 있는 요약에 대한 필요성이 증가하고 있고, 그에 따라 문서요약에 대한 많은 연구가 진행되고 있다.However, a large amount of information also causes a user to not easily access the desired information. For this reason, there is an increasing need for a summary that enables users to easily check the desired information, and accordingly, a lot of researches on document summaries have been conducted.

문서요약은 요약의 목적에 따라서 다음과 같이 구분할 수 있다. 첫째, 문서의 전체 내용을 이해할 수 있도록 요약하는 일반요약 또는 포괄적 요약이 있다. 둘째, 사용자의 질의에 관한 내용으로 문서를 요약하는 질의기반 문서요약이 있다. 셋째, 여러 개의 문서로부터 동일 주제에 관련된 내용을 요약하는 다중 문서요약과, 단일문서로부터 관련 주제를 요약하는 단일문서요약이 있다. 마지막으로, 인터넷상의 사용자 로그와 사용자의 요구에 관련된 특별한 정보로 개인의 특성에 맞도록 요약하는 개인화 문서요약이 있다. 그외에도 문서요약은 요약에 사용되는 기술에 따라서 통계적 방법, 그래프기반 방법, 언어학기반 방법, 의미정보기반 방법, 외부자원기반 방법 및 기타 복합기반 방법을 이용한 문서요약으로 구분될 수 있다.The document summary can be classified as follows according to the purpose of the summary. First, there is a general summary or comprehensive summary that summarizes the entire contents of the document. Second, there is a query-based document summary that summarizes the document with the contents of the user's query. Third, there are multiple document summaries summarizing the contents related to the same topic from several documents, and single document summaries summarizing related topics from a single document. Finally, there is a personalized document summary that summarizes the user log on the Internet and special information related to the user's needs to suit the individual's characteristics. In addition, document summaries can be divided into document summaries using statistical methods, graph-based methods, linguistics-based methods, semantic information-based methods, external resource-based methods, and other complex-based methods, depending on the techniques used in the summary.

본 발명은 전술한 바와 같은 기술적 배경에서 안출된 것으로서, 비음수 행렬분해로부터 추출된 의미특징과 의사연관 피드백에 의해 사용자 의도에 따른 문장을 입력문서로부터 추출하고, 추출된 문장을 이용해 요약문을 생성할 수 있는 질의 기반 문서 요약 장치 및 방법을 제공하는 것을 그 목적으로 한다.The present invention has been made in the technical background as described above, by extracting a sentence according to a user's intention from the input document by means of semantic features and pseudo-related feedback extracted from non-negative matrix decomposition, and to generate a summary sentence using the extracted sentence It is an object of the present invention to provide an apparatus and method for query-based document summarization.

본 발명의 일면에 따른 질의 기반 문서 요약 장치는, 입력문서를 문장으로 분해하여 용어문장 행렬을 생성하는 전처리기; 의사연관 피드백을 이용하여 사용자 질의를 확장하는 질의 확장기; 비음수 행렬분해된 의미특징을 이용하여 용어의 가중치를 재산정하는 가중치 계산기; 및 확장된 상기 사용자 질의 및 재산정된 상기 용어의 가중치가 적용된 상기 용어문장 행렬을 이용하여 상기 입력문서에 대한 요약문을 생성하는 요약기를 포함하는 것을 특징으로 한다.In accordance with an aspect of the present invention, a query-based document summarizing apparatus includes: a preprocessor configured to decompose an input document into sentences to generate a term sentence matrix; A query extender that extends user queries using correlation feedback; A weight calculator for recalculating the weights of terms using nonnegative matrixed semantic features; And a summarizer for generating a summary sentence for the input document by using the extended terminology matrix to which the user query and the weighted term of the redefined term are applied.

본 발명에 따르면, 의사연관 피드백에 의해 확장된 질의와 비음수 행렬분해에 의해 재산정된 용어 가중치를 이용하여 입력 문서에서 사용자의 의도에 따른 문장을 추출하여 요약문을 생성하므로, 입력문서의 특징을 잘 표현할 수 있는 요약문을 생성할 수 있으며, 생성된 요약문의 질을 향상시킬 수 있다. According to the present invention, a sentence is extracted by extracting a sentence according to a user's intention from an input document by using a query extended by pseudo-related feedback and a term weight defined by nonnegative matrix decomposition, and thus, a summary sentence is generated. You can create a summary that can be expressed well, and improve the quality of the generated summary.

도 1은 본 발명의 실시예에 따른 질의 기반 문서 요약 장치를 도시한 구성도.1 is a block diagram showing a query-based document summary device according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

이하, 본 발명의 실시예에 대하여 첨부한 도면을 참조하여 상세히 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 질의 기반 문서 요약 장치를 도시한 구성도이다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. 1 is a block diagram showing a query-based document summary device according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 질의 기반 문서 요약 장치(10)는 전처리기(110), 질의 확장기(120), 가중치 계산기(130) 및 문장 요약기(140)를 포함한다.As shown in FIG. 1, the query based document summary apparatus 10 according to an embodiment of the present invention includes a preprocessor 110, a query expander 120, a weight calculator 130, and a sentence summarizer 140. do.

전처리기(110)는 입력문서에 대한 문장 분해, 불용어(Stop-word) 제거, 어근(Stemming) 추출 및 용어문장 행렬 생성을 수행한다. The preprocessor 110 performs sentence decomposition, stop-word removal, stemming extraction, and term sentence matrix generation for the input document.

구체적으로, 전처리기(110)는 Rijsbergen의 불용어 목록과 어휘 분석에 의한 불용어 제거방법을 이용하여 입력문서로부터 불용어를 제거할 수 있다. In detail, the preprocessor 110 may remove the stopword from the input document by using the stopword list of Rijsbergen and the stopword removal method by lexical analysis.

또한, 전처리기(110)는 Porter 스테밍(Stepping) 알고리즘을 이용하여 입력문서에 대한 어근을 추출할 수 있다.In addition, the preprocessor 110 may extract the root of the input document using a Porter stepping algorithm.

이때, 전처리기(110)는 영문 문서를 입력받을 경우 전술한 수행을 하며, 한글 문서를 입력받을 경우에는 형태소 분석도구를 이용하여 입력문서를 전처리할 수 있다.In this case, the preprocessor 110 may perform the above-described operation when receiving an English document, and may preprocess the input document by using a morphological analysis tool when receiving a Korean document.

질의 확장기(120)는 의사연관 피드백을 이용하여 입력받은 사용자의 질의를 확장한다.The query expander 120 expands the received user's query by using the correlation feedback.

구체적으로, 연관 피드백은 사용자가 결과를 지속적으로 확인해야 하므로, 처리 부하가 많으나, 의사연관 피드백은 사용자의 간섭을 최소화할 수 있어, 최근의 자동화 정보검색 시스템에서 많이 사용되고 있다. 따라서, 본 발명의 질의 확장기(120)도 사용자의 부담을 덜기 위해서 의사연관 피드백을 적용하였다.In detail, the related feedback has a large processing load because the user needs to continuously check the result, but the pseudo-related feedback can minimize the interference of the user, and thus, the feedback has been widely used in recent automated information retrieval systems. Accordingly, the query expander 120 of the present invention also applies the correlation feedback to reduce the burden on the user.

구체적으로, 의사연관 피드백은 사용자 질의와의 유사도가 높은 순으로 상위 k개의 문장을 이용하여 질의를 확장하는 방법이다.Specifically, pseudo-relationship is a method of expanding a query using the top k sentences in order of high similarity with the user query.

이때, 질의 확장에 사용되는 문장의 개수 k가 너무 많으면, 사용자의 원하는 주제에 비해 요약 결과가 너무 포괄적이고 모호해지고, 문장의 개수 k가 너무 적으면, 사용자가 요구하는 주제에 비해 요약 결과가 너무 협소해질 수 있다. 따라서, 의사연관 피드백에서 적당량의 k개의 문장을 선택하는 것이 중요하다.In this case, if the number of sentences k used to expand the query is too large, the summary result is too comprehensive and ambiguous for the user's desired subject, and if the number of sentences k is too small, the summary result is too large for the subject requested by the user. It can be narrow. Therefore, it is important to select an appropriate amount of k sentences in the correlational feedback.

그런데, 의사연관 피드백은 종래의 연관 피드백과는 달리, 비연관 문서를 판단할 수 없기 때문에 하기의 수학식 1에 의해 산출된 개수의 연관 피드백을 사용하고 있다.However, unlike the related feedback in the related art, since the unrelated document cannot be determined, the number of related feedbacks calculated by Equation 1 below is used.

가중치 계산기(130)는 하기의 수학식 2와 같이 용어문장 행렬 D에 가중치 행렬 W를 대입하여 용어의 가중치가 재산정된 용어문장 행렬

를 산출한다.The weight calculator 130 substitutes the weight matrix W into the term sentence matrix D as shown in Equation 2 below to determine the weight of the term.

.

여기서,

는 용어에 대응하는 비음수 행렬분해된 의미특징의 합을 그 값으로 갖는 대각행렬로서, 하기의 수학식 3과 같이 표현될 수 있다.here,

Is a diagonal matrix having, as its value, the sum of non-negative matrix-factored semantic features corresponding to the term, which may be expressed by Equation 3 below.

상기 수학식 3에서, w_i는 용어의 가중치를 산출하기 위한 의미특징의 합으로서, 하기의 수학식 4과 같이 비음수 행렬분해 알고리즘에 의해 계산될 수 있다.In Equation 3, w _i is a sum of semantic features for calculating a weight of a term, and may be calculated by a non-negative matrix decomposition algorithm as in Equation 4 below.

구체적으로, 의미 특징의 합 w_i는 수학식 5 및 6과 같이 유클리디언 거리 J가 0에 가깝게 수렴할 때까지 의미특징 행렬 W와 H의 값을 동시에 갱신함에 의해 산출될 수 있다.Specifically, the sum w _i of the semantic features may be calculated by simultaneously updating the values of the semantic feature matrices W and H until the Euclidean distance J converges close to zero, as shown in equations (5) and (6).

다시 말해, 상기 수학식 5는 행렬 A를 비음수 m×r 행렬 W와 비음수 r×n 행렬 H로 분해하기 위한 것이다. 여기서, A는 m개의 픽셀과 n개의 이미지로 이루어진 m×n 행렬이고, r은 의미특징의 개수로서 이미지 학습을 위한 학습이미지의 수로 설정된다.In other words, Equation 5 is for decomposing the matrix A into a non-negative m × r matrix W and a non-negative r × n matrix H. Here, A is an m × n matrix composed of m pixels and n images, and r is the number of semantic features, which is set as the number of learning images for image learning.

문장 요약기(140)는 의사연관 피드백을 적용하여 확장된 질의와 용어의 가중치가 재산정된 용어문장 행렬을 이용하여 문장을 추출하여 요약문을 생성한다. 구체적으로, 확장된 질의와 가중치가 재산정된 용어문장 행렬 간의 코사인 유사도를 확인하고, 코사인 유사도가 가장 높은 상위 문장들을 추출하여 요약문을 생성한다.The sentence summarizer 140 generates a summary sentence by applying a correlation-related feedback and extracting a sentence using a term sentence matrix in which the extended query and term weights are redefined. Specifically, the cosine similarity is confirmed between the extended query and the weighted term sentence matrix, and a summary sentence is generated by extracting the upper sentences with the highest cosine similarity.

요약하면, 본 발명의 실시예에 따른 질의 기반 문서 요약 장치(10)는 전처리 단계, 질의 확장 단계, 용어 가중치 재산정 단계 및 문장 추출에 의한 요약문 생성 단계를 수행한다. In summary, the query-based document summarizing apparatus 10 according to an embodiment of the present invention performs a preprocessing step, a query expansion step, a term weighting property renaming step, and a sentence generation step by sentence extraction.

첫 번째로, 질의 기반 문서 요약 장치(10)는 전처리 단계에서 입력문서를 문장으로 분해한 후 용어를 추출하여 용어문장 행렬을 만든다. First, the query-based document summarizing apparatus 10 decomposes an input document into sentences in a preprocessing step and extracts terms to form a term sentence matrix.

두 번째로, 질의 기반 문서 요약 장치(10)는 질의 확장 단계에서 의사연관 피드백을 이용하여 입력된 사용자의 질의를 확장한다.Secondly, the query based document summary device 10 expands the input user's query using the correlational feedback in the query expansion step.

세 번째로, 질의 기반 문서 요약 장치(10)는 가중치 계산 단계에서, 비음수 행렬분해된 의미특징을 이용하여 용어의 가중치를 재산정한 용어문장 행렬을 산출한다.Third, the query-based document summarizing apparatus 10 calculates a term sentence matrix in which the weights of terms are redefined using non-negative matrix-determined semantic features in the weight calculation step.

마지막으로, 질의 기반 문서 요약 장치(10)는 문서요약 단계에서, 확장된 질의와 재산정된 용어 가중치를 적용한 용어문장 행렬 간의 유사도를 확인하여 유사도가 높은 상위 기설정된 개수의 문장을 이용하여 요약문을 생성한다.Finally, the query-based document summarizing apparatus 10 checks the similarity between the extended query and the term sentence matrix to which the redefined term weight is applied, and uses the higher preset number of sentences having a higher similarity to generate a summary sentence. Create

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims

A preprocessor for decomposing an input document into sentences to generate a term sentence matrix;
A query extender that extends user queries using correlation feedback;
A weight calculator for recalculating the weights of terms using nonnegative matrixed semantic features; And
A summarizer that generates a summary for the input document using the extended terminology matrix with the user query and the weighted term of the redefined term
Query-based document summary device comprising a.

The method of claim 1, wherein the summarizer,
And confirming the similarity between the term sentence matrix to which the weight of the redefined term is applied and the extended user query, and generating the summary by using the higher preset number of sentences having the high similarity.