KR20210146833A

KR20210146833A - Apparatus and method for providing summary of document based on genetic algorithm

Info

Publication number: KR20210146833A
Application number: KR1020210068668A
Authority: KR
Inventors: 정치훈
Original assignee: 정치훈
Priority date: 2020-05-27
Filing date: 2021-05-27
Publication date: 2021-12-06
Also published as: KR102565149B1

Abstract

The present disclosure relates to an apparatus and a method for providing the summary of a document based on a genetic algorithm, capable of appropriately selecting an abstract solution while improving a processing speed of the genetic algorithm. According to an embodiment in the present disclosure, an apparatus and a method for providing the summary of a document based on a genetic algorithm includes: a communication circuit to receive an original document from the outside; a memory to store the original document; and a processor electrically connected between the communication circuit and the memory. The processor generates a solution including a plurality of genes corresponding to sentences, which are included in the original document, and are to be analyzed by using the genetic algorithm, and representing some sentences selected to generate the abstract from the sentences to be analyzed, such that the solution satisfies a condition for a preset length of the abstract, calculates a coverage for the solution by using a first object function associated with a similarity between the sentences, which are included in the original document, to be analyzed and the selected some sentences corresponding to the solution, calculates the diversity for the solution by using a second objective function associated with the similarity between the some sentences selected to correspond to the solution, and provides at least one abstract corresponding to each of at least solution satisfying a specific optimal condition based on the coverage and the diversity for each of a plurality of solutions generated through the genetic algorithm.

Description

APPARATUS AND METHOD FOR PROVIDING SUMMARY OF DOCUMENT BASED ON GENETIC ALGORITHM

본 문서에서 개시되는 실시 예들은 유전 알고리즘을 이용하여 문서에 대한 요약문을 제공하기 위한 장치 및 방법과 관련된다.Embodiments disclosed in this document relate to an apparatus and method for providing a summary for a document using a genetic algorithm.

ICT 분야의 발전으로 인해, 인간에게 공유되는 정보의 양은 크고 급격하게 증가했다. 기술의 발전에 따라 사람들 사이의 정보 공유 활동이 용이해졌으나, 아이러니하게도 개인들은 잘못된 정보 또는 허위 정보 등을 필터링하고, 고품질의 정보 소스로부터 적절한 컨텐츠를 발견하기 위해 더 많은 시간을 할애하고 있다. 개인들이 낭비하는 시간을 감소시키기 위해 다양한 자동화 기술이 연구되고 개발되었으나, 사용자들의 부담을 충분히 덜어주지는 못하고 있다.Due to the development of the ICT field, the amount of information shared with humans has increased significantly and rapidly. With the development of technology, information sharing activities among people have become easier, but ironically, individuals spend more time filtering out false or false information and discovering appropriate content from high-quality information sources. Various automation technologies have been researched and developed to reduce time wasted by individuals, but they do not sufficiently relieve the burden on users.

신속하게 불필요한 정보를 필터링하고 적합한 컨텐츠를 발견하기 위해서는 텍스트 정보에 대한 요약을 제공하는 시스템이 요구될 수 있다. 자동화된 방법에 의한 요약문을 이용함으로써 사용자는 범람하는 정보를 더욱 효과적으로 선별할 수 있다.A system that provides a summary of textual information may be required to quickly filter out unnecessary information and discover suitable content. By using the summary by an automated method, the user can more effectively screen the overflowing information.

문서의 요약문을 제공하기 위해 요약문을 추출하는 다양한 알고리즘이 활용될 수 있고, 추출된 요약문을 평가하는 다양한 기법이 활용될 수 있다. 요약문 추출 시 유전 알고리즘을 활용하는 경우, 필연적으로 고려해야 할 솔루션이 과도하게 많아지므로 매우 긴 처리 시간이 요구될 수 있다. 한편, 요약문의 평가 시 요약문의 퀄리티를 명확하게 반영할 수 있는 요인을 설정해야 하고, 다양한 요인을 균형있게 고려할 필요성이 있다.In order to provide a summary of the document, various algorithms for extracting the summary may be utilized, and various techniques for evaluating the extracted summary may be utilized. When a genetic algorithm is used to extract abstracts, a very long processing time may be required because there are inevitably too many solutions to be considered. On the other hand, it is necessary to set factors that can clearly reflect the quality of the summary text when evaluating the summary text, and it is necessary to consider various factors in a balanced way.

본 발명의 실시 예들은, 요약문의 생성을 위한 유전 알고리즘의 처리 속도를 향상시키고, 요약문 솔루션의 적절한 선별을 가능하게 하는 장치 및 방법을 제공하기 위한 것이다.SUMMARY Embodiments of the present invention provide an apparatus and method for improving the processing speed of a genetic algorithm for generating a summary text and enabling appropriate selection of a summary text solution.

본 문서에 개시되는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치는, 외부로부터 원본 문서를 수신하는 통신 회로, 원본 문서를 저장하는 메모리, 및 통신 회로 및 메모리와 전기적으로 연결된 프로세서를 포함하고, 프로세서는 유전 알고리즘을 이용하여 원본 문서에 포함된 분석 대상 문장 각각에 대응하는 복수의 유전자를 포함하고 분석 대상 문장으로부터 요약문의 생성을 위해 선택된 일부 문장을 나타내는 솔루션을 미리 설정된 요약문 길이 조건을 만족하도록 생성하고, 원본 문서에 포함된 분석 대상 문장과 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제1 목적 함수를 이용하여 솔루션에 대한 커버리지를 산출하고, 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제2 목적 함수를 이용하여 솔루션에 대한 다이버시티를 산출하고, 유전 알고리즘에 의해 생성된 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 제공할 수 있다.An apparatus for providing a summary of a document based on a genetic algorithm according to an embodiment of the present disclosure includes a communication circuit for receiving an original document from the outside, a memory for storing the original document, and a processor electrically connected to the communication circuit and the memory and the processor satisfies a preset summary sentence length condition by using a genetic algorithm to provide a solution that includes a plurality of genes corresponding to each of the analysis target sentences included in the original document and represents some sentences selected for generation of a summary sentence from the analysis target sentence and calculating the coverage for the solution by using the first objective function associated with the similarity between the analysis target sentences included in the original document and the selected partial sentences corresponding to the solution, and the similarity between the selected partial sentences corresponding to the solution Calculate diversity for a solution by using a second objective function associated with One or more summaries may be provided.

일 실시 예에 따르면, 프로세서는 유전 알고리즘을 이용하여 솔루션의 초기화를 수행하는 경우, 솔루션에 포함된 복수의 유전자의 값을 순차적으로 설정하되, 미리 설정된 요약문 길이 조건에 기초하여 상기 복수의 유전자의 값을 결정하기 위해 사용되는 확률 값을 조절할 수 있다.According to an embodiment, when the processor initializes the solution using the genetic algorithm, the processor sequentially sets the values of a plurality of genes included in the solution, and the values of the plurality of genes are based on a preset summary length condition. It is possible to adjust the probability value used to determine .

일 실시 예에 따르면, 프로세서는 유전 알고리즘을 이용하여 솔루션의 변이를 수행하는 경우, 미리 설정된 요약문 길이 조건 및 솔루션에 대응하는 요약문의 길이 정보에 기초하여 솔루션에 포함된 복수의 유전자의 변이 여부를 결정할 수 있다.According to an embodiment, when performing the mutation of the solution using the genetic algorithm, the processor determines whether a plurality of genes included in the solution are mutated based on a preset summary length condition and information on the length of the summary corresponding to the solution. can

일 실시 예에 따르면, 프로세서는 유전 알고리즘을 이용하여 솔루션과 다른 솔루션의 교차를 수행하는 경우, 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보 및 다른 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보에 기초하여 교차의 수행 여부를 결정할 수 있다.According to an embodiment, when the processor performs crossing of a solution and another solution using a genetic algorithm, based on length information of a summary sentence corresponding to a gene selected from a solution and length information of a summary sentence corresponding to a gene selected from another solution Thus, it is possible to decide whether to perform the intersection.

일 실시 예에 따르면, 커버리지는 분석 대상 문장에 대응하는 벡터와 선택된 일부 문장에 대응하는 벡터 사이의 유사도 및 분석 대상 문장에 대응하는 벡터와 선택된 일부 문장 각각에 대응하는 복수의 문장 벡터 각각 사이의 유사도에 기초하여 산출될 수 있다.According to an embodiment, the coverage is a degree of similarity between a vector corresponding to an analysis target sentence and a vector corresponding to a selected partial sentence, and a degree of similarity between a vector corresponding to an analysis target sentence and each of a plurality of sentence vectors corresponding to each of the selected partial sentences. can be calculated based on

일 실시 예에 따르면, 커버리지는 미리 저장된 지식 기반(knowledge-based) 데이터베이스에 기초하여 산출되는 선택된 일부 문장에 포함된 단어의 중요도에 기초하여 산출될 수 있다.According to an embodiment, the coverage may be calculated based on the importance of words included in some selected sentences calculated based on a pre-stored knowledge-based database.

일 실시 예에 따르면, 커버리지는 선택된 일부 문장에 대응하는 토픽 키워드에 포함된 단어의 개수 및 토픽 키워드에 포함된 단어의 분포 밀도에 기초하여 산출될 수 있다.According to an embodiment, the coverage may be calculated based on the number of words included in the topic keyword corresponding to some selected sentences and the distribution density of words included in the topic keyword.

일 실시 예에 따르면, 다이버시티는 선택된 일부 문장 사이의 유사도가 감소함에 따라 증가할 수 있다.According to an embodiment, diversity may increase as the similarity between selected partial sentences decreases.

일 실시 예에 따르면, 프로세서는 커버리지 및 다이버시티가 파레토 최적에 따른 지정된 조건을 만족하는 하나 이상의 솔루션을 획득하고, 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 생성할 수 있다.According to an embodiment, the processor may obtain one or more solutions in which coverage and diversity satisfy a specified condition according to the Pareto optimum, and generate one or more summaries corresponding to each of the one or more solutions.

일 실시 예에 따르면, 프로세서는 미리 설정된 요약문 길이 조건을 만족하지 않는 솔루션이 생성되면, 생성된 솔루션이 지정된 최적화 조건을 만족하는지 여부를 판단할 때, 생성된 솔루션에 대응하는 요약문이 초과한 길이에 따른 페널티를 적용할 수 있다.According to an embodiment, when a solution that does not satisfy the preset summary sentence length condition is generated, the processor determines whether the generated solution satisfies the specified optimization condition, the length of the summary text corresponding to the generated solution is exceeded. Penalties may apply.

본 문서에 개시되는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 방법은 유전 알고리즘을 이용하여 원본 문서에 포함된 분석 대상 문장 각각에 대응하는 복수의 유전자를 포함하고 분석 대상 문장으로부터 요약문의 생성을 위해 선택된 일부 문장을 나타내는 솔루션을 미리 설정된 요약문 길이 조건을 만족하도록 생성하는 단계, 원본 문서에 포함된 분석 대상 문장과 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제1 목적 함수를 이용하여 솔루션에 대한 커버리지를 산출하는 단계, 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제2 목적 함수를 이용하여 솔루션에 대한 다이버시티를 산출하는 단계, 및 유전 알고리즘에 의해 생성된 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 제공하는 단계를 포함할 수 있다.The method for providing a summary sentence of a document based on a genetic algorithm according to an embodiment disclosed in this document includes a plurality of genes corresponding to each analysis target sentence included in an original document using a genetic algorithm, and generates a summary sentence from the analysis target sentence generating a solution representing the selected partial sentences to satisfy a preset summary sentence length condition; using a first objective function associated with the similarity between the analysis target sentences included in the original document and the selected partial sentences corresponding to the solution calculating the coverage for the solution, calculating the diversity for the solution by using a second objective function associated with the similarity between some selected sentences corresponding to the solution, and for each of the plurality of solutions generated by the genetic algorithm and providing one or more summaries corresponding to each of one or more solutions satisfying a specified optimization condition based on coverage and diversity.

본 문서에 개시되는 일 실시 예에 따른 컴퓨팅 디바이스에 포함된 적어도 하나의 프로세서에 의해 실행 가능한 명령어가 저장된 컴퓨터 기록 매체에 있어서, 명령어는, 적어도 하나의 프로세서로 하여금, 유전 알고리즘을 이용하여 원본 문서에 포함된 분석 대상 문장 각각에 대응하는 복수의 유전자를 포함하고 분석 대상 문장으로부터 요약문의 생성을 위해 선택된 일부 문장을 나타내는 솔루션을 미리 설정된 요약문 길이 조건을 만족하도록 생성하고, 원본 문서에 포함된 분석 대상 문장과 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제1 목적 함수를 이용하여 솔루션에 대한 커버리지를 산출하고, 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제2 목적 함수를 이용하여 솔루션에 대한 다이버시티를 산출하고, 유전 알고리즘에 의해 생성된 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 제공하도록 할 수 있다.In the computer recording medium storing instructions executable by at least one processor included in the computing device according to an embodiment disclosed in this document, the instructions cause the at least one processor to add to the original document using a genetic algorithm. A solution including a plurality of genes corresponding to each included analysis target sentence and representing some sentences selected for generation of a summary sentence from the analysis target sentence is generated to satisfy a preset summary sentence length condition, and the analysis target sentence included in the original document A coverage for the solution is calculated using a first objective function associated with the degree of similarity between the selected partial sentences corresponding to the solution, and a second objective function associated with the similarity between the selected partial sentences corresponding to the solution is used for the solution. The diversity may be calculated, and one or more summaries corresponding to each of the one or more solutions satisfying a specified optimization condition may be provided based on the coverage and diversity of each of the plurality of solutions generated by the genetic algorithm.

본 문서에 개시되는 실시 예들에 따르면, 미리 설정된 요약문의 길이 조건을 고려하여 유전 알고리즘의 프로세스(예: 초기화, 변이 및 교차 등)를 수행함으로써, 요약문의 생성을 위한 처리 시간을 단축할 수 있다.According to the embodiments disclosed in this document, the processing time for generating the summary can be shortened by performing the genetic algorithm process (eg, initialization, mutation, crossover, etc.) in consideration of the preset length condition of the summary.

또한, 2개의 목적 함수를 이용하여 다목적 최적화를 수행함으로써, 솔루션을 효율적으로 평가하여 우수한 요약문을 선별할 수 있다.In addition, by performing multi-objective optimization using two objective functions, it is possible to efficiently evaluate solutions and select excellent summaries.

또한, 일정 이상의 퀄리티를 만족하는 하나 이상의 요약문을 선택함으로써, 사용자의 의도에 따라 적합한 요약문을 선택할 수 있도록 할 수 있다.In addition, by selecting one or more summary sentences that satisfy a certain quality or more, it is possible to select an appropriate summary according to the user's intention.

이 외에, 본 문서를 통해 직접적 또는 간접적으로 파악되는 다양한 효과들이 제공될 수 있다.In addition, various effects directly or indirectly identified through this document may be provided.

도 1은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치에 의해 제공되는 예시적인 요약문을 도시한다.
도 2는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 구성을 도시하는 블록도이다.
도 3은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 4는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 5는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 6은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 7은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 8은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.
도 9는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 방법을 설명하기 위한 흐름도이다.
도면의 설명과 관련하여, 동일 또는 유사한 구성요소에 대해서는 동일 또는 유사한 참조 부호가 사용될 수 있다.1 illustrates an exemplary summary provided by an apparatus for providing a summary of a document based on a genetic algorithm according to an embodiment.
2 is a block diagram illustrating a configuration of an apparatus for providing a document summary based on a genetic algorithm according to an exemplary embodiment.
3 is a diagram for explaining an exemplary operation of an apparatus for providing a summary of a document based on a genetic algorithm according to an embodiment.
4 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.
5 is a diagram for explaining an exemplary operation of an apparatus for providing a document summary based on a genetic algorithm according to an embodiment.
6 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.
7 is a diagram for describing an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.
8 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.
9 is a flowchart illustrating a method of providing a document summary based on a genetic algorithm according to an embodiment.
In connection with the description of the drawings, the same or similar reference numerals may be used for the same or similar components.

이하, 본 발명의 일부 실시 예들을 예시적인 도면을 통해서 상세하게 설명한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 실시 예의 다양한 변경, 균등물 또는 대체물을 포함하는 것으로 이해되어야 한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시 예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. However, this is not intended to limit the present invention to specific embodiments, and it should be understood that various modifications, equivalents or substitutes of the embodiments of the present invention are included. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function interferes with the understanding of the embodiment of the present invention, the detailed description thereof will be omitted.

도 1은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치에 의해 제공되는 예시적인 요약문을 도시한다.1 illustrates an exemplary summary provided by an apparatus for providing a summary of a document based on a genetic algorithm according to an embodiment.

일 실시 예에 따른 요약문 제공 장치는 복수의 원본 문서(111, 112, 113, 114)에 포함된 문장 중 일부를 추출함으로써 요약문(120)을 생성할 수 있다.The summary text providing apparatus according to an embodiment may generate the summary text 120 by extracting some of the sentences included in the plurality of original documents 111 , 112 , 113 , and 114 .

예를 들어, 요약문 제공 장치는 복수의 원본 문서(111, 112, 113, 114)를 획득할 수 있다. 요약문 제공 장치는 유전 알고리즘을 이용하여 복수의 원본 문서(111, 112, 113, 114)에 포함된 문장들 중 요약문(120)에 포함될 일부 문장에 대응하는 솔루션을 도출할 수 있다. 요약문 제공 장치는 도출된 솔루션을 2개의 목적 함수에 입력함으로써, 복수의 원본 문서(111, 112, 113, 114)와 요약문(120) 사이의 유사도와 연관된 커버리지 및 요약문(120)에 포함된 문장들 사이의 유사도와 연관된 다이버시티를 산출할 수 있다. 커버리지에 의해 요약문(120)이 원본 문서(111, 112, 113, 114)의 내용을 충분히 반영하는지 여부가 판단될 수 있고, 다이버시티에 의해 요약문(120)에 불필요한 문장이 포함되었는지 여부가 판단될 수 있다. 요약문 제공 장치는 커버리지 및 다이버시티가 최적화 조건을 만족하는 솔루션을 도출할 수 있고, 도출된 솔루션에 대응하는 요약문(120)을 제공할 수 있다.For example, the summary providing apparatus may acquire the plurality of original documents 111 , 112 , 113 , and 114 . The apparatus for providing a summary may derive a solution corresponding to some sentences to be included in the summary 120 among sentences included in the plurality of original documents 111 , 112 , 113 and 114 by using a genetic algorithm. The summary text providing apparatus inputs the derived solution into two objective functions, thereby providing coverage related to the similarity between the plurality of original documents 111 , 112 , 113 , and 114 and the summary text 120 and sentences included in the summary text 120 . Diversity associated with the degree of similarity may be calculated. It can be determined whether the summary text 120 sufficiently reflects the contents of the original documents 111, 112, 113, and 114 by the coverage, and whether unnecessary sentences are included in the summary text 120 can be determined by the diversity. can The apparatus for providing a summary may derive a solution in which coverage and diversity satisfy optimization conditions, and may provide a summary 120 corresponding to the derived solution.

이상에서 설명한 예시는 본 문서의 이해를 위해 예시적으로 기재된 것으로 본 문서의 권리범위는 이에 제한되지 않는다. 이하에서는 유전 알고리즘의 활용, 다목적 최적화에 의한 솔루션의 평가 및 요약문 제공을 위한 구체적인 방식에 대해 상세히 설명한다.The examples described above are illustratively described for the understanding of this document, and the scope of the rights of this document is not limited thereto. Hereinafter, a specific method for using a genetic algorithm, evaluating a solution by multi-purpose optimization, and providing a summary will be described in detail.

도 2는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 구성을 도시하는 블록도이다.2 is a block diagram illustrating a configuration of an apparatus for providing a document summary based on a genetic algorithm according to an exemplary embodiment.

도 2를 참조하면, 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치(200)는 통신 회로(210), 메모리(220) 및 프로세서(230)를 포함할 수 있다. 요약문 제공 장치(200)는, 예를 들어, 사용자 단말로서 데스크탑, 랩탑, 태블릿 또는 스마트 폰 등과 같은 컴퓨팅 디바이스일 수도 있고, 서버 형태로 구현될 수도 있다. 이 경우 물리적으로는 온 프레미스(on premise) 또는 클라우드에 위치할 수도 있다. 다른 예를 들면, 요약문 제공 장치(200)는 2 이상으로 분산된 환경의 컴퓨팅 디바이스(예: 1개의 사용자 단말 및 1개의 서버)로 구현될 수도 있다. 요약문 제공 장치(200)는 문서에 대한 요약문을 효율적으로 제공하기 위한 것이나, 이에 제한되지 않고, 단일 문서에 대한 요약문을 제공하기 위해 활용될 수도 있다.Referring to FIG. 2 , an apparatus 200 for providing a document summary based on a genetic algorithm according to an embodiment may include a communication circuit 210 , a memory 220 , and a processor 230 . The summary providing apparatus 200 may be, for example, a computing device such as a desktop, a laptop, a tablet, or a smart phone as a user terminal, or may be implemented in the form of a server. In this case, it may be physically located on premise or in the cloud. As another example, the summary providing apparatus 200 may be implemented as a computing device (eg, one user terminal and one server) in a distributed environment in two or more. The summary providing apparatus 200 is for efficiently providing a summary of a document, but is not limited thereto, and may be utilized to provide a summary of a single document.

통신 회로(210)는 외부와 무선 또는 유선으로 통신하도록 구성될 수 있다. 통신 회로(210)는 외부 장치와 데이터를 송수신할 수 있다. 예를 들어, 통신 회로(210)는 외부로부터 하나 이상의 원본 문서를 수신할 수 있다.The communication circuit 210 may be configured to communicate with the outside wirelessly or by wire. The communication circuit 210 may transmit/receive data to and from an external device. For example, the communication circuit 210 may receive one or more original documents from the outside.

메모리(220)는 휘발성 메모리 및/또는 비휘발성 메모리를 포함할 수 있다. 메모리(220)는 문서의 요약문 제공 장치(200)에서 취급되는 다양한 데이터를 저장할 수 있다. 예를 들어, 메모리(220)는 문서의 요약문 제공 장치(200) 내부에서 처리된 데이터를 저장할 수 있고, 외부로부터 수신된 데이터를 저장할 수도 있다. 예를 들어, 메모리(220)는 수신된 하나 이상의 원본 문서를 저장할 수 있다.Memory 220 may include volatile memory and/or non-volatile memory. The memory 220 may store various data handled by the apparatus 200 for providing a summary of a document. For example, the memory 220 may store data processed inside the apparatus 200 for providing a summary of a document, or may store data received from the outside. For example, the memory 220 may store one or more received original documents.

프로세서(230)는 통신 회로(210) 및 메모리(220)와 전기적으로 연결될 수 있다. 프로세서(230)는 통신 회로(210) 및 메모리(220)를 제어할 수 있고, 다양한 데이터 처리 및 연산을 수행할 수 있다. 도 2에서는 프로세서(230)가 단일의 구성인 것으로 도시되었으나, 복수의 구성으로 분리되어 구현될 수도 있다. 프로세서(230)는 문서의 요약문을 제공하기 위한 도구를 사용자에게 제공할 수 있다. 프로세서(230)는 메모리(220)에 저장된 소프트웨어 내지 인스트럭션을 실행함으로써, 이하와 같은 동작을 수행할 수 있다.The processor 230 may be electrically connected to the communication circuit 210 and the memory 220 . The processor 230 may control the communication circuit 210 and the memory 220 , and may perform various data processing and operations. Although the processor 230 is illustrated as a single configuration in FIG. 2 , it may be implemented as a plurality of separate components. The processor 230 may provide the user with a tool for providing a summary of the document. The processor 230 may perform the following operations by executing software or instructions stored in the memory 220 .

일 실시 예에 따르면, 프로세서(230)는 유전 알고리즘을 이용하여 원본 문서에 포함된 분석 대상 문장 각각에 대응하는 복수의 유전자를 포함하고 분석 대상 문장으로부터 요약문의 생성을 위해 선택된 일부 문장을 나타내는 솔루션을 미리 설정된 요약문 길이 조건을 만족하도록 생성할 수 있다. 본 문서에서 용어 “문장”은 1개의 문장, 문장의 일부(2 이상의 단어를 포함하는 어구 또는 어절 등) 또는 2 이상의 문장 집합을 의미하는 것으로 해석될 수 있다. 분석 대상 문장은 원본 문서에 포함된 전체 문장을 의미할 수도 있고, 전체에서 일부 문장이 제외된 문장을 의미할 수도 있다. 유전 알고리즘의 솔루션은 바이너리 데이터로 이루어진 유전자(성분)을 포함하는 벡터일 수 있다. 솔루션 벡터는 원본 문서에 포함된 분석 대상 문장 중 후보 요약문에 포함될 문장이 무엇인지 나타낼 수 있다. 예를 들어, 원본 문서가 10개의 문장으로 이루어지고, 후보 요약문이 1번째, 3번째 및 7번째 문장을 포함하는 경우, 솔루션 벡터는 (1, 0, 1, 0, 0, 0, 1, 0, 0, 0)일 수 있다. 프로세서(230)는 유전 알고리즘을 이용하여 초기화, 변이 및 교차 등의 과정을 수행함으로써, 다수의(또는 복수의) 후보 요약문에 대응하는 다수의(또는 복수의) 솔루션을 생성할 수 있다. 요약문 길이 조건은 단어 개수(예: 100 단어 이하), 문장 개수(예: 10 문장 이하) 또는 원본 문서의 길이 대비 요약문의 길이(예: 원본 문서의 문장 개수의 10% 또는 원본 문서의 단어 개수의 10%) 등으로 설정될 수 있다. 본 문서에서 용어 “단어”는 텍스트로 이루어질 수도 있고, 수치, 벡터, 메트릭스 또는 텐서 등으로 이루어진 데이터일 수도 있다. 또한, 용어 “단어”는 서브워드(subword)를 포함하는 개념으로 이해될 수 있다. 프로세서(230)는 솔루션의 초기화, 변이 및 교차 등의 과정을 수행할 때 미리 설정된 요약문 길이 조건을 만족하도록 후보 요약문에 대응하는 솔루션을 생성할 수 있다. 유전 알고리즘을 이용한 솔루션의 초기화, 변이 및 교차에 대해서는 도 3 내지 도 7을 참조하여 상세히 설명한다.According to an embodiment, the processor 230 uses a genetic algorithm to include a plurality of genes corresponding to each of the analysis target sentences included in the original document and provides a solution representing some sentences selected for generating a summary sentence from the analysis target sentence. It can be created to satisfy the preset summary length condition. In this document, the term “sentence” may be interpreted to mean one sentence, a part of a sentence (such as a phrase or word including two or more words), or a set of two or more sentences. The analysis target sentence may mean the entire sentence included in the original document, or may mean a sentence excluding some sentences from the whole. A solution of the genetic algorithm may be a vector containing a gene (component) consisting of binary data. The solution vector may indicate which sentences to be included in the candidate summary sentence among the analysis target sentences included in the original document. For example, if the original document consists of 10 sentences, and the candidate summary contains the 1st, 3rd, and 7th sentences, the solution vector is (1, 0, 1, 0, 0, 0, 1, 0) , 0, 0). The processor 230 may generate a plurality of (or a plurality of) solutions corresponding to a plurality of (or a plurality of) candidate summaries by performing processes such as initialization, mutation, and crossover using a genetic algorithm. The summary sentence length condition is the number of words (eg 100 words or less), the number of sentences (eg 10 sentences or less), or the length of the summary compared to the length of the original document (eg 10% of the number of sentences in the original document or the number of words in the original document) 10%) and the like. In this document, the term “word” may consist of text or data composed of numbers, vectors, matrices, or tensors. Also, the term “word” may be understood as a concept including a subword. The processor 230 may generate a solution corresponding to the candidate summary text to satisfy a preset summary text length condition when performing processes such as initialization, mutation, and intersection of the solution. The initialization, mutation, and crossover of a solution using a genetic algorithm will be described in detail with reference to FIGS. 3 to 7 .

일 실시 예에 따르면, 프로세서(230)는 유전 알고리즘을 이용하여 솔루션의 초기화를 수행하는 경우, 솔루션에 포함된 복수의 유전자의 값을 순차적으로 설정하되, 미리 설정된 요약문 길이 조건에 기초하여 상기 복수의 유전자의 값을 결정하기 위해 사용되는 확률 값을 조절할 수 있다. 바이너리 데이터로 이루어진 솔루션에 대해 유전 알고리즘을 이용하여 초기화를 수행하는 경우, 솔루션에 포함된 각 유전자가 순차적으로 임의의 확률에 의해 0 또는 1로 설정됨으로써 새로운 솔루션이 생성될 수 있다. 요약문을 위해 솔루션을 도출하는 경우, 솔루션에 대응하는 요약문의 길이가 과도하기 길어지면 요약문은 본연의 기능을 상실할 수 있다. 과도하게 긴 길이를 갖는 후보 요약문은 필요성이 낮을 뿐만 아니라, 시스템에 불필요한 연산을 요구하게 될 수 있다. 따라서, 후보 요약문의 길이를 제한하는 조건에 따라, 초기화에 의해 생성되는 솔루션을 제한할 수 있다. 프로세서(230)는 솔루션의 초기화 시 솔루션에 포함될 유전자를 순차적으로 임의의 확률에 의해 0 또는 1로 생성하되, 후보 요약문의 길이가 길이 조건을 만족하도록 유전자의 값을 0 또는 1로 결정하기 위해 사용되는 유전 알고리즘의 확률 값을 조절할 수 있다. 다른 예를 들면, 프로세서(230)는 미리 설정된 요약문 길이 조건 및 기설정된 유전자에 대응하는 요약문의 길이 정보에 기초하여 나머지 유전자의 값을 설정할 수 있다. 프로세서(230)는 솔루션의 초기화 시 솔루션에 포함될 유전자를 순차적으로 임의의 확률에 의해 0 또는 1로 생성하되, 이미 생성된 유전자에 대응하는 후보 요약문의 길이가 길이 조건에 도달(또는 길이 조건을 초과)한 경우, 생성되지 않은 나머지 유전자를 0으로 생성하거나, 0으로 생성될 확률을 증가시킬 수도 있다. 유전 알고리즘을 이용한 솔루션의 초기화에 대해서는 도 3 내지 도 5를 참조하여 상세히 설명한다.According to an embodiment, the processor 230 sequentially sets the values of a plurality of genes included in the solution when the solution is initialized using a genetic algorithm, but based on a preset summary length condition, the plurality of The probability value used to determine the value of a gene can be adjusted. When initialization is performed using a genetic algorithm for a solution made of binary data, a new solution may be generated by sequentially setting each gene included in the solution to 0 or 1 by random probability. In the case of deriving a solution for the summary text, if the length of the summary text corresponding to the solution becomes excessively long, the summary text may lose its original function. Candidate summaries having an excessively long length are not only less necessary, but may require unnecessary computation in the system. Therefore, according to the condition limiting the length of the candidate summary, it is possible to limit the solution generated by initialization. When the solution is initialized, the processor 230 sequentially generates 0 or 1 genes to be included in the solution with a random probability, but the length of the candidate summary is used to determine the value of the gene as 0 or 1 so that the length condition is satisfied. It is possible to adjust the probability value of the genetic algorithm. As another example, the processor 230 may set values of the remaining genes based on a preset summary length condition and length information of a summary text corresponding to the preset gene. When the solution is initialized, the processor 230 sequentially generates 0 or 1 genes to be included in the solution with random probability, but the length of the candidate summary corresponding to the already generated gene reaches the length condition (or exceeds the length condition) ), the remaining genes that were not generated may be generated as 0, or the probability of being generated to 0 may be increased. The initialization of the solution using the genetic algorithm will be described in detail with reference to FIGS. 3 to 5 .

일 실시 예에 따르면, 프로세서(230)는 유전 알고리즘을 이용하여 솔루션의 변이를 수행하는 경우, 미리 설정된 요약문 길이 조건 및 솔루션에 대응하는 요약문의 길이 정보에 기초하여 솔루션에 포함된 복수의 유전자의 변이 여부를 결정할 수 있다. 바이너리 데이터로 이루어진 솔루션에 대해 유전 알고리즘을 이용하여 변이를 수행하는 경우, 솔루션에 포함된 유전자 중 하나 이상이 임의의 확률로 변경됨으로써 다른 솔루션이 생성될 수 있다. 예를 들어, 유전자 0은 1로 변이될 수 있고, 유전자 1은 0으로 변이될 수 있다. 이 경우, 특히 유전자 0이 1로 변이되는 경우, 변이에 의해 생성된 솔루션에 대응하는 후보 요약문의 길이가 길이 조건을 초과하게 될 수 있다. 따라서, 후보 요약문의 길이를 제한하는 조건에 따라, 변이의 수행을 제한할 수 있다. 프로세서(230)는 솔루션에 포함된 유전자 중 적어도 하나를 임의의 확률에 의해 0 또는 1로 변이시키되, 변이에 의해 생성된 솔루션에 대응하는 후보 요약문의 길이가 길이 조건을 초과하는 경우, 해당 유전자의 변이 수행 확률을 감소시키거나, 변이를 수행하지 않을 수 있다. 유전 알고리즘을 이용한 솔루션의 변이에 대해서는 도 6을 참조하여 상세히 설명한다.According to an embodiment, when the processor 230 performs the mutation of the solution using the genetic algorithm, mutation of a plurality of genes included in the solution based on a preset summary length condition and length information of the summary corresponding to the solution can decide whether When mutation is performed using a genetic algorithm on a solution made of binary data, another solution may be generated by changing at least one of the genes included in the solution with random probability. For example, gene 0 may be mutated to 1, and gene 1 may be mutated to 0. In this case, especially when gene 0 is mutated to 1, the length of the candidate summary corresponding to the solution generated by the mutation may exceed the length condition. Therefore, according to the condition limiting the length of the candidate summary, the performance of the mutation may be restricted. The processor 230 mutates at least one of the genes included in the solution to 0 or 1 with a random probability, and when the length of the candidate summary corresponding to the solution generated by the mutation exceeds the length condition, The probability of performing mutation may be reduced or mutation may not be performed. The variation of the solution using the genetic algorithm will be described in detail with reference to FIG. 6 .

일 실시 예에 따르면, 프로세서(230)는 유전 알고리즘을 이용하여 솔루션과 다른 솔루션의 교차를 수행하는 경우, 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보 및 다른 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보에 기초하여 교차의 수행 여부를 결정할 수 있다. 바이너리 데이터로 이루어진 2개의 솔루션에 대해 유전 알고리즘을 이용하여 교차를 수행하는 경우, 임의의 확률로 제1 솔루션의 일부와 제2 솔루션의 일부가 결합될 수 있다. 예를 들어, 10개의 성분을 갖는 제1 솔루션과 제2 솔루션을 교차시키면, 제1 솔루션의 제1 성분 내지 제5 성분과 제2 솔루션의 제6 성분 내지 제10 성분이 결합되어, 다른 솔루션이 생성될 수 있다. 이 경우, 교차에 의해 생성된 솔루션에 대응하는 후보 요약문의 길이가 길이 조건을 초과하게 될 수 있다. 따라서, 후보 요약문의 길이를 제한하는 조건에 따라, 교차의 수행을 제한할 수 있다. 프로세서(230)는 2개의 솔루션에 포함된 유전자 중 일부를 선택하여 임의의 확률에 의해 교차시키되, 교차에 의해 생성된 솔루션에 대응하는 후보 요약문의 길이가 길이 조건을 초과하는 경우, 그 2개의 솔루션에 대한 교차 수행 확률을 감소시키거나, 교차를 수행하지 않을 수 있다. 유전 알고리즘을 이용한 솔루션의 교차에 대해서는 도 7을 참조하여 상세히 설명한다.According to an embodiment, when the processor 230 crosses a solution and another solution using a genetic algorithm, information on the length of a summary text corresponding to the gene selected from the solution and the length of the summary text corresponding to the gene selected from another solution Based on the information, it is possible to determine whether to perform the intersection. When crossing two solutions made of binary data using a genetic algorithm, a part of the first solution and a part of the second solution may be combined with an arbitrary probability. For example, intersecting a first solution having ten components and a second solution results in the first through fifth components of the first solution and the sixth through tenth components of the second solution being combined, resulting in another solution can be created In this case, the length of the candidate summary corresponding to the solution generated by the intersection may exceed the length condition. Therefore, according to the condition limiting the length of the candidate summary, the performance of the intersection can be restricted. The processor 230 selects some of the genes included in the two solutions and crosses them by random probability, and when the length of the candidate summary corresponding to the solution generated by the crossing exceeds the length condition, the two solutions It is possible to reduce the probability of performing the intersection for , or not to perform the intersection. The intersection of a solution using a genetic algorithm will be described in detail with reference to FIG. 7 .

일 실시 예에 따르면, 프로세서(230)는 유전 알고리즘을 이용하여 후보 요약문에 대응하는 솔루션을 생성하는 경우, 세대의 수를 제한할 수 있다. 또한, 프로세서(230)는 후보 요약문의 성능 향상이 없는 경우 다음 세대의 생성을 중단할 수도 있다. 또한, 프로세서(230)는 현재 진행 상황을 저장하고, 추후 저장된 결과를 불러온 후 다음 세대의 생성을 재개할 수도 있다.According to an embodiment, when generating a solution corresponding to a candidate summary by using a genetic algorithm, the processor 230 may limit the number of generations. Also, the processor 230 may stop generating the next generation when there is no performance improvement of the candidate summary. In addition, the processor 230 may store the current progress, retrieve the stored result later, and then resume generation of the next generation.

일 실시 예에 따르면, 프로세서(230)는 원본 문서에 포함된 분석 대상 문장과 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제1 목적 함수를 이용하여 솔루션에 대한 커버리지를 산출할 수 있다. 원본 문서에 포함된 문장과 특정 솔루션에 대응하는 후보 요약문 사이의 유사도를 활용함으로써, 후보 요약문이 원본 문서의 내용을 얼마나 적절히 반영하는지를 나타내는 커버리지가 산출될 수 있다. 예를 들어, 커버리지는 분석 대상 문장에 대응하는 벡터와 선택된 일부 문장에 대응하는 벡터 사이의 유사도 및 분석 대상 문장에 대응하는 벡터와 선택된 일부 문장 각각에 대응하는 복수의 문장 벡터 각각 사이의 유사도에 기초하여 산출될 수 있다. 원본 문서의 분석 대상 문장에 대응하는 벡터(원본 문서에 대응하는 벡터)는 아래와 같은 예시적인 수학식으로 표현될 수 있다.According to an embodiment, the processor 230 may calculate the coverage for the solution by using the first objective function associated with the similarity between the analysis target sentences included in the original document and some selected sentences corresponding to the solution. By utilizing the similarity between the sentences included in the original document and the candidate summary corresponding to a specific solution, coverage indicating how well the candidate summary reflects the content of the original document may be calculated. For example, the coverage is based on the degree of similarity between the vector corresponding to the analysis target sentence and the vector corresponding to the selected partial sentence and the similarity between the vector corresponding to the analysis target sentence and each of the plurality of sentence vectors corresponding to each of the selected partial sentences. can be calculated by A vector (a vector corresponding to the original document) corresponding to the analysis target sentence of the original document may be expressed by the following exemplary equation.

[수학식 1][Equation 1]

여기서, o는 원본 문서에 대응하는 벡터이고, o_k는 원본 문서의 k번째 문장에 대응하는 값(또는 벡터)이고, w_ik는 k번째 문장의 i번째 단어에 대응하는 웨이트이고, w_ik는 TF-ISF(term frequency-inverse sentence frequency)에 의해 산출될 수 있다.Here, o is a vector corresponding to the original document, o _k is a value (or vector) corresponding to the k-th sentence of the original document, w _ik is a weight corresponding to the i-th word of the k-th sentence, and w _ik is It can be calculated by TF-ISF (term frequency-inverse sentence frequency).

한편, 솔루션에 대응하는 후보 요약문에 포함된 문장에 대응하는 벡터(후보 요약문에 대응하는 벡터) 및 후보 요약문에 포함된 문장 각각에 대한 문장 벡터는 아래와 같은 예시적인 수학식으로 표현될 수 있다.Meanwhile, a vector corresponding to a sentence included in the candidate summary sentence corresponding to the solution (a vector corresponding to the candidate summary sentence) and a sentence vector for each sentence included in the candidate summary sentence may be expressed by the following exemplary equations.

[수학식 2][Equation 2]

여기서, s_i는 원본 문서에 포함된 i번째 문장의 문장 벡터이고, w_im은 i번째 문장에 포함된 m번째 단어에 대응하는 웨이트이고, o^s는 후보 요약문에 대응하는 벡터이고, o^s _k는 후보 요약문에 포함된 m번째 문장에 대응하는 값(또는 벡터)이고, x_i는 솔루션 벡터의 i번째 유전자(0 또는 1)일 수 있다.Here, s _i is the sentence vector of the i-th sentence included in the original document, w _im is the weight corresponding to the m-th word included in the i-th sentence, o ^s is a vector corresponding to the candidate summary, o ^s _k is a value (or vector) corresponding to the m-th sentence included in the candidate summary sentence, and x _i may be the i-th gene (0 or 1) of the solution vector.

상술한 원본 문서에 대응하는 벡터, 후보 요약문에 대응하는 벡터 및 후보 요약문에 포함된 문장 각각에 대한 문장 벡터를 이용하여 커버리지를 산출하는 예시적인 수학식(제1 목적 함수)은 아래와 같다.An exemplary equation (first objective function) for calculating coverage by using the vector corresponding to the above-described original document, the vector corresponding to the candidate summary, and the sentence vector for each sentence included in the candidate summary is as follows.

[수학식 3][Equation 3]

여기서, n은 원본 문서에 포함된 분석 대상 문장의 개수일 수 있고, sim은 유사도 산출을 위한 함수로서, 코사인 유사도 함수일 수 있다. 상술한 수식에 의해 원본 문서와 특정 솔루션에 대응하는 후보 요약문 사이의 유사도 및 원본 문서와 후보 요약문에 포함된 문장 각각 사이의 유사도를 반영하는 커버리지가 산출될 수 있다. 커버리지의 산출을 위한 제1 목적 함수는 커버리지를 산출하는데 관여된 문장(예: 후보 요약문에 포함된 문장)의 평균 길이 값으로 나눠줌으로써 정규화될 수 있다.Here, n may be the number of analysis target sentences included in the original document, and sim may be a function for calculating the similarity and may be a cosine similarity function. A coverage reflecting the similarity between the original document and the candidate summary corresponding to the specific solution and the similarity between each of the sentences included in the original document and the candidate summary may be calculated by the above-described formula. The first objective function for calculating the coverage may be normalized by dividing by the average length value of sentences (eg, sentences included in the candidate summary) involved in calculating the coverage.

일 실시 예에 따르면, 커버리지는 미리 저장된 지식 기반(knowledge-based) 데이터베이스(사전 형태의 파일 등과 같은 다양한 형태의 데이터를 포함)에 기초하여 산출되는 선택된 일부 문장에 포함된 단어의 중요도에 기초하여 산출될 수도 있다. 지식 기반 데이터베이스에 기초하여 산출되는 단어의 중요도 및 이를 고려한 커버리지는 아래와 같은 예시적인 수학식으로 산출될 수 있다. 후보 요약문에 포함된 단어의 중요도를 고려함으로써, 후보 요약문의 퀄리티를 더 정확하게 판단할 수 있다.According to an embodiment, the coverage is calculated based on the importance of words included in some selected sentences calculated based on a pre-stored knowledge-based database (including various types of data such as a file in a dictionary format) it might be The importance of a word calculated based on the knowledge-based database and the coverage taking it into consideration may be calculated by the following exemplary equation. By considering the importance of words included in the candidate summary, the quality of the candidate summary may be more accurately determined.

[수학식 4][Equation 4]

여기서, μ^s _k는 후보 요약문에 포함된 k번째 문장의 중요도이고, y_i는 지식 기반 데이터베이스(예: DBpedia)로부터 획득되는 k번째 문장의 i번째 단어에 대한 값(예: 바이너리 값)일 수 있다.Here, μ ^s _k is the importance of the k-th sentence included in the candidate summary, and y _i is the value (eg, binary value) of the i-th word of the k-th sentence obtained from the knowledge base (eg, DBpedia) have.

일 실시 예에 따르면, 커버리지는 선택된 일부 문장에 대응하는 토픽 키워드에 포함된 단어의 개수 및 토픽 키워드에 포함된 단어의 분포 밀도에 기초하여 산출될 수도 있다. 토픽 키워드를 고려한 커버리지는 아래와 같은 예시적인 수학식에 의해 산출될 수 있다. 후보 요약문에 포함된 문장의 토픽 키워드를 고려함으로써, 후보 요약문의 퀄리티를 더 정확하게 판단할 수 있다.According to an embodiment, the coverage may be calculated based on the number of words included in the topic keyword corresponding to some selected sentences and the distribution density of words included in the topic keyword. The coverage in consideration of the topic keyword may be calculated by the following exemplary equation. By considering the topic keywords of sentences included in the candidate summary, the quality of the candidate summary may be more accurately determined.

[수학식 5][Equation 5]

여기서, Ii는 i-1번째 문장 대비 i번째 문장의 토픽 키워드에 포함되는 단어 개수의 증가량이고, Di는 i-1번째 문장 대비 i번째 문장의 토픽 키워드에 포함되는 단어 개수의 감소량이고, Si는 토픽 키워드의 산출 시 활용된 단어의 분포 밀도일 수 있다.Here, Ii is the amount of increase in the number of words included in the topic keyword of the i-th sentence compared to the i-1th sentence, Di is the amount of decrease in the number of words included in the topic keyword of the i-th sentence compared to the i-1th sentence, and Si is It may be a distribution density of words used in calculating the topic keyword.

일 실시 예에 따르면, 프로세서(230)는 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제2 목적 함수를 이용하여 솔루션에 대한 다이버시티를 산출할 수 있다. 다이버시티는 선택된 일부 문장 사이의 유사도가 감소함에 따라 증가하도록 산출될 수 있다. 특정 솔루션에 대응하는 후보 요약문에 포함된 문장들 사이의 유사도를 활용함으로써, 후보 요약문에 불필요한 문장이 얼마나 포함되었는지 여부를 판단하기 위한 다이버시티가 산출될 수 있다. 예를 들어, 다이버시티는 후보 요약문에 포함된 문장 벡터들 사이의 유사도에 기초하여 산출될 수 있다. 후보 요약문에 포함된 문장 벡터를 이용하여 다이버시티를 산출하는 예시적인 수학식(제2 목적 함수)은 아래와 같다.According to an embodiment, the processor 230 may calculate diversity for the solution by using the second objective function associated with the similarity between some selected sentences corresponding to the solution. Diversity may be calculated to increase as the similarity between some selected sentences decreases. By utilizing the similarity between sentences included in the candidate summary corresponding to a specific solution, diversity for determining how many unnecessary sentences are included in the candidate summary may be calculated. For example, the diversity may be calculated based on the similarity between sentence vectors included in the candidate summary. An exemplary equation (second objective function) for calculating diversity using a sentence vector included in a candidate summary sentence is as follows.

[수학식 6][Equation 6]

여기서, s_i는 원본 문서에 포함된 i번째 문장이고, s_j는 원본 문서에 포함된 j번째 문장일 수 있다. 솔루션 벡터의 성분인 x_i 및 x_j를 활용함으로써, 후보 요약문에 포함된 문장의 유사도를 고려할 수 있다. 상술한 수식에 의해 후보 요약문에 포함된 한 문장과 다른 문장 사이의 유사도를 반영하는 다이버시티가 산출될 수 있다. 다이버시티의 산출을 위한 제2 목적 함수는 커버리지를 산출하는데 관여된 문장(예: 후보 요약문에 포함된 문장)의 평균 길이 값으로 나눠줌으로써 정규화될 수 있다.Here, s _i may be the i-th sentence included in the original document, and s _j may be the j-th sentence included in the original document. _{By using x i} and x _j , which are components of the solution vector, the similarity of sentences included in the candidate summary can be considered. Diversity reflecting the similarity between one sentence and another sentence included in the candidate summary sentence may be calculated by the above-described equation. The second objective function for calculating diversity may be normalized by dividing by the average length value of sentences (eg, sentences included in candidate summary sentences) involved in calculating coverage.

일 실시 예에 다르면, 프로세서(230)는 유전 알고리즘에 의해 생성된 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 제공할 수 있다. 예를 들어, 프로세서(230)는 커버리지 및 다이버시티가 파레토 최적에 따른 지정된 조건을 만족하는 하나 이상의 솔루션을 획득할 수 있다. 프로세서(230)는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 생성하고, 생성된 하나 이상의 요약문을 사용자에게 제공할 수 있다. 프로세서(230)는 지정된 최적화 조건을 만족하는 복수의 솔루션이 획득된 경우 추가적인 처리를 통해 하나의 솔루션을 도출하여 이에 대응하는 하나의 요약문을 사용자에게 제공할 수도 있고, 복수의 솔루션 각각에 대응하는 복수의 요약문을 사용자에게 제공하여 사용자에게 선택권을 줄 수도 있다.According to an embodiment, the processor 230 may provide one or more summaries corresponding to each of one or more solutions that satisfy a specified optimization condition based on the coverage and diversity of each of the plurality of solutions generated by the genetic algorithm. have. For example, the processor 230 may obtain one or more solutions in which coverage and diversity satisfy a specified condition according to a Pareto optimum. The processor 230 may generate one or more summaries corresponding to each of the one or more solutions, and provide the generated one or more summaries to the user. When a plurality of solutions satisfying the specified optimization condition are obtained, the processor 230 may derive one solution through additional processing and provide one summary text corresponding thereto to the user, or a plurality of solutions corresponding to each of the plurality of solutions. It is also possible to give the user a choice by providing the user with a summary of the

일 실시 예에 따르면, 프로세서(230)는 미리 설정된 요약문 길이 조건을 만족하지 않는 솔루션이 생성되면, 생성된 솔루션이 지정된 최적화 조건을 만족하는지 여부를 판단할 때, 생성된 솔루션에 대응하는 요약문이 초과한 길이에 따른 페널티를 적용할 수 있다. 요약문 길이 조건이 짧게 설정된 경우, 요약문 길이 조건을 초과하는 솔루션의 커버리지 및 다이버시티 값이 요약문 길이 조건을 만족하는 솔루션의 커버리지 및 다이버시티 값보다 우수할 수 있다. 따라서, 경우에 따라 요약문 길이 조건을 만족하지 못하는 솔루션을 고려할 필요성이 있다. 페널티는, 예를 들어, 초과된 길이와 지정된 요약문 길이 사이의 비율(overflow/limit)로 산출될 수도 있다.According to an embodiment, when a solution that does not satisfy the preset summary length condition is generated, the processor 230 determines whether the generated solution satisfies the specified optimization condition, the summary text corresponding to the generated solution exceeds the A penalty according to one length may be applied. When the summary sentence length condition is set to be short, the coverage and diversity values of the solution exceeding the summary length condition may be superior to the coverage and diversity values of the solution satisfying the abstract length condition. Therefore, in some cases, it is necessary to consider a solution that does not satisfy the summary sentence length condition. The penalty may be calculated, for example, as a ratio (overflow/limit) between the length exceeded and the specified summary length.

도 3은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.3 is a diagram for explaining an exemplary operation of an apparatus for providing a summary of a document based on a genetic algorithm according to an embodiment.

도 3을 참조하면, 일 실시 예에 따른 요약문 제공 장치는 유전 알고리즘을 이용하여 솔루션의 초기화를 수행하는 경우, 솔루션에 포함된 복수의 유전자의 값을 순차적으로 설정하되, 미리 설정된 요약문 길이 조건에 기초하여 상기 복수의 유전자의 값을 결정하기 위해 사용되는 확률 값을 조절할 수 있다.Referring to FIG. 3 , when the apparatus for providing a summary text according to an embodiment performs initialization of a solution using a genetic algorithm, values of a plurality of genes included in the solution are sequentially set, but based on a preset summary text length condition Thus, the probability value used to determine the values of the plurality of genes may be adjusted.

예를 들어, 요약문 제공 장치는 원본 문서의 분석 대상 문장이 10개인 경우, 10개의 유전자를 포함하는 염색체(솔루션)를 생성할 수 있다. 염색체를 초기화하는 경우, 1번째 유전자부터 10번째 유전자까지 0 또는 1의 값을 갖도록 순차적으로 생성될 수 있다. 요약문 길이 조건은, 예를 들어, 3개 문장으로 설정될 수 있다. 요약문 제공 장치는 후보 요약문의 길이가 3으로 설정될 확률이 높아지도록 유전자 설정에 사용되는 확률 값을 조절할 수 있다. 요약문 제공 장치는 요약문 길이 조건에 따라 σ를 조절함으로써, 후보 요약문의 길이가 3에 가까워지도록 염색체를 생성할 수 있다.For example, when there are 10 analysis target sentences in the original document, the summary sentence providing apparatus may generate a chromosome (solution) including 10 genes. When a chromosome is initialized, it may be sequentially generated to have a value of 0 or 1 from the 1st gene to the 10th gene. The summary sentence length condition may be set to, for example, three sentences. The summary providing apparatus may adjust the probability value used for gene setting so that the probability that the length of the candidate summary is set to 3 increases. The apparatus for providing a summary sentence may generate a chromosome such that the length of the candidate summary text approaches 3 by adjusting σ according to the summary text length condition.

다른 예를 들면, 1번째 유전자가 1로, 2번째 유전자가 0으로, 3번재 유전자가 0으로, 4번째 유전자가 1로, 5번째 유전자가 1로 설정된 경우, 기설정된 유전자에 대응하는 후보 요약문은 3개 문장을 포함할 수 있다. 기설정된 유전자에 대응하는 후보 요약문에 포함된 문장의 수가 증가할수록 유전자가 1로 설정될 확률은 감소할 수 있다. 이 경우, 6번째 유전자 내지 10번째 유전자 중 하나라도 1로 설정되면, 생성된 염색체는 요약문 길이 조건을 만족하지 못할 수 있다. 따라서, 6번째 유전자 내지 10번째 유전자는 0으로 설정될 수 있다. 이로써, 길이가 3 이하인 후보 요약문에 대응하도록 염색체가 초기화될 수 있다.As another example, when the 1st gene is set to 1, the 2nd gene is set to 0, the 3rd gene is set to 0, the 4th gene is set to 1, and the 5th gene is set to 1, the candidate summary corresponding to the preset gene can contain 3 sentences. As the number of sentences included in the candidate summary corresponding to the preset gene increases, the probability that the gene is set to 1 may decrease. In this case, if even one of the 6th to 10th genes is set to 1, the generated chromosome may not satisfy the summary length condition. Accordingly, the 6th gene to the 10th gene may be set to 0. This allows chromosomes to be initialized to correspond to candidate summaries of length 3 or less.

도 4는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.4 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.

도 4를 참조하면, 일 실시 예에 따른 요약문 제공 장치는 유전 알고리즘을 이용하여 솔루션의 초기화를 수행하는 경우, 솔루션에 포함된 복수의 유전자의 값을 순차적으로 설정하되, 미리 설정된 요약문 길이 조건에 기초하여 상기 복수의 유전자의 값을 결정하기 위해 사용되는 확률 값을 조절할 수 있다.Referring to FIG. 4 , when the apparatus for providing a summary text according to an embodiment performs initialization of a solution using a genetic algorithm, the values of a plurality of genes included in the solution are sequentially set, but based on a preset summary text length condition Thus, the probability value used to determine the values of the plurality of genes may be adjusted.

예를 들어, 요약문 제공 장치는 원본 문서의 분석 대상 문장이 10개인 경우, 10개의 유전자를 포함하는 염색체(솔루션)를 생성할 수 있다. 염색체를 초기화하는 경우, 1번째 유전자부터 10번째 유전자까지 0 또는 1의 값을 갖도록 순차적으로 생성될 수 있다. 요약문 길이 조건은 범위로 설정될 수도 있고, 예를 들어, 3개 문장 내지 4개 문장으로 설정될 수 있다. 이 경우, 후보 요약문이 가급적으로 최소 3개 문장을 포함하고 최대 4개 문장을 포함하도록 염색체가 생성될 수 있다. 요약문 제공 장치는 후보 요약문의 길이가 3 내지 4로 설정될 확률이 높아지도록 유전자 설정에 사용되는 확률 값을 조절할 수 있다. 요약문 제공 장치는 요약문 길이 조건에 따라 σ를 α(예: 3) 내지 β(예: 4)로 임의로 조절함으로써, 후보 요약문의 길이가 3 내지 4에 가까워지도록 염색체를 생성할 수 있다.For example, when there are 10 analysis target sentences in the original document, the summary sentence providing apparatus may generate a chromosome (solution) including 10 genes. When a chromosome is initialized, it may be sequentially generated to have a value of 0 or 1 from the 1st gene to the 10th gene. The summary sentence length condition may be set as a range, for example, 3 to 4 sentences. In this case, chromosomes may be generated such that the candidate summary preferably contains a minimum of 3 sentences and a maximum of 4 sentences. The summary providing apparatus may adjust the probability value used for gene setting so that the probability that the length of the candidate summary is set to 3 to 4 increases. The apparatus for providing a summary may generate chromosomes such that the length of the candidate summary approaches 3 to 4 by arbitrarily adjusting σ to α (eg, 3) to β (eg, 4) according to the summary length condition.

다른 예를 들면, 기설정된 유전자에 대응하는 후보 요약문의 길이가 최소 길이보다 짧은 경우, 상대적으로 높은 확률로 유전자가 1로 설정될 수 있고(예: 1번째 유전자 내지 5번째 유전자 생성 시), 후보 요약문의 길이가 최소 길이보다 길고 최대 길이보다 짧은 경우 상대적으로 낮은 확률로 유전자가 1로 설정될 수 있다(예: 6번째 유전자 내지 9번째 유전자 생성 시). 기설정된 유전자에 대응하는 후보 요약문의 길이가 최대 길이에 도달한 경우, 나머지 유전자는 0으로 설정될 수 있다(예: 10번째 유전자 생성 시). 이로써, 길이가 3 이상이고 4 이하인 후보 요약문에 대응하도록 염색체가 초기화될 수 있다.As another example, when the length of the candidate summary corresponding to the preset gene is shorter than the minimum length, the gene may be set to 1 with a relatively high probability (eg, when the first gene to the fifth gene is generated), and the candidate If the length of the summary is longer than the minimum length and shorter than the maximum length, there is a relatively low probability that a gene can be set to 1 (eg, when generating 6th to 9th genes). When the length of the candidate summary corresponding to the preset gene reaches the maximum length, the remaining genes may be set to 0 (eg, when the 10th gene is created). This allows chromosomes to be initialized to correspond to candidate summaries of length 3 or greater and 4 or less.

도 5는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.5 is a diagram for explaining an exemplary operation of an apparatus for providing a document summary based on a genetic algorithm according to an embodiment.

도 5의 (a)는 요약문 길이 조건을 적용하지 않고 초기화를 수행한 경우, 생성된 솔루션의 커버리지와 다이버시티 사이의 관계를 나타내는 그래프이다. 도 5의 (b)는 요약문 길이 조건을 적용하여 초기화를 수행한 경우, 생성된 솔루션의 커버리지와 다이버시티 사이의 관계를 나타내는 그래프이다.FIG. 5A is a graph illustrating a relationship between coverage and diversity of a generated solution when initialization is performed without applying a summary sentence length condition. FIG. 5B is a graph illustrating a relationship between coverage and diversity of a generated solution when initialization is performed by applying a summary sentence length condition.

요약문의 길이 조건을 적용하지 않은 경우, 다수의 불필요한 문장을 포함할 수 있으므로, 다이버시티가 낮은 다수의 후보 요약문이 생성될 수 있다. 한편, 요약문의 길이 조건을 적용한 경우, 다양한 분포의 커버리지 및 다이버시티를 갖는 후보 요약문이 생성될 수 있다.When the length condition of the summary is not applied, a plurality of candidate summary sentences having low diversity may be generated because a plurality of unnecessary sentences may be included. Meanwhile, when the length condition of the summary sentence is applied, candidate summary sentences having various distributions of coverage and diversity may be generated.

도 6은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.6 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.

도 6을 참조하면, 일 실시 예에 따른 요약문 제공 장치는 유전 알고리즘을 이용하여 솔루션의 변이를 수행하는 경우, 미리 설정된 요약문 길이 조건 및 솔루션에 대응하는 요약문의 길이 정보에 기초하여 솔루션에 포함된 복수의 유전자의 변이 여부를 결정할 수 있다.Referring to FIG. 6 , when the apparatus for providing a summary text according to an embodiment performs a solution variation using a genetic algorithm, a plurality of pieces included in a solution based on a preset summary text length condition and length information of a summary text corresponding to the solution can determine whether the gene is mutated.

예를 들어, 요약문 제공 장치는 10개의 유전자를 포함하는 염색체(솔루션)에 대한 변이를 수행할 수 있다. 염색체에 포함된 유전자를 변이시키는 경우, 유전자 값은 임의의 확률에 따라 변이될 수 있다. 요약문 제공 장치는 염색체에 포함된 0의 비율, 1의 비율, 요약문 길이 조건에 기초하여 유전자 변이에 사용되는 확률 값을 조절할 수 있다. 요약문 길이 조건은, 예를 들어, 3개 문장으로 설정될 수 있다. 염색체가 값이 1인 유전자를 2개 포함한 경우, 0인 유전자가 1로 변이될 확률은 상대적으로 높게 설정될 수 있다. 염색체가 값이 1인 유전자를 3개 포함한 경우, 0인 유전자가 1로 변이될 확률은 상대적으로 낮게 설정될 수 있다. 이로써, 후보 요약문의 길이가 3에 가까워지도록 염색체를 변이시킬 수 있다.For example, the summary providing apparatus may perform mutation on a chromosome (solution) including 10 genes. When a gene included in a chromosome is mutated, the gene value may be mutated according to an arbitrary probability. The summary providing apparatus may adjust the probability value used for genetic mutation based on the ratio of 0, the ratio of 1, and the summary length condition included in the chromosome. The summary sentence length condition may be set to, for example, three sentences. When a chromosome includes two genes with a value of 1, the probability that a gene with a value of 0 will be mutated into 1 may be set to be relatively high. When a chromosome includes three genes with a value of 1, the probability that a gene with a value of 0 will be mutated into 1 may be set to be relatively low. Thereby, the chromosome can be mutated so that the length of the candidate summary is close to three.

다른 예를 들면, 염색체가 값이 1인 유전자를 2개 포함한 경우, 값이 0인 7번째 유전자는 1로 변이될 수 있고, 염색체가 값이 1인 유전자를 3개 포함한 경우, 값이 0인 7번째 유전자는 변이되지 않을 수 있다. 이로써, 길이가 3 이하인 후보 요약문에 대응하도록 염색체가 변이될 수 있다.For another example, if a chromosome contains two genes with a value of 1, the 7th gene with a value of 0 may be mutated to 1. If a chromosome contains three genes with a value of 1, The 7th gene may not be mutated. This allows chromosomes to be mutated to correspond to candidate summaries of length 3 or less.

도 7은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.7 is a diagram for describing an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.

도 7을 참조하면, 일 실시 예에 따른 요약문 제공 장치는 유전 알고리즘을 이용하여 솔루션과 다른 솔루션의 교차를 수행하는 경우, 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보 및 다른 솔루션에서 선택된 유전자에 대응하는 요약문의 길이 정보에 기초하여 교차의 수행 여부를 결정할 수 있다.Referring to FIG. 7 , the apparatus for providing a summary according to an embodiment corresponds to length information of a summary text corresponding to a gene selected from a solution and a gene selected from another solution when crossing a solution with another solution using a genetic algorithm Based on the length information of the summary sentence, it is possible to determine whether to perform the intersection.

예를 들어, 요약문 제공 장치는 10개의 유전자를 포함하는 염색체(솔루션)에 대한 교차를 수행할 수 있다. 염색체를 교차시키는 경우, 교차 지점 및 교차 여부는 임의의 확률에 따라 결정될 수 있다. 요약문 제공 장치는 염색체에 포함된 0의 비율, 1의 비율, 요약문 길이 조건에 기초하여 교차에 사용되는 확률 값을 조절할 수 있다. 요약문 길이 조건은, 예를 들어, 3개 문장으로 설정될 수 있다. 교차될 염색체가 값이 1인 유전자를 3개 포함하는 경우, 교차를 수행할 확률은 상대적으로 높게 설정될 수 있다. 교차될 염색체가 값이 1인 유전자를 4개 포함하는 경우, 교차를 수행할 확률은 상대적으로 낮게 설정될 수 있다. 이로써, 후보 요약문의 길이가 3에 가까워지도록 염색체를 교차시킬 수 있다.For example, the summary providing device may perform crossover on chromosomes (solutions) comprising 10 genes. In the case of crossing chromosomes, the crossing point and whether the chromosomes are crossed may be determined according to an arbitrary probability. The summary providing apparatus may adjust the probability value used for crossover based on the ratio of 0, the ratio of 1, and the summary length condition included in the chromosome. The summary sentence length condition may be set to, for example, three sentences. When the chromosome to be crossed includes three genes having a value of 1, the probability of performing the crossover may be set to be relatively high. When the chromosome to be crossed includes four genes having a value of 1, the probability of performing the crossover may be set to be relatively low. This allows chromosomes to be crossed so that the length of the candidate summary is close to three.

다른 예를 들면, 교차될 염색체가 값이 1인 유전자를 3개 포함하는 경우, 교차가 수행될 수 있고, 교차될 염색체가 값이 1인 유전자를 4개 포함하는 경우, 교차가 수행되지 않을 수 있다. 이로써, 길이가 3 이하인 후보 요약문에 대응하도록 염색체가 교차될 수 있다.As another example, crossover may be performed when the chromosome to be crossed contains 3 genes with a value of 1, and crossover may not be performed when the chromosome to be crossed contains 4 genes with a value of 1. have. This allows chromosomes to be crossed to correspond to candidate summaries of length 3 or less.

도 8은 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 장치의 예시적인 동작을 설명하기 위한 도면이다.8 is a diagram for explaining an exemplary operation of an apparatus for providing a summary text of a document based on a genetic algorithm according to an embodiment.

도 8을 참조하면, 일 실시 예에 따른 요약문 제공 장치는 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션을 획득할 수 있다. 요약문은 원본 문서의 내용을 적절히 반영하는 동시에 불필요한 문장을 갖지 않아야 본연의 기능을 수행할 수 있다. 따라서, 제1 목적 함수 및 제2 목적 함수에 대한 다목적 최적화(multi-object optimization)를 수행함으로써, 고품질의 요약문을 도출할 수 있다.Referring to FIG. 8 , the apparatus for providing a summary according to an embodiment may acquire one or more solutions satisfying a specified optimization condition based on coverage and diversity of each of a plurality of solutions. A summary sentence can perform its original function when it properly reflects the content of the original document and does not contain unnecessary sentences. Accordingly, by performing multi-object optimization on the first objective function and the second objective function, it is possible to derive a high-quality summary.

도 8에 도시된 그래프는 유전 알고리즘에 의해 생성된 솔루션 중 파레토 최적 조건을 만족하는 솔루션의 커버리지와 다이버시티 사이의 관계를 나타낼 수 있다. 요약문 제공 장치는 커버리지와 다이버시티가 최적화 조건을 만족하는 하나 이상의 솔루션을 선택할 수 있다. 요약문 제공 장치는 최적화 조건을 만족하는 솔루션 중 커버리지가 높은 솔루션을 선택할 수도 있고, 다이버시티가 높은 솔루션을 선택할 수도 있고, 커버리지와 다이버시티의 균형이 이루어진 솔루션을 선택할 수도 있다. 요약문 제공 장치는 복수의 솔루션을 사용자에게 제공하고, 사용자의 선택을 유도할 수도 있다.The graph shown in FIG. 8 may represent a relationship between the coverage and diversity of a solution that satisfies the Pareto optimal condition among solutions generated by the genetic algorithm. The apparatus for providing a summary may select one or more solutions in which coverage and diversity satisfy optimization conditions. The apparatus for providing a summary may select a solution having high coverage, a solution having high diversity, or a solution having a balance between coverage and diversity among solutions that satisfy the optimization condition. The summary text providing apparatus may provide a plurality of solutions to the user and induce the user's selection.

도 9는 일 실시 예에 따른 유전 알고리즘에 기반한 문서의 요약문 제공 방법을 설명하기 위한 흐름도이다.9 is a flowchart illustrating a method of providing a document summary based on a genetic algorithm according to an embodiment.

이하에서는 도 2의 문서의 요약문 제공 장치(200)가 도 9의 프로세스를 수행하는 것을 가정한다. 또한, 도 9의 설명에서, 문서의 요약문 제공 장치에 의해 수행되는 것으로 기술된 동작은 프로세서(230)에 의해 제어되는 것으로 이해될 수 있다.Hereinafter, it is assumed that the apparatus 200 for providing a summary of a document of FIG. 2 performs the process of FIG. 9 . Also, in the description of FIG. 9 , it may be understood that an operation described as being performed by the apparatus for providing a summary of a document is controlled by the processor 230 .

도 9를 참조하면, 단계 910에서, 장치는 유전 알고리즘을 이용하여 원본 문서에 포함된 분석 대상 문장 각각에 대응하는 복수의 유전자를 포함하고 분석 대상 문장으로부터 요약문의 생성을 위해 선택된 일부 문장을 나타내는 솔루션을 미리 설정된 요약문 길이 조건을 만족하도록 생성할 수 있다.Referring to FIG. 9 , in step 910, the device includes a plurality of genes corresponding to each of the analysis target sentences included in the original document using a genetic algorithm, and a solution representing some sentences selected for generation of a summary sentence from the analysis target sentence can be generated to satisfy a preset summary sentence length condition.

단계 920에서, 장치는 원본 문서에 포함된 분석 대상 문장과 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제1 목적 함수를 이용하여 솔루션에 대한 커버리지를 산출할 수 있다.In operation 920, the device may calculate the coverage for the solution by using the first objective function associated with the similarity between the analysis target sentences included in the original document and some selected sentences corresponding to the solution.

단계 930에서, 장치는 솔루션에 대응하는 선택된 일부 문장 사이의 유사도와 연관된 제2 목적 함수를 이용하여 솔루션에 대한 다이버시티를 산출할 수 있다.In operation 930, the device may calculate diversity for the solution by using the second objective function associated with the similarity between some selected sentences corresponding to the solution.

단계 940에서, 장치는 유전 알고리즘에 의해 생성된 복수의 솔루션 각각에 대한 커버리지 및 다이버시티에 기초하여 지정된 최적화 조건을 만족하는 하나 이상의 솔루션 각각에 대응하는 하나 이상의 요약문을 제공할 수 있다.In operation 940 , the device may provide one or more summaries corresponding to each of the one or more solutions satisfying the specified optimization condition based on the coverage and diversity for each of the plurality of solutions generated by the genetic algorithm.

본 문서의 실시 예들 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및/또는 B 중 적어도 하나", "A, B 또는 C" 또는 "A, B 및/또는 C 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. "제1," "제2," "첫째," 또는 "둘째," 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 어떤 구성요소가 다른 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소를 통하여 연결될 수 있다.The embodiments of this document and the terms used therein are not intended to limit the technology described in this document to a specific embodiment, but it should be understood to include various modifications, equivalents, and/or substitutions of the embodiments. In connection with the description of the drawings, like reference numerals may be used for like components. The singular expression may include the plural expression unless the context clearly dictates otherwise. In this document, expressions such as “A or B”, “at least one of A and/or B”, “A, B or C” or “at least one of A, B and/or C” refer to all of the items listed together. Possible combinations may be included. Expressions such as "first," "second," "first," or "second," can modify the corresponding elements regardless of order or importance, and to distinguish one element from another element. It is used only and does not limit the corresponding components. When a component is referred to as being “connected (functionally or communicatively)” or “connected” to another component, the component is directly connected to the other component or refers to another component. can be connected through

본 문서에서, "~하도록 설정된(adapted to or configured to)"은 상황에 따라, 예를 들면, 하드웨어적 또는 소프트웨어적으로 "~에 적합한," "~하는 능력을 가지는," "~하도록 변경된," "~하도록 만들어진," "~를 할 수 있는," 또는 "~하도록 설계된"과 상호 호환적으로(interchangeably) 사용될 수 있다. 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 설정된 (또는 구성된) 프로세서"는 해당 동작들을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(예: CPU)를 의미할 수 있다.In this document, "adapted to or configured to", depending on the context, for example, hardware or software "suitable for," "having the ability to," "modified to, Can be used interchangeably with ""made to," "capable of," or "designed to." In some circumstances, the expression “a device configured to” may mean that the device is “capable of” with other devices or parts. For example, the phrase "a processor configured (or configured to perform) A, B, and C" refers to a dedicated processor (eg, an embedded processor) for performing the corresponding operations, or by executing one or more programs stored in a memory device; It may refer to a general-purpose processor (eg, CPU) capable of performing corresponding operations.

본 문서에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어(firmware)로 구성된 유닛(unit)을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. "모듈"은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. "모듈"은 기계적으로 또는 전자적으로 구현될 수 있으며, 예를 들면, 어떤 동작들을 수행하는, 알려졌거나 앞으로 개발될, ASIC(application-specific integrated circuit) 칩, FPGAs(field-programmable gate arrays), 또는 프로그램 가능 논리 장치를 포함할 수 있다.As used herein, the term “module” includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic block, component, or circuit. can A “module” may be an integrally formed component or a minimum unit or a part that performs one or more functions. A “module” may be implemented mechanically or electronically, for example, known or to be developed, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), or It may include a programmable logic device.

일 실시 예에 따른 장치(예: 모듈들 또는 그 기능들) 또는 방법(예: 동작들)의 적어도 일부는 프로그램 모듈의 형태로 컴퓨터로 판독 가능한 저장 매체에 저장된 명령어로 구현될 수 있다. 상기 명령어가 프로세서에 의해 실행될 경우, 프로세서가 상기 명령어에 해당하는 기능을 수행할 수 있다.At least a portion of an apparatus (eg, modules or functions thereof) or a method (eg, operations) according to an embodiment may be implemented as instructions stored in a computer-readable storage medium in the form of a program module. When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction.

일 실시 예에 따른 구성 요소(예: 모듈 또는 프로그램 모듈) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소를 더 포함할 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램 모듈)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 일 실시 예에 따른 모듈, 프로그램 모듈 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱(heuristic)하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.Each of the components (eg, a module or a program module) according to an embodiment may be composed of a singular or a plurality of entities, and some sub-components of the aforementioned sub-components may be omitted or other sub-components may be included. may include more. Alternatively or additionally, some components (eg, a module or a program module) may be integrated into one entity to perform the same or similar functions performed by each corresponding component before being integrated. Operations performed by a module, program module, or other component according to an embodiment are sequentially, parallelly, repetitively or heuristically executed, or at least some operations are executed in a different order, omitted, or other operations This can be added.

Claims

In the apparatus for providing a summary of a document based on a genetic algorithm,
a communication circuit for receiving an original document from the outside;
a memory for storing the original document; and
a processor electrically coupled to the communication circuitry and the memory;
The processor is
Using a genetic algorithm, a solution including a plurality of genes corresponding to each analysis target sentence included in the original document and representing some sentences selected for generating a summary sentence from the analysis target sentence is generated to satisfy a preset summary sentence length condition do,
Calculating the coverage for the solution by using a first objective function associated with the similarity between the analysis target sentence included in the original document and the selected partial sentence corresponding to the solution,
calculating diversity for the solution by using a second objective function associated with the similarity between the selected partial sentences corresponding to the solution;
and providing one or more summaries corresponding to each of one or more solutions satisfying a specified optimization condition based on the diversity and the coverage for each of a plurality of solutions generated by the genetic algorithm.

The method of claim 1,
The processor is
When initialization of the solution is performed using the genetic algorithm, the values of the plurality of genes included in the solution are sequentially set, and the values of the plurality of genes are determined based on the preset summary sentence length condition A device, characterized in that for adjusting the probability value used for

The method of claim 1,
The processor is
When mutation of the solution is performed using the genetic algorithm, determining whether the plurality of genes included in the solution are mutated based on the preset summary length condition and length information of the summary text corresponding to the solution Characterized by the device.

The method of claim 1,
The processor is
When the intersection of the solution and another solution is performed using the genetic algorithm, the crossover is performed based on length information of the summary corresponding to the gene selected in the solution and the length information of the summary corresponding to the gene selected in the other solution. A device, characterized in that determining whether to perform.

The method of claim 1,
The coverage is based on the degree of similarity between the vector corresponding to the analysis target sentence and the vector corresponding to the selected partial sentence, and the similarity between the vector corresponding to the analysis target sentence and each of the plurality of sentence vectors corresponding to each of the selected partial sentences. Device, characterized in that calculated on the basis of.

6. The method of claim 5,
The apparatus, characterized in that the coverage is calculated based on the importance of words included in the selected partial sentences calculated based on a pre-stored knowledge-based database.

6. The method of claim 5,
The apparatus, characterized in that the coverage is calculated based on the number of words included in the topic keyword corresponding to the selected partial sentences and the distribution density of words included in the topic keyword.

The method of claim 1,
The apparatus, characterized in that the diversity increases as the degree of similarity between the selected partial sentences decreases.

The method of claim 1,
The processor is
obtaining the one or more solutions in which the coverage and the diversity satisfy a specified condition according to a Pareto optimum;
and generating one or more summaries corresponding to each of the one or more solutions.

The method of claim 1,
The processor is
When a solution that does not satisfy the preset summary length condition is generated, a penalty is applied according to the length exceeding the summary text corresponding to the generated solution when determining whether the generated solution satisfies the specified optimization condition A device, characterized in that.

A method for providing a summary of a document based on a genetic algorithm, the method comprising:
Generating a solution that includes a plurality of genes corresponding to each of the analysis target sentences included in the original document using a genetic algorithm and that represents some sentences selected for generation of a summary sentence from the analysis target sentence to satisfy a preset summary sentence length condition step;
calculating coverage for the solution by using a first objective function associated with a degree of similarity between the analysis target sentence included in the original document and the selected partial sentence corresponding to the solution;
calculating diversity for the solution by using a second objective function associated with a similarity between the selected partial sentences corresponding to the solution; and
providing one or more summaries corresponding to each of one or more solutions satisfying a specified optimization condition based on the diversity and the coverage for each of a plurality of solutions generated by the genetic algorithm; Way.

In a computer recording medium storing instructions executable by at least one processor included in a computing device,
The instructions cause the at least one processor,
Using a genetic algorithm, a solution including a plurality of genes corresponding to each analysis target sentence included in the original document and representing some sentences selected for generating a summary sentence from the analysis target sentence is generated to satisfy a preset summary sentence length condition, ,
Calculating the coverage for the solution by using a first objective function associated with the similarity between the analysis target sentence included in the original document and the selected partial sentence corresponding to the solution,
calculating diversity for the solution by using a second objective function associated with the similarity between the selected partial sentences corresponding to the solution;
and to provide one or more summaries corresponding to each of one or more solutions satisfying a specified optimization condition based on the diversity and the coverage for each of the plurality of solutions generated by the genetic algorithm. .