JP5921457B2

JP5921457B2 - Document summarization method, apparatus, and program

Info

Publication number: JP5921457B2
Application number: JP2013020697A
Authority: JP
Inventors: 平尾　努; 努平尾; 宜仁安田; 正彬西野; 永田　昌明; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-05
Filing date: 2013-02-05
Publication date: 2016-05-24
Anticipated expiration: 2033-02-05
Also published as: JP2014153766A

Description

本発明は、文書要約方法、装置、及びプログラムに関し、特に、入力文書に対応する要約を生成する文書要約方法、装置、及びプログラムに関する。 The present invention relates to a document summarization method, apparatus, and program, and more particularly, to a document summarization method, apparatus, and program for generating a summary corresponding to an input document.

従来の計算機による文書の要約手法では、文書中の文法的な要素（文、句、節）に対し重要度を与え、それら要素の重要度の和が最大かつ、文字数（単語数）が要約の大きさとして許容できるある値以下に収まる組合せを選択する組合せ最適化問題として捉えている。 In conventional document summarization techniques, importance is given to grammatical elements (sentences, phrases, and clauses) in the document, the sum of the importance of these elements is the largest, and the number of characters (number of words) is summarized. This is regarded as a combination optimization problem in which a combination that falls within a certain value that is acceptable as a size is selected.

たとえば、非特許文献１では文書中の文法的な要素を「文」とし、文重要度の和が最大かつ、要約文字数がＮ以下の文の組合せをナップサック問題としてとらえ、ナップサックアルゴリズム（動的計画法）を利用して最適解を得ている。 For example, in Non-Patent Document 1, a grammatical element in a document is set to “sentence”, a combination of sentences having a maximum sum of sentence importance and a summary character number of N or less is regarded as a knapsack problem, and a knapsack algorithm (dynamic programming) is used. Method) to obtain the optimal solution.

従来技術の処理の流れを図１４に示す。まず、分割部が文書を入力として受け取り、文法的な単位に分割する。ここではその単位を「文」とする。なお、単位を文とはせずそれよりも小さい単位、あるいは大きい単位としても以下の処理は変更せずに利用できる。文への分割は、日本語の場合、句点を手がかりとして簡単なルールで分割することができる。英語などヨーロッパ言語では、ピリオドを手がかりとすれば良い。 FIG. 14 shows the flow of processing in the prior art. First, the dividing unit receives a document as input and divides it into grammatical units. Here, the unit is “sentence”. Note that the following processing can be used without change even if the unit is not a sentence but a smaller unit or a larger unit. In the case of Japanese, the sentence can be divided according to simple rules using punctuation as a clue. In European languages such as English, the period can be used as a clue.

次に、重要度付与部が文中に含まれる単語重要度に基づき、文の重要度を決定する。単語重要度は情報検索システムなどで一般的に用いられるｔｆ−ｉｄｆ法などを用いて決定しておけば良い。これを用いて文重要度をたとえば、以下の（１）式で文ｓｉの重要度を定義する。なお、ｗ（ｔ）は単語重要度データベースが保持する単語ｔの重要度である。 Next, the importance level assigning unit determines the importance level of the sentence based on the word importance level included in the sentence. The word importance may be determined using a tf-idf method or the like generally used in an information search system or the like. Using this, for example, the importance of the sentence si is defined by the following equation (1). Note that w (t) is the importance of the word t held in the word importance database.

次に、組合せ探索部は要約として許容される長さをパラメータとして受け取り、その長さを超えず、かつ、文重要度の総和が最大となる文の組合せを探索する。 Next, the combination search unit receives a length allowed as a summary as a parameter, and searches for a combination of sentences that does not exceed the length and maximizes the total sentence importance.

つまり、文集合をＳ、文集合の重要度を表す関数をＦとすると、以下の（２）式を最大化し、かつ文集合の長さ（文字数あるいは単語数）がＬ_ｍａｘ以下となる文集合（組合せ）を探索する問題となる。本来であれば、Ｆを最大化する組合せの探索は２^Ｎ通りあり、その探索は現実的ではない。しかし、実際には長さＬ_ｍａｘを超える組合せを探索する必要がないため、ナップサックアルゴリズムを用いて効率的に最適解を求めることができる。 In other words, if the sentence set is S and the function representing the importance of the sentence set is F, the following expression (2) is maximized, and the sentence set length (number of characters or words) is L _max or less. This becomes a problem of searching for (combination). Originally, there are 2 ^N combinations that maximize F, and the search is not realistic. However, since it is not actually necessary to search for a combination exceeding the length L _max , an optimal solution can be efficiently obtained using the knapsack algorithm.

平尾努、鈴木潤、磯崎秀樹、「最適化問題としての文書要約」、人工知能学会論文誌、2009年、Vol.24、No.2、p.223-231Tsutomu Hirao, Jun Suzuki, Hideki Amagasaki, “Summary of Documents as Optimization Problems”, Journal of the Japanese Society for Artificial Intelligence, 2009, Vol.24, No.2, p.223-231

しかし、従来の要約技術では、文を独立した単位として考えており、長さ制約Ｌ_ｍａｘのもと関数Ｆを最大化する文集合が必ずしも入力文書の論理構造を捉えているとは限らない。たとえば、要約として読んだときに入力文書の持つ意味と反転した意味を伝えるようなものになる得る可能性がある。 However, the conventional summarization technique considers sentences as independent units, and the sentence set that maximizes the function F under the length constraint L _max does not always capture the logical structure of the input document. For example, when reading as a summary, there is a possibility that the meaning of the input document and the inverted meaning are conveyed.

本発明は、上記の事情を鑑みてなされたもので、入力文書の論理構造を正しく反映した要約を作成することができる文書要約方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a document summarization method, apparatus, and program capable of creating a summary that correctly reflects the logical structure of an input document.

上記の目的を達成するために本発明に係る文書要約方法は、構造木作成手段及び要約生成手段を含み、入力文書から、予め定められた長さの上限以下となるように、予め定められた文字列単位を少なくとも１つを選択して、前記入力文書に対応する要約を生成する文書要約装置における文書要約方法であって、前記構造木作成手段によって、前記入力文書を前記文字列単位に分割した結果に基づいて、前記入力文書のうちの最も重要な前記文字列単位をルートノードとし、かつ前記入力文書のうちの各文字列単位を各ノードとし、かつ修飾関係を有する前記各文字列単位間に対応するノード間をエッジで結合した、前記入力文書の各文字列単位の依存構造に基づく談話構造木を作成するステップと、前記要約生成手段によって、前記構造木作成手段によって作成された前記依存構造に基づく談話構造木の各ノードに対応する前記文字列単位の前記長さ及び前記文字列単位の重要度に基づいて、前記依存構造に基づく談話構造木のルートノードを含む部分木のうち、前記部分木の各ノードに対応する前記文字列単位の長さの合計が前記長さの上限以下であって、重要度の合計が最大となる部分木を求め、前記入力文書から、前記求めた部分木の各ノードに対応する前記文字列単位を選択して、前記入力文書に対応する要約を生成するステップと、を含み、前記構造木作成手段によって前記依存構造に基づく談話構造木を作成するステップは、修辞構造解析手段によって、ルートノードが前記入力文書の全体を表し、かつ前記入力文書のうちの少なくとも１つの前記文字列単位からなる文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、前記文字列単位の系列間の修飾関係を表した、前記入力文書の文字列単位の系列の各々の修辞構造に基づく談話構造木を作成するステップと、修辞構造木変換手段によって、前記修辞構造解析手段によって作成された前記修辞構造に基づく談話構造木を、前記依存構造に基づく談話構造木に変換するステップと、を含む。 In order to achieve the above object, a document summarization method according to the present invention includes a structural tree creation means and a summary generation means, and is predetermined from an input document so as to be less than or equal to a predetermined upper limit of length. A document summarization method in a document summarization apparatus for selecting at least one character string unit and generating a summary corresponding to the input document, wherein the input document is divided into character string units by the structure tree creating means Based on the result, the most important character string unit of the input document is a root node, each character string unit of the input document is a node, and each character string unit having a modification relationship Creating a discourse structure tree based on the dependency structure of each character string unit of the input document, in which nodes corresponding to each other are connected by an edge; and Based on the importance of the length and the string unit of the string unit for each node of the discourse structure tree based on said dependency structure created by means of discourse structure tree based on the Yi Son構 Concrete Among subtrees including a root node, a subtree having a total sum of lengths equal to or less than the upper limit of the length corresponding to each node of the subtree and a maximum sum of importance is obtained. the from the input document, and select the text unit for each node of the obtained partial tree, and generating the summary corresponding to the input document, by the unrealized, the structure tree creating means The step of creating a discourse structure tree based on the dependency structure includes a rhetorical structure analysis unit, wherein a root node represents the entire input document and includes at least one character string unit of the input document. A discourse based on the rhetorical structure of each string-based series of the input document, which represents a hierarchical structure with each string-based series as a node and represents a modification relationship between the string-based series Creating a structure tree; and converting a discourse structure tree based on the rhetorical structure created by the rhetorical structure analysis means into a discourse structure tree based on the dependency structure by a rhetorical structure tree conversion means. .

本発明に係る文書要約装置は、入力文書から、予め定められた長さの上限以下となるように、予め定められた文字列単位を少なくとも１つを選択して、前記入力文書に対応する要約を生成する文書要約装置であって、前記入力文書を前記文字列単位に分割した結果に基づいて、前記入力文書のうちの最も重要な前記文字列単位をルートノードとし、かつ前記入力文書のうちの各文字列単位を各ノードとし、かつ修飾関係を有する前記各文字列単位間に対応するノード間をエッジで結合した、前記入力文書の各文字列単位の依存構造に基づく談話構造木を作成する構造木作成手段と、前記構造木作成手段によって作成された前記依存構造に基づく談話構造木の各ノードに対応する前記文字列単位の前記長さ及び前記文字列単位の重要度に基づいて、前記依存構造に基づく談話構造木のルートノードを含む部分木のうち、前記部分木の各ノードに対応する前記文字列単位の長さの合計が前記長さの上限以下であって、重要度の合計が最大となる部分木を求め、前記入力文書から、前記求めた部分木の各ノードに対応する前記文字列単位を選択して、前記入力文書に対応する要約を生成する要約生成手段と、を含み、前記構造木作成手段は、ルートノードが前記入力文書の全体を表し、かつ前記入力文書のうちの少なくとも１つの前記文字列単位からなる文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、前記文字列単位の系列間の修飾関係を表した、前記入力文書の文字列単位の系列の各々の修辞構造に基づく談話構造木を作成する修辞構造解析手段と、前記修辞構造解析手段によって作成された前記修辞構造に基づく談話構造木を、前記依存構造に基づく談話構造木に変換する修辞構造木変換手段と、を含んで構成されている。 The document summarization apparatus according to the present invention selects at least one predetermined character string unit from an input document so as to be equal to or less than an upper limit of a predetermined length, and summarizes corresponding to the input document. A document summarizing device that generates the input document based on a result obtained by dividing the input document into the character string units, the most important character string unit of the input documents as a root node, and Create a discourse structure tree based on the dependency structure of each character string unit of the input document, in which each character string unit is a node and nodes corresponding to each character string unit having a modification relationship are connected by an edge. Based on the length of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure created by the structural tree creation unit and the importance of the character string unit Of the subtree including the root node of the discourse structure tree based on the Yi Son構 granulation, the total length of the string unit is equal to or less than the upper limit of the length corresponding to each node in the subtree, Summarization generation for obtaining a subtree having the maximum importance, selecting the character string unit corresponding to each node of the obtained subtree from the input document, and generating a summary corresponding to the input document see containing means, wherein the structure tree creating device represents the entire root node the input document, and each of the at least one of said character consists of a sequence unit string unit sequence of said input document each Rhetorical structure analysis means for creating a discourse structure tree based on the rhetorical structure of each of the character string unit series of the input document, which represents a hierarchical structure as a node and represents a modification relation between the character string unit series And the rhetorical structure solution And is configured to discourse structure tree based on said rhetorical structure created, a rhetorical structure tree converting means for converting the discourse structure tree based on the dependency structure, include the means.

本発明に係る文書要約方法及び文書要約装置によれば、構造木作成手段によって、入力文書を文字列単位に分割した結果に基づいて、入力文書のうちの最も重要な文字列単位をルートノードとし、かつ入力文書のうちの各文字列単位を各ノードとし、かつ修飾関係を有する各文字列単位間に対応するノード間をエッジで結合した、入力文書の各文字列単位の依存構造に基づく談話構造木を作成する。 According to the document summarizing method and the document summarizing device of the present invention, the most important character string unit of the input documents is set as the root node based on the result of dividing the input document into character string units by the structure tree creating means. , And a discourse based on the dependency structure of each character string unit of the input document in which each character string unit of the input document is each node, and nodes corresponding to each character string unit having a qualifying relationship are connected by an edge Create a structural tree.

そして、要約生成手段によって、構造木作成手段によって作成された依存構造に基づく談話構造木の各ノードに対応する文字列単位の長さ及び文字列単位の重要度に基づいて、依存構造に基づく談話構造木のルートノードを含む部分木のうち、部分木の各ノードに対応する文字列単位の長さの合計が長さの上限以下であって、重要度の合計が最大となる部分木を求め、入力文書から、求めた部分木の各ノードに対応する文字列単位を選択して、入力文書に対応する要約を生成する。
また、構造木作成手段は、修辞構造解析手段によって、ルートノードが入力文書の全体を表し、かつ入力文書のうちの少なくとも１つの文字列単位からなる文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、文字列単位の系列間の修飾関係を表した、入力文書の文字列単位の系列の各々の修辞構造に基づく談話構造木を作成し、修辞構造木変換手段によって、修辞構造解析手段によって作成された修辞構造に基づく談話構造木を、依存構造に基づく談話構造木に変換する。 Then, the summary generation means, based on the importance of the length and the string unit string unit for each node of the discourse structure tree based on the dependency structure created by the tree structure creation means, the Yi Son構 Concrete Among the subtrees that include the root node of the discourse structure tree based on , the subtree whose sum of the lengths of the character string units corresponding to each node of the subtree is less than or equal to the upper limit of the length and whose sum of importance is the maximum From the input document, a character string unit corresponding to each node of the obtained subtree is selected, and a summary corresponding to the input document is generated.
Also, the structure tree creating means uses the rhetorical structure analyzing means to represent each of the character string unit series consisting of at least one character string unit of the input document whose root node represents the entire input document. Create a discourse structure tree based on the rhetorical structure of each string unit in the input document that represents the hierarchical structure and the modification relationship between the strings in the string unit. The discourse structure tree based on the rhetorical structure created by the structure analysis means is converted into a discourse structure tree based on the dependency structure.

このように、入力文書の各文字列単位の依存構造に基づく談話構造木を作成し、依存構造に基づく談話構造木の各ノードに対応する文字列単位の長さ及び文字列単位の重要度に基づいて、依存構造に基づく談話構造木のルートノードを含む部分木のうち、部分木の各ノードに対応する文字列単位の長さの合計が長さの上限以下であって、重要度の合計が最大となる部分木を求め、部分木の各ノードに対応する文字列単位を選択して、入力文書に対応する要約を生成することにより、入力文書の論理構造を正しく反映した要約を作成することができる。 In this way, a discourse structure tree based on the dependency structure of each character string unit of the input document is created, and the length of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure and the importance of the character string unit are set. based on, Yi of the subtree including the root node of the discourse structure tree based on Son構 granulation, the total length of the string unit for each node in the subtree is equal to or less than the maximum length, the importance A subtree that maximizes the sum of, and selects a string unit corresponding to each node of the subtree to generate a summary corresponding to the input document. Can be created.

また、前記要約生成手段は、前記構造木作成手段によって作成された前記依存構造に基づく談話構造木の各ノードに対応する前記文字列単位の前記長さ及び前記文字列単位の重要度に基づいて、前記依存構造に基づく談話構造木の各ノードについて、葉ノードからボトムアップの順で、前記長さの上限以下の各長さに対して、前記ノードをルートノードとして形成される部分木のうち、前記部分木の各ノードに対応する前記文字列単位の長さの合計が前記長さ以下であって、かつ、重要度の合計が最大となる部分木を、ナップサック問題を解くことにより求め、前記求めた前記依存構造に基づく談話構造木のルートノードを含む部分木のうち、前記重要度の合計が最大となる部分木について、前記部分木の各ノードに対応する前記文字列単位を前記入力文書から選択して、前記入力文書に対応する要約を生成するようにすることができる。 Further, the summary generation means is based on the length of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure created by the structure tree creation means and the importance of the character string unit. For each node of the discourse structure tree based on the dependency structure, among the subtrees formed using the node as a root node for each length below the upper limit of the length in order from the leaf node to the bottom-up A subtree having a maximum sum of the lengths of the character string units corresponding to the nodes of the subtree and having the maximum importance is obtained by solving a knapsack problem, wherein among the subtree including the root node of the discourse structure tree based on the Yi Son構 granulation obtained, for subtree as the sum maximum of the importance, the string unit for each node of the subtree Select from the input document, it is possible to generate a summary corresponding to the input document.

本発明に係るプログラムは、コンピュータを、本発明に係る文書要約方法を構成する各ステップをコンピュータに実行させるためのプログラムである。 The program according to the present invention is a program for causing a computer to execute each step constituting the document summarizing method according to the present invention.

以上説明したように、文書要約方法、装置、及びプログラムによれば、入力文書の各文字列単位の依存構造に基づく談話構造木を作成し、依存構造に基づく談話構造木の各ノードに対応する文字列単位の長さ及び文字列単位の重要度に基づいて、依存構造に基づく談話構造木のルートノードを含む部分木のうち、部分木の各ノードに対応する文字列単位の長さの合計が長さの上限以下であって、重要度の合計が最大となる部分木を求め、部分木の各ノードに対応する文字列単位を選択して、入力文書に対応する要約を生成することにより、入力文書の論理構造を正しく反映した要約を作成することができる、という効果が得られる。 As described above, according to the document summarization method, apparatus, and program, a discourse structure tree based on the dependency structure of each character string unit of the input document is created, and each discourse structure tree based on the dependency structure is handled. based on the importance of the length and the string unit string unit, Yi of the subtree including the root node of the discourse structure tree based on Son構 concrete, the length of the string unit for each node in the subtree Finds a subtree with a total of less than or equal to the upper limit of length and the maximum importance, and selects a string unit corresponding to each node of the subtree to generate a summary corresponding to the input document As a result, it is possible to create a summary that correctly reflects the logical structure of the input document.

本発明の実施の形態に係る文書要約装置の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the document summarization apparatus which concerns on embodiment of this invention. 文書の修辞構造に基づく談話構造木（ＲＳＴ−ＤＴ）の一例を示す図である。It is a figure which shows an example of the discourse structure tree (RST-DT) based on the rhetorical structure of a document. 図２に示したＲＳＴ−ＤＴから得られた文書の依存構造に基づく談話構造木（ＤＥＰ−ＤＴ）の一例を示す図である。It is a figure which shows an example of the discourse structure tree (DEP-DT) based on the dependence structure of the document obtained from RST-DT shown in FIG. 図２に示したＲＳＴ−ＤＴの各非終端記号にｈｅａｄを定義した図である。FIG. 3 is a diagram in which a head is defined for each non-terminal symbol of the RST-DT illustrated in FIG. 2. 本発明の実施の形態に係る文書要約装置における構造木変換処理ルーチンの前半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the first half part of the structural tree conversion process routine in the document summarizing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文書要約装置における構造木変換処理ルーチンの後半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the second half part of the structural tree conversion process routine in the document summarizing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文書要約装置における構造木刈り込み処理ルーチンの前半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the first half part of the structural tree pruning process routine in the document summarizing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文書要約装置における構造木刈り込み処理ルーチンの後半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the second half part of the structural tree pruning process routine in the document summarizing apparatus which concerns on embodiment of this invention. ＤＥＰ−ＤＴの一例と、当該依存構造に基づく談話構造木の各ノードに付与された文の重要度と文の長さとの一例を示す図である。It is a figure which shows an example of DEP-DT, an example of the importance of the sentence provided to each node of the discourse structure tree based on the said dependence structure, and a sentence length. 図９に示したＤＥＰ−ＤＴのノード８〜４（ＩＤ＝８〜４）に対応する配列内の重要度スコアの変化を説明するための図である。It is a figure for demonstrating the change of the importance score in the arrangement | sequence corresponding to the nodes 8-4 (ID = 8-4) of DEP-DT shown in FIG. 図９に示したＤＥＰ−ＤＴのノード３〜２（ＩＤ＝３〜２）に対応する配列内の重要度スコアの変化を説明するための図である。It is a figure for demonstrating the change of the importance score in the arrangement | sequence corresponding to the nodes 3-2 (ID = 3-2) of DEP-DT shown in FIG. 図９に示したＤＥＰ−ＤＴのノード１（ＩＤ＝１）に対応する配列内の重要度スコアの変化を説明するための図である。It is a figure for demonstrating the change of the importance score in the arrangement | sequence corresponding to the node 1 (ID = 1) of DEP-DT shown in FIG. 図９に示したＤＥＰ−ＤＴのノード０（ＩＤ＝０）に対応する配列内の重要度スコアの変化を説明するための図である。It is a figure for demonstrating the change of the importance score in the arrangement | sequence corresponding to the node 0 (ID = 0) of DEP-DT shown in FIG. 従来技術を説明するための図である。It is a figure for demonstrating a prior art.

＜概要＞
まず、本発明の実施の形態の概要について説明する。 <Overview>
First, an outline of an embodiment of the present invention will be described.

本発明の実施の形態は、与えられた文書を要約する技術に関する。この技術は文書要約時において、与えられた文書を文書中の文法的な要素（文、句、節）をノードとした木として表現し、その木を刈り込むことで文書の要約を生成する技術である。本実施の形態では、「文」をノードとした木として表現し、その木を刈り込むことで文書の要約を生成する場合を例に挙げて説明する。 Embodiments of the present invention relate to a technique for summarizing a given document. This technology is used to generate a document summary by expressing a given document as a tree with grammatical elements (sentences, phrases, and clauses) in the document as nodes when pruning the document. is there. In the present embodiment, a case will be described as an example in which a document summary is generated by expressing a “sentence” as a node and pruning the tree.

本実施の形態では、要約元の文書の論理構造(修辞構造) を正しく反映するため、入力文書を修辞構造に基づく談話構造木（Rhetorical Structure Theory based Discourse Tree：ＲＳＴ−ＤＴ）（以下、ＲＳＴ−ＤＴと称する。）として捉え、木構造を壊すことなく刈り込むことで要約を生成する。ただし、ＲＳＴ−ＤＴのそのままの構造では刈り込みが難しいため、ＲＳＴ−ＤＴを一旦、依存構造に基づく談話構造木（Dependency based Discourse Tree:ＤＥＰ−ＤＴ）（以下、ＤＥＰ−ＤＴと称する。）へと変換し、ＤＥＰ−ＤＴを刈り込むことで要約を生成する。 In the present embodiment, in order to correctly reflect the logical structure (rhetorical structure) of the summarization source document, the input document is converted into a discourse structure tree (RST-DT) (hereinafter referred to as RST-DT). This is referred to as DT.) And a summary is generated by pruning without destroying the tree structure. However, since pruning is difficult with the structure of RST-DT as it is, RST-DT is once changed to a Dependency based Discourse Tree (DEP-DT) (hereinafter referred to as DEP-DT). A summary is generated by converting and pruning the DEP-DT.

＜システム構成＞
以下、図面を参照して本発明の実施の形態を詳細に説明する。図１は、本発明の実施の形態に係る文書要約装置１００を示すブロック図である。文書要約装置１００は、ＣＰＵと、ＲＡＭと、後述する構造木変換処理ルーチン及び構造木刈り込み処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 <System configuration>
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a document summarizing apparatus 100 according to an embodiment of the present invention. The document summarization apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM that stores a program for executing a structural tree conversion processing routine and a structural tree pruning processing routine, which will be described later. It is configured as shown.

本実施の形態に係る文書要約装置１００は、図１に示すように、入力部１と、演算部２と、出力部３とを備えている。 As shown in FIG. 1, the document summarization apparatus 100 according to the present embodiment includes an input unit 1, a calculation unit 2, and an output unit 3.

入力部１は、要約対象となる文書（テキスト）と、要約文書の長さ（文字数又は単語数）の上限を表す指標Ｌ_ｍａｘ（以下、長さの上限Ｌ_ｍａｘと称する。）とを受け付ける。文書要約装置１００は、入力された長さの上限Ｌ_ｍａｘ以下となるように、入力部１により入力された文書から、文の少なくとも１つが選択され、入力された文書に対応する要約を生成する。 The input unit 1 receives a document (text) to be summarized and an index L _max (hereinafter referred to as a length upper limit L _max ) indicating the upper limit of the length (number of characters or words) of the summary document. The document summarization apparatus 100 selects at least one sentence from the document input by the input unit 1 so as to be equal to or less than the upper limit L _max of the input length, and generates a summary corresponding to the input document. .

演算部２は、分割部２０と、単語重要度データベース２１と、重要度付与部２２と、修辞構造解析部２３と、修辞構造木変換部２４と、依存構造木刈り込み部２５とを備えている。なお、修辞構造解析部２３及び修辞構造木変換部２４は、構造木作成手段の一例である。また、依存構造木刈り込み部２５は、要約生成手段の一例である。 The computing unit 2 includes a dividing unit 20, a word importance database 21, an importance assigning unit 22, a rhetorical structure analyzing unit 23, a rhetorical structure tree converting unit 24, and a dependency structure tree pruning unit 25. . The rhetorical structure analyzing unit 23 and the rhetorical structure tree converting unit 24 are examples of a structure tree creating unit. The dependency structure tree pruning unit 25 is an example of a summary generation unit.

分割部２０は、入力部１により入力された文書を文に分割する。なお、文に分割する技術としては、従来既知の手法を用いればよいため、説明を省略する。 The dividing unit 20 divides the document input by the input unit 1 into sentences. In addition, as a technique for dividing into sentences, a conventionally known method may be used, and thus description thereof is omitted.

単語重要度データベース２１には、複数の単語の各々について、当該単語ｔの重要度を示す単語重要度ｗ（ｔ）が予め記憶されている。単語重要度ｗ（ｔ）については情報検索システムなどで一般的に用いられるｔｆ‐ｉｄｆ法などを用いて決定しておけば良い。 In the word importance database 21, a word importance w (t) indicating the importance of the word t is stored in advance for each of a plurality of words. The word importance w (t) may be determined using a tf-idf method or the like generally used in an information search system or the like.

重要度付与部２２は、分割部２０によって分割された入力文書の各文について、当該文に含まれる各単語の単語重要度に基づいて、当該文の重要度を決定する。具体的には、入力文書の各文ｓｉについて、単語重要度データベース２１に記憶されている、当該文に含まれる各単語ｔの単語重要度ｗ（ｔ）に基づいて、上記（１）式に従って、当該文ｓｉの重要度を算出し、当該文に対して重要度を付与する。 The importance level assigning unit 22 determines the importance level of each sentence of the input document divided by the dividing unit 20 based on the word importance level of each word included in the sentence. Specifically, for each sentence si of the input document, based on the word importance w (t) of each word t included in the sentence stored in the word importance database 21, according to the above equation (1). The importance of the sentence si is calculated, and the importance is given to the sentence.

修辞構造解析部２３は、分割部２０によって分割された各文を解析し、図２に示すような、Rhetorical Structure Theory（参考文献１（Mann, WC. and Thomson, SA.、「Rhetorical Structure Theory：Toward a functional theory of text organization」、Text & Talk、1988、Vol.8、No.3、p.243-281（http://www.sfu.ca/rst/）を参照。）に基づく談話構造木（ＲＳＴ−ＤＴ）を作成する。ここで、ＲＳＴ−ＤＴとは、ルートノードが入力文書の全体を表し、かつ入力文書のうちの少なくとも１つの文からなる文系列の各々を各ノードとした階層構造を表し、かつ、文系列間の修飾関係を表した、入力文書の文系列の各々の修辞構造に基づく談話構造木である。図２に示すｅは文書中の文法的要素（例えば、文）を表す終端記号であり、ｒｏｏｔは文書全体を表す仮想的なノードである。Ｓは衛星、Ｎは核という文法的要素そのものあるいは文法的要素の系列が文書中で担う役割を示す非終端記号である。さらに、Ｓは必ずＮを修飾するというルールがある。また、ＳとＮ、ＳとＳ、ＮとＮの間には修飾関係を表すラベルが定義される。たとえば、ｅ６はｅ５に対する「Ｅｖｉｄｅｎｃｅ」を表す。 The rhetorical structure analyzing unit 23 analyzes each sentence divided by the dividing unit 20 and, as shown in FIG. 2, Rhetorical Structure Theory (reference document 1 (Mann, WC. And Thomson, SA., “Rhetorical Structure Theory: Discourse structure based on "Toward a functional theory of text organization", Text & Talk, 1988, Vol.8, No.3, p.243-281 (http://www.sfu.ca/rst/). A tree (RST-DT) is created, where RST-DT is a root node representing the entire input document, and each sentence sequence comprising at least one sentence of the input document is defined as each node. 2 is a discourse structure tree based on the rhetorical structure of each sentence sequence of the input document, which represents a hierarchical structure and a modification relationship between sentence series. Terminal is a virtual node representing the entire document, S is a satellite, N It is a non-terminal symbol that indicates the role of a grammatical element itself or a sequence of grammatical elements in a document, and S has a rule that always modifies N. Also, S and N, S and S, N A label representing a modification relationship is defined between N and N. For example, e6 represents “Evidence” for e5.

修辞構造解析部２３は、例えば、入力された文書の各文に基づいて、参考文献２（duVerle, D. and Prendinger, H.、「A Novel Discourse Parser Based on Support Vector Machine Classification」、Proc of the 47thACL、2009、p665-675）を参照。）に記載の解析技術を用いて、ＲＳＴ−ＤＴを作成する。または、ＲＳＴ−ＤＴのアノテーション済みコーパスを用いて、上記参考文献２と同様に解析器を構築しておき、構築した解析器を用いて、ＲＳＴ−ＤＴを作成することも可能である。なお、Rhetorical Structure Theoryでは、文書中の文法的要素は節として定義されているが、本実施の形態ではこれが節であっても、文であっても問題はないので先にも述べた通り、以降の説明では、文が文法的要素であると仮定する。 The rhetorical structure analysis unit 23, for example, based on each sentence of the input document, Reference Document 2 (duVerle, D. and Prendinger, H., “A Novel Discourse Parser Based on Support Vector Machine Classification”, Proc of the 47thACL, 2009, p665-675). RST-DT is created using the analysis technique described in (1). Alternatively, it is possible to construct an analyzer using an annotated corpus of RST-DT in the same manner as in Reference Document 2 above, and create an RST-DT using the constructed analyzer. In the Rhetorical Structure Theory, the grammatical element in the document is defined as a section, but in the present embodiment, there is no problem even if it is a section or a sentence. In the following description, it is assumed that the sentence is a grammatical element.

ＲＳＴ−ＤＴを要約文書生成に利用しようと考えた場合、特に文と文との間の修飾関係を把握し難いという問題がある。そこで本実施の形態では、修辞構造木変換部２４によって、修辞構造解析部２３によって作成されたＲＳＴ−ＤＴを、文と文との間の修飾関係が明らかになるような形の木、すなわち、依存構造に基づく談話構造木（Dependency based Discourse Tree：ＤＥＰ−ＤＴ）へ変換する。ＤＥＰ−ＤＴとは、入力文書のうちの最も重要な文をルートノードとし、かつ入力文書のうちの各文を各ノードとし、かつ修飾関係を有する各文間に対応するノード間をエッジで結合した、入力文書の各文の依存構造に基づく談話構造木である。図３に上記図２のＲＳＴ−ＤＴをＤＥＰ−ＤＴに変換した結果を示す。ＤＥＰ−ＤＴでは、ＲＳＴ−ＤＴで定義されていた非終端記号間の関係ラベルが失われるが、文と文との間の修飾関係が明らかとなる。ＲＳＴ−ＤＴからＤＥＰ−ＤＴへの変換は、以下のステップ（０）〜（２−４）により行われる。 When the RST-DT is used for generating a summary document, there is a problem that it is difficult to grasp the modification relationship between sentences. Therefore, in the present embodiment, the rhetorical structure tree converting unit 24 converts the RST-DT created by the rhetorical structure analyzing unit 23 into a tree whose modification relationship between sentences becomes clear, It converts into a discourse structure tree (Dependency based Discourse Tree: DEP-DT) based on a dependency structure. With DEP-DT, the most important sentence in the input document is used as a root node, each sentence in the input document is used as each node, and nodes corresponding to each sentence having a qualifying relationship are connected by an edge. This is a discourse structure tree based on the dependency structure of each sentence of the input document. FIG. 3 shows the result of converting the RST-DT of FIG. 2 to DEP-DT. In DEP-DT, the relationship label between non-terminal symbols defined in RST-DT is lost, but the modification relationship between sentences becomes clear. Conversion from RST-DT to DEP-DT is performed by the following steps (0) to (2-4).

ステップ（０）
全ての非終端記号（Ｓ又はＮ）に対し、ｈｅａｄを定義する。ｈｅａｄとは、その記号の子孫の文の中で一番左のＮに対応する文（ｅ）を指す。子孫に、Ｎに対応する文が存在しない場合、ｈｅａｄは未定義とする。上記図２のＲＳＴ−ＤＴの各非終端記号に、ｈｅａｄを定義した場合を、図４に示す。 Step (0)
Define head for all non-terminal symbols (S or N). “head” indicates a sentence (e) corresponding to the leftmost N among the descendant sentences of the symbol. If there is no sentence corresponding to N in the descendants, head is undefined. FIG. 4 shows a case where head is defined for each non-terminal symbol of RST-DT in FIG.

ステップ（１−１）
文（ｅ）の親がＳの場合、直近の先祖にｈｅａｄが定義されているか否かをチェックし、ｈｅａｄが定義されている場合、その文を修飾する。 Step (1-1)
When the parent of the sentence (e) is S, it is checked whether or not head is defined in the nearest ancestor. If head is defined, the sentence is qualified.

ステップ（１−２）
ｈｅａｄが定義されていない場合、さらに先祖を辿り、（１−１）が当てはまれば、その文を修飾し、ｒｏｏｔまでたどりついた場合には、ｒｏｏｔのｈｅａｄとして定義されている文を修飾する。 Step (1-2)
If the head is not defined, the ancestor is further traced. If (1-1) is true, the sentence is modified. If the head is reached, the sentence defined as the root head is modified.

ステップ（２−１）
文（ｅ）の親がＮの場合、直近の先祖のＳの兄弟にＮがいる場合、Ｎのｈｅａｄが定義されているか否かをチェックする。 Step (2-1)
If the parent of the sentence (e) is N, and N is the closest ancestor's S sibling, it is checked whether or not N head is defined.

ステップ（２−２）
ｈｅａｄが定義されている場合、その文を修飾する。 Step (2-2)
If head is defined, qualify the statement.

ステップ（２−３）
ｈｅａｄが定義されていない場合、さらに先祖を辿り、Ｓを探し、（２−１）、（２−２）を適用する。 Step (2-3)
When the head is not defined, the ancestor is further traced, S is searched, and (2-1) and (2-2) are applied.

ステップ（２−４）
ｒｏｏｔまでたどりついた場合には、ｒｏｏｔのｈｅａｄとして定義されている文を修飾する。 Step (2-4)
When the root is reached, the sentence defined as the root head is modified.

依存構造木刈り込み部２５は、修辞構造木変換部２４によって得られたＤＥＰ−ＤＴを刈り込む。具体的には、依存構造木刈り込み部２５は、修辞構造木変換部２４によって変換されたＤＥＰ−ＤＴの各ノードに対応する文の長さｌ及び文の重要度に基づいて、ＤＥＰ−ＤＴの最も重要な文に対応するルートノードを含む部分木のうち、部分木の各ノードに対応する文の長さｌの合計が長さの上限Ｌ_ｍａｘ以下であって、重要度の合計（重要度スコア）が最大となる部分木が得られるように、修辞構造木変換部２４によって変換されたＤＥＰ−ＤＴを刈り込み、ＤＥＰ−ＤＴを刈り込んだ部分木の各ノードに対応する文を選択して、入力文書に対応する要約を生成する。 The dependency structure tree pruning unit 25 prunes the DEP-DT obtained by the rhetorical structure tree conversion unit 24. Specifically, the dependency structure tree pruning unit 25 determines the DEP-DT based on the sentence length l and the sentence importance corresponding to each node of the DEP-DT converted by the rhetorical structure tree conversion unit 24. Among the subtrees including the root node corresponding to the most important sentence, the sum of the lengths l of sentences corresponding to the nodes of the subtree is equal to or less than the upper limit L _{max of the} length, The DEP-DT converted by the rhetorical structure tree conversion unit 24 is trimmed so that the subtree having the maximum score) is obtained, and the sentence corresponding to each node of the subtree trimmed from the DEP-DT is selected. Generate a summary corresponding to the input document.

より詳細には、依存構造木刈り込み部２５は、修辞構造木変換部２４によって変換されたＤＥＰ−ＤＴの各ノードに対応する文の長さｌ及び文の重要度に基づいて、ＤＥＰ−ＤＴの各ノードについて、葉ノードからボトムアップの順で、長さの上限Ｌ_ｍａｘ以下の各長さＬに対して、ノードをルートノードとして形成される部分木のうち、部分木の各ノードに対応する文の長さｌの合計が当該長さＬ以下であって、かつ、重要度の合計（重要度スコア）が最大となる部分木を、ナップサック問題を解くことにより求める。そして、求めた依存構造に基づく談話構造木のルートノードを含む部分木のうち、重要度スコアが最大となる部分木について、部分木の各ノードに対応する文を選択して、入力文書に対応する要約を生成する。
例えば、ＤＥＰ−ＤＴの刈り込みアルゴリズムは以下のステップ[０]〜[２−３]を備えている。 More specifically, the dependency structure tree pruning unit 25 determines the DEP-DT based on the sentence length l and the sentence importance corresponding to each node of the DEP-DT converted by the rhetorical structure tree conversion unit 24. For each node, corresponding to each node of the subtree among the subtrees formed with the node as the root node for each length L that is less than or equal to the upper limit L _max of the length from the leaf node to the bottom-up order. A subtree having the total sentence length l equal to or less than the length L and the maximum importance (importance score) is obtained by solving the knapsack problem. Of the subtree including the root node of the discourse structure tree based on the obtained Yi Son構 granulation, the subtree importance score is maximum, select the sentence corresponding to each node in the subtree, the input document Generate a summary corresponding to.
For example, the DEP-DT pruning algorithm includes the following steps [0] to [2-3].

ステップ［０］
修辞構造木変換部２４によって変換されたＤＥＰ−ＤＴの全てのノードに対し、長さＬ_ｍａｘ＋１の配列を用意し、配列の全ての要素の重要度スコアをゼロで初期化する。各ノードの配列は、当該ノードを含む部分木のうち、長さの合計が長さｉ（０≦ｉ≦Ｌ_ｍａｘ)以下の部分木に対応する要約の要約スコアの最大値を格納する。 Step [0]
An array of length L _max +1 is prepared for all nodes of DEP-DT converted by the rhetorical structure tree conversion unit 24, and importance scores of all elements of the array are initialized to zero. The array of each node stores the maximum summarization summary score corresponding to a subtree whose total length is equal to or less than the length i (0 ≦ i ≦ L _max ) among subtrees including the node.

ステップ［１］
修辞構造木変換部２４によって変換されたＤＥＰ−ＤＴをＳ式で表現し、Ｓ式の右側のノードから順に、当該ノードを対象ノードとし、対象ノードの配列の個々の要素の重要度スコアを以下のステップ[２−１]〜[２−３]で決定する。 Step [1]
The DEP-DT converted by the rhetorical structure tree conversion unit 24 is expressed by an S expression, and in order from the node on the right side of the S expression, the node is the target node, and the importance score of each element of the array of the target node is Steps [2-1] to [2-3].

ステップ［２−１］
対象ノードが子ノードを持たない、かつ、対象ノードに対応する文の長さｌについてｌ≦Ｌ_ｍａｘならば、対象ノードの配列に対し、添字がｌからＬ_ｍａｘまでの要素の値を対象ノードに対応する文の重要度ｖとする。 Step [2-1]
If the target node has no child nodes and if l ≦ L _{max with} respect to the length l of the sentence corresponding to the target node, the value of the element whose subscript is 1 to L _max is set to the target node array. The importance level v of the sentence corresponding to.

ステップ［２−２］
対象ノードが子ノードを持つ場合、任意の子ノードを選択し、選択した子ノードの配列から、添字がゼロからＬ_ｍａｘ−ｌまでの要素を取り出す。これをベース配列と呼ぶ。 Step [2-2]
If the target node has child nodes, an arbitrary child node is selected, and elements with subscripts from zero to L _max -l are extracted from the array of the selected child nodes. This is called a base sequence.

ステップ［２−２−１］
他の子ノードに対し、以下のステップ[２−２−２]〜[２−２−４]の処理を行う。 Step [2-2-1]
The following steps [2-2-2] to [2-2-4] are performed on other child nodes.

ステップ［２−２−２］
当該子ノードの配列から、添字がゼロからＬ_ｍａｘ−ｌまでの要素を取り出す。取り出した要素からなる配列に格納された値からアイテムを抽出する。取り出した要素からなる配列に格納された値の異なり数だけアイテムは存在する。たとえば、取り出した要素からなる配列が［０、１、１、２、３］であれば、この配列には長さ１、重要度スコア１のアイテム、長さ３、重要度スコア２のアイテム、長さ４、重要度スコア３のアイテムが存在する。 Step [2-2-2]
Elements with subscripts from zero to L _max -l are extracted from the array of child nodes. An item is extracted from a value stored in an array of extracted elements. There are as many items as there are different numbers of values stored in the array of extracted elements. For example, if the array of extracted elements is [0, 1, 1, 2, 3], the array has an item of length 1, an importance score of 1, an item of length 3, an importance score of 2, There is an item of length 4 and importance score 3.

ステップ［２−２−３］
抽出した各アイテムについて、ベース配列と当該アイテムとでナップサック問題を解き、抽出したアイテムの数だけ長さＬ_ｍａｘ−ｌの配列を得る。 Step [2-2-3]
For each extracted item, the knapsack problem is solved with the base array and the item, and an array having a length L _max -l is obtained by the number of extracted items.

ステップ［２−２−４］
各添字について、得られた配列群の当該添字の要素から、最大値を取得し、各添字について取得した最大値を記憶した配列を生成し、生成した配列でベース配列を上書きする。 Step [2-2-4]
For each subscript, the maximum value is acquired from the element of the subscript in the obtained array group, an array storing the maximum value acquired for each subscript is generated, and the base array is overwritten with the generated array.

ステップ［２−３］
ベース配列に対し、対象ノードに対応する文の長さと重要度スコアを加算する。 Step [2-3]
The length of the sentence corresponding to the target node and the importance score are added to the base sequence.

上記のアルゴリズムにより、ＤＥＰ−ＤＴの全てのノードに対して、長さの合計が長さｉ（０≦ｉ≦Ｌ_ｍａｘ）以下の部分木に対応する要約の要約スコアの最大値を格納した配列が生成される。ここで、ｒｏｏｔノードの配列のＬ_ｍａｘ番目の要素に最大値が格納されるので、当該要素に格納されている最大値を計算した履歴をたどることにより、長さ制約がＬ_ｍａｘのもと、文重要度の和が最大となる部分木を得ることができ、当該部分木から要約を得ることができる。 An array storing the maximum summarization summary score corresponding to a subtree whose total length is less than or equal to length i (0 ≦ i ≦ L _max ) for all nodes of DEP-DT by the above algorithm Is generated. Here, since the maximum value is stored in the L _max th element of the array of the root node, by tracing the history of calculating the maximum value stored in the element, the length constraint is based on L _max . A subtree having the maximum sentence importance can be obtained, and a summary can be obtained from the subtree.

＜文書要約装置の作用＞
次に、本実施の形態に係る文書要約装置１００の作用について説明する。要約対象の文書と、要約の長さの上限Ｌ_ｍａｘとが文書要約装置１００に入力されると、文書要約装置１００によって、図５及び図６に示す構造木変換処理ルーチンが実行される。 <Operation of document summarization device>
Next, the operation of the document summarizing apparatus 100 according to this embodiment will be described. When the document to be summarized and the upper limit L _max of the summary length are input to the document summarization apparatus 100, the document summarization apparatus 100 executes a structural tree conversion processing routine shown in FIGS.

まず、ステップＳ１００において、入力部１により入力された文書を受け付ける。ステップＳ１０２において、上記ステップＳ１００で入力された文書について、分割部２０によって、文に分割する。 First, in step S100, a document input by the input unit 1 is received. In step S102, the document input in step S100 is divided into sentences by the dividing unit 20.

次に、ステップＳ１０４において、上記ステップＳ１０２で分割された各文について、重要度付与部２２によって、単語重要度データベース２１に記憶されている各単語の単語重要度に基づいて、上記（１）式に従って、当該文に対して重要度を付与する。 Next, in step S104, for each sentence divided in step S102, the importance level assigning unit 22 uses the above formula (1) based on the word importance level of each word stored in the word importance level database 21. According to, assign importance to the sentence.

そして、ステップＳ１０６において、修辞構造解析部２３によって、上記ステップＳ１０２で分割された各文に基づいて、文系列の各々の修辞構造を解析し、ＲＳＴ−ＤＴを作成する。 In step S106, the rhetorical structure analysis unit 23 analyzes each rhetorical structure of the sentence series based on the sentences divided in step S102, and creates an RST-DT.

ステップＳ１０８において、修辞構造木変換部２４によって、上記ステップＳ１０６で作成されたＲＳＴ−ＤＴにおける非終端記号（上記図２に示すＳ又はＮに相当）のノードを一つ処理対象ノードとして設定する。 In step S108, the rhetorical structure tree conversion unit 24 sets one non-terminal symbol (corresponding to S or N shown in FIG. 2) in the RST-DT created in step S106 as one processing target node.

次に、ステップＳ１１０において、上記ステップＳ１０８で設定された処理対象のノードについて、子孫のうちの一番左の非終端記号Ｎに対応する文（上記図２に示す終端記号ｅ１〜ｅ１０に相当）をｈｅａｄと定義する。 Next, in step S110, for the processing target node set in step S108, a sentence corresponding to the leftmost non-terminal symbol N among the descendants (corresponding to the terminal symbols e1 to e10 shown in FIG. 2). It is defined as head.

そして、ステップＳ１１２において、ＲＳＴ−ＤＴの全ての非終端記号のノードについて、上記ステップＳ１０８〜Ｓ１１０の処理を実行したか否かを判定する。そして、上記ステップＳ１０８〜Ｓ１１０の処理を実行していない、非終端記号のノードが存在する場合には、上記ステップＳ１０８へ戻り、当該ノードを処理対象として設定する。一方、全ての非終端記号のノードについて、上記ステップＳ１０８〜Ｓ１１０の処理を実行した場合には、ステップＳ１１４へ進む。 In step S112, it is determined whether or not the processing in steps S108 to S110 has been executed for all non-terminal symbol nodes of RST-DT. If there is a non-terminal symbol node that has not executed the processes of steps S108 to S110, the process returns to step S108 to set the node as a processing target. On the other hand, when the processes of steps S108 to S110 are executed for all non-terminal symbol nodes, the process proceeds to step S114.

次に、ステップＳ１１４において、ＲＳＴ−ＤＴの１つの文に対応するノード（上記図２に示す終端記号ｅ１〜ｅ１０に相当）を処理対象として設定する。 Next, in step S114, nodes (corresponding to the terminal symbols e1 to e10 shown in FIG. 2) corresponding to one sentence of RST-DT are set as processing targets.

次に、ステップＳ１１６において、上記ステップＳ１１４で設定された処理対象の文ノードの親ノードをチェックし、当該親ノードが非終端記号Ｓのノードであるか否かを判定する。そして、当該親ノードが非終端記号Ｓのノードである場合には、ステップＳ１１８へ移行する。当該親ノードが非終端記号Ｓのノードでない場合（非終端記号Ｎのノードである場合）には、ステップＳ１２２へ移行する。 Next, in step S116, the parent node of the sentence node to be processed set in step S114 is checked to determine whether or not the parent node is a node of the non-terminal symbol S. If the parent node is a node of the non-terminal symbol S, the process proceeds to step S118. When the parent node is not a node of the non-terminal symbol S (when it is a node of the non-terminal symbol N), the process proceeds to step S122.

次にステップＳ１１８において、処理対象の文ノードの親ノードの先祖であって、ｈｅａｄが定義されている直近の先祖を探索する。 In step S118, the ancestor of the parent node of the sentence node to be processed and the nearest ancestor in which head is defined is searched.

次に、ステップＳ１２０において、上記ステップＳ１１４で設定された処理対象の文ノードの文の修飾先として、上記ステップＳ１１８で探索された先祖に定義されているｈｅａｄを修飾する。なお、上記ステップ１１８でＲＳＴ−ＤＴのｒｏｏｔノードまで辿った場合には、処理対象ノードの文の修飾先として、ｒｏｏｔノードに定義されているｈｅａｄを修飾する。 Next, in step S120, the head defined in the ancestor searched in step S118 is modified as the modification destination of the sentence of the processing target sentence node set in step S114. Note that when tracing to the root node of the RST-DT in step 118, the head defined in the root node is modified as the modification destination of the sentence of the processing target node.

ステップＳ１２２において、処理対象の文ノードの親ノードである非終端記号Ｎのノードの先祖を辿り、ｈｅａｄが定義されている非終端記号Ｎのノードを兄弟ノードに持つ、直近の先祖の非終端記号Ｓのノードを探索する。 In step S122, the ancestor of the non-terminal symbol N that is the parent node of the sentence node to be processed is traced, and the node of the non-terminal symbol S of the nearest ancestor having the node of the non-terminal symbol N in which head is defined as a sibling node Explore.

そして、ステップＳ１２４において、ｈｅａｄが定義されている非終端記号Ｎのノードを兄弟ノードに持つ直近の先祖の非終端記号Ｓのノードが探索されたか否かを判定する。そして、ｈｅａｄが定義されている非終端記号Ｎのノードを兄弟ノードに持つ直近の先祖の非終端記号Ｓのノードが探索された場合には、ステップＳ１２６へ移行する。ｈｅａｄが定義されている非終端記号Ｎのノードを兄弟ノードに持つ直近の先祖の非終端記号Ｓのノードが探索されなかった場合には、ステップＳ１２８へ移行する。 In step S124, it is determined whether or not a node of the nearest ancestor non-terminal symbol S having a node of the non-terminal symbol N in which head is defined as a sibling node has been searched. Then, if a node of the nearest ancestor non-terminal symbol S having a node of the non-terminal symbol N in which head is defined as a sibling node is searched, the process proceeds to step S126. If no nearest ancestor non-terminal symbol S node having a non-terminal symbol N node with head defined as a sibling node is found, the process proceeds to step S128.

ステップＳ１２６において、上記ステップＳ１１４で設定された処理対象の文ノードの文の修飾先として、上記ステップＳ１２２で探索された非終端記号Ｓのノードの兄弟ノードである非終端記号Ｎのノードに定義されているｈｅａｄを修飾する。 In step S126, it is defined as a node of the non-terminal symbol N that is a sibling node of the node of the non-terminal symbol S searched in step S122 as a modification destination of the sentence of the processing target sentence node set in step S114. Modify head.

ステップＳ１２８において、上記ステップＳ１１４で設定された処理対象の文ノードの文の修飾先として、ｒｏｏｔノードに定義されているｈｅａｄを修飾する。 In step S128, the head defined in the root node is modified as a modification destination of the sentence of the processing target sentence node set in step S114.

ステップＳ１３０において、ＲＳＴ−ＤＴの終端記号に対応する文ノードの全てについて、上記ステップＳ１１４〜Ｓ１２８の処理を実行したか否かを判定する。そして、上記ステップＳ１１４〜Ｓ１２８の処理を実行していない文ノードが存在する場合には、上記ステップＳ１１４へ戻り、当該文ノードを処理対象として設定する。一方、全ての文ノードについて、上記ステップＳ１１４〜Ｓ１２８の処理を実行した場合には、ステップＳ１３２へ進む。 In step S130, it is determined whether or not the processing in steps S114 to S128 has been executed for all the sentence nodes corresponding to the terminal symbols of RST-DT. If there is a sentence node that has not executed the processes in steps S114 to S128, the process returns to step S114, and the sentence node is set as a processing target. On the other hand, if the processing of steps S114 to S128 has been executed for all sentence nodes, the process proceeds to step S132.

そして、ステップＳ１３２において、上記ステップＳ１２２、Ｓ１２６、Ｓ１２８で得られた修飾関係に従って、各文ノード間をエッジで結合することにより、ＤＥＰ−ＤＴを作成する。 In step S132, a DEP-DT is created by connecting the sentence nodes with edges according to the modification relationships obtained in steps S122, S126, and S128.

そして、ステップＳ１３４において、上記ステップＳ１３２で作成されたＤＥＰ−ＤＴを結果として出力する。 In step S134, the DEP-DT created in step S132 is output as a result.

そして、上記構造木変換処理ルーチンによって要約対象の文書に対応するＲＳＴ−ＤＴからＤＥＰ−ＤＴへと変換されると、文書要約装置１００によって、図７及び図８に示す構造木刈り込み処理ルーチンが実行される。 When the RST-DT corresponding to the document to be summarized is converted into DEP-DT by the structural tree conversion processing routine, the structural tree pruning processing routine shown in FIG. 7 and FIG. Is done.

まず、ステップＳ２００において、上記構造木変換処理ルーチンによって変換されたＤＥＰ−ＤＴの全てのノードに対して、長さＬ_ｍａｘ+１の配列を用意する。当該配列については、文の長さｌが配列の添え字に対応し、重要度スコアが配列の要素に格納される。 First, in step S200, an array of length L _max +1 is prepared for all nodes of DEP-DT converted by the structural tree conversion processing routine. For the array, the sentence length l corresponds to the array index, and the importance score is stored in the array element.

次に、ステップＳ２０２において、上記ステップＳ２００で用意した全ての配列を初期化する。 Next, in step S202, all arrays prepared in step S200 are initialized.

そして、ステップＳ２０４において、上記構造木変換処理ルーチンによって変換されたＤＥＰ−ＤＴをＳ式で表現する。 In step S204, the DEP-DT converted by the structural tree conversion processing routine is expressed by an S expression.

次に、ステップＳ２０６において、上記ステップＳ２０４で表現されたＳ式の右から順に、１つのノードを対象ノードとして設定する。 Next, in step S206, one node is set as a target node in order from the right of the expression S expressed in step S204.

次に、ステップＳ２０８において、上記ステップＳ２０６で設定された対象ノードの長さｌが、ｌ≦Ｌ_ｍａｘであるか否かを判定する。そして、対象ノードの長さｌが、ｌ≦Ｌ_ｍａｘである場合には、ステップＳ２１０へ移行する。対象ノードの長さｌが、ｌ≦Ｌ_ｍａｘでない場合には、ステップＳ２０６へ戻り、次のノードを対象ノードとして設定する。 Next, in step S208, it is determined whether or not the length l of the target node set in step S206 is l ≦ L _max . When the length l of the target node is l ≦ L _max , the process proceeds to step S210. When the length l of the target node is not l ≦ L _max , the process returns to step S206, and the next node is set as the target node.

ステップＳ２１０において、対象ノードの配列について、添え字がｌ〜Ｌ_ｍａｘまでの要素の各々に、対象ノードに対応する文の重要度ｖを格納する。 In step S210, the sequence of the target node, subscript in each element to L～L _max, stores the importance degree v of the sentence corresponding to the target node.

ステップＳ２１２において、上記ステップＳ２０６で設定された対象ノードが子ノードを持つか否かを判定する。そして、対象ノードが子ノードを持つ場合には、ステップＳ２１４へ移行する。対象ノードが子ノードを持たない場合には、ステップＳ２３４へ移行する。 In step S212, it is determined whether the target node set in step S206 has a child node. If the target node has child nodes, the process proceeds to step S214. If the target node has no child nodes, the process proceeds to step S234.

次に、ステップＳ２１４において、上記ステップＳ２０６で設定された対象ノードの子ノードを１つ設定する。ステップＳ２１６において、上記ステップＳ２１４で設定された子ノードの配列について、添え字０〜Ｌ_ｍａｘ−ｌまでの要素からなる配列を、ベース配列として設定する。 In step S214, one child node of the target node set in step S206 is set. In step S216, for the child node array set in step S214, an array composed of elements from subscripts 0 to L _max −1 is set as a base array.

そして、ステップＳ２１８において、上記ステップＳ２０６で設定された対象ノードが、他の子ノードを持つか否かを判定する。対象ノードが他の子ノードを持つ場合には、ステップＳ２２０へ移行する。対象ノードが他の子ノードを持たない場合には、ステップＳ２３２へ移行する。 In step S218, it is determined whether or not the target node set in step S206 has another child node. If the target node has other child nodes, the process proceeds to step S220. When the target node has no other child node, the process proceeds to step S232.

次に、ステップＳ２２０において、上記の他の子ノードのうちの１つの子ノードを設定する。そして、ステップＳ２２２において、上記ステップＳ２２０で設定された子ノードの配列から、添え字０〜Ｌ_ｍａｘ−ｌまでの各要素を取り出す。 Next, in step S220, one of the other child nodes is set. In step S222, each element from the subscript 0 to L _max −1 is extracted from the child node array set in step S220.

そして、ステップＳ２２４において、上記ステップＳ２２２で取り出された添え字０〜Ｌ_ｍａｘ−ｌまでの各要素からアイテムを抽出する。 In step S224, an item is extracted from each element of subscripts 0 to L _max −1 extracted in step S222.

そして、ステップＳ２２６において、上記ステップＳ２２４で抽出された各アイテムについて、上記ステップＳ２１６で設定されたベース配列と、当該アイテムとで、ナップサック問題を解いて、各アイテムに対する配列を作成する In step S226, for each item extracted in step S224, the knapsack problem is solved using the base array set in step S216 and the item, and an array for each item is created.

次に、ステップＳ２２８において、上記ステップＳ２２６で作成された各アイテムに対する配列から、各添字における最大値を取り出した配列を作成し、作成した配列で、ベース配列を上書きする。 Next, in step S228, an array in which the maximum value in each subscript is extracted from the array for each item created in step S226 is created, and the base array is overwritten with the created array.

ステップＳ２３０において、対象ノードが更に他の子ノードを持つか否かを判定する。対象ノードが更に他の子ノードを持つ場合には、ステップＳ２２０へ移行する。対象ノードが更に他の子ノードを持たない場合には、ステップＳ２３２へ移行する。 In step S230, it is determined whether the target node further has other child nodes. If the target node further has other child nodes, the process proceeds to step S220. If the target node does not have any other child nodes, the process proceeds to step S232.

次に、ステップＳ２３２において、上記ステップＳ２１６又は上記ステップＳ２２８で得られたベース配列に対して、上記ステップＳ２１０で設定された対象ノードの配列の各要素の重要度スコアを加算して、対象ノードの配列を更新する。 Next, in step S232, the importance score of each element of the target node array set in step S210 is added to the base array obtained in step S216 or step S228, and Update the array.

そして、ステップＳ２３４において、上記ステップＳ２０４で表現されたＤＥＰ−ＤＴのＳ式の全てのノードについて、上記ステップＳ２０６〜Ｓ２３２の処理を実行したか否かを判定する。そして、上記ステップＳ２０６〜Ｓ２３２の処理を実行していない文が存在する場合には、上記ステップＳ２０６へ戻り、当該ノードを対象ノードとして設定する。一方、全てのノードについて、上記ステップＳ２０６〜Ｓ２３２の処理を実行した場合には、ステップＳ２３６へ進む。 In step S234, it is determined whether or not the processing in steps S206 to S232 has been executed for all nodes in the S expression of DEP-DT expressed in step S204. If there is a sentence that does not execute the processes in steps S206 to S232, the process returns to step S206, and the node is set as a target node. On the other hand, when the processes in steps S206 to S232 have been executed for all nodes, the process proceeds to step S236.

次にステップＳ２３６において、ｒｏｏｔノードの配列の添え字Ｌ_ｍａｘの要素に格納されている重要度スコアが算出された履歴を辿り、ＤＥＰ−ＤＴに対し、当該要素の重要度スコアを算出するために用いられたノードを残し、その他のノードを刈り込むことによって、部分木を取得する。 In step S236, follows a history of importance scores stored in an element of index L _max of the array of root node is calculated, with respect DEP-DT, in order to calculate the importance score of the element A subtree is obtained by leaving the used nodes and pruning other nodes.

そして、ステップＳ２３８において、上記ステップＳ２３６で取得された部分木に基づいて、要約を作成する。 In step S238, a summary is created based on the subtree acquired in step S236.

そして、ステップＳ２４０において、上記ステップＳ２３８で作成された要約を出力部３により出力して、構造木刈り込み処理ルーチンを終了する。 In step S240, the summary created in step S238 is output by the output unit 3, and the structural tree pruning processing routine is terminated.

＜動作例＞
次に、本実施の形態に係る文書要約装置の実際の動作例を以下で説明する。
動作例として、図９に示すＤＥＰ−ＤＴを例に挙げて説明する。上記図９に示すＤＥＰ−ＤＴは、既にＲＳＴ−ＤＴから変換されたものであることを想定する。上記図９の右側の表に、ＤＥＰ−ＤＴにおける各ノード（文）の重要度（Ｖ）と長さｌとを示す。なお、要約の長さ制約（長さの上限Ｌ_ｍａｘ）は、Ｌ_ｍａｘ＝１０とする。 <Operation example>
Next, an actual operation example of the document summarizing apparatus according to the present embodiment will be described below.
As an operation example, the DEP-DT shown in FIG. 9 will be described as an example. It is assumed that the DEP-DT shown in FIG. 9 has already been converted from the RST-DT. The table on the right side of FIG. 9 shows the importance (V) and length l of each node (sentence) in DEP-DT. Note that the summary length constraint (upper limit L _{max of} length) is L _max = 10.

上記図９のＤＥＰ−ＤＴをＳ式で表すと以下の表現となる。 When DEP-DT in FIG. 9 is expressed by an S-expression, the following expression is obtained.

（０（１（２）（３（４）（５）））（６（７））（８）） (0 (1 (2) (3 (4) (5))) (6 (7)) (8))

よって、ノード番号８〜０の順に対象ノードを設定し、各ノードに対し長さ１０＋１の配列を用意する。 Therefore, target nodes are set in the order of node numbers 8 to 0, and an array of length 10 + 1 is prepared for each node.

そして、以下の手順に従って、各配列の各要素に格納される重要度スコアを決定すれば良い。 Then, the importance score stored in each element of each array may be determined according to the following procedure.

はじめに、ノード８を対象ノードとして設定する。上記図９に示すように、ノード８は子ノードを持たないため、配列の１番目から１０番目までの要素に、ノード８の重要度である２を格納する（上記ステップ［２−１］適用、図１０参照。）。 First, the node 8 is set as a target node. As shown in FIG. 9, since the node 8 has no child nodes, the importance level 2 of the node 8 is stored in the first to tenth elements of the array (applying step [2-1] above) , See FIG.

次に、ノード７を対象ノードとして設定する。上記図９に示すように、ノード７は子ノードを持たないため、配列の１番目から１０番目までの要素に、ノード７の重要度である３を格納する（上記ステップ［２−１］適用、図１０参照。）。 Next, the node 7 is set as a target node. As shown in FIG. 9, since the node 7 has no child nodes, the importance level 3 of the node 7 is stored in the first to tenth elements of the array (applying step [2-1] above) , See FIG.

次に、ノード６を対象ノードとして設定する。ノード６は子ノードを持つため、任意の子ノードを１つ選択する。この場合、子ノードはノード７だけなのでこれを選択する。ノード６の長さｌは２なので、ノード７の配列の０から８までの要素を取り出しベース配列とする（上記ステップ［２−２］適用、図１０参照。）。 Next, the node 6 is set as a target node. Since node 6 has child nodes, one arbitrary child node is selected. In this case, since the only child node is node 7, it is selected. Since the length l of the node 6 is 2, the elements from 0 to 8 in the array of the node 7 are taken out and used as the base array (applying step [2-2], see FIG. 10).

ノード６にはノード７以外の子ノードがいないため、先のステップで取り出されたベース配列の要素（０から８までの要素）に対し、ノード６自身の長さｌと重要度スコアを足す（上記ステップ［２−３］適用、図１０参照。）。
ここで、「ノード６自身の重要度スコア」とは、ノード６の配列の２番目から１０番目までの要素に、ノード６の重要度である１を格納したものである（図１０中段（２）参照）。また、「ノード６自身の長さｌ…を足す」とは、ノード６自身の長さｌ（＝２）分だけシフトさせて、上記取り出された配列の要素（０から８までの要素）を足すことを意味する。 Since node 6 has no child nodes other than node 7, the length l and importance score of node 6 itself are added to the elements (elements 0 to 8) of the base array extracted in the previous step ( Apply the above step [2-3], see FIG.
Here, “the importance score of the node 6 itself” is obtained by storing 1 which is the importance of the node 6 in the second to tenth elements of the array of the node 6 (middle (2) in FIG. )reference). Further, “adding the length l of the node 6 itself” means shifting the element 6 by the length l (= 2) of the node 6 itself, and removing the elements of the extracted array (elements from 0 to 8). It means adding.

次に、ノード５を対象ノードとして設定する。ノード５は子ノードを持たないため、配列の１番目から１０番目までの要素に、ノード５の重要度である２を格納する（上記ステップ［２−１］適用、図１０参照。）。 Next, node 5 is set as a target node. Since node 5 has no child nodes, 2 which is the importance of node 5 is stored in the first to tenth elements of the array (applying step [2-1], see FIG. 10).

次に、ノード４を対象ノードとして設定する。ノード４は子ノードを持たないため、配列の３番目から１０番目までの要素に、ノード４の重要度である３を格納する（上記ステップ［２−１］適用、図１０参照。）。 Next, node 4 is set as a target node. Since the node 4 has no child nodes, the importance level 3 of the node 4 is stored in the third to tenth elements of the array (applying step [2-1], see FIG. 10).

次に、ノード３を対象ノードとして設定する。ノード３は子ノードを持つため、任意の子ノードを１つ選択する。この場合、子ノードはノード４と５であるが、ここでは、５を選択する。ノード３の長さｌは２なので、ノード５の配列の０から８までの要素を取り出しベース配列とする（上記ステップ［２−２］適用、図１１参照。）。 Next, node 3 is set as a target node. Since node 3 has child nodes, one arbitrary child node is selected. In this case, the child nodes are nodes 4 and 5, but 5 is selected here. Since the length l of the node 3 is 2, the elements from 0 to 8 in the array of the node 5 are taken out and used as the base array (applying step [2-2], see FIG. 11).

ノード３の５以外の子ノードは４なので、ノード４の配列の０から８番目までの要素を取り出す。個々に含まれる非ゼロの重要度スコアは３のみであることから、ノード４の配列（ただし、添字は０から８まで）が保持するアイテムは長さ３、重要度スコア３のアイテム（ノード４そのもの) のみである（上記ステップ［２−２−１］、［２−２−２］適用、図１１参照。）。 Since the child nodes other than 5 of node 3 are 4, the elements from 0 to 8 in the array of node 4 are extracted. Since the non-zero importance score included in each is only 3, the items held in the array of nodes 4 (however, subscripts are 0 to 8) are items of length 3 and importance scores 3 (node 4 (Applying steps [2-2-1] and [2-2-2], see FIG. 11).

ベース配列と先のステップで抽出したアイテムとでナップサック問題を解いて配列を作成し、ベース配列を上書きする（上記ステップ［２−２−３］適用、図１１（３）（ＩＤ＝３の４行目）参照。）。 An array is created by solving the knapsack problem using the base array and the item extracted in the previous step, and the base array is overwritten (applying the above step [2-2-3], FIG. 11 (3) (ID = 3 4 (See line)).

ベース配列に対し、ノード３自身の長さｌと重要度スコアを足す（上記ステップ［２−３］適用、図１１（５）（ＩＤ＝３の７行目）参照。）。 The length l of node 3 itself and the importance score are added to the base sequence (applying step [2-3], see FIG. 11 (5) (7th line of ID = 3)).

次に、ノード２を対象ノードとして設定する。ノード２は子ノードを持たないため、配列の２番目から１０番目までの要素をノード２の重要度スコアである４を格納する（上記ステップ［２−１］適用、図１１参照。）。 Next, node 2 is set as a target node. Since node 2 has no child nodes, the second to tenth elements of the array are stored with 4 being the importance score of node 2 (applying step [2-1], see FIG. 11).

次に、ノード１を対象ノードとして設定する。ノード１は子ノードを持つため、任意の子ノードを１つ選択する。この場合、子ノードはノード２と３であるが、ここでは、３を選択する。ノード１の長さｌは１なので、ノード３の配列を０から９番目まで取り出しベース配列とする（上記ステップ［２−２］適用、図１２参照。）。 Next, node 1 is set as the target node. Since node 1 has child nodes, one arbitrary child node is selected. In this case, the child nodes are nodes 2 and 3, but 3 is selected here. Since the length l of the node 1 is 1, the array of the node 3 is extracted from the 0th to the 9th and is used as the base array (applying step [2-2], see FIG. 12).

ノード１の３以外の子ノードは２なので、ノード２の配列の０から９番目までの要素を取り出す。配列の個々に要素に含まれる非ゼロの重要度スコアは４のみであることから、ノード２の配列（ただし、添字は０から９まで）が保持するアイテムは長さ２、重要度スコア４のアイテム（ノード２そのもの）のみである（上記ステップ［２−２−１］、［２−２−２］適用、図１２参照。）。 Since the child nodes other than 3 of node 1 are 2, the elements from 0 to 9 in the array of node 2 are extracted. Since the non-zero importance score included in each element of the array is only 4, the items held by the array of node 2 (however, subscripts from 0 to 9) have length 2 and importance score 4 It is only an item (node 2 itself) (the above steps [2-2-1] and [2-2-2] are applied, see FIG. 12).

ベース配列と先のステップで抽出したアイテムとでナップサック問題を解いて配列を作成し、ベース配列を上書きする（上記ステップ［２−２−３］適用、図１２（ＩＤ＝１の４行目）参照。）。 An array is created by solving the knapsack problem with the base array and the item extracted in the previous step, and the base array is overwritten (applying the above step [2-2-3], FIG. 12 (ID = 1, 4th line)) reference.).

ベース配列に対し、ノード１自身の長さｌと重要度スコアを足す（上記ステップ［２−３］適用、図１２（ＩＤ＝１の７行目）参照。）。 The length 1 of node 1 itself and the importance score are added to the base sequence (see step [2-3], see FIG. 12 (7th line of ID = 1)).

次に、ノード０を対象ノードとして設定する。ノード０は子ノードを持つため、任意の子ノードを１つ選択する。この場合、子ノードはノード１と６と８であるが、ここでは、１を選択する。ノード０の長さは３なので、ノード１の配列を０から７番目までの要素を取り出しベース配列とする（上記ステップ［２−２］適用、図１３（１）、（４）参照。）。 Next, node 0 is set as the target node. Since node 0 has child nodes, one arbitrary child node is selected. In this case, the child nodes are nodes 1, 6, and 8, but 1 is selected here. Since the length of the node 0 is 3, the elements from the 0th to 7th elements are extracted from the array of the node 1 and used as a base array (see step [2-2], see FIGS. 13 (1) and (4)).

ノード０の１以外の子ノードは６と８であるが、まず、ノード６の配列の０から７番目までの要素を取り出す。配列の個々に要素に含まれる非ゼロの重要度スコアは１と４であることから、ノード６の配列（ただし、添字は０から７まで）が保持するアイテムは長さ２、重要度スコア１と長さ３、重要度スコア４の２つアイテムとなる（上記ステップ［２−２−１］、［２−２−２］適用、図１３（２）、（５）参照。）。 The child nodes other than 1 of node 0 are 6 and 8. First, the elements from 0 to 7 in the array of node 6 are extracted. Since the non-zero importance score included in each element of the array is 1 and 4, the item held by the array of node 6 (however, subscripts 0 to 7) have a length of 2 and an importance score of 1 And two items of length 3 and importance score 4 (applying steps [2-2-1] and [2-2-2], see FIGS. 13 (2) and (5)).

ベース配列と先のステップで抽出したアイテム（長さ２、重要度スコア１）とでナップサック問題を解いて配列を作成する（上記ステップ［２−２−３］適用、図１３（３）参照。）。 An array is created by solving the knapsack problem using the base array and the items extracted in the previous step (length 2, importance score 1) (see step [2-2-3] above, see FIG. 13 (3)). ).

ベース配列と２つ前のステップで抽出したもう１つのアイテム（長さ３、重要度スコア４）とでナップサック問題を解いて配列を作成する（上記ステップ［２−２−３］適用、図１３（６）参照。）。 An array is created by solving the knapsack problem using the base array and the other item (length 3, importance score 4) extracted in the previous step (applying step [2-2-3] above, FIG. 13). (See (6).)

前の２つのステップで作成した配列の個々の要素に対して最大値のみを記録した配列でベース配列を上書きする（上記ステップ［２−２−４］適用、図１３（７）参照。）。 The base array is overwritten with an array in which only the maximum value is recorded for each element of the array created in the previous two steps (applying step [2-2-4], see FIG. 13 (7)).

ノード０の残りの子ノードであるノード８の配列の０から７番目までの要素を取り出す。配列の個々に要素に含まれる非ゼロの重要度スコアは２のみであることから、ノード８の配列（ただし、添字は０から７まで）が保持するアイテムは長さ１、重要度スコア２のアイテム（ノード８そのもの）のみとなる（上記ステップ［２−２−１］、［２−２−２］適用、図１３（８）参照。）。 The elements from 0 to 7 in the array of node 8 which is the remaining child node of node 0 are taken out. Since the non-zero importance score included in each element of the array is only 2, the items held by the array of node 8 (however, the subscripts are 0 to 7) have length 1 and importance score 2. Only the item (node 8 itself) is applied (the above steps [2-2-1] and [2-2-2] are applied, see FIG. 13 (8)).

ベース配列と先のステップで抽出したアイテムとでナップサック問題を解いて配列を作成し、ベース配列を上書きする（上記ステップ［２−２−３］適用、図１３（９）参照。）。 An array is created by solving the knapsack problem with the base array and the item extracted in the previous step, and the base array is overwritten (applying step [2-2-3], see FIG. 13 (9)).

ベース配列に対し、ノード０自身の長さｌと重要度スコアを足す（上記ステップ［２−３］適用、図１３（１０）、（１１）参照。）。 The length l of node 0 itself and the importance score are added to the base sequence (applying step [2-3], see FIGS. 13 (10) and (11)).

以上の手続きにより全てのノードに対し配列の要素を決定することができた。最大値は必ずｒｏｏｔノードに格納される。例ではその値は１４であり、これを記録した履歴を辿ることで要約を得ることができる。この例では、ノード０、１、２、６、７、８から成る部分木（長さ１０、重要度スコア１４）の各ノードに対する文を入力文書から選択して要約を生成し出力する。 With the above procedure, the array elements can be determined for all nodes. The maximum value is always stored in the root node. In the example, the value is 14, and a summary can be obtained by tracing the history of recording. In this example, a sentence for each node of a subtree (length 10 and importance score 14) composed of nodes 0, 1, 2, 6, 7, and 8 is selected from an input document, and a summary is generated and output.

以上説明したように、本実施の形態に係る文書要約装置によれば、入力文書の各文の依存構造に基づく談話構造木を作成し、依存構造に基づく談話構造木の各ノードに対応する文の長さ及び文の重要度に基づいて、依存構造に基づく談話構造木のルートノードを含む部分木のうち、部分木の各ノードに対応する文の長さの合計が長さの上限以下であって、重要度の合計が最大となる部分木を求め、部分木の各ノードに対応する文を選択して、入力文書に対応する要約を生成することにより、入力文書の論理構造を正しく反映した要約を作成することができる。 As described above, according to the document summarizing apparatus according to the present embodiment, a discourse structure tree based on the dependency structure of each sentence of the input document is created, and a sentence corresponding to each node of the discourse structure tree based on the dependency structure. based on the importance of the length and sentence, Yi of the subtree including the root node of the discourse structure tree based on Son構 concrete, the upper limit of the length total length of the sentence corresponding to each node of the subtree The logical structure of the input document is determined by obtaining a subtree with the maximum importance, selecting a sentence corresponding to each node of the subtree, and generating a summary corresponding to the input document. You can create summaries that reflect correctly.

また、本実施の形態に係る文書要約装置を用いることで、長さ制約（長さの上限Ｌ_ｍａｘ）のもと、文書の論理構造を崩すことなく、文重要度の和が最大とする文の組合せ、すなわち、要約を生成することができるようになる。 In addition, by using the document summarizing apparatus according to the present embodiment, a sentence whose sum of sentence importance is maximized without destroying the logical structure of the document under the length constraint (upper limit L _max ). A combination can be generated, that is, a summary.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本実施の形態では、要約対象を文書とした場合を例に挙げて説明したが、文書ではなく「文」をＤＥＰ−ＤＴとして表せば、同様に一文要約も可能となる。 For example, in the present embodiment, the case where the summarization target is a document has been described as an example. However, if “sentence” is represented as DEP-DT instead of the document, single-sentence summarization is similarly possible.

また、本実施の形態では、文書中の「文」をＤＥＰ−ＤＴの各ノードとした場合を例に説明したが、各ノードを文以外の文字列単位として表わすこともできる。その場合には、分割部２０によって、文書を「文字列単位」（文法的な要素（句、節など））に分割し、当該「文字列単位」をノードとした木としてＤＥＰ−ＤＴを表現する。 In this embodiment, the “sentence” in the document is described as an example of each node of the DEP-DT. However, each node can be expressed as a character string unit other than the sentence. In this case, the dividing unit 20 divides the document into “character string units” (grammatical elements (phrases, clauses, etc.)), and expresses DEP-DT as a tree having the “character string units” as nodes. To do.

また、本実施の形態に係る文書要約装置は、日本語だけでなく英語等の外国語にも適用可能である。その場合には、ピリオドを手がかりとして分割部２０によって分割し、文の長さｌに関しては、単語数を用いれば良い。 Further, the document summarizing apparatus according to the present embodiment can be applied not only to Japanese but also to foreign languages such as English. In this case, the period is used as a clue and the dividing unit 20 divides the sentence, and the number of words may be used for the sentence length l.

また、単語重要度データベース２１は、外部に設けられ、文書要約装置とネットワークで接続されていてもよい。 The word importance database 21 may be provided outside and connected to the document summarization apparatus via a network.

また、入力部１に入力される文書は、既に文又は文字列単位に分割された形態であってもよい。 The document input to the input unit 1 may be in a form that has already been divided into sentences or character strings.

また、重要度付与部２２は、上記（１）式に基づいて、各文又は各文字列単位に重要度を付与する場合を例に説明したが、これに限定されるものではなく、他の方法によって各文又は各文字列単位に重要度を付与してもよい。 Moreover, although the importance provision part 22 demonstrated as an example the case where importance was provided to each sentence or each character string unit based on said (1) Formula, it is not limited to this, Other The importance may be given to each sentence or each character string unit by a method.

また、本実施の形態では、要約対象の入力文書を表すＲＳＴ−ＤＴを変換してＤＥＰ−ＤＴを得る場合を例に説明したが、要約対象の入力文書を表すＤＥＰ−ＤＴを得るために、必ずしも、要約対象の入力文書を表すＲＳＴ−ＤＴが必要ではない。例えば、学習用文書を表すＲＳＴ−ＤＴのアノテーション済みコーパスをＤＥＰ−ＤＴに変換したものを学習データとして、文書を入力として受け取り、直接ＤＥＰ−ＤＴを出力する解析器を構築することも可能である。この場合には、要約対象の入力文書を入力として、解析器を用いて、入力文書を表わすＤＥＰ−ＤＴを作成することができる。 Also, in the present embodiment, the case where DEP-DT is obtained by converting RST-DT representing an input document to be summarized has been described as an example, but in order to obtain DEP-DT representing an input document to be summarized, The RST-DT that represents the input document to be summarized is not necessarily required. For example, it is also possible to construct an analyzer that receives a document as input and outputs DEP-DT directly as learning data obtained by converting an annotated corpus of RST-DT representing a learning document into DEP-DT. . In this case, a DEP-DT representing the input document can be created by using the input document to be summarized as an input and using an analyzer.

上述の文書要約装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The document summarization apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１入力部
２演算部
３出力部
２０分割部
２１単語重要度データベース
２２重要度付与部
２３修辞構造解析部
２４修辞構造木変換部
２５依存構造木刈り込み部
１００文書要約装置 DESCRIPTION OF SYMBOLS 1 Input part 2 Computation part 3 Output part 20 Dividing part 21 Word importance database 22 Importance giving part 23 Rhetorical structure analysis part 24 Rhetorical structure tree conversion part 25 Dependent structure tree pruning part 100 Document summarization apparatus

Claims

Corresponding to the input document by selecting at least one predetermined character string unit from the input document so as to be less than or equal to the upper limit of the predetermined length, including the structure tree creating means and the summary generating means A document summarization method in a document summarization apparatus for generating a summary to be performed, comprising:
Based on the result of dividing the input document into the character string units by the structure tree creating means, the most important character string unit of the input document is a root node, and each character of the input document is Creating a discourse structure tree based on a dependency structure of each character string unit of the input document, wherein each node is a column unit, and nodes corresponding to each character string unit having a modification relationship are connected by an edge; ,
Based on the length of the character string unit and the importance of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure created by the structure tree creating unit by the summary generating unit, Yi among the subtree including the root node of the discourse structure tree based on Son構 granulation, the total length of the string unit is equal to or less than the upper limit of the length corresponding to each node in the subtree, the importance Generating a summary corresponding to the input document by selecting the character string unit corresponding to each node of the determined partial tree from the input document,
Only including,
The step of creating a discourse structure tree based on the dependency structure by the structure tree creating means includes:
By the rhetorical structure analyzing means, the root node represents the whole of the input document, and represents a hierarchical structure in which each sequence of character string units consisting of at least one character string unit of the input documents is each node, And creating a discourse structure tree based on the rhetorical structure of each of the character string unit series of the input document, which represents a modification relationship between the character string unit series,
Converting a discourse structure tree based on the rhetorical structure created by the rhetorical structure analyzing means into a discourse structure tree based on the dependent structure by a rhetorical structure tree converting means;
Document summarization method including

The step of generating a summary by the summary generation means includes the length of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure created by the structure tree creation means and the importance of the character string unit. Based on the degree, for each node of the discourse structure tree based on the dependency structure, the node is formed as a root node for each length equal to or less than the upper limit of the length from the leaf node to the bottom-up order. Among the subtrees, solve the knapsack problem for the subtrees whose sum of the lengths of the character string units corresponding to the nodes of the subtree is equal to or less than the length and whose sum of importance is maximum. determined by the out of the subtree including the root node of the discourse structure tree based on the Yi Son構 granulation obtained, for subtrees sum of the importance is maximized, each node in the subtree The string units selected from the input document, document summarization method according to claim 1, wherein generating a summary corresponding to the input document corresponds.

A document summarization apparatus that selects at least one predetermined character string unit from an input document so as to be equal to or less than an upper limit of a predetermined length, and generates a summary corresponding to the input document. ,
Based on the result of dividing the input document into the character string units, the most important character string unit of the input document as a root node, and each character string unit of the input document as each node, And a structure tree creating means for creating a discourse structure tree based on a dependency structure of each character string unit of the input document, in which nodes corresponding to each character string unit having a modification relationship are coupled by an edge,
Based on the importance of the length and the string unit of the string unit for each node of the discourse structure tree based on said dependency structure created by the tree structure creation means, based on the Yi Son構 Concrete Among the subtrees including the root node of the discourse structure tree , the sum of the lengths of the character string units corresponding to the nodes of the subtree is less than or equal to the upper limit of the length, and the sum of the importance levels is maximized. Summarization generating means for obtaining a subtree, selecting the character string unit corresponding to each node of the obtained subtree from the input document, and generating a summary corresponding to the input document;
Only including,
The structural tree creating means
A root node represents the whole of the input document, and represents a hierarchical structure in which each sequence of character string units composed of at least one character string unit of the input documents is each node, and the character string unit A rhetorical structure analysis means for creating a discourse structure tree based on the rhetorical structure of each series of character string units of the input document, which represents a modification relationship between
Rhetorical structure tree conversion means for converting a discourse structure tree based on the rhetorical structure created by the rhetorical structure analyzing means to a discourse structure tree based on the dependency structure;
A document summarization device.

The summary generation means, based on the length of the character string unit corresponding to each node of the discourse structure tree based on the dependency structure created by the structural tree creation means and the importance of the character string unit, For each node of the discourse structure tree based on the dependency structure, in the sub-tree formed using the node as a root node for each length below the upper limit of the length in order from the leaf node to the bottom-up, Obtain a subtree having a total sum of lengths corresponding to each node of the subtree that is less than or equal to the length and having the maximum importance by solving a knapsack problem, and was among the subtree including the root node of the discourse structure tree based on the Yi Son構 granulation, the total for the subtree having the maximum importance, the said string unit for each node of the subtree Select from the force document, document summarization apparatus according to claim 3, wherein generating a summary corresponding to the input document.

A program for causing a computer to execute each step constituting the document summarizing method according to claim 1 .