JP6021079B2

JP6021079B2 - Document summarization apparatus, method, and program

Info

Publication number: JP6021079B2
Application number: JP2014045656A
Authority: JP
Inventors: 平尾　努; 努平尾; 悠太菊池; 学奥村; 大也高村
Original assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Priority date: 2014-03-07
Filing date: 2014-03-07
Publication date: 2016-11-02
Anticipated expiration: 2034-03-07
Also published as: JP2015170224A

Description

本発明は、与えられた文書を要約する文書要約装置、方法、及びプログラムに関する。 The present invention relates to a document summarization apparatus, method, and program for summarizing a given document.

従来の計算機による文書の要約手法では、ある目的関数が最大となるように文書中の文法的な要素、例えば、文や文に含まれる節または句などの連続した単語列を、各要素の重要度の和が最大となるように抽出する。また、文法的な要素を抽出する際、これらを単なる集合として捉えるのではなく、文法的な要素間の親子関係、つまり修辞構造を考慮することで、要約の品質が向上することが知られている（例えば、非特許文献１参照）。 In conventional computer document summarization methods, a grammatical element in a document, for example, a continuous word string such as a clause or phrase included in a sentence, is used for each element to maximize a certain objective function. Extract so that the sum of degrees is maximized. In addition, when extracting grammatical elements, it is known that the quality of summarization is improved by considering parent-child relationships between grammatical elements, that is, rhetorical structure, rather than just considering them as a set. (For example, refer nonpatent literature 1).

非特許文献１に記載の技術では、文書中の文法的な要素を「節」とし、節をノードとした修辞構造木として文書を表現する。そして、節の重要度の和が最大、かつ要約の長さがＬ_ｍａｘ以下の根付き部分木を要約として抽出する組合せ最適化問題として、文書要約を定式化している。なお、要約の長さとは、例えば、要約に含まれる単語数または文字数である。 In the technique described in Non-Patent Document 1, a document is expressed as a rhetorical structure tree in which a grammatical element in a document is “section” and a section is a node. Then, the document summarization is formulated as a combinatorial optimization problem in which a rooted subtree having the maximum sum of the importance of clauses and the summarization length of L _max or less is extracted as a summarization. The summary length is, for example, the number of words or characters included in the summary.

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasudaand MasaakiNagata, "Single-Document Summarization as a Tree Knapsack Problem", Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515-1520, 2013.Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasudaand MasaakiNagata, "Single-Document Summarization as a Tree Knapsack Problem", Proc. Of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515-1520, 2013.

従来の要約技術では、文や節等の文法的な要素を予め決定しておき、文書をそれらの要素間の親子関係を表した修辞構造木として表現することで、原文書の論理構造を捉えた要約を生成できる。この場合、予め定義した文や節といった文書中の連続した単語列を抽出することで要約を生成する。文法的な要素を文とする場合、要約は、原文書に含まれる文から要約に含める文を抽出することで生成される。しかし、要約の長さの制約Ｌ_ｍａｘが厳しい（小さい）場合、抽出することのできる文の数が極端に少なくなり、要約としての情報の被覆率が低下するという問題がある。一方、文法的な要素を節とする場合、要約は、原文書に含まれる節から要約に含める節を抽出することで生成される。この場合、節は文よりも小さい単位であることから、文法的な要素を文とする場合に比べ、要約としての情報の被覆率は高くなるが、読み易さが損なわれるという問題がある。 In conventional summarization techniques, grammatical elements such as sentences and clauses are determined in advance, and the document is represented as a rhetorical structure tree representing the parent-child relationship between those elements, thereby capturing the logical structure of the original document. Summary can be generated. In this case, a summary is generated by extracting continuous word strings in a document such as a predefined sentence or section. When a grammatical element is a sentence, the summary is generated by extracting a sentence to be included in the summary from a sentence included in the original document. However, if the summary length constraint L _max is severe (small), the number of sentences that can be extracted becomes extremely small, and there is a problem that the coverage of information as a summary decreases. On the other hand, when a grammatical element is a section, a summary is generated by extracting a section included in the summary from a section included in the original document. In this case, since a section is a unit smaller than a sentence, the coverage of information as a summary is higher than that when a grammatical element is a sentence, but there is a problem that readability is impaired.

本発明は、上記の事情を鑑みてなされたもので、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる文書要約装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a document summarization apparatus, method, and program capable of improving the coverage of information as a summary while ensuring readability. And

上記目的を達成するために、本発明に係る文書要約装置は、複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成する修辞構造解析部と、前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成する係り受け解析部と、予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成する生成部と、を含んで構成されている。 In order to achieve the above object, a document summarization apparatus according to the present invention includes a rhetorical structure analysis unit that generates a rhetorical structure tree representing a document based on a rhetorical relationship between sentences in a document including a plurality of sentences. A dependency analysis unit that generates each dependency tree representing each of the sentences based on dependency relationships between words in each of a plurality of sentences, a maximum length of a predetermined summary, and an importance of each of the words At least one subtree is extracted from each of the dependency trees so as to satisfy a predetermined condition based on the rhetorical structure tree and the dependency tree, and based on the extracted subtree And a generation unit for generating a summary of the document.

本発明に係る文書要約装置によれば、修辞構造解析部が、複数の文を含む文書における文間の修辞関係に基づいて、文書を表す修辞構造木を生成する。また、係り受け解析部が、複数の文の各々における単語間の係り受け関係に基づいて、文の各々を表す係り受け木の各々を生成する。そして、生成部が、予め定めた要約の最大長、単語の各々の重要度、修辞構造木、及び係り受け木の各々に基づいて、予め定めた条件を満たすように、係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、文書の要約を生成する。 According to the document summarizing apparatus according to the present invention, the rhetorical structure analyzing unit generates a rhetorical structure tree representing a document based on a rhetorical relationship between sentences in a document including a plurality of sentences. The dependency analysis unit generates each dependency tree representing each sentence based on the dependency relationship between words in each of the plurality of sentences. Then, based on each of the dependency trees, the generation unit satisfies the predetermined condition based on the predetermined maximum length of the summary, the importance of each word, the rhetorical structure tree, and each of the dependency trees. At most one subtree is extracted, and a document summary is generated based on the extracted subtree.

このように、修辞構造木と係り受け木との入れ子構造で文書を表現し、係り受け木の各々から多くとも一つの部分木を抽出して要約を生成することで、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる。 In this way, a document is represented by a nested structure of rhetorical structure trees and dependency trees, and at most one subtree is extracted from each dependency tree to generate a summary, thereby ensuring readability. However, the coverage of information as a summary can be improved.

また、前記予め定めた条件を、要約の長さが前記最大長以下で、前記修辞構造木が表す修辞関係及び前記係り受け木が表す係り受け関係を損なわず、かつ要約に含まれる単語の重要度の和が最大になるように定めることができる。これにより、文書の論理構造、及び文としての文法性が損なわれないため、より読み易さを確保した要約を生成することができる。 In addition, the predetermined condition is that the summary length is less than or equal to the maximum length, the rhetorical relationship represented by the rhetorical structure tree and the dependency relationship represented by the dependency tree are not impaired, and the importance of the words included in the summary It can be determined so that the sum of degrees is maximized. As a result, the logical structure of the document and the grammatical nature of the sentence are not impaired, so that a summary that ensures more readability can be generated.

また、前記予め定めた条件を、下記（１）式〜（１２）式に示す制約の下、下記（１３）式に示す目的関数を最大化するように定めることができる。これにより、文書要約の問題を定式化することができる。 Further, the predetermined condition can be determined so as to maximize the objective function shown in the following equation (13) under the constraints shown in the following equations (1) to (12). As a result, the problem of document summarization can be formulated.

ただし、ｉは文の識別番号、Ｎは文書に含まれる文の総数、ｊは単語の識別番号、Ｍ（ｉ）は識別番号ｉの文に含まれる単語の総数、ｘ_ｉは識別番号ｉの文が要約に含まれるとき１となる決定変数、ｚ_ｉｊは識別番号ｉの文の識別番号ｊの単語が要約に含まれるとき１となる決定変数、ｗ_ｉｊは識別番号ｉの文の識別番号ｊの単語の重要度、ｒ_ｉｊは識別番号ｉの文の識別番号ｊの単語が部分木の根である場合に１となる決定変数、Ｌ_ｍａｘは最大長、ｐａｒｅｎｔ（ｉ）は、修辞構造木における識別番号ｉの文の親の文の識別番号を返す関数、ｐａｒｅｎｔ（ｉ，ｊ）は、識別番号ｉの文を表す係り受け木における識別番号ｊの単語の親の単語の識別番号を返す関数、ａは所定の係数、Ｒ_ｃ（ｉ）は識別番号ｉの文を表す係り受け木において、根の候補となる単語の識別番号の集合を返す関数、ｒｏｏｔ（ｉ）は、識別番号ｉの文を表す係り受け木における真の根である単語の識別番号を返す関数、ｓｕｂ（ｉ）は、識別番号ｉの文から主語である単語の識別番号の集合を返す関数、及びｏｂｊ（ｉ）は、識別番号ｉの文から目的語である単語の識別番号の集合を返す関数である。 Where i is the sentence identification number, N is the total number of sentences included in the document, j is the word identification number, M (i) is the total number of words included in the sentence with the identification number i, and x _i is the identification number i. A decision variable that becomes 1 when the sentence is included in the summary, z _ij is a decision variable that becomes 1 when the word with identification number j of the sentence with identification number i is included in the summary, and w _ij is an identification number of the sentence with identification number i The importance of the word j, r _ij is a decision variable that becomes 1 when the word with the identification number j of the sentence with the identification number i is the root of the subtree, L _max is the maximum length, and parent (i) is the rhetorical structure tree A function that returns the identification number of the parent sentence of the sentence with the identification number i, parent (i, j) is a function that returns the identification number of the parent word of the word with the identification number j in the dependency tree representing the sentence with the identification number i. , A is a predetermined coefficient, and R _c (i) is a dependency tree representing a sentence with identification number i. The function root (i) that returns a set of identification numbers of words that are root candidates is a function that returns the identification number of a word that is a true root in a dependency tree representing a sentence with the identification number i, sub ( i) is a function that returns a set of identification numbers of the word that is the subject from the sentence of the identification number i, and obj (i) is a function that returns a set of identification numbers of the word that is the object from the sentence of the identification number i. is there.

また、本発明に係る文書要約方法は、修辞構造解析部と、係り受け解析部と、生成部とを含む文書要約装置における文書要約方法であって、前記修辞構造解析部が、複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成するステップと、前記係り受け解析部が、前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成するステップと、前記生成部が、予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成するステップと、を含む方法である。 The document summarization method according to the present invention is a document summarization method in a document summarization apparatus including a rhetorical structure analysis unit, a dependency analysis unit, and a generation unit, wherein the rhetorical structure analysis unit reads a plurality of sentences. A step of generating a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including the dependency analysis unit, based on the dependency relationship between words in each of the plurality of sentences, A step of generating each dependency tree representing each sentence; and the generation unit determines a maximum length of a predetermined summary, importance of each word, the rhetorical structure tree, and each dependency tree. And extracting at most one subtree from each of the dependency trees so as to satisfy a predetermined condition, and generating a summary of the document based on the extracted subtrees. It is.

また、本発明に係る文書要約プログラムは、コンピュータを、上記の文書要約装置を構成する各部として機能させるためのプログラムである。 A document summarization program according to the present invention is a program for causing a computer to function as each unit constituting the document summarization apparatus.

以上説明したように、本発明の文書要約装置、方法、及びプログラムによれば、予め定めた要約の最大長、単語の各々の重要度、文間の修辞関係に基づいて文書を表した修辞構造木、及び単語間の係り受け関係に基づいて文の各々を表した係り受け木の各々に基づいて、予め定めた条件を満たすように、係り受け木の各々から多くとも一つの部分木を抽出して要約を生成する。これにより、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる、という効果が得られる。 As described above, according to the document summarization apparatus, method, and program of the present invention, a rhetorical structure representing a document based on a predetermined maximum length of summary, importance of each word, and rhetorical relationship between sentences. Extract at most one subtree from each dependency tree to satisfy a predetermined condition based on each dependency tree that represents each sentence based on the dependency relationship between the tree and the word To generate a summary. Thereby, the effect that the coverage of the information as a summary can be improved while ensuring readability is obtained.

本実施の形態に係る文書要約装置の機能ブロック図である。It is a functional block diagram of the document summarization apparatus concerning this Embodiment. 修辞構造木及び係り受け木の一例を示す概略図である。It is the schematic which shows an example of a rhetorical structure tree and a dependency tree. 本実施の形態における文書要約処理ルーチンの一例を示すフローチャートである。It is a flowchart which shows an example of the document summary process routine in this Embodiment. 本実施の形態における文書要約の一例を示す概略図である。It is the schematic which shows an example of the document summary in this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態に係る文書要約装置１０を示すブロック図である。文書要約装置１０は、ＣＰＵと、ＲＡＭと、後述する文書要約処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成することができる。 FIG. 1 is a block diagram showing a document summarizing apparatus 10 according to an embodiment of the present invention. The document summarization apparatus 10 can be constituted by a computer including a CPU, a RAM, and a ROM that stores a program for executing a document summarization processing routine described later.

文書要約装置１０を構成するコンピュータは、機能的には、図１に示すように、入力された文書を文毎に分割する文分割部１２と、分割された各文を単語毎に分割する単語分割部１４と、各単語に重要度を付与する単語重要度付与部１６と、文間の修辞関係を解析して、文書を表す修辞構造木を生成する修辞構造解析部１８と、単語間の係り受け関係を解析して、文の各々を表す係り受け木の各々を生成する係り受け解析部２０と、要約の長さ、各単語の重要度、修辞構造木、及び係り受け木の各々に基づいて、文書の要約を生成する生成部２２とを含んだ構成で表すことができる。以下、各部について詳述する。 As shown in FIG. 1, the computer constituting the document summarizing apparatus 10 functionally has a sentence dividing unit 12 that divides an input document into sentences, and a word that divides each divided sentence into words. A dividing unit 14, a word importance assigning unit 16 that assigns importance to each word, a rhetorical structure analyzing unit 18 that analyzes a rhetorical relationship between sentences and generates a rhetorical structure tree representing a document, A dependency analysis unit 20 that analyzes each dependency relationship to generate each dependency tree that represents each sentence, a summary length, an importance of each word, a rhetorical structure tree, and a dependency tree Based on this, it can be expressed by a configuration including a generation unit 22 that generates a summary of a document. Hereinafter, each part is explained in full detail.

文分割部１２は、文書要約装置１０に入力された、複数の文を含む文書（テキストデータ）を受け取り、文書に文の区切りを与え、文書を文毎に分割する。文の区切りの認定には、既存の文分割器を利用することができる。また、単純に、句点を手がかりとして文の区切りを与えてもよい。 The sentence dividing unit 12 receives a document (text data) including a plurality of sentences input to the document summarizing apparatus 10, gives a sentence break to the document, and divides the document for each sentence. An existing sentence divider can be used to identify sentence breaks. In addition, sentence breaks may be given simply using punctuation marks.

単語分割部１４は、文分割部１２で分割された各文を入力として受け取り、各文に単語区切りを与え、各文を単語毎に分割する。単語区切りの認定には、既存の形態素解析器を利用することができる。また、英文などのように、単語区切りが明示的に与えられている文書が入力された場合には、その明示的な単語区切りに従って単語分割を行ってもよい。 The word dividing unit 14 receives each sentence divided by the sentence dividing unit 12 as an input, gives a word break to each sentence, and divides each sentence into words. An existing morphological analyzer can be used for recognition of word breaks. Further, when a document such as an English sentence in which a word break is explicitly given is input, word division may be performed according to the explicit word break.

単語重要度付与部１６は、単語分割部１４で分割された各単語を入力として受け取り、単語重要度データベース（ＤＢ）３０を参照し、入力された文書に含まれる各単語に重要度を付与する。単語重要度ＤＢ３０には、複数の単語の各々と、その単語の重要度とが対応付けて記憶されている。単語重要度ＤＢ３０に記憶しておく各単語の重要度は、例えば情報検索システムなどで一般的に用いられるｔｆ−ｉｄｆ（term frequency - inverse document frequency）法などを用いて定義しておくことができる。 The word importance assigning unit 16 receives each word divided by the word dividing unit 14 as input, refers to the word importance database (DB) 30, and assigns importance to each word included in the input document. . In the word importance DB 30, each of a plurality of words and the importance of the word are stored in association with each other. The importance of each word stored in the word importance DB 30 can be defined using, for example, a tf-idf (term frequency-inverse document frequency) method generally used in an information search system or the like. .

修辞構造解析部１８は、文分割部１２で分割された各文を入力として受け取り、文間の修辞関係を解析する。例えば、修辞構造解析器（参考文献「duVerle, D. and Prendinger, H, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proc of the 47th ACL, pp. 665-675, 2009.」）を用いてＲＳＴ（Rhetorical Structure Theory）ツリーを生成した後、例えば、非特許文献１に記載されているルールを適用することで、文間の修辞関係を解析することができる。また、必ずしもＲＳＴツリーを生成する必要はなく、文間の修辞関係を表した修辞構造木のデータを用いて学習した解析器を利用して、文間の修辞関係を解析してもよい。 The rhetorical structure analyzing unit 18 receives each sentence divided by the sentence dividing unit 12 as input, and analyzes the rhetorical relationship between the sentences. For example, using a rhetorical structure analyzer (references “duVerle, D. and Prendinger, H,“ A Novel Discourse Parser Based on Support Vector Machine Classication ”, Proc of the 47th ACL, pp. 665-675, 2009.) Then, after generating an RST (Rhetorical Structure Theory) tree, for example, the rhetorical relationship between sentences can be analyzed by applying the rules described in Non-Patent Document 1. Moreover, it is not always necessary to generate an RST tree, and the rhetorical relationship between sentences may be analyzed using an analyzer that has been learned using data of a rhetorical structure tree that represents the rhetorical relation between sentences.

修辞構造解析部１８は、解析した文間の修辞関係に基づいて、各文をノードで表し、修辞関係にある文間に対応するノード間を接続した修辞構造木を生成する。修辞関係にある二つの文は、一方が親で他方が子の関係にあるため、親の文に対応するノードが親ノード、子の文に対応するノードが子ノードとなるように、ノード間を接続する。修辞構造木の一例を図２左図に示す。図２左図は、入力された文書に含まれる文ｓ_１、文ｓ_２、・・・、文ｓ_８の各々を、ノードｓ_１、ノードｓ_２、・・・、ノードｓ_８（図中○印）で表している。また、親子関係がある文間に対応するノード間を、始点側が子ノード、終点側の親ノードとなるように矢印（エッジ）で接続している。 The rhetorical structure analysis unit 18 represents each sentence as a node based on the analyzed rhetorical relationship between sentences, and generates a rhetorical structure tree in which nodes corresponding to each other in the rhetorical relation are connected. Since two sentences in a rhetorical relationship have one parent and the other are children, the node corresponding to the parent sentence is the parent node and the node corresponding to the child sentence is the child node. Connect. An example of the rhetorical structure tree is shown in the left diagram of FIG. 2 left shows, sentence _{s 1} included in the input document, sentence _s 2, · · ·, each sentence _{s 8,} node _{s 1,} node _s 2, · · ·, node _{s 8} (figure (○) In addition, nodes corresponding to a sentence having a parent-child relationship are connected by arrows (edges) so that the start point side is a child node and the end point side parent node.

係り受け解析部２０は、単語分割部１４で分割された各単語を入力として受け取り、単語間の係り受け関係を解析する。係り受け関係の解析には、既存の係り受け解析器を利用することができる。係り受け解析部２０は、解析した単語間の係り受け関係に基づいて、各単語をノードで表し、係り受け関係にある単語間に対応するノード間を接続した係り受け木を生成する。係り受け関係にある二つの単語は、一方が親で他方が子の関係にあるため、親の単語に対応するノードが親ノード、子の単語に対応するノードが子ノードとなるように、ノード間を接続する。係り受け木の一例を図２右図に示す。図２右図は、図２左図に示す修辞構造木に含まれるノードｓ_８に対応する文ｓ_８に含まれる単語ｗ_１、単語ｗ_２、・・・、単語ｗ_１２の各々を、ノードｗ_１、ノードｗ_２、・・・、ノードｗ_１２（図中○印）で表している。また、親子関係がある単語間に対応するノード間を、始点側が子ノード、終点側の親ノードとなるように矢印（エッジ）で接続している。 The dependency analysis unit 20 receives each word divided by the word division unit 14 as an input, and analyzes the dependency relationship between the words. An existing dependency analyzer can be used to analyze the dependency relationship. The dependency analysis unit 20 represents each word as a node based on the analyzed dependency relationship between the words, and generates a dependency tree in which nodes corresponding to each other in the dependency relationship are connected. Since the two words in the dependency relationship are in a parent relationship and the other is in a child relationship, the node corresponding to the parent word is a parent node, and the node corresponding to the child word is a child node. Connect between them. An example of a dependency tree is shown in the right side of FIG. The right figure of FIG. 2 shows each of the word w ₁ , the word w ₂ ,..., And the word w ₁₂ included in the sentence s ₈ corresponding to the node s ₈ included in the rhetorical structure tree shown in the left figure of FIG. w _1, node _w 2, ···, is represented by a node _{w 12} (figure ○ mark). Further, the nodes corresponding to the words having the parent-child relationship are connected by arrows (edges) so that the start point side is the child node and the end point side is the parent node.

生成部２２は、入力された要約長Ｌ_ｍａｘを受け取る。要約長は、要約の長さを制約するパラメタであり、ここでは、要約長Ｌ_ｍａｘを、要約に含まれる単語の最大数とする。なお、要約長は文字数としてもよい。また、入力された要約長Ｌ_ｍａｘを受け取る場合に限定されず、ＲＯＭなどに予め記憶しておいた要約長Ｌ_ｍａｘを読み出してもよい。また、生成部２２は、修辞構造解析部１８で生成された修辞構造木、及び係り受け解析部２０で生成された係り受け木の各々を受け取る。さらに、生成部２２は、単語重要度付与部１６で各単語に付与された重要度を受け取る。 Generator 22 receives a summary length _{L max} input. The summary length is a parameter that restricts the length of the summary. Here, the summary length L _max is the maximum number of words included in the summary. The summary length may be the number of characters. Further, not limited to the case of receiving the summary length L _max input may read the summary length L _max stored in advance such as in the ROM. The generation unit 22 receives each of the rhetorical structure tree generated by the rhetorical structure analysis unit 18 and the dependency tree generated by the dependency analysis unit 20. Further, the generation unit 22 receives the importance level given to each word by the word importance level giving unit 16.

生成部２２は、要約長Ｌ_ｍａｘ、各単語の重要度、修辞構造木、及び係り受け木の各々に基づいて、要約に含まれる単語数がＬ_ｍａｘ以下で、単語間の親子関係及び文間の親子関係を損なわないように、かつ要約に含まれる単語の重要度の和が最大となるように、各係り受け木から部分木を抽出する。そして、生成部２２は、抽出した部分木の各々を構成するノードに対応する単語集合から要約を生成する。 Based on the summary length L _max , the importance of each word, the rhetorical structure tree, and the dependency tree, the generation unit 22 has a number of words included in the summary that is less than or equal to L _max , and the parent-child relationship between words and the sentence spacing The subtrees are extracted from each dependency tree so that the parent-child relationship is not impaired and the sum of the importance of the words included in the summary is maximized. Then, the generation unit 22 generates a summary from the word set corresponding to the nodes constituting each of the extracted partial trees.

ここで、単語間の親子関係を損なわない、とは、要約に含まれる単語間の親子関係が、係り受け解析部２０で生成された係り受け木の対応する部分の構造を維持していることを意味する。また、文間の親子関係を損なわない、とは、要約に含まれる文間の親子関係が、修辞構造解析部１８で生成された修辞構造木の対応する部分の構造を維持していることを意味する。すなわち、文に着目して要約を見ると、その文間の構造は、修辞構造解析部１８で生成された修辞構造木の一部に表れている。また、単語に着目して要約を見ると、その単語間の構造は、係り受け解析部２０で生成された係り受け木の一部に表れている。 Here, the parent-child relationship between words does not impair that the parent-child relationship between words included in the summary maintains the structure of the corresponding part of the dependency tree generated by the dependency analysis unit 20. Means. Moreover, the parent-child relationship between sentences does not impair that the parent-child relationship between sentences included in the summary maintains the structure of the corresponding part of the rhetorical structure tree generated by the rhetorical structure analyzing unit 18. means. That is, when the summary is viewed focusing on the sentence, the structure between the sentences appears in a part of the rhetorical structure tree generated by the rhetorical structure analyzing unit 18. When the summary is viewed focusing on the words, the structure between the words appears in a part of the dependency tree generated by the dependency analysis unit 20.

また、生成部２２は、各係り受け木から部分木を抽出する際、各係り受け木から多くとも一つの部分木を抽出する。多くとも一つの部分木を抽出するとは、一つの文から２つ以上の部分木を抽出しないこと、また、部分木を抽出しない係り受け木も存在することを意味する。例えば、図２右図に示す文ｓ_８を表す係り受け木から、（ノードｗ_１，ノードｗ_２，ノードｗ_３）という部分木を抽出した場合、文ｓ_８を表す係り受け木からは、これ以上の部分木は抽出しない。これは、複文や重文などを表す係り受け木からは複数の部分木が抽出され得るが、一つの文は一つの意味を持つものとして扱い、一つの係り受け木から複数の部分木を抽出することを制限するものである。これにより、要約長Ｌ_ｍａｘの制限の下、より多くの文から部分木を抽出することができ、要約としての情報の被覆率を向上させることができる。 In addition, when the generation unit 22 extracts a partial tree from each dependency tree, the generation unit 22 extracts at most one partial tree from each dependency tree. Extracting at most one subtree means that two or more subtrees are not extracted from one sentence, and there is a dependency tree that does not extract a subtree. For example, when a subtree (node w ₁ , node w ₂ , node w ₃ ) is extracted from the dependency tree representing the sentence s ₈ shown in the right diagram of FIG. 2, from the dependency tree representing the sentence s ₈ , No more subtrees are extracted. This is because multiple subtrees can be extracted from dependency trees representing compound sentences and heavy sentences, but one sentence is treated as having one meaning and multiple subtrees are extracted from one dependency tree. It restricts that. As a result, subtrees can be extracted from more sentences under the limitation of the summary length _Lmax , and the coverage of information as a summary can be improved.

また、生成部２２は、修辞構造木において、部分木が抽出された係り受け木に対応するノードの親ノードに対応する係り受け木からも部分木を抽出する。修辞構造木に含まれる各ノードの親ノードを辿ると、最終的には根ノードに辿り着くため、部分木が抽出された係り受け木に対応するノード集合は、修辞構造木における根付き部分木となる。このように、要約に含まれる単語を含む文集合が、修辞構造木において根付き部分木となるようにすることで、文書の持つ論理構造（例えば、起承転結のような構造）を要約に反映させることができ、要約としての情報の被覆率及び精度が向上する。 In addition, the generation unit 22 extracts a partial tree from the dependency tree corresponding to the parent node of the node corresponding to the dependency tree from which the partial tree is extracted. When the parent node of each node included in the rhetorical structure tree is traced, the root node is finally reached. Therefore, the node set corresponding to the dependency tree from which the subtree is extracted is the rooted subtree in the rhetorical structure tree. Become. In this way, by making a sentence set including words included in the summary a rooted subtree in the rhetorical structure tree, the logical structure of the document (for example, a structure such as a turnover) is reflected in the summary. This improves the coverage and accuracy of information as a summary.

つまり、生成部２２により生成される要約は、要約を構成する文集合は、修辞構造木の根付き部分木で表され、要約を構成する各文に含まれる単語列は、係り受け木の部分木で表される。 That is, in the summary generated by the generation unit 22, a sentence set constituting the summary is represented by a rooted subtree of the rhetorical structure tree, and a word string included in each sentence constituting the summary is a dependency tree subtree. expressed.

上記のような要約生成の問題は、下記（１）式〜（１２）式に示す制約の下、下記（１３）式に示す目的関数を最大化する整数計画問題として定式化される。 The above summary generation problem is formulated as an integer programming problem that maximizes the objective function shown in the following equation (13) under the constraints shown in the following equations (1) to (12).

ここで、ｉは入力された文書における文の位置（文書の先頭から何番目の文か）を表す変数であり、各文の識別番号（文ｉｄ）である。Ｎは文書に含まれる文の総数である。ｊはｉ番目の文における単語の位置（文の先頭から何番目の単語か）を表す変数であり、単語の識別番号（単語ｉｄ）である。Ｍ（ｉ）はｉ番目の文に含まれる単語の総数である。ｘ_ｉはｉ番目の文が要約に含まれるとき１となる決定変数である。ｚ_ｉｊはｉ番目の文のｊ番目の単語が要約に含まれるとき１となる決定変数である。ｗ_ｉｊはｉ番目の文のｊ番目の単語の重要度であり、単語重要度付与部１６で付与される値である。ｒ_ｉｊはｉ番目の文のｊ番目の単語に対応するノードが、係り受け木から抽出される部分木の根ノードである場合に１となる決定変数である。 Here, i is a variable representing the position of the sentence in the input document (the number of the sentence from the beginning of the document), and is an identification number (sentence id) of each sentence. N is the total number of sentences included in the document. j is a variable representing the position of the word in the i-th sentence (the number of the word from the beginning of the sentence), and is a word identification number (word id). M (i) is the total number of words included in the i-th sentence. x _i is a decision variable that becomes 1 when the i-th sentence is included in the summary. z _ij is a decision variable that becomes 1 when the j-th word of the i-th sentence is included in the summary. w _ij is the importance level of the j-th word of the i-th sentence, and is a value given by the word importance level assigning unit 16. r _ij is a decision variable that becomes 1 when the node corresponding to the j-th word of the i-th sentence is the root node of the subtree extracted from the dependency tree.

（１）式は、要約に含まれる単語の数がＬ_ｍａｘ以下になることを保証する制約式である。（２）式は、文間の親子関係に関する制約であり、ｉ番目の文を要約として抽出する場合は、その親の文も要約に含まれることを保証する制約式である。ｐａｒｅｎｔ（ｉ）は、修辞構造木において、ｉ番目の文に対応するノードの親ノードに対応する文の文ｉｄを返す関数である。 The expression (1) is a constraint expression that guarantees that the number of words included in the summary is equal to or less than _Lmax . The expression (2) is a constraint on the parent-child relationship between sentences, and when extracting the i-th sentence as a summary, it is a constraint expression that guarantees that the parent sentence is also included in the summary. parent (i) is a function that returns the sentence id of the sentence corresponding to the parent node of the node corresponding to the i-th sentence in the rhetorical structure tree.

（３）式は、単語間の親子関係に関する制約であり、ｉ番目の文のｊ番目の単語を要約に含める場合は、その親の単語も要約に含まれることを保証する制約式である。ｐａｒｅｎｔ（ｉ，ｊ）は、ｉ番目の文を表す係り受け木において、ｊ番目の単語に対応するノードの親ノードに対応する単語の単語ｉｄを返す関数である。ただし、ここでｒ_ｉｊの項は、（９）式に示す制約式と併せて記述することで、ｉ番目の文のｊ番目の単語に対応するノードを根ノードとする部分木を抽出する場合に限り、その親ノードに対応する単語は要約に含めないことを保証する。 Equation (3) is a constraint on the parent-child relationship between words. When the j-th word of the i-th sentence is included in the summary, the parent word is guaranteed to be included in the summary. parent (i, j) is a function that returns the word id of the word corresponding to the parent node of the node corresponding to the j-th word in the dependency tree representing the i-th sentence. However, when the term r _ij is described together with the constraint expression shown in equation (9), a subtree whose root node is the node corresponding to the j-th word of the i-th sentence is extracted. Only if the word corresponding to its parent node is not included in the summary.

（４）式は、ｉ番目の文のｊ番目の単語を要約に含める場合、ｉ番目の文が要約に含まれることを保証するための制約式である。（５）式は、単語を抽出せずに文だけが抽出されることを防ぐための制約式である。ここで、ａ（ｉ）は以下の式で定める。 Expression (4) is a constraint expression for ensuring that the i-th sentence is included in the summary when the j-th word of the i-th sentence is included in the summary. Equation (5) is a constraint equation for preventing only a sentence from being extracted without extracting a word. Here, a (i) is defined by the following equation.

ｌｅｎ（ｉ）は、ｉ番目の文の単語数を返す関数である。ａを導入することで、短い文（ここでは１０単語以下）からは、部分的に単語を抽出せず、原文そのままを要約に含めるように抽出する。これは、短い文から部分的に単語を抽出すると著しく可読性が低下するため、これを防ぐためである。 len (i) is a function that returns the number of words in the i-th sentence. By introducing a, a short sentence (here, 10 words or less) is not partially extracted but is extracted so that the original sentence is included in the summary. This is in order to prevent this because a word is partially extracted from a short sentence and the readability is significantly reduced.

（６）式〜（９）式は、任意のノードを根ノードとする部分木の抽出を可能にするための制約式である。（６）式は、一つの文を表す一つの係り受け木からは、多くとも一つの部分木を抽出することを保証する制約式である。（７）式は、事前に定められた根ノードの候補以外のノードを根ノードとする部分木を抽出しないことを保証する制約式である。ここで、Ｒ_ｃ（ｉ）はｉ番目の文を表す係り受け木において、根ノードの候補となるノードに対応する単語の単語ｉｄの集合を返す関数である。根ノードの候補は、例えば、文中の品詞が動詞である単語に対応するノードとすることができる。（８）式は、ある単語に対応するノードが、部分木の根ノードとして抽出された場合には、必ずその単語を要約に含めることを保証する制約式である。（９）式は、抽出した部分木の根ノードに対応する単語の親の単語は要約に含めないことを保証する制約式である。 Expressions (6) to (9) are constraint expressions for enabling extraction of a subtree having an arbitrary node as a root node. Expression (6) is a constraint expression that guarantees that at most one subtree is extracted from one dependency tree representing one sentence. The expression (7) is a constraint expression that guarantees that a subtree having a node other than a predetermined root node candidate as a root node is not extracted. Here, R _c (i) is a function that returns a set of word ids of words corresponding to nodes that are candidates for root nodes in the dependency tree representing the i-th sentence. The candidate for the root node can be, for example, a node corresponding to a word whose part of speech in the sentence is a verb. The expression (8) is a constraint expression that ensures that a word corresponding to a word is included in the summary whenever the node corresponding to the word is extracted as a root node of the subtree. Expression (9) is a constraint expression that guarantees that the parent word of the word corresponding to the root node of the extracted subtree is not included in the summary.

（１０）式は、係り受け木の真の根ノードを部分木の根ノードとして抽出しない場合、真の根ノードに対応する単語を要約に含めないことを保証する制約式である。真の根ノードとは、係り受け木全体における根ノードであり、部分木における根ノードと区別するため、「真の根ノード」と呼ぶ。図２右図の例では、ノードｗ_１２が真の根ノードである。ここで、ｒｏｏｔ（ｉ）は、ｉ番目の文を表す係り受け木における真の根ノードに対応する単語の単語ｉｄを返す関数である。 Expression (10) is a constraint expression that guarantees that the word corresponding to the true root node is not included in the summary when the true root node of the dependency tree is not extracted as the root node of the subtree. The true root node is a root node in the entire dependency tree, and is referred to as a “true root node” in order to distinguish it from the root node in the subtree. In the example of FIG. 2 right figure, node w ₁₂ is a true root node. Here, root (i) is a function that returns the word id of the word corresponding to the true root node in the dependency tree representing the i-th sentence.

（１１）式は、文に主語（ＳＵＢ）が含まれる場合、文に含まれる主語のうち一つ以上を要約に含めるという制約式である。ここで、ｓｕｂ（ｉ）は、ｉ番目の文から、単語間の親子関係に基づいて、主語である単語の単語ｉｄの集合を返す関数である。（１２）式は、文に目的語（ＯＢＪ）が含まれる場合、文に含まれる目的語のうち一つ以上を要約に含めるという制約式である。ここで、ｏｂｊ（ｉ）は、ｉ番目の文から、単語間の親子関係に基づいて、目的語である単語の単語ｉｄの集合を返す関数である。 The expression (11) is a constraint expression in which one or more of the subjects included in the sentence are included in the summary when the subject includes a subject (SUB). Here, sub (i) is a function that returns a set of word ids of the word that is the subject from the i-th sentence based on the parent-child relationship between words. The expression (12) is a constraint expression in which one or more of the objects included in the sentence are included in the summary when the object (OBJ) is included in the sentence. Here, obj (i) is a function that returns a set of word ids of the word that is the object based on the parent-child relationship between words from the i-th sentence.

次に、本実施の形態の文書要約装置１０の作用について説明する。要約生成の対象となる文書、及びパラメタである要約長Ｌ_ｍａｘが文書要約装置１０に入力されると、文書要約装置１０において、図３に示す文書要約処理ルーチンが実行される。 Next, the operation of the document summarizing apparatus 10 according to the present embodiment will be described. When a summary generation target document and a summary length L _{max as} a parameter are input to the document summarization apparatus 10, the document summarization apparatus 10 executes a document summarization processing routine shown in FIG.

ステップＳ１０で、文分割部１２が、入力された文書を受け取り、文書に文の区切りを与え、文書を文毎に分割する。次に、ステップＳ１２で、単語分割部１４が、文分割部１２で分割された各文を入力として受け取り、各文に単語区切りを与え、各文を単語毎に分割する。次に、ステップＳ１４で、単語重要度付与部１６が、単語分割部１４で分割された各単語を入力として受け取り、単語重要度ＤＢ３０を参照し、各単語に重要度を付与する。 In step S10, the sentence dividing unit 12 receives the input document, gives a sentence break to the document, and divides the document into sentences. Next, in step S12, the word dividing unit 14 receives each sentence divided by the sentence dividing unit 12 as an input, gives a word break to each sentence, and divides each sentence into words. Next, in step S14, the word importance assigning unit 16 receives each word divided by the word dividing unit 14 as an input, refers to the word importance DB 30, and assigns importance to each word.

次に、ステップＳ１６で、修辞構造解析部１８が、文分割部１２で分割された各文を入力として受け取り、例えば既知の修辞構造解析器を利用して、文間の修辞関係を解析する。そして、修辞構造解析部１８は、各文をノードで表し、文間の修辞関係に基づいて、親子関係にある文間に対応するノード間を接続して、文書を表す修辞構造木を生成する。 Next, in step S16, the rhetorical structure analyzing unit 18 receives each sentence divided by the sentence dividing unit 12 as input, and analyzes the rhetorical relationship between sentences using, for example, a known rhetorical structure analyzer. Then, the rhetorical structure analysis unit 18 represents each sentence as a node, and based on the rhetorical relation between sentences, connects the nodes corresponding to the sentences in the parent-child relation to generate a rhetorical structure tree representing the document. .

次に、ステップＳ１８で、係り受け解析部２０が、単語分割部１４で分割された各単語を入力として受け取り、例えば既知の係り受け解析器を利用して、単語間の係り受け関係を解析する。そして、係り受け解析部２０は、各単語をノードで表し、単語間の係り受け関係に基づいて、親子関係にある単語間に対応するノード間を接続して、文の各々を表す係り受け木の各々を生成する。 Next, in step S18, the dependency analysis unit 20 receives each word divided by the word division unit 14 as an input, and analyzes the dependency relationship between words using, for example, a known dependency analyzer. . Then, the dependency analysis unit 20 represents each word as a node, and based on the dependency relationship between the words, connects the corresponding nodes between the words in the parent-child relationship, and represents a dependency tree representing each of the sentences. Generate each of

次に、ステップＳ２０で、生成部２２が、入力された要約長Ｌ_ｍａｘ、単語重要度付与部１６で付与された各単語の重要度、修辞構造解析部１８で生成された修辞構造木、及び係り受け解析部２０で生成された係り受け木の各々に基づいて、要約に含まれる単語数がＬ_ｍａｘ以下で、単語間の親子関係及び文間の親子関係を損なわず、かつ要約に含まれる単語の重要度の和が最大となるように、各係り受け木から多くとも一つの部分木を抽出する。次に、ステップＳ２２で、生成部２２が、各係り受け木から抽出した部分木集合から要約を生成し、生成した要約を出力して、要約生成処理ルーチンを終了する。 Next, in step S20, the generation unit 22 inputs the summary length _Lmax , the importance of each word given by the word importance assigning unit 16, the rhetorical structure tree generated by the rhetorical structure analysis unit 18, and based on each of dependency receiving dependency generated by the analysis unit 20 trees, the number of words included in the summary below L _max, without impairing the parent-child relationship and sentences parent-child relationships between words, and are included in the summary At most one subtree is extracted from each dependency tree so that the sum of the importance of words is maximized. Next, in step S22, the generation unit 22 generates a summary from the subtree set extracted from each dependency tree, outputs the generated summary, and ends the summary generation processing routine.

ここで、以下の文１〜文４を含む文書について、本実施の形態に係る文書要約装置１０により生成した要約の一例について説明する。 Here, an example of a summary generated by the document summarizing apparatus 10 according to the present embodiment will be described for a document including the following sentences 1 to 4.

文１：A Japanese apple is cropping up in orchards the way Hondas did on U.S. roads .
文２：It is called the Fuji.
文３：Some fruit visionaries say the Fuji could someday tumble the Red Delicious.
文４：But the apple industry is ripe for change . Sentence 1: A Japanese apple is cropping up in orchards the way Hondas did on US roads.
Sentence 2: It is called the Fuji.
Sentence 3: Some fruit visionaries say the Fuji could someday tumble the Red Delicious.
Sentence 4: But the apple industry is ripe for change.

図４に示すように、文書は、文１に対応するノードを根ノード、文２及び文３の各々に対応するノードが文１に対応するノードの子ノード、文４に対応するノードが文３に対応するノードの子ノードである修辞構造木で表される。さらに、各文は、その文に含まれる各単語をノードとする係り受け木で表される。図４の例では、各文に対応するノード内に、その文を表す係り受け木を示している。 As shown in FIG. 4, the document includes a node corresponding to sentence 1 as a root node, a node corresponding to each of sentences 2 and 3 as a child node of a node corresponding to sentence 1, and a node corresponding to sentence 4 as a sentence. It is represented by a rhetorical structure tree that is a child node of the node corresponding to 3. Further, each sentence is represented by a dependency tree having each word included in the sentence as a node. In the example of FIG. 4, a dependency tree representing a sentence is shown in a node corresponding to each sentence.

ここから、単語の長さの制約の下、文間の親子関係及び単語間の親子関係を損なわず、かつ単語の重要度の和が最大となるように部分木を抽出する。図４では、抽出された部分木を構成するノード、及び抽出された部分木を含む文に対応するノードを太線の枠で示している。そして、抽出された部分木から、下記に示すような要約が生成される。 From this, subtrees are extracted so that the parent-child relationship between sentences and the parent-child relationship between words are not impaired, and the sum of the importance of words is maximized under the restriction of the word length. In FIG. 4, nodes constituting the extracted subtree and nodes corresponding to the sentence including the extracted subtree are indicated by bold lines. Then, a summary as shown below is generated from the extracted subtree.

A Japanese apple is cropping up in orchards. The Fuji could someday tumble the Red Delicious. But the apple industry is ripe for change. A Japanese apple is cropping up in orchards.The Fuji could someday tumble the Red Delicious.But the apple industry is ripe for change.

以上説明したように、本実施の形態に係る文書要約装置によれば、文書を文間の修辞関係を表す修辞構造木として表現し、文を単語間の係り受け関係を表す係り受け木として表現することで、文をノードとする木、単語をノードとする木の入れ子構造として文書を捉える。そして、要約に含まれる単語の数がＬ_ｍａｘ以下で、要約に含まれる単語の重要度の和が最大となるように、修辞構造木が表す文書の修辞構造を損なうことなく、係り受け木から多くとも一つの部分木を抽出し、抽出した部分木から要約を生成する。これにより、文書の論理構造、文としての文法性が損なわれないため、読み易さを確保できる。また、１つの係り受け木から多くとも一つの部分木を抽出することで、要約としての情報の被覆率を向上させることができる。 As described above, according to the document summarizing apparatus according to the present embodiment, a document is expressed as a rhetorical structure tree representing a rhetorical relationship between sentences, and a sentence is represented as a dependency tree representing a dependency relation between words. By doing so, the document is understood as a nested structure of a tree having sentences as nodes and a tree having words as nodes. Then, from the dependency tree, the number of words included in the summary is equal to or less than L _max and the rhetorical structure of the document represented by the rhetorical structure tree is not impaired so that the sum of the importance of the words included in the summary is maximized. At most one subtree is extracted and a summary is generated from the extracted subtree. As a result, the logical structure of the document and the grammatical nature of the sentence are not impaired, so that readability can be ensured. Also, by extracting at most one subtree from one dependency tree, the coverage of information as a summary can be improved.

なお、文区切り及び単語区切りが与えられた文書が文書要約装置に入力される場合には、文分割部１２及び単語分割部１４の各構成は省略してもよい。 Note that when a document provided with sentence breaks and word breaks is input to the document summarization apparatus, the components of the sentence divider 12 and the word divider 14 may be omitted.

上述の文書要約装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The document summarization apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０文書要約装置
１２文分割部
１４単語分割部
１６単語重要度付与部
１８修辞構造解析部
２０係り受け解析部
２２生成部
３０単語重要度ＤＢ DESCRIPTION OF SYMBOLS 10 Document summarization device 12 Sentence division part 14 Word division part 16 Word importance assignment part 18 Rhetorical structure analysis part 20 Dependency analysis part 22 Generation part 30 Word importance DB

Claims

A rhetorical structure analysis unit that generates a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including a plurality of sentences;
A dependency analysis unit that generates each dependency tree representing each of the sentences based on a dependency relationship between words in each of the plurality of sentences;
Based on a predetermined maximum length of each summary, the importance of each of the words, the rhetorical structure tree, and each of the dependency trees, at most from each of the dependency trees to satisfy a predetermined condition. A generation unit that extracts one subtree and generates a summary of the document based on the extracted subtree;
A document summarization device.

The predetermined condition is that the summary length is less than or equal to the maximum length, the rhetorical relationship represented by the rhetorical structure tree and the dependency relationship represented by the dependency tree are not impaired, and the importance of words included in the summary is determined. 2. The document summarizing apparatus according to claim 1, wherein the sum is maximized.

2. The document summarization apparatus according to claim 1, wherein the predetermined condition is to maximize an objective function represented by the following equation (13) under the constraints represented by the following equations (1) to (12).

Where i is the sentence identification number, N is the total number of sentences included in the document, j is the word identification number, M (i) is the total number of words included in the sentence with the identification number i, and x _i is the identification number i. A decision variable that becomes 1 when the sentence is included in the summary, z _ij is a decision variable that becomes 1 when the word with identification number j of the sentence with identification number i is included in the summary, and w _ij is an identification number of the sentence with identification number i The importance of the word j, r _ij is a decision variable that becomes 1 when the word with the identification number j of the sentence with the identification number i is the root of the subtree, L _max is the maximum length, and parent (i) is the rhetorical structure tree A function that returns the identification number of the parent sentence of the sentence with the identification number i, parent (i, j) is a function that returns the identification number of the parent word of the word with the identification number j in the dependency tree representing the sentence with the identification number i. , A is a predetermined coefficient, and R _c (i) is a dependency tree representing a sentence with identification number i. The function root (i) that returns a set of identification numbers of words that are root candidates is a function that returns the identification number of a word that is a true root in a dependency tree representing a sentence with the identification number i, sub ( i) is a function that returns a set of identification numbers of the word that is the subject from the sentence of the identification number i, and obj (i) is a function that returns a set of identification numbers of the word that is the object from the sentence of the identification number i. is there.

A document summarization method in a document summarization apparatus including a rhetorical structure analysis unit, a dependency analysis unit, and a generation unit,
The rhetorical structure analysis unit generates a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including a plurality of sentences;
The dependency analysis unit generating each dependency tree representing each of the sentences based on a dependency relationship between words in each of the plurality of sentences;
The dependency tree is configured so that the generation unit satisfies a predetermined condition on the basis of a predetermined maximum length of summary, importance of each word, the rhetorical structure tree, and the dependency tree. Extracting at most one subtree from each of the above and generating a summary of the document based on the extracted subtrees;
Document summarization method including

A document summarization program for causing a computer to function as each part constituting the document summarization apparatus according to any one of claims 1 to 3.