JP2015170224A - Document summarizing device, method and program - Google Patents

Document summarizing device, method and program Download PDF

Info

Publication number
JP2015170224A
JP2015170224A JP2014045656A JP2014045656A JP2015170224A JP 2015170224 A JP2015170224 A JP 2015170224A JP 2014045656 A JP2014045656 A JP 2014045656A JP 2014045656 A JP2014045656 A JP 2014045656A JP 2015170224 A JP2015170224 A JP 2015170224A
Authority
JP
Japan
Prior art keywords
sentence
word
document
identification number
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2014045656A
Other languages
Japanese (ja)
Other versions
JP6021079B2 (en
Inventor
平尾 努
Tsutomu Hirao
努 平尾
悠太 菊池
Yuta Kikuchi
悠太 菊池
学 奥村
Manabu Okumura
学 奥村
大也 高村
Daiya Takamura
大也 高村
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Tokyo Institute of Technology NUC
Original Assignee
Nippon Telegraph and Telephone Corp
Tokyo Institute of Technology NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp, Tokyo Institute of Technology NUC filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2014045656A priority Critical patent/JP6021079B2/en
Publication of JP2015170224A publication Critical patent/JP2015170224A/en
Application granted granted Critical
Publication of JP6021079B2 publication Critical patent/JP6021079B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To increase an information coverage as a summary while maintaining readability.SOLUTION: A rhetoric structure analysis part 18 generates, based on a rhetoric relation between sentences in a document containing a plurality of sentences, a rhetoric structure tree representing the document. A modification analysis part 20 generates, based on a modification relation between words in each of the plurality of sentences, each of modification trees representing each of the sentences. A generation part 22 extracts one subtree at most from each of the modification trees so as to satisfy a predetermined condition on the basis of a predetermined maximum length of a summary, an importance level of each of the words given by a word importance level imparting part 16, the rhetoric structure tree and each of the modification trees, and generates a summary of the document on the basis of the extracted subtrees.

Description

本発明は、与えられた文書を要約する文書要約装置、方法、及びプログラムに関する。   The present invention relates to a document summarization apparatus, method, and program for summarizing a given document.

従来の計算機による文書の要約手法では、ある目的関数が最大となるように文書中の文法的な要素、例えば、文や文に含まれる節または句などの連続した単語列を、各要素の重要度の和が最大となるように抽出する。また、文法的な要素を抽出する際、これらを単なる集合として捉えるのではなく、文法的な要素間の親子関係、つまり修辞構造を考慮することで、要約の品質が向上することが知られている(例えば、非特許文献1参照)。   In conventional computer document summarization methods, a grammatical element in a document, for example, a continuous word string such as a clause or phrase included in a sentence, is used for each element to maximize a certain objective function. Extract so that the sum of degrees is maximized. In addition, when extracting grammatical elements, it is known that the quality of summarization is improved by considering parent-child relationships between grammatical elements, that is, rhetorical structure, rather than just considering them as a set. (For example, refer nonpatent literature 1).

非特許文献1に記載の技術では、文書中の文法的な要素を「節」とし、節をノードとした修辞構造木として文書を表現する。そして、節の重要度の和が最大、かつ要約の長さがLmax以下の根付き部分木を要約として抽出する組合せ最適化問題として、文書要約を定式化している。なお、要約の長さとは、例えば、要約に含まれる単語数または文字数である。 In the technique described in Non-Patent Document 1, a document is expressed as a rhetorical structure tree in which a grammatical element in a document is “section” and a section is a node. Then, the document summarization is formulated as a combinatorial optimization problem in which a rooted subtree having the maximum sum of the importance of clauses and the summarization length of L max or less is extracted as a summarization. The summary length is, for example, the number of words or characters included in the summary.

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasudaand MasaakiNagata, "Single-Document Summarization as a Tree Knapsack Problem", Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515-1520, 2013.Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasudaand MasaakiNagata, "Single-Document Summarization as a Tree Knapsack Problem", Proc. Of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515-1520, 2013.

従来の要約技術では、文や節等の文法的な要素を予め決定しておき、文書をそれらの要素間の親子関係を表した修辞構造木として表現することで、原文書の論理構造を捉えた要約を生成できる。この場合、予め定義した文や節といった文書中の連続した単語列を抽出することで要約を生成する。文法的な要素を文とする場合、要約は、原文書に含まれる文から要約に含める文を抽出することで生成される。しかし、要約の長さの制約Lmaxが厳しい(小さい)場合、抽出することのできる文の数が極端に少なくなり、要約としての情報の被覆率が低下するという問題がある。一方、文法的な要素を節とする場合、要約は、原文書に含まれる節から要約に含める節を抽出することで生成される。この場合、節は文よりも小さい単位であることから、文法的な要素を文とする場合に比べ、要約としての情報の被覆率は高くなるが、読み易さが損なわれるという問題がある。 In conventional summarization techniques, grammatical elements such as sentences and clauses are determined in advance, and the document is represented as a rhetorical structure tree representing the parent-child relationship between those elements, thereby capturing the logical structure of the original document. Summary can be generated. In this case, a summary is generated by extracting continuous word strings in a document such as a predefined sentence or section. When a grammatical element is a sentence, the summary is generated by extracting a sentence to be included in the summary from a sentence included in the original document. However, if the summary length constraint L max is severe (small), the number of sentences that can be extracted becomes extremely small, and there is a problem that the coverage of information as a summary decreases. On the other hand, when a grammatical element is a section, a summary is generated by extracting a section included in the summary from a section included in the original document. In this case, since a section is a unit smaller than a sentence, the coverage of information as a summary is higher than that when a grammatical element is a sentence, but there is a problem that readability is impaired.

本発明は、上記の事情を鑑みてなされたもので、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる文書要約装置、方法、及びプログラムを提供することを目的とする。   The present invention has been made in view of the above circumstances, and an object thereof is to provide a document summarization apparatus, method, and program capable of improving the coverage of information as a summary while ensuring readability. And

上記目的を達成するために、本発明に係る文書要約装置は、複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成する修辞構造解析部と、前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成する係り受け解析部と、予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成する生成部と、を含んで構成されている。   In order to achieve the above object, a document summarization apparatus according to the present invention includes a rhetorical structure analysis unit that generates a rhetorical structure tree representing a document based on a rhetorical relationship between sentences in a document including a plurality of sentences. A dependency analysis unit that generates each dependency tree representing each of the sentences based on dependency relationships between words in each of a plurality of sentences, a maximum length of a predetermined summary, and an importance of each of the words At least one subtree is extracted from each of the dependency trees so as to satisfy a predetermined condition based on the rhetorical structure tree and the dependency tree, and based on the extracted subtree And a generation unit for generating a summary of the document.

本発明に係る文書要約装置によれば、修辞構造解析部が、複数の文を含む文書における文間の修辞関係に基づいて、文書を表す修辞構造木を生成する。また、係り受け解析部が、複数の文の各々における単語間の係り受け関係に基づいて、文の各々を表す係り受け木の各々を生成する。そして、生成部が、予め定めた要約の最大長、単語の各々の重要度、修辞構造木、及び係り受け木の各々に基づいて、予め定めた条件を満たすように、係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、文書の要約を生成する。   According to the document summarizing apparatus according to the present invention, the rhetorical structure analyzing unit generates a rhetorical structure tree representing a document based on a rhetorical relationship between sentences in a document including a plurality of sentences. The dependency analysis unit generates each dependency tree representing each sentence based on the dependency relationship between words in each of the plurality of sentences. Then, based on each of the dependency trees, the generation unit satisfies the predetermined condition based on the predetermined maximum length of the summary, the importance of each word, the rhetorical structure tree, and each of the dependency trees. At most one subtree is extracted, and a document summary is generated based on the extracted subtree.

このように、修辞構造木と係り受け木との入れ子構造で文書を表現し、係り受け木の各々から多くとも一つの部分木を抽出して要約を生成することで、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる。   In this way, a document is represented by a nested structure of rhetorical structure trees and dependency trees, and at most one subtree is extracted from each dependency tree to generate a summary, thereby ensuring readability. However, the coverage of information as a summary can be improved.

また、前記予め定めた条件を、要約の長さが前記最大長以下で、前記修辞構造木が表す修辞関係及び前記係り受け木が表す係り受け関係を損なわず、かつ要約に含まれる単語の重要度の和が最大になるように定めることができる。これにより、文書の論理構造、及び文としての文法性が損なわれないため、より読み易さを確保した要約を生成することができる。   In addition, the predetermined condition is that the summary length is less than or equal to the maximum length, the rhetorical relationship represented by the rhetorical structure tree and the dependency relationship represented by the dependency tree are not impaired, and the importance of the words included in the summary It can be determined so that the sum of degrees is maximized. As a result, the logical structure of the document and the grammatical nature of the sentence are not impaired, so that a summary that ensures more readability can be generated.

また、前記予め定めた条件を、下記(1)式〜(12)式に示す制約の下、下記(13)式に示す目的関数を最大化するように定めることができる。これにより、文書要約の問題を定式化することができる。   Further, the predetermined condition can be determined so as to maximize the objective function shown in the following equation (13) under the constraints shown in the following equations (1) to (12). As a result, the problem of document summarization can be formulated.

Figure 2015170224
Figure 2015170224

ただし、iは文の識別番号、Nは文書に含まれる文の総数、jは単語の識別番号、M(i)は識別番号iの文に含まれる単語の総数、xは識別番号iの文が要約に含まれるとき1となる決定変数、zijは識別番号iの文の識別番号jの単語が要約に含まれるとき1となる決定変数、wijは識別番号iの文の識別番号jの単語の重要度、rijは識別番号iの文の識別番号jの単語が部分木の根である場合に1となる決定変数、Lmaxは最大長、parent(i)は、修辞構造木における識別番号iの文の親の文の識別番号を返す関数、parent(i,j)は、識別番号iの文を表す係り受け木における識別番号jの単語の親の単語の識別番号を返す関数、aは所定の係数、R(i)は識別番号iの文を表す係り受け木において、根の候補となる単語の識別番号の集合を返す関数、root(i)は、識別番号iの文を表す係り受け木における真の根である単語の識別番号を返す関数、sub(i)は、識別番号iの文から主語である単語の識別番号の集合を返す関数、及びobj(i)は、識別番号iの文から目的語である単語の識別番号の集合を返す関数である。 Where i is the sentence identification number, N is the total number of sentences included in the document, j is the word identification number, M (i) is the total number of words included in the sentence with the identification number i, and x i is the identification number i. A decision variable that becomes 1 when the sentence is included in the summary, z ij is a decision variable that becomes 1 when the word with identification number j of the sentence with identification number i is included in the summary, and w ij is an identification number of the sentence with identification number i The importance of the word j, r ij is a decision variable that becomes 1 when the word with the identification number j of the sentence with the identification number i is the root of the subtree, L max is the maximum length, and parent (i) is the rhetorical structure tree A function that returns the identification number of the parent sentence of the sentence with the identification number i, parent (i, j) is a function that returns the identification number of the parent word of the word with the identification number j in the dependency tree representing the sentence with the identification number i. , A is a predetermined coefficient, and R c (i) is a dependency tree representing a sentence with identification number i. The function root (i) that returns a set of identification numbers of words that are root candidates is a function that returns the identification number of a word that is a true root in a dependency tree representing a sentence with the identification number i, sub ( i) is a function that returns a set of identification numbers of the word that is the subject from the sentence of the identification number i, and obj (i) is a function that returns a set of identification numbers of the word that is the object from the sentence of the identification number i. is there.

また、本発明に係る文書要約方法は、修辞構造解析部と、係り受け解析部と、生成部とを含む文書要約装置における文書要約方法であって、前記修辞構造解析部が、複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成するステップと、前記係り受け解析部が、前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成するステップと、前記生成部が、予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成するステップと、を含む方法である。   The document summarization method according to the present invention is a document summarization method in a document summarization apparatus including a rhetorical structure analysis unit, a dependency analysis unit, and a generation unit, wherein the rhetorical structure analysis unit reads a plurality of sentences. A step of generating a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including the dependency analysis unit, based on the dependency relationship between words in each of the plurality of sentences, A step of generating each dependency tree representing each sentence; and the generation unit determines a maximum length of a predetermined summary, importance of each word, the rhetorical structure tree, and each dependency tree. And extracting at most one subtree from each of the dependency trees so as to satisfy a predetermined condition, and generating a summary of the document based on the extracted subtrees. It is.

また、本発明に係る文書要約プログラムは、コンピュータを、上記の文書要約装置を構成する各部として機能させるためのプログラムである。   A document summarization program according to the present invention is a program for causing a computer to function as each unit constituting the document summarization apparatus.

以上説明したように、本発明の文書要約装置、方法、及びプログラムによれば、予め定めた要約の最大長、単語の各々の重要度、文間の修辞関係に基づいて文書を表した修辞構造木、及び単語間の係り受け関係に基づいて文の各々を表した係り受け木の各々に基づいて、予め定めた条件を満たすように、係り受け木の各々から多くとも一つの部分木を抽出して要約を生成する。これにより、読み易さを確保しつつ、要約としての情報の被覆率を向上させることができる、という効果が得られる。   As described above, according to the document summarization apparatus, method, and program of the present invention, a rhetorical structure representing a document based on a predetermined maximum length of summary, importance of each word, and rhetorical relationship between sentences. Extract at most one subtree from each dependency tree to satisfy a predetermined condition based on each dependency tree that represents each sentence based on the dependency relationship between the tree and the word To generate a summary. Thereby, the effect that the coverage of the information as a summary can be improved while ensuring readability is obtained.

本実施の形態に係る文書要約装置の機能ブロック図である。It is a functional block diagram of the document summarization apparatus concerning this Embodiment. 修辞構造木及び係り受け木の一例を示す概略図である。It is the schematic which shows an example of a rhetorical structure tree and a dependency tree. 本実施の形態における文書要約処理ルーチンの一例を示すフローチャートである。It is a flowchart which shows an example of the document summary process routine in this Embodiment. 本実施の形態における文書要約の一例を示す概略図である。It is the schematic which shows an example of the document summary in this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。   Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図1は、本発明の実施の形態に係る文書要約装置10を示すブロック図である。文書要約装置10は、CPUと、RAMと、後述する文書要約処理ルーチンを実行するためのプログラムを記憶したROMとを備えたコンピュータで構成することができる。   FIG. 1 is a block diagram showing a document summarizing apparatus 10 according to an embodiment of the present invention. The document summarization apparatus 10 can be constituted by a computer including a CPU, a RAM, and a ROM that stores a program for executing a document summarization processing routine described later.

文書要約装置10を構成するコンピュータは、機能的には、図1に示すように、入力された文書を文毎に分割する文分割部12と、分割された各文を単語毎に分割する単語分割部14と、各単語に重要度を付与する単語重要度付与部16と、文間の修辞関係を解析して、文書を表す修辞構造木を生成する修辞構造解析部18と、単語間の係り受け関係を解析して、文の各々を表す係り受け木の各々を生成する係り受け解析部20と、要約の長さ、各単語の重要度、修辞構造木、及び係り受け木の各々に基づいて、文書の要約を生成する生成部22とを含んだ構成で表すことができる。以下、各部について詳述する。   As shown in FIG. 1, the computer constituting the document summarizing apparatus 10 functionally has a sentence dividing unit 12 that divides an input document into sentences, and a word that divides each divided sentence into words. A dividing unit 14, a word importance assigning unit 16 that assigns importance to each word, a rhetorical structure analyzing unit 18 that analyzes a rhetorical relationship between sentences and generates a rhetorical structure tree representing a document, A dependency analysis unit 20 that analyzes each dependency relationship to generate each dependency tree that represents each sentence, a summary length, an importance of each word, a rhetorical structure tree, and a dependency tree Based on this, it can be expressed by a configuration including a generation unit 22 that generates a summary of a document. Hereinafter, each part is explained in full detail.

文分割部12は、文書要約装置10に入力された、複数の文を含む文書(テキストデータ)を受け取り、文書に文の区切りを与え、文書を文毎に分割する。文の区切りの認定には、既存の文分割器を利用することができる。また、単純に、句点を手がかりとして文の区切りを与えてもよい。   The sentence dividing unit 12 receives a document (text data) including a plurality of sentences input to the document summarizing apparatus 10, gives a sentence break to the document, and divides the document for each sentence. An existing sentence divider can be used to identify sentence breaks. In addition, sentence breaks may be given simply using punctuation marks.

単語分割部14は、文分割部12で分割された各文を入力として受け取り、各文に単語区切りを与え、各文を単語毎に分割する。単語区切りの認定には、既存の形態素解析器を利用することができる。また、英文などのように、単語区切りが明示的に与えられている文書が入力された場合には、その明示的な単語区切りに従って単語分割を行ってもよい。   The word dividing unit 14 receives each sentence divided by the sentence dividing unit 12 as an input, gives a word break to each sentence, and divides each sentence into words. An existing morphological analyzer can be used for recognition of word breaks. Further, when a document such as an English sentence in which a word break is explicitly given is input, word division may be performed according to the explicit word break.

単語重要度付与部16は、単語分割部14で分割された各単語を入力として受け取り、単語重要度データベース(DB)30を参照し、入力された文書に含まれる各単語に重要度を付与する。単語重要度DB30には、複数の単語の各々と、その単語の重要度とが対応付けて記憶されている。単語重要度DB30に記憶しておく各単語の重要度は、例えば情報検索システムなどで一般的に用いられるtf−idf(term frequency - inverse document frequency)法などを用いて定義しておくことができる。   The word importance assigning unit 16 receives each word divided by the word dividing unit 14 as input, refers to the word importance database (DB) 30, and assigns importance to each word included in the input document. . In the word importance DB 30, each of a plurality of words and the importance of the word are stored in association with each other. The importance of each word stored in the word importance DB 30 can be defined using, for example, a tf-idf (term frequency-inverse document frequency) method generally used in an information search system or the like. .

修辞構造解析部18は、文分割部12で分割された各文を入力として受け取り、文間の修辞関係を解析する。例えば、修辞構造解析器(参考文献「duVerle, D. and Prendinger, H, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proc of the 47th ACL, pp. 665-675, 2009.」)を用いてRST(Rhetorical Structure Theory)ツリーを生成した後、例えば、非特許文献1に記載されているルールを適用することで、文間の修辞関係を解析することができる。また、必ずしもRSTツリーを生成する必要はなく、文間の修辞関係を表した修辞構造木のデータを用いて学習した解析器を利用して、文間の修辞関係を解析してもよい。   The rhetorical structure analyzing unit 18 receives each sentence divided by the sentence dividing unit 12 as input, and analyzes the rhetorical relationship between the sentences. For example, using a rhetorical structure analyzer (references “duVerle, D. and Prendinger, H,“ A Novel Discourse Parser Based on Support Vector Machine Classication ”, Proc of the 47th ACL, pp. 665-675, 2009.) Then, after generating an RST (Rhetorical Structure Theory) tree, for example, the rhetorical relationship between sentences can be analyzed by applying the rules described in Non-Patent Document 1. Moreover, it is not always necessary to generate an RST tree, and the rhetorical relationship between sentences may be analyzed using an analyzer that has been learned using data of a rhetorical structure tree that represents the rhetorical relation between sentences.

修辞構造解析部18は、解析した文間の修辞関係に基づいて、各文をノードで表し、修辞関係にある文間に対応するノード間を接続した修辞構造木を生成する。修辞関係にある二つの文は、一方が親で他方が子の関係にあるため、親の文に対応するノードが親ノード、子の文に対応するノードが子ノードとなるように、ノード間を接続する。修辞構造木の一例を図2左図に示す。図2左図は、入力された文書に含まれる文s、文s、・・・、文sの各々を、ノードs、ノードs、・・・、ノードs(図中○印)で表している。また、親子関係がある文間に対応するノード間を、始点側が子ノード、終点側の親ノードとなるように矢印(エッジ)で接続している。 The rhetorical structure analysis unit 18 represents each sentence as a node based on the analyzed rhetorical relationship between sentences, and generates a rhetorical structure tree in which nodes corresponding to each other in the rhetorical relation are connected. Since two sentences in a rhetorical relationship have one parent and the other are children, the node corresponding to the parent sentence is the parent node and the node corresponding to the child sentence is the child node. Connect. An example of the rhetorical structure tree is shown in the left diagram of FIG. 2 left shows, sentence s 1 included in the input document, sentence s 2, · · ·, each sentence s 8, node s 1, node s 2, · · ·, node s 8 (figure (○) In addition, nodes corresponding to a sentence having a parent-child relationship are connected by arrows (edges) so that the start point side is a child node and the end point side parent node.

係り受け解析部20は、単語分割部14で分割された各単語を入力として受け取り、単語間の係り受け関係を解析する。係り受け関係の解析には、既存の係り受け解析器を利用することができる。係り受け解析部20は、解析した単語間の係り受け関係に基づいて、各単語をノードで表し、係り受け関係にある単語間に対応するノード間を接続した係り受け木を生成する。係り受け関係にある二つの単語は、一方が親で他方が子の関係にあるため、親の単語に対応するノードが親ノード、子の単語に対応するノードが子ノードとなるように、ノード間を接続する。係り受け木の一例を図2右図に示す。図2右図は、図2左図に示す修辞構造木に含まれるノードsに対応する文sに含まれる単語w、単語w、・・・、単語w12の各々を、ノードw、ノードw、・・・、ノードw12(図中○印)で表している。また、親子関係がある単語間に対応するノード間を、始点側が子ノード、終点側の親ノードとなるように矢印(エッジ)で接続している。 The dependency analysis unit 20 receives each word divided by the word division unit 14 as an input, and analyzes the dependency relationship between the words. An existing dependency analyzer can be used to analyze the dependency relationship. The dependency analysis unit 20 represents each word as a node based on the analyzed dependency relationship between the words, and generates a dependency tree in which nodes corresponding to each other in the dependency relationship are connected. Since the two words in the dependency relationship are in a parent relationship and the other is in a child relationship, the node corresponding to the parent word is a parent node, and the node corresponding to the child word is a child node. Connect between them. An example of a dependency tree is shown in the right side of FIG. The right figure of FIG. 2 shows each of the word w 1 , the word w 2 ,..., And the word w 12 included in the sentence s 8 corresponding to the node s 8 included in the rhetorical structure tree shown in the left figure of FIG. w 1, node w 2, ···, is represented by a node w 12 (figure ○ mark). Further, the nodes corresponding to the words having the parent-child relationship are connected by arrows (edges) so that the start point side is the child node and the end point side is the parent node.

生成部22は、入力された要約長Lmaxを受け取る。要約長は、要約の長さを制約するパラメタであり、ここでは、要約長Lmaxを、要約に含まれる単語の最大数とする。なお、要約長は文字数としてもよい。また、入力された要約長Lmaxを受け取る場合に限定されず、ROMなどに予め記憶しておいた要約長Lmaxを読み出してもよい。また、生成部22は、修辞構造解析部18で生成された修辞構造木、及び係り受け解析部20で生成された係り受け木の各々を受け取る。さらに、生成部22は、単語重要度付与部16で各単語に付与された重要度を受け取る。 Generator 22 receives a summary length L max input. The summary length is a parameter that restricts the length of the summary. Here, the summary length L max is the maximum number of words included in the summary. The summary length may be the number of characters. Further, not limited to the case of receiving the summary length L max input may read the summary length L max stored in advance such as in the ROM. The generation unit 22 receives each of the rhetorical structure tree generated by the rhetorical structure analysis unit 18 and the dependency tree generated by the dependency analysis unit 20. Further, the generation unit 22 receives the importance level given to each word by the word importance level giving unit 16.

生成部22は、要約長Lmax、各単語の重要度、修辞構造木、及び係り受け木の各々に基づいて、要約に含まれる単語数がLmax以下で、単語間の親子関係及び文間の親子関係を損なわないように、かつ要約に含まれる単語の重要度の和が最大となるように、各係り受け木から部分木を抽出する。そして、生成部22は、抽出した部分木の各々を構成するノードに対応する単語集合から要約を生成する。 Based on the summary length L max , the importance of each word, the rhetorical structure tree, and the dependency tree, the generation unit 22 has a number of words included in the summary that is less than or equal to L max , and the parent-child relationship between words and the sentence spacing The subtrees are extracted from each dependency tree so that the parent-child relationship is not impaired and the sum of the importance of the words included in the summary is maximized. Then, the generation unit 22 generates a summary from the word set corresponding to the nodes constituting each of the extracted partial trees.

ここで、単語間の親子関係を損なわない、とは、要約に含まれる単語間の親子関係が、係り受け解析部20で生成された係り受け木の対応する部分の構造を維持していることを意味する。また、文間の親子関係を損なわない、とは、要約に含まれる文間の親子関係が、修辞構造解析部18で生成された修辞構造木の対応する部分の構造を維持していることを意味する。すなわち、文に着目して要約を見ると、その文間の構造は、修辞構造解析部18で生成された修辞構造木の一部に表れている。また、単語に着目して要約を見ると、その単語間の構造は、係り受け解析部20で生成された係り受け木の一部に表れている。   Here, the parent-child relationship between words does not impair that the parent-child relationship between words included in the summary maintains the structure of the corresponding part of the dependency tree generated by the dependency analysis unit 20. Means. Moreover, the parent-child relationship between sentences does not impair that the parent-child relationship between sentences included in the summary maintains the structure of the corresponding part of the rhetorical structure tree generated by the rhetorical structure analyzing unit 18. means. That is, when the summary is viewed focusing on the sentence, the structure between the sentences appears in a part of the rhetorical structure tree generated by the rhetorical structure analyzing unit 18. When the summary is viewed focusing on the words, the structure between the words appears in a part of the dependency tree generated by the dependency analysis unit 20.

また、生成部22は、各係り受け木から部分木を抽出する際、各係り受け木から多くとも一つの部分木を抽出する。多くとも一つの部分木を抽出するとは、一つの文から2つ以上の部分木を抽出しないこと、また、部分木を抽出しない係り受け木も存在することを意味する。例えば、図2右図に示す文sを表す係り受け木から、(ノードw,ノードw,ノードw)という部分木を抽出した場合、文sを表す係り受け木からは、これ以上の部分木は抽出しない。これは、複文や重文などを表す係り受け木からは複数の部分木が抽出され得るが、一つの文は一つの意味を持つものとして扱い、一つの係り受け木から複数の部分木を抽出することを制限するものである。これにより、要約長Lmaxの制限の下、より多くの文から部分木を抽出することができ、要約としての情報の被覆率を向上させることができる。 In addition, when the generation unit 22 extracts a partial tree from each dependency tree, the generation unit 22 extracts at most one partial tree from each dependency tree. Extracting at most one subtree means that two or more subtrees are not extracted from one sentence, and there is a dependency tree that does not extract a subtree. For example, when a subtree (node w 1 , node w 2 , node w 3 ) is extracted from the dependency tree representing the sentence s 8 shown in the right diagram of FIG. 2, from the dependency tree representing the sentence s 8 , No more subtrees are extracted. This is because multiple subtrees can be extracted from dependency trees representing compound sentences and heavy sentences, but one sentence is treated as having one meaning and multiple subtrees are extracted from one dependency tree. It restricts that. As a result, subtrees can be extracted from more sentences under the limitation of the summary length Lmax , and the coverage of information as a summary can be improved.

また、生成部22は、修辞構造木において、部分木が抽出された係り受け木に対応するノードの親ノードに対応する係り受け木からも部分木を抽出する。修辞構造木に含まれる各ノードの親ノードを辿ると、最終的には根ノードに辿り着くため、部分木が抽出された係り受け木に対応するノード集合は、修辞構造木における根付き部分木となる。このように、要約に含まれる単語を含む文集合が、修辞構造木において根付き部分木となるようにすることで、文書の持つ論理構造(例えば、起承転結のような構造)を要約に反映させることができ、要約としての情報の被覆率及び精度が向上する。   In addition, the generation unit 22 extracts a partial tree from the dependency tree corresponding to the parent node of the node corresponding to the dependency tree from which the partial tree is extracted. When the parent node of each node included in the rhetorical structure tree is traced, the root node is finally reached. Therefore, the node set corresponding to the dependency tree from which the subtree is extracted is the rooted subtree in the rhetorical structure tree. Become. In this way, by making a sentence set including words included in the summary a rooted subtree in the rhetorical structure tree, the logical structure of the document (for example, a structure such as a turnover) is reflected in the summary. This improves the coverage and accuracy of information as a summary.

つまり、生成部22により生成される要約は、要約を構成する文集合は、修辞構造木の根付き部分木で表され、要約を構成する各文に含まれる単語列は、係り受け木の部分木で表される。   That is, in the summary generated by the generation unit 22, a sentence set constituting the summary is represented by a rooted subtree of the rhetorical structure tree, and a word string included in each sentence constituting the summary is a dependency tree subtree. expressed.

上記のような要約生成の問題は、下記(1)式〜(12)式に示す制約の下、下記(13)式に示す目的関数を最大化する整数計画問題として定式化される。   The above summary generation problem is formulated as an integer programming problem that maximizes the objective function shown in the following equation (13) under the constraints shown in the following equations (1) to (12).

Figure 2015170224
Figure 2015170224

ここで、iは入力された文書における文の位置(文書の先頭から何番目の文か)を表す変数であり、各文の識別番号(文id)である。Nは文書に含まれる文の総数である。jはi番目の文における単語の位置(文の先頭から何番目の単語か)を表す変数であり、単語の識別番号(単語id)である。M(i)はi番目の文に含まれる単語の総数である。xはi番目の文が要約に含まれるとき1となる決定変数である。zijはi番目の文のj番目の単語が要約に含まれるとき1となる決定変数である。wijはi番目の文のj番目の単語の重要度であり、単語重要度付与部16で付与される値である。rijはi番目の文のj番目の単語に対応するノードが、係り受け木から抽出される部分木の根ノードである場合に1となる決定変数である。 Here, i is a variable representing the position of the sentence in the input document (the number of the sentence from the beginning of the document), and is an identification number (sentence id) of each sentence. N is the total number of sentences included in the document. j is a variable representing the position of the word in the i-th sentence (the number of the word from the beginning of the sentence), and is a word identification number (word id). M (i) is the total number of words included in the i-th sentence. x i is a decision variable that becomes 1 when the i-th sentence is included in the summary. z ij is a decision variable that becomes 1 when the j-th word of the i-th sentence is included in the summary. w ij is the importance level of the j-th word of the i-th sentence, and is a value given by the word importance level assigning unit 16. r ij is a decision variable that becomes 1 when the node corresponding to the j-th word of the i-th sentence is the root node of the subtree extracted from the dependency tree.

(1)式は、要約に含まれる単語の数がLmax以下になることを保証する制約式である。(2)式は、文間の親子関係に関する制約であり、i番目の文を要約として抽出する場合は、その親の文も要約に含まれることを保証する制約式である。parent(i)は、修辞構造木において、i番目の文に対応するノードの親ノードに対応する文の文idを返す関数である。 The expression (1) is a constraint expression that guarantees that the number of words included in the summary is equal to or less than Lmax . The expression (2) is a constraint on the parent-child relationship between sentences, and when extracting the i-th sentence as a summary, it is a constraint expression that guarantees that the parent sentence is also included in the summary. parent (i) is a function that returns the sentence id of the sentence corresponding to the parent node of the node corresponding to the i-th sentence in the rhetorical structure tree.

(3)式は、単語間の親子関係に関する制約であり、i番目の文のj番目の単語を要約に含める場合は、その親の単語も要約に含まれることを保証する制約式である。parent(i,j)は、i番目の文を表す係り受け木において、j番目の単語に対応するノードの親ノードに対応する単語の単語idを返す関数である。ただし、ここでrijの項は、(9)式に示す制約式と併せて記述することで、i番目の文のj番目の単語に対応するノードを根ノードとする部分木を抽出する場合に限り、その親ノードに対応する単語は要約に含めないことを保証する。 Equation (3) is a constraint on the parent-child relationship between words. When the j-th word of the i-th sentence is included in the summary, the parent word is guaranteed to be included in the summary. parent (i, j) is a function that returns the word id of the word corresponding to the parent node of the node corresponding to the j-th word in the dependency tree representing the i-th sentence. However, when the term r ij is described together with the constraint expression shown in equation (9), a subtree whose root node is the node corresponding to the j-th word of the i-th sentence is extracted. Only if the word corresponding to its parent node is not included in the summary.

(4)式は、i番目の文のj番目の単語を要約に含める場合、i番目の文が要約に含まれることを保証するための制約式である。(5)式は、単語を抽出せずに文だけが抽出されることを防ぐための制約式である。ここで、a(i)は以下の式で定める。   Expression (4) is a constraint expression for ensuring that the i-th sentence is included in the summary when the j-th word of the i-th sentence is included in the summary. Equation (5) is a constraint equation for preventing only a sentence from being extracted without extracting a word. Here, a (i) is defined by the following equation.

Figure 2015170224
Figure 2015170224

len(i)は、i番目の文の単語数を返す関数である。aを導入することで、短い文(ここでは10単語以下)からは、部分的に単語を抽出せず、原文そのままを要約に含めるように抽出する。これは、短い文から部分的に単語を抽出すると著しく可読性が低下するため、これを防ぐためである。   len (i) is a function that returns the number of words in the i-th sentence. By introducing a, a short sentence (here, 10 words or less) is not partially extracted but is extracted so that the original sentence is included in the summary. This is in order to prevent this because a word is partially extracted from a short sentence and the readability is significantly reduced.

(6)式〜(9)式は、任意のノードを根ノードとする部分木の抽出を可能にするための制約式である。(6)式は、一つの文を表す一つの係り受け木からは、多くとも一つの部分木を抽出することを保証する制約式である。(7)式は、事前に定められた根ノードの候補以外のノードを根ノードとする部分木を抽出しないことを保証する制約式である。ここで、R(i)はi番目の文を表す係り受け木において、根ノードの候補となるノードに対応する単語の単語idの集合を返す関数である。根ノードの候補は、例えば、文中の品詞が動詞である単語に対応するノードとすることができる。(8)式は、ある単語に対応するノードが、部分木の根ノードとして抽出された場合には、必ずその単語を要約に含めることを保証する制約式である。(9)式は、抽出した部分木の根ノードに対応する単語の親の単語は要約に含めないことを保証する制約式である。 Expressions (6) to (9) are constraint expressions for enabling extraction of a subtree having an arbitrary node as a root node. Expression (6) is a constraint expression that guarantees that at most one subtree is extracted from one dependency tree representing one sentence. The expression (7) is a constraint expression that guarantees that a subtree having a node other than a predetermined root node candidate as a root node is not extracted. Here, R c (i) is a function that returns a set of word ids of words corresponding to nodes that are candidates for root nodes in the dependency tree representing the i-th sentence. The candidate for the root node can be, for example, a node corresponding to a word whose part of speech in the sentence is a verb. The expression (8) is a constraint expression that ensures that a word corresponding to a word is included in the summary whenever the node corresponding to the word is extracted as a root node of the subtree. Expression (9) is a constraint expression that guarantees that the parent word of the word corresponding to the root node of the extracted subtree is not included in the summary.

(10)式は、係り受け木の真の根ノードを部分木の根ノードとして抽出しない場合、真の根ノードに対応する単語を要約に含めないことを保証する制約式である。真の根ノードとは、係り受け木全体における根ノードであり、部分木における根ノードと区別するため、「真の根ノード」と呼ぶ。図2右図の例では、ノードw12が真の根ノードである。ここで、root(i)は、i番目の文を表す係り受け木における真の根ノードに対応する単語の単語idを返す関数である。 Expression (10) is a constraint expression that guarantees that the word corresponding to the true root node is not included in the summary when the true root node of the dependency tree is not extracted as the root node of the subtree. The true root node is a root node in the entire dependency tree, and is referred to as a “true root node” in order to distinguish it from the root node in the subtree. In the example of FIG. 2 right figure, node w 12 is a true root node. Here, root (i) is a function that returns the word id of the word corresponding to the true root node in the dependency tree representing the i-th sentence.

(11)式は、文に主語(SUB)が含まれる場合、文に含まれる主語のうち一つ以上を要約に含めるという制約式である。ここで、sub(i)は、i番目の文から、単語間の親子関係に基づいて、主語である単語の単語idの集合を返す関数である。(12)式は、文に目的語(OBJ)が含まれる場合、文に含まれる目的語のうち一つ以上を要約に含めるという制約式である。ここで、obj(i)は、i番目の文から、単語間の親子関係に基づいて、目的語である単語の単語idの集合を返す関数である。   The expression (11) is a constraint expression in which one or more of the subjects included in the sentence are included in the summary when the subject includes a subject (SUB). Here, sub (i) is a function that returns a set of word ids of the word that is the subject from the i-th sentence based on the parent-child relationship between words. The expression (12) is a constraint expression in which one or more of the objects included in the sentence are included in the summary when the object (OBJ) is included in the sentence. Here, obj (i) is a function that returns a set of word ids of the word that is the object based on the parent-child relationship between words from the i-th sentence.

次に、本実施の形態の文書要約装置10の作用について説明する。要約生成の対象となる文書、及びパラメタである要約長Lmaxが文書要約装置10に入力されると、文書要約装置10において、図3に示す文書要約処理ルーチンが実行される。 Next, the operation of the document summarizing apparatus 10 according to the present embodiment will be described. When a summary generation target document and a summary length L max as a parameter are input to the document summarization apparatus 10, the document summarization apparatus 10 executes a document summarization processing routine shown in FIG.

ステップS10で、文分割部12が、入力された文書を受け取り、文書に文の区切りを与え、文書を文毎に分割する。次に、ステップS12で、単語分割部14が、文分割部12で分割された各文を入力として受け取り、各文に単語区切りを与え、各文を単語毎に分割する。次に、ステップS14で、単語重要度付与部16が、単語分割部14で分割された各単語を入力として受け取り、単語重要度DB30を参照し、各単語に重要度を付与する。   In step S10, the sentence dividing unit 12 receives the input document, gives a sentence break to the document, and divides the document into sentences. Next, in step S12, the word dividing unit 14 receives each sentence divided by the sentence dividing unit 12 as an input, gives a word break to each sentence, and divides each sentence into words. Next, in step S14, the word importance assigning unit 16 receives each word divided by the word dividing unit 14 as an input, refers to the word importance DB 30, and assigns importance to each word.

次に、ステップS16で、修辞構造解析部18が、文分割部12で分割された各文を入力として受け取り、例えば既知の修辞構造解析器を利用して、文間の修辞関係を解析する。そして、修辞構造解析部18は、各文をノードで表し、文間の修辞関係に基づいて、親子関係にある文間に対応するノード間を接続して、文書を表す修辞構造木を生成する。   Next, in step S16, the rhetorical structure analyzing unit 18 receives each sentence divided by the sentence dividing unit 12 as input, and analyzes the rhetorical relationship between sentences using, for example, a known rhetorical structure analyzer. Then, the rhetorical structure analysis unit 18 represents each sentence as a node, and based on the rhetorical relation between sentences, connects the nodes corresponding to the sentences in the parent-child relation to generate a rhetorical structure tree representing the document. .

次に、ステップS18で、係り受け解析部20が、単語分割部14で分割された各単語を入力として受け取り、例えば既知の係り受け解析器を利用して、単語間の係り受け関係を解析する。そして、係り受け解析部20は、各単語をノードで表し、単語間の係り受け関係に基づいて、親子関係にある単語間に対応するノード間を接続して、文の各々を表す係り受け木の各々を生成する。   Next, in step S18, the dependency analysis unit 20 receives each word divided by the word division unit 14 as an input, and analyzes the dependency relationship between words using, for example, a known dependency analyzer. . Then, the dependency analysis unit 20 represents each word as a node, and based on the dependency relationship between the words, connects the corresponding nodes between the words in the parent-child relationship, and represents a dependency tree representing each of the sentences. Generate each of

次に、ステップS20で、生成部22が、入力された要約長Lmax、単語重要度付与部16で付与された各単語の重要度、修辞構造解析部18で生成された修辞構造木、及び係り受け解析部20で生成された係り受け木の各々に基づいて、要約に含まれる単語数がLmax以下で、単語間の親子関係及び文間の親子関係を損なわず、かつ要約に含まれる単語の重要度の和が最大となるように、各係り受け木から多くとも一つの部分木を抽出する。次に、ステップS22で、生成部22が、各係り受け木から抽出した部分木集合から要約を生成し、生成した要約を出力して、要約生成処理ルーチンを終了する。 Next, in step S20, the generation unit 22 inputs the summary length Lmax , the importance of each word given by the word importance assigning unit 16, the rhetorical structure tree generated by the rhetorical structure analysis unit 18, and based on each of dependency receiving dependency generated by the analysis unit 20 trees, the number of words included in the summary below L max, without impairing the parent-child relationship and sentences parent-child relationships between words, and are included in the summary At most one subtree is extracted from each dependency tree so that the sum of the importance of words is maximized. Next, in step S22, the generation unit 22 generates a summary from the subtree set extracted from each dependency tree, outputs the generated summary, and ends the summary generation processing routine.

ここで、以下の文1〜文4を含む文書について、本実施の形態に係る文書要約装置10により生成した要約の一例について説明する。   Here, an example of a summary generated by the document summarizing apparatus 10 according to the present embodiment will be described for a document including the following sentences 1 to 4.

文1:A Japanese apple is cropping up in orchards the way Hondas did on U.S. roads .
文2:It is called the Fuji.
文3:Some fruit visionaries say the Fuji could someday tumble the Red Delicious.
文4:But the apple industry is ripe for change .
Sentence 1: A Japanese apple is cropping up in orchards the way Hondas did on US roads.
Sentence 2: It is called the Fuji.
Sentence 3: Some fruit visionaries say the Fuji could someday tumble the Red Delicious.
Sentence 4: But the apple industry is ripe for change.

図4に示すように、文書は、文1に対応するノードを根ノード、文2及び文3の各々に対応するノードが文1に対応するノードの子ノード、文4に対応するノードが文3に対応するノードの子ノードである修辞構造木で表される。さらに、各文は、その文に含まれる各単語をノードとする係り受け木で表される。図4の例では、各文に対応するノード内に、その文を表す係り受け木を示している。   As shown in FIG. 4, the document includes a node corresponding to sentence 1 as a root node, a node corresponding to each of sentences 2 and 3 as a child node of a node corresponding to sentence 1, and a node corresponding to sentence 4 as a sentence. It is represented by a rhetorical structure tree that is a child node of the node corresponding to 3. Further, each sentence is represented by a dependency tree having each word included in the sentence as a node. In the example of FIG. 4, a dependency tree representing a sentence is shown in a node corresponding to each sentence.

ここから、単語の長さの制約の下、文間の親子関係及び単語間の親子関係を損なわず、かつ単語の重要度の和が最大となるように部分木を抽出する。図4では、抽出された部分木を構成するノード、及び抽出された部分木を含む文に対応するノードを太線の枠で示している。そして、抽出された部分木から、下記に示すような要約が生成される。   From this, subtrees are extracted so that the parent-child relationship between sentences and the parent-child relationship between words are not impaired, and the sum of the importance of words is maximized under the restriction of the word length. In FIG. 4, nodes constituting the extracted subtree and nodes corresponding to the sentence including the extracted subtree are indicated by bold lines. Then, a summary as shown below is generated from the extracted subtree.

A Japanese apple is cropping up in orchards. The Fuji could someday tumble the Red Delicious. But the apple industry is ripe for change.       A Japanese apple is cropping up in orchards.The Fuji could someday tumble the Red Delicious.But the apple industry is ripe for change.

以上説明したように、本実施の形態に係る文書要約装置によれば、文書を文間の修辞関係を表す修辞構造木として表現し、文を単語間の係り受け関係を表す係り受け木として表現することで、文をノードとする木、単語をノードとする木の入れ子構造として文書を捉える。そして、要約に含まれる単語の数がLmax以下で、要約に含まれる単語の重要度の和が最大となるように、修辞構造木が表す文書の修辞構造を損なうことなく、係り受け木から多くとも一つの部分木を抽出し、抽出した部分木から要約を生成する。これにより、文書の論理構造、文としての文法性が損なわれないため、読み易さを確保できる。また、1つの係り受け木から多くとも一つの部分木を抽出することで、要約としての情報の被覆率を向上させることができる。 As described above, according to the document summarizing apparatus according to the present embodiment, a document is expressed as a rhetorical structure tree representing a rhetorical relationship between sentences, and a sentence is represented as a dependency tree representing a dependency relation between words. By doing so, the document is understood as a nested structure of a tree having sentences as nodes and a tree having words as nodes. Then, from the dependency tree, the number of words included in the summary is equal to or less than L max and the rhetorical structure of the document represented by the rhetorical structure tree is not impaired so that the sum of the importance of the words included in the summary is maximized. At most one subtree is extracted and a summary is generated from the extracted subtree. As a result, the logical structure of the document and the grammatical nature of the sentence are not impaired, so that readability can be ensured. Also, by extracting at most one subtree from one dependency tree, the coverage of information as a summary can be improved.

なお、文区切り及び単語区切りが与えられた文書が文書要約装置に入力される場合には、文分割部12及び単語分割部14の各構成は省略してもよい。   Note that when a document provided with sentence breaks and word breaks is input to the document summarization apparatus, the components of the sentence divider 12 and the word divider 14 may be omitted.

上述の文書要約装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、WWWシステムを利用している場合であれば、ホームページ提供環境(あるいは表示環境)も含むものとする。   The document summarization apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。   Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

10 文書要約装置
12 文分割部
14 単語分割部
16 単語重要度付与部
18 修辞構造解析部
20 係り受け解析部
22 生成部
30 単語重要度DB
DESCRIPTION OF SYMBOLS 10 Document summarization device 12 Sentence division part 14 Word division part 16 Word importance assignment part 18 Rhetorical structure analysis part 20 Dependency analysis part 22 Generation part 30 Word importance DB

Claims (5)

複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成する修辞構造解析部と、
前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成する係り受け解析部と、
予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成する生成部と、
を含む文書要約装置。
A rhetorical structure analysis unit that generates a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including a plurality of sentences;
A dependency analysis unit that generates each dependency tree representing each of the sentences based on a dependency relationship between words in each of the plurality of sentences;
Based on a predetermined maximum length of each summary, the importance of each of the words, the rhetorical structure tree, and each of the dependency trees, at most from each of the dependency trees to satisfy a predetermined condition. A generation unit that extracts one subtree and generates a summary of the document based on the extracted subtree;
A document summarization device.
前記予め定めた条件は、要約の長さが前記最大長以下で、前記修辞構造木が表す修辞関係及び前記係り受け木が表す係り受け関係を損なわず、かつ要約に含まれる単語の重要度の和が最大になることである請求項1記載の文書要約装置。   The predetermined condition is that the summary length is less than or equal to the maximum length, the rhetorical relationship represented by the rhetorical structure tree and the dependency relationship represented by the dependency tree are not impaired, and the importance of words included in the summary is determined. 2. The document summarizing apparatus according to claim 1, wherein the sum is maximized. 前記予め定めた条件は、下記(1)式〜(12)式に示す制約の下、下記(13)式に示す目的関数を最大化することである請求項1記載の文書要約装置。
Figure 2015170224

ただし、iは文の識別番号、Nは文書に含まれる文の総数、jは単語の識別番号、M(i)は識別番号iの文に含まれる単語の総数、xは識別番号iの文が要約に含まれるとき1となる決定変数、zijは識別番号iの文の識別番号jの単語が要約に含まれるとき1となる決定変数、wijは識別番号iの文の識別番号jの単語の重要度、rijは識別番号iの文の識別番号jの単語が部分木の根である場合に1となる決定変数、Lmaxは最大長、parent(i)は、修辞構造木における識別番号iの文の親の文の識別番号を返す関数、parent(i,j)は、識別番号iの文を表す係り受け木における識別番号jの単語の親の単語の識別番号を返す関数、aは所定の係数、R(i)は識別番号iの文を表す係り受け木において、根の候補となる単語の識別番号の集合を返す関数、root(i)は、識別番号iの文を表す係り受け木における真の根である単語の識別番号を返す関数、sub(i)は、識別番号iの文から主語である単語の識別番号の集合を返す関数、及びobj(i)は、識別番号iの文から目的語である単語の識別番号の集合を返す関数である。
2. The document summarization apparatus according to claim 1, wherein the predetermined condition is to maximize an objective function represented by the following equation (13) under the constraints represented by the following equations (1) to (12).
Figure 2015170224

Where i is the sentence identification number, N is the total number of sentences included in the document, j is the word identification number, M (i) is the total number of words included in the sentence with the identification number i, and x i is the identification number i. A decision variable that becomes 1 when the sentence is included in the summary, z ij is a decision variable that becomes 1 when the word with identification number j of the sentence with identification number i is included in the summary, and w ij is an identification number of the sentence with identification number i The importance of the word j, r ij is a decision variable that becomes 1 when the word with the identification number j of the sentence with the identification number i is the root of the subtree, L max is the maximum length, and parent (i) is the rhetorical structure tree A function that returns the identification number of the parent sentence of the sentence with the identification number i, parent (i, j) is a function that returns the identification number of the parent word of the word with the identification number j in the dependency tree representing the sentence with the identification number i. , A is a predetermined coefficient, and R c (i) is a dependency tree representing a sentence with identification number i. The function root (i) that returns a set of identification numbers of words that are root candidates is a function that returns the identification number of a word that is a true root in a dependency tree representing a sentence with the identification number i, sub ( i) is a function that returns a set of identification numbers of the word that is the subject from the sentence of the identification number i, and obj (i) is a function that returns a set of identification numbers of the word that is the object from the sentence of the identification number i. is there.
修辞構造解析部と、係り受け解析部と、生成部とを含む文書要約装置における文書要約方法であって、
前記修辞構造解析部が、複数の文を含む文書における文間の修辞関係に基づいて、前記文書を表す修辞構造木を生成するステップと、
前記係り受け解析部が、前記複数の文の各々における単語間の係り受け関係に基づいて、前記文の各々を表す係り受け木の各々を生成するステップと、
前記生成部が、予め定めた要約の最大長、前記単語の各々の重要度、前記修辞構造木、及び前記係り受け木の各々に基づいて、予め定めた条件を満たすように、前記係り受け木の各々から多くとも一つの部分木を抽出し、抽出した部分木に基づいて、前記文書の要約を生成するステップと、
を含む文書要約方法。
A document summarization method in a document summarization apparatus including a rhetorical structure analysis unit, a dependency analysis unit, and a generation unit,
The rhetorical structure analysis unit generates a rhetorical structure tree representing the document based on a rhetorical relationship between sentences in a document including a plurality of sentences;
The dependency analysis unit generating each dependency tree representing each of the sentences based on a dependency relationship between words in each of the plurality of sentences;
The dependency tree is configured so that the generation unit satisfies a predetermined condition on the basis of a predetermined maximum length of summary, importance of each word, the rhetorical structure tree, and the dependency tree. Extracting at most one subtree from each of the above and generating a summary of the document based on the extracted subtrees;
Document summarization method including
コンピュータを、請求項1〜請求項3のいずれか1項に記載の文書要約装置を構成する各部として機能させるための文書要約プログラム。   A document summarization program for causing a computer to function as each part constituting the document summarization apparatus according to any one of claims 1 to 3.
JP2014045656A 2014-03-07 2014-03-07 Document summarization apparatus, method, and program Active JP6021079B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014045656A JP6021079B2 (en) 2014-03-07 2014-03-07 Document summarization apparatus, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2014045656A JP6021079B2 (en) 2014-03-07 2014-03-07 Document summarization apparatus, method, and program

Publications (2)

Publication Number Publication Date
JP2015170224A true JP2015170224A (en) 2015-09-28
JP6021079B2 JP6021079B2 (en) 2016-11-02

Family

ID=54202875

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2014045656A Active JP6021079B2 (en) 2014-03-07 2014-03-07 Document summarization apparatus, method, and program

Country Status (1)

Country Link
JP (1) JP6021079B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017134588A (en) * 2016-01-27 2017-08-03 日本電信電話株式会社 Oracle summary search device, method, and program
JP2018097669A (en) * 2016-12-14 2018-06-21 日本電信電話株式会社 Summary generation device, method, and program
JP2020181387A (en) * 2019-04-25 2020-11-05 シャープ株式会社 Document summarization device, document summarization system, document summarization method, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003281164A (en) * 2002-03-20 2003-10-03 Fuji Xerox Co Ltd Document summarizing device, document summarizing method and document summarizing program
JP2004094946A (en) * 2002-08-30 2004-03-25 Fuji Xerox Co Ltd Method for summarizing source text, method for selecting coordinate relationship for compression, system for summarizing source text, and program
JP2010262511A (en) * 2009-05-08 2010-11-18 Nippon Telegr & Teleph Corp <Ntt> Text summarization method, apparatus thereof, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003281164A (en) * 2002-03-20 2003-10-03 Fuji Xerox Co Ltd Document summarizing device, document summarizing method and document summarizing program
JP2004094946A (en) * 2002-08-30 2004-03-25 Fuji Xerox Co Ltd Method for summarizing source text, method for selecting coordinate relationship for compression, system for summarizing source text, and program
JP2010262511A (en) * 2009-05-08 2010-11-18 Nippon Telegr & Teleph Corp <Ntt> Text summarization method, apparatus thereof, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JPN6016033910; 伊藤 潤 外2名: '係り受け木を用いた日本語文書の重要部分抽出' 情報処理学会研究報告 第2003巻第108号, 20031107, p.19-24, 社団法人情報処理学会 *
JPN6016033911; 平尾 努 外3名: '談話構造に基づく単一文書要約' 言語処理学会第19回年次大会 発表論文集 [online] , 20130304, p.492-495, 言語処理学会 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017134588A (en) * 2016-01-27 2017-08-03 日本電信電話株式会社 Oracle summary search device, method, and program
JP2018097669A (en) * 2016-12-14 2018-06-21 日本電信電話株式会社 Summary generation device, method, and program
JP2020181387A (en) * 2019-04-25 2020-11-05 シャープ株式会社 Document summarization device, document summarization system, document summarization method, and program

Also Published As

Publication number Publication date
JP6021079B2 (en) 2016-11-02

Similar Documents

Publication Publication Date Title
Hasan et al. Stance classification of ideological debates: Data, models, features, and constraints
Fonseca et al. Mac-morpho revisited: Towards robust part-of-speech tagging
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
US10275458B2 (en) Systematic tuning of text analytic annotators with specialized information
KR101948257B1 (en) Multi-classification device and method using lsp
CN109635275A (en) Literature content retrieval and recognition methods and device
JP6614152B2 (en) Text processing system, text processing method, and computer program
JP6021079B2 (en) Document summarization apparatus, method, and program
Kumar et al. Sanskrit compound processor
KR20110017129A (en) Apparatus and method for words sense disambiguation using korean wordnet and its program stored recording medium
Sangati et al. Multiword expression identification with recurring tree fragments and association measures
JP6062829B2 (en) Dependency relationship analysis parameter learning device, dependency relationship analysis device, method, and program
KR101092354B1 (en) Compound noun recognition apparatus and its method
Blessing et al. An end-to-end environment for research question-driven entity extraction and network analysis
Kramer et al. Improvement of a naive Bayes sentiment classifier using MRS-based features
de Carvalho et al. Extracting semantic information from patent claims using phrasal structure annotations
Basit et al. Semantic similarity analysis of urdu documents
Trye et al. A hybrid architecture for labelling bilingual māori-english tweets
JP2008197952A (en) Text segmentation method, its device, its program and computer readable recording medium
KR102203895B1 (en) Embedding based causality detection System and Method and Computer Readable Recording Medium on which program therefor is recorded
Praveena et al. Chunking based malayalam paraphrase identification using unfolding recursive autoencoders
Tammewar et al. Can distributed word embeddings be an alternative to costly linguistic features: A study on parsing hindi
JP6665029B2 (en) Language analysis device, language analysis method, and program
JP6298785B2 (en) Natural language analysis apparatus, method, and program
Chhetri et al. Development of a morph analyser for Nepali noun token

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20151126

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20151126

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20160826

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20160906

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20160926

R150 Certificate of patent or registration of utility model

Ref document number: 6021079

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S533 Written request for registration of change of name

Free format text: JAPANESE INTERMEDIATE CODE: R313533

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350