JP6291440B2

JP6291440B2 - Parameter learning method, apparatus, and program

Info

Publication number: JP6291440B2
Application number: JP2015040409A
Authority: JP
Inventors: 康久吉田; 鈴木　潤; 潤鈴木; 平尾　努; 努平尾; 林　克彦; 克彦林; 永田　昌明; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-03-02
Filing date: 2015-03-02
Publication date: 2018-03-14
Anticipated expiration: 2035-03-02
Also published as: JP2016162198A

Description

本発明は、パラメータ学習方法、装置、及びプログラムに関する。 The present invention relates to a parameter learning method, apparatus, and program.

修辞構造理論（Rhetorical Structure Theory：RST）とは文書の論理構造（修辞構造）を捉えるための理論である（例えば、非特許文献１）。ＲＳＴに基づく修辞構造を表わした木のことを修辞構造木（Rhetorical Structure Theory based Discourse Tree：ＲＳＴ−ＤＴ）と呼ぶ。ＲＳＴ−ＤＴの例を図９に示す。また、アノテーションが付与されていない生テキストに修辞構造木を付与することを談話構造解析と呼ぶ。与えられた文書は談話構造における最小の単位（Elementary Discourse Unit：ＥＤＵ）に分割される（図９ではｅ１からｅ１０までがそれに対応)。各ＥＤＵには衛星（Satellite：Ｓ）または核（Nuclues：Ｎ）というラベルが付与され、Ｓは必ずＮを修飾するという関係がある。また、ＳとＮ、ＮとＮの間には修辞関係を表わすラベルが付与される。例えば、ｅ_１とｅ_２の間には「Background」という関係ラベルが付与される。 Rhetorical Structure Theory (RST) is a theory for capturing the logical structure (rhetorical structure) of a document (for example, Non-Patent Document 1). A tree representing a rhetorical structure based on RST is referred to as a rhetorical structure theory based discourse tree (RST-DT). An example of RST-DT is shown in FIG. Also, giving a rhetorical structure tree to a raw text to which no annotation is given is called discourse structure analysis. The given document is divided into the minimum units (Elementary Discourse Unit: EDU) in the discourse structure (in FIG. 9, the range from e1 to e10 corresponds to it). Each EDU is assigned a label of satellite (Stellite: S) or nucleus (Nuclues: N), and S always has a relationship of modifying N. Further, labels representing rhetorical relationships are assigned between S and N and between N and N. For example, a relation label “Background” is assigned between e ₁ and e ₂ .

ＲＳＴ−ＤＴでは文書全体が一つのノードになるまで、ノード間に付与されたＮまたはＳのラベル、修辞関係のラベルを一つのノードとし、ラベルの付与とノードの生成を再帰的に行なう。Ｒｏｏｔは文書全体を表わす仮想的なノードである。 In the RST-DT, until the entire document becomes one node, the N or S label given between the nodes and the rhetorical label are used as one node, and the label assignment and the node generation are performed recursively. Root is a virtual node that represents the entire document.

文書が与えられた際にＲＳＴ−ＤＴへと解析するアルゴリズムの代表的なものとしてＨＩＬＤＡ（例えば、非特許文献２）がある。ＨＩＬＤＡは貪欲法を用いた最易優先探索手法の一つであり、以下の手続きを用いて与えられた文をＲＳＴ−ＤＴへと解析する。 HILDA (for example, Non-Patent Document 2) is a representative algorithm for analyzing a RST-DT when a document is given. HILDA is one of the most prioritized search methods using the greedy method, and analyzes a given sentence into RST-DT using the following procedure.

（ステップ１）与えられた文書をＥＤＵに区切る。 (Step 1) A given document is divided into EDUs.

（ステップ２）隣り合うノードの中でどれが最も結合しやすいかSupport Vector Machine を用いて決定し、ラベルを付与した上で隣り合うノードを一つのノードに結合する。 (Step 2) Support Vector Machine is used to determine which of the adjacent nodes is most likely to be combined, and the adjacent nodes are combined into one node after giving a label.

（ステップ３）全体が一つのノードであれば結合された木を返し、そうでなければステップ２へ戻る。 (Step 3) If the entire node is a single node, the combined tree is returned; otherwise, the process returns to Step 2.

William C，Mann and Sandra A. Thompson、“Rhetorical structure theory: Toward a functional theory of text organization”、1988、Text,8(3)、ｐ.243―281William C, Mann and Sandra A. Thompson, “Rhetorical structure theory: Toward a functional theory of text organization”, 1988, Text, 8 (3), p.243-281 H. Hernault, H. Prendinger, David A. duVerle, and M. Ishizuka、“HILDA: A Discourse Parser Using Support Vector Machine Classification”、2010、In Dialogue & Discourse, 2010(3)、p.1-33H. Hernault, H. Prendinger, David A. duVerle, and M. Ishizuka, “HILDA: A Discourse Parser Using Support Vector Machine Classification”, 2010, In Dialogue & Discourse, 2010 (3), p.1-33

しかし、ＨＩＬＤＡに代表される従来の談話構造解析技術は、探索誤りにあまり頑健ではない。例えば、上記図９においてｅ５とｅ６との間の関係を「Elaboration」と付与してしまうと、さらにｅ４と結合した際に修辞構造ラベル「Contrast」を正しく付与できなくなる恐れがある。 However, conventional discourse structure analysis techniques represented by HILDA are not very robust against search errors. For example, if the relationship between e5 and e6 in FIG. 9 is given as “Elaboration”, the rhetorical structure label “Contrast” may not be given correctly when it is further combined with e4.

本発明は、上記の事情を鑑みてなされたもので、談話構造解析を精度よく行うためのパラメータを得ることができるパラメータ学習方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a parameter learning method, apparatus, and program capable of obtaining parameters for accurately performing discourse structure analysis.

上記の目的を達成するために本発明のパラメータ学習方法は、学習用入力部、パラメータ学習部、及び反復判定部を含むパラメータ学習装置におけるパラメータ学習方法であって、前記学習用入力部が、複数の学習用文書の各々について、前記学習用文書内の文字列単位の各々と、ルートノードが前記学習用文書の全体を表し、かつ前記学習用文書のうちの少なくとも１つの文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、前記文字列単位の系列間の修飾関係及び関係ラベルを表した、前記学習用文書の文字列単位の系列の各々の修辞構造に基づく修辞構造木であって、かつ前記学習用文書に対応する正解の前記修辞構造木の各々とを受け付けると、前記パラメータ学習部が、複数の学習用文書の各々について、前記正解の修辞構造木に含まれる部分木の列と、前記部分木の列から抽出される特徴量ベクトルに対する重みベクトルを用いて選択される、前記学習用文書に対応する前記修辞構造木の部分木の列とのペアのうち、前記特徴量ベクトルと前記重みベクトルとを用いて算出されるスコアの差分が最大となる、前記正解の修辞構造木に含まれる部分木の列と、前記学習用文書に対応する前記修辞構造木の部分木の列とのペアに含まれる前記部分木の列の各々から抽出される特徴ベクトルに基づいて、前記重みベクトルを更新するステップと、前記反復判定部が、予め定められた回数だけ、前記パラメータ学習部による更新を繰り返すステップと、を含む。 In order to achieve the above object, a parameter learning method of the present invention is a parameter learning method in a parameter learning device including a learning input unit, a parameter learning unit, and an iterative determination unit, wherein the learning input unit includes a plurality of learning input units. For each of the learning documents, each of the character string units in the learning document, a root node representing the whole of the learning document, and a sequence of at least one character string unit of the learning documents A rhetorical structure tree based on the rhetorical structure of each sequence of character strings in the learning document, representing a hierarchical structure with each node as a node, and representing a modification relationship and a relationship label between the sequences of character strings. And when each of the correct rhetorical structure trees corresponding to the learning document is received, the parameter learning unit performs the correct answer for each of a plurality of learning documents. A column of subtrees included in the rhetorical structure tree and a subtree of the rhetorical structure tree corresponding to the learning document selected using a weight vector for a feature vector extracted from the column of the subtree Corresponding to the column of subtrees included in the correct rhetorical structure tree, the score difference calculated using the feature vector and the weight vector being maximized, and the learning document Updating the weight vector based on a feature vector extracted from each of the subtree columns included in a pair with the subtree column of the rhetorical structure tree, and the iteration determination unit is predetermined. Repeating the updating by the parameter learning unit a predetermined number of times.

本発明のパラメータ学習装置は、複数の学習用文書の各々について、前記学習用文書内の文字列単位の各々と、ルートノードが前記学習用文書の全体を表し、かつ前記学習用文書のうちの少なくとも１つの文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、前記文字列単位の系列間の修飾関係及び関係ラベルを表した、前記学習用文書の文字列単位の系列の各々の修辞構造に基づく修辞構造木であって、かつ前記学習用文書に対応する正解の前記修辞構造木の各々とを受け付ける入力部と、複数の学習用文書の各々について、前記正解の修辞構造木に含まれる部分木の列と、前記部分木の列から抽出される特徴量ベクトルに対する重みベクトルを用いて選択される、前記学習用文書に対応する前記修辞構造木の部分木の列とのペアのうち、前記特徴量ベクトルと前記重みベクトルとを用いて算出されるスコアの差分が最大となる、前記正解の修辞構造木に含まれる部分木の列と、前記学習用文書に対応する前記修辞構造木の部分木の列とのペアに含まれる前記部分木の列の各々から抽出される特徴ベクトルに基づいて、前記重みベクトルを更新するパラメータ学習部と、予め定められた回数だけ、前記パラメータ学習部による更新を繰り返す反復判定部と、を含んで構成されている。 The parameter learning device according to the present invention includes, for each of a plurality of learning documents, each of the character string units in the learning document, a root node representing the whole learning document, and among the learning documents, A sequence of at least one character string unit representing a hierarchical structure with each node as a node, and a modification relationship between the character string unit sequences and a relationship label; A rhetorical structure tree based on each rhetorical structure and receiving each correct rhetorical structure tree corresponding to the learning document; and for each of a plurality of learning documents, the correct rhetorical structure A sequence of subtrees included in the tree, and a sequence of subtrees of the rhetorical structure tree corresponding to the learning document, selected using a weight vector for a feature vector extracted from the sequence of subtrees. Among the subtrees included in the correct rhetorical structure tree, the difference between the scores calculated using the feature vector and the weight vector is maximized, and the document corresponding to the learning document A parameter learning unit that updates the weight vector based on a feature vector extracted from each of the subtree columns included in a pair with a subtree column of the rhetorical structure tree, and a predetermined number of times, An iterative determination unit that repeats updating by the parameter learning unit.

また、本発明の前記パラメータ学習部は、前記正解の修辞構造木に含まれる部分木の列に対して、前回選択された部分木の列において隣り合う部分木のペアを結合して生成される、前記正解の修辞構造木に含まれる部分木の列の集合のうち、前記部分木の列から抽出される前記特徴量ベクトルと前記重みベクトルとを用いて算出されるスコアが最大となる部分木の列を選択し、前記学習用文書に対応する前記修辞構造木の部分木の列に対して、前回選択された部分木の列の各々において、隣り合う部分木のペアを結合して生成される部分木の列の集合のうち、前記部分木の列から抽出される前記特徴量ベクトルと前記重みベクトルとを用いて算出されるスコアが上位ｋ個となる部分木の列を選択し、前記正解の修辞構造木に含まれる部分木の列に対して選択された部分木の列と、前記学習用文書に対応する前記修辞構造木の部分木の列に対して選択された上位ｋ個となる部分木の列の各々とのペアを生成することを繰り返し、前記生成されたペアのうち、前記特徴量ベクトルと前記重みベクトルとを用いて算出されるスコアの差分が最大となる、前記正解の修辞構造木に含まれる部分木の列と、前記学習用文書に対応する前記修辞構造木の部分木の列とのペアに含まれる前記部分木の列の各々から抽出される特徴ベクトルに基づいて、前記重みベクトルを更新するようにすることができる。 Further, the parameter learning unit of the present invention is generated by combining a subtree sequence included in the correct rhetorical tree with a pair of adjacent subtrees in the previously selected subtree sequence. , Of the set of subtrees included in the correct rhetorical structure tree, the subtree having the maximum score calculated using the feature vector and the weight vector extracted from the subtree sequence Generated by combining adjacent subtree pairs in each of the subtree columns selected previously with respect to the subtree column of the rhetorical structure tree corresponding to the learning document. Selecting a row of subtrees having the top k scores calculated using the feature vector and the weight vector extracted from the subtree column, Subtrees included in the correct rhetorical structure tree Generates a pair of a subtree column selected with respect to each of the top k subtree columns selected for the subtree column of the rhetorical structure tree corresponding to the learning document. A sequence of subtrees included in the correct rhetorical structure tree in which the difference between the scores calculated using the feature vector and the weight vector among the generated pairs is maximized; The weight vector is updated based on a feature vector extracted from each of the subtree columns included in a pair with the subtree column of the rhetorical structure tree corresponding to the learning document. Can do.

本発明のプログラムは、コンピュータを、本発明の談話構造解析装置の各部として機能させるためのプログラムである。 The program of this invention is a program for functioning a computer as each part of the discourse structure analysis apparatus of this invention.

以上説明したように、本発明のパラメータ学習方法、装置、及びプログラムによれば、複数の学習用文書の各々について、正解の修辞構造木に含まれる部分木の列と、部分木の列から抽出される特徴量ベクトルに対する重みベクトルを用いて選択される、学習用文書に対応する修辞構造木の部分木の列とのペアのうち、特徴量ベクトルと重みベクトルとを用いて算出されるスコアの差分が最大となる、正解の修辞構造木に含まれる部分木の列と、学習用文書に対応する修辞構造木の部分木の列とのペアに含まれる部分木の列の各々から抽出される特徴ベクトルに基づいて、重みベクトルを更新することにより、談話構造解析を精度よく行うためのパラメータを得ることができる、という効果が得られる。 As described above, according to the parameter learning method, apparatus, and program of the present invention, each of a plurality of learning documents is extracted from the subtree sequence included in the correct rhetorical structure tree and the subtree sequence. Of the score calculated using the feature vector and the weight vector out of the pair of the subtree of the rhetorical structure tree corresponding to the learning document, which is selected using the weight vector for the feature vector Extracted from each of the subtree columns included in the pair of the subtree included in the correct rhetorical tree with the maximum difference and the subtree column corresponding to the learning document. By updating the weight vector based on the feature vector, it is possible to obtain an effect that a parameter for performing discourse structure analysis with high accuracy can be obtained.

本発明の実施の形態の談話構造解析装置の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the discourse structure analysis apparatus of embodiment of this invention. 談話構造解析のアルゴリズムの一例を示す図である。It is a figure which shows an example of the algorithm of discourse structure analysis. 本発明の実施の形態のパラメータ学習装置の一構成例を示すブロック図である。It is a block diagram which shows one structural example of the parameter learning apparatus of embodiment of this invention. スコアが最大となるペアを生成するアルゴリズムの一例を示す図である。It is a figure which shows an example of the algorithm which produces | generates the pair from which a score becomes the maximum. 重みベクトルｗを学習するアルゴリズムの一例を示す図である。It is a figure which shows an example of the algorithm which learns the weight vector w. 本発明の実施の形態のパラメータ学習装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the parameter learning apparatus of embodiment of this invention. 本発明の実施の形態のパラメータ学習装置における最大ペア算出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the largest pair calculation process routine in the parameter learning apparatus of embodiment of this invention. 本発明の実施の形態の談話構造解析装置における解析処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the analysis process routine in the discourse structure analysis apparatus of embodiment of this invention. 修辞構造に基づく修辞構造木（ＲＳＴ−ＤＴ）を説明するための説明図である。It is explanatory drawing for demonstrating the rhetorical structure tree (RST-DT) based on a rhetorical structure.

＜概要＞
まず、本発明の実施の形態の概要について説明する。本発明の実施の形態は、与えられた文書中の文法的な要素間の談話構造の解析に関する。この技術はアノテーションが付与されていない文書全体の文法的な要素の間の談話構造を木として解析する技術である。本発明の実施の形態のポイントは、談話構造解析における最易優先探索をビーム探索に拡張し、探索誤りに頑健にした点と、最易優先探索をビーム探索に拡張してパラメータを学習する点である。 <Overview>
First, an outline of an embodiment of the present invention will be described. Embodiments of the present invention relate to the analysis of discourse structures between grammatical elements in a given document. This technique is a technique for analyzing a discourse structure between grammatical elements of an entire document without an annotation as a tree. The point of the embodiment of the present invention is that the most prioritized search in discourse structure analysis is extended to beam search and robust to search errors, and the easiest priority search is extended to beam search to learn parameters. It is.

本発明に係る実施の形態は、談話構造をより正確に捉えるためにビーム探索を用いた談話構造解析を行なう。既存手法であるＨＩＬＤＡは貪欲法に基づいた方法であり、ある時点での決定（例えば、上記図９におけるｅ５とｅ６とを結合し、「Evidence」のラベルを付与する、など）を誤るとそれ以降の決定にも悪影響を及ぼす恐れがある。そこで本発明の実施の形態では、貪欲法であるＨＩＬＤＡにビーム探索を用いることで探索誤りを減らす。本発明の実施の形態は、最適なパラメータを学習する段階と、最適なパラメータを使って入力された文書の談話構造を解析する段階との二つに分かれる。 The embodiment according to the present invention performs a discourse structure analysis using beam search in order to capture a discourse structure more accurately. HILDA, which is an existing method, is a method based on the greedy method, and if a decision at a certain point in time (for example, combining e5 and e6 in FIG. 9 above and giving the label “Evidence”, etc.) It may adversely affect subsequent decisions. Therefore, in the embodiment of the present invention, search errors are reduced by using beam search for HILDA which is a greedy method. The embodiment of the present invention is divided into two steps: a step of learning an optimum parameter and a step of analyzing a discourse structure of a document input using the optimum parameter.

まず、最適なパラメータを学習する段階について説明する。この段階では、談話構造アノテーション済みの文書から抽出された特徴量ベクトルと初期パラメータをパラメータ学習部の入力とする。パラメータ学習部では談話構造解析を行なうのに最適なパラメータを学習し、そのパラメータを学習済みパラメータとして出力する。 First, the step of learning the optimum parameter will be described. At this stage, the feature vector and the initial parameter extracted from the discourse structure annotated document are input to the parameter learning unit. The parameter learning unit learns the optimum parameters for the discourse structure analysis and outputs the parameters as learned parameters.

次に、最適なパラメータを使って入力された文書の談話構造を解析する段階について説明する。この段階では、入力された文書を、ＥＤＵの単位に分割する。次にＥＤＵの列から特徴量ベクトルを抽出し、前段階で得られた学習済みパラメータと共に談話構造解析部に渡される。談話構造解析部では、それらを元に入力された文書に対し、談話構造解析の結果として、ＲＳＴ−ＤＴを出力する。 Next, the stage of analyzing the discourse structure of the input document using optimum parameters will be described. At this stage, the input document is divided into EDU units. Next, a feature vector is extracted from the EDU column, and is passed to the discourse structure analysis unit together with the learned parameters obtained in the previous stage. The discourse structure analysis unit outputs RST-DT as a result of the discourse structure analysis for the documents input based on them.

ここで、ＲＳＴ−ＤＴとは、ルートノードが文書の全体を表し、かつ文書のうちの少なくとも１つの文字列単位の系列の各々を各ノードとした階層構造を表し、かつ、文字列単位の系列間の修飾関係及び関係ラベルを表した、文書の文字列単位の系列の各々の修辞構造に基づく修辞構造木である。
また、ＲＳＴ−ＤＴの文字列単位は、文書中の最小の単位（Elementary Discourse Unit：ＥＤＵ）に対応する。 Here, RST-DT represents a hierarchical structure in which the root node represents the entire document, and each of at least one character string unit sequence of the document is a node, and the character string unit sequence. It is a rhetorical structure tree based on the rhetorical structure of each series of character string units of a document, which represents a modification relationship and a relation label between them.
The character string unit of RST-DT corresponds to the smallest unit (Elementary Discourse Unit: EDU) in the document.

＜談話構造解析装置のシステム構成＞
以下、図面を参照して本発明の実施の形態を詳細に説明する。図１は、本発明の実施の形態の談話構造解析装置１００を示すブロック図である。談話構造解析装置１００は、ＣＰＵと、ＲＡＭと、談話構造解析処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 <System configuration of discourse structure analyzer>
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a discourse structure analyzing apparatus 100 according to an embodiment of the present invention. The discourse structure analyzing apparatus 100 is composed of a computer including a CPU, a RAM, and a ROM storing a program for executing a discourse structure analyzing process routine, and is functionally configured as follows. .

本実施の形態の談話構造解析装置１００は、図１に示すように、入力部１０と、パラメータデータベース２０と、演算部３０と、出力部４０とを備えている。 As shown in FIG. 1, the discourse structure analyzing apparatus 100 according to the present embodiment includes an input unit 10, a parameter database 20, a calculation unit 30, and an output unit 40.

談話構造解析装置１００では、解析対象の文書が入力されると、当該文書の談話構造解析を行う。 In the discourse structure analyzing apparatus 100, when a document to be analyzed is input, the discourse structure analysis of the document is performed.

入力部１０は、解析対象の文書の入力を受け付ける。なお、入力される文書は、少なくとも１文を含む文書である。 The input unit 10 receives input of a document to be analyzed. Note that the input document is a document including at least one sentence.

パラメータデータベース２０には、後述するパラメータ学習装置２００によって学習された重みベクトルｗが格納されている。 The parameter database 20 stores a weight vector w learned by a parameter learning device 200 described later.

演算部３０は、入力部１０により受け付けた、解析対象の文書について談話構造解析を行う。また、演算部３０は、ＥＤＵ分割部３２と、特徴抽出部３４と、談話構造解析部３６とを備えている。 The calculation unit 30 performs a discourse structure analysis on the analysis target document received by the input unit 10. In addition, the calculation unit 30 includes an EDU division unit 32, a feature extraction unit 34, and a discourse structure analysis unit 36.

ＥＤＵ分割部３２は、入力部１０により受け付けた解析対象の文書をＥＤＵに分割する。例えば、ＥＤＵ分割部３２は、文書中の各単語間でＳＶＭなどの分類器を用いて、区切れるか区切れないかを判断させ、文書をＥＤＵに分割したものを出力する。 The EDU division unit 32 divides the analysis target document received by the input unit 10 into EDUs. For example, the EDU division unit 32 uses a classifier such as SVM between words in the document to determine whether the word is divided or not, and outputs a document divided into EDUs.

特徴抽出部３４は、ＥＤＵ分割部３２によって得られたＥＤＵの列である部分木の列、又は後述する談話構造解析部３６によって生成された部分木の列から、特徴量ベクトルを抽出する。例えば、特徴抽出部３４は、ＲＳＴ−ＤＴの部分木の列Ｓから２つのＲＳＴ−ＤＴの部分木Ｓ_ｉ、Ｓ_ｉ＋１を結合する際の特徴量ベクトルをｆ（Ｓ）とする。特徴量ベクトルのうち、代表的なものを以下に挙げる。 The feature extraction unit 34 extracts a feature vector from a subtree sequence that is an EDU sequence obtained by the EDU division unit 32 or a subtree sequence generated by the discourse structure analysis unit 36 described later. For example, the feature extraction unit 34 sets f (S) as a feature quantity vector when combining two RST-DT subtrees S _i and S _{i + 1} from a column S of the RST-DT subtree. Among the feature vectors, typical ones are listed below.

（１）ＲＳＴ−ＤＴの部分木Ｓ_ｉに含まれる単語数が５以下であるか。
（２）ＲＳＴ−ＤＴの部分木Ｓ_ｉ、Ｓ_ｉ＋１が同じ文に含まれるか。
（３）ＲＳＴ−ＤＴの部分木Ｓ_ｉの先頭が「Because」で始まるか。
（４）ＲＳＴ−ＤＴの部分木Ｓ_ｉ、Ｓ_ｉ＋１が含むＥＤＵの個数。
（５）ＲＳＴ−ＤＴの部分木Ｓ_ｉの主辞の品詞が動詞であるか。
（６）ＲＳＴ−ＤＴの部分木Ｓ_ｉ＋１の一番上のノードの修辞関係ラベルが「Evidence」であるか。 (1) Whether the number of words included in the subtree S _i of RST-DT is 5 or less.
(2) Whether subtrees S _i and S _{i + 1 of} RST-DT are included in the same sentence.
(3) Whether the head of the subtree S _i of RST-DT starts with “Because”.
(4) The number of EDUs included in the subtrees S _i and S _{i + 1 of the} RST-DT.
(5) whether the part of speech of the head word of the RST-DT subtree _{S i} of a verb.
(6) Is the rhetorical relation label of the top node of the subtree S _{i + 1} of the RST-DT “Evidence”?

例えば、上記図９において、ＲＳＴ−ＤＴの部分木の列［ｅ１,ｅ２,ｅ３,ｅ４,ｅ５−６,ｅ７,ｅ８,ｅ９,ｅ１０］をＳとし、ｅ４とｅ５−６を結合する際の特徴量がどのようなものになるかを説明する。ここで、ｅ５−６はｅ５とｅ６とを修辞構造ラベル「Evidence」で結合したノードを表わす。ｅ４の実際のテキストは「Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,」、ｅ５−６の実際のテキストは「but any liquid water formed that way would evaporate almost instantly because of the low atmospheric pressure.」であるとする。このとき、ｅ４とｅ５−６を結合する際の特徴量（関係ラベルは「Contrast」）は For example, in FIG. 9 above, the column [e1, e2, e3, e4, e5-6, e7, e8, e9, e10] of the RST-DT sub-tree is S, and e4 and e5-6 are combined. The feature amount will be described. Here, e5-6 represents a node obtained by combining e5 and e6 with the rhetorical structure label “Evidence”. The actual text of e4 is "Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion," and the actual text of e5-6 is "but any liquid water formed that way would evaporate almost instantly because of the low atmospheric pressure. " At this time, the characteristic amount (relation label is “Contrast”) when combining e4 and e5-6 is

（１）ｅ４に含まれる単語数は１６個なので、単語数は５個より大きい。
（２）ｅ４とｅ５−６は同じ文に含まれる。
（３）ｅ４の先頭は「Because」で始まらない。
（４）ｅ４とｅ５−６が含むＥＤＵの個数は３個。
（５）ｅ４の主辞の品詞が動詞である。
（６）ｅ５−６の一番上のノードの修辞関係ラベルは「Evidence」である。
であることからｆ（Ｓ）＝［０，１，０，３，１，１]となる。 (1) Since the number of words included in e4 is 16, the number of words is larger than 5.
(2) e4 and e5-6 are included in the same sentence.
(3) The beginning of e4 does not start with “Because”.
(4) The number of EDUs included in e4 and e5-6 is three.
(5) The part of speech of e4 is the verb.
(6) The rhetoric label of the top node of e5-6 is “Evidence”.
Therefore, f (S) = [0, 1, 0, 3, 1, 1].

談話構造解析部３６は、ＥＤＵの列である部分木の列又は前回生成された部分木の列に基づき生成される、複数の部分木の列の各々について、パラメータデータベース２０に格納された重みベクトルｗと、特徴抽出部３４によって抽出された特徴量ベクトルｆ（Ｓ）とに基づいて、部分木の列に対するスコアを算出する。そして、談話構造解析部３６は、複数の部分木の列の各々に対するスコアの各々に基づいて、スコアが上位ｋ個となる部分木の列の各々を配列ｂｅａｍに格納する。 The discourse structure analysis unit 36 calculates the weight vector stored in the parameter database 20 for each of a plurality of subtree columns generated based on a subtree column that is an EDU column or a previously generated subtree column. Based on w and the feature quantity vector f (S) extracted by the feature extraction unit 34, a score for a subtree column is calculated. Then, the discourse structure analysis unit 36 stores each column of subtrees having the highest k scores in the array beam based on each score for each column of the plurality of subtrees.

具体的には、談話構造解析部３６は、以下の式（１）に示す関数ｅｘｐａｎｄに従って、ＲＳＴ−ＤＴの部分木の列から新たにＲＳＴ−ＤＴの部分木の列を列挙する。 Specifically, the discourse structure analyzing unit 36 enumerates a column of the RST-DT subtree anew from the column of the RST-DT subtree according to the function expand shown in the following equation (1).

ここで、修辞関係ラベルの集合をＬ、ＲＳＴ−ＤＴの部分木を含む列をＳ、核または衛星のペアの集合をＮＳ＝｛（Nucleus，Satellite），（Satellite，Nucleus），（Nucleus，Nucleus）｝と定義する。関数ｂｕｉｌｄは入力として、Ｓとインデックスｉ∈｛１，…，length(S)−１｝、修辞関係ラベルｌ∈Ｌ，核または衛星のペア（ｎｓ１，ｎｓ２）∈ＮＳを受け取る。また、関数ｂｕｉｌｄは出力として、２つのＲＳＴ−ＤＴの部分木Ｓｉ、Ｓｉ＋１を結合した新たなＲＳＴ−ＤＴの部分木を返す。その際に修辞関係ラベルｌと核または衛星とのペアを付与する。 Here, the set of rhetorical relation labels is L, the sequence including the subtree of RST-DT is S, the set of pairs of nuclei or satellites is NS = {(Nucleus, Satellite), (Satellite, Nucleus), (Nucleus, Nucleus) )}. The function build receives as inputs S and an index iε {1,..., Length (S) −1}, rhetorical relationship label lεL, a nucleus or satellite pair (ns1, ns2) εNS. Also, the function build returns a new RST-DT subtree obtained by combining two RST-DT subtrees Si and Si + 1. At that time, a pair of rhetorical label 1 and a nucleus or satellite is given.

談話構造解析部３６の具体的な処理の内容を表す擬似コードを図２に示す。関数ｔｏｐ_ｋ（Ｚ）はＲＳＴ−ＤＴの部分木の列の集合（Ｚ）から、スコア上位ｋ個の候補を保持する関数である。それぞれのＲＳＴ−ＤＴの部分木の列ｚ∈Ｚは、重みベクトルｗと、ｚから抽出された特徴量ベクトルとの内積をスコアとして保持する。 FIG. 2 shows a pseudo code representing the specific processing contents of the discourse structure analysis unit 36. The function top _k (Z) is a function that holds k candidates with the highest score from the set (Z) of columns of the subtree of RST-DT. Each RST-DT subtree column zεZ holds the inner product of the weight vector w and the feature vector extracted from z as a score.

上記図２に示すＡｌｇｏｒｉｔｈｍ１は重みベクトルｗとＥＤＵに分割された文書ｘ＝［ｅ１，ｅ２，・・・，ｅｎ］を入力とする。談話構造解析部３６では、関数ｅｘｐａｎｄで部分木の列の候補を展開しながら、以下の式（２）に従って、配列ｂｅａｍに、スコア上位ｋ個の部分木の列の候補を保持する。なお、配列ｂｅａｍは２次元配列であり、ｂｅａｍ［ｉ］には、スコアが上位ｉ番目の部分木の列が格納された配列が格納される。すなわち、ｂｅａｍ［ｉ］［ｊ］には、スコアが上位ｉ番目の部分木の列のうち、ｊ番目の部分木が格納される。 Algorithm 1 shown in FIG. 2 receives a document x = [e1, e2,..., En] divided into a weight vector w and an EDU. The discourse structure analysis unit 36 expands the subtree column candidates using the function expand, and holds the k candidate subtree column with the highest score in the array beam according to the following equation (2). Note that the array beam is a two-dimensional array, and the beam [i] stores an array in which a column of the i-th subtree having the highest score is stored. That is, the beam [i] [j] stores the j-th subtree in the column with the highest i-th subtree.

最終的に得られたｂｅａｍ［０］［０］に、最もスコアが高い１つの木、すなわちＲＳＴ−ＤＴが格納される。また、談話構造解析部３６は、最終的に得られたｂｅａｍ［０］［０］を、解析結果となるＲＳＴ−ＤＴとして出力する。 The finally obtained beam [0] [0] stores one tree having the highest score, that is, RST-DT. The discourse structure analyzing unit 36 outputs the finally obtained beam [0] [0] as an RST-DT that is an analysis result.

出力部４０は、談話構造解析部３６で出力されたＲＳＴ−ＤＴを解析結果として出力する。 The output unit 40 outputs the RST-DT output by the discourse structure analyzing unit 36 as an analysis result.

重みベクトルｗが既知であれば、上記図２に示したＡｌｇｏｒｉｔｈｍ１を用いてビーム探索を用いた最易探索に基づく談話構造解析を行なうことができる。しかし、重みベクトルｗは既知ではない。そこで、本実施の形態では、構造化パーセプトロンに基づき、重みベクトルｗを求める。構造化パーセプトロンは学習アルゴリズムの一例である。 If the weight vector w is known, the discourse structure analysis based on the easy search using the beam search can be performed using Algorithm 1 shown in FIG. However, the weight vector w is not known. Therefore, in the present embodiment, the weight vector w is obtained based on the structured perceptron. A structured perceptron is an example of a learning algorithm.

＜パラメータ学習装置のシステム構成＞
図３は、本発明の実施の形態のパラメータ学習装置２００を示すブロック図である。このパラメータ学習装置２００は、ＣＰＵと、ＲＡＭと、後述するパラメータ学習処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 <System configuration of parameter learning device>
FIG. 3 is a block diagram illustrating the parameter learning device 200 according to the embodiment of this invention. The parameter learning device 200 is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a parameter learning processing routine described later, and is functionally configured as follows. Yes.

学習用入力部５０は、複数の学習データの入力を受け付ける。具体的には、学習用入力部５０は、複数の学習用文書の各々について、当該学習用文書内のＥＤＵの各々と、当該学習用文書に対応する正解のＲＳＴ−ＤＴとの組み合わせを、学習データとして受け付ける。 The learning input unit 50 receives input of a plurality of learning data. Specifically, the learning input unit 50 learns, for each of a plurality of learning documents, a combination of each EDU in the learning document and a correct RST-DT corresponding to the learning document. Accept as data.

学習用演算部６０は、学習用入力部５０により受け付けた複数の学習データに基づいて、談話構造解析をするための重みベクトルｗを学習する。学習用演算部６０は、学習データベース６２と、学習用特徴抽出部６４と、パラメータ学習部６６と、反復判定部６８とを備えている。 The learning calculation unit 60 learns a weight vector w for performing discourse structure analysis based on the plurality of learning data received by the learning input unit 50. The learning calculation unit 60 includes a learning database 62, a learning feature extraction unit 64, a parameter learning unit 66, and an iterative determination unit 68.

学習データベース６２には、学習用入力部５０により受け付けた複数の学習データが格納される。 The learning database 62 stores a plurality of learning data received by the learning input unit 50.

学習用特徴抽出部６４は、学習データベース６２に格納された複数の学習データの各々に含まれるＥＤＵの列である部分木の列、又は後述するパラメータ学習部６６によって生成された部分木の列から、特徴抽出部３４と同様に、特徴量ベクトルを抽出する。 The learning feature extraction unit 64 is based on a subtree sequence that is a sequence of EDUs included in each of the plurality of learning data stored in the learning database 62 or a subtree sequence generated by the parameter learning unit 66 described later. Similarly to the feature extraction unit 34, a feature quantity vector is extracted.

パラメータ学習部６６は、学習データについて、正解のＲＳＴ−ＤＴに含まれる部分木の列と、部分木の列から抽出される特徴量ベクトルに対する重みベクトルｗを用いて選択される、当該学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアを生成する。そして、パラメータ学習部６６は、生成されたペアの各々のうち、特徴量ベクトルと重みベクトルｗとを用いて算出されるスコアの差分が最大となる、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアに含まれる部分木の列の各々から抽出される特徴ベクトルに基づいて、重みベクトルｗを更新する。パラメータ学習部６６は、上記の処理を、複数の学習データの各々について繰り返す。 The parameter learning unit 66 selects the learning data by using the column of the subtree included in the correct RST-DT and the weight vector w for the feature vector extracted from the column of the subtree. A pair with a column of the RST-DT sub-tree corresponding to is generated. Then, the parameter learning unit 66 includes a subtree included in the correct RST-DT in which the difference between the scores calculated using the feature vector and the weight vector w is maximized among the generated pairs. The weight vector w is updated based on the feature vector extracted from each of the subtree columns included in a pair of the column and the subtree of the RST-DT corresponding to the learning document. The parameter learning unit 66 repeats the above processing for each of a plurality of learning data.

ここで、パラメータ学習部６６の処理の詳細について説明する。まず、パラメータ学習部６６は、図４に示すＡｌｇｏｒｉｔｈｍ２に従って、特徴量ベクトルと重みベクトルｗとを用いて算出されるスコアの差分が最大となる、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアを選択する。 Here, details of the processing of the parameter learning unit 66 will be described. First, in accordance with Algorithm 2 shown in FIG. 4, the parameter learning unit 66 includes a sequence of subtrees included in the correct RST-DT in which the difference between the scores calculated using the feature vector and the weight vector w is maximized. Then, a pair with a column of the RST-DT subtree corresponding to the learning document is selected.

具体的には、パラメータ学習部６６は、以下の式（３）に従って、正解のＲＳＴ−ＤＴに含まれる部分木の列に対して、パラメータ学習部６６によって前回選択された部分木の列において隣り合う部分木のペアを１つだけ結合して生成される、正解のＲＳＴ−ＤＴに含まれる部分木の列の集合のうち、学習用特徴抽出部６４によって部分木の列から抽出される特徴量ベクトルｆ（ｏ）と、重みベクトルｗとを用いて算出されるスコアが最大となる部分木の列を選択する。ここで、正解のＲＳＴ−ＤＴに含まれる部分木の列をオラクルｏと称する。 Specifically, the parameter learning unit 66 is adjacent to the subtree sequence selected by the parameter learning unit 66 last time with respect to the subtree sequence included in the correct RST-DT according to the following equation (3). Of the set of subtree columns included in the correct RST-DT generated by combining only one matching subtree pair, the feature amount extracted from the subtree column by the learning feature extraction unit 64 A column of the subtree having the maximum score calculated using the vector f (o) and the weight vector w is selected. Here, the column of the subtree included in the correct RST-DT is referred to as oracle o.

次に、パラメータ学習部６６は、上記式（２）に従って、学習用文書に対応するＲＳＴ−ＤＴの部分木の列に対して、前回選択された部分木の列の各々において、隣り合う部分木のペアを結合して生成される部分木の列の集合のうち、部分木の列から抽出される特徴量ベクトルｆ（Ｓ）と重みベクトルｗとを用いて算出されるスコアが上位ｋ個となる部分木の列を選択する。 Next, in accordance with the above equation (2), the parameter learning unit 66 sets the adjacent subtree in each of the subtrees selected last time with respect to the subtree of the RST-DT corresponding to the learning document. Among the set of sub-tree columns generated by combining the pairs, the top k scores are calculated using the feature vector f (S) and the weight vector w extracted from the sub-tree column. Select a row of subtrees.

次に、パラメータ学習部６６は、以下の式（４）に従って、正解のＲＳＴ−ＤＴに含まれる部分木の列に対して選択された部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列に対して選択された上位ｋ個となる部分木の列の各々とのペアを生成する。 Next, the parameter learning unit 66, in accordance with the following equation (4), selects the subtree sequence selected for the subtree sequence included in the correct RST-DT and the RST-DT corresponding to the learning document. A pair with each of the top k subtree columns selected for the subtree column is generated.

パラメータ学習部６６は、正解のＲＳＴ−ＤＴに含まれる部分木の列が、１つの木となるまで、上記の処理を繰り返す。 The parameter learning unit 66 repeats the above processing until the subtree sequence included in the correct RST-DT becomes one tree.

そして、パラメータ学習部６６は、以下の式（５）に従って、生成されたペアのうち、特徴量ベクトルと重みベクトルｗとを用いて算出されるスコアの差分が最大となる、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアを選択する。 The parameter learning unit 66 then corrects the correct RST-DT in which the difference between the scores calculated using the feature vector and the weight vector w among the generated pairs is maximized according to the following equation (5). A pair of a subtree sequence included in the RST-DT subtree sequence corresponding to the learning document is selected.

そして、パラメータ学習部６６は、上記式（５）に従って選択された正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアに含まれる部分木の列の各々から抽出される特徴ベクトルに基づいて、重みベクトルｗを更新する。パラメータ学習部６６における更新処理を、図５のＡｌｇｏｒｉｔｈｍ３に示す。 Then, the parameter learning unit 66 is included in a pair of a subtree column included in the correct RST-DT selected according to the above formula (5) and a subtree column of the RST-DT corresponding to the learning document. The weight vector w is updated based on the feature vector extracted from each of the columns of the partial tree. The update processing in the parameter learning unit 66 is shown in Algorithm 3 in FIG.

本実施の形態で用いる構造化パーセプトロンでは、１つの正解のＲＳＴ−ＤＴ（ｔ）が与えられる度に、オラクルｏと予測したＲＳＴ−ＤＴ（Ｓ）との組を元に重みを更新していく。オラクルｏと予測したＲＳＴ−ＤＴ（Ｓ）との組は、上記図４のＡｌｇｏｒｉｔｈｍ２に記した関数ｍａｘ−ｖｉｏｌａｔｉｏｎ−ｐａｉｒによって得られる。関数ｍａｘ−ｖｉｏｌａｔｉｏｎ−ｐａｉｒはオラクルとビーム探索で得られた予測したＲＳＴ−ＤＴの組をｐａｉｒｓに格納し、オラクルｏとビーム探索で得られた予測したＲＳＴ−ＤＴ（Ｓ）のスコアの差が最大になるような組を返す関数である。また、関数ｍａｘ−ｖｉｏｌａｔｉｏｎ−ｐａｉｒ内で使用されている関数ｅｘｐａｎｄ−ｏｒａｃｌｅは、関数ｅｘｐａｎｄ同様に候補を展開するが正解のＲＳＴ−ＤＴ（ｔ）に含まれている候補のみを返す関数である。構造化パーセプトロンでは、関数ｍａｘ−ｖｉｏｌａｔｉｏｎ−ｐａｉｒから得られる組を用いて重みベクトルｗを更新していく。また、構造化パーセプトロンでは、得られたＲＳＴ−ＤＴの部分木の組（ｏ，Ｓ）それぞれから抽出される特徴量ベクトルｆ（ｏ），ｆ（Ｓ）∈Ｒ^Ｍの差分を足して重みベクトルを更新する。 In the structured perceptron used in the present embodiment, every time one correct RST-DT (t) is given, the weight is updated based on the combination of Oracle o and the predicted RST-DT (S). . The set of Oracle o and the predicted RST-DT (S) is obtained by the function max-violation-pair described in Algorithm 2 in FIG. The function max-violation-pair stores the pair of predicted RST-DT obtained by Oracle and beam search in pairs, and the difference between the scores of Oracle o and predicted RST-DT (S) obtained by beam search is This function returns the largest tuple. The function expand-oracle used in the function max-violation-pair is a function that expands candidates as in the function expand but returns only candidates included in the correct RST-DT (t). In the structured perceptron, the weight vector w is updated using a set obtained from the function max-violation-pair. Further, the structured perceptron, the subtree of the resulting RST-DT pair (o, S) is extracted from each feature vector f (o), the weight vector by adding the difference f (S) ∈R ^M Update.

反復判定部６８は、予め定められた回数だけ、上記パラメータ学習部６６による更新を繰り返す。パラメータ学習部６６は、予め定められた回数更新を繰り返した場合には、更新された重みベクトルｗをパラメータデータベース７０に格納する。 The iteration determination unit 68 repeats the update by the parameter learning unit 66 a predetermined number of times. The parameter learning unit 66 stores the updated weight vector w in the parameter database 70 when it has been updated a predetermined number of times.

パラメータデータベース７０には、パラメータ学習部６６で更新された重みベクトルｗが格納される。 The parameter database 70 stores the weight vector w updated by the parameter learning unit 66.

＜パラメータ学習装置の作用＞
次に、本実施の形態のパラメータ学習装置２００の作用について説明する。まず、複数の学習データがパラメータ学習装置２００に入力されると、パラメータ学習装置２００によって、入力された複数の学習データが、学習データベース６２へ格納される。そして、パラメータ学習装置２００によって、図６に示す学習処理ルーチンが実行される。 <Operation of parameter learning device>
Next, the operation of the parameter learning device 200 according to the present embodiment will be described. First, when a plurality of learning data is input to the parameter learning device 200, the input plurality of learning data is stored in the learning database 62 by the parameter learning device 200. Then, the parameter learning device 200 executes a learning process routine shown in FIG.

まず、ステップＳ１００において、学習データベース６２に格納された複数の学習データから１つの学習データを読み込み、設定する。 First, in step S100, one learning data is read from a plurality of learning data stored in the learning database 62 and set.

次に、ステップＳ１０２において、パラメータ学習部６６は、上記ステップＳ１００で設定された学習データについて、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアを生成する。なお、学習用文書に対応するＲＳＴ−ＤＴの部分木の列は、部分木の列から抽出される特徴量ベクトルに対する重みベクトルｗを用いて選択される。当該ステップＳ１０２は、図７に示す最大ペア算出処理ルーチンによって実現される。 Next, in step S102, the parameter learning unit 66, with respect to the learning data set in step S100, the subtree sequence included in the correct RST-DT and the RST-DT subtree corresponding to the learning document. Create a pair with the column. Note that the RST-DT subtree column corresponding to the learning document is selected using the weight vector w for the feature vector extracted from the subtree column. This step S102 is realized by the maximum pair calculation processing routine shown in FIG.

＜最大ペア算出処理ルーチン＞
まず、ステップＳ２００において、配列ｐａｉｒｓを初期化し、ｏｒａｃｌｅ及び配列ｂｅａｍに、上記ステップＳ１００で設定された学習データに含まれるＥＤＵの各々を格納した配列を格納する。 <Maximum pair calculation processing routine>
First, in step S200, the array pairs is initialized, and an array storing each of the EDUs included in the learning data set in step S100 is stored in the oracle and the array beam.

次に、ステップＳ２０２において、上記ステップＳ２００で設定されたｏｒａｃｌｅ又は前回のステップＳ２０６で更新されたｏｒａｃｌｅに格納されている配列の要素が１つであるか否かを判定する。ｏｒａｃｌｅに格納されている配列の要素が１つでない場合には、ステップＳ２０４へ進む。一方、後述するステップＳ２０４〜Ｓ２０６の処理によって最もスコアが高い１つの木が生成され、ｏｒａｃｌｅに格納されている配列の要素が１つである場合には、ステップＳ２１４へ進む。 Next, in step S202, it is determined whether or not there is one array element stored in the oracle set in step S200 or the oracle updated in the previous step S206. If the number of elements stored in oracle is not one, the process proceeds to step S204. On the other hand, if one tree having the highest score is generated by the processing of steps S204 to S206 described later and the number of elements of the array stored in oracle is one, the process proceeds to step S214.

ステップＳ２０４において、関数ｅｘｐａｎｄ−ｏｒａｃｌｅによって、学習データの正解のＲＳＴ−ＤＴに含まれる部分木の列に対して、前回のステップＳ２０６で更新されたｏｒａｃｌｅに格納されている部分木の列から、正解のＲＳＴ−ＤＴに含まれる、新たな部分木の列の各々を生成する。そして、学習用特徴抽出部６４は、新たに生成された部分木の列の各々から、特徴量ベクトルｆ（ｏ）を抽出する。 In step S204, the correct answer is obtained from the sequence of the subtree stored in the oracle updated in the previous step S206 with respect to the sequence of the subtree included in the correct RST-DT of the learning data by the function expand-oracle. Each new subtree sequence included in each RST-DT is generated. Then, the learning feature extraction unit 64 extracts a feature vector f (o) from each newly generated subtree sequence.

ステップＳ２０６において、上記ステップＳ２０４で生成された新たな部分木の列から、上記式（３）に従って、上記ステップＳ２０４で抽出された特徴量ベクトルｆ（ｏ）と、重みベクトルｗとを用いて算出されるスコアが最大となる部分木の列を選択し、ｏｒａｃｌｅに格納する。 In step S206, calculation is performed using the feature vector f (o) extracted in step S204 and the weight vector w from the sequence of the new subtree generated in step S204 according to the above equation (3). The column of the subtree that gives the maximum score is selected and stored in oracle.

ステップＳ２０８において、関数ｅｘｐａｎｄによって、学習データの学習用文書に対応するＲＳＴ−ＤＴの部分木の列に対して、前回のステップＳ２１０で更新されたｂｅａｍに格納された部分木の列の各々から、部分木の列の集合を生成する。そして、学習用特徴抽出部６４は、新たに生成された部分木の列の集合に含まれる部分木の列Ｓの各々について、特徴量ベクトルｆ（Ｓ）を抽出する。 In step S208, by using the function expand, from each of the subtree columns stored in the beam updated in the previous step S210, with respect to the subtree column of the RST-DT corresponding to the learning document of the learning data, Generate a set of subtree columns. Then, the learning feature extraction unit 64 extracts a feature quantity vector f (S) for each of the subtree columns S included in the newly generated subtree column set.

ステップＳ２１０において、上記ステップＳ２０８で生成された部分木の列の集合から、上記式（２）に従って、上記ステップＳ２０８で抽出された特徴量ベクトルｆ（Ｓ）と重みベクトルｗとを用いて算出されるスコアが上位ｋ個となる部分木の列を選択し、配列ｂｅａｍに格納する。 In step S210, the feature vector f (S) extracted in step S208 and the weight vector w are calculated from the set of columns of the subtree generated in step S208 according to the equation (2). The column of the subtree having the top k scores is selected and stored in the array beam.

ステップＳ２１２において、上記式（４）に従って、上記ステップＳ２０６で更新されたｏｒａｃｌｅに格納された部分木の列と、上記ステップＳ２１０で更新された配列ｂｅａｍに格納された部分木の列とのペアを各々生成し、配列ｐａｉｒｓに格納し、上記ステップＳ２０２へ戻る。 In step S212, according to the above equation (4), a pair of the subtree stored in the oracle updated in step S206 and the subtree stored in the array beam updated in step S210 is paired. Each is generated, stored in the array pairs, and the process returns to step S202.

ステップＳ２１４において、上記式（５）に従って、上記ステップＳ２１２で配列ｐａｉｒｓに格納されたペアのうち、特徴量ベクトルと重みベクトルｗとを用いて算出されるスコアの差分が最大となる、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアを出力して、最大ペア算出処理ルーチンを終了する。 In step S214, the correct RST in which the difference between the scores calculated using the feature vector and the weight vector w among the pairs stored in the array pairs in step S212 is maximized according to the above equation (5). The pair of the subtree included in the DT and the subtree column of the RST-DT corresponding to the learning document is output, and the maximum pair calculation processing routine is terminated.

次に学習処理ルーチンに戻り、ステップＳ１０４において、パラメータ学習部６６は、上記ステップＳ１０２で出力されたペアに基づいて、当該ペアに含まれる部分木の列の各々から抽出される特徴ベクトルに基づいて、重みベクトルｗを更新する。 Next, returning to the learning processing routine, in step S104, the parameter learning unit 66 is based on the feature vector extracted from each column of the subtree included in the pair based on the pair output in step S102. , Update the weight vector w.

ステップＳ１０６において、学習データベース６２に格納された複数の学習データの全てについて、上記ステップＳ１００〜ステップＳ１０４の処理を実行したか否かを判定する。学習データベース６２に格納された複数の学習データの全てについて、上記ステップＳ１００〜ステップＳ１０４の処理を実行した場合には、ステップＳ１０８へ進む。一方、上記ステップＳ１００〜ステップＳ１０４の処理を実行していない学習データが存在する場合には、ステップＳ１００へ戻る。 In step S106, it is determined whether or not the processing in steps S100 to S104 has been executed for all of the plurality of learning data stored in the learning database 62. When the processes of steps S100 to S104 are executed for all of the plurality of learning data stored in the learning database 62, the process proceeds to step S108. On the other hand, if there is learning data that has not been subjected to the processing of steps S100 to S104, the process returns to step S100.

ステップＳ１０８において、上記ステップＳ１００〜ステップＳ１０６の処理を予め定められた回数繰り返したか否かを判定する。上記ステップＳ１００〜ステップＳ１０６の処理を予め定められた回数繰り返した場合には、ステップＳ１１０へ進む。一方、上記ステップＳ１００〜ステップＳ１０６の処理を予め定められた回数繰り返していない場合には、ステップＳ１００へ戻る。 In step S108, it is determined whether or not the processing in steps S100 to S106 has been repeated a predetermined number of times. When the processes in steps S100 to S106 are repeated a predetermined number of times, the process proceeds to step S110. On the other hand, if the processes in steps S100 to S106 are not repeated a predetermined number of times, the process returns to step S100.

そして、ステップＳ１１０において、上記ステップＳ１０４の処理で得られた重みベクトルｗをパラメータデータベース７０へ格納して、学習処理ルーチンを終了する。 In step S110, the weight vector w obtained in the process of step S104 is stored in the parameter database 70, and the learning process routine is terminated.

＜談話構造解析装置の作用＞
次に、本実施の形態の談話構造解析装置１００の作用について説明する。まず、パラメータ学習装置２００のパラメータデータベース７０に記憶されている重みベクトルｗが、談話構造解析装置１００に入力されると、パラメータデータベース２０に格納される。そして、解析対象としての入力文書が談話構造解析装置１００に入力されると、談話構造解析装置１００によって、図８に示す解析処理ルーチンが実行される。 <Operation of discourse structure analyzer>
Next, the operation of the discourse structure analyzing apparatus 100 of the present embodiment will be described. First, when the weight vector w stored in the parameter database 70 of the parameter learning device 200 is input to the discourse structure analyzing device 100, it is stored in the parameter database 20. When the input document to be analyzed is input to the discourse structure analyzing apparatus 100, the discourse structure analyzing apparatus 100 executes an analysis processing routine shown in FIG.

まず、ステップＳ３００において、入力部１０によって、解析対象の入力文書を受け付ける。 First, in step S300, the input unit 10 receives an input document to be analyzed.

次に、ステップＳ３０２において、ＥＤＵ分割部３２によって、上記ステップＳ３００で受け付けた解析対象の入力文書をＥＤＵに分割する。 In step S302, the EDU divider 32 divides the input document to be analyzed received in step S300 into EDUs.

ステップＳ３０４において、談話構造解析部３６は、パラメータデータベース２０に格納された重みベクトルｗを読み込む。 In step S304, the discourse structure analysis unit 36 reads the weight vector w stored in the parameter database 20.

ステップＳ３０６において、談話構造解析部３６は、上記ステップＳ３０２で得られたＥＤＵの各々を格納した配列を配列ｂｅａｍに格納する。 In step S306, the discourse structure analyzing unit 36 stores an array storing each of the EDUs obtained in step S302 in the array beam.

ステップＳ３０８において、談話構造解析部３６は、上記ステップＳ３０６で設定された配列ｂｅａｍ[０]又は前回のステップＳ３１２で更新された配列ｂｅａｍ[０]に格納されている配列の要素が１つであるか否かを判定する。配列ｂｅａｍ[０]に格納されている配列の要素が１つでない場合には、ステップＳ３１０へ進む。一方、後述するステップＳ３１０〜Ｓ３１２の処理によって配列ｂｅａｍ[０]に格納されている配列の要素が１つである場合には、ステップＳ３１４へ進む。 In step S308, the discourse structure analysis unit 36 has one array element stored in the array beam [0] set in step S306 or the array beam [0] updated in the previous step S312. It is determined whether or not. If the number of elements in the array stored in the array beam [0] is not one, the process proceeds to step S310. On the other hand, if the number of elements of the array stored in the array beam [0] is one by the processes of steps S310 to S312 described later, the process proceeds to step S314.

ステップＳ３１０において、談話構造解析部３６は、関数expandによって、上記ステップＳ３０６で設定された配列ｂｅａｍ又は前回のステップＳ３１２で更新された配列ｂｅａｍに格納された部分木の列の各々から、部分木の列の集合を生成する。そして、特徴抽出部３４は、生成された部分木の列の集合に含まれる部分木の列Ｓの各々について、特徴量ベクトルｆ（Ｓ）を抽出する。 In step S310, the discourse structure analysis unit 36 uses the function expand to calculate the subtree from each of the subtree columns stored in the array beam set in step S306 or the array beam updated in the previous step S312. Generate a set of columns. Then, the feature extraction unit 34 extracts a feature quantity vector f (S) for each of the subtree columns S included in the generated set of subtree columns.

ステップＳ３１２において、談話構造解析部３６は、上記式（２）に従って、上記ステップＳ３１０で抽出された特徴量ベクトルｆ（Ｓ）と重みベクトルｗとを用いて算出されるスコアが上位ｋ個となる部分木の列を選択し、配列ｂｅａｍに格納し、上記ステップ３０８へ戻る。 In step S312, the discourse structure analysis unit 36 has the top k scores calculated using the feature vector f (S) and the weight vector w extracted in step S310 according to the above equation (2). A row of subtrees is selected, stored in the array beam, and the process returns to step 308 above.

ステップＳ３１４において、上記ステップＳ３１２で更新された配列ｂｅａｍのうち、ｂｅａｍ［０］［０］に格納されているＲＳＴ−ＤＴを出力部４０により解析結果として出力し、解析処理ルーチンを終了する。 In step S314, out of the array beam updated in step S312, RST-DT stored in beam [0] [0] is output as an analysis result by the output unit 40, and the analysis processing routine ends.

以上説明したように、本実施の形態のパラメータ学習装置によれば、複数の学習用文書の各々について、正解のＲＳＴ−ＤＴに含まれる部分木の列と、部分木の列から抽出される特徴量ベクトルに対する重みベクトルｗを用いて選択される、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアのうち、特徴量ベクトルと重みベクトルｗとを用いて算出されるスコアの差分が最大となる、正解のＲＳＴ−ＤＴに含まれる部分木の列と、学習用文書に対応するＲＳＴ−ＤＴの部分木の列とのペアに含まれる部分木の列の各々から抽出される特徴ベクトルに基づいて、重みベクトルｗを更新することにより、談話構造解析を精度よく行うための重みベクトルｗを得ることができる。 As described above, according to the parameter learning apparatus of the present embodiment, for each of a plurality of learning documents, the subtree sequence included in the correct RST-DT and the features extracted from the subtree sequence The difference between the scores calculated using the feature quantity vector and the weight vector w among the pairs of the RST-DT subtree columns corresponding to the learning document, which are selected using the weight vector w for the quantity vector. Extracted from each of the subtree columns included in a pair of the subtree column included in the correct RST-DT and the RST-DT subtree column corresponding to the learning document. By updating the weight vector w based on the vector, the weight vector w for accurately performing the discourse structure analysis can be obtained.

本実施の形態の談話構造解析装置によれば、パラメータ学習装置によって得られた重みベクトルｗを用いて談話構造解析を行うことにより、談話構造解析を精度よく行うことができる。 According to the discourse structure analysis apparatus of the present embodiment, the discourse structure analysis can be performed with high accuracy by performing the discourse structure analysis using the weight vector w obtained by the parameter learning apparatus.

また、本実施の形態のパラメータ学習装置及び談話構造解析装置を用いることで、ビーム探索に基づいた最易優先探索を行なうことで探索誤りに頑健な解析が可能となり、より高精度な談話構造解析が可能になる。 Also, by using the parameter learning device and the discourse structure analysis device of the present embodiment, it is possible to perform robust analysis against search errors by performing the most prioritized search based on the beam search, and more accurate discourse structure analysis. Is possible.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本実施の形態では、文書中のＥＤＵをＲＳＴ−ＤＴの各ノードとした場合を例に説明したが、各ノードをＥＤＵ以外の文字列単位として表わすこともできる。その場合には、ＥＤＵ分割部３２によって、文書を当該文字列単位に分割し、当該文字列単位をノードとして表したＲＳＴ−ＤＴを構築する。 For example, although a case has been described with the present embodiment where an EDU in a document is each RST-DT node, each node can also be represented as a character string other than an EDU. In that case, the EDU dividing unit 32 divides the document into the character string units, and constructs an RST-DT in which the character string units are represented as nodes.

また、本実施の形態のパラメータ学習装置及び談話構造解析装置は、英語だけでなく日本語等の他の言語にも適用可能である。 Further, the parameter learning device and the discourse structure analyzing device of the present embodiment can be applied not only to English but also to other languages such as Japanese.

また、学習データベース６２及びパラメータデータベース７０は、パラメータ学習装置の外部に設けられ、パラメータ学習装置とネットワークで接続されていてもよい。また、パラメータデータベース２０は、談話構造解析装置の外部に設けられ、談話構造解析装置とネットワークで接続されていてもよい。 The learning database 62 and the parameter database 70 may be provided outside the parameter learning device and connected to the parameter learning device via a network. The parameter database 20 may be provided outside the discourse structure analyzing apparatus and connected to the discourse structure analyzing apparatus via a network.

また、入力部１０に入力される文書は、既に文又はＥＤＵに分割された形態であってもよい。その場合には、ＥＤＵ分割部３２の処理については省略する。 Further, the document input to the input unit 10 may be in a form that has already been divided into sentences or EDUs. In that case, the processing of the EDU division unit 32 is omitted.

また、上記実施の形態では、パラメータ学習装置と談話構造解析装置とを別々の装置として構成する場合を例に説明したが、パラメータ学習装置と談話構造解析装置とを１つの装置として構成してもよい。 Moreover, although the case where the parameter learning device and the discourse structure analyzing device are configured as separate devices has been described as an example in the above embodiment, the parameter learning device and the discourse structure analyzing device may be configured as one device. Good.

上述のパラメータ学習装置及び談話構造解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The parameter learning device and the discourse structure analyzing device described above have a computer system inside. However, if the “computer system” uses a WWW system, a homepage providing environment (or display environment) is also available. Shall be included.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０，７０パラメータデータベース
３０演算部
３２ＥＤＵ分割部
３４特徴抽出部
３６談話構造解析部
４０出力部
５０学習用入力部
６０学習用演算部
６４学習用特徴抽出部
６２学習データベース
６６パラメータ学習部
６８反復判定部
１００談話構造解析装置
２００パラメータ学習装置 DESCRIPTION OF SYMBOLS 10 Input part 20, 70 Parameter database 30 Operation part 32 EDU division part 34 Feature extraction part 36 Discourse structure analysis part 40 Output part 50 Learning input part 60 Learning operation part 64 Learning feature extraction part 62 Learning database 66 Parameter learning part 68 Iteration Determination Unit 100 Discourse Structure Analysis Device 200 Parameter Learning Device

Claims

A parameter learning method in a parameter learning device including a learning input unit, a parameter learning unit, and an iterative determination unit,
The learning input unit, for each of a plurality of learning documents, each of the character string units in the learning document, the root node represents the whole learning document, and at least of the learning documents Each of the character string unit series of the learning document that represents a hierarchical structure with each of the character string unit series as a node, and represents a modification relationship and a relation label between the character string unit series Each of the rhetorical structure tree based on the rhetorical structure and the correct rhetorical structure tree corresponding to the learning document.
The parameter learning unit
For each of the multiple learning documents
A part of the rhetorical structure tree corresponding to the learning document selected using a subtree included in the correct rhetorical structure tree and a weight vector for a feature vector extracted from the subtree line A pair of subtrees included in the correct rhetorical structure tree, wherein the difference between the scores calculated using the feature vector and the weight vector is maximized, and the learning tree Updating the weight vector based on a feature vector extracted from each of the subtree columns included in a pair with a subtree column of the rhetorical structure tree corresponding to a document;
The iteration determination unit repeating the update by the parameter learning unit a predetermined number of times;
A parameter learning method including:

The step of the parameter learning unit updating the weight vector includes:
A part included in the correct rhetorical structure tree generated by combining a pair of adjacent subtrees in the previously selected subtree column with respect to the subtree string included in the correct rhetorical structure tree Selecting a column of a partial tree having a maximum score calculated using the feature vector and the weight vector extracted from the column of the partial tree from a set of trees;
For each column of subtrees of the rhetorical structure tree corresponding to the learning document, a sequence of subtrees generated by combining pairs of adjacent subtrees in each of the subtree columns selected previously. From the set, select a column of the subtree having the top k scores calculated using the feature vector and the weight vector extracted from the column of the subtree;
A column of the subtree selected for the subtree column included in the correct rhetorical structure tree and the top k selected for the subtree column of the rhetorical tree corresponding to the learning document Repeat to generate a pair with each of the subtree columns
Among the generated pairs, a subtree sequence included in the correct rhetorical structure tree having a maximum score difference calculated using the feature vector and the weight vector, and the learning document The parameter learning method according to claim 1, wherein the weight vector is updated based on a feature vector extracted from each column of the subtrees included in a pair with a subtree column of the rhetorical structure tree corresponding to.

For each of the plurality of learning documents, each of the character string units in the learning document and a root node represents the whole of the learning document, and the sequence of at least one character string unit of the learning documents A rhetorical structure based on the rhetorical structure of each sequence of character strings in the learning document, representing a hierarchical structure with each of the nodes as a node, and representing a modification relationship and a relationship label between the sequences of the character string An input unit that accepts each correct rhetorical structure tree corresponding to the learning document,
For each of the multiple learning documents
A part of the rhetorical structure tree corresponding to the learning document selected using a subtree included in the correct rhetorical structure tree and a weight vector for a feature vector extracted from the subtree line A pair of subtrees included in the correct rhetorical structure tree, wherein the difference between the scores calculated using the feature vector and the weight vector is maximized, and the learning tree A parameter learning unit that updates the weight vector based on a feature vector extracted from each column of the subtree included in a pair with a subtree column of the rhetorical structure tree corresponding to a document;
An iterative determination unit that repeats the update by the parameter learning unit a predetermined number of times;
A parameter learning device.

The program for functioning a computer as each part of the parameter learning apparatus of Claim 3.