JP2020135467A

JP2020135467A - Discourse structure analysis apparatus, method, and program

Info

Publication number: JP2020135467A
Application number: JP2019028629A
Authority: JP
Inventors: 平尾　努; Tsutomu Hirao; 努平尾; 永田　昌明; Masaaki Nagata; 昌明永田; 尚輝小林; Naoki Kobayashi; 学奥村; Manabu Okumura
Original assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2020-08-31
Anticipated expiration: 2039-02-20
Also published as: JP7054145B2

Abstract

To highly accurately construct a discourse structure tree regardless of the number of EDUs.SOLUTION: An in-sentence analysis unit 32, on the basis of parameters of a learned model for estimating EDU vectors each representing each elementary unit of an elementary unit sequence, positions at which the elementary unit sequence is divided into two spans, and combinations of nonterminal symbols given to each of the two spans, divides the elementary unit sequence into two spans, recursively repeats the estimation of the combinations of nonterminal symbols given to each span and the relation labels of the spans, and outputs an in-sentence discourse tree, which is a discourse structure tree using elementary units as units. An in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a discourse structure using sentences as units. An in-document analysis unit 36 outputs an in-document discourse tree, which is a discourse structure tree using paragraphs as units. A tree coupling unit 38 outputs a discourse structure tree in which are coupled the structures of the elementary units, sentences, and paragraphs of the document on the basis of the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree.SELECTED DRAWING: Figure 3

Description

本発明は、談話構造解析装置、方法、及びプログラムに係り、特に、文書の談話構造を解析するための談話構造解析装置、方法、及びプログラムに関する。 The present invention relates to a discourse structure analysis device, method, and program, and more particularly to a discourse structure analysis device, method, and program for analyzing the discourse structure of a document.

従来の談話構造解析技術として、文書を、基本単位であるＥｌｅｍｅｎｔａｒｙＤｉｓｃｏｕｒｓｅＵｎｉｔ（ＥＤＵ）と呼ばれる文よりも小さい、節に相当するテキストユニットの系列データとみなし、ＥＤＵをボトムアップに組み上げていくことで文書全体の談話構造木（図１）を構築する手法が提案されている。図１は一般的な談話構造木の一例を示す図である。なお、図１に示すように以下の実施の形態において用いる談話構造木は２分木として表現される（たとえば、非特許文献１など）。図１において、終端記号はＥＤＵ（ｅ）であり、非終端記号はそれが支配するスパン（連続したＥＤＵの系列）が核（Ｎ）であるか衛星（Ｓ）であるかを表す。ＳからＮ、ＮからＮをつなぐエッジにはＥｌａｂｏｒａｔｉｏｎ、Ｓａｍｅ−Ｕｎｉｔなどの関係ラベルが与えられる。 As a conventional discourse structure analysis technique, a document is regarded as a series of text unit data corresponding to a clause, which is smaller than a sentence called the Elementary Discourse Unit (EDU), which is a basic unit, and the EDU is assembled from the bottom up. A method for constructing a discourse structure tree (Fig. 1) for the entire document has been proposed. FIG. 1 is a diagram showing an example of a general discourse structure tree. As shown in FIG. 1, the discourse structure tree used in the following embodiments is expressed as a binary tree (for example, Non-Patent Document 1). In FIG. 1, the terminal symbol is EDU (e), and the non-terminal symbol represents whether the span (sequential series of EDUs) governed by it is a nucleus (N) or a satellite (S). Relationship labels such as Elabation and Same-Unit are given to the edges connecting S to N and N to N.

duVerle, David and Prendinger, Helmut, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp 665{673, 2009duVerle, David and Prendinger, Helmut, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp 665 {673, 2009

従来の方法は、文書中の文、段落といった明示的に利用できる構造を利用せずに単にＥＤＵの系列としてとらえている。一般的には文書中のＥＤＵの数が数十におよぶことは珍しくないため、多くのＥＤＵを考慮しつつ木を構築していかなければならず解析性能が劣化する。また、文書中の文、段落といった構造を無視して、ボトムアップに木を構築していくとエラーが累積し、解析性能が劣化するという問題があった。 The conventional method does not use explicitly available structures such as sentences and paragraphs in a document, but simply regards them as a series of EDUs. In general, it is not uncommon for the number of EDUs in a document to reach several tens, so it is necessary to construct a tree while considering many EDUs, and the analysis performance deteriorates. In addition, if the structure such as sentences and paragraphs in the document is ignored and the tree is constructed from the bottom up, errors are accumulated and the analysis performance is deteriorated.

本発明は、上記事情を鑑みて成されたものであり、ＥＤＵの数に関わらず、精度よく、談話構造木を構築できる談話構造解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a discourse structure analysis device, a method, and a program capable of constructing a discourse structure tree with high accuracy regardless of the number of EDUs.

上記目的を達成するために、第１の発明に係る談話構造解析装置は、文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行う部分構造解析部と、各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すＥＤＵ（ＥｌｅｍｅｎｔａｒｙＤｉｓｃｏｕｒｓｅＵｎｉｔ）ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記基本単位の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力する文内解析部と、各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記文の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する段落内解析部と、前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記段落の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する文書内解析部と、前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力する木結合部と、を含んで構成されている。 In order to achieve the above object, the discourse structure analysis apparatus according to the first invention divides a document into a series of paragraphs of the document, divides the document into a series of sentences included in each paragraph, and divides the document into each sentence. A partial structure analysis unit that divides the included basic units into a series, an EDU (Elementary Discourse Unit) vector representing each basic unit of the basic unit series included in the sentence, and the basic unit. The basic unit contained in the sentence, based on the position that divides the series into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. To estimate the combination of non-terminating symbols given to each of the two spans and to estimate the relational label of the two spans with respect to the series of the basic units. Each of the paragraphs recursively repeats until it becomes the basic unit, and each of the spans is a node, and is represented by a dichotomized tree with a non-terminating symbol. Intra-sentence analysis unit that outputs, for each paragraph, a sentence vector that represents each sentence of the sentence series included in the paragraph, a position that divides the sentence series into two spans, and the two spans. Based on the parameters of the trained model for estimating the combination of non-terminated symbols given to each, the sequence of the sentences contained in the paragraph is divided into two spans, and each of the two spans. Estimating the combination of non-terminating symbols given to the sentence and estimating the relational label of the two spans with respect to the series of the sentences are recursively repeated until each of the spans becomes the sentence, and each of the spans is noded. An in-paragraph analysis unit that outputs an in-paragraph discourse tree, which is a discourse structure tree in sentence units, represented by a dichotomized tree with a non-terminating symbol, and each paragraph in the series of paragraphs included in the document. Based on the paragraph vector representing, the position that divides the paragraph sequence into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. The series of paragraphs contained in the document is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated, and the relationship label of the two spans with respect to the series of paragraphs is estimated. This is done recursively until each of the spans is the paragraph. The in-document analysis unit that outputs the in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree with a non-terminating symbol attached to each of the spans as a node, and the inside of the sentence. A tree joint that outputs a discourse structure tree that combines the basic unit of the document, the sentence, and the structure of the paragraph based on the discourse tree, the discourse tree in the paragraph, and the discourse tree in the document. Is configured to include.

また、第１の発明に係る談話構造解析装置において、前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とするようにしてもよい。 Further, in the discourse structure analysis apparatus according to the first invention, the division position is defined based on the parameters of the trained model, and is the two spans obtained when the discourse structure analysis device is divided at the division position. The position may be set to maximize the plausibility.

第２の発明に係る談話構造解析方法は、部分構造解析部が、文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行うステップと、文内解析部が、各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すＥＤＵ（ＥｌｅｍｅｎｔａｒｙＤｉｓｃｏｕｒｓｅＵｎｉｔ）ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記基本単位の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力するステップと、段落内解析部が、各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記文の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力するステップと、文書内解析部が、前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記段落の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力するステップと、木結合部が、前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力するステップと、を含んで実行することを特徴とする。 In the discourse structure analysis method according to the second invention, the partial structure analysis unit includes division of the document into a series of paragraphs of the document, division into a series of sentences included in each paragraph, and each sentence. The step of dividing the basic unit into a series, and the paragraph analysis unit, for each sentence, have an EDU (Elementary Discourse Unit) vector representing each basic unit of the basic unit series included in the sentence, and the basic. The basics contained in the sentence, based on the positions that divide the series of units into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. Dividing the series of units into two spans and estimating the combination of non-terminated symbols given to each of the two spans and estimating the relational label of the two spans with respect to the series of basic units. Intra-sentence discourse, which is a discourse structure tree in units of basic units, represented by a dichotomized tree in which each of the spans is a node and is given a non-terminating symbol by repeating recursively until each of the spans becomes the basic unit. A step to output a tree, a sentence vector representing each sentence of the sentence series included in the paragraph, a position where the paragraph analysis unit divides the sentence series into two spans, and the above. Based on the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans, the sequence of the sentences contained in the paragraph is divided into two spans and the two spans. Estimating the combination of non-terminating symbols given to each of the two spans and estimating the relational label of the two spans for the sequence of the sentences is repeated recursively until each of the spans becomes the sentence, and the span The document includes a step of outputting an in-paragraph discourse tree, which is a discourse structure tree in sentence units, and an in-document analysis unit, each of which is a node and is represented by a dichotomized tree with a non-terminating symbol. A trained model for estimating a paragraph vector representing each paragraph of a series of paragraphs, a position to divide the series of paragraphs into two spans, and a combination of non-terminating symbols given to each of the two spans. Based on the parameters of, the series of paragraphs included in the document is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated, and the above two for the series of paragraphs. Each of the spans precedes the estimation of the relational label of one span. A step to output a discourse tree in a document, which is a discourse structure tree in paragraph units, which is represented by a binary tree with a non-terminating symbol, with each of the spans as a node, repeated recursively until it becomes a paragraph. , The tree connecting part is a discourse structure in which the basic unit of the document, the sentence, and the paragraph are combined based on the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree. It is characterized by including and executing a step of outputting a tree.

また、第２の発明に係る談話構造解析方法において、前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とするようにしてもよい。 Further, in the discourse structure analysis method according to the second invention, the division position is defined based on the parameters of the trained model, and the two spans obtained when the division is performed at the division position. The position may be set to maximize the plausibility.

第３の発明に係るプログラムは、第１の発明に記載の談話構造解析装置の各部として機能させるためのプログラムである。 The program according to the third invention is a program for functioning as each part of the discourse structure analysis apparatus according to the first invention.

本発明の談話構造解析装置、方法、及びプログラムによれば、ＥＤＵの数に関わらず、精度よく、談話構造木を構築できる、という効果が得られる。 According to the discourse structure analysis device, method, and program of the present invention, it is possible to obtain the effect that the discourse structure tree can be constructed with high accuracy regardless of the number of EDUs.

一般的な談話構造木の一例を示す図である。It is a figure which shows an example of a general discourse structure tree. 文、段落、及び文書の部分構造木への分割例を示す図である。It is a figure which shows the example of the division into a partial structure tree of a sentence, a paragraph, and a document. 本発明の実施の形態に係る談話構造解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the discourse structure analysis apparatus which concerns on embodiment of this invention. ＥＤＵを葉とする文内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a sentence which has EDU as a leaf. 文を葉とする文書内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a document which has a sentence as a leaf. 段落を葉とする文書内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a document which leaves a paragraph. 文内解析部、段落内解析部、及び文書内解析部に対応する具体的な内部構成を示す図である。It is a figure which shows the concrete internal structure corresponding to the sentence analysis part, the paragraph analysis part, and the document analysis part. 分類する関係ラベルの１８種の種類の一例を示す図である。It is a figure which shows an example of 18 kinds of relation labels to be classified. 本発明の実施の形態に係る談話構造解析装置の談話構造解析処理ルーチンを示すフローチャートである。It is a flowchart which shows the discourse structure analysis processing routine of the discourse structure analysis apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明の実施の形態では、上記課題に対して、文書を文、段落、及び文書という３つの部分構造に分割し、それぞれの構造に対して、トップダウンで解析を行う。つまり、ＥＤＵ系列、文系列、段落系列を２分することを繰り返し、木を構築する。図２は文、段落、及び文書の部分構造木への分割例を示す図である。 In the embodiment of the present invention, the document is divided into three substructures of a sentence, a paragraph, and a document for the above problem, and each structure is analyzed from the top down. That is, the EDU series, the sentence series, and the paragraph series are repeatedly divided into two to construct a tree. FIG. 2 is a diagram showing an example of dividing a sentence, a paragraph, and a document into substructure trees.

＜本発明の実施の形態に係る談話構造解析装置の構成＞ <Structure of Discourse Structure Analysis Device According to the Embodiment of the Present Invention>

次に、本発明の実施の形態に係る談話構造解析装置の構成について説明する。図３に示すように、本発明の実施の形態に係る談話構造解析装置１００は、ＣＰＵと、ＲＡＭと、後述する談話構造解析処理ルーチンを実行するためのプログラム及び各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この談話構造解析装置１００は、機能的には図３に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 Next, the configuration of the discourse structure analysis device according to the embodiment of the present invention will be described. As shown in FIG. 3, the discourse structure analysis device 100 according to the embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program for executing a discourse structure analysis processing routine described later, and various data. It can be configured with a computer that includes. The discourse structure analysis device 100 functionally includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、談話構造を解析する対象となる文書を受け付ける。 The input unit 10 receives a document to be analyzed for the discourse structure.

演算部２０は、部分構造解析部３０と、文内解析部３２と、段落内解析部３４と、文書内解析部３６と、木結合部３８とを含んで構成されている。 The calculation unit 20 includes a partial structure analysis unit 30, an in-sentence analysis unit 32, an in-paragraph analysis unit 34, an in-document analysis unit 36, and a tree connection unit 38.

図４はＥＤＵを葉とする文内談話木の一例を示す図である。図５は文を葉とする文書内談話木の一例を示す図である。図６は段落を葉とする文書内談話木の一例を示す図である。 FIG. 4 is a diagram showing an example of an in-text discourse tree having EDU as a leaf. FIG. 5 is a diagram showing an example of an in-document discourse tree having a sentence as a leaf. FIG. 6 is a diagram showing an example of an in-document discourse tree having paragraphs as leaves.

談話構造解析装置１００の処理の概要を説明する。談話構造解析装置１００は、入力として文書を受け取ると、文書を、文、段落、文書という構造に分割し、それぞれをＥＤＵ系列、文系列、段落系列として扱う。文内解析部３２で、ＥＤＵを葉とする文内談話木（図４）を構築する。段落内解析部３４で、文を葉とする段落内談話木（図５）、文書内解析部３６で、段落を葉とする文書内談話木（図６）を構築する。木結合部３８は、これらの木を結合し、最終的に談話構造木を出力する。ＥＤＵが基本単位の一例である。 The outline of the processing of the discourse structure analysis device 100 will be described. When the discourse structure analysis device 100 receives a document as input, it divides the document into a structure of a sentence, a paragraph, and a document, and treats each as an EDU sequence, a sentence sequence, and a paragraph sequence. The sentence analysis unit 32 constructs a sentence discourse tree (FIG. 4) with EDU as a leaf. The in-paragraph analysis unit 34 constructs an in-paragraph discourse tree with sentences as leaves (FIG. 5), and the in-document analysis unit 36 constructs an in-document discourse tree with paragraphs as leaves (FIG. 6). The tree joining unit 38 joins these trees and finally outputs a discourse structure tree. EDU is an example of a basic unit.

図７は文内解析部３２、段落内解析部３４、及び文書内解析部３６に対応する具体的な内部構成を示す図である。文内解析部３２、段落内解析部３４、及び文書内解析部３６の具体的な内部処理は、図７に示す構成の各処理部によって実現される。内部処理を行う各処理部は、パラメタ学習部２２０と、ベクトル変換部２３０と、最適分割部２３２と、パラメタ記憶部２３４と、関係分類部２３６とを含んで構成される。内部処理については後述する。 FIG. 7 is a diagram showing a specific internal configuration corresponding to the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36. The specific internal processing of the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36 is realized by each processing unit having the configuration shown in FIG. Each processing unit that performs internal processing includes a parameter learning unit 220, a vector conversion unit 230, an optimum division unit 232, a parameter storage unit 234, and a relation classification unit 236. The internal processing will be described later.

以下、談話構造解析装置１００の各処理部について説明する。 Hereinafter, each processing unit of the discourse structure analysis device 100 will be described.

部分構造解析部３０は、入力部１０で受け付けた文書について、文書の段落の系列（段落系列）への分割と、各段落に含まれる文の系列（文系列）への分割と、各文に含まれるＥＤＵの系列（ＥＤＵ系列）への分割とを行う。 The partial structure analysis unit 30 divides the document received by the input unit 10 into a series of paragraphs (paragraph series) of the document, a series of sentences included in each paragraph (sentence series), and each sentence. The EDU series (EDU series) to be included is divided.

具体的には、部分構造解析部３０は、以下に説明するように、文書から、文、段落、文書の３つの構造に分割し、それぞれＥＤＵ系列、文系列、段落系列として出力する。ＥＤＵ系列への分割は、文をＥＤＵへ分割する既存技術が提案されているのでそれを用いればよい。文系列への分割は、句点を手がかりに文を認定すればよい。また、文系列への分割は、既存の文境界認定器を利用することも可能である。段落系列への分割は、空行、字下げなどを手がかりとして分割すればよい。手がかりの情報がない場合には既存技術を用いて段落境界を認定すればよい。 Specifically, the substructural analysis unit 30 divides the document into three structures of a sentence, a paragraph, and a document, and outputs them as an EDU series, a sentence series, and a paragraph series, respectively, as described below. For the division into the EDU series, the existing technique for dividing the sentence into the EDU has been proposed, and it may be used. To divide into sentence series, the sentence may be recognized by using the punctuation as a clue. It is also possible to use an existing sentence boundary certifier for division into sentence sequences. The division into paragraph series may be divided by using blank lines, indentation, etc. as clues. If there is no clue information, paragraph boundaries can be identified using existing technology.

文内解析部３２は、各文について、当該文に含まれるＥＤＵの系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共にＥＤＵの系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々がＥＤＵとなるまで再帰的に繰り返す。組み合わせの推定は、当該文に含まれるＥＤＵの系列の各ＥＤＵを表すＥＤＵベクトルと、ＥＤＵの系列を二つのスパンに分割する位置、及び二つのスパンの各々に付与する非終端記号の組み合わせを推定するための学習済みのモデルのパラメタ（後述するパラメタ記憶部２３４に記憶）とに基づく。文内解析部３２は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、ＥＤＵを単位とした談話構造木である文内談話木を出力する。 For each sentence, the in-sentence analysis unit 32 divides the EDU sequence included in the sentence into two spans, estimates the combination of non-terminal symbols given to each of the two spans, and sets two for the EDU sequence. Estimating the relational labels of one span is recursively repeated until each of the spans is EDU. The combination estimation estimates the combination of the EDU vector representing each EDU of the EDU series included in the sentence, the position for dividing the EDU series into two spans, and the non-terminal symbol given to each of the two spans. Based on the parameters of the trained model for the purpose (stored in the parameter storage unit 234 described later). The in-sentence analysis unit 32 outputs an in-sentence discourse tree, which is a discourse structure tree in EDU units, represented by a binary tree to which each span is a node and a non-terminal symbol is added by recursive processing. ..

段落内解析部３４は、各段落について、当該段落に含まれる文の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に文の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が文となるまで再帰的に繰り返す。組み合わせの推定は、当該段落に含まれる文の系列の各文を表す文ベクトルと、学習済みのモデルのパラメタとに基づく。段落内解析部３４は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する。 For each paragraph, the in-paragraph analysis unit 34 divides the sentence sequence included in the paragraph into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and sets two for the sentence sequence. Estimating the relational labels of one span is recursively repeated until each of the spans becomes a sentence. The estimation of the combination is based on the sentence vector representing each sentence of the sentence series contained in the paragraph and the parameters of the trained model. The in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminating symbol, by recursive processing. ..

文書内解析部３６は、文書に含まれる段落の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に段落の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が段落となるまで再帰的に繰り返す。組み合わせの推定は、文書に含まれる段落の系列の各段落を表す段落ベクトルと、学習済みのモデルのパラメタとに基づく。文書内解析部３６は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する。 The in-document analysis unit 36 divides the paragraph series included in the document into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and labels the relationship between the two spans with respect to the paragraph series. Is recursively repeated until each of the spans becomes a paragraph. The estimation of the combination is based on the paragraph vector representing each paragraph in the series of paragraphs contained in the document and the parameters of the trained model. The in-document analysis unit 36 outputs an in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminal symbol by recursive processing. ..

木結合部３８は、文内解析部３２が出力した文内談話木と、段落内解析部３４が出力した段落内談話木と、文書内解析部３６が出力した文書内談話木とに基づいて、文書のＥＤＵと文と段落との構造を結合した談話構造木を出力部５０に出力する。 The tree joining unit 38 is based on the in-sentence discourse tree output by the in-sentence analysis unit 32, the in-paragraph discourse tree output by the paragraph analysis unit 34, and the in-document discourse tree output by the in-document analysis unit 36. , The discourse structure tree that combines the structure of the EDU of the document, the sentence, and the paragraph is output to the output unit 50.

次に、図７の文内解析部３２、段落内解析部３４、及び文書内解析部３６の内部処理について、文内解析部３２の場合を例に説明する。 Next, the internal processing of the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36 in FIG. 7 will be described by taking the case of the in-sentence analysis unit 32 as an example.

ベクトル変換部２３０は、入力されたＥＤＵ系列をＥＤＵベクトル系列に変換する。ベクトル変換部２３０は、ＥＤＵベクトル系列に基づいて任意のスパン（ｉ番目のＥＤＵからｊ（ｉ＜ｊ）番目のＥＤＵまでの連続したＥＤＵ系列）のベクトルを双方向ＬＳＴＭを用いて構築する。ＥＤＵベクトルは、ＥＤＵに含まれる単語のベクトルの加重平均として表現される。単語のベクトルとしては、既存技術で得た単語ベクトルを利用すればよい。ｉ番目のＥＤＵからｊ番目のＥＤＵで構成されるスパンのベクトルは、ＥＤＵベクトル系列全体を前向きＬＳＴＭ、後ろ向きＬＳＴＭへ入力し、それぞれの内部状態ベクトルの差分を連結したものとする。つまり、Ｓ_ｉ，ｊ＝ｆ_ｉ−ｆ_ｊ；ｂ_ｊ−ｂ_ｉとなる。ｆは前向きＬＳＴＭから得た内部状態ベクトル、ｂは後ろ向きＬＳＴＭから得た内部状態ベクトルである。また、「；」はベクトルを連結することを表す。 The vector conversion unit 230 converts the input EDU sequence into an EDU vector sequence. The vector conversion unit 230 constructs a vector of an arbitrary span (a continuous EDU sequence from the i-th EDU to the j (i <j) -th EDU) based on the EDU vector sequence using bidirectional LSTM. The EDU vector is expressed as a weighted average of the vectors of the words contained in the EDU. As the word vector, the word vector obtained by the existing technology may be used. For the vector of the span composed of the i-th EDU to the j-th EDU, the entire EDU vector series is input to the forward LSTM and the backward LSTM, and the differences between the respective internal state vectors are concatenated. _{_{_{That, S i, j = f i}}} -f j; a b j -b _i. f is an internal state vector obtained from the forward LSTM, and b is an internal state vector obtained from the backward LSTM. In addition, ";" indicates that the vectors are connected.

最適分割部２３２は、ｉ番目のＥＤＵからｊ番目のＥＤＵで構成されるスパンのベクトルとパラメタを受け取り、入力されたスパンを２つのスパンに分割し、それぞれのスパンのラベルを与える。 The optimum division unit 232 receives the vector and parameters of the span composed of the i-th EDU to the j-th EDU, divides the input span into two spans, and gives a label for each span.

具体的には、最適分割部２３２は、非終端記号のラベルの組み合わせ（ｌ∈｛Ｎ−Ｓ，Ｓ−Ｎ，Ｎ−Ｎ｝）のもっともらしさを表すスコアを以下の（１）式で定義する。 Specifically, the optimum division unit 232 defines a score representing the plausibility of the combination of labels of non-terminal symbols (l ∈ {NS, SN, NN}) by the following equation (1). ..

・・・（１）
... (1)

非終端記号のラベルの組み合わせは、ｉ番目のＥＤＵからｊ番目のＥＤＵで構成されるスパンをあるＥＤＵ直後で分割した際の２つのスパンに対して与えるラベルの組み合わせである。なお、Ｓ−Ｓというラベルの組み合わせは談話構造解析の理論上ではありえない。また、Ｗ_ｌ、ｖ_ｌ、ｂ_ｌは学習済みモデルのパラメタ行列であり、パラメタ記憶部２３４に記憶されている。学習済みモデルのパラメタ行列Ｗ_ｌ、ｖ_ｌ、ｂ_ｌは、ラベル付きのＥＤＵ系列を入力として、パラメタ学習部２２０により予め学習しておけばよい。パラメタ学習部２２０については後述する。 The non-terminal symbol label combination is a label combination given to two spans when a span composed of the i-th EDU to the j-th EDU is divided immediately after a certain EDU. It should be noted that the combination of labels SS cannot be theoretically possible in discourse structure analysis. Further, W _l , v _l , and b _l are parameter matrices of the trained model and are stored in the parameter storage unit 234. The parameter matrices W _l , v _l , and b _l of the trained model may be learned in advance by the parameter learning unit 220 by inputting the labeled EDU sequence. The parameter learning unit 220 will be described later.

最適分割部２３２は、ｉ番目のＥＤＵからｊ番目のＥＤＵからなるスパンに対して、ｋ番目のＥＤＵ（ｉ≦ｋ＜ｊ）の直後でスパンを分割する際のもっともらしさを表すスコアを以下の（２）式で定義する。 The optimum division unit 232 has the following score indicating the plausibility when dividing the span immediately after the kth EDU (i ≦ k <j) with respect to the span consisting of the i-th EDU to the j-th EDU. It is defined by equation (2).

・・・（２）
... (2)

また、最適分割部２３２は、以下の（３）式にてスパンとしてのもっともらしさを最大にする位置ｋにてスパンを分割し、分割した２つのスパンに対してラベルを付与する。 Further, the optimum division unit 232 divides the span at the position k that maximizes the plausibility as a span according to the following equation (3), and assigns a label to the two divided spans.

・・・（３）
... (3)

ここで、Ｓ_ｂｅｓｔ（）は以下の（４）式で定義する。 Here, S _best () is defined by the following equation (4).

・・・（４）
... (4)

このように、スパンを分割する位置は、パラメタ記憶部２３４の学習済みのモデルのパラメタに基づいて定義される、分割する位置で分割したときに得られる二つのスパンのもっともらしさを最大にする位置となる。 In this way, the position for dividing the span is defined based on the parameters of the trained model of the parameter storage unit 234, and is the position that maximizes the plausibility of the two spans obtained when the span is divided at the division position. It becomes.

上述したように、最適分割部２３２は、ｉ番目のＥＤＵからｊ番目のＥＤＵで構成されるスパンのベクトルとパラメタを受け取り、以下の（５）式、（６）式に従って、入力されたスパンを位置＾ｋで２つのスパンに分割し、それぞれのスパンのラベルの組み合わせ＾ｌを与える。

・・・（５）

・・・（６）
最適分割部２３２は、ｉ番目のＥＤＵからｊ番目のＥＤＵとして文の先頭のＥＤＵから末尾のＥＤＵを与え、２つのスパンに分割する手続きを再帰的に繰り返し、分割されたスパンが単体のＥＤＵになるまで繰り返す。この手続が終了すると、文に対して非終端記号がＮかＳ、終端記号がＥＤＵとなる２分木が構築される。 As described above, the optimum division unit 232 receives the vector and parameters of the span composed of the i-th EDU to the j-th EDU, and sets the input span according to the following equations (5) and (6). Divide into two spans at position ^ k and give a combination of labels for each span ^ l.

... (5)

... (6)
The optimum division unit 232 recursively repeats the procedure of recursively dividing the sentence into two spans by giving the i-th EDU to the j-th EDU from the first EDU to the last EDU of the sentence, and the divided span becomes a single EDU. Repeat until. When this procedure is completed, a binary tree is constructed with the non-terminal symbol N or S and the terminal symbol EDU for the sentence.

関係分類部２３６は、ラベル付きの２つのスパンを受け取り関係ラベルを出力する。関係分類部２３６は、訓練データから正解の２つのラベル付きスパンが与えられたときに正解の関係ラベルを出力するように学習したモデル（図示省略）を用いればよい。図８は、分類する関係ラベルの１８種の種類の一例を示す図である。 The relationship classification unit 236 receives two labeled spans and outputs the relationship label. The relationship classification unit 236 may use a model (not shown) learned to output the correct relationship label when two labeled spans of the correct answer are given from the training data. FIG. 8 is a diagram showing an example of 18 types of related labels to be classified.

次に、パラメタ学習部２２０の事前処理を説明する。パラメタ学習部２２０は、ｉ番目のＥＤＵからｊ番目のＥＤＵまでのスパンを表すベクトルと正しい分割を表すｋ、ラベルの組み合わせｌが与えられるとする。パラメタ学習部２２０は、ランダムに初期化したパラメタを以下の（７）式のスコアを最大化するように逐次的に学習する。 Next, the pre-processing of the parameter learning unit 220 will be described. It is assumed that the parameter learning unit 220 is given a combination l of a vector representing the span from the i-th EDU to the j-th EDU, k representing the correct division, and a label. The parameter learning unit 220 sequentially learns the randomly initialized parameters so as to maximize the score of the following equation (7).

・・・（７）
... (7)

ここで、＾ｋ、及び＾ｌは、現在のパラメタにおける最良の分割とラベルの組み合わせであり、（５）式、及び（６）式で得る。 Here, ^ k and ^ l are the best combination of division and label in the current parameter, and are obtained by Eqs. (5) and (6).

以上が文内解析部３２を例にした内部処理の説明である。 The above is the description of the internal processing using the in-sentence analysis unit 32 as an example.

段落内解析部３４として処理する場合には、上記の内部処理において、ＥＤＵ系列を文系列に置き換え、ＥＤＵベクトルを文ベクトルに置き換えて処理すればよい。ただし、文ベクトルは、文に含まれる単語のベクトルの加重平均として表現される。また、文書内解析部３６として処理する場合には、上記の内部処理において、ＥＤＵ系列を段落系列に置き換え、ＥＤＵベクトルを段落ベクトルに置き換えて処理すればよい。ただし、段落ベクトルは、段落に含まれる単語のベクトルの加重平均として表現される。 When processing as the in-paragraph analysis unit 34, the EDU sequence may be replaced with a sentence sequence and the EDU vector may be replaced with a sentence vector in the above internal processing. However, the sentence vector is expressed as a weighted average of the vectors of the words contained in the sentence. Further, in the case of processing as the in-document analysis unit 36, in the above internal processing, the EDU series may be replaced with the paragraph series and the EDU vector may be replaced with the paragraph vector for processing. However, the paragraph vector is expressed as a weighted average of the vectors of the words contained in the paragraph.

＜本発明の実施の形態に係る談話構造解析装置の作用＞ <Operation of Discourse Structure Analyst Device According to the Embodiment of the Present Invention>

次に、本発明の実施の形態に係る談話構造解析装置１００の作用について説明する。入力部１０において文書を受け付けると、談話構造解析装置１００は、図９に示す談話構造処理ルーチンを実行する。 Next, the operation of the discourse structure analysis device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives the document, the discourse structure analysis device 100 executes the discourse structure processing routine shown in FIG.

まず、ステップＳ１００では、部分構造解析部３０は、入力部１０で受け付けた文書について、文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれるＥＤＵの系列への分割とを行う。 First, in step S100, the substructural analysis unit 30 divides the document received by the input unit 10 into a series of paragraphs of the document, divides the document into a series of sentences included in each paragraph, and includes the document in each sentence. Divide the EDU into a series.

次に、ステップＳ１０２では、文内解析部３２は、各文について、当該文に含まれるＥＤＵの系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共にＥＤＵの系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々がＥＤＵとなるまで再帰的に繰り返す。組み合わせの推定は、当該文に含まれるＥＤＵの系列の各ＥＤＵを表すＥＤＵベクトルと、ＥＤＵの系列を二つのスパンに分割する位置と、二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づく。文内解析部３２は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、ＥＤＵを単位とした談話構造木である文内談話木を出力する。 Next, in step S102, the in-sentence analysis unit 32 divides the EDU sequence included in the sentence into two spans for each sentence, and estimates the combination of non-terminal symbols given to each of the two spans. And estimating the relational labels of the two spans for the EDU sequence is recursively repeated until each of the spans is an EDU. The combination is estimated by estimating the EDU vector representing each EDU of the EDU series included in the sentence, the position where the EDU series is divided into two spans, and the combination of nonterminal symbols given to each of the two spans. Based on the parameters of the trained model to do. The in-sentence analysis unit 32 outputs an in-sentence discourse tree, which is a discourse structure tree in EDU units, represented by a binary tree to which each span is a node and a non-terminal symbol is added by recursive processing. ..

ステップＳ１０４では、段落内解析部３４は、各段落について、当該段落に含まれる文の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に文の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が文となるまで再帰的に繰り返す。組み合わせの推定は、当該段落に含まれる文の系列の各文を表す文ベクトルと、学習済みのモデルのパラメタとに基づく。段落内解析部３４は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する。 In step S104, the paragraph analysis unit 34 divides the sentence sequence included in the paragraph into two spans, estimates the combination of non-terminal symbols given to each of the two spans, and makes a sentence. Estimating the relational labels of two spans for a series of is recursively repeated until each of the spans becomes a sentence. The estimation of the combination is based on the sentence vector representing each sentence of the sentence series contained in the paragraph and the parameters of the trained model. The in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminating symbol, by recursive processing. ..

ステップＳ１０６では、文書内解析部３６は、文書に含まれる段落の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に段落の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が段落となるまで再帰的に繰り返す。組み合わせの推定は、文書に含まれる段落の系列の各段落を表す段落ベクトルと、学習済みのモデルのパラメタとに基づく。文書内解析部３６は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する。 In step S106, the in-document analysis unit 36 divides the paragraph series included in the document into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and sets two paragraph series. Estimating the span relationship labels recursively until each of the spans is a paragraph. The estimation of the combination is based on the paragraph vector representing each paragraph in the series of paragraphs contained in the document and the parameters of the trained model. The in-document analysis unit 36 outputs an in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminal symbol by recursive processing. ..

ステップＳ１０８では、木結合部３８は、文内解析部３２が出力した文内談話木と、段落内解析部３４が出力した段落内談話木と、文書内解析部３６が出力した文書内談話木とに基づいて、文書のＥＤＵと文と段落との構造を結合した談話構造木を出力部５０に出力する。 In step S108, the tree joining unit 38 includes an in-sentence discourse tree output by the in-sentence analysis unit 32, an in-paragraph discourse tree output by the paragraph analysis unit 34, and an in-document discourse tree output by the in-document analysis unit 36. Based on the above, a discourse structure tree that combines the structures of the EDU of the document, the sentence, and the paragraph is output to the output unit 50.

以上説明したように、本発明の実施の形態に係る談話構造解析装置によれば、ＥＤＵの数に関わらず、精度よく、談話構造木を構築できる。 As described above, according to the discourse structure analysis device according to the embodiment of the present invention, the discourse structure tree can be constructed with high accuracy regardless of the number of EDUs.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

１０入力部
２０演算部
３０部分構造解析部
３２文内解析部
３４段落内解析部
３６文書内解析部
３８木結合部
５０出力部
１００談話構造解析装置
２２０パラメタ学習部
２３０ベクトル変換部
２３２最適分割部
２３４パラメタ記憶部
２３６関係分類部 10 Input unit 20 Calculation unit 30 Partial structure analysis unit 32 In-sentence analysis unit 34 In-paragraph analysis unit 36 In-document analysis unit 38 Tree connection unit 50 Output unit 100 Discourse structure analysis device 220 Parameter learning unit 230 Vector conversion unit 232 Optimal division unit 234 Parameter storage unit 236 Relationship classification unit

Claims

A substructural analysis unit that divides a document into a series of paragraphs of the document, a series of sentences included in each paragraph, and a series of basic units included in each sentence.
For each sentence, an EDU (Elementary Discourse Unit) vector representing each basic unit of the series of the basic units included in the sentence, a position for dividing the series of the basic units into two spans, and each of the two spans. Based on the parameters of the trained model for estimating the combination of non-terminating symbols given to, the sequence of the basic units contained in the sentence is divided into two spans, and each of the two spans. Estimating the combination of non-terminating symbols given to the base unit and estimating the relational label of the two spans with respect to the series of the basic units is repeated recursively until each of the spans becomes the basic unit, and each of the spans Is a node, and an in-sentence analysis unit that outputs an in-sentence discourse tree, which is a discourse structure tree in units of basic units, represented by a dichotomized tree with a non-terminating symbol.
For each paragraph, a sentence vector representing each sentence of the sentence series included in the paragraph, a position for dividing the sentence series into two spans, and a combination of non-terminating symbols given to each of the two spans. The sequence of sentences contained in the paragraph is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated based on the parameters of the trained model for estimating. And estimating the relational labels of the two spans for the sequence of the sentences is repeated recursively until each of the spans becomes the sentence, and each of the spans is a node, and a non-terminating symbol is given. An in-paragraph analysis unit that outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a tree,
To estimate the paragraph vector that represents each paragraph of the paragraph series contained in the document, the position that divides the paragraph series into two spans, and the combination of non-terminating symbols given to each of the two spans. Based on the parameters of the trained model of, the series of paragraphs contained in the document is divided into two spans, and the combination of non-terminated symbols given to each of the two spans is estimated and the paragraph Estimating the relationship label of the two spans for the series of is recursively repeated until each of the spans becomes the paragraph, and each of the spans is represented by a binary tree with a non-terminating symbol. An in-document analysis unit that outputs an in-document discourse tree, which is a paragraph-based discourse structure tree,
A tree combination that outputs a discourse structure tree that combines the structure of the basic unit of the document, the sentence, and the paragraph based on the discourse tree in the sentence, the discourse tree in the paragraph, and the discourse tree in the document. Department and
Discourse structure analyzer including.

The first aspect of the present invention, wherein the division position is defined based on the parameters of the trained model and is a position that maximizes the plausibility of the two spans obtained when the division is performed at the division position. Discourse structure analysis device.

A step in which the substructural analysis unit divides a document into a series of paragraphs of the document, a series of sentences included in each paragraph, and a series of basic units included in each sentence. ,
For each sentence, the sentence analysis unit divides the EDU (Elementary Unit) vector representing each basic unit of the series of the basic units included in the sentence into two spans, and the position where the series of the basic units is divided into two spans. Based on the parameters of the trained model for estimating the combination of non-terminating symbols given to each of the two spans, the sequence of the basic units contained in the sentence is divided into two spans and Estimating the combination of non-terminating symbols given to each of the two spans is recursively repeated until each of the spans becomes the basic unit, and each of the spans is used as a node, and the non-terminating symbol is given. A step to output an in-sentence discourse tree, which is a discourse structure tree in units of basic units represented by a tree,
For each paragraph, the in-paragraph analysis unit assigns a sentence vector representing each sentence of the sentence series included in the paragraph, a position for dividing the sentence series into two spans, and each of the two spans. The sequence of sentences contained in the paragraph is divided into two spans and assigned to each of the two spans, based on the parameters of the trained model for estimating the combination of non-terminating symbols to be used. Estimating a combination of non-terminating symbols is repeated recursively until each of the spans becomes the sentence, and each of the spans is a node, and the sentence is represented by a dichotomized tree with a non-terminating symbol. Steps to output the in-paragraph discourse tree, which is the discourse structure tree
A paragraph vector representing each paragraph of the paragraph series included in the document, a position for dividing the paragraph series into two spans, and a non-terminal symbol given to each of the two spans by the in-document analysis unit. A combination of non-terminated symbols that divides the sequence of paragraphs contained in the document into two spans and assigns each of the two spans, based on the parameters of the trained model for estimating the combination. Is recursively repeated until each of the spans becomes the paragraph, and each of the spans is a node, and is represented by a binary tree with a non-terminating symbol. The steps to output the in-document discourse tree, which is
A discourse structure tree in which the tree connecting portion combines the basic unit of the document and the structure of the sentence and the paragraph based on the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree. And the steps to output
Discourse structure analysis method including.

The third aspect of the present invention, wherein the division position is defined based on the parameters of the trained model and is a position that maximizes the plausibility of the two spans obtained when the division is performed at the division position. Discourse structure analysis method.

A program for causing a computer to function as each part of the discourse structure analysis device according to claim 1 or 2.