JP2020135467A - Discourse structure analysis apparatus, method, and program - Google Patents

Discourse structure analysis apparatus, method, and program Download PDF

Info

Publication number
JP2020135467A
JP2020135467A JP2019028629A JP2019028629A JP2020135467A JP 2020135467 A JP2020135467 A JP 2020135467A JP 2019028629 A JP2019028629 A JP 2019028629A JP 2019028629 A JP2019028629 A JP 2019028629A JP 2020135467 A JP2020135467 A JP 2020135467A
Authority
JP
Japan
Prior art keywords
spans
sentence
paragraph
tree
discourse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2019028629A
Other languages
Japanese (ja)
Other versions
JP7054145B2 (en
Inventor
平尾 努
Tsutomu Hirao
努 平尾
永田 昌明
Masaaki Nagata
昌明 永田
尚輝 小林
Naoki Kobayashi
尚輝 小林
学 奥村
Manabu Okumura
学 奥村
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Tokyo Institute of Technology NUC
Original Assignee
Nippon Telegraph and Telephone Corp
Tokyo Institute of Technology NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp, Tokyo Institute of Technology NUC filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2019028629A priority Critical patent/JP7054145B2/en
Publication of JP2020135467A publication Critical patent/JP2020135467A/en
Application granted granted Critical
Publication of JP7054145B2 publication Critical patent/JP7054145B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

To highly accurately construct a discourse structure tree regardless of the number of EDUs.SOLUTION: An in-sentence analysis unit 32, on the basis of parameters of a learned model for estimating EDU vectors each representing each elementary unit of an elementary unit sequence, positions at which the elementary unit sequence is divided into two spans, and combinations of nonterminal symbols given to each of the two spans, divides the elementary unit sequence into two spans, recursively repeats the estimation of the combinations of nonterminal symbols given to each span and the relation labels of the spans, and outputs an in-sentence discourse tree, which is a discourse structure tree using elementary units as units. An in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a discourse structure using sentences as units. An in-document analysis unit 36 outputs an in-document discourse tree, which is a discourse structure tree using paragraphs as units. A tree coupling unit 38 outputs a discourse structure tree in which are coupled the structures of the elementary units, sentences, and paragraphs of the document on the basis of the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree.SELECTED DRAWING: Figure 3

Description

本発明は、談話構造解析装置、方法、及びプログラムに係り、特に、文書の談話構造を解析するための談話構造解析装置、方法、及びプログラムに関する。 The present invention relates to a discourse structure analysis device, method, and program, and more particularly to a discourse structure analysis device, method, and program for analyzing the discourse structure of a document.

従来の談話構造解析技術として、文書を、基本単位であるElementary Discourse Unit(EDU)と呼ばれる文よりも小さい、節に相当するテキストユニットの系列データとみなし、EDUをボトムアップに組み上げていくことで文書全体の談話構造木(図1)を構築する手法が提案されている。図1は一般的な談話構造木の一例を示す図である。なお、図1に示すように以下の実施の形態において用いる談話構造木は2分木として表現される(たとえば、非特許文献1など)。図1において、終端記号はEDU(e)であり、非終端記号はそれが支配するスパン(連続したEDUの系列)が核(N)であるか衛星(S)であるかを表す。SからN、NからNをつなぐエッジにはElaboration、Same−Unitなどの関係ラベルが与えられる。 As a conventional discourse structure analysis technique, a document is regarded as a series of text unit data corresponding to a clause, which is smaller than a sentence called the Elementary Discourse Unit (EDU), which is a basic unit, and the EDU is assembled from the bottom up. A method for constructing a discourse structure tree (Fig. 1) for the entire document has been proposed. FIG. 1 is a diagram showing an example of a general discourse structure tree. As shown in FIG. 1, the discourse structure tree used in the following embodiments is expressed as a binary tree (for example, Non-Patent Document 1). In FIG. 1, the terminal symbol is EDU (e), and the non-terminal symbol represents whether the span (sequential series of EDUs) governed by it is a nucleus (N) or a satellite (S). Relationship labels such as Elabation and Same-Unit are given to the edges connecting S to N and N to N.

duVerle, David and Prendinger, Helmut, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp 665{673, 2009duVerle, David and Prendinger, Helmut, "A Novel Discourse Parser Based on Support Vector Machine Classication", Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp 665 {673, 2009

従来の方法は、文書中の文、段落といった明示的に利用できる構造を利用せずに単にEDUの系列としてとらえている。一般的には文書中のEDUの数が数十におよぶことは珍しくないため、多くのEDUを考慮しつつ木を構築していかなければならず解析性能が劣化する。また、文書中の文、段落といった構造を無視して、ボトムアップに木を構築していくとエラーが累積し、解析性能が劣化するという問題があった。 The conventional method does not use explicitly available structures such as sentences and paragraphs in a document, but simply regards them as a series of EDUs. In general, it is not uncommon for the number of EDUs in a document to reach several tens, so it is necessary to construct a tree while considering many EDUs, and the analysis performance deteriorates. In addition, if the structure such as sentences and paragraphs in the document is ignored and the tree is constructed from the bottom up, errors are accumulated and the analysis performance is deteriorated.

本発明は、上記事情を鑑みて成されたものであり、EDUの数に関わらず、精度よく、談話構造木を構築できる談話構造解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a discourse structure analysis device, a method, and a program capable of constructing a discourse structure tree with high accuracy regardless of the number of EDUs.

上記目的を達成するために、第1の発明に係る談話構造解析装置は、文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行う部分構造解析部と、各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すEDU(Elementary Discourse Unit)ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記基本単位の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力する文内解析部と、各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記文の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する段落内解析部と、前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記段落の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する文書内解析部と、前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力する木結合部と、を含んで構成されている。 In order to achieve the above object, the discourse structure analysis apparatus according to the first invention divides a document into a series of paragraphs of the document, divides the document into a series of sentences included in each paragraph, and divides the document into each sentence. A partial structure analysis unit that divides the included basic units into a series, an EDU (Elementary Discourse Unit) vector representing each basic unit of the basic unit series included in the sentence, and the basic unit. The basic unit contained in the sentence, based on the position that divides the series into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. To estimate the combination of non-terminating symbols given to each of the two spans and to estimate the relational label of the two spans with respect to the series of the basic units. Each of the paragraphs recursively repeats until it becomes the basic unit, and each of the spans is a node, and is represented by a dichotomized tree with a non-terminating symbol. Intra-sentence analysis unit that outputs, for each paragraph, a sentence vector that represents each sentence of the sentence series included in the paragraph, a position that divides the sentence series into two spans, and the two spans. Based on the parameters of the trained model for estimating the combination of non-terminated symbols given to each, the sequence of the sentences contained in the paragraph is divided into two spans, and each of the two spans. Estimating the combination of non-terminating symbols given to the sentence and estimating the relational label of the two spans with respect to the series of the sentences are recursively repeated until each of the spans becomes the sentence, and each of the spans is noded. An in-paragraph analysis unit that outputs an in-paragraph discourse tree, which is a discourse structure tree in sentence units, represented by a dichotomized tree with a non-terminating symbol, and each paragraph in the series of paragraphs included in the document. Based on the paragraph vector representing, the position that divides the paragraph sequence into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. The series of paragraphs contained in the document is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated, and the relationship label of the two spans with respect to the series of paragraphs is estimated. This is done recursively until each of the spans is the paragraph. The in-document analysis unit that outputs the in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree with a non-terminating symbol attached to each of the spans as a node, and the inside of the sentence. A tree joint that outputs a discourse structure tree that combines the basic unit of the document, the sentence, and the structure of the paragraph based on the discourse tree, the discourse tree in the paragraph, and the discourse tree in the document. Is configured to include.

また、第1の発明に係る談話構造解析装置において、前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とするようにしてもよい。 Further, in the discourse structure analysis apparatus according to the first invention, the division position is defined based on the parameters of the trained model, and is the two spans obtained when the discourse structure analysis device is divided at the division position. The position may be set to maximize the plausibility.

第2の発明に係る談話構造解析方法は、部分構造解析部が、文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行うステップと、文内解析部が、各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すEDU(Elementary Discourse Unit)ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記基本単位の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力するステップと、段落内解析部が、各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記文の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力するステップと、文書内解析部が、前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記段落の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力するステップと、木結合部が、前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力するステップと、を含んで実行することを特徴とする。 In the discourse structure analysis method according to the second invention, the partial structure analysis unit includes division of the document into a series of paragraphs of the document, division into a series of sentences included in each paragraph, and each sentence. The step of dividing the basic unit into a series, and the paragraph analysis unit, for each sentence, have an EDU (Elementary Discourse Unit) vector representing each basic unit of the basic unit series included in the sentence, and the basic. The basics contained in the sentence, based on the positions that divide the series of units into two spans, and the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans. Dividing the series of units into two spans and estimating the combination of non-terminated symbols given to each of the two spans and estimating the relational label of the two spans with respect to the series of basic units. Intra-sentence discourse, which is a discourse structure tree in units of basic units, represented by a dichotomized tree in which each of the spans is a node and is given a non-terminating symbol by repeating recursively until each of the spans becomes the basic unit. A step to output a tree, a sentence vector representing each sentence of the sentence series included in the paragraph, a position where the paragraph analysis unit divides the sentence series into two spans, and the above. Based on the parameters of the trained model for estimating the combination of non-terminated symbols given to each of the two spans, the sequence of the sentences contained in the paragraph is divided into two spans and the two spans. Estimating the combination of non-terminating symbols given to each of the two spans and estimating the relational label of the two spans for the sequence of the sentences is repeated recursively until each of the spans becomes the sentence, and the span The document includes a step of outputting an in-paragraph discourse tree, which is a discourse structure tree in sentence units, and an in-document analysis unit, each of which is a node and is represented by a dichotomized tree with a non-terminating symbol. A trained model for estimating a paragraph vector representing each paragraph of a series of paragraphs, a position to divide the series of paragraphs into two spans, and a combination of non-terminating symbols given to each of the two spans. Based on the parameters of, the series of paragraphs included in the document is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated, and the above two for the series of paragraphs. Each of the spans precedes the estimation of the relational label of one span. A step to output a discourse tree in a document, which is a discourse structure tree in paragraph units, which is represented by a binary tree with a non-terminating symbol, with each of the spans as a node, repeated recursively until it becomes a paragraph. , The tree connecting part is a discourse structure in which the basic unit of the document, the sentence, and the paragraph are combined based on the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree. It is characterized by including and executing a step of outputting a tree.

また、第2の発明に係る談話構造解析方法において、前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とするようにしてもよい。 Further, in the discourse structure analysis method according to the second invention, the division position is defined based on the parameters of the trained model, and the two spans obtained when the division is performed at the division position. The position may be set to maximize the plausibility.

第3の発明に係るプログラムは、第1の発明に記載の談話構造解析装置の各部として機能させるためのプログラムである。 The program according to the third invention is a program for functioning as each part of the discourse structure analysis apparatus according to the first invention.

本発明の談話構造解析装置、方法、及びプログラムによれば、EDUの数に関わらず、精度よく、談話構造木を構築できる、という効果が得られる。 According to the discourse structure analysis device, method, and program of the present invention, it is possible to obtain the effect that the discourse structure tree can be constructed with high accuracy regardless of the number of EDUs.

一般的な談話構造木の一例を示す図である。It is a figure which shows an example of a general discourse structure tree. 文、段落、及び文書の部分構造木への分割例を示す図である。It is a figure which shows the example of the division into a partial structure tree of a sentence, a paragraph, and a document. 本発明の実施の形態に係る談話構造解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the discourse structure analysis apparatus which concerns on embodiment of this invention. EDUを葉とする文内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a sentence which has EDU as a leaf. 文を葉とする文書内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a document which has a sentence as a leaf. 段落を葉とする文書内談話木の一例を示す図である。It is a figure which shows an example of the discourse tree in a document which leaves a paragraph. 文内解析部、段落内解析部、及び文書内解析部に対応する具体的な内部構成を示す図である。It is a figure which shows the concrete internal structure corresponding to the sentence analysis part, the paragraph analysis part, and the document analysis part. 分類する関係ラベルの18種の種類の一例を示す図である。It is a figure which shows an example of 18 kinds of relation labels to be classified. 本発明の実施の形態に係る談話構造解析装置の談話構造解析処理ルーチンを示すフローチャートである。It is a flowchart which shows the discourse structure analysis processing routine of the discourse structure analysis apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明の実施の形態では、上記課題に対して、文書を文、段落、及び文書という3つの部分構造に分割し、それぞれの構造に対して、トップダウンで解析を行う。つまり、EDU系列、文系列、段落系列を2分することを繰り返し、木を構築する。図2は文、段落、及び文書の部分構造木への分割例を示す図である。 In the embodiment of the present invention, the document is divided into three substructures of a sentence, a paragraph, and a document for the above problem, and each structure is analyzed from the top down. That is, the EDU series, the sentence series, and the paragraph series are repeatedly divided into two to construct a tree. FIG. 2 is a diagram showing an example of dividing a sentence, a paragraph, and a document into substructure trees.

<本発明の実施の形態に係る談話構造解析装置の構成> <Structure of Discourse Structure Analysis Device According to the Embodiment of the Present Invention>

次に、本発明の実施の形態に係る談話構造解析装置の構成について説明する。図3に示すように、本発明の実施の形態に係る談話構造解析装置100は、CPUと、RAMと、後述する談話構造解析処理ルーチンを実行するためのプログラム及び各種データを記憶したROMと、を含むコンピュータで構成することが出来る。この談話構造解析装置100は、機能的には図3に示すように入力部10と、演算部20と、出力部50とを備えている。 Next, the configuration of the discourse structure analysis device according to the embodiment of the present invention will be described. As shown in FIG. 3, the discourse structure analysis device 100 according to the embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program for executing a discourse structure analysis processing routine described later, and various data. It can be configured with a computer that includes. The discourse structure analysis device 100 functionally includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部10は、談話構造を解析する対象となる文書を受け付ける。 The input unit 10 receives a document to be analyzed for the discourse structure.

演算部20は、部分構造解析部30と、文内解析部32と、段落内解析部34と、文書内解析部36と、木結合部38とを含んで構成されている。 The calculation unit 20 includes a partial structure analysis unit 30, an in-sentence analysis unit 32, an in-paragraph analysis unit 34, an in-document analysis unit 36, and a tree connection unit 38.

図4はEDUを葉とする文内談話木の一例を示す図である。図5は文を葉とする文書内談話木の一例を示す図である。図6は段落を葉とする文書内談話木の一例を示す図である。 FIG. 4 is a diagram showing an example of an in-text discourse tree having EDU as a leaf. FIG. 5 is a diagram showing an example of an in-document discourse tree having a sentence as a leaf. FIG. 6 is a diagram showing an example of an in-document discourse tree having paragraphs as leaves.

談話構造解析装置100の処理の概要を説明する。談話構造解析装置100は、入力として文書を受け取ると、文書を、文、段落、文書という構造に分割し、それぞれをEDU系列、文系列、段落系列として扱う。文内解析部32で、EDUを葉とする文内談話木(図4)を構築する。段落内解析部34で、文を葉とする段落内談話木(図5)、文書内解析部36で、段落を葉とする文書内談話木(図6)を構築する。木結合部38は、これらの木を結合し、最終的に談話構造木を出力する。EDUが基本単位の一例である。 The outline of the processing of the discourse structure analysis device 100 will be described. When the discourse structure analysis device 100 receives a document as input, it divides the document into a structure of a sentence, a paragraph, and a document, and treats each as an EDU sequence, a sentence sequence, and a paragraph sequence. The sentence analysis unit 32 constructs a sentence discourse tree (FIG. 4) with EDU as a leaf. The in-paragraph analysis unit 34 constructs an in-paragraph discourse tree with sentences as leaves (FIG. 5), and the in-document analysis unit 36 constructs an in-document discourse tree with paragraphs as leaves (FIG. 6). The tree joining unit 38 joins these trees and finally outputs a discourse structure tree. EDU is an example of a basic unit.

図7は文内解析部32、段落内解析部34、及び文書内解析部36に対応する具体的な内部構成を示す図である。文内解析部32、段落内解析部34、及び文書内解析部36の具体的な内部処理は、図7に示す構成の各処理部によって実現される。内部処理を行う各処理部は、パラメタ学習部220と、ベクトル変換部230と、最適分割部232と、パラメタ記憶部234と、関係分類部236とを含んで構成される。内部処理については後述する。 FIG. 7 is a diagram showing a specific internal configuration corresponding to the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36. The specific internal processing of the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36 is realized by each processing unit having the configuration shown in FIG. Each processing unit that performs internal processing includes a parameter learning unit 220, a vector conversion unit 230, an optimum division unit 232, a parameter storage unit 234, and a relation classification unit 236. The internal processing will be described later.

以下、談話構造解析装置100の各処理部について説明する。 Hereinafter, each processing unit of the discourse structure analysis device 100 will be described.

部分構造解析部30は、入力部10で受け付けた文書について、文書の段落の系列(段落系列)への分割と、各段落に含まれる文の系列(文系列)への分割と、各文に含まれるEDUの系列(EDU系列)への分割とを行う。 The partial structure analysis unit 30 divides the document received by the input unit 10 into a series of paragraphs (paragraph series) of the document, a series of sentences included in each paragraph (sentence series), and each sentence. The EDU series (EDU series) to be included is divided.

具体的には、部分構造解析部30は、以下に説明するように、文書から、文、段落、文書の3つの構造に分割し、それぞれEDU系列、文系列、段落系列として出力する。EDU系列への分割は、文をEDUへ分割する既存技術が提案されているのでそれを用いればよい。文系列への分割は、句点を手がかりに文を認定すればよい。また、文系列への分割は、既存の文境界認定器を利用することも可能である。段落系列への分割は、空行、字下げなどを手がかりとして分割すればよい。手がかりの情報がない場合には既存技術を用いて段落境界を認定すればよい。 Specifically, the substructural analysis unit 30 divides the document into three structures of a sentence, a paragraph, and a document, and outputs them as an EDU series, a sentence series, and a paragraph series, respectively, as described below. For the division into the EDU series, the existing technique for dividing the sentence into the EDU has been proposed, and it may be used. To divide into sentence series, the sentence may be recognized by using the punctuation as a clue. It is also possible to use an existing sentence boundary certifier for division into sentence sequences. The division into paragraph series may be divided by using blank lines, indentation, etc. as clues. If there is no clue information, paragraph boundaries can be identified using existing technology.

文内解析部32は、各文について、当該文に含まれるEDUの系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共にEDUの系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々がEDUとなるまで再帰的に繰り返す。組み合わせの推定は、当該文に含まれるEDUの系列の各EDUを表すEDUベクトルと、EDUの系列を二つのスパンに分割する位置、及び二つのスパンの各々に付与する非終端記号の組み合わせを推定するための学習済みのモデルのパラメタ(後述するパラメタ記憶部234に記憶)とに基づく。文内解析部32は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、EDUを単位とした談話構造木である文内談話木を出力する。 For each sentence, the in-sentence analysis unit 32 divides the EDU sequence included in the sentence into two spans, estimates the combination of non-terminal symbols given to each of the two spans, and sets two for the EDU sequence. Estimating the relational labels of one span is recursively repeated until each of the spans is EDU. The combination estimation estimates the combination of the EDU vector representing each EDU of the EDU series included in the sentence, the position for dividing the EDU series into two spans, and the non-terminal symbol given to each of the two spans. Based on the parameters of the trained model for the purpose (stored in the parameter storage unit 234 described later). The in-sentence analysis unit 32 outputs an in-sentence discourse tree, which is a discourse structure tree in EDU units, represented by a binary tree to which each span is a node and a non-terminal symbol is added by recursive processing. ..

段落内解析部34は、各段落について、当該段落に含まれる文の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に文の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が文となるまで再帰的に繰り返す。組み合わせの推定は、当該段落に含まれる文の系列の各文を表す文ベクトルと、学習済みのモデルのパラメタとに基づく。段落内解析部34は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する。 For each paragraph, the in-paragraph analysis unit 34 divides the sentence sequence included in the paragraph into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and sets two for the sentence sequence. Estimating the relational labels of one span is recursively repeated until each of the spans becomes a sentence. The estimation of the combination is based on the sentence vector representing each sentence of the sentence series contained in the paragraph and the parameters of the trained model. The in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminating symbol, by recursive processing. ..

文書内解析部36は、文書に含まれる段落の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に段落の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が段落となるまで再帰的に繰り返す。組み合わせの推定は、文書に含まれる段落の系列の各段落を表す段落ベクトルと、学習済みのモデルのパラメタとに基づく。文書内解析部36は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する。 The in-document analysis unit 36 divides the paragraph series included in the document into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and labels the relationship between the two spans with respect to the paragraph series. Is recursively repeated until each of the spans becomes a paragraph. The estimation of the combination is based on the paragraph vector representing each paragraph in the series of paragraphs contained in the document and the parameters of the trained model. The in-document analysis unit 36 outputs an in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminal symbol by recursive processing. ..

木結合部38は、文内解析部32が出力した文内談話木と、段落内解析部34が出力した段落内談話木と、文書内解析部36が出力した文書内談話木とに基づいて、文書のEDUと文と段落との構造を結合した談話構造木を出力部50に出力する。 The tree joining unit 38 is based on the in-sentence discourse tree output by the in-sentence analysis unit 32, the in-paragraph discourse tree output by the paragraph analysis unit 34, and the in-document discourse tree output by the in-document analysis unit 36. , The discourse structure tree that combines the structure of the EDU of the document, the sentence, and the paragraph is output to the output unit 50.

次に、図7の文内解析部32、段落内解析部34、及び文書内解析部36の内部処理について、文内解析部32の場合を例に説明する。 Next, the internal processing of the in-sentence analysis unit 32, the in-paragraph analysis unit 34, and the in-document analysis unit 36 in FIG. 7 will be described by taking the case of the in-sentence analysis unit 32 as an example.

ベクトル変換部230は、入力されたEDU系列をEDUベクトル系列に変換する。ベクトル変換部230は、EDUベクトル系列に基づいて任意のスパン(i番目のEDUからj(i<j)番目のEDUまでの連続したEDU系列)のベクトルを双方向LSTMを用いて構築する。EDUベクトルは、EDUに含まれる単語のベクトルの加重平均として表現される。単語のベクトルとしては、既存技術で得た単語ベクトルを利用すればよい。i番目のEDUからj番目のEDUで構成されるスパンのベクトルは、EDUベクトル系列全体を前向きLSTM、後ろ向きLSTMへ入力し、それぞれの内部状態ベクトルの差分を連結したものとする。つまり、Si,j=f−f;b−bとなる。fは前向きLSTMから得た内部状態ベクトル、bは後ろ向きLSTMから得た内部状態ベクトルである。また、「;」はベクトルを連結することを表す。 The vector conversion unit 230 converts the input EDU sequence into an EDU vector sequence. The vector conversion unit 230 constructs a vector of an arbitrary span (a continuous EDU sequence from the i-th EDU to the j (i <j) -th EDU) based on the EDU vector sequence using bidirectional LSTM. The EDU vector is expressed as a weighted average of the vectors of the words contained in the EDU. As the word vector, the word vector obtained by the existing technology may be used. For the vector of the span composed of the i-th EDU to the j-th EDU, the entire EDU vector series is input to the forward LSTM and the backward LSTM, and the differences between the respective internal state vectors are concatenated. That, S i, j = f i -f j; a b j -b i. f is an internal state vector obtained from the forward LSTM, and b is an internal state vector obtained from the backward LSTM. In addition, ";" indicates that the vectors are connected.

最適分割部232は、i番目のEDUからj番目のEDUで構成されるスパンのベクトルとパラメタを受け取り、入力されたスパンを2つのスパンに分割し、それぞれのスパンのラベルを与える。 The optimum division unit 232 receives the vector and parameters of the span composed of the i-th EDU to the j-th EDU, divides the input span into two spans, and gives a label for each span.

具体的には、最適分割部232は、非終端記号のラベルの組み合わせ(l∈{N−S,S−N,N−N})のもっともらしさを表すスコアを以下の(1)式で定義する。 Specifically, the optimum division unit 232 defines a score representing the plausibility of the combination of labels of non-terminal symbols (l ∈ {NS, SN, NN}) by the following equation (1). ..


・・・(1)

... (1)

非終端記号のラベルの組み合わせは、i番目のEDUからj番目のEDUで構成されるスパンをあるEDU直後で分割した際の2つのスパンに対して与えるラベルの組み合わせである。なお、S−Sというラベルの組み合わせは談話構造解析の理論上ではありえない。また、W、v、bは学習済みモデルのパラメタ行列であり、パラメタ記憶部234に記憶されている。学習済みモデルのパラメタ行列W、v、bは、ラベル付きのEDU系列を入力として、パラメタ学習部220により予め学習しておけばよい。パラメタ学習部220については後述する。 The non-terminal symbol label combination is a label combination given to two spans when a span composed of the i-th EDU to the j-th EDU is divided immediately after a certain EDU. It should be noted that the combination of labels SS cannot be theoretically possible in discourse structure analysis. Further, W l , v l , and b l are parameter matrices of the trained model and are stored in the parameter storage unit 234. The parameter matrices W l , v l , and b l of the trained model may be learned in advance by the parameter learning unit 220 by inputting the labeled EDU sequence. The parameter learning unit 220 will be described later.

最適分割部232は、i番目のEDUからj番目のEDUからなるスパンに対して、k番目のEDU(i≦k<j)の直後でスパンを分割する際のもっともらしさを表すスコアを以下の(2)式で定義する。 The optimum division unit 232 has the following score indicating the plausibility when dividing the span immediately after the kth EDU (i ≦ k <j) with respect to the span consisting of the i-th EDU to the j-th EDU. It is defined by equation (2).


・・・(2)

... (2)

また、最適分割部232は、以下の(3)式にてスパンとしてのもっともらしさを最大にする位置kにてスパンを分割し、分割した2つのスパンに対してラベルを付与する。 Further, the optimum division unit 232 divides the span at the position k that maximizes the plausibility as a span according to the following equation (3), and assigns a label to the two divided spans.


・・・(3)

... (3)

ここで、Sbest()は以下の(4)式で定義する。 Here, S best () is defined by the following equation (4).


・・・(4)

... (4)

このように、スパンを分割する位置は、パラメタ記憶部234の学習済みのモデルのパラメタに基づいて定義される、分割する位置で分割したときに得られる二つのスパンのもっともらしさを最大にする位置となる。 In this way, the position for dividing the span is defined based on the parameters of the trained model of the parameter storage unit 234, and is the position that maximizes the plausibility of the two spans obtained when the span is divided at the division position. It becomes.

上述したように、最適分割部232は、i番目のEDUからj番目のEDUで構成されるスパンのベクトルとパラメタを受け取り、以下の(5)式、(6)式に従って、入力されたスパンを位置^kで2つのスパンに分割し、それぞれのスパンのラベルの組み合わせ^lを与える。

・・・(5)

・・・(6)
最適分割部232は、i番目のEDUからj番目のEDUとして文の先頭のEDUから末尾のEDUを与え、2つのスパンに分割する手続きを再帰的に繰り返し、分割されたスパンが単体のEDUになるまで繰り返す。この手続が終了すると、文に対して非終端記号がNかS、終端記号がEDUとなる2分木が構築される。
As described above, the optimum division unit 232 receives the vector and parameters of the span composed of the i-th EDU to the j-th EDU, and sets the input span according to the following equations (5) and (6). Divide into two spans at position ^ k and give a combination of labels for each span ^ l.

... (5)

... (6)
The optimum division unit 232 recursively repeats the procedure of recursively dividing the sentence into two spans by giving the i-th EDU to the j-th EDU from the first EDU to the last EDU of the sentence, and the divided span becomes a single EDU. Repeat until. When this procedure is completed, a binary tree is constructed with the non-terminal symbol N or S and the terminal symbol EDU for the sentence.

関係分類部236は、ラベル付きの2つのスパンを受け取り関係ラベルを出力する。関係分類部236は、訓練データから正解の2つのラベル付きスパンが与えられたときに正解の関係ラベルを出力するように学習したモデル(図示省略)を用いればよい。図8は、分類する関係ラベルの18種の種類の一例を示す図である。 The relationship classification unit 236 receives two labeled spans and outputs the relationship label. The relationship classification unit 236 may use a model (not shown) learned to output the correct relationship label when two labeled spans of the correct answer are given from the training data. FIG. 8 is a diagram showing an example of 18 types of related labels to be classified.

次に、パラメタ学習部220の事前処理を説明する。パラメタ学習部220は、i番目のEDUからj番目のEDUまでのスパンを表すベクトルと正しい分割を表すk、ラベルの組み合わせlが与えられるとする。パラメタ学習部220は、ランダムに初期化したパラメタを以下の(7)式のスコアを最大化するように逐次的に学習する。 Next, the pre-processing of the parameter learning unit 220 will be described. It is assumed that the parameter learning unit 220 is given a combination l of a vector representing the span from the i-th EDU to the j-th EDU, k representing the correct division, and a label. The parameter learning unit 220 sequentially learns the randomly initialized parameters so as to maximize the score of the following equation (7).


・・・(7)

... (7)

ここで、^k、及び^lは、現在のパラメタにおける最良の分割とラベルの組み合わせであり、(5)式、及び(6)式で得る。 Here, ^ k and ^ l are the best combination of division and label in the current parameter, and are obtained by Eqs. (5) and (6).

以上が文内解析部32を例にした内部処理の説明である。 The above is the description of the internal processing using the in-sentence analysis unit 32 as an example.

段落内解析部34として処理する場合には、上記の内部処理において、EDU系列を文系列に置き換え、EDUベクトルを文ベクトルに置き換えて処理すればよい。ただし、文ベクトルは、文に含まれる単語のベクトルの加重平均として表現される。また、文書内解析部36として処理する場合には、上記の内部処理において、EDU系列を段落系列に置き換え、EDUベクトルを段落ベクトルに置き換えて処理すればよい。ただし、段落ベクトルは、段落に含まれる単語のベクトルの加重平均として表現される。 When processing as the in-paragraph analysis unit 34, the EDU sequence may be replaced with a sentence sequence and the EDU vector may be replaced with a sentence vector in the above internal processing. However, the sentence vector is expressed as a weighted average of the vectors of the words contained in the sentence. Further, in the case of processing as the in-document analysis unit 36, in the above internal processing, the EDU series may be replaced with the paragraph series and the EDU vector may be replaced with the paragraph vector for processing. However, the paragraph vector is expressed as a weighted average of the vectors of the words contained in the paragraph.

<本発明の実施の形態に係る談話構造解析装置の作用> <Operation of Discourse Structure Analyst Device According to the Embodiment of the Present Invention>

次に、本発明の実施の形態に係る談話構造解析装置100の作用について説明する。入力部10において文書を受け付けると、談話構造解析装置100は、図9に示す談話構造処理ルーチンを実行する。 Next, the operation of the discourse structure analysis device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives the document, the discourse structure analysis device 100 executes the discourse structure processing routine shown in FIG.

まず、ステップS100では、部分構造解析部30は、入力部10で受け付けた文書について、文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれるEDUの系列への分割とを行う。 First, in step S100, the substructural analysis unit 30 divides the document received by the input unit 10 into a series of paragraphs of the document, divides the document into a series of sentences included in each paragraph, and includes the document in each sentence. Divide the EDU into a series.

次に、ステップS102では、文内解析部32は、各文について、当該文に含まれるEDUの系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共にEDUの系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々がEDUとなるまで再帰的に繰り返す。組み合わせの推定は、当該文に含まれるEDUの系列の各EDUを表すEDUベクトルと、EDUの系列を二つのスパンに分割する位置と、二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づく。文内解析部32は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、EDUを単位とした談話構造木である文内談話木を出力する。 Next, in step S102, the in-sentence analysis unit 32 divides the EDU sequence included in the sentence into two spans for each sentence, and estimates the combination of non-terminal symbols given to each of the two spans. And estimating the relational labels of the two spans for the EDU sequence is recursively repeated until each of the spans is an EDU. The combination is estimated by estimating the EDU vector representing each EDU of the EDU series included in the sentence, the position where the EDU series is divided into two spans, and the combination of nonterminal symbols given to each of the two spans. Based on the parameters of the trained model to do. The in-sentence analysis unit 32 outputs an in-sentence discourse tree, which is a discourse structure tree in EDU units, represented by a binary tree to which each span is a node and a non-terminal symbol is added by recursive processing. ..

ステップS104では、段落内解析部34は、各段落について、当該段落に含まれる文の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に文の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が文となるまで再帰的に繰り返す。組み合わせの推定は、当該段落に含まれる文の系列の各文を表す文ベクトルと、学習済みのモデルのパラメタとに基づく。段落内解析部34は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する。 In step S104, the paragraph analysis unit 34 divides the sentence sequence included in the paragraph into two spans, estimates the combination of non-terminal symbols given to each of the two spans, and makes a sentence. Estimating the relational labels of two spans for a series of is recursively repeated until each of the spans becomes a sentence. The estimation of the combination is based on the sentence vector representing each sentence of the sentence series contained in the paragraph and the parameters of the trained model. The in-paragraph analysis unit 34 outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminating symbol, by recursive processing. ..

ステップS106では、文書内解析部36は、文書に含まれる段落の系列を二つのスパンに分割し、かつ、二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に段落の系列に対する二つのスパンの関係ラベルを推定することを、スパンの各々が段落となるまで再帰的に繰り返す。組み合わせの推定は、文書に含まれる段落の系列の各段落を表す段落ベクトルと、学習済みのモデルのパラメタとに基づく。文書内解析部36は、再帰的な処理により、スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する。 In step S106, the in-document analysis unit 36 divides the paragraph series included in the document into two spans, estimates the combination of nonterminal symbols given to each of the two spans, and sets two paragraph series. Estimating the span relationship labels recursively until each of the spans is a paragraph. The estimation of the combination is based on the paragraph vector representing each paragraph in the series of paragraphs contained in the document and the parameters of the trained model. The in-document analysis unit 36 outputs an in-document discourse tree, which is a paragraph-based discourse structure tree represented by a binary tree to which each span is a node and is given a non-terminal symbol by recursive processing. ..

ステップS108では、木結合部38は、文内解析部32が出力した文内談話木と、段落内解析部34が出力した段落内談話木と、文書内解析部36が出力した文書内談話木とに基づいて、文書のEDUと文と段落との構造を結合した談話構造木を出力部50に出力する。 In step S108, the tree joining unit 38 includes an in-sentence discourse tree output by the in-sentence analysis unit 32, an in-paragraph discourse tree output by the paragraph analysis unit 34, and an in-document discourse tree output by the in-document analysis unit 36. Based on the above, a discourse structure tree that combines the structures of the EDU of the document, the sentence, and the paragraph is output to the output unit 50.

以上説明したように、本発明の実施の形態に係る談話構造解析装置によれば、EDUの数に関わらず、精度よく、談話構造木を構築できる。 As described above, according to the discourse structure analysis device according to the embodiment of the present invention, the discourse structure tree can be constructed with high accuracy regardless of the number of EDUs.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

10 入力部
20 演算部
30 部分構造解析部
32 文内解析部
34 段落内解析部
36 文書内解析部
38 木結合部
50 出力部
100 談話構造解析装置
220 パラメタ学習部
230 ベクトル変換部
232 最適分割部
234 パラメタ記憶部
236 関係分類部
10 Input unit 20 Calculation unit 30 Partial structure analysis unit 32 In-sentence analysis unit 34 In-paragraph analysis unit 36 In-document analysis unit 38 Tree connection unit 50 Output unit 100 Discourse structure analysis device 220 Parameter learning unit 230 Vector conversion unit 232 Optimal division unit 234 Parameter storage unit 236 Relationship classification unit

Claims (5)

文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行う部分構造解析部と、
各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すEDU(Elementary Discourse Unit)ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記基本単位の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力する文内解析部と、
各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記文の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力する段落内解析部と、
前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定すると共に前記段落の系列に対する前記二つのスパンの関係ラベルを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力する文書内解析部と、
前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力する木結合部と、
を含む談話構造解析装置。
A substructural analysis unit that divides a document into a series of paragraphs of the document, a series of sentences included in each paragraph, and a series of basic units included in each sentence.
For each sentence, an EDU (Elementary Discourse Unit) vector representing each basic unit of the series of the basic units included in the sentence, a position for dividing the series of the basic units into two spans, and each of the two spans. Based on the parameters of the trained model for estimating the combination of non-terminating symbols given to, the sequence of the basic units contained in the sentence is divided into two spans, and each of the two spans. Estimating the combination of non-terminating symbols given to the base unit and estimating the relational label of the two spans with respect to the series of the basic units is repeated recursively until each of the spans becomes the basic unit, and each of the spans Is a node, and an in-sentence analysis unit that outputs an in-sentence discourse tree, which is a discourse structure tree in units of basic units, represented by a dichotomized tree with a non-terminating symbol.
For each paragraph, a sentence vector representing each sentence of the sentence series included in the paragraph, a position for dividing the sentence series into two spans, and a combination of non-terminating symbols given to each of the two spans. The sequence of sentences contained in the paragraph is divided into two spans, and the combination of non-terminating symbols given to each of the two spans is estimated based on the parameters of the trained model for estimating. And estimating the relational labels of the two spans for the sequence of the sentences is repeated recursively until each of the spans becomes the sentence, and each of the spans is a node, and a non-terminating symbol is given. An in-paragraph analysis unit that outputs an in-paragraph discourse tree, which is a sentence-based discourse structure tree represented by a tree,
To estimate the paragraph vector that represents each paragraph of the paragraph series contained in the document, the position that divides the paragraph series into two spans, and the combination of non-terminating symbols given to each of the two spans. Based on the parameters of the trained model of, the series of paragraphs contained in the document is divided into two spans, and the combination of non-terminated symbols given to each of the two spans is estimated and the paragraph Estimating the relationship label of the two spans for the series of is recursively repeated until each of the spans becomes the paragraph, and each of the spans is represented by a binary tree with a non-terminating symbol. An in-document analysis unit that outputs an in-document discourse tree, which is a paragraph-based discourse structure tree,
A tree combination that outputs a discourse structure tree that combines the structure of the basic unit of the document, the sentence, and the paragraph based on the discourse tree in the sentence, the discourse tree in the paragraph, and the discourse tree in the document. Department and
Discourse structure analyzer including.
前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とする請求項1に記載の談話構造解析装置。 The first aspect of the present invention, wherein the division position is defined based on the parameters of the trained model and is a position that maximizes the plausibility of the two spans obtained when the division is performed at the division position. Discourse structure analysis device. 部分構造解析部が、文書について、前記文書の段落の系列への分割と、各段落に含まれる文の系列への分割と、各文に含まれる基本単位の系列への分割とを行うステップと、
文内解析部が、各文について、前記文に含まれる前記基本単位の系列の各基本単位を表すEDU(Elementary Discourse Unit)ベクトルと、前記基本単位の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文に含まれる前記基本単位の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定することを、前記スパンの各々が前記基本単位となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、基本単位を単位とした談話構造木である文内談話木を出力するステップと、
段落内解析部が、各段落について、前記段落に含まれる前記文の系列の各文を表す文ベクトルと、前記文の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記段落に含まれる前記文の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定することを、前記スパンの各々が前記文となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、文を単位とした談話構造木である段落内談話木を出力するステップと、
文書内解析部が、前記文書に含まれる前記段落の系列の各段落を表す段落ベクトルと、前記段落の系列を二つのスパンに分割する位置、及び前記二つのスパンの各々に付与する非終端記号の組み合わせとを推定するための学習済みのモデルのパラメタとに基づいて、前記文書に含まれる前記段落の系列を二つのスパンに分割し、かつ、前記二つのスパンの各々に付与する非終端記号の組み合わせを推定することを、前記スパンの各々が前記段落となるまで再帰的に繰り返し、前記スパンの各々をノードとし、非終端記号が付与された二分木で表される、段落を単位とした談話構造木である文書内談話木を出力するステップと、
木結合部が、前記文内談話木と、前記段落内談話木と、前記文書内談話木とに基づいて、前記文書の前記基本単位と前記文と前記段落との構造を結合した談話構造木を出力するステップと、
を含む談話構造解析方法。
A step in which the substructural analysis unit divides a document into a series of paragraphs of the document, a series of sentences included in each paragraph, and a series of basic units included in each sentence. ,
For each sentence, the sentence analysis unit divides the EDU (Elementary Unit) vector representing each basic unit of the series of the basic units included in the sentence into two spans, and the position where the series of the basic units is divided into two spans. Based on the parameters of the trained model for estimating the combination of non-terminating symbols given to each of the two spans, the sequence of the basic units contained in the sentence is divided into two spans and Estimating the combination of non-terminating symbols given to each of the two spans is recursively repeated until each of the spans becomes the basic unit, and each of the spans is used as a node, and the non-terminating symbol is given. A step to output an in-sentence discourse tree, which is a discourse structure tree in units of basic units represented by a tree,
For each paragraph, the in-paragraph analysis unit assigns a sentence vector representing each sentence of the sentence series included in the paragraph, a position for dividing the sentence series into two spans, and each of the two spans. The sequence of sentences contained in the paragraph is divided into two spans and assigned to each of the two spans, based on the parameters of the trained model for estimating the combination of non-terminating symbols to be used. Estimating a combination of non-terminating symbols is repeated recursively until each of the spans becomes the sentence, and each of the spans is a node, and the sentence is represented by a dichotomized tree with a non-terminating symbol. Steps to output the in-paragraph discourse tree, which is the discourse structure tree
A paragraph vector representing each paragraph of the paragraph series included in the document, a position for dividing the paragraph series into two spans, and a non-terminal symbol given to each of the two spans by the in-document analysis unit. A combination of non-terminated symbols that divides the sequence of paragraphs contained in the document into two spans and assigns each of the two spans, based on the parameters of the trained model for estimating the combination. Is recursively repeated until each of the spans becomes the paragraph, and each of the spans is a node, and is represented by a binary tree with a non-terminating symbol. The steps to output the in-document discourse tree, which is
A discourse structure tree in which the tree connecting portion combines the basic unit of the document and the structure of the sentence and the paragraph based on the in-sentence discourse tree, the in-paragraph discourse tree, and the in-document discourse tree. And the steps to output
Discourse structure analysis method including.
前記分割する位置は、前記学習済みのモデルのパラメタに基づいて定義される、前記分割する位置で分割したときに得られる前記二つのスパンのもっともらしさを最大にする位置とする請求項3に記載の談話構造解析方法。 The third aspect of the present invention, wherein the division position is defined based on the parameters of the trained model and is a position that maximizes the plausibility of the two spans obtained when the division is performed at the division position. Discourse structure analysis method. コンピュータを、請求項1又は請求項2に記載の談話構造解析装置の各部として機能させるためのプログラム。 A program for causing a computer to function as each part of the discourse structure analysis device according to claim 1 or 2.
JP2019028629A 2019-02-20 2019-02-20 Discourse structure analyzer, method, and program Active JP7054145B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019028629A JP7054145B2 (en) 2019-02-20 2019-02-20 Discourse structure analyzer, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2019028629A JP7054145B2 (en) 2019-02-20 2019-02-20 Discourse structure analyzer, method, and program

Publications (2)

Publication Number Publication Date
JP2020135467A true JP2020135467A (en) 2020-08-31
JP7054145B2 JP7054145B2 (en) 2022-04-13

Family

ID=72263252

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2019028629A Active JP7054145B2 (en) 2019-02-20 2019-02-20 Discourse structure analyzer, method, and program

Country Status (1)

Country Link
JP (1) JP7054145B2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016162198A (en) * 2015-03-02 2016-09-05 日本電信電話株式会社 Parameter learning method, device, and program
US20180365228A1 (en) * 2017-06-15 2018-12-20 Oracle International Corporation Tree kernel learning for text classification into classes of intent

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016162198A (en) * 2015-03-02 2016-09-05 日本電信電話株式会社 Parameter learning method, device, and program
US20180365228A1 (en) * 2017-06-15 2018-12-20 Oracle International Corporation Tree kernel learning for text classification into classes of intent

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徳永 健伸: "自然言語処理技術の最近の動向", 情報処理, vol. 第33巻 第7号, JPN6022007329, 15 July 1992 (1992-07-15), pages 780 - 789, ISSN: 0004715944 *

Also Published As

Publication number Publication date
JP7054145B2 (en) 2022-04-13

Similar Documents

Publication Publication Date Title
CN106847288B (en) Error correction method and device for voice recognition text
US8447589B2 (en) Text paraphrasing method and program, conversion rule computing method and program, and text paraphrasing system
CN111353306B (en) Entity relationship and dependency Tree-LSTM-based combined event extraction method
US20030046073A1 (en) Word predicting method, voice recognition method, and voice recognition apparatus and program using the same methods
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN111767731A (en) Training method and device of grammar error correction model and grammar error correction method and device
JP2001266060A (en) Analysis system questionnaire answer
Makhambetov et al. Data-driven morphological analysis and disambiguation for kazakh
CN109063772B (en) Image personalized semantic analysis method, device and equipment based on deep learning
JP6062829B2 (en) Dependency relationship analysis parameter learning device, dependency relationship analysis device, method, and program
JP7054145B2 (en) Discourse structure analyzer, method, and program
US11386272B2 (en) Learning method and generating apparatus
CN112016299A (en) Method and device for generating dependency syntax tree by using neural network executed by computer
CN114298052B (en) Entity joint annotation relation extraction method and system based on probability graph
Haghdoost et al. Building a morphological network for persian on top of a morpheme-segmented lexicon
KR102569381B1 (en) System and Method for Machine Reading Comprehension to Table-centered Web Documents
CN113641789B (en) Viewpoint retrieval method and system based on hierarchical fusion multi-head attention network and convolution network
JP6590723B2 (en) Word rearrangement learning method, word rearrangement method, apparatus, and program
US20210303802A1 (en) Program storage medium, information processing apparatus and method for encoding sentence
JP5523929B2 (en) Text summarization apparatus, text summarization method, and text summarization program
JP7148077B2 (en) Tree structure analysis device, method, and program
JP5087994B2 (en) Language analysis method and apparatus
CN110909545A (en) Black guide detection method based on gradient lifting algorithm
CN113010717B (en) Image verse description generation method, device and equipment
CN113434760B (en) Construction method recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20190221

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20210212

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20220204

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20220301

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20220323

R150 Certificate of patent or registration of utility model

Ref document number: 7054145

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150