JP6042264B2

JP6042264B2 - Grammar rule learning device, method, and program

Info

Publication number: JP6042264B2
Application number: JP2013103557A
Authority: JP
Inventors: 裕之進藤; 永田　昌明; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-05-15
Filing date: 2013-05-15
Publication date: 2016-12-14
Anticipated expiration: 2033-05-15
Also published as: JP2014225104A

Description

本発明は、文法規則学習装置、方法、及びプログラムに係り、特に、統語的な情報が付与された構文木のコーパスから、サンプリング法に基づいて文法規則を学習する文法規則学習装置、方法、及びプログラムに関する。 The present invention relates to a grammar rule learning apparatus, method, and program, and more particularly, a grammar rule learning apparatus, method, and method for learning a grammar rule based on a sampling method from a corpus of a syntax tree to which syntactic information is assigned. Regarding the program.

従来、自然言語処理分野では、統語的な情報が付与された構文木コーパスから、確率的文法モデルに基づいて、文法規則を学習することが行われている。ここで、図５に構文木の例を示す。図５の例では、木構造の末端ノードには単語が付与されており、それ以外のノードには統語情報を表すシンボルが付与されている。図５において、“ＮＰ”は名詞句、“ＤＴ”は前置詞、“Ｎ”は名詞、及び“ＪＪ”は形容詞を表すシンボルである。また、確率的文法モデルの例として、文脈自由文法、木置換文法、木接合文法などに基づくモデルがある。図６に、図５に示す構文木から木置換文法に従って得られる文法規則の例を示す。木置換文法における各文法規則は、構文木の部分構造（部分木）となっている。 Conventionally, in the natural language processing field, grammatical rules are learned from a syntactic tree corpus to which syntactic information is given based on a probabilistic grammar model. Here, FIG. 5 shows an example of a syntax tree. In the example of FIG. 5, words are assigned to the end nodes of the tree structure, and symbols representing syntactic information are assigned to the other nodes. In FIG. 5, “NP” is a noun phrase, “DT” is a preposition, “N” is a noun, and “JJ” is an adjective symbol. Examples of probabilistic grammar models include models based on context-free grammars, tree replacement grammars, and tree-joint grammars. FIG. 6 shows an example of grammar rules obtained from the syntax tree shown in FIG. 5 according to the tree replacement grammar. Each grammar rule in the tree replacement grammar has a partial structure (subtree) of the parse tree.

また、構文木コーパスから文法規則を学習する方法として、ギブスサンプリング法を用いた方法が提案されている（例えば、非特許文献１参照）。ギブスサンプリング法では、構文木を一つずつ巡回し、対象となる構文木を構成する現在の部分木の割り当てを確率的に更新していく方法である。ギブスサンプリング法では、部分木の割り当てを更新する度に確率的文法モデルの尤度を計算し、最も尤度の高かった部分木の割り当てを最終的な文法規則として出力する。 As a method for learning grammatical rules from a syntax tree corpus, a method using a Gibbs sampling method has been proposed (for example, see Non-Patent Document 1). In the Gibbs sampling method, a syntax tree is circulated one by one, and the current subtree assignment constituting the target syntax tree is updated probabilistically. In the Gibbs sampling method, the likelihood of the probabilistic grammar model is calculated every time the subtree assignment is updated, and the subtree assignment with the highest likelihood is output as the final grammar rule.

Trevor Cohn, Sharon Goldwater, and Phil Blunsom, "Inducing compact but accurate tree-substitution grammars," In Proceedings of HLT-NAACL, pages 548-556, Boulder, Colorado, June. Association for Computational Linguistics, 2009.Trevor Cohn, Sharon Goldwater, and Phil Blunsom, "Inducing compact but accurate tree-substitution grammars," In Proceedings of HLT-NAACL, pages 548-556, Boulder, Colorado, June. Association for Computational Linguistics, 2009.

しかしながら、非特許文献１に記載のギブスサンプリング法のように、部分木を一つずつ更新して文法規則を学習する手法では、構文木コーパスのデータ量が多い場合に、局所最適解へ留まってしまい、確率的文法モデルにおける尤度が高い文法規則を学習することができない、という問題がある。 However, in the method of learning grammar rules by updating subtrees one by one, such as the Gibbs sampling method described in Non-Patent Document 1, when the data amount of the syntax tree corpus is large, the local optimal solution remains. Therefore, there is a problem that it is impossible to learn a grammar rule having a high likelihood in the probabilistic grammar model.

本発明は、上記問題点を解決するために成されたものであり、確率的文法モデルにおける尤度が高い文法規則を学習することができる文法規則学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and provides a grammar rule learning device, method, and program capable of learning a grammar rule having a high likelihood in a probabilistic grammar model. Objective.

上記目的を達成するために、本発明の文法規則学習装置は、複数種類の複数の部分木と所定のパラメータとで複数の構文木を表した確率的文法モデルにおける前記複数の部分木及び前記所定のパラメータの初期値を設定する初期設定部と、構文木を構成する部分木の割り当てを確率的に更新する手法により、前記初期設定部により設定された前記複数の部分木及び前記所定のパラメータ、または前回更新された複数の部分木及び所定のパラメータを更新する更新部と、前記更新部により更新された前記複数の部分木に含まれる第１の種類の部分木を分割して構成される部分木候補の各々から、前記所定のパラメータで表された前記確率的文法モデルにおける各部分木候補と該部分木候補を含む構文木集合との同時確率に応じて選択される部分木候補を第２の種類の部分木とし、前記複数の構文木に含まれる全ての前記第１の種類の部分木を、前記第２の種類の部分木に置換して、前記所定のパラメータを更新する置換更新部と、予め定めた終了条件を満たすまで、前記更新部による前記複数の部分木及び前記所定のパラメータの更新、並びに前記置換更新部による部分木の置換及び前記所定のパラメータの更新を繰り返し、前記終了条件を満たした際の前記複数の部分木の各々を文法規則として、前記終了条件を満たした際の所定のパラメータと共に出力する終了判定部と、を含んで構成されている。 In order to achieve the above object, the grammar rule learning device of the present invention includes the plurality of subtrees and the predetermined in the probabilistic grammar model representing a plurality of syntax trees with a plurality of types of subtrees and predetermined parameters. An initial setting unit that sets initial values of the parameters, and a method of probabilistically updating assignment of subtrees constituting the syntax tree, the plurality of subtrees set by the initial setting unit and the predetermined parameters, Alternatively, a part configured by dividing a plurality of subtrees updated last time and an update unit that updates a predetermined parameter, and a first type subtree included in the plurality of subtrees updated by the update unit A part selected from each tree candidate according to the joint probability of each subtree candidate and the syntax tree set including the subtree candidate in the probabilistic grammar model represented by the predetermined parameter The candidates with the second type of subtree, all of the first type of subtree included in the plurality of syntax trees, it is replaced before Symbol second type of subtree, the predetermined parameters The replacement update unit to be updated and the update of the plurality of subtrees and the predetermined parameter by the update unit until the predetermined end condition is satisfied, and the replacement of the subtree by the replacement update unit and the update of the predetermined parameter And an end determination unit that outputs each of the plurality of subtrees when the end condition is satisfied as a grammatical rule together with a predetermined parameter when the end condition is satisfied.

本発明の文法規則学習装置によれば、初期設定部が、複数種類の複数の部分木と所定のパラメータとで複数の構文木を表した確率的文法モデルにおける複数の部分木及び所定のパラメータの初期値を設定する。そして、更新部が、構文木を構成する部分木の割り当てを確率的に更新する手法により、初期設定部により設定された複数の部分木及び所定のパラメータ、または前回更新された複数の部分木及び所定のパラメータを更新する。更新部による複数の部分木及び所定のパラメータの更新は、例えば、ギブスサンプリング法等の従来手法を用いることができる。 According to the grammatical rule learning device of the present invention, the initial setting unit includes a plurality of subtrees and predetermined parameters in a probabilistic grammar model in which a plurality of types of subtrees and predetermined parameters represent a plurality of syntax trees. Set the initial value. Then, the update unit probabilistically updates the allocation of subtrees constituting the syntax tree, and a plurality of subtrees and predetermined parameters set by the initial setting unit, or a plurality of subtrees updated last time and Update predetermined parameters. For example, a conventional method such as a Gibbs sampling method can be used to update the plurality of subtrees and the predetermined parameter by the updating unit.

そして、置換更新部が、更新部により更新された複数の部分木に含まれる第１の種類の部分木を分割して構成される部分木候補の各々から、所定のパラメータで表された確率的文法モデルにおける各部分木候補と部分木候補を含む構文木集合との同時確率に応じて選択される部分木候補を第２の種類の部分木とし、複数の構文木に含まれる全ての第１の種類の部分木を、第２の種類の部分木に置換して、所定のパラメータを更新する。そして、終了判定部が、予め定めた終了条件を満たすまで、更新部による複数の部分木及び所定のパラメータの更新、並びに置換更新部による部分木の置換及び所定のパラメータの更新を繰り返し、終了条件を満たした際の複数の部分木の各々を文法規則として、終了条件を満たした際の所定のパラメータと共に出力する。 Then, the replacement update unit generates a probabilistic representation represented by a predetermined parameter from each of the subtree candidates configured by dividing the first type subtree included in the plurality of subtrees updated by the update unit. A subtree candidate selected according to the joint probability of each subtree candidate and the syntax tree set including the subtree candidate in the grammar model is set as the second type of subtree, and all the first included in the plurality of syntax trees. The sub-tree of the type is replaced with the sub-tree of the second type, and the predetermined parameter is updated. Then, until the end determination unit satisfies a predetermined end condition, the updating unit repeatedly updates a plurality of subtrees and predetermined parameters, and the replacement updating unit replaces the subtrees and updates predetermined parameters. Are output together with predetermined parameters when the end condition is satisfied as grammar rules.

このように、複数の構文木を構成する複数の部分木に含まれる第１の種類の部分木を、一度に第２の種類の部分木に置換して、複数の構文木を表す確率的文法モデルのパラメータを更新することにより、部分木の頻度分布を大幅に変更できるため、確率的文法モデルにおける尤度が高い文法規則を学習することができる。 In this way, the probabilistic grammar representing a plurality of syntax trees by replacing the first type of subtrees included in the plurality of subtrees constituting the plurality of syntax trees with the second type of subtree at a time. By updating the parameters of the model, the frequency distribution of the subtree can be significantly changed, so that a grammar rule having a high likelihood in the probabilistic grammar model can be learned.

また、前記置換更新部は、前記第１の種類の部分木を分割して構成される部分木候補の各々から、前記所定のパラメータで表された前記確率的文法モデルにおける各部分木候補と該部分木候補を含む構文木集合との同時確率に応じて選択された部分木候補を、前記第２の種類の部分木とすることができる。これにより、部分木の頻度分布を多様に変更することができるため、より尤度の高い文法規則を学習することができる。 Further, the replacement update unit includes each subtree candidate in the probabilistic grammar model represented by the predetermined parameter from each of the subtree candidates configured by dividing the first type of subtree. The subtree candidate selected according to the joint probability with the syntax tree set including the subtree candidate can be set as the second type of subtree. Thereby, since the frequency distribution of the subtree can be changed in various ways, a grammar rule with higher likelihood can be learned.

また、本発明の文法規則学習方法は、初期設定部が、複数種類の複数の部分木と所定のパラメータとで複数の構文木を表した確率的文法モデルにおける前記複数の部分木及び前記所定のパラメータの初期値を設定するステップと、更新部が、前記初期設定部により設定された前記複数の部分木及び前記所定のパラメータ、または前回更新された複数の部分木及び所定のパラメータを更新するステップと、置換更新部が、前記更新部により更新された前記複数の部分木に含まれる第１の種類の部分木を、前記第１の種類とは異なる第２の種類の部分木に置換して、前記所定のパラメータを更新するステップと、終了判定部が、予め定めた終了条件を満たすまで、前記更新部による前記複数の部分木及び前記所定のパラメータの更新、並びに前記置換更新部による部分木の置換及び前記所定のパラメータの更新を繰り返し、前記終了条件を満たした際の前記複数の部分木の各々を文法規則として、前記終了条件を満たした際の所定のパラメータと共に出力するステップと、を含む方法である。 In the grammatical rule learning method of the present invention, the initial setting unit includes the plurality of subtrees in the probabilistic grammar model in which a plurality of types of subtrees and a predetermined parameter represent a plurality of syntax trees, and the predetermined A step of setting initial values of parameters, and a step of updating the plurality of subtrees and the predetermined parameters set by the initial setting unit or the plurality of subtrees and the predetermined parameters updated last time And the replacement updating unit replaces the first type of subtree included in the plurality of subtrees updated by the updating unit with a second type of subtree different from the first type. Updating the predetermined parameter, and updating the plurality of subtrees and the predetermined parameter by the updating unit until the end determination unit satisfies a predetermined end condition, and The replacement of the subtree by the replacement updating unit and the updating of the predetermined parameter are repeated, and each of the plurality of subtrees when the end condition is satisfied is a grammatical rule, along with the predetermined parameter when the end condition is satisfied And a step of outputting.

また、本発明の文法規則学習プログラムは、コンピュータを、上記の文法規則学習装置を構成する各部として機能させるためのプログラムである。 The grammatical rule learning program of the present invention is a program for causing a computer to function as each part constituting the grammatical rule learning device.

以上説明したように、本発明の文法規則学習装置、方法、及びプログラムによれば、複数の構文木を構成する複数の部分木に含まれる第１の種類の部分木を、一度に第２の種類の部分木に置換して、複数の構文木を表す確率的文法モデルのパラメータを更新することにより、部分木の頻度分布を大幅に変更できるため、確率的文法モデルにおける尤度が高い文法規則を学習することができる、という効果が得られる。 As described above, according to the grammatical rule learning device, method, and program of the present invention, the first type of subtrees included in the plurality of subtrees constituting the plurality of syntax trees can be converted to the second type at a time. By substituting different types of subtrees and updating the parameters of the probabilistic grammar model that represents multiple syntax trees, the frequency distribution of the subtree can be changed significantly, so the grammar rules with high likelihood in the probabilistic grammar model Can be learned.

本実施の形態に係る文法規則学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the grammar rule learning apparatus which concerns on this Embodiment. 部分木の置換を説明するための概略図である。It is the schematic for demonstrating substitution of a subtree. 本実施の形態における文法規則学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the grammar rule learning process routine in this Embodiment. 本実施の形態における手法と従来手法との確率的文法モデルの尤度を比較した結果を示す図である。It is a figure which shows the result of having compared the likelihood of the probabilistic grammar model of the method in this Embodiment, and the conventional method. 構文木の一例を示す図である。It is a figure which shows an example of a syntax tree. 図５に示す構文木から木置換文法に従って得られる文法規則の一例を示す図である。It is a figure which shows an example of the grammar rule obtained from the syntax tree shown in FIG. 5 according to a tree replacement grammar.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１に示すように、本実施の形態に係る文法規則学習装置１０は、統語的な情報が付与された構文木ｔの集合である構文木コーパス｛ｔ｝を入力とし、確率的文法モデルの尤度を最大にする最適な文法規則、すなわち、部分木の集合を出力する。本実施の形態では、確率的文法モデルとして、木置換文法による確率的文法モデル（例えば、非特許文献１参照）を用いる場合を例に説明する。 As shown in FIG. 1, the grammar rule learning apparatus 10 according to the present embodiment receives a syntax tree corpus {t}, which is a set of syntax trees t to which syntactic information is assigned, and inputs a probabilistic grammar model. The optimal grammar rule that maximizes the likelihood, that is, the set of subtrees is output. In this embodiment, a case where a probabilistic grammar model based on a tree replacement grammar (for example, see Non-Patent Document 1) is used as an example is described.

ここで、木置換文法では、具体的な確率的文法モデルの式は、例えば下記（１）式のように与えられる。 Here, in the tree replacement grammar, a specific expression of the probabilistic grammar model is given by, for example, the following expression (1).

ただし、ｅ_ｔは、一つの構文木ｔを構成する部分木ｅの集合、｛ｅ_ｔ｝は、構文木コーパス｛ｔ｝全体での部分木ｅの集合、Ｐ（｛ｅ_ｔ｝，｛ｔ｝｜θ）は、パラメータをθとする確率的文法モデルにおける部分木の集合｛ｅ_ｔ｝と構文木コーパス｛ｔ｝との同時確率である。また、ｉは、構文木コーパス中の構文木を指すインデックス、ｊは、構文木ｔを構成する部分木を指すインデックス、Ｉは、構文木コーパスに含まれる構文木の総数、Ｊは、構文木ｔを構成する部分木ｅの総数である。また、Ｘは、部分木ｅの根ノードのシンボルを表し、ｎ_ｅは、部分木ｅが構文木コーパス中に何回出現したかを表す回数である。また、θ_Ｘは、木置換文法のパラメータであり、根ノードのシンボルＸ毎に定義されている。従って、確率的文法モデルのパラメータは、θ_Ｘの集合、すなわち、θ＝｛θ_Ｘ｝となる。ｎ_．，Ｘは、根ノードがＸである部分木が、構文木コーパス中に何回出現したかを表す回数である。Ｐ_０（ｅ｜Ｘ）は、部分木ｅの基底確率と呼ばれる確率で、例えば一様分布に基づいて計算される。 Where e _t is a set of subtrees e constituting one syntax tree t, {e _t } is a set of subtrees e in the entire syntax tree corpus {t}, and P ({e _t }, {t } | Θ) is a joint probability between a subtree set {e _t } and a syntax tree corpus {t} in a probabilistic grammar model with a parameter θ. I is an index indicating a syntax tree in the syntax tree corpus, j is an index indicating a subtree constituting the syntax tree t, I is a total number of syntax trees included in the syntax tree corpus, and J is a syntax tree. This is the total number of subtrees e constituting t. Further, X represents a symbol of the root node of the subtree e, n _e is the subtree e is the number representing how appeared many times in the syntax tree corpus. Θ _X is a parameter of the tree replacement grammar and is defined for each symbol X of the root node. Therefore, the parameters of the probabilistic grammar model are a set of θ _X , that is, θ = {θ _X }. n _{. , X} is the number of times representing how many times a subtree whose root node is X appears in the syntax tree corpus. P ₀ (e | X) is a probability called a base probability of the subtree e, and is calculated based on, for example, a uniform distribution.

文法規則学習装置１０は、構文木コーパス｛ｔ｝を入力とし、（１）式に示すような確率的文法モデルの尤度Ｐ（｛ｅ_ｔ｝，｛ｔ｝｜θ）を最大にする文法規則｛＾ｅ_ｔ｝及びパラメータ＾θを出力する。 The grammar rule learning device 10 receives a syntax tree corpus {t} as an input, and a grammar that maximizes the likelihood P ({e _t }, {t} | θ) of a probabilistic grammar model as shown in equation (1). The rule {^ e _t } and the parameter ^ θ are output.

文法規則学習装置１０は、ＣＰＵと、ＲＡＭと、後述する文法規則学習処理ルーチンを実行するための文法規則学習プログラムを記憶したＲＯＭと、を備えたコンピュータで構成されている。すなわち、ＲＯＭに記憶された文法規則学習プログラムをＣＰＵが実行することにより、コンピュータが文法規則学習装置１０として機能する。また、コンピュータを、記憶手段としてのＨＤＤを含んで構成するようにしてもよい。 The grammatical rule learning device 10 is constituted by a computer including a CPU, a RAM, and a ROM that stores a grammatical rule learning program for executing a grammatical rule learning processing routine described later. That is, the computer functions as the grammar rule learning device 10 when the CPU executes the grammar rule learning program stored in the ROM. Further, the computer may be configured to include an HDD as storage means.

このコンピュータは、機能的には、図１に示すように、初期設定部１１、ギブスサンプリング部１２、置換更新部１３、及び終了判定部１４を含んだ構成で表すことができる。なお、初期設定部１１及びギブスサンプリング部１２は、本発明の設定部の一例である。 As shown in FIG. 1, this computer can be functionally represented by a configuration including an initial setting unit 11, a Gibbs sampling unit 12, a replacement update unit 13, and an end determination unit 14. The initial setting unit 11 and the Gibbs sampling unit 12 are examples of the setting unit of the present invention.

初期設定部１１は、構文木コーパス｛ｔ｝を入力として受け付け、初期部分木｛ｅ_ｔ｝^（０）及び初期パラメータθ^（０）を設定する。初期部分木｛ｅ_ｔ｝^（０）は、例えば、末端ノード及び根ノードを除く構文木の各ノードに対して、ランダムに０または１の変数を割り当て、１の変数が割り当てられたノードを部分木の根ノードに設定し、０の変数が割り当てられたノードを部分木の内部ノード（部分木の根ノード以外のノード）に設定することができる。初期パラメータθ^（０）についても、ランダムに設定することができる。初期設定部１１は、設定した初期部分木｛ｅ_ｔ｝^（０）及び初期パラメータθ^（０）を、ギブスサンプリング部１２へ出力する。 The initial setting unit 11 receives the syntax tree corpus {t} as an input, and sets the initial subtree {e _t } ⁽⁰⁾ and the initial parameter θ ⁽⁰⁾ . The initial subtree {e _t } ⁽⁰⁾ is, for example, a variable in which 0 or 1 is randomly assigned to each node of the syntax tree excluding the terminal node and the root node, and the node to which the variable of 1 is assigned It can be set as a root node of a tree, and a node assigned a variable of 0 can be set as an internal node of the subtree (a node other than the root node of the subtree). The initial parameter θ ⁽⁰⁾ can also be set at random. The initial setting unit 11 outputs the set initial subtree {e _t } ⁽⁰⁾ and the initial parameter θ ⁽⁰⁾ to the Gibbs sampling unit 12.

ギブスサンプリング部１２は、初期部分木｛ｅ_ｔ｝^（０）及び初期パラメータθ^（０）、または後述する終了判定部１４から出力された部分木｛ｅ_ｔ｝^{（ｕ＋１）}及びパラメータθ^{（ｕ＋１）}を入力として受け付ける。なお、ｕは、更新処理の繰り返し回数を表す変数である。ギブスサンプリング部１２は、例えば、非特許文献１に開示されているように、構文木の各ノードをランダムな順番で巡回しながら各ノードを処理対象の対象ノードとする。そして、ギブスサンプリング部１２は、対象ノードが部分木の根ノードか内部ノードかを確率的に決定し、対象ノードに対する現在の設定を更新する。 The Gibbs sampling unit 12 uses the initial subtree {e _t } ⁽⁰⁾ and the initial parameter θ ⁽⁰⁾ , or the partial tree {e _t } ^{(u + 1)} and the parameter θ ^{(u + 1)} output from the end determination unit 14 described later. Is accepted as input. U is a variable representing the number of times the update process is repeated. For example, as disclosed in Non-Patent Document 1, the Gibbs sampling unit 12 circulates each node of the syntax tree in a random order and sets each node as a processing target node. The Gibbs sampling unit 12 probabilistically determines whether the target node is a root node or an internal node of the subtree, and updates the current setting for the target node.

また、ギブスサンプリング部１２は、構文木の全てのノードを対象ノードとして設定を更新する処理が終了すると、（１）式に基づいてパラメータθを更新する。パラメータθの更新方法として、例えば、非特許文献１に開示されているサンプリング法に基づく方法を用いることができる。この方法では、パラメータθの確率分布Ｐ（θ）がガンマ分布Ｇａｍｍａ（１．０，１．０）に従うと仮定して、Ｐ（｛ｅ_ｔ-｝，｛ｔ｝，θ）＝Ｐ（｛ｅ_ｔ-｝，｛ｔ｝｜θ）×Ｐ（θ）を計算する。そして、｛ｅ_ｔ-｝を全て固定した上で、下記（２）式に示す更新後のθ’を、マルコフ連鎖モンテカルロ法で探索し、パラメータθをθ’に更新する。 In addition, when the processing for updating the setting with all the nodes of the syntax tree as target nodes is completed, the Gibbs sampling unit 12 updates the parameter θ based on the equation (1). As a method for updating the parameter θ, for example, a method based on the sampling method disclosed in Non-Patent Document 1 can be used. In this method, assuming that the probability distribution P (θ) of the parameter θ follows the gamma distribution Gamma (1.0, 1.0), P ({e _t −}, {t}, θ) = P ({ e _t −}, {t} | θ) × P (θ) is calculated. Then, after fixing all {e _t- }, the updated θ ′ shown in the following equation (2) is searched by the Markov chain Monte Carlo method, and the parameter θ is updated to θ ′.

ギブスサンプリング部１２は、更新した部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を出力する。 The Gibbs sampling unit 12 outputs the updated subtree {e _t } ^(u) and the parameter θ ^(u) .

置換更新部１３は、ギブスサンプリング部１２により更新された部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を入力として受け付ける。置換更新部１３は、構文木の各ノードを巡回するのではなく、部分木｛ｅ_ｔ｝^（ｕ）を一種類ずつランダムな順番で巡回しながら各部分木を処理対象の対象部分木とし、対象部分木ｅを、別の部分木ｅ’へ、Ｐ（ｅ’，｛ｔ｝｜θ）の確率に応じて置換する。ただし、｛ｔ｝は、別の部分木ｅ’を含む構文木ｔの集合である。Ｐ（ｅ’，｛ｔ｝｜θ）の確率は、下記（３）式のように計算される。

ただし、ｋは｛ｔ｝に含まれる構文木を指すインデックスであり、Ｋは｛ｔ｝に含まれる構文木の総数である。また、Ｐ（ｅ’，ｔ_ｋ｜θ）は、（１）式下段に示す、パラメータをθとする確率的文法モデルにおける部分木ｅ’と構文木ｔとの同時確率である。
このとき、部分木ｅ’の候補として、対象部分木ｅを分割して構成される全ての部分木の組み合わせを考慮する。図２に示すような部分木ｅが対象部分木の場合、対象部分木ｅを分割して構成できる全ての部分木の組み合わせ（図２中のｅ’_（１）〜ｅ’_（４））を部分木ｅ’の候補とする。ただし、部分木ｅと部分木ｅ’とが同じ部分木になる場合もある。図２の例では、部分木ｅ’_（４）が対象部分木ｅと同じ部分木である。置換更新部１３は、各部分木の候補ｅ’の確率Ｐ（ｅ’，｛ｔ｝｜θ）に応じて、１つの候補を選択し、構文木コーパス｛ｔ｝に含まれる全ての部分木ｅを、選択した部分木ｅ’へ置換する。 The replacement update unit 13 receives the subtree {e _t } ^(u) and the parameter θ ^(u) updated by the Gibbs sampling unit 12 as inputs. The replacement update unit 13 does not circulate each node of the syntax tree, but makes each subtree a target subtree to be processed while circulating the subtrees {e _t } ^(u) one by one in random order, The target subtree e is replaced with another subtree e ′ according to the probability of P (e ′, {t} | θ). Here, {t} is a set of syntax trees t including another subtree e ′. The probability of P (e ′, {t} | θ) is calculated as in the following equation (3).

Here, k is an index indicating the syntax tree included in {t}, and K is the total number of syntax trees included in {t}. P (e ′, t _k | θ) is a joint probability of the subtree e ′ and the syntax tree t in the probabilistic grammar model having the parameter θ shown in the lower part of the equation (1).
At this time, combinations of all subtrees configured by dividing the target subtree e are considered as candidates for the subtree e ′. When the subtree e as shown in FIG. 2 is the target subtree, all subtree combinations (e ′ _{(1) to} e ′ _{(4) in} FIG. 2) that can be configured by dividing the target subtree e are represented. Let it be a candidate for the subtree e ′. However, the partial tree e and the partial tree e ′ may be the same partial tree. In the example of FIG. 2, the subtree e ′ ₍₄₎ is the same subtree as the target subtree e. The replacement updating unit 13 selects one candidate according to the probability P (e ′, {t} | θ) of each subtree candidate e ′, and all subtrees included in the syntax tree corpus {t}. Replace e with the selected subtree e ′.

置換更新部１３は、部分木の置換が完了したら、ギブスサンプリング部１２と同様に、パラメータθを更新する。置換更新部１３では、構文木コーパス｛ｔ｝に含まれる複数の同じ種類の部分木を一度に更新するため、ギブスサンプリング部１２による更新よりも、部分木の頻度分布を大幅に変更できる可能性がある。従って、単にギブスサンプリングを何度も繰り返すよりも、尤度の高い部分木の割り当てを発見できる可能性が高まり、確率的文法モデルが局所最適解に留まってしまう問題を解消することができる。置換更新部１３は、更新した部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を出力する。 When the replacement of the subtree is completed, the replacement update unit 13 updates the parameter θ in the same manner as the Gibbs sampling unit 12. Since the replacement update unit 13 updates a plurality of subtrees of the same type included in the syntax tree corpus {t} at a time, there is a possibility that the frequency distribution of the subtree can be significantly changed as compared with the update by the Gibbs sampling unit 12. There is. Therefore, it is possible to find an assignment of a subtree with a high likelihood rather than simply repeating Gibbs sampling many times, and the problem that the probabilistic grammar model remains in the local optimal solution can be solved. The replacement updating unit 13 outputs the updated subtree {e _t } ^(u) and the parameter θ ^(u) .

終了判定部１４は、置換更新部１３により更新された部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を入力として受け付ける。終了判定部１４は、終了条件を満たしたか否かを判定し、終了条件を満たすまで、ギブスサンプリング部１２及び置換更新部１３の処理を繰り返す。終了判定は、例えば、現在の繰り返し回数ｕが、予め指定された回数（例えば３０００回）となったときに終了と判定し、それ未満であれば未終了と判定することができる。終了判定部１４は、終了条件を満たしたと判定した場合には、現在の部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を、それぞれ確率的文法モデルの尤度を最大にする文法規則｛＾ｅ_ｔ｝及びパラメータ＾θとして出力する。 The end determination unit 14 receives the subtree {e _t } ^(u) and the parameter θ ^(u) updated by the replacement update unit 13 as inputs. The end determination unit 14 determines whether the end condition is satisfied, and repeats the processing of the Gibbs sampling unit 12 and the replacement update unit 13 until the end condition is satisfied. In the end determination, for example, it can be determined to end when the current number of repetitions u reaches a predetermined number of times (for example, 3000 times), and it can be determined to be incomplete if it is less than that. When the termination determination unit 14 determines that the termination condition is satisfied, the grammar rule that maximizes the likelihood of the probabilistic grammar model is obtained for each of the current subtree {e _t } ^(u) and the parameter θ ^(u). Output as {^ e _t } and parameter ^ θ.

また、終了判定部１４は、終了条件を満たしていないと判定した場合には、繰り返し回数ｕを１つ増やし、現在の部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を、｛ｅ_ｔ｝^{（ｕ＋１）}及びθ^{（ｕ＋１）}としてギブスサンプリング部１２へ出力する。 Further, when the termination determination unit 14 determines that the termination condition is not satisfied, the repetition count u is increased by 1, and the current subtree {e _t } ^(u) and the parameter θ ^(u) are converted to {e _t } ^{(u + 1)} and θ ^{(u + 1)} are output to the Gibbs sampling unit 12.

なお、終了判定部１４による終了条件の判定は、繰り返し回数が指定された回数となったか否かを判定する場合に限定されない。例えば、置換更新部１３により更新された｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）に基づいて今回算出された確率的文法モデルの尤度と、前回算出された尤度との差が所定値以下となった場合に、終了条件を満たすと判定してもよい。 Note that determination of the end condition by the end determination unit 14 is not limited to determining whether or not the number of repetitions is the specified number. For example, the difference between the likelihood of the probabilistic grammar model calculated this time based on {e _t } ^(u) and the parameter θ ^(u) updated by the replacement update unit 13 and the likelihood calculated last time is predetermined. When the value is less than or equal to the value, it may be determined that the end condition is satisfied.

次に、本実施の形態に係る文法規則学習装置１０の作用について説明する。文法規則学習装置１０に、構文木コーパス｛ｔ｝が入力されると、文法規則学習装置１０において、図３に示す文法規則学習処理ルーチンが実行される。なお、文法規則学習装置１０への構文木コーパス｛ｔ｝の入力は、外部装置や外部記憶媒体等に記憶された構文木コーパス｛ｔ｝を、ネットワーク等を介して文法規則学習装置１０内の記憶装置へ読み込むことにより行われる。また、予め文法規則学習装置１０内の記憶装置に記憶された構文木コーパス｛ｔ｝を読み出して、図３に示す文法規則学習処理ルーチンを開始するようにしてもよい。 Next, the operation of the grammar rule learning apparatus 10 according to this embodiment will be described. When the syntax tree corpus {t} is input to the grammar rule learning device 10, the grammar rule learning device 10 executes the grammar rule learning processing routine shown in FIG. The syntax tree corpus {t} is input to the grammar rule learning device 10 by using the syntax tree corpus {t} stored in the external device or the external storage medium or the like in the grammar rule learning device 10 via a network or the like. This is done by reading into the storage device. Alternatively, the syntax tree corpus {t} stored in advance in the storage device in the grammar rule learning device 10 may be read to start the grammar rule learning processing routine shown in FIG.

図３に示す文法規則学習処理ルーチンのステップ１００で、初期設定部１１が、構文木コーパス｛ｔ｝を入力として受け付け、初期部分木｛ｅ_ｔ｝^（０）及び初期パラメータθ^（０）を設定する。次に、ステップ１０２で、終了判定部１４が、繰り返し回数を示す変数ｕを０に設定する。 In step 100 of the grammar rule learning processing routine shown in FIG. 3, the initial setting unit 11 accepts the syntax tree corpus {t} as input and sets the initial subtree {e _t } ⁽⁰⁾ and the initial parameter θ ⁽⁰⁾ . To do. Next, in step 102, the end determination unit 14 sets a variable u indicating the number of repetitions to 0.

次に、ステップ１０４で、ギブスサンプリング部１２が、例えば非特許文献１に開示されているように、構文木の各ノードをランダムな順番で巡回しながら各ノードを対象ノードとし、対象ノードが部分木の根ノードか内部ノードかを確率的に決定し、対象ノードに対する現在の設定を更新し、（１）式に基づいてパラメータθを更新する。 Next, in step 104, the Gibbs sampling unit 12 makes each node a target node while circulating each node of the syntax tree in a random order as disclosed in Non-Patent Document 1, for example, and the target node is a partial node. Whether the root node or internal node of the tree is determined probabilistically, the current setting for the target node is updated, and the parameter θ is updated based on equation (1).

次に、ステップ１０６で、置換更新部１３が、部分木｛ｅ_ｔ｝^（ｕ）を一種類ずつランダムな順番で巡回しながら各部分木を対象部分木とし、対象部分木ｅを分割して構成される全ての部分木の組み合わせを部分木ｅ’の候補とする。そして、置換更新部１３は、各部分木の候補ｅ’の確率Ｐ（ｅ’，｛ｔ｝｜θ）に応じて、１つの候補を選択し、構文木コーパス｛ｔ｝に含まれる全ての部分木ｅを、選択した部分木ｅ’へ置換する。さらに、置換更新部１３は、ギブスサンプリング部１２と同様に、パラメータθを更新する。 Next, in Step 106, the replacement update unit 13 divides the target subtree e by setting each subtree as a target subtree while traversing the subtree {e _t } ^(u) one by one in a random order. All the combinations of subtrees to be configured are candidates for the subtree e ′. Then, the replacement update unit 13 selects one candidate according to the probability P (e ′, {t} | θ) of each subtree candidate e ′, and sets all the candidates included in the syntax tree corpus {t}. The subtree e is replaced with the selected subtree e ′. Further, the replacement update unit 13 updates the parameter θ in the same manner as the Gibbs sampling unit 12.

次に、ステップ１０８で、終了判定部１４が、現在の繰り返し回数ｕが、予め指定された回数（例えば３０００回）となったか否かを判定することにより、終了条件を満たすか否かを判定する。終了判定部１４は、終了条件を満たしていないと判定した場合には、ステップ１１０へ移行して、繰り返し回数ｕを１つ増やし、現在の部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を、｛ｅ_ｔ｝^{（ｕ＋１）}及びθ^{（ｕ＋１）}としてギブスサンプリング部１２へ出力し、ステップ１０４へ戻る。繰り返し処理におけるステップ１０４では、ギブスサンプリング部１２が、更新された｛ｅ_ｔ｝^{（ｕ＋１）}及びθ^{（ｕ＋１）}を用いて、対象ノードの設定の更新及びパラメータθの更新を繰り返す。また、繰り返し処理におけるステップ１０６では、置換更新部１３が、ギブスサンプリング部１２により更新された｛ｅ_ｔ｝^{（ｕ＋１）}及びθ^{（ｕ＋１）}を用いて、部分木の置換及びパラメータθの更新を繰り返す。 Next, in step 108, the end determination unit 14 determines whether or not the end condition is satisfied by determining whether or not the current number of repetitions u has reached a predetermined number of times (for example, 3000 times). To do. If the end determination unit 14 determines that the end condition is not satisfied, the end determination unit 14 proceeds to step 110 to increase the number of repetitions u by one, and the current subtree {e _t } ^(u) and the parameter θ ^(u ⁾ ⁾ Are output to the Gibbs sampling unit 12 as {e _t } ^{(u + 1)} and θ ^{(u + 1)} , and the process returns to step 104. In step 104 in the iterative process, the Gibbs sampling unit 12 repeats the update of the setting of the target node and the update of the parameter θ using the updated {e _t } ^{(u + 1)} and θ ^{(u + 1)} . Also, in step 106 in the iterative process, the replacement update unit 13 repeats subtree replacement and parameter θ update using {e _t } ^{(u + 1)} and θ ^{(u + 1)} updated by the Gibbs sampling unit 12. .

一方、終了判定部１４は、終了条件を満たしたと判定した場合には、ステップ１１２へ移行し、現在の部分木｛ｅ_ｔ｝^（ｕ）及びパラメータθ^（ｕ）を、それぞれ確率的文法モデルの尤度を最大にする文法規則｛＾ｅ_ｔ｝及びパラメータ＾θとして出力し、文法規則学習処理ルーチンを終了する。 On the other hand, if the end determination unit 14 determines that the end condition is satisfied, the end determination unit 14 proceeds to step 112, and converts the current subtree {e _t } ^(u) and the parameter θ ^(u) to the probabilistic grammar model. The grammar rule {^ e _t } that maximizes the likelihood and the parameter ^ θ are output, and the grammar rule learning processing routine is terminated.

以上説明したように、本実施の形態に係る文法規則学習装置によれば、構文木コーパスに含まれる複数の同じ種類の部分木を一度に別の部分木に置換して、確率的文法モデルのパラメータを更新する。これにより、従来手法のように、部分木を局所的に更新する場合に比べ、部分木の頻度分布を大幅に変更できるため、尤度の高い部分木の割り当て、すなわち尤度の高い文法規則を発見できる可能性が高まり、確率的文法モデルが局所最適解に留まってしまう問題を解消することができる。また、部分木を別の部分木に置換する際に、元の部分木を分割して構成される全ての部分木の組み合わせを候補として考慮するため、部分木の頻度分布を多様に変更することができる。 As described above, according to the grammar rule learning device according to the present embodiment, a plurality of the same type of subtrees included in the syntax tree corpus are replaced with different subtrees at a time, and Update parameters. As a result, the frequency distribution of subtrees can be changed significantly compared to the case where the subtree is updated locally as in the conventional method. The possibility of discovery increases, and the problem that the probabilistic grammar model remains in the local optimal solution can be solved. In addition, when replacing a subtree with another subtree, the subtree frequency distribution can be changed in various ways in order to consider all subtree combinations configured by dividing the original subtree as candidates. Can do.

ここで、本実施の形態に係る文法規則学習装置の効果を検証するため、置換更新部がある場合（本実施の形態）と、置換更新部がない場合（単純なギブスサンプリング法による従来手法）との尤度を比較した実験結果を図４に示す。本実験では、構文木コーパスは約４万文の構文木で構成されており、繰り返し回数ｕは１０００回とした。図４に示すように、本実施の形態に係る文法規則学習装置は、置換更新部の効果により、従来手法よりも尤度の高い文法規則を発見できることを確認した。 Here, in order to verify the effect of the grammatical rule learning apparatus according to the present embodiment, when there is a replacement update unit (this embodiment) and when there is no replacement update unit (conventional technique based on a simple Gibbs sampling method) FIG. 4 shows the experimental results comparing the likelihood of. In this experiment, the syntax tree corpus is composed of a syntax tree of about 40,000 sentences, and the number of repetitions u is 1000. As shown in FIG. 4, it was confirmed that the grammar rule learning device according to the present embodiment can find a grammar rule having a higher likelihood than the conventional method due to the effect of the replacement update unit.

本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、上述の文法規則学習装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The above-described grammatical rule learning apparatus has a computer system therein, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０文法規則学習装置
１１初期設定部
１２ギブスサンプリング部
１３置換更新部
１４終了判定部 DESCRIPTION OF SYMBOLS 10 Grammar rule learning apparatus 11 Initial setting part 12 Gibbs sampling part 13 Replacement update part 14 Completion determination part

Claims

An initial setting unit for setting an initial value of the plurality of subtrees and the predetermined parameter in a probabilistic grammar model representing a plurality of syntax trees with a plurality of types of subtrees and predetermined parameters;
The plurality of subtrees and the predetermined parameters set by the initial setting unit, or the plurality of subtrees and the predetermined parameters updated last time by the method of probabilistically updating the subtrees constituting the syntax tree An update unit for updating
In each of the probabilistic grammar models represented by the predetermined parameter, from each of the subtree candidates configured by dividing the first type of subtree included in the plurality of subtrees updated by the updating unit . A subtree candidate selected according to the joint probability between each subtree candidate and the syntax tree set including the subtree candidate is set as a second type of subtree, and all the first trees included in the plurality of syntax trees are included. and the type of subtree, replaced before Symbol second type of subtree, substituted updating unit that updates the predetermined parameter,
Until the predetermined end condition is satisfied, the updating unit repeatedly updates the plurality of subtrees and the predetermined parameter, and the replacement updating unit replaces the subtree and the predetermined parameter. An end determination unit that outputs each of the plurality of subtrees when satisfied as a grammar rule together with a predetermined parameter when the end condition is satisfied,
Grammar rule learning device including

An initial setting unit setting initial values of the plurality of subtrees and the predetermined parameter in a probabilistic grammar model representing a plurality of syntax trees with a plurality of types of subtrees and predetermined parameters;
The plurality of subtrees set by the initial setting unit and the predetermined parameter, or the plurality of subtrees updated last time by a method in which the update unit probabilistically updates the allocation of subtrees constituting the syntax tree. And updating the predetermined parameters;
The replacement update unit is represented by the predetermined parameter from each of the subtree candidates configured by dividing the first type of subtree included in the plurality of subtrees updated by the update unit. A subtree candidate selected according to the joint probability of each subtree candidate and the syntax tree set including the subtree candidate in the probabilistic grammar model is set as a second type of subtree, and is included in the plurality of syntax trees. all the first type of subtree, replaced before Symbol second type of subtree, and updating the predetermined parameter,
Until the end determination unit satisfies a predetermined end condition, the updating unit repeatedly updates the plurality of subtrees and the predetermined parameter, and the replacement updating unit replaces the subtree and updates the predetermined parameter. Outputting each of the plurality of subtrees when the end condition is satisfied as a grammar rule together with a predetermined parameter when the end condition is satisfied;
Grammar rule learning method including

A grammatical rule learning program for causing a computer to function as each part constituting the grammatical rule learning device of claim 1 .