JP5087994B2

JP5087994B2 - Language analysis method and apparatus

Info

Publication number: JP5087994B2
Application number: JP2007135691A
Authority: JP
Inventors: 哲治中川
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2007-05-22
Filing date: 2007-05-22
Publication date: 2012-12-05
Anticipated expiration: 2027-05-22
Also published as: JP2008293119A

Description

本発明は、言語を解析する言語解析方法及びその装置に関するものである。 The present invention relates to a language analysis method and apparatus for analyzing a language.

機械翻訳システム等の自然言語処理装置で機械翻訳する場合、例えば、使用者によって入力された入力文に対して文の依存構造の解析等を行い、この解析結果に基づいて翻訳処理をすることが行われる。 When machine translation is performed by a natural language processing apparatus such as a machine translation system, for example, the dependency structure of a sentence is analyzed with respect to an input sentence input by a user, and the translation processing is performed based on the analysis result. Done.

前記自然言語処理装置では、入力された入力文の依存構造を解析する必要が度々発生する。依存構造とは、例えば、単語間又は文節間の修飾／被修飾の関係（係り受け関係）を表す構造のことをいう。 In the natural language processing apparatus, it is often necessary to analyze the dependency structure of the input sentence. The dependency structure refers to, for example, a structure representing a modification / modification relationship (dependency relationship) between words or phrases.

情報処理学会論文誌、Ｖｏｌ．４０、Ｎｏ．９（１９９９）内元清貴、外２名“最大エントロピー法に基づくモデルを用いた日本語係り受け解析”Ｐ．３３９７−３４０７IPSJ Journal, Vol. 40, no. 9 (1999) Kiyotaka Uchimoto and two others, “Japanese dependency analysis using a model based on the maximum entropy method”, p. 3397-3407 情報処理学会論文誌、Ｖｏｌ．４３、Ｎｏ．６（２００２）工藤拓、松本裕治、“チャンキングの段階適用による日本語係り受け解析”Ｐ．１８３４−１８４１IPSJ Journal, Vol. 43, no. 6 (2002) Taku Kudo, Yuji Matsumoto, “Japanese dependency analysis by applying the chunking stage”, p. 1834-1841 ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ、Ｖｏｌ．３１、Ｎｏ．１（２００４）ＣｏｌｌｉｎｓａｎｄＫｏｏ“ＤｉｓｃｒｉｍｉｎａｔｉｖｅＲｅｒａｎｋｉｎｇｆｏｒＮａｔｕｒａｌＬａｎｇｕａｇｅＰａｒｓｉｎｇ”Computational Linguistics, Vol. 31, no. 1 (2004) Collins and Koo “Discriminative Learning for Natural Language Parsing”

前記非特許文献１に記載されているように、依存構造を解析する方法としては、規則に基づく方法や統計的手法が考えられ、当該統計的手法は依存構造解析に必要なパラメータ等を学習用データから自動的に推定できるという利点がある。統計的手法に基づく依存構造解析では、データ中に含まれる手がかりとなる情報（素性）を基にして、未知の文に対する依存構造を決定する。そのため、どのような素性を利用するかによって依存構造解析の精度が左右される。前記非特許文献１に記載された技術では、最大エントロピーモデルを用いて２つの文節が係り受け関係を持つ確率を計算することにより、統計的に文の依存構造を求めている。 As described in Non-Patent Document 1, as a method of analyzing the dependency structure, a rule-based method or a statistical method can be considered, and the statistical method is used for learning parameters and the like necessary for the dependency structure analysis. There is an advantage that it can be automatically estimated from data. In a dependency structure analysis based on a statistical method, a dependency structure for an unknown sentence is determined based on information (features) that is a clue contained in data. Therefore, the accuracy of the dependency structure analysis depends on what features are used. In the technique described in Non-Patent Document 1, a sentence dependency structure is statistically obtained by calculating a probability that two phrases have a dependency relationship using a maximum entropy model.

前記非特許文献２に記載された技術では、サポートベクターマシンを用いて、隣り合う２つの文節が係るか、若しくは係らないかの判定を段階的に適用していくことにより、文全体の依存構造を決定している。この非特許文献２に示す依存構造の解析方法では、文中の各係り受け関係が独立であるとは仮定せず、ある２つの文節が係り受け関係にあるかどうかを判定する際に、着目している係り元文節に係る文節、着目している係り先文節に係る文節、着目している係り先文節が係る文節から得られる素性も考慮している。このような素性を利用することで、より高い精度で文の依存構造を決定することができる。 In the technique described in Non-Patent Document 2, by using a support vector machine, it is applied step by step to determine whether two adjacent clauses are related or not related, whereby the dependency structure of the whole sentence is applied. Is determined. The dependency structure analysis method shown in Non-Patent Document 2 does not assume that each dependency relationship in a sentence is independent, but pays attention when determining whether or not two clauses are in a dependency relationship. Consideration is also given to the clauses related to the current source clause, the clauses related to the current target clause, and the features obtained from the clauses related to the current target clause. By using such a feature, it is possible to determine the sentence dependency structure with higher accuracy.

又、前記非特許文献３では、リランキングに基づく方法を利用して文の句構造を決定している。 In Non-Patent Document 3, the phrase structure of a sentence is determined using a method based on reranking.

しかしながら、従来の言語解析技術では、次の（Ａ）〜（Ｄ）のような課題があった。 However, the conventional language analysis techniques have the following problems (A) to (D).

（Ａ）前記非特許文献１に記載された技術の課題
前記非特許文献１に示す依存構造の解析方法では、文中の各係り受け関係は独立であると仮定し、利用できる素性が限定されているため、この仮定が成り立たない場合は解析に失敗する可能性があるという課題がある。 (A) Problems of the technology described in Non-Patent Document 1 In the dependency structure analysis method shown in Non-Patent Document 1, it is assumed that each dependency relationship in the sentence is independent, and the available features are limited. Therefore, there is a problem that the analysis may fail if this assumption is not satisfied.

（Ｂ）前記非特許文献２に記載された技術の課題
前記非特許文献２に示す依存構造の解析方法では、文節単位の素性だけではなく、文中に含まれる複数の文節の各係り先のような文単位の素性を一部利用することができる。つまり、文中の各係り受け関係を１つずつ順番に判定しているため、既に係り受け関係の判定された文節から得られる素性を利用することができる。ところが、まだ係り受け関係が判定されていない文節から得られる素性は利用することができず、文単位の任意の素性を利用できるわけではない。 (B) Problems of the technology described in Non-Patent Document 2 In the dependency structure analysis method shown in Non-Patent Document 2, not only the feature of each phrase, but also the respective destinations of a plurality of phrases included in the sentence. Some sentence unit features can be used. That is, since each dependency relationship in the sentence is determined one by one in order, the features obtained from the clauses for which the dependency relationship has already been determined can be used. However, a feature obtained from a clause whose dependency relationship has not yet been determined cannot be used, and an arbitrary feature of a sentence unit cannot be used.

（Ｃ）前記非特許文献３に記載された技術の課題
前記非特許文献３では、任意の文単位の素性を用いて句構造解析を行うために、リランキングに基づく方法を利用している。この方法では、先ず始めに従来手法を用いて、正解であると思われる解候補の上位ｘ個（例えばｘ＝３０）を得る。次に、そのｘ個の各候補に対して、文単位の素性を用いてどの候補が最も正しいと思われるかを決定する。この方法では、ｘ個の各候補は既に文中の全ての句構造が決定されているため、文単位の任意の素性を利用することができる。しかし、この非特許文献３に示す解析方法では、始めに得るｘ個の解候補の中に正解が含まれていなければ、どのようにしても正しい答えを出力することはできない、という課題がある。 (C) Problem of the technique described in Non-Patent Document 3 In Non-Patent Document 3, a method based on reranking is used in order to perform phrase structure analysis using features of arbitrary sentence units. In this method, first, the top x candidate solutions (eg, x = 30) that are considered to be correct are obtained by using the conventional method. Next, for each of the x candidates, which candidate is considered to be the most correct is determined using a sentence unit feature. In this method, since all the phrase structures in the sentence have already been determined for each of the x candidates, any feature of the sentence unit can be used. However, the analysis method shown in Non-Patent Document 3 has a problem that a correct answer cannot be output in any way unless the correct solution is included in the first x solution candidates obtained. .

（Ｄ）本願出願人の先の提案（特願２００７−１０８７１、非公知状態であり、以下単に「先の提案」という。）
前記（Ａ）〜（Ｃ）の課題を解決するために、本願出願人は、先に、依存構造を解析する方法として、ギブスサンプリングを用いることにより、文中の任意の素性を利用して統計的に文の依存構造を解析する方法を提案した。この先の提案では、文単位の任意の素性を利用し、予め解候補を用意せずに依存構造の解析を行うことが可能になるといった特徴がある。 (D) Applicant's previous proposal (Japanese Patent Application No. 2007-10871, in a non-known state, hereinafter simply referred to as “previous proposal”)
In order to solve the problems (A) to (C), the applicant of the present application first uses a Gibbs sampling as a method of analyzing the dependency structure, and statistically utilizes an arbitrary feature in the sentence. A method to analyze the dependency structure of sentences was proposed. This previous proposal has a feature that it is possible to analyze a dependency structure without preparing a solution candidate in advance by using an arbitrary feature of a sentence unit.

即ち、先の提案は、入力文に対する依存構造を解析する言語処理方法であって、文節単位の素性を用いて依存構造を解析するための確率モデルのパラメータである文節単位モデルパラメータを推定する文節単位モデルパラメータ推定ステップと、文単位の素性を用いて依存構造を解析するための確率モデルのパラメータである文単位モデルパラメータを推定する文単位モデルパラメータ推定ステップと、前記文単位モデルパラメータと前記文節単位モデルパラメータとによって規定される確率モデルから入力文に対する依存構造のサンプルを生成するサンプル生成ステップと、前記サンプル生成ステップで生成した前記入力文に対する依存構造のサンプルから最適な依存構造を決定する依存構造決定ステップとを有することを特徴としている。 That is, the previous proposal is a language processing method for analyzing a dependency structure for an input sentence, and a phrase unit model parameter that is a parameter of a probabilistic model for analyzing a dependency structure using a phrase unit feature is estimated. A unit model parameter estimating step, a sentence unit model parameter estimating step for estimating a sentence unit model parameter which is a parameter of a probability model for analyzing a dependency structure using sentence unit features, the sentence unit model parameter and the clause A sample generation step for generating a sample of a dependency structure for an input sentence from a probability model defined by a unit model parameter, and a dependency for determining an optimum dependency structure from the sample of the dependency structure for the input sentence generated in the sample generation step And a structure determination step.

ところが、この先の提案においても、次のような不都合がある。
先の提案の方法では、依存構造の解（文中の各単語の係り先）を選択する方法として、ギブスサンプリングによって計算された周辺確率を最大化する係り先を選ぶという方法を用いている。依存構造木は、サイクル（グラフ中で循環するノード列）を含まないという性質があり、更に言語によっては、係り受け関係は交差しないという性質がある。しかし、先の提案の方法により決定された解は、依存構造木として正しい形になっているとは限らない。又、先の提案の方法は、文中の各単語の係り先を決定するだけであるが、自然言語処理の応用においては、文中の各単語がどの単語に係るかということだけではなく、どのような関係で係るか（主語であるか目的語であるか、あるいは並列関係であるか、等の係り受け関係）がしばしば必要とされる。 However, this proposal also has the following disadvantages.
In the previously proposed method, as a method for selecting a solution of the dependency structure (the relationship destination of each word in the sentence), a method of selecting a dependency destination that maximizes the peripheral probability calculated by Gibbs sampling is used. The dependency structure tree has a property that it does not include a cycle (node sequence circulating in the graph), and further, depending on the language, there is a property that the dependency relationship does not intersect. However, the solution determined by the previously proposed method is not always in the correct form as a dependency structure tree. In addition, the previous proposal method only determines the destination of each word in the sentence, but in the application of natural language processing, not only what word each word in the sentence relates to, but also how It is often necessary that the relationship be related (such as whether it is the subject, the object, or a parallel relationship).

従って、未だ技術的に十分に満足できる言語解析技術を実現することが困難であった。 Therefore, it has been difficult to realize a language analysis technique that is sufficiently technically satisfactory.

そこで、本願発明では、上記課題を解決するために、最大全域木（ＭａｘｉｍｕｍＳｐａｎｎｉｎｇＴｒｅｅ）の探索手法を適用し、依存構造木として正しい形の解のみを出力し、更に、係り先の決定だけではなく係り受け関係のラベルの同定も行うようにした言語解析方法及びその装置を提供することを目的とする。 Therefore, in the present invention, in order to solve the above-described problem, a maximum spanning tree search method is applied, and only a correct form solution is output as a dependency structure tree. It is an object of the present invention to provide a language analysis method and apparatus for identifying a dependency-related label.

本発明の内の第１の発明の言語解析方法は、単語あるいは文節に分割された入力文に対しコンピュータを用いて言語解析を行う言語解析方法であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布における文節単位モデルパラメータを求める文節単位モデルパラメータ推定手順と、前記文節単位モデルパラメータと、前記コーパスと、に基づき、前記文節単位の確率分布から前記コーパス中の各文に対する依存構造木のサンプルを生成するパラメータ推定用サンプル生成手順と、前記サンプル及び前記コーパスに基づき、文単位の確率分布における文単位モデルパラメータを求める文単位モデルパラメータ推定手順と、前記入力文を受け付ける入力手順と、前記入力文において、前記文節単位モデルパラメータと、前記文単位モデルパラメータと、によって規定される文単位の依存構造の確率分布に基づき、ギブスサンプリングにより文単位の依存構造木のサンプルを生成する動的サンプル生成手順と、生成された前記サンプルに対して周辺確率によりスコアを求め、Ｅｉｓｅｒ法又はＣｈｕ−Ｌｉｕ−Ｅｄｍｏｎｄｓ法を用いて最適な依存構造を決定する解検索手順と、を有することを特徴とする。 A language analysis method according to a first aspect of the present invention is a language analysis method for performing language analysis on an input sentence divided into words or phrases using a computer, and has been parsed in a storage means. The phrase unit model parameter estimation procedure for determining the phrase unit model parameter in the phrase unit probability distribution based on the phrase unit probability distribution, the phrase unit model parameter, and the corpus, the phrase unit probability distribution from the phrase unit probability distribution in the corpus A parameter estimation sample generation procedure for generating a sample of a dependency structure tree for each sentence, a sentence unit model parameter estimation procedure for obtaining a sentence unit model parameter in a sentence unit probability distribution based on the sample and the corpus, and the input sentence And an input procedure for receiving the phrase unit model parameter in the input sentence. If, based on the probability distribution of the dependency structure of the sentence which is defined by, and the sentence model parameters, and a dynamic sample generating step of generating a sample of dependent parse tree of a sentence units by Gibbs sampling, that were generated the A solution search procedure for obtaining a score for the sample based on a marginal probability and determining an optimum dependency structure using the Eiser method or the Chu-Liu-Emonds method .

第２の発明の言語解析方法は、第１の発明の言語解析方法において、更に、前記決定された依存構造がラベル無し依存構造の場合には、前記ラベル無し依存構造に対して、ラベル付け用モデルパラメータによって規定される確率モデルを用いて、係り受け関係のラベルを同定する係り受け関係ラベル決定手順を有することを特徴とする。 The language analysis method according to the second invention is the language analysis method according to the first invention, wherein when the determined dependency structure is an unlabeled dependency structure, the labelless dependency structure is labeled. It is characterized by having a dependency relationship label determination procedure for identifying a dependency relationship label using a probability model defined by model parameters.

第３の発明の言語解析方法は、単語あるいは文節に分割された入力文に対しコンピュータを用いて言語解析を行う言語解析方法であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布におけるラベル情報付き文節単位モデルパラメータを求めるラベル情報付き文節単位モデルパラメータ推定手順と、前記ラベル情報付き文節単位モデルパラメータと、前記コーパスと、に基づき、前記文節単位の確率分布から前記コーパス中の各文に対するラベル付きの依存構造木のサンプルを生成するパラメータ推定用サンプル生成手順と、前記サンプル及び前記コーパスに基づき、文単位の確率分布におけるラベル情報付き文単位モデルパラメータを求める文単位モデルパラメータ推定手順と、前記入力文を受け付ける入力手順と、前記ラベル情報付き文節単位モデルパラメータと、前記ラベル情報付き文単位モデルパラメータと、によって規定される文単位の依存構造の確立分布に基づき、ギブスサンプリングを用いて文単位の依存構造木のサンプルを生成するラベル付き動的サンプル生成手順と、生成された前記サンプルから、最大全域木探索手法を用いて最適な依存構造を決定する解探索手順と、を有することを特徴とする。 A language analysis method according to a third aspect of the present invention is a language analysis method for performing language analysis on an input sentence divided into words or phrases using a computer, based on a parsed corpus stored in a storage means, Based on the phrase unit model parameter estimation procedure with label information for obtaining the phrase unit model parameter with label information in the phrase unit probability distribution, the phrase unit model parameter with label information, and the corpus, from the phrase unit probability distribution A parameter estimation sample generation procedure for generating a labeled dependency structure tree sample for each sentence in the corpus, and a sentence for determining a sentence unit model parameter with label information in a sentence unit probability distribution based on the sample and the corpus Unit model parameter estimation procedure and input to accept the input sentence Based on the establishment distribution of the dependency structure of the sentence unit defined by the procedure, the phrase unit model parameter with the label information, and the sentence unit model parameter with the label information, the dependency structure tree of the sentence unit using Gibbs sampling A labeled dynamic sample generation procedure for generating a sample, and a solution search procedure for determining an optimum dependency structure from the generated sample using a maximum spanning tree search method .

第４の発明の言語解析方法は、単語あるいは文節に分割された入力文に対しコンピュータを用いて言語解析を行う言語解析方法であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布におけるラベル無し文節単位モデルパラメータを求める文節単位モデルパラメータ推定手続と、前記コーパスを用いてラベル付け用モデルパラメータを求めるラベル付け用モデルパラメータ推定手続と、前記ラベル情報無し文節単位モデルパラメータと、前記コーパスと、前記ラベル付け用モデルパラメータと、に基づき、前記文節単位の確率分布から前記コーパス中の各文に対するラベル付きの依存構造木のサンプルを生成するパラメータ推定用ラベル付きサンプル生成手順と、前記サンプル及び前記コーパスに基づき、ラベル情報付き文単位モデルパラメータを求める文単位モデルパラメータ推定手順と、前記入力文を受け付ける入力手順と、前記ラベル情報無し文節単位モデルパラメータと、前記ラベル情報付き文単位モデルパラメータと、によって規定される文単位の依存構造の確立分布に基づき、ギブスサンプリングを用いて文単位の依存構造木のサンプルを生成するラベル付き動的サンプル生成手順と、生成された前記サンプルから、最大全域木探索手法を用いて最適な依存構造を決定する解探索手順と、を有することを特徴とする。 A language analysis method according to a fourth aspect of the present invention is a language analysis method for performing language analysis using a computer for an input sentence divided into words or phrases, based on a parsed corpus stored in a storage means, Clause unit model parameter estimation procedure for obtaining unlabeled phrase unit model parameters in the phrase unit probability distribution, labeling model parameter estimation procedure for obtaining labeling model parameters using the corpus, and the label information-free clause unit model Generating labeled sample for parameter estimation that generates a labeled dependency structure tree sample for each sentence in the corpus from the phrase-based probability distribution based on the parameter, the corpus, and the labeling model parameter Based on the procedure, the sample and the corpus, The sentence unit model parameter estimation procedure for obtaining the sentence unit model parameter with the label information, the input procedure for receiving the input sentence, the phrase unit model parameter without the label information, and the sentence unit model parameter with the label information Based on the establishment distribution of sentence-by-sentence dependency structure, a labeled dynamic sample generation procedure for generating a sentence-by-sentence dependency structure tree sample using Gibbs sampling, and using the maximum spanning tree search method from the generated sample And a solution search procedure for determining an optimum dependency structure .

第５の発明の言語解析装置は、単語あるいは文節に分割された入力文に対し言語解析を行う言語解析装置であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布における文節単位モデルパラメータを求める文節単位モデルパラメータ推定部と、前記文節単位モデルパラメータと、前記コーパスと、に基づき、前記文節単位の確率分布から前記コーパス中の各文に対する依存構造木のサンプルを生成するパラメータ推定用サンプル生成部と、前記サンプル及び前記コーパスに基づき、文単位の確率分布における文単位モデルパラメータを求める文単位モデルパラメータ推定部と、
前記入力文を受け付ける入力部と、前記入力文において、前記文節単位モデルパラメータと、前記文単位モデルパラメータと、によって規定される文単位の依存構造の確率分布に基づき、ギブスサンプリングにより文単位の依存構造木のサンプルを生成する動的サンプル生成部と、生成された前記サンプルに対して周辺確率によりスコアを求め、Ｅｉｓｅｒ法又はＣｈｕ−Ｌｉｕ−Ｅｄｍｏｎｄｓ法を用いて最適な依存構造を決定する解検索部と、を有することを特徴とする。 A language analysis apparatus according to a fifth aspect of the present invention is a language analysis apparatus that performs a language analysis on an input sentence divided into words or phrases, and is based on a syntactically analyzed corpus stored in a storage means, and the probability of each phrase Based on the phrase unit model parameter estimation unit for determining the phrase unit model parameter in the distribution, the phrase unit model parameter, and the corpus, a sample of the dependency structure tree for each sentence in the corpus is obtained from the phrase unit probability distribution. A parameter generation sample generation unit for parameter estimation, and a sentence unit model parameter estimation unit for determining a sentence unit model parameter in a sentence unit probability distribution based on the sample and the corpus;
Based on the probability distribution of the sentence-dependent dependency structure defined by the input unit that receives the input sentence, and the sentence-unit model parameter and the sentence-unit model parameter in the input sentence, dependency of the sentence unit by Gibbs sampling and a dynamic sample generator for generating samples of tree structure, obtains a score by marginal probabilities against that were generated said sample, to determine the optimal dependency structure using Eiser method or Chu-Liu-Edmonds method solutions And a search unit.

第６の発明の言語解析装置は、第５の発明の言語解析装置において、更に、前記決定された依存構造がラベル無し依存構造の場合には、前記ラベル無し依存構造に対して、ラベル付け用モデルパラメータによって規定される確率モデルを用いて、係り受け関係のラベルを同定する係り受け関係ラベル決定部を有することを特徴とする。 The language analysis device according to a sixth aspect of the present invention is the language analysis device according to the fifth aspect, wherein, when the determined dependency structure is an unlabeled dependency structure, the labelless dependency structure is labeled. It is characterized by having a dependency relationship label determining unit for identifying a dependency relationship label using a probability model defined by model parameters.

第７の発明の言語解析装置は、単語あるいは文節に分割された入力文に対し言語解析を行う言語解析装置であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布におけるラベル情報付き文節単位モデルパラメータを求めるラベル情報付き文節単位モデルパラメータ推定部と、前記ラベル情報付き文節単位モデルパラメータと、前記コーパスと、に基づき、前記文節単位の確率分布から前記コーパス中の各文に対するラベル付きの依存構造木のサンプルを生成するパラメータ推定用サンプル生成部と、前記サンプル及び前記コーパスに基づき、文単位の確率分布におけるラベル情報付き文単位モデルパラメータを求める文単位モデルパラメータ推定部と、前記入力文を受け付ける入力部と、前記入力文において、前記ラベル情報付き文節単位モデルパラメータと、前記ラベル情報付き文単位モデルパラメータと、によって規定される文単位の依存構造の確立分布に基づきギブスサンプリングにより文単位の依存構造木のサンプルを生成するラベル付き動的サンプル生成部と、生成された前記サンプルから、最大全域木探索手法を用いて最適な依存構造を決定する解探索手段と、を有することを特徴とする。 A language analysis apparatus according to a seventh aspect of the present invention is a language analysis apparatus that performs a language analysis on an input sentence divided into words or phrases, and is based on a syntactically analyzed corpus stored in a storage means, and the probability of each phrase Based on the phrase unit model parameter estimation unit with label information for determining the phrase unit model parameter with label information in the distribution, the phrase unit model parameter with label information, and the corpus, from the probability distribution of the phrase unit in the corpus A parameter generation sample generation unit for generating a labeled dependency structure tree sample for each sentence, and a sentence unit model parameter estimation for obtaining a sentence unit model parameter with label information in a sentence unit probability distribution based on the sample and the corpus A part, an input part that receives the input sentence, and the input sentence Said label information with clause unit model parameters, the label and sentence model parameters with information, with labels that produce a sample of the dependent parse tree of a sentence units by Gibbs sampling based on the probability distribution of the dependency structure of the sentence, which is defined by and a dynamic sample generating unit, from that were generated said sample, characterized by having a a solution search means for determining the optimum dependency structure using the maximum spanning tree search method.

第８の発明の言語解析装置は、単語あるいは文節に分割された入力文に対し言語解析を行う言語解析装置であって、記憶手段に記憶された構文解析済みのコーパスに基づき、文節単位の確率分布におけるラベル無し文節単位モデルパラメータを求める文節単位モデルパラメータ推定部と、前記コーパスを用いてラベル付け用モデルパラメータを求めるラベル付け用モデルパラメータ推定部と、前記ラベル情報無し文節単位モデルパラメータと、前記コーパスと、前記ラベル付け用モデルパラメータと、に基づき、文節単位の確率分布から前記コーパス中の各文に対するラベル付きの依存構造木のサンプルを生成するパラメータ推定用ラベル付きサンプル生成部と、前記サンプル及び前記コーパスに基づき、文単位の確率分布におけるラベル情報付き文単位モデルパラメータを求める文単位モデルパラメータ推定部と、前記入力文を受け付ける入力部と、前記入力文において、前記ラベル情報無し文節単位モデルパラメータと、前記ラベル情報付き文単位モデルパラメータと、によって規定される文単位の依存構造の確率分布に基づき、ギブスサンプリングにより文単位の依存構造木のサンプルを生成するラベル付き動的サンプル生成部と、前記生成されたサンプルから、最大全域木探索手法を用いて最適な依存構造を決定する解探索部と、を有することを特徴とする。 A language analyzing apparatus according to an eighth aspect of the present invention is a language analyzing apparatus that performs a language analysis on an input sentence divided into words or phrases, and is based on a parsed corpus stored in a storage means, and a probability of each phrase. Clause unit model parameter estimation unit for obtaining unlabeled phrase unit model parameter in distribution, Labeling model parameter estimation unit for obtaining labeling model parameter using the corpus, Clause unit model parameter without label information, A sample generation unit with a parameter estimation for parameter estimation that generates a sample of a labeled dependency structure tree for each sentence in the corpus from a probability distribution in a unit of clauses based on a corpus and the model parameter for labeling; And the label information in the sentence-wise probability distribution based on the corpus A sentence model parameter estimation unit for obtaining a sentence model parameter attached, an input unit for receiving the input sentence in the input sentence, and the label information without clause unit model parameters, and the label information with Buntan'i model parameters, by Based on the probability distribution of the sentence-dependent dependency structure, a labeled dynamic sample generation unit that generates a sentence-dependent dependency tree sample by Gibbs sampling , and a maximum spanning tree search method from the generated sample And a solution search unit for determining an optimum dependency structure.

本発明の言語解析方法及びその装置によれば、文中のあらゆる素性を利用することができるギブスサンプリングを使用した依存構造解析方法やそのシステムにおいて、解を探索する際に最大全域木の探索手法を適用することで、依存構造木として正しい解を得ることができる。 According to the language analysis method and apparatus of the present invention, in the dependency structure analysis method using Gibbs sampling that can use all the features in a sentence and its system, the search method of the maximum spanning tree is used when searching for a solution. By applying, a correct solution can be obtained as a dependency structure tree.

言語解析装置は、入力文に対して、文節単位モデルパラメータと文単位モデルパラメータとによって規定される確率モデルから、依存構造のサンプルを生成する動的サンプル生成部と、前記生成されたサンプルから、最大全域木探索手法を用いて最適な依存構造を決定する解探索部とを有している。 The language analysis apparatus, for an input sentence, from a probability model defined by a phrase unit model parameter and a sentence unit model parameter, a dynamic sample generation unit that generates a sample of a dependency structure, and the generated sample, And a solution search unit that determines an optimum dependency structure using a maximum spanning tree search method.

（実施例１の構成）
図１は、本発明の実施例１を示す言語解析装置の概略の構成図である。
この言語解析装置は、例えば、中央処理装置（以下「ＣＰＵ」という。）、及び外部記憶装置や内部記憶装置等を有するコンピュータにより、言語解析プログラムを実行することにより構成される装置であり、入力された文を依存構造解析して係り受け関係を決定する解析部１０と、依存構造の解析を行う際に使用される確率モデルのパラメータを格納するモデル格納部２０と、構文解析済みコーパスからモデルパラメータの学習を行うパラメータ推定部３０等とにより構成されている。 (Configuration of Example 1)
FIG. 1 is a schematic configuration diagram of a language analysis apparatus showing Embodiment 1 of the present invention.
This language analysis device is, for example, a device configured by executing a language analysis program by a computer having a central processing unit (hereinafter referred to as “CPU”) and an external storage device or an internal storage device. An analysis unit 10 that determines a dependency relationship by analyzing a dependency structure of a sentence, a model storage unit 20 that stores parameters of a probability model used when analyzing the dependency structure, and a model from a parsed corpus The parameter estimation part 30 etc. which perform parameter learning are comprised.

解析部１０は、依存構造解析の対象となる文を使用者から入力するためのキーボード等の入力部１１と、ラベル無し動的サンプル生成部１２と、解探索部１３と、係り受け関係ラベル決定部１４と、ディスプレイ、プリンタ等の出力部１５とを有している。これらのラベル無し動的サンプル生成部１２、解探索部１３、及び、係り受け関係ラベル決定部１４は、例えば、ＣＰＵのプログラム制御により実現されるものである。 The analysis unit 10 includes an input unit 11 such as a keyboard for inputting a sentence to be subjected to dependency structure analysis from a user, an unlabeled dynamic sample generation unit 12, a solution search unit 13, and a dependency relationship label determination. Section 14 and an output section 15 such as a display or a printer. The unlabeled dynamic sample generation unit 12, the solution search unit 13, and the dependency relationship label determination unit 14 are realized, for example, by program control of the CPU.

その内、ラベル無し動的サンプル生成部１２は、入力部１１に接続され、入力された文に対して、ラベル情報無し文節単位モデルパラメータとラベル情報無し文単位モデルパラメータとの情報を用いて、これらのモデルパラメータによって規定される確率分布から多数のラベル無し依存構造解析木のサンプルを生成するものであり、この出力側に、解探索部１３が接続されている。解探索部１３は、生成されたサンプルを利用して入力文の依存構造を決定するものであり、この出力側に、係り受け関係ラベル決定部１４が接続されている。係り受け関係ラベル決定部１４は、ラベル付け用モデルパラメータの情報を用いて依存構造の決定された文に対してその依存関係のラベルを付与するものであり、この出力側に、出力部１５が接続されている。出力部１５は、依存構造とその係り受け関係が決定された文を使用者へ出力する装置である。 Among them, the unlabeled dynamic sample generation unit 12 is connected to the input unit 11 and uses the information of the phrase unit model parameter without label information and the label unit model without label information for the input sentence, A large number of unlabeled dependency structural analysis tree samples are generated from the probability distribution defined by these model parameters, and a solution search unit 13 is connected to the output side. The solution search unit 13 determines a dependency structure of an input sentence using the generated sample, and a dependency relationship label determination unit 14 is connected to the output side. The dependency relationship label determination unit 14 assigns a dependency relationship label to a sentence whose dependency structure has been determined using the information of the labeling model parameter. On the output side, the output unit 15 It is connected. The output unit 15 is a device that outputs a sentence for which a dependency structure and its dependency relationship are determined to a user.

モデル格納部２０は、第１の格納部（例えば、ラベル情報無し文節単位モデルパラメータ格納部）２１、第２の格納部（例えば、ラベル情報無し文単位モデルパラメータ格納部）２２、及び、第３の格納部（例えば、ラベル付け用モデルパラメータ格納部）２３を有し、これらが外部あるいは内部の記憶装置により構成されている。その内、ラベル情報無し文節単位モデルパラメータ格納部２１は、ラベル無し動的サンプル生成部１２等で使用されるラベル情報無し文節単位モデルのパラメータを格納するものである。ラベル情報無し文単位モデルパラメータ格納部２２は、ラベル無し動的サンプル生成部１２で使用されるラベル情報無し文単位モデルのパラメータを格納するものである。更に、ラベル付け用モデルパラメータ格納部２３は、係り受け関係ラベル決定部１４で使用されるラベル付け用モデルのパラメータを格納するものである。 The model storage unit 20 includes a first storage unit (for example, a no-label information phrase unit model parameter storage unit) 21, a second storage unit (for example, a no-label information sentence unit model parameter storage unit) 22, and a third Storage unit (for example, a labeling model parameter storage unit) 23, and these are configured by an external or internal storage device. Among them, the no-label-information clause unit model parameter storage unit 21 stores parameters of the no-label-information clause unit model used in the unlabeled dynamic sample generation unit 12 and the like. The label informationless sentence unit model parameter storage unit 22 stores parameters of the label informationless sentence unit model used in the unlabeled dynamic sample generation unit 12. Further, the labeling model parameter storage unit 23 stores parameters of the labeling model used by the dependency relationship label determination unit 14.

パラメータ推定部３０は、構文解析済みコーパス格納部３１、ラベル情報無し文節単位モデルパラメータ推定部３２、パラメータ推定用ラベル無しサンプル生成部３３、ラベル情報無し文単位モデルパラメータ推定部３４、及び、ラベル付け用モデルパラメータ推定部３５を有し、これらが相互に接続されている。この内、構文解析済みコーパス格納部３１は、他のパラメータ推定部３２，３４，３５及びサンプル生成部３３で使用される構文解析済みコーパスを格納するものであり、外部あるいは内部の記憶装置により構成されている。 The parameter estimation unit 30 includes a syntactically analyzed corpus storage unit 31, a label unit without phrase information model parameter estimation unit 32, a parameter estimation unlabeled sample generation unit 33, a label information without sentence unit model parameter estimation unit 34, and a labeling Model parameter estimation unit 35, which are connected to each other. Among them, the parsed corpus storage unit 31 stores parsed corpora used in the other parameter estimation units 32, 34, and 35 and the sample generation unit 33, and is configured by an external or internal storage device. Has been.

他のパラメータ推定部３２，３４，３５及びサンプル生成部３３は、例えば、ＣＰＵのプログラム制御により実現されるものである。この内、ラベル情報無し文節単位モデルパラメータ推定部３２は、構文解析済みコーパス格納部３１に格納されたコーパスを用いてラベル情報無し文節単位モデルのパラメータを求め、その結果をラベル情報無し文節単位モデルパラメータ格納部２１へ格納するものである。パラメータ推定用ラベル無しサンプル生成部３３は、構文解析済みコーパス格納部３１に格納されたコーパスと、ラベル情報無し文節単位モデルパラメータ格納部２１に格納されたラベル情報無し文節単位モデルパラメータとを用いて、そのパラメータにより規定される確率分布からコーパス中の各文に対するラベル無しの依存構造のサンプルを生成し、この生成されたサンプルをラベル情報無し文単位モデルパラメータ推定部３４に与えるものである。 The other parameter estimation units 32, 34, and 35 and the sample generation unit 33 are realized, for example, by program control of the CPU. Among them, the phrase unit model parameter estimation unit 32 without label information obtains the parameters of the phrase unit model without label information using the corpus stored in the parsed corpus storage unit 31, and the result is used as the phrase unit model without label information. This is stored in the parameter storage unit 21. The parameter estimation unlabeled sample generation unit 33 uses the corpus stored in the parsed corpus storage unit 31 and the label information-free phrase unit model parameter stored in the label information-free phrase unit model parameter storage unit 21. The sample of the dependency structure without the label for each sentence in the corpus is generated from the probability distribution defined by the parameter, and the generated sample is given to the sentence unit model parameter estimation unit 34 without the label information.

ラベル情報無し文単位モデルパラメータ推定部３４は、構文解析済みコーパス格納部３１に格納されたコーパスと、パラメータ推定用ラベル無しサンプル生成部３３により生成されたサンプルとを用いて、ラベル情報無し文単位モデルのパラメータを求め、その結果をラベル情報無し文単位モデルパラメータ格納部２２へ格納するものである。更に、ラベル付け用モデルパラメータ推定部３５は、構文解析済みコーパス格納部３１に格納されたコーパスを用いて、ラベル付け用モデルのパラメータを求め、その結果をラベル付け用モデルパラメータ格納部２３へ格納するものである。 The label information-less sentence unit model parameter estimation unit 34 uses the corpus stored in the parsed corpus storage unit 31 and the sample generated by the parameter estimation unlabeled sample generation unit 33, and uses the label information-less sentence unit. The model parameters are obtained, and the results are stored in the sentence unit model parameter storage unit 22 without label information. Further, the labeling model parameter estimation unit 35 obtains parameters of the labeling model using the corpus stored in the parsed corpus storage unit 31 and stores the result in the labeling model parameter storage unit 23. To do.

（実施例１の言語解析方法における依存構造解析処理）
図２は、図１に示す言語解析装置が使用者によって入力された文を依存構造解析して出力するまでの存構造解析処理を示すフローチャートである。 (Dependent structure analysis processing in language analysis method of embodiment 1)
FIG. 2 is a flowchart showing the existing structure analysis processing until the language analysis apparatus shown in FIG. 1 analyzes and outputs the sentence inputted by the user.

先ず、使用者が依存構造解析したい文を入力部１１によって入力する（ステップＳ１）。但し、入力される文は、既に単語あるいは文節に分割されており、その品詞等も推定されているものとする。ここでは、入力された文中のｔ番目の文節（単語）をｗ_ｔで表すことにする。入力文全体を下記の式（１）に示すように入力文中の文節の集合Ｗで表すこととし、文中に含まれる文節の数を｜Ｗ｜で表すことにする。 First, the user inputs a sentence to be subjected to dependency structure analysis by the input unit 11 (step S1). However, it is assumed that the input sentence is already divided into words or phrases, and the part of speech or the like is estimated. Here, the t-th clause (word) in the input sentence is represented by w _t . The entire input sentence is represented by a set W of clauses in the input sentence as shown in the following formula (1), and the number of clauses included in the sentence is represented by | W |.

又、ｔ番目の文節の係り先が文中の何番目の文節であるかをｈ_ｔで表すことにし、下記の式（２）に示すようにｈ_ｔの集合をＨで表すことにする。 Further, dependency destination of t-th clause what number clause in a sentence to be represented by h _t, the set of h _t as shown in the following formula (2) to be represented by H.

但し、文全体の主辞である文節は係り先を持たないため、そのような文節の係り先は例えばゼロ（０）で表すこととする。ｔ番目の文節の係り受け関係のラベル（主語であるか等）をｌ_ｔで表すことにし、下記の式（３）に示すようにｌ_ｔの集合をＬで表すことにする。 However, since the clause that is the main word of the whole sentence has no dependency destination, the dependency destination of such a clause is represented by, for example, zero (0). t-th clause of dependency relationships labels (whether the subject or the like) to be represented by l _t, the set of l _t as shown in the following formula (3) to be represented by L.

よって本実施例１の言語解析方法の目的である依存構造解析は、文節列Ｗが入力された時に、その各文節の係り先の集合Ｈとそのラベルの集合Ｌを求めることである。 Therefore, the dependency structure analysis which is the object of the language analysis method of the first embodiment is to obtain the set H of the relation destination of each clause and the set L of the label when the clause string W is input.

入力部１１で入力された文節列Ｗに対して、ラベル無し動的サンプル生成部１２は、ラベル情報無し文節単位モデルパラメータ格納部２１に格納されたラベル情報無し文節単位モデルパラメータと、ラベル情報無し文単位モデルパラメータ格納部２２に格納されたラベル情報無し文単位モデルパラメータとを用いて、それらのパラメータにより規定される文単位の依存構造の確率分布からギブスサンプリングにより、動的にラベル無し依存構造木のサンプルを生成する（ステップＳ２）。 For the phrase string W input by the input unit 11, the unlabeled dynamic sample generation unit 12 includes the label information-free phrase unit model parameter stored in the label information-free phrase unit model parameter storage unit 21 and no label information. Using unlabeled sentence unit model parameters stored in the sentence unit model parameter storage unit 22 and dynamically analyzing the unlabeled dependency structure by Gibbs sampling from the probability distribution of the dependency structure of the sentence unit defined by those parameters A tree sample is generated (step S2).

ここで、前記先の提案にも示されているが、ギブスサンプリング（又は、ギブスサンプラー、熱浴法）について簡単に説明する。 Here, the Gibbs sampling (or Gibbs sampler, heat bath method), which is also shown in the previous proposal, will be briefly described.

ギブスサンプリングとは、乱数を用いて次々に変数の値を書き換えることで、動的にサンプルを生成する手法である。なお、前記変数の値を書き換える部分は、条件付き分布からのサンプリングにより行われる。変数ｘをいくつかの成分に分けて、ｘ＝｛ｘ_ｉ｝，ｉ＝１，…，Ｎのように書き、ｘ_ｉは元からある自然な要素でも、それらを幾つかずつにまとめ直したものでも良いとする。 Gibbs sampling is a technique for dynamically generating samples by rewriting variable values one after another using random numbers. The part for rewriting the value of the variable is performed by sampling from a conditional distribution. The variable x is divided into several components and written as x = {x _i }, i = 1,..., N. Even if x _i is a natural element originally, it is regrouped into several components. Things can be used.

毎回、１つの成分ｘ_ｉを選んで、現在のその値を忘れて新しく取り直すとする。新しく取り直す際には同じ成分の以前の値を参照しない。その時、新しいｘ_ｉの値を、それ以外の成分を固定した条件付き確率Ｐ（ｘ_ｉ｜ｘ_１，…，ｘ_ｉ−１，ｘ_ｉ＋１，…，ｘ_Ｎ）で選ぶのがギブスサンプリングである。新しく取り直す際には、同じ成分の以前の値を参照しないのが特徴である。連続変数なら条件付密度を考えることになる。 Suppose that each time one component x _i is selected, the current value is forgotten and a new one is taken. Do not refer to previous values of the same component when renewing. At that time, Gibbs sampling is to select a new value of x _i with a conditional probability P (x _i | x ₁ ,..., X _i−1 , x _{i + 1} ,..., X _N ) with other components fixed. . It is a feature that the new value is not referred to the previous value of the same component. If it is a continuous variable, we will consider conditional density.

任意の初期状態から出発して、他の成分を固定した条件付き確率に従って１つの成分を取り直す、という操作を限りなく繰り返すわけである。成分ｉの選び方は毎回ランダムに選んでも、計算を始める前に決めた順番で順繰りに選んでも良い。ギブスサンプリングは、問題を分割して、それぞれは効率良く扱えるものに分けて、それを統合するというアプローチである。ギブスサンプリングのポイントは、統合の仕方として条件付き分布からのサンプリングによって状態を部分的に更新していくという方法を取ることで、これによって、全体として、つじつまのあった計算を行い、且つ、高次元でも部分と全体とのギャップによって計算が破綻しないようにできる。 Starting from an arbitrary initial state, the operation of re-taking one component according to a conditional probability with other components fixed is repeated indefinitely. The method of selecting the component i may be selected at random each time, or may be selected in the order determined before starting the calculation. Gibbs sampling is an approach that divides a problem into pieces that can be handled efficiently and integrates them. The point of Gibbs sampling is that the state is partially updated by sampling from the conditional distribution as a method of integration. Even in the dimension, the calculation can be prevented from failing due to the gap between the part and the whole.

目標分布が事後分布π^＊でその確率密度関数をπ（θ｜ｘ）とし、θはθ＝（θ１，・・・，θｐ）と幾つかのブロックに分割できるとする。又、θ_−ｉ＝（θ１，・・・，θ_ｉ−１，θ_ｉ＋１，・・・，θｐ）とｘが与えられたときの条件付き事後分布π^＊ _ｉの確率密度関数をπ（θ_ｉ｜θ_−ｉ，ｘ）とし、この条件付き分布からのサンプリングが容易であると仮定する。この時、ギブスサンプリングのアルゴリズムは、
（ａ）初期値θ（０）＝（θ^（０） _１，θ^（０） _２，…，θ^（０） _ｐ）を決め、ｔ＝１とおく。
（ｂ）ｉ＝１，・・・，ｐについて、
θ^（ｔ） _１，〜π（θ_ｉ，｜θ^（ｔ） _−ｉ，ｘ），
θ^（ｔ） _−ｉ＝（θ^（ｔ） _１，・・・，θ^（ｔ） _ｉ−１，θ^{（ｔ−１）} _ｉ＋１，・・・，θ^{（ｔ−１）} _ｐ），
を発生させる。
（ｃ）ｔをｔ＋１として（ｂ）に戻る。 It is assumed that the target distribution is posterior distribution π ^* and the probability density function is π (θ | x), and θ can be divided into several blocks as θ = (θ1,..., Θp). In addition, θ _−i = (θ1,..., Θ _i−1 , θ _{i + 1} ,..., Θp) and x are given, and the probability density function of the conditional posterior distribution π ^* _i is expressed as π (θ _i | θ _−i , x) and assume that sampling from this conditional distribution is easy. At this time, the Gibbs sampling algorithm is
(A) The initial value θ (0) = (θ ⁽⁰⁾ ₁ , θ ⁽⁰⁾ ₂ ,..., Θ ⁽⁰⁾ _p ) is determined, and t = 1 is set.
(B) For i = 1,.
θ ^(t) ₁ , ~ π (θ _i , | θ ^(t) _−i , x),
θ ^(t) _−i = (θ ^(t) ₁ ,..., θ ^(t) _i−1 , θ ^(t−1) _{i + 1} ,..., θ ^(t−1) _p ),
Is generated.
(C) Set t to t + 1 and return to (b).

前記（ｂ）と（ｃ）を繰り返し、十分大きな数Ｎについてｔ≧Ｎのときθ^（ｔ）＝（θ^（ｔ） _１，θ^（ｔ） _２，・・・，θ^（ｔ） _ｐ）を事後分布π^＊の確率標本とする。 (B) and (c) are repeated, and θ ^(t) = (θ ^(t) ₁ , θ ^(t) ₂ ,..., Θ ^(t) _p ) when t ≧ N for a sufficiently large number N. A probability sample with a posterior distribution π ^* .

以上で、ギブスサンプリングについての説明を終了する。なお、ラベル無し動的サンプル生成部１２が実行するギブスサンプリングは、下記の参考文献１に記載のギブスサンプリングと同じものである。
参考文献１；伊庭幸人、外５名“計算統計ＩＩ−マルコフ連鎖モンテカルロ法とその周辺“（２００５）岩波書店 This is the end of the explanation of Gibbs sampling. The Gibbs sampling executed by the unlabeled dynamic sample generation unit 12 is the same as the Gibbs sampling described in Reference Document 1 below.
Reference 1; Yukito Iba, 5 others "Calculation Statistics II-Markov Chain Monte Carlo Method and its Surroundings" (2005) Iwanami Shoten

文単位の依存構造の確率分布、つまり文節列Ｗが与えられた場合の集合Ｈの確率分布としては様々なものを考えることができるが、例えば、前記先の提案に示されているような下記の式（４）〜（８）により定義される確率分布Ｐ（Ｈ｜Ｗ）を利用することができる。 Various probability distributions of dependency structures in units of sentences, that is, probability distributions of the set H when a phrase string W is given, can be considered, for example, the following as shown in the above proposal: The probability distribution P (H | W) defined by the following equations (4) to (8) can be used.

これらの式（４）〜（８）において、Ｑ（Ｈ｜Ｗ）は文節単位の依存構造の確率分布であり、λ_ｋは文単位モデルパラメータであり、Ω（Ｗ）は文節の集合Ｗが与えられた場合のあらゆる可能なＨからなる集合であり、ｆ_ｋ（Ｗ，Ｈ）は任意の文単位の素性である、又、μ_ｉは文節単位モデルパラメータであり、ｇ_ｉ（Ｗ，ｔ，ｈ）は文節列Ｗ中のｔ番目の文節がｈ番目の文節に係ることに関する文節単位の素性である。この素性ｇ_ｉ（Ｗ，ｔ，ｈ）としては、例えば、ｗ_ｔとｗ_ｈのそれぞれの見出し、品詞、あるいは、文節間の距離、文節間の読点の有無等、前記非特許文献１に記述されているような素性等を用いることができる。 In these equations (4) to (8), Q (H | W) is the probability distribution of the dependency structure of the phrase unit, λ _k is the sentence unit model parameter, and Ω (W) is the phrase set W. A set of all possible H given, f _k (W, H) is an arbitrary sentence unit feature, μ _i is a phrase unit model parameter, and g _i (W, t , H) is a phrase-unit feature relating to the fact that the t-th phrase in the phrase string W relates to the h-th phrase. As the feature _{g i (W, t, h} ) described, for example, each heading w _t and w _h, part of speech, or the distance between the clauses, the presence or absence of commas between clauses such as the non-patent document 1 Such features as those described above can be used.

前記非特許文献１には、次のように記載されている。素性とは、例えば、２つの文節間の係り受けの確率を計算するための情報であり、具体的には、素性は、表層文字列、品詞、活用形、括弧や句読点の有無、文節間距離、又はそれらの組み合わせ等であり、表２（非特許文献１、Ｐ．３４０１参照）に挙げられている。表２に挙げた素性は、素性名と素性値から成り、１文中の２つの文節に着目したとき、それぞれの文節（前文節と後文節）が持ち得る属性若しくは２文節間に現れ得る属性を表している。このように、文節単位の素性を用いて計算される依存構造の確率分布は、文中の各係り受け関係間の独立性を仮定して計算される。 Non-Patent Document 1 describes as follows. A feature is, for example, information for calculating the dependency probability between two clauses. Specifically, a feature is a surface character string, part of speech, usage form, presence or absence of parentheses or punctuation marks, distance between clauses. Or combinations thereof, and are listed in Table 2 (see Non-Patent Document 1, P. 3401). The features listed in Table 2 consist of feature names and feature values. When focusing on two clauses in one sentence, the attributes that each clause (previous clause and subsequent clause) can have or appear between two clauses Represents. Thus, the probability distribution of the dependency structure calculated using the phrase unit feature is calculated assuming independence between each dependency relationship in the sentence.

以上のようにして定義された確率分布Ｐ（Ｈ｜Ｗ）に対して、前記先の提案に示されるような方法で、ギブスサンプリングを用いてサンプルを生成することができる。 For the probability distribution P (H | W) defined as described above, a sample can be generated using Gibbs sampling by the method shown in the previous proposal.

ラベル無し動的サンプル生成部１２により生成されたＳ個のサンプル｛Ｈ^（１），・・・，Ｈ^（Ｓ）｝を用いて、解探索部１３は最適な依存構造と思われる解を探索し決定する（ステップＳ３）。ここでは、以下の式（９）、（１０）のように定義される周辺確率を利用することにする。 Using the S samples {H ⁽¹⁾ ,..., H ^(S) } generated by the unlabeled dynamic sample generation unit 12, the solution search unit 13 searches for a solution that seems to be an optimal dependency structure. It is determined (step S3). Here, peripheral probabilities defined as the following formulas (9) and (10) are used.

ここで、Ｐ_ｔ（ｈ｜Ｗ）は、ｔ番目の単語がｈを係り先とする確率である。単純にこの値を最大化するような候補を解として選択した場合、結果として得られる構造はサイクルを含む等の依存構造木としては不適切な構造になっている可能性がある。そこで、最大全域木の探索手法を利用して、最適な依存構造木を探索することにする。依存構造解析は、例えば、下記の参考文献２に記載されているように、最大全域木の探索問題として解くことができる。
参考文献２；ＵｎｉｖｅｒｓｉｔｙｏｆＰｅｎｎｓｙｌｖａｎｉａＤｅｐａｒｔｍｅｎｔｏｆＣｏｍｐｕｔｅｒａｎｄＩｎｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅＴｅｃｈｎｉｃａｌＲｅｐｏｒｔＮｏ．ＭＳ−ＣＩＳ−０６−１１（２００６）ＭｃＤｏｎａｌｄｅｔａｌ．“ＳｐａｎｎｉｎｇＴｒｅｅＭｅｔｈｏｄｓｆｏｒＤｉｓｃｒｉｍｉｎａｔｉｖｅＴｒａｉｎｉｎｇｏｆＤｅｐｅｎｄｅｎｃｙＰａｒｓｅｒｓ” Here, P _t (h | W) is a probability that the t-th word is h. When a candidate that simply maximizes this value is selected as a solution, the resulting structure may be an inappropriate structure as a dependency structure tree including cycles. Therefore, an optimum dependency structure tree is searched using the maximum spanning tree search method. The dependency structure analysis can be solved as a maximum spanning tree search problem as described in Reference Document 2 below, for example.
Reference 2; University of Pennylvania Department of Computer and Information Science Technical Report No. MS-CIS-06-11 (2006) McDonald et al. “Spanning Tree Methods for Discriminating Training of Dependency Parrs”

この参考文献２に記載されているように、特に、係り受け関係が交差しない言語の場合はＥｉｓｎｅｒ法を用いることで最適な最大全域木を求めることができ、係り受けが交差する言語の場合はＣｈｕ−Ｌｉｕ−Ｅｄｍｏｎｄｓ法を用いることで最適な最大全域木を求めることができる。これらの最大全域木を求めるアルゴリズムは、グラフ中のノードｉからノードｊへ向かうエッジのスコアをｓ（ｉ，ｊ）とした場合に、このスコアの和を最大にするような全域木をグラフ中から求めることができる。そこで、ｉを係り先の単語、ｊを係り元の単語として、スコアｓ（ｉ，ｊ）を周辺確率を用いて下記の式（１１）のように定義する。 As described in Reference 2, particularly in the case of a language in which the dependency relationship does not intersect, an optimum maximum spanning tree can be obtained by using the Eisner method. In the case of a language in which the dependency relationship intersects, An optimal maximum spanning tree can be obtained by using the Chu-Liu-Edmonds method. The algorithm for obtaining these maximum spanning trees is that a spanning tree that maximizes the sum of the scores when the score of the edge from node i to node j in the graph is s (i, j) is shown in the graph. Can be obtained from Therefore, the score s (i, j) is defined as the following equation (11) using the peripheral probability, where i is a related word and j is a related word.

このようにして計算されたスコアをＥｉｓｎｅｒ法やＣｈｕ−Ｌｉｕ−Ｅｄｍｏｎｄｓ法と共に利用することで、解探索部１３は最適な依存構造を探索することができる。 By using the score calculated in this way together with the Eisner method and the Chu-Liu-Edmonds method, the solution search unit 13 can search for an optimum dependency structure.

係り受け関係ラベル決定部１４は、解探索部１３によって決定された依存構造を入力として、各単語の係り受け関係のラベルを決定する（ステップＳ４）。係り受け関係のラベルの決定は、多値分類問題として解くことができるので、多値分類問題を解くための手法を用いることができる。例えば、最大エントロピー法や下記の参考文献３に記載されたサポートベクターマシンを用いて、ラベルを付与する単語、及びその周辺の単語やその係り先の単語の語形や品詞等を素性として利用することで、ラベルを決定することができる。
参考文献３；金明等“統計科学のフロンティア１０言語と心理の統計”（２００３）岩波書店 The dependency relationship label determination unit 14 receives the dependency structure determined by the solution search unit 13 and determines a dependency relationship label for each word (step S4). Since the determination of the dependency relationship label can be solved as a multi-value classification problem, a technique for solving the multi-value classification problem can be used. For example, using the maximum entropy method or the support vector machine described in Reference 3 below, use the word to which the label is attached, the surrounding words and the word form or part of speech of the related word as features. The label can be determined.
Reference 3; Kimming, et al. "Statistical Science Frontier 10 Languages and Psychological Statistics" (2003) Iwanami Shoten

例えば、最大エントロピーモデルを用いて、下記の式（１２）〜（１４）のような確率分布によって表されるラベル付け用モデルを定義し、この確率を最大にするようなラベルＬを求めることができる。 For example, by using a maximum entropy model, a labeling model represented by a probability distribution such as the following equations (12) to (14) is defined, and a label L that maximizes the probability is obtained. it can.

これらの式（１２）〜（１４）において、ν_ｉはラベル付け用モデルのパラメータであり、ｅ_ｊ（Ｗ，Ｈ，ｔ，ｌ）は係り先としてＨを持つ文節列Ｗのｔ番目の文節がｌという係り受け関係のラベルを持つことに関する素性である。出力部１５は、係り受け関係ラベル決定部１４によって出力される、入力文に対して依存構造と係り受け関係が同定された依存構造解析結果を利用者へ出力し（ステップＳ５）、依存構造解析処理を終了する。 In these equations (12) to (14), ν _i is a parameter of the labeling model, and e _j (W, H, t, l) is the t-th clause of the clause string W having H as the destination. Is a feature relating to having a dependency-related label of l. The output unit 15 outputs to the user the dependency structure analysis result that is output from the dependency relationship label determination unit 14 and that identifies the dependency structure and the dependency relationship with respect to the input sentence (step S5). The process ends.

（実施例１の言語解析方法におけるパラメータ推定処理）
図３は、実施例１の図１の言語解析方法において依存構造解析を行う上で必要となるパラメータを学習用データから学習して格納するまでのパラメータ推定処理を示すフローチャートである。 (Parameter estimation process in language analysis method of embodiment 1)
FIG. 3 is a flowchart illustrating a parameter estimation process until a parameter necessary for performing the dependency structure analysis in the language analysis method of FIG. 1 according to the first embodiment is learned from the learning data and stored.

先ず、ラベル情報無し文節単位モデルパラメータ推定部３２は、構文解析済みコーパス格納部３１に格納された構文解析済みコーパスを用いて、ラベル情報無し文節単位モデルパラメータ、つまり式（７）中の｛μ_ｉ｝を求め、それをラベル情報無し文節単位モデルパラメータ格納部２１へ格納する（ステップＳ１１）。 First, the phrase unit model parameter estimation unit 32 without label information uses the parsed corpus stored in the parsed corpus storage unit 31, and uses the parsed corpus unit parameter, that is, {μ in the expression (7). _i } is obtained and stored in the phrase unit model parameter storage unit 21 without label information (step S11).

次に、パラメータ推定用ラベル無しサンプル生成部３３は、構文解析済みコーパス格納部３１に格納された構文解析済みコーパスと、ラベル情報無し文節単位モデルパラメータ格納部２１に格納されたラベル無し文節単位モデルパラメータとを用いて、それらのパラメータによって規定される文節単位の依存構造の確率分布Ｑ（Ｈ｜Ｗ）から、パラメータ推定用のラベル無しサンプルを構文解析済みコーパス中の各文に対して生成し、それをラベル情報無し文単位モデルパラメータ推定部３４へ渡す（ステップＳ１２）。 Next, the parameter estimation unlabeled sample generation unit 33 includes the parsed corpus stored in the parsed corpus storage unit 31 and the unlabeled phrase unit model stored in the label information-free phrase unit model parameter storage unit 21. Parameters are used to generate an unlabeled sample for parameter estimation for each sentence in the parsed corpus from the probability distribution Q (H | W) of the dependency structure of the phrase unit defined by those parameters. Then, it is passed to the sentence unit model parameter estimation unit 34 without the label information (step S12).

ラベル情報無し文単位モデルパラメータ推定部３４は、構文解析済みコーパス格納部３１に格納された構文解析済みコーパスと、パラメータ推定用ラベル無しサンプル生成部３３により生成されたパラメータ推定用のラベル無しサンプルとを用いて、ラベル情報無し文単位モデルパラメータ｛λ_ｋ｝を求め、それをラベル情報無し文単位モデルパラメータ格納部２２へ格納する（ステップＳ１３）。 The sentence unit model parameter estimation unit 34 without label information includes a parsed corpus stored in the parsed corpus storage unit 31, a parameter-unlabeled sample generated by the parameter estimation label-free sample generation unit 33, and Is used to obtain the sentence unit model parameter {λ _k } without label information and stores it in the sentence unit model parameter storage unit 22 without label information (step S13).

その後、ラベル付け用モデルパラメータ推定部３５は、構文解析済みコーパス格納部３１に格納された構文解析済みコーパスを用いて、ラベル付け用モデルのパラメータを求め、それをラベル付け用モデルパラメータ格納部２３へ格納し（ステップＳ１４）、パラメータ推定処理を終了する。 Thereafter, the labeling model parameter estimation unit 35 obtains the parameters of the labeling model using the parsed corpus stored in the parsed corpus storage unit 31, and obtains the parameters of the labeling model parameter storage unit 23. (Step S14), and the parameter estimation process is terminated.

（実施例１の効果）
本実施例１の言語解析方法及びその装置によれば、文中のあらゆる素性を利用することができるギブスサンプリングを使用した依存構造解析方法やそのシステムにおいて、解を探索する際に最大全域木の探索手法を適用することで、依存構造木として正しい解を得ることができる。更に、各単語の係り先を決定した後で、係り受けラベルの決定を後処理として行うことで、少ない計算量で効率良く、係り受け関係のラベルも同定することができる。 (Effect of Example 1)
According to the language analysis method and apparatus of the first embodiment, in the dependency structure analysis method using Gibbs sampling that can use all the features in the sentence and the system thereof, the search of the maximum spanning tree is performed when searching for a solution. By applying the method, a correct solution can be obtained as a dependency structure tree. Furthermore, by determining the dependency labels as post-processing after determining the dependency destination of each word, it is possible to efficiently identify the dependency relationship labels with a small amount of calculation.

本実施例２の言語解析方法及びその装置では、実施例１において後処理として行っている係り受け関係のラベル推定を、係り先を決定する際に同時に行うようにしている。 In the language analysis method and apparatus according to the second embodiment, the dependency-related label estimation, which is performed as post-processing in the first embodiment, is performed at the same time when the dependency destination is determined.

（実施例２の構成）
図４は、本発明の実施例２を示す言語解析装置の概略の構成図であり、実施例１を示す図１中の要素と共通の要素には共通の符号が付されている。 (Configuration of Example 2)
FIG. 4 is a schematic configuration diagram of a language analysis apparatus showing a second embodiment of the present invention. Elements common to the elements in FIG. 1 showing the first embodiment are denoted by common reference numerals.

本実施例２の言語解析装置では、実施例１の構成に対して、係り受け関係ラベル決定部１４を持たず、文節単位モデルと文単位モデルの両方でラベル情報を扱う点が異なっている。即ち、本実施例２の言語解析装置は、実施例１とは構成が異なり、解析部１０Ａ、モデル格納部２０Ａ、及びパラメータ推定部３０Ａにより構成されている。 The language analysis apparatus according to the second embodiment is different from the configuration according to the first embodiment in that it does not have the dependency relationship label determining unit 14 and handles label information in both the phrase unit model and the sentence unit model. In other words, the language analysis apparatus according to the second embodiment has a configuration different from that of the first embodiment, and includes the analysis unit 10A, the model storage unit 20A, and the parameter estimation unit 30A.

解析部１０Ａは、実施例１と同様の入力部１１と、この入力部１１に接続された実施例１とは構成の異なるラベル付き動的サンプル生成部１２Ａと、このサンプル生成部１２Ａに接続された実施例１と同様の解探索部１３と、この解探索部１３に接続された実施例１と同様の出力部１５とを有している。モデル格納部２０Ａは、実施例１とは構成が異なり、第１の格納部（例えば、ラベル情報付き文節単位モデルパラメータ格納部）２１Ａ、及び、第２の格納部（例えば、ラベル情報付き文単位モデルパラメータ格納部）２２Ａを有している。更に、パラメータ推定部３０Ａは、実施例１と同様の構文解析済みコーパス格納部３１と、実施例１とは構成の異なるラベル情報付き文節単位モデルパラメータ推定部３２Ａと、実施例１とは構成の異なるパラメータ推定用ラベル付きサンプル生成部３３Ａと、実施例１とは構成の異なるラベル情報付き文単位モデルパラメータ推定部３４Ａとを有し、これらのコーパス格納部３１、パラメータ推定部３２Ａ，３４Ａ、及びサンプル生成部３３Ａが相互に接続されている。 The analysis unit 10A is connected to an input unit 11 similar to that of the first embodiment, a labeled dynamic sample generation unit 12A having a different configuration from that of the first embodiment connected to the input unit 11, and the sample generation unit 12A. The same solution search unit 13 as that of the first embodiment and the output unit 15 similar to that of the first embodiment connected to this solution search unit 13 are included. The model storage unit 20A has a different configuration from that of the first embodiment, and includes a first storage unit (for example, phrase unit model parameter storage unit with label information) 21A and a second storage unit (for example, sentence unit with label information). Model parameter storage unit) 22A. Further, the parameter estimation unit 30A includes a syntactically analyzed corpus storage unit 31 similar to that in the first embodiment, a phrase-unit model parameter estimation unit 32A with label information having a configuration different from that of the first embodiment, and a configuration of the first embodiment. The sample generation unit 33A with different parameter estimation for parameter estimation and the sentence unit model parameter estimation unit 34A with label information having a configuration different from that of the first embodiment, the corpus storage unit 31, the parameter estimation units 32A and 34A, and Sample generators 33A are connected to each other.

解析部１０Ａにおいて、ラベル付き動的サンプル生成部１２Ａは、入力部１１から入力された文に対して、ラベル情報付き文節単位モデルパラメータ格納部２１Ａと、ラベル情報付き文単位モデルパラメータ格納部２２Ａに格納された情報とを用いて、これらのモデルパラメータによって規定される確率分布から多数のラベル付き依存構造解析木のサンプルを生成するものである。このラベル付き動的サンプル生成部１２Ａの出力側に接続された解探索部１３は、生成されたサンプルを利用して入力文のラベル付き依存構造を決定し、出力部１５へ出力するものである。その他の構成は、実施例１とほぼ同様である。 In the analyzing unit 10A, the labeled dynamic sample generating unit 12A stores the sentence input from the input unit 11 into the phrase unit model parameter storage unit 21A with label information and the sentence unit model parameter storage unit 22A with label information. Using the stored information, a large number of labeled dependency structure analysis tree samples are generated from the probability distribution defined by these model parameters. The solution search unit 13 connected to the output side of the labeled dynamic sample generation unit 12A determines the labeled dependency structure of the input sentence using the generated sample, and outputs it to the output unit 15. . Other configurations are substantially the same as those of the first embodiment.

（実施例２の言語解析方法）
図５は、図４に示す言語解析装置が使用者によって入力された文を依存構造解析して出力するまでの依存構造解析処理を示すフローチャートであり、実施例１を示す図２中の要素と共通の要素には共通の符号が付されている。 (Language analysis method of embodiment 2)
FIG. 5 is a flowchart showing the dependency structure analysis process until the language analysis apparatus shown in FIG. 4 analyzes and outputs the sentence input by the user, and the elements in FIG. Common elements are given common reference numerals.

本実施例２の依存構造解析処理では、実施例１と同様に依存構造解析する文の入力処理（ステップＳ１）と、実施例１とは異なり入力文に対する動的なラベル付きサンプルの生成処理（ステップＳ２Ａ）、及び、最適なラベル付き依存構造の探索処理（ステップＳ３Ａ）と、実施例１と同様に結果を出力する処理（Ｓ５）とを実行するようになっている。 In the dependency structure analysis process according to the second embodiment, a sentence input process for dependency structure analysis (step S1) is performed in the same manner as in the first embodiment, and a dynamically labeled sample generation process for an input sentence (unlike the first embodiment) (step S1). Step S2A), an optimum labeled dependency structure search process (step S3A), and a process of outputting a result (S5) as in the first embodiment are executed.

又、図６は、実施例２の図４の言語解析方法において依存構造解析を行う上で必要となるパラメータを学習用データから学習して格納するまでのパラメータ推定処理を示すフローチャートであり、実施例１の図３に対応している。 FIG. 6 is a flowchart showing parameter estimation processing until learning and storing parameters necessary for performing dependency structure analysis in the language analysis method of FIG. This corresponds to FIG.

本実施例２のパラメータ推定処理では、実施例１とは異なり、ラベル情報付き文節単位モデルパラメータの推定処理（ステップＳ１１Ａ）、パラメータ推定用ラベル付きサンプルの生成処理（ステップＳ１２Ａ）、及び、ラベル情報付き文単位モデルパラメータの推定処理（ステップＳ１３Ａ）を実行するようになっている。 In the parameter estimation process of the second embodiment, unlike the first embodiment, the phrase unit model parameter estimation process with label information (step S11A), the parameter estimation labeled sample generation process (step S12A), and the label information The attached sentence unit model parameter estimation process (step S13A) is executed.

本実施例２の図５及び図６の処理では、実施例１の処理と比べて、文節単位のモデルと文単位のモデルでラベル情報を扱う点が異なっているので、この点について以下説明する。 The processing of FIGS. 5 and 6 of the second embodiment is different from the processing of the first embodiment in that label information is handled in the phrase unit model and the sentence unit model. This point will be described below. .

実施例１では、式（４）〜（８）のような、係り先のみを考慮したラベル情報を扱わない文節単位と文単位の確率モデルを用いて、係り先のみを決定している。これに対し、本実施例２では、係り先だけではなく係り受け関係のラベルも考慮した下記の式（１５）〜（１９）に示すような確率モデルを用いて、係り先とラベルの両方を同時に決定する。 In the first embodiment, only a relation destination is determined using a phrase unit and a sentence-unit probability model that do not handle label information in consideration of only the relation destination, such as Expressions (4) to (8). On the other hand, in the second embodiment, by using a probability model as shown in the following formulas (15) to (19) considering not only the dependency destination but also the dependency relationship label, both the dependency destination and the label are determined. Decide at the same time.

これらの式（１５）〜（１９）において、Ｑ（Ｈ，Ｌ｜Ｗ）は文節単位のラベル付き依存構造の確率分布であり、λ_ｋはラベル情報付き文単位モデルパラメータであり、Ω（Ｗ）はＷが与えられた場合のあらゆる可能なＨとＬの組み合わせからなる集合であり、ｆ_ｋ（Ｗ，Ｈ，Ｌ）は任意の文単位の素性である、又、μ_ｉはラベル情報付き文節単位モデルパラメータであり、ｇ_ｉ（Ｗ，ｔ，ｈ，ｌ）は文節列Ｗ中のｔ番目の文節がｈ番目の文節にｌというラベルの関係で係ることに関する文節単位の素性である。このモデルを使用した場合のサンプルの生成やパラメータの推定等は、実施例１や前記先の提案の場合と同様に行うことができる。 In these formulas (15) to (19), Q (H, L | W) is the probability distribution of the labeled dependent structure in phrase units, λ _k is the sentence unit model parameter with label information, and Ω (W ) Is a set of all possible combinations of H and L when W is given, f _k (W, H, L) is an arbitrary sentence unit feature, and μ _i is with label information It is a phrase unit model parameter, and g _i (W, t, h, l) is a phrase unit feature relating to the relationship of the t-th phrase in the phrase string W to the h-th phrase with a label of l. When this model is used, sample generation, parameter estimation, and the like can be performed in the same manner as in the first embodiment and the previous proposal.

図５において、入力部１１で入力された文節列Ｗに対して（ステップＳ１）、ラベル付き動的サンプル生成部１２Ａは、ラベル情報付き文節単位モデルパラメータ格納部２１Ａに格納されたラベル情報付き文節単位モデルパラメータと、ラベル情報付き文単位モデルパラメータ格納部２２Ａに格納されたラベル情報付き文単位モデルパラメータとを用いて、それらのパラメータにより規定される文単位のラベル付き依存構造の確率分布からギブスサンプリングにより動的にラベル付き依存構造木のサンプルを生成する（ステップＳ１２）。 In FIG. 5, for the phrase string W input by the input unit 11 (step S1), the labeled dynamic sample generation unit 12A performs the phrase with label information stored in the phrase unit model parameter storage unit 21A with label information. By using the unit model parameter and the sentence unit model parameter with label information stored in the sentence unit model parameter storage unit 22A with label information, the probability distribution of the labeled dependent structure of the sentence unit defined by these parameters is given. A sample of the labeled dependency structure tree is dynamically generated by sampling (step S12).

ラベル付き動的サンプル生成部１２Ａにより生成されたＳ個のラベル付きサンプル｛Ｈ^（１），Ｌ^（１），・・・，Ｈ^（Ｓ），Ｌ^（Ｓ）｝を用いて、解探索部１３は最適なラベル付き依存構造と思われる解を探索し決定する（ステップＳ１３）。ここでは、下記の式（２０）〜（２２）で定義される周辺確率を利用することにする。 Using the S labeled samples {H ⁽¹⁾ , L ⁽¹⁾ ,..., H ^(S) , L ^(S) } generated by the labeled dynamic sample generation unit 12A, the solution search unit 13 searches for and determines a solution that seems to be an optimal labeled dependency structure (step S13). Here, the peripheral probabilities defined by the following equations (20) to (22) are used.

ここで、Ｐ_ｔ（ｈ，ｌ｜Ｗ）は、ｔ番目の単語がｈを係り先としてそのラベルがｌである確率である。この周辺確率を用いて、係り先の単語がｈ、係り元の単語がｄ、係り受け関係のラベルがｌの場合のスコアｓ（ｈ，ｄ，ｌ）を次のように定義する。 Here, P _t (h, l | W) is a probability that the t-th word is h and the label is l. Using this peripheral probability, the score s (h, d, l) when the dependency destination word is h, the dependency source word is d, and the dependency relationship label is l is defined as follows.

このようなスコアｓ（ｈ，ｄ，ｌ）を用いることで、実施例１の場合と同様に、最大全域木を探索して最適解を求めることができる。 By using such a score s (h, d, l), as in the case of the first embodiment, an optimal solution can be obtained by searching the maximum spanning tree.

（実施例２の効果）
本実施例２によれば、実施例１に比べて必要な計算量は多くなるが、係り先の決定と係り受け関係のラベルの同定を同時に行うことで、これらの情報を統合した素性を用いて高い精度で依存構造解析を行うことができる。つまり、係り先の文節とラベルの情報を統合的に扱うことが可能となり、高い精度で依存構造解析ができる。 (Effect of Example 2)
According to the second embodiment, the amount of calculation required is larger than that of the first embodiment, but by using the feature that integrates these pieces of information by simultaneously determining the dependency destination and identifying the label of the dependency relationship. Dependence structure analysis can be performed with high accuracy. That is, it becomes possible to handle the information on the clauses and labels of the relations in an integrated manner, and the dependency structure analysis can be performed with high accuracy.

本実施例３の言語解析方法及びその装置では、実施例１及び２を組み合わせ、実施例２において文節単位モデルで行っているラベル情報の推定を、別に用意されたラベル付け用モデルを用いて別個の処理として行うようにしている。 In the language analysis method and apparatus of the third embodiment, the first and second embodiments are combined, and label information estimation performed in the phrase unit model in the second embodiment is separately performed using a separately prepared labeling model. This is done as a process.

（実施例３の構成）
図７は、本発明の実施例３を示す言語解析装置の概略の構成図であり、実施例１及び実施例２を示す図１及び図４中の要素と共通の要素には共通の符号が付されている。 (Configuration of Example 3)
FIG. 7 is a schematic configuration diagram of a language analyzing apparatus showing a third embodiment of the present invention. Elements common to the elements in FIGS. 1 and 4 showing the first and second embodiments are denoted by common reference numerals. It is attached.

本実施例３の言語解析装置は、実施例２の構成に対して、文節単位モデルではラベル情報を取り扱わず、別に用意されたラベル付け用モデルを用いてラベル情報を扱う点が異なっている。即ち、本実施例３の言語解析装置は、実施例２と同様の解析部１０Ａと、実施例２の構成とは異なるモデル格納部２０Ｂ、及びパラメータ推定部３０Ｂとにより構成されている。 The language analysis apparatus according to the third embodiment is different from the configuration according to the second embodiment in that the phrase unit model does not handle label information but uses a separately prepared labeling model to handle label information. That is, the language analysis apparatus according to the third embodiment includes an analysis unit 10A that is the same as that of the second embodiment, a model storage unit 20B that is different from the configuration of the second embodiment, and a parameter estimation unit 30B.

モデル格納部２０Ｂは、実施例１と同様の第１の格納部（例えば、ラベル情報無し文節単位モデルパラメータ格納部）２１と、実施例２と同様の第３の格納部（例えば、ラベル情報付き文単位モデルパラメータ格納部）２２Ａと、実施例１と同様の第２の格納部（例えば、ラベル付け用モデルパラメータ格納部）２３とを有している。パラメータ推定部３０Ｂは、実施例１と同様の構文解析済みコーパス格納部３１、及び、ラベル情報無し文節単位モデルパラメータ推定部３２と、実施例２と同様のパラメータ推定用ラベル付きサンプル生成部３３Ａ、及び、ラベル情報付き文単位モデルパラメータ推定部３４Ａと、実施例１と同様のラベル付け用モデルパラメータ推定部３５とを有し、これらのコーパス格納部３１、パラメータ推定部３２，３４Ａ，３５、及びサンプル生成部３３Ａが相互に接続されている。 The model storage unit 20B includes a first storage unit (for example, no phrase information phrase unit model parameter storage unit) 21 similar to the first embodiment and a third storage unit (for example, with label information) similar to the second embodiment. A sentence unit model parameter storage unit) 22A, and a second storage unit (for example, a labeling model parameter storage unit) 23 similar to that of the first embodiment. The parameter estimation unit 30B includes a parsed corpus storage unit 31 similar to that in the first embodiment, a phrase unit model parameter estimation unit 32 without label information, and a parameter generation sample generation unit 33A for parameter estimation similar to that in the second embodiment. And a sentence unit model parameter estimation unit 34A with label information and a labeling model parameter estimation unit 35 similar to that of the first embodiment, and these corpus storage unit 31, parameter estimation units 32, 34A, 35, and Sample generators 33A are connected to each other.

解析部１０Ａにおいて、ラベル付き動的サンプル生成部１２Ａは、入力部１１から入力された文に対して、ラベル情報無し文節単位モデルパラメータ格納部２１と、ラベル情報付き文単位モデルパラメータ格納部２２Ａと、ラベル付け用モデルパラメータ格納部２３に格納されたモデルパラメータ情報を用いて、これらのモデルパラメータによって規定される確率分布から多数のラベル付き依存構造解析木のサンプルを生成するものである。 In the analysis unit 10A, the labeled dynamic sample generation unit 12A performs the phrase unit model parameter storage unit 21 without label information, the sentence unit model parameter storage unit 22A with label information, for the sentence input from the input unit 11. Using the model parameter information stored in the labeling model parameter storage unit 23, a large number of labeled dependency structure analysis tree samples are generated from the probability distribution defined by these model parameters.

又、パラメータ推定部３０Ｂにおいて、パラメータ推定用ラベル付きサンプル生成部３３Ａは、構文解析済みコーパス格納部３１に格納されたコーパスと、ラベル情報無し文節単位モデルパラメータ格納部２１に格納されたラベル情報無し文節単位モデルパラメータと、ラベル付け用モデルパラメータ格納部２３に格納されたラベル付け用モデルパラメータとを用いて、そのパラメータにより規定される確率分布からコーパス中の各文に対するラベル付きの依存構造のサンプルを生成し、生成されたサンプルをラベル情報付き文単位モデルパラメータ推定部３４Ａに与えるものである。又、ラベル付け用モデルパラメータ推定部３５は、構文解析済みコーパス格納部３１に格納されたコーパスを用いてラベル付け用モデルのパラメータを求め、その結果をラベル付け用モデルパラメータ格納部２３）へ格納するものである。その他の構成は、実施例１、２とほぼ同様である。 Further, in the parameter estimation unit 30B, the parameter estimation labeled sample generation unit 33A includes a corpus stored in the parsed corpus storage unit 31 and no label information stored in the phrase unit model parameter storage unit 21 without label information. Using the phrase unit model parameter and the labeling model parameter stored in the labeling model parameter storage unit 23, a sample of the labeled dependency structure for each sentence in the corpus from the probability distribution defined by the parameter And the generated sample is given to the sentence unit model parameter estimation unit 34A with label information. Further, the labeling model parameter estimation unit 35 obtains parameters of the labeling model using the corpus stored in the parsed corpus storage unit 31, and stores the result in the labeling model parameter storage unit 23). To do. Other configurations are substantially the same as those of the first and second embodiments.

（実施例３の言語解析方法）
本実施例３言語解析方法において、使用者によって入力された文を依存構造解析して出力するまでの依存構造解析処理を示すフローチャートの内容は、実施例２の図５と同一である。 (Language analysis method of Example 3)
In the language analysis method of the third embodiment, the contents of the flowchart showing the dependency structure analysis processing until the sentence input by the user is analyzed and output are the same as those in FIG. 5 of the second embodiment.

図８は、実施例３の図７の言語解析方法において依存構造解析を行う上で必要となるパラメータを学習用データから学習して格納するまでのパラメータ推定処理を示すフローチャートであり、実施例１の図３及び実施例２の図６に対応している。 FIG. 8 is a flowchart showing parameter estimation processing until learning and storing parameters necessary for performing the dependency structure analysis in the language analysis method of FIG. 3 and FIG. 6 of the second embodiment.

本実施例３のパラメータ推定処理では、実施例１と同様のラベル情報無し文節単位モデルパラメータの推定処理（ステップＳ１１）と、実施例１とは異なるラベル付け用モデルパラメータの推定処理（ステップＳ１４Ａ）と、実施例２と同様のパラメータ推定用ラベル付きサンプルの生成処理（ステップＳ１２Ａ）、及び、ラベル情報付き文単位モデルパラメータの推定処理（ステップＳ１３Ａ）とを実行するようになっている。 In the parameter estimation process of the third embodiment, the label information-free phrase unit model parameter estimation process (step S11) similar to the first embodiment and the labeling model parameter estimation process different from the first embodiment (step S14A). Then, the parameter estimation labeled sample generation process (step S12A) and the label information-added sentence unit model parameter estimation process (step S13A) similar to those in the second embodiment are executed.

本実施例３の図８の処理では、実施例２の処理と比べて、文節単位のモデルではラベル情報を取り扱わず、別に用意されたラベル付け用モデルを用いてラベル情報を扱う点が異なっているので、この点について以下説明する。 The processing of FIG. 8 of the third embodiment is different from the processing of the second embodiment in that the label unit is not handled by the phrase unit model but the label information is handled by using a separately prepared labeling model. This point will be described below.

実施例２では、式（１７）、（１９）のような、係り先と係り受け関係のラベルを同時に考慮した文節単位のモデルを用いた。このようなモデルは、係り先とそのラベルの情報を統合して扱うことができるので、高い精度で解析が行えることが期待できる反面、係り先とそのラベルの組み合わせを考慮しなくてはならないため計算量が多くなる。 In the second embodiment, a phrase unit model that simultaneously considers the dependency destination and the dependency relationship label, such as equations (17) and (19), is used. Such a model can handle the information of the relationship destination and its label in an integrated manner, so it can be expected that the analysis can be performed with high accuracy, but the combination of the relationship destination and its label must be considered. The calculation amount increases.

そこで、本実施例３では、文単位のモデルについては実施例２と同様に係り先とラベルを同時に考慮し、文節単位のモデルについては実施例１と同様に係り先のみ考慮するモデルを用いることとし、下記の式（２３）〜（２５）のようにして確率モデルを定義する。 Therefore, in the third embodiment, as in the second embodiment, the relationship model and the label are simultaneously considered for the sentence unit model, and the model that considers only the dependency relationship is used for the phrase unit model as in the first embodiment. And the probability model is defined as in the following equations (23) to (25).

これらの式（２３）〜（２５）において、Ｑ（Ｈ｜Ｗ）は式（６）により定義される文節単位のラベル無し依存構造の確率分布であり、Ｐ（Ｌ｜Ｗ，Ｈ）は式（１２）により定義されるラベル付け用モデルであり、λ_ｋはラベル情報付き文単位モデルパラメータであり、Ω（Ｗ）はＷが与えられた場合のあらゆる可能なＨとＬの組み合わせからなる集合であり、ｆ_ｋ（Ｗ，Ｈ，Ｌ）は任意の文単位の素性である。このモデルを使用した場合のサンプルの生成やパラメータの推定等は、実施例１の場合や前記先の提案と同様に行うことができる。 In these formulas (23) to (25), Q (H | W) is a probability distribution of the unlabeled dependency structure defined by the formula (6), and P (L | W, H) is a formula. (12) is a model for labeling, λ _k is a sentence unit model parameter with label information, and Ω (W) is a set of all possible combinations of H and L when W is given. F _k (W, H, L) is an arbitrary sentence unit feature. Sample generation, parameter estimation, and the like when this model is used can be performed in the same manner as in the first embodiment and the previous proposal.

解析部１０Ａにおいて、入力部１１で入力された文節列Ｗに対して、ラベル付き動的サンプル生成部１２Ａは、ラベル情報無し文節単位モデルパラメータ格納部２１に格納されたラベル情報無し文節単位モデルパラメータと、ラベル情報付き文単位モデルパラメータ格納部２２Ａに格納されたラベル情報付き文単位モデルパラメータと、ラベル付け用モデルパラメータ格納部２３に格納されたラベル付け用モデルパラメータとを用いて、それらのパラメータにより規定される文単位の依存構造の確率分布からギブスサンプリングにより動的にラベル付き依存構造木のサンプルを生成する。この生成結果に基づき、解探索部１３及び出力部１５にて実施例２と同様の処理が行われる。 In the analysis unit 10A, for the phrase string W input by the input unit 11, the labeled dynamic sample generation unit 12A performs the label information-free phrase unit model parameter stored in the label information-free phrase unit model parameter storage unit 21. And the sentence unit model parameter with label information stored in the sentence unit model parameter storage unit 22A with label information, and the model parameter for labeling stored in the model parameter storage unit 23 for labeling, these parameters A labeled dependency structure tree sample is dynamically generated by Gibbs sampling from the probability distribution of the dependency structure of the sentence unit defined by. Based on this generation result, the solution search unit 13 and the output unit 15 perform the same processing as in the second embodiment.

（実施例３の効果）
本実施例３によれば、文節単位のモデルはラベル情報を考慮せずに済むため必要な計算量が少なくなるが、文単位のモデルではラベル情報を考慮するため高い精度で依存構造解析ができるようになる。つまり、文節単位のモデルでは係り先の決定のみを行い、ラベルの同定は別のモデルで取り扱い、又、文単位のモデルでは係り先の決定とラベルの同定を同時に行うことにより、計算量を抑えながら高い精度で依存構造解析を行うことができる。 (Effect of Example 3)
According to the third embodiment, the phrase unit model does not need to consider the label information, so the amount of calculation is reduced. However, the sentence unit model considers the label information and can perform dependency structure analysis with high accuracy. It becomes like this. In other words, only the determination of the relationship destination is performed in the phrase unit model, the identification of the label is handled by another model, and the determination of the relationship destination and the label identification are performed simultaneously in the sentence unit model, thereby reducing the calculation amount. However, dependency structure analysis can be performed with high accuracy.

（変形例）
本発明は、図示の実施例に限定されず、言語解析装置及び言語解析方法は、図示以外の他の構成や処理手順等に種々の利用形態や変形が可能である。 (Modification)
The present invention is not limited to the illustrated embodiment, and the language analysis device and the language analysis method can be variously used and modified in configurations and processing procedures other than those illustrated.

本発明の実施例１を示す言語解析装置の概略の構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a schematic block diagram of the language analyzer which shows Example 1 of this invention. 図１の存構造解析処理を示すフローチャートである。It is a flowchart which shows the existing structure analysis process of FIG. 図１のパラメータ推定処理を示すフローチャートである。It is a flowchart which shows the parameter estimation process of FIG. 本発明の実施例２を示す言語解析装置の概略の構成図である。It is a schematic block diagram of the language analyzer which shows Example 2 of this invention. 図４の依存構造解析処理を示すフローチャートである。It is a flowchart which shows the dependence structure analysis process of FIG. 図４のパラメータ推定処理を示すフローチャートである。It is a flowchart which shows the parameter estimation process of FIG. 本発明の実施例３を示す言語解析装置の概略の構成図である。It is a schematic block diagram of the language analyzer which shows Example 3 of this invention. 図７のパラメータ推定処理を示すフローチャートである。It is a flowchart which shows the parameter estimation process of FIG.

Explanation of symbols

１０，１０Ａ解析部
１１入力部
１２ラベル無し動的サンプル生成部
１２Ａラベル付き動的サンプル生成部
１３解探索部
１４係り受け関係ラベル決定部
１５出力部
２０，２０Ａ，２０Ｂモデル格納部
２１ラベル情報無し文節単位モデルパラメータ格納部
２１Ａラベル情報付き文節単位モデルパラメータ格納部
２２ラベル情報無し文単位モデルパラメータ格納部
２２Ａラベル情報付き文単位モデルパラメータ格納部
２３ラベル付け用モデルパラメータ格納部
３０，３０Ａ，３０Ｂパラメータ推定部
３１構文解析済みコーパス格納部
３２ラベル情報無し文節単位モデルパラメータ推定部
３２Ａラベル情報付き文節単位モデルパラメータ推定部
３３パラメータ推定用ラベル無しサンプル生成部
３３Ａパラメータ推定用ラベル付きサンプル生成部
３４ラベル情報無し文単位モデルパラメータ推定部
３４Ａラベル情報付き文単位モデルパラメータ推定部
３５ラベル付け用モデルパラメータ推定部 DESCRIPTION OF SYMBOLS 10,10A Analysis part 11 Input part 12 Unlabeled dynamic sample generation part 12A Labeled dynamic sample generation part 13 Solution search part 14 Dependency relation label determination part 15 Output part 20, 20A, 20B Model storage part 21 No label information Sentence unit model parameter storage unit 21A Sentence unit model parameter storage unit with label information 22 Sentence unit model parameter storage unit without label information 22A Sentence model parameter storage unit with label information 23 Model parameter storage unit for labeling 30, 30A, 30B Parameters Estimator 31 Parsed corpus storage unit 32 Clause unit model parameter estimation unit without label information 32A Clause unit model parameter estimation unit with label information 33 Parameter unlabeled sample generation unit 33A Parameter estimation label Sample generation unit with label 34 sentence unit model parameter estimation unit without label information 34A sentence unit model parameter estimation unit with label information 35 Model parameter estimation unit for labeling

Claims

A language analysis method for performing language analysis on a computer using an input sentence divided into words or phrases,
A phrase unit model parameter estimation procedure for obtaining a phrase unit model parameter in a phrase unit probability distribution based on the parsed corpus stored in the storage means;
A parameter generation sample generation procedure for generating a sample of a dependency structure tree for each sentence in the corpus from the phrase unit probability distribution based on the phrase unit model parameter and the corpus;
A sentence unit model parameter estimation procedure for obtaining a sentence unit model parameter in a sentence unit probability distribution based on the sample and the corpus;
An input procedure for receiving the input sentence;
Dynamically generating a sentence-dependent dependency tree sample by Gibbs sampling based on a probability distribution of sentence-dependent dependency structures defined by the phrase-unit model parameters and the sentence-unit model parameters in the input sentence Sample generation procedure,
A solution search procedure seeking scores by marginal probabilities, to determine the optimal dependency structure using Eiser method or Chu-Liu-Edmonds method against that were generated said sample,
How language analysis characterized in that it comprises a.

The language analysis method according to claim 1, further comprising:
In the case where the determined dependency structure is an unlabeled dependency structure, a dependency model for identifying a dependency relationship label is used for the unlabeled dependency structure using a probability model defined by a model parameter for labeling. The relationship label determination procedure
A language analysis method comprising:

A language analysis method for performing language analysis on a computer using an input sentence divided into words or phrases,
Based on the parsed corpus stored in the storage means, the phrase unit model parameter estimation procedure with label information for obtaining the phrase unit model parameter with label information in the phrase unit probability distribution,
A parameter estimation sample generation procedure for generating a labeled dependent structure tree sample for each sentence in the corpus from the phrase unit probability parameter distribution based on the phrase information-added phrase unit model parameter and the corpus;
A sentence unit model parameter estimation procedure for obtaining a sentence unit model parameter with label information in a probability distribution of a sentence unit based on the sample and the corpus;
An input procedure for receiving the input sentence;
Generates a dependency structure tree sample for each sentence using Gibbs sampling based on the establishment distribution of the dependency structure for each sentence unit defined by the phrase unit model parameter with label information and the sentence unit model parameter with label information A labeled dynamic sample generation procedure,
From the generated sample, a solution search procedure for determining an optimal dependency structure using a maximum spanning tree search method ;
How language analysis characterized in that it comprises a.

A language analysis method for performing language analysis on a computer using an input sentence divided into words or phrases,
A phrase unit model parameter estimation procedure for obtaining an unlabeled phrase unit model parameter in a phrase unit probability distribution based on the parsed corpus stored in the storage means;
A labeling model parameter estimation procedure for obtaining a labeling model parameter using the corpus;
Based on the phrase information-free phrase unit model parameter, the corpus, and the labeling model parameter, a labeled dependency structure tree sample for each sentence in the corpus is generated from the phrase unit probability distribution. Sample generation procedure with labeled parameters for parameter estimation,
A sentence unit model parameter estimation procedure for obtaining a sentence unit model parameter with label information based on the sample and the corpus;
An input procedure for receiving the input sentence;
Generates a sample dependency structure tree for each sentence using Gibbs sampling based on the establishment distribution of the dependency structure for each sentence unit defined by the phrase unit model parameter without label information and the sentence unit model parameter with label information A labeled dynamic sample generation procedure,
From the generated sample, a solution search procedure for determining an optimal dependency structure using a maximum spanning tree search method;
How language analysis characterized in that it comprises a.

A language analysis device that performs language analysis on an input sentence divided into words or phrases,
A phrase unit model parameter estimator for obtaining a phrase unit model parameter in a phrase unit probability distribution based on a parsed corpus stored in the storage means;
A parameter estimation sample generation unit that generates a sample of a dependency structure tree for each sentence in the corpus from the phrase unit probability distribution based on the phrase unit model parameter and the corpus;
A sentence unit model parameter estimation unit for obtaining a sentence unit model parameter in a sentence unit probability distribution based on the sample and the corpus;
An input unit for receiving the input sentence;
Dynamically generating a sentence-dependent dependency tree sample by Gibbs sampling based on a probability distribution of sentence-dependent dependency structures defined by the phrase-unit model parameters and the sentence-unit model parameters in the input sentence A sample generator;
Determine the score by marginal probabilities against that were generated the sample, the solution search unit to determine an optimal dependency structure using Eiser method or Chu-Liu-Edmonds method,
A language analyzer characterized by comprising:

The language analysis device according to claim 5 further includes:
A first storage unit for storing the phrase unit model parameters;
A second storage for storing the sentence unit model parameters;
A language analyzer characterized by comprising:

The language analysis apparatus according to claim 5 or 6, further comprising:
In the case where the determined dependency structure is an unlabeled dependency structure, a dependency model for identifying a dependency relationship label is used for the unlabeled dependency structure using a probability model defined by a model parameter for labeling. A language analysis apparatus comprising a relation label determination unit.

The language analysis device according to claim 7, further comprising:
A language analyzing apparatus comprising a third storage unit for storing the labeling model parameter.

A language analysis device that performs language analysis on an input sentence divided into words or phrases,
Based on the parsed corpus stored in the storage means, the phrase unit model parameter estimation unit with label information for obtaining the phrase unit model parameter with label information in the phrase unit probability distribution,
A parameter estimation sample generation unit that generates a labeled dependency structure tree sample for each sentence in the corpus from the phrase-unit probability distribution based on the label information-added phrase unit model parameter and the corpus;
Based on the sample and the corpus, a sentence unit model parameter estimation unit for obtaining a sentence unit model parameter with label information in a sentence unit probability distribution;
An input unit for receiving the input sentence;
In the input sentence, a sample dependency structure tree of sentence units by Gibbs sampling based on the establishment distribution of the dependency structure of sentence units defined by the phrase unit model parameters with label information and the sentence unit model parameters with label information and the labeled dynamic sample generation unit that generates,
From that were generated the sample, the solution search means for determining the optimum dependency structure using the maximum spanning tree search technique,
A language analyzer characterized by comprising:

The language analysis apparatus according to claim 9 further includes:
A first storage unit for storing the phrase unit model parameter with label information;
A second storage for storing the sentence unit model parameter with label information;
A language analyzer characterized by comprising:

A language analysis device that performs language analysis on an input sentence divided into words or phrases,
A phrase unit model parameter estimation unit for obtaining an unlabeled phrase unit model parameter in a phrase unit probability distribution based on a parsed corpus stored in a storage unit;
A labeling model parameter estimating unit for obtaining a labeling model parameter using the corpus;
A parameter for generating a sample of a dependency structure tree with a label for each sentence in the corpus from a phrase unit probability distribution based on the phrase information-free phrase unit model parameter, the corpus, and the labeling model parameter A sample generator with a label for estimation;
Based on the sample and the corpus, a sentence unit model parameter estimation unit for obtaining a sentence unit model parameter with label information in a sentence unit probability distribution;
An input unit for receiving the input sentence;
In the input sentence, based on the probability distribution of the dependency structure of the sentence unit defined by the phrase unit model parameter without the label information and the sentence unit model parameter with the label information , the dependency structure tree of the sentence unit is obtained by Gibbs sampling. A labeled dynamic sample generator for generating samples;
A solution search unit for determining an optimal dependency structure from the generated sample using a maximum spanning tree search method;
A language analyzer characterized by comprising:

In the language analysis apparatus according to claim 11, further,
A first storage unit for storing the phrase unit model parameter without label information;
A second storage unit for storing the labeling model parameters;
A third storage unit for storing the sentence unit model parameter with label information;
A language analyzer characterized by comprising: