JP3908410B2

JP3908410B2 - Language analysis apparatus and method, and recording medium

Info

Publication number: JP3908410B2
Application number: JP14908299A
Authority: JP
Inventors: 潔山端
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-05-28
Filing date: 1999-05-28
Publication date: 2007-04-25
Anticipated expiration: 2019-05-28
Also published as: JP2000339311A

Description

【０００１】
【発明の属する技術分野】
本発明は、言語の解析装置に関し、特に、文法規則を単語の辞書の中に保持する語彙化文法に基づく言語の解析装置に関する。
【０００２】
【従来の技術】
文法を構成する基本要素、一般には、構文木のすべてがいずれかの単語と関連付けられているとき、その文法は語彙化されているという。すなわち、文法規則を単語に関連付けすることを文法の語彙化といい、語彙化された文法を語彙化文法という。
【０００３】
語彙化文法に基づく自然言語の解析装置として、例えば特開平７−５６９2１号公報（以下「文献１」という）には、文脈自由文法を「語彙化文脈自由文法（LCFG; Lexicalized Context Free Grammar）」と呼ばれる語彙化形式に変換してから文の解析を行う文解析システムが記載されている。
【０００４】
また、刊行物（1996年8月、プロシーディングス・オブ・ザ・フォース・インターナショナル・コンファランス・オン・スポークン・ランゲージ・プロセッシング（Proceedings of ICSLP 96, the Fourth International Conference on Spoken Language Processing, Philadelphia, Pennsylvania,1996））に記載の「ヘッド・オートマタ・アンド・バイリンガル・タイリング：トランスレーション・ウイズ・ミニマル・リプレゼンテーションズ」（Head Automata and Bilingual Tiling: Translation with Minimal Representations）と題する論文（以下「文献２」という）には、単語に付随するオートマトンにより依存文法の解析と言語間の構文変換を行う解析装置が記載されている。
【０００５】
また、語彙化文法の解析を効率化するために語彙化したオートマトンを使用する方法が知られており、例えば、刊行物（1998年4月、プロシーディングス・オブ・ザ・ファースト・ワークショップ・オン・タビュレーション・イン・パーシング・アンド・ディダクション（Proceedings of the first Tabulation in Parsing and Deduction, Paris, France, INRIA主催)）に記載の「グラマー・コンパクション・アンド・コンピューテーション・シェアリング・イン・オートマトン・ベースド・パーシング（Grammar Compaction and Computation Sharing in Automaton−based Parsing）」と題する論文（以下「文献３」という）には、単語に関連付けられた構文木の葉の非終端記号の列を受理する有限オートマトンを構成し、同じ単語に付随するこれらの有限オートマトンのマージと最小化を行うことにより、解析を効率化する手法が示されている。
【０００６】
この方法では、上記文献３の第3．4節に述べられているように、構文木自体ではなく、その葉（leaf）の列を抽出し一列に並べた、フラットな非終端記号列（non-terminal symbols）を処理対象としている。
【０００７】
この列を受理する有限オートマトンを構文木ごとに一つずつ作成し、マージ・最小化することにより、複数の木に共通の解析処理を共通化して処理の効率化を図るというものである。
【０００８】
【発明が解決しようとする課題】
しかしながら、上記した各文献等に記載された語彙化文法の解析装置には、取り扱える文法形式がシステム全体を通じて固定であり、柔軟な文法記述が難しい、という問題点を有している。
【０００９】
例えば、上記文献１に記載される解析装置では、解析可能な文法形式は、文脈自由文法（context free grammar）に限定されている。このことは、文法を語彙化文脈自由文法に変換してから解析を進めるという構成からも明らかである。
【００１０】
ところで、よく知られているように、文脈自由文法には、語順の任意性を簡潔に表現できない、という問題がある。
【００１１】
例えば日本語では、「私が夕食をゆっくり食べた」も、「夕食をゆっくり私が食べた」も正しい文であり、解析に成功しなければならない。
【００１２】
日本語では、一般に、「が」格、「を」格のような動詞の格要素、および「ゆっくり」のような任意修飾要素の語順は自由であり、どのような順序で現れることもある。
【００１３】
ところが、文脈自由文法では、ルール右辺の構成要素の語順は固定であるため、このような語順の自由性（任意性）を簡潔に表現することはできず、すべての語順の可能性を、あらかじめルールとして展開しておかなければならない。
【００１４】
同様に、上記文献２に記載される解析装置においても、解析可能な文法形式は、依存文法に限定されている。
【００１５】
また上記文献３に記載される解析装置では、対象となる文法形式は特に限定されない旨の記述があるものの、しかしながら、構文木の葉の非終端記号の列を受理する有限オートマトンを構成するという構成によれば、上記文献３の手法が適用可能な文法形式が限定されるものであることは明らかである。これは、非終端記号の列に対して制約を設けているためである。具体的には、この列を有限オートマトンで受理するという構成をとっているため、構文解析を終えた後、ある単語をヘッドとする構文木が持つ葉の非終端記号列として許容されるのは、有限オートマトンで受理可能な文法形式相当にほぼ限定される。
【００１６】
また、構文解析においては、従来より、動的計画法に基づき、解析順序を制御することにより、部分解を共有し、解析の効率化を図ることが、一般に行われるが、単語に解析順序を制御するオートマトンを有する上記文献２および３の構成では、これら二つの解析順序の制御をどのように統合して、動的計画法に基づく効率的解析を可能とするかについての記載はなく、不明である。
【００１７】
したがって本発明は、上記問題点に鑑みてなされたものであって、その目的は、単語ごとに付随する文法のクラスが実質的に異なるような語彙化文法をも取り扱える言語解析装置及び方法を提供することにある。
【００１８】
本発明の別の目的は、特定の文法形式の制約にとらわれることなく一般の語彙化文法の記述と受理を可能とするとともに、動的計画法に基づく効率的な解析の全体制御を可能とする言語解析装置及び方法を提供することにある。これ以外の本発明の他の目的、特徴等は以下の説明で容易に明らかとされるであろう。
【００１９】
【課題を解決するための手段】
前記目的を達成するため本発明の言語解析装置は、辞書中の単語が、その単語に関連付けられた構文木の集合と、構文木集合に属する構文木の並びが作る構文木列を受理する構文木列受理手段を有するように構成された辞書と、前記構文木列受理手段を用いて入力文の解析を進める解析手段とを備える。
【００２０】
本発明に係る方法は、辞書内の単語が、該単語に関連付けされる構文木の集合と、前記構文木の集合に属する構文木の並びが作る構文木列を受理する構文木列受理部と、を備えてなる辞書を備え、入力手段から入力された文を解析する際に、（ａ）前記辞書中の単語ごとに、該単語に関連付けされる構文木の成立をチェックするステップと、
（ｂ）前記辞書中の単語ごとに、前記ステップ（ａ）で成立した前記構文木の作る構文木列の中から統語的に許されるものを受理するステップと、
（ｃ）前記ステップ（ａ）及び（ｂ）を動的計画法による制御に基づき繰り返すステップと、を含む。
【００２１】
【発明の実施の形態】
本発明の実施の形態について説明する。本発明の実施の形態においては、辞書内の単語が、該単語に関連付けられた構文木の集合と、構文木集合に属する構文木の並びが作る構文木列を受理する構文木列受理部と、単語情報を有する単語辞書を備え、入力文の解析を進める解析部は、単語に対応する構文木列受理部を用いて解析を実行する。
【００２２】
このように、本発明の実施の形態においては、構文木だけでなく、構文木の組み合わせを制御する構文木列受理部を、単語ごとに備えたことにより、単語を中心としたまとめあげに最も適切な文法形式を選択して用いることができる。
【００２３】
例えば、日本語の動詞の格要素と自由修飾要素の取り込みについては、語順の自由性が簡潔に記述できる文法形式を採用し、動詞構文木列受理部として実装する一方、名詞への前置修飾のように、語順が固定の場合に対して、単純な文脈自由文法を採用して構文木列受理部として実装することも容易である。
【００２４】
また、本発明の実施の形態においては、一つの構文木にまとめあげる処理と、構文木間の組み合わせ処理が、別のオートマトンで制御される構成とされており、動的計画法に基づき、解析順序を制御することにより、効率的に解析を進めることができ、且つ、部分解から全体の解として最適な解を得ることができる。
【００２５】
具体的には、一つの構文木がまとめあがった段階で、構文木列受理部の状態をインデックスとする部分解としてチャートに登録し、インデックスを同じくする部分解をパックする（詰め合わせる）ことにより、部分解の共有と動的計画法に基づく解析順序制御とを矛盾なく進めることができる。
【００２６】
さらに、本発明の実施の形態においては、辞書中の単語に関連付けられた構文木が、構文木列受理部の終状態列を受理するオートマトンとして構成されており、構文木列受理部が、オートマトンとして構成された前記構文木の終状態列を受理するオートマトンとして構成されている。
【００２７】
かかる構成により、構文木を、任意のオートマトンとして構成することができ、柔軟な文法記述を行うことができる。
【００２８】
以下では、本発明の実施の形態について図面を参照してより詳細に説明する。
【００２９】
図１は、本発明の第１の実施の形態の構成を示す図である。図１を参照すると、この自然言語解析装置は、原文を入力する入力部１と、辞書2と、辞書２を参照しながら原文の解析を進める解析部３と、部分解析結果を登録するチャート４と、解析結果を出力する出力部５とを備えて構成されている。
【００３０】
図２は、辞書２の内容の一例を示す図である。図２を参照すると、辞書２の中の単語は、構文木の集合を格納する構文木プール２１と、文法で許容される構文木の列を受理する構文木列受理部２２と、単語の綴りや統語的性質等の単語情報２３とを備えて構成されている。
【００３１】
図３は、本発明の第１の実施の形態の動作を説明するための流れ図である。図１乃至図３を参照して、本発明の第１の実施の形態の動作について説明する。
【００３２】
入力部１に入力された原文が解析部３に送られると（ステップ3‐1）、解析部３は辞書２を参照して、原文を単語に分割し、各単語の辞書の内容を辞書２からロードして、チャート４に登録する（ステップ3‐2）。
【００３３】
原文の解析は、解析部３の制御に基づき、チャート４の上で、動的計画法に従って、すべての部分区間の間に包含関係により、線形順序を付け、部分区間に対する部分解析結果をすべて求める処理を、小さな区間から大きな区間へと注目する区間を動かしながら繰り返して行われる。
【００３４】
具体的には、包含関係による線形順序に従って、未処理区間を一つ選ぶ（ステップ3‐3）。そのような区間が残っていれば（ステップ3‐4でYes）、この区間の部分解をすべて求め（ステップ3‐5）、区間の選択の処理（ステップ3‐3）に戻る。
【００３５】
すべての区間に対する部分解が求まり、未処理区間がなくなったら（ステップ3‐4でNo）、全体の区間を張る部分解を解析結果として、出力部５に送る（ステップ3‐6）。
【００３６】
出力部５は、送られてきた解析結果を外部に出力する（ステップ3‐7）。
【００３７】
図４は、図３のステップ３−５の処理の詳細を示す流れ図である。図４を参照して、ある区間に対する部分解をすべて求める処理（ステップ3‐５）の詳細について説明する。
【００３８】
まず、処理済みの区間に含まれない単語に注目する（ステップ4‐1）。そのような単語が選べない場合（ステップ4−2でNo）、処理を終了する。
【００３９】
処理済みの区間に含まれない単語が選択できた場合（ステップ4−2でYes）、当該単語の処理を開始する。
【００４０】
まず、この単語の構文木列受理部の状態を調べ、次に受理できる構文木を一つ選ぶ（ステップ4‐3）。
【００４１】
次に受理できる構文木が選択できた場合（ステップ4−3でYes）、構文木が成立するかどうかチェックする。
【００４２】
構文木の葉は、単語の構文木列受理部の終状態に対応しており、構文木自身は、チャート４上に登録されている終状態列を受理するオートマトンとして実装されている。
【００４３】
構文木が成立するかどうかは、このオートマトンが終状態に到達するか否かで判定される。
【００４４】
ステップ4−5では、選択された構文木のオートマトンを起動して、葉に対応する終状態列の受理を試みる。
【００４５】
オートマトンが終状態に到達した場合（ステップ4−6でYes）、構文木列受理部は、その構文木を受理して、構文木列受理部の新しい状態を持つエッジを新たにチャート４に登録し（ステップ4−8）、元の単語において受理可能な別の構文木を選択する処理（ステップ4−3）に戻る。一般に、ある状態で受理可能な構文木は一つとは限らないためである。
【００４６】
一方、受理できなかった場合（ステップ4−6でNo）、受理できなかった理由を分析し、今後解析が進み、新たなエッジが作成されてくると受理可能な状態になり得るか否かを判定する（ステップ4−7）。
【００４７】
今後受理できる可能性があると判定された場合（ステップ4−7でYes）、中途まで状態遷移が進んだ構文木オートマトンを、活性エッジとして、チャート４に登録する（ステップ4−9）。
【００４８】
一方、今後とも成立の見込みがないと判定された場合（ステップ4−7でNo）、受理可能な他の構文木を選択する処理（ステップ4−3）に戻る。
【００４９】
このように、構文木列受理部が次に受理可能な構文木を選択し、その構文木が本当に成立しているかどうかについて、構文木自身のオートマトンがチェックする、という二つの処理を繰り返しながら解析処理が進む。
【００５０】
構文木列受理部２２は、単純な事例では、有限オートマトンとして構成されるが、これに限らず、プッシュダウンオートマトンや、さらに複雑なオートマトンであってもよい。例えば、自由語順が扱えるように、状態遷移の集合に対して、その遷移回数を管理する補助テーブルを備えたオートマトンを用いることができる。
【００５１】
ある単語の構文木と別の単語の構文木は、構文木列受理部２２の終状態を介してのみ関わるため、内部の処理は、単語独自に決めてもよい。
【００５２】
そこで、ある単語に対しては、プッシュダウンオートマトンを用いて構文木列受理部を構成し、別の単語に対しては有限オートマトンを、さらに他の単語には異なった形式のオートマトンを用いて構文木列受理部２２を構成することができる。これは、構文木列受理部２２を、構文木自身の受理の処理から独立させ、両者ともに単語ごとに語彙化して格納するという、本発明の構成の大きな特徴であり、かつ利点でもある。
【００５３】
さらに以上の説明からも明らかなように、単語ごとのオートマトンによる構文木の受理の処理と、構文木の成立の有無のチェック処理をそれぞれ別の処理として分離しているため、構文木および構文木列受理部２２の具体的な構成にかかわらず、チャート４を利用した動的計画法の処理手順（アルゴリズム）で解析全体を統御し、部分結果を共有しながら、効率的に解析を進めることができる。
【００５４】
これは、構文木列受理部２２を、構文木自身の受理の処理から独立させた本発明の構成によりはじめて可能となったものであり、本発明の大きな特徴であり、利点でもある。
【００５５】
図５は、本発明の第２の実施の形態を説明する図である。図５を参照すると、本発明の第３の実施の形態は、入力装置１０１と、コンピュータから構成されるデータ処理装置１０２と、出力装置１０３と、記憶装置１０４と、自然言語解析プログラムを記録した記憶媒体１０５とを備える。記録媒体１０５は、磁気ディスク、磁気テープ、光ディスク、半導体メモリその他の記録媒体よりなる。
【００５６】
自然言語解析プログラムは、記録媒体１０５からデータ処理装置１０２の主記憶装置に読み込まれ、データ処理装置１０２の動作を制御する。データ処理装置１０２は、自然言語解析プログラムの制御により以下の処理を行う。
【００５７】
入力文が入力装置１０１から読み込まれると、解析部１０３が起動される。解析部１０３は辞書１０２を参照して、文中の単語を認定し、対応する単語辞書をロードしてチャート４に登録する。
【００５８】
チャート４に入力文中のすべての単語の辞書が登録されたら、解析部１０３は、動的計画法に基づく制御に従って、解析を開始する。解析部１０３は、制御戦略に従って注目単語を一つ決め、該単語中の構文木列受理部２２を起動する。
【００５９】
構文木列受理部２２は、次に受理可能な構文木を一つ決め、その構文木がチャート４上で成立するかどうかのチェックを開始する。構文木の葉は単語の構文木列受理部の終状態として表現されているので、注目している構文木の葉に対応する終状態がその順序で並んでいれば、その構文木はチャート４上で成立する。
【００６０】
成立した場合には、構文木列受理部２２はその構文木を受理する。
【００６１】
構文木を受理したら、受理した構文木の下部構造を構造化し、対応するチャート４の区間にその単語を登録する。
【００６２】
さらに、次に受理可能な構文木を一つ選んでその構文木に対して成立条件のチェックを開始する。構文木の成立条件が満たされなかった場合は、状況により処理が二つに分かれる。解析処理が進んで他の単語の構文木列受理部２２が新たに終状態に到達すれば成立条件が満たされるようになる可能性がある場合には、解析中途の活性エッジとしてチャート４上に登録する。そのような可能性がない場合には条件チェックは失敗とする。
【００６３】
こうして解析を進め、全体を張るエッジで、構文木列受理部２２が終状態にあるものが得られれば、そのエッジが解析結果として出力部１０３から出力される。
【００６４】
【実施例】
次に本発明の実施例について図面を参照して説明する。
【００６５】
図６は、本発明を、英語（英文）の構文解析装置に適用した第１の実施例の構成を示す図である。図６を参照すると、演算処理を行うＣＰＵ７と、辞書等を格納する記憶装置８と、原文を入力するキーボード９と、キーボード９からＣＰＵ７に原文を取り込む入力部１と、記憶装置内に格納された辞書２と、解析の途中結果を記録するチャート４と、辞書２を参照して入力文を辞書引きしチャート４に登録する辞書ローダ６と、解析処理の全体制御を行う解析制御部３と、解析結果をＣＰＵ７の外部に出力する出力部５と、出力結果を表示するＣＲＴ又はＬＣＤ等のディスプレイ装置１０と、を備えて構成されている。
【００６６】
さらに、辞書２中の各々の単語の辞書は、図６に示すように、構文木プール２１と、文法的に正しい構文木列を受理する構文木列受理部２２と、単語の統語的性質をはじめとする各種の情報である単語情報２３とを備えて構成されている。各構成要素の間は、データ線・制御線（通信線）を介して接続されている。
【００６７】
図７は、本発明の第１の実施例における単語“eats”の辞書２の内容の一部を示す図である。３人称単数現在形の動詞“eats”の構文木プール２１には、自身に直接関わる構文木として、直接目的語の取り込みをあらわす構文木7a、副詞や前置詞句による後方自由修飾をあらわす構文木7b、主語の取り込みをあらわす構文木7cが登録されている。構文木プール２１に含まれている、単語“eats”に関連付けられた構文木7a、7b、7cは、いずれも、深さ（レベル）１の構文木（素片）よりなる。
【００６８】
こららの構文木は、ルート（root）ノードとその子供のノードの一つがselfとして特別にマークされている。
【００６９】
構文木が二つ与えられたときには、二つの構文木におけるselfのノード（一方はルートノード、他方は葉のノード）を互いに重ね合わせることで、これら二つの構文木を連結する。
【００７０】
あるいは、構文木のselfノードの位置に単語自身が現れてもよい。
【００７１】
これらの構文木7a、7b、7cが、連結されて大きな構文木となったとき、ルートのselfのノードからselfのノードを下方に辿り、単語自身に終わるパスは、その単語をヘッドとする構文木の背骨(spine)であることに注意されたい。
【００７２】
構文木の他の葉のノードにあらわれるsubjectやdobといった記号は、他の単語の構文木列受理部の終状態に対応する。
【００７３】
構文木は、selfの左右に指定された終状態を持つ単語（すなわちチャート上のエッジ）が表れることで、成立可能となる。
【００７４】
例えば、構文木7aは、自分の直後に、終状態dob（直接目的語の意味）に到達した構文木列受理部を持つエッジがあらわれた場合に限り、成立する。
【００７５】
構文木7bは、selfの直後に終状態postmod（後置任意修飾要素の意味）があらわれた場合に、成立条件を満たす。
【００７６】
これらの構文木は、構文木列受理部の終状態を入力として状態遷移を行うオートマトンとして実装されている。
【００７７】
また、構文木列受理部２２は、これら３つの構文木の列のうち、
列7a・7b*・7c
のみを受理するように構成されている。
【００７８】
ここで、記号「・」は構文木の連結をあらわし、「＊」は任意回数の繰り返しを表している。
【００７９】
すなわち、列7a・7b*・7cを受理する構文木列受理部２２は、まず、構文木7aを受理し、次に構文木7bを０回以上任意回数繰り返して受理し、最後に、構文木7cを受理して終状態に至るように構成されている。
【００８０】
構文木列は、上述のselfを媒介として連結することにより、大きな構文木と同一視される。
【００８１】
例えば、列7a・7b*・7cに対応する構文木を表わすと、図８に示すような構文木となる。
【００８２】
言い換えれば、“eats”の構文木列受理部２２は、図８に示す構文木を受理するように構成されており、これら二つの表現は等価である。
【００８３】
図９は、本発明の第１の実施例における単語“he”の辞書内容の一部を表す図である。“he”の構文木プールは空であり、構文木列受理部２２は始状態から終状態subjectへと空（ε）の入力で遷移する。単語情報には、綴り：“he”、原型：“he”、品詞：pronounが格納される。
【００８４】
図１０は、本発明の第１の実施例における単語“dinner”の辞書内容の一部を示す図である。“dinner”の構文木プールは、形容詞前置修飾をあらわすTd1、名詞前置修飾をあらわすTd2、限定詞による修飾をあらわすTd3、前置詞句による後置修飾をあらわすTd4からなり、構文木列受理部は、
構文木列(Td1|Td2)*Td3*{Td4}
を受理するように構成されている。
【００８５】
ここで、記号「｜」はオア、｛｝は省略可能であることをあらわす。
【００８６】
すなわち、構文木列受理部は、形容詞前置修飾の構文木Td1または名詞前置修飾の構文木Td2を任意回受理した後、限定詞句による修飾の構文木Ｔd3を任意回受理し、最後に前置詞句による後置修飾の構文木Td4を０ないし１回受理して終状態に至る。
【００８７】
終状態は、subject、dob、prepobjの重ね合わせである。
【００８８】
次に本発明の第１の実施例の解析処理について、具体的に説明する。
【００８９】
文“He eats dinner”がキーボード９から入力されると、ＣＰＵ７が起動され、通信線を介して入力部１に入力される。
【００９０】
入力部１は、キーボード９からの入力をＣＰＵ７が理解可能な形式に変換し、通信線を介して辞書ローダ６に送る。
【００９１】
辞書ローダ６は、記憶装置８に格納された辞書２を参照して、入力文中の単語境界を認定し、各単語の辞書内容をロードし、通信線を介してチャート４に送る。
【００９２】
チャート４は、単語単位に、その辞書内容をチャート４の上に登録する。
【００９３】
図１１は、辞書ロード直後のチャート４の様子を示す図である。
【００９４】
チャート４の先頭を位置０、末尾を位置３として、“he”が位置０から１に(e1)、“eats”が位置１から２に(e2)、“dinner”が位置２から３に(e3)登録されている。各単語には、図１１に示すように、各々始状態にある構文木列受理部が付随している。以下、オートマトンの状態を囲む二重線は、その状態が現在の状態であることを示すものとする。なお、図１１では、簡単のため、構文木プールと単語情報は省略したが、これらも構文木列受理部と同様に単語に付随している。
【００９５】
チャート４への登録が終了したら、解析部３が起動される。
【００９６】
解析処理は、解析部３の制御の下に、基本的には、左から右方向への文脈自由文法のボトムアップチャート法と同様に行われる。
【００９７】
また解析すべき部分区間を選択する順序や、活性エッジと非活性エッジを使って解析を進めることは文脈自由文法の場合と同じである。しかし、文脈自由文法の場合には、チャートの部分区間に登録される部分解析結果は非終端記号により区別されるのに対し、本発明では、単語の構文木列受理部のオートマトンの状態により区別される。
【００９８】
さらに、部分解析結果（非活性エッジ）に対して、次に適用可能な文法規則が、非終端記号の一致を条件にグローバルな文法規則のプールから選択されるのではなく、そのエッジに付随する構文木列受理部のオートマトンにより決定される。
【００９９】
まず、先頭の未処理区間[0,1]に対し、この区間のすべての部分解を求める。
【０１００】
そのために、この区間に存在する単語“he”に注目し、単語“he”の構文木列受理部を起動する。
【０１０１】
上述したように、“he”の構文木列受理部は、空（ε）の構文木入力で終状態へと遷移するので、構文木の成立チェックを行うことなく、オートマトンが終状態subjectに到達する。
【０１０２】
構文木列受理部の終状態は、受理を終えたエッジがどのような文法機能を持てるかによって名前が付けられており、subjectは、この終状態を持つエッジが主語としての文法機能を果たすことができることをあらわす。
【０１０３】
こうして、構文木列の受理が終了したので、構文木列受理部が終状態にあるエッジを新規に作成して区間[0,1]に登録する(e4)。
【０１０４】
この新規エッジは、構文木列受理部が終状態にある以外は、辞書ロード直後の“he”と同じ情報を持つ。
【０１０５】
区間[0,1]ではこれ以上解析は進まないので、解析部３は注目区間を[1,2]に移し、この区間の全部分解を求める処理を開始する。
【０１０６】
区間[1,2]には、唯一の単語“eats”（正確にはエッジe2）があるので、まず単語“eats”の構文木列受理部が起動される。
【０１０７】
単語“eats”の構文木列受理部は、構文木列7a・7b*・7cを受理するように構成されており、始状態の直後に受理可能な構文木は7aしかないので、まず構文木7aが成立するかどうかのチェックを行う。
【０１０８】
構文木7aの成立条件は、自分（self）の右に終状態dobを持つエッジが存在することである。しかし、この時点では、このようなエッジは登録されていないので、構文木7aは成立しない。しかし、今後右方に解析が進めば、このようなエッジが新規に登録される可能性があるので、解析部３は、構文木7aから次にマッチングすべき終状態がdobである旨の活性エッジを作成し、区間[1,2]に登録する。
【０１０９】
次に、解析部３は、注目区間を[0,2]とし、この区間に存在するすべての部分解析結果を求めようとするが、この区間内部に新たに成立する構文木はないので、次のステップに進む。
【０１１０】
解析部３は、注目区間を[2,3]とし、この区間に存在するすべての部分解析結果を求める。この区間の唯一の単語“dinner”（正確にはエッジe3）に焦点を移し、構文木の受理を開始する。
【０１１１】
e3の構文木列受理部は、(Td1|Td2)*Td3*{Td4}を受理するように構成された有限オートマトンである。
【０１１２】
最初に受理可能な構文木としては、Td1、Td2、Td3、Td4のいずれも許されるが、これらの中で成立可能な状況にあるものは一つもない。
【０１１３】
一方、構文木列受理部は空の構文木を受理して終状態に移行することも可能であり、これによって終状態（dob|sub|prepobj）に遷移する。
【０１１４】
この終状態は、sub、dob、prepobjの重ね合わせとしての状態を持ち、直接目的語、主語、前置詞の目的語としての文法機能を持つことが表現されている。この終状態を持ったエッジがエッジe5として区間[2,3]に登録される。
【０１１５】
区間[2,3]の処理が終わると、新しく部分解析結果が登録されたので、コンプリーション処理に移る。
【０１１６】
コンプリーション処理は、新たに登録されたエッジが、以前登録された活性エッジとマッチング可能かどうかをチェックし、マッチング可能な場合には、新たにエッジを作成する。
【０１１７】
ここでは、区間[1,2]に構文木7aから作られた活性エッジが登録されており、dobの終状態を待っているので、解析部３は、7aの構文木のオートマトンを再起動する。このオートマトンは、dobの終状態を持つエッジe5を受理することにより、構文木自身の終状態に到達する。
【０１１８】
その結果、構文木7aが受理され、解析部３は、“eats”の構文木列受理部の状態を一つ進めたエッジe6をチャート４に登録する。その内容は、基本的に“eats”の単語と同じで、構文木列受理部の状態が、構文木7aの受理直後の状態であり、次に受理する構文木は7b又は7cである点だけが異なる。図１２は、この時点のチャートの状態を示す図である。
【０１１９】
次に、解析部３は、区間[1,3]の処理に入る。この区間には、直前のコンプリーション処理で登録されたエッジe6が登録されているので、これに焦点を移し、構文木列受理部を起動して、次の構文木が受理できるか否かチェックを開始する。
【０１２０】
e6において、次に受理可能な構文木は7b又は7cである。構文木7bは右方に自由修飾要素である終状態postmodをもつ単語があれば成立条件を満たし、7cは左方にsubjの終状態に達した単語があれば成立条件を満たす。ここでは、直前に終状態subjectを持つエッジe4が存在するため、構文木7cの成立条件が満たされて、これを受理して、区間[0,3]に新たなエッジe7が登録される。
【０１２１】
e7の構文木列受理部は終状態(sentence)にある。図１３は、この状態のチャートを示す図である。
【０１２２】
図１３に示す構文木は、e2にはじまり、e7に至る過程で“eats”に付随する構文木列受理部が受理した構文木の列から組み上げたものである。
【０１２３】
こうして、チャートの全区間を張り、構文木列受理部が終状態に達したエッジが作成されたので、構文解析に成功したことになる。解は、このエッジが受理した構文木列である。
【０１２４】
前述したように、構文木列は、selfのノードを重ね合わせて一つの構文木としてまとめあげることができ、こうして作られる構文木は解と等価である。図１３には後者の形で示してある。
【０１２５】
出力部５は、この構文木を受け取って、ディスプレイ装置１０の画面に表示可能な形へとデータ形式の変換を行う。変換結果をディスプレイ装置１０に通信線を介して送付すると、ディスプレイ装置１０が受け取った結果をユーザに表示し、解析処理が終了する。
【０１２６】
なお、上記の例では表れなかったが、同じ区間に、以後の解析処理に関わる情報が同じエッジが登録された場合には、これらをパックすることにより、動的計画法の利点を活かした効率的な解析を進めることができる。
【０１２７】
ここで、「以後の解析処理に関わる情報が同じ」とは、エッジに付随する構文木列受理オートマトンが、現在の状態を含めて全く同じであることをいう。すなわち、エッジの登録の際に、自分と等価な構文木列受理オートマトンを持ち、その状態が自分と等価であるエッジがすでに同じ区間に登録されている場合には、チャートに登録するのをやめて、既登録のエッジにパックして登録し、以後の解析処理を共通化することにより、文脈自由文法の場合によく知られたチャート法やＣＫＹ法など、動的計画法に基づく効率的な解析アルゴリズムをそのまま流用できる。異なるのは、文脈自由文法の場合は、この条件が非終端記号の一致により判定されるのに対し、本発明では、これを構文木列受理オートマトンの状態の一致により判定する点のみである。これは、本発明の言語解析装置が、文脈自由文法にとどまらず、はるかに広い範囲の文法形式を受理できることを考えると、本発明の大きな特徴である。
【０１２８】
図１４は、本発明を実施して、日本語の解析装置を構成した第２の実施例における単語辞書の内容を示す図である。本発明の第２の実施例の構成は、前記第１の実施例と同様とされ、図６を参照すると、演算処理を行うＣＰＵ７と、辞書等を格納する記憶装置８と、原文を入力するキーボード９と、キーボード９からＣＰＵ７に原文を取り込む入力部１と、記憶装置内に格納された辞書２と、解析の途中結果を記録するチャート４と、辞書２を参照して入力文を辞書引きしてチャート４に登録する辞書ローダ６と、解析処理の全体制御を行う解析部３と、解析結果をＣＰＵ７の外部に出力する出力部5と、出力結果を表示するディスプレイ装置１０と、を備えて構成されている。
【０１２９】
さらに、辞書２中の各々の単語の辞書は、構文木プール２１と、文法的に正しい構文木列を受理する構文木列受理部２２と、単語の統語的性質をはじめとする各種の情報である単語情報２３とを備えて構成されている。各構成要素の間は、通信線を介して接続されている。
【０１３０】
本発明の第２の実施例においては、解析装置を日本語の解析装置として実施するために、単語辞書内容を、図１４に示すように、日本語のものに変更した点が前記第１の実施例と相違している。単に言語を変えただけでなく、日本語の解析に適するように構文木列受理部のオートマトンの構成を変更している。以下、こ本発明の第２の実施例について、図１４を参照して説明する。
【０１３１】
図１４には、動詞「食べる」の辞書内容の一部が示されている。構文木プールは、前方の連用要素を取り込む構文木であるTdただ一つからなる。
【０１３２】
「食べる」が「〜が〜を食べる」の必須格パターンを持つことは、構文木としてではなく、構文木列受理部内に設けられた補助テーブル（受理回数制約テーブル）に表現されている。
【０１３３】
構文木列受理部は、骨格を成す有限オートマトンと、受理回数制約テーブルの状態の積を状態とするオートマトンである。有限オートマトンの側でTdを受理するたびに、受理回数制約テーブルの対応する「回数」の欄がインクリメントされていく。
【０１３４】
「回数」の欄が、「回数制約」の欄の制約を満たさなくなると、受理失敗として終了する。
【０１３５】
この仕組みにより、単語「食べる」が、助詞「が」を一回、助詞「を」を一回、他の任意修飾要素を任意回、任意の順番で取り込むことができる、という構造制約を簡潔な形で表現することができる。
【０１３６】
一方、例えば名詞と助詞の連接や、助動詞が動詞を取り込むという現象のように、語順が固定されている文法現象の記述には、英語の場合と同様、複数の構文木の受理の順序を有限オートマトンで表現することが自然であり、本発明の第２の実施例でも、これに従う。このように、単語に関わる文法現象の性質に応じて、最も適切な文法形式を単語単位に採用することができることは、本発明の大きな特徴である。
【０１３７】
語順の自由性を許す語彙化文法形式として、従来、いくつかの形式が提案されている。例えば、1994年、米国ペンシルベニア州立大学計算機科学科に提出された博士論文「フォーマル・アンド・コンピューテーショナル・アスペクツ・オブ・ナチュラル・ランゲージ・シンタックス」(Formal and Computational Aspects of Natural Language Syntax, PhD dissertation to Universtiy of Pennsylvania, USA. 1994)（「文献５」という）を参照されたい。
【０１３８】
その中に一例として挙がっているUVG−DL (Unordered Vector Grammar with Dominance Links)は、複数の構文木を組（ベクトル）にしたものを単位とするベクトル文法であり、構文木の間に支配関係リンク(dominance link)が与えられているのが特徴である。
【０１３９】
この文法では、一つの組の中の構文木は同じ回数だけ使われること、および、支配関係リンクで結ばれた構文木は、これと矛盾しない形で先祖・子孫の関係（構文木の上での上位下位関係）になければならない、という制約を満たさなければならない。
【０１４０】
この形式を少し修正すれば、例えば日本語の動詞格要素の自由語順を簡潔に表現することは容易である。しかし、このような表現形式は、特に、日本語やドイツ語の格要素の振る舞いの記述に適合するように決められたものであり、その他の語の振る舞いの記述に対しては、必ずしも適切とは言えない。文法形式全体を、ある特定の文法現象の記述が容易になるように選択すると、他の現象の記述に対して適切でなくなるのは当然のことである。
【０１４１】
これに対して、本発明では、構文木列受理部を単語ごとに持つことにより、該単語が従う実質的な文法形式を、単語ごとに、柔軟且つ容易に、変更することができる。
【０１４２】
本発明の第２の実施例では、動詞の文法を、適用回数制約テーブルで拡張した有限状態オートマトンにより実装したが、これを例えばUVG−DLで実装してもよい。他の語との関係は、構文木列受理部の終状態のシンボルを通じてのみ発生するので、終状態に至るまでにどのような処理が行われようと、影響はその単語に閉じている。
【０１４３】
さらに、前記第１の実施例で説明したように、構文木の成立のチェック（構文木オートマトンによる終状態の受理処理）と、構文木列の受理処理とを分離することにより、構文木列受理部の実装にかかわわらず、チャート等を使用した動的計画法に基づく効率的な解析を行うことができることも、本発明の特徴である。
【０１４４】
本発明の第３の実施例について説明する。図１５は、本発明の第３の実施例を説明するための図である。図３に示すD−Tree Grammarと同等な解析を行う場合について説明する。
【０１４５】
D−Tree Grammarsの詳細については、1995年、インターナショナル・ワークショップ・オン・パーシング・テクノロジーズ(International Workshop on Parsing Technologies)で発表された「パーシング・ディー・ツリー・グラマーズ」(Parsing D−Tree Grammars)（「文献５」という）の記載が参照される。
【０１４６】
図１５に示す構文木は、Sの間、およびVPの間のリンクが点線で表現されている。これは、このリンクが長さ１の固定リンクではなく、０以上任意の数の構文木が間に割って入ることを許すことを表現している。ただし、割って入った構文木が、この構文木と連結される二つのノードにおいて、その品詞は（割って入った位置に応じて）SまたはVPでなければならない。
【０１４７】
このような表現は、WH移動等の遠距離依存現象や、“so that”等の呼応を表現するのに有用である。
【０１４８】
本発明では、構文木プールを少し拡張してプッシュダウンスタック中に構文木を格納し伝播させることで、同じことを実現できる。
【０１４９】
図１６は、図１５を実現するための単語辞書の内容である。この内容が、図１５で最下部にあるVに対応する単語の辞書中に記述されている。
【０１５０】
構文木プールには、図１５の最下部の木T1だけが構文木プールに入っており、その他の、破線より上の二つの部分の木は、プッシュダウンスタックに格納されている。また、構文木列受理部は、T1を一回だけ受理したら終状態に遷移するように構成されている。
【０１５１】
構文木列受理部が終状態に到達したら、構文木プールからプッシュダウンスタックが取り出され、終状態に関連付けられる。この状態で、構文木列受理部は、受理を継続する形で、スタックから構文木を取り出して受理を試みてもよいし、そのまま自分を子供として取り込んだ構文木列受理部に渡してもよい。ただし、いずれも、取り出した構文木の非終端記号が終状態のシンボルと一致することを条件とする。後者の場合、スタックが上方へ伝播していき、どこかの時点でスタック中の構文木の受理が試みられることになる。
【０１５２】
図１７は、この様子を模式的にあらわしたものであり、最下部のVPが別の構文木の子供となって成長していった（太い鎖線）後、スタック中の上位の構文木がポップされて受理される様子を表している。
【０１５３】
このように、本発明の構成において、構文木をスタックに入れて伝播させ、後の処理で、構文木列の受理処理が終わった時点においてこれをポップし、受理処理を行うことにより、図１５で表現されているような、遠距離における支配関係を容易に処理することができる。
【０１５４】
なお、ここでは、スタックにプッシュするのは単独の構文木としたが、一般には、構文木プールと構文木列受理部の組を単位としてスタックにプッシュするように拡張できる。
【０１５５】
また、図１８は、“so (adjective) that (sentence)”の呼応関係の処理を説明するための図である。原文が、
“She is so kind that everybody likes her．”
のとき、“so”の辞書に、“that”節以下をまとめあげる構文木（図１８(b)）を持たせ、これを図１８(a)の構文木上で矢印で示すように伝播させ、全体がsentenceとしてまとめあがった時点で受理を起動することにより、正しく“that”節以下をまとめあげることができる。
【０１５６】
“So”と“that”節の呼応関係は、“that”節の構文木の起源から明らかである。
【０１５７】
なお、一般に、文脈自由文法が与えられたとき、以下の手順により、元の文法と強同値になるように、本発明による解析装置を構成することができることを注意しておく。
【０１５８】
まず、各単語に対して、その単語に付随する構文木プールを以下のように構成する。
【０１５９】
単語を固定（FIX）した時、プレターミナルからその単語を導く規則はすべて構文木プールに入れる。その際、仮に、規則のルートに、プレターミナルシンボルにあわせて記号selfを与えておく。
【０１６０】
次に、構文木プール中の規則のルートとなっている非終端記号に対し、その記号を葉に持つすべての規則を集め、構文木プールに追加する。ただし、注目した葉とルートに、仮に記号selfを与えておく。同じ規則でも、注目した葉が異なれば、selfの位置が異なるので、別の規則として追加する。
【０１６１】
構文木プール中の規則のルートとなっている非終端記号に対し、この追加作業を規則の追加が止まるまで繰り返す。これで、構文木プールの作成が終了する。
【０１６２】
構文木列受理部は、以下のように構成される非決定性有限状態オートマトンである。構文木プール中の規則のルートとなっている非終端記号の各々が状態となる。始状態からは、プレターミナルに対応する状態に遷移可能とし、各遷移において、遷移先のプレターミナルから単語を生成する規則が受理されるように構成する。
【０１６３】
さらに、状態１から状態２への遷移においては、状態２をルート、状態１を記号selfのついたリーフとする構文木が構文木プールの中に存在するとき、この構文木を受理して遷移するように構成する。また、すべての状態から終状態へと、空の構文木を受理して遷移を可能とする。
【０１６４】
以上の構成により、元の文脈自由文法と全く同一の構文木集合を受理する本発明による解析装置が構成できる。
【０１６５】
本発明は、上記に説明した以外にも種々に変形して実施することができる。
【０１６６】
前記各実施例では、単語に記述する文法として、もっぱら構文木を単位とする句構造文法（ＰＳＧ；phrase structure grammar）を採用したが、依存関係による記述を採用し、依存文法形式により構文木プールの内容と構文木列受理部を構成することも容易である。この形式を単語ごとに変化させてもよい。
【０１６７】
また、構文木列受理部として、有限オートマトンではなく、プッシュダウンオートマトンを採用するように構成することも容易である。さらには、構文木列受理部に、一般のオートマトンを採用することにより、さらに記述能力の高い文法形式をアクセプトするように構成することも容易である
【０１６８】
また、本発明に係る言語解析装置を組み込んで、翻訳装置や情報検索装置さらに音声認識等に適用することができることは勿論である。また、本発明は、自然言語の解析に限定されるものでなく、例えばコンピュータのプログラミング言語等の人工言語（形式言語）に適用することも容易であることは明らかである。
【０１６９】
以上のように、本発明は、本発明の原理に準ずる範囲内での各種変形を含む。
【０１７０】
【発明の効果】
以上説明したように、本発明によれば、文法形式を実質的に支配する、構文木の組み合わせ方に関する情報を、構文木列受理部として、単語ごとに持つように構成したため、その単語が従う実質的な文法形式を単語ごとに柔軟に変更することができ、このため、個々の単語の振る舞いに応じた、柔軟で簡潔な文法記述を行うことができる、という効果を奏する。
【０１７１】
一般に、記述能力の高い文法形式は、受理に必要な計算量が多く、実用の観点から問題が多いが、本発明によれば、そのような受理部が置かれ、実際に処理が行われるのは、真にその記述能力を必要とする少数の単語に留まる。
【０１７２】
多くの一般的な単語においては、処理の軽い受理処理を行えば十分であり、本発明によれば、実際に、このような構成とすることができるので、実際の計算量は、従来のシステムのようにすべてを一つの文法形式で統一する構成と比べて、はるかに縮減することができることが期待される。
【０１７３】
さらに、本発明によれば、こ特定の文法形式にとらわれない柔軟な文法記述を可能とするとともに、文脈自由文法と同様、動的計画法に基づく、効率的な解析の制御を可能としている、という効果を奏する。
【０１７４】
本発明の上記効果は、構文木自体の成立のチェックと、構文木列の受理のチェックを分離したことによるものであり、上記効果は、従来システムでは得られなかったものである。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態をなす自然言語処理装置の構成を示すブロック図である。
【図２】本発明の第１の実施の形態をなす自然言語処理装置の辞書の構成を示すブロック図である。
【図３】本発明の第１の実施の形態をなす自然言語処理装置の処理を示す流れ図である。
【図４】本発明の第１の実施の形態をなす自然言語処理装置の処理を示す流れ図である。
【図５】本発明の第２の実施の形態をなす自然言語処理装置の構成を示すブロック図である。
【図６】本発明の第１の実施例をなす自然言語処理装置の構成を示すブロック図である。
【図７】本発明の第１の実施例をなす自然言語処理装置における単語辞書の構成を示すブロック図である。
【図８】本発明の第１の実施例をなす自然言語処理装置が受理する構文木の一例を示す図である。
【図９】本発明の第１の実施例をなす自然言語処理装置における単語辞書の構成を示すブロック図である。
【図１０】本発明の第１の実施例をなす自然言語処理装置におけるチャートの内容を示す図である。
【図１１】本発明の第１の実施例をなす自然言語処理装置におけるチャートの内容を示す図である。
【図１２】本発明の第１の実施例をなす自然言語処理装置におけるチャートの内容を示す図である。
【図１３】本発明の第１の実施例をなす自然言語処理装置におけるチャートの内容を示す図である。
【図１４】本発明の第２の実施例をなす自然言語処理装置における単語辞書の構成を示すブロック図である。
【図１５】本発明の第３の実施例をなす自然言語処理装置が処理可能な構文木の一例を示す図である。
【図１６】本発明の第３の実施例をなす自然言語処理装置における辞書内容の一例を示す図である。
【図１７】本発明の第３の実施例をなす自然言語処理装置の処理を説明するための図である。
【図１８】本発明の第３の実施例をなす自然言語処理装置の処理を説明するための図である。
【符号の説明】
１入力部
２辞書
３解析部
４チャート
５出力部
６辞書ローダ
７ＣＰＵ
８記憶装置
９キーボード
１０ＣＲＴ
２１構文木プール
２２構文木列受理部
２３単語情報
１０１入力装置
１０２データ処理装置
１０３出力装置
１０４記憶装置
１０５記録媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a language analysis device, and more particularly to a language analysis device based on lexicalized grammar that holds grammar rules in a dictionary of words.
[0002]
[Prior art]
A grammar is lexicalized when all the basic elements that make up the grammar, typically the parse tree, are all associated with any word. In other words, associating a grammar rule with a word is called lexicalization of grammar, and lexicalized grammar is called lexicalization grammar.
[0003]
As a natural language analysis device based on lexicalized grammar, for example, Japanese Patent Application Laid-Open No. 7-56921 (hereinafter referred to as “Literature 1”) describes a context free grammar as a “lexicalized context free grammar (LCFG)”. Describes a sentence analysis system that analyzes sentences after converting them into a lexicalized format.
[0004]
Also published in August 1996, Proceedings of the Force International Conference on Spoken Language Processing, Philadelphia, Pennsylvania, 1996 )) Entitled “Head Automata and Bilingual Tiling: Translation with Minimal Representations” (hereinafter referred to as “Reference 2”) ) Describes an analysis device that performs dependency grammar analysis and syntactic conversion between languages using an automaton associated with a word.
[0005]
There are also known ways to use lexicalized automata to streamline the analysis of lexicalized grammars, such as publications (Procedures of the First Workshop on April 1998).・ Glamour Compaction and Computing Sharing in as described in Tabulation in Parsing and Deduction (hosted by Proceedings of the First Tabulation in Parsing and Deduction, Paris, France, INRIA) The paper entitled “Grammar Compaction and Computation Sharing in Automaton-based Parsing” (hereinafter referred to as “Document 3”) is a finite automaton that accepts a sequence of non-terminal symbols in the leaves of a syntax tree associated with a word. To merge and minimize these finite automata associated with the same word. Accordingly, it is shown technique for efficient analysis.
[0006]
In this method, as described in section 3.4 of the above-mentioned document 3, a flat non-terminal symbol string (non-) in which a sequence of leafs is extracted and arranged in a line, not the syntax tree itself. terminal symbols).
[0007]
By creating one finite automaton that accepts this sequence for each syntax tree, merging and minimizing, common analysis processing is shared by a plurality of trees to improve processing efficiency.
[0008]
[Problems to be solved by the invention]
However, the lexicalized grammar analysis device described in each of the above-mentioned documents has a problem that the grammar format that can be handled is fixed throughout the system, and that it is difficult to make a flexible grammar description.
[0009]
For example, in the analysis device described in Document 1, the grammar format that can be analyzed is limited to context free grammar. This is clear from the configuration in which the grammar is converted to the lexical context-free grammar before the analysis proceeds.
[0010]
By the way, as is well known, there is a problem in context-free grammar that the arbitraryness of word order cannot be expressed concisely.
[0011]
For example, in Japanese, "I ate dinner slowly" and "I ate dinner slowly" are correct sentences and must be successfully analyzed.
[0012]
In Japanese, the order of verb case elements such as “ga”, “to”, and optional modifiers such as “slow” are generally free and may appear in any order.
[0013]
However, in the context-free grammar, the word order of the components on the right side of the rule is fixed, so this freedom (arbitraryness) of the word order cannot be expressed succinctly. It must be developed as a rule.
[0014]
Similarly, in the analysis apparatus described in the above-mentioned document 2, the grammar format that can be analyzed is limited to the dependent grammar.
[0015]
Further, in the analysis device described in Document 3, although there is a description that the target grammar format is not particularly limited, however, according to the configuration of configuring a finite automaton that accepts a sequence of non-terminal symbols in the leaves of the syntax tree, It is clear that the grammatical format to which the technique of the above-mentioned document 3 can be applied is limited. This is because a restriction is provided for a sequence of non-terminal symbols. Specifically, since this sequence is configured to be accepted by a finite automaton, after the syntax analysis is completed, what is permitted as a non-terminal symbol sequence of leaves of a syntax tree with a certain word as a head is Almost limited to grammatical forms that can be accepted by finite automata.
[0016]
In syntactic analysis, it has been common practice to share partial decomposition and improve analysis efficiency by controlling the analysis order based on dynamic programming. In the configurations of the above-mentioned documents 2 and 3 having an automaton to be controlled, there is no description on how to integrate the control of these two analysis orders to enable efficient analysis based on dynamic programming. It is.
[0017]
Accordingly, the present invention has been made in view of the above problems, and an object of the present invention is to provide a language analysis apparatus and method capable of handling lexicalized grammars in which the grammar classes accompanying the words are substantially different. There is to do.
[0018]
Another object of the present invention is to enable description and acceptance of general lexicalized grammars without being restricted by specific grammatical forms, and to enable efficient overall control of analysis based on dynamic programming. To provide a language analysis apparatus and method. Other objects, features, etc. of the present invention other than this will be readily apparent from the following description.
[0019]
[Means for Solving the Problems]
In order to achieve the above object, the language analysis apparatus of the present invention has a syntax in which a word in a dictionary accepts a syntax tree sequence formed by a set of syntax trees associated with the word and a sequence of syntax trees belonging to the syntax tree set. A dictionary configured to include a tree sequence accepting unit; and an analysis unit that advances an analysis of an input sentence using the syntax tree sequence accepting unit.
[0020]
According to the method of the present invention, a word in a dictionary is a set of syntax trees associated with the word, and a syntax tree sequence receiving unit that receives a syntax tree sequence created by a sequence of syntax trees belonging to the set of syntax trees. When analyzing a sentence input from the input means, (a) for each word in the dictionary, checking the formation of a syntax tree associated with the word;
(B) receiving, for each word in the dictionary, a syntactically permitted one from the syntax tree sequence created by the syntax tree established in step (a);
(C) repeating the steps (a) and (b) based on control by dynamic programming.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described. In an embodiment of the present invention, a word in the dictionary includes a syntax tree sequence receiving unit that receives a syntax tree set associated with the word and a syntax tree sequence created by a sequence of syntax trees belonging to the syntax tree set. The analysis unit that includes the word dictionary having the word information and advances the analysis of the input sentence executes the analysis using the syntax tree sequence reception unit corresponding to the word.
[0022]
As described above, in the embodiment of the present invention, not only a syntax tree but also a syntax tree sequence receiving unit that controls a combination of syntax trees is provided for each word, so that it is most suitable for grouping around words. You can select a grammatical form and use it.
[0023]
For example, for the incorporation of Japanese verb case elements and free modifier elements, a grammatical form that allows simple description of word order freedom is adopted and implemented as a verb syntax tree sequence acceptance part, while prefix modification to nouns. As shown in the above, it is easy to adopt a simple context-free grammar and implement it as a syntax tree sequence accepting unit when the word order is fixed.
[0024]
Further, in the embodiment of the present invention, the processing to be combined into one syntax tree and the combination processing between syntax trees are configured to be controlled by another automaton, and the analysis order is based on dynamic programming. By controlling this, it is possible to efficiently proceed with the analysis and obtain an optimal solution as a whole solution from partial decomposition.
[0025]
Specifically, at the stage where one parse tree is gathered, by registering the state of the parse tree sequence accepting part in the chart as a partial decomposition with the index, and packing (packing) the partial decompositions with the same index Thus, sharing of partial decomposition and analysis order control based on dynamic programming can proceed without contradiction.
[0026]
Further, in the embodiment of the present invention, the syntax tree associated with the words in the dictionary is configured as an automaton that accepts the final state sequence of the syntax tree sequence accepting unit, and the syntax tree sequence accepting unit includes the automaton. Is configured as an automaton that accepts a final state sequence of the syntax tree.
[0027]
With this configuration, the syntax tree can be configured as an arbitrary automaton, and flexible grammar description can be performed.
[0028]
Hereinafter, embodiments of the present invention will be described in more detail with reference to the drawings.
[0029]
FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention. Referring to FIG. 1, this natural language analyzing apparatus includes an input unit 1 for inputting an original sentence, a dictionary 2, an analysis part 3 that advances an analysis of the original sentence while referring to the dictionary 2, and a chart 4 for registering a partial analysis result. And an output unit 5 for outputting the analysis result.
[0030]
FIG. 2 is a diagram illustrating an example of the contents of the dictionary 2. Referring to FIG. 2, words in the dictionary 2 include a syntax tree pool 21 that stores a set of syntax trees, a syntax tree sequence reception unit 22 that receives a sequence of syntax trees allowed by the grammar, and a spelling of words. And word information 23 such as syntactic properties.
[0031]
FIG. 3 is a flowchart for explaining the operation of the first exemplary embodiment of the present invention. The operation of the first embodiment of the present invention will be described with reference to FIGS.
[0032]
When the original text input to the input unit 1 is sent to the analysis unit 3 (step 3-1), the analysis unit 3 refers to the dictionary 2 and divides the original text into words, and the contents of the dictionary of each word are stored in the dictionary 2. Is loaded and registered in the chart 4 (step 3-2).
[0033]
The analysis of the original text is based on the control of the analysis unit 3, and on the chart 4, according to the dynamic programming method, a linear order is assigned between all the partial sections according to the inclusion relation, and all partial analysis results for the partial sections are obtained. The process is repeated while moving the section of interest from a small section to a large section.
[0034]
Specifically, one unprocessed section is selected according to the linear order based on the inclusion relation (step 3-3). If such a section remains (Yes in Step 3-4), all partial decompositions of this section are obtained (Step 3-5), and the process returns to the section selection process (Step 3-3).
[0035]
When partial decomposition for all sections is obtained and there are no unprocessed sections (No in step 3-4), the partial decomposition over the entire section is sent to the output unit 5 as an analysis result (step 3-6).
[0036]
The output unit 5 outputs the sent analysis result to the outside (step 3-7).
[0037]
FIG. 4 is a flowchart showing details of the processing in step 3-5 in FIG. With reference to FIG. 4, the details of the process (step 3-5) for obtaining all partial decompositions for a certain section will be described.
[0038]
First, focus on words that are not included in the processed section (step 4-1). If such a word cannot be selected (No in step 4-2), the process ends.
[0039]
When a word that is not included in the processed section can be selected (Yes in step 4-2), processing of the word is started.
[0040]
First, the state of the word syntax tree sequence accepting unit is checked, and then one acceptable syntax tree is selected (step 4-3).
[0041]
Next, when an acceptable syntax tree can be selected (Yes in step 4-3), it is checked whether the syntax tree is established.
[0042]
The leaves of the syntax tree correspond to the final state of the syntax tree sequence accepting unit for words, and the syntax tree itself is implemented as an automaton that accepts the final state sequence registered on the chart 4.
[0043]
Whether or not the syntax tree is established is determined by whether or not this automaton reaches the final state.
[0044]
In step 4-5, the automaton of the selected syntax tree is activated to try to accept the final state sequence corresponding to the leaf.
[0045]
When the automaton reaches the final state (Yes in step 4-6), the syntax tree sequence accepting unit accepts the syntax tree and newly registers an edge having a new state in the syntax tree sequence accepting unit in Chart 4. (Step 4-8), the process returns to the process of selecting another syntax tree that can be accepted in the original word (Step 4-3). This is because, in general, there is not always one syntax tree that can be accepted in a certain state.
[0046]
On the other hand, if it was not accepted (No in step 4-6), analyze the reason why it was not accepted, and if the analysis progresses in the future and a new edge is created, whether or not it can be accepted. Determine (Step 4-7).
[0047]
When it is determined that there is a possibility that it can be accepted in the future (Yes in Step 4-7), the syntax tree automaton whose state transition has progressed halfway is registered in the chart 4 as an active edge (Step 4-9).
[0048]
On the other hand, when it is determined that there is no possibility of establishment in the future (No in step 4-7), the process returns to the process of selecting another acceptable syntax tree (step 4-3).
[0049]
In this way, the syntax tree string accepting unit selects the next acceptable syntax tree, and analyzes the automaton of the syntax tree itself to check whether or not the syntax tree is actually established, and repeats the analysis. Processing proceeds.
[0050]
Syntax tree string acceptor 22 is configured as a finite automaton in a simple case, but is not limited thereto, and may be a push-down automaton or a more complicated automaton. For example, an automaton provided with an auxiliary table for managing the number of transitions can be used for a set of state transitions so that the free word order can be handled.
[0051]
Since the syntax tree of a certain word and the syntax tree of another word are involved only via the final state of the syntax tree sequence receiving unit 22, the internal processing may be determined independently for the word.
[0052]
Therefore, a syntax tree sequence accepting unit is constructed using a pushdown automaton for a certain word, a finite automaton is used for another word, and a different form automaton is used for another word. The tree row receiving unit 22 can be configured. This is a major feature and advantage of the configuration of the present invention in which the syntax tree sequence accepting unit 22 is made independent of the process of accepting the syntax tree itself, and both are lexicalized and stored for each word.
[0053]
Furthermore, as is clear from the above description, the processing of accepting a syntax tree by an automaton for each word and the processing for checking whether a syntax tree is established are separated as separate processes. Regardless of the specific configuration of the column receiving unit 22, the entire analysis can be managed by the dynamic programming processing procedure (algorithm) using the chart 4, and the analysis can be efficiently performed while sharing the partial results. it can.
[0054]
This is made possible for the first time by the configuration of the present invention in which the syntax tree sequence accepting unit 22 is made independent of the process of accepting the syntax tree itself, and is a major feature and advantage of the present invention.
[0055]
FIG. 5 is a diagram for explaining a second embodiment of the present invention. Referring to FIG. 5, the third embodiment of the present invention records an input device 101, a data processing device 102 composed of a computer, an output device 103, a storage device 104, and a natural language analysis program. A storage medium 105. The recording medium 105 includes a magnetic disk, a magnetic tape, an optical disk, a semiconductor memory, and other recording media.
[0056]
The natural language analysis program is read from the recording medium 105 into the main storage device of the data processing device 102 and controls the operation of the data processing device 102. The data processing device 102 performs the following processing under the control of a natural language analysis program.
[0057]
When an input sentence is read from the input device 101, the analysis unit 103 is activated. The analysis unit 103 refers to the dictionary 102, recognizes a word in the sentence, loads the corresponding word dictionary, and registers it in the chart 4.
[0058]
When the dictionary of all words in the input sentence is registered in the chart 4, the analysis unit 103 starts analysis according to control based on dynamic programming. The analysis unit 103 determines one word of interest according to the control strategy, and activates the syntax tree sequence reception unit 22 in the word.
[0059]
The syntax tree sequence accepting unit 22 determines one acceptable syntax tree and starts checking whether the syntax tree is established on the chart 4. Since the leaves of the syntax tree are expressed as the final state of the syntax tree sequence accepting unit of the word, if the final states corresponding to the leaves of the syntax tree of interest are arranged in that order, the syntax tree is established on the chart 4. .
[0060]
If it is established, the syntax tree sequence accepting unit 22 accepts the syntax tree.
[0061]
When the syntax tree is accepted, the substructure of the accepted syntax tree is structured and the word is registered in the corresponding section of the chart 4.
[0062]
Furthermore, the next acceptable syntax tree is selected, and the condition check for the syntax tree is started. If the conditions for establishing the syntax tree are not satisfied, the processing is divided into two depending on the situation. If the analysis process proceeds and the syntactic tree sequence accepting unit 22 for another word reaches a final state, there is a possibility that the satisfaction condition may be satisfied. sign up. If there is no such possibility, the condition check fails.
[0063]
Analyzing this way, with the edge that stretches the whole, Syntax tree string acceptor If what is in the final state is obtained, the edge is output from the output unit 103 as the analysis result.
[0064]
【Example】
Next, embodiments of the present invention will be described with reference to the drawings.
[0065]
FIG. 6 is a diagram showing a configuration of a first embodiment in which the present invention is applied to an English (English) syntax analysis apparatus. Referring to FIG. 6, the CPU 7 that performs arithmetic processing, the storage device 8 that stores a dictionary, the keyboard 9 that inputs the original text, the input unit 1 that loads the original text from the keyboard 9 to the CPU 7, and the storage device. A dictionary 2, a chart 4 for recording the results of the analysis, a dictionary loader 6 for referencing the dictionary 2 to retrieve an input sentence and registering it in the chart 4, and an analysis control unit 3 for controlling the entire analysis process The output unit 5 outputs the analysis result to the outside of the CPU 7 and the display device 10 such as a CRT or LCD for displaying the output result.
[0066]
Further, as shown in FIG. 6, the dictionary of each word in the dictionary 2 includes a syntax tree pool 21, a syntax tree sequence accepting unit 22 that accepts a grammatically correct syntax tree sequence, and a syntactic property of the words. It comprises word information 23 which is various kinds of information including the beginning. Each component is connected via a data line / control line (communication line).
[0067]
FIG. 7 is a diagram showing a part of the contents of the dictionary 2 of the word “eats” in the first embodiment of the present invention. In the syntax tree pool 21 of the verb “eats” of the third person singular present tense, as a syntax tree directly related to itself, a syntax tree 7a that directly captures an object, a syntax tree 7b that represents backward free modification by an adverb or a preposition phrase. A syntax tree 7c representing subject import is registered. The syntax trees 7a, 7b, and 7c associated with the word “eats” included in the syntax tree pool 21 are each composed of a syntax tree (element) of depth (level) 1.
[0068]
In these syntax trees, the root node and one of its child nodes are specially marked as self.
[0069]
When two syntax trees are given, self nodes (one is a root node and the other is a leaf node) in the two syntax trees are overlapped with each other to connect the two syntax trees.
[0070]
Alternatively, the word itself may appear at the position of the self node in the syntax tree.
[0071]
When these syntax trees 7a, 7b, and 7c are concatenated into a large syntax tree, the path that follows self from the root self node and ends in the word itself is a syntax that uses that word as the head. Note that it is the spine of the tree.
[0072]
Symbols such as subject and dob that appear in other leaf nodes of the syntax tree Syntax tree string acceptor Corresponds to the final state.
[0073]
The syntax tree can be established when words having a final state designated on the left and right of self (that is, edges on the chart) appear.
[0074]
For example, the syntax tree 7a reaches the final state dob (direct object meaning) immediately after itself. Syntax tree string acceptor This is true only if an edge with a appears.
[0075]
The syntax tree 7b satisfies the satisfaction condition when the final state postmod (meaning postfix optional modifier) appears immediately after self.
[0076]
These syntax trees are implemented as an automaton that performs state transitions with the final state of the syntax tree sequence accepting unit as an input.
[0077]
Also, Syntax tree string acceptor 22 is a sequence of these three syntax trees.
Row 7a, 7b *, 7c
Configured to accept only.
[0078]
Here, the symbol “·” represents a concatenation of syntax trees, and “*” represents an arbitrary number of repetitions.
[0079]
That is, accept columns 7a, 7b *, 7c Syntax tree string acceptor 22 is configured to first accept the syntax tree 7a, then accept the syntax tree 7b by repeating it zero or more times, and finally accept the syntax tree 7c to reach the final state.
[0080]
A syntax tree sequence is identified as a large syntax tree by concatenating with the above self as a medium.
[0081]
For example, if a syntax tree corresponding to the columns 7a, 7b *, and 7c is represented, the syntax tree shown in FIG. 8 is obtained.
[0082]
In other words, “eats” Syntax tree string acceptor 22 is configured to accept the syntax tree shown in FIG. 8, and these two expressions are equivalent.
[0083]
FIG. 9 is a diagram showing a part of the dictionary contents of the word “he” in the first embodiment of the present invention. The syntax tree pool of “he” is empty, and the syntax tree sequence receiving unit 22 transitions from the start state to the end state subject with an input of empty (ε). In the word information, spelling: “he”, prototype: “he”, part of speech: pronoun is stored.
[0084]
FIG. 10 is a diagram showing a part of the dictionary contents of the word “dinner” in the first embodiment of the present invention. The syntax tree pool of “dinner” consists of Td1 which represents adjective prefix modification, Td2 which represents noun prefix modification, Td3 which represents qualifier modification, and Td4 which represents postfix modification by preposition phrase. Is
Syntax tree sequence (Td1 | Td2) * Td3 * {Td4}
Is configured to accept.
[0085]
Here, the symbol “|” indicates “OR”, and {} indicates that it can be omitted.
[0086]
That is, Syntax tree string acceptor Accepts an adjective prefix modification syntax tree Td1 or a noun prefix modification syntax tree Td2 any number of times, then accepts a qualifier phrase modification tree Td3 any number of times, and finally a postfix modification syntax with a preposition phrase Accept tree Td4 0 to 1 time to reach final state.
[0087]
The final state is a superposition of subject, dob, and prepobj.
[0088]
Next, the analysis processing of the first embodiment of the present invention will be specifically described.
[0089]
When the sentence “He eats dinner” is input from the keyboard 9, the CPU 7 is activated and input to the input unit 1 via the communication line.
[0090]
The input unit 1 converts the input from the keyboard 9 into a format that can be understood by the CPU 7 and sends it to the dictionary loader 6 via a communication line.
[0091]
The dictionary loader 6 refers to the dictionary 2 stored in the storage device 8, recognizes word boundaries in the input sentence, loads the dictionary contents of each word, and sends them to the chart 4 via a communication line.
[0092]
The chart 4 registers the dictionary contents on the chart 4 in units of words.
[0093]
FIG. 11 is a diagram illustrating a state of the chart 4 immediately after loading the dictionary.
[0094]
Chart 4 starts at position 0, ends at position 3, “he” from position 0 to 1 (e1), “eats” from position 1 to 2 (e2), “dinner” from position 2 to 3 ( e3) Registered. Each word is in its starting state, as shown in FIG. Syntax tree string acceptor Is attached. Hereinafter, the double line surrounding the state of the automaton indicates that the state is the current state. In FIG. 11, for simplicity, the syntax tree pool and word information are omitted. Syntax tree string acceptor Is attached to the word as well.
[0095]
When registration to the chart 4 is completed, the analysis unit 3 is activated.
[0096]
The analysis processing is basically performed in the same manner as the bottom-up chart method of context-free grammar from left to right under the control of the analysis unit 3.
[0097]
The order of selecting partial sections to be analyzed and the analysis using the active and inactive edges are the same as in the context-free grammar. However, in the case of context-free grammar, the partial analysis results registered in the partial section of the chart are distinguished by non-terminal symbols, whereas in the present invention, they are distinguished by the automaton state of the word syntax tree sequence acceptance unit. The
[0098]
In addition, for partial analysis results (inactive edges), the next applicable grammar rule is not selected from the global pool of grammar rules subject to non-terminal symbol matches, but the syntax associated with that edge. It is determined by the automaton of the tree row acceptance unit.
[0099]
First, for the first unprocessed section [0, 1], all partial decompositions of this section are obtained.
[0100]
Therefore, paying attention to the word “he” existing in this section, the syntax tree string accepting unit for the word “he” is activated.
[0101]
As described above, the syntax tree sequence accepting unit of “he” transitions to the final state with an empty (ε) syntax tree input, so the automaton reaches the final state subject without checking whether the syntax tree is satisfied. To do.
[0102]
Syntax tree string acceptor The final state is named according to what grammar function the accepted edge can have, and subject indicates that the edge having this final state can perform the grammatical function as the subject.
[0103]
In this way, since the acceptance of the syntax tree sequence has been completed, the syntax tree sequence acceptance unit newly creates an edge in the final state and registers it in the interval [0, 1] (e4).
[0104]
This new edge has the same information as “he” immediately after loading the dictionary, except that the syntax tree sequence accepting unit is in the final state.
[0105]
Since the analysis does not proceed any more in the interval [0, 1], the analysis unit 3 moves the attention interval to [1, 2] and starts a process for obtaining the total decomposition of this interval.
[0106]
Since there is only one word “eats” (more precisely, edge e2) in the interval [1,2], the syntax tree sequence accepting unit for the word “eats” is first activated.
[0107]
The word “eats” Syntax tree string acceptor Is configured to accept the syntax tree sequence 7a, 7b *, and 7c, and since there is only 7a that can be accepted immediately after the start state, it first checks whether the syntax tree 7a holds. .
[0108]
A condition for establishing the syntax tree 7a is that an edge having a final state dob exists to the right of self (self). However, since such an edge is not registered at this time, the syntax tree 7a is not established. However, if the analysis proceeds to the right in the future, such an edge may be newly registered. Therefore, the analysis unit 3 activates that the final state to be matched next from the syntax tree 7a is dob. Create an edge and register it in the interval [1,2].
[0109]
Next, the analysis unit 3 sets the attention interval to [0, 2] and tries to obtain all partial analysis results existing in this interval, but since there is no newly established syntax tree in this interval, Go to step.
[0110]
The analysis unit 3 sets the attention interval to [2, 3], and obtains all partial analysis results existing in this interval. Focus on the only word “dinner” in this interval (exactly edge e3) and start accepting the parse tree.
[0111]
The e3 syntax tree sequence accepting unit is a finite automaton configured to accept (Td1 | Td2) * Td3 * {Td4}.
[0112]
Td1, Td2, Td3, and Td4 are all accepted as the first acceptable syntax tree, but none of them is in a situation that can be established.
[0113]
on the other hand, Syntax tree string acceptor Can also accept an empty syntax tree and enter the final state, which transitions to the final state (dob | sub | prepobj).
[0114]
This final state has a state as a superposition of sub, dob, and prepobj, and expresses that it has a grammatical function as a direct object, subject, and preposition object. The edge having this final state is registered in the interval [2, 3] as the edge e5.
[0115]
When the processing of section [2, 3] is completed, the partial analysis result is newly registered, so the process proceeds to completion processing.
[0116]
In the completion process, it is checked whether or not the newly registered edge can be matched with the previously registered active edge. If matching is possible, a new edge is created.
[0117]
Here, since the active edge created from the syntax tree 7a is registered in the interval [1,2] and waiting for the final state of dob, the analysis unit 3 restarts the automaton of the syntax tree of 7a. . This automaton reaches the final state of the syntax tree itself by accepting an edge e5 having a final state of dob.
[0118]
As a result, the syntax tree 7a is accepted, and the analysis unit 3 determines that “eats” Syntax tree string acceptor The edge e6 that is advanced by one state is registered in the chart 4. The content is basically the same as the word “eats”, and the state of the syntax tree sequence acceptance unit is the state immediately after acceptance of the syntax tree 7a, and the next syntax tree to be accepted is 7b or 7c. Is different. FIG. 12 is a diagram showing the state of the chart at this time.
[0119]
Next, the analysis unit 3 enters the process of section [1,3]. Since the edge e6 registered in the previous completion process is registered in this section, the focus is shifted to this and the syntax tree sequence accepting unit is activated to check whether the next syntax tree can be accepted. To start.
[0120]
In e6, the next acceptable syntax tree is 7b or 7c. The syntax tree 7b satisfies the establishment condition if there is a word having the final state postmod, which is a free modification element, on the right side, and 7c satisfies the establishment condition if there is a word that reaches the final state of subj on the left side. Here, since there is an edge e4 having a final state subject immediately before, the condition for establishing the syntax tree 7c is satisfied, and this is accepted, and a new edge e7 is registered in the interval [0, 3].
[0121]
The e7 syntax tree string acceptor is in the final state. FIG. 13 is a diagram showing a chart in this state.
[0122]
The syntax tree shown in FIG. 13 is assembled from the syntax tree sequence received by the syntax tree sequence reception unit attached to “eats” in the process from e2 to e7.
[0123]
Thus, since all edges of the chart are extended and the syntax tree sequence receiving unit reaches the final state, the syntax analysis is successful. The solution is a syntactic tree sequence accepted by this edge.
[0124]
As described above, the syntax tree sequence can be compiled as a single syntax tree by superimposing self nodes, and the syntax tree thus created is equivalent to a solution. FIG. 13 shows the latter form.
[0125]
The output unit 5 receives this syntax tree and converts the data format into a form that can be displayed on the screen of the display device 10. When the conversion result is sent to the display device 10 via the communication line, the result received by the display device 10 is displayed to the user, and the analysis process ends.
[0126]
Although not shown in the above example, if the same edge is registered in the same section with the information related to the subsequent analysis processing, the efficiency that takes advantage of the dynamic programming method is packed by packing them. Analysis can proceed.
[0127]
Here, “the information related to the subsequent analysis processing is the same” means that the syntax tree sequence acceptance automaton associated with the edge is exactly the same including the current state. That is, when registering an edge, if you have a syntax tree acceptance automaton that is equivalent to you and an edge whose state is equivalent to you is already registered in the same section, stop registering it in the chart. Efficient analysis based on dynamic programming such as chart method and CKY method well known in the case of context-free grammar by packing in registered edges and registering them, and sharing the subsequent analysis processing The algorithm can be used as it is. The only difference is that in the context-free grammar, this condition is determined by a match of a non-terminal symbol, whereas in the present invention, this condition is determined by a match of the state of a syntax tree acceptance automaton. This is a major feature of the present invention, considering that the language analysis apparatus of the present invention can accept not only context-free grammar but a much wider range of grammar formats.
[0128]
FIG. 14 is a diagram showing the contents of the word dictionary in the second embodiment in which the present invention is implemented to constitute a Japanese analyzer. The configuration of the second embodiment of the present invention is the same as that of the first embodiment. Referring to FIG. 6, the CPU 7 that performs arithmetic processing, the storage device 8 that stores a dictionary, etc., and the original text are input. A keyboard 9, an input unit 1 that captures an original sentence from the keyboard 9 to the CPU 7, a dictionary 2 that is stored in the storage device, a chart 4 that records a result of the analysis, and a dictionary that is used to refer to the input sentence in the dictionary 2. The dictionary loader 6 registered in the chart 4, the analysis unit 3 that performs overall control of the analysis process, the output unit 5 that outputs the analysis result to the outside of the CPU 7, and the display device 10 that displays the output result are provided. Configured.
[0129]
Furthermore, the dictionary of each word in the dictionary 2 includes a syntax tree pool 21, a syntax tree sequence accepting unit 22 that accepts a grammatically correct syntax tree sequence, and various types of information including the syntactic nature of words. It is configured with certain word information 23. Each component is connected via a communication line.
[0130]
In the second embodiment of the present invention, the word dictionary contents are changed to Japanese as shown in FIG. 14 in order to implement the analysis device as a Japanese analysis device. This is different from the embodiment. In addition to simply changing the language, the structure of the automaton in the syntactic tree accepting unit is changed to suit the analysis of Japanese. The second embodiment of the present invention will be described below with reference to FIG.
[0131]
FIG. 14 shows a part of the dictionary contents of the verb “eat”. The syntax tree pool consists of only one Td, which is a syntax tree that fetches the preceding consecutive elements.
[0132]
The fact that “eating” has an essential case pattern of “to eat” is expressed not in the syntax tree but in an auxiliary table (acceptance count constraint table) provided in the syntax tree sequence receiving unit.
[0133]
The syntax tree sequence acceptance unit is an automaton whose state is the product of the finite automaton that forms the skeleton and the state of the acceptance number constraint table. Each time Td is accepted on the finite automaton side, the corresponding “number of times” column of the acceptance number restriction table is incremented.
[0134]
If the “number of times” column does not satisfy the constraints of the “number of times constraint” column, the process ends as an acceptance failure.
[0135]
With this mechanism, the structural constraint that the word “eat” can capture the particle “ga” once, the particle “ha” once, and any other optional modifiers any number of times in any order is simple. It can be expressed in form.
[0136]
On the other hand, for the description of grammatical phenomena in which the word order is fixed, such as the concatenation of nouns and particles, and the phenomenon that auxiliary verbs take in verbs, the order of acceptance of multiple syntax trees is limited as in English. It is natural to express with an automaton, and this is followed in the second embodiment of the present invention. Thus, it is a great feature of the present invention that the most appropriate grammar format can be adopted for each word according to the nature of the grammatical phenomenon related to the word.
[0137]
Conventionally, several formats have been proposed as lexicalized grammar formats that allow freedom of word order. For example, in 1994, a Ph.D. dissertation “Formal and Computational Aspects of Natural Language Syntax, PhD dissertation” submitted to the Department of Computer Science at Pennsylvania State University in 1994. to Universtiy of Pennsylvania, USA. 1994) (referred to as “Reference 5”).
[0138]
UVG-DL (Unordered Vector Grammar with Dominance Links), which is listed as an example, is a vector grammar whose unit is a group of multiple syntax trees. (link) is given.
[0139]
In this grammar, a syntax tree in a pair is used the same number of times, and a syntax tree connected by a dominating relationship link does not contradict the ancestor / descendant relationship (on the syntax tree). Must satisfy the constraint that it must be
[0140]
If this form is slightly modified, for example, it is easy to express the free word order of Japanese verb case elements concisely. However, such an expression format has been determined to be compatible with the description of the behavior of case elements in Japanese and German, and is not necessarily appropriate for the description of the behavior of other words. I can't say that. Of course, if the entire grammar form is chosen to facilitate the description of a particular grammatical phenomenon, it will naturally become inappropriate for the description of other phenomena.
[0141]
On the other hand, in the present invention, by having a syntax tree sequence accepting unit for each word, the substantial grammatical form followed by the word can be flexibly and easily changed for each word.
[0142]
In the second embodiment of the present invention, the grammar of the verb is implemented by a finite state automaton expanded by the application number constraint table, but this may be implemented by, for example, UVG-DL. Since the relationship with another word occurs only through the final state symbol of the syntax tree sequence accepting unit, no matter what processing is performed before reaching the final state, the influence is closed to that word.
[0143]
Further, as described in the first embodiment, syntax tree sequence acceptance is separated by separating the syntax tree formation check (final state acceptance processing by the syntax tree automaton) and syntax tree sequence acceptance processing. It is a feature of the present invention that an efficient analysis based on a dynamic programming method using a chart or the like can be performed regardless of the implementation of the part.
[0144]
A third embodiment of the present invention will be described. FIG. 15 is a diagram for explaining a third embodiment of the present invention. A case where analysis equivalent to the D-Tree Grammar shown in FIG. 3 is performed will be described.
[0145]
For more information on D-Tree Grammars, see “Parsing D-Tree Grammars” (1995) presented at International Workshop on Parsing Technologies (Parsing D-Tree Grammars) The description of “Document 5” is referred to.
[0146]
In the syntax tree shown in FIG. 15, links between S and VP are represented by dotted lines. This expresses that this link is not a fixed link having a length of 1, and that any number of syntax trees of 0 or more can be inserted between them. However, at the two nodes where the split syntax tree is connected to this syntax tree, the part of speech must be S or VP (depending on the split location).
[0147]
Such an expression is useful for expressing a long distance-dependent phenomenon such as WH movement and a response such as “so that”.
[0148]
In the present invention, the same can be realized by slightly expanding the syntax tree pool and storing and propagating the syntax tree in the pushdown stack.
[0149]
FIG. 16 shows the contents of the word dictionary for realizing FIG. This content is described in the word dictionary corresponding to V at the bottom in FIG.
[0150]
In the syntax tree pool, only the lowermost tree T1 in FIG. 15 is included in the syntax tree pool, and the other two trees above the broken line are stored in the pushdown stack. The syntax tree sequence accepting unit is configured to transition to a final state when T1 is accepted only once.
[0151]
When the syntax tree sequence accepting unit reaches the final state, the pushdown stack is extracted from the syntax tree pool and associated with the final state. In this state, the syntax tree string accepting unit may take out the syntax tree from the stack and continue accepting it, or accept itself as a child. Syntax tree string acceptor May be passed to However, in any case, the condition is that the non-terminal symbol of the extracted syntax tree matches the symbol in the final state. In the latter case, the stack propagates upward, and at some point, an attempt is made to accept the syntax tree in the stack.
[0152]
Fig. 17 schematically shows this situation. After the bottom VP grew as a child of another syntax tree (thick chain line), the top syntax tree in the stack popped. It shows how it is accepted.
[0153]
In this way, in the configuration of the present invention, the syntax tree is propagated in the stack, and when the acceptance processing of the syntax tree sequence is completed in the subsequent processing, the syntax tree is popped and the acceptance processing is performed. It is possible to easily handle the dominant relationship at a long distance as expressed by.
[0154]
Here, although a single syntax tree is pushed onto the stack, in general, it can be extended to push onto the stack in units of a syntax tree pool and a syntax tree string accepting unit.
[0155]
FIG. 18 is a diagram for explaining the processing of the response relationship of “so (adjective) that (sentence)”. Original text
“She is so kind that everybody likes her.”
At this time, the dictionary of “so” has a syntax tree (FIG. 18B) that summarizes the following “that” clause, and this is propagated as indicated by an arrow on the syntax tree of FIG. By starting acceptance when the whole is gathered as a sentence, it is possible to correctly summarize the sections below that.
[0156]
The responsive relationship between “So” and “that” clause is clear from the origin of the “that” clause syntax tree.
[0157]
It should be noted that, in general, when a context-free grammar is given, the analysis apparatus according to the present invention can be configured so as to be strongly equivalent to the original grammar by the following procedure.
[0158]
First, for each word, a syntax tree pool attached to the word is constructed as follows.
[0159]
When you fix a word (FIX), all the rules that lead that word from the preterminal are put into the syntax tree pool. At that time, the symbol self is given to the rule route in accordance with the pre-terminal symbol.
[0160]
Next, for a non-terminal symbol that is the root of a rule in the syntax tree pool, all rules having the symbol as a leaf are collected and added to the syntax tree pool. However, the symbol self is given to the noticed leaf and root. Even with the same rule, if the noticed leaf is different, the position of self is different, so it is added as another rule.
[0161]
This additional operation is repeated for the nonterminal symbol that is the root of the rule in the syntax tree pool until the rule addition stops. This completes the creation of the syntax tree pool.
[0162]
The syntax tree sequence accepting unit is a nondeterministic finite state automaton configured as follows. Each nonterminal symbol that is the root of a rule in the syntax tree pool is a state. From the start state, it is possible to make a transition to a state corresponding to the pre-terminal, and in each transition, a rule for generating a word from the pre-terminal of the transition destination is accepted.
[0163]
Furthermore, in the transition from state 1 to state 2, when there is a syntax tree in the syntax tree pool with state 2 as the root and state 1 as the leaf with the symbol self, this syntax tree is accepted and transition is made. To be configured. In addition, an empty syntax tree is accepted from all states to a final state, and transition is possible.
[0164]
With the above configuration, the analysis apparatus according to the present invention that accepts the same syntax tree set as the original context-free grammar can be configured.
[0165]
The present invention can be implemented with various modifications other than those described above.
[0166]
In each of the above-described embodiments, a phrase structure grammar (PSG) that exclusively uses a syntax tree as a grammar described in a word is adopted. And the contents of Syntax tree string acceptor It is also easy to configure. You may change this form for every word.
[0167]
Also, Syntax tree string acceptor As a matter of course, it is easy to adopt a push-down automaton instead of a finite automaton. Moreover, Syntax tree string acceptor In addition, by adopting a general automaton, it is easy to configure to accept a grammatical form with higher descriptive ability.
[0168]
Of course, the language analysis apparatus according to the present invention can be incorporated and applied to a translation apparatus, an information retrieval apparatus, voice recognition, and the like. Further, the present invention is not limited to natural language analysis, and it is obvious that the present invention can be easily applied to an artificial language (formal language) such as a computer programming language.
[0169]
As described above, the present invention includes various modifications within the scope according to the principle of the present invention.
[0170]
【The invention's effect】
As described above, according to the present invention, since the syntax tree sequence receiving unit has information about how to combine syntax trees that substantially control the grammatical form as the syntax tree sequence receiving unit, the word follows. The substantial grammatical form can be changed flexibly for each word, and therefore, there is an effect that a flexible and concise grammar description can be made according to the behavior of each word.
[0171]
In general, a grammatical form with high descriptive capacity requires a large amount of calculation for acceptance and has many problems from a practical point of view, but according to the present invention, such an acceptance part is placed and processing is actually performed. Stays in a few words that really need that descriptive ability.
[0172]
For many common words, it is sufficient to perform an accepting process with a light process, and according to the present invention, such a configuration can actually be adopted. Compared to the configuration that unifies everything in one grammatical form, it can be expected to be much reduced.
[0173]
Furthermore, according to the present invention, flexible grammar description independent of this specific grammatical format is possible, and as with context-free grammar, efficient analysis control based on dynamic programming is enabled. There is an effect.
[0174]
The effect of the present invention is due to the separation of the check of the establishment of the syntax tree itself and the check of the acceptance of the syntax tree sequence, and the effect is not obtained by the conventional system.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a natural language processing apparatus according to a first embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration of a dictionary of the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 3 is a flowchart showing processing of the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 4 is a flowchart showing processing of the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 5 is a block diagram showing a configuration of a natural language processing apparatus according to a second embodiment of the present invention.
FIG. 6 is a block diagram showing a configuration of a natural language processing apparatus according to the first embodiment of the present invention.
FIG. 7 is a block diagram showing a configuration of a word dictionary in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 8 is a diagram illustrating an example of a syntax tree accepted by the natural language processing apparatus according to the first embodiment of this invention.
FIG. 9 is a block diagram showing a configuration of a word dictionary in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 10 is a diagram showing the contents of a chart in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 11 is a diagram showing the contents of a chart in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 12 is a diagram showing the contents of a chart in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 13 is a diagram showing the contents of a chart in the natural language processing apparatus according to the first embodiment of the present invention.
FIG. 14 is a block diagram showing a configuration of a word dictionary in the natural language processing apparatus according to the second embodiment of the present invention.
FIG. 15 is a diagram illustrating an example of a syntax tree that can be processed by the natural language processing apparatus according to the third embodiment of the present invention;
FIG. 16 is a diagram showing an example of dictionary contents in the natural language processing apparatus according to the third embodiment of the present invention.
FIG. 17 is a diagram for explaining processing of the natural language processing apparatus according to the third embodiment of the present invention;
FIG. 18 is a diagram for explaining processing of the natural language processing apparatus according to the third embodiment of the present invention;
[Explanation of symbols]
1 Input section
2 Dictionary
3 Analysis Department
4 chart
5 Output section
6 Dictionary loader
7 CPU
8 Storage device
9 Keyboard
10 CRT
21 Syntax tree pool
22 Syntax tree string acceptance section
23 Word information
101 Input device
102 Data processing device
103 Output device
104 Storage device
105 Recording medium

Claims

A syntax tree pool storing a set of syntax trees related to a word, and a syntax tree sequence receiving unit describing a sequence of syntax trees belonging to the syntax tree pool in an automaton;
A word dictionary having word information including word spelling and syntactic properties is stored in the storage means for each word;
In proceeding with the analysis of the original text input from the input means,
(A) dividing the original sentence into words, loading the contents of the word dictionary of the divided words from the dictionary, and registering them as an edge in a chart;
(B) determining a target edge from the edges registered in the chart;
(C) The state of the automaton corresponding to the target edge determined in the step (b) is checked with reference to the syntax tree sequence accepting unit in the target edge, and the current state of this automaton is changed to the next state. Selecting an acceptable syntax tree by transition to
(D) determining whether the syntax tree selected in step (c) holds on the chart;
(E) If it is determined in the step (d) that a syntax tree is established, the syntax tree is accepted, the state transition of the automaton is performed, a new edge corresponding to the automaton after the state transition is created, and the chart Registering with
(F) outputting the language analysis result of the original text by analyzing the contents of the chart;
A language analysis method comprising:

Input means for inputting original text, dictionary storing word information, analysis means for analyzing the input original text while referring to the dictionary, and storage means for holding a chart for registering partial analysis results by the analysis means And an output means for outputting the analysis result,
In the storage means, a syntax tree pool storing a set of syntax trees related to words, a syntax tree sequence receiving unit describing a sequence of syntax trees belonging to the syntax tree pool in an automaton,
A dictionary having a word dictionary for each word with word information including word spelling and syntactic properties,
When the analysis unit proceeds with the analysis process of the original text input from the input unit,
(A) a function of dividing the original text into words, loading the contents of the word dictionary of the divided words from the dictionary, and registering them as an edge in a chart;
(B) a function for determining an edge of interest from among the edges registered in the chart;
(C) The state of the automaton corresponding to the target edge determined by the function of (b) is checked with reference to the syntax tree sequence accepting unit in the target edge, and the current state of this automaton is changed to the next state. A function to select an acceptable syntax tree by transition to
(D) a function of determining whether or not the syntax tree selected by the function of (c) is established on the chart;
(E) When it is determined that the syntax tree is established by the function of (d), the syntax tree is accepted, the state transition of the automaton is performed, a new edge corresponding to the automaton after the state transition is created, and the chart With the ability to register with
(F) a function of outputting a language analysis result of the original text by analyzing the contents of the chart;
A language analyzer characterized by comprising:

A syntax tree pool storing a set of syntax trees related to words, and a syntax tree sequence accepting unit describing a sequence of syntax trees belonging to the syntax tree pool in an automaton;
A dictionary having a word dictionary for each word having word information including spelling and syntactic properties of the word is stored in the storage means of the computer,
When proceeding with the computer to analyze the original text input from the computer input means ,
(A) dividing the original sentence into words, loading the contents of the word dictionary of the divided words from the dictionary, and registering them as an edge in a chart;
(B) determining a target edge from the edges registered in the chart;
(C) The state of the automaton corresponding to the target edge determined in the step (b) is checked with reference to the syntax tree sequence accepting unit in the target edge, and the current state of this automaton is changed to the next state. Selecting an acceptable syntax tree by transition to
(D) determining whether the syntax tree selected in step (c) holds on the chart;
(E) If it is determined in the step (d) that a syntax tree is established, the syntax tree is accepted, the state transition of the automaton is performed, a new edge corresponding to the automaton after the state transition is created, and the chart Registering with
(F) outputting the language analysis result of the original text by analyzing the contents of the chart;
A recording medium that records a program for executing the program on a computer.