JP2004094434A

JP2004094434A - Language processing method, its program, and its device

Info

Publication number: JP2004094434A
Application number: JP2002252475A
Authority: JP
Inventors: Koji Tsukamoto; 塚本　浩司; Manabu Satsusano; 颯々野　学
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-08-30
Filing date: 2002-08-30
Publication date: 2004-03-25

Abstract

<P>PROBLEM TO BE SOLVED: To highly precisely achieve tagging in an unknown document by learning the relationship between the features of a target word and peripheral words including words before and after the target word and a correct tag. <P>SOLUTION: A learning processing step generates a plurality of learning results by using the group of a pair of: the features of a target word and peripheral words at the both sides included in tagged correct document data; and a correct tag. A decision processing step repeats the decision of the predicted tag of each target word from the features of the target word and the peripheral words at the both sides included in unknown document data to be tagged to decide the correct tag by using the plurality of learning results learned by using the features of the target word and the peripheral words. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、タグのついた正解文書の学習結果を用いて未知文書のタグ付けを自動的に行う言語処理方法、プログラム及び装置に関し、特に、正解文書における対象語と周辺語の特徴を用いた学習結果により未知文書のタグ付けを行う言語処理方法、プログラム及び装置に関する。
【０００２】
【従来の技術】
近年、文書に含まれる語に、ある指標となるタグを付与するための言語処理の研究開発が広範に行われており、タグ付き文書を用いた機械翻訳、情報検索、質問応答、知識発見などの言語処理を応用したシステムの実用化が期待されている。
【０００３】
従来、タグづけされた正解文書データからルール（学習結果）を学習し、学習したルールを用いて未知文書に自動的にタグを付与する方法は、様々な種類のものが提案されている。例えば、以下の様なものが挙げられる。
【０００４】
（１）　Ｄ．　Ｍ．　Ｂｉｋｅｌ，　Ｓ．　Ｍｉｌｌｅｒ，　Ｒ．Ｓｃｈｗａｒｔｚ　ａｎｄ　Ｒ．　Ｗｅｉｓｈｅｄｅｌ．（１９９７）Ｎｙｍｂｌｅ：Ａ　Ｈｉｇｈ−Ｐｅｒｆｏｒｍａｎｃｅ　Ｌｅａｒｎｉｎｎｇ　Ｎａｍｅ−Ｆｉｎｄｅｒ．　Ｉｎ５ｔｈ　Ｃｏｎｆｅｒｅｎｃｅ　ｏｎ　Ａｐｐｌｉｅｄ　Ｎａｔｕｒａｌ　Ｌａｎ−ｇｕａｇｅ　Ｐｒｏｃｅｓｓｉｎｇ。ここでは、固有名詞をタグづけの問題として扱い、隠れマルコフモデルの学習によってタグ付けけをしている。
【０００５】
（２）　Ａ．　Ｂｏｒｔｈｗｉｃｋ，　Ｊ．　Ｓｔｅｒｌｉｎｇ，　Ｅ．　Ａｇｉｃ−ｈｔｅｉｎ　ａｎｄ　Ｒ．　Ｇｒｉｓｈｍａｎ．（１９９８）　　Ｅｘｐｌｏｉｔｉ−ｎｇ　Ｄｉｖｅｒｓｅ　Ｋｎｏｗｌｅｄｇｅ　Ｓｏｕｒｅｓ　ｖｉａ　Ｍａｘｉｍ−ｕｍ　Ｅｎｔｒｏｐｙ　ｉｎ　Ｎａｍｅｄ　Ｅｎｔｉｔｙ　Ｒｅｃｏｇｎｉｔｉｏｎ．Ｉｎ　Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　６ｔｈ　Ｗｏｒｋｓｈｏｐ　ｏｎ　Ｖｅｒｙ　Ｌａｒｇｅ　Ｃｏｒｐｏｒａ．ここでは、（１）と同様の問題を、複数システムの出力を特徴とするＭａｘｉｍｕｍ　Ｅｎｔｒｏｐｙ法による学習として扱っている。
【０００６】
（３）　Ｍ．　Ｃｏｌｌｉｎｓ　ａｎｄ　Ｙ．　Ｓｉｎｇｅｒ．（１９９９）
Ｕｎｓｕｐｅｒｖｉｓｅｄ　Ｍｏｄｅｌｓ　ｆｏｒ　Ｎａｍｅｄ　Ｅｎｔｉｔｙ
Ｃｌａｓｓｉｆｉｃａｔｉｏｎ．　Ｉｎ　Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ
Ｊｏｉｎｔ　ＳＩＧＤＡＴ　Ｃｏｎｆｅｒｅｎｃｅ　ｏｎ　Ｅｍｐｉｌｉｃａｌ
Ｍｅｔｈｏｄｓ　ｉｎ　Ｎａｔｕｒａｌ　Ｌａｎｇｕａｇｅ　Ｐｒｏｃｅｓｓｉｎｇ
ａｎｄ　Ｖｅｒｙ　Ｌａｒｇｅ　Ｃｏｒｐｏｒａ．ここでは、タグのついていない生データを用いて精度を向上させる方法をとっている。
【０００７】
これらの方法は、タグの判定を行なうのに、対象となる単語や周辺の単語の表記などの特徴を用いているが、周辺の単語がどの様にタグづけされたかを用いていない。
【０００８】
周辺の単語に付与されたタグは、対象となる単語につけるべきタグの重要な手がかりとなる。例えば「山口」という語にタグづけをする場合、「山口」が人名のタグをとることもあれば、地名のタグをとることもある。もし「山口」の次の語が「太郎」などで、人名のタグが付与されるとわかれば、「山口」は人名だと考えられる。一方、「山口」の前に「山口県」などの単語列があり、これらに地名のタグがつくことがわかれば「山口」に地名のタグがつく手がかりとなる。
【０００９】
【発明が解決しようとする課題】
しかしながら、周辺の単語に付与されたタグは、対象語のタグの判定の重要な手がかりとなるが、未知の文書に対しては、通常、周辺のタグは知り得ないので利用できない。
【００１０】
この問題を解消する手法の１つとして次のものが知られれている。
（４）山田　寛康，工藤　拓，松本　裕治。（２００１）「Ｓｕｐｐｏｒｔ　Ｖｅ−ｃｔｏｒ　Ｍａｃｈｉｎｅｓを用いた日本語固有表現抽出」、情報処理学会自然言語処理研究会１４２−１７。
【００１１】
この日本語固有表現抽出にあっては、周辺のタグを対象語の判定をするための特徴として用いるために、動的素性という方法を用いている。動的素性を用いる方法では、学習時には、周辺のタグとして、対象となる語の左側（または右側）の正解データのタグを利用する。未知の文書に対しては、最左端（または右端）の語に関しては、周辺のタグを利用せず、それ以外の語に関しては、学習器により判定された結果を周辺のタグとして用いる。
【００１２】
この動的素性を用いる方法は有効であることが知られているが、精度の点でまだ不十分である。その原因は、２つあり、いずれも周辺のタグの取り扱い方についてである。
【００１３】
一つは、周辺のタグとして片側のタグしか用いることができない点である。もう一つは、学習時には周辺のタグとして正解データを用いているのに対し、判定時には周辺のタグとして判定した正解か否か不明な予測データを用いており、一貫していないという点である。
【００１４】
本発明は、対象語と前後を含む周辺語の特徴と正解タグの関係を学習して未知文書におけるタグ付けを高精度で実現する言語処理方法、プログラム及び装置を提供することを目的とする。
【００１５】
【課題を解決するための手段】
図１は本発明の原理説明図である。本発明の言語処理方法は、学習処理部１により、タグのついた正解文書データに含まれる対象語及びその両側の周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理ステップと、判定処理部２により、複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及びその両側の周辺語の特徴から各対象語のタグを判定する判定処理ステップとを備えたことを特徴とする。
【００１６】
この学習処理ステップは、
入力装置により、外部からタグのついた正解文書データを入力する入力ステップと、
データ変換器により、正解文書データを学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
学習器により、学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、その特徴から予測タグを判定する第１学習結果２４−１を生成する第１学習ステップと、
判定器により、第１学習ステップで生成された第１学習結果２４−１を用いて、学習データに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、
データ再構成装置により、第１判定ステップで判定された予測タグを学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
学習器により、データ再構成ステップで再構成された学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果２４−２を生成する第２学習ステップと、
判定器により、前記第２学習ステップで生成された第２学習結果２４−２を用いて、学習データに含まれる予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
データ再構成ステップ、第２学習ステップ及び前記第２判定ステップを複数回繰り返させて第２学習結果以降の複数の学習結果２４−３〜２４−Ｎを生成する学習反復ステップと、
を備えたことを特徴とする。
【００１７】
このような学習処理ステップで生成された学習結果を用いる判定処理ステップは、
入力装置により、外部からタグ付け対象となる未知文書データを入力する入力ステップと、
データ変換器により、未知文書データを学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
判定器により、第１学習結果２４−１を用いて、テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第３判定ステップと、
データ再構成装置により、第３判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成する第２データ再構成ステップと、
判定器により、前記第２学習結果２４−２を用いて、第２データ再構成ステップで再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第４判定ステップと、
データ再構成ステップ及び第４判定ステップを第２学習結果以降の学習結果２４−３〜２４−Ｎに切替ながら複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
出力装置により、未知文書データを判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とする。
【００１８】
このような本発明の言語処理方法によれば、周辺の予測タグは１回目は全く分からないが、２回目以降で周辺のタグが正解すれば、有用な特徴となる。このため周辺のタグが正解することにより予測タグの精度が上がれば、より精度の高い周辺のタグを特徴として判定が行なえるので、タグ付け精度が向上する。
【００１９】
本発明の別の形態にあっては、正解文書から周辺語の特徴を用いない学習結果と周辺語の特徴を用いた学習結果の２つを生成して未知文書のタグ付けを簡略的に行う。
【００２０】
この場合、学習処理ステップは、
入力装置により、外部からタグのついた正解文書データを入力する入力ステップと、
データ変換器により、正解文書データを学習器で取り扱う予測タグを含まない形式の学習データに変換するデータ変換ステップと、
学習器により、学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、特徴から予測タグを判定する第１学習結果を生成する第１学習ステップと、
データ変換器により、正解文書データを予測タグを含む学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
学習器により、学習データに含まれる対象語及び正解タグを予測タグとして含む周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習ステップと、
を備えたことを特徴とする。
【００２１】
この学習処理ステップで生成された第１学習結果と第２学習結果を用いた未知文書のタグ付けのため、判定処理ステップは、
入力装置により、外部からタグ付け対象となる未知文書データを入力する入力ステップと、
データ変換器により、未知文書データを学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
判定器により、第１学習結果を用いて、テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、
データ再構成装置により、第１判定ステップで判定された予測タグを学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
判定器により、第２学習結果を用いて、データ再構成ステップで再構成されたテストデータに含まれる予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
データ再構成ステップ及び第２判定ステップを第２学習結果を用いて複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
出力装置により、未知文書データを判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とする。
【００２２】
この簡易的な言語処理方法にあっては、周辺の予測タグは１回目は全くわからないので、それを使用せず、判定にも予測タグを用いていない学習結果を使用する。２回目以降は、前回に判定された周辺のタグが正しいものであると仮定した学習結果を用いる。これにより周辺のタグが正解することにより予測タグの精度が上がれば、より精度の高い周辺のタグを特徴とした判定が行なえるので、タグ付け精度が向上する。
【００２３】
ここで、学習処理ステップおよび判定処理ステップは、入力装置により受け取った正解文書データ又は未知文書データを解析して形態素を判定する形態素解析ステップを備える。この場合、データ変換ステップは、解析された形態素を含めて正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換する。
【００２４】
このような形態素解析は、処理言語が日本語のように、単語毎にスペースで区切られていない言語のタグ付けに好適である。そのため英語のように、単語がスペースで区切られている言語には、形態素解析は不要である。
【００２５】
また学習処理ステップおよび判定処理ステップは、入力装置により受け取った正解文書データ又は未知文書データを外部のタグ付けシステムに渡してタグ付け結果を取得する外部タグ付けステップを備える。この場合、データ変換ステップは、データ変換器により、タグ付けシステムから取得したタグ付け結果を含めて正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換する。
【００２６】
本発明は、言語処理のためのプログラムを提供する。このプログラムは、コンピュータに、タグのついた正解文書データに含まれる対象語及その両側の周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理ステップと、複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及びその両側の周辺語の特徴から各対象語のタグを判定する判定処理ステップとを実行させる。
【００２７】
本発明は、言語処理装置を提供するものであり、タグのついた正解文書データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理部と、判定処理装置により、複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及び周辺語の特徴から各対象語のタグを判定する判定処理部とを備えたことを特徴とする。
【００２８】
この言語処理のためのプログラム及び言語処理装置の詳細は、言語処理方法の場合と基本的に同じになる。
【００２９】
【発明の実施の形態】
図２は、本発明による言語処理装置の実施形態を示したブロック図である。図２の言語処理装置は、学習処理部としての機能と判定処理部としての機能を備える。即ち、学習処理部としての機能は、タグの付いた正解文書データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する。一方、判定処理部としての機能は、学習処理部により得られた複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及び周辺語の特徴から各対象語のタグを判定する。
【００３０】
図２にあっては、日本語文書を対象とした本発明の言語処理装置における学習処理部としての機能構成をブロックで示している。この学習処理部として機能する言語処理装置は、入出力装置１０、形態素解析装置１１、データ変換器１２、学習器１４、学習制御装置１６、判定器１８、判定制御装置２０、学習結果データベース２２及びデータ再構成装置２６で構成される。
【００３１】
一方、判定処理部として機能する言語処理装置は、図１０に示すように、学習器１４と学習制御装置１６を除いた装置構成を持つことになる。
【００３２】
図２のような本発明による言語処理装置は、例えば図３に示すコンピュータのハードウェア環境により実現される。図３のコンピュータにおいて、ＣＰＵ４００のバス４０１にはＲＡＭ４０２、ハードディスクドコントローラ（ソフト）４０４、フロッピィディスクドライバ（ソフト）４１０、ＣＤ−ＲＯＭドライバ（ソフト）４１４、マウスコントローラ４１８、キーボードコントローラ４２２、ディスプレイコントローラ４２６、通信用ボード４３０が接続される。
【００３３】
ハードディスクコントローラ４０４はハードディスクドライブ４０６を接続し、本発明の言語処理を実行するアプリケーションプログラムをローディングしており、コンピュータの起動時にハードディスクドライブ４０６から必要なプログラムを呼び出して、ＲＡＭ４０２上に展開し、ＣＰＵ４００により実行する。
【００３４】
フロッピィディスクドライバ４１０にはフロッピィディスクドライブ（ハード）４１２が接続され、フロッピィディスク（Ｒ）に対する読み書きができる。ＣＤ−ＲＯＭドライバ４１４に対しては、ＣＤドライブ（ハード）４１６が接続され、ＣＤに記憶されたデータやプログラムを読み込むことができる。
【００３５】
マウスコントローラ４１８はマウス４２０の入力操作をＣＰＵ４００に伝える。キーボードコントローラ４２２はキーボード４２４の入力操作をＣＰＵ４００に伝える。ディスプレイコントローラ４２６は表示部４２８に対して表示を行う。通信用ボード４３０は無線を含む通信回線４３２を使用し、インターネット等のネットワークを介して他のコンピュータやサーバとの間で通信を行う。
【００３６】
ここで図２の構成を持つ本発明の言語処理装置における学習処理部の原理を説明する。いま日本語文書を例にとって文書中の固有名詞を自動的に取り出す方法について考える。対象とする日本語文書データとして「前日に開催された通常国会で日本放送協会会長のＸＸ氏は…」という文書を例にとる。
【００３７】
この文書データは図２の形態素解析装置１１により単語単位に分解されると、例えば図４の文書データ２８のようになる。文書データ２８の中には固有名詞として、固有名２８−１である「通常国会」、組織名２８−２である「日本放送協会」、及び人名２８−３である「ＸＸ」が含まれている。
【００３８】
図４の文書データ２８の文中の固有名詞を取り出すことは、図５のようなタグ付けの問題として定義できる。図５は語３２に正解タグ３４が付された正解文書データ３０を示している。本発明にあっては、未知文書データの語にタグを自動的に付けるため、図５のように語３２に正解タグ３４が付けられた正解文書データ３０を準備し、この正解文書データ３０を用いて学習処理部を行う。
【００３９】
本発明の学習処理は、語３２の特徴と正解タグ３４を対応付けた正解文書データ３０から学習結果（学習ルール）を生成する。生成された学習結果は、語の特徴からタグを予測することができる。
【００４０】
正解文書データ３０は、例えば「放送」という語について見ると、特徴として自分自身及び前後の語と文字数などを使用し、正解タグとして「Ｉ−ＯＲＧ」を指定する。学習処理にあっては、このような正解文書データ３０における対象語及び周辺語の特徴と正解タグの対の集合を用いて特徴からタグを判定することのできる学習結果を生成する。
【００４１】
ここで対象語に対する周辺語のタグを特徴の１つとして用いることができれば、正解データとしての対象語のタグの予測に有効であることが予想される。これは固有名詞を表わすタグが連続して出易いことなどに起因している。例えば図６の未知文書データ３６を例にとると、語３８に対し正解タグ４２が分かっている場合、予測タグ４０は対象語に対する周辺タグの特徴を用いると、その予測が効果的にできる。
【００４２】
例えば語３８における「日本」、「放送」、「協会」は、いずれもタグが「Ｉ−ＯＲＧ」であるが、「協会」のタグが「Ｉ−ＯＲＧ」であることが分かっていれば、その前に位置する「日本」や「放送」のタグが「Ｉ−ＯＲＧ」であるということの重要な手がかりとなる。
【００４３】
しかしながら、本発明の言語処理装置における処理目標が未知のタグを予測することであり、このため図６の未知文書データ３６にあっては正解タグ４２は不明であり、したがって対象語に対する周辺タグは分からない。そこで本発明の言語処理方法にあっては、周辺タグとして正解文書データに基づいて学習処理により付与したタグを使用する。
【００４４】
図７，図８は、図２の本発明の言語処理装置の学習処理部の構成により実行される学習処理の手順を示した説明図である。この図７，図８における学習処理を、図２の装置構成を参照しながら説明すると、次のようになる。
【００４５】
図７において、まず入出力装置１０の入力部としての機能により正解文書データ３０を入力する。この正解文書データ３０は例えば図５に示した内容を持つ。次に、データ変換器１２によりデータ変換４４を行う。このデータ変換は、正解文書データ３０を学習器１４で取り扱うことのできる形式の学習データ４６に変換する。
【００４６】
学習データ４６は、語４８、特徴５０、予測タグ５２、正解タグ５４から構成されている。特徴５０には、この実施形態にあっては「表記」、「語長」、「文字種」、「品詞」の４つが含まれている。予測タグ５２は、初期状態にあっては「？」としている。
【００４７】
次に、学習制御装置１６により制御される学習器１４によって、学習データ４６に含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、特徴から予測タグを判定する学習結果（第１学習結果：分類ルールともいう）２４−１を生成する。
【００４８】
この学習処理６０は、例えば対象語を「放送」とすると、その前後の周辺語「日本」と「協会」について、それぞれの特徴５０と対象語である「放送」のタグ５８である「Ｉ−ＯＲＧ」の対を学習器１４に設定し、特徴データ５６の入力により正解タグ「Ｉ−ＯＲＧ」を得るためのルールを学習結果として生成する。
【００４９】
このような学習処理は、学習データ４６における先頭と最後の語「日本」及び「ＸＸ」を除く２番目から５番目の「放送」、「協会」、「会長」、「の」のそれぞれを対象語として、その周辺語の特徴と正解タグ５４の対を全て用いて繰り返し学習処理を行い、学習結果２４−１を生成する。生成された学習結果２４−１は学習結果データベース２２に格納される。
【００５０】
次に、判定制御装置２０で制御される判定器１８により、学習処理６０で生成された学習結果２４−１を用いて、学習データ４６に含まれる対象語及び周辺語の特徴から各対象語の予測タグ５２を判定する。
【００５１】
図７の判定処理６２は、学習データ４６の２番目の「放送」を対象語とし、その前後の周辺語「日本」と「協会」を含む特徴データ５６を判定器１８に入力し、学習結果２４−１を用いて判定することにより判定データ６４を得ており、この場合、対象語である「放送」の予測タグは「Ｏ」となっている。この判定データ６４における各対象語の予測タグを得るための判定処理６２は、学習データ４６における最初と最後を除く間の全ての対象語について行われ、これによって学習データ４６における予測タグ５２が全て得られる。
【００５２】
続いて、図８のデータ再構成６６をデータ再構成装置２６により行う。このデータ再構成６６は、図７の判定処理６２で判定された予測タグを学習データの予測タグ５２に更新情報として格納して再構成学習データ６８を生成する。
【００５３】
次に、学習器１４により再構成学習データ６８における予測タグ５２を含む対象語及び周辺語の特徴と正解タグ５４の対の集合を用いて、特徴から予測タグを判定する学習結果（第２学習結果）２４−２を生成する。
【００５４】
例えば対象語「放送」を例にとると、その前後の周辺語「日本」と「協会」における特徴データ７０に、図７の判定処理６２で得られた更新済みの予測タグを含めた特徴と対象語「放送」の正解タグ５８である「Ｉ−ＯＲＧ」を用いて学習処理７２を行う。この学習処理７２は、再構成学習データ６８における最初と最後の語を除く２番目から５番目の対象語のそれぞれについて同様に繰り返すことで、第２学習結果としての学習結果２４−２を生成する。
【００５５】
次に、判定器１８により学習処理７２で生成された第２学習結果としての学習結果２４−２を用いて、再構成学習データ６８に含まれる予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する。この例では、判定データ７６として、再構成学習データ６８における対象語「放送」における特徴データ７０とその正解タグ５８である「Ｉ−ＯＲＧ」の対を判定器１８に入力して、学習結果２４−２による予測タグの判定データ７６を示している。この判定処理７４は、再構成学習データ６８における先頭と最後の語を除く２番目から５番目の全てについて同様にして行われる。
【００５６】
判定処理７４が済むと、判定ステップ７８で学習回数ｉの値が予め定めた学習回数より小さいか否か判定し、小さければ再びデータ再構成６６に戻り、判定処理７４で生成された判定データ７６の予測タグを再構成学習データ６８の予測タグ５２に含める更新を行った後、この更新済みの再構成学習データ６８を対象に３回目の学習処理７２を行って、学習結果２４−３を生成する。そして、判定処理７４で学習結果２４−３を使用して再構成学習データ６８から判定データ７６の予測タグを判定する。
【００５７】
このような予測タグによるデータ再構成、再構成学習データに基づく学習処理、学習処理で得られた学習結果による判定処理を、学習回数ｉがＮ回に達するまで繰り返し、これによって学習結果データベース２２には学習結果２４−１〜２４−Ｎが格納され、判定ステップ７８で学習回数がＮ回に達することを判定して学習処理を終了する。
【００５８】
図９は、図７，図８に示した図２の言語処理装置による学習処理のフローチャートであり、このフローチャートが本発明による学習処理プログラムの処理内容を表わしている。
【００５９】
図９において、学習処理は、ステップＳ１で入出力装置１０により正解文書データを入力し、ステップＳ２でデータ変換器により特徴とタグの集合に変換し、予測タグは「？」として１番目の学習データを生成する。
【００６０】
次にステップＳ３で、学習器１４により学習データにおける対象語及び周辺語の特徴と正解タグの対の集合を用いて第１学習結果を生成する。続いてステップＳ４で、判定器１４により第１学習結果を用いて、正解データにおける対象語及び周辺語の特徴から各語のタグを予測する。このステップＳ１〜Ｓ４の処理は図７に対応している。
【００６１】
続いてステップＳ５で、データ再構成装置２６により、ステップＳ４で取得した予測タグを学習データの予測タグと入れ替えて、新たな学習データ即ち再構成学習データを生成する。次にステップＳ６で、学習器１４によりｉ番目の学習データにおける予測タグを含む対象語及び周辺語の特徴と正解タグの対の集合を用いて第ｉ学習結果を生成する。この場合、ｉ＝２である。
【００６２】
続いてステップＳ７でｉが予め定めた回数Ｎより小さいか否かチェックし、小さければステップＳ８でｉを１つ増加した後、ステップＳ５からの処理を、ステップＳ７でｉがＮとなるまで繰り返す。このステップＳ５〜Ｓ８の処理が図８に対応している。
【００６３】
図１０は、本発明の言語処理装置において、図２の学習処理部で生成された学習結果データベース２２の学習結果２４−１〜２４−Ｎを用いて未知文書データのタグ付けを自動的に行う判定処理部の構成を示したブロック図である。この判定処理部としての構成は、図２における学習制御部１６と学習器１４の機能を除いた構成で実現される。
【００６４】
図１１，図１２は、図１０の判定処理部による処理手順の説明図であり、図１０の構成に対応して説明すると次のようになる。
【００６５】
図１１において、まず入出力装置１０の入力部の機能を利用して未知文書データ８０を入力する。この未知文書データ８０としては、例えば図４のような文書データ２８である。次にデータ変換器１２によるデータ変換８２を行う。
【００６６】
このデータ変換８２は、未知文書データ８０を学習器１４で取り扱う形式のテストデータ８４に変換する。テストデータ８４は、語８６、特徴８８及び予測タグ９０で構成されており、予測タグ９０は初期状態にあっては「？」となっている。
【００６７】
次に、判定器１８により学習結果データベース２２の最初の学習結果２４−１を用いて、テストデータ８４に含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する。具体的には、テストデータ８４における先頭と最後を除く２番目から５番目の各対象語について、例えば２番目の対象語「放送」を例にとると、その前後の周辺語「日本」、「協会」を含む特徴データ９２を判定器１８に入力し、これに学習結果２４−１を適用して判定データ９６に示す予測タグを判定し、この場合には対象語「放送」の予測タグは「Ｏ」となっている。
【００６８】
次に図１２に進み、データ再構成装置２６により図１１の判定処理９４で判定された予測タグを学習データの特徴の一部に入れるように予測タグ９０を更新し、再構成テストデータ１００を生成する。
【００６９】
続いて、判定器１８により学習結果データベース２２から２番目の学習結果２４−２を取り出し、再構成テストデータ１００に含まれる予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する。この場合の判定処理１０４にあっては、再構成テストデータ１００における対象語「放送」を例にとって、その前後の周辺語を含む特徴データ１０２を判定器１８に入力して、予測タグを判定した判定データ１０６が得られた状態を示している。
【００７０】
判定処理１０４で再構成テストデータ１００に基づく予測タグの判定が済むと、判定ステップ１０８で判定回数ｉが所定回数Ｎ未満か否かチェックし、未満であればデータ再構成９８に戻って、判定処理１０４で判定された予測タグを再構成テストデータ１００の予測タグ９０に含める更新を行った後、３番目の学習結果２４−３を用いて再構成テストデータ１００に対し判定処理１０４を行って、判定データ１０６として予測タグを判定する。
【００７１】
このようなデータ再構成、判定回数に対応した学習結果に基づく判定処理をＮ回繰り返して、一連の判定処理を終了する。この判定処理により再構成テストデータ１００における予測タグ９０が最終的に正解タグとして確定することから、データ変換器１２は、確定した判定タグを入力した未知文書データの各語に付ける逆変換を行って、タグ付きの文書データを入出力装置１０から出力することになる。
【００７２】
図１３は、図１１の判定処理９４において、対象語「放送」について得られた判定データによる未知文書データ８０であり、この第１回目の判定段階で予測タグ８８は正解タグ１１２とは異なっているが、図１２に示す２番目以降の学習結果２４−２〜２４−Ｎを使用した判定処理の繰り返しにより、対象語「放送」の予測タグ８８は正解タグ１１２の「Ｉ−ＯＲＧ」に更新され、正解タグに一致するタグを判定することができる。
【００７３】
図１４は、図１１，図１２における判定処理のフローチャートであり、このフローチャートは図９の学習処理のフローチャートによる学習処理プログラムで得られた学習結果２４−１〜２４−Ｎを用いたタグ判定の判定処理プログラムを表わしている。
【００７４】
図１４において、まずステップＳ１で入出力装置１０により未知文書データを入力した後、ステップＳ２でデータ変換器により特徴とタグの集合に変換し、予測タグは「？」とし、これを１番目のテストデータとして生成する。次にステップＳ３で、判定器１８により第１学習結果を用いてテストデータにおける対象語及び周辺語の特徴から各語のタグを予測する。このステップＳ１〜Ｓ３の処理が図１１に対応している。
【００７５】
次にステップＳ４で、データ再構成装置２６により判定器１８により取得した予測タグをテストデータの予測タグと入れ替えて、新たなテストデータ即ち再構成テストデータを生成する。続いてステップＳ５で判定器１８により第ｉ学習結果、この場合には第２学習結果を用いて、再構成済みのテストデータにおける予測タグを含む対象語及び周辺語の特徴から各語のタグを予測する。
【００７６】
ステップＳ６で処理回数ｉが所定値Ｎ未満であれば、ステップＳ７でｉを１つ増加した後、ステップＳ４からの処理を繰り返す。第Ｎ学習結果によるタグの判定が済むと、ステップＳ６でｉがＮであることから、一連の判定処理を終了する。そして、データ変換器１２で未知文書に判定したタグを付けた逆変換を行った後、入出力装置１０の出力部としての機能により処理の済んだタグ付け文書を出力する。このステップＳ４〜Ｓ７の処理が図１３に対応している。
【００７７】
図１５は、本発明による言語処理装置における学習処理部の他の実施形態であり、この実施形態にあっては正解文書データとして英語文書データを対象としている。英語文書にあっては、日本語のように単語に分ける形態素解析が不要であることから、図２の実施形態に設けていた形態素解析装置１１が除かれており、他の構成は同じである。
【００７８】
図１６，図１７は、図１５の学習処理部により英語の正解文書データを対象に行う学習処理の説明図である。
【００７９】
図１６において、入出力装置１０の入力部としての機能により英語の正解文書データ１１４を入力し、データ変換器１２で、学習器１４で取扱い可能な学習データ１１８に変換する。学習データ１１８は、語１２０、特徴１２２、予測タグ１２４及び正解タグ１２５で構成されている。特徴１２２は「表記」「語長」「大文字」の３つであり、図４の日本語正解文書データのような「品詞」は除かれている。
【００８０】
この学習データ１１８につき、学習器１４で学習処理１３０を行い、学習結果２４−１を生成する。続いて判定器１８で、判定処理１３２により学習結果２４−１を使用して学習データ１１８の各対象語ごとに例えば対象語「Ｒｅｓｅｒｖｅ」のように前後の周辺語を含む特徴データ１２８を入力し、判定データ１３４により対象語の予測タグを判定する。
【００８１】
次に図１７のデータ再構成１３６をデータ再構成装置２６により行う。即ち、学習データの中に図１６の判定処理１３２で判定した予測タグを取り込んで、更新した再構成学習データ１３８を生成する。
【００８２】
次に学習器１４により学習処理１５２を行う。学習処理１５２は、再構成学習データ１３８における対象語の特徴データ１４８に判定された予測タグ１４４を含め、これと正解タグ１４６との対の集合により学習を行って、２番目の学習結果２４−２を生成する。
【００８３】
この２番目の学習結果２４−２を用いて、判定器１８で再構成学習データ１３８の特徴１４２を入力して判定処理１５４を行い、判定データ１５６として予測タグを判定する。この処理を判定ステップ１５８で処理回数ｉが所定回数Ｎに達するまで繰り返し、これによって学習結果データベース２２には学習結果２４−１〜２４−Ｎが格納される。
【００８４】
図１８は、図１５の学習処理部で生成された学習結果データベース２２の学習結果２４−１〜２４−Ｎを用いた英文の未知文書データの各語に対するタグ付けの判定処理のブロック図であり、学習器１４及び学習制御器１６の機能が除かれており、その他の構成は同じになる。
【００８５】
図１９，図２０は、図１８の英文の未知文書データにタグ付けを行う判定処理の処理手順の説明図である。
【００８６】
図１９において、入出力装置１０の入力部の機能により英文の未知文書データ１６０を入力し、データ変換器１２で学習器１４で処理可能な形式であるテストデータ１６４に変換するデータ変換１６２を行う。次に判定器１８により１番目の学習結果２４−１を用いてテストデータ１６４から予測タグを判定し、判定データ１７６を生成する。
【００８７】
次に図２０に進み、データ再構成装置２６によりデータ再構成１７８を行い、図１９の判定処理１７４で判定した予測タグを取り入れた予測タグ１８６の更新を行った再構成テストデータ１８０を生成する。
【００８８】
次に、判定器１８により再構成テストデータ１８０の予測タグ１８６を含む対象語及び周辺語の特徴を入力し、これに２番目の学習結果２４−２を用いて判定を行い、判定データ１９２のように各対象語の予測タグを判定する。
【００８９】
このデータ再構成１７８及び判定処理を１９０を、判定ステップ１９４でｉがＮになるまで残りの学習結果２４−３〜２４−Ｎにつき繰り返し、各語の予測タグを確定する。そして、データ変換器１２による逆変換で未知の英文の文書データに判定したタグを付して外部に出力する。
【００９０】
図２０は、本発明の言語処理装置における学習処理部の他の実施形態であり、この実施形態にあっては正解文書データに基づく学習結果の生成を簡易的に行うようにしたことを特徴とする。
【００９１】
図２０の実施形態の学習処理部は、図２の学習処理部の構成における判定器１８、判定制御装置２０及びデータ再構成装置２６を除いた構成をとっている。この実施形態の学習処理部は、正解文書データについて、まず周辺語を用いることなく対象語と周辺語の特徴と、対象語の正解タグから学習結果２４０−１を生成する。次に、対象語と正解単語を予測単語とした周辺語の特徴と対象語の正解タグの集合から学習結果２４０−２を生成する。
【００９２】
図２２，図２３は、図２１の簡易式の学習処理部における処理手順の説明図であり、日本語文書を処理対象としている。
【００９３】
図２２において、まず正解文書データ１９６を入出力装置１０の入力部の機能により入力する。この正解文書データ１９６は例えば図５の正解文書データ３０と同じものである。次にデータ変換器１２により、データ変換１９８を行い、学習器１４で取り扱うことのできる学習データ２００に変換する。この学習データ２００には予測タグの項目は設けられていない。
【００９４】
次に学習器１４に学習データ２００における対象語と周辺語の特徴、例えば対象語「放送」を例にとると、前後の周辺語「日本」と「協会」を含む特徴データ２０２と対象語の正解タグ２０４である「Ｉ−ＯＲＧ」を学習器１４に入力し、学習処理２０６により、第１学習結果としての学習結果２４０−１を生成する。
【００９５】
もちろん、学習処理２０６は学習データ２００における最初と最後の語を除く、２番目から５番目の語をそれぞれ対象語として前後の周辺語を含む特徴と正解タグの次の集合について、繰り返して行うことで学習結果２４０−１を生成する。
【００９６】
次に図２３に進み、正解文書データ１９６を対象にデータ再構成２０８を行って学習データ２１０を生成する。この学習データ２１０には新たに予測タグの項目が追加される。
【００９７】
次に学習器１４により学習データ２１０における対象語及び正解タグを予測タグとした周辺語の特徴と正解タグの次の集合を用いた学習処理２１８を行って、学習結果（第２学習結果）２４０−２を生成する。
【００９８】
例えば２番目の対象語「放送」については前後の周辺語「日本」と「協会」における特徴データ２１２の中にそれぞれの正解タグを予測タグ２１４としての「Ｉ−ＯＲＧ」及び予測タグ２１６としての「Ｉ−ＯＲＧ」として取り込み、この特徴データ２１２と対象語の正解タグ２０４である「Ｉ−ＯＲＧ」を学習器１４にセットし、特徴データ２１２から予測タグを得るためのルールとしての学習結果２４０−２を生成する。もちろん学習処理２１８は学習データ２１０における最初と最後の語を除く２番目から５番目の語を対象語として繰り返し学習処理が行われる。
【００９９】
図２４は、図２１の簡易式の学習処理のフローチャートであり、これは簡易的な学習処理プログラムの内容を表している。まずステップＳ１で入出力装置１０により日本語の正解文書を入力し、ステップＳ２でデータ変換器１２により特徴とタグの集合に変換し、第１学習データを生成する。この第１学習データには予測タグは設けられていない。
【０１００】
次にステップＳ３で学習器１４により学習データにおける対象語及び周辺語の特徴と正解タグの対の集合を用いて第１学習結果を生成する。このステップＳ１〜Ｓ３の処理が図２２に対応している。
【０１０１】
次にステップＳ４でデータ変換器１２により正解文書データから特徴とタグに変換し、予測タグを新たに設けてこれを「？」として第２学習データを生成する。最終的にステップＳ５で学習器１４により第２学習データにおける対象語及び正解タグを予測タグに含む周辺語の特徴と正解タグの対を用いて第２学習結果を生成する。
【０１０２】
図２５は、図２１の簡易的な学習処理部で生成された学習結果を用いた日本語の未知文書データに対するタグを行う簡易的な判定処理部のブロック図である。この簡易的な判定処理部は基本的に図１０の実施形態と同じであるが、学習結果データベース２２に２つの学習結果２４０−１，２４０−２しかない点が相違している。
【０１０３】
図２６，図２７は、図２５の簡易的な判定処理部の処理手順の説明図である。図２６において、まず日本語の未知文書データ２２０を入出力装置１０の入出力部により入力し、データ変換器１２により判定器で取り扱い可能な形式のテストデータ２２４に変換する。
【０１０４】
次に判定器１８により、最初に得られている学習結果２４０−１を用いてテストデータ２２４に含まれる対象語、例えば「放送」及び周辺語「日本」と「協会」の特徴データ２２６から対象語の予測タグを判定処理２２８により判定し、判定データ２３０を生成する。
【０１０５】
次に図２７に進み、データ再構成装置２６により図２６の判定処理２２８で判定された予測タグを特徴データ２３６の一部に入れた再構成テストデータ２３４を生成する。
【０１０６】
次に判定器１８により、２つ目の学習結果２４０−２を用いて、再構成テストデータ２３４に含まれる予測タグを含む対象語及び周辺語の特徴、例えば対象語「放送」の場合には特徴データ２３６から対象語のタグを判定する判定処理２３８を行い、判定データ２４２を生成する。
【０１０７】
もちろん判定ステップ２４４は再構成テストデータ２３４における最初と最後を除く２番目から５番目の各対象語について学習結果２４０−２を用いて予測タグを判定する。
【０１０８】
次の判定設定２２４で処理回数ｉが所定値Ｎ未満であればデータ再構成２３２に戻り、判定処理２３８で得られた予測タグを更新して、再構成テストデータ２３４を再度作り、同様な学習結果２４０−２を用いた判定処理を繰り返し、処理回数がＮ回に達した場合には、そのとき得られている再構成テストデータ２３４の予測タグをタグの判定結果とし、データ変換器１２により入力されている未知文書データの各語に判定したタグを付ける逆変換を行った後、入出力装置１０の出力部としての機能により外部にタグ付文書データを出力する。
【０１０９】
図２８は、図２５の簡易的な判定処理部のフローチャートであり、このフローチャートが簡易的な判定処理プログラムの内容に対応している。図２８において、ステップＳ１で入出力装置１０により日本語の未知文書データを入力し、ステップＳ２でデータ変換器１２により特徴とタグの集合に変換し、予測タグは「？」とし、これを１番目のテストデータとして生成する。
【０１１０】
次にステップＳ３で判定器１４により、第１学習結果を用いてテストデータにおける対象語及び周辺語の特徴から各語のタグを予測する。このステップＳ１〜Ｓ３の処理は図２６に対応している。次にステップＳ４でデータ再構成装置２６によりステップＳ３で取得した予測タグをテストデータの予測タグに入れ替えて、新たなテストデータを再構成テストデータとして生成し、ステップＳ５で判定器１８により、第２学習結果を用いて、テストデータにおける予測タグを含む対象語及び周辺語の特徴から、各語のタグを予測する。
【０１１１】
このステップＳ４，Ｓ５もしくはＳ６で処理回数が所定回数Ｎに達するまでステップＳ７でｉをひとつずつ増加させながら繰り返す。そして、Ｎ回の判定処理が終了すると確定した各語の判定タグを未知の文書データの各語にする逆変換を行い、外部に出力する。
【０１１２】
図２９，図３０は、英文の正解文書データを対象とした図２１の実施形態と同じ簡易的な学習処理の処理手順である。この英文の正解文書データを対象とした簡易的な学習処理にあっては、図２１における形態素解析装置１１を除いた学習処理部としての構成で実現できることから、図２９，図３０の処理手順を日本語の場合と同様、図２５を参照して説明する。
【０１１３】
図２９において、入出力装置１０の入力部としての機能により英文の正解文書データ２４６を入力し、データ変換器１２により、学習器１４で取り扱い可能な学習データ２５０に変換する。
【０１１４】
続いて学習器１４により、学習データ２５０における対象語及び周辺語の特徴、例えば対象語「Ｒｅｓｅｒｖｅ」における特徴データ２５２と、その正解タグ２５４である「Ｉ−ＯＲＧ」の対を含む他の対象語における特徴を含んで学習器１４により学習処理２５６を行い、特徴データ２５２から予測タグを判定する学習結果２４０−１を生成する。
【０１１５】
続いて図３０に進み、データ変換器１２で英文の正解文書データ２４６から予測タグを含む第２の学習データ２６０を生成する。この学習データ２６０につき、学習器１４による学習処理２６８を行って学習結果２４０−２を生成する。
【０１１６】
学習処理２６８は例えば対象語「Ｒｅｓｅｒｖｅ」を例にとると、対象語及び正解タグを予測タグ２６４，２６５として含む周辺語の特徴２６２と対象語の正解タグ２６６の対を用いて、特徴データ２６２から予測タグを判定する学習結果２４０−２を生成する。このような学習は学習データ２６０の先頭と最後の語を除く２番目から５番目の各語について繰り返し行って学習結果２４０−２を得ている。
【０１１７】
図３１，図３２は、図２９，図３０の簡易的な学習処理で得られた英文を対象とした２つの学習結果２４０−１，２４０−２を用いた英文未知文書データからタグを判定する簡易的な判定処理の処理手順である。
【０１１８】
この簡易的な判定処理手順は図２５に示した日本語文書を対象とした簡易的な判定処理部と同じ構成で実現できることから、図２５を参照して英文未知文書データを対象とした簡易的な判定処理を説明すると次のようになる。
【０１１９】
図３１において、英文の未知文書データ２７０を入出力装置１０の入力部の機能により入力し、データ変換器１２で判定器１８で取り扱い可能な形式のテストデータ２７４に変換する。
【０１２０】
次に判定器１８により、最初の学習結果２４０−１を用いてテストデータ２７４の対象語及び周辺語の特徴から各対象語の予測タグを判定する判定処理２７８を行い、判定データ２８０を生成する。
【０１２１】
次に図３２に進み、データ再構成装置２６により図３１の判定処理２７８で得られた予測タグを各種データの予測タグの一部に入れて、再構成した再構成学習データ２８４を生成する。
【０１２２】
続いて、再構成学習データ２８４の特徴データ２８６を対象に２番目の学習結果２４０−２を用いて判定器１８で判定処理２８８を行い、各語の予測タグを判定した判定データ２９０を生成する。
【０１２３】
このデータ再構成２８２，判定処理２８８を判定ステップ２９２で処理回数ｉが所定回数Ｎ回に達するまで同じ学習結果２４０―２を使用して繰り返し行い、再構成テストデータ２８４の予測タグを正解タグとして確定する。
【０１２４】
そしてデータ変換器１２により入力している未知の英文文書データに判定したタグをつける逆変換を行った後、入出力装置１０の出力部の機能により処理の済んだタグ付英文文書データを外部に出力する。
【０１２５】
図３３は、本発明の文書処理装置の他の実施形態であり、この実施形態にあっては図２の実施形態に更に外部タグ付けシステムインタフェース２９６を設けている。
【０１２６】
外部タグ付けシステムインタフェース２９６は入出力装置１０の入力部の機能により、受け取った日本語の正解文書データまたは未知文書データをインターネットなどのネットワークを介して、外部タグ付けシステム２９８渡してタグ付結果を取得し、この外部タグ付けシステム２９８から取得した外部のタグを予測タグとして学習処理及び判定処理を行うようにしたことを特徴とする。
【０１２７】
また図３３は図２の日本語文書データを対象とした文書処理装置の適用を例にとっているが、形態素解析装置１１を備えてない英文文書データを言語処理装置にもそのまま適用できる。また図３３の実施形態は学習結果２４−１〜２４−Ｎを生成して用いる実施形態を例にとるものであったが、２つの学習データ２４０−１，２４０−２を生成して利用する簡易的な言語処理装置の実施形態についても同様に適用することができる。
【０１２８】
次に図２における実施形態における学習処理、及び図２の学習結果を利用した図１０の実施形態における判定処理を例にとって、学習器及び判定器における具体的な処理内容を説明する。
【０１２９】
図３４は、本発明の学習処理において正解データから１回目の学習を行って、第１学習結果としての第１分類ルールを生成する説明図である。以下の学習処理の具体例にあっては学習器として後の説明で明らかにする決定木の生成アルゴリズムを摘用した場合を例にとっている。
【０１３０】
図３４において、正解文書データ３００は語３０２に対し正解タグ３０４が付与されている。この正解文書データ３００は「２００２年にイングランド開催されたウィンブルドンは波乱の幕開けとなった。優勝候補のセレナウィリアムスは１回戦でオーストラリアのマリアグースに敗れ・・」となる文書を例にとっている。正解文書データ３００の文書は形態素解析により品詞に分けられ、図示のように配列され、それぞれに正解タグ３０４が付されている。
【０１３１】
学習器として機能する決定木生成器３０６は、正解文書データ３００を決定木アルゴリズムに従って学習することにより、第１決定木となる第１分類ルール３０８を生成する。このようにして生成された第１分類ルール３０８は図３５のように初期状態で空となっている学習結果データベース２２−１に第１分類ルールの保存３１０によって学習結果データベース２２−２に格納される。
【０１３２】
次に図３６のように、正解文書データ３００の語３０２に生成した第１分類ルール３０８を適用した判定処理により、右側に示す正解文書データ３００の予測タグ３１２−１を判定する。
【０１３３】
次に図３７の正解文書データ３００のように、図３６で判定した予測タグ３１２−１をとらえたデータ再構成を行った後、この再構成された正解文書データ３００を決定木生成器３１４に入力して、２回目の学習を行い、第２決定木としての第２分類ルール３１６を生成する。
【０１３４】
次に図３８のように、学習結果データベース２２−２に、すでに格納している第１分類ルール３０８に加え、図３７で生成した第２分類ルール３１６を保存３１８により格納する。
【０１３５】
次に図３９のように、語３０２に対し前回の予測タグ３１２−１を有する正解文書データ３００に対し、第２分類ルール３１６を適用した判定処理により、右側に示す正解文書データ３００の予測タグ３１２−２を判定して付与する。
【０１３６】
以上が図２の実施形態に示した本発明における学習処理の基本的な処理であるが、実際には３回目の学習、４回目の学習、・・・・、Ｎ回目の学習を繰り返して、更に第３分類ルール〜第Ｎ分類ルールを生成することになる。もちろん必要最小限、本発明にあっては第２分類ルールまで作成すれば良い。
【０１３７】
ここで本発明の学習処理において分類ルール、すなわち決定木を作成するアルゴリズムを説明する。
【０１３８】
図４０は、本発明の学習処理において決定木を生成する関数ＭａｋｅＴｒｅｅ３２０である。この決定木を生成する関数ＭａｋｅＴｒｅｅ３２０には複数のアルゴリズムがあるが、例えば次の決定木生成方法を使用する。
【０１３９】
本発明で使用する決定木生成方法は、Ｑｕｉｎｌａｎ　“Ｃ４．５：　Ｐｒｏ−ｇｒａｍｓ　ｆｏｒ　Ｍａｃｈｉｎｅ　　Ｌｅａｒｎｉｎｇ　“（１９９３）（日本語訳「ＡＩによるデータ解析」）等で提案されている。以下にアルゴリズムの概略を示す。ここで、カテゴリと特徴の相関は、情報量利得、相互情報量、χ^２検定量などさまざまな方法がある。
【０１４０】
［決定木の生成アルゴリズムＧｅｎｅｒａｔｅｔｒｅｅ（学習セット、木））］
ステップＳ１：．学習セットが空であれば、デフォルトのカテゴリを振り、関数を抜ける。
ステップＳ２：学習セットのカテゴリがすべて同じであれば、木にそのカテゴリと判定するノードを加え、関数を抜ける。
ステップＳ３：特徴のうち、カテゴリと最も相関の強い特徴を選ぶ。
ステップＳ４：選んだ特徴での分割を木に加える。
ステップＳ５：選んだ特徴で学習セットを分割する。
ステップＳ６：分割された学習セットそれぞれを引数として、Ｇｅｎｅｒａｔｅｔｒｅｅ（）を呼ぶ。
【０１４１】
この決定木生成アルゴリズムについて、正解文書データ３００を入力した際の決定木の生成処理の主な内容は次のようになる。まずステップＳ１〜Ｓ２の処理を通じて、正解文書データ３００の文字種を選ぶ。この場合に文字種は漢字、平仮名、片仮名、英数字の４種類が含まれている。
【０１４２】
次にステップＳ４により文字種によって正解文書データ３００である学習セットを分解する。即ち、図４１のように漢字セット３００−１、平仮名セット３００−２、片仮名セット３００−３及び英数字セット３００−４に分解する。
【０１４３】
続いてステップＳ６の処理により分解された各セットを入力して、関数ＧｅｎｅｒａｔｅＴｒｅｅを呼ぶ。そしてこの関数により、例えば漢字セット３００−１を例にとノード全ての正解文書データ３００における漢字の正解タグとしては「Ｏ」のカテゴリがついているため、「Ｏ」のノードを正解タグとして加え、これを他のセットにも繰り返すことで、例えば図３４に示した第１分類ルール３０８が生成される。
【０１４４】
ここで図３４の第１分類ルール３０８にあっては、周辺語の特徴が組み入れていないが、図３７の予測タグを付与した後の第２分類ルール３１６にあっては前後の予測タグが特徴に組み入れられている。
【０１４５】
図４２は、図３４〜図３９の学習処理により生成された第１分類ルール３０８と第２分類ルール３１６を未知文書データに適用して予測タグを付与する判定処理の具体例である。
【０１４６】
図４２にあっては、未知文書データ３２２として「１９９８年のサンマリノでの試合ではミカハッキネンは惜しくも優勝を逃したが」を入力し形態素解析により品詞単位の語３２４に分けたデータとしている。
【０１４７】
この未知文書データ３２２に対し、学習結果データベース２２に格納している第１分類ルール３０８を適用した判定処理により、語３２４のそれぞれに対し予測タグ３２６−１を付与する。
【０１４８】
次に予測タグ３２６−１が付与された未知文書データ３２２に対し学習結果データベース２２の第２分類ルール３１６を適用し、語３２４のそれぞれに対し予測タグ３２６−２を付与する。
【０１４９】
図４３は、図４２の１回目の判定処理による第１分類ルール３０８の適用による予測タグの付与を未知文書データの一部である「試合ではミカハッキネンは」の部分について取り出して示している。
【０１５０】
ここで第１分類ルール３０８は木構造を持っており、例えば対象語「試合」が矢印３０８−１のように「文字種＝漢字」「語長＜＝３」及び「品詞＝その他」のノードをとることで予測タグ「Ｏ」を判定し、この場合は正解となる。
【０１５１】
一方「ミカ」は、矢印３０８−２のように「文字種＝片仮名」及び「語長＜４」のノードを通って予測タグ「Ｉ−ＬＯＣ」を判定するが、この場合、判定した予測タグは不正解となっている。
【０１５２】
図４４は、図４３の第１分類ルール３０８で判定した予測タグを特徴に加えた第２分類ルール３１６に従った判定処理の具体例である。まず対象語「試合」には、第１分類ルール３０８でも正解であるが、第２分類ルール３１６にあっても矢印３１６−２のように予測タグ「Ｏ」となり、これも正解である。
【０１５３】
一方、対象語「ミカ」は、矢印３１６−３のように「文字種＝片仮名」「語長＜４」のノードに続いて「後ろのタグ＝Ｉ−ＰＥＲ」が判別され、これにより予測タグが「Ｉ−ＰＥＲ」と判定され、この周辺語の予測タグを特徴に組み込んだ第２分類ルール３１６により正解となる。
【０１５４】
このように本発明にあっては、第１分類ルール（第１学習結果）３０８により未知文書の対象語について周辺語のタグを予測し、第２分類ルール（第２学習結果）３１６にあっては、周辺語の予測タグを特徴とした分類ルールとすることで、周辺語の予測タグとの相関から対象語のタグを高精度に判定することができる。
【０１５５】
図４５は、図４３，図４４と同じ未知文書データ中の「試合ではミカハッキネンは」に対する対象語「ハッキネン」について、タグ付を行う際の特徴情報を従来手法と本発明の手法について比較している。
【０１５６】
図４５（Ａ）は、周辺語のタグを特徴に含めていない従来手法３２８の特徴であり、前後の語がどういったタグと判定されるかは全く考慮されていない。
【０１５７】
図４５（Ｂ）は、従来技術（Ａ）に示した山田の方法３３０であり、対象語に対し片側、例えば前の周辺語の予測タグを用いている。しかしながら、この場合には対象語「ハッキネン」のタグが「Ｉ−ＰＥＲ」となる手がかりが対象語の前の語にあった場合にしか有効に働かない。例えば対象語が「ミカ」であった場合、その手がかりとなる周辺語「ハッキネン」の「Ｉ−ＰＥＲ」は後ろの語の予測タグとなるため有効に使用できない。
【０１５８】
図４５（Ｃ）は、本発明の手法３３２の特徴であり、対象語「ハッキネン」に対し、前後に一致する周辺の複数の箇所の予測タグを特徴に用いている。このため対象語の周辺のどこかに手がかりがあれば、それを用いてタグを正確に予測することができる。
【０１５９】
尚、本発明は上記の実施形態に限定されず、その目的と利点を損なうことのない適宜の変形を含む。更に本発明は上記の実施形態に示した数値による限定は受けない。
【０１６０】
ここで本発明の特徴を列挙すると、次の付記のようになる。
（付記）
（付記１）
学習処理装置により、タグのついた正解文書データに含まれる対象語及びその両側の周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理ステップと、
判定処理装置により、前記複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及びその両側の周辺語の特徴から各対象語のタグを判定する判定処理ステップと、
を備えたことを特徴とする言語処理方法。（１）
【０１６１】
（付記２）
付記１記載の言語処理方法に於いて、前記学習処理ステップは、
入力装置により、外部からタグのついた正解文書データを入力する入力ステップと、
データ変換器により、前記正解文書データを学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
学習器により、前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習ステップと、
判定器により、前記第１学習ステップで生成された第１学習結果を用いて、前記学習データに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、
データ再構成装置により、前記第１判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
前記学習器により、データ再構成ステップで再構成された学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習ステップと、
前記判定器により、前記第２学習ステップで生成された第２学習結果を用いて、前記学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
前記データ再構成ステップ、第２学習ステップ及び前記第２判定ステップを複数回繰り返させて第２学習結果以降の複数の学習結果を生成する学習反復ステップと、
を備えたことを特徴とする言語処理方法。（２）
【０１６２】
（付記３）
付記２記載の言語処理方法に於いて、前記判定処理ステップは、
前記入力装置により、外部からタグ付け対象となる未知文書データを入力する入力ステップと、
前記データ変換器により、前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
前記判定器により、前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第３判定ステップと、前記データ再構成装置により、前記第３判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成する第２データ再構成ステップと、
前記判定器により、前記第２学習結果を用いて、前記第２データ再構成ステップで再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第４判定ステップと、
前記データ再構成ステップ及び前記第４判定ステップを第２学習結果以降の学習結果に切替ながら複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
出力装置により、前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とする言語処理方法。（３）
【０１６３】
（付記４）
付記１記載の言語処理方法に於いて、前記学習処理ステップは、
入力装置により、外部からタグのついた正解文書データを入力する入力ステップと、
データ変換器により、前記正解文書データを学習器で取り扱う予測タグを含まない形式の学習データに変換するデータ変換ステップと、
学習器により、前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習ステップと、
データ変換器により、前記正解文書データを予測タグを含む学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
学習器により、前記学習データに含まれる対象語及び正解タグを予測タグとして含む周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習ステップと、
を備えたことを特徴とする言語処理方法。（４）
【０１６４】
（付記５）
付記４記載の言語処理方法に於いて、前記判定処理ステップは、
前記入力装置により、外部からタグ付け対象となる未知文書データを入力する入力ステップと、
前記データ変換器により、前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
前記判定器により、前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、前記データ再構成装置により、前記第１判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
前記判定器により、前記第２学習結果を用いて、前記データ再構成ステップで再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
前記データ再構成ステップ及び前記第２判定ステップを第２学習結果を用いて複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
出力装置により、前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とする言語処理方法。（５）
【０１６５】
（付記６）
付記２乃至５のいずれかに記載の言語処理方法に於いて、
記学習処理ステップおよび判定処理ステップは、前記入力装置により受け取った正解文書データ又は未知文書データを解析して形態素を判定する形態素解析ステップを備え、
記データ変換ステップは、データ変換器により、前記解析された前記形態素を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とする言語処理方法。
【０１６６】
（付記７）
付記２乃至５のいずれかに記載の言語処理方法に於いて、
前記学習処理ステップおよび判定処理ステップは、前記入力装置により受け取った正解文書データ又は未知文書データを外部のタグ付けシステムに渡してタグ付け結果を取得する外部タグ付けステップを備え、
前記データ変換ステップは、データ変換器により、前記タグ付けシステムから取得したタグ付け結果を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とする言語処理方法。
【０１６７】
（付記８）
コンピュータに、
タグのついた正解文書データに含まれる対象語及びその両側の周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理ステップと、
前記複数の学習結果を用いて、タグ付け対象となる未知文書データに含まれる対象語及びその両側の周辺語の特徴から各対象語のタグを判定する判定処理ステップと、
を実行させることを特徴とするプログラム。（６）
【０１６８】
（付記９）
付記８記載のプログラムに於いて、前記学習処理ステップは、
入力装置により、外部からタグのついた正解文書データを入力する入力ステップと、
データ変換器により、前記正解文書データを学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
学習器により、前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習ステップと、
判定器により、前記第１学習ステップで生成された第１学習結果を用いて、前記学習データに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、
データ再構成装置により、前記第１判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
前記学習器により、データ再構成ステップで再構成された学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習ステップと、
前記判定器により、前記第２学習ステップで生成された第２学習結果を用いて、前記学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
前記データ再構成ステップ、第２学習ステップ及び前記第２判定ステップを複数回繰り返させて第２学習結果以降の複数の学習結果を生成する学習反復ステップと、
を備えたことを特徴とするプログラム。（７）
【０１６９】
（付記１０）
付記９記載のプログラムに於いて、前記判定処理ステップは、
外部からタグ付け対象となる未知文書データを入力する入力ステップと、
前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第３判定ステップと、
前記第３判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成する第２データ再構成ステップと、
前記第２学習結果を用いて、前記第２データ再構成ステップで再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第４判定ステップと、
前記データ再構成ステップ及び前記第４判定ステップを第２学習結果以降の学習結果に切替ながら複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とするプログラム。（８）
【０１７０】
（付記１１）
付記８記載のプログラムに於いて、前記学習処理ステップは、
外部からタグのついた正解文書データを入力する入力ステップと、
前記正解文書データを学習器で取り扱う予測タグを含まない形式の学習データに変換するデータ変換ステップと、
前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習ステップと、
前記正解文書データを予測タグを含む学習器で取り扱う形式の学習データに変換するデータ変換ステップと、
前記学習データに含まれる対象語及び正解タグを予測タグとして含む周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習ステップと、
を備えたことを特徴とするプログラム。
【０１７１】
（付記１２）
付記１１記載のプログラムに於いて、前記判定処理ステップは、
外部からタグ付け対象となる未知文書データを入力する入力ステップと、
前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換ステップと、
前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定ステップと、
前記データ再構成装置により、前記第１判定ステップで判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成ステップと、
前記第２学習結果を用いて、前記データ再構成ステップで再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定ステップと、
前記データ再構成ステップ及び前記第２判定ステップを第２学習結果を用いて複数回繰り返させて未知文書データのタグを判定させる判定反復ステップと、
前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力ステップと、
を備えたことを特徴とするプログラム。
【０１７２】
（付記１３）
付記９乃至１２のいずれかに記載のプログラムに於いて、
前記学習処理ステップおよび判定処理ステップは、前記入力装置により受け取った正解文書データ又は未知文書データを解析して形態素を判定する形態素解析ステップを備え、
前記データ変換ステップは、データ変換器により、前記解析された前記形態素を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とするプログラム。
【０１７３】
（付記１４）
付記９乃至１２のいずれかに記載のプログラムに於いて、
前記学習処理ステップおよび判定処理ステップは、前記入力装置により受け取った正解文書データ又は未知文書データを外部のタグ付けシステムに渡してタグ付け結果を取得する外部タグ付けステップを備え、
前記データ変換ステップは、データ変換器により、前記タグ付けシステムから取得したタグ付け結果を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とするプログラム。
【０１７４】
（付記１５）
タグのついた正解文書データに含まれる対象語及びその両側の周辺語の特徴と正解タグの対の集合を用いて複数の学習結果を生成する学習処理部と、
前記判定処理装置により、前記複数その両側の周辺語の特徴から各対象語のタグを判定する判定処理部と、
を備えたことを特徴とする言語処理装置。（９）
【０１７５】
（付記１６）
付記１５記載の言語処理装置に於いて、前記学習処理部は、
外部からタグのついた正解文書データを入力する入力装置と、
前記正解文書データを学習器で取り扱う形式の学習データに変換するデータ変換装置と、
前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習器と、
判定器により、前記第１学習ステップで生成された学習結果を用いて、前記学習データに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定器と、
データ再構成装置により、前記第１判定器で判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成装置と、
前記データ再構成装置で再構成された学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習器と、
前記第２学習器で生成された第２学習結果を用いて、前記学習データに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定器と、
前記データ再構成装置、第２学習器及び前記第２判定器を複数回繰り返させて第２学習結果以降の複数の学習結果を生成する学習反復部と、
を備えたことを特徴とする言語処理装置。（１０）
【０１７６】
（付記１７）
付記１６記載の言語処理装置に於いて、前記判定処理部は、
外部からタグ付け対象となる未知文書データを入力する入力装置と、
前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換装置と、
前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第３判定器と、
前記データ再構成装置により、前記第３判定器で判定された予測タグを前記学習データの特徴の一部に入れて再構成する第２データ再構成装置と、
前記第２学習結果を用いて、前記第２データ再構成装置で再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第４判定器と、
前記第２データ再構成装置及び前記第４判定器を第２学習結果以降の学習結果に切替ながら複数回繰り返させて未知文書データのタグを判定させる判定反復部と、
前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力装置と、
を備えたことを特徴とする言語処理装置。（１１）
【０１７７】
（付記１８）
付記１５記載の言語処理装置に於いて、前記学習処理部は、
外部からタグのついた正解文書データを入力する入力装置と、
前記正解文書データを学習器で取り扱う予測タグを含まない形式の学習データに変換するデータ変換部と、
前記学習データに含まれる対象語及び周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第１学習結果を生成する第１学習器と、
前記正解文書データを予測タグを含む学習器で取り扱う形式の学習データに変換するデータ変換装置と、
前記学習データに含まれる対象語及び正解タグを予測タグとして含む周辺語の特徴と正解タグの対の集合を用いて、前記特徴から予測タグを判定する第２学習結果を生成する第２学習器と、
を備えたことを特徴とする言語処理装置。
【０１７８】
（付記１９）
付記１８記載の言語処理装置に於いて、前記判定処理部は、
外部からタグ付け対象となる未知文書データを入力する入力装置と、
前記未知文書データを前記学習器で取り扱う形式のテストデータに変換するデータ変換装置と、
前記第１学習結果を用いて、前記テストデータに含まれる対象語及び周辺語の特徴から各対象語の予測タグを判定する第１判定器と、
前記第１判定器で判定された予測タグを前記学習データの特徴の一部に入れて再構成するデータ再構成装置と、
前記第２学習結果を用いて、前記データ再構成装置で再構成されたテストデータに含まれる前記予測タグを含む対象語及び周辺語の特徴から各対象語のタグを判定する第２判定器と、
前記データ再構成装置及び第２判定器を前記第２学習結果を用いて複数回繰り返させて未知文書データのタグを判定させる判定反復部と、
前記未知文書データを前記判定されたタグの付いた文書データに逆変換して出力する出力装置と、
を備えたことを特徴とする言語処理装置。
【０１７９】
（付記２０）
付記１６乃至１９のいずれかに記載の言語処理装置に於いて、
前記学習処理部および判定処理部は、前記入力装置により受け取った正解文書データ又は未知文書データを解析して形態素を判定する形態素解析装置を備え、
前記データ変換装置は、前記解析された前記形態素を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とする言語処理装置。
【０１８０】
（付記２１）
付記１６乃至１９のいずれかの記載の言語処理装置に於いて、
前記学習処理部および判定処理部は、前記入力装置により受け取った正解文書データ又は未知文書データを外部のタグ付けシステムに渡してタグ付け結果を取得する外部タグ付けインタフェースを備え、
前記データ変換装置は、前記タグ付けシステムから取得したタグ付け結果を含めて前記正解文書データ又は未知文書データを学習器で取り扱う形式の学習データに変換することを特徴とする言語処理装置。
【０１８１】
【発明の効果】
以上説明してきたように本発明によれば、タグのついた正解文書データに含まれる対象語及び周辺語の特徴と、正解タグの対の集合を用いて複数の学習結果を生成し、この学習結果を用いて未知文書データに含まれる対象語及び周辺語の特徴から、各対象語のタグを判定するため、対象語のタグに対し相関関係のある周辺語のタグを利用して、未知文書のタグが判定でき、更に複数の学習結果を生成し、これを未知文書のタグ判定に利用して繰り返し未知文書のタグ判定を繰り返すことで、周辺タグの判定精度が高まれば、これに対応して対象語の判定タグの精度が向上し、未知文書のタグ付を高精度で行うことができる。
【図面の簡単な説明】
【図１】本発明の原理説明図
【図２】日本語の正解文書を対象とした学習処理部として機能する言語処理装置のブロック図
【図３】本発明の言語処理装置が適用されるコンピュータのハードウェア環境の説明図
【図４】形態素解析された日本語文書データの説明図
【図５】正解文書データの説明図
【図６】未知文書データの説明図
【図７】図２の学習処理部における処理機能の説明図
【図８】図７に続く処理機能の説明図
【図９】図２の学習処理のフローチャート
【図１０】図２の学習結果を用いた判定処理部として機能する言語処理装置のブロック図
【図１１】図１０の判定処理部における処理機能の説明図
【図１２】図１１に続く処理機能の説明図
【図１３】未知文書に対する判定結果の説明図
【図１４】図１１の判定処理のフローチャート
【図１５】英語の正解文書を対象とした学習処理部として機能する言語処理装置のブロック図
【図１６】図１５の学習処理部における処理機能の説明図
【図１７】図１６に続く処理機能の説明図
【図１８】図１６の学習結果を用いた判定処理部として機能する言語処理装置のブロック図
【図１９】図１８の判定処理部における処理機能の説明図
【図２０】図１９に続く処理機能の説明図
【図２１】日本語の正解文書を対象とした簡易的な学習処理部として機能する言語処理装置のブロック図
【図２２】図２１の学習処理部における処理機能の説明図
【図２３】図２２に続く処理機能の説明図
【図２４】図２１の学習処理のフローチャート
【図２５】図２１の学習結果を用いた判定処理部として機能する言語処理装置のブロック図
【図２６】図２５の判定処理部における処理機能の説明図
【図２７】図２６に続く処理機能の説明図
【図２８】図２５の判定処理のフローチャート
【図２９】英語の正解文書を対象とした簡易的な学習処理部における処理機能の説明図
【図３０】図２９に続く処理機能の説明図
【図３１】図２０と図３０の学習結果を用いた判定処理部における処理機能の説明図
【図３２】図３１に続く処理機能の説明図
【図３３】外部タグ付けけシステムにアクセスするインタフェースを備えた本発明の言語処理装置のブロック図
【図３４】正解文書データから１回目の学習を行って第１分類ルールを生成する学習処理の説明図
【図３５】図３４で生成した第１分類ルールを格納した学習結果データベースへの説明図
【図３６】正解文書データに第１分類ルールを適用して予測タグを付与する判定処理の説明図
【図３７】図３６の予測タグを付与された正解文書データから２回目の学習を行って第２分類ルールを生成する学習処理の説明図
【図３８】第１分類ルールに加え図３７で生成した第２分類ルールを格納した学習結果データベースへの説明図
【図３９】図３８の予測タグを付与された正解文書データに対し分類ルールを適用して予測タグを判定する判定学習処理の説明図
【図４０】分類ルールとなる決定木を生成する関数ＭａｋｅＴｒｅｅの説明図
【図４１】文字種で学習セットを分割する決定木生成処理の説明図
【図４２】未知文書データに第１分類ルールと第２分類ルールを続けて適用して予測タグを付与する判定処理の説明図
【図４３】未知文書データの一部を例にとって第１分類ルールの適用により予測タグを付与する判定処理の説明図
【図４４】図４３で付与した予測タグを含む未知文書データに第２分類ルールの適用により予測タグを付与する判定処理の説明図
【図４５】従来手法と本発明の手法による予測タグの付与の対比説明図
【符号の説明】
１０：入出力装置
１１：形態素解析装置
１２：データ変換器
１４：学習器
１６：学習制御器
１８：判定器
２０：判定制御器
２２：学習データベース
２４−１，２４０−１：学習結果（第１学習結果）
２４−２，２４０−２：学習結果（第２学習結果）
２４−３〜２４−Ｎ：学習結果
２６：データ再構成装置
２８：日本語文書データ
３０：正解文書データ
３２，８０，２００：学習データ
４８：語
５０：特徴
５２：予測タグ
５４：正解タグ
５６，７０：特徴データ
６４，７６，９６，１０６，１３４，１５６，１７６，１９２，２３０，２４２，２８０，２９０：判定データ
６８，１３８，２１０：再構成学習データ
８４，１６４，２２４，２５０，２７４：テストデータ
１００，１８０，２３４，２６０，２８４：再構成テストデータ
３００：外部タグ付けシステムインタフェース[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a language processing method, a program, and a device for automatically tagging an unknown document by using the learning result of a correct document with a tag, and in particular, using features of a target word and peripheral words in the correct document. The present invention relates to a language processing method, a program and an apparatus for tagging an unknown document based on a learning result.
[0002]
[Prior art]
In recent years, research and development of linguistic processing to add a certain index tag to words included in documents has been extensively performed, such as machine translation using tagged documents, information retrieval, question answering, knowledge discovery, etc. It is expected that a system using language processing will be put to practical use.
[0003]
Conventionally, various types of methods have been proposed for learning rules (learning results) from tagged correct answer document data and automatically adding tags to unknown documents using the learned rules. For example, the following are mentioned.
[0004]
(1) D. M. Bikel, S.M. Miller, R.A. Schwartz and R.S. Weishedel. (1997) Nymble: A High-Performance Learning Name-Finder. In5th Conference on Applied Natural Lan-guage Processing. Here, proper nouns are treated as a tagging problem, and tagging is performed by learning a hidden Markov model.
[0005]
(2) A. Borthwick, J .; Sterling, E.A. Agic-htein and R.A. Gishman. (1998) Exploiti-ng Diversity Knowledge Sources Via Maxim-um Entropy in Named Entity Recognition. In Processeds of the 6th Works on Very Large Corporation. Here, the same problem as (1) is treated as learning by the Maximum Entropy method which features the outputs of a plurality of systems.
[0006]
(3) M.P. Collins and Y. Singer. (1999)
Unsupervised Models for Named Entity
Classification. In Processeds of the
Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing
and Very Large Corporation. Here, a method of improving accuracy using raw data without a tag is adopted.
[0007]
These methods use features such as notation of a target word and surrounding words to determine a tag, but do not use how surrounding words are tagged.
[0008]
Tags attached to surrounding words are important clues for tags to be attached to the target word. For example, when tagging the word "Yamaguchi", "Yamaguchi" may take a tag for a person's name or a tag for a place name. If it is known that the word following “Yamaguchi” is “Taro” or the like and a tag of a personal name is given, “Yamaguchi” is considered to be a personal name. On the other hand, there is a word string such as "Yamaguchi" before "Yamaguchi", and if it is found that these are tagged with a place name, it is a clue that a place name tag is attached to "Yamaguchi".
[0009]
[Problems to be solved by the invention]
However, the tags attached to the surrounding words are important clues for determining the tag of the target word, but cannot be used for an unknown document because the surrounding tags cannot usually be known.
[0010]
The following is known as one of the methods for solving this problem.
(4) Hiroyasu Yamada, Taku Kudo, Yuji Matsumoto. (2001) "Extracting Japanese Named Expressions Using Support Vector Machines", Information Processing Society of Japan, Natural Language Processing Research Group 142-17.
[0011]
In this Japanese named entity extraction method, a method called a dynamic feature is used in order to use surrounding tags as features for determining a target word. In the method using the dynamic feature, at the time of learning, a tag of the correct answer data on the left side (or right side) of the target word is used as a surrounding tag. For an unknown document, the peripheral tag is not used for the leftmost (or rightmost) word, and for other words, the result determined by the learning device is used as the peripheral tag.
[0012]
Although the method using this dynamic feature is known to be effective, it is still insufficient in accuracy. There are two causes, both of which are related to how to handle surrounding tags.
[0013]
One is that only one tag can be used as a peripheral tag. The other is that, while learning, the correct data is used as a peripheral tag, but at the time of determination, prediction data that is unknown or not is used as a peripheral tag, and thus it is inconsistent. .
[0014]
An object of the present invention is to provide a language processing method, a program, and a device for learning the relationship between a feature of a target word, peripheral words including surrounding words, and correct tags and realizing tagging in an unknown document with high accuracy.
[0015]
[Means for Solving the Problems]
FIG. 1 is a diagram illustrating the principle of the present invention. In the language processing method according to the present invention, the learning processing unit 1 generates a plurality of learning results using a set of pairs of a target word included in a correct answer document data with a tag and features of surrounding words on both sides thereof and a correct answer tag. A learning processing step, and using the plurality of learning results, determine the tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides of the target word. And a determination processing step.
[0016]
This learning processing step
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data converter for converting the correct document data into learning data in a format handled by the learning device,
A first learning step of generating a first learning result 24-1 for determining a prediction tag from a feature using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data by a learner; ,
A first determining step of determining a prediction tag of each target word from characteristics of the target word and peripheral words included in the learning data by using the first learning result 24-1 generated in the first learning step by the determiner; ,
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by the data reconstructing device;
A learning unit that determines a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding word including the prediction tag included in the learning data reconstructed in the data reconstructing step; A second learning step of generating a second learning result 24-2;
Using the second learning result 24-2 generated in the second learning step, the determiner determines the tag of each target word from the characteristics of the target word including the prediction tag and the surrounding words included in the learning data. Two determination steps;
A learning repetition step of generating a plurality of learning results 24-3 to 24-N after the second learning result by repeating the data reconstructing step, the second learning step, and the second determination step a plurality of times;
It is characterized by having.
[0017]
The determination processing step using the learning result generated in such a learning processing step includes:
An input step of inputting unknown document data to be tagged from the outside by the input device;
A data conversion step of converting unknown document data into test data in a format handled by the learning device by a data converter;
A third determining step of determining a prediction tag of each target word from characteristics of the target word and the peripheral words included in the test data by using the first learning result 24-1 by the determiner;
A second data reconstructing step of reconstructing the prediction tag determined in the third determining step into a part of the features of the learning data by the data reconstructing device;
Using the second learning result 24-2, the determiner determines the tag of each target word from the characteristics of the target word including the predicted tag and the surrounding words included in the test data reconstructed in the second data reconstruction step. A fourth determination step of determining
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the fourth determination step a plurality of times while switching to the learning results 24-3 to 24-N after the second learning result;
An output step of inverting the unknown document data into document data with the determined tag by the output device and outputting the document data;
It is characterized by having.
[0018]
According to such a language processing method of the present invention, the peripheral prediction tag is not known at all for the first time, but it becomes a useful feature if the peripheral tag is correctly answered after the second time. For this reason, if the accuracy of the prediction tag is improved due to the correct answer of the peripheral tag, the determination can be performed with the peripheral tag having higher accuracy as a feature, and the tagging accuracy is improved.
[0019]
According to another aspect of the present invention, tagging of an unknown document is simply performed by generating two learning results from a correct document, that is, a learning result that does not use the characteristics of peripheral words, and a learning result that uses the characteristics of peripheral words. .
[0020]
In this case, the learning processing step
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data conversion step of converting the correct answer document data into learning data in a format not including a prediction tag handled by the learning device by a data converter;
A first learning step of generating, by a learning device, a first learning result for determining a prediction tag from a feature using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data;
A data conversion step of converting the correct answer document data into learning data in a format handled by a learning device including a prediction tag by a data converter;
The learning device generates a second learning result for determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a peripheral word including a target word and a correct answer tag included in the learning data as a prediction tag. 2 learning steps,
It is characterized by having.
[0021]
For tagging an unknown document using the first learning result and the second learning result generated in the learning processing step, the determination processing step includes:
An input step of inputting unknown document data to be tagged from the outside by the input device;
A data conversion step of converting unknown document data into test data in a format handled by the learning device by a data converter;
A first determining step of determining a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data by using the first learning result by the determiner;
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by the data reconstructing device;
A second determining step of determining a tag of each target word from characteristics of a target word including a prediction tag included in the test data reconstructed in the data reconstructing step and peripheral words using the second learning result by the determiner; When,
A determination repetition step of determining a tag of unknown document data by repeating the data reconstruction step and the second determination step a plurality of times using the second learning result;
An output step of inverting the unknown document data into document data with the determined tag by the output device and outputting the document data;
It is characterized by having.
[0022]
In this simple language processing method, the surrounding prediction tags are not known at all for the first time, so they are not used, and the learning result that does not use the prediction tags is used for determination. For the second and subsequent times, a learning result is used assuming that the peripheral tags determined last time are correct. As a result, if the accuracy of the prediction tag is improved by the correct answer of the peripheral tag, a determination with a higher accuracy of the peripheral tag can be performed, so that the tagging accuracy is improved.
[0023]
Here, the learning processing step and the determination processing step include a morphological analysis step of analyzing the correct document data or unknown document data received by the input device to determine a morpheme. In this case, the data conversion step converts the correct document data or the unknown document data including the analyzed morpheme into learning data in a format handled by the learning device.
[0024]
Such morphological analysis is suitable for tagging a language in which the processing language is not separated by a space for each word, such as Japanese. Therefore, a language in which words are separated by spaces, such as English, does not require morphological analysis.
[0025]
The learning processing step and the determination processing step include an external tagging step of passing the correct document data or unknown document data received by the input device to an external tagging system and acquiring a tagging result. In this case, in the data conversion step, the data converter converts the correct answer document data or the unknown document data including the tagging result obtained from the tagging system into learning data in a format handled by the learning device.
[0026]
The present invention provides a program for language processing. The program includes, in a computer, a learning processing step of generating a plurality of learning results by using a set of pairs of a feature word and a correct answer tag included in a target word included in the correct answer document data with tags and surrounding words on both sides thereof. A determination processing step of determining the tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides thereof using the learning result of.
[0027]
The present invention provides a language processing apparatus, and includes a learning process for generating a plurality of learning results using a set of pairs of features of a target word and surrounding words included in a correct answer document data with a tag and correct answer tags. And a determination processing unit configured to determine a tag of each target word from characteristics of a target word and peripheral words included in unknown document data to be tagged by using a plurality of learning results by the determination processing device. It is characterized by the following.
[0028]
The details of the language processing program and the language processing device are basically the same as those of the language processing method.
[0029]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 2 is a block diagram showing an embodiment of the language processing device according to the present invention. The language processing device in FIG. 2 has a function as a learning processing unit and a function as a determination processing unit. That is, the function as the learning processing unit generates a plurality of learning results by using a set of pairs of the feature of the target word and the surrounding words and the correct tag included in the correct document data with the tag. On the other hand, the function as the judgment processing unit uses the plurality of learning results obtained by the learning processing unit to determine the tag of each target word from the characteristics of the target word and surrounding words included in the unknown document data to be tagged. judge.
[0030]
In FIG. 2, the functional configuration as a learning processing unit in the language processing apparatus of the present invention for Japanese documents is shown by blocks. The language processing device functioning as the learning processing unit includes an input / output device 10, a morphological analysis device 11, a data converter 12, a learning device 14, a learning control device 16, a judgment device 18, a judgment control device 20, a learning result database 22, It is composed of a data reconstruction device 26.
[0031]
On the other hand, the language processing device functioning as the determination processing unit has a device configuration excluding the learning device 14 and the learning control device 16, as shown in FIG.
[0032]
The language processing apparatus according to the present invention as shown in FIG. 2 is realized by, for example, a hardware environment of a computer shown in FIG. In the computer shown in FIG. 3, a RAM 402, a hard disk controller (software) 404, a floppy disk driver (software) 410, a CD-ROM driver (software) 414, a mouse controller 418, a keyboard controller 422, and a display controller 426 are provided on a bus 401 of a CPU 400. , A communication board 430 is connected.
[0033]
The hard disk controller 404 connects the hard disk drive 406 and loads an application program for executing the language processing of the present invention. When the computer is started, the hard disk controller 404 calls a necessary program from the hard disk drive 406, develops the program on the RAM 402, and Execute.
[0034]
A floppy disk drive (hardware) 412 is connected to the floppy disk driver 410, and can perform reading and writing on the floppy disk (R). A CD drive (hardware) 416 is connected to the CD-ROM driver 414, and can read data and programs stored in the CD.
[0035]
The mouse controller 418 transmits an input operation of the mouse 420 to the CPU 400. The keyboard controller 422 transmits an input operation of the keyboard 424 to the CPU 400. The display controller 426 performs display on the display unit 428. The communication board 430 uses a communication line 432 including wireless communication to communicate with another computer or server via a network such as the Internet.
[0036]
Here, the principle of the learning processing unit in the language processing apparatus of the present invention having the configuration shown in FIG. 2 will be described. Consider a method of automatically extracting proper nouns in a Japanese document as an example. As an example of the target Japanese document data, a document "Mr. XX, chairman of the Japan Broadcasting Corporation at the ordinary parliament held the day before ..." is taken as an example.
[0037]
When this document data is decomposed into words by the morphological analysis device 11 in FIG. 2, the document data becomes, for example, document data 28 in FIG. The document data 28 includes, as proper nouns, “Normal Diet” as the proper name 28-1, “Japan Broadcasting Corporation” as the organization name 28-2, and “XX” as the personal name 28-3. I have.
[0038]
Extracting proper nouns in the text of the document data 28 in FIG. 4 can be defined as a tagging problem as in FIG. FIG. 5 shows the correct answer document data 30 in which the correct answer tag 34 is added to the word 32. In the present invention, in order to automatically add a tag to a word of unknown document data, as shown in FIG. A learning processing unit is performed by using this.
[0039]
The learning process of the present invention generates a learning result (learning rule) from the correct answer document data 30 in which the feature of the word 32 and the correct answer tag 34 are associated. The generated learning result can predict the tag from the characteristics of the word.
[0040]
For example, when looking at the word "broadcast", the correct answer document data 30 uses itself and the words before and after the word and the number of characters as features, and designates "I-ORG" as the correct answer tag. In the learning process, a learning result that can determine a tag from a feature is generated by using a set of pairs of a feature and a correct answer tag of the target word and surrounding words in the correct answer document data 30.
[0041]
Here, if the tag of the peripheral word with respect to the target word can be used as one of the features, it is expected that it is effective for predicting the tag of the target word as the correct answer data. This is due to the fact that tags representing proper nouns are likely to appear continuously. For example, taking the unknown document data 36 of FIG. 6 as an example, when the correct tag 42 is known for the word 38, the prediction can be effectively performed by using the characteristics of the surrounding tags for the target tag as the prediction tag 40.
[0042]
For example, if the tags of “Japan”, “broadcast”, and “association” in the word 38 are all “I-ORG”, but it is known that the tag of “association” is “I-ORG”, This is an important clue that the tag of “Japan” or “broadcast” located before that is “I-ORG”.
[0043]
However, the processing goal of the language processing apparatus of the present invention is to predict an unknown tag. Therefore, the correct tag 42 is unknown in the unknown document data 36 of FIG. I do not understand. Therefore, in the language processing method of the present invention, a tag assigned by a learning process based on the correct answer document data is used as a peripheral tag.
[0044]
7 and 8 are explanatory diagrams showing the procedure of the learning process executed by the configuration of the learning processing unit of the language processing apparatus of the present invention in FIG. The learning process in FIGS. 7 and 8 will be described below with reference to the device configuration in FIG.
[0045]
In FIG. 7, first, the correct answer document data 30 is input by the function of the input / output device 10 as an input unit. The correct answer document data 30 has, for example, the contents shown in FIG. Next, data conversion 44 is performed by the data converter 12. In this data conversion, the correct answer document data 30 is converted into learning data 46 in a format that can be handled by the learning device 14.
[0046]
The learning data 46 includes words 48, features 50, prediction tags 52, and correct answer tags 54. In this embodiment, the feature 50 includes four items: “notation”, “word length”, “character type”, and “part of speech”. The prediction tag 52 is "?" In the initial state.
[0047]
Next, the learning unit 14 controlled by the learning control device 16 uses the set of pairs of the feature and the correct answer tag of the target word and the surrounding words included in the learning data 46 to determine a prediction tag from the feature ( (First learning result: also referred to as classification rule) 24-1.
[0048]
For example, if the target word is “broadcast”, the learning process 60 includes, for the surrounding words “Japan” and “association” before and after the target word, the respective features 50 and the tag 58 of the target word “broadcast”, “I- A pair of “ORG” is set in the learning device 14, and a rule for obtaining the correct answer tag “I-ORG” based on the input of the feature data 56 is generated as a learning result.
[0049]
Such learning processing is performed on each of the second to fifth broadcasts, associations, chairpersons, and nos except for the first and last words "Japan" and "XX" in the learning data 46. The learning process is repeatedly performed using all pairs of the features of the surrounding words and the correct tags 54 as words, and a learning result 24-1 is generated. The generated learning result 24-1 is stored in the learning result database 22.
[0050]
Next, the determination unit 18 controlled by the determination control device 20 uses the learning result 24-1 generated in the learning process 60 to determine the characteristics of the target word and the surrounding words included in the learning data 46 and to determine the characteristics of each target word. The prediction tag 52 is determined.
[0051]
In the determination process 62 of FIG. 7, the second “broadcast” of the learning data 46 is set as a target word, and the characteristic data 56 including surrounding words “Japan” and “association” before and after the target word are input to the determiner 18. The determination data 64 is obtained by making a determination using 24-1. In this case, the prediction tag of the target word “broadcast” is “O”. The determination process 62 for obtaining the prediction tag of each target word in the determination data 64 is performed for all target words in the learning data 46 except for the first and last words. can get.
[0052]
Then, the data reconstruction 66 of FIG. This data reconstruction 66 generates the reconstruction learning data 68 by storing the prediction tag determined in the determination processing 62 of FIG. 7 as update information in the prediction tag 52 of the learning data.
[0053]
Next, a learning result (second learning) for determining a prediction tag from a feature using the set of pairs of the feature and the correct answer tag 54 of the target word and the surrounding words including the prediction tag 52 in the reconstructed learning data 68 by the learning device 14. Result) 24-2 is generated.
[0054]
For example, if the target word “broadcast” is taken as an example, the feature data 70 in the surrounding words “Japan” and “association” before and after the target word “broadcast” include the feature including the updated prediction tag obtained in the determination process 62 in FIG. The learning process 72 is performed using “I-ORG”, which is the correct answer tag 58 of the target word “broadcast”. This learning process 72 generates a learning result 24-2 as a second learning result by repeating similarly for each of the second to fifth target words except the first and last words in the reconstructed learning data 68. .
[0055]
Next, using the learning result 24-2 as the second learning result generated in the learning process 72 by the determiner 18, each of the features of the target word including the prediction tag and the surrounding words included in the reconstructed learning data 68 is used. The tag of the target word is determined. In this example, a pair of the feature data 70 of the target word “broadcast” in the reconstructed learning data 68 and its correct tag 58 “I-ORG” is input to the determiner 18 as the determination data 76, and the learning result 24 is obtained. 3 shows judgment data 76 of the prediction tag according to −2. This determination processing 74 is performed in the same manner for all of the second to fifth words except the first and last words in the reconfiguration learning data 68.
[0056]
When the determination process 74 is completed, it is determined in a determination step 78 whether or not the value of the number of learnings i is smaller than a predetermined number of learnings. If the value is smaller, the process returns to the data reconstruction 66 again, and the determination data 76 generated in the determination process 74 is determined. Of the reconstructed learning data 68 is included in the predicted tag 52, and a third learning process 72 is performed on the updated reconstructed learning data 68 to generate a learning result 24-3. I do. Then, in the determination process 74, the prediction tag of the determination data 76 is determined from the reconstructed learning data 68 using the learning result 24-3.
[0057]
The data reconstruction using the prediction tags, the learning process based on the reconstructed learning data, and the determination process based on the learning result obtained in the learning process are repeated until the number i of learning reaches N times. , The learning results 24-1 to 24-N are stored, and it is determined in the determination step 78 that the number of times of learning reaches N times, and the learning process ends.
[0058]
FIG. 9 is a flowchart of the learning processing by the language processing apparatus of FIG. 2 shown in FIGS. 7 and 8, and this flowchart shows the processing contents of the learning processing program according to the present invention.
[0059]
In FIG. 9, in the learning process, the correct document data is input by the input / output device 10 in step S1, and the data is converted into a set of features and tags by a data converter in step S2. Generate data.
[0060]
Next, in step S3, the learning device 14 generates a first learning result by using a set of pairs of the feature of the target word and the peripheral word and the correct answer tag in the learning data. Subsequently, in step S4, the tag of each word is predicted from the characteristics of the target word and the peripheral words in the correct answer data using the first learning result by the determiner 14. The processing of steps S1 to S4 corresponds to FIG.
[0061]
Subsequently, in step S5, the data reconstruction device 26 replaces the prediction tag acquired in step S4 with the prediction tag of the learning data to generate new learning data, that is, reconstructed learning data. Next, in step S6, the learning unit 14 generates an i-th learning result using a set of pairs of the feature of the target word including the prediction tag and the surrounding words and the correct answer tag in the i-th learning data. In this case, i = 2.
[0062]
Subsequently, in step S7, it is checked whether or not i is smaller than a predetermined number N. If it is smaller, i is increased by one in step S8, and the processing from step S5 is repeated until i becomes N in step S7. . The processing of steps S5 to S8 corresponds to FIG.
[0063]
FIG. 10 shows how the language processing apparatus of the present invention automatically tags unknown document data using the learning results 24-1 to 24-N of the learning result database 22 generated by the learning processing unit of FIG. FIG. 3 is a block diagram illustrating a configuration of a determination processing unit. The configuration as the determination processing unit is realized by a configuration excluding the functions of the learning control unit 16 and the learning device 14 in FIG.
[0064]
FIG. 11 and FIG. 12 are explanatory diagrams of the processing procedure by the determination processing unit of FIG. 10, and the following description will be made according to the configuration of FIG.
[0065]
In FIG. 11, first, unknown document data 80 is input using the function of the input unit of the input / output device 10. The unknown document data 80 is, for example, the document data 28 as shown in FIG. Next, data conversion by the data converter 12 is performed.
[0066]
The data conversion 82 converts the unknown document data 80 into test data 84 in a format handled by the learning device 14. The test data 84 includes a word 86, a feature 88, and a prediction tag 90, and the prediction tag 90 is "?" In an initial state.
[0067]
Next, using the first learning result 24-1 of the learning result database 22 by the determiner 18, the prediction tag of each target word is determined from the characteristics of the target word and the peripheral words included in the test data 84. Specifically, for the second to fifth target words excluding the first and last words in the test data 84, for example, if the second target word “broadcast” is taken as an example, the surrounding words “Japan”, “ The characteristic data 92 including "association" is input to the determiner 18, and the learning result 24-1 is applied to the characteristic data 92 to determine the prediction tag shown in the determination data 96. In this case, the prediction tag of the target word "broadcast" is It is "O".
[0068]
Next, proceeding to FIG. 12, the prediction tag 90 is updated by the data reconstructing device 26 so that the prediction tag determined in the determination process 94 of FIG. Generate.
[0069]
Subsequently, the second learning result 24-2 is extracted from the learning result database 22 by the determiner 18, and the tag of each target word is determined from the characteristics of the target word including the prediction tag included in the reconstructed test data 100 and the peripheral words. I do. In the determination processing 104 in this case, the target word “broadcast” in the reconstructed test data 100 is taken as an example, and the feature data 102 including surrounding words before and after the target word is input to the determiner 18 to determine the prediction tag. This shows a state in which the determination data 106 has been obtained.
[0070]
When the determination of the prediction tag based on the reconfiguration test data 100 is completed in the determination process 104, it is checked in a determination step 108 whether or not the number of determinations i is less than a predetermined number N. After performing the update including the prediction tag determined in the process 104 in the prediction tag 90 of the reconstructed test data 100, the determination process 104 is performed on the reconstructed test data 100 using the third learning result 24-3. The prediction tag is determined as the determination data 106.
[0071]
The determination process based on the learning result corresponding to the data reconstruction and the number of determinations is repeated N times, and a series of the determination processes ends. Since the prediction tag 90 in the reconstructed test data 100 is finally determined as a correct answer tag by this determination processing, the data converter 12 performs an inverse conversion of attaching the determined determination tag to each word of the input unknown document data. Thus, the tagged document data is output from the input / output device 10.
[0072]
FIG. 13 shows unknown document data 80 based on the determination data obtained for the target word “broadcast” in the determination process 94 of FIG. 11. In the first determination stage, the prediction tag 88 differs from the correct tag 112. However, the prediction tag 88 of the target word “broadcast” is updated to “I-ORG” of the correct answer tag 112 by repeating the determination process using the second and subsequent learning results 24-2 to 24-N shown in FIG. The tag that matches the correct answer tag can be determined.
[0073]
FIG. 14 is a flowchart of the determination process in FIGS. 11 and 12. This flowchart is a flowchart of the tag determination using the learning results 24-1 to 24-N obtained by the learning processing program according to the learning process flowchart of FIG. 9 shows a judgment processing program.
[0074]
In FIG. 14, first, unknown document data is input by the input / output device 10 in step S1, then converted into a set of features and tags by a data converter in step S2, the prediction tag is set to "?" Generate as test data. Next, in step S3, the tag of each word is predicted from the characteristics of the target word and the surrounding words in the test data by the determiner 18 using the first learning result. The processing of steps S1 to S3 corresponds to FIG.
[0075]
Next, in step S4, the data reconstruction device 26 replaces the prediction tag acquired by the determiner 18 with the prediction tag of the test data, and generates new test data, that is, reconstructed test data. Subsequently, in step S5, using the i-th learning result, in this case, the second learning result, by the determiner 18, the tag of each word is extracted from the features of the target word and the surrounding words including the prediction tag in the reconstructed test data. Predict.
[0076]
If the number of processes i is less than the predetermined value N in step S6, i is incremented by one in step S7, and then the processes from step S4 are repeated. When the determination of the tag based on the N-th learning result is completed, since i is N in step S6, a series of determination processing ends. Then, the data converter 12 performs an inverse conversion with the determined tag attached to the unknown document, and outputs the processed tagged document by the function of the input / output device 10 as an output unit. The processing of steps S4 to S7 corresponds to FIG.
[0077]
FIG. 15 shows another embodiment of the learning processing unit in the language processing apparatus according to the present invention. In this embodiment, English document data is targeted as the correct answer document data. In the case of an English document, since it is not necessary to perform morphological analysis to separate words as in Japanese, the morphological analyzer 11 provided in the embodiment of FIG. 2 is omitted, and the other configuration is the same. .
[0078]
FIG. 16 and FIG. 17 are explanatory diagrams of the learning processing performed by the learning processing unit of FIG. 15 on the correct English document data.
[0079]
In FIG. 16, the correct answer document data 114 in English is input by a function as an input unit of the input / output device 10, and converted into learning data 118 that can be handled by the learning device 14 by the data converter 12. The learning data 118 is composed of words 120, features 122, prediction tags 124, and correct answer tags 125. The features 122 are "notation", "word length", and "capital letter", and "part of speech" as in the Japanese correct answer document data in FIG. 4 is excluded.
[0080]
A learning process 130 is performed on the learning data 118 by the learning device 14 to generate a learning result 24-1. Subsequently, the judgment unit 18 uses the learning result 24-1 by the judgment processing 132 to input, for each target word of the learning data 118, feature data 128 including surrounding words such as the target word "Reserve" for each target word in the learning data 118, for example. Then, the prediction tag of the target word is determined based on the determination data 134.
[0081]
Next, data reconstruction 136 in FIG. That is, the prediction tag determined in the determination process 132 in FIG. 16 is taken into the learning data, and the reconstructed learning data 138 updated is generated.
[0082]
Next, a learning process 152 is performed by the learning device 14. The learning process 152 includes the predicted tag 144 determined in the feature data 148 of the target word in the reconstructed learning data 138, and performs learning using a set of pairs of the predicted tag 144 and the correct answer tag 146. Generate 2.
[0083]
Using the second learning result 24-2, the determination unit 18 inputs the feature 142 of the reconstructed learning data 138 and performs a determination process 154, and determines a prediction tag as the determination data 156. This process is repeated until the number of processes i reaches the predetermined number N in the determination step 158, whereby the learning results 24-1 to 24-N are stored in the learning result database 22.
[0084]
FIG. 18 is a block diagram of a tagging determination process for each word of unknown document data of English text using the learning results 24-1 to 24-N of the learning result database 22 generated by the learning processing unit of FIG. , The functions of the learning device 14 and the learning controller 16 are omitted, and the other configurations are the same.
[0085]
FIG. 19 and FIG. 20 are explanatory diagrams of the processing procedure of the determination processing for tagging the unknown English document data of FIG.
[0086]
In FIG. 19, the unknown document data 160 in English is input by the function of the input unit of the input / output device 10, and the data converter 12 performs data conversion 162 for converting the data into test data 164 that can be processed by the learning device 14. . Next, a predictor tag is determined from the test data 164 by using the first learning result 24-1 by the determiner 18, and the determination data 176 is generated.
[0087]
Next, proceeding to FIG. 20, data reconstruction 178 is performed by the data reconstruction device 26, and reconfiguration test data 180 in which the prediction tag 186 updated by incorporating the prediction tag determined in the determination process 174 of FIG. 19 is generated. .
[0088]
Next, the features of the target word and the surrounding words including the prediction tag 186 of the reconstructed test data 180 are input by the determiner 18, and the determination is performed using the second learning result 24-2. Thus, the prediction tag of each target word is determined.
[0089]
The data reconstruction 178 and the determination processing 190 are repeated for the remaining learning results 24-3 to 24-N until i becomes N in the determination step 194, and the prediction tag of each word is determined. Then, the unknown English document data is attached to the determined tag by the reverse conversion by the data converter 12 and output to the outside.
[0090]
FIG. 20 shows another embodiment of the learning processing unit in the language processing apparatus according to the present invention. In this embodiment, a learning result based on the correct answer document data is simply generated. I do.
[0091]
The learning processing unit of the embodiment in FIG. 20 has a configuration in which the determination unit 18, the determination control device 20, and the data reconfiguration device 26 in the configuration of the learning processing unit in FIG. The learning processing unit of this embodiment first generates a learning result 240-1 from the correct answer document data from the target word and the features of the peripheral word and the correct answer tag of the target word without using the peripheral word. Next, a learning result 240-2 is generated from a set of the features of the surrounding words using the target word and the correct word as predicted words and the correct tag of the target word.
[0092]
FIG. 22 and FIG. 23 are explanatory diagrams of the processing procedure in the simplified-type learning processing unit in FIG. 21, in which a Japanese document is to be processed.
[0093]
In FIG. 22, first, the correct answer document data 196 is input by the function of the input unit of the input / output device 10. The correct answer document data 196 is the same as the correct answer document data 30 in FIG. 5, for example. Next, the data converter 12 performs data conversion 198 to convert the data into learning data 200 that can be handled by the learning device 14. The learning data 200 does not include the item of the prediction tag.
[0094]
Next, when the learning device 14 takes the feature of the target word and the peripheral words in the learning data 200, for example, the target word “broadcast” as an example, the feature data 202 including the surrounding peripheral words “Japan” and “association” and the target word The correct answer tag 204 “I-ORG” is input to the learning device 14, and the learning process 206 generates a learning result 240-1 as a first learning result.
[0095]
Of course, the learning process 206 is performed repeatedly on the next set of features and correct tags including the surrounding words before and after, with the second to fifth words as target words, excluding the first and last words in the learning data 200. Generates the learning result 240-1.
[0096]
Next, the processing proceeds to FIG. 23, in which data reconstruction 208 is performed on the correct answer document data 196 to generate learning data 210. A new prediction tag item is added to the learning data 210.
[0097]
Next, a learning process 218 is performed by the learning device 218 using the next set of features of the surrounding words and the target word in the learning data 210 and the correct tag as the prediction tag, and a learning result (second learning result) 240 -2 is generated.
[0098]
For example, as for the second target word “broadcast”, in the characteristic data 212 in the surrounding words “Japan” and “association”, the correct answer tags are “I-ORG” as the prediction tag 214 and “I-ORG” as the prediction tag 216. The feature data 212 and the correct tag 204 of the target word “I-ORG” are set in the learning device 14, and the learning result 240 as a rule for obtaining a prediction tag from the feature data 212 is obtained. -2 is generated. Of course, in the learning process 218, the learning process is repeatedly performed using the second to fifth words in the learning data 210 excluding the first and last words as target words.
[0099]
FIG. 24 is a flowchart of the simplified learning process in FIG. 21, which shows the contents of a simple learning processing program. First, in step S1, a Japanese correct answer document is input from the input / output device 10, and in step S2, the data is converted into a set of features and tags by the data converter 12, thereby generating first learning data. No prediction tag is provided in the first learning data.
[0100]
Next, in step S3, a first learning result is generated by the learning device 14 using the set of pairs of the feature and the correct answer tag of the target word and surrounding words in the learning data. The processing of steps S1 to S3 corresponds to FIG.
[0101]
Next, in Step S4, the correct answer document data is converted into features and tags by the data converter 12, a prediction tag is newly provided, and this is set to "?" To generate second learning data. Finally, in step S5, the learning device 14 generates a second learning result by using a pair of the feature of the peripheral word including the target word and the correct tag in the prediction tag in the second learning data and the correct tag.
[0102]
FIG. 25 is a block diagram of a simple determination processing unit for tagging Japanese unknown document data using the learning result generated by the simple learning processing unit of FIG. This simple determination processing unit is basically the same as the embodiment of FIG. 10, except that the learning result database 22 has only two learning results 240-1 and 240-2.
[0103]
26 and 27 are explanatory diagrams of the processing procedure of the simple determination processing unit in FIG. In FIG. 26, first, unknown document data 220 in Japanese is input through the input / output unit of the input / output device 10, and is converted by the data converter 12 into test data 224 in a format that can be handled by the determiner.
[0104]
Next, the judgment unit 18 uses the learning result 240-1 obtained first to obtain a target word from the target data included in the test data 224, for example, “broadcast” and the characteristic data 226 of the peripheral words “Japan” and “association”. The prediction tag of the word is determined by the determination process 228, and the determination data 230 is generated.
[0105]
Next, proceeding to FIG. 27, the data reconstructing device 26 generates reconstructed test data 234 in which the prediction tag determined in the determination process 228 of FIG.
[0106]
Next, using the second learning result 240-2, the determiner 18 uses the features of the target word and the peripheral words including the prediction tag included in the reconstructed test data 234, for example, in the case of the target word “broadcast”, The determination processing 238 for determining the tag of the target word from the characteristic data 236 is performed, and the determination data 242 is generated.
[0107]
Of course, the determination step 244 determines a prediction tag using the learning result 240-2 for each of the second to fifth target words excluding the first and last words in the reconstructed test data 234.
[0108]
In the next determination setting 224, if the number of processes i is less than the predetermined value N, the process returns to the data reconstruction 232, updates the prediction tag obtained in the determination process 238, creates the reconstruction test data 234 again, and performs the same learning. The determination process using the result 240-2 is repeated, and when the number of processes reaches N, the prediction tag of the reconstructed test data 234 obtained at that time is used as the tag determination result, and the data converter 12 After performing the inverse conversion of attaching the determined tag to each word of the input unknown document data, the function of the input / output device 10 as an output unit outputs the tagged document data to the outside.
[0109]
FIG. 28 is a flowchart of the simple determination processing unit of FIG. 25, and this flowchart corresponds to the content of the simple determination processing program. In FIG. 28, in step S1, Japanese unknown document data is input by the input / output device 10, and in step S2, the data is converted into a set of features and tags by the data converter 12, and the prediction tag is set to "?". Generate as the second test data.
[0110]
Next, in step S3, the tag of each word is predicted from the characteristics of the target word and the peripheral words in the test data by the determiner 14 using the first learning result. The processing of steps S1 to S3 corresponds to FIG. Next, in step S4, the prediction tag acquired in step S3 by the data reconstructing device 26 is replaced with the prediction tag of the test data, and new test data is generated as reconstructed test data. (2) The tag of each word is predicted from the characteristics of the target word including the prediction tag and the surrounding words in the test data using the learning result.
[0111]
In step S4, S5 or S6, the process is repeated while increasing i one by one in step S7 until the number of processes reaches the predetermined number N. Then, when the determination processing for N times is completed, inverse conversion is performed to convert the determination tag of each word determined to each word of the unknown document data, and output to the outside.
[0112]
FIGS. 29 and 30 show the same simple learning processing procedure as that of the embodiment of FIG. 21 for English correct answer document data. Since the simple learning process for the correct answer document data of the English sentence can be realized by a configuration as a learning processing unit excluding the morphological analysis device 11 in FIG. 21, the processing procedures in FIG. 29 and FIG. Description will be made with reference to FIG. 25 as in the case of Japanese.
[0113]
In FIG. 29, the correct answer document data 246 in English is input by the function of the input unit of the input / output device 10, and is converted by the data converter 12 into learning data 250 that can be handled by the learning device 14.
[0114]
Subsequently, the learning device 14 outputs another target word including the pair of the feature of the target word and the peripheral words in the learning data 250, for example, the feature data 252 of the target word “Reserve” and its correct tag 254, “I-ORG”. The learning process is performed by the learning device 14 including the feature in the above, and a learning result 240-1 for determining a prediction tag from the feature data 252 is generated.
[0115]
Subsequently, the process proceeds to FIG. 30, where the data converter 12 generates second learning data 260 including a prediction tag from the correct answer document data 246 in English. A learning process 268 is performed on the learning data 260 by the learning device 14 to generate a learning result 240-2.
[0116]
The learning process 268 uses the pair of the feature 262 of the peripheral word including the target word and the correct tag as the prediction tags 264 and 265 and the correct tag 266 of the target word, for example, with the target word “Reserve” as an example. , A learning result 240-2 for determining the prediction tag is generated. Such learning is repeatedly performed for each of the second to fifth words excluding the first and last words of the learning data 260 to obtain a learning result 240-2.
[0117]
FIGS. 31 and 32 show a case where tags are determined from unknown English document data using two learning results 240-1 and 240-2 for English sentences obtained by the simple learning process of FIGS. 29 and 30. It is a processing procedure of a simple determination process.
[0118]
Since this simple judgment processing procedure can be realized with the same configuration as the simple judgment processing unit for Japanese documents shown in FIG. 25, a simple judgment processing for English unknown document data will be described with reference to FIG. The following describes the determination processing.
[0119]
In FIG. 31, unknown document data 270 in English is input by the function of the input unit of the input / output device 10, and is converted by the data converter 12 into test data 274 in a format that can be handled by the determiner 18.
[0120]
Next, the determination unit 18 performs a determination process 278 of determining a prediction tag of each target word from the characteristics of the target word and the peripheral words of the test data 274 using the first learning result 240-1, and generates determination data 280. .
[0121]
Next, the processing proceeds to FIG. 32, in which the data reconstruction device 26 inserts the prediction tag obtained in the determination processing 278 in FIG. 31 into a part of the prediction tags of various data to generate reconstructed learning data 284.
[0122]
Subsequently, the determination processing 288 is performed by the determiner 18 on the feature data 286 of the reconstructed learning data 284 using the second learning result 240-2, and the determination data 290 for determining the prediction tag of each word is generated. .
[0123]
The data reconstruction 282 and the determination process 288 are repeatedly performed in the determination step 292 using the same learning result 240-2 until the number of processes i reaches the predetermined number N, and the prediction tag of the reconstruction test data 284 is used as the correct answer tag. Determine.
[0124]
Then, after performing the inverse conversion of attaching the determined tag to the unknown English document data input by the data converter 12, the tagged English document data processed by the function of the output unit of the input / output device 10 is output to the outside. Output.
[0125]
FIG. 33 shows another embodiment of the document processing apparatus of the present invention. In this embodiment, an external tagging system interface 296 is further provided in the embodiment of FIG.
[0126]
The external tagging system interface 296 uses the function of the input unit of the input / output device 10 to pass the received Japanese correct document data or unknown document data to the external tagging system 298 via a network such as the Internet, and to output the tagging result. The learning process and the judgment process are performed by using the acquired external tag obtained from the external tagging system 298 as a prediction tag.
[0127]
FIG. 33 shows an example of application of the document processing apparatus for Japanese document data shown in FIG. 2, but English document data without the morphological analyzer 11 can be applied to the language processing apparatus as it is. In the embodiment of FIG. 33, an example of generating and using the learning results 24-1 to 24-N is taken as an example, but two learning data 240-1 and 240-2 are generated and used. The same can be applied to an embodiment of a simple language processing device.
[0128]
Next, specific processing contents in the learning device and the determination device will be described by taking the learning process in the embodiment in FIG. 2 and the determination process in the embodiment in FIG. 10 using the learning result in FIG. 2 as an example.
[0129]
FIG. 34 is an explanatory diagram of performing the first learning from the correct answer data in the learning processing of the present invention and generating a first classification rule as a first learning result. In the following specific example of the learning process, a case where a decision tree generation algorithm, which will be described later, is used as a learning device is taken as an example.
[0130]
In FIG. 34, the correct answer tag 304 is added to the correct answer document data 300 for the word 302. The correct answer document data 300 exemplifies a document that states, "Wimbledon held in England in 2002 marked the beginning of a turbulence. Winner Selena Williams lost to Maria Goose of Australia in the first round." The document of the correct answer document data 300 is divided into parts of speech by morphological analysis, arranged as shown in the figure, and a correct answer tag 304 is attached to each.
[0131]
The decision tree generator 306, which functions as a learning unit, learns the correct answer document data 300 in accordance with the decision tree algorithm, thereby generating a first classification rule 308 serving as a first decision tree. The first classification rule 308 generated in this manner is stored in the learning result database 22-2 by storing the first classification rule 310 in the empty learning result database 22-1 in the initial state as shown in FIG. You.
[0132]
Next, as shown in FIG. 36, the prediction tag 312-1 of the correct document data 300 shown on the right side is determined by a determination process in which the generated first classification rule 308 is applied to the word 302 of the correct document data 300.
[0133]
Next, as in the case of the correct answer document data 300 in FIG. 37, data reconstruction is performed by capturing the prediction tag 312-1 determined in FIG. 36, and the reconstructed correct answer document data 300 is sent to the decision tree generator 314. A second learning is performed by inputting, and a second classification rule 316 as a second decision tree is generated.
[0134]
Next, as shown in FIG. 38, in addition to the first classification rule 308 already stored in the learning result database 22-2, the second classification rule 316 generated in FIG.
[0135]
Next, as shown in FIG. 39, the prediction tag of the correct document data 300 shown on the right side is determined by applying the second classification rule 316 to the correct document data 300 having the previous prediction tag 312-1 with respect to the word 302. 312-2 is determined and assigned.
[0136]
The above is the basic processing of the learning processing in the present invention shown in the embodiment of FIG. 2, but in practice, the third learning, the fourth learning,... Further, the third to Nth classification rules are generated. Needless to say, in the present invention, up to the second classification rule may be created.
[0137]
Here, an algorithm for creating a classification rule, that is, a decision tree in the learning process of the present invention will be described.
[0138]
FIG. 40 shows a function MakeTree 320 that generates a decision tree in the learning processing of the present invention. There are a plurality of algorithms for the function MakeTree 320 for generating this decision tree. For example, the following decision tree generation method is used.
[0139]
The decision tree generation method used in the present invention is proposed in Quinlan “C4.5: Pro-grams for Machine Learning” (1993) (Japanese translation “Data analysis by AI”) and the like. The outline of the algorithm is shown below. Here, the correlation between the category and the feature is based on information amount gain, mutual information amount, ² There are various methods, such as a test amount.
[0140]
[Decision Tree Generation Algorithm Generatetree (Learning Set, Tree)]
Step S1 :. If the training set is empty, assign the default category and exit the function.
Step S2: If the categories of the learning set are all the same, add a node to be judged to be the category to the tree and exit the function.
Step S3: From the features, the feature having the strongest correlation with the category is selected.
Step S4: Add a division with the selected feature to the tree.
Step S5: Divide the learning set by the selected features.
Step S6: Call Generatetree () with each of the divided learning sets as an argument.
[0141]
Regarding this decision tree generation algorithm, the main contents of the decision tree generation processing when the correct answer document data 300 is input are as follows. First, the character type of the correct answer document data 300 is selected through the processing of steps S1 and S2. In this case, four types of characters are included: kanji, hiragana, katakana, and alphanumeric characters.
[0142]
Next, in step S4, the learning set that is the correct document data 300 is decomposed according to the character type. That is, as shown in FIG. 41, the kanji set 300-1, the hiragana set 300-2, the katakana set 300-3, and the alphanumeric set 300-4 are decomposed.
[0143]
Subsequently, each set decomposed by the process of step S6 is input, and a function GenerateTree is called. With this function, for example, in the case of the kanji set 300-1 as an example, the category of “O” is attached as the kanji correct tag in the correct document data 300 of all nodes, so the “O” node is added as the correct tag, By repeating this for another set, for example, the first classification rule 308 shown in FIG. 34 is generated.
[0144]
Here, in the first classification rule 308 in FIG. 34, the features of the peripheral words are not incorporated, but in the second classification rule 316 after the addition of the prediction tags in FIG. Has been incorporated into.
[0145]
FIG. 42 is a specific example of a determination process for applying a prediction tag by applying the first classification rule 308 and the second classification rule 316 generated by the learning processes of FIGS. 34 to 39 to unknown document data.
[0146]
In FIG. 42, as the unknown document data 322, "Mika Hakkinen unfortunately missed the victory in the match at San Marino in 1998" is input, and the data is divided into words 324 in parts of speech by morphological analysis.
[0147]
A prediction tag 326-1 is assigned to each of the words 324 by a determination process that applies the first classification rule 308 stored in the learning result database 22 to the unknown document data 322.
[0148]
Next, the second classification rule 316 of the learning result database 22 is applied to the unknown document data 322 to which the prediction tag 326-1 is added, and the prediction tag 326-2 is added to each of the words 324.
[0149]
FIG. 43 shows the addition of a prediction tag by applying the first classification rule 308 in the first determination process of FIG. 42, for a part of “Mika Hakkinen is a match” which is a part of unknown document data.
[0150]
Here, the first classification rule 308 has a tree structure. For example, as shown by an arrow 308-1, the target word "game" includes nodes of "character type = kanji", "word length <= 3", and "part of speech = others". Thus, the prediction tag “O” is determined, and in this case, the answer is correct.
[0151]
On the other hand, “Mika” determines the prediction tag “I-LOC” through the nodes of “character type = Katakana” and “word length <4” as indicated by an arrow 308-2. In this case, the determined prediction tag is It is incorrect.
[0152]
FIG. 44 is a specific example of the determination processing according to the second classification rule 316 in which the prediction tag determined by the first classification rule 308 in FIG. 43 is added to the feature. First, although the target word “match” is correct in the first classification rule 308, even in the second classification rule 316, the prediction tag is “O” as indicated by an arrow 316-2, which is also correct.
[0153]
On the other hand, as for the target word “Mika”, as shown by an arrow 316-3, a “back tag = I-PER” is determined following a node of “character type = Katakana” and “word length <4”. It is determined to be "I-PER", and the answer is correct according to the second classification rule 316 in which the prediction tag of the peripheral word is incorporated in the feature.
[0154]
As described above, according to the present invention, the tag of the peripheral word is predicted for the target word of the unknown document by the first classification rule (first learning result) 308, and the second classification rule (second learning result) 316 By using a classification rule characterized by a prediction tag of a peripheral word, a tag of a target word can be determined with high accuracy from a correlation with the prediction tag of a peripheral word.
[0155]
FIG. 45 shows a comparison between the conventional method and the method of the present invention on the characteristic information when tagging the target word “Hakkinen” for “Mika Hakkinen is in the game” in the same unknown document data as in FIGS. 43 and 44. ing.
[0156]
FIG. 45A shows a feature of the conventional method 328 in which the tag of the peripheral word is not included in the feature, and no consideration is given to what kind of tag the preceding and following words are determined.
[0157]
FIG. 45 (B) shows the method 330 of Yamada shown in the prior art (A), in which a prediction tag of one side, for example, a preceding peripheral word is used for a target word. However, in this case, it works effectively only when there is a clue that the tag of the target word “Hakkinen” becomes “I-PER” in the word before the target word. For example, if the target word is “Mika”, “I-PER” of the surrounding word “Hakkinen” as a clue cannot be used effectively because it becomes a prediction tag of the subsequent word.
[0158]
FIG. 45 (C) is a feature of the technique 332 of the present invention, in which the target word “Hakkinen” uses, as its feature, prediction tags of a plurality of peripheral portions that match before and after. Therefore, if there is a clue somewhere around the target word, it can be used to accurately predict the tag.
[0159]
It should be noted that the present invention is not limited to the above embodiment, but includes appropriate modifications without impairing the objects and advantages thereof. Further, the present invention is not limited by the numerical values shown in the above embodiments.
[0160]
Here, the features of the present invention are listed as follows.
(Note)
(Appendix 1)
A learning processing step of generating a plurality of learning results by using a set of pairs of a target word included in a correct answer document data with a tag and features of surrounding words on both sides thereof and a correct answer tag,
A determination processing device, using the plurality of learning results, a determination processing step of determining a tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides thereof;
A language processing method comprising: (1)
[0161]
(Appendix 2)
In the language processing method according to Supplementary Note 1, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data converter for converting the correct document data into learning data in a format handled by the learning device;
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A first determining step of determining, by using a first learning result generated in the first learning step, a prediction tag of each target word from characteristics of the target word and peripheral words included in the learning data,
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by a data reconstructing device;
The learning device determines a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding word including the prediction tag included in the learning data reconstructed in the data reconstruction step. A second learning step of generating a second learning result;
Using the second learning result generated in the second learning step, the determiner determines a tag of each target word from characteristics of the target word including the prediction tag and surrounding words included in the learning data. Two determination steps;
A learning repetition step of generating the plurality of learning results after the second learning result by repeating the data reconstructing step, the second learning step, and the second determination step a plurality of times;
A language processing method comprising: (2)
[0162]
(Appendix 3)
In the language processing method according to Supplementary Note 2, the determination processing step includes:
By the input device, an input step of inputting unknown document data to be tagged externally,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device by the data converter;
A third determining step of determining a prediction tag of each target word from the characteristics of the target word and the peripheral words included in the test data by using the first learning result by the determiner; A second data reconstructing step of reconstructing the prediction tag determined in the third determining step by incorporating the prediction tag into a part of the feature of the learning data;
Using the second learning result, the determiner determines a tag of each target word from the characteristics of the target word and the surrounding words including the prediction tag included in the test data reconstructed in the second data reconstruction step. A fourth determining step of determining;
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the fourth determination step a plurality of times while switching to a learning result after a second learning result;
An output step of inverting the unknown document data into document data with the determined tag and outputting the unknown document data,
A language processing method comprising: (3)
[0163]
(Appendix 4)
In the language processing method according to Supplementary Note 1, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data conversion step of converting the correct answer document data into learning data in a format not including a prediction tag handled by the learning device by a data converter,
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A data conversion step of converting the correct document data into learning data of a format handled by a learning device including a prediction tag by a data converter;
Using a set of pairs of a feature and a correct answer tag of a peripheral word including the target word and the correct answer tag included in the learning data as a predictive tag, the learning device generates a second learning result for determining the predictive tag from the feature A second learning step;
A language processing method comprising: (4)
[0164]
(Appendix 5)
In the language processing method according to Supplementary Note 4, the determination processing step includes:
By the input device, an input step of inputting unknown document data to be tagged externally,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device by the data converter;
A first determining step of determining, by the determiner, a prediction tag of each target word from characteristics of the target word and the peripheral words included in the test data, using the first learning result; and A data reconstructing step of reconstructing the prediction tag determined in the first determining step by incorporating the prediction tag into a part of the feature of the learning data;
Using the second learning result, the determiner determines a tag of each target word from features of the target word including the prediction tag and surrounding words included in the test data reconstructed in the data reconstructing step. A second determination step;
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the second determination step a plurality of times using a second learning result;
An output step of inverting the unknown document data into document data with the determined tag and outputting the unknown document data,
A language processing method comprising: (5)
[0165]
(Appendix 6)
In the language processing method according to any one of supplementary notes 2 to 5,
The notation learning processing step and the determination processing step include a morphological analysis step of analyzing the correct document data or unknown document data received by the input device to determine a morpheme,
The language processing method is characterized in that, in the data conversion step, the correct answer document data or the unknown document data including the analyzed morpheme is converted into learning data of a format handled by a learning device by a data converter.
[0166]
(Appendix 7)
In the language processing method according to any one of supplementary notes 2 to 5,
The learning processing step and the determination processing step include an external tagging step of passing the correct document data or unknown document data received by the input device to an external tagging system to obtain a tagging result,
The data conversion step includes a step of converting the correct document data or the unknown document data, including the tagging result obtained from the tagging system, into learning data in a format handled by a learning device by a data converter. Processing method.
[0167]
(Appendix 8)
On the computer,
A learning processing step of generating a plurality of learning results by using a set of pairs of a target word included in a correct answer document data with a tag and features of surrounding words on both sides thereof and a correct answer tag,
A determination processing step of determining the tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides using the plurality of learning results;
A program characterized by executing (6)
[0168]
(Appendix 9)
In the program according to attachment 8, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data converter for converting the correct document data into learning data in a format handled by the learning device;
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A first determining step of determining, by using a first learning result generated in the first learning step, a prediction tag of each target word from characteristics of the target word and peripheral words included in the learning data,
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by a data reconstructing device;
The learning device determines a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding word including the prediction tag included in the learning data reconstructed in the data reconstruction step. A second learning step of generating a second learning result;
Using the second learning result generated in the second learning step, the determiner determines a tag of each target word from characteristics of the target word including the prediction tag and surrounding words included in the learning data. Two determination steps;
A learning repetition step of generating the plurality of learning results after the second learning result by repeating the data reconstructing step, the second learning step, and the second determination step a plurality of times;
A program characterized by comprising: (7)
[0169]
(Appendix 10)
In the program according to supplementary note 9, the determination processing step includes:
An input step of inputting unknown document data to be tagged from outside,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device;
A third determining step of using the first learning result to determine a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data;
A second data reconstructing step of reconstructing the prediction tag determined in the third determining step by incorporating the prediction tag into a part of the feature of the learning data;
Using the second learning result, determine a tag of each target word from features of the target word including the prediction tag and surrounding words included in the test data reconstructed in the second data reconstruction step Steps and
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the fourth determination step a plurality of times while switching to a learning result after a second learning result;
An output step of inversely converting the unknown document data into document data with the determined tag and outputting the document data;
A program characterized by comprising: (8)
[0170]
(Appendix 11)
In the program according to attachment 8, the learning processing step includes:
An input step of inputting correct answer document data with a tag from outside,
A data conversion step of converting the correct answer document data into learning data not including a prediction tag handled by a learning device,
A first learning step of generating a first learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and surrounding words included in the learning data;
A data conversion step of converting the correct document data into learning data in a format handled by a learning device including a prediction tag;
A second learning step of generating a second learning result of determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a peripheral word including the target word and the correct answer tag included in the learning data as a prediction tag When,
A program characterized by comprising:
[0171]
(Appendix 12)
In the program according to supplementary note 11, the determination processing step includes:
An input step of inputting unknown document data to be tagged from outside,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device;
A first determination step of determining a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data using the first learning result;
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by the data reconstructing device;
A second determining step of using the second learning result to determine a tag of each target word from characteristics of the target word and the surrounding words including the prediction tag included in the test data reconstructed in the data reconstructing step; ,
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the second determination step a plurality of times using a second learning result;
An output step of inversely converting the unknown document data into document data with the determined tag and outputting the document data;
A program characterized by comprising:
[0172]
(Appendix 13)
In the program according to any one of supplementary notes 9 to 12,
The learning processing step and the determination processing step include a morphological analysis step of determining a morpheme by analyzing correct document data or unknown document data received by the input device,
The data conversion step is a program for converting, by a data converter, the correct document data or unknown document data including the analyzed morpheme into learning data in a format handled by a learning device.
[0173]
(Appendix 14)
In the program according to any one of supplementary notes 9 to 12,
The learning processing step and the determination processing step include an external tagging step of passing the correct document data or unknown document data received by the input device to an external tagging system to obtain a tagging result,
The data conversion step includes a step of converting the correct document data or unknown document data, including a tagging result obtained from the tagging system, into learning data in a format handled by a learning device by a data converter. .
[0174]
(Appendix 15)
A learning processing unit that generates a plurality of learning results using a set of pairs of a target word included in a target word included in a correct answer document data and a surrounding word on both sides thereof and a correct answer tag,
A determination processing unit configured to determine a tag of each target word from characteristics of the plurality of peripheral words on both sides thereof,
A language processing device comprising: (9)
[0175]
(Appendix 16)
In the language processing device according to attachment 15, the learning processing unit includes:
An input device for inputting tagged document data from outside,
A data conversion device that converts the correct answer document data into learning data of a format handled by a learning device,
A first learning device that generates a first learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data;
A first determiner configured to determine a prediction tag of each target word from characteristics of the target word and surrounding words included in the learning data, using a learning result generated in the first learning step;
A data reconstructing device for reconstructing the prediction tag determined by the first determiner into a part of the feature of the learning data by a data reconstructing device,
A second learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a target word and a surrounding word including the prediction tag included in the learning data reconstructed by the data reconstruction device. A second learning device that generates
A second determiner that determines a tag of each target word from characteristics of the target word and the peripheral words including the prediction tag included in the learning data, using a second learning result generated by the second learning device;
A learning repetition unit that generates the plurality of learning results after the second learning result by repeating the data reconstructing device, the second learning device, and the second determination device a plurality of times;
A language processing device comprising: (10)
[0176]
(Appendix 17)
In the language processing device according to Supplementary Note 16, the determination processing unit includes:
An input device for inputting unknown document data to be tagged from outside,
A data conversion device that converts the unknown document data into test data in a format handled by the learning device;
A third determiner that determines a prediction tag of each target word from the characteristics of the target word and surrounding words included in the test data using the first learning result;
A second data reconstructing device configured to reconstruct the prediction tag determined by the third determiner into a part of the feature of the learning data by the data reconstructing device;
Using the second learning result, determine a tag of each target word from characteristics of the target word including the prediction tag and surrounding words included in the test data reconstructed by the second data reconstructing device. Vessels,
A determination repetition unit that determines the tag of the unknown document data by repeating the second data reconstruction device and the fourth determination unit a plurality of times while switching to the learning result after the second learning result;
An output device that inversely converts the unknown document data into document data with the determined tag and outputs the document data;
A language processing device comprising: (11)
[0177]
(Appendix 18)
In the language processing device according to attachment 15, the learning processing unit includes:
An input device for inputting tagged document data from outside,
A data conversion unit that converts the correct answer document data into learning data in a format not including a prediction tag handled by a learning device,
A first learning device that generates a first learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data;
A data conversion device that converts the correct answer document data into learning data of a format handled by a learning device including a prediction tag,
A second learning unit that generates a second learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a peripheral word including the target word and the correct answer tag included in the learning data as a prediction tag. When,
A language processing device comprising:
[0178]
(Appendix 19)
In the language processing device according to attachment 18, the determination processing unit includes:
An input device for inputting unknown document data to be tagged from outside,
A data conversion device that converts the unknown document data into test data in a format handled by the learning device;
A first determiner that determines a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data using the first learning result;
A data reconstructing device for reconstructing the prediction tag determined by the first determiner into a part of the feature of the learning data,
A second determiner that determines a tag of each target word from characteristics of the target word and the surrounding words including the prediction tag included in the test data reconstructed by the data reconstructing device using the second learning result; and ,
A determination repetition unit that determines the tag of the unknown document data by repeating the data reconstructing device and the second determination unit a plurality of times using the second learning result;
An output device that inversely converts the unknown document data into document data with the determined tag and outputs the document data;
A language processing device comprising:
[0179]
(Appendix 20)
In the language processing device according to any one of supplementary notes 16 to 19,
The learning processing unit and the determination processing unit include a morphological analysis device that determines the morpheme by analyzing the correct document data or unknown document data received by the input device,
A language processing apparatus, wherein the data conversion device converts the correct document data or unknown document data including the analyzed morpheme into learning data in a format handled by a learning device.
[0180]
(Appendix 21)
In the language processing device according to any one of supplementary notes 16 to 19,
The learning processing unit and the determination processing unit include an external tagging interface that obtains a tagging result by passing correct document data or unknown document data received by the input device to an external tagging system,
A language processing apparatus, wherein the data conversion device converts the correct document data or unknown document data into learning data in a format handled by a learning device, including a tagging result obtained from the tagging system.
[0181]
【The invention's effect】
As described above, according to the present invention, a plurality of learning results are generated using a set of pairs of a correct word and a feature of a target word and surrounding words included in the correct word data with a tag. In order to determine the tag of each target word from the characteristics of the target word and surrounding words included in the unknown document data using the result, the unknown document is determined using the tags of the surrounding words that have a correlation with the tag of the target word. Tag can be determined, and furthermore, a plurality of learning results are generated, and this is used for the tag determination of the unknown document, and the tag determination of the unknown document is repeated. As a result, the accuracy of the target word determination tag is improved, and the tagging of the unknown document can be performed with high accuracy.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is a block diagram of a language processing device functioning as a learning processing unit for Japanese correct answer documents;
FIG. 3 is an explanatory diagram of a hardware environment of a computer to which the language processing device of the present invention is applied;
FIG. 4 is an explanatory diagram of Japanese document data subjected to morphological analysis.
FIG. 5 is an explanatory diagram of the correct answer document data.
FIG. 6 is an explanatory diagram of unknown document data.
FIG. 7 is an explanatory diagram of a processing function in a learning processing unit in FIG. 2;
FIG. 8 is an explanatory diagram of a processing function following FIG. 7;
FIG. 9 is a flowchart of a learning process in FIG. 2;
10 is a block diagram of a language processing device that functions as a determination processing unit using the learning result of FIG. 2;
FIG. 11 is an explanatory diagram of a processing function in a determination processing unit in FIG. 10;
FIG. 12 is an explanatory diagram of a processing function following FIG. 11;
FIG. 13 is an explanatory diagram of a determination result for an unknown document.
FIG. 14 is a flowchart of a determination process in FIG. 11;
FIG. 15 is a block diagram of a language processing apparatus functioning as a learning processing unit for an English correct answer document;
16 is an explanatory diagram of a processing function in the learning processing unit in FIG.
FIG. 17 is an explanatory diagram of a processing function following FIG. 16;
FIG. 18 is a block diagram of a language processing device functioning as a determination processing unit using the learning result of FIG. 16;
FIG. 19 is an explanatory diagram of a processing function in a determination processing unit in FIG. 18;
FIG. 20 is an explanatory diagram of a processing function following FIG. 19;
FIG. 21 is a block diagram of a language processing apparatus functioning as a simple learning processing unit for a Japanese correct answer document;
FIG. 22 is an explanatory diagram of a processing function in the learning processing unit in FIG. 21;
FIG. 23 is an explanatory diagram of a processing function following FIG. 22;
FIG. 24 is a flowchart of a learning process in FIG. 21;
FIG. 25 is a block diagram of a language processing device functioning as a determination processing unit using the learning result of FIG. 21;
FIG. 26 is an explanatory diagram of a processing function in the determination processing unit of FIG. 25;
FIG. 27 is an explanatory diagram of a processing function following FIG. 26;
FIG. 28 is a flowchart of a determination process in FIG. 25;
FIG. 29 is an explanatory diagram of a processing function in a simple learning processing unit for an English correct answer document;
FIG. 30 is an explanatory diagram of a processing function following FIG. 29;
FIG. 31 is an explanatory diagram of a processing function in a determination processing unit using the learning results of FIGS. 20 and 30;
FIG. 32 is an explanatory diagram of a processing function following FIG. 31;
FIG. 33 is a block diagram of the language processing apparatus of the present invention having an interface for accessing an external tagging system;
FIG. 34 is an explanatory diagram of a learning process of generating a first classification rule by performing a first learning from correct answer document data.
FIG. 35 is an explanatory diagram of a learning result database storing the first classification rules generated in FIG. 34;
FIG. 36 is an explanatory diagram of a determination process of applying a first classification rule to a correct answer document data and adding a prediction tag;
FIG. 37 is an explanatory diagram of a learning process of generating a second classification rule by performing a second learning from the correct answer document data to which the prediction tag is added in FIG. 36;
FIG. 38 is an explanatory diagram of the learning result database storing the second classification rules generated in FIG. 37 in addition to the first classification rules.
39 is an explanatory diagram of a judgment learning process of applying a classification rule to the correct answer document data to which a prediction tag has been added and determining a prediction tag; FIG.
FIG. 40 is an explanatory diagram of a function MakeTree for generating a decision tree serving as a classification rule.
FIG. 41 is an explanatory diagram of a decision tree generation process for dividing a learning set by character types.
FIG. 42 is an explanatory diagram of a determination process of applying a first classification rule and a second classification rule to unknown document data successively to add a prediction tag;
FIG. 43 is an explanatory diagram of a determination process for assigning a prediction tag by applying a first classification rule by using a part of unknown document data as an example;
FIG. 44 is an explanatory diagram of a determination process of adding a prediction tag to unknown document data including the prediction tag added in FIG. 43 by applying a second classification rule.
FIG. 45 is a diagram illustrating a comparison between the conventional method and the method of the present invention in which a prediction tag is added.
[Explanation of symbols]
10: I / O device
11: Morphological analyzer
12: Data converter
14: Learning device
16: Learning controller
18: Judge
20: Judgment controller
22: Learning database
24-1, 240-1: learning result (first learning result)
24-2, 240-2: Learning result (second learning result)
24-3 to 24-N: Learning result
26: Data reconstruction device
28: Japanese document data
30: Correct answer document data
32, 80, 200: Learning data
48: Word
50: Features
52: prediction tag
54: Correct answer tag
56, 70: feature data
64, 76, 96, 106, 134, 156, 176, 192, 230, 242, 280, 290: judgment data
68, 138, 210: Reconstruction learning data
84, 164, 224, 250, 274: test data
100, 180, 234, 260, 284: Reconstruction test data
300: External tagging system interface

Claims

A learning processing step of generating a plurality of learning results by using a set of pairs of a target word included in a correct answer document data with a tag and features of surrounding words on both sides thereof and a correct answer tag,
A determination processing device, using the plurality of learning results, a determination processing step of determining a tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides thereof;
A language processing method comprising:

In the language processing method according to claim 1, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data converter for converting the correct document data into learning data in a format handled by the learning device;
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A first determining step of determining, by using a first learning result generated in the first learning step, a prediction tag of each target word from characteristics of the target word and peripheral words included in the learning data,
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by a data reconstructing device;
The learning device determines a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding word including the prediction tag included in the learning data reconstructed in the data reconstruction step. A second learning step of generating a second learning result; and the target word including the prediction tag included in the learning data and a surrounding area, using the second learning result generated by the second learning step by the determiner. A second determining step of determining a tag of each target word from the characteristics of the word;
A learning repetition step of generating the plurality of learning results after the second learning result by repeating the data reconstructing step, the second learning step, and the second determination step a plurality of times;
A language processing method comprising:

In the language processing method according to claim 1, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data conversion step of converting the correct answer document data into learning data in a format not including a prediction tag handled by the learning device by a data converter,
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A data conversion step of converting the correct document data into learning data of a format handled by a learning device including a prediction tag by a data converter;
Using a set of pairs of a feature and a correct answer tag of a peripheral word including the target word and the correct answer tag included in the learning data as a predictive tag, the learning device generates a second learning result for determining the predictive tag from the feature A second learning step;
A language processing method comprising:

In the language processing method according to claim 3, the determination processing step includes:
By the input device, an input step of inputting unknown document data to be tagged externally,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device by the data converter;
A first determining step of determining, by the determiner, a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data, using the first learning result;
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by the data reconstructing device;
Using the second learning result, the determiner determines a tag of each target word from features of the target word including the prediction tag and surrounding words included in the test data reconstructed in the data reconstructing step. A second determination step;
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the second determination step a plurality of times using a second learning result;
An output step of inverting the unknown document data into document data with the determined tag and outputting the unknown document data,
A language processing method comprising:

In the language processing method according to any one of claims 2 to 4,
The learning processing step and the determination processing step include a morphological analysis step of determining a morpheme by analyzing correct document data or unknown document data received by the input device,
The language conversion method, wherein the data conversion step converts the correct document data or the unknown document data including the analyzed morpheme into learning data in a format handled by a learning device by a data converter.

On the computer,
A learning processing step of generating a plurality of learning results by using a set of pairs of a target word included in a correct answer document data with a tag and features of surrounding words on both sides thereof and a correct answer tag,
A determination processing step of determining the tag of each target word from the characteristics of the target word included in the unknown document data to be tagged and the surrounding words on both sides using the plurality of learning results;
A program characterized by executing

In the program according to claim 6, the learning processing step includes:
An input step of inputting the correct answer document data with a tag from the outside by the input device;
A data converter for converting the correct document data into learning data in a format handled by the learning device;
A first learning step of generating a first learning result of determining a prediction tag from the feature by using a set of pairs of a feature and a correct answer tag of a target word and surrounding words included in the learning data,
A first determining step of determining, by using a first learning result generated in the first learning step, a prediction tag of each target word from characteristics of the target word and peripheral words included in the learning data,
A data reconstructing step of reconstructing the prediction tag determined in the first determining step into a part of the features of the learning data by a data reconstructing device;
The learning device determines a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding word including the prediction tag included in the learning data reconstructed in the data reconstruction step. A second learning step of generating a second learning result; and the target word including the prediction tag included in the learning data and a surrounding area, using the second learning result generated by the second learning step by the determiner. A second determining step of determining a tag of each target word from the characteristics of the word;
A learning repetition step of generating the plurality of learning results after the second learning result by repeating the data reconstructing step, the second learning step, and the second determination step a plurality of times;
A program characterized by comprising:

In the program according to claim 7, the determination processing step includes:
An input step of inputting unknown document data to be tagged from outside,
A data conversion step of converting the unknown document data into test data in a format handled by the learning device;
A third determining step of using the first learning result to determine a prediction tag of each target word from characteristics of the target word and surrounding words included in the test data;
A second data reconstructing step of reconstructing the prediction tag determined in the third determining step by incorporating the prediction tag into a part of the feature of the learning data;
Using the second learning result, determine a tag of each target word from features of the target word including the prediction tag and surrounding words included in the test data reconstructed in the second data reconstruction step Steps and
A determination repetition step of determining the tag of the unknown document data by repeating the data reconstruction step and the fourth determination step a plurality of times while switching to a learning result after a second learning result;
An output step of inversely converting the unknown document data into document data with the determined tag and outputting the document data;
A program characterized by comprising:

A learning processing unit that generates a plurality of learning results using a set of pairs of a target word included in a target word included in a correct answer document data and a surrounding word on both sides thereof and a correct answer tag,
A determination processing unit configured to determine a tag of each target word from characteristics of the plurality of peripheral words on both sides thereof,
A language processing device comprising:

The language processing apparatus according to claim 9, wherein the learning processing unit comprises:
An input device for inputting tagged document data from outside,
A data conversion device that converts the correct answer document data into learning data of a format handled by a learning device,
A first learning unit that generates a first learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of the target word and the surrounding words included in the learning data; A first determiner for determining a prediction tag of each target word from characteristics of the target word and surrounding words included in the learning data using a learning result generated in the first learning step;
A data reconstructing device for reconstructing the prediction tag determined by the first determiner into a part of the feature of the learning data by a data reconstructing device,
A second learning result for determining a prediction tag from the feature using a set of pairs of a feature and a correct answer tag of a target word and a surrounding word including the prediction tag included in the learning data reconstructed by the data reconstruction device. A second learning device that generates
A second determiner that determines a tag of each target word from characteristics of the target word and the peripheral words including the prediction tag included in the learning data, using a second learning result generated by the second learning device;
A learning repetition unit that generates the plurality of learning results after the second learning result by repeating the data reconstructing device, the second learning device, and the second determination device a plurality of times;
A language processing device comprising:

In the language processing device according to claim 10, the determination processing unit includes:
An input device for inputting unknown document data to be tagged from outside,
A data conversion device that converts the unknown document data into test data in a format handled by the learning device;
A third determiner that determines a prediction tag of each target word from the characteristics of the target word and surrounding words included in the test data using the first learning result;
A second data reconstructing device configured to reconstruct the prediction tag determined by the third determiner into a part of the feature of the learning data by the data reconstructing device;
Using the second learning result, determine a tag of each target word from characteristics of the target word including the prediction tag and surrounding words included in the test data reconstructed by the second data reconstructing device. Vessels,
A determination repetition unit that determines the tag of the unknown document data by repeating the second data reconstruction device and the fourth determination unit a plurality of times while switching to the learning result after the second learning result;
An output device that inversely converts the unknown document data into document data with the determined tag and outputs the document data;
A language processing device comprising: