JP3939264B2

JP3939264B2 - Morphological analyzer

Info

Publication number: JP3939264B2
Application number: JP2003080537A
Authority: JP
Inventors: 秀樹山本; さより下畑; 美穂子北村
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2003-03-24
Filing date: 2003-03-24
Publication date: 2007-07-04
Anticipated expiration: 2018-03-04
Also published as: JP2003296323A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された自然言語文を形態素（例えば単語）に分割する形態素解析装置に関し、特に、解析処理時間及び又は解析精度を従来より向上させようとしたものである。
【０００２】
【従来の技術】
【０００３】
【特許文献１】
特開平５−５２５４３号公報
【０００４】
【非特許文献１】
山本幹雄、増山正和著、「品詞・区切り情報を含む拡張文字の連鎖確率を用いた日本語形態素解析」、言語処理学会第３回年次大会発表論文集、１９９７年３月
ワードプロセッサによるテキスト作成機会の増大や、インターネット対応機器の普及により、大量の電子化された自然言語文が容易に入手可能となってきた。文字認識システム、機械翻訳システム、情報検索システム、情報抽出システム等の大量の自然言語文を扱う自然言語処理システムにとって、形態素解析処理は、各種システムが目的とする専門処理を実施する前に共通して実施され、単語や句等の文中の意味単位である形態素を確定する極めて重要な処理である。
【０００５】
このような形態素解析処理においては、単語分割（形態素分割）の精度の高さが要求されるとともに、大量の自然言語文を高速に処理するという処理速度も要求される。
【０００６】
従来の形態素解析方法としては、形態素辞書（単語辞書）や活用語尾テーブルや品詞別接続テーブル等を備え、これら各種の記憶部をアクセスしながら形態素解析を行うのが一般的であった（特許文献１参照）。
【０００７】
また、最近になって、文字をベースとした確率モデルを利用した形態素解析方法も提案されている（非特許文献１、並びに、特願平９−６８３００号明細書及び図面参照）。
【０００８】
この形態素解析方法は、自然言語テキストが入力文として与えられたときに、この入力文を構成する形態素列として、各文字の直後が形態素境界であるか否かのあらゆる組み合わせの中から最も確からしい形態素列の並びを出力させるものである。
【０００９】
そして、最も確からしい形態素列の並びか否かを判断させるために、大量のテキストデータ（コーパス；学習データ）から学習させた確率モデル（統計データベース；解析実行時データのデータベース）を用いる。統計データベースに格納されている１組の解析実行時データは、例えば、文字数Ｎの拡張文字列、及び、その拡張文字列がコーパス上にどの程度の割合で出現するかを表す連鎖確率のデータである。なお、拡張文字とは、「私」、「は」等の通常の文字とは異なり、このような文字に対して、少なくとも形態素区切り情報（この文字の直後が形態素区切りか否か）を含む拡張情報を付加したものである。
【００１０】
【発明が解決しようとする課題】
（１）形態素辞書を用いる従来では一般的であった形態素解析方法は、入力段階では長さが不明な形態素を定めるように形態素辞書を引くものであるので、形態素辞書を引く回数が非常に多くなって辞書引きにかなりの時間がかかり、大量の文書を短い時間で処理することはできなかった。すなわち、利用者は、形態素解析結果を迅速には得ることができない。以下、場合によっては、この種の形態素解析を低速形態素解析と呼ぶこととする。
【００１１】
（２）これに対して、文字をベースとした確率モデル（統計データベース）を利用した形態素解析方法は、入力文から定まる所定文字数（Ｎ）の拡張文字列を統計データベースの格納内容と照合して形態素解析を行うことを基本とするので、上記の形態素解析方法（低速形態素解析方法）に比較して形態素解析結果を高速に得ることができる。
【００１２】
しかし、この形態素解析方法においては、事前にパラメータ（統計データ；解析実行時データ）を学習して作成しておく必要があり、そのための学習データ（コーパス）を用意するのが大変であった。以下、場合によっては、この種の形態素解析を高速形態素解析と呼ぶこととする。
【００１３】
必要な学習データ（コーパス）は、上述した拡張文字列及びその連鎖確率でなる統計データを算出できるものであるので、形態素の区切り箇所の情報（及びその形態素の品詞情報）等をテキストファイルに付加したものである。テキストファイルは入手し易いが、それに上述した情報を付加したファイルは、現状ではほとんどなく、テキストファイルに人間が上述した情報を一つ一つ付加して学習に用いられる学習データ（コーパス）を作成していた。又は、低速形態素解析の結果に対して、人手で修正を加えて、学習に用いられる学習データ（コーパス）を作成していた。
【００１４】
高速形態素解析において、以上のような学習データを用意して統計データベースを作成しても、事前に用意した学習データにない文字列に対しては、正しく解析することはできない。低速形態素解析においても、勿論、辞書に入っていない形態素（未知語）からなる文字列に対しては、正しく解析できないが、通常形態素解析用の辞書には、数万から数十万語の辞書を用いているので、正しく解析できない文字列（未知語）に出会うことは少ない。仮に、低速形態素解析の辞書にある形態素を全て適当な頻度で含んだ学習データを用意することができて、それを用いて高速形態素解析を学習することができれば原理的には、低速形態素解析と高速形態素解析は、ほぼ同じ精度で解析できる。すなわち、正しく解析できない文字列は同じになるといえる。
【００１５】
しかしながら、低速形態素解析の辞書にある形態素を全て適当な頻度で含んだ学習データを用意することは現実的に不可能である。その結果、高速形態素解析においては、学習データになかった形態素が出現する文章の解析精度は、低速形態素解析よりも劣ってしまう。
【００１６】
（３）高速形態素解析の利用者が、その形態素解析方法が採用している学習データがどのような形態素から構成されていたかを知る方法がない場合は、利用者としては一つ一つの文章の解析結果を見て、高速解析結果の精度が悪いと判断したときには、その文章にだけ低速形態素解析を使うようにするか、あるいはその文章だけ人手で修正するかどちらかの方法をとらざるを得ない。
【００１７】
形態素解析したい文章が様々な分野にわたっている場合は、一つ一つ解析結果をチェックするのは面倒な作業であり、もし、チェックをしないとすると、高速形態素解析を利用した場合の全体としての精度は悪くなってしまう。
【００１８】
形態素解析したい文章が様々な分野にわたっている例としては、インターネット上の様々なＷＷＷサーバ上の文書ファイルを形態素解析して出現する形態素の頻度を調べて、検索サービス用のインデックスファイルを作るために形態素解析を利用する場合などがある。
【００１９】
（４）ところで、精度の悪かった高速形態素解析の結果に対して、人手でチェックした後、そのデータを統計データベースに反映させる（フィードバックさせる）ことも考えられる。
【００２０】
このようにすると、反映処理後は、その分野と同じ分野の文章に対しては同程度の精度で解析することが可能になるが、人手によるチェックという作業はなくなる訳ではないので、面倒である。
【００２１】
そのため、平均的に見て、高精度の形態素解析結果を得られるまでの時間が短い形態素解析装置が求められている。
【００２２】
【課題を解決するための手段】
かかる課題を解決するために、本発明は、自然言語文に現れる所定文字数でなる部分文字列とその絶対的又は相対的な頻度情報と、上記部分文字列の各文字に対して付与されたその文字が形態素の最終文字か否かの区切り位置の情報とを少なくとも含む組データである解析実行時データを多数格納している解析実行時データ格納手段と、未知文章に対して、上記解析実行時データ格納手段の格納内容を参照して形態素解析を実行する第１の形態素解析手段とを有する形態素解析装置において、以下のようにしたことを特徴とする。すなわち、上記未知文章に対して上記第１の形態素解析手段が適用した上記解析実行時データ格納手段に格納されている中の上記解析実行時データにおける頻度情報が閾値より小さいとき、及び又は、上記解析実行時データ格納手段に格納されている上記解析実行時データを適用し得ない部分を上記未知文章が有するときに、上記第１の形態素解析手段からの形態素解析結果の精度を低精度と推測する精度判定手段、又は、上記未知文章の全体又は部分の中で特定文字種の文字が所定文字数以上つながっているときに、上記第１の形態素解析手段からの形態素解析結果の精度を低精度と推測する精度判定手段、又は、上記未知文章の全体又は部分の中で特定文字種の文字が所定文字数以上含まれているときに、上記第１の形態素解析手段からの形態素解析結果の精度を低精度と推測する精度判定手段を有することを特徴とする。
【００２３】
【発明の実施の形態】
（Ａ）第１の実施形態
以下、本発明による形態素解析装置の第１の実施形態を図面を参照しながら詳述する。
【００２４】
この第１の実施形態の形態素解析装置は、基本的には、入力文を高速形態素解析方法で解析するものであり、低速な形態素解析の結果を高速な形態素解析の学習データに自動的に変換する学習機能を持つことによって、これまで学習データとしていなかった文章を容易に学習データとして使用することができるようにしたことを大きな特徴としているものである。
【００２５】
図１は、第１の実施形態の形態素解析装置の構成を示す機能ブロック図である。すなわち、第１の実施形態の形態素解析装置は、実際上、入出力装置や処理装置や記憶装置（や通信装置）等を有するワークステーションやパソコン等の情報処理装置上に実現されるものであるが、機能的には、図１に示す構成を有するものである。
【００２６】
図１において、この第１の実施形態の形態素解析装置１０は、低速形態素解析手段１１、低速形態素解析結果格納手段１２、変換手段１３、学習データ格納手段１４、学習手段１５、解析実行時データ格納手段１６及び高速形態素解析手段１７を有している。これらの構成要素のうち、低速形態素解析手段１１、低速形態素解析結果格納手段１２、変換手段１３、学習データ格納手段１４及び学習手段１５が、第１の実施形態の解析実行時データ作成装置を構成している。
【００２７】
低速形態素解析手段１１は、詳細構成の図示は省略するが、内蔵する形態素辞書を利用して形態素解析を行う従来の低速形態素解析装置と同様な構成を有するものである。すなわち、上述した特許文献１に記載されている形態素解析装置やそれに類似した装置と同様な詳細構成を有する。この第１の実施形態の場合、低速形態素解析手段１１は、未知文書中の各文を形態素解析するものとして設けられているのではなく、解析実行時データ格納手段１６に格納させる解析実行時データを作成する構成中の一要素として設けられている。この低速形態素解析手段１１には、学習用文書が入力される。なお、図１において、学習用文書と記載されているブロックは、学習用文書の入力手段をも意味している。
【００２８】
低速形態素解析結果格納手段１２は、低速形態素解析手段１１が学習用文書の各文に対して実行した低速形態素解析結果を格納するものである。
【００２９】
変換手段１３は、低速形態素解析結果格納手段１２に格納されている低速形態素解析結果のデータ形式を、高速形態素解析装置が必要とする学習データとしてのデータ形式に変換するものである。
【００３０】
学習データ格納手段１４は、変換手段１３が変換して得た学習データ（コーパス）を格納するものである。
【００３１】
学習手段１５は、学習データ格納手段１４に格納されている学習データから、高速形態素解析手段１７が未知文章の形態素解析時に参照する解析実行時データを作成するものである。すなわち、例えば、学習データ上に現れる所定文字数Ｎの拡張文字列、及び、その拡張文字列が学習データ上にどの程度の割合で出現するかを表す連鎖確率でなる解析実行時データ（統計データ）を、学習データから作成するものである。学習データから解析実行時データを作成する方法としては、非特許文献２、並びに、特願平９−３５０６５１号明細書及び図面に記載の方法を適用できる。
【００３２】
【非特許文献２】
長尾眞、森信介著、「大規模日本語テキストのｎグラム統計の作り方と語句の自動抽出」、情報処理学会研究報告自然言語処理９６−１、１９９３年７月
解析実行時データ格納手段（統計データベース）１６は、学習手段１５によって作成された解析実行時データを格納するものである。後述する図１１は、解析実行時データ格納手段（統計データベース）１６に格納された一部の解析実行時データ（Ｎが３の場合）を示している。なお、解析実行時データの連鎖確率は、例えば、先頭側のＮ−１文字が同一の複数の文字数Ｎの拡張文字列の連鎖確率の総和が１になるように定められる。
【００３３】
高速形態素解析手段１７は、形態素解析対象の未知文書又は未知文章が与えられたときに、各文章に対して、解析実行時データ格納手段１６の格納内容を参照して形態素解析を実行し、得られた形態素解析結果を出力するものである。高速形態素解析手段１７は、例えば、上述した非特許文献１や、特願平９−６８３００号明細書及び図面に記載された構成、又はそれに類似した構成により実現される。
【００３４】
図示は省略するが、高速形態素解析手段１７の詳細構成例を挙げると以下の通りである。すなわち、高速形態素解析手段１７は、スコアテーブル１７ａ、拡張文字列生成部１７ｂ、連鎖確率計算部１７ｃ、及び、最適経路探索部１７ｄを有する。
【００３５】
スコアテーブル１７ａは、解析対象の未知文章の文頭から文末までの全ての拡張文字列の経路と、解析実行時データ格納手段１６に格納されている所定文字数の拡張文字列の連鎖確率とに基づき、求められた拡張文字列の経路に対応する連鎖確率を格納するものである。拡張文字列生成部１７ｂは、解析対象の未知文章についての拡張文字を生成し、当該拡張文字の組み合わせ（経路）の全てをスコアテーブル１７ａに格納させるものである。連鎖確率計算部１７ｃは、解析実行時データ格納手段１６に格納されている連鎖確率に基づき、スコアテーブル１７ａに格納されている拡張文字列の各経路に対する連鎖確率を計算するものである。最適経路探索部１７ｄは、連鎖確率計算部１７ｃにより計算された連鎖確率の中から、最適な条件（例えば最大値の連鎖確率を与えるなど）を満たす拡張文字列を、最適拡張文字列（形態素解析結果）として選択するものである。
【００３６】
なお、図１において、未知文書と記載されているブロックは、未知文書の入力手段をも意味しており、形態素解析結果と記載されているブロックは、形態素解析結果の出力手段をも意味している。
【００３７】
次に、この第１の実施形態の形態素解析装置１０の処理の概要を図２のフローチャートを参照しながら詳述する。なお、第１の実施形態の形態素解析装置１０の処理は、未知文書の形態素解析を実行させるための準備段階の処理と、未知文書の形態素解析を実行する処理とに分かれ、図２におけるステップ１００〜１０２が前者の処理に対応し、ステップ１０３が後者の処理に対応している。
【００３８】
学習用文書が当該形態素解析装置１０に入力されると、形態素辞書を利用する低速形態素解析手段１１によって入力された学習用文書が形態素解析され、その形態素解析結果が低速形態素解析結果格納手段１２に書き込まれる（ステップ１００）。
【００３９】
このとき、格納される形態素解析結果のデータ形式は、当然に、低速形態素解析手段１１による出力データ形式である。このような低速形態素解析結果が、変換手段１３によって、高速形態素解析手段１７が利用する解析実行時データを作成させる元となる学習データのデータ形式に変換され、学習データ格納手段１４に格納される（ステップ１０１）。
【００４０】
そして、この学習データが、学習手段１５によって処理されて解析実行時データが作成され、作成された解析実行時データが解析実行時データ格納手段１６に格納される（ステップ１０２）。
【００４１】
以上のような高速形態素解析処理の準備段階の処理が終了した後において、未知文書が入力されると、その未知文書の各文章に対し、高速形態素解析手段１７が、解析実行時データ格納手段１６の格納内容を参照しながら形態素解析し、得られた形態素解析結果を出力する（ステップ１０３）。
【００４２】
図３は、低速形態素解析手段１１に入力される学習用文書の一例を示している。図３に示すように、学習用文書は、拡張情報やタグを伴うことがない自然言語テキストデータになっている。
【００４３】
適用している低速形態素解析手段１１の内部構成にもよってその出力データ形式（形態素解析結果データ形式）は異なる。図４は、図３の第１文目を低速形態素解析した結果の出力例（出力データ形式例）を示している。図４の各行は一つの単語の情報を示している。1つの単語の情報は空白で区切られた３つの情報からなり、それぞれ品詞、標準形、出現形である。活用しない名詞などの場合は標準形と活用形は同じになる。
【００４４】
図５は、図３に対応した学習データの例を示している。図４に例示したような低速形態素解析結果のデータを、この図５に示すような学習データに変換手段１３は変換する。
【００４５】
図５に示した例は、解析実行時データが、文字数Ｎの拡張文字列と、その拡張文字列がコーパス上にどの程度の割合で出現するかを表す連鎖確率のデータとでなり、しかも、拡張文字が、「私」、「は」等の通常の文字に対して形態素区切り情報を拡張情報として付加したものである場合に対応した例である。なお、拡張情報として、形態素区切り情報に加えて品詞情報を含むものは、図５の形式とは異なるものとなる。
【００４６】
図５（Ａ）は、前接する文字との間が形態素の区切りになる場合を「１」で、そうでない場合を「０」で表した拡張文字列で、形態素の境界（区切り）を表した例を示している。図５（Ｂ）は、図５（Ａ）と同じ内容を、形態素区切りをスラッシュ（／）で表した例である。
【００４７】
図６は、変換手段１３による変換処理の流れの一例を示すフローチャートである。なお、図６は、変換後のデータ形式が図５（Ａ）に示すような場合に対応したものである。
【００４８】
まず、低速形態素解析結果格納手段１２に、変換処理が終了していない低速形態素解析結果が残っているか否かを確認する（ステップ２００）。残っていないならば、一連の変換処理を終了する。
【００４９】
これに対して、低速形態素解析結果格納手段１２に、未処理の低速形態素解析結果が残っているならば、未処理の低速形態素解析結果を１文分だけ読み出す（ステップ２０１）。そして、読み出した低速形態素解析結果から、出現形の項目を抜き出し（ステップ２０２）、各出現形の文字をそれぞれ拡張文字に変換して拡張文字列を作成する（ステップ２０３）。そして、得られた拡張文字列を、学習データ格納手段１４に格納して上述したステップ２００に戻る（ステップ２０４）。
【００５０】
ここで、拡張文字への変換は、出現形の最後の文字だけに形態素区切りであることを表す「１」を付与し、それ以外の文字には、形態素区切りでないことを表す「０」を付与する。例えば、図７に示すように、出現形が「機械翻訳」であれば、それに対する拡張文字列として、＜機，０＞＜械，０＞＜翻，０＞＜訳，１＞が得られる。
【００５１】
上記第１の実施形態によれば、低速形態素解析結果を、高速形態素解析方法が解析時に用いる解析実行時データの作成用学習データに自動的に変換する学習機能を持たせたので、これまで学習データとしていなかった文章を容易に学習データとして使用することができる、基本的に高速形態素解析方法に従っている形態素解析装置を実現できる。その結果、学習データの充実を計ることができ、未知文書に対する高速形態素解析結果の精度向上も期待できる。
【００５２】
また、低速形態素解析結果を学習データに自動的に変換する学習機能を持たせたので、利用者は学習用文書を当該装置に入力する操作を行うだけで良く、学習用文書から学習データを作成したり、低速形態素解析結果から学習データを作成したりすることを不要にすることができる。
【００５３】
（Ｂ）第２の実施形態
次に、本発明による形態素解析装置の第２の実施形態を図面を参照しながら詳述する。
【００５４】
この第２の実施形態の形態素解析装置は、高速形態素解析結果の精度の良否を弁別し、良くない場合には、そのことを明らかにした結果を利用者に提示し、利用者に精度が低い場合の判断を委ねるようにしたことを大きな特徴としているものである。
【００５５】
図８は、この第２の実施形態の形態素解析装置１０Ａの機能的構成を示すブロック図であり、上述した第１の実施形態に係る図１との同一、対応部分には同一符号を付して示している。
【００５６】
図８において、第２の実施形態の形態素解析装置１０Ａは、解析実行時データ格納手段１６、高速形態素解析手段１７、精度判定手段１８及び精度・解析結果合成手段１９を備える。
【００５７】
なお、解析実行時データ格納手段１６に格納する解析実行時データの作成方法が第１の実施形態と同様である場合には、図示は省略しているが、低速形態素解析手段１１、低速形態素解析結果格納手段１２、変換手段１３、学習データ格納手段１４及び学習手段１５も備える（図１参照）。これら構成要素についての説明は省略する。
【００５８】
また、解析実行時データ格納手段１６及び高速形態素解析手段１７は、第１の実施形態のものと同様であるので、その機能説明は省略する。
【００５９】
この第２の実施形態で新たに設けられた精度判定手段１８は、高速形態素解析手段１７が解析実行時データ格納手段１６に所望する解析実行時データを検索した際の検索結果に基づいて、高速形態素解析手段１７から得られる形態素解析結果における精度が低いと思われる文字列を判定するものである。このような精度判定結果は、精度・解析結果合成手段１９に与えられる。
【００６０】
解析実行時データは、非特許文献１や、特願平９−６８３００号明細書及び図面にも記載されているように、また、第１の実施形態で説明したように、学習データから作成される。学習データに現れた文字列に対応した解析実行時データは存在するが、当然に、学習データに現れない文字列に対応した解析実行時データは存在しない。学習データに現れた文字列に対応した解析実行時データであっても、その出現頻度によって、連鎖確率の値は変化する。
【００６１】
従って、未知文書を高速形態素解析しようとして解析実行時データ格納手段１６をアクセスした場合において、該当文字列が存在しない部分や存在してもその連鎖確率が低い部分等は、高速形態素解析結果におけるその部分の精度は、他の部分より低いということができる。精度判定手段１８は、解析実行時データ格納手段１６に対するアクセスを通じて、このような低精度部分の判定を行うものである。
【００６２】
精度・解析結果合成手段１９には、精度判定手段１８から精度判定結果が与えられると共に、高速形態素解析手段１７から形態素解析結果が与えられる。精度・解析結果合成手段１９は、これらの入力情報を合成し、精度判定手段１８が精度が不十分であると判断した文字列を明示して形態素解析結果を利用者に提示するものである。
【００６３】
以上のように機能ブロック化できる、第２の実施形態の形態素解析装置１０Ａの全体処理の流れの一例を、図９のフローチャートを参照しながら詳述する。
【００６４】
なお、図９は、未知文書中のある１文に対する処理を示している。また、図９の処理例では、低精度文字列部分の特定を、解析実行時データ格納手段１６に該当文字列が存在しないことを１要件としている。さらに、精度カウンタを装置１０Ａ（精度判定手段１８）が内蔵しているとして説明する。この精度カウンタは、初期値が０である一時メモリである。さらにまた、装置１０Ａ（精度判定手段１８）が、低精度文字列のバッファメモリも内蔵しているとして説明する。
【００６５】
入力文における文字位置ポインタを備え、このポインタが示す文字位置から始まるＮ文字の文字列を読み込む（ステップ３００）。そして、この読み込み処理で文字列が読み込めなかったか否かに基づいて、最終番目の文字列の読み込み、それに続く処理が既に終了しているか否かを判定する（ステップ３０１）。
【００６６】
終了していない場合には、読み込んだ文字列に基づいて検索文字列を作成して解析実行時データ格納手段１６を検索し、検索文字列が解析実行時データ格納手段１６に存在したか否かを判定する（ステップ３０２、３０３）。
【００６７】
ここで、作成される検索文字列は一般に複数組である。例えば、解析実行時データ格納手段１６に図１１に示すような拡張文字列の解析実行時データが格納されているので、Ｎ（例えば３）文字の読み込み文字列のそれぞれの文字を２種類の拡張文字に置き換え、入力文字列の各文字についての２種類の拡張文字の全ての組み合わせがそれぞれ、検索文字列となるので、作成される検索文字列は一般には、２のＮ乗組だけ存在する。ステップ３０３の判定で検索文字列が存在しないとする場合は、「全て」の検索文字列が存在しない場合であり、２のＮ乗組のうちの１組の検索文字列でも解析実行時データ格納手段１６に存在する場合には、ステップ３０３の判定では存在するとする。
【００６８】
ステップ３０３の判定結果、検索文字列が存在しないという結果を得たときには、精度カウンタの値を１インクリメントし、今回の読み込み文字列を低精度文字列格納領域に格納して後述するステップ３０９に移行する（ステップ３０４、３０５）。
【００６９】
一方、ステップ３０３の判定結果、検索文字列が解析実行時データ格納手段１６に存在していた場合には、その時点での精度カウンタの値が閾値以下であるか否かを判定する（ステップ３０６）。なお、閾値は、Ｎの値に応じて定められるものであるが、例えば、Ｎが３であれば１ぐらいが適当である。
【００７０】
ここで、肯定結果が得られたときには、精度カウンタの値を０クリアすると共に、低精度文字列格納領域に格納されていた低精度文字列もクリアして後述するステップ３０９に移行する（ステップ３０７）。これに対して、検索文字列が解析実行時データ格納手段１６に存在しており、しかも、その時点での精度カウンタの値が閾値より大きいときには、その時点で低精度文字列格納領域に格納されていた低精度文字列を、形態素解析結果で明示する部分として認識して、後述するステップ３０９に移行する（ステップ３０８）。
【００７１】
ステップ３０９においては、解析実行時データ格納手段１６に存在した１又は複数組の検索文字列についての連鎖確率に基づいて、今回読み込んだ文字列までの入力文の文字列についての複数の形態素解析結果候補の評価値（連鎖確率の積）を更新する。なお、検索文字列が存在しない場合での取り扱いは任意であるが、既存の高速形態素解析手段の方法をそのまま採用すれば良い。例えば、解析実行時データ格納手段１６に格納されている解析実行時データが文字数Ｎの拡張文字列に係るものである場合に、それらから文字数Ｎ−１や文字数Ｎ−２の拡張文字列に係る解析実行時データを形成して処理する。一般に、文字数が長ければ存在しない文字列でも、それより短い文字数の部分ごとに見た場合には、存在することが多い。
【００７２】
このようなステップ３０９の処理が終了すると、文字位置ポインタを１大きくして上述したステップ３００に戻り、入力文中の文字数Ｎの文字列の読み込みを行う。
【００７３】
ステップ３００〜３０９でなる処理ループを繰り返すことにより、入力文中の文末側の文字数Ｎの文字列の読み込み、それに続くステップ３０２からステップ３０９に至る処理も終了し、その後、ステップ３０１に移行してきたときには、最終文字列の処理も終了したと判定される。
【００７４】
このとき、入力文中の各文字を拡張文字に置き換えた組み合わせの中で最も連鎖確率が高いものを形態素解析結果とし、この形態素解析結果を低精度文字列を明示して利用者に提示し、一連の処理を終了する（ステップ３１０）。
【００７５】
以上のような第２の実施形態の形態素解析装置１０Ａの処理を、図１０に示す文「給与計算システム蜃気楼の構成を図１に示す。」が入力されたとして具体的に説明する。なお、解析実行時データ格納手段１６には、学習データに連続して現れた３文字を一つの単位として値（連鎖確率）が割り当てられているものとし、図１１に示す内容が格納されているものとする。図１１に示されていない文字列は値が割り当てられていない存在しないものとする。また、説明を簡単にするために、文頭、文末の処理、及び、解析実行時データには３文字未満の文字列の値はないものとして説明する。さらに、精度カウンタの値に対する閾値を１として説明する。
【００７６】
精度カウンタと低精度文字列格納領域を初期化してから図９の処理を開始する。
【００７７】
まず、ステップ３００で最初の３文字「給与計」を読み込み、読み込み終了でないことがステップ３０１で確認され、その文字列「給与計」について、ステップ３０２で解析実行時データ格納手段１６を検索すると、存在が確認され（連鎖確率０．７１が出力されることが存在を表す）、ステップ３０３、３０６、３０７を経てステップ３０９に至り、その文字列までの拡張文字列候補の評価値（スコア）が計算される。従って、文字列「給与計」に対する処理が終了しても、精度カウンタの値は０であり、低精度文字列格納領域にも何らの文字も格納されない。
【００７８】
文字列「与計算」、「計算シ」、「算シス」、「システ」及び「ステム」についても同様な経路の処理が実行される。従って、文字列「ステム」に対する処理が終了した時点では、精度カウンタの値は０であり、低精度文字列格納領域にも何らの文字も格納されない。
【００７９】
次に、文字列「テム蜃」が読み込まれると、解析実行時データ格納手段１６には対応する解析実行時データがないので、ステップ３０４で精度カウンタの値が１加算され（これにより「１」となる）、ステップ３０５で低精度文字列格納領域に「テム蜃」が格納され、その後、ステップ３０９に移行する。
【００８０】
以下、文字列「ム蜃気」、「蜃気楼」、「気楼の」及び「楼の構」についても同様な処理が実行される。その結果、文字列「楼の構」に対する処理が終了したときには、低精度文字列格納領域には文字列「テム蜃気楼の構」が格納され、精度カウンタの値は「５」となっている。
【００８１】
次の文字列「の構成」は、解析実行時データ格納手段１６に対応する解析実行時データが存在するので、ステップ３０３からステップ３０６に移行する。このときの精度カウンタの値「５」は、閾値「1」よりも大きいので、ステップ３０８で、低精度文字列格納領域に格納されている低精度文字列「テム蜃気楼の構」が精度・解析結果合成手段１９に与えられ、その後、ステップ３０９に移行する。文字列「の構成」に対する処理が終了したときには、その前の文字列「楼の構」に対する処理が終了したときと同様に、低精度文字列格納領域には文字列「テム蜃気楼の構」が格納され、精度カウンタの値は「５」となっている。
【００８２】
その次の文字列「構成を」から最終文字列「示す。」までについてはそれぞれ、ステップ３０３、３０６、３０７、３０９という、対応する解析実行時データが解析実行時データ格納手段１６に存在する場合の一般的な経路での処理が実行される。
【００８３】
最終文字列「示す。」に対する処理が終了すると、次には文字列がないので、ステップ３１０に移行し、図１２に例示するように、形態素解析結果「給与／計算／システム／蜃気楼の／構成／を／図／１／に／示／す／。」と、システムが精度に自信がない低精度文字列「テム蜃気楼の構」とを対比しやすいように利用者に提示する。
【００８４】
上記第２の実施形態によれば、高速形態素解析の精度が良くないと判断された部分文字列に対しては、その結果を利用者に提示するようにしたので、利用者が必要に応じて正しい形態素解析結果を入力することができる形態素解析装置を実現できる。
【００８５】
形態素解析装置の解析結果は、次の構文解析装置などの入力になるので、その精度が重要であり、正しくない解析結果を次の装置に渡した場合の悪影響の度合は大きい。正しいか正しくないかが明らかでない部分に対しては、利用者に判断させるので、その結果、正しい形態素解析結果を次の装置に入力させることができる。
【００８６】
ここで、精度判定を解析実行時データ格納手段に存在するか否かで行っているので、精度判定機能が処理時間をほとんど長期化させることはない。
【００８７】
（Ｃ）第３の実施形態
次に、本発明による形態素解析装置の第３の実施形態を図面を参照しながら詳述する。
【００８８】
この第３の実施形態の形態素解析装置は、高速形態素解析結果の精度の良否を弁別し、良くない部分に対しては、自動的に低速形態素解析を実行し、常に精度が高い形態素解析結果を出力するようにしたことを大きな特徴としているものである。
【００８９】
図１３は、この第３の実施形態の形態素解析装置１０Ｂの機能的構成を示すブロック図であり、上述した第１の実施形態に係る図１や第２の実施形態に係る図８との同一、対応部分には同一符号を付して示している。
【００９０】
図１３において、第３の実施形態の形態素解析装置１０Ｂは、低速形態素解析手段１１、解析実行時データ格納手段１６、高速形態素解析手段１７、精度判定手段１８及び解析結果合成手段２０を備える。
【００９１】
なお、解析実行時データ格納手段１６に格納する解析実行時データの作成方法が第１の実施形態と同様である場合には、図示は省略しているが、低速形態素解析手段１１、低速形態素解析結果格納手段１２、変換手段１３、学習データ格納手段１４及び学習手段１５も備える（図１参照）。これら構成要素についての説明は省略する。この第３の実施形態の場合、解析実行時データ格納手段１６に格納する解析実行時データの作成方法が第１の実施形態と同様である場合には、低速形態素解析手段１１は、解析実行時データの作成処理のためと、後述する低精度文字列を含む文字列の形態素解析のための双方に利用される。
【００９２】
また、低速形態素解析手段１１、解析実行時データ格納手段１６及び高速形態素解析手段１７の機能自体は、第１の実施形態のものと同様であるので、その機能説明は省略する。さらに、精度判定手段１８の機能自体は、第２の実施形態のものと同様であるので、その機能説明は省略する。
【００９３】
しかし、この第３の実施形態の場合、精度判定手段１８が、高速形態素解析方法では精度に自信がないと判定した、入力文中の低精度文字列は低速形態素解析手段１１に与えられるようになされている。低速形態素解析手段１１は、このような低精度文字列を含む文字列部分に対して低速形態素解析処理を実行する。
【００９４】
この第３の実施形態で新たに設けられた解析結果合成手段２０は、高速形態素解析手段１７からの形態素解析結果における低精度文字列に対応した部分を、低速形態素解析手段１１による低速形態素解析結果に置き換えるものである。
【００９５】
図１４は、第３の実施形態の形態素解析装置１０Ｂの全体処理の流れの一例を示すフローチャートであり、第２の実施形態に係る図９との同一処理ステップには、同一符号を付して示している。
【００９６】
第２の実施形態の場合、確定された低精度文字列はステップ３０８で利用者への提示対象として認識されるが、この第３の実施形態の場合には、確定された低精度文字列は、ステップ３０８ａで低速形態素解析手段１１に与えられる。
【００９７】
また、第２の実施形態の場合、ステップ３１０で、低精度文字列を明示した形で高速形態素解析結果を利用者に提示していたが、この第３の実施形態の場合には、高速形態素解析結果における低精度文字列に対応した部分を、低速形態素解析結果に置き換え、置き換え後の形態素解析結果を利用者に提示する。なお、低速形態素解析は、高速形態素解析結果における、低精度文字列の先頭文字より前の形態素区切り位置と、低精度文字列の最終文字より後の形態素区切り位置とに挟まれた文字列に対して実行される。
【００９８】
以上の２点を除けば、他の処理は第２の実施形態と同様であり、その説明は省略する。
【００９９】
上述した図１０に示す文「給与計算システム蜃気楼の構成を図１に示す。」が、この第３の実施形態の形態素解析装置１０Ｂに入力された場合にも、文字列「テム蜃気楼の構」が低精度文字列として認識されるのは、第２の実施形態と同様である。
【０１００】
今、低速形態素解析手段１１が内蔵する形態素辞書には、「蜃気楼」が一つの形態素（名詞）として登録されているものとする。低速形態素解析手段１１は、低精度文字列「テム蜃気楼の構」と、高速形態素解析結果「給与／計算／システム／蜃気楼の／構成／を／図／１／に／示／す／。」とが与えられると、低精度文字列「テム蜃気楼の構」の先頭文字より前の形態素区切り位置と、低精度文字列の最終文字より後の形態素区切り位置とに挟まれた文字列「システム蜃気楼の構成」が低速形態素解析対象部分として解析を実行する。
【０１０１】
そして、低速形態素解析手段１１は、図１５に示すように、「システム」、「蜃気楼」、「の」、及び「構成」を別々の形態素として解析結果を出力する。高速形態素解析結果「給与／計算／システム／蜃気楼の／構成／を／図／１／に／示／す／。」の該当部分がこの低速形態素解析結果に置き換えられるので、最終的な形態素解析結果は、図１６に示すように、「給与／計算／システム／蜃気楼／の／構成／を／図／１／に／示／す／。」となる。
【０１０２】
この第３の実施形態での解析結果は、第２の実施形態の解析結果に比べて、「蜃気楼」と「の」を別の形態素として解析しており、精度が向上している。
【０１０３】
上記第３の実施形態によれば、高速形態素解析の精度が良くないと判断された部分文字列又はその近傍に対しては、自動的に低速形態素解析を実行し、低速形態素解析結果に置き換えるようにしたので、常に精度が良い形態素解析結果を出力する形態素解析装置を実現できる。
【０１０４】
この第３の実施形態においても、高速形態素解析を基本解析処理としているので、入力文を全て低速形態素解析するよりも短い時間で解析を実行できる。
【０１０５】
（Ｄ）第４の実施形態
次に、本発明による形態素解析装置の第４の実施形態を図面を参照しながら詳述する。
【０１０６】
この第４の実施形態の形態素解析装置は、高速形態素解析結果の精度の良否を弁別し、良くない部分に対しては、自動的に低速形態素解析を実行し、常に精度が高い形態素解析結果を出力すると共に、低速形態素解析結果を高速形態素解析の解析実行時データに学習、反映させ、学習後には、精度が良くなかった文章と同じ形態素が含まれる文章に対して精度良くかつ高速に形態素解析できるようにしたことを大きな特徴としているものである。
【０１０７】
図１７は、この第４の実施形態の形態素解析装置１０Ｃの機能的構成を示すブロック図であり、既述した各実施形態に係る図１、図８及び図１３との同一、対応部分には同一符号を付して示している。
【０１０８】
図１７において、第４の実施形態の形態素解析装置１０Ｃは、低速形態素解析手段１１、変換手段１３、学習データ格納手段１４、学習手段１５、解析実行時データ格納手段１６、高速形態素解析手段１７、精度判定手段１８及び解析結果合成手段２０を備える。
【０１０９】
第４の実施形態の形態素解析装置１０Ｃの全ての構成要素はそれぞれ、既述した各実施形態の対応する要素と同一機能を果たすものである。
【０１１０】
しかし、この第４の実施形態の形態素解析装置１０Ｃにおいては、低精度文字列を含む文字列に対して低速形態素解析手段１１が解析して得た結果を、変換手段１３に与えている点が第１や第３の実施形態と異なっている。変換手段１３から解析実行時データ格納手段１６への処理経路上での各手段の機能は、第１の実施形態と同様である。
【０１１１】
なお、低速形態素解析手段１１、変換手段１３、学習データ格納手段１４、学習手段１５及び解析実行時データ格納手段１６が、第１の実施形態と同様な外部から入力された学習用文書に対する処理をも担うものであっても良いことは勿論である。
【０１１２】
図１８は、第４の実施形態の形態素解析装置１０Ｃの全体処理の流れの一例を示すフローチャートであり、第３の実施形態に係る図１４との同一処理ステップには、同一符号を付して示している。
【０１１３】
第４の実施形態の形態素解析装置１０Ｃでは、第３の実施形態の最終処理ステップ３１０ａより後にステップ３１１及び３１２の処理を設けている。
【０１１４】
ステップ３１１は、低速形態素解析結果を、高速形態素解析の学習手段１５への入力用データ（学習データ）に変換して追加格納する処理である。ステップ３１２は、その時点での全ての学習データを用いて、解析実行時データを作成する処理である。
【０１１５】
低精度文字列を含む文字列に対して、例えば、上述した図１５に示すような低速形態素解析結果が得られた場合に、変換手段１３が上述した図６に示すような変換方法で学習データを変換すると、図１９に示すような拡張文字列（学習データ）が得られる。
【０１１６】
このような学習データが、既存の学習データに追加され、追加後の学習データ全体に対して、学習手段１５が学習すると、低速形態素解析結果に対応した部分の解析実行時データとして図２０に示すようなデータが得られて（他の解析実行時データも当然に得られる）、解析実行時データ格納手段１６に格納される。すなわち、図１１に示すようなデータ（連鎖確率は変化する）に加えて、図２０に示すようなデータが新たに加わることになる。
【０１１７】
その結果、学習したデータによって解析可能な文が解析対象として入力された場合には、例えば、「給与計算システム蜃気楼の値段は２０００円です。」が入力された場合には、前回低精度文字列と認定された部分も精度判定手段１８で低精度と判定されなくなり、第３の実施形態と同程度の精度の高速形態素解析結果を、毎回、低速形態素解析手段１１を起動しないで得られるようになる。
【０１１８】
上記第４の実施形態によれば、高速形態素解析の精度が良くない場合には自動的に低速形態素解析を実行し、さらにその結果を高速形態素解析の学習のためのデータとして使用し、学習後には、精度が良くなかった文章と同じ形態素が含まれる文章に対して精度良くかつ高速に形態素解析できる形態素解析装置を実現できる。
【０１１９】
（Ｅ）他の実施形態
上記各実施形態においては、解析実行時データが１カテゴリーのものを示したが、分野別などの複数カテゴリーのものを用意し、未知文書の入力時にカテゴリーを指定させるようにしても良い。この場合、第１の実施形態では、学習用文書を入力させる際に、その学習用文書のカテゴリーも指定することを要する。また、第３や第４の実施形態では、低速形態素解析手段が適用する専門辞書があれば、そのカテゴリーのものとなる。さらに、第４の実施形態では、低速形態素解析結果を、未知文書の入力時に指定されたカテゴリーの解析実行時データに反映させることとなる。
【０１２０】
また、第１の実施形態の説明では、学習データ格納手段１４への格納が追加格納か新規格納（前のものをクリアしての格納）かを明確に示さなかったが、いずれであっても良い。また、外部から、格納方法を変換手段１３にその都度指示できるようにしても良い。
【０１２１】
さらに、第１及び第４の実施形態において、学習手段１５を以下のようにしても良い。学習データ格納手段１４に追加された学習データについてのみ、文字列の出現頻度を計数して解析実行時データを作成する。この場合、解析実行時データ格納手段１６には、連鎖確率だけでなく出現頻度も格納しておき、今回の集計結果と、解析実行時データ格納手段１６に既に格納されている出現頻度とから、学習手段１５は、既存の解析時学習データの文字列や、新規発生の文字列の連鎖確率を決定するようにしても良い。
【０１２２】
さらにまた、第２〜第４の実施形態においては、解析実行時データ格納手段１６に存在しないことを低精度文字列の認定条件にしているものを示したが、存在しても、その値（連鎖確率）が所定閾値より小さいことを低精度文字列の認定条件にするようにしても良い。
【０１２３】
また、第２〜第４の実施形態において、低精度文字列の範囲を上記のように１文の部分文字列とするのではなく、判定文字列を含む１文全てを低精度文字列として扱うようにしても良い。第２の実施形態であれば、文単位に低精度か否かの情報が付随される。第３の実施形態であれば、低精度認定時にその文全体が低速形態素解析手段１１で解析されることになる。第４の実施形態であれば、文全体の低速形態素解析結果が、解析実行時データ格納手段１６の格納内容に反映される。このように文全体で精度推測を行う場合には、最適な高速形態素解析結果での連鎖確率を、入力文の文字数などで正規化し、その値を閾値と比較することなどによって、その文の精度を推測するようにしても良い。
【０１２４】
また、解析実行時データ格納手段１６の格納内容を利用しないで精度を判定する方法を単独で採用したり、解析実行時データ格納手段１６の格納内容を利用して精度を判定する方法と併用したりしても良い。例えば、解析実行時データ格納手段１６の格納内容を利用しないで精度を判定する方法としては、例えば、ひらがなや漢字などのある１種類の文字種が連続して所定文字数以上つながっている部分の中央所定文字数部分を精度が低いと判定するような方法を挙げることができる。また、第２水準の漢字を所定文字数以上含む文の精度を低いと判定するようにしても良い。
【０１２５】
さらに、第３及び第４の実施形態においては、低精度文字列に対応した低速形態素解析を１文毎に実行するものを示したが、文書全体を高速形態素解析した後でまとめて精度の悪かった部分に対して低速形態素解析を実行するようにしても良い。
【０１２６】
さらにまた、上記各実施形態の具体的説明においては、解析実行時データを構成する拡張文字の拡張情報が、形態素区切り情報だけのものを示したが、これに加えて、品詞情報や単語の発音情報を含むものであっても良い。この場合、当然に、変換手段や学習手段もそれに応じたものとなる。解析実行時データをこのようにした場合には、単語分割と品詞付与を行なう形態素解析や、単語の発音を決定する形態素解析を高速化することができる。
【０１２７】
また、上記各実施形態においては、対象とする自然言語が日本語である形態素解析装置を示したが、他の言語の形態素解析装置に対しても本発明を適用することができる。
【０１２８】
【発明の効果】
以上のように、本発明の形態素解析装置によれば、利用者に負担をかけることなく、形態素解析結果の精度向上や、解析処理時間の短縮化を期待できる。
【図面の簡単な説明】
【図１】第１の実施形態の構成を示すブロック図である。
【図２】第１の実施形態の処理の概要を示すフローチャートである。
【図３】第１の実施形態の学習用文書の一例を示す説明図である。
【図４】図３の第１文目についての低速形態素解析結果を示す説明図である。
【図５】図３の学習用文書に対応した学習データを示す説明図である。
【図６】第１の実施形態の変換手段による詳細処理例を示すフローチャートである。
【図７】図６のステップ２０６の処理の説明図である。
【図８】第２の実施形態の構成を示すブロック図である。
【図９】第２の実施形態の処理を示すフローチャートである。
【図１０】形態素解析対象文を示す説明図である。
【図１１】解析実行時データ格納手段１６の格納内容例を示す説明図である。
【図１２】第２の実施形態で図１０の文を解析した出力内容例を示す説明図である。
【図１３】第３の実施形態の構成を示すブロック図である。
【図１４】第３の実施形態の処理を示すフローチャートである。
【図１５】第３の実施形態の低精度文字列に対する低速形態素解析結果例を示す説明図である。
【図１６】第３の実施形態で図１０の文を解析した最終的な解析結果例を示す説明図である。
【図１７】第４の実施形態の構成を示すブロック図である。
【図１８】第４の実施形態の処理を示すフローチャートである。
【図１９】第４の実施形態での低精度文字列に対する低速形態素解析結果を学習データに変換した例を示す説明図である。
【図２０】第４の実施形態での低精度文字列に対応した学習データから形成された解析実行時データの例を示す説明図である。
【符号の説明】
１０、１０Ａ、１０Ｂ、１０Ｃ…形態素解析装置、
１１…低速形態素解析手段、
１３…変換手段、
１５…学習手段、
１７…高速形態素解析手段、
１８…精度判定手段、
１９…精度・解析結果合成手段、
２０…解析結果合成手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a morpheme analyzer that divides an input natural language sentence into morphemes (for example, words), and in particular, intends to improve analysis processing time and / or analysis accuracy.
[0002]
[Prior art]
[0003]
[Patent Document 1]
JP-A-5-52543
[0004]
[Non-Patent Document 1]
Mikio Yamamoto, Masakazu Masuyama, “Japanese Morphological Analysis Using Chain Probabilities of Extended Characters Containing Part-of-Speech and Separation Information”, Proc. Of the 3rd Annual Conference of the Language Processing Society, March 1997
Due to the increase in opportunities for creating text with word processors and the spread of Internet-compatible devices, a large amount of digitized natural language sentences has become readily available. For natural language processing systems that handle a large amount of natural language sentences, such as character recognition systems, machine translation systems, information retrieval systems, information extraction systems, etc., morphological analysis processing is common before various systems perform specialized processing. This is an extremely important process for determining morphemes that are semantic units in sentences such as words and phrases.
[0005]
In such morphological analysis processing, high accuracy of word division (morpheme division) is required, and processing speed of processing a large amount of natural language sentences at high speed is also required.
[0006]
As a conventional morpheme analysis method, a morpheme dictionary (word dictionary), a utilization ending table, a part-of-speech connection table, and the like are generally used, and morpheme analysis is performed while accessing these various storage units (Patent Literature). 1).
[0007]
Recently, a morphological analysis method using a character-based probability model has also been proposed (see Non-Patent Document 1, Japanese Patent Application No. 9-68300, and drawings).
[0008]
This morpheme analysis method is most probable among all combinations of whether or not a morpheme boundary immediately follows each character as a morpheme sequence constituting the input sentence when a natural language text is given as an input sentence. Outputs a sequence of morpheme strings.
[0009]
In order to determine whether or not the most probable morpheme sequence is arranged, a probability model (statistical database; analysis execution data database) learned from a large amount of text data (corpus; learning data) is used. A set of analysis execution data stored in the statistical database is, for example, an extended character string of N characters and chain probability data indicating how much the extended character string appears on the corpus. is there. Note that extended characters are different from normal characters such as “I” and “ha”, and include at least morpheme delimiter information (whether or not this character is immediately followed by a morpheme delimiter). Information is added.
[0010]
[Problems to be solved by the invention]
(1) The morpheme analysis method, which has been generally used in the past using a morpheme dictionary, draws a morpheme dictionary so as to determine a morpheme whose length is unknown at the input stage. It took a considerable amount of time to look up the dictionary, and a large amount of documents could not be processed in a short time. That is, the user cannot quickly obtain the morphological analysis result. Hereinafter, in some cases, this type of morphological analysis is referred to as low-speed morphological analysis.
[0011]
(2) On the other hand, the morpheme analysis method using a character-based probability model (statistical database) compares an extended character string of a predetermined number of characters (N) determined from an input sentence with the stored contents of the statistical database. Since the morpheme analysis is basically performed, the morpheme analysis result can be obtained faster than the morpheme analysis method (slow morpheme analysis method).
[0012]
However, in this morphological analysis method, it is necessary to learn and prepare parameters (statistical data; analysis execution data) in advance, and it is difficult to prepare learning data (corpus) for that purpose. Hereinafter, in some cases, this type of morphological analysis is referred to as high-speed morphological analysis.
[0013]
Necessary learning data (corpus) can calculate statistical data consisting of the above-mentioned extended character strings and their chain probabilities, so information on morpheme breaks (and part-of-speech information on morphemes) is added to the text file. It is a thing. Text files are easy to obtain, but files with the above-mentioned information added are rare at present, and humans add the above-mentioned information one by one to the text file to create learning data (corpus) used for learning Was. Alternatively, the learning data (corpus) used for learning is created by manually correcting the result of the low-speed morphological analysis.
[0014]
In high-speed morphological analysis, even if the above learning data is prepared and a statistical database is created, it is not possible to correctly analyze a character string that is not included in the learning data prepared in advance. Even in low-speed morpheme analysis, of course, a character string consisting of morphemes (unknown words) that are not in the dictionary cannot be analyzed correctly, but a normal morpheme analysis dictionary has a dictionary of tens of thousands to hundreds of thousands of words. It is rare to encounter character strings (unknown words) that cannot be parsed correctly. If it is possible to prepare learning data that contains all morphemes in the dictionary for low-speed morpheme analysis at an appropriate frequency and learn high-speed morpheme analysis using it, in principle, low-speed morpheme analysis and High-speed morphological analysis can be analyzed with almost the same accuracy. That is, it can be said that the character strings that cannot be analyzed correctly are the same.
[0015]
However, it is practically impossible to prepare learning data that includes all morphemes in the dictionary for low-speed morpheme analysis at an appropriate frequency. As a result, in high-speed morpheme analysis, the analysis accuracy of sentences in which morphemes that are not in the learning data appear is inferior to that of low-speed morpheme analysis.
[0016]
(3) If there is no way for the user of high-speed morpheme analysis to know what morpheme the learning data adopted by the morpheme analysis method is composed of, If you look at the analysis results and determine that the accuracy of the high-speed analysis results is poor, you must either use low-speed morphological analysis only for the text or manually correct only the text. Absent.
[0017]
If the text you want to analyze is in various fields, checking the analysis results one by one is a tedious task, and if you do not check, the overall accuracy when using high-speed morphological analysis Will get worse.
[0018]
Examples of texts that you want to analyze in various fields include morphemes to create index files for search services by examining the frequency of morphemes that appear by performing morphological analysis on document files on various WWW servers on the Internet. There are cases where analysis is used.
[0019]
(4) By the way, after manually checking the result of high-speed morphological analysis with low accuracy, it is also possible to reflect (feed back) the data to a statistical database.
[0020]
In this way, after reflection processing, it is possible to analyze sentences in the same field as that field with the same degree of accuracy, but it is troublesome because the work of manual check is not eliminated. .
[0021]
Therefore, there is a demand for a morpheme analyzer that takes a short time to obtain a highly accurate morpheme analysis result on average.
[0022]
[Means for Solving the Problems]
In order to solve such a problem, the present invention provides a partial character string consisting of a predetermined number of characters appearing in a natural language sentence, its absolute or relative frequency information, and the character assigned to each character of the partial character string. Analysis execution time data storage means for storing a lot of analysis execution time data, which is a set of data including at least the position information of whether or not the character is the last character of the morpheme, and the above analysis execution time for an unknown sentence A morpheme analyzer having a first morpheme analysis unit that executes morpheme analysis with reference to the stored contents of the data storage unit is characterized as follows. That is, when the frequency information in the analysis execution time data stored in the analysis execution time data storage means applied by the first morpheme analysis means to the unknown sentence is smaller than a threshold and / or When the unknown sentence has a portion to which the analysis execution data stored in the analysis execution data storage means cannot be applied, the accuracy of the morphological analysis result from the first morpheme analysis means is estimated to be low accuracy Accuracy judging means to perform, or whole or part of the unknown sentence among Character of specific character type Is connected more than the specified number of characters Sometimes, accuracy determination means for estimating the accuracy of the morpheme analysis result from the first morpheme analysis means as low accuracy Or an accuracy determination that estimates that the accuracy of the morpheme analysis result from the first morpheme analysis means is low when a character of a specific character type is included in the whole or part of the unknown sentence by a predetermined number or more Have means It is characterized by that.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
(A) First embodiment
Hereinafter, a first embodiment of a morphological analyzer according to the present invention will be described in detail with reference to the drawings.
[0024]
The morpheme analyzer of the first embodiment basically analyzes an input sentence by a high-speed morpheme analysis method, and automatically converts low-speed morpheme analysis results into learning data for high-speed morpheme analysis. It has a great feature that it has a learning function to make it possible to easily use sentences that have not been used as learning data so far as learning data.
[0025]
FIG. 1 is a functional block diagram showing the configuration of the morphological analyzer of the first embodiment. That is, the morphological analyzer of the first embodiment is actually realized on an information processing apparatus such as a workstation or a personal computer having an input / output device, a processing device, a storage device (and a communication device), and the like. However, functionally, it has the structure shown in FIG.
[0026]
In FIG. 1, a morpheme analyzer 10 according to the first embodiment includes a low-speed morpheme analysis unit 11, a low-speed morpheme analysis result storage unit 12, a conversion unit 13, a learning data storage unit 14, a learning unit 15, and an analysis execution time data storage. Means 16 and high-speed morpheme analysis means 17 are provided. Among these components, the low-speed morpheme analysis unit 11, the low-speed morpheme analysis result storage unit 12, the conversion unit 13, the learning data storage unit 14, and the learning unit 15 constitute the analysis execution time data creation device of the first embodiment. is doing.
[0027]
Although the detailed configuration of the low-speed morpheme analysis unit 11 is omitted, the low-speed morpheme analysis unit 11 has the same configuration as a conventional low-speed morpheme analyzer that performs morpheme analysis using a built-in morpheme dictionary. That is, it has the same detailed configuration as the morphological analyzer described in Patent Document 1 described above and a similar device. In the case of the first embodiment, the low-speed morpheme analyzing unit 11 is not provided for performing morphological analysis on each sentence in the unknown document, but the analysis execution time data stored in the analysis execution time data storage unit 16. It is provided as one element in the configuration for creating. A learning document is input to the low-speed morpheme analysis unit 11. In FIG. 1, a block described as a learning document also means a learning document input unit.
[0028]
The low-speed morpheme analysis result storage unit 12 stores the low-speed morpheme analysis result executed by the low-speed morpheme analysis unit 11 for each sentence of the learning document.
[0029]
The conversion unit 13 converts the data format of the low-speed morpheme analysis result stored in the low-speed morpheme analysis result storage unit 12 into a data format as learning data required by the high-speed morpheme analysis device.
[0030]
The learning data storage means 14 stores learning data (corpus) obtained by the conversion means 13 converting.
[0031]
The learning unit 15 creates analysis execution time data that the high-speed morpheme analysis unit 17 refers to at the time of morpheme analysis of an unknown sentence from the learning data stored in the learning data storage unit 14. That is, for example, an analysis execution data (statistical data) consisting of an extended character string having a predetermined number of characters N appearing on the learning data and a chain probability indicating how much the extended character string appears on the learning data Is created from learning data. As a method of creating analysis execution data from learning data, the methods described in Non-Patent Document 2, Japanese Patent Application No. 9-350651, and drawings can be applied.
[0032]
[Non-Patent Document 2]
Satoshi Nagao and Shinsuke Mori, “How to make n-gram statistics and automatic phrase extraction for large-scale Japanese text”, IPSJ Research Report Natural Language Processing 96-1, July 1993
The analysis execution time data storage means (statistical database) 16 stores the analysis execution time data created by the learning means 15. FIG. 11 described later shows a part of the analysis execution time data (when N is 3) stored in the analysis execution data storage means (statistical database) 16. Note that the chaining probability of the analysis execution data is determined such that, for example, the sum of the chaining probabilities of a plurality of N extended character strings having the same N-1 character on the leading side is 1.
[0033]
The high-speed morpheme analysis unit 17 executes morpheme analysis for each sentence by referring to the contents stored in the analysis execution data storage unit 16 when an unknown document or unknown sentence to be analyzed is given. The obtained morpheme analysis result is output. The high-speed morphological analysis means 17 is realized by, for example, the configuration described in Non-Patent Document 1 described above, Japanese Patent Application No. 9-68300 and the drawings, or a configuration similar thereto.
[0034]
Although illustration is omitted, a detailed configuration example of the high-speed morpheme analysis means 17 is as follows. That is, the high-speed morpheme analysis unit 17 includes a score table 17a, an extended character string generation unit 17b, a chain probability calculation unit 17c, and an optimum route search unit 17d.
[0035]
The score table 17a is based on the path of all the extended character strings from the beginning to the end of the sentence of the unknown sentence to be analyzed, and the chain probability of the extended character strings of a predetermined number of characters stored in the analysis execution time data storage means 16. The chain probability corresponding to the obtained extended character string path is stored. The extended character string generation unit 17b generates extended characters for the unknown sentence to be analyzed, and stores all combinations (paths) of the extended characters in the score table 17a. The chain probability calculation unit 17c calculates the chain probability for each path of the extended character string stored in the score table 17a based on the chain probability stored in the analysis execution time data storage unit 16. The optimum path searching unit 17d selects an extended character string satisfying an optimum condition (for example, giving a maximum chain probability) from among the chain probabilities calculated by the chain probability calculating unit 17c. Result).
[0036]
In FIG. 1, a block described as an unknown document also means an input unit for an unknown document, and a block described as a morpheme analysis result also means an output unit for a morpheme analysis result. Yes.
[0037]
Next, the outline of the processing of the morphological analyzer 10 of the first embodiment will be described in detail with reference to the flowchart of FIG. The processing of the morphological analysis device 10 of the first embodiment is divided into a preparation stage processing for executing morphological analysis of an unknown document and a processing for executing morphological analysis of an unknown document. Step 100 in FIG. ˜102 corresponds to the former process, and step 103 corresponds to the latter process.
[0038]
When the learning document is input to the morpheme analyzer 10, the learning document input by the low-speed morpheme analysis unit 11 using the morpheme dictionary is analyzed, and the morpheme analysis result is stored in the low-speed morpheme analysis result storage unit 12. It is written (step 100).
[0039]
At this time, the data format of the stored morpheme analysis result is naturally an output data format by the low-speed morpheme analysis unit 11. Such a low-speed morpheme analysis result is converted by the conversion unit 13 into the data format of the learning data from which the analysis execution time data used by the high-speed morpheme analysis unit 17 is generated and stored in the learning data storage unit 14. (Step 101).
[0040]
The learning data is processed by the learning means 15 to create analysis execution time data, and the created analysis execution time data is stored in the analysis execution time data storage means 16 (step 102).
[0041]
When an unknown document is input after the processing of the preparation stage of the high-speed morpheme analysis process as described above is completed, the high-speed morpheme analysis unit 17 performs an analysis execution time data storage unit 16 for each sentence of the unknown document. The morpheme analysis is performed with reference to the stored contents of the data, and the obtained morpheme analysis result is output (step 103).
[0042]
FIG. 3 shows an example of a learning document input to the low-speed morpheme analyzing unit 11. As shown in FIG. 3, the learning document is natural language text data that does not accompany extended information or tags.
[0043]
The output data format (morpheme analysis result data format) varies depending on the internal configuration of the low-speed morpheme analyzer 11 applied. FIG. 4 shows an output example (output data format example) as a result of low-speed morphological analysis of the first sentence in FIG. Each line in FIG. 4 shows information of one word. The information of one word consists of three pieces of information separated by a space, and each of them is a part of speech, a standard form, and an appearance form. In the case of nouns that are not used, the standard form and the used form are the same.
[0044]
FIG. 5 shows an example of learning data corresponding to FIG. The conversion unit 13 converts the data of the low-speed morpheme analysis result illustrated in FIG. 4 into the learning data as illustrated in FIG.
[0045]
In the example shown in FIG. 5, the data at the time of analysis execution is an extended character string of N characters, and chain probability data indicating how much the extended character string appears on the corpus, This is an example corresponding to the case where the extended character is obtained by adding morpheme separation information as extended information to a normal character such as “I” or “ha”. Note that the extension information including part-of-speech information in addition to the morpheme segmentation information is different from the format shown in FIG.
[0046]
In FIG. 5A, the morpheme boundary (separation) is represented by an extended character string in which “1” indicates a morpheme delimiter between the preceding characters and “0” otherwise. An example is shown. FIG. 5B is an example in which the same contents as FIG. 5A are represented by slashes (/) as morpheme breaks.
[0047]
FIG. 6 is a flowchart showing an example of the flow of conversion processing by the conversion means 13. FIG. 6 corresponds to the case where the data format after conversion is as shown in FIG.
[0048]
First, it is confirmed whether or not the low-speed morpheme analysis result for which the conversion process has not been completed remains in the low-speed morpheme analysis result storage unit 12 (step 200). If not, a series of conversion processes are terminated.
[0049]
On the other hand, if the unprocessed low-speed morpheme analysis result remains in the low-speed morpheme analysis result storage unit 12, the unprocessed low-speed morpheme analysis result is read by one sentence (step 201). Then, appearance item is extracted from the read low-speed morpheme analysis result (step 202), and characters of each appearance form are converted into extended characters to create an extended character string (step 203). Then, the obtained extended character string is stored in the learning data storage means 14 and the process returns to step 200 described above (step 204).
[0050]
Here, in the conversion to an extended character, “1” indicating that it is a morpheme break is given only to the last character of the appearance form, and “0” indicating that it is not a morpheme break is assigned to other characters. To do. For example, as shown in FIG. 7, if the appearance form is “machine translation”, <machine, 0><machine,0><translation,0><translation,1> is obtained as an extended character string for it. .
[0051]
According to the first embodiment, the learning function for automatically converting the low-speed morpheme analysis result into the learning data for creating the analysis execution data used in the analysis by the high-speed morpheme analysis method is provided. It is possible to realize a morpheme analyzer that basically follows a high-speed morpheme analysis method that can easily use a sentence that was not used as data as learning data. As a result, the learning data can be enhanced and the accuracy of the high-speed morphological analysis results for unknown documents can be expected.
[0052]
In addition, since a learning function that automatically converts low-speed morphological analysis results into learning data is provided, the user only has to input a learning document to the device, and create learning data from the learning document. Or creating learning data from the low-speed morpheme analysis result can be made unnecessary.
[0053]
(B) Second embodiment
Next, a second embodiment of the morphological analyzer according to the present invention will be described in detail with reference to the drawings.
[0054]
The morpheme analyzer of the second embodiment discriminates whether the accuracy of the high-speed morpheme analysis result is good or not, and if not, presents the result to the user, and the accuracy is low to the user. The main feature is that it is left to the judgment of the case.
[0055]
FIG. 8 is a block diagram showing a functional configuration of the morphological analyzer 10A according to the second embodiment. Components identical or corresponding to those in FIG. 1 according to the first embodiment described above are denoted by the same reference numerals. It shows.
[0056]
In FIG. 8, a morpheme analyzer 10A according to the second embodiment includes an analysis execution time data storage unit 16, a high-speed morpheme analyzer 17, an accuracy determination unit 18, and an accuracy / analysis result synthesis unit 19.
[0057]
If the method for creating analysis execution data stored in the analysis execution data storage unit 16 is the same as that in the first embodiment, the illustration is omitted, but the low speed morpheme analysis unit 11 and the low speed morpheme analysis are omitted. The result storage means 12, the conversion means 13, the learning data storage means 14, and the learning means 15 are also provided (refer FIG. 1). A description of these components is omitted.
[0058]
The analysis execution data storage unit 16 and the high-speed morpheme analysis unit 17 are the same as those in the first embodiment, and thus the description of their functions is omitted.
[0059]
The accuracy determination means 18 newly provided in the second embodiment is based on the search result when the high speed morpheme analysis means 17 searches the analysis execution time data storage means 16 for the desired analysis execution time data. A character string that is considered to have low accuracy in the morpheme analysis result obtained from the morpheme analysis means 17 is determined. Such accuracy determination results are given to the accuracy / analysis result synthesis means 19.
[0060]
The analysis execution data is created from learning data as described in Non-Patent Document 1, Japanese Patent Application No. 9-68300 and the drawings, and as described in the first embodiment. The Analysis execution data corresponding to the character string appearing in the learning data exists, but naturally, analysis execution data corresponding to the character string not appearing in the learning data does not exist. Even in the analysis execution data corresponding to the character string appearing in the learning data, the value of the chain probability changes depending on the appearance frequency.
[0061]
Therefore, when the analysis execution data storage means 16 is accessed in order to analyze an unknown document for high-speed morpheme analysis, a portion where the corresponding character string does not exist or a portion having a low chain probability even if it exists, It can be said that the accuracy of a part is lower than the other parts. The accuracy determination unit 18 determines such a low-accuracy portion through access to the analysis execution data storage unit 16.
[0062]
The accuracy / analysis result synthesizing unit 19 is given the accuracy judgment result from the accuracy judging unit 18 and the morpheme analysis result from the high-speed morpheme analyzing unit 17. The accuracy / analysis result synthesizing unit 19 synthesizes these pieces of input information, clearly indicates the character string determined by the accuracy determining unit 18 to be insufficient in accuracy, and presents the morphological analysis result to the user.
[0063]
An example of the overall processing flow of the morpheme analyzer 10A of the second embodiment that can be functionalized as described above will be described in detail with reference to the flowchart of FIG.
[0064]
FIG. 9 shows processing for one sentence in the unknown document. Further, in the processing example of FIG. 9, the specification of the low-precision character string portion has one requirement that the corresponding character string does not exist in the analysis execution data storage unit 16. Further, description will be made assuming that the accuracy counter is built in the apparatus 10A (accuracy determination means 18). This accuracy counter is a temporary memory whose initial value is zero. Further, the description will be given on the assumption that the apparatus 10A (accuracy determination means 18) also incorporates a buffer memory for low-accuracy character strings.
[0065]
A character position pointer in the input sentence is provided, and a character string of N characters starting from the character position indicated by the pointer is read (step 300). Then, based on whether or not the character string could not be read by this reading process, it is determined whether or not the last character string has been read and the subsequent process has already been completed (step 301).
[0066]
If not completed, a search character string is created based on the read character string, the analysis execution time data storage means 16 is searched, and whether or not the search character string exists in the analysis execution time data storage means 16. Is determined (steps 302 and 303).
[0067]
Here, a plurality of search character strings are generally created. For example, since the analysis execution time data as shown in FIG. 11 is stored in the analysis execution data storage means 16, each character of the read character string of N (for example, 3) characters is expanded into two types. Since all combinations of the two types of extended characters for each character in the input character string are search character strings, there are generally only 2 N sets of search character strings to be created. If it is determined in step 303 that no search character string exists, “all” search character strings do not exist, and even when one set of search character strings out of 2 N sets of data is stored in the analysis execution data storage means. 16 is present in the determination of step 303.
[0068]
If the determination result in step 303 shows that the search character string does not exist, the value of the precision counter is incremented by 1, the current read character string is stored in the low-precision character string storage area, and the process proceeds to step 309 described later. (Steps 304 and 305).
[0069]
On the other hand, if the result of determination in step 303 is that the search character string exists in the analysis execution data storage unit 16, it is determined whether or not the value of the accuracy counter at that time is equal to or less than the threshold (step 306). ). The threshold is determined according to the value of N. For example, if N is 3, about 1 is appropriate.
[0070]
If an affirmative result is obtained, the value of the precision counter is cleared to 0, and the low-precision character string stored in the low-precision character string storage area is cleared, and the process proceeds to step 309 described later (step 307). ). On the other hand, if the search character string exists in the analysis execution time data storage means 16 and the value of the accuracy counter at that time is larger than the threshold, it is stored in the low-precision character string storage area at that time. The low-accuracy character string that has been recognized is recognized as a part to be specified by the morphological analysis result, and the process proceeds to step 309 described later (step 308).
[0071]
In step 309, a plurality of morpheme analysis results for the character string of the input sentence up to the character string read this time, based on the chain probability for one or more sets of search character strings existing in the analysis execution time data storage means 16 Update the candidate's evaluation value (product of chain probability). In addition, although the handling in the case where a search character string does not exist is arbitrary, what is necessary is just to employ | adopt the method of the existing high-speed morpheme analysis means as it is. For example, when the analysis execution time data stored in the analysis execution time data storage means 16 relates to an extended character string with N characters, it relates to an extended character string with N-1 characters or N-2 characters. Form and process analysis run-time data. In general, a character string that does not exist if the number of characters is long is often present when viewed for each portion having a shorter number of characters.
[0072]
When the processing in step 309 is completed, the character position pointer is incremented by 1, and the process returns to step 300 described above to read a character string of N characters in the input sentence.
[0073]
When the processing loop consisting of steps 300 to 309 is repeated, the character string of the number N of characters at the end of the sentence in the input sentence is read, and the subsequent processing from step 302 to step 309 is also completed. It is determined that the processing of the last character string is also completed.
[0074]
At this time, among the combinations in which each character in the input sentence is replaced with an extended character, the combination with the highest chain probability is taken as the morpheme analysis result, and this morpheme analysis result is clearly shown to the user as a low-precision character string. The process is terminated (step 310).
[0075]
The processing of the morphological analysis apparatus 10A of the second embodiment as described above will be specifically described on the assumption that the sentence “The configuration of the salary calculation system mirage is shown in FIG. 1” shown in FIG. 10 is input. The analysis execution data storage means 16 is assigned a value (chain probability) with three characters appearing continuously in the learning data as one unit, and the contents shown in FIG. 11 are stored. Shall. It is assumed that a character string not shown in FIG. 11 is not assigned a value. For the sake of simplicity, the description will be made assuming that there is no character string value of less than 3 characters in the sentence head, sentence end processing, and analysis execution time data. Further, description will be made assuming that the threshold value for the value of the accuracy counter is 1.
[0076]
The processing shown in FIG. 9 is started after the accuracy counter and the low-precision character string storage area are initialized.
[0077]
First, in step 300, the first three characters “salary meter” are read, and it is confirmed in step 301 that the reading is not completed. When the character string “salary meter” is searched in the analysis execution data storage unit 16 in step 302, Existence is confirmed (representing presence of a chain probability of 0.71), and the process proceeds to step 309 through steps 303, 306, and 307, and the evaluation value (score) of the extended character string candidate up to that character string is Calculated. Therefore, even when the process for the character string “payroller” is completed, the value of the precision counter is 0, and no character is stored in the low-precision character string storage area.
[0078]
The same path processing is executed for the character strings “given calculation”, “calculation system”, “calculation system”, “system”, and “stem”. Therefore, when the processing for the character string “stem” is completed, the value of the precision counter is 0, and no character is stored in the low-precision character string storage area.
[0079]
Next, when the character string “Tem 蜃” is read, since there is no corresponding analysis execution time data in the analysis execution time data storage means 16, the value of the accuracy counter is incremented by 1 in step 304 (thereby “1”). In step 305, “tem 蜃” is stored in the low-precision character string storage area, and then the process proceeds to step 309.
[0080]
Thereafter, the same processing is executed for the character strings “Miraku”, “Miraku”, “Kairou”, and “Circle”. As a result, when the processing for the character string “ro-no-gano” is completed, the character string “tem mirage” is stored in the low-precision character string storage area, and the value of the accuracy counter is “5”.
[0081]
Since the next character string “configuration” includes analysis execution time data corresponding to the analysis execution data storage unit 16, the process proceeds from step 303 to step 306. Since the accuracy counter value “5” at this time is larger than the threshold value “1”, the accuracy / analysis of the low-accuracy character string “Tem Mirage” stored in the low-accuracy character string storage area in step 308 The result is given to the result synthesizing means 19, and then the process proceeds to step 309. When the process for the character string “configuration” is completed, the character string “Tem Mirage's structure” is displayed in the low-precision character string storage area in the same manner as when the process for the previous character string “ro” is completed. The value stored in the accuracy counter is “5”.
[0082]
For the next character string “configuration” to the final character string “show”, the corresponding analysis execution time data of steps 303, 306, 307, and 309 exist in the analysis execution time data storage unit 16, respectively. The process in the general route is executed.
[0083]
When the process for the final character string “show” is completed, there is no character string next, so the process proceeds to step 310, and as illustrated in FIG. 12, the morphological analysis result “salary / calculation / system / mirage / configuration”. / Is shown to the user so that it can be easily compared with the low-precision character string “Tem Mirage” whose accuracy is not confident.
[0084]
According to the second embodiment, the partial character string determined to have poor high-speed morphological analysis accuracy is presented to the user, so that the user can A morpheme analyzer that can input a correct morpheme analysis result can be realized.
[0085]
Since the analysis result of the morphological analysis device is input to the next syntax analysis device or the like, the accuracy is important, and the degree of adverse effects when an incorrect analysis result is passed to the next device is large. Since it is determined by the user that the part is not clear whether it is correct or incorrect, the correct morphological analysis result can be input to the next apparatus.
[0086]
Here, since the accuracy determination is performed based on whether or not the data is stored in the analysis execution time data storage unit, the accuracy determination function hardly increases the processing time.
[0087]
(C) Third embodiment
Next, a third embodiment of the morphological analyzer according to the present invention will be described in detail with reference to the drawings.
[0088]
The morpheme analyzer of the third embodiment discriminates whether the accuracy of the high-speed morpheme analysis result is good or not, and automatically performs the low-speed morpheme analysis on the poor part, and always provides the high-precision morpheme analysis result. The main feature is that the output is made.
[0089]
FIG. 13 is a block diagram showing a functional configuration of the morphological analyzer 10B according to the third embodiment, which is the same as FIG. 1 according to the first embodiment and FIG. 8 according to the second embodiment. Corresponding parts are denoted by the same reference numerals.
[0090]
In FIG. 13, the morpheme analyzer 10B of the third embodiment includes a low-speed morpheme analysis unit 11, an analysis execution data storage unit 16, a high-speed morpheme analysis unit 17, an accuracy determination unit 18, and an analysis result synthesis unit 20.
[0091]
If the method for creating analysis execution data stored in the analysis execution data storage unit 16 is the same as that in the first embodiment, the illustration is omitted, but the low speed morpheme analysis unit 11 and the low speed morpheme analysis are omitted. The result storage means 12, the conversion means 13, the learning data storage means 14, and the learning means 15 are also provided (refer FIG. 1). A description of these components is omitted. In the case of the third embodiment, when the method of creating the analysis execution time data stored in the analysis execution time data storage means 16 is the same as that of the first embodiment, the low speed morpheme analysis means 11 This is used both for data creation processing and for morphological analysis of character strings including low-precision character strings, which will be described later.
[0092]
The functions of the low-speed morpheme analysis unit 11, the analysis execution time data storage unit 16, and the high-speed morpheme analysis unit 17 are the same as those in the first embodiment, and thus description of the functions is omitted. Furthermore, since the function itself of the accuracy determination means 18 is the same as that of the second embodiment, the description of the function is omitted.
[0093]
However, in the case of the third embodiment, the accuracy determination means 18 determines that the high-speed morpheme analysis method has no confidence in the accuracy, and the low-precision character string in the input sentence is given to the low-speed morpheme analysis means 11. ing. The low-speed morpheme analysis means 11 performs a low-speed morpheme analysis process on the character string portion including such a low-precision character string.
[0094]
The analysis result synthesizing means 20 newly provided in the third embodiment uses the low-speed morpheme analysis result obtained by the low-speed morpheme analysis means 11 for the portion corresponding to the low precision character string in the morpheme analysis result from the high-speed morpheme analysis means 17. It replaces with.
[0095]
FIG. 14 is a flowchart showing an example of the overall processing flow of the morphological analyzer 10B of the third embodiment. The same processing steps as those in FIG. 9 according to the second embodiment are denoted by the same reference numerals. Show.
[0096]
In the case of the second embodiment, the confirmed low-precision character string is recognized as an object to be presented to the user in step 308. In the case of this third embodiment, the confirmed low-precision character string is In step 308a, the low speed morpheme analyzing means 11 is provided.
[0097]
In the case of the second embodiment, the high-speed morpheme analysis result is presented to the user in the form of clearly specifying the low-precision character string in step 310. In the case of this third embodiment, the high-speed morpheme is shown. The part corresponding to the low-precision character string in the analysis result is replaced with the low-speed morpheme analysis result, and the replaced morpheme analysis result is presented to the user. The low-speed morpheme analysis is performed on the character string sandwiched between the morpheme break position before the first character of the low-precision character string and the morpheme break position after the last character of the low-precision character string in the high-speed morpheme analysis result. Executed.
[0098]
Except for the above two points, the other processes are the same as those in the second embodiment, and a description thereof will be omitted.
[0099]
Even when the sentence “The structure of the payroll system mirage is shown in FIG. 1” shown in FIG. 10 is input to the morpheme analyzer 10B of the third embodiment, the character string “Tem mirage” Is recognized as a low-precision character string, as in the second embodiment.
[0100]
Now, it is assumed that “mirage” is registered as one morpheme (noun) in the morpheme dictionary built in the low-speed morpheme analyzer 11. The low-speed morpheme analysis means 11 includes a low-precision character string “Tem Mirage's structure” and a high-speed morpheme analysis result “Pay / Calculation / System / Mirage / Configuration // Figure / 1 /// Show / Select /.”. Is given, the character string “system mirage of the system mirage” is sandwiched between the morpheme separation position before the first character of the low-precision character string “Tem Mirage” and the morpheme separation position after the last character of the low-precision character string. “Configuration” executes the analysis as a low-speed morphological analysis target part.
[0101]
Then, as shown in FIG. 15, the low-speed morpheme analyzing unit 11 outputs the analysis result with “system”, “mirage”, “no”, and “configuration” as separate morphemes. Since the corresponding part of the high-speed morpheme analysis result “salary / calculation / system / mirage / configuration //////////” is replaced with this low-speed morpheme analysis result, the final morpheme analysis result As shown in FIG. 16, “salary / calculation / system / mirage /// configuration /////////”.
[0102]
The analysis result in the third embodiment is analyzed with “mirage” and “no” as separate morphemes compared with the analysis result of the second embodiment, and the accuracy is improved.
[0103]
According to the third embodiment, a low-speed morpheme analysis is automatically performed on a partial character string determined to have poor high-speed morpheme analysis accuracy or its vicinity, and replaced with a low-speed morpheme analysis result. Therefore, it is possible to realize a morpheme analyzer that always outputs highly accurate morpheme analysis results.
[0104]
Also in the third embodiment, since the high-speed morpheme analysis is the basic analysis process, the analysis can be executed in a shorter time than the low-speed morpheme analysis of all input sentences.
[0105]
(D) Fourth embodiment
Next, a fourth embodiment of the morphological analyzer according to the present invention will be described in detail with reference to the drawings.
[0106]
The morpheme analyzer of the fourth embodiment discriminates whether the accuracy of the high-speed morpheme analysis result is good or not, and automatically executes the low-speed morpheme analysis on the bad part, and always provides the high-precision morpheme analysis result In addition to outputting, the low-speed morpheme analysis results are learned and reflected in the analysis execution data of the high-speed morpheme analysis. The main feature is that it has been made possible.
[0107]
FIG. 17 is a block diagram showing a functional configuration of the morphological analyzer 10C according to the fourth embodiment. The same or corresponding parts as those in FIGS. 1, 8, and 13 according to the above-described embodiments are shown in FIG. The same reference numerals are given.
[0108]
In FIG. 17, the morpheme analyzer 10C of the fourth embodiment includes a low-speed morpheme analyzer 11, a converter 13, a learning data storage unit 14, a learning unit 15, an analysis execution time data storage unit 16, a high-speed morpheme analysis unit 17, An accuracy determination unit 18 and an analysis result synthesis unit 20 are provided.
[0109]
All the constituent elements of the morphological analyzer 10C of the fourth embodiment have the same functions as the corresponding elements of the respective embodiments described above.
[0110]
However, in the morpheme analyzer 10C of the fourth embodiment, the result obtained by the low-speed morpheme analyzer 11 analyzing the character string including the low-precision character string is given to the converter 13. This is different from the first and third embodiments. The function of each means on the processing path from the conversion means 13 to the analysis execution time data storage means 16 is the same as that of the first embodiment.
[0111]
The low-speed morpheme analysis unit 11, the conversion unit 13, the learning data storage unit 14, the learning unit 15, and the analysis execution time data storage unit 16 perform the processing on the learning document input from the same as in the first embodiment. Of course, it may be the one that bears.
[0112]
FIG. 18 is a flowchart showing an example of the overall processing flow of the morpheme analyzer 10C of the fourth embodiment. The same processing steps as those in FIG. 14 according to the third embodiment are denoted by the same reference numerals. Show.
[0113]
In the morphological analyzer 10C of the fourth embodiment, the processing of steps 311 and 312 is provided after the final processing step 310a of the third embodiment.
[0114]
Step 311 is a process of converting the low-speed morpheme analysis result into data (learning data) for input to the learning means 15 for high-speed morpheme analysis and additionally storing it. Step 312 is a process of creating analysis execution data using all the learning data at that time.
[0115]
For example, when a low-speed morpheme analysis result as shown in FIG. 15 is obtained for a character string including a low-precision character string, the conversion means 13 learns data by the conversion method as shown in FIG. Is converted, an extended character string (learning data) as shown in FIG. 19 is obtained.
[0116]
When such learning data is added to the existing learning data and the learning unit 15 learns the entire learning data after the addition, the data is shown in FIG. 20 as analysis execution time data corresponding to the low-speed morpheme analysis result. Such data is obtained (other analysis execution time data is also obtained naturally) and stored in the analysis execution time data storage means 16. That is, in addition to data as shown in FIG. 11 (the chain probability changes), data as shown in FIG. 20 is newly added.
[0117]
As a result, when a sentence that can be analyzed based on the learned data is input as an analysis target, for example, when “Payroll system mirage is 2000 yen” is input, the previous low-precision character string. The accuracy determination means 18 does not determine that the portion is recognized as low accuracy, and a high-speed morpheme analysis result with the same degree of accuracy as in the third embodiment can be obtained without starting the low-speed morpheme analysis unit 11 each time. Become.
[0118]
According to the fourth embodiment, when the accuracy of the high-speed morpheme analysis is not good, the low-speed morpheme analysis is automatically executed, and the result is used as data for learning of the high-speed morpheme analysis. Can realize a morpheme analyzer that can accurately and rapidly analyze a sentence including the same morpheme as a sentence with poor accuracy.
[0119]
(E) Other embodiments
In each of the above-described embodiments, the analysis execution data indicates one category, but a plurality of categories such as by field may be prepared, and the category may be designated when an unknown document is input. In this case, in the first embodiment, when inputting a learning document, it is necessary to specify a category of the learning document. Further, in the third and fourth embodiments, if there is a specialized dictionary to which the low-speed morphological analysis means is applied, it belongs to that category. Furthermore, in the fourth embodiment, the low-speed morpheme analysis result is reflected in the analysis execution time data of the category specified when the unknown document is input.
[0120]
In the description of the first embodiment, it is not clearly shown whether the storage in the learning data storage means 14 is an additional storage or a new storage (a storage after clearing the previous one). good. Alternatively, the storage method may be instructed from the outside to the conversion means 13 each time.
[0121]
Furthermore, in the first and fourth embodiments, the learning means 15 may be configured as follows. Only for the learning data added to the learning data storage means 14, the frequency of appearance of the character string is counted to generate analysis execution data. In this case, not only the chain probability but also the appearance frequency is stored in the analysis execution time data storage unit 16, and from the total result of this time and the appearance frequency already stored in the analysis execution time data storage unit 16, The learning means 15 may determine the character string of the existing analysis-time learning data or the chain probability of a newly generated character string.
[0122]
Furthermore, in the second to fourth embodiments, the low-precision character string is identified as not being present in the analysis execution time data storage means 16, but even if it exists, its value ( The recognition condition for the low-precision character string may be that the linkage probability is smaller than a predetermined threshold.
[0123]
In the second to fourth embodiments, the range of the low-precision character string is not a partial character string of one sentence as described above, but all one sentence including the determination character string is handled as the low-precision character string. You may do it. In the second embodiment, information on whether or not the accuracy is low is attached to each sentence. If it is 3rd Embodiment, the whole sentence will be analyzed by the low speed morpheme analysis means 11 at the time of low precision recognition. In the fourth embodiment Ah Then, the low-speed morpheme analysis result of the entire sentence is reflected in the stored contents of the analysis execution data storage unit 16. When estimating accuracy for the entire sentence in this way, normalize the chain probability in the optimal high-speed morpheme analysis result by the number of characters in the input sentence, etc., and compare the value with a threshold value. You may make it guess.
[0124]
In addition, a method of judging accuracy without using the stored contents of the analysis execution time data storage means 16 is adopted alone, or a method of using the storage contents of the analysis execution time data storage means 16 to judge accuracy. You may do it. For example, as a method for determining the accuracy without using the stored contents of the analysis execution data storage means 16, for example, a predetermined central portion of a portion where one kind of character type such as hiragana or kanji is continuously connected for a predetermined number of characters or more. A method of determining that the accuracy of the character number portion is low can be given. Further, it may be determined that the accuracy of a sentence including a second number of Chinese characters of a predetermined number or more is low.
[0125]
Furthermore, in the third and fourth embodiments, the low-speed morphological analysis corresponding to the low-precision character string is executed for each sentence. However, after the high-speed morphological analysis of the entire document, the accuracy is poor. A low-speed morphological analysis may be performed on the part.
[0126]
Furthermore, in the specific description of each of the above embodiments, the extended information of the extended characters constituting the data at the time of analysis execution indicates only the morpheme segmentation information. Information may be included. In this case, as a matter of course, the conversion means and the learning means also correspond to them. When analysis execution data is set in this way, it is possible to speed up morpheme analysis that performs word division and part-of-speech assignment and morpheme analysis that determines the pronunciation of a word.
[0127]
Further, in each of the above embodiments, the morpheme analyzer in which the target natural language is Japanese is shown, but the present invention can also be applied to morpheme analyzers of other languages.
[0128]
【The invention's effect】
As described above, according to the morpheme analyzer of the present invention, it is possible to expect improvement in accuracy of morpheme analysis results and reduction in analysis processing time without imposing a burden on the user.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first embodiment.
FIG. 2 is a flowchart illustrating an overview of processing according to the first embodiment;
FIG. 3 is an explanatory diagram illustrating an example of a learning document according to the first embodiment.
FIG. 4 is an explanatory diagram showing a low-speed morpheme analysis result for the first sentence in FIG. 3;
FIG. 5 is an explanatory diagram showing learning data corresponding to the learning document in FIG. 3;
FIG. 6 is a flowchart illustrating an example of detailed processing by a conversion unit according to the first embodiment.
7 is an explanatory diagram of processing in step 206 in FIG. 6;
FIG. 8 is a block diagram showing a configuration of a second embodiment.
FIG. 9 is a flowchart showing processing of the second embodiment.
FIG. 10 is an explanatory diagram showing a morphological analysis target sentence.
FIG. 11 is an explanatory diagram showing an example of stored contents of an analysis execution data storage unit 16;
12 is an explanatory diagram showing an example of output contents obtained by analyzing the sentence of FIG. 10 in the second embodiment.
FIG. 13 is a block diagram illustrating a configuration of a third embodiment.
FIG. 14 is a flowchart showing processing of the third embodiment.
FIG. 15 is an explanatory diagram illustrating a low-speed morpheme analysis result example for a low-precision character string according to the third embodiment;
FIG. 16 is an explanatory diagram illustrating a final analysis result example obtained by analyzing the sentence of FIG. 10 in the third embodiment.
FIG. 17 is a block diagram showing a configuration of a fourth embodiment.
FIG. 18 is a flowchart showing processing of the fourth embodiment.
FIG. 19 is an explanatory diagram illustrating an example in which a low-speed morpheme analysis result for a low-precision character string in the fourth embodiment is converted into learning data.
FIG. 20 is an explanatory diagram illustrating an example of analysis execution data formed from learning data corresponding to a low-precision character string according to the fourth embodiment.
[Explanation of symbols]
10, 10A, 10B, 10C ... morphological analyzer,
11 ... Low-speed morphological analysis means,
13: Conversion means,
15. Learning means,
17 ... High-speed morphological analysis means,
18 ... accuracy judging means,
19: Accuracy / analysis result synthesis means,
20: Analysis result synthesis means.

Claims

The partial character string consisting of the first predetermined number of characters appearing in the natural language sentence , the chain probability that is the appearance ratio of the partial character string in the existing document, and the character assigned to each character of the partial character string are morphemes. Analysis execution time data storage means for storing a large number of analysis execution time data, which is set data including at least the position information indicating whether the character is the last character, and the analysis execution time data storage means for unknown sentences A morpheme analyzer having first morpheme analysis means for executing morpheme analysis with reference to the stored content of
When the unknown text smaller than the first linkage probability in the analysis run time data is the threshold value in the morphological analysis means is stored in said analysis execution time data storage means applied against, or, the analysis execution When the unknown sentence has a portion where the analysis execution time data stored in the time data storage means cannot be applied , the whole unknown sentence or the chain probability from the first morphological analysis means is a threshold value. There is an accuracy determination means for estimating the accuracy of the morphological analysis result of a character string portion in an unknown sentence including a smaller partial character string of the analysis execution data or a portion to which the analysis execution data cannot be applied. A morphological analyzer characterized by the above.

The partial character string consisting of the first predetermined number of characters appearing in the natural language sentence , the chain probability that is the appearance ratio of the partial character string in the existing document, and the character assigned to each character of the partial character string are morphemes. Analysis execution time data storage means for storing a large number of analysis execution time data, which is set data including at least the position information indicating whether the character is the last character, and the analysis execution time data storage means for unknown sentences A morpheme analyzer having first morpheme analysis means for executing morpheme analysis with reference to the stored content of
When the character of a specific character type is connected by a second predetermined number of characters or more in the whole or part of the unknown sentence, the whole unknown sentence or the character of the specific character type from the first morpheme analyzing means is 2. A morpheme analysis apparatus comprising: an accuracy determination unit that estimates that the accuracy of a morpheme analysis result for a character string portion in an unknown sentence including a portion connected to a predetermined number of 2 or more is low accuracy.

The partial character string consisting of the first predetermined number of characters appearing in the natural language sentence , the chain probability that is the appearance ratio of the partial character string in the existing document, and the character assigned to each character of the partial character string are morphemes. Analysis execution time data storage means for storing a large number of analysis execution time data, which is set data including at least the position information indicating whether the character is the last character, and the analysis execution time data storage means for unknown sentences A morpheme analyzer having first morpheme analysis means for executing morpheme analysis with reference to the stored content of
When the character of the specific character type is included in the whole or part of the unknown sentence by the third predetermined number of characters , the whole unknown sentence or the character of the specific character type from the first morpheme analyzing means is A morpheme analyzer comprising: an accuracy determining unit that estimates that the accuracy of a morpheme analysis result for a character string portion in an unknown sentence including a portion including a third predetermined number of characters or more is low accuracy.

Characterized in that the accuracy determination means includes an analysis result output means for clearly indicating an unknown sentence or a character string portion in the unknown sentence that is estimated to be low in accuracy and outputting a morpheme analysis result from the first morpheme analysis means. The morpheme analyzer according to any one of claims 1 to 3.

The second morpheme analysis unit that performs morpheme analysis using a morpheme dictionary with respect to an unknown sentence or a character string portion in the unknown sentence that is estimated to be low in accuracy by the accuracy determination unit. 4. The morphological analyzer according to any one of 4 above.

In the morpheme analysis result of the first morpheme analysis unit, the result corresponding to the unknown sentence or the character string portion in the unknown sentence that the accuracy determination unit estimated to be low in accuracy is used as the morpheme analysis of the second morpheme analysis unit. 6. The morpheme analyzer according to claim 5, further comprising an analysis result synthesizing unit that replaces the result.

Learning data consisting of a set of each character and delimiter position information indicating whether or not the character is the final character of the morpheme is recognized from the morpheme analysis result of the second morpheme analysis means. Conversion means to create,
The training data was tabulated cut for each of the first predetermined number, to create a partial character string consisting of a first predetermined number, and a chain probability, multiple analysis runtime data that at least includes a delimiter position information learned The morpheme analyzer according to claim 5 or 6, further comprising: means.