JP3908919B2

JP3908919B2 - Morphological analysis system and morphological analysis method

Info

Publication number: JP3908919B2
Application number: JP2001114935A
Authority: JP
Inventors: 成一天白
Original assignee: ARCADIA, INC.
Current assignee: ARCADIA, INC.
Priority date: 2000-05-10
Filing date: 2001-04-13
Publication date: 2007-04-25
Anticipated expiration: 2021-04-13
Also published as: JP2002032366A

Description

【０００１】
【発明の技術分野】
この発明は、形態素解析システムと形態素解析方法に関するものであり、特に日本語の漢字やひらがなを含んだ文字列の形態素解析に関する。
【０００２】
【従来の技術】
日本語の正書法では単語を分かち書きしないため、コンピュータ等を利用して形態素解析を行う形態素解析プログラムにおいては単語の境界を求めて形態素を同定することが主な課題となる。
【０００３】
その形態素解析の手法として、入力文字列の先頭から、辞書に記録されている形態素のうち最も長く一致したものを候補としつつ順次解析していく最長一致法や、その他、分割数最小法、学習機能や統計的言語モデルを用いて形態素解析をする手法等がある。
【０００４】
【発明が解決しようとする課題】
ここで、日本語の文字列は多義性や曖昧性を含んでいるものが多いから、形態素解析プログラムにおける形態素の同定においては以下のような問題点がある。
【０００５】
（１）漢字が連なった入力文に対して誤った解析結果が得られることがある。
例えば、入力文字列「中古車両」については、「中古｜車両」（「｜」は、形態素間の区切れ位置を表す。以下同じ）という結果が望ましい。しかし、プログラムが読み出す辞書に「中、古、車、両、中古、車両、中古車」という複数の形態素が記録されている場合に、最長一致法によってこれを解析すると、「中古車｜両」という解析結果が得られてしまう。
【０００６】
（２）ひらがなを含む入力文の解析時間が長い。
【０００７】
ひらがなを含む入力文に対しても、入力文字列を順次分解することによってあらゆる可能性の形態素候補を取得しつつ、それらの形態素の文法的な接合の可能性を考慮して解析するため、プログラムの計算量が非常に多くなってしまい迅速な解析が行われにくい。
【０００８】
この発明は、上記のような問題に鑑みて、形態素解析を正確かつ迅速に行うことのできる形態素解析システムおよび形態素解析方法を提供することを目的とする。
【０００９】
【課題を解決するための手段および発明の効果】
（１）本発明に係る漢字文字列形態素解析方法は、
コンピュータシステムにより、形態素が記録された形態素記録部を参照することによって漢字を含んだ文字列の形態素解析を行う漢字文字列形態素解析方法であって、
前記形態素記録部には、
形態素が、他の任意の文字列を後続して結合すると前記形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素である場合には、前記形態素に関連づけて前記区切れ位置を示す区切位置情報を併せて記録しておき、
前記コンピュータシステムが、入力された文字列の形態素解析を行う際には、
前記形態素記録部から形態素を参照して読み出すことにより文字列の先頭文字から最長一致法によって第１形態素候補を取得し、
前記第１形態素候補に前記区切位置情報が記録されているならば区切れ位置の後続文字を先頭文字として最長一致法によって第１形態素候補の文字列より後続の文字列を含んだ第２形態素候補が取得できるか否かを判断し、
第２形態素候補が取得できるならば、第１形態素候補を前記区切れ位置で区切ったときの前半文字列と第２形態素候補とを形態素解析結果とし、
第２形態素候補が取得できなければ、第１形態素候補を形態素解析結果とすることを特徴としている。
【００１０】
これにより、最長一致法によって取得される１つの形態素であっても、他の後続の文字列と結合するとその形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素である場合には、確実にその区切れ位置で区切られた２つの形態素として判断されることになる。したがって、形態素解析を効率良く、かつ正確に行うことができる。また、区切位置情報を併せて記録しておくだけでよいのであるから、一定の解析結果が得られるような辞書の作成や、その記録内容を保守・管理することが容易になる。
【００１３】
（６）本発明に係る形態素解析システムは、
文字列を入力する文字入力部と、
漢字を含んだ文字列の形態素を記録した漢字形態素記録部および、ひらがなを含んだ文字列の形態素を記録したひらがな形態素記録部、の少なくとも１つと、
形態素解析を行う形態素解析部と、
を備えた形態素解析システムであって、
前記漢字形態素記録部は、
漢字を含んだ文字列の形態素が、文字列を区切った場合の前半文字列または後半文字列の少なくともいずれかに他の任意の文字列を後続して結合すると少なくとも２つの形態素となる文字列を含む場合には、
前記形態素に関連づけて前記形態素の区切れ位置を示す区切位置情報、
が併せて記録されており、
前記区切位置情報は、コンピュータが、
前記形態素に記録された前記区切れ位置の後続文字を先頭文字として、最長一致法によってその後続文字を含んだ形態素が取得できるか否かの判断するために利用する情報であること、
を特徴としており、
前記ひらがな形態素記録部は、
助詞または助動詞または形式名詞または基本動詞を含む２以上のひらがな形態素の組み合わせのうち、文法的に正しく接合されうる形態素の組み合わせであるひらがな形態素接合を少なくとも記録するとともに、
前記ひらがな形態素接合には、
前記形態素ごとの区切れ位置、
が併せて記録されていることを特徴としている。
【００１４】
これにより、解析対象である文字列が漢字やひらがな等の文字が混合されているものに対しても正確かつ迅速に形態素解析が行われることになる。
【００１５】
（７）本発明に係るネットワーク形態素解析システムは、
通信端末は、
日本語文字列をコンピュータサーバーに対して送信する文字列送信手段と、
コンピュータサーバーから送信される形態素解析結果を受信する形態素解析結果受信手段と、
を備え、
通信端末と通信可能に接続されたコンピュータサーバーは、
送信された日本語文字列の情報を受信する受信部と、
漢字を含んだ文字列の形態素を記録した漢字形態素記録部および、ひらがなを含んだ文字列の形態素を記録したひらがな形態素記録部、の少なくとも１つと、
形態素解析を行う形態素解析部と、
前記形態素解析部による形態素解析結果を通信端末に対して送信する形態素解析結果送信手段と、
を備えたネットワーク形態素解析システムであって、
前記漢字形態素記録部は、
漢字を含んだ文字列の形態素が、他の任意の文字列を後続して結合すると前記形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素である場合には、
前記形態素に関連づけて前記区切れ位置を示す区切位置情報、
が併せて記録されており、
前記ひらがな形態素記録部は、少なくとも
文法的に接合されうる２つ以上のひらがな形態素の組み合わせであるひらがな形態素接合、
が記録されるともに、
前記ひらがな形態素接合には、
前記形態素ごとの区切れ位置が併せて記録されており、
漢字を含んだ文字列の形態素解析を行う前記形態素解析部は、
前記漢字形態素記録部から形態素を参照して読み出すことにより文字列の先頭文字から最長一致法によって第１形態素候補を取得する第１形態素候補取得手段と、
前記第１形態素候補に前記区切位置情報が記録されているならば区切れ位置の後続文字を先頭文字として最長一致法によって第１形態素候補の文字列より後続の文字列を含んだ第２形態素候補が取得できるか否かを判断する第２形態素候補取得判断手段と、
第２形態素候補が取得できるならば、第１形態素候補を前記区切れ位置で区切ったときの前半文字列と第２形態素候補とを形態素解析結果とし、第２形態素候補が取得できなければ、第１形態素候補を形態素解析結果とする漢字文字列形態素確定手段と、
を備え、
ひらがなを含んだ文字列の形態素解析を行う前記形態素解析部は、
前記ひらがな形態素記録部からひらがな形態素接合とひらがな形態素とを参照して読み出すことにより文字列の中に前記ひらがな形態素接合と合致するものがあれば前記ひらがな形態素の組み合わせを形態素解析結果とするひらがな文字列形態素確定手段、
を備えたことを特徴としている。
【００１６】
これにより、端末装置側で漢字形態素記録部やひらがな形態素記録部、形態素解析部を備える必要がない。したがって、より多数の端末装置のユーザーが容易に形態素解析結果を得ることができる。漢字を含んだ文字列の形態素解析については、最長一致法によって取得される１つの形態素であっても、他の後続の文字列と結合するとその形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素である場合には、確実にその区切れ位置で区切られた２つの形態素として判断されることになる。したがって、形態素解析を効率良く、かつ正確に行うことができる。また、区切位置情報を併せて記録しておくだけでよいのであるから、一定の解析結果が得られるような辞書の作成や、その記録内容を保守・管理することが容易になる。ひらがなを含んだ文字列の形態素解析については、あらかじめ記録された形態素記録部との照合を行って形態素解析が行われるのであるから、形態素の同定までのプログラムの計算量が大幅に削減される。したがって、形態素解析が非常に迅速に行われることになる。
【００１７】
（８）本発明に係る前記区切位置情報は、
前記形態素の後ろの文字からＮ文字目に位置することを示す情報とされており、
前記Ｎ文字目に位置することを示す情報は、コンピュータが、
前記形態素に記録された前記Ｎ文字目の位置の後続文字を先頭文字として、最長一致法によってその後続文字を含んだ形態素が取得できるか否かの判断するために利用する情報であること、
を特徴としている。
以上
【００１８】
これにより、第１形態素候補を取得したときの判断対象位置からN文字後退する処理が行われるのであるから、第１形態素候補の文字列の先頭に戻ってさらにN文字移動するという処理を行う場合と比べると、プログラムの計算量が削減される。したがって、形態素解析がより迅速に行われる。
【００１９】
用語の定義と実施形態との対応について説明する。
【００２０】
この発明において、
「区切位置情報」とは、ある形態素が、他の任意の文字列を後ろに結合することを想定したときに、その形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素であると判断される場合において、その形態素の文字列中の区切れ位置を示すものをいう。実施形態では、図３AのN文字後退データがこれに該当するものであり、このデータは、そのような区切れ位置が形態素の後ろの文字からN文字目にある場合のそのN（整数）の値を示している。
【００２１】
【発明の実施の形態】
（１）本発明の実施形態による形態素解析システムの構成
本発明に係る形態素解析システムの実施形態を図面に基づいて説明する。本発明における形態素解析システムとしての日本語形態素解析システムは、図２で示すブロック図で表されるような構成によるパーソナルコンピュータに備えられている。このシステムは、メモリ１００、キーボード１０３、ＣＰＵ１０１、ＣＲＴ１０２、ハードディスク１０４を備えている。そして、ハードディスク１０４には、図１に示す、字種区別プログラム１０と、形態素に対する読み付与プログラム２０、本発明における漢字文字列形態素解析方法を行うものとしての漢字文字列形態素解析プログラム１４、本発明におけるひらがな文字列形態素解析方法を行うものとしてのひらがな文字列形態素解析プログラム１６、漢字形態素記録部としての漢字文字列形態素N文字登録辞書１２、ひらがな形態素記録部としてのひらがな形態素接合リスト辞書１８とが記録されている。漢字文字列形態素N文字登録辞書１２とひらがな形態素接合リスト辞書１８は、フロッピーディスク（登録商標）１０５またはＣＤ−ＲＯＭ１０６に記録されており、それらを読みとることによってハードディスク１０４に記録されることになる。なお、漢字文字列形態素N文字登録辞書１２とひらがな形態素接合リスト辞書１８は、本実施形態に限らず、ＩＣカード等のコンピュータ可読の記録媒体からインストールしてもよいし、さらに、通信回線を用いてダウンロードするようにすることもできる。
【００２２】
ＣＰＵ１０１の動作のための上記プログラムについても、本実施形態に限らず、ＩＣカード等のコンピュータ可読の記録媒体からインストールさせるようにしてもよい。さらに、通信回線を用いてプログラムをダウンロードするようにすることもできる。また、ＣＤ−ＲＯＭ１０６等からプログラムをインストールすることにより、ＣＤ−ＲＯＭ１０６等に記憶させたプログラムを間接的にコンピュータに実行させるようにするのではなく、ＣＤ−ＲＯＭ１０６等に記憶させたプログラムを直接的に実行するようにしてもよい。
【００２３】
なお、コンピュータによって実行可能なプログラムとしては、そのままインストールするだけで直接実行可能なものはもちろん、一旦他の形態等に変換が必要なもの（例えば、データ圧縮されているものを解凍する等）、さらには、他のモジュール部分と組合して実行可能なものも含む。
【００２４】
システム全体の形態素解析の流れは、文字入力部としてのキーボード１０３等によって入力された文字列に対して文頭から文末に向けて判断対象位置を移動しながら、字種区別プログラム１０によってひらがなを含んだ文字列と漢字を含んだ文字列とを区別して対応するプログラムにより解析を行い、さらに次の文字列に判断対象を移動することの繰り返しによって解析データを得てメモリ１００に記録する。
【００２５】
また、形態素に対する読み付与プログラム２０については、これに限定されるものではなく、解析結果を利用する目的に応じてプログラムを変更することもでき、例えば、音声合成のためにアクセントを付与したり、ひらがなから漢字への変換、品詞解析を行ったり、あるいは単に形態素解析のみを行ったりする等のプログラムを備える場合が考えられる。
【００２６】
（２）漢字文字列形態素N文字登録辞書１２の構成
図３Aに、漢字文字列形態素N文字登録辞書１２の例の一部を示す。この辞書には、漢字形態素記録部としての漢字辞書のカラムと、区切位置情報としてのN文字後退データのカラムが設けられている。
【００２７】
漢字辞書のカラムには、形態素としての漢字文字列および、漢字とカタカナあるいはひらがなとの混合文字列等が記録されている。
【００２８】
N文字後退データのカラムには、それぞれの形態素に対応づけて、その形態素の後ろの文字からN文字目に区切れ位置がある場合のそのNの値（整数）が記録されている。つまりこのNの値は、その形態素が、他の任意の文字列を後続して結合するとその形態素の文字列中に区切れ位置を持つ少なくとも２つの形態素となるような形態素である場合には、その文字列中の区切れ位置が形態素の後ろの文字からN文字目にあることを示している。
【００２９】
例えば、辞書１２中の「中古車」は１つの形態素であるが、後ろに別の文字列を結合することにより「中古｜車庫」や、「中古｜車輪」、「中古｜車両」等の、それぞれ２つの形態素を得ることができる。よって、この場合は「中古車」の後ろから１文字目が区切れ位置となる可能性があると判断できるため、N文字後退データには”１”が記録されている。同様に「省エネ」は１つの形態素であるが、「省｜エネルギー」という２つの形態素を得ることができ、「省エネ」の後ろから２文字目が区切れ位置となる可能性があると判断できるため、N文字後退データには”２”が記録されている。
【００３０】
つまり、辞書１２には、形態素が、他の文字列を後ろに結合するとその形態素の文字列中で区切られる２以上の形態素となるような形態素であるにもかかわらず１つの形態素として記録されている場合には、１つの形態素ではなく、そのような２つ以上の形態素となる可能性があるものとして、あらかじめその形態素に関連づけてその区切れ位置を記録していることになる。
【００３１】
なお、１つの形態素の文字列中にそのような区切れ位置があることの判断としては、上記のように形態素の後ろに文字列を結合した場合を想定する場合のほか、形態素が、その文字列を区切った場合の前半文字列に他の任意の文字列を後続して結合した場合を想定することも考えられる。例えば、上記の「省エネ」の例においては、その文字列中を区切った場合の前半文字列の「省」に、他の任意の文字列を結合した場合を想定すると、「省｜資源」や、「省｜電力」等のそれぞれ２つの形態素となるのであるから、「省」の後ろに区切れ位置がある、という判断をすることもできる。
【００３２】
ここで、上記の例で示した区切れ位置の判断の根拠は、「中古車」における「車」や、「省エネ」における「省」は、それぞれの形態素におけるいわゆる造語要素であることに起因するものである。
【００３３】
（３）漢字を含む文字列の形態素解析方法
図４に、本実施形態による、漢字文字列形態素解析プログラム１４が漢字を含む文字列の形態素解析をする場合の処理を示す。なお、本形態素解析方法は、従来の最長一致法を利用しつつその欠点を補完するものであり、以下、この方法を「最長一致ーN文字後退法」とする。
【００３４】
まず、漢字文字列形態素解析プログラム１４が漢字文字列形態素N文字登録辞書１２を参照することにより、先頭文字から最長一致法によって第１形態素候補を取得する（ステップＳ１）。そして、取得した文字列のN文字後退データを判断し（ステップＳ２）、N=０であれば第１形態素候補を形態素解析結果として決定する（ステップＳ３）。一方、N=０以外であれば、第１形態素候補の文字列からN文字後退した位置まで判断対象位置を後退し、その位置から再度後続の文字列に対して最長一致法による第２形態素候補の取得を開始する（ステップＳ７）。
【００３５】
次に、第２形態素候補としてN+１以上の文字列の形態素が取得できたか否かを判断し（ステップＳ８）、取得されていなければ最初に取得した第１形態素候補を形態素解析結果として決定する（ステップＳ３）。一方、取得されているならば、第１形態素候補をN文字後退した位置で区切った場合の前半文字列と、第２形態素候補とを形態素解析結果として確定する（ステップＳ９）。
【００３６】
そして、ステップＳ３またはステップＳ９において決定された形態素を記録し（ステップＳ４）、文字列のデータの全ての形態素を記録していなければ、残りの文字列に対して最長一致法による第１形態素候補の取得が開始され（ステップＳ１０）、以降同じ流れを繰り返して形態素解析が進み、全てのデータの形態素を記録したならば解析を終了する（ステップＳ６）。
【００３７】
以上のような形態素解析方法の具体例として、図５に、入力文字列「中古車両」が与えられた例を示す。辞書１２には、「中古」だけでなく「中古車」も形態素として記録されているのであるから、最長一致法によると「中古車」が第１形態素候補として選択される（ステップＳ５０）。そして、「中古車」にはN=１が記録されているから（ステップＳ５１）、その判断対象位置を１文字後退させて、「車」から後続する文字列について最長一致法によって第２形態素候補が取得できるか判断する（ステップＳ５２）。そして、図３Aに例示するように、「車両」が辞書に登録されているから（ステップＳ５３）、「中古」（「中古車」を「車」の位置で区切った前半文字列）と「車両」の両者を形態素とし、「中古｜車両」を解析結果とする（ステップＳ５４、Ｓ５５、Ｓ５６、Ｓ５７）。
【００３８】
（４）漢字を含む文字列の形態素解析方法による効果
以上のような方法によれば、最長一致法によって取得される１つの形態素であっても、他の後続の文字列と結合するとその形態素の文字列中で区切られる少なくとも２つの形態素となるような形態素である場合には、確実にその区切れ位置で区切られた２つの形態素として判断されることになる。したがって、形態素解析を効率良く、かつ正確に行うことができる。また、区切位置情報を併せて記録しておくだけでよいのであるから、一定の解析結果が得られるような辞書の作成や、その記録内容を保守・管理することが容易になる。
【００３９】
以下、そのような本実施形態による効果を、従来の形態素解析方法と比較しつつ図５で示した例を用いて説明する。
【００４０】
まず、従来の最長一致法によると、「中古車」という形態素が辞書に記録されていれば、「中古車｜両」というユーザーにとって望ましくない解析結果になってしまう。一方、「中古車両」を記録しておけば「中古車両」が解析結果とされ、「中古車｜両」という結果よりは好ましい結果が得られることになる。しかしそのような記録をするにしても、それは「中古車両」についてのみ効果が得られるだけで、同様な問題が生じうる「中古車輪」や「中古車庫」等のあらゆる可能性についてまで検討して辞書の記録をすることは多大な労力を要する。また、「中古車」が記録されていなければ、確かに本実施形態で示したのと同様に「中古｜車両」という解析結果が得られる。しかしながら、形態素としての「中古車」を辞書に記録すると望ましい結果が得られなくなるということを辞書の記録段階で予想することは困難であり、仮にそのような不都合が生ずる度に辞書の記録内容を変更することとすれば、他の記録内容やプログラムに影響を及ぼすことにつながり、一定の解析結果が得られるように辞書の記録内容を保守・管理することに多大な労力を要することになる。
【００４１】
なお、その他の解析手法である分割数最小法（入力文字列を構成する形態素の総数が最小になる結果を優先する手法）によっても、「中古車｜両」も「中古｜車両」も同じく２形態素であるからいずれが解析結果であるかが決定されない。
【００４２】
この点、本実施形態による最長一致ーN文字後退法を採用すれば、確実に「中古｜車両」や「中古｜車輪」等のユーザーが要求する形態素の同定が行われる。
【００４３】
また、その他の解析手法として従来、様々な入力文に対する望ましい形態素の組み合わせを学習させたり、統計的に最も可能性の高い解析結果が優先されるような解析プログラムも既存するが、上記のような問題が生じると予想されるものについて全ての例を記録することは多大な労力を要するうえ、学習・統計データの作成コストが大きな問題となる。そのうえ、学習された解析結果や統計的に最も可能性の高い解析結果が、任意の入力文の解析についてユーザーにとって望ましい形態素の同定結果となるか否かは保証されない。
【００４４】
この点、本実施形態の漢字文字列形態素解析プログラム１４によると、上記のような学習・統計プログラムを備える必要がないのであるから、従来のプログラムと比較すると非常に単純なプログラムを採用することができ、プログラム容量を小さくすることができる。
【００４５】
なお、本実施形態における漢字文字列形態素N文字登録辞書１２は、区切位置情報を形態素の後ろの文字からN文字目の位置を示すデータとして記録しているが、これに限定されるものではない。例えば、形態素の前の文字からN文字目であるものとして記録してもよい。ただし、本実施形態のように後ろの文字からN文字目の位置を示すようにすることにより、第１形態素候補を取得したときの判断対象位置からN文字後退する処理が行われるのであるから、第１形態素候補の文字列の先頭に戻ってさらにN文字移動するという処理を行う場合と比べると、プログラムの計算量が削減される。したがって、形態素解析がより迅速に行われる。
【００４６】
また、漢字文字列形態素N文字登録辞書１２のその他の実施形態として、図３Bは、漢字文字列形態素区切位置ポインタ登録辞書の例の一部を示す。この辞書においては、図３AのようなN文字後退データではなく、区切位置情報としての区切位置ポインタを形態素の文字列中に記録しており、この場合、第１形態素候補を取得したときの判断対象位置が、そのポインタの位置に移動して第２形態素候補を取得する処理が行われることになる。
【００４７】
（５）ひらがな形態素接合リスト辞書１８の構成
図６に、ひらがな形態素記録部としての、ひらがな形態素接合リスト辞書１８の例の一部を示す。この辞書のカラムには、ひらがなの形態素と、ひらがな形態素接合とを記録した、ひらがな形態素記録部としてのひらがな辞書が記録されている。
【００４８】
ここで、ひらがな形態素接合とは、ひらがな等で表される複数の形態素を文法的な接合の正しさを考慮して接合させるとともにその区切れ位置も併せて記録された文字列である。本実施形態では、ひらがな等で表される文字列において一般的に大部分を占める、助詞、助動詞、基本動詞、形式名詞等についてのひらがな形態素接合を記録している。
【００４９】
例えば、「の」と「ところ」は別々の形態素と判断されうるものであるが、格助詞「の」に、「ところ」が接合されて存在の関係等を表すことが文法的に正しいのであるから、「の｜ところ」という２つの形態素が接合されたものである形態素接合を辞書に記録している。
【００５０】
（６）ひらがなを含む文字列の形態素解析方法
図７に、本実施形態による、ひらがな文字列形態素解析プログラム１６がひらがなを含む文字列の形態素解析をする場合の処理を示す。
【００５１】
まず、ひらがな文字列形態素解析プログラム１６がひらがな文字列形態素接合リスト辞書１８を参照してそのデータに対する照合を行う（ステップＳ２０）。この照合は、一連の文字列のデータの全ての文字について、辞書１８に記録されたいずれかの形態素あるいは形態素接合と合致するまで、いわゆる総当たりによる検索処理を繰り返すことによって行われる（ステップＳ２１）。次に、取得された形態素候補のそれぞれの形態素間の文法的な接合規則が適切か否かを判断し（ステップＳ２２）、適切でなければ再びステップＳ２０に戻って別の組み合わせによる形態素等の候補を取得することになる。なお、この接合規則の判断については形態素解析プログラム一般に用いられているものと同様である。そして、接合規則が適切であれば、取得したものを確定した形態素として決定・記録し（ステップＳ２３）、さらに、残りのひらがな文字列に対しても同様の照合処理が行われ（ステップＳ２４、Ｓ２５）、全てのデータの解析結果を記録したならば解析を終了する（ステップＳ２６）。
【００５２】
以上のようなひらがなの形態素解析方法の具体例として、図８に、入力文字列「ぎりぎりのところまで」が与えられた例を示す。まず、ひらがな文字列形態素解析プログラム１６がひらがな文字列形態素接合リスト辞書１８を参照してそのデータに対する照合を行う（ステップＳ７０）。そして、辞書１８に記録された、「の｜ところ」という形態素接合と、「ぎりぎり」「まで」という形態素を取得すれば一連の文字列の全ての文字に対して形態素が取得されることになるのであるから、これらを形態素候補として取得し（ステップＳ７１）、接合規則は適切であると判断され（ステップＳ７２）、それらを確定した形態素として決定・記録し（ステップＳ７３、Ｓ７４）、形態素解析を終了し、「ぎりぎり｜の｜ところ｜まで」を解析結果とする（ステップＳ７５）。
【００５３】
（７）ひらがなを含む文字列の形態素解析方法による効果
以上のような方法によれば、あらかじめ記録した辞書１８との照合を行って形態素解析が行われるのであるから、形態素の確定までのプログラムの計算量が大幅に削減される。したがって、形態素解析が非常に迅速に行われることになる。
【００５４】
以下、そのような本実施形態による効果を、従来の形態素解析方法と比較しつつ図８で示した例を用いて説明する。
【００５５】
従来の形態素解析のプログラムは、入力文字列を順次分解することによってあらゆる可能性の形態素候補を取得しつつ、それらの形態素の接合の可能性を考慮して解析を行っていた。例えば、従来のプログラムであれば図８の例においては、入力文字列を順次分解しつつ形態素候補を判断して取得するのであるから、「ぎりぎり」の後続文字列について、「のと｜ころ｜まで」（「のと」は、格助詞「の」に、「と」が接合されて同格の関係等を表す）という誤った解析結果としてしまう可能性も含んでいるうえに、それらの形態素の取得に至るまでの処理時間が長くなってしまう。
【００５６】
この点、本実施形態による接合リスト辞書との照合処理を採用すれば、迅速かつ確実に「ぎりぎり｜の｜ところ｜まで」というユーザーが要求する形態素の同定が行われる。
【００５７】
なお、辞書１８には、通常の日本語文のひらがな等で表される文字列の大部分を占める、助詞または、助動詞、基本動詞、形式名詞等についてのひらがな形態素接合を記録している。これによれば、ひらがなを含む文字列の形態素のみを辞書に記録していた従来の方法に比べて辞書の容量が大きくなるのが一般的である。しかしながら、本実施形態によれば、プログラムの計算量が大幅に削減されるのであるからプログラム容量を小さくすることができ、結果として迅速な解析が行われることにより、ユーザーにとって好ましい形態素解析システムが実現されることになる。
【００５８】
以上、漢字を含む文字列の形態素解析方法とひらがなを含む文字列の形態素解析方法を説明したが、本実施形態における日本語形態素解析システムはこれらの両者を行うプログラムを備えているのであるから、解析対象である文字列が漢字やひらがな等の文字が混合されているものに対しても正確かつ迅速に形態素解析が行われることになる。特に、日本語の文字列についてはひらがなが含まれる比率と、漢字が含まれる比率がそれぞれ約５０％程度であるから、字種区別プログラム１０によって最初にそれぞれを区別しておけばある程度の分かち書きがされているのに等しく、区別されたそれぞれに対応するプログラムにより解析がなされるのであるから、より一層迅速な解析が行われることになる。
【００５９】
なお、形態素解析結果の良否の判断基準は、その結果の利用目的の差異によって異なる部分もあるため、図３Aや図６の辞書の記録内容については、これに限定されるものではなく、その利用目的に対応した望ましい解析結果が得られるように辞書を記録することになる。これについては、辞書の作成者だけでなくユーザー自身もその目的に応じて辞書の内容を変更できるようにしてもよく、具体的には、N文字データを変更したり、新たなひらがな形態素接合を記録したり、また、誤りがあれば訂正できるようにすることもできる。
【００６０】
（８）本発明に係る形態素解析システムのその他の実施形態
以上の実施形態においては、日本語形態素解析システムとしてパーソナルコンピュータのプログラムが入力文字列を解析するシステムを示したが、これに限定されるものではなく、その他の日本語形態素解析一般の利用、例えば、ワードプロセッサの漢字かな変換用のプログラムおよび辞書の記録等にも採用できる。
【００６１】
また、図９に示すように、ネットワークを利用した形態素解析サービスにも採用することができる。この場合、通信端末としての端末装置５０のユーザーは、解析の対象となる文字列を装置に入力し、文字列送信手段がインターネット５２を介してそのデータをコンピュータサーバーとしての形態素解析サービスサーバー５６に送信する。そして、受信部と、漢字形態素記録部とひらがな形態素記録部の少なくとも１つ、形態素解析部、形態素解析結果送信手段、とを備えた形態素解析サービスサーバー５６は、そのデータの形態素解析を行った後その解析結果を端末装置５０に対して送信することになる。
【００６２】
これにより、端末装置側で漢字形態素記録部やひらがな形態素記録部、形態素解析部を備える必要がない。したがって、より多数の端末装置のユーザーが容易に形態素解析結果を得ることができる。
【図面の簡単な説明】
【図１】図１は、本発明の実施形態による、日本語形態素解析システムのプログラム構成を示す図である。
【図２】図２は、日本語形態素解析システムを備えたパーソナルコンピュータのブロック図である。
【図３】図３Aは、本発明の実施形態による、漢字文字列形態素N文字後退辞書の例の一部を示す図である。図３Bは、本発明のその他の実施形態による、漢字文字列形態素区切位置ポインタ登録辞書の例の一部を示す図である。
【図４】図４は、本発明の実施形態による、漢字を含む文字列の形態素解析方法を示す図である。
【図５】図５は、本発明の実施形態による、漢字を含む文字列の形態素解析方法の具体例を示す図である。
【図６】図６は、本発明の実施形態による、ひらがな形態素接合リスト辞書の例の一部を示す図である。
【図７】図７は、本発明の実施形態による、ひらがなを含む文字列の形態素解析方法を示す図である。
【図８】図８は、本発明の実施形態による、ひらがなを含む文字列の形態素解析方法の具体例を示す図である。
【図９】図９は、本発明のその他の実施形態による、ネットワーク日本語形態素解析システムを示す図である。
【符号の説明】
１０・・・字種区別プログラム
１２・・・漢字文字列形態素N文字登録辞書
１４・・・漢字文字列形態素解析プログラム
１８・・・ひらがな形態素接合リスト辞書
１６・・・ひらがな文字列形態素解析プログラム
２０・・・形態素に対する読み付与プログラム[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a morpheme analysis system and a morpheme analysis method, and more particularly to a morpheme analysis of a character string including Japanese kanji and hiragana.
[0002]
[Prior art]
In Japanese orthography, words are not written separately, so in a morpheme analysis program that performs morphological analysis using a computer or the like, the main problem is to identify morphemes by obtaining word boundaries.
[0003]
As the morpheme analysis method, from the beginning of the input character string, the longest match method in which the longest match among the morphemes recorded in the dictionary is sequentially analyzed as candidates, and the minimum division number method, learning There are methods for performing morphological analysis using functions and statistical language models.
[0004]
[Problems to be solved by the invention]
Here, since many Japanese character strings contain ambiguity and ambiguity, there are the following problems in morpheme identification in a morpheme analysis program.
[0005]
(1) An incorrect analysis result may be obtained for an input sentence with a series of kanji characters.
For example, with respect to the input character string “used vehicle”, a result of “used | vehicle” (“|” represents a partitioning position between morphemes; the same applies hereinafter) is desirable. However, if a dictionary read by the program contains multiple morphemes “middle, old, car, both, used, vehicle, used car”, then analyzing this using the longest match method, “used car | both” The analysis result is obtained.
[0006]
(2) The analysis time of input sentences including hiragana is long.
[0007]
A program in order to analyze the input sentence including hiragana in consideration of the possibility of grammatical concatenation of these morphemes while obtaining morpheme candidates of all possibilities by sequentially decomposing the input character string. The amount of calculation becomes very large, and quick analysis is difficult.
[0008]
In view of the above problems, an object of the present invention is to provide a morpheme analysis system and a morpheme analysis method capable of performing morpheme analysis accurately and quickly.
[0009]
[Means for Solving the Problems and Effects of the Invention]
(1) A kanji character string morpheme analysis method according to the present invention includes:
A kanji character string morpheme analysis method for performing a morpheme analysis of a character string including kanji by referring to a morpheme recording unit in which a morpheme is recorded by a computer system,
In the morpheme recording unit,
If the morpheme is a morpheme that, when combined with any other character string, results in at least two morphemes separated in the character string of the morpheme, it indicates the delimited position in association with the morpheme Record the break position information together,
When the computer system performs morphological analysis of the input character string,
The first morpheme candidate is obtained by the longest match method from the first character of the character string by reading with reference to the morpheme from the morpheme recording unit,
If the position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character To determine whether or not
If the second morpheme candidate can be obtained, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the delimiter position are used as the morpheme analysis result,
If the second morpheme candidate cannot be acquired, the first morpheme candidate is used as a morpheme analysis result.
[0010]
Thus, even if one morpheme obtained by the longest match method is a morpheme that, when combined with another subsequent character string, becomes at least two morphemes separated in the character string of the morpheme Therefore, it is determined as two morphemes that are separated at the separation position. Therefore, morphological analysis can be performed efficiently and accurately. Further, since it is only necessary to record the delimiter position information together, it is easy to create a dictionary that can obtain a certain analysis result and to maintain and manage the recorded contents.
[0013]
  (6)The morphological analysis system according to the present invention is:
  A character input part for inputting a character string;
  At least one of a kanji morpheme recording unit that records a morpheme of a character string that includes kanji and a hiragana morpheme recording unit that records a morpheme of a character string that includes hiragana;
  A morphological analysis unit that performs morphological analysis;
  A morphological analysis system comprising:
  The kanji morpheme recording unit is
  When a morpheme of a character string containing Kanji is combined with any other character string after at least one of the first half character string or the second half character string when the character string is separated, a character string that becomes at least two morphemes If included,
  Separation position information indicating the division position of the morpheme in association with the morpheme,
  Is also recorded,
  The delimiter position information is obtained by a computer,
  Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method, with the subsequent character recorded in the morpheme as the first character as the first character,
  It is characterized by
  The hiragana morpheme recording unit is
  Record at least a hiragana morpheme combination, which is a combination of morphemes that can be grammatically correct among a combination of two or more hiragana morphemes containing a particle or auxiliary verb or a formal noun or basic verb;
  In the hiragana morpheme junction,
  Separation position for each morpheme,
  Is recorded together.
[0014]
As a result, the morphological analysis is performed accurately and quickly even when the character string to be analyzed is a mixture of characters such as kanji and hiragana.
[0015]
  (7)The network morphological analysis system according to the present invention is:
  Communication terminal
  A character string transmission means for transmitting a Japanese character string to a computer server;
  A morpheme analysis result receiving means for receiving a morpheme analysis result transmitted from a computer server;
  With
  A computer server connected to a communication terminal so that it can communicate with
  A receiving unit that receives information of the transmitted Japanese character string;
  At least one of a kanji morpheme recording unit that records a morpheme of a character string that includes kanji and a hiragana morpheme recording unit that records a morpheme of a character string that includes hiragana;
  A morphological analysis unit that performs morphological analysis;
  A morpheme analysis result transmitting means for transmitting a morpheme analysis result by the morpheme analysis unit to a communication terminal;
  A network morphological analysis system comprising:
  The kanji morpheme recording unit is
  When the morpheme of the character string including the kanji is a morpheme that becomes at least two morphemes separated in the character string of the morpheme when the other arbitrary character strings are subsequently combined,
  Separation position information indicating the division position in association with the morpheme,
  Is also recorded,
  The hiragana morpheme recording unit is at least
  A hiragana morpheme combination that is a combination of two or more hiragana morphemes that can be grammatically joined,
  Is recorded,
  In the hiragana morpheme junction,
  The separation position for each morpheme is recorded together,
  The morpheme analysis unit that performs morphological analysis of a character string including kanji is
  First morpheme candidate acquisition means for acquiring a first morpheme candidate from a first character of a character string by a longest match method by reading out the morpheme from the kanji morpheme recording unit;
  If the delimiter position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character Second morpheme candidate acquisition determination means for determining whether or not can be acquired;
  If the second morpheme candidate can be acquired, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the dividing position are set as the morpheme analysis results. A kanji character string morpheme determining means which takes one morpheme candidate as a morpheme analysis result;
  With
  The morpheme analysis unit that performs morphological analysis of a character string including hiragana,
  If there is a character string that matches the hiragana morpheme junction by reading out the hiragana morpheme junction and the hiragana morpheme from the hiragana morpheme recording unit, the hiragana character string having the combination of the hiragana morpheme as a morpheme analysis result Morpheme determination means,
  It is characterized by having.
[0016]
  Thereby, it is not necessary to provide a kanji morpheme recording unit, a hiragana morpheme recording unit, and a morpheme analysis unit on the terminal device side. Therefore, a larger number of terminal device users can easily obtain the morphological analysis results.For morphological analysis of character strings containing Kanji characters, even if one morpheme is obtained by the longest match method, at least two morphemes that are separated in the character string of the morpheme when combined with other subsequent character strings If it is such a morpheme, it will be judged as two morphemes that are definitely separated at the separation position. Therefore, morphological analysis can be performed efficiently and accurately. Further, since it is only necessary to record the delimiter position information together, it is easy to create a dictionary that can obtain a certain analysis result and to maintain and manage the recorded contents. As for the morphological analysis of the character string including the hiragana, since the morphological analysis is performed by collating with the morpheme recording unit recorded in advance, the calculation amount of the program until the identification of the morpheme is greatly reduced. Therefore, morphological analysis is performed very quickly.
[0017]
  (8)The delimiter position information according to the present invention is:
  It is information indicating that it is located at the Nth character from the character behind the morpheme,
  The information indicating the position of the Nth character is as follows.
  Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method with the subsequent character at the position of the Nth character recorded in the morpheme as the first character;
  It is characterized by.
more than
[0018]
As a result, the process of moving back N characters from the position to be determined when the first morpheme candidate is acquired is performed, and therefore the process of returning to the beginning of the character string of the first morpheme candidate and moving N characters further is performed. Compared with, the amount of calculation of the program is reduced. Therefore, morphological analysis is performed more quickly.
[0019]
The correspondence between the definition of terms and the embodiments will be described.
[0020]
In this invention,
“Delimiter position information” is a morpheme that, when it is assumed that a certain morpheme is combined with another arbitrary character string later, it becomes at least two morphemes separated in the character string of the morpheme When it is determined, it indicates a delimiter position in the character string of the morpheme. In the embodiment, the N character backward data in FIG. 3A corresponds to this, and this data is the N (integer) of the case where such a delimiting position is the Nth character from the character after the morpheme. The value is shown.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
(1) Configuration of a morphological analysis system according to an embodiment of the present invention
An embodiment of a morphological analysis system according to the present invention will be described with reference to the drawings. A Japanese morpheme analysis system as a morpheme analysis system according to the present invention is provided in a personal computer having a configuration represented by the block diagram shown in FIG. This system includes a memory 100, a keyboard 103, a CPU 101, a CRT 102, and a hard disk 104. In the hard disk 104, the character type distinguishing program 10 shown in FIG. 1, the reading imparting program 20 for morphemes, the kanji character string morpheme analysis program 14 for performing the kanji character string morpheme analysis method of the present invention, and the present invention. A hiragana character string morpheme analysis program 16 for performing a hiragana character string morpheme analysis method, a kanji character string morpheme N character registration dictionary 12 as a kanji morpheme recording unit, and a hiragana morpheme junction list dictionary 18 as a hiragana morpheme recording unit It is recorded. The Kanji character string morpheme N character registration dictionary 12 and the hiragana morpheme junction list dictionary 18 are recorded in the floppy disk (registered trademark) 105 or the CD-ROM 106, and are recorded in the hard disk 104 by reading them. The kanji character string morpheme N character registration dictionary 12 and the hiragana morpheme junction list dictionary 18 are not limited to the present embodiment, and may be installed from a computer-readable recording medium such as an IC card, and further using a communication line. You can also download it.
[0022]
The program for the operation of the CPU 101 is not limited to this embodiment, and may be installed from a computer-readable recording medium such as an IC card. Further, the program can be downloaded using a communication line. Further, by installing the program from the CD-ROM 106 or the like, the program stored in the CD-ROM 106 or the like is not indirectly executed by the computer but the program stored in the CD-ROM 106 or the like is directly executed. You may make it perform to.
[0023]
As a program that can be executed by a computer, a program that can be directly executed by simply installing it as well as a program that needs to be converted into another form or the like (for example, decompressing a data-compressed program) Furthermore, what is executable in combination with another module part is also included.
[0024]
The flow of morphological analysis of the entire system includes hiragana by the character type discrimination program 10 while moving the position to be judged from the beginning of the sentence toward the end of the sentence with respect to the character string input by the keyboard 103 or the like as the character input unit. A character string and a character string including kanji are distinguished and analyzed by a corresponding program, and analysis data is obtained and recorded in the memory 100 by repeatedly moving the determination target to the next character string.
[0025]
Further, the reading imparting program 20 for morphemes is not limited to this, and the program can be changed according to the purpose of using the analysis result, for example, giving an accent for speech synthesis, There may be a case where a program for converting hiragana to kanji, performing part-of-speech analysis, or simply performing only morphological analysis is provided.
[0026]
(2) Configuration of Kanji character string morpheme N character registration dictionary 12
FIG. 3A shows a part of an example of the Kanji character string morpheme N character registration dictionary 12. This dictionary is provided with a column of a Chinese character dictionary as a Chinese character morpheme recording unit and a column of N character backward data as delimiter position information.
[0027]
In the column of the kanji dictionary, kanji character strings as morphemes, mixed character strings of kanji and katakana or hiragana, and the like are recorded.
[0028]
In the column of N character backward data, the value (integer) of N when there is a position delimited at the Nth character from the character after the morpheme is recorded in association with each morpheme. In other words, the value of N is a morpheme whose morpheme becomes at least two morphemes that are separated by positions in the character string of the morpheme when subsequent arbitrary character strings are combined. This indicates that the delimiter position in the character string is the Nth character from the character after the morpheme.
[0029]
For example, “used car” in the dictionary 12 is one morpheme, but by combining another character string at the back, “used | garage”, “used | wheels”, “used | vehicles”, etc. Two morphemes can be obtained for each. Therefore, in this case, it can be determined that there is a possibility that the first character from the end of “used car” may be separated, and therefore, “1” is recorded in the N character backward data. Similarly, “energy saving” is one morpheme, but two morphemes “energy saving | energy” can be obtained, and it can be determined that there is a possibility that the second character after “energy saving” may be separated. Therefore, “2” is recorded in the N character backward data.
[0030]
In other words, the morpheme is recorded in the dictionary 12 as one morpheme even though it is a morpheme that becomes two or more morphemes that are separated in the character string of the morpheme when the other character string is combined later. If there is a single morpheme, it is assumed that there may be two or more such morphemes, and the division position is recorded in advance in association with the morpheme.
[0031]
It should be noted that, in addition to the case where a character string is combined after the morpheme as described above, the determination of the presence of such a delimited position in the character string of one morpheme It is also conceivable to assume a case in which any other character string is subsequently joined to the first half character string when the columns are separated. For example, in the above example of “energy saving”, assuming that the character string in the first half of the character string is divided into “savings” and another arbitrary character string is combined, “saving” | resources ” Therefore, it can be determined that there is a delimited position after “saving”.
[0032]
Here, the basis for the determination of the separation position shown in the above example is that “car” in “used car” and “saving” in “energy saving” are so-called coined elements in each morpheme. Is.
[0033]
(3) Morphological analysis method for character strings including kanji
FIG. 4 shows processing when the kanji character string morphological analysis program 14 performs morphological analysis of a character string including kanji according to the present embodiment. Note that this morpheme analysis method complements the shortcomings while using the conventional longest match method. Hereinafter, this method is referred to as “longest match-N-character backward method”.
[0034]
First, the kanji character string morpheme analysis program 14 refers to the kanji character string morpheme N character registration dictionary 12 to acquire a first morpheme candidate from the first character by the longest match method (step S1). Then, the N character backward data of the acquired character string is determined (step S2). If N = 0, the first morpheme candidate is determined as the morpheme analysis result (step S3). On the other hand, if N is not 0, the determination target position is moved backward from the character string of the first morpheme candidate to a position backward by N characters, and the second morpheme candidate by the longest match method for the subsequent character string from that position again Is started (step S7).
[0035]
Next, it is determined whether or not a character string morpheme of N + 1 or more can be acquired as a second morpheme candidate (step S8). If not acquired, the first morpheme candidate acquired first is determined as a morpheme analysis result. (Step S3). On the other hand, if it is acquired, the first half character string when the first morpheme candidate is separated at a position backward by N characters and the second morpheme candidate are determined as the morpheme analysis result (step S9).
[0036]
Then, the morpheme determined in step S3 or step S9 is recorded (step S4). If not all the morphemes of the character string data are recorded, the first morpheme candidate by the longest match method is applied to the remaining character strings. Acquisition is started (step S10), and thereafter the same flow is repeated to advance morpheme analysis. When all data morphemes have been recorded, the analysis ends (step S6).
[0037]
As a specific example of the morphological analysis method as described above, FIG. 5 shows an example in which the input character string “used vehicle” is given. Since not only “used” but also “used cars” are recorded in the dictionary 12 as morphemes, “used cars” is selected as the first morpheme candidate according to the longest match method (step S50). Since “1” is recorded in “used car” (step S51), the position to be judged is moved backward by one character, and the second morpheme candidate is determined by the longest match method for the character string following “car”. It is determined whether or not can be acquired (step S52). As illustrated in FIG. 3A, since “vehicle” is registered in the dictionary (step S53), “used” (the first half character string in which “used car” is separated by the position of “car”) and “vehicle” ”As morphemes, and“ used | vehicles ”as analysis results (steps S54, S55, S56, S57).
[0038]
(4) Effects of the morphological analysis method for character strings containing kanji
According to the above method, even one morpheme acquired by the longest match method becomes at least two morphemes that are separated in the character string of the morpheme when combined with other subsequent character strings. In the case of a morpheme, it is determined as two morphemes that are definitely separated at the separation position. Therefore, morphological analysis can be performed efficiently and accurately. Further, since it is only necessary to record the delimiter position information together, it is easy to create a dictionary that can obtain a certain analysis result and to maintain and manage the recorded contents.
[0039]
Hereinafter, the effects of the present embodiment will be described using the example shown in FIG. 5 while comparing with the conventional morphological analysis method.
[0040]
First, according to the conventional longest match method, if the morpheme “used car” is recorded in the dictionary, an analysis result undesirable for the user “used car | both” is obtained. On the other hand, if “used vehicle” is recorded, “used vehicle” is taken as the analysis result, and a more preferable result than the result “used car | both” is obtained. However, even if such a record is made, it can only be effective for "used vehicles", and it examines all possibilities such as "used wheels" and "used garages" that can cause similar problems. Recording a dictionary requires a great deal of effort. If “used car” is not recorded, the analysis result “used | vehicle” can be obtained as in the present embodiment. However, it is difficult to predict at the dictionary recording stage that the desired result will not be obtained if the "used car" as a morpheme is recorded in the dictionary. If it is changed, it will affect other recorded contents and programs, and it will take much effort to maintain and manage the recorded contents of the dictionary so that a certain analysis result can be obtained.
[0041]
Note that “used car | both” and “used | vehicles” are also 2 by the minimum number of division method (a technique that gives priority to the result of minimizing the total number of morphemes constituting the input character string). Since it is a morpheme, it is not determined which is the analysis result.
[0042]
In this regard, if the longest match-N character retraction method according to the present embodiment is adopted, the morpheme requested by the user such as “used | vehicle” or “used | wheel” is reliably identified.
[0043]
In addition, as other analysis methods, there are existing analysis programs in which a desired combination of morphemes for various input sentences is learned, or an analysis result that gives priority to the statistically most probable analysis result has already existed. Recording all examples of what is expected to cause problems is labor intensive and the cost of learning / statistical data creation is a major problem. In addition, it is not guaranteed whether the learned analysis result or the statistically most likely analysis result is the desired morpheme identification result for the analysis of an arbitrary input sentence.
[0044]
In this regard, according to the kanji character string morpheme analysis program 14 of the present embodiment, since it is not necessary to provide the learning / statistical program as described above, a very simple program can be adopted as compared with the conventional program. And the program capacity can be reduced.
[0045]
Although the Kanji character string morpheme N character registration dictionary 12 in this embodiment records the delimiter position information as data indicating the position of the Nth character from the character behind the morpheme, it is not limited to this. . For example, it may be recorded as the Nth character from the character before the morpheme. However, since the position of the Nth character from the back character is indicated as in the present embodiment, a process of moving back N characters from the determination target position when the first morpheme candidate is acquired is performed. Compared to the case where the process of returning to the beginning of the character string of the first morpheme candidate and moving N characters further is performed, the calculation amount of the program is reduced. Therefore, morphological analysis is performed more quickly.
[0046]
As another embodiment of the Kanji character string morpheme N character registration dictionary 12, FIG. 3B shows a part of an example of a Kanji character string morpheme partition position pointer registration dictionary. In this dictionary, not the N-character backward data as shown in FIG. 3A but the delimiter position pointer as delimiter position information is recorded in the character string of the morpheme. In this case, the determination when the first morpheme candidate is acquired The process of acquiring the second morpheme candidate by moving the target position to the position of the pointer is performed.
[0047]
(5) Configuration of the hiragana morpheme junction list dictionary 18
FIG. 6 shows a part of an example of the hiragana morpheme junction list dictionary 18 as a hiragana morpheme recording unit. In the dictionary column, a hiragana dictionary as a hiragana morpheme recording unit in which hiragana morphemes and hiragana morpheme junctions are recorded is recorded.
[0048]
Here, the hiragana morpheme joint is a character string in which a plurality of morphemes represented by hiragana or the like are joined in consideration of the correctness of the grammatical joint and their break positions are also recorded. In the present embodiment, hiragana morpheme conjunctions are recorded for particles, auxiliary verbs, basic verbs, formal nouns, etc., which generally occupy most of character strings represented by hiragana and the like.
[0049]
For example, “No” and “Place” can be judged as separate morphemes, but it is grammatically correct that “Place” is joined to the case particle “No” to express the relationship of existence. Therefore, a morpheme junction, which is a combination of two morphemes “no | place”, is recorded in the dictionary.
[0050]
(6) Morphological analysis method for character string including hiragana
FIG. 7 shows processing when the hiragana character string morphological analysis program 16 performs morphological analysis of a character string including hiragana according to the present embodiment.
[0051]
First, the hiragana character string morpheme analysis program 16 refers to the hiragana character string morpheme junction list dictionary 18 and collates the data (step S20). This collation is performed by repeating so-called brute force search processing until all characters in the series of character string data match any morpheme or morpheme junction recorded in the dictionary 18 (step S21). . Next, it is determined whether or not the grammatical joining rule between each morpheme of the acquired morpheme candidates is appropriate (step S22). Will get. The determination of the joining rule is the same as that generally used for morphological analysis programs. If the joining rule is appropriate, the acquired one is determined and recorded as a fixed morpheme (step S23), and the same matching process is performed for the remaining hiragana character strings (steps S24 and S25). ) If the analysis results of all data are recorded, the analysis is terminated (step S26).
[0052]
As a specific example of the hiragana morphological analysis method as described above, FIG. 8 shows an example in which an input character string “up to the limit” is given. First, the hiragana character string morpheme analysis program 16 refers to the hiragana character string morpheme junction list dictionary 18 and collates the data (step S70). Then, if the morpheme junction “no | place” and the morpheme “barely” “up to” recorded in the dictionary 18 are acquired, the morpheme is acquired for all characters in the series of character strings. Therefore, these are acquired as morpheme candidates (step S71), it is determined that the joining rules are appropriate (step S72), they are determined and recorded as confirmed morphemes (steps S73, S74), and the morpheme analysis is performed. The process ends, and “up to | where | of last minute |” is set as the analysis result (step S75).
[0053]
(7) Effects of the morphological analysis method for character strings including hiragana
According to the above method, since the morpheme analysis is performed by collating with the dictionary 18 recorded in advance, the calculation amount of the program until the morpheme is determined is greatly reduced. Therefore, morphological analysis is performed very quickly.
[0054]
Hereinafter, such an effect of the present embodiment will be described using the example shown in FIG. 8 while comparing with the conventional morphological analysis method.
[0055]
A conventional morpheme analysis program obtains all possible morpheme candidates by sequentially decomposing an input character string, and performs analysis in consideration of the possibility of joining these morphemes. For example, in the case of the conventional program, in the example of FIG. 8, the morpheme candidates are determined and acquired while sequentially decomposing the input character string. ("Noto" includes the possibility of an erroneous analysis result such as "no" is joined to the case particle "no" to indicate the relationship of equality). Processing time until acquisition will be long.
[0056]
In this regard, if the collation processing with the junction list dictionary according to the present embodiment is employed, the morpheme requested by the user “up to | where |” is quickly and reliably identified.
[0057]
Note that the dictionary 18 records hiragana morpheme junctions for particles, auxiliary verbs, basic verbs, formal nouns, etc., which occupy most of the character strings represented by hiragana or the like in ordinary Japanese sentences. According to this, it is general that the capacity of the dictionary becomes larger than the conventional method in which only the morpheme of the character string including the hiragana is recorded in the dictionary. However, according to the present embodiment, since the amount of calculation of the program is greatly reduced, the program capacity can be reduced, and as a result, quick analysis is performed, thereby realizing a morphological analysis system preferable for the user. Will be.
[0058]
As described above, the morphological analysis method for character strings including kanji and the morphological analysis method for character strings including hiragana have been described, but the Japanese morphological analysis system in this embodiment includes a program for performing both of these, Morphological analysis can be performed accurately and quickly even when the character string to be analyzed is a mixture of characters such as kanji and hiragana. In particular, for Japanese character strings, the ratio of hiragana and the ratio of kanji is about 50%, respectively. In other words, the analysis is performed by the program corresponding to each distinction, so that the analysis is performed more quickly.
[0059]
Note that the criteria for determining the quality of morphological analysis results may differ depending on the purpose of use of the results, so the recorded contents of the dictionary in FIG. 3A and FIG. 6 are not limited to this, and their usage The dictionary is recorded so that a desired analysis result corresponding to the purpose can be obtained. For this, not only the creator of the dictionary but also the user himself / herself may be able to change the contents of the dictionary according to its purpose. Specifically, N character data can be changed or a new hiragana morpheme connection can be made. It can be recorded or corrected if there is an error.
[0060]
(8) Other embodiments of the morphological analysis system according to the present invention
In the above embodiment, a system in which a personal computer program analyzes an input character string is shown as a Japanese morpheme analysis system, but the present invention is not limited to this, and other general use of Japanese morpheme analysis, for example, It can also be used for word processor kanji-kana conversion programs and dictionary recording.
[0061]
Further, as shown in FIG. 9, it can be employed in a morphological analysis service using a network. In this case, the user of the terminal device 50 as a communication terminal inputs a character string to be analyzed into the device, and the character string transmitting means sends the data to the morphological analysis service server 56 as a computer server via the Internet 52. Send. The morpheme analysis service server 56 including the receiving unit, at least one of the kanji morpheme recording unit and the hiragana morpheme recording unit, the morpheme analysis unit, and the morpheme analysis result transmission unit performs the morpheme analysis of the data. The analysis result is transmitted to the terminal device 50.
[0062]
Thereby, it is not necessary to provide a kanji morpheme recording unit, a hiragana morpheme recording unit, and a morpheme analysis unit on the terminal device side. Therefore, a larger number of terminal device users can easily obtain the morphological analysis results.
[Brief description of the drawings]
FIG. 1 is a diagram showing a program configuration of a Japanese morphological analysis system according to an embodiment of the present invention.
FIG. 2 is a block diagram of a personal computer equipped with a Japanese morphological analysis system.
FIG. 3A is a diagram showing a part of an example of a Kanji character string morpheme N character backward dictionary according to an embodiment of the present invention; FIG. 3B is a diagram illustrating a part of an example of a kanji character string morpheme break position pointer registration dictionary according to another embodiment of the present invention.
FIG. 4 is a diagram illustrating a morphological analysis method for a character string including kanji according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a specific example of a morphological analysis method for a character string including kanji according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating a part of an example of a hiragana morpheme junction list dictionary according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating a morphological analysis method for a character string including hiragana according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating a specific example of a morphological analysis method for a character string including hiragana according to an embodiment of the present invention.
FIG. 9 is a diagram illustrating a network Japanese morphological analysis system according to another embodiment of the present invention.
[Explanation of symbols]
10 ... Character type distinction program
12 ... Kanji character string morpheme N character registration dictionary
14 ... Kanji character string morphological analysis program
18 ... Hiragana Morphological Junction List Dictionary
16 ... Hiragana character string morpheme analysis program
20 ... Reading program for morphemes

Claims

A kanji character string morpheme analysis method for performing a morpheme analysis of a character string including kanji by referring to a morpheme recording unit in which a morpheme is recorded by a computer system,
In the morpheme recording unit,
If the morpheme is a morpheme that, when combined with any other character string, results in at least two morphemes separated in the character string of the morpheme, it indicates the delimited position in association with the morpheme Record the break position information together,
When the computer system performs morphological analysis of the input character string,
The first morpheme candidate is obtained by the longest match method from the first character of the character string by reading with reference to the morpheme from the morpheme recording unit,
If the position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character To determine whether or not
If the second morpheme candidate can be obtained, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the delimiter position are used as the morpheme analysis result,
If the second morpheme candidate cannot be acquired, the kanji character string morpheme analysis method using the first morpheme candidate as the morpheme analysis result.

  In the Chinese character string morphological analysis method of Claim 1,
  The delimiter position information is
  It is information indicating that it is located at the Nth character from the character behind the morpheme,
  The information indicating the position of the Nth character is as follows.
  Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method with the subsequent character at the position of the Nth character recorded in the morpheme as the first character;
  Kanji character string morphological analysis method characterized by the above.

A kanji character string morpheme analysis system that performs a morpheme analysis of a character string including kanji by referring to a morpheme recording unit in which a morpheme is recorded by a computer system,
In the morpheme recording unit,
If the morpheme is a morpheme that, when combined with any other character string, results in at least two morphemes separated in the character string of the morpheme, it indicates the delimited position in association with the morpheme Record the break position information together,
When the computer system performs morphological analysis of the input character string,
The first morpheme candidate is obtained by the longest match method from the first character of the character string by reading with reference to the morpheme from the morpheme recording unit,
If the position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character To determine whether or not
If the second morpheme candidate can be obtained, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the delimiter position are used as the morpheme analysis result,
If a 2nd morpheme candidate cannot be acquired, a 1st morpheme candidate will be made into a morpheme analysis result, The kanji character string morpheme analysis system characterized by the above-mentioned.

A computer-readable program for causing a computer system to function as a kanji character string morphological analysis system for performing morphological analysis of a character string including kanji, or a recording medium on which the program is recorded,
In the morpheme recording unit,
If the morpheme is a morpheme that, when combined with any other character string, results in at least two morphemes separated in the character string of the morpheme, it indicates the delimited position in association with the morpheme Record the break position information together,
When the computer system performs morphological analysis of the input character string,
The first morpheme candidate is obtained by the longest match method from the first character of the character string by reading with reference to the morpheme from the morpheme recording unit,
If the position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character To determine whether or not
If the second morpheme candidate can be obtained, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the delimiter position are used as the morpheme analysis result,
If the second morpheme candidate cannot be acquired, a program for causing a computer system to execute a process of using the first morpheme candidate as a morpheme analysis result, or a recording medium on which the program is recorded.

  In the program of Claim 4, or the recording medium which recorded the program,
  The delimiter position information is
  It is information indicating that it is located at the Nth character from the character behind the morpheme,
  The information indicating the position of the Nth character is as follows.
  Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method with the subsequent character at the position of the Nth character recorded in the morpheme as the first character;
  Or a recording medium on which the program is recorded.

A character input part for inputting a character string;
At least one of a kanji morpheme recording unit that records a morpheme of a character string that includes kanji and a hiragana morpheme recording unit that records a morpheme of a character string that includes hiragana;
A morphological analysis unit that performs morphological analysis;
A morphological analysis system comprising:
The kanji morpheme recording unit is
When a morpheme of a character string containing Kanji is combined with any other character string after at least one of the first half character string or the second half character string when the character string is separated, a character string that becomes at least two morphemes If included,
Separation position information indicating the division position of the morpheme in association with the morpheme,
Is also recorded,
The delimiter position information is obtained by a computer,
Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method, with the subsequent character recorded in the morpheme as the first character as the first character,
It is characterized by
The hiragana morpheme recording unit is
Record at least a hiragana morpheme combination, which is a combination of morphemes that can be grammatically correct among a combination of two or more hiragana morphemes containing a particle or auxiliary verb or a formal noun or basic verb;
In the hiragana morpheme junction,
Separation position for each morpheme,
A morphological analysis system characterized by the fact that is recorded together.

Communication terminal
A character string transmission means for transmitting a Japanese character string to a computer server;
A morpheme analysis result receiving means for receiving a morpheme analysis result transmitted from a computer server;
With
A computer server connected to a communication terminal so that it can communicate with
A receiving unit that receives information of the transmitted Japanese character string;
At least one of a kanji morpheme recording unit that records a morpheme of a character string that includes kanji and a hiragana morpheme recording unit that records a morpheme of a character string that includes hiragana;
A morphological analysis unit that performs morphological analysis;
A morpheme analysis result transmitting means for transmitting a morpheme analysis result by the morpheme analysis unit to a communication terminal;
A network morphological analysis system comprising:
The kanji morpheme recording unit is
When the morpheme of the character string including the kanji is a morpheme that becomes at least two morphemes separated in the character string of the morpheme when the other arbitrary character strings are subsequently combined,
Separation position information indicating the division position in association with the morpheme,
Is also recorded,
The hiragana morpheme recording unit is a combination of two or more hiragana morphemes that can be joined at least grammatically,
Is recorded,
In the hiragana morpheme junction,
The separation position for each morpheme is recorded together,
The morpheme analysis unit that performs morphological analysis of a character string including kanji is
First morpheme candidate acquisition means for acquiring a first morpheme candidate from a first character of a character string by a longest match method by reading out the morpheme from the kanji morpheme recording unit;
If the position information is recorded in the first morpheme candidate, the second morpheme candidate including a character string subsequent to the character string of the first morpheme candidate by the longest match method with the subsequent character at the delimiter position as the first character Second morpheme candidate acquisition determination means for determining whether or not can be acquired;
If the second morpheme candidate can be acquired, the first morpheme candidate and the second morpheme candidate when the first morpheme candidate is divided at the dividing position are set as the morpheme analysis results. A kanji character string morpheme determining means which takes one morpheme candidate as a morpheme analysis result;
With
The morpheme analysis unit that performs morphological analysis of a character string including hiragana,
If there is a character string that matches the hiragana morpheme junction by reading out the hiragana morpheme junction and the hiragana morpheme from the hiragana morpheme recording unit, the hiragana character string having the combination of the hiragana morpheme as a morpheme analysis result Morpheme determination means,
A network morphological analysis system characterized by comprising:

The morphological analysis system according to claim 3, 6 or 7,
The delimiter position information is
It is information indicating that it is located at the Nth character from the character behind the morpheme,
The information indicating the position of the Nth character is as follows.
Information used to determine whether or not a morpheme including the subsequent character can be obtained by the longest match method with the subsequent character at the position of the Nth character recorded in the morpheme as the first character;
A morphological analysis system characterized by