JP3622841B2

JP3622841B2 - Kana-kanji conversion device and kana-kanji conversion method

Info

Publication number: JP3622841B2
Application number: JP2000304102A
Authority: JP
Inventors: 敏久田代
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2000-10-03
Filing date: 2000-10-03
Publication date: 2005-02-23
Anticipated expiration: 2020-10-03
Also published as: JP2002117025A

Description

【０００１】
【発明の属する技術分野】
本発明は、かな漢字変換装置およびかな漢字変換方法に関し、より詳細には、コンピュータ・システムに日本語を入力するために使用されているかな漢字変換装置およびかな漢字変換方法に関する。
【０００２】
【従来の技術】
日本語の文字列を入力する装置として、キーボードから入力したい漢字列に対応するかな文字列を入力し、漢字変換キーの入力に応答して、かな文字列をかな漢字文字列に変換するパーソナル・コンピュータやワード・プロセッサなどのかな漢字変換装置が従来から知られている。この装置に入力したかな文字列をかな漢字文字列に変換する場合は、漢字変換用の特定の１つまたは複数のキーを組み合わせて押下し、かな漢字文字列の候補を表示する。また、連続して候補を表示することも可能であり、この場合直前の候補を呼び出すときは、前候補キーを押下するなどして、文章を入力することができる。かな文字列をカタカナ文字列に変換する場合や、ローマ字文字列に変換する場合も、上記と同様の手順で行われる。
【０００３】
入力された文字列についてかな漢字変換を行うかな漢字変換装置では、文字列に対応する漢字を決定するのに、形態素や各フレームの解析情報を参照することによって、変換精度を高めている。ここで、形態素とは、一つ以上の音素からなる意味をもった最小の言語単位をいい、形態素解析では、文字列に含まれている形態素の切れ目を認識し、および形態素の品詞を認定する。
【０００４】
また、格フレーム解析では、文字列に含まれている単語間の意味的な結合関係を「格文法」の考え方によって表現する。
【０００５】
従来のかな漢字変換装置では、上述のような品詞に基く形態素解析と、格フレーム解析とを用いている場合が多い。
【０００６】
【発明が解決しようとする課題】
しかし、上述した従来のかな漢字変換方式では、正確に変換することが難しい言語現象も存在する。また、従来のかな漢字変換方式の場合、特殊な語彙や表現についてもできる限り楽に変換できるように、文字列に対応する表現を広く認めると、そのような特殊な用語を使用しないユーザにとっては、不可解な単語の連続や、共起関係が薄い同音類義語の誤変換／学習等の副作用が生じるという問題があった。
【０００７】
一方、音声認識システムにおいて使用されている言語モデルとして、トライグラムが知られている。このトライグラムでは、品詞に基く形態素解析や、格フレームでは正確に変換できないような言語現象にも対応出来るというメリットがある。
【０００８】
しかし、トライグラムの計算量は極めて大きいので、かな漢字変換のような高速性が要求されるシステムにトライグラムをそのまま応用することは非常に困難であるという問題があった。
【０００９】
本発明はこのような問題に鑑みてなされたものであり、その目的とするところは、従来のかな漢字変換において生じていた誤変換等の副作用を抑制することができるかな漢字変換装置およびかな漢字変換方法を提供することにある。
【００１０】
【課題を解決するための手段】
本発明は、このような目的を達成するために、請求項１に記載の発明は、かな漢字変換の候補となる文字および該文字の優先度を記述する辞書、２つの品詞の接続の優先度を記述する品詞の接続表、文字列の中に含まれる語句の他の語句に対する意味的関係を記述する格フレーム辞書、および実際のテキストを含むテキストコーパスの中に単語のＮ（Ｎ≧３）個の連鎖が出現する確率を記述するＮグラムデータを記憶する記憶装置と、該記憶装置に記憶された前記辞書、前記品詞の接続表、前記格フレーム辞書、および前記Ｎグラムデータに基づいて、入力装置から入力されたかな文字列をかな漢字文字列に変換するデータ処理装置とを備えたかな漢字変換装置であって、前記データ処理装置は、入力された前記かな文字列にマッチする前記文字を前記辞書から抽出する辞書引き手段と、前記文字の優先度と、前記品詞の接続の優先度とに基づき、前記文字の優先度と、前記品詞の接続の優先度とに基づき、前記辞書引き手段により抽出された前記文字を組み合わせて、可能性が所定の基準以上であるかな漢字文字列の第１の候補を作成する形態素解析手段と、前記格フレーム辞書に基づき、前記第１の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えて、第２の候補を作成する格フレーム解析手段と、前記単語のＮ個の連鎖が出現する確率に基づいて、前記第２の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えるＮグラム解析手段とを備えたことを特徴とする。
【００１１】
また、請求項２に記載の発明は、請求項１に記載のかな漢字変換装置において、前記記憶装置は、ローマ字に対応するかな文字を記述するローマ字かな変換表を記憶し、前記データ処理装置は、前記入力装置から入力されたローマ字をローマ字かな変換表に基づいて前記かな文字列に変換するローマ字かな変換手段を備え、前記辞書引き手段は、前記ローマ字かな変換手段によって変換された前記かな文字列にマッチする前記文字を前記辞書から抽出することを特徴とする。
【００１２】
また、請求項３に記載の発明は、記憶装置に記憶された、かな漢字変換の候補となる文字および該文字の優先度を記述する辞書、２つの品詞の接続の優先度を記述する品詞の接続表、文字列の中に含まれる語句の他の語句に対する意味的関係を記述する格フレーム辞書、および実際のテキストを含むテキストコーパスの中に単語のＮ（Ｎ≧３）個の連鎖が出現する確率を記述するＮグラムデータに基づいて、入力装置から入力されたかな文字列をかな漢字文字列に変換するかな漢字変換方法であって、入力された前記かな文字列にマッチする前記文字を前記辞書から抽出する辞書引きステップと、前記文字の優先度と、前記品詞の接続の優先度とに基づき、前記辞書引きステップにおいて抽出された前記文字を組み合わせて、可能性が所定の基準以上であるかな漢字文字列の第１の候補を作成する形態素解析ステップと、前記格フレーム辞書に基づき、前記第１の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えて、第２の候補を作成する格フレーム解析ステップと、前記単語のＮ個の連鎖が出現する確率に基づいて、前記第２の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えるＮグラム解析ステップとを備えることを特徴とする。
【００１３】
さらに、請求項４に記載の発明は、請求項３に記載のかな漢字変換方法であって、前記記憶装置は、ローマ字に対応するかな文字を記述するローマ字かな変換表を記憶し、前記入力装置から入力されたローマ字をローマ字かな変換表に基づいて前記かな文字列に変換するローマ字かな変換ステップを備え、前記辞書引きステップは、前記ローマ字かな変換ステップにおいて変換された前記かな文字列にマッチする前記文字を前記辞書から抽出することを特徴とする。
【００１４】
この方法によれば、極度に長い時間および大きなディスク容量を必要とせずに、かな漢字変換の精度を上げることができる。
【００１５】
【発明の実施の形態】
以下に、図面を参照し、本発明の実施の形態について詳細に説明する。
【００１６】
図１は、本実施形態に係るかな漢字変換装置の機能ブロック図である。
図１の例に示すように、本実施形態のかな漢字変換装置は、入力装置１０１と、表示装置１０３と、データ処理装置１０５と、記憶装置１１７とを備えいる。入力装置１０１は、かな漢字変換をするためのかな文字列を入力したり、変換、確定等の各種指示を行うためのキーボード等によって構成される。
【００１７】
表示装置１０３は、具体的にはＣＲＴやＬＣＤ等によって構成され、入力装置１０１によって入力される文字列等が表示される。
【００１８】
データ処理装置１０５は、中央演算処理装置（ＣＰＵ）においてコンピュータ・プログラムを構成する命令の読みだし、および実行を行う基本処理装置（ＢＰＵ）や制御装置によって構成されており、ローマ字かな変換手段１０７と、辞書引き手段１０９と、形態素解析手段１１１と、格フレーム解析手段１１３と、トライグラム解析手段１１５とによって構成されている。
【００１９】
ローマ字かな変換手段１０７は、外部から入力されたローマ字をかな文字、すなわちひらがなまたはカタカナに変換する処理を行なう。辞書引き手段１０９は、かな漢字変換を行うことを目的としてかな漢字変換装置に記憶されている辞書から、ローマ字かな変換手段１０７により変換されたかな文字の読みに対応する漢字を見つけ出す。
【００２０】
なお、入力装置１０１から、かな文字が直接入力された場合には、ローマ字かな変換手段１０７による処理を経ることなく辞書引き手段１０９による処理が行われることとなる。入力装置１０１からローマ字が入力されるか、あるいはかな文字が入力されるかは、かな漢字変換装置における、文字の入力モードによって決定される。
【００２１】
形態素解析手段１１１は、入力された文字列をかな漢字混じりの文字列、すなわちかな漢字文字列に変換した場合に含まれる単語について、品詞の接続情報および語の優先順位を用いて各単語間のつながり安さを判定する。格フレーム解析手段１１３は、動詞とその主語、目的語との関係に基づいて、かな漢字混じりの文字列の候補についてより意味的に正しいと思われる順に文字列の候補の優先順位を変更する。
【００２２】
トライグラム解析手段１１５は、テキストコーパスから抽出した３つの単語の組み（トライグラム）を用いて、格フレーム解析手段１１３により順位付けされた文字列の候補を改めて並べ替える。
【００２３】
記憶装置１１７は、データを格納するための主記憶装置等によって構成されており、本発明に関係し、ＣＰＵによってアクセスされるデータが記憶されている。本実施形態において、記憶装置１１７には、かな漢字変換において参照されるデータとして、ローマ字かな変換表１１９、辞書１２１、品詞の接続表１２３、格フレーム辞書１２５、およびトライグラムデータ１２７が記憶されている。
【００２４】
また、記憶装置１１７の図示しない領域には、入力された文字列や検索された漢字の候補、データ処理装置１０５を含むＣＰＵによって実行されるコンピュータ・プログラムの実行命令が格納されている。そして、ＣＰＵはこの内容を直接アクセスして命令やデータをレジスタに入れ、プログラムの実行やデータに対する操作、あるいはデータに基づく操作を行うことができる。
【００２５】
図２〜図６は、「ほんをに、さんさつよんだ」という文字列について、本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。以下、本実施形態に係るかな漢字変換装置の動作について説明する。
【００２６】
まず、図２において、入力装置１０７から「ｈｏｎｗｏｎｉ，ｓａｎｎｓａｔｕｙｏｎｄａ」というローマ字が入力されると、ローマ字かな変換手段１０７は、ローマ字の読みに対応するひらがなを選択する。ローマ字かな変換手段１１９は、ローマ字かな変換表１１９を参照し、
ｈｏ → ほ
ｎ → ん
ｗｏ → を
などの対応関係から、「ほんをに、さんさつよんだ」というひらがな列を作成する。
【００２７】
次に、辞書引き手段１０９が、このようにして変換されたひらがな列の読みにマッチする文字列を辞書１２１から抽出する。具体的には、「ほ」という読みに対応する語として「穂」、「歩」、「帆」、「ほ」を、「ほん」という読みに対応する語として「本」、「翻」という文字が抽出されており、この処理はひらがな列の終端まで続けられる。
【００２８】
続いて、図４に示すように、形態素解析手段１１１が、辞書引き手段１０９によって見つけ出された語の集合について、品詞の接続表１２３に基づき、漢字を含む文字列の候補を作成し、各候補に優先順位を付ける。本実施形態において、文字列の候補は、辞書１２１に含まれている語の各々に付与されている優先度、および品詞の接続表１２３において、品詞の接続の種類毎に付与されている優先度を合計し、その合計点数の低い文字列が優先されるように順位付けがなされている。
【００２９】
たとえば、「本を似、三冊呼んだ」という文字列の場合は、辞書１２１に基づいて、
（本）＋（を）＋（似）＋（、）＋（三）＋（冊）＋（呼）＋（んだ）
という点数計算が行われ、単語の優先度として
１０＋５＋１０＋５＋１０＋２＋２０＋５＝８５点
という点数が算出される。また、この文字列は、
名詞−助詞−一段活用動詞−読点−数詞−助数詞−バ行五段活用−語尾
という品詞の接続からなるため、品詞の接続表１２１に基づいて、
（名詞−助詞）＋（助詞−一段活用助詞）＋（一段活用助詞−読点）＋（読点−数詞）＋（数詞−助数詞）＋（助数詞−バ行五段活用）＋（バ行五段活用−語尾）
という点数計算が行われ、品詞の接続による優先度として、
３０＋２０＋３０＋４０＋１０＋４０＋３０＝２００点
という点数が得られる。そして、単語の優先度と品詞の接続による優先度を合計し、優先度は２８５点と計算される。
【００３０】
同様の計算を行うことにより、「本を似、三冊読んだ」というかな漢字文字列について２９０点、「本を二、三冊呼んだ」について２９５点、「本を二、三冊読んだ」というかな漢字文字列について３００点という点数が算出される。したがって、これら４つの候補の優先順位は、
１．本を似、三冊呼んだ
２．本を似、三冊読んだ
３．本を二、三冊呼んだ
４．本を二、三冊読んだ
となる。そして、優先度を示す点数がある基準値以上の場合は、つながりにくい候補、すなわち可能性の低い候補として除外される。たとえば「翻を似、三冊呼んだ」など、他にも種々の語の組み合わせが可能であるが、このようなかな漢字文字列については、計算の結果優先度の点数が高くなるため、候補から除外される。
【００３１】
続いて、図５に示すように、上述したように優先順位がつけられ絞込みが行われた候補について、格フレーム解析手段１１３が並べ替えを行う。格フレーム解析手段１１３は、格フレーム辞書１２５を参照し、以下のような判断処理を行う。
【００３２】
たとえば、格フレーム辞書１２５によれば、「読」という語の前に「が」という助詞が位置する場合、主格が人であればその文字列は意味的に正しいと判断される。また、「を」という助詞が「読」の前に位置する場合、対象格が「本」であれば意味的に正しいと判断される。同様に、「呼」という語の前に「が」という助詞が位置する場合は主格が「人」である場合、また「を」が位置する場合は対象格が人である場合に意味的に正しいと判断される。従って、格フレーム解析手段１１３では、「本を似、三冊読んだ」および「本を二、三冊読んだ」といった候補の方が「本を似、三冊呼んだ」および「本を二、三冊呼んだ」よりもふさわしい、すなわち意味的に正しいものと判断され、優先順位は高くなる。
【００３３】
このような順位付けによる結果、優先順位は
１．本を似、三冊読んだ
２．本を二、三冊読んだ
３．本を似、三冊呼んだ
４．本を二、三冊呼んだ
となる。
【００３４】
続いて、トライグラム解析手段１１５がトライグラムデータ１２７を参照し、図６に示すように、格フレーム解析手段１１３によって順位付けされた候補の並べ替えを行う。
【００３５】
トライグラムデータ１２７には、３個の単語の連鎖がテキストに出現する確率が記述されており、この確率は、実際のテキストコーパスから作成される。すなわち、トライグラムデータ１２７は、実際のテキストコーパスに含まれている大量のテキストのデータについて、当該テキストを単語毎に区切り、３個の連語が出現する確率を求めることにより作成される。
【００３６】
トライグラムデータ１２７を参照した場合、「二」「、」「三」という語の並びが出現する確率が高い（言い換えれば、現実のテキストの中には、「二」「、」「三」という語の並びが多い）ので、「二」「、」「三」という単語の連鎖を多く含む候補が優先されるように文字列の候補が並べ替えられることとなる。なお、図６に示す例では、たとえば「似」「、」「三」という単語の並び等の、極めて確率の低い単語の連鎖については省略されている。
【００３７】
したがって、トライグラム解析手段１１５による優先順位の並べ替えの結果は、
１．本を二、三冊読んだ
２．本を二、三冊呼んだ
３．本を似、三冊読んだ
４．本を似、三冊呼んだ
となる。
【００３８】
以下、本実施形態に係るかな漢字変換装置を使用した実験の結果を記す。
【００３９】
一回のかな漢字変換処理で文字列に含まれる文字が正しい文字に変換される確率（以下、ｃｈａｒｒａｔｅという）が９４．０９％、一回のかな漢字変換処理で文字列全体が正しい文字列に変換される確率（以下、ｓｅｎｔｅｎｃｅｒａｔｅという）が４６．０５％である従来のかな漢字変換処理装置について、百数十ＭＢの実際のテキストコーパスに基づいて作成した約８０ＭＢのトライグラムデータを使用したトライグラムによる解析処理を適用した結果、ｃｈａｒｒａｔｅは９５．０３％、ｓｅｎｔｅｎｃｅｒａｔｅは５２．６８％であった。すなわち、文字単位においても、文単位においても、一回の変換処理で正しい文字列に変換される確率が上昇することが確認された。
【００４０】
以上、本発明の好適な実施形態について説明したが、本発明はこれに限られず、他の種々の形態で実施することが可能である。
【００４１】
たとえば、上述の実施形態では、実際のテキストコーパスに含まれるテキスト中に、３個の単語の連鎖が出現する確率を記述するトライグラムデータを作成することとしたが、単語の連鎖の数は３個に限定されず、任意のＮ（Ｎ≧２）個の単語の連鎖であってもよい。この場合、かな漢字変換装置の記憶装置には、Ｎ個の単語の連鎖がテキストコーパスに出現する確率を記述するＮグラムデータが記憶される。すなわち、Ｎグラムデータは、実際のテキストコーパスに含まれている大量のテキストのデータについて、当該テキストを単語毎に区切り、Ｎ個の連語が出現する確率を求めることにより作成される。
【００４２】
そして、上述の実施形態におけるトライグラム解析手段に代えて、Ｎグラム解析手段がＮグラムデータを参照し、格フレーム解析手段により並べ替えられた文字列の候補の優先順位をさらに並べ替えることとなる。
【００４３】
【発明の効果】
以上説明したように、本発明によれば、かな漢字変換の候補となる文字および該文字の優先度を記述する辞書、２つの品詞の接続の優先度を記述する品詞の接続表、文字列の中に含まれる語句の他の語句に対する意味的関係を記述する格フレーム辞書、および実際のテキストを含むテキストコーパスの中に単語のＮ（Ｎ≧２）個の連鎖が出現する確率を記述するＮグラムデータを記憶する記憶装置と、該記憶装置に記憶された前記辞書、前記品詞の接続表、前記格フレーム辞書、および前記Ｎグラムデータに基づいて、入力装置から入力されたかな文字列をかな漢字文字列に変換するデータ処理装置とを備えたかな漢字変換装置であって、前記データ処理装置は、入力された前記かな文字列にマッチする前記文字を前記辞書から抽出する辞書引き手段と、前記文字の優先度と、前記品詞の接続の優先度とに基づき、前記辞書引き手段により抽出された前記文字を組み合わせて、可能性が所定の基準以上であるかな漢字文字列の第１の候補を作成する形態素解析手段と、前記格フレーム辞書に基づき、前記第１の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えて、第２の候補を作成する格フレーム解析手段と、前記単語のＮ個の連鎖が出現する確率に基づいて、前記第２の候補に含まれる前記かな漢字文字列の候補の優先順位を並べ替えるＮグラム解析手段とを備えたので、従来のかな漢字変換が出力する候補にのみトライグラムを適用することにより、計算量を抑えながら変換精度の向上を図ることができる。
【００４４】
また、前記記憶装置は、ローマ字に対応するかな文字を記述するローマ字かな変換表を記憶し、前記データ処理装置は、前記入力装置から入力されたローマ字をローマ字かな変換表に基づいて前記かな文字列に変換するローマ字かな変換手段を備え、前記辞書引き手段は、前記ローマ字かな変換手段によって変換された前記かな文字列にマッチする前記文字を前記辞書から抽出するので、ローマ字入力モードあるいはかな入力モードのいずれにおいても、極度に長い時間および大きなディスク容量を必要とせずに、かな漢字変換の精度を上げることができる。
【図面の簡単な説明】
【図１】本実施形態に係るかな漢字変換装置の機能ブロック図である。
【図２】本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。
【図３】本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。
【図４】本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。
【図５】本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。
【図６】本実施形態に係るかな漢字変換装置によるかな漢字変換の流れを説明するための図である。
【符号の説明】
１０１入力装置
１０３表示装置
１０５データ処理装置
１０７ローマ字かな変換手段
１０９辞書引き手段
１１１形態素解析手段
１１３格フレーム解析手段
１１５トライグラム解析手段
１１７記憶装置
１１９ローマ字かな変換表
１２１辞書
１２３品詞の接続表
１２５格フレーム辞書
１２７トライグラムデータ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a kana-kanji conversion device and a kana-kanji conversion method, and more particularly to a kana-kanji conversion device and a kana-kanji conversion method used for inputting Japanese into a computer system.
[0002]
[Prior art]
A personal computer that inputs a kana character string corresponding to the kanji character string you want to input from the keyboard as a device for inputting Japanese character strings, and converts the kana character string to a kana kanji character string in response to the input of the kanji conversion key. Conventionally, kana-kanji conversion devices such as word processors are known. In order to convert a kana character string input to this apparatus into a kana-kanji character string, a specific one or more keys for kanji conversion are pressed in combination to display kana-kanji character string candidates. It is also possible to display candidates in succession. In this case, when calling the immediately preceding candidate, it is possible to input a sentence by pressing the previous candidate key or the like. A procedure similar to the above is performed when a kana character string is converted into a katakana character string or when converted into a Roman character string.
[0003]
In a kana-kanji conversion apparatus that performs kana-kanji conversion on an input character string, conversion accuracy is improved by referring to morpheme and analysis information of each frame in order to determine a kanji corresponding to the character string. Here, a morpheme is the smallest linguistic unit that has a meaning consisting of one or more phonemes. In the morpheme analysis, a morpheme break in a character string is recognized and a morpheme part of speech is recognized. .
[0004]
In case frame analysis, a semantic connection relationship between words included in a character string is expressed by the concept of “case grammar”.
[0005]
Conventional Kana-Kanji conversion devices often use morphological analysis based on the part of speech as described above and case frame analysis.
[0006]
[Problems to be solved by the invention]
However, in the conventional Kana-Kanji conversion method described above, there is a language phenomenon that is difficult to convert accurately. In addition, in the case of the traditional Kana-Kanji conversion method, it is inexplicable for users who do not use such special terms if the expressions corresponding to the character strings are widely accepted so that special vocabulary and expressions can be converted as easily as possible. There is a problem that side effects such as incorrect word continuation and erroneous conversion / learning of synonym synonyms with a weak co-occurrence relationship occur.
[0007]
On the other hand, trigrams are known as language models used in speech recognition systems. This trigram has the merit of being able to deal with morphological analysis based on parts of speech and language phenomena that cannot be accurately converted with case frames.
[0008]
However, since the calculation amount of the trigram is extremely large, there is a problem that it is very difficult to apply the trigram as it is to a system that requires high speed such as kana-kanji conversion.
[0009]
The present invention has been made in view of such problems, and an object of the present invention is to provide a kana-kanji conversion apparatus and a kana-kanji conversion method capable of suppressing side effects such as erroneous conversion that have occurred in conventional kana-kanji conversion. It is to provide.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a dictionary that describes characters that are candidates for kana-kanji conversion and the priority of the characters, and the priority of connection of two parts of speech. A connection table of parts of speech to be described, a case frame dictionary that describes the semantic relationship of words contained in a character string to other words, and N (N ≧ 3) words in a text corpus containing actual text A storage device that stores the N-gram data describing the probability of occurrence of the chain, and the dictionary, the part-of-speech connection table, the case frame dictionary, and the N-gram data stored in the storage device. A kana-kanji conversion device comprising a data processing device for converting a kana character string input from a device into a kana-kanji character string, wherein the data processing device matches the input kana character string Based on the dictionary priority means for extracting characters from the dictionary, the priority of the characters, and the priority of connection of the parts of speech, the priority of the characters and the priority of connection of the parts of speech Based on the case frame dictionary and the first candidate based on the morpheme analysis unit that combines the characters extracted by the subtracting unit to create a first candidate for a kana-kanji character string whose possibility is greater than a predetermined criterion. Based on the case frame analysis means for rearranging the priorities of the candidates for the included Kana-Kanji character strings to create a second candidate, and the probability of occurrence of N chains of the word, the second candidate N-gram analyzing means for rearranging the priorities of the candidates of the included kana-kanji character strings.
[0011]
The invention according to claim 2 is the kana-kanji conversion device according to claim 1, wherein the storage device stores a roman character kana conversion table describing a kana character corresponding to a roman character, and the data processing device includes: Romaji-kana conversion means for converting a Roman character input from the input device into the kana character string based on a Roman-kana conversion table, and the dictionary lookup means converts the Roman character to the kana character string converted by the Roman character-kana conversion means. The matching character is extracted from the dictionary.
[0012]
Further, the invention according to claim 3 is a dictionary that describes characters that are candidates for kana-kanji conversion stored in a storage device and priority of the characters, and connection of parts of speech that describes the priority of connection of two parts of speech. N (N ≧ 3) word chains appear in tables, case frame dictionaries that describe the semantic relationship of words contained in a string to other words, and text corpora containing actual text A kana-kanji conversion method for converting a kana character string input from an input device into a kana-kanji character string based on N-gram data describing a probability, wherein the character that matches the input kana character string is retrieved from the dictionary. a dictionary step for extracting, the priority of the text, based on the priority of the connection of the parts of speech, a combination of the character extracted in the dictionary step, possibly of a predetermined A morphological analysis step of creating a first candidate for the kana-kanji character string is quasi more, based on the rated frame dictionary, rearrange the priority order of the candidate of the kana-kanji character string contained in the first candidate, the A case frame analysis step for generating two candidates, and an N-gram analysis for rearranging the priorities of the kana-kanji character string candidates included in the second candidate based on a probability that N chains of the word appear. And a step.
[0013]
The invention according to claim 4 is the kana-kanji conversion method according to claim 3, wherein the storage device stores a romaji kana conversion table describing kana characters corresponding to romaji, and is stored in the input device. A Romaji-kana conversion step of converting the input Romaji into the Kana character string based on a Romaji-Kana conversion table, wherein the dictionary lookup step matches the Kana character string converted in the Romaji-Kana conversion step. Is extracted from the dictionary.
[0014]
According to this method, the accuracy of kana-kanji conversion can be improved without requiring an extremely long time and a large disk capacity.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below in detail with reference to the drawings.
[0016]
FIG. 1 is a functional block diagram of the kana-kanji conversion apparatus according to this embodiment.
As shown in the example of FIG. 1, the kana-kanji conversion device of the present embodiment includes an input device 101, a display device 103, a data processing device 105, and a storage device 117. The input device 101 is configured by a keyboard or the like for inputting a kana character string for kana-kanji conversion, or performing various instructions such as conversion and determination.
[0017]
The display device 103 is specifically composed of a CRT, LCD, or the like, and displays a character string or the like input by the input device 101.
[0018]
The data processing unit 105 includes a basic processing unit (BPU) and a control unit that read and execute instructions constituting a computer program in a central processing unit (CPU). , A dictionary lookup unit 109, a morpheme analysis unit 111, a case frame analysis unit 113, and a trigram analysis unit 115.
[0019]
The Romaji-kana conversion means 107 performs a process of converting a Roman character inputted from the outside into a Kana character, that is, hiragana or katakana. The dictionary lookup unit 109 finds a kanji character corresponding to the reading of the kana character converted by the roman character kana conversion unit 107 from the dictionary stored in the kana-kanji conversion device for the purpose of performing kana-kanji conversion.
[0020]
When a kana character is directly input from the input device 101, the process by the dictionary lookup unit 109 is performed without the process by the Roman character kana conversion unit 107. Whether Roman characters or Kana characters are input from the input device 101 is determined by the character input mode in the Kana-Kanji conversion device.
[0021]
The morpheme analyzing means 111 uses the connection information of the part of speech and the priority of the words for words included when the input character string is converted into a kana-kanji mixed character string, that is, a kana-kanji character string. Determine. The case frame analysis means 113 changes the priority order of the character string candidates in the order that the character string candidates mixed with kana-kanji are considered more semantically correct based on the relationship between the verb and its subject and object.
[0022]
The trigram analyzing unit 115 rearranges the character string candidates ranked by the case frame analyzing unit 113 using a set of three words (trigram) extracted from the text corpus.
[0023]
The storage device 117 is constituted by a main storage device or the like for storing data, and stores data accessed by the CPU in relation to the present invention. In the present embodiment, the storage device 117 stores a Romaji-kana conversion table 119, a dictionary 121, a part-of-speech connection table 123, a case frame dictionary 125, and trigram data 127 as data to be referred to in kana-kanji conversion. .
[0024]
Further, in an area (not shown) of the storage device 117, an input character string, a searched kanji candidate, and an execution instruction of a computer program executed by the CPU including the data processing device 105 are stored. Then, the CPU can directly access the contents and put an instruction or data into a register to execute a program, perform an operation on the data, or perform an operation based on the data.
[0025]
2-6 is a figure for demonstrating the flow of the kana-kanji conversion by the kana-kanji conversion apparatus based on this embodiment about the character string "Hon-san, Sansai-yoda". Hereinafter, the operation of the kana-kanji conversion apparatus according to this embodiment will be described.
[0026]
First, in FIG. 2, when a roman character “honwoni, sanstuyononda” is input from the input device 107, the romaji kana conversion unit 107 selects a hiragana corresponding to the reading of the roman character. The romaji-kana conversion means 119 refers to the romaji-kana conversion table 119,
From a correspondence such as ho → hon → nonwo →, a hiragana string “honsa ni san san sai san” is created.
[0027]
Next, the dictionary lookup unit 109 extracts from the dictionary 121 a character string that matches the reading of the hiragana string thus converted. Specifically, “ho”, “ho”, “sail”, “ho” are words corresponding to the reading “ho”, and “book”, “translation” are words corresponding to the reading “hon”. Characters have been extracted and this process continues until the end of the hiragana string.
[0028]
Subsequently, as shown in FIG. 4, the morpheme analyzing unit 111 creates character string candidates including kanji characters based on the part-of-speech connection table 123 for the set of words found by the dictionary searching unit 109. Prioritize candidates. In this embodiment, the character string candidates are given priority to each of the words included in the dictionary 121, and given priority for each type of part-of-speech connection in the part-of-speech connection table 123. Are ranked so that a character string with a low total score is given priority.
[0029]
For example, in the case of the character string “similar to a book and called three volumes”, based on the dictionary 121,
(Book) + (do) + (similar) + (,) + (three) + (book) + (call) + (do)
The score of 10 + 5 + 10 + 5 + 10 + 2 + 20 + 5 = 85 points is calculated as the word priority. This string is
Noun-Particulate-One-level inflection verb-Reading-Numeric-Numerical-Ba-line five-tier inflection-End of speech connection based on part-of-speech connection table 121,
(Noun-Participant) + (Participant-One-stage Inflection Particle) + (One-Step Inflection Particle-Reading Point) + (Reading-Numerical) + (Numerical-Numerical Word) + (Numerical Word-Five-Line Inflection) + (Bain Five-Line Inflection) -Ending)
As a priority by connecting parts of speech,
A score of 30 + 20 + 30 + 40 + 10 + 40 + 30 = 200 points is obtained. Then, the priority of the word and the priority based on the part-of-speech connection are summed, and the priority is calculated as 285 points.
[0030]
By performing the same calculation, 290 kana kanji strings that read “similar books, read three books”, 295 points for “call two or three books”, and “read two or three books” A score of 300 points is calculated for a kana character string. Therefore, the priority of these four candidates is
1. I called the book three times in the same way 2. I read three books in the same way. I called two or three books. I read a couple of books. If the score indicating priority is equal to or higher than a reference value, it is excluded as a candidate that is difficult to connect, that is, a candidate with low possibility. Various combinations of words are possible, for example, “Similar translations, called three volumes”. However, for such Kana-Kanji character strings, the priority score is high as a result of the calculation, so Excluded.
[0031]
Subsequently, as shown in FIG. 5, the case frame analysis unit 113 rearranges the candidates that have been prioritized and narrowed down as described above. The case frame analysis unit 113 refers to the case frame dictionary 125 and performs the following determination process.
[0032]
For example, according to the case frame dictionary 125, when the particle “ga” is positioned before the word “reading”, the character string is determined to be semantically correct if the main case is a person. In addition, when the particle “O” is positioned before “Reading”, it is determined that it is semantically correct if the target case is “Book”. Similarly, if the particle “ga” is located before the word “call”, the main case is “person”, and if “wa” is located, the case is semantically human. Judged to be correct. Therefore, in the case frame analysis means 113, candidates such as “similar book read three books” and “read two books three books” “similar books, called three books” and “two books read”. It is judged that it is more suitable than the three volumes, "that is, semantically correct, and the priority is higher.
[0033]
As a result of such ranking, the priority is 1. I read three books in the same way. 2. I read a few books. Called three books, similar 4 If you call two or three books.
[0034]
Subsequently, the trigram analyzing unit 115 refers to the trigram data 127 and rearranges the candidates ranked by the case frame analyzing unit 113 as shown in FIG.
[0035]
Trigram data 127 describes the probability that a chain of three words will appear in the text, and this probability is created from the actual text corpus. That is, the trigram data 127 is created by dividing the text into words for a large amount of text data included in the actual text corpus and determining the probability that three collocations will appear.
[0036]
When the trigram data 127 is referred to, there is a high probability that an arrangement of the words “2”, “,”, “three” will appear (in other words, “2”, “,” “three” are included in the actual text). Therefore, the character string candidates are rearranged so that a candidate including a lot of word chains of “two”, “,” “three” is prioritized. In the example shown in FIG. 6, for example, word chains with extremely low probabilities such as the arrangement of the words “similar”, “,” and “three” are omitted.
[0037]
Therefore, the result of the rearrangement of the priority order by the trigram analyzing means 115 is
1. I read a few books. 2. I called two or three books. I read three books in the same way. It looks like a book and three books are called.
[0038]
Hereinafter, the result of the experiment using the kana-kanji conversion apparatus according to this embodiment will be described.
[0039]
The probability that a character in a character string is converted to a correct character by one Kana-Kanji conversion process (hereinafter referred to as “char rate”) is 94.09%, and the entire character string is converted to a correct character string by one Kana-Kanji conversion process. Trigram using approximately 80 MB of trigram data created based on an actual text corpus of hundreds of MB for a conventional Kana-Kanji conversion processing device with a probability of being played (hereinafter referred to as “sence rate”) of 46.05% As a result of applying the analysis processing according to the above, char rate was 95.03% and sentence rate was 52.68%. That is, it has been confirmed that the probability of conversion to a correct character string by a single conversion process increases both in character units and sentence units.
[0040]
The preferred embodiment of the present invention has been described above, but the present invention is not limited to this and can be implemented in various other forms.
[0041]
For example, in the above-described embodiment, the trigram data describing the probability that a chain of three words appears in the text included in the actual text corpus is generated, but the number of word chains is three. The number of words is not limited, and a chain of arbitrary N (N ≧ 2) words may be used. In this case, the storage device of the Kana-Kanji conversion device stores N-gram data describing the probability that a chain of N words will appear in the text corpus. That is, the N-gram data is created by dividing the text into words for a large amount of text data included in the actual text corpus and determining the probability that N collocations will appear.
[0042]
Then, instead of the trigram analysis means in the above-described embodiment, the N-gram analysis means refers to the N-gram data, and further rearranges the priority order of the character string candidates rearranged by the case frame analysis means. .
[0043]
【The invention's effect】
As described above, according to the present invention, a dictionary that describes characters that are candidates for Kana-Kanji conversion and the priority of the characters, a part-of-speech connection table that describes the priority of connection of two parts of speech, and a character string A case frame dictionary that describes the semantic relationship of other words to other words, and an N-gram that describes the probability that N (N ≧ 2) chains of words will appear in the text corpus containing the actual text Based on the storage device for storing data, the dictionary stored in the storage device, the part-of-speech connection table, the case frame dictionary, and the N-gram data, a kana-kanji character is input from the input device. A kana-kanji conversion device comprising a data processing device for converting into a string, wherein the data processing device extracts from the dictionary the characters that match the input kana character string And can means, the priority of the text, based on the priority of the connection of the parts of speech, a combination of the character extracted by the dictionary means, the possibility of kana-kanji character string is equal to or greater than a predetermined reference a morphological analysis means for creating a first candidate, on the basis of the case frames dictionary, the sorted priority candidate kana-kanji character string included in the first candidate, case frame analysis to create a second candidate And N-gram analyzing means for rearranging the priorities of the kana-kanji character string candidates included in the second candidate based on the probability that N chains of the word appear. By applying a trigram only to candidates output by Kana-Kanji conversion, conversion accuracy can be improved while reducing the amount of calculation.
[0044]
The storage device stores a Roman-kana conversion table that describes a kana character corresponding to a Roman character, and the data processing device converts the Roman character input from the input device into the kana character string based on the Roman-kana conversion table. Romaji-kana conversion means for converting to kana, and the dictionary lookup means extracts from the dictionary the characters that match the kana character string converted by the romaji-kana conversion means. In any case, the accuracy of kana-kanji conversion can be improved without requiring an extremely long time and a large disk capacity.
[Brief description of the drawings]
FIG. 1 is a functional block diagram of a kana-kanji conversion apparatus according to an embodiment.
FIG. 2 is a diagram for explaining a flow of kana-kanji conversion by the kana-kanji conversion apparatus according to the present embodiment.
FIG. 3 is a diagram for explaining a flow of kana-kanji conversion by the kana-kanji conversion apparatus according to the present embodiment.
FIG. 4 is a diagram for explaining a flow of kana-kanji conversion by the kana-kanji conversion apparatus according to the present embodiment.
FIG. 5 is a diagram for explaining a flow of kana-kanji conversion by the kana-kanji conversion apparatus according to the present embodiment.
FIG. 6 is a diagram for explaining a flow of kana-kanji conversion by the kana-kanji conversion apparatus according to the present embodiment.
[Explanation of symbols]
101 Input Device 103 Display Device 105 Data Processing Device 107 Romaji Kana Conversion Unit 109 Dictionary Lookup Unit 111 Morphological Analysis Unit 113 Case Frame Analysis Unit 115 Trigram Analysis Unit 117 Storage Device 119 Romaji Kana Conversion Table 121 Dictionary 123 Part-of-Speech Connection Table 125 Case Frame dictionary 127 Trigram data

Claims

A dictionary that describes characters that are candidates for Kana-Kanji conversion and the priority of the characters, a part-of-speech connection table that describes the priority of the connection of two parts of speech, and the semantic relationship of other words contained in the character string to other words A case frame dictionary that describes and a storage device that stores N-gram data that describes the probability of N (N ≧ 2) chains of words appearing in a text corpus that contains the actual text;
A data processing device for converting a kana character string input from an input device into a kana-kanji character string based on the dictionary, the part-of-speech connection table, the case frame dictionary, and the N-gram data stored in the storage device; A kana-kanji conversion device comprising the data processing device,
Dictionary lookup means for extracting from the dictionary the characters that match the input kana character string;
Based on the priority of the character and the priority of connection of the part of speech, combining the characters extracted by the dictionary lookup unit , a first candidate for a kana-kanji character string whose possibility is equal to or greater than a predetermined criterion is obtained. Morphological analysis means to create,
Based on the case frame dictionary, case frame analysis means for rearranging the priorities of the kana-kanji character string candidates included in the first candidate to create a second candidate;
Kana-Kanji conversion comprising: N-gram analyzing means for rearranging the priorities of the Kana-Kanji character string candidates included in the second candidate based on a probability that N chains of the word appear. apparatus.

The storage device stores a romaji kana conversion table describing kana characters corresponding to romaji, and the data processing device converts the romaji input from the input device into the kana character string based on the romaji kana conversion table. 2. The kana-kanji character according to claim 1, further comprising: a romaji kana conversion unit configured to extract the characters that match the kana character string converted by the romaji kana conversion unit. Conversion device.

A dictionary that describes characters that are candidates for kana-kanji conversion and the priority of the characters stored in the storage device, a part-of-speech connection table that describes the connection priority of two parts of speech, and a phrase included in the character string Based on case frame dictionaries that describe semantic relationships to other phrases, and N-gram data that describes the probability of N (N ≧ 2) chains of words appearing in a text corpus containing actual text, A kana-kanji conversion method for converting a kana character string input from an input device into a kana-kanji character string,
A dictionary lookup step of extracting from the dictionary the characters that match the input kana character string;
Based on the priority of the character and the priority of connection of the part of speech, combining the characters extracted in the dictionary lookup step, a first candidate for a kana-kanji character string whose possibility is equal to or higher than a predetermined criterion is obtained. A morphological analysis step to create,
Based on the case frame dictionary, a case frame analysis step of rearranging the priorities of the kana-kanji character string candidates included in the first candidate to create a second candidate;
A kana-kanji conversion device comprising: an N-gram analysis step of rearranging the priorities of the kana-kanji character string candidates included in the second candidate based on a probability that N chains of the word appear. Method.

The storage device stores a romaji kana conversion table describing kana characters corresponding to romaji, and a romaji kana conversion step for converting romaji input from the input device into the kana character string based on the romaji kana conversion table. 4. The kana-kanji conversion method according to claim 3, wherein the dictionary lookup step extracts, from the dictionary, the characters that match the kana character string converted in the Roman character kana conversion step.