JP2007004637A

JP2007004637A - Kana-Kanji conversion

Info

Publication number: JP2007004637A
Application number: JP2005185768A
Authority: JP
Inventors: Hiroaki Kaneki; 宏明鹿子木; Michihide Maeda; 理英前田; Miyuki Seki; 美由紀関
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2005-06-24
Filing date: 2005-06-24
Publication date: 2007-01-11
Anticipated expiration: 2025-06-24
Also published as: JP4796341B2

Abstract

【課題】ユーザが細かく変換キーを押しながら入力する場合でも、体感を損なわず、より高い精度の変換ができるかな漢字変換の技術を提供する。
【解決手段】データベースに備わる統計的言語モデルから得られる確率であって入力された文の対象とする単語が特定の記号に続く確率が補正されたものとなるようにする第１の処理を実行する。さらに、特定の記号の内、所定の記号に対しては、該所定の記号が確定された場合のみ、第１の処理を実行する。さらに、入力された文に特定の記号が含まれず、かつ、先に入力された文字列が確定された場合、該確定された文字列に続く単語が該確定された文字列に続く確率が補正されたものとなるようにする第２の処理を実行する。
【選択図】図４PROBLEM TO BE SOLVED: To provide a kana-kanji conversion technique capable of performing conversion with higher accuracy without impairing the sensation even when a user performs fine input while pressing a conversion key.
A first process is executed for correcting a probability obtained from a statistical language model provided in a database and correcting a probability that a target word of an input sentence follows a specific symbol. To do. Further, the first process is executed for a predetermined symbol among the specific symbols only when the predetermined symbol is confirmed. Furthermore, when the input sentence does not include a specific symbol and the previously input character string is confirmed, the probability that the word following the confirmed character string continues to the confirmed character string is corrected. The second process is performed so as to achieve the above.
[Selection] Figure 4

Description

言語モデルとしてｎグラムモデルを使用するかな漢字変換に関し、その方法およびこのかな漢字変換のプログラムを格納するコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to kana-kanji conversion using an n-gram model as a language model, and to a method and a computer-readable recording medium for storing the kana-kanji conversion program.

一般に、コンピュータシステム等に日本語をタイプ入力する際、タイプ入力されたかな文字を漢字に変換するかな漢字変換の処理が行われる。そのかな漢字変換プログラムの代表的なものとして、本出願人によるＭＳ−ＩＭＥ（その関連技術は、特許文献１等に開示されている）などがある。
日本語をタイプ入力するとき、ユーザがどのタイミングで変換キーを押すかは定かではない。しかし、ユーザビリティテストや誤変換報告などから、細かく変換キーを押しながら入力するユーザが多いということが分かっている。 In general, when Japanese is typed into a computer system or the like, kana-kanji conversion processing is performed to convert the typed kana characters into kanji. As a typical kana-kanji conversion program, there is MS-IME (the related technology is disclosed in Patent Document 1) by the present applicant.
When typing in Japanese, it is not certain when the user presses the conversion key. However, it is known from the usability test and misconversion reports that many users input while pressing the conversion key in detail.

特開２００４−１１８４６１号公報JP 2004-118461 A

一方、純粋なＳＬＭ（ＳｔａｔｉｓｔｉｃａｌＬａｎｇｕａｇｅＭｏｄｅｌ：統計的言語モデル）で、このような細切れ入力をシミュレーションし変換精度（ＣＥＲ：ＣｈａｒａｃｔｅｒＥｒｒｏｒＲａｔｅ）を測定したところ、細切れ入力の変換精度が低いという結果が出た。実際に、細切れ入力の場合、びっくりするような誤変換の後、再入力すると期待通りの変換となることがあり、ユーザを戸惑わせるようなことが起き易い。特に、細切れ入力で先に特定の文字がある場合と、そうでない場合とで変換結果が異なることが起きていた。 On the other hand, when such a shredded input is simulated and measured with a pure SLM (Statistical Language Model), the conversion accuracy (CER: Character Error Rate) is measured. It was. Actually, in the case of input with a small amount, if the input is re-input after a surprising conversion error, the conversion may be as expected, and the user is likely to be confused. In particular, conversion results differed between the case where a specific character was first input and the case where the character was not input.

また、特定の記号がない場合でも、前に確定済みの文字列がある場合とない場合とで、それに続く文字列に対して期待される変換結果が異なる場合があった。 Even when there is no specific symbol, the conversion result expected for the subsequent character string may differ depending on whether or not there is a previously determined character string.

上記のように、細切れ入力をするユーザは多数存在している。したがって、細切れ入力の変換精度が低いという上記問題を解決し、細切れ入力をするユーザでも長めに入力するユーザと同等かそれ以上の変換精度を体験できる必要がある。 As described above, there are a large number of users who input shredded pieces. Therefore, it is necessary to solve the above-mentioned problem that the conversion accuracy of the shredded input is low and to experience a conversion accuracy equivalent to or higher than that of the user who inputs a long time even if the user inputs the shredded input.

本発明は、以上の点に鑑みなされたもので、その目的は、ユーザが細かく変換キーを押しながら入力する場合でも、体感を損なわず、より高い精度の変換ができるかな漢字変換の技術を提供することにある。 The present invention has been made in view of the above points, and an object of the present invention is to provide a kana-kanji conversion technique that can perform conversion with higher accuracy without impairing the sensation even when the user performs fine input while pressing the conversion key. There is.

上記目的を達成するため、請求項１に記載の発明は、統計的言語モデルを備えるデータベースと前記統計的言語モデルを利用するかな漢字変換プログラムとを格納するコンピュータ読み取り可能な記録媒体であって、前記かな漢字変換プログラムは、前記統計的言語モデルから得られる確率であって入力された文の単語が特定の記号に続く確率が、補正された値となるようにする第１の処理を実行することを特徴とする。 In order to achieve the above object, the invention described in claim 1 is a computer-readable recording medium storing a database having a statistical language model and a kana-kanji conversion program using the statistical language model, The kana-kanji conversion program executes a first process that makes a probability obtained from the statistical language model and a probability that an input sentence word follows a specific symbol becomes a corrected value. Features.

また、請求項２に記載の発明は、請求項１に記載のコンピュータ読み取り可能な記録媒体において、前記特定の記号の内、所定の記号に対しては、該所定の記号が確定された場合のみ、前記第１の処理を実行することを特徴とする。 The invention according to claim 2 is the computer-readable recording medium according to claim 1, wherein, for the predetermined symbol, the predetermined symbol is determined only when the predetermined symbol is determined. The first process is executed.

また、請求項３に記載の発明は、請求項１または請求項２に記載のコンピュータ読み取り可能な記録媒体において、前記かな漢字変換プログラムは、入力された文に前記特定の記号が含まれず、かつ、先に入力された文字列が確定された場合、該確定された文字列に続く単語が該確定された文字列に続く確率が、補正された値となるようにする第２の処理を実行することを特徴とする。 The invention according to claim 3 is the computer-readable recording medium according to claim 1 or 2, wherein the kana-kanji conversion program does not include the specific symbol in the input sentence, and When the previously input character string is confirmed, a second process is executed such that the probability that the word following the confirmed character string follows the confirmed character string becomes a corrected value. It is characterized by that.

また、請求項４に記載の発明は、請求項１または請求項２に記載のコンピュータ読み取り可能な記録媒体において、前記第１の処理により得られる前記確率は、前記特定の記号を、文頭を示す文頭マークとみなして前記統計的言語モデルから得られる確率であることを特徴とする。 According to a fourth aspect of the present invention, in the computer-readable recording medium according to the first or second aspect, the probability obtained by the first processing indicates the specific symbol and the beginning of a sentence. It is a probability obtained from the statistical language model as a sentence mark.

また、請求項５に記載の発明は、請求項３に記載のコンピュータ読み取り可能な記録媒体において、前記第２の処理により得られる前記確率は、前記統計的言語モデルから得られる確率であって前記確定された文字列に続く単語が該確定された文字列に続く確率と、前記確定された文字列に含まれる単語を文頭マークとみなして前記統計的言語モデルから得られる確率であって前記確定された文字列に続く単語が該確定された文字列に続く確率とを、線形補間することにより得られる確率であることを特徴とする。 According to a fifth aspect of the present invention, in the computer-readable recording medium according to the third aspect, the probability obtained by the second process is a probability obtained from the statistical language model. A probability that a word following the confirmed character string follows the confirmed character string, and a probability obtained from the statistical language model by regarding a word included in the confirmed character string as a head mark, and the confirmation The probability that a word following the determined character string follows the determined character string is a probability obtained by linear interpolation.

また、請求項６に記載の発明は、請求項１から請求項５のいずれかに記載のコンピュータ読み取り可能な記録媒体において、前記統計的言語モデルは、ｎグラムモデルであることを特徴とする。 According to a sixth aspect of the present invention, in the computer-readable recording medium according to any of the first to fifth aspects, the statistical language model is an n-gram model.

また、請求項７に記載の発明は、コンピュータシステム上で実行される、統計的言語モデルを利用するかな漢字変換方法であって、データベースに備わる統計的言語モデルから得られる確率であって入力された文の単語が特定の記号に続く確率が、補正された値となるようにする第１の処理を実行するステップを備えることを特徴とする。 The invention according to claim 7 is a kana-kanji conversion method using a statistical language model, which is executed on a computer system, and is input with a probability obtained from a statistical language model provided in a database. It is characterized by comprising the step of executing a first process so that a probability that a word of a sentence follows a specific symbol becomes a corrected value.

また、請求項８に記載の発明は、請求項７に記載のコンピュータ読み取り可能な記録媒体において、前記特定の記号の内、所定の記号に対しては、該所定の記号が確定された場合のみ、前記第１の処理を実行することを特徴とする。 The invention according to claim 8 is the computer-readable recording medium according to claim 7, wherein, for the predetermined symbol among the specific symbols, only when the predetermined symbol is determined. The first process is executed.

また、請求項９に記載の発明は、請求項７または請求項８に記載の方法において、入力された文に前記特定の記号が含まれず、かつ、先に入力された文字列が確定された場合、該確定された文字列に続く単語が該確定された文字列に続く確率が、補正された値となるようにする第２の処理を実行するステップをさらに備えることを特徴とする。 In the method according to claim 9, in the method according to claim 7 or claim 8, the input sentence does not include the specific symbol, and the previously input character string is confirmed. In this case, the method further includes a step of executing a second process so that a probability that a word following the confirmed character string continues to the confirmed character string becomes a corrected value.

また、請求項１０に記載の発明は、請求項７から請求項９のいずれかに記載の方法において、前記第１の処理により得られる前記確率は、前記特定の記号を、文頭を示す文頭マークとみなして前記統計的言語モデルから得られる確率であることを特徴とする。 The invention according to claim 10 is the method according to any one of claims 7 to 9, wherein the probability obtained by the first processing is the initial mark indicating the specific symbol, the initial mark. This is a probability obtained from the statistical language model.

また、請求項１１に記載の発明は、請求項９に記載の方法おいて、前記第２の処理により得られる前記確率は、前記統計的言語モデルから得られる確率であって前記確定された文字列に続く単語が該確定された文字列に続く確率と、前記確定された文字列に含まれる単語を文頭マークとみなして前記統計的言語モデルから得られる確率であって前記確定された文字列に続く単語が該確定された文字列に続く確率とを、線形補間することにより得られる確率であることを特徴とする。 The invention according to claim 11 is the method according to claim 9, wherein the probability obtained by the second process is a probability obtained from the statistical language model and the determined character. A probability that a word following a string follows the confirmed character string, and a probability obtained from the statistical language model by regarding a word included in the confirmed character string as a head mark, and the confirmed character string It is a probability obtained by linearly interpolating the probability that the word following is followed by the confirmed character string.

また、請求項１２に記載の発明は、請求項７から請求項１１のいずれかに記載の方法において、前記統計的言語モデルは、ｎグラムモデルであることを特徴とする。 The invention according to claim 12 is the method according to any one of claims 7 to 11, wherein the statistical language model is an n-gram model.

本発明によれば、ユーザが細かく変換キーを押す細切れ変換でも、かな漢字変換の変換精度を向上させ、ユーザの体感を損なうことを防ぐことができる。
また、上記効果を、変換候補を得るための確率を補正されたものとする処理だけで享受できるので、統計的言語モデル自体に対し何ら変更を行う必要がない。 According to the present invention, it is possible to improve the conversion accuracy of kana-kanji conversion even when the user presses the conversion key finely, and to prevent the user's experience from being impaired.
In addition, since the above effect can be enjoyed only by processing in which the probability for obtaining the conversion candidate is corrected, it is not necessary to make any changes to the statistical language model itself.

図１に、本発明を実施することができる適切なコンピュータシステム１００の一例を示す。このコンピュータシステム１００は適切なコンピュータシステムの一例にすぎず、通信ネットワークを介してリンクされている遠隔処理デバイスによってタスクが実施される分散型コンピュータシステムで実施することもできる。分散型コンピュータシステムでは、コンピュータに所定の処理を実行させるためのプログラムモジュールを、ローカルおよび遠隔コンピュータの記録媒体内に配置することができる。 FIG. 1 illustrates an example of a suitable computer system 100 on which the invention may be implemented. The computer system 100 is only one example of a suitable computer system and can be implemented in a distributed computer system where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computer system, program modules for causing a computer to execute predetermined processing can be arranged in recording media of local and remote computers.

図１を参照すると、本発明を実施するための例示的システムであるコンピュータシステム１００は、コンピュータ１１０で示す汎用コンピューティングデバイスを含む。コンピュータ１１０の構成要素は、処理ユニット１２０、システムメモリ１３０、およびシステムメモリを含む様々なシステム構成要素を処理ユニット１２０に結合するシステムバス１２１等を含む。 With reference to FIG. 1, a computer system 100, which is an exemplary system for implementing the invention, includes a general purpose computing device shown as computer 110. The components of the computer 110 include a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120.

コンピュータ１１０は、一般に様々なコンピュータ読み取り可能な記録媒体を備える。コンピュータ読み取り可能な記録媒体は、コンピュータ１１０によってアクセス可能であれば任意の媒体であってよく、揮発性媒体および不揮発性媒体、取り外し可能媒体および固定の媒体のいずれでもよい。こうしたコンピュータ読み取り可能な記録媒体には、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、フラッシュメモリまたは他のメモリ技術、ＣＤ−ＲＯＭ、デジタル多目的ディスクまたは他の光ディスクストレージ、磁気カセット、磁気テープ、磁気ディスクストレージ（ハードディスクドライブ１４１）または他の磁気記憶装置、所望の情報を格納するために使用され、またコンピュータ１１０によってアクセスすることができる任意の媒体が挙げられる。 The computer 110 generally includes various computer-readable recording media. The computer-readable recording medium may be any medium that can be accessed by the computer 110, and may be any of a volatile medium and a non-volatile medium, a removable medium, and a fixed medium. Such computer readable recording media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital multipurpose disk or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage (hard disk drive 141 ) Or other magnetic storage device, any medium that can be used to store the desired information and that can be accessed by the computer 110.

コンピュータ１１０はまた、取り外し可能な媒体に対して読み書きを行うためのドライブを備える。例示として、図１に、取り外し可能な不揮発性磁気ディスク１５２に対して読み出しまたは書込みする磁気ディスクドライブ１５１、取り外し可能な不揮発性光ディスク１５６に対して読み出しまたは書込みをする、ＣＤ−ＲＯＭまたは他の光学媒体などの光ディスクドライブ１５５を示している。ハードディスクドライブ１４１は、一般に、インターフェース１４０などの固定のインターフェースを介してシステムバス１２１に接続されており、磁気ディスクドライブ１５１および光ディスクドライブ１５５は、一般に、インターフェース１５０などの取り外し可能なインターフェースによってシステムバス１２１に接続されている。 The computer 110 also includes a drive for reading from and writing to removable media. Illustratively, FIG. 1 shows a magnetic disk drive 151 that reads or writes to a removable non-volatile magnetic disk 152, a CD-ROM or other optical that reads or writes to a removable non-volatile optical disk 156. An optical disk drive 155 such as a medium is shown. The hard disk drive 141 is generally connected to the system bus 121 via a fixed interface such as the interface 140, and the magnetic disk drive 151 and the optical disk drive 155 are generally connected to the system bus 121 by a removable interface such as the interface 150. It is connected to the.

システムメモリ１３０は、読み出し専用メモリ（ＲＯＭ）１３１およびランダムアクセスメモリ（ＲＡＭ）１３２などの揮発性または不揮発性のメモリからなっている。起動時などに、コンピュータ１１０内の要素間で情報を転送するために役立つ基本ルーチンを含んでいる基本入出力システム１３３（ＢＩＯＳ）は、一般にＲＯＭ１３１内に格納されている。ＲＡＭ１３２は、一般に、処理ユニット１２０に即時アクセス可能な、またはその時点において処理ユニット１２０により操作されているデータまたはプログラムモジュールを含む。これらの例示として、図１に、オペレーティングシステム１３４、アプリケーションプログラム１３５、他のプログラムモジュール１３６およびプログラムデータ１３７を示す。 The system memory 130 includes a volatile or non-volatile memory such as a read only memory (ROM) 131 and a random access memory (RAM) 132. A basic input / output system 133 (BIOS) containing basic routines useful for transferring information between elements within the computer 110, such as at startup, is typically stored in the ROM 131. The RAM 132 typically includes data or program modules that are immediately accessible to or processed by the processing unit 120 at that time. As an illustration of these, FIG. 1 shows an operating system 134, application programs 135, other program modules 136 and program data 137.

なお、図１では、オペレーティングシステム１４４、アプリケーションプログラム１４５、その他のプログラムモジュール１４６、およびプログラムデータ１４７には、最低限、それらが異なるコピーであることを示すために異なる番号を付与している。ユーザは、キーボード１６２および、一般にマウス、トラックボールまたはタッチパッドと呼ばれるポインティングデバイス１６１などの入力デバイスによってコンピュータ１１０にコマンドおよび情報を入力することができる。他の入力デバイス（図示せず）は、マイクロフォン、ジョイスティック、ゲームパッド、スキャナなどを含むことができる。これらの入力デバイスは、システムバスに結合されたユーザ入力インターフェース１６０を介して処理ユニット１２０に接続されるが、パラレルポート、ゲームポートまたはユニバーサルシリアルバス（ＵＳＢ）などの他のインターフェースを介して接続することもできる。モニタ１９１または他のタイプの表示装置も、ビデオインターフェース１９０などのインターフェースを介してシステムバス１２１に接続される。モニタの他に、コンピュータは、出力周辺インターフェース１９５を介して接続することができるスピーカ１９７およびプリンタ１９６など、他の周辺出力装置を含むこともできる。 In FIG. 1, different numbers are assigned to the operating system 144, the application program 145, the other program modules 146, and the program data 147 at a minimum to indicate that they are different copies. A user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) can include a microphone, joystick, game pad, scanner, and the like. These input devices are connected to the processing unit 120 via a user input interface 160 coupled to the system bus, but connected via other interfaces such as a parallel port, game port or universal serial bus (USB). You can also. A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, the computer can also include other peripheral output devices such as a speaker 197 and a printer 196 that can be connected via an output peripheral interface 195.

また、コンピュータ１１０は、前述のように遠隔コンピュータ１８０などの１つまたは複数の遠隔コンピュータへの論理接続を使用してネットワーク化された環境において動作することができる。遠隔コンピュータ１８０は、別のパーソナルコンピュータ、サーバ、ルータ、ネットワークＰＣ、ピアデバイスまたは他の共通ネットワークノードであってよく、図１では記憶装置１８１しか示していないが、遠隔コンピュータ１８０も、コンピュータ１１０に関して上記で説明した多くのまたはすべての要素を含む。図１で示す論理接続は、ローカルエリアネットワーク（ＬＡＮ）１７１およびワイドエリアネットワーク（ＷＡＮ）１７３を含むが、他のネットワークを含むこともできる。このようなネットワーキング環境は、イントラネットおよびインターネットなどのコンピュータネットワークでは一般的なことである。 Computer 110 may also operate in a networked environment using logical connections to one or more remote computers, such as remote computer 180 as described above. The remote computer 180 may be another personal computer, server, router, network PC, peer device or other common network node, and only the storage device 181 is shown in FIG. Includes many or all of the elements described above. The logical connections shown in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but can also include other networks. Such networking environments are common in computer networks such as intranets and the Internet.

ＬＡＮネットワーキング環境で使用されるとき、コンピュータ１１０は、ネットワークインターフェースまたはアダプタ１７０を介してＬＡＮ１７１に接続される。ＷＡＮネットワーキング環境で使用されるとき、コンピュータ１１０は、通常、インターネットなどのＷＡＮ１７３を介して通信を確立するモデム１７２または他の手段を備える。内蔵型または外付けのモデム１７２は、ユーザ入力インターフェース１６０を介してシステムバス１２１に接続することができる。ネットワーク環境において、コンピュータ１１０に関して示されたプログラムモジュールまたはその一部は、遠隔記憶装置に格納することができる。例示として、図１に、記憶装置１８１上に常駐するものとして遠隔アプリケーションプログラム１８５を示している。 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. A built-in or external modem 172 can be connected to the system bus 121 via a user input interface 160. In a network environment, the program modules illustrated for computer 110 or portions thereof may be stored in a remote storage device. Illustratively, FIG. 1 shows remote application program 185 as residing on storage device 181.

以下の説明では、本発明は、特に指摘しない限り、アプリケーションプログラム（かな漢字変換プログラム）のコンピュータ実行可能命令をシステムメモリ１３１にロードした処理ユニット１２０が、そのコンピュータ実行可能命令に基づき実行することができる動作を説明する。この動作において、処理ユニット１２０はコンピュータ実行可能命令に基づきプログラムデータ１３７を参照し、あるいはその更新を行う。なお、本発明に関わるかな漢字変換プログラムは、前述のコンピュータ読み取り可能な記録媒体に記録してユーザに提供することも、通信媒体を介してユーザに配布することもできる。 In the following description, unless otherwise specified, the present invention can be executed by the processing unit 120 loaded with the computer executable instructions of the application program (kana-kanji conversion program) into the system memory 131 based on the computer executable instructions. The operation will be described. In this operation, the processing unit 120 refers to or updates the program data 137 based on computer-executable instructions. The kana-kanji conversion program according to the present invention can be recorded on the computer-readable recording medium and provided to the user, or can be distributed to the user via the communication medium.

図２は、本実施形態に係るプログラムデータ１３７の内容をより詳細に示す図であり、本発明に関わる部分のみを概略的に示している（詳細は後述する）。 FIG. 2 is a diagram showing the details of the program data 137 according to the present embodiment in more detail, and schematically shows only the parts related to the present invention (details will be described later).

プログラムデータ１３７は、コーパス２０２と、辞書２０４と、ユーザ辞書２０６とを含んでいる。コーパス２０２は、自然言語処理等に利用される大規模テキストデータであって、文字列が形態素ごとに分割され、各形態素について品詞が決定された（即ち、品詞タグ付けされた）ものである。その他、係り受けなどの統語情報が付加されたものもコーパス２０２として利用することができる。辞書２０４は、語および品詞の各々に対する識別子（ＩＤ）を定義したデータである。ここで、語は文字の表記とその読みとを含んでいる。ユーザ辞書２０６は、ユーザ個人が使い勝手を良くするために単語や定型句を登録して作成する登録辞書の１つである。ここで、登録辞書は、ユーザ辞書の他、専門辞書や分野別辞書などのベンダにより登録されるものであっても良い。 The program data 137 includes a corpus 202, a dictionary 204, and a user dictionary 206. The corpus 202 is large-scale text data used for natural language processing or the like, in which a character string is divided for each morpheme, and a part of speech is determined for each morpheme (that is, part-of-speech tagging). In addition, information to which syntactic information such as dependency is added can also be used as the corpus 202. The dictionary 204 is data defining an identifier (ID) for each word and part of speech. Here, the word includes notation of characters and their reading. The user dictionary 206 is one of registered dictionaries that are created by registering words and fixed phrases for the convenience of individual users. Here, in addition to the user dictionary, the registration dictionary may be registered by a vendor such as a specialized dictionary or a field-specific dictionary.

ここでｎグラムモデルについて簡単に説明する。
ｎ−１個の単語列Ｗ_１Ｗ_２．．．Ｗ_ｎ−１の後にｎ番目の単語Ｗ_ｎが続く確率は、次のような条件付確率で示される。
Ｐ＝Ｐ（Ｗ_ｎ｜Ｗ_１Ｗ_２．．．Ｗ_ｎ−１）
＝Ｐ（Ｗ_１Ｗ_２．．．Ｗ_ｎ）／Ｐ（Ｗ_１Ｗ_２．．．Ｗ_ｎ−１）
これをｎグラムモデルという。以下ではトライグラムモデル（すなわち、上式でｎ＝３）を例に挙げて説明するが、本発明はバイグラムモデル等の他のマルチグラムモデル（以下では、総称して統計的言語モデルと記す）にも適用可能であることはいうまでもない。 Here, the n-gram model will be briefly described.
n-1 word strings W ₁ W ₂ . . . The probability that the _nth word Wn follows Wn ₋₁ is indicated by the following conditional probability.
_{_{_{P = P (W n | W}}} 1 W 2 ... W n-1)
_{_{_{_{= P (W 1 W 2 ...}}}} W n) / P (W 1 W 2 ... W n-1)
This is called an n-gram model. Hereinafter, a trigram model (that is, n = 3 in the above formula) will be described as an example, but the present invention is another multigram model such as a bigram model (hereinafter collectively referred to as a statistical language model). Needless to say, this is also applicable.

図３は、本実施形態に係るかな漢字変換の方法の概要を示す図である。
なお、以下の説明において使用されるコンピュータシステムとして図１に示すものが使用される。 FIG. 3 is a diagram showing an outline of a kana-kanji conversion method according to the present embodiment.
The computer system shown in FIG. 1 is used as the computer system used in the following description.

図３に示すステップＳ３００２において、コンピュータシステムは、辞書２０２とユーザ辞書とを用いて、入力ひらがな７０４から語ＩＤと品詞ＩＤの混ざったＩＤの組み合わせ（パス）の集合（ラティス）を作成する。ステップＳ３００４において、統計的言語モデル３０４から、各パスの生起する確率を取り出し、各パスと確率を対応付けたパス−確率対応表７０８を生成する。ステップＳ３００６では、パス−確率対応表７０８のうちから、最も確率の高いパスを変換候補７１０として選択する。そして、ステップＳ３００８では、辞書２０２とユーザ辞書２０６とを用いて、選択されたパスをかな漢字文字列７１２に変換する（なお、かな漢字変換の方法の詳細は、例えば特許文献１等が参考となろう）。 In step S3002 shown in FIG. 3, the computer system uses the dictionary 202 and the user dictionary to create a set (lattice) of ID combinations (paths) in which word IDs and part-of-speech IDs are mixed from the input hiragana 704. In step S3004, the probability that each path occurs is extracted from the statistical language model 304, and a path-probability correspondence table 708 in which each path is associated with the probability is generated. In step S3006, the path with the highest probability is selected as the conversion candidate 710 from the path-probability correspondence table 708. In step S3008, the selected path is converted into a kana-kanji character string 712 using the dictionary 202 and the user dictionary 206 (for details of the kana-kanji conversion method, refer to, for example, Patent Document 1). ).

以上のようにして、コンピュータシステム上で、統計的言語モデルを利用したかな漢字変換が行われるが、前述のように細切れ入力で先に特定の記号がある場合と、そうでない場合とで変換結果が異なるという問題が起きている。以下に、この点について説明する。 As described above, Kana-Kanji conversion using a statistical language model is performed on a computer system. However, as described above, conversion results are different depending on whether there is a specific symbol in the first segment input or not. There is a problem of being different. This point will be described below.

例えば、行のはじめ（文頭）にスペース（空白文字：ここでは□で表す）が入力され、さらに“きょう”が入力されたとする。この場合の単語列は、
＜ｓ＞□きょう
となる。ここでは文頭を示す文頭マークを＜ｓ＞、文末マークを＜／ｓ＞で表している。なお、文頭マーク＜ｓ＞は、改行の後に自動的に付けられる。
一方、文頭から“きょう”が入力されたとする。この場合は、
＜ｓ＞きょう＜／ｓ＞
となる。 For example, it is assumed that a space (blank character: represented here by □) is input at the beginning (start of sentence) of a line, and “Kyo” is further input. The word string in this case is
<S> □ Today
It becomes. Here, the beginning mark indicating the beginning of the sentence is represented by <s> and the end of sentence mark is represented by </ s>. The sentence head mark <s> is automatically added after a line feed.
On the other hand, it is assumed that “Kyo” is input from the beginning of the sentence. in this case,
<S> Today </ s>
It becomes.

以上の２つのケースでは、“きょう”に対するそれぞれのトライグラムＰ（きょう｜＜ｓ＞，□）と、Ｐ（きょう｜＜ｓ＞，＜ｓ＞）（この場合、トライグラム計算のため＜ｓ＞を２つ並べている）は異なるものとなる。したがって、かな漢字変換の結果も異なる場合がある。同様に、文頭にかぎかっこ“「”がくる場合も、そうでない場合とで変換結果が異なってくる。例えば、“「きょう”は、“「今日”ではなく“「強”と変換されたりする。 In the above two cases, the respective trigrams P (today | <s>, □) and P (today | <s>, <s>) (in this case, <s 2> are different). Therefore, the result of Kana-Kanji conversion may be different. Similarly, the conversion result differs depending on whether or not the bracket ““ ”comes at the beginning of the sentence. For example, ““ Kyoto ”is converted to“ “Strong” instead of “Today”.

“きょう”が“今日”と変換されるのをユーザが所望しているとき、文頭に（あるいは文中に）スペースやかぎかっこなどの特定の記号があった場合にも、同様に“□今日”、“「今日”と変換されるのが望ましい。
この問題を解決するため、本実施形態では、文頭や文中にスペースやかぎかっこなど特定の記号がくる場合、所定の条件のもとでこの特定の記号を文頭マーク＜ｓ＞とみなして、単語Ｗ_３に対するトライグラムＰ（Ｗ_３）を補正されたものにして、前述のかな漢字変換におけるパスの生起する確率を求める（詳細は後述する）。 If the user wishes to convert “today” to “today” and there is a specific symbol such as a space or an angle bracket at the beginning of the sentence (or in the sentence), “ , "Today" should be converted.
In order to solve this problem, in the present embodiment, when a specific symbol such as a space or an angle bracket appears in the beginning of a sentence or a sentence, the specific symbol is regarded as a beginning mark <s> under a predetermined condition, and a word trigram P for W ₃ (W ₃₎ and to that corrected, (details will be described later) for determining the probability of occurrence of a path in the kana-kanji conversion described above.

ここで、上記特定の記号について述べる。本実施形態において特定の記号類は、１．オープンブラケットグループ（Ｏｐｅｎｂｒａｃｋｅｔｓｇｒｏｕｐ）、２．バレットグループ（Ｂｕｌｌｅｔｓｇｒｏｕｐ）の２つに分けて扱う（この場合、スペースはオープンブラケットグループと同じ扱いとすることができる）。すなわち、 Here, the specific symbol will be described. In this embodiment, specific symbols are: Open bracket group (Open brackets group), 2. The two bullet groups are handled (in this case, the space can be treated the same as the open bracket group). That is,

１の記号の場合、記号が確定済みか未確定かに関係なく、その記号を文頭マークとみなして、補正された確率値を求める。２の記号の場合は、その記号が確定済みか未確定かをみて、確定済みの場合にその記号を文頭マークとみなして補正された確率値を求める。 In the case of a symbol of 1, regardless of whether the symbol is confirmed or not yet confirmed, the symbol is regarded as a head mark and a corrected probability value is obtained. In the case of the symbol of 2, it is determined whether the symbol has been confirmed or not yet confirmed, and if it has been confirmed, the symbol is regarded as a head mark and a corrected probability value is obtained.

例えば、“「ＡＢ”または“「”確定 “ＡＢ”とタイプ入力された場合（１の記号の場合）は、“「＜ｓ＞＜ｓ＞ＡＢ”と見なして補正された確率値を求める。一方、“●ＡＢ”とタイプ入力された場合（２の記号の場合）は、通常の確率値を求め、“●”確定 “ＡＢ”とタイプ入力された場合は、これを“●＜ｓ＞＜ｓ＞ＡＢ”とみなして補正された確率値を求める。なお、ここで記号右側の「確定」は、前の記号が確定されたことを示している。また、記号“”はオープンブラケットグループに属する記号であるが、本明細書においては、記号・文字列を強調して示すためにも使用しており（例えば、“●”など）、この場合、この記号がタイプ入力されることを示すものではないことを理解されたい。 For example, when ““ AB ”or“ “” determined “AB” is typed (in the case of a symbol of 1), a corrected probability value is obtained by regarding ““ <s> <s> AB ””. On the other hand, when “● AB” is typed (in the case of symbol 2), a normal probability value is obtained, and when “●” is confirmed and “AB” is typed, this is replaced with “● <s>”. <S> AB ”is regarded as a corrected probability value. Here, “confirm” on the right side of the symbol indicates that the previous symbol has been confirmed. In addition, the symbol “” is a symbol belonging to the open bracket group, but in this specification, it is also used to emphasize and indicate a symbol / character string (for example, “●”). It should be understood that this symbol does not indicate that it is being typed.

以上では、特定の記号が存在する場合について言及したが、特定の記号がない場合でも、前確定文字列がある場合とない場合とで、それに続く文字列に対して期待される変換が異なる場合がある。
例えば、「きょうはいしゃにいく」に対して、もっとも高い確率を有するパスに対応するものが「今日は医者に行く」だとする。ところが「今日」でいったん確定をして「はいしゃにいく」と入力した場合、ユーザが期待するのは「は医者に行く」ではなく、「歯医者に行く」の場合がほとんどである。しかし、統計的言語モデルでは「は医者に行く」が一番高い確率になってしまう。一方、文頭マークにつながる「はいしゃにいく」に対しては「歯医者に行く」が一番高い確率をもつ。このようにユーザがいったん確定を行った後は、文頭マークにつながるとした場合の確率も加味したほうがよい場合がある。この問題を解決するため、このような場合には、下式のように線形補間を行うようにする。
（１）確定済みの単語／記号のＰｒｅｖＷ_１ＰｒｅｖＷ_２に続き単語／記号Ｗ_３が挿入された場合：
Ｐ（Ｗ_３）＝Ｐ（Ｗ_３｜＜ｓ＞，＜ｓ＞）＊（１−α）
＋Ｐ（Ｗ_３｜ＰｒｅｖＷ_１，ＰｒｅｖＷ_２）＊α
（２）確定済みの単語／記号のＰｒｅｖＷ_１に続き未確定の単語／記号Ｗ_２，Ｗ_３が挿入された場合：
Ｐ（Ｗ_３）＝Ｐ（Ｗ_３｜＜ｓ＞，Ｗ_２）＊Ｐ（Ｗ_２｜＜ｓ＞，＜ｓ＞）＊（１−α）
＋Ｐ（Ｗ_３｜ＰｒｅｖＷ_１，Ｗ_２）＊α
として補正を行う。上式において、０≦α≦１であり、αは固定値として、よりよくかな漢字変換がなされるように適宜設定される。例えば、上記例の場合、α＝０が好ましい。 In the above, the case where a specific symbol exists is mentioned, but even when there is no specific symbol, the expected conversion for the subsequent character string differs depending on whether there is a predetermined character string or not. There is.
For example, suppose that “to go to the doctor today” corresponds to the path having the highest probability for “going to today”. However, when “Today” is confirmed and the user inputs “go to the hospital”, the user expects not to “go to the doctor” but to “go to the dentist” in most cases. However, in the statistical language model, “Go to the doctor” has the highest probability. On the other hand, “going to the dentist” has the highest probability of “going to saishani” connected to the sentence mark. In this way, once the user has confirmed, there is a case where it is better to consider the probability when it is connected to the sentence head mark. In order to solve this problem, in such a case, linear interpolation is performed as in the following equation.
(1) When a word / symbol W ₃ is inserted after PrevW ₁ PrevW ₂ of a confirmed word / symbol:
P (W ₃ ) = P (W ₃ | <s>, <s>) * (1-α)
+ P (W ₃ | PrevW ₁ , PrevW ₂ ) * α
(2) When uncertain words / symbols W ₂ and W ₃ are inserted after PrevW ₁ of the confirmed word / symbol:
P (W ₃ ) = P (W ₃ | <s>, W ₂ ) * P (W ₂ | <s>, <s>) * (1-α)
+ P (W ₃ | PrevW ₁ , W ₂ ) * α
As a correction. In the above equation, 0 ≦ α ≦ 1, and α is appropriately set as a fixed value so that a better kanji conversion is performed. For example, in the above example, α = 0 is preferable.

次に、上記補正された確率値を得るための処理の一例を、図４に示す概念的なプログラムソースを参照して説明する。
トライグラムでは、３つの単語の列Ｗ_１Ｗ_２Ｗ_３から３番目の単語Ｗ_３が続く確率を求めるが、ここではこの３つの単語をＬｅｆｔ，Ｍｉｄｄｌｅ，Ｒｉｇｈｔで表すこととする。 Next, an example of a process for obtaining the corrected probability value will be described with reference to a conceptual program source shown in FIG.
In the trigram, the probability that the _third word W ₃ continues from the three word strings W ₁ W ₂ W ₃ is obtained. Here, these three words are represented by Left, Middle, and Right.

図４に示す処理では、まず、Ｍｉｄｄｌｅに前確定文字列（すなわち、確定済みの文字列）が存在し、かつ、Ｍｉｄｄｌｅがバレットかスペースかオープンブラケットである（条件１）か否かを判断している。この条件１を満たす場合、ＭｉｄｄｌｅとＬｅｆｔのそれぞれに文頭マークを設定する。 In the process shown in FIG. 4, first, it is determined whether or not a previously confirmed character string (that is, a confirmed character string) exists in the Middle, and whether the Middle is a bullet, a space, or an open bracket (condition 1). ing. When this condition 1 is satisfied, a sentence head mark is set for each of Middle and Left.

次いで、上記条件１を満たさない場合、すなわち、Ｍｉｄｄｌｅに前確定文字列がないか、または、Ｍｉｄｄｌｅがバレット、スペース、オープンブラケットのいずれでもない場合、さらに、Ｌｅｆｔに前確定文字列が存在し、かつ、Ｌｅｆｔがバレットかスペースかオープンブラケットである（条件２）か否かを判断している。この段階で、この条件２を満たしている場合、Ｌｅｆｔに文頭マーク＜ｓ＞を設定する。 Next, when the above condition 1 is not satisfied, that is, when there is no pre-determined character string in Middle, or when Middle is not any of bullet, space, or open bracket, a pre-determined character string exists in Left, In addition, it is determined whether Left is a bullet, a space, or an open bracket (condition 2). At this stage, when the condition 2 is satisfied, the head mark <s> is set in Left.

次いで、上記条件２をも満たさない場合、すなわち、Ｍｉｄｄｌｅに前確定文字列がないか、または、Ｍｉｄｄｌｅがバレット、スペース、オープンブラケットのいずれでもない場合で、さらに、Ｌｅｆｔに前確定文字列がないか、または、Ｌｅｆｔがバレット、スペース、オープンブラケットのいずれでもない場合、さらに、Ｍｉｄｄｌｅがオープンブラケットである（条件３）か否か判断している。この段階で、この条件３を満たしている場合、ＭｉｄｄｌｅとＬｅｆｔのそれぞれに文頭マーク＜ｓ＞を設定する。 Next, when the above condition 2 is not satisfied, that is, when there is no preceding fixed character string in Middle, or when Middle is not any of bullet, space, or open bracket, and there is no preceding fixed character string in Left. If Left is not a bullet, space, or open bracket, it is further determined whether Middle is an open bracket (condition 3). At this stage, if the condition 3 is satisfied, the head mark <s> is set in each of Middle and Left.

次いで、上記条件３をも満たさない場合、すなわち、Ｍｉｄｄｌｅに前確定文字列がないか、または、Ｍｉｄｄｌｅがバレット、スペース、オープンブラケットのいずれでもなく、さらに、Ｌｅｆｔに前確定文字列がないか、または、Ｌｅｆｔがバレット、スペース、オープンブラケットのいずれでもなく、さらに、Ｍｉｄｄｌｅがオープンブラケットでない場合、Ｌｅｆｔがオーブンブラケットである（条件４）か否か判断している。この段階で、Ｌｅｆｔがオープンブラケットである場合、Ｌｅｆｔのみ文頭マーク＜ｓ＞を設定する。 Next, when the above condition 3 is not satisfied, that is, there is no pre-defined character string in Middle, or Middle is not any of bullet, space, open bracket, and there is no pre-defined character string in Left, Alternatively, if Left is not a bullet, space, or open bracket, and Middle is not an open bracket, it is determined whether Left is an oven bracket (condition 4). At this stage, if Left is an open bracket, only the Left character is set with a head mark <s>.

上記条件１，２，３，４のいずれも満たさない場合は、ＭｉｄｄｌｅおよびＬｅｆｔは元の設定（元の単語列のＭｉｄｄｌｅとＬｅｆｔの内容）のままとする。 If none of the above conditions 1, 2, 3, and 4 is satisfied, Middle and Left are left at their original settings (contents of Middle and Left of the original word string).

以上のようにして設定されたＭｉｄｄｌｅとＬｅｆｔを用いて、前述のトライグラムを算出する。図では、関数ＧｅｔＴｒｉｇｒａｍ（Ｌｅｆｔ，Ｍｉｄｄｌｅ，Ｒｉｇｈｔ）で表し、これを“通常の確率”として設定している（上記各条件を満たさない場合は、まさに通常の確率であり、いずれかの条件を満たした場合は、補正された確率がこの“通常の確率”に設定される）。 The above-described trigram is calculated using Middle and Left set as described above. In the figure, it is represented by the function GetTrigram (Left, Middle, Right), which is set as “normal probability” (if the above conditions are not satisfied, it is just a normal probability, and either condition is satisfied) The corrected probability is set to this “normal probability”).

次いで、Ｌｅｆｔ，Ｍｉｄｄｌｅからなる単語列が、スペース、バレット、スペース、オープンブラケット（これらを、図４ではＳｐｅｃｉａｌＣｈａｒと称す）のいずれも含まず、前確定文字列を含む場合は、さらに以下の処理を行う。 Next, if the word string consisting of Left and Middle does not include any of spaces, bullets, spaces, and open brackets (these are referred to as “special char” in FIG. 4) and includes a pre-determined character string, the following processing is further performed I do.

Ｍｉｄｄｌｅが前確定文字列である場合（このとき、当然にＬｅｆｔも前確定文字列である）、ＳｅｎｔｅｎｃｅＳｔａｒｔ（文頭マーク）との確率として、Ｐ（Ｗ_３｜＜ｓ＞，＜ｓ＞）を求める。一方、Ｍｉｄｄｌｅが前確定文字列でない場合、Ｌｅｆｔのみが前確定文字列であるので、この場合は、ＳｅｎｔｅｎｃｅＳｔａｒｔとの確率として、Ｐ（Ｗ_３｜＜ｓ＞，Ｗ_２）＊Ｐ（Ｗ_２｜＜ｓ＞，＜ｓ＞）を求める。
そして、通常の確率として、
通常の確率＝通常の確率＊α＋ＳｅｎｔｅｎｃｅＳｔａｒｔとの確率＊（１−α）
を求める。ここで係数αは前述のように、０≦α≦１の範囲で、好ましいかな漢字変換が行われるように適宜設定される。 When Middle is a pre-determined character string (in this case, of course, Left is also a pre-determined character string), P (W ₃ | <s>, <s>) is used as the probability of a sentence start (sentence mark). Ask. On the other hand, when Middle is not a pre-determined character string, only Left is a pre-determined character string, and in this case, P (W ₃ | <s>, W ₂ ) * P (W ₂ ) as the probability of Sentence Start. | <S>, <s>).
And as a normal probability
Normal probability = normal probability * α + probability with Sentence Start * (1−α)
Ask for. Here, the coefficient α is appropriately set so that preferable kana-kanji conversion is performed in the range of 0 ≦ α ≦ 1, as described above.

以上のようにして算出された“通常の確率”を用いて、最終確率（＝通常の確率＊前の単語までの確率）を算出する。この最終確率は、補正された該当パスの生起する確率として用い、前述のようにしてかな漢字変換を行う。 Using the “normal probability” calculated as described above, the final probability (= normal probability * probability to previous word) is calculated. This final probability is used as the probability of occurrence of the corrected corresponding path, and kana-kanji conversion is performed as described above.

以上のように処理することで、文頭や文中に特定の記号がある場合の単語Ｗ_３の補正された確率値（トライグラム値）は、単語Ｗ_３が文頭にある場合の確率に等しくなり、したがって、文頭や文中にスペースやかぎかっこなどの特定の記号がある場合とそうでない場合で変換結果が異なるという問題を解決することができる。
また、特定の記号がない場合でも、前確定文字列がある場合とない場合とで、それに続く文字列に対して期待される変換が異なる場合があるが、上記のように文頭マークにつながる確率との線形補間を施すことで、より好ましい変換結果を得ることができるようになる。
したがって、本実施形態によれば、統計的言語モデル自体をトレーニングする（調整する）ことなく、上記ように補正された確率値を得る処理だけで前述の問題に対応することができるようにもなる。 By processing as described above, the corrected probability value of a word W ₃ when there is a specific symbol on the beginning of a sentence or statement (trigram value), a word W ₃ is equal to the probability in a case in beginning of a sentence, Therefore, it is possible to solve the problem that the conversion result differs depending on whether or not there is a specific symbol such as a space or an angle bracket in the beginning or sentence.
In addition, even if there is no specific symbol, the expected conversion for the subsequent character string may be different depending on whether there is a pre-determined character string or not. A more preferable conversion result can be obtained by performing linear interpolation.
Therefore, according to the present embodiment, the problem described above can be dealt with only by obtaining the probability value corrected as described above, without training (adjusting) the statistical language model itself. .

本発明を実施する例示的システムを構成するコンピュータシステムを示す図である。FIG. 2 illustrates a computer system that constitutes an exemplary system for implementing the invention. 本発明による実施形態におけるプログラムデータの内容を示すブロック図である。It is a block diagram which shows the content of the program data in embodiment by this invention. 同実施形態におけるかな漢字変換方法の動作の概要を示す図である。It is a figure which shows the outline | summary of operation | movement of the kana-kanji conversion method in the same embodiment. 本発明による補正された確率値（トライグラム）を得るための処理例を示す概念的なプログラムソースである。It is a conceptual program source which shows the process example for obtaining the corrected probability value (trigram) by this invention.

Explanation of symbols

１００コンピュータシステム
１１０コンピュータ
１２０処理ユニット
１２１システムバス
１３０システムメモリ
１３１読み出し専用メモリ
１３２ランダムアクセスメモリ
１３３基本入出力システム
１３４オペレーティングモジュール
１３５アプリケーションプログラム
１３６他のプログラムモジュール
１３７プログラムデータ
１４０取り外し不可能不揮発性メモリインターフェース
１４１ハードディスクドライブ
１４４オペレーティングシステム
１４５アプリケーションプログラム
１４６他のプログラムモジュール
１４７プログラムデータ
１５０取り外し可能不揮発性メモリインターフェース
１５１磁気ディスクドライブ
１５２取り外し可能な不揮発性磁気ディスク
１５５光ディスクドライブ
１５６取り外し可能な不揮発性光ディスク
１６０ユーザ入力インターフェース
１６１ポインティングデバイス
１６２キーボード
１７０アダプタ
１７１ローカルエリアネットワーク（ＬＡＮ）
１７２モデム
１７３ワイドエリアネットワーク（ＷＡＮ）
１８０遠隔コンピュータ
１８１記憶装置
１８４マルチレベルキャッシュ
１８５遠隔アプリケーションプログラム
１９０ビデオインターフェース
１９１モニタ
１９５出力周辺インターフェース
１９６プリンタ
１９７スピーカ
２０２コーパス
２０４辞書
２０６ユーザ辞書
３０４統計的言語モデル
７０４入力ひらがな
７０６ラティス
７０８パス−確率対応表
７１０変換候補
７１２かな漢字文字列 100 Computer System 110 Computer 120 Processing Unit 121 System Bus 130 System Memory 131 Read Only Memory 132 Random Access Memory 133 Basic Input / Output System 134 Operating Module 135 Application Program 136 Other Program Modules 137 Program Data 140 Non-Removable Non-volatile Memory Interface 141 Hard disk drive 144 Operating system 145 Application program 146 Other program modules 147 Program data 150 Removable non-volatile memory interface 151 Magnetic disk drive 152 Removable non-volatile magnetic disk 155 Optical disk drive 156 Removal Capacity, nonvolatile optical disk 160 user input interface 161 a pointing device 162 keyboard 170 adapter 171 local area network (LAN)
172 Modem 173 Wide Area Network (WAN)
180 remote computer 181 storage device 184 multi-level cache 185 remote application program 190 video interface 191 monitor 195 output peripheral interface 196 printer 197 speaker 202 corpus 204 dictionary 206 user dictionary 304 statistical language model 704 input hiragana 706 lattice 708 path-probability correspondence table 710 Conversion candidate 712 Kana-Kanji character string

Claims

A computer-readable recording medium for storing a database including a statistical language model and a kana-kanji conversion program using the statistical language model,
The kana-kanji conversion program executes a first process that makes a probability obtained from the statistical language model and a probability that an input sentence word follows a specific symbol becomes a corrected value. A computer-readable recording medium characterized by the above.

2. The computer-readable recording according to claim 1, wherein the first process is executed for a predetermined symbol among the specific symbols only when the predetermined symbol is confirmed. Medium.

When the input sentence does not include the specific symbol and the previously input character string is confirmed, the kana-kanji conversion program determines that the word following the confirmed character string is the confirmed character string. 3. The computer-readable recording medium according to claim 1, wherein the second process is performed so that the probability following the first value becomes a corrected value. 4.

The probability obtained by the first processing is a probability obtained from the statistical language model by regarding the specific symbol as a head mark indicating a head of the sentence. The computer-readable recording medium as described.

The probability obtained by the second process is a probability obtained from the statistical language model, and a probability that a word following the confirmed character string follows the confirmed character string, and the confirmed character By linearly interpolating the probabilities obtained from the statistical language model with the words included in the sequence as sentence head marks, and the probabilities that the words following the confirmed character string follow the confirmed character string The computer-readable recording medium according to claim 3, wherein the probability is obtained.

The computer-readable recording medium according to any one of claims 1 to 5, wherein the statistical language model is an n-gram model.

A kana-kanji conversion method using a statistical language model executed on a computer system,
And a step of executing a first process in which a probability obtained from a statistical language model provided in a database and a word of an inputted sentence follows a specific symbol becomes a corrected value. Feature method.

The method according to claim 7, wherein the first process is executed only for a predetermined symbol among the specific symbols when the predetermined symbol is confirmed.

When the input sentence does not include the specific symbol and the previously input character string is confirmed, the probability that the word following the confirmed character string continues to the confirmed character string is corrected. The method according to claim 7, further comprising a step of executing a second process for obtaining the determined value.

The probability obtained by the first processing is a probability obtained from the statistical language model by regarding the specific symbol as a head mark indicating a head of the sentence. The method according to any one.

The probability obtained by the second process is a probability obtained from the statistical language model, and a probability that a word following the confirmed character string follows the confirmed character string, and the confirmed character By linearly interpolating the probabilities obtained from the statistical language model with the words included in the sequence as sentence head marks, and the probabilities that the words following the confirmed character string follow the confirmed character string The method of claim 9, wherein the probability is obtained.

12. A method according to any of claims 7 to 11, wherein the statistical language model is an n-gram model.