JPH03168863A

JPH03168863A - Method for constituting connection probability dictionary

Info

Publication number: JPH03168863A
Application number: JP1310244A
Authority: JP
Inventors: Koji Matsuoka; 浩司松岡; Masahiro Oku; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-11-28
Filing date: 1989-11-28
Publication date: 1991-07-22

Abstract

PURPOSE:To omit useless retrieval by detecting three-character connection probability by the number of three-character connection probability retrievals when the probability is not registered. CONSTITUTION:A two-character connection probability registering part 11 is retrieved by a two-character string Sk<2> key at first in order to find out connection probability, and the number of three-character connection probability retrievals of a corresponding record is '0', the retrieval of the three-character connection probability is inhibited. Thereby the absence of a record corresponding to a three-character string Sr<3> (=Sk<2>Ckm) key in a three-character connection probability registering part 12 can be detected at the time of retrieving the registering part 11. Consequently, useless retrieval can be removed.

Description

【発明の詳細な説明】〔産業上の利用分野】？発明は、電子計算機に入力された日本諸文章の誤字や
脱字を文字間の連接確率が大きい文字に灯正するために
、文字間の連接確率を記憶する辞書構成法に関するもの
である。[Detailed description of the invention] [Industrial application field] ? The present invention relates to a dictionary construction method for storing the probability of concatenation between characters in order to correct misspellings and omissions in Japanese texts input into a computer to characters with a high probability of concatenation between characters.

[Prior art]

ワードプロセッサや文字読み取り装置により日本語文章
を計算機に入力する際に誤字や脱字が混入することがあ
る．この入力誤りを自動的に検出し、訂正する方法とし
て、文章中の連続する文字間の組合せの出現頻度のばら
つきに着目し、誤字の周辺の正しい文字と連接確率が高
い文字を訂正候補とする方法が用いられている。When entering Japanese text into a computer using a word processor or character reading device, typos or omissions may occur. As a method for automatically detecting and correcting input errors, we focus on the variation in the frequency of occurrence of combinations between consecutive characters in a sentence, and select characters with a high probability of concatenation with the correct characters surrounding the typo as correction candidates. method is used.

連接確率は，前方の文字列が定まったときに、次に特定
の文字が出現する確率であり、ｎ　−　１文字列ＣＬＣ
，・・・Ｃｎ−■の次に文字Ｃ０が現われる確率をｎ文
字列Ｃ１Ｃ，・・・Ｃｎに対するｎ文字連接確率連接確
率と呼び、ｐ　（　ｃ　ｎ　ｌ　ｃ　ｉｃ　ｚ・・・Ｃ
Ｉｌ−、）で表わす。連接確率は、入力誤りのない大量
の文章（以降、原文データと呼ぶ）に含まれるｎ文字列
とｎ−１文字列の出現頻度から、［”式によリ導出する
。導出された連接確率は、連接確率辞書に登録される。The concatenation probability is the probability that a specific character will appear next when the previous character string is determined, and is the probability that a specific character will appear next, and n − 1 character string CLC
,...Cn-■ The probability that character C0 appears next to n character string C1C,...Cn is called the n-character concatenation probability concatenation probability, and p ( c n l c ic z...C
Il-, ). The concatenation probability is derived from the frequency of appearance of n character strings and n-1 character strings contained in a large amount of sentences without input errors (hereinafter referred to as original text data) using the formula [''. The derived concatenation probability is registered in the conjunction probability dictionary.

Ｐ（ＣＱＩ　Ｃ１Ｃ２・Ｃ，．）（ｎ文字列Ｃ１Ｃ，・・・ＣＩ１の出現頻度）？　　（
ＣＩＣ，・・・ＣＩ１−■）ここで，文字列Ｓの出現頻度をＡ　（Ｓ）とする．誤字
を訂正するに当っては、人間や文字読み取り装置の誤り
特性に応じた候補文字を誤字位置に挿入する。候補文字
は通常複数化存在し，この中から適切な候補文字を絞り
込む必要がある。P(CQI C1C2・C,.) (frequency of appearance of n character string C1C,...CI1)? (
CIC,...CI1-■) Here, let the frequency of appearance of the character string S be A (S). When correcting typographical errors, candidate characters are inserted at the location of the typographical error in accordance with the error characteristics of humans and character reading devices. There are usually multiple candidate characters, and it is necessary to narrow down the appropriate candidate characters from among them.

候補文字の評価文字の評価関数として、［２コ式におけ
る正字確率Ｆを定義する。正字確率が高い候補文字は，
隣接する文字と連接確率が高い候補文字であり、正字で
ある可能性が高い．Ｆ　＝ＱユＸ　Ｑｚ　Ｘ　Ｑ３　　
　・・・・・［２］ここで、Ｑ，＝Ｐ　（ＣＬ　／Ｃ■
−２０よー１）Ｑ２＝Ｐ　（ｃ．．■／Ｃエー、ＣＬ　
）・・・［３］Ｑｉ＝Ｐ　（Ｃｉ．２／Ｃｉ　　Ｃエ．
１）？ｉは誤字であり、ＣＬ−２０■−■は誤字の直前
の２文字であり、ＣＬ。、Ｃｉ．，は誤字の直後の２文
字である。すなわち、正字確率はＣｉ−２ＣＬ−、Ｃエ
Ｃ，や、Ｃｉや２の５文字における文字Ｃえを含む３つ
の３文字連接確率の積である。As an evaluation function of evaluation characters of candidate characters, [define the probability of correct characters F in the 2-co expression. Candidate characters with a high probability of being correct are
It is a candidate character that has a high probability of concatenation with adjacent characters, and is likely to be an orthographic character. F =QyuX QzX Q3
...[2] Here, Q,=P (CL /C■
-20yo-1) Q2=P (c..■/C A, CL
)...[3]Qi=P (Ci.2/Ci Cd.
1)? i is a typo, and CL-20■-■ are the two characters immediately before the typo, CL. , Ci. , are the two characters immediately after the typo. That is, the orthographic probability is the product of three three-letter concatenation probabilities including the letter C in the five letters Ci-2CL-, C, C, and Ci and 2.

一般に［１］式において文字連接長ｎを大きくするほど
，原文データを精度よく近似できるので誤字を訂正でき
る能力が高くなる。一方、ｎが大きくなると、個々の文
字列Ｃ１Ｃ２・・・ｃｏの出現頻度が小さくなる。言語
的に存在しうる文字列Ｃ１Ｃ２・・・Ｃｎでも、限られ
た原文データの中には文字列が含まれないために、連接
確率辞書に登録されないことが多くなる．このように原
文データが十分大きくないために、［３］式の３文字連
接確率のいづれかが連接確率辞書に登録されていない場
合には、それぞれ次のように２文字連接確率で代用する
。Generally, the larger the character concatenation length n in equation [1], the more accurately the original text data can be approximated, and the higher the ability to correct typos. On the other hand, as n becomes larger, the appearance frequency of each character string C1C2...co becomes smaller. Even though character strings C1C2...Cn may exist linguistically, they are not included in the limited original text data, so they are often not registered in the conjunctive probability dictionary. If any of the 3-character concatenation probabilities in equation [3] are not registered in the concatenation probability dictionary because the original text data is not large enough, the following 2-character concatenation probabilities are substituted for each.

？エ＝ｐ　　（ｃえ−■／Ｃい２）Ｑｘ＝ｐ　　（Ｃｌ　　／ｃｉ−　　）　　　・・・・
［４］Ｑ，＝Ｐ　　（Ｃエ．１／Ｃえ　）さらに、［４］式の２文字連接確率のいづれかが登録さ
れていない場合には、それぞれデフォルト値Ｐｄ　（連
接確率辞書に登録されている連接確率よりも十分に小さ
い値）で代用する。? E=p (ce-■/Ci2) Qx=p (Cl/ci-)...
[4] Q. (Sufficiently smaller than the connection probability).

Ｑ，＝ＰｄＱ，＝Ｐｄ　　　　・・・・・・・・［５］Ｑ，＝Ｐｄ連接確率辞書は［３］式と［４］式の連接確率を登録す
る。Q,=Pd Q,=Pd...[5]Q,=Pd The connection probability dictionary registers the connection probabilities of equations [3] and [4].

第４図に従来の連接確率辞書の構或を示す．第４図にお
いて、１は２文字連接確率登録部であり、２は３文字連
接確率登録部である．３はキ一部であり、２文字列Ｓｋ
２（ｋ＝１．２，・・・，α）を登録する。４はデータ
部であり、２文字列Ｓｋ２の２文字連接確率を登録する
．２文字連接確率登録部１では、２文字列Ｓｋ”とその
−２文字連接確率の組からなるレコードからなり，２文
字列Ｓｋ”に対する２文字連接確率を検索する．５はキ
一部であり、３文字列Ｓｒ’　（ｒ＝１．２，　・＝　
　β）を登録する．６はデータ部であり，３文字列Ｓｒ
”の３文字連接確率を登録する。３文字連接確率登録部
２では３文字列Ｓｒ’とその３文字連接確率の組からな
るレコードからなり，３文字列Ｓｒ’に対する３文字連
接確率を検索する。Figure 4 shows the structure of a conventional conjunctive probability dictionary. In FIG. 4, 1 is a 2-character concatenation probability registration section, and 2 is a 3-character concatenation probability registration section. 3 is part of Ki, 2 character string Sk
2 (k=1.2, . . . , α) is registered. 4 is a data section in which the probability of two characters concatenating in the two character string Sk2 is registered. The two-character concatenation probability registration unit 1 consists of records consisting of a pair of two-character string Sk'' and its -2-character concatenation probability, and searches for the two-character concatenation probability for the two-character string Sk''. 5 is the key part, and the 3 character string Sr' (r=1.2, ・=
β) is registered. 6 is the data part, 3 character strings Sr
The 3-character concatenation probability of " is registered. The 3-character concatenation probability registration unit 2 consists of a record consisting of a set of a 3-character string Sr' and its 3-character concatenation probability, and searches for the 3-character concatenation probability for the 3-character string Sr'. .

連接確率を求める手順を第５図に示す。The procedure for determining the connection probability is shown in FIG.

ここで、３文字列Ｓｒ３キーを順にＣＬ−２Ｃい，Ｃエ
，Ｃエー１ＣえＣ。１，ＣＩＣＩや，Ｃえ．２に設定す
ることにより、それぞれＱエ、Ｑ２、Ｑ３を求める。ま
た、３文字列Ｓｒ３キーは、２文字列Ｓｋ”と文字Ｃｋ
ｌＩ（ｍ＝１．２，−，δｌｋｌ、ここで、δ，ｋ．は
先頭２文字がＳｋ”である３文字列Ｓｒ３キーの個数で
ある）で表わす．この手順の基本的な考え方は次の通りである．■．３文
字列Ｓｒ３をキーとして３文字連接確率登録部２を検索
する（ステップ５０１）。そして、３文字列Ｓｒ’キー
に対応するレコードＬ，があれば、３文字列Ｓｒ３キー
に対応するレコードＬ，から３文字列Ｓｒ３の連接確率
を読み出す（ステップ５０６）。Here, press the 3 character string Sr3 keys in order: CL-2C, C, C, 1C, C. 1, CICI, C. By setting it to 2, Qe, Q2, and Q3 are obtained, respectively. Also, the 3-character string Sr3 key is the 2-character string Sk” and the character Ck.
It is expressed as lI (m = 1.2, -, δlkl, where δ, k. is the number of 3-character string Sr3 keys whose first two characters are ``Sk''). The basic idea of this procedure is as follows. As shown below.■.Search the 3-character concatenation probability registration unit 2 using the 3-character string Sr3 as a key (step 501).If there is a record L corresponding to the 3-character string Sr' key, the 3-character string The concatenation probability of the three character string Sr3 is read from the record L corresponding to the Sr3 key (step 506).

■．３文字列Ｓｒ”キーに対応するレコードＬ，がなけ
れば（ステップ５０２）、２文字列Ｓｋ”をキーとして
２文字連接確率登録部１を検索する（ステップ５０３）
．そして，２文字列キーＳｋ”に対応するレコードＬ，
があれば，２文字列Ｓｒ３キーに対応するレコードＬ２
から２文字列Ｓｋ”の連接確率を読み出す（ステップ５
０５）。■． If there is no record L corresponding to the 3-character string Sr'' key (step 502), the 2-character concatenation probability registration unit 1 is searched using the 2-character string Sk'' as a key (step 503).
．． Then, the record L corresponding to the 2-character string key “Sk”,
If there is, the record L2 corresponding to the 2-character string Sr3 key
Read the concatenation probability of two character strings Sk” from (Step 5
05).

■．さらに、２文字列Ｓｋ”に対応するレコードがなけ
れば（ステップ５０４）、連接確率としてデフォルト値
Ｐｄを設定する（ステップ５０７）。■． Further, if there is no record corresponding to the two-character string Sk'' (step 504), a default value Pd is set as the concatenation probability (step 507).

[Problems to be solved by the invention]

しかしながら，従来の連接確率辞書では，最初に３文字
列キーで３文字連接確率登録部２を検索していた．一般
に３文字連接確率が登録されてないことが多く、この場
合には２文字列キーで２文字連接確率登録部１を検索す
ることによって代用の２文字連接確率を求める必要があ
る．このため、辞書検索回数が多くなり、誤字の訂正時
間が大きくなるという問題があった．本発明は、前記問題点を解決するためになされたもので
ある。However, in the conventional concatenation probability dictionary, the 3-character concatenation probability registration section 2 is first searched using the 3-character string key. Generally, the three-character concatenation probability is often not registered, and in this case, it is necessary to find a substitute two-character concatenation probability by searching the two-character concatenation probability registration section 1 using the two-character string key. As a result, there was a problem in that the number of dictionary searches increased and the time required to correct typos increased. The present invention has been made to solve the above problems.

本発明の目的は、辞書検索回数を削減し，かつ高速に検
索できる連接確率辞書の構成法を提供することにある。An object of the present invention is to provide a method for configuring a conjunctive probability dictionary that can reduce the number of times the dictionary is searched and can be searched at high speed.

本発明の前記ならびにその他の目的と新規な特徴は、本
明細書の記述及び添付図面によって明らかになるであろ
う。The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings.

[Means to solve the problem]

前記目的を達成するために、本発明は、２文字列をキー
とし、２文字列の連接確率と．３文字連接確率ポインタ
と３文字連接確率検索数を登録するレコードからなる２
文字連接確率登録部と，３文字列の末尾の１文字をキー
とし、３文字列の連接確率を登録するレコードからなり
、前記３文字連接確率ポインタと３文字連接確率検索数
から限定される３文字連接確率登録部を有する連接確率
辞書の構成法であって、前記２文字列をキーとして２文
字連接確率登録部を検索し、対応するレコードの３文字
連接確率検索数が′″Ｏ　ｌｊであれば、３文字連接確
率の検索を行なわず、３文字連接確率検索数が”　０　
＋＋でなければ、３文字連接確率登餘部のレコードの集
合を検索することを最も主要な特徴とする。In order to achieve the above object, the present invention uses two character strings as keys, and calculates the concatenation probability of the two character strings and . 2 consisting of a record that registers the 3-character concatenation probability pointer and the number of 3-character concatenation probability searches.
It consists of a character concatenation probability registration unit and a record that registers the concatenation probability of the 3-character string using the last character of the 3-character string as a key, and is limited by the 3-character concatenation probability pointer and the number of 3-character concatenation probability searches. A construction method of a conjunctive probability dictionary having a character concatenation probability registration section, in which the two-character concatenation probability registration section is searched using the two character strings as a key, and the number of three-character concatenation probability searches for the corresponding record is ''O lj. If there is, the search for 3-character concatenation probability will not be performed, and the number of 3-character concatenation probability searches will be "0"
If it is not ++, the main feature is to search for a set of records in the 3-letter concatenation probability section.

[Effect]

前述の手段によれば、連接確率を求めるために、まず、
２文字列キーにより２文字連接確率登録部を検索し，対
応するレコードの３文字連接確率検索数がｉｔ　Ｏ　ｌ
７であれば、３文字連接確率の検索を行わないので、３
文字列キーに対応するレコードが３文字連接確率登録部
に存在しないことを２文字連接確率登録部を検索した時
点で検出し、無駄な検索を排除することができる．前記３文字連接確率検索数が゛Ｏ″でなければ、３文字
列キーに対応するレコードが存在し、このとき，３文字
連接確率ポインタと３文字連接確率検索数から限定され
る３文字連接確率登録部のレコードの集合を検索するこ
とにより，検索対象となる３文字連接確率登録部の範囲
を限定するので、検索時間を削減することができる．すなわち、２文字連接確率登録部は、２文字連接確率を
登録するとともに、３文字連接確率登録部の検索範囲を
限定するインデックスとなっている。According to the above-mentioned means, in order to find the connection probability, first,
Search the 2-character concatenation probability registration section using the 2-character string key, and find the number of 3-character concatenation probability searches for the corresponding record.
If it is 7, the search for 3-character concatenation probability is not performed, so 3
It is possible to detect that a record corresponding to a character string key does not exist in the 3-character concatenation probability registration section when searching the 2-character concatenation probability registration section, thereby eliminating unnecessary searches. If the number of 3-character concatenation probability searches is not ``O'', there is a record corresponding to the 3-character string key, and in this case, the 3-character concatenation probability is limited from the 3-character concatenation probability pointer and the 3-character concatenation probability search number. By searching the set of records in the registration section, the range of the 3-character concatenation probability registration section to be searched is limited, so the search time can be reduced.In other words, the 2-character concatenation probability registration section is This is an index that not only registers the concatenation probability but also limits the search range of the three-character concatenation probability registration section.

これらにより、辞書の検索を高速化することができる。These make it possible to speed up dictionary searches.

[Embodiments of the invention]

以下、本発明の一実施例を図面を用いて具体的に説明す
る。Hereinafter, one embodiment of the present invention will be specifically described using the drawings.

第１図は，本発明の連接確率辞書構ｌ戊法の一実施例を
説明するための連接確率辞書の構成を示す図である。FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary for explaining an embodiment of the conjunctive probability dictionary structure method of the present invention.

第１図において、”は２文字連接確率登録部であり、１
２は３文字連接確率登録部である。１３はキ一部であり
、２文字列Ｓｋ２（ｋ＝１．２，・・・，α）を登録す
る。１４はデータ部であり、２文字列Ｓｋ’″の２文字
連接確率を登録する。１５は３文字連接確率ポインタ部
であり、３文字連接確率ポインタを登録する．１６は３
文字連接確率検索数登録部であり、３文字連接確率検索
数を登録する。２文字連接確率登録部”は２文字列Ｓｋ
２．２文字連接確率，３文字連接確率ポインタ，３文字
連接確率検索数の組を登録するレコードからなる。In FIG. 1, "" is a two-character concatenation probability registration part, and 1
2 is a three-character concatenation probability registration section. Reference numeral 13 is a key part, in which two character strings Sk2 (k=1.2, . . . , α) are registered. Reference numeral 14 is a data section, in which the 2-character concatenation probability of the 2-character string Sk''' is registered. 15 is a 3-character concatenation probability pointer section, in which the 3-character concatenation probability pointer is registered. 16 is 3.
This is a character concatenation probability search number registration unit that registers the number of three-character concatenation probability searches. 2-character concatenation probability registration part” is 2-character string Sk
2. Consists of records that register sets of 2-character concatenation probability, 3-character concatenation probability pointer, and 3-character concatenation probability search number.

１７はキ一部であり、３文字列Ｓｒ３（　＝Ｓｋ”Ｃｋ
ｌｌ）キーを登録する。一般に原文データ中に特定の３
文字列が存在すれば、その３文字列の先頭２文字からな
る２文字列が必ず原文データ中に存在する。17 is the key part, and the 3 character string Sr3 (=Sk”Ck
ll) Register the key. In general, there are three specific
If a character string exists, two character strings consisting of the first two characters of the three character strings are sure to exist in the original data.

すなわち，連接確率辞書に３文字列Ｓｒ３（＝Ｓｋ”Ｃ
ｋ，）キーが登録されていれば、その３文字列Ｓｒ３（
＝Ｓｋ２Ｃｋ，）の先頭２文字からなる２文字列Ｓｋ”
キーは必ず登録される．したがって、３文字列Ｓｒ３（
＝Ｓｋ”Ｃｋ．）キーの末尾の文字Ｃｋ．のみをキ一部
１７に登録することにより、キ一部１７のメモリ量を削
減する。１８はデータ部であり、３文字列Ｓｒ”キーの
連接確率を登録する。In other words, the 3-character string Sr3 (=Sk”C
k,) key is registered, the 3-character string Sr3(
2-character string Sk” consisting of the first two characters of =Sk2Ck,)
The key is always registered. Therefore, the 3-character string Sr3(
=Sk"Ck.) By registering only the last character Ck. of the key in the key part 17, the memory amount of the key part 17 is reduced. 18 is a data part, and the 3 character string Sr" key is Register the connection probability.

本実施例の連接確率辞書を検索する手順を第２図に示す
。FIG. 2 shows the procedure for searching the conjunction probability dictionary in this embodiment.

まず，最初に３文字列Ｓｒ３（＝Ｓｋ”Ｃｋ．）キーの
先頭２文字列Ｓｋ”をキーとして２文字連接確率登録部
”を検索する（ステップ２０１）．次に，前記２文字列
Ｓｋ”キーが２文字連接確率登録部”に存在するか否か
を検出し（ステップ２Ｏ２）、前記２文字列Ｓｋ２キー
が存在しない場合、すなわち，２文字列Ｓｋ”キーに対
応するレコードＬ２がなければ（Ｎｏ）．連接確率とし
てデフォルト値Ｐｄを設定する（ステップ２０３）．前
記２文字列Ｓｋ”キーが存在する場合、すなわち、２文
字列Ｓｋ２キーに対応するレコードＬ２があれば（ＹＥ
Ｓ）、そのレコードＬ２を読み出し（ステップ２０４）
，そのレコードＬ２の３文字連接確率検索数が′゛Ｏ′
″であるか否かを検出し（ステップ２０５）、そのレコ
ードＬ２の３文字連接確率検索数が″′Ｏ″である（Ｙ
ＥＳ）場合は、レコードＬ２から２文字列Ｓｋ２の連接
確率を読み出す（ステップ２０６）。レコードＬ２の３
文字連接確率検索数がｉｔ　Ｏ　＃ｊでない（Ｎｏ）場
合は、３文字連接確率ポインタが指示するレコードを先
頭として３文字連接確率検索数個分のレコードを検索対
象とし、３文字列Ｓｒ３キーの末尾の文字Ｃｋｌｌをキ
ーとして３文字連接確率登録部１２を検索する（ステッ
プ２０７）。First, the two-character concatenation probability registration section " is searched using the first two character strings Sk" of the three-character string Sr3 (=Sk"Ck.) key as a key (step 201). Next, the two-character string Sk" is searched. It is detected whether or not the key exists in the 2-character concatenation probability registration unit (Step 2O2), and if the 2-character string Sk2 key does not exist, that is, if there is no record L2 corresponding to the 2-character string Sk” key, (No). A default value Pd is set as the connection probability (step 203). If the 2-character string Sk” key exists, that is, if there is a record L2 corresponding to the 2-character string Sk2 key (YE
S), read the record L2 (step 204)
, the number of 3-character concatenation probability searches for record L2 is ′゛O′
” (step 205), and the number of 3-character concatenation probability searches for the record L2 is “O” (Y
ES), the concatenation probability of the two character strings Sk2 is read from the record L2 (step 206). Record L2 3
If the number of character concatenation probability searches is not it O #j (No), the record indicated by the 3-character concatenation probability pointer is the first record, and the records corresponding to the number of 3-character concatenation probability searches are searched, and the 3-character string Sr3 key is searched. The three-character concatenation probability registration unit 12 is searched using the last character Ckll as a key (step 207).

次に、前記文字Ｃｋ，が３文字連接確率登録部１２に存
在するか否かを検出し（ステップ２０８）、前記文字Ｃ
ｋ，キーが存在する（ＹＥＳ）場合、すなわち、文字Ｃ
ｋ，キーに対応するレコードＬ３があれば（ＹＥＳ），
そのレコードＬ３を読み出し、そのレコードＬ３から３
文字列Ｓｒ３の連接確率を読み出す（ステップ２０９）
。前記文字Ｃｋ．キーが存在しない（Ｎｏ）場合，すな
わち、文字Ｃｋ，キーに対応するレコードＬ３がないの
で、その代りにレコードＬ，から２文字列Ｓ１の連接確
率を読み出す（ステップ２１０）。Next, it is detected whether or not the character Ck exists in the 3-character concatenation probability registration unit 12 (step 208), and the character Ck,
k, if the key exists (YES), i.e. the character C
k, if there is a record L3 corresponding to the key (YES),
Read that record L3, and 3 from that record L3.
Read the concatenation probability of character string Sr3 (step 209)
. The character Ck. If the key does not exist (No), that is, since there is no record L3 corresponding to the character Ck and the key, the concatenation probability of the two character strings S1 is read from the record L instead (step 210).

以上の説明からわかるように、本発明の連接確率辞書を
検索する手順の基本的な考え方は、次の通りである。As can be seen from the above description, the basic idea of the procedure for searching the conjunctive probability dictionary of the present invention is as follows.

■．文字列Ｓｋ”をキーとして２文字連接確率登録部１
工を検索する。■． 2-character concatenation probability registration unit 1 using the character string “Sk” as a key
Search for engineering.

■．２文字列Ｓｋ”キーに対応するレコードＬ２がなけ
れば、連接確率としてデフォルト値Ｐｄを設定する． ■．２文字列Ｓｋ”キーに対応するレコードＬ，があれ
ば、そのレコードＬ２の３文字連接確率検索数が“Ｏ”
でないときのみ、文字Ｃｋ．をキーとして３文字連接確
率ポインタと３文字連接確率検索数で限定できるレコー
ド集合に対して、３文字連接確率登録部１２を検索する
．３文字連接確率検索数が”　０　７１であれば、３文
字連接確率登録部１２に３文字列Ｓｒ３の連接確率が登
録されていないので，代わりにレコードＬ２から２文字
列Ｓｋ”の連接確率を読み出す。■． If there is no record L2 corresponding to the 2-character string Sk" key, a default value Pd is set as the concatenation probability. ■.If there is a record L, corresponding to the 2-character string Sk" key, the 3-character concatenation of that record L2 Probability search number is “O”
Only when the character Ck. Using as a key, the 3-character concatenation probability registration unit 12 is searched for a record set that can be limited by the 3-character concatenation probability pointer and the 3-character concatenation probability search number. If the number of 3-character concatenation probability searches is "0 71," the concatenation probability of the 3-character string Sr3 is not registered in the 3-character concatenation probability registration unit 12, so the concatenation probability of the 2-character string Sk'' is obtained from the record L2 instead. read out.

■．文字Ｃｋ．に対応するレコードかあれば、レコード
Ｌ，から３文字列Ｓｒ３の連接確率を読み出す。■． Letter Ck. If there is a record corresponding to , the concatenation probability of the 3-character string Sr3 is read from the record L.

文字Ｃｋ，キーに対応するレコードがなければ、レコー
ドＬ２から２文字列Ｓｋ”の連接確率を読み出す。If there is no record corresponding to the character Ck and the key, the concatenation probability of the two character strings Sk'' is read from the record L2.

第３図は，本発明の対象となる連接確率辞書の一実施例
の具体例を示す図であり，２文字連接確率登録部”では
、２字列『交流」，「国際」，および「個性」をキーと
するレコードにそれぞれの２文字連接確率ｏ．ｓ，ｏ．
ｇおよび０．１が登録されている例である．３文字連接確率登録部１２では，文字ｒ語」，「色」，
ｒ的」，「法」に対するレコードはそれぞれ３文字列ｒ
国際語」，「国際色」，「国際的」，「国際法』の連接
確率０．２，０．４，０．３，Ｏ．王が登録されている
．これらのレコードは２文字連接確率登録部”の２文字
列ｒ国際」に対する３文字連接確率ポインタと３文字連
接確率検索数（＝４）から限定される．文字「化」，ｒ派」をキーとするレコードはそれぞれ３
文字列「個性化」，「個性派」の連接確率０．６，０．
４が登録されている。これらのレコードは２文字連接確
率登録部”の２文字列「個性一をキーとするレコード上
の３文字連接確率ポインタと３文字連接確率検索数（＝
２）から限定される。２文字連接確率登録部”の２文字
列「交流」をキーとするレコードの３文字連接確率検索
数はａｔ　Ｏ　ｕである。したがって、２文字列「交流
」を先頭２文字とする３文字列をキーとするレコードは
３文字連接確率登録部１２に存在しないことが分かる．次に、従来の連接確率辞書構成法と本実施例の連接確率
辞書構或法の検索時間を評価する。評価の簡単のために
，次の仮定を置く。FIG. 3 is a diagram showing a specific example of an embodiment of the conjunctive probability dictionary that is the object of the present invention. ” for each two-character concatenation probability o. s, o.
In this example, g and 0.1 are registered. In the 3-character concatenation probability registration unit 12, the 3-character concatenation probability registration unit 12 stores
The records for “r” and “ho” are each 3 character strings r
Conjunction probabilities of ``International language'', ``International color'', ``International'', and ``International law'' are registered as 0.2, 0.4, 0.3, and O. Wang.These records have 2-letter concatenation probabilities. It is limited by the 3-letter concatenation probability pointer and the number of 3-letter concatenation probability searches (=4) for the 2-character string r international in the registration section. There are 3 records each with the characters ``ka'' and ``r-ha'' as keys.
The concatenation probability of the character strings “individualization” and “individualization” is 0.6, 0.
4 are registered. These records are the 2-character string ``The 3-character concatenation probability pointer on the record whose key is ``2-character concatenation probability registration section'' and the 3-character concatenation probability search number (=
2). The number of 3-character concatenation probability searches for records with the 2-character string ``AC'' as a key in the ``2-character concatenation probability registration section'' is at O u. Therefore, it can be seen that there is no record in the 3-character concatenation probability registration unit 12 that has a 3-character string whose first two characters are the 2-character string "AC" as a key. Next, the search time of the conventional conjunctive probability dictionary construction method and the conjunctive probability dictionary construction method of this embodiment will be evaluated. To simplify the evaluation, we make the following assumptions.

■．全ての字種の文字の出現頻度が等しい．■．連接確
率辞書の全てのレコードに対して、レコードを検索する
確率が等しい。■． Characters of all character types have the same frequency of appearance. ■． The probability of retrieving a record is equal for all records in the connection probability dictionary.

文字の種類の数をＡ，２文字列キーの個数をα，３文字
列キーの個数をβとすると、（従来の連接確率辞書構成法の検索時間）■　（３文字
連接確率登録部の検索時間）＋（３文字連接確率登録部
にキーが登録されていない確率）×（２文字連接確率登
録部の検索時間）＝　ｌｏｇ　β＋（工−β／Ａ３）　Ｘｌｏｇ　ｃｔ”
・［６］（本実施例の連接確率辞書構成法の検索時間）
ｃｘ：（２文字連接確率登録部の検索時間）＋（３文字
連接確率登録部にキーが登録されている確率）×（３文
字連接確率登録部の検索時間）＝１０ｇα＋（β／Ａ”）　Ｘｌｏｇ（β／α）・・・
［７］第３図の具体例において、α＝３，β＝６である
。また、日本語の文字の種類をＡ＝７０００とする。こ
れらを［６］式、［７］式に代入すると、（従来の連接
確率辞書構或法の検索時間）　（Ｘ：２．６（本実施例
の連接確率辞書構成法の検索時間）ｃｃｌ，６すなわち
、検索時間を約４割（４０％）削減できる。Assuming that the number of character types is A, the number of 2-character string keys is α, and the number of 3-character string keys is β, (Search time of conventional conjunctive probability dictionary construction method) time) + (probability that the key is not registered in the 3-character concatenation probability registration section) x (search time of the 2-character concatenation probability registration section) = log β + (engine - β / A3) Xlog ct"
・[6] (Search time of conjunctive probability dictionary construction method of this embodiment)
cx: (Search time of 2-letter concatenation probability registration section) + (Probability that a key is registered in 3-character conjunctive probability registration section) x (Search time of 3-character conjunctive probability registration section) = 10gα + (β/A'') Xlog(β/α)...
[7] In the specific example of FIG. 3, α=3 and β=6. Also, assume that the types of Japanese characters are A=7000. Substituting these into equations [6] and [7], (search time of conventional conjunctive probability dictionary construction method) (X: 2.6 (search time of conjunctive probability dictionary construction method of this embodiment) ccl, 6. In other words, the search time can be reduced by about 40% (40%).

以上の説明からわかるように，本実施例によれば、連接
確率を求めるために，まず２文字列Ｓ１キーにより２文
字連接確率登録部”を検索し、対応するレコードの３文
字連接確率検索数がｔｇ　Ｏ　ｔｐであれば、３文字連
接確率の検索を行わないので、３文字列Ｓｒ３（＝Ｓｋ
”Ｃｋ．）キーに対応するレコードが３文字連接確率登
録部１２に存在しないことを２文字連接確率登録部”を
検索した時点で検出し、無駄な検索を排除することがで
きる．前記３文字連接確率検索数がｒｉ　Ｏ　ｔｐでな
ければ、３文字列Ｓｒ３（＝Ｓｋ”Ｃｋ，）キーに対応
するレコードが存在し、このとき、３文字連接確率ポイ
ンタと３文字連接確率検索数から限定される３文字連接
確率登録部１２のレコードの集合を検索することにより
、検索対象となる３文字連接確率登録部１２の範囲を限
定するので，検索時間を削減することができる。これら
により、辞書検索を高速化することができる。As can be seen from the above explanation, according to this embodiment, in order to obtain the concatenation probability, first, the 2-character concatenation probability registration field is searched using the 2-character string S1 key, and the number of 3-character concatenation probability searches for the corresponding record is If tg O tp, the 3-character string Sr3 (=Sk
It is possible to detect that the record corresponding to the ``Ck.) key does not exist in the 3-character concatenation probability registration section 12 at the time of searching the 2-character concatenation probability registration section'', thereby eliminating unnecessary searches. If the number of 3-character concatenation probability searches is not ri O tp, there is a record corresponding to the 3-character string Sr3 (=Sk”Ck,) key, and in this case, the 3-character concatenation probability pointer and the 3-character concatenation probability search number By searching the set of records in the 3-character concatenation probability registration section 12 that is limited from , it is possible to speed up dictionary searches.

以上、本発明を前記実施例に基づき具体的に説明したが
、本発明は，前記実施例に限定されるものではなく、そ
の要旨を逸脱しない範囲において種々変更可能であるこ
とは言うまでもない。Although the present invention has been specifically described above based on the embodiments described above, it goes without saying that the present invention is not limited to the embodiments described above, and can be modified in various ways without departing from the gist thereof.

〔Effect of the invention〕

以上，説明したように，本発明によれば，３文字連接確
率を検索する場合に次の効果が得られる。As described above, according to the present invention, the following effects can be obtained when searching for three-character concatenation probability.

■．３文字連接確率が登録されてない場合を、３文字連
接確率検索数により検出でき、無駄な検索を省くことが
できるので、検索時間を短縮できる。■． A case where a 3-character concatenation probability is not registered can be detected by the number of 3-character concatenation probability searches, and unnecessary searches can be omitted, so that the search time can be shortened.

■．３文字連接確率が登録されている場合にも，３文字
連接確率ポインタと３文字連接確率検索数から検索範囲
を限定できるので、検索時間を短縮できる。■． Even if the 3-character concatenation probability is registered, the search range can be limited from the 3-character concatenation probability pointer and the 3-character concatenation probability search number, so the search time can be shortened.

[Brief explanation of the drawing]

第１図は、本発明の連接確率辞書構成法の一実施例を説
明するための連接確率辞書の構戊を示す図、第２図は、本発明の連接確率辞書構威法の一実施例の連
接確率辞書を検索する手順を示す図、第３図は、本発明
の対象となる連接確率辞書の一実施例の具体例を示す図
、第４図は、従来の連接確率辞書の構或図，第５図は、従
来の連接確率辞書の検索手順を示す図である。図中、１・・・２文字連接確率登録部、２・・・３文字
連接確率登録部，３・・・キ一部、４・・・データ部、
５・・キ一部、６・・・データ部、”・・・２文字連接
確率登録部、１２・・・３文字連接確率登録部、１３・
・・キ一部、１４・・・データ部、１５・・・３文字連
接確率ポインタ、１６・・・３文字連接確率検索数、１
７・・・キ一部、１８・・・データ部。FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary for explaining an embodiment of the conjunctive probability dictionary construction method of the present invention, and FIG. 2 is an example of the conjunctive probability dictionary construction method of the present invention. FIG. 3 is a diagram showing a specific example of an embodiment of the conjunctive probability dictionary that is the subject of the present invention. FIG. 4 is a diagram showing the structure of the conventional conjunctive probability dictionary. 5 are diagrams showing a conventional search procedure for a conjunctive probability dictionary. In the figure, 1... 2-character concatenation probability registration section, 2... 3-character concatenation probability registration section, 3... Part of Ki, 4... Data section,
5... Ki part, 6... Data section, ``... 2-character concatenation probability registration section, 12... 3-character concatenation probability registration section, 13.
... Ki part, 14... Data part, 15... 3-character concatenation probability pointer, 16... 3-character concatenation probability search number, 1
7...Ki part, 18...Data part.

Claims

[Claims]

(1) Using two character strings as keys, the concatenation probability of two character strings, and 3
A 2-character concatenation probability registration section consisting of a record for registering a character concatenation probability pointer and a 3-character concatenation probability search number, and a record for registering the concatenation probability of 3-character strings using the last character of the 3-character string as a key, A construction method of a conjunctive probability dictionary having a 3-character concatenated probability registration section limited from the 3-character concatenated probability pointer and the 3-character concatenated probability search number, wherein the 2-character concatenated probability registration section is searched using the 2-character string as a key. However, if the number of 3-letter concatenation probability searches for the corresponding record is "0", the 3-letter concatenation probability search is not performed, and if the 3-letter concatenation probability search number is not "0", the 3-letter concatenation probability registration section A conjunctive probability dictionary construction method characterized by searching a set of records.