JPH03102465A

JPH03102465A - Character combination probability dictionary comprising method

Info

Publication number: JPH03102465A
Application number: JP1240412A
Authority: JP
Inventors: Koji Matsuoka; 浩司松岡; Jinichi Murakami; 村上　仁一
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-09-16
Filing date: 1989-09-16
Publication date: 1991-04-26

Abstract

PURPOSE:To reduce the number of character combination probability to be registered on a combination probability dictionary by using an appearance frequency registration method. CONSTITUTION:The appearance frequency A(C1 C2...Cn) of character strings C1 C2...Cn in an (n-1) character string are read out from a registration part 11, and the appearance frequency A(C1 C2...Cn-1) of an n character string are read out from an n character string appearance frequency registration part 12. Front (n) character combination probability is derived with a front (n) character combination probability deriving part 17, and is outputted from an output terminal 19. Similarly, rear (n) character combination probability is derived with a rear (n) character combination probability deriving part 18, and is outputted from an output terminal 20. In such a way, the number of records of the combination probability dictionary goes to the sum of the numbers of records of the (n-1) character appearance frequency registration part 11 and the (n) character appearance frequency registration part 12, that is t<n-1>+t<n>, which enables the combination probability dictionary to be miniaturized.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、電子言１算機に入力された１二１本１ｔ｛文
竜の誤字や脱字を訂正するために、文字連接確率を記憶
する辞書の構成法に関するものである。〔従来技術〕ワードプロセッサや文字読み取り装置により日本語文章
を電子計算機に入力する際に、誤字や升｛１１字か混入
することがある。この入力誤りを自動的に検出し、訂正
する方法として、文章中の文字連接の出現頻度のばらつ
きに着目し、誤字の周辺の正しい文字と文字連接確率が
高い文字を訂正候補とする方法が用いられている。具体
的には、入力文章に誤字が存在する場合、誤字の前後の
文字列に接続しやすい訂正候補文字を献字位ｉｆＹ　１
Ｃ抑入ずる。この仮文字列の文字連接確率を算定し、こ
れに基づいて訂正候補を順位｛Ｊけ、上位の訂正候袖を
選択する。ここで、文字連接確率は、文字列の前後に現わ？る文字
の出現確率であり、次の前方ｎ文字連接確率と後方Ｔ】
文字連接確率がある。これらの文字連接確率は、入力誤
りのない大量の文章（原文テータ）に含まれるｒ１文字
列とｒ｝−　．１文？列の出現頻度から導出する。なお
、以下の式で文字列Ｓの出現頻度をＡ．．　（　Ｓ　）
とする。（１）前方Ｄ文字連接確率ｒｌ−ｊ文字列Ｃ　１Ｃ　，”　’　Ｃ　ｎ　−　１に
対して、次に文字Ｃｎが現われる前率である。Ｐｔ　（　Ｃ　ｎ　／　Ｃ　１．　Ｃ　２　・・・Ｃ　
ｎ−、）＝Ｃ１Ｃ２−　・Ｃｎの出現頻度／文字Ｃ。を
除いて同一であるＣ１Ｃ２・・・Ｃｎの出現頻度の総和
＝Ａ　（Ｃ１Ｃｉ．−−−Ｃｌ１）　／Ａ（Ｃ１Ｃ７・
・・Ｃｒｌ−，）・・・・・・〔１〕（２）後方ｎ文字
連接確率ｎ−１文字列Ｃ２・・・ｃｏに対して、直前に文字Ｃエ
が現われる確率である。Ｐｂ（Ｃ■／Ｃ２Ｃ３・・・Ｃ　ｎ）　＝　Ｃ　ｙ．　
Ｃ　，・・・Ｃｎの出現頻度／文字Ｃ■を除いて同一で
あるＣ　ｉＣ　，，・・・Ｃ。の出現頻度の総和＝Ａ（
Ｃ１Ｃ２・・・ｃｎ）　／Ａ．　（Ｃ２Ｃ３・・・ｃｎ
）　　・・・・・・・　〔２〕３？上の連接確率を登録する辞書が連接確率辞書である。第５図に従来の連接確率辞書の構成を示す。第５図において、１はｉ）ｈ方Ｔ１文字連接確串Ｉ〕ｆ
＜　ｃ　ｎ　／ｃ　１Ｃ　２　・・・ｃ　ｎ　−　］．
　）をイ５録ずる［１ラ方ｒ１文字連接確率部である。３はキ一部であり、ｎ文字列Ｃ１Ｃ２・・・Ｃｎを登録
する。４はデータ部であり、前方ｎ文字連接確率Ｐ　ｔ
　（Ｃ　．−ｌ／　Ｃ　，Ｃ　２・・・Ｃｎ−，）を登
録する。前力ｒ１文字連接補率Ｈｙｌ＋　１ではｎ文字
列０１Ｃ２・・・ｃｏとして登録した文字列ｆ．Ｉ：ｉ
〕，２，・・・，α）とその前方ｎ文字連接確率の組か
らなるレコー１・を登録し、文字列ｆ１に対する前方ｎ
文字連接確率を検索する。２は後方ｎ文字連接確率を登録する後方ｎ文字連接確率
部である。５はキ一部であり、Ｄ文字列ＣｉＣ７・・・
ｃｏを登録する。６はデータ部であり、後方ｎ文字連接
確率ｐ．（ｃ■／Ｃ２Ｃ３・・・ｃｎ）を登録する。２
では、Ｄ文字列Ｃ１Ｃ２・・・Ｃｏとして登録した文字
列ｂ　＋　（　１−１．．　！　２　＋・・・β）とそ
の１１１方ｎ文字連接確率の組からなるレコー１〜を登
録し、４一文字列ｂ，に対する後方文字連接確率を検索する。〔発明が解決しようとする課題〕従来の連接確率′ｆＲ書の構或では、前方ｎ文字連接確
率と後方ｎ文字連接確率を登録している。全ての文字の
種類の組合せからなるｎ文字列の文字連接確率を登録す
ると仮定すると、前方ｎ文字連接確率と後方ｎ文字連接
確率の個数はｊ（にｔ“（ｔは文字種数を表わす。日本
語ではｔ＝７０００程度である。）である。すなわち、
連接確率辞書のレコード数は２×ｔ゜である。したがっ
て、文字連接確率を登録するための連接確率辞書のファ
イル容量が大きくなるという問題があった。本発明の目的は、上記の問題点を解決して、連接確率辞
書に登録する文字連接確率の個数を削減する辞書の構或
方法を提供することにある。本発明の前記ならびにその他の目的と新規な特徴は、本
明細書の記述及び添付図面によって明らかになるであろ
う。〔課題を解決するための手段〕」二記の目的を達戒するために、請求項１の発明？、文
字連接確率を登録する辞書の構成法において、文字列Ｃ
１Ｃｉ・・・Ｃ０−１の出現頻度を登録するｎ−１文字
列出現頻度登録部と、文字列ｃ　１．　ｃ　２・・・Ｃ
ｒｌの出現頻度を登録するｎ文字列出］３１頻度登録部
と、文字列Ｃ１Ｃ２・・・Ｃｎの出現頻度に対する文字
列Ｃ１Ｃ２・・・ｃ　ｎ−ｔの出現頻度の比として，文
字列Ｃ■Ｃ２・・・Ｃｎ−，の次に文字Ｃ。が現われる
涌率である前方ｎ文字連接確率を求める前方ｎ文字連接
確率導出部と、文字列ＣｉＣ２・・・Ｃｎの出現頻度に
対ずる文字列Ｃｉ・・・Ｃｎの出現頻度の比として、文
字列Ｃ２・・・Ｏｎの直前に文字Ｃ１が現われる確率で
ある後方ｎ文字連接確率を求める後方ｎ文字連接確率導
出部と、上記ｎ−１文字列出現頻度登録部と上記ｎ文字
列出現頻度登録部より読み出した出現頻度から上記前方
Ｄ文字連接確率導出部で前方ｎ文字連接通率を生成し、
−１―記ｎ　−　．１文字列出現頻度登録部と上記ｎ文
字列出現頻度登録部より読み出した出現頻度から１二記
後方ｎ文字連接確率導出部で後方ｎ文字連接確率を生成
する手段を備えたことを最も主要な特徴とする。また、請求項２の発明は、文字連接確率を登録する辞書
の構戊法において、文字列ＣｉＣｉ，・・・Ｃｋ，の次
に文字Ｃｘ現われる確率である前方ｋ文字連接確率を登
録する１１方方ｋ文字連接確率登録部（ｋ　＝　２　．
３　，・”，ｎ）と、文字Ｃ、が出現する土文字出現確
Ｓ＄を登録する１文字出現確率登録部と、文字列Ｃ２Ｃ
３・・・Ｃｉの直前に文字Ｃｉが現われる確率である後
方１文字連接確率を導出する後方ｉ文字連接確率導出部
い＝２．３，・・・，ｎ）と、上記後方ユ文字連接確率
導出部は、前方ｒ文字連接確率登録部（ｒ＝２，・・・
＋ｉ）より読み出した鋪方ｒ文字連接確率と、１文字出
現確率登録部より読み出した１−文字出現確率と、後方
ｍ文字連接確率導出部（ｍ＝２，・・・，コ−−１−）
より導出した後方ｍ文字連接確率とを用いて、後方ｊ文
字連接確率を導出する手段を備えたことを最も主要な特
徴とする。〔作　用〕前述した手段によれば、文字連接確率辞書に登録する連
接Ｓ率の個数を小さくするための辞書構７成法として、出現頻度登録法あるいは連接確率登録法を
用いることにより、連接確率辞書に登録する文字連接確
率の個数を削滅ずることができる。〔発明の実施例〕以下、本発明の一実施例を図面を用いて具体的に説明す
る。なお、実施例を説明するための全図において、同−機能
を有するものは同一符号を付け、その繰り返しの説明は
省略する。本発明の文字連接確率辞書構成法の−実施例は、文字連
接確率辞害に登録する連接確率の個数を小さくするため
の辞書構成法として、出現頻度登録法あるいは連接確率
登録法を用いる。（１）出現頻度登録法第ｌ図は、本実施例の出現頻度登録法による連接確率辞
書の構或を示す図である。第１図において、１ｌはｎ−１文字列出現頻度登録部で
あり、ｎ−１文字列の出現頻度を登録する。１３はキ一部であり、ｎ−王文字列Ｃ　ｔ　Ｃ　，・・
・ＣＬＩ−１を登録する。ｌ４はデータ部であり、Ｃ１
Ｃ，・・・Ｃｎ８？の出現回数を登録する。ｎ−１文字列出現頻度登録部
１１てはｎ−１文字列Ｃ１Ｃ，・・・Ｃｎ−■として登
録したＴＪ，（ｉ＝１．２，・・・，γ）とその出現頻
度の組からなるレコー１へを豊録し、文′冫：列ｔＪ　
，に対する出現頻度を検索する。１２はｎ文字列出現頻度登録部であり、ｎ文字列の出現
回数を登録する。ｌ５はキ一部であり、Ｄ文字列Ｃ■Ｃ
２・・・Ｃｎを登録ずる。１６はデータ部であり、Ｃｉ
Ｃ２・・・Ｃ４の出現回数を登録する。ｎ文字列出現頻
度登録部１２てはｎ文字列ＣｉＣ２・・・Ｃｎとして登
録したＶ．（コ．＝１，２，・・・δ）とその出現頻度
の組からなるレコードを登録し、文字列Ｖ１に対する出
現頻度を検索する。１７は前方ｎ文字連接確率導出部であり、ｎ−１文字列
の出現頻度とＤ文字列の出現頻度から前述した〔１〕式
に基づいて、前方ｎ文字連接確率を導出し、出力端子１
９から出力する。１８は後方ｎ文字連接確率導出部であ
り、ｎ−１文字列の出現頻度とｎ文字列の出現頻度から
〔２〕式に基づいて、後方ｎ文字連接確率を導出し、出
力端了２０から出力する。すなわち、前記出現頻度登録法では、文字列Ｃ１Ｃ２・
・・Ｃｒｌに対して、文字列Ｉ１１文字列の出現頻度Ａ
　（Ｃ　ｉＣ　，，−−−　Ｃ　ｎ−。）を１１からあ
゛２ム出し、ｎ文字列の出現頻度Ａ（ＣｎＣ２”・Ｃｎ
−，．）をｎ文字列１Ｉｊ現頻度登録部１２から読み出
す。前方ｎ文字連接確率を前述の〔］〕式に基づいて前
方ｎ文字連接確率導出部ｌ７で導出し、出力端子１９か
ら出力する。同様に、後方ｎ文字連接確率を前述の〔２〕式に基づい
て後方Ｄ文字連接確率導出部１８で導出し、出力端子２
０から出力する。全ての文字の種類からなるｎ文字列の文字連接確率を登
録すると仮定すると、出現頻度登録法の連接確率辞書の
レコード数は、ｎ−１文字出現頻度登録部とｎ文字出現
頻度登録部のレコード数の合計であり、ｔ”一’　−１
−　ｔ　”個となる。一方、従来の連接確率辞書のレコ
ー１・数は、前方連接確率部と後方連接確率部のレコー
１・数の合計であり、２×ｔ・個となる。したがって、
本発明によれは連接確率辞書を小型化することができる
，，（２）連接確率登録法第２図は、本実施例の連接確率登録法による連接確率辞
書の構威を示す図である。第２図において、２１，　２２，　２３．　２４は、そ
れぞれ前方ｎ文字連接確率登録部、前方ｎ−１文字連接
確率登＠部、前方３文字連接確率登録部、前方２文字連
接確率登録部である。前方〕文字連接確率登録部（ｊ　
＝２．３・・・，ｎ）は、前方ｉ文字連接確率を登録す
る。２５は１文字出現確率登録部であり、１文字出現確率（
＝原文データの文字数に対する着目する文字の出現頻度
の比）を登録する。２６，　２７．　２８．　２９は、それぞれ後方ｎ文字
連接確率導出部、後方ｎ−１文字連接確率導出部、後方
３文字連接確率導出部、後方２文字連接確率導出部であ
る。１９″は前方ｎ文字連接確率の出力であり、２０′
は後方ｎ文字連接確率の出力である。後方ｉ文字連接確率導出部（ｉ＝２．３・・・，ｎ）が
後方ｊ文字連接確率を導出するに当たって、以下の〔９
〕式を用いる。この〔９〕式の導出方法１１− ？ついて、次に説明する。ｊ文字列Ｃ■Ｃ２・・・Ｃ１の出現確率ｐ（ｃ■Ｃ２・
・・Ｃ１）は、次の〔３〕式で表わされる。ｐ（ｃ■Ｃ２・・・Ｃｉ）＝Ｐｆ（Ｃｎ／Ｃ１Ｃｎ・・
・Ｃ１−■）Ｘｌ）（ＣｎＣ２・・・Ｃｎ−，）・・〔
３〕同様に、ｋ文字列ＣエＣ２・・・Ｃｋ（ｋ＝２．３
・・・，１−１）の出現確率ｐ（ｃ１ｃ，・・・Ｃｋ）
は、次の〔４〕式で表わされる。ｐ（ｃ１ｃ，，・・・Ｇ　Ｋ）　＝　Ｐ　ｔ　（　Ｃ　
Ｋ　／　Ｃ　，，　Ｃ　２・・・Ｃｋ−，）×Ｐ（０１
Ｃ２・・・Ｃｋ−１．）・・〔４〕前記〔３〕式に〔４
〕式を繰り返し代入することにより、次の〔５〕式が導
かれる。？（Ｃ１Ｃ２・・・Ｃ１）＝Ｐ（Ｃ■）×ｎＰ，（Ｃｋ
／Ｃ■Ｃ２・・・Ｃｋ−■）・・・〔５〕また、ｉ文字
列Ｃ　１．　Ｃ　２・・・Ｃｉの出現確率Ｐ（Ｃ■Ｃ２
・・・Ｃ＋）は、次の〔６〕式で表わされる。ｐ（ｃよＣ２・・・ＣＩ）＝Ｐｂ（Ｃ１／Ｃ２Ｃ３・・
・Ｃ＋．）ｘ　Ｐ　（　Ｃ　２Ｃ　：ｌ・・・Ｃｉ）・
・〔６〕同様に、ｍ文字列Ｃ　，−ｍ＋１Ｃ　＋−ｍ４
２　＋＋＋　Ｃ　１　（ｍ　＝＝２　１３−，ｉ−１）
の出現確率Ｐ　（Ｃ　ｒ　−ｍ＋＋　Ｃ　Ｉ−ｍ＋■’
　”　’Ｃ１）は、次の〔７）式で表わされる。１２Ｐ（Ｃｎ−ｍ４、Ｃ＋−ｍ＋ｚ”’Ｃｒ）”　ＰｂＣＣ
ｒ−ｍ＋，．／Ｃｒ−ｍ＋＋　Ｃ　ｒ　−１１１＋３　
”　”　Ｃ　Ｉ）Ｘ　Ｐ　（Ｃ＋．−ｍ＋，ｃ＋−ｍ＋
ａ”・Ｃ’＋）”　ｌ：７）〔６〕式に〔７〕式を繰り
返し代入することにより、次の〔８〕式が導かれる。？Ｃｎ−ｍ＋１／Ｃ■−ｍ＋２Ｃ　Ｉ−ｍ＋ａ・・・Ｃ
ｎ）・・・・・・・・　〔８〕前記〔５〕式と〔８〕式より次の〔９〕式が導かれる。Ｐ．（Ｃ■／　Ｃ　２　Ｃ　３・・・ｃ　ｒ　）？（ｃ
■）　×”ｎ　Ｐ　１　（　Ｃ　Ｋ／　Ｃ　，Ｃ　２・
・・Ｃｋ．．、）／Ｐ　（Ｃ　，）Ｘ　ＲｌＰｂ（ｃ　
＋−，７ｃ　１−１１１＋２　Ｃ　ｌ−＋ｎ＋３−・−
Ｃ　Ｉ）ｍｍ２・・・・・・・・・　〔９〕〔９〕式に基づいて，前方ｋ文字連接確率登録部（ｋ＝
２．３・・・，１）から読み出した前方ｋ文字連接確率
Ｐ　１　（　Ｃ　ｘ　／　Ｃ　ｘ　Ｃ　２・・・Ｃｋ−
■）と、１文字出現確率登録部から読み出したｐ（ｃエ
）、ｐ（ｃ＋）と後方ｍ文字連接確率登録部（ｍ＝２．
３・・・，ｉ−１）？ら読み出した後方ｒｎ文字連接確
率Ｐ　．　（Ｃ　．　−ｍ．，７Ｇ　，　−ｍ＋２　Ｃ
　Ｉ■−ｍ＋３・・・ｃ，）とを用いて、後方コ文字連
接確率導出部は後方１文字連接確率を求める。この連接確率登録法では、前方ｎ文字連接確率、前方ｎ
　−　１文字連接確率、前方２文字連接確率、１−文字
出現確率を各々前方ｎ文字連接確率登録部２１，前方ｎ
　−　１文字連接確率登録部２２．前方３文字連接確率
登録部２３，前方２文字連接確率登録部２４，１文字出
現確率登録部２５から読み出す。また、後方ｎ文字連接確率、後方ｎ−１文字連接確率、
後方３文字連接確率、後方２文字連接確率を前述の〔９
〕式に基づいて、各々後方Ｄ文字連接確率導出部２６，
後方ｎ−１文字連接確率導出部２７，後方３文字連接確
率導出部２８，後方２文字連接確率導出部２９で導出す
る。前方ｎ文字連接確率を前方ｎ文字連接確率登録部２
工で導出し、出力端子ｌ９で出力する。後方ｎ文字連接
確率を後方ｎ文字連接確率導出部２６で導出し，出力端
子２０で出力する。上記の仮定のもとで、前方１文字連接確率辞書（ｉ−２
　．　３　，”・，ｎ）のレコード数はｔＩであり、１
文字出現確率辞書のレコー１〜数はｔである。したがっ
て連接確率辞書のレコード数はｔ　ｌＩ＋　ｔｌ１’＋−−−＋　ｔ　＝（ｔ　ｎ＋”
−ｔ）／（ｔ−１）となる。従来の連接確率辞書のレコ
ード数は上記で述べたように２×ｔ″個であるから、本
発明によれば連接確率辞書を小型化することができる。なお、連接確率登録法は、出現頻度登録法に比較してレ
コード数が大きくなるが、後方ｎ文字連接確率の他に後
方ｍ文字連接確率（ｍ＝２．３，・・・，ｎ−１）を同
時に導出できる利点がある。次に、出現頻度登録法による連接確率辞書の具体例につ
いて説明する。第３図は、本発明の出現頻度登録法による連接確率辞書
の一実施例の概略構成を示す図である。本実施例においては、説明を簡単にするために、文字の
種類ｔを２（「会Ｊ、「議Ｊの２種）とし、文字連接確
率の次数ｎを２とする（前方２文字連接確率、後方２文
字連接確率を求める）。第３図において、３ｌは１文字出現頻度登録部で１５あり、１文字列「会」、「議」に対する各々の出現頻度
Ａ（会）＝２０、Ａ（議）＝８０を登録する。３２は２
文字出現頻度登録部であり、２文字列「会会」、「会議
」、「議会」、「＠＠Ｊに対ずる出現頻度Ａ（会会）＝
２、Ａ（会議）＝１６、Ａ（議会）＝８、Ａ（議議）＝
４を登録する。３３は前方２文字連接確率導出部である
。３４は後方２文字連接確率導出部である。３５．　３
６は各々前方２文字連接確率導出部３３の出力端子と後
方２文字連接確率導出部３４の出力端子である。例えば
、前方２文字連接確率ｐｔ（会／議）を次のように導出
し、出力端末３５から出力する。また、前方２文字連接確ＩＰ．（会／議）を次のように
導出し、出力端子３６から出力する。従来の連接確率辞
書のレコード数は２Ｘｔ’＝８であるのに対して、出現
頻度登録法ではｔ′一“＋１；”＝６である。したがって、連接確率辞書のレコード数を小さ１６くできる。第４図は、本発明の連接確率登録法による連接確率辞書
の一実施例の概略構或を示す図である。」二記と同様にｔ＝２、ｎ＝２とする。２４は前方２文字連接確率登録部であり、２文字列「会
会」、「会議」、「議会」、「議議」に対する各々の前
方の２文字連接確率Ｐｆ（会／会）＝０．１．Ｐｆ（議
／会）＝０．８、Ｐｆ（会／議）＝０．１、ｐｚ（議／
議）＝０．０５を登録する。２５は１文字出現確率登録部であり、『会Ｊ、「議」に
対する各々の１文字出現確率Ｐ（会）＝０．２、Ｐ（議
）＝０．８を登録する。２９は後方２文字連接確率導出
部である。３７は前方２文字連接確率登録部Ｚ４の出力
端子であり、３８は後方２文字連接確率登録部２９の出
力端子である。前方２文字連接確率は、前方２文字連接
確率登録部２４で読み出し、出力端子３７で出力する。後方２文字連接確率は後方２文字連接確率導出部２９で
導出し、出力端子３８で出力する。例えは、ｐ．（会／
議）は後方２文字連接確率導出部２９で次のように導出
する。＝０．２　　　　　　・　・　・　・　・　〔１　２〕
出現頻度登録法のコード数は、（１”＋１−ｔ）／（ｔ
−１）＝６である。したがって、従来の連接＠率辞書に比較して、連接確率
辞書のレコード数を小さくできる。このように、」二記
の実施例では前方２文字連接確率を登録することにより
、後方２文字連接確率は登録せずとも計算できる。同様
に、逆に後方２文字連接確率を登録することにより、前
方２文字連接確率を計算する構或とすることもできる。本発明は、日本語文意に含まれる誤字に対する訂正候補
文字の絞り込みに応用することかできる。以上、本発明を実施例にもとづき具体的に説明したが、
本発明は、前記実施例に限定されるものではなく、その
要旨を逸脱しない範囲において種々変更可能であること
は３うまでもない。〔発明の効果〕以上、説明したように、本発明によれば、登録すべき文
字連接確率辞書のレコー１へ数を小さくすることができ
るので、文字連接確率辞書を小型化することができる。[Detailed Description of the Invention] [Industrial Application Field] The present invention is a method for storing character concatenation probabilities in order to correct typos and omissions in 121 characters inputted into an electronic word calculator. It concerns how to construct a dictionary. [Prior Art] When inputting Japanese text into a computer using a word processor or character reading device, typographical errors or characters may be mixed in. As a method to automatically detect and correct input errors, we use a method that focuses on the variation in the frequency of character concatenations in a sentence, and selects correct characters surrounding the typo and characters with a high probability of concatenation as correction candidates. It is being Specifically, when there is a typo in the input text, the correction candidate characters that are easily connected to the strings before and after the typo are set at the character position ifY 1.
C suppress. The character concatenation probability of this temporary character string is calculated, and based on this, the correction candidates are ranked {J} and the higher correction candidate is selected. Here, the character concatenation probability appears before and after the string? is the appearance probability of the next character, and the concatenation probability of the next n forward characters and the backward T]
There is a probability of character concatenation. These character concatenation probabilities are calculated based on the r1 character string and r}− . One sentence? Derived from the frequency of occurrence of columns. Note that the frequency of appearance of the character string S is expressed as A. using the following formula. ．． (S)
shall be. (1) Forward D character concatenation probability rl-j This is the probability that the character Cn will appear next for the character string C 1C , "' C n - 1. Pt (C n / C 1. C 2 ... C
n-, )=C1C2- - Frequency of appearance of Cn/letter C. Sum of frequencies of appearance of C1C2...Cn that are the same except for = A (C1Ci.---Cl1) /A(C1C7
...Crl-,)...[1] (2) Probability of n-character concatenation afterward This is the probability that the character C appears immediately before the n-1 character string C2...co. Pb(C■/C2C3...C n) = C y.
C iC , . . . C that are the same except for the appearance frequency of C , . . . Cn/the letter C■. Sum of appearance frequencies = A(
C1C2...cn) /A. (C2C3...cn
) ・・・・・・・・・ [2] 3? The dictionary in which the above connection probabilities are registered is the connection probability dictionary. FIG. 5 shows the structure of a conventional conjunctive probability dictionary. In Figure 5, 1 is i) h direction T1 character concatenation sure skewer I]f
<cn/c1C2...cn-].
) is recorded as [1 character r1 character concatenation probability part. 3 is a key part, and n character strings C1C2...Cn are registered. 4 is the data part, and the forward n character concatenation probability P t
(C.-l/C, C2...Cn-,) is registered. In front force r1 character concatenation complement rate Hyl+ 1, character string f. registered as n character string 01C2...co. I:i
], 2, ..., α) and its preceding n character concatenation probability is registered, and the preceding n characters for the character string f1 are registered.
Search for character concatenation probability. 2 is a backward n character concatenation probability section that registers the concatenation probability of the backward n characters. 5 is the key part, and the D character string CiC7...
Register co. 6 is the data part, which contains the concatenation probability p.6 of the last n characters. (c■/C2C3...cn) is registered. 2
Now, record 1 ~ consisting of the character string b + (1-1..! 2 + ... β) registered as D character string C1C2...Co and its 111-way n character concatenation probability is registered, 41 Search for backward character concatenation probability for character string b. [Problems to be Solved by the Invention] In the structure of the conventional concatenation probability 'fR book, the concatenation probability of the first n characters and the concatenation probability of the last n characters are registered. Assuming that the character concatenation probabilities of n character strings consisting of combinations of all character types are registered, the number of forward n character concatenation probabilities and backward n character concatenation probabilities is j(to t'' (t represents the number of character types. Japan (For words, t=7000 or so.) In other words,
The number of records in the connection probability dictionary is 2×t°. Therefore, there is a problem in that the file capacity of the concatenation probability dictionary for registering character concatenation probabilities becomes large. SUMMARY OF THE INVENTION An object of the present invention is to provide a dictionary structure method that solves the above problems and reduces the number of character concatenation probabilities to be registered in a concatenation probability dictionary. The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings. [Means for solving the problem] In order to achieve the above two objects, the invention of claim 1? , in a dictionary construction method for registering character concatenation probabilities, character string C
1Ci... An n-1 character string appearance frequency registration unit that registers the appearance frequency of C0-1, and a character string c1. c 2...C
n character string output for registering the appearance frequency of rl] 31 frequency registration section, and as the ratio of the appearance frequency of the character string C1C2...cn-t to the appearance frequency of the character string C1C2...Cn, C2...Cn-, followed by the letter C. The forward n character concatenation probability derivation unit calculates the concatenation probability of the forward n characters, which is the rate at which the character string CiC2...Cn appears, and the character a backward n-character concatenation probability deriving unit that calculates the concatenation probability of backward n characters, which is the probability that character C1 appears immediately before column C2...On; the n-1 character string appearance frequency registration unit; and the n-character string appearance frequency registration unit. The forward D character concatenation probability derivation section generates a concatenation probability of forward N characters from the appearance frequency read from the section;
-1-Note n-. The most important feature is that it is provided with a means for generating a backward n character concatenation probability in a backward n character concatenation probability deriving section from the appearance frequency read out from the character string appearance frequency registration section and the n character string appearance frequency registration section. Features. Further, the invention of claim 2 provides an 11 method for registering the forward k character concatenation probability, which is the probability that a character Cx appears next to a character string CiCi, . . . Ck, in a dictionary construction method for registering character concatenation probabilities. Direction k character concatenation probability registration unit (k = 2.
3. 1 character appearance probability registration unit that registers the earth character appearance probability S$ in which the character C appears, and the character string C2C.
3... A backward i character concatenation probability derivation unit that derives the backward one character concatenation probability, which is the probability that the character Ci appears immediately before Ci = 2.3,...,n), and the above backward U character concatenation probability The derivation unit is a forward r character concatenation probability registration unit (r=2,...
+i), the 1-character appearance probability read from the 1-character appearance probability registration section, and the backward m-character connection probability derivation section (m=2,..., ko--1- )
The most important feature is that the present invention includes a means for deriving the concatenation probability of the following j characters using the concatenation probability of the concatenation of the following m characters. [Operation] According to the above-mentioned means, by using the appearance frequency registration method or the conjunction probability registration method as a dictionary construction method for reducing the number of conjunction S rates to be registered in the character conjunction probability dictionary, The number of character concatenation probabilities registered in the probability dictionary can be eliminated. [Embodiment of the Invention] An embodiment of the present invention will be specifically described below with reference to the drawings. In all the figures for explaining the embodiments, parts having the same functions are given the same reference numerals, and repeated explanations thereof will be omitted. The embodiment of the character concatenation probability dictionary construction method of the present invention uses an appearance frequency registration method or a conjunctive probability registration method as a dictionary construction method for reducing the number of conjunctive probabilities to be registered in the character concatenation probability dictionary. (1) Appearance Frequency Registration Method FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary based on the appearance frequency registration method of this embodiment. In FIG. 1, 1l is an n-1 character string appearance frequency registration unit, which registers the appearance frequency of n-1 character strings. 13 is the key part, n-king character string C t C ,...
・Register CLI-1. l4 is the data part, and C1
C,...Cn8? Register the number of occurrences. The n-1 character string appearance frequency registration unit 11 selects the n-1 character string C1C, . . . from the set of TJ, (i=1.2, . Completely record Naru Record 1, text: Column tJ
, search the frequency of occurrence for . Reference numeral 12 denotes an n-character string appearance frequency registration section, which registers the number of times an n-character string appears. l5 is part of key, D character string C■C
2...Register Cn. 16 is a data section, Ci
Register the number of appearances of C2...C4. The n character string appearance frequency registration unit 12 registers the V. A record consisting of a set of (co.=1, 2, . . . δ) and its appearance frequency is registered, and the appearance frequency for the character string V1 is searched. 17 is a forward n character concatenation probability deriving unit, which derives the concatenation probability of forward n characters from the appearance frequency of the n-1 character string and the appearance frequency of the D character string based on the above-mentioned formula [1], and outputs the concatenation probability of the forward n characters from the output terminal 1.
Output from 9. Reference numeral 18 is a backward n character concatenation probability derivation unit, which derives the concatenation probability of backward n characters from the appearance frequency of the n-1 character string and the appearance frequency of the n character string based on formula [2], and outputs the concatenation probability from the output end 20. Output. That is, in the above appearance frequency registration method, the character string C1C2・
・Appearance frequency A of character string I11 for Crl
(C iC ,,---C n-.) is taken out from 11 and the frequency of appearance of n character string A (CnC2"・Cn
-,． ) is read out from the n character string 1Ij current frequency registration unit 12. The forward n character concatenation probability is derived by the forward n character concatenation probability deriving unit 17 based on the above-mentioned formula []], and is outputted from the output terminal 19. Similarly, the backward n character concatenation probability is derived by the backward D character concatenation probability deriving unit 18 based on the above-mentioned formula [2], and the output terminal 2
Output from 0. Assuming that the character concatenation probabilities of n character strings consisting of all character types are registered, the number of records in the concatenation probability dictionary of the appearance frequency registration method is the number of records in the n-1 character appearance frequency registration section and the n character appearance frequency registration section. is the sum of the numbers, t"1' -1
On the other hand, the number of records in the conventional concatenation probability dictionary is the sum of the number of records in the forward concatenation probability section and the backward concatenation probability section, which is 2 x t. Therefore,
According to the present invention, the conjunctive probability dictionary can be downsized. (2) Conjunctive probability registration method FIG. 2 is a diagram showing the structure of the conjunctive probability dictionary according to the conjunctive probability registration method of this embodiment. In FIG. 2, 21, 22, 23. Reference numerals 24 denote a forward n-character concatenation probability registration section, a forward n-1 character concatenation probability registration section, a forward 3-character concatenation probability registration section, and a forward 2-character concatenation probability registration section. Forward] Character concatenation probability registration section (j
=2.3...,n) registers the forward i character concatenation probability. 25 is a one-character appearance probability registration section, which stores the one-character appearance probability (
= the ratio of the appearance frequency of the character of interest to the number of characters in the original text data) is registered. 26, 27. 28. Reference numerals 29 denote a backward n-character concatenation probability deriving section, a backward n-1 character concatenation probability deriving section, a backward 3-character concatenation probability deriving section, and a backward 2-character concatenation probability deriving section. 19″ is the output of the forward n character concatenation probability, and 20′
is the output of the concatenation probability of backward n characters. When the backward i character concatenation probability derivation unit (i=2.3...,n) derives the backward j character concatenation probability, the following [9
] Use the formula. How to derive this formula [9] 11-? This will be explained next. j Character string C■C2...Probability of appearance of C1 p(c■C2・
...C1) is expressed by the following formula [3]. p(c■C2...Ci)=Pf(Cn/C1Cn...
・C1-■)Xl) (CnC2...Cn-,)...[
3] Similarly, k character string C C2...Ck (k=2.3
..., 1-1) appearance probability p(c1c,...Ck)
is expressed by the following formula [4]. p(c1c,,...G K) = P t (C
K/C,, C2...Ck-,)×P(01
C2...Ck-1. )... [4] In the above [3] formula, [4
By repeatedly substituting the formula, the following formula [5] is derived. ? (C1C2...C1)=P(C■)×nP, (Ck
/C■C2...Ck-■)...[5] Also, i character string C1. C2...Probability of appearance of Ci P(C■C2
...C+) is expressed by the following formula [6]. p(cyoC2...CI)=Pb(C1/C2C3...
・C+. )x P (C2C:l...Ci)・
・[6] Similarly, m character string C , -m+1C +-m4
2 +++ C 1 (m ==2 13-, i-1)
The appearance probability P (C r −m++ C I−m+■'
"'C1) is expressed by the following formula [7). 12 P(Cn-m4, C+-m+z"'Cr)" PbCC
rm+,. /Cr-m++ Cr-111+3
” ” C I)X P (C+.-m+,c+-m+
a"・C'+)"l:7) By repeatedly substituting the formula [7] into the formula [6], the following formula [8] is derived. ? Cn-m+1/C■-m+2C I-m+a...C
n) ...... [8] The following equation [9] is derived from the above equations [5] and [8]. P. (C■/C 2 C 3...cr)? (c
■) ×”n P 1 (CK/C,C2・
...Ck. ．． , )/P (C ,)X RlPb(c
+-, 7c 1-111+2 C l-+n+3-・-
C I) mm2 ...... [9] Based on the formula [9], forward k character concatenation probability registration unit (k =
2.3..., 1) Forward k character concatenation probability P 1 (C x / C x C 2...Ck-
■), p(cd), p(c+) read from the one character appearance probability registration section, and backward m character concatenation probability registration section (m=2.
3...,i-1)? The backward rn character concatenation probability P read from . (C. -m., 7G, -m+2 C
I■-m+3...c,), the backward C character concatenation probability deriving section calculates the concatenation probability of one backward character. In this concatenation probability registration method, the concatenation probability of forward n characters, the concatenation probability of forward n characters,
- The 1-character concatenation probability, the forward 2-character concatenation probability, and the 1-character appearance probability are stored in the forward n character concatenation probability registration unit 21 and the forward n character concatenation probability, respectively.
- One-character concatenation probability registration unit 22. It is read from the forward 3 character concatenation probability registration section 23, the forward 2 character concatenation probability registration section 24, and the 1 character appearance probability registration section 25. In addition, backward n-character concatenation probability, backward n-1 character concatenation probability,
The probability of concatenation of three consecutive letters and the concatenation probability of two consequential letters are calculated using the above [9
] Based on the formula, the backward D character concatenation probability deriving unit 26,
The backward n-1 character concatenation probability deriving unit 27, the backward three character concatenation probability deriving unit 28, and the backward two character concatenation probability deriving unit 29 derive the result. The forward n character concatenation probability is registered in the forward n character concatenation probability registration unit 2
It is derived at the terminal 19 and output at the output terminal l9. The backward n character concatenation probability is derived by the backward n character concatenation probability deriving section 26 and outputted from the output terminal 20. Under the above assumptions, the first character concatenation probability dictionary (i-2
．． The number of records of 3,”・,n) is tI, and 1
The number of records 1 to 1 in the character appearance probability dictionary is t. Therefore, the number of records in the conjunction probability dictionary is t lI+ tl1'+----+ t = (t n+"
-t)/(t-1). As stated above, the number of records in the conventional conjunctive probability dictionary is 2×t'', so according to the present invention, the concatenative probability dictionary can be downsized. Although the number of records is larger compared to the registration method, it has the advantage of simultaneously deriving the concatenation probability of backward m characters (m = 2.3, ..., n-1) in addition to the concatenation probability of backward n characters.Next A specific example of a conjunctive probability dictionary based on the appearance frequency registration method will be described below. FIG. To simplify the explanation, assume that the character type t is 2 (two types, ``Kai J'' and ``Gi J''), and the degree n of the character concatenation probability is 2 (the concatenation probability of the first two letters, the last two letters). In Fig. 3, 3l is the one-character appearance frequency registration section, which is 15, and the respective appearance frequencies for one character string "kai" and "gi" are A(kai) = 20 and A(gi) = Register 80. 32 is 2
This is a character appearance frequency registration section, and the appearance frequency A (kai) for the two character strings “kaikai”, “meeting”, “meeting”, and “@@J” is
2, A (meeting) = 16, A (parliament) = 8, A (deliberation) =
Register 4. 33 is a forward two character concatenation probability deriving unit. 34 is a backward two character concatenation probability deriving unit. 35. 3
Reference numerals 6 denote the output terminals of the forward 2-character concatenation probability deriving section 33 and the output terminals of the backward 2-character concatenation probability deriving section 34, respectively. For example, the forward two-character concatenation probability pt (meeting/meeting) is derived as follows and output from the output terminal 35. In addition, the two characters in front are sure to be connected IP. (meeting/meeting) is derived as follows and output from the output terminal 36. The number of records in the conventional connection probability dictionary is 2Xt'=8, whereas in the appearance frequency registration method, the number of records is t'+1;=6. Therefore, the number of records in the connection probability dictionary can be reduced to 16. FIG. 4 is a diagram showing a schematic structure of an embodiment of a conjunction probability dictionary based on the conjunction probability registration method of the present invention. ”As in Section 2, t=2 and n=2. 24 is a forward two-character concatenation probability registration unit, which stores the concatenation probability Pf (meeting/meeting) of each preceding two characters for the two-character strings "meeting", "meeting", "meeting", and "meeting" = 0. 1. Pf(meeting/meeting)=0.8, Pf(meeting/meeting)=0.1, pz(meeting/meeting)=0.8, pz(meeting/meeting)=0.1,
) = 0.05. Reference numeral 25 is a one-character appearance probability registration unit, which registers the one-character appearance probabilities P (Kai) = 0.2 and P (Kai) = 0.8 for "Kai J" and "Ki". 29 is a backward two character concatenation probability derivation unit. 37 is an output terminal of the forward 2-character concatenation probability registration section Z4, and 38 is an output terminal of the backward 2-character concatenation probability registration section 29. The forward two-character concatenation probability is read by the forward two-character concatenation probability registration unit 24 and outputted from the output terminal 37. The backward two-character concatenation probability is derived by the backward two-character concatenation probability deriving unit 29 and outputted from the output terminal 38. For example, p. (Meeting/
(2) is derived by the backward two character concatenation probability deriving unit 29 as follows. =0.2 ・・・・・ [1 2]
The number of codes for the appearance frequency registration method is (1”+1-t)/(t
-1)=6. Therefore, the number of records in the concatenated probability dictionary can be reduced compared to the conventional concatenated @rate dictionary. In this manner, in the embodiment described in section 2, by registering the probability of concatenation of two preceding characters, the probability of concatenation of two subsequent characters can be calculated without registering. Similarly, it is also possible to calculate the probability of two characters in the front by registering the probability of two characters in the back. The present invention can be applied to narrowing down correction candidate characters for typographical errors included in the meaning of Japanese sentences. The present invention has been specifically explained above based on examples, but
It goes without saying that the present invention is not limited to the embodiments described above, and can be modified in various ways without departing from the gist thereof. [Effects of the Invention] As described above, according to the present invention, it is possible to reduce the number of records 1 of the character concatenation probability dictionary to be registered, and therefore it is possible to downsize the character concatenation probability dictionary.

[Brief explanation of drawings]

第１図は、本実施例の出現頻度登録法による連接確率辞
書の構或を示す図、第２図は、本実施例の連接確率登録法による連接確率辞
書の構成を示す図、第３図は，本発明の出現頻度登録法による連接確率辞書
の−実施例の概酩構成を示す図、第４図は、本発明の連
接′ｌｆ９率登録法による連接確率辞書の一実施例の概
略構或を示す図、第５図は、従来の連接確率胛書の問題
点を説明するための図である。図中、１・前方ｎ文字連接確率登録部、２・後方ｎ文字
連接確率！ｆ録部、３・キ一部、４　データ部、５　キ
一部、６　・データ部、１１・・−ｎ−１文字列出現頻
度登録部、１２・・・ｎ文字列出現頻度登録１９部、１３・キ一部、１４　　データ部、１５・キ一部、
１６データ部、ｌ７・・前方ｎ文字連接確率導出部、ｌ
８・・後方Ｄ文字連接確率導出部、１９　　前方ｎ文字
連接確率導出部の出力端子、２０・後方Ｄ文字連接確率
導出部の出力端子、１９’　　前方ｎ文字連接確率の出
力、２０′・・後方ｎ文字連接柁率の出力、２１前方ｎ
文字連接確率登録部、２２・前方ｎ　−　１文字連接確
率登録部、２３　　前方３文字連接確率登録部、２４・
前方２文字連接確率登録部、２５１，文字出現確率登録
部、２６・・後方Ｄ文字連接確率導出部、２７後方ｎ−
１文字連接確率導出部、２８　　後方３文字連接確率導
出部、２９・後方２文字連接ａ率導出部、３１・・１文
字出現頻度登録部、３２　２文字出現頻度登録部、３３
・・前方２文字連接補′率導出部、３４・後方２文字連
接確率導出部、３５　　前方２文字連接確率導出部の出
力端子、３６・後方２文字連接確率導出部の出力端子、
３７　　前方２文字連接確率登録部の出力端子、３８・
・後方２文字連接桶−率登録部の出力端子。２０FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary according to the appearance frequency registration method of this embodiment. FIG. 2 is a diagram showing the configuration of a conjunctive probability dictionary according to the conjunctive probability registration method of this embodiment. 4 is a diagram showing the general structure of an embodiment of a connected probability dictionary based on the appearance frequency registration method of the present invention, and FIG. FIG. 5 is a diagram for explaining the problems of the conventional connected probability book. In the figure, 1. Forward n character concatenation probability registration section, 2. Backward n character concatenation probability! f record section, 3/key part, 4 data section, 5 key section, 6/data section, 11...-n-1 character string appearance frequency registration section, 12...n character string appearance frequency registration 19 section , 13・Ki part, 14 Data part, 15・Ki part,
16 data section, l7... forward n character concatenation probability derivation section, l
8. Backward D character conjunctive probability derivation unit, 19 Output terminal of forward n character concatenation probability derivation unit, 20. Output terminal of backward D character conjunctive probability derivation unit, 19' Output of forward n character concatenation probability, 20'... Output of backward n character concatenation rate, 21 forward n
Character concatenation probability registration unit, 22. Forward n-1 character concatenation probability registration unit, 23 Forward 3 character concatenation probability registration unit, 24.
Forward 2-character concatenation probability registration unit, 251, Character appearance probability registration unit, 26... Rear D character concatenation probability derivation unit, 27 Rear n-
1-character concatenation probability derivation unit, 28 Backward 3-character concatenation probability derivation unit, 29・Backward 2-character concatenation a rate derivation unit, 31... 1-character appearance frequency registration unit, 32 2-character appearance frequency registration unit, 33
・Forward 2-character concatenation complement rate derivation unit, 34・Backward 2-character concatenation probability derivation unit, 35 Output terminal of forward 2-character concatenation probability derivation unit, 36・Output terminal of backward 2-character concatenation probability derivation unit,
37 Output terminal of forward two character concatenation probability registration unit, 38.
・Backward 2-character concatenation bucket - Output terminal of rate registration section. 20

Claims

[Claims]

(1) In a dictionary construction method for registering character concatenation probabilities,
An n-1 character string appearance frequency registration unit that registers the appearance frequency of character string C_1C_2...C_n_-_1, and character string C_1.
An n character string appearance frequency registration unit that registers the appearance frequency of C_2...C_n, and a character string C_1C_2...C_n_-_1 for the appearance frequency of character string C_1C_2...C_n.
As the ratio of the appearance frequency of the character string C_1C_2...C_
a forward n character concatenation probability deriving unit that calculates a forward n character concatenation probability that is the probability that a character C_n appears next to n_-_1;
The backward n character concatenation probability, which is the probability that the character C_1 appears immediately before the character string C_2...C_n, is calculated as the ratio of the appearance frequency of the character string C_2...C_n to the appearance frequency of the character string C_1C_2...C_n. The forward n character concatenation probability derivation unit generates a forward n character concatenation probability from the appearance frequencies read out from the n-character string appearance frequency registration unit and the n character string appearance frequency registration unit. , further comprising means for generating a backward n character concatenation probability in the backward n character concatenation probability deriving section from the appearance frequencies read from the n-1 character string appearance frequency registration section and the n character string appearance frequency registration section. Character concatenation probability dictionary construction method.

(2) In a dictionary construction method for registering character concatenation probabilities,
Character string C_1C_2...C_k_-_1 followed by character C
A forward k character concatenation probability registration unit (k = 2, 3, ..., n
), a character appearance probability registration unit that registers the character appearance probability of character C_1, and character string C_2C_3...C
A backward i character concatenation probability derivation unit (i=
2, 3, .
), the 1-character appearance probability read from the 1-character appearance probability registration section, and the backward m-character connection probability derivation section (m = 2, ..., i-1). 1. A method for constructing a dictionary of character concatenation probabilities, comprising means for deriving backward i-character concatenation probabilities using m-character concatenation probabilities.