JPH03102465A - Character combination probability dictionary comprising method - Google Patents

Character combination probability dictionary comprising method

Info

Publication number
JPH03102465A
JPH03102465A JP1240412A JP24041289A JPH03102465A JP H03102465 A JPH03102465 A JP H03102465A JP 1240412 A JP1240412 A JP 1240412A JP 24041289 A JP24041289 A JP 24041289A JP H03102465 A JPH03102465 A JP H03102465A
Authority
JP
Japan
Prior art keywords
character
probability
concatenation
appearance frequency
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP1240412A
Other languages
Japanese (ja)
Inventor
Koji Matsuoka
浩司 松岡
Jinichi Murakami
村上 仁一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP1240412A priority Critical patent/JPH03102465A/en
Publication of JPH03102465A publication Critical patent/JPH03102465A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE:To reduce the number of character combination probability to be registered on a combination probability dictionary by using an appearance frequency registration method. CONSTITUTION:The appearance frequency A(C1 C2...Cn) of character strings C1 C2...Cn in an (n-1) character string are read out from a registration part 11, and the appearance frequency A(C1 C2...Cn-1) of an n character string are read out from an n character string appearance frequency registration part 12. Front (n) character combination probability is derived with a front (n) character combination probability deriving part 17, and is outputted from an output terminal 19. Similarly, rear (n) character combination probability is derived with a rear (n) character combination probability deriving part 18, and is outputted from an output terminal 20. In such a way, the number of records of the combination probability dictionary goes to the sum of the numbers of records of the (n-1) character appearance frequency registration part 11 and the (n) character appearance frequency registration part 12, that is t<n-1>+t<n>, which enables the combination probability dictionary to be miniaturized.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は、電子言1算機に入力された1二1本1t{文
竜の誤字や脱字を訂正するために、文字連接確率を記憶
する辞書の構成法に関するものである。 〔従来技術〕 ワードプロセッサや文字読み取り装置により日本語文章
を電子計算機に入力する際に、誤字や升{11字か混入
することがある。この入力誤りを自動的に検出し、訂正
する方法として、文章中の文字連接の出現頻度のばらつ
きに着目し、誤字の周辺の正しい文字と文字連接確率が
高い文字を訂正候補とする方法が用いられている。具体
的には、入力文章に誤字が存在する場合、誤字の前後の
文字列に接続しやすい訂正候補文字を献字位ifY 1
C抑入ずる。この仮文字列の文字連接確率を算定し、こ
れに基づいて訂正候補を順位{Jけ、上位の訂正候袖を
選択する。 ここで、文字連接確率は、文字列の前後に現わ?る文字
の出現確率であり、次の前方n文字連接確率と後方T】
文字連接確率がある。これらの文字連接確率は、入力誤
りのない大量の文章(原文テータ)に含まれるr1文字
列とr}− .1文?列の出現頻度から導出する。なお
、以下の式で文字列Sの出現頻度をA.. ( S )
とする。 (1)前方D文字連接確率 rl−j文字列C 1C ,” ’ C n − 1に
対して、次に文字Cnが現われる前率である。 Pt ( C n / C 1. C 2 ・・・C 
n−、)=C1C2− ・Cnの出現頻度/文字C。を
除いて同一であるC1C2・・・Cnの出現頻度の総和
=A (C1Ci.−−−Cl1) /A(C1C7・
・・Crl−,)・・・・・・〔1〕(2)後方n文字
連接確率 n−1文字列C2・・・coに対して、直前に文字Cエ
が現われる確率である。 Pb(C■/C2C3・・・C n) = C y. 
C ,・・・Cnの出現頻度/文字C■を除いて同一で
あるC iC ,,・・・C。の出現頻度の総和=A(
C1C2・・・cn) /A. (C2C3・・・cn
)  ・・・・・・・ 〔2〕3 ?上の連接確率を登録する辞書が連接確率辞書である。 第5図に従来の連接確率辞書の構成を示す。 第5図において、1はi)h方T1文字連接確串I〕f
< c n /c 1C 2 ・・・c n − ].
 )をイ5録ずる[1ラ方r1文字連接確率部である。 3はキ一部であり、n文字列C1C2・・・Cnを登録
する。4はデータ部であり、前方n文字連接確率P t
 (C .−l/ C ,C 2・・・Cn−,)を登
録する。前力r1文字連接補率Hyl+ 1ではn文字
列01C2・・・coとして登録した文字列f.I:i
〕,2,・・・,α)とその前方n文字連接確率の組か
らなるレコー1・を登録し、文字列f1に対する前方n
文字連接確率を検索する。 2は後方n文字連接確率を登録する後方n文字連接確率
部である。5はキ一部であり、D文字列CiC7・・・
coを登録する。6はデータ部であり、後方n文字連接
確率p.(c■/C2C3・・・cn)を登録する。2
では、D文字列C1C2・・・Coとして登録した文字
列b + ( 1−1.. ! 2 +・・・β)とそ
の111方n文字連接確率の組からなるレコー1〜を登
録し、4一 文字列b,に対する後方文字連接確率を検索する。 〔発明が解決しようとする課題〕 従来の連接確率′fR書の構或では、前方n文字連接確
率と後方n文字連接確率を登録している。全ての文字の
種類の組合せからなるn文字列の文字連接確率を登録す
ると仮定すると、前方n文字連接確率と後方n文字連接
確率の個数はj(にt“(tは文字種数を表わす。日本
語ではt=7000程度である。)である。すなわち、
連接確率辞書のレコード数は2×t゜である。したがっ
て、文字連接確率を登録するための連接確率辞書のファ
イル容量が大きくなるという問題があった。 本発明の目的は、上記の問題点を解決して、連接確率辞
書に登録する文字連接確率の個数を削減する辞書の構或
方法を提供することにある。 本発明の前記ならびにその他の目的と新規な特徴は、本
明細書の記述及び添付図面によって明らかになるであろ
う。 〔課題を解決するための手段〕 」二記の目的を達戒するために、請求項1の発明?、文
字連接確率を登録する辞書の構成法において、文字列C
1Ci・・・C0−1の出現頻度を登録するn−1文字
列出現頻度登録部と、文字列c 1. c 2・・・C
rlの出現頻度を登録するn文字列出]31頻度登録部
と、文字列C1C2・・・Cnの出現頻度に対する文字
列C1C2・・・c n−tの出現頻度の比として,文
字列C■C2・・・Cn−,の次に文字C。が現われる
涌率である前方n文字連接確率を求める前方n文字連接
確率導出部と、文字列CiC2・・・Cnの出現頻度に
対ずる文字列Ci・・・Cnの出現頻度の比として、文
字列C2・・・Onの直前に文字C1が現われる確率で
ある後方n文字連接確率を求める後方n文字連接確率導
出部と、上記n−1文字列出現頻度登録部と上記n文字
列出現頻度登録部より読み出した出現頻度から上記前方
D文字連接確率導出部で前方n文字連接通率を生成し、
−1―記n − .1文字列出現頻度登録部と上記n文
字列出現頻度登録部より読み出した出現頻度から1二記
後方n文字連接確率導出部で後方n文字連接確率を生成
する手段を備えたことを最も主要な特徴とする。 また、請求項2の発明は、文字連接確率を登録する辞書
の構戊法において、文字列CiCi,・・・Ck,の次
に文字Cx現われる確率である前方k文字連接確率を登
録する11方方k文字連接確率登録部(k = 2 .
3 ,・”,n)と、文字C、が出現する土文字出現確
S$を登録する1文字出現確率登録部と、文字列C2C
3・・・Ciの直前に文字Ciが現われる確率である後
方1文字連接確率を導出する後方i文字連接確率導出部
い=2.3,・・・,n)と、上記後方ユ文字連接確率
導出部は、前方r文字連接確率登録部(r=2,・・・
+i)より読み出した鋪方r文字連接確率と、1文字出
現確率登録部より読み出した1−文字出現確率と、後方
m文字連接確率導出部(m=2,・・・,コ−−1−)
より導出した後方m文字連接確率とを用いて、後方j文
字連接確率を導出する手段を備えたことを最も主要な特
徴とする。 〔作 用〕 前述した手段によれば、文字連接確率辞書に登録する連
接S率の個数を小さくするための辞書構7 成法として、出現頻度登録法あるいは連接確率登録法を
用いることにより、連接確率辞書に登録する文字連接確
率の個数を削滅ずることができる。 〔発明の実施例〕 以下、本発明の一実施例を図面を用いて具体的に説明す
る。 なお、実施例を説明するための全図において、同−機能
を有するものは同一符号を付け、その繰り返しの説明は
省略する。 本発明の文字連接確率辞書構成法の−実施例は、文字連
接確率辞害に登録する連接確率の個数を小さくするため
の辞書構成法として、出現頻度登録法あるいは連接確率
登録法を用いる。 (1)出現頻度登録法 第l図は、本実施例の出現頻度登録法による連接確率辞
書の構或を示す図である。 第1図において、1lはn−1文字列出現頻度登録部で
あり、n−1文字列の出現頻度を登録する。 13はキ一部であり、n−王文字列C t C ,・・
・CLI−1を登録する。l4はデータ部であり、C1
C,・・・Cn8 ?の出現回数を登録する。n−1文字列出現頻度登録部
11てはn−1文字列C1C,・・・Cn−■として登
録したTJ,(i=1.2,・・・,γ)とその出現頻
度の組からなるレコー1へを豊録し、文′冫:列tJ 
,に対する出現頻度を検索する。 12はn文字列出現頻度登録部であり、n文字列の出現
回数を登録する。l5はキ一部であり、D文字列C■C
2・・・Cnを登録ずる。16はデータ部であり、Ci
C2・・・C4の出現回数を登録する。n文字列出現頻
度登録部12てはn文字列CiC2・・・Cnとして登
録したV.(コ.=1,2,・・・δ)とその出現頻度
の組からなるレコードを登録し、文字列V1に対する出
現頻度を検索する。 17は前方n文字連接確率導出部であり、n−1文字列
の出現頻度とD文字列の出現頻度から前述した〔1〕式
に基づいて、前方n文字連接確率を導出し、出力端子1
9から出力する。18は後方n文字連接確率導出部であ
り、n−1文字列の出現頻度とn文字列の出現頻度から
〔2〕式に基づいて、後方n文字連接確率を導出し、出
力端了20から出力する。 すなわち、前記出現頻度登録法では、文字列C1C2・
・・Crlに対して、文字列I11文字列の出現頻度A
 (C iC ,,−−− C n−。)を11からあ
゛2ム出し、n文字列の出現頻度A(CnC2”・Cn
−,.)をn文字列1Ij現頻度登録部12から読み出
す。前方n文字連接確率を前述の〔]〕式に基づいて前
方n文字連接確率導出部l7で導出し、出力端子19か
ら出力する。 同様に、後方n文字連接確率を前述の〔2〕式に基づい
て後方D文字連接確率導出部18で導出し、出力端子2
0から出力する。 全ての文字の種類からなるn文字列の文字連接確率を登
録すると仮定すると、出現頻度登録法の連接確率辞書の
レコード数は、n−1文字出現頻度登録部とn文字出現
頻度登録部のレコード数の合計であり、t”一’ −1
− t ”個となる。一方、従来の連接確率辞書のレコ
ー1・数は、前方連接確率部と後方連接確率部のレコー
1・数の合計であり、2×t・個となる。したがって、
本発明によれは連接確率辞書を小型化することができる
,,(2)連接確率登録法 第2図は、本実施例の連接確率登録法による連接確率辞
書の構威を示す図である。 第2図において、21, 22, 23. 24は、そ
れぞれ前方n文字連接確率登録部、前方n−1文字連接
確率登@部、前方3文字連接確率登録部、前方2文字連
接確率登録部である。前方〕文字連接確率登録部(j 
=2.3・・・,n)は、前方i文字連接確率を登録す
る。 25は1文字出現確率登録部であり、1文字出現確率(
=原文データの文字数に対する着目する文字の出現頻度
の比)を登録する。 26, 27. 28. 29は、それぞれ後方n文字
連接確率導出部、後方n−1文字連接確率導出部、後方
3文字連接確率導出部、後方2文字連接確率導出部であ
る。19″は前方n文字連接確率の出力であり、20′
は後方n文字連接確率の出力である。 後方i文字連接確率導出部(i=2.3・・・,n)が
後方j文字連接確率を導出するに当たって、以下の〔9
〕式を用いる。この〔9〕式の導出方法11− ?ついて、次に説明する。 j文字列C■C2・・・C1の出現確率p(c■C2・
・・C1)は、次の〔3〕式で表わされる。 p(c■C2・・・Ci)=Pf(Cn/C1Cn・・
・C1−■)Xl)(CnC2・・・Cn−,)・・〔
3〕同様に、k文字列CエC2・・・Ck(k=2.3
・・・,1−1)の出現確率p(c1c,・・・Ck)
は、次の〔4〕式で表わされる。 p(c1c,,・・・G K) = P t ( C 
K / C ,, C 2・・・Ck−,)×P(01
C2・・・Ck−1.)・・〔4〕前記〔3〕式に〔4
〕式を繰り返し代入することにより、次の〔5〕式が導
かれる。 ?(C1C2・・・C1)=P(C■)×nP,(Ck
/C■C2・・・Ck−■)・・・〔5〕また、i文字
列C 1. C 2・・・Ciの出現確率P(C■C2
・・・C+)は、次の〔6〕式で表わされる。 p(cよC2・・・CI)=Pb(C1/C2C3・・
・C+.)x P ( C 2C :l・・・Ci)・
・〔6〕同様に、m文字列C ,−m+1C +−m4
2 +++ C 1 (m ==2 13−,i−1)
の出現確率P (C r −m++ C I−m+■’
 ” ’C1)は、次の〔7)式で表わされる。 12 P(Cn−m4、C+−m+z”’Cr)” PbCC
r−m+,./Cr−m++ C r −111+3 
” ” C I)X P (C+.−m+,c+−m+
a”・C’+)” l:7)〔6〕式に〔7〕式を繰り
返し代入することにより、次の〔8〕式が導かれる。 ?Cn−m+1/C■−m+2C I−m+a・・・C
n)・・・・・・・・ 〔8〕 前記〔5〕式と〔8〕式より次の〔9〕式が導かれる。 P.(C■/ C 2 C 3・・・c r )?(c
■) ×”n P 1 ( C K/ C ,C 2・
・・Ck..、)/P (C ,)X RlPb(c 
+−,7c 1−111+2 C l−+n+3−・−
C I)mm2 ・・・・・・・・・ 〔9〕 〔9〕式に基づいて,前方k文字連接確率登録部(k=
2.3・・・,1)から読み出した前方k文字連接確率
P 1 ( C x / C x C 2・・・Ck−
■)と、1文字出現確率登録部から読み出したp(cエ
)、p(c+)と後方m文字連接確率登録部(m=2.
3・・・,i−1)?ら読み出した後方rn文字連接確
率P . (C . −m.,7G , −m+2 C
 I■−m+3・・・c,)とを用いて、後方コ文字連
接確率導出部は後方1文字連接確率を求める。 この連接確率登録法では、前方n文字連接確率、前方n
 − 1文字連接確率、前方2文字連接確率、1−文字
出現確率を各々前方n文字連接確率登録部21,前方n
 − 1文字連接確率登録部22.前方3文字連接確率
登録部23,前方2文字連接確率登録部24,1文字出
現確率登録部25から読み出す。 また、後方n文字連接確率、後方n−1文字連接確率、
後方3文字連接確率、後方2文字連接確率を前述の〔9
〕式に基づいて、各々後方D文字連接確率導出部26,
後方n−1文字連接確率導出部27,後方3文字連接確
率導出部28,後方2文字連接確率導出部29で導出す
る。前方n文字連接確率を前方n文字連接確率登録部2
工で導出し、出力端子l9で出力する。後方n文字連接
確率を後方n文字連接確率導出部26で導出し,出力端
子20で出力する。 上記の仮定のもとで、前方1文字連接確率辞書(i−2
 . 3 ,”・,n)のレコード数はtIであり、1
文字出現確率辞書のレコー1〜数はtである。したがっ
て連接確率辞書のレコード数は t lI+ tl1’+−−−+ t =(t n+”
−t)/(t−1)となる。従来の連接確率辞書のレコ
ード数は上記で述べたように2×t″個であるから、本
発明によれば連接確率辞書を小型化することができる。 なお、連接確率登録法は、出現頻度登録法に比較してレ
コード数が大きくなるが、後方n文字連接確率の他に後
方m文字連接確率(m=2.3,・・・,n−1)を同
時に導出できる利点がある。 次に、出現頻度登録法による連接確率辞書の具体例につ
いて説明する。 第3図は、本発明の出現頻度登録法による連接確率辞書
の一実施例の概略構成を示す図である。 本実施例においては、説明を簡単にするために、文字の
種類tを2(「会J、「議Jの2種)とし、文字連接確
率の次数nを2とする(前方2文字連接確率、後方2文
字連接確率を求める)。 第3図において、3lは1文字出現頻度登録部で15 あり、1文字列「会」、「議」に対する各々の出現頻度
A(会)=20、A(議)=80を登録する。32は2
文字出現頻度登録部であり、2文字列「会会」、「会議
」、「議会」、「@@Jに対ずる出現頻度A(会会)=
2、A(会議)=16、A(議会)=8、A(議議)=
4を登録する。33は前方2文字連接確率導出部である
。34は後方2文字連接確率導出部である。35. 3
6は各々前方2文字連接確率導出部33の出力端子と後
方2文字連接確率導出部34の出力端子である。例えば
、前方2文字連接確率pt(会/議)を次のように導出
し、出力端末35から出力する。 また、前方2文字連接確IP.(会/議)を次のように
導出し、出力端子36から出力する。従来の連接確率辞
書のレコード数は2Xt’=8であるのに対して、出現
頻度登録法ではt′一“+1;”=6である。 したがって、連接確率辞書のレコード数を小さ16 くできる。 第4図は、本発明の連接確率登録法による連接確率辞書
の一実施例の概略構或を示す図である。 」二記と同様にt=2、n=2とする。 24は前方2文字連接確率登録部であり、2文字列「会
会」、「会議」、「議会」、「議議」に対する各々の前
方の2文字連接確率Pf(会/会)=0.1.Pf(議
/会)=0.8、Pf(会/議)=0.1、pz(議/
議)=0.05を登録する。 25は1文字出現確率登録部であり、『会J、「議」に
対する各々の1文字出現確率P(会)=0.2、P(議
)=0.8を登録する。29は後方2文字連接確率導出
部である。37は前方2文字連接確率登録部Z4の出力
端子であり、38は後方2文字連接確率登録部29の出
力端子である。前方2文字連接確率は、前方2文字連接
確率登録部24で読み出し、出力端子37で出力する。 後方2文字連接確率は後方2文字連接確率導出部29で
導出し、出力端子38で出力する。例えは、p.(会/
議)は後方2文字連接確率導出部29で次のように導出
する。 =0.2      ・ ・ ・ ・ ・ 〔1 2〕
出現頻度登録法のコード数は、(1”+1−t)/(t
−1)=6である。 したがって、従来の連接@率辞書に比較して、連接確率
辞書のレコード数を小さくできる。このように、」二記
の実施例では前方2文字連接確率を登録することにより
、後方2文字連接確率は登録せずとも計算できる。同様
に、逆に後方2文字連接確率を登録することにより、前
方2文字連接確率を計算する構或とすることもできる。 本発明は、日本語文意に含まれる誤字に対する訂正候補
文字の絞り込みに応用することかできる。 以上、本発明を実施例にもとづき具体的に説明したが、
本発明は、前記実施例に限定されるものではなく、その
要旨を逸脱しない範囲において種々変更可能であること
は3うまでもない。 〔発明の効果〕 以上、説明したように、本発明によれば、登録すべき文
字連接確率辞書のレコー1へ数を小さくすることができ
るので、文字連接確率辞書を小型化することができる。
[Detailed Description of the Invention] [Industrial Application Field] The present invention is a method for storing character concatenation probabilities in order to correct typos and omissions in 121 characters inputted into an electronic word calculator. It concerns how to construct a dictionary. [Prior Art] When inputting Japanese text into a computer using a word processor or character reading device, typographical errors or characters may be mixed in. As a method to automatically detect and correct input errors, we use a method that focuses on the variation in the frequency of character concatenations in a sentence, and selects correct characters surrounding the typo and characters with a high probability of concatenation as correction candidates. It is being Specifically, when there is a typo in the input text, the correction candidate characters that are easily connected to the strings before and after the typo are set at the character position ifY 1.
C suppress. The character concatenation probability of this temporary character string is calculated, and based on this, the correction candidates are ranked {J} and the higher correction candidate is selected. Here, the character concatenation probability appears before and after the string? is the appearance probability of the next character, and the concatenation probability of the next n forward characters and the backward T]
There is a probability of character concatenation. These character concatenation probabilities are calculated based on the r1 character string and r}− . One sentence? Derived from the frequency of occurrence of columns. Note that the frequency of appearance of the character string S is expressed as A. using the following formula. .. (S)
shall be. (1) Forward D character concatenation probability rl-j This is the probability that the character Cn will appear next for the character string C 1C , "' C n - 1. Pt (C n / C 1. C 2 ... C
n-, )=C1C2- - Frequency of appearance of Cn/letter C. Sum of frequencies of appearance of C1C2...Cn that are the same except for = A (C1Ci.---Cl1) /A(C1C7
...Crl-,)...[1] (2) Probability of n-character concatenation afterward This is the probability that the character C appears immediately before the n-1 character string C2...co. Pb(C■/C2C3...C n) = C y.
C iC , . . . C that are the same except for the appearance frequency of C , . . . Cn/the letter C■. Sum of appearance frequencies = A(
C1C2...cn) /A. (C2C3...cn
) ・・・・・・・・・ [2] 3? The dictionary in which the above connection probabilities are registered is the connection probability dictionary. FIG. 5 shows the structure of a conventional conjunctive probability dictionary. In Figure 5, 1 is i) h direction T1 character concatenation sure skewer I]f
<cn/c1C2...cn-].
) is recorded as [1 character r1 character concatenation probability part. 3 is a key part, and n character strings C1C2...Cn are registered. 4 is the data part, and the forward n character concatenation probability P t
(C.-l/C, C2...Cn-,) is registered. In front force r1 character concatenation complement rate Hyl+ 1, character string f. registered as n character string 01C2...co. I:i
], 2, ..., α) and its preceding n character concatenation probability is registered, and the preceding n characters for the character string f1 are registered.
Search for character concatenation probability. 2 is a backward n character concatenation probability section that registers the concatenation probability of the backward n characters. 5 is the key part, and the D character string CiC7...
Register co. 6 is the data part, which contains the concatenation probability p.6 of the last n characters. (c■/C2C3...cn) is registered. 2
Now, record 1 ~ consisting of the character string b + (1-1..! 2 + ... β) registered as D character string C1C2...Co and its 111-way n character concatenation probability is registered, 41 Search for backward character concatenation probability for character string b. [Problems to be Solved by the Invention] In the structure of the conventional concatenation probability 'fR book, the concatenation probability of the first n characters and the concatenation probability of the last n characters are registered. Assuming that the character concatenation probabilities of n character strings consisting of combinations of all character types are registered, the number of forward n character concatenation probabilities and backward n character concatenation probabilities is j(to t'' (t represents the number of character types. Japan (For words, t=7000 or so.) In other words,
The number of records in the connection probability dictionary is 2×t°. Therefore, there is a problem in that the file capacity of the concatenation probability dictionary for registering character concatenation probabilities becomes large. SUMMARY OF THE INVENTION An object of the present invention is to provide a dictionary structure method that solves the above problems and reduces the number of character concatenation probabilities to be registered in a concatenation probability dictionary. The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings. [Means for solving the problem] In order to achieve the above two objects, the invention of claim 1? , in a dictionary construction method for registering character concatenation probabilities, character string C
1Ci... An n-1 character string appearance frequency registration unit that registers the appearance frequency of C0-1, and a character string c1. c 2...C
n character string output for registering the appearance frequency of rl] 31 frequency registration section, and as the ratio of the appearance frequency of the character string C1C2...cn-t to the appearance frequency of the character string C1C2...Cn, C2...Cn-, followed by the letter C. The forward n character concatenation probability derivation unit calculates the concatenation probability of the forward n characters, which is the rate at which the character string CiC2...Cn appears, and the character a backward n-character concatenation probability deriving unit that calculates the concatenation probability of backward n characters, which is the probability that character C1 appears immediately before column C2...On; the n-1 character string appearance frequency registration unit; and the n-character string appearance frequency registration unit. The forward D character concatenation probability derivation section generates a concatenation probability of forward N characters from the appearance frequency read from the section;
-1-Note n-. The most important feature is that it is provided with a means for generating a backward n character concatenation probability in a backward n character concatenation probability deriving section from the appearance frequency read out from the character string appearance frequency registration section and the n character string appearance frequency registration section. Features. Further, the invention of claim 2 provides an 11 method for registering the forward k character concatenation probability, which is the probability that a character Cx appears next to a character string CiCi, . . . Ck, in a dictionary construction method for registering character concatenation probabilities. Direction k character concatenation probability registration unit (k = 2.
3. 1 character appearance probability registration unit that registers the earth character appearance probability S$ in which the character C appears, and the character string C2C.
3... A backward i character concatenation probability derivation unit that derives the backward one character concatenation probability, which is the probability that the character Ci appears immediately before Ci = 2.3,...,n), and the above backward U character concatenation probability The derivation unit is a forward r character concatenation probability registration unit (r=2,...
+i), the 1-character appearance probability read from the 1-character appearance probability registration section, and the backward m-character connection probability derivation section (m=2,..., ko--1- )
The most important feature is that the present invention includes a means for deriving the concatenation probability of the following j characters using the concatenation probability of the concatenation of the following m characters. [Operation] According to the above-mentioned means, by using the appearance frequency registration method or the conjunction probability registration method as a dictionary construction method for reducing the number of conjunction S rates to be registered in the character conjunction probability dictionary, The number of character concatenation probabilities registered in the probability dictionary can be eliminated. [Embodiment of the Invention] An embodiment of the present invention will be specifically described below with reference to the drawings. In all the figures for explaining the embodiments, parts having the same functions are given the same reference numerals, and repeated explanations thereof will be omitted. The embodiment of the character concatenation probability dictionary construction method of the present invention uses an appearance frequency registration method or a conjunctive probability registration method as a dictionary construction method for reducing the number of conjunctive probabilities to be registered in the character concatenation probability dictionary. (1) Appearance Frequency Registration Method FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary based on the appearance frequency registration method of this embodiment. In FIG. 1, 1l is an n-1 character string appearance frequency registration unit, which registers the appearance frequency of n-1 character strings. 13 is the key part, n-king character string C t C ,...
・Register CLI-1. l4 is the data part, and C1
C,...Cn8? Register the number of occurrences. The n-1 character string appearance frequency registration unit 11 selects the n-1 character string C1C, . . . from the set of TJ, (i=1.2, . Completely record Naru Record 1, text: Column tJ
, search the frequency of occurrence for . Reference numeral 12 denotes an n-character string appearance frequency registration section, which registers the number of times an n-character string appears. l5 is part of key, D character string C■C
2...Register Cn. 16 is a data section, Ci
Register the number of appearances of C2...C4. The n character string appearance frequency registration unit 12 registers the V. A record consisting of a set of (co.=1, 2, . . . δ) and its appearance frequency is registered, and the appearance frequency for the character string V1 is searched. 17 is a forward n character concatenation probability deriving unit, which derives the concatenation probability of forward n characters from the appearance frequency of the n-1 character string and the appearance frequency of the D character string based on the above-mentioned formula [1], and outputs the concatenation probability of the forward n characters from the output terminal 1.
Output from 9. Reference numeral 18 is a backward n character concatenation probability derivation unit, which derives the concatenation probability of backward n characters from the appearance frequency of the n-1 character string and the appearance frequency of the n character string based on formula [2], and outputs the concatenation probability from the output end 20. Output. That is, in the above appearance frequency registration method, the character string C1C2・
・Appearance frequency A of character string I11 for Crl
(C iC ,,---C n-.) is taken out from 11 and the frequency of appearance of n character string A (CnC2"・Cn
-,. ) is read out from the n character string 1Ij current frequency registration unit 12. The forward n character concatenation probability is derived by the forward n character concatenation probability deriving unit 17 based on the above-mentioned formula []], and is outputted from the output terminal 19. Similarly, the backward n character concatenation probability is derived by the backward D character concatenation probability deriving unit 18 based on the above-mentioned formula [2], and the output terminal 2
Output from 0. Assuming that the character concatenation probabilities of n character strings consisting of all character types are registered, the number of records in the concatenation probability dictionary of the appearance frequency registration method is the number of records in the n-1 character appearance frequency registration section and the n character appearance frequency registration section. is the sum of the numbers, t"1' -1
On the other hand, the number of records in the conventional concatenation probability dictionary is the sum of the number of records in the forward concatenation probability section and the backward concatenation probability section, which is 2 x t. Therefore,
According to the present invention, the conjunctive probability dictionary can be downsized. (2) Conjunctive probability registration method FIG. 2 is a diagram showing the structure of the conjunctive probability dictionary according to the conjunctive probability registration method of this embodiment. In FIG. 2, 21, 22, 23. Reference numerals 24 denote a forward n-character concatenation probability registration section, a forward n-1 character concatenation probability registration section, a forward 3-character concatenation probability registration section, and a forward 2-character concatenation probability registration section. Forward] Character concatenation probability registration section (j
=2.3...,n) registers the forward i character concatenation probability. 25 is a one-character appearance probability registration section, which stores the one-character appearance probability (
= the ratio of the appearance frequency of the character of interest to the number of characters in the original text data) is registered. 26, 27. 28. Reference numerals 29 denote a backward n-character concatenation probability deriving section, a backward n-1 character concatenation probability deriving section, a backward 3-character concatenation probability deriving section, and a backward 2-character concatenation probability deriving section. 19″ is the output of the forward n character concatenation probability, and 20′
is the output of the concatenation probability of backward n characters. When the backward i character concatenation probability derivation unit (i=2.3...,n) derives the backward j character concatenation probability, the following [9
] Use the formula. How to derive this formula [9] 11-? This will be explained next. j Character string C■C2...Probability of appearance of C1 p(c■C2・
...C1) is expressed by the following formula [3]. p(c■C2...Ci)=Pf(Cn/C1Cn...
・C1-■)Xl) (CnC2...Cn-,)...[
3] Similarly, k character string C C2...Ck (k=2.3
..., 1-1) appearance probability p(c1c,...Ck)
is expressed by the following formula [4]. p(c1c,,...G K) = P t (C
K/C,, C2...Ck-,)×P(01
C2...Ck-1. )... [4] In the above [3] formula, [4
By repeatedly substituting the formula, the following formula [5] is derived. ? (C1C2...C1)=P(C■)×nP, (Ck
/C■C2...Ck-■)...[5] Also, i character string C1. C2...Probability of appearance of Ci P(C■C2
...C+) is expressed by the following formula [6]. p(cyoC2...CI)=Pb(C1/C2C3...
・C+. )x P (C2C:l...Ci)・
・[6] Similarly, m character string C , -m+1C +-m4
2 +++ C 1 (m ==2 13-, i-1)
The appearance probability P (C r −m++ C I−m+■'
"'C1) is expressed by the following formula [7). 12 P(Cn-m4, C+-m+z"'Cr)" PbCC
rm+,. /Cr-m++ Cr-111+3
” ” C I)X P (C+.-m+,c+-m+
a"・C'+)"l:7) By repeatedly substituting the formula [7] into the formula [6], the following formula [8] is derived. ? Cn-m+1/C■-m+2C I-m+a...C
n) ...... [8] The following equation [9] is derived from the above equations [5] and [8]. P. (C■/C 2 C 3...cr)? (c
■) ×”n P 1 (CK/C,C2・
...Ck. .. , )/P (C ,)X RlPb(c
+-, 7c 1-111+2 C l-+n+3-・-
C I) mm2 ...... [9] Based on the formula [9], forward k character concatenation probability registration unit (k =
2.3..., 1) Forward k character concatenation probability P 1 (C x / C x C 2...Ck-
■), p(cd), p(c+) read from the one character appearance probability registration section, and backward m character concatenation probability registration section (m=2.
3...,i-1)? The backward rn character concatenation probability P read from . (C. -m., 7G, -m+2 C
I■-m+3...c,), the backward C character concatenation probability deriving section calculates the concatenation probability of one backward character. In this concatenation probability registration method, the concatenation probability of forward n characters, the concatenation probability of forward n characters,
- The 1-character concatenation probability, the forward 2-character concatenation probability, and the 1-character appearance probability are stored in the forward n character concatenation probability registration unit 21 and the forward n character concatenation probability, respectively.
- One-character concatenation probability registration unit 22. It is read from the forward 3 character concatenation probability registration section 23, the forward 2 character concatenation probability registration section 24, and the 1 character appearance probability registration section 25. In addition, backward n-character concatenation probability, backward n-1 character concatenation probability,
The probability of concatenation of three consecutive letters and the concatenation probability of two consequential letters are calculated using the above [9
] Based on the formula, the backward D character concatenation probability deriving unit 26,
The backward n-1 character concatenation probability deriving unit 27, the backward three character concatenation probability deriving unit 28, and the backward two character concatenation probability deriving unit 29 derive the result. The forward n character concatenation probability is registered in the forward n character concatenation probability registration unit 2
It is derived at the terminal 19 and output at the output terminal l9. The backward n character concatenation probability is derived by the backward n character concatenation probability deriving section 26 and outputted from the output terminal 20. Under the above assumptions, the first character concatenation probability dictionary (i-2
.. The number of records of 3,”・,n) is tI, and 1
The number of records 1 to 1 in the character appearance probability dictionary is t. Therefore, the number of records in the conjunction probability dictionary is t lI+ tl1'+----+ t = (t n+"
-t)/(t-1). As stated above, the number of records in the conventional conjunctive probability dictionary is 2×t'', so according to the present invention, the concatenative probability dictionary can be downsized. Although the number of records is larger compared to the registration method, it has the advantage of simultaneously deriving the concatenation probability of backward m characters (m = 2.3, ..., n-1) in addition to the concatenation probability of backward n characters.Next A specific example of a conjunctive probability dictionary based on the appearance frequency registration method will be described below. FIG. To simplify the explanation, assume that the character type t is 2 (two types, ``Kai J'' and ``Gi J''), and the degree n of the character concatenation probability is 2 (the concatenation probability of the first two letters, the last two letters). In Fig. 3, 3l is the one-character appearance frequency registration section, which is 15, and the respective appearance frequencies for one character string "kai" and "gi" are A(kai) = 20 and A(gi) = Register 80. 32 is 2
This is a character appearance frequency registration section, and the appearance frequency A (kai) for the two character strings “kaikai”, “meeting”, “meeting”, and “@@J” is
2, A (meeting) = 16, A (parliament) = 8, A (deliberation) =
Register 4. 33 is a forward two character concatenation probability deriving unit. 34 is a backward two character concatenation probability deriving unit. 35. 3
Reference numerals 6 denote the output terminals of the forward 2-character concatenation probability deriving section 33 and the output terminals of the backward 2-character concatenation probability deriving section 34, respectively. For example, the forward two-character concatenation probability pt (meeting/meeting) is derived as follows and output from the output terminal 35. In addition, the two characters in front are sure to be connected IP. (meeting/meeting) is derived as follows and output from the output terminal 36. The number of records in the conventional connection probability dictionary is 2Xt'=8, whereas in the appearance frequency registration method, the number of records is t'+1;=6. Therefore, the number of records in the connection probability dictionary can be reduced to 16. FIG. 4 is a diagram showing a schematic structure of an embodiment of a conjunction probability dictionary based on the conjunction probability registration method of the present invention. ”As in Section 2, t=2 and n=2. 24 is a forward two-character concatenation probability registration unit, which stores the concatenation probability Pf (meeting/meeting) of each preceding two characters for the two-character strings "meeting", "meeting", "meeting", and "meeting" = 0. 1. Pf(meeting/meeting)=0.8, Pf(meeting/meeting)=0.1, pz(meeting/meeting)=0.8, pz(meeting/meeting)=0.1,
) = 0.05. Reference numeral 25 is a one-character appearance probability registration unit, which registers the one-character appearance probabilities P (Kai) = 0.2 and P (Kai) = 0.8 for "Kai J" and "Ki". 29 is a backward two character concatenation probability derivation unit. 37 is an output terminal of the forward 2-character concatenation probability registration section Z4, and 38 is an output terminal of the backward 2-character concatenation probability registration section 29. The forward two-character concatenation probability is read by the forward two-character concatenation probability registration unit 24 and outputted from the output terminal 37. The backward two-character concatenation probability is derived by the backward two-character concatenation probability deriving unit 29 and outputted from the output terminal 38. For example, p. (Meeting/
(2) is derived by the backward two character concatenation probability deriving unit 29 as follows. =0.2 ・ ・ ・ ・ ・ [1 2]
The number of codes for the appearance frequency registration method is (1”+1-t)/(t
-1)=6. Therefore, the number of records in the concatenated probability dictionary can be reduced compared to the conventional concatenated @rate dictionary. In this manner, in the embodiment described in section 2, by registering the probability of concatenation of two preceding characters, the probability of concatenation of two subsequent characters can be calculated without registering. Similarly, it is also possible to calculate the probability of two characters in the front by registering the probability of two characters in the back. The present invention can be applied to narrowing down correction candidate characters for typographical errors included in the meaning of Japanese sentences. The present invention has been specifically explained above based on examples, but
It goes without saying that the present invention is not limited to the embodiments described above, and can be modified in various ways without departing from the gist thereof. [Effects of the Invention] As described above, according to the present invention, it is possible to reduce the number of records 1 of the character concatenation probability dictionary to be registered, and therefore it is possible to downsize the character concatenation probability dictionary.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は、本実施例の出現頻度登録法による連接確率辞
書の構或を示す図、 第2図は、本実施例の連接確率登録法による連接確率辞
書の構成を示す図、 第3図は,本発明の出現頻度登録法による連接確率辞書
の−実施例の概酩構成を示す図、第4図は、本発明の連
接′lf9率登録法による連接確率辞書の一実施例の概
略構或を示す図、第5図は、従来の連接確率胛書の問題
点を説明するための図である。 図中、1・前方n文字連接確率登録部、2・後方n文字
連接確率!f録部、3・キ一部、4 データ部、5 キ
一部、6 ・データ部、11・・−n−1文字列出現頻
度登録部、12・・・n文字列出現頻度登録19 部、13・キ一部、14  データ部、15・キ一部、
16データ部、l7・・前方n文字連接確率導出部、l
8・・後方D文字連接確率導出部、19  前方n文字
連接確率導出部の出力端子、20・後方D文字連接確率
導出部の出力端子、19’  前方n文字連接確率の出
力、20′・・後方n文字連接柁率の出力、21前方n
文字連接確率登録部、22・前方n − 1文字連接確
率登録部、23  前方3文字連接確率登録部、24・
前方2文字連接確率登録部、251,文字出現確率登録
部、26・・後方D文字連接確率導出部、27後方n−
1文字連接確率導出部、28  後方3文字連接確率導
出部、29・後方2文字連接a率導出部、31・・1文
字出現頻度登録部、32 2文字出現頻度登録部、33
・・前方2文字連接補′率導出部、34・後方2文字連
接確率導出部、35  前方2文字連接確率導出部の出
力端子、36・後方2文字連接確率導出部の出力端子、
37  前方2文字連接確率登録部の出力端子、38・
・後方2文字連接桶−率登録部の出力端子。 20
FIG. 1 is a diagram showing the structure of a conjunctive probability dictionary according to the appearance frequency registration method of this embodiment. FIG. 2 is a diagram showing the configuration of a conjunctive probability dictionary according to the conjunctive probability registration method of this embodiment. 4 is a diagram showing the general structure of an embodiment of a connected probability dictionary based on the appearance frequency registration method of the present invention, and FIG. FIG. 5 is a diagram for explaining the problems of the conventional connected probability book. In the figure, 1. Forward n character concatenation probability registration section, 2. Backward n character concatenation probability! f record section, 3/key part, 4 data section, 5 key section, 6/data section, 11...-n-1 character string appearance frequency registration section, 12...n character string appearance frequency registration 19 section , 13・Ki part, 14 Data part, 15・Ki part,
16 data section, l7... forward n character concatenation probability derivation section, l
8. Backward D character conjunctive probability derivation unit, 19 Output terminal of forward n character concatenation probability derivation unit, 20. Output terminal of backward D character conjunctive probability derivation unit, 19' Output of forward n character concatenation probability, 20'... Output of backward n character concatenation rate, 21 forward n
Character concatenation probability registration unit, 22. Forward n-1 character concatenation probability registration unit, 23 Forward 3 character concatenation probability registration unit, 24.
Forward 2-character concatenation probability registration unit, 251, Character appearance probability registration unit, 26... Rear D character concatenation probability derivation unit, 27 Rear n-
1-character concatenation probability derivation unit, 28 Backward 3-character concatenation probability derivation unit, 29・Backward 2-character concatenation a rate derivation unit, 31... 1-character appearance frequency registration unit, 32 2-character appearance frequency registration unit, 33
・Forward 2-character concatenation complement rate derivation unit, 34・Backward 2-character concatenation probability derivation unit, 35 Output terminal of forward 2-character concatenation probability derivation unit, 36・Output terminal of backward 2-character concatenation probability derivation unit,
37 Output terminal of forward two character concatenation probability registration unit, 38.
・Backward 2-character concatenation bucket - Output terminal of rate registration section. 20

Claims (2)

【特許請求の範囲】[Claims] (1)文字連接確率を登録する辞書の構成法において、
文字列C_1C_2・・・C_n_−_1の出現頻度を
登録するn−1文字列出現頻度登録部と、文字列C_1
C_2・・・C_nの出現頻度を登録するn文字列出現
頻度登録部と、文字列C_1C_2・・・C_nの出現
頻度に対する文字列C_1C_2・・・C_n_−_1
の出現頻度の比として、文字列C_1C_2・・・C_
n_−_1の次に文字C_nが現われる確率である前方
n文字連接確率を求める前方n文字連接確率導出部と、
文字列C_1C_2・・・C_nの出現頻度に対する文
字列C_2・・・C_nの出現頻度の比として、文字列
C_2・・・C_nの直前に文字C_1が現われる確率
である後方n文字連接確率を求める後方n文字連接確率
導出部と、上記n−1文字列出現頻度登録部と上記n文
字列出現頻度登録部より読み出した出現頻度から上記前
方n文字連接確率導出部で前方n文字連接確率を生成し
、上記n−1文字列出現頻度登録部と上記n文字列出現
頻度登録部より読み出した出現頻度から上記後方n文字
連接確率導出部で後方n文字連接確率を生成する手段を
備えたことを特徴とする文字連接確率辞書構成法。
(1) In a dictionary construction method for registering character concatenation probabilities,
An n-1 character string appearance frequency registration unit that registers the appearance frequency of character string C_1C_2...C_n_-_1, and character string C_1.
An n character string appearance frequency registration unit that registers the appearance frequency of C_2...C_n, and a character string C_1C_2...C_n_-_1 for the appearance frequency of character string C_1C_2...C_n.
As the ratio of the appearance frequency of the character string C_1C_2...C_
a forward n character concatenation probability deriving unit that calculates a forward n character concatenation probability that is the probability that a character C_n appears next to n_-_1;
The backward n character concatenation probability, which is the probability that the character C_1 appears immediately before the character string C_2...C_n, is calculated as the ratio of the appearance frequency of the character string C_2...C_n to the appearance frequency of the character string C_1C_2...C_n. The forward n character concatenation probability derivation unit generates a forward n character concatenation probability from the appearance frequencies read out from the n-character string appearance frequency registration unit and the n character string appearance frequency registration unit. , further comprising means for generating a backward n character concatenation probability in the backward n character concatenation probability deriving section from the appearance frequencies read from the n-1 character string appearance frequency registration section and the n character string appearance frequency registration section. Character concatenation probability dictionary construction method.
(2)文字連接確率を登録する辞書の構成法において、
文字列C_1C_2・・・C_k_−_1の次に文字C
_kが現われる確率である前方k文字連接確率を登録す
る前方k文字連接確率登録部(k=2、3、・・・、n
)と、文字C_1が出現する1文字出現確率を登録する
1文字出現確率登録部と、文字列C_2C_3・・・C
_iの直前に文字C_1が現われる確率である後方i文
字連接確率を導出する後方i文字連接確率導出部(i=
2、3、・・・、n)と、上記後方i文字連接確率導出
部は、前方r文字連接確率登録部(r=2、・・・、i
)より読み出した前方r文字連接確率と、1文字出現確
率登録部より読み出した1文字出現確率と、後方m文字
連接確率導出部(m=2、・・・、i−1)より導出し
た後方m文字連接確率とを用いて、後方i文字連接確率
を導出する手段を有することを特徴とする文字連接確率
辞書構成法。
(2) In a dictionary construction method for registering character concatenation probabilities,
Character string C_1C_2...C_k_-_1 followed by character C
A forward k character concatenation probability registration unit (k = 2, 3, ..., n
), a character appearance probability registration unit that registers the character appearance probability of character C_1, and character string C_2C_3...C
A backward i character concatenation probability derivation unit (i=
2, 3, .
), the 1-character appearance probability read from the 1-character appearance probability registration section, and the backward m-character connection probability derivation section (m = 2, ..., i-1). 1. A method for constructing a dictionary of character concatenation probabilities, comprising means for deriving backward i-character concatenation probabilities using m-character concatenation probabilities.
JP1240412A 1989-09-16 1989-09-16 Character combination probability dictionary comprising method Pending JPH03102465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1240412A JPH03102465A (en) 1989-09-16 1989-09-16 Character combination probability dictionary comprising method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1240412A JPH03102465A (en) 1989-09-16 1989-09-16 Character combination probability dictionary comprising method

Publications (1)

Publication Number Publication Date
JPH03102465A true JPH03102465A (en) 1991-04-26

Family

ID=17059087

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1240412A Pending JPH03102465A (en) 1989-09-16 1989-09-16 Character combination probability dictionary comprising method

Country Status (1)

Country Link
JP (1) JPH03102465A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012533921A (en) * 2009-07-17 2012-12-27 イーストソフト コーポレーション Data compression method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012533921A (en) * 2009-07-17 2012-12-27 イーストソフト コーポレーション Data compression method

Similar Documents

Publication Publication Date Title
US6587819B1 (en) Chinese character conversion apparatus using syntax information
JPS6262372B2 (en)
JPH03102465A (en) Character combination probability dictionary comprising method
JP4084515B2 (en) Alphabet character / Japanese reading correspondence apparatus and method, alphabetic word transliteration apparatus and method, and recording medium recording the processing program therefor
JPH042254A (en) Telephone system for letter imput using approximate coincidence between input letter range
JP3285149B2 (en) Foreign language electronic dictionary search method and apparatus
JP3353769B2 (en) Character recognition device, character recognition method, and character recognition program recording medium
JPH03168863A (en) Method for constituting connection probability dictionary
JP4423369B2 (en) Kanji kana mixing input device, kanji kana mixing input method, and information recording medium
JPS62121570A (en) Continued clause conversion processing system based on connection probability
JPS6029884A (en) Reading method of word
JPH04270449A (en) Address input device
JPH0554145B2 (en)
JPS60178575A (en) Japanese processor
JP2628775B2 (en) Dictionary creation device
JPH0375865A (en) Kana-kanji conversion device
JPH0131229B2 (en)
JPS61177575A (en) Forming device of japanese document
JPS6140662A (en) Homonym selection system
JPH03179550A (en) Kana/kanji converting device
JPS61285573A (en) Kana-to-kanji converting device
JPH04372047A (en) Kana/kanji converter
JPS59116835A (en) Japanese input device with input abbreviating function
JPS58208846A (en) Priority deciding system for kana (japanese syllabary) letter train
JPH0546593A (en) Character input device