JPH11328318A

JPH11328318A - Probability table generating device, probability system language processor, recognizing device, and record medium

Info

Publication number: JPH11328318A
Application number: JP10127938A
Authority: JP
Inventors: Hideaki Tanaka; 秀明田中
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1998-05-11
Filing date: 1998-05-11
Publication date: 1999-11-30

Abstract

PROBLEM TO BE SOLVED: To enable a fast, high-precision probabilioty system language process with small memory capacity. SOLUTION: A unigram generation part 19 obtains unigram. A diagram generation part 20 obtains a diagram. A property trigram generation part 21 obtains a property trigram. A vector division part 23 divides the said diagram into vectors of recognition object character number dimensions. A clustering part 24 clusters the respective vectors to generate a compressed character code conversion table 13 wherein characters and cluster codes (compressed character code) are made to correspond to each other. A compressed diagram generation part obtains the transition probability (compressed diagram) of two compressed character code sets. The said compressed diagram is compressed greatly in the number of elements without losing the linguistic transition information that the diagram has. Lost linguistic information is complemented by using the unigram and property trigram in combination and the high-precision language process can be performed without spoiling low memory capacity and speediness.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、文字や音声の低
容量メモリ,高速および高認識率での言語処理を行う際
に使用される確率テーブルを作成する確率テーブル作成
装置、この確率テーブルを用いた確率方式言語処理装置
及び認識装置、並びに、コンピュータ読み取り可能な記
録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a probability table creating apparatus for creating a probability table used for performing a low-capacity memory of characters and voices, and performing language processing at a high speed and a high recognition rate. The present invention relates to a stochastic language processing device and a recognition device, and a computer-readable recording medium.

【０００２】[0002]

【従来の技術】近年、認識技術の進歩によって、充分実
用レベルの性能を有した高精度な認識装置や認識ソフト
ウェア及びそれら応用商品が発表され、広くユーザーに
浸透している。その一例として、携帯情報端末(ＰＤ
Ａ：personal digital assistant)では、ペンによる手
書き文字認識が必須機能となっていることが挙げられ
る。2. Description of the Related Art In recent years, with the advancement of recognition technology, high-precision recognition devices and recognition software having a sufficiently practical level of performance and their application products have been announced and widely permeated by users. As an example, a portable information terminal (PD)
A: In personal digital assistant), handwritten character recognition using a pen is an essential function.

【０００３】近年における認識精度の向上には、勿論、
文字や音声単体の認識方式が、より高精度にチューニン
グ,改良,考案されたことが最大の要因ではあるが、それ
以外にも、所謂言語処理技術(知識処理とも呼ばれる)の
進歩も大きな要因であると考えられる。In recent years, recognition accuracy has been improved, of course.
The biggest factor was that the recognition method for characters and voices was tuned, improved, and devised with higher accuracy, but in addition to that, progress in so-called language processing technology (also called knowledge processing) was also a major factor. It is believed that there is.

【０００４】文字や音声の認識装置における言語処理技
術の重要性・必要性は、次の２つの理由による。（１）依然として、認識手法単体での第一位認識率は低
い。如何に高精度化された現在の認識手法をもってして
も、例えば、日本語活字ＯＣＲ(光学的文字読取機)にお
ける文字単体での第１位認識率は９７％〜９８％と低
い。ところが、第１０位程度の候補内認識率は９９％を
越えており、更なる精度の高い認識装置の実用化には何
らかの方法で候補内の正解を選択する必要が生じる。況
してや、活字よりも認識率が低下する手書き文字や音声
では、より言語処理技術の必要性が増す。The importance and necessity of the language processing technology in the character and voice recognition apparatus are based on the following two reasons. (1) The first-order recognition rate of the recognition method alone is still low. No matter how highly accurate the current recognition method is, for example, the first place recognition rate of a single character in Japanese character OCR (optical character reader) is as low as 97% to 98%. However, the in-candidate recognition rate of about the tenth rank exceeds 99%, and it is necessary to select a correct answer in the candidate by some method in order to put a more accurate recognition device into practical use. Under such circumstances, the need for language processing technology increases for handwritten characters and voices whose recognition rate is lower than that of printed characters.

【０００５】（２）認識評価値のみでは正確なセグメン
テーション(切り出し)ができない。現在の認識装置にお
いては、切り出しと認識処理とを融合させる方法が主流
である。この場合でも、上述したように認識手法の性能
が不安定であるために、その認識評価値(距離や類似度
等)を信用した切り出しでは、切り出し精度が低下する
という問題が発生する。さらに、どうしても上記認識評
価値だけでは切り出せないパターンも存在する。例え
ば、文字「加」の場合には、半角片仮名の「カ」と「ロ」なの
か全角漢字の「加」なのかが、パターンのみでは判断が
つかない。そのために、前後の文字を含めた何らかの評
価が必要となる。(2) Accurate segmentation (cutout) cannot be performed only with the recognition evaluation value. In current recognition apparatuses, a method of fusing cutout and recognition processing is mainstream. Even in this case, since the performance of the recognition method is unstable as described above, there is a problem that the cutout accuracy is reduced in the cutout that trusts the recognition evaluation value (distance, similarity, etc.). Further, there are also patterns that cannot be cut out by using only the recognition evaluation values. For example, in the case of the character "KA", it is not possible to determine whether the character is "KA" and "RO" in the half-width katakana or "KA" in the full-width kanji by only the pattern. Therefore, some evaluation including the characters before and after is necessary.

【０００６】上述の理由(１),(２)から、言語処理の重
要性および必要性が文字認識の研究開始当初から認めら
れており、様々な改良が行われて現在に至る。現在で
は、言語処理は各種認識装置の必須処理となっている。For the reasons (1) and (2) above, the importance and necessity of language processing has been recognized since the beginning of character recognition research, and various improvements have been made to this day. At present, language processing is an indispensable process of various recognition devices.

【０００７】認識装置開発当初の日本における言語処理
の試みは、単語照合を基本とするものが多い。この言語
処理は、認識候補を組み合わせでできる文字列中に正し
い認識結果が含まれているか否かを、単語辞書検索によ
って判断するものである。このような単語処理は、非常
に有力な言語処理(切り出しと認識を含めた認識処理と
も考えられる)として、以来、多くの認識装置で採用さ
れている。Attempts of linguistic processing in Japan at the beginning of the development of the recognition device are often based on word matching. In this linguistic processing, it is determined by a word dictionary search whether or not a correct recognition result is included in a character string formed by combining recognition candidates. Such word processing has been adopted by many recognition apparatuses since it was a very powerful language processing (which is also considered as recognition processing including cutout and recognition).

【０００８】ところが、上述の単語照合を基本とする言
語処理には、（１）単語辞書に登録されていない文字列
は正しく認識できない、（２）特に日本語のように文書
中において単語間の区切りがない場合には、単語照合を
行う区間の切り出し(形態素解析)が正確には行えないと
いう問題があり、この解決策として、［１］特開平５−
６４６４号公報「文字列認識方法及びその装置」や、
［２］特開平５−１２０４９４号公報「文字認識方法及
びその装置」等に開示されている確率方式言語処理が提
案されている。However, in the above-described language processing based on word collation, (1) character strings that are not registered in the word dictionary cannot be correctly recognized. If there is no break, there is a problem that segmentation (morphological analysis) of a section for performing word matching cannot be performed accurately.
No. 6464, “Character string recognition method and device thereof”,
[2] Probabilistic language processing disclosed in Japanese Patent Application Laid-Open No. 5-120494, "Character Recognition Method and Apparatus" has been proposed.

【０００９】上述の公開特許公報［１］や［２］等は、
基本的には文字列をマルコフモデルと仮定した場合の確
率方式言語処理であり、公開特許公報［１］は、２値の
ダイグラム(２文字組み合わせの出現頻度あるいは確率
を示す用語)を、公開特許公報［２］は多値の変形ダイ
グラムを使用したものである。The above-mentioned published patent publications [1] and [2],
Basically, it is a stochastic language processing in which a character string is assumed to be a Markov model, and Patent Document 1 discloses a binary diagram (a term indicating the appearance frequency or probability of a combination of two characters). Publication [2] uses a multi-valued modified diagram.

【００１０】ところで、上述のような確率方式言語処理
は、欧米の認識技術では古くから実施されているもので
ある。これらは、例えば、［３］E.Riseman and A.Hans
on,“バイナリＮ-グラムを用いたエラー訂正のための後
処理システム(A contextualpostprocessing system for
error correction using binary n-grams,)"IEEETran
s. Comput.,pp.480-493,May 1974.や、［４］A.Goshtas
by,“確率的弛緩ラベルを用いた単語認識（Contextual
word recognition using probabilisticrelaxation la
beling,)”Pattern Recognition,vol.21,no.5,pp.455-4
62 1988.等で報告されている。尚、文献［３］は２値の
ダイグラムを使用し、文献［４］は多値のダイグラムを
使用している。By the way, the above-mentioned stochastic language processing has been used for a long time in European and American recognition technologies. These include, for example, [3] E. Riseman and A. Hans
on, “A contextual postprocessing system for error correction using binary N-grams
error correction using binary n-grams,) "IEEETran
s. Comput., pp. 480-493, May 1974. [4] A. Goshtas
by, “Word recognition using probabilistic relaxation labels (Contextual
word recognition using probabilisticrelaxation la
beling,) ”Pattern Recognition, vol.21, no.5, pp.455-4
62 1988. Reference [3] uses a binary diagram, and document [4] uses a multi-value diagram.

【００１１】欧米でも、日本での研究と同様に、言語処
理には早くからスペルチェック(単語照合)方式が検討さ
れている。しかしながら、その後、上述したような未登
録の問題を解決するために確率方式へ移行し、現在は確
率方式が主流となっている(但し、現在は確率方式とス
ペルチェック方式との併用が多く用いられている)。In Europe and the United States, a spell check (word collation) method has been studied from an early stage in language processing, as in research in Japan. However, after that, it shifted to the stochastic method to solve the unregistered problem as described above, and the stochastic method is now mainstream (however, the combination of the stochastic method and the spell check method is often used at present. Has been).

【００１２】欧米では、単語切り出し(形態素解析)の問
題が日本に比べて比較的容易であるにも拘わらず確率方
式に移行している。この事実は、一般的な使用では如何
に多くの未登録語が認識装置に入力されるかを示すもの
であり、この問題を解決しない限り、実用に耐え得る認
識装置は実現できないことを示すものである。以下、上
記確率方式言語処理の利点と問題点とについて述べる。In the United States and Europe, the problem of word segmentation (morphological analysis) has been shifted to the stochastic method although it is relatively easy compared to Japan. This fact shows how many unregistered words are input to the recognizer in general use, and shows that a practically usable recognizer cannot be realized unless this problem is solved. It is. Hereinafter, advantages and problems of the above stochastic language processing will be described.

【００１３】（Ａ）確率方式言語処理の利点確率方式とは、形態素を切り出す事なく文字列の確から
しさを評価するものである。例えば、「本日は晴天な
り」の「日」に関して、「目」または「日」という認識
結果が得られ、「本目は晴天なり」と「本日は晴天な
り」という２つの文字列が生成されたとする。これらの
文字列に対する確率値が、Ｐ(本目は晴天なり)＝０.７
１、Ｐ(本日は晴天なり)＝０.７６となった場合、認識
結果「日」を正解とする言語処理方式である。この確率
方式による言語処理の利点を列挙すると次のようであ
る。・固有名詞等の辞書にない未登録単語でも正解となる
(可能性がある)。・単語切り出しが不要である。特に、日本語のように形
態素解析が完全でない言語には有効である。・形態素が確定していなくとも(例えば入力途中でも)機
能することになるから、ＰＤＡの手書き認識等の逐次入
力するタイプの認識装置で入力が終了しない段階でも言
語処理が可能になる。・高速である。(A) Advantages of Probabilistic Language Processing The probabilistic method evaluates the certainty of a character string without cutting out morphemes. For example, for the "day" of "Today is fine weather", a recognition result of "eye" or "day" is obtained, and two character strings of "real is fine weather" and "today is fine weather" are generated. I do. The probability value for these character strings is P (real weather is clear) = 0.7
When P (the weather is fine today) = 0.76, the language processing method uses the recognition result “day” as the correct answer. The advantages of the language processing by this probability method are listed as follows.・ Unregistered words such as proper nouns that are not in the dictionary are correct.
(there is a possibility).・ Word segmentation is unnecessary. In particular, it is effective for a language in which morphological analysis is not complete, such as Japanese. -Even if the morpheme has not been determined (for example, during input), the function can be performed, so that the language processing can be performed even when the input is not completed by a recognition device of a type that sequentially inputs, such as PDA handwriting recognition.・ High speed.

【００１４】ここで、実際の文字列の遷移確率計算方法
について説明する。文献「確率モデルによる音声認識」
(中川聖一著、電子情報通信学会、コロナ社、初版昭和
６３年)によれば、文字列をＣ＝(ｃ₁,ｃ₂,ｃ₃,…,
ｃ_n)、Ｆ()を()内の文字の組み合わせの出現頻度とする
とき、文字列Ｃの遷移確率Ｐ(Ｃ)は、式(１)で表され
る。Here, a method of calculating the transition probability of an actual character string will be described. Document "Speech Recognition by Stochastic Model"
According to (Seiichi Nakagawa, Institute of Electronics, Information and Communication Engineers, Corona, first edition, 1988), a character string is represented as C = (c ₁ , c ₂ , c ₃ , ...,
Assuming that c _n ) and F () are the appearance frequency of the combination of characters in (), the transition probability P (C) of the character string C is expressed by Expression (1).

【数１】 (Equation 1)

【００１５】式(１)より、任意長の文字列Ｃに関する遷
移確率Ｐ(Ｃ)は、結局のところ、２文字組み(ダイグラ
ム：digram)と３文字組み(トリグラム：trigram）の出
現頻度(または確率)テーブルを用意すれば計算できるこ
とになる。また、式(１)より、遷移確率値は個々の単語
を切り出すことなく計算できるために、形態素解析が不
要であることは明白である。また、上記遷移確率値は、
確率値(単語的な確からしさの評価値)であるために、例
え未登録語であっても言語的に確からしければその値は
大きくなる。したがって、遷移確率値に基づいて正解の
認識候補を選択できるのである。From equation (1), the transition probability P (C) for an arbitrary-length character string C is, after all, the appearance frequency (or trigram) of a two-character set (digram) and a three-character set (trigram). Probability) can be calculated by preparing a table. From equation (1), it is clear that the morphological analysis is unnecessary because the transition probability value can be calculated without cutting out individual words. The transition probability value is
Since it is a probability value (evaluation value of word likelihood), even if it is an unregistered word, its value increases if it is linguistically certain. Therefore, a correct recognition candidate can be selected based on the transition probability value.

【００１６】さらに、上記式(１)においては乗除算を用
いているが、一般的には、ダイグラムおよびトリグラム
を遷移確率で表し、対数変換(圧縮)して、式(２)のよう
な加減算の式に変換する。そして、式(２)による加減算
で文字列Ｃの遷移確率Ｐ(Ｃ)を計算するのである。その
結果、ダイグラムとトリグラムとの遷移確率テーブルの
参照と加減算という極めてシンプルな処理で遷移確率Ｐ
(Ｃ)の算出を行うことができ、辞書検索(単語照合)によ
る言語処理に比べて遥かに高速に処理できるのである。Furthermore, although multiplication / division is used in the above equation (1), generally, a digram and a trigram are represented by transition probabilities, logarithmically transformed (compressed), and added / subtracted as in equation (2). Into an expression. Then, the transition probability P (C) of the character string C is calculated by addition and subtraction according to equation (2). As a result, the transition probability P is calculated by a very simple process of referencing and adding / subtracting the transition probability table between the diagram and the trigram.
(C) can be calculated, and processing can be performed much faster than linguistic processing by dictionary search (word matching).

【数２】 (Equation 2)

【００１７】（Ｂ）確率方式言語処理の問題点確率方式による言語処理の問題点を列挙すると次のよう
である。・遷移確率値は確率値(多値)であるために、決
定的な評価値となりにくい。単語照合による言語処理で
は、処理結果が「ある」/「ない」の２値であり、処理
結果を決定し易い。これに対して、遷移確率値は多値で
あるから処理結果の決定的な評価値とはなりにくい。特
に、語彙が完全に決まっているような認識装置の場合に
は、寧ろ単語照合方式の方が処理結果の決定に効果的で
ある。・認識対象文字数が多い場合にはメモリ容量が膨
大なものとなる。(B) Problems of the language processing by the stochastic method The problems of the language processing by the stochastic method are listed as follows. -Since the transition probability value is a probability value (multi-valued), it is difficult to become a definitive evaluation value. In the language processing based on word collation, the processing result is a binary value of “present” / “not present”, and the processing result is easily determined. On the other hand, since the transition probability value is multi-valued, it is difficult to become a definitive evaluation value of the processing result. In particular, in the case of a recognition device in which the vocabulary is completely determined, the word matching method is more effective in determining the processing result. -When the number of characters to be recognized is large, the memory capacity becomes enormous.

【００１８】上述の確率方式言語処理における第１の問
題点は、単語照合方式言語処理を併用することによって
回避可能である。事実、欧米ではその方法が主流であ
る。特に、日本語の場合には、上記第２の問題点である
メモリ容量の問題が大きい。The above-mentioned first problem in the stochastic language processing can be avoided by using the word processing language processing in combination. In fact, the method is mainstream in Europe and the United States. In particular, in the case of Japanese, the second problem, that is, the problem of the memory capacity, is large.

【００１９】上記ダイグラムの遷移確率テーブルとトリ
グラムの遷移確率テーブルとのメモリ容量は、ダイグラ
ムの遷移確率テーブルの場合には、２文字の組み合わせ
の出現数、つまり認識対象文字数の二乗に比例して増加
する一方、トリグラムの遷移確率テーブルの場合には、
３文字の組み合わせの出現数、つまり認識対象文字数の
三乗に比例して増加する。日本語ＯＣＲを例に取ると、
４０００文字を認識対象とし、遷移確率テーブルの一要
素を１バイトとした場合のメモリ容量は図９に示すよう
に、ダイグラムの場合には１６メガバイト、トリグラム
の場合には６４ギガバイトとなり、実現不可能なメモリ
容量となる。尚、以下においては、上記確率テーブルそ
のものをもダイグラムあるいはトリグラムと言うことに
する。In the case of the transition probability table of the diagram, the memory capacity of the transition probability table of the diagram and the transition probability table of the trigram increases in proportion to the number of appearances of the combination of two characters, that is, the square of the number of characters to be recognized. On the other hand, in the case of a trigram transition probability table,
It increases in proportion to the number of appearances of a combination of three characters, that is, the cube of the number of characters to be recognized. Taking Japanese OCR as an example,
As shown in FIG. 9, when 4000 characters are to be recognized and one element of the transition probability table is 1 byte, the memory capacity is 16 megabytes in the case of a diagram and 64 gigabytes in the case of a trigram. Memory capacity. In the following, the probability table itself is also called a diagram or a trigram.

【００２０】この第２の問題に対する最も単純な解決策
は、トリグラムの使用を諦めダイグラムのみの使用で遷
移確率Ｐ(Ｃ)を算出することである(この場合、当然な
がら求められる遷移確率Ｐ(Ｃ)の精度は落ちる)。しか
しながら、それでも相当のメモリ容量を必要とする。そ
のため、メモリが低容量・高価格であった１０年〜２０
年前には、確率方式言語処理は文字数の少ない欧米語で
は実現可能であるものの、日本語の場合には実現困難で
あった。近年、メモリの低価格化・大容量化が進んで、
やっと日本でも確率方式言語処理が採用され始めてい
る。The simplest solution to this second problem is to give up the use of the trigram and calculate the transition probability P (C) using only the diagram (in this case, the transition probability P ( The accuracy of C) decreases). However, it still requires considerable memory capacity. Therefore, the memory was low capacity and high price for 10 to 20 years.
Years ago, stochastic linguistic processing was feasible in Western languages with few characters, but difficult in Japanese. In recent years, as the price and capacity of memories have increased,
Finally, stochastic language processing has begun to be adopted in Japan.

【００２１】上述の公開特許公報［１］,［２］では、
トリグラムの使用を諦めてダイグラムのみを使用するこ
とでメモリ容量の問題に対処している。さらに、公開特
許公報［１］では、多値のダイグラムの使用を諦めて２
値のダイグラムを使用することで更にメモリ容量を削減
している。また、公開特許公報［２］では、一定の確率
値以上の組み合わせのみの確率値を格納することで更に
メモリ容量を削減している。In the above publications [1] and [2],
The problem of memory capacity is addressed by giving up using trigrams and using only digrams. Further, in the patent publication [1], the use of multi-valued
The use of the value diagram further reduces the memory capacity. Further, in Japanese Patent Laid-Open Publication [2], the memory capacity is further reduced by storing the probability values of only combinations having a certain probability value or more.

【００２２】[0022]

【発明が解決しようとする課題】しかしながら、上記公
開特許公報［１］,［２］に開示された従来の確率方式
言語処理には、以下のような問題がある。すなわち、公
開特許公報［１］では、２値のダイグラムを使用してい
るために、得られる遷移確率Ｐ(Ｃ)の精度が多値のダイ
グラムを使用した場合に比べて著しく低下するという問
題がある。さらには、２値のダイグラムを使用すること
でメモリ容量を削減しているとは言え、依然として２メ
ガバイト(＝４０００×４０００/８)とメモリ容量は多
い。特に、単語照合式言語処理を併用することを考えた
場合、別に言語処理辞書用の容量が必要となるために、
上記メモリ容量は極力抑える必要がある。However, the conventional stochastic language processing disclosed in the above publications [1] and [2] has the following problems. That is, in the patent publication [1], since the binary diagram is used, there is a problem that the accuracy of the obtained transition probability P (C) is significantly reduced as compared with the case where the multivalued diagram is used. is there. Further, although the memory capacity is reduced by using a binary diagram, the memory capacity is still large at 2 megabytes (= 4000 × 4000/8). In particular, considering the use of word-matching language processing together, a separate capacity for the language processing dictionary is required.
It is necessary to suppress the memory capacity as much as possible.

【００２３】また、公開特許公報［２］では、一定の確
率値以上の組み合わせのみの確率値を遷移確率テーブル
に格納しているために、ある２文字の遷移確率Ｐ(Ｃ)を
求める場合にはその２文字の組み合わせの遷移確率が遷
移確率テーブル上に格納されているか否かを絶えず検索
する必要がある。そのために、高速性を損なうという問
題がある。Further, in the patent publication [2], since the probability values of only combinations having a certain probability value or more are stored in the transition probability table, the transition probability P (C) of a certain two characters is calculated. Needs to constantly search whether the transition probability of the combination of the two characters is stored in the transition probability table. Therefore, there is a problem that high-speed performance is impaired.

【００２４】そこで、この発明の目的は、高速性を損な
うことなく、高い遷移確率を得、且つ、低メモリ容量を
可能にする多値の確率テーブルを作成する確率テーブル
作成装置、並びに、この確率テーブルを用いた確率方式
言語処理装置及び認識装置を提供することにある。An object of the present invention is to provide a probability table creating apparatus for creating a multi-valued probability table capable of obtaining a high transition probability and achieving a low memory capacity without impairing high-speed performance, and an object of the present invention. An object of the present invention is to provide a stochastic language processing device and a recognition device using a table.

【００２５】[0025]

【課題を解決するための手段】上記目的を達成するた
め、請求項１に係る発明の確率テーブル作成装置は、一
つの自然言語の文字列が格納されたメモリと、上記メモ
リに格納された文字列の総ての文字を類似した遷移特性
を有するクラスタにクラスタリングし,各クラスタにク
ラスタコードを付与するクラスタリング部と、上記メモ
リに格納された文字列における各文字のクラスタコード
を求め,上記文字列における総ての隣接２文字に関して
その２文字のクラスタコード間の遷移確率を求めて２ク
ラスタコード遷移確率テーブルを作成する２クラスタコ
ード遷移確率テーブル作成部を備えたことを特徴として
いる。In order to achieve the above object, according to the first aspect of the present invention, there is provided a probability table creating apparatus comprising: a memory in which a character string of one natural language is stored; and a character stored in the memory. A clustering unit that clusters all characters in a column into clusters having similar transition characteristics and assigns a cluster code to each cluster; and obtains a cluster code of each character in the character string stored in the memory, Is characterized by including a two-cluster code transition probability table creation unit that creates a two-cluster code transition probability table by calculating the transition probabilities between the two-character cluster codes for all the two adjacent characters.

【００２６】上記構成によれば、認識対象の文字や音節
の数が上記クラスタの数に圧縮される。したがって、隣
接２文字のクラスタコード間の遷移確率を表す２クラス
タコード遷移確率テーブルは、隣接２文字間の遷移確率
を表す２文字遷移確率テーブルに比して要素数が(クラ
スタ数/認識対象文字(又は音節)数)²に圧縮され、低メ
モリ容量および高速での確率方式言語処理を可能にす
る。しかも、同一クラスタに属する文字や音節は同様の
遷移特性を有しているために、上記２クラスタコード遷
移確率テーブルには各文字や音節の言語的遷移情報が保
持されている。したがって、上記多値の２クラスタコー
ド遷移確率テーブルを用いることによって、高い精度で
の確率方式言語処理を行うことが可能となる。According to the above configuration, the number of characters and syllables to be recognized is compressed to the number of clusters. Therefore, the two-cluster code transition probability table that represents the transition probability between the cluster codes of two adjacent characters has a smaller number of elements than the two-character transition probability table that represents the transition probability between two adjacent characters. (Or the number of syllables) is compressed to ² , enabling low-memory capacity and high-speed stochastic language processing. Moreover, since characters and syllables belonging to the same cluster have similar transition characteristics, the linguistic transition information of each character and syllable is held in the two-cluster code transition probability table. Therefore, by using the multi-value two-cluster code transition probability table, it is possible to perform the stochastic language processing with high accuracy.

【００２７】また、請求項２に係る発明は、請求項１に
係る発明の確率テーブル作成装置において、上記メモリ
に格納された文字列における各文字の属性を求め、上記
文字列における総ての隣接３文字に関してその３文字の
属性間の遷移確率を求めて３文字属性遷移確率テーブル
を作成する３文字属性遷移確率テーブル作成部を備えた
ことを特徴としている。According to a second aspect of the present invention, in the probability table creating apparatus according to the first aspect of the present invention, an attribute of each character in the character string stored in the memory is obtained, and all adjacent attributes in the character string are determined. It is characterized in that a three-character attribute transition probability table creating unit for creating a three-character attribute transition probability table by calculating a transition probability between the attributes of the three characters for the three characters is provided.

【００２８】上記構成によれば、自然言語中における隣
接３文字の属性間の遷移確率を表す３文字属性遷移確率
テーブルが作成される。したがって、上記２クラスタコ
ード遷移確率テーブルに３文字属性遷移確率テーブルを
併用した確率方式言語処理を行うことによって、認識対
象の全文字または全音節の数を圧縮した際に失われた各
文字や音節の言語的遷移情報が補われて、更に確率方式
言語処理の精度を高めることが可能となる。According to the above configuration, a three-character attribute transition probability table representing transition probabilities between attributes of three adjacent characters in a natural language is created. Therefore, by performing the stochastic language processing using the two-character code transition probability table in combination with the three-character attribute transition probability table, each character or syllable lost when the number of all characters or all syllables to be recognized is compressed. Linguistic transition information is supplemented, and the accuracy of the stochastic language processing can be further improved.

【００２９】また、請求項３に係る発明は、請求項２に
係る発明の確率テーブル作成装置において、上記メモリ
に格納された文字列における全文字に対する各１文字の
出現確率を求めて１文字出現確率テーブルを作成する１
文字出現確率テーブル作成部を備えたことを特徴として
いる。According to a third aspect of the present invention, in the probability table creating apparatus according to the second aspect of the present invention, the occurrence probability of each one character with respect to all the characters in the character string stored in the memory is obtained to obtain one character appearance. Create a probability table 1
A character appearance probability table creation unit is provided.

【００３０】上記構成によれば、自然言語中における各
文字の出現確率を表す１文字出現確率テーブルが作成さ
れる。したがって、上記２クラスタコード遷移確率テー
ブルに上記１文字出現確率テーブルおよび３文字属性遷
移確率テーブルを併用した確率方式言語処理を行うこと
によって、認識対象の全文字または全音節の数を圧縮し
た際に失われた各文字や音節の言語的遷移情報が補われ
て、更に確率方式言語処理の精度を高めることが可能と
なる。According to the above configuration, a one-character appearance probability table representing the appearance probability of each character in a natural language is created. Therefore, by performing the stochastic linguistic processing using the one-character appearance probability table and the three-character attribute transition probability table in combination with the two-cluster code transition probability table, when the number of all characters or all syllables to be recognized is compressed, The lost linguistic transition information of each character or syllable is supplemented, and the accuracy of the stochastic language processing can be further improved.

【００３１】また、請求項４に係る発明は、請求項１に
係る発明の確率テーブル作成装置において、上記２クラ
スタコード遷移確率テーブルの各要素値を対数圧縮する
対数圧縮部を備えたことを特徴としている。According to a fourth aspect of the present invention, in the probability table creating apparatus according to the first aspect of the present invention, a logarithmic compression unit for logarithmically compressing each element value of the two-cluster code transition probability table is provided. And

【００３２】上記構成によれば、上記２クラスタコード
遷移確率テーブルの各要素値が１バイトで表現可能まで
に圧縮される。こうして、更なる記憶容量の低下が図ら
れ、それに伴って更なる確率方式言語処理の高速化が可
能になる。According to the above configuration, each element value of the two-cluster code transition probability table is compressed so that it can be represented by one byte. In this way, the storage capacity is further reduced, and accordingly, the speed of the stochastic language processing can be further increased.

【００３３】また、請求項５に係る発明は、請求項１に
係る発明の確率テーブル作成装置において、上記メモリ
に格納された文字列の全文字における総ての隣接２文字
間の遷移確率を求めて,２文字遷移確率テーブルを作成
する２文字遷移確率テーブル作成部を備えると共に、上
記クラスタリング部は,上記２文字遷移確率テーブルを
認識対象文字数個の認識対象文字数の次元のベクトルに
分割し,得られた全ベクトルを所定数のクラスタにクラ
スタリングするようになっており、上記クラスタリング
の結果に基づいて,上記２文字の夫々が属するクラスタ
のクラスタコードを求める際に用いるクラスタコード変
換テーブルを作成するクラスタコード変換テーブル作成
部を備えて、上記２クラスタコード遷移確率テーブル作
成部は,上記クラスタコード変換テーブルを用いて上記
２文字の夫々を上記クラスタコードに変換するクラスタ
コード変換部を有し,この変換されたクラスタコードを
用いて上記２クラスタコード遷移確率テーブルを作成す
るようになっていることを特徴としている。According to a fifth aspect of the present invention, in the probability table creating apparatus according to the first aspect of the present invention, the transition probability between all adjacent two characters in all the characters of the character string stored in the memory is obtained. A two-character transition probability table creating unit that creates a two-character transition probability table, and the clustering unit divides the two-character transition probability table into vectors of the number of characters to be recognized and the number of characters to be recognized. A cluster for creating a cluster code conversion table used when obtaining a cluster code of a cluster to which each of the two characters belongs based on the result of the clustering. The two-cluster code transition probability table creation unit includes a code conversion table creation unit. A cluster code conversion unit that converts each of the two characters into the cluster code by using a code conversion table, and creates the two-cluster code transition probability table by using the converted cluster code. It is characterized by:

【００３４】上記構成によれば、上記文字列における隣
接２文字間の遷移確率を表す２文字遷移確率テーブルが
作成される。そして、この２文字遷移確率テーブルに基
づいて、類似した遷移特性を有する文字や音節が同一ク
ラスタにクラスタリングされる。したがって、各クラス
タにクラスタコードを付加してこのクラスタコードを文
字と見なすことによって、各文字や音節の言語的遷移情
報を損なうことなく認識対象の全文字や全音節の数が上
記クラスタ数に圧縮される。また、上記クラスタコード
を求める際に用いるクラスタコード変換テーブルが作成
される。こうして、上記クラスタコード変換テーブルを
参照する簡単な処理によって上記隣接２文字のクラスタ
コードが得られ、上記２クラスタコード遷移確率テーブ
ルが作成される。According to the above configuration, a two-character transition probability table representing the transition probability between two adjacent characters in the character string is created. Then, based on the two-character transition probability table, characters and syllables having similar transition characteristics are clustered into the same cluster. Therefore, by adding a cluster code to each cluster and treating this cluster code as a character, the number of all characters and all syllables to be recognized is compressed to the above-mentioned number of clusters without losing the linguistic transition information of each character or syllable. Is done. Further, a cluster code conversion table used for obtaining the cluster code is created. Thus, the cluster code of the two adjacent characters is obtained by a simple process of referring to the cluster code conversion table, and the two cluster code transition probability table is created.

【００３５】また、請求項６に係る発明は、請求項２に
係る発明の確率テーブル作成装置において、上記文字の
属性を求める際に用いる属性変換テーブルを備えると共
に、上記３文字属性遷移確率テーブル作成部は,上記属
性変換テーブルを用いて上記文字を属性に変換する属性
変換部を有し,上記変換された属性を用いて上記３文字
属性遷移確率テーブルを作成するようになっていること
を特徴としている。According to a sixth aspect of the present invention, in the probability table creating apparatus according to the second aspect of the present invention, there is provided the attribute conversion table used for obtaining the attribute of the character, and the three-character attribute transition probability table is created. The unit has an attribute conversion unit that converts the character into an attribute using the attribute conversion table, and creates the three-character attribute transition probability table using the converted attribute. And

【００３６】上記構成によれば、上記３文字属性遷移確
率テーブルの作成が、属性変換テーブルを参照するとい
う簡単な操作で行われる。According to the above configuration, the creation of the three-character attribute transition probability table is performed by a simple operation of referring to the attribute conversion table.

【００３７】また、請求項７に係る発明は、請求項２に
係る発明の確率テーブル作成装置において、上記メモリ
には日本語文章が格納されており、上記文字は日本語文
字であり、上記属性は、平仮名,片仮名,記号,漢字,数
字,アルファベット大文字およびアルファベット小文字
の何れかであることを特徴としている。According to a seventh aspect of the present invention, in the probability table creating apparatus of the second aspect, the memory stores a Japanese sentence, the character is a Japanese character, and the attribute Is characterized by one of hiragana, katakana, symbol, kanji, numeral, uppercase alphabet, and lowercase alphabet.

【００３８】上記構成によれば、認識対象文字数が４０
００程度に達するために、１要素を１バイトとした場合
に１６メガバイトが必要な日本語文字用の２文字遷移確
率テーブルの記憶容量が、例えば上記クラスタ数を１０
００とするクラスタリングを行って文字数圧縮して上記
２クラスタコード遷移確率テーブルを作成することによ
って、１メガバイトまで圧縮される。同様に、認識対象
文字を平仮名,片仮名,記号,漢字,数字,アルファベット
大文字およびアルファベット小文字の何れかの属性に変
換して上記３文字属性遷移確率テーブルを作成すること
によって、６４ギガバイトが必要な３文字遷移確率テー
ブルの記憶容量が最大３４３バイトまで圧縮される。こ
うして、日本語文字認識装置に搭載可能な低記憶容量の
隣接２文字間および隣接３文字間の遷移情報を表す確率
テーブルが作成される。According to the above configuration, the number of characters to be recognized is 40
In order to reach about 00, when the size of one element is 1 byte, the storage capacity of the two-character transition probability table for Japanese characters that requires 16 megabytes is, for example, 10 clusters.
By creating a 2-cluster code transition probability table by performing clustering to set the number of characters to 00 and compressing the number of characters, the data is compressed to 1 megabyte. Similarly, by converting the character to be recognized into any one of hiragana, katakana, symbol, kanji, numeral, uppercase alphabet and lowercase alphabet, and creating the above three-character attribute transition probability table, 3 gigabytes that require 64 gigabytes are required. The storage capacity of the character transition probability table is compressed to a maximum of 343 bytes. In this way, a probability table representing transition information between two adjacent characters and between three adjacent characters of low storage capacity that can be mounted on the Japanese character recognition device is created.

【００３９】また、請求項８に係る発明は、請求項２に
記載の確率テーブル作成装置において、上記メモリには
中国語文章が格納されており、文字は中国語文字であ
り、上記属性は、助字,記号,助字以外の漢字,数字,アル
ファベット大文字およびアルファベット小文字の何れか
であることを特徴としている。According to an eighth aspect of the present invention, in the probability table creating apparatus of the second aspect, the memory stores Chinese sentences, the characters are Chinese characters, and the attribute is: It is characterized by being any one of a supplementary character, a symbol, a kanji other than the supplementary character, a numeral, an uppercase alphabet and a lowercase alphabet.

【００４０】上記構成によれば、日本語の場合と同様に
認識対象文字数が極端に多い中国語の場合にも、認識対
象文字にクラスタリングを行って文字数圧縮して上記２
クラスタコード遷移確率テーブルを作成することによっ
て、２文字遷移確率テーブルの記憶容量が例えば１メガ
バイトまで圧縮される。同様に、認識対象文字を助字,
記号,助字以外の漢字,数字,アルファベット大文字およ
びアルファベット小文字の何れかの属性に変換して上記
３文字属性遷移確率テーブルを作成することによって、
３文字遷移確率テーブルの記憶容量が例えば最大２１６
バイトまで圧縮される。こうして、中国語文字認識装置
に搭載可能な低記憶容量の隣接２文字間および隣接３文
字間の遷移情報を表す確率テーブルが作成される。According to the above configuration, even in the case of Chinese, where the number of characters to be recognized is extremely large, as in the case of Japanese, clustering is performed on the characters to be recognized and the number of characters is compressed to reduce the number of characters.
By creating the cluster code transition probability table, the storage capacity of the two-character transition probability table is compressed to, for example, 1 megabyte. Similarly, the target character is a supplementary character,
By converting to the attribute of any of Chinese characters other than symbols and supplementary characters, numbers, uppercase letters and lowercase letters, and creating the three-character attribute transition probability table,
The storage capacity of the three-character transition probability table is, for example, up to 216
Compressed to bytes. Thus, a probability table representing transition information between two adjacent characters and between three adjacent characters having a low storage capacity that can be mounted on the Chinese character recognition device is created.

【００４１】また、請求項９に係る発明の確率方式言語
処理装置は、請求項１に係る発明の確率テーブル作成装
置によって作成された上記２クラスタコード遷移確率テ
ーブルと、上記２クラスタコード遷移確率テーブルを用
いて,入力文字列の文字列遷移確率を算出する文字列遷
移確率算出部を備えたことを特徴としている。According to a ninth aspect of the present invention, there is provided a stochastic language processing apparatus, wherein the two-cluster code transition probability table and the two-cluster code transition probability table created by the probability table creating apparatus according to the first aspect of the invention. And a character string transition probability calculation unit that calculates a character string transition probability of the input character string.

【００４２】上記構成によれば、確率方式言語処理の際
に使用される上記２クラスタコード遷移確率テーブル
は、その記憶容量が従来の２文字遷移確率テーブルより
も(クラスタ数/認識対象文字(又は音節)数)²だけ圧縮さ
れている。そのため、認識対象文字数が４０００程度に
達する日本語文字が処理対象であっても、１要素を１バ
イトとして１メガバイトの記憶容量であればよい。According to the above configuration, the two-cluster code transition probability table used in the stochastic language processing has a larger storage capacity than the conventional two-character transition probability table (the number of clusters / characters to be recognized (or (Syllable) Number) ² compressed. Therefore, even if Japanese characters whose number of characters to be recognized reaches about 4000 are to be processed, a storage capacity of 1 megabyte with one element as 1 byte is sufficient.

【００４３】また、請求項１０に係る発明は、請求項９
に係る発明の確率方式言語処理装置において、請求項２
および請求項３に係る発明の確率テーブル作成装置によ
って作成された上記１文字出現確率テーブルおよび３文
字属性遷移確率テーブルの少なくとも一方の確率テーブ
ルを備えると共に、上記文字列遷移確率算出部は、上記
確率テーブルをも用いて上記入力文字列の文字列遷移確
率を算出するようになっていることを特徴としている。The invention according to claim 10 is the invention according to claim 9
The stochastic language processing device according to the invention according to claim 2, wherein
And at least one of the one-character appearance probability table and the three-character attribute transition probability table created by the probability table creation device according to the third aspect of the present invention. It is characterized in that the character string transition probability of the input character string is calculated using a table.

【００４４】上記構成によれば、上記確率方式言語処理
の際に、上記１文字出現確率テーブル、および、隣接３
文字間の遷移情報を表すテーブルである３文字属性遷移
確率テーブルの少なくとも一方が併用されて、精度の高
い文字列遷移確率が算出される。According to the above arrangement, the one-character appearance probability table and the adjacent three
A highly accurate character string transition probability is calculated by using at least one of the three character attribute transition probability tables, which are tables representing transition information between characters.

【００４５】また、請求項１１に係る発明は、請求項１
０に記載の確率方式言語処理装置において、上記文字列
遷移確率算出部は、上記文字列遷移確率を上記第１の式
によって算出することを特徴としている。The invention according to claim 11 is based on claim 1.
0, the character string transition probability calculating unit calculates the character string transition probability by the first equation.

【００４６】上記構成によれば、上記２クラスタコード
遷移確率テーブル,３文字属性遷移確率テーブルおよび
１文字出現確率テーブルの参照と、ごく簡単な演算処理
のみで上記文字列遷移確率が算出される。こうして、確
率方式言語処理動作の更なる高速化が図られる。According to the above configuration, the character string transition probability is calculated only by referring to the two-cluster code transition probability table, the three-character attribute transition probability table, and the one-character appearance probability table, and by a very simple operation. Thus, the speed of the stochastic language processing operation is further increased.

【００４７】また、請求項１２に係る発明は、入力され
た言語要素列から個々の言語要素を切り出す切り出し部
と,切り出された言語要素の特徴パターンと標準パター
ンとのマッチングを行って複数の認識候補を得るマッチ
ング部と,上記複数の認識候補を組み合わせて得られた
複数の候補文字列に対して確率方式言語処理を行って所
定数の候補文字列を生成する候補文字列生成部を備えた
認識装置において、請求項１に係る発明の確率テーブル
作成装置によって作成された上記２クラスタコード遷移
確率テーブルを備えると共に、上記候補文字列生成部
は,上記２クラスタコード遷移確率テーブルを用いて上
記候補文字列のスコアを算出するスコア算出部を有し,
上記算出されたスコアに基づいて上記候補文字列の確か
らしさを評価する確率方式言語処理を行うようになって
いることを特徴としている。According to a twelfth aspect of the present invention, there is provided a cut-out unit for cutting out individual language elements from an input language element sequence, and matching between a feature pattern of the cut-out language elements and a standard pattern to generate a plurality of recognition patterns. A matching unit that obtains a candidate, and a candidate character string generation unit that performs a stochastic language process on a plurality of candidate character strings obtained by combining the plurality of recognition candidates and generates a predetermined number of candidate character strings. In the recognition device, the two-cluster code transition probability table created by the probability table creation device according to the first aspect of the present invention is provided, and the candidate character string generation unit uses the two-cluster code transition probability table to create the candidate. It has a score calculation unit that calculates the score of the character string,
A probabilistic language process for evaluating the likelihood of the candidate character string based on the calculated score is performed.

【００４８】上記構成によれば、上記確率方式言語処理
の際に候補文字列の確からしさを評価する際に使用され
るスコアが、従来の２文字遷移確率テーブルよりも要素
数が(クラスタ数/認識対象文字(又は音節)数)²に圧縮さ
れている上記２クラスタコード遷移確率テーブルを使用
して算出される。そのため、認識対象文字数が４０００
程度に達する日本語文字を認識する場合であっても１要
素を１バイトとして１メガバイトの記憶容量であればよ
く、単語照合方式言語処理用の辞書を搭載したとしても
全く弊害は生じない。According to the above configuration, the score used for evaluating the likelihood of a candidate character string in the above-described stochastic language processing is smaller in the number of elements than the conventional two-character transition probability table by (the number of clusters / It is calculated using the two-cluster chord transition probability table has been compressed in the recognition object character (or syllables) number) ^2. Therefore, the number of characters to be recognized is 4000
Even in the case of recognizing Japanese characters up to the extent, a memory capacity of 1 megabyte with one element as one byte is sufficient, and even if a dictionary for word processing language processing is installed, no adverse effect occurs.

【００４９】また、請求項１３に係る発明は、請求項１
２に係る発明の認識装置において、請求項２および請求
項３に係る発明の確率テーブル作成装置によって作成さ
れた上記１文字出現確率テーブルおよび３文字属性遷移
確率テーブルの少なくとも一方の確率テーブルを備える
と共に、上記スコア算出部は、上記確率テーブルをも用
いて上記候補文字列のスコアを算出するようになってい
ることを特徴としている。The invention according to claim 13 is based on claim 1
The recognition device according to the second aspect of the present invention includes at least one of the one-character appearance probability table and the three-character attribute transition probability table created by the probability table creation device according to the second and third aspects of the invention. The score calculation unit is characterized in that the score of the candidate character string is calculated using the probability table as well.

【００５０】上記構成によれば、上記確率方式言語処理
の際に、上記１文字出現確率テーブルおよび３文字属性
遷移確率テーブルの少なくとも一方が併用されて、上記
候補文字列の確からしさをより正確に表すスコアが算出
される。According to the above configuration, at the time of the probability language processing, at least one of the one-character appearance probability table and the three-character attribute transition probability table is used together to more accurately determine the likelihood of the candidate character string. A representative score is calculated.

【００５１】また、請求項１４に係る発明は、請求項１
２に係る発明の認識装置において、上記候補文字列生成
部によって生成された所定数の候補文字列に対して単語
照合方式言語処理を行って、最適な候補文字列を上記入
力言語要素列の認識結果として出力する単語照合言語処
理部を備えたことを特徴としている。The invention according to claim 14 is based on claim 1
In the recognition device according to the second aspect of the invention, a predetermined number of candidate character strings generated by the candidate character string generation unit are subjected to word matching method language processing to recognize an optimal candidate character string in the input language element sequence. It is characterized by having a word collation language processing unit for outputting as a result.

【００５２】上記構成によれば、上記確率方式言語処理
によって選出された複数の候補文字列の中から、単語辞
書検索による単語照合方式言語処理が行われて、認識結
果として最適な候補文字列が得られる。こうして、より
適確に入力文章や入力音声が認識される。According to the above arrangement, word matching method language processing by word dictionary search is performed from a plurality of candidate character strings selected by the probability method language processing, and an optimum candidate character string is obtained as a recognition result. can get. In this way, the input sentence and the input voice are more accurately recognized.

【００５３】また、請求項１５に係る発明は、請求項１
３に係る発明の認識装置において、上記スコア算出部
は、上記スコアを上記第２の式によって算出することを
特徴としている。The invention according to claim 15 is the first invention.
In the recognition device according to a third aspect of the present invention, the score calculation section calculates the score by the second equation.

【００５４】上記構成によれば、上記２クラスタコード
遷移確率テーブル,３文字属性遷移確率テーブルおよび
１文字出現確率テーブルの参照と、ごく簡単な演算処理
のみとで上記スコアが算出される。こうして、認識処理
動作の高速化が図られる。According to the above configuration, the score is calculated only by referring to the two-cluster code transition probability table, the three-character attribute transition probability table, and the one-character appearance probability table, and by a very simple calculation process. Thus, the speed of the recognition processing operation is increased.

【００５５】また、請求項１６に係る発明は、請求項１
２に係る発明の認識装置において、上記入力された言語
要素列は文字列であることを特徴としている。The invention according to claim 16 is based on claim 1.
2. The recognition device according to claim 2, wherein the input language element string is a character string.

【００５６】上記構成によれば、従来の２文字遷移確率
テーブルよりも記憶容量が(クラスタ数/認識対象文字
数)²だけ圧縮されている上記２クラスタコード遷移確率
テーブルを使用することによって、低メモリ容量で高認
識率の文字認識処理が高速に行われる。According to the above configuration, by using the two-cluster code transition probability table whose storage capacity is compressed by (number of clusters / number of characters to be recognized) ² compared to the conventional two-character transition probability table, a low memory Character recognition processing with high capacity and high recognition rate is performed at high speed.

【００５７】また、請求項１７に係る発明のコンピュー
タ読み取り可能な記録媒体は、入力された文字列から個
々の文字を切り出す切り出し部、切り出された文字の特
徴パターンと標準パターンとのマッチングを行って複数
の認識候補を得るマッチング部、上記複数の認識候補を
組み合わせて得られた複数の候補文字列に対して,請求
項１に係る発明の確率テーブル作成装置によって作成さ
れた上記２クラスタコード遷移確率テーブルを用いた確
率方式言語処理を行って上記各候補文字列のスコアを算
出し,このスコアに基づいて所定数の候補文字列を生成
する候補文字列生成部として、コンピュータを機能させ
る文字認識プログラムが記録されていることを特徴とし
ている。A computer-readable recording medium according to a seventeenth aspect of the present invention provides a cutout unit for cutting out individual characters from an input character string, and performs matching between a feature pattern of the cutout character and a standard pattern. A matching unit that obtains a plurality of recognition candidates; and a two-cluster code transition probability created by the probability table creation device according to the first aspect of the invention, for a plurality of candidate character strings obtained by combining the plurality of recognition candidates. A character recognition program that causes a computer to function as a candidate character string generation unit that calculates a score for each of the above candidate character strings by performing stochastic language processing using a table, and generates a predetermined number of candidate character strings based on the score. Is recorded.

【００５８】上記構成によれば、請求項１６に係る発明
と同様に、上記２クラスタコード遷移確率テーブルを使
用することによって、低メモリ容量で高認識率の文字認
識処理が高速に行われる。According to the above configuration, the character recognition processing with a low memory capacity and a high recognition rate is performed at high speed by using the two-cluster code transition probability table as in the sixteenth aspect.

【００５９】尚、上記言語要素とは、１つの自然言語の
要素となる文字あるいは音節を表す概念である。The language element is a concept representing a character or a syllable that is an element of one natural language.

【００６０】[0060]

【発明の実施の形態】以下、この発明を図示の実施の形
態により詳細に説明する。本実施の形態の確率テーブル
作成装置は、言語的な遷移関係を損なわない文字数また
は音節数の圧縮を施した圧縮ダイグラム(文字通りダイ
グラムを圧縮するのではなく、認識対象文字(又は認識
対象音節)数を圧縮した際の圧縮文字(又は圧縮音節)コ
ードに関して再度ダイグラムを作成したもの：上記クラ
スタコード遷移確率テーブルに相当)と、１文字(又は１
音節)の出現確率を表すテーブルであるユニグラムと、
３文字(又は３音節)属性の組み合わせの出現確率を表す
テーブルである属性トリグラムとを作成する。そして、
上記圧縮ダイグラムを用いることによって確率方式言語
処理の低容量メモリ性および高速性を実現する。一方、
認識対象文字(又は認識対象音節)数の圧縮によって損失
した情報を上記ユニグラムおよび属性トリグラムの併用
によって補い、低容量メモリ性および高速性を損なうこ
となく、高精度な確率方式言語処理を実現するのであ
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. The probability table creation device of the present embodiment is a compression diagram that compresses the number of characters or the number of syllables without impairing the linguistic transition relationship (rather than literally compressing the diagram, the number of characters to be recognized (or syllables to be recognized) Of a compressed character (or compressed syllable) code obtained by compressing the character code: a cluster code transition probability table, and one character (or one character)
A unigram, which is a table representing the probability of occurrence of (syllable),
An attribute trigram, which is a table representing the appearance probability of a combination of three character (or three syllable) attributes, is created. And
By using the above-mentioned compression diagram, low-capacity memory and high-speed processing of the stochastic language processing are realized. on the other hand,
The information lost due to the compression of the number of characters to be recognized (or syllables to be recognized) is compensated for by using the above unigram and attribute trigram together, and high-accuracy stochastic language processing is realized without impairing low-capacity memory and high-speed performance. is there.

【００６１】以下、この発明の根幹を成す上記圧縮ダイ
グラムとその作成方法、属性トリグラムとその作成方
法、および、文字列確率計算方法について説明する。
尚、文字の場合も音節の場合も基本原理や基本構成は同
じであるから、以下の説明は文字で代表して行う。A description will now be given of the above-described compressed diagram and its preparation method, the attribute trigram and its preparation method, and the character string probability calculation method, which form the basis of the present invention.
Note that the basic principle and basic configuration are the same for both characters and syllables, so the following description will be made using characters as a representative.

【００６２】（ａ）圧縮ダイグラム(２クラスタコード
遷移確率テーブル)について上記ダイグラムのメモリ容量は、認識対象文字数の二乗
に比例して増加する。逆に言えば、認識対象の文字数が
少なくなればメモリ容量は劇的に少なくなる。例えば、
認識対象が１０００文字の場合のダイグラムのメモリ容
量は、１要素当たり１バイトが必要であるとすると１０
００×１０００＝１メガバイトであり、この程度のメモ
リ容量であれば充分実現可能である。従って、ダイグラ
ムそのものを圧縮するのではなく、何らかの方法で、認
識対象文字数を１０００程度に圧縮すれば良いことにな
る。この発明の圧縮ダイグラムは、これに鑑みて考えら
れたものであり、以下のようにして作成される。(A) Compression Diagram (Two-Cluster Code Transition Probability Table) The memory capacity of the above diagram increases in proportion to the square of the number of characters to be recognized. Conversely, if the number of characters to be recognized is reduced, the memory capacity is dramatically reduced. For example,
When the recognition target is 1000 characters, the memory capacity of the diagram is 10 if one byte is required for each element.
00 × 1000 = 1 megabyte, which is sufficiently feasible with such a memory capacity. Therefore, instead of compressing the diagram itself, the number of characters to be recognized may be compressed to about 1000 by some method. The compression diagram of the present invention has been conceived in view of this, and is created as follows.

【００６３】先ず、大量の一般文書(認識対象原稿が限
定されている場合はその認識対象原稿)から、１文字の
出現確率(ユニグラム)と、図２(a)に示すような隣接２
文字間の遷移確率(ダイグラム)とを求める。認識対象文
字カテゴリ数を４０００とすると、この時点でのダイグ
ラムは４０００×４０００＝１６メガバイトの容量を有
する。尚、ユニグラムの容量は４０００＝４キロバイト
である。First, the appearance probability (unigram) of one character is calculated from a large number of general documents (or the recognition target document if the recognition target document is limited), as shown in FIG.
The transition probability between characters (diagram) is obtained. Assuming that the number of character categories to be recognized is 4000, the diagram at this point has a capacity of 4000 × 4000 = 16 megabytes. Incidentally, the capacity of the unigram is 4000 = 4 kilobytes.

【００６４】次に、上記ダイグラムにおける各行(注目
文字から全文字への遷移確率)を４０００次元ベクトル
データと想定し、ダイグラムを４０００個のベクトルに
分解する。そして、図２(b)に示すように、これら４０
００個のベクトルを所定数（例えば１０００）になるよ
うにクラスタリングする。その場合のクラスタリング手
法としては、従来から知られているｋ−ミーンズ法やウ
ォード法等を利用する。尚、上記クラスタリングとは、
距離が近い複数のベクトルを１つのクラスタに代表する
ことである。Next, each line (probability of transition from the character of interest to all characters) in the above diagram is assumed to be 4000-dimensional vector data, and the diagram is decomposed into 4000 vectors. Then, as shown in FIG.
The 00 vectors are clustered into a predetermined number (for example, 1000). As a clustering method in that case, a conventionally known k-means method, Ward method, or the like is used. The above clustering is
This is to represent a plurality of vectors having a short distance in one cluster.

【００６５】上記ベクトルは、ある文字を注目文字とし
た場合の注目文字から全文字への遷移確率を要素として
いる。従って、ベクトル間の距離が近い(つまり、クラ
スタリングによって同一クラスタに分類される)夫々の
ベクトル(注目文字)は、言語的な遷移関係が極めて近い
ということになる。つまり、同一クラスタに分類された
複数の注目文字(ベクトル)ｃ1〜ｃ4を、１つの圧縮文字
コード(クラスタコード)cc0で置き換えても言語的な情
報損失は少ないことになる。こうして、言語的な情報の
損失を少なくして４０００個の認識対象文字を１０００
個の圧縮文字コードに圧縮するのである。The above-mentioned vector has a transition probability from a target character to all characters when a certain character is set as a target character as an element. Therefore, each vector (attention character) having a short distance between vectors (that is, classified into the same cluster by clustering) has a very close linguistic transition relationship. That is, even if a plurality of target characters (vectors) c1 to c4 classified into the same cluster are replaced with one compressed character code (cluster code) cc0, linguistic information loss is small. In this way, the linguistic information loss can be reduced and 4000 recognition target characters can be set to 1000.
It is compressed into compressed character codes.

【００６６】上記クラスタリングの終了後、求められた
圧縮文字コードを用いて、今一度大量文書から文字数を
圧縮したダイグラム(圧縮ダイグラム)を作成する。尚、
その場合に、大量文書からの文字を圧縮文字コード(ク
ラスタコード)に変換する必要がある。そこで、上記ク
ラスタリングの結果から圧縮文字コードと文字コードと
の対応テーブル(圧縮文字コード変換テーブル)を作成し
ておき、この圧縮文字コード変換テーブルを用いて、入
力文字を圧縮文字コードに変換するのである。こうして
得られた圧縮ダイグラムは、圧縮文字コード数(クラス
タ数：１０００)の二乗に比例したメモリ容量を有し、
認識対象文字数が４０００の場合には１メガバイトとな
る。すなわち、充分実現可能な程度のメモリ容量を達成
できるのである。また、その場合の要素値は多値である
ため圧縮文字コード間の高い精度で遷移確率を表現する
ことができる。After the completion of the clustering, a diagram (compressed diagram) in which the number of characters is once again compressed from a large number of documents is created using the obtained compressed character code. still,
In that case, it is necessary to convert characters from a large number of documents into a compressed character code (cluster code). Therefore, a correspondence table (compressed character code conversion table) between the compressed character codes and the character codes is created from the clustering result, and the input characters are converted into the compressed character codes using the compressed character code conversion table. is there. The compressed diagram obtained in this way has a memory capacity proportional to the square of the number of compressed character codes (the number of clusters: 1000),
When the number of characters to be recognized is 4000, it is 1 megabyte. That is, a sufficiently achievable memory capacity can be achieved. In addition, since the element value in this case is multi-valued, the transition probability between compressed character codes can be expressed with high accuracy.

【００６７】（ｂ）属性トリグラム(３文字属性遷移確
率テーブル)について上記属性トリグラムは、上記圧縮ダイグラム(正確に
は、認識対象文字数を圧縮したダイグラム)の作成時に
損失した情報を補うものである。本実施の形態における
文字属性は、図３に示すように「属ひ：平仮名」,「属
カ：片仮名」,「属記：記号」,「属漢：漢字」,「属
数：数字」,「属大：アルファベット大文字」,「属小：
アルファベット小文字」の７属性である。したがって、
上記属性を用いて作成された属性トリグラムのメモリ容
量は、７バイトの三乗であり３４３バイトである。この
属性トリグラムは、圧縮ダイグラムの場合と同様に、大
量文書からの文字の文字コードを文字属性変換テーブル
を用いて文字属性コードに変換する。そして、隣接３文
字間の文字属性コード遷移確率を算出することによって
求められる。(B) Attribute trigram (three-character attribute transition probability table) The attribute trigram supplements information lost during the creation of the above-mentioned compressed diagram (more precisely, the diagram obtained by compressing the number of characters to be recognized). As shown in FIG. 3, the character attributes according to the present embodiment include “genihi: hiragana”, “genka: katakana”, “genki: symbol”, “genkan: kanji”, “genus: number”, "Genus: Upper case alphabet", "Genus:
7 attributes of "lowercase alphabet". Therefore,
The memory capacity of the attribute trigram created using the above attributes is a cube of 7 bytes, that is, 343 bytes. This attribute trigram converts a character code of a character from a large number of documents into a character attribute code using a character attribute conversion table, as in the case of the compression diagram. Then, it is obtained by calculating a character attribute code transition probability between three adjacent characters.

【００６８】上述のようにして得られたユニグラム,圧
縮ダイグラムおよび属性トリグラムは、それらの合計の
メモリ容量が１メガバイト程度と十分実用的な容量であ
り、且つ、言語情報の損失が少ない高精度な確率テーブ
ルである。したがって、これらの確率テーブルを言語処
理に用いることによって、高精度,高速および低メモリ
な認識処理を行うことができるのである。The unigram, compressed digram and attribute trigram obtained as described above have a sufficiently practical total memory capacity of about 1 megabyte, and have high precision with little loss of linguistic information. It is a probability table. Therefore, by using these probability tables for language processing, high-accuracy, high-speed, and low-memory recognition processing can be performed.

【００６９】（ｃ）文字列遷移確率計算方法についてこの発明における確率方式言語処理では、上記ユニグラ
ム,圧縮ダイグラムおよび属性トリグラムの確率テーブ
ルを用いて、式(１)及び式(２)から求めた以下の近似式
(３)によって文字列ｘの文字列遷移確率Ｐ(ｘ)を算出す
る。尚、この発明で用いる文字列遷移確率は、式(１)を
対数圧縮したものが前提となっている。したがって、以
下、説明上区別を必要とする場合のみ「log{Ｐ(ｘ)}」
と記述するが、単に「Ｐ(ｘ)」と記述した場合も対数圧縮
した遷移確率を表しているものとする。(C) Character String Transition Probability Calculation Method In the stochastic language processing in the present invention, the following formulas are obtained from the equations (1) and (2) using the above-described probability tables of the unigram, the compressed digram and the attribute trigram. Approximate expression
The character string transition probability P (x) of the character string x is calculated by (3). Note that the character string transition probability used in the present invention is based on logarithmically compressed expression (1). Therefore, hereinafter, "log {P (x)}"
However, it is also assumed that a simple description “P (x)” also represents a log-compressed transition probability.

【００７０】Ｐ(ｃ₁,ｃ₂,…,ｃ_n)＝Ｐ(ｃ₁)＋Ｐ(ｃ₁,ｃ₂)＋Ｐ(ｃ₁,ｃ₂,ｃ₃) ＋Ｐ(ｃ₂,ｃ₃,ｃ₄)＋ … ＋Ｐ(ｃ_n-2,ｃ_n-1,ｃ_n）…（３) ＋Ｐ(ｃ_n-1,ｃ_n)＋Ｐ(ｃ_n)P (c ₁ , c ₂ ,..., C _n ) = P (c ₁ ) + P (c ₁ , c ₂ ) + P (c ₁ , c ₂ , c ₃ ) + P (c ₂ , c ₃ , c ₄ ) +... + P (c _n−2 , c _n−1 , c _n )… (3) + P (c _n−1 , c _n ) + P (c _n )

【００７１】ここで、夫々の遷移確率は、上記ユニグラ
ム,圧縮ダイグラムおよび属性トリグラムを用いて、以
下の近似式(４)によって算出する。Ｐ(ｘ) ＝ユニグラム(x) Ｐ(ｘ,ｙ) ＝{ユニグラム(x)＋ユニグラム(y)}/２＋圧縮ダイグラム(x^c,y^c)/２Ｐ(ｘ,ｙ,ｚ)＝{ユニグラム(x)＋ユニグラム(y)＋ユニグラム(z)}/３＋{圧縮ダイグラム(x^c,y^c)＋圧縮ダイグラム(y^c,z^c)}/２＋{属性トリグラム(x^a,y^a,z^a)}/３ …（４) 但し、ｘ,ｙ,ｚ：文字コードｘ^c,ｙ^c,ｚ^c ：文字コードｘ,ｙ,ｚの圧縮文字コードｘ^a,ｙ^a,ｚ^a ：文字コードｘ,ｙ,ｚの文字属性コードユニグラム(x) ：ユニグラムにおける文字コードｘの要素値圧縮ダイグラム(x^c,y^c) ：圧縮ダイグラムの圧縮文字コード組み(x^c,y^c)の要素値属性トリグラム(x^a,y^a,z^a)：属性トリグラムの文字属性コード組み（x^a,y^a,z^a) の要素値Here, the respective transition probabilities are calculated by the following approximation formula (4) using the unigram, the compressed diagram and the attribute trigram. P (x) = uni-gram (x) P (x, y ) = { uni-gram (x) + uni-gram (y)} / 2 + compression Daiguramu ^{^{(x c, y c) /}} 2 P (x, y, z) = {unigram (x) + uni-gram (y) + uni-gram (z)} / 3 + {compression Daiguramu (x ^{^c,} y ^c) + compressed Daiguramu ^{^{(y c, z c)}}} / 2 + { attribute trigram (x ^a, y ^a , z ^a )} / 3 (4) where x, y, z: character code x ^c , y ^c , z ^c : compressed character code x ^a , y ^a , z of character code x, y, z ^a: character code x, y, z of the character attribute codes unigram (x): element value of a character code x in unigram compression Daiguramu (x ^c, y ^c): compression character code set of compression Daiguramu (x ^c, y ^c) the iodine value attribute trigram ^{^{(x a, y a, z}} a): component value of the character attribute code set of attributes trigrams ^{^{(x a, y a, z}} a)

【００７２】具体的に説明すれば、文字列「本日は××
××へ」の場合は、Ｐ(本日は××××へ)＝Ｐ(本)＋Ｐ(本日)＋Ｐ(本日は)
＋Ｐ(日はシ)＋Ｐ(はシャ) ＋… ＋Ｐ(ープへ）＋Ｐ(プ
へ)＋Ｐ(へ) と展開され、Ｐ(本日は)の部分は、式(４)によって以下
の通りとなる。Ｐ(本日は)＝{ユニグラム(本)＋ユニグラム(日)＋ユニ
グラム(は)}/３＋{圧縮ダイグラム(本^c,日^c)＋圧縮ダイ
グラム(日^c,は^c)}/２＋{属性トリグラム(本^a,日^a,は^a)}
/３More specifically, the character string “Today is xx
"To xx", P (to xxxx today) = P (book) + P (today) + P (today)
+ P (day) + P (ha) + ... + P (to hoop) + P (to hoop) + P (to), and the part of P (today) is as follows by equation (4). Become. P (today) = {unigram (book) + unigram (day) + unigram (wa)} / 3 + {compression diagram (book ^c , day ^c ) + compression diagram (day ^c , wa ^c )} / 2 + {attribute trigram (Book ^a , day ^a , is ^a )}
/ 3

【００７３】上述の文字列遷移確率の展開式において、
「３」で割る箇所は徐算テーブルで、「２」で割る箇所はビ
ットシフトで実現可能である。したがって、上述の演算
式はテーブルの参照と加算のみで行うことができ、高速
処理が可能である。In the expansion formula of the character string transition probability described above,
The part divided by “3” can be realized by a reduction table, and the part divided by “2” can be realized by bit shift. Therefore, the above-described arithmetic expression can be performed only by referring to and adding the table, and high-speed processing can be performed.

【００７４】図１は、第１実施の形態の確率テーブル作
成装置の機能ブロック図である。この確率テーブル作成
装置は、確率方式言語処理を行う際に用いる上記ユニグ
ラム,圧縮ダイグラムおよび属性トリグラム等の確率テ
ーブルを作成してメモリに格納するものである。尚、本
確率テーブル作成装置は、あらゆる言語に適用可能であ
るが、説明の便宜上、言語は日本語、処理対象文字数は
４０００文字であるとする。また、総ての漢字コード
は、内部コード(１〜４０００)化されているものとす
る。FIG. 1 is a functional block diagram of the probability table creating apparatus according to the first embodiment. This probability table creation device creates a probability table such as the unigram, the compressed digram, and the attribute trigram used in performing the stochastic language processing, and stores it in a memory. The present probability table creation device can be applied to any language, but for convenience of explanation, it is assumed that the language is Japanese and the number of characters to be processed is 4000 characters. It is assumed that all kanji codes are converted into internal codes (1 to 4000).

【００７５】文書データベース１には大量の日本語の一
般文書が格納されている。現文字バッファ１５,前文字
バファ１６および前々文字バッファ１７には、文書デー
タベース１から読み出され１文字、当該文字の直前の文
字、当該文字の前々文字が格納される。ユニグラム作成
部１９は、文書データベース１中の全文字に対する各１
文字の出現確率を算出してユニグラムバッファ２に格納
する。ダイグラム作成部２０は、文書データベース１中
の全文字に関する２文字組み合わせの出現確率(遷移確
率)を算出してダイグラムバッファ４に格納する。さら
に、属性トリグラム作成部２１は、文書データベース１
中の全文字に関する３文字属性組み合わせの出現確率
(遷移確率)を算出して属性トリグラムバッファ６に格納
する。The document database 1 stores a large amount of general Japanese documents. The current character buffer 15, the previous character buffer 16, and the character buffer 17 store one character read from the document database 1, the character immediately before the character, and the character before the character. The unigram creator 19 assigns one to each character in the document database 1.
The appearance probability of the character is calculated and stored in the unigram buffer 2. The diagram creating unit 20 calculates the appearance probabilities (transition probabilities) of the two-character combinations for all the characters in the document database 1 and stores the calculated probabilities in the diagram buffer 4. Further, the attribute trigram creation unit 21 sends the document database 1
Probability of three-character attribute combination for all characters in
(Transition probability) is calculated and stored in the attribute trigram buffer 6.

【００７６】圧縮ダイグラム作成部２２は、後に詳述す
るようにして認識対象文字数をｋ個のクラスタにクラス
タリングして文字数を圧縮した場合に、文書データベー
ス１中の全文字が属する圧縮文字コード(クラスタコー
ド)に関して２圧縮文字コードの組み合わせの出現確率
を算出して圧縮ダイグラムバッファ８に格納する。ベク
トル分割部２３は、上記ダイグラムの各行をベクトルデ
ータと想定して各ベクトルに分割し、分割された各ベク
トルをベクトルバッファ１０に格納する。クラスタリン
グ部２４は、全文字に関する上記ベクトルをクラスタリ
ングし、クラスタリング結果をクラスタバッファ１１に
格納する。制御部１８は、ユニグラム作成部１９,ダイ
グラム作成部２０,属性トリグラム作成部２１,圧縮ダイ
グラム作成部２２,ベクトル分割部２３およびクラスタ
リング部２４等を制御して上記確率テーブルを作成す
る。When the number of characters to be recognized is clustered into k clusters and the number of characters is compressed as described later in detail, the compressed diagram generation unit 22 generates a compressed character code (cluster) to which all the characters in the document database 1 belong. With respect to (code), the appearance probability of the combination of two compressed character codes is calculated and stored in the compressed diagram buffer 8. The vector dividing unit 23 divides each row of the diagram into vector by assuming vector data as vector data, and stores each divided vector in the vector buffer 10. The clustering unit 24 clusters the vectors for all characters and stores the clustering result in the cluster buffer 11. The control unit 18 controls the unigram creating unit 19, the diagram creating unit 20, the attribute trigram creating unit 21, the compressed diagram creating unit 22, the vector dividing unit 23, the clustering unit 24, and the like to create the probability table.

【００７７】尚、全文字数カウンタ３は、上記文書デー
タベース１中の全文字数をカウントする。全２文字組み
合わせ数カウンタ５は、文書データベース１中に存在す
る全２文字組み数をカウントする。全３文字属性組み合
わせ数カウンタ７は、文書データベース１中に存在する
全３文字属性組み数をカウントする。全２圧縮文字コー
ド組み合わせ数カウンタ９は、文書データベース１中に
存在する全２圧縮文字コード組み数をカウントする。ま
た、文字属性変換テーブル１２は、文書データベース１
中の文字を文字属性に変換する場合に参照される。圧縮
文字コード変換テーブル１３は、文書データベース１中
の文字を圧縮文字コードに変換する場合(つまり、文書
データベース１中の文字数を圧縮する場合)に参照され
る。一時メモリ１４は、上記各処理の際に得られたデー
タを一時保管するのに使用される。The total character counter 3 counts the total number of characters in the document database 1. The total two-character combination counter 5 counts the total number of two-character combinations existing in the document database 1. The total three-character attribute combination counter 7 counts the total number of three-character attribute combinations existing in the document database 1. The total 2 compressed character code combination counter 9 counts the total number of 2 compressed character code combinations existing in the document database 1. Further, the character attribute conversion table 12 stores the document database 1
Referenced when converting the characters inside to character attributes. The compressed character code conversion table 13 is referred to when characters in the document database 1 are converted into compressed character codes (that is, when the number of characters in the document database 1 is compressed). The temporary memory 14 is used for temporarily storing data obtained in each of the above processes.

【００７８】上記構成の確率テーブル作成装置は、上記
制御部１８の制御の下に以下のように動作して確率テー
ブルを作成する。図４および図５は、上記制御部１８に
よって実行される確率テーブル作成処理動作のフローチ
ャートである。以下、図４および図５に従って確率テー
ブルの作成手順について説明する。The probability table creating apparatus having the above configuration operates as follows under the control of the control unit 18 to create a probability table. FIGS. 4 and 5 are flowcharts of the probability table creation processing operation executed by the control unit 18. Hereinafter, the procedure for creating the probability table will be described with reference to FIGS.

【００７９】ステップＳ1で、全バッファおよび全カウ
ンタが「０」クリアされる。ステップＳ2で、文書デー
タベース１から１文字が読み出されて現文字バッファ１
５に格納される。ステップＳ3で、上記文書データベー
ス１に文字が在ったか否かが判定される。その結果、文
字が在ればステップＳ4に進み、文字が無ければステッ
プＳ14に進む。In step S1, all buffers and all counters are cleared to "0". In step S2, one character is read from the document database 1 and the current character buffer 1
5 is stored. In step S3, it is determined whether or not a character exists in the document database 1. As a result, if there is a character, the process proceeds to step S4, and if there is no character, the process proceeds to step S14.

【００８０】ステップＳ4で、上記全文字数カウンタ３
によって全文字数がカウントアップされる。ステップＳ
5で、ユニグラム作成部１９によって、ユニグラム計算
が行われる。このユニグラムの計算は、現文字バッファ
１５に格納されている現文字に対応するユニグラムバッ
ファ２の要素値(出現数)をインクリメントすることによ
って行われる。In step S4, the total character counter 3
Counts up the total number of characters. Step S
At 5, the unigram calculation is performed by the unigram creating unit 19. The calculation of the unigram is performed by incrementing the element value (the number of appearances) of the unigram buffer 2 corresponding to the current character stored in the current character buffer 15.

【００８１】ステップＳ6で、上記前文字バッファ１６
に前文字の文字コードが格納されているか否かが判定さ
れる。その結果格納されていればステップＳ7に進み、
そうでなければステップＳ9に進む。ステップＳ7で、全
２文字組み合わせ数カウンタ５によって全２文字組み数
がカウントアップされる。ステップＳ8で、ダイグラム
作成部２０によって、ダイグラム計算が行われる。この
ダイグラムの計算は、前文字バッファ１６と現文字バッ
ファ１５とに格納されている２つの文字の組み合わせに
対応するダイグラムバッファ４の要素値(出現数)をイン
クリメントすることによって行われる。In step S6, the previous character buffer 16
It is determined whether the character code of the previous character is stored in. If the result is stored, the process proceeds to step S7,
Otherwise, go to step S9. In step S7, the total two-character combination number is counted up by the total two-character combination number counter 5. In step S8, the diagram creation unit 20 performs a diagram calculation. The calculation of the diagram is performed by incrementing the element value (number of appearances) of the diagram buffer 4 corresponding to the combination of the two characters stored in the previous character buffer 16 and the current character buffer 15.

【００８２】ステップＳ9で、上記前々文字バッファ１
７に前々文字の文字コードが格納されているか否かが判
定される。その結果格納されていればステップＳ10に進
み、そうでなければステップＳ13に進む。ステップＳ10
で、現文字バッファ１５,前文字バッファ１６および前
々文字バッファ１７に格納されている文字コードが文字
属性コードに変換される。この文字属性変換は、上記文
字属性変換テーブル１２を用いることによって、文字コ
ード(漢字コードは総て１〜４０００の内部コード化さ
れている)から直接変換される。この文字属性変換テー
ブル１２は、文字コードをオフセットアドレスとして当
該文字コードの文字属性コードが参照できる構造を有し
ており、(認識対象文字数×１＝４０００)バイトの容量
が必要である。尚、この文字属性変換テーブル１２は、
予め作成しておく。In step S9, the character buffer 1
It is determined whether or not the character code of the character before 2 is stored in 7. If the result is stored, the process proceeds to step S10; otherwise, the process proceeds to step S13. Step S10
Thus, the character codes stored in the current character buffer 15, the previous character buffer 16, and the character buffer 17 before are converted into character attribute codes. This character attribute conversion is directly converted from a character code (all Chinese character codes are internally coded from 1 to 4000) by using the character attribute conversion table 12 described above. The character attribute conversion table 12 has a structure in which the character attribute code of the character code can be referred to using the character code as an offset address, and requires a capacity of (number of characters to be recognized × 1 = 4000) bytes. Note that this character attribute conversion table 12 is
Create it in advance.

【００８３】ステップＳ11で、上記全３文字属性組み合
わせ数カウンタ７によって全３文字属性組み数がカウン
トされる。ステップＳ12で、属性トリグラム作成部２１
によって、属性トリグラム計算が行われる。この属性ト
リグラムの計算は、前々文字バッファ１７,前文字バッ
ファ１６及び現文字バッファ１５に格納されている３文
字の文字属性の組み合わせに対応する属性トリグラムバ
ッファ６の要素値(出現数)をインクリメントすることに
よって行われる。In step S 11, the total number of three-character attribute combinations is counted by the total three-character attribute combination number counter 7. In step S12, the attribute trigram creating unit 21
Performs attribute trigram calculation. The attribute trigram is calculated by calculating the element value (the number of appearances) of the attribute trigram buffer 6 corresponding to the combination of the character attributes of the three characters stored in the character buffer 17, the previous character buffer 16 and the current character buffer 15 before and after. This is done by incrementing.

【００８４】ステップＳ13で、上記前文字バッファ１６
及び現文字バッファ１５の内容で、前々文字バッファ１
７および前文字バッファ１６の内容が更新される。そう
した後に、上記ステップＳ2に戻って次の１文字に対す
る処理に移行する。そして、上記ステップＳ3において
もはや文書データベース１に文字は無いと判定される
と、ステップＳ14で、ユニグラム作成部１９,ダイグラ
ム作成部２０および属性トリグラム作成部２１によっ
て、上記ユニグラム,ダイグラムおよび属性トリグラム
の要素値(出現数)が確率値に変換される。この確率値へ
の変換は、上記ユニグラム,ダイグラムまたは属性トリ
グラムの各要素値を、全文字数カウンタ３,全２文字組
み合わせ数カウンタ５または全３文字属性組み合わせ数
カウンタ７のカウント値で除すことによっておこなわれ
る。In step S13, the previous character buffer 16
And the contents of the current character buffer 15 and the
7 and the contents of the previous character buffer 16 are updated. After that, the process returns to step S2 to shift to the process for the next one character. If it is determined in step S3 that there are no more characters in the document database 1, in step S14, the unigram creation unit 19, the diagram creation unit 20 and the attribute trigram creation unit 21 use the elements of the unigram, diagram and attribute trigram. The value (number of occurrences) is converted to a probability value. The conversion into the probability value is performed by dividing each element value of the unigram, the digram or the attribute trigram by the count value of the total number of characters counter 3, the total number of 2 characters combination counter 5, or the total number of 3 characters attribute combination counter 7. It is carried out.

【００８５】ところで、上述のようにして計算されたユ
ニグラム,ダイグラムおよび属性トリグラムの各要素値
(確率値)は、小数点以下の値を取りその有効桁数も大き
い。したがって、そのまま確率方式言語処理用にテーブ
ル化するとメモリ容量が大きくなり、且つ、文字列遷移
確率の計算に乗算を使用するために演算速度も低下して
しまう。そこで、ステップＳ15で、上記ステップＳ14に
おいて確率値に変換された上記ユニグラムバッファ２,
ダイグラムバッファ４または属性トリグラムバッファ６
の要素値に対して対数圧縮が行われ、得られた対数圧縮
値で該当する要素値が更新される。By the way, each element value of the unigram, the diagram and the attribute trigram calculated as described above
The (probability value) takes a value after the decimal point and has a significant number of significant digits. Therefore, if the data is tabulated for probabilistic language processing as it is, the memory capacity increases, and the calculation speed decreases because multiplication is used for calculating the character string transition probability. Then, in step S15, the unigram buffer 2, converted into the probability value in step S14,
Diagram buffer 4 or attribute trigram buffer 6
Is subjected to logarithmic compression, and the corresponding logarithmically compressed value is used to update the corresponding element value.

【００８６】上記対数圧縮は以下の式(５)によって行
う。 lnＰ(x)＝min(２５５,(−log(Ｐ(x))−ＭＩＮ))×２５６/(ＭＡＸ−ＭＩＮ) …（５）ここで、Ｐ(x)は任意の文字ｘの確率値であり、Ｐ(x)＞
０なる値を取る。また、Ｐ(x)＝０であればlnＰ(x)＝２
５５である。尚、min(ａ,ｂ)は、ａとｂのうち小さい方
を取る関数である。また、ＭＩＮおよびＭＡＸは上記対
数圧縮後の確率値の最小値および最大値であり、本実施
の形態では「０」および「１８」である。The logarithmic compression is performed by the following equation (5). lnP (x) = min (255, (− log (P (x)) − MIN)) × 256 / (MAX−MIN) (5) where P (x) is a probability value of an arbitrary character x. Yes, P (x)>
Take the value 0. If P (x) = 0, lnP (x) = 2
55. Note that min (a, b) is a function that takes the smaller of a and b. Further, MIN and MAX are the minimum value and the maximum value of the probability value after the logarithmic compression, and are “0” and “18” in the present embodiment.

【００８７】上記式(５)により、「１〜０.０００００
００００００００００００１」の範囲にある確率値Ｐ
(x)が、０〜２５５までの１バイトデータに変換され
る。尚、−log(Ｐ)＞ＭＡＸ(＝１８)以上となる低確率
値(無限大となる−log(０)をも含む)は、１バイトで表
現可能な最大値「２５５」で押さえる。According to the above formula (5), “1-0.0000000”
Probability value P in the range of "00000000000001"
(x) is converted to 1-byte data from 0 to 255. It should be noted that a low probability value (including -log (0) which becomes infinity) in which -log (P)> MAX (= 18) or more is suppressed by a maximum value "255" that can be expressed by 1 byte.

【００８８】通常、確率値は大きい方がより確からしい
が、本実施の形態の確率値は、対数変換しているために
値が小さい方がより確からしいことを表す評価値という
ことになる。Normally, the larger the probability value is, the more likely it is. However, the probability value in the present embodiment is an evaluation value indicating that the smaller the value is, the more likely it is due to logarithmic transformation.

【００８９】こうして、上記ダイグラムが求められる
と、上記圧縮ダイグラムの算出処理に移行する。When the above-mentioned diagram is obtained, the processing shifts to the calculation process of the above-mentioned compressed diagram.

【００９０】ステップＳ16で、上記ベクトル分割部２３
によって、以下のようなベクトル分割処理が実行され
る。すなわち、ダイグラムバッファ４に格納されたダイ
グラムを４０００×４０００の２次元配列(４０００×
４０００の行列)と想定し、各行を４０００次元ベクト
ルとして４０００個のベクトルに分割してベクトルバッ
ファ１０に格納される。このベクトルは、注目文字から
全文字への遷移確率を要素とする４０００次元ベクトル
である。In step S16, the vector dividing unit 23
Thus, the following vector division processing is executed. That is, the diagram stored in the diagram buffer 4 is converted into a 4000 × 4000 two-dimensional array (4000 × 4000).
(4000 matrices), each row is divided into 4000 vectors as a 4000-dimensional vector, and stored in the vector buffer 10. This vector is a 4000-dimensional vector having a transition probability from the target character to all characters as an element.

【００９１】ステップＳ17で、上記クラスタリング部２
４によって、上記分割された４０００個のベクトルに対
して、ｋ−ミーンズ法やウォード法等のクラスタリング
手法を用いてクラスタリングが行われる。その場合に設
定する最大クラスタ数は、本実施の形態においては「１
０００」とする。得られたクラスタはクラスタバッファ
１１に格納される。In step S17, the clustering unit 2
4, the clustering is performed on the 4000 divided vectors using a clustering method such as the k-means method or the Ward method. In this case, the maximum number of clusters set in this case is “1”.
000 ". The obtained cluster is stored in the cluster buffer 11.

【００９２】ステップＳ18で、上記文字コードを圧縮文
字コードに変換する際に使用される圧縮文字コード変換
テーブル１３が作成される。この圧縮文字コード変換テ
ーブル１３は、次のようにして作成される。すなわち、
上記クラスタリングによって各クラスタ毎に集められた
各ベクトルの各注目文字の文字コードｃ1〜ｃ4,ｃ5〜ｃ
8,…,ｃ3997〜ｃ4000を、当該クラスタのクラスタコー
ド(つまり、圧縮文字コードcc0〜cc999)に対応付けたテ
ーブルを作成することによって、圧縮文字コード変換テ
ーブル１３が作成される。この圧縮文字コード変換テー
ブル１３は、文字属性変換テーブル１２の場合と同様
に、文字コードから圧縮文字コードに直接変換できるよ
うに、文字コード(１〜４０００)をオフセットアドレス
として当該文字コードの圧縮文字コードが参照できる構
造になっている。この圧縮文字コード変換テーブル１３
のメモリ容量は、４０００×２＝８キロバイトとなる。In step S18, a compressed character code conversion table 13 used for converting the character code into a compressed character code is created. This compressed character code conversion table 13 is created as follows. That is,
The character codes c1 to c4, c5 to c of the target characters of each vector collected for each cluster by the above clustering
The compressed character code conversion table 13 is created by creating a table in which 8,..., C3997 to c4000 are associated with the cluster codes of the cluster (that is, the compressed character codes cc0 to cc999). As in the case of the character attribute conversion table 12, the compressed character code conversion table 13 uses the character code (1 to 4000) as an offset address so that the compressed character of the character code can be directly converted from the character code to the compressed character code. The structure allows the code to be referenced. This compressed character code conversion table 13
Has a memory capacity of 4000 × 2 = 8 kilobytes.

【００９３】ステップＳ19で、上記圧縮ダイグラムバッ
ファ８,全２圧縮文字コード組み合わせカウンタ９,現文
字バッファ１５および前文字バッファ１６が「０」クリ
アされる。ステップＳ20で、文書データベース１から１
文字が読み出されて現文字バッファ１５に格納される。
ステップＳ21で、文書データベース１に文字が在ったか
否かが判定される。その結果、文字が在ればステップＳ
22に進み、文字が無ければステップＳ27に進む。In step S19, the above-mentioned compressed diagram buffer 8, all-two-compressed character code combination counter 9, current character buffer 15, and previous character buffer 16 are cleared to "0". In step S20, the document databases 1 to 1
Characters are read and stored in the current character buffer 15.
In step S21, it is determined whether a character exists in the document database 1. As a result, if there is a character, step S
Proceed to 22; if there are no characters, proceed to step S27.

【００９４】ステップＳ22で、上記前文字バッファ１６
に前文字が格納されているか否かが判別される。その結
果、前文字が在ればステップＳ23に進み、前文字が無け
ればステップＳ26に進む。ステップＳ23で、現文字バッ
ファ１５および前文字バッファ１６に格納されている文
字コードが圧縮文字コードに変換される。この圧縮文字
コード変換では、圧縮文字コード変換テーブル１３を用
いて、文字コードから直接圧縮文字コードに変換され
る。In step S22, the previous character buffer 16
It is determined whether or not the previous character is stored in. As a result, if there is a previous character, the process proceeds to step S23, and if there is no previous character, the process proceeds to step S26. In step S23, the character codes stored in the current character buffer 15 and the previous character buffer 16 are converted into compressed character codes. In this compressed character code conversion, the character code is directly converted into the compressed character code using the compressed character code conversion table 13.

【００９５】ステップＳ24で、上記全２圧縮文字コード
組み合わせ数カウンタ９によって全２圧縮文字コード組
み数がカウントされる。ステップＳ25で、圧縮ダイグラ
ム作成部２２によって、圧縮ダイグラム計算が行われ
る。この圧縮ダイグラムの計算は、上記ステップＳ23に
おいて得られた２つの圧縮文字コードの組み合わせに対
応する圧縮ダイグラムバッファ８の要素値(出現数)をイ
ンクリメントすることによって行われる。In step S24, the total number of 2-compressed character code combinations is counted by the total 2-compressed character code combination number counter 9. In step S25, a compression diagram calculation is performed by the compression diagram creation unit 22. The calculation of the compressed diagram is performed by incrementing the element value (number of appearances) of the compressed diagram buffer 8 corresponding to the combination of the two compressed character codes obtained in step S23.

【００９６】ステップＳ26で、上記現文字バッファ１５
の内容で、前文字バッファ１６の内容が更新される。そ
うした後に、上記ステップＳ20に戻って次の１文字に対
する処理に移行する。そして、上記ステップＳ21におい
てもはや文書データベース１に文字は無いと判定される
と、ステップＳ27で、圧縮ダイグラム作成部２２によっ
て、上記圧縮ダイグラムの要素値(出現数)が確率値に変
換される。この確率値への変換は、上記圧縮ダイグラム
の各要素値を、全２圧縮文字コード組み合わせ数カウン
タ９のカウント値で除すことによっておこなわれる。そ
して、上記ユニグラム,ダイグラムおよび属性トリグラ
ムの場合と同様に、ステップＳ28で、上記ステップＳ27
において確率値に変換された圧縮ダイグラムバッファ８
の要素値に対して対数圧縮が行われ、得られた対数圧縮
値で上記圧縮ダイグラムバッファ８の該当する要素値が
更新される。こうして、上記圧縮ダイグラムが作成され
ると確率テーブル作成処理動作を終了する。In step S26, the current character buffer 15
The content of the previous character buffer 16 is updated with the content of After that, the process returns to step S20 to shift to the process for the next one character. Then, when it is determined in step S21 that there are no more characters in the document database 1, in step S27, the element value (number of appearances) of the compressed diagram is converted into a probability value by the compressed diagram creation unit 22. The conversion into the probability value is performed by dividing each element value of the above-mentioned compressed diagram by the count value of the total 2 compressed character code combination counter 9. Then, as in the case of the unigram, the digram and the attribute trigram, in step S28,
Compressed diagram buffer 8 converted to a probability value in
Are subjected to logarithmic compression, and the corresponding element values in the compressed diagram buffer 8 are updated with the obtained logarithmically compressed values. When the compressed diagram is created in this way, the probability table creation processing operation ends.

【００９７】こうして作成された各確率テーブルのメモ
リサイズは図６に示す通りであり、上記ユニグラム,圧
縮ダイグラム,属性トリグラム,文字属性変換テーブル１
２および圧縮文字コード変換テーブル１３の合計メモリ
容量は約１メガバイトと非常にコンパクトとなる。The memory size of each probability table thus created is as shown in FIG. 6, and the above unigram, compressed diagram, attribute trigram, character attribute conversion table 1
2, the total memory capacity of the compressed character code conversion table 13 is about 1 megabyte, which is very compact.

【００９８】このように、本実施の形態においては、全
文字数カウンタ３によって文書データベース１中の全文
字カテゴリ数をカウントし、全２文字組み合わせ数カウ
ンタ５によって全２文字組み数をカウントし、全３文字
属性組み合わせ数カウンタ７によって全３文字属性組み
数をカウントする。As described above, in the present embodiment, the total number of all character categories in the document database 1 is counted by the total number of characters counter 3, and the total number of two-character combinations is counted by the total two-character combination number counter 5. The three-character attribute combination number counter 7 counts the total number of three-character attribute combinations.

【００９９】そして、上記ユニグラム作成部１９は、全
文字の出現数を算出してユニグラムバッファ２の対応す
る要素値を更新し、文書データベース１中に文字が無く
なると、上記全文字数カウンタ３のカウント値を用いて
ユニグラムバッファ２の要素値(出現数)を確率値に変換
する。また、ダイグラム作成部２０は、全２文字組みの
出現数を算出してダイグラムバッファ４の対応する要素
値を更新し、文書データベース１中に文字が無くなる
と、全２文字組み合わせ数カウンタ５のカウント値を用
いてダイグラムバッファ４の要素値(出現数)を確率値に
変換する。属性トリグラム作成部２１は、前々文字,前
文字及び現文字の文字コードを変換した文字属性コード
に基づいて、全３文字属性組みの出現数を算出して属性
トリグラム６の対応する要素値を更新し、文書データベ
ース１中に文字が無くなると、全３文字属性組み合わせ
数カウンタ７のカウント値を用いて属性トリグラムバッ
ファ６の要素値(出現数)を確率値に変換する。そしてさ
らに、ユニグラムバッファ２,ダイグラムバッファ４お
よび属性トリグラムバッファ６の各要素値を対数圧縮し
て、上記ユニグラム,ダイグラムおよび属性トリグラム
が形成される。The unigram creation unit 19 calculates the number of appearances of all characters and updates the corresponding element value in the unigram buffer 2. When there are no more characters in the document database 1, the unigram creation unit 19 updates the total character counter 3. The element value (number of appearances) of the unigram buffer 2 is converted into a probability value using the count value. The diagram creating unit 20 calculates the number of appearances of all two-character combinations, updates the corresponding element values in the diagram buffer 4, and when there are no more characters in the document database 1, the counter of the two-character combination number counter 5 The element value (number of appearances) of the diagram buffer 4 is converted into a probability value using the count value. The attribute trigram creating unit 21 calculates the number of appearances of all three-character attribute sets based on the character attribute codes obtained by converting the character codes of the character before the previous character, the previous character, and the current character, and calculates the corresponding element value of the attribute trigram 6. When there are no more characters in the document database 1 after updating, the element value (number of appearances) of the attribute trigram buffer 6 is converted into a probability value using the count value of the all three character attribute combination number counter 7. Further, each element value of the unigram buffer 2, the digram buffer 4 and the attribute trigram buffer 6 is logarithmically compressed to form the above-mentioned unigram, digram and attribute trigram.

【０１００】こうして、上記ダイグラムが求められる
と、ベクトル分割部２３によって、４０００×４０００
の行列である上記ダイグラムを４０００個の４０００次
元ベクトルに分割する。そして、クラスタリング部２４
によって、各ベクトルに対してクラスタリングを行い、
各クラスタに属する文字コードとクラスタコード(圧縮
文字コード)とを対応付けた圧縮文字コード変換テーブ
ル１３を作成する。そうした後、文書データベース１か
ら１文字を読み取る毎に現文字バッファ１５及び前文字
バッファ１６を更新する。さらに、現文字と前文字の文
字コードを圧縮文字コード変換テーブル１３を用いて圧
縮文字コードに変換し、全２圧縮文字コード組み合わせ
数カウンタ９によって全２圧縮文字コード組み数をカウ
ントする。そして、圧縮ダイグラム作成部２２によっ
て、上記ダイグラム作成の場合と同様にして圧縮ダイグ
ラムを作成する。In this way, when the above-mentioned diagram is obtained, the vector dividing section 23 makes 4000 × 4000.
Is divided into 4000 4000-dimensional vectors. Then, the clustering unit 24
Performs clustering on each vector,
A compressed character code conversion table 13 in which character codes belonging to each cluster are associated with cluster codes (compressed character codes) is created. After that, each time one character is read from the document database 1, the current character buffer 15 and the previous character buffer 16 are updated. Further, the character codes of the current character and the previous character are converted into compressed character codes using the compressed character code conversion table 13, and the total number of 2-compressed character code combinations is counted by the total 2-compressed character code combination counter 9. Then, the compressed diagram creation unit 22 creates a compressed diagram in the same manner as in the above-described diagram creation.

【０１０１】こうして作成された上記圧縮ダイグラム
は、上記ダイグラムが有している言語的な遷移情報を失
うことなく要素数が１/１６に圧縮されている。したが
って、上記圧縮ダイグラムを用いて言語処理を行うこと
によって、言語情報を損なうことなく低記憶容量性およ
び高速性を実現できるのである。また、上記認識対象文
字数の圧縮によって損失した言語的遷移情報は、１文字
の出現確率であるユニグラムと、隣接３文字間の文字属
性コード遷移確率である属性トリグラムとの併用によっ
て補われて、上記低記憶容量性および高速性を損なうこ
となく高精度な確率方式言語処理を可能にするのであ
る。The number of elements of the compressed diagram thus created is reduced to 1/16 without losing the linguistic transition information of the diagram. Therefore, by performing linguistic processing using the compressed diagram, low storage capacity and high speed can be realized without impairing linguistic information. The linguistic transition information lost due to the compression of the number of characters to be recognized is supplemented by a combination of a unigram, which is the appearance probability of one character, and an attribute trigram, which is a character attribute code transition probability between three adjacent characters. This enables high-accuracy stochastic language processing without impairing low storage capacity and high-speed performance.

【０１０２】すなわち、本実施の形態によれば、日本語
のように認識対象文字数が多い言語に対しても、１メガ
バイト程度の低いメモリ容量で高精度な多値確率テーブ
ルを用いた確率方式言語処理を行うことが可能となる。
また、その場合の文字列確率演算を加算とテーブル参照
との極めてシンプルな演算にでき、高速処理が可能とな
る。したがって、低メモリ容量,高認識率,高速処理が可
能な認識装置を実現できる。That is, according to the present embodiment, even for a language having a large number of characters to be recognized, such as Japanese, a stochastic language using a high-precision multi-value probability table with a low memory capacity of about 1 megabyte. Processing can be performed.
In addition, the character string probability calculation in that case can be made a very simple calculation of addition and table reference, and high-speed processing is possible. Therefore, it is possible to realize a recognition device capable of low memory capacity, high recognition rate, and high-speed processing.

【０１０３】図７は、第２実施の形態における認識装置
のブロック図である。この認識装置は、第１実施の形態
において作成された上記確率テーブルを用いる確率方式
言語処理手段を有しており、入力された文書画像データ
から文字を認識するものである。尚、本認識装置はあら
ゆる言語に適用可能であるが、説明の便宜上、言語は日
本語、認識対象文字数は４０００文字、認識評価値とし
て類似度を用いる日本語活字ＯＣＲであるとする。ま
た、総ての漢字コードは、内部コード(１〜４０００)化
されているものとする。FIG. 7 is a block diagram of a recognition device according to the second embodiment. This recognition device has a stochastic language processing unit using the above-mentioned probability table created in the first embodiment, and recognizes characters from input document image data. Although the present recognition apparatus can be applied to any language, for convenience of explanation, it is assumed that the language is Japanese, the number of characters to be recognized is 4000 characters, and Japanese type OCR using similarity as a recognition evaluation value. It is assumed that all kanji codes are converted into internal codes (1 to 4000).

【０１０４】画像入力部３１は、スキャナ４７あるいは
画像ファイル４８からの文書画像データを取り込んで画
像メモリ４９に格納する。切り出し部３２は、画像メモ
リ４９に格納された１文書の画像データに基づいて文字
認識を行う領域を切り出す領域切り出し部３３、１行の
文字列を切り出す行切り出し部３４、１文字を切り出す
文字切り出し部３５、および、接触文字を強制分離する
接触文字切り出し部３６を有している。尚、得られた上
記領域,行および文字の画像メモリ４９上の座標は、夫
々領域座標メモリ５０,行座標メモリ５１および(仮)文
字座標メモリ５２に格納される。また、周辺分布特徴バ
ッファ５３には、後に詳述する行方向や列方向の周辺分
布特徴が格納される。特徴抽出部３７は、上記切り出さ
れた１文字の画像データに基づいて特徴パターンを抽出
して特徴メモリ５４に格納する。The image input unit 31 takes in document image data from the scanner 47 or the image file 48 and stores it in the image memory 49. The cutout unit 32 is an area cutout unit 33 that cuts out an area for character recognition based on the image data of one document stored in the image memory 49, a line cutout unit 34 that cuts out a one-line character string, and a character cutout that cuts out one character. And a contact character cutout unit 36 for forcibly separating contact characters. The obtained coordinates of the region, line and character on the image memory 49 are stored in the region coordinate memory 50, the line coordinate memory 51 and the (temporary) character coordinate memory 52, respectively. The peripheral distribution feature buffer 53 stores peripheral distribution characteristics in a row direction and a column direction, which will be described in detail later. The feature extracting unit 37 extracts a feature pattern based on the extracted one-character image data and stores the feature pattern in the feature memory 54.

【０１０５】マッチング部３８は、上記切り出された１
文字の特徴パターンとパターン辞書５５に格納された認
識対象文字の標準パターンとのマッチングを行って類似
度を算出する第１類似度計算部３９、および、上記算出
された類似度を降順にソーティングして所定数の文字候
補を選出して認識結果候補を得る第１ソーティング部４
０を有する。尚、得られた認識結果候補は、認識結果候
補バッファ５６に格納される。The matching unit 38 calculates the extracted 1
A first similarity calculating unit 39 that calculates a similarity by performing matching between a character feature pattern and a standard pattern of a recognition target character stored in the pattern dictionary 55, and sorts the calculated similarities in descending order. First sorting section 4 for selecting a predetermined number of character candidates to obtain recognition result candidates
Has zero. Note that the obtained recognition result candidates are stored in the recognition result candidate buffer 56.

【０１０６】文字列生成部４１は、上記１行を構成する
１文字の認識結果候補が得られる毎に当該文字までの総
ての認識結果候補を組み合わせて候補文字列に展開する
候補文字列展開部４２、得られた候補文字列のスコアを
上記ユニグラム,圧縮ダイグラム,属性トリグラム等のテ
ーブル５８を用いて計算するスコア計算部４３、およ
び、算出されたスコアを降順にソーティングして所定数
の候補文字列を選出する第２ソーティング部４４を有す
る。尚、展開された候補文字列は候補文字列バッファ５
７に格納される。Each time a recognition result candidate for one character constituting one line is obtained, the character string generation unit 41 combines all the recognition result candidates up to the character and develops the candidate character string into a candidate character string. Unit 42, a score calculation unit 43 that calculates the score of the obtained candidate character string using the above-described table 58 such as the unigram, compressed diagram, attribute trigram, and the like, and sorts the calculated scores in descending order to obtain a predetermined number of candidates. It has a second sorting unit 44 for selecting a character string. The expanded candidate character string is stored in the candidate character string buffer 5.
7 is stored.

【０１０７】単語照合言語処理部４５は、上記候補文字
列バッファ５７に格納されている１行分の上記所定数の
候補文字列に対して単語照合用言語辞書５９を用いた単
語照合式言語処理を行う。そして、この単語照合式言語
処理の結果に基づいて最終的な認識結果を求めて認識結
果バアッファ６０の内容を更新する。制御部４６は、画
像入力部３１,切り出し部３２,特徴抽出部３７,マッチ
ング部３８,文字列生成部４１および単語照合言語処理
部４５を制御して、文字認識処理を実行する。The word collation language processing unit 45 uses the word collation language dictionary 59 for the predetermined number of candidate character strings for one line stored in the candidate character string buffer 57 and uses the word collation language processing 59 I do. Then, the content of the recognition result buffer 60 is updated by obtaining the final recognition result based on the result of the word collation type language processing. The control unit 46 controls the image input unit 31, the extraction unit 32, the feature extraction unit 37, the matching unit 38, the character string generation unit 41, and the word matching language processing unit 45 to execute a character recognition process.

【０１０８】すなわち、本実施の形態においては、上記
スコア計算部４３および第２ソーティング部４４で上記
確率方式言語処理手段を構成するのである。That is, in the present embodiment, the above-mentioned stochastic language processing means is constituted by the above-mentioned score calculation section 43 and the second sorting section 44.

【０１０９】上記構成の認識装置は、上記制御部４６の
制御の下に以下のように動作して文字認識処理を行う。
図８は、上記制御部４６によって実行される文字認識処
理動作のフローチャートである。以下、図８に従って文
字の認識手順について説明する。The recognition device having the above configuration operates as follows under the control of the control section 46 to perform a character recognition process.
FIG. 8 is a flowchart of the character recognition processing operation executed by the control unit 46. Hereinafter, the character recognition procedure will be described with reference to FIG.

【０１１０】ステップＳ31で、上記画像入力部３１によ
って、スキャナ４７または画像ファイル４８から１文書
の画像データが取り込まれて画像メモリ４９に格納され
る。ステップＳ32で、領域切り出し部３３によって、上
記１文書から文字認識を行う全領域が切り出される。ま
たは、上記全領域が指定される。そして、切り出された
領域または指定された領域の画像メモリ４９上の座標が
領域座標メモリ５０に格納される。ステップＳ33で、一
つの領域の画像データが読み出される。ステップＳ34
で、領域が在ったか否かが判別される。その結果、領域
が在ればステップＳ35に進み、無ければステップＳ49に
進む。In step S 31, image data of one document is fetched from the scanner 47 or the image file 48 by the image input unit 31 and stored in the image memory 49. In step S32, the entire area for character recognition is extracted from the one document by the area extracting unit 33. Alternatively, the entire area is specified. Then, the coordinates of the clipped region or the designated region on the image memory 49 are stored in the region coordinate memory 50. In step S33, image data of one area is read. Step S34
It is determined whether or not there is an area. As a result, if there is a region, the process proceeds to step S35; otherwise, the process proceeds to step S49.

【０１１１】ステップＳ35で、上記行切り出し部３４に
よって、当該領域から全行が切り出される。この場合の
行の切り出し方法は特に限定するものではない。例え
ば、当該領域の画像データに基づいて行方向への所定の
明るさ以上の画素数を数え上記周辺分布特徴として周辺
分布特徴バッファ５３に格納し、この周辺分布特徴に基
づいて切り出す。ステップＳ36で、上記ステップＳ35に
おいて切り出された行が在るか否かが判別される。その
結果、行が無ければ上記ステップＳ33に戻って次の領域
の処理に移行し、行が在ればステップＳ37に進む。In step S35, the row cutout section 34 cuts out all rows from the area. In this case, the method of cutting out the rows is not particularly limited. For example, the number of pixels having a predetermined brightness or more in the row direction is counted based on the image data of the area, stored as the peripheral distribution feature in the peripheral distribution feature buffer 53, and cut out based on the peripheral distribution feature. In step S36, it is determined whether or not the line cut out in step S35 exists. As a result, if there is no line, the flow returns to step S33 to shift to the processing of the next area, and if there is a line, the flow proceeds to step S37.

【０１１２】ステップＳ37で、一つの行の画像データが
読み出される。ステップＳ38で、上記文字切り出し部３
５によって、上記読み出された１行の画像データに基づ
いて仮の文字切り出しが行われる。尚、上記「仮の文字
切り出し」の処理は必ずしも全てが正解の文字切り出し
を行うのではなく、例えば、漢字「加」の場合には、
「加」又は「カ」と「ロ」と言うように、１文字として
切出せる個所を全て抽出する処理である。この場合の仮
の文字切り出しの方法は特に限定するものではない。例
えば、当該行の画像データに基づいて列方向への所定の
明るさ以上の画素数を数えて上記周辺分布特徴として周
辺分布特徴バッファ５３に格納し、この周辺分布特徴に
基づいて切り出し箇所を抽出する。さらに、接触文字切
り出し部３６によって、接触文字のような文字塊に対し
て、周辺分布特徴バッファ５３に格納された上記列方向
の周辺分布特徴を用いて強制分離できる個所が抽出され
る。In step S37, one row of image data is read. In step S38, the character cutout unit 3
5, temporary character segmentation is performed based on the read one line of image data. Note that the processing of the above “temporary character cutout” does not always perform the correct character cutout. For example, in the case of the kanji “ka”,
This is a process of extracting all portions that can be cut out as one character, such as “add” or “f” and “b”. In this case, the method of temporary character cutout is not particularly limited. For example, based on the image data of the row, the number of pixels having a predetermined brightness or more in the column direction is counted and stored in the peripheral distribution feature buffer 53 as the peripheral distribution feature, and a cutout portion is extracted based on the peripheral distribution feature. I do. Further, the contact character cutout unit 36 extracts a portion that can be forcibly separated from a character block such as a contact character by using the column-direction peripheral distribution feature stored in the peripheral distribution feature buffer 53.

【０１１３】ステップＳ39で、上記候補文字列バッファ
５７が初期化される。ステップＳ40で、当該行中に文字
が存在するか否かが判別される。その結果、文字が在れ
ばステップＳ41に進む一方、文字が無ければステップＳ
47に進む。ステップＳ41で、文字切り出し部３５によっ
て、当該行から上記仮の文字切り出し結果に基づいて実
際に１文字が切り出される。In step S39, the candidate character string buffer 57 is initialized. In step S40, it is determined whether a character exists in the line. As a result, if there is a character, the process proceeds to step S41, while if there is no character, the process proceeds to step S41.
Go to 47. In step S41, one character is actually cut out from the line by the character cutout unit 35 based on the provisional character cutout result.

【０１１４】ステップＳ42で、上記特徴抽出部３７によ
って、上記切り出された当該文字候補の特徴パターンが
抽出される。ステップＳ43で、第１類似度計算部３９に
よって、当該文字候補の特徴パターンとパターン辞書５
５に格納された認識対象文字の標準パターンとのマッチ
ングが行われて類似度が算出される。さらに、第１ソー
ティング部４０によって、上記算出された類似度を降順
にソーティングして所定数(本実施の形態においては
「１０個」)の文字候補が選出される。こうして得られ
た文字候補は、認識結果候補バッファ５６に格納され
る。In step S42, the feature extraction unit 37 extracts the feature pattern of the extracted character candidate. In step S43, the first similarity calculator 39 calculates the characteristic pattern of the character candidate and the pattern dictionary 5
The matching with the standard pattern of the recognition target character stored in No. 5 is performed to calculate the similarity. Further, the first sorting unit 40 sorts the calculated similarities in descending order, and selects a predetermined number (in this embodiment, “10”) of character candidates. The character candidates thus obtained are stored in the recognition result candidate buffer 56.

【０１１５】以後、上記認識結果候補バッファ５６に格
納されている文字候補が組み合わされて文字列に展開さ
れ、候補文字列バッファ５７に格納される。ここで、総
ての文字候補の組み合わせを候補文字列バッファ５７に
格納することは、１０の(１行の全文字数)乗個の候補文
字列分のメモリ容量を必要とするために不可能である。
また、以後の処理速度も著しく低下する。そのため、本
実施の形態においては、総ての候補文字列のスコアを以
下のようにして求め、このスコアに基づいて候補文字列
数の削減を行うのである。Thereafter, the character candidates stored in the recognition result candidate buffer 56 are combined and developed into a character string, and stored in the candidate character string buffer 57. Here, it is impossible to store all combinations of character candidates in the candidate character string buffer 57 because memory capacity for 10 (the total number of characters in one line) candidate character strings is required. is there.
Further, the subsequent processing speed is significantly reduced. Therefore, in the present embodiment, the scores of all candidate character strings are obtained as follows, and the number of candidate character strings is reduced based on this score.

【０１１６】ステップＳ44で、上記候補文字列展開部４
２によって、認識結果候補バッファ５６に格納されてい
る総ての文字候補を組み合わせて候補文字列に展開され
る。ステップＳ45で、上記スコア計算部４３によって、
上記得られた総ての候補文字列のスコアが計算される。
このスコアの計算は、式(６)に従って行われる。スコア(１)＝Ｗ_SＳ₁＋Ｗ_PＰ(ｃ₁) スコア(２)＝Ｗ_S{(Ｓ₁＋Ｓ₂)/２}＋Ｗ_PＰ(ｃ₁,ｃ₂) スコア(ｉ)＝〔スコア(i−1)×(i−1) ＋{Ｗ_SＳ_i＋Ｗ_PＰ(ｃ_i-2,ｃ_i-2,ｃ_i)}〕/ｉ …（６）但し、スコア(Ｉ) ：文字候補数がＩである候補文字列のスコアｃ₁,ｃ₂,ｃ₃,…,ｃ_n：候補文字列Ｓ₁,Ｓ₂,Ｓ₃,…,Ｓ_n：各文字候補の類似度Ｐ(ｘ) ：候補文字列ｘの文字列遷移確率（第１実施の形態の式(４) によって算出) Ｗ_P,Ｗ_S ：重み尚、上記重みＷ_P,Ｗ_Sは、Ｗ_S＞Ｗ_Pなる関係にあること
が望ましい。In step S44, the candidate character string developing unit 4
2, all character candidates stored in the recognition result candidate buffer 56 are combined and developed into a candidate character string. In step S45, the score calculation unit 43
The scores of all the obtained candidate character strings are calculated.
The calculation of this score is performed according to equation (6). Score _{_{(1) = W S S 1}} + W P P (c 1) score _{(2) = W S {(} S 1 + S 2) / 2} + W P P (c 1, c 2) score (i) = [score (i-1) × (i -1) + {W S S i + W P P (c i-2, c i-2, c i)} ] / i ... (6) where score (I): character Score c ₁ , c ₂ , c ₃ ,..., C _{n of} candidate character strings with the number of candidates I: candidate character strings S ₁ , S ₂ , S ₃ ,..., _Sn : similarity P ( x): Probability of character string transition of candidate character string x (calculated by equation (4) of the first embodiment) W _P , W _S : Weights Note that the weights W _P , W _S are W _S > W _P. It is desirable to have a relationship.

【０１１７】本実施の形態におけるスコアは、上記スコ
ア計算部４３の第２類似度計算部で算出される類似度Ｓ
_nと確率計算部で算出される遷移確率Ｐ(ｘ)との重み付
き和である。したがって、上記類似度Ｓ_nのような認識
評価値と上記遷移確率Ｐ(ｘ)以外の評価尺度を使用する
場合でも、式(６)に、上記評価尺度の値にある重みを乗
じた値を第３項として加えるようにすれば、上記ユニグ
ラム,圧縮ダイグラムおよび属性トリグラムを用いたス
コア計算を実行可能となる。The score in the present embodiment is obtained by calculating the similarity S calculated by the second similarity calculator of the score calculator 43.
_This is a weighted sum of _n and the transition probability P (x) calculated by the probability calculation unit. Therefore, even when using the recognized evaluation value and the evaluation measure other than the transition probability P (x) as described above similarity S _n, the equation (6), a value obtained by multiplying the weight on the value of the rating scale If added as the third term, it becomes possible to execute score calculation using the above-mentioned unigram, compressed diagram and attribute trigram.

【０１１８】ステップＳ46で、上記第２ソーティング部
４４によって、上記算出された各候補文字列をスコアの
降順にソーティングして、スコアの大きい順に所定数
(本実施の形態では「１００個」)の候補文字列を選出す
る。こうして選出された１００個の候補文字列が、候補
文字列バッファ５７に格納される。その後、上記ステッ
プＳ40に戻り、次の１文字の切り出しに移行する。以
後、当該行に文字が無くなるまで、当該行からの文字切
り出し、切り出し文字の特徴抽出、当該切り出し文字の
マッチング、既に得られた文字候補と当該切り出し文字
の認識候補とに基づく候補文字列の生成(スコア計算と
候補削減処理)が順次行われる。そして、上記ステップ
Ｓ40において、当該行中にもはや文字が無いと判別され
るとステップＳ47に進む。In step S46, the above-mentioned calculated candidate character strings are sorted by the second sorting section 44 in descending order of the score, and the predetermined character strings are sorted in descending order of the score.
(In the present embodiment, “100”) candidate character strings are selected. The 100 candidate character strings thus selected are stored in the candidate character string buffer 57. After that, the process returns to step S40, and the process proceeds to cutting out the next one character. Thereafter, until there are no more characters in the line, character extraction from the line, character extraction of the extracted character, matching of the extracted character, generation of a candidate character string based on the already obtained character candidate and the recognition candidate of the extracted character (Score calculation and candidate reduction processing) are sequentially performed. If it is determined in step S40 that there are no more characters in the line, the process proceeds to step S47.

【０１１９】ステップＳ47で、上記単語照合言語処理部
４５によって、候補文字列バッファ５７に格納されてい
る１行分の１００個の候補文字列に対して、単語照合用
言語辞書５９を用いた単語照合式言語処理が実行され
る。ステップＳ48で、候補文字列バッファ５７に格納さ
れている１行分の１００個の候補文字列の中から、上記
スコアと単語照合用言語辞書５９に存在する単語数とに
基づいて、最適な候補文字列が選択されて認識結果バッ
ファ６０に格納される。In step S 47, the word collation language processing unit 45 uses the word collation language dictionary 59 for the 100 candidate character strings for one line stored in the candidate character string buffer 57. The collation language processing is executed. In step S48, an optimal candidate is selected from among the 100 candidate character strings for one line stored in the candidate character string buffer 57 based on the score and the number of words existing in the word matching language dictionary 59. A character string is selected and stored in the recognition result buffer 60.

【０１２０】その後、上記ステップＳ36に戻り、次の行
の処理に移行する。以後、当該領域の行が無くなるま
で、順次各行からの文字切り出し、切り出し文字の特徴
抽出、当該切り出し文字のマッチング、既に得られた文
字候補と当該切り出し文字の認識候補とに基づく候補文
字列の生成(スコア計算と候補削減処理)、単語照合式言
語処理、および、最適候補文字列の選択が順次行われ
る。そして、上記ステップＳ36において当該領域に行が
無いと判定され、上記ステップＳ34において領域が無い
と判別されると、ステップＳ49に進む。ステップＳ49
で、認識結果バッファ６０に格納された総ての行の最適
候補文字列が認識結果として出力される。その後、文字
認識処理動作を終了する。Thereafter, the flow returns to step S36, and shifts to the processing of the next line. Thereafter, until there are no more lines in the area, character extraction from each line is sequentially performed, character extraction of extracted characters, matching of the extracted characters, generation of candidate character strings based on already obtained character candidates and recognition candidates for the extracted characters. (Score calculation and candidate reduction processing), word matching type language processing, and selection of an optimal candidate character string are sequentially performed. If it is determined in step S36 that there is no line in the area, and if it is determined in step S34 that there is no area, the process proceeds to step S49. Step S49
Then, the optimal candidate character strings of all the rows stored in the recognition result buffer 60 are output as the recognition results. Thereafter, the character recognition processing operation ends.

【０１２１】このように、本実施の形態においては、認
識装置(日本語活字ＯＣＲ)に第１実施の形態における確
率テーブル作成装置で作成された確率テーブル(ユニグ
ラム,圧縮ダイグラム及び属性トリグラム)を搭載する。
そして、画像入力部３１で取り込まれた文書の画像デー
タから、切り出し部３２によって１文字を切り出し、特
徴抽出部３７で特徴パターンを抽出し、マッチング部３
８によって１０個の文字候補を得る。こうして、１行の
全文字に関して１０個ずつの文字候補が得られると、候
補文字列展開部４２によって総ての文字候補を組み合わ
せて候補文字列に展開する。As described above, in the present embodiment, the recognition device (Japanese type OCR) is equipped with the probability table (unigram, compressed diagram, and attribute trigram) created by the probability table creation device in the first embodiment. I do.
Then, one character is cut out by the cutout unit 32 from the image data of the document captured by the image input unit 31, a feature pattern is extracted by the feature extraction unit 37, and the matching unit 3
By using 8, 10 character candidates are obtained. Thus, when ten character candidates are obtained for all the characters in one line, the candidate character string developing unit 42 combines all character candidates and develops them into candidate character strings.

【０１２２】そうすると、上記スコア計算部４３によっ
て、上記得られた総ての候補文字列に関し、式(６)に従
って上記確率テーブルを用いてスコアを算出する。そし
て、第２ソーティング部４４によって、総ての候補文字
列をスコアの降順にソーティングしてスコアの大きい順
に１００個の候補文字列を選出するようにしている。Then, the score calculation section 43 calculates scores for all the obtained candidate character strings using the probability table according to the equation (6). Then, the second sorting unit 44 sorts all the candidate character strings in descending order of the score, and selects 100 candidate character strings in descending order of the score.

【０１２３】こうして、上記確率方式言語処理を行うこ
とによって、１行の全文字から１０個ずつ得られた文字
候補を展開して生成された多数の候補文字列の中から、
単語切り出し(形態素解析)に因らずに、確からしい１０
０個の候補文字列が迅速に選出される。その場合に用い
られる上記確率テーブルは、第１実施の形態における上
記ユニグラム,圧縮ダイグラムおよび属性トリグラムで
構成されている。したがって、本実施の形態において実
行される確率方式言語処理は、低メモリ容量で高速に且
つ高精度で行われる。As described above, by performing the above-mentioned stochastic linguistic processing, from among a large number of candidate character strings generated by expanding character candidates obtained ten by ten from all the characters in one line,
Regardless of word segmentation (morphological analysis), it is likely to be 10
Zero candidate strings are quickly selected. The probability table used in that case is composed of the unigram, the compressed diagram and the attribute trigram in the first embodiment. Therefore, the stochastic language processing executed in the present embodiment is performed at high speed and with high accuracy with a small memory capacity.

【０１２４】以後、上記単語照合言語処理部４５によっ
て、上記選出された１行分の１００個の候補文字列に対
して、単語照合用言語辞書５９を用いた単語照合式言語
処理を行って最適な候補文字列を選択して認識結果とす
る。Thereafter, the word collation language processing unit 45 performs a word collation type language process using the word collation language dictionary 59 on the selected one hundred candidate character strings for one line to optimize A candidate character string is selected as a recognition result.

【０１２５】すなわち、本実施の形態によれば、日本語
のように認識対象文字数が多い言語の認識に対しても、
１メガバイト程度の低いメモリ容量で高精度な多値確率
テーブルを用いた確率方式言語処理を適用でき、低容量
メモリ,高認識率,高速処理が可能な認識装置を実現でき
る。That is, according to the present embodiment, even for recognition of a language having a large number of characters to be recognized, such as Japanese,
A stochastic language process using a high-precision multi-value probability table with a low memory capacity of about 1 megabyte can be applied, and a recognition device capable of low-capacity memory, high recognition rate, and high-speed processing can be realized.

【０１２６】尚、第１実施の形態においては、日本語文
字に適用する場合を例に説明しているが、あらゆる言語
に適用可能である。特に、日本語や中国語等のように文
字数の多い言語に対して有効である。尚、中国語認識用
の確率テーブルを作成する場合には、上記文字属性とし
て、例えば助字,記号,助字以外の漢字,数字,アルファベ
ット大文字およびアルファベット小文字等を用いればよ
い。In the first embodiment, the case where the present invention is applied to Japanese characters is described as an example, but the present invention can be applied to any language. In particular, it is effective for languages with a large number of characters, such as Japanese and Chinese. When a probability table for Chinese recognition is created, for example, supplementary characters, symbols, kanji other than supplementary characters, numbers, uppercase letters and lowercase letters may be used as the character attributes.

【０１２７】さらに、第１実施の形態において作成され
た上記確率テーブルを用いた文字列確率演算は、漢字仮
名交じり文字列に対する確からしさの判定に威力を発揮
するため、仮名漢字変換装置等の言語処理装置への適用
も可能である。また、第１実施の形態においては、文書
データベース１に格納された日本語文書に基づいて文字
認識や仮名漢字変換の際に用いる確率テーブルを作成し
ているが、音声認識の際に用いる確率テーブルを作成す
ることも可能である。その場合には、文書データベース
１には日本語の仮名文字列を格納し、この仮名文字列を
用いて上記圧縮ダイグラム,ユニグラムおよび属性トリ
グラムを作成すればよい。Further, the character string probability calculation using the above-described probability table created in the first embodiment is effective in determining the likelihood of a kanji-kana mixed character string. Application to a processing device is also possible. Further, in the first embodiment, the probability table used for character recognition and kana-kanji conversion based on the Japanese document stored in the document database 1 is created. It is also possible to create In this case, a Japanese kana character string may be stored in the document database 1, and the compressed digram, unigram, and attribute trigram may be created using the kana character string.

【０１２８】また、第２実施の形態における確率テーブ
ルは、第１実施の形態における確率テーブル作成装置を
認識装置に搭載して、この確率テーブル作成装置によっ
て作成してもよい。あるいは、他の確率テーブル作成装
置で作成されたものを本認識装置にインストールしても
よい。また、第２実施の形態は、日本語文字に限らず外
国文字の認識装置にも適用可能であることは言うまでも
ない。さらに、文字認識に限らず音声認識にも適用でき
る。The probability table according to the second embodiment may be created by mounting the probability table creation device according to the first embodiment on a recognition device and using the probability table creation device. Alternatively, one created by another probability table creation device may be installed in the present recognition device. Further, it goes without saying that the second embodiment is applicable not only to Japanese character but also to a foreign character recognition device. Further, the present invention can be applied not only to character recognition but also to voice recognition.

【０１２９】また、第１実施の形態および第２実施の形
態においては、上記圧縮ダイグラム,ユニグラムおよび
属性トリグラムの総てを作成あるいは搭載するようにし
ているが、この発明では、少なくとも上記圧縮ダイグラ
ムを作成あるいは搭載すればよい。Further, in the first and second embodiments, all of the above-mentioned compressed diagram, unigram and attribute trigram are created or mounted, but in the present invention, at least the above-mentioned compressed diagram is created and loaded. It may be created or mounted.

【０１３０】また、第１実施の形態における確率テーブ
ル作成装置においては、上記確率テーブル作成処理のプ
ログラムを以下の何れかの方法によってＲＯＭ(リード・
オンリ・メモリ)(図示せず)あるいはＲＡＭ(ランダム・ア
クセス・メモリ)(図示せず)に記憶している。（ａ）予め上記ＲＯＭに記憶させておく。（ｂ）上記確率テーブル作成処理プログラムの一部ある
いは全部をフロッピーディスクやハードディスク装置等
の記録媒体に格納しておき、必要に応じて上記プログラ
ムを上記ＲＡＭにインストールする。（ｃ）コンピュータネットワークから上記確率テーブル
作成処理プログラムを上記ＲＡＭにインストールする。In the probability table creation device according to the first embodiment, the program for the probability table creation processing is stored in the ROM (read / write) by any of the following methods.
It is stored in an only memory (not shown) or a random access memory (RAM) (not shown). (A) It is stored in the ROM in advance. (B) Part or all of the probability table creation processing program is stored in a recording medium such as a floppy disk or a hard disk device, and the program is installed in the RAM as needed. (C) Install the probability table creation processing program into the RAM from the computer network.

【０１３１】同様に、第２実施の形態における認識装置
においては、上記文字認識処理のプログラムを以下の何
れかの方法によってＲＯＭ(リード・オンリ・メモリ)(図
示せず)あるいはＲＡＭ(ランダム・アクセス・メモリ)(図
示せず)に記憶している。（ａ）予め上記ＲＯＭに記憶させておく。（ｂ）上記文字認識処理プログラムの一部あるいは全部
をフロッピーディスクやハードディスク装置等の記録媒
体に格納しておき、必要に応じて上記プログラムを上記
ＲＡＭにインストールする。（ｃ）コンピュータネットワークから上記文字認識処理
プログラムを上記ＲＡＭにインストールする。Similarly, in the recognition apparatus according to the second embodiment, the program for the character recognition processing is loaded into a ROM (Read Only Memory) (not shown) or a RAM (Random Access Memory) by any of the following methods. • stored in a memory (not shown). (A) It is stored in the ROM in advance. (B) A part or all of the character recognition processing program is stored in a recording medium such as a floppy disk or a hard disk device, and the program is installed in the RAM as needed. (C) Install the character recognition processing program into the RAM from a computer network.

【０１３２】[0132]

【発明の効果】以上より明らかなように、請求項１に係
る発明の確率テーブル作成装置は、クラスタリング部に
よって、メモリに格納された文字列の全文字に対して遷
移特性を基準としたクラスタリングを行って各クラスタ
にクラスタコードを付与するので、認識対象の文字や音
節の数が上記クラスタの数に圧縮される。また、２クラ
スタコード遷移確率テーブル(以下、クラスタコード・ダ
イグラムと言う)作成部によって、上記メモリに格納さ
れた文字列における総ての隣接２文字のクラスタコード
間の遷移確率を表すクラスタコード・ダイグラムを作成
するので、上記クラスタコード・ダイグラムは、隣接２
文字間の遷移確率を表す従来のダイグラムに比して要素
数を(クラスタ数/認識対象文字(又は音節)数)²に圧縮で
きる。したがって、低記憶容量の確率テーブルを作成で
き、低メモリ容量および高速での確率方式言語処理を可
能にする。As is apparent from the above description, the probability table creating apparatus according to the first aspect of the present invention uses the clustering unit to perform clustering based on transition characteristics for all characters of a character string stored in a memory. Since the cluster code is assigned to each cluster, the number of characters and syllables to be recognized is compressed to the number of clusters. A cluster code diagram showing the transition probabilities between the cluster codes of all two adjacent characters in the character string stored in the memory by a two-cluster code transition probability table (hereinafter, referred to as a cluster code diagram) creation unit. So that the above cluster code diagram is adjacent 2
The number of elements can be reduced to (the number of clusters / the number of characters (or syllables) to be recognized) ² as compared with the conventional diagram showing the transition probability between characters. Therefore, a probability table having a low storage capacity can be created, and a low-memory capacity and high-speed stochastic language process can be performed.

【０１３３】その場合、同一クラスタに属する文字や音
節は同様の遷移特性を有しているため、上記クラスタコ
ード・ダイグラムには各文字や音節の言語的遷移情報が
保持されている。したがって、認識対象の全文字または
全音節の数の圧縮に伴う精度の低下を抑えて、高精度な
多値のクラスタコード・ダイグラムを作成できる。した
がって、高い精度での確率方式言語処理を可能にする。In this case, since characters and syllables belonging to the same cluster have similar transition characteristics, the cluster code diagram stores linguistic transition information of each character and syllable. Therefore, a high-precision multi-value cluster code diagram can be created while suppressing a decrease in accuracy due to compression of the number of all characters or all syllables to be recognized. Therefore, it is possible to perform stochastic language processing with high accuracy.

【０１３４】また、請求項２に係る発明の確率テーブル
作成装置は、３文字属性遷移確率テーブル(以下、属性
トリグラムと言う)作成部によって、上記メモリに格納
された文字列における総ての隣接３文字の属性間の遷移
確率を表す属性トリグラムを作成するので、認識対象の
全文字または全音節の数を圧縮した際に失われた各文字
や音節の言語的遷移情報を補うための隣接３文字間の遷
移情報を表す確率テーブルを得ることができる。したが
って、上記クラスタコード・ダイグラムに上記属性トリ
グラムを併用することによって、更に精度の高い確率方
式言語処理を行うことが可能になる。The probability table creating apparatus according to the second aspect of the present invention uses a three-character attribute transition probability table (hereinafter referred to as an attribute trigram) creating unit to generate all three adjacent character strings in the character string stored in the memory. Creates attribute trigrams that represent the transition probabilities between character attributes, so three adjacent characters are used to supplement the linguistic transition information of each character or syllable lost when compressing the number of all characters or all syllables to be recognized. It is possible to obtain a probability table representing transition information between the two. Therefore, by using the attribute trigram in combination with the cluster code diagram, it is possible to perform stochastic language processing with higher accuracy.

【０１３５】また、請求項３に係る発明の確率テーブル
作成装置は、１文字出現確率テーブル(以下、ユニグラ
ムと言う)作成部によって、上記メモリに格納された文
字列における各文字の出現確率を表すユニグラムを作成
するので、認識対象の全文字または全音節の数を圧縮し
た際時に失われた各文字や音節の言語的情報を補うため
の各文字または音節の出現情報を表す確率テーブルを得
ることができる。したがって、上記クラスタコード・ダ
イグラムに上記ユニグラムおよび属性トリグラムを併用
することによって、更に精度の高い確率方式言語処理を
行うことが可能になる。In the probability table creating apparatus according to the third aspect of the present invention, a one-character appearance probability table (hereinafter, referred to as a unigram) creating unit represents the appearance probability of each character in the character string stored in the memory. To create a unigram, obtain a probability table showing the appearance information of each character or syllable to supplement the linguistic information of each character or syllable lost when compressing the number of all characters or syllables to be recognized Can be. Therefore, by using the unigram and the attribute trigram together with the cluster code diagram, it is possible to perform a stochastic language processing with higher accuracy.

【０１３６】また、請求項４に係る発明の確率テーブル
作成装置は、対数圧縮部によって、上記クラスタコード
・ダイグラムの各要素値を対数圧縮するので、上記各要
素値のデータ長を１バイトで表現可能な長さまで圧縮で
きる。したがって、上記クラスタコード・ダイグラムの
更なる記憶容量の低下を図り、それに伴って上記確率方
式言語処理の更なる高速化を可能にできる。Further, in the probability table creating apparatus according to the present invention, each element value of the cluster code diagram is logarithmically compressed by the logarithmic compression unit, so that the data length of each element value is represented by 1 byte. Can be compressed to the possible length. Therefore, it is possible to further reduce the storage capacity of the cluster code diagram, and to further increase the speed of the stochastic language processing.

【０１３７】また、請求項５に係る発明の確率テーブル
作成装置は、２文字遷移確率テーブル作成部によって、
上記メモリに格納された文字列に基づいてダイグラムを
作成し、上記クラスタリング部は、上記ダイグラムを認
識対象文字数個の認識対象文字数の次元のベクトルに分
割し、得られた全ベクトルを所定数のクラスタにクラス
タリングするので、類似した遷移特性を有する文字や音
節を同一クラスタにクラスタリングできる。したがっ
て、上記各クラスタにクラスタコードを付加してこのク
ラスタコードを文字と見なすことによって、各文字や音
節の言語的遷移情報を損なうことなく認識対象の全文字
あるいは全音節の数を上記クラスタ数に圧縮できる。Further, the probability table creation device of the invention according to claim 5 is characterized in that the two-character transition probability table creation unit uses
A diagram is created based on the character strings stored in the memory, and the clustering unit divides the diagram into vectors of the number of characters to be recognized, the number of characters to be recognized, and divides all obtained vectors into a predetermined number of clusters. Therefore, characters and syllables having similar transition characteristics can be clustered into the same cluster. Therefore, by adding a cluster code to each cluster and treating this cluster code as a character, the number of all characters or all syllables to be recognized is converted to the number of clusters without losing the linguistic transition information of each character or syllable. Can be compressed.

【０１３８】さらに、クラスタコード変換テーブル作成
部によって上記クラスタリングの結果に基づいてクラス
タコード変換テーブルを作成し、クラスタコード変換部
によって上記クラスタコード変換テーブルを用いて上記
文字列を上記クラスタコードに変換し、上記クラスタコ
ード・ダイグラム作成部は、変換されたクラスタコード
を用いて上記クラスタコード・ダイグラムを作成するの
で、上記クラスタコード変換テーブルを参照する簡単な
処理で上記クラスタコードを得てクラスタコード・ダイ
グラムを作成できる。Further, a cluster code conversion table is created by the cluster code conversion table creation section based on the clustering result, and the character string is converted into the cluster code by the cluster code conversion table using the cluster code conversion table. Since the cluster code / diagram creating unit creates the cluster code / diagram using the converted cluster code, the cluster code / diagram is obtained by a simple process of referring to the cluster code conversion table to obtain the cluster code / diagram. Can be created.

【０１３９】また、請求項６に係る発明の確率テーブル
作成装置は、属性変換部によって属性変換テーブルを用
いて上記文字列を属性に変換し、上記属性トリグラム作
成部は、上記変換された属性を用いて上記属性トリグラ
ムを作成するので、上記属性変換テーブルを参照すると
いう簡単な操作で上記属性を得て属性トリグラムを作成
できる。According to a sixth aspect of the present invention, the probability table creating apparatus converts the character string into an attribute by using an attribute converting table by an attribute converting unit, and the attribute trigram creating unit converts the converted attribute into an attribute. Since the attribute trigram is created using the attribute, the attribute trigram can be created by obtaining the attribute by a simple operation of referring to the attribute conversion table.

【０１４０】また、請求項７に係る発明の確率テーブル
作成装置は、上記メモリには日本語文章が格納され、上
記文字は日本語文字であり、上記属性は、平仮名,片仮
名,記号,漢字,数字,アルファベット大文字およびアルフ
ァベット小文字の何れかであるので、１要素を１バイト
とした場合に１６メガバイトが必要な日本語文字用のダ
イグラムを、上記クラスタ数１０００に文字数圧縮して
上記クラスタコード・ダイグラムを作成することによっ
て、１メガバイトまで記憶容量を圧縮できる。同様に、
認識対象文字を上記属性に変換して上記属性トリグラム
を作成することによって、６４ギガバイトが必要なトリ
グラムの記憶容量を最大３４３バイトまで圧縮できる。
すなわち、この発明によれば、日本語文字認識装置に搭
載可能な低メモリ容量の隣接２文字間および隣接３文字
間の遷移情報を表す確率テーブルを作成できる。Further, in the probability table creating apparatus according to the present invention, a Japanese sentence is stored in the memory, the characters are Japanese characters, and the attributes are hiragana, katakana, symbol, kanji, Since any one of a number, an uppercase letter and a lowercase letter is used, a diagram for Japanese characters that requires 16 megabytes when one element is 1 byte is compressed into the above-mentioned cluster number 1000 and the above cluster code diagram. By creating, the storage capacity can be compressed to 1 megabyte. Similarly,
By converting the character to be recognized into the attribute and creating the attribute trigram, the storage capacity of the trigram requiring 64 gigabytes can be compressed to a maximum of 343 bytes.
That is, according to the present invention, it is possible to create a probability table representing transition information between two adjacent characters and between three adjacent characters having a low memory capacity that can be mounted on the Japanese character recognition device.

【０１４１】また、請求項８に係る発明の確率テーブル
作成装置は、上記メモリには中国語文章が格納され、文
字は中国語文字であり、上記属性は、助字,記号,助字以
外の漢字,数字,アルファベット大文字およびアルファベ
ット小文字の何れかであるので、日本語の場合と同様に
認識対象文字数が極端に多い中国語の場合にも、認識対
象文字にクラスタリングを行って文字数圧縮して上記ク
ラスタコード・ダイグラムを作成することによって、ダ
イグラムの記憶容量を例えば１メガバイトまで圧縮でき
る。同様に、認識対象文字を上記属性に変換して上記属
性トリグラムを作成することによって、トリグラムの記
憶容量を例えば最大２１６バイトまで圧縮できる。すな
わち、この発明によれば、中国語文字認識装置に搭載可
能な低メモリ容量の隣接２文字間および隣接３文字間の
遷移情報を表す確率テーブルを作成できる。Further, in the probability table creating apparatus according to the present invention, a Chinese sentence is stored in the memory, the character is a Chinese character, and the attribute is other than a supplementary character, a symbol, and a supplementary character. Since it is any of Chinese characters, numbers, uppercase letters and lowercase letters, even in the case of Chinese, where the number of characters to be recognized is extremely large, as in Japanese, clustering is performed on the characters to be recognized and the number of characters is compressed. By creating a cluster code diagram, the storage capacity of the diagram can be reduced to, for example, 1 megabyte. Similarly, by converting the recognition target character into the attribute and creating the attribute trigram, the storage capacity of the trigram can be compressed, for example, to a maximum of 216 bytes. That is, according to the present invention, it is possible to create a probability table representing transition information between two adjacent characters and three adjacent characters with a low memory capacity that can be mounted on the Chinese character recognition device.

【０１４２】また、請求項９に係る発明の確率方式言語
処理装置は、請求項１に係る発明の確率テーブル作成装
置によって作成された上記クラスタコード・ダイグラム
を有し、文字列遷移確率算出部によって、従来のダイグ
ラムよりも記憶容量が(クラスタ数/認識対象文字(又は
音節)数)²だけ圧縮された上記クラスタコード・ダイグラ
ムを用いて入力文字列の文字列遷移確率を算出するの
で、認識対象文字数が４０００程度に達する日本語文字
を処理対象とする場合でも、１要素を１バイトとして１
メガバイトのメモリ容量があればよい。A stochastic language processing apparatus according to a ninth aspect of the present invention has the cluster code diagram created by the probability table creating apparatus according to the first aspect of the present invention, and is provided with a character string transition probability calculating section. Since the storage capacity of the input string is calculated using the above-mentioned cluster code / diagram compressed by ² (the number of clusters / the number of characters (or syllables) to be recognized) compared to the conventional diagram, the recognition target Even when Japanese characters whose number of characters reaches about 4000 are to be processed, 1 element is regarded as 1 byte and 1 byte.
All you need is a megabyte of memory.

【０１４３】また、請求項１０に係る発明の確率方式言
語処理装置は、請求項２および請求項３に係る発明の確
率テーブル作成装置によって作成された上記ユニグラム
および属性トリグラムの少なくとも一方の確率テーブル
を有し、上記文字列遷移確率算出部は、上記確率テーブ
ルをも用いて上記入力文字列の文字列遷移確率を算出す
るので、上記ユニグラムおよび属性トリグラムの少なく
とも一方を併用して、高い精度で上記文字列遷移確率を
算出できる。A stochastic language processing apparatus according to a tenth aspect of the present invention provides a stochastic language processing apparatus according to the second and third aspects of the present invention, which stores at least one of the unigram and attribute trigram probability tables created by the probability table creating apparatus. The character string transition probability calculation unit calculates the character string transition probability of the input character string using the probability table as well, so that at least one of the unigram and the attribute trigram is used in combination, and the character string transition probability is calculated with high accuracy. Character string transition probability can be calculated.

【０１４４】また、請求項１１に係る発明の確率方式言
語処理装置における上記文字列遷移確率算出部は、上記
文字列遷移確率を上記第１の式によって算出するので、
上記クラスタコード・ダイグラム,属性トリグラムおよび
ユニグラムの参照と、ごく簡単な演算処理のみで上記文
字列遷移確率を算出できる。こうして、確率方式言語処
理動作の高速化を図ることができる。Further, in the stochastic language processing apparatus according to the eleventh aspect, the character string transition probability calculating section calculates the character string transition probability by the first equation.
The character string transition probability can be calculated only by referring to the cluster code diagram, the attribute trigram, and the unigram, and by very simple arithmetic processing. Thus, the speed of the stochastic language processing operation can be increased.

【０１４５】また、請求項１２に係る発明の認識装置
は、入力された言語要素列から切り出された言語要素に
対するマッチングの結果得られた複数の認識候補を組み
合わせて複数の候補文字列を得、スコア算出部によっ
て、請求項１に係る発明の確率テーブル作成装置によっ
て作成された上記クラスタコード・ダイグラムを用いて
上記候補文字列のスコアを算出し、候補文字列生成部に
よって、上記算出されたスコアに基づいて上記候補文字
列の確からしさを評価する確率方式言語処理を行って所
定数の候補文字列を生成するので、従来のダイグラムよ
りも記憶容量が(クラスタ数/認識対象文字(又は音節)
数)²だけ圧縮されている上記クラスタコード・ダイグラ
ムを使用して上記候補文字列に対する確率方式言語処理
を行うことができる。したがって、認識対象文字数が４
０００程度に達する日本語文字を認識する場合でも、１
要素を１バイトとして１メガバイトの記憶容量があれば
よく、単語照合方式言語処理用の辞書を搭載したとして
も全く弊害は生じない。A recognition apparatus according to a twelfth aspect of the present invention obtains a plurality of candidate character strings by combining a plurality of recognition candidates obtained as a result of matching with a language element cut out from an input language element string. A score calculation unit calculates a score of the candidate character string using the cluster code diagram created by the probability table creation device according to the first aspect of the present invention, and the calculated score is calculated by the candidate character string generation unit. Since a predetermined number of candidate character strings are generated by performing a stochastic linguistic process that evaluates the likelihood of the candidate character strings based on the
The probabilistic linguistic processing can be performed on the candidate character string using the cluster code diagram that has been compressed by (number) ² . Therefore, the number of characters to be recognized is 4
Even when recognizing about 000 Japanese characters, 1
It is sufficient that the storage capacity of 1 megabyte is 1 byte for each element. Even if a dictionary for word processing language processing is installed, no harm is caused.

【０１４６】また、請求項１３に係る発明の認識装置
は、請求項２および請求項３に係る発明の確率テーブル
作成装置によって作成された上記ユニグラムおよび属性
トリグラムの少なくとも一方の確率テーブルを有し、上
記スコア算出部は、上記確率テーブルをも用いて上記候
補文字列のスコアを算出するので、上記ユニグラムおよ
び属性トリグラムの少なくとも一方を併用した確率方式
言語処理を行って、上記候補文字列の確からしさをより
正確に表すスコアを算出できる。A recognition device according to a thirteenth aspect of the present invention has a probability table of at least one of the unigram and the attribute trigram created by the probability table creation device of the second and third aspects of the invention. Since the score calculation unit calculates the score of the candidate character string by using the probability table as well, the probability calculation language processing using at least one of the unigram and the attribute trigram is performed, and the probability of the candidate character string is determined. Can be calculated more accurately.

【０１４７】また、請求項１４に係る発明の認識装置
は、単語照合言語処理部によって、上記生成された所定
数の候補文字列に対して単語照合方式言語処理を行っ
て、最適な候補文字列を上記入力言語要素列の認識結果
として出力するので、単語辞書検索によって最適な候補
文字列を認識結果として得ることができる。したがっ
て、より適確に入力文章や入力音声を認識できる。According to a fourteenth aspect of the present invention, the word matching language processing section performs word matching language processing on the predetermined number of generated candidate character strings to obtain an optimal candidate character string. Is output as a recognition result of the input language element sequence, so that an optimal candidate character string can be obtained as a recognition result by word dictionary search. Therefore, the input sentence and the input voice can be more accurately recognized.

【０１４８】また、請求項１５に係る発明の認識装置に
おける上記スコア算出部は、上記スコアを上記第２の式
によって算出するので、上記クラスタコード・ダイグラ
ム,属性トリグラムおよびユニグラムの参照と、ごく簡
単な演算処理のみとで上記スコアを算出できる。したが
って、認識処理動作の高速化を図ることができる。Since the score calculation unit in the recognition device of the invention according to claim 15 calculates the score by the second equation, it is very easy to refer to the cluster code diagram, attribute trigram and unigram. The above-mentioned score can be calculated only by simple arithmetic processing. Therefore, the speed of the recognition processing operation can be increased.

【０１４９】また、請求項１６に係る発明の認識装置
は、上記入力された言語要素列は文字列であるので、従
来の２文字遷移確率テーブルよりも記憶容量が(クラス
タ数/認識対象文字数)²だけ圧縮された上記クラスタコ
ード・ダイグラムを使用して、低メモリ容量で高認識率
での文字認識処理を高速に行うことができる。In the recognition apparatus according to the sixteenth aspect, since the input language element sequence is a character string, the storage capacity is (number of clusters / number of characters to be recognized) as compared with the conventional two-character transition probability table. By using the cluster code diagram compressed by ^two , character recognition processing with a low memory capacity and a high recognition rate can be performed at high speed.

【０１５０】また、請求項１７に係る発明のコンピュー
タ読み取り可能な記録媒体は、入力された文字列から個
々の文字を切り出す切り出し部、切り出された文字の特
徴パターンと標準パターンとのマッチングを行って複数
の認識候補を得るマッチング部、上記複数の認識候補を
組み合わせて得られた複数の候補文字列に対して上記ク
ラスタコード・ダイグラムを用いて算出された上記各候
補文字列のスコアに基づいて所定数の候補文字列を生成
する候補文字列生成部として、コンピュータを機能させ
る文字認識プログラムが記録されているので、請求項１
６と同じ効果を奏することができる。The computer readable recording medium according to the seventeenth aspect of the present invention provides a cutout unit for cutting out individual characters from an input character string, and performs matching between a feature pattern of the cutout characters and a standard pattern. A matching unit that obtains a plurality of recognition candidates; a plurality of candidate character strings obtained by combining the plurality of recognition candidates; and a predetermined unit based on a score of each of the candidate character strings calculated using the cluster code diagram. 2. A character recognition program that causes a computer to function as a candidate character string generation unit that generates a number of candidate character strings.
The same effect as that of No. 6 can be obtained.

[Brief description of the drawings]

【図１】この発明の確率テーブル作成装置の機能ブロッ
ク図である。FIG. 1 is a functional block diagram of a probability table creation device according to the present invention.

【図２】図１における圧縮ダイグラム作成部,ベクトル
分割部およびクラスタリング部によってダイグラムから
圧縮ダイグラムを作成する方法の説明図である。FIG. 2 is an explanatory diagram of a method of creating a compressed diagram from a diagram by a compressed diagram creating unit, a vector dividing unit, and a clustering unit in FIG. 1;

【図３】日本語文字における文字属性の一例を示す図で
ある。FIG. 3 is a diagram illustrating an example of character attributes of Japanese characters.

【図４】図１における制御部によって実行される確率テ
ーブル作成処理動作のフローチャートである。FIG. 4 is a flowchart of a probability table creation processing operation executed by a control unit in FIG. 1;

【図５】図４に続く確率テーブル作成処理動作のフロー
チャートである。FIG. 5 is a flowchart of a probability table creation processing operation following FIG. 4;

【図６】図１に示す確率テーブル作成装置によって作成
されたテーブルのメモリサイズを示す図である。FIG. 6 is a diagram showing a memory size of a table created by the probability table creating device shown in FIG. 1;

【図７】この発明の認識装置の機能ブロック図である。FIG. 7 is a functional block diagram of the recognition device of the present invention.

【図８】図７における制御部によって実行される文字認
識処理動作のフローチャートである。FIG. 8 is a flowchart of a character recognition processing operation executed by the control unit in FIG. 7;

【図９】従来のダイグラムおよびトリグラムのメモリサ
イズを示す図である。FIG. 9 is a diagram showing memory sizes of a conventional diagram and trigram.

[Explanation of symbols]

１…文書データベース、１２…文字属性変換
テーブル、１３…圧縮文字コード変換テーブル、１８,
４６…制御部、１９…ユニグラム作成部、２
０…ダイグラム作成部、２１…属性トリグラム作成部、
２２…圧縮ダイグラム作成部、２３…ベクトル分
割部、２４…クラスタリング部、３１…画
像入力部、３２…切り出し部、３７…
特徴抽出部、３８…マッチング部、４
２…候補文字列展開部、４３…スコア計算
部、４５…単語照合言語処理部、５８…テーブ
ル、５９…単語照合用言語辞書。1: Document database, 12: Character attribute conversion table, 13: Compressed character code conversion table, 18,
46: control unit, 19: unigram creation unit, 2
0: diagram creation unit, 21: attribute trigram creation unit,
22: Compression diagram creating unit, 23: Vector dividing unit, 24: Clustering unit, 31: Image input unit, 32: Cutout unit, 37 ...
Feature extraction unit 38 Matching unit 4
2 ... Candidate character string expansion unit 43: Score calculation unit 45: Word collation language processing unit 58: Table, 59: Language dictionary for word collation

Claims

[Claims]

1. A memory in which a character string of one natural language is stored, and all characters of the character string stored in the memory are clustered into clusters having similar transition characteristics, and a cluster code is assigned to each cluster. A clustering unit to be assigned; a cluster code of each character in the character string stored in the memory; a transition probability between the cluster codes of the two characters for all adjacent two characters in the character string; A probability table creation device comprising a two-cluster code transition probability table creation unit for creating a transition probability table.

2. The probability table creation device according to claim 1, wherein an attribute of each character in the character string stored in the memory is obtained, and an attribute between the three adjacent characters in the character string is calculated. A probability table creation device, comprising: a three-character attribute transition probability table creation unit that creates a three-character attribute transition probability table by calculating the transition probability of.

3. The one-character appearance probability according to claim 2, wherein the one-character occurrence probability table is created by calculating the occurrence probability of each one character with respect to all the characters in the character string stored in the memory. A probability table creation device comprising a table creation unit.

4. The probability table creation device according to claim 1, further comprising a logarithmic compression unit that logarithmically compresses each element value of the two-cluster code transition probability table.

5. The probability table creation device according to claim 1, wherein a transition probability between all two adjacent characters in all characters of the character string stored in the memory is obtained, and a two-character transition probability table is created. And a clustering unit that divides the two-character transition probability table into vectors having a dimension equal to the number of characters to be recognized and a predetermined number of clusters. And a cluster code conversion table creating unit that creates a cluster code conversion table used when obtaining a cluster code of a cluster to which each of the two characters belongs based on the result of the clustering. The 2 cluster code transition probability table creation unit uses the cluster code conversion table. The two characters each have a cluster code conversion section for converting into the cluster code, the 2 by using the converted cluster code
A probability table creation device characterized by creating a cluster code transition probability table.

6. The probability table creation device according to claim 2, further comprising: an attribute conversion table used when obtaining the attribute of the character, wherein the three-character attribute transition probability table creation unit uses the attribute conversion table. A probability table creating apparatus, comprising: an attribute conversion unit for converting the character into an attribute by using the converted attribute to create the three-character attribute transition probability table.

7. The probability table creation device according to claim 2, wherein the memory stores a Japanese sentence, the character is a Japanese character, and the attribute is a hiragana, a katakana, a symbol, a kanji. A probability table creating device, which is any one of a character, a numeral, an uppercase alphabet and a lowercase alphabet.

8. The probability table creation device according to claim 2, wherein the memory stores a Chinese sentence, the character is a Chinese character, and the attribute is a supplementary character, a symbol, or a supplementary character. A probability table creation device characterized by any one of kanji, numbers, uppercase letters and lowercase letters other than.

9. A character string transition probability of an input character string is calculated using the two-cluster code transition probability table created by the probability table creation device according to claim 1 and the two-cluster code transition probability table. A stochastic language processing device comprising a character string transition probability calculation unit.

10. The stochastic language processing device according to claim 9, wherein the one-character appearance probability table and the three-character attribute transition probability table created by the probability table creation device according to claim 2 or 3. A stochastic language having at least one probability table, wherein the character string transition probability calculation unit calculates a character string transition probability of the input character string by using the probability table as well. Processing equipment.

11. The stochastic language processing device according to claim 10, wherein the character string transition probability calculating unit calculates the character string transition probability by the following first equation. Processing equipment. P (c ₁ , c ₂ ,..., C _n ) = P (c ₁ ) + P (c ₁ , c ₂ ) + P (c ₁ ,
c ₂ , c ₃ ) + P (c ₂ , c ₃ , c ₄ ) +... + P (c _n−2 , c _n−1 , c
_n ) + P (c _n−1 , c _n ) + P (c _n ) where P (x) = 1 character appearance probability table (x) P (x, y) = {1 character appearance probability table (x) +1 Character appearance probability table (y)} / 2 + 2 cluster code transition probability table (x ^c , y ^c ) / 2 P (x, y, z) = {1 character appearance probability table (x) +1 character appearance probability table (y) +1 character appearance probability table (z)} /
3+ {2 cluster code transition probability table (x ^c , y ^c ) +2 cluster code transition probability table (y ^c , z ^c )} / 2+ {3 character attribute transition probability table (x ^a , y ^a , z ^a )} / 3, however, x, y, z: the character ^{^{^{x c, y c, z c}}} : characters x, y, z of the cluster code ^{^{^{x a, y a, z a}}} : character x, y, attribute 1 appearance of z Probability table (x): Element value of language element x in 1-character appearance probability table 2 cluster code transition probability table (x ^c , y ^c ): Cluster code combination in 2 cluster code transition probability table
(x ^c, y ^c) element value 3 character attribute transition probability table of ^{^{(x a, y a, z}} a): 3 character attributes set in the character attribute transition probability table ^{^{(x a, y a, z}} a) element values of

12. A cutout unit for cutting out individual language elements from an input language element sequence, a matching unit for matching a feature pattern of the cutout language elements with a standard pattern to obtain a plurality of recognition candidates, A recognition apparatus comprising: a candidate character string generation unit configured to perform a stochastic language process on a plurality of candidate character strings obtained by combining a plurality of recognition candidates to generate a predetermined number of candidate character strings. The candidate character string generation unit calculates the score of the candidate character string using the two cluster code transition probability table created by the two-cluster code transition probability table created by the described probability table creation device. And performing a stochastic language process for evaluating the likelihood of the candidate character string based on the calculated score. Recognition apparatus characterized by there.

13. The recognition device according to claim 12, wherein at least one of the one-character appearance probability table and the three-character attribute transition probability table created by the probability table creation device according to claim 2 or 3. A recognition device comprising a probability table, wherein the score calculation unit calculates the score of the candidate character string by using the probability table as well.

14. The recognition apparatus according to claim 12, wherein a predetermined number of candidate character strings generated by the candidate character string generation unit are subjected to a word matching language processing to determine an optimal candidate character string. A recognition device, comprising: a word collation language processing unit that outputs a result of recognition of an input language element sequence.

15. The recognition device according to claim 13, wherein the score calculation unit calculates the score by the following second equation. Score _{_{(1) = W S S 1}} + W P P (c 1) score _{(2) = W S {(} S 1 + S 2) / 2} + W P P (c 1, c 2) score (i) = [score (i−1) × (i−1) + {W _S S _i + W _P P
_{_{(c i-2, c i}} -2, c i)} ] / i where score (I): Score c ₁ of the candidate character strings of characters _{_{I, c 2, c 3,}} ..., c n: the candidate characters Sequence S ₁ , S ₂ , S ₃ ,..., _Sn : Evaluation value of each recognition candidate at the time of matching P (y): Character string transition probability of candidate character string y
(Calculated using the formulas for calculating P (x), P (x, y), P (x, y, z) according to claim 7) W _P , W _S : Weight

16. The recognition device according to claim 12, wherein the input language element sequence is a character string.

17. A cutout unit for cutting out individual characters from an input character string, a matching unit for matching a feature pattern of the cutout character with a standard pattern to obtain a plurality of recognition candidates, A plurality of candidate character strings obtained by combining the candidate character strings by performing a stochastic language process using the two-cluster code transition probability table created by the probability table creating apparatus according to claim 1. And a character recognition program for causing a computer to function as a candidate character string generation unit for generating a predetermined number of candidate character strings based on the score. .