JPH08241378A - Recognizing method for low-quality character - Google Patents

Recognizing method for low-quality character

Info

Publication number
JPH08241378A
JPH08241378A JP7043729A JP4372995A JPH08241378A JP H08241378 A JPH08241378 A JP H08241378A JP 7043729 A JP7043729 A JP 7043729A JP 4372995 A JP4372995 A JP 4372995A JP H08241378 A JPH08241378 A JP H08241378A
Authority
JP
Japan
Prior art keywords
character
quality
read
dictionary
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP7043729A
Other languages
Japanese (ja)
Inventor
Atsushi Hidaka
篤 日高
Shinji Matsui
伸二 松井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP7043729A priority Critical patent/JPH08241378A/en
Publication of JPH08241378A publication Critical patent/JPH08241378A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

PURPOSE: To read even a document consisting of deteriorated characters at a low misread rate by opening a feature dictionary to be used for reading according to the character quality of the document inputted from an image scanner. CONSTITUTION: A character recognition processor 22 temporarily displays document image input data, inputted by the scanning of the image scanner 21, on the display device 31 of a host computer 3 before reading. Read conditions are set according to the display contents. Then character vectors are extracted from the document image input data. An area corresponding to the character quality label on the read condition setting screen is opened in the feature dictionary containing feature character vectors by using character quality labels as indexes on the basis of the selective specification of the character quality label on the read condition setting screen. The cut character vector which is extracted is collated with feature character vectors stored in the quality label area according to similarities extract read candidate characters, which are selected and then outputted as determined recognized character after specific postprocessing.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】文字読取装置における文字認識方
法、特に低品質文字の認識方法に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition method for a character reader, and more particularly to a recognition method for low quality characters.

【0002】[0002]

【従来の技術】文字読取装置(OCR)1は、図6に例
示のようにイメージスキャナ21と文字認識プロセッサ22
からなる文字読取認識部2およびホストコンピュータ3
によって構成されており、文字認識プロセッサ22は、イ
メージスキャナ21が読み取り対象の文書を光学走査して
得た読取データを入力とし、概略を図7に示したフロー
の処理に従い、先ず読取データを観測して1文字として
処理すべき文字パターンデータ領域を切出し、切り出し
た文字パターンデータを解析して該文字パターンデータ
が内蔵している特徴パラメータを抽出し、この特徴パラ
メータを読み取り対象範囲の各文字に対応して予め用意
した各文字に属する特徴パラメータの辞書と照合して特
徴パラメータが整合する文字を抽出することによって読
取った文字を認識し、該文字に割当た文字区分コードを
文字情報として出力する作用を基本機能とするものであ
る。このとき、ホストコンピュータ2は、読み取り対象
文書の読み取り条件の指示設定、読み取り結果の表示等
のマンマシンインターフェースとして機能することとも
に、読み取りによって得られた文書情報をもととする文
書の編集校正の作業をも遂行する。
2. Description of the Related Art A character reading device (OCR) 1 includes an image scanner 21 and a character recognition processor 22 as shown in FIG.
Character reading recognition unit 2 and host computer 3
The character recognition processor 22 receives the read data obtained by optically scanning the document to be read by the image scanner 21 as an input, and first observes the read data according to the process of the flow shown in FIG. Then, the character pattern data area to be processed as one character is cut out, the cut-out character pattern data is analyzed to extract the characteristic parameter contained in the character pattern data, and this characteristic parameter is set to each character in the reading target range. Correspondingly, the read character is recognized by comparing with the dictionary of the characteristic parameters belonging to each character prepared in advance and extracting the characters having the matching characteristic parameters, and the character classification code assigned to the character is output as the character information. The action is the basic function. At this time, the host computer 2 functions as a man-machine interface for setting the reading conditions of the document to be read, displaying the reading result, and the like, and edits and proofreads the document based on the document information obtained by the reading. Do the work well.

【0003】文字認識プロセッサ22では、イメージスキ
ャナ21から入力されて一文字として切り出した図8の
(a)に例示のような文字パターンデータ領域を、同図
(b)のように適当な細かさの縦L横Iの枡目に区分
し、それぞれの枡目の濃度値を枡目の座標の関数f(Xp,
Yp) (1≦p≦I,1≦q≦L)として読み取り、1つ
の文字パターンデータ領域に対応して得られる濃度パタ
ーン値の集まりを読み取り文字の近似データとしてい
る。そして、このような文字パターンデータ領域を枡目
に区分して得た枡目の区分の数(I×L=K)の濃度パ
ターンの値f(Xp,Yp)の集まりに枡目の区分の順に番号
をつけ、この番号順に濃度パターンの値xi=f(Xp,Y
p) を並べて、これをK次元空間に張られた式1で表さ
れるベクトルXと解釈している。
In the character recognition processor 22, the character pattern data area as illustrated in FIG. 8A which is input from the image scanner 21 and is cut out as one character has an appropriate fineness as shown in FIG. 8B. The cells are divided into cells of length L and width I, and the density value of each cell is a function f (Xp,
Yp) (1.ltoreq.p.ltoreq.I, 1.ltoreq.q.ltoreq.L) is read, and a set of density pattern values obtained corresponding to one character pattern data area is used as approximate data of the read character. Then, such a character pattern data area is divided into squares, and the number of squares (I × L = K) obtained in the squares is set to a group of density pattern values f (Xp, Yp). Numbers are assigned in order, and the density pattern value xi = f (Xp, Y
p) are arranged, and this is interpreted as a vector X expressed by the equation 1 stretched in the K-dimensional space.

【0004】[0004]

【数1】X=(x1,x2, xk) k=I×L (1) 上記のように、1個の文字は読取りの分解能に相当する
一文字領域を区分する枡目の数kに等しい次元の文字ベ
クトルによって表現されるので、読み取り対象文字範囲
となる各文字のそれぞれについて、あらかじめ上記枡目
区分による該文字を表すK次元の基準文字ベクトルを求
め、この基準文字ベクトルの集まりを文字ベクトル辞書
として用意しておくと、読取り対象の文書をイメージス
キャナで走査して得られた認識対象文字の文字ベクトル
を、辞書に収録の基準文字ベクトルと照合して一致する
ベクトルを選定することによって原理的には該文字を認
識することができる。
[Expression 1] X = (x1, x2, xk) k = I × L (1) As described above, one character has a dimension equal to the number k of cells dividing one character area corresponding to the reading resolution. Is represented by the character vector of, the K-dimensional reference character vector representing the character by the above-mentioned grid division is obtained in advance for each of the characters to be read, and the set of the reference character vectors is used as a character vector dictionary. In principle, the character vector of the recognition target character obtained by scanning the document to be read with the image scanner is compared with the reference character vector stored in the dictionary and the matching vector is selected. Can recognize the character.

【0005】ところで、プリンタあるいは活字によって
印刷された文字のような規格化された文字であっても、
同一書体で印刷された同一の文字の字体を詳細に比較す
るとプリンタ,あるいは活字の製作元における文字設計
のちがいによる僅かな差異が見られるのが普通である。
このような同一の文字であっても異なる字体で印刷され
た2つの文字は、文字パターン領域における濃度パター
ン分布がいく分異なっているので、文字パターンデータ
領域を枡目に区分するサンプリングによって得られる文
字ベクトルは、字体の異なる2つの文字間で類似ではあ
るが完全に一致することはない。それゆえ、認識対象文
字の文字ベクトルと辞書に登録の基準文字ベクトルとの
完全一致を判定認識の条件にすると大方の文字の読取り
認識が不可能になってしまう。
By the way, even if a standardized character such as a character printed by a printer or printed characters,
When comparing the fonts of the same characters printed in the same typeface in detail, it is common to see slight differences due to differences in the character designs of the printer or the type manufacturer.
Even if the same character is printed, two characters printed in different fonts have somewhat different density pattern distributions in the character pattern area, and thus can be obtained by sampling the character pattern data area into cells. The character vectors are similar but not completely identical between two characters with different fonts. Therefore, if the condition of judgment recognition is a perfect match between the character vector of the character to be recognized and the reference character vector registered in the dictionary, most characters cannot be read and recognized.

【0006】文字ベクトルの完全一致を文字認識の判定
条件とした場合に生じる上記の不都合を避けて、字体の
僅少の差異に影響されずに文字を正しく認識するため、
文字ベクトル空間における2つのベクトルの一致の替わ
りに、2つのベクトルXとYの隔たりを意味する式
(1)で表される距離d(X,Y) 、または式(2)で表さ
れる類似度s(X,Y) を認識判定に利用する方法が実用さ
れている。
In order to avoid the above-mentioned inconvenience that occurs when a perfect match of character vectors is used as a judgment condition for character recognition, and to recognize a character correctly without being affected by a slight difference in font,
Instead of matching two vectors in the character vector space, a distance d (X, Y) represented by equation (1), which means a separation between two vectors X and Y, or a similarity represented by equation (2). A method of using the degree s (X, Y) for recognition judgment is in practical use.

【0007】[0007]

【数2】 [Equation 2]

【0008】[0008]

【数3】 (Equation 3)

【0009】距離d(X,Y) は、0≦d≦∞の値を取り、
X=Yの時に0となる。一方、類似度s(X,Y) は、2つ
のベクトルXとYとがなす角度θの余弦値 cosθと等し
く−1≦s≦1の値をとり、X=Yのとき1となる。文
字認識においては、印刷濃度の差の影響が少ない式2で
表される類似度を判定基準とする方法が用いられること
が多い。
The distance d (X, Y) takes a value of 0≤d≤∞,
It becomes 0 when X = Y. On the other hand, the similarity s (X, Y) is equal to the cosine value cos θ of the angle θ formed by the two vectors X and Y, and takes a value of −1 ≦ s ≦ 1, and becomes 1 when X = Y. In character recognition, a method that uses the degree of similarity represented by Expression 2 that is less affected by the difference in print density as a criterion is often used.

【0010】文字の認識判定において基準となる特徴辞
書は、明朝体・ゴシック体・イタリック体・教科書体
(それぞれ複数のフォント)等の標準サンプルより特徴
抽出された光学的読取データの特徴の集まりである。従
って、特徴辞書に登録されていない文字種・字体の認識
率は低くなってしまう。かといって、むやみやたらに特
徴辞書に文字種・字体を登録する事は、類似度計算の際
に他の文字種・字体の類似度を相対的に下げることにな
り、認識に悪影響を及ぼす。また、特徴辞書が余りに大
きいと類似度計算を数多く行わなくてはならなくなり、
処理時間が必要以上にかかってしまう。
The feature dictionary that serves as a reference in character recognition determination is a collection of features of optically read data extracted from standard samples such as Mincho typeface, Gothic typeface, italic typeface, and textbook typefaces (each having a plurality of fonts). Is. Therefore, the recognition rate of the character type / character style not registered in the feature dictionary becomes low. On the other hand, unnecessarily registering the character type / font in the feature dictionary relatively lowers the similarity of other character types / characters when calculating the similarity, which adversely affects recognition. Also, if the feature dictionary is too large, many similarity calculations will have to be performed,
It takes more processing time than necessary.

【0011】そこで、特徴辞書に登録する各文字に対応
の基準文字ベクトルについても、基準に選定した一文字
にもとづいて定めるのではなく、同属の字体に属する同
一文字の多数の事例サンプルを集めて求めた各文字ベク
トルの平均ベクトルを文字基準ベクトルとして登録する
方法が実行されている。
Therefore, the reference character vector corresponding to each character registered in the feature dictionary is not determined based on one character selected as a reference, but a large number of case samples of the same character belonging to the same genre are collected and obtained. The method of registering the average vector of each character vector as a character reference vector is being implemented.

【0012】[0012]

【発明が解決しようとする課題】読取対象の文書が、複
写を繰返した結果の文書などであると、文字の品質は劣
化して、文字が「つぶれ」ていたり「かすれ」ていたり
しており、その文字から得られる光学的特徴は従来技術
による特徴辞書の該当文字とは著しく異なった上に類似
度が低くなってしまい、正しい認識結果が得られなくな
る。
If the document to be read is a document resulting from repeated copying, the quality of the characters deteriorates and the characters are "blurred" or "blurred". The optical feature obtained from the character is significantly different from the corresponding character in the feature dictionary according to the related art and the degree of similarity is low, so that a correct recognition result cannot be obtained.

【0013】図9に同じ字体の文字の劣化によるイメー
ジの違いの例を示す。図9の(a)は通常品質の文字、
同図(b)は品質の悪い文字、同じく(c)は著しく品
質の悪い文字である。それぞれ「用」という文字が含ま
れるが、図9の(a)に比べて同図(b)では横方向の
成分が「かすれ」ていて読み取りづらく、(c)に至っ
ては横方向の成分が完全に無くなっており、図9の
(c)のような文字をOCRで従来方法の認識を行った
場合、「川」として認識されてしまう確率が非常に高
い。「用」との類似度も低く候補文字の上位には入って
こないので、後処理等で正確を導き出すことは大変難し
い。従来、この様な品質の悪い文字の認識は困難であっ
た。
FIG. 9 shows an example of a difference in image due to deterioration of characters having the same font. FIG. 9A shows a character of normal quality,
The figure (b) is a poor quality character, and the figure (c) is a remarkably poor character. Although the word "use" is included in each case, the horizontal component is "blurred" in FIG. 9 (b) and hard to read in FIG. 9 (b), and the horizontal component in FIG. 9 (c). It is completely lost, and when the character as shown in (c) of FIG. 9 is recognized by OCR by the conventional method, the probability of being recognized as “river” is very high. Since the similarity to "use" is low and it does not enter the upper rank of the candidate character, it is very difficult to derive the accuracy by post-processing. Conventionally, it has been difficult to recognize such poor quality characters.

【0014】さらに、あまりに類似度が低い場合には候
補文字(認識処理途中では、まず読み取り対象文字の認
識結果の候補を複数個選ぶ。これらを候補文字と呼ぶ)
中に正しい認識結果が出力されてこなくなってしまい。
こういった場合には、候補文字の組み合わせのパターン
を利用した日本語処理などの後処理によって正解を導き
出すことも困難である。
Further, if the degree of similarity is too low, candidate characters are selected (during the recognition process, a plurality of candidates for the recognition result of the character to be read are first selected. These are called candidate characters).
The correct recognition result is not output inside.
In such a case, it is difficult to derive a correct answer by post-processing such as Japanese processing using a pattern of combination of candidate characters.

【0015】また、特徴辞書は1字種に対し複数の字体
が登録されており、「つぶれ」及び「かすれ」等の特徴
量を持った光学的特徴標準パターンを追加することで、
「つぶれ」たり「かすれ」たりした文字の認識も可能で
あるが、追加した光学的特徴標準パターンが他の字種の
光学的特徴標準パターンに近くなり、高い類似度を持
ち、他の文字の認識結果に悪影響を及ぼすことがある。
Further, in the feature dictionary, a plurality of fonts are registered for one character type, and by adding an optical feature standard pattern having feature amounts such as "blurred" and "blurred",
It is also possible to recognize characters that are “blurred” or “blurred”, but the added optical feature standard pattern becomes closer to the optical feature standard pattern of other characters, has a high degree of similarity, and It may adversely affect the recognition result.

【0016】本発明は品質が劣化した文字からなる文書
であっても、これを読み取って短い処理工程のもとに正
しい文字または、正しい文字に近い文字として認識する
ことができる劣化文字の認識方法を提供し、文字認識装
置における劣化文字の認識率を向上することを課題とす
る。
The present invention is a method for recognizing a deteriorated character which can read a document including a character whose quality is deteriorated and read it as a correct character or a character close to the correct character in a short processing step. To improve the recognition rate of deteriorated characters in a character recognition device.

【0017】[0017]

【課題を解決するための手段】上記の課題の解決のた
め、本発明においては、文字認識装置における文字の認
識方法をイメージスキャナにより光学的に読み取ってバ
ッファに格納した読取対象文書のパターンデータを観測
して単一文字域として切り出して得た濃度パターンの値
の順序集合である切出し文字ベクトルを、あらかじめ異
なる品質の文字サンプルのイメージスキャナによる読取
データをもとに生成した基準文字ベクトルを文字品質区
分に対応して検索可能に収集して用意した特徴辞書と、
読取条件設定にあたって読取対象文書の文字に選定した
文字品質区分のもとで照合して文字認識を行うようにす
る。
In order to solve the above problems, in the present invention, pattern data of a document to be read which is optically read by a character recognition method in a character recognition device by an image scanner and stored in a buffer is used. A reference character vector generated based on the read data of an image scanner of a character sample of different quality in advance is used as a character quality classification. Feature dictionary prepared by collecting and prepared for search
When setting the reading conditions, the characters of the document to be read are collated and recognized based on the selected character quality classification.

【0018】文字認識の過程で用いる文字品質区分に対
応の特徴辞書は、あらかじめ異なる品質の文字サンプル
のイメージスキャナ読取データをもとに生成した基準文
字ベクトルを文字品質区分毎に収集して用意した複数の
品質水準別特徴辞書または、同一文字毎に文字品質の水
準に別けて収録した単一の統合辞書であり、読取対象文
書の文字に選定した文字品質水準を示す文字品質ラベル
によって参照する品質水準別特徴辞書または統合辞書の
参照領域を選択するようにする。また、劣化文字対応の
特徴辞書を高品位の文字サンプルをもとに生成された基
準特徴辞書に収録の個別文字に対応の基準文字ベクトル
に対し、字体の種類と品質水準に対応してあらかじめ設
定した特定ベクトル成分の重み付を行うマスクを乗じて
得るようにしてもよい。
The feature dictionary corresponding to the character quality classification used in the character recognition process is prepared by collecting reference character vectors generated in advance based on image scanner read data of character samples of different qualities for each character quality classification. It is a feature dictionary by multiple quality levels or a single integrated dictionary in which the same character is classified according to the character quality level, and the quality referred to by the character quality label indicating the character quality level selected for the characters of the document to be read. Select the reference area of the level-specific feature dictionary or integrated dictionary. In addition, the characteristic dictionary corresponding to degraded characters is preset in accordance with the type of font and quality level for the reference character vector corresponding to individual characters recorded in the reference feature dictionary generated based on high-quality character samples. It may be obtained by multiplying by a mask for weighting the specified vector component.

【0019】そうして、読取対象文書の文字の文字品質
水準に対応の文字品質ラベルの設定は、イメージスキャ
ナが走査して入力した画像イメージをそのまま表示する
入力イメージ表示領域を有する読取条件設定画面上で行
い、該読取条件設定画面に設けたウインドウ部に、文字
品水準区分に対応して予め用意した各文字品質水準の文
字の事例が表示され、入力イメージ表示領域に表示され
た入力文字画像の品質に相当の文字品質水準を表す品質
ラベルを指定するようにする。
The setting of the character quality label corresponding to the character quality level of the character of the document to be read is performed by the reading condition setting screen having the input image display area for displaying the image image scanned and input by the image scanner as it is. The input character image displayed in the input image display area is displayed on the window provided on the reading condition setting screen, and the case of the character of each character quality level prepared in advance corresponding to the character product level classification is displayed. Specify a quality label that represents a character quality level equivalent to the quality of.

【0020】[0020]

【作用】読取条件設定画面上に表示された文書画像中の
文字を、同画面のウインドウに文字品質ラベルに対応し
て表示された劣化文字の例と対比して読取対象文書の文
字品質に相当の文字品質ラベルを選定指定すると、指定
文字品質ラベルに対応の特徴辞書が開かれて文字の認識
処理が遂行される。
[Function] The characters in the document image displayed on the reading condition setting screen are equivalent to the character quality of the document to be read in comparison with the example of the deteriorated characters displayed corresponding to the character quality label in the window of the screen. When the character quality label is selected and designated, the feature dictionary corresponding to the designated character quality label is opened and character recognition processing is performed.

【0021】[0021]

【実施例】本発明にもとづく類似文字判別方法を用いた
文字認識方法の一実施例における処理のフローを図1に
示し、図1によって本発明の方法を説明する。なお、本
発明の方法を実行する文字読取装置の構成は、従来技術
の説明に用いた図6に例示の構成の装置と同等であり、
以下説明に必要な場合図6中に付された符号を引用す
る。
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a process flow in an embodiment of a character recognition method using a similar character discrimination method according to the present invention, and the method of the present invention will be described with reference to FIG. The configuration of the character reading device that executes the method of the present invention is equivalent to that of the device illustrated in FIG.
In the following description, the reference numerals given in FIG. 6 will be referred to when necessary for the explanation.

【0022】文字読取装置の文字認識プロセッサ22は、
イメージスキャナー22が走査して入力した文書画像入力
データを、読取前にホストコンピュータ3の表示装置31
に一旦そのまま表示する(S2)。そして、この入力表
示画面の表示内容を参酌して文書画像入力データの読取
りにあたっての読取り方向読取り範囲その他の読取条件
がホストコンピュータ3を通じて設定されるが(S3
1)、本発明の方法においては、この読取条件設定項目
中に入力された文字画像の品質水準を予め定めた文字の
質ラベルの区分に従って選定指定する工程(S32)を含
んでいる。
The character recognition processor 22 of the character reader is
Document image input data scanned and input by the image scanner 22 is displayed on the display device 31 of the host computer 3 before reading.
Is displayed as it is (S2). Then, in consideration of the contents displayed on this input display screen, the reading direction and other reading conditions for reading the document image input data are set through the host computer 3 (S3).
1) The method of the present invention includes a step (S32) of selecting and designating the quality level of the character image input in the reading condition setting item in accordance with a predetermined character quality label category.

【0023】上記によって読取条件を設定すると、文字
認識装置には設定された読取条件にもとづいて入力され
た文書画像入力データから一文字としての文字領域を切
出し、切出した文字領域の文字パターンデータを解析し
て切出し文字ベクトルを抽出する(S4,S5)。そし
て、読取り条件設定画面における文字品質ラベルの選定
指定に基づいて、予め異なる品質水準の文字サンプルの
イメージスキャナによる読取データをもとに生成した文
字ベクトルを品質水準の区分に対応して設定した文字品
質ラベルを索引に特徴文字ベクトルとして収録した特徴
辞書の該文字品質ラベルに相当の領域が開かれ(S
6)、この品質ラベル領域に収録の特徴文字ベクトルと
抽出された切出し文字ベクトルとを類似度によって照合
して読み取り候補文字をソート抽出した後(S7)、ソ
ートによって抽出した候補文字文字を選別して(S8)
従来技術に準ずる知識処理による修正からなる後処理
(S9)を加えたのち決定認識文字として出力する(S
10)。
When the reading condition is set as described above, a character area as one character is cut out from the document image input data input to the character recognition device based on the set reading condition, and the character pattern data of the cut out character area is analyzed. Then, the cut-out character vector is extracted (S4, S5). Then, based on the selection specification of the character quality label on the reading condition setting screen, a character vector generated in advance based on the data read by the image scanner of character samples of different quality levels is set according to the quality level classification. An area corresponding to the character quality label of the feature dictionary in which the quality label is recorded as a feature character vector in the index is opened (S
6) After the characteristic character vector recorded in the quality label area is collated with the extracted cut-out character vector according to the degree of similarity to sort and extract the read candidate characters (S7), the candidate character characters extracted by the sorting are selected. (S8)
Post-processing (S9) consisting of correction by knowledge processing according to the prior art is added and then output as a decision recognition character (S).
Ten).

【0024】続いて、上記に説明の本発明にもとづく低
品質文字の認識方法における文字認識基本フローの処理
の過程で用いる文字の品質水準に相当の文字品質ラベル
によって検索される特徴辞書の作成方法を説明する。図
9の(b)及び(c)に例示の様な品質の悪い文字を含
む文書においては、他の文字も全体的に同様の品質の劣
化が見られる。この様な文書は、コピーを繰り返した文
書であることが多い。この事から、コピーの回数によっ
て文字品質の劣化の度合いを定量化する事が出来ると考
えられる。そこで、通常品質の文字イメージサンプルを
基にコピーを繰り返し、コピー回数に応じて文字品質が
劣化した特徴辞書作成用の文字イメージサンプルを作成
する。
Then, a method of creating a feature dictionary searched by a character quality label corresponding to the quality level of a character used in the process of the character recognition basic flow in the method of recognizing a low quality character according to the present invention described above. Will be explained. In a document including characters of poor quality as illustrated in FIGS. 9B and 9C, other characters also show similar deterioration in quality as a whole. Such a document is often a document that is repeatedly copied. From this, it is considered that the degree of deterioration of character quality can be quantified by the number of copies. Therefore, copying is repeated based on the character image sample of normal quality, and the character image sample for creating the feature dictionary in which the character quality deteriorates according to the number of copies is created.

【0025】こうして作成した劣化文字イメージサンプ
ルを図6によって説明の文字読取装置に入力して特徴パ
ラメータ抽出を行い、文字品質の劣化の度合をコピーの
回数によって表現することとし、コピー回数を文字品質
ラベルに付して品質別の特徴辞書を作成する。上記の特
徴辞書の作成方法の基本を図2に示す。
The deteriorated character image sample thus created is input to the character reading apparatus described with reference to FIG. 6 to extract characteristic parameters, and the degree of deterioration of character quality is expressed by the number of copies. Create a feature dictionary for each quality by attaching it to the label. The basic method of creating the above-mentioned feature dictionary is shown in FIG.

【0026】図2に例示の特徴辞書作成方法では、特徴
辞書は文字品質水順で区分され文字品質ラベルを付され
たボリュームによって構成されている形態となっている
が、図3に例示のように同一文字の異なる文字品質サン
プルによって得られた特徴文字ベクトルを品質の順に同
一文字毎を単位に一巻の特徴辞書に収録し、辞書参照に
あたっては文字品質ラベルで指定の順位の特徴文字ベク
トルのみを参照するようにしてもよい。
In the feature dictionary creating method illustrated in FIG. 2, the feature dictionary is configured by the volumes that are divided in the order of character quality and labeled with the character quality, but as illustrated in FIG. The characteristic character vectors obtained by different character quality samples of the same character are recorded in the characteristic dictionary of one volume for each character in the order of quality, and only the characteristic character vector specified in the character quality label is referenced when referring to the dictionary. May be referred to.

【0027】続いて、予め用意した重み付デーブルを用
いて特徴辞書を生成する方法の基本を図4に示しこの方
法にについて説明する。読取り対象文字がたとえば明朝
体の文字である場合、文字の縦方向成分に比べて横方向
成分の方が細かくなっているので、明朝体で印刷された
文書に対してコピーなどを繰り返すと横方向成分の方が
縦方向成分よりも「かすれ」の度合いが大きくなる。
(図9(a)(b)参照)。そこで、類似度計算の際に
縦横方向成分に対して、かすれずに残っている縦方向成
分を強調するように重み付けを行うマスクを作成し特徴
辞書に収録の各文字ベクトルに、この重み付けマスクを
作用させて実用の特徴ベクトルを生成する。
Next, the basic method of generating a feature dictionary using a weighted table prepared in advance is shown in FIG. 4 and this method will be described. For example, if the characters to be read are characters in Mincho type, the horizontal component is finer than the vertical component, so if you repeat copying on a document printed in Mincho type. The degree of “blurring” of the horizontal component is larger than that of the vertical component.
(See FIGS. 9A and 9B). Therefore, when calculating the similarity, a mask is created that weights the vertical and horizontal components so as to emphasize the remaining vertical components without blurring, and this weighting mask is applied to each character vector recorded in the feature dictionary. It is made to act and a practical feature vector is generated.

【0028】重み付テーブルに登録する文字品質に対応
のマスクは、上記のように文字の字体の性格にもとづい
て原理的に劣化しやすい字画部分を抽出してこれ減衰さ
せる作用を行うフィルタを別途劣化文字のサンプルを収
集することなく作りあげる方法と、逆に予め収集した劣
化した文字のサンプルを重ね合わせて劣化せずに共通に
残る字画部分を抽出するようなフィルタを作りあげる方
法のいづれも採用することができる。
As described above, the mask corresponding to the character quality registered in the weighting table has a separate filter for extracting and attenuating the stroke portion which is theoretically easily deteriorated based on the character of the character. Either adopting a method of creating a sample of deteriorated characters without collecting it, or conversely, creating a filter that superimposes samples of deteriorated characters collected in advance and extracts the stroke part that remains in common without deterioration. be able to.

【0029】また、マスクを作用させる文字辞書として
は、劣化文字サンプルによって生成した特定品質ラベル
に属する個別特徴辞書を用いてより劣化の進んだ状態に
対応するようにしてもよいし、劣化文字サンプルが十分
に得られず文字読取のための基準文字辞書のみが用意さ
れている場合は、基準文字辞書にマスクを作用させて劣
化の発生している文字に対応するようにすることもでき
る。
As the character dictionary to which the mask is applied, an individual feature dictionary belonging to a specific quality label generated by the deteriorated character sample may be used so as to correspond to a more deteriorated state. Is not sufficiently obtained and only the reference character dictionary for character reading is prepared, it is possible to apply a mask to the reference character dictionary so as to deal with the deteriorated character.

【0030】最後に、読取条件設定における文字品質ラ
ベルの選定指定の方法について説明する。実施例の最初
に説明の如く、文字読取装置1においてはイメージスキ
ャナー21が走査して文字認識プロセッサ22に入力した文
書画像データは一旦そのままホストコンピュータ3の表
示装置に表示されるが、この表示は通常読取り条件設定
画面の場で行われる。
Finally, a method of selecting and designating the character quality label in setting the reading conditions will be described. As described at the beginning of the embodiment, in the character reading device 1, the document image data scanned by the image scanner 21 and input to the character recognition processor 22 is once displayed as it is on the display device of the host computer 3. It is usually done on the screen of the reading condition setting screen.

【0031】読取り条件設定画面は、図5に例示のよう
に入力文書画像データ表示領域IDと読取りにあたって
指定すべき読取りの範囲、読取りの方向その他の設定項
目とその選択枝が表示される設定項目領域CRとによっ
て構成されている。そこで、本発明においては、読取り
条件設定画面中に文字品質レベル事例表示用のウィンド
ウWIを設け、このウィンドウに文字品質ラベルと、そ
の文字品質ラベルに属する代表的な劣化文字の例と表示
するようにし、この代表例の劣化文字と入力文書画像デ
ータ表示域に表示された文書画像中の文字を比較して文
字品質が類似の文字品質ラベルの符号を指定入力するよ
うにする。
On the reading condition setting screen, as shown in FIG. 5, the input document image data display area ID, the reading range to be specified in reading, the reading direction and other setting items and their setting items are displayed. The area CR and the area CR. Therefore, in the present invention, a window WI for displaying the character quality level case is provided in the reading condition setting screen, and the character quality label and an example of representative deteriorated characters belonging to the character quality label are displayed in this window. Then, the degraded character of this representative example is compared with the character in the document image displayed in the input document image data display area, and the code of the character quality label having similar character quality is designated and input.

【0032】この文字品質レベル事例表示用のウィンド
ウは、設定項目の領域CRの一部に設けてもよいし、コ
マンドによって入力文書画像データ表示領域IDに必要
に応じて開かれるようにしてもよい。
The window for displaying the character quality level case may be provided in a part of the area CR of the setting item, or may be opened in the input document image data display area ID by a command if necessary. .

【0033】[0033]

【発明の効果】本発明の低品質文字認識方法を適用した
文書読取装置においては、読取りに用いる特徴辞書がイ
メージスキャナから入力された文書の文字品質に対応し
て選択されて開かれるので、劣化した文字からなる文書
であっても低い誤読率で読取可能になるという効果がま
ず得られ、特徴辞書の使用範囲を品質ラベルによって選
定して辞書の全範囲を照合しなくともよいようにしてい
るので劣化文字の読取りにあたっても処理速度が低下し
ないという効果が得られる。
In the document reading apparatus to which the low-quality character recognition method of the present invention is applied, the characteristic dictionary used for reading is selected and opened corresponding to the character quality of the document input from the image scanner, so that the deterioration occurs. The effect of being able to read even a document consisting of written characters with a low misreading rate is first obtained, and the usage range of the feature dictionary is selected by the quality label so that it is not necessary to collate the entire range of the dictionary. Therefore, it is possible to obtain the effect that the processing speed does not decrease even when reading the deteriorated character.

【0034】そして品質区分毎に収集した文字サンプル
を読取って得た文字ベクトルによって文字品質水準別個
別特徴辞書を生成する方法によれば、文字サンプルの収
集ができると自動的な特徴辞書の生成が可能となる。ま
た、重み付けマスクを用いて使用段階での特徴辞書を生
成する方法によれば、文字品質水準に対応の特徴辞書を
あらかじめ確立するに十分な文字サンプルが得られない
場合でも、字体の性格によって発生しがちな品質劣化を
マスクによって発生させているので劣化文字の読取に適
応可能な特徴辞書が得られるという効果が得られる。
According to the method of generating the individual characteristic dictionary for each character quality level by the character vector obtained by reading the character sample collected for each quality classification, the characteristic dictionary is automatically generated when the character sample can be collected. It will be possible. Further, according to the method of generating the feature dictionary at the use stage by using the weighting mask, even if a sufficient character sample is not obtained in advance to establish the feature dictionary corresponding to the character quality level, it may occur depending on the character style. Since the quality deterioration that tends to occur is generated by the mask, the effect that a feature dictionary adaptable to reading deteriorated characters can be obtained.

【0035】そうして、読取条件設定画面にウィンドウ
を開いて文字品質ラベルと共にその品質ラベルに属する
文字の事例を表示するようにした方法によれば、読取対
象の文書画像と品質ラベルによって区分された劣化文字
の例が同一画面中に表示されるので読取条件選定指定に
あたっての文字品質ラベルの選定を容易にかつ適切に実
行できるという効果が得られる。
According to the method in which the window is opened on the reading condition setting screen and the character quality label and the case of the character belonging to the quality label are displayed, the document image to be read and the quality label are classified. Since an example of deteriorated characters is displayed on the same screen, it is possible to easily and appropriately perform the selection of the character quality label when the reading condition selection is designated.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明による低品質文字の認識方法の処理フロ
ー図
FIG. 1 is a process flow chart of a low quality character recognition method according to the present invention.

【図2】品質水準別個別特徴辞書作成方法の説明図FIG. 2 is an explanatory diagram of a method for creating an individual feature dictionary by quality level.

【図3】品質水準別に文字ベクトルを収録する統合特徴
辞書作成方法の説明図
FIG. 3 is an explanatory diagram of an integrated feature dictionary creating method for recording character vectors by quality level.

【図4】重み付けテーブルのマスクを用いて劣化文字対
応の特徴辞書を生成する方法の説明図
FIG. 4 is an explanatory diagram of a method of generating a characteristic dictionary corresponding to a deteriorated character by using a mask of a weighting table.

【図5】本発明による読み取り条件設定画面の構成図FIG. 5 is a configuration diagram of a reading condition setting screen according to the present invention.

【図6】文字読取装置構成図FIG. 6 is a block diagram of a character reading device

【図7】文字読取装置における文字認識処理の基本フロ
ー図
FIG. 7 is a basic flowchart of character recognition processing in the character reading device.

【図8】文字ベクトルの説明図FIG. 8 is an explanatory diagram of a character vector.

【図9】文字品質の異なる印刷文字の例を示す図FIG. 9 is a diagram showing an example of printed characters having different character qualities.

【符号の説明】[Explanation of symbols]

1 文字読取装置 2 文字読取認識部 21 イメージスキャナ 22 文字認識プロセッサ 3 ホストコンピュータ 1 character reading device 2 character reading recognition unit 21 image scanner 22 character recognition processor 3 host computer

Claims (5)

【特許請求の範囲】[Claims] 【請求項1】イメージスキャナが光学走査してバッファ
に格納した読取対象文書のパターンデータを観測して単
一文字域として切り出して得た濃度パターンの値の順序
集合である切出し文字ベクトルを、あらかじめ異なる品
質の文字サンプルのイメージスキャナによる読取データ
をもとに生成した基準文字ベクトルを文字品質区分に対
応して検索可能に収集して用意した特徴辞書と、読取条
件設定にあたって読取対象文書の文字に選定した文字品
質区分のもとで照合して文字認識を行うようにしたこと
を特徴とする低品質文字の認識方法。
1. A cut-out character vector, which is an ordered set of values of a density pattern obtained by observing pattern data of a document to be read optically scanned by an image scanner and stored in a buffer and cut out as a single character area, is different in advance. A characteristic dictionary prepared by collecting searchable reference character vectors corresponding to character quality classifications based on data read by an image scanner of quality character samples, and selecting the characters of the document to be read when setting the reading conditions A low-quality character recognition method characterized in that character recognition is performed by matching under the specified character quality classification.
【請求項2】特徴辞書が、あらかじめ異なる品質の文字
サンプルのイメージスキャナ読取データをもとに生成し
た基準文字ベクトルを文字品質区分毎に収集して用意し
た複数の品質水準別特徴辞書からなり、読取対象文書の
文字に選定した文字品質水準に該当する品質水準の特徴
辞書を選択して切出し文字ベクトルとの照合を行って文
字認識を行うことを特徴とする請求項1に記載の低品質
文字の認識方法。
2. The feature dictionary comprises a plurality of quality level-based feature dictionaries prepared by collecting reference character vectors generated based on image scanner read data of character samples of different qualities in advance for each character quality category. The low-quality character according to claim 1, wherein a character dictionary having a quality level corresponding to a character quality level selected for a character of a document to be read is selected and collated with a cut-out character vector to perform character recognition. Recognition method.
【請求項3】特徴辞書が、あらかじめ異なる品質の文字
サンプルのイメージスキャナ読取データをもとに、同一
文字毎に文字品質の水準に別けて収録した単一の統合辞
書であり、読取対象文書の文字に選定した文字品質水準
に対応の、統合辞書に登録の基準文字ベクトルを選択し
て切出し文字ベクトルとの照合を行って文字認識を行う
ことを特徴とする請求項1に記載の低品質文字の認識方
法。
3. The characteristic dictionary is a single integrated dictionary in which the character quality levels of the same characters are separately recorded based on image scanner read data of character samples of different qualities in advance. 2. The low-quality character according to claim 1, wherein a reference character vector registered in the integrated dictionary corresponding to the character quality level selected for the character is selected and collated with the cut-out character vector for character recognition. Recognition method.
【請求項4】特徴辞書が高品位の文字サンプルをもとに
生成された基準特徴辞書に収録の個別文字に対応の基準
文字ベクトルに対し、字体の種類と品質水準に対応して
あらかじめ設定した特定ベクトル成分の重み付を行うマ
スクを乗じて得られる文字ベクトルによって構成するこ
とを特徴とする請求項1に記載の低品質文字の認識方
法。
4. A reference character vector corresponding to an individual character recorded in a reference characteristic dictionary generated based on a high-quality character sample in the characteristic dictionary is preset in accordance with the type of font and quality level. The method for recognizing a low quality character according to claim 1, wherein the character vector is configured by a character vector obtained by multiplying a mask for weighting a specific vector component.
【請求項5】読取条件の設定を、イメージスキャナが走
査して入力した画像イメージをそのまま表示する入力イ
メージ表示領域を有する読取条件設定画面上で行い、該
読取条件設定画面に設けたウインドウ部に、文字品水準
区分に対応して予め用意した各文字品質水準の文字の事
例が表示され、入力イメージ表示領域に表示された入力
文字画像の品質に相当の文字品質水準を表す品質ラベル
を指定することによって特徴辞書にかかわる文字品質区
分を設定するようにしたことを特徴とする請求項1ない
し4のいずれかの項に記載の低品質文字の認識方法。
5. A reading condition is set on a reading condition setting screen having an input image display area for displaying an image image scanned and input by an image scanner as it is, and a window portion provided on the reading condition setting screen is set. , An example of characters of each character quality level prepared in advance corresponding to the character product level classification is displayed, and a quality label representing the character quality level equivalent to the quality of the input character image displayed in the input image display area is specified. The method for recognizing low quality characters according to any one of claims 1 to 4, characterized in that the character quality classification relating to the feature dictionary is set accordingly.
JP7043729A 1995-03-03 1995-03-03 Recognizing method for low-quality character Pending JPH08241378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP7043729A JPH08241378A (en) 1995-03-03 1995-03-03 Recognizing method for low-quality character

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP7043729A JPH08241378A (en) 1995-03-03 1995-03-03 Recognizing method for low-quality character

Publications (1)

Publication Number Publication Date
JPH08241378A true JPH08241378A (en) 1996-09-17

Family

ID=12671880

Family Applications (1)

Application Number Title Priority Date Filing Date
JP7043729A Pending JPH08241378A (en) 1995-03-03 1995-03-03 Recognizing method for low-quality character

Country Status (1)

Country Link
JP (1) JPH08241378A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100369051C (en) * 2005-01-11 2008-02-13 富士通株式会社 Grayscale character dictionary generation apparatus
CN100373399C (en) * 2004-08-18 2008-03-05 富士通株式会社 Method and apparatus for establishing degradation dictionary
WO2016068325A1 (en) * 2014-10-31 2016-05-06 オムロン株式会社 Character recognition device, character recognition method, and program

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100373399C (en) * 2004-08-18 2008-03-05 富士通株式会社 Method and apparatus for establishing degradation dictionary
CN100369051C (en) * 2005-01-11 2008-02-13 富士通株式会社 Grayscale character dictionary generation apparatus
WO2016068325A1 (en) * 2014-10-31 2016-05-06 オムロン株式会社 Character recognition device, character recognition method, and program
JP2016091186A (en) * 2014-10-31 2016-05-23 オムロン株式会社 Character recognition apparatus, character recognition method, and program
US10049309B2 (en) 2014-10-31 2018-08-14 Omron Corporation Character recognition device, character recognition method and program

Similar Documents

Publication Publication Date Title
EP0544431B1 (en) Methods and apparatus for selecting semantically significant images in a document image without decoding image content
US5369714A (en) Method and apparatus for determining the frequency of phrases in a document without document image decoding
Guo et al. Separating handwritten material from machine printed text using hidden markov models
Hochberg et al. Script and language identification for handwritten document images
KR100412317B1 (en) Character recognizing/correcting system
US8566349B2 (en) Handwritten document categorizer and method of training
JP3842006B2 (en) Form classification device, form classification method, and computer-readable recording medium storing a program for causing a computer to execute these methods
EP1936536B1 (en) System and method for performing classification through generative models of features occuring in an image
US20090060396A1 (en) Features generation and spotting methods and systems using same
EP2144188A2 (en) Word detection method and system
DE102011079443A1 (en) Learning weights of typed font fonts in handwriting keyword retrieval
JP2001167131A (en) Automatic classifying method for document using document signature
WO1990015386A1 (en) Document identification by characteristics matching
Nagy 29 Optical character recognition—Theory and practice
JP3485020B2 (en) Character recognition method and apparatus, and storage medium
CN110796134B (en) Method for combining words of Chinese characters in strong-noise complex background image
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
CN111612045B (en) Universal method for acquiring target detection data set
WO2007070010A1 (en) Improvements in electronic document analysis
JP3313272B2 (en) Address reading method and identification function weight vector generation method
JPH08241378A (en) Recognizing method for low-quality character
JPH08272902A (en) Method for recognizing character of different quality and different font
Sturgeon Unsupervised extraction of training data for pre-modern Chinese OCR
JP3374762B2 (en) Character recognition method and apparatus
Ueki et al. Japanese Cursive Character Recognition for Efficient Transcription.