JPH0812683B2

JPH0812683B2 - High speed extraction method for specific character strings

Info

Publication number: JPH0812683B2
Application number: JP61288799A
Authority: JP
Inventors: 弘一本間; 文伸古村; 文男和歌森; 晃加賀美
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1986-12-05
Filing date: 1986-12-05
Publication date: 1996-02-07
Anticipated expiration: 2011-02-07
Also published as: JPS63142487A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は文書画像の文字列抽出方法に係り、特にシス
テムにあらかじめ登録された特定文字列のみを、文書画
像中から効率良く抽出するのに好適な特定文字列高速抽
出方法に関する。The present invention relates to a method for extracting a character string of a document image, and more particularly to efficiently extracting only a specific character string registered in advance in the system from the document image. The present invention relates to a suitable specific character string high-speed extraction method.

〔従来の技術〕従来、入力文字列の言語情報を利用して読み取り精度
を向上させる文字認識処理方式については、杉村：候補
文字補完と形態素解析による漢字認識の誤まり訂正処理
法，信学会情シ全国大会（1985）の１−307頁から１−3
08頁において論ぜられている。そこでの文字認識システ
ムは、単語辞書を持ち、全文を文字認識し、従来の１文
字ごとの認識結果である候補文字の並びを形態素解析
し、２位以下に正解文字がある場合の誤り訂正を行う。
このような方式によれば、１文字１文字を単独に文字認
識する場合に比べ、認識精度を向上させることができ
る。[Prior Art] Conventionally, regarding a character recognition processing method for improving reading accuracy by using linguistic information of an input character string, Sugimura: error correction processing method of kanji recognition by candidate character complement and morphological analysis, IEICE Information Page 1-307 from 1-3 of National Convention (1985) 1-3
Discussed on page 08. The character recognition system there has a word dictionary, recognizes the whole sentence, performs morphological analysis on the sequence of candidate characters that is the conventional recognition result for each character, and corrects errors when there is a correct character at the second or lower position. To do.
According to such a method, the recognition accuracy can be improved as compared with the case of individually recognizing each character.

[Problems to be solved by the invention]

上記従来技術は、単語辞書による先験的文字列情報を
文字認識に用いるものであるが、全文を文字認識するた
め、特定の文字列だけを効率良く抽出しようとする点に
ついては配慮されておらず、処理時間がかかるという問
題があつた。The above-mentioned conventional technology uses a priori character string information based on a word dictionary for character recognition.However, since the entire text is recognized, it should be taken into consideration that only a specific character string is efficiently extracted. However, there is a problem that it takes a long processing time.

本発明の目的は、システムにあらかじめ登録された特
定文字列のみを、文書画像中から効率良く抽出するに好
適な特定文字列高速抽出方法を提供することにある。An object of the present invention is to provide a specific character string high-speed extraction method suitable for efficiently extracting only a specific character string registered in advance in a system from a document image.

[Means for solving problems]

上記目的は、文書上の文字画像（以下、文書画像とい
う）から文字を切り出し、切り出した文字の特徴量を算
出し、算出された特徴量とあらかじめ登録してある文字
特徴量とを照合する際、算出特徴量を量子化し、文書画
像上で抽出すべき特定の文字列が量子化文字特徴量列に
どのように対応するかをあらかじめ求めテーブル化して
おき、このテーブルを参照して入力文字画像の量子化文
字特徴量列から候補となる特定文字列を求め、この特定
文字列を対象に文字パタンの一致不一致の認識処理を行
うことにより、達成される。The above-mentioned purpose is to cut out a character from a character image (hereinafter referred to as a document image) on a document, calculate a feature amount of the cut out character, and collate the calculated feature amount with a character feature amount registered in advance. , Quantize the calculated feature quantity, find in advance how the particular character string to be extracted on the document image corresponds to the quantized character feature quantity sequence, create a table, and refer to this table to input character image This is achieved by obtaining a candidate specific character string from the quantized character feature amount sequence of, and performing recognition processing of matching or non-matching of the character pattern for this specific character string.

[Action]

一般に、パタン分類においては分類先のクラス数が多
い場合、所属クラスの完全な識別より、いくつかの特定
クラスだけへの所属の有無の判別の方が格段に少ない演
算で実現できる。この傾向は、パタンの並びの中から、
特定のパタンの並びを判別する場合には、さらに顕著に
なる。通常、並びには規則性があるためである。文書画
像からの特定文字列の抽出は、まさにこの場合にあた
る。Generally, in pattern classification, when the number of classes to be classified is large, it can be realized by a significantly smaller number of operations to determine whether or not a class belongs to only some specific classes than to completely identify the classes to which the classes belong. This tendency is
It becomes more prominent when determining the arrangement of specific patterns. This is because normally, and have regularity. The extraction of the specific character string from the document image is just this case.

第２図で、上記事実を説明する。文字パタンの特徴パ
ラメータをｘと表わし、各文字ｉの特徴空間での確率分
布を、P_i（ｘ）;i＝1,…、文字種数、と表わすと、確率
分布P_i（ｘ）は図に示すように重なりを持つて分布す
る。通常の印刷文字認識では、入力パタンｘに対し各確
率分布P_i（ｘ）の最大値を与える文字を認識結果とす
る。ただし実際の演算では、各文字の平均パタン_ｉと
の距離ｄ（_i,x）の最も小さい文字を選ぶことが多
い。従つて、１文字の認識には文字種類数回だけの距離
計算を必要とする。一方、入力パタンｘが、特定Ｘであ
る可能性の有無の判定には１回あるいは少ない個数の特
定文字数回の距離計算ですむ。また、文字の組合わせで
ある文字列は、全くランダムな組合せが許されるわけで
はないため、１文字に関する確率分布が重なりを持つ場
合でも、文字特徴空間の積空間での文字列の確率分布の
重なりは少なくなる。従つて、確率分布の重なりが大き
い、粗い特徴量を用いても、文字列の認識では比較的高
い精度となる。FIG. 2 illustrates the above fact. If the characteristic parameter of the character pattern is represented as x, and the probability distribution of each character i in the feature space is represented as P _i (x); i = 1, ..., Number of character types, the probability distribution P _i (x) is They are distributed with overlap as shown in. In ordinary print character recognition, a character that gives the maximum value of each probability distribution P _i (x) to the input pattern x is set as the recognition result. However, in actual calculation, the character having the smallest distance d ( _i , x) from the average pattern _{i of} each character is often selected. Therefore, recognition of one character requires distance calculation only for several character types. On the other hand, the distance calculation may be performed once or a small number of specific characters to determine whether or not the input pattern x may be the specific X. Moreover, since a character string that is a combination of characters does not allow completely random combinations, even if the probability distributions for one character have an overlap, the probability distribution of the character string in the product space of the character feature space There is less overlap. Therefore, even if a coarse feature amount having a large overlap of probability distributions is used, the character string recognition has a relatively high accuracy.

〔Example〕

以下、本発明の一実施例を第１図により説明する。光
デイスクなどに収められている文書画像データは、光デ
イスク装置１から読み出され、１ラインずつ文字枠切り
出し装置２に入力され、各文字の最小外接矩形情報すな
わち、左上右下頂点の座標が出力され、文字情報テーブ
ル３の外接矩形情報部４に格納される。文字枠切り出し
装置としては、ここでは、特願昭60−184242号「文書文
字切り出し画像処理方式」に詳述されている装置を用い
るものとする。文書画像全体について文字外接矩形情報
が抽出されると、外接矩形情報は文字情報テーブル３か
ら読み出され、粗特徴量算出装置５において、各文字ご
とに粗い特徴量が算出される。粗い特徴量としては、文
献：文字認識概論（1982）のp78〜79で詳しい説明のあ
る複雑度指数を用いる。複雑度指数は、文字パタンの輪
郭線の垂直および水平方向成分の総長である。第３図に
輪郭線の総長を求めるための２×２メツシユ要素パタン
を示す。要素パタンは、垂直パタンＶ（同図（ａ））、
水平パタンＨ（同図（ｂ））、斜め片側パタンＬ（同図
（ｃ））、斜め両側パタンＴ（同図（ｄ））の４種類に
分けられる。図中の太線は文字パタンの輪郭線を折れ線
近似したものである。それぞれの要素パタンの文字全体
における総数ｎ（Ｖ）,n（Ｈ）,n（Ｌ）そしてｎ（Ｔ）
から、輪郭線の垂直および水平方向成分を求める。した
がつて、水平、垂直方向複雑度指数l_x,l_yは、それぞれ
下式で求まる。An embodiment of the present invention will be described below with reference to FIG. The document image data stored in the optical disc or the like is read from the optical disc device 1 and input line by line to the character frame clipping device 2, and the minimum circumscribing rectangle information of each character, that is, the coordinates of the upper left lower right vertex It is output and stored in the circumscribed rectangle information section 4 of the character information table 3. As the character frame clipping device, the device detailed in Japanese Patent Application No. 60-184242 "Document character clipping image processing method" is used here. When the character circumscribing rectangle information is extracted for the entire document image, the circumscribing rectangle information is read from the character information table 3, and the rough feature amount calculating device 5 calculates a rough feature amount for each character. As the coarse feature amount, the complexity index which is described in detail in p78 to 79 of Literature: Introduction to Character Recognition (1982) is used. The complexity index is the total length of the vertical and horizontal components of the outline of the character pattern. FIG. 3 shows a 2 × 2 mesh element pattern for obtaining the total length of the contour line. The element pattern is the vertical pattern V ((a) in the figure),
There are four types of patterns, a horizontal pattern H (FIG. 2B), a diagonal one side pattern L (FIG. 7C), and a diagonal both side pattern T (FIG. 2D). The thick line in the figure is a polygonal line approximation of the outline of the character pattern. N (V), n (H), n (L) and n (T), the total number of each element pattern in the entire character
Then, the vertical and horizontal components of the contour line are obtained. Therefore, the horizontal and vertical direction complexity indices l _x and l _y are obtained by the following equations.

粗特徴量算出装置５で求めた各文字に関する水平，垂
直方向複雑度指数は、文字情報テーブル３の粗特徴量部
６に、外接矩形情報部４の文字枠情報と対応して格納さ
れる。 The horizontal and vertical direction complexity index for each character obtained by the rough feature amount calculation device 5 is stored in the rough feature amount portion 6 of the character information table 3 in association with the character frame information of the circumscribing rectangle information portion 4.

つぎに、文字情報テーブル３から文字列の順に文字粗
特徴量が読み出され、量子化器７により０から15のコー
ドｍ（4bit）にコード化され、シフトレジスタ８に格納
される。量子化器７では、２方向の複雑度指数をそれぞ
れ３つの閾値で４区間に分割し定義した16の区間のいず
れに入るかを判別する。シフトレジスタ８には、連続３
文字分の粗特徴量量子化コードが格納されており、アド
レス演算器９は、３文字分のコードを12bitデータと考
え、約4kwのテーブルを参照することにより、文字列テ
ーブル10上の対応アドレスを求める。抽出すべき特定文
字列は、あらかじめ約4Kの分割区間（16の分割区間の３
乗積区間）のいずれに入るかを粗特徴量により判別さ
れ、分割区間ごとにコード列として集められ、文字列テ
ーブル10に格納されている。ただし、本判別のための分
割区間は互いにオーバーラツプさせ１つの文字列が複数
の区間に対応することも許す。アドレス演算部９の出力
アドレスは、分割区間に関するコード列データの先頭ア
ドレスである。ただし、先頭アドレス自身には分割区間
中のコード列の個数が格納されている。シフトレジスタ
８上の粗特徴量量子化コードに対応する３文字に対し、
アドレス演算器９の出力アドレスで参照される文字列テ
ーブル10の内容は、候補となる特定文字列の数を示す。
特定文字列数が０の場合には、判定装置11により候補と
なる特定文字列はないと判定され、文字情報テーブル３
から次の文字粗特徴量が読み出され同様にして文字列テ
ーブル10の参照が行われる。Next, the character rough feature amounts are read from the character information table 3 in the order of character strings, coded by the quantizer 7 into codes m (4 bits) of 0 to 15, and stored in the shift register 8. The quantizer 7 divides the complexity index in the two directions into four sections with three thresholds, and determines which of the 16 sections is defined. The shift register 8 has three consecutive
The coarse feature amount quantization code for characters is stored, and the address calculator 9 considers the code for 3 characters as 12-bit data, and by referring to the table of about 4 kw, the corresponding address on the character string table 10 Ask for. The specific character string to be extracted is a divided section of about 4K in advance (3 of 16 divided sections).
Which of the product sections) to enter is determined by the rough feature amount, and the divided section is collected as a code string and stored in the character string table 10. However, the divided sections for this determination may overlap each other to allow one character string to correspond to a plurality of sections. The output address of the address calculation unit 9 is the start address of the code string data regarding the division section. However, the number of code strings in the divided section is stored in the head address itself. For the three characters corresponding to the coarse feature quantity quantization code on the shift register 8,
The content of the character string table 10 referred to by the output address of the address calculator 9 indicates the number of candidate specific character strings.
When the number of specific character strings is 0, the determination device 11 determines that there is no specific character string that is a candidate, and the character information table 3
Then, the next character rough feature amount is read out and the character string table 10 is referred to in the same manner.

文字列テーブル10の参照結果が０でない場合には、候
補となる文字列の文字コード列が次々と文字列テーブル
10から読み出され、文字精特徴量テーブル12により、文
字コードに対応する文字パタンの精特徴量に変換され
る。文字パタンの精特徴量としては、文字パタンＫ（x,
y）自身を用いる。ここで、x,yは各々水平と垂直方向の
位置座標で、である。精特徴量列レジスタ13には、候補文字列の精特
徴量｛▲Ｋⁱ ₁▼（x,y），▲Ｋⁱ ₂▼（x,y），▲Ｋⁱ ₃▼
（x,y）｝が格納される。添字ｉはｉ番目の候補である
ことを示す。一方、判定装置11は、候補文字列数が０で
ない場合には、精特徴量算出装置14に起動をかけ、シフ
トレジスタ８中の文字列に関する精特徴量を算出する。
すなわち、光デイスク装置１より該当文字パタンを切り
出し、被判定精特徴量列レジスタ15に格納する。両レジ
スタ13,15の精特徴量列は、距離計算装置16により下式
で距離すなわち相違度が求められ、閾値判定器17で相違度の判定が行われる。相違度が閾値
θを越える場合には、入力文書画像中の被判定文字列
は、候補文字列ではあり得ないと考え、候補文字列中で
相違度が閾値θを越えないものを選び、相違度と共に文
字コード列を出力する。相違度が閾値θを越えない候補
文字列が複数ある場合には、最小の相違度を与える文字
コード列あるいは、すべての文字コード列を順位付けし
て出力する。もし、候補文字列の相違度がすべて閾値θ
を越える場合には、被判定文字列は抽出すべき特定文字
列のいずれでもないとし、文字情報テーブル３に起動が
かかり、次の文字粗特徴量が読み出され、１文字シフト
した３文字の文字列上で上記判定処理が行われる。If the reference result of the character string table 10 is not 0, the character code strings of candidate character strings are successively displayed in the character string table.
It is read from 10, and is converted into the precise feature amount of the character pattern corresponding to the character code by the character precise feature amount table 12. As the precise feature amount of the character pattern, the character pattern K (x,
y) Use yourself. Where x and y are position coordinates in the horizontal and vertical directions, respectively. Is. The fine feature amount register 13 stores the fine feature amount of the candidate character string {▲ K ⁱ ₁ ▼ (x, y), ▲ K ⁱ ₂ ▼ (x, y), ▲ K ⁱ ₃ ▼.
(X, y)} is stored. The subscript i indicates that it is the i-th candidate. On the other hand, when the number of candidate character strings is not 0, the determination device 11 activates the fine feature amount calculation device 14 to calculate the fine feature amount regarding the character string in the shift register 8.
That is, the corresponding character pattern is cut out from the optical disk device 1 and stored in the to-be-determined fine feature quantity sequence register 15. The precise feature amount sequence of both registers 13 and 15 is obtained by the distance calculation device 16 in the following formula, that is, the degree of difference, The threshold determiner 17 determines the degree of difference. If the dissimilarity exceeds the threshold θ, the character string to be judged in the input document image cannot be a candidate character string, and a candidate character string whose dissimilarity does not exceed the threshold θ is selected. Output the character code string with the degree. When there are a plurality of candidate character strings whose dissimilarity does not exceed the threshold value θ, the character code string giving the smallest dissimilarity or all the character code strings are ranked and output. If the differences of the candidate character strings are all threshold θ
If it exceeds, it is determined that the character string to be judged is not one of the specific character strings to be extracted, the character information table 3 is activated, the next character rough feature amount is read, and one character is shifted and three characters are shifted. The above determination process is performed on the character string.

以上述べた実施例では、文字パタンの粗特徴量とし
て、複雑度指数を用いたが、周辺分布、縮少したパタン
そのものなど、別の粗特徴量を用いることも可能であ
る。In the embodiment described above, the complexity index is used as the rough feature amount of the character pattern, but it is also possible to use another rough feature amount such as the marginal distribution or the reduced pattern itself.

〔The invention's effect〕

本発明によれば、文書画像上で抽出すべき特定の文字
列をあらかじめ粗い文字特徴量の並びに基づいて分類し
ておくため、入力文書画像文字列から、演算量の少ない
粗い特徴量を用いて、候補となる登録特定文字列をしぼ
り込め、その結果特定文字列の抽出を高速におこなえる
ため、システムに登録された特定文字列のみを、文書画
像中から効率良く抽出できる効果がある。According to the present invention, a specific character string to be extracted on a document image is classified in advance based on the arrangement of coarse character feature amounts. Therefore, a coarse feature amount with a small amount of calculation is used from the input document image character string. Since the registered specific character strings that are candidates are narrowed down and the specific character strings can be extracted at high speed as a result, only the specific character strings registered in the system can be efficiently extracted from the document image.

[Brief description of drawings]

第１図は本発明の一実施例の全体システム構成図、第２
図は文字パタンの特徴空間での確率分布の説明図、第３
図は文字パタンの粗特徴量として用いた複雑度指数計算
のための要素パタンの一例を示す図である。FIG. 1 is an overall system configuration diagram of an embodiment of the present invention, and FIG.
Fig. 3 is an explanatory diagram of the probability distribution of the character pattern in the feature space.
The figure is a diagram showing an example of an element pattern for calculating a complexity index used as a rough feature amount of a character pattern.

Claims

[Claims]

1. A character image is input, a character string is cut out from the input character image, a feature amount column of the cut out character string is calculated, and the calculated feature amount column and a specific character prepared in advance. A specific character characterized by collating the column of the feature amount related to the column while shifting the character position, and outputting only the collation position and the character string in which the collation result of all the characters constituting the character string is equal to or greater than a predetermined threshold value. High-speed column extraction method.

2. The collation process is a process of collating and removing a character string in an input image that cannot be a specific character string to be extracted with a rough feature amount, and a collation failure with a rough feature amount is not unsuccessful. The high-speed extraction method of a specific character string according to claim 1, which comprises a process of matching only a character string in the input image that has not been detected with a fine feature amount.

3. The collating process quantizes a rough feature amount of the input character string to obtain a quantized character feature amount sequence, and a specific character string to be extracted on a document image is the quantized character feature amount. How to correspond to the columns is obtained in advance and made into a table, the specific character strings that are candidates are obtained by referring to the table from the quantized character feature sequence, and the matching of character patterns for the specific character string The specific character string high-speed extraction method according to claim 2, which comprises a process of performing matching with a fine feature amount.