JP2705981B2

JP2705981B2 - Character reading method

Info

Publication number: JP2705981B2
Application number: JP1159750A
Authority: JP
Inventors: 一郎小倉; 保夫本郷; 収志吉田
Original assignee: Fuji Electric Co Ltd
Current assignee: Fuji Electric Co Ltd
Priority date: 1989-06-23
Filing date: 1989-06-23
Publication date: 1998-01-28
Anticipated expiration: 2013-01-28
Also published as: JPH0325692A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、文書画像から空白記号を切り出し、その
空白の種類を判定することによって段落読取を可能にし
た文字読取方法に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character reading method in which a blank symbol is cut out from a document image and the type of the blank is determined to enable paragraph reading.

[Conventional technology]

この種の空白部分切り出し方法として、出願人は特願
昭63-130143号,292445号等を提案している（以下、単に
提案済方法ともいう）。The applicant has proposed Japanese Patent Application No. 63-130143 and Japanese Patent Application No. 292445 (hereinafter, simply referred to as a proposed method) as this kind of blank portion extracting method.

すなわち、提案済方法は画像処理をして文書画像から
文字行または文字列を切り出し、この切り出された文字
行または文字列から文字の存在する部分を抽出し、その
後行切り出し結果より得られた行寸法をもとに標準文字
サイズを決定し、この標準文字サイズをもとに文字の存
在していない空白部分を１つまたは複数の空白記号とし
て切り出すものである。That is, the proposed method performs image processing to cut out a character line or character string from a document image, extracts a portion where a character exists from the cut out character line or character string, and then obtains a line obtained from the line cutout result. The standard character size is determined based on the dimensions, and a blank portion where no character exists is cut out as one or a plurality of blank symbols based on the standard character size.

[Problems to be solved by the invention]

このように、提案済方法では文字の存在していない空
白部分の距離から空白記号数を決定するようにしている
ため、文書によっては文字と文字の間のピッチが広いも
のがあって空白が余分に入ってしまったり、新聞や論文
集の如く段落のある場合に空白部分が広いと、空白数が
正確に計算できずに段落がズレることがあり、このため
段落毎に読み取ることができない、などの問題がある。As described above, in the proposed method, the number of blank symbols is determined from the distance of a blank portion where no character exists, so that some documents have a large pitch between characters and an extra blank space. If there is a large space in a paragraph such as a newspaper or a collection of papers, the number of blanks cannot be calculated accurately, and the paragraph may shift, so it is not possible to read each paragraph. There is a problem.

したがって、この発明の課題は正しい空白数を算出
し、この空白の種類を判定することにより、段落読取を
可能にすることにある。Therefore, an object of the present invention is to enable paragraph reading by calculating a correct blank number and determining the type of the blank.

[Means for solving the problem]

文字画像を画像処理して文字行または文字列を切り出
し、この文字行または文字列から文字の存在する部分を
抽出し、さらに、その文字行または文字列における文字
と文字との間の平均ピッチをもとに、文字の存在してい
ない部分を１つまたは複数の空白記号として切り出し、
各行または各列における空白記号位置のズレ度合いをも
とに、その空白が文字と文字との間の空白か、段落によ
る空白かを判別することにより、段落読取を可能にす
る。Image processing of a character image cuts out a character line or character string, extracts the portion where the character exists from this character line or character string, and further calculates the average pitch between characters in the character line or character string. Originally, the part where no character exists is cut out as one or more blank symbols,
Paragraph reading is enabled by determining whether the blank is a space between characters or a paragraph, based on the degree of deviation of the position of the blank symbol in each row or each column.

[Action]

文書中より文字を切り出す際、空白の部分を空白記号
として精度良く切り出すとともに、この空白の種類から
段落と判定できた場合は空白をはさむ文字にそのことを
示す記号を付け、段落読取も可能にすることにより、文
章の構成を変えないようにする。When extracting characters from the document, the blank part is accurately extracted as a blank symbol, and if it can be determined as a paragraph from this type of space, a character indicating that fact is attached to the character sandwiching the space, and paragraph reading is also possible Doing so will not change the composition of the sentence.

〔Example〕

第１図はこの発明の実施例を示すフローチャート、第
２図および第３図はこの発明を具体的に説明するための
説明図である。なお、以下では横書きの例で説明する
が、縦書きの場合についても同様である。FIG. 1 is a flowchart showing an embodiment of the present invention, and FIGS. 2 and 3 are explanatory views for specifically explaining the present invention. In the following, an example of horizontal writing will be described, but the same applies to vertical writing.

すなわち、第１図に示されるように、文書画像から切
り出された文字行と垂直な方向にその投影をとり、文字
の存在部分を文字らしきもの（以下仮文字ともいう）と
して切り出し、これにより得られた行寸法をもとに標準
文字サイズを決定し、この標準文字サイズをもとに仮文
字の中から全角文字とみなされる文字を選び出す。さら
に、残りの仮文字については統合文字，分離文字を作成
し、OCR（文字読取装置）にて認識させ、文字の性質に
よる矛盾処理と、統合文字，分離文字については類似度
により正しい文字を判別するのは提案済方法と同じであ
るが、この間にI,IIおよびIIIの処理を付加した点が特
徴である。以下、順番に説明する。That is, as shown in FIG. 1, the projection is performed in the direction perpendicular to the character line cut out from the document image, and the existing portion of the character is cut out as a character (hereinafter also referred to as a provisional character). The standard character size is determined on the basis of the determined line size, and a character regarded as a full-width character is selected from the provisional characters based on the standard character size. Furthermore, integrated characters and separated characters are created for the remaining provisional characters and recognized by the OCR (character reading device), inconsistency processing based on character characteristics, and correct characters are determined based on similarity for integrated characters and separated characters. This is the same as the proposed method, but is characterized by the addition of the processing of I, II and III during this time. Hereinafter, description will be made in order.

まず、空白記号切り出しのために、全角文字切り出し
処理の後に、文字間ピッチの計算１を行なう。いま、入
力画像が第２図（イ）の様であるとすると、仮文字K1に
着目する（Ｉ−１参照）。そして、K1が全角文字か否か
を判断し（Ｉ−２参照）、全角文字ならば次の文字K2に
着目し（Ｉ−３参照）、同様に全角文字か否かを判断し
（Ｉ−４参照）、全角文字ならば第２図（イ）に示す如
きK1とK2の間のピッチPT1を求め（Ｉ−５参照）、次式
にもとづく判断を行なう（Ｉ−６参照）。First, in order to extract a blank symbol, calculation 1 of the pitch between characters is performed after the full-width character extraction process. Now, assuming that the input image is as shown in FIG. 2A, attention is paid to the provisional character K1 (see I-1). Then, it is determined whether or not K1 is a full-width character (see I-2). If it is a full-width character, attention is paid to the next character K2 (see I-3), and similarly, it is determined whether or not it is a full-width character (I-). 4), the pitch PT1 between K1 and K2 as shown in FIG. 2 (a) is obtained (see I-5), and a decision is made based on the following equation (see I-6).

Ｗ≦PT1≦1.5W ……（１）ここに、Ｗは上記の行切り出しにより得られた行寸法
をもとに決定された標準文字サイズである（第２図
（イ）参照）。そして、上式を満足するPT1を加算し、
これをＡとする（Ｉ−７参照）。次のステップＩ−８で
は１行分終了したか否かを判断し、終了していなければ
次の文字についてステップＩ−１からＩ−７を繰り返
す。１行分終了したらＡを平均し、これを平均ピッチPT
として求める（Ｉ−９参照）。W ≦ PT1 ≦ 1.5W (1) Here, W is a standard character size determined based on the line size obtained by the above-described line segmentation (see FIG. 2A). Then, add PT1 that satisfies the above equation,
This is designated as A (see I-7). In the next step I-8, it is determined whether or not one line has been completed. If not, steps I-1 to I-7 are repeated for the next character. After finishing one line, average A and average it to the average pitch PT
(See I-9).

次に、処理IIについて説明する。 Next, the processing II will be described.

まず、処理Ｉで求められた平均ピッチPTを用い、空白
記号の切り出しを行なう。そのために、行端S1と先頭の
仮文字C1までの距離D1を求め（II−１参照）、これを平
均ピッチPTで割って空白記号数B1を求める（II−２参
照）。このとき、仮文字の中心位置をもとに空白数を求
めることにより、文字の幅による影響を小さくすること
ができる。その結果、第２図（イ）に示すような画像の
場合は、１文字分の空白が挿入されることになる。次
に、行中の空白記号の切出を行なう。ここで、隣接する
２つの仮文字KiとKjに着目し、それらの距離をD2とし、
次式にもとずく判断を行なう（II−3,4,5参照）。First, a blank symbol is cut out using the average pitch PT obtained in the process I. For this purpose, the distance D1 between the line end S1 and the leading provisional character C1 is obtained (see II-1), and this is divided by the average pitch PT to obtain the number of blank symbols B1 (see II-2). At this time, the influence of the character width can be reduced by obtaining the number of blanks based on the center position of the provisional character. As a result, in the case of an image as shown in FIG. 2 (a), a space for one character is inserted. Next, a blank symbol in the line is extracted. Here, attention is paid to two adjacent provisional characters Ki and Kj, and their distance is set to D2.
Judgment is made based on the following equation (see II-3, 4, 5).

D2＞0.3W ……（２）そして、この（２）式を満足する場合に、空白記号数
を計算し、そうでない場合はII-22の処理を行なう。ま
た、0.3Wを判断基準としたのは、半角の空白も考慮する
ためである。なお、この値は文書に応じて可変にするこ
とができる。その後、Kiが全角文字か否かを判断し（II
−６参照）、全角文字でなければKiとその前の仮文字を
統合した文字のサイズが、標準文字サイズの大きさを満
足とするとき、この統合文字の中心位置をC1として求め
る（II−7,8参照）。なお、Kiが全角文字または前記条
件を満足しない場合は、Kiの中心位置をC1とする（II−
９参照）。次の仮文字Kjについても全角文字か否かを判
断し、そうでなければKjとその次の仮文字を統合し、こ
のサイズが標準文字サイズを満足する場合は、統合文字
の中心位置をC2とする（II-10,11,12参照）。なお、Kj
が全角文字または前記条件を満足しない場合は、Kjの中
心位置をC2とする（II-13参照）。そして、C2とC1の差
を文字間ピッチPT2とし（II-14参照）、次式の判断を行
なう（II-15参照）。D2> 0.3W (2) Then, when the expression (2) is satisfied, the number of blank symbols is calculated, and otherwise, the process of II-22 is performed. The reason why 0.3 W is used as a criterion is that a half-width space is taken into consideration. Note that this value can be made variable according to the document. Then, it is determined whether Ki is a full-width character (II
-6), if the size of the character obtained by integrating Ki and the temporary character before it is not a full-width character satisfies the standard character size, the center position of the integrated character is obtained as C1 (II- 7, 8). If Ki is not a full-width character or does not satisfy the above conditions, the center position of Ki is set to C1 (II-
9). It is determined whether the next provisional character Kj is also a full-width character.If not, Kj and the next provisional character are integrated.If this size satisfies the standard character size, the central position of the integrated character is set to C2. (See II-10, 11, 12). Note that Kj
Is not a full-width character or does not satisfy the above conditions, the center position of Kj is set to C2 (see II-13). Then, the difference between C2 and C1 is set as the character pitch PT2 (see II-14), and the following equation is determined (see II-15).

PT2＞PT ……（３）この（３）式を満足しない場合は空白記号数B2を０とし
（II-17参照）、II-22の処理を行なう。一方、（３）式
を満足する場合は次式により、空白記号数B2を求める
（II-16参照）。PT2> PT (3) If the expression (3) is not satisfied, the number of blank symbols B2 is set to 0 (see II-17), and the process of II-22 is performed. On the other hand, if the equation (3) is satisfied, the number of blank symbols B2 is obtained by the following equation (see II-16).

B2＝PT2/PT−１ ……（４）ここで、空白数B2が１以上ならばKiに空白開始の記号を
付け、Kjには空白終了の記号を付ける（II-18,21参
照）。空白開始の記号，空白終了の記号はいずれか一方
だけでも良い。そうでなければ、次の判断により半角空
白を求め、 1.25PT≦PT2≦1.75PT ……（４）（４）式を満足すれば、半角空白をセットする（II-19,
20参照）。なお、この判断基準は文書により可変とす
る。そして、次のステップ22で１行終了したか否かを判
断し、終了していなければ、II−３からII-22を繰り返
す。B2 = PT2 / PT-1 (4) Here, if the number of blanks B2 is 1 or more, a blank start symbol is attached to Ki and a blank end symbol is attached to Kj (see II-18, 21). Either the start of space or the end of space may be either. Otherwise, a half-width space is obtained by the following judgment, and 1.25PT ≦ PT2 ≦ 1.75PT (4) If the expression (4) is satisfied, a half-width space is set (II-19,
20). This criterion is variable depending on the document. Then, in the next step 22, it is determined whether or not one line has been completed. If not completed, steps II-3 to II-22 are repeated.

以上の処理を全ての行について行なった後、段落処理
IIIを実行する。まず、第３図（イ）の如き入力画像に
対し、１行目から空白終了の記号が付けられた文字K1〜
Knの中心位置をT1〜Tnとする（III−1,2参照）。そし
て、T1とその他の文字位置T2〜Tnとの距離が各行の平均
ピッチPT以内ならば、K1〜Knがほぼ同じ位置に揃ってい
ることになるので、K1〜Knの前の空白は段落を区切るた
めの空白と考えることができる。かかる手法により、空
白の種類が単なる文字と文字の間の空白か、それとも段
落の区切を表わす空白かを判定することができる。そし
て、これらK1〜Knに段落区切の記号を付ける（III−３
〜８参照）。なお、段落の開始には１文字空白が入るこ
とから、T1〜Tnの中で最も値の小さいTmとの差が、0.8P
T以上（この判定基準は文書によって可変）のものがあ
る場合、その文字の１つ前の空白文字に段落区切の記号
を付け替える（III-9〜13参照）。その結果を第３図
（ロ）に示す。なお、同図左半分の第３行目,6行目の空
白は行末を示すので、これらの空白は削除する。また、
段落区切記号を付けた文字の位置は同じ筈なので各行の
空白記号数の補正をすることもできる（III-15,16参
照）。これにより、文字認識結果を第３図（イ）の入力
画像と同じように出力したり、段落区切の記号により一
列に並べて出力したり、左右に分けて出力したりするこ
とができる。After performing the above processing for all lines, paragraph processing
Perform III. First, for the input image as shown in FIG.
Let the center position of Kn be T1-Tn (see III-1, 2). If the distance between T1 and the other character positions T2 to Tn is within the average pitch PT of each line, K1 to Kn are almost at the same position, so the blank before K1 to Kn replaces the paragraph. You can think of it as a space to separate. By such a method, it is possible to determine whether the type of white space is a mere white space between characters or a white space representing a paragraph break. Then, these K1 to Kn are marked with a paragraph break symbol (III-3).
To 8). In addition, since one character space is inserted at the beginning of the paragraph, the difference from Tm having the smallest value among T1 to Tn is 0.8P.
If there is a character of T or more (this judgment criterion is variable depending on the document), replace the blank character immediately before the character with a symbol for separating paragraphs (see III-9 to 13). The results are shown in FIG. Since the blanks on the third and sixth lines in the left half of the figure indicate the end of the line, these blanks are deleted. Also,
The position of the character with the paragraph separator should be the same, so the number of blank symbols in each line can be corrected (see III-15, 16). As a result, the character recognition result can be output in the same manner as the input image of FIG. 3 (a), output in a line by paragraph-separated symbols, or output separately on the left and right.

第４図はこの発明が適用される文字読取装置を示すブ
ロック図である。FIG. 4 is a block diagram showing a character reading apparatus to which the present invention is applied.

これは画像入力装置1,CPU2,ROM3,RAM,画像メモリ５お
よび文字認識部６等より構成されているが、その主要動
作については上述した通りであるので、説明は省略す
る。It is composed of an image input device 1, a CPU 2, a ROM 3, a RAM, an image memory 5, a character recognition unit 6, and the like. The main operations are the same as described above, and a description thereof will be omitted.

以上の如くすることにより、例えば第２図（イ）のよ
うな文書に対し、従来は同図（ロ）の如く空白記号が余
分に入っていたが、この発明によれば同図（ハ）の如く
空白記号を正しく入れることが可能となる。By doing as described above, for example, in a document as shown in FIG. 2 (a), an extra blank symbol was conventionally inserted as shown in FIG. 2 (b). It is possible to insert a blank symbol correctly as in

〔The invention's effect〕

この発明によれば、行頭または行中に存在する空白部
分を空白記号として正しく切り出すことができ、また空
白の種類により段落と判定できる場合は、空白をはさむ
文字の少なくとも一方に段落区切を示す記号を付けるこ
とができるので、段落読取が可能となる。つまり、この
記号をもとに段落毎に読取結果を整理することができる
ため、文書の構成を変化させることなく元の文書どおり
に復元することが可能となる。According to the present invention, a blank portion existing at the beginning of a line or in a line can be correctly cut out as a blank symbol, and if it can be determined as a paragraph by the type of blank, a symbol indicating a paragraph break is provided in at least one of the characters including the blank. Can be added, so that paragraph reading becomes possible. That is, since the reading result can be arranged for each paragraph based on this symbol, it is possible to restore the original document without changing the structure of the document.

[Brief description of the drawings]

第１図はこの発明の実施例を示すフローチャート、第２
図および第３図はいずれもこの発明を具体的に説明する
ための説明図、第４図はこの発明が適用される文字読取
装置を示すブロック図である。符号説明１……画像入力装置、２……CPU、３……ROM、４……RA
M、５……画像メモリ、６……文字認識部。FIG. 1 is a flowchart showing an embodiment of the present invention.
FIGS. 3 and 3 are explanatory diagrams for specifically explaining the present invention, and FIG. 4 is a block diagram showing a character reading apparatus to which the present invention is applied. Description of symbols 1 ... image input device, 2 ... CPU, 3 ... ROM, 4 ... RA
M, 5 ... Image memory, 6 ... Character recognition unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者吉田収志東京都日野市富士町１番地富士ファコム制御株式会社内 (56)参考文献特開昭58−56076（ＪＰ，Ａ) 特開昭59−17664（ＪＰ，Ａ) ──────────────────────────────────────────────────続き Continuation of the front page (72) Inventor Shoji Yoshida 1 Fujimachi, Hino-shi, Tokyo Fuji Facom Control Co., Ltd. (56) References JP-A-58-56076 (JP, A) JP-A-59- 17664 (JP, A)

Claims

(57) [Claims]

1. A character image is image-processed to cut out a character line or a character string, a portion where a character is present is extracted from the character line or the character string, and a character between the character and the character in the character line or the character string is extracted. Based on the average pitch between them, a portion where no character is present is cut out as one or more blank symbols, and the blank is replaced with a character and a character based on the degree of deviation of the blank symbol position in each row or each column. A character reading method characterized in that a paragraph can be read by determining whether there is a white space between paragraphs or a white space between paragraphs.