JPH0581477A

JPH0581477A - Character segmenting method

Info

Publication number: JPH0581477A
Application number: JP3237928A
Authority: JP
Inventors: Takeshi Furuto; 健古戸
Original assignee: Sumitomo Electric Industries Ltd
Current assignee: Sumitomo Electric Industries Ltd
Priority date: 1991-09-18
Filing date: 1991-09-18
Publication date: 1993-04-02

Abstract

PURPOSE:To perform correct segmentation at all times even in a document which has a separate character or contact character in a document image of indeterminate character side. CONSTITUTION:A temporary character whose line-directional size is larger than estimated character side is detected as the contact character, it is decided whether the character before the contact character is a separate character or nonseparate character, and the segmentation head position of the contact character is determined. Further, the widths of the characters before and behind the character are referred to so as to decide whether part of the separate character is segmented independently or not as to a obtained temporary character string, and temporary characters are mutually coupled according to the decision result.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、不定ピッチの文書を光
学的に読み取り、文書に含まれる文字の画像データを文
字コードに変換する文字認識装置（ＯＣＲ）において、
分離文字や隣の文字と接触する文字が文書中に存在する
ときでも、１文字ごとに正しく切り出すことができる文
字切出し方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device (OCR) for optically reading a document having an irregular pitch and converting image data of characters contained in the document into a character code.
The present invention relates to a character cutout method that can correctly cut out each character even when a separated character or a character that contacts an adjacent character exists in a document.

【０００２】[0002]

【従来の技術】ＯＣＲでは、画像データを文字コードに
変換するために紙面上の画像データを１文字ごとに切り
出して文字認識を行う必要がある。そのための文字切出
し技術として、まず文字列を切出し候補線で区切り、文
字列の形状に基づいて推定された平均ピッチ（ピッチと
は文字と次の文字との間隔をいう）を用いて、その切出
し候補線同士の間隔がほぼ前記平均ピッチに等しい候補
線対を探し、その候補線対の中に含まれる非分離の図形
を１つの文字として切り出す方法が知られている（特公
平２−２３９０５号公報参照）。2. Description of the Related Art In OCR, in order to convert image data into character codes, it is necessary to cut out the image data on the paper for each character and perform character recognition. As a character cutting technique for that purpose, first the character string is separated by a cutting candidate line, and the average pitch estimated based on the shape of the character string (pitch means the interval between one character and the next character) There is known a method of searching for a pair of candidate lines in which the intervals between the candidate lines are substantially equal to the average pitch and cutting out a non-separated figure included in the pair of candidate lines as one character (Japanese Patent Publication No. 2-23905). See the bulletin).

【０００３】さらに、既に切り出された領域に隣接する
未切出し部分に含まれる図形に対しては、前後の切出し
領域までの距離を計測し、いずれかの領域に統合させる
という誤り修正手法も知られている（同公報参照）。こ
の誤り修正手法を、図８（同公報の第６図と同じもの）
を用いて解説する。例えば「東京」という縦文字を切り
出す場合、「東」は正確に切り出されたが、「京」は、Furthermore, for a figure included in an uncut portion adjacent to an already cut area, an error correction method is also known in which the distances to the front and rear cut areas are measured and integrated into one of the areas. (See the same publication). This error correction method is shown in FIG. 8 (the same as FIG. 6 of the publication).
Will be explained. For example, when cutting out the vertical letters "Tokyo", "east" was cut out correctly, but "Kyo"

【０００４】[0004]

【外１】 [Outer 1]

【０００５】のみが１つの文字として切り出されたとす
る。すなわち、It is assumed that only one is cut out as one character. That is,

【０００６】[0006]

【外２】 [Outside 2]

【０００７】と、[0007]

【０００８】[0008]

【外３】 [Outside 3]

【０００９】とに分離されているとする。このような場
合には、まず切出し領域に挟まれる図形Ｑから上下の切
出し領域までの距離ｐ2,ｐ3 を計測し、ｐ2 が一定の値
ρ7 より小さい場合には図形Ｑを上の領域に、またｐ2
が一定の値ρ7 より小さい場合には図形Ｑを下の領域に
統合させる。また、ｐ2,ｐ3 が一定の条件を満足しない
場合には上下の領域の切出し済みフラグ、切出し位置フ
ラグをクリアすることによってリジェクト処理を行う。It is assumed that they are separated into and. In such a case, first, the distances p2 and p3 from the figure Q sandwiched between the cutout areas to the upper and lower cutout areas are measured, and when the p2 is smaller than a constant value ρ7, the figure Q is set in the upper area, p2
If is smaller than a certain value ρ 7, the figure Q is integrated into the lower area. If p2 and p3 do not satisfy certain conditions, the reject processing is performed by clearing the cut-out flag and the cut-out position flag in the upper and lower areas.

【００１０】このように処理を行うことによって、文字
のピッチが与えられていない文字列、また文字そのもの
の大きさは等しくても文字が一定のピッチに従って並ん
でいない文字列、前後の文字と全くピッチの異なる記号
を含んでいる文字列等に対しても、効果的かつ正確な文
字の切出しができるとされている。By performing the processing as described above, a character string in which the character pitch is not given, a character string in which the characters themselves are equal in size, but the characters are not arranged at a constant pitch, and the preceding and succeeding characters are completely eliminated. It is said that the character can be effectively and accurately cut out even for a character string including symbols having different pitches.

【００１１】[0011]

【発明が解決しようとする課題】ところが、分離文字
（日本語文書において漢字・かなに特有の左右又は上下
に分離された文字をいう。横書きの場合「川」「い」等
の文字がこれにあたり、縦書きの場合「京」等の文字が
これにあたる。）と非分離文字とが存在し、しかも分離
文字と非分離文字とが接触する場合、前記のような文字
切出し方法のみでは、正確な文字の切出しができないと
いう問題がある。However, a separated character (a character separated into left and right or top and bottom peculiar to kanji and kana in Japanese documents is a character such as "kawa" and "i" in horizontal writing. , In the case of vertical writing, such as "Kyo") and non-separated characters exist, and when the separated characters and the non-separated characters come into contact with each other, only the character cutting method as described above is effective. There is a problem that characters cannot be cut out.

【００１２】例えば、「さて、準備が整ったところ」と
いう文字列を切り出す場合、前記の文字切出し方法を適
用すると、「、」という部分が切出し領域に挟まれる図
形として残る。図９に示すように、「て」と「、」との
距離をｐ2 、「、」と「準」との距離をｐ3 とすると、
ｐ2 ＜ρ7 ＜ｐ3 であれば、「、」は「て」に結合して
「て、」という誤った切出しを行ってしまう。For example, when the character string "Well, ready" is cut out, when the above character cutting method is applied, the portion "," remains as a figure sandwiched between the cutout areas. As shown in FIG. 9, if the distance between "te" and "," is p2, and the distance between "," and "quasi" is p3,
If p2 <ρ7 <p3, "," will be combined with "te" and the wrong cut out will be made.

【００１３】また、「準備」は正しく切り出されたとし
ても、この後に続く「が整」の左小部分Even if "preparation" is cut out correctly, the left small portion of "ga-sei" that follows

【００１４】[0014]

【外４】 [Outside 4]

【００１５】の前後間隔ｐ5,ｐ6 がｐ5 ＜ρ7 ＜ｐ6 の
関係にあれば（図１０参照）、「備」と結合して、If the front-to-back intervals p5 and p6 are in the relation of p5 <ρ7 <p6 (see FIG. 10), they are combined with "bi" and

【００１６】[0016]

【外５】 [Outside 5]

【００１７】となり、切出しを誤ってしまう。あるい
は、Therefore, the cutout is erroneous. Alternatively,

【００１８】[0018]

【外６】 [Outside 6]

【００１９】を１文字として切り出してしまうといった
ことも考えられる。そこで、本発明の目的は、上述の技
術的課題を解決し、分離文字及び非分離文字が混在しし
かもそれらの文字が接触した文書において、文字サイズ
が不明あるいは文字サイズの変動が大きな場合でも、文
字認識に必要な１文字ごとの切出しを正確に行うことが
できる文字切出し方法を提供することである。It is also possible to cut out as one character. Therefore, an object of the present invention is to solve the above technical problem, and in a document in which separated characters and non-separated characters are mixed and the characters are in contact with each other, even when the character size is unknown or the character size varies greatly, It is an object of the present invention to provide a character cutout method capable of accurately cutting out each character required for character recognition.

【００２０】[0020]

【課題を解決するための手段】上記の目的を達成するた
めの請求項１記載の文字切出し方法は、文字列の形状か
ら文字サイズＷを推定し、文字列画像データの投影像を
求め、その投影像を構成する画素の塊を仮文字としてそ
れぞれ切出し、仮文字の幅が文字サイズＷより大きな仮
文字Ｃi について、仮文字Ｃi の１つ前に隣接する仮文
字Ｃi-1 が非分離文字であるか分離文字の一部であるか
を、少なくともその仮文字Ｃi-1 の幅及びその仮文字Ｃ
i-1 の前後の間隔によって判定し、この判定結果に応じ
て仮文字Ｃi の切出し位置を修正し、この修正された切
出し位置から文字サイズＷだけ後ろの位置又はその近傍
で切出し、前記手順により切り出された結果得られた仮
文字列についてある仮文字Ｃj が分離文字の一部である
かどうかを、その仮文字Ｃj 及びその仮文字Ｃj の１つ
前又は１つ後ろに隣接する仮文字Ｃj-1 又は仮文字Ｃj+
1 について、仮文字Ｃj-1 の幅及び仮文字Ｃj-1 と仮文
字Ｃj との間隔又は仮文字Ｃj+1 の幅及び仮文字Ｃj と
仮文字Ｃj+1 との間隔に基づいて判定し、この判定結果
に応じて仮文字を結合させる方法である。In order to achieve the above object, a character cutting method according to claim 1 estimates a character size W from a shape of a character string, obtains a projected image of the character string image data, and The cluster of pixels forming the projected image is cut out as a temporary character, and for the temporary character Ci whose width is larger than the character size W, the temporary character Ci-1 immediately before the temporary character Ci is a non-separated character. Whether or not it is a part of a separator character is at least the width of the dummy character Ci-1 and the dummy character C.
Judgment is made based on the interval before and after i-1, the cut-out position of the temporary character Ci is corrected according to the judgment result, and the cut-out position is cut out at a position which is behind the corrected cut-out position by the character size W or in the vicinity thereof. Whether or not a certain character Cj of the temporary character string obtained as a result of the cutout is a part of the separated character is determined by the temporary character Cj and the temporary character Cj adjacent before or after the temporary character Cj. -1 or temporary character Cj +
1 is determined based on the width of the temporary character Cj-1 and the distance between the temporary characters Cj-1 and Cj or the width of the temporary character Cj + 1 and the distance between the temporary characters Cj and Cj + 1, This is a method of combining temporary characters in accordance with the result of this determination.

【００２１】請求項２記載の文字切出し方法は、接触文
字であるとみなされた仮文字Ｃi について、仮文字Ｃi
の１つ後ろに隣接する仮文字Ｃi+1 が非分離文字である
か分離文字の一部であるかを判定し、修正された切出し
位置から文字サイズＷだけ前の位置また器その近傍で切
出すところのみが請求項１の方法と相違する。In the character cutting method according to the second aspect of the present invention, the temporary character Ci that is regarded as a contact character is regarded as the temporary character Ci.
It is determined whether the temporary character Ci + 1 that is adjacent to the next character after is a non-separated character or a part of the separated character. Only the point of outputting is different from the method of claim 1.

【００２２】[0022]

【作用】図１は請求項１記載の発明の方法を説明するブ
ロック図である。まず、文字サイズＷを推定する。これ
には、例えば文字列画像データを行方向に投影して投影
像から行の高さを求め、各行に対する行の高さの平均値
又は最大値を求めればよい（ステップＮ１，Ｎ２）。行
の高さを文字サイズに等しいとするのは、日本語文字は
ほぼ正方形となっているからである。FIG. 1 is a block diagram for explaining the method of the invention described in claim 1. First, the character size W is estimated. For this purpose, for example, the character string image data is projected in the row direction, the row height is obtained from the projected image, and the average value or the maximum value of the row height for each row is obtained (steps N1 and N2). The line height is made equal to the character size because Japanese characters are almost square.

【００２３】次に、文字列画像データを行と垂直な方向
に投影して投影像から仮の切出し位置を決定し、それら
の投影像を構成する画素の塊を仮文字としてそれぞれ抽
出する（ステップＮ３）。さらに、仮文字の幅を判定し
（ステップＮ４）、仮文字の幅が文字サイズＷより大き
な仮文字Ｃi を接触文字（他の図形と接触しているの
で、当該他の図形と一緒に切り出してしまった文字）と
みなし、この接触文字の前に隣接する文字が非分離文字
か分離文字の一部であるかを、少なくともその仮文字Ｃ
i-1 の幅及びその仮文字Ｃi-1 の前後の間隔によって判
定する（ステップＮ５）。もし、分離文字の一部であれ
ば、この判定結果に応じて仮文字Ｃi の切出し位置を修
正し、この修正された切出し位置から文字サイズＷだけ
後ろの位置の近傍における、例えば文字列画像データの
投影像の極小位置で切出すことによって接触文字の正し
い分離ができる（ステップＮ６）。Next, the character string image data is projected in a direction perpendicular to the rows to determine temporary cut-out positions from the projected images, and the blocks of pixels forming those projected images are extracted as temporary characters (steps). N3). Further, the width of the provisional character is determined (step N4), and the provisional character Ci having a width of the provisional character larger than the character size W is cut out together with the contact character (because it is in contact with another figure, it is cut out together with the other figure. Character which is adjacent to the contact character and is a non-separated character or a part of the separated character.
Judgment is made based on the width of i-1 and the space before and after the temporary character Ci-1 (step N5). If it is a part of the separated character, the cut-out position of the temporary character Ci is corrected according to this determination result, and for example, character string image data in the vicinity of the position after the corrected cut-out position by the character size W is corrected. The contact character can be correctly separated by cutting out the projected image at the minimum position (step N6).

【００２４】次に、前記手順により切り出された結果得
られた仮文字列について各仮文字Ｃj が分離文字の一部
であるかどうかを判定する。例えば、その仮文字Ｃj と
その仮文字Ｃj の１つ前に隣接する仮文字Ｃj-1 とが誤
って分離されているならば、仮文字Ｃj-1 の幅及び仮文
字Ｃj-1 と仮文字Ｃj との間隔に基づいてこれを判定
し、この判定結果に応じて仮文字を結合させる（ステッ
プＮ７）。Next, it is judged whether or not each temporary character Cj is a part of the separated character in the temporary character string obtained as a result of being cut out by the above procedure. For example, if the temporary character Cj and the temporary character Cj-1 immediately before the temporary character Cj are separated by mistake, the width of the temporary character Cj-1 and the temporary character Cj-1 and the temporary character Cj-1. This is determined based on the distance from Cj, and the temporary characters are combined according to the determination result (step N7).

【００２５】請求項２記載の発明は、行の後ろから文字
を切り出す場合に適用される方法であって、接触文字で
あるとみなされた仮文字Ｃi について、仮文字Ｃi の１
つ後ろに隣接する仮文字Ｃi+1 が非分離文字であるか分
離文字の一部であるかを判定するところが請求項１の方
法と相違するに過ぎない。A second aspect of the present invention is a method applied when a character is cut out from the end of a line, and for the temporary character Ci regarded as a contact character, one of the temporary characters Ci is used.
Only the difference from the method of claim 1 is that it is determined whether the next adjacent dummy character Ci + 1 is a non-separated character or a part of the separated character.

【００２６】[0026]

【実施例】以下実施例を示す添付図面によって詳細に説
明する。図２は、本発明の文字切出し方法を実施するＯ
ＣＲの基本的な構成を示すブロック図である。このＯＣ
Ｒは、ハンディな筐体の中に、読取対象である文字を記
載した読取面１を照明する光源２及び読みと裏面からの
反射光によりその受光面６ａに形成された光学像を電気
信号に変換する２次元イメージセンサ６などを収納した
スキャナ４を備えている。Embodiments will be described in detail below with reference to the accompanying drawings showing embodiments. FIG. 2 is a diagram illustrating an operation for implementing the character cutting method according to the present invention.
It is a block diagram which shows the basic composition of CR. This OC
R is a light source 2 for illuminating a reading surface 1 on which characters to be read are written, and an optical image formed on the light receiving surface 6a by the reflected light from the reading and the back surface as an electric signal in a handy housing. The scanner 4 is provided with a two-dimensional image sensor 6 for conversion.

【００２７】スキャナ４を読取面１に当てがうと、この
読取面１からの反射光は、反射鏡３で反射され、さらに
レンズ系５を通してイメージセンサ６の受光面６ａに導
かれ、この受光面６ａに文字の光学像を形成する。イメ
ージセンサ６はこの光学像を電気信号に変換して二値化
回路７に与える。二値化回路７は、イメージセンサ６の
出力信号を適当なしきい値で弁別して「１」又は「０」
の二値信号に変換する。When the scanner 4 is applied to the reading surface 1, the reflected light from the reading surface 1 is reflected by the reflecting mirror 3 and further guided to the light receiving surface 6a of the image sensor 6 through the lens system 5, and this light receiving surface 6a. An optical image of the character is formed on 6a. The image sensor 6 converts this optical image into an electric signal and gives it to the binarization circuit 7. The binarization circuit 7 discriminates the output signal of the image sensor 6 by an appropriate threshold value to determine "1" or "0".
To a binary signal.

【００２８】二値化回路７の出力信号は、コード９を解
して本体処理部８に与えられる。この本体処理部８はコ
ード９からの信号を、画像メモリ８１に一時記憶する。
画像メモリ８１に記憶された画像は、ＣＰＵによる所定
の処理によって、１文字切出し回路８２において文字の
切出しが行われ、文字認識回路８３においてこのように
切り出された各画像に対して文字の認識処理が行われ
る。この認識処理は、認識辞書に予め記憶された文字パ
ターンの特徴量との差計算などによって行われ、その文
字が認識されることになる。そして、認識結果は編集部
８４に与えられ、編集部８４は認識結果か正しいもので
あればその結果を出力し、誤りであればその認識結果を
無視して次の読み取りを開始する。The output signal of the binarization circuit 7 is applied to the main body processing section 8 by decoding the code 9. The main body processing unit 8 temporarily stores the signal from the code 9 in the image memory 81.
With respect to the image stored in the image memory 81, the character is cut out by the one-character cutout circuit 82 by a predetermined process by the CPU, and the character recognition processing is performed on each image cut out by the character recognition circuit 83. Is done. This recognition processing is performed by calculating the difference between the character pattern and the characteristic amount of the character pattern stored in advance in the recognition dictionary, and the character is recognized. Then, the recognition result is given to the editing unit 84, and if the recognition result is correct, the editing unit 84 outputs the result, and if it is incorrect, the recognition result is ignored and the next reading is started.

【００２９】以上の一連の処理のうち、１文字切出し回
路８２の行う文字切出し処理について、フローチャート
を参照しながらさらに詳細に説明する。文字切出し処理
は、接触文字の検出と直前仮文字判定処理、接触文字の
分離処理、及び分離文字の統合処理に大別される。接触
文字の検出と直前仮文字判定処理並びに接触文字の分離
処理は、図３の手順により行われる。まず、画像メモリ
８１に記憶された画像データに対して、行方向に投影を
とり、白画素で分けられたそれぞれの部分を一行とす
る。各行の高さの最大値を求め、これを文字サイズＷma
xとする。Of the series of processes described above, the character cutout process performed by the one-character cutout circuit 82 will be described in more detail with reference to the flowchart. The character cutout process is roughly classified into a contact character detection process, a preceding tentative character determination process, a contact character separation process, and a separated character integration process. The contact character detection, the immediately preceding tentative character determination process, and the contact character separation process are performed by the procedure of FIG. First, the image data stored in the image memory 81 is projected in the row direction, and each portion divided by white pixels is set as one row. Find the maximum height of each line and use this as the character size Wma
Let x.

【００３０】つぎに、各行について、行方向と垂直な方
向に投影をとり、黒画素の塊の集合を仮文字列として切
り出す。この仮文字列の中から、行方向の幅Ｗが前記文
字サイズＷmax よりも大きなものを接触文字とみなす
（ステップＳ１）。図４を例にとって示すと、「さて、
準備が整ったところ」の「準備」とNext, each row is projected in a direction perpendicular to the row direction to cut out a set of black pixel blocks as a temporary character string. From this temporary character string, a character having a width W in the row direction larger than the character size Wmax is regarded as a contact character (step S1). Taking Fig. 4 as an example, "Well,
"Preparation" of "where you are ready"

【００３１】[0031]

【外７】 [Outside 7]

【００３２】とが接触文字となる。接触文字とみなされ
た仮文字をＣi で表し、同じ行内における直前の仮文字
をＣi-1 で表わす。そして、仮文字Ｃi-1 が全角非分離
文字であるか（ステップＳ２）、半角文字であるか（ス
テップＳ３）、分離文字の後半部分であるか（ステップ
Ｓ４）を判定する。この判定は、次の各数値に基づき決
定される。仮文字Ｃi-1 の幅Ｗi-1 、仮文字Ｃi-1 とその仮文字の１つ後ろの仮文字Ｃi
との間隔Ｓi-1 、仮文字Ｃi-1 の１つ前の仮文字Ｃi-2 の幅Ｗi-2 、仮文字Ｃi-2 とその仮文字の１つ後ろの仮文字Ｃi-
1 との間隔Ｓi-2 、仮文字Ｃi-1 の２つ前の仮文字Ｃi-3 の幅Ｗi-3 、仮文字Ｃi-3 とその仮文字の1 つ後ろの仮文字Ｃi-
2 との間隔Ｓi-3 まず、全角非分離文字の判定は、当該直前の仮文字Ｃi-
1 の行方向の幅Ｗi-1がＷi-1 ＞Ｗmax ×Ｐ１ (1) であるかどうかで判定される。ここに、パラメータＰ１
は全角非分離文字の判定のためのパラメータであり、１
より小さく１に近い数例えば０．７２に設定される。And are contact characters. A dummy character regarded as a contact character is represented by Ci, and a previous dummy character in the same line is represented by Ci-1. Then, it is determined whether the temporary character Ci-1 is a full-width non-separated character (step S2), a half-width character (step S3), or a second half of the separated character (step S4). This judgment is determined based on the following numerical values. The width Wi-1 of the temporary character Ci-1, the temporary character Ci-1 and the temporary character Ci that is one behind the temporary character Ci-1.
And the space Si-1, the width Wi-2 of the temporary character Ci-2 immediately before the temporary character Ci-1, the temporary character Ci-2 and the temporary character Ci- immediately after the temporary character Ci-
The interval Si-2 from 1, the width Wi-3 of the temporary character Ci-3 that is two characters before the temporary character Ci-1, the temporary character Ci-3 and the temporary character Ci- that is one character after the temporary character Ci-
Distance from 2 Si-3 First, the full-width non-separated character is determined by the temporary character Ci-
It is determined whether or not the width Wi-1 of 1 in the row direction is Wi-1> Wmax xP1 (1). Here, the parameter P1
Is a parameter for determining full-width non-separated characters, and 1
It is set to a smaller number close to 1, for example, 0.72.

【００３３】もし、Ｗi-1 ≦Ｗmax ×Ｐ１ (2) であれば、全角非分離文字でないと判定されるので、半
角文字かどうかを判定する。この判定は、Ｗi-1 ＋Ｓi-1 ＞Ｗmax ×Ｐ２ (3) 又は、Ｓi-1 ＞Ｗi-1 ×Ｐ３かつＳi-1 ＞Ｗmax ×Ｐ４ (4) 又は、Ｗmax ×Ｐ５＜Ｓi-2 ＋Ｗi-1 ＋Ｓi-1 ＜Ｗmax ×Ｐ６ (5) のいずれかの判定式を満たしているかで判定される。パ
ラメータＰ２は仮文字Ｃi-1 とその直後の空白部分の幅
Ｓi-1 の合計がほぼ１文字サイズかどうかを決定するも
のであり、１より小さく１に近い数例えば０．８３に設
定される。パラメータＰ３は、半角文字といえるために
は仮文字Ｃi-1の直後の空白部分の幅Ｓi-1 が仮文字の
幅Ｗi-1 にほぼ等しいかどうかを決定するものであり、
１より小さく１に近い数例えば０．８５に設定される。
パラメータＰ４は、半角文字といえるためには仮文字Ｃ
i-1の直後の空白部分の幅Ｓi-1 が文字サイズＷmax よ
り極端に小さくないかどうかを決定するものであり、例
えば０．１８に設定される。パラメータＰ５，Ｐ６は仮
文字Ｃi-1 の幅Ｗi-1 とその直前とその直後の空白部分
の幅Ｓi-2,Ｓi-1 の合計がほぼ１文字サイズかどうかを
決定するものであり、パラメータＰ５は１より小さく１
に近い数例えば０．８５に設定され、パラメータＰ６は
１より大きく１に近い数例えば１．２に設定される。If Wi-1 ≤ Wmax × P1 (2), it is determined that the character is not a full-width non-separated character, so it is determined whether it is a half-width character. This determination is made by Wi-1 + Si-1> Wmax × P2 (3) or Si-1> Wi-1 × P3 and Si-1> Wmax × P4 (4) or Wmax × P5 <Si-2 + Wi- It is judged whether any one of the judgment formulas of 1 + Si-1 <Wmax × P6 (5) is satisfied. The parameter P2 determines whether or not the total of the temporary character Ci-1 and the width Si-1 of the blank portion immediately after it is approximately one character size, and is set to a number smaller than 1 and close to 1, for example, 0.83. .. The parameter P3 determines whether or not the width Si-1 of the blank portion immediately after the provisional character Ci-1 is almost equal to the width Wi-1 of the provisional character in order to call it a half-width character.
It is set to a number smaller than 1 and close to 1, for example, 0.85.
Since the parameter P4 is a half-width character, the dummy character C
It determines whether the width Si-1 of the blank portion immediately after i-1 is extremely smaller than the character size Wmax, and is set to 0.18, for example. The parameters P5 and P6 determine whether or not the sum of the width Wi-1 of the temporary character Ci-1 and the widths Si-2 and Si-1 of the blank portions immediately before and after that is approximately one character size. P5 is smaller than 1 and 1
Is set to, for example, 0.85, and the parameter P6 is set to a number larger than 1 and close to 1, for example, 1.2.

【００３４】前記(4) 式によって判定された例を図５に
示す。図５において、接触文字Ｃ3の直前の仮文字Ｃ2
である「、」の幅Ｗ2 と、その直後の空白部分の幅Ｓ2
を(4) 式に代入する。Ｓ2 ＞０．８５Ｗ2 かつＳ2 ＞０．１８Ｗmax を満たしているので、仮文字Ｃ2 は半角文字であること
が分かる。FIG. 5 shows an example determined by the equation (4). In FIG. 5, the temporary character C2 immediately before the contact character C3
The width W2 of "," and the width S2 of the blank part immediately after
Is substituted into Eq. (4). Since S2> 0.85W2 and S2> 0.18Wmax are satisfied, it is understood that the dummy character C2 is a half-width character.

【００３５】以上の(3),(4),(5) いずれの判定式も満た
さない場合、半角文字と判定することはできないので、
分離文字の後半部分かどうかを判定する。これは、Ｗmax ×Ｐ７＜Ｗi-2 ＋Ｓi-2 ＋Ｗi-1 ≦Ｗmax ×Ｐ８ (6) Ｗi-1 ＞Ｗmax ×Ｐ９ (7) Ｗi-3 ＋Ｓi-3 ＋Ｗi-2 ＞Ｗmax (8) の３つの式を同時に満たすかどうかにより判定される。If none of the above judgment formulas (3), (4) and (5) is satisfied, it cannot be judged as a half-width character.
Determine if it is the second half of the separator. This is Wmax × P7 <Wi-2 + Si-2 + Wi-1 ≦ Wmax × P8 (6) Wi-1> Wmax × P9 (7) Wi-3 + Si-3 + Wi-2> Wmax (8) It is determined by whether the expressions are satisfied at the same time.

【００３６】ここに、パラメータＰ７，Ｐ８は仮文字Ｃ
i-1 の１つ前の仮文字Ｃi-2 とその直後の空白部分と仮
文字Ｃi-1 との幅の合計がほぼ１文字サイズかどうかを
決定するものであり、パラメータＰ７は１より小さく１
に近い数例えば０．６８に設定され、パラメータＰ６は
１に近い数例えば１．０に設定される。パラメータＰ９
は仮文字Ｃi-1 が分離文字の後半部分といえるだけの幅
を有しているかどうかを決定するものであり、例えば
０．１８に設定される。(8) 式は当該仮文字Ｃi-1 より
２つ前の仮文字Ｃi-3 とその後の空白部分とさらにその
後の仮文字Ｃi-2との合計の幅がほぼ１文字サイズかど
うかを決定する式である。Here, the parameters P7 and P8 are temporary characters C.
It determines whether or not the total width of the temporary character Ci-2 immediately preceding i-1 and the space immediately after it and the temporary character Ci-1 is approximately one character size, and the parameter P7 is smaller than 1. 1
Is set to, for example, 0.68, and the parameter P6 is set to a number close to 1, for example, 1.0. Parameter P9
Determines whether or not the temporary character Ci-1 has a width that can be said to be the latter half of the separated character, and is set to 0.18, for example. The formula (8) determines whether the total width of the temporary character Ci-3, which is two digits before the relevant temporary character Ci-1, and the subsequent blank portion, and the subsequent dummy character Ci-2, is approximately one character size. It is an expression.

【００３７】以上の判定基準に基づいて直前の仮文字Ｃ
i-1 が全角非分離文字であるか、半角文字であるか、分
離文字の後半部分であることが分かる。もし、前記ステ
ップＳ２，３，４のいずれもＮＯの判定がされると、前
の仮文字Ｃi-1 は分離文字の前半部分であり、当該仮文
字Ｃi と結合されるべきものであることが分かる。図５
の例でいえば、仮文字Ｃ6 ′であるOn the basis of the above-mentioned criteria, the immediately preceding character C
It can be seen that i-1 is a full-width non-separated character, a half-width character, or the second half of the separated character. If NO in any of steps S2, S3 and S4, the preceding character Ci-1 is the first half of the separated character and should be combined with the character Ci. I understand. Figure 5
In the example, it is the temporary character C6 '.

【００３８】[0038]

【外８】 [Outside 8]

【００３９】の１つ前の仮文字Ｃ5 ′The temporary character C5 'one before

【００４０】[0040]

【外９】 [Outside 9]

【００４１】は前記ステップＳ２，３，４の条件にいず
れも該当せず、仮文字Ｃ5 ′は分離文字の前半部である
ことが分かる。次に、接触文字の分離処理を行う。この
ため、切出し先頭位置を決定する（図３のステップＳ
５，６）。切出し先頭位置は、仮文字Ｃi-1 が全角非分
離文字であるか、半角文字であるか、分離文字の後半部
分であれば、当該仮文字Ｃi の先頭から行う（ステップ
Ｓ５）。図５の例でいえば、仮文字Ｃ2 が半角文字であ
るので、切出し先頭は仮文字Ｃ3 の先頭Ａからとなる。It is understood that none of the conditions of steps S2, S3 and S4 correspond to the provisional character C5 'being the first half of the separated character. Next, the contact character is separated. Therefore, the cutting start position is determined (step S in FIG. 3).
5, 6). If the temporary character Ci-1 is a full-width non-separated character, a half-width character, or the latter half of the separated character, the cut-out start position is performed from the start of the temporary character Ci (step S5). In the example of FIG. 5, since the temporary character C2 is a half-width character, the cutout start is from the start A of the temporary character C3.

【００４２】もし仮文字Ｃi-1 が分離文字の前半部分で
あれば、切出し先頭位置は、直前の仮文字Ｃi-1 の先頭
から行う（ステップＳ６）。図５の例でいえば、仮文字
Ｃ5′は分離文字の前半部であるので、切出し先頭は仮
文字Ｃ5 ′の先頭Ｂに修正される。切出し位置の範囲
は、切出し先頭位置からＷmax ×Ｐ１０ないしＷmax ま
での範囲である。ここにパラメータＰ１０は例えば０．
９に設定される。If the temporary character Ci-1 is the first half of the separated character, the cut-out head position is from the head of the immediately preceding virtual character Ci-1 (step S6). In the example of FIG. 5, since the temporary character C5 'is the first half of the separated character, the cutout head is corrected to the head B of the temporary character C5'. The range of the cutout position is a range from the cutout start position to Wmax × P10 to Wmax. Here, the parameter P10 is, for example, 0.
Set to 9.

【００４３】以上のように切出し位置の範囲が決定され
ると、実際の切出し位置を一義的に決定する。これには
例えば黒画素の投影値が最も小さい位置を接触文字分離
位置と決定すればよい（ステップＳ７）。このステップ
Ｓ７の分離処理を、フローチャート（図６）を用いて詳
細に説明する。切出し先頭位置からＷmax ×Ｐ１０の位
置をｘ_start、切出し先頭位置からＷmax の位置をｘ
_endと定義する。また、位置ｘ_startでの黒画素の投影
値をｈ_start、位置ｘ_endでの黒画素の投影値を
ｈ_end、黒画素の投影値の最小値をｈ_minとする。ステ
ップＳ７１においてパラメータｘをｘ_start、パラメー
タｈをｈ_startに設定する。ステップＳ７２でパラメー
タｘをΔｘだけ増加させ、ステップＳ７４ではｈ_minを
このｘでの黒画素の投影値ｈ_xと比較する。もしｈ_mi _n
＞ｈ_xであればステップＳ７５でこのｈ_xを改めてｈ
_minとおき、ステップＳ７６でｘをｘ_extとおく。そし
て、ステップＳ７２に戻り、パラメータｘをまたΔｘだ
け増加させ前記と同様の処理を繰り返す。ステップＳ７
３でｘがｘ_endを超えれば、その時点でのｘ_extの位置
での黒画素を白画素に変換し処理を終わる（ステップＳ
７７）。When the range of the cutout position is determined as described above, the actual cutout position is uniquely determined. For this purpose, for example, the position where the projected value of the black pixel is the smallest may be determined as the contact character separation position (step S7). The separation process in step S7 will be described in detail with reference to the flowchart (FIG. 6). The position of Wmax x P10 from the cutting _start position is x _start , and the position of Wmax from the cutting _start position is x
Define as _end . The projection value of the black pixel at the position x _start is h _start , the projection value of the black pixel at the position x _end is h _end , and the minimum projection value of the black pixels is h _min . In step S71, the parameter x is set to x _start and the parameter h is set to h _start . In step S72, the parameter x is increased by Δx, and in step S74 h _min is compared with the projection value h _x of the black pixel at this x. If h _mi _n
If> h _x , this h _x is rewritten to h in step S75.
_min , and x is set to x _ext in step S76. Then, the process returns to step S72, the parameter x is increased by Δx again, and the same processing as described above is repeated. Step S7
If x exceeds x _{end in} 3, the black pixel at the position of x _ext at that time is converted into a white pixel, and the process ends (step S
77).

【００４４】この処理によって、実際に強制的な切出し
を行う位置を決定することができる。例えば図５でいえ
ば、仮文字Ｃ3 の先頭ＡからＷmax ×Ｐ１０ないしＷma
x までの範囲にある極小位置ｘ_extが新しい切出し位置
となり、仮文字Ｃ5 ′の先頭ＢからＷmax ×Ｐ１０ない
しＷmax までの範囲にある極小位置ｘ_ext′が新しい切
出し位置となる。By this processing, the position where the actual cutting is performed can be determined. For example, referring to FIG. 5, from the beginning A of the temporary character C3 to Wmax × P10 or Wma.
The minimum position x _{ext in} the range up to x becomes the new cut-out position, and the minimum position x _ext ′ in the range from the beginning B of the temporary character C5 'to Wmax × P10 to Wmax becomes the new cut-out position.

【００４５】以上のようにして得られた文字列には、ま
だ分離文字の一部が混在している。このため、前記文字
サイズＷmax を用いて分離文字の統合処理を行う。すな
わち、行方向の幅Ｗj がＷj ＜Ｗmax ×Ｐ１１である仮文字Ｃj に着目し、次の仮文字Ｃj+1 との間隔
Ｓj がＳj ＜Ｗmax ×Ｐ１２かつＷj ＋Ｓj ＋Ｗj+1 ≦Ｗma
x ならば、当該仮文字Ｃj と次の仮文字Ｃj+1 とを結合し
新たな文字として決定する。In the character string obtained as described above, some of the separated characters are still mixed. Therefore, the character separation process is performed by using the character size Wmax. That is, paying attention to the temporary character Cj whose width Wj in the row direction is Wj <Wmax × P11, and the distance Sj from the next temporary character Cj + 1 is Sj <Wmax × P12 and Wj + Sj + Wj + 1 ≤Wma.
If x, the temporary character Cj and the next temporary character Cj + 1 are combined to determine a new character.

【００４６】また、前の仮文字Ｃj-1 との間隔Ｓj-1 がＳj-1 ＜Ｗmax ×Ｐ１２かつＷj-1 ＋Ｓj-1 ＋Ｗj ≦
Ｗmax ならば、当該仮文字Ｃj と前の仮文字Ｃj-1 とを結合し
新たな文字として決定する。ここに、パラメータＰ１１
は仮文字Ｃj が分離文字の一部分かどうか判定するため
の１より小さな数であり例えば０．６５に設定される。
パラメータＰ１２は仮文字Ｃj と隣接する仮文字Ｃj-1,
Ｃj+1 との間隔が充分小さいか判定するための数であり
例えば０．２６に設定される。The distance Sj-1 from the preceding provisional character Cj-1 is Sj-1 <Wmax xP12 and Wj-1 + Sj-1 + Wj ≤
If it is Wmax, the temporary character Cj and the previous temporary character Cj-1 are combined and determined as a new character. Here, the parameter P11
Is a number smaller than 1 for determining whether the temporary character Cj is a part of the separated character and is set to, for example, 0.65.
The parameter P12 is a temporary character Cj-1, which is adjacent to the temporary character Cj.
It is a number for determining whether the interval with Cj + 1 is sufficiently small, and is set to, for example, 0.26.

【００４７】具体例を示すと図７のとおりとなる。図７
において、仮文字Ｃ3である「、」に注目して、次の仮
文字Ｃ4 との間隔Ｓ3 がＳ3 ＜０．２６Ｗmax を満たしているとする。しかし、Ｗ3 ＋Ｓ3 ＋Ｗ4 がＷmax よりも大きいので、仮文字Ｃ3 である「、」
と、次の仮文字Ｃ4 である「準」とは結合させるべきで
はないことが分かる。A concrete example is shown in FIG. Figure 7
In the above, paying attention to the temporary character C3, ",", it is assumed that the distance S3 from the next temporary character C4 satisfies S3 <0.26 Wmax. However, since W3 + S3 + W4 is larger than Wmax, the temporary character is C3, ",".
It can be seen that the following temporary character C4, "quasi", should not be combined.

【００４８】次に、図７において、仮文字Ｃ6 であるNext, in FIG. 7, it is a temporary character C6.

【００４９】[0049]

【外１０】 [Outside 10]

【００５０】に注目して、次の仮文字Ｃ7 との間隔Ｓ6
がＳ6 ＜０．２６Ｗmax を満たし、Ｗ6 ＋Ｓ6 ＋Ｗ7 ＜Ｗmax を満たしているとする。この場合、仮文字Ｃ6 であるPaying attention to the following, the space S6 between the next temporary character C7 and
Satisfies S6 <0.26 Wmax and W6 + S6 + W7 <Wmax. In this case, it is the temporary character C6

【００５１】[0051]

【外１１】 [Outside 11]

【００５２】と、次の仮文字Ｃ7 であるAnd the next temporary character C7

【００５３】[0053]

【外１２】 [Outside 12]

【００５４】とを結合させて「が」を作ることができ
る。さらに、図７において、仮文字Ｃ7 である"And" can be made by combining and. Further, in FIG. 7, it is a temporary character C7.

【００５５】[0055]

【外１３】 [Outside 13]

【００５６】に注目して、１つ前の仮文字Ｃ6 との間隔
Ｓ6 がＳ6 ＜０．２６Ｗmax を満たし、Ｗ6 ＋Ｓ6 ＋Ｗ7 ＜Ｗmax を満たしているので、Paying attention to the above, since the distance S6 from the previous provisional character C6 satisfies S6 <0.26Wmax and W6 + S6 + W7 <Wmax,

【００５７】[0057]

【外１４】 [Outside 14]

【００５８】と結合させて「が」を作っても結果は同じ
である。以上の実施例では、行の先端から切出しを行う
例を扱ったが、行の後端から切出しを行う場合には、横
書きの文字を裏から見たのと同じことになり、全く同様
の手順により処理することができる。The same result can be obtained by combining with and forming "ga". In the above embodiment, the example of cutting out from the front end of the line was dealt with, but when cutting out from the rear end of the line, it becomes the same as viewing the horizontally written characters from the back side, and the completely same procedure. Can be processed by.

【００５９】[0059]

【発明の効果】以上のように請求項１記載の文字切出し
方法によれば、推定文字サイズより行方向の幅の大きな
仮文字を接触文字として検出し、接触文字の前の文字が
分離文字か非分離文字かを判定して接触文字の切出し先
頭位置を定めるので、接触文字の分離が正確に行える。
また、そのようにして得られた仮文字列について、分離
文字の一部が独立して切り出されているかどうかを、前
後の文字の幅等を参照して判定し、この判定結果に応じ
て仮文字同士を結合させることもできる。したがって、
不定文字サイズの文書画像中に分離文字や接触文字が含
まれている文書においても、常に正しい切出しを行うこ
とができる。As described above, according to the character cutting method of the first aspect, a temporary character having a width in the line direction larger than the estimated character size is detected as a contact character, and the character before the contact character is a separated character. Since it is determined whether the character is a non-separated character and the cut-out leading position of the contact character is determined, the contact character can be accurately separated.
In addition, with respect to the temporary character string thus obtained, it is determined whether or not a part of the separated characters is cut out independently, by referring to the widths of the preceding and following characters, and according to the determination result, It is also possible to combine characters. Therefore,
Even in a document in which a separated character or a contact character is included in a document image having an indefinite character size, correct clipping can always be performed.

【００６０】また、請求項２記載の文字切出し方法によ
れば、切出しを行の後端から行う場合でも同様に、常に
正しい切出しを行うことができる。According to the character cutout method of the second aspect, the correct cutout can always be performed even when the cutout is performed from the rear end of the line.

[Brief description of drawings]

【図１】本発明の手順を説明するブロック図である。FIG. 1 is a block diagram illustrating a procedure of the present invention.

【図２】本発明の文字切出し方法を実施するＯＣＲの基
本的な構成を示すブロック図である。FIG. 2 is a block diagram showing a basic configuration of an OCR that implements the character cutout method of the present invention.

【図３】接触文字の検出と直前仮文字判定処理並びに接
触文字の分離処理の概要を示すフローチャートである。FIG. 3 is a flowchart showing an outline of contact character detection, immediately preceding provisional character determination processing, and contact character separation processing.

【図４】接触文字の検出処理の具体例を示す図である。FIG. 4 is a diagram showing a specific example of a contact character detection process.

【図５】接触文字の切出し位置の修正と、接触文字の分
離処理の具体例を示す図である。FIG. 5 is a diagram showing a specific example of correction of a cutout position of a contact character and separation processing of the contact character.

【図６】接触文字の分離位置を決定する手順を示すフロ
ーチャートである。FIG. 6 is a flowchart showing a procedure for determining a separation position of a contact character.

【図７】分離文字の一部同士の統合処理を解説する具体
例を示す図である。FIG. 7 is a diagram illustrating a specific example explaining an integration process of a part of separated characters.

【図８】従来の文字切出し方法における未切出し部分Ｑ
の統合手法を解説する具体例を示す図である。FIG. 8 is an uncut portion Q in the conventional character cutting method.
It is a figure which shows the specific example explaining the integration method of.

【図９】従来の文字切出し方法の問題点を解説する具体
例を示す図である。FIG. 9 is a diagram showing a specific example for explaining the problems of the conventional character cutting method.

【図１０】従来の文字切出し方法の問題点を解説する具
体例を示す図である。FIG. 10 is a diagram showing a specific example for explaining the problems of the conventional character cutting method.

[Explanation of symbols]

１読取面４スキャナ８本体処理部８１画像メモリ８２１文字切出し回路 1 Reading Surface 4 Scanner 8 Main Processing Unit 81 Image Memory 82 1 Character Cutout Circuit

Claims

[Claims]

1. An image signal representing a read object including characters is acquired, and image data for each character to be recognized is cut out from character string image data corresponding to a character string included in the acquired image signal. In the character cutout method,
A character cutting method characterized by adopting steps (a) to (d). (a) The character size W is estimated from the shape of the character string. (b) Obtain the projected image of the character string image data, and cut out the blocks of pixels forming the projected image as temporary characters. (c) Regarding a temporary character Ci whose width is larger than the character size W, determines whether the adjacent temporary character Ci-1 before the temporary character Ci is a non-separated character or a part of the separated character. At least the width of the temporary character Ci-1 and the space between the temporary character Ci-1 and the virtual characters before and after the temporary character Ci-1 are used for determination, and the cutout position of the temporary character Ci is corrected according to the result of the determination, and the corrected cutout The character size W is cut out from the position at or after the position. (d) Whether the temporary character Cj of the temporary character string obtained as a result of being cut out in the step (c) is a part of the separation character is determined by the temporary character Cj and the previous or one character of the temporary character Cj. Regarding the temporary character Cj-1 or the temporary character Cj + 1 that is adjacent to the next one, the width of the temporary character Cj-1 and the temporary characters Cj-1 and Cj
And the width of the temporary character Cj + 1 and the distance between the temporary characters Cj and Cj + 1, and the temporary characters are combined if necessary according to the result of this determination.

2. An image signal representing a read object including characters is acquired, and image data for each character to be recognized is cut out from character string image data corresponding to a character string included in the acquired image signal. In the character cutout method,
A character cutting method characterized by adopting steps (a) to (d). (a) The character size W is estimated from the shape of the character string. (b) Obtain the projected image of the character string image data, and cut out the blocks of pixels forming the projected image as temporary characters. (c) For a temporary character Ci whose width is larger than the character size W, the temporary character Ci + 1 immediately after the temporary character Ci and adjacent to the temporary character Ci + 1
Is a non-separated character or a part of the separated character, based on at least the width of the dummy character Ci + 1 and the space between the dummy character Ci + 1 and the preceding and following dummy characters. Accordingly, the cut-out position of the temporary character Ci is corrected, and the cut-out position is cut out at a position preceding the corrected cut-out position by the character size W or in the vicinity thereof. (d) Whether the temporary character Cj of the temporary character string obtained as a result of being cut out in the step (c) is a part of the separation character is determined by the temporary character Cj and the previous or one character of the temporary character Cj. Regarding the temporary character Cj-1 or the temporary character Cj + 1 that is adjacent to the next one, the width of the temporary character Cj-1 and the temporary characters Cj-1 and Cj
And the width of the temporary character Cj + 1 and the distance between the temporary characters Cj and Cj + 1, and the temporary characters are combined if necessary according to the result of this determination.