JPH02255995A

JPH02255995A - Character segmenting method

Info

Publication number: JPH02255995A
Application number: JP1014416A
Authority: JP
Inventors: Mikio Aoki; 三喜男青木
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1988-04-28
Filing date: 1989-01-24
Publication date: 1990-10-16
Anticipated expiration: 2012-01-08
Also published as: JP2570415B2

Abstract

PURPOSE:To exactly segment a character from a character picture in an undetermined form by finding out a segmenting position from the value of peripheral distribution only when a candidate segmenting position does not exist at the time of character extraction. CONSTITUTION:A CPU 101 segments the character one by one from the character picture, which is read by a picture input device 102, in a RAM 105. At first, the standard character height of the character and the tickness of a line are estimated from the peripheral distribution in a row direction and next, a standard character interval, word interval and standard character width are estimated and the extraction of a word is executed. Then, the candidate character segmenting position is estimated. For the segmentation of the character in the extracted word, the like component of the character is extracted and simultaneously, the height and width of the character are extracted. When the width of the character exceeds an allowable size, the extraction of a contour is executed again. When the candidate character segmenting position does not exist, a candidate character segmenting range is determined according to the peripheral distribution in a direction vertical to the row direction and the extraction of the contour is executed again. Then, only the inside of the area is extracted and the character is segmented. Thus, the character picture in the undetermined form such as the character picture, in which the peripheral distribution is overlapped, etc., can be exactly segmented.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、紙面上に書かれた文字を画像として入力する
ことにより、文字画像から文字領域を捜し出し、コード
番号に変換する文字認識装置の文字切り出し方法に関す
る。[Detailed Description of the Invention] [Field of Industrial Application] The present invention provides a character recognition device that searches for a character area from a character image by inputting characters written on a sheet of paper as an image, and converts it into a code number. Concerning how to cut out characters.

[Conventional technology]

近年、文字認識装置の急激なる進歩により、さまざまな
文書画像から文字領域を自動的に抽出し、さらに一つ一
つの文字を切り出し、認識し、自動的に文字ファイルが
作成できるようになってきており、文字の切り出し方法
はさまざまな方法が考え出されてきている。In recent years, rapid advances in character recognition devices have made it possible to automatically extract character areas from various document images, cut out and recognize individual characters, and automatically create character files. Various methods have been devised for cutting out characters.

例えば一般に多く用いられている方法に抽出文字行の縦
方向の周辺分布を計数する方法がある。For example, one commonly used method is to count the vertical peripheral distribution of extracted character lines.

例えば、第４図（ａ）に示す様な抽出文字行の縦方向の
周辺分布を計数すると第４図（ｂ）に示すような分布を
得る。（第４図（ｂ）は、文字画像の存在か否かのみを
示す）該分布の値により、文字の存在位置を知ることが
可能となり、−文字一文字の文字の切り出しを行ってい
た。また、抽出文字行が第４図（ｃ）に示す様な周辺分
布の重なった文字においては、複数文字と判断した場合
に文字ピッチにより文字の切れ目を推定して、強制的に
文字の切り出しを行っていた。For example, when the vertical peripheral distribution of extracted character lines as shown in FIG. 4(a) is counted, a distribution as shown in FIG. 4(b) is obtained. (FIG. 4(b) only shows whether or not a character image exists.) The value of the distribution makes it possible to know the location of a character, and it is possible to extract one character by one character. In addition, for characters whose marginal distributions overlap as shown in Figure 4 (c), if the extracted character lines are determined to be multiple characters, the character break is estimated based on the character pitch and the character is forcibly extracted. I was going.

[Problem to be solved by the invention]

しかしながら、この様な方法で文字の切り出しを行う場
合、対象文字画像が、第４図（ａ）に示す様な定ピツチ
の文字画像においては正確に文字の切り出しが可能であ
るが、第４図（ｃ）に示した様な文字画像、第５図（ａ
）に示した様な文字画像、第１１図（ａ）に示した様な
文字画像においては、正確な文字の切り出しは不可能で
ある。However, when cutting out characters using this method, it is possible to accurately cut out characters if the target character image is a character image with a fixed pitch as shown in Figure 4(a), Character images as shown in (c), Figure 5 (a)
), it is impossible to accurately cut out the characters in a character image as shown in FIG. 11(a).

第４図（ｃ）の文字画像、第５図（ａ）の文字画像、第
１１図（ａ）の文字画像、どれも文字ピッチ不定のプロ
ポーショナル文字である。従って、従来の方法の様に単
に縦方向の周辺分布より文字位置及び文字ピッチを推定
することができない。The character image in FIG. 4(c), the character image in FIG. 5(a), and the character image in FIG. 11(a) are all proportional characters with an indefinite character pitch. Therefore, the character position and character pitch cannot be estimated simply from the vertical peripheral distribution as in the conventional method.

第４図（ｃ）により従来の方法にて文字の切り出しを行
おうとした場合について述べる。第４図（Ｃ）は、第４
図（ａ）と同一フォントの文字画像である。第４図（ａ
）が文字ピッチ一定なのに対し、第４図（Ｃ）は文字ピ
ッチ不定のプロポーショナル文字である。既文字画像の
縦方向の周辺分布第４図（ｄ）に着目すると、殆どの部
分で第４図（ａ）の周辺分布第４図（ｂ）とほぼ同じで
あるが、ＹＯの周辺分布に相当する部分が異なる。A case will be described in which an attempt is made to cut out characters using the conventional method with reference to FIG. 4(c). Figure 4(C) shows the fourth
This is a character image with the same font as in Figure (a). Figure 4 (a
) has a constant character pitch, whereas FIG. 4(C) is a proportional character with an undefined character pitch. If we focus on the vertical peripheral distribution of the existing character image in Figure 4(d), it is almost the same as the peripheral distribution in Figure 4(a) in Figure 4(b) in most parts, but the marginal distribution of YO is different. The corresponding parts are different.

第４図（ｄ）は、Ｙの周辺分布とＯの周辺分布が重なっ
てしまっている。従って該周辺分布からは、ＹとＯを一
つの文字として判断し、てしまうか、あるいは実際と異
った場所で切り離してしまう。今、仮に二つの文字と判
断して強制的に二つの文字に分けようとした場合、文字
ピッチに着目すると、第１２図に示す１２０１で文字画
像を分離し、周辺分布の形状に着目すると、１２０２で
文字画像を分離してしまう。１２０１．１２０２どちら
の位置で文字画像を分離しても、ＹかＯの文字の途中で
切れ、別の文字と一緒に抽出されてしまい、正確な文字
の抽出は不可能である。In FIG. 4(d), the marginal distribution of Y and the marginal distribution of O overlap. Therefore, from the peripheral distribution, Y and O are judged to be one character and are separated, or they are separated at a location different from the actual location. Now, if we judge that there are two characters and try to forcefully separate them into two characters, if we focus on the character pitch, we will separate the character image at 1201 shown in Figure 12, and if we focus on the shape of the peripheral distribution, we will get the following: In step 1202, the character images are separated. 1201.1202 No matter where the character image is separated, the character Y or O will be cut off in the middle and extracted together with another character, making it impossible to extract the character accurately.

次に第５図（ａ）に示１５た文字画像より、従来の方法
により文字の切り出］−を行う場合について述べる。第
５図（ａ）はｍ１ｌｌｉｏｎという７つの文字により構
成される単語である。該単語の縦方向の周辺分布を計数
すると第５図（ｂ）に示す様に、ｍ、　　ｉ、　　１．
　１．　　ｉの５つの文字の周辺分布５０１．５０２．
５０３．５０４．５０５は連なって一つの周辺分布の固
まりとなっている。Next, a case will be described in which characters are extracted by a conventional method from the character image 15 shown in FIG. 5(a). FIG. 5(a) shows the word m1llion, which is composed of seven letters. When the vertical peripheral distribution of the word is counted, as shown in FIG. 5(b), m, i, 1.
1. Marginal distribution of 5 characters of i 501.502.
503, 504, and 505 are connected to form one cluster of marginal distributions.

従って、該周辺分布からは簡単に文字の切れ口を推定す
ることは不可能である。文字ピッチにより文字の切り雌
しを行おうとすると、ｉ、１等の文字は標準文字幅の半
分の大きさであるので、二つの文字で一つの文字と判断
して一緒に切り出される可能性が大きい。また、周辺分
布の形状で判断して切り離すと、ｉ、１等の文字はうま
く切り離せるかもしれないが、文字ｍを分解する可能性
が大であり、文字の切り出しの信頼性が極めて小さい。Therefore, it is impossible to easily estimate the end of a character from the marginal distribution. If you try to cut and paste characters based on the character pitch, characters such as i and 1 are half the standard character width, so there is a possibility that the two characters will be judged as one character and cut out together. big. Furthermore, if characters such as i and 1 are separated based on the shape of the peripheral distribution, it may be possible to successfully separate characters such as i, 1, etc., but there is a high possibility that the character m will be decomposed, and the reliability of character extraction is extremely low.

また、第１１図（ａ）に示した文字画像の場合も、第５
図の場合と同様に周辺分布の形状のみからは文字の正確
な切り出しは不可能である。Also, in the case of the character image shown in FIG. 11(a), the fifth
As in the case shown in the figure, it is impossible to accurately cut out characters only from the shape of the peripheral distribution.

そこで本発明は以上の様な課題を解決するもので、その
目的とするところは、隣どうしの文字が接触した文字画
像、周辺分布が重なった文字画像、文字ピッチが一定で
ない文字画像から一つ一つの文字を正確に切り出す方法
を提供することにある。Therefore, the present invention is intended to solve the above-mentioned problems, and its purpose is to select one of character images in which adjacent characters touch each other, character images in which the peripheral distributions overlap, and character images in which the character pitch is not constant. The purpose is to provide a method for accurately cutting out a single character.

課題を解決するための手段〕本発明の光学的画像入力手段により紙面等に書かれた文
字画像を読み取り、入力された画像データ中の文字を認
識１コ一ド番号に起き換える文字認識装置における文字
切り出し方法は、■行方向の周辺分布より文字の標章文
字高及び線の太さを推定し、 ■行方向と垂直な方向の周辺分布より、標準文字間隔、
単語間隔及び標準文字幅を推定し、単語の抽出を行い、 ■前記周辺分布より文字切り出し候補位置を推定し、 ■前記抽出単語中の文字の切り出しは文字の連結成分の
輪郭を抽出すると同時に、文字高及び文字幅を抽出し、 ■文字幅が標準文字幅から許容できる大きさを越える場
合には、前記文字切り出し候補位置の範囲内にて再度輪
郭の抽出を行い、 ■前記文字切り出し候補位置が存在しない場合には行方
向と垂直な方向の周辺分布により文字切り出しの範囲を
決め再度輪郭の抽出を行い、■該輪郭に囲まれた領域の
内側のみを抽出することにより文字の切り出しを行うこ
とを特徴とする。Means for Solving the Problems] In a character recognition device that reads a character image written on paper etc. using the optical image input means of the present invention and converts the characters in the input image data into a recognized single code number. The character extraction method is as follows: ■ Estimate the mark character height and line thickness of the characters from the peripheral distribution in the line direction, ■ Estimate the standard character spacing,
Word spacing and standard character width are estimated, and words are extracted. ■ Character extraction candidate positions are estimated from the peripheral distribution. ■ Characters in the extracted words are extracted by simultaneously extracting the contours of connected components of the characters. The character height and width are extracted, ■ If the character width exceeds the allowable size from the standard character width, the outline is extracted again within the range of the character extraction candidate position, ■ The character extraction candidate position If the character does not exist, determine the range of character extraction based on the peripheral distribution in the direction perpendicular to the line direction, extract the outline again, and extract the character by extracting only the inside of the area surrounded by the outline. It is characterized by

文字の標準文字高は、行方向の周辺分布の形に着目し、
周辺分布が急激に変化して大きくなっている部の幅を標
準文字高とし、該標準文字高の大きさより、文字線の最
低線幅を推定することを特徴とする。The standard character height of characters is determined by focusing on the shape of the peripheral distribution in the line direction.
It is characterized in that the width of the portion where the peripheral distribution suddenly changes and becomes larger is defined as a standard character height, and the minimum line width of a character line is estimated from the size of the standard character height.

行方向と垂直な方向の周辺分布で文字の存在しない部分
の大きさの統計をとることにより、標準文字間隔及び単
語間隔を推定することを特徴とする。It is characterized by estimating standard character spacing and word spacing by taking statistics on the size of portions where no characters exist in the peripheral distribution in the direction perpendicular to the line direction.

行方向と垂直な方向の周辺分布で、周辺分布の値が前記
最低線幅より大きい部分の大きさの統計をとることによ
り、標準文字幅を推定することを特徴とする。The present invention is characterized in that the standard character width is estimated by taking statistics on the size of the portion of the peripheral distribution in the direction perpendicular to the line direction where the value of the peripheral distribution is larger than the minimum line width.

標準文字幅の推定は、前記統計で、標準文字高の７５％
よりも大きい領域において、標準文字高に最も近い固ま
りの最大値とすることを特徴とする。The standard character width is estimated to be 75% of the standard character height based on the above statistics.
Characteristically, in an area larger than , the maximum value of the block closest to the standard character height is used.

前記統計中の固まりの分類は、固まりと固まりとの距離
が、標準文字高に比例したある値以下であれば同一の固
まりとみなすことを特徴とする。The classification of clusters in the statistics is characterized in that if the distance between clusters is less than a certain value proportional to the standard character height, the clusters are considered to be the same cluster.

前記標準文字間隔及び単語間隔と、行方向と垂直な方向
の周辺分布とを比較することにより単語の位置の抽出を
行うことを特徴とする。The present invention is characterized in that word positions are extracted by comparing the standard character spacing and word spacing with the peripheral distribution in the direction perpendicular to the line direction.

行方向と垂直な方向の周辺分布で、周辺分布の値が前記
最低線幅より小さい部分の、各々の中心部分を文字切り
出し候補位置とすることを特徴とする。The present invention is characterized in that the center portion of each portion of the peripheral distribution in the direction perpendicular to the line direction where the value of the peripheral distribution is smaller than the minimum line width is set as a character extraction candidate position.

抽出文字幅より連結文字と判断した場合、前記文字切り
出し候補位置を優先に切り出すことを特徴とする。If it is determined that the character is a connected character based on the extraction character width, the character extraction candidate position is preferentially extracted.

周辺分布の値より文字切り出し位置を推定する場合、１
７２文字幅付近及び１文字幅付近で最も周辺分布の値の
小さいところを求めることを特徴とする。When estimating the character extraction position from the value of the marginal distribution, 1
It is characterized by finding the smallest value of the marginal distribution near the width of 72 characters and around the width of 1 character.

文字の輪郭に囲まれた領域の抽出は、原ＷＩ像と同じ大
きさの画像領域を設け、該画像領域に文字の輪郭を描き
、輪郭の内側を塗りつぶした後に原画像との共通部分を
とることにより、対象文字のみを抽出することを特徴と
する。To extract the area surrounded by the outline of the character, create an image area of the same size as the original WI image, draw the outline of the character in the image area, fill in the inside of the outline, and then take the common part with the original image. By doing so, only the target characters are extracted.

〔Example〕

以下本発明について実施例に基づいて詳細に説明する。 The present invention will be described in detail below based on examples.

本発明の文字切り出し手段が構成要素となる文字認識装
置は、第１図のブロック図に示す様に、ＣＰＵｌ０Ｉ、
画像人力装置１０２、認識文字表示装置１０３、ＲＯＭ
１０４、ＲＡＭ１０５により構成されている。As shown in the block diagram of FIG.
Image human power device 102, recognized character display device 103, ROM
104 and RAM 105.

以下、画像入力装置１０２によりＲＡＭＩ　０５に読み
込まれた文字画像中より一つ一つの文字を切り出す本発
明の文字切り出し方法を第２図に示すフローチャートに
基づいて詳細に説明する。Hereinafter, the character cutting method of the present invention for cutting out each character from a character image read into the RAMI 05 by the image input device 102 will be explained in detail based on the flowchart shown in FIG.

画像入力装置１０２によりＲＡＭＩＯ３に読み込まれた
文字画像を第３図（ａ）とする。本発明においては、ま
ず最初に行方向の周辺分布を計数する。文字画像第３図
（ａ）の行方向の周辺分布を計数すると第３図（ｂ）の
３０１の様な形状の周辺分布を得る。欧米文字は、第３
図（ａ）のｄに相当する文字高の大きな文字、ｅに相当
する文字高の小さな文字、ｙに相当する文字高は大きい
が文字位置が低い文字の３種類の文字から構成されてい
る。従って、該３種類の文字から構成されている文字画
像の行方向の周辺分布を計数すると、通常第３図（ｂ）
に示す３０１の様な形状の周辺分布となる。該周辺分布
３０１は、３１１．３１２．３１３の三つの領域が合わ
さってできており、文字画像によっては、３１２の領域
あるいは３１３の領域が存在しない場合がある。しかし
ながら、３１１の領域は常に存在し、該領域の幅３０４
は、文字高の小さな文字の高さ（以下標準文字高と呼ぶ
）に相当する。従って、周辺分布の形状に着目すること
により標準文字高３０４を知ることが可能となる。また
、印刷欧米活字は、標準文字高と文字線の太さとの間に
通常標準文字高：文字線の太さ≧１６＝１の関係が成り立つ。従って、文字画像の垂直方向の周辺
分布に着目した場合、周辺分布の値が標準文字高の１／
１６よりも低い値の部分は文字のひげ部分であり、ある
いは、文字が接触している部分であると判断できる。こ
こで、判断の基準となる標準文字高の１／１６の値を最
低線幅として求めておく。The character image read into the RAMIO 3 by the image input device 102 is shown in FIG. 3(a). In the present invention, first, the marginal distribution in the row direction is counted. When the peripheral distribution in the row direction of the character image in FIG. 3(a) is counted, a peripheral distribution in the shape of 301 in FIG. 3(b) is obtained. Western characters are the third
It is composed of three types of characters: a character with a large character height corresponding to d in Figure (a), a character with a small character height equivalent to e, and a character with a large character height but a low character position corresponding to y. Therefore, if we count the peripheral distribution in the row direction of a character image made up of these three types of characters, it is usually shown in Figure 3(b).
The peripheral distribution has a shape like 301 shown in FIG. The peripheral distribution 301 is made up of three areas 311, 312, and 313, and depending on the character image, the area 312 or the area 313 may not exist. However, an area of 311 always exists, and the area has a width of 304
corresponds to the height of a small character (hereinafter referred to as standard character height). Therefore, it is possible to know the standard character height 304 by paying attention to the shape of the peripheral distribution. In addition, in printed European and American type, the following relationship usually holds between the standard character height and the thickness of the character line: standard character height:character line thickness≧16=1. Therefore, when focusing on the vertical peripheral distribution of a character image, the value of the peripheral distribution is 1/1 of the standard character height.
It can be determined that a portion with a value lower than 16 is a whisker portion of a character, or a portion where characters are in contact. Here, a value of 1/16 of the standard character height, which serves as a criterion for judgment, is determined as the minimum line width.

次の段階として、文字行の垂直方向の周辺分布により標
準文字間隔、単語間隔及び標準文字幅を求める。まず最
初に、文字行と垂直な方向に投影した場合、文字ｍＲが
存在するか否かのデータを求める。文字画像を第４図（
ａ）とした場合、文字画像が存在するか否かのデータす
なわち文字画像と垂直の方向に投影したデータは第４図
（ｂ）である。領域４０１は文字画像が存在する部分、
領域４０２は文字画像が存在しない部分、すなわち文字
間隔に相当する部分である。該文字が存在しない部分４
０２の統計をとると第６図（ａ）の様になる。同様に、
文字画像第４図（ｃ）の投影したデータ第４図（ｄ）の
文字画像が存在しない部分４０２の統計をとると第６図
（ｂ）の様になる。第６図（ａ）、第６図（ｂ）の実線
で示されたデータは、第４図（ｂ）、第４図（ｄ）のデ
ータをそれぞれ示すが、一般に文字間隔のヒストグラム
は第６図（ａ）、第６図（ｂ）に示した点線て示される
傾向になる。該２つの１７ストグラノ１、は、それぞれ
２つの固まりに分けることが可能である。As the next step, standard character spacing, word spacing, and standard character width are determined from the vertical peripheral distribution of character lines. First, data is obtained as to whether or not the character mR exists when projected in a direction perpendicular to the character line. The character image is shown in Figure 4 (
In the case of a), the data indicating whether or not a character image exists, that is, the data projected in the direction perpendicular to the character image, is shown in FIG. 4(b). Area 401 is a portion where a character image exists,
A region 402 is a portion where no character image exists, that is, a portion corresponding to the character spacing. Part 4 where the character does not exist
If we take the statistics of 02, we get the result as shown in Figure 6(a). Similarly,
When the statistics of the projected data of the character image in FIG. 4(c) and the portion 402 where the character image does not exist in FIG. 4(d) are taken, the result is as shown in FIG. 6(b). The data indicated by the solid lines in Figures 6(a) and 6(b) indicate the data in Figures 4(b) and 4(d), respectively, but generally the character spacing histogram is The tendency is shown by the dotted lines shown in FIG. 6(a) and FIG. 6(b). The two 17 stogranos 1 can each be divided into two chunks.

該固まりは一つは文字間隔の固まり、もう−−−−−つ
は単語間隔の固まりと判断することができる。従って、
文字間隔の統工１により標準文字間隔及び単語間隔を推
定することが可能となる。一般に文字間隔のヒストグラ
ムは第６図の様なヒストグラムになるが時として第８図
の様なヒストグラムを得ることがある。該ヒストグラム
にはデータの固まりがいくつもある。この様に固まりが
多く存在する場合には、次の方法にて固まりを分類する
。印刷欧米文字は通常規則正１．＜並んで印刷されてい
る。It can be determined that one of the clusters is a cluster of character spacings, and the other is a cluster of word spacings. Therefore,
Character spacing technique 1 makes it possible to estimate standard character spacing and word spacing. Generally, a histogram of character spacing is a histogram like that shown in FIG. 6, but sometimes a histogram like that shown in FIG. 8 is obtained. There are many clusters of data in the histogram. When there are many lumps like this, classify the lumps using the following method. Printed Western characters are usually regular 1. <They are printed side by side.

］７たがって、文字間隔はほぼ等しいはずであるが時と
して文字の形により文字間隔が異なる場合が生じる。し
か１７ながらその間隔は、標準文字高に比例したある値
以−Ｌにちらばることは無い。Ｌ、たがって、本発明で
は、標準文字高の１／１６を固まりと固まりの距離の１
．きい値８０５とした。該しきい値８０５と、固まり８
１１．８１２．８１３．８１４．８１５の間隔８０１．
８０２．８０ニう、８０４とを比較すること（、−より
、固まりと２）１１と８１２は同一の固まり、固ｊ：す
８１３と８１４と８１５は同一・の固：ｊ：りと判断で
き、標準文字間隔及び単語間隔の推定が可能となる。]7 Therefore, the character spacing should be approximately equal, but sometimes the character spacing differs depending on the shape of the character. However, the spacing does not spread beyond a certain value proportional to the standard character height. L, Therefore, in the present invention, 1/16 of the standard character height is set to 1 of the distance between clusters.
．． The threshold value was set to 805. The threshold value 805 and the mass 8
11.812.813.814.815 interval 801.
Comparing 802.80 and 804 (from , -, mass and 2), it can be determined that 11 and 812 are the same mass, and that 813, 814, and 815 are the same mass. , standard character spacing and word spacing can be estimated.

次に文字画像の標準文字幅を推定する。標準文字幅を推
定できれば、文字の切り出１７時において連結文字かの
判断及び、切り出し候補位置をＩｉ８定する手掛りとな
る。標準文字幅の１１１定は、標準文字間隔及び単語間
隔の推定と同様ｊＪ文字領域の統計をとる。仮に文字画
像が第４図（ａ）及び第４図（ｃ）の場合、文字幅の統
計は第７図（ａ）及び第７図（ｂ）の実線の様になり、
−・船釣に点線で示した傾向のデータを得る。第７図（
ａ）に示したデータは固まりが一つであるため該固まり
の最大値を標準文字幅とすることができる。第７図（ｂ
）に示したデータにおいては、固まり７０２と固まり７
０３の２つの固まりが存在する。通常印刷欧米文字は標
準文字幅は標準文字高に非常に近い値であるので、標準
文字高７０１−の付近の固まり７０２の最大値を標準文
字幅とすることが可能となる。また文字間隔の推定時と
同様に時としてヒストグラムが多くの固まりからなる場
合第９図においては、固まりと固まりの距離９０１．９
０２．９０３と」、２きい値９０４とを比較するこ吉に
より固まり９１１、固まり９１２及び固まり９１３は同
一の固まりと判断することが可能である。Next, the standard character width of the character image is estimated. If the standard character width can be estimated, it will be a clue for determining whether the character is a connected character at time 17 of character extraction and determining the extraction candidate position Ii8. For the standard character width 111 constant, statistics for the jJ character area are taken, similar to the estimation of the standard character spacing and word spacing. If the character images are as shown in Figures 4(a) and 4(c), the character width statistics will be as shown by the solid lines in Figures 7(a) and 7(b),
−・Obtain data on trends shown by dotted lines for boat fishing. Figure 7 (
Since the data shown in a) has one block, the maximum value of the block can be set as the standard character width. Figure 7 (b
), in the data shown in cluster 702 and cluster 7
There are two clusters of 03. Since the standard character width of normally printed European and American characters is very close to the standard character height, it is possible to set the maximum value of the cluster 702 near the standard character height 701- as the standard character width. Also, as in the case of estimating character spacing, sometimes when the histogram consists of many clusters, in Figure 9, the distance between clusters is 901.9.
02.903 and 2 threshold value 904, it is possible to determine that the lumps 911, 912, and 913 are the same lump.

さらに、固まりと固まりの間のしきい値でもって多くの
固まりを分類しても、第１０図の様な３つの固まりにな
ることがある。統計の結県が第１０図の様になるのは、
文字画像中にｉや１等の文字が含まれているからである
。そこで、本発明では、１や１等の固まり〕００１の最
大値を標準文字幅と判断しない様に、標準文字幅の７５
％の位置１００４よりも大きい領域において固まりの最
大値を捜す。こうすることにより標準文字幅の約半分の
大きさの１や１の固まりを除外し、正確に標準文字幅を
推定するごとを可能にした。以上の様にして通常の文字
画像の文字幅の推定はｉＪ能となるが、文字画像の中に
は第５図（ａ）に示した様な文字画像が存在する。該文
字画像は、ｍ、ｉ、１゜１゜ｉ、ｏ、ｎの７つの文字よ
り構成されており、そのうちｍ、　　ｉ、　　ｌ、　　
１．　　ｉの５の文字が接触している。従って、標準文
字幅の推定のために縦方向の周辺分布を計数ｊ２ても、
標準文字幅の推定は不可能である。そこで本発明では、
印刷欧米文字において、文字線幅は標準文字高に比例１
．たある値以−にの太さを持ち、第５図（ａ）に見られ
る底辺部のひげの部分は該ある値よりも細い綿であるこ
とに着目１９、縦方向の周辺分合ｉ第５図（１））を言
１数１７だ後、標準文字高に比例１ｊ二あろ値５０８で
もって周辺分布を切り捨てる過程を設ける。該過程によ
り、ある値５０８よりも周辺分布の鎮が大きくなる部分
のみを文字領域と判断１．て、文字Ｒｊ域を表示するき
第５図（ｅ）に示す様になる１、５１１．５１２．５１
３．５１．４．５１５．５１６．５１７はそれぞれｍ、
ｉ、Ｉ、１．ｉ、Ｏ。Furthermore, even if many lumps are classified using the threshold value between the lumps, three lumps as shown in FIG. 10 may be formed. The reason why the statistical prefectures are as shown in Figure 10 is because
This is because characters such as i and 1 are included in the character image. Therefore, in the present invention, in order not to judge the maximum value of 1, 1, etc.]001 as the standard character width, 75
Find the maximum value of the lump in the area larger than the % position 1004. By doing this, we were able to exclude 1's and clusters of 1's that were about half the size of the standard character width, making it possible to accurately estimate the standard character width. As described above, the character width of a normal character image can be estimated with ease, but there are character images as shown in FIG. 5(a) among the character images. The character image is composed of seven characters: m, i, 1゜1゜i, o, n, among which m, i, l,
1. The letter 5 of i is touching. Therefore, even if we count the vertical marginal distribution j2 to estimate the standard character width,
Estimation of standard character width is not possible. Therefore, in the present invention,
In printed European and American characters, the character line width is proportional to the standard character height1
．． Noting that the whisker part at the bottom seen in Figure 5(a) is thinner than the certain value,19, the peripheral portion in the vertical direction is i-th. 5(1)), a process is provided in which the marginal distribution is truncated with a value of 1j binary 508 proportional to the standard character height. Through this process, only the portion where the edge of the peripheral distribution is larger than a certain value 508 is determined to be a character area.1. When the character Rj area is displayed, it becomes 1, 511.512.51 as shown in Figure 5(e).
3.51.4.515.516.517 are m, respectively
i, I, 1. i, O.

口の文字位置及び文字幅を示す６．５１−１〜・５１７
′の７つのデータは、実際の文字幅よりも多少小さめで
はあるが、文字とし５て必要な情報の部分は必ず含んで
おり、これらの領域を文字領域と判断ずることか可能と
なる。従って、該データ第５図（ｃ）の文字領域の統計
をとることにより、標章文字幅の推定が可能となる。6.51-1 to 517 indicating the character position and width of the mouth
Although the seven pieces of data ' are slightly smaller than the actual character width, they always contain the information necessary for characters, and it is possible to judge these areas as character areas. Therefore, by taking statistics on the character area of the data shown in FIG. 5(c), it is possible to estimate the character width of the mark.

次に文字画像より単語の抽出を行う。単語の抽出は前記
標準文字間隔及び単語間隔と、文字行の垂直方向の周辺
分布とを比較する（文字画像が第４図（ａ）の場合周辺
分布は第４図（ｂ））することにより、その文字の存在
していない部分の大きさから単語間隔を見つけることが
可能となり、文字画像中からの単語の抽出が可能となる
。Next, words are extracted from the character image. Words are extracted by comparing the standard character spacing and word spacing with the vertical peripheral distribution of the character line (if the character image is shown in Figure 4(a), the peripheral distribution is shown in Figure 4(b)). , it becomes possible to find the word interval from the size of the part where the character does not exist, and it becomes possible to extract the word from the character image.

次に該抽出単語から一文字一文字の抽出を連結成分を抽
出することにより行うわけであるが、ここでまた一つ問
題がある。今仮に抽出単語が第５図（ａ）の様な文字画
像であるとする。この場合、連結成分の抽出を行うとｍ
ｉ　ｌ　ｌ　ｉの５つの文字を抽出してしまう。該抽出
文字はその文字幅から連結文字と判断し、途中で強制的
に切り離さなければならない。該切り離し位置は、通常
、標準文字幅付近の周辺分布の値の小さいところで行わ
れる。ところが、第５図（ａ）の場合、ｍやｉ、ｌとい
った文字が含まれている。この様な文字においては、ｉ
やｌの文字は標準文字幅の半分の幅であるため、二文字
連なると一文字分の幅となり二文字−緒に切り出されて
しまい、またｍの場合、標僧文字幅の付近において一度
周辺分布の値がかなり小さくなっているため文字の切れ
目と判断され途中で切り離されてしまう可能性が非常に
大きい。そこで本発明は、第５図（ｂ）の周辺分布で標
準文字高に比例したある１ｉｉ５０８以上の領域第５図
（ｃ）の文字位置５１１〜５１７の文字と文字の中間点
５２１〜５２６を求める。切り出し候補位置とする。該
切り出し候補位置５２１〜５２６は第５図からも明らか
な様に文字と文字の境を表わしている。従って、文字の
切り出し時において、文字の大きさが標準文字幅から許
容できる大きさよりも大きい場合には該位置を用いるこ
とにより文字の切り出しが可能となり、間違った位置で
の文字の切り出しが極めて減少する。Next, each character is extracted from the extracted word by extracting connected components, but there is another problem here. Assume now that the extracted word is a character image as shown in FIG. 5(a). In this case, when extracting connected components, m
The five characters i l l i are extracted. The extracted character must be determined to be a concatenated character based on its character width, and must be forcibly separated in the middle. The separation position is usually performed at a location where the value of the peripheral distribution is small near the standard character width. However, in the case of FIG. 5(a), characters such as m, i, and l are included. In such a character, i
Since the width of the letters `` and ``l'' is half of the standard character width, when two letters are consecutive, the width becomes one character and the two letters are cut out together. Since the value of is quite small, there is a very high possibility that it will be judged as a break between characters and be cut off midway. Therefore, the present invention calculates midpoints 521 to 526 between characters at character positions 511 to 517 in Figure 5(c) in a certain 1ii508 or higher area proportional to the standard character height in the peripheral distribution in Figure 5(b). . Use this as a candidate position for cutting out. As is clear from FIG. 5, the cutout candidate positions 521 to 526 represent boundaries between characters. Therefore, when cutting out a character, if the size of the character is larger than the allowable size from the standard character width, it is possible to cut out the character by using this position, and the number of characters cut out at the wrong position is greatly reduced. do.

次に一文字一文字の文字の抽出を行う。本発明において
は、文字の抽出は連結成分を抽出することによって行っ
ている。欧米文字は、殆どの文字が一つの連結成分であ
り、連結成分が一つでない文字においても、主な連結成
分のみでたいていの文字は判断できるため該方法は効果
的である。連結成分の抽出は、最初に連結成分の輪郭を
抽出する。輪郭を抽出することにより、文字の位置及び
大きさの情報が得られる。輪郭の抽出が行われると既に
求められた標準文字幅と、該抽出文字幅とを比較する。Next, extract each character one by one. In the present invention, characters are extracted by extracting connected components. Most European and American characters have one connected component, and even for characters that do not have one connected component, most characters can be determined based only on the main connected component, so this method is effective. To extract connected components, first, the contours of connected components are extracted. By extracting the outline, information on the position and size of the character can be obtained. When the outline is extracted, the extracted character width is compared with the already determined standard character width.

抽出文字幅が標準文字幅より許容できる大きさである場
合には、該抽出文字を一文字とＩ１１断１文字を抽出す
る。抽出文字幅が標準文字幅より許容できない大きさの
場合には、先ず、該抽出領域中に切り出し候補位置が存
在するか否かを調べる。もし、切り出し候補位置が存在
していれば、該位置において文字を切り出すのが最も適
切であるため該位置の範囲内において再度連結成分の文
字の輪郭の抽出を行う。また、文字画像が第１１図（ａ
）の様な文字画像であった場合、（この文字画像におい
ては、標準文字高に対する一定の大きさ１１１０よりも
周辺分布の小さい領域の中間点は１１２１〜１１２５の
５箇所しか存在しない。しかしながら、文字は９文字存
在する。）文字画像１１５２の輪郭抽出を行うと、標準
文字幅より許容できない文字幅と判断される。If the extracted character width is larger than the standard character width, one extracted character and one I11 cut character are extracted. If the extracted character width is larger than the standard character width, first, it is checked whether or not there is a cutting candidate position in the extraction area. If a cutout candidate position exists, it is most appropriate to cut out the character at that position, so the outline of the character of the connected component is extracted again within the range of this position. In addition, the character image is shown in Figure 11 (a
), (In this character image, there are only five midpoints, 1121 to 1125, of areas where the peripheral distribution is smaller than the constant size 1110 for the standard character height. However, (There are nine characters.) When the outline of the character image 1152 is extracted, it is determined that the character width is unacceptable compared to the standard character width.

そこで、切り出し候補位置の存在を確認する。しかし文
字画像１１５２の周辺分布１１４２は最低線幅１１１０
よりも小さくなる値を文字の途中にもっていないので切
り出し候補位置は存在しない。Therefore, the existence of the extraction candidate position is confirmed. However, the peripheral distribution 1142 of the character image 1152 has a minimum line width of 1110
Since there is no value smaller than , there is no candidate cutting position in the middle of the character.

この場合、抽出文字の周辺分布１１４２の標準文字幅付
近の値に着目する。それにより、文字ｐと「の接触部分
１１３１を捜すことが可能となり一文字−文字の文字の
切り出しが可能となる。同様に切り出しを繰り返して第
１１図においては、１１２１〜１１２５の５箇所の切り
出し候補位置の他に、１１３１〜１１３３の３箇所の切
り出し位置を抽出し、９つの文字を正確に切り出すこと
が可能となる。また本発明は、文字の連結成分の輪郭を
抽出した後、抽出文字幅が一文字幅である場合には文字
の抽出を行う方法であるので、例えば第１２図（ａ）の
様な文字画像においても何の問題も無く文字の抽出を行
える。該文字画像はＹｏ、ｕという３つの無接触の文字
である。しかしながら周辺分酊第１２図（ｂ）において
は、ＹとＯの周辺分布が重なって１つの周辺分布１２０
３を形成している。ここで従来の様に該周辺分布１２０
３の形状から文字を強制的に切り離そうとすると１２０
２の位置で切り離すことになり、また、周辺分布の大き
さにより切り離そうとすると１２０１の位置で切り離す
ことになってしまう。いずれの位置にせよ、文字を不適
切な位置で切り離し２てしまう結果になるが、本発明に
おいて全く問題は生じない。In this case, attention is paid to values near the standard character width of the peripheral distribution 1142 of extracted characters. As a result, it becomes possible to search for the contact part 1131 between the letters p and ", and it becomes possible to cut out one character after another. In the same way, cutting out is repeated, and in FIG. In addition to the positions, it is possible to extract three cutout positions from 1131 to 1133, and to accurately cut out nine characters.Furthermore, the present invention extracts the contours of the connected components of the characters, and then extracts the extracted character width. Since this method extracts characters when the width is one character, it is possible to extract characters without any problem even in a character image such as that shown in FIG. 12(a). However, in the marginal distribution diagram 12(b), the marginal distributions of Y and O overlap and form one marginal distribution 120.
3 is formed. Here, as in the conventional case, the marginal distribution 120
If you try to forcibly separate the character from the shape of 3, it will be 120
If you try to cut it off due to the size of the peripheral distribution, you will end up cutting it off at the position 1201. Regardless of the position, the result is that the characters are cut off at an inappropriate position, but this does not cause any problems in the present invention.

文字の連結成分の輪郭抽出が適当な大きさで行われた後
は、文字の連結成分の抽出を行う。本発明においては、
字画像と同じ大きさの領域をもう一つ設けである。第１
３図（ａ）の文字画像１３０１よりＰという文字を抽出
するにあたって先ず第１３図（ａ）のＰという文字の輪
郭を別の領域１３０２に描く。次に該輪郭によって囲ま
れた領域をぬりつぶし、第１３図（ｂ）を得る。この後
、原画像第１３図（ａ）と抽出文字領域の現像第１３図
（ｂ）との共通部分を抽出することにより第１３図（ｃ
）に示した画像を得る。該画像第１３図（Ｃ）を見てわ
かる様に、位置的に重なっているＰとｅの画像から、Ｐ
の画像のみをきれいに抽出することが可能となっている
。After the contours of the connected components of the characters have been extracted at an appropriate size, the connected components of the characters are extracted. In the present invention,
Another area with the same size as the character image is provided. 1st
To extract the character P from the character image 1301 in FIG. 3(a), first the outline of the character P in FIG. 13(a) is drawn in another area 1302. Next, the area surrounded by the outline is filled in to obtain FIG. 13(b). After this, by extracting the common part between the original image in FIG. 13(a) and the developed image in FIG. 13(b) of the extracted character area, FIG.
) to obtain the image shown in ). As can be seen from the image in FIG. 13(C), from the images of P and e that overlap in position, P
It is now possible to clearly extract only the images.

以上述べた様に本発明によれば、行方向の周辺分布によ
り標章文字高及び最低線幅を求め、該値を参考に標準字
間隔及び単語間隔、標章文字長を求めるので、正確な標
準文字間隔及び単語間隔、標準文字長が求まる。その結
果、正確な単語切り出しが可能となる。また、文字切り
出し時においては、標準文字幅と比較しながら、切り出
し候補位置及び周辺分ｍを参考に文字の切り出しを行う
ので、正確な位置での文字の切り出しが可能となった。As described above, according to the present invention, the mark character height and minimum line width are determined from the peripheral distribution in the line direction, and standard character spacing, word spacing, and mark character length are determined with reference to these values, so accurate Find standard character spacing, word spacing, and standard character length. As a result, accurate word extraction becomes possible. Furthermore, when cutting out characters, characters are cut out with reference to the cutting candidate position and the peripheral portion m while comparing with the standard character width, making it possible to cut out characters at accurate positions.

さらに、文字の切り出Ｉ７は、文字の連結成分の輪郭に
囲まれた領域の画像を抽出するので、周辺分布では重な
って切り出し位置がわからない文字画像からでも正確に
一文字のみを抽出することが可能となった。Furthermore, character extraction I7 extracts an image of an area surrounded by the outline of connected components of characters, so it is possible to accurately extract only one character from character images whose marginal distribution overlaps and the extraction position is unknown. It became.

以上のように本発明により、今まで困難とされていた一
文字一文字の文字の切り出しをより正確に行うことを可
能にしたので、該方法を構成要素に用いる文字認識装置
の信頼性を大幅に向」ニさせることが可能となった。As described above, the present invention has made it possible to more accurately cut out individual characters, which had been considered difficult up until now, and has greatly improved the reliability of character recognition devices that use this method as a component. ”It became possible to do this.

また本発明は、標準文字高及び文字の連結成分の抽出を
用いているので、特に印刷欧米文字等に適したものであ
る。Furthermore, since the present invention uses standard character height and extraction of connected components of characters, it is particularly suitable for printing European and American characters.

〔Effect of the invention〕

以−ヒ述べた様に本発明は次にあげる多くの効果を有し
、文字認識装置の信頼性を多いに向上させるものである
。As described below, the present invention has many of the following effects and greatly improves the reliability of character recognition devices.

行方向の周辺分布を計数（２、該周辺分布の形状より標
準文字高及び最低線幅を推定することにより、行方向と
垂直な方向の文字領域をより正確に求めることを可能と
し、文字切り出し候補位置を求めることを可能とした。By counting the peripheral distribution in the line direction (2. By estimating the standard character height and minimum line width from the shape of the peripheral distribution, it is possible to more accurately determine the character area in the direction perpendicular to the line direction, and it is possible to cut out characters. This makes it possible to find candidate positions.

また、標準文字間隔及び１語間隔、標準文字長を正確に
求めることを可能とした。It also made it possible to accurately determine standard character spacing, word spacing, and standard character length.

行方向と垂直な方向の周辺分布で文字の存在しない部分
の統計をとることにより正確な標章文字間隔及び単語間
隔が求まりその結果正確な単語の抽出を可能にした。By taking statistics on the areas where no characters are present in the peripheral distribution in the direction perpendicular to the line direction, accurate character spacing and word spacing can be determined, making it possible to extract accurate words.

行方向と垂直な方向の周辺分布で、標準文字高に比例し
たある大きさ以下の領域の統計をとることにより正確な
標準文字長が求まり、また該領域の中心を切り出し候補
位置とすることにより、正確な位置での文字の切り離し
を可能にｊ７た。In the peripheral distribution in the direction perpendicular to the line direction, the accurate standard character length can be determined by taking statistics on the area below a certain size proportional to the standard character height, and by setting the center of the area as the candidate cutting position. , it is possible to separate characters at precise positions.

標準文字幅の推定時においては、標準文字高の７５％よ
り大きい領域の固まりの最大値を求めることによって、
文字幅が２１５分の文字の固まりを誤って選ぶ＋ｉＪ能
性を無くし、また、固まりと固まりとの間隔が標準文字
高に比例し７たある値以下であれば同一の固まりとみな
すので正確に固まりの分類ができ、その結果正確に標準
文字幅、標準文字間隔及び単語間隔が推定できる。When estimating the standard character width, by finding the maximum value of the area larger than 75% of the standard character height,
This eliminates the possibility of accidentally selecting a block of characters with a character width of 215 minutes, and if the interval between the blocks is proportional to the standard character height and is less than a certain value, it is considered to be the same block, so it is accurate. Clusters can be classified, and as a result, standard character width, standard character spacing, and word spacing can be accurately estimated.

文字の切り出［２は、文字の輪郭の内側の領域のみを抽
出するので、縦方向の周辺分布においては切り出し位置
のわからない文字画像であっても正確に対象の文字のみ
を抽出することが可能である。Character extraction [2] extracts only the area inside the outline of the character, so it is possible to accurately extract only the target character even in a character image where the extraction position is unknown in the vertical peripheral distribution. It is.

文字抽出時において、抽出文字幅が標準文字幅の許容で
きる大きさを越えている場合、先ず切り出し候補位置を
確認し、切り出し候補位置が存在しない場合のみ周辺分
布の値より切り出し位置を求めているので、周辺分布の
値のみでは間違った位置で切り離してしまう文字画像に
おいても、より正確な文字の抽出が可能となった。When extracting characters, if the extracted character width exceeds the allowable standard character width, the candidate position for extraction is first checked, and only if there is no candidate position for extraction, the extraction position is determined from the value of the peripheral distribution. Therefore, it is now possible to extract characters more accurately even in character images that would be separated at the wrong position using only the marginal distribution values.

以上述べた様に本発明は、隣どおしの文字が接触した文
字画像、周辺分布が重なった文字画像、文字ピッチが一
定でない文字画像から一つ一つの文字を正確に切り出す
ことを可能にした。その結果、該文字切り出し方法を構
成要素とする文字認識装置の信頼性を大きく向上させる
という効果を有している。As described above, the present invention makes it possible to accurately extract individual characters from character images where adjacent characters touch, from character images where the peripheral distributions overlap, and from character images where the character pitch is not constant. did. As a result, it has the effect of greatly improving the reliability of a character recognition device that uses the character segmentation method as a component.

[Brief explanation of drawings]

第１図に本発明の文字切り出し手段の用いられる文字認
識装置のブロック図を示す。第２図に本発明の文字切り出し手段のフローチャートを
示す。第３図（ａ）（ｂ）に本発明の標準文字高の抽出方法を
説明した図を示す。第４図（ａ）〜（ｄ）、第５図（ａ）　〜（ｃ）、第６
図（ａ）　　（ｂ）、第７図（ａ）　　（ｂ）、第８図
、第９図、第１０図に標章文字間隔及び単語間隔、標準
文字長、切り出し候補位置の推定の様子を説明した図を
示す。第１１図（ａ）　（ｂ）、第１２図（ａ）　　（ｂ）、
第１３図（ａ）〜（Ｃ）に文字抽出の様子を説明した図
を示す。１０１　・１０２　Φ １０３　・１０４　・１０５　・３０１　・３０４　・４０１　・４０２　・・　ＣＰＵ・画像入力装置・認識文字表示装置・ＲＯＭ φＲＡＭ・行方向周辺分布・標準文字高・文字領域・文字間隔５０１〜５０７５１１〜５１７５２１〜５２６５０８・　・　・　・７０１　俸　・　・　・８０１〜８０４８０５φ　・　Φ　・９０１〜９０３９０４◆　・　・　・１００４　拳　・　・１１１０　争　φ　・１１３１〜１１・文字周辺分布・文字領域・文字切り出し候補位置・最低文字線幅・標準文字高・固まりの間隔・固まりの間隔のしきい値・固まりの間隔・固まりの間隔のしきい値・標準文字高の７５％・最低文字線幅・文字切り出し位置・文字切り出し位置以上出願人　セイコーエプソン株式会社代理人　弁理士　上　柳　雅　誉（他１名）第２図第４図（α）第４図第４図（ｄ）３＋１１第３図（久）第　３　図　（ト）〔ス客［１％］第６図（＾ン第６図（ト）３に第８図〔支ＹＪＦ判〕第７図（α２〔支３′暢〕第７図（トノ第９図１９０ψ 第１０図 ↑牛７＝／　１０ユ ↑　↑　↑FIG. 1 shows a block diagram of a character recognition device in which the character cutting means of the present invention is used. FIG. 2 shows a flowchart of the character cutting means of the present invention. FIGS. 3(a) and 3(b) are diagrams illustrating the standard character height extraction method of the present invention. Figures 4 (a) to (d), Figures 5 (a) to (c), and 6
Figures (a) and (b), Figures 7 (a) and (b), Figures 8, 9, and 10 show how the mark character spacing, word spacing, standard character length, and extraction candidate positions are estimated. The illustrated diagram is shown. Figure 11 (a) (b), Figure 12 (a) (b),
FIGS. 13(a) to 13(C) are diagrams illustrating how characters are extracted. 101 ・ 102 Φ 103 ・ 104 ・ 105 ・ 301 ・ 304 ・ 401 ・ 402 ・・ CPU ・ Image input device ・ Recognized character display device ・ ROM φRAM ・ Line direction peripheral distribution ・ Standard character height ・ Character area ・ Character spacing 501 to 507 511-517 521-526 508・・・・ 701 Salary ・・・ 801-804 805φ ・ Φ ・ 901-903 904◆ ・・・ 1004 Fist ・・ 1110 Fight φ ・ 1131-11 ・Character peripheral distribution・Character area・Character cutting candidate position, minimum character line width, standard character height, cluster spacing, cluster spacing threshold, cluster spacing, cluster spacing threshold, 75% of standard character height, minimum character line width, Character cutting position/Character cutting position and above Applicant: Seiko Epson Co., Ltd. Agent Patent attorney Masayoshi Kamiyanagi (and 1 other person) Figure 2 Figure 4 (α) Figure 4 Figure 4 (d) 3+11 Figure 3 ( Figure 3 (G) [Su customers [1%] Figure 6 (^ Figure 6 (G) Figure 3 and Figure 8 [YJF size] Figure 7 (α2 [YJF size]) Figure 7 Figure (Tonneau Figure 9 190ψ Figure 10 ↑ Cow 7 = / 10 Yu ↑ ↑ ↑

Claims

[Claims]

(1) The character cutting method in a character recognition device that reads a character image written on paper etc. using an optical image input means, recognizes the characters in the input image data, and replaces them with a code number is as follows: [1] In the line direction. Estimate the standard character height and line thickness of the characters from the marginal distribution, and [2] Estimate the standard character spacing, word spacing, and standard character width from the peripheral distribution in the direction perpendicular to the line direction, and extract words. , [3] Estimate character extraction candidate positions from the marginal distribution; [4] Extract characters in the extracted word by extracting outlines of connected components of characters, and at the same time extracting character height and character width; [5] ] If the character width exceeds the allowable size from the standard character width, the outline is extracted again within the range of the character extraction candidate position. [6] If the character extraction candidate position does not exist, the outline is extracted in the line direction. A character is characterized in that the range of character extraction is determined based on the peripheral distribution in the direction perpendicular to the character, and the outline is extracted again, and [7] the character is extracted by extracting only the inside of the area surrounded by the outline. Cutting method.

(2) The standard character height of a character is determined by focusing on the shape of the peripheral distribution in the line direction, and determining the width of the part where the peripheral distribution changes rapidly and becomes larger as the standard character height, and from the size of the standard character height, 2. The character cutting method according to claim 1, further comprising estimating a minimum line width of a character line.

(3) The character segmentation method according to claim 1, characterized in that the standard character spacing and word spacing are estimated by taking statistics on the size of the portion where no characters are present in the peripheral distribution in the direction perpendicular to the line direction. .

(4) The standard character width is estimated by taking statistics on the size of a portion of the peripheral distribution in the direction perpendicular to the line direction where the value of the peripheral distribution is larger than the minimum line width. How to cut out characters as described.

(5) Character cutting according to claim 1, characterized in that the standard character width is estimated by using the maximum value of the cluster closest to the standard character height in an area larger than 75% of the standard character height according to the statistics. Method.

(6) The character segmentation method according to claim 1, wherein in the classification of clusters in the statistics, if the distance between the clusters is less than or equal to a certain value proportional to the standard character height, the clusters are considered to be the same cluster. .

(7) The character segmentation method according to claim 1, wherein the word position is extracted by comparing the standard character spacing and word spacing with a peripheral distribution in a direction perpendicular to the line direction.

(8) The character according to claim 1, wherein the center portion of each portion of the peripheral distribution in the direction perpendicular to the line direction where the value of the peripheral distribution is smaller than the minimum line width is set as a character extraction candidate position. Cutting method.

(9) The character extraction method according to claim 1, wherein when it is determined that the extracted character is a connected character based on the width of the extracted character, the character extraction candidate position is cut out preferentially.

(10) Character cutting according to claim 1, characterized in that when estimating the character cutting position from the value of the marginal distribution, a point with the smallest value of the marginal distribution is found near 1/2 character width and near 1 character width. Method.

(11) To extract the area surrounded by the outline of a character, create an image area of the same size as the original image, draw the outline of the character in the image area, fill in the inside of the outline, and then extract the common area with the original image. 2. The character extraction method according to claim 1, wherein only target characters are extracted by taking .