JP2728085B2

JP2728085B2 - Character extraction method

Info

Publication number: JP2728085B2
Application number: JP8136478A
Authority: JP
Inventors: 三喜男青木
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1988-04-28
Filing date: 1996-05-30
Publication date: 1998-03-18
Anticipated expiration: 2013-03-18
Also published as: JPH096915A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、紙面上に書かれた
文宇を画像として入力することにより、文字画像から文
字領域を捜し出し、コード番号に変換する文字認識装置
に用いられる文字切り出し方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character extracting method used in a character recognition apparatus for searching a character area from a character image and converting it into a code number by inputting a text written on paper as an image. .

【０００２】[0002]

【従来の技術】近年、文字認識装置の急激なる進歩によ
り、さまざまな文書画像から文字領域を自動的に抽出
し、さらに１つ１つの文字を切り出し、認識し、自動的
に文字ファイルが作成できるようになってきており、文
字の切り出し方法はさまざまな方法が考え出されてきて
いる。2. Description of the Related Art In recent years, with the rapid progress of a character recognition device, a character area can be automatically extracted from various document images, and each character can be cut out and recognized, and a character file can be automatically created. As a result, various methods for extracting characters have been devised.

【０００３】例えば、一般に多く用いられている方法に
抽出文字行の行方向と垂直な方向の周辺分布を計数する
方法がある。[0003] For example, a method that is generally used is a method of counting the marginal distribution of an extracted character line in a direction perpendicular to the line direction.

【０００４】例えば、図４（Ａ）に示すような抽出文字
行の行方向と垂直な方向の周辺分布を計数する。この計
数において、計数値があった否か、すなわち、文字画像
が存在するか否かのみを図に示すと、図４（Ｂ）に示す
ような周辺分布が得られ、この周辺分布の値により、文
字の存在位置を知ることが可能となり、一文字一文字の
文字の切り出しを行なうことができる。しかしながら、
抽出文字行が図４（Ｃ）に示すように、周辺分布の重な
った文字においては、「Ｙ］と「ｏ」のように、重なっ
た文字の部分は文字幅が大きくなる。そこで、複数文字
と判断した場合には、文字ピッチにより文字の切れ目を
推定して、強制的に文字の切り出しを行なっていた。For example, a peripheral distribution in a direction perpendicular to the line direction of an extracted character line as shown in FIG. 4A is counted. In this counting, if only the presence or absence of the count value, that is, whether or not a character image exists, is shown in the figure, a peripheral distribution as shown in FIG. 4B is obtained. , It is possible to know the position of the character, and the character can be cut out one by one. However,
As shown in FIG. 4C, in an extracted character line, in a character whose peripheral distribution overlaps, the character width of the overlapping character portion becomes larger, such as “Y” and “o”. Therefore, when a plurality of characters are determined, character cuts are estimated based on the character pitch, and characters are forcibly cut out.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、このよ
うな方法で文字の切り出しを行なう場合、対象文字画像
が、図４（Ａ）に示すような定ピッチの文字画像におい
ては、正確に文字の切り出しが可能であるが、図４
（Ｃ）に示した様な文字画像、図５（Ａ）に示したよう
な文字画像、図ｌ１（Ａ）に示したような文字画像にお
いては、正確な文字の切り出しは不可能である。図４
（Ｃ）の文字画像、図５（Ａ）の文字画像、図ｌ１
（Ａ）の文字画像は、どれも文字ピッチ不定のプロポー
シヨナル文字である。したがって、従来の方法のよう
に、単に行方向と垂直な方向の周辺分布によるだけで
は、文字位置および文字ピッチを推定することができな
い。However, when a character is cut out by such a method, if the target character image is a character image having a constant pitch as shown in FIG. Is possible, but FIG.
In the character image as shown in FIG. 5C, the character image as shown in FIG. 5A, and the character image as shown in FIG. FIG.
(C) Character image, FIG. 5 (A) character image, FIG.
The character images in (A) are all proportional characters with an unfixed character pitch. Therefore, the character position and the character pitch cannot be estimated only by the marginal distribution in the direction perpendicular to the line direction as in the conventional method.

【０００６】図４（Ｃ）のような文字について、従来の
方法にて文字の切り出しを行なう場合について述べる。
図４（Ｃ）は、図４（Ａ）と同ーのフォントの文字画像
である。図４（Ａ）が文字ピッチがー定であるのに対
し、図４（Ｃ）は文字ピッチ不定のプロポーショナル文
字である。図４（Ｃ）の文字画像の行方向と垂直な方向
の周辺分布を示した図４（Ｄ）に着目すると、ほとんど
の部分で図４（Ａ）の周辺分布である図４（Ｂ）とほぼ
同じであるが、「Ｙｏ」の周辺分布に相当する部分が異
なる。図４（Ｄ）は、「Ｙ」の周辺分布と「ｏ」の周辺
分布が重なってしまっている。したがってこのような周
辺分布からは、「Ｙ」と「ｏ」を１つの文字として判断
してしまうか、あるいは、実際と異った場所で切り離し
てしまう。今、仮に２つの文字と判断して強制的に２つ
の文字に分けようとした場合、文字ピッチに着目する
と、図１２に示す１２０１で文字画像を分離し、周辺分
布の形状に着目すると、１２０２で文字画像を分離して
しまう。１２０１，１２０２のどちらの位置で文字画像
を分離しても、「Ｙ」か「ｏ」の文字の途中で切れ、別
の文字と一緒に抽出されてしまい、正確な文字の抽出は
不可能である。A case in which a character as shown in FIG. 4C is cut out by a conventional method will be described.
FIG. 4C is a character image in the same font as FIG. 4A. FIG. 4A shows a proportional character having a fixed character pitch, while FIG. 4C shows a proportional character having a variable character pitch. Paying attention to FIG. 4D, which shows the peripheral distribution in the direction perpendicular to the line direction of the character image in FIG. 4C, FIG. 4B, which is the peripheral distribution of FIG. Although they are almost the same, the part corresponding to the peripheral distribution of “Yo” is different. In FIG. 4D, the peripheral distribution of “Y” and the peripheral distribution of “o” overlap. Therefore, from such a peripheral distribution, "Y" and "o" are determined as one character, or separated at a place different from the actual one. Now, if it is supposed that two characters are determined and forcibly divided into two characters, focusing on the character pitch, the character image is separated at 1201 shown in FIG. Will separate the character image. Regardless of the position of the character image at 1201 or 1202, the character image is cut off in the middle of the character "Y" or "o" and is extracted together with another character. is there.

【０００７】次に、図５（Ａ）に示した文字画像より、
従来の方法により文字の切り出しを行なう場合について
述べる。図５（Ａ）は「ｍｉｌｌｉｏｎ」という７つの
文字により構成される単語である。この単語の行方向と
垂直な方向の周辺分布を計数して計数値を図示すると図
５（Ｂ）に示すように、「ｍ」，「ｉ」，「ｌ」，
「ｌ」，「ｉ」の５つの文字の周辺分布５０１，５０
２，５０３，５０４，５０５は連なって一つの周辺分布
の固まりとなっている。したがって、この周辺分布から
は簡単に文字の切れ目を推定することは不可能である。
文字ピッチにより文字の切り離しを行なうようにする
と、「ｉ」，「ｌ」等の文字は標準文字幅の半分の大き
さであるので、２つの文字で１つの文字と判断してしま
い、一緒に切り出される可能性が大きい。また、周辺分
布の形状で判断して切り離すと、「ｉ」，「ｌ」等の文
字はうまく切り離せるかもしれないが、「ｍ」を分解す
る可能性が大であり、文字の切り出しの信頼性が極めて
小さい。Next, from the character image shown in FIG.
A case where characters are cut out by a conventional method will be described. FIG. 5A is a word composed of seven characters “million”. When the peripheral distribution of the word in the direction perpendicular to the row direction is counted and the count value is illustrated, as shown in FIG. 5B, “m”, “i”, “l”,
Marginal distributions 501 and 50 of five characters “l” and “i”
2, 503, 504, and 505 are connected to form one peripheral distribution block. Therefore, it is impossible to easily estimate a character break from this peripheral distribution.
If the characters are separated based on the character pitch, characters such as "i" and "l" are half the standard character width, so that two characters are determined to be one character. There is a high possibility of being cut out. Also, if the character is separated based on the shape of the marginal distribution, characters such as “i” and “l” may be separated well, but the possibility of decomposing “m” is large. Very small.

【０００８】また、図１１（Ａ）に示した文字画像の場
合も、図５の場合と同様に周辺分布の形状のみからは文
字の正確な切り出しは不可能である。[0008] Also, in the case of the character image shown in FIG. 11A, it is impossible to cut out the character accurately only from the shape of the marginal distribution as in the case of FIG.

【０００９】本発明は、上述したような問題を解決する
もので、その目的とするところは、隣同士の文字が接触
した文字画像から１つ１つの文字を正確に切り出す方法
を提供することにある。An object of the present invention is to solve the above-mentioned problems, and an object of the present invention is to provide a method for accurately cutting out individual characters from a character image in which adjacent characters are in contact with each other. is there.

【００１０】[0010]

【課題を解決するための手段】請求項１に記載の発明
は、光学的画像入力手段により紙面等に書かれた欧米文
字画像を読み取り、入力された画像データ中の文字を認
識する文字認識装置の文字切り出し方法において、行方
向の周辺分布より文字の標準文字高を推定するととも
に、該標準文字高に基づいて無視する線幅の値を決定
し、前記行方向と垂直な方向の周辺分布で、該周辺分布
の値が前記無視する線幅の値より大きい部分の統計をと
ることにより、標準文字幅を推定することを特徴とする
ものである。According to a first aspect of the present invention, there is provided a character recognition apparatus for reading a Western character image written on paper or the like by an optical image input means and recognizing a character in the input image data. In the character segmentation method, the standard character height of the character is estimated from the peripheral distribution in the line direction, and the value of the line width to be ignored is determined based on the standard character height, and the marginal distribution in the direction perpendicular to the line direction is determined. The standard character width is estimated by taking statistics of a portion where the value of the marginal distribution is larger than the value of the line width to be ignored.

【００１１】請求項２に記載の発明は、請求項１に記載
の文字切り出し方法において、前記統計により、前記周
辺分布の値が前記標準文字高のほぼ半分より大きい領域
において、標準文字幅を推定することを特徴とするもの
である。According to a second aspect of the present invention, in the character segmenting method according to the first aspect, the standard character width is estimated in the area where the value of the marginal distribution is substantially larger than the half of the standard character height by the statistics. It is characterized by doing.

【００１２】請求項３に記載の発明は、請求項２に記載
の文字切り出し方法において、前記統計における固まり
の分布は、固まりと固まりとの距離が、前記標準文字高
に比例した所定の値以下であれば同一の固まりとみなす
ことを特徴とするものである。According to a third aspect of the present invention, in the character segmenting method according to the second aspect, the distribution of the blocks in the statistics is such that a distance between the blocks is equal to or less than a predetermined value proportional to the standard character height. In this case, they are regarded as the same block.

【００１３】[0013]

【発明の実施の形態】以下、本発明について実施の形態
に基づいて詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below in detail based on embodiments.

【００１４】文字切り出しの全体プロセスの概要の一例
は、光学的画像入力手段により紙面等に書かれた文字画
像を読み取り、入力された画像デー夕中の文字を認識し
てコード番号に置き換える文字認識装置に用いられるも
のであり、行方向の周辺分布より文字の標準文字高を推定し、無
視する線の太さ決める。行方向と垂直な方向の周辺
分布より、標準文字間隔、単語間隔および標準文字幅を
推定し、単語の抽出を行ない、前記周辺分布より文字切り出し候補位置を推定し、前記抽出単語中の文字の切り出しは文字の連結成分の
輪郭を抽出するとともに、文字高および文字幅を抽出
し、文字幅が標準文字幅から許容できる大きさを越える場
合には、前記文字切り出し候補位置の範囲内にて再度輪
郭の抽出を行ない、前記文字切り出し候補位置が存在しない場合いには、
行方向と垂直な方向の周辺分布により文字切り出しの範
囲を決め再度輪郭の抽出を行ない、該輪郭に囲まれた領域の内側のみを抽出することによ
り文字の切り出しを行なう。An example of the outline of the whole process of character extraction is a character recognition in which a character image written on paper or the like is read by an optical image input means, and characters in the input image data are recognized and replaced with a code number. It is used in equipment, and estimates the standard character height of characters from the marginal distribution in the line direction and determines the thickness of lines to be ignored. Estimate the standard character spacing, word spacing and standard character width from the peripheral distribution in the direction perpendicular to the line direction, extract words, estimate character cutout candidate positions from the peripheral distribution, The cutout extracts the outline of the connected component of the character, and also extracts the character height and the character width. If the character width exceeds the allowable size from the standard character width, the character is cut again within the range of the character cutout candidate position. If the contour is extracted and the character extraction candidate position does not exist,
The character extraction range is determined based on the peripheral distribution in the direction perpendicular to the line direction, and the outline is extracted again. The character is extracted by extracting only the inside of the area surrounded by the outline.

【００１５】文字の標準文字高は、行方向の周辺分布の
形に着目し、周辺分布が急激に変化して大きくなってい
る部の幅を標準文字高とし、該標準文字高の大きさよ
り、文字線の最低線幅を推定することができる。Focusing on the shape of the peripheral distribution in the line direction, the standard character height of a character is defined as the standard character height, where the width of the portion where the peripheral distribution changes rapidly and becomes large is defined as the standard character height. The minimum line width of the character line can be estimated.

【００１６】行方向と垂直な方向の周辺分布で文字の存
在しない部分の大きさの統計をとることにより、標準文
字間隔および単語間隔を推定することができる。The standard character spacing and word spacing can be estimated by taking statistics on the size of the portion where no character exists in the marginal distribution in the direction perpendicular to the line direction.

【００１７】行方向と垂直な方向の周辺分布で、周辺分
布の値が前記最低線幅より大きい部分の大きさの統計を
とることにより、標準文字幅を推定することができる。The standard character width can be estimated by taking statistics on the size of the marginal distribution in the direction perpendicular to the line direction and the value of the marginal distribution larger than the minimum line width.

【００１８】標準文字幅の惟定は、前記統計で、標準文
字高の７５％よりも大きい領域において、標準文字高に
最も近い固まりの最大値とすることができる。The standard character width can be determined to be the maximum value of the block closest to the standard character height in an area larger than 75% of the standard character height in the aforementioned statistics.

【００１９】前記統計中の固まりの分類は、固まりと固
まりとの距離が、標準文字高に比例したある値以下であ
れば同一の固まりとみなすことができる。As for the classification of the blocks in the statistics, if the distance between the blocks is equal to or less than a certain value proportional to the standard character height, the blocks can be regarded as the same block.

【００２０】前記標準文字間隔および単語間隔と、行方
向と垂直な方向の周辺分布とを比較することにより単語
の位置の抽出を行なうことができる。The position of a word can be extracted by comparing the standard character spacing and word spacing with the marginal distribution in the direction perpendicular to the line direction.

【００２１】行方向と垂直な方向の周辺分布で、周辺分
布の値が前記最低線幅より小さい部分の、各々の中心部
分を文字切り出し候補位置とすることができる。In the peripheral distribution in the direction perpendicular to the row direction, each central portion of the portion where the value of the peripheral distribution is smaller than the minimum line width can be set as a character extraction candidate position.

【００２２】抽出文字幅より連結文字と判断した場合、
前記文字切り出し候補位置を優先に切り出すことができ
る。When the character is determined to be a concatenated character from the extracted character width,
The character extraction candidate position can be extracted with priority.

【００２３】周辺分布の値より文字切り出し位置を推定
する場合、１／２文字幅付近およびｌ文字幅付近で最も
周辺分布の値の小さいところを求めることができる。When estimating the character cut-out position from the value of the marginal distribution, it is possible to find the part where the marginal distribution value is the smallest near the half character width and around the l character width.

【００２４】文字の輪郭に囲まれた領域の抽出は、原画
像と同じ大きさの画像領域を設け、該画像領域に文字の
輪郭を描き、輪郭の内側を塗りつぶした後に原画像との
共通部分をとることにより、対象文字のみを抽出するが
できる。To extract an area surrounded by the outline of a character, an image area having the same size as the original image is provided, the outline of the character is drawn in the image area, and the inside of the outline is painted. , Only the target character can be extracted.

【００２５】文字切り出し手段が構成要素となる文字認
識装置は、図１のブロック図に示すように、ＣＰＵ１０
１、画像入力装置１０２、認識文字表示装置１０３、Ｒ
ＯＭ１０４、ＲＡＭ１０５により横成されている。As shown in the block diagram of FIG. 1, a character recognition device in which the character extracting means is a constituent
1, image input device 102, recognized character display device 103, R
The OM 104 and the RAM 105 are arranged horizontally.

【００２６】以下、画像入力装置ｌ０２によりＲＡＭｌ
０５に読み込まれた文字画像中よリ１つ１つの文字を切
り出す文字切り出し方法を、図２に示すフローチャー卜
に基づいて詳細に説明する。In the following, a RAM 1
A character extracting method for extracting individual characters from the character image read in 05 will be described in detail with reference to a flowchart shown in FIG.

【００２７】画像入力装置１０２によりＲＡＭ１０５に
読み込まれた文字画像を図３（Ａ）とする。まず最初
に、行方向の周辺分布を計数する。文字画像図３（Ａ）
の行方向の周辺分布を計数すると、図３（Ｂ）の３０１
のような形状の周辺分布を得る。欧米文字は、図３
（Ａ）の「ｄ」に相当する文字高の大きな文字、「ｅ」
に相当する文字高の小さな文字、「ｙ」に相当する文字
高は大きいが文字位置が低い文字の３種類の文字から構
成されている。したがって、この３種類の文字から構成
されている文字画像の行方向の周辺分布を計数すると、
通常図３（Ｂ）に示す３０１の様な形状の周辺分布とな
る。周辺分布３０１は、３１１，３１２，３ｌ３の３つ
の領域が合わさってできており、文字画像によっては、
３１２の領域あるいは３１３の領域が存在しない場合が
ある。しかしながら、３１１の領域は常に存在し、この
領域の上限３０２と下限３０３との間の幅３０４は、文
字高の小さな文字の高さ（以下、「標準文字高」と呼
ぶ。）に相当する。したがって、行方向の周辺分布の形
状に着目することにより、標準文字高を知ることが可能
となる。FIG. 3A shows a character image read into the RAM 105 by the image input device 102. First, the peripheral distribution in the row direction is counted. Character image figure 3 (A)
When the peripheral distribution in the row direction is counted, 301 in FIG.
Is obtained. Western characters are shown in Figure 3
A character with a large character height corresponding to "d" in (A), "e"
, And three types of characters having a large character height corresponding to “y” but a low character position. Therefore, when the peripheral distribution in the row direction of a character image composed of these three types of characters is counted,
Normally, a peripheral distribution having a shape like 301 shown in FIG. The marginal distribution 301 is formed by combining three regions 311, 312, and 313, and depending on the character image,
There are cases where the area 312 or the area 313 does not exist. However, a region 311 always exists, and a width 304 between the upper limit 302 and the lower limit 303 of this region corresponds to the height of a character having a small character height (hereinafter, referred to as “standard character height”). Therefore, it is possible to know the standard character height by focusing on the shape of the peripheral distribution in the row direction.

【００２８】また、印刷欧米活字は、標準文字高と文字
線の太さとの間に、通常、標準文字高：文字線の太さ≧１６：１の関係が成り立つ。したがって、文字画像の垂直方向の
周辺分布に着目した場合、周辺分布の値が標準文字高の
１／１６よりも低い値の部分は文字のひげ部分であり、
あるいは、文字が接触している部分であると判断でき
る。ここで、判断の基準となる標準文字高の１／１６の
値を無視する線幅として求めておく。Further, for printed European and American prints, the relationship of standard character height: character line thickness ≧ 16: 1 is usually established between the standard character height and the character line thickness. Therefore, when focusing on the vertical distribution of the character image, a portion where the value of the peripheral distribution is lower than 1/16 of the standard character height is a whisker portion of the character,
Alternatively, it can be determined that the part is in contact with the character. Here, a value of 1/16 of the standard character height serving as a criterion for determination is determined as a line width to be ignored.

【００２９】次の段階として、文字行の垂直方向の周辺
分布により標準文字間隔、単語間隔および標準文字幅を
求める。まず最初に、文字行と垂直な方向に投影した場
合、文字画像が存在するか否かのデータを求める。文字
画像を図４（Ａ）とした場合、文字画像が存在するか否
かのデータすなわち文字画像と垂直の方向に投影したデ
ータは図４（Ｂ）である。領域４０１は文字画像が存在
する部分、領域４０２は文字画像が存在しない部分、す
なわち文字間隔に相当する部分である。文字が存在しな
い部分４０２の統計をとると、図６（Ａ）のようにな
る。同様に、文字画像を図４（Ｃ）とした場合の投影し
たデータ図４（Ｄ）の文字画像が存在しない部分４０２
の統計をとると、図６（Ｂ）のようになる。図６
（Ａ）、図６（Ｂ）の実線で示されたデータは、図４
（Ｂ）、図４（Ｄ）のデータをそれぞれ示すが、一般に
文字間隔のヒストグラムは、図６（Ａ）、図６（Ｂ）に
示した点線で示される傾向になる。この２つのヒストグ
ラムは、それぞれ２つの固まりに分けることが可能であ
る。この固まりは、１つは文字間隔の固まり、もう１つ
は単語間隔の固まりと判断することができる。したがっ
て、文字間隔の統計により標準文字間隔および単語間隔
を推定することが可能となる。一般に文字間隔のヒスト
グラムは、図６のようなヒストグラムになるが、時とし
て図８のようなヒス卜グラムを得ることがある。ヒス卜
グラムには、データの固まりがいくつもある。このよう
に固まりが多く存在する場合には、次の方法にて固まり
を分類する。印刷欧米文字は通常規則正しく並んで印刷
されている。In the next step, the standard character spacing, word spacing and standard character width are determined from the vertical marginal distribution of the character line. First, when the image is projected in a direction perpendicular to the character line, data as to whether or not a character image exists is obtained. When the character image is shown in FIG. 4A, data indicating whether or not a character image exists, that is, data projected in a direction perpendicular to the character image is shown in FIG. 4B. An area 401 is a part where a character image exists, and an area 402 is a part where a character image does not exist, that is, a part corresponding to a character interval. FIG. 6A shows the statistics of the portion 402 where no character exists. Similarly, the projected data when the character image is shown in FIG. 4C. Portion 402 where the character image in FIG. 4D does not exist.
6B is obtained as shown in FIG. FIG.
(A) and the data shown by the solid line in FIG.
4B shows the data of FIG. 4D, and the histogram of the character spacing tends to be generally indicated by the dotted lines shown in FIGS. 6A and 6B. These two histograms can each be divided into two blocks. One of the blocks can be determined to be a block of character spacing, and the other can be a block of word spacing. Therefore, the standard character spacing and word spacing can be estimated from the character spacing statistics. Generally, the histogram of the character spacing is a histogram as shown in FIG. 6, but sometimes a histogram as shown in FIG. 8 is obtained. A histogram contains a number of chunks of data. When there are many such clusters, the clusters are classified by the following method. Printing Western characters are usually printed side by side.

【００３０】したがって、文字間隔はほぼ等しい筈であ
るが、時として文字の形により文字間隔が異なる場合が
生じる。しかしながら、その問隔は、標準文字高に比例
したある値以上にちらばることはない。したがって、標
準文字高の１／１６を固まりと固まりの距離のしきい値
８０５とした。しきい値８０５と、固まり８１１，８１
２，８１３，８１４，８１５の間隔８０１，８０２，８
０３，８０４とを比較することにより、固まり８１１と
８ｌ２は同一の固まり、固まり８１３と８１４と８ｌ５
は同一の固まりと判断でき、標準文字問隔および単語間
隔の推定が可能となる。Therefore, the character spacing should be almost equal, but sometimes the character spacing differs depending on the shape of the character. However, the interval is not scattered beyond a certain value proportional to the standard character height. Therefore, 1/16 of the standard character height is set as the threshold 805 of the distance between blocks. Threshold 805 and masses 811, 81
2,801,814,815 interval 801,802,8
By comparing the data with 03,804, the masses 811 and 812 are the same, and the masses 813, 814, and 815 are the same.
Can be determined to be the same block, and the standard character interval and word spacing can be estimated.

【００３１】次に、文字画像の標準文字幅を推定する。
標準文字幅を推定できれば、文字の切り出し時において
１文字か連結文字かの判断、および、切り出し候補位置
を推定する手掛りとなる。標準文字幅の推定は、標準文
字間隔および単語間隔の推定と同様に文字領域の統計を
とる。仮に、文字画像が図４（Ａ）および図４（Ｃ）の
場合、文字幅の統計は図７（Ａ）および図７（Ｂ）の実
線のようになり、一般的に点線で示した傾向のデー夕を
得る。図７（Ａ）に示したデータは、固まりが１つであ
るため、この固まりの最大値を標準文字幅とすることが
できる。図７（Ｂ）に示したデータにおいては、固まり
７０２と固まり７０３の２つの固まりが存在する。通
常、印刷欧米文字は、標準文字幅が標準文字高に非常に
近い値であるので、標準文字高７０１の付近の固まり７
０２の最大値を標準文字幅とすることが可能となる。ま
た、文字間隔の推定時と同様に、時としてヒス卜グラム
が多くの固まりからなる場合、図９においては、固まり
と固まりの距離９０１，９０２，９０３としきい値９０
４とを比較することによって、固まり９ｌｌ，固まり９
１２および固まり９１３は同一の固まりと判断すること
が可能である。さらに、固まりと固まりの間のしきい値
でもって多くの固まりを分類しても、図１０の様な３つ
の固まりになることがある。統計の結果が図１０のよう
になるのは、文字画像中に「ｉ」や「ｌ」等の文字が含
まれているからである。Next, the standard character width of the character image is estimated.
If the standard character width can be estimated, it becomes a clue to determine whether a character is a single character or a concatenated character at the time of character extraction, and to estimate a candidate extraction position. The estimation of the standard character width takes the statistics of the character area in the same manner as the estimation of the standard character interval and the word interval. If the character image is as shown in FIGS. 4 (A) and 4 (C), the statistics of the character width are as shown by the solid lines in FIGS. 7 (A) and 7 (B), and generally indicated by dotted lines. Get Day Evening. Since the data shown in FIG. 7A has one block, the maximum value of the block can be used as the standard character width. In the data shown in FIG. 7B, there are two blocks, a block 702 and a block 703. Normally, for printed Western characters, the standard character width is very close to the standard character height.
02 becomes the standard character width. Also, as in the case of estimating the character spacing, if the histogram sometimes consists of many clusters, FIG. 9 shows the distances 901, 902, 903 between clusters and the threshold 90.
By comparing 4 with lump 9 lll and lump 9
12 and the mass 913 can be determined to be the same mass. Further, even if many clusters are classified based on the threshold value between clusters, three clusters as shown in FIG. 10 may occur. The statistical result is as shown in FIG. 10 because characters such as "i" and "l" are included in the character image.

【００３２】そこで、「ｉ」や「ｌ」等の固まり１００
１の最大値を標準文字幅と判断しないように、標準文字
幅の７５％の位置１００４よりも大きい領域において固
まりの最大値を探す。こうすることにより、標準文字幅
の約半分の幅の「ｉ」や「ｌ」の固まりを除外し、正確
に標準文字幅を推定することを可能にした。以上のよう
にして、通常の文字画像の文字幅の推定は可能となる
が、文字画像の中には図５（Ａ）に示したような文字画
像が存在する。この文字画像は、「ｍ」，「ｉ」，
「ｌ」，「ｌ」，「ｉ」，「ｏ」，「ｎ」の７つの文字
より横成されており、そのうち「ｍ」，「ｉ」，
「ｌ」，「ｌ」，「ｉ」の５の文字が接触している。し
たがって、標準文字幅の推定のために行方向と垂直な方
向の周辺分布を計数しても、標準文字幅の推定は不可能
である。Therefore, the lump 100 such as "i" or "l"
In order to prevent the maximum value of 1 from being determined as the standard character width, the maximum value of the block is searched for in an area larger than the position 1004 at 75% of the standard character width. By doing so, it is possible to exclude a cluster of “i” and “l” which are about half the width of the standard character width, and to accurately estimate the standard character width. As described above, the character width of a normal character image can be estimated, but a character image as shown in FIG. 5A exists in the character image. The character images are "m", "i",
It is composed of seven characters “l”, “l”, “i”, “o”, and “n”, of which “m”, “i”,
The five characters “l”, “l”, and “i” are in contact. Therefore, even if the peripheral distribution in the direction perpendicular to the line direction is counted for estimating the standard character width, the standard character width cannot be estimated.

【００３３】そこで本発明では、印刷欧米文字におい
て、文字線幅は標準文字高に比例したある値以上の太さ
を持ち、図５（Ａ）に見られる底辺部のひげの部分は、
ある値よりも細い線であることに着目し、行方向と垂直
な方向の周辺分布図５（Ｂ）を計数した後、標準文字高
に比例したある値５０８でもって無視する線幅を決め
て、周辺分布を切り捨てる過程を設ける。この過程によ
り、ある値５０８よりも周辺分布の値が大きくなる部分
のみを文字領域と判断して、文字領域を表示すると、図
５（Ｃ）に示すようになる。５１１，５１２，５１３，
５１４，５１５，５ｌ６，５ｌ７はそれぞれ「ｍ」，
「ｉ」，「ｌ」，「ｌ」，「ｉ」，「ｏ」，「ｎ」の文
字位置および文字幅を示す。５ｌ１〜５ｌ７の７つのデ
ータは、実際の文字幅よりも多少小さめではあるが、文
字として必要な情報の部分は必ず含んでおり、これらの
領域を文字領域と判断することが可能となる。したがっ
て、このデータ図５（Ｃ）の文字領域の統計をとること
により、標準文字幅の推定が可能となる。Therefore, in the present invention, in printed European and American characters, the character line width has a thickness greater than or equal to a certain value in proportion to the standard character height, and the bottom beard portion shown in FIG.
Focusing on the line being thinner than a certain value, counting the marginal distribution diagram 5 (B) in the direction perpendicular to the line direction, and determining the line width to ignore with a certain value 508 proportional to the standard character height , And a process of cutting off the marginal distribution. According to this process, only a portion where the value of the peripheral distribution becomes larger than a certain value 508 is determined as a character region, and the character region is displayed as shown in FIG. 5C. 511, 512, 513,
514,515,5l6,5l7 are "m",
The character positions and character widths of “i”, “l”, “l”, “i”, “o”, and “n” are shown. The seven data 511 to 517 are slightly smaller than the actual character width, but always include information necessary for characters, and these areas can be determined as character areas. Therefore, it is possible to estimate the standard character width by collecting the statistics of the character area shown in FIG. 5C.

【００３４】次に、文字画像より単語の抽出を行なう。
単語の抽出は、前記標準文字間隔および単語間隔と、文
字行の垂直方向の周辺分布とを比較する。文字画像が図
４（Ａ）の場合、周辺分布は図４（Ｂ）であることによ
り、その文字の存在していない部分の大きさから単語間
隔を見つけることが可能となり、文字画像中からの単語
の抽出が可能となる。Next, words are extracted from the character images.
In extracting words, the standard character spacing and word spacing are compared with the vertical marginal distribution of character lines. When the character image is as shown in FIG. 4A, since the peripheral distribution is as shown in FIG. 4B, it is possible to find the word interval from the size of the portion where the character does not exist. Word extraction becomes possible.

【００３５】次に、抽出単語から１文字ごとの抽出を連
結成分を抽出することにより行なうわけであるが、ここ
でまた問題がある。今仮に、抽出単語が図５（Ａ）のよ
うな文字画像であるとする。この場合、連結成分の抽出
を行なうと「ｍｉｌｌｉ」の５つの文字を１文字として
抽出してしまう。この抽出文字はその文字幅から連結文
字と判断し、途中で強制的に切り離さなければならな
い。切り離し位置は、通常、標準文字幅付近の周辺分布
の値の小さいところで行なわれる。ところが、図５
（Ａ）の場合、「ｍ」や「ｉ」，「ｌ」といった文字が
含まれている。このような文字においては、「ｉ」や
「ｌ」の文字は標準文字幅の半分の幅であるため、２文
字連なると１文字分の幅となり２文字が一緒に切り出さ
れてしまい、また「ｍ」の場合、標準文字幅の付近にお
いて一度周辺分布の値がかなり小さくなっているため文
字の切れ目と判断され、途中で切り離されてしまう可能
性が非常に大きい。Next, the extraction of each character from the extracted word is performed by extracting the connected component, but there is another problem here. Now, assume that the extracted word is a character image as shown in FIG. In this case, when the connected component is extracted, the five characters “milli” are extracted as one character. This extracted character is determined to be a concatenated character based on its character width, and must be forcibly cut off halfway. The separation position is usually set at a position where the value of the peripheral distribution near the standard character width is small. However, FIG.
In the case of (A), characters such as “m”, “i”, and “l” are included. In such a character, the characters "i" and "l" have half the width of the standard character width. Therefore, if two characters continue, the width becomes one character and two characters are cut out together. In the case of "m", the value of the marginal distribution once becomes very small in the vicinity of the standard character width, so that it is determined that the character is a break, and it is very likely that the character will be cut off in the middle.

【００３６】そこで、図５（Ｂ）の周辺分布で標準文字
高に比例したある値５０８以上の領域図５（Ｃ）の文字
位置５１１〜５ｌ７の文字と文字の中間点５２１〜５２
６を求めて切り出し候補位置とする。この切り出し候補
位置５２１〜５２６は図５からも明らかなように文字と
文字の境を表わしている。したがって、文字の切り出し
時において、文字の大きさが標準文字幅から許容できる
大きさよりも大きい場合にはこのような切り出し位置を
用いることにより文字の切り出しが可能となり、間違っ
た位置での文字の切り出しが極めて滅少する。Therefore, in the peripheral distribution of FIG. 5B, an area in which the value is equal to or more than a certain value 508 in proportion to the standard character height, the characters at character positions 511 to 517 in FIG.
6 is determined as a candidate clipping position. The cutout candidate positions 521 to 526 represent boundaries between characters as is clear from FIG. Therefore, when cutting out a character, if the character size is larger than the allowable size from the standard character width, it is possible to cut out the character by using such a cutout position, and to cut out the character at an incorrect position. Is extremely diminished.

【００３７】次に、一文字一文字の文字の抽出を行な
う。文字の抽出は、連結成分を抽出することによって行
なっている。欧米文字は、殆どの文字が１つの連結成分
であり、連結成分が１つでない文字においても、主な連
結成分のみでたいていの文字は判断できるため、この方
法は効果的である。連結成分の抽出は、最初に連結成分
の輪郭を抽出する。輪郭を抽出することにより、文字の
位置および大きさの情報が得られる。輪郭の抽出が行な
われると、すでに求められた標準文字幅と、抽出した文
字幅とを比較する。Next, characters are extracted one by one. Character extraction is performed by extracting connected components. This method is effective because most of the Western characters have one connected component, and even if the character has no connected component, most characters can be determined only by the main connected component. In the extraction of the connected component, first, the outline of the connected component is extracted. By extracting the contour, information on the position and size of the character can be obtained. When the outline is extracted, the standard character width already obtained is compared with the extracted character width.

【００３８】抽出した文字幅が標準文字幅より許容でき
る大きさである場合には、抽出した文字を１文字と判断
して１文字を抽出する。抽出した文字幅が標準文字幅よ
り許容できない大きさの場合には、先ず、抽出領域中に
切り出し候補位置が存在するか否かを調べる。もし、切
り出し候補位置が存在していれば、その位置において文
字を切り出すのが最も適切であるためその位置の範囲内
において再度連結成分の文字の輪郭の抽出を行なう。If the extracted character width is larger than the standard character width, the extracted character is determined to be one character and one character is extracted. If the extracted character width is unacceptable from the standard character width, first, it is checked whether or not there is a cutout candidate position in the extraction area. If a cutout candidate position exists, it is most appropriate to cut out the character at that position, so the outline of the character of the connected component is extracted again within the range of the position.

【００３９】また、文字画像が図１１（Ａ）のような文
字画像であった場合（この文字画像においては、標準文
字高に対する一定の大きさｌｌ１０よりも周辺分布の小
さい領域の中問点は１１２１〜１１２５の５箇所しか存
在しない。しかしながら、文字は９文字存在する。）、
文字画像ｌ１５２の輪郭抽出を行なうと、標準文字幅よ
り許容できない文字幅と判断される。そこで、切り出し
候補位置の存在を確認する。しかし文字画像１１５２の
周辺分布１１４２は無視する線幅１１１０よりも小さく
なる値を文字の途中にもっていないので切り出し候補位
置は存在しない。この場合、抽出文字の周辺分布ｌ１４
２の標準文字幅付近の値に着目する。それにより、文字
「ａ」と「ｒ」の接触部分１１３１を探すことが可能と
なり、一文字一文字の文字の切り出しが可能となる。同
様に切り出しを繰り返して、図１１においては、ｌ１２
１〜１ｌ２５の５箇所の切り出し候補位置の他に、１１
３１〜１１３３の３箇所の切り出し位置を抽出し、９つ
の文字を正確に切り出すことが可能となる。In the case where the character image is a character image as shown in FIG. 11A (in this character image, the middle point of the area whose peripheral distribution is smaller than a fixed size 110 with respect to the standard character height is There are only five positions 1121 to 1125. However, there are nine characters.)
When the outline of the character image 1152 is extracted, the character width is determined to be unacceptable from the standard character width. Therefore, the existence of the extraction candidate position is confirmed. However, since the marginal distribution 1142 of the character image 1152 does not have a value smaller than the ignored line width 1110 in the middle of the character, there is no cutout candidate position. In this case, the marginal distribution 114 of the extracted character
Note the value near the standard character width of 2. As a result, it is possible to search for the contact portion 1131 between the characters “a” and “r”, and it is possible to cut out characters one by one. The cutout is repeated in the same manner, and in FIG.
In addition to the five extraction candidate positions of 1-125, 11
It is possible to extract three cutout positions 31 to 1133 and to cut out nine characters accurately.

【００４０】また、文字の連結成分の輪郭を抽出した
後、抽出した文字幅が１文字幅である場合には文字の抽
出を行なう方法であるので、例えば図ｌ２（Ａ）のよう
な文字画像においても何の問題も無く文字の抽出を行な
える。この文字画像は「Ｙ」，「ｏ」，「ｕ」という３
つの無接触の文字である。しかしながら、周辺分布図１
２（Ｂ）においては、「Ｙ」と「ｏ」の周辺分布が重な
って１つの周辺分布１２０３を形成している。ここで従
来のように周辺分布ｌ２０３の形状から文字を強制的に
切り離そうとすると、ｌ２０２の位置で切り離すことに
なる。また、周辺分布の大きさにより切り離そうとする
と、ｌ２０１の位置で切り離すことになってしまう。い
ずれの位置にせよ、文字を不適切な位置で切り離してし
まう結果になるが、この方法では、全く問題は生じな
い。After extracting the outline of the connected component of the character, if the extracted character width is one character width, the character is extracted. For example, a character image as shown in FIG. Can extract characters without any problem. This character image is composed of three characters "Y", "o", and "u".
Contactless characters. However, marginal distribution diagram 1
In FIG. 2B, the peripheral distributions of “Y” and “o” overlap to form one peripheral distribution 1203. Here, if the character is forcibly separated from the shape of the peripheral distribution 1203 as in the related art, the character is separated at the position 1202. In addition, if an attempt is made to separate according to the size of the peripheral distribution, the separation is performed at the position of l201. Either position will result in the characters being cut off at inappropriate positions, but this method does not pose any problems.

【００４１】文字の連結成分の輪郭抽出が適当な大きさ
で行なわれた後は、文字の連結成分の抽出を行なう。文
字画像と同じ大きさの領域をもう１つ設けてある。図１
３（Ａ）の文字画像１３０１より「Ｐ」という文字を抽
出するにあたって、先ず図１３（Ａ）の「Ｐ」という文
字の輪郭を別の領域１３０２に描く。次に、この輪郭に
よって囲まれた領域を塗りつぶし、図１３（Ｂ）を得
る。この後、原画像図１３（Ａ）と抽出文字領域の画像
図１３（Ｂ）との共通部分を抽出することにより、図１
３（Ｃ）に示した画像を得る。この画像図１３（Ｃ）を
見てわかるように、位置的に重なっている「Ｐ」と
「ｅ」の画像から、「Ｐ」の画像のみをきれいに抽出す
ることが可能となっている。After the outline of the connected component of the character is extracted with an appropriate size, the connected component of the character is extracted. Another area of the same size as the character image is provided. FIG.
In extracting the character “P” from the character image 1301 of FIG. 3A, first, the outline of the character “P” in FIG. Next, the area surrounded by the outline is painted out to obtain FIG. Thereafter, a common part between the original image FIG. 13 (A) and the extracted character area image FIG. 13 (B) is extracted, whereby FIG.
The image shown in FIG. 3 (C) is obtained. As can be seen from FIG. 13C, it is possible to cleanly extract only the “P” image from the “P” and “e” images that overlap in position.

【００４２】以上述べたように、本発明によれば、行方
向の周辺分布により標準文字高および無視する線幅を求
め、この値を参考に標準字間隔および単語間隔、標準文
字幅を求めるので、正確な標準文字間隔および単語間
隔、標準文字幅が求まる。その結果、正確な単語切り出
しが可能となる。また、文字切り出し時においては、標
準文字幅と比較しながら、切り出し候補位置および周辺
分布を参考に文字の切り出しを行なうので、正確な位置
での文字の切り出しが可能となった。さらに、文字の切
り出しは、文字の連結成分の輪郭に囲まれた領域の画像
を抽出するので、周辺分布では重なって切り出し位置が
わからない文字画像からでも正確に１文字のみを抽出す
ることが可能となった。As described above, according to the present invention, the standard character height and the line width to be ignored are obtained from the marginal distribution in the line direction, and the standard character spacing, word spacing, and standard character width are calculated with reference to these values. The exact standard character spacing, word spacing, and standard character width are determined. As a result, accurate word segmentation becomes possible. Also, at the time of character extraction, characters are extracted with reference to the extraction candidate position and the surrounding distribution while comparing with the standard character width, so that it is possible to extract characters at accurate positions. Furthermore, since character extraction extracts an image of a region surrounded by the outline of a connected component of a character, it is possible to accurately extract only one character even from a character image where the extraction position is not known due to overlap in the peripheral distribution. became.

【００４３】以上のように本発明により、今まで困難と
されていた一文字一文字の文字の切り出しをより正確に
行なうことを可能にしたので、この方法を構成要素に用
いる文字認識装置の信頼性を大幅に向上させることが可
能となった。As described above, according to the present invention, it has become possible to more accurately cut out characters one by one, which has been difficult until now. It became possible to greatly improve.

【００４４】また本発明は、標準文字高および文字の連
結成分の抽出を用いているので、特に印刷欧米文字等に
適したものである。Further, since the present invention uses the extraction of the standard character height and the connected components of the characters, the present invention is particularly suitable for printed Western characters.

【００４５】[0045]

【発明の効果】以上述べたように、本発明によれば、行
方向の周辺分布を計数し、該周辺分布の形状より標準文
字高を推定するとともに、無視する線幅を決定して、行
方向と垂直な方向の周辺分布で、該周辺分布の値が前記
無視する線幅の値より大きい部分の統計をとることによ
り、ひげ部分でつながった文字を分離して、標準文字幅
を推定することができる。As described above, according to the present invention, the peripheral distribution in the line direction is counted, the standard character height is estimated from the shape of the peripheral distribution, and the line width to be ignored is determined. In the marginal distribution in the direction perpendicular to the direction, by taking statistics of a part where the value of the marginal distribution is larger than the value of the line width to be ignored, characters connected at the whiskers are separated and the standard character width is estimated. be able to.

【００４６】また、前記統計により、前記周辺分布の値
が前記標準文字高のほぼ半分より大きい領域において、
標準文字幅を推定することによって、より正確な位置で
の文字の切り離しが可能である。According to the statistics, in a region where the value of the marginal distribution is larger than almost half of the standard character height,
By estimating the standard character width, it is possible to separate characters at more accurate positions.

【００４７】さらに、前記統計における固まりの分布
は、固まりと固まりとの距離が、前記標準文字高に比例
した所定の値以下であれば同一の固まりとみなすことに
より、より確実な切り出しができる。Further, the distribution of blocks in the statistics can be more reliably cut out by regarding the blocks as being the same block if the distance between the blocks is equal to or less than a predetermined value proportional to the standard character height.

[Brief description of the drawings]

【図１】本発明の文字切り出し手段の用いられる文字
認識装置のブロック図である。FIG. 1 is a block diagram of a character recognition device using a character cutout unit of the present invention.

【図２】本発明の文字切り出し手段のフローチャート
を示す。FIG. 2 shows a flowchart of a character segmenting means of the present invention.

【図３】（Ａ），（Ｂ）は、本発明の標準文字高の抽
出方法を説明した図である。FIGS. 3A and 3B are diagrams illustrating a method for extracting a standard character height according to the present invention.

【図４】（Ａ）〜（Ｄ）は、本発明の標準文字間隔お
よび単語間隔、標準文字幅、切り出し候補位置の推定の
様子を説明した図である。FIGS. 4A to 4D are diagrams illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図５】（Ａ）〜（Ｃ）は、本発明の標準文字間隔お
よび単語間隔、標準文字幅、切り出し候補位置の推定の
様子を説明した図である。FIGS. 5A to 5C are diagrams illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図６】（Ａ），（Ｂ）は、本発明の標準文字間隔お
よび単語間隔、標準文字幅、切り出し候補位置の推定の
様子を説明した図である。FIGS. 6A and 6B are diagrams illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図７】（Ａ），（Ｂ）は、本発明の標準文字間隔お
よび単語間隔、標準文字幅、切り出し候補位置の推定の
様子を説明した図である。FIGS. 7A and 7B are diagrams illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図８】本発明の標準文字間隔および単語間隔、標準
文字幅、切り出し候補位置の推定の様子を説明した図で
ある。FIG. 8 is a diagram illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図９】本発明の標準文字間隔および単語間隔、標準
文字幅、切り出し候補位置の推定の様子を説明した図で
ある。FIG. 9 is a diagram illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図１０】本発明の標準文字間隔および単語間隔、標
準文字幅、切り出し候補位置の推定の様子を説明した図
である。FIG. 10 is a diagram illustrating a state of estimating a standard character interval and a word interval, a standard character width, and a cutout candidate position according to the present invention.

【図１１】（Ａ），（Ｂ）は、本発明の文字抽出の様
子を説明した図である。FIGS. 11A and 11B are diagrams illustrating a state of character extraction according to the present invention.

【図１２】（Ａ），（Ｂ）は、本発明の文字抽出の様
子を説明した図である。FIGS. 12A and 12B are diagrams illustrating a state of character extraction according to the present invention.

【図１３】（Ａ）〜（Ｃ）は、本発明の文字抽出の様
子を説明した図である。FIGS. 13A to 13C are diagrams for explaining the character extraction according to the present invention.

[Explanation of symbols]

１０１…ＣＰＵ、１０２…画像人力装置、１０３…認識
文字表示装置、１０４…ＲＯＭ、ｌ０５…ＲＡＭ、３０
１…行方向周辺分布、３０４…標準文字高、４０１…文
字領域、４０２…文字間隔、５０ｌ〜５０７…文字周辺
分布、５ｌ１〜５１７…文字領域、５２１〜５２６…文
字切り出し候補位置、５０８…最低文字線幅、７０ｌ…
標準文字高、８０ｌ〜８０４…固まりの間隔、８０５…
固まりの間隔のしきい値、９０ｌ〜９０３…固まりの問
隔、９０４…固まりの問隔のしきい値、１００４…標準
文字高の７５％、１ｌ１０…最低文字線幅、１１３１〜
１ｌ３３…文字切り出し位置、１２０ｌ〜１２０２…文
字切り出し位置。101: CPU, 102: image human input device, 103: recognition character display device, 104: ROM, 105: RAM, 30
1 ... line direction peripheral distribution, 304 ... standard character height, 401 ... character area, 402 ... character spacing, 50l to 507 ... character peripheral distribution, 511 to 517 ... character area, 521 to 526 ... character cutout candidate position, 508 ... lowest Character line width, 70 l ...
Standard character height, 80l to 804 ... Lump spacing, 805 ...
Lump interval threshold, 90l to 903: lump interval, 904 ... lump threshold, 1004: 75% of standard character height, 1110: minimum character line width, 1131
1133: character extraction position, 1201 to 1202: character extraction position.

Claims

(57) [Claims]

1. A character cutout method for a character recognition device which reads a Western character image written on paper or the like by an optical image input means and recognizes characters in input image data. The standard character height is estimated, and the value of the line width to be ignored is determined based on the standard character height.In the peripheral distribution in the direction perpendicular to the line direction, the value of the peripheral distribution is the value of the line width to be ignored. A character extraction method characterized by estimating a standard character width by collecting statistics of a portion larger than a value.

2. The character cutout method according to claim 1, wherein a standard character width is estimated in an area where the value of the marginal distribution is substantially larger than half of the standard character height by the statistics.

3. The distribution of chunks in the statistics is regarded as the same chunk if the distance between chunks is equal to or less than a predetermined value proportional to the standard character height. Character extraction method.