JPH04372087A

JPH04372087A - English character recognition device

Info

Publication number: JPH04372087A
Application number: JP3150020A
Authority: JP
Inventors: Michiaki Nobuoka; 信岡　道明
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1991-06-21
Filing date: 1991-06-21
Publication date: 1992-12-25

Abstract

PURPOSE:To make it possible to accurately perform the segmentation of contact characters. CONSTITUTION:After segmenting an area by a work unit from an inputted picture, an area by a character unit in a character area segment part is segmented and a recognition processing by a character unit is performed in a character recognition part. At this point, as for the character area of which recognition accuracy is low, a resegmentation is performed. In resegmentation processing, the collation of a character string of which recognition accuracy is high in the character recognition processing with a recognition dictionary with the spelling of a word stored is performed, a character string existing in an area of which recognition accuracy is low is estimated, a segmentation position is determined based on a character width dictionary storing the character width of each character and a resegmentation is performed. The character segmentation can be performed at a more correct position and the recognition accuracy can be improved by adding the information on a word dictionary and the character width dictionary at the time of resegmenting the character area.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は英文字の認識を行う英文
字認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an English character recognition device for recognizing English characters.

【０００２】0002

【従来の技術】近年、文字認識装置を電子計算機等の入
力装置として利用する要求が高まっており、安定な認識
結果を効率的に得ることが出来る文字入力装置が電子計
算機等のシステムの性能向上に不可欠となっている。[Background Art] In recent years, there has been an increasing demand for character recognition devices to be used as input devices for computers, etc., and character input devices that can efficiently obtain stable recognition results can improve the performance of systems such as computers. has become essential.

【０００３】従来の認識装置は、文字領域の画像と認識
辞書との照合により認識を行っている。この認識におい
て、認識確度が低い場合、切り出された文字領域内で文
字が接触していると仮定し、推定される接触文字数で領
域を等分割し、その前後の位置に対して縦方向の黒画素
のヒストグラムを求め、この度数の低い位置を文字単位
の区切れ位置とし、再度文字領域を切り出し、認識を行
うことにより、認識結果を求めていた。Conventional recognition devices perform recognition by comparing an image of a character area with a recognition dictionary. In this recognition, if the recognition accuracy is low, it is assumed that characters are in contact within the extracted character area, the area is divided equally by the estimated number of touching characters, and vertical black The recognition result was obtained by obtaining a histogram of pixels, using the position with the lowest frequency as the separation position for each character, cutting out the character area again, and performing recognition.

【０００４】0004

【発明が解決しようとする課題】しかしながら、英文は
文字間隔が一定でないため、接触文字数を求めることが
難しく、また英文字は字種により文字幅が異なるため、
単に推定される文字数で領域を等分割しても、必ずしも
文字の区切りあるいはその近くで分割されない。そのた
め、誤った位置で文字領域を切り出すことがあり、認識
精度の低下を招いていた。本発明の課題はこれらの問題
点を解消した英文字認識装置の提供にある。[Problem to be Solved by the Invention] However, since the character spacing in English text is not constant, it is difficult to calculate the number of touching characters, and the width of English characters varies depending on the type of character.
Even if the area is simply divided into equal parts based on the estimated number of characters, the area will not necessarily be divided at or near character breaks. As a result, a character area may be cut out at the wrong position, resulting in a decrease in recognition accuracy. An object of the present invention is to provide an English character recognition device that solves these problems.

【０００５】[0005]

【課題を解決するための手段】本発明はこの課題を解決
するため、文字認識部での認識の結果、認識確度の高い
文字列を抽出し、文字領域切り出し部においてそれらの
文字で構成される単語を単語辞書より抽出し、認定確度
の低い領域内に存在する文字列の候補を選ぶ。そして、
文字幅辞書をもとに選ばれた文字列候補の文字単位の区
切れ位置を推定し、文字領域を再度切り出し、認識を行
うことにより、認識精度の向上を図る。[Means for Solving the Problems] In order to solve this problem, the present invention extracts character strings with high recognition accuracy as a result of recognition in a character recognition unit, and creates a character string composed of those characters in a character area extraction unit. Words are extracted from a word dictionary, and character string candidates that exist in areas with low recognition accuracy are selected. and,
The recognition accuracy is improved by estimating the character-by-character break positions of the selected character string candidates based on the character width dictionary, cutting out the character area again, and performing recognition.

【０００６】[0006]

【作用】本発明において、認識確度の低い文字領域に対
して、再び切り出しを行う際、単語辞書をもとに文字列
の候補を選び、文字幅辞書をもとに選ばれた文字列候補
の文字単位の区切れ位置を推定することにより、より正
しい位置での文字領域の切り出しを行ない、認識精度の
向上を図る。[Operation] In the present invention, when cutting out a character region with low recognition accuracy again, character string candidates are selected based on the word dictionary, and character string candidates selected based on the character width dictionary are By estimating the break position of each character, character areas can be cut out at more accurate positions to improve recognition accuracy.

【０００７】[0007]

【実施例】本発明の一実施例を添付図面とともに説明す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to the accompanying drawings.

【０００８】図１において、１は認識対象文書を文書画
像として入力する画像入力部、２は入力された文書画像
から文字列の集まりを見つけ、文書領域を出力する文書
領域切り出し部、３は文書領域から単語単位の区切りを
見つけ１つの単語の範囲を単語領域として出力する単語
領域切り出し部、４は英単語の綴りを記憶している単語
辞書７と各々の英文字の文字幅を記憶している文字幅辞
書８をもとに単語領域から文字単位の区切りを見つけ１
つの文字の範囲を文字領域として出力する文字切り出し
部、５は全ての文字の図形特徴をもとに予め用意された
認識辞書９を比較し、それらの間の認識度を求め、文字
認識を行う文字認識部、６は文字領域切り出し部４と文
字認識部５の制御を行ない、認識文字コードを決定し、
出力部１０を制御する制御部、７は英単語の綴りを記憶
している単語辞書、８は各々の英文字の文字幅を記憶し
ている文字幅辞書、９は全ての文字の図形特徴をもとに
予め用意された認識辞書、１０は決定した認識文字コー
ドを出力する出力部、１１は画像入力部１、文章領域切
り出し部２、単語領域切り出し部３、文字領域切り出し
部４、文字認識部５の各部をつなぐ内部バス、１２，１
３はそれぞれ文字領域切り出し部４と制御部６との間・
文字認識部５と制御部６との間をつなぐ内部バス、１４
は文字領域切り出し部４と単語辞書７と文字幅辞書８と
をつなぐ内部バス、１５，１６はそれぞれ制御部６と出
力部１０との間・文字認識部５と認識辞書９とをつなぐ
内部バスである。In FIG. 1, 1 is an image input unit that inputs a document to be recognized as a document image; 2 is a document area cutting unit that finds a collection of character strings from the input document image and outputs a document area; 3 is a document A word area cutting section 4 finds a word-by-word break from the area and outputs a range of one word as a word area; 4 is a word dictionary 7 that stores the spellings of English words; and a word dictionary 7 that stores the character width of each English character. Find character-by-character breaks from the word area based on the character width dictionary 8.
A character extraction unit 5 outputs a range of two characters as a character area, and a character extraction unit 5 compares a recognition dictionary 9 prepared in advance based on the graphical features of all characters, calculates the degree of recognition between them, and performs character recognition. A character recognition unit 6 controls the character area extraction unit 4 and the character recognition unit 5, determines a recognized character code,
A control unit that controls the output unit 10, 7 a word dictionary that stores the spellings of English words, 8 a character width dictionary that stores the character width of each English letter, and 9 a character width dictionary that stores the graphical characteristics of all characters. A recognition dictionary is prepared in advance, 10 is an output unit that outputs the determined recognition character code, 11 is an image input unit 1, a text area extraction unit 2, a word area extraction unit 3, a character area extraction unit 4, and character recognition. Internal bus connecting each part of section 5, 12,1
3 are between the character area cutting section 4 and the control section 6, respectively.
An internal bus 14 that connects the character recognition unit 5 and the control unit 6
15 and 16 are internal buses that connect the character area extraction unit 4, word dictionary 7, and character width dictionary 8, and internal buses that connect the control unit 6 and the output unit 10 and the character recognition unit 5 and the recognition dictionary 9, respectively. It is.

【０００９】以上のように構成された英文字認識装置に
ついて、図２に全体フローチャート、図３に文字領域切
り出しフローチャートを示し、以下その動作を説明する
。Regarding the English character recognition device configured as described above, FIG. 2 shows an overall flowchart, and FIG. 3 shows a character area extraction flowchart, and the operation thereof will be explained below.

【００１０】まず、認識したい文書を画像入力部に文書
画像として入力する。（ステップｓ１，単にｓ１という
。以下同様にステップの文字を省略する）。入力された
文書画像を文書領域切り出し部２に送り、文書領域切り
出し部２にて、文書画像の縦方向及び横方向の黒画素の
ヒストグラムを求め、これをもとに文書領域を抽出する
。そして文書の領域の位置情報を内部データとして蓄え
る（ｓ２）。First, a document to be recognized is input as a document image to the image input section. (Step s1 is simply referred to as s1. Hereinafter, the letter ``step'' will be omitted as well). The input document image is sent to the document area cutting section 2, which obtains a histogram of black pixels in the vertical and horizontal directions of the document image, and extracts the document area based on this histogram. Then, the position information of the document area is stored as internal data (s2).

【００１１】単語領域切り出し部３に文書領域の位置情
報を送り、文書領域内に対する単語領域の切り出し処理
を行う。単語領域切り出し部３では、単語の前後の空白
が文字間の空白より大きいことに着目し、ある幅以上の
空白に挟まれた文字列を単語領域として切り出す。文書
領域切り出し部２にて見つけられた文書領域内の全ての
単語領域の位置情報を求め、内部データとして蓄える（
ｓ３）。[0011] Position information of the document area is sent to the word area extraction unit 3, and word area extraction processing is performed within the document area. The word area cutting unit 3 focuses on the fact that the spaces before and after a word are larger than the spaces between characters, and cuts out a character string sandwiched between spaces of a certain width or more as a word area. The position information of all the word areas in the document area found by the document area extraction unit 2 is obtained and stored as internal data (
s3).

【００１２】以上のｓ１からｓ３にて求められた、文書
領域・単語領域の位置情報をもとに文字領域の切り出し
及び文字の認識処理を行う。文字領域の切り出し及び認
識処理は単語毎に行なわれ、ｓ４からｓ８にて１つの単
語が認識される。[0012] Based on the positional information of the document area/word area obtained in steps s1 to s3 above, character area extraction and character recognition processing are performed. Cutting out the character area and recognition processing are performed for each word, and one word is recognized from s4 to s8.

【００１３】処理において２種類のフラグを使用してお
り、以下のような意味を持つ。Two types of flags are used in the process and have the following meanings.

【００１４】再切り出しフラグは、処理しようとする単
語領域から、初めて文字領域を切り出す状態なのか、再
切り出しを行う状態かを示すもので、値「０」は単語領
域から初めて文字領域を切り出す状態、値「１」は単語
領域から再切り出しを行う状態を表わす。[0014] The re-extracting flag indicates whether the character area is being extracted for the first time from the word area to be processed or whether the character area is being re-extracted.The value "0" indicates the state where the character area is being extracted for the first time from the word area. , the value "1" represents a state in which re-cutting is performed from the word area.

【００１５】候補文字列フラグは、再切り出しを行う場
合、候補文字列が存在する状態か、存在しない状態かを
示すもので、値「０」は候補文字列が存在しない状態、
値「１」は候補文字列が存在しない状態を表わす。[0015] The candidate character string flag indicates whether a candidate character string exists or does not exist when re-extracting, and a value of "0" indicates a state where a candidate character string does not exist.
A value of "1" represents a state in which no candidate character string exists.

【００１６】まず、再切り出しフラグ、候補文字列フラ
グに「０」をセットする。（ｓ４）。First, the re-cutting flag and candidate character string flag are set to "0". (s4).

【００１７】文字領域切り出し部４では、英文字は左右
に分離していないことに着目し、前後を空白に挟まれた
領域を文字とし、単語領域内の文字の連なりを文字単位
に分離し、文字領域として切り出す。単語領域切り出し
部３で抽出された単語領域内の全ての文字領域の位置情
報を求め、内部データとして蓄える（図２のｓ５、図３
のｓ２４）。The character area extraction unit 4 focuses on the fact that English letters are not separated to the left and right, treats the area between the front and back spaces as characters, and separates the series of characters in the word area into character units. Extract as a character area. The position information of all the character areas in the word area extracted by the word area extraction unit 3 is obtained and stored as internal data (s5 in Fig. 2, Fig. 3
s24).

【００１８】文字認識部５に文字領域の位置情報を送り
、文字領域内の文字の認識を行う。文字認識部５では、
注目する文字領域の画像と認識対象文字との間の形状の
類似性を類似度として求め、最も類似度の高かった認識
対象文字の認識文字コードと類似度の値を内部データと
して蓄える。この処理を単語領域内の全ての文字領域に
対して行う（ｓ６）。The position information of the character area is sent to the character recognition unit 5, and the characters within the character area are recognized. In the character recognition section 5,
The similarity in shape between the image of the character region of interest and the character to be recognized is determined as the degree of similarity, and the recognized character code of the character to be recognized with the highest degree of similarity and the similarity value are stored as internal data. This process is performed for all character areas within the word area (s6).

【００１９】制御部６に前記認識文字コード及び類似度
を送り、再度文字切り出しを行う必要があるか判定する
。制御部では、各文字の類似度の値が認識結果に値する
か判定し（ｓ７）、全ての文字の類似度の値が認識結果
に値する場合、文字領域の切り出し及び文字認識処理を
終了し、出力部へ認識文字コードを送る。類似度の値が
認識結果に値しない文字が存在する場合、再度文字領域
を切り出せるか判定する（ｓ９）。The recognized character code and the degree of similarity are sent to the control unit 6, and it is determined whether character segmentation needs to be performed again. The control unit determines whether the similarity value of each character is worthy of the recognition result (s7), and if the similarity value of all the characters is worthy of the recognition result, ends the character area cutting and character recognition processing, Sends the recognized character code to the output section. If there is a character whose similarity value does not match the recognition result, it is determined whether the character area can be cut out again (s9).

【００２０】この判定は、再切り出しフラグが「０」あ
るいは候補文字列フラグが「１」の場合、認識結果に値
する文字の位置情報及び認識文字コードを文字領域切り
出し部４に送り、再度文字領域の切り出しを行う。再切
り出しフラグが「１」かつ候補文字列フラグが「０」の
場合、文字領域の切り出し及び文字認識処理を終了し、
認識結果をリジェクトとし、リジェクトコードを出力部
に送る（ｓ１０）。出力部では、送られてきた認識文字
コードまたはリジェクトコードを出力する（ｓ８）。In this determination, if the re-cutting flag is "0" or the candidate character string flag is "1", the position information of the character worthy of the recognition result and the recognized character code are sent to the character area cutting unit 4, and the character area is cut out again. Cut out. If the re-cutting flag is "1" and the candidate character string flag is "0", the character area cutting and character recognition processing are finished,
The recognition result is determined as a reject, and a reject code is sent to the output unit (s10). The output unit outputs the received recognized character code or reject code (s8).

【００２１】単語領域内に接触文字が存在している場合
の文字領域を切り出す過程を、「ｓｏｕｔｈｗｅｓｔ」
において「ｔｈ」が接触している場合を例として示す。[0021] The process of cutting out a character area when there are touching characters in a word area is called ``southwest''.
The case where "th" is touching is shown as an example.

【００２２】前記文字領域の切り出し処理において、「
ｓ」「ｏ」「ｕ」「ｔｈ」「ｗ」「ｅ」「ｓ」「ｔ」と
８個の文字領域に切り出され、文字認識処理において「
ｓ」「ｏ」「ｕ」「ｗ」「ｅ」「ｓ」「ｔ」は認識結果
に値するが、「ｔｈ」は認識結果に値しないと判定され
る。[0022] In the character area cutting process, "
The characters are cut out into eight character areas: s, o, u, th, w, e, s, and t.
It is determined that "s", "o", "u", "w", "e", "s", and "t" are worthy of recognition results, but "th" is not worthy of recognition results.

【００２３】制御部では文字切り出し部へ「ｓ」「ｏ」
「ｕ」「＊」「ｗ」「ｅ」「ｓ」「ｔ」。The control section sends "s" and "o" to the character cutting section.
"u""*""w""e""s""t".

【００２４】各々の認識文字コード（＊はリジェクトを
示す）及び位置情報を文字領域切り出し部４へ送る。Each recognized character code (* indicates reject) and position information are sent to the character area cutting section 4.

【００２５】以下、文字の再切り出し過程を図３文字領
域切り出し処理の詳細フロチャートに沿って示す。The character re-cutting process will be described below with reference to the detailed flowchart of the character area cutting process shown in FIG.

【００２６】再切り出しフラグが「０」か「１」かを判
定し、１回目の文字領域切り出し処理か、再切り出し処
理かを判定する（ｓ１３）。さらに、再切り出し処理の
場合、１回目の再切り出し処理かを判定し、１回目の再
切り出し処理の場合、送られてきた認識文字コード及び
位置情報を内部データに蓄え、候補文字列を求める。候
補文字列は送られてきた認識確度の高い文字列と、予め
単語の綴りを記憶している単語辞書７との照合により、
文字列が一致する単語を候補文字列とし、内部データに
蓄え、候補文字列フラグを「１」とする。この例では、
最初の３文字が「ｓｏｕ」で終わりの４文字が「ｗｅｓ
ｔ」である単語を単語辞書７との照合により、候補文字
列「ｓｏｕｔｈｗｅｓｔ」を得る（ｓ１３，ｓ１４，ｓ
１５，ｓ１６，ｓ１７，ｓ１８）。It is determined whether the re-cutting flag is ``0'' or ``1'', and it is determined whether it is the first character area cropping process or the re-cutting process (s13). Furthermore, in the case of re-extracting processing, it is determined whether this is the first re-extracting process, and in the case of the first re-extracting process, the sent recognition character code and position information are stored in internal data to obtain a candidate character string. Candidate character strings are determined by comparing the sent character strings with high recognition accuracy against a word dictionary 7 that stores word spellings in advance.
A word with matching character strings is set as a candidate character string, stored in internal data, and a candidate character string flag is set to "1". In this example,
The first 3 letters are "sou" and the last 4 letters are "wes"
By comparing the word "t" with the word dictionary 7, the candidate character string "southwest" is obtained (s13, s14, s
15, s16, s17, s18).

【００２７】一致する単語が存在しない場合、候補文字
列フラグを「０」とし、文字領域切り出し処理を終了す
る（ｓ２３）。If there is no matching word, the candidate character string flag is set to "0" and the character area extraction process ends (s23).

【００２８】次に候補文字列の１つを選び、認識確度の
低かった文字領域内の文字数及び文字を仮定し、予め文
字の幅を記憶している文字幅辞書８をもとに文字単位の
区切れ位置を推定する。この例では候補文字列「ｓｏｕ
ｔｈｗｅｓｔ」より、認識確度が低かった文字領域内に
は「ｔ」「ｈ」の２文字が接触していると仮定し、「ｔ
」及び「ｈ」の文字幅の値を文字幅辞書８より得る。（
この値をｗｔ，ｗｈとする。）Next, select one of the candidate character strings, assume the number of characters and characters in the character area with low recognition accuracy, and calculate character by character based on the character width dictionary 8 that stores character widths in advance. Estimate the break position. In this example, the candidate string “sou
Assuming that two characters "t" and "h" are in contact with each other in the character area where the recognition accuracy was low, "t
” and “h” are obtained from the character width dictionary 8. (
Let these values be wt and wh. )

【００２９】文字単位の区切れ位置　　＝　　認識確度
の低かった領域の左端　　＋　　認識確度の低かった領
域の幅　　×　　ｗｔ／（ｗｔ＋ｗｈ）より、文字単位の区切れ位置を推定し、内部データに蓄
える。３文字以上接触している場合も同様に各文字の区
切れ位置を推定する（ｓ１９）。[0029] The break position of each character is estimated from the following formula: left end of area where recognition accuracy was low + width of area where recognition accuracy was low x wt/(wt+wh), and stored in internal data. If three or more characters are in contact, the break position of each character is similarly estimated (s19).

【００３０】推定された区切れ位置の前後に対して、縦
方向の黒画素のヒストグラムを求め、この度数の最も低
い位置を文字単位の区切れ位置とし、再度文字領域とし
て切り出し、この文字領域の位置情報を内部データとし
て蓄える。（ｓ２１）。Obtain a histogram of black pixels in the vertical direction before and after the estimated break position, set the position with the lowest frequency as the break position for each character, cut it out as a character area again, and then Store location information as internal data. (s21).

【００３１】候補文字列が存在するか判定し、候補文字
列がこれ以上存在しない場合、候補文字列フラグを「０
」とする（ｓ２２，ｓ２３）。It is determined whether a candidate character string exists, and if there are no more candidate character strings, the candidate character string flag is set to "0".
” (s22, s23).

【００３２】文字認識部５に新しく切り出された文字領
域の位置情報を送り、文字領域内の認識を行う。The positional information of the newly cut out character area is sent to the character recognition unit 5, and recognition within the character area is performed.

【００３３】以上、ｓ４からｓ８を文書中の全ての単語
が認識されるまで繰り返す（ｓ１２）ことにより与えら
れた文書画像の文字認識処理を行う。As described above, character recognition processing for a given document image is performed by repeating steps s4 to s8 until all words in the document are recognized (s12).

【００３４】[0034]

【発明の効果】以上説明したように、この発明によって
、文字が接触している位置を確度高く求め文字領域の切
り出し処理を行うことができ、認識精度の向上を図るこ
とができる。As described above, according to the present invention, it is possible to accurately determine the position where characters are in contact and to cut out a character area, thereby improving recognition accuracy.

[Brief explanation of drawings]

【図１】本発明の一実施例における英文字認識装置の構
成を示す全体構成図FIG. 1 is an overall configuration diagram showing the configuration of an English character recognition device in an embodiment of the present invention.

【図２】本実施例の制御手順を示す全体フロチャート[Figure 2] Overall flowchart showing the control procedure of this embodiment

【
図３】本実施例での文字領域切り出し処理の制御手順を
示すフロチャート[
FIG. 3: Flowchart showing the control procedure for character area extraction processing in this embodiment

[Explanation of symbols]

１　　画像入力部２　　文書領域切り出し部３　　単語領域切り出し部４　　文字領域切り出し部５　　文字認識部６　　制御部７　　単語辞書８　　文字幅辞書９　　認識辞書１０　　出力部 1 Image input section 2 Document area extraction section 3 Word area extraction part 4 Character area extraction part 5 Character recognition section 6 Control section 7. Word dictionary 8 Character width dictionary 9 Recognition dictionary 10 Output section

Claims

[Claims]

Claims 1: An image input unit that inputs a document to be recognized; a text area extraction unit that outputs a document area from the input document image; a word area extraction unit that outputs a word area from the text area; A character area extraction unit that outputs a character area based on an English word dictionary that stores the spellings of English words and a character width dictionary that stores the character width of each English letter, and an image of the extracted character area. a character recognition unit that performs recognition based on a recognition dictionary that stores all characters from the character recognition unit; a control unit that controls the character recognition unit and the character area extraction unit to determine a recognized character code; and a control unit that determines a recognized character code. An English character recognition device consisting of an output section that outputs a code.