JPH06243294A

JPH06243294A - Character recognition postprocessing device

Info

Publication number: JPH06243294A
Application number: JP5047374A
Authority: JP
Inventors: Yoshitaka Hamaguchi; 佳孝濱口; Sadamasa Hirogaki; 節正広垣; Naohiro Amamoto; 直弘天本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1993-02-12
Filing date: 1993-02-12
Publication date: 1994-09-02

Abstract

PURPOSE:To properly perform the post processing of character recognition. CONSTITUTION:A document image 2 is segmented into individual character images by a character segmenting part 3, and a character code is outputted from a character recognition part 4 by pattern recognition. Meanwhile, a standard character space between characters in the document image 2 is calculated in a standard character space calculating part 51 by segmenting information obtained by the character segmenting part 3. A character space calculating part 52 divides each character space by this standard character space to calculate a normalized character space. A character space collating part 54 compares this normalized character space with a prescribed value stored in character segmenting knowledge data 53; and if it is smaller than the prescribed value, adjacent characters are regarded as one character. A language knowledge collating part 6 substitutes a part, which is discriminated as segmenting error by a character segmenting error detection part 5, out of the character code inputted from the character recognition part 4 with a character code in character segmenting knowledge data 53 inputted through the character space collating part 54. This substituted character code is used to perform collation with a word dictionary 7.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文字認識結果を自動的
に修正する文字認識後処理装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition post-processing device for automatically correcting a character recognition result.

【０００２】[0002]

【従来の技術】従来、光学的に読取った文書イメージか
ら文字を切出し、切出した文字を認識する文字認識装置
がある。このような文字認識装置では、認識された文字
を単語ごとに取出し、単語辞書中の単語と比較すること
により文字認識の誤りを検出し修正している。このよう
な処理を文字認識後処理という。図２は、従来の文字認
識後処理装置の一例のブロック図である。2. Description of the Related Art Conventionally, there is a character recognition device which cuts out a character from an optically read document image and recognizes the cut out character. In such a character recognition device, the recognized character is taken out word by word and compared with a word in a word dictionary to detect and correct an error in character recognition. Such processing is called character recognition post-processing. FIG. 2 is a block diagram of an example of a conventional character recognition post-processing device.

【０００３】図２に示す文字認識後処理装置２１には、
光学的に読取られた文書イメージ２２から文字切出し部
２３によって切出され、文字認識部２４によって認識さ
れた文字コードが入力される。文字切出し部２３は、文
書イメージ２２から、例えば、図３に示すような文字行
を切出した後、その文字行から各文字「’」、「ａ」、
「ｂ」等を切出す。文字認識部２４は、切出された各文
字「’」、「ａ」、「ｂ」等についてパターン認識によ
り候補文字との一致度を求め、一致度の高い順に第１候
補から第ｎ候補とし、文字コードを出力する。The character recognition post-processing device 21 shown in FIG.
The character code cut out by the character cutout unit 23 from the optically read document image 22 and recognized by the character recognition unit 24 is input. The character cutout unit 23 cuts out, for example, a character line as shown in FIG. 3 from the document image 22, and then the characters “′”, “a”,
Cut out "b" etc. The character recognition unit 24 obtains the degree of coincidence of each of the cut out characters “′”, “a”, “b”, etc. with the candidate character by pattern recognition, and determines the degree of coincidence from the first candidate to the nth candidate. , Output the character code.

【０００４】文字認識後処理装置２１は、言語知識照合
部２５と、単語辞書２６とから成る。言語知識照合部２
５は、認識された文字コードの配列を単語ごとに区切
り、単語辞書２６内の候補単語と比較し照合する。単語
ごとの区切りは、例えば、英文書の場合には単語間の空
白を検知する等の手法が知られている。候補単語との照
合方法の具体例としては、文字認識部２４から第１候補
として出力された文字を並べた参照単語との一致文字数
が最も多い単語を単語辞書２６から取出し、両単語の各
文字の一致度の合計値であるコスト値が最小の候補単語
を出力する方法がある（例えば、特公昭６１−２００３
８号公報参照）。The character recognition post-processing device 21 comprises a language knowledge collating unit 25 and a word dictionary 26. Linguistic knowledge collation unit 2
Reference numeral 5 divides the recognized character code array into words and compares them with candidate words in the word dictionary 26 for comparison. As for the delimiter for each word, for example, in the case of an English document, a method of detecting a space between words is known. As a specific example of a method of matching with a candidate word, the word having the largest number of matching characters with the reference word in which the characters output as the first candidate from the character recognition unit 24 are arranged is extracted from the word dictionary 26, and each character of both words is extracted. There is a method of outputting a candidate word having the minimum cost value, which is the total value of the degree of coincidence of (for example, Japanese Patent Publication No. 61-2003
No. 8).

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上述し
た従来の技術には、次のような問題があった。即ち、文
字切出し部２３で切出した文字を文字認識部２４で認識
して文字コードとし、当該文字コードの配列を参照単語
として単語辞書内の候補文字と照合するようにしていた
ので、文字切出し部２３の文字切出しに誤りがある場合
には、適切な後処理ができなかった。例えば、図４に示
すように、「”」が文字切出しの誤りにより、「’」、
「’」と切出された場合には、その誤りを検出すること
ができなかった。However, the above-mentioned conventional technique has the following problems. That is, the characters cut out by the character cutting unit 23 are recognized by the character recognition unit 24 as a character code, and the arrangement of the character code is used as a reference word to be matched with the candidate character in the word dictionary. If there was an error in the character segmentation of 23, the appropriate post-processing could not be performed. For example, as shown in FIG. 4, due to an error in cutting out a character "",
If it was cut out as "'", the error could not be detected.

【０００６】また、図７、図８に示すように、文字のか
すれ等により「ｍ」が「ｒ」、「ｎ」と切出されたり、
「ｗ」が「ｖ」、「ｖ」と切出される場合があり、この
場合、言語知識照合部２５で正しい単語照合を行なうこ
とができない。例えば、「ｍｅｍｏｒｙ」という単語が
文書イメージ２２内にあり、これが文字切出し部２３で
「ｒｎｅｍｏｒｙ」と切出されると、言語知識照合部２
５でこれを参照単語として最も近い候補単語が単語辞書
２６から取出される。従って、後処理で文字切り出しの
誤りを修正しようとすると、全ての文字について文字切
り出しの誤りを仮定して単語辞書との照合をする等の処
理を行う必要があり、後処理に要する時間が非常に長く
なったり、「ｍｅｍｏｒｙ」とは全く異なった単語に修
正されてしまうこともあった。本発明は、以上の点に着
目してなされたもので、文字切出し誤りを検出し、適切
な単語照合が行なえるようにした文字認識後処理装置を
提供することを目的とするものである。Further, as shown in FIGS. 7 and 8, “m” is cut out as “r” or “n” due to faint characters, or the like.
In some cases, "w" may be cut out as "v" or "v", and in this case, the language knowledge collation unit 25 cannot perform correct word collation. For example, when the word “memory” is present in the document image 22 and is cut out by the character cutout unit 23 as “rnememory”, the language knowledge collation unit 2
At 5 the closest candidate word is retrieved from the word dictionary 26 with this as the reference word. Therefore, when trying to correct an error in character extraction in post-processing, it is necessary to perform processing such as matching with the word dictionary assuming all characters are in error in character extraction, and the time required for post-processing is extremely long. Sometimes it became very long, and it was sometimes corrected to a word completely different from "memory." The present invention has been made in view of the above points, and an object of the present invention is to provide a character recognition post-processing device capable of detecting a character cutting error and performing appropriate word matching.

【０００７】[0007]

【課題を解決するための手段】本発明の文字認識後処理
装置は、文書イメージから切出された各文字、及び当該
各文字の座標とに基づいて文字切出しの誤りを検出して
候補文字を出力する文字切出し誤り検出部と、当該文字
切出し誤り検出部により出力された候補文字を文字認識
結果に含めて言語知識との照合を行なう言語知識照合部
とから成ることを特徴とするものである。A character recognition post-processing apparatus of the present invention detects a character cutout error based on each character cut out from a document image and the coordinates of each character and selects a candidate character. It is characterized by comprising a character cutout error detection unit for outputting and a language knowledge collation unit for collating with the language knowledge by including the candidate characters output by the character cutout error detection unit in the character recognition result. .

【０００８】[0008]

【作用】本発明の文字認識後処理装置においては、光学
的に読取られた文書イメージが、文字切出し部で各文字
イメージに切出される。また、文字認識部でパターン認
識により文字コードが出力される。一方、文字切出し部
で得られた切出し情報により、文字切出し誤り検出部の
標準文字間隔算出部で文書イメージ内の各文字間の標準
文字間隔が算出される。また、文字間隔算出部では、各
文字間隔をこの標準文字間隔で割り、正規化文字間隔を
算出する。そして、文字間隔照合部では、この正規化文
字間隔を文字切出し知識データに格納された所定値と比
較し、所定値より小さい場合、隣接文字を１文字とす
る。その後、言語知識照合部では、文字認識部から入力
された文字コードのうち、文字切出し誤り検出部で切出
し誤りとされた部分を文字間隔照合部から入力する。こ
の場合、切出し誤りとされた部分を文字切出し知識デー
タ内の文字コードと置換えるようにしてもよい。その
後、単語辞書との照合を行なう。In the character recognition post-processing apparatus of the present invention, the optically read document image is cut out into each character image by the character cutting section. In addition, the character recognition section outputs a character code by pattern recognition. On the other hand, the standard character spacing calculation section of the character segmentation error detection section calculates the standard character spacing between the characters in the document image based on the segmentation information obtained by the character segmentation section. In addition, the character spacing calculation unit divides each character spacing by this standard character spacing to calculate a normalized character spacing. Then, the character interval collating unit compares the normalized character interval with a predetermined value stored in the character cut-out knowledge data, and if the value is smaller than the predetermined value, the adjacent character is regarded as one character. After that, in the linguistic knowledge collating unit, of the character codes input from the character recognizing unit, the portion determined to be the cutting error by the character cutting error detecting unit is input from the character interval checking unit. In this case, the portion which is determined to be the cutout error may be replaced with the character code in the character cutout knowledge data. After that, the word dictionary is collated.

【０００９】[0009]

【実施例】以下、本発明の実施例を図面を参照して詳細
に説明する。図１は、本発明の文字認識後処理装置の一
実施例のブロック図である。図１に示す文字認識後処理
装置１には、文書イメージ２から文字切出し部３で切出
された文字が、文字認識部４で文字コードに変換されて
入力されるとともに、文字切出し部３からの文字切出し
情報が入力される。文字切出し部３では、図３に示すよ
うな文字行が各文字「‘」、「ａ」、「ｂ」等に切出さ
れる。文字の切出しは、例えば、文字行の各Ｘ座標にお
いて、Ｙ方向のすべての画素が白である位置を検出する
等の手法を用いることができる。例えば、図３におい
て、Ｘ１〜Ｘ１０の各Ｘ座標において、座標Ｙ１とＹ２
の間の画素はすべて白であるので、これらの座標Ｘ１〜
Ｘ１０で文字行を文字「‘」、「ａ」、「ｂ」等に切出
す。Embodiments of the present invention will now be described in detail with reference to the drawings. FIG. 1 is a block diagram of an embodiment of the character recognition post-processing device of the present invention. In the character recognition post-processing device 1 shown in FIG. 1, the characters cut out from the document image 2 by the character cutout unit 3 are converted into character codes by the character recognition unit 4 and input. The character cutout information of is input. In the character cutout unit 3, a character line as shown in FIG. 3 is cut out into each character “′”, “a”, “b” and the like. To cut out a character, for example, a method of detecting a position where all pixels in the Y direction are white in each X coordinate of a character line can be used. For example, in FIG. 3, at each X coordinate of X1 to X10, coordinates Y1 and Y2
Since the pixels between are all white, these coordinates X1 ...
The character line is cut out into characters "'", "a", "b", etc. at X10.

【００１０】図３のの部分の切出し結果は、図４にお
いて破線で示すようになる。即ち、座標Ｘ５とＸ６の間
では「’」が切出される。また、「”」は、１文字であ
るが、座標Ｘ７においてＹ方向のすべての画素が白であ
るので、座標Ｘ６とＸ７の間で「’」が切出される。そ
して、座標Ｘ７とＸ８の間でも「’」が切出される。従
って、文字認識部４における図３のの部分の認識結果
は、「’」、「’」、「’」となる。文字認識後処理装
置１は、文字切出し誤り検出部５と、言語知識照合部６
等から成る。文字切出し誤り検出部５は、文字切出し部
３からの文字切出し情報、例えば切出された文字が文書
の先頭から何番目の文字かを示す番号及び切出し位置の
座標と、文字認識部４からの文字コードとを入力し、こ
れらに基づいて文字切出しの誤りを検出して候補文字を
出力する。一方、言語知識照合部６は、文字切出し誤り
検出部５により出力された候補文字を文字認識部４の文
字認識結果に含めて言語知識、例えば単語辞書７内の候
補単語との照合を行なう。The cutout result of the portion of FIG. 3 is shown by the broken line in FIG. That is, "'" is cut out between the coordinates X5 and X6. Further, although "" is one character, since all the pixels in the Y direction are white at the coordinate X7, "'" is cut out between the coordinates X6 and X7. Then, "'" is cut out even between the coordinates X7 and X8. Therefore, the recognition result of the portion of FIG. 3 in the character recognition unit 4 becomes "'", "'", "'". The character recognition post-processing device 1 includes a character cutout error detection unit 5 and a language knowledge collation unit 6
Etc. The character cutout error detection unit 5 detects the character cutout information from the character cutout unit 3, for example, the number indicating the number of the cutout character from the beginning of the document, the coordinates of the cutout position, and the character recognition unit 4. The character code and are input, and based on these, the character cutting error is detected and the candidate character is output. On the other hand, the language knowledge collation unit 6 includes the candidate characters output by the character cutout error detection unit 5 in the character recognition result of the character recognition unit 4 and verifies the language knowledge, for example, a candidate word in the word dictionary 7.

【００１１】文字切出し誤り検出部５は、標準文字間隔
算出部５１と、文字間隔算出部５２と、文字切出し知識
データ５３と、文字間隔照合部５４とから成る。標準文
字間隔算出部５１では、文字切出し部３からの文字切出
し情報、例えば切出された文字のＸ座標及びＹ座標をも
とに、各文字の間の画素数を求め、これを文字間隔とす
る。例えば、図４に示すように、座標Ｘ６で切出し
た「’」と、座標Ｘ７で切出した「’」の文字間隔は、
７５画素である。また、座標Ｘ７で切出した「’」と、
座標Ｘ８で切出した「’」の文字間隔は、６画素であ
る。このような文字間隔を各文字の切出しの際に標準文
字間隔算出部５１で求める。そして、すべての文字の切
出しが終了し、すべての文字間隔が求められたとき、求
められたすべての文字間隔を合計し、その合計値を文字
間隔数（＝文字数−１）で割り、平均文字間隔を求め
る。標準文字間隔算出部５１は、例えば、この平均文字
間隔を文書イメージ２における標準文字間隔とする。そ
の他、例えば、文書の最初の数行の文字間隔の平均を標
準文字間隔とするようにしてもよい。The character cutout error detection section 5 comprises a standard character spacing calculation section 51, a character spacing calculation section 52, character cutout knowledge data 53, and a character spacing verification section 54. The standard character interval calculation unit 51 obtains the number of pixels between each character based on the character cutout information from the character cutout unit 3, for example, the X coordinate and the Y coordinate of the cut out character, and determines this as the character interval. To do. For example, as shown in FIG. 4, the character spacing between the “′” cut out at the coordinate X6 and the “′” cut out at the coordinate X7 is
There are 75 pixels. Also, "'" cut out at coordinate X7,
The character interval of "'" cut out at the coordinate X8 is 6 pixels. Such character spacing is calculated by the standard character spacing calculation unit 51 when each character is cut out. When all the characters have been cut out and all the character intervals have been calculated, all the calculated character intervals are summed, and the total value is divided by the number of character intervals (= number of characters-1) to obtain the average character. Find the interval. The standard character spacing calculator 51 uses this average character spacing as the standard character spacing in the document image 2, for example. In addition, for example, the average of the character spacing in the first several lines of the document may be set as the standard character spacing.

【００１２】文字間隔照合部５４では、文字認識部４の
文字認識結果である文字コードと、文字切出し部３の文
字切出し情報から算出した文字間隔とから文字認識の誤
りを検出する。まず、上述のようにして求めた平均文字
間隔を標準文字間隔としてこれを各文字間隔と比較す
る。このため、文字間隔算出部５２で、便宜上、各文字
間隔を標準文字間隔で割って正規化文字間隔を求める。
例えば、標準文字間隔が１５画素であるとすると、図５
に示すように、文字間隔が７５画素である場合の正規化
文字間隔は５となり、文字間隔が６画素である場合の正
規化文字間隔は０．４となる。そして、この正規化文字
間隔を図６に示す所定値（例えば、０．５）と比較する
ことにより２文字として切出された文字が１文字である
か否かを判別する。これにより、例えば、図３のにお
いて「’」と「’」の２文字として切出されたもの
が「”」の１文字の誤りと判定される。このように、文
字間隔そのものにより１文字か否かを判定せずに標準文
字間隔との比である正規化文字間隔により切出し誤りか
否かを判定するようにしたのは、文章によって文字の間
隔が狭い場合もあれば広い場合もあるからである。The character interval collating unit 54 detects an error in character recognition from the character code which is the character recognition result of the character recognizing unit 4 and the character interval calculated from the character cutting information of the character cutting unit 3. First, the average character spacing determined as described above is used as a standard character spacing, and this is compared with each character spacing. For this reason, the character spacing calculation unit 52 divides each character spacing by the standard character spacing to obtain a normalized character spacing for convenience.
For example, assuming that the standard character spacing is 15 pixels,
As shown in, the normalized character interval is 5 when the character interval is 75 pixels, and the normalized character interval is 0.4 when the character interval is 6 pixels. Then, by comparing this normalized character spacing with a predetermined value (for example, 0.5) shown in FIG. 6, it is determined whether or not the character cut out as two characters is one character. As a result, for example, what is cut out as two characters of "'" and "'" in FIG. 3 is determined as an error of one character of "". In this way, it is determined whether there is a cutout error based on the normalized character spacing, which is the ratio to the standard character spacing, instead of determining whether there is one character based on the character spacing itself. It may be narrow or wide.

【００１３】また、より正確な誤り検出を行なうため、
図６に示すように、正規化文字間隔の判定に用いる所定
値は、隣接する文字に応じて変化させるようにしてい
る。このため、以下のような文字切出し誤り知識データ
が構成されている。図６に示すように、隣接文字
が「’」、「’」である場合の正規化文字間隔の判定値
は、０．５としている。即ち、隣接文字
が「’」、「’」である場合に、正規化文字間隔が０．
５より小さいとき、これらは２文字ではなく、１文
字「”」を候補文字とする。また、図６に示す文字切出
し知識辞書には、その他、以下のようなデータが格納さ
れている。例えば、図３のの部分の「ｍ」、の部分
の「ｗ」、の部分の「ｘ」は、それぞれ１文字である
が、２文字として認識される場合がある。このような誤
認識は、文書上のインク等のかすれ、又はラインセンサ
の１画素に障害がある等の読取部の異常によりＹ方向の
白い線が文字の上に生じることにより引き起こされる。Further, in order to perform more accurate error detection,
As shown in FIG. 6, the predetermined value used to determine the normalized character spacing is changed according to the adjacent character. Therefore, the following character cutting error knowledge data is configured. As shown in FIG. 6, the determination value of the normalized character spacing when the adjacent characters are “′” and “′” is 0.5. That is, when the adjacent characters are "'" and "'", the normalized character spacing is 0.
When it is less than 5, these are not two characters but one character """as a candidate character. In addition, the following data is stored in the character cut-out knowledge dictionary shown in FIG. For example, the “m” in the part of FIG. 3, the “w” in the part, and the “x” in the part are each one character, but may be recognized as two characters. Such erroneous recognition is caused by generation of a white line in the Y direction on a character due to a blur of ink or the like on a document or an abnormality in the reading unit such as a defect in one pixel of the line sensor.

【００１４】例えば、図７に示すように、「ｍ」の中央
部分で白い線が入ると、文字切出し部３では「ｒ」と
「ｎ」とに分けて２文字として切出される。また、図８
に示すように、「ｗ」の中央部分で白い線が入ると、
「ｖ」と「ｖ」とに分けて２文字として切出される。そ
して、図９に示すように「ｘ」の中央部分で白い線が入
ると、「＞」と「＜」とに分けて２文字として切出され
る。従って、文字切出し誤り知識データに、隣接する文
字が「ｒ」と「ｎ」で正規化文字間隔の判定値が０．３
より小さい場合の候補文字として「ｍ」を挙げるように
している。また、隣接する文字が「ｖ」と「ｖ」で正規
化文字間隔の判定値が０．２５より小さい場合の候補文
字として「ｗ」を挙げるようにしている。そして、隣接
する文字が「＞」と「＜」で正規化文字間隔の判定値が
０．３より小さい場合の候補文字として「ｘ」を挙げる
ようにしている。For example, as shown in FIG. 7, when a white line is entered in the central part of "m", the character cutting section 3 separates "r" and "n" into two characters. Also, FIG.
As shown in, when there is a white line in the center of "w",
It is cut out as two characters divided into "v" and "v". Then, as shown in FIG. 9, when a white line is entered at the center of "x", it is divided into ">" and "<" and cut out as two characters. Therefore, in the character cut-out error knowledge data, if the adjacent characters are "r" and "n", the judgment value of the normalized character interval is 0.3.
"M" is mentioned as a candidate character when it is smaller. In addition, when adjacent characters are “v” and “v” and the determination value of the normalized character spacing is smaller than 0.25, “w” is cited as a candidate character. Then, when adjacent characters are ">" and "<" and the determination value of the normalized character spacing is smaller than 0.3, "x" is cited as a candidate character.

【００１５】文字切出し誤り検出部５により文字切出し
部３による文字切り出しの誤りが検出されたときは、図
６の文字切出し誤り知識データに従って２文字として切
り出されたものが１文字の候補文字に修正され、言語知
識照合部６にわたされる。言語知識照合部６は、例え
ば、文字認識部４から受け取った文字コードを単語辞書
７内の単語と照合して、一致度の充分高い単語が得られ
なかった場合、文字切出し誤り検出部５より得られる修
正された候補文字を用いて単語辞書７内の単語と照合し
て後処理を行い、結果を出力部８より出力する。When the character cutout error detection unit 5 detects an error in the character cutout by the character cutout unit 3, the two characters cut out according to the character cutout error knowledge data shown in FIG. 6 are corrected into one character candidate character. And passed to the language knowledge collation unit 6. The language knowledge collating unit 6 collates the character code received from the character recognizing unit 4 with a word in the word dictionary 7, and if a word with a sufficiently high degree of coincidence cannot be obtained, the character cutting error detecting unit 5 The obtained corrected candidate characters are used for matching with the words in the word dictionary 7 for post-processing, and the result is output from the output unit 8.

【００１６】次に、上述した文字認識後処理装置の動作
を説明する。図１に示す文書イメージ２に図３に示す文
字行があるとする。この場合、文字切出し部３は、座標
Ｘ１で文字「‘」を切出すと、このときの座標を文字切
出し誤り検出部５に入力する。これと同時に、文字切出
し部３は、文字認識部４に切出された「‘」の文字イメ
ージを入力する。文字認識部４は、「‘」の文字イメー
ジからパターンマッチングにより最も近似した照合パタ
ーンを捜し出し、これに対応する文字コードを出力す
る。この場合、通常は「‘」の文字コードが出力され
る。Next, the operation of the above character recognition post-processing device will be described. It is assumed that the document image 2 shown in FIG. 1 has the character lines shown in FIG. In this case, when the character cutout unit 3 cuts out the character "'" at the coordinate X1, the character cutout unit 3 inputs the coordinates at this time to the character cutout error detection unit 5. At the same time, the character cutout unit 3 inputs the cutout character image of “′” to the character recognition unit 4. The character recognition unit 4 finds the closest matching pattern from the character image of "'" by pattern matching and outputs the character code corresponding to this. In this case, the character code "'" is usually output.

【００１７】一方、文字切出し誤り検出部５の標準文字
間隔算出部５１は、文字「‘」と、文字切出し部３で次
に切出される文字「ａ」との間の間隔を検出して記憶す
る。また、文字数をカウントする。このときは、カウン
ト値は“１”である。そして、次の文字「ｂ」が切出さ
れるとき、文字「ａ」と、文字「ｂ」との間隔を検出し
て記憶し、カウント値を“２”にカウントアップする。
すべての文字間隔を検出して記憶した後、記憶されたす
べての文字間隔を合計し、合計値を“文字数−１”で割
って標準文字間隔を算出する。On the other hand, the standard character spacing calculator 51 of the character clipping error detector 5 detects and stores the spacing between the character "'" and the character "a" to be clipped next by the character clipping unit 3. To do. Also, the number of characters is counted. At this time, the count value is "1". Then, when the next character "b" is cut out, the interval between the character "a" and the character "b" is detected and stored, and the count value is counted up to "2".
After detecting and storing all the character intervals, all the stored character intervals are summed, and the total value is divided by "the number of characters-1" to calculate the standard character interval.

【００１８】図３のの部分に示す「ｍ」は、図７に示
すように、座標Ｘ８′で切られたとする。すると、文字
認識部４により１文字「ｍ」が２文字「ｒ」、「ｎ」で
あると認識される。一方、文字切出し誤り検出部５で
は、標準文字間隔算出部５１において算出された標準文
字間隔「１５」と、文字「ｒ」及び「ｎ」の距離「４画
素」とから、正規化文字間隔を算出する。即ち、文字間
隔算出部５２で「４÷１５＝０．２６」の計算を行なっ
て正規化文字間隔「０．２６」を算出する。この結果を
図１０に示す。そして、文字間隔照合部５４では、文字
切出し知識データ５３との参照を行なう。即ち、文字間
隔照合部５４で、文字間隔算出部５２で算出した正規化
文字間隔「０．２６」を、図６に示す文字切出し知識デ
ータ５３における隣接文字が「ｒ」、「ｎ」の場合の正
規化文字間隔「０．３」と比較する。そして、前者が後
者より小さいので、文字切出し部３による文字切出しを
誤りとする。この結果を図１０に示す。そして、候補文
字として「ｍ」が言語知識照合部６に入力され、言語知
識照合部６は文字認識部４から送られた「ｒ」、「ｎ」
を、文字切出し誤り検出部５から送られた「ｍ」に置換
える。It is assumed that "m" shown in the part of FIG. 3 is cut at the coordinate X8 'as shown in FIG. Then, the character recognition unit 4 recognizes that one character “m” is two characters “r” and “n”. On the other hand, in the character cutout error detection unit 5, the normalized character spacing is calculated from the standard character spacing “15” calculated by the standard character spacing calculation unit 51 and the distance “4 pixels” between the characters “r” and “n”. calculate. That is, the character spacing calculation unit 52 calculates “4 ÷ 15 = 0.26” to calculate the normalized character spacing “0.26”. The result is shown in FIG. Then, the character spacing collating unit 54 refers to the character cut-out knowledge data 53. That is, when the character spacing verification unit 54 determines the normalized character spacing “0.26” calculated by the character spacing calculation unit 52 as the adjacent characters in the character cut-out knowledge data 53 shown in FIG. 6 being “r” and “n”. And the normalized character spacing "0.3". Since the former is smaller than the latter, the character cutout by the character cutout unit 3 is an error. The result is shown in FIG. Then, “m” is input to the language knowledge collation unit 6 as a candidate character, and the language knowledge collation unit 6 sends “r” and “n” sent from the character recognition unit 4.
Is replaced with “m” sent from the character cutout error detection unit 5.

【００１９】尚、この場合の置換文字は、文字切出し誤
り検出部５から送らなくてもよい。即ち、言語知識照合
部６は、何番目と何番目の文字が切出し誤りである旨の
情報のみを受取り、この情報に対応した処理を言語知識
照合部６で行なうようにしてもよい。例えば、言語知識
照合部６に「ｒ」＋「ｎ」＝「ｍ」という知識を持たせ
ておいて、この知識により置換えてもよいし、また、切
出し誤りの文字に対し候補文字があがらなかったものと
して単語照合を行なうようにしてもよい。この結果、例
えば、「ｍｅｍｏｒｙ」という単語が文書イメージ２２
内にあり、これが文字切出し部３で「ｒｎｅｍｏｒｙ」
と切り出され、言語知識照合部６で７文字の一致度の充
分高い単語が単語辞書７から取り出せない場合でも、文
字切出し誤り検出部５で切り出しの誤りが検出され「ｍ
ｅｍｏｒｙ」と修正された文字コードを用い、言語知識
照合部６は単語辞書７から、６文字で最も近い候補単語
「ｍｅｍｏｒｙ」を取り出すことができる。The replacement character in this case does not have to be sent from the character segmentation error detection unit 5. That is, the linguistic knowledge collation unit 6 may receive only the information indicating that the number and the number of the character are cut out and the linguistic knowledge collation unit 6 may perform the process corresponding to this information. For example, the linguistic knowledge collation unit 6 may have the knowledge of “r” + “n” = “m”, and the knowledge may be replaced by this knowledge, or a candidate character does not appear for a character with a clipping error. You may make it perform a word collation as a thing. As a result, for example, the word “memory” becomes the document image 22.
It is inside, and this is the "rnemory" in the character cutting part 3
Even if the language knowledge collation unit 6 cannot extract a word with a sufficiently high degree of matching of 7 characters from the word dictionary 7, the character segmentation error detection unit 5 detects the segmentation error and outputs “m”.
The language knowledge collation unit 6 can retrieve the closest candidate word “memory” of 6 characters from the word dictionary 7 by using the corrected character code “memory”.

【００２０】また、図３のの部分に示す「ｗ」は、図
８に示すように、座標Ｘ９′で切られたとする。する
と、文字認識部４により１文字「ｗ」が２文字「ｖ」、
「ｖ」であると認識される。一方、文字切出し誤り検出
部５では、標準文字間隔算出部５１において算出された
標準文字間隔「１５」と、文字「ｖ」及び「ｖ」の距離
「３画素」とから、正規化文字間隔を算出する。即ち、
文字間隔算出部５２で「３÷１５＝０．２」の計算を行
なって正規化文字間隔「０．２」を算出する。この結果
を図１１に示す。そして、文字間隔照合部５４では、文
字切出し知識データ５３との参照を行なう。即ち、文字
間隔照合部５４で、文字間隔算出部５２で算出した正規
化文字間隔「０．２」を、図６に示す文字切出し知識デ
ータ５３における隣接文字が「ｖ」、「ｖ」の場合の正
規化文字間隔「０．２５」と比較する。そして、前者が
後者より小さいので、文字切出し部３による文字切出し
を誤りとする。この結果を図１１に示す。そして、候補
文字として「ｗ」が言語知識照合部６に入力され、言語
知識照合部６は文字認識部４から送られた「ｖ」、
「ｖ」を、文字切出し誤り検出部５から送られた「ｗ」
に置換える。Further, it is assumed that the "w" shown in the part of FIG. 3 is cut at the coordinate X9 'as shown in FIG. Then, the character recognition unit 4 converts one character “w” into two characters “v”,
Recognized as "v". On the other hand, the character segmentation error detection unit 5 determines the normalized character spacing from the standard character spacing “15” calculated by the standard character spacing calculation unit 51 and the distance “3 pixels” between the characters “v” and “v”. calculate. That is,
The character spacing calculator 52 calculates "3/15 = 0.2" to calculate the normalized character spacing "0.2". The result is shown in FIG. Then, the character spacing collating unit 54 refers to the character cut-out knowledge data 53. That is, when the character spacing verification unit 54 determines the normalized character spacing “0.2” calculated by the character spacing calculation unit 52 to be “v” and “v” as the adjacent characters in the character cutting knowledge data 53 shown in FIG. It is compared with the normalized character spacing "0.25". Since the former is smaller than the latter, the character cutout by the character cutout unit 3 is an error. The result is shown in FIG. Then, “w” is input to the language knowledge collation unit 6 as a candidate character, and the language knowledge collation unit 6 sends “v”, which is sent from the character recognition unit 4.
“V” is replaced with “w” sent from the character cutting error detection unit 5.
Replace with.

【００２１】更に、図３のの部分に示す「ｘ」は、図
９に示すように、座標Ｘ１０′で切られたとする。する
と、文字認識部４により１文字「ｘ」が２文字「＞」、
「＜」であると認識される。一方、文字切出し誤り検出
部５では、標準文字間隔算出部５１において算出された
標準文字間隔「１５」と、文字「＞」及び「＜」の距離
「４画素」とから、正規化文字間隔を算出する。即ち、
文字間隔算出部５２で「３÷１５＝０．２」の計算を行
なって正規化文字間隔「０．２」を算出する。この結果
を図１２に示す。そして、文字間隔照合部５４では、文
字切出し知識データ５３との参照を行なう。即ち、文字
間隔照合部５４で、文字間隔算出部５２で算出した正規
化文字間隔「０．２６」を、図６に示す文字切出し知識
データ５３における隣接文字が「＞」、「＜」の場合の
正規化文字間隔「０．３」と比較する。そして、前者が
後者より小さいので、文字切出し部３による文字切出し
を誤りとする。この結果を図１２に示す。そして、候補
文字として「ｘ」が言語知識照合部６に入力され、言語
知識照合部６は文字認識部４から送られた「＞」、
「＜」を、文字切出し誤り検出部５から送られた「ｘ」
に置換える。Further, it is assumed that the "x" shown in the part of FIG. 3 is cut at the coordinate X10 'as shown in FIG. Then, the character recognizing unit 4 converts one character “x” into two characters “>”,
Recognized as "<". On the other hand, in the character cutout error detection unit 5, the normalized character spacing is calculated from the standard character spacing “15” calculated by the standard character spacing calculation unit 51 and the distance “4 pixels” between the characters “>” and “<”. calculate. That is,
The character spacing calculator 52 calculates "3/15 = 0.2" to calculate the normalized character spacing "0.2". The result is shown in FIG. Then, the character spacing collating unit 54 refers to the character cut-out knowledge data 53. That is, when the character spacing verification unit 54 determines the normalized character spacing “0.26” calculated by the character spacing calculation unit 52 as the adjacent characters in the character cut-out knowledge data 53 shown in FIG. 6, “>” and “<”. And the normalized character spacing "0.3". Since the former is smaller than the latter, the character cutout by the character cutout unit 3 is an error. The result is shown in FIG. Then, “x” is input to the language knowledge collation unit 6 as a candidate character, and the language knowledge collation unit 6 sends “>”, which is sent from the character recognition unit 4.
The “<” is replaced with the “x” sent from the character cutout error detection unit 5.
Replace with.

【００２２】以上のようにして、「’」と「’」、
「ｒ」と「ｎ」、「ｖ」と「ｖ」、「＞」と「＜」等の
文字間隔が、標準文字間隔に比較して接近しているとき
は、文字切出し部３の文字切出しが誤りであることが検
出され、それぞれ「”」、「ｍ」、「ｗ」、「ｘ」等と
置換えられる。従って、言語知識照合部６で適切な候補
文字を使用して単語辞書７内の候補単語との照合を行な
うことができ、文字切出し誤りの修正機能を含めつつ、
文字認識の後処理を適切に行なうことができる。As described above, "'" and "'",
When the character spacings such as "r" and "n", "v" and "v", and ">" and "<" are closer than the standard character spacing, the character segmentation of the character segmentation unit 3 is performed. Is detected and replaced with "", "m", "w", "x", etc., respectively. Therefore, the language knowledge collation unit 6 can collate with the candidate word in the word dictionary 7 by using an appropriate candidate character, and while including the function of correcting the character segmentation error,
Post-processing of character recognition can be appropriately performed.

【００２３】尚、上述した実施例においては、隣接する
２文字の文字間隔を予め定められた所定値と比較するよ
うにしたが、本発明はこれに限定されるものではなく、
例えば、前後の文字間隔との相対的な比較を行なって切
出し誤りを検出するようにしてもよい。即ち、図４に示
すように、座標Ｘ７における文字間隔が６画素であると
き、これをその１つ前の座標Ｘ６における文字間隔「７
５画素」と比較し、相対的に座標Ｘ７における文字間隔
が狭いので、「”」の切出し誤りを検出するようにして
もよい。また、上述した実施例においては、１文字を２
文字とする切出し誤りを検出する場合について説明した
が、本発明はこれに限定されるものではなく、１文字を
３文字以上とする切出し誤りを検出する場合にも適用で
きる。更に、上述した実施例においては、欧文の認識を
行なう場合について説明したが、本発明はこれに限定さ
れるものではなく、和文の認識を行なう場合についても
適用できる。In the above embodiment, the character spacing between two adjacent characters is compared with a predetermined value, but the present invention is not limited to this.
For example, a clipping error may be detected by making a relative comparison with the character spacing before and after. That is, as shown in FIG. 4, when the character interval at the coordinate X7 is 6 pixels, this is changed to the character interval "7" at the coordinate X6 immediately before.
Since the character spacing at the coordinate X7 is relatively narrower than that of "5 pixels", a clipping error of """may be detected. In addition, in the above-described embodiment, one character becomes two.
Although the case of detecting a clipping error of a character has been described, the present invention is not limited to this and can be applied to a case of detecting a clipping error of one character being three or more characters. Further, in the above-described embodiment, the case of recognizing a European sentence has been described, but the present invention is not limited to this, and can be applied to a case of recognizing a Japanese sentence.

【００２４】[0024]

【発明の効果】以上説明したように、本発明の文字認識
後処理装置によれば、文字切出し部が出力する切出し位
置の座標等の切出し情報と、文字認識部が出力する文字
コードとから、文字間隔を算出し、接近している複数の
文字を１文字として単語辞書内の候補文字と照合するよ
うにしたので、文字切出し誤りの修正機能を含んだ適切
な文字認識後処理を行なうことができる。また、各文字
間隔を予め定められた所定値と比較することによって切
出し誤りを検出することにより、簡単で正確な誤り検出
を行なうことができる。更に、文書中の標準文字間隔を
算出してこれとの比である正規化文字間隔を算出し、こ
れに対応した所定値を予め用意しておいて算出した正規
化文字間隔と比較することにより、標準文字間隔が文書
により異なる場合にも、同一の文字切出し誤り知識デー
タを使用し、同様の誤り検出を行なえるようにすること
ができる。As described above, according to the character recognition post-processing device of the present invention, from the cutout information such as the coordinates of the cutout position output by the character cutout unit and the character code output by the character recognition unit, Since the character spacing is calculated and a plurality of characters that are close to each other are regarded as one character to be collated with the candidate character in the word dictionary, it is possible to perform an appropriate character recognition post-processing including a correction function of a character cutting error. it can. Further, by detecting the cut-out error by comparing each character interval with a predetermined value, a simple and accurate error detection can be performed. Furthermore, by calculating the standard character spacing in the document and calculating the normalized character spacing, which is the ratio, and preparing a predetermined value corresponding to this and comparing it with the calculated normalized character spacing. Even when the standard character spacing differs depending on the document, the same character segmentation error knowledge data can be used to enable the same error detection.

[Brief description of drawings]

【図１】本発明の文字認識後処理装置の一実施例のブロ
ック図である。FIG. 1 is a block diagram of an embodiment of a character recognition post-processing device of the present invention.

【図２】従来の文字認識後処理装置の一例のブロック図
である。FIG. 2 is a block diagram of an example of a conventional character recognition post-processing device.

【図３】文字切出し手順の説明図である。FIG. 3 is an explanatory diagram of a character cutout procedure.

【図４】文字切出し処理例（その１）の説明図である。FIG. 4 is an explanatory diagram of an example (part 1) of character cutting processing.

【図５】文字切出し誤り検出例（その１）の説明図であ
る。FIG. 5 is an explanatory diagram of a character cutout error detection example (No. 1).

【図６】文字切出し誤り知識データ例の説明図である。FIG. 6 is an explanatory diagram of an example of character cutting error knowledge data.

【図７】文字切出し処理例（その２）の説明図である。FIG. 7 is an explanatory diagram of a character cutout processing example (No. 2).

【図８】文字切出し処理例（その３）の説明図である。FIG. 8 is an explanatory diagram of a character cutting process example (3).

【図９】文字切出し処理例（その４）の説明図である。FIG. 9 is an explanatory diagram of a character cutout processing example (No. 4).

【図１０】文字切出し誤り検出例（その２）の説明図で
ある。FIG. 10 is an explanatory diagram of a character cutting error detection example (No. 2).

【図１１】文字切出し誤り検出例（その３）の説明図で
ある。FIG. 11 is an explanatory diagram of a character cutting error detection example (No. 3).

【図１２】文字切出し誤り検出例（その４）の説明図で
ある。FIG. 12 is an explanatory diagram of a character segmentation error detection example (No. 4).

[Explanation of symbols]

１文字認識後処理装置５文字切出し誤り検出部６言語知識照合部７単語辞書５１標準文字間隔算出部５２文字間隔算出部５３文字切出し知識データ５４文字間隔照合部 1 Character recognition post-processing device 5 Character cutout error detection unit 6 Language knowledge collation unit 7 Word dictionary 51 Standard character spacing calculation unit 52 Character spacing calculation unit 53 Character cutout knowledge data 54 Character spacing verification unit

Claims

[Claims]

1. A character cutout error detection unit that detects a character cutout error based on the coordinates of each character cut out from a document image and outputs the detection position, and a character cutout error detection unit that outputs the detected position. A character recognition post-processing device comprising a linguistic knowledge collating unit that corrects a character at a detection position and collates it with linguistic knowledge.

2. The character cut-out error detection unit outputs the adjacent plural characters as one character when the interval between the adjacent characters cut out from the document image is smaller than a predetermined value. Character recognition post-processing device.

3. A standard character interval is calculated from an interval between adjacent characters cut out from a document image, the interval between adjacent characters is normalized by the standard character interval, and the normalized character interval is a predetermined value. The character recognition post-processing device according to claim 1, wherein when the difference is smaller than the character extraction error detection unit, the adjacent plural characters are output as one character.