JP2891368B2

JP2891368B2 - Post-processing method of character recognition result

Info

Publication number: JP2891368B2
Application number: JP1172722A
Authority: JP
Inventors: 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-07-04
Filing date: 1989-07-04
Publication date: 1999-05-17
Anticipated expiration: 2014-05-17
Also published as: JPH0337781A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、文字認識に係り、特に文字認識結果の修正
のための後処理に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to character recognition, and more particularly to post-processing for correcting character recognition results.

[Conventional technology]

１文字単位の文字認識によって誤認識を完全に排除す
ることは極めて困難であるため、漢字OCRにおいては、
単語照合や形態素解析等によって文字認識結果の誤り修
正（後処理）を行うことが多い。Since it is extremely difficult to completely eliminate erroneous recognition by character recognition on a character-by-character basis, in Kanji OCR,
Error correction (post-processing) of character recognition results is often performed by word matching, morphological analysis, or the like.

従来、一般文書を対象とした後処理の方式は、認識対
象となる領域（例えば段落）の認識が終了した後で、そ
の領域に改行はないものとして、領域の全部の文字列に
ついて後処理を行うようになっている。あるいは、文を
後処理の単位とした方式もあり、これにおいては、文の
先頭から句点または読点までの文字列を、改行を無視し
て読み込み後処理を行う。Conventionally, a post-processing method for a general document is such that after recognition of an area to be recognized (eg, a paragraph) is completed, post-processing is performed on all character strings in the area, assuming that the area has no line break. It is supposed to do. Alternatively, there is a method in which a sentence is used as a post-processing unit. In this method, a character string from the beginning of a sentence to a punctuation mark or a reading point is read and processed after ignoring line feed.

なお、単語照合や形態素解析による日本語文書の文字
認識結果の後処理に関する公知資料としては、例えば
「西野ほか：“日本語文書リーダ後処理の実現”、自然
言語処理64−６（1987.11.20）,pp.45−52」がある。Examples of publicly known materials relating to post-processing of character recognition results of Japanese documents by word matching and morphological analysis include, for example, “Nishino et al .:“ Implementing Post-processing of Japanese Document Readers ”, Natural Language Processing 64-6 (November 20, 1987) ), Pp. 45-52 ".

[Problems to be solved by the invention]

しかし、段落や文を処理単位とした後処理方式によれ
ば、段落や文が長い場合、その認識結果が終了するまで
後処理の開始が待たされ処理効率が悪く、また認識結果
の格納のために大きなメモリが必要となるという問題が
ある。However, according to the post-processing method in which a paragraph or a sentence is a processing unit, when a paragraph or a sentence is long, the start of post-processing is waited until the recognition result is completed, so that the processing efficiency is poor. Requires a large memory.

よって本発明の目的は、このような問題を解決できる
文字認識結果の後処理方法を提供することにある。Therefore, an object of the present invention is to provide a post-processing method for character recognition results that can solve such a problem.

[Means for solving the problem]

本発明は、文字認識結果を１行ずつ読み込み、単語照
合や形態素解析によって誤りを修正する処理を基本と
し、処理対象の１行の認識結果文字列を単語あるいは文
節の複数の部分に区切り、先頭から最後の一つ手前まで
の区切り部分を解析範囲として、該解析範囲の各区切り
部分の文字列について単語照合や形態素解析を行って誤
りを修正し、最後の区切り部分の文字列については次行
の認識結果文字列の先頭に移し、あらたに次行の処理対
象とするものである。The present invention is based on a process of reading a character recognition result line by line and correcting an error by word matching or morphological analysis. The recognition result character string of one line to be processed is divided into a plurality of parts of a word or a phrase. From the delimiter part up to the last one as the analysis range, word matching and morphological analysis are performed on the character string at each delimiter part of the analysis range to correct errors, and the character string at the last delimiter part is Is moved to the beginning of the recognition result character string, and is newly set as a processing target of the next line.

また、本発明は、同じく、文字認識結果を１行ずつ読
み込み、単語照合や形態素解析によって誤りを修正する
処理を基本とし、処理対象の１行の認識結果文字列を、
先頭から順に単語照合や形態素解析を行って誤りを修正
し、解析不能として残った行末の部分の文字あるいは文
字列について次行の認識結果文字列の先頭に移し、あら
たに次行の処理対象とすることを特徴とするものであ
る。In addition, the present invention is also based on a process of reading a character recognition result line by line and correcting an error by word matching or morphological analysis.
Correct the error by performing word matching and morphological analysis in order from the beginning, move the characters or character strings at the end of the line remaining unanalysable to the beginning of the recognition result character string on the next line, and set It is characterized by doing.

(Operation)

１行の文字列は、単語や文節等に区切られて解析され
るが、一つの単語が行末と次の行頭に分裂していること
がある。このような改行によって分裂した単語は、それ
を考慮せずに扱ったのでは正しい解析ができず、その部
分に誤認識があると修正に失敗する。One line of a character string is analyzed by being divided into words, phrases, and the like. One word may be split at the end of a line and at the beginning of the next line. A word broken by such a line break cannot be correctly analyzed if it is not taken into account, and if it is misrecognized, the correction fails.

しかし、本発明によれば、１行を単位として後処理を
行うけれども、行末の単語や文節の区切り部分、あるい
は、行末の解析不能な部分を次の行の先頭へ移し、次行
の後処理で扱うため、改行により分裂した単語について
も、その誤認識を正しく修正することができる。However, according to the present invention, although post-processing is performed on a line-by-line basis, a word or phrase delimiter at the end of a line or an unparseable part at the end of a line is moved to the beginning of the next line, and post-processing of the next line is performed. Therefore, even for a word split by a line feed, the erroneous recognition can be correctly corrected.

また、一般に文字認識装置においては“領域抽出また
は領域指定”→“行切り出し”→“文字切り出し”→
“文字認識”→“後処理”という手順で処理が進む。し
たがって、後処理の単位が１行であると、ある行の後処
理と次の行の文字認識処理とを並列的に実行することに
より、後処理の待ち時間を少なくして処理全体を高速化
することができる。さらに、１行に印刷または記入され
る文字数はほぼ決まっているので、認識結果の記憶に必
要なメモリ量を減らすことができる。In general, in a character recognition device, “area extraction or area specification” → “line extraction” → “character extraction” →
Processing proceeds in the order of “character recognition” → “post-processing”. Therefore, if the unit of the post-processing is one line, the post-processing of a certain line and the character recognition processing of the next line are executed in parallel, thereby reducing the waiting time of the post-processing and speeding up the entire processing. can do. Further, since the number of characters to be printed or written on one line is almost fixed, the amount of memory required for storing the recognition result can be reduced.

〔Example〕

以下、図面を用い本発明の実施例について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

第１図に本発明に係る漢字OCRのブロック図を示す。
この漢字OCRの全体的な構成及び処理の流れは従来と同
様である。すなわち、画像入力部１がスキャナ（あるい
は他の入力装置）２より対象画像を読み込み、画像メモ
リ３に格納する。範囲指定部４が、画像メモリ３に入力
された画像より認識する範囲を自動的に指定し、あるい
は指定手段によって指定する。切り出し部５が範囲指定
された領域から行イメージを切り出し、この行イメージ
より個々の文字のイメージを切り出す。認識部６が、文
字辞書メモリ７に格納された辞書を用い、切り出された
各文字イメージの文字認識を行い、認識結果を認識結果
メモリ８に格納する。後処理部９が、認識結果文字列を
認識結果メモリ７より読み込み、単語辞書・文法辞書メ
モリ10に格納されている単語辞書や文法辞書を参照して
解析し、文章として不適当な部分を修正する。この後処
理部９が本発明に直接係わる部分であり、その詳細につ
いては実施例別に後述する。出力部11は、後処理の結果
をCRTまたは出力ファイル12へ出力する。FIG. 1 shows a block diagram of the kanji OCR according to the present invention.
The overall configuration and processing flow of this kanji OCR is the same as in the past. That is, the image input unit 1 reads a target image from the scanner (or another input device) 2 and stores the target image in the image memory 3. The range designating unit 4 automatically designates a range to be recognized from the image input to the image memory 3, or designates the range by designating means. The cutout unit 5 cuts out a line image from the area whose range is designated, and cuts out an image of each character from the line image. Using the dictionary stored in the character dictionary memory 7, the recognizing unit 6 performs character recognition on each of the extracted character images, and stores the recognition result in the recognition result memory 8. The post-processing unit 9 reads the recognition result character string from the recognition result memory 7, analyzes the character string by referring to the word dictionary and the grammar dictionary stored in the word dictionary / grammar dictionary memory 10, and corrects an inappropriate portion as a sentence. I do. The post-processing unit 9 is a part directly related to the present invention, and details thereof will be described later for each embodiment. The output unit 11 outputs the result of the post-processing to the CRT or the output file 12.

実施例１本実施例における後処理部９のブロック図を第２図に
示す。１行分の文字認識が終了すると、認識結果読み込
み部21が認識結果メモリ８より１行分の認識結果を読み
込み、行認識結果メモリ22に格納する。なお、前の行の
行末の末解析部分が行認識結果メモリ22の先頭に書き移
されている場合は、その次の文字位置から書き込む。Embodiment 1 FIG. 2 shows a block diagram of the post-processing unit 9 in this embodiment. When the character recognition for one line is completed, the recognition result reading unit 21 reads the recognition result for one line from the recognition result memory 8 and stores it in the line recognition result memory 22. If the end analysis part at the end of the previous line has been transferred to the head of the line recognition result memory 22, the writing is performed from the next character position.

次に仮文節切り部23が、行認識結果メモリ22内の１行
の認識結果文字列を、文字種の変化（例えば、ひらがな
→非ひらがなの切りかわり）する位置等によって、解析
単位としての複数の部分に区切る。このようにして区切
られた部分を“仮文節”と呼ぶことにする。Next, the provisional phrase segmentation unit 23 converts a one-line recognition result character string in the line recognition result memory 22 into a plurality of analysis units as an analysis unit according to a position at which a character type changes (for example, switching from hiragana to non-hiragana). Divide into parts. The part separated in this way is called a “provisional clause”.

解析範囲決定部24が、仮文節の位置を考慮し、行認識
結果メモリ22の先頭文字位置から最後の仮文節の手前の
文字位置までを解析対象範囲に設定する。The analysis range determination unit 24 sets the analysis target range from the first character position of the line recognition result memory 22 to the character position before the last provisional phrase in consideration of the position of the provisional phrase.

文章解析・修正部25が、設定された解析対象範囲の各
仮文節について、単語辞書・文法辞書メモリ10内の単語
辞書および文法辞書を利用し、単語照合や形態素解析を
行って誤認識の修正を行い、行認識結果メモリ22内の認
識結果を正しく書き換える。このように修正された結果
が出力部11により出力される。The sentence analysis / correction unit 25 uses the word dictionary and the grammar dictionary in the word dictionary / grammar dictionary memory 10 for each provisional phrase in the set analysis target range, and performs word matching and morphological analysis to correct erroneous recognition. Is performed, and the recognition result in the line recognition result memory 22 is correctly rewritten. The output unit 11 outputs the result corrected in this way.

解析対象範囲の最後の仮文節まで解析・修正が終わる
と、最終文節書き込み部26が、行認識結果メモリ22内の
行末の仮文節の認識結果（解析修正の対象から外された
部分）を、先頭へ移す操作を行う。すなわち、行末の仮
文節は改行により分断されている可能性があるが、これ
は次の行の行頭の仮文節として、次行の後処理の対象と
されることになる。When the analysis and correction are completed up to the last provisional clause in the analysis target range, the final clause writing unit 26 reads the recognition result of the provisional clause at the end of the line in the line recognition result memory 22 (the part excluded from the analysis and correction target), Perform the operation of moving to the beginning. That is, the provisional clause at the end of the line may be separated by a line feed, but this is a provisional clause at the beginning of the next line and is subjected to post-processing of the next line.

なお、このような１行についての後処理の実行中に、
次行以降の文字認識処理が並列的に実行される。In addition, during the execution of such post-processing for one line,
The character recognition processing for the next and subsequent lines is executed in parallel.

ここで、次のような対象文章を例にとって、後処理の
内容を説明する。Here, the contents of the post-processing will be described using the following target text as an example.

まず、第１行目に対し次の認識結果が得られたとす
る。 First, it is assumed that the following recognition result is obtained for the first line.

“文字認識技術は現代の情報処” これは文字種変化（ひらかな→非ひらがな）によって次
のように仮文節に区切られる。“Character recognition technology is a modern information processing place” This is divided into temporary phrases as follows by character type change (Hiragana → Non-Hiragana).

“文字認識技術は／現代の／情報処” そして、解析対象範囲は次のようになる。 “Character recognition technology / modern / information processing” And the analysis target range is as follows.

“文字認識技術は／現代の” これで、改行により分裂していた単語“情報処（理）”
は第１行目の解析対象から外される。"Character recognition technology / modern" Now, the word "information processing (physical)" that was split by line breaks
Is excluded from the analysis target in the first row.

この第１行目の解析対象範囲に対する解析・修正が終
わり、結果が出力されると、行末の処理されなかった仮
文節部分が行認識結果メモリ22の先頭に書き移され、そ
れ以降の文字位置はクリアされる。When the analysis and correction of the analysis target range on the first line is completed and the result is output, the unprocessed provisional clause at the end of the line is transferred to the head of the line recognition result memory 22, and the character positions thereafter are shifted. Is cleared.

第２行目の処理に移り、その認識結果が行認識メモリ
22の前行から移された部分に続けて書き込まれるので、
第２行目の認識結果は次のようになる。The process moves to the second line, and the result of the recognition is stored in the line recognition memory.
Since it is written continuously from the part moved from the previous line of 22,
The recognition result on the second line is as follows.

“情報処理の花形技術である。” これで、２行にまたがって分裂していた単語“情報処
理”が一つにまとまり、解析・修正が可能となる。"It is a flower-shaped technique of information processing." Thus, the word "information processing", which has been split over two lines, is united and can be analyzed and corrected.

そして、同じように仮文節に区切り、対象範囲設定、
解析・修正を行う。ただし、この行は、最後の文字が句
点であるので、行末で単語の分裂は起こっていないと判
断し、行の最後の仮文節も解析対象範囲とする。行の最
後の文字が読点の場合も同様の扱いをする。Then, in the same way, break into temporary clauses, set the target range,
Perform analysis and correction. However, since the last character of this line is a punctuation mark, it is determined that word division has not occurred at the end of the line, and the last provisional clause of the line is also set as the analysis target range. The same applies if the last character on the line is a reading point.

実施例２本実施例における後処理部９のブロック図を第３図に
示す。１行分の文字認識が終了すると、認識結果読み込
み部31が認識結果メモリ８より１行分の認識結果を読み
込み、行認識結果メモリ32に格納する。なお、前の行の
行末の解析不能部分が行認識結果メモリ32の先頭に書き
移されている場合は、その次の文字位置から書き込む。Embodiment 2 FIG. 3 shows a block diagram of the post-processing unit 9 in this embodiment. When the character recognition for one line is completed, the recognition result reading unit 31 reads the recognition result for one line from the recognition result memory 8 and stores it in the line recognition result memory 32. If the unanalyzable part at the end of the previous line has been transferred to the beginning of the line recognition result memory 32, the writing is started from the next character position.

次に文章解析・修正部33が、行認識結果メモリ32の内
容について、単語辞書・文法辞書メモリ10内の単語辞書
および文法辞書を利用し、単語照合や形態素解析による
修正処理を行い、行認識結果メモリ32内の認識結果を正
しく書き換える。このように修正された結果が出力部11
により出力される。ただし、行末に解析不能となった部
分（該当単語が見っからない文字列）があれば、その行
末部分に“次行もちこし”の印を付ける。Next, the sentence analysis / correction unit 33 uses the word dictionary and the grammar dictionary in the word dictionary / grammar dictionary memory 10 for the contents of the row recognition result memory 32, performs word matching and morphological analysis, and performs row recognition. The recognition result in the result memory 32 is correctly rewritten. The result corrected in this way is output unit 11
Is output by However, if there is a part that cannot be analyzed (a character string in which the corresponding word cannot be found) at the end of the line, the end of the line is marked with “next line”.

次行もちこし部分書き込み部34が、行認識結果メモリ
32の先頭へ、“次行もちこし”印がつけられた行末部分
を、先頭へ書き移し、それ以降の文字位置をクリアす
る。これで、現在行の行末の解析不能となった部分は、
次行の先頭へ移され、次行の後処理の対象とされる。The next-line partial writing unit 34 stores a line-recognition result memory.
Move the end of the line marked "Next line" to the beginning of 32, and clear the character positions thereafter. The unparseable part of the end of the current line is now
It is moved to the beginning of the next line and is subjected to post-processing of the next line.

なお、このように１行についての後処理の実行中に、
次行以降の文字認識処理が並列的に実行される。During the post-processing of one line,
The character recognition processing for the next and subsequent lines is executed in parallel.

“文字認識技術は現代の情報処” この文字列について、先頭から単語照合あるいは形態素
解析等によって解析を進めていくと、次のようになる
（／は単語の切れ目）。“Character recognition technology is a modern information processing” When the character string is analyzed from the beginning by word collation or morphological analysis, the following is obtained (/ is a word break).

“文字／認識／技術／は／現代／の／情報／処” この最後の文字“処”は該当する単語がなく解析不能
となるので、この“処”が次行もちこし部分となり、修
正結果出力は次のようになる。"Character / recognition / technology / wa / modern / no / information / process" Since this last character "process" has no corresponding word and cannot be analyzed, this "process" becomes the next line carrying part, and the correction result The output looks like this:

“文字認識技術は現代の情報” つぎに、“処”が次行の先頭に移されるので、第２行
の認識結果は “処理の花形技術である。” となり、２行にまたがって分裂していた単語がつなが
る。この行の最後の文字は句点であるので、行末まで解
析不能部分がなく、最後まで処理されることになる。"Character recognition technology is modern information." Next, since "process" is moved to the head of the next line, the recognition result of the second line becomes "processing flower-shaped technology." The word that was connected is connected. Since the last character of this line is a punctuation mark, there is no unparsable part until the end of the line, and processing is performed to the end.

〔The invention's effect〕

以上説明した如く、本発明によれば、改行により分裂
した単語を含めた認識結果を修正することが可能であ
り、後処理と認識処理を行単位で並列的に実行して処理
全体の効率向上が可能であり、また、後処理の単位は１
行であるので、後処理のための認識結果の記憶メモリの
容量削減が可能である。As described above, according to the present invention, it is possible to correct the recognition result including a word split by a line feed, and to improve the efficiency of the entire process by executing post-processing and recognition processing in parallel on a line-by-line basis. Is possible, and the unit of post-processing is 1
Since it is a row, the capacity of the storage memory for the recognition result for post-processing can be reduced.

[Brief description of the drawings]

第１図は本発明に係る漢字OCRのブロック図、第２図は
本発明の一実施例における後処理部のブロック図、第３
図は本発明の他の実施例における後処理部のブロック図
である。６……認識部、８……認識結果メモリ、９……後処理
部、10……単語辞書・文法辞書メモリ、21……認識結果
読み込み部、22……行認識結果メモリ、23……仮文節切
り部、24……解析範囲決定部、25……文章解析・修正
部、26……最終文節書き込み部、31……認識結果読み込
み部、32……行認識結果メモリ、33……文章解析・修正
部、34……次行もちこし部分書き込み部。FIG. 1 is a block diagram of a kanji OCR according to the present invention, FIG. 2 is a block diagram of a post-processing unit according to an embodiment of the present invention, and FIG.
FIG. 13 is a block diagram of a post-processing unit according to another embodiment of the present invention. 6: recognition unit, 8: recognition result memory, 9: post-processing unit, 10: word dictionary / grammar dictionary memory, 21: recognition result reading unit, 22: line recognition result memory, 23: temporary Phrase segmentation unit, 24: analysis range determination unit, 25: sentence analysis / correction unit, 26: final phrase writing unit, 31: recognition result reading unit, 32: line recognition result memory, 33: sentence analysis・ Correction part, 34 ... The next line carry part writing part.

Claims

(57) [Claims]

1. A post-processing method for analyzing a character recognition result one line at a time and correcting an error, wherein a recognition result character string of one line to be processed is divided into a plurality of parts of words or phrases, The analysis part is the delimiter part up to the one before, and the word string and morphological analysis are performed on the character string of each delimiter part of the analysis range to correct errors. A post-processing method for a recognition result, wherein the recognition result is moved to the head of the result and is newly processed in the next line.

2. A post-processing method for analyzing a character recognition result one line at a time and correcting an error. Character recognition result post-processing method, wherein the character or character string at the end of the line remaining unanalyzable is moved to the beginning of the recognition result character string on the next line, and is newly processed by the next line. .