JP2968354B2

JP2968354B2 - Post-processing method of character recognition result

Info

Publication number: JP2968354B2
Application number: JP3026844A
Authority: JP
Inventors: 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-01-28
Filing date: 1991-01-28
Publication date: 1999-10-25
Anticipated expiration: 2014-10-25
Also published as: JPH04252390A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文字認識システムにお
いて、文字認識結果に対し単語照合や形態素解析によっ
て誤り修正を行なうための後処理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a post-processing method for correcting an error in a character recognition system by word collation or morphological analysis in a character recognition system.

【０００２】[0002]

【従来の技術】文字認識システムにおいて、文字画像の
特徴量を用いた１文字単位の文字認識によって誤認識を
完全に排除することは極めて困難である。このため、活
字または手書きの文字を認識する日本語ＯＣＲシステム
においては、文字認識結果として得られた文字列につい
て、後処理で単語照合や形態素解析等により誤認識文字
の自動修正を行なうことが多い。2. Description of the Related Art In a character recognition system, it is extremely difficult to completely eliminate erroneous recognition by character recognition on a character-by-character basis using a characteristic amount of a character image. For this reason, in a Japanese OCR system that recognizes printed characters or handwritten characters, a character string obtained as a result of character recognition often automatically corrects an erroneously recognized character by word matching or morphological analysis in post-processing. .

【０００３】このような日本語ＯＣＲシステムの後処理
の方法は数多く提案されているが、一般文章を対象とし
たものとしては次の方法が知られている。なお、ｃとｄ
の方法は、同一出願により特許出願がなされたものであ
る。ａ）認識対象となる領域（例えば段落）の認識がすべて
終了した後に、改行は無いものとして、領域内の全部の
文字列について先頭より順に後処理を行なう。ｂ）文を処理の単位とし、文の先頭から句点または読点
までの文字列を、改行を無視して処理する。ｃ）処理の単位を行とし、行の文字列を文字種の変化位
置で区切り、最後の区切り部分の文字を次行の先頭へ移
し、次行で処理する。ｄ）処理の単位を行とし、行の先頭から単語処理を行な
い、解析不能として残った行末部分の文字を次行の先頭
へ移し、次行で処理する。Many post-processing methods for such a Japanese OCR system have been proposed, but the following methods are known for general sentences. Note that c and d
Is a method in which a patent application is filed by the same application. a) After the recognition of the area to be recognized (for example, a paragraph) is completed, post-processing is performed on all character strings in the area in order from the beginning, assuming that there is no line break. b) A sentence is used as a unit of processing, and a character string from the beginning of the sentence to a period or a punctuation mark is processed ignoring a line feed. c) The unit of processing is a line, the character string of the line is separated by the change position of the character type, the character of the last delimiter is moved to the beginning of the next line, and the next line is processed. d) The unit of processing is a line, word processing is performed from the beginning of the line, characters at the end of the line remaining unanalyzable are moved to the beginning of the next line, and processed in the next line.

【０００４】なお、単語照合や形態素解析による日本語
文書の文字認識結果の後処理に関する公知資料として
は、例えば「西野ほか：“日本語リーダ後処理の実
現”、自然言語処理６４−６（１９８７．１１．２
０）、pp．４５−５２」がある。Examples of known materials relating to post-processing of character recognition results of Japanese documents by word collation and morphological analysis include, for example, “Nishino et al .:“ Realization of Post-processing of Japanese Reader ””, Natural Language Processing 64-6 (1987). .11.2.
0), pp. 45-52 ".

【０００５】[0005]

【発明が解決しようとする課題】前記ａまたはｂの方法
によれば、段落や文が長い場合、その認識が終了するま
で後処理の開始を待たされ処理効率が悪く、また認識結
果の格納のために大きなメモリ容量が必要となるという
問題がある。According to the above method a or b, when a paragraph or a sentence is long, the start of post-processing is waited until the recognition is completed, so that the processing efficiency is low. Therefore, there is a problem that a large memory capacity is required.

【０００６】文字列は単語や文節で区切られて解析され
るが、一つの単語が行末と次行の先頭に分裂することが
ある。行を単位として後処理を行なう場合、このような
単語の分裂を考慮しないと、分裂した単語を正しく解析
することができないため、分裂単語中の誤認識文字の修
正を失敗する。A character string is analyzed by being delimited by words or phrases, but one word may be split at the end of a line and at the beginning of the next line. When performing post-processing on a line-by-line basis, unless such word splitting is taken into account, the split word cannot be correctly analyzed, and correction of misrecognized characters in the split word fails.

【０００７】行を処理単位とした前記ｃまたはｄの方法
は、そのような行末から次行先頭に跨って分裂した単語
を、次行で処理するため、分裂単語の誤りを修正可能で
ある。また、行を単位としているため、認識結果を後処
理が終了するまで保存するためのメモリが少なくとも、
ある行の後処理と並行して次行の認識処理を実行するこ
とができる。しかし解決すべき課題も残されている。[0007] In the above-mentioned method c or d using a line as a processing unit, the word split from the end of the line to the head of the next line is processed in the next line, so that the error of the split word is corrected. It is possible. In addition, since the unit is a line, at least a memory for storing the recognition result until the post-processing is completed,
Recognition processing of the next row can be executed in parallel with post-processing of a certain row. However, there are still issues to be solved.

【０００８】すなわち、前記ｃの方法では、文字種の変
化による区切りによって行末部を決定しているため、行
内文字が全部がひらがなであったり、英数字であったり
した場合、文字種の情報のみでは文字列の区切りを特定
することができず、行内の全文字が次行へ移されるとい
う事態が発生する。つまり、次行へ移される文字数が必
ずしも少なくなるという保証がない。このような事態に
対応するためには、後処理待ちの認識結果を保存するた
めのメモリに余裕をみる必要があり、省メモリの目的を
十分に達成できないという問題が残されていた。That is, in the above method c, since the end of the line is determined by the delimiter based on the change of the character type, if all the characters in the line are hiragana or alphanumeric, the character type information alone is used. A column break cannot be specified, and all characters in a line are moved to the next line. That is, there is no guarantee that the number of characters transferred to the next line is not necessarily small. In order to cope with such a situation, it is necessary to provide a margin in a memory for storing the recognition result waiting for post-processing, and there has been a problem that the purpose of memory saving cannot be sufficiently achieved.

【０００９】また前記ｄの方法では、単語解析が不能と
なることが行末部の文字列を次行へ移すことを決定する
手段であるが、単語照合を行なった場合に、行末部分の
分裂した単語が必ず解析不能となるとは限らない。行末
の分裂した単語の一部が正解以外の別の単語との照合に
成功することがある。この場合、解析不能とならないの
で、次行への移送は行なわれず、誤った修正が確定して
しまうという問題がある。In the method d, the word analysis becomes impossible to determine that the character string at the end of the line is to be transferred to the next line. However, when word matching is performed, the end of the line may be split. Words are not always unparsable. Some broken words at the end of a line may be successfully matched with another word that is not the correct answer. In this case, since the analysis is not disabled, the transfer to the next line is not performed, and there is a problem that an erroneous correction is determined.

【００１０】本発明の目的は、前記従来方法の問題点を
解決し、改行により分裂した単語も確実に修正可能で、
かつ文字認識処理と後処理とを並行的に実行する場合に
おいても後処理終了まで必要な情報を保存するためのメ
モリの容量を十分に小さくできる、後処理方法を提供す
ることにある。[0010] An object of the present invention is to solve the problems of the conventional method, and it is possible to reliably correct a word broken by a line feed.
Another object of the present invention is to provide a post-processing method capable of sufficiently reducing the capacity of a memory for storing necessary information until the end of the post-processing even when the character recognition processing and the post-processing are performed in parallel.

【００１１】[0011]

【課題を解決するための手段】本発明の後処理方法によ
れば、１行毎に、行先頭より順に単語単位に処理を実行
し、一つの単語の処理を終わる都度、該単語に続く未処
理文字の字数と一定値とを比較し、該未処理文字の字数
が該一定値以下であるときは処理中の行の処理を終了
し、該未処理文字、あるいは該未処理文字に加え処理済
みの最後の単語部分の文字も次行の先頭へ移す。According to the post-processing method of the present invention, the processing is executed for each line in units of a word from the head of the line, and each time the processing of one word is completed, the unprocessed word following the word is processed. The number of characters to be processed is compared with a certain value. If the number of characters of the unprocessed character is equal to or less than the certain value, the processing of the line being processed is terminated, and the processing is performed in addition to the unprocessed character or the unprocessed character. The character of the last word part that has already been completed is also moved to the beginning of the next line.

【００１２】また、各行において、前行から移された文
字を除いた文字数がある一定値以下のとき、あるいは、
処理対象領域内の各行において、前行から移された文字
を除いた文字数が、処理済み行の最大文字数よりある一
定値以上少ないとき、または該最大文字数の一定割合以
下のとき、ならびに、行の最後の文字が句読点のとき
は、最後の文字まで当該行で処理する。In each line, when the number of characters excluding characters transferred from the previous line is less than a certain value, or
In each line in the processing target area, when the number of characters excluding the characters transferred from the previous line is smaller than the maximum number of characters of the processed line by a certain value or more, or when the number of characters is equal to or less than a certain ratio of the maximum number of characters, and If the last character is a punctuation mark, process the line up to the last character.

【００１３】[0013]

【作用】１行毎の後処理であるので、１行分の文字認識
処理が終了するたびに後処理を実行し処理効率を上げる
ことができるとともに、改行によって分裂し２行に跨っ
た単語についても次行にて連続した文字列として単語解
析を行ない、適切に修正することができる。Since the post-processing is performed for each line, post-processing is executed every time the character recognition processing for one line is completed, thereby improving the processing efficiency. Also, word analysis is performed as a continuous character string in the next line, so that it can be appropriately corrected.

【００１４】未処理文字列に加えて最後に処理された単
語の文字列をも次行へ移すことにより、単語解析処理で
単語間接続の検証を行なう場合においても、改行により
分裂した単語の単語間接続の検証が可能となり、確実な
解析が保証される。By moving the character string of the last processed word in addition to the unprocessed character string to the next line, even in the case where the inter-word connection is verified by the word analysis processing, the word of the word split by the line feed can be used. It is possible to verify the inter-connection, and a reliable analysis is guaranteed.

【００１５】次行に移される文字列の最大文字数は一定
値を超えることがないので、次行へ移した文字列の保存
に必要なメモリ量を極めて小さくすることができる。Since the maximum number of characters of the character string transferred to the next line does not exceed a certain value, the amount of memory required for storing the character string transferred to the next line can be extremely reduced.

【００１６】また、条件判定により、無意味な次行への
文字列の持ち越しを防止するため、処理の無駄を減らし
て効率的な後処理が可能である。In addition, the condition determination prevents a character string from being carried over to the next meaningless line, so that wasteful processing can be reduced and efficient post-processing can be performed.

【００１７】[0017]

【実施例】図１は本発明に係る日本語文字認識システム
の概略ブロック図である。この日本語文字認識システム
において、画像入力部１０はスキャナー等により文書原
稿の画像を読み取り、その２値画像データを文書画像メ
モリ１１に格納する。行・文字切り出し部１２は、文書
画像メモリ１１内の画像から文字行と文字画像を切り出
し、文字画像データを文字画像メモリ１３に格納し、ま
た認識対象領域（例えば段落。自動的に検出するか、オ
ペレータより指定する。）の位置、切り出した行の位
置、文字位置、文字サイズ等の切り出し情報を切り出し
情報メモリ１４に格納する。FIG. 1 is a schematic block diagram of a Japanese character recognition system according to the present invention. In this Japanese character recognition system, an image input unit 10 reads an image of a document original by a scanner or the like, and stores the binary image data in a document image memory 11. The line / character extracting unit 12 extracts a character line and a character image from the image in the document image memory 11, stores the character image data in the character image memory 13, and recognizes a recognition target area (for example, a paragraph. , Specified by the operator), and the cutout information such as the position of the cutout line, the character position, and the character size are stored in the cutout information memory 14.

【００１８】文字認識部１５は、文字画像メモリ１３よ
り文字画像データを読み出し、正規化を行なってから特
徴量を抽出し、この特徴量と文字辞書メモリ１６内の辞
書とのマッチングを行なうことにより、また切り出し情
報中の形状情報を利用することにより認識結果候補を決
定し、その文字コードと距離データ等を認識結果メモリ
１７に格納する。The character recognizing unit 15 reads out character image data from the character image memory 13, performs normalization, extracts a characteristic amount, and performs matching between the characteristic amount and a dictionary in the character dictionary memory 16. Further, the recognition result candidate is determined by using the shape information in the cutout information, and the character code and the distance data thereof are stored in the recognition result memory 17.

【００１９】後処理部１８は、本発明に直接係わる後処
理を実行する部分である。これは、認識結果を１行ずつ
認識結果メモリ１７より読み込み、その先頭に行末保存
メモリ１９より前行持ち越しの文字列を付加したのち、
行先頭より単語辞書メモリ２０の内容を用いて言語解析
・修正処理を行ない、修正結果により認識結果メモリ１
７の内容を書き換える。また、処理行の行末部分の文字
列を次行へ移す必要がある場合には、その文字列を行末
保存メモリ１９に書き込む。この後処理の内容の詳細に
ついては、実施例別に後述する。The post-processing section 18 is a section for executing post-processing directly related to the present invention. That is, after reading the recognition result line by line from the recognition result memory 17 and adding a character string carried over to the previous line from the line end storage memory 19 at the beginning,
From the beginning of the line, language analysis / correction processing is performed using the contents of the word dictionary memory 20, and the recognition result memory 1
7 is rewritten. When it is necessary to move the character string at the end of the processing line to the next line, the character string is written to the end-of-line storage memory 19. Details of the contents of this post-processing will be described later for each embodiment.

【００２０】結果出力部２１は、認識結果メモリ１７の
内容をディスプレイ、プリンタ等の出力機器あるいは磁
気ディスク装置等のファイル装置に出力する。The result output unit 21 outputs the contents of the recognition result memory 17 to an output device such as a display or a printer or a file device such as a magnetic disk device.

【００２１】実施例１後処理の内容を図２に示したフローチャートに沿って順
に説明する。まず後処理部１８は、現在処理しようとし
ている１行（現在行と呼ぶ）の認識結果データを認識結
果メモリ１７より読み込む（ステップ１００）。Embodiment 1 The contents of the post-processing will be described in order along the flowchart shown in FIG. First, the post-processing unit 18 reads, from the recognition result memory 17, recognition result data of one line (to be referred to as a current line) currently being processed (step 100).

【００２２】次に、行末保存メモリ１９に前行から持ち
越された文字列があるか調べ（ステップ１０２）、存在
するときは、その文字列を行末保存メモリ１９より読み
込み、現在行の認識結果データの先頭に付加する（ステ
ップ１０４）。Next, it is checked whether there is a character string carried over from the previous line in the line end storage memory 19 (step 102). If there is, the character string is read from the line end storage memory 19 and the recognition result data of the current line is read. (Step 104).

【００２３】次に、現在行の認識結果データ（前行から
持ち越された文字列も含める）を、行頭より１単語ずつ
単語照合により解析を進めていく（ステップ１０６）。Next, the recognition result data of the current line (including the character string carried over from the previous line) is analyzed by word matching one word at a time from the beginning of the line (step 106).

【００２４】一つの単語を処理するたびに、その処理済
み単語に続く未処理の文字列の文字数が一定値（行末文
字数閾値）以下であるか否かを調べる（ステップ１０
８）。これは未処理文字列の次行持ち越しの判定の１ス
テップである。未処理文字数が行末文字数閾値より大き
い場合は、次行持ち越しを行なう段階まで処理が進んで
いないということであるので、次の単語の処理に移る。
なお、行末文字数閾値はシステムによって異なってよい
が、後処理部１８で使用する単語辞書内の最大単語長の
文字数より１だけ少ない値を用いる。ここでは、最大単
語長を１０文字とし、行末文字数閾値を９であるものと
して説明を進める。Each time one word is processed, it is checked whether or not the number of characters of the unprocessed character string following the processed word is equal to or less than a certain value (threshold for the number of characters at the end of a line) (step 10).
8). This is one step of judging whether the unprocessed character string is carried over to the next line. If the number of unprocessed characters is larger than the threshold value of the number of characters at the end of the line, it means that the process has not progressed to the stage of carrying over to the next line, and the process proceeds to the next word.
The threshold value for the number of characters at the end of the line may differ depending on the system, but a value smaller by one than the number of characters having the maximum word length in the word dictionary used in the post-processing unit 18 is used. Here, the description will proceed assuming that the maximum word length is 10 characters and the line end character number threshold is 9.

【００２５】未処理文字数が行末文字数閾値以下となっ
た場合、未処理文字数が０であるか調べる（ステップ１
１０）。未処理文字数が０であれば、現在行の処理は最
終文字まで終わっているので、次行が残っているかを調
べ（ステップ１１２）、残っていなければ、すなわち現
在行が認識対象領域の最終行であれば、処理を完了す
る。次行があるならば、ステップ１００へ戻り次行の処
理を開始する。If the number of unprocessed characters is equal to or smaller than the line end character number threshold, it is checked whether the number of unprocessed characters is 0 (step 1).
10). If the number of unprocessed characters is 0, the processing of the current line has been completed up to the last character, so it is checked whether or not the next line remains (step 112). If not, that is, the current line is the last line of the recognition target area. If so, the process is completed. If there is a next line, the process returns to step 100 to start the processing of the next line.

【００２６】ステップ１１０で未処理文字が残っている
と判定した場合、最後の文字が句点または読点であるか
調べ（ステップ１１４）、句読点のときは現在行を最後
まで処理すべきと判断されるので、ステップ１０６に戻
って次の単語の処理を実行する。If it is determined in step 110 that unprocessed characters remain, it is checked whether the last character is a punctuation mark or a punctuation mark (step 114). If it is a punctuation mark, it is determined that the current line should be processed to the end. Therefore, the process returns to step 106 to execute the processing of the next word.

【００２７】最後の文字が句読点でないときは、現在行
の前行から持ち越しの文字を除いた文字数が、認識対象
領域内の処理済み行中の最大文字数より一定の閾値以上
少ないか、あるいは、該最大文字数の一定割合以下であ
るかを調べる（ステップ１１６）。このような条件を満
たさない場合は、この時点で次行への持ち越しを行なわ
ないと判断され、ステップ１０６に戻る。他方、条件を
満たす場合は、前行からの持ち越し文字数を除いた現在
行の文字数が行末文字数閾値以下であるか調べる（ステ
ップ１１８）。条件を満たさないときはステップ１０６
に戻る。When the last character is not a punctuation mark, the number of characters excluding the carryover character from the previous line of the current line is smaller than the maximum number of characters in the processed line in the recognition target area by a certain threshold value or less. It is checked whether the number is equal to or less than a predetermined ratio of the maximum number of characters (step 116). If such a condition is not satisfied, it is determined that the carry-over to the next line is not performed at this time, and the process returns to step 106. On the other hand, if the condition is satisfied, it is checked whether or not the number of characters in the current line excluding the number of characters carried over from the previous line is equal to or smaller than the line end character number threshold (step 118). If the condition is not satisfied, step 106
Return to

【００２８】ステップ１１８の条件を満たさないとき
は、現在行が認識対象領域の最終行であるか調べ（ステ
ップ１２０）、最終行ならば次行への持ち越しをせず現
在行の最後の文字まで現在行で処理するためステップ１
０６に戻るが、最終行でなければ、現在行の未処理文字
列を次行に持ち越すことと最終的に判断し、この文字列
を行末保存メモリ１９に格納し（ステップ１２２）、現
在行の処理を終了してステップ１００に戻り次行の処理
を開始する。If the condition of step 118 is not satisfied, it is checked whether the current line is the last line of the recognition target area (step 120). If it is the last line, the current line is not carried over to the last character of the current line. Step 1 for processing on the current line
Returning to step 06, if it is not the last line, it is finally determined that the unprocessed character string of the current line is carried over to the next line, and this character string is stored in the end-of-line storage memory 19 (step 122). The process ends, and the process returns to step 100 to start the process of the next line.

【００２９】ここで、次の２行の文字列画像の入力と表示のデモンストレーションを行なう。からなる認識対象領域を例として、後処理を説明する。Here, the input and display of the next two lines of character string images will be demonstrated. The post-processing will be described using a recognition target area composed of.

【００３０】１行目の文字列について先頭より単語照合
により解析を進めていくと、４単語目までは次のように
処理が進む。画像／の／入力／と／（ただし、／は単語の境界を意味する）When the analysis of the character string on the first line is performed by word collation from the beginning, the processing proceeds as follows up to the fourth word. / Input / and / of image / (however, / means word boundary)

【００３１】この時点では１行目の残り文字数は１０で
あるので、次の単語の処理に進む。次の単語「表示」を
処理した段階で、未処理文字数は８となって、これは行
末文字数閾値である９以下である。画像／の／入力／と／表示／のデモンストレー 8 7 6 5 4 3 2 1 （下線部が未処理文字列、字数）At this point, since the number of remaining characters on the first line is 10, the process proceeds to the next word. At the stage where the next word “display” is processed, the number of unprocessed characters becomes 8, which is 9 or less, which is the line end character number threshold value. Demonstration of image /// input / and / display / 8 7 6 5 4 3 2 1 (Unprocessed character strings and number of characters are underlined)

【００３２】また、最後の文字は句読点ではなく、現在
行が最初の行であるためステップ１１６の条件を当然に
満たさず、さらに現在行の文字数は１６で行末文字数閾
値の９より多く、さらに現在行は最終行でない。そこ
で、この段階で現在行すなわち１行目の処理を終了し、
未処理文字列「のデモンストレー」を行末保存メモリ１
９に保存し、２行目の処理に進む。ここでは行末文字数
閾値が９であるから、行末保存メモリ１９は９文字分の
容量を持っていれば足りる。The last character is not a punctuation mark and the current line is the first line, so that the condition of step 116 is not naturally satisfied. Further, the number of characters of the current line is 16, which is more than 9 which is the threshold of the number of characters at the end of the line. Line is not the last line. Therefore, at this stage, the processing of the current line, that is, the first line, is completed.
Unprocessed character string "Demonstration" is stored at the end of line memory 1.
9 and proceed to the processing of the second line. Here, since the line end character number threshold value is 9, it is sufficient that the line end storage memory 19 has a capacity of 9 characters.

【００３３】２行目の認識結果文字列を読み込み、その
先頭に行末保存メモリ１９の内容を付加すると、次の文
字列となる。すなわち、１行目の改行で分裂した単語
「デモンストレーション」が連続した形になる。のデモンストレーションを行なう。（下線部が前行持ち越し文字列）When the character string of the recognition result on the second line is read and the content of the line end storage memory 19 is added to the beginning of the character string, the next character string is obtained. That is, the word “demonstration” split at the first line feed has a continuous form. Conduct a demonstration. (The underlined part is the previous line carryover character string)

【００３４】この文字列の解析を進めると、の／デモンストレーション／を／行なう／。のようになる。ただし、ａ）行末文字が句点である（ステップ１１４）。ｂ）前行持ち越し文字を除いた文字数が句読点も含めて
８で、処理済み行の最大文字数１６より行末文字数閾値
９以上少なく、また最大文字数に対する比が８／１６＝
０．５と十分に小さい（ステップ１１６）。ｃ）前行繰り越し文字を除いた文字数が行末文字数閾値
以下である（ステップ１１８）。ｄ）最終行である（ステップ１２０）。したがって、最終文字まで処理され、次行への持ち越し
はしない。As the analysis of this character string proceeds, the following is demonstrated: / demonstration / perform / perform. become that way. However, a) The end-of-line character is a period (step 114). b) The number of characters excluding the carry-over character is included, including punctuation.
8, the number of characters at the end of the line is smaller by 9 or more than the maximum number of characters 16 of the processed line, and the ratio to the maximum number of characters is 8/16 =
0.5, which is sufficiently small (step 116). c) The number of characters excluding the carry-over character of the preceding line is equal to or smaller than the line end character number threshold (step 118). d) The last line (step 120). Therefore, the last character is processed and is not carried over to the next line.

【００３５】実施例２図２のステップ１２２において、未処理文字列に加えて
最後に処理された単語の文字列も行末保存メモリ１９に
格納し、次行へ持ち越すことが、実施例１と処理内容が
異なる。Embodiment 2 In step 122 of FIG. 2, the character string of the last processed word is stored in the end-of-line storage memory 19 in addition to the unprocessed character string, and is carried over to the next line. The content is different.

【００３６】実施例１で用いた２行の文字列を再び例に
用いる。１行目の単語「表示」まで次のように処理が進
む。画像／の／入力／と／表示／のデモンストレー The two-line character string used in the first embodiment is used again as an example. The processing proceeds as follows up to the word “display” on the first line. Demonstration of images / input / and / display /

【００３７】この段階で１行目の処理を終わり、最後に
処理した単語「表示」以降の文字列を次行へ持ち越す。
したがって、２行目は表示のデモンストレーションを行なう。となり、単語解析処理を行なえば表示／の／デモンストレーション／を／行なう／。となる。At this stage, the processing on the first line is completed, and the character string after the word "display" processed last is carried over to the next line.
Therefore, the second line performs a demonstration of the display. If word analysis processing is performed, display / no / demonstration / do / do /. Becomes

【００３８】ここで、未処理単語のみでなく、最後に処
理した単語の文字列まで次行へ持ち越す理由は次の通り
である。後処理における単語解析処理が、単語照合のみ
でなく、連続した単語間の接続をも検証するものである
場合、未処理文字列のみを次行へ持ち越したのでは、次
行の処理において、最初の単語とその直前の単語（前行
の最後に処理した単語）との接続を検証できないことに
より、解析の不十分な部分が残ってしまう。このような
不都合は、本実施例による如く、最後の処理単語も次行
へ持ち越すことにより避けることができる。Here, the reason why not only the unprocessed word but also the character string of the last processed word is carried over to the next line is as follows. If the word analysis process in post-processing is to verify not only word matching but also connections between consecutive words, if only unprocessed character strings are carried over to the next line, The inability to verify the connection between the word and the word immediately before it (the word processed last in the previous line) leaves insufficient analysis. Such inconvenience can be avoided by carrying over the last processed word to the next line as in the present embodiment.

【００３９】例に上げた２行の文字列を再び用いて説明
する。実施例１によれば、２行目の先頭に「のデモンス
トレーション」が持ち越され、最初に「の」という単語
が処理されるが、その前にある「表示」の情報が持ち越
されないので、「表示」と「の」の間の接続の検証がで
きない。これに対し、本実施例によれば、「表示」も２
行目に持ち越されるため、２行目の処理において「表
示」と「の」の間の接続検証についての不連続な点はな
くなる。A description will be given again using the two lines of character strings given in the example. According to the first embodiment, the “demonstration” is carried over to the beginning of the second line, and the word “no” is processed first, but the information of “display” preceding it is not carried over. Cannot verify the connection between "Display" and "No". On the other hand, according to the present embodiment, “display” is also 2
Since the data is carried over to the second line, there is no discontinuous point in connection verification between “display” and “no” in the processing of the second line.

【００４０】換言すれば、後処理における単語解析処理
のレベルに応じて行末処理を選べばよく、単語照合のみ
の解析であれば実施例１の行末処理で問題がなく、単語
照合のみならず単語間接続検証をも行なう解析であれ
ば、実施例２の行末処理が適当であるということであ
る。In other words, the end-of-line processing may be selected according to the level of the word analysis processing in the post-processing. If the analysis is only for word matching, there is no problem in the end-of-line processing of the first embodiment. If the analysis also performs inter-connection verification, the line ending processing of the second embodiment is appropriate.

【００４１】なお、以上説明した処理を実行する後処理
部１９は、汎用のプロセッサを用いソフトウエアにより
実現され、あるいは専用のハードウエアを用いて実現さ
れる。いずれにしても、以上の説明に基づき当業者は容
易に後処理部１９を実現可能であるので、後処理部１９
を実現するためのソフトウエアまたはハードウエアの具
体例の提示は省略する。The post-processing section 19 for executing the above-described processing is realized by software using a general-purpose processor, or is realized by using dedicated hardware. In any case, a person skilled in the art can easily realize the post-processing unit 19 based on the above description.
The illustration of specific examples of software or hardware for implementing the above is omitted.

【００４２】[0042]

【発明の効果】以上詳細に説明した如く、本発明によれ
ば、次のような効果を得られる。１）行毎の後処理であるので、１行分の文字認識処理が
終了するたびに後処理を実行し処理効率を上げることが
できる。２）行末処理、すなわち未処理文字列または最終処理単
語と未処理文字列の次行への持ち越し処理を行なうの
で、改行によって分裂し２行に跨った単語についても次
行にて連続した文字列として単語解析を行ない、適切に
修正することができる。３）行末処理において未処理文字列のみならず最終処理
単語まで次行へ持ち越すことにより、単語解析処理で単
語間接続の検証を行なう場合においても、改行により分
裂した単語の単語間接続の検証が可能となり、確実な解
析が保証される。４）未処理文字の字数、前行からの持ち越し文字を除い
た文字数、あるいは、前行からの持ち越し文字を除いた
文字数と処理済み行の最大文字数との比較、最終文字の
種類等の条件判定により、無意味な行末処理を防止する
ため、処理の無駄を減らして効率的な後処理が可能であ
る。５）次行に持ち越される文字列の最大文字数は一定値を
超えることがないので、次行へ移した文字列の保存に必
要なメモリ量を極めて小さくすることができる。As described in detail above, according to the present invention, the following effects can be obtained. 1) Since the post-processing is performed for each line, the post-processing is executed every time the character recognition processing for one line is completed, thereby increasing the processing efficiency. 2) End-of-line processing, that is, carry-over processing of the unprocessed character string or the last processed word and the unprocessed character string to the next line. The word can be analyzed and corrected appropriately. 3) By carrying over not only the unprocessed character string but also the final processed word to the next line in the end-of-line processing, even when verifying the inter-word connection by the word analysis processing, the verification of the inter-word connection of a word split by a line feed can be performed. It is possible and a reliable analysis is guaranteed. 4) Comparison of the number of unprocessed characters, the number of characters excluding carryover characters from the previous line, or the number of characters excluding carryover characters from the previous line with the maximum number of characters in the processed line, and judgment of conditions such as the type of the last character Thus, unnecessary post-processing can be prevented and unnecessary post-processing can be performed by reducing waste of processing. 5) Since the maximum number of characters of the character string carried over to the next line does not exceed a certain value, the amount of memory required for storing the character string transferred to the next line can be extremely reduced.

[Brief description of the drawings]

【図１】本発明に係る日本語文字認識システムの概略ブ
ロック図である。FIG. 1 is a schematic block diagram of a Japanese character recognition system according to the present invention.

【図２】本発明の一実施例における後処理のフローチャ
ートである。FIG. 2 is a flowchart of post-processing according to an embodiment of the present invention.

[Explanation of symbols]

１０画像入力部１１文書画像メモリ１２行・文字切り出し部１３文字画像メモリ１４切り出し情報メモリ１５文字認識部１６文字辞書メモリ１７認識結果メモリ１８後処理部１９行末保存メモリ２０単語辞書メモリ２１結果出力部 Reference Signs List 10 Image input unit 11 Document image memory 12 Line / character extraction unit 13 Character image memory 14 Extraction information memory 15 Character recognition unit 16 Character dictionary memory 17 Recognition result memory 18 Post-processing unit 19 End-of-line storage memory 20 Word dictionary memory 21 Result output unit

Claims

(57) [Claims]

1. In post-processing for correcting an error in a character recognition result by a word analysis process or the like, a process is executed for each line in a word unit from a line head, and each time one word process is completed, the process is performed. The number of characters of the unprocessed characters following the word is compared with a certain value. If the number of characters of the unprocessed characters is equal to or less than the certain value, the processing of the line being processed is terminated, and the unprocessed characters are replaced with the beginning of the next line. Post-processing method of character recognition result, characterized in that it is transferred to

2. In post-processing for correcting an error in a character recognition result by a word analysis process or the like, a process is executed for each line in units of a word in order from the head of the line, and each time the processing of one word is completed, the process is performed. The number of unprocessed characters following the word is compared with a certain value. If the number of unprocessed characters is equal to or less than the certain value, the processing of the line being processed is terminated. A character recognition result post-processing method, which moves characters up to the beginning of the next line.

3. In each line, when the number of characters excluding characters transferred from the previous line is equal to or less than a certain value, processing is performed on the line up to the last character.
Post-processing method of the described character recognition result.

4. In each line in the processing target area, when the number of characters excluding characters transferred from the previous line is smaller than a maximum number of characters of a processed line by a certain value or more, or when the number of characters is less than a certain ratio of the maximum number of characters. 3. The post-processing method according to claim 1, wherein the last character is processed in the line.

5. The post-processing method according to claim 1, wherein, when the last character in each line is a punctuation mark, processing is performed on the line up to the punctuation mark.