JP2003203204A

JP2003203204A - Character recognition method and character recognition device

Info

Publication number: JP2003203204A
Application number: JP2002000189A
Authority: JP
Inventors: Toshifumi Yamaai; 敏文山合
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-01-04
Filing date: 2002-01-04
Publication date: 2003-07-18
Anticipated expiration: 2022-01-04
Also published as: JP4056745B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a character recognition method and device capable of performing a character recognition having high precision and high reliability by preventing the recognition output of a recognition result for other than characters. <P>SOLUTION: An area on a document is divided to a character area in which characters are described, and two or more areas by attribute corresponding to other descriptions, character data are divided to each line unit within the character area (step S1101), and the average confidence degree of recognition result of characters is calculated in each line unit in the character area (steps S1102 and S1103). Thereafter, a character data first candidate with low confidence degree is substituted by a space (step S1115) based on the overlapping with a surface area to the line (step S1104), the overlapping with a drawing (step S1105), the ratio of the number of high-confidence degree characters to the number of characters in the line (step S1110) and the like to prevent the output of the recognition result of uncertain characters, whereby the character recognition precision and reliability are improved. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、文字認識をおこ
なう、より詳しくは、原稿の画像を読み取り文字データ
を出力する文字認識方法において、文字以外の認識結果
の認識出力を防いでより精度の高い文字認識がおこなえ
る文字認識方法および文字認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention performs character recognition, and more particularly, in a character recognition method for reading an image of a document and outputting character data, it is possible to prevent recognition output of a recognition result other than characters and to achieve higher accuracy. The present invention relates to a character recognition method and a character recognition device that can perform character recognition.

【０００２】[0002]

【従来の技術】従来の文字認識方法としては、特許第２
９９１７７９号に開示された技術のように、文字単体の
確信度情報を用い、文字の認識段階の複数の情報を評価
し、その文字の信頼度に相当する値を算出し運用するも
のがある。2. Description of the Related Art As a conventional character recognition method, Japanese Patent No. 2 has been adopted.
There is a technique disclosed in Japanese Patent No. 991779, in which reliability information of a single character is used to evaluate a plurality of pieces of information at a character recognition stage, and a value corresponding to the reliability of the character is calculated and used.

【０００３】この文字の確信度を利用する他の方法とし
ては、特開平５−１８２０１４号公報に開示されている
ように、低い確信度情報が得られた文字認識結果に対
し、ユーザーからの修正を促すような表示をおこなうな
どの方法が知られている。As another method of utilizing the certainty factor of the character, as disclosed in Japanese Patent Laid-Open No. 5-182014, the character recognition result for which low certainty factor information is obtained is corrected by the user. There is known a method of performing a display prompting the user.

【０００４】また、特開平７−２２００９１号公報に開
示された技術は、各文字の情報だけでなく、領域の自動
分割、識別を実行処理した後に、文字認識をおこない、
その結果を利用して、属性の再判別をするものである。Further, the technique disclosed in Japanese Patent Laid-Open No. 7-220091 discloses not only the information on each character, but also the automatic division and identification of the area, and then the character recognition.
The result is used to re-determine the attribute.

【０００５】さらに、特開平８−１０１８８０号公報に
開示された技術は、特定の領域の確信度を文字の確信度
から算出し、算出した結果に応じて表示の方法を変える
ものである。また、特開平９−２８２４１６号公報に開
示された技術では、領域だけに限らず、文書全体の確信
度を求めることで、その画像の結果全体のリジェクト判
定に応用するものである。以上のように、文字認識に確
信度を利用した方法は、多数提案されている。Further, the technique disclosed in Japanese Patent Application Laid-Open No. 8-101880 calculates the certainty factor of a specific area from the certainty factor of a character and changes the display method according to the calculated result. Further, the technique disclosed in Japanese Patent Laid-Open No. 9-284416 is applied to reject determination of the entire result of the image by obtaining the certainty factor of the entire document, not limited to only the area. As described above, many methods using certainty factor for character recognition have been proposed.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上述し
た各従来技術に基づく文書自体の確信度でリジェクト判
定する方法は、文書全体の文字認識の精度を極力向上さ
せ、文字以外のものの認識結果を極力出力しないという
方向を目指した場合においては、期待した効果を得るこ
とができない。However, the method of judging rejection based on the certainty factor of the document itself based on each of the above-mentioned prior arts improves the accuracy of character recognition of the entire document as much as possible, and the recognition result of objects other than characters as much as possible. When aiming for the direction of not outputting, the expected effect cannot be obtained.

【０００７】領域単位に確信度を求めてなんらかの処理
をする場合、非文字領域と文字領域を融合させた一つの
領域として領域分割で切り出したとすると、当然非文字
部分の確信度は低くなるため、非文字部分の混在率によ
って、領域の確信度が上下することになる。When the certainty factor is obtained for each region and some processing is performed, if the region is cut out as one region in which the non-character region and the character region are fused, the certainty factor of the non-character part naturally lowers. Depending on the mixture ratio of the non-character portion, the certainty factor of the area increases or decreases.

【０００８】領域の確信度が低い場合は、その文字領域
を図に変更する処理などを実行すると、得られた文字認
識結果が活かされずなくなることも考えられる。一方、
文字単位で確信度が低いものを全てリジェクト（排除）
すると、品質の悪い画像を認識した場合には、リジェク
ト結果が多数を占め、見苦しく使用できない文字認識結
果が出力されることになる。品質の悪い画像としては、
たとえば、ノイズが多い画像の他に、濃度が濃く文字が
つぶれている画像や、逆に濃度が薄いかすれた画像があ
る。When the certainty factor of the area is low, it is possible that the obtained character recognition result will not be utilized if the processing for changing the character area into a figure is executed. on the other hand,
Reject (exclude) all items with low confidence in character units
Then, when an image with poor quality is recognized, the rejection result occupies a large number, and an unsightly and unusable character recognition result is output. As a poor quality image,
For example, in addition to an image with a lot of noise, there are an image with a high density and characters being crushed, and, conversely, an image with a light density and faintness.

【０００９】ところで、自動で文字領域を判別する方法
としては、本出願人による特開平７−０３７０３６号公
報など、外接矩形を抽出し、そのサイズや内部の情報に
よって、外接矩形を分類し、文字矩形は文字矩形で統合
して領域を生成していく方法などがある。このような、
領域分割の方法で文字領域が取得できた後に、文字認識
をおこない、文字コードや座標とその確信度を得る方法
もある。この確信度を得る方法には、上記説明した特許
第２９９１７７９号公報などの技術を用いる。しかし、
この場合、文字領域の判別の精度に影響を受け上記同様
に使用できない文字認識結果が出力されることになる。By the way, as a method for automatically discriminating a character area, a circumscribing rectangle is extracted as in Japanese Patent Laid-Open No. 7-037036 by the present applicant, and the circumscribing rectangle is classified according to its size and internal information, and character There is a method of generating an area by integrating rectangles with character rectangles. like this,
There is also a method in which after the character area is obtained by the area division method, character recognition is performed to obtain the character code and coordinates and the certainty factor thereof. As a method for obtaining this certainty factor, a technique such as the above-mentioned Japanese Patent No. 2991779 is used. But,
In this case, the unrecognizable character recognition result is output in the same manner as above due to the influence of the accuracy of the character area discrimination.

【００１０】この発明は、上述した従来技術による問題
点を解消するため、文字以外の認識結果の認識出力を防
いでより高精度で高信頼性を有する文字認識がおこなえ
る文字認識方法、および装置を提供することを目的とす
る。In order to solve the above-mentioned problems of the prior art, the present invention provides a character recognition method and apparatus capable of preventing recognition output of recognition results other than characters and performing character recognition with higher accuracy and reliability. The purpose is to provide.

【００１１】[0011]

【課題を解決するための手段】上述した課題を解決し、
目的を達成するため、請求項１の発明にかかる文字認識
方法は、原稿上の文字領域を自動判別し、該文字領域内
の文字を文字データとして認識する文字認識方法におい
て、前記原稿上の領域を文字が記載された文字領域、お
よび他の記載に対応する複数の属性別の領域に分割する
領域分割工程と、前記文字領域内の各行単位に文字デー
タを認識する文字認識工程と、前記文字領域内の各行単
位に文字の認識結果の確からしさを求め、該確からしさ
を示す確信度を出力する確信度算出工程と、を備えたこ
とを特徴とする。[Means for Solving the Problems]
In order to achieve the object, a character recognition method according to the invention of claim 1 is a character recognition method for automatically discriminating a character area on a document and recognizing characters in the character area as character data. An area dividing step of dividing a character area in which a character is described and a plurality of attribute-specific areas corresponding to other descriptions, a character recognition step of recognizing character data for each line in the character area, and the character The certainty factor calculation step of obtaining certainty factor of the character recognition result for each line in the area and outputting the certainty factor indicating the certainty factor.

【００１２】この請求項１の発明によれば、行単位の確
信度を求めるため、文字でない部分の認識結果を出力せ
ず、文字認識精度の向上を図ることができる。According to the first aspect of the present invention, since the certainty factor is obtained for each line, the recognition result of the non-character portion is not output, and the character recognition accuracy can be improved.

【００１３】また、請求項２の発明にかかる文字認識方
法は、請求項１に記載の発明において、前記確信度算出
工程は、得られた行単位の文字の認識結果の確からしさ
を、確からしい、確からしくないの２つに判定し、確か
らしくないと判定された場合には、該行の全ての文字認
識結果をあらかじめ定めた所定の文字に置き換えて出力
することを特徴とする。Further, in the character recognition method according to a second aspect of the present invention, in the invention according to the first aspect, the certainty factor calculation step is more likely to determine the certainty of the obtained character recognition result for each line. It is characterized in that it is determined to be uncertain, and when it is determined to be uncertain, all the character recognition results of the line are replaced with predetermined characters and output.

【００１４】この請求項２の発明によれば、文字の認識
結果が確からしくない際にこの文字の置き換えによっ
て、置き換えた文字に対する処理をおこないやすく、文
字認識後の処理作業の容易化を図ることができる。According to the second aspect of the present invention, when the character recognition result is uncertain, by replacing the character, the replaced character can be easily processed, and the processing work after the character recognition can be facilitated. You can

【００１５】また、請求項３の発明にかかる文字認識方
法は、請求項１に記載の発明において、前記確信度算出
工程は、得られた行単位の文字の認識結果の確からしさ
を、確からしい、確からしくないの２つに判定し、確か
らしくないと判定した場合には、該行は文字領域ではな
いと判定し、領域属性を他の領域に変更することを特徴
とする。Further, in the character recognition method according to a third aspect of the present invention, in the invention according to the first aspect, the certainty factor calculation step is more likely to determine the certainty of the obtained character recognition result for each line. It is determined that the line is not a character region and the region attribute is changed to another region when it is determined to be uncertain.

【００１６】この請求項３の発明によれば、文字領域で
はない部分に対する文字認識を実行したことを判定でき
るようになり、領域分割の精度向上を図ることができ
る。According to the third aspect of the present invention, it is possible to determine that character recognition has been executed for a portion that is not a character area, and the accuracy of area division can be improved.

【００１７】また、請求項４の発明にかかる文字認識方
法は、請求項３に記載の発明において、前記確信度算出
工程は、前記文字領域内における全ての行で、前記行単
位の文字の認識結果が確からしくないと判定した場合に
は、該文字領域を削除することを特徴とする。Further, in the character recognition method according to the invention of claim 4, in the invention of claim 3, the certainty factor calculation step recognizes the character of each line in every line in the character area. When it is determined that the result is uncertain, the character area is deleted.

【００１８】この請求項４の発明によれば、文字領域の
削除によって、不要な文字認識のデータ出力を防止で
き、他の領域属性を含む原稿等の文字認識結果の信頼性
向上を図ることができる。According to the invention of claim 4, by deleting the character area, unnecessary data output for character recognition can be prevented and the reliability of the character recognition result of the original or the like including other area attributes can be improved. it can.

【００１９】また、請求項５の発明にかかる文字認識方
法は、請求項３に記載の発明において、前記確信度算出
工程は、前記文字領域内における一部の行について、該
行単位の文字の認識結果が確からしくないと判定した場
合には、該行を除く確からしい行のみで文字領域を形成
するよう該文字領域のサイズを変更することを特徴とす
る。Further, in the character recognition method according to a fifth aspect of the present invention, in the invention according to the third aspect, the certainty factor calculation step is performed for a part of the lines in the character area to detect the characters of the line unit. When it is determined that the recognition result is uncertain, the size of the character area is changed so that the character area is formed only with certain lines other than the certain line.

【００２０】この請求項５の発明によれば、文字領域の
大きさを、非文字領域である一部の行を含まないように
サイズを変更するため、文字領域内での文字認識の高精
度化を図ることができる。According to the invention of claim 5, the size of the character area is changed so as not to include a part of the lines which are non-character areas. Therefore, the character recognition in the character area is highly accurate. Can be realized.

【００２１】また、請求項６の発明にかかる文字認識方
法は、請求項１〜５のいずれか一つに記載の発明におい
て、前記確信度算出工程は、行の確信度を得る情報とし
て、行内の文字の確信度のうち、あらかじめ定めた所定
の閾値以上の確信度を持つ文字の数を用いることを特徴
とする。The character recognition method according to a sixth aspect of the present invention is the character recognition method according to any one of the first to fifth aspects, wherein the certainty factor calculation step uses as an information for obtaining the certainty factor of the line. Among the certainty factors of the character, the number of characters having the certainty factor equal to or higher than a predetermined threshold value is used.

【００２２】この請求項６の発明によれば、所定の閾値
以上の確信度を持つ文字数を用いて行の確信度を得るた
め、たとえば、文章として確からしさが得られているよ
うな認識結果である場合、高い確信度を持った文字結果
が多くなることから、この文字数を行の確信度に用いる
ことで、文字認識の高精度化を図ることができる。According to the invention of claim 6, since the certainty factor of the line is obtained by using the number of characters having the certainty factor equal to or more than a predetermined threshold value, for example, the recognition result such that the certainty as a sentence is obtained. In some cases, there are many character results with a high degree of certainty. Therefore, by using this number of characters as the degree of certainty of a line, it is possible to improve the accuracy of character recognition.

【００２３】また、請求項７の発明にかかる文字認識方
法は、請求項１〜５のいずれか一つに記載の発明におい
て、前記確信度算出工程は、行の確信度を得る情報とし
て、行内の文字のうち英数文字数の行内文字数に対する
比率を用いることを特徴とする。The character recognition method according to a seventh aspect of the present invention is the character recognition method according to any one of the first to fifth aspects, wherein the certainty factor calculation step uses as an information for obtaining the certainty factor of the line It is characterized by using the ratio of the number of alphanumeric characters to the number of inline characters among the characters of.

【００２４】この請求項７の発明によれば、行内におけ
る文字認識時の言語処理に効かない英数文字の比率が多
い場合平均確信度の判定基準をレベルダウンする等して
仮名漢字で得られる所定の文字認識精度の維持を図るこ
とができる。According to the invention of claim 7, when the ratio of alphanumeric characters which is not effective for the language processing at the time of character recognition in a line is large, the judgment standard of the average certainty factor is lowered to obtain the kana kanji. It is possible to maintain a predetermined character recognition accuracy.

【００２５】また、請求項８の発明にかかる文字認識方
法は、請求項１〜５のいずれか一つに記載の発明におい
て、前記確信度算出工程は、行の確信度を得る情報とし
て、他の属性の領域との位置の重なり具合を用いること
を特徴とする。Further, the character recognition method according to the invention of claim 8 is the character recognition method according to any one of claims 1 to 5, wherein the certainty factor calculation step uses other information as information for obtaining the certainty factor of the line. The feature is that the degree of overlap with the area of the attribute of is used.

【００２６】この請求項８の発明によれば、文字行の座
標値と別属性の領域との位置重なり具合によって、文字
の確信度を得て文字認識の高精度化を図ることができ
る。According to the eighth aspect of the invention, the degree of overlap between the coordinate value of the character line and the area of the different attribute makes it possible to obtain the certainty factor of the character and improve the accuracy of character recognition.

【００２７】請求項９の発明にかかる文字認識装置は、
原稿上の文字領域を自動判別し、該文字領域内の文字を
文字データとして認識する文字認識装置において、前記
原稿上の領域を文字が記載された文字領域、および他の
記載に対応する複数の属性別の領域に分割する領域分割
手段と、前記文字領域内の各行単位に文字データを認識
する文字認識手段と、前記文字領域内の各行単位に文字
の認識結果の確からしさを求め、該確からしさを示す確
信度を出力する確信度算出手段と、を備えたことを特徴
とする。A character recognition device according to the invention of claim 9 is
In a character recognition device for automatically discriminating a character area on a manuscript and recognizing a character in the character area as character data, the area on the manuscript is described as a character area in which a character is described, and a plurality of areas corresponding to other descriptions. Area dividing means for dividing the area into attribute areas, character recognition means for recognizing character data line by line in the character area, and probability of character recognition result for each line in the character area, The certainty factor calculating means which outputs the certainty factor which shows likelihood is provided, It is characterized by the above-mentioned.

【００２８】この請求項９の発明によれば、行単位の確
信度を求めるため、文字でない部分の認識結果を出力せ
ず、文字認識精度の向上を図ることができる。According to the ninth aspect of the invention, since the certainty factor is obtained for each line, it is possible to improve the character recognition accuracy without outputting the recognition result of the non-character portion.

【００２９】[0029]

【発明の実施の形態】以下に添付図面を参照して、この
発明にかかる文字認識方法および文字認識装置の好適な
実施の形態を詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Preferred embodiments of a character recognition method and a character recognition device according to the present invention will be described in detail below with reference to the accompanying drawings.

【００３０】図１は、本発明の文字認識装置の全体構成
を示すブロック図である。文字認識装置１００は、スキ
ャナ１０１が読み取った画像データを文字認識してディ
スプレイ１０２、およびプリンタ等の印字装置１０３に
テキスト等の文字データを出力する。FIG. 1 is a block diagram showing the overall construction of the character recognition device of the present invention. The character recognition device 100 character-recognizes image data read by the scanner 101 and outputs character data such as text to a display 102 and a printing device 103 such as a printer.

【００３１】文字認識装置１００は、スキャナ１０１の
画像データを格納する画像メモリ１０４，画像メモリ１
０４の画像データを文字認識処理するＣＰＵ１０５，Ｃ
ＰＵ１０５の文字認識処理プログラムが格納されたＲＯ
Ｍ１０６，ＣＰＵ１０５の文字認識処理時のデータのワ
ークエリアとして用いられるＲＡＭ１０７，ＣＰＵ１０
５の文字認識処理時に参照される辞書（辞書データ格納
部）１０８によって構成されている。The character recognition device 100 includes an image memory 104 for storing image data of the scanner 101 and an image memory 1.
CPU 105 for performing character recognition processing on image data 04, C
RO in which the character recognition processing program of PU 105 is stored
M106, RAM107 used as a work area of data at the time of character recognition processing of CPU105, CPU10
The dictionary (dictionary data storage unit) 108 referred to in the character recognition processing of No. 5 is configured.

【００３２】ＲＯＭ１０６の文字認識処理プログラム
は、画像データを文字認識処理時の単位に領域分割する
領域分割部，ＯＣＲ（文字認識）部，確信度処理部，の
各機能別プログラムによって大略構成されている。The character recognition processing program of the ROM 106 is roughly constituted by a program for each function of an area dividing unit for dividing the image data into units for character recognition processing, an OCR (character recognition) unit, and a certainty factor processing unit. There is.

【００３３】図２は、文字認識装置１００が実行する文
字認識処理の概要手順を示すフローチャートである。Ｃ
ＰＵ１０５は、スキャナ１０１から読み込まれ画像メモ
リ１０４に格納した原稿等の画像データを、ＲＯＭ１０
６の文字認識処理プログラムによって文字認識する。FIG. 2 is a flow chart showing an outline procedure of the character recognition processing executed by the character recognition device 100. C
The PU 105 stores image data of a document or the like read from the scanner 101 and stored in the image memory 104 in the ROM 10
Character recognition is performed by the character recognition processing program 6 of FIG.

【００３４】はじめに、領域分割部で画像データを文字
認識処理する単位に領域分割する（ステップＳ２０
１）。つぎに、ＯＣＲ部で領域分割された各領域毎に文
字認識する（ステップＳ２０２）。つぎに、確信度処理
部で文字認識結果に対する「確からしさ」である確信度
を得て確信度別に文字認識結果の出力を調整する（ステ
ップＳ２０３）。First, the area dividing unit divides the image data into areas for character recognition processing (step S20).
1). Next, character recognition is performed for each area divided by the OCR section (step S202). Next, the certainty factor processing unit obtains a certainty factor that is "certainty" with respect to the character recognition result, and adjusts the output of the character recognition result for each certainty factor (step S203).

【００３５】図３は、文字認識装置１００が実行する文
字認識処理で行確信度処理をおこなう手順を示すフロー
チャートである。ＣＰＵ１０５は、はじめに、領域分割
部で画像データを文字認識処理する行単位に領域分割す
る（ステップＳ３０１）。以降、この分割された行単位
で以降の処理を実行する（ステップＳ３０２）。つぎ
に、ＯＣＲ部は、領域分割された各領域のうち、文字領
域に対する文字認識処理を実行する（ステップＳ３０
３）。FIG. 3 is a flowchart showing a procedure for performing the line certainty factor process in the character recognition process executed by the character recognition device 100. First, the CPU 105 divides the image data into regions by line for character recognition processing by the region dividing unit (step S301). After that, the subsequent processing is executed for each of the divided lines (step S302). Next, the OCR unit executes a character recognition process for a character area in each of the divided areas (step S30).
3).

【００３６】つぎに、確信度処理部でこの行の文字認識
結果に対する「確からしさ」である行確信度を得る（ス
テップＳ３０４）。つぎに、分割された全ての領域に対
する文字認識処理が終了したか否か判断する（ステップ
Ｓ３０５）。未だ終了していなければ（ステップＳ３０
５：Ｎｏ）、ステップＳ３０２に復帰してつぎの領域に
対する文字認識処理を実行する。全ての領域に対する文
字認識処理が終了すれば（ステップＳ３０５：Ｙｅ
ｓ）、各行の文字認識結果と、行確信度の結果が出力さ
れる。Next, the certainty factor processing unit obtains a certainty factor of the line which is the "probability" for the character recognition result of this line (step S304). Next, it is determined whether or not the character recognition processing has been completed for all the divided areas (step S305). If not yet completed (step S30
5: No), returning to step S302, the character recognition processing for the next area is executed. When the character recognition processing for all the areas is completed (step S305: Ye
s), the result of character recognition of each line and the result of line certainty are output.

【００３７】つぎに、図４は、文字認識装置１００が実
行する平均確信度を用いた文字／非文字判定の手順を示
すフローチャートである。はじめに、領域分割部には、
文字認識後の結果が入力され、文字認識結果を行単位に
領域分割する（ステップＳ４０１）。そして、分割され
た各行の行データ数をｉとして、行データ数ｉが領域内
の全行数に至るまで以下の処理を継続させる（ステップ
Ｓ４０２）。まず、確信度処理部でこの行の文字認識結
果に対する「確からしさ」である行確信度の平均値（行
内平均確信度）を算出する（ステップＳ４０３）。つぎ
に、算出された行内平均確信度を比較用の閾値Ｔｈ１と
比較する（ステップＳ４０４）。Next, FIG. 4 is a flow chart showing the procedure of character / non-character determination using the average certainty factor, which is executed by the character recognition apparatus 100. First of all,
The result after the character recognition is input, and the character recognition result is divided into regions for each line (step S401). Then, assuming that the number of row data of each divided row is i, the following processing is continued until the number of row data i reaches the total number of rows in the area (step S402). First, the certainty factor processing unit calculates an average value of the line certainty factor (in-line average certainty factor), which is the “probability” for the character recognition result of this line (step S403). Next, the calculated in-row average certainty factor is compared with a comparison threshold Th1 (step S404).

【００３８】行内平均確信度が閾値Ｔｈ１を超えていれ
ば（ステップＳ４０４：Ｙｅｓ）、この行が文字と判定
する（ステップＳ４０５）。一方、行内平均確信度が閾
値Ｔｈ１を超えていなければ（ステップＳ４０４：Ｎ
ｏ）、この行が非文字と判定する（ステップＳ４０
６）。この後、ステップＳ４０２に復帰して全ての行に
対する文字／非文字の判定をおこない、行確信度の判定
結果を出力する（ステップＳ４０７）。If the in-line average certainty factor exceeds the threshold Th1 (step S404: Yes), it is determined that this line is a character (step S405). On the other hand, if the in-row average certainty factor does not exceed the threshold value Th1 (step S404: N
o), it is determined that this line is non-character (step S40)
6). After that, the process returns to step S402, character / non-character determination is performed for all lines, and the determination result of the line certainty factor is output (step S407).

【００３９】つぎに、図５は、文字認識装置１００が実
行する平均確信度を用いた文字／非文字判定の手順を示
すフローチャートである。はじめに、領域分割部には、
文字認識後の結果が入力され、文字認識結果を行単位に
領域分割する（ステップＳ５０１）。そして、分割され
た各行の行データ数をｉとして、行データ数ｉが領域内
の分割された全行数に至るまで以下の処理を継続させる
（ステップＳ５０２）。まず、確信度処理部でこの行の
文字認識結果に対する「確からしさ」である行確信度の
平均値（行内平均確信度）を算出する（ステップＳ５０
３）。つぎに、算出された行内平均確信度を比較用の閾
値Ｔｈ１と比較する（ステップＳ５０４）。Next, FIG. 5 is a flow chart showing the procedure of character / non-character determination using the average certainty factor, which is executed by the character recognition apparatus 100. First of all,
The result after character recognition is input, and the character recognition result is divided into regions on a line-by-line basis (step S501). Then, assuming that the number of row data of each divided row is i, the following processing is continued until the number of row data i reaches the total number of divided rows in the area (step S502). First, the certainty factor processing unit calculates an average value of the line certainty factor (in-line average certainty factor) which is the “probability” for the character recognition result of this line (step S50).
3). Next, the calculated in-row average certainty factor is compared with the comparison threshold Th1 (step S504).

【００４０】行内平均確信度が閾値Ｔｈ１を超えていれ
ば（ステップＳ５０４：Ｙｅｓ）、この行が文字と判定
する（ステップＳ５０５）。一方、行内平均確信度が閾
値Ｔｈ１を超えていなければ（ステップＳ５０４：Ｎ
ｏ）、この行が非文字と判定し（ステップＳ５０６）、
この行をあらかじめ定めた所定のリジェクト文字に置き
換える（ステップＳ５０７）。この後、ステップＳ５０
２に復帰して全ての行に対する文字／非文字の判定、お
よび非文字に対するリジェクト処理をおこない、行確信
度の判定結果を出力する（ステップＳ５０８）。If the in-line average certainty factor exceeds the threshold Th1 (step S504: Yes), it is determined that this line is a character (step S505). On the other hand, if the in-row average certainty factor does not exceed the threshold Th1 (step S504: N
o), it is determined that this line is a non-character (step S506),
This line is replaced with a predetermined reject character determined in advance (step S507). After this, step S50
Returning to step 2, character / non-character determination for all lines and reject processing for non-characters are performed, and the determination result of the line certainty factor is output (step S508).

【００４１】この非文字としては、特殊文字、たとえば
「＝」という文字や、ＯＣＲで通常使用しない（規定範
囲外の）文字コードをリジェクト文字（リジェクトコー
ド）に置き換える。リジェクト文字としては、なんらか
の文字を割り当てておけばよく、文字認識後の後工程で
リジェクト文字に対する処理が可能となる。なお、リジ
ェクト文字としてスペースを割り当てると、認識前に文
字が無かったと誤判断されるため、なんらかのかが表示
される文字を用いることが望ましい。As this non-character, a special character, for example, a character "=" or a character code which is not normally used in OCR (outside the specified range) is replaced with a reject character (reject code). Any character may be assigned as the reject character, and the process for the reject character can be performed in a later step after character recognition. If a space is assigned as a reject character, it is erroneously determined that there is no character before recognition, so it is desirable to use a character that displays something.

【００４２】上記の確信度処理部は、行確信度を閾値と
比較して得た上で、行確信度を、確からしい／あるいは
確からしくない、の２つのいずれかに判定し、その結
果、確からしくないと判定した場合には、この行は文字
領域でないと判定する構成としてもよい。The certainty factor processing unit obtains the line certainty factor by comparing it with a threshold value and then determines the line certainty factor to be either probable or uncertain, and as a result, When it is determined that the line is not accurate, the line may be determined not to be the character area.

【００４３】つぎに、図６は、文字認識装置１００が実
行する文字／非文字判定によって領域種別を変更する手
順を示すフローチャートである。はじめに、領域分割部
には、各領域毎の文字認識結果が入力される。領域数Ｉ
ｎは、０を初期値として原稿全体の領域数ｎまで増加す
る。つぎに、入力されたある領域の文字認識結果を行単
位に領域分割する（ステップＳ６０１）。そして、分割
された各行の行データ数をｉとして、行データ数ｉが領
域内の分割された全行数に至るまで以下の処理を継続さ
せる（ステップＳ６０２）。まず、確信度処理部でこの
行の文字認識結果に対する「確からしさ」である行確信
度の平均値（行内平均確信度）を算出する（ステップＳ
６０３）。つぎに、算出された行内平均確信度を比較用
の閾値Ｔｈ１と比較する（ステップＳ６０４）。Next, FIG. 6 is a flow chart showing a procedure for changing the area type by the character / non-character determination executed by the character recognition apparatus 100. First, the character recognition result for each area is input to the area dividing unit. Number of areas I
The initial value of n increases from 0 to the number n of areas of the entire document. Next, the input character recognition result of a certain area is divided into areas for each line (step S601). Then, with the number of row data of each divided row as i, the following processing is continued until the number of row data i reaches the total number of divided rows in the area (step S602). First, the certainty factor processing unit calculates an average value of the line certainty factor (intra-line average certainty factor) which is the "probability" for the character recognition result of this line (step S).
603). Next, the calculated in-row average certainty factor is compared with a comparison threshold Th1 (step S604).

【００４４】行内平均確信度が閾値Ｔｈ１を超えていれ
ば（ステップＳ６０４：Ｙｅｓ）、この行が文字と判定
する（ステップＳ６０５）。一方、行内平均確信度が閾
値Ｔｈ１を超えていなければ（ステップＳ６０４：Ｎ
ｏ）、この行が非文字と判定し（ステップＳ６０６）、
この行が非文字行であるとして非文字行数をインクリメ
ント（Ｉｎ⁺⁺）する（ステップＳ６０７）。この後、ス
テップＳ６０２に復帰して全ての行に対する文字／非文
字の判定、および非文字行数のカウントをおこない、１
領域中の行確信度の判定結果を出力する。If the in-line average certainty factor exceeds the threshold Th1 (step S604: Yes), it is determined that this line is a character (step S605). On the other hand, if the in-row average certainty factor does not exceed the threshold Th1 (step S604: N
o), it is determined that this line is a non-character (step S606),
Assuming that this line is a non-character line, the number of non-character lines is incremented (In ⁺⁺ ) (step S607). After that, the process returns to step S602, character / non-character determination is performed for all lines, and the number of non-character lines is counted, and 1
Output the determination result of the line certainty factor in the area.

【００４５】この際、この結果出力時、得られた行確信
度に基づき、１領域中の非文字行数が制限値ｎに達した
か否かを判定する（ステップＳ６０８）。非文字行数が
制限値ｎに達していない場合には（ステップＳ６０８：
Ｎｏ）、この領域が文字領域であると判断し、行確信度
の結果を出力する。一方、非文字行数が制限値ｎに達し
た場合には（ステップＳ６０８：Ｙｅｓ）、この処理中
の領域が文字領域ではないと判断し、この領域の属性種
別を他に変更する（ステップＳ６０９）。At this time, at the time of outputting the result, it is determined whether or not the number of non-character lines in one area has reached the limit value n based on the obtained line certainty factor (step S608). If the number of non-character lines has not reached the limit value n (step S608:
No), it is determined that this area is a character area, and the result of the line certainty factor is output. On the other hand, when the number of non-character lines reaches the limit value n (step S608: Yes), it is determined that the area being processed is not a character area, and the attribute type of this area is changed to another (step S609). ).

【００４６】図７は、原稿７００上における各領域の属
性を示す図である。図示のように、領域分割時に領域の
属性は、文字領域７０１，図領域７０２，表領域７０
３，囲み枠領域７０４等として分類し属性が附される。
文字領域７０１は、原稿７００上レイアウトに従い、文
字認識する領域順１〜４（７０１ａ〜７０１ｅ）が附さ
れる。FIG. 7 is a diagram showing the attributes of each area on the document 700. As shown in the figure, when the area is divided, the area attributes are a character area 701, a figure area 702, and a table area 70.
3. Classified as an enclosed frame area 704, etc., and an attribute is added.
According to the layout on the document 700, the character area 701 is provided with character recognition area orders 1 to 4 (701a to 701e).

【００４７】上記ステップＳ６０９での領域属性の変更
時には、たとえば文字領域７０１ｅがこれにあたる場
合、この領域７０１ｅの属性種別を変更する。たとえ
ば、図６記載の処理を再度実行する。また、この領域７
０１ｅ自体を削除する。この他、この領域７０１ｅを図
領域に変更する設定としてもよい。When the area attribute is changed in step S609, for example, when the character area 701e corresponds to this, the attribute type of this area 701e is changed. For example, the process shown in FIG. 6 is executed again. In addition, this area 7
01e itself is deleted. In addition to this, the area 701e may be changed to a drawing area.

【００４８】つぎに、図８は、文字認識装置１００が実
行する文字／非文字判定によって文字領域のサイズを変
更する手順を示すフローチャートである。はじめに、領
域分割部には、各領域毎の文字認識結果が入力される。
領域数Ｉｎは、０を初期値として原稿全体の領域数ｎま
で増加する。つぎに、入力されたある領域の文字認識結
果を行単位に領域分割する（ステップＳ８０１）。そし
て、分割された各行の行データ数をｉとして、行データ
数ｉが領域内の分割された全行数に至るまで以下の処理
を継続させる（ステップＳ８０２）。まず、確信度処理
部でこの行の文字認識結果に対する「確からしさ」であ
る行確信度の平均値（行内平均確信度）を算出する（ス
テップＳ８０３）。つぎに、算出された行内平均確信度
を比較用の閾値Ｔｈ１と比較する（ステップＳ８０
４）。Next, FIG. 8 is a flow chart showing a procedure for changing the size of the character area by the character / non-character determination executed by the character recognition apparatus 100. First, the character recognition result for each area is input to the area dividing unit.
The number of areas In increases from 0 as an initial value to the number of areas n of the entire document. Next, the input character recognition result of a certain area is divided into areas for each line (step S801). Then, assuming that the number of row data of each divided row is i, the following processing is continued until the number of row data i reaches the total number of divided rows in the area (step S802). First, the certainty factor processing unit calculates an average value of the line certainty factor (in-line average certainty factor) which is the “probability” for the character recognition result of this line (step S803). Next, the calculated in-row average certainty factor is compared with the comparison threshold Th1 (step S80).
4).

【００４９】行内平均確信度が閾値Ｔｈ１を超えていれ
ば（ステップＳ８０４：Ｙｅｓ）、この行が文字と判定
する（ステップＳ８０５）。一方、行内平均確信度が閾
値Ｔｈ１を超えていなければ（ステップＳ８０４：Ｎ
ｏ）、この行が非文字と判定し（ステップＳ８０６）、
この行が非文字行であるとして非文字行数をインクリメ
ント（Ｉｎ⁺⁺）する（ステップＳ８０７）。この後、ス
テップＳ８０２に復帰して全ての行に対する文字／非文
字の判定、および非文字行数のカウントをおこない、１
領域中の行確信度の判定結果を出力する。If the in-line average certainty factor exceeds the threshold Th1 (step S804: Yes), it is determined that this line is a character (step S805). On the other hand, if the in-row average certainty factor does not exceed the threshold Th1 (step S804: N
o), it is determined that this line is a non-character (step S806),
Assuming that this line is a non-character line, the number of non-character lines is incremented (In ⁺⁺ ) (step S807). After that, the process returns to step S802, character / non-character determination is performed for all lines, and the number of non-character lines is counted.
Output the determination result of the line certainty factor in the area.

【００５０】この際、この結果出力時、得られた行確信
度に基づき、１領域中の非文字行数が制限値ｎに達した
か否かを判定する（ステップＳ８０８）。非文字行数が
制限値ｎに達していない場合には（ステップＳ８０８：
Ｎｏ）、この領域が文字領域であると判断し、行確信度
の結果を出力する。一方、非文字行数が制限値ｎに達し
た場合には（ステップＳ８０８：Ｙｅｓ）、この処理中
の領域が文字領域ではないと判断し、この領域のサイズ
を変更する（ステップＳ８０９）。At this time, at the time of outputting the result, it is determined whether or not the number of non-character lines in one area has reached the limit value n based on the obtained line certainty factor (step S808). If the number of non-character lines has not reached the limit value n (step S808:
No), it is determined that this area is a character area, and the result of the line certainty factor is output. On the other hand, when the number of non-character lines has reached the limit value n (step S808: Yes), it is determined that the area being processed is not a character area, and the size of this area is changed (step S809).

【００５１】図９は文字領域のサイズ変更例を示す図で
ある。図９（ａ）に示すように、この文字領域７０１に
おいて連続して高確信度行９０１ａ〜９０１ｎが判断さ
れ、最下行に低確信度行９０１ｘが判断されたとする。
この場合、図９（ｂ）に示すように、低確信度行９０１
ｘを除く、高確信度行９０１ａ〜９０１ｎのみで文字領
域７０１を形成するよう領域サイズを変更する。FIG. 9 is a diagram showing an example of changing the size of the character area. As shown in FIG. 9A, it is assumed that the high confidence row 901a to 901n is continuously determined in this character area 701, and the low confidence row 901x is determined at the bottom row.
In this case, as shown in FIG.
The area size is changed so that the character area 701 is formed only by the high confidence rows 901a to 901n excluding x.

【００５２】図１０は、文字領域サイズの他の変更例を
示す図である。図１０（ａ）に示すように、この文字領
域７０１において連続する高確信度行９０１ａ〜９０１
ｄ，９０１ｅ〜９０１ｇの間に、低確信度行９０１ｘが
判断されたとする。図示の例では、低確信度行９０１ｘ
は複数行である。FIG. 10 shows another example of changing the character area size. As shown in FIG. 10A, in this character area 701, high confidence line 901a to 901 that are continuous.
It is assumed that the low confidence factor row 901x is determined between d and 901e to 901g. In the illustrated example, the low confidence line 901x
Is multiple lines.

【００５３】この場合、図９（ｂ）に示すように、低確
信度行９０１ｘを除く、一方のまとまりの高確信度行９
０１ａ〜９０１ｄで分割された一方の文字領域７０１Ａ
を形成する。また、他方のまとまりの高確信度行９０１
ｅ〜９０１ｇで分割された他方の文字領域７０１Ｂを形
成する。In this case, as shown in FIG. 9B, one group of high confidence line 9 except for the low confidence line 901x.
One character area 701A divided by 01a to 901d
To form. Also, the other group of high confidence line 901
The other character area 701B divided by e to 901g is formed.

【００５４】このように、文字領域７０１内における低
確信度行９０１ｘの位置に応じて領域サイズの変更時に
文字領域７０１自体を分割することもできる。この際、
低確信度行９０１ｘの各行の領域属性を文字領域から図
領域に変更することもできる。As described above, the character area 701 itself can be divided when the area size is changed according to the position of the low confidence factor line 901x in the character area 701. On this occasion,
The region attribute of each line of the low confidence factor line 901x can be changed from the character region to the drawing region.

【００５５】つぎに、図１１は、本発明の低確信度処理
の処理内容を示すフローチャートである。この低確信度
処理では、認識した文字結果の確信度を参照して、行や
領域単位の平均確信度が低い場合に文字認識結果を消去
する処理を実行する。この際、処理条件を細かく分ける
ことでなるべく正解文字を残し、文字らしくない認識結
果を削除する構成である。Next, FIG. 11 is a flowchart showing the processing contents of the low confidence processing according to the present invention. In this low confidence factor process, the confidence factor of the recognized character result is referred to, and the process of deleting the character recognition result is executed when the average confidence factor of the line or area unit is low. At this time, the processing conditions are finely divided to leave the correct characters as much as possible and delete the recognition result that does not look like a character.

【００５６】以下に説明する全ての低確信度処理は、行
単位でおこなう。この低確信度処理で文字かそれ以外を
判断する特徴として以下の６つを組み合わせて用いる。All the low confidence processings described below are performed on a line-by-line basis. The following six features are combined and used as features for determining whether a character is a character or not in this low confidence processing.

【００５７】１）行内平均確信度２）高確信度文字数３）高確信度文字数の比率４）英数文字数５）ユーザー設定閾値６）図、表等との重なりなど[0057] 1) Intra-row average confidence 2) High confidence character number 3) Ratio of high confidence character number 4) Number of alphanumeric characters 5) User set threshold 6) Overlapping with figures, tables, etc.

【００５８】実際の文字認識結果では、１行の文字数が
何文字であるかはデータを１行分（改行コードの位置ま
で）解析しないとわからない。そのため、改行位置をみ
つけるのと同時に特徴抽出をおこなうことで、行の領域
座標値がわかる時点では上記の特徴は全て収拾済みと考
えて良い。図１１に示すフローチャート上で、文字認識
行が表領域と重なりがあるために除外する処理の前に特
徴計算が入るのはそのためである。In the actual character recognition result, the number of characters in one line cannot be known until the data is analyzed for one line (up to the position of the line feed code). Therefore, it can be considered that all of the above features have been collected by the time when the area coordinate values of the line are known by performing the feature extraction at the same time as finding the line feed position. This is why the feature calculation is performed before the process of excluding the character recognition line because it overlaps the table area in the flowchart shown in FIG.

【００５９】はじめに、領域分割部には、各領域毎の文
字認識結果が入力される。つぎに、入力されたある領域
の文字認識結果を行単位に領域分割する（ステップＳ１
１０１）。そして、分割された各行の行データ数をｉと
して、行データ数ｉが領域内の分割された全行数に至る
まで以下の処理を継続させる（ステップＳ１１０２）。
まず、確信度処理部でこの行の文字認識結果に対する
「確からしさ」である行確信度の平均値（行内平均確信
度）を算出する（ステップＳ１１０３）。この際、高確
信度文字数、たとえば、確信度８０以上の文字数を計数
する。First, the character recognition result for each area is input to the area dividing unit. Then, the input character recognition result of a certain area is divided into areas (step S1).
101). Then, with the number of row data of each divided row being i, the following processing is continued until the number of row data i reaches the total number of divided rows in the area (step S1102).
First, the certainty factor processing unit calculates an average value of the line certainty factor (intra-line average certainty factor) which is the “probability” for the character recognition result of this line (step S1103). At this time, the number of high-confidence characters, for example, the number of characters with a certainty factor of 80 or more is counted.

【００６０】つぎに、この行が表に包含されている（表
内部の文字列）かどうかを判断する（ステップＳ１１０
４）。包含されている行の場合は（ステップＳ１１０
４：Ｙｅｓ）、低確信度処理はおこなわない（ステップ
Ｓ１１０２に復帰）。理由は、表の内部は数字列などが
入る可能性が高く、そのような文字列は確信度が比較的
小さめに出る傾向があるため、確信度の低いような数値
データを削除しないためである。Next, it is judged whether or not this row is included in the table (character string inside the table) (step S110).
4). If the line is included (step S110)
4: Yes), low confidence processing is not performed (return to step S1102). The reason is that there is a high possibility that a numeric string will be entered inside the table, and such a character string tends to have a relatively low certainty factor, so numerical data with a low certainty factor is not deleted. .

【００６１】ステップＳ１１０４で包含されていない行
の場合には（ステップＳ１１０４：Ｎｏ）、つぎに、画
像全体に対してある程度小さな図に、その行が重なって
いるかどうかを判断する（ステップＳ１１０５）。重な
っている場合には（ステップＳ１１０５：Ｙｅｓ）、重
なりフラグをＯＮにする（ステップＳ１１０６）。重な
っていない場合には（ステップＳ１１０５：Ｎｏ）、重
なりフラグをＯＦＦにする（ステップＳ１１０７）。If the line is not included in step S1104 (step S1104: No), it is then determined whether or not the line overlaps a figure which is somewhat small with respect to the entire image (step S1105). If they overlap (step S1105: Yes), the overlap flag is turned on (step S1106). If they do not overlap (step S1105: No), the overlap flag is turned off (step S1107).

【００６２】この「小さい図」という判定には閾値処理
を使う。結果領域の存在する範囲を求め、縦横どちらか
小さいほうの長さの１／２の値を閾値とし、図の縦横双
方とも閾値以下である場合、小さい図として低確信度処
理に用いる。そして、この図との重なりフラグによっ
て、文字を削除するかどうかの条件を変えている。Threshold processing is used for the determination of "small figure". The range in which the result area exists is determined, and the value of 1/2 of the smaller length in the vertical and horizontal directions is used as the threshold value. Then, depending on the overlap flag with this figure, the condition for deleting characters is changed.

【００６３】この処理では、低確信度処理の閾値を５０
に設定した場合（Ｔｈ１＝５０）の状態がもっとも効果
がある場合の例である。そして、重なりフラグがＯＦＦ
（ステップＳ１１０７）とされた後には、平均確信度が
６０を越えたか判断し（ステップＳ１１０８）、越えて
いれば（ステップＳ１１０８：Ｙｅｓ）、ステップＳ１
１０２に復帰する。越えていなければ（ステップＳ１１
０８：Ｎｏ）、ステップＳ１１００に移行する。In this processing, the threshold for the low confidence processing is set to 50.
This is an example of the case where the state when set to (Th1 = 50) is most effective. And the overlap flag is OFF
After (Step S1107), it is determined whether the average certainty factor exceeds 60 (Step S1108), and if it exceeds (Step S1108: Yes), Step S1
Return to 102. If not exceeded (step S11)
08: No), and proceeds to step S1100.

【００６４】重なりフラグがＯＮ（ステップＳ１１０
６）とされた後には、平均確信度が７０を越えたか判断
し（ステップＳ１１０９）、越えていれば（ステップＳ
１１０９：Ｙｅｓ）、ステップＳ１１０２に復帰する。
越えていなければ（ステップＳ１１０９：Ｎｏ）、ステ
ップＳ１１１０に移行する。The overlap flag is ON (step S110
After 6), it is judged whether the average certainty factor exceeds 70 (step S1109), and if the average certainty factor exceeds 70 (step S1109).
1109: Yes), and the process returns to step S1102.
If not exceeded (step S1109: No), the process proceeds to step S1110.

【００６５】ステップＳ１１１０では、行内文字数に対
する高確信度文字数の割合を判断する。判断は、（高確
信度文字数／行内文字数＞４０％）の算出式を用いる。
割合が４０％を越えていれば（ステップＳ１１１０：Ｙ
ｅｓ）、ステップＳ１１０２に復帰する。越えていなけ
れば（ステップＳ１１１０：Ｎｏ）、高確信度文字数が
存在し、かつ、この文字の平均確信度が閾値Ｔｈ１（５
０）以上であるか判断する（ステップＳ１１１１）。い
ずれも満たしていれば（ステップＳ１１１１：Ｙｅ
ｓ）、ステップＳ１１０２に復帰する。いずれか一方で
も満たさなければ（ステップＳ１１１１：Ｎｏ）、ステ
ップＳ１１１２に移行する。In step S1110, the ratio of the high confidence character number to the in-line character number is determined. The determination uses the calculation formula of (the number of high confidence characters / the number of characters in a line> 40%).
If the ratio exceeds 40% (step S1110: Y
es), and returns to step S1102. If it does not exceed (step S1110: No), there is a high confidence character number, and the average confidence of this character is the threshold Th1 (5).
0) or more is determined (step S1111). If both are satisfied (step S1111: Ye
s) and returns to step S1102. If either one is not satisfied (step S1111: No), the process proceeds to step S1112.

【００６６】つぎに、行内に英数文字が所定数（たとえ
ば４つ）以上あり、かつこの文字の平均確信度が閾値Ｔ
ｈ１以上であるか判断する（ステップＳ１１１２）。い
ずれも満たしていれば（ステップＳ１１１２：Ｙｅ
ｓ）、ステップＳ１１０２に復帰する。いずれか一方で
も満たさなければ（ステップＳ１１１２：Ｎｏ）、つぎ
に、この行について図との重なりフラグの状態を判断す
る（ステップＳ１１１３）。Next, there are a predetermined number (for example, four) of alphanumeric characters in a line, and the average certainty factor of this character is the threshold value T.
It is determined whether or not h1 or more (step S1112). If both are satisfied (step S1112: Ye
s) and returns to step S1102. If either one is not satisfied (step S1112: No), then the state of the overlap flag with the drawing is determined for this row (step S1113).

【００６７】この行が図と重なる（図との重なりフラグ
がＯＮ）場合には（ステップＳ１１１３：Ｙｅｓ）、ス
テップＳ１１１５に移行する。一方、この行が図と重な
らない（図との重なりフラグがＯＦＦ）場合には（ステ
ップＳ１１１３：Ｎｏ）、行内の高確信度文字が所定割
合（たとえば１０％）を越え、かつ、この文字の平均確
信度が閾値を越えたか判断する（ステップＳ１１１
４）。越えていれば（ステップＳ１１１４：Ｙｅｓ）、
ステップＳ１１０２に復帰する。いずれか一方でも満た
さなければ（ステップＳ１１１４：Ｎｏ）、ステップＳ
１１１５に移行する。ステップＳ１１１５では、行内の
文字データ第１候補をスペースに置換し、ステップＳ１
１０２に復帰する。If this line overlaps the figure (the overlap flag with the figure is ON) (step S1113: YES), the procedure moves to step S1115. On the other hand, if this line does not overlap with the figure (the overlap flag with the figure is OFF) (step S1113: No), the high confidence character in the line exceeds a predetermined ratio (for example, 10%), and It is determined whether the average certainty factor exceeds a threshold value (step S111).
4). If it exceeds (step S1114: Yes),
The procedure returns to step S1102. If either one is not satisfied (step S1114: No), step S
Moving to 1115. In step S1115, the character data first candidate in the line is replaced with a space, and step S1
Return to 102.

【００６８】上記の確信度を得るための情報には言語処
理を用いることが多い。先に領域識別で、文字領域とな
った領域に対して、文字行を切り出す。たとえば、射影
などをとり黒画素の少ない部分で、行を切り出した後、
文字を射影や、黒画素の外接矩形を利用して取り出す。
この際、日本語では、１文字が１つの矩形になるとは限
らないので、１つの文字について何種類かの切り出し方
をしておいて、もっともよい結果を最終結果にするよう
に文字を切り出す。Language processing is often used for the information for obtaining the above certainty factor. The character line is cut out from the area that has become the character area by the area identification first. For example, after taking a projection and cutting out a line in a part with few black pixels,
Characters are extracted using the projection and the circumscribed rectangle of black pixels.
At this time, in Japanese, one character does not necessarily become one rectangle, so several kinds of cutting methods are used for one character, and the character is cut so that the best result is the final result.

【００６９】この後、後処理が実行される。切り出した
文字列を形態素解析等の手法で単語単位に切り出し、こ
の単語を言語辞書にある単語とマッチング処理する。そ
して、文法的に整合するような認識結果が得られた場合
に高い確信度が得られる。このような、文章として確か
らしさが得られているような認識結果である場合、高い
確信度を持った文字結果が多くなる。こうした場合、上
記のステップＳ１１１０の処理のように、平均確信度よ
りも確実に認識できた文字の文字数は非常に有効とな
る。After this, post-processing is executed. The cut out character string is cut out in word units by a method such as morphological analysis, and this word is matched with a word in the language dictionary. Then, a high degree of certainty is obtained when a recognition result that is grammatically consistent is obtained. In the case of such a recognition result in which the certainty is obtained as a sentence, many character results have a high degree of certainty. In such a case, as in the process of step S1110 described above, the number of characters that can be recognized more reliably than the average certainty factor is very effective.

【００７０】後処理において、たとえば、「出入り口」
という単語があって、単語辞書にもあったとする。しか
し、認識されたのは「出人りロ」（でひとりろ）だった
とする。「人」と「ロ」の認識の第二候補はそれぞれ
「入」と「口」がある。このため、第二候補と第一候補
を入れかえると単語辞書にもマッチングする言葉ができ
るので、これが正解であると判断する。このような、後
処理は、上記の確信度に影響を有し、単語辞書とマッチ
ングした認識結果は確信度も高くなる。In the post-processing, for example, "doorway"
Suppose there is a word, and it was also in the word dictionary. However, it is assumed that the person who was recognized was a "departure person" (and alone). The second candidates for recognition of "person" and "b" are "on" and "mouth", respectively. Therefore, if the second candidate and the first candidate are replaced with each other, a matching word can be created in the word dictionary, and it is determined that this is the correct answer. Such post-processing has an influence on the above certainty factor, and the recognition result matched with the word dictionary also has a higher certainty factor.

【００７１】上記の低確信度処理では、行単位にスペー
スへの変換がおこなわれる。このため、仮にある領域の
行が全てスペースに変換されたとすると、その領域を文
字領域として多数のスペースを認識結果として出力する
ことには無駄がある。そこで、領域中の行が全てスペー
スに置き換えられた場合には、文字領域サイズの変更処
理（図８のステップＳ８０９の処理）を実行し、領域自
体を結果から削除する（図９、図１０参照）。また、削
除に限らず、この領域を図その他などに属性変更する処
理も可能である。In the above low confidence processing, conversion into space is performed in units of lines. Therefore, if all the lines in a certain area are converted into spaces, it is useless to output a large number of spaces as recognition results using the area as a character area. Therefore, when all the lines in the area are replaced with spaces, the character area size changing process (the processing of step S809 in FIG. 8) is executed and the area itself is deleted from the result (see FIGS. 9 and 10). ). Further, the processing is not limited to deletion, and processing for changing the attribute of this area to a drawing or the like is also possible.

【００７２】また、上記処理では、行内文字を削除する
際は、第一候補としてスペースコードを挿入し、第２候
補以下には以前の候補文字を候補順位を下げる形で格納
しておく。第一候補そのものをスペースに置き換える処
理ではないため、認識処理用に生成され、別途格納され
る中間データを後ほど参照することによって、この後の
処理で候補の再度の置き換えが可能である。Further, in the above process, when deleting the in-line character, the space code is inserted as the first candidate, and the previous candidate characters are stored in the second and lower candidates in the form of lowering the candidate rank. Since it is not a process of replacing the first candidate itself with a space, by referring to the intermediate data generated for the recognition process and stored separately, it is possible to replace the candidate again in the subsequent process.

【００７３】また、上記処理によれば、結果的に文字行
が図との重なりがある場合は（ステップＳ１１０５：Ｙ
ｅｓ）、以下に実行される各処理（ステップＳ１１０
９，ステップＳ１１１４）で用いた対比用の値（平均確
信度）が高いため、多めに文字が削除される設定となっ
ている。Further, according to the above processing, if the character line overlaps with the figure as a result (step S1105: Y
es), each processing executed below (step S110)
9. Since the value for comparison (average certainty factor) used in step S1114) is high, a large number of characters are deleted.

【００７４】また、上記の処理では、低確信度処理の閾
値（Ｔｈ１）を５０に設定したものであるが、この閾値
を５０以上としたい場合には、Ｔｈ１に所望する値の閾
値を設定して同様に処理が可能である。一方、ユーザー
設定閾値が５０より低い場合は、確信度が低い文字も出
力したい要望であるため、この場合には、上記フローチ
ャートの各処理とは異なり、行内の平均確信度をユーザ
ー設定閾値と対比し、高い、あるいは低いかを判断する
処理を実行する。Further, in the above processing, the threshold value (Th1) of the low confidence processing is set to 50. However, when it is desired to set the threshold value to 50 or more, the threshold value of a desired value is set to Th1. The same processing can be performed. On the other hand, when the user-set threshold value is lower than 50, it is a request to output a character with a low certainty factor. Therefore, in this case, unlike the processes in the above flowchart, the average certainty factor in a line is compared with the user-set threshold value. Then, the process of determining whether it is high or low is executed.

【００７５】また、ステップＳ１１１０の処理では、行
全体の文字数の中に占める高確信度の文字の割合を判断
している。これによって、たとえば、１行が５文字でこ
の中に３文字が高確信度の場合と、１行が４０文字でこ
の中に３文字が高確信度であった場合の信頼度の変動が
防止できる。Further, in the processing of step S1110, the proportion of high confidence characters in the number of characters in the entire line is determined. This prevents fluctuations in reliability when, for example, one line has 5 characters and 3 of them have high confidence, and 1 line has 40 characters and 3 of which have high confidence. it can.

【００７６】また、行内の文字の確信度の平均とユーザ
ー設定の閾値（Ｔｈ１）を比較する処理（ステップＳ１
１１１，ステップＳ１１１４）によって、パラメーター
をできるだけ少なくした簡素な処理手順にでき、ユーザ
ーの意向を反映しやすくなる。Further, the process of comparing the average of the certainty factors of the characters in the line with the threshold value (Th1) set by the user (step S1).
111, step S1114), a simple processing procedure with as few parameters as possible can be performed, and the intention of the user can be easily reflected.

【００７７】また、行内の文字数のうち、英数文字の行
内に含まれる割合を比較する処理（ステップＳ１１１
２）によって、英数文字による確信度への影響を低減さ
せている。具体的に説明すると、英数文字では、たとえ
ば、ｂと６、ｑと９、ｏと０、ｓとＳなど、類似してい
る文字が比較的多く、数字に関してはほとんど言語処理
が効かない構成上の理由に基づき、確信度は仮名漢字に
比べて低めに出る傾向がある。そのため、英数字が行内
に多く存在している場合は、平均確信度の判定基準をレ
ベルダウンさせる処理が有効であり、これを用いてい
る。Further, a process of comparing the ratio of alphanumeric characters included in a line to the number of characters in a line (step S111)
By 2), the influence of alphanumeric characters on the certainty factor is reduced. More specifically, among alphanumeric characters, there are relatively many similar characters such as b and 6, q and 9, o and 0, and s and S, and almost no language processing is effective for numbers. Based on the above reasons, the certainty factor tends to be lower than that of Kana-Kanji. Therefore, when a large number of alphanumeric characters are present in a line, the process of lowering the criterion for determining the average certainty factor is effective and is used.

【００７８】また、行内の確信度を算出する際に、文字
の確信度ではなく、文字行の座標値と別属性の領域との
位置重なりを使用する。特に、表領域との重なりを判定
（ステップＳ１１０４）を用いている。表領域の内部に
ある文字行に対しては、以降の低確信度処理全てを実行
しないことが有効である。表領域に対する文字認識の結
果に、英数字が多く含まれていた場合には、数表を認識
させた結果の可能性が高いため、以降の処理で確信度が
低く文字を削除する処理を除外することが望ましい。Further, when calculating the certainty factor within a line, the position overlap between the coordinate value of the character line and the area of another attribute is used instead of the certainty factor of the character. In particular, the determination of the overlap with the table area (step S1104) is used. It is effective not to execute all the subsequent low confidence processing on the character lines inside the table area. If there are many alphanumeric characters in the result of character recognition for the table area, it is likely that the result of recognizing the numerical table is high, so the processing that removes the character with low certainty in the subsequent processing is excluded. It is desirable to do.

【００７９】同様に、図や写真との重なりを判定（ステ
ップＳ１１０５）することによって、図と重なっている
文字領域中の１行全体の確信度平均が低いような場合
は、図の一部を文字認識したことによるものと判断しや
すくなる。ここで、全ての図領域を使うに限らず、たと
えば、矩形で表示させたときに画像全面となるような図
に対してのみ判定の対象から外す構成としてもよい。Similarly, by determining the overlap with the drawing or the photograph (step S1105), if the average confidence factor of the entire one line in the character area overlapping the drawing is low, a part of the drawing is deleted. It is easy to determine that this is due to character recognition. Here, it is not limited to using all the drawing areas, and for example, a configuration may be adopted in which only the drawing that becomes the entire surface of the image when displayed in a rectangle is excluded from the determination target.

【００８０】以上説明した低確信度処理で文字かそれ以
外を判断する特徴１）行内平均確信度２）高確信度文字数３）高確信度文字数の比率４）英数文字数５）ユーザー設定閾値６）図、表等との重なりは、これら特徴の少なくとも１つ以上の情報を組み合わ
せて用い、行の確信度を求めることができる。Characteristic of judging character or other by the low confidence processing described above 1) Average in-line confidence 2) High confidence character number 3) High confidence character number ratio 4) Alphanumeric character number 5) User set threshold value 6 ) For the overlap with the figure, the table, etc., it is possible to obtain the certainty factor of the row by using at least one piece of information of these features in combination.

【００８１】また、確信度情報を含む特徴と、文字行が
含まれている（重なりのある）領域情報の特徴を使用す
る場合に、文字行が包含されている領域の種類によっ
て、文字行の確信度情報を比較する閾値を変更させる構
成とすることもできる。Further, when the feature including the certainty factor information and the feature of the area information including the character line (overlapped) are used, the character line of the character line is changed depending on the type of the area including the character line. It is also possible to adopt a configuration in which the threshold value for comparing certainty factor information is changed.

【００８２】なお、本実施の形態で説明した文字認識方
法は、あらかじめ用意されたプログラムをパーソナル・
コンピューターやワークステーション等のコンピュータ
で実行することによって実現することができる。このプ
ログラムは、ハードディスク、フロッピー（Ｒ）ディス
ク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読
み取り可能な記録媒体に記録され、コンピュータによっ
て記録媒体から読み出されることによって実行される。
またこのプログラムは、上記記録媒体を介して、インタ
ーネット等のネットワークを介して配布することができ
る。In the character recognition method described in this embodiment, a program prepared in advance is personalized.
It can be realized by executing on a computer such as a computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a floppy (R) disk, a CD-ROM, an MO, or a DVD, and is executed by being read from the recording medium by the computer.
The program can be distributed via the recording medium and a network such as the Internet.

【００８３】[0083]

【発明の効果】以上説明したように、請求項１に記載の
発明によれば、請求項１の発明にかかる文字認識方法
は、原稿上の文字領域を自動判別し、該文字領域内の文
字を文字データとして認識する文字認識方法において、
前記原稿上の領域を文字が記載された文字領域、および
他の記載に対応する複数の属性別の領域に分割する領域
分割工程と、前記文字領域内の各行単位に文字データを
認識する文字認識工程と、前記文字領域内の各行単位に
文字の認識結果の確からしさを求め、該確からしさを示
す確信度を出力する確信度算出工程とを備えたので、こ
れによって、文字でない部分の認識結果を出力せず、文
字認識精度の向上を図ることができるという効果を奏す
る。As described above, according to the first aspect of the invention, the character recognition method according to the first aspect of the invention automatically determines the character area on the original and determines the characters in the character area. In the character recognition method that recognizes as character data,
An area dividing step of dividing an area on the original document into a character area in which characters are described and a plurality of attribute-specific areas corresponding to other descriptions, and character recognition for recognizing character data for each line in the character area Since the step and the certainty factor calculation step of obtaining the certainty factor of the recognition result of the character for each line in the character area and outputting the certainty factor indicating the certainty factor, the recognition result of the non-character part is thereby obtained. It is possible to improve the character recognition accuracy without outputting.

【００８４】また、請求項２に記載の発明によれば、請
求項１に記載の発明において、前記確信度算出工程は、
得られた行単位の文字の認識結果の確からしさを、確か
らしい、確からしくないの２つに判定し、確からしくな
いと判定された場合には、該行の全ての文字認識結果を
あらかじめ定めた所定の文字に置き換えて出力すること
としたので、これによって、置き換えた文字に対する処
理をおこないやすく、文字認識後の処理作業の容易化を
図ることができるという効果を奏する。According to the invention of claim 2, in the invention of claim 1, the certainty factor calculation step is:
The certainty of the obtained character recognition result for each line is judged to be probable or not certain, and when it is not confirmed, all the character recognition results of the line are predetermined. Since the output is performed by replacing the character with a predetermined character, it is possible to easily perform the processing on the replaced character and to facilitate the processing work after the character recognition.

【００８５】また、請求項３に記載の発明によれば、請
求項１に記載の発明において、前記確信度算出工程は、
得られた行単位の文字の認識結果の確からしさを、確か
らしい、確からしくないの２つに判定し、確からしくな
いと判定した場合には、該行は文字領域ではないと判定
し、領域属性を他の領域に変更するので、文字領域では
ない部分に対する文字認識を実行したことを判定できる
ようになり、これによって、領域分割の精度向上を図る
ことができるという効果を奏する。According to the invention described in claim 3, in the invention described in claim 1, the certainty factor calculation step is:
The certainty of the obtained character recognition result for each line is judged to be probable or not certain. If not, it is judged that the line is not a character area, and Since the attribute is changed to another area, it is possible to determine that character recognition has been executed for a portion that is not a character area, and thus it is possible to improve the accuracy of area division.

【００８６】また、請求項４に記載の発明によれば、請
求項３に記載の発明において、前記確信度算出工程は、
前記文字領域内における全ての行で、前記行単位の文字
の認識結果が確からしくないと判定した場合には、該文
字領域を削除するので、これによって、不要な文字認識
のデータ出力を防止でき、他の領域属性を含む原稿等の
文字認識結果の信頼性向上を図ることができるという効
果を奏する。According to the invention described in claim 4, in the invention described in claim 3, the certainty factor calculation step is:
When it is determined that the recognition result of the characters in each line is not accurate in all the lines in the character area, the character area is deleted, thereby preventing unnecessary character recognition data output. Thus, it is possible to improve the reliability of the character recognition result of a document or the like including other area attributes.

【００８７】また、請求項５に記載の発明によれば、請
求項３に記載の発明において、前記確信度算出工程は、
前記文字領域内における一部の行について、該行単位の
文字の認識結果が確からしくないと判定した場合には、
該行を除く確からしい行のみで文字領域を形成するよう
該文字領域のサイズを変更するので、これによって、文
字領域内での文字認識の高精度化を図ることができると
いう効果を奏する。According to the invention of claim 5, in the invention of claim 3, the certainty factor calculation step is:
For some lines in the character area, when it is determined that the recognition result of the characters in each line is not accurate,
Since the size of the character area is changed so that the character area is formed only by the probable rows excluding the line, it is possible to improve the accuracy of character recognition in the character area.

【００８８】また、請求項６に記載の発明によれば、請
求項１〜５のいずれか一つに記載の発明において、前記
確信度算出工程は、行の確信度を得る情報として、行内
の文字の確信度のうち、あらかじめ定めた所定の閾値以
上の確信度を持つ文字の数を用いるので、たとえば、文
章として確からしさが得られているような認識結果であ
る場合、高い確信度を持った文字結果が多くなることか
ら、この文字数を行の確信度に用いることで、文字認識
の高精度化を図ることができるという効果を奏する。According to a sixth aspect of the present invention, in the invention according to any one of the first to fifth aspects, the certainty factor calculation step uses as information for obtaining the certainty factor of the line, Among the certainty factors of characters, the number of characters having certainty factors equal to or higher than a predetermined threshold value is used, so for example, when the recognition result is such that certainty is obtained as a sentence, high certainty factor is obtained. Since the number of character results increases, the use of this number of characters for the certainty factor of the line has the effect of improving the accuracy of character recognition.

【００８９】また、請求項７に記載の発明によれば、請
求項１〜５のいずれか一つに記載の発明において、前記
確信度算出工程は、行の確信度を得る情報として、行内
の文字のうち英数文字数の行内文字数に対する比率を用
いるので、行内における文字認識時の言語処理に効かな
い英数文字の比率が多い場合、平均確信度の判定基準を
レベルダウンする等して仮名漢字で得られる所定の文字
認識精度の維持を図ることができるという効果を奏す
る。Further, according to the invention described in claim 7, in the invention described in any one of claims 1 to 5, the certainty factor calculation step is performed as information for obtaining the certainty factor of the line. Since the ratio of the number of alphanumeric characters to the number of characters in a line is used among characters, if the ratio of alphanumeric characters that is not effective for language processing at the time of character recognition in a line is large, lower the criterion for determining the average certainty level, etc. It is possible to maintain the predetermined character recognition accuracy obtained in step 1.

【００９０】また、請求項８に記載の発明によれば、請
求項１〜５のいずれか一つに記載の発明において、前記
確信度算出工程は、行の確信度を得る情報として、他の
属性の領域との位置の重なり具合を用いるので、文字行
の座標値と別属性の領域との位置重なり具合によって、
文字の確信度を得て文字認識の高精度化を図ることがで
きるという効果を奏する。Further, according to the invention described in claim 8, in the invention described in any one of claims 1 to 5, the certainty factor calculation step uses other information as information for obtaining the certainty factor of the row. Since the degree of position overlap with the attribute area is used, depending on the degree of position overlap between the character line coordinate values and the area of another attribute,
It is possible to obtain the certainty factor of the character and to improve the accuracy of the character recognition.

【００９１】また、請求項９に記載の発明によれば、原
稿上の文字領域を自動判別し、該文字領域内の文字を文
字データとして認識する文字認識装置において、前記原
稿上の領域を文字が記載された文字領域、および他の記
載に対応する複数の属性別の領域に分割する領域分割手
段と、前記文字領域内の各行単位に文字データを認識す
る文字認識手段と、前記文字領域内の各行単位に文字の
認識結果の確からしさを求め、該確からしさを示す確信
度を出力する確信度算出手段とを備えたので、これによ
って、行単位の確信度を求めるため、文字でない部分の
認識結果を出力せず、文字認識精度の向上を図ることが
できるという効果を奏する。According to the ninth aspect of the invention, in the character recognition device for automatically discriminating the character area on the original and recognizing the characters in the character area as character data, the area on the original is recognized as a character. Area dividing means for dividing into a plurality of attribute-based areas corresponding to other descriptions, character recognition means for recognizing character data on a line-by-line basis within the character area, and inside the character area Since a certainty factor of each character is obtained for each line and a certainty factor calculation unit for outputting a certainty factor indicating the certainty factor is output, the certainty factor of each line is obtained in this way, so that the non-character portion The character recognition accuracy can be improved without outputting the recognition result.

[Brief description of drawings]

【図１】この発明の本実施の形態にかかる文字認識装置
の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a character recognition device according to an embodiment of the present invention.

【図２】この発明の本実施の形態にかかる文字認識装置
の文字認識処理の概要手順を示すフローチャートであ
る。FIG. 2 is a flowchart showing an outline procedure of character recognition processing of the character recognition device according to the embodiment of the present invention.

【図３】この発明の本実施の形態にかかる文字認識装置
の文字認識処理で行確信度処理をおこなう手順を示すフ
ローチャートである。FIG. 3 is a flowchart showing a procedure for performing line certainty degree processing in the character recognition processing of the character recognition device according to the embodiment of the present invention.

【図４】この発明の本実施の形態にかかる文字認識装置
の平均確信度を用いた文字／非文字判定の手順を示すフ
ローチャートである。FIG. 4 is a flowchart showing a procedure of character / non-character determination using the average certainty factor of the character recognition device according to the embodiment of the present invention.

【図５】この発明の本実施の形態にかかる文字認識装置
の平均確信度を用いた文字／非文字判定の手順を示すフ
ローチャートである。FIG. 5 is a flowchart showing a procedure of character / non-character determination using the average certainty factor of the character recognition device according to the embodiment of the present invention.

【図６】この発明の本実施の形態にかかる文字認識装置
の文字／非文字判定によって領域種別を変更する手順を
示すフローチャートである。FIG. 6 is a flowchart showing a procedure for changing an area type by character / non-character determination of the character recognition device according to the embodiment of the present invention.

【図７】この発明の本実施の形態にかかる文字認識装置
の原稿上における各領域の属性を示す図である。FIG. 7 is a diagram showing attributes of respective areas on a document of the character recognition device according to the embodiment of the present invention.

【図８】この発明の本実施の形態にかかる文字認識装置
の文字／非文字判定によって文字領域のサイズを変更す
る手順を示すフローチャートである。FIG. 8 is a flowchart showing a procedure for changing the size of a character area by character / non-character determination of the character recognition device according to the embodiment of the present invention.

【図９】この発明の本実施の形態にかかる文字認識装置
の文字領域のサイズ変更例を示す図である。FIG. 9 is a diagram showing an example of changing the size of a character area of the character recognition device according to the embodiment of the present invention.

【図１０】この発明の本実施の形態にかかる文字認識装
置の文字領域のサイズの他の変更例を示す図である。FIG. 10 is a diagram showing another modification of the size of the character area of the character recognition device according to the embodiment of the present invention.

【図１１】この発明の本実施の形態にかかる文字認識装
置の低確信度処理の処理内容を示すフローチャートであ
る。FIG. 11 is a flowchart showing the processing contents of the low confidence degree processing of the character recognition device according to the embodiment of the present invention.

[Explanation of symbols]

１００文字認識装置１０１スキャナ１０２ディスプレイ１０３印字装置１０４画像メモリ１０５ＣＰＵ１０６ＲＯＭ１０７ＲＡＭ１０８辞書７００原稿７０１（７０１ａ〜７０１ｅ，７０１Ａ，７０１Ｂ）
文字領域７０２図領域７０３表領域７０４囲み枠領域９０１ａ〜９０１ｎ高確信度行９０１ｘ低確信度行100 Character Recognition Device 101 Scanner 102 Display 103 Printing Device 104 Image Memory 105 CPU 106 ROM 107 RAM 108 Dictionary 700 Original 701 (701a to 701e, 701A, 701B)
Character area 702 Figure area 703 Table area 704 Enclosing frame areas 901a to 901n High confidence line 901x Low confidence line

Claims

[Claims]

1. A character recognition method for automatically recognizing a character area on a document and recognizing a character within the character area as character data, comprising: Area dividing step of dividing into a plurality of attribute-based areas corresponding to, character recognition step of recognizing character data in each line unit in the character area, and certainty of character recognition result in each line unit of the character area. And a confidence factor calculation step of outputting a confidence factor indicating the certainty factor, and a character recognition method comprising:

2. The certainty factor calculation step determines the certainty of the obtained recognition result of the characters in line units into two, which is likely and not certain, and when it is determined that the certainty is not certain. The character recognition method according to claim 1, wherein all the character recognition results of the line are replaced with predetermined characters and output.

3. The certainty factor calculation step determines the certainty of the obtained recognition result of the characters on a line-by-line basis to be probable and not probable. The character recognition method according to claim 1, wherein the line is determined not to be a character region, and the region attribute is changed to another region.

4. The certainty factor calculating step deletes the character area when it is determined that the recognition result of the character of each line is not accurate in all the lines in the character area. The character recognition method according to claim 3.

5. The certainty factor calculation step, if it is determined that the recognition result of the character of each line is uncertain for a part of the lines in the character region,
4. The character recognition method according to claim 3, wherein the size of the character area is changed so that the character area is formed only by a certain line excluding the line.

6. The certainty factor calculation step uses, as the information for obtaining the certainty factor of a line, the number of characters having the certainty factor of a predetermined threshold value or more among the certainty factors of the characters in the line. The character recognition method according to any one of claims 1 to 5.

7. The certainty factor calculation step uses a ratio of the number of alphanumeric characters in a line to the number of characters in the line as the information for obtaining the certainty factor of the line. Character recognition method described in.

8. The confidence factor calculation step uses the degree of overlap with a position of another attribute as the information for obtaining the confidence factor of a row, according to any one of claims 1 to 5. Character recognition method described.

9. A character recognition device for automatically discriminating a character area on a manuscript and recognizing characters in the character area as character data. Area dividing means for dividing into a plurality of attribute-based areas, character recognition means for recognizing character data in each line unit in the character area, and certainty of character recognition result in each line unit in the character area. And a certainty factor calculating means for outputting the certainty factor indicating the certainty factor, and the character recognition device.