JPH08249425A

JPH08249425A - Document reading method to output character attribute information

Info

Publication number: JPH08249425A
Application number: JP7052238A
Authority: JP
Inventors: Kazuyuki Yoshida; 收志吉田; Tetsuo Kiuchi; 哲夫木内
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1995-03-13
Filing date: 1995-03-13
Publication date: 1996-09-27

Abstract

PURPOSE: To provide a document reading method to output information relating to the attribute of character such as the form and size of the character, etc., by attaching to a read and recognized character code. CONSTITUTION: This document reading method is constituted in such a way that a character string is extracted by analyzing character image data inputted from an image scanner and labeling is applied (S2), and after that, processing (S5) in which the character is recognized by measuring size, processing (S6) which recognizes by measuring the thickness of a character line following the processing, and processing (S7) which recognizes by judging a character form specification based on the recognition results of above two processing are provided in processes (S3), (S4) in which character recognition is performed by segmenting a single character area, and when no judgement and recognition for the character form specification is performed by such processing, a segmented character is rotated by 90 deg., and after that, the character is recognized by repeating processing (S8), (S9) to recognize by judging the character form specification again, and the information relating to the character attribute recognized in the above processing is outputted by attaching on the character code.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書読取装置における
文書読み取り方法、特に既存の活字印刷文書を情報源と
するデータベースの構築あるいは絶版図書の再版出版印
刷などの業務において要求される異字体などで強調され
た語句・文章領域を抽出して文字コードと共に出力する
機能を備えた文書読み取り方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of reading a document in a document reading device, and in particular, a different character type required in the construction of a database using an existing type print document as an information source or a reprint publication printing of an out-of-print book. The present invention relates to a document reading method having a function of extracting a phrase / sentence region emphasized by and outputting it together with a character code.

【０００２】[0002]

【従来の技術】文書読取装置（ＯＣＲ）１は、図９に例
示のようにイメージスキャナ21と文字認識プロセッサ22
からなる文字読取認識部２およびホストコンピュータ３
によって構成されており、文字認識プロセッサ22は、イ
メージスキャナ21が読み取り対象の文書を光学走査して
得た文字画像データを入力とし、概略を図10に示したフ
ローの処理に従い、先ず読取データを観測して１文字と
して処理すべき文字画像データ領域を切出し、切り出し
た文字画像データを解析して該文字画像データが内蔵し
ている特徴パラメータを抽出し、この特徴パラメータを
読み取り対象範囲の各文字に対応して予め用意した各文
字に属する特徴パラメータの辞書と照合して特徴パラメ
ータが整合する文字を抽出することによって読取った文
字を認識し、該文字に割当た文字区分コードを文字情報
として出力する作用を基本機能とするものである。そし
て、ホストコンピュータ２は、読み取り対象文書の読み
取り条件の指示設定、読み取り結果の表示等のマンマシ
ンインターフェースとして機能することともに、読み取
りによって得られた文書情報をもととするデータベース
の構築あるいは文書の編集校正の作業をも遂行する。2. Description of the Related Art A document reading device (OCR) 1 includes an image scanner 21 and a character recognition processor 22 as shown in FIG.
Character reading recognition unit 2 and host computer 3
The character recognition processor 22 receives the character image data obtained by optically scanning the document to be read by the image scanner 21 as an input, and first reads the read data according to the process of the flow shown in FIG. The character image data area to be observed and processed as one character is cut out, the cut-out character image data is analyzed, the characteristic parameter contained in the character image data is extracted, and this characteristic parameter is read for each character in the range to be read. Corresponding to the character parameters that belong to each character prepared in advance to match the characteristic parameters are extracted to recognize the read character, the character classification code assigned to the character is output as character information The action is the basic function. The host computer 2 functions as a man-machine interface for setting the reading conditions of the document to be read, displaying the reading result, and the like, and also constructs a database based on the document information obtained by the reading or writes the document. Also performs editing and proofreading work.

【０００３】[0003]

【発明が解決しようとする課題】上記の従来の文書読取
装置においては、出力されるのは変換された文字コード
であり、ゴシック体や明朝体などの文字字体の種類や大
きさなどの情報が出力されず、入力画像を再び人間が見
直さないと字体や文字大きさなどの情報を得ることがで
きない。そのため、入力した文書をそのまま再現しよう
とするとき、あるいは目次や索引編集のため見出しや太
字の部分を抽出する場合にも人間が見直して処理する必
要があり、この処理に時間がかかることとなる。In the conventional document reading apparatus described above, it is the converted character code that is output, and information such as the type and size of the character font such as Gothic font or Mincho font is output. Is not output, and information such as font and character size cannot be obtained unless the input image is reviewed by a human again. Therefore, when trying to reproduce the input document as it is, or when extracting headings and bold parts for table of contents and index editing, it is necessary for humans to review and process, and this process will take time. .

【０００４】また、データベースの構築やファイリング
した文字の検索のためにキーワードを抽出する際におい
ても、ゴシック体や太字にして強調された文字は抽出で
きず、文書読取装置による処理の後、再度人が文書を見
てこれらの情報を新しく入力しなくてはならないという
問題がある。本発明は、文字の字体や大きさなどの文字
の属性にかかわる情報が、文字読取認識処理の間に失わ
れてしまうという従来技術にもとづく文書読取装置にお
ける問題点を解決して、文字属性情報が失われないよう
に処理する手段を提供し、入力した文書がそのままの形
態で再現しうるようにするとともに、文字属性情報をも
とに強調され文字からなる語句・文章を抽出して表示出
力する文書読取装置の実現を可能にする文書読み取り方
法の提供を課題とする。Further, when extracting a keyword for constructing a database or searching for a filed character, it is not possible to extract a Gothic typeface or a character emphasized in bold, and after the processing by the document reading device, a human being is read again. Has the problem of having to look at the document and enter these new information. The present invention solves the problem in the document reading device based on the related art that information related to the attributes of characters such as the font and size of characters is lost during the character reading recognition process, and the character attribute information is solved. It provides a means to process the input document so that the input document can be reproduced as it is, and extracts words and sentences that are emphasized based on the character attribute information and output them. It is an object of the present invention to provide a document reading method that enables the realization of a document reading device.

【０００５】[0005]

【課題を解決するための手段】上記の課題を解決するた
め、本発明では文書読取装置における文書読み取り方法
を、イメージスキャナが読取対象文書を光学的に走査し
てバッファに格納した文字画像データを解析して文字列
を抽出しラベリングした後、単一文字域を切り出し、切
り出した文字の属性を文字パターン辞書を照合すること
により判定して文字コードを帰属させる処理の工程に、
文字の大きさを計測して認識する処理と、該処理に続い
て文字線の太さを計測して認識する処理と、前記２処理
の認識結果をもとに字体種別を判定して認識する処理と
を設け、上記の処理によって字体種別の判定認識ができ
なかったときには、切り出した文字を90度回転させた
後、再度前記の字体種別を判定して認識する処理を繰り
返して文字を認識し、以上の処理において認識した文字
属性に係わる情報を文字コードに付して出力するように
する。In order to solve the above problems, the present invention provides a method for reading a document in a document reading apparatus in which character image data optically scanned by an image scanner to be read is stored in a buffer. After analyzing and extracting the character string and labeling, cutting out a single character area, determining the attribute of the cut out character by collating the character pattern dictionary, and assigning the character code to the process of processing,
A process of measuring and recognizing the size of a character, a process of measuring and recognizing the thickness of a character line subsequent to the process, and a process of recognizing a font type based on the recognition result of the two processes. When the character type judgment cannot be recognized by the above process, the cut character is rotated by 90 degrees, and then the character type is judged again and recognized again to recognize the character. Information relating to the character attributes recognized in the above processing is added to the character code and output.

【０００６】そうして、切り出した文字の属性判定処理
の後に字体が同一と認識した文字範囲をブロック化する
処理を設け、本文の字体に属さない所定の文字長以内の
文字ブロックをキーワードとして出力する処理を加え
る。また、同一字体の文字範囲をブロック化する処理の
後に、この処理および文字パターンデータを解析して文
字列を抽出しラベリングする処理において得られた認識
結果をもとに大見出し・中見出し・小見出しを検出する
処理と、行頭見出しを検出する処理とからなる文書特有
の属性を検出判定する処理を設け、この文書属性判定結
果を踏まえて、文書の属性情報と文字の属性情報とを文
字コードに付して出力するようにする。[0006] Thus, after the attribute determination processing of the cut out characters, a processing for blocking the character range recognized as the same font is provided, and a character block within a predetermined character length that does not belong to the font of the text is output as a keyword. Add processing to do. In addition, after the process of blocking the character range of the same font, this process and the process of analyzing the character pattern data to extract the character string and label it based on the recognition results obtained in the process Is provided and a process for detecting and determining a document-specific attribute consisting of a process for detecting a head of line and a process for detecting a heading of a line is performed, and based on the document attribute determination result, the document attribute information and the character attribute information are converted into character codes. Attach and output.

【０００７】さらに、文書特有の属性を判定する処理の
後に、入力手段を通じて指定した属性の文書部分のみを
選択抽出する処理を設け、指定した属性の文書部分のみ
を出力表示するようにする。Further, after the process of determining the attribute peculiar to the document, a process of selectively extracting only the document portion having the designated attribute through the input means is provided, and only the document portion having the designated attribute is output and displayed.

【０００８】[0008]

【作用】イメージスキャナが走査してバッファに格納し
た文字画像データから見出し、本文、修飾行等の文字列
を抽出してラベリングした後にラベリングした文字列に
ついて単一文字域を切り出す文字切り出し処理に続けて
設けた文字属性判定処理は、切り出した文字の大きさと
字画の太さの測定結果をもとする字体の判定処理を行
い、判定不能のときには切り出した文字を90度回転した
後、再度字体の判定処理を行い縦書きされた英文字が混
在している場合でも字体を適正に認識し、文字列抽出ラ
ベリング処理と文字属性判定処理において獲得した文字
列属性情報と字体情報とを、文字認識結果の文字コード
に付して出力する。[Function] Following the character segmentation process of extracting a single character area from the character string extracted from the character image data scanned by the image scanner and stored in the buffer, extracting the character string of the text, the modified line, etc. and labeling the character string. The provided character attribute determination process determines the font based on the measurement result of the size of the cut out character and the thickness of the stroke, and if it cannot be determined, the cut out character is rotated 90 degrees and then the font is determined again. Even if there is a mixture of vertically written English characters after processing, the font is properly recognized, and the character string attribute information and font information acquired in the character string extraction labeling processing and character attribute determination processing are used as the character recognition result. Output by attaching to the character code.

【０００９】そうして、切り出した文字の属性判定処理
の後に字体が同一と認識した文字範囲をブロック化する
処理を設けた文書読み取り方法によれば、本文の字体に
属さない所定の文字長以内の文字ブロックがキーワード
として出力される。また、同一字体の文字範囲をブロッ
ク化する処理の後に、文字列ラベリング処理で得られた
認識結果をもとに文書特有の属性を検出判定する処理を
設けた文書読み取り方法によれば、大見出し・中見出し
・小見出し等の文字列属性が文字認識結果の文字コード
に付されて出力される。Thus, according to the document reading method, which is provided with a process of blocking the character range in which the fonts are recognized to be the same after the attribute determination process of the cut out characters, within a predetermined character length not belonging to the fonts of the body. The character block of is output as a keyword. In addition, according to the document reading method, after the process of blocking the character range of the same font, the process of detecting and determining the document-specific attribute based on the recognition result obtained by the character string labeling process is provided, -Character string attributes such as medium heading / small heading are added to the character code of the character recognition result and output.

【００１０】さらに、文書特有の属性を判定する処理の
後に、入力手段を通じて指定した属性の文書部分のみを
選択抽出する処理を設けた文書読み取り方法によれば、
指定した属性の文書部分のみが出力表示される。Further, according to the document reading method, after the process of determining the attribute peculiar to the document, the process of selectively extracting only the document portion having the attribute designated through the input means is provided.
Only the document part with the specified attributes is output and displayed.

【００１１】[0011]

【実施例】イメージスキャナが読取対象文書を光学的に
走査してバッファに格納した文字画像データを解析して
一文字の領域を切り出し、切り出した文字の属性を文字
パターン辞書を照合することにより判定して文字コード
を帰属させる処理の工程の前段に、本発明にもとづいて
文字の属性判定の工程を設けた文書読み取り方法の１実
施例の処理のフローを図１に、また、図２に入力文書を
図１のフローの文書読み取り方法で処理したときの処理
結果の例を示し、これらの図によって本発明の方法を説
明する。[Example] An image scanner optically scans a document to be read, analyzes character image data stored in a buffer, cuts out an area of one character, and determines the attribute of the cut out character by collating a character pattern dictionary. 1 and FIG. 2 shows the processing flow of one embodiment of the document reading method in which the step of determining the attribute of a character is provided before the step of assigning the character code according to the present invention. 1 shows an example of the processing result when the document reading method of the flow of FIG. 1 is processed, and the method of the present invention will be described with reference to these figures.

【００１２】図１において、画像入力Ｓ１から文字切出
しＳ３までの工程は、図10によって説明の従来技術によ
る文書読取装置におけるＡないしＤの文書画像入力と観
測処理と同等の処理である。文字切出しＳ３までの処理
は、文字画像データを一旦文字列あるいは画像域として
とらえてラベリングする処理Ｓ２を経て階層的に実行さ
れる。In FIG. 1, the steps from the image input S1 to the character cut-out S3 are equivalent to the document image input of A to D and the observation processing in the document reading apparatus according to the prior art described with reference to FIG. The processing up to the character cutting S3 is hierarchically executed through the processing S2 in which the character image data is once regarded as a character string or an image area and labeled.

【００１３】文字属性判定処理Ｓ４の工程中に設けられ
たＳ５ないしＳ９の処理が、本発明にもとづいて設けら
れた処理工程であり、１文字として切り出した文字画像
データと辞書とを照合して文字コードの帰属判定を可能
とするために行う文字の大きさと字線太さの正規化正準
化する処理過程において、処理Ｓ５で文字の大きさを、
処理Ｓ６で文字の字画線の太さを測定して認識するよう
にする。そして、認識された文字の大きさと字画線の太
さの対比によって、ゴシック，明朝体等の字体の種別区
分を判定する処理がＳ７であり、整合する字体が検出で
きない場合には処理Ｓ８で文字を90度回転させて再度Ｓ
９で字体判定認識処理を繰返し、英単語等縦書きされた
文字を抽出して字体を判定し文字を認識する。The processes of S5 to S9 provided during the process of character attribute determination process S4 are the process steps provided according to the present invention, and the character image data cut out as one character is collated with the dictionary. In the process of normalizing and canonicalizing the character size and the line thickness performed to enable the character code attribution determination, the character size is determined in step S5.
In step S6, the thickness of the stroke line of the character is measured and recognized. Then, the process of determining the type classification of the font such as Gothic or Mincho based on the size of the recognized character and the thickness of the stroke is S7, and if a matching font cannot be detected, the process is S8. Rotate the character 90 degrees and press S again
The character type determination and recognition process is repeated at 9 to extract vertically written characters such as English words to determine the typeface and recognize the characters.

【００１４】なお、文字の強調が字体と文字寸法による
のみでなく、アンダーラインあるいは傍線、傍点、ルビ
などによって行われている文書をも読み取り対象とする
場合には、特開平１─151396号公報および特開平１─19
6685号公報に公開されているアンダーラインあるいは傍
線、傍点、ルビなどを認識する方法を流用して強調文字
列を抽出するようにすればよい。In addition, when the document to be emphasized is not only based on the typeface and the character size but also underlines, sidelines, sidepoints, ruby, etc., is to be read, Japanese Patent Laid-Open No. 151396/1989. And JP-A-1-19
The emphasized character string may be extracted by diverting the method of recognizing underlines, sidelines, sidepoints, ruby, etc. disclosed in Japanese Patent No. 6685.

【００１５】以上の処理によって文字コードと共に文字
の属性が把握されたら、処理Ｓ10において、図２に例示
のように文字属性を付して文字コードを出力する。次
に、本発明第２の方法にもとづいて読取った文書中から
キーワードを抽出して出力する文書読み取り方法の１実
施例の処理のフローを図３に示し、第２の発明の方法を
説明する。When the character attributes are grasped together with the character code by the above processing, the character attribute is added as shown in FIG. 2 and the character code is output in step S10. Next, a processing flow of one embodiment of a document reading method for extracting and outputting a keyword from a document read based on the second method of the present invention is shown in FIG. 3, and the method of the second invention will be described. .

【００１６】図３において、Ｓ１ないしＳ３の処理およ
び処理Ｓ４中のＳ５ないしＳ９の字体を判定する処理
は、図１中で同一の符号を付した処理と同等の処理であ
る。第２の発明にもとづく文書読み取り方法では、切り
出した文字の字体判定認識処理Ｓ９に引続いて、同一の
字体の文字が連続する範囲を検出して群別する文字ブロ
ック化処理Ｓ11を設け、分類された文字ブロックについ
て本文の字体に属さないブロックをキーワードとして出
力する処理Ｓ12が設けている。In FIG. 3, the processes of S1 to S3 and the process of determining the fonts of S5 to S9 in the process S4 are the same as the processes denoted by the same reference numerals in FIG. In the document reading method based on the second invention, a character block forming process S11 for detecting a range in which characters having the same character form continue and grouping is provided subsequent to the character form recognition processing S9 for the cut out characters. A process S12 for outputting, as a keyword, a block that does not belong to the font of the body of the generated character block is provided.

【００１７】この第２の発明にもとづく図３のフローの
文書読み取り方法で入力文書を処理したときの処理結果
の例を図４に示す。続いて、字体種別によってブロック
化区分した読取文字情報をもとに、大見出し・中見出し
および小見出しを判定する文章属性判定処理を設け、文
字中の大見出し・中見出し・小見出しを抽出して出力す
る第３の発明の方法にもとづく文書読み取り方法の１実
施例の処理のフローをを図５に示し、この発明について
説明する。FIG. 4 shows an example of the processing result when the input document is processed by the document reading method of the flow of FIG. 3 based on the second invention. Next, based on the read character information that is divided into blocks according to the font type, a sentence attribute determination process that determines the large heading / medium heading and the small heading is provided, and the large heading / medium heading / small heading in the character is extracted and output. The process flow of one embodiment of the document reading method based on the method of the third invention will be described with reference to FIG.

【００１８】図５においても、処理Ｓ１からＳ11までの
処理は図３中で同一の符号を付した処理と同等の処理で
あり、Ｓ11までの処理によって読み取った文書につい
て、見出し，本文，アンダーライン行およびルビ行がラ
ベリング情報として得られており、また文字コードと字
体情報および同一字体に属する文字の範囲がブロック化
情報として得られている。Also in FIG. 5, the processes from S1 to S11 are the same as the processes with the same reference numerals in FIG. 3, and the headline, the text, and the underline of the document read by the processes up to S11 are processed. Lines and ruby lines are obtained as labeling information, and character codes and font information and a range of characters belonging to the same font are obtained as blocking information.

【００１９】文字のブロック化処理Ｓ11に続いて、ラベ
リング情報と書体およびブロック化情報とによって大見
出し・中見出しと小見出しの区分を判定する処理Ｓ14
と、行頭見出しを判別する処理Ｓ15とからなる文章属性
判定処理Ｓ13を設け、文章属性判定処理Ｓ13の処理結果
を文字属性情報と共に文字コードに付して出力する処理
Ｓ16が設けられている。Subsequent to the character blocking process S11, a process S14 for determining the division between the large heading / medium heading and the small heading based on the labeling information and the typeface / blocking information.
And a text attribute determination process S13 including a line heading determination process S15, and a process S16 for outputting the processing result of the text attribute determination process S13 along with the character attribute information to a character code.

【００２０】上記の図５のフローの方法によって入力し
た文書を処理したときの結果出力の一例を図６に示す。
最後に、入力手順を通じて指定した属性の文章部分のみ
を出力表示するようにする第４の発明の方法にもとづく
文書読み取り方法の１実施例の処理のフローをの一実施
例を図７に示し、この発明を説明する。FIG. 6 shows an example of the result output when the document input by the method of the flow of FIG. 5 is processed.
Finally, FIG. 7 shows an example of the processing flow of one embodiment of the document reading method based on the method of the fourth invention, in which only the text part having the attribute designated through the input procedure is output and displayed. The present invention will be described.

【００２１】図７において、画像入力処理Ｓ１から文章
属性判定処理Ｓ13までは、上記に説明の第３の発明によ
る図５のフローの処理において同一の符号を付した処理
と同等の処理であり、Ｓ13までの処理によって、見出し
にかかわる文章の属性情報と字体および文字の大きさに
かかわる情報が得られることとなる。キーボードなどか
ら入力される抽出しようとする文章，字体とにかかわる
属性指定情報を読み込む指定属性読込み処理Ｓ17と、処
理フローに読み込んだ属性指定情報を処理Ｓ13までの処
理によって得られた情報と照合して整合する属性のデー
タを選択する属性の選択処理Ｓ17を設け、出力処理Ｓ18
は属性指定にもとづいて選択された部分の文字のみが出
力されるようにする。In FIG. 7, the image input process S1 to the sentence attribute determination process S13 are the same processes as those designated by the same reference numerals in the process of FIG. 5 according to the third invention described above. By the processes up to S13, the attribute information of the sentence related to the headline and the information related to the font and the size of the character can be obtained. Specified attribute reading processing S17 that reads the attribute specifying information related to the sentence or font to be extracted that is input from the keyboard, and the attribute specifying information that is read in the processing flow is compared with the information obtained by the processing up to processing S13. Attribute selection processing S17 for selecting the data of the attribute to be matched and output processing S18
Causes only the characters in the selected part based on the attribute specification to be output.

【００２２】上記の図７のフローの方法によって入力し
た文書を処理したときの出力結果の一例を図８に示す。FIG. 8 shows an example of the output result when the document input by the method of the flow of FIG. 7 is processed.

【００２３】[0023]

【発明の効果】この発明の文書読み取り方法にもとづい
て文書を読み取って処理する文書読取装置によれば、切
り出した文字の文字コードだけでなく、文字の大きさや
字体などの文字属性を判定して情報として保持するよう
にしたので、この文字属性情報と領域抽出の結果とを合
わせ、読み取り入力対象の文書の構造を人手による追加
補助入力を要することなくそのまま再現することができ
るようになる。According to the document reading apparatus for reading and processing a document based on the document reading method of the present invention, not only the character code of the cut-out character but also the character attributes such as the character size and font are determined. Since the information is held as information, the character attribute information and the area extraction result can be combined and the structure of the document to be read and input can be reproduced as it is without the need for additional auxiliary input.

【００２４】また、ゴシック体の強調部分や見出し部分
など指定の部分を選択抽出して出力することが可能とな
ったので、目次や索引の作成も含め文書を読み取って再
版する場合などの編集作業を人手をかけずに正確かつ能
率的に実行できるという効果が得られる。また、複数の
文書からキーワードを抽出してデータベースを構築する
際などにも本発明の文書読み取り方法を適用した文書読
取装置を用いて作業を手早く能率的に遂行することが可
能になる。Further, since it becomes possible to selectively extract a specified portion such as an emphasized portion of a Gothic font or a heading portion and output it, an editing operation such as reading and reprinting a document including creation of a table of contents and an index is possible. It is possible to obtain the effect that can be executed accurately and efficiently without manpower. Further, even when a keyword is extracted from a plurality of documents and a database is constructed, the work can be quickly and efficiently performed by using the document reading device to which the document reading method of the present invention is applied.

[Brief description of drawings]

【図１】本発明にもとづく文字認識方法における処理の
要部フロー図FIG. 1 is a flow chart of a main part of processing in a character recognition method according to the present invention.

【図２】図１のフローの処理の結果を例示する図FIG. 2 is a diagram illustrating an example of a result of processing of the flow of FIG.

【図３】キーワードを抽出して出力するようにした処理
のフロー図FIG. 3 is a flow chart of processing for extracting and outputting a keyword.

【図４】図３のフローの処理の結果を例示する図FIG. 4 is a diagram illustrating an example of a result of processing of the flow of FIG.

【図５】文章の属性を抽出して出力するようにした処理
のフロー図FIG. 5 is a flow chart of processing for extracting and outputting a text attribute.

【図６】図５のフローの処理の結果を例示する図FIG. 6 is a diagram illustrating an example of a result of processing of the flow of FIG.

【図７】指定した属性の文章を抽出して出力するように
した処理のフロー図FIG. 7 is a flow chart of processing for extracting and outputting a sentence having a specified attribute.

【図８】図７のフローの処理の結果を例示する図FIG. 8 is a diagram showing an example of a result of processing of the flow of FIG.

【図９】文字読取装置の基本構成図FIG. 9 is a basic configuration diagram of a character reading device.

【図１０】文字読取装置における文字認識処理の基本フ
ロー図FIG. 10 is a basic flowchart of character recognition processing in the character reading device.

[Explanation of symbols]

１文字読取装置２文字認識装置 21 イメージスキャナ 22 文字認識プロセッサ３ホストコンピュータ４入力文書５処理結果 1 Character reading device 2 Character recognition device 21 Image scanner 22 Character recognition processor 3 Host computer 4 Input document 5 Processing result

Claims

[Claims]

1. An image scanner optically scans a document to be read, analyzes character image data stored in a buffer, extracts a character string and labels it, then cuts out a single character area, and extracts the attribute of the cut out character. Before the step of deciding by assigning a character code by collating the character pattern dictionary, the process of measuring and recognizing the size of the character and the process of measuring the thickness of the character line are performed. A process for recognizing and a process for determining and recognizing the font type based on the recognition results of the above two processes are provided, and when the above-described process cannot determine and recognize the font type, the cut-out character is After rotating once, the process for determining and recognizing the font type is repeated again, and information relating to the font of the character recognized by the above font type determination recognition process and the character string labeling process And information relating to the recognized character string attributes, a document reading method for outputting character attribute information, characterized in that to output subjected to the character recognition result character codes.

2. The process of extracting and labeling a character string includes a process of recognizing characters with underlines, sidelines, sidepoints and ruby as different typefaces, and the same typeface after the attribute determination process of the cut out characters. 2. The document reading method according to claim 1, further comprising: providing a process of blocking the character range recognized as "," and outputting a character block within a predetermined character length that does not belong to the font of the text as a keyword.

3. A large heading / medium heading / small heading is detected based on the recognition result obtained in the processing of dividing the character range of the same font into blocks and the processing of extracting and labeling the character string. A document-specific attribute consisting of a process, a line-heading detection process, and a document-specific attribute detection process is provided. Based on the document attribute determination result, the document attribute information and the character attribute information are attached to the character code. 3. The document reading method according to claim 2, wherein the document is read out and output.

4. A process for selectively extracting only a document part having a specified attribute through the input means after the process for determining a document-specific attribute, and outputting and displaying only the document part having the specified attribute. The document reading method according to claim 3, which is characterized in that.