JP2005135041A

JP2005135041A - Document search/browse method and document search/browse system

Info

Publication number: JP2005135041A
Application number: JP2003368304A
Authority: JP
Inventors: Takeshi Eisaki; 健永崎; Katsumi Marukawa; 勝美丸川; Sayaka Takeuchi; 沙弥香竹内
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-10-29
Filing date: 2003-10-29
Publication date: 2005-05-26
Anticipated expiration: 2023-10-29
Also published as: JP4461769B2; CN100351839C; CN1612154A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method that enables a search and browse of a document image group through the application of a document structure analysis technique and a character recognition technique as searching/browsing means for paper documents and document images. <P>SOLUTION: A highly functional document image search/browse system separates an OCR and a document processing apparatus, adopts as OCR output formats data (reading hypothesis data) holding multiple hypotheses of character line extraction, character segmentation and character recognition, and document structure data having ruled line information, frame information, character line information, browse attribute information and the like about a document image, and provides a function of important keyword extraction and document search from typed and handwritten character strings using OCR-added data, and of document display intended by a browser using the document structure data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、紙文書群または文書画像群の中から、文書解析技術を応用して、文書群をコンピュータ上で検索、及び閲覧する際に必要となる情報を取得するための、その装置及び文書解析技術プログラムを記録した記録媒体に関する。 The present invention relates to an apparatus and a document for acquiring information necessary for searching and browsing a document group on a computer by applying document analysis technology from a paper document group or a document image group. The present invention relates to a recording medium on which an analysis technology program is recorded.

デジタル情報技術が普及した今日でも、紙文書は情報伝達媒体として広く利用されている。しかし、紙文書のままでは保管場所を取る、必要とする情報の検索が難しい等の問題があるため、紙文書を電子画像化して保存し、電子画像化した文書（以下、文書画像と称する）に対して検索・閲覧をコンピュータ上で行いたいという要求が社会的に高まっている。 Even today, when digital information technology is widespread, paper documents are widely used as information transmission media. However, there are problems such as taking a storage place and making it difficult to search for necessary information if the paper document remains as it is, so that the paper document is stored as an electronic image and is converted into an electronic image (hereinafter referred to as a document image). However, there is a growing social demand for searching and browsing on computers.

紙文書検索の最も基本的な手法は、ＯＣＲ（Optical Character Recognition）によって紙文書をテキストファイルに変換し、テキストファイルに対して検索を行うことである。しかし、一般にＯＣＲで変換されたテキストコードには誤りが含まれるため、単純なテキスト検索では対処できないケースが生じる。無論、ＯＣＲによって変換されたテキストコードを人手で修正し、その修正結果に対して検索を行うことも可能である。しかし、人間が介在しての修正は、その処理速度及びコストの面から実用的とは言い難い。 The most basic method of paper document search is to convert a paper document into a text file by OCR (Optical Character Recognition) and search the text file. However, since a text code converted by OCR generally contains an error, there are cases where it cannot be dealt with by simple text search. Of course, it is also possible to manually correct the text code converted by OCR and perform a search on the correction result. However, it is difficult to say that correction with human intervention is practical in terms of processing speed and cost.

特開平０５−１０８８９１号公報（特許文献１）では、ＯＣＲの読取精度を向上する手段としてＯＣＲの認識結果に形態素解析を適用する手法が記されている。確かに形態素解析等の知識処理を行うことで誤読を訂正することは可能であるが、それでも１００％の訂正は不可能である。また、通常の形態素解析で用いる辞書は新聞等の一般文章を対象としており、特殊な業務用途の文書を精度良く校正するためには、その分野に適合した特殊辞書を追加定義する必要がある。このため保守性や計算量の面で問題が残る。 Japanese Patent Application Laid-Open No. 05-108891 (Patent Document 1) describes a method of applying morphological analysis to an OCR recognition result as means for improving OCR reading accuracy. Although it is possible to correct misreading by performing knowledge processing such as morphological analysis, 100% correction is still impossible. A dictionary used in normal morphological analysis is intended for general sentences such as newspapers, and in order to proofread a document for special business use with high accuracy, it is necessary to additionally define a special dictionary suitable for the field. Therefore, problems remain in terms of maintainability and computational complexity.

特開平１０−７４２５０号公報（特許文献２）では、文字誤読が検索に与える悪影響を回避するために、ＯＣＲで誤読しやすい類似文字の情報を利用して単語検索を行う手法が提案されている。また、特開平９−１３４３６９号公報（特許文献３）では、ＯＣＲの読取結果に複数の文字識別候補を許し、その中から文字コードを選択して単語を検出する手法が提案されている。確かに、これらの技術を使えば１文字単位の誤読が単語検索に与える悪影響を回避することができる。 Japanese Patent Application Laid-Open No. 10-74250 (Patent Document 2) proposes a method of performing word search using information on similar characters that are easily misread by OCR in order to avoid the adverse effects of character misreading on the search. . Japanese Laid-Open Patent Publication No. 9-134369 (Patent Document 3) proposes a method for detecting a word by allowing a plurality of character identification candidates to be read from an OCR and selecting a character code from the candidates. Certainly, if these techniques are used, it is possible to avoid the adverse effects of word-by-character misreading on word search.

しかし、上記手法では分離文字や文字間接触等によって文字パタンの境界が明確に定まらないが故に誤って文字パタンが切り出されたケースに対処できない。例えば、「ハル」と書かれた文字を、ＯＣＲが「ヘル」と読んだケースには上記特許の手法で対処できるが、「ハノレ」と読んだケースには対処できない。更に、図や表が入り組んだ文書や帳票形式で罫線が多く混在する文書等に対しては、そもそも文字読取の以前に文字行の検出・同定が困難であるケースが多い。この問題に対しても上記手法では対処できない。 However, the above method cannot cope with the case where the character pattern is cut out by mistake because the boundary of the character pattern is not clearly determined due to the separation character or the contact between characters. For example, the case where the character “Hull” is read as “Hell” by the OCR can be dealt with by the above-mentioned patent technique, but the case where “Hanore” is read cannot be dealt with. Furthermore, in many cases, it is difficult to detect and identify a character line before character reading for a document in which figures and tables are complicated or a document in which a lot of ruled lines are mixed in a form format. This problem cannot be dealt with by this method.

更には、文書画像の閲覧機能として、紙文書にはない付加価値を付けたいという要求がある。例えば、大量の書類をチェックする場合、文書全面を見ることは普通行わず、記載必須欄を集中的に見る。そこで画面上でチェックする際に、あらかじめ文書画像の特定欄を抽出しておき、画面には抽出した特定の欄のみを表示する、または特定の欄を強調して表示するなどの機能が考えられる。しかし、従来ＯＣＲでは、特定欄の記載事項を認識する機能のみが存在するため、この認識結果を画面に表示することしかできない。認識結果が完全であれば、特定欄の認識結果を表示することで文書画像の部分閲覧に十分対応できるが、これは現実的には難しい。それよりはＯＣＲ装置がテキスト認識の結果と共に枠構造や罫線座標などの文書構造データを出力し、これらの情報を活用した閲覧機能のあることが望ましい。 Furthermore, there is a demand for adding an added value not found in paper documents as a document image browsing function. For example, when checking a large number of documents, the entire document is not usually viewed, but the description required columns are concentrated. Therefore, when checking on the screen, it may be possible to extract a specific column of the document image in advance and display only the extracted specific column on the screen, or display a specific column with emphasis. . However, in the conventional OCR, there is only a function for recognizing the items described in the specific column, so that the recognition result can only be displayed on the screen. If the recognition result is complete, it is possible to sufficiently cope with partial browsing of the document image by displaying the recognition result in the specific column, but this is difficult in practice. It is more desirable that the OCR device has a browsing function that outputs document structure data such as a frame structure and ruled line coordinates together with the result of text recognition, and utilizes such information.

電子画像化した紙文書の取扱いフォーマットとしては、ＴＩＦＦやＧＩＦなどの画像フォーマット、ＰＤＦなどの文書フォーマットが存在する。通常は、画像を記録したファイルと、ＯＣＲ装置で認識した結果はＣＳＶやＸＭＬ等の形式の別ファイルとして出力し、これらを統括して扱う。しかし、この場合、相互のファイルのリンク関係を保持するシステムの構築が必要となる。ＰＤＦに関してはＯＣＲで認識した結果を透明テキストとして画像ファイル中に埋め込んで取扱う機能が存在するが、手書文字の場合、認識結果が一意的に定まるとは限らない。更には、文書構造データを画像ファイル中に埋め込むことはサポートされていない。文書構造データを画像ファイルと別個に扱い、両者を組み合わせた閲覧ソフトを構築することも可能ではある。しかし、文書構造データを画像ファイルと別々に扱うことは、文書の管理上、非効率的である。なぜなら、文書構造データは文書画像中の罫線や枠や文字行の座標情報を含むという特性のため、テキストと違い画像ファイルとの独立性が弱いからである。 There are image formats such as TIFF and GIF, and document formats such as PDF as handling formats for electronically converted paper documents. Normally, the file in which the image is recorded and the result recognized by the OCR device are output as separate files in a format such as CSV or XML, and these are handled in an integrated manner. However, in this case, it is necessary to construct a system that maintains the link relationship between the files. With regard to PDF, there is a function of handling the result recognized by OCR as a transparent text embedded in an image file. However, in the case of handwritten characters, the recognition result is not always determined uniquely. Furthermore, embedding document structure data in an image file is not supported. It is also possible to construct document viewing software that handles document structure data separately from image files and combines them. However, handling document structure data separately from image files is inefficient in managing the document. This is because document structure data has a characteristic that it includes coordinate information of ruled lines, frames, and character lines in a document image, and thus is less independent from an image file than text.

コンピュータ上での文書閲覧時に、文書に強調色や色線などの効果を付けて表示することは広く行われているが、一般にはワードやＨＴＭＬなどの電子的に構成された文書データに対して行われている。これに対して、文書画像ファイルに対しての効果は、表示エフェクトを掛ける為の時間制約などがあるため、敬遠されている。 When viewing a document on a computer, it is widely performed to display the document with an effect such as an emphasis color or a color line. However, in general, electronically configured document data such as a word or HTML is used. Has been done. On the other hand, the effect on the document image file is avoided because there are time restrictions for applying the display effect.

特開平０５−１０８８９１号公報Japanese Patent Laid-Open No. 05-108891

特開平１０−７４２５０号公報JP-A-10-74250 特開平９−１３４３６９号公報JP-A-9-134369 特開平０９−３１９８２４号公報Japanese Patent Application Laid-Open No. 09-319824 特開２０００−２５１０１２号公報Japanese Patent Laid-Open No. 2000-251012 特開２００１−０１４３１１号公報JP 2001-014411 A 特許２８８６８６８号公報Japanese Patent No. 28868868 特願平０９−２３８０３２号公報Japanese Patent Application No. 09-238032

本発明の目的は、ＯＣＲ装置による文書認識の結果を元に、紙文書群を電子画像化して高度な検索・閲覧機能を提供する文書検索・閲覧システム、その装置及びＯＣＲ認識プログラム及び文書閲覧システムを記録した記録媒体を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a document search / browse system, an apparatus, an OCR recognition program, and a document browsing system that provide an advanced search / browsing function by converting a paper document group into an electronic image based on the result of document recognition by an OCR device. Is to provide a recording medium on which is recorded.

従来の手法では、紙文書群からの文書検索はＯＣＲ読取の結果であるテキストファイルに対して検索を行っていたが、文字潰れやかすれ等に起因するＯＣＲの文字識別誤りや、文字パタン境界の曖昧性に起因するＯＣＲの文字切出誤りや、文書−図版−罫線混在に起因するＯＣＲの文字行抽出誤りに対処することが困難であった。本発明の第１の目的はＯＣＲ読取に起こり得る文字識別、文字切出、文字行抽出の誤りが文書検索に与える悪影響を回避する方法を提案することである。 In the conventional method, a document search from a paper document group is performed on a text file that is a result of OCR reading. However, an OCR character identification error caused by character crushing or blurring, or a character pattern boundary It has been difficult to cope with OCR character extraction errors due to ambiguity and OCR character line extraction errors due to mixed document-illustration-ruled lines. The first object of the present invention is to propose a method for avoiding the adverse effects of character identification, character extraction, and character line extraction errors that may occur in OCR reading on document retrieval.

また、従来の手法では、文書画像の閲覧において部分領域の表示を行う場合、固定の座標を用いて部分領域を特定することを行っていたが、画像のズレなどに影響を受けるという問題があった。これに対して本手法では、ＯＣＲ装置から罫線情報、枠情報、文字行情報等を持つ文書構造データを出力し、このデータを使うことで表示への悪影響を回避する。本発明の第２の目的は、文書画像の閲覧時に部分領域表示・強調表示・重要語表示・秘匿処理などの付加価値を提供することである。 Further, in the conventional method, when displaying a partial area when viewing a document image, the partial area is specified using fixed coordinates. However, there is a problem that it is affected by an image misalignment or the like. It was. On the other hand, in this method, document structure data having ruled line information, frame information, character line information, and the like is output from the OCR device, and the adverse effect on the display is avoided by using this data. A second object of the present invention is to provide added value such as partial area display / highlight display / important word display / secret processing when a document image is browsed.

また、従来の手法では、文書画像表示の際のエフェクトを掛けるために、文書画像データの変換時間が掛るという問題があった。本手法では、ＯＣＲ装置から出力した文書構造データを使い、予め表示効果が必要と予測される領域・文字列に対して、擬似カラー化を行うことにより、この問題を回避する。本発明の第３の目的は、文書閲覧時の文書表示処理にかかる処理時間を軽減することである。 In addition, the conventional method has a problem that it takes time to convert document image data in order to apply an effect when displaying the document image. In this method, this problem is avoided by using the document structure data output from the OCR device and performing pseudo-coloring on areas / character strings that are predicted to require a display effect in advance. The third object of the present invention is to reduce the processing time required for the document display process at the time of document browsing.

上記第１の目的を達成するため、本発明は、ＯＣＲ装置と文書画像処理装置を分離し、ＯＣＲの出力形態として文書画像（擬似カラー化文書画像を含む）と、及び、読取結果テキスト、読取仮説データ、文書構造データ（以上を併せてＯＣＲ付加データと称する）を保持するファイルを採用し、この文書画像及びＯＣＲ付加データを元にキーワード検索並びに文書閲覧機能を構成することで、必要な文書画像の検索及び文書画像閲覧を行うシステムを提供する。 In order to achieve the first object, the present invention separates an OCR device and a document image processing device, and outputs a document image (including a pseudo-colored document image) as an output form of the OCR, a read result text, and a read Employing a file that holds hypothesis data and document structure data (collectively referred to as OCR additional data) and constructing a keyword search and document browsing function based on this document image and OCR additional data, the necessary document A system for searching an image and browsing a document image is provided.

上記第２の目的を達成するため、本発明は、ＯＣＲ装置の出力であるＯＣＲ付加データを用いて、部分領域の強調表示、部分領域の切出表示、特定文字列の強調表示などの視覚効果を実現する閲覧システムを提供する。 In order to achieve the second object, the present invention uses the OCR additional data, which is an output of the OCR device, to provide visual effects such as partial area highlighting, partial area clipping display, and specific character string highlighting display. A browsing system that realizes

上記第３の目的を達成するため、本発明は、ＯＣＲ付加データを用いて事前に確定した特定領域に対して擬似カラー化処理を行い、表示モード切り替えに併せて擬似カラー値を変更することで、高速な表示機能を提供する。 In order to achieve the third object, the present invention performs pseudo colorization processing on a specific area determined in advance using OCR additional data, and changes the pseudo color value in accordance with display mode switching. Provide high-speed display function.

本発明によれば、従来の手法では、文書画像群からの文書検索はＯＣＲ読取の結果であるテキストを元に検索を行っていたが、活字文字や手書文字の混在や文字潰れやかすれ等に起因するＯＣＲの文字識別誤りや、文字パタン境界の曖昧性に起因するＯＣＲの文字切出誤りや、文書−図版−罫線混在に起因するＯＣＲの文字行抽出誤りに対処することが困難であった問題に対して、文字識別、文字切出、文字行抽出の候補を保持するＯＣＲ付加データを用いて単語検索及び文書検索を行うことにより、上記問題が回避できる。また、ＯＣＲ付加データ中に含まれる文書構造データを用いることで、文書画像の閲覧の際に、必要な箇所の強調表示、複数文書の一覧など付加価値を付けた閲覧システムの構築が可能となる。 According to the present invention, in the conventional technique, the document search from the document image group is performed based on the text as a result of the OCR reading. However, a mixture of type characters and handwritten characters, character collapse, blurring, etc. It is difficult to deal with OCR character identification errors caused by scrambled characters, OCR character cutout errors caused by ambiguity of character pattern boundaries, and OCR character line extraction errors caused by mixed document-plate-ruled lines. The above problem can be avoided by performing word search and document search using OCR additional data holding candidates for character identification, character cutout, and character line extraction. Further, by using the document structure data included in the OCR additional data, it is possible to construct a browsing system with added value such as highlighting of necessary portions and a list of a plurality of documents when browsing a document image. .

図１を例として、従来手法と本提案手法の違いを概説する。図１は、従来のＯＣＲを使った文書処理と、本特許で提案する手法を使った文書処理との違いを模式化したものである。 Using FIG. 1 as an example, the difference between the conventional method and the proposed method will be outlined. FIG. 1 schematically shows the difference between document processing using conventional OCR and document processing using the technique proposed in this patent.

まず従来のフローでは、０１０１に示す紙文書群があり、これを０１０２に示すＯＣＲ装置に掛けて読取を行う。ＯＣＲの出力は、０１０３に示すように、紙画像を電子化した文書画像、及びＯＣＲ読取結果であるテキストファイルである。次に、０１０４に示す装置を使って文書処理を行う。このフローでは、ＯＣＲ出力結果が読取結果テキストと文書画像であるため、文書処理ではテキスト検索と文書画像の閲覧ができることになる。 First, in the conventional flow, there is a paper document group indicated by 0101, and this is applied to the OCR apparatus indicated by 0102 for reading. As shown in 0103, the output of the OCR is a document image obtained by digitizing a paper image and a text file that is a result of OCR reading. Next, document processing is performed using an apparatus indicated by 0104. In this flow, since the OCR output result is the read result text and the document image, the text search and the document image can be browsed in the document processing.

これに対し、本特許で提案する処理フローでは、まず０１０５に示す紙文書群があり、これを０１０６に示すＯＣＲ装置に掛けて読取を行う。ＯＣＲの出力は、０１０７に示すように、紙画像を電子化した文書画像、及びＯＣＲ読取結果である読取結果テキスト、及び文字行抽出や文字切出や文字認識の候補を情報として持つ読取仮説データ、及び文書の罫線情報や枠情報や文字行情報や閲覧属性情報を持つ文書構造データ、あるいは上記データ群を文書画像の中に埋め込んだ付加情報付き文書画像が出力される。次に、０１０８に示す装置を使って文書処理を行う。このフローでは、ＯＣＲ出力結果がテキストと文書画像以外の上記情報を含むため、文書処理ではテキスト検索と文書画像の単純な閲覧のみでなく、認識が困難な手書キーワードの検索や、文書中の重要なキーワードや領域を色線やコントラストなどで強調した強調表示や、文書画像の必要な所だけを並べて閲覧する部分領域表示（部分縦覧）や、機密事項を部分的に秘匿しての表示などが可能となる。 On the other hand, in the processing flow proposed in this patent, first, there is a paper document group indicated by 0105, and this is applied to the OCR apparatus indicated by 0106 for reading. As shown in 0107, the OCR output includes a document image obtained by digitizing a paper image, a reading result text as an OCR reading result, and reading hypothesis data having candidates for character line extraction, character extraction, and character recognition as information. , And document structure data having ruled line information, frame information, character line information, and browsing attribute information, or a document image with additional information in which the data group is embedded in the document image. Next, document processing is performed using the apparatus indicated by 0108. In this flow, since the OCR output result includes the above information other than text and document image, not only text search and simple browsing of document image but also search of handwritten keywords that are difficult to recognize, Highlighting highlighting important keywords and areas with color lines and contrast, partial area display (partial inspection) where only necessary parts of a document image are viewed side by side, display with confidential information partially hidden, etc. Is possible.

０１０７で出力されるデータは、対応する紙文書または文書画像を一意的に同定する文書ＩＤコードを持ち、磁気記憶装置などに保存できる。保存する形態としては、文書画像、擬似カラー化文書画像、読取結果テキスト、読取仮説データ、文書構造データを別個にデータベース上に持つ形態と、これらのデータを付加情報として文書画像ファイル中に埋め込んで持つ形態が考えられる。前者のメリットは文書画像とＯＣＲが付加したデータ（読取結果テキストなどのこと。以下、ＯＣＲ付加データ）を別個に扱うため、文書の閲覧や検索が、それぞれ独立の既存ツールを使ってできることにある。但し、テキストで検索した文書を表示したい、または、検索に引っ掛かった箇所を強調表示したい場合には、文書ＩＤを使ってＯＣＲ付加データと文書画像の間の対応関係を計算する必要がある。また、読取結果テキストを用いた場合、検索時に引っ掛かった検索語を文書画像上で強調表示しようとしても、読取結果デキストに対応する文書画像上での座標情報が無いため、不可能である。後者のメリットは文書画像ファイルのみを管理するだけで、画像とＯＣＲ付加データの全情報にアクセスできる点である。後者の場合、前者のように文書ＩＤを使って、ＯＣＲ付加データ（読取結果テキストなど）と文書画像の間のリンクを張る必要が無いため、文書管理が容易になる。 The data output in 0107 has a document ID code for uniquely identifying the corresponding paper document or document image, and can be stored in a magnetic storage device or the like. As a form to be stored, a document image, a pseudo color document image, a reading result text, reading hypothesis data, and document structure data are separately stored in a database, and these data are embedded as additional information in a document image file. Possible forms. The former merit is that document images and data added by OCR (reading result text, etc., hereinafter referred to as OCR additional data) are handled separately, so that documents can be viewed and searched using independent existing tools. . However, when it is desired to display a document searched by text or to highlight a portion caught in the search, it is necessary to calculate the correspondence between the OCR additional data and the document image using the document ID. Further, when the read result text is used, even if an attempt is made to highlight the search word caught during the search on the document image, it is impossible because there is no coordinate information on the document image corresponding to the read result text. The latter merit is that all information of the image and the OCR additional data can be accessed by managing only the document image file. In the latter case, it is not necessary to create a link between the OCR additional data (such as the read result text) and the document image using the document ID as in the former case, so that document management is facilitated.

図２について説明する。本発明の実施例である帳票認識装置では、初めに、ＯＣＲ装置が紙文書を撮像して、これを電子的画像データに変換する。本処理は、元々の文書が電子的画像データである場合は省略可能である（０２０１）。次に、電子的画像データを元にして、罫線抽出、枠構造解析、読取対象欄の位置推定等の文書構造解析を行う（０２０２）。文書構造解析の処理においては、文書構造辞書を用いる。文書構造辞書には、読取対象である文書画像の罫線座標、枠座標、読取対象欄の属性（氏名記入欄、住所記入欄、閲覧属性情報等）などの情報が含まれる。このとき使う認識処理には従来から知られた技術（特開平０９−３１９８２４号公報（特許文献４）、特開２０００−２５１０１２号公報（特許文献５）等）を利用する。次に、文書構造解析の結果を受けて、読取対象である文字行を抽出する（０２０３）。次に、文字行画像から文字パタン候補の切出と、各文字パタン候補の文字識別を行う（０２０４）。文書構造が入り組む場合は、文字行の仮説が複数立てられ、それぞれに文字パタン候補切出と文字識別を行う。文字識別の処理においては、文字識別辞書を用いる。文字識別辞書には、認識対象である文字パタンの文字コードと構造情報（輪郭方向成分の強度分布、各種統計量等）などが含まれる。文字パタン候補及び識別結果を併せて文字列仮説と称する。読取対象とする文書において、書かれ得る文字表記列が事前に決まっている場合は、文字列仮説に対して表記解析を行う（０２０５）。文字列表記解析の処理においては、文字列表記知識辞書を用いる。文字列表記知識辞書には、当該文書において出現し得る単語、数字列の表記、並びに単語群の出現可能順序等の情報が含まれる。これにより、文字切出や文字識別の曖昧性を含んだ文字列仮説は、文字列パス、更には文字列テキストに変換される。但し、文字列パスとは、文字コードと当該文字コードに対応する文字候補パタンがペアと成ったものの並びである。上記０２０５の処理が失敗した場合、または文書の表記知識が事前に分からない場合は、文字列仮説のまま次の処理に移行する。次の処理では、文字列仮説またはテキストの情報が入力され、これに対して、どちらか、または両方をＯＣＲの出力とする選択を行う（０２０６）。一般に、文字列仮説を有向グラフと解釈してグラフの始点から終点までを所定の表記知識を満たしつつ通るようなパスが存在し、当該パスが一意的に定まり、かつ文字識別の類似度及び文字パタンの並びから定まる文字列パスとしての信頼度がある閾値を超えた場合は、文字列テキスト情報を出力すると判定する。判定の結果、テキストを出力すると判断された場合は、０２０７の処理において、文字列テキストを読取結果テキストとして出力する。なお、読取結果テキストの出力に対しては、人間による修正が加わることがあり得る。逆に、文字列パスの信頼度が低い場合は、文字列仮説を出力とする。読取結果テキスト、並びに読取仮説データの双方は、必要に応じて当該文字列の書かれた文書画像上の位置情報を保持するものとする。以上の処理により、文書画像ファイル、文書構造データ、読取結果テキスト、読取仮説データが出力され、これらのデータを元に次の文書処理を行う。文書処理の過程は大きく２つに分かれると考えられる。１段目はデータ登録部である（０２０９）。ここでは、上記データ群を扱えるように、データベース若しくは文書画像中にデータを登録する。次に、これらのデータを使って文書処理（０２１０）を行う。ＯＣＲ装置と文書処理装置が分離しているケースにおいては、ＯＣＲ装置の処理範囲は０２０１から０２０８、若しくは０２０１から０２０９までである。 With reference to FIG. In the form recognition apparatus according to the embodiment of the present invention, first, the OCR apparatus captures a paper document and converts it into electronic image data. This process can be omitted if the original document is electronic image data (0201). Next, based on the electronic image data, document structure analysis such as ruled line extraction, frame structure analysis, and position estimation of the reading target column is performed (0202). In the document structure analysis process, a document structure dictionary is used. The document structure dictionary includes information such as ruled line coordinates and frame coordinates of the document image to be read, and attributes of the column to be read (name entry column, address entry column, browsing attribute information, etc.). For the recognition processing used at this time, a conventionally known technique (Japanese Patent Laid-Open No. 09-319824 (Patent Document 4), Japanese Patent Laid-Open No. 2000-2521012 (Patent Document 5), etc.) is used. Next, in response to the result of the document structure analysis, a character line to be read is extracted (0203). Next, extraction of character pattern candidates from the character line image and character identification of each character pattern candidate are performed (0204). When the document structure is complicated, a plurality of character line hypotheses are established, and character pattern candidate extraction and character identification are performed for each. In the character identification process, a character identification dictionary is used. The character identification dictionary includes a character code of a character pattern to be recognized, structure information (intensity distribution of contour direction components, various statistics, and the like). The character pattern candidate and the identification result are collectively referred to as a character string hypothesis. In a document to be read, if a character notation string that can be written is determined in advance, a notation analysis is performed on the character string hypothesis (0205). In the character string notation analysis processing, a character string notation knowledge dictionary is used. The character string notation knowledge dictionary includes information such as the word that can appear in the document, the notation of a numeric string, and the order in which words can appear. As a result, the character string hypothesis including character cutout and character identification ambiguity is converted to a character string path and further to a character string text. However, the character string path is an array of a character code and a character candidate pattern corresponding to the character code. If the process of 0205 fails or if the notation knowledge of the document is not known in advance, the process proceeds to the next process with the character string hypothesis. In the next processing, character string hypothesis or text information is input, and either or both are selected as an OCR output (0206). In general, there is a path that interprets the character string hypothesis as a directed graph and passes from the start point to the end point of the graph while satisfying the specified notation knowledge, the path is uniquely determined, and the similarity and character pattern of character identification If the reliability of the character string path determined from the list of the character strings exceeds a certain threshold value, it is determined to output the character string text information. If it is determined that the text is output as a result of the determination, the character string text is output as the read result text in the process of 0207. Note that human correction may be added to the output of the read result text. On the contrary, when the reliability of the character string path is low, the character string hypothesis is output. Both the reading result text and the reading hypothesis data hold position information on the document image in which the character string is written as necessary. Through the above processing, the document image file, document structure data, reading result text, and reading hypothesis data are output, and the next document processing is performed based on these data. The process of document processing is considered to be roughly divided into two. The first level is a data registration unit (0209). Here, data is registered in the database or document image so that the data group can be handled. Next, document processing (0210) is performed using these data. In the case where the OCR device and the document processing device are separated, the processing range of the OCR device is 0201 to 0208, or 0201 to 0209.

図３について説明する。図３は、文書画像及びＯＣＲ付加データを使った文書処理の処理フローを示した図である。但し、図３の０３０１から０３０７のデータ及び処理はＯＣＲ側で扱うことも可能である。その場合、ＯＣＲ側からは文書構造データ、読取結果テキスト、読取仮説データから成るＯＣＲ付加データ付きの文書画像若しくは擬似カラー化文書画像、あるいはＯＣＲ付加データと文書画像若しくは擬似カラー化文書画像が格納されたデータベースが、図０３０８に示す文書処理部に渡されることとなる。初めに、文書画像及び対応するＯＣＲ付加データ群（０３０１）を入力とし、これをファイルから読み込む（０３０２）。文書画像を表示する際に便利なよう、必要であれば文書画像に対して擬似カラー処理を行う（０３０３）。擬似カラー処理については後に詳細を述べる。文書画像とＯＣＲ付加データを扱う形態としては、文書画像、読取結果テキスト、読取仮説データ、文書構造データを別個にデータベース上に持つ場合と、ＯＣＲ付加データを文書画像ファイル中に埋め込んで持つという２つの形態が考えられる。前者の場合はデータベース登録処理を行い（０３０４）、文書画像とＯＣＲ付加データを対応付けてデータベースに登録する（０３０５）。後者の場合は、画像情報埋込処理を行い（０３０６）、付加情報付き文書画像ファイルを作成する（０３０７）。以上が図２におけるデータ登録処理０２０９に該当する。これらの作業の後で、文書処理が行われる（０３０８）。 With reference to FIG. FIG. 3 is a diagram showing a processing flow of document processing using a document image and OCR additional data. However, the data and processing from 0301 to 0307 in FIG. 3 can be handled on the OCR side. In that case, the OCR side stores a document image or pseudo-colored document image with OCR additional data consisting of document structure data, reading result text, and reading hypothesis data, or OCR additional data and a document image or pseudo-colored document image. The database is transferred to the document processing unit shown in FIG. First, the document image and the corresponding OCR additional data group (0301) are input and read from the file (0302). For convenience when displaying the document image, pseudo color processing is performed on the document image if necessary (0303). Details of the pseudo color processing will be described later. As a form for handling the document image and the OCR additional data, the document image, the reading result text, the reading hypothesis data, and the document structure data are separately stored in the database, and the OCR additional data is embedded in the document image file. There are two possible forms. In the former case, database registration processing is performed (0304), and the document image and the OCR additional data are associated with each other and registered in the database (0305). In the latter case, image information embedding processing is performed (0306), and a document image file with additional information is created (0307). The above corresponds to the data registration process 0209 in FIG. After these operations, document processing is performed (0308).

図４について説明する。図４は、ＯＣＲ付加データを文書画像ファイルに埋め込む場合の一例を示したものである。この図ではＴＩＦＦなどのタグ形式画像ファイルを想定している。一般にタグ形式画像ファイルでは、ファイルの先頭ブロックにタグ情報が格納され、画像データ本体はタグからリンクを貼られた位置に存在する。タグ情報の中にはそれぞれのタグに対応するデータ本体部の格納位置と、データ本体部に記録されたデータの種別を表すタグＩＤ番号が存在する。タグＩＤ番号は予め画像ファイル形式の規約として定められており、タグＩＤ番号を見ることで、当該タグの指すデータが画像データであるか、作成者や作成日時などのデータであるかが区別できる。ＯＣＲ付加データを加える場合は、このタグ情報をブロックに追加し、ＯＣＲ付加データ用のタグＩＤとＯＣＲ付加データの登録先へのポインタを張れば可能となる。 FIG. 4 will be described. FIG. 4 shows an example in which OCR additional data is embedded in a document image file. This figure assumes a tag-format image file such as TIFF. In general, in a tag format image file, tag information is stored in the first block of the file, and the image data body exists at a position where a link is pasted from the tag. The tag information includes a storage position of the data main body corresponding to each tag and a tag ID number indicating the type of data recorded in the data main body. The tag ID number is defined in advance as a rule for the image file format, and by looking at the tag ID number, it is possible to distinguish whether the data pointed to by the tag is image data or data such as the creator or creation date and time. . When adding OCR additional data, this tag information is added to the block, and a tag ID for the OCR additional data and a pointer to the registration destination of the OCR additional data can be provided.

図５は処理対象となる文書画像の一例である。図６は、図５の文書画像に対して、文書構造解析と行抽出を行った結果である。図６（ａ）には文書構造解析の結果である罫線情報、枠情報及び文字行情報が、太線または外接矩形で示されている。０６０１は傷病名欄、０６０２は診療日欄、０６０３は摘要欄、０６０４は診療日数欄、０６０５は点数欄を表す。それぞれ太い四角で括られた部分が、文書構造解析の結果、解析対象欄として認識された領域である。解析対象欄は、文書処理において重要となる欄であり、文書構造辞書において予め特定されている。太枠の中にある細い四角は、文字行として抽出された領域である。枠毎に文字行が抽出されている枠（０６０１や０６０３など）と、抽出されていない枠（０６０２と０６０４）とがあるのは、解析対象欄が、読取対象であるか否かの違いによる。読取対象であるか否かも、文書構造辞書に予め登録されている。印刷活字文書では文字行抽出は容易であるが、手書文字及び印刷活字と手書きの混在環境においては難しくなる。そのようなケースに対しても、図６（ｂ）に示すように、文字行の曖昧性を保持した抽出を行う。すなわち、文字行と思われる塊の仮設を複数立て、それらを抽出結果とするため、１つの文字パタン候補が１つの文字行に属するとは限らない。また印刷活字を前提とした文字行抽出結果と、手書文字行を想定した文字行抽出結果が異なることがあるが、この場合も複数の文字行仮説を出力する。これにより印刷活字及び手書き文書画像の処理に対応する。０６０７は印刷活字文字行として抽出され、０６０８は曖昧な手書文字行として抽出された領域である。上記文書構造解析の処理では、文書構造辞書を用いる。文書構造辞書には、読取対象である文書画像の罫線座標、枠座標、読取対象欄の属性（氏名記入欄、住所記入欄、閲覧属性情報等）などの情報が含まれる。また上記処理を行った結果、ＯＣＲ付加データの中の文書構造データとして、枠座標、当該欄の属性、当該欄内の文字行座標情報、当該欄内の文字パタン候補座標情報、当該欄の閲覧属性情報などの情報が得られる。 FIG. 5 is an example of a document image to be processed. FIG. 6 shows a result of document structure analysis and line extraction performed on the document image of FIG. In FIG. 6A, ruled line information, frame information, and character line information, which are the results of document structure analysis, are indicated by bold lines or circumscribed rectangles. Reference numeral 0601 denotes a wound name column, 0602 denotes a medical treatment date column, 0603 denotes a summary column, 0604 denotes a medical treatment day column, and 0605 denotes a score column. Each part surrounded by a thick square is an area recognized as an analysis target column as a result of document structure analysis. The analysis target column is a column that is important in document processing, and is specified in advance in the document structure dictionary. A thin square in the thick frame is an area extracted as a character line. The reason why there are frames in which character lines are extracted for each frame (such as 0601 and 0603) and frames that are not extracted (such as 0602 and 0604) depends on whether the analysis target column is a reading target. . Whether or not it is a reading target is also registered in the document structure dictionary in advance. Extraction of character lines is easy in a print type document, but it becomes difficult in a mixed environment of handwritten characters and print type characters and handwriting. Even in such a case, as shown in FIG. 6B, extraction is performed while maintaining the ambiguity of the character line. That is, since a plurality of temporary blocks that are supposed to be character lines are set up and used as extraction results, one character pattern candidate does not necessarily belong to one character line. In addition, there are cases where the character line extraction result on the assumption of print type and the character line extraction result on the assumption of a handwritten character line are different. In this case, a plurality of character line hypotheses are output. This corresponds to the processing of print type and handwritten document images. An area 0607 is extracted as a print type character line, and an area 0608 is extracted as an ambiguous handwritten character line. In the document structure analysis process, a document structure dictionary is used. The document structure dictionary includes information such as ruled line coordinates and frame coordinates of the document image to be read, and attributes of the column to be read (name entry column, address entry column, browsing attribute information, etc.). As a result of the above processing, as the document structure data in the OCR additional data, the frame coordinates, the attribute of the field, the character line coordinate information in the field, the character pattern candidate coordinate information in the field, the browsing of the field Information such as attribute information can be obtained.

図７を元に文字列仮説の作成と、表記知識利用による文字列認識の流れを説明する。また、図８は、文字列仮説の概念図とデータの詳細を示した図である。読取対象文字行７（ａ）から、文字パタンと推定される部分を様々に切出して文字パタン候補を作り、各文字パタン候補を文字識別したものが、文字列仮説７（ｂ）である。文字列仮説は、文字パタン候補、文字識別の結果得られた順位付けされた識別文字コード群、文字列仮説中での文字パタン候補間の接続関係の情報、を最低限持つものとする。このような文字列仮説の表現を、グラフ形式による表現という。次に文字列表記知識７（ｃ）を使って、文字列仮説から文字列パス７（ｄ）を計算する。文字列パスとは、一意的に確定した文字コード列（テキスト）と、各文字コードに対応する文字パタンの並びを意味する。例では文字列表記知識辞書に含まれる表記文字列の候補をＯＲ記号（｜）で単語を並べて表現している。すなわち、記号｜の間に挟まれた単語群が検索対象として指定されることを意味する。文字列表記知識を表現する方法としては、この表現以外にもトライ、文脈自由文法などを使った方法がある（特開２００１−０１４３１１号公報（特許文献６）等に記載）。文字列仮説の詳細は図８に詳しい。文字列仮説は、文字パタンの候補をアーク（０８０１）とし、文字パタンの境界をノード（０８０２）とする有向グラフとして表現される。各文字パタンには、左右（縦書きであれば上下）のノード（文字パタン候補境界）を表す境界ＩＤ番号と、文字識別候補（０８０３）及び識別類似度（０８０４）の情報が含まれる。知識処理は、この文字列仮説と文字列表記知識を入力として、文字列仮説に含まれ得る単語とそのパタン列を見つける処理である。例えば文字列表記知識にある「血液化学検査」という単語は、図８（ｂ）の文字列仮説中に、丸で示される文字コード及び文字パタン候補（０８０５）を辿ることで見つけることができる。当該欄に書かれる文字列の表記が事前に定まっている場合、本処理を行うことで文字コード列が確定する。すなわち、以上の処理により、図２にあるＯＣＲ読取結果としての文字列テキスト（文字コード列）、若しくは図３にある文書処理における検索結果が確定することとなる。 The flow of character string recognition by creating a character string hypothesis and using notation knowledge will be described with reference to FIG. FIG. 8 is a conceptual diagram of the character string hypothesis and details of the data. A character string hypothesis 7 (b) is a character pattern hypothesis that is obtained by cutting out a portion estimated to be a character pattern from the character line 7 (a) to be read to create character pattern candidates and character-identifying each character pattern candidate. It is assumed that the character string hypothesis has at least character pattern candidates, ranked identification character code groups obtained as a result of character identification, and information on connection relations between character pattern candidates in the character string hypothesis. Such expression of the character string hypothesis is called expression in a graph format. Next, the character string notation knowledge 7 (c) is used to calculate the character string path 7 (d) from the character string hypothesis. The character string path means a character code string (text) uniquely determined and a character pattern corresponding to each character code. In the example, candidates for notation character strings included in the character string notation knowledge dictionary are expressed by arranging words with an OR symbol (|). That is, it means that a word group sandwiched between symbols | is designated as a search target. As a method of expressing the character string notation knowledge, there is a method using a try, a context free grammar, and the like in addition to this expression (described in Japanese Patent Laid-Open No. 2001-014411 (Patent Document 6) and the like). Details of the string hypothesis are detailed in FIG. The character string hypothesis is expressed as a directed graph in which a character pattern candidate is an arc (0801) and a character pattern boundary is a node (0802). Each character pattern includes a boundary ID number representing left and right (upper and lower if vertical writing) nodes (character pattern candidate boundaries), information on character identification candidates (0803), and identification similarity (0804). Knowledge processing is processing for finding words and pattern strings that can be included in the character string hypothesis using the character string hypothesis and knowledge of character string notation as inputs. For example, the word “blood chemistry test” in the character string notation knowledge can be found by following the character code and character pattern candidate (0805) indicated by a circle in the character string hypothesis of FIG. When the notation of the character string written in the field is determined in advance, the character code string is determined by performing this process. That is, by the above processing, the character string text (character code string) as the OCR reading result shown in FIG. 2 or the search result in the document processing shown in FIG. 3 is determined.

図９、図１０、図１１、図１２、図１４は、上記処理によって得られたＯＣＲ付加データと、文書画像または擬似カラー化画像を使って文書閲覧を行う場合の閲覧機能の例を示した図である。ＯＣＲ付加データが文書画像ファイルと別のデータベースに蓄えられている場合は、文書ＩＤを用いて文書画像ファイルに対応するデータベース上のＯＣＲ付加データにアクセスし、閲覧機能を実現する。また、ＯＣＲ付加データが文書画像ファイルに格納されている場合は、図４に示されるように文書画像ファイル中のタグで指定された領域に格納されたＯＣＲ付加データを参照して、閲覧機能を実現する。
図９について説明する。図９は、本特許で提案する手法を用いた文書処理の閲覧システムの一画面構成例を示したものである。ここでは、レセプト文書の閲覧システムを例としている。まず始めに、紙レセプトをＯＣＲで読取り、文書画像とＯＣＲ付加データを出力する。このシステムでは文書画像の全面表示と、部分表示の切り替えが可能となっており、部分表示を行う場合は、ＯＣＲ付加データ中の文書構造データを使って当該欄の座標データを取得し、その部分領域を表示する。０９０１は１枚の文書画像を表示したブロックになる。０９０２には表示している文書画像の名前、０９０３にはレセプトの傷病名欄、０９０９にはレセプトの摘要欄が表示されている。一般に文書点検では文書画像の全面を表示する必要は無く、点検に必要な領域に限って、複数文書を並べて表示することで、点検の効率化を図ることができる。この他にもＯＣＲ付加データ中の文書構造データを用いることにより、ＰＤＡ等の携帯情報端末機器のような狭い画面上への表示に適合するよう文書配置構造を修正することが考えられる。例えば、ニ段組のような形態の文書であれば、各段ごとに文書を細かく分割し、これを縦に並べて配置して、上下スクロールのみを使った閲覧ができるようにするといった機能が実現できる。あるいは、文書処理業務をサポートする上で、マウスカーソルで欄の中をクリックすると、当該欄に応じたヘルプや業務ノウハウが表示されるなどの機能が実現できる。 9, FIG. 10, FIG. 11, FIG. 12, and FIG. 14 show examples of browsing functions when browsing the document using the OCR additional data obtained by the above processing and the document image or the pseudo color image. FIG. When the OCR additional data is stored in a database separate from the document image file, the OCR additional data on the database corresponding to the document image file is accessed using the document ID to realize a browsing function. When the OCR additional data is stored in the document image file, the browsing function is set by referring to the OCR additional data stored in the area designated by the tag in the document image file as shown in FIG. Realize.
FIG. 9 will be described. FIG. 9 shows an example of a screen configuration of a document processing browsing system using the method proposed in this patent. Here, a receipt system for a receipt document is taken as an example. First, a paper receipt is read by OCR, and a document image and OCR additional data are output. In this system, it is possible to switch between full-screen display and partial display of a document image. When performing partial display, the coordinate data of the corresponding column is acquired using the document structure data in the OCR additional data, and the part is displayed. Display area. 0901 is a block displaying one document image. 0902 displays the name of the document image being displayed, 0903 displays the name of the name of the wound, and 0909 displays the summary field of the receipt. In general, it is not necessary to display the entire surface of a document image in document inspection, and the efficiency of inspection can be improved by displaying a plurality of documents side by side only in an area necessary for inspection. In addition, by using the document structure data in the OCR additional data, it is conceivable to modify the document arrangement structure so as to be suitable for display on a narrow screen such as a portable information terminal device such as a PDA. For example, if the document is in a two-column format, the document can be subdivided into sections and arranged vertically to enable browsing using only vertical scrolling. it can. Alternatively, when supporting a document processing job, a function such as displaying help or business know-how corresponding to the column can be realized by clicking in the column with a mouse cursor.

図１０及び図１１について説明する。図１０は、本特許で提案する手法を用いた重要キーワード閲覧システムの一画面構成例を示したものである。１００１には、抽出するべき重要キーワードのリストが指定されている。１００２には、抽出されたキーワードが下線付きで表示されている。図１１は、先ほどの重要キーワードの抽出機能と併せて、チェックルールを使った文書画像簡易点検システムの一画面構成例である。まず初めに、入力欄１１０１に点検で用いるチェックルールを指定する。この図ではチェックルールは検索キーワードの論理演算として定義されている。次にＯＣＲ付加データにある読取結果テキストまたは読取仮説データから、当該キーワードの検索と論理演算適用を行う。キーワード抽出のアルゴリズムとしては有限オートマトン法、トップダウン構文解析法、ボトムアップ構文解析法、動的計画法などがある（特許２８８６８６８号公報（特許文献７）、特願平０９−２３８０３２号公報（特許文献８）等に記載）。表示欄１１０３には、検索の結果得られた文書名が表示されている。チェックルールに適合した文書は表示欄１１０４に表示される。ＯＣＲ付加データは、元の紙文書または文書画像と一意的な対応が取れる文書ＩＤコードを持つため、文書画像と検索結果の同時表示が可能である。また、キーワード情報には座標情報が含まれるため、検索されたキーワードは１１０５に示すような下線で場所を示している。ここでは「特定疾患指導料ＡＮＤ特定疾患処方管理加算」というチェックルールに適合した文書画像が表示されている。ＯＣＲ付加データでは、通常のＯＣＲでは読取り困難な手書文字についても、文字切出や文字識別の曖昧性を保持した読取仮説データがあるため、印刷活字・手書き文書に関係なく検索・点検が行える。また、ＯＣＲ装置と文書処理を分離して業務処理を行うケースのおいて、ＯＣＲ付加データ中の読取仮説データを用いることにより、ＯＣＲ装置から文字認識をやり直さなくとも任意のタイミングで任意のキーワードを検索することができる。 10 and 11 will be described. FIG. 10 shows one screen configuration example of an important keyword browsing system using the method proposed in this patent. In 1001, a list of important keywords to be extracted is designated. In 1002, the extracted keyword is displayed with an underline. FIG. 11 is a screen configuration example of a simple document image inspection system using check rules in addition to the important keyword extraction function described above. First, a check rule used for inspection is specified in the input field 1101. In this figure, the check rule is defined as a logical operation of a search keyword. Next, the keyword search and logical operation application are performed from the reading result text or reading hypothesis data in the OCR additional data. Keyword extraction algorithms include a finite automaton method, a top-down syntax analysis method, a bottom-up syntax analysis method, a dynamic programming method, and the like (Japanese Patent No. 2886868 (Patent Document 7), Japanese Patent Application No. 09-238032 (Patent Patents). Document 8)). The display column 1103 displays the document name obtained as a result of the search. A document that conforms to the check rule is displayed in the display field 1104. Since the OCR additional data has a document ID code that can uniquely correspond to the original paper document or document image, the document image and the search result can be displayed simultaneously. In addition, since the keyword information includes coordinate information, the searched keyword indicates a place with an underline as indicated by 1105. Here, a document image conforming to the check rule “specific disease instruction fee AND specific disease prescription management addition” is displayed. With OCR-added data, even for handwritten characters that are difficult to read by normal OCR, there is reading hypothesis data that preserves the ambiguity of character extraction and character identification, so it can be searched and inspected regardless of printed or handwritten documents. . Also, in cases where business processing is performed by separating the OCR device and document processing, the use of reading hypothesis data in the OCR additional data allows an arbitrary keyword to be assigned at an arbitrary timing without re-performing character recognition from the OCR device. You can search.

図１２について説明する。図１２は、本特許で提案する手法を用いた秘匿事項の表示限定機能の例を示したものである。図１２（ａ）は文書構造解析の結果として得られた、秘匿対象領域及び当該領域中の文字行の抽出結果である。ここでは、名前が書かれた文字行が秘匿対象事項であるとする。秘匿対象領域を黒枠で塗り潰した結果が図１２（ｂ）である。これにより、閲覧者毎に必要なデータの秘匿・開示を図ることができる。同様に秘匿対象領域を背景色（白）で塗り潰した結果が図１２（ｃ）である。後者の背景色での塗り潰しの場合、黒枠で塗り潰すのに比べて、そこに秘匿対象データがあることを閲覧者に意識させない分、データの秘匿性が高くなる。後者の塗り潰し方については、幾つかの方法が考えられる。それについては、図１３を元に説明する。 FIG. 12 will be described. FIG. 12 shows an example of a confidential matter display limiting function using the technique proposed in this patent. FIG. 12A shows the extraction result of the concealment target area and the character line in the area obtained as a result of the document structure analysis. Here, it is assumed that the character line in which the name is written is a matter to be concealed. FIG. 12B shows the result of filling the concealment target area with a black frame. This makes it possible to conceal and disclose necessary data for each viewer. Similarly, FIG. 12C shows the result of filling the concealment target area with the background color (white). In the case of filling with the latter background color, compared to painting with a black frame, the confidentiality of the data becomes higher because the viewer is not aware that there is data to be kept secret. For the latter method of painting, several methods are conceivable. This will be described with reference to FIG.

図１３について説明する。図１３は、文書画像の擬似カラー化処理の概念図を示したものである。各画素は色を表す値（カラー値）を持っている。例えば白黒画像であれば、０若しくは１の値を持つ。０の値がどのような色を表すかは、ＲＧＢカラーマップと呼ばれるテーブルを参照する。図１３（ｂ）にあるＲＧＢカラーマップでは、０は白を、１は黒を表す。擬似カラー化処理は、対象領域内の対象文字行内の黒画素（必ずしも黒である必要は無く、単に秘匿対象の色という意味である）に対して、別のカラー値を割り当てる処理である。図１３（ｃ）は、文書画像の氏名欄内の文字行の画素に対してカラー値２を割り当てている。このカラー値２に対して、ＲＧＢカラーマップの定義を白（背景色）とすれば、表示画面上では「日立太郎」という名前が白で表示される。つまり、あたかも白く塗り潰されたかのように表示される。しかし、内部的には名前部分の画像データは消去されてない。カラー値２を与えられた画素集合が、名前部分を構成する画像に相当する。なお、擬似カラー化をＯＣＲ装置で行う場合は、元と文書画像を変更して擬似カラー化し、擬似カラー化された情報のカラー値と属性を閲覧属性情報として、ＯＣＲ付加データ中の文書構造データに格納して出力することとなる。 FIG. 13 will be described. FIG. 13 shows a conceptual diagram of the pseudo colorization processing of the document image. Each pixel has a value representing a color (color value). For example, a monochrome image has a value of 0 or 1. For what kind of color the value of 0 represents, refer to a table called an RGB color map. In the RGB color map in FIG. 13B, 0 represents white and 1 represents black. The pseudo-colorization process is a process of assigning another color value to a black pixel in a target character line in a target area (not necessarily black but simply means a color to be concealed). In FIG. 13C, the color value 2 is assigned to the pixel of the character line in the name column of the document image. If the definition of the RGB color map is white (background color) for this color value 2, the name “Hitachi Taro” is displayed in white on the display screen. That is, it is displayed as if it were painted white. However, the image data of the name portion is not erased internally. A pixel set given a color value 2 corresponds to an image constituting the name portion. When pseudo-colorization is performed by the OCR apparatus, the original and document images are changed to be pseudo-colored, and the color value and attribute of the pseudo-colored information are used as browsing attribute information, and the document structure data in the OCR additional data Will be stored and output.

ＯＣＲ付加データより得られた枠位置情報及び枠属性情報を使えば、秘匿すべき領域の所在が判明する。実際の秘匿方法としては様々な方法が考えられる。秘匿対象欄と判別した場合、その中の文字行を抽出し、文字行の外接矩形情報を得ることにより、当該外接矩形内の領域を黒で塗り潰す方法や、当該外接矩形内の領域内の黒（前景色）に対して擬似カラー化を行い擬似カラー値を白（背景色）として白で塗り潰されたように見せる方法や、当該外接矩形内の領域内の黒（前景色）に対して擬似カラー化を行い擬似カラー値を黒（前景色）とし、かつ当該外接矩形内を黒で塗り潰す方法などがある。秘匿情報を表示する場合は、ＯＣＲ付加データに含まれる閲覧属性データから擬似カラー値の値とその開示条件を知り、閲覧者がその開示条件に適合する場合は擬似カラー値を前景色へと変更する、または背景色に対して目立つ他のカラー値にすることで表示ができる。
擬似カラー化を用いた情報秘匿の特色は、汎用ビューワーでの文書画像の可読性を保持し、かつ元の画像情報を壊すこと無く、秘匿情報の隠蔽を可能とすることにある。一般に文書画像における情報秘匿の方法としては、ＰＤＦのように特殊フォーマットを使い、専用ビューワを利用して、パスワード等によるチェックを経ないと当該文書が開けない、若しくは部分的に黒く塗り潰された所が見えないなどの方法がある。もう一つの方法は、汎用フォーマットを使い、特殊なビューワーでのみ秘匿情報が見れるという方法である。擬似カラー化処理は、主に後者に適用され得る手法である。この手法のメリットは、汎用ビューワを使用するためシステムのコストが押さえられること、更には画像上のデータが本質的には消去されず、見た目だけ消えることにある。セキュリティの質を更に高めるためには、画像自身に暗号を掛けるなどの方策がある。この場合も、一般的なツールを組み合わせることで実現できるため、上記メリットは損なわれない。 If the frame position information and the frame attribute information obtained from the OCR additional data are used, the location of the area to be concealed can be determined. Various methods can be considered as an actual concealment method. If it is determined as a concealment target column, a character line in the column is extracted, and circumscribed rectangle information of the character line is obtained, whereby a region in the circumscribed rectangle is painted in black, or a region in the circumscribed rectangle is For pseudo-coloring of black (foreground) and making the pseudo-color value appear to be filled with white as white (background color), or for black (foreground) in the area within the circumscribed rectangle There is a method of performing pseudo-coloring, setting the pseudo-color value to black (foreground color), and filling the circumscribed rectangle with black. When displaying confidential information, know the value of the pseudo color value and the disclosure condition from the browsing attribute data included in the OCR additional data, and if the viewer meets the disclosure condition, change the pseudo color value to the foreground color Or by using other color values that stand out against the background color.
The feature of information concealment using pseudo-colorization is that the readability of the document image in the general-purpose viewer is maintained and the concealment information can be concealed without destroying the original image information. In general, as a method of concealing information in a document image, a special format such as PDF is used, and a special viewer is used, and the document cannot be opened unless it is checked with a password or the like, or is partially blackened. There are methods such as invisible. Another method is to use a general-purpose format and view confidential information only with a special viewer. The pseudo colorization process is a technique that can be applied mainly to the latter. The advantage of this method is that the cost of the system is reduced because a general-purpose viewer is used, and further, the data on the image is not erased essentially, and only the appearance is lost. In order to further improve the quality of security, there are measures such as applying encryption to the image itself. Also in this case, since it can be realized by combining general tools, the above advantages are not impaired.

図１４について説明する。図１４は、本特許で提案する手法によって注目領域を強調して表示する場合の一画面構成例である。図１４（ａ）は文書構造解析の結果で、１４０１に傷病名欄、１４０２に摘要欄が抽出されている。この２つの欄だけを注目したいと思った場合、図９にあるように、枠を切り出して表示する手もあるが、ここでは枠の強調表示と周りの階調を落とす処理とにより、実際の文書画像の構成を崩すことなく、強調した表示を実現している（図１４（ｂ））。この処理にも、先ほどの擬似カラー化が使える。即ち、傷病名欄と摘要欄内部の文字行に含まれる画素に対して、擬似カラー値２を割り当てる。強調処理をする前はカラー値２の色を黒にしておく。強調処理が要求された場合は、領域外の黒画素のカラー値１の色を灰色に設定すれば良い。コントラスト処理をする方法としては、その都度画像を走査して色を変更する方法、元の画像とマスク画像の論理演算を取る方法などがあるが、それらの処理に比べて当該処理は、事前に擬似カラー化しておけば、コントラスト強調などの必要な要求が閲覧者からあった際にＲＧＢカラーマップの値を変更するだけで強調効果が実現できるため、処理が高速であるというメリットがある。図１４（ｃ）は、同様の処理を、画像閲覧者の作業進行に併せて変更する場合を示している。例えば作業初めには１４０５にある傷病名欄を集中的に点検し、次の作業フェーズでは１４０６の摘要欄を点検するといった点検方法が、ＯＣＲ付加データと擬似カラー化処理を用いることで可能である。 FIG. 14 will be described. FIG. 14 shows an example of a screen configuration when the attention area is highlighted and displayed by the method proposed in this patent. FIG. 14A shows the result of document structure analysis, in which 1401 is a wound name column and 1402 is a summary column. If you want to pay attention only to these two fields, there is a hand to cut out and display the frame as shown in FIG. 9, but here the actual display by the highlighting of the frame and the process of reducing the surrounding gradation The highlighted display is realized without breaking the structure of the document image (FIG. 14B). For this process, the pseudo-colorization described above can be used. In other words, the pseudo color value 2 is assigned to the pixels included in the character lines in the injury name column and the summary column. Before the enhancement process, the color of color value 2 is set to black. When enhancement processing is requested, the color of color value 1 of the black pixel outside the region may be set to gray. As a method for performing the contrast processing, there are a method of scanning the image each time to change the color, a method of taking a logical operation of the original image and the mask image, etc. If the pseudo color is used, the enhancement effect can be realized only by changing the value of the RGB color map when a necessary request for contrast enhancement or the like is received from the viewer. Therefore, there is an advantage that the processing is fast. FIG. 14C shows a case where the same process is changed in accordance with the progress of the work of the image viewer. For example, an inspection method in which an injury name column 1405 at the beginning of the operation is intensively checked and an outline column 1406 is checked in the next operation phase is possible by using OCR additional data and pseudo colorization processing. .

図１５について説明する。図１５は、本特許で提案する手法によってＯＣＲ装置と文書画像処理装置を分離する形で文書検索システムを構成した場合の一構成例である。図１５上段にはＯＣＲ装置の一構成例を、図１５下段には文書画像処理装置の一構成例を示した。 FIG. 15 will be described. FIG. 15 shows an example of a configuration in which a document search system is configured by separating the OCR device and the document image processing device by the method proposed in this patent. The upper part of FIG. 15 shows a configuration example of the OCR apparatus, and the lower part of FIG. 15 shows a configuration example of the document image processing apparatus.

まず上段のＯＣＲ装置では、画像入力装置（１５０１）により文書を電子データ（文書画像）に変換し、それを外部記憶装置（１５０４）及びメモリ（１５０５）に蓄えて、中央演算装置（１５０６）により読取を行う。図２における文書構造辞書、文字識別辞書、文字列表記知識辞書などは外部記憶装置（１５０４）に蓄えられており、文書構造解析にはここに蓄えた定義を参照する。これらの処理は操作端末装置（１５０２）を通して人間が操作可能であり、処理結果等は表示端末装置（１５０３）を通して表示され、外部記憶装置に蓄積または通信装置（１５０７）を通して外部装置にデータが送られる。ＯＣＲが読取った結果は、従来の装置のようにテキストファイルとしても出力できるが、ＯＣＲ付加データとしても出力できる。読取仮説データ及び読取結果テキスト及び文書構造データを含むＯＣＲ付加データは、文書画像ファイルに埋め込まれて、または文書画像ファイルと対応付けられて外部記憶装置に蓄えられるか、または通信装置を通して外部の装置に送られる。その際、ＯＣＲ付加データにはＯＣＲで読取った文書（あるいは画像）に対応する文書ＩＤコードが振られるとする。この文書ＩＤコードを利用することで、紙文書または文書画像とＯＣＲ付加データとの対応が取れる。 First, in the upper OCR device, a document is converted into electronic data (document image) by an image input device (1501), stored in an external storage device (1504) and a memory (1505), and then by a central processing unit (1506). Read. The document structure dictionary, the character identification dictionary, the character string notation knowledge dictionary, etc. in FIG. 2 are stored in the external storage device (1504), and the definition stored here is referred to for the document structure analysis. These processes can be operated by a human through the operation terminal device (1502), and the processing results and the like are displayed through the display terminal device (1503), stored in the external storage device, or transmitted to the external device through the communication device (1507). It is done. The result read by the OCR can be output as a text file as in the conventional apparatus, but can also be output as OCR additional data. The OCR additional data including the reading hypothesis data, the reading result text, and the document structure data is embedded in the document image file, or stored in the external storage device in association with the document image file, or external device through the communication device. Sent to. At this time, it is assumed that a document ID code corresponding to a document (or image) read by OCR is assigned to the OCR additional data. By using this document ID code, the correspondence between the paper document or document image and the OCR additional data can be taken.

図１５下段の文書画像処理装置は、上記ＯＣＲ機能装置から出力されたＯＣＲ付加データを用いて文書検索・文書閲覧を行うもので、一旦ＯＣＲ付加データが生成された文書に対しては何度でも繰り返し（ＯＣＲ付加データが存在する限り）検索・閲覧できる機能を有する。この文書画像処理装置は、通信装置（１５１５）及び外部記憶装置（１５１２）よりＯＣＲ付加データを読み、これをメモリ（１５１３）にロードして、中央演算装置（１５１４）により検索・閲覧処理を行う。検索したい単語及び文書検索ルールは、外部記憶装置に蓄えられているか、または操作端末装置（１５１０）から入力することができる。単語の検索結果は表示端末装置（１５１１）を通して表示され、また通信装置を通して外部機器にデータを送信する、または外部記憶装置に検索結果を蓄積することができる。これらの装置は通信バス（１５０７、１５０８、１５０９、１０１５、１５１６）によってつながれている。 The document image processing apparatus in the lower part of FIG. 15 performs document search and document browsing using the OCR additional data output from the OCR function apparatus. The document image processing apparatus once generates OCR additional data any number of times. It has the function of being able to search and browse repeatedly (as long as OCR additional data exists). This document image processing device reads OCR additional data from the communication device (1515) and the external storage device (1512), loads it into the memory (1513), and performs search / view processing by the central processing unit (1514). . Words and document search rules to be searched are stored in an external storage device or can be input from the operation terminal device (1510). The search result of the word is displayed through the display terminal device (1511), and data can be transmitted to the external device through the communication device, or the search result can be stored in the external storage device. These devices are connected by a communication bus (1507, 1508, 1509, 1015, 1516).

本特許と従来手法の処理の比較図。The comparison figure of the processing of this patent and the conventional method. ＯＣＲ付加データを出力するＯＣＲ装置のフロー図。The flowchart of the OCR apparatus which outputs OCR additional data. ＯＣＲ付加データを使った文書処理のフロー図。The flowchart of the document processing using OCR additional data. 画像ファイルへのＯＣＲ付加データ埋込の概念図。The conceptual diagram of OCR additional data embedding in an image file. 文書画像の一例。An example of a document image. 文書構造解析の一例。An example of document structure analysis. 文字列仮説を使った表記知識処理の概念図。The conceptual diagram of the notation knowledge process using a character string hypothesis. 文字列仮説の概念図。The conceptual diagram of a character string hypothesis. 文書閲覧システムの一例（部分縦覧）。An example (partial inspection) of a document browsing system. 文書閲覧システムの一例（重要語表示）。An example of a document browsing system (important word display). 文書閲覧システムの一例（ルール点検）。An example of a document browsing system (rule inspection). 文書閲覧システムの一例（情報秘匿）。An example of a document browsing system (information confidentiality). 擬似カラー化の概念図。The conceptual diagram of pseudo-colorization. 文書閲覧システムの一例（領域強調）。An example of document browsing system (area emphasis). ＯＣＲ装置と文書処理装置の構成例。2 shows a configuration example of an OCR device and a document processing device.

Explanation of symbols

０１０１…従来の文書処理システムに入力される紙文書、０１０２…従来の文書処理システムでのＯＣＲ部、０１０３…従来の文書処理システムのＯＣＲ出力結果，０１０４…従来の文書処理システムにおける文書処理部，０１０５…本特許で提案する文書処理システムに入力される紙文書，０１０６…本特許で提案する文書処理システムでのＯＣＲ部，０１０７…本特許で提案する文書処理システムのＯＣＲ出力結果，０１０８…本特許で提案する文書処理システムにおける文書処理部
０２０１…画像入力部，０２０２…文書構造解析部，０２０３…文字行抽出部，０２０４…文字列仮説作成部，０２０５…文字列表記解析部，０２０６…文字列仮説／テキスト選択部，０２０７…テキスト出力部，０２０８…文字列仮説出力部，０２０９…データ登録部，０２１０…文書処理部
０３０１…入力データ群，０３０２…データ読込部，０３０３…擬似カラー処理部，０３０４…データベース登録部，０３０５…付加情報データベース，０３０６…画像情報埋込部，０３０７…付加情報付き文書画像ファイル，０３０８…文書処理部
０５０１…処理対象とする文書画像の例
０６０１…文書構造解析の結果（傷病名欄），０６０２…文書構造解析の結果（診療日欄），０６０３…文書構造解析の結果（摘要欄），０６０４…文書構造解析の結果（診療日数欄），０６０５…文書構造解析の結果（点数欄），０６０６…文書構造解析の結果（行抽出），０６０７…行抽出の結果１（印刷活字行の例），０６０８…行抽出の結果２（手書文字行の例）
０８０１…文字列仮説上の文字パタン，０８０２…文字列仮説上のパタン境界，０８０３…文字列仮説上の文字識別結果，０８０４…文字列仮説上の文字識別類似度，０８０５…文字列仮説上から検索された単語
０９０１…部分縦覧表示された文書画像の部分領域群，０９０２…部分縦覧表示されている文書画像の名前，０９０３…部分縦覧表示されている文書画像の傷病名欄，０９０４…部分縦覧表示されている文書画像の摘要欄，１００１…文書画像中から検索を行うキーワードのリスト，１００２…文書画像中に見つかったキーワード（下線で表示）
１１０１…文書画像の検索ルールのリスト，１１０２…文書画像中から抽出された重要キーワードのリスト，１１０３…指定ルールに条件が一致した文書画像のリスト，１１０４…文書画像中で検索ルールが一致した箇所，１１０５…検索ルールに適合した重要キーワード（下線で表示）
１４０１…文書構造解析の結果得られた傷病名欄の位置，１４０２…文書構造解析の結果得られた摘要欄の位置，１４０３…傷病名欄を強調表示した結果，１４０４…摘要欄を強調表示した結果，１４０５…始めに傷病名欄を強調表示した結果，１４０６…次に摘要欄を強調表示した結果
１５０１…ＯＣＲ装置部における画像入力装置，１５０２…ＯＣＲ装置部における操作端末装置，１５０３…ＯＣＲ装置部における表示端末装置，１５０４…ＯＣＲ装置部における外部記憶装置，１５０５…ＯＣＲ装置部におけるメモリ，１５０６…ＯＣＲ装置部におけるＣＰＵ，１５０７…ＯＣＲ装置部における通信装置，１５０８…ＯＣＲ装置部における通信バス，１５０９…ネットワーク部，１５１０…文書画像処理装置部における操作端末装置，１５１１…文書画像処理装置部における表示端末装置，１５１２…文書画像処理装置部における外部記憶装置，１５１３…文書画像処理装置部におけるメモリ，１５１４…文書画像処理装置部におけるＣＰＵ，１５１５…文書画像処理装置部における通信装置，１５１６…文書画像処理装置部における通信バス。

0101: Paper document input to the conventional document processing system, 0102: OCR unit in the conventional document processing system, 0103 ... OCR output result of the conventional document processing system, 0104 ... Document processing unit in the conventional document processing system, 0105 ... Paper document input to the document processing system proposed in this patent, 0106 ... OCR unit in the document processing system proposed in this patent, 0107 ... OCR output result of the document processing system proposed in this patent, 0108 ... Document processing unit 0201 ... image input unit, 0202 ... document structure analysis unit, 0203 ... character line extraction unit, 0204 ... character string hypothesis creation unit, 0205 ... character string notation analysis unit, 0206 ... character in the document processing system proposed in the patent Column hypothesis / text selection unit, 0207 ... text output unit, 0208 ... character string hypothesis output unit, 0209 Data registration unit, 0210 ... Document processing unit 0301 ... Input data group, 0302 ... Data reading unit, 0303 ... Pseudo color processing unit, 0304 ... Database registration unit, 0305 ... Additional information database, 0306 ... Image information embedding unit, 0307 ... Document image file with additional information, 0308 ... Document processing unit 0501 ... Example of document image to be processed 0601 ... Result of document structure analysis (injury name column), 0602 ... Result of document structure analysis (medical day column), 0603 ... Document structure analysis result (summary column), 0604 ... Document structure analysis result (medical care days column), 0605 ... Document structure analysis result (score column), 0606 ... Document structure analysis result (row extraction), 0607 ... line Extraction result 1 (example of printed type line), 0608... Line extraction result 2 (example of handwritten character line)
0801 ... Character pattern on the character string hypothesis, 0802 ... Pattern boundary on the character string hypothesis, 0803 ... Character identification result on the character string hypothesis, 0804 ... Character identification similarity on the character string hypothesis, 0805 ... From the character string hypothesis Retrieved word 0901... Partial region group of document image displayed in partial vertical display, 0902... Name of document image displayed in partial vertical display, 0903... Damage name column of document image displayed in partial vertical display, 0904. Description column of the displayed document image, 1001... List of keywords to be searched from within the document image, 1002... Keywords found in the document image (indicated by underline)
1101 ... List of document image search rules, 1102 ... List of important keywords extracted from the document image, 1103 ... List of document images whose conditions match the specified rule, 1104 ... Location where the search rules match in the document image , 1105 ... Important keywords that match the search rules (displayed underlined)
1401 ... Position of the injury / symptom name field obtained as a result of the document structure analysis, 1402 ... Position of the abstract field obtained as a result of the document structure analysis, 1403 ... As a result of highlighting the injury / illness name field, 1404 ... Highlighting the abstract field As a result, 1405... Result of first highlighting the disease name column, 1406. Result of highlighting the summary column 1501... Image input device in the OCR device unit, 1502... Operation terminal device in the OCR device unit, 1503. Display terminal device in the unit, 1504 ... External storage device in the OCR device unit, 1505 ... Memory in the OCR device unit, 1506 ... CPU in the OCR device unit, 1507 ... Communication device in the OCR device unit, 1508 ... Communication bus in the OCR device unit, 1509: Network unit, 1510: Operation terminal device in document image processing unit, 511... Display terminal device in document image processing unit, 1512. External storage device in document image processing unit, 1513. Memory in document image processing unit, 1514. CPU in document image processing unit, 1515. Communication device in the unit, 1516... Communication bus in the document image processing unit.

Claims

An OCR device that performs character recognition processing on document image data generated by optically reading a paper document,
A storage device for storing a document structure dictionary used for document structure analysis and a character identification dictionary used for character identification;
An image input unit for inputting the document image data;
An arithmetic unit,
The arithmetic unit generates a document structure data by performing a frame structure analysis of the document image data and specifying a reading target frame using the document structure dictionary, and using the character identification dictionary, the specified reading target frame. A character recognition process is performed on the read result text or read hypothesis data, and OCR additional data including at least one of the document structure data and the read hypothesis data is output in association with the document image data.
The reading hypothesis data includes at least a character cutout pattern candidate and an identification result of the character cutout pattern generated in the process of character recognition processing.

2. The OCR apparatus according to claim 1, wherein the OCR additional data is registered in the same file as the document image data.

3. The OCR device according to claim 2, wherein the file is a tag-format image file including a plurality of data blocks and tags corresponding to the plurality of data blocks, respectively.
An OCR apparatus comprising: at least one data block for storing the OCR additional data; and a tag including information indicating that the data stored in the data block is OCR additional data.

The OCR device according to claim 1,
The arithmetic unit identifies a portion that needs to be concealed in the document image data based on the document structure data, and changes the color value of each pixel of the document image data to another color value for the portion that requires concealment. Change and perform a pseudo-colorization process to create a correspondence between the display color used when displaying the other color value and the other color value,
Updating the document image data to include the other color values;
A color map table including a correspondence between the display color and the other color value, and browsing attribute information including at least a pseudo color value and a viewing permission condition are output in association with the document image data. OCR device.

The OCR apparatus according to claim 1, wherein the reading hypothesis data includes a character string hypothesis generated in the character recognition process, and the character string hypothesis includes information on a character extraction pattern candidate and characters of the character extraction pattern candidate. An OCR device characterized in that an identification result is expressed in a graph format.

A document processing apparatus that performs document processing using a document reading processing result performed by an OCR apparatus as input information,
An input unit that receives an input of the document reading process result, a display unit that displays the document reading process result, a user input unit that receives a user input, and an arithmetic unit;
The result of the document reading process is as follows: document image data generated by optically reading a paper document, document structure data including a frame structure of the document image data, and a frame to be read among the frames of the document image data OCR additional data including at least one of reading hypothesis data of the character recognition processing of
The calculation unit selectively displays information included in the document reading processing result on the display unit using the OCR additional data based on an instruction input from the user input unit. Processing equipment.

The document processing apparatus according to claim 6,
The document reading processing result includes the reading hypothesis data in the OCR additional data,
The reading hypothesis data is a graph representation of information about character extraction pattern candidates and character identification results of the character extraction pattern candidates.
The calculation unit searches the reading hypothesis data expressed in the graph format using the search keyword input to the user input unit, and displays the document image data included in the reading processing result according to the search result. A document processing apparatus characterized in that the document processing apparatus is displayed on the screen.

The document processing apparatus according to claim 6,
The document reading processing result includes the document structure data in the OCR additional data,
The document structure data has display target frame information indicating which of the frames included in the document image data is a display target frame,
The document processing apparatus, wherein the calculation unit selectively displays a display target frame included in the document image data according to the display target frame information.

The document processing apparatus according to claim 6,
The document reading processing result includes the document structure data in the OCR additional data,
The document structure data has display target frame information indicating which of the frames included in the document image data is a display target frame,
The document processing apparatus, wherein the calculation unit highlights and displays a display target frame included in the document image data in accordance with the display target frame information.

The document processing apparatus according to claim 6,
Some areas of the document image data have undergone pseudo-colorization processing,
The OCR additional data includes a color map table including a correspondence relationship between a color value of each pixel and a display color in the area where the pseudo color processing is performed,
The calculation unit determines the display color of the area where the pseudo-colorization process has been performed with reference to the color map table according to the designation of the viewing state by the user, and the display unit determines the determined display color. A document processing apparatus for displaying the document image data.

An OCR input device that inputs document image data generated by optically reading a paper document, a storage device that stores a document structure dictionary used for document structure analysis and a character identification dictionary used for character identification, and the document structure A document processing method in a document processing system having an operation unit that performs an operation including an OCR process including analysis and character identification, a document reading result storage unit that registers the OCR process result, and a display unit,
Analyzing the frame structure of the document image data using the document structure dictionary,
Based on the information of the analyzed frame structure, the character recognition processing of the document image data is performed using the character identification dictionary to generate a reading result text or reading hypothesis data,
OCR additional data including at least one of the document structure data and the reading hypothesis data is stored in the storage unit in the document reading result storage unit in association with the document image data,
The document processing method, wherein the reading hypothesis data includes at least a character extraction pattern candidate and an identification result of the character extraction pattern, which are generated in the process of character recognition processing.

12. The document processing method according to claim 11, wherein the OCR additional data is registered in the same file as the document image data.

13. The document processing method according to claim 12, wherein the file is an image file in a tag format including a plurality of data blocks and tags corresponding to each of the plurality of data blocks.
A document processing method comprising: at least one data block for storing the OCR additional data; and a tag including information indicating that the data stored in the data block is OCR additional data.

A document processing method according to claim 11,
Based on the document structure data, a portion that needs to be concealed in the document image data is specified, and the color value of each pixel of the document image data is changed to another color value for the portion that needs to be concealed. A pseudo color process for creating a correspondence between the display color used to display the color value and the other color value,
Updating the document image data to include the other color values;
A color map table including a correspondence between the display color and the other color value, and browsing attribute information including at least a pseudo color value and a viewing permission condition are stored in association with the document image data. Document processing method.

15. The document processing method according to claim 14, wherein the OCR additional data includes a color map table including a correspondence relationship between a color value of each pixel and a display color in the area where the pseudo color processing is performed, and the browsing. The arithmetic unit, using the browsing attribute information, discriminates a browsing state permitted in the area by using the browsing attribute information, and the pseudo color processing is performed with reference to the color map table A document processing method, comprising: determining a display color of an area and displaying the document image data using the determined display color.

12. The document processing method according to claim 11, wherein the reading hypothesis data includes a character string hypothesis generated in the character recognition process, and the character string hypothesis includes information about a character extraction pattern candidate and the character extraction pattern candidate. A document processing method, wherein the character identification result is expressed in a graph format.

The document processing method according to claim 11, wherein the document processing system includes a user input unit that receives a user input,
The OCR additional data includes the reading hypothesis data, the reading hypothesis data is searched using a search keyword input to the user input unit, and document image data having reading hypothesis data matching the search keyword is searched. As a document processing method.

A document processing method according to claim 11,
The OCR additional data includes the document reading processing result,
The sentence structure data has display target frame information indicating which of the frames included in the document image data is a display target frame,
The document processing method, wherein the calculation unit selectively displays a display target frame included in the document image data according to the display target frame information.

A document processing method according to claim 11,
The OCR additional data includes the document reading processing result,
The sentence structure data has display target frame information indicating which of the frames included in the document image data is a display target frame, and the display target frame included in the document image data according to the display target frame information. A document processing method characterized by highlighting and displaying.