JP2009009307A

JP2009009307A - Document image processor and processing method

Info

Publication number: JP2009009307A
Application number: JP2007169303A
Authority: JP
Inventors: Toshiaki Yagasaki; 敏明矢ヶ崎
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2007-06-27
Filing date: 2007-06-27
Publication date: 2009-01-15

Abstract

<P>PROBLEM TO BE SOLVED: To improve the convenience of a user in high level document processing such as retrieval by implementing document image analysis processing to an input document image, executing the detection of a character region, the discrimination of the attributes (for example, titles, captions and texts or the like) of the region, and the recognition processing of a character itself, and detecting accuracy to the result, and calculating the accuracy of word units, clause units, sentence units, and a whole document. <P>SOLUTION: This document image processor is configured of: an image input part such as a scanner; a pre-processing part; a document image structure analyzing part; a character recognition part; an accuracy calculation part; and an electronic document generation part. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は文書画像処理装置、及び方法に関するものである。 The present invention relates to a document image processing apparatus and method.

近年、情報の電子化が進み、紙文書を電子化文書を相互に変換する需要が高まっている。紙文書を電子化する際には、単にスキャナなどにより紙面を光電変換し、画像データ化することのみにとどまらず、記載されている内容を認識して、文書を構成するテキスト、記号、写真、表などそれぞれ性質の異なる領域に分割し、文字部は文字コード情報、図や線、表枠はベクトルデータ、写真は画像データ等のように紙文書のデジタル化／コード化の取組みが行われている。 In recent years, computerization of information has progressed, and there is an increasing demand for converting paper documents into electronic documents. When a paper document is digitized, it is not limited to simply photoelectrically converting the paper surface with a scanner or the like to convert it into image data, and the text, symbols, photos, Divided into areas with different properties such as tables, character code information for figures, figures and lines, table frames for vector data, photographs for image data, etc. Yes.

このような紙文書のコード化は、現状、認識技術の技術的な問題点もあり、記載内容を認識するブロックセレクション、記載内容の文字領域から文字をコード化する認識技術、図や線等のベクトル化技術等、すべての技術範囲において、不完全な状態になっている。しかし、その一方、電子化は上述のように急速に進展しており、紙文書の電子化が拡大しているという現実もある。そして、不完全な状態で紙文書の検索にコード化技術を使用している。この場合、文字認識の誤認識による検索不能は割り切りにしたり、文字認識時に複数の候補文字持ち、検索時にその組み合わせを加味したり、文字認識処理後に手動で修正する等で対応している。そのため、紙文書の電子化文書の文章要約、分類等の高度化文書処理技術においては、文字の誤認識のために実用に耐えられない結果を生じされることもまれにある。 Such coding of paper documents currently has technical problems in recognition technology, such as block selection for recognizing the description content, recognition technology for encoding characters from the character area of the description content, diagrams, lines, etc. It is in an incomplete state in all technical ranges such as vectorization technology. However, on the other hand, digitization is progressing rapidly as described above, and there is a reality that digitization of paper documents is expanding. Then, coding technology is used to search for paper documents in an incomplete state. In this case, inability to search due to misrecognition of character recognition is divided, has a plurality of candidate characters at the time of character recognition, considers the combination at the time of search, or manually corrects after character recognition processing. For this reason, in advanced document processing techniques such as sentence summarization and classification of digitized paper documents, there are rare cases where results that cannot be put to practical use are generated due to erroneous recognition of characters.

従来例としては、例えば特許文献１と特許文献２をあげることが出来る。
特開平04-364593号公報特開平08-305805号公報 For example, Patent Document 1 and Patent Document 2 can be cited as conventional examples.
Japanese Patent Laid-Open No. 04-364593 Japanese Unexamined Patent Publication No. 08-305805

このように、情報の電子化により、電子文書による検索技術、要約、分類等の高度文書処理の要求も拡大していき、実用的になりつつあるが、紙文書の電子化による電子文書は実用的でなく、解析、認識等の電子化技術の未熟さにより、利用者に不便さを課したものであった。それで、本発明は検索技術から高度文書処理技術へと展開されていく技術分野に、紙文書の電子化に伴う定常的な誤認識を加味して、利用者の利便性の向上を提供するために発明である。例えば、紙文書の検索の場合、誤認識による検索不備の回避である。通常、検索結果は文字認識の結果によるところが多く、候補文字の伴う検索でも抽出された結果の確度に関しては言及していないため、結果の信憑性の判断は利用者に委ねられており、利用者も誤認識とは無関係に内容の把握をしなければならない。さらに、要約においても重要文抽出に関して、誤認識に対する考慮はなされていないため、多くの場合、誤認識が伴う要約結果が出力されることになる。さらに、複数の文書コンテンツから要約・合成して、目次自動生成を含む新しい文書作成を目的とする場合、誤認識が原因により、出力自体無意味なものとなるケースがありえる。さらに、分類に関しても誤認識による精度の悪化により、誤った分類結果の選出を引き起こすこともある。 In this way, with the digitization of information, the demand for advanced document processing such as search technology, summarization, classification, etc. by electronic documents is expanding and becoming practical, but electronic documents by digitizing paper documents are becoming practical Inexperienced computerization techniques such as analysis and recognition impose inconvenience on users. Therefore, the present invention provides improved convenience for users by adding regular misrecognition accompanying the digitization of paper documents to the technical field developed from search technology to advanced document processing technology. It is an invention. For example, in the case of a paper document search, avoiding a search defect due to erroneous recognition. Usually, the search results depend on the result of character recognition, and since the accuracy of the extracted results is not mentioned even in the search with candidate characters, the judgment of the credibility of the results is left to the user. However, it is necessary to grasp the contents regardless of misrecognition. Furthermore, since no consideration is given to misrecognition regarding the extraction of important sentences in the summary, in many cases, a summary result with misrecognition is output. Furthermore, when the purpose is to create a new document including a table of contents automatic generation by summarizing and synthesizing from a plurality of document contents, the output itself may be meaningless due to misrecognition. In addition, the classification result may deteriorate due to the deterioration of accuracy due to misrecognition.

本発明は、これら現状の問題点を鑑み、利用者に電子化技術のあやまりを強いることなく、紙文書の付加価値化の促進により、氾濫している情報の共有及び再利用性の向上を促す使いやすい文書画像処理装置を提供するものである。 In light of these current problems, the present invention encourages users to share added information and improve reusability by encouraging users to add value to paper documents without compromising electronic technology. An easy-to-use document image processing apparatus is provided.

本発明により、検索のためのクエリがデータベースに投げかけられた場合、該検索クエリが単語である場合、さらに紙文書の文字認識結果をアクセスするシステムの場合、該文字認識結果に含まれる個々の文字の認識精度の確度（信頼度）より計算される単語単位での確度による重み付けが加味されて、検索実行されることを特徴とする。つまり、本発明によると単語クエリに対して、検索対象の単語が一致していなくても、該検索対象の単語の確度が低い場合には、検索対象文書として出力し、該確度が高い場合には、該対象文書から除外することを特徴とする。 According to the present invention, when a search query is thrown to a database, when the search query is a word, and in a system that accesses a character recognition result of a paper document, individual characters included in the character recognition result This is characterized in that the search is executed by adding weighting based on the accuracy in units of words calculated from the accuracy (reliability) of the recognition accuracy. That is, according to the present invention, even if the search target word does not match the word query, if the accuracy of the search target word is low, it is output as a search target document, and when the accuracy is high Is excluded from the target document.

本発明により、紙文書の文字認識された電子化データによる要約文書処理が実行された場合、重要文抽出アルゴリズムにより文中の重要文がランク付けされて抽出されることを特徴とし、該抽出ランクされた重要文が文字認識結果である紙の電子文書である場合、該文字認識結果によって計算される文の確度による重み付けが加味されて、重要文のランク付けを修正することを特徴とする。つまり、本発明によると該文字認識結果が加味された重要文のランク付けは重要文内の個々の文字の認識結果の確度が低い場合にはランク付け順位が下げることで、要約出力として、利用者に読みやすい要約文書を提供することを特徴とする。 According to the present invention, when summary document processing is performed using digitized data in which a character of a paper document is recognized, the important sentence in the sentence is ranked and extracted by the important sentence extraction algorithm. In the case where the important sentence is a paper electronic document as a character recognition result, the ranking of the important sentence is corrected by adding weighting based on the sentence accuracy calculated based on the character recognition result. In other words, according to the present invention, the ranking of the important sentence taking into account the character recognition result is used as a summary output by lowering the ranking when the accuracy of the recognition result of each character in the important sentence is low. It is characterized by providing easy-to-read summary documents.

さらに、本発明によると、複数の文書から文書構造解析及び重要文の抽出を行い、該文書構造解析から得られる見出しの自動抽出、該自動抽出された見出しから、該見出しをまとめて目次の自動作成を行うシステム（装置）において、該文書の文字認識結果に基づいて算出される見出し文の確度に基づいて、該確度が低い場合には、目次文の見出し文字部のフォント・スタイルを変えることで利用者に警告を促すことを特徴とする。さらに、該複数文書から要約された新しい文書を作成する場合、文字認識結果に基づく見出し文の確度及び重要文の確度から、該見出し文は確度に応じて出力スタイルを変え、要約文は確度の高い文から構成されるようにしたことを特徴とする。 Furthermore, according to the present invention, document structure analysis and important sentence extraction are performed from a plurality of documents, automatic extraction of headings obtained from the document structure analysis, and the automatic extraction of a table of contents by collecting the headings from the automatically extracted headings. In the creation system (apparatus), based on the accuracy of the headline sentence calculated based on the character recognition result of the document, if the accuracy is low, the font style of the headline character portion of the table of contents sentence is changed. It is characterized by prompting the user with a warning. Furthermore, when creating a new document summarized from the plurality of documents, the heading sentence changes the output style according to the accuracy from the accuracy of the headline sentence and the accuracy of the important sentence based on the character recognition result, and the summary sentence has the accuracy. It is characterized by being composed of high sentences.

さらに、本発明によると、紙文書の文字認識された電子化データによる分類文書処理が実行された場合、分類のための算出は文字認識された電子データの単語・文節・文等の概念ベクトルによって算出されることと、さらに概念ベクトルは文字の認識の確度に基づいて重み付けされることを特徴とする。つまり、概念ベクトルの強度は認識精度が確度が高く、概念要素が一致していれば、最も強く、認識精度が確度が低くても、概念要素が文字分類的に類似していれば、次に強く、認識精度が確度が高く、概念要素が文字分類的に類似している場合は最も低くなることで、文字認識による分類の影響を押えることを特徴としている。 Further, according to the present invention, when classified document processing is performed using digitized data in which a paper document is character-recognized, the calculation for classification is performed based on a concept vector of words, phrases, sentences, etc. of the character-recognized electronic data. In addition, the concept vector is weighted based on the accuracy of character recognition. In other words, the strength of the concept vector is the strongest when the recognition accuracy is high and the concept elements match, and even if the recognition accuracy is low, the concept elements are similar in character classification. It is characterized in that the recognition accuracy is high, the accuracy is high, and when the concept elements are similar in character classification, it becomes the lowest so that the influence of classification by character recognition can be suppressed.

以上のように、本発明によると入力文書画像に対して文書画像解析処理を実施するすることで、文字領域の検出及び該領域の属性（例えば、見出し、キャプション、本文等）の判別及び文字自体の認識処理を実行し、該結果に対する確度の検出をすることで、単語単位、文節単位、文単位、文書全体の確度の算出によって、検索等の高度文書処理に利用者の利便性向上を図ることを達成している。つまり、検索などに紙文書の電子化データを使う場合、文字認識の結果が検索のヒット率を左右する。つまり、誤認識によりヒットしなかったり、ヒットしたり、また、誤認識のヒット率低下をさけるため、複数語のなかに条件をあまくすると、利用者はヒットした理由に混乱することがある。 As described above, according to the present invention, by performing the document image analysis process on the input document image, the detection of the character region, the determination of the attribute of the region (for example, heading, caption, text, etc.) and the character itself By executing the recognition processing and detecting the accuracy of the result, the accuracy of the word unit, the phrase unit, the sentence unit, and the whole document is calculated, thereby improving the convenience of the user for advanced document processing such as retrieval. Have achieved that. That is, when using digitized data of a paper document for searching or the like, the result of character recognition affects the search hit rate. In other words, if a condition is included in a plurality of words in order to avoid hits due to misrecognition, hits, or reduce the hit rate of misrecognition, the user may be confused about the reason for the hit.

それで、本発明の文字認識による確度を算出、さらに算出した確度を考慮して、検索等の高度文書処理を実行することで、利用者に対して、誤認識に伴う未ヒット率に低下を軽減し、さらにその確度を利用者に提供することで、ヒットした理由を利用者に容易に知らしめることも可能となる。該利用者への確度提供手段は、ヒットした文字及び単語へ色手段等を変えることでも可能であり、ランキングの際に考慮することも可能となる。 Therefore, the accuracy of character recognition according to the present invention is calculated, and advanced document processing such as search is executed in consideration of the calculated accuracy, thereby reducing the decrease in the unhit rate due to misrecognition to the user. In addition, by providing the accuracy to the user, the user can easily be informed of the reason for the hit. The accuracy providing means for the user can be changed by changing the color means or the like to the hit character or word, and can be taken into consideration when ranking.

図１は本実施例の文書画像処理装置のブロック構成図である。スキャナ等の画像入力部１０から入力された文書画像は、前処理部１１によって、傾斜検知・補正、原稿向き検知・補正、ノイズ除去、場合によっては、2値化処理によって、後段の処理が適確に行われるための画像処理が実行される。さらに、文書画像構造解析部１２によって、該文書画像画像の属性判定及び該属性のレイアウト構造が解析される。該属性は、本発明によると文字画像領域、背景画像、写真画像、グラフィック画像等の画像属性の検出を目的とする。前述レイアウト構造の解析は、前述の画像属性の物理的な配置及び文字画像領域の場合、該領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性の検出を目的とする。これらレイアウト構造の解析結果は、電子ドキュメント１５（後述）に出力される。一方、文字画像領域は、文字認識１３に入り、一般的な文字認識処理に従って、文字コード化され、さらに確度算出１４によって、文字コード化された文字認識結果は認識結果の確からしさを示す確度が求められる。該確度は、本発明の技術を使って、一文字単位から複数文字の集合までの含むものとする。さらに、これらの文字認識結果、確度、レイアウト結果は、電子ドキュメント１５に入力され、該電子ドキュメントは紙文書のレイアウトを維持し、文字画像領域は確度情報をもった文字認識結果により再構築されるものである。さらに、該電子ドキュメントは文字認識結果の表示の有無の切り替えが可能であり、無の場合、文字画像領域の画像そのものが表示されることになる。 FIG. 1 is a block diagram of a document image processing apparatus according to this embodiment. The document image input from the image input unit 10 such as a scanner is subjected to subsequent processing by the preprocessing unit 11 by tilt detection / correction, document orientation detection / correction, noise removal, and in some cases binarization processing. Image processing is performed to ensure that it is performed. Further, the document image structure analysis unit 12 analyzes the attribute determination of the document image image and the layout structure of the attribute. According to the present invention, the attribute is intended to detect image attributes such as a character image area, a background image, a photographic image, and a graphic image. The analysis of the layout structure is aimed at detecting the character string attributes such as the heading (including title), body, and caption of the area in the case of the physical arrangement of the image attributes and the character image area. Analysis results of these layout structures are output to an electronic document 15 (described later). On the other hand, the character image area enters the character recognition 13 and is character-coded in accordance with a general character recognition process. Further, the accuracy calculation 14 indicates that the character-coded character recognition result indicates the accuracy of the recognition result. Desired. The accuracy is assumed to include one character unit to a set of a plurality of characters by using the technique of the present invention. Further, these character recognition results, accuracy, and layout results are input to the electronic document 15, the electronic document maintains the layout of the paper document, and the character image area is reconstructed with the character recognition results having accuracy information. Is. Further, the electronic document can be switched between display and non-display of the character recognition result. If not, the image of the character image area itself is displayed.

図２は本実施例によって、使われる文書確度抽出の処理フローチャートである。画像入力２０は、図１の１０、１１に対応するもので、文書画像入力及び前処理が実行される。文書画像部抽出１１は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字画像部の抽出を行い、文字画像解析２２は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性の検出を行う部分である。よって、文書画像部抽出１１及び文字画像解析２２によって、見出しの文字領域、本文の文字領域、キャプションの文字領域という画像データが生成され、文字領域属性は、電子ドキュメント１５に入力され、以後説明する文字認識結果及び確度情報のデータを待つ。尚、ここで、見出し、本文、キャプションの属性抽出に関しては、本発明の目的ではないということと先願の発明によって公開されているので、ここでは言及しない。一方、文字画像解析２２により出力される文字画像領域は、文字認識２３に入力される。文字認識２３に入力され、一文字づつ一般的な文字認識処理手段によってコード化され、さらに、該コード化された個々の文字の認識の正確さを示す文字確度が文字確度算出２４によってもとめられる。これは、個々の文字の認識の信頼度を示すもので、様々が発明が提案されているので、やはりここも言及しない。よって、個々の文字単位の文字コード及び該文字コードの確度が文字認識２３、文字確度算出２４より出力されることになる。特に、認識された文字コード列に関しては、単語抽出２５に入力され、文字コード列から単語単位の抽出が実行される。これは、予めシステムに格納されている単語辞書を用いて、単語辞書ベースの単語が検出される。よって、未知語に対して、利用者によって定義されるか等の手段によらない限り、未検出となるが、幅広く、誤認識を考慮した単語単位での検出率を向上させるため、認識結果を求めるときに得られる候補文字群、及び認識結果との関係ない認識結果文字コードの類似文字群を参照に単語単位の抽出が実行される。さらに、該検出された単語単位の文字コード列は単語確度算出２６に入力され、前述算出の文字確度より単語単位での確度が算出される。該算出手段に関しては、以下に一例を示す。例えば、“文学”という文字コードの確度抽出にという場合、
文（90%）学（90%） 0.9x0.9 -> 0.81
というようにそれぞれの文字の積によって求めることの可能である。 FIG. 2 is a flowchart of the process of extracting the document accuracy used according to this embodiment. The image input 20 corresponds to 10 and 11 in FIG. 1, and document image input and preprocessing are executed. The document image portion extraction 11 extracts a character image portion related to the present invention in the document image structure analysis portion 12 of FIG. 1, and the character image analysis 22 is a portion of the document image structure analysis portion 12 of FIG. Thus, it is a part for detecting character string attributes such as headlines (including titles), texts, captions, etc. of character areas related to the present invention. Therefore, image data such as a headline character region, a body character region, and a caption character region is generated by the document image portion extraction 11 and the character image analysis 22, and the character region attributes are input to the electronic document 15 and will be described below. Wait for character recognition result and accuracy information data. Here, the attribute extraction of the headline, the text, and the caption is not mentioned here because it is disclosed by the invention of the prior application that it is not the object of the present invention. On the other hand, the character image area output by the character image analysis 22 is input to the character recognition 23. The character accuracy is input to the character recognition 23, coded one by one by a general character recognition processing means, and character accuracy indicating the accuracy of recognition of each coded character is obtained by the character accuracy calculation 24. This indicates the reliability of recognition of individual characters, and since various inventions have been proposed, they will not be mentioned here either. Therefore, the character code of each character unit and the accuracy of the character code are output from the character recognition 23 and the character accuracy calculation 24. In particular, the recognized character code string is input to the word extraction 25, and word-by-word extraction is executed from the character code string. In this method, a word dictionary-based word is detected using a word dictionary stored in the system in advance. Therefore, the unknown word is not detected unless it is defined by the user, etc., but the recognition result is set in order to improve the detection rate in a wide range of words considering misrecognition. Extraction in units of words is performed with reference to the candidate character group obtained when obtaining and the similar character group of the recognition result character code not related to the recognition result. Further, the detected character code string in units of words is input to the word accuracy calculation 26, and the accuracy in units of words is calculated from the previously calculated character accuracy. An example of the calculation means is shown below. For example, to extract the accuracy of the character code “literature”
Sentence (90%) Studies (90%) 0.9x0.9-> 0.81
Thus, it can be obtained by the product of each character.

よって、単語及び該単語の確度の結果が単語抽出２５及び単語抽出算出２６より得られることになる。さらに、該結果に基づいて、特に文節抽出２７においては、抽出された単語及びその後の文字より文節が抽出され、該抽出された文節の確度算出が文節確度算出２８で実行される。該算出方法は、
（単語の確度）ｘ（その後の文字の確度）
としもよく、単語及び文字の確度の重み付けを考慮して、
λ(単語の確度）＋（１−λ）(その後の文字の確度）− 但し、λは重み付け係数
としてもよい。 Accordingly, the word and the result of the accuracy of the word are obtained from the word extraction 25 and the word extraction calculation 26. Further, based on the result, particularly in the phrase extraction 27, a phrase is extracted from the extracted word and the subsequent characters, and the accuracy of the extracted phrase is calculated by the phrase accuracy calculation 28. The calculation method is as follows:
(Word accuracy) x (Subsequent character accuracy)
It ’s also good to consider the weighting of word and letter accuracy,
λ (word accuracy) + (1-λ) (subsequent character accuracy) − However, λ may be a weighting coefficient.

さらに、文抽出２９においては、文節抽出２７によって抽出された文節及び未抽出文字より文が抽出される。該求められた文は、文確度算出３０により既算出済みの文節及び文字の確度の値から該文の確度が算出されることを特徴としている。該文の算出方法に関しては、既算出済みの確度値の積、つまり、
（文節の確度）ｘ（その他の文字の確度）
としてもよく、前述の処理より算出された確度に重み付けを考慮して、前述の重み付け文節確度算出手段に従って、文の確度の算出してもよい。該確度の算出手段は、本発明の主旨とは異なるため、詳細は記述しない。 Further, in the sentence extraction 29, a sentence is extracted from the phrase extracted by the phrase extraction 27 and the unextracted characters. The obtained sentence is characterized in that the sentence accuracy is calculated from the already calculated clause and character accuracy values by the sentence accuracy calculation 30. Regarding the calculation method of the sentence, the product of already calculated accuracy values, that is,
(Sentence accuracy) x (Accuracy of other characters)
Alternatively, the accuracy of the sentence may be calculated according to the above-described weighted phrase accuracy calculation means in consideration of weighting the accuracy calculated by the above-described processing. Since the accuracy calculation means is different from the gist of the present invention, details are not described.

さらに、ここでは、文の塊を電子ドキュメントと表し、電子ドキュメント３１で文字画像解析より求められる文書構造解析結果である文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性とともに生成されることを特徴とする。さらに、本発明によると、該求められた電子ドキュメントより文書確度算出３２より電子ドキュメントの確度が算出されることになる。 Further, here, a sentence block is represented as an electronic document, and is generated together with character string attributes such as a heading (including title), text, and caption of a character area, which is a result of document structure analysis obtained by character image analysis in the electronic document 31. It is characterized by being. Furthermore, according to the present invention, the accuracy of the electronic document is calculated by the document accuracy calculation 32 from the obtained electronic document.

図３は本実施例によって、使われる文書確度抽出の処理フローチャートである。画像入力３３は、図１の１０、１１に対応するもので、文書画像入力及び前処理が実行される。文書画像部抽出１１は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字画像部の抽出を行い、文字画像解析３５は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性の検出を行う部分である。よって、文書画像部抽出１１及び文字画像解析３５によって、見出しの文字領域、本文の文字領域、キャプションの文字領域という画像データが生成され、文字領域属性は、電子ドキュメント４４に入力され、以後説明する文字認識結果及び確度情報のデータを待つ。尚、ここで、見出し、本文、キャプションの属性抽出に関しては、本発明の目的ではないということと先願の発明によって公開されているので、ここでは言及しない。一方、文字画像解析３５により出力される文字画像領域は、文字認識３６に入力される。文字認識３６に入力され、一文字づつ一般的な文字認識処理手段によってコード化され、さらに、該コード化された個々の文字の認識の正確さを示す文字確度が文字確度算出３７によってもとめられる。これは、個々の文字の認識の信頼度を示すもので、様々が発明が提案されているので、やはりここも言及しない。よって、個々の文字単位の文字コード及び該文字コードの確度が文字認識３６、文字確度算出３７より出力されることになる。さらに、本発明において、該処理によって、得られた文字コード及び文字確度は、後述の単語抽出３８、文節抽出３９、文抽出４０に入力される。前述の処理において、認識された文字コード列は、単語抽出３８に入力され、文字コード列から単語単位の抽出が実行される。これは、予めシステムに格納されている単語辞書を用いて、単語辞書ベースの単語が検出される。よって、未知語に対して、利用者によって定義されるか等の手段によらない限り、未検出となるが、幅広く、誤認識を考慮した単語単位での検出率を向上させるため、認識結果を求めるときに得られる候補文字群、及び認識結果との関係ない認識結果文字コードの類似文字群を参照に単語単位の抽出が実行される。さらに、該検出された単語単位の文字コード列は単語確度算出４１に入力され、前述算出の文字確度より単語単位での確度が算出される。該算出手段に関しては、以下に一例を示す。例えば、“文学”という文字コードの確度抽出にという場合、
文（90%）学（90%） 0.9x0.9 -> 0.81
というようにそれぞれの文字の積によって求めることの可能である。 FIG. 3 is a flowchart showing the process of extracting the document accuracy used according to this embodiment. The image input 33 corresponds to 10 and 11 in FIG. 1, and document image input and preprocessing are executed. The document image part extraction 11 extracts a character image part related to the present invention in the document image structure analysis part 12 of FIG. 1, and the character image analysis 35 is a part of the document image structure analysis part 12 of FIG. Thus, it is a part for detecting character string attributes such as headlines (including titles), texts, captions, etc. of character areas related to the present invention. Therefore, the document image part extraction 11 and the character image analysis 35 generate image data such as a headline character region, a body character region, and a caption character region, and character region attributes are input to the electronic document 44, which will be described below. Wait for character recognition result and accuracy information data. Here, the attribute extraction of the headline, the text, and the caption is not mentioned here because it is disclosed by the invention of the prior application that it is not the object of the present invention. On the other hand, the character image area output by the character image analysis 35 is input to the character recognition 36. The character accuracy is input to the character recognition 36, encoded one by one by general character recognition processing means, and character accuracy indicating the accuracy of recognition of the encoded individual characters is obtained by the character accuracy calculation 37. This indicates the reliability of recognition of individual characters, and since various inventions have been proposed, they will not be mentioned here either. Therefore, the character code of each character unit and the accuracy of the character code are output from the character recognition 36 and the character accuracy calculation 37. Further, in the present invention, the character code and the character accuracy obtained by the processing are input to a word extraction 38, a phrase extraction 39, and a sentence extraction 40 which will be described later. In the above-described processing, the recognized character code string is input to the word extraction 38, and word unit extraction is performed from the character code string. In this method, a word dictionary-based word is detected using a word dictionary stored in the system in advance. Therefore, the unknown word is not detected unless it is defined by the user, etc., but the recognition result is set in order to improve the detection rate in a wide range of words considering misrecognition. Extraction in units of words is performed with reference to the candidate character group obtained when obtaining and the similar character group of the recognition result character code not related to the recognition result. Further, the detected character code string in units of words is input to the word accuracy calculation 41, and the accuracy in units of words is calculated from the previously calculated character accuracy. An example of the calculation means is shown below. For example, to extract the accuracy of the character code “literature”
Sentence (90%) Studies (90%) 0.9x0.9-> 0.81
Thus, it can be obtained by the product of each character.

よって、単語及び該単語の確度の結果が単語抽出３８及び単語抽出算出４１より得られることになる。ここで、算出された単語確度は、最終出力系である電子ドキュメント４４に入力される。さらに、単語抽出の結果に基づいて、特に文節抽出３９においては、抽出された単語及びその後の文字より文節が抽出され、該抽出された文節の確度算出が文節確度算出４２で実行される。該算出方法は、認識された文字コードごとの確度よって算出されることになる。該算出方法は、
（文字の確度）ｘ（文字の確度）ｘ ---- ｘ（文字の確度）
としもよく、個々の文字の確度の重み付けを考慮して、
Σλ（文字の確度） − 但し、λは重み付け係数であり、Σλ=1である。
としてもよい。 Accordingly, the word and the result of the accuracy of the word are obtained from the word extraction 38 and the word extraction calculation 41. Here, the calculated word accuracy is input to the electronic document 44 which is the final output system. Further, based on the result of word extraction, particularly in the phrase extraction 39, a phrase is extracted from the extracted word and the subsequent characters, and the accuracy of the extracted phrase is calculated by the phrase accuracy calculation 42. The calculation method is calculated based on the accuracy for each recognized character code. The calculation method is as follows:
(Character accuracy) x (Character accuracy) x ---- x (Character accuracy)
It ’s also good to consider the accuracy weighting of individual characters,
Σλ (character accuracy) − where λ is a weighting coefficient and Σλ = 1.
It is good.

該処理によって、求められた文節及び文節確度は、最終出力である電子ドキュメント４４に入力される。さらに、文抽出４０においては、文節抽出３９によって抽出された文節及び未抽出文字より文が抽出される。該求められた文は、文確度算出４３により既算出済みの文字の確度の値から該文の確度が算出されることを特徴としている。該文の算出方法に関しては、既算出済みの確度値の積、つまり、
（文字の確度）ｘ（文字の確度）ｘ ---- ｘ（文字の確度）
としてもよく、前述の処理より算出された確度に重み付けを考慮して、前述の重み付け文節確度算出手段に従って、文の確度の算出してもよい。該確度の算出手段は、本発明の主旨とは異なるため、詳細は記述しない。 Through this processing, the obtained clause and clause accuracy are input to the electronic document 44 as the final output. Further, in the sentence extraction 40, a sentence is extracted from the phrase extracted by the phrase extraction 39 and unextracted characters. The sentence is characterized in that the sentence accuracy calculation 43 calculates the accuracy of the sentence from the already calculated character accuracy value. Regarding the calculation method of the sentence, the product of already calculated accuracy values, that is,
(Character accuracy) x (Character accuracy) x ---- x (Character accuracy)
Alternatively, the accuracy of the sentence may be calculated according to the above-described weighted phrase accuracy calculation means in consideration of weighting the accuracy calculated by the above-described processing. Since the accuracy calculation means is different from the gist of the present invention, details are not described.

さらに、ここでは、文の塊を電子ドキュメントと表し、電子ドキュメント４４で文字画像解析より求められる文書構造解析結果である文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性、単語単位の確度、文節単位の確度、文単位の確度とともに生成されることを特徴とする。さらに、本発明によると、該求められた電子ドキュメントは文書確度算出３２よって単語単位の確度、文節単位の確度、文単位の確度の値から電子ドキュメントの確度が算出されることを特徴とする。 Further, here, a sentence block is represented as an electronic document, and character string attributes such as headings (including titles), texts, captions, and the like of character regions, which are document structure analysis results obtained by character image analysis in the electronic document 44, words It is generated together with unit accuracy, phrase unit accuracy, and sentence unit accuracy. Further, according to the present invention, the accuracy of the electronic document is calculated from the value of the word accuracy, the phrase accuracy, and the sentence accuracy by the document accuracy calculation 32 according to the present invention.

図４は本実施例によって、使われる文書確度抽出の処理フローチャートである。画像入力４６は、図１の１０、１１に対応するもので、文書画像入力及び前処理が実行される。文書画像部抽出１１は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字画像部の抽出を行い、文字画像解析４８は、図１の文書画像構造解析部１２のなかで、本発明に関係する文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性の検出を行う部分である。よって、文書画像部抽出１１及び文字画像解析４８によって、見出しの文字領域、本文の文字領域、キャプションの文字領域という画像データが生成され、文字領域属性は、電子ドキュメント５７に入力され、以後説明する文字認識結果及び確度情報のデータを待つ。尚、ここで、見出し、本文、キャプションの属性抽出に関しては、本発明の目的ではないということと先願の発明によって公開されているので、ここでは言及しない。一方、文字画像解析４８により出力される文字画像領域は、文字認識４９に入力される。文字認識４９に入力され、一文字づつ一般的な文字認識処理手段によってコード化され、さらに、該コード化された個々の文字の認識の正確さを示す文字確度が文字確度算出５０によってもとめられる。これは、個々の文字の認識の信頼度を示すもので、様々が発明が提案されているので、やはりここも言及しない。よって、個々の文字単位の文字コード及び該文字コードの確度が文字認識４９、文字確度算出５０より出力されることになる。さらに、本発明において、該処理によって、得られた文字コード及び文字確度は、単語抽出に入力される。前述の処理において、認識された文字コード列は、単語抽出５１に入力され、文字コード列から単語単位の抽出が実行される。これは、予めシステムに格納されている単語辞書を用いて、単語辞書ベースの単語が検出される。よって、未知語に対しては、本発明では、未検出を避けるため、幅広く、誤認識を考慮した単語単位での検出率を向上を目的とし、認識結果を求めるときに得られる候補文字群、及び認識結果との関係ない認識結果文字コードの類似文字群を参照に単語単位の抽出が実行される。さらに、該処理によっても、求められない文字に対しては、未知語としての単語検出を実行する。特に、単語間に含まれる助詞等も単語であると仮定し、処理は続行する。さらに、該検出された単語単位の文字コード列は単語確度算出５４に入力され、前述算出の文字確度より単語単位での確度が算出される。該算出手段に関しては、以下に一例を示す。例えば、“文学”という文字コードの確度抽出にという場合、
文（90%）学（90%） 0.9x0.9 -> 0.81
というようにそれぞれの文字の積によって求めることの可能である。 FIG. 4 is a process flowchart for extracting the document accuracy used according to this embodiment. The image input 46 corresponds to 10 and 11 in FIG. 1, and document image input and preprocessing are executed. The document image portion extraction 11 extracts a character image portion related to the present invention in the document image structure analysis portion 12 of FIG. 1, and the character image analysis 48 is a portion of the document image structure analysis portion 12 of FIG. Thus, it is a part for detecting character string attributes such as headlines (including titles), texts, captions, etc. of character areas related to the present invention. Therefore, image data such as a headline character region, a body character region, and a caption character region are generated by the document image portion extraction 11 and the character image analysis 48, and character region attributes are input to the electronic document 57, which will be described later. Wait for character recognition result and accuracy information data. Here, the attribute extraction of the headline, the text, and the caption is not mentioned here because it is disclosed by the invention of the prior application that it is not the object of the present invention. On the other hand, the character image area output by the character image analysis 48 is input to the character recognition 49. The character accuracy is input to the character recognition 49, coded one by one by general character recognition processing means, and character accuracy indicating the accuracy of recognition of the coded individual characters is obtained by the character accuracy calculation 50. This indicates the reliability of recognition of individual characters, and since various inventions have been proposed, they will not be mentioned here either. Therefore, the character code of each character unit and the accuracy of the character code are output from the character recognition 49 and the character accuracy calculation 50. Further, in the present invention, the character code and the character accuracy obtained by the processing are input to the word extraction. In the above-described processing, the recognized character code string is input to the word extraction 51, and word unit extraction is executed from the character code string. In this method, a word dictionary-based word is detected using a word dictionary stored in the system in advance. Therefore, for unknown words, in the present invention, in order to avoid undetection, a wide range of candidate character groups obtained when obtaining recognition results for the purpose of improving the detection rate in terms of words in consideration of misrecognition, Then, word-wise extraction is performed with reference to the similar character group of the recognition result character code not related to the recognition result. Furthermore, word detection as an unknown word is executed for characters that are not required by this processing. In particular, assuming that particles included between words are also words, the process continues. Furthermore, the detected character code string in units of words is input to the word accuracy calculation 54, and the accuracy in units of words is calculated from the previously calculated character accuracy. An example of the calculation means is shown below. For example, to extract the accuracy of the character code “literature”
Sentence (90%) Studies (90%) 0.9x0.9-> 0.81
Thus, it can be obtained by the product of each character.

よって、単語及び該単語の確度の結果が単語抽出５１及び単語抽出算出５４より得られることになる。ここで、算出された単語確度は、最終出力系である電子ドキュメント４４に入力される。さらに、単語抽出の結果に基づいて、特に文節抽出５２においては、抽出された単語及び未知語として求められた単語より文節が抽出され、該抽出された文節の確度算出が文節確度算出５５で実行される。該算出方法は、単語確度によって求められた認識された確度によって算出されることになる。但し、単語による確度が求められていない場合、未知語の単語として確度が使われる。該算出方法は、
（単語の確度）ｘ（単語の確度）ｘ ---- ｘ（単語の確度）
としもよく、個々の文字の確度の重み付けを考慮して、
Σλ（単語の確度） − 但し、λは重み付け係数であり、Σλ=1である。
としてもよい。 Therefore, the word and the result of the accuracy of the word are obtained from the word extraction 51 and the word extraction calculation 54. Here, the calculated word accuracy is input to the electronic document 44 which is the final output system. Further, based on the result of word extraction, particularly in the phrase extraction 52, a phrase is extracted from the extracted word and the word obtained as an unknown word, and the accuracy of the extracted phrase is calculated by the phrase accuracy calculation 55. Is done. The calculation method is calculated by the recognized accuracy obtained by the word accuracy. However, if the accuracy by word is not required, the accuracy is used as an unknown word. The calculation method is as follows:
(Word accuracy) x (word accuracy) x ---- x (word accuracy)
It ’s also good to consider the accuracy weighting of individual characters,
Σλ (word accuracy) − where λ is a weighting factor and Σλ = 1.
It is good.

該処理によって、求められた文節及び文節確度は、最終出力である電子ドキュメント５７に入力される。さらに、文抽出５３においては、文節抽出５２によって抽出された文節より文が抽出される。該求められた文は、文確度算出５６により既算出済みの文節の確度の値から該文の確度が算出されることを特徴としている。該文の算出方法に関しては、既算出済みの確度値の積、つまり、
（文節の確度）ｘ（文節の確度）ｘ ---- ｘ（文節の確度）
としてもよく、前述の処理より算出された確度に重み付けを考慮して、前述の重み付け文節確度算出手段に従って、文の確度の算出してもよい。該確度の算出手段は、本発明の主旨とは異なるため、詳細は記述しない。 By this processing, the obtained clause and clause accuracy are input to the electronic document 57 which is the final output. Further, in the sentence extraction 53, a sentence is extracted from the phrase extracted by the phrase extraction 52. The obtained sentence is characterized in that the sentence accuracy is calculated from the already calculated phrase accuracy value by the sentence accuracy calculation 56. Regarding the calculation method of the sentence, the product of already calculated accuracy values, that is,
(Phrase accuracy) x (Phrase accuracy) x ---- x (Phrase accuracy)
Alternatively, the accuracy of the sentence may be calculated according to the above-described weighted phrase accuracy calculation means in consideration of weighting the accuracy calculated by the above-described processing. Since the accuracy calculation means is different from the gist of the present invention, details are not described.

さらに、ここでは、文の塊を電子ドキュメントと表し、電子ドキュメント５７で文字画像解析より求められる文書構造解析結果である文字領域の見出し（含む、タイトル）、本文、キャプション等の文字列属性、単語単位の確度、文節単位の確度、文単位の確度とともに生成されることを特徴とする。さらに、本発明によると、該求められた電子ドキュメントは文書確度算出５８よって単語単位の確度、文節単位の確度、文単位の確度の値から電子ドキュメントの確度が算出されることを特徴とする。 Further, here, a sentence lump is represented as an electronic document, and character string attributes such as headings (including titles), texts, captions, and the like of character areas, which are document structure analysis results obtained by character image analysis in the electronic document 57, words It is generated together with unit accuracy, phrase unit accuracy, and sentence unit accuracy. Further, according to the present invention, the obtained electronic document is calculated by the document accuracy calculation 58, and the accuracy of the electronic document is calculated from the value of the word unit accuracy, the phrase unit accuracy, and the sentence unit accuracy.

今回は、実施例として、３つの例を示したが、実施例１（図２）においては、最終出力である電子ドキュメントの文書としての確度を求めるための手段を示した。実施例２（図３）においては、最終出力である電子ドキュメントの文書としての確度を求めるために文字確度から単語確度、文節確度、文確度を求め、それぞれの結果から文書の確度を算出するとともに、上述の単語、文節、文の確度の値を最終出力である電子ドキュメントに格納していることを示した。実施例３（図４）においては、最終出力である電子ドキュメントの文書としての確度を求めるために文字確度から単語確度、単語確度から文節確度、文節確度から文確度を求め、それぞれの結果から文書の確度を算出するとともに、上述の単語、文節、文の確度の値を最終出力である電子ドキュメントに格納していることを示した。 In this example, three examples are shown as examples. In the example 1 (FIG. 2), means for obtaining the accuracy of the electronic document as the final output is shown. In Example 2 (FIG. 3), in order to obtain the accuracy of the electronic document as the final output as a document, the word accuracy, the phrase accuracy, and the sentence accuracy are obtained from the character accuracy, and the accuracy of the document is calculated from each result. The above-mentioned words, phrases, and sentence accuracy values are stored in the final electronic document. In the third embodiment (FIG. 4), the word accuracy is obtained from the character accuracy, the word accuracy is obtained from the word accuracy, the sentence accuracy is obtained from the phrase accuracy, and the document accuracy is obtained from each result in order to obtain the accuracy of the electronic document as the final output. It was shown that the above-mentioned word, phrase, and sentence accuracy values were stored in the final output electronic document.

図５は本発明を検索処理に用いた例を示している。まず、図５（a）であるが、これは従来の文書画像処理装置における全文検索の例である。この例の場合、文書という単語をクエリで装置に検索指示をしていることを示しています。検索結果として、検出された文書画像は、文字認識のときに誤認識して、文書が交書になっている。通常、文書というクエリで全文検索を行った場合、誤認識した部分である交書はヒットしない。この例の場合、文書という単語が同一文書の中で正解していたので、ヒットしたが、これはまれである。よって、本発明を使った例が図５（ｂ）である。これは、本発明による単語の確度をデータとしてもっている例である。この場合、認識している文書は当然、ヒットし、従来例において、検索されなかった交書もヒットしている。但し、これは、交書自体が誤認識の確度が高いという観点から、類似文字から文書の可能性を示し、検索結果としてヒットしたことを示している。 FIG. 5 shows an example in which the present invention is used for search processing. First, as shown in FIG. 5A, this is an example of full-text search in a conventional document image processing apparatus. In this example, the device is instructed to search for the word “document” by query. As a search result, the detected document image is misrecognized at the time of character recognition, and the document is written in a cross. Normally, when a full-text search is performed using a document query, a copy that is a misrecognized portion is not hit. In this example, the word “document” was correctly answered in the same document, so it was a hit, but this is rare. Therefore, an example using the present invention is shown in FIG. This is an example in which the word accuracy according to the present invention is used as data. In this case, the recognized document naturally hits, and in the conventional example, the book that has not been searched also hits. However, this indicates that the document itself is a hit as a search result, indicating the possibility of the document from similar characters from the viewpoint that the accuracy of the recognition error is high.

図１は本実施例の文書画像処理装置のブロック構成図(電子ドキュメントの作成まで)FIG. 1 is a block diagram of a document image processing apparatus according to the present embodiment (until creation of an electronic document). 本実施例の文書確度抽出の処理フローチャートProcessing flowchart of document accuracy extraction of this embodiment 本実施例の文書確度抽出の処理フローチャートProcessing flowchart of document accuracy extraction of this embodiment 本実施例の文書確度抽出の処理フローチャートProcessing flowchart of document accuracy extraction of this embodiment (a) は従来例の検索(全文)を説明する図、(b) は本実施例を検索に使用した場合を説明する図(a) is a diagram for explaining a conventional example search (full text), and (b) is a diagram for explaining a case in which this embodiment is used for a search.

Claims

The document image processing apparatus has means for inputting a document image, means for digitizing the inputted document image, means for storing the digitized document,
In the apparatus, the digitized electronic document has a character image area extracted by means for extracting a character image, and the non-character image area has means for character encoding in character recognition processing.
In the device, a character string that is composed of at least one character coded character has accuracy calculated in character recognition,
Means for calculating the accuracy of the word from the accuracy of the calculated character string;
Means for calculating the sentence accuracy from the calculated word accuracy;
Means for calculating the accuracy of a plurality of sentences from the accuracy of the calculated sentences;
In the apparatus, a document image processing apparatus that calculates the accuracy of a sentence.

2. The document image processing apparatus according to claim 1, wherein the accuracy of a sentence is obtained by calculating the accuracy of a phrase and a phrase (word + character).

2. The document image processing apparatus according to claim 1, wherein the plurality of sentences are composed of at least two consecutive sentences.

4. The document image processing apparatus according to claim 3, wherein the accuracy of a plurality of sentences includes an accuracy in a paragraph (paragraph) unit.

4. The document image processing apparatus according to claim 3, wherein the accuracy of a plurality of sentences includes the accuracy of a section and a chapter.