JP2004038605A

JP2004038605A - Image processor, image processing method, and program for controlling image processor

Info

Publication number: JP2004038605A
Application number: JP2002195373A
Authority: JP
Inventors: Hiroyuki Masai; 政井　宏之
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2002-07-04
Filing date: 2002-07-04
Publication date: 2004-02-05

Abstract

<P>PROBLEM TO BE SOLVED: To realize a device capable of automatically designating an important area and a non-important area in a document. <P>SOLUTION: An item extraction part 20 extracts an item area from a document image of a prescribed format based on the format, and extracts an item kind showing of which kind the item area is and an item content showing the content of the item area. An importance imparting reference holding part 300 holds importance imparting references showing the relation of importance to item kind and item content. An importance imparting part 30 compares the item kind and item content extracted by the item extraction part 20 with the importance imparting references of the part 300 and imparts an importance for each item. An output part 40 outputs document information with the importance imparted for each item. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、文書の内容に基づいて重要であると判断される文書の任意の領域を抽出する画像処理装置に関するものであり、例えば、重要であると判断される文書の任意の領域を非公開にする等、文書を公開する場合等に利用されものである。
【０００２】
【従来の技術】
従来、文書の内容に基づいて重要であると判断される文書の任意の領域は、表示画面上で利用者によって指定されていた。例えば、特開平６−１７６１２７号「電子ファイリング装置」では、登録する文書画像を表示画面上に表示し、利用者によって文書画像中の任意の領域を重要領域として指定している。ここでは、重要領域毎に固有のパスワードを指定することも可能としており、閲覧者が当該画像を表示する場合に、指定した重要領域を非公開にしたり、固有のパスワードを入力することで当該領域を公開にしたりすることを可能としている。
しかしながら、前記の電子ファイリング装置では、文書画像を表示画面上に表示し、重要領域を人手によって指定するため、文書種別が同じである文書を登録する場合に個別に重要領域を指定する必要があったり、表示画面によっては文書画像全体を表示することが出来ず、重要領域を指定するのに表示画面上で文書画像をスクロールさせる必要があったりする等、操作上の煩わしさが問題であった。
【０００３】
前記問題を解決するものとして、特開平７−１１４６１６号「伝票文書情報システム」がある。この伝票文書情報システムでは、伝票文書の表枠位置、文字行位置等の様式データを抽出し、利用者が登録した様式データ（辞書データ）と照合することで伝票文書の様式を識別し、読取領域や重要領域を指定している。重要領域には複数段階のアクセスレベルが設定され、閲覧者のアクセスレベルに応じて当該領域の公開／非公開を設定できるため、きめ細かな文書の公開を実現している。
【０００４】
【発明が解決しようとする課題】
しかしながら、前記伝票文書情報システムでは、重要領域が自動的に抽出されるため、操作上の煩わしさの問題は解決されているものの、辞書データ上に記述される重要領域は固定であるため、文書種別が同じで様式の異なる文書の場合に個別に重要領域を指定する必要があった。また、文書種別は異なるが、同一内容を有する領域を重要領域として指定したい場合にも個別に重要領域を指定する必要があった。
【０００５】
【課題を解決するための手段】
本発明は、前述の課題を解決するため次の構成を採用する。
〈構成１〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する項目抽出部と、予め、項目種別と項目内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、項目抽出部で抽出された項目種別と項目内容を、重要度付与基準保持部の重要度付与基準と比較して、各項目毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【０００６】
〈構成２〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域が文書中で他の項目領域とどのような関係にあるかを示す論理構造を求め、かつ論理構造がどのような種別であるかを示す論理構造種別と、論理構造の内容を示す論理構造内容とを抽出する論理構造抽出部と、予め、論理構造種別と論理構造内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、論理構造抽出部で抽出された論理構造種別と論理構造内容を、重要度付与基準保持部の重要度付与基準と比較して、各論理構造毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【０００７】
〈構成３〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような文書であるかを示す文書種別を識別する文書識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する項目抽出部と、予め、文書種別と項目種別と項目内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、文書識別部で識別された文書種別と、項目抽出部で抽出された項目種別と項目内容とを、重要度付与基準保持部の重要度付与基準と比較して、各項目毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【０００８】
〈構成４〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような文書であるかを示す文書種別を識別する文書識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域が文書中で他の項目領域とどのような関係にあるかを示す論理構造を求め、かつ論理構造がどのような種別であるかを示す論理構造種別と、論理構造の内容を示す論理構造内容とを抽出する論理構造抽出部と、予め、文書種別と論理構造種別と論理構造内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、文書識別部で識別された文書種別と、論理構造抽出部で抽出された論理構造種別と論理構造内容とを、重要度付与基準保持部の重要度付与基準と比較して、各論理構造毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【０００９】
〈構成５〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような用途で使用されるかを示す用途種別を識別する用途識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する項目抽出部と、予め、用途種別と項目種別と項目内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、用途識別部で識別された用途種別と、項目抽出部で抽出された項目種別と項目内容とを、重要度付与基準保持部の重要度付与基準と比較して、各項目毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【００１０】
〈構成６〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような用途で使用されるかを示す用途種別を識別する用途識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域が文書中で他の項目領域とどのような関係にあるかを示す論理構造を求め、かつ論理構造がどのような種別であるかを示す論理構造種別と、論理構造の内容を示す論理構造内容とを抽出する論理構造抽出部と、予め、用途種別と論理構造種別と論理構造内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、用途識別部で識別された用途種別と、論理構造抽出部で抽出された論理構造種別と論理構造内容とを、重要度付与基準保持部の重要度付与基準と比較して、各論理構造毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【００１１】
〈構成７〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような用途で使用されるかを示す用途種別と、どのような文書であるかを示す文書種別とを識別する用途・文書識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する項目抽出部と、予め、用途種別と文書種別と項目種別と項目内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、用途・文書識別部で識別された用途種別と文書種別と、項目抽出部で抽出された項目種別と項目内容とを、重要度付与基準保持部の重要度付与基準と比較して、各項目毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【００１２】
〈構成８〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書がどのような用途で使用されるかを示す用途種別と、どのような文書であるかを示す文書種別とを識別する用途・文書識別部と、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域が文書中で他の項目領域とどのような関係にあるかを示す論理構造を求め、かつ論理構造がどのような種別であるかを示す論理構造種別と、論理構造の内容を示す論理構造内容とを抽出する論理構造抽出部と、予め、用途種別と文書種別と論理構造種別と論理構造内容とに対する重要度の関係を示す重要度付与基準を保持する重要度付与基準保持部と、用途・文書識別部で識別された用途種別と文書種別と、論理構造抽出部で抽出された論理構造種別と論理構造内容とを、重要度付与基準保持部の重要度付与基準と比較して、各論理構造毎に重要度を付与する重要度付与部とを備えたことを特徴とする画像処理装置。
【００１３】
〈構成９〉
所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出し、次に、項目種別と項目内容を、予め、項目種別と項目内容とに対する重要度の関係を示す重要度付与基準と比較して、各項目毎に重要度を付与することを特徴とする画像処理方法。
【００１４】
〈構成１０〉
画像処理装置を構成するコンピュータを、所定の形式の文書に対して、文書の画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する項目抽出部と、項目抽出部で抽出された項目種別と項目内容を、予め、項目種別と項目内容とに対する重要度の関係を示す重要度付与基準と比較して、各項目毎に重要度を付与する重要度付与部として機能させるための画像処理装置の制御用プログラム。
【００１５】
【発明の実施の形態】
以下、本発明の実施の形態を具体例を用いて詳細に説明する。
《具体例１》
具体例１では、文書の内容に基づいて重要であると判断される可能性のある文書の領域とその領域の内容を示す種別およびその領域の内容を抽出し、利用者が定義する領域の内容を示す種別および領域の内容と複数段階の重要度との対応に基づいて、前記文書の領域を重要領域として指定することで、従来の問題を解決するようにしたものである。
【００１６】
〈構成〉
図１は、本発明の画像処理装置の具体例１を示す構成図である。
図示の画像処理装置１は、入力装置２および記憶装置３に接続されており、入力部１０、項目抽出部２０、重要度付与部３０、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００を備える。また、画像処理装置１は、例えばパーソナルコンピュータから構成されている。
【００１７】
入力装置２は、例えば、文書の画像を光学的に読み取ってデジタルデータとして出力するスキャナや、あるいは文書の画像データを提供する通信インタフェース等からなる。記憶装置３は、画像処理装置１が出力した重要度が付与された文書のデータを記憶するための外部記憶装置や通信インタフェース等からなる。入力部１０は、入力装置２からの文書画像データを入力して文書画像格納部１００に送出するための画像処理装置１の入力インタフェースである。項目抽出部２０は、文書画像格納部１００に格納された所定の形式を持った文書の画像から、その形式に基づいて文書の内容を形成する単位となる項目領域を抽出し、かつ項目領域がどのような種別であるかを示す項目種別と、項目領域の内容を示す項目内容とを抽出する機能を有するものである。重要度付与部３０は、項目抽出部２０で抽出された項目種別と項目内容を、重要度付与基準保持部３００の重要度付与基準と比較して、各項目毎に重要度を付与する機能部である。出力部４０は、文書画像格納部１００に格納された文書画像と、重要度付与対象情報格納部２００に格納された各項目毎に重要度が付与された重要度付与対象情報とを記憶装置３に対して出力する画像処理装置１の出力インタフェースである。
【００１８】
文書画像格納部１００は、入力部１０によって入力された文書画像を格納する記憶部であり、半導体メモリやハードディスク装置等からなる。重要度付与対象情報格納部２００は、項目抽出部２０によって抽出された各項目のデータを格納すると共に、重要度付与部３０によって重要度が付与された項目のデータを保持する記憶部であり、半導体メモリやハードディスク装置等からなる。重要度付与基準保持部３００は、項目種別と項目内容とに対する重要度の関係を示す重要度付与基準を保持する記憶部であり、半導体メモリやハードディスク装置等からなる。
尚、図中、細線矢印はデータの流れ、太線矢印は制御の流れを示している。また、項目抽出部２０、重要度付与部３０は、それぞれの機能に対応したソフトウェアと、これを実行するためのＣＰＵやメモリといったハードウェアから構成されている。
【００１９】
〈動作〉
図２は、具体例１の動作を示すフローチャートである。
［入力部１０の動作］
入力部１０は、画像処理装置１に接続されたスキャナ等の入力装置２から文書画像を取得し、取得した文書画像を文書画像格納部１００に格納する（ステップＳ１０）。
【００２０】
［項目抽出部２０の動作］
次に、項目抽出部２０は、入力部１０によって文書画像格納部１００に格納された文書画像を取得し、取得した文書画像中の全て、あるいは一部の項目について、項目種別、項目領域（項目の外接矩形の座標値）および項目内容（項目の文字認識結果）を抽出する。ここで、項目とは、文書の内容を形成する単位のことである。そして、抽出した項目種別、項目領域および項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する（ステップＳ２０）。
【００２１】
図３は、文書画像の一例を示す説明図である。
図４は、構成要素の一例を示す説明図である。
図５は、項目の一例を示す説明図である。
例えば、所定の形式の文書として、入力部１０により文書画像格納部１００に格納された文書画像が図３に示す論文画像の場合、項目抽出部２０は、図５に示す論文画像中の全て、あるいは一部の項目について、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域（項目の外接矩形の座標値）および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容（項目の文字認識結果）を抽出する。但し、項目内容を抽出するのは、図４に示す構成要素種別が「文章」の構成要素から項目種別が判定される項目のみである。即ち、項目種別が「本文」「節タイトル」「キャプション」等の項目のみである。また、構成要素とは、文書の紙面を形成する単位のことである。
【００２２】
次に、項目抽出部２０は、抽出した「本文」「節タイトル」「図」「キャプション」等の項目種別と、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容とを記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する。
【００２３】
図６は、重要度付与対象情報の一例を示す説明図である。
図６では、項目領域を項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」で示す。また、左段組の上から下への順番で項目種別、項目領域および項目内容を並べる。
以下、「本文」「節タイトル」「図」「キャプション」等の項目種別と、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容の抽出について説明する。
【００２４】
項目抽出部２０は、先ず、論文画像中の黒画素の連結性を調べることで黒画素集合を抽出する。抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より大きい場合、抽出した黒画素集合は構成要素種別が「図」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「図領域」として抽出する。「図」の構成要素は、文書の内容を形成する単位である項目と捉えることもできるため、構成要素種別が「図」の構成要素は、項目種別が「図」の項目であると判定し、構成要素領域である「図領域」を項目領域とする。
一方、抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より小さい場合、抽出した黒画素集合は構成要素種別が「文字」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「文字領域」として抽出する。そして、「文字」の構成要素が水平方向に並んでおり、隣接する二つの「文字」の構成要素の水平方向の距離が任意の値より小さい場合、水平方向に並んでいる「文字」の構成要素集合は構成要素種別が「文字列」の構成要素であると判定し、「文字」の構成要素集合の外接矩形を「文字列領域」として抽出する。
【００２５】
更に、「文字列」の構成要素が垂直方向に並んでおり、隣接する二つの「文字列」の構成要素の垂直方向の距離が任意の値より小さい場合、垂直方向に並んでいる「文字列」の構成要素集合は構成要素種別が「文章」の構成要素であると判定し、「文字列」の構成要素集合の外接矩形を「文章領域」として抽出する。
続いて、「文章」の構成要素である「文字」の構成要素集合に対して文字認識処理を行い、得られた文字認識結果について、句読点が含まれる場合、「文章」の構成要素は項目種別が「本文」の項目であると判定し、構成要素領域である「文章領域」を「本文領域」として、文字認識結果を「本文内容」とする。
一方、得られた文字認識結果について、先頭からの文字列が「数字」「．」「数字」「．」である場合、「文章」の構成要素は、項目種別が「節タイトル」の項目であると判定し、構成要素領域である「文章領域」を「節タイトル領域」として、文字認識結果を「節タイトル内容」とする。尚、項目抽出部２０は、論文の場合、先頭からの文字列が「数字」「．」「数字」「．」である場合には、「節タイトル」の項目であるというパターンデータを予め有しているとする。
また、得られた文字認識結果について、先頭の文字が「図」である場合、「文章」の構成要素は項目種別が「キャプション」の項目であると判定し、構成要素領域である「文章領域」を「キャプション領域」として、文字認識結果を「キャプション内容」とする。この場合も、項目抽出部２０がこのようなパターンデータを予め有しているものである。
【００２６】
［重要度付与部３０の動作］
重要度付与部３０は、項目抽出部２０により重要度付与対象情報格納部２００に格納された重要度付与対象情報を取得し、更に、重要度付与基準保持部３００に予め格納された重要度付与基準を取得する。
図７は、重要度付与基準の一例を示す説明図である。
そして、重要度付与部３０は、重要度付与基準に記述された項目種別、項目内容と重要度の対応に基づいて、項目種別、項目領域、重要度を記述した重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する（ステップＳ３０）。
図８は、重要度付与対象情報（重要度含む）の一例を示す説明図である。
例えば、項目抽出部２０により重要度付与対象情報格納部２００に格納された重要度付与対象情報が図６に示す重要度付与対象情報、更に、重要度付与基準保持部３００に予め格納された重要度付与基準が図７に示す重要度付与基準の場合、重要度付与部３０は、図７に示す重要度付与基準に記述された「本文」「節タイトル」「図」「キャプション」等の項目種別、「文字列「ＯＣＲ」を含む」「指定無し」等の項目内容と「高」「中」「低」「高」「高」等の重要度の対応に基づいて、項目種別、項目領域、重要度を記述した図８に示す重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する。
【００２７】
［出力部４０の動作］
出力部４０は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、更に、重要度付与部３０により重要度付与対象情報格納部２００に格納された重要度付与対象情報（重要度含む）を取得する。そして、取得した文書画像および重要度付与対象情報（重要度含む）を画像処理装置１に接続された記憶装置３に格納する（ステップＳ４０）。
【００２８】
以上の動作の説明において、項目種別に「本文」「節タイトル」「図」「キャプション」等の単語を用いたが、文書の内容を形成する単位であることが容易に理解できる単語であれば、いずれの単語を用いても良い。
重要度付与対象情報の項目領域に項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」を用いたが、項目の外接矩形の座標値であることが容易に理解できる形式であれば、いずれの形式を用いても良い。
「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容の抽出に上記の方法を用いたが、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容が抽出できる方法であれば、他の方法を用いても良い。
重要度付与基準の項目内容に「文字列「ＯＣＲ」を含む」等の指定事項を用いたが、項目内容を特定する事柄であることが容易に理解できる指定事項であれば、どのような指定事項を用いても良い。
重要度付与基準の項目内容に「指定無し」等の単語を用いたが、項目内容を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての項目内容に「指定無し」を用いても良い。
重要度付与基準の重要度に「高」「中」「低」等の単語を用いたが、重要度の段階、重要度の高低が容易に理解できる単語であれば、どのような単語を用いても良い。また、重要度の段階数は２種類（重要または重要でない）でも良い。
【００２９】
〈効果〉
以上説明したように、具体例１によれば、文書画像から項目を抽出し、この項目に対して、予め登録した重要度付与基準を用いて、重要度を付与するようにしたので次のような効果がある。
即ち、文書の内容を形成する単位である項目の内容に応じて、項目に自動的に重要度が付与されるため、文書を公開するような場合、ある閲覧者には重要度が比較的高い「本文」「図」「キャプション」を非公開にしたり、別の閲覧者には重要度が高い「図」「キャプション」に加え、重要なキーワードを有する「本文」のみを非公開にしたりすることができる。
即ち、重要度に応じた項目の公開／非公開を項目内容毎に調整することができる。これは、文書の内容を深く知らない利用者が重要度を設定する場合においても、例えば、重要なキーワードを有する本文は、本文全体が重要である可能性があるため、非公開にする必要がある等、ある程度文書の内容の重要性に応じた重要度の設定ができることを示す。
【００３０】
《具体例２》
具体例２は、項目間の関係を示す論理構造毎に重要度を付与するようにしたものである。
【００３１】
〈構成〉
図９は、具体例２の構成図である。
画像処理装置１ａは、入力装置２および記憶装置３に接続されており、入力部１０、論理構造抽出部２１、重要度付与部３１、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００で構成される。
ここで、入力部１０、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００は、具体例１と同様であるため、ここでの説明は省略する。但し、具体例２の重要度付与基準保持部３００の重要度付与基準は論理構造種別と論理構造内容とに対する重要度のデータとなっている。
論理構造抽出部２１は、文書画像格納部１００に格納された所定の形式の文書に対して、その文書画像から、文書の形式に基づいて文書の内容を形成する単位となる項目領域を抽出すると共に、項目領域が文書中で他の項目領域とどのような関係にあるかを示す論理構造を求め、かつ論理構造がどのような種別であるかを示す論理構造種別と、論理構造の内容を示す論理構造内容とを抽出する機能を有している。重要度付与部３１は、論理構造抽出部２１で抽出された論理構造種別と論理構造内容を、重要度付与基準保持部３００の重要度付与基準と比較して、各論理構造毎に重要度を付与する機能を有している。
【００３２】
〈動作〉
図１０は、具体例２の動作を示すフローチャートである。
尚、具体例２においても、図３の文書画像の一例、図４の構成要素の一例を用いて説明する。尚、具体例１と同じ動作をする入力部１０、出力部４０の説明は省略する。即ち、図１０において、ステップＳ１０とステップＳ４０は、具体例１と同様であるため、ここでの説明は省略する。
【００３３】
［論理構造抽出部２１の動作］
論理構造抽出部２１は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像中の全て、あるいは一部の論理構造について、論理構造種別、論理構造領域（論理構造を形成する項目の外接矩形の座標値集合）および論理構造内容（論理構造を形成する項目の文字認識結果集合）を抽出する。そして、抽出した論理構造種別、論理構造領域および論理構造内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する（ステップＳ２１）。
尚、論理構造とは、文書の内容に基づく項目の構造のことである。即ち、複数の項目があった場合に文書中で各項目がどのような関係にあるかといったことを示す情報である。
【００３４】
図１１は、論理構造の一例を示す説明図である。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図３に示す論文画像の場合、論理構造抽出部２１は、図１１に示す論文画像中の全て、あるいは一部の論理構造について、「２．２．節本文」「２．３．節本文」等の論理構造種別、「２．２節本文領域」「２．３．節本文領域」等の論理構造領域（論理構造を形成する項目の外接矩形の座標値集合）および「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容（論理構造を形成する項目の文字認識結果集合）を抽出する。但し、論理構造内容を抽出するのは、図４に示す構成要素種別が「文章」の構成要素から項目種別が判定される項目で形成される論理構造のみである。即ち、論理構造種別が「２．２．節本文」「２．３．節本文」等の論理構造のみである。そして、抽出した「２．２．節本文」「２．３．節本文」等の論理構造種別、「２．２．節本文領域」「２．３．節本文領域」等の論理構造領域および「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する。
【００３５】
図１２は、重要度付与対象情報の一例を示す説明図である。
図１２に示すように、論理構造領域は、論理構造を形成する項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」の集合で示す。また、左段組の上から下への順番で論理構造を形成する項目の外接矩形の座標値を並べる。更に、同様の順番で論理構造を形成する項目の文字認識結果を並べる。
以下、「２．２．節本文」「２．３．節本文」等の論理構造種別、「２．２．節本文領域」「２．３．節本文領域」等の論理構造領域および「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容の抽出について説明する。
【００３６】
論理構造抽出部２１は、先ず、論文画像中の黒画素の連結性を調べることで黒画素集合を抽出する。抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より大きい場合、抽出した黒画素集合は構成要素種別が「図」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「図領域」として抽出する。「図」の構成要素は、文書の内容を形成する単位である項目と捉えることもできるため、構成要素種別が「図」の構成要素は、項目種別が「図」の項目であると判定し、構成要素領域である「図領域」を項目領域とする。
一方、抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より小さい場合、抽出した黒画素集合は構成要素種別が「文字」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「文字領域」として抽出する。そして、「文字」の構成要素が水平方向に並んでおり、隣接する二つの「文字」の構成要素の水平方向の距離が任意の値より小さい場合、水平方向に並んでいる「文字」の構成要素集合は構成要素種別が「文字列」の構成要素であると判定し、「文字」の構成要素集合の外接矩形を「文字列領域」として抽出する。
【００３７】
更に、「文字列」の構成要素が垂直方向に並んでおり、隣接する二つの「文字列」の構成要素の垂直方向の距離が任意の値より小さい場合、垂直方向に並んでいる「文字列」の構成要素集合は構成要素種別が「文章」の構成要素であると判定し、「文字列」の構成要素集合の外接矩形を「文章領域」として抽出する。続いて、「文章」の構成要素である「文字」の構成要素集合に対して文字認識処理を行い、得られた文字認識結果について、句読点が含まれる場合、「文章」の構成要素は項目種別が「本文」の項目であると判定し、構成要素領域である「文章領域」を「本文領域」として、文字認識結果を「本文内容」とする。
【００３８】
一方、得られた文字認識結果について、先頭からの文字列が「数字」「．」「数字」「．」である場合、「文章」の構成要素は項目種別が「節タイトル」の項目であると判定し、構成要素領域である「文章領域」を「節タイトル領域」として、文字認識結果を「節タイトル内容」とする。また、得られた文字認識結果について、先頭の文字が「図」である場合、「文章」の構成要素は項目種別が「キャプション」の項目であると判定し、構成要素領域である「文章領域」を「キャプション領域」として、文字認識結果を「キャプション内容」とする。ここで、「節タイトル内容」について、先頭からの文字列が「２．３．」である場合、「節タイトル」の項目は論理構造種別が「２．３．節タイトル」の論理構造であると判定し、項目領域である「節タイトル領域」を「２．３．節タイトル領域」として、項目内容である「節タイトル内容」を「２．３．節タイトル内容」とする。論文は一般に段組構成をしており、段組間は左から右に、段組内は上から下に項目順が形成されることから、論理構造種別が「２．３．節タイトル」の論理構造より前に存在する「本文」の項目集合は論理構造種別が「２．２．節本文」の論理構造であり、論理構造種別が「２．３．節タイトル」の論理構造より後に存在する「本文」の項目集合は論理構造種別が「２．３．節本文」の論理構造であると判定し、論理構造を形成する項目の外接矩形の座標値集合をそれぞれ「２．２．節本文領域」「２．３．節本文領域」等の論理構造領域として、論理構造を形成する項目の文字認識結果集合をそれぞれ「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容とする。
このように、具体例２では、所定の形式の文書として、各節に順番に番号が付与されている論文に対して、その番号に着目して論理構造を抽出している。
【００３９】
［重要度付与部３１の動作］
重要度付与部３１は、論理構造抽出部２１により重要度付与対象情報格納部２００に格納された重要度付与対象情報を取得し、更に、重要度付与基準保持部３００に予め格納された重要度付与基準を取得する。
図１３は、重要度付与基準の一例を示す説明図である。
そして、重要度付与部３１は、重要度付与基準に記述された論理構造種別、論理構造内容と重要度の対応に基づいて、論理構造種別、論理構造領域、重要度を記述した重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する（ステップＳ３１）。
図１４は、重要度付与対象情報（重要度含む）の一例を示す説明図である。
例えば、論理構造抽出部２１により重要度付与対象情報格納部２００に格納された重要度付与対象情報が図１２に示す重要度付与対象情報、更に、重要度付与基準保持部３００に予め格納された重要度付与基準が図１３に示す重要度付与基準の場合、重要度付与部３１は、図１３に示す重要度付与基準に記述された「○○節本文」等の論理構造種別、「文字列「ＯＣＲ」を含む」「指定無し」等の論理構造内容と、「高」「中」等の重要度の対応に基づいて、論理構造種別、論理構造領域、重要度を記述した図１４に示す重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する。ここで、「○○節本文」の「○○節」は、任意の節番号を示す。
【００４０】
以上の動作の説明において、論理構造種別に「２．２．節本文」「２．３．節本文」等の単語を用いたが、文書の内容に基づく項目の構造であることが容易に理解できる単語であれば、どのような単語を用いても良い。
重要度付与対象情報の論理構造領域に論理構造を形成する項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」の集合を用いたが、論理構造を形成する項目の外接矩形の座標値集合であることが容易に理解できる形式であれば、どのような形式を用いても良い。
「２．２．節本文」「２．３．節本文」等の論理構造種別、「２．２．節本文領域」「２．３．節本文領域」等の論理構造領域および「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容の抽出に上記の方法を用いたが、「２．２．節本文」「２．３．節本文」等の論理構造種別、「２．２．節本文領域」「２．３．節本文領域」等の論理構造領域および「２．２．節本文内容」「２．３．節本文内容」等の論理構造内容が抽出できる方法であれば、どのような方法を用いても良い。
重要度付与基準の論理構造内容に「文字列「ＯＣＲ」を含む」等の指定事項を用いたが、論理構造内容を特定する事柄であることが容易に理解できる指定事項であれば、どのような指定事項を用いても良い。
重要度付与基準の論理構造内容に「指定無し」等の単語を用いたが、論理構造内容を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての論理構造内容に「指定無し」を用いても良い。
重要度付与基準の重要度に「高」「中」等の単語を用いたが、重要度の段階、重要度の高低が容易に理解できる単語であれば、どのような単語を用いても良い。また、重要度の段階数は重要と重要でないという２種類であっても良い。
【００４１】
〈効果〉
以上のように具体例２によれば、文書画像から、文書中で各項目がどのような関係にあるかを示す論理構造を抽出し、この論理構造に対して、予め登録した重要度付与基準を用いて、重要度を付与するようにしたので次のような効果がある。
即ち、文書の内容に基づく項目の構造である論理構造の内容に応じて、論理構造に自動的に重要度が付与されるため、文書を公開するような場合、ある閲覧者には重要度が比較的高い「２．２．節本文」「２．３．節本文」を非公開にしたり、別の閲覧者には重要度が高い「２．３．節本文」のみを非公開にしたりすることができる。即ち、重要度に応じた論理構造の公開／非公開を調整することができる。これは、文書の内容を深く知らない利用者が重要度を設定する場合においても、例えば、重要なキーワードを有する本文は、本文自体が重要である可能性があるため、非公開にする必要がある等、ある程度文書の内容の重要性に応じた重要度の設定ができることを示す。しかも、「２．３．節本文」の論理構造は「本文」の項目集合であるため、具体例１が「２．３．節本文」を構成する「本文」の項目のうち、重要なキーワードを有する「本文」のみに高い重要度を設定する可能性があるのに対して、具体例２は、本文全体に的確な重要度を設定できる。また、重要度付与基準の論理構造内容に、例えば「○○節タイトルの内容に文字列「ＯＣＲ」を含む」等の指定事項を用いることで、「○○節タイトル」の論理構造内容のみを探索して、該当する「○○節本文」を効率良く導き出すこともできる。
【００４２】
《具体例３》
具体例３は、文書中の画像情報から文書種別を求め、この文書種別を考慮して重要度を付与するようにしたものである。
【００４３】
〈構成〉
図１５は、具体例３の構成図である。
図示の画像処理装置１ｂは、入力装置２および記憶装置３に接続されており、入力部１０、文書識別部５２、項目抽出部２２、重要度付与部３２、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００を備えている。
ここで、入力部１０、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００は、具体例１、２と同様であるため、ここでの説明は省略する。但し、具体例３の重要度付与基準保持部３００の重要度付与基準は、項目種別と項目内容に加えて文書種別も含めた重要度のデータとなっている。
【００４４】
文書識別部５２は、文書画像格納部１００に格納された所定の形式の文書画像から、この文書の形式に基づいてどのような文書であるかを示す文書種別を識別する機能部である。項目抽出部２２は、文書識別部５２の文書識別結果に基づいて、具体例１の項目抽出部２０と同様の処理を行う機能を有している。また、重要度付与部３２は、文書識別部５２で識別された文書種別と、項目抽出部２２で抽出された項目種別と項目内容とを、重要度付与基準保持部３００の重要度付与基準と比較して、各項目毎に重要度を付与する機能部である。
【００４５】
〈動作〉
図１６は、具体例３の動作を示すフローチャートである。
尚、具体例１、２と同じ動作をする入力部１０、出力部４０の説明は省略する。即ち、図１６において、ステップＳ１０とステップＳ４０の説明は省略する。
【００４６】
［文書識別部５２の動作］
文書識別部５２は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像の文書種別を識別する。そして、識別した文書種別を記述した重要度付与対象情報（文書種別のみ）を作成し、作成した重要度付与対象情報（文書種別のみ）を重要度付与対象情報格納部２００に格納する（ステップＳ５２）。
図１７は、具体例３における文書画像の一例を示す説明図である。
例えば、入力部１０により文書画像格納部１００に格納された文書画像がこのような図１７に示す論文画像の場合（これは図３に示す論文画像の右上端部に「論文Ａ」の文書種別が記載された文書画像である）、文書識別部５２は、論文画像の右上端部に記載された「論文Ａ」の文書種別を抽出し、論文画像の文書種別を識別する。そして、識別した文書種別を記述した重要度付与対象情報（文書種別のみ）を作成し、作成した重要度付与対象情報（文書種別のみ）を重要度付与対象情報格納部２００に格納する。
【００４７】
図１８は、重要度付与対象情報（文書種別のみ）の一例を示す説明図である。
以下、文書種別の抽出について説明する。
文書識別部５２は、先ず、論文画像中の右上端部付近の黒画素の連結性を調べることで黒画素集合を抽出する。抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より大きい場合、抽出した黒画素集合は構成要素種別が「枠」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「枠領域」として抽出する。一方、抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より小さい場合、抽出した黒画素集合は構成要素種別が「文字」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「文字領域」として抽出する。そして、「文字」の構成要素が水平方向に並んでおり、隣接する二つの「文字」の構成要素の水平方向の距離が任意の値より小さい場合、水平方向に並んでいる「文字」の構成要素集合は構成要素種別が「文字列」の構成要素であると判定し、「文字」の構成要素集合の外接矩形を「文字列領域」として抽出する。更に、「文字列」の構成要素である「文字」の構成要素集合に対して文字認識処理を行い、得られた文字認識結果（「論文Ａ」）を文書種別として抽出する。
【００４８】
［項目抽出部２２の動作］
項目抽出部２２は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像中の全て、あるいは一部の項目について、項目種別、項目領域（項目の外接矩形の座標値）および項目内容（項目の文字認識結果）を抽出する。項目とは、文書の内容を形成する単位のことである。更に、項目抽出部２２は、文書識別部５２により重要度付与対象情報格納部２００に格納された重要度付与対象情報（文書種別のみ）を取得し、文書種別および抽出した項目種別、項目領域および項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する（ステップＳ２２）。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図１７に示す論文画像の場合、項目抽出部２２は、図５に示す論文画像中の全て、あるいは一部の項目について、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域（項目の外接矩形の座標値）および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容（項目の文字認識結果）を抽出する。但し、項目内容を抽出するのは、図４に示す構成要素種別が「文章」の構成要素から項目種別が判定される項目のみである。即ち、項目種別が「本文」「節タイトル」「キャプション」等の項目のみである。更に、項目抽出部２２は、文書識別部５２により重要度付与対象情報格納部２００に格納された図１８に示す重要度付与対象情報（文書種別のみ）を取得し、「論文Ａ」の文書種別および抽出した「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する。
図１９は、具体例３における重要度付与対象情報の一例を示す説明図である。図１９では、項目領域を項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」で示す。また、左段組の上から下への順番で項目種別、項目領域および項目内容を並べる。
尚、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容の抽出については、具体例１と同様である。
【００４９】
［重要度付与部３２の動作］
重要度付与部３２は、項目抽出部２２により重要度付与対象情報格納部２００に格納された重要度付与対象情報を取得し、更に、重要度付与基準保持部３００に予め格納された重要度付与基準を取得する。
図２０は、重要度付与基準の一例を示す説明図である。
そして、重要度付与部３２は、重要度付与基準に記述された文書種別、項目種別、項目内容と重要度の対応に基づいて、文書種別、項目種別、項目領域、重要度を記述した重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する（ステップＳ３２）。
図２１は、重要度付与対象情報（重要度含む）の一例を示す説明図である。
例えば、項目抽出部２２により重要度付与対象情報格納部２００に格納された重要度付与対象情報が図１９に示す重要度付与対象情報、更に、重要度付与基準保持部３００に予め格納された重要度付与基準が図２０に示す重要度付与基準の場合、重要度付与部３２は、図２０に示す重要度付与基準に記述された「論文Ａ」「指定無し」等の文書種別、「本文」「節タイトル」「図」「キャプション」等の項目種別、「文字列「ＯＣＲ」を含む」「指定無し」等の項目内容と「高」「低」「中」「低」「高」「高」等の重要度の対応に基づいて、文書種別、項目種別、項目領域、重要度を記述した図２１に示す重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する。
【００５０】
以上の動作の説明において、文書種別に「論文Ａ」等の単語を用いたが、文書種別であることが容易に理解できる単語であれば、どのような単語を用いても良い。また、文書種別の抽出に上記の方法を用いたが、文書種別が抽出できる方法であれば、どのような方法を用いても良い。
更に、重要度付与基準の文書種別に「指定無し」等の単語を用いたが、文書種別を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての文書種別に「指定無し」を用いても良い。
その他の利用形態については、具体例１と同様である。
【００５１】
〈効果〉
以上のように具体例３によれば、具体例１の構成に加えて、文書の形式に基づいて文書種別を抽出し、この文書種別も考慮して各項目の重要度を付与するようにしたので、次のような効果がある。
即ち、文書種別および文書の内容を形成する単位である項目の内容に応じて、項目に自動的に重要度が付与されるため、文書を公開するような場合、ある文書では重要度が比較的高い「本文」「図」「キャプション」を非公開にしたり、別の文書では重要度が高い「図」「キャプション」に加え、重要なキーワードを有する「本文」のみを非公開にしたりすることができる（ある文書で非公開にした「本文」を、別の文書で公開にできる）。即ち、重要度に応じた項目の公開／非公開を文書種別および項目内容毎に調整することができる。具体例１に比べ、文書種別毎に調整することができる分、きめ細かな文書の公開を実現することができる。
【００５２】
《具体例４》
具体例４は、文書中の画像情報からその文書がどのような用途で使用されるかを示す用途種別を求め、この用途種別を考慮して重要度を付与するようにしたものである。
【００５３】
〈構成〉
図２２は、具体例４の構成図である。
図示の画像処理装置１ｃは、入力装置２および記憶装置３に接続されており、入力部１０、用途識別部５３、項目抽出部２３、重要度付与部３３、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００で構成されている。
ここで、入力部１０、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００は、具体例１〜３と同様であるため、ここでの説明は省略する。但し、具体例４の重要度付与基準保持部３００の重要度付与基準は、項目種別と項目内容に加えて用途種別も含めた重要度のデータとなっている。
【００５４】
用途識別部５３は、文書画像格納部１００に格納された所定の形式の文書画像から、この文書の形式に基づいてどのような用途で使用される文書であるかを示す用途種別を識別する機能部である。項目抽出部２３は、用途識別部５３の用途識別結果に基づいて、具体例１の項目抽出部２０と同様の処理を行う機能を有している。また、重要度付与部３３は、用途識別部５３で識別された用途種別と、項目抽出部２３で抽出された項目種別と項目内容とを、重要度付与基準保持部３００の重要度付与基準と比較して、各項目毎に重要度を付与する機能部である。
【００５５】
〈動作〉
図２３は、具体例４の動作を示すフローチャートである。
尚、具体例１〜３と同じ動作をする入力部１０、出力部４０の説明は省略する。即ち、図２３において、ステップＳ１０とステップＳ４０の説明は省略する。
【００５６】
［用途識別部５３の動作］
用途識別部５３は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像の用途種別を識別する。そして、識別した用途種別を記述した重要度付与対象情報（用途種別のみ）を作成し、作成した重要度付与対象情報（用途種別のみ）を重要度付与対象情報格納部２００に格納する（ステップＳ５３）。
図２４は、具体例４における文書画像の一例を示す説明図である。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図２４に示す論文画像の場合（これは図３に示す論文画像の左上端部に「社内公開」の用途種別が記載された文書画像である）、用途識別部５３は、論文画像の左上端部に記載された「社内公開」の用途種別を抽出し、論文画像の用途種別を識別する。そして、識別した用途種別を記述した重要度付与対象情報（用途種別のみ）を作成し、作成した重要度付与対象情報（用途種別のみ）を重要度付与対象情報格納部２００に格納する。
【００５７】
図２５は、重要度付与対象情報（用途種別のみ）の一例を示す説明図である。
以下、用途種別の抽出について説明する。
用途識別部５３は、先ず、論文画像中の左上端部付近の黒画素の連結性を調べることで黒画素集合を抽出する。抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より大きい場合、抽出した黒画素集合は構成要素種別が「枠」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「枠領域」として抽出する。一方、抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より小さい場合、抽出した黒画素集合は構成要素種別が「文字」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「文字領域」として抽出する。そして、「文字」の構成要素が水平方向に並んでおり、隣接する二つの「文字」の構成要素の水平方向の距離が任意の値より小さい場合、水平方向に並んでいる「文字」の構成要素集合は構成要素種別が「文字列」の構成要素であると判定し、「文字」の構成要素集合の外接矩形を「文字列領域」として抽出する。更に、「文字列」の構成要素である「文字」の構成要素集合に対して文字認識処理を行い、得られた文字認識結果（「社内公開」）を用途種別として抽出する。
【００５８】
［項目抽出部２３の動作］
項目抽出部２３は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像中の全て、あるいは一部の項目について、項目種別、項目領域（項目の外接矩形の座標値）および項目内容（項目の文字認識結果）を抽出する。項目とは、文書の内容を形成する単位のことである。更に、項目抽出部２３は、用途識別部５３により重要度付与対象情報格納部２００に格納された重要度付与対象情報（用途種別のみ）を取得し、用途種別および抽出した項目種別、項目領域および項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する（ステップＳ２３）。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図２４に示す論文画像の場合、項目抽出部２３は、図５に示す論文画像中の全て、あるいは一部の項目について、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域（項目の外接矩形の座標値）および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容（項目の文字認識結果）を抽出する。但し、項目内容を抽出するのは、図４に示す構成要素種別が「文章」の構成要素から項目種別が判定される項目のみである。即ち、項目種別が「本文」「節タイトル」「キャプション」等の項目のみである。更に、項目抽出部２３は、用途識別部５３により重要度付与対象情報格納部２００に格納された重要度付与対象情報（用途種別のみ）を取得し、「社内公開」の用途種別および抽出した「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する。
【００５９】
図２６は、重要度付与対象情報の一例を示す説明図である。
図２６では、項目領域を項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」で示す。また、左段組の上から下への順番で項目種別、項目領域および項目内容を並べる。
「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容の抽出については、具体例１と同様である。
【００６０】
［重要度付与部３３の動作］
重要度付与部３３は、項目抽出部２３により重要度付与対象情報格納部２００に格納された重要度付与対象情報を取得し、更に、重要度付与基準保持部３００に予め格納された重要度付与基準を取得する。
図２７は、重要度付与基準の一例を示す説明図である。
そして、重要度付与部３３は、重要度付与基準に記述された用途種別、項目種別、項目内容と重要度の対応に基づいて、用途種別、項目種別、項目領域、重要度を記述した重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する（ステップＳ３３）。
図２８は、重要度付与対象情報（重要度含む）の一例を示す説明図である。
例えば、項目抽出部２３により重要度付与対象情報格納部２００に格納された重要度付与対象情報が図２６に示す重要度付与対象情報、更に、重要度付与基準保持部３００に予め格納された重要度付与基準が図２７に示す重要度付与基準の場合、重要度付与部３３は、図２７に示す重要度付与基準に記述された「社内公開」「指定無し」等の用途種別、「本文」「節タイトル」「図」「キャプション」等の項目種別、「文字列「ＯＣＲ」を含む」「指定無し」等の項目内容と「高」「低」「中」「低」「高」「高」等の重要度の対応に基づいて、用途種別、項目種別、項目領域、重要度を記述した図２８に示す重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する。
【００６１】
以上の動作の説明において、用途種別に「社内公開」等の単語を用いたが、用途種別であることが容易に理解できる単語であれば、どのような単語を用いても良い。また、用途種別の抽出に上記の方法を用いたが、用途種別が抽出できる方法であれば、どのような方法を用いても良い。
更に、重要度付与基準の用途種別に「指定無し」等の単語を用いたが、用途種別を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての用途種別に「指定無し」を用いても良い。
その他の利用形態については、具体例１と同様である。
【００６２】
〈効果〉
以上のように具体例４によれば、具体例１の構成に加えて、文書の形式に基づいて用途種別を抽出し、この用途種別も考慮して各項目の重要度を付与するようにしたので、次のような効果がある。
即ち、用途種別および文書の内容を形成する単位である項目の内容に応じて、項目に自動的に重要度が付与されるため、文書を公開するような場合、ある用途では重要度が比較的高い「本文」「図」「キャプション」を非公開にしたり、別の用途では重要度が高い「図」「キャプション」に加え、重要なキーワードを有する「本文」のみを非公開にしたりすることができる（ある用途で非公開にした「本文」を、別の用途で公開にできる）。即ち、重要度に応じた項目の公開／非公開を用途種別および項目内容毎に調整することができる。具体例１に比べ、用途種別毎に調整することができる分、きめ細かな文書の公開を実現することができる。
【００６３】
《具体例５》
具体例５は、文書中の画像情報からその文書がどのような用途で使用されるかを示す用途種別と、その文書がどのような文書であるかを示す文書種別を求め、これらの用途種別と文書種別とを考慮して重要度を付与するようにしたものである。
【００６４】
〈構成〉
図２９は、具体例５の構成図である。
図示の画像処理装置１ｄは、入力装置２および記憶装置３に接続されており、入力部１０、用途・文書識別部５４、項目抽出部２４、重要度付与部３４、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００で構成されている。
ここで、入力部１０、出力部４０、文書画像格納部１００、重要度付与対象情報格納部２００、重要度付与基準保持部３００は、具体例１〜４と同様であるため、ここでの説明は省略する。但し、具体例５の重要度付与基準保持部３００の重要度付与基準は、項目種別と項目内容に加えて文書種別と用途種別も含めた重要度のデータとなっている。
【００６５】
用途・文書識別部５４は、文書画像格納部１００に格納された所定の形式の文書画像から、この文書の形式に基づいて、文書がどのような用途で使用されるかを示す用途種別と、どのような文書であるかを示す文書種別とを識別する機能部である。項目抽出部２４は、用途・文書識別部５４の用途種別・文書種別識別結果に基づいて、具体例１の項目抽出部２０と同様の処理を行う機能を有している。また、重要度付与部３４は、用途・文書識別部５４で識別された用途種別・文書種別と、項目抽出部２２で抽出された項目種別と項目内容とを、重要度付与基準保持部３００の重要度付与基準と比較して、各項目毎に重要度を付与する機能部である。
【００６６】
〈動作〉
図３０は、具体例５の動作を示すフローチャートである。
尚、具体例１〜４と同じ動作をする入力部１０、出力部４０の説明は省略する。即ち、図３０において、ステップＳ１０とステップＳ４０の説明は省略する。
【００６７】
［用途・文書識別部５４の動作］
用途・文書識別部５４は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像の用途種別および文書種別を識別する。そして、識別した用途種別および文書種別を記述した重要度付与対象情報（用途・文書種別のみ）を作成し、作成した重要度付与対象情報（用途・文書種別のみ）を重要度付与対象情報格納部２００に格納する（ステップＳ５４）。
図３１は、具体例５における文書画像の一例を示す説明図である。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図３１に示す論文画像の場合（これは図３に示す論文画像の左上端部に「社内公開」の用途種別と、右上端部に「論文Ａ」の文書種別とが記載された文書画像である）、用途・文書識別部５４は、論文画像の左上端部に記載された「社内公開」の用途種別および右上端部に記載された「論文Ａ」の文書種別を抽出し、論文画像の用途種別および文書種別を識別する。そして、識別した用途種別および文書種別を記述した重要度付与対象情報（用途・文書種別のみ）を作成し、作成した重要度付与対象情報（用途・文書種別のみ）を重要度付与対象情報格納部２００に格納する。
【００６８】
図３２は、重要度付与対象情報（用途・文書種別のみ）の一例を示す説明図である。
以下、用途種別および文書種別の抽出について説明する。
用途・文書識別部５４は、先ず、論文画像中の左上端部付近の黒画素の連結性を調べることで黒画素集合を抽出する。抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より大きい場合、抽出した黒画素集合は構成要素種別が「枠」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「枠領域」として抽出する。一方、抽出した黒画素集合の外接矩形について、縦横の長さが任意の値より小さい場合、抽出した黒画素集合は構成要素種別が「文字」の構成要素であると判定し、抽出した黒画素集合の外接矩形を「文字領域」として抽出する。そして、「文字」の構成要素が水平方向に並んでおり、隣接する二つの「文字」の構成要素の水平方向の距離が任意の値より小さい場合、水平方向に並んでいる「文字」の構成要素集合は構成要素種別が「文字列」の構成要素であると判定し、「文字」の構成要素集合の外接矩形を「文字列領域」として抽出する。更に、「文字列」の構成要素である「文字」の構成要素集合に対して文字認識処理を行い、得られた文字認識結果（「社内公開」）を用途種別として抽出する。続いて、論文画像中の右上端部付近の黒画素の連結性を調べることで黒画素集合を抽出し、同様の処理を行うことで得られた文字認識結果（「論文Ａ」）を文書種別として抽出する。
【００６９】
［項目抽出部２４の動作］
項目抽出部２４は、入力部１０により文書画像格納部１００に格納された文書画像を取得し、取得した文書画像中の全て、あるいは一部の項目について、項目種別、項目領域（項目の外接矩形の座標値）および項目内容（項目の文字認識結果）を抽出する。項目とは、文書の内容を形成する単位のことである。更に、項目抽出部２４は、用途・文書識別部５４により重要度付与対象情報格納部２００に格納された重要度付与対象情報（用途・文書種別のみ）を取得し、用途種別、文書種別および抽出した項目種別、項目領域および項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する（ステップＳ２４）。
例えば、入力部１０により文書画像格納部１００に格納された文書画像が図３１に示す論文画像の場合、項目抽出部２４は、図５に示す論文画像中の全て、あるいは一部の項目について、「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域（項目の外接矩形の座標値）および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容（項目の文字認識結果）を抽出する。但し、項目内容を抽出するのは、図４に示す構成要素種別が「文章」の構成要素から項目種別が判定される項目のみである。即ち、項目種別が「本文」「節タイトル」「キャプション」等の項目のみである。更に、項目抽出部２４は、用途・文書識別部５４により重要度付与対象情報格納部２００に格納された図３２に示す重要度付与対象情報（用途・文書種別のみ）を取得し、「社内公開」の用途種別、「論文Ａ」の文書種別および抽出した「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容を記述した重要度付与対象情報を作成し、作成した重要度付与対象情報を重要度付与対象情報格納部２００に格納する。
【００７０】
図３３は、具体例５における重要度付与対象情報の一例を示す説明図である。図３３では、項目領域を項目の外接矩形の「（左上Ｘ座標，左上Ｙ座標，右下Ｘ座標，右下Ｙ座標）」で示す。また、左段組の上から下への順番で項目種別、項目領域および項目内容を並べる。
「本文」「節タイトル」「図」「キャプション」等の項目種別、「本文領域」「節タイトル領域」「図領域」「キャプション領域」等の項目領域および「本文内容」「節タイトル内容」「キャプション内容」等の項目内容の抽出については、具体例１と同様である。
【００７１】
［重要度付与部３４の動作］
重要度付与部３４は、項目抽出部２４により重要度付与対象情報格納部２００に格納された重要度付与対象情報を取得し、更に、重要度付与基準保持部３００に予め格納された重要度付与基準を取得する。
図３４は、重要度付与基準の一例を示す説明図である。
そして、重要度付与部３４は、重要度付与基準に記述された用途種別、文書種別、項目種別、項目内容と重要度の対応に基づいて、用途種別、文書種別、項目種別、項目領域、重要度を記述した重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する（ステップＳ３４）。
図３５は、重要度付与対象情報（重要度含む）の一例を示す説明図である。
例えば、項目抽出部２４により重要度付与対象情報格納部２００に格納された重要度付与対象情報が図３３に示す重要度付与対象情報、更に、重要度付与基準保持部３００に予め格納された重要度付与基準が図３４に示す重要度付与基準の場合、重要度付与部３４は、図３４に示す重要度付与基準に記述された「社内公開」「指定無し」等の用途種別、「論文Ａ」「指定無し」等の文書種別、「本文」「節タイトル」「図」「キャプション」等の項目種別、「文字列「ＯＣＲ」を含む」「指定無し」等の項目内容と「高」「低」「中」「高」「低」「高」「高」等の重要度の対応に基づいて、用途種別、文書種別、項目種別、項目領域、重要度を記述した図３５に示す重要度付与対象情報（重要度含む）を作成し、作成した重要度付与対象情報（重要度含む）を重要度付与対象情報格納部２００に格納する。
【００７２】
以上の動作の説明において、用途種別に「社内公開」等の単語を用いたが、用途種別であることが容易に理解できる単語であれば、どのような単語を用いても良い。文書種別に「論文Ａ」等の単語を用いたが、文書種別であることが容易に理解できる単語であれば、どのような単語を用いても良い。
用途種別および文書種別の抽出に上記の方法を用いたが、用途種別および文書種別が抽出できる方法であれば、どのような方法を用いても良い。
重要度付与基準の用途種別に「指定無し」等の単語を用いたが、用途種別を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての用途種別に「指定無し」を用いても良い。
重要度付与基準の文書種別に「指定無し」等の単語を用いたが、文書種別を特定していないことが容易に理解できる単語であれば、どのような単語を用いても良い。また、全ての文書種別に「指定無し」を用いても良い。
その他の利用形態については、具体例１と同様である。
【００７３】
〈効果〉
以上のように具体例５によれば、具体例１の構成に加えて、文書の形式に基づいて用途種別と文書種別を抽出し、この用途種別と文書種別も考慮して各項目の重要度を付与するようにしたので、次のような効果がある。
即ち、用途種別、文書種別および文書の内容を形成する単位である項目の内容に応じて、項目に自動的に重要度が付与されるため、文書を社内に公開するような（用途種別が「社内公開」である）場合、ある文書では重要度が比較的高い「本文」「図」「キャプション」を非公開にしたり、別の文書では重要度が高い「図」「キャプション」に加え、重要なキーワードを有する「本文」のみを非公開にしたりすることができる（ある文書で非公開にした「本文」を、別の文書で公開にできる）。
更に、文書を社外に公開するような（用途種別が「社外公開」である）場合の公開／非公開の調整も可能である。即ち、重要度に応じた項目の公開／非公開を用途種別、文書種別および項目内容毎に調整することができる。具体例３、４に比べ、用途種別および文書種別毎に調整することができる分、更にきめ細かな文書の公開を実現することができる。
【００７４】
《利用形態》
上記各具体例では、所定の形式の文書として、例えば図３、図１７、図２４、図３１のような形式の文書を対象としたが、これ以外の形式の文書であってもよい。即ち、文書の形式が予め分かっているものであれば、その形式に対応して各項目領域を抽出することで、上記各具体例と同様に処理することができる。
上記具体例３、４、５は、具体例１の構成に対して、文書識別や用途識別の構成を付加するようにしたが、これら具体例を具体例２の構成に対して適用するようにしてもよい。この場合、具体例３、４、５の項目抽出部２２、２３、２４を、具体例２の論理構造抽出部２１とすることで実現することができる。
【００７５】
上記各具体例では、各項目に重要度を付与することで、その文書の公開／非公開といった用途に用いたが、これ以外にも種々の用途に適用することができる。例えば、重要度の高い部分には電子透かしを入れるといったセキュリティレベルの高低として利用してもよい。また、重要度の差異によって文書データのカラーレベル（増色／減色）、圧縮レベル（圧縮／非圧縮）、解像度レベル（高解像度／低解像度）、省略レベル（省略する／省略しない）、要約レベル（要約する／しない）といった種々のレベルに対応した処理に適用することができる。
【００７６】
また、対象文書としても、学校関係では、内申書、成績表、通知票といったものに対して適用できる。医療機関では、カルテ、処方箋等に適用できる。更に、企業では、機密文書、仕様書、ソースコード、回路図といったものに適用可能である。また、金融機関等では、申請書等、種々の文書に適用可能である。
【図面の簡単な説明】
【図１】本発明の画像処理装置の具体例１を示す構成図である。
【図２】具体例１の動作を示すフローチャートである。
【図３】具体例１の文書画像の一例を示す説明図である。
【図４】具体例１の構成要素の一例を示す説明図である。
【図５】具体例１の項目の一例を示す説明図である。
【図６】具体例１の重要度付与対象情報の一例を示す説明図である。
【図７】具体例１の重要度付与基準の一例を示す説明図である。
【図８】具体例１の重要度付与対象情報（重要度含む）の一例を示す説明図である。
【図９】具体例２の構成図である。
【図１０】具体例２の動作を示すフローチャートである。
【図１１】論理構造の一例を示す説明図である。
【図１２】具体例２の重要度付与対象情報の一例を示す説明図である。
【図１３】具体例２の重要度付与基準の一例を示す説明図である。
【図１４】具体例２の重要度付与対象情報（重要度含む）の一例を示す説明図である。
【図１５】具体例３の構成図である。
【図１６】具体例３の動作を示すフローチャートである。
【図１７】具体例３における文書画像の一例を示す説明図である。
【図１８】具体例３における重要度付与対象情報（文書種別のみ）の一例を示す説明図である。
【図１９】具体例３における重要度付与対象情報の一例を示す説明図である。
【図２０】具体例３における重要度付与基準の一例を示す説明図である。
【図２１】具体例３における重要度付与対象情報（重要度含む）の一例を示す説明図である。
【図２２】具体例４の構成図である。
【図２３】具体例４の動作を示すフローチャートである。
【図２４】具体例４における文書画像の一例を示す説明図である。
【図２５】具体例４における重要度付与対象情報（用途種別のみ）の一例を示す説明図である。
【図２６】具体例４における重要度付与対象情報の一例を示す説明図である。
【図２７】具体例４における重要度付与基準の一例を示す説明図である。
【図２８】具体例４における重要度付与対象情報（重要度含む）の一例を示す説明図である。
【図２９】具体例５の構成図である。
【図３０】具体例５の動作を示すフローチャートである。
【図３１】具体例５における文書画像の一例を示す説明図である。
【図３２】具体例５における重要度付与対象情報（用途・文書種別のみ）の一例を示す説明図である。
【図３３】具体例５における重要度付与対象情報の一例を示す説明図である。
【図３４】具体例５における重要度付与基準の一例を示す説明図である。
【図３５】具体例５における重要度付与対象情報（重要度含む）の一例を示す説明図である。
【符号の説明】
２０、２２、２３、２４　項目抽出部
２１　論理構造抽出部
３０、３１、３２、３３、３４　重要度付与部
１００　文書画像格納部
２００　重要度付与対象情報格納部
３００　重要度付与基準保持部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an image processing apparatus that extracts an arbitrary region of a document determined to be important based on the content of the document, and for example, hides an arbitrary region of a document determined to be important. This is used when a document is made public.
[0002]
[Prior art]
Conventionally, an arbitrary area of a document determined to be important based on the content of the document has been designated by a user on a display screen. For example, in Japanese Patent Application Laid-Open No. 6-176127, "electronic filing device", a document image to be registered is displayed on a display screen, and a user designates an arbitrary region in the document image as an important region. Here, it is also possible to specify a unique password for each important area. When a viewer displays the image, the designated important area is kept secret or the unique password is entered to input the unique password. And make it public.
However, in the electronic filing apparatus, since a document image is displayed on a display screen and an important area is manually specified, it is necessary to individually specify an important area when registering documents of the same document type. In some cases, the entire document image cannot be displayed depending on the display screen, and it is necessary to scroll the document image on the display screen to specify an important area. .
[0003]
As a solution to the above problem, there is Japanese Patent Laid-Open Publication No. Hei 7-114616 "Slip Document Information System". In this slip document information system, form data such as a table frame position and a character line position of the slip document is extracted and collated with the form data (dictionary data) registered by the user to identify and read the form of the slip document. An area or important area is specified. A plurality of access levels are set for the important area, and whether the area is open or closed can be set according to the access level of the viewer.
[0004]
[Problems to be solved by the invention]
However, in the voucher document information system, since the important area is automatically extracted, the problem of operational complexity is solved, but the important area described on the dictionary data is fixed. In the case of documents of the same type but different formats, it was necessary to individually specify important areas. Also, when it is desired to specify areas having the same contents as important areas, although the document types are different, it is necessary to individually specify important areas.
[0005]
[Means for Solving the Problems]
The present invention employs the following configuration to solve the above-described problem.
<Configuration 1>
For a document in a predetermined format, an item type that extracts an item area, which is a unit forming the content of the document based on the document format, from the image of the document and indicates the type of the item area An item extraction unit that extracts item contents indicating the contents of the item area; an importance assignment reference holding unit that holds an importance assignment criterion indicating a relationship between the item type and the item content in advance; Comparing the item type and the item content extracted by the extraction unit with the importance assignment criterion of the importance assignment criterion holding unit, and having an importance assigning unit for assigning an importance to each item. Image processing device.
[0006]
<Configuration 2>
For a document of a predetermined format, an item area which is a unit forming the content of the document based on the document format is extracted from the image of the document, and how the item area differs from other item areas in the document. A logical structure extraction unit that obtains a logical structure indicating the relationship, and extracts a logical structure type indicating the type of the logical structure, and a logical structure content indicating the content of the logical structure, An importance assigning criterion holding unit that holds an importance assigning criterion indicating a relationship between the logical structure type and the logical structure content, and an importance assigning unit that assigns the logical structure type and the logical structure content extracted by the logical structure extracting unit. An image processing apparatus comprising: an importance assigning unit that assigns an importance to each logical structure as compared with an importance assignment reference of a reference holding unit.
[0007]
<Configuration 3>
For a document of a predetermined format, from a document image, a document identification unit for identifying a document type indicating what kind of document the document is based on the document format, and for a document of a predetermined format, From the document image, an item area that is a unit forming the contents of the document based on the document format is extracted, and an item type indicating the type of the item area and an item indicating the contents of the item area An item extraction unit for extracting the contents, an importance assignment criterion holding unit for holding an importance assignment criterion indicating a relationship between the document type, the item type, and the item content in advance, and a document identification unit. The document type, the item type and the item content extracted by the item extraction unit are compared with the importance assignment criterion of the importance assignment criterion holding unit, and the importance assigning unit that assigns the importance to each item is compared. An image processing apparatus comprising:
[0008]
<Configuration 4>
For a document of a predetermined format, from a document image, a document identification unit for identifying a document type indicating what kind of document the document is based on the document format, and for a document of a predetermined format, From the image of the document, extract the item area that is the unit that forms the content of the document based on the document format, and create a logical structure that indicates how the item area is related to other item areas in the document. A logical structure type indicating the type of the logical structure obtained and the logical structure; a logical structure extracting unit for extracting logical structure contents indicating the contents of the logical structure; and a document type, a logical structure type, and a logical structure An importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between the content and the importance, a document type identified by the document identification unit, a logical structure type and a logical structure content extracted by the logical structure extraction unit And the importance assignment criteria Compared to the importance granted criterion lifting unit, an image processing apparatus characterized by comprising a severity imparting unit for imparting the importance for each logical structure.
[0009]
<Configuration 5>
For a document of a predetermined format, from a document image, a use identification unit for identifying a use type indicating a purpose of use of the document based on the document format, and Then, from the image of the document, an item area which is a unit forming the content of the document based on the format of the document is extracted, and an item type indicating the type of the item area and the content of the item area are extracted. An item extracting unit for extracting the item contents to be indicated, an importance assigning criterion holding unit for holding in advance an importance assigning criterion indicating a relationship between the use type, the item type and the item contents, and a use identifying unit. The importance assigning unit assigns an importance to each item by comparing the application type, the item type and the item content extracted by the item extracting unit with the importance assigning criterion of the importance assigning criterion holding unit. An image processing apparatus comprising:
[0010]
<Configuration 6>
For a document of a predetermined format, from a document image, a use identification unit for identifying a use type indicating a purpose of use of the document based on the document format, and And extract, from the image of the document, an item area which is a unit forming the contents of the document based on the document format, and indicate how the item area is related to other item areas in the document. A logical structure type that determines the structure and indicates the type of the logical structure, a logical structure extracting unit that extracts a logical structure content indicating the contents of the logical structure, and a use type and a logical structure type in advance. An importance assignment criterion holding unit that holds an importance assignment criterion indicating the relationship of the importance to the logical structure content, a use type identified by the use identification unit, and a logical structure type and a logic extracted by the logical structure extraction unit. Structural content and importance Compared to the importance granted criterion reference holder, the image processing apparatus characterized by comprising a severity imparting unit for imparting the importance for each logical structure.
[0011]
<Configuration 7>
For a document of a predetermined format, from the image of the document, a use type indicating a purpose of use of the document based on the format of the document and a document type indicating the type of the document are set. A use / document identification unit for identifying, and for a document in a predetermined format, extracting, from an image of the document, an item area which is a unit forming the content of the document based on the document format, and how the item area is Item extraction unit that extracts an item type indicating whether the item type is an appropriate type, and an item content indicating the content of the item area, and an important item that indicates in advance the relationship between the application type, the document type, the item type, and the importance of the item content. The importance assignment criterion holding unit for holding the importance assignment criterion, the application type and the document type identified by the use / document identification unit, and the item type and the item content extracted by the item extraction unit are held by the importance assignment criterion. Of each department The image processing apparatus characterized by comprising a severity imparting unit for imparting the importance for each eye.
[0012]
<Configuration 8>
For a document of a predetermined format, from the image of the document, a use type indicating a purpose of use of the document based on the format of the document and a document type indicating the type of the document are set. A use / document identification unit for identifying, and for a document of a predetermined format, an item area which is a unit forming the content of the document based on the document format is extracted from the image of the document, and the item area is included in the document. Finds the logical structure indicating the relationship with other item areas, and extracts the logical structure type indicating the type of the logical structure and the logical structure contents indicating the contents of the logical structure A logical structure extracting unit, an importance assigning criterion holding unit that holds in advance an importance assigning criterion indicating a relationship between importance for use type, document type, logical structure type, and logical structure content, and a use / document identifying unit Type and document identified by Separately, the logical structure type and the logical structure content extracted by the logical structure extracting unit are compared with the importance assigning criterion of the importance assigning criterion holding unit, and the importance is assigned to each logical structure. And an image processing apparatus.
[0013]
<Configuration 9>
For a document of a predetermined format, an item type indicating a type of the item region as well as extracting an item region as a unit forming the content of the document based on the document format from the document image And the item contents indicating the contents of the item area are extracted. Next, the item types and the item contents are compared in advance with an importance assignment criterion indicating the relationship between the item types and the item contents, and An image processing method characterized by assigning importance to each item.
[0014]
<Configuration 10>
A computer constituting the image processing apparatus is configured to extract, from a document image, an item area serving as a unit forming the content of the document based on the document format for a document in a predetermined format, and determine how the item area is determined. An item extraction unit that extracts an item type indicating whether the item type is an appropriate type, an item content indicating the content of the item area, and the item type and item content extracted by the item extraction unit A control program for an image processing apparatus for causing the image processing apparatus to function as an importance assigning unit that assigns an importance to each item, as compared with an importance assignment criterion indicating a relationship of importance with respect to each item.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail using specific examples.
<< Specific Example 1 >>
In the first specific example, an area of a document that may be determined to be important based on the content of the document, a type indicating the content of the area and the content of the area are extracted, and the content of the area defined by the user is extracted. The conventional problem is solved by designating the area of the document as an important area based on the correspondence between the type and the content of the area indicating the importance and the importance in a plurality of stages.
[0016]
<Constitution>
FIG. 1 is a configuration diagram showing a specific example 1 of the image processing apparatus of the present invention.
The illustrated image processing apparatus 1 is connected to an input device 2 and a storage device 3, and includes an input unit 10, an item extraction unit 20, an importance assignment unit 30, an output unit 40, a document image storage unit 100, and an importance assignment target. An information storage unit 200 and an importance assignment reference holding unit 300 are provided. Further, the image processing apparatus 1 is composed of, for example, a personal computer.
[0017]
The input device 2 includes, for example, a scanner that optically reads an image of a document and outputs it as digital data, or a communication interface that provides image data of the document. The storage device 3 includes an external storage device, a communication interface, and the like for storing data of the document to which the importance is output, which is output from the image processing device 1. The input unit 10 is an input interface of the image processing apparatus 1 for inputting document image data from the input device 2 and transmitting the document image data to the document image storage unit 100. The item extracting unit 20 extracts, from the image of the document having a predetermined format stored in the document image storage unit 100, an item region serving as a unit for forming the content of the document based on the format, and It has a function of extracting an item type indicating the type and an item content indicating the content of the item area. The importance assigning unit 30 compares the item type and the item content extracted by the item extracting unit 20 with the importance assigning criterion of the importance assigning criterion holding unit 300, and assigns an importance to each item. It is. The output unit 40 stores the document image stored in the document image storage unit 100 and the importance assignment target information assigned importance for each item stored in the importance assignment target information storage unit 200 in the storage device 3. This is an output interface of the image processing apparatus 1 for outputting to the image processing apparatus.
[0018]
The document image storage unit 100 is a storage unit that stores a document image input by the input unit 10, and includes a semiconductor memory, a hard disk device, and the like. The importance assignment target information storage unit 200 is a storage unit that stores the data of each item extracted by the item extraction unit 20 and holds the data of the item assigned the importance by the importance assignment unit 30. It comprises a semiconductor memory, a hard disk device, and the like. The importance assignment criterion holding unit 300 is a storage unit that holds an importance assignment criterion indicating the relationship between the item type and the item content, and is composed of a semiconductor memory, a hard disk device, or the like.
In the drawing, the thin arrows indicate the flow of data, and the thick arrows indicate the flow of control. The item extracting unit 20 and the importance assigning unit 30 are configured by software corresponding to each function and hardware such as a CPU and a memory for executing the software.
[0019]
<motion>
FIG. 2 is a flowchart illustrating the operation of the first embodiment.
[Operation of Input Unit 10]
The input unit 10 acquires a document image from the input device 2 such as a scanner connected to the image processing apparatus 1, and stores the acquired document image in the document image storage unit 100 (Step S10).
[0020]
[Operation of item extraction unit 20]
Next, the item extracting unit 20 acquires the document image stored in the document image storage unit 100 by the input unit 10 and, for all or some of the items in the acquired document image, the item type and the item area (item area). (The coordinate value of the circumscribed rectangle of) and the item contents (character recognition result of the item). Here, the item is a unit forming the content of the document. Then, importance assignment target information describing the extracted item type, item area, and item content is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200 (step S20).
[0021]
FIG. 3 is an explanatory diagram illustrating an example of a document image.
FIG. 4 is an explanatory diagram illustrating an example of a component.
FIG. 5 is an explanatory diagram illustrating an example of an item.
For example, when the document image stored in the document image storage unit 100 by the input unit 10 as the document in the predetermined format is the paper image shown in FIG. 3, the item extraction unit 20 determines all of the paper images shown in FIG. Alternatively, for some items, item types such as “body”, “section title”, “figure”, and “caption”, and item areas such as “body area”, “section title area”, “figure area”, and “caption area” The coordinates (rectangular coordinate values) and item contents (character recognition results of items) such as “body content”, “section title content”, and “caption content” are extracted. However, the item contents are extracted only for the items whose item type is determined from the components whose component type is “text” shown in FIG. That is, the item types are only items such as “text”, “section title”, and “caption”. A component is a unit that forms a paper surface of a document.
[0022]
Next, the item extracting unit 20 extracts the item types such as “text”, “section title”, “figure”, and “caption” and the items such as “body area”, “section title area”, “figure area”, and “caption area”. Creates importance assignment target information describing an area and item contents such as "text content", "section title content", and "caption content", and stores the created importance assignment target information in the importance assignment target information storage unit 200. I do.
[0023]
FIG. 6 is an explanatory diagram illustrating an example of the importance assignment target information.
In FIG. 6, the item area is indicated by “(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)” of the circumscribed rectangle of the item. The item type, the item area, and the item content are arranged in order from top to bottom in the left column.
Hereinafter, item types such as "body", "section title", "figure", and "caption", and item areas such as "body area", "section title area", "figure area", and "caption area", and "body content" and "section title" Extraction of item contents such as "contents" and "caption contents" will be described.
[0024]
The item extracting unit 20 first extracts a black pixel set by examining the connectivity of black pixels in the paper image. If the height and width of the circumscribed rectangle of the extracted black pixel set is greater than an arbitrary value, the extracted black pixel set is determined to be a component whose component type is “figure”, and the extracted black pixel set The circumscribed rectangle is extracted as a “figure area”. Since the components of the figure can be considered as items that are the units forming the contents of the document, it is determined that the component with the component type of the figure is the item of the item type of the figure. , A “drawing area” which is a component area is an item area.
On the other hand, if the height and width of the circumscribed rectangle of the extracted black pixel set is smaller than an arbitrary value, the extracted black pixel set is determined to be a component having the component type “character”, and the extracted black pixel The circumscribed rectangle of the set is extracted as a “character area”. If the components of the “character” are arranged in the horizontal direction, and the horizontal distance between two adjacent components of the “character” is smaller than an arbitrary value, the configuration of the “character” arranged in the horizontal direction The element set is determined to be a component whose component type is “character string”, and a circumscribed rectangle of the component set of “character” is extracted as a “character string area”.
[0025]
Further, when the components of the “character string” are arranged in the vertical direction, and when the vertical distance between two adjacent components of the “character string” is smaller than an arbitrary value, the “character string” is arranged in the vertical direction. Is determined to be a component having a component type of “text”, and a circumscribed rectangle of the component set of “character string” is extracted as a “text region”.
Subsequently, a character set is processed for a set of components of "character", which is a component of "sentence". If the obtained character recognition result includes punctuation, the component of "sentence" is an item type. Is determined to be an item of “text”, the “text area”, which is a component area, is set to “text area”, and the character recognition result is set to “text content”.
On the other hand, in the obtained character recognition result, if the character string from the beginning is “number”, “.”, “Number”, “.”, The component of “sentence” is an item whose item type is “section title”. It is determined that there is, and the “text section”, which is a component area, is set as a “section title area”, and the character recognition result is set as “section title content”. In the case of a paper, if the character string from the beginning is “number”, “.”, “Number”, “.”, The item extraction unit 20 has in advance pattern data indicating that it is an item of “section title”. Suppose you are.
In addition, in the obtained character recognition result, when the first character is “figure”, the component of “sentence” is determined to be an item of item type “caption”, and “sentence area” which is a component area is determined. "As a" caption area "and the character recognition result as" caption content ". Also in this case, the item extracting unit 20 has such pattern data in advance.
[0026]
[Operation of importance imparting section 30]
The importance assigning unit 30 acquires the importance assigning target information stored in the importance assigning target information storage unit 200 by the item extracting unit 20, and further obtains the importance assigning pre-stored in the importance assigning reference holding unit 300. Get the criteria.
FIG. 7 is an explanatory diagram illustrating an example of the importance assignment criterion.
Then, the importance assigning unit 30 assigns the importance assignment target information (importance degree) describing the item type, the item area, and the importance based on the correspondence between the item type, the item content, and the importance described in the importance assignment criterion. ) Is created, and the created importance assignment target information (including the importance) is stored in the importance assignment target information storage unit 200 (step S30).
FIG. 8 is an explanatory diagram illustrating an example of importance assignment target information (including importance).
For example, the importance assignment target information stored in the importance assignment target information storage unit 200 by the item extraction unit 20 is the importance assignment target information shown in FIG. In the case where the importance assignment criterion is the importance assignment criterion shown in FIG. 7, the importance assignment section 30 sets items such as “text”, “section title”, “figure”, and “caption” described in the importance assignment criterion shown in FIG. Item type and item area based on the correspondence between the type, item contents such as “including character string“ OCR ”” and “unspecified” and importance such as “high”, “medium”, “low”, “high” and “high” The importance assignment target information (including the importance) shown in FIG. 8 describing the importance is created, and the created importance assignment target information (including the importance) is stored in the importance assignment target information storage unit 200.
[0027]
[Operation of Output Unit 40]
The output unit 40 acquires the document image stored in the document image storage unit 100 by the input unit 10, and further obtains the importance assignment target information stored in the importance assignment target information storage unit 200 by the importance assignment unit 30 ( (Including importance). Then, the acquired document image and the information to be added with importance (including importance) are stored in the storage device 3 connected to the image processing apparatus 1 (step S40).
[0028]
In the above description of the operation, words such as “text”, “section title”, “figure”, and “caption” are used as item types, but any word that can be easily understood as a unit forming the contents of a document , Any word may be used.
The circumscribed rectangle “(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)” of the circumscribed rectangle of the item is used in the item area of the importance assignment target information, but it is the coordinate value of the circumscribed rectangle of the item Any format may be used as long as it can be easily understood.
Item types such as "Body", "Section title", "Figure", "Caption", etc., item areas such as "Body area", "Section title area", "Figure area", "Caption area", and "Body contents", "Section title contents", " The above method was used to extract the item contents such as “caption contents”, but the item types such as “body”, “section title”, “figure”, “caption”, etc., “body area”, “section title area”, “figure area”, “figure area” Other methods may be used as long as they can extract item areas such as "caption area" and item contents such as "text content", "section title content" and "caption content".
Although the specification items such as "including character string" OCR "" are used for the item content of the importance assignment standard, any specification item that can easily understand that the item content is specified is specified. Matters may be used.
Although a word such as "no designation" is used as the item content of the importance assignment criterion, any word may be used as long as it is easily understood that the item content is not specified. Also, “no designation” may be used for all item contents.
Words such as "high", "medium", and "low" were used as the importance of the importance assignment criteria, but any word that can easily understand the importance level and importance is used. May be. The number of levels of importance may be two (important or not important).
[0029]
<effect>
As described above, according to the first specific example, an item is extracted from a document image, and an importance is assigned to this item using a previously assigned importance assignment criterion. Has a significant effect.
That is, since the importance is automatically given to the item according to the content of the item which is a unit forming the content of the document, when a document is made public, the importance is relatively high for a certain viewer. Keep the "body", "figure" and "caption" private, or keep only the "body" with important keywords in addition to the "figure" and "caption" that are important to other viewers. Can be.
That is, disclosure / non-disclosure of items according to the degree of importance can be adjusted for each item content. This is because even if a user who does not know the contents of the document deeply sets the importance, for example, the body with important keywords may need to be kept private because the whole body may be important. Indicates that the importance can be set to some extent according to the importance of the contents of the document.
[0030]
<< Specific Example 2 >>
In the specific example 2, importance is assigned to each logical structure indicating the relationship between items.
[0031]
<Constitution>
FIG. 9 is a configuration diagram of the specific example 2.
The image processing device 1 a is connected to the input device 2 and the storage device 3, and has an input unit 10, a logical structure extraction unit 21, an importance assigning unit 31, an output unit 40, a document image storage unit 100, and importance assignment target information. The storage unit 200 includes an importance assignment reference holding unit 300.
Here, the input unit 10, the output unit 40, the document image storage unit 100, the importance level assignment target information storage unit 200, and the importance level assignment reference holding unit 300 are the same as those in the first embodiment, and thus description thereof is omitted. I do. However, the importance assignment criterion of the importance assignment criterion holding unit 300 in the specific example 2 is data of importance for the logical structure type and the logical structure content.
The logical structure extracting unit 21 extracts, from a document image of a document of a predetermined format stored in the document image storage unit 100, an item area serving as a unit for forming the content of the document based on the document format. At the same time, a logical structure indicating the relationship between the item area and other item areas in the document is obtained, and a logical structure type indicating the type of the logical structure and the contents of the logical structure are described. It has a function to extract the logical structure contents shown. The importance assigning unit 31 compares the logical structure type and the logical structure content extracted by the logical structure extracting unit 21 with the importance assigning criterion of the importance assigning criterion holding unit 300, and determines the importance for each logical structure. It has the function of giving.
[0032]
<motion>
FIG. 10 is a flowchart illustrating the operation of the second embodiment.
In addition, also in the specific example 2, description will be made using an example of the document image of FIG. 3 and an example of the components of FIG. The description of the input unit 10 and the output unit 40 that perform the same operations as in the first embodiment will be omitted. That is, in FIG. 10, steps S10 and S40 are the same as those in the first example, and thus description thereof will be omitted.
[0033]
[Operation of Logical Structure Extraction Unit 21]
The logical structure extraction unit 21 acquires the document image stored in the document image storage unit 100 by the input unit 10 and, for all or a part of the logical structure in the acquired document image, a logical structure type and a logical structure area ( The coordinate value set of the circumscribed rectangle of the item forming the logical structure and the logical structure content (character recognition result set of the item forming the logical structure) are extracted. Then, importance assignment target information describing the extracted logical structure type, logical structure area, and logical structure content is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200 (step S21). .
Note that the logical structure is a structure of an item based on the contents of a document. That is, it is information indicating what relationship each item has in the document when there are a plurality of items.
[0034]
FIG. 11 is an explanatory diagram illustrating an example of the logical structure.
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is the paper image shown in FIG. 3, the logical structure extraction unit 21 determines all or a part of the logical structure in the paper image shown in FIG. For the logical structure type such as “2.2. Section text” and “2.3. Section text”, and the logical structure area such as “2.2 section text area” and “2.3. Section text area” Extraction of the coordinate value set of the circumscribed rectangle of the item to be formed) and logical structure contents (character recognition result set of items forming the logical structure) such as "2.2. Section body content" and "2.3. Section body content" I do. However, the logical structure content is extracted only from the logical structure formed by the item whose item type is determined from the component whose component type is “text” shown in FIG. That is, the logical structure type is only the logical structure such as “2.2. Section text” or “2.3. Section text”. Then, the extracted logical structure types such as “2.2. Section text” and “2.3. Section text”, logical structure areas such as “2.2. Section text area” and “2.3. Section text area”, and Creates importance assignment target information describing logical structure contents such as “2.2. Section text content” and “2.3. Section text content”, and stores the created importance assignment target information in the importance assignment target information storage unit. 200.
[0035]
FIG. 12 is an explanatory diagram illustrating an example of the importance assignment target information.
As shown in FIG. 12, the logical structure area is indicated by a set of "(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)" of a circumscribed rectangle of items forming the logical structure. Also, the coordinate values of the circumscribed rectangles of the items forming the logical structure are arranged in order from the top to the bottom of the left column. Further, the character recognition results of the items forming the logical structure are arranged in the same order.
Hereinafter, logical structure types such as “2.2. Section text” and “2.3. Section text”, logical structure areas such as “2.2. Section text area” and “2.3. Section text area” and “2. Extraction of logical structure contents such as "2.2. Section text contents" and "2.3. Section text contents" will be described.
[0036]
The logical structure extraction unit 21 first extracts a set of black pixels by examining the connectivity of the black pixels in the paper image. If the height and width of the circumscribed rectangle of the extracted black pixel set is greater than an arbitrary value, the extracted black pixel set is determined to be a component whose component type is “figure”, and the extracted black pixel set The circumscribed rectangle is extracted as a “figure area”. Since the components of the figure can be considered as items that are the units forming the contents of the document, it is determined that the component with the component type of the figure is the item of the item type of the figure. , A “drawing area” which is a component area is an item area.
On the other hand, if the height and width of the circumscribed rectangle of the extracted black pixel set is smaller than an arbitrary value, the extracted black pixel set is determined to be a component having the component type “character”, and the extracted black pixel The circumscribed rectangle of the set is extracted as a “character area”. If the components of the “character” are arranged in the horizontal direction, and the horizontal distance between two adjacent components of the “character” is smaller than an arbitrary value, the configuration of the “character” arranged in the horizontal direction The element set is determined to be a component whose component type is “character string”, and a circumscribed rectangle of the component set of “character” is extracted as a “character string area”.
[0037]
Further, when the components of the “character string” are arranged in the vertical direction, and when the vertical distance between two adjacent components of the “character string” is smaller than an arbitrary value, the “character string” is arranged in the vertical direction. Is determined to be a component having a component type of “text”, and a circumscribed rectangle of the component set of “character string” is extracted as a “text region”. Subsequently, a character set is processed for a set of components of "character", which is a component of "sentence". If the obtained character recognition result includes punctuation, the component of "sentence" is an item type. Is determined to be an item of “text”, the “text area”, which is a component area, is set to “text area”, and the character recognition result is set to “text content”.
[0038]
On the other hand, in the obtained character recognition result, when the character string from the beginning is “number”, “.”, “Number”, “.”, The component element of “sentence” is an item whose item type is “section title” It is determined that the “sentence area”, which is a component area, is “section title area”, and the character recognition result is “section title content”. In addition, in the obtained character recognition result, when the first character is “figure”, the component of “sentence” is determined to be an item of item type “caption”, and “sentence area” which is a component area is determined. "As a" caption area "and the character recognition result as" caption content ". Here, when the character string from the top of “section title content” is “2.3.”, The item of “section title” has a logical structure whose logical structure type is “2.3. Section title”. And the item area “section title area” is set to “2.3. Section title area”, and the item content “section title content” is set to “2.3. Section title content”. In general, papers have a column structure. Items are formed from left to right between columns and from top to bottom within columns. Therefore, the logical structure type is “2.3. The item set of “Body” existing before the logical structure has the logical structure type of “2.2. Section body”, and exists after the logical structure of “2.3. Section title”. It is determined that the item set of “text” has a logical structure whose logical structure type is “2.3. Section text”, and the coordinate value sets of circumscribed rectangles of the items forming the logical structure are respectively described in “2.2. As a logical structure area such as “text area” and “2.3. Section text area”, character recognition result sets of items that form the logical structure are respectively referred to as “2.2. Section text content” and “2.3. And so on.
As described above, in the specific example 2, as a document in a predetermined format, a logical structure is extracted by paying attention to the number of a paper in which each section is sequentially numbered.
[0039]
[Operation of importance imparting section 31]
The importance level assigning unit 31 acquires the importance level assigning target information stored in the importance level assigning information storage unit 200 by the logical structure extracting unit 21, and further acquires the importance level stored in the importance level assigning reference holding unit 300 in advance. Get the grant criteria.
FIG. 13 is an explanatory diagram illustrating an example of the importance assignment criterion.
Then, based on the logical structure type, the logical structure content, and the correspondence between the importance and the logical structure type described in the importance level criterion, the importance level assigning unit 31 describes the importance level assignment target describing the logical structure type, the logical structure area, and the importance level. Information (including importance) is created, and the created importance assignment target information (including importance) is stored in the importance assignment target information storage unit 200 (step S31).
FIG. 14 is an explanatory diagram illustrating an example of importance assignment target information (including importance).
For example, the importance assignment target information stored in the importance assignment target information storage unit 200 by the logical structure extraction unit 21 is stored in advance in the importance assignment target information illustrated in FIG. When the importance assignment criterion is the importance assignment criterion illustrated in FIG. 13, the importance assignment unit 31 determines the logical structure type such as “XX section text” described in the importance assignment criterion illustrated in FIG. FIG. 14 describes the logical structure type, the logical structure area, and the importance based on the correspondence between the logical structure contents such as “including OCR” and “unspecified” and the importance such as “high” and “medium”. The importance assignment target information (including the importance) is created, and the created importance assignment target information (including the importance) is stored in the importance assignment target information storage unit 200. Here, “section XX” of “section XX body” indicates an arbitrary section number.
[0040]
In the above description of the operation, words such as “2.2. Section text” and “2.3. Section text” are used for the logical structure type, but it is easily understood that the item structure is based on the contents of the document. Any word that can be used may be used.
Although a set of “(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)” of a circumscribed rectangle of an item forming a logical structure in the logical structure area of the importance assignment target information is used, Any format may be used as long as it can be easily understood that it is a set of coordinate values of a circumscribed rectangle of the item forming the.
Logical structure types such as "2.2. Section text" and "2.3. Section text", logical structure areas such as "2.2. Section text area" and "2.3. Section text area", and "2.2. Although the above method was used to extract the logical structure contents such as “.Section text contents” and “2.3.Section text contents”, the logical structures such as “2.2.Section text” and “2.3.Section text” were used. Type, logical structure areas such as “2.2. Section body area” and “2.3. Section body area” and logical structure contents such as “2.2. Section body contents” and “2.3. Section body contents” Any method may be used as long as it can be extracted.
Although the specification items such as "including character string" OCR "" are used in the logical structure content of the importance assignment standard, what kind of specification items can be easily understood as the matter specifying the logical structure content. May be used.
Although a word such as “no designation” is used as the logical structure content of the importance assignment reference, any word may be used as long as it is easily understood that the logical structure content is not specified. In addition, “not specified” may be used for all logical structure contents.
Although words such as “high” and “medium” are used as the importance of the importance assignment criterion, any word may be used as long as the word of the importance level and the level of importance can be easily understood. . Further, the number of levels of importance may be two types, that is, important and not important.
[0041]
<effect>
As described above, according to the specific example 2, the logical structure indicating the relation between the items in the document is extracted from the document image, and the importance assigning criterion registered in advance for the logical structure is extracted. Is used to assign importance, and the following effects are obtained.
That is, importance is automatically assigned to the logical structure according to the contents of the logical structure, which is the structure of the item based on the contents of the document. Relatively high “2.2.Section text” and “2.3.Section text” are kept private, or only “2.3.Section text”, which is highly important to other viewers, is kept private. be able to. That is, the disclosure / non-disclosure of the logical structure can be adjusted according to the importance. This is because even if a user who does not know the contents of the document deeply sets the importance, for example, the body with important keywords needs to be kept private because the body itself may be important. Indicates that the importance can be set to some extent according to the importance of the contents of the document. In addition, since the logical structure of “2.3. Section body” is an item set of “Body”, specific example 1 has important keywords in the “Body section” of “2.3. Section body”. However, in the specific example 2, it is possible to set an accurate importance for the whole text, while there is a possibility that a high importance is set only for the “text” having “. Also, by using designation items such as “including the character string“ OCR ”in the content of the section XX title” as the logical structure content of the importance assignment standard, only the logical structure content of “section section XX title” is used. By searching, the corresponding “XX section text” can be efficiently derived.
[0042]
<< Specific Example 3 >>
In the specific example 3, the document type is obtained from the image information in the document, and the importance is given in consideration of the document type.
[0043]
<Constitution>
FIG. 15 is a configuration diagram of the third embodiment.
The illustrated image processing device 1b is connected to the input device 2 and the storage device 3, and includes an input unit 10, a document identification unit 52, an item extraction unit 22, an importance assignment unit 32, an output unit 40, and a document image storage unit 100. , An importance assignment target information storage unit 200 and an importance assignment reference holding unit 300.
Here, the input unit 10, the output unit 40, the document image storage unit 100, the importance assignment target information storage unit 200, and the importance assignment reference holding unit 300 are the same as those in the first and second specific examples, and therefore will be described here. Is omitted. However, the importance assignment criterion of the importance assignment criterion holding unit 300 of the specific example 3 is data of importance including the document type in addition to the item type and the item content.
[0044]
The document identification unit 52 is a functional unit that identifies a document type indicating a type of a document based on the format of the document from a document image of a predetermined format stored in the document image storage unit 100. The item extracting unit 22 has a function of performing the same processing as that of the item extracting unit 20 of the first embodiment based on the document identification result of the document identifying unit 52. Further, the importance assigning unit 32 compares the document type identified by the document identifying unit 52, the item type and the item content extracted by the item extracting unit 22 with the importance assigning criterion of the importance assigning criterion holding unit 300. In comparison, this is a functional unit that assigns importance to each item.
[0045]
<motion>
FIG. 16 is a flowchart illustrating the operation of the third example.
The description of the input unit 10 and the output unit 40 that perform the same operations as those of the first and second examples will be omitted. That is, in FIG. 16, the description of step S10 and step S40 is omitted.
[0046]
[Operation of Document Identification Unit 52]
The document identification unit 52 acquires the document image stored in the document image storage unit 100 by the input unit 10, and identifies the document type of the acquired document image. Then, importance assignment target information (document type only) describing the identified document type is created, and the created importance assignment target information (document type only) is stored in the importance assignment target information storage unit 200 (step S52). ).
FIG. 17 is an explanatory diagram illustrating an example of the document image in the specific example 3.
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is such a paper image as shown in FIG. 17 (this is a document type of “paper A” at the upper right end of the paper image shown in FIG. 3). The document identification unit 52 extracts the document type of “paper A” described in the upper right end of the paper image, and identifies the document type of the paper image. Then, importance assignment target information (document type only) describing the identified document type is created, and the created importance assignment target information (document type only) is stored in the importance assignment target information storage unit 200.
[0047]
FIG. 18 is an explanatory diagram illustrating an example of the importance assignment target information (only the document type).
Hereinafter, extraction of the document type will be described.
The document identification unit 52 first extracts a set of black pixels by examining the connectivity of the black pixels near the upper right end in the paper image. If the height and width of the circumscribed rectangle of the extracted black pixel set are greater than an arbitrary value, the extracted black pixel set is determined to be a component having a component type of “frame”, and the extracted black pixel set The circumscribed rectangle is extracted as a “frame area”. On the other hand, if the height and width of the circumscribed rectangle of the extracted black pixel set is smaller than an arbitrary value, the extracted black pixel set is determined to be a component having the component type “character”, and the extracted black pixel The circumscribed rectangle of the set is extracted as a “character area”. If the components of the “character” are arranged in the horizontal direction, and the horizontal distance between two adjacent components of the “character” is smaller than an arbitrary value, the configuration of the “character” arranged in the horizontal direction The element set is determined to be a component whose component type is “character string”, and a circumscribed rectangle of the component set of “character” is extracted as a “character string area”. Further, a character recognition process is performed on a component set of “character” which is a component of “character string”, and an obtained character recognition result (“paper A”) is extracted as a document type.
[0048]
[Operation of item extraction unit 22]
The item extracting unit 22 acquires the document image stored in the document image storing unit 100 by the input unit 10 and, for all or some of the items in the acquired document image, an item type and an item area (circumscribed rectangle of the item). ) And the item contents (character recognition result of the item) are extracted. An item is a unit that forms the content of a document. Further, the item extraction unit 22 acquires the importance level assignment target information (only the document type) stored in the importance level assignment target information storage unit 200 by the document identification unit 52, and obtains the document type and the extracted item type, item area, and The importance assignment target information describing the item contents is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200 (step S22).
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is the article image shown in FIG. 17, the item extracting unit 22 determines whether all or some of the items in the article image shown in FIG. Item types such as “body”, “section title”, “figure”, “caption”, etc., item areas such as “body area”, “section title area”, “figure area”, “caption area” (the coordinate values of the circumscribed rectangle of the item) and “ Item contents (character recognition results of items) such as body contents, section title contents, and caption contents are extracted. However, the item contents are extracted only for the items whose item type is determined from the components whose component type is “text” shown in FIG. That is, the item types are only items such as “text”, “section title”, and “caption”. Further, the item extraction unit 22 acquires the importance assignment target information (only the document type) shown in FIG. 18 stored in the importance assignment target information storage unit 200 by the document identification unit 52, and obtains the document type of “paper A”. Item types such as "text", "section title", "figure", "caption", etc., "text area", "section title area", "figure area", "caption area", etc., and "text content", "section title" The importance assignment target information that describes item contents such as "contents" and "caption contents" is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200.
FIG. 19 is an explanatory diagram illustrating an example of the importance level assignment target information in the specific example 3. In FIG. 19, the item area is indicated by the circumscribed rectangle “(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)” of the item. The item type, the item area, and the item content are arranged in order from top to bottom in the left column.
Item types such as "Body", "Section title", "Figure", "Caption", etc., item areas such as "Body area", "Section title area", "Figure area", "Caption area" and "Body contents", "Section title contents" The extraction of item contents such as "" and "caption contents" is the same as in the first embodiment.
[0049]
[Operation of Importance Assigning Unit 32]
The importance level assigning unit 32 acquires the importance level assigning target information stored in the importance level assigning target information storage unit 200 by the item extracting unit 22, and further obtains the importance level assigning level stored in advance in the importance level assigning reference holding unit 300. Get the criteria.
FIG. 20 is an explanatory diagram illustrating an example of the importance assignment criterion.
Then, the importance assigning unit 32 writes the document type, the item type, the item area, and the importance based on the correspondence between the document type, the item type, the item content, and the importance described in the importance assignment criterion. Assignment target information (including importance) is created, and the created importance assignment target information (including importance) is stored in the importance assignment target information storage unit 200 (step S32).
FIG. 21 is an explanatory diagram illustrating an example of importance assignment target information (including importance).
For example, the importance assignment target information stored in the importance assignment target information storage unit 200 by the item extraction unit 22 is the importance assignment target information shown in FIG. In the case where the importance assignment criterion is the importance assignment criterion shown in FIG. 20, the importance assigning unit 32 outputs the document type such as "article A" or "unspecified" described in the importance assignment criterion shown in FIG. Item types such as “section title”, “figure”, “caption”, etc., and item contents such as “including character string“ OCR ”” and “unspecified” and “high”, “low”, “medium”, “low”, “high”, “high” 21. Based on the correspondence of the degrees of importance such as "", the information on the degree of importance (including the degree of importance) shown in FIG. (Including importance) is stored in the importance assignment target information storage unit 200. That.
[0050]
In the above description of the operation, a word such as "thesis A" is used as the document type, but any word can be used as long as the word can be easily understood as the document type. Further, although the above method is used for extracting the document type, any method may be used as long as the document type can be extracted.
Further, although a word such as "not specified" is used as the document type of the importance assignment criterion, any word may be used as long as it is easily understood that the document type is not specified. Further, “not specified” may be used for all document types.
Other uses are the same as in the first embodiment.
[0051]
<effect>
As described above, according to the specific example 3, in addition to the configuration of the specific example 1, the document type is extracted based on the document format, and the importance of each item is assigned in consideration of the document type. Therefore, the following effects are obtained.
That is, items are automatically assigned importance in accordance with the document type and the contents of the items which are the units forming the contents of the document. Therefore, when a document is made public, the importance of a certain document is relatively low. It is possible to keep high “body”, “figure” and “caption” private, and to keep only “body” with important keywords in addition to “figure” and “caption” that are important in other documents. Yes (the "body" that has been kept private in one document can be made public in another document). That is, disclosure / non-disclosure of items according to importance can be adjusted for each document type and item content. Compared to the specific example 1, the document can be finely disclosed since the adjustment can be made for each document type.
[0052]
<< Specific Example 4 >>
In the specific example 4, a use type indicating the purpose of use of the document is obtained from the image information in the document, and importance is given in consideration of the use type.
[0053]
<Constitution>
FIG. 22 is a configuration diagram of the specific example 4.
The illustrated image processing device 1c is connected to the input device 2 and the storage device 3, and includes the input unit 10, the use identifying unit 53, the item extracting unit 23, the importance assigning unit 33, the output unit 40, and the document image storing unit 100. , An importance assignment target information storage unit 200 and an importance assignment reference holding unit 300.
Here, the input unit 10, the output unit 40, the document image storage unit 100, the importance assignment target information storage unit 200, and the importance assignment reference holding unit 300 are the same as those in the specific examples 1 to 3, and will be described here. Is omitted. However, the importance assignment criterion of the importance assignment criterion holding unit 300 of the specific example 4 is data of the importance including the use type in addition to the item type and the item content.
[0054]
The use identifying unit 53 is a function of identifying, from a document image in a predetermined format stored in the document image storage unit 100, a use type indicating a document used for what purpose based on the format of the document. Department. The item extracting unit 23 has a function of performing the same processing as that of the item extracting unit 20 of the first specific example, based on the usage identification result of the usage identifying unit 53. Further, the importance assigning unit 33 compares the use type identified by the use identifying unit 53, the item type and the item content extracted by the item extracting unit 23 with the importance assigning criterion of the importance assigning criterion holding unit 300. In comparison, this is a functional unit that assigns importance to each item.
[0055]
<motion>
FIG. 23 is a flowchart showing the operation of the fourth embodiment.
The description of the input unit 10 and the output unit 40 that perform the same operations as those of the first to third examples will be omitted. That is, in FIG. 23, the description of step S10 and step S40 is omitted.
[0056]
[Operation of usage identifying unit 53]
The use identifying unit 53 acquires the document image stored in the document image storage unit 100 by the input unit 10, and identifies the type of use of the acquired document image. Then, importance level assignment target information (only the usage type) describing the identified usage type is created, and the created importance level assignment target information (only the usage type) is stored in the importance level assignment target information storage unit 200 (step S53). ).
FIG. 24 is an explanatory diagram illustrating an example of the document image in the specific example 4.
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is a dissertation image shown in FIG. 24 (this is a case where the usage type of “internal publication” is described in the upper left corner of the dissertation image shown in FIG. 3). The usage identifying unit 53 extracts the usage type of “internal publication” described in the upper left corner of the paper image, and identifies the usage type of the paper image. Then, importance level assignment target information (only usage type) describing the identified usage type is created, and the created importance level assignment target information (only usage type) is stored in the importance level assignment target information storage unit 200.
[0057]
FIG. 25 is an explanatory diagram illustrating an example of the importance assignment target information (use type only).
Hereinafter, the extraction of the usage type will be described.
The use identifying unit 53 first extracts a set of black pixels by examining the connectivity of black pixels near the upper left end in the paper image. If the length and width of the circumscribed rectangle of the extracted black pixel set is greater than an arbitrary value, the extracted black pixel set is determined to be a component whose component type is “frame”, and the extracted black pixel set The circumscribed rectangle is extracted as a “frame area”. On the other hand, if the height and width of the circumscribed rectangle of the extracted black pixel set is smaller than an arbitrary value, the extracted black pixel set is determined to be a component having the component type “character”, and the extracted black pixel The circumscribed rectangle of the set is extracted as a “character area”. If the components of the “character” are arranged in the horizontal direction, and the horizontal distance between two adjacent components of the “character” is smaller than an arbitrary value, the configuration of the “character” arranged in the horizontal direction The element set is determined to be a component whose component type is “character string”, and a circumscribed rectangle of the component set of “character” is extracted as a “character string area”. Further, a character recognition process is performed on a component set of “character”, which is a component of “character string”, and the obtained character recognition result (“internal publication”) is extracted as a usage type.
[0058]
[Operation of item extraction unit 23]
The item extraction unit 23 acquires the document image stored in the document image storage unit 100 by the input unit 10 and, for all or some of the items in the acquired document image, an item type and an item area (circumscribed rectangle of the item). ) And the item contents (character recognition result of the item) are extracted. An item is a unit that forms the content of a document. Further, the item extraction unit 23 acquires the importance level assignment target information (only the usage type) stored in the importance level assignment target information storage unit 200 by the usage identification unit 53, and uses the usage type and the extracted item type, item area, and The importance assignment target information describing the item contents is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200 (step S23).
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is the article image shown in FIG. 24, the item extracting unit 23 determines whether all or some of the items in the article image shown in FIG. Item types such as “body”, “section title”, “figure”, “caption”, etc., item areas such as “body area”, “section title area”, “figure area”, “caption area” (the coordinate values of the circumscribed rectangle of the item) and “ Item contents (character recognition results of items) such as body contents, section title contents, and caption contents are extracted. However, the item contents are extracted only for the items whose item type is determined from the components whose component type is “text” shown in FIG. That is, the item types are only items such as “text”, “section title”, and “caption”. Further, the item extraction unit 23 acquires the importance assignment target information (only the use type) stored in the importance assignment target information storage unit 200 by the use identification unit 53, and extracts the use type of “internal publication” and the extracted “ Item types such as "body", "section title", "figure", "caption", etc., item areas such as "body area", "section title area", "figure area", "caption area", and "body content", "section title content", "caption" The importance assignment target information that describes item contents such as “contents” is created, and the created importance assignment target information is stored in the importance assignment target information storage unit 200.
[0059]
FIG. 26 is an explanatory diagram illustrating an example of importance assignment target information.
In FIG. 26, the item area is indicated by the circumscribed rectangle "(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)" of the item. The item type, the item area, and the item content are arranged in order from top to bottom in the left column.
Item types such as "Body", "Section title", "Figure", "Caption", etc., item areas such as "Body area", "Section title area", "Figure area", "Caption area", and "Body contents", "Section title contents", " The extraction of item contents such as "caption contents" is the same as in the first embodiment.
[0060]
[Operation of importance assigning section 33]
The importance level assigning unit 33 acquires the importance level assigning target information stored in the importance level assigning information storage unit 200 by the item extracting unit 23, and further obtains the importance level assigning level stored in advance in the importance level assigning reference holding unit 300. Get the criteria.
FIG. 27 is an explanatory diagram illustrating an example of the importance assignment criterion.
Then, based on the use type, the item type, and the correspondence between the item contents and the importance described in the importance provision criterion, the importance assigning unit 33 describes the importance type in which the use type, the item type, the item area, and the importance are described. Assignment target information (including importance) is created, and the created importance assignment target information (including importance) is stored in the importance assignment target information storage unit 200 (step S33).
FIG. 28 is an explanatory diagram illustrating an example of importance assignment target information (including importance).
For example, the importance assignment target information stored in the importance assignment target information storage unit 200 by the item extraction unit 23 is the importance assignment target information shown in FIG. In the case where the importance assignment criterion is the importance assignment criterion shown in FIG. 27, the importance assignment section 33 writes the usage type such as “internal publication” or “no designation” described in the importance assignment criterion shown in FIG. Item types such as “section title”, “figure”, “caption”, etc., and item contents such as “including character string“ OCR ”” and “unspecified” and “high”, “low”, “medium”, “low”, “high”, “high” Based on the correspondence of the degrees of importance such as "", the degree of importance provision target information (including the degree of importance) shown in FIG. (Including importance) is stored in the importance assignment target information storage unit 200. To.
[0061]
In the above description of the operation, a word such as “internal publication” is used as the usage type, but any word may be used as long as it is easily understood that the usage type is used. Further, although the above-described method is used for extracting the usage type, any method may be used as long as the usage type can be extracted.
Further, although a word such as "no designation" is used as the usage type of the importance assignment criterion, any word may be used as long as it is easily understood that the usage type is not specified. Further, “unspecified” may be used for all application types.
Other uses are the same as in the first embodiment.
[0062]
<effect>
As described above, according to the specific example 4, in addition to the configuration of the specific example 1, a usage type is extracted based on a document format, and the importance of each item is given in consideration of the usage type. Therefore, the following effects are obtained.
That is, importance is automatically assigned to an item according to the type of use and the content of the item, which is a unit forming the content of the document. It is possible to keep high “body”, “figure” and “caption” private, or to keep only “body” with important keywords in addition to “figure” and “caption” that are highly important for other uses. Yes (a "body" that has been kept private for one use can be made public for another use). That is, the disclosure / non-disclosure of the item according to the degree of importance can be adjusted for each use type and item content. Compared to the specific example 1, the document can be disclosed in a more detailed manner, since the adjustment can be made for each application type.
[0063]
<< Specific Example 5 >>
In Example 5, a use type indicating the purpose of use of the document and a document type indicating what kind of document the document is obtained from the image information in the document. The importance is given in consideration of the document type and the document type.
[0064]
<Constitution>
FIG. 29 is a configuration diagram of the specific example 5.
The illustrated image processing device 1d is connected to the input device 2 and the storage device 3, and has an input unit 10, a use / document identification unit 54, an item extraction unit 24, an importance assignment unit 34, an output unit 40, and a document image storage. It comprises a unit 100, an importance assignment target information storage unit 200, and an importance assignment reference holding unit 300.
Here, the input unit 10, the output unit 40, the document image storage unit 100, the importance assignment target information storage unit 200, and the importance assignment reference holding unit 300 are the same as those in the specific examples 1 to 4, and will be described here. Is omitted. However, the importance assignment criterion of the importance assignment criterion holding unit 300 of the specific example 5 is data of importance including the document type and the use type in addition to the item type and the item content.
[0065]
The use / document identification unit 54, from the document image in a predetermined format stored in the document image storage unit 100, based on the format of this document, a use type indicating what use the document is used for, This is a functional unit for identifying a document type indicating the type of document. The item extracting unit 24 has a function of performing the same processing as that of the item extracting unit 20 of the first specific example based on the use type / document type identification result of the use / document identification unit 54. Further, the importance assigning unit 34 stores the usage type / document type identified by the application / document identifying unit 54 and the item type and item content extracted by the item extracting unit 22 in the importance assigning criterion holding unit 300. This is a functional unit that assigns importance to each item as compared with the importance assignment reference.
[0066]
<motion>
FIG. 30 is a flowchart showing the operation of the specific example 5.
The description of the input unit 10 and the output unit 40 that perform the same operations as those of the first to fourth embodiments will be omitted. That is, in FIG. 30, the description of step S10 and step S40 is omitted.
[0067]
[Operation of Usage / Document Identification Unit 54]
The use / document identification unit 54 acquires the document image stored in the document image storage unit 100 by the input unit 10, and identifies the use type and the document type of the acquired document image. Then, importance assignment target information (only use / document type) describing the identified usage type and document type is created, and the created importance assignment target information (only use / document type) is stored in the importance assignment target information storage unit. 200 (step S54).
FIG. 31 is an explanatory diagram illustrating an example of a document image in the specific example 5.
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is a dissertation image shown in FIG. 31 (this is the usage type of “internal publication” in the upper left part of the dissertation image shown in FIG. The use / document identification unit 54 has a use type of “internal publication” and an upper right end described in the upper left corner of the paper image. The document type of “thesis A” described in the above is extracted, and the usage type and the document type of the thesis image are identified. Then, importance assignment target information (only use / document type) describing the identified usage type and document type is created, and the created importance assignment target information (only use / document type) is stored in the importance assignment target information storage unit. 200.
[0068]
FIG. 32 is an explanatory diagram illustrating an example of importance assignment target information (only use / document type).
Hereinafter, extraction of the usage type and the document type will be described.
The use / document identification unit 54 first extracts a set of black pixels by examining the connectivity of black pixels near the upper left corner in the dissertation image. If the length and width of the circumscribed rectangle of the extracted black pixel set is greater than an arbitrary value, the extracted black pixel set is determined to be a component whose component type is “frame”, and the extracted black pixel set The circumscribed rectangle is extracted as a “frame area”. On the other hand, if the height and width of the circumscribed rectangle of the extracted black pixel set is smaller than an arbitrary value, the extracted black pixel set is determined to be a component having the component type “character”, and the extracted black pixel The circumscribed rectangle of the set is extracted as a “character area”. If the components of the “character” are arranged in the horizontal direction, and the horizontal distance between two adjacent components of the “character” is smaller than an arbitrary value, the configuration of the “character” arranged in the horizontal direction The element set is determined to be a component whose component type is “character string”, and a circumscribed rectangle of the component set of “character” is extracted as a “character string area”. Further, a character recognition process is performed on a component set of “character”, which is a component of “character string”, and the obtained character recognition result (“internal publication”) is extracted as a usage type. Subsequently, a set of black pixels is extracted by examining the connectivity of the black pixels near the upper right end in the paper image, and the character recognition result (“paper A”) obtained by performing the same processing is used as the document type. Extract as
[0069]
[Operation of item extraction unit 24]
The item extraction unit 24 acquires the document image stored in the document image storage unit 100 by the input unit 10 and, for all or some of the items in the acquired document image, an item type and an item area (circumscribed rectangle of the item). ) And the item contents (character recognition result of the item) are extracted. An item is a unit that forms the content of a document. Further, the item extracting unit 24 acquires the importance level assignment target information (only the usage type / document type) stored in the importance level assignment target information storage unit 200 by the usage / document identification unit 54, and extracts the usage type, the document type, and the extraction. Then, importance level assignment target information describing the created item type, item area, and item content is created, and the created importance level assignment target information is stored in the importance level assignment target information storage unit 200 (step S24).
For example, when the document image stored in the document image storage unit 100 by the input unit 10 is the article image shown in FIG. 31, the item extracting unit 24 determines whether all or some of the items in the article image shown in FIG. Item types such as “body”, “section title”, “figure”, “caption”, etc .; Item contents (character recognition results of items) such as body contents, section title contents, and caption contents are extracted. However, the item contents are extracted only for the items whose item type is determined from the components whose component type is “text” shown in FIG. That is, the item types are only items such as “text”, “section title”, and “caption”. Further, the item extraction unit 24 acquires the importance assignment target information (only the use / document type) shown in FIG. 32 stored in the importance assignment target information storage unit 200 by the use / document identification unit 54, and outputs the “internal disclosure”. ", The document type of" Thesis A "and the extracted item types such as" Body "," Section title "," Figure ", and" Caption "," Body area "," Section title area "," Figure area ", and" Caption area ". And the like, and item contents such as "text contents", "section title contents", and "caption contents" are created, and the created importance assignment information is stored in the importance assignment information storage unit 200. To be stored.
[0070]
FIG. 33 is an explanatory diagram illustrating an example of the importance level assignment target information in the specific example 5. In FIG. 33, the item area is indicated by the circumscribed rectangle "(upper left X coordinate, upper left Y coordinate, lower right X coordinate, lower right Y coordinate)" of the item. The item type, the item area, and the item content are arranged in order from top to bottom in the left column.
Item types such as "Body", "Section title", "Figure", "Caption", etc., item areas such as "Body area", "Section title area", "Figure area", "Caption area", and "Body contents", "Section title contents", " The extraction of item contents such as "caption contents" is the same as in the first embodiment.
[0071]
[Operation of Importance Assigning Unit 34]
The importance level assigning unit 34 acquires the importance level assigning target information stored in the importance level assigning target information storage unit 200 by the item extracting unit 24, and further acquires the importance level assigning level stored in the importance level assigning reference holding unit 300 in advance. Get the criteria.
FIG. 34 is an explanatory diagram illustrating an example of the importance assignment criterion.
Then, the importance assigning unit 34 determines the use type, the document type, the item type, the item area, the importance, The importance assignment target information (including the importance) describing the degree is created, and the created importance assignment target information (including the importance) is stored in the importance assignment target information storage unit 200 (step S34).
FIG. 35 is an explanatory diagram illustrating an example of importance assignment target information (including importance).
For example, the importance assignment target information stored in the importance assignment target information storage unit 200 by the item extraction unit 24 is the importance assignment target information shown in FIG. In the case where the importance assignment criterion is the importance assignment criterion shown in FIG. 34, the importance assignment section 34 sets the usage type such as “internal publication” or “no designation” described in the importance assignment criterion shown in FIG. , "No designation", etc., item types such as "Body", "Section title", "Figure", "Caption", etc., and item contents such as "including character string" OCR "", "No designation" and "High", "High" Based on the correspondence of importance such as “low”, “medium”, “high”, “low”, “high”, “high”, etc., the importance shown in FIG. 35 describing the use type, document type, item type, item area, and importance Create assignment target information (including importance) and create the importance assignment target information Storing (including severity) the importance grantee information storage unit 200.
[0072]
In the above description of the operation, a word such as “internal publication” is used as the usage type, but any word may be used as long as it is easily understood that the usage type is used. Although a word such as "article A" is used as the document type, any word may be used as long as the word can be easily understood as the document type.
Although the above method is used for extracting the use type and the document type, any method may be used as long as the use type and the document type can be extracted.
Although a word such as "unspecified" is used as the usage type of the importance assignment criterion, any word may be used as long as it is easily understood that the usage type is not specified. Further, “unspecified” may be used for all application types.
Although a word such as "no designation" is used as the document type of the importance assignment criterion, any word may be used as long as it is easily understood that the document type is not specified. Further, “not specified” may be used for all document types.
Other uses are the same as in the first embodiment.
[0073]
<effect>
As described above, according to the specific example 5, in addition to the configuration of the specific example 1, the application type and the document type are extracted based on the document format, and the importance of each item is also taken into consideration in consideration of the application type and the document type. Has the following effects.
That is, since the items are automatically assigned importance in accordance with the type of use, the type of document, and the content of the item which is a unit forming the content of the document, the document is disclosed to the company (use type is ""Internalpublication"), keep the relatively important "text", "figure", and "caption" in one document private, or add the important "figure", "caption" It is possible to keep only the “body” having a certain keyword private (the “body” kept private in one document can be made public in another document).
Further, it is possible to adjust the disclosure / non-disclosure when the document is disclosed outside the company (the use type is “external publication”). That is, the disclosure / non-disclosure of the item according to the degree of importance can be adjusted for each use type, document type, and item content. Compared with the specific examples 3 and 4, since the adjustment can be made for each use type and document type, more detailed disclosure of a document can be realized.
[0074]
《Usage form》
In each of the above specific examples, a document in a format as shown in FIG. 3, FIG. 17, FIG. 24, and FIG. 31, for example, is targeted as a document in a predetermined format, but a document in another format may be used. That is, if the format of the document is known in advance, by extracting each item area corresponding to the format, processing can be performed in the same manner as the above specific examples.
In the specific examples 3, 4, and 5, the configuration of the document identification and the use identification is added to the configuration of the specific example 1. However, these specific examples are applied to the configuration of the specific example 2. You may. In this case, it can be realized by replacing the item extraction units 22, 23, and 24 of the specific examples 3, 4, and 5 with the logical structure extraction unit 21 of the specific example 2.
[0075]
In each of the above specific examples, by assigning importance to each item, the document is used for disclosure / non-disclosure of the document, but can be applied to various other uses. For example, it may be used as a high or low security level such that a digital watermark is added to a portion with high importance. Further, the color level (increase / decrease) of the document data, the compression level (compression / non-compression), the resolution level (high resolution / low resolution), the omission level (omitted / omitted), and the summary level depend on the difference in importance. The present invention can be applied to processing corresponding to various levels such as (summarize / not summarize).
[0076]
Also, as a target document, in school relations, it can be applied to an in-house report, a grade sheet, a notification slip, and the like. In medical institutions, it can be applied to medical records, prescriptions, and the like. Further, in a company, it can be applied to confidential documents, specifications, source codes, circuit diagrams, and the like. Further, in a financial institution or the like, it can be applied to various documents such as an application form.
[Brief description of the drawings]
FIG. 1 is a configuration diagram illustrating a specific example 1 of an image processing apparatus according to the present invention.
FIG. 2 is a flowchart illustrating an operation of a specific example 1.
FIG. 3 is an explanatory diagram illustrating an example of a document image according to a specific example 1.
FIG. 4 is an explanatory diagram illustrating an example of a component of a specific example 1.
FIG. 5 is an explanatory diagram showing an example of items of a specific example 1.
FIG. 6 is an explanatory diagram showing an example of importance level assignment target information of a specific example 1.
FIG. 7 is an explanatory diagram illustrating an example of an importance assignment criterion according to a specific example 1.
FIG. 8 is an explanatory diagram showing an example of importance assignment target information (including importance) of the first specific example.
FIG. 9 is a configuration diagram of a specific example 2.
FIG. 10 is a flowchart illustrating an operation of a specific example 2;
FIG. 11 is an explanatory diagram illustrating an example of a logical structure.
FIG. 12 is an explanatory diagram showing an example of importance level assignment target information of a specific example 2.
FIG. 13 is an explanatory diagram illustrating an example of an importance assignment criterion according to a specific example 2.
FIG. 14 is an explanatory diagram showing an example of importance assignment target information (including importance) in Specific Example 2.
FIG. 15 is a configuration diagram of a specific example 3.
FIG. 16 is a flowchart illustrating an operation of a specific example 3.
FIG. 17 is an explanatory diagram showing an example of a document image in specific example 3.
FIG. 18 is an explanatory diagram showing an example of importance assignment target information (only a document type) in a specific example 3.
FIG. 19 is an explanatory diagram showing an example of importance level assignment target information in specific example 3.
FIG. 20 is an explanatory diagram illustrating an example of an importance assignment criterion in a specific example 3.
FIG. 21 is an explanatory diagram showing an example of importance assignment target information (including importance) in Specific Example 3.
FIG. 22 is a configuration diagram of a specific example 4.
FIG. 23 is a flowchart showing an operation of a specific example 4.
FIG. 24 is an explanatory diagram illustrating an example of a document image in a specific example 4.
FIG. 25 is an explanatory diagram showing an example of importance assignment target information (only a use type) in a specific example 4.
FIG. 26 is an explanatory diagram illustrating an example of importance level assignment target information in a specific example 4.
FIG. 27 is an explanatory diagram illustrating an example of an importance level assignment criterion in a specific example 4.
FIG. 28 is an explanatory diagram showing an example of importance assignment target information (including importance) in Specific Example 4.
FIG. 29 is a configuration diagram of a specific example 5;
FIG. 30 is a flowchart showing an operation of a specific example 5;
FIG. 31 is an explanatory diagram showing an example of a document image in specific example 5.
FIG. 32 is an explanatory diagram showing an example of importance level assignment target information (only use / document type) in specific example 5;
FIG. 33 is an explanatory diagram showing an example of importance level assignment target information in a specific example 5.
FIG. 34 is an explanatory diagram illustrating an example of an importance assignment criterion in a specific example 5.
FIG. 35 is an explanatory diagram showing an example of importance assignment target information (including importance) in Specific Example 5;
[Explanation of symbols]
20, 22, 23, 24 Item extraction unit
21 Logical structure extraction unit
30, 31, 32, 33, 34 Importance assigning unit
100 Document image storage
200 Importance level information storage
300 Holder for assigning importance level

Claims

For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and what kind of the item area is And an item extraction unit that extracts an item type indicating the item type indicating the content of the item area,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between the item type and the importance of the item content,
An importance assigning unit that assigns an importance to each item by comparing the item type and the item content extracted by the item extracting unit with an importance assigning criterion of the importance assigning criterion holding unit; An image processing apparatus characterized by the above-mentioned.

For a document in a predetermined format, an item area serving as a unit for forming the content of the document is extracted from an image of the document based on the format of the document, and the item area is used as another item in the document. A logic for obtaining a logical structure indicating how the logical structure is related to the area, and extracting a logical structure type indicating the type of the logical structure and a logical structure content indicating the contents of the logical structure A structure extraction unit, and in advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between the logical structure type and the importance of the logical structure content,
The logical structure type and the logical structure content extracted by the logical structure extracting unit are compared with the importance assigning criterion of the importance assigning criterion holding unit, and an importance assigning unit that assigns an importance to each logical structure. An image processing apparatus comprising:

For a document in a predetermined format, a document identification unit that identifies a document type indicating what kind of document the document is based on the format of the document from an image of the document,
For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and what kind of the item area is And an item extraction unit that extracts an item type indicating the item type indicating the content of the item area,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between the document type, the item type, and the item content, and
The document type identified by the document identification unit, and the item type and item content extracted by the item extraction unit are compared with the importance assignment criterion of the importance assignment criterion holding unit, and important items for each item are compared. An image processing apparatus comprising: an importance assigning unit that assigns a degree.

For a document in a predetermined format, a document identification unit that identifies a document type indicating what kind of document the document is based on the format of the document from an image of the document,
For a document in a predetermined format, an item area serving as a unit for forming the content of the document is extracted from an image of the document based on the format of the document, and the item area is used as another item in the document. A logic for obtaining a logical structure indicating how the logical structure is related to the area, and extracting a logical structure type indicating the type of the logical structure and a logical structure content indicating the contents of the logical structure A structure extraction unit,
An importance assignment reference holding unit that holds an importance assignment reference indicating a relationship between the document type, the logical structure type, and the logical structure content in advance;
The document type identified by the document identification unit, the logical structure type and the logical structure content extracted by the logical structure extraction unit are compared with the importance assignment criterion of the importance assignment criterion holding unit, and each logical An image processing apparatus comprising: an importance assigning unit that assigns an importance to each structure.

For a document of a predetermined format, from a picture of the document, a use identification unit that identifies a use type indicating what kind of use the document is used based on the format of the document,
For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and what kind of the item area is And an item extraction unit that extracts an item type indicating the item type indicating the content of the item area,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between importance levels for use types, item types, and item contents,
The application type identified by the application identification unit and the item type and item content extracted by the item extraction unit are compared with the importance assignment criterion of the importance assignment criterion holding unit, and the importance is assigned to each item. An image processing apparatus comprising: an importance assigning unit that assigns a degree.

For a document of a predetermined format, from a picture of the document, a use identification unit that identifies a use type indicating what kind of use the document is used based on the format of the document,
For a document in a predetermined format, an item area serving as a unit for forming the content of the document is extracted from an image of the document based on the format of the document, and the item area is used as another item in the document. A logic for obtaining a logical structure indicating how the logical structure is related to the area, and extracting a logical structure type indicating the type of the logical structure and a logical structure content indicating the contents of the logical structure A structure extraction unit,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between importance levels for use types, logical structure types, and logical structure contents,
The use type identified by the use identification unit, the logical structure type and the logical structure content extracted by the logical structure extraction unit are compared with the importance assignment reference of the importance assignment reference holding unit, and each logical An image processing apparatus comprising: an importance assigning unit that assigns an importance to each structure.

For a document in a predetermined format, from the image of the document, a use type indicating what use the document is used based on the format of the document, and a document indicating what the document is A use / document identification unit for identifying the type;
For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and what kind of the item area is And an item extraction unit that extracts an item type indicating the item type indicating the content of the item area,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between importance levels for use types, document types, item types, and item contents,
The use type and document type identified by the use / document identification unit, and the item type and item content extracted by the item extraction unit are compared with the importance assignment criterion of the importance assignment criterion holding unit, An image processing apparatus comprising: an importance assigning unit that assigns an importance to each item.

For a document in a predetermined format, from the image of the document, a use type indicating what use the document is used based on the format of the document, and a document indicating what the document is A use / document identification unit for identifying the type;
For a document in a predetermined format, an item area serving as a unit for forming the content of the document is extracted from an image of the document based on the format of the document, and the item area is used as another item in the document. A logic for obtaining a logical structure indicating how the logical structure is related to the area, and extracting a logical structure type indicating the type of the logical structure and a logical structure content indicating the contents of the logical structure A structure extraction unit,
In advance, an importance assignment criterion holding unit that holds an importance assignment criterion indicating a relationship between importance levels for use types, document types, logical structure types, and logical structure contents,
The use type and document type identified by the use / document identification unit are compared with the logical structure type and logical structure content extracted by the logical structure extraction unit, and the importance assignment standard of the importance assignment standard holding unit is compared. And an importance assigning unit that assigns an importance to each logical structure.

For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and the type of the item area is determined. And an item type indicating the content of the item area are extracted,
Next, the item type and the item content are compared in advance with an importance assignment criterion indicating the relationship between the item type and the item content and the importance is assigned to each item. Processing method.

A computer constituting the image processing apparatus,
For a document of a predetermined format, an item area that is a unit forming the content of the document is extracted from an image of the document based on the format of the document, and what kind of the item area is And an item extraction unit that extracts an item type indicating the item type indicating the content of the item area,
The item type and item content extracted by the item extraction unit are compared in advance with an importance assignment criterion indicating the relationship between the item type and the item content, and the importance is assigned to each item. A program for controlling an image processing apparatus to function as an assigning unit.