JP4521942B2

JP4521942B2 - Document management apparatus and method

Info

Publication number: JP4521942B2
Application number: JP2000222810A
Authority: JP
Inventors: 和之齋藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-07-24
Filing date: 2000-07-24
Publication date: 2010-08-11
Anticipated expiration: 2020-07-24
Also published as: JP2002041498A

Description

【０００１】
【発明の属する技術分野】
本発明は文書管理装置及び方法に関する。更に詳しくは、例えばコンピュータを中心として、有線・無線のネットワークにより接続されたデジタル機器間でデジタル化された文書データによる情報のやり取りが行われる際の文書データの管理に好適な、文書管理装置及び方法に関するものである。
【０００２】
【従来の技術】
文書を光学的に入力し、その文書画像を文書データとして保管する文書データベースシステムにおいて、あるいは、文書を光学的に入力し、その文書画像からテキスト領域のみを文字認識し、文書画像と認識結果を併せて１つの文書データとして保管する文書データベースシステムが知られている。
【０００３】
一般にこの種の文書データベースシステムは、文書データを保管する際は、▲１▼１つの文書につき１つのファイルをハード・ディスク等の記憶装置に保管していた、あるいは、▲２▼１つの文書につき１つの元ファイルを作成し、それを複数個複製して、別々の記憶装置に保管していた。あるいは、▲３▼１つの文書につきＭ×Ｎのブロックに分割してそれぞれを１つずつファイルとして、別々の記憶装置に保管するということも考えられる。
【０００４】
【発明が解決しようとする課題】
しかしながら、上記▲１▼の方法では、１つの文書につき１つのファイルとしているので、ハード・ディスク等の記憶装置が壊れると文書そのものを全て失うという危険性があった。
【０００５】
また、▲２▼の方法では、１つの記憶装置が壊れても他の記憶装置に保管されているので安全性は高まるものの、文書データがカラー画像であった場合など１つのファイル・サイズが非常に大きく、それを複数個複製して保管しておくには、記憶装置のサイズや個数が大きくなる等、無駄が大きかった。
【０００６】
更に、▲３▼の方法では、分割個数が少ない場合、元の文書内容に関係なく分割されているので、どれか一つのブロックが欠けた場合に、元の文書内容が把握できなくなるという危険性があるし、また分割個数が多い場合はやはり、多数の記憶装置が必要になるという無駄が生じることになろう。
【０００７】
本発明は、上述の課題に鑑みてなされたもので、その目的とするところは、ファイル・サイズの増大を抑え、ファイルが失われる危険性を軽減できる、文書管理装置及び方法を提供することにある。
【０００８】
【課題を解決するための手段】
上記の目的を達成するための本発明による文書管理装置は以下の構成を備える。すなわち、
電子化された文書画像を、レイアウト属性毎の部分領域に分割し、当該分割された各部分領域の前記文書画像におけるレイアウトに関する情報を含むレイアウト解析情報を出力する解析手段と、
前記レイアウト解析情報に基づいて、各部分領域内の文書画像を部分文書画像データとして抽出し、前記部分領域ごとに、当該抽出された部分文書画像データと前記レイアウト解析情報とを含む文書オブジェクトを作成する作成手段と、
前記文書画像におけるレイアウト上で隣接する位置にある文書オブジェクト同士が異なる記憶手段に保管されるように、前記作成手段で作成された複数の文書オブジェクトの各々を複数の記憶手段に割り振り、各文書オブジェクトの保管先を特定するためのリンク情報を付加した文書オブジェクトを当該割り振られたそれぞれの記憶手段に保管する保管処理手段と、
前記リンク情報に基づいて関連する文書オブジェクトを前記複数の記憶手段から取り出し、当該取り出した文書オブジェクトに含まれているレイアウト解析情報に基づいて当該取り出した文書オブジェクトに含まれている部分文書画像データを合成することにより、元の文書画像を再構築する再構築手段とを備える。
【０００９】
また、上記の目的を達成するための本発明による文書管理方法は以下の工程を備える。すなわち、
文書管理装置の解析手段が、電子化された文書画像を、レイアウト属性毎の部分領域に分割し、当該分割された各部分領域の前記文書画像におけるレイアウトに関する情報を含むレイアウト解析情報を出力する解析工程と、
文書管理装置の作成手段が、前記レイアウト解析情報に基づいて、各部分領域内の文書画像を部分文書画像データとして抽出し、前記部分領域ごとに、当該抽出された部分文書画像データと前記レイアウト解析情報とを含む文書オブジェクトを作成する作成工程と、
文書管理装置の保管処理手段が、前記文書画像におけるレイアウト上で隣接する位置にある文書オブジェクト同士が異なる記憶手段に保管されるように、前記作成工程で作成された複数の文書オブジェクトの各々を複数の記憶手段に割り振り、各文書オブジェクトの保管先を特定するためのリンク情報を付加した文書オブジェクトを当該割り振られたそれぞれの記憶手段に保管する保管処理工程と、
文書管理装置の再構築手段が、前記リンク情報に基づいて関連する文書オブジェクトを前記複数の記憶手段から取り出し、当該取り出した文書オブジェクトに含まれている部分文書画像データを合成することにより、元の文書画像を再構築する再構築工程とを備える。
【００１０】
【発明の実施の形態】
以下、添付の図面を参照して本発明の好適な実施形態を説明する。
【００１１】
［第１の実施形態］
第１の実施形態に係るシステムのブロック構成を図１に示す。
【００１２】
図１において、１０１は各デジタル機器間を接続するネットワーク、１０２はデータベースを管理するファイル・サーバ、１０３はデータベースを保管するデータベース記憶装置、１０４は１０２とは別のファイル・サーバ、１０５は１０３とは別のデータベース記憶装置、１０６はデータベースに保管するデータを作成するクライアントＰＣ、１０７はスキャナ、１０８はデータベースに保管されているデータを検索および出力するクライアントＰＣ、１０９はプリンタ、である。
【００１３】
次に処理の流れについて図２、図３、図４、図５、図６のフローチャートと、図１７、図１８、図１９に従って説明する。
【００１４】
まず、ステップＳ２０１で、画像入力手段であるスキャナ（１０７）より文書画像を入力する。
【００１５】
次に、クライアントＰＣ（１０６）において、レイアウト解析処理（ステップＳ２０２）で、文書画像を『図』や『テキスト』や『表』等の各種レイアウト属性毎の領域に分割し、レイアウト解析情報（図１７）を出力する。
【００１６】
なお、レイアウト解析情報は、図１７に示されるように、レイアウト情報メタデータとレイアウト解析の結果得られた各部分領域の情報である部分領域データ（１〜ｎ）を含む。レイアウト情報メタデータは、元の文書画像の全体の大きさ（全体画像の幅及び全体画像の高さ）、画像データの対応する解像度、レイアウト解析の結果得られた部分領域の数を格納する。また、部分領域データ（１）〜（ｎ）には、各部分領域のＩＤ、部分領域の大きさ（幅と高さ）、部分領域の位置(開始座標）、部分領域の終了座標、部分領域のレイアウト属性が格納される。
【００１７】
次に、文書オブジェクトの作成処理（ステップＳ２０３）を行う。文書オブジェクトの作成処理（ステップＳ２０３）の例を、図４のフローチャートに従い、さらに詳細に説明する。
【００１８】
まず、ステップＳ４０１で、レイアウト解析情報を分析し、処理対象とする部分領域のＩＤを抽出し、オブジェクト情報（図１８）として保持する（ステップＳ４０２）。次に、レイアウト解析情報にしたがってその部分領域の画像を抽出し、圧縮する（ステップＳ４０３）。このとき、圧縮情報をオブジェクト情報に追加する。
【００１９】
次に、レイアウト解析情報およびオブジェクト情報および抽出した部分画像を１つにまとめることにより、１つの文書オブジェクトとし、（ステップＳ４０４）文書オブジェクトＩＤを付与する。
【００２０】
ステップＳ４０４終了後、未処理の分割領域が残っているか否かを検査し、残っている場合はステップＳ４０１に戻り、残っていなければ終了する（ステップＳ４０５）。
【００２１】
以上のようにして、文書オブジェクトが生成され、それぞれの文書オブジェクトには図１８に示すオブジェクト情報が付加される。なお、図１８に示されるようにオブジェクト情報は、当該文書オブジェクトのＩＤ（Ｓ４０４で格納）と、対応する部分領域のＩＤ（Ｓ４０１で格納）、圧縮方式（Ｓ４０３で格納）を含む。なお図１８には、重要オブジェクトフラグがしめされているが、これは第３の実施形態で用いられるものであり、第１の実施形態では使用しないので省略してよい。
【００２２】
次に、図２に戻って、作成した文書オブジェクトを、文書オブジェクト保管処理（ステップＳ２０４）で、データベース記憶装置（１０３）や別のデータベース記憶装置（１０５）に振り分けて保存する。
【００２３】
さらに図５に従い、文書オブジェクト保管処理（ステップＳ２０４）の例を詳細に説明する。
【００２４】
まず、ステップＳ５０１で、保管先が未決の文書オブジェクトを選択する。次に、選択した文書オブジェクトを保管するデータベース記憶装置を選択する（ステップＳ５０２）。例えば、前回選択した文書オブジェクトがデータベース記憶装置（１０３）であるならば、今回選択した文書オブジェクトはデータベース記憶装置（１０５）にするなどの方法がある。この時、文書オブジェクトと選択した保管先の関連をリンク情報（図１９）に記録していく。このリンク情報は、後述のステップＳ５０４において、各文書オブジェクトに付加される。なお、ステップＳ５０２におけるデータベース記憶装置の選択方法としては、レイアウト上の位置関係に応じて割り振るようにしてもよい。すなわち、各文書オブジェクトをデータベース記憶装置に割り振る一例として、レイアウト上で近いもの同士が同一の記憶装置に保管されないように割り振るようにしてもよい。具体的には、たとえば、各文書オブジェクトを上下左右方向に隣接するか否かをチェックし、隣接する文書オブジェクトの場合は異なる記憶装置に割り振るなどが挙げられる。
【００２５】
ステップＳ５０２終了後、保管先が未決の文書オブジェクトが残っているか否かを検査し、残っている場合はステップＳ５０１に戻り、残っていなければステップＳ５０４に進む（ステップＳ５０３）。保管先がすべて決定した場合は、ステップＳ５０４で、各文書オブジェクトにリンク情報を付加して、それぞれ選択したデータベース記憶装置に文書オブジェクトを保管する。
【００２６】
リンク情報は図１９に示されるように、リンク情報メタデータとリンクデータを含む。リンク情報メタデータには当該リンク情報に含まれるリンクデータの個数が記録される。各リンクデータには、文書オブジェクトのＩＤ、部分領域のＩＤ、当該文書オブジェクトの保管先アドレス、その保管日時が格納される。このリンク情報は、図５のフローチャートにより保管先が決定されていくに従って形成されていく。
【００２７】
次に、以上のようにして保管した文書データの出力処理について、図３を参照して説明する。
【００２８】
まず、ステップＳ３０１で、文書オブジェクトを検索する。検索については、あらかじめ保管する際に付加したキーワードに対して行う等の方法がある。もちろん、指定された文書ファイル名をもつ文書オブジェクトを検索するようにしてもよい。指定された文書オブジェクトが検索されたならば、次に、文書データの再構築処理を行う（ステップＳ３０２）。
【００２９】
以下、図６に従い、文書データ再構築処理（ステップＳ３０２）の例を詳細に説明する。
【００３０】
ステップＳ６０１で、文書オブジェクトを検索し、文書オブジェクトがあるか否かを検査し、ある場合はステップＳ６０２に進み、ない場合は終了する。ステップＳ６０１で文書オブジェクトが検索に対しヒットしたならば、ヒットした文書オブジェクトをデータベースから取り出す（ステップＳ６０２）。
【００３１】
次に、取り出した文書オブジェクトに付属のリンク情報を分析し（ステップＳ６０３）、関連文書オブジェクトを取り出す（ステップＳ６０４）。取り出した文書オブジェクトや関連文書オブジェクトの部分画像を、それぞれレイアウト解析情報をもとにして合成し、元の文書画像に再構築していく（ステップＳ６０５）。その際、部分画像が圧縮されている場合には解凍も行う。
【００３２】
ステップＳ６０５終了後、関連文書オブジェクトが未だ残っているか否かを検査し、残っている場合はステップＳ６０３に戻り、残っていなければ終了する。（ステップＳ６０６）。上記の処理によって、文書データが再構築される。
【００３３】
以上説明したように、第１の実施形態によれば、スキャナ１０７等から読み取った文書画像は、『図』や『テキスト』や『表』といった所定の種類のレイアウト属性毎に部分領域に分割される。このとき、各部分領域毎の大きさ、レイアウト属性及び文書画像中の位置を表すレイアウト解析情報が出力される。このレイアウト解析情報を用いて、各部分領域毎に当該領域中の部分文書データによる文書オブジェクトが作成される。ここで、各文書オブジェクトには、レイアウト解析情報と、元の文書画像を構成する他の文書オブジェクトとのリンクを示すリンク情報が付加される。こうして得られた複数の文書オブジェクトは、複数のデータベース記憶装置に振り分けられて、保管処理される。
【００３４】
また、データベース記憶装置に分けて記憶された複数の文書オブジェクトから元文書データを再構築する場合には、文書オブジェクトに付加されたリンク情報によって元文書画像を構成するのに必要な文書オブジェクトを取得し、これらをレイアウト解析情報に従って配置する。
【００３５】
従って、上記第１の実施形態によれば、一つの文書を構成する複数の文書オブジェクトを複数のデータベース記憶装置に分散記憶させるので、記憶装置の破損等による文書データの消滅のリスクを低減できる。また、元文書を複数コピーするのではなく、元文書を分割して記憶するので、文書画像の保存時における記憶装置のサイズを低減できる。
【００３６】
更に、複数の文書オブジェクトを、それぞれにリンク情報とレイアウト解析情報を付加してデータベース記憶装置へ振り分けるので、データベース記憶装置の破損によって文書データが破損しても、残りのデータベース記憶装置に記憶されている文書オブジェクトによりある程度の文書復元ができ、元の文書内容が把握可能となる。
【００３７】
［第２の実施形態］
上記第１の実施形態では、文書画像から得られた複数の文書オブジェクトを複数のデータベース記憶装置に降り分けて記憶した。第２の実施形態では、特定のレイアウト属性の文書オブジェクトについては、その文書オブジェクトを複製して別々に保管することによって、一方が異常をおこしても、もう一方を代わりに用いて文書データを再構築する（安全性を高める）ことを可能にする。
【００３８】
第２の実施形態による処理の流れについて図２、図３、図７、図８のフローチャート、及び図１７、図１９に従って説明する。
【００３９】
第１の実施形態で説明したように、まず、ステップＳ２０１で、画像入力手段であるスキャナ（１０７）より文書画像を入力する。次に、クライアントＰＣ（１０６）において、レイアウト解析処理（ステップＳ２０２）により、文書画像を『図』や『テキスト』や『表』等の各種属性毎の領域に分割し、レイアウト解析情報（図１７）を出力する。
【００４０】
次に、文書オブジェクトの作成処理（ステップＳ２０３）を行う。そして、作成した文書オブジェクトを文書オブジェクト保管処理（ステップＳ２０４）によって、データベース記憶装置（１０３）や別のデータベース記憶装置（１０５）に振り分けて保存する。
【００４１】
以下、図７を参照して、第２の実施形態による文書オブジェクト保管処理（ステップＳ２０４）の例を詳細に説明する。
【００４２】
まず、ステップＳ７０１で、保管先が未決の文書オブジェクトを選択する。次に、ステップＳ７０２で、選択した文書オブジェクトのレイアウト属性を検査する。本実施形態においては、レイアウト属性が『テキスト』の場合に、文書オブジェクトを複製し（ステップＳ７０６）、元の文書オブジェクトおよび複製文書オブジェクトのそれぞれが異なる装置に保管されるようデータベース記憶装置を選択する（ステップＳ７０７、Ｓ７０８）。なお、このとき、複製文書オブジェクトのオブジェクト情報のオブジェクトＩＤについては、元の文書オブジェクトのオブジェクトＩＤとは異なり、かつユニークな値になるよう変更する。
【００４３】
また、これ以外のレイアウト属性（本例ではテキスト以外のレイアウト属性）を有する文書オブジェクトの場合は、上述のような複製処理は行わず、当該文書オブジェクトのみを保管するデータベース記憶装置を選択する（ステップＳ７０３）。なお、ステップＳ７０８、Ｓ７０３のいずれの場合も、文書オブジェクトと選択した保管先の関連をリンク情報（図１９）として記録していく。
【００４４】
ステップＳ７０３またはステップＳ７０８終了後、保管先が未決の文書オブジェクトが残っているか否かを検査し、残っている場合はステップＳ７０１に戻り、残っていなければステップＳ７０５に進む（ステップＳ７０４）。
【００４５】
以上のようにして保管先がすべて決定したならば、各文書オブジェクトにリンク情報とレイアウト解析情報を付加して、それぞれ選択したデータベース記憶装置に文書オブジェクトを保管する（ステップＳ７０５）。
【００４６】
次に、以上のようにして保管された文書データの再構築処理について説明する。再構築処理の大まかな手順は第１の実施形態（図３）に準ずる。ここでは、第２の実施形態に係る文書データ再構築処理（ステップＳ３０２）の例を図８に従い詳細に説明する。
【００４７】
ステップＳ８０１で、文書オブジェクトを検索し、文書オブジェクトがあるか否かを検査し、ある場合はステップＳ８０２に進み、ない場合は終了する。ステップＳ８０１で文書オブジェクトが検索に対しヒットしたならば、ヒットした文書オブジェクトをデータベースから取り出す（ステップＳ８０２）。次に、取り出した文書オブジェクトに付属するリンク情報を分析し（ステップＳ８０３）、リンク先の関連文書オブジェクト（取出した文書オブジェクトを含む）が正常か否かを判断する（ステップＳ８０４）。そして、正常ならば、関連文書オブジェクトを取り出し（ステップＳ８０５）、そうでなければステップＳ８０８に移る。
【００４８】
リンク先関連文書オブジェクトが正常でない場合は、複製文書オブジェクトが有るか否かを調べる（ステップＳ８０８）。複製文書オブジェクトが存在するか否かの判断は、当該文書オブジェクトのレイアウト属性に従ってなされる。本実施形態においては、関連文書オブジェクトのレイアウト属性が『テキスト』ならば複製文書オブジェクトが有るので、複製文書オブジェクトを取り出す（ステップＳ８０９）。
【００４９】
一方、関連文書オブジェクトのレイアウト属性が『テキスト』でないならば複製文書オブジェクトは無いので、レイアウト解析情報を用いて、本来の部分画像と同じ大きさのダミー画像を作成する（ステップＳ８１０）。例えば、黒一色で塗りつぶした画像をダミー画像にするとか、「部分画像に異常がある」等のメッセージを書き込んだ画像をダミー画像にするなどの方法が挙げられる。
【００５０】
取り出した、関連する文書オブジェクト、複製文書オブジェクト等の部分画像およびダミー画像をそれぞれ、レイアウト解析情報をもとにして合成し、元の文書画像に再構築していく（ステップＳ８０６）。その際、部分画像が圧縮されている場合には解凍も行う。
【００５１】
ステップＳ８０６終了後、関連文書オブジェクトが未だ残っているか否かを検査し、残っている場合はステップＳ８０３に戻り、残っていなければ終了する（ステップＳ８０７）。上記の処理によって、文書データを再構築する。
【００５２】
以上説明したように、第２の実施形態によれば、所定のレイアウト属性を有する文書オブジェクトについては複製文書オブジェクトを生成して、これを別々のデータベース記憶装置に記憶するので、所定のレイアウト属性を有する文書オブジェクトが破損しても、複製文書オブジェクトを用いて正常に文書画像を復元することができる。
【００５３】
また、第２の実施形態によれば、破損した文書オブジェクトの複製が存在しない場合に、ダミー画像を充当するので、文書画像のどの部分が破損しているか容易に判断することができる。なお、このダミー画像の充当は第１の実施形態にモ適用することが可能である。この場合、ステップＳ６０４において、文書オブジェクトが以上かどうかを判断し、異常であった場合にダミー画像を作成するようにすればよい。
【００５４】
［第３の実施形態］
第２の実施形態では、所定のレイアウト属性を有する文書オブジェクトについて複製文書オブジェクトを生成し、別個のデータベース記憶装置に記憶した。第３の実施形態では、複数の条件に合致する文書オブジェクトを重要オブジェクトとして設定し、重要オブジェクトについては文書オブジェクトの複製を行って別々に保管するようにする。
【００５５】
第３の実施形態による処理の流れについて図９、図１０、図１１、図８のフローチャート、および図１７、図１８、図１９に従って説明する。
【００５６】
まず、ステップＳ９０１で、重要オブジェクトであるかどうかを判定するための条件を設定する。この条件としては、例えば、特定のレイアウト属性とレイアウト位置の組み合わせを用いたり、部分画像の大きさを用いたりすることが挙げられる。
【００５７】
次に、第１の実施形態と同じように、画像入力手段であるスキャナ（１０７）より文書画像を入力する（ステップＳ９０２）。次に、クライアントＰＣ（１０６）において、レイアウト解析処理（ステップＳ９０３）により、文書画像を『図』や『テキスト』や『表』等の各種属性毎の領域に分割し、レイアウト解析情報（図１７）を出力する。
【００５８】
次に、文書オブジェクトの作成処理（ステップＳ９０４）を行う。そして、作成した文書オブジェクトを文書オブジェクト保管処理（ステップＳ９０５）によって、データベース記憶装置（１０３）や別のデータベース記憶装置（１０５）に振り分けて保存する。
【００５９】
上述の文書オブジェクトの作成処理（ステップＳ９０４）の例を、図１０のフローチャートを参照して詳細に説明する。
【００６０】
まず、ステップＳ１００１で、レイアウト解析情報を分析し、処理対象とする部分領域のＩＤをオブジェクト情報（図１８）として抽出する（ステップＳ１００２）。次に、レイアウト解析情報にしたがってその部分領域の画像を抽出し、圧縮する（ステップＳ１００３）。この時、圧縮情報はオブジェクト情報に追加する。次に、レイアウト解析情報およびオブジェクト情報および抽出した部分画像を１つにまとめることにより、１つの文書オブジェクトとする（ステップＳ１００４）。ここまでは第１の実施形態による文書オブジェクト作成処理（図４のステップＳ４０１〜Ｓ４０４）と同様である。
【００６１】
次に、あらかじめステップＳ９０１で設定した条件に従って、その文書オブジェクトが重要オブジェクトに該当するか否かを検査し（ステップＳ１００５）、重要オブジェクトであるならばオブジェクト情報（図１８）の『重要フラグ』を１にしてＯＮにする（ステップＳ１００６）。一方、重要オブジェクトでないならば、オブジェクト情報の『重要フラグ』を０にしてＯＦＦにする（ステップＳ１００７）。
【００６２】
ステップＳ１００６またはステップＳ１００７終了後、未処理の分割領域が残っているか否かを検査し、残っている場合はステップＳ１００１に戻り、残っていなければ終了する（ステップＳ１００８）。
【００６３】
さらに図１１のフローチャートを参照して、文書オブジェクト保管処理（ステップＳ９０５）を詳細に説明する。第２の実施形態（図７）では、レイアウト属性がテキストか否かに応じて文書オブジェクトの複製を作成するかどうかを決定したが、第３の実施形態ではこれを重要オブジェクトか否かに応じて決定する（図１１）。
【００６４】
まず、ステップＳ１１０１で、保管先が未決の文書オブジェクトを選択する。次に、ステップＳ１１０２で、選択した文書オブジェクトが重要オブジェクトか否かを検査する。重要オブジェクトの場合には、当該文書オブジェクトを複製し（ステップＳ１１０６）、元の文書オブジェクトおよび複製文書オブジェクトのそれぞれが異なる装置に保管されるようデータベース記憶装置を選択する（ステップＳ１１０７、Ｓ１１０８）。このとき、複製文書オブジェクトのオブジェクト情報のオブジェクトＩＤについては元の文書オブジェクトのオブジェクトＩＤとは異なりかつユニークな値になるよう変更する。
【００６５】
重要オブジェクトでない場合は、選択した文書オブジェクトのみを保管するデータベース記憶装置を選択する（ステップＳ１１０３）。なお、ステップＳ１１０８、Ｓ１１０３のいずれの場合も、文書オブジェクトと選択した保管先の関連をリンク情報（図１９）として記録していく。
【００６６】
ステップＳ１１０３またはステップＳ１１０８の終了後、保管先が未決の文書オブジェクトが残っているか否かを検査する。未決の文書オブジェクトが残っている場合はステップＳ１１０１に戻り、残っていなければステップＳ１１０５に進む（ステップＳ１１０４）。
【００６７】
以上のようにして、保管先がすべて決定したならば、各文書オブジェクトにリンク情報を付加して、それぞれ選択したデータベース記憶装置に文書オブジェクトを保管する（ステップＳ１１０５）。
【００６８】
次に、第３の実施形態に係る文書データ再構築処理（ステップＳ３０２）の例を第２の実施形態の説明で用いた図８のフローチャートを流用して詳細に説明する。
【００６９】
ステップＳ８０１で、文書オブジェクトを検索し、文書オブジェクトがあるか否かを検査し、ある場合はステップＳ８０２に進み、ない場合は終了する。ステップＳ８０１で文書オブジェクトが検索に対しヒットしたならば、ヒットした文書オブジェクトをデータベースから取り出す（ステップＳ８０２）。次に、取り出した文書オブジェクトに付属のリンク情報を分析し（ステップＳ８０３）、リンク先の関連文書オブジェクトが正常か否かを判断し（ステップＳ８０４）、正常ならば、関連文書オブジェクトを取り出し（ステップＳ８０５）、そうでなければステップＳ８０８に移る。
【００７０】
リンク先関連文書オブジェクトが正常でない場合は、複製文書オブジェクトが有るか否かを判定する（ステップＳ８０８）。本実施形態では、オブジェクト情報の「重要オブジェクトフラグ」が１か否か（オンか否か）によって複製文書オブジェクトが存在するか否かを判定できる。
【００７１】
関連文書オブジェクトが重要オブジェクトならば複製文書オブジェクトが有るので、複製文書オブジェクトを取り出す（ステップＳ８０９）。関連文書オブジェクトが重要オブジェクトでないならば複製文書オブジェクトは無いので、レイアウト解析情報を用いて、本来の部分画像と同じ大きさのダミー画像を作成する（ステップＳ８１０）。
【００７２】
例えば、黒一色で塗りつぶした画像をダミー画像にするとか、「部分画像に異常がある」等のメッセージを書き込んだ画像をダミー画像にするなどの方法がある。取り出した文書オブジェクト、関連文書オブジェクト、複製文書オブジェクト等の部分画像およびダミー画像をそれぞれレイアウト解析情報をもとにして合成し、元の文書画像に再構築していく（ステップＳ８０６）。その際、部分画像が圧縮されている場合には解凍も行う。
【００７３】
ステップＳ８０６終了後、関連文書オブジェクトが未だ残っているか否かを検査し、残っている場合はステップＳ８０３に戻り、残っていなければ終了する（ステップＳ８０７）。上記の処理によって、文書データを再構築する。
【００７４】
以上説明したように、第３の実施形態によれば、所望の選択条件によって選択された文書オブジェクトについては複製文書オブジェクトを生成して、これを別々のデータベース記憶装置に記憶するので、所定のレイアウト属性を有する文書オブジェクトが破損しても、複製文書オブジェクトを用いて正常に文書画像を復元することができる。とくに、複製を作る文書オブジェクトの条件を所望に設定できるので、第２の実施形態にくらべて柔軟性が高まる。
【００７５】
［第４の実施形態］
部分領域が文字認識対象領域（テキスト）である場合には、さらに文字認識データを含めた文書オブジェクトを作成し、認識文字を検索に使用することも可能である。
【００７６】
さらには、文字認識対象領域の文書オブジェクトの場合に、その文書オブジェクトの部分画像を除いたものを複製して別々に保管することによって、一方が異常をおこしても、もう一方を代わりに用いて文書データを再構築する（安全性を高める）ことが可能である。
【００７７】
第４の実施形態では、上記の処理を実現する構成を説明する。
【００７８】
処理の流れについて図２、図３、図１２、図１３、図１４のフローチャート、および図１７、図１８、図１９、図２０に従って説明する。
【００７９】
本実施形態による文書オブジェクトの保管処理手順は第１の実施形態（図２）と同様である。以下では、ステップＳ２０３の文書オブジェクトの作成処理と、ステップＳ２０４の文書オブジェクト保管処理について説明する。
【００８０】
まず、文書オブジェクトの作成処理（ステップＳ２０３）の例を図１２に従い詳細に説明する。ステップＳ１２０１で、レイアウト解析情報を分析し、処理対象とする部分領域のＩＤをオブジェクト情報（図１８）として抽出する（ステップＳ１２０２）。次に、レイアウト解析情報にしたがってその部分領域の画像を抽出し、圧縮する（ステップＳ１２０３）。この時、圧縮情報はオブジェクト情報に追加する。
【００８１】
次に、部分領域が文字認識対象領域か否かを検査し（ステップＳ１２０４）、文字認識対象領域であればステップＳ１２０５に進み、そうでなければステップＳ１２０６に進む。本実施形態においては、部分領域のレイアウト属性が『テキスト』であれば文字認識対象領域であると判断し、文字認識を実行する（ステップＳ１２０５）。
【００８２】
次に、レイアウト解析情報およびオブジェクト情報および抽出した部分画像、さらに文字認識結果の文字認識データ（図２０）があるならばそれも含めて１つにまとめることにより、１つの文書オブジェクトとする（ステップＳ１２０６）。なお、本実施形態では、文字認識データは図２０に示す形態をとる。
【００８３】
ステップＳ１２０６終了後、未処理の分割領域が残っているか否かを検査し、残っている場合はステップＳ１２０１に戻り、残っていなければ終了する（ステップＳ１２０７）。
【００８４】
作成された文書オブジェクトは文書オブジェクト保管処理（ステップＳ２０４）によって、データベース記憶装置（１０３）や別のデータベース記憶装置（１０５）に振り分けて保存される。以下、図１３に従い、文書オブジェクト保管処理（ステップＳ２０４）の例を詳細に説明する。
【００８５】
ステップＳ１３０１で、保管先が未決の文書オブジェクトを選択する。次に、ステップＳ１３０２で、選択した文書オブジェクトが文字認識データを含んでいるか否かを検査する（ステップＳ１３０２）。
【００８６】
本実施形態においては、レイアウト属性が『テキスト』であれば文字認識データを含んでいると判断し、元の文書オブジェクトから画像データ部分を除いて部分的に複製し（ステップＳ１３０６）、元の文書オブジェクトおよび複製文書オブジェクトのそれぞれが異なる装置に保管されるようデータベース記憶装置を選択する（ステップＳ１３０７、Ｓ１３０８）。このとき、複製文書オブジェクトのオブジェクト情報（図１８）のオブジェクトＩＤについては元の文書オブジェクトのオブジェクトＩＤとは異なりかつユニークな値になるよう変更する。
【００８７】
文字認識データを含んでいない場合（本実施形態においては、レイアウト属性が『テキスト』でない場合）は、選択した文書オブジェクトのみを保管するデータベース記憶装置を選択する（ステップＳ１３０３）。なお、ステップＳ１３０８、ステップＳ１３０３のいずれの場合も、文書オブジェクトと選択した保管先の関連をリンク情報（図１９）として記録していく。ステップＳ１３０３またはステップＳ１３０８終了後、保管先が未決の文書オブジェクトが残っているか否かを検査し、残っている場合はステップＳ１３０１に戻り、残っていなければステップＳ１３０５に進む（ステップＳ１３０４）。
【００８８】
次に、保管先がすべて決定したならば、各文書オブジェクトにリンク情報を付加して、それぞれ選択したデータベース記憶装置に文書オブジェクトを保管する（ステップＳ１３０５）。
【００８９】
次に、以上のようにして保管した文書データの出力について、図３のフローチャートを流用して説明する。
【００９０】
ステップＳ３０１で、文書オブジェクトを検索する。検索については、第４の実施形態のように文書オブジェクトに文字認識データを含んでいる場合には、認識文字を全文検索対象として行う等の方法を用いることも可能である。次に、文書データの再構築処理を行い（ステップＳ３０２）、再構築された文書データを出力する（ステップＳ３０３）。
【００９１】
以下、第４の実施形態に係る文書データ再構築処理（ステップＳ３０２）の例を図１４に従い詳細に説明する。
【００９２】
ステップＳ１４０１で、文書オブジェクトを検索し、文書オブジェクトがあるか否かを検査し、ある場合はステップＳ１４０２に進み、ない場合は終了する。ステップＳ１４０１で文書オブジェクトが検索に対しヒットしたならば、ヒットした文書オブジェクトをデータベースから取り出す（ステップＳ１４０２）。次に、取り出した文書オブジェクトに付属しているリンク情報を分析し（ステップＳ１４０３）、リンク先の関連文書オブジェクトが正常か否かを判断し（ステップＳ１４０４）、正常ならば、関連文書オブジェクトを取り出す（ステップＳ１４０５）。
【００９３】
リンク先関連文書オブジェクトが正常でない場合は、複製文書オブジェクトが有るか否かを調べる（ステップＳ１４０８）。本実施形態においては、関連文書オブジェクトのレイアウト属性が『テキスト』ならば部分的複製文書オブジェクトが有るので、部分的複製文書オブジェクトを取り出す（ステップＳ１４０９）。一方、関連文書オブジェクトのレイアウト属性が『テキスト』でないならば複製文書オブジェクトは無いので、レイアウト解析情報を用いて、本来の部分画像と同じ大きさのダミー画像を作成する（ステップＳ１４１０）。
【００９４】
取り出した関連文書オブジェクトの部分画像および複製文書オブジェクトの認識文字やダミ一画像をそれぞれレイアウト解析情報をもとにして合成し、元の文書画像に再構築していく（ステップＳ１４０６）。その際、部分画像が圧縮されている場合には解凍も行う。
【００９５】
ステップＳ１４０６終了後、関連文書オブジェクトが未だ残っているか否かを検査し、残っている場合はステップＳ１４０３に戻り、残っていなければ終了する（ステップＳ１４０７）。上記の処理によって、文書データを再構築する。
【００９６】
以上のように第４の実施形態によれば、レイアウト属性がテキストの文書オブジェクトについては、文字認識結果データを複製文書オブジェクトとしてもつので、当該文書オブジェクトが破損しても、その内容を復元することができる。また、同じ文書オブジェクトの複製をもつ第２の実施形態に比べて、複製文書オブジェクトのデータ量を低減することができる。
【００９７】
［第５の実施形態］
あらかじめ重要キーワードを設定し、文字認識データを含めた文書オブジェクトにその重要キーワードが存在した場合に文書オブジェクトを複製して別々に保管することによって、一方が異常をおこしても、もう一方を代わりに用いて文書データを再構築する（安全性を高める）ことが可能である。第５の実施形態では、このような処理を実現する。
【００９８】
第５の実施形態による処理の流れについて図１５、図１６、図１４のフローチャート、および図１７、図１８、図１９、図２０を参照して説明する。
【００９９】
ステップＳ１５０１で、重要キーワードを設定する。次に、画像入力手段であるスキャナ（１０７）より文書画像を入力する（ステップＳ１５０２）。次に、クライアントＰＣ（１０６）において、レイアウト解析処理（ステップＳ１５０３）で、文書画像を『図』や『テキスト』や『表』等の各種属性毎の領域に分割し、レイアウト解析情報（図１７）を出力する。次に、文書オブジェクトの作成処理（ステップＳ１５０４）を行う。文書オブジェクト作成処理では、第４の実施形態（図１２）によって説明したように、文書オブジェクトのレイアウト属性がテキストであった場合には、文字認識処理を行い、その文字認識データが文書オブジェクトに含まれる。
【０１００】
次に、作成した文書オブジェクトを文書オブジェクト保管処理（ステップＳ１５０５）によって、データベース記憶装置（１０３）や別のデータベース記憶装置（１０５）に振り分けて保存する。
【０１０１】
以下、文書オブジェクト保管処理（ステップＳ１５０５）の例を図１６に従い詳細に説明する。
【０１０２】
まず、ステップＳ１６０１で、保管先が未決の文書オブジェクトを選択する。次に、ステップＳ１６０２で、選択した文書オブジェクトの文字認識データ（図２０）が重要キーワードを含んでいるか否かを検査する（ステップＳ１６０２）。
【０１０３】
本実施形態においては、重要キーワードを含んでいると判断したならば、元の文書オブジェクトから画像データ部分を除いて部分的に複製し（ステップＳ１６０６）、元の文書オブジェクトおよび複製文書オブジェクトのそれぞれが異なる装置に保管されるようデータベース記憶装置を選択する（ステップＳ１６０７、Ｓ１６０８）。このとき、複製文書オブジェクトのオブジェクト情報（図１８）のオブジェクトＩＤについては元の文書オブジェクトのオブジェクトＩＤとは異なりかつユニークな値になるよう変更する。
【０１０４】
また、重要キーワードを含んでいない場合は、選択した文書オブジェクトのみを保管するデータベース記憶装置を選択する（ステップＳ１６０３）。なお、ステップＳ１６０８、Ｓ１６０３のいずれの場合も、文書オブジェクトと選択した保管先の関連をリンク情報（図１９）として記録していく。
【０１０５】
次に、保管先がすべて決定したならば、各文書オブジェクトにリンク情報を付加して、それぞれ選択したデータベース記憶装置に文書オブジェクトを保管する（ステップＳ１６０５）。
【０１０６】
また、第５の実施形態に係る文書データ再構築処理（ステップＳ３０２）の例を図１４のフローチャートを流用して詳細に説明する。
【０１０７】
ステップＳ１４０１で、文書オブジェクトを検索し、文書オブジェクトがあるか否かを検査し、ある場合はステップＳ１４０２に進み、ない場合は終了する。ステップＳ１４０１で文書オブジェクトが検索に対しヒットしたならば、ヒットした文書オブジェクトをデータベースから取り出す（ステップＳ１４０２）。次に、取り出した文書オブジェクトに付属のリンク情報を分析し（ステップＳ１４０３）、リンク先の関連文書オブジェクトが正常か否かを判断し（ステップＳ１４０４）、正常ならば関連文書オブジェクトを取り出す（ステップＳ１４０５）。
【０１０８】
リンク先関連文書オブジェクトが正常でない場合は、複製文書オブジェクトが有るか否かを調べる（ステップＳ１４０８）。第５の実施形態においては、ステップＳ１５０１で設定された重要キーワードを文字認識データが含むならば部分的複製文書オブジェクトが有るので、部分的複製文書オブジェクトを取り出す（ステップＳ１４０９）。関連文書オブジェクトが文字認識データに重要キーワードを含まないならば複製文書オブジェクトは無いので、レイアウト解析情報を用いて、本来の部分画像と同じ大きさのダミー画像を作成する（ステップＳ１４１０）。
【０１０９】
取り出した文書オブジェクトや関連文書オブジェクトの部分画像および複製文書オブジェクトの認識文字やダミ一画像をそれぞれレイアウト解析情報をもとにして合成し、元の文書画像に再構築していく（ステップＳ１４０６）。その際、部分画像が圧縮されている場合には解凍も行う。
【０１１０】
ステップＳ１４０６終了後、関連文書オブジェクトが未だ残っているか否かを検査し、残っている場合はステップＳ１４０３に戻り、残っていなければ終了する（ステップＳ１４０７）。上記の処理によって、文書データを再構築する。
【０１１１】
以上のように、第５の実施形態によれば、文字認識結果にキーワードを含む文書オブジェクトについては、その複製を別のデータベース記憶装置に格納するので、文書オブジェクトが破損しても、文書の重要な部分については、その内容を判断できる程度に復元することができる。
【０１１２】
［他の実施形態］
なお、本発明は、複数の機器（例えばホストコンピュータ、インタフェイス機器、リーダ、プリンタなど）から構成されるシステムに適用しても、一つの機器からなる装置（例えば、複写機、ファクシミリ装置など）に適用してもよい。
【０１１３】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体（または記録媒体）を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【０１１４】
さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【０１１５】
本発明を上記記憶媒体に適用する場合、その記憶媒体には、先に説明したフローチャートに対応するプログラムコードが格納されることになる。
【０１１６】
【発明の効果】
以上説明したように、本発明によれば、ファイル・サイズの増大を抑え、ファイルが失われる危険性を軽減できる、文書管理に好適なデータベースシステムを提供できる。
【図面の簡単な説明】
【図１】本発明の、第１の実施形態に係るシステムのブロック構成図である。
【図２】第１の実施形態に係る、画像入力処理から文書オブジェクト保管処理までの処理の流れを示すフローチャートである。
【図３】第１の実施形態に係る、文書オブジェクト検索処理から文再構築文書出力処理までの処理の流れを示すフローチャートである。
【図４】第１の実施形態に係る、文書オブジェクト作成処理の１例についての処理の流れを示すフローチャートである。
【図５】第１の実施形態に係る、文書オブジェクト保管処理の１例についての処理の流れを示すフローチャートである。
【図６】第１の実施形態に係る、文書データ再構築処理の１例についての処理の流れを示すフローチャートである。
【図７】本発明の、第２の実施形態に係る、文書オブジェクト保管処理の１例についての処理の流れを示すフローチャートである。
【図８】第２の実施形態に係る、文書データ再構築処理の１例についての処理の流れを示すフローチャートである。
【図９】本発明の、第３の実施形態に係る、重要オブジェクト設定処理から文書オブジェクト保管処理までの処理の流れを示すフローチャートである。
【図１０】第３の実施形態に係る、文書オブジェクト作成処理の１例についての処理の流れを示すフローチャートである。
【図１１】第３の実施形態に係る、文書オブジェクト保管処理の１例についての処理の流れを示すフローチャートである。
【図１２】本発明の、第４の実施形態に係る、文書オブジェクト作成処理の１例についての処理の流れを示すフローチャートである。
【図１３】第４の実施形態に係る、文書オブジェクト保管処理の１例についての処理の流れを示すフローチャートである。
【図１４】第４の実施形態に係る、文書データ再構築処理の１例についての処理の流れを示すフローチャートである。
【図１５】本発明の、第５の実施形態に係る、重要キーワード設定処理から文書オブジェクト保管処理までの処理の流れを示すフローチャートである。
【図１６】第５の実施形態に係る、文書オブジェクト保管処理の１例についての処理の流れを示すフローチャートである。
【図１７】本発明の、第１の実施形態に係る、レイアウト解析情報の構造の１例を示す図である。
【図１８】第１の実施形態に係る、オブジェクト情報の構造の１例を示す図である。
【図１９】第１の実施形態に係る、リンク情報の構造の１例を示す図である。
【図２０】第１の実施形態に係る、文字認識データの構造の１例を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document management apparatus and method. More specifically, for example, a document management apparatus suitable for managing document data when information is exchanged by digitized document data between digital devices connected by a wired / wireless network, mainly a computer, and It is about the method.
[0002]
[Prior art]
In a document database system in which a document is optically input and the document image is stored as document data, or the document is optically input and only a text area is recognized from the document image, and the document image and the recognition result are obtained. In addition, a document database system that stores data as one document data is known.
[0003]
In general, in this type of document database system, when storing document data, (1) one file per document is stored in a storage device such as a hard disk, or (2) one document is stored. One original file was created, and a plurality of the original files were duplicated and stored in different storage devices. Alternatively, (3) it can be considered that one document is divided into M × N blocks and each is stored as a file in a separate storage device.
[0004]
[Problems to be solved by the invention]
However, in the method (1), since one file is created for each document, there is a risk that the entire document itself is lost if a storage device such as a hard disk is broken.
[0005]
In the method (2), although one storage device is broken and stored in another storage device, the safety is improved. However, when the document data is a color image, one file size is very large. However, in order to duplicate and store a plurality of them, the size and the number of storage devices are increased, which is wasteful.
[0006]
Further, in the method (3), when the number of divisions is small, the division is performed regardless of the original document contents, and therefore, there is a risk that the original document contents cannot be grasped if any one block is missing. In addition, if the number of divisions is large, there will be a waste that a large number of storage devices are required.
[0007]
The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a document management apparatus and method capable of suppressing an increase in file size and reducing a risk of losing a file. is there.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, a document management apparatus according to the present invention comprises the following arrangement. That is,
A digitized document image for each layout attribute of Divided into partial areas, Contains information on the layout of the divided partial areas in the document image An analysis means for outputting layout analysis information;
Based on the layout analysis information, the document image in each partial area is extracted as partial document image data, For each partial area, Extracted partial document image data And the layout analysis information Creating means for creating a document object;
So that document objects at adjacent positions on the layout in the document image are stored in different storage means, A plurality of document objects created by the creating means Each of To multiple storage means Thus, the document object to which the link information for specifying the storage destination of each document object is added is stored in each allocated storage means. Storage processing means for storing;
Relevant document objects based on the link information The plurality of storage means Removed from the Document object By combining the partial document image data included in the extracted document object based on the layout analysis information included in Reconstructing means for reconstructing the original document image.
[0009]
A document management method according to the present invention for achieving the above object includes the following steps. That is,
The analysis means of the document management device A digitized document image for each layout attribute of Divided into partial areas, Contains information on the layout of the divided partial areas in the document image An analysis process for outputting layout analysis information;
The means for creating the document management device Based on the layout analysis information, the document image in each partial area is extracted as partial document image data, For each partial area, Extracted partial document image data And the layout analysis information A creation process for creating a document object;
The storage processing means of the document management apparatus stores document objects in adjacent positions on the layout in the document image in different storage means. A plurality of document objects created in the creation step Each of To multiple storage means Thus, the document object to which the link information for specifying the storage destination of each document object is added is stored in each allocated storage means. A storage process to store;
The restructuring means of the document management device selects a related document object based on the link information. The plurality of storage means Removed from the Document object By combining the partial document image data included in A reconstruction step of reconstructing the original document image.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.
[0011]
[First Embodiment]
FIG. 1 shows a block configuration of a system according to the first embodiment.
[0012]
In FIG. 1, 101 is a network connecting each digital device, 102 is a file server for managing the database, 103 is a database storage device for storing the database, 104 is a file server different from 102, 105 is 103 Is a separate database storage device, 106 is a client PC for creating data to be stored in the database, 107 is a scanner, 108 is a client PC for retrieving and outputting data stored in the database, and 109 is a printer.
[0013]
Next, the flow of processing will be described with reference to the flowcharts of FIGS. 2, 3, 4, 5, and 6, and FIGS.
[0014]
First, in step S201, a document image is input from a scanner (107) which is an image input unit.
[0015]
Next, in the client PC (106), in the layout analysis process (step S202), the document image is divided into areas for various layout attributes such as “figure”, “text”, “table”, etc., and layout analysis information (figure 17) is output.
[0016]
As shown in FIG. 17, the layout analysis information includes layout information metadata and partial area data (1 to n) that is information of each partial area obtained as a result of layout analysis. The layout information metadata stores the entire size of the original document image (the width of the entire image and the height of the entire image), the resolution corresponding to the image data, and the number of partial areas obtained as a result of the layout analysis. The partial area data (1) to (n) includes the ID of each partial area, the size (width and height) of the partial area, the position of the partial area (start coordinate), the end coordinate of the partial area, and the partial area. The layout attribute of is stored.
[0017]
Next, a document object creation process (step S203) is performed. An example of document object creation processing (step S203) will be described in more detail with reference to the flowchart of FIG.
[0018]
First, in step S401, layout analysis information is analyzed, the ID of a partial area to be processed is extracted, and held as object information (FIG. 18) (step S402). Next, the image of the partial area is extracted and compressed according to the layout analysis information (step S403). At this time, the compression information is added to the object information.
[0019]
Next, the layout analysis information, the object information, and the extracted partial images are combined into one document object, and a document object ID is assigned (step S404).
[0020]
After step S404 is completed, it is checked whether or not an unprocessed divided area remains. If it remains, the process returns to step S401, and if not, the process ends (step S405).
[0021]
As described above, document objects are generated, and the object information shown in FIG. 18 is added to each document object. As shown in FIG. 18, the object information includes the ID of the document object (stored in S404), the ID of the corresponding partial area (stored in S401), and the compression method (stored in S403). Although an important object flag is shown in FIG. 18, this is used in the third embodiment and may be omitted because it is not used in the first embodiment.
[0022]
Next, returning to FIG. 2, the created document object is sorted and stored in the database storage device (103) or another database storage device (105) in the document object storage process (step S204).
[0023]
Further, an example of the document object storage process (step S204) will be described in detail with reference to FIG.
[0024]
First, in step S501, a document object whose storage destination is undecided is selected. Next, a database storage device for storing the selected document object is selected (step S502). For example, if the previously selected document object is the database storage device (103), there is a method of making the currently selected document object the database storage device (105). At this time, the relationship between the document object and the selected storage destination is recorded in the link information (FIG. 19). This link information is added to each document object in step S504 described later. As a method for selecting the database storage device in step S502, the database storage device may be allocated according to the positional relationship on the layout. That is, as an example of allocating each document object to the database storage device, it is possible to allocate the document objects so that those close to each other in the layout are not stored in the same storage device. Specifically, for example, it is checked whether or not each document object is adjacent in the vertical and horizontal directions, and in the case of adjacent document objects, they are allocated to different storage devices.
[0025]
After completion of step S502, it is checked whether or not a document object whose storage destination has not been determined remains. If it remains, the process returns to step S501. If not, the process proceeds to step S504 (step S503). If all the storage destinations are determined, in step S504, link information is added to each document object, and the document object is stored in each selected database storage device.
[0026]
As shown in FIG. 19, the link information includes link information metadata and link data. The number of link data included in the link information is recorded in the link information metadata. Each link data stores the document object ID, the partial area ID, the storage destination address of the document object, and the storage date and time. This link information is formed as the storage destination is determined according to the flowchart of FIG.
[0027]
Next, output processing of document data stored as described above will be described with reference to FIG.
[0028]
First, in step S301, a document object is searched. There is a method of searching for a keyword added when storing in advance. Of course, a document object having a designated document file name may be searched. If the designated document object is retrieved, document data reconstruction processing is performed (step S302).
[0029]
Hereinafter, an example of the document data reconstruction process (step S302) will be described in detail with reference to FIG.
[0030]
In step S601, the document object is searched to check whether there is a document object. If there is, the process proceeds to step S602, and if not, the process ends. If the document object hits the search in step S601, the hit document object is taken out from the database (step S602).
[0031]
Next, link information attached to the extracted document object is analyzed (step S603), and a related document object is extracted (step S604). The extracted partial images of the document object and the related document object are synthesized based on the layout analysis information, respectively, and reconstructed into the original document image (step S605). At that time, if the partial image is compressed, decompression is also performed.
[0032]
After step S605 is completed, it is checked whether or not the related document object still remains. If it remains, the process returns to step S603, and if not, the process ends. (Step S606). The document data is reconstructed by the above processing.
[0033]
As described above, according to the first embodiment, the document image read from the scanner 107 or the like is divided into partial areas for each predetermined type of layout attribute such as “figure”, “text”, and “table”. The At this time, layout analysis information representing the size, layout attribute, and position in the document image for each partial area is output. Using this layout analysis information, a document object is created for each partial area using partial document data in the area. Here, to each document object, link analysis information and link information indicating a link between other document objects constituting the original document image are added. The plurality of document objects obtained in this way are distributed to a plurality of database storage devices and stored.
[0034]
Also, when reconstructing the original document data from a plurality of document objects stored separately in the database storage device, the document object necessary for composing the original document image is obtained from the link information added to the document object. These are arranged according to the layout analysis information.
[0035]
Therefore, according to the first embodiment, since a plurality of document objects constituting one document are distributed and stored in a plurality of database storage devices, it is possible to reduce the risk of erasure of document data due to damage to the storage devices. Further, since the original document is divided and stored instead of copying a plurality of original documents, the size of the storage device when saving the document image can be reduced.
[0036]
Furthermore, since a plurality of document objects are allocated to the database storage device by adding link information and layout analysis information to each, even if the document data is damaged due to the damage of the database storage device, it is stored in the remaining database storage devices. The document object can be restored to some extent by the existing document object, and the original document content can be grasped.
[0037]
[Second Embodiment]
In the first embodiment, a plurality of document objects obtained from document images are stored in a plurality of database storage devices. In the second embodiment, for a document object having a specific layout attribute, if the document object is duplicated and stored separately, even if one side malfunctions, the other can be used instead to re-create the document data. Allows building (enhancing safety).
[0038]
The flow of processing according to the second embodiment will be described with reference to the flowcharts of FIGS. 2, 3, 7, and 8, and FIGS.
[0039]
As described in the first embodiment, first, in step S201, a document image is input from the scanner (107) as image input means. Next, in the client PC (106), the document image is divided into areas for various attributes such as “figure”, “text”, “table” and the like by layout analysis processing (step S202), and layout analysis information (FIG. 17). ) Is output.
[0040]
Next, a document object creation process (step S203) is performed. Then, the created document object is sorted and stored in the database storage device (103) or another database storage device (105) by the document object storage process (step S204).
[0041]
Hereinafter, an example of the document object storage process (step S204) according to the second embodiment will be described in detail with reference to FIG.
[0042]
First, in step S701, a document object whose storage destination is undecided is selected. In step S702, the layout attribute of the selected document object is checked. In the present embodiment, when the layout attribute is “text”, the document object is duplicated (step S706), and the database storage device is selected so that the original document object and the duplicate document object are stored in different devices. (Steps S707 and S708). At this time, the object ID of the object information of the duplicate document object is changed so as to have a unique value different from the object ID of the original document object.
[0043]
In the case of a document object having a layout attribute other than this (in this example, a layout attribute other than text), the database storage device that stores only the document object is selected without performing the above-described duplication processing (step S703). In both cases of steps S708 and S703, the relationship between the document object and the selected storage destination is recorded as link information (FIG. 19).
[0044]
After step S703 or step S708 is completed, it is checked whether or not a document object whose storage destination has not been determined remains. If it remains, the process returns to step S701, and if not, the process proceeds to step S705 (step S704).
[0045]
When all the storage destinations are determined as described above, link information and layout analysis information are added to each document object, and the document object is stored in each selected database storage device (step S705).
[0046]
Next, a process for reconstructing document data stored as described above will be described. The rough procedure of the reconstruction process is the same as that of the first embodiment (FIG. 3). Here, an example of the document data reconstruction process (step S302) according to the second embodiment will be described in detail with reference to FIG.
[0047]
In step S801, the document object is searched to check whether there is a document object. If there is, the process proceeds to step S802, and if not, the process ends. If the document object hits the search in step S801, the hit document object is taken out from the database (step S802). Next, link information attached to the extracted document object is analyzed (step S803), and it is determined whether or not the linked related document object (including the extracted document object) is normal (step S804). If it is normal, the related document object is taken out (step S805). Otherwise, the process proceeds to step S808.
[0048]
If the link destination related document object is not normal, it is checked whether there is a duplicate document object (step S808). Whether or not a duplicate document object exists is determined according to the layout attribute of the document object. In the present embodiment, if the layout attribute of the related document object is “text”, there is a duplicate document object, so the duplicate document object is extracted (step S809).
[0049]
On the other hand, if the layout attribute of the related document object is not “text”, there is no duplicate document object, so a dummy image having the same size as the original partial image is created using the layout analysis information (step S810). For example, an image painted with a solid black color is used as a dummy image, or an image on which a message such as “There is an abnormality in a partial image” is written as a dummy image.
[0050]
The extracted partial images and dummy images of the related document object, duplicate document object, etc. are synthesized based on the layout analysis information and reconstructed into the original document image (step S806). At that time, if the partial image is compressed, decompression is also performed.
[0051]
After step S806 is completed, it is checked whether or not the related document object still remains. If it remains, the process returns to step S803, and if it does not remain, the process ends (step S807). Document data is reconstructed by the above processing.
[0052]
As described above, according to the second embodiment, for a document object having a predetermined layout attribute, a duplicate document object is generated and stored in a separate database storage device. Even if the document object is damaged, it is possible to restore the document image normally using the duplicate document object.
[0053]
Further, according to the second embodiment, when there is no duplicate of the damaged document object, the dummy image is applied, so it is possible to easily determine which part of the document image is damaged. Note that the application of the dummy image can be applied to the first embodiment. In this case, in step S604, it is determined whether the document object is the above or not, and if it is abnormal, a dummy image may be created.
[0054]
[Third Embodiment]
In the second embodiment, a duplicate document object is generated for a document object having a predetermined layout attribute and stored in a separate database storage device. In the third embodiment, document objects that meet a plurality of conditions are set as important objects, and the important objects are duplicated and stored separately.
[0055]
A processing flow according to the third embodiment will be described with reference to the flowcharts of FIGS. 9, 10, 11, and 8, and FIGS. 17, 18, and 19. FIG.
[0056]
First, in step S901, a condition for determining whether the object is an important object is set. Examples of this condition include using a combination of a specific layout attribute and layout position, or using the size of a partial image.
[0057]
Next, as in the first embodiment, a document image is input from the scanner (107) as image input means (step S902). Next, in the client PC (106), the document image is divided into areas for various attributes such as “figure”, “text”, and “table” by layout analysis processing (step S903), and layout analysis information (FIG. 17) is obtained. ) Is output.
[0058]
Next, a document object creation process (step S904) is performed. Then, the created document object is sorted and stored in the database storage device (103) or another database storage device (105) by the document object storage process (step S905).
[0059]
An example of the document object creation process (step S904) will be described in detail with reference to the flowchart of FIG.
[0060]
First, in step S1001, the layout analysis information is analyzed, and the ID of the partial area to be processed is extracted as object information (FIG. 18) (step S1002). Next, the image of the partial area is extracted and compressed according to the layout analysis information (step S1003). At this time, the compression information is added to the object information. Next, the layout analysis information, the object information, and the extracted partial images are combined into one document object to be one document object (step S1004). The processing up to this point is the same as the document object creation processing (steps S401 to S404 in FIG. 4) according to the first embodiment.
[0061]
Next, it is checked whether or not the document object corresponds to an important object according to the conditions set in advance in step S901 (step S1005). If it is an important object, the “important flag” in the object information (FIG. 18) is set. 1 is turned on (step S1006). On the other hand, if it is not an important object, the “important flag” in the object information is set to 0 and turned off (step S1007).
[0062]
After step S1006 or step S1007 is completed, it is checked whether or not an unprocessed divided area remains. If it remains, the process returns to step S1001, and if not, the process ends (step S1008).
[0063]
Further, the document object storage process (step S905) will be described in detail with reference to the flowchart of FIG. In the second embodiment (FIG. 7), it is determined whether to create a copy of the document object according to whether the layout attribute is text. In the third embodiment, it is determined whether this is an important object. (Fig. 11).
[0064]
First, in step S1101, a document object whose storage destination is undecided is selected. In step S1102, it is checked whether the selected document object is an important object. In the case of an important object, the document object is duplicated (step S1106), and the database storage device is selected so that the original document object and the duplicate document object are stored in different devices (steps S1107 and S1108). At this time, the object ID of the object information of the duplicate document object is changed to be a unique value different from the object ID of the original document object.
[0065]
If it is not an important object, a database storage device that stores only the selected document object is selected (step S1103). In both cases of steps S1108 and S1103, the relationship between the document object and the selected storage destination is recorded as link information (FIG. 19).
[0066]
After step S1103 or step S1108, it is checked whether there are any document objects whose storage destinations have not been determined. If an undecided document object remains, the process returns to step S1101, and if not, the process proceeds to step S1105 (step S1104).
[0067]
When all the storage destinations are determined as described above, link information is added to each document object, and the document object is stored in the selected database storage device (step S1105).
[0068]
Next, the example of the document data reconstruction process (step S302) according to the third embodiment will be described in detail with reference to the flowchart of FIG. 8 used in the description of the second embodiment.
[0069]
In step S801, the document object is searched to check whether there is a document object. If there is, the process proceeds to step S802, and if not, the process ends. If the document object hits the search in step S801, the hit document object is taken out from the database (step S802). Next, link information attached to the extracted document object is analyzed (step S803), and it is determined whether or not the linked related document object is normal (step S804). If normal, the related document object is extracted (step S804). If not, the process proceeds to step S808.
[0070]
If the link destination related document object is not normal, it is determined whether or not there is a duplicate document object (step S808). In this embodiment, whether or not a duplicate document object exists can be determined based on whether or not the “important object flag” in the object information is 1 (whether it is ON).
[0071]
If the related document object is an important object, there is a duplicate document object, and the duplicate document object is extracted (step S809). If the related document object is not an important object, there is no duplicate document object, so a dummy image having the same size as the original partial image is created using the layout analysis information (step S810).
[0072]
For example, there are methods such as making an image painted with black color a dummy image, or making an image with a message such as “partial image abnormal” written as a dummy image. The partial images and dummy images of the extracted document object, related document object, duplicate document object, etc. are synthesized based on the layout analysis information, respectively, and reconstructed into the original document image (step S806). At that time, if the partial image is compressed, decompression is also performed.
[0073]
After step S806 is completed, it is checked whether or not the related document object still remains. If it remains, the process returns to step S803, and if it does not remain, the process ends (step S807). Document data is reconstructed by the above processing.
[0074]
As described above, according to the third embodiment, a duplicate document object is generated for a document object selected according to a desired selection condition and stored in a separate database storage device. Even if the document object having the attribute is damaged, the document image can be normally restored using the duplicate document object. In particular, since the conditions of the document object to be duplicated can be set as desired, the flexibility is enhanced as compared with the second embodiment.
[0075]
[Fourth Embodiment]
When the partial area is a character recognition target area (text), it is also possible to create a document object including character recognition data and use the recognized character for the search.
[0076]
Furthermore, in the case of a document object in the character recognition target area, if one of the document objects is abnormal by duplicating and storing the document object excluding the partial image, use the other instead. Document data can be reconstructed (enhanced safety).
[0077]
In the fourth embodiment, a configuration for realizing the above processing will be described.
[0078]
The processing flow will be described with reference to the flowcharts of FIGS. 2, 3, 12, 13 and 14 and FIGS. 17, 18, 19 and 20.
[0079]
The document object storage processing procedure according to this embodiment is the same as that of the first embodiment (FIG. 2). Hereinafter, the document object creation process in step S203 and the document object storage process in step S204 will be described.
[0080]
First, an example of document object creation processing (step S203) will be described in detail with reference to FIG. In step S1201, the layout analysis information is analyzed, and the ID of the partial area to be processed is extracted as object information (FIG. 18) (step S1202). Next, the image of the partial area is extracted and compressed according to the layout analysis information (step S1203). At this time, the compression information is added to the object information.
[0081]
Next, it is checked whether or not the partial area is a character recognition target area (step S1204). If the partial area is a character recognition target area, the process proceeds to step S1205; otherwise, the process proceeds to step S1206. In this embodiment, if the layout attribute of the partial region is “text”, it is determined that the region is a character recognition target region, and character recognition is executed (step S1205).
[0082]
Next, layout analysis information, object information, extracted partial images, and character recognition data (FIG. 20) as a result of character recognition, if any, are integrated into one document object (step) S1206). In this embodiment, the character recognition data takes the form shown in FIG.
[0083]
After step S1206 is completed, it is checked whether or not an unprocessed divided area remains. If it remains, the process returns to step S1201, and if not, the process ends (step S1207).
[0084]
The created document objects are sorted and stored in the database storage device (103) or another database storage device (105) by the document object storage process (step S204). Hereinafter, an example of the document object storage process (step S204) will be described in detail with reference to FIG.
[0085]
In step S1301, a document object whose storage destination has not been determined is selected. In step S1302, it is checked whether the selected document object includes character recognition data (step S1302).
[0086]
In this embodiment, if the layout attribute is “text”, it is determined that character recognition data is included, and the original document object is partially duplicated except for the image data portion (step S1306), and the original document The database storage device is selected so that the object and the duplicate document object are stored in different devices (steps S1307 and S1308). At this time, the object ID of the object information (FIG. 18) of the duplicate document object is changed to be a unique value different from the object ID of the original document object.
[0087]
When character recognition data is not included (in this embodiment, when the layout attribute is not “text”), a database storage device that stores only the selected document object is selected (step S1303). In both cases of step S1308 and step S1303, the relationship between the document object and the selected storage destination is recorded as link information (FIG. 19). After step S1303 or step S1308 is completed, it is checked whether or not a document object whose storage destination has not been determined remains. If it remains, the process returns to step S1301, and if not, the process proceeds to step S1305 (step S1304).
[0088]
Next, when all storage destinations are determined, link information is added to each document object, and the document object is stored in the selected database storage device (step S1305).
[0089]
Next, the output of the document data stored as described above will be described with reference to the flowchart of FIG.
[0090]
In step S301, a document object is searched. As for the search, when the character recognition data is included in the document object as in the fourth embodiment, it is possible to use a method such as performing a recognized character as a full text search target. Next, document data reconstruction processing is performed (step S302), and the reconstructed document data is output (step S303).
[0091]
Hereinafter, an example of the document data reconstruction process (step S302) according to the fourth embodiment will be described in detail with reference to FIG.
[0092]
In step S1401, the document object is searched to check whether there is a document object. If there is, the process proceeds to step S1402, and if not, the process ends. If the document object hits the search in step S1401, the hit document object is taken out from the database (step S1402). Next, link information attached to the extracted document object is analyzed (step S1403), and it is determined whether the linked related document object is normal (step S1404). If normal, the related document object is extracted. (Step S1405).
[0093]
If the link destination related document object is not normal, it is checked whether or not there is a duplicate document object (step S1408). In the present embodiment, if the layout attribute of the related document object is “text”, there is a partially duplicated document object, so the partially duplicated document object is extracted (step S1409). On the other hand, if the layout attribute of the related document object is not “text”, there is no duplicate document object, so a dummy image having the same size as the original partial image is created using the layout analysis information (step S1410).
[0094]
The extracted partial image of the related document object and the recognized character and dummy image of the duplicate document object are synthesized based on the layout analysis information and reconstructed into the original document image (step S1406). At that time, if the partial image is compressed, decompression is also performed.
[0095]
After completion of step S1406, it is checked whether or not the related document object still remains. If it remains, the process returns to step S1403, and if not, the process ends (step S1407). Document data is reconstructed by the above processing.
[0096]
As described above, according to the fourth embodiment, a document object whose layout attribute is text has character recognition result data as a duplicate document object, so that the content can be restored even if the document object is damaged. Can do. Further, the data amount of the duplicate document object can be reduced as compared with the second embodiment having a duplicate of the same document object.
[0097]
[Fifth Embodiment]
If an important keyword is set in advance, and the important keyword exists in the document object including the character recognition data, the document object is duplicated and stored separately. It can be used to reconstruct document data (increase safety). In the fifth embodiment, such processing is realized.
[0098]
The flow of processing according to the fifth embodiment will be described with reference to the flowcharts of FIGS. 15, 16, and 14, and FIGS. 17, 18, 19, and 20.
[0099]
In step S1501, an important keyword is set. Next, a document image is input from the scanner (107) which is an image input means (step S1502). Next, in the client PC (106), in the layout analysis process (step S1503), the document image is divided into areas for various attributes such as “figure”, “text”, and “table”, and layout analysis information (FIG. 17). ) Is output. Next, a document object creation process (step S1504) is performed. In the document object creation process, as described in the fourth embodiment (FIG. 12), when the layout attribute of the document object is text, the character recognition process is performed, and the character recognition data is included in the document object. It is.
[0100]
Next, the created document object is sorted and stored in the database storage device (103) or another database storage device (105) by document object storage processing (step S1505).
[0101]
Hereinafter, an example of the document object storage process (step S1505) will be described in detail with reference to FIG.
[0102]
First, in step S1601, a document object whose storage destination is undecided is selected. In step S1602, it is checked whether the character recognition data (FIG. 20) of the selected document object includes an important keyword (step S1602).
[0103]
In this embodiment, if it is determined that an important keyword is included, the original document object is partially duplicated excluding the image data portion (step S1606), and each of the original document object and the duplicate document object is A database storage device is selected to be stored in a different device (steps S1607 and S1608). At this time, the object ID of the object information (FIG. 18) of the duplicate document object is changed to be a unique value different from the object ID of the original document object.
[0104]
If no important keyword is included, a database storage device that stores only the selected document object is selected (step S1603). In both cases of steps S1608 and S1603, the relationship between the document object and the selected storage destination is recorded as link information (FIG. 19).
[0105]
Next, when all the storage destinations are determined, link information is added to each document object, and the document object is stored in the selected database storage device (step S1605).
[0106]
An example of document data reconstruction processing (step S302) according to the fifth embodiment will be described in detail with reference to the flowchart of FIG.
[0107]
In step S1401, the document object is searched to check whether there is a document object. If there is, the process proceeds to step S1402, and if not, the process ends. If the document object hits the search in step S1401, the hit document object is taken out from the database (step S1402). Next, link information attached to the extracted document object is analyzed (step S1403), and it is determined whether the linked related document object is normal (step S1404). If normal, the related document object is extracted (step S1405). ).
[0108]
If the link destination related document object is not normal, it is checked whether or not there is a duplicate document object (step S1408). In the fifth embodiment, if the important keyword set in step S1501 is included in the character recognition data, there is a partially duplicated document object. Therefore, the partially duplicated document object is extracted (step S1409). If the related document object does not contain an important keyword in the character recognition data, there is no duplicate document object, so a dummy image having the same size as the original partial image is created using the layout analysis information (step S1410).
[0109]
The extracted document object and related document object partial images and duplicated document object recognition characters and dummy images are synthesized based on the layout analysis information and reconstructed into the original document image (step S1406). At that time, if the partial image is compressed, decompression is also performed.
[0110]
After completion of step S1406, it is checked whether or not the related document object still remains. If it remains, the process returns to step S1403, and if not, the process ends (step S1407). Document data is reconstructed by the above processing.
[0111]
As described above, according to the fifth embodiment, since a copy of a document object including a keyword in the character recognition result is stored in another database storage device, even if the document object is damaged, the document is important. This part can be restored to such an extent that its contents can be judged.
[0112]
[Other Embodiments]
Note that the present invention can be applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, and a printer), and a device (for example, a copying machine and a facsimile device) including a single device. You may apply to.
[0113]
Also, an object of the present invention is to supply a storage medium (or recording medium) that records a program code of software that realizes the functions of the above-described embodiments to a system or apparatus, and to perform computer (or CPU or CPU) of the system or apparatus. Needless to say, this can also be achieved when the MPU) reads and executes the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention. Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.
[0114]
Furthermore, after the program code read from the storage medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function is based on the instruction of the program code. It goes without saying that the CPU or the like provided in the expansion card or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.
[0115]
When the present invention is applied to the storage medium, the storage medium stores program codes corresponding to the flowcharts described above.
[0116]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a database system suitable for document management that can suppress an increase in file size and reduce a risk of losing a file.
[Brief description of the drawings]
FIG. 1 is a block configuration diagram of a system according to a first embodiment of the present invention.
FIG. 2 is a flowchart showing a flow of processing from image input processing to document object storage processing according to the first embodiment.
FIG. 3 is a flowchart showing a flow of processing from document object search processing to sentence reconstructed document output processing according to the first embodiment.
FIG. 4 is a flowchart showing a processing flow for one example of document object creation processing according to the first embodiment;
FIG. 5 is a flowchart showing a processing flow for one example of document object storage processing according to the first embodiment;
FIG. 6 is a flowchart showing a process flow for an example of a document data reconstruction process according to the first embodiment.
FIG. 7 is a flowchart showing a processing flow for one example of document object storage processing according to the second embodiment of the present invention;
FIG. 8 is a flowchart showing a processing flow for one example of document data reconstruction processing according to the second embodiment;
FIG. 9 is a flowchart showing a process flow from an important object setting process to a document object storage process according to the third embodiment of the present invention.
FIG. 10 is a flowchart showing a processing flow for one example of document object creation processing according to the third embodiment;
FIG. 11 is a flowchart showing a processing flow for one example of document object storage processing according to the third embodiment;
FIG. 12 is a flowchart showing a processing flow for one example of document object creation processing according to the fourth embodiment of the present invention;
FIG. 13 is a flowchart showing a processing flow for one example of document object storage processing according to the fourth embodiment;
FIG. 14 is a flowchart showing a process flow of an example of a document data reconstruction process according to the fourth embodiment.
FIG. 15 is a flowchart showing a process flow from an important keyword setting process to a document object storage process according to the fifth embodiment of the present invention;
FIG. 16 is a flowchart showing a processing flow for one example of document object storage processing according to the fifth embodiment;
FIG. 17 is a diagram showing an example of the structure of layout analysis information according to the first embodiment of the present invention.
FIG. 18 is a diagram illustrating an example of a structure of object information according to the first embodiment.
FIG. 19 is a diagram illustrating an example of a structure of link information according to the first embodiment.
FIG. 20 is a diagram showing an example of the structure of character recognition data according to the first embodiment.

Claims

The electronic document image, analysis means for dividing the partial region of each layout attributes, and outputs the layout analysis information including information about the layout of the document image of the each divided partial regions,
Based on the layout analysis information, a document image in each partial area is extracted as partial document image data, and a document object including the extracted partial document image data and the layout analysis information is created for each partial area. Creating means to
As a document between objects in adjacent positions are stored in different storage means on the layout of the document image, page allocation to each of a plurality of document objects created by the creating means to the plurality of storage means, each Storage processing means for storing the document object to which the link information for specifying the storage destination of the document object is added in each of the allocated storage means ;
The related document object is extracted from the plurality of storage means based on the link information, and the partial document image data included in the extracted document object is extracted based on the layout analysis information included in the extracted document object. A document management apparatus comprising: reconstruction means for reconstructing an original document image by combining .

The document management apparatus according to claim 1, wherein the layout attribute includes at least a figure, a text, or a table as an attribute.

Each document object created by the creation means includes partial document image data of a corresponding partial area, object information including an identification number of the document object, and the layout analysis information. The document management apparatus described.

It said storage processing means further, with respect to document objects of a particular layout attributes, along with replicate the document object, a duplicate document objects characterized by storing allocate different storage means with the document object The document management apparatus according to any one of claims 1 to 3 .

The reconstruction unit reconstructs the original document image using the document object when the document object necessary for the reconstruction is not damaged with respect to the document object having the specific layout attribute, and is necessary for the reconstruction. The document management apparatus according to claim 4 , wherein when the document object is damaged, the original document image is reconstructed using the duplicated document object.

The document management apparatus further comprises setting means for setting the selected document object as an important object,
It said storage processing means further and characterized in that with replicating a document object that has been set in the important document object in the setting means, storing the replicated document object allocated to different memory means with the document object The document management apparatus according to claim 1.

The setting means includes
Determination condition setting means for setting a determination condition for determining whether or not the document object is important;
Determination means for determining whether or not a document object satisfies the determination condition based on the layout analysis information;
The document management apparatus according to claim 6 , wherein the document object determined to satisfy the determination condition is set as an important document object.

The reconstruction unit reconstructs an original document image using the document object when the document object necessary for the reconstruction is not damaged with respect to the important document object, and is necessary for the reconstruction. The document management apparatus according to claim 6 , wherein when the document object is damaged, the original document image is reconstructed using the duplicated document object.

Character recognition is performed for an area that is a character recognition target area based on the layout analysis information, and character recognition data is acquired.
If the partial document image data is a character recognition target area, the creating means includes the character recognition data in the document object together with the partial document image data,
In the case of a document object including character recognition data, the storage processing unit generates a second document object using the character recognition data included in the document object, and stores the second document object in a storage unit different from the original document object. The document management apparatus according to claim 1, wherein the document management apparatus is allocated and stored.

When the document object is a document object corresponding to the character recognition target area, the storage processing unit generates the second document object by duplicating the document object by excluding the partial document image data. The document management apparatus according to claim 9 , wherein the second document object is allocated and stored in a storage unit different from the original document object.

The reconstruction means reconstructs the original document image using the document object when the document object necessary for the reconstruction is not damaged with respect to the document object in the character recognition target area, and the document necessary for the reconstruction. The document management apparatus according to claim 9 , wherein when the object is damaged, the original document image is reconstructed using the second document object.

A setting means for setting keywords;
Claim wherein the storage processing means further, that the character recognition data duplicates the document object including the keyword, and wherein the storing allocate different storage means with the original document object document object duplicating The document management apparatus according to any one of 9 to 11 .

If the document object necessary for reconstruction is not damaged , the reconstructing unit reconstructs the original document image using the document object, and if the document object necessary for reconstruction is damaged The document management apparatus according to claim 12 , wherein an original document image is reconstructed using the copied document object.

In the reconstruction means, if the document objects required reconstruction can not acquire the according to any one of claims 1 to 13, characterized in that allocated a dummy image to a portion area of the document object Document management device.

The analysis means of the document management apparatus, a digitized document image is divided into partial regions for each layout attributes, and outputs the layout analysis information including information about the layout of the document image of the each divided partial regions analyzed Process,
The creation unit of the document management apparatus extracts the document image in each partial area as partial document image data based on the layout analysis information, and the extracted partial document image data and the layout analysis for each partial area. Creating a document object including information ;
Plurality of storage processing means of the document management apparatus, so that the document object together in adjacent positions are stored in different storage means on the layout of the document image, each of the plurality of document objects created by said creation step Ri allocation in the memory means, each of the storage process storing in storage means a document object allocated the adding link information for specifying a storage destination of each document object,
The reconstruction unit of the document management apparatus retrieves the related document object from the plurality of storage units based on the link information, and synthesizes the partial document image data included in the retrieved document object. A document management method comprising: a reconstruction step of reconstructing a document image.

Computer
The electronic document image, analysis means for dividing the partial region of each layout attributes, and outputs the layout analysis information including information about the layout of the document image of the each divided partial regions,
Based on the layout analysis information, a document image in each partial area is extracted as partial document image data, and a document object including the extracted partial document image data and the layout analysis information is created for each partial area. Creating means to
As a document between objects in adjacent positions on the layout is stored in a different storage unit in the document image, page allocation tail of each of the plurality of document objects created by the creating means to the plurality of storage means A storage processing means for storing the document object to which link information for specifying the storage destination of each document object is added in each of the allocated storage means ;
The related document object is extracted from the plurality of storage means based on the link information, and the partial document image data included in the extracted document object is extracted based on the layout analysis information included in the extracted document object. by combining, computer-readable storage medium storing a program for functioning as a re-constructing means for reconstructing the original document image.