JP2005043990A

JP2005043990A - Document processor and document processing method

Info

Publication number: JP2005043990A
Application number: JP2003200463A
Authority: JP
Inventors: Yasuto Ishitani; 康人石谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-07-23
Filing date: 2003-07-23
Publication date: 2005-02-17

Abstract

<P>PROBLEM TO BE SOLVED: To precisely generate a structured document tagged with XML or HTML from printed document of various forms. <P>SOLUTION: The document processor comprises a layout analysis part 10 for extracting sentence areas and layout information of the sentence areas, and extracting character line areas for every sentence area; a character recognition part 11 for extracting character patterns from the character line areas and converting them to character codes; a document logic element extraction part 13 for extracting document logic elements from the sentence areas in reference to a database having information for extracting document logic elements and information for the contents of the document logic elements; a document logic element reading order determination part for arranging the document logic elements in the reading order; a document structure analysis part 14 for hierarchically grouping the document logic elements based on information showing the identity of kinds of document logic elements and the layers of document logic elements; and a tree structure extraction part for forming a tree structure showing the logic relation between document logic elements and character codes based on the grouped information. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、文書を処理する文書処理装置及び文書処理方法に関する。
【０００２】
【従来の技術】
新聞記事、雑誌、科学技術文献、書籍、オフィス文書、公文書などの印刷文書をスキャナ等の入力装置から文書画像としてコンピュータに取り込み、この取り込んだ画像情報から文字を認識しコード化することにより、印刷文書の内容を電子化・再利用したいという要求がある。
【０００３】
この場合、印刷文書をイメージスキャナで画像としてコンピュータに取り込み、この取り込んだ画像から「レイアウト構造」と「論理構造」を抽出・対応づけ、さらには文字認識処理を行い、読み順通りに出力するといった処理を行うことが一般的である。例えば、確率文法の枠組を用いて、複数ページに渡る章節構造とリスト構造を抽出する技術が知られている（例えば、非特許文献１参照）。
【０００４】
また、章見出し、パラグラフ、箇条書き、数式、脚注、ヘッダ、フッタなどの汎用的な文書論理要素を自動抽出し、これらに対して読み順を付与した後、文字認識結果を出力する技術が知られている（例えば、特許文献１参照）。
【０００５】
【非特許文献１】
建石：“確率文法を用いた文書論理構造の解釈法”、信学論Ｄ−ＩＩ，Ｖｏｌ．Ｊ７９−Ｄ−ＩＩ，Ｎｏ．５，ｐｐ．６８７−６９７，（１９９６−５）
【０００６】
【特許文献１】
特開平１１−２５００４１号公報
【０００７】
【発明が解決しようとする課題】
従来の技術では、印刷文書をイメージスキャナで画像としてコンピュータに取り込み、この取り込んだ画像から「レイアウト構造」と「論理構造」を抽出・対応づけ、さらには文字認識処理を行い、読み順通りに出力するといった処理が一般的である。しかし、いずれの従来の技術も特定のレイアウト条件下の印刷文書から特定の論理要素を抽出するといった程度にとどまり、多様な印刷文書全般にわたって詳細に解析して、ユーザが所望の論理情報を柔軟に抽出することは困難であった。特に箇条書きが階層的に記述されている文書において、その階層的な箇条書き構造を矛盾無く高精度に抽出し、この結果をＨＴＭＬやＸＭＬで記述された構造化文書に正しく変換することは困難であった。
【０００８】
本発明はこのような課題に着目してなされたものであり、その目的とするところは、多様な形態の印刷文書からＸＭＬやＨＴＭＬ等でタグ付けされた構造化文書を高精度に生成することができる文書処理装置及び文書処理方法を提供することにある。
【０００９】
【課題を解決するための手段】
上記課題を解決するために、本発明は文書処理装置であって、入力された文書画像から、文章が記載されている領域を示す文章領域を抽出すると共に、この文章領域の前記文書画像におけるレイアウトを示すレイアウト情報を抽出するレイアウト解析部と、前記文章領域毎に、当該文章領域を構成する文字行領域を抽出する文字行領域抽出部と、前記文字行領域から文字パターンを抽出して文字コードへ変換する文字認識部と、前記文章領域内の文書を構成する文書論理要素を抽出するための情報と前記文書論理要素の内容に関する情報とを有し、所望の木構造を得るための構造化方法を記載したデータベースを示す第１の文書モデルを参照して、前記レイアウト解析部により抽出された文章領域から文書論理要素を抽出する文書論理要素抽出部と、前記文書論理要素抽出部により抽出された前記文書論理要素を読み順通りに並べる文書論理要素読み順決定部と、文書論理要素の種類の同一性と文書論理要素の階層とを示す情報を有し、文書論理要素を構造化するための第２の文書モデルに基づいて、前記文書論理要素読み順決定部によって読み順通りに並べ替えられた文書論理要素を階層的にグループ化する文書論理要素グループ化部と、前記文書論理要素グループ化部でグループ化された情報に基づいて文書論理要素と文字コードの論理的な関係を表す木構造を作成する木構造作成部とを具備する。
【００１０】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について詳細に説明する。本実施形態の文書処理装置は、スキャナ等から入力された複数枚の文書画像から情報を抽出・編集して、例えばＸＭＬ技術を用いてタグ付けされた構造化文書を生成する。まず、スキャナ等で構成される文書画像生成部において複数の紙文書を画像に変換して文書画像を生成する。
【００１１】
各画像はスキャナ等で２値化処理が行われて２値画像に変換される。また、スキャニング時に多値画像を生成したあとで、文献「鈴木、窪田：“補間と凸判定に基づくストローク抽出を用いた低解像度文書画像の２値化”、電子情報通信学会、パターン認識・理解研究会、技術報告、ＰＲＭＵ９９−２３１、ｐｐ１−８、２０００」に基づいた公知の技術により多値画像から２値画像に変換してもよい。当該２値画像はさらに特開平０５−１７４１８３号公報「文書画像傾き検出方式」を用いて画像の傾きが補正された２値画像に変換しても良い。以下、これら処理により傾きが補正された２値画像を入力画像と呼ぶことにする。複数の入力画像は、順次、後述する文書処理装置に入力される。
【００１２】
図１は、本発明に係る文書処理装置の構成を示すブロック図である。文書処理装置は、レイアウト解析部１０と、文字認識部１１と、文書論理要素抽出部１３と、文書モデルデータベース１６と、文書論理要素抽出結果編集部１２と、文書構造解析部１４と、データフォーマット変換部１５とから構成される。
【００１３】
このような構成において、まず、入力画像は、レイアウト解析部１０へ入力される。レイアウト解析処理部１０は、外部から入力された入力画像をレイアウト解析し、文章領域、表領域、図領域、写真／絵領域などの性質の異なる部分領域をレイアウト要素群として抽出する。この抽出された各領域は、当該領域を外接する矩形により表現することが可能である。例えば文章領域は、図２に示すように、文章領域ＴＢを外接する矩形５０により表現される。この場合、外接矩形５０は図３に示すように、外接矩形５０の左上端の位置座標（ｘ１，ｙ１）と右下端の位置座標（ｘ２，ｙ２）により表現することができる。
【００１４】
レイアウト解析では、縦書きと横書きの文章領域は異なる領域として分離されて出力される。また一つの文章領域はカラムをまたがって抽出されることがない。ただし、レイアウト解析では、段落（パラグラフ）、箇条書き（リスト）、数式、章見出しなどの文書論理要素に相当する領域が抽出されていなくても良い。
【００１５】
各々の文章領域ＴＢ（あるいは表領域）では図２に示すように文字行領域Ｓｔｒが順序付けられて抽出され（Ｓｔｒ１，Ｓｔｒ２，Ｓｔｒ３，Ｓｔｒ４，Ｓｔｒ５）、各文字行領域Ｓｔｒでは、図２に示すように文字領域Ｃｈが同様に順序付けられて抽出される（Ｃｈ１，Ｃｈ２，Ｃｈ３，Ｃｈ４，Ｃｈ５）。
【００１６】
文章領域ＴＢ（あるいは表領域）と文字行領域Ｓｔｒと文字領域Ｃｈとは、それぞれ階層的に、例えば図４のような木構造により記述（表現）することができる。
【００１７】
レイアウト解析部１０は、例えば特開平９−１６７２３３号公報「画像処理方法および画像処理装置」に開示された構成により実現できる。この場合、レイアウト解析部１０で文字認識処理が行われて、文章領域の各文字がコード化される。あるいは、レイアウト解析の直後に文字認識処理が行われて文章領域ＴＢの各文字がコード化されるようにしてもよい。
【００１８】
文字認識部１１は、例えば、「有吉：“動的な仮説生成・検証による日本語印刷文書からの文字の切り出し”，電子情報通信学会技術報告，ＰＲＵ９３−４７，ｐｐ．３３−４０，１９９３．」に開示された構成により文字の認識を実現する。この場合、レイアウト解析で得られた文字行領域Ｓｔｒから個々の文字領域Ｃｈを切り出したあと、文字領域Ｃｈ内の文字パターンから文字コード情報に変換する。文字認識部１１は、この文字コードへ変換した文字認識結果を読み順通りに並べた状態で文書論理要素抽出部１３へ出力する。なお、この読み順通りに並んだ状態とは、横書き文字行の場合には文字行中において文字コードが左から右に並んでいる状態を意味し、縦書き文字行の場合には文字行中において文字コードが上から下に並んでいる状態を意味する。
【００１９】
更に文字認識部１１は、レイアウト解析部１０でレイアウト解析されたレイアウト解析結果も文書論理要素抽出部１３へ出力する。
【００２０】
文書論理要素抽出部１３は、予め作成されて文書モデルデータベース１６に格納された文書モデル（詳細は後述する）を、文字認識部１１から入力されたレイアウト解析結果と文字認識結果に適用することにより、タイトル、著者、アブストラクト、ヘッダ、フッタ、章見出し、パラグラフ、箇条書き、表の要素・列、図表キャプションなどの文書論理要素を抽出する。ここではこれら抽出された文書論理要素を文書論理要素抽出結果と記す。
【００２１】
文書論理要素抽出結果編集部１２は、文書論理要素抽出部１３で抽出された文書論理要素抽出結果に対して、オペレータが要素名や階層情報について変更や付与を行なうための部分である。
【００２２】
文書構造解析部１４は、文書論理要素に対して読み順を付与したあと文書論理要素を読み順通りに並べ替え、予め作成してある文書モデルを文書モデルデータベース１６から参照して各文書論理要素について階層的なグループ化を行い、文書構造を木構造として抽出する。上記読み順の決定方法は例えば特開平１１−２５００４１号公報「文書処理装置および文書処理方法」に開示された構成により実現することができる。
【００２３】
データフォーマット変換部１５は、木構造として記述されている文書論理要素と文字認識結果をＸＭＬやＨＴＭＬなどで記述した構造化文書に変換し、最終的な出力を得る。
【００２４】
以下、文書モデルデータベース１６について詳細に説明する。
【００２５】
従来のＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅｃｏｇｎｉｔｉｏｎ）では、文字認識結果を読み順通りに出力することにより文書画像からプレーンテキストを生成するものであった。しかし、実際の文書では章、節、箇条書きなどの階層的な構造を有している場合があり、ＸＭＬ／ＨＴＭＬ文書ではこのような文書を階層的な木構造で表現する。したがって、文書画像をＸＭＬ／ＨＴＭＬ文書に変換する場合には、上述した文書論理要素の集合からなんらかの階層構造を抽出する必要がある。この際、構造化方法が異なると処理結果の木構造が大幅に異なってしまうことがあり、オペレータ所望の木構造を得られない場合には文書変換後に大幅な修正を必要とするという問題が生じる。
【００２６】
そこで本実施形態では、オペレータが予め構造化方法を文書モデルという形で文書モデルデータベース１６に指定しておき、システムがこの文書モデルを解釈してオペレータ所望の木構造を生成した後、テキスト化した文書を構造化文書に変換する。
【００２７】
このような文書モデルを用いることにより、文書画像において文書論理要素を高精度に抽出すると共に文書の階層的な構造を高精度に抽出することができる。この結果、印刷文書から効率良く高精度にＸＭＬ／ＨＴＭＬタグ付き文書を生成することが可能となる。
【００２８】
図５〜図７は、オペレータが構造化方法を指定するときのユーザインタフェースの具体例を示す図である。
【００２９】
図５は、文書要素の関する構造化方法の知識・モデルを定義するためのユーザインタフェースである（モデル内容の詳細は後述する）。
【００３０】
図６は、論理要素の抽出方法の知識・モデルを定義するためのユーザインタフェースである（モデル内容の詳細は後述する）。
【００３１】
図７は、論理要素の内容に関する知識・モデルを定義するためのユーザインタフェースである（モデル内容の詳細は後述する）。
【００３２】
これらのユーザインタフェースを用いて定義された内容については、図１の文書モデルデータベース１６に格納され、文書入力時に適宜用いられるものとする。
【００３３】
使用される文書モデルは、
（１）文書論理要素の内容に関する知識
（２）文書論理要素の抽出方法に関する知識
（３）章見出し構造や箇条書き構造の構造化方法に関する知識
などで構成される。
【００３４】
（１）に記載の文書論理要素の内容に関する知識は、図５に示すユーザインタフェースを用いて定義される。まず、論理要素名として「ヘッダ」，「フッタ」，「文書見出し」，「章見出し１」，「章見出し２」，「章見出し３」，「章見出し４」，「章見出し５」，「章見出し６」，「章見出し７」，「章見出し８」，「章見出し９」，「パラグラフ」，「順序付き箇条書き」，「順序なし箇条書き」，「定義付き箇条書き」，「図キャプション」，「図脚注」，「表キャプション」，「表脚注」，「数式」，「脚注」のいずれかを選択する。
【００３５】
この図６〜図８のユーザインタフェースでの選択は、ユーザが各論理要素名の先頭にある白丸をマウス等のポインティングデバイスを用いてクリックし、黒丸に変更することにより行なわれる。この選択項目の先頭に白丸がある場合には当該項目は選択されず、先頭に黒丸がある場合には当該項目が選択されるものとする。
【００３６】
「階層レベル」において０以上の任意の整数を用いて階層情報を設定する。
【００３７】
「言語」において、「日本語」か「英語」のいずれかを選択する。
【００３８】
「文字行方向」において、「横書き」か「縦書き」のいずれかを選択する。
【００３９】
「文字サイズ」，「ブロック高」，「ブロック長」においてサイズ情報の上限と下限をｍｍ単位で設定する。
【００４０】
「文字行数」と「文字数」において上限と下限を設定する。
【００４１】
「字下げ幅」において字下げ量をｍｍ単位で設定する。
【００４２】
「ヘディング情報」においてヘディング情報として判断するための情報を設定する。ヘディング情報は、章見出しや箇条書きの先頭に付与される「第１編」，「第１章」，「第１条」，「１．」，「（１）」，「▲１▼」，「Ｉ」，「（イ）」，「（ａ）」，「Ａ．」，「一」などの番号や「・」，「○」，「※」，「＊」などの記号による論理要素の識別情報のことである。ここでは、これらのヘディング記述を「数列による定義」と「キーワード情報による定義」の２つの手段で表現する。
【００４３】
「ヘディング情報」の「数列による定義」ではヘディング記述を「数字」，「漢数字」，「ローマ数字」，「丸数字」，「アルファベット」，「カナ」，「ギリシャ文字」，「記号」，「ピリオド」，「ハイフン」，「開括弧」，「閉括弧」，「スペース」の１３種類の情報で記述する。このとき、各情報に“０（無し）”か“１以上の整数”のいずれかを割り当てることにより１３個の数字で構成されるとともに上述した順に並べられた数列でヘディング記述を表現する。
【００４４】
例えば、「（１）」のヘディング記述は、「数字」に１と「開括弧」に１と「閉括弧」に１をそれぞれ割り当て、他の情報に０を割り当てることにより、「１０００００００００１１０」と表現する。また、「１．２．１」のヘディング記述については、「数字」に３とピリオドに２とスペースに１をそれぞれ割り当て、他の情報に０を割り当てることにより、「３０００００００２０００１」と表現する。
【００４５】
このように本発明では、箇条書きや章見出しのヘディング記述（例えば、「１．はじめに」における「１．」の部分を章見出しのヘディングとする。）を数字、漢数字、ローマ数字、丸数字、アルファベット、カナ、ギリシャ文字、記号、ピリオド、ハイフン、開括弧、兵括弧、スペースの１３種類の文字種の組み合わせで表現するものとする。この場合、各文字種に対して、「０（無しを意味する）」か「１以上の整数（文字種の個数を意味する）」のどちらかの数字を割り当てることにより、ヘディング記述を１３個の数値情報の組み合わせで表現するものとする。
【００４６】
更に、数値情報を、数字、漢数字、ローマ数字、丸数字、アルファベット、カナ、ギリシャ文字、記号、ピリオド、ハイフン、開括弧、兵括弧、スペースの順に並べることにより、１３個の数値情報で構成される数列情報でヘディング記述を表現する。例えば、「１．」のヘディング記述は、「数字」に１、「漢数字」に０、「ローマ数字」に０、「丸数字」に０、「アルファベット」に０、「カナ」に０、「ギリシャ文字」に０、「記号」に０、「ピリオド」に１、「ハイフン」に０、「開括弧」に０、「閉括弧」に０、「スペース」に０を割り当て、これらを並べることにより１０００００００１００００と表現することができる。
【００４７】
本発明では、このような方法を用いることにより入力文書中の章見出しや箇条書きのヘディング記述を特定するものとする。本発明によれば、「１．はじめに」と「２．原理」を同様に「１０００００００１００００」と表現することができるので章見出しや箇条書きのヘディング記述の多様性に対応することが可能となる。
【００４８】
このような定義方法を導入することにより、文書モデルを定義する際に出現可能な箇条書きのヘディング記述をすべて網羅する必要が無くなる。
【００４９】
「ヘディング情報」の「キーワード情報による定義」では、上述したような数列で表現できないヘディング記述をそのまま文字列で表現するものである。しかし、この際に「第１条」，「第２条」，・・・，「第９９条」のようにすべてのヘディング記述を網羅すると煩雑になり、さらにはキーワード照合処理に多くの時間を要することから、「第条」のように数字が変化する部分を空白などで簡略化して表現することを可能とする。
【００５０】
また、開括弧“［”と閉括弧“］”の間に任意の文字列を挿入した形式の章見出しや定義付き箇条書きもこのような省略を許容したキーワード定義を採用することにより柔軟に特定することが可能となる。
【００５１】
キーワード照合を実施した場合には、処理対象の文字行中におけるキーワードの照合状態に関する定義をすることができる。この場合、「キーワード照合箇所」として文字行の先頭箇所からキーワード照合の結果の先頭位置までの距離をｍｍ単位で設定することができ、この値を超えたキーワード照合結果を無効と見なすことができる。また、文字行中におけるキーワードの占有率を設定することができ、その値を下回る占有率を示すキーワード照合結果を無効と見なすことができる。さらに、「キーワード照合箇所」と「キーワード占有率」の条件を組み合わせて適用することができ、両方の条件を満たす必要がある場合には、「キーワード照合条件」において「ＡＮＤ判定」を選択し、どちらか一方の条件を満たす場合には、「ＯＲ判定」を選択する。
【００５２】
（２）に記載の文書論理要素の抽出方法に関する知識は、図６に示したユーザインタフェースを用いて定義される。図６のインタフェースでは「ヘッダ」，「フッタ」，「文書見出し」，「章見出し」，「パラグラフ」，「順序付き箇条書き」，「順序なし箇条書き」，「定義付き箇条書き」，「図」，「図キャプション」，「図脚注」，「表」，「表キャプション」，「表脚注」，「数式」，「脚注」などの各論理要素において、抽出方法を指定することができる。
【００５３】
以下では、「抽出しない」が選択された場合には、当該論理要素はパラグラフとして抽出されるものとする。また、「抽出したあと棄却する」が選択された場合には、処理時に一旦抽出されるものの、最終的には木構造に出力されないものとする。さらに、「モデルに基づいて抽出する」が選択された場合には、図５に示したユーザインタフェースを用いて定義された内容に基づいて論理要素が抽出されるものとする。
【００５４】
論理要素「ヘッダ」では、「抽出しない」，「ページ上部で抽出して棄却する」，「ページ右側で抽出して棄却する」，「ページ左側で抽出して棄却する」のいずれかを選択することができる。
【００５５】
論理要素「フッタ」では、「抽出しない」，「ページ下部で抽出して棄却する」，「ページ右側で抽出して棄却する」，「ページ左側で抽出して棄却する」のいずれかを選択することができる。
【００５６】
論理要素「文書見出し」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００５７】
論理要素「章節見出し」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００５８】
論理要素「パラグラフ」では、「自動抽出する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００５９】
論理要素「順序付き箇条書き」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００６０】
論理要素「順序なし箇条書き」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００６１】
論理要素「定義付き箇条書き」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「モデルに基づいて抽出する」のいずれかを選択することができる。
【００６２】
論理要素「図」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」のいずれかを選択することができる。
【００６３】
論理要素「図キャプション」では、「抽出しない」，「抽出したあと棄却する」，「図の上部で抽出する」，「図の下部で抽出する」，「図の右側で抽出する」，「図の左側で抽出する」のいずれかを選択することができる。
【００６４】
論理要素「図脚注」では、「抽出しない」，「抽出したあと棄却する」，「図の上部で抽出する」，「図の下部で抽出する」，「図の右側で抽出する」，「図の左側で抽出する」のいずれかを選択することができる。
【００６５】
論理要素「表」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」，「横書きテキストとして抽出する」，「縦書き／横書き混在テキストとして抽出」，「格子状の表以外は横書きテキストとして抽出する」，「格子状の表以外は縦書きテキストとして抽出する」，「格子状の表以外は縦書き／横書き混在テキストとして抽出する」のいずれかを選択することができる。この場合、格子状の表とは、表を構成するすべての水平件線が表の外枠の横幅と等しく、表を構成するすべての垂直件線が表の外枠の縦幅と等しいという性質を持っている表を意味するものとする。このような表は「表−列−表の要素」という階層構造に従って出力するものとする。このとき、列とは表の要素の並びによるグループを意味するものとする。
【００６６】
論理要素「表キャプション」では、「抽出しない」，「抽出したあと棄却する」，「表の上部で抽出する」，「表の下部で抽出する」，「表の右側で抽出する」，「表の左側で抽出する」のいずれかを選択することができる。
【００６７】
論理要素表「脚注」では、「抽出しない」，「抽出したあと棄却する」，「表の上部で抽出する」，「表の下部で抽出する」，「表の右側で抽出する」，「表の左側で抽出する」のいずれかを選択することができる。
【００６８】
論理要素「数式」では、「抽出しない」，「図として抽出する」，「抽出したあと棄却する」のいずれかを選択することができる。
【００６９】
論理要素「脚注」では、「抽出しない」，「自動抽出する」，「抽出したあと棄却する」のいずれかを選択することができる。
【００７０】
（３）に記載の論理要素の構造化に関する知識は、図７のユーザインタフェースを用いて定義される。
【００７１】
「文書構造化方法」に関して、「実施する」か「実施しない」のいずれかを選択することができる。このとき「実施しない」が選択されると他のすべての定義が無効となり、論理要素が読み順どおりにフラットに出力されるものとする。これとは異なり、「実施する」が選択された場合には、他の定義に基づいて論理要素が構造化され木構造として出力されるものとする。
【００７２】
「箇条書きの構造化」に関して、「構造化しない」，「字下げ情報に基づいて構造化する」，「ヘディング情報に基づいて構造化する」を選択することができる。字下げ情報に基づいて箇条書きが構造化される場合には、字下げを判定するための字下げ量をｍｍ単位で設定することを可能としている。箇条書き構造化の動作については後述する。
【００７３】
「箇条書き内部の図の構造化」に関しては、「箇条書き構造内に残す」か「箇条書きの外に出す」のいずれかを選択することを可能としている。図を箇条書き内に残す場合には、文書中において隣接している箇条書きと同様に構造化されるものとする。図を箇条書き構造の外に出す場合には、文書中で図を含んでいる箇条書き構造の次に、当該箇条書き構造と同一の階層で出力するものとする。
【００７４】
「箇条書き内部の表の構造化」に関しては、同様に「箇条書き構造内に残す」か「箇条書きの外に出す」のいずれかを選択することを可能としている。表を箇条書き内に残す場合には、文書中において隣接している箇条書きと同様に構造化されるものとする。表を箇条書き構造の外に出す場合には、文書中で図を含んでいる箇条書き構造の次に、当該箇条書き構造と同一の階層で出力するものとする。
【００７５】
表内部の論理要素の抽出に関しては、「表内部の章見出しの抽出」と「表内部の箇条書きの抽出」について「抽出しない」か「抽出する」のいずれかを選択することを可能としている。このとき、「抽出しない」が選択された場合には、それぞれの論理要素をパラグラフとして出力するものとする。
【００７６】
「表内部の論理要素の読み順付与」に関しては「横組みとして順序付けする」か「縦組みとして順序付けする」のいずれかを選択することができる。横組みとして順序付けする場合には、左上端の論理要素から水平方向に順序付けしていくものとする。縦組みとして順序付けする場合には、右上端の論理要素から垂直方向に順序付けしていくものとする。
【００７７】
「文書見出しの順序付け」に関しては、「文書見出しを最初に読む」ようにするか、「文書見出しに対して自動的に順序付けする」ようにする。
【００７８】
文書論理要素抽出部１３では、上述した文書モデルデータベース１６を参照して以下の処理を行う。
【００７９】
まず行単位に文字認識結果を保持しているレイアウト解析結果を文字行ごとに分解する。そして各文字行に対して、「通常行」，「字下げ行」，「センタリング行」，「ハードリターン行」のいずれかの情報を付与する。このときテキストブロック境界と文字行先頭位置の距離Ｄ１とテキストブロック境界と文字行末尾位置の距離Ｄ２に対して、以下の条件式を適用することにより文字行を上述した４つのカテゴリに分類する。
【００８０】
条件式：
Ｄ１≦ＴＨ１かつＤ２≦ＴＨ２の場合、当該文字行を通常行とする。
【００８１】
Ｄ１＞ＴＨ１かつＤ２≦ＴＨ２の場合、当該文字行を字下げ行とする。
【００８２】
Ｄ１＞ＴＨ１かつＤ２＞ＴＨ２の場合、当該文字行をセンタリング行とする。
【００８３】
Ｄ１≦ＴＨ１かつＤ２＞ＴＨ２の場合、当該文字行をハードリターン行とする。
【００８４】
文書モデルデータベース１６にて、章節見出し抽出を実施する（図６に示す「章節見出し」が選択され、更にこの「章節見出し」の中で「自動抽出する」または「モデルに基づいて抽出する」が選択されている）または箇条書き照合を実施する（図６に示す「順序付き箇条書き」が選択され、更にこの「順序付き箇条書き」の中で「自動抽出する」または「モデルに基づいて抽出する」が選択されている、或いは「順序なし箇条書き」が選択され、更に「順序なし箇条書き」の中で「自動抽出する」または「モデルに基づいて抽出する」が選択されている）ことが指定されている場合には、各行に対して「ヘディング記述」の検出を行なう。この際、文書モデルにおいて文書論理要素のヘディング記述が数列で表現されている場合には、各文字行において先頭からいくつかの文字を同様に数列で表現し、この数列と文書モデルで定義された数列の一致性を判定することによりヘディング記述を検出する。
【００８５】
文書モデルにおいて文書論理要素のヘディング記述がキーワードで表現されている場合には、各文字行に対してキーワード照合を実施することによりヘディング記述を検出する。この際、上述したようにキーワードが簡略化して表現されている場合には、簡略化された箇所では照合を行なわず、この前後でキーワード照合を行なうことにより所望の文書論理要素を抽出する。
【００８６】
図８は、文書論理要素抽出結果の一例を示す図である。
【００８７】
ヘディング記述を有する文字行とこのあとに続く文字行が図８に示す配置関係にある場合にはこれらを統合することで章見出しもしくは箇条書きの文書論理要素を抽出する。すなわち、図８に示すように、３行の文字行６０〜６２のうち、先頭行６０が通常行で、２行目以降（６１，６２）に字下げを伴った字下げ行かセンタリング行が連続しており、各文字行６０〜６２の行長が前の行の行長と同じか短い場合にはこれらを統合して文書論理要素６３を抽出する。
【００８８】
章見出しについしては、「１．」を見出し１、「１．１」を見出し２、「１．１．１」を見出し３とするように階層の異なる章見出しを区別可能とする要素名を付与する。
【００８９】
箇条書きについては、文書モデルにおいてヘディング記述に対応して階層情報が定義されている場合には、論理要素の抽出と同時に階層情報の付与を行なう。
【００９０】
文書中に表が混在している場合に、文書モデルにおいて表を自動抽出すると指定されている場合（図６に示す「図」の中で「自動抽出する」が選択されている）には、表から要素と要素の列を抽出して、「表−列−表の要素」の階層構造を抽出する。
【００９１】
文書モデルにおいて「表をテキスト領域として抽出する」と指定されている場合（図６に示す「表」の中で、「自動抽出する」或いはテキストとして抽出する下側の５つのいずれかが選択されている）、表の各セルにおいて上述した方法によりテキスト領域を抽出し、同様にテキスト領域を文字行単位に分解したあとで各文字行を４つのカテゴリに分類するとともに、上述したヘディング記述の検出と文字行の統合を実施することにより文書論理要素を抽出する。
【００９２】
この結果、表から通常のテキスト領域と同様の論理要素を抽出することが可能となる。この際、図７に示す「表内部の章見出し抽出」の中で「抽出しない」が選択されている場合には、表外の章見出しと同様のヘディング記述を持つ表内の文書論理要素を箇条書きやパラグラフとして抽出する。また、図７に示す「表内部の箇条書きを抽出」の中で「抽出しない」が選択されている場合には、表外の箇条書きと同様のヘディング記述をもつ表内の論理要素をパラグラフとして抽出する。
【００９３】
上述した処理で抽出された文書論理要素は図示しないディスプレイなどの表示装置に出力され、文書論理要素抽出結果編集部１２で処理結果の変更もしくは処理結果への情報付与が行なわれる。この文書論理要素抽出結果編集部１２では以後の文書構造解析部１４における処理で高精度な処理結果を得るために、論理要素名や階層情報が誤っている場合にはこれらを変更する。また階層情報が未検出の場合には、この論理要素に対して階層情報を付与する。このとき、同一のヘディング記述や字下げ量を有する論理要素に対して、異なる階層情報を与えることができ、この結果に基づいて文書構造解析部１４で木構造抽出が行われる。
【００９４】
文書構造解析部１４では文書モデルデータベース１６を参照して以下の処理を行う。
【００９５】
まず文書論理要素に対して読み順を付与すると共に文書論理要素を読み順に応じて並べ替える。さらに読み順通りに並べ替えた文書論理要素に対してグループ化処理を行なう。このとき、章グループと箇条書きグループを抽出すると共にグループ間の包含関係を抽出することにより文書の階層構造を抽出する。
【００９６】
文書論理要素のグループを抽出する際には、章グループのような大域的なグループの抽出を優先的に行なう。章グループの抽出では、同一の要素名を持つ章見出しの間に存在する論理要素をグループ化する。
【００９７】
箇条書きグループの抽出では、読み順の昇順に下位のグループ化処理を実施する。このとき、文書モデルにおいて「字下げに基づいて構造化する」と定義されている場合には、文書論理要素の先頭位置（横書き（縦書き）要素の場合はテキストブロックの左端（上端））の配置関係に基づいて章グループ内部でグループ化を行なう。例えば、ある箇条書きの論理要素に続く論理要素の先頭位置がほぼ同じかあるいは字下げされている場合には、これらを同一のグループに所属すると見なしてグループ化を行なう。このグループ化処理は、グループの先頭要素の先頭位置座標値よる小さい先頭位置座標値を持つ論理要素が出現した時点、あるいは浅い階層情報を有する論理要素が出現した時点で終了する。
【００９８】
各グループにおいて、前に位置する論理要素と比較して先頭位置が閾値ＴＨ３以上字下げされている論理要素が出現するごとに同様なグループ化処理を再帰的に実施することにより階層的なグループ化の抽出を実現することが可能となる。前述した論理要素抽出の際にモデル定義に基づいて階層情報が付与されており、隣接する論理要素より深い階層であることを意味する階層情報が与えられている場合には、グループ内で字下げ位置に関わらず下位グループの検出を行なう。
【００９９】
文書モデルにおいて「ヘディング記述に基づいて構造化する」と指定されている場合（図７の「箇条書き構造化」の中で「ヘディング記述に基づいて構造化する」が選択されている）には、各章グループの内部において上述した方法で検出されたヘディング記述の一致性に基づいて箇条書きグループの抽出を行う。
【０１００】
まず、章グループの中で最初に出現する箇条書きからこの章グループの最後の要素をまとめてグループとして抽出する。次に、この箇条書きグループの中で、先頭要素とは異なるヘディング記述を有する箇条書きが出現した場合には、先頭要素と同じヘディング記述を有する箇条書きが出現するまでグループ化処理を行なうことにより当該グループの下位となる箇条書きグループを検出する。このような処理を再帰的に実施することにより階層的な箇条書きグループを検出することが可能となる。
【０１０１】
このとき、上述しように論理要素抽出の際にモデル定義に基づいて階層情報が付与されている場合には、ヘディング記述が同一でも階層が異なれば下位のグループの抽出を実施する。これとは反対に、ヘディング記述が異なっていても階層情報が同一であれば同一のグループに属するものとしてグループ化処理を実施する。
【０１０２】
文書モデルにおいて「図表を箇条書きの外に出力する」が指定されている場合（図７の「箇条書き内部の図の構造化」の中で「箇条書き構造の外に出す」が選択されている、或いは図７の「箇条書き内部の表の構造化」の中で「箇条書き構造の外に出す」が選択されている）、箇条書きグループ内のすべての図／表は読み順を保持したまま最上位の箇条書きグループの次に同一の階層情報を与えられた形で出力される。
【０１０３】
文書構造解析部１４では、以上のような階層的なグループ化処理で得られた結果を木構造に変換する。文書モデルにおいて「構造化しない」が指定されていた場合（図７の「文書構造化」の中で「実施しない」が選択されている）、章見出しや箇条書きのグループ化処理が実施されないのでフラットな木構造が得られる。
【０１０４】
文書モデルにおいて「字下げに基づいて構造化する」か「ヘディング記述に基づいて構造化する」が指定されている場合（図７の「箇条書き構造化」の中で「字下げ情報に基づいて構造化する」または「ヘディング記述に基づいて構造化する」が選択されている）、他の文書モデルのパラメータに応じて様々な木構造が得られる。
【０１０５】
以下、上記した実施形態を具体例を用いてさらに詳細に説明する。
【０１０６】
図９は、１ページの印刷文書を文書画像に変換した例である。この場合、論理要素Ｃ〜Ｍ，Ｎ１〜Ｎ５については表の構成要素と見なすことができるが、ここではそれぞれを表の要素として出力せずに図１０に示す論理構造を構成する論理要素として出力する場合を例として考える。また、論理要素Ａを「見出し１」、論理要素Ｂを「見出し２」、論理要素Ｅと論理要素Ｉを「見出し３」、論理要素Ｆ，Ｇ，Ｈ，Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１〜Ｎ５を「箇条書き」と見なす。また、論理要素ＣとＤを不要論理要素として最終結果には出力しない。
【０１０７】
まず、レイアウト解析部１０でのレイアウト解析処理により図１１に示すレイアウトの解析結果が得られる。次に、文書論理要素抽出部１３での文書論理要素抽出処理により、まず、図１２に示す文字行に分解され、次に文字行の判定とヘディング記述の同定によって図１３に示す論理要素が抽出される。
【０１０８】
例えば、文書モデル１６に格納された文書モデルにおいて、「見出し１」の論理要素のヘディング記述として「第章」というキーワードが定義されていれば、論理要素Ａは「見出し１」と識別される。同様に、文書モデルにおいて「見出し２」の論理要素のヘディング記述として「第節」というキーワードが定義されていれば、論理要素Ｂは「見出し２」と識別される。
【０１０９】
さらに、文書モデルにおいて「箇条書き」の論理要素のヘディング記述として「１０００００００１００００」，「１０００００００００１１０」がそれぞれ定義されていれば、論理要素Ｆ，Ｇ，Ｈ，Ｊ，Ｋ，Ｌ，Ｎ１〜Ｎ５は「箇条書き」と識別される。この場合、論理要素Ｅ，Ｉの「見出し２」と論理要素Ｋ，Ｌの「箇条書き」は同じヘディング記述を有するので、従来の技術ではこれらを異なる論理要素として区別することは難しかった。
【０１１０】
しかし、論理要素ＥとＩは、論理要素ＫとＬよりもブロック幅が小さいので、本実施形態ではブロック幅とヘディング記述を組み合わせたモデル定義を行なうことにより、これらを異なる論理要素に判別することが可能である。
【０１１１】
文書モデルにおいて、「１０００００００１００００」のヘディング記述を有し、文字行の高さがαｍｍ以上の箇条書きの階層レベルを１、「１０００００００００１１０」のヘディング記述を有する箇条書きの階層レベルを２、「１０００００００１００００」のヘディング記述を有し、文字行の高さがβｍｍ以下の箇条書きの階層レベルを３とそれぞれ定義されていれば、論理要素Ｆ，Ｇ，Ｈは階層レベル１の順序付き箇条書き７０、論理要素Ｊ，Ｋ，Ｌは階層レベル２の順序付き箇条書き７１、論理要素Ｎ１〜Ｎ５は階層レベル３の順序付き箇条書き７２と識別することができる。
【０１１２】
文書構造解析部１４では、まず、文書論理要素に読み順を付与したあと文書論理要素を図１４に示す通り読み順に応じて並べ替える。この結果、論理要素Ａ，Ｂ，Ｅ，Ｆ，Ｇ，Ｈ，Ｉ，Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５の順に整列することになる。
【０１１３】
次に、グループ化処理を行うことにより論理要素の階層的なグループを検出する。図１４の例では、「見出し１」のグループＡ，Ｂ，Ｅ，Ｆ，Ｇ，Ｈ，Ｉ，Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５がまず検出される。次いで、「見出し１」グループの中から「見出し２」のグループＢ，Ｅ，Ｆ，Ｇ，Ｈ，Ｉ，Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５と、「見出し３」のグループＥ，Ｆ，Ｇ，ＨおよびＩ，Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５が検出される。
【０１１４】
そして、ヘディング情報の同一性と階層情報とを手がかりとして、「順序付き箇条書き」グループＦ，Ｇ，Ｈ、Ｊ，Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５、Ｋ，Ｌ，Ｍ，Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５、Ｎ１，Ｎ２，Ｎ３，Ｎ４，Ｎ５がそれぞれの「見出し３」のグループから階層的に検出される。
【０１１５】
このようにして得られたグループをグループ間の包含関係にもとづいて、図１５に示す木構造に変換する。
【０１１６】
以上、上記した実施形態によれば、複数ページで構成される多様な印刷文書から変換された文書画像から、文書見出し、著者、アブストラクト、日付、章見出し、パラグラフ、箇条書き、脚注、キャプション、数式、ヘッダ、フッタなどの文書論理要素を自動的に抽出した後、文書論理要素間の階層構造を高精度に抽出して文書画像を文書論理要素の木構造で表現するようにしたので、文書画像からＸＭＬやＨＴＭＬ等でタグ付けされた構造化文書を高精度に生成することができ、コンピュータシステムへの自動入力が可能になる。
【０１１７】
また、多段の階層構造を持つ複雑な箇条書き構造をオペレータの意図どおりの木構造で表現することができるので、構造化文書の煩雑な編集作業を大幅に削減することが可能となり、大量文書の電子化におけるオペレータの編集作業を大幅に軽減することができる。
【０１１８】
（変形例１）
本発明によれば、図９に示す印刷文書をさらに別の論理構造を有するＸＭＬ文書へ変換することが可能となる。
【０１１９】
例えば図９に示す論理要素Ａ，Ｂ，Ｅ，Ｉをそれぞれ箇条書きと見なす場合には以下のようなモデル定義を持つ文書モデルデータベース１６を作成し、これに基づいた構造化処理を実施することにより、図１６に示す木構造を抽出したあと図１８に示すＸＭＬ文書に変換することが可能となる。
【０１２０】
文書モデルの定義の例としては例えば次の通りとする。
【０１２１】
「第１章」「第２章」「第３章」・・・・のヘディング記述を有する論理要素を箇条書きとみなし、この階層情報をレベル１とする。
【０１２２】
「第１節」「第２節」「第３節」・・・・のヘディング記述を有する論理要素を箇条書きとみなし、この階層情報をレベル２とする。
【０１２３】
「（１）」「（２）」「（３）」・・・・のヘディング記述を有する論理要素を箇条書きとみなし、文字サイズ（文字行の高さ）がγｍｍ未満の場合には階層情報をレベル３とし、文字行の高さがλｍｍ以上の場合には階層情報をレベル５とする。
【０１２４】
「１．」「２．」「３．」・・・・のヘディング記述を有する論理要素を箇条書きとみなし、文字行の高さがαｍｍ以上の場合には階層情報をレベル４とし、文字行の高さがβｍｍ以下の場合には階層情報をレベル６とする。
【０１２５】
以上の通りに文書モデルが定義されている場合、前述したようにレベル１の箇条書きのグループ化処理、レベル２の箇条書きのグループ化処理、レベル３の箇条書きのグループ化処理、レベル４の箇条書きのグループ化処理、レベル５の箇条書きのグループ化処理、レベル６の箇条書きのグループ化処理が順次階層的に行われることにより図１７に示す階層的グループを抽出することができる。
【０１２６】
この後、文書構造解析部１４により図１６に示す木構造が抽出され、さらにデータフォーマット変換部１５により図１８に示す当該木構造もつＸＭＬ文書に変換される。
【０１２７】
（変形例２）
また、図９に示す印刷文書において、論理要素Ａをレベル１の見出し、論理要素Ｂをレベル２の見出し、論理要素Ｃ〜Ｎ５までを表の要素と見なすようモデル定義を持つ文書モデルデータベース１６を作成し、これに基づいた構造化処理を実施することにより、以下の処理手順に基づいて図２０の木構造を抽出した後、図２２に示すＸＭＬ文書へ変換することが可能となる。
【０１２８】
例えば、文書画像中から水平線分と垂直線分を抽出した後、この抽出された線分で囲まれる領域を表要素として抽出する。
【０１２９】
この結果、文書構造解析部１４は、図１９に示す通り、論理要素Ｃを表要素１、論理要素Ｄを表要素２、論理要素Ｅを表要素３、論理要素Ｆ〜Ｈをまとめて表要素４、論理要素Ｉを表要素５、論理要素Ｊ〜Ｎ５をまとめて表要素６としてそれぞれ抽出する。
【０１３０】
次に、文書構造解析部１４は、水平方向に隣接する表要素をグループ化する。詳細には、表要素１と２を表要素グループａ、表要素３と４を表要素グループｂ、表要素５と６を表要素グループｃとする。
【０１３１】
次に、文書構造解析部１４は、表要素グループａ〜ｃまでをグループ化し、図２１に示す表グループを抽出する。
【０１３２】
この結果、文書構造解析部１４は、図２０に示す木構造を抽出する。この抽出された木構造は、データフォーマット変換部１５により図２２に示すＸＭＬ文書へ変換される。
【０１３３】
以上のように、図９に示す印刷文書から図２２に示すＸＭＬ文書を生成することが可能となる。
【０１３４】
【発明の効果】
本発明によれば、多様な形態の印刷文書からＸＭＬやＨＴＭＬ等でタグ付けされた構造化文書を高精度に生成することができる。
【図面の簡単な説明】
【図１】本発明の文書処理装置の構成を示すブロック図である。
【図２】レイアウト解析結果の一例を示す図である。
【図３】外接矩形について説明するための図である。
【図４】文章領域と文字行領域と文字領域とをそれぞれ階層的に記述したようすを示す図である。
【図５】文書要素の関する構造化方法の知識・モデルを定義するためのユーザインタフェースを示す図である。
【図６】論理要素の内容に関する知識・モデルを定義するためのユーザインタフェースを示す図である。
【図７】論理要素の抽出方法の知識・モデルを定義するためのユーザインタフェースを示す図である。
【図８】文書論理要素抽出結果の一例を示す図である。
【図９】処理対象の文書画像の一例を示す図である。
【図１０】文書構造解析結果の一例を示す図である。
【図１１】レイアウト解析結果の一例を示す図である。
【図１２】文書論理要素抽出の一例を示す図である。
【図１３】文書論理要素抽出結果の一例を示す図である。
【図１４】文書構造解析処理における論理要素の階層的グループ化の一例を示す図である。
【図１５】文書構造解析とデータフォーマット変換の処理結果の一例を示す図である。
【図１６】文書構造解析結果の他の例を示す図である。
【図１７】文書構造解析処理における論理要素の階層的グループ化の他の例を示す図である。
【図１８】文書構造解析とデータフォーマット変換の処理結果の他の例を示す図である。
【図１９】文書論理要素抽出結果の他の例を示す図である。
【図２０】文書構造解析結果の他の例を示す図である。
【図２１】文書構造解析処理における論理要素の階層的グループ化の他の例を示す図である。
【図２２】文書構造解析とデータフォーマット変換の処理結果の他の例を示す図である。
【符号の説明】
１０…レイアウト解析部、１１…文字認識部、１２…文書論理要素抽出結果編集部、１３…文書論理要素抽出部、１４…文書構造解析部、１５…データフォーマット変換部、１６…モデルデータベース。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing apparatus and a document processing method for processing a document.
[0002]
[Prior art]
By importing printed documents such as newspaper articles, magazines, scientific and technical literature, books, office documents, official documents, etc. from a scanner or other input device to a computer as a document image, and recognizing and encoding characters from the captured image information, There is a demand for digitizing and reusing the contents of printed documents.
[0003]
In this case, a print document is captured as an image by an image scanner into a computer, and “layout structure” and “logical structure” are extracted and associated from the captured image, and further, character recognition processing is performed and output in the reading order. It is common to perform processing. For example, a technique for extracting a chapter structure and a list structure over a plurality of pages using a framework of a probability grammar is known (for example, see Non-Patent Document 1).
[0004]
Also known is a technology that automatically extracts general-purpose document logic elements such as chapter headings, paragraphs, bullets, mathematical formulas, footnotes, headers, and footers, assigns the reading order to them, and then outputs character recognition results. (For example, refer to Patent Document 1).
[0005]
[Non-Patent Document 1]
Kenishi: “Interpretation of document logical structure using probabilistic grammar”, IEICE D-II, Vol. J79-D-II, no. 5, pp. 687-697, (1996-5)
[0006]
[Patent Document 1]
Japanese Patent Laid-Open No. 11-250041
[0007]
[Problems to be solved by the invention]
In the conventional technology, a printed document is captured as an image by an image scanner to a computer, and "layout structure" and "logical structure" are extracted and associated from the captured image, and further, character recognition processing is performed and output in the reading order. The process of doing is common. However, all of the conventional techniques only extract a specific logical element from a printed document under a specific layout condition, and the user can flexibly obtain desired logical information through detailed analysis over various printed documents. It was difficult to extract. In particular, in a document in which itemized items are described hierarchically, it is difficult to accurately extract the hierarchical itemized structure without contradiction, and to correctly convert the result into a structured document described in HTML or XML. Met.
[0008]
The present invention has been made paying attention to such problems, and the object of the present invention is to generate a structured document tagged with XML or HTML from various forms of printed documents with high accuracy. An object of the present invention is to provide a document processing apparatus and a document processing method capable of performing the above.
[0009]
[Means for Solving the Problems]
In order to solve the above-described problem, the present invention is a document processing apparatus, which extracts a text region indicating a region where a text is described from an input document image, and layouts the text region in the document image. A layout analysis unit for extracting layout information indicating a character line, a character line region extraction unit for extracting a character line region constituting the text region for each text region, and a character code by extracting a character pattern from the character line region A character recognition unit for converting to a document, information for extracting document logical elements constituting a document in the sentence area, and information on the contents of the document logical elements, and a structure for obtaining a desired tree structure Document logical element extraction for extracting a document logical element from a text region extracted by the layout analysis unit with reference to a first document model indicating a database describing a method A document logical element reading order determining unit that arranges the document logical elements extracted by the document logical element extracting unit in the reading order, information indicating the identity of the document logical elements and the hierarchy of the document logical elements And hierarchically grouping the document logical elements rearranged in the reading order by the document logical element reading order determining unit based on the second document model for structuring the document logical elements A logical element grouping unit; and a tree structure creating unit that creates a tree structure representing a logical relationship between the document logical element and the character code based on the information grouped by the document logical element grouping unit.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The document processing apparatus according to the present embodiment extracts and edits information from a plurality of document images input from a scanner or the like, and generates a tagged document using, for example, XML technology. First, a document image generating unit configured by a scanner or the like converts a plurality of paper documents into images to generate a document image.
[0011]
Each image is binarized by a scanner or the like and converted into a binary image. In addition, after generating multi-valued images during scanning, the document “Suzuki, Kubota:“ Binarization of low-resolution document images using stroke extraction based on interpolation and convexity determination ”, IEICE, Pattern Recognition and Understanding A multi-valued image may be converted into a binary image by a known technique based on “Study Group, Technical Report, PRMU99-231, pp1-8, 2000”. The binary image may be further converted into a binary image in which the inclination of the image is corrected using “Document Image Inclination Detection Method” of Japanese Patent Laid-Open No. 05-174183. Hereinafter, a binary image whose inclination is corrected by these processes will be referred to as an input image. The plurality of input images are sequentially input to a document processing apparatus described later.
[0012]
FIG. 1 is a block diagram showing a configuration of a document processing apparatus according to the present invention. The document processing apparatus includes a layout analysis unit 10, a character recognition unit 11, a document logical element extraction unit 13, a document model database 16, a document logical element extraction result editing unit 12, a document structure analysis unit 14, and a data format. And a conversion unit 15.
[0013]
In such a configuration, first, an input image is input to the layout analysis unit 10. The layout analysis processing unit 10 performs layout analysis on an input image input from the outside, and extracts partial areas having different properties such as a text area, a table area, a figure area, and a photo / picture area as a layout element group. Each extracted area can be expressed by a rectangle circumscribing the area. For example, the text area is represented by a rectangle 50 that circumscribes the text area TB as shown in FIG. In this case, the circumscribed rectangle 50 can be expressed by the position coordinates (x1, y1) of the upper left corner and the position coordinates (x2, y2) of the lower right corner of the circumscribed rectangle 50, as shown in FIG.
[0014]
In layout analysis, vertical writing and horizontal writing text areas are separated and output as different areas. One sentence area is not extracted across columns. However, in the layout analysis, areas corresponding to document logical elements such as paragraphs (paragraphs), bulleted lists (lists), mathematical expressions, and chapter headings may not be extracted.
[0015]
In each sentence area TB (or table area), character line areas Str are extracted in order as shown in FIG. 2 (Str1, Str2, Str3, Str4, Str5). In each character line area Str, as shown in FIG. Thus, the character areas Ch are similarly ordered and extracted (Ch1, Ch2, Ch3, Ch4, Ch5).
[0016]
The text area TB (or table area), the character line area Str, and the character area Ch can be described (represented) in a hierarchical manner, for example, by a tree structure as shown in FIG.
[0017]
The layout analysis unit 10 can be realized by a configuration disclosed in, for example, “Image processing method and image processing apparatus” of Japanese Patent Laid-Open No. 9-167233. In this case, character recognition processing is performed in the layout analysis unit 10 and each character in the text area is coded. Alternatively, the character recognition process may be performed immediately after the layout analysis to code each character in the text area TB.
[0018]
The character recognition unit 11 is, for example, “Ariyoshi:“ Extracting Characters from a Japanese Printed Document by Dynamic Hypothesis Generation / Verification ””, IEICE Technical Report, PRU 93-47, pp. 33-40, 1993. The character recognition is realized by the configuration disclosed in FIG. In this case, after cutting out each character area Ch from the character line area Str obtained by the layout analysis, the character pattern in the character area Ch is converted into character code information. The character recognition unit 11 outputs the character recognition results converted into the character codes to the document logical element extraction unit 13 in a state in which they are arranged in the reading order. In addition, the state of being arranged in the reading order means that the character codes are arranged from left to right in the character line in the case of a horizontal writing character line, and in the character line in the case of a vertical writing character line. Means that the character codes are arranged from top to bottom.
[0019]
Further, the character recognition unit 11 also outputs the layout analysis result subjected to the layout analysis by the layout analysis unit 10 to the document logic element extraction unit 13.
[0020]
The document logic element extraction unit 13 applies a document model (details will be described later) created in advance and stored in the document model database 16 to the layout analysis result and the character recognition result input from the character recognition unit 11. Document logical elements such as titles, authors, abstracts, headers, footers, chapter headings, paragraphs, bullets, table elements / columns, and chart captions are extracted. Here, these extracted document logical elements are referred to as document logical element extraction results.
[0021]
The document logical element extraction result editing unit 12 is a part for an operator to change or give an element name or hierarchy information to the document logical element extraction result extracted by the document logical element extraction unit 13.
[0022]
The document structure analysis unit 14 assigns the reading order to the document logical elements, rearranges the document logical elements in the reading order, refers to a document model created in advance from the document model database 16, and sets each document logical element. The document structure is extracted as a tree structure. The reading order determination method can be realized by, for example, a configuration disclosed in Japanese Patent Application Laid-Open No. 11-250041 “Document processing apparatus and document processing method”.
[0023]
The data format conversion unit 15 converts a document logical element described as a tree structure and a character recognition result into a structured document described in XML or HTML, and obtains a final output.
[0024]
Hereinafter, the document model database 16 will be described in detail.
[0025]
Conventional OCR (Optical Character Recognition) generates plain text from a document image by outputting character recognition results in the order of reading. However, an actual document may have a hierarchical structure such as chapters, sections, and bullets. In an XML / HTML document, such a document is expressed by a hierarchical tree structure. Therefore, when converting a document image into an XML / HTML document, it is necessary to extract some hierarchical structure from the set of document logical elements described above. At this time, if the structuring method is different, the tree structure of the processing result may be greatly different, and if the operator's desired tree structure cannot be obtained, there is a problem that a large correction is required after document conversion. .
[0026]
Therefore, in this embodiment, the operator designates the structuring method in advance in the document model database 16 in the form of a document model, and the system interprets this document model to generate a tree structure desired by the operator, and then converts it into text. Convert a document to a structured document.
[0027]
By using such a document model, document logical elements can be extracted from a document image with high accuracy, and the hierarchical structure of the document can be extracted with high accuracy. As a result, it is possible to efficiently and accurately generate a document with XML / HTML tags from a printed document.
[0028]
5 to 7 are diagrams showing specific examples of the user interface when the operator designates the structuring method.
[0029]
FIG. 5 shows a user interface for defining a knowledge / model of a structuring method related to document elements (details of the model contents will be described later).
[0030]
FIG. 6 shows a user interface for defining a knowledge / model of a method for extracting logical elements (details of the model contents will be described later).
[0031]
FIG. 7 shows a user interface for defining a knowledge / model related to the contents of logical elements (details of the model contents will be described later).
[0032]
The contents defined by using these user interfaces are stored in the document model database 16 of FIG. 1 and used appropriately when inputting a document.
[0033]
The document model used is
(1) Knowledge about the contents of document logic elements
(2) Knowledge about document logic element extraction method
(3) Knowledge of how to structure chapter headings and bullet structures
Etc.
[0034]
Knowledge about the contents of the document logic element described in (1) is defined using the user interface shown in FIG. First, as logical element names, “header”, “footer”, “document heading”, “chapter heading 1”, “chapter heading 2”, “chapter heading 3”, “chapter heading 4”, “chapter heading 5”, “ "Chapter Heading 6", "Chapter Heading 7", "Chapter Heading 8", "Chapter Heading 9", "Paragraph", "Ordered Bullet", "Unordered Bullet", "Define Bullet", "Figure Select one of “Caption”, “Figure Footnote”, “Table Caption”, “Table Footnote”, “Formula”, “Footnote”.
[0035]
The selection in the user interface of FIGS. 6 to 8 is performed by the user clicking a white circle at the head of each logical element name with a pointing device such as a mouse and changing it to a black circle. When the selection item has a white circle at the beginning, the item is not selected, and when the selection has a black circle, the item is selected.
[0036]
In the “hierarchy level”, hierarchy information is set using an arbitrary integer of 0 or more.
[0037]
In “Language”, select either “Japanese” or “English”.
[0038]
In “Character Line Direction”, select either “horizontal writing” or “vertical writing”.
[0039]
The upper limit and the lower limit of the size information are set in “mm” for “character size”, “block height”, and “block length”.
[0040]
Set upper and lower limits for "number of character lines" and "number of characters".
[0041]
Set the indentation amount in mm in “Indentation Width”.
[0042]
Information for determining as heading information is set in “heading information”. The heading information includes “Part 1”, “Chapter 1”, “Article 1”, “1.”, “(1)”, “▲ 1 ▼” Logical elements using numbers such as “I”, “(I)”, “(a)”, “A.”, “One”, and symbols such as “•”, “O”, “*”, “*” It is identification information. Here, these heading descriptions are expressed by two means of “definition by numerical sequence” and “definition by keyword information”.
[0043]
In “Heading Information”, “Define by Number Sequence”, the heading description is “numeric”, “Kanji”, “Roman Number”, “Circle Number”, “Alphabet”, “Kana”, “Greek Letter”, “Symbol”, It is described with 13 types of information: “period”, “hyphen”, “open parenthesis”, “close parenthesis”, “space”. At this time, by assigning “0 (none)” or “an integer greater than or equal to 1” to each piece of information, the heading description is expressed by a numerical sequence composed of 13 numbers and arranged in the order described above.
[0044]
For example, the heading description “(1)” is expressed as “100000000010” by assigning 1 to “number”, 1 to “open parenthesis”, 1 to “close parenthesis”, and 0 to other information. To do. The heading description “1.2.1” is expressed as “30000000020001” by assigning 3 to “number”, 2 to period, 1 to space, and 0 to other information.
[0045]
As described above, in the present invention, the headings of bullets and chapter headings (for example, “1.” in “1. Introduction” is used as the heading of the chapter heading) are numbers, Chinese numerals, Roman numerals, and round numerals. , Alphabets, kana, Greek letters, symbols, periods, hyphens, open parentheses, brailles, and spaces. In this case, for each character type, by assigning a number of either “0 (meaning none)” or “an integer greater than or equal to 1 (meaning the number of character types)”, the heading description is expressed by 13 numerical values. It shall be expressed by a combination of information.
[0046]
In addition, numerical information is composed of 13 numerical information by arranging numbers, Chinese numerals, Roman numerals, round numerals, alphabets, kana, Greek letters, symbols, periods, hyphens, open parentheses, braces, and spaces in this order. The heading description is expressed by the sequence information. For example, the heading description of “1.” is 1 for “number”, 0 for “Kanji number”, 0 for “Roman number”, 0 for “Round number”, 0 for “Alphabet”, 0 for “Kana”, Assign 0 to "Greek", 0 to "Symbol", 1 to "Period", 0 to "Hyphen", 0 to "Open parenthesis", 0 to "Close parenthesis", 0 to "Space" Therefore, it can be expressed as 1000000010000000.
[0047]
In the present invention, such a method is used to specify a chapter heading or item heading description in an input document. According to the present invention, since “1. Introduction” and “2. Principle” can be similarly expressed as “10000000100000”, it is possible to cope with the variety of heading descriptions of chapter headings and bullets.
[0048]
By introducing such a definition method, it is not necessary to cover all itemized heading descriptions that can appear when defining a document model.
[0049]
In the “definition by keyword information” of “heading information”, a heading description that cannot be expressed by a numerical sequence as described above is expressed as a character string as it is. However, it is complicated to cover all heading descriptions like “Article 1”, “Article 2”,..., “Article 99” at this time. Therefore, it is possible to simplify and express a portion where the number changes like “Article” with a blank or the like.
[0050]
In addition, chapter headings in the form of an arbitrary character string inserted between the open parenthesis “[” and the close parenthesis “]” and bullets with definitions can be flexibly specified by adopting keyword definitions that allow such omission. It becomes possible.
[0051]
When keyword matching is performed, it is possible to define the keyword matching status in the character line to be processed. In this case, the distance from the beginning position of the character line to the beginning position of the keyword matching result can be set in mm units as the “keyword matching position”, and the keyword matching result exceeding this value can be regarded as invalid. . In addition, it is possible to set the occupancy rate of the keyword in the character line, and it is possible to regard the keyword collation result indicating the occupancy rate below that value as invalid. Furthermore, the conditions of “keyword matching location” and “keyword occupancy” can be applied in combination, and when both conditions need to be satisfied, “AND determination” is selected in “keyword matching condition” When either one of the conditions is satisfied, “OR determination” is selected.
[0052]
Knowledge regarding the document logical element extraction method described in (2) is defined using the user interface shown in FIG. In the interface of FIG. 6, “header”, “footer”, “document heading”, “chapter heading”, “paragraph”, “ordered bullets”, “unordered bullets”, “defined bullets”, “figure” ”,“ Figure caption ”,“ Figure footnote ”,“ Table ”,“ Table caption ”,“ Table footnote ”,“ Formula ”,“ Footnote ”, etc., the extraction method can be designated.
[0053]
In the following, when “Do not extract” is selected, the logical element is extracted as a paragraph. In addition, when “reject after extraction” is selected, it is extracted once at the time of processing, but is not finally output to the tree structure. Furthermore, when “extract based on model” is selected, logical elements are extracted based on the contents defined using the user interface shown in FIG.
[0054]
For the logical element "Header", select "Do not extract", "Extract and reject at the top of the page", "Extract and reject on the right side of the page", or "Extract and reject on the left side of the page" be able to.
[0055]
For the logical element "Footer", select either "Do not extract", "Extract and reject at the bottom of the page", "Extract and reject on the right side of the page", or "Extract and reject on the left side of the page" be able to.
[0056]
In the logical element “document heading”, one of “Do not extract”, “Auto extract”, “Reject after extraction”, and “Extract based on model” can be selected.
[0057]
In the logical element “Chapter heading”, any of “Do not extract”, “Auto extract”, “Reject after extraction”, and “Extract based on model” can be selected.
[0058]
In the logical element “paragraph”, either “automatic extraction” or “extraction based on the model” can be selected.
[0059]
In the logical item “ordered bullets”, one of “Do not extract”, “Auto extract”, “Discard after extraction”, and “Extract based on model” can be selected.
[0060]
In the logical element “unordered list”, one of “Do not extract”, “Auto extract”, “Reject after extraction”, and “Extract based on model” can be selected.
[0061]
In the logical item “Bullet with definition”, one of “Do not extract”, “Auto extract”, “Reject after extraction”, and “Extract based on model” can be selected.
[0062]
For the logical element “diagram”, one of “not extract”, “auto extract”, and “reject after extraction” can be selected.
[0063]
For the logical element "Figure caption", "Do not extract", "Discard after extraction", "Extract at the top of the figure", "Extract at the bottom of the figure", "Extract at the right side of the figure", "Figure "Extract on the left side of" can be selected.
[0064]
For the logical element "Figure footnote", "Do not extract", "Discard after extraction", "Extract at the top of the figure", "Extract at the bottom of the figure", "Extract at the right side of the figure", "Figure "Extract on the left side of" can be selected.
[0065]
For logical element "table", "Do not extract", "Auto extract", "Discard after extraction", "Extract as horizontal text", "Extract as mixed vertical / horizontal text", "Lattice table "Extract as non-grid table", "Extract as non-grid table as vertical text", or "Extract non-grid table as mixed vertical / horizontal text" . In this case, a grid-like table has the property that all horizontal lines constituting the table are equal to the width of the outer frame of the table, and all vertical lines constituting the table are equal to the vertical width of the outer frame of the table. Means a table with Such a table is output according to a hierarchical structure of “table-column-table element”. At this time, the column means a group based on the arrangement of table elements.
[0066]
For the logical element “table caption”, “do not extract”, “reject after extraction”, “extract at the top of the table”, “extract at the bottom of the table”, “extract at the right side of the table”, “table "Extract on the left side of" can be selected.
[0067]
In the logical element table "footnote", "Do not extract", "Discard after extraction", "Extract at the top of the table", "Extract at the bottom of the table", "Extract at the right side of the table", "Table "Extract on the left side of" can be selected.
[0068]
In the logical element “formula”, any one of “not extract”, “extract as a diagram”, and “discard after extraction” can be selected.
[0069]
For the logical element “footnote”, one of “Do not extract”, “Auto extract”, and “Discard after extraction” can be selected.
[0070]
Knowledge regarding the structuring of the logic elements described in (3) is defined using the user interface of FIG.
[0071]
With regard to the “document structuring method”, either “execute” or “do not execute” can be selected. At this time, if “not implemented” is selected, all other definitions are invalidated, and logical elements are output flatly in the reading order. On the other hand, when “Implement” is selected, logical elements are structured based on other definitions and output as a tree structure.
[0072]
With regard to “structured bullets”, “not structured”, “structured based on indentation information”, and “structured based on heading information” can be selected. When bullets are structured based on indentation information, it is possible to set the indentation amount for determining indentation in mm units. The bullet structuring operation will be described later.
[0073]
With regard to “structuring the diagram inside the bulleted list”, it is possible to select either “leave in the bulleted structure” or “out of the bulleted list”. When a figure is left in an itemized list, it is structured in the same way as an adjacent itemized list in a document. When a figure is taken out of the bullet structure, it is output at the same level as the bullet structure after the bullet structure containing the figure in the document.
[0074]
Similarly, regarding “structuring the table inside the bulleted list”, it is possible to select either “leave in the bulleted structure” or “out of the bulleted list”. If a table is left in a bulleted list, it shall be structured in the same way as the adjacent bulleted list in the document. When the table is taken out of the bullet structure, it is output at the same level as the bullet structure after the bullet structure containing the figure in the document.
[0075]
Regarding the extraction of logical elements in the table, it is possible to select either “Do not extract” or “Extract” for “Extracting chapter headings in the table” and “Extracting bullets in the table” . At this time, when “Do not extract” is selected, each logical element is output as a paragraph.
[0076]
With regard to “giving reading order of logical elements in the table”, either “order as horizontal composition” or “order as vertical composition” can be selected. When ordering as horizontal composition, it is assumed that the logical elements at the upper left corner are ordered in the horizontal direction. When ordering as a vertical composition, it is assumed that the logical elements at the upper right end are ordered in the vertical direction.
[0077]
Regarding “ordering of document headings”, “read document headings first” or “automatically order document headings”.
[0078]
The document logical element extraction unit 13 performs the following processing with reference to the document model database 16 described above.
[0079]
First, the layout analysis result holding the character recognition result for each line is decomposed for each character line. Each character line is given information of “normal line”, “indentation line”, “centering line”, or “hard return line”. At this time, the character lines are classified into the above four categories by applying the following conditional expressions to the distance D1 between the text block boundary and the character line head position and the distance D2 between the text block boundary and the character line end position.
[0080]
Conditional expression:
When D1 ≦ TH1 and D2 ≦ TH2, the character line is a normal line.
[0081]
When D1> TH1 and D2 ≦ TH2, the character line is set as an indented line.
[0082]
When D1> TH1 and D2> TH2, the character line is set as a centering line.
[0083]
When D1 ≦ TH1 and D2> TH2, the character line is set as a hard return line.
[0084]
Chapter heading extraction is performed in the document model database 16 ("section heading" shown in FIG. 6 is selected, and "automatic extraction" or "extraction based on model" is selected in the "section heading". (Selected) or perform bulleted matching ("ordered bullets" shown in Fig. 6 are selected and "automatically extracted" or "extract based on model" in this "ordered bullets" Is selected "or" unordered bullets "is selected, and" automatic extraction "or" extract based on model "is selected in" unordered bullets ") Is specified, “heading description” is detected for each line. At this time, if the heading description of the document logical element is expressed by a number sequence in the document model, several characters are similarly expressed by the number sequence from the beginning in each character line, and the number sequence and the document model are defined. The heading description is detected by determining the coincidence of the numerical sequences.
[0085]
When the heading description of a document logical element is expressed by a keyword in the document model, the heading description is detected by performing keyword matching on each character line. At this time, if the keyword is expressed in a simplified manner as described above, a desired document logical element is extracted by performing keyword matching before and after the simplified portion without performing matching.
[0086]
FIG. 8 is a diagram illustrating an example of a document logical element extraction result.
[0087]
When a character line having a heading description and a character line following the head line are in the arrangement relationship shown in FIG. 8, the document logical elements of chapter headings or bullets are extracted by integrating them. That is, as shown in FIG. 8, among the three character lines 60 to 62, the first line 60 is a normal line, and the second and subsequent lines (61, 62) are continuously indented or indented with indentation or centering lines. If the line length of each of the character lines 60 to 62 is the same as or shorter than the line length of the previous line, these are integrated to extract the document logical element 63.
[0088]
As for the chapter headings, element names that make it possible to distinguish chapter headings at different levels such that “1.” is heading 1, “1.1” is heading 2, and “1.1.1” is heading 3. Is granted.
[0089]
As for itemization, when hierarchical information is defined corresponding to the heading description in the document model, hierarchical information is added simultaneously with extraction of logical elements.
[0090]
If a table is mixed in the document and the document model specifies that the table is to be automatically extracted ("Automatic extraction" is selected in the "Figure" shown in FIG. 6), The elements and element columns are extracted from the table, and the hierarchical structure of “table-column-table element” is extracted.
[0091]
When “extract table as text area” is specified in the document model (in the “table” shown in FIG. 6, either “automatic extraction” or the lower five to be extracted as text is selected. In each cell of the table, the text area is extracted by the method described above, and after the text area is similarly decomposed into character lines, each character line is classified into four categories, and the heading description described above is detected. Document logical elements are extracted by integrating the character lines.
[0092]
As a result, it is possible to extract a logical element similar to a normal text area from the table. At this time, when “Do not extract” is selected in “Extract chapter headings inside table” shown in FIG. 7, the document logical elements in the table having the same heading description as the chapter headings outside the table are displayed. Extract as bullets or paragraphs. Also, when “Do not extract” is selected in “Extract bullets inside table” shown in FIG. 7, the logical elements in the table having the same heading description as the bullets outside the table are paragraphs. Extract as
[0093]
The document logic element extracted by the above-described processing is output to a display device such as a display (not shown), and the document logic element extraction result editing unit 12 changes the processing result or adds information to the processing result. In this document logical element extraction result editing unit 12, in order to obtain a highly accurate processing result in the subsequent processing in the document structure analyzing unit 14, if the logical element name or the hierarchy information is incorrect, these are changed. If no hierarchy information is detected, hierarchy information is assigned to this logical element. At this time, different hierarchical information can be given to logical elements having the same heading description and indentation amount, and the tree structure extraction is performed by the document structure analysis unit 14 based on the result.
[0094]
The document structure analysis unit 14 refers to the document model database 16 and performs the following processing.
[0095]
First, the reading order is assigned to the document logical elements, and the document logical elements are rearranged according to the reading order. Further, a grouping process is performed on the document logical elements rearranged in the reading order. At this time, the hierarchical structure of the document is extracted by extracting the chapter group and the bulleted group and extracting the inclusion relation between the groups.
[0096]
When extracting a group of document logical elements, a global group such as a chapter group is preferentially extracted. In chapter group extraction, logical elements existing between chapter headings having the same element name are grouped.
[0097]
In the extraction of bulleted groups, lower grouping processing is performed in ascending order of reading order. At this time, if it is defined in the document model as “structure based on indentation”, the top position of the document logical element (the left end (upper end) of the text block in the case of horizontal writing (vertical writing) elements) Group within chapter groups based on placement relationships. For example, when the head positions of logical elements following a certain bulleted logical element are substantially the same or indented, they are considered to belong to the same group and are grouped. This grouping process ends when a logical element having a smaller starting position coordinate value than the starting position coordinate value of the starting element of the group appears or when a logical element having shallow hierarchical information appears.
[0098]
In each group, hierarchical grouping is performed by recursively performing the same grouping process every time a logical element whose head position is indented by more than the threshold TH3 as compared to the logical element located in front appears. Can be realized. When hierarchical information is given based on the model definition at the time of logical element extraction described above, and hierarchical information that means that the hierarchy is deeper than the adjacent logical elements, indentation within the group The lower group is detected regardless of the position.
[0099]
When “Structure based on heading description” is specified in the document model (“Structure based on heading description” is selected in “Bulk Structuring” in FIG. 7) The bulleted groups are extracted based on the consistency of the heading descriptions detected by the method described above within each chapter group.
[0100]
First, the last element of this chapter group is extracted as a group from the bullets that appear first in the chapter group. Next, if a bullet with a heading description different from the head element appears in the bullet group, grouping is performed until a bullet with the same heading description as the head element appears. Detect bulleted groups that are subordinate to the group. By executing such processing recursively, it is possible to detect a hierarchical itemized group.
[0101]
At this time, as described above, when the hierarchical information is given based on the model definition at the time of the logical element extraction, the lower group is extracted if the hierarchy is different even if the heading description is the same. On the contrary, even if the heading descriptions are different, if the hierarchical information is the same, the grouping process is performed as belonging to the same group.
[0102]
In the document model, when “Output diagram outside bulleted list” is specified (“Outside bullet structure” is selected in “Structure diagram inside bulleted list” in FIG. 7) Or “Exit Bullet Structure” is selected in “Structure Tables in Bullets” in FIG. 7). All figures / tables in a bulleted group retain their reading order. As it is, the same hierarchical information is output in the form given next to the top bulleted group.
[0103]
The document structure analysis unit 14 converts the result obtained by the hierarchical grouping process as described above into a tree structure. When “not structured” is specified in the document model (“not implemented” is selected in “document structured” in FIG. 7), grouping processing of chapter headings and bullets is not performed. A flat tree structure is obtained.
[0104]
When "Structure based on indentation" or "Structure based on heading description" is specified in the document model ("Based on indentation information" in "Bulk structuring" in FIG. 7) Depending on the parameters of the other document model, various tree structures are obtained. “Structure” or “Structure based on heading description” is selected.
[0105]
Hereinafter, the above-described embodiment will be described in more detail using specific examples.
[0106]
FIG. 9 shows an example in which a one-page printed document is converted into a document image. In this case, the logical elements C to M and N1 to N5 can be regarded as constituent elements of the table, but here, they are not output as the elements of the table but are output as the logical elements constituting the logical structure shown in FIG. Consider the case of doing as an example. Further, the logical element A is “heading 1”, the logical element B is “heading 2”, the logical element E and the logical element I are “heading 3”, and the logical elements F, G, H, J, K, L, M, and N1. ˜N5 is regarded as “bullet”. Also, the logic elements C and D are not output as unnecessary logic elements in the final result.
[0107]
First, a layout analysis result shown in FIG. 11 is obtained by a layout analysis process in the layout analysis unit 10. Next, by the document logical element extraction processing in the document logical element extraction unit 13, first, the logical element shown in FIG. 13 is extracted by decomposing the character line shown in FIG. 12 and then identifying the character line and identifying the heading description. Is done.
[0108]
For example, in the document model stored in the document model 16, if the keyword “Chapter” is defined as the heading description of the logical element “Heading 1”, the logical element A is identified as “Heading 1”. Similarly, if the keyword “Section” is defined as the heading description of the logical element “Heading 2” in the document model, the logical element B is identified as “Heading 2”.
[0109]
Further, if “10000000100” and “10000000000001” are defined as the heading description of the “bullet” logical element in the document model, the logical elements F, G, H, J, K, L, and N1 to N5 are “ Identified as "Bullet". In this case, since the “heading 2” of the logical elements E and I and the “bullet” of the logical elements K and L have the same heading description, it has been difficult to distinguish them as different logical elements in the prior art.
[0110]
However, since the logical elements E and I have a smaller block width than the logical elements K and L, in this embodiment, by defining a model that combines the block width and the heading description, the logical elements E and I are discriminated as different logical elements. Is possible.
[0111]
The document model has a heading description of “10000000100000”, a hierarchical level of bullets with a character line height of α mm or more, 1 and a hierarchical level of bullets with a heading description of “100000000010”, “100000000100000”. If the hierarchical level of the itemized list with the text line height of β mm or less is defined as 3 respectively, the logical elements F, G, and H are the ordered itemized item 70 of the hierarchical level 1, logical Elements J, K, and L can be identified as ordered bullets 71 at hierarchical level 2, and logical elements N1-N5 can be identified as ordered bullets 72 at hierarchical level 3.
[0112]
The document structure analysis unit 14 first assigns the reading order to the document logical elements, and then rearranges the document logical elements according to the reading order as shown in FIG. As a result, the logic elements A, B, E, F, G, H, I, J, K, L, M, N1, N2, N3, N4, and N5 are arranged in this order.
[0113]
Next, a hierarchical group of logical elements is detected by performing a grouping process. In the example of FIG. 14, the groups A, B, E, F, G, H, I, J, K, L, M, N1, N2, N3, N4, and N5 of “Heading 1” are first detected. Next, from the “Heading 1” group, the groups B, E, F, G, H, I, J, K, L, M, N1, N2, N3, N4, N5 of “Heading 2” and “Heading 3” E, F, G, H and I, J, K, L, M, N1, N2, N3, N4, and N5 are detected.
[0114]
Then, using the sameness of heading information and hierarchical information as clues, “ordered list” groups F, G, H, J, K, L, M, N1, N2, N3, N4, N5, K, L, M, N 1, N 2, N 3, N 4, N 5, N 1, N 2, N 3, N 4, and N 5 are detected hierarchically from each “Heading 3” group.
[0115]
The group thus obtained is converted into a tree structure shown in FIG. 15 based on the inclusion relationship between the groups.
[0116]
As described above, according to the above-described embodiment, document headings, authors, abstracts, dates, chapter headings, paragraphs, bullets, footnotes, captions, mathematical expressions are converted from document images converted from various printed documents composed of a plurality of pages. Since document logical elements such as headers and footers are automatically extracted, the hierarchical structure between the document logical elements is extracted with high precision so that the document image is represented by a tree structure of the document logical elements. A structured document tagged with XML, HTML, or the like can be generated with high accuracy, and can be automatically input to a computer system.
[0117]
In addition, a complicated bullet structure with a multi-level hierarchical structure can be expressed in a tree structure as intended by the operator, which makes it possible to greatly reduce the complexity of editing structured documents, Operator editing work in computerization can be greatly reduced.
[0118]
(Modification 1)
According to the present invention, the print document shown in FIG. 9 can be converted into an XML document having another logical structure.
[0119]
For example, when each of the logical elements A, B, E, and I shown in FIG. 9 is considered as an itemized list, a document model database 16 having the following model definition is created and a structuring process is performed based on the document model database 16. Thus, the tree structure shown in FIG. 16 can be extracted and then converted into the XML document shown in FIG.
[0120]
An example of the document model definition is as follows.
[0121]
Logical elements having heading descriptions of “Chapter 1”, “Chapter 2”, “Chapter 3”,... Are regarded as itemized items, and this hierarchical information is defined as level 1.
[0122]
Logical elements having heading descriptions of “Section 1”, “Section 2”, “Section 3”,... Are regarded as itemized items, and this hierarchical information is defined as level 2.
[0123]
If logical elements having heading descriptions of “(1)”, “(2)”, “(3)”,... Are regarded as bullets, and the character size (height of the character line) is less than γmm, the hierarchy information Is level 3, and if the height of the character line is λ mm or more, the hierarchical information is level 5.
[0124]
Logical elements having heading descriptions of “1.”, “2.”, “3.”,... Are regarded as itemized items. If the height is less than βmm, the hierarchy information is set to level 6.
[0125]
When the document model is defined as described above, as described above, the level 1 bulleted grouping process, the level 2 bulleted grouping process, the level 3 bulleted grouping process, and the level 4 bulleted grouping process are performed. The hierarchical group shown in FIG. 17 can be extracted by sequentially performing the itemized grouping process, the level 5 itemized grouping process, and the level 6 itemized grouping process in a hierarchical manner.
[0126]
Thereafter, the tree structure shown in FIG. 16 is extracted by the document structure analysis unit 14, and further converted into an XML document having the tree structure shown in FIG. 18 by the data format conversion unit 15.
[0127]
(Modification 2)
Further, in the printed document shown in FIG. 9, a document model database 16 having a model definition is considered so that the logical element A is regarded as a level 1 heading, the logical element B as a level 2 heading, and the logical elements C to N5 as table elements. By creating and executing a structuring process based on this, after extracting the tree structure of FIG. 20 based on the following processing procedure, it is possible to convert it into the XML document shown in FIG.
[0128]
For example, after a horizontal line segment and a vertical line segment are extracted from a document image, an area surrounded by the extracted line segment is extracted as a table element.
[0129]
As a result, as shown in FIG. 19, the document structure analysis unit 14 combines the logical element C into the table element 1, the logical element D into the table element 2, the logical element E into the table element 3, and the logical elements F to H into the table element. 4. Extract logical element I as table element 5 and logical elements J to N5 as table element 6 together.
[0130]
Next, the document structure analysis unit 14 groups the table elements adjacent in the horizontal direction. Specifically, table elements 1 and 2 are defined as a table element group a, table elements 3 and 4 as a table element group b, and table elements 5 and 6 as a table element group c.
[0131]
Next, the document structure analysis unit 14 groups the table element groups a to c and extracts the table group shown in FIG.
[0132]
As a result, the document structure analysis unit 14 extracts the tree structure shown in FIG. The extracted tree structure is converted by the data format conversion unit 15 into an XML document shown in FIG.
[0133]
As described above, the XML document shown in FIG. 22 can be generated from the print document shown in FIG.
[0134]
【The invention's effect】
According to the present invention, a structured document tagged with XML, HTML, or the like can be generated with high accuracy from various types of printed documents.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document processing apparatus according to the present invention.
FIG. 2 is a diagram illustrating an example of a layout analysis result.
FIG. 3 is a diagram for explaining a circumscribed rectangle.
FIG. 4 is a diagram showing a sentence area, a character line area, and a character area described hierarchically.
FIG. 5 is a diagram showing a user interface for defining a knowledge model of a structuring method related to document elements.
FIG. 6 is a diagram showing a user interface for defining a knowledge model related to the contents of logical elements.
FIG. 7 is a diagram showing a user interface for defining a knowledge / model of a logical element extraction method;
FIG. 8 is a diagram illustrating an example of a document logical element extraction result.
FIG. 9 is a diagram illustrating an example of a document image to be processed.
FIG. 10 is a diagram illustrating an example of a document structure analysis result.
FIG. 11 is a diagram illustrating an example of a layout analysis result.
FIG. 12 is a diagram illustrating an example of document logical element extraction.
FIG. 13 is a diagram illustrating an example of a document logical element extraction result.
FIG. 14 is a diagram showing an example of hierarchical grouping of logical elements in document structure analysis processing.
FIG. 15 is a diagram illustrating an example of processing results of document structure analysis and data format conversion.
FIG. 16 is a diagram illustrating another example of a document structure analysis result.
FIG. 17 is a diagram illustrating another example of hierarchical grouping of logical elements in document structure analysis processing.
FIG. 18 is a diagram illustrating another example of processing results of document structure analysis and data format conversion.
FIG. 19 is a diagram illustrating another example of a document logical element extraction result.
FIG. 20 is a diagram illustrating another example of a document structure analysis result.
FIG. 21 is a diagram illustrating another example of hierarchical grouping of logical elements in document structure analysis processing;
FIG. 22 is a diagram illustrating another example of processing results of document structure analysis and data format conversion.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Layout analysis part, 11 ... Character recognition part, 12 ... Document logic element extraction result edit part, 13 ... Document logic element extraction part, 14 ... Document structure analysis part, 15 ... Data format conversion part, 16 ... Model database.

Claims

A layout analysis unit that extracts a text region indicating a region where a text is described from the input document image, and extracts layout information indicating a layout of the text image in the document image;
For each sentence area, a character line area extracting unit that extracts a character line area constituting the sentence area;
A character recognition unit that extracts a character pattern from the character line area and converts it into a character code;
A database showing a structuring method for obtaining a desired tree structure having information for extracting document logical elements constituting a document in the sentence area and information on contents of the document logical elements. A document logical element extraction unit that extracts a document logical element from a text region extracted by the layout analysis unit with reference to one document model;
A document logical element reading order determining unit that arranges the document logical elements extracted by the document logical element extracting unit in reading order;
Based on the second document model for structuring the document logical elements, the document logical element reading order determining unit has information indicating the identity of the document logical elements and the hierarchy of the document logical elements. A document logical element grouping unit that hierarchically groups document logical elements rearranged in order,
A document processing apparatus comprising: a tree structure creation unit that creates a tree structure representing a logical relationship between document logic elements and character codes based on information grouped by the document logic element grouping unit .

The document model related to the method for extracting the document logical element is not extracted, automatically extracted, extracted based on the document model related to the contents of the document logical element, rejected after extraction, the position on the page and other logics. 2. The document processing apparatus according to claim 1, wherein extraction of a document logical element from a table element constituting a table is extracted based on an arrangement relationship with the element.

The document model related to the contents of the document logical element is composed of the type of the document logical element, the geometric information of the document logical element, the keyword information existing in the document logical element, and the heading description information of the first line of the document logical element. The document processing apparatus according to claim 1.

The keyword information existing in the document logical element is composed of a character string that constitutes the keyword, an arrangement position of the keyword in the character line including the keyword, and an occupancy ratio of the keyword in the character line including the keyword. The document processing apparatus according to claim 3.

4. The document processing apparatus according to claim 3, wherein the heading description information of the first line of the document logical element is composed of a number sequence expressed by the number of character types constituting the heading information.

The document logical element extraction unit includes the heading description information of the first line of the document logical element, which is composed of a number sequence represented by the number of character types constituting the heading information, and the character of the first line of the document logical element. 2. The document processing apparatus according to claim 1, wherein the document logical element is recognized by comparing the number of character types extracted from the code information and a number sequence expressed more.

The document model relating to the method of structuring the document logical elements relates to information relating to the structuring method of bullets, information relating to the method of structuring diagrams located in the bullets, and to the method of structuring chapter headings and bullets in the table. 2. The document processing apparatus according to claim 1, wherein the document processing apparatus comprises information, information on a method for ordering logical elements in a table, and information on a method for ordering document headings.

A layout analysis step of extracting a text area indicating an area in which text is described from the input document image, and extracting layout information indicating a layout in the text image of the text area;
A character line area extracting step for extracting a character line area constituting the sentence area for each sentence area;
A character recognition step of extracting a character pattern from the character line region and converting it into a character code;
A database showing a structuring method for obtaining a desired tree structure having information for extracting document logical elements constituting a document in the sentence area and information on contents of the document logical elements. A document logic element extraction step of extracting a document logic element from the sentence area extracted in the layout analysis step with reference to one document model;
A document logical element reading order determining step for arranging the document logical elements extracted in the document logical element extracting step in the reading order;
Based on the second document model for structuring the document logical elements, the information indicating the identity of the document logical elements and the hierarchy of the document logical elements is read in the document logical element reading order determination step. A document logical element grouping step for hierarchically grouping document logical elements rearranged in order;
A document processing method comprising: a tree structure creation step for creating a tree structure representing a logical relationship between document logic elements and character codes based on the information grouped in the document logic element grouping step .