JP2004038756A

JP2004038756A - Document conversion method and document conversion device

Info

Publication number: JP2004038756A
Application number: JP2002197343A
Authority: JP
Inventors: Hideaki Tanaka; 田中　秀明
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-07-05
Filing date: 2002-07-05
Publication date: 2004-02-05

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document conversion method capable of realizing highly accurate conversion from static document type data to dynamic document type data. <P>SOLUTION: Character information of each character is extracted from the inputted static document data (S11), and a plurality of area dividing candidates are prepared on the basis of the extracted character information (S12). Line extraction in each area is carried out per each area dividing candidate, and character strings are prepared by searching extracted lines (S13). A transition probability of all character strings in one area dividing candidate is calculated (S14), and on the basis of the transition probability per each area dividing candidate, an area with the highest transition probability of all character strings is determined as an optimum area (S15). Line extraction in each area is carried out again with respect to the optimum area, and an extracted line is searched to prepare the correct sentence (S16). By this, the static document data is converted into the dynamic document data including the correct sentence. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は文書変換方法および文書変換装置に関し、特に、静的ドキュメント形式のデータから動的ドキュメント形式のデータへの高精度の変換を実現できる文書変換方法および文書変換装置に関する。
【０００２】
【従来の技術】
昨今の情報関連技術やドキュメント関連技術の発展に伴ない、多種の電子化ドキュメントが氾濫している。これらの電子化ドキュメントファイルは、文字や画像などの各オブジェクトの位置情報を有するものと、位置情報がないものとに大別できる。
【０００３】
前者の代表例としては、ＤＴＰ（Ｄｅｓｋｔｏｐ　Ｐｕｂｌｉｓｈｉｎｇ）ソフトのドキュメントファイルや、ＰＤＦ（Ｐｏｒｔａｂｌｅ　Ｄｏｃｕｍｅｎｔ　Ｆｏｒｍａｔ）ファイルなどがある。また、後者の代表例としては、プレーンのテキストファイル（テキストだけが入っているファイル）やＨＴＭＬ（Ｈｙｐｅｒｔｅｘｔ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）ファイルがある。なお、以降の説明においては、前者を静的ドキュメント形式と呼び、後者を動的ドキュメント形式と呼ぶ。
【０００４】
位置情報を持たない動的ドキュメント形式は、そのファイル専用のブラウザ（閲覧ソフト）により、ドキュメント内の文字や画像があるルールに基づき配置され、表示される。あるルールとは、例えばプレーンテキストならば文字を順番に配置し改行コードに基づき改行するルールであり、ＨＴＭＬならば各タグに基づく配置ルールである。
【０００５】
一方、各種電子化ドキュメントを閲覧するためのプラットフォームとして、従来ではパーソナルコンピュータが一般的であったが、今後はＰＤＡ（Ｐｅｒｓｏｎａｌ　Ｄｉｇｉｔａｌ　Ａｓｓｉｓｔａｎｔｓ）やＰＤＣ（Ｐｅｒｓｏｎａｌ　Ｄｉｇｉｔａｌ　Ｃｅｌｌｕｌａｒ）などの携帯端末にシフトすることが予想される。実際、現在においても、携帯端末でのＷｅｂ閲覧が実現されている。
【０００６】
しかしながら、携帯端末はその画面解像度がパーソナルコンピュータに比べ小さく、静的ドキュメント形式の表示には適さないという問題があり、現在もそして今後も、携帯端末で主に利用されるものは動的ドキュメント形式と予想される。
【０００７】
このため、今後は、電子化ドキュメントの閲覧プラットフォームが携帯端末にシフトすることに伴ない、静的ドキュメント形式の情報を高効率に動的ドキュメント形式へ変換するための変換技術が重要となる。
【０００８】
このような電子化ドキュメントの変換技術として、従来では、特開平８−１４７４４６号公報において開示されている電子ファイリングシステム等の技術が提案されている。上述の電子ファイリングシステムは、アプリケーションからプリンタドライバへ伝達される情報により電子化ドキュメントを電子ファイリングシステムのデータに変換するシステムである。すなわち、印刷が可能なアプリケーション上で、該当ドキュメントを、変換用プリンタドライバを用いて「印刷する」という操作を行なうことにより、データ変換を行なうシステムである。
【０００９】
例えば、代表的なＯＳの１つであるＷｉｎｄｏｗｓ（Ｒ）の場合、アプリケーションが印刷する際にプリンタドライバへ引渡すデータは、カーネルモジュールであるＧＤＩ（Ｇｒａｐｈｉｃｓ　Ｄｅｖｉｃｅ　Ｉｎｔｅｒｆａｃｅ）が統一した形式に変換している。したがって、上述の電子ファイリングシステムでは、共通のＧＤＩコマンドを解釈しかつデータ変換を行なうプリンタドライバを用意することにより、多くの異なる電子化ドキュメントフォーマットの違いを吸収し、多対１のデータ変換を実現している。
【００１０】
【発明が解決しようとする課題】
しかしながら、上述の特開平８−１４７４４６号公報において開示されている電子ファイリングシステムのような従来技術では、特にＤＴＰソフト等の特定のアプリケーションにおいて、プリンタドライバへ渡されるデータの文字コードが一連の文章とはならない問題があり、この結果、静的ドキュメント形式のデータを動的ドキュメント形式へ変換することが難しいという問題がある。この理由は、ＤＴＰソフトではＧＤＩコードの表現能力を越えた文章の表現となり、１文字ずつの描画となるためである。
【００１１】
この問題に対して、従来技術でも、文字高さの１／２などの固定割合のしきい値により文字列、単語、行などの判定（切出し）を行なっているが、このようなしきい値処理では、性能が低下するという問題がある。
【００１２】
また、従来技術とは違いプリンタドライバを使用せず、対象ドキュメントのフォーマットを直接解釈する手段を備えた変換技術の場合でも、ＤＴＰやＰＤＦなどの静的ドキュメント形式では、抽出した文字が文章として並んでいないという問題があり、同じく動的ドキュメント形式へ変換が難しいという問題がある。なお、抽出した文字が文章として並んでいない理由としては、ＤＴＰやＰＤＦ等は、表示もしくは印刷すべき個々の位置情報があるがゆえに、保存されているデータ内では文章の順序に文字を格納する必要がなく、ドキュメントを作成するアプリケーション独自の並びで保存されていることが挙げられる。
【００１３】
そこで、プリンタドライバや、直接のフォーマット解釈等において、静的ドキュメント形式のデータを動的ドキュメント形式に変換する場合、得られた個々の文字位置情報を基に正解文章の作成を行なう必要がある。しかし、この文章作成を文字サイズなどのしきい値処理で行った場合、精度が低下するという問題がある。
【００１４】
本発明はこれらの問題に鑑みてなされたものであって、静的ドキュメント形式のデータから動的ドキュメント形式のデータへの高精度の変換を実現できる文書変換方法および文書変換装置を提供することを目的とする。
【００１５】
【課題を解決するための手段】
上記目的を達成するために、本発明のある局面に従うと、文書変換方法は、文字の位置情報を含むドキュメントデータを文字の位置情報を含まないドキュメントデータへ変換する文書変換方法であって、文字の位置情報を含むドキュメントデータより、各文字の位置情報を含む文字情報を抽出する文字情報抽出ステップと、抽出された文字の位置情報に基づいて、ドキュメントデータを複数の領域に分割する分割候補を作成する領域分割候補作成ステップと、作成された領域分割候補ごとに、抽出された各文字の位置情報に基づいて行を抽出し、抽出した行を探索して文字列を作成する文字列作成ステップと、作成された文字列の遷移確率値を計算する文字列確率計算ステップと、計算結果に基づいて、作成された領域分割候補より最適な領域を決定する領域決定ステップと、決定した最適な領域の結果に基づいて、各文字の位置情報を含まないテキスト文章を作成する文章作成ステップとを備える。
【００１６】
本発明の他の局面に従うと、文書変換方法は、文字の位置情報を含むドキュメントデータを文字の位置情報を含まないドキュメントデータへ変換する文書変換方法であって、文字の位置情報を含むドキュメントデータより、各文字の位置情報を含む文字情報を抽出する文字情報抽出ステップと、抽出された文字の位置情報に基づいて、ドキュメントデータを複数の領域に分割する分割候補を作成する領域分割候補作成ステップと、作成された領域分割候補ごとに、抽出された各文字の位置情報に基づいて行を抽出し、抽出結果となる文字の連結が、作成された全ての領域分割候補において１つでも異なる文字を中心として、文字と文字の前後の所定数の文字とからなる部分文字列を作成する文字列作成ステップと、作成された部分文字列の遷移確率値を計算する文字列確率計算ステップと、計算結果に基づいて、作成された領域分割候補より最適な領域を決定する領域決定ステップと、決定した最適な領域の結果に基づいて、各文字の位置情報を含まないテキスト文章を作成する文章作成ステップとを備える。
【００１７】
また、上述の文字情報抽出ステップは、文字の位置情報を含むドキュメントデータに基づいて変換された共通する形式のデータを解釈することと、文字の位置情報を含むドキュメントデータを直接解釈することとの、少なくとも一方の方法によって文字情報を抽出することが望ましい。
【００１８】
本発明の他の局面に従うと、文書変換装置は、文字の位置情報を含むドキュメントデータを文字の位置情報を含まないドキュメントデータへ変換する文書変換装置であって、文字の位置情報を含むドキュメントデータより、各文字の位置情報を含む文字情報を抽出する文字情報抽出手段と、抽出された前記文字の位置情報に基づいて、ドキュメントデータを複数の領域に分割する分割候補を作成する領域分割候補作成手段と、作成された領域分割候補ごとに、抽出された各文字の位置情報に基づいて行を抽出し、抽出した行を探索して文字列を作成する文字列作成手段と、作成された文字列の遷移確率値を計算する文字列確率計算手段と、計算結果に基づいて、作成された領域分割候補より最適な領域を決定する領域決定手段と、決定した最適な領域の結果に基づいて、各文字の位置情報を含まないテキスト文章を作成する文章作成手段とを備える。
【００１９】
本発明のさらに他の局面に従うと、文書変換装置は、文字の位置情報を含むドキュメントデータを文字の位置情報を含まないドキュメントデータへ変換する文書変換装置であって、文字の位置情報を含むドキュメントデータより、各文字の位置情報を含む文字情報を抽出する文字情報抽出手段と、抽出された文字の位置情報に基づいて、ドキュメントデータを複数の領域に分割する分割候補を作成する領域分割候補作成手段と、作成された領域分割候補ごとに、抽出された各文字の位置情報に基づいて行を抽出し、抽出結果となる文字の連結が、作成された全ての領域分割候補において１つでも異なる文字を中心として、文字と文字の前後の所定数の文字とからなる部分文字列を作成する文字列作成手段と、作成された部分文字列の遷移確率値を計算する文字列確率計算手段と、計算結果に基づいて、作成された領域分割候補より最適な領域を決定する領域決定手段と、決定した最適な領域の結果に基づいて、各文字の位置情報を含まないテキスト文章を作成する文章作成手段とを備える。
【００２０】
また、上述の文字情報抽出手段は、文字の位置情報を含むドキュメントデータに基づいて変換された共通する形式のデータを解釈することと、文字の位置情報を含むドキュメントデータを直接解釈することとの、少なくとも一方の方法によって文字情報を抽出することが望ましい。
【００２１】
【発明の実施の形態】
以下に、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品および構成要素には同一の符号を付してある。それらの名称および機能も同じである。したがってそれらについての詳細な説明は繰返さない。
【００２２】
［第１の実施の形態］
図１は、本実施の形態におけるドキュメント処理装置であるパーソナルコンピュータ（以下、ＰＣと言う）１の構成の具体例を示すブロック図である。
【００２３】
図１を参照して、本実施の形態におけるＰＣ１は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）等から構成される制御部１０１によって全体的に制御され、各種処理を行なう。制御部１０１で実行されるプログラムは記憶部１０２に記憶される。また、記憶部１０２は、制御部１０１でプログラムを実行する際のバッファや、一時的な作業領域にもなる。入力部１０３は、マウスやキーボード等から構成され、ユーザからの各種指示等を受付ける。また、読取装置から構成されている場合には、他の装置からの情報やフレキシブルディスク等に記憶されているデータ等の受付も行なう。出力部１０４は、図示しないディスプレイにドキュメント等の情報を表示するために出力してもよい。また、プリンタＩ／Ｆ（インタフェース）として、図示されないプリンタにドキュメント等の情報を出力してもよい。
【００２４】
なお、図１に示されているＰＣ１の構成は、一般的なパーソナルコンピュータの構成であって、ＰＣ１の構成は図１に示される構成に限定されない。また、本実施の形態においては、ドキュメント処理装置が図１に構成の具体例を示すパーソナルコンピュータであるものとして説明を行なうが、その他、携帯電話やＰＤＡ（Ｐｅｒｓｏｎａｌ　Ｄｉｇｉｔａｌ　Ａｓｓｉｓｔａｎｔｓ）等の携帯端末であってもよい。
【００２５】
本実施の形態においては、上述のＰＣ１において、静的ドキュメント形式のデータを動的ドキュメント形式のデータに変換する変換処理を行なう。すなわち、文字や画像等の各オブジェクトの位置情報が含まれるドキュメント形式のデータ（以下、静的ドキュメントデータと言う）から、位置情報の含まれないドキュメント形式のデータ（以下、動的ドキュメントデータと言う）に変換する変換処理を行なう。
【００２６】
図２は、上述の変換処理を行なうためのＰＣ１の構成を示すブロック図である。図２を参照して、入力部１０３は、静的ドキュメントデータ１０の入力を受付ける。制御部１０１は、入力部１０３が静的ドキュメントデータ１０の入力を受付けたことを検出することで、制御部１０１に含まれる文字情報抽出部１０１１、領域分割候補作成部１０１２、文字列作成部１０１３、文字列確率計算部１０１４、最適領域決定部１０１５、および正解文書作成部１０１６を、記憶部１０２に含まれる文字情報バッファ１０２１および文字列バッファ１０２２と連動させて制御し、静的ドキュメントデータ１０を動的ドキュメントデータ２０に変換する。そして、変換された動的ドキュメントデータ２０は、出力部１０４から出力される。
【００２７】
図３は、本実施の形態のＰＣ１で行なわれる変換処理について示すフローチャートである。図３に示される処理は、ＰＣ１の制御部１０１が、記憶部１０２に記憶されているプログラムを読出して実行することによって実現される。以下に、図３に示される変換処理について、図２のブロック図を参照しつつ説明する。
【００２８】
始めに、ＰＣ１は、入力部１０２において静的ドキュメントデータ１０の入力を受付ける（Ｓ１０）。静的ドキュメントデータ１０については、先に述べたとおりであるが、位置情報以外にもフォント名や、アンダーライン、イタリック、ボールド等の属性情報も格納されているデータである。
【００２９】
次に、文字情報抽出部１０１１において、入力した静的ドキュメントデータ１０から、各文字の文字情報を抽出する（Ｓ１１）。文字情報抽出部１０１１での文字情報の抽出方法は、一般的なプリンタドライバにおいて実施されているような、ＧＤＩ等のカーネルモジュールが変換した共通する形式のデータを解釈することで抽出する方法でもよいし、直接静的ドキュメントデータ１０を解釈することによって文字情報を抽出する方法でもよいし、その他の方法であってもよい。ここでの文字情報の抽出方法は随意に決定できるものとする。そして、ステップＳ１１で抽出した文字情報を、文字情報バッファ１０２１に格納する。
【００３０】
次に、領域分割候補作成部１０１２において、ステップＳ１１で抽出した文字情報に基づいて領域分割を行ない、複数の領域分割候補を作成する（Ｓ１２）。ステップＳ１２における領域分割候補の作成処理については、後にサブルーチンを挙げて詳細な説明を行なう。
【００３１】
さらに、文字列作成部１０１３において、ステップＳ１２で作成した領域分割候補ごとに各領域内の行抽出を行ない、抽出された行を、横書き領域の場合は上から下に、縦書き領域の場合は右から左に行を探索し文字列を作成する（Ｓ１３）。ステップＳ１３における行抽出および文字列の作成処理については、後にサブルーチンを挙げて詳細な説明を行なう。
【００３２】
続いて、文字列確率計算部１０１４において、ステップＳ１３で作成した１つの領域分割候補における全文字列の遷移確率を計算する（Ｓ１４）。そして、最適領域決定部１０１５において、ステップＳ１４で計算した領域分割候補ごとの遷移確率に基づいて、最も全文字列の遷移確率が高い領域を最適な領域として領域決定する（Ｓ１５）。なお、ステップＳ１４で計算される全文字列の遷移確率については、後に詳述する。
【００３３】
次に、正解文書作成部１０１６において、ステップＳ１５で決定した最適領域に対して再び各領域内の行抽出を行ない、抽出された行を横書き領域の場合は上から下に、縦書き領域の場合は右から左に行を探索し、正解の文章を作成する（Ｓ１６）。
【００３４】
最後に、ステップＳ１６で作成した正解の文章を含む動的ドキュメントデータ２０のファイルを出力部１０４より出力する（Ｓ１７）。
【００３５】
以上で、ＰＣ１における変換処理が終了される。上述の処理を実行することによって、ＰＣ１に入力された静的ドキュメントデータ１０が動的ドキュメントデータ２０に変換されて出力される。
【００３６】
さらに、上述のステップＳ１１において静的ドキュメントデータ１０から抽出した文字情報を格納する文字情報バッファ１０２１について具体例を図４に挙げ、説明を行なう。
【００３７】
図４を参照して、文字情報バッファ１０２１には、ステップＳ１１で抽出された文字の順番に、１文字目から最終文字まで各文字の情報が順に格納される。なお、各オブジェクトの位置情報が含まれる静的ドキュメントデータ１０の場合、この図４に示される抽出順に文字を並べても正解の文章とはならない。
【００３８】
文字情報バッファ１０２１に格納された各文字の情報は、文字コード１１、開始のｘ座標（水平軸座標）１２、開始のｙ座標（垂直軸座標）１３、ｘサイズ（水平軸サイズ）１４、ｙサイズ（垂直軸サイズ）１５、フォント情報１６、属性情報１７および領域・行情報リスト１８から構成される。なお、各文字の情報の構成は、上述の構成に限定されるものではない。例えば、フォント情報１６は、フォント情報そのものであってもよいし、フォント情報を示すテーブルのアドレスであってもよい。どちらを採用するかは随意に決定できるものとする。なお、本実施の形態においては、文字ごとにフォント情報を格納する形式であるものとする。
【００３９】
属性情報１７は、各文字の属性に関する情報であって、アンダーライン、イタリック、ボールドなどの属性情報が該当する。
【００４０】
また、領域・行情報リスト１８は、ステップＳ１２での領域分割の結果得られた領域分割候補数分だけ文字ごとに確保される。すなわち、ステップＳ１２での領域分割の結果、ｍ個の領域分割候補が作成された場合、領域・行情報リスト１８は、第１領域・行情報１８１〜第ｍ領域・行情報１８５を含む。さらに、１つの領域・行情報は、領域ラベル１８１１、前文字情報アドレス１８１２、および後文字情報アドレス１８１３から構成される。ステップＳ１２での領域分割候補の作成処理においては、該当する領域の領域ラベルにデータが格納される。具体的には、第１領域・行情報１８１の領域ラベル１８１１には、第１領域分割候補における該当文字の領域ラベルが格納され、以降、各領域分割候補における該当文字の領域ラベルが、順に、該当する領域・行情報の領域ラベルに格納される。
【００４１】
次に、上述のステップＳ１２において領域分割候補作成部１０１２で実行される領域分割候補の作成処理について、具体例を挙げて説明する。本実施の形態における領域分割候補の作成処理では、特許第３０１９２８７号において開示されている画像領域分割方法を改良した方法を採用する。
【００４２】
上述の特許第３０１９２８７号において開示されている画像領域分割方法は、主に文字認識装置の前処理として用いられる方法であり、画像より抽出した矩形座標に対して、近傍矩形のラベリングにより領域を抽出する方法である。より具体的には、例えば画像が文字「川」である場合、認識前画像より縦長の３つの矩形座標となっている領域を抽出する。次に、統合すべき距離パラメータδをある範囲で変化させて、これに対して閾値判定で最適な距離パラメータδを決定する。そして、領域を決定（分割）する方法である。すなわち、文字コードなどを使用せず物理的な矩形座標だけで領域を決定する方法であり、基本的にはしきい値処理である。したがって、図５に示すようなエラーが発生する場合もある。
【００４３】
図５においては、領域３１１〜３１３を含む入力画像３１に対する、結果表示画像３２を示す。すなわち、図５に示されるように、上述の画像領域分割方法では、３領域が正解である入力画像３１（例えば文字「川」である場合等）であっても、領域３１１と領域３１２との間の距離が狭いため、採用するしきい値によっては、結果表示画像３２が領域３２１および領域３２２の２領域からなる画像であると判定されてしまう場合もある。そこで、本実施の形態における領域分割候補の作成処理では、最適な距離パラメータδを決定する処理を行なわず、各距離パラメータδで得られたラベリング結果を各領域分割候補として抽出する。この、本実施の形態における領域分割候補の作成処理について説明する。
【００４４】
図６に、上述のステップＳ１２で実行される領域分割候補の作成処理についてフローチャートを示す。図６を参照して、まず始めに、あらかじめ定めたルールに従い、文字要素から判定に不都合なものを除外する（Ｓ１２１）。判定に不都合なものとは、例えば、実施の文字またはその構成要素とは考えられないような大きいもの等が該当する。
【００４５】
次に、距離パラメータδに初期値を設定する（Ｓ１２２）。続いて、全ての２つの文字要素（Ｃ_ｉ、Ｃ_ｊとする。ただし、ｉ≠ｊ）を１回ずつ取出す（Ｓ１２３）。ここでは、黒画素の連結部分を文字要素としてもよいし、黒画素のランを文字要素としてもよい。
【００４６】
次に、ステップＳ１２３で取出した２つの文字要素が近接しているか否かを判定する（Ｓ１２４）。ここでは、Ｒδ（Ｃ_ｉ、Ｃ_ｊ）が成立するか否かを判定することで、２つの文字要素が近接しているか否かを判定する。
【００４７】
ここで、Ｒδ（Ｃ_ｉ、Ｃ_ｊ）とは、「２つの文字要素Ｃ_ｉとＣ_ｊとの距離が距離パラメータδ以下（または未満）」であることを意味する。２つの文字要素ｃとｄとについて、「ある文字要素からなる列｛ｘ_ｉ｝（ｉ＝０，１，．．．ｎ−１）が存在して、Ｒδ（ｃ，ｘ_０）、Ｒδ（ｘ_ｉ，ｘ_ｉ＋１）（０≦ｉ≦ｎ−２）、Ｒδ（ｘ_ｎ−１，ｄ）がすべて成立つ」という関係（以下これをＳδ（ｃ，ｄ）と記載する）は、数学的には同値関係と呼ばれる関係の一種である。同値関係については、それが定義されている集合全体が、互いにその関係が成立つものだけからなる幾つかのグループに分解されるという著しい性質がある。
【００４８】
したがって、ステップＳ１２３で行なうように、画像中の２つの文字要素の全ての組合わせについて、上記の同値関係が成立つかどうかを検査すれば、画像中の全ての文字要素は、そのグループに属する任意の文字要素ａ，ｂについてはＳδ（ａ，ｂ）となるようなグループに分解できる。なお、距離パラメータδがベクトル量（δ_ｘ，δ_ｙ）の場合は、δ_ｘ，δ_ｙを別々に変化させて、同様のことを行なう。この場合、上記の「２つの文字要素ａとｂとの距離が距離パラメータδ以下（または未満）」とは「２つの文字要素のｘ方向の距離がδ_ｘ以下（または未満）、ｙ方向の距離がδ_ｙ以下（または未満）」を意味するものとする（以下同様）。
【００４９】
そして、Ｒδ（Ｃ_ｉ、Ｃ_ｊ）が成立する場合には（Ｓ１２４でＹＥＳ）、その２つの文字要素の各々と１対１に対応する文字要素ラベルと、その２つ文字要素ラベルのいずれかと値が等しい文字要素ラベル全てに、共通な新しい値を代入する（Ｓ１２５）。そして、全てのｉ、ｊの組合わせを処理したか否かを判定する（Ｓ１２６）。Ｒδ（Ｃ_ｉ、Ｃ_ｊ）が成立しない場合には（Ｓ１２４でＮＯ）、ステップＳ１２５の処理を実行せずに、全てのｉ、ｊの組合わせを処理したか否かを判定する（Ｓ１２６）。
【００５０】
全てのｉ、ｊの組合わせを処理していない場合には（Ｓ１２６でＮＯ）、ステップＳ１２３に戻り、全てのｉ、ｊの組合わせを処理している場合には（Ｓ１２６でＹＥＳ）、分割結果を記憶する（Ｓ１２７）。
【００５１】
次に、全ての距離パラメータδと、距離パラメータδがとれる値の集合△とについて処理をしているか否かの判定を行なう（Ｓ１２８）。処理をしていれば（Ｓ１２８でＹＥＳ）、領域分割候補の作成処理を終了し、図３に示されるメインルーチンに処理を戻す。全ての距離パラメータδと、距離パラメータδがとれる値の集合△とについて処理をしていなければ（Ｓ１２８でＮＯ）、距離パラメータδを変化させて（Ｓ１２９）、ステップＳ１２３以降の処理を再び行なう。
【００５２】
以上で領域分割候補の作成処理を終了し、図３に示されるメインルーチンに処理を戻す。
【００５３】
なお、上述のステップＳ１２７において、分割結果は、図４に示される領域・行情報リスト１８に格納される。この場合、領域分割の結果としてｍ個の候補が作成されたならば（すなわち、ｍ個の距離パラメータδで領域分割が行なわれたならば）、第１領域・行情報の領域ラベル１８１１には、第１領域分割候補における該当文字の領域ラベルが格納され、以降、第２領域・行情報１８２〜第ｍ領域・行情報１８５の領域ラベルには、第２領域分割候補〜第ｍ領域分割候補の領域ラベルが順次格納される。
【００５４】
次に、上述のステップＳ１３において文字列作成部１０１３で実行される行抽出処理および文字列の作成処理について、具体例を挙げて説明する。ステップＳ１３では、具体的には、全ての領域分割候補に対して、候補内の領域ごとに行方向判定処理を行ない、判定方向に沿って行抽出処理を行なう。
【００５５】
上述の行方向判定処理には、例えば、本願発明者と他の発明者らとの共同した発明であって、特許第３１２４８５４号にて開示されている文字列方向検出装置が行なう行方向判定方法を適用することができる。すなわち、予め、画像内の文字列がｘ方向かｙ方向かを判定する基準を定めておき、入力された領域長方形の位置座標に基づいて、ｘ方向かｙ方向かを判定する方法を適用することができる。例えば、各領域長方形について、Ｒｘｙ＝（ｙ方向の長さ）／（ｘ方向の長さ）を計算しておき、ＲＡ＝（Ｒｘｙが１．０　未満の領域長方形の数）／（画像中の全領域数）が０．５未満であるとき、その紙面においては文字列はｙ方向とみなし、そうでないときは文字列はｘ方向とすることができる。
【００５６】
また、上述の行抽出処理には、例えば、本願発明者と他の発明者らとの共同した発明であって、特公平８−１６９１８号公報において開示されている行抽出方法を適用することができる。すなわち、各矩形のオーバーラップ関係により矩形を（横書きの場合は）左右に連結し行を抽出する行抽出方法を適用することができる。
【００５７】
なお、上述の行抽出処理で第１領域内にて連結された行情報として、各文字ごとに、図４に示される前文字情報アドレス１８１２に、前の文字が格納されている文字情報バッファ１０２１内の該当文字情報の先頭アドレスが格納される。また、次文字情報アドレス１８１３に、次の文字が格納されている文字情報バッファ１０２１内の該当文字情報の先頭アドレスが格納される。これは、第２領域以下、他の領域であっても同様である。なお、ここで前の文字とは、横書きの場合は行内で１つ左の文字、縦書きの場合は行内で１つ上の文字を指し、次の文字とは、横書きの場合は行内で１つ右の文字、縦書きの場合は行内で１つ下の文字を指す。また、文字が行頭にある場合、前文字情報アドレス１８１２には、前行の最終文字の先頭アドレスが格納され、文字が行末にある場合、次文字情報アドレス１８１３には次行の先頭文字の先頭アドレスが格納される。ただし、その領域内の第１行の行頭文字の前文字情報アドレス１８１２はＮＵＬＬ（０）となり、その領域内の最終行の行末文字の次文字情報アドレス１８１３もＮＵＬＬ（０）となる。
【００５８】
次に、図７に、上述の行抽出処理の結果に基づく連結関係の具体例を示す。図７は、横書き文書における行抽出処理の結果の具体例を示す図である。図７において、矩形は文字矩形を示し、右矢印（→）は次文字情報アドレスの参照先を示している。次文字は、行内では次の（右隣の）矩形で示される文字であり、行末では次行の先頭文字となる。また、左矢印（←）は前文字情報アドレスの参照先を示している。前文字は、行内では前の（左隣の）矩形示される文字であり、行頭では前行の最終文字となる。すなわち、領域内の先頭文字４１の場合、前文字情報アドレスにはＮＵＬＬ（０）が格納される。また、領域内の最終文字４２の場合、次文字情報アドレスにはＮＵＬＬ（０）が格納される。
【００５９】
次に、候補内の領域ごとにこのように抽出された行について文字を走査し、文章を作成する。図８は、上述のステップＳ１３において文字列作成部１０１３で実行される文字列作成処理を示すフローチャートである。図８を参照して、まず始めに、領域分割候補カウンタＪを０に初期化する（Ｓ１３１）。この領域分割候補カウンタＪはＣＰＵ内のレジスタを使用する。
【００６０】
次に、領域分割候補カウンタＪが領域分割候補数未満か否かを判定する（Ｓ１３２）。領域分割候補数未満の場合は（Ｓ１３２でＹＥＳ）、続いて、文字カウンタＩを０に初期化する（Ｓ１３３）。この文字カウンタＩもまたＣＰＵ内のレジスタを使用する。
【００６１】
次に、文字カウンタＩが文字情報数未満か否かを判定する（Ｓ１３４）。文字情報数未満の場合は（Ｓ１３４でＹＥＳ）、続いて、文字カウンタＩで示される文字情報の、領域分割候補カウンタＪで示される領域・行情報（すなわち、文字情報バッファ１０２１に格納されているＩ番目の文字情報のＪ番目の領域・行情報）の前文字情報アドレスがＮＵＬＬ（０）かどうかを判断する。このことにより、該当文字情報が先頭文字か否かの判断を行なう（Ｓ１３５）。
【００６２】
該当文字情報が先頭文字である場合は（Ｓ１３５でＹＥＳ）、文字情報ポインタにＳ１３５で発見した（すなわち、文字情報バッファのＩ番目の）文字情報のアドレスを入れる（Ｓ１３６）。この文字情報ポインタはＣＰＵ内のレジスタを使用する。そして、文字情報ポインタで参照される文字コードを文字列バッファ１０２２へ格納する（Ｓ１３７）。
【００６３】
次に、文字情報ポインタで参照される次文字情報アドレスがＮＵＬＬ（０）か否かを判断する。このことにより次の文字（すなわち、行抽出により連結された文字）があるか否かを判断する（Ｓ１３８）。
【００６４】
次の文字がある場合（すなわち、次文字情報アドレスがＮＵＬＬでない場合）は（Ｓ１３８でＹＥＳ）、その次文字情報アドレスを文字情報ポインタにセットし（Ｓ１３９）、ステップＳ１３７に戻る。
【００６５】
次の文字がない場合（すなわち、次文字情報アドレスがＮＵＬＬの場合）（Ｓ１３８でＮＯ）、文字カウンタＩをインクリメントし（Ｓ１４０）、ステップＳ１３４へ戻る。また、ステップＳ１３５において該当文字情報が先頭文字でない場合にも（Ｓ１３５でＮＯ）、同様に文字カウンタＩをインクリメントし（Ｓ１４０）、ステップＳ１３４へ戻る。すなわち、次の行に対してステップＳ１３４〜Ｓ１３９の処理を繰返す。
【００６６】
当該領域での全文字情報に対しての走査を終了した場合、すなわち、文字カウンタＩが文字情報数以上の場合は（Ｓ１３４でＮＯ）、領域分割候補カウンタＪをインクリメントし（Ｓ１４１）、ステップＳ１３２へ戻る。すなわち、次の領域に対してステップＳ１３２〜Ｓ１４０の処理を繰返す。
【００６７】
そして、全領域分割候補に対して処理を終了した場合、すなわち、領域分割候補カウンタＪが領域分割候補数以上になった場合は（Ｓ１３２でＮＯ）、本処理を終了して、図３に示されるメインルーチンに処理を戻す。
【００６８】
なお、図９に、文字列バッファ１０２２の具体例を示す。図９を参照して、文字列バッファ１０２２は領域分割候補数分作成された第１領域分割候補文字列バッファ５１〜第ｍ領域分割候補文字列バッファ５５を含む（図９では領域分割候補数がｍ個の場合を図示している）。
【００６９】
１つの領域分割候補文字列バッファ内は、さらに文字列数と、前記文字列数分の文字列バッファとを含む。すなわち、第１領域分割候補文字列バッファ５１には、第１領域分割候補に含まれる文字列数５１１と、第１文字列バッファ５１２〜第Ｌ文字列バッファ５１５とが含まれる（図９では第１領域分割候補に含まれる文字列数がＬ個の場合を図示している）。この文字列数Ｌは、図８のステップＳ１３５において先頭文字（すなわち、当該文字情報の前文字情報アドレスがＮＵＬＬ（０）である文字）と判断された個数に該当し、対応する文字列バッファも動的に確保される。
【００７０】
次に、図３のステップＳ１０４で実行される全文字列の遷移確率の計算について具体的に説明する。
【００７１】
文字列の遷移確率値Ｐとは、具体例を考えると、「本日は晴天なり」という文章の確率値を求めることである。すなわち、以下の式に示される通りである。
【００７２】
Ｐ（本日は晴天なり）＝０．７３
この遷移確率値Ｐは、文献「確率モデルによる音声認識」（中川聖一著、電子情報通信学会、コロナ社、初版昭和６３年）によれば、文字列をＣ＝（ｃ１，ｃ２，・・・，ｃｎ）、Ｆ（　）を括弧内の文字の組合わせとするとき、以下の式（１）で与えられる。
【００７３】
【数１】

【００７４】
式（１）より、任意長の文字列Ｃに関する遷移確率値Ｐ（Ｃ）は、２文字組（ｄｉｇｒａｍ）と３文字組（ｔｒｉｇｒａｍ）との出現頻度（または出現確率）テーブルを用意することで得ることができる。ただし、対象文字コードが複数バイトコードの場合（日本語などの場合）、文字のカテゴリ数が数千オーダーとなり、特にｔｒｉｇｒａｍを確保するためのメモリ量の確保が難しい。そのため、式（１）を直接採用することは適当ではない。したがって、本実施の形態においては、ｄｉｇｒａｍと１文字との出現頻度（ｕｎｉｇｒａｍ）を使用した近似式である次の式（２）にて文字列の遷移確率値Ｐを計算する。
【００７５】
【数２】

【００７６】
図９に示される領域分割候補文字列バッファごとに、その中に存在する文字列数５１１で示される複数の文字列バッファ５１２〜５１５に対して、個々に式（２）で遷移確率値Ｐを計算し、それらの積により該当領域候補内の文字列全体の文字列遷移確率を計算する。具体的には、領域分割候補内の文字列数をＬ、各文字列をＣ１，Ｃ２，・・・ＣＬとすると、次の式（３）により領域分割候補文字列全体の文字列遷移確率を計算する。
【００７７】
【数３】

【００７８】
ステップＳ１４では、文字列確率計算部１０１４は、全ての領域分割候補に対して、式（３）により文字列遷移確率を計算し、ステップＳ１５で最適領域決定部１０１５がその文字列遷移確率が最大となる領域分割候補を最適な領域分割結果として採用し、ステップＳ１６で正解文書作成部１０１６が正解の文字列を作成する。なお、ステップＳ１６での正解の文字列を作成する方法は、図８に示される文字列作成処理とほぼ同様の方法である。すなわち、図８のステップＳ１３１，Ｓ１３２、およびＳ１４１をなくし、領域分割候補カウンタＪを最適と判断された領域分割候補番号とすれば、図８に示される処理と同じ処理で正解の文字列を作成することができる。
【００７９】
本実施の形態におけるドキュメント処理装置であるＰＣ１で以上の処理が実行されることで、静的ドキュメント形式から動的ドキュメント形式への、高精度な変換を実現することができる。
【００８０】
［第２の実施の形態］
次に、図１０は、第１の実施の形態において図５に示される入力画像３０１と同様のドキュメントに対する領域分割候補と各領域分割に対する行抽出との具体例を示す図である。
【００８１】
図１０を参照して、領域分割候補６１〜６３は、第１の実施の形態において領域分割に用いる距離パラメータδを変化させた場合に得られる領域分割候補を示す。さらに、分割領域６１１〜６１３は領域分割候補６１における分割領域を示し、分割領域６２１，６２２は領域分割候補６２における分割領域を示し、分割領域６３１は領域分割候補６３における分割領域を示す。また、各分割領域内の太線の横長矩形は行を、その中の細線の矩形は文字を示している。なお、図５に示されるように、領域分割候補６１〜６３は、横書きの３カラムのドキュメントに対する領域分割候補であるため、領域分割は、分割領域６１１〜６１３の３領域からなる領域分割候補６１が正解となる。
【００８２】
第１の実施の形態においては、各領域分割候補内の各領域に対して行抽出を行ない、文字を連結して文字列を作成し、文字列遷移確率により最適な分割候補を決定した後に、再度、その最適な分割候補（図１０では、領域分割候補６１）に対して文字列を抽出し、正解の文章を作成した。このような、第１の実施の形態における処理が有効に動作する理由は、不正解の領域分割候補より抽出した文字列では別領域の文字が連結されるため、文字列としての遷移確率が低下するからである。例えば、図１０においては、分割領域６２１や分割領域６３１に示されるように別領域の文字が連結され、その結果、別領域間をまたがる部分の文字列が言語的に意味不明な文字列となる。そのため、文字列としての遷移確率が低下する。
【００８３】
そこで、第２の実施の形態においてはこのことをさらに利用し、第１の実施の形態におけるドキュメント処理装置と同様のＰＣ１において、より高精度の変換を行なう場合について説明する。
【００８４】
具体的に、図１０に示される全ての領域分割候補６１〜６３での行抽出結果において、他の文字連結と異なる部分の文字連鎖だけを評価した方がより有効に正解の文章が作成できることは、上述の式（２）よりも明らかである。すなわち、図１１に、図１０に示される領域分割候補６１〜６３を比較した場合の、文字連結と異なる部分の文字を暗転して示す。図１１に示されるように、領域間の境界となる文字が連結の異なる文字である。そのため、領域候補ごとにその文字を含めた前後数文字で構成される文字列を作成し、その文字列の遷移確率値Ｐを式（２）で求め、それらを式（３）で評価すればよいことになる。
【００８５】
第２の実施の形態においてＰＣ１で行なわれる変換処理は、第１の実施の形態において図３に示される変換処理とほぼ同様であるため、ここでの、同様の部分についての説明は繰返さない。第２の実施の形態においては、図３のステップＳ１３での文字列作成部１０１３で実行される文字列作成処理が、第１の実施の形態における処理とは異なる。そこで、第２の実施の形態における文字列作成処理について、図１２にフローチャートを示す。なお、図１２にフローチャートが示される文字列作成処理は、図８にフローチャートが示される第１の実施の形態における文字列作成処理とほぼ同様である。すなわち、図１２を参照して、ステップＳ２３１〜Ｓ２３４における処理は、図８のステップＳ１３１〜Ｓ１３４における処理と同様である。そこで、以降においては、ステップＳ２３５の処理以降について説明する。
【００８６】
図１２を参照して、文字カウンタＩで示される文字の連結が、全ての領域候補で同じかどうかを判断することにより、連結が異なる文字を判定する（Ｓ２３５）。具体的には、該当文字情報の領域・行リストの全ての領域・行情報の前文字情報アドレスと次文字情報アドレスとが全て同じ場合でない場合（１つでも異なるアドレスが付与されている場合）に、該当文字を連結が異なる文字と判断する。
【００８７】
１つでも連結が異なる場合は（Ｓ２３５でＹＥＳ）、その文字から所定の文字数であるＮ文字まで前の全文字コードを抽出する（Ｓ２３６）。これは、文字情報の前文字情報アドレスを走査することで可能である。同様に、文字情報の次文字情報アドレスを走査することで、その文字からＮ文字まで次の全文字コードも抽出する（Ｓ２３７）。
【００８８】
次に、ステップＳ２３６およびＳ２３７で抽出された文字コードを、それらの連結順に並べて文字列を作成し、文字列バッファ１０２１へ格納する（Ｓ２３８）。その後、図８のステップＳ１４０以降の処理と同様に、文字カウンタＩをインクリメントし（Ｓ２３９）、ステップＳ２３４へ戻る。また、ステップＳ２３５において文字カウンタＩで示される文字の連結が、全ての領域候補で同じである場合にも（Ｓ２３５でＮＯ）、同様に文字カウンタＩをインクリメントし（Ｓ２３９）、ステップＳ２３４へ戻る。すなわち、次の行に対してステップＳ２３４〜Ｓ２３８の処理を繰返す。
【００８９】
このように、本実施の形態において、上述の、連結が異なる文字の近傍の文字で構成される文字列だけを評価する文字列作成処理を行なうことで、第１の実施の形態における変換処理よりもさらに高精度の領域判定が実現可能となる。
【００９０】
さらに、上述のドキュメント変換装置が行なう変換方法を、プログラムとして提供することもできる。このようなプログラムは、コンピュータに付属するフレキシブルディスク、ＣＤ−ＲＯＭ、ＲＯＭ、ＲＡＭおよびメモリカードなどのコンピュータ読取り可能な記録媒体にて記録させて、プログラム製品として提供することもできる。あるいは、コンピュータに内蔵するハードディスクなどの記録媒体にて記録させて、プログラムを提供することもできる。また、ネットワークを介したダウンロードによって、プログラムを提供することもできる。
【００９１】
提供されるプログラム製品は、ハードディスクなどのプログラム格納部にインストールされて実行される。なお、プログラム製品は、プログラム自体と、プログラムが記録された記録媒体とを含む。
【００９２】
今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。
【図面の簡単な説明】
【図１】本実施の形態におけるドキュメント処理装置であるＰＣ１の構成の具体例を示すブロック図である。
【図２】変換処理を行なうためのＰＣ１の構成を示すブロック図である。
【図３】本実施の形態のＰＣ１で行なわれる変換処理について示すフローチャートである。
【図４】文字情報バッファ１０２１の具体例を示す図である。
【図５】領域３１１〜３１３を含む入力画像３１に対する、結果表示画像３２を示す図である。
【図６】ステップＳ１２で実行される領域分割候補の作成処理について示すフローチャートである。
【図７】連結関係の具体例を示す図である。
【図８】ステップＳ１３で実行される文字列作成処理を示すフローチャートである。
【図９】文字列バッファ１０２２の具体例を示す図である。
【図１０】第１の実施の形態において図５に示される入力画像３０１と同様のドキュメントに対する領域分割候補と各領域分割に対する行抽出との具体例を示す図である。
【図１１】図１０に示される領域分割候補６１〜６３を比較した場合の、文字連結と異なる部分の文字を暗転して示した図である。
【図１２】第２の実施の形態における文字列作成処理を示すフローチャートである。
【符号の説明】
１　ＰＣ、１０　静的ドキュメントデータ、１１　文字コード、１２　開始ｘ座標、１３　開始ｙ座標、１４　ｘサイズ、１５　ｙサイズ、１６　フォント、１７　属性情報、１８　領域・行情報リスト、２０　動的ドキュメントデータ、３１　入力画像、３２　結果表示画像、４１　先頭文字、４２　最終文字、５１〜５５　領域分割候補文字列バッファ、６１〜６３　領域分割候補、１０１　制御部、１０２　記憶部、１０３　入力部、１０４　出力部、１８１〜１８５　領域・行情報、３１１〜３１３，３２１，３２２　領域、５１１　文字列数、５１２〜５１５　文字列バッファ、６１１〜６１３，６２１，６２２，６３１　分割領域、１０１１　文字情報抽出部、１０１２　領域分割候補作成部、１０１３　文字列作成部、１０１４　文字列確率計算部、１０１５　最適領域決定部、１０１６　正解文書作成部、１０２１　文字情報バッファ、１０２２　文字列バッファ、１８１１　領域ラベル、１８１２　前文字情報アドレス、１８１３　次文字情報アドレス。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document conversion method and a document conversion device, and more particularly, to a document conversion method and a document conversion device capable of realizing high-precision conversion from static document format data to dynamic document format data.
[0002]
[Prior art]
With the development of information-related technology and document-related technology in recent years, various types of electronic documents are flooding. These digitized document files can be roughly classified into those having positional information of each object such as characters and images and those having no positional information.
[0003]
Representative examples of the former include a document file of DTP (Desktop Publishing) software and a PDF (Portable Document Format) file. Representative examples of the latter include a plain text file (a file containing only text) and an HTML (Hypertext Markup Language) file. In the following description, the former is called a static document format, and the latter is called a dynamic document format.
[0004]
The dynamic document format having no position information is arranged and displayed by a browser (browsing software) dedicated to the file based on a certain rule with characters and images in the document. A certain rule is, for example, a rule for arranging characters in order in the case of plain text and performing a line feed based on a line feed code, and in the case of HTML, an arrangement rule based on each tag.
[0005]
On the other hand, as a platform for browsing various electronic documents, a personal computer has been generally used in the past. However, in the future, a shift to portable terminals such as PDA (Personal Digital Assistants) and PDC (Personal Digital Cellular) will be made. is expected. In fact, even now, Web browsing with a mobile terminal is realized.
[0006]
However, mobile devices have the problem that their screen resolution is smaller than that of personal computers and they are not suitable for displaying static document formats. Currently and in the future, mobile devices are mainly used in dynamic document formats. It is expected to be.
[0007]
For this reason, in the future, with the shift of the digitized document browsing platform to mobile terminals, a conversion technique for converting information in a static document format into a dynamic document format with high efficiency will be important.
[0008]
Conventionally, as a conversion technique of such an electronic document, a technique such as an electronic filing system disclosed in Japanese Patent Application Laid-Open No. 8-147446 has been proposed. The above-described electronic filing system is a system that converts an electronic document into data of the electronic filing system based on information transmitted from an application to a printer driver. That is, the system performs data conversion by performing an operation of “printing” a corresponding document on a printable application using a conversion printer driver.
[0009]
For example, in the case of Windows (R) which is one of the representative OSs, data transferred to a printer driver when an application performs printing is converted into a format unified by a GDI (Graphics Device Interface) which is a kernel module. . Therefore, in the above-described electronic filing system, a printer driver that interprets a common GDI command and performs data conversion is provided, thereby absorbing many different electronic document formats and realizing many-to-one data conversion. are doing.
[0010]
[Problems to be solved by the invention]
However, in a conventional technology such as the electronic filing system disclosed in the above-mentioned Japanese Patent Application Laid-Open No. 8-147446, the character code of the data passed to the printer driver is converted into a series of sentences, particularly in a specific application such as DTP software. As a result, there is a problem that it is difficult to convert data in a static document format to a dynamic document format. The reason for this is that the DTP software expresses a sentence that exceeds the expressive ability of the GDI code and draws one character at a time.
[0011]
To solve this problem, in the related art, a character string, a word, a line, and the like are determined (cut out) using a fixed ratio threshold value such as 1/2 of the character height. Then, there is a problem that performance is reduced.
[0012]
Also, unlike the conventional technology, even in the case of a conversion technology that does not use a printer driver and has a means for directly interpreting the format of the target document, in a static document format such as DTP or PDF, the extracted characters are arranged as text. There is also a problem that it is difficult to convert to a dynamic document format. The reason why the extracted characters are not arranged as sentences is that DTP, PDF, and the like store characters in the order of sentences in the stored data because there is individual position information to be displayed or printed. There is no need for it, and it is stored in a sequence unique to the application that creates the document.
[0013]
Therefore, when converting data in a static document format to a dynamic document format by a printer driver or direct format interpretation, it is necessary to create a correct sentence based on the obtained individual character position information. However, if the text creation is performed by threshold processing such as character size, there is a problem that accuracy is reduced.
[0014]
The present invention has been made in view of these problems, and an object of the present invention is to provide a document conversion method and a document conversion apparatus capable of realizing high-precision conversion from static document format data to dynamic document format data. Aim.
[0015]
[Means for Solving the Problems]
In order to achieve the above object, according to an aspect of the present invention, a document conversion method is a document conversion method for converting document data including character position information into document data not including character position information. A character information extraction step of extracting character information including the position information of each character from the document data including the position information of the character; and a division candidate for dividing the document data into a plurality of regions based on the position information of the extracted characters. A region dividing candidate creating step to be created, and a character string creating step of extracting a line based on the position information of each extracted character for each created region dividing candidate, searching the extracted line, and creating a character string And a character string probability calculating step of calculating a transition probability value of the generated character string. A region determining step of determining a, based on the result of the determined optimum area, and a text creation step of creating a text sentence does not include the location information of each character.
[0016]
According to another aspect of the present invention, a document conversion method is a document conversion method for converting document data including character position information into document data not including character position information, wherein the document data includes character position information. A character information extracting step of extracting character information including position information of each character; and an area division candidate creating step of creating a division candidate for dividing document data into a plurality of areas based on the extracted character position information. For each of the created region division candidates, a line is extracted based on the position information of each extracted character, and the connection of the characters as the extraction result is different for all the created region division candidates. A character string creating step of creating a partial character string consisting of a character and a predetermined number of characters before and after the character, and transition of the created partial character string A character string probability calculation step of calculating a probability value, an area determination step of determining an optimal area from the created area division candidates based on the calculation result, and a character string probability calculation step of each character based on the determined optimal area result. A text creation step of creating a text sentence that does not include position information.
[0017]
In addition, the character information extracting step includes interpreting data in a common format converted based on the document data including the character position information, and directly interpreting the document data including the character position information. It is desirable to extract character information by at least one of the methods.
[0018]
According to another aspect of the present invention, a document conversion apparatus is a document conversion apparatus for converting document data including character position information into document data not including character position information, the document conversion apparatus including a document data including character position information. Character information extraction means for extracting character information including position information of each character, and region division candidate generation for generating division candidates for dividing document data into a plurality of regions based on the extracted character position information A character string creating means for extracting a line based on the position information of each extracted character, searching for the extracted line, and creating a character string for each created region division candidate; A character string probability calculating means for calculating a transition probability value of a column; an area determining means for determining an optimum area from the created area dividing candidates based on the calculation result; Based on the results of such area, and a text creation means for creating a text sentence does not include the location information of each character.
[0019]
According to still another aspect of the present invention, a document conversion device is a document conversion device for converting document data including character position information into document data not including character position information, the document conversion device including a document including character position information. Character information extraction means for extracting character information including position information of each character from data, and area division candidate generation for generating a division candidate for dividing document data into a plurality of areas based on the extracted character position information Means and a line is extracted based on the position information of each extracted character for each of the created region division candidates, and the connection of characters as an extraction result is different even in all of the created region division candidates. A character string creating means for creating a partial character string consisting of a character and a predetermined number of characters before and after the character, and a transition probability of the created partial character string A character string probability calculating means, an area determining means for determining an optimum area from the created area dividing candidates based on the calculation result, and a position information of each character based on the determined optimum area result. And a sentence creating means for creating a text sentence not including
[0020]
Further, the character information extracting means described above interprets data in a common format converted based on document data including character position information, and directly interprets document data including character position information. It is desirable to extract character information by at least one of the methods.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts and components are denoted by the same reference numerals. Their names and functions are the same. Therefore, detailed description thereof will not be repeated.
[0022]
[First Embodiment]
FIG. 1 is a block diagram illustrating a specific example of a configuration of a personal computer (hereinafter, referred to as a PC) 1 which is a document processing apparatus according to the present embodiment.
[0023]
Referring to FIG. 1, PC 1 in the present embodiment is entirely controlled by a control unit 101 including a CPU (Central Processing Unit) and performs various processes. The program executed by the control unit 101 is stored in the storage unit 102. The storage unit 102 also serves as a buffer and a temporary work area when the control unit 101 executes a program. The input unit 103 includes a mouse, a keyboard, and the like, and receives various instructions from the user. In the case where the apparatus is constituted by a reading device, it also receives information from other devices, data stored on a flexible disk or the like. The output unit 104 may output to display information such as a document on a display (not shown). Further, information such as a document may be output to a printer (not shown) as a printer I / F (interface).
[0024]
The configuration of the PC 1 shown in FIG. 1 is a configuration of a general personal computer, and the configuration of the PC 1 is not limited to the configuration shown in FIG. In this embodiment, the document processing apparatus will be described as a personal computer whose configuration is shown in FIG. 1 as an example. However, the present invention is also applicable to a mobile terminal such as a mobile phone or a PDA (Personal Digital Assistants). You may.
[0025]
In the present embodiment, the above-described PC 1 performs a conversion process of converting data in a static document format into data in a dynamic document format. That is, from document format data (hereinafter, referred to as static document data) including position information of each object such as characters and images, to document format data (hereinafter, referred to as dynamic document data) not including position information. ) Is performed.
[0026]
FIG. 2 is a block diagram showing a configuration of the PC 1 for performing the above-described conversion processing. Referring to FIG. 2, input unit 103 receives an input of static document data 10. The control unit 101 detects that the input unit 103 has received the input of the static document data 10, and thereby detects the character information extraction unit 1011, the area division candidate creation unit 1012, and the character string creation unit 1013 included in the control unit 101. , The character string probability calculating unit 1014, the optimal area determining unit 1015, and the correct answer document creating unit 1016 are controlled in conjunction with the character information buffer 1021 and the character string buffer 1022 included in the storage unit 102 to control the static document data 10. It is converted into dynamic document data 20. Then, the converted dynamic document data 20 is output from the output unit 104.
[0027]
FIG. 3 is a flowchart showing a conversion process performed by PC 1 of the present embodiment. The process illustrated in FIG. 3 is realized by the control unit 101 of the PC 1 reading and executing a program stored in the storage unit 102. Hereinafter, the conversion processing shown in FIG. 3 will be described with reference to the block diagram of FIG.
[0028]
First, the PC 1 receives an input of the static document data 10 through the input unit 102 (S10). As described above, the static document data 10 is data in which font information and attribute information such as underline, italic, and bold are stored in addition to the position information.
[0029]
Next, the character information extracting unit 1011 extracts character information of each character from the input static document data 10 (S11). The character information extraction unit 1011 may extract the character information by interpreting data in a common format converted by a kernel module such as GDI, as is implemented in a general printer driver. Then, a method of extracting character information by directly interpreting the static document data 10 may be used, or another method may be used. Here, the method of extracting the character information can be arbitrarily determined. Then, the character information extracted in step S11 is stored in the character information buffer 1021.
[0030]
Next, the area division candidate creating unit 1012 performs area division based on the character information extracted in step S11 to create a plurality of area division candidates (S12). The process of creating the region division candidate in step S12 will be described later in detail with reference to a subroutine.
[0031]
Further, the character string creation unit 1013 extracts lines in each area for each of the area division candidates created in step S12, and extracts the extracted lines from top to bottom in the case of a horizontal writing area, and in the case of a vertical writing area. A line is searched from right to left to create a character string (S13). The line extraction and character string creation processing in step S13 will be described later in detail with reference to a subroutine.
[0032]
Subsequently, the character string probability calculation unit 1014 calculates the transition probabilities of all character strings in one area division candidate created in step S13 (S14). Then, based on the transition probabilities for each of the region division candidates calculated in step S14, the optimal region determining unit 1015 determines the region having the highest transition probability of all character strings as the optimal region (S15). The transition probabilities of all character strings calculated in step S14 will be described later in detail.
[0033]
Next, in the correct answer document creation unit 1016, a line in each region is extracted again with respect to the optimum region determined in step S15, and the extracted lines are arranged from top to bottom in the case of the horizontal writing region, and in the case of the vertical writing region. Searches for a line from right to left, and creates a correct sentence (S16).
[0034]
Finally, the file of the dynamic document data 20 including the correct sentence created in step S16 is output from the output unit 104 (S17).
[0035]
Thus, the conversion process in PC1 is completed. By executing the above-described processing, the static document data 10 input to the PC 1 is converted into dynamic document data 20 and output.
[0036]
Further, a specific example of the character information buffer 1021 for storing the character information extracted from the static document data 10 in step S11 described above will be described with reference to FIG.
[0037]
Referring to FIG. 4, character information buffer 1021 stores information on each character from the first character to the last character in the order of the characters extracted in step S11. In the case of the static document data 10 including the position information of each object, even if the characters are arranged in the extraction order shown in FIG. 4, the text is not a correct answer.
[0038]
The information of each character stored in the character information buffer 1021 includes a character code 11, a starting x coordinate (horizontal axis coordinate) 12, a starting y coordinate (vertical axis coordinate) 13, an x size (horizontal axis size) 14, and y. It comprises a size (vertical axis size) 15, font information 16, attribute information 17, and area / line information list 18. The configuration of the information of each character is not limited to the above configuration. For example, the font information 16 may be the font information itself or an address of a table indicating the font information. Which one to adopt can be determined arbitrarily. In this embodiment, the format is such that font information is stored for each character.
[0039]
The attribute information 17 is information on the attribute of each character, and corresponds to attribute information such as underline, italic, and bold.
[0040]
The area / line information list 18 is secured for each character by the number of area division candidates obtained as a result of the area division in step S12. That is, as a result of the region division in step S12, when m region division candidates are created, the region / line information list 18 includes first region / line information 181 to m-th region / line information 185. Further, one area / line information includes an area label 1811, a preceding character information address 1812, and a succeeding character information address 1813. In the process of creating a region division candidate in step S12, data is stored in the region label of the corresponding region. Specifically, the area label of the corresponding character in the first area division candidate is stored in the area label 1811 of the first area / line information 181. Thereafter, the area label of the corresponding character in each area division candidate is sequentially It is stored in the area label of the corresponding area / line information.
[0041]
Next, a process of creating a region division candidate executed by the region division candidate creation unit 1012 in step S12 described above will be described using a specific example. In the process of creating a region division candidate according to the present embodiment, an improved method of the image region division method disclosed in Japanese Patent No. 3019287 is adopted.
[0042]
The image region dividing method disclosed in the above-mentioned Japanese Patent No. 3019287 is a method mainly used as preprocessing of a character recognition device, and extracts an area by labeling a neighboring rectangle with respect to rectangular coordinates extracted from an image. How to More specifically, for example, when the image is a character “river”, an area having three vertically long rectangular coordinates from the image before recognition is extracted. Next, the distance parameter δ to be integrated is changed within a certain range, and the optimum distance parameter δ is determined by threshold value determination. Then, this is a method of determining (dividing) an area. That is, this is a method of determining an area only by physical rectangular coordinates without using a character code or the like, and is basically a threshold processing. Therefore, an error as shown in FIG. 5 may occur.
[0043]
FIG. 5 shows a result display image 32 for an input image 31 including regions 311 to 313. That is, as shown in FIG. 5, in the above-described image area dividing method, even if the input image 31 in which the three areas are correct (for example, the character “river”), the area 311 and the area 312 Since the distance between them is small, the result display image 32 may be determined to be an image composed of two regions, the region 321 and the region 322, depending on the threshold value employed. Therefore, in the process of creating a region division candidate in the present embodiment, the process of determining the optimal distance parameter δ is not performed, and the labeling result obtained with each distance parameter δ is extracted as each region division candidate. The process of creating a region division candidate according to the present embodiment will be described.
[0044]
FIG. 6 shows a flowchart of the process of creating a region division candidate executed in step S12 described above. Referring to FIG. 6, first, inconvenient characters are excluded from character elements according to a predetermined rule (S121). Inconvenient for the determination is, for example, a large character that cannot be considered as a character of the implementation or a component thereof.
[0045]
Next, an initial value is set for the distance parameter δ (S122). Subsequently, all two character elements (C _i , C _j And However, i ≠ j) is extracted once (S123). Here, a connected portion of black pixels may be used as a character element, or a run of black pixels may be used as a character element.
[0046]
Next, it is determined whether or not the two character elements extracted in step S123 are close to each other (S124). Here, Rδ (C _i , C _j ) Is determined to determine whether two character elements are close to each other.
[0047]
Here, Rδ (C _i , C _j ) Means "two character elements C _i And C _j Is less than (or less than) the distance parameter δ ”. Regarding the two character elements c and d, "a sequence of certain character elements {x _i ｝ (I = 0, 1,... N−1) exists, and Rδ (c, x ₀ ), Rδ (x _i , X _{i + 1} ) (0 ≦ i ≦ n−2), Rδ (x _n-1 , D) are satisfied (hereinafter referred to as Sδ (c, d)) is a kind of relation mathematically called an equivalence relation. The equivalence relation has a remarkable property that the entire set in which it is defined is decomposed into several groups consisting only of those relations that hold.
[0048]
Therefore, as is performed in step S123, if it is checked whether or not the above-described equivalence relation is established for all combinations of two character elements in the image, all character elements in the image are Can be decomposed into groups such as Sδ (a, b). Note that the distance parameter δ is a vector quantity (δ _x , Δ _y ), Δ _x , Δ _y Are varied separately and the same is done. In this case, "the distance between the two character elements a and b is equal to or less than (or less than) the distance parameter δ" means that the distance in the x direction between the two character elements is δ. _x Less (or less), the distance in the y direction is δ _y The following (or less) "(the same applies hereinafter).
[0049]
Then, Rδ (C _i , C _j ) Holds (YES in S124), a character element label corresponding to each of the two character elements on a one-to-one basis, and all character element labels having the same value as one of the two character element labels, A common new value is substituted (S125). Then, it is determined whether or not all combinations of i and j have been processed (S126). Rδ (C _i , C _j If) is not satisfied (NO in S124), it is determined whether or not all combinations of i and j have been processed without executing the processing of step S125 (S126).
[0050]
If all combinations of i and j have not been processed (NO in S126), the process returns to step S123, and if all combinations of i and j have been processed (YES in S126), division is performed. The result is stored (S127).
[0051]
Next, it is determined whether or not all the distance parameters δ and a set 値 of values that can take the distance parameter δ are processed (S128). If the process has been performed (YES in S128), the process of creating a region division candidate ends, and the process returns to the main routine shown in FIG. If the processing has not been performed for all the distance parameters δ and the set 値 of values that can take the distance parameters δ (NO in S128), the distance parameters δ are changed (S129), and the processing from step S123 is performed again.
[0052]
Thus, the process of creating the area division candidates is completed, and the process returns to the main routine shown in FIG.
[0053]
In step S127, the division result is stored in the area / line information list 18 shown in FIG. In this case, if m candidates are created as a result of the region division (that is, if the region division is performed with m distance parameters δ), the region label 1811 of the first region / row information includes , The area label of the corresponding character in the first area division candidate is stored. Thereafter, the area labels of the second area / line information 182 to the m-th area / line information 185 include the second area division candidate to the m-th area division candidate. Are sequentially stored.
[0054]
Next, the line extraction processing and the character string creation processing performed by the character string creation unit 1013 in step S13 described above will be described using specific examples. In step S13, specifically, a row direction determination process is performed for all the region division candidates for each region in the candidate, and a row extraction process is performed along the determination direction.
[0055]
The above-described line direction determination processing is, for example, a joint invention between the inventor of the present application and other inventors, and is performed by the character string direction detecting device disclosed in Japanese Patent No. 3124854. Can be applied. That is, a criterion for determining whether the character string in the image is in the x direction or the y direction is determined in advance, and a method of determining whether the character string in the x direction or the y direction is based on the position coordinates of the input area rectangle is applied. be able to. For example, for each region rectangle, Rxy = (length in the y direction) / (length in the x direction) is calculated in advance, and RA = (the number of region rectangles where Rxy is less than 1.0) / (in the image) When (the total number of areas) is less than 0.5, the character string can be considered to be in the y direction on the page, and otherwise, the character string can be in the x direction.
[0056]
Further, for the above-described row extraction processing, for example, it is possible to apply the row extraction method disclosed in Japanese Patent Publication No. 8-16918, which is a joint invention between the present inventor and other inventors. it can. That is, it is possible to apply a line extraction method in which rectangles are connected to the left and right (in the case of horizontal writing) and lines are extracted according to the overlapping relationship between the rectangles.
[0057]
As the line information connected in the first area in the above-described line extraction processing, the character information buffer 1021 storing the previous character in the previous character information address 1812 shown in FIG. 4 for each character. The start address of the corresponding character information in is stored. Also, the next character information address 1813 stores the head address of the corresponding character information in the character information buffer 1021 in which the next character is stored. This is the same for the second region and the other regions. Here, the previous character refers to the character left one in the line in the case of horizontal writing, the character up one in the line in the case of vertical writing, and the next character refers to the character one in the line in the case of horizontal writing. The character immediately to the right, or in the case of vertical writing, the character immediately below in the line. If the character is at the beginning of the line, the previous character information address 1812 stores the start address of the last character of the previous line. If the character is at the end of the line, the next character information address 1813 stores the start address of the first character of the next line. The address is stored. However, the previous character information address 1812 of the first character of the first line in the area is NULL (0), and the next character information address 1813 of the last character of the last line in the area is also NULL (0).
[0058]
Next, FIG. 7 shows a specific example of the connection relationship based on the result of the above-described row extraction processing. FIG. 7 is a diagram illustrating a specific example of the result of the line extraction process in a horizontally written document. In FIG. 7, a rectangle indicates a character rectangle, and a right arrow (→) indicates a reference destination of the next character information address. The next character is the character indicated by the next (right next) rectangle in the line, and the first character at the end of the line. A left arrow (←) indicates a reference destination of the preceding character information address. The previous character is a character indicated by the previous (left adjacent) rectangle in the line, and the last character of the previous line at the beginning of the line. That is, in the case of the first character 41 in the area, NULL (0) is stored in the previous character information address. In the case of the last character 42 in the area, NULL (0) is stored in the next character information address.
[0059]
Next, a character is scanned for the line thus extracted for each region in the candidate to create a sentence. FIG. 8 is a flowchart showing the character string creation processing executed by the character string creation unit 1013 in step S13 described above. Referring to FIG. 8, first, a region division candidate counter J is initialized to 0 (S131). This area division candidate counter J uses a register in the CPU.
[0060]
Next, it is determined whether or not the area division candidate counter J is less than the number of area division candidates (S132). If the number is less than the number of area division candidates (YES in S132), the character counter I is subsequently initialized to 0 (S133). This character counter I also uses registers in the CPU.
[0061]
Next, it is determined whether or not the character counter I is less than the number of character information (S134). If the number is less than the number of character information (YES in S134), subsequently, the area / line information of the character information indicated by the character counter I indicated by the area division candidate counter J (that is, stored in the character information buffer 1021). It is determined whether the previous character information address of the J-th area / line information of the I-th character information is NULL (0). Thus, it is determined whether or not the corresponding character information is the first character (S135).
[0062]
If the character information is the first character (YES in S135), the address of the character information found in S135 (that is, the I-th character information buffer) is entered in the character information pointer (S136). This character information pointer uses a register in the CPU. Then, the character code referred to by the character information pointer is stored in the character string buffer 1022 (S137).
[0063]
Next, it is determined whether or not the next character information address referred to by the character information pointer is NULL (0). Accordingly, it is determined whether or not there is a next character (that is, a character connected by line extraction) (S138).
[0064]
If there is a next character (that is, if the next character information address is not NULL) (YES in S138), the next character information address is set in the character information pointer (S139), and the process returns to step S137.
[0065]
When there is no next character (that is, when the next character information address is NULL) (NO in S138), the character counter I is incremented (S140), and the process returns to step S134. If the character information is not the first character in step S135 (NO in S135), the character counter I is similarly incremented (S140), and the process returns to step S134. That is, the processing of steps S134 to S139 is repeated for the next row.
[0066]
When scanning of all character information in the area is completed, that is, when the character counter I is equal to or more than the number of character information (NO in S134), the area division candidate counter J is incremented (S141), and step S132 is performed. Return to That is, the processing of steps S132 to S140 is repeated for the next area.
[0067]
When the processing has been completed for all the area division candidates, that is, when the area division candidate counter J has become equal to or greater than the number of area division candidates (NO in S132), the present processing is ended and shown in FIG. Return to the main routine.
[0068]
FIG. 9 shows a specific example of the character string buffer 1022. Referring to FIG. 9, character string buffer 1022 includes first area division candidate character string buffers 51 to m-th area division candidate character string buffers 55 created for the number of area division candidates (in FIG. 9, the number of area division candidates is m is shown).
[0069]
One area division candidate character string buffer further includes the number of character strings and character string buffers for the number of character strings. That is, the first area division candidate character string buffer 51 includes the number 511 of character strings included in the first area division candidate, and the first character string buffer 512 to the L-th character string buffer 515 (in FIG. The case where the number of character strings included in one area division candidate is L is illustrated). The number L of character strings corresponds to the number determined as the first character (that is, the character whose previous character information address of the character information is NULL (0)) in step S135 in FIG. Reserved dynamically.
[0070]
Next, the calculation of the transition probabilities of all the character strings performed in step S104 of FIG. 3 will be specifically described.
[0071]
The transition probability value P of the character string is to calculate a probability value of a sentence “Today is fine weather”, considering a specific example. That is, it is as shown in the following equation.
[0072]
P (sunny weather today) = 0.73
According to the document “Speech Recognition by Probability Model” (written by Seiichi Nakagawa, Institute of Electronics, Information and Communication Engineers, Corona, first edition, 1988), the transition probability value P is expressed as C = (c1, c2,. ., Cn) and F () are given by the following equation (1) when a combination of characters in parentheses is used.
[0073]
(Equation 1)

[0074]
From equation (1), the transition probability value P (C) for the character string C having an arbitrary length can be obtained by preparing an appearance frequency (or appearance probability) table of a two-character set (digram) and a three-character set (trigram). Obtainable. However, when the target character code is a multi-byte code (for example, in Japanese), the number of character categories is on the order of thousands, and it is particularly difficult to secure a memory amount for securing a trigger. Therefore, it is not appropriate to directly employ Equation (1). Therefore, in the present embodiment, the transition probability value P of the character string is calculated by the following expression (2), which is an approximation expression using the occurrence frequency (unigram) of digram and one character.
[0075]
(Equation 2)

[0076]
For each of the plurality of character string buffers 512 to 515 indicated by the number of character strings 511 existing therein, the transition probability value P is individually calculated by the equation (2) for each of the region division candidate character string buffers illustrated in FIG. Then, the character string transition probability of the entire character string in the corresponding area candidate is calculated by the product thereof. Specifically, assuming that the number of character strings in the region division candidate is L and each character string is C1, C2,... CL, the character string transition probability of the entire region division candidate character string is calculated by the following equation (3). calculate.
[0077]
[Equation 3]

[0078]
In step S14, the character string probability calculation unit 1014 calculates the character string transition probabilities for all the region division candidates according to equation (3). In step S15, the optimal region determination unit 1015 determines that the character string transition probabilities are the maximum. Is adopted as the optimal area division result, and in step S16, the correct answer document creation unit 1016 creates a correct character string. The method of creating the correct character string in step S16 is almost the same as the character string creation processing shown in FIG. That is, if steps S131, S132, and S141 in FIG. 8 are eliminated and the area division candidate counter J is set to the area division candidate number determined to be optimum, a correct character string is created by the same processing as the processing shown in FIG. can do.
[0079]
By executing the above processing on the PC 1 which is the document processing device in the present embodiment, highly accurate conversion from a static document format to a dynamic document format can be realized.
[0080]
[Second embodiment]
Next, FIG. 10 is a diagram showing a specific example of region division candidates for a document similar to the input image 301 shown in FIG. 5 and line extraction for each region division in the first embodiment.
[0081]
Referring to FIG. 10, region division candidates 61 to 63 indicate region division candidates obtained when distance parameter δ used for region division is changed in the first embodiment. Further, the divided regions 611 to 613 indicate the divided regions in the region dividing candidate 61, the divided regions 621 and 622 indicate the divided regions in the region dividing candidate 62, and the divided region 631 indicates the divided regions in the region dividing candidate 63. The bold horizontal rectangle in each divided area indicates a line, and the thin rectangle therein indicates a character. As shown in FIG. 5, the region division candidates 61 to 63 are region division candidates for a horizontally written three-column document. Therefore, the region division is performed by the region division candidate 61 composed of three regions of the divided regions 611 to 613. Is the correct answer.
[0082]
In the first embodiment, a line is extracted for each region in each region division candidate, a character string is created by connecting characters, and an optimal division candidate is determined based on a character string transition probability. Again, a character string was extracted for the optimal division candidate (the region division candidate 61 in FIG. 10), and a correct sentence was created. The reason why the processing according to the first embodiment operates effectively is that the character string extracted from the incorrectly divided area division candidate is connected to characters in another area, so that the transition probability as a character string decreases. Because you do. For example, in FIG. 10, characters in different regions are connected as shown in the divided regions 621 and 631, and as a result, a character string in a portion extending between different regions becomes a character string whose language is meaningless. . Therefore, the transition probability as a character string decreases.
[0083]
Therefore, in the second embodiment, a description will be given of a case in which this fact is further utilized and a higher-precision conversion is performed in the PC 1 similar to the document processing apparatus in the first embodiment.
[0084]
Specifically, in the line extraction results of all the area division candidates 61 to 63 shown in FIG. 10, it is more effective to evaluate only the character chain of a part different from the other character concatenation to create a correct sentence more effectively. This is clearer than the above equation (2). That is, FIG. 11 shows a darkened portion of a character different from the character connection when comparing the area division candidates 61 to 63 shown in FIG. As shown in FIG. 11, characters that are boundaries between regions are characters having different concatenations. Therefore, a character string composed of several characters before and after that character is created for each region candidate, the transition probability value P of the character string is obtained by Expression (2), and these are evaluated by Expression (3). It will be good.
[0085]
The conversion process performed by PC 1 in the second embodiment is substantially the same as the conversion process shown in FIG. 3 in the first embodiment, and therefore, the description of the same portions will not be repeated. In the second embodiment, the character string creation processing executed by the character string creation unit 1013 in step S13 in FIG. 3 is different from the processing in the first embodiment. Therefore, a flowchart of the character string creation processing according to the second embodiment is shown in FIG. The character string creation processing shown in the flowchart in FIG. 12 is almost the same as the character string creation processing in the first embodiment shown in the flowchart in FIG. That is, referring to FIG. 12, the processing in steps S231 to S234 is the same as the processing in steps S131 to S134 in FIG. Thus, hereinafter, the processing after step S235 will be described.
[0086]
With reference to FIG. 12, it is determined whether or not the connection of the character indicated by the character counter I is the same in all the region candidates, thereby determining a character having a different connection (S235). Specifically, when the previous character information address and the next character information address of all the areas and line information of the area / line list of the corresponding character information are not the same (when at least one different address is assigned) Then, the corresponding character is determined to be a character having a different concatenation.
[0087]
If even one connection is different (YES in S235), all character codes before that character and up to N characters which is a predetermined number of characters are extracted (S236). This can be done by scanning the character information address preceding the character information. Similarly, by scanning the next character information address of the character information, all the next character codes from that character to the N character are also extracted (S237).
[0088]
Next, a character string is created by arranging the character codes extracted in steps S236 and S237 in the order of their concatenation, and stored in the character string buffer 1021 (S238). Thereafter, similarly to the processing after step S140 in FIG. 8, the character counter I is incremented (S239), and the process returns to step S234. Also, if the connection of the character indicated by the character counter I in step S235 is the same for all area candidates (NO in S235), the character counter I is similarly incremented (S239), and the process returns to step S234. That is, the processing of steps S234 to S238 is repeated for the next row.
[0089]
As described above, in the present embodiment, by performing the above-described character string creation processing that evaluates only the character string composed of characters in the vicinity of the character having different concatenation, the conversion processing in the first embodiment is improved. This also makes it possible to realize a more accurate area determination.
[0090]
Further, the conversion method performed by the above-described document conversion apparatus can be provided as a program. Such a program can be recorded on a computer-readable recording medium such as a flexible disk, a CD-ROM, a ROM, a RAM, and a memory card attached to the computer, and can be provided as a program product. Alternatively, the program can be provided by being recorded on a recording medium such as a hard disk incorporated in the computer. Further, the program can be provided by downloading via a network.
[0091]
The provided program product is installed and executed in a program storage unit such as a hard disk. Note that the program product includes the program itself and a recording medium on which the program is recorded.
[0092]
The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a specific example of a configuration of a PC 1 that is a document processing device according to the present embodiment.
FIG. 2 is a block diagram showing a configuration of a PC 1 for performing a conversion process.
FIG. 3 is a flowchart illustrating a conversion process performed by the PC 1 according to the present embodiment.
FIG. 4 is a diagram illustrating a specific example of a character information buffer 1021;
FIG. 5 is a diagram showing a result display image 32 with respect to an input image 31 including regions 311 to 313.
FIG. 6 is a flowchart illustrating a process of creating a region division candidate executed in step S12.
FIG. 7 is a diagram illustrating a specific example of a connection relationship.
FIG. 8 is a flowchart showing a character string creation process executed in step S13.
FIG. 9 is a diagram showing a specific example of a character string buffer 1022.
FIG. 10 is a diagram showing a specific example of region division candidates for a document similar to the input image 301 shown in FIG. 5 and line extraction for each region division in the first embodiment.
FIG. 11 is a diagram in which characters in portions different from character concatenation when the region division candidates 61 to 63 shown in FIG. 10 are compared are darkened.
FIG. 12 is a flowchart illustrating a character string creation process according to the second embodiment.
[Explanation of symbols]
1 PC, 10 static document data, 11 character code, 12 start x coordinate, 13 start y coordinate, 14 x size, 15 y size, 16 font, 17 attribute information, 18 area / line information list, 20 dynamic document data , 31 input image, 32 result display image, 41 first character, 42 last character, 51-55 area division candidate character string buffer, 61-63 area division candidate, 101 control unit, 102 storage unit, 103 input unit, 104 output unit 181 to 185 area / line information, 311 to 313, 321, 322 area, 511 character string number, 512 to 515 character string buffer, 611 to 613, 621, 622, 631 divided area, 1011 character information extraction unit, 1012 area Division candidate creation unit, 1013 character string creation unit, 1014 character string probability calculation unit, 1015 optimal area determination Fixed part, 1016 correct answer document creation part, 1021 character information buffer, 1022 character string buffer, 1811 area label, 1812 previous character information address, 1813 next character information address.

Claims

A document conversion method for converting document data including character position information into document data not including character position information,
Character information extraction step of extracting character information including the position information of each character from document data including the position information of the character;
An area division candidate creating step of creating a division candidate for dividing the document data into a plurality of areas based on the extracted position information of the character;
For each of the created region division candidates, extract a line based on the position information of the extracted each character, a character string creating step of creating a character string by searching the extracted line,
A string probability calculation step of calculating the transition probability value of the created string,
An area determination step of determining an optimal area from the created area division candidates based on the calculation result,
A text creating step of creating a text sentence that does not include the position information of each character based on the result of the determined optimal area.

A document conversion method for converting document data including character position information into document data not including character position information,
Character information extraction step of extracting character information including the position information of each character from document data including the position information of the character;
An area division candidate creating step of creating a division candidate for dividing the document data into a plurality of areas based on the extracted position information of the character;
For each of the created region division candidates, a line is extracted based on the position information of each of the extracted characters, and the connection of the character as the extraction result is one in all of the created region division candidates. A character string creating step of creating a partial character string including the character and a predetermined number of characters before and after the character, centering on different characters,
A string probability calculation step of calculating a transition probability value of the created partial string,
An area determination step of determining an optimal area from the created area division candidates based on the calculation result,
A text creating step of creating a text sentence that does not include the position information of each character based on the result of the determined optimal area.

The character information extracting step is to interpret data of a common format converted based on document data including the position information of the character, and to directly interpret the document data including the position information of the character, 3. The document conversion method according to claim 1, wherein the character information is extracted by at least one method.

A document conversion device for converting document data including character position information into document data not including character position information,
Character information extraction means for extracting character information including the position information of each character from document data including the position information of the character,
Area division candidate creating means for creating a division candidate for dividing the document data into a plurality of areas based on the extracted position information of the character;
For each of the created region division candidates, extract a line based on the position information of the extracted each character, a character string creating unit that creates a character string by searching for the extracted line,
String probability calculation means for calculating the transition probability value of the created character string,
Area determination means for determining an optimal area from the created area division candidates based on the calculation result,
A document conversion device, comprising: a text creating unit that creates a text sentence that does not include the position information of each character based on the result of the determined optimal area.

A document conversion device for converting document data including character position information into document data not including character position information,
Character information extraction means for extracting character information including the position information of each character from document data including the position information of the character,
Area division candidate creating means for creating a division candidate for dividing the document data into a plurality of areas based on the extracted position information of the character;
For each of the created region division candidates, a line is extracted based on the position information of each of the extracted characters, and the connection of the character as the extraction result is one in all of the created region division candidates. A character string creating means for creating a partial character string consisting of the character and a predetermined number of characters before and after the character, centering on different characters,
String probability calculating means for calculating the transition probability value of the created partial character string,
Area determination means for determining an optimal area from the created area division candidates based on the calculation result,
A document conversion device, comprising: a text creating unit that creates a text sentence that does not include the position information of each character based on the result of the determined optimal area.

The character information extracting means interprets data in a common format converted based on document data including the position information of the character, and directly interprets the document data including the position information of the character, 6. The document conversion device according to claim 4, wherein the character information is extracted by at least one method.