JP4265300B2

JP4265300B2 - Information processing apparatus and information processing method

Info

Publication number: JP4265300B2
Application number: JP2003176094A
Authority: JP
Inventors: 正義榊原; 芳幸内藤; 雅紀佐竹; 直子佐藤; 昌俊田川
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-06-20
Filing date: 2003-06-20
Publication date: 2009-05-20
Anticipated expiration: 2023-06-20
Also published as: JP2005011165A

Description

【０００１】
【発明の属する技術分野】
本発明は、情報処理装置及び情報処理方法にかかり、特に、ＨＴＭＬ（Hyper Text Markup Language）やＸＭＬ（eXtensible Markup Language）等で代表されるマークアップ言語で記述された構造化文書を処理する情報処理装置及び情報処理方法に関する。
【０００２】
【従来の技術】
情報を構造化して加工や流通を容易にするために、情報の記述形式としてＸＭＬ等のマークアップ言語が使用され始めている。ＸＭＬで記述された文書を処理する情報処理装置では、通常ＸＭＬデータを解析して情報処理装置で処理可能な内部表現形式に変換した上で、各種の処理を行っている。この際、木構造形式等に変換される。
【０００３】
しかしながら、ＸＭＬ文書が保持する構造情報を変換する際に、文章を構成する要素間の親子関係を表す構造情報等の付加的な情報を保持する必要があるために、元のＸＭＬ文書よりもデータ容量が大きくなってしまい、多大なメモリを必要とする。組み込む機器によっては一時記憶領域が制限され、変換された構造情報を保持しきれなくなってしまう。二次記憶領域に構造情報を待避して保持させることも可能であるが、待避した情報にアクセスする場合に処理時間がかかってしまう。
【０００４】
そこで、元の構造情報を変更して内部表現に要するデータ量を削減することが提案されている。例えば、特許文献１や特許文献２に記載の技術などが提案されている。
【０００５】
特許文献１に記載の技術では、複数の要素を、それらの位置関係を示す情報と共に１つの要素として合成して記憶することにより、データ容量の増加を抑制している。
【０００６】
また、特許文献２に記載の技術では、構造化文書の階層を浅くすることにより動作メモリ量の削減を可能にしてデータアクセス効率を改善している。
【０００７】
【特許文献１】
特開２００２−１０８８５０号公報（第１頁、第２図）
【特許文献２】
特開２００２−２９７５６９号公報（第１頁、第５図）
【０００８】
【発明が解決しようとする課題】
特許文献１や特許文献２に記載の技術は、データベースに登録された個々のデータを表すレコード情報等、類似の構造を持つ多数の要素を扱う用途においてデータ容量を削減することができる。しかしながら、特許文献１や特許文献２に記載の技術では、一般の文章等、構造上の類似性が少ない文章に対しては削減量が少ない、という問題がある。また、文章中には一般のテキスト情報の他に、グラフィックス情報や数式情報等、特別な処理を要するデータが含まれることがある。この場合、一般テキストの処理ルーチン動作中に数式処理ルーチンを呼び出すと複数のルーチンを同時実行することになり処理の負荷が大きくなり、ワークメモリの使用量も増加してしまう、という問題がある。
【０００９】
本発明は、上記問題を解決すべく成されたもので、構造上の類似性が少ない要素を含む構造化文章に対して、構造化文書を解析する際のデータ量の増加を抑制することが可能な情報処理装置及び情報処理方法を提供することを目的とする。
【００１０】
【課題を解決するための手段】
上記目的を達成するために請求項１に記載の発明は、マークアップ言語で記述され、複数の節を有する複数の章で構成された構造化文書を解析して、自装置で処理可能な表現形式に変換する解析手段と、前記構造化文書を構成する要素毎に、解析を保留するか否かを判断するための所定の保留条件に応じて解析の保留可否を判断する判断手段と、前記構造化文書の表紙データ生成または目次データ生成に必要な部分以外を前記所定の保留条件として設定する設定手段と、前記判断手段によって保留可能と判断された場合に、解析を保留する要素及び当該要素の解析時に参照されるべき文脈情報を保存する保存手段と、前記保存手段によって保存された要素及び文脈情報に基づいて、当該要素の解析を継続するように前記解析手段を制御する制御手段と、を備え、前記構造化文書の表紙データ生成または目次データ生成において、表紙データ生成または目次データ生成に必要な部分を前記解析手段によって解析を行った後に、前記制御手段が、前記判断手段によって保留可能と判断された要素について、前記保存手段に保存された文脈情報に基づいて、順次解析を再開するように前記解析手段を制御することを特徴としている。
【００１１】
解析手段では、マークアップ言語で記述され、複数の節を有する複数の章で構成された構造化文書が解析され、自装置で処理可能な表現形式に変換される。例えば、ＸＭＬ等のマークアップ言語で記述された構造化文書は、その内容を字句解析して自装置で処理可能な木構造等による表現形式に変換することができる。ＸＭＬを用いた場合には、ＸＭＬパーサ及びＤＯＭ（Document Object Model）インタフェース等を用いることができる。
【００１２】
判断手段では、構造化文書を構成する要素毎に、解析の保留可否が判断される。例えば、ＸＭＬ要素の名前空間に基づいて、解析の保留可否を判断する。当該名前空間の持つ要素の処理のために、別のルーチンを並列実行させなければならない等、メモリ使用量が大きい場合は解析を保留するといった判断が可能である。
設定手段では、構造化文書の表紙データ生成または目次データ生成に必要な部分以外を所定の保留条件として設定する。
【００１３】
保存手段では、判断手段によって保留可能と判断された要素及び当該要素の解析時に必要となる参照されるべき文脈情報（例えば、要素の付帯情報等）が保存される。
【００１５】
このように、構造化文書を解析する際に、保留可能な要素については、解析せずに保留して、後で保留された要素を解析することによって、複数のルーチンの同時実行やワークメモリの増加を抑制したり、解析後に不必要となったデータの削除等を順次行うことにより、解析時に必要となるデータ量を抑制することが可能となる。すなわち、構造上の類似性が少ない文章に対して、構造化情報を解析する際のデータ量の増加を抑制することが可能となる。
【００１６】
なお、請求項２に記載の発明のように、判断手段により解析の保留可否を判断するための保留条件を設定する設定手段を更に備えるようにしてもよい。
【００１７】
さらに、構造化文書の表紙データ生成または目次データ生成において、表紙データ生成または目次データ生成に必要な部分を解析手段によって解析を行った後に、制御手段が、判断手段によって保留可能と判断された要素について、保存手段に保存された文脈情報に基づいて、順次解析を再開するように解析手段を制御する。すなわち、必要となる部分の解析を行い、その他の部分を保留するため自装置で処理可能な表現形式に変換されるデータ量を抑えて表紙データや目次データを生成できる。
【００１８】
請求項２に記載の発明は、設定手段と、解析手段と、判断手段と、保存手段と、制御手段と、を備えて、マークアップ言語で記述され、複数の節を有する複数の章で構成された構造化文書を解析して、所定の処理を行う情報処理装置の情報処理方法であって、前記設定手段が、前記構造化文書の表紙データ生成または目次データ生成に必要な部分以外を、解析を保留するか否かを判断するための所定の条件として設定し、前記解析手段が、前記構造化文書を解析して、自装置で処理可能な表現形式に変換し、前記判断手段が、前記構造化文書を構成する要素毎に、前記所定の保留条件に応じて解析の保留可否を判断する判断し、前記保存手段が、前記判断手段によって保留可能と判断された場合に、解析を保留する要素及び当該要素の解析時に参照されるべき文脈情報を保存し、前記構造化文書の表紙データ生成または目次データ生成において、表示データ生成または目次データ生成に必要な部分を前記解析手段で解析を行った後に、前記制御手段が、前記判断手段によって保留可能と判断され、前記保存手段によって保存された要素について、前記保存手段によって保存された文脈情報に基づいて、順次解析を再開するように前記解析手段を制御することを特徴としている。
【００１９】
請求項２に記載の発明によれば、設定手段が、構造化文書の表紙データ生成または目次データ生成に必要な部分以外を、解析を保留するか否かを判断するための所定の条件として設定し、解析手段が、マークアップ言語で記述され、複数の節を有する複数の章で構成された構造化文書を解析して、自装置で処理可能な表現形式に変換する。また、判断手段が、構造化文書を構成する要素毎に、所定の保留条件に応じて解析の保留可否を判断する。そして、判断手段によって保留可能と判断された場合に、解析を保留する要素及び当該要素の解析時に参照されるべき文脈情報（例えば、要素の付帯情報等）が保存手段によって保存される。
【００２０】
例えば、ＸＭＬ等のマークアップ言語で記述された構造化文書は、その内容を字句解析して自装置で処理可能な木構造等による表現形式に変換することができる。ＸＭＬを用いた場合には、ＸＭＬパーサ及びＤＯＭインタフェース等を用いることができる。この時、当該解析時に保留可能な要素について保留して、当該要素及び文脈情報を保存する。
【００２２】
このように、構造化文書を解析する際に、保留可能な要素については、解析せずに保留して、後で保留された要素を解析することによって、複数の処理ルーチンの同時実行やワークメモリの増加を抑制したり、解析後に不必要となったデータの削除等を順次行うことが可能となり、解析時に増加するデータ量を抑制することが可能となる。すなわち、構造上の類似性が少ない文章に対して、構造化情報を解析する際に、データ量の増加を抑制することが可能となる。
【００２４】
そして、構造化文書の表紙データ生成または目次データ生成において、表示データ生成または目次データ生成に必要な部分を解析手段で解析を行った後に、制御手段が、判断手段によって保留可能と判断され、保存手段によって保存された要素について、保存手段によって保存された文脈情報に基づいて、順次解析を再開するように解析手段を制御する。これによって、必要となる部分の解析を行い、その他の部分を保留するため自装置で処理可能な表現形式に変換されるデータ量を抑えて表紙データや目次データを生成できる。
【００２５】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態の一例を詳細に説明する。
【００２６】
図１は、本発明の実施の形態に係わる情報処理装置１０の構成を示すブロック図である。なお、本発明の実施の形態に係わる情報処理装置１０は、プリンタに適用可能なものとして説明する。本発明の実施の形態に係わる情報処理装置１０は、外部より送信されるマークアップ言語で記述された構造化文書を内部表現形式に変換して用紙等の記録媒体に記録可能とされている。なお、本実施の形態では、マークアップ言語としてＸＭＬを用いた例を説明する。
【００２７】
図１に示すように、情報処理装置１０は、主制御手段１２、データ取得手段１８、保留条件設定手段１６、データ解析手段１４、記憶手段２０、解析文脈保存手段２２、及び解析文脈展開手段２４を有している。
【００２８】
主制御手段１２は、本発明の制御手段に相当し、データ取得手段１８、保留条件設定手段１６、データ解析手段１４、記憶手段２０、解析文脈保存手段２２、及び解析文脈展開手段２４をそれぞれ制御して、構造化文書のデータを解析して内部表現形式に変換するための制御を行う。
【００２９】
データ取得手段１８は、処理対象となるデータを外部から取得する。例えば、ユーザがコンピュータから行ったＸＭＬで記述された構造化文書のデータからなる印刷要求をネットワーク（例えば、インターネットやＬＡＮ（Local Area Network）など）を介して取得する。
【００３０】
保留条件設定手段１６は、データ取得手段１８によって外部より取得した構造化文書のデータをデータ解析手段１４によって内部表現形式に変換する際に、各要素の解析を行うデータ処理を保留するか否かを決定するための条件を設定する。例えば、図２に示すように、保留条件、再開条件、や保存文脈条件等を設定することにより、ＸＳＬＴ（Xml Stylesheet Language Transformation）等による処理の際に、変換対象となっている要素情報を元に、保留対象の要素を決定することができる。なお、保留条件設定手段１６は、情報処理装置１０の外部に設けるようにしてもよく、例えば、ＸＭＬ解析を要求する外部システムに設けるようにしてもよい。
【００３１】
データ解析手段１４は、本発明の解析手段に相当し、データ取得手段１８によって外部より取得した構造化文書のデータを字句解析して文章を構成する各要素を木構造等による内部表現形式に変換する。例えば、ＸＭＬパーサ及びＤＯＭインタフェース等を用いて内部表現形式に変換する。
【００３２】
また、データ解析手段１４は、本発明の判断手段にも相当し、データ解析時に、保留条件設定手段１６によって設定された保留条件に応じて各要素の解析を行うデータ処理を保留するように主制御手段１２によって制御される。例えば、名前空間が異なるタグセットの処理を保留したり、部分的に処理を先行させて残りを保留したり、逐次処理する際に未解析の参照を保留したりすることが可能である。
【００３３】
記憶手段２０は、データ取得手段１８によって取得した構造化文書のデータ、保留条件設定手段１６によって設定した保留条件、データ解析手段１４による解析結果及び解析文脈情報等を記憶する。
【００３４】
解析文脈保存手段２２は、本発明の保存手段に相当し、保留条件設定手段１６によって設定された保留条件に従って、データ解析手段１４によって各要素の解析を行うデータ処理を保留する際に、再開時に必要となる解析文脈情報を抽出して記憶手段２０に保存する。
【００３５】
解析文脈保存手段２２は、例えば、ＸＭＬ要素の名前空間を解決する情報や、章・節番号自動付与に関する情報等の解析前後関係に基づく情報を保存するようにしてもよい。
【００３６】
また、解析文脈保存手段２２は、ＸＰａｔｈ（Xml Path Language）による参照を含む処理が発生した際、未解決の要素（ＸＭＬデータの末尾にある等）を参照していた場合は処理の途中経過を記憶手段２０に保存するようにしてもよい。既に解析対象となったが解析保留された情報は、後続の文章情報の解析中都度、或いは解析終了後に参照され、解析可能と判断された時点でデータ解析手段１４によって解析するようにしてもよい。
【００３７】
さらに、解析文脈保存手段２２は、解析を継続する際に使用すべき解析方法や内部表現形式の作成方法等の処理内容を記憶手段２０に保存するようにしてもよい。
【００３８】
解析文脈展開手段２４は、保留された各要素の解析を再開する際に、解析文脈保存手段２２によって保存された解析文脈情報を記憶手段２０より読み出して、展開する。
【００３９】
続いて、主制御手段１２の制御によって実行されるデータ処理について図３のフローチャートを参照して説明する。
【００４０】
ステップ１００では、保留条件設定手段１６によって構造化文書の各要素を解析する際に解析処理を一時的に保留させるための保留条件の設定を行う。例えば、保留条件の設定は、図２に示すように保留条件、処理の再開条件、保留時に保持される文脈情報の組で構成される。図中１行目の例は、名前空間（処理対象データ種）の変更の場合に保留し（保留条件）、名前空間のプレフィックスと名前空間識別子とのマッピング情報や、保留されたデータ種別（保存文脈）を再開時に必要となる解析文脈情報として解析文脈保存手段２２によって記憶手段２０に保存して、現在の処理すなわち現在の名前空間が終了した時点で再開する（再開条件）設定を表している。２行目の例は、未解析部分への参照を行う際に保留し（保留条件）、参照元や参照先等の参照情報（保存文脈）を再開時に必要となる解析文脈情報として解析文脈保存手段２２によって記憶手段２０に保存して、参照先の部分が解析対象となった場合に再開する（再開条件）設定を表している。また、３行目の例は、文章の目次生成処理等において章番号付与中の節要素の処理の場合に保留し（保留条件）、保留された節が属する章の番号（保存文脈）を再開時に必要となる解析文脈情報として解析文脈保存手段２２によって記憶手段２０に保存して、章番号付与終了時に再開するようにする（再開条件）設定を表している。
【００４１】
そして、ステップ１０２では、データ解析手段１４によってデータ解析処理が行われる。処理においてはステップ１００で設定した条件を参照して処理の保留及び再開をしながら解析が行われ、全てのデータが解析されたらデータ処理が終了する。
【００４２】
データ解析手段１４によって行われるデータ解析処理は、例えば、図４に示すフローチャートに従って行われる。
【００４３】
すなわち、ステップ２００では、解析保留中のデータの中に処理再開が可能な再開対象データがあるか否かデータ解析手段１４によって判定される。該判定は、元データに対する現在解析中の箇所や保留中の情報の再開条件を元に、解析を再開可能なデータがあるか否かを判定する。解析処理の開始時など保留中のデータが存在しなかったり、再開可能な保留データがないために該判定が否定された場合には、ステップ２０２へ移行して、未処理のデータがあるか否かデータ解析手段１４によって判定される。
【００４４】
未処理のデータがない場合には、ステップ２０２の判定が否定されてデータ解析処理を終了し、未処理のデータがある場合には、ステップ２０２の判定が肯定されて、ステップ２０４へ移行する。
【００４５】
ステップ２０４では、データ解析手段１４によって元データに対する解析が行われる。元データを字句解析して、個々のタグやテキスト部分といった処理単位を抽出する。
【００４６】
次にステップ２０６では、ステップ２０４で抽出した処理単位に対して解析保留するか否かデータ解析手段１４によって判定される。該判定は、保留条件設定手段１６によって設定された保留条件に該当するか否かを判定することによってなされ、解析保留条件を満たす場合には、判定が肯定されて、ステップ２０８へ移行して、保留対象となる元データの情報及び解析文脈情報が記憶手段２０に保存され、ステップ２００へ戻る。ステップ２０６において解析保留箇所がない場合には、判定が否定されて、ステップ２０７へ移行し、抽出した部分を内部表現形式に変換して保持し、その後ステップ２００へ戻る。すなわち、元データに対する解析を行い、現在対象である箇所が解析保留不要であれば、内部表現形式に変換した後に先頭に戻り順次解析を進行する。
【００４７】
一方、ステップ２００において再開対象データがある場合、すなわち判定が肯定された場合には、ステップ２１０へ移行して、解析文脈展開手段２４によって記憶手段２０に記憶された解析文脈情報が展開される。
【００４８】
そして、ステップ２１２では、データ解析手段１４によってデータが展開された解析文脈情報に基づいて解析される。解析が終了したらステップ２１４へ移行し、ステップ２１２で展開した文脈情報を削除して解析文脈情報を復元し、ステップ２００へ戻る。すなわち、再開可能なデータがある場合には、保存されている解析文脈情報を展開して、展開された解析文脈に基づいて解析処理を行い、解析終了後には元の解析文脈を復元する。
【００４９】
このように、図４に示すフローチャートのデータ解析処理では、ＸＭＬ等のマークアップ言語で記述された構造化文書の元データを順次解析し、解析を保留すべき箇所があれば記憶手段２０に保存しておき、後続の要素を解析する都度、保留された解析処理が再開可能か判定し、保留可能なものは解析文脈情報を展開して解析を行うことで、複数の処理ルーチンの同時実行やワークメモリの増加を抑制することが可能となり、構造化文書を内部表現形式に変換する際に増加するデータ量を抑制することができる。
【００５０】
なお、ステップ２１２においてデータ解析を行った後に、再度、解析保留箇所か否か判定する処理を追加し、解析保留箇所の場合には、ステップ２０８へ移行し、解析保留箇所ではない場合にステップ２１４へ移行するようにしてもよい。
【００５１】
ここで、上述のようにして行われるデータ処理について一例を挙げて説明する。
【００５２】
図５は、ＸＭＬで記述された構造化文書のデータの一例である。なお、図５に示すＸＭＬで記述された構造化文書は、複数の節を有する複数の章で構成された書籍状の構造化文書である。
【００５３】
例えば、このような構造化文書をプリンタで印刷する場合、構造化文書から印刷データを作成するだけでなく、章の構成を記載した表紙データや目次データも作成する場合を考える。
【００５４】
まず、表紙データに印刷する文書の構成（各章のタイトル一覧等）を作成する。この時、各章の先頭部分の要素の解析のみを行い、章に含まれている節の解析を保留する。すなわち、図６に示すように、各章（ｃｈａｐｔｅｒ）を解析し、各章の節（ｓｅｃｔｉｏｎ）を保留する。この時、各章の内容は解析せずに記憶手段２０に保持しておくと共に、各章の章番号などの情報も保存文脈情報として記憶手段２０に保持する。この際、表紙データの作成に必要となる部分のみを行い、その他の部分は保留するため内部表現に変換されるデータ量を抑えることができる。また、表紙データの作成のみを行うことで、作成終了後は表紙データ作成時に必要としたワークメモリを解放することができる。
【００５５】
そして、表紙データを作成した後、各章の内容を解析して節を表すタグデータを抽出して目次データを作成する。この際に本文等、節を表すタグデータ以外のデータ解析は再度保留される。最後に、各節の本文を解析して全ての印刷データを作成する。目次の節番号を作成する際には、文脈情報として保存された章番号を使用する。このような処理を行うことで変換されるデータ量やワークメモリ量を抑えながら処理を進めることができる。
【００５６】
印刷処理を行う場合、データ解析と並行して、解析済のデータを元に逐次印刷データを作成して印刷することができる。印刷が終了したデータは不要になるため、後続のデータの解析途中に印刷終了した先行データを削除することによって、データ量を削減することができ、少ないメモリ量でも処理することが可能となる。
【００５７】
このように、保留可能な要素については保留して、順次データ解析処理を行うので、保留される要素分のデータ量の増加を抑制することができ、記憶領域を効率的に使用することができる。
【００５８】
続いて、データ解析手段１４によって行われるデータ解析処理のその他の例について、図７のフローチャートを参照して説明する。
【００５９】
ステップ３００では、未処理のデータがあるか否かデータ解析手段１４によって判定される。未処理のデータがある場合には、判定が肯定されてステップ３０２へ移行してデータ解析手段１４によって元データに対する解析が行われる（データ解析）。元データを字句解析して、個々のタグやテキスト部分といった処理単位を抽出する。
【００６０】
次にステップ３０４では、ステップ３０２で抽出した処理単位に対して解析保留箇所があるか否かデータ解析手段１４によって判定される。該判定は、保留条件設定手段１６によって設定された保留条件に該当するか否かを判定することによってなされ、解析保留条件を満たす場合には、判定が肯定されて、ステップ３０６へ移行して、保留対象となる元データの情報及び解析文脈情報が記憶手段２０に保存されてステップ３００に戻る。ステップ３０４において解析保留箇所がない場合には、判定が否定されて、ステップ３０５へ移行し、抽出した部分を内部表現形式に変換して保持し、その後ステップ３００へ戻る。すなわち、元データに対する解析を行い、現在対象である箇所が解析保留不要であれば、内部表現形式に変換した後に先頭に戻って順次解析を進行する。
【００６１】
一方、ステップ３００において未処理データがない、すなわち判定が否定された場合には、ステップ３０８へ移行して、保留データがあるか否かデータ解析手段１４によって判定される。該判定が否定された場合には、そのままデータ解析処理を終了し、判定が肯定された場合には、ステップ３１０へ移行する。
【００６２】
ステップ３１０では、解析文脈展開手段２４によって記憶手段２０に保存された解析文脈情報が展開されて解析が行われる。すなわち、保留されているもののうち、再開可能な箇所の解析文脈情報を展開して、展開された解析文脈情報に基づいて解析を行う。
【００６３】
このように、図７に示すフローチャートのデータ解析処理では、元データを順次解析し、解析を保留すべき箇所があれば保存しておき、元データの解析が終了したら、保留中の箇所の１つを選択して、解析文脈情報を復元して解析を行うことで、上記の実施の形態と同様に、構造化文書を内部表現形式に変換する際に増加するデータ量を抑制することができる。
【００６４】
【発明の効果】
以上説明したように本発明によれば、マークアップ言語で記述された構造化文書を解析する際に、要素毎に保留可否を判断する判断手段と、保留可能な場合に、当該要素と解析時に参照されるべき文脈情報を保存する保存手段と、保存された要素及び文脈情報に基づいて当該要素の解析を継続するように制御する制御手段と、を備えてることにより、保留可能な要素を保留しながら、自装置で処理可能な表現形式に変換するので、複数の処理ルーチンの同時実行やワークメモリの増加を抑制したり、解析後に不必要となったデータの削除等を順次行うことが可能となり、構造化情報を解析する際のデータ量の増加を抑制することが可能となる、という効果がある。
【図面の簡単な説明】
【図１】本発明の実施の形態に係わる情報処理装置の構成を示すブロック図である。
【図２】保留条件設定手段によって設定する保留条件の一例を示す表である。
【図３】本発明の実施の形態に係わる情報処理装置における主制御手段の制御によって実行されるデータ処理の流れを示すフローチャートである。
【図４】データ解析処理の流れを示すフローチャートである。
【図５】ＸＭＬで記述された構造化文書のデータの一例を示す図である。
【図６】本発明のデータ解析処理の流れを表す模式図である。
【図７】データ解析処理のその他の例を示すフローチャートである。
【符号の説明】
１０情報処理装置
１２主制御手段
１４データ解析手段
１６保留条件設定手段
１８データ取得手段
２０記憶手段
２２文脈情報保存手段
２４解析文脈展開手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information processing apparatus and an information processing method, and in particular, information processing for processing a structured document described in a markup language represented by HTML (Hyper Text Markup Language), XML (eXtensible Markup Language), or the like. The present invention relates to an apparatus and an information processing method.
[0002]
[Prior art]
In order to structure information and facilitate processing and distribution, a markup language such as XML has begun to be used as an information description format. In an information processing apparatus that processes a document described in XML, various processes are performed after the XML data is normally analyzed and converted into an internal representation format that can be processed by the information processing apparatus. At this time, it is converted into a tree structure format or the like.
[0003]
However, when converting the structure information held in the XML document, additional information such as structure information indicating the parent-child relationship between the elements constituting the sentence needs to be held, so that the data is more than the original XML document. The capacity increases and a large amount of memory is required. Depending on the device to be incorporated, the temporary storage area is limited, and the converted structure information cannot be held. Although it is possible to save and store the structure information in the secondary storage area, it takes time to access the saved information.
[0004]
Thus, it has been proposed to reduce the amount of data required for internal representation by changing the original structure information. For example, techniques described in Patent Document 1 and Patent Document 2 have been proposed.
[0005]
In the technique described in Patent Document 1, an increase in data capacity is suppressed by combining and storing a plurality of elements together with information indicating their positional relationship as one element.
[0006]
In the technique described in Patent Document 2, the data access efficiency is improved by reducing the amount of operation memory by making the structured document hierarchy shallow.
[0007]
[Patent Document 1]
JP 2002-108850 A (first page, FIG. 2)
[Patent Document 2]
Japanese Patent Laid-Open No. 2002-297569 (first page, FIG. 5)
[0008]
[Problems to be solved by the invention]
The techniques described in Patent Document 1 and Patent Document 2 can reduce the data capacity in applications that handle a large number of elements having a similar structure such as record information representing individual data registered in a database. However, the techniques described in Patent Document 1 and Patent Document 2 have a problem that the amount of reduction is small for a sentence having little structural similarity, such as a general sentence. In addition to general text information, the text may include data that requires special processing such as graphics information and mathematical formula information. In this case, there is a problem that when a mathematical expression processing routine is called during the general text processing routine operation, a plurality of routines are executed simultaneously, the processing load increases, and the amount of work memory used also increases.
[0009]
The present invention has been made to solve the above-described problem, and suppresses an increase in the amount of data when a structured document is analyzed with respect to a structured sentence including elements having little structural similarity. An object of the present invention is to provide an information processing apparatus and an information processing method.
[0010]
[Means for Solving the Problems]
  To achieve the above object, the invention described in claim 1 is described in a markup language.Consists of multiple chapters with multiple sectionsAnalyzing means for analyzing the structured document and converting it into an expression format that can be processed by the device itself, and a predetermined hold for determining whether to hold the analysis for each element constituting the structured document A determination means for determining whether or not the analysis can be suspended according to conditions;Setting means for setting a part other than a part necessary for generating cover data or table of contents data of the structured document as the predetermined holding condition;Based on the element and context information stored by the storage means, the storage means for storing the element for which the analysis is suspended and the context information to be referred to when the element is analyzed when the determination means determines that the storage is possible Control means for controlling the analysis means so as to continue the analysis of the element, in the cover data generation or table of contents data generation of the structured document,After analyzing the part necessary for cover data generation or table of contents data generation by the analysis means,The control means,in frontThe analysis unit is controlled to sequentially resume the analysis of the elements determined to be suspendable by the determination unit based on the context information stored in the storage unit.
[0011]
  The analysis means is described in a markup language.Consists of multiple chapters with multiple sectionsThe structured document is analyzed and converted into an expression format that can be processed by the device itself. For example, a structured document described in a markup language such as XML can be converted into an expression format such as a tree structure that can be processed by its own device by analyzing the lexical content of the document. When XML is used, an XML parser, a DOM (Document Object Model) interface, or the like can be used.
[0012]
  The determination means determines whether or not the analysis can be suspended for each element constituting the structured document. For example, it is determined whether or not the analysis can be suspended based on the namespace of the XML element. For example, another routine must be executed in parallel to process the elements of the name space. For example, when the memory usage is large, it is possible to determine that the analysis is suspended.
  The setting means sets a part other than the part necessary for generating cover data or table of contents data of the structured document as a predetermined holding condition.
[0013]
The storage unit stores the elements determined to be suspendable by the determination unit and the context information to be referred to (e.g., incidental information of the elements) required when analyzing the element.
[0015]
In this way, when analyzing a structured document, elements that can be held are held without analysis, and the elements held later are analyzed, so that multiple routines can be executed simultaneously or work memory It is possible to suppress the amount of data required at the time of analysis by suppressing the increase or by sequentially deleting unnecessary data after the analysis. That is, it is possible to suppress an increase in the amount of data when analyzing structured information for a sentence with little structural similarity.
[0016]
Note that, as in the invention described in claim 2, setting means for setting a holding condition for determining whether or not analysis can be held may be further provided by the determining means.
[0017]
  Furthermore, in the generation of cover data or table of contents data for structured documents,After analyzing the necessary parts for generating cover data or table of contents data by analysis means,Control means, SizeBased on the context information stored in the storage unit, the analysis unit controls the analysis unit so that the analysis is sequentially resumed for the elements determined to be suspendable by the disconnection unit. That is, the cover data and the table of contents data can be generated by analyzing the necessary parts and holding down the other parts and suppressing the amount of data converted into an expression format that can be processed by the own apparatus.
[0018]
  Claim2The invention described inSetting means;Analysis means, judgment means,, KeepA structured document that is described in a markup language and that is composed of a plurality of chapters having a plurality of sections.PredeterminedAn information processing method for an information processing apparatus that performs processing,The setting means sets, as a predetermined condition for determining whether or not to suspend analysis, except for a part necessary for generating cover data or table of contents data of the structured document,The analysis unit analyzes the structured document and converts it into an expression format that can be processed by the own device, and the determination unit determines, for each element constituting the structured document,AboveJudgment is made to determine whether or not to hold the analysis according to a predetermined holding condition, and when the storage unit determines that the holding can be held by the determining unit, it should be referred to at the time of analysis of the element for which analysis is held Save contextual information,In the cover data generation or the table of contents data generation of the structured document, after analyzing the portion necessary for display data generation or table of contents data generation by the analysis means,The control means isIt is determined by the determination means that the suspension is possible,Elements stored by the storage meansSaved by the storage meansBased on contextual information,orderResume next analysisTo control the analysis meansIt is characterized by that.
[0019]
  According to invention of Claim 2,The setting means sets a part other than the part necessary for generating the cover data or the table of contents data of the structured document as a predetermined condition for determining whether or not to hold the analysis, and the analyzing meansAnalyzes structured documents written in markup language and composed of multiple chapters with multiple sections, and converts them into an expression format that can be processed by the device.. Also,For each element that makes up the structured document,According to the predetermined hold conditionsDetermining whether analysis can be suspended. SoAnd it was determined that the hold could be made by the determination means.If the analysis is suspendedThe element and context information (for example, incidental information of the element) to be referred to when the element is analyzed are stored by the storage unit.
[0020]
For example, a structured document described in a markup language such as XML can be converted into an expression format such as a tree structure that can be processed by its own device by analyzing the lexical content of the document. When XML is used, an XML parser, a DOM interface, or the like can be used. At this time, the element that can be reserved at the time of the analysis is reserved and the element and context information are stored.
[0022]
In this way, when analyzing a structured document, elements that can be held are held without analysis, and the elements that are held later are analyzed, so that multiple processing routines can be executed simultaneously or work memory can be analyzed. It is possible to suppress the increase in data, or to delete unnecessary data after analysis, and the like, and it is possible to suppress the amount of data that increases during analysis. That is, an increase in the amount of data can be suppressed when analyzing structured information for a sentence with little structural similarity.
[0024]
  Then, in the cover data generation or table of contents data generation of the structured document, the analysis unit analyzes the part necessary for display data generation or table of contents data generation, and then the control unit determines that the determination unit can hold and saves it. For elements stored by means,Context information stored by the storage meansIn the newsResume sequential analysis based onTo control the analysis means. As a result, the necessary part is analyzed, and the cover data and the table of contents data can be generated while suppressing the amount of data converted into an expression format that can be processed by the own apparatus because other parts are reserved.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an example of an embodiment of the present invention will be described in detail with reference to the drawings.
[0026]
FIG. 1 is a block diagram showing a configuration of an information processing apparatus 10 according to an embodiment of the present invention. The information processing apparatus 10 according to the embodiment of the present invention will be described as being applicable to a printer. The information processing apparatus 10 according to the embodiment of the present invention can convert a structured document described in a markup language transmitted from the outside into an internal representation format and record it on a recording medium such as paper. In this embodiment, an example using XML as a markup language will be described.
[0027]
As shown in FIG. 1, the information processing apparatus 10 includes a main control unit 12, a data acquisition unit 18, a hold condition setting unit 16, a data analysis unit 14, a storage unit 20, an analysis context storage unit 22, and an analysis context expansion unit 24. have.
[0028]
The main control means 12 corresponds to the control means of the present invention, and controls the data acquisition means 18, holding condition setting means 16, data analysis means 14, storage means 20, analysis context storage means 22, and analysis context expansion means 24, respectively. Then, control for analyzing the data of the structured document and converting it into the internal representation format is performed.
[0029]
The data acquisition unit 18 acquires data to be processed from the outside. For example, a print request made up of structured document data described in XML performed by a user from a computer is acquired via a network (for example, the Internet or a LAN (Local Area Network)).
[0030]
Whether the holding condition setting unit 16 holds the data processing for analyzing each element when the data of the structured document acquired from the outside by the data acquiring unit 18 is converted into the internal representation format by the data analyzing unit 14. Set the conditions for determining. For example, as shown in FIG. 2, by setting a hold condition, a resume condition, a save context condition, and the like, the element information that is the conversion target is generated in the processing by XSLT (Xml Stylesheet Language Transformation). In addition, the element to be held can be determined. Note that the hold condition setting means 16 may be provided outside the information processing apparatus 10, for example, in an external system that requests XML analysis.
[0031]
The data analysis unit 14 corresponds to the analysis unit of the present invention, and lexically analyzes the data of the structured document acquired from the outside by the data acquisition unit 18 and converts each element constituting the sentence into an internal representation format such as a tree structure. To do. For example, it is converted into an internal representation format using an XML parser, a DOM interface, or the like.
[0032]
The data analysis means 14 also corresponds to the determination means of the present invention, and mainly holds the data processing for analyzing each element according to the hold condition set by the hold condition setting means 16 at the time of data analysis. It is controlled by the control means 12. For example, it is possible to suspend processing of tag sets having different name spaces, suspend processing by partially preceding processing, or suspend unanalyzed references when sequentially processing.
[0033]
The storage unit 20 stores the data of the structured document acquired by the data acquisition unit 18, the holding condition set by the holding condition setting unit 16, the analysis result by the data analysis unit 14, the analysis context information, and the like.
[0034]
The analysis context storage unit 22 corresponds to the storage unit of the present invention. When the data processing for analyzing each element by the data analysis unit 14 is suspended according to the suspension condition set by the suspension condition setting unit 16, the analysis context storage unit 22 Necessary analysis context information is extracted and stored in the storage means 20.
[0035]
For example, the analysis context storage unit 22 may store information based on the analysis context such as information for resolving the namespace of the XML element, information on automatic assignment of chapter / section numbers, and the like.
[0036]
In addition, the analysis context storage means 22 may indicate the progress of the process if an unresolved element (such as at the end of the XML data) is referenced when a process including a reference by XPath (Xml Path Language) occurs. You may make it preserve | save in the memory | storage means 20. FIG. Information that has already been analyzed but is pending analysis may be referred to each time during analysis of subsequent text information or after the analysis is completed, and may be analyzed by the data analysis means 14 when it is determined that analysis is possible. .
[0037]
Further, the analysis context storage unit 22 may store the processing contents such as the analysis method to be used when the analysis is continued and the internal representation format creation method in the storage unit 20.
[0038]
The analysis context expansion unit 24 reads out the analysis context information stored by the analysis context storage unit 22 from the storage unit 20 and expands it when resuming the analysis of each suspended element.
[0039]
Next, data processing executed under the control of the main control means 12 will be described with reference to the flowchart of FIG.
[0040]
In step 100, a hold condition is set for temporarily holding the analysis process when each element of the structured document is analyzed by the hold condition setting means 16. For example, as shown in FIG. 2, the hold condition is set by a set of a hold condition, a process restart condition, and context information held at the time of hold. In the example of the first line in the figure, when a namespace (data type to be processed) is changed, it is suspended (holding condition), mapping information between a namespace prefix and a namespace identifier, and a suspended data type (stored) Context) is stored in the storage unit 20 by the analysis context storage unit 22 as analysis context information required at the time of resumption, and represents a setting for resuming when the current process, that is, the current name space is completed (restart condition). . In the example of the second line, when the reference to the unanalyzed part is made (pending condition), the reference context (save context) such as the reference source and reference destination is saved as the analysis context information required at the time of restart. The setting is stored in the storage unit 20 by the unit 22 and is restarted (restart condition) when the reference destination part is an analysis target. In the example of the third line, in the case of processing a section element that is assigned a chapter number in the table of contents generation processing, etc., it is suspended (holding condition), and the number of the chapter to which the reserved section belongs (save context) is resumed. This represents a setting (resumption condition) that is saved in the storage means 20 by the analysis context saving means 22 as analysis context information that is sometimes required, and is resumed at the end of chapter number assignment.
[0041]
In step 102, the data analysis unit 14 performs data analysis processing. In the process, the analysis is performed while referring to the condition set in step 100 while the process is suspended and resumed. When all the data is analyzed, the data process ends.
[0042]
The data analysis process performed by the data analysis means 14 is performed according to the flowchart shown in FIG. 4, for example.
[0043]
That is, in step 200, it is determined by the data analysis means 14 whether there is data to be resumed that can be resumed in the data pending analysis. In this determination, it is determined whether or not there is data for which the analysis can be resumed based on the location of the original data currently being analyzed and the restart condition of the pending information. If there is no pending data, such as at the start of analysis processing, or if the determination is denied because there is no resumable pending data, the process proceeds to step 202 to determine whether there is unprocessed data. It is determined by the data analysis means 14.
[0044]
If there is no unprocessed data, the determination in step 202 is denied and the data analysis process is terminated. If there is unprocessed data, the determination in step 202 is affirmed and the process proceeds to step 204.
[0045]
In step 204, the data analysis means 14 analyzes the original data. The original data is lexically analyzed to extract processing units such as individual tags and text parts.
[0046]
Next, in step 206, the data analysis unit 14 determines whether or not the analysis unit is suspended for the processing unit extracted in step 204. The determination is made by determining whether or not the hold condition set by the hold condition setting unit 16 is satisfied. When the analysis hold condition is satisfied, the determination is affirmed, and the process proceeds to step 208. The information of the original data to be held and the analysis context information are stored in the storage unit 20 and the process returns to Step 200. If there is no analysis pending place in step 206, the determination is negative, the process proceeds to step 207, the extracted part is converted into the internal representation format and held, and then the process returns to step 200. That is, the original data is analyzed, and if the currently targeted portion does not require analysis pending, it is converted to the internal representation format and then returned to the beginning to proceed with the analysis.
[0047]
On the other hand, if there is data to be resumed in step 200, that is, if the determination is affirmed, the process proceeds to step 210, and the analysis context information stored in the storage unit 20 is expanded by the analysis context expansion unit 24.
[0048]
In step 212, the data analysis unit 14 analyzes the data based on the analyzed context information. When the analysis is completed, the process proceeds to step 214 where the context information developed in step 212 is deleted to restore the analysis context information, and the process returns to step 200. That is, if there is resumable data, the saved analysis context information is expanded, analysis processing is performed based on the expanded analysis context, and the original analysis context is restored after the analysis is completed.
[0049]
As described above, in the data analysis process of the flowchart shown in FIG. 4, the original data of the structured document described in a markup language such as XML is sequentially analyzed, and if there is a place where the analysis should be suspended, it is stored in the storage unit 20. In addition, each time a subsequent element is analyzed, it is determined whether the suspended analysis process can be resumed, and those that can be suspended are analyzed by analyzing the analysis context information. An increase in work memory can be suppressed, and the amount of data that is increased when a structured document is converted into an internal representation format can be suppressed.
[0050]
After the data analysis in step 212, a process for determining whether or not the analysis is pending is added again. If the analysis is pending, the process proceeds to step 208. If the analysis is not pending, step 214 is performed. You may make it transfer to.
[0051]
Here, an example is given and demonstrated about the data processing performed as mentioned above.
[0052]
FIG. 5 is an example of structured document data described in XML. Note that the structured document described in XML shown in FIG. 5 is a book-like structured document composed of a plurality of chapters having a plurality of sections.
[0053]
For example, when printing such a structured document with a printer, let us consider a case in which not only print data is created from the structured document but also cover data and table of contents data describing the structure of the chapter are created.
[0054]
First, a configuration of a document to be printed on the cover data (list of titles of each chapter, etc.) is created. At this time, only the analysis of the element at the head of each chapter is performed, and the analysis of the section included in the chapter is suspended. That is, as shown in FIG. 6, each chapter is analyzed, and the section of each chapter is put on hold. At this time, the contents of each chapter are stored in the storage unit 20 without being analyzed, and information such as the chapter number of each chapter is also stored in the storage unit 20 as saved context information. At this time, only the part necessary for creating the cover data is performed and the other parts are reserved, so that the amount of data converted into the internal representation can be suppressed. Also, by only creating the cover data, the work memory required when creating the cover data can be released after the creation is completed.
[0055]
Then, after creating the cover data, the contents of each chapter are analyzed to extract the tag data representing the section, and the table of contents data is created. At this time, data analysis other than the tag data representing the section such as the text is suspended again. Finally, all print data is created by analyzing the text of each section. When creating section numbers for the table of contents, the chapter numbers stored as context information are used. By performing such processing, it is possible to proceed while suppressing the amount of data to be converted and the amount of work memory.
[0056]
When printing is performed, in parallel with data analysis, print data can be created and printed sequentially based on the analyzed data. Since data that has been printed is no longer necessary, the amount of data can be reduced by deleting the preceding data that has been printed during the analysis of the subsequent data, and processing can be performed with a small amount of memory.
[0057]
In this way, since the elements that can be held are held and the data analysis process is sequentially performed, an increase in the amount of data for the held elements can be suppressed, and the storage area can be used efficiently. .
[0058]
Next, another example of the data analysis process performed by the data analysis unit 14 will be described with reference to the flowchart of FIG.
[0059]
In step 300, it is determined by the data analysis means 14 whether there is unprocessed data. If there is unprocessed data, the determination is affirmed and the routine proceeds to step 302 where the data analysis means 14 analyzes the original data (data analysis). The original data is lexically analyzed to extract processing units such as individual tags and text parts.
[0060]
Next, in step 304, the data analysis unit 14 determines whether or not there is an analysis pending portion for the processing unit extracted in step 302. The determination is made by determining whether or not the hold condition set by the hold condition setting unit 16 is satisfied. When the analysis hold condition is satisfied, the determination is affirmed, and the process proceeds to step 306. The information of the original data to be held and the analysis context information are stored in the storage unit 20 and the process returns to Step 300. If there is no analysis pending place in step 304, the determination is denied, the process proceeds to step 305, the extracted part is converted into the internal representation format and held, and then the process returns to step 300. That is, the original data is analyzed, and if the current target portion does not require analysis suspension, the analysis is sequentially performed by returning to the beginning after conversion to the internal representation format.
[0061]
On the other hand, if there is no unprocessed data in step 300, that is, if the determination is negative, the routine goes to step 308, where it is determined by the data analysis means 14 whether there is pending data. If the determination is negative, the data analysis process is terminated as it is, and if the determination is affirmative, the process proceeds to step 310.
[0062]
In step 310, analysis context information stored in the storage means 20 is expanded by the analysis context expansion means 24 and analyzed. That is, the analysis context information of the resumable portion among the pending ones is expanded, and the analysis is performed based on the expanded analysis context information.
[0063]
As described above, in the data analysis process of the flowchart shown in FIG. 7, the original data is sequentially analyzed, and if there is a place where the analysis should be suspended, it is stored. By selecting one and restoring and analyzing analysis context information, the amount of data that increases when a structured document is converted into an internal representation format can be suppressed, as in the above embodiment. .
[0064]
【The invention's effect】
As described above, according to the present invention, when analyzing a structured document described in a markup language, a determination unit that determines whether or not to hold for each element, and when the hold can be held, the element and A storage means for storing the context information to be referred to and a control means for controlling the analysis of the element based on the stored element and the context information, thereby holding the holdable element. However, since it is converted to an expression format that can be processed by the device itself, it is possible to suppress the simultaneous execution of multiple processing routines, increase in work memory, and deletion of unnecessary data after analysis. Thus, it is possible to suppress an increase in the amount of data when analyzing the structured information.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an information processing apparatus according to an embodiment of the present invention.
FIG. 2 is a table showing an example of a holding condition set by a holding condition setting unit.
FIG. 3 is a flowchart showing a flow of data processing executed by control of main control means in the information processing apparatus according to the embodiment of the present invention.
FIG. 4 is a flowchart showing a flow of data analysis processing.
FIG. 5 is a diagram illustrating an example of structured document data described in XML.
FIG. 6 is a schematic diagram showing the flow of data analysis processing of the present invention.
FIG. 7 is a flowchart showing another example of data analysis processing.
[Explanation of symbols]
10 Information processing equipment
12 Main control means
14 Data analysis means
16 Hold condition setting means
18 Data acquisition means
20 storage means
22 Context information storage means
24 Analysis context expansion means

Claims

Analyzing means for analyzing a structured document written in a markup language and composed of a plurality of chapters having a plurality of sections, and converting the document into an expression format that can be processed by the device;
A determination means for determining whether or not analysis is suspended according to a predetermined suspension condition for determining whether or not to suspend analysis for each element constituting the structured document;
Setting means for setting a part other than a part necessary for generating cover data or table of contents data of the structured document as the predetermined holding condition;
A storage unit that stores an element for which analysis is suspended and context information to be referred to when the element is analyzed when it is determined by the determination unit that suspension is possible;
Control means for controlling the analysis means to continue analysis of the element based on the element and context information stored by the storage means;
With
In the cover data generation or table of contents data generation of the structured document, after the analysis unit analyzes the part necessary for cover data generation or table of contents data generation, the control unit determines that the determination unit can hold the data. An information processing apparatus that controls the analysis unit so as to sequentially resume the analysis of the elements based on the context information stored in the storage unit.

A setting unit, an analysis unit, a determination unit, provided with a save means, and control means, and is described in a markup language, and analyzes the configuration structured document in a plurality of sections having a plurality of sections An information processing method for an information processing apparatus that performs predetermined processing,
The setting means sets, as a predetermined condition for determining whether or not to suspend analysis, except for a part necessary for generating cover data or table of contents data of the structured document,
The analysis means analyzes the structured document and converts it into an expression format that can be processed by the device itself.
Said determining means, for each element constituting the structured document, to determine to determine pending availability of analysis in accordance with the predetermined hold condition,
When the storage unit determines that the determination unit can hold, the storage unit stores the element for which the analysis is suspended and the context information to be referred to when the element is analyzed,
In the cover data generation or table of contents data generation of the structured document, after the analysis unit analyzes the portion necessary for display data generation or table of contents data generation, the control unit determines that the determination unit can hold the data. the About stored by storage means element on the basis of the context information stored by the storing means, the information processing method characterized by controlling said analyzing means for resuming the sequential analysis.