JP2004287978A

JP2004287978A - Method and program for dividing structured document

Info

Publication number: JP2004287978A
Application number: JP2003080747A
Authority: JP
Inventors: Hiroyuki Numano; 宏行沼野
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-03-24
Filing date: 2003-03-24
Publication date: 2004-10-14
Anticipated expiration: 2023-03-24
Also published as: JP3905851B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce memory capacity required to receive and process a structured document and to shorten a processing time. <P>SOLUTION: An XML (extensible markup language) document reading part 112a in a computer 11 sequentially reads data of an XML document 114 stored in a storage device 111 (S1) from its beginning. An XML document dividing part 112b divides the XML document 114 into new XML documents 114-1 (i = 1, 2...) having a structure of an XML and shapes the new XML documents 114-1 (S2) by operating in parallel with reading of the XML document 114 and sequentially processing the data of the read XML document 114. An XML document transmitting part 112c transfers the XML documents 114-i to a computer 12 each time one XML document 114-1 is divided and shaped (S3). The transferred XML document 114-i is processed by an application 122 without waiting for the next XML document (S4 and S5). <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、タグを用いて論理的に木構造で表現される入れ子構造を持つ構造化文書を処理するのに好適な、構造化文書の分割方法及びプログラムに関する。
【０００２】
【従来の技術】
現在、あらゆるデータを記述する手段として、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）が広く利用されている。このＸＭＬを用いて記述された文書（データ）はＸＭＬ文書（ＸＭＬデータ）と呼ばれる。ＸＭＬ文書は、タグを用いて論理的に木構造で表現される構造化文書として知られており、以下のような特徴を持つ。第１の特徴は、ＸＭＬ文書が階層構造で表現され、無制限の入れ子構造を許す点にある。第２の特徴は、ＸＭＬ文書が繰り返し構造を持ち、無制限の不定繰り返しを許す点にある（例えば、非特許文献１参照）。
【０００３】
ＸＭＬ文書は、その記述形式から、インターネット上でのデータ可搬性に優れている。このためＸＭＬ文書は、特に異なるアプリケーション間或いは異なるシステム間のデータ交換に広く利用されるようになってきている。ＸＭＬ文書はインターネット上のＨＴＴＰ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）プロトコル、或いはＨＴＴＰプロトコル上のＳＯＡＰ（ＳｉｍｐｌｅＯｂｊｅｃｔＡｃｃｅｓｓＰｒｏｔｏｃｏｌ）プロトコルに従って転送されることが多い。このプロトコルでは、ＸＭＬ文書を構成するデータ（ＸＭＬデータ）は、単純に当該データが読み込まれた順番に転送される。
【０００４】
転送されたＸＭＬ文書はアプリケーションにより処理される。現在ＸＭＬを処理する技術として以下の２種類が広く使われている。第１はＳＡＸ（ＳｉｍｐｌｅＡＰＩｆｏｒＸＭＬ）と呼ばれる技術である。ＳＡＸでは、ＸＭＬ文書が逐次的に処理されるため、階層構造を扱えない。但し、ＳＡＸでは、必要とするメモリ容量が少なくて済む。第２はＤＯＭ（ＤｏｃｕｍｅｎｔＯｂｊｅｃｔＭｏｄｅｌ）と呼ばれる技術である。ＤＯＭでは、ＸＭＬ文書全体の要素を解析してから処理されるため、階層構造を扱いやすい。しかし、必要とするメモリ容量が大きい
【０００５】
【非特許文献１】
中山幹敏、奥井康弘著，「改訂版標準ＸＭＬ完全解説（上）／（下）」，初版，技術評論社，平成１３年４月２５日
【０００６】
【発明が解決しようとする課題】
上記したように従来技術においては、ＸＭＬ文書（タグを用いて論理的に木構造で表現される構造化文書）を構成するデータは、単純に当該データが読み込まれた順番に転送される。つまり従来技術においては、ＸＭＬ文書を転送するのに、当該ＸＭＬ文書をＸＭＬ構造として分割する手段を有していない。このため、例えばアプリケーションに転送されたＸＭＬ文書を、当該アプリケーションが上記ＤＯＭ技術或いはＳＡＸ技術を利用して処理するには、まずＸＭＬ文書全体の転送が完了している必要がある。したがって、ＸＭＬ文書をアプリケーションに転送して処理を行う場合、当該ＸＭＬ文書の転送が完了するまでアプリケーション側での処理が待たされ、処理の開始が遅れる。つまり従来技術にあっては、ＸＭＬ文書の転送開始から転送されたＸＭＬ文書を処理し終えるまでの全体の処理時間が長くなるという問題がある。また従来技術にあっては、ＸＭＬ文書全体を読み込んでから当該文書が処理されるため、必要となるメモリ容量が増加するという問題もある。
【０００７】
本発明は上記事情を考慮してなされたものでその目的は、入れ子構造を持つ構造化文書を分割して、入れ子構造を持つ複数の構造化文書に整形することにより、構造化文書を受け取って処理するのに必要となるメモリ容量を少なくすると共に処理時間を短縮できる、構造化文書の分割方法及びプログラムを提供することにある。
【０００８】
【課題を解決するための手段】
本発明の１つの観点によれば、タグを用いて論理的に木構造で表現される入れ子構造を持つ構造化文書をコンピュータにより分割する構造化文書の分割方法が提供される。この方法は、分割の対象となる構造化文書のデータを先頭から順にバッファに読み込むステップと、上記バッファに読み込まれたデータを順に処理することにより、上記分割の対象となる構造化文書（原構造化文書）を複数の新たな構造化文書に分割・整形するステップと、原構造化文書から新たな構造化文書が分割・整形される都度、当該新たな構造化文書を文書処理手段に渡すステップとを備えている。
【０００９】
このような構成においては、原構造化文書の分割された文書部分が、構造化された新たな構造化文書に整形されるため、入れ子構造を持つ構造化文書としての特徴を保持している。このため、構造化文書を処理する処理手段は、長大な構造化文書を扱うに当たって、従来技術とは異なって、全体を受け取るまで待たなくても、分割・整形された新たな構造化文書が生成される毎に、その新たな構造化文書を受け取って処理することができる。その結果、構造化文書を処理するのに必要とされるメモリ容量の減少、処理手段を利用する利用者への応答性能向上を図ることができる。
【００１０】
ここで、原構造化文書から新たな構造化文書を分割・整形するのに、上記分割・整形ステップに、以下のステップ、即ち、読み込まれたデータが開始タグの場合に、当該開始タグに対応する終了タグをバッファ上で強制的に挿入するステップと、原構造化文書からのデータの読み込みの途中のために構造が特定できない部分に対応し、構造化文書の分割の途中地点であることを示す区切りタグを上記バッファ上で挿入するステップとを含めるとよい。この区切りタグを含む新たな構造化文書を処理手段が処理する場合、当該区切りタグを検出するまでは当該新たな構造化文書の先頭から処理を行い、当該区切りタグを検出した段階で処理を中断して、次の新たな構造化文書を受け取った際に、上記区切りタグを当該次の新たな構造化文書に置き換えて処理を再開すればよい。
【００１１】
ここで、上記区切りタグを挿入するステップでは、バッファに開始タグが格納されていない状態で当該バッファに開始タグが読み込まれた場合、上記区切りタグが当該開始タグの後に挿入され、上記終了タグを挿入するステップでは、開始タグの後に挿入された区切りタグの後に、当該開始タグに対応する終了タグが挿入される構成とするとよい。
【００１２】
また、原構造化文書から新たな構造化文書を分割・整形するために、区切りタグが挿入された後に新たに開始タグがバッファに読み込まれた場合には、当該新たに読み込まれた開始タグをバッファ上で上記区切りタグの前に移動し、この区切りタグの後に上記新たに読み込まれた開始タグに対応する終了タグを挿入するとよい。同様に原構造化文書から新たな構造化文書を分割・整形するために、テキストが読み込まれた場合には、当該テキストをバッファ上で区切りタグの前に移動するとよい。同様に、読み込まれたデータが終了タグで、且つ当該終了タグと同一の終了タグが強制的に挿入された構造化文書が既に処理手段に渡されている場合には、読み込まれた終了タグを、終了タグが存在することを処理手段に通知するためのダミー終了タグに置き換えるとよい。このダミー終了タグを含む構造化文書で区切りタグが置き換えられた構造化文書を処理手段が処理する場合、そのダミー終了タグを削除し、その位置に、当該ダミー終了タグの数に一致する数の、当該区切りタグが置き換えられた構造化文書の最後尾側の終了タグを移動すればよい。
【００１３】
【発明の実施の形態】
以下、本発明の実施の形態につき図面を参照して説明する。図１は本発明の一実施形態に係るコンピュータシステムの構成を示すブロック図である。図１のシステムは、物理的に異なる２つのコンピュータ１１及び１２と、当該コンピュータ１１及び１２を接続するネットワーク通信路１３とから構成される。
【００１４】
コンピュータ１１は、磁気ディスク装置に代表される記憶装置１１１と当該コンピュータ１１上で動作するＸＭＬ文書分割送信装置１１２と文書バッファ１１３とを備えている。記憶装置１１１には、ＸＭＬ文書（ＸＭＬドキュメント）１１４が格納されている。ＸＭＬ文書１１４は、タグを用いて論理的に木構造で表現される入れ子構造を持つ構造化文書である。つまりＸＭＬ文書１１４は、タグにより構造化された性質を持つ構造化文書である。記憶装置１１１にはまた、コンピュータ１１内のＣＰＵ（図示せず）によって実行されるプログラム（構造化文書の分割送信用プログラム）１１５が格納されている。
【００１５】
ＸＭＬ文書分割送信装置１１２は、コンピュータ１１内のＣＰＵがプログラム１１５を実行することにより実現される機能ブロックである。ＸＭＬ文書分割送信装置１１２は、ＸＭＬ文書読み取り部１１２ａと、ＸＭＬ文書分割部１１２ｂと、ＸＭＬ文書送信部１１２ｃとを有する。ＸＭＬ文書読み取り部１１２ａは、記憶装置１１１に格納されているＸＭＬ文書１１４をコンピュータ１２に転送する必要がある場合、当該ＸＭＬ文書１１４のデータを先頭から順に読み取る機能を有する。ＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａによるＸＭＬ文書１１４の読み取りと並行して、当該ＸＭＬ文書読み取り部１１２ａにより読み取られたＸＭＬ文書１１４を順次分割する機能を有する。ＸＭＬ文書分割部１１２ｂはまた、分割されたＸＭＬ文書１１４の部分を、ＸＭＬとしての構造を持つ１つの新たなＸＭＬ文書に整形する機能を有する。ＸＭＬ文書送信部１１２ｃは、ＸＭＬ文書分割部１１２ｂにより分割されてＸＭＬ文書に整形される毎に、そのＸＭＬ文書をネットワーク通信路１３を介してコンピュータ１２に送信する機能を有する。文書バッファ１１３は、ＸＭＬ文書読み取り部１１２ａ、ＸＭＬ文書分割部１１２ｂ及びＸＭＬ文書送信部１１２ｃにより処理されるデータを一時格納するのに用いられる。
【００１６】
コンピュータ１２は、受信装置１２１と、アプリケーション（アプリケーションプログラム）１２２とを備えている。受信装置１２１は、コンピュータ１１からネットワーク通信路１３経由で送信されたＸＭＬ文書を受信するＸＭＬ文書受信部１２１ａと、当該ＸＭＬ文書受信部１２１ａにより受信されたＸＭＬ文書をアプリケーション１２２に渡すアプリケーションプログラムインタフェース（以下、ＡＰＩと称する）１２１ｂとを有する。アプリケーション１２２は、ＡＰＩ１２１ｂから渡されたＸＭＬ文書を処理するための処理ルーチンを含む。
【００１７】
次に、図１のシステムにおいて、コンピュータ１１側でＸＭＬ文書を分割・整形してコンピュータ１２に転送し、その分割・整形されたＸＭＬ文書をコンピュータ１２側で処理する動作の概要について、図２の動作説明図を参照して説明する。
【００１８】
コンピュータ１１において、ＸＭＬ文書分割送信装置１１２内のＸＭＬ文書読み取り部１１２ａは、記憶装置１１１にアクセスして、コンピュータ１２に転送すべきＸＭＬ文書１１４のデータを先頭から順に読み取る（ステップＳ１）。ＸＭＬ文書分割送信装置１１２内のＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａによるＸＭＬ文書１１４の読み取りと並行して動作する。そしてＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａにより読み取られたＸＭＬ文書１１４のデータを順に処理することにより、当該ＸＭＬ文書１１４を、ＸＭＬとしての構造を持つ複数の新たなＸＭＬ文書１１４−ｉ（ｉ＝１，２…）に分割・整形する（ステップＳ２）。このＸＭＬ文書分割部１１２ｂによる分割・整形処理は文書バッファ１１３上で行われる。ＸＭＬ文書分割送信装置１１２内のＸＭＬ文書送信部１１２ｃは、ＸＭＬ文書１１４から１つのＸＭＬ文書１１４−ｉが分割・整形される都度、当該ＸＭＬ文書１１４−ｉをネットワーク通信路１３経由でコンピュータ１２に転送（送信）する（ステップＳ３）。このようにしてコンピュータ１１からコンピュータ１２へは、ＸＭＬ文書１１４に対応する複数のＸＭＬ文書１１４−ｉが順次転送される。
【００１９】
コンピュータ１１のＸＭＬ文書送信部１１２ｃからネットワーク通信路１３経由で転送されたＸＭＬ文書１１４−ｉは、コンピュータ１２のＸＭＬ文書受信部１２１ａで受信される。ＸＭＬ文書受信部１２１ａで受信されたＸＭＬ文書１１４−ｉはＡＰＩ１２１ｂによりアプリケーション１２２に渡される（ステップＳ４，Ｓ５）。アプリケーション１２２は、ＡＰＩ１２１ｂから渡されたＸＭＬ文書１１４−ｉがＸＭＬとしての構造を持っていることから、本来処理すべきＸＭＬ文書１１４−ｉの全体が転送完了するのを待つことなく、且つＸＭＬ構造の利点を備えたままで当該ＸＭＬ文書１１４−ｉを処理（利用）する。但し、アプリケーション１２２でＸＭＬ文書１１４−ｉを処理するには、当該ＸＭＬ文書１１４−ｉが、元のＸＭＬ文書（原ＸＭＬ文書）１１４を後述する手順で分割・整形することにより取得されたものであることを、当該アプリケーション１２２が認識する必要がある。
【００２０】
最近、ＸＭＬ構造を用いて各種データを保存するシステムの利用の広がりが目覚しい。そこで今後を展望すると、ＸＭＬ文書のデータ構造の複雑さ或いはサイズは増大していくことが想定される。また、利用されるコンピュータ機器も大規模なものからコンパクトでメモリ等の資源の少ないものまで幅広くなることが考えられる。そのため、データ構造の複雑さ或いはサイズの増大したＸＭＬ文書を対象にその構造を生かした処理を行おうとする場合、必要となるメモリ容量が増大する可能性がある。したがって本実施形態で適用される、ＸＭＬ文書を分割・整形し、複数の小規模のＸＭＬ文書として順次アプリケーション１２２側に渡す構成は、近い将来考えられるサイズの大きいＸＭＬ文書をメモリ等資源の少ないコンピュータ機器で処理する利用形態に最適である。
【００２１】
次に、コンピュータ１１側でＸＭＬ文書を分割・整形してコンピュータ１２に転送し、その分割・整形されたＸＭＬ文書をコンピュータ１２側で処理する動作の詳細について、（１）ＸＭＬ文書の特徴、（２）ＸＭＬ文書の分割・整形・送信、（３）ＸＭＬ文書の処理の順で説明する。
【００２２】
（１）ＸＭＬ文書の特徴
従来技術でも述べたように、ＸＭＬ文書は、ＸＭＬ文書が階層構造で表現され、無制限の入れ子構造を許す点と、ＸＭＬ文書が繰り返し構造を持ち、無制限の不定繰り返しを許す点とに特徴がある。このＸＭＬ文書の階層構造は、＜ＴＡＧ＞及び＜／ＴＡＧ＞に代表される、いわゆる開始タグ及び終了タグを用いて記述される。ここで、ＸＭＬ文書中の一部の開始タグと終了タグとで囲われた範囲は、ＸＭＬ文書全体に対する部分ではあるが、この部分もまたＸＭＬとしての構造を持つことに着目する。
【００２３】
図３は、この「ＸＭＬ文書の一部もＸＭＬ文書としての構造を持つ」ことの一例を示す。図３において、ＸＭＬ文書１１４の開始タグ＜ｔｉｔｌｅ＞から終了タグ＜／ｔｉｔｌｅ＞までの部分３０１はＸＭＬ文書としての構造を持ち、したがってＸＭＬ文書といえる。同様に、開始タグ＜ａｂｓｔｒａｃｔ＞から終了タグ＜／ａｂｓｔｒａｃｔ＞までの部分３０２も、ＸＭＬ文書といえる。
【００２４】
またＸＭＬ文書では、開始タグ＜ＴＡＧ＞が当該文書の中に現れた際には、必ずいずれは終了タグ＜／ＴＡＧ＞が現れることが保証される。図２のＸＭＬ文書１１４の例では、まず開始タグ＜ｄｏｃｕｍｅｎｔ＞が現れる。この＜ｄｏｃｕｍｅｎｔ＞が読み込まれた時点ではいつ終了タグ＜／ｄｏｃｕｍｅｎｔ＞が現れるかは分からないが、最終的には当該＜／ｄｏｃｕｍｅｎｔ＞が必ず現れることは分かる。
【００２５】
（２）ＸＭＬ文書の分割・整形・送信
次に、ＸＭＬ文書の分割・整形・送信の手順、即ち上述したＸＭＬ文書の特徴を利用して、当該ＸＭＬ文書を複数の新たなＸＭＬ文書に分割・整形し、その新たなＸＭＬ文書を送信する手順について説明する。ここで、分割・整形されたＸＭＬ文書の送信を分割送信と呼ぶ。
【００２６】
まず、上記手順の概要について述べる。本実施形態において、記憶装置１１１に格納されているＸＭＬ文書１１４のデータは、ＸＭＬ文書読み取り部１１２ａにより先頭から順に読み取られる。この読み取りと並行して、読み取られたＸＭＬ文書１１４のデータがＸＭＬ文書分割部１１２ｂにより順次処理される。このＸＭＬ文書分割部１１２ｂの処理の特徴は、次の通りである。
【００２７】
（２ａ）ＸＭＬ文書（原ＸＭＬ文書）を分割する単位（分割単位）が初期設定される。
（２ｂ）開始タグが現れた時点で終了タグが強制的に挿入される。
（２ｃ）“分割途中地点”であることを示す区切りタグ＜ｘ／＞が定義され、読み込みが完了していない状態では、構造がわからない部分が区切りタグ＜ｘ／＞で置き換えられる。
（２ｄ）ＸＭＬ文書の分割は分割単位（ここでは文字数であり、例えば６４文字）を上限に行われる。本実施形態において、分割単位にカウントされる文字は、開始タグを構成する文字とテキストを構成する文字のみである。また、タグの途中での分割は禁止される。これに対し、テキストの途中での分割は許される。但し、テキスト途中で分割した場合に、次に分割送信される残りのテキスト部分がテキスト分割の印であるテキスト分割開始タグ＜ｙ＞及びテキスト分割終了タグ＜／ｙ＞で囲まれるようにする。これにより、テキスト途中で分割しても、ＸＭＬ整形式のＸＭＬ文書に整形でき、タグ以外で始まる文書、つまりＸＭＬ整形式でない文書が分割送信されるのを防止できる。
（２ｅ）開始タグが現れた時点で終了タグが強制的に挿入されて分割送信されると、その終了タグが本来含まれているＸＭＬ文書の部分は、ＸＭＬの整形式でなくなってしまう。そこで、「終了タグ」が実際に現れた際に、「終了タグ」が存在することを意味するダミー終了タグ＜ｚ／＞を適用することで、分割送信において「本来の終了タグが現れた」ことが受信側に通知されるようにする。
次に、ＸＭＬ文書の分割・整形・送信の手順の詳細について、図３に示すＸＭＬ文書１１４を分割して送信する場合を例に、図４乃至図７のフローチャート並びに図８乃至図１０のＸＭＬ文書分割・整形状態遷移図を参照して説明する。
【００２８】
まず、コンピュータ１１内のＸＭＬ文書分割部１１２ｂはＸＭＬ文書の分割・整形のための初期設定を行う（ステップＳ１）。ここでは、ＸＭＬ文書の分割単位＝６４文字、区切りタグ＝＜ｘ／＞、テキスト分割開始タグ＝＜ｙ＞、テキスト分割終了タグ＝＜／ｙ＞、及びダミー終了タグ＝＜ｚ／＞が初期設定される。また２種のフラグＦ１及びＦ２が、例えばＦ１＝０，Ｆ２＝１に初期設定される。フラグＦ１はＦ１＝１のとき、分割されたテキストの残りの部分を処理すべきことを示す。フラグＦ２はＦ２＝１のとき、次に現れる開始タグは新たなＸＭＬ文書の先頭の開始タグとなることを示す。
【００２９】
以上の初期化処理が行われると、コンピュータ１１内のＸＭＬ文書読み取り部１１２ａは、ＸＭＬ文書（原ＸＭＬ文書）１１４のデータを、先頭から順に最後まで、タグについてはタグ単位に、テキストについてはテキスト単位に読み込む（ステップＳ１２〜Ｓ１４）。ＸＭＬ文書読み取り部１１２ａにより読み込まれたデータは、読み込まれた順番に文書バッファ１１３に格納される。図３のＸＭＬ文書１１４の例では、まず、先頭の開始タグ＜ｄｏｃｕｍｅｎｔｉｄ＝”ｄｏｃ０１”＞が読み込まれて文書バッファ１１３に格納される。つまり、開始タグ＜ｄｏｃｕｍｅｎｔｉｄ＝”ｄｏｃ０１”＞が文書バッファ１１３に読み込まれる。この例のように、開始タグが文書バッファ１１３に読み込まれ、且つフラグＦ２が１の場合（ステップＳ１５，Ｓ１６）、コンピュータ１１内のＸＭＬ文書分割部１１２ｂは、文書バッファ１１３内の当該読み込まれた開始フラグ（即ち＜ｄｏｃｕｍｅｎｔｉｄ＝”ｄｏｃ０１”＞）の後に、図８（ａ）に示すように区切りタグ＜ｘ／＞を挿入する（ステップＳ１７）。次にＸＭＬ文書分割部１１２ｂは、図８（ｂ）に示すように、文書バッファ１１３に格納されている区切りタグ＜ｘ／＞の後に、開始タグ＜ｄｏｃｕｍｅｎｔｉｄ＝”ｄｏｃ０１”＞に対応する終了タグ＜／ｄｏｃｕｍｅｎｔ＞を強制的に挿入する（ステップＳ１８）。そしてＸＭＬ文書分割部１１２ｂは、この例のようにフラグＦ２が１の場合、当該フラグＦ２を０にした後（ステップＳ１９，Ｓ２０）、ＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３０】
するとＸＭＬ文書読み取り部１１２ａは、フラグＦ１が１であるかを判定する（ステップＳ１２）。この例のようにＦ１＝１でない場合、ＸＭＬ文書読み取り部１１２ａはＸＭＬ文書１１４の最後まで読み取りが行われたか（つまりＸＭＬ文書１１４の全データを処理し終えたか）を判定する（ステップＳ１３）。この例では、ＸＭＬ文書１１４の最後までは読み取りが行われていない。この場合、ＸＭＬ文書読み取り部１１２ａはＸＭＬ文書１１４中の次のデータ（タグまたはテキスト）を文書バッファ１１３に読み込む（ステップＳ１４）。ここでは、開始タグ＜ｄｏｃｕｍｅｎｔｉｄ＝”ｄｏｃ０１”＞の次の開始タグ＜ｔｉｔｌｅ＞が文書バッファ１１３に読み込まれる。
【００３１】
このように開始タグが読み込まれ、且つフラグＦ２が０の場合（ステップＳ１５，Ｓ１６）、ＸＭＬ文書分割部１１２ｂは、その際に文書バッファ１１３に格納されている、当該開始タグを含む未送信のデータ（但し、強制的に挿入されるタグを除く）が分割単位（６４文字）を超えているかを判定する（ステップＳ２１）。この例のように分割単位を超えていない場合、ＸＭＬ文書分割部１１２ｂは、文書バッファ１１３に読み込まれた開始タグ＜ｔｉｔｌｅ＞を、図８（ｃ）に示すように区切りタグ＜ｘ／＞の前に移動する（ステップＳ２２）。
【００３２】
次にＸＭＬ文書分割部１１２ｂは、図８（ｄ）に示すように、区切りタグ＜ｘ／＞の後に、開始タグ＜ｔｉｔｌｅ＞に対応する終了タグ＜／ｔｉｔｌｅ＞を強制的に挿入する（ステップＳ１８）。そしてＸＭＬ文書分割部１１２ｂは、この例のようにフラグＦ２が０の場合、そのままＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３３】
これを受けてＸＭＬ文書読み取り部１１２ａは、ＸＭＬ文書１１４中の次のデータ、即ちテキスト「ＳａｍｐｌｅＤｏｃｕｍｅｎｔ」を文書バッファ１１３に読み込む（ステップＳ１２〜Ｓ１４）。このようにテキストが読み込まれた場合（ステップＳ１５）、ＸＭＬ文書分割部１１２ｂは、その際に文書バッファ１１３に格納されている、当該テキストを含む未送信のデータが分割単位を超えているかを判定する（ステップＳ２３）。この例のように、分割単位を超えていない場合、ＸＭＬ文書分割部１１２ｂは、文書バッファ１１３に読み込まれたテキスト「ＳａｍｐｌｅＤｏｃｕｍｅｎｔ」を、図８（ｅ）に示すように区切りタグ＜ｘ／＞の前に移動する（ステップＳ２４）。そしてＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３４】
するとＸＭＬ文書読み取り部１１２ａは、ＸＭＬ文書１１４中の次のデータ、即ち終了タグ＜／ｔｉｔｌｅ＞を文書バッファ１１３に読み込む（ステップＳ１２〜Ｓ１４）。このように終了タグが文書バッファ１１３に読み込まれた場合（ステップＳ１５）、ＸＭＬ文書分割部１１２ｂは、当該終了タグが既に強制的に挿入されて送信済みであるかを判定する（ステップＳ２５）。この例のように、終了タグ＜／ｔｉｔｌｅ＞が送信済みでない場合、即ち今回読み込まれた終了タグ＜／ｔｉｔｌｅ＞とは別の、強制的に挿入された終了タグ＜／ｔｉｔｌｅ＞が文書バッファ１１３内に存在する場合、ＸＭＬ文書分割部１１２ｂは今回読み込まれた終了タグ＜／ｔｉｔｌｅ＞を削除すると共に、図８（ｆ）に示すように、区切りタグ＜ｘ／＞を既に強制的に挿入されている終了タグ＜／ｔｉｔｌｅ＞の後に移動する（ステップＳ２６）。そしてＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３５】
以下、同様にして、ＸＭＬ文書読み取り部１１２ａにより、開始タグ＜ａｕｔｈｏｒ＞、テキスト「ＴａｒｏＳｕｚｕｋｉ」、終了タグ＜／ａｕｔｈｏｒ＞、開始タグ＜ｄａｔｅ＞及びテキスト「２００３０１０１」が順に読み込まれたものとする。このときの文書バッファ１１３の内容を図９（ａ）に示す。さて、テキスト「２００３０１０１」が文書バッファ１１３に読み込まれた場合（ステップＳ１４，Ｓ１５）、ＸＭＬ文書分割部１１２ｂは、当該テキストを含む文書バッファ１１３内の未送信のデータが分割単位を超えているかを判定する（ステップＳ２３）。ここでは、テキスト「２００３０１０１」中の「１０１」の部分が分割単位を超えている。この場合、ＸＭＬ文書分割部１１２ｂは、文書バッファ１１３に読み込まれたテキスト「２００３０１０１」のうち先頭から分割単位を超えない範囲で最大長のテキスト部分「２００３０」を当該文書バッファ１１３に残して、そのテキスト部分「２００３０」を、図９（ｂ）に示すように、区切りタグ＜ｘ／＞の前に移動する（ステップＳ２７）このステップＳ２７において、ＸＭＬ文書分割部１１２ｂは残りのテキスト部分「１０１」を保持する。そしてＸＭＬ文書分割部１１２ｂは、フラグＦ１を１に設定して、既に読み込まれている未処理のテキスト部分が残っていることを示した後（ステップＳ２８）、ＸＭＬ文書読み取り部１１２ａ及びＸＭＬ文書送信部１１２ｃに制御を渡す。
【００３６】
これによりＸＭＬ文書送信部１１２ｃは、その時点において文書バッファ１１３に格納されている図９（ｂ）に示す構造のデータを、ＸＭＬ文書１１４から分割されて整形された新たなＸＭＬ文書１１４−ｉ（＝１１４−１）としてネットワーク通信路１３経由でコンピュータ１２に送信する（ステップＳ２９）。一方、ＸＭＬ文書読み取り部１１２ａは、フラグＦ１が１であることから（ステップＳ１２）、ＸＭＬ文書分割部１１２ｂには既に読み込まれている未処理のテキスト部分が残されていると判断し、そのままＸＭＬ文書分割部１１２ｂに制御を戻す。
【００３７】
これを受けてＸＭＬ文書分割部１１２ｂは、自身が保持している残りのテキスト部分「１０１」を文書バッファ１１３に格納すると共に、図１０（ａ）に示すように、当該テキスト部分「１０１」の前と後に、それぞれ、テキスト分割開始タグ＜ｙ＞とテキスト分割終了タグ＜／ｙ＞とを強制的に挿入する（ステップＳ３０）。即ちＸＭＬ文書分割部１１２ｂは、テキスト部分「１０１」をテキスト分割開始タグ＜ｙ＞とテキスト分割終了タグ＜／ｙ＞とで囲む。そしてＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３８】
するとＸＭＬ文書読み取り部１１２ａは、ＸＭＬ文書１１４中の次のデータ、即ちテキスト「２００３０１０１」に後続する終了タグ＜／ｄａｔｅ＞を、図１０（ａ）に示すように文書バッファ１１３に読み込む（ステップＳ１２〜Ｓ１４）。この場合、ＸＭＬ文書分割部１１２ｂは、終了タグ＜／ｄａｔｅ＞が既に強制的に挿入されて送信済みであるかを判定する（ステップＳ２５）。この例のように、終了タグ＜／ｄａｔｅ＞が送信済みである場合、ＸＭＬ文書分割部１１２ｂは、分割送信において「本来の終了タグが現れた」ことをコンピュータ１２に通知するために、文書バッファ１１３に読み込まれた終了タグ＜／ｄａｔｅ＞を、図１０（ｃ）に示すようにダミー終了タグ＜ｚ／＞に置き換えて（ステップＳ３１）、ＸＭＬ文書読み取り部１１２ａに制御を渡す。
【００３９】
以下、同様にして、ＸＭＬ文書分割部１１２ｂによるＸＭＬ文書１１４を対象とする分割・整形が続けられ、図１１（ａ）に示す、図９（ｂ）に相当するＸＭＬ文書１１４−１の送信（１回目の送信）に続き、図１１（ｂ）乃至図１１（ｄ）、図１２（ａ）乃至図１２（ｄ）に示すＸＭＬ文書１１４−２乃至１１４−８の送信（２回目乃至８回目の送信）が行われる。ここで、図１１（ｃ）は、開始タグ＜ｓｅｃｔｉｏｎｉｄ＝”ｓｅｃ０１”＞が文書バッファ１１３に読み込まれたために分割単位を超えた場合に送信されるＸＭＬ文書１１４−３を示している。この開始タグの読み込みにより分割単位を超えた場合の動作について説明する。
【００４０】
今、フラグＦ２が０の状態で、ＸＭＬ文書読み取り部１１２ａにより開始タグ＜ｓｅｃｔｉｏｎｉｄ＝”ｓｅｃ０１”＞が文書バッファ１１３に読み込まれ、当該開始タグを含む文書バッファ１１３内の未送信のデータが分割単位を超えたものとする（ステップＳ１５，Ｓ１６，Ｓ２１）。この場合、ＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書分割送信装置１１２の状態を開始タグ読み込み前の状態に戻し、次に処理される開始タグが新たなＸＭＬ文書の先頭の開始タグとなることを示すためにフラグＦ２を１に設定する（ステップＳ３２，Ｓ３３）。そしてＸＭＬ文書分割部１１２ｂは、ＸＭＬ文書読み取り部１１２ａ及びＸＭＬ文書送信部１１２ｃに制御を渡す。
【００４１】
これによりＸＭＬ文書送信部１１２ｃは、その時点において文書バッファ１１３に格納されている図１１（ｃ）に示すデータを新たなＸＭＬ文書１１４−ｉ（＝１１４−３）としてコンピュータ１２に送信する（ステップＳ２９）。一方、ＸＭＬ文書読み取り部１１２ａは、再度開始タグ＜ｓｅｃｔｉｏｎｉｄ＝”ｓｅｃ０１”＞をＳＰＭ１３に読み込む（ステップＳ１２〜Ｓ１４）。なお、ＸＭＬ文書分割部１１２ｂにて開始タグ＜ｓｅｃｔｉｏｎｉｄ＝”ｓｅｃ０１”＞を保持し、ステップＳ３２，Ｓ３３を行わずに、ＸＭＬ文書送信部１１２ｃによるＸＭＬ文書の送信を行わせ、しかる後に当該開始タグ＜ｓｅｃｔｉｏｎｉｄ＝”ｓｅｃ０１”＞を処理するためにステップＳ１７に分岐する構成としてもよい。また、分割単位のカウント対象をテキストだけとしても構わない。
【００４２】
次に、ＸＭＬ文書１１４の全データを処理し終えた場合の動作を説明する。ＸＭＬ文書読み取り部１１２ａは、ＸＭＬ文書１１４の全データを処理し終えた場合（ステップＳ１３）、その旨をＸＭＬ文書分割部１１２ｂに通知する。するとＸＭＬ文書分割部１１２ｂは、文書バッファ１１３から区切りタグ＜ｘ／＞を削除してＸＭＬ文書送信部１１２ｃに制御を渡す（ステップＳ３４）。これを受けてＸＭＬ文書送信部１１２ｃは、その時点において文書バッファ１１３に格納されているデータを新たなＸＭＬ文書１１４−ｉとしてコンピュータ１２に送信する（ステップＳ３５）。ここでは、図１２（ｄ）に示すＸＭＬ文書１１４−８が送信され、コンピュータ１１内のＸＭＬ文書分割送信装置１１２による一連の分割・整形・送信処理が終了する。
【００４３】
以上のＸＭＬ文書１１４を対象とするＸＭＬ文書分割送信装置１１２による分割送信と木構造（階層構造）との関係を図１３に示す。図１３において、逆Ｌ字の記号１３１−１〜１３１−７は、それぞれＸＭＬ文書１１４−１〜１１４−７の送信（１回目乃至７回目の送信）を表す。また、記号１３１−１，１３１−５〜１３１−７のように、記号１３１−ｉが木構造中のノードを表す文字列の後に位置する場合、そのノードに対応する構造のデータの送信がｉ回目の分割送信で完了したことを示す。一方、記号１３１−２〜１３１−４のように、記号１３１−ｉが木構造中のノードを表す文字列の途中に位置する場合、ｉ回目の分割送信は、そのノードに対応する構造のデータの送信途中であることを示す。
【００４４】
（３）分割送信されたＸＭＬ文書の処理
次に、コンピュータ１１からコンピュータ１２に分割送信されたＸＭＬ文書１１４−ｉ（図１１及び図１２の例ではｉ＝１〜８）の処理の概要について説明する。前記したように、コンピュータ１１（内のＸＭＬ文書送信部１１２ｃ）によりネットワーク通信路１３を経由して送信されたＸＭＬ文書１１４−ｉはコンピュータ１２内のＸＭＬ文書受信部１２１ａで受信される。このＸＭＬ文書受信部１２１ａで受信されたＸＭＬ文書１１４−ｉはＡＰＩ１２１ｂを介してアプリケーション１２２に渡され、当該アプリケーション１２２により処理される。ここで、図３に示されるＸＭＬ文書１１４が分割された部分を整形して得られる、図１１（ａ）乃至図１１（ｄ）及び図１２（ａ）乃至図１２（ｄ）に示すＸＭＬ文書１１４−１乃至１１４−８もＸＭＬとしての構造を持つ。したがって、アプリケーション１２２は、ＸＭＬ文書１１４−１乃至１１４−８を少ないメモリ容量で従来技術を用いて処理できる。なお、ＸＭＬ文書１１４−ｉ（ｉ＝１〜８）は、実際には、コンピュータ１２内のＣＰＵ（図示せず）が当該アプリケーション１２２を実行することにより処理されるが、説明の簡略化のためにアプリケーション１２２により処理されるものとする。
【００４５】
次に、分割送信されたＸＭＬ文書１１４−ｉの処理の詳細について、図１４のフローチャートを参照して説明する。アプリケーション１２２は、ＸＭＬ文書受信部１２１ａで受信されたＸＭＬ文書１１４−ｉをＡＰＩ１２１ｂから受け取ると、本来処理すべきＸＭＬ文書１１４の全体が転送完了するのを待つことなく、当該ＸＭＬ文書１１４−ｉの処理を開始する。これにより、コンピュータ１２においてＸＭＬ文書を処理するのに必要となるメモリ容量を低減できると共に、処理時間を短縮できる。
【００４６】
アプリケーション１２２は、ＸＭＬ文書１１４−ｉの処理で区切りタグ＜ｘ／＞を検出した場合（ステップＳ４１）、次のＸＭＬ文書１１４−（ｉ＋１）を受け取るまで処理を中断する（ステップＳ４２）。アプリケーション１２２は、次のＸＭＬ文書１１４−（ｉ＋１）を受け取ると（ステップＳ４２）、処理を中断していた先行するＸＭＬ文書１１４−ｉ中の区切りタグ＜ｘ／＞を、当該次のＸＭＬ文書１１４−（ｉ＋１）に置き換える（ステップＳ４３）。そしてアプリケーション１２２は、中断していたＸＭＬ文書１１４−ｉの処理を再開する（ステップＳ４４）。但し、ここで処理されるＸＭＬ文書１１４−ｉは、中断前のＸＭＬ文書１１４−ｉとは異なり、区切りタグ＜ｘ／＞がＸＭＬ文書１１４−（ｉ＋１）に置き換えられた構造となっている。そこで以下の説明では、区切りタグ＜ｘ／＞がＸＭＬ文書１１４−（ｉ＋１）に置き換えられたＸＭＬ文書１１４−ｉをＸＭＬ文書１１４−ｉ’と表現する。
【００４７】
図１１の例では、アプリケーション１２２がＸＭＬ文書１１４−２を受け取った段階で、ＸＭＬ文書１１４−１中の区切りタグ＜ｘ／＞がＸＭＬ文書１１４−２に置き換えられる。これによりアプリケーション１２２は、ＸＭＬ文書１１４−２を含む新たなＸＭＬ文書１１４−１（即ちＸＭＬ文書１１４−１’）中の新たな区切りタグ＜ｘ／＞を検出するまで（ステップＳ４１）、処理を継続することができる（ステップＳ４４）。
【００４８】
さて、区切りタグ＜ｘ／＞に置き換えられたＸＭＬ文書１１４−（ｉ＋１）に別の区切りタグ＜ｘ／＞が含まれているならば、再開後の処理で当該別の区切りタグ＜ｘ／＞が検出された段階で、当該別の区切りタグ＜ｘ／＞が更に次のＸＭＬ文書１１４−（ｉ＋２）に置き換えられる。つまりアプリケーション１２２では、区切りタグ＜ｘ／＞を分割送信されるＸＭＬ文書に置き換える動作が繰り返される。
【００４９】
またアプリケーション１２２は、ＸＭＬ文書１１４−ｉ’の処理でテキスト分割開始タグ＜ｙ＞を検出した場合（ステップＳ４５）、当該テキスト分割開始タグ＜ｙ＞及びテキスト分割終了タグ＜／ｙ＞の対をＸＭＬ文書１１４−ｉ’から削除する（ステップＳ４６）。次にアプリケーション１２２は、削除したタグ＜／ｙ＞に続くダミー終了タグ＜ｚ／＞の数ｊをカウントする（ステップＳ４７）。ＸＭＬ文書１１４−ｉ’から削除されたタグ＜／ｙ＞に続くダミー終了タグ＜ｚ／＞の数がｊの場合、当該ＸＭＬ文書１１４−ｉ’には、その最後尾側にｊ個の終了タグが含まれている。そこで、アプリケーション１２２は、ＸＭＬ文書１１４−ｉ’から削除されたタグ＜／ｙ＞に続くｊ個のダミー終了タグ＜ｚ／＞を削除し、その位置に、当該ＸＭＬ文書１１４−ｉ’の最後尾側のｊ個の終了タグを移動する（ステップＳ４８）。そしてアプリケーション１２２は、終了タグ移動後のＸＭＬ文書１１４−ｉ’の処理を継続し（ステップＳ４４）、最後まで処理できた段階で（ステップＳ４９）、全ての処理を終了する。
【００５０】
上記実施形態では、コンピュータ１１内のＸＭＬ文書分割送信装置１１２による分割・整形の対象となるＸＭＬ文書１１４が、当該コンピュータ１１の記憶装置１１１に格納されていることを前提としている。しかし、この記憶装置１１１が、ネットワークを介して接続された別のコンピュータの記憶装置である場合にも、当該記憶装置１１１に格納されているＸＭＬ文書１１４を対象に、コンピュータ１１内のＸＭＬ文書分割送信装置１１２にて上記実施形態と同様に分割・整形することが可能である。
【００５１】
また、上記実施形態では、ＸＭＬ文書１１４の分割と分割された文書部分の整形とが全て送信側で行われることを前提としている。しかし、送信側ではＸＭＬ文書１１４を先頭から例えば一定サイズに単純に分割してその分割された文書部分（つまりＸＭＬの構造を持たない文書部分）を順次送信し、受信側で図４乃至図７のフローチャートに相当する処理を行ってアプリケーション１２２に渡すことも可能である。ここでは、分割されたＸＭＬ文書１１４の部分を受信する毎に、その先頭から順にデータの読み込みが行われる。このことは、上記実施形態において、ＸＭＬ文書１１４のデータを先頭から順に読み込むことと等価である。
【００５２】
なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【００５３】
【発明の効果】
以上詳述したように本発明によれば、原構造化文書のデータを先頭から順にバッファに読み込み、このバッファに読み込まれたデータを順に処理することにより上記原構造化文書を複数の新たな構造化文書に分割・整形し、この新たな構造化文書が分割・整形される都度、当該新たな構造化文書が文書処理手段による処理に供される構成としたので、当該新たな構造化文書に入れ子構造を持つ構造化文書としての特徴を持たせることができる。このため本発明によれば、構造化文書を処理する処理手段側では、長大な構造化文書を扱うに当たって、全体を受け取るまで待たなくても、分割・整形された新たな構造化文書が生成される毎に、その新たな構造化文書を受け取って処理することができる。その結果、構造化文書を処理するのに必要とされるメモリ容量の減少、処理手段を利用する利用者への応答性能向上を図ることができる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係るコンピュータシステムの構成を示すブロック図。
【図２】同実施形態において、コンピュータ１１側でＸＭＬ文書を分割・整形してコンピュータ１２に転送し、その分割・整形されたＸＭＬ文書をコンピュータ１２側で処理する動作の概要を説明するための図。
【図３】図１中のＸＭＬ文書１１４の具体例と、当該ＸＭＬ文書１１４の一部もＸＭＬ文書としての構造を持つことを示す図。
【図４】ＸＭＬ文書の分割・整形・送信の手順を説明するためのフローチャートの一部を示す図。
【図５】ＸＭＬ文書の分割・整形・送信の手順を説明するためのフローチャートの他の一部を示す図。
【図６】ＸＭＬ文書の分割・整形・送信の手順を説明するためのフローチャートの更に他の一部を示す図。
【図７】ＸＭＬ文書の分割・整形・送信の手順を説明するためのフローチャートの残りを示す図。
【図８】図３のＸＭＬ文書１１４を対象とするＸＭＬ文書分割・整形の状態遷移図。
【図９】図３のＸＭＬ文書１１４を対象とするＸＭＬ文書分割・整形の状態遷移図。
【図１０】図３のＸＭＬ文書１１４を対象とするＸＭＬ文書分割・整形の状態遷移図。
【図１１】図４乃至図７のフローチャートに従って分割送信されるＸＭＬ文書１１４−１〜１１４−４の具体例を示す図。
【図１２】図４乃至図７のフローチャートに従って分割送信されるＸＭＬ文書１１４−５〜１１４−８の具体例を示す図。
【図１３】ＸＭＬ文書１１４に対応するＸＭＬ文書の分割送信と木構造との関係を示す図。
【図１４】分割送信されたＸＭＬ文書の処理の手順を説明するためのフローチャート。
【符号の説明】
１１，１２…コンピュータ、１３…ネットワーク通信路、１１１…記憶装置、１１２…ＸＭＬ文書分割送信装置、１１２ａ…ＸＭＬ文書読み取り部、１１２ｂ…ＸＭＬ文書分割部、１１２ｃ…ＸＭＬ文書送信部、１１４…ＸＭＬ文書（原ＸＭＬ文書）、１１４−１〜１１４−８…ＸＭＬ文書、１２１…受信装置、１２１ａ…ＸＭＬ文書受信部、１２１ｂ…ＡＰＩ（アプリケーションプログラムインタフェース）、１２２…アプリケーション（処理手段）。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a structured document dividing method and a program suitable for processing a structured document having a nested structure logically expressed by a tree structure using tags.
[0002]
[Prior art]
Currently, XML (Extensible Markup Language) is widely used as a means for describing all kinds of data. A document (data) described using this XML is called an XML document (XML data). An XML document is known as a structured document that is logically expressed in a tree structure using tags, and has the following features. The first feature is that the XML document is expressed in a hierarchical structure, and allows an unlimited nested structure. The second feature is that the XML document has a repeating structure and allows unlimited indefinite repetition (for example, see Non-Patent Document 1).
[0003]
An XML document is excellent in data portability on the Internet due to its description format. For this reason, XML documents have been widely used especially for data exchange between different applications or different systems. XML documents are often transferred in accordance with the HTTP (Hyper Text Transfer Protocol) protocol on the Internet or the SOAP (Simple Object Access Protocol) protocol over the HTTP protocol. In this protocol, data (XML data) constituting an XML document is simply transferred in the order in which the data was read.
[0004]
The transferred XML document is processed by the application. At present, the following two types are widely used as technologies for processing XML. The first is a technique called SAX (Simple API for XML). In SAX, since the XML documents are sequentially processed, a hierarchical structure cannot be handled. However, the SAX requires less memory capacity. The second is a technique called DOM (Document Object Model). In the DOM, since the elements of the entire XML document are analyzed and then processed, the hierarchical structure is easy to handle. However, the required memory capacity is large
[0005]
[Non-patent document 1]
Mikitoshi Nakayama and Yasuhiro Okui, "Complete Explanation of Revised Standard XML (above) / (below)", First Edition, Technical Review, April 25, 2001
[0006]
[Problems to be solved by the invention]
As described above, in the related art, data constituting an XML document (structured document logically expressed in a tree structure using tags) is simply transferred in the order in which the data was read. In other words, in the prior art, there is no means for dividing the XML document into an XML structure for transferring the XML document. For this reason, for example, in order for the application to process the XML document transferred to the application by using the DOM technology or the SAX technology, the transfer of the entire XML document must first be completed. Therefore, when the XML document is transferred to the application for processing, the processing on the application side waits until the transfer of the XML document is completed, and the start of the processing is delayed. That is, in the related art, there is a problem that the entire processing time from the start of the transfer of the XML document to the end of processing the transferred XML document is long. Further, in the related art, since the document is processed after reading the entire XML document, there is also a problem that the required memory capacity increases.
[0007]
The present invention has been made in view of the above circumstances, and its purpose is to receive a structured document by dividing a structured document having a nested structure and shaping it into a plurality of structured documents having a nested structure. An object of the present invention is to provide a structured document dividing method and a program capable of reducing a memory capacity required for processing and shortening a processing time.
[0008]
[Means for Solving the Problems]
According to one aspect of the present invention, there is provided a structured document dividing method in which a structured document having a nested structure logically represented by a tree structure using tags is divided by a computer. This method includes sequentially reading data of a structured document to be divided into a buffer from the beginning, and sequentially processing the data read into the buffer, thereby forming a structured document (original structure) to be divided. Dividing and shaping the structured document) into a plurality of new structured documents, and passing the new structured document to the document processing means each time a new structured document is split and shaped from the original structured document And
[0009]
In such a configuration, since the divided document portion of the original structured document is shaped into a new structured document, the feature as a structured document having a nested structure is retained. For this reason, the processing means for processing a structured document generates a new divided / shaped document without having to wait until the entire document is received, unlike the related art, in handling a long structured document. Each time the new structured document is received and processed. As a result, the memory capacity required for processing the structured document can be reduced, and the response performance to the user using the processing means can be improved.
[0010]
Here, in order to divide and shape a new structured document from the original structured document, the following steps are performed in the above-mentioned division and shaping step, that is, when the read data is a start tag, The step of forcibly inserting an end tag in the buffer and the step of reading the data from the original structured document corresponding to the part where the structure can not be specified Inserting the indicated delimiter tag in the buffer. When the processing unit processes a new structured document including the delimiter tag, the processing is performed from the beginning of the new structured document until the delimiter tag is detected, and the process is interrupted when the delimiter tag is detected. Then, when the next new structured document is received, the processing may be restarted by replacing the delimiter tag with the next new structured document.
[0011]
Here, in the step of inserting the delimiter tag, when the start tag is read into the buffer in a state where the start tag is not stored in the buffer, the delimiter tag is inserted after the start tag, and the end tag is inserted. In the inserting step, an end tag corresponding to the start tag may be inserted after the delimiter tag inserted after the start tag.
[0012]
If a new start tag is read into the buffer after a delimiter tag is inserted in order to divide and format a new structured document from the original structured document, the newly read start tag is It is preferable to move to a position before the delimiter tag in the buffer and insert an end tag corresponding to the newly read start tag after the delimiter tag. Similarly, when text is read in order to divide and format a new structured document from the original structured document, the text may be moved to a position before the delimiter tag on the buffer. Similarly, if the read data is an end tag and the structured document in which the same end tag as the end tag has been forcibly inserted has already been passed to the processing means, the read end tag is , May be replaced with a dummy end tag for notifying the processing means that the end tag exists. When the processing unit processes the structured document in which the delimiter tag is replaced with the structured document including the dummy end tag, the dummy end tag is deleted, and the number of the dummy end tag corresponding to the number of the dummy end tag is deleted at the position. What is necessary is just to move the end tag on the last side of the structured document in which the delimiter tag is replaced.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a computer system according to one embodiment of the present invention. The system in FIG. 1 includes two physically different computers 11 and 12, and a network communication path 13 connecting the computers 11 and 12.
[0014]
The computer 11 includes a storage device 111 typified by a magnetic disk device, an XML document division transmitting device 112 operating on the computer 11, and a document buffer 113. The storage device 111 stores an XML document (XML document) 114. The XML document 114 is a structured document having a nested structure logically expressed by a tree structure using tags. That is, the XML document 114 is a structured document having a property structured by tags. The storage device 111 also stores a program (a program for divided transmission of a structured document) 115 that is executed by a CPU (not shown) in the computer 11.
[0015]
The XML document division transmission device 112 is a functional block realized by a CPU in the computer 11 executing the program 115. The XML document division transmission device 112 includes an XML document reading unit 112a, an XML document division unit 112b, and an XML document transmission unit 112c. When it is necessary to transfer the XML document 114 stored in the storage device 111 to the computer 12, the XML document reading unit 112a has a function of sequentially reading the data of the XML document 114 from the top. The XML document division unit 112b has a function of sequentially dividing the XML document 114 read by the XML document reading unit 112a in parallel with the reading of the XML document 114 by the XML document reading unit 112a. The XML document dividing unit 112b also has a function of shaping the divided XML document 114 into one new XML document having a structure as XML. The XML document transmitting unit 112c has a function of transmitting the XML document to the computer 12 via the network communication path 13 every time the XML document is divided by the XML document dividing unit 112b and shaped into an XML document. The document buffer 113 is used to temporarily store data processed by the XML document reading unit 112a, the XML document dividing unit 112b, and the XML document transmitting unit 112c.
[0016]
The computer 12 includes a receiving device 121 and an application (application program) 122. The receiving device 121 includes an XML document receiving unit 121a that receives an XML document transmitted from the computer 11 via the network communication path 13, and an application program interface (an application program interface) that passes the XML document received by the XML document receiving unit 121a to the application 122 (Hereinafter referred to as API) 121b. The application 122 includes a processing routine for processing the XML document passed from the API 121b.
[0017]
Next, in the system of FIG. 1, an outline of an operation of dividing and shaping an XML document on the computer 11 side, transferring the divided and shaped XML document to the computer 12, and processing the divided and shaped XML document on the computer 12 side is shown in FIG. The operation will be described with reference to the operation explanatory diagram.
[0018]
In the computer 11, the XML document reading unit 112a in the XML document division transmission device 112 accesses the storage device 111 and reads data of the XML document 114 to be transferred to the computer 12 in order from the top (step S1). The XML document division unit 112b in the XML document division transmission device 112 operates in parallel with the reading of the XML document 114 by the XML document reading unit 112a. Then, the XML document dividing unit 112b sequentially processes the data of the XML document 114 read by the XML document reading unit 112a, and converts the XML document 114 into a plurality of new XML documents 114-i having a structure as XML. (I = 1, 2,...) (Step S2). The division / shaping process by the XML document division unit 112b is performed on the document buffer 113. The XML document transmission unit 112c in the XML document division transmission device 112 transmits the XML document 114-i to the computer 12 via the network communication path 13 every time one XML document 114-i is divided and shaped from the XML document 114. Transfer (transmit) (step S3). In this way, a plurality of XML documents 114-i corresponding to the XML documents 114 are sequentially transferred from the computer 11 to the computer 12.
[0019]
The XML document 114-i transferred from the XML document transmitting unit 112c of the computer 11 via the network communication path 13 is received by the XML document receiving unit 121a of the computer 12. The XML document 114-i received by the XML document receiving unit 121a is passed to the application 122 by the API 121b (Steps S4 and S5). Since the XML document 114-i passed from the API 121b has a structure as XML, the application 122 does not have to wait until the transfer of the entire XML document 114-i to be processed is completed, and the application 122 does not use the XML structure. The XML document 114-i is processed (utilized) with the advantage of (1). However, in order for the application 122 to process the XML document 114-i, the XML document 114-i is obtained by dividing and shaping the original XML document (original XML document) 114 in a procedure described later. The application 122 needs to recognize that there is something.
[0020]
Recently, the use of a system for storing various data using the XML structure has been remarkably widespread. In view of this, it is expected that the complexity or size of the data structure of the XML document will increase. In addition, it is conceivable that computer equipment to be used will be wide-ranging from large-scale ones to compact ones with small resources such as memories. For this reason, when an XML document having a complicated data structure or an increased size is to be processed using the structure, the required memory capacity may increase. Therefore, the configuration applied to the present embodiment, in which the XML document is divided and shaped, and sequentially passed to the application 122 as a plurality of small-scale XML documents, is a computer which uses a large-sized XML document conceived in the near future and which has few resources such as memory. It is most suitable for the usage form of processing by the device.
[0021]
Next, the details of the operation of dividing and shaping the XML document on the computer 11 and transferring it to the computer 12 and processing the divided and shaped XML document on the computer 12 are described in (1) Characteristics of the XML document, 2) XML document division / shaping / transmission and (3) XML document processing will be described in this order.
[0022]
(1) Features of XML documents
As described in the related art, the XML document is characterized in that the XML document is expressed in a hierarchical structure and allows an unlimited nested structure, and that the XML document has a repeating structure and allows unlimited indefinite repetition. . The hierarchical structure of the XML document is described using so-called start tags and end tags represented by <TAG> and </ TAG>. Here, it is noted that the range surrounded by a part of the start tag and the end tag in the XML document is a part for the entire XML document, but this part also has a structure as XML.
[0023]
FIG. 3 shows an example in which "a part of the XML document also has a structure as an XML document". In FIG. 3, a portion 301 from the start tag <title> to the end tag </ title> of the XML document 114 has a structure as an XML document, and can be said to be an XML document. Similarly, the part 302 from the start tag <abstract> to the end tag </ abstract> can be said to be an XML document.
[0024]
Further, in the XML document, when the start tag <TAG> appears in the document, it is guaranteed that the end tag </ TAG> will always appear in any case. In the example of the XML document 114 shown in FIG. 2, first, a start tag <document> appears. When this <document> is read, it is not known when the end tag </ document> appears, but it can be understood that the </ document> always appears eventually.
[0025]
(2) Dividing, shaping, and transmitting XML documents
Next, the XML document is divided, shaped, and transmitted, that is, the XML document is divided and shaped into a plurality of new XML documents using the above-described features of the XML document, and the new XML document is transmitted. The procedure will be described. Here, transmission of the divided and shaped XML document is referred to as divided transmission.
[0026]
First, an outline of the above procedure will be described. In the present embodiment, the data of the XML document 114 stored in the storage device 111 is sequentially read from the top by the XML document reading unit 112a. In parallel with the reading, the data of the read XML document 114 is sequentially processed by the XML document dividing unit 112b. The features of the processing of the XML document division unit 112b are as follows.
[0027]
(2a) A unit (division unit) for dividing an XML document (original XML document) is initialized.
(2b) When the start tag appears, the end tag is forcibly inserted.
(2c) A delimiter tag <x /> indicating “partition midway” is defined, and in a state where reading is not completed, a part whose structure is unknown is replaced with the delimiter tag <x />.
(2d) The XML document is divided into division units (here, the number of characters, for example, 64 characters). In the present embodiment, only characters constituting the start tag and characters constituting the text are counted in the division unit. Also, division in the middle of the tag is prohibited. On the other hand, division in the middle of the text is allowed. However, when the text is divided in the middle of the text, the remaining text part to be divided and transmitted next is surrounded by a text division start tag <y> and a text division end tag </ y> which are marks for text division. As a result, even if the text is divided in the middle, the XML document can be formatted into an XML well-formed XML document, and a document that starts with something other than a tag, that is, a document that is not XML well-formed can be prevented from being divided and transmitted.
(2e) If the end tag is forcibly inserted and divided and transmitted when the start tag appears, the part of the XML document that originally includes the end tag will not be in a well-formed XML format. Therefore, when the “end tag” actually appears, the dummy end tag <z />, which means that the “end tag” exists, is applied, so that “the original end tag has appeared” in the divided transmission. Is notified to the receiver.
Next, the details of the procedure for dividing, shaping, and transmitting the XML document will be described with reference to the flowcharts of FIGS. 4 to 7 and the XML of FIGS. 8 to 10 in a case where the XML document 114 shown in FIG. This will be described with reference to a document division / shaping state transition diagram.
[0028]
First, the XML document dividing unit 112b in the computer 11 performs an initial setting for dividing and shaping the XML document (step S1). Here, the XML document division unit = 64 characters, delimiter tag = <x />, text division start tag = <y>, text division end tag = </ y>, and dummy end tag = <z /> Is set. The two types of flags F1 and F2 are initialized to, for example, F1 = 0 and F2 = 1. When F1 = 1, the flag F1 indicates that the remaining part of the divided text is to be processed. The flag F2 indicates that when F2 = 1, the start tag that appears next will be the start tag of the new XML document.
[0029]
When the above-described initialization processing is performed, the XML document reading unit 112a in the computer 11 reads the data of the XML document (original XML document) 114 from the top to the end, from tag to tag, tag by tag, and text by text. It is read in units (steps S12 to S14). The data read by the XML document reading unit 112a is stored in the document buffer 113 in the reading order. In the example of the XML document 114 shown in FIG. 3, first, the start tag <document id = “doc01”> at the head is read and stored in the document buffer 113. That is, the start tag <document id = "doc01"> is read into the document buffer 113. As in this example, when the start tag is read into the document buffer 113 and the flag F2 is 1 (steps S15 and S16), the XML document division unit 112b in the computer 11 reads the read tag in the document buffer 113. After the start flag (that is, <document id = "doc01">), a delimiter tag <x /> is inserted as shown in FIG. 8A (step S17). Next, as shown in FIG. 8B, the XML document dividing unit 112b ends the end tag corresponding to the start tag <document id = "doc01"> after the delimiter tag <x /> stored in the document buffer 113. The tag </ document> is forcibly inserted (step S18). Then, when the flag F2 is 1, as in this example, the XML document division unit 112b sets the flag F2 to 0 (steps S19, S20), and then passes control to the XML document reading unit 112a.
[0030]
Then, the XML document reading unit 112a determines whether the flag F1 is 1 (Step S12). If F1 is not 1 as in this example, the XML document reading unit 112a determines whether reading has been performed up to the end of the XML document 114 (that is, whether all data of the XML document 114 has been processed) (step S13). In this example, reading has not been performed up to the end of the XML document 114. In this case, the XML document reading unit 112a reads the next data (tag or text) in the XML document 114 into the document buffer 113 (Step S14). Here, the start tag <title> next to the start tag <document id = "doc01"> is read into the document buffer 113.
[0031]
When the start tag is read in this way and the flag F2 is 0 (steps S15 and S16), the XML document division unit 112b stores the untransmitted data including the start tag stored in the document buffer 113 at that time. It is determined whether the data (excluding the tag that is forcibly inserted) exceeds the division unit (64 characters) (step S21). When the division unit does not exceed the division unit as in this example, the XML document division unit 112b converts the start tag <title> read into the document buffer 113 into the delimiter tag <x /> as shown in FIG. Move forward (step S22).
[0032]
Next, the XML document division unit 112b forcibly inserts an end tag </ title> corresponding to the start tag <title> after the delimiter tag <x /> as shown in FIG. S18). Then, when the flag F2 is 0 as in this example, the XML document division unit 112b passes control to the XML document reading unit 112a as it is.
[0033]
In response, the XML document reading unit 112a reads the next data in the XML document 114, that is, the text "Sample Document" into the document buffer 113 (steps S12 to S14). When the text is read in this way (step S15), the XML document division unit 112b determines whether the untransmitted data including the text stored in the document buffer 113 at that time exceeds the division unit. (Step S23). If the division unit does not exceed the division unit as in this example, the XML document division unit 112b converts the text “Sample Document” read into the document buffer 113 into a delimiter tag <x /> as shown in FIG. (Step S24). Then, the XML document division unit 112b passes control to the XML document reading unit 112a.
[0034]
Then, the XML document reading unit 112a reads the next data in the XML document 114, that is, the end tag </ title> into the document buffer 113 (steps S12 to S14). When the end tag is thus read into the document buffer 113 (step S15), the XML document division unit 112b determines whether the end tag has been forcibly inserted and transmitted (step S25). As in this example, when the end tag </ title> has not been transmitted, that is, a forcedly inserted end tag </ title> different from the end tag </ title> read this time is stored in the document buffer 113. If it exists within the XML document, the XML document division unit 112b deletes the end tag </ title> read this time and forcibly inserts the delimiter tag <x /> as shown in FIG. It moves after the end tag </ title> (step S26). Then, the XML document division unit 112b passes control to the XML document reading unit 112a.
[0035]
Hereinafter, similarly, it is assumed that the start tag <author>, the text "Taro Suzuki", the end tag </ author>, the start tag <date>, and the text "20030101" are sequentially read by the XML document reading unit 112a. . FIG. 9A shows the contents of the document buffer 113 at this time. When the text “20030101” is read into the document buffer 113 (steps S14 and S15), the XML document division unit 112b determines whether the untransmitted data in the document buffer 113 including the text exceeds the division unit. A determination is made (step S23). Here, the portion of “101” in the text “20030101” exceeds the division unit. In this case, the XML document division unit 112b leaves the text portion “20030” having the maximum length within the range not exceeding the division unit from the beginning of the text “20030101” read into the document buffer 113, As shown in FIG. 9B, the text part “20030” is moved before the delimiter tag <x /> (step S27). In this step S27, the XML document division unit 112b sets the remaining text part “101”. Hold. Then, the XML document division unit 112b sets the flag F1 to 1 to indicate that an unprocessed text portion that has already been read remains (step S28), and then transmits the XML document to the XML document reading unit 112a. The control is passed to the unit 112c.
[0036]
As a result, the XML document transmitting unit 112c splits the data having the structure shown in FIG. 9B stored in the document buffer 113 at that time from the XML document 114 into a new XML document 114-i ( = 114-1) and transmitted to the computer 12 via the network communication path 13 (step S29). On the other hand, since the flag F1 is 1 (step S12), the XML document reading unit 112a determines that an unprocessed text portion that has already been read remains in the XML document dividing unit 112b, and the XML document is not changed. The control is returned to the document division unit 112b.
[0037]
In response to this, the XML document division unit 112b stores the remaining text part “101” held by the XML document division unit 112b in the document buffer 113 and, as shown in FIG. Before and after, a text division start tag <y> and a text division end tag </ y> are forcibly inserted, respectively (step S30). That is, the XML document division unit 112b surrounds the text part “101” with a text division start tag <y> and a text division end tag </ y>. Then, the XML document division unit 112b passes control to the XML document reading unit 112a.
[0038]
Then, the XML document reading unit 112a reads the next data in the XML document 114, that is, the end tag </ date> following the text "20030101" into the document buffer 113 as shown in FIG. 10A (step S12). ~ S14). In this case, the XML document dividing unit 112b determines whether the end tag </ date> has been forcibly inserted and transmitted (step S25). When the end tag </ date> has been transmitted as in this example, the XML document division unit 112b sends the document buffer to notify the computer 12 that "the original end tag has appeared" in the division transmission. The end tag </ date> read in 113 is replaced with a dummy end tag <z /> as shown in FIG. 10C (step S31), and control is passed to the XML document reading unit 112a.
[0039]
Thereafter, similarly, the division and shaping of the XML document 114 by the XML document division unit 112b is continued, and the transmission of the XML document 114-1 corresponding to FIG. 9B shown in FIG. Following the first transmission, the transmission (second to eighth transmission) of the XML documents 114-2 to 114-8 shown in FIGS. 11B to 11D and 12A to 12D. Is transmitted). Here, FIG. 11C shows an XML document 114-3 transmitted when the start tag <section id = "sec01"> is read into the document buffer 113 and exceeds the division unit. The operation when reading of the start tag exceeds the division unit will be described.
[0040]
Now, with the flag F2 set to 0, the start tag <section id = "sec01"> is read into the document buffer 113 by the XML document reading unit 112a, and the untransmitted data in the document buffer 113 including the start tag is divided. It is assumed that the unit has been exceeded (steps S15, S16, S21). In this case, the XML document division unit 112b returns the state of the XML document division transmission device 112 to the state before reading the start tag, and indicates that the start tag to be processed next is the start tag of the new XML document. Therefore, the flag F2 is set to 1 (steps S32 and S33). Then, the XML document dividing unit 112b passes control to the XML document reading unit 112a and the XML document transmitting unit 112c.
[0041]
As a result, the XML document transmitting unit 112c transmits the data shown in FIG. 11C stored in the document buffer 113 at that time to the computer 12 as a new XML document 114-i (= 114-3) (step S1). S29). On the other hand, the XML document reading unit 112a reads the start tag <section id = "sec01"> into the SPM 13 again (steps S12 to S14). The start tag <section id = "sec01"> is held in the XML document dividing unit 112b, and the XML document is transmitted by the XML document transmitting unit 112c without performing steps S32 and S33. The processing may branch to step S17 to process the tag <section id = "sec01">. Alternatively, the text to be counted in the division unit may be only text.
[0042]
Next, an operation when all data of the XML document 114 has been processed will be described. When all the data of the XML document 114 has been processed (step S13), the XML document reading unit 112a notifies the XML document division unit 112b of the fact. Then, the XML document division unit 112b deletes the delimiter tag <x /> from the document buffer 113 and passes control to the XML document transmission unit 112c (step S34). In response, the XML document transmitting unit 112c transmits the data stored in the document buffer 113 at that time to the computer 12 as a new XML document 114-i (step S35). Here, the XML document 114-8 shown in FIG. 12D is transmitted, and a series of division / shaping / transmission processing by the XML document division / transmission apparatus 112 in the computer 11 ends.
[0043]
FIG. 13 shows the relationship between the divided transmission by the XML document division transmission device 112 for the XML document 114 and the tree structure (hierarchical structure). In FIG. 13, inverted L-shaped symbols 131-1 to 131-7 represent transmissions (first to seventh transmissions) of XML documents 114-1 to 114-7, respectively. Further, when the symbol 131-i is located after the character string representing the node in the tree structure as in the symbols 131-1 and 131-5 to 131-7, transmission of data having a structure corresponding to the node is i Indicates that the transmission has been completed in the second divided transmission. On the other hand, when the symbol 131-i is located in the middle of the character string representing the node in the tree structure, as in the symbols 131-2 to 131-4, the i-th divided transmission is performed using the data having the structure corresponding to the node. Indicates that the transmission is in progress.
[0044]
(3) Processing of XML document transmitted by division
Next, an outline of processing of the XML document 114-i (i = 1 to 8 in the examples of FIGS. 11 and 12) divided and transmitted from the computer 11 to the computer 12 will be described. As described above, the XML document 114-i transmitted by the computer 11 (the XML document transmission unit 112c therein) via the network communication path 13 is received by the XML document reception unit 121a in the computer 12. The XML document 114-i received by the XML document receiving unit 121a is passed to the application 122 via the API 121b, and is processed by the application 122. Here, the XML document 114 shown in FIG. 3 is obtained by shaping the divided part, and the XML document shown in FIGS. 11A to 11D and 12A to 12D is obtained. 114-1 to 114-8 also have a structure as XML. Therefore, the application 122 can process the XML documents 114-1 to 114-8 with a small memory capacity using the conventional technology. Note that the XML document 114-i (i = 1 to 8) is actually processed by a CPU (not shown) in the computer 12 executing the application 122, but for simplicity of description. Is processed by the application 122.
[0045]
Next, the details of the processing of the XML document 114-i transmitted by division will be described with reference to the flowchart of FIG. When the application 122 receives the XML document 114-i received by the XML document receiving unit 121a from the API 121b, the application 122 does not wait until the transfer of the entire XML document 114 to be processed is completed, without waiting for the entire XML document 114-i to be processed. Start processing. Thereby, the memory capacity required for processing the XML document in the computer 12 can be reduced, and the processing time can be reduced.
[0046]
When the application 122 detects the delimiter tag <x /> in the processing of the XML document 114-i (step S41), the application 122 suspends the processing until receiving the next XML document 114- (i + 1) (step S42). Upon receiving the next XML document 114- (i + 1) (step S42), the application 122 replaces the delimiter tag <x /> in the preceding XML document 114-i whose processing has been interrupted with the next XML document 114- -(I + 1) (step S43). Then, the application 122 restarts the suspended processing of the XML document 114-i (step S44). However, the XML document 114-i processed here has a structure in which the delimiter tag <x /> is replaced with the XML document 114- (i + 1), unlike the XML document 114-i before interruption. Therefore, in the following description, the XML document 114-i in which the delimiter tag <x /> has been replaced with the XML document 114- (i + 1) is represented as an XML document 114-i '.
[0047]
In the example of FIG. 11, when the application 122 receives the XML document 114-2, the delimiter tag <x /> in the XML document 114-1 is replaced with the XML document 114-2. Thus, the application 122 performs the processing until it detects a new delimiter tag <x /> in the new XML document 114-1 including the XML document 114-2 (that is, the XML document 114-1 ′) (step S41). It can be continued (step S44).
[0048]
Now, if another delimiter tag <x /> is included in the XML document 114- (i + 1) replaced by the delimiter tag <x />, the other delimiter tag <x /> Is detected, the another delimiter tag <x /> is further replaced with the next XML document 114- (i + 2). In other words, in the application 122, the operation of replacing the delimiter tag <x /> with the XML document to be divided and transmitted is repeated.
[0049]
When the application 122 detects the text division start tag <y> in the processing of the XML document 114-i ′ (step S45), the application 122 determines the pair of the text division start tag <y> and the text division end tag </ y>. It is deleted from the XML document 114-i '(step S46). Next, the application 122 counts the number j of the dummy end tags <z /> following the deleted tags </ y> (Step S47). When the number of dummy end tags <z /> following the tag </ y> deleted from the XML document 114-i 'is j, the XML document 114-i' has j end Tags are included. Therefore, the application 122 deletes j dummy end tags <z /> following the deleted tag </ y> from the XML document 114-i ', and places the last end of the XML document 114-i' at that position. The j end tags on the tail side are moved (step S48). Then, the application 122 continues the processing of the XML document 114-i 'after the movement of the end tag (step S44), and ends all the processing when the processing is completed to the end (step S49).
[0050]
In the above embodiment, it is assumed that the XML document 114 to be divided and shaped by the XML document division transmitting device 112 in the computer 11 is stored in the storage device 111 of the computer 11. However, even when the storage device 111 is a storage device of another computer connected via a network, the XML document 114 stored in the storage device 111 may be divided into XML documents in the computer 11. The transmission device 112 can divide and shape in the same manner as in the above embodiment.
[0051]
In the above embodiment, it is assumed that the division of the XML document 114 and the shaping of the divided document portion are all performed on the transmission side. However, the transmitting side simply divides the XML document 114 into, for example, a fixed size from the beginning, and sequentially transmits the divided document portions (that is, document portions having no XML structure). It is also possible to perform a process corresponding to the flowchart of FIG. Here, every time a portion of the divided XML document 114 is received, data is read in order from the top. This is equivalent to reading the data of the XML document 114 sequentially from the top in the above embodiment.
[0052]
Note that the present invention is not limited to the above-described embodiment, and can be variously modified in an implementation stage without departing from the gist of the invention. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some components are deleted from all the components shown in the embodiment, the problems described in the column of the problem to be solved by the invention can be solved, and the effects described in the column of the effect of the invention can be solved. Is obtained, a configuration from which this configuration requirement is deleted can be extracted as an invention.
[0053]
【The invention's effect】
As described in detail above, according to the present invention, the original structured document is read into a buffer in order from the beginning, and the data read into the buffer is processed in order, whereby the original structured document is converted into a plurality of new structures. Each time this new structured document is divided and shaped, the new structured document is provided for processing by the document processing means. Features can be provided as a structured document having a nested structure. Therefore, according to the present invention, when processing a long structured document, the processing means for processing the structured document generates a new divided / shaped structured document without waiting for the entire structured document to be received. Every time the new structured document is received and processed. As a result, the memory capacity required for processing the structured document can be reduced, and the response performance to the user using the processing means can be improved.
[Brief description of the drawings]
FIG. 1 is an exemplary block diagram showing the configuration of a computer system according to an embodiment of the present invention.
FIG. 2 is a view for explaining an outline of an operation of dividing and shaping an XML document on a computer 11 side and transferring the divided and shaped XML document on a computer side in the embodiment; FIG.
FIG. 3 is a view showing a specific example of an XML document 114 in FIG. 1 and that a part of the XML document 114 also has a structure as an XML document.
FIG. 4 is a diagram showing a part of a flowchart for explaining a procedure of dividing, shaping, and transmitting an XML document.
FIG. 5 is a diagram showing another part of the flowchart for explaining the procedure of dividing, shaping, and transmitting the XML document.
FIG. 6 is a diagram showing still another part of the flowchart for explaining the procedure of dividing, shaping, and transmitting an XML document.
FIG. 7 is a view showing the rest of the flowchart for explaining the procedure of dividing, shaping, and transmitting an XML document.
FIG. 8 is a state transition diagram of XML document division and shaping for the XML document 114 of FIG. 3;
FIG. 9 is a state transition diagram of XML document division and shaping for the XML document 114 of FIG. 3;
FIG. 10 is a state transition diagram of XML document division and shaping for the XML document 114 of FIG. 3;
FIG. 11 is a view showing a specific example of XML documents 114-1 to 114-4 divided and transmitted according to the flowcharts of FIGS. 4 to 7;
FIG. 12 is a view showing a specific example of XML documents 114-5 to 114-8 divided and transmitted according to the flowcharts of FIGS. 4 to 7;
FIG. 13 is a view showing a relationship between a divided transmission of an XML document corresponding to the XML document 114 and a tree structure.
FIG. 14 is a flowchart for explaining the procedure of processing an XML document that has been divided and transmitted.
[Explanation of symbols]
11, 12: Computer, 13: Network communication path, 111: Storage device, 112: XML document division transmission device, 112a: XML document reading unit, 112b: XML document division unit, 112c: XML document transmission unit, 114: XML document (Original XML document), 114-1 to 114-8: XML document, 121: receiving device, 121a: XML document receiving unit, 121b: API (application program interface), 122: application (processing means).

Claims

A structured document division method for dividing a structured document having a nested structure logically expressed by a tree structure using a tag by a computer,
Reading the data of the structured document to be divided into buffers sequentially from the beginning;
Dividing and shaping the structured document to be divided into a plurality of new structured documents by sequentially processing the data read into the buffer;
Transferring the new structured document to document processing means each time the new structured document is divided and shaped from the structured document to be divided. Split method.

When the read data is a start tag, the dividing / shaping step includes a step of forcibly inserting an end tag corresponding to the start tag in the buffer, and a step of reading data from the structured document. Corresponding to the part for which the structure can not be specified for, including a step of inserting a delimiter tag on the buffer indicating that it is an intermediate point in the division of the structured document,
Each time a new structured document including the read start tag, the delimiter tag, and the inserted end tag corresponding to the start tag is generated on the buffer, the new structured document is processed by the document processing unit. 2. The method of dividing a structured document according to claim 1, further comprising:

In the step of inserting the delimiter tag, when the start tag is read into the buffer in a state where the start tag is not stored in the buffer, the delimiter tag is inserted after the start tag,
3. The method according to claim 2, wherein in the step of inserting the end tag, an end tag corresponding to the start tag is inserted after the delimiter tag inserted after the start tag. .

The step of dividing and shaping includes, when a new start tag is read into the buffer after the insertion of the delimiter tag, moving the newly read start tag to a position before the delimiter tag on the buffer. Including
4. The method according to claim 3, wherein in the step of inserting the end tag, an end tag corresponding to the newly read start tag is inserted after the delimiter tag.

5. The structured document according to claim 4, wherein the dividing / shaping step includes, when the read data is text, moving the text on the buffer before the delimiter tag. Split method.

The division / formatting step is performed when the read data is an end tag and a structured document into which the same end tag as the end tag is forcibly inserted has already been passed to the document processing unit. 6. The method according to claim 5, further comprising the step of replacing the end tag with a dummy end tag for notifying the document processing means that the end tag exists.

7. The method according to claim 6, wherein the dividing and shaping step includes a step of managing the size of data including at least the text of the structured document passed to the document processing unit so as not to exceed a predetermined division unit. How to divide the described structured document.

The managing step includes:
When the read data is text and the unprocessed data including the text stored in the buffer exceeds the division unit, a text portion of the text that does not exceed the division unit is extracted. Generating a new structured document on the buffer by leaving the text portion in the buffer and moving the text portion before the delimiter tag;
When the new structured document generated on the buffer is passed to the document processing means, the remaining text part in the read text is stored in the buffer, and the remaining text part is stored in the buffer. 8. A method for dividing a structured document according to claim 7, comprising a step of enclosing a text division start tag and a text division end tag which are marks for text division.

A program for dividing a structured document having a nested structure logically expressed by a tree structure using tags by a computer,
To the computer,
Reading the data of the structured document to be divided into buffers sequentially from the beginning;
Dividing and shaping the structured document to be divided into a plurality of new structured documents by sequentially processing the data read into the buffer;
Transferring the new structured document to document processing means each time the new structured document is divided and shaped from the structured document to be divided.

When the read data is a start tag, the dividing / shaping step includes a step of forcibly inserting an end tag corresponding to the start tag in the buffer, and a step of reading data from the structured document. Corresponding to the part for which the structure can not be specified for, including a step of inserting a delimiter tag on the buffer indicating that it is an intermediate point in the division of the structured document,
Each time a new structured document including the read start tag, the delimiter tag, and the inserted end tag corresponding to the start tag is generated on the buffer, the new structured document is processed by the document processing unit. 10. The program according to claim 9, wherein the step of passing to the computer is executed by the computer.