JP2008084341A

JP2008084341A - Structured document compressing method, compressing device, and computer-readable recording medium recording structured document compressing program

Info

Publication number: JP2008084341A
Application number: JP2007311608A
Authority: JP
Inventors: Hironori Yahagi; 裕紀矢作
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-06-21
Filing date: 2007-11-30
Publication date: 2008-04-10

Abstract

<P>PROBLEM TO BE SOLVED: To improve compressibility of a structured document by compressing a tag part without impairing the characteristics of the structured document. <P>SOLUTION: A structured document compressing device comprises a document achievement value analyzing part 20 analyzing description in a tag of a document achievement value forming the structured document; a tag dictionary preparing part 80 preparing a tag dictionary which associates a character string described in the tag of the document achievement value, with a shortened character string shorter than the character string and allowing the character string to be specified according to an analyzed result by the document achievement value analyzing part 20; and a document achievement value character string substituting part 41 substituting the shortened character string corresponding to the character string, for the character strings described in the tag of the document achievement value using the tag dictionary prepared by the tag dictionary preparing part 80. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ＨＴＭＬ（Hyper Text Markup Language）,ＳＧＭＬ(Standard Generalized Markup Language)やＸＭＬ(Extensible Markup Language)等の構造化文書を圧縮するための、方法および装置、並びに、プログラムを記録したコンピュータ読取可能な記録媒体に関する。 The present invention relates to a method and apparatus for compressing structured documents such as HTML (Hyper Text Markup Language), SGML (Standard Generalized Markup Language), and XML (Extensible Markup Language), and a computer-readable recording program. The present invention relates to various recording media.

近年、計算機やインターネットやイントラネットの普及に伴い、文書，ソフトウェア，数値，画像データ等の様々な種類のデータを含む電子文書が増加している。そして、電子文書のデータ量が大きい場合には、電子文書データから冗長な部分を省いてそのデータを圧縮することにより、メモリに記憶される電子文書のデータ量を減らしたり、電子文書の送信時間を短縮したりしている。 In recent years, with the spread of computers, the Internet, and intranets, electronic documents including various types of data such as documents, software, numerical values, and image data are increasing. When the data amount of the electronic document is large, the redundant portion is omitted from the electronic document data and the data is compressed to reduce the data amount of the electronic document stored in the memory, or the transmission time of the electronic document. Or shorten it.

なお、以下の説明では、情報理論で用いられる呼称を踏襲して、データの１ワード単位を「文字」と呼び、文字が任意数つながったものを「文字列」と呼ぶ。また、データを文字から成るものとみなして、任意の種類のデータを圧縮することの可能な、いわゆる、ユニバーサル符号化の技術は、広く研究され普及している。 In the following description, in accordance with the name used in information theory, one word unit of data is referred to as “character”, and an arbitrary number of connected characters is referred to as “character string”. In addition, so-called universal encoding technology capable of compressing any kind of data on the assumption that data consists of characters has been widely studied and spread.

一方、電子文書においては、タグを付けて構造化した文書が広く用いられている。このような構造化文書を用いる場合、「文字データ」と「マーク付け」（markup）とに分けて処理が行なわれる。ここで、「マーク」とは、タグその他の構造化情報の総称であり、具体的には、開始タグ，終了タグ，空要素タグ，実体参照，文字参照，注釈，ＣＤＡＴＡセクションの区切り子，文書型宣言，処理命令などを指す。 On the other hand, in electronic documents, structured documents with tags are widely used. When such a structured document is used, processing is performed separately for “character data” and “markup”. Here, “mark” is a general term for tags and other structured information. Specifically, a start tag, an end tag, an empty element tag, an entity reference, a character reference, a comment, a CDATA section delimiter, a document Points to type declarations, processing instructions, etc.

構造化文書の代表例としては、大規模保存データベース向けのＳＧＭＬ（Standard Generalized Markup Language）や、ＷＷＷ（World Wide Web）向けに簡便な構成をもつＨＴＭＬ（Hyper Text Markup Language）や、ＳＧＭＬをインターネット向けに簡略化したＸＭＬ（eXtensible Markup Language）などがある。 Typical examples of structured documents include SGML (Standard Generalized Markup Language) for large-scale storage databases, HTML (Hyper Text Markup Language) with a simple configuration for WWW (World Wide Web), and SGML for the Internet. There is a simplified XML (eXtensible Markup Language).

ＳＧＭＬは、官公庁，企業等における大規模保存文書データベースのビューアへの適用，データベース検索，製品の開発作業の同時並行化，出版（ＣＡＤ，電子本，データベース）のために利用されるほか、データ交換のための中間言語としても利用される。また、ＨＴＭＬはＷＷＷとともに世界的に普及している。
ＸＭＬは、ＨＴＭＬを補うものとして、最近、特に注目を浴びている。このＸＭＬは、インターネット上で文書を取り扱うためだけでなく、携帯電話，カーナビゲーション等のあらゆる情報機器が交信するための媒介として利用されつつある。 SGML is used for application to viewers of large-scale stored document databases in public offices and companies, database search, parallel development of products, publication (CAD, e-book, database) and data exchange It is also used as an intermediate language for In addition, HTML is widespread worldwide with WWW.
XML has recently received particular attention as a supplement to HTML. This XML is being used not only for handling documents on the Internet but also as an intermediary for communication of various information devices such as mobile phones and car navigation systems.

ＸＭＬ文書は、大きく分けて、ＸＭＬ宣言と、文書型定義（Document Type Definition; DTD）と、文書実現値（ＸＭＬインスタンス）との３つの部分（構成要素）から成る。また、処理上の観点から見ると、ＸＭＬ文書は、整形式（well-formed）と検証済み（valid）との２つに分類される。 An XML document is roughly divided into three parts (components) of an XML declaration, a document type definition (DTD), and a document realization value (XML instance). Also, from the viewpoint of processing, XML documents are classified into two types, well-formed and validated.

ここで、ＸＭＬ宣言は、ＳＧＭＬ宣言とは全く異なるもので、単にＸＭＬのバージョン宣言，文字コードの宣言などを行なう簡単なものであり、ＤＴＤは、ＳＧＭＬと同じく、タグ付き文書に現れる要素，属性，エンティティの定義を行なう部分であり、ＸＭＬインスタンスは、実際のタグ付き文書が書かれる部分である。 Here, the XML declaration is completely different from the SGML declaration, and is simply an XML version declaration, a character code declaration, etc. DTD, like SGML, has elements and attributes that appear in a tagged document. , An entity definition part, and an XML instance is a part where an actual tagged document is written.

また、整形式ＸＭＬ文書とは、開始タグと終了タグとの対応がとれており、ＸＭＬで規定したタグ付け規則に従ってＸＭＬインスタンスが書かれたものであり、検証済みＸＭＬ文書とは、ＤＴＤの中の要素型宣言，属性リスト宣言によって定義された要素の階層関係，属性の型などに従ってタグ付けが行なわれたものである。この検証済みＸＭＬ文書は、当然、整形式ＸＭＬ文書としてのタグ付け規則に従っていることが前提条件である。 A well-formed XML document has a correspondence between a start tag and an end tag, and an XML instance is written in accordance with a tagging rule defined in XML. A verified XML document is a DTD. Are tagged according to the element type declaration, element hierarchy defined by the attribute list declaration, attribute type, etc. It is a prerequisite that this verified XML document naturally follows the tagging rules as a well-formed XML document.

上記構成要素と上述のごとく分類されたＸＭＬ文書との対応関係、並びに、上記構成要素とＳＧＭＬ文書やＨＴＭＬ文書との対応関係（必須か否か）を、表１に示す。 Table 1 shows the correspondence between the component and the XML document classified as described above, and the correspondence between the component and the SGML document or HTML document (whether it is essential).

この表１に示すように、整形式ＸＭＬ文書では、文書実現値のみが必須の構成要素であり、検証済みＸＭＬ文書では、文書型定義および文書実現値が必須の構成要素である。また、ＳＧＭＬ文書では、全ての構成要素が必須であり、ＨＴＭＬ文書では、ＨＴＭＬ宣言以外は必須の構成要素となっている。 As shown in Table 1, in the well-formed XML document, only the document realization value is an essential component, and in the verified XML document, the document type definition and the document realization value are essential components. In addition, in the SGML document, all components are essential, and in the HTML document, components other than the HTML declaration are indispensable components.

ＸＭＬでは、文書を、階層構造をもった要素の集合としてとらえ、各要素を識別するために使用されるマークがタグである。文書の要素が始まったことを示すタグは開始タグと呼ばれ、その要素が終わったことを示すタグは終了タグと呼ばれ、これら２つのタグで挟まれた部分が要素の内容となる。 In XML, a document is regarded as a set of elements having a hierarchical structure, and a mark used to identify each element is a tag. A tag indicating that an element of a document has started is called a start tag, a tag indicating that the element has ended is called an end tag, and the portion sandwiched between these two tags is the content of the element.

要素に対する基本的なタグ付けは、図３８（Ａ）に示す通りである。この図３８（Ａ）に示す例は、１つの要素を表す際のタグ付けであるが、ＸＭＬインスタンスには、実際にはいくつもの要素が存在し、それらの要素が階層構造になっている。つまり、ある要素の下に別の要素群が存在することがあり、このような階層構造を表現するためには、図３８（Ｂ）に示すように、タグを入れ子にする。図３８（Ｂ）に示すように、要素ａの下位にくる要素ｂと要素ｃとは、階層構造において同列に位置するもので、“ａ”という親要素に対して兄弟関係にある。このような兄弟関係にある要素は、兄の要素ｂの終了タグのすぐあとに弟の要素ｃの開始タグを書くことになる。また、要素の内容として、平文（テキスト）と混在させる形で下位の要素を書くこともできる。その場合、図３８（Ｃ）に示すようなタグ付けを行なう。 Basic tagging of elements is as shown in FIG. The example shown in FIG. 38A is tagging when expressing one element, but there are actually several elements in the XML instance, and these elements have a hierarchical structure. That is, another group of elements may exist under a certain element, and in order to express such a hierarchical structure, tags are nested as shown in FIG. As shown in FIG. 38 (B), the element b and the element c which are lower than the element a are located in the same column in the hierarchical structure, and have a sibling relationship with the parent element “a”. The element having such a sibling relationship is written with the start tag of the brother element c immediately after the end tag of the brother element b. In addition, as a content of an element, a lower element can be written in a mixed form with plain text (text). In that case, tagging as shown in FIG.

さらに、タグは、要素の構造を表現するだけでなく、要素になんらかの付属情報（属性）を与えることもできる。つまり、「要素のタイプを区別したい」，「要素に一意な識別子を付けて別の所から参照したい」などの理由から、付属情報としての属性を要素に与える場合がある。この属性は、図３９に示すように、属性名と属性値との対で表され、開始タグの中に書き込まれる。
ＸＭＬ文書の処理上の区分は、さらに、構成の違いにより、表２のように５通り（パターン(1)〜(5)；表２中、丸付き数字の１〜５）に分類される。 Furthermore, the tag can not only represent the structure of the element, but can also give some attached information (attribute) to the element. That is, an attribute as attached information may be given to an element for reasons such as “I want to distinguish between element types” or “I want to refer to another element with a unique identifier”. As shown in FIG. 39, this attribute is represented by a pair of attribute name and attribute value, and is written in the start tag.
The processing of the XML document is further classified into five types as shown in Table 2 (patterns (1) to (5); numbers 1 to 5 in circles in Table 2) depending on the configuration.

この表２に示すごとく、ＸＭＬ文書は、ＤＴＤをもたないもの（パターン(1)）と、実体宣言（エンティティ宣言）を含むＤＴＤをもつもの（パターン(2)）と、外部への実体宣言を含むＤＴＤをもつもの（パターン(3)）と、内部にＤＴＤを記述したもの（パターン(4)；ＳＧＭＬと共通）と、外部ファイルのＤＴＤを利用するもの（パターン(5)；ＳＧＭＬ，ＨＴＭＬと共通）との５通りに分類される。 As shown in Table 2, the XML document has a DTD (pattern (1)), a DTD including an entity declaration (entity declaration) (pattern (2)), and an entity declaration to the outside. With DTD (pattern (3)), internal DTD described (pattern (4); common with SGML), and external file DTD (pattern (5); SGML, HTML) And common).

パターン(1)〜(3)に対応するＸＭＬ文書は、「整形式ＸＭＬ文書」と呼ばれ、ＤＴＤによる検証を要することなくタグを設定することができる。また、パターン(4)，(5)に対応するＸＭＬ文書は、「検証済みＸＭＬ文書」と呼ばれる。
パターン(1)〜(4)に対応するＸＭＬやＳＧＭＬでは、利用者がタグを自由に設定することができる。パターン(4)では、ＸＭＬもＳＧＭＬも文書内の自前のＤＴＤでタグを定義することができる。一方、パターン(5)だけに対応しているＨＴＭＬは、外部ファイルのＤＴＤ〔Ｗ３Ｃ（World Wide Web Consortium）発行〕のみに依存し、利用者が自由にタグを設定することはできない。 The XML documents corresponding to the patterns (1) to (3) are called “formatted XML documents”, and tags can be set without requiring verification by DTD. The XML documents corresponding to the patterns (4) and (5) are called “verified XML documents”.
In XML and SGML corresponding to patterns (1) to (4), a user can freely set tags. In pattern (4), both XML and SGML can define tags with their own DTDs in the document. On the other hand, HTML corresponding only to the pattern (5) depends only on the DTD (issued by the World Wide Web Consortium (W3C)) of the external file, and the user cannot freely set the tag.

パターン(1)は、ＸＭＬインスタンスのみを有し、検証済みＸＭＬ文書としてのチェックを行なわないので、ＤＴＤを完全に取り払った、最もシンプルな整形式ＸＭＬ文書であり、ＤＴＤがなくてもＸＭＬインスタンスの内容を解釈可能なものである。 Pattern (1) is the simplest well-formed XML document that has only an XML instance and does not check as a verified XML document, thus completely removing the DTD. Even if there is no DTD, the XML instance The contents can be interpreted.

パターン(2)は、置換文字列定義（実体宣言）を含むＤＴＤを有し、ＸＭＬインスタンス内で短縮文字列を使用するために、ＤＴＤにおける実体宣言でそれらの短縮文字列を宣言した整形式ＸＭＬ文書である。このパターン(2)では、実体参照を用いて、ＸＭＬインスタンスの内容中の長い文字列を短い文字列と置き換えるべく、ＤＴＤにおいて短い文字列と長い文字列との対応関係が定義される。 The pattern (2) has a DTD including a replacement character string definition (entity declaration), and in order to use the shortened character string in the XML instance, the well-formed XML in which those shortened character strings are declared in the entity declaration in the DTD. It is a document. In this pattern (2), correspondence between a short character string and a long character string is defined in the DTD in order to replace a long character string in the content of the XML instance with a short character string using an entity reference.

パターン(3)は、複数のファイルでＸＭＬ文書を作成するために、ＤＴＤにおける実体宣言で、それらのファイルを宣言した整形式ＸＭＬ文書である。このパターン(3)では、外部のファイルをＸＭＬインスタンスの内容中で引用するために実体参照を用いており、ＤＴＤにおいて、ＸＭＬインスタンス内で用いられる短い文字列と、実際のファイルを指定する情報との対応関係が定義される。 The pattern (3) is a well-formed XML document in which these files are declared with entity declarations in DTD to create an XML document with a plurality of files. In this pattern (3), an entity reference is used to quote an external file in the contents of the XML instance. In DTD, a short character string used in the XML instance, information specifying an actual file, and Is defined.

パターン(4)は、ＸＭＬインスタンスに添付されたＤＴＤ（DOCTYPE宣言）において、検証済みＸＭＬ文書としてのチェックに必要な要素型宣言および属性リスト宣言を定義するものである。
パターン(5)は、外部ファイルに存在するＤＴＤで要素型宣言および属性リスト宣言を定義するものであり、ＸＭＬインスタンスに添付されたＤＴＤ（DOCTYPE宣言）において、その外部ファイルを指定する情報が記述されている。 The pattern (4) defines an element type declaration and an attribute list declaration necessary for checking as a verified XML document in a DTD (DOCTYPE declaration) attached to an XML instance.
Pattern (5) defines element type declarations and attribute list declarations with DTDs existing in external files. Information specifying the external files is described in DTDs (DOCTYPE declarations) attached to XML instances. ing.

あるＸＭＬ文書が、整形式ＸＭＬ文書として解釈されるか、検証済みＸＭＬ文書として解釈されるかは、ＸＭＬ文書を解釈するソフトウェアであるＸＭＬプロセッサ（ＸＭＬパーサー）に依存する。このＸＭＬプロセッサは、図４０に示すごとく、ＸＭＬ文書を解析し、整形式ＸＭＬ文書としてのチェックおよび検証済みＸＭＬ文書としてのチェックを行なってから、チェックを終えた（木構造として表された）ＸＭＬ文書を、ブラウザなどの他の応用ソフトウェアに渡す機能を果たすものである。 Whether an XML document is interpreted as a well-formed XML document or an verified XML document depends on an XML processor (XML parser) that is software that interprets the XML document. As shown in FIG. 40, the XML processor analyzes the XML document, performs a check as a well-formed XML document and a check as a verified XML document, and then finishes the check (expressed as a tree structure). It fulfills the function of passing a document to other application software such as a browser.

そして、上述したような構造化文書を圧縮する際には、前述したように、その構造化文書を文字データの集まりとみなして圧縮を行なうユニバーサル符号化の技術が利用されている。 When compressing a structured document as described above, as described above, a universal encoding technique is used in which the structured document is compressed as a collection of character data.

従来の構造化文書の圧縮手法としては、大別して下記２通りの手法（ａ），（ｂ）がある。
（ａ）タグを元の位置から移動せず、タグによって挟まれた平文の部分のみを圧縮する手法。
（ｂ）タグのみを文書実現値（インスタンス）の先頭に移動することにより、タグどうし平文どうしをそれぞれまとめて圧縮する手法。 Conventional compression methods for structured documents are roughly divided into the following two methods (a) and (b).
(A) A method of compressing only a plaintext portion sandwiched between tags without moving the tag from the original position.
(B) A technique in which only plain tags are compressed together by moving only the tags to the head of the document realization value (instance).

手法（ａ）では、タグそのものを圧縮しない。通常、タグだけで構造化文書の３０％前後の容量を占めるため、タグを圧縮しなければ構造化文書の圧縮率が低下することになる。
手法（ｂ）では、圧縮された文書を伸長して復元する際にタグを元の位置に戻すべく、タグの元の位置に２バイトの識別符号を付しておかなければならず、それだけ圧縮率が低下することになる。 In the method (a), the tag itself is not compressed. Usually, the tag alone occupies a capacity of about 30% of the structured document. Therefore, unless the tag is compressed, the compression rate of the structured document is lowered.
In the method (b), in order to return the tag to the original position when decompressing and restoring the compressed document, a 2-byte identification code must be attached to the original position of the tag. The rate will drop.

また、例えばＨＴＭＬ文書をユニバーサル符号化技術により圧縮する場合、ＨＴＭＬ文書の木構造で所定の深さ以下の内容を圧縮する手法や、ＨＴＭＬ文書のタグ表現の冗長部分を検出してより簡潔な表現に置換する手法などもあるが、前者の手法では、木構造で所定の深さ以下の文書構造は、伸長・復元しない限り分からないほか、後者の手法では、圧縮対象文書がＨＴＭＬ文書であるため、要素名や属性名などを圧縮することができない。 For example, when compressing an HTML document using the universal encoding technique, a method of compressing contents below a predetermined depth in the tree structure of the HTML document, or a more concise expression by detecting redundant parts of the tag expression of the HTML document. However, in the former method, the document structure below a predetermined depth in the tree structure is not known unless it is decompressed or restored. In the latter method, the compression target document is an HTML document. Element names and attribute names cannot be compressed.

上述した通り、構造化文書においてタグは検索には必要であるが、上記手法（ａ）のようにタグだけが元のまま保存されると、構造化文書の圧縮率が悪化するし、上記手法（ｂ）のようにタグだけ圧縮すると、圧縮ファイルの検索機能が失われ、圧縮状態において検索することができなくなる。 As described above, tags are necessary for searching in structured documents. However, if only tags are stored as they are as in method (a), the compression rate of structured documents deteriorates, and the method described above is used. If only the tag is compressed as in (b), the compressed file search function is lost and the search cannot be performed in the compressed state.

通常、圧縮したファイルは２００ＫＢ／ｓ程度で伸長することができるため、例えば日本工業規格Ａ４判サイズで１頁分のデータが４〜６ＫＢ程度であるＳＧＭＬ／ＸＭＬ文書は、０．０２〜０．０３秒で伸長され閲覧可能になる。一方、データベース検索方式では０．０８秒程度の処理時間で検索を行なっている。 Normally, a compressed file can be decompressed at about 200 KB / s. For example, an SGML / XML document having a size of 4 to 6 KB for a Japanese industrial standard A4 size is about 0.02 to 0.02. It will be expanded and viewable in 03 seconds. On the other hand, in the database search method, the search is performed in a processing time of about 0.08 seconds.

構造化文書のデータベースに圧縮方式を適用した場合、圧縮する単位の大きさにもよるが、伸長してからの検索を行なうとなると、検索までの時間は０．１秒を超えることもある。このため、検索までの時間が０．１秒を超えてよいか否かで圧縮方式の選択が変わってくる。
上述のように、タグの圧縮を行なわないと、構造化文書の圧縮率が低くなり、文書データの格納効率が低下するので、大規模なデータベースを取り扱うシステムでは好ましくない。 When a compression method is applied to a structured document database, depending on the size of the unit to be compressed, if a search after decompression is performed, the time until the search may exceed 0.1 seconds. For this reason, the selection of the compression method varies depending on whether or not the search time may exceed 0.1 seconds.
As described above, if the tag is not compressed, the compression rate of the structured document is lowered and the storage efficiency of the document data is lowered, which is not preferable in a system that handles a large-scale database.

一方、ＸＭＬで記述された部品表や価格表等では、短い語句（内容）を挟んだ開始タグと終了タグとの対のような冗長な表現〔図２（Ａ）および図２（Ｂ）参照〕が頻繁に現われるが、このような場合に、検索可能な状態を保持したままタグにかかる部分を圧縮できるようにすることが望まれている。 On the other hand, in a parts list or price list described in XML, a redundant expression such as a pair of a start tag and an end tag sandwiching a short word (content) [see FIG. 2 (A) and FIG. 2 (B)]. ] Appear frequently, but in such a case, it is desired to be able to compress the portion of the tag while maintaining the searchable state.

実際のＸＭＬ文書等のデータにおいてタグで挟まれた「内容」のデータ長は短い場合が多い。具体的には、２０バイト、日本語で１０文字程度である。通常、短いデータは圧縮し難い。しかも、検索のキーワードとして「内容」の一部を残す場合、その「内容」のデータが短いと、「内容」を圧縮することなくそのまま残すことになり、結局、構造化文書の圧縮率が低下してしまう。 In many cases, the data length of “content” sandwiched between tags in actual XML document data is short. Specifically, it is about 20 bytes and about 10 characters in Japanese. Usually, short data is difficult to compress. Moreover, if you leave a part of “content” as a search keyword, if the “content” data is short, it will leave the “content” as it is without being compressed, eventually reducing the compression rate of structured documents. Resulting in.

本発明は、このような課題に鑑み創案されたもので、構造化文書の特徴を損なうことなくタグ部分の圧縮を可能にし、構造化文書の圧縮率の向上をはかった、構造化文書の圧縮方法および圧縮装置並びに構造化文書圧縮プログラムを記録したコンピュータ読取可能な記録媒体を提供することを目的とする。 The present invention has been devised in view of such problems, and enables compression of a tag portion without impairing the characteristics of the structured document, and the compression of the structured document in which the compression rate of the structured document is improved. It is an object of the present invention to provide a method, a compression apparatus, and a computer-readable recording medium on which a structured document compression program is recorded.

上記目的を達成するために、本発明の関連技術としての構造化文書の圧縮方法は、構造化文書を成す文書実現値における要素の木構造を解析する文書実現値解析ステップと、該文書実現値解析ステップでの解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該葉要素の親要素の属性として該親要素の開始タグ内に移す文書実現値構成変更ステップとを有することを特徴としている。 To achieve the above object, a structured document compression method as a related technique of the present invention includes a document realization value analysis step for analyzing a tree structure of elements in a document realization value forming a structured document, and the document realization value. A document realization value structure that moves information about an element that is a leaf of the tree structure (hereinafter referred to as a leaf element) into the start tag of the parent element as an attribute of the parent element of the leaf element according to the analysis result in the analysis step And a changing step.

上記文書実現値構成変更ステップにおいて、該葉要素についての開始タグ，終了タグおよび内容を該文書実現値から削除し、該葉要素についての情報である要素名および内容を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加してもよく、このとき、該葉要素の開始タグ内に該葉要素についての情報である属性が記述されている場合、該属性にかかる属性名および属性値を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加してもよい。また、上記文書実現値構成変更ステップにおいて、該親要素の終了タグを削除するとともに、該親要素の開始タグを空要素タグに変更してもよい。 In the document realization value configuration change step, the start tag, end tag, and content for the leaf element are deleted from the document realization value, and the element name and content, which are information about the leaf element, are attributed to the parent element, respectively. It may be added as a name and attribute value in the start tag of the parent element. At this time, if an attribute which is information about the leaf element is described in the start tag of the leaf element, the attribute is applied. The attribute name and attribute value may be added in the start tag of the parent element as the attribute name and attribute value of the parent element, respectively. In the document actual value configuration changing step, the end tag of the parent element may be deleted and the start tag of the parent element may be changed to an empty element tag.

さらに、該構造化文書を成す文書型定義における要素の木構造を解析する文書型定義解析ステップと、該文書型定義解析ステップでの解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該文書型定義から削除し、該葉要素の親要素の属性として該文書型定義で再定義する文書型定義構成変更ステップとをさらにそなえてもよい。このとき、該文書型定義構成変更ステップにおいて、該葉要素の要素型宣言を該文書型定義から削除するとともに該葉要素にかかる記述を該親要素の要素型宣言から削除し、該葉要素の要素型宣言にかかる情報を、該親要素の属性として該親要素の属性リスト宣言で再定義してもよく、さらに、該文書型定義で該葉要素の属性が該葉要素の属性リスト宣言により定義されている場合、該葉要素の属性リスト宣言を該文書型定義から削除し、該葉要素の属性を、該親要素の属性として該親要素の属性リスト宣言で再定義してもよい。 Further, a document type definition analyzing step for analyzing a tree structure of elements in the document type definition constituting the structured document, and an element (hereinafter referred to as a leaf) of the tree structure according to an analysis result in the document type definition analyzing step. A document type definition configuration change step of deleting information about the element) from the document type definition and redefining it as an attribute of the parent element of the leaf element in the document type definition. At this time, in the document type definition configuration change step, the element type declaration of the leaf element is deleted from the document type definition and the description of the leaf element is deleted from the element type declaration of the parent element. The information related to the element type declaration may be redefined as the attribute of the parent element in the attribute list declaration of the parent element. Further, in the document type definition, the attribute of the leaf element is determined by the attribute list declaration of the leaf element. If defined, the attribute list declaration of the leaf element may be deleted from the document type definition, and the attribute of the leaf element may be redefined as the attribute of the parent element in the attribute list declaration of the parent element.

上記目的を達成するために、本発明の構造化文書圧縮方法（請求項１）は、構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析ステップと、該文書実現値解析ステップでの解析結果に従って、該文書実現値のタグ内に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成ステップと、該タグ辞書作成ステップで作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換ステップとを有することを特徴としている。 In order to achieve the above object, a structured document compression method according to the present invention (claim 1) includes a document realization value analyzing step for analyzing a description in a tag of a document realization value forming a structured document, and the document realization value. A tag dictionary that creates a tag dictionary that associates a character string described in a tag of the document realization value with a shortened character string that is shorter than the character string and that can identify the character string, according to the analysis result in the analysis step Using the tag dictionary created in the creation step and the tag dictionary creation step, a document actual value character that replaces the character string described in the tag of the document actual value with a shortened character string corresponding to the character string And a column replacement step.

また、本発明の構造化文書圧縮方法（請求項２）は、構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析ステップと、該構造化文書を成す文書型定義の記述を解析する文書型定義解析ステップと、該文書実現値解析ステップおよび該文書型定義解析ステップでの解析結果に従って、該文書実現値のタグ内および該文書型定義に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成ステップと、該タグ辞書作成ステップで作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換ステップと、該タグ辞書作成ステップで作成された該タグ辞書を用いて、該文書型定義に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書型定義文字列置換ステップとを有することを特徴としている。 Further, the structured document compression method of the present invention (claim 2) includes a document realization value analyzing step for analyzing a description in a tag of a document realization value forming a structured document, and a document type definition forming the structured document. A document type definition analyzing step for analyzing the description, a character string described in the tag of the document realized value and in the document type definition according to the analysis result in the document realized value analyzing step and the document type definition analyzing step, and the document type definition Using the tag dictionary created in the tag dictionary creating step to create a tag dictionary that associates a shortened character string that is shorter than the character string and that can specify the character string, and realizing the document using the tag dictionary created in the tag dictionary creating step The document realization value character string replacement step of replacing the character string described in the value tag with the shortened character string corresponding to the character string, and the tag dictionary created in the tag dictionary creation step, the document The character string described in the definition, is characterized by having a document type definition string substitution step of replacing the shortening character string corresponding to the character string.

このとき、該タグ内もしくは該文書型定義に記述された要素名および属性名を前記文字列として扱い、該要素名および該属性名を該短縮文字列に置き換えてもよい（請求項３）。
また、単語文字列と該単語文字列よりも短く且つ該単語文字列を特定しうる短縮文字列とを対応させる単語辞書を用いて、該文書実現値の内容に含まれる単語文字列を、当該単語文字列に対応する短縮文字列に置き換える単語文字列置換ステップをそなえてもよく（請求項４）、さらに、該タグ内もしくは該文書型定義に記述された文字列を該短縮文字列に置き換えるとともに該単語文字列を該短縮文字列に置き換えた後に、これらの文字列を可変長符号化により圧縮する可変長符号化ステップをそなえてもよい（請求項５）。 At this time, an element name and an attribute name described in the tag or in the document type definition may be treated as the character string, and the element name and the attribute name may be replaced with the abbreviated character string.
Further, by using a word dictionary that associates a word character string with a shortened character string that is shorter than the word character string and that can identify the word character string, the word character string included in the content of the document realization value is A word character string replacement step for replacing with a shortened character string corresponding to the word character string may be provided (Claim 4), and a character string described in the tag or in the document type definition is replaced with the shortened character string. In addition, after the word character string is replaced with the shortened character string, a variable length encoding step of compressing these character strings by variable length encoding may be provided.

一方、本発明の関連技術としての構造化文書圧縮装置は、構造化文書を成す文書実現値における要素の木構造を解析する文書実現値解析部と、該文書実現値解析部による解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該葉要素の親要素の属性として該親要素の開始タグ内に移す文書実現値構成変更部とをそなえて構成されたことを特徴としている。 On the other hand, a structured document compression apparatus as a related technique of the present invention is a document realization value analysis unit that analyzes a tree structure of elements in a document realization value that forms a structured document, and an analysis result by the document realization value analysis unit, A document realization value configuration change unit that transfers information about an element that is a leaf of the tree structure (hereinafter referred to as a leaf element) as an attribute of the parent element of the leaf element into a start tag of the parent element. It is characterized by that.

このとき、該文書実現値構成部が、該葉要素についての開始タグ，終了タグおよび内容を該文書実現値から削除し、該葉要素についての情報である要素名および内容を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加してもよく、さらに、該葉要素の開始タグ内に該葉要素についての情報である属性が記述されている場合、該属性にかかる属性名および属性値を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加してもよいし、該親要素の終了タグを削除するとともに、該親要素の開始タグを空要素タグに変更してもよい。 At this time, the document realization value configuration unit deletes the start tag, the end tag, and the content for the leaf element from the document realization value, and the element name and content, which are information about the leaf element, are respectively replaced with the parent element. The attribute name and attribute value of the parent element may be added to the start tag of the parent element, and when an attribute that is information about the leaf element is described in the start tag of the leaf element, Such attribute name and attribute value may be added to the start tag of the parent element as the attribute name and attribute value of the parent element, respectively, or the end tag of the parent element may be deleted and the start of the parent element The tag may be changed to an empty element tag.

また、該構造化文書を成す文書型定義における要素の木構造を解析する文書型定義解析部と、該文書型定義解析部による解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該文書型定義から削除し、該葉要素の親要素の属性として該文書型定義で再定義する文書型定義構成変更部とをさらにそなえてもよい。 In addition, a document type definition analysis unit that analyzes a tree structure of elements in the document type definition that forms the structured document, and an element that becomes a leaf of the tree structure (hereinafter referred to as a leaf element) according to the analysis result by the document type definition analysis unit The document type definition configuration change unit may be further provided that deletes the information on the document type definition from the document type definition and redefines the leaf element as an attribute of the parent element of the leaf element.

本発明の構造化文書圧縮装置（請求項６）は、構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析部と、該文書実現値解析部による解析結果に従って、該文書実現値のタグ内に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成部と、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換部とをそなえて構成されたことを特徴としている。 The structured document compression apparatus according to the present invention (Claim 6) includes a document realization value analysis unit that analyzes a description in a tag of a document realization value that forms a structured document, and the analysis result by the document realization value analysis unit, A tag dictionary creating unit that creates a tag dictionary that associates a character string described in a tag of a document realization value with a shortened character string that is shorter than the character string and that can identify the character string; and the tag dictionary creating unit And a document realization value character string replacement unit that replaces the character string described in the tag of the document realization value with a shortened character string corresponding to the character string. It is characterized by that.

本発明の構造化文書圧縮装置（請求項７）は、構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析部と、該構造化文書を成す文書型定義の記述を解析する文書型定義解析部と、該文書実現値解析部および該文書型定義解析部による解析結果に従って、該文書実現値のタグ内および該文書型定義に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成部と、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換部と、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書型定義に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書型定義文字列置換部とをそなえて構成されたことを特徴としている。
このとき、該タグ内もしくは該文書型定義に記述された要素名および属性名を前記文字列として扱い、該要素名および該属性名を該短縮文字列に置き換えてもよい（請求項８）。 The structured document compression apparatus according to the present invention (Claim 7) includes a document realization value analysis unit that analyzes a description in a tag of a document realization value that constitutes a structured document, and a description of a document type definition that constitutes the structured document. According to the analysis result of the document type definition analysis unit to be analyzed, the document realization value analysis unit, and the document type definition analysis unit, the character string described in the document realization value tag and the document type definition and the character string A tag dictionary creating unit that creates a tag dictionary that is associated with a shortened character string that is short and can specify the character string, and using the tag dictionary created by the tag dictionary creating unit, the tag of the document realization value The document realization value character string replacement unit that replaces the character string described in the character string with a shortened character string corresponding to the character string and the tag dictionary created by the tag dictionary creation unit, Corresponds to the character string described Is characterized in that shortened constructed and a document type definition string substitution unit to substitute the string that.
At this time, an element name and an attribute name described in the tag or in the document type definition may be treated as the character string, and the element name and the attribute name may be replaced with the abbreviated character string.

さらに、本発明の関連技術としての記録媒体は、構造化文書を圧縮する機能をコンピュータにより実現するための構造化文書圧縮プログラムを格納したコンピュータ読取可能なものであって、該構造化文書圧縮プログラムが、該構造化文書を成す文書実現値における要素の木構造を解析する文書実現値解析部、および、該文書実現値解析部による解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該葉要素の親要素の属性として該親要素の開始タグ内に移す文書実現値構成変更部として、該コンピュータを機能させることを特徴としている。 Furthermore, a recording medium as a related technique of the present invention is a computer-readable recording medium storing a structured document compression program for realizing a function of compressing a structured document by a computer, and the structured document compression program The document realization value analysis unit that analyzes the tree structure of the elements in the document realization value that forms the structured document, and the element that becomes the leaf of the tree structure (hereinafter referred to as leaf) according to the analysis result by the document realization value analysis unit The computer is caused to function as a document realization value configuration changing unit that transfers information on the element) into the start tag of the parent element as an attribute of the parent element of the leaf element.

このとき、該構造化文書圧縮プログラムが、該文書実現値構成部により、該葉要素についての開始タグ，終了タグおよび内容を該文書実現値から削除し、該葉要素についての情報である要素名および内容を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加するよう、該コンピュータを機能させてもよいし、さらに、該葉要素の開始タグ内に該葉要素についての情報である属性が記述されている場合、該文書実現値構成変更部により、該属性にかかる属性名および属性値を、それぞれ該親要素の属性名および属性値として該親要素の開始タグ内に付加するよう、該コンピュータを機能させてもよいし、該文書実現値構成変更部により、該親要素の終了タグを削除するとともに該親要素の開始タグを空要素タグに変更するよう、該コンピュータを機能させてもよい。 At this time, the structured document compression program deletes the start tag, end tag, and contents of the leaf element from the document actual value by the document actual value configuration unit, and an element name that is information about the leaf element And the contents may be added to the start tag of the parent element as the attribute name and attribute value of the parent element, respectively, and the leaf element is further included in the start tag of the leaf element. When the attribute that is information about the attribute is described, the document actual value configuration changing unit uses the attribute name and attribute value relating to the attribute as the attribute name and attribute value of the parent element, respectively, and the start tag of the parent element The computer may function to be added to the document, or the document realization value configuration changing unit deletes the end tag of the parent element and changes the start tag of the parent element to an empty element tag. So that, it may function the computer.

また、該構造化文書圧縮プログラムが、該構造化文書を成す文書型定義における要素の木構造を解析する文書型定義解析部、および、該文書型定義解析部による解析結果に従い、該木構造の葉となる要素（以下、葉要素という）についての情報を、該文書型定義から削除し、該葉要素の親要素の属性として該文書型定義で再定義する文書型定義構成変更部として、該コンピュータを機能させてもよい。 Further, the structured document compression program analyzes a tree structure of elements in the document type definition constituting the structured document, and the analysis result of the tree structure according to the analysis result by the document type definition analyzing unit. As a document type definition configuration change unit that deletes information about a leaf element (hereinafter referred to as a leaf element) from the document type definition and redefines it as an attribute of the parent element of the leaf element in the document type definition. The computer may function.

本発明の記録媒体（請求項９）は、構造化文書を圧縮する機能をコンピュータにより実現するための構造化文書圧縮プログラムを格納したコンピュータ読取可能なものであって、該構造化文書圧縮プログラムが、該構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析部、該文書実現値解析部による解析結果に従って、該文書実現値のタグ内に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成部、および、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換部として、該コンピュータを機能させることを特徴としている。 The recording medium of the present invention (Claim 9) is a computer-readable recording medium storing a structured document compression program for realizing the function of compressing a structured document by a computer. A document realization value analysis unit for analyzing a description in the tag of the document realization value forming the structured document, and a character string described in the tag of the document realization value according to the analysis result by the document realization value analysis unit A tag dictionary creating unit that creates a tag dictionary that is associated with a shortened character string that is shorter than a character string and that can specify the character string; and the document using the tag dictionary created by the tag dictionary creating unit, The computer is caused to function as a document actual value character string replacement unit that replaces a character string described in a tag of an actual value with a shortened character string corresponding to the character string.

本発明の記録媒体（請求項１０）は、構造化文書を圧縮する機能をコンピュータにより実現するための構造化文書圧縮プログラムを格納したコンピュータ読取可能なものであって、該構造化文書圧縮プログラムが、該構造化文書を成す文書実現値のタグ内の記述を解析する文書実現値解析部、該構造化文書を成す文書型定義の記述を解析する文書型定義解析部、該文書実現値解析部および該文書型定義解析部による解析結果に従って、該文書実現値のタグ内および該文書型定義に記述された文字列と該文字列よりも短く且つ該文字列を特定しうる短縮文字列とを対応させるタグ辞書を作成するタグ辞書作成部、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書実現値のタグ内に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書実現値文字列置換部、および、該タグ辞書作成部により作成された該タグ辞書を用いて、該文書型定義に記述された文字列を、当該文字列に対応する短縮文字列に置き換える文書型定義文字列置換部として、該コンピュータを機能させることを特徴としている。
このとき、該構造化文書圧縮プログラムが、該タグ内もしくは該文書型定義に記述された要素名および属性名を前記文字列として扱い、該コンピュータに、該要素名および該属性名を該短縮文字列に置換させてもよい（請求項１１）。 The recording medium of the present invention (Claim 10) is a computer-readable recording medium storing a structured document compression program for realizing the function of compressing a structured document by a computer, and the structured document compression program is stored in the recording medium. A document realization value analysis unit that analyzes a description in a tag of a document realization value that forms the structured document, a document type definition analysis unit that analyzes a description of a document type definition that forms the structured document, and a document realization value analysis unit And a character string described in the tag of the document realization value and in the document type definition and a shortened character string shorter than the character string and capable of specifying the character string according to the analysis result by the document type definition analysis unit A tag dictionary creating unit that creates a corresponding tag dictionary, and using the tag dictionary created by the tag dictionary creating unit, a character string described in a tag of the document realization value is shortened corresponding to the character string Using the tag dictionary created by the document realization value character string substitution unit that replaces the character string and the tag dictionary creation unit, the character string described in the document type definition is converted to a shortened character corresponding to the character string. The computer is made to function as a document type definition character string replacement unit for replacement with a column.
At this time, the structured document compression program treats the element name and attribute name described in the tag or in the document type definition as the character string, and causes the computer to convert the element name and attribute name to the short character. A column may be substituted (claim 11).

上述のような構造化文書の圧縮方法および圧縮装置並びに構造化文書圧縮プログラムを記録したコンピュータ読取可能な記録媒体によれば、以下のような効果ないし利点が得られる。 According to the above-described method and apparatus for compressing a structured document and a computer-readable recording medium on which a structured document compression program is recorded, the following effects or advantages can be obtained.

（１）文書実現値における要素の木構造を解析し、その解析結果に従って、文書実現値における葉要素についての情報を、親要素の属性としてこの親要素の開始タグ内に移すことで、より具体的には、葉要素についての開始タグ，終了タグおよび内容を文書実現値から削除し、葉要素についての情報である要素名および内容を、それぞれ親要素の属性名および属性値として親要素の開始タグ内に付加することで、葉要素にかかる記述を親要素の属性として取り扱うことができ、葉要素の開始タグや終了タグを記述する必要がなくなり、構造化文書の特徴を損なうことなく、また、検索可能な状態に保持したまま、葉要素にかかるタグの記述が省略・圧縮される。 (1) By analyzing the tree structure of the element in the document realization value and moving the information about the leaf element in the document realization value as the attribute of the parent element in the start tag of the parent element according to the analysis result, more specific Specifically, the start tag, end tag, and content for the leaf element are deleted from the document realization value, and the element name and content that is information about the leaf element are used as the parent element attribute name and attribute value, respectively. By adding it in the tag, the description of the leaf element can be handled as an attribute of the parent element, so there is no need to describe the start tag and end tag of the leaf element, and the characteristics of the structured document are not impaired. The description of the tag relating to the leaf element is omitted / compressed while maintaining the searchable state.

従って、構造化文書の圧縮率を大幅に高めることができ、ひいては、大規模なデータベースを取り扱うシステムにおいて文書データの格納効率を大幅に高めることができる。特に、多数の短い語句をもつ部品表や価格表等を構造化文書で記述するような場合、短い語句（内容）を挟んだ開始タグと終了タグとの対表現を省略することができるので、その圧縮率を大幅に高めることができる。 Therefore, the compression rate of the structured document can be greatly increased, and the storage efficiency of document data can be greatly increased in a system that handles a large-scale database. In particular, when a bill of materials or a price list with a lot of short words is described in a structured document, the paired expression of the start tag and end tag with a short word (content) sandwiched can be omitted. The compression rate can be greatly increased.

（２）葉要素の開始タグ内に属性が記述されている場合、属性にかかる属性名および属性値を、それぞれ親要素の属性名および属性値として親要素の開始タグ内に付加することで、葉要素の属性にかかる記述も親要素の属性として取り扱われ、構造化文書の圧縮率をより高めることができる。
（３）親要素の終了タグを削除するとともに親要素の開始タグを空要素タグに変更することにより、さらに親要素の終了タグを構造化文書の記述から削除することができ、構造化文書の圧縮率をより高めることができる。 (2) When an attribute is described in the start tag of the leaf element, the attribute name and attribute value relating to the attribute are added to the start tag of the parent element as the attribute name and attribute value of the parent element, respectively. The description concerning the attribute of the leaf element is also handled as the attribute of the parent element, and the compression rate of the structured document can be further increased.
(3) By deleting the end tag of the parent element and changing the start tag of the parent element to an empty element tag, the end tag of the parent element can be further deleted from the description of the structured document. The compression rate can be further increased.

（４）文書型定義における要素の木構造を解析し、その解析結果に従って、葉要素についての情報を、文書型定義から削除し親要素の属性として文書型定義で再定義することで、より具体的には、葉要素の要素型宣言を文書型定義から削除するとともに葉要素にかかる記述を親要素の要素型宣言から削除し、その葉要素の要素型宣言にかかる情報を親要素の属性として再定義することで（請求項６）、文書実現値に対して行なわれた圧縮に対応した圧縮処理が文書型定義に対しても行なわれ、葉要素にかかる記述を親要素の属性として取り扱うことができる。従って、構造化文書の特徴を損なうことなく、また、検索可能な状態に保持したまま、葉要素にかかる要素型宣言の記述が省略されて文書型定義が圧縮され、構造化文書の圧縮率をより高めることができる。 (4) Analyzing the tree structure of elements in the document type definition, and deleting the leaf element information from the document type definition and redefining it in the document type definition as an attribute of the parent element according to the analysis result. Specifically, the element type declaration of the leaf element is deleted from the document type definition, the description related to the leaf element is deleted from the element type declaration of the parent element, and the information related to the element type declaration of the leaf element is used as the attribute of the parent element. By redefining (Claim 6), the compression processing corresponding to the compression performed on the document realization value is also performed on the document type definition, and the description of the leaf element is handled as the attribute of the parent element. Can do. Therefore, the description of the element type declaration for the leaf element is omitted without compromising the characteristics of the structured document, and the document type definition is compressed while maintaining the searchable state, and the compression rate of the structured document is increased. Can be increased.

（５）文書型定義で葉要素の属性が葉要素の属性リスト宣言により定義されている場合、葉要素の属性リスト宣言を文書型定義から削除し、その葉要素の属性を親要素の属性として再定義することで、葉要素の属性にかかる記述も親要素の属性として取り扱うことができる。従って、構造化文書の特徴を損なうことなく、また、検索可能な状態に保持したまま、葉要素にかかる属性リスト宣言の記述が省略されて文書型定義がより圧縮され、構造化文書の圧縮率をより高めることができる。 (5) When the attribute of a leaf element is defined by the attribute list declaration of the leaf element in the document type definition, the attribute list declaration of the leaf element is deleted from the document type definition, and the attribute of the leaf element is set as the attribute of the parent element By redefining, the description about the attribute of the leaf element can be handled as the attribute of the parent element. Therefore, the description of the attribute list declaration for the leaf element is omitted without losing the characteristics of the structured document, while maintaining the searchable state, and the document type definition is further compressed. Can be further enhanced.

（６）文書実現値のタグ内の記述を解析し、その解析結果に従ってタグ辞書を作成し、そのタグ辞書を用いて、文書実現値のタグ内に記述された文字列を、その文字列よりも短く且つその文字を特定しうる短縮文字列に置き換えることにより（請求項１，６，９）、構造化文書の特徴や構造を損なうことなくタグ内の文字列が圧縮されるので、構造化文書の圧縮率を大幅に高めることができ、ひいては、大規模なデータベースを取り扱うシステムにおいて文書データの格納効率を大幅に高めることができる。 (6) Analyzing the description in the tag of the document actual value, creating a tag dictionary according to the analysis result, and using the tag dictionary, the character string described in the tag of the document actual value is If the character string in the tag is compressed without impairing the characteristics and structure of the structured document, the character string in the tag is compressed without replacing the character string. The document compression rate can be greatly increased, and consequently the storage efficiency of document data can be significantly increased in a system that handles a large-scale database.

（７）文書実現値のタグ内や文書型定義の記述を解析し、その解析結果に従ってタグ辞書を作成し、そのタグ辞書を用いて、文書実現値のタグ内や文書型定義に記述された文字列を、その文字列よりも短く且つその文字を特定しうる短縮文字列に置き換えることにより（請求項２，７，１０）、構造化文書が文書型定義を有している場合であっても、構造化文書の特徴や構造を損なうことなく文書型定義の文字列が圧縮されるので、構造化文書の圧縮率を大幅に高めることができ、ひいては、大規模なデータベースを取り扱うシステムにおいて文書データの格納効率を大幅に高めることができる。 (7) Analyze the description of the document realization value tag and document type definition, create a tag dictionary according to the analysis result, and use the tag dictionary to describe in the document realization value tag and document type definition By replacing a character string with a shortened character string shorter than the character string and specifying the character (claims 2, 7, and 10), the structured document has a document type definition. However, since the character string of the document type definition is compressed without losing the characteristics and structure of the structured document, the compression rate of the structured document can be greatly increased. Data storage efficiency can be greatly increased.

（８）タグ内や文書型定義に記述された要素名および属性名を短縮文字列に置き換えることにより（請求項３，８，１１）、検索可能な状態に保持したままタグ部分や文書型定義を圧縮することができる。つまり、要素名および属性名について置換を行ない、属性値は元の形のまま保持することで、文書データを伸長することなく圧縮した状態のままで検索や文書構造の把握を行なえるようになっている。従って、文書圧縮後に文書の構造を解析して検索を行なう際に、圧縮された文書の伸長を行なう必要がなく、大規模なデータベースにおいて文書データを圧縮格納しても、その文書データの検索処理等を短時間で行なうことができる。 (8) By replacing element names and attribute names described in tags and document type definitions with abbreviated character strings (claims 3, 8, and 11), tag portions and document type definitions are maintained in a searchable state. Can be compressed. In other words, element names and attribute names are replaced, and attribute values are retained in their original form, so that document data can be searched and document structure can be grasped without being decompressed. ing. Therefore, when searching after analyzing the structure of the document after compressing the document, it is not necessary to decompress the compressed document. Even if the document data is compressed and stored in a large-scale database, the document data search processing is performed. Etc. can be performed in a short time.

（９）単語辞書を用いて、文書実現値の内容に含まれる単語文字列を、その単語文字列よりも短く且つその単語文字列を特定しうる短縮文字列に置き換えることにより（請求項４）、構造化文書の平文部分（文書実現値の内容）が圧縮されるので、構造化文書の圧縮率をさらに高めることができる。
（１０）置換処理後の文字列をさらに可変長符号化により圧縮することで（請求項５）、構造化文書の圧縮率をより高めることができる。 (9) By using a word dictionary, the word character string included in the content of the document realization value is replaced with a shortened character string that is shorter than the word character string and can identify the word character string (claim 4). Since the plain text portion (content of the document realization value) of the structured document is compressed, the compression rate of the structured document can be further increased.
(10) The compression rate of the structured document can be further increased by further compressing the character string after the replacement processing by variable length coding (claim 5).

以下、図面を参照して本発明の実施の形態を説明する。
〔１〕第１実施形態の説明
まず、図２〜図４を参照しながら、本発明の第１実施形態における構造化文書の圧縮原理について説明する。なお、図２（Ａ）〜図２（Ｄ）および図３（Ａ）〜図３（Ｃ）はいずれも第１実施形態における構造化文書（文書実現値）の圧縮原理を説明するための図、図４（Ａ）および図４（Ｂ）はいずれも第１実施形態における構造化文書（ＤＴＤ）の圧縮原理を説明するための図である。なお、以下、本発明の第１実施形態では、構造化文書がＸＭＬ文書である場合について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
[1] Description of First Embodiment First, the compression principle of a structured document in the first embodiment of the present invention will be described with reference to FIGS. 2A to 2D and FIGS. 3A to 3C are diagrams for explaining the compression principle of the structured document (document realization value) in the first embodiment. FIGS. 4A and 4B are diagrams for explaining the compression principle of the structured document (DTD) in the first embodiment. In the following description of the first embodiment of the present invention, the structured document is an XML document.

前述した通り、ＸＭＬで記述された部品表や価格表等では、図２（Ａ）および図２（Ｂ）に示すごとく、短い語句（平文）を内容として挟んだ開始タグと終了タグとの対のような冗長な表現が頻繁に現われる。
ここで、図２（Ａ）には、ある親要素（要素１）の下に２つの子要素（子要素をもたない要素２および要素３）が存在する場合の、ＸＭＬの一般的な記述例が示されている。そして、図２（Ｂ）には、図２（Ａ）に示した一般的な記述例に対応した具体的な記述例が示されている。なお、以下、子要素をもたない要素を葉の要素，葉要素もしくは単に葉と呼ぶ場合がある。 As described above, in the parts list and price list written in XML, as shown in FIG. 2A and FIG. 2B, a pair of a start tag and an end tag sandwiching a short word (plain text) as content A redundant expression such as appears frequently.
Here, FIG. 2A shows a general description of XML in the case where two child elements (element 2 and element 3 having no child element) exist under a certain parent element (element 1). An example is shown. FIG. 2B shows a specific description example corresponding to the general description example shown in FIG. Hereinafter, an element having no child element may be referred to as a leaf element, a leaf element, or simply a leaf.

図２（Ａ）に示す一般的な記述例において、要素１は、要素名１を指定されるとともに属性情報（属性名１および属性値１）を指定され、要素１の子要素である要素２は、要素名２を指定されて内容２を有し、要素２と同じく要素１の子要素（要素２と兄弟関係）である要素３は、要素名３を指定されるとともに属性情報（属性名３および属性値３）を指定されて内容３を有している。 In the general description example shown in FIG. 2A, element 1 is designated with element name 1 and attribute information (attribute name 1 and attribute value 1), and element 2 is a child element of element 1. Has an element name 2 and content 2, and element 3, which is a child element of element 1 (similar to element 2) like element 2, is specified with element name 3 and attribute information (attribute name) 3 and attribute value 3) are specified and content 3 is included.

そして、図２（Ｂ）に示す具体的な記述例では、要素名１が「book」、属性名１が「field」、属性値１が「本」、要素名２が「title」、内容２が「ＸＭＬ入門」、要素名３が「author」、属性名３が「year」、属性値３が「1955」、内容３が「佐藤元」となっている。
これらの図２（Ａ）および図２（Ｂ）に示す記述例の木構造を図２（Ｃ）に示す。また、これらの図２（Ａ）〜図２（Ｃ）に示す例についての、葉の要素一覧表を図２（Ｄ）に示す。 In the specific description example shown in FIG. 2B, the element name 1 is “book”, the attribute name 1 is “field”, the attribute value 1 is “book”, the element name 2 is “title”, and the content 2 Is “Introduction to XML”, the element name 3 is “author”, the attribute name 3 is “year”, the attribute value 3 is “1955”, and the content 3 is “Moto Sato”.
The tree structure of the description example shown in FIGS. 2A and 2B is shown in FIG. Further, FIG. 2D shows a leaf element list for the examples shown in FIGS. 2A to 2C.

なお、図２（Ｂ）に示す記述例において、１行目の記述＜book field="本"＞が要素「book」の開始タグで、５行目の記述＜/book＞が要素「book」の終了タグであり、これらのタグにより括られた、２〜４行目の記述が要素「book」の内容を示している。１行目の開始タグ内の記述「field="本"」は、要素「book」の属性情報（属性名が「field」で属性値が「本」）を示している。 In the description example shown in FIG. 2B, the description <book field = "book"> on the first line is the start tag of the element "book", and the description </ book> on the fifth line is the element "book". The description on the 2nd to 4th lines enclosed by these tags indicates the content of the element “book”. The description “field =“ book ”” in the start tag on the first line indicates the attribute information (the attribute name is “field” and the attribute value is “book”) of the element “book”.

また、２行目において、記述＜title＞が要素「title」の開始タグで、記述＜/title＞が要素「title」の終了タグであり、これらのタグ間の記述「ＸＭＬ入門」が要素「title」の内容である。
同様に、３行目の記述＜author year＝"1955"＞が要素「author」の開始タグで、４行目の記述＜/author＞が要素「author」の終了タグであり、これらのタグ間の記述「佐藤元」が要素「author」の内容である。３行目の開始タグ内の記述「year＝"1955"」は、要素「author」の属性情報（属性名が「year」で属性値が「1955」）を示している。 In the second line, the description <title> is the start tag of the element “title”, the description </ title> is the end tag of the element “title”, and the description “Introduction to XML” between these tags is the element “ The content of “title”.
Similarly, the description <author year = "1955"> on the third line is the start tag for the element "author" and the description </ author> on the fourth line is the end tag for the element "author". Is the content of the element “author”. The description “year =“ 1955 ”” in the start tag on the third line indicates the attribute information of the element “author” (attribute name is “year” and attribute value is “1955”).

本発明の第１実施形態では、図２（Ａ）〜図２（Ｄ）に示すごとく、子要素をもたず木構造の葉として並んでいる要素（以下、葉の要素，葉要素もしくは単に葉と呼ぶ場合がある）を検出し、図２（Ｄ）に示すような葉の要素一覧表についてのファイルを出力する。そして、そのファイルに基づいて、図３（Ａ）および図３（Ｂ）に示すように、葉の要素名や内容を上位の要素１（親要素）の属性に置き換えてその葉の要素を削除するとともに、要素１の開始タグを空要素タグに変更する。このとき、要素３の属性情報（属性名３および属性値３）も、要素３の要素名３や内容３と対等な、要素１の属性として並べる。 In the first embodiment of the present invention, as shown in FIGS. 2 (A) to 2 (D), elements that have no child elements and are arranged as leaves of a tree structure (hereinafter referred to as leaf elements, leaf elements or simply 2), and a file about the leaf element list as shown in FIG. 2D is output. Then, based on the file, as shown in FIGS. 3A and 3B, the leaf element name and contents are replaced with the attribute of the upper element 1 (parent element) and the leaf element is deleted. At the same time, the start tag of element 1 is changed to an empty element tag. At this time, the attribute information (attribute name 3 and attribute value 3) of the element 3 is also arranged as an attribute of the element 1 that is equivalent to the element name 3 and the content 3 of the element 3.

ここで、図３（Ａ）や図３（Ｂ）に示すごとく、要素名や属性情報を“＜”と“/＞”とで囲んで記述されたタグは、内容をもたない空要素タグである。このとき、属性情報については必ずしも記述・指定しなくてもよい。
図３（Ａ）には、図２（Ａ）に示した一般的な記述例を第１実施形態の圧縮方法により圧縮した結果得られる記述が示され、図３（Ｂ）には、図２（Ｂ）に示した具体的な記述例を第１実施形態の圧縮方法により圧縮した結果得られる記述が示されている。 Here, as shown in FIGS. 3A and 3B, a tag described by enclosing an element name or attribute information between “<” and “/>” is an empty element tag having no content. It is. At this time, the attribute information does not necessarily have to be described / specified.
FIG. 3A shows a description obtained as a result of compressing the general description example shown in FIG. 2A by the compression method of the first embodiment, and FIG. A description obtained as a result of compressing the specific description example shown in (B) by the compression method of the first embodiment is shown.

これらの図３（Ａ）および図３（Ｂ）に示すように、要素１の開始タグ（空要素タグ）においては、要素名１および属性情報（属性名１および属性値１）が指定されるだけでなく、要素名２および内容２がそれぞれ要素１の第２の属性名および属性値として指定され、要素名３および内容３がそれぞれ要素１の第３の属性名および属性値として指定され、属性名３および属性値３がそれぞれ要素１の第４の属性名および属性値として指定されている。
なお、図３（Ｃ）は、図３（Ａ）および図３（Ｂ）に示した開始タグ（空要素タグ）をもつ要素１の構造を図式表現したものである。 As shown in FIGS. 3A and 3B, in the start tag (empty element tag) of element 1, element name 1 and attribute information (attribute name 1 and attribute value 1) are designated. In addition, element name 2 and content 2 are designated as the second attribute name and attribute value of element 1, respectively, element name 3 and content 3 are designated as the third attribute name and attribute value of element 1, respectively. Attribute name 3 and attribute value 3 are designated as the fourth attribute name and attribute value of element 1, respectively.
FIG. 3C is a schematic representation of the structure of the element 1 having the start tag (empty element tag) shown in FIGS. 3A and 3B.

一方、図２や図３を参照しながら上述したごとく圧縮処理を行なった構造化文書がＤＴＤを有している場合には、その圧縮処理に対応して、図４（Ａ）および図４（Ｂ）に示すごとくＤＴＤの変更（圧縮）も行なわれる。
即ち、図４（Ａ）には、図２（Ｂ）に示すＸＭＬ文書を定義する、変更前（圧縮前）のＤＴＤが示されている。この図４（Ａ）に示すＤＴＤにおいて、１行目の記述は文書型宣言（ＤＯＣＴＹＰＥ宣言）であり、ここでは、この文書の文書型名つまり最上位要素の要素名が「book」であることが宣言されている。 On the other hand, when the structured document subjected to the compression processing as described above with reference to FIGS. 2 and 3 has DTD, FIG. 4A and FIG. As shown in B), DTD is changed (compressed).
4A shows the DTD before change (before compression) that defines the XML document shown in FIG. 2B. In the DTD shown in FIG. 4A, the description on the first line is a document type declaration (DOCTYPE declaration). Here, the document type name of this document, that is, the element name of the top-level element is “book”. Is declared.

そして、１行目末尾の“[”と７行目の“]＞”との間における記述（２〜６行目の記述）が、この文書の構成を定義している。
２行目の要素型宣言の要素名と文書型名とは一致する必要があり、ここでは、その規則に従い最上位要素の要素名として「book」が指定されている。また、この親要素「book」の下に要素名「title」および「author」の２つの子要素が並んで存在することが、２行目の要素型宣言内において、内容モデル記述“(title,author)”により宣言されている。つまり、要素「book」は２つの子要素「title」および「author」から構成されることが宣言されている。 The description between the “[” at the end of the first line and “]>” on the seventh line (the description on the second to sixth lines) defines the structure of this document.
The element name of the element type declaration on the second line needs to match the document type name. Here, “book” is designated as the element name of the top-level element according to the rule. In addition, the fact that there are two child elements with the element names “title” and “author” side by side under the parent element “book” is the content model description “(title, author) ”. That is, the element “book” is declared to be composed of two child elements “title” and “author”.

さらに、３行目の要素型宣言により、要素の要素名として「title」が指定されるとともに、この要素「title」の内容が文字データ（＃ＰＣＤＡＴＡ）であることが宣言されている。同様に、４行目の要素型宣言により、要素の要素名として「author」が指定されるとともに、この要素「author」の内容が文字データ（＃ＰＣＤＡＴＡ）であることが宣言されている。これらの要素型宣言では、各要素の下における子要素の存在は宣言されていない。つまり、これらの要素を親とする子要素は存在しておらず、これらの要素は、木構造の葉を成す要素である。 Furthermore, the element type declaration on the third line specifies “title” as the element name of the element and declares that the content of the element “title” is character data (#PCDATA). Similarly, the element type declaration on the fourth line specifies “author” as the element name of the element and declares that the content of the element “author” is character data (#PCDATA). In these element type declarations, the existence of child elements under each element is not declared. In other words, there are no child elements having these elements as parents, and these elements are elements that form leaves of a tree structure.

また、５行目の属性リスト宣言では、要素「book」に伴う属性として、属性名「field」と、属性値ついての３種類の候補「本」，「雑誌」，「小冊子」と、デフォルト値「本」とが宣言されている。
さらに、６行目の属性リスト宣言では、要素「author」に伴う属性として、著者の生年を示す属性名「year」と、その属性値のデータ型（ＣＤＡＴＡ）とが宣言されている。
なお、７行目の記述“]＞”は、これで１行目の文書型宣言の内部サブセット記述部分が終了することを示している。 In the attribute list declaration on the fifth line, the attribute name “field”, the three types of candidate “book”, “magazine”, “booklet”, and default values for the attribute “book” are included. "Book" is declared.
Further, in the attribute list declaration on the sixth line, as an attribute accompanying the element “author”, an attribute name “year” indicating the author's birth year and a data type (CDATA) of the attribute value are declared.
The description “]>” on the seventh line indicates that the internal subset description part of the document type declaration on the first line is completed.

そして、図４（Ｂ）には、図３（Ｂ）に示すＸＭＬ文書を定義する、変更後（圧縮後）のＤＴＤが示されており、この図４（Ｂ）に示すＤＴＤにおいては、図４（Ａ）に示すＤＴＤに存在していた、要素「title」および要素「author」についての要素型宣言が消えるとともに、２行目の親要素「book」の要素型宣言内における子要素の定義（内容モデル）が消えている。また、要素「author」に伴う属性「year」も消えている。 FIG. 4B shows the DTD after modification (after compression) that defines the XML document shown in FIG. 3B. In the DTD shown in FIG. The element type declarations for the element “title” and the element “author” that existed in the DTD shown in FIG. 4A disappear, and the child element definition in the element type declaration of the parent element “book” on the second line (Content model) disappears. In addition, the attribute “year” associated with the element “author” has also disappeared.

代わりに、３〜６行目の属性リスト宣言において、親要素「book」の属性として、新たに属性名「title」，「author」，「year」が付加されている。そして、これらの属性名「title」，「author」の属性値の候補として「＃ＰＣＤＡＴＡ」が宣言されるとともに、属性名「year」の属性値の候補として「ＣＤＡＴＡ」が宣言されている。 Instead, attribute names “title”, “author”, and “year” are newly added as attributes of the parent element “book” in the attribute list declaration on the third to sixth lines. Then, “#PCDATA” is declared as a candidate attribute value of these attribute names “title” and “author”, and “CDATA” is declared as a candidate attribute value of the attribute name “year”.

ここで、「＃ＰＣＤＡＴＡ」としては、文字データのほかに、タグや、実体参照と呼ばれるマークが含まれる。なお、実体参照は、構造化文書を記述する際、文字データ中で特定の文字列を所定のマークで表現し、文字データを応用ソフトウェアに渡す際には、マークの部分に元の文字列を代入する機能である。ただし、属性値の中で実体参照を行なう場合、その属性値にかかる文書と同じ文書内の文字列を参照することは許されているが、外部ファイルにおける文字列を参照することは許されていない。また、生年を表わす属性値としては、純粋の文字データであるＣＤＡＴＡが宣言されている。 Here, “#PCDATA” includes tags and marks called entity references in addition to character data. In entity reference, when a structured document is described, a specific character string is represented by a predetermined mark in character data, and when character data is passed to application software, the original character string is added to the mark portion. It is a function to substitute. However, when an entity reference is made in an attribute value, it is allowed to refer to a character string in the same document as that of the attribute value, but it is not allowed to refer to a character string in an external file. Absent. As an attribute value representing the year of birth, CDATA, which is pure character data, is declared.

第１実施形態の構造化文書の圧縮手法では、上述のごとく文書実現値の圧縮処理に合わせてＤＴＤも変更・圧縮するため、図４０で前述したＸＭＬプロセッサは、上述の圧縮に呼応した形で、図３（Ａ）や図３（Ｂ）に示すような新しいＸＭＬ文書を正しく検討することができる。 In the structured document compression method of the first embodiment, the DTD is also changed / compressed in accordance with the compression processing of the document realization value as described above. Therefore, the XML processor described above with reference to FIG. Thus, a new XML document as shown in FIGS. 3A and 3B can be properly considered.

また、上述のごとく、従来、「要素」として表現された情報が親要素の「属性」に変換されるとともに、その変換に伴うＸＭＬ文書の構造変更に合わせてＤＴＤも変更される。これにより、従来、「要素」として表現された情報は、親要素の「属性」として検出されるようになる。従って、「要素名のバイト数」＋「３バイト」分だけ、ＸＭＬ文書の表現が圧縮・節約されることになる。 Further, as described above, conventionally, information expressed as “element” is converted into “attribute” of the parent element, and DTD is also changed in accordance with the structure change of the XML document accompanying the conversion. As a result, information conventionally expressed as “element” is detected as “attribute” of the parent element. Therefore, the XML document expression is compressed and saved by the number of “element name bytes” + “3 bytes”.

以下、図１〜図１３を参照しながら、本発明の第１実施形態について、より詳細かつ具体的に説明する。
図１は本発明の第１実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図であり、この図１に示すように、第１実施形態の圧縮装置は、文書記憶部１０，文書実現値解析部２０，ＤＴＤ解析部３０，文書実現値構成変更部４０，ＤＴＤ構成変更部５０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０を有して構成されている。 Hereinafter, the first embodiment of the present invention will be described in more detail and specifically with reference to FIGS.
FIG. 1 is a block diagram showing the functional configuration of a structured document compression apparatus according to the first embodiment of the present invention. As shown in FIG. 1, the compression apparatus of the first embodiment includes a document storage unit 10, The document realization value analysis unit 20, the DTD analysis unit 30, the document realization value configuration change unit 40, the DTD configuration change unit 50, the new DTD file creation unit 60, and the old and new DTD correspondence table output unit 70 are configured.

ここで、本実施形態の圧縮装置は、ＣＰＵ，ＲＡＭ，ＲＯＭなどをバスラインにより接続して構成される、パソコン等のコンピュータシステムにより実現されるものである。
つまり、ＲＡＭやＲＯＭが文書記憶部１０としての機能を果たすほか、ＲＡＭには、文書実現値解析部２０，ＤＴＤ解析部３０，文書実現値構成変更部４０，ＤＴＤ構成変更部５０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０を実現するためのアプリケーションプログラムが格納されている。 Here, the compression apparatus of the present embodiment is realized by a computer system such as a personal computer, which is configured by connecting a CPU, a RAM, a ROM, and the like through a bus line.
That is, in addition to the RAM and ROM serving as the document storage unit 10, the RAM includes a document realization value analysis unit 20, a DTD analysis unit 30, a document realization value configuration change unit 40, a DTD configuration change unit 50, and a new DTD file. Application programs for realizing the creation unit 60 and the old and new DTD correspondence table output unit 70 are stored.

そして、ＣＰＵが、上記アプリケーションプログラムを実行することにより、文書実現値解析部２０，ＤＴＤ解析部３０，文書実現値構成変更部４０，ＤＴＤ構成変更部５０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０としての機能（その詳細については後述）が実現され、第１実施形態の構造化文書の圧縮装置が実現されるようになっている。 When the CPU executes the application program, the document realization value analysis unit 20, the DTD analysis unit 30, the document realization value configuration change unit 40, the DTD configuration change unit 50, the new DTD file creation unit 60, and the old and new DTD support A function (details of which will be described later) as the table output unit 70 is realized, and the structured document compression apparatus of the first embodiment is realized.

この第１実施形態の圧縮装置を実現するためのプログラムは、例えばフレキシブルディスク，ＣＤ−ＲＯＭ等の、コンピュータ読取可能な記録媒体に記録された形態で提供される。そして、コンピュータはその記録媒体からプログラムを読み取って内部記憶装置または外部記憶装置に転送し格納して用いる。また、そのプログラムを、例えば磁気ディスク，光ディスク，光磁気ディスク等の記憶装置（記録媒体）に記録しておき、その記憶装置から通信経路を介してコンピュータに提供してもよい。 A program for realizing the compression apparatus of the first embodiment is provided in a form recorded on a computer-readable recording medium such as a flexible disk or a CD-ROM. Then, the computer reads the program from the recording medium, transfers it to the internal storage device or the external storage device, and uses it. The program may be recorded in a storage device (recording medium) such as a magnetic disk, an optical disk, or a magneto-optical disk, and provided from the storage device to the computer via a communication path.

そして、第１実施形態の圧縮装置としての機能をコンピュータにより実現する際には、内部記憶装置（例えばＲＡＭ）に格納された上記プログラムがコンピュータのマイクロプロセッサ（例えばＣＰＵ）によって実行される。このとき、記録媒体に記録されたプログラムをマイクロプロセッサが直接読み取って実行してもよい。 When the function of the compression device according to the first embodiment is realized by a computer, the program stored in the internal storage device (for example, RAM) is executed by the microprocessor (for example, CPU) of the computer. At this time, the program recorded on the recording medium may be directly read and executed by the microprocessor.

なお、本実施形態において、コンピュータとは、ハードウェアとオペレーティングシステムとを含む概念であり、オペレーティングシステムの制御の下で動作するハードウェアを意味している。また、オペレーティングシステムが不要でアプリケーションプログラム単独でハードウェアを動作させるような場合には、そのハードウェア自体がコンピュータに相当する。ハードウェアは、少なくとも、ＣＰＵ等のマイクロプロセッサと、記録媒体に記録されたコンピュータプログラムを読み取るための手段とをそなえている。 In the present embodiment, the computer is a concept including hardware and an operating system, and means hardware that operates under the control of the operating system. Further, when an operating system is unnecessary and hardware is operated by an application program alone, the hardware itself corresponds to a computer. The hardware includes at least a microprocessor such as a CPU and means for reading a computer program recorded on a recording medium.

上記アプリケーションプログラムは、このようなコンピュータに、文書実現値解析部２０，ＤＴＤ解析部３０，文書実現値構成変更部４０，ＤＴＤ構成変更部５０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０としての機能を実現させるプログラムコードを含んでいる。また、その機能の一部は、アプリケーションプログラムではなくオペレーティングシステムによって実現されてもよい。 The application program is stored in such a computer on the document realization value analysis unit 20, the DTD analysis unit 30, the document realization value configuration change unit 40, the DTD configuration change unit 50, the new DTD file creation unit 60, and the old and new DTD correspondence table output unit. The program code for realizing the function 70 is included. Some of the functions may be realized by an operating system instead of an application program.

さらに、本実施形態における記録媒体としては、上述したフレキシブルディスク，ＣＤ−ＲＯＭ，磁気ディスク，光ディスク，光磁気ディスクのほか、ＩＣカード，ＲＯＭカートリッジ，磁気テープ，パンチカード，コンピュータの内部記憶装置（ＲＡＭやＲＯＭなどのメモリ），外部記憶装置等や、バーコードなどの符号が印刷された印刷物等の、コンピュータ読取可能な種々の媒体を利用することができる。 In addition to the flexible disk, CD-ROM, magnetic disk, optical disk, and magneto-optical disk described above, the recording medium in this embodiment includes an IC card, a ROM cartridge, a magnetic tape, a punch card, and an internal storage device (RAM) of the computer. In addition, various computer-readable media such as an external storage device or a printed matter on which a code such as a barcode is printed can be used.

さて、図１に示す第１実施形態の圧縮装置において、文書記憶部１０は、構造化文書であるＸＭＬ文書を記憶するもので、本実施形態では、圧縮前および圧縮後のいずれのＸＭＬ文書も記憶するほか、後述する新旧ＤＴＤ対応表も記憶するものである。 In the compression apparatus of the first embodiment shown in FIG. 1, the document storage unit 10 stores an XML document that is a structured document. In this embodiment, both XML documents before and after compression are stored. In addition to storing, a new and old DTD correspondence table described later is also stored.

文書実現値解析部２０は、ＸＭＬ文書を成す文書実現値を解析し、文書実現値における要素の木構造（親子関係）や、文書実現値の記述を解析するもので、その解析手順については、図６に示すフローチャートを参照しながら後述する。そして、本実施形態の文書実現値解析部２０は、要素の木構造（親子関係）の解析結果として、例えば図２（Ｄ）に示すような葉の要素一覧表（ファイル）を出力する。 The document realization value analysis unit 20 analyzes the document realization value forming the XML document and analyzes the element tree structure (parent-child relationship) in the document realization value and the description of the document realization value. This will be described later with reference to the flowchart shown in FIG. Then, the document realization value analysis unit 20 of this embodiment outputs, for example, a leaf element list (file) as shown in FIG. 2D as the analysis result of the tree structure (parent-child relationship) of the elements.

ＤＴＤ解析部（文書型定義解析部）３０は、ＸＭＬ文書が検証済みＸＭＬ文書である場合（つまりパターン(4)または(5)のＸＭＬ文書である場合）に、そのＤＴＤを解析し、ＤＴＤにおける要素の木構造（親子関係）や、ＤＴＤの記述を解析するもので、その解析手順については図７に示すフローチャートを参照しながら後述する。そして、本実施形態のＤＴＤ解析部３０も、要素の木構造（親子関係）の解析結果として、例えば図２（Ｄ）に示すような葉の要素一覧表（ファイル）を出力する。ただし、ＸＭＬ文書がパターン(4)である場合、ＤＴＤは文書記憶部１０から読み込まれるが、ＸＭＬ文書がパターン(5)である場合、ＤＴＤは外部ファイル１００から読み込まれる。 The DTD analysis unit (document type definition analysis unit) 30 analyzes the DTD when the XML document is a verified XML document (that is, when the XML document is the pattern (4) or (5)). It analyzes the tree structure (parent-child relationship) of elements and the description of DTD, and the analysis procedure will be described later with reference to the flowchart shown in FIG. The DTD analysis unit 30 of the present embodiment also outputs a leaf element list (file) as shown in FIG. 2D, for example, as an analysis result of the element tree structure (parent-child relationship). However, when the XML document is the pattern (4), the DTD is read from the document storage unit 10, but when the XML document is the pattern (5), the DTD is read from the external file 100.

このとき、文書実現値解析部２０やＤＴＤ解析部３０から出力される葉の要素の一覧表は、前述した通り、子要素をもたず木構造の葉として並んでいる要素を検出し、その葉の要素と親要素との対応関係を明記したものである。
文書実現値構成変更部４０は、文書実現値の表現を簡潔にすべく、文書実現値解析部２０による解析結果（葉の要素一覧表）に従って、文書実現値における葉要素についての情報を、その葉要素の親要素の属性として親要素の開始タグ内に移動させるもので、以下のような構成変更処理（ａ１）〜（ａ３）を実行するものである。 At this time, the leaf element list output from the document realization value analysis unit 20 and the DTD analysis unit 30 detects the elements arranged as leaves of the tree structure without any child elements, as described above. It clearly specifies the correspondence between leaf elements and parent elements.
In order to simplify the expression of the document realization value, the document realization value configuration change unit 40 obtains information about the leaf element in the document realization value according to the analysis result (leaf element list) by the document realization value analysis unit 20. It is moved within the start tag of the parent element as an attribute of the parent element of the leaf element, and the following configuration change processing (a1) to (a3) is executed.

（ａ１）葉要素についての開始タグ，終了タグおよび内容を文書実現値から削除し、葉要素についての情報である要素名および内容を、それぞれ親要素の属性名および属性値として親要素の開始タグ内に付加する。
（ａ２）葉要素の開始タグ内に葉要素についての情報である属性が記述されている場合、属性にかかる属性名および属性値を、それぞれ親要素の属性名および属性値として親要素の開始タグ内に付加する。 (A1) The start tag, end tag, and content for the leaf element are deleted from the document realization value, and the element name and content, which is information about the leaf element, are used as the parent element attribute name and attribute value, respectively. Append inside.
(A2) When an attribute that is information about a leaf element is described in the start tag of the leaf element, the attribute name and attribute value related to the attribute are used as the attribute name and attribute value of the parent element, respectively. Append inside.

（ａ３）親要素の終了タグを削除するとともに、親要素の開始タグを空要素タグに変更する。
この文書実現値構成変更部４０による構成変更手順については、図８に示すフローチャートを参照しながら後述する。また、本実施形態では、文書実現値構成変更部４０による構成変更結果（圧縮後の文書実現値）を、文書記憶部１０に出力・格納しているが、その他の記録媒体等に出力・格納してもよい。 (A3) The end tag of the parent element is deleted, and the start tag of the parent element is changed to an empty element tag.
The configuration change procedure by the document realization value configuration change unit 40 will be described later with reference to the flowchart shown in FIG. In the present embodiment, the configuration change result (compressed document actual value) by the document actual value configuration changing unit 40 is output / stored in the document storage unit 10, but is output / stored in another recording medium or the like. May be.

ＤＴＤ構成変更部（文書型定義構成変更部）５０は、ＸＭＬ文書が検証済みＸＭＬ文書である場合（ＸＭＬ文書がパターン(4)または(5)である場合）に、文書実現値構成変更部４０による構成変更に合わせてＤＴＤの表現を簡潔にすべく、ＤＴＤ解析部３０による解析結果（葉の要素一覧表）に従って、ＤＴＤにおける葉要素についての情報を、ＤＴＤから削除し、その葉要素の親要素の属性としてＤＴＤで再定義するもので、以下のような構成変更処理（ｂ１）および（ｂ２）を実行するものである。 The DTD configuration change unit (document type definition configuration change unit) 50, when the XML document is a verified XML document (when the XML document is the pattern (4) or (5)), the document realization value configuration change unit 40 In order to simplify the representation of the DTD in accordance with the change in the configuration, the information about the leaf element in the DTD is deleted from the DTD according to the analysis result (leaf element list) by the DTD analysis unit 30, and the parent of the leaf element The element is redefined as an element attribute by DTD, and the following configuration change processing (b1) and (b2) are executed.

（ｂ１）葉要素の要素型宣言をＤＴＤから削除するとともに葉要素にかかる記述（内容モデルの記述）を親要素の要素型宣言から削除し、葉要素の要素型宣言にかかる情報（削除した部分にかかる情報）を、親要素の属性として親要素の属性リスト宣言で再定義する。このとき、葉要素の要素名および内容（データ型）を、それぞれ親要素の属性名および属性値として宣言・再定義する。 (B1) The element type declaration of the leaf element is deleted from the DTD, the description related to the leaf element (content model description) is deleted from the element type declaration of the parent element, and the information related to the element type declaration of the leaf element (the deleted part) Is redefined in the attribute list declaration of the parent element as the attribute of the parent element. At this time, the element name and content (data type) of the leaf element are declared and redefined as the attribute name and attribute value of the parent element, respectively.

（ｂ２）ＤＴＤで葉要素の属性が葉要素の属性リスト宣言により定義されている場合、葉要素の属性リスト宣言をＤＴＤから削除し、葉要素の属性を、その葉要素の親要素の属性として親要素の属性リスト宣言で再定義する。このとき、葉要素についての属性名および属性値を、それぞれ親要素の属性名および属性値として宣言・再定義する。 (B2) When the attribute of the leaf element is defined in the DTD by the attribute list declaration of the leaf element, the attribute list declaration of the leaf element is deleted from the DTD, and the attribute of the leaf element is set as the attribute of the parent element of the leaf element Redefine in parent element attribute list declaration. At this time, the attribute name and attribute value for the leaf element are declared and redefined as the attribute name and attribute value of the parent element, respectively.

このＤＴＤ構成変更部５０による構成変更手順については、図９に示すフローチャートを参照しながら後述する。また、本実施形態では、ＸＭＬ文書がパターン(4)である場合、ＤＴＤ構成変更部５０による構成変更結果（圧縮後のＤＴＤ）を、文書実現値構成変更部４０による構成変更結果（圧縮後の文書実現値）とともに文書記憶部１０に出力・格納しているが、圧縮後の文書実現値とともに他の記録媒体等に出力・格納してもよい。 The configuration change procedure by the DTD configuration change unit 50 will be described later with reference to the flowchart shown in FIG. In the present embodiment, when the XML document is the pattern (4), the configuration change result (compressed DTD) by the DTD configuration change unit 50 is changed to the configuration change result (compressed value) by the document realization value configuration change unit 40. The document realization value) is output / stored in the document storage unit 10, but may be output / stored together with the compressed document realization value in another recording medium or the like.

新規ＤＴＤファイル作成部６０は、ＤＴＤが外部ファイル１００に存在する場合（ＸＭＬ文書がパターン(5)である場合）、ＤＴＤ構成変更部５０により変更処理されたＤＴＤについてのファイル（新規ＤＴＤファイル）を作成して外部ファイル１００へ出力するものである。
新旧ＤＴＤ対応表出力部７０は、ＤＴＤが外部ファイル１００に存在する場合（ＸＭＬ文書がパターン(5)である場合）、構成変更前のＤＴＤと構成変更後の新規ＤＴＤとの対応関係を明記した新旧ＤＴＤ対応表を作成して文書記憶部１０へ出力するものである。 When the DTD exists in the external file 100 (when the XML document is the pattern (5)), the new DTD file creation unit 60 selects a file (new DTD file) for the DTD that has been changed by the DTD configuration change unit 50. It is created and output to the external file 100.
When the DTD exists in the external file 100 (when the XML document is the pattern (5)), the old and new DTD correspondence table output unit 70 clarifies the correspondence between the DTD before the configuration change and the new DTD after the configuration change. An old and new DTD correspondence table is created and output to the document storage unit 10.

なお、第１実施形態の圧縮手法（要素から属性への変換）により圧縮されたＸＭＬ文書は、ＸＭＬ文書としての特徴を全く損なっておらず、圧縮状態のままで（伸長することなく）ＸＭＬ文書としての機能を果たすことができるので、圧縮されたＸＭＬ文書の伸長について特に議論する必要はない。従って、第１実施形態では、圧縮手法についてのみ説明する。 Note that the XML document compressed by the compression method (element-to-attribute conversion) of the first embodiment does not impair the characteristics of the XML document at all, and remains in the compressed state (without decompression). Therefore, there is no need to discuss the decompression of the compressed XML document. Therefore, in the first embodiment, only the compression method will be described.

次に、図５〜図１３を参照しながら、第１実施形態の圧縮装置の動作について説明する。
まず、図５に示すフローチャート（ステップＳ１１〜Ｓ２９）に従って、第１実施形態における構造化文書（ＸＭＬ文書）の圧縮手順について説明する。第１実施形態の圧縮手法では、前述した通り要素から属性への変換を行なうことにより、葉の要素が消え、その葉の親要素が空要素になる。 Next, the operation of the compression apparatus according to the first embodiment will be described with reference to FIGS.
First, a structured document (XML document) compression procedure according to the first embodiment will be described with reference to the flowchart shown in FIG. 5 (steps S11 to S29). In the compression method of the first embodiment, as described above, by converting the element to the attribute, the leaf element disappears and the parent element of the leaf becomes an empty element.

なお、第１実施形態の圧縮装置には、図１では図示省略しているが、文書記憶部１０に保存されているＸＭＬ文書がパターン(1)〜(5)（表２参照）のいずれのものであるからを認識するためのパターン認識機能がそなえられている。このパターン認識機能による処理は、図５に示すステップＳ１２〜Ｓ１４による処理に対応している。 In the compression apparatus of the first embodiment, although not shown in FIG. 1, the XML document stored in the document storage unit 10 is one of the patterns (1) to (5) (see Table 2). It has a pattern recognition function for recognizing it. The processing by this pattern recognition function corresponds to the processing by steps S12 to S14 shown in FIG.

圧縮対象のＸＭＬ文書が入力され文書記憶部１０に格納されると（ステップＳ１１）、そのＸＭＬ文書に“＜!ＤＯＣＴＹＰＥ”が記述されているか否かを判定し（ステップＳ１２）、記述されていない場合（ステップＳ１２のＮＯルート）、そのＸＭＬ文書はＤＴＤをもたない整形式ＸＭＬ文書、つまりパターン(1)のＸＭＬ文書であると認識され、後述するごとくステップＳ１５，Ｓ１６およびＳ２９が実行される。 When an XML document to be compressed is input and stored in the document storage unit 10 (step S11), it is determined whether or not “<! DOCTYPE” is described in the XML document (step S12) and is not described. If so (NO route of step S12), the XML document is recognized as a well-formed XML document having no DTD, that is, an XML document of pattern (1), and steps S15, S16, and S29 are executed as described later. .

ＸＭＬ文書に“＜!ＤＯＣＴＹＰＥ”が記述されている場合（ステップＳ１２のＹＥＳルート）、その後に“[”が記述されているか否かを判定する（ステップＳ１３）。
“＜!ＤＯＣＴＹＰＥ”は記述されているが“[”が記述されていない場合（ステップＳ１３のＮＯルート）、そのＸＭＬ文書は、ＤＴＤを外部ファイル１００として有する検証済みＸＭＬ文書、つまりパターン(5)のＸＭＬ文書であると認識され、後述するごとくステップＳ２１〜Ｓ２９が実行される。 When “<! DOCTYPE” is described in the XML document (YES route in step S12), it is determined whether or not “[” is described thereafter (step S13).
When “<! DOCTYPE” is described but “[” is not described (NO route of step S13), the XML document is a verified XML document having the DTD as the external file 100, that is, the pattern (5). As described later, steps S21 to S29 are executed.

“[”が記述されている場合（ステップＳ１３のＹＥＳルート）、“＜!ＥＬＥＭＥＮＴ”（もしくは“＜!ＡＴＴＬＩＳＴ”）が記述されているか否かを判定する（ステップＳ１４）。
“＜!ＤＯＣＴＹＰＥ”および“[”は記述されているが“＜!ＥＬＥＭＥＮＴ”が記述されていない場合（ステップＳ１４のＮＯルート）、実体宣言を含むＤＴＤを有する整形式ＸＭＬ文書、つまりパターン(2)のＸＭＬ文書であると認識され、パターン(1)の場合と同様、ステップＳ１５，Ｓ１６およびＳ２９が実行される。 If “[” is described (YES route in step S13), it is determined whether “<! ELEMENT” (or “<! ATTLIST”) is described (step S14).
When “<! DOCTYPE” and “[” are described but “<! ELEMENT” is not described (NO route of step S14), a well-formed XML document having a DTD including an entity declaration, that is, a pattern (2 ) And the steps S15, S16 and S29 are executed as in the case of the pattern (1).

“＜!ＤＯＣＴＹＰＥ”，“[”および“＜!ＥＬＥＭＥＮＴ”がいずれも記述されている場合（ステップＳ１４のＹＥＳルート）、ＸＭＬ文書内にＤＴＤを有する検証済みＸＭＬ文書、つまりパターン(4)のＸＭＬ文書であると認識され、ステップＳ１７〜Ｓ２０およびＳ２９が実行される。
以下、各パターン(1)〜(5)に対する圧縮処理について、図１０〜図１３に示す具体例（第１例〜第４例）を参照しながら説明する。 When “<! DOCTYPE”, “[”, and “<! ELEMENT” are all described (YES route in step S14), a verified XML document having a DTD in the XML document, that is, the XML of the pattern (4) The document is recognized and steps S17 to S20 and S29 are executed.
Hereinafter, compression processing for each pattern (1) to (5) will be described with reference to specific examples (first to fourth examples) shown in FIGS.

図１０（Ａ）および図１０（Ｂ）はいずれも第１実施形態によるＸＭＬ文書の具体的な圧縮処理（第１例）を説明するための図である。
図１０（Ａ）に示す圧縮前のＸＭＬ文書は、前述したパターン(1)のＸＭＬ文書であり、１行目に、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２〜５行目に、文書実現値が記述されている。ここに記述された文書実現値は、要素「book」の開始タグ内において属性情報（field＝"本"）が記述されていない点を除けば、図２（Ｂ）に示したＸＭＬ文書の記述例と同一である。 FIGS. 10A and 10B are diagrams for explaining specific compression processing (first example) of an XML document according to the first embodiment.
The XML document before compression shown in FIG. 10A is the XML document of the pattern (1) described above, and an XML declaration indicating that this document is a version 1.0 XML document is described in the first line. In the second to fifth lines, document realization values are described. The document realization value described here is the description of the XML document shown in FIG. 2B except that the attribute information (field = “book”) is not described in the start tag of the element “book”. Same as example.

図１０（Ａ）に示すＸＭＬ文書（パターン(1)）には“＜!ＤＯＣＴＹＰＥ”が記述されていないので、処理はステップＳ１２のＮＯルートからステップＳ１５へ移行し、文書実現値解析部２０によって文書実現値が解析される。これにより、文書実現値中において、葉となる要素がどこに記述されているかが検出され、その葉要素の要素名と、親要素の要素名との対応関係が、葉の要素一覧表として登録・出力される。図１０（Ａ）に示すＸＭＬ文書の場合、図２（Ｄ）と同様の葉の要素一覧表が得られる。 Since “<! DOCTYPE” is not described in the XML document (pattern (1)) shown in FIG. 10A, the process proceeds from the NO route of step S12 to step S15, and the document realization value analysis unit 20 performs the process. The document realization value is analyzed. This detects where the leaf element is described in the document realization value, and the correspondence between the element name of the leaf element and the element name of the parent element is registered as a leaf element list. Is output. In the case of the XML document shown in FIG. 10A, a leaf element list similar to that shown in FIG. 2D is obtained.

そして、ステップＳ１５で得られた、葉の要素一覧表（解析結果）に従って、文書実現値構成変更部４０により、葉の要素についての要素名および内容が、それぞれ、親要素の属性名および属性値に移動・変更されるとともに、親要素の開始タグが空要素タグに変更される（ステップＳ１６）。このとき、葉の要素「author」に付随した属性「year」は、親要素「book」の属性に代わる。 Then, according to the leaf element list (analysis result) obtained in step S15, the document realization value configuration changing unit 40 converts the element name and content of the leaf element into the attribute name and attribute value of the parent element, respectively. The start tag of the parent element is changed to an empty element tag (step S16). At this time, the attribute “year” attached to the leaf element “author” replaces the attribute of the parent element “book”.

ステップＳ１５およびＳ１６によって、例えば図１０（Ａ）に示すＸＭＬ文書は、図１０（Ｂ）に示すようなＸＭＬ文書に変更・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ２９）。
図１０（Ｂ）に示すＸＭＬ文書において、１行目のＸＭＬ宣言の記述は圧縮前と変わらないが、２行目には、図１０（Ａ）における２〜５行目の記述が集約されて記述されている。つまり、図１０（Ｂ）に示すＸＭＬ文書では、図３（Ｂ）に示した例と同様、２つの子要素「title」および「author」にかかる全ての情報が、親要素「book」の開始タグ（空要素タグ）において、親要素「book」の属性として記述される。 Through steps S15 and S16, for example, the XML document shown in FIG. 10A is changed and compressed into an XML document as shown in FIG. 10B, and then output and stored as a compressed document in the document storage unit 10 and the like. (Step S29).
In the XML document shown in FIG. 10B, the description of the XML declaration in the first line is the same as before compression, but the descriptions in the second to fifth lines in FIG. is described. That is, in the XML document shown in FIG. 10B, as in the example shown in FIG. 3B, all the information relating to the two child elements “title” and “author” is the start of the parent element “book”. The tag (empty element tag) is described as an attribute of the parent element “book”.

図１１（Ａ）および図１１（Ｂ）はいずれも第１実施形態によるＸＭＬ文書の具体的な圧縮処理（第２例）を説明するための図である。
図１１（Ａ）に示す圧縮前のＸＭＬ文書は、前述したパターン(2)のＸＭＬ文書であり、１行目に、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２〜４行目に、置換文字列定義（実体宣言）を含むＤＴＤが記述され、５〜８行目に文書実現値が記述されている。 FIG. 11A and FIG. 11B are diagrams for explaining a specific compression process (second example) of an XML document according to the first embodiment.
The XML document before compression shown in FIG. 11A is the XML document of the pattern (2) described above, and an XML declaration indicating that this document is a version 1.0 XML document is described in the first line. In the 2nd to 4th lines, the DTD including the replacement character string definition (substance declaration) is described, and in the 5th to 8th lines, the document realization value is described.

２〜４行目のＤＴＤでは、文書型宣言に含まれる実体宣言（３行目）により、文書実現値（ＸＭＬインスタンス）内で用いられる置換文字列「ＸＭＬ」の実体が「Extensible Markup Language」であることが定義されている。
また、５〜８行目に記述された文書実現値は、図１０（Ａ）に示したＸＭＬ文書の２〜５行目の記述例とほぼ同一であるが、図１１（Ａ）に示す例では、６行目の要素「title」の内容として、「ＸＭＬ（＆ＸＭＬ；の略称）入門」が記述されている。 In the DTDs in the 2nd to 4th lines, the entity of the replacement character string “XML” used in the document realization value (XML instance) is “Extensible Markup Language” by the entity declaration (line 3) included in the document type declaration. It is defined that there is.
The document realization values described in the 5th to 8th lines are almost the same as the description example in the 2nd to 5th lines of the XML document shown in FIG. 10 (A), but the example shown in FIG. 11 (A). Then, “Introduction to XML (abbreviation of &XML;)” is described as the content of the element “title” on the sixth line.

ここで、「＆ＸＭＬ；」は、置換文字列「ＸＭＬ」の実体を参照することを指示する記述であり、実際に表示・印刷等によって出力される文書中では、「Extensible Markup Language」と表記されることになる。
図１１（Ａ）に示すＸＭＬ文書（パターン(2)）には、“＜!ＤＯＣＴＹＰＥ”および“[”がいずれも記述されているが、“＜!ＥＬＥＭＥＮＴ”や“＜!ＡＴＴＬＩＳＴ”が記述されていないので、処理はステップＳ１４のＮＯルートからステップＳ１５へ移行し、前述したパターン(1)のＸＭＬ文書と同様の処理が実行される。このとき、文書実現値の内容中における実体参照の記述〔図１１（Ａ）では“＆ＸＭＬ；”〕は、そのまま、親要素の属性値として取り扱われる。 Here, “&XML;” is a description instructing to refer to the entity of the replacement character string “XML”, and is expressed as “Extensible Markup Language” in a document that is actually output by display or printing. Will be.
In the XML document (pattern (2)) shown in FIG. 11A, both “<! DOCTYPE” and “[” are described, but “<! ELEMENT” and “<! ATTLIST” are described. Therefore, the process proceeds from the NO route of step S14 to step S15, and the same process as the XML document of the pattern (1) described above is executed. At this time, the description of the entity reference (“&XML;” in FIG. 11A) in the content of the document realization value is handled as it is as the attribute value of the parent element.

これにより、例えば図１１（Ａ）に示すＸＭＬ文書は、図１１（Ｂ）に示すようなＸＭＬ文書に変更・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ２９）。
図１１（Ｂ）に示すＸＭＬ文書において、１〜４行目の記述は圧縮前と変わらないが、５行目には、図１１（Ａ）における５〜８行目の記述が集約されて記述されている。つまり、図１１（Ｂ）に示すＸＭＬ文書でも、図１０（Ｂ）に示した例と同様、２つの子要素「title」および「author」にかかる全ての情報が、親要素「book」の開始タグ（空要素タグ）において、親要素「book」の属性として記述される。ただし、図１１（Ｂ）では、属性名「title」に対する属性値として、「ＸＭＬ（＆ＸＭＬ；の略称）入門」がそのまま記述される。 Thus, for example, the XML document shown in FIG. 11A is changed and compressed into an XML document as shown in FIG. 11B, and then output and stored as a compressed document in the document storage unit 10 or the like (step) S29).
In the XML document shown in FIG. 11B, the descriptions in the first to fourth lines are the same as before compression, but in the fifth line, the descriptions in the fifth to eighth lines in FIG. Has been. That is, also in the XML document shown in FIG. 11B, as in the example shown in FIG. 10B, all the information related to the two child elements “title” and “author” is the start of the parent element “book”. The tag (empty element tag) is described as an attribute of the parent element “book”. However, in FIG. 11B, “Introduction to XML (&XML;)” is described as it is as the attribute value for the attribute name “title”.

ところで、パターン(3)のＸＭＬ文書〔例えば図２２（Ａ）参照〕は、前述した通り、外部ファイルを文書実現値（ＸＭＬインスタンス）の内容中で引用するために実体参照を用いるものである。第１実施形態では、文書実現値の内容を親要素の属性値として取り扱っているが、ＸＭＬ文書の仕様上、外部に対する実体参照を属性値で用いることはできないため、パターン(3)のＸＭＬ文書に、第１実施形態の圧縮手法は適用されない。 By the way, the XML document of the pattern (3) [see, for example, FIG. 22A] uses an entity reference to quote an external file in the content of the document realization value (XML instance) as described above. In the first embodiment, the content of the document realization value is handled as the attribute value of the parent element. However, since the entity reference to the outside cannot be used as the attribute value due to the specification of the XML document, the XML document of the pattern (3) In addition, the compression method of the first embodiment is not applied.

図１２（Ａ）および図１２（Ｂ）はいずれも第１実施形態によるＸＭＬ文書の具体的な圧縮処理（第３例）を説明するための図である。
図１２（Ａ）に示す圧縮前のＸＭＬ文書は、前述したパターン(4)のＸＭＬ文書であり、１行目に、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２〜８行目にＤＴＤが記述され、９〜１２行目に文書実現値が記述されている。ここで、２〜８行目に記述されたＤＴＤは、図４（Ａ）に示したＤＴＤと同一であり、９〜１２行目に記述された文書実現値は、図２（Ｂ）に示した文書実現値の記述例と同一であるので、その説明は省略する。 FIGS. 12A and 12B are diagrams for explaining specific compression processing (third example) of an XML document according to the first embodiment.
The XML document before compression shown in FIG. 12A is an XML document of the pattern (4) described above, and an XML declaration indicating that this document is a version 1.0 XML document is described on the first line. The DTD is described in the 2nd to 8th lines, and the document realization value is described in the 9th to 12th lines. Here, the DTD described in the 2nd to 8th lines is the same as the DTD shown in FIG. 4A, and the document realization value described in the 9th to 12th lines is shown in FIG. The description is omitted because it is the same as the description example of the document actual value.

図１２（Ａ）に示すＸＭＬ文書（パターン(4)）には、“＜!ＤＯＣＴＹＰＥ”および“[”が記述されるとともに“＜!ＥＬＥＭＥＮＴ”または“＜!ＡＴＴＬＩＳＴ”も記述されているので、処理はステップＳ１４のＹＥＳルートからステップＳ１７へ移行し、文書実現値解析部２０によって文書実現値が解析されるとともに、ＤＴＤ解析部３０によってＤＴＤが解析される（ステップＳ１８）。これにより、文書実現値中やＤＴＤ中において、葉となる要素がどこに記述されているかが検出され、その葉要素の要素名と、親要素の要素名との対応関係が、葉の要素一覧表として登録・出力される。図１２（Ａ）に示すＸＭＬ文書の場合も、図２（Ｄ）と同様の葉の要素一覧表が得られる。 In the XML document (pattern (4)) shown in FIG. 12A, “<! DOCTYPE” and “[” are described and “<! ELEMENT” or “<! ATTLIST” is also described. The process proceeds from the YES route in step S14 to step S17, where the document actual value analysis unit 20 analyzes the document realization value and the DTD analysis unit 30 analyzes the DTD (step S18). As a result, it is detected where the leaf element is described in the document realization value or DTD, and the correspondence between the element name of the leaf element and the element name of the parent element is the leaf element list. Registered and output as In the case of the XML document shown in FIG. 12A, a leaf element list similar to that shown in FIG. 2D is obtained.

そして、ステップＳ１７で得られた、葉の要素一覧表（解析結果）に従って、文書実現値構成変更部４０により、葉の要素についての要素名および内容が、それぞれ、親要素の属性名および属性値に移動・変更されるとともに、親要素の開始タグが空要素タグに変更される（ステップＳ１９）。このとき、葉の要素「author」に付随した属性「year」は、親要素「book」の属性に代わる。 Then, according to the leaf element list (analysis result) obtained in step S17, the document actual value configuration changing unit 40 converts the element name and content of the leaf element into the attribute name and attribute value of the parent element, respectively. The start tag of the parent element is changed to an empty element tag (step S19). At this time, the attribute “year” attached to the leaf element “author” replaces the attribute of the parent element “book”.

また、図１２（Ａ）に示すＸＭＬ文書はパターン(4)（即ち、その内部にＤＴＤを記述された、検証済みＸＭＬ文書）であるので、文書実現値構成変更部４０による文書実現値の構成変更に合わせて、ＤＴＤの構成を、ＤＴＤ構成変更部５０により以下のように変更する（ステップＳ２０）。 Also, since the XML document shown in FIG. 12A is the pattern (4) (that is, a verified XML document in which DTD is described), the configuration of the document actual value by the document actual value configuration change unit 40 is shown. In accordance with the change, the DTD configuration is changed by the DTD configuration changing unit 50 as follows (step S20).

つまり、構成変更後の文書実現値において、親要素「book」は子要素「title」および「author」をもたなくなるので、親要素「book」についての要素型宣言内で子要素を宣言していた内容モデル（title,author）は削除される。また、構成変更後の文書実現値において、葉の要素「title」および「author」は削除され親要素「book」の属性に変更されるとともに要素「author」の属性「year」も親要素「book」の属性に変更されるので、要素「title」および「author」の要素型宣言や要素「author」の属性リスト宣言も削除される。一方、親要素「book」は、新たに「title」，「author」および「year」を属性としてもつことになるため、親要素「book」の属性リスト宣言で、新たな属性の属性名および属性値を列挙する。 In other words, in the document realization value after the configuration change, the parent element “book” has no child elements “title” and “author”, so the child element is declared in the element type declaration for the parent element “book”. The content model (title, author) is deleted. In the document realization value after the configuration change, the leaf elements “title” and “author” are deleted and changed to the attributes of the parent element “book”, and the attribute “year” of the element “author” is also changed to the parent element “book”. ", The element type declaration of the elements" title "and" author "and the attribute list declaration of the element" author "are also deleted. On the other hand, since the parent element “book” will newly have “title”, “author”, and “year” as attributes, the attribute name and attribute of the new attribute in the attribute list declaration of the parent element “book” Enumerate values.

上述したステップＳ１７〜Ｓ２０によって、図１２（Ａ）に示すＸＭＬ文書は、図１２（Ｂ）に示すようなＸＭＬ文書に変更・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ２９）。
図１２（Ｂ）に示すＸＭＬ文書において、１行目および２行目の記述は圧縮前と変わらないが、３行目の要素「book」の要素型宣言からは内容モデルの記述が削除されている。また、４〜７行目の記述は、図１２（Ａ）における４〜７行目の記述を、要素「book」の属性リスト宣言内にまとめたものとなっている。 The XML document shown in FIG. 12A is changed / compressed into the XML document shown in FIG. 12B by steps S17 to S20 described above, and then output and stored as a compressed document in the document storage unit 10 or the like. (Step S29).
In the XML document shown in FIG. 12B, the descriptions on the first and second lines are the same as before compression, but the description of the content model is deleted from the element type declaration of the element “book” on the third line. Yes. Further, the descriptions on the 4th to 7th lines are the descriptions of the 4th to 7th lines in FIG. 12 (A) combined in the attribute list declaration of the element “book”.

さらに、図３（Ｂ）に示した例と同様、９行目には、図１２（Ａ）における９〜１２行目の記述が集約されて記述されている。つまり、図１２（Ｂ）に示すＸＭＬ文書でも、図３（Ｂ）に示した例と同様、２つの子要素（葉要素）「title」および「author」にかかる全ての情報が、親要素「book」の開始タグ（空要素タグ）において、親要素「book」の属性として記述される。 Further, as in the example shown in FIG. 3B, the descriptions of the 9th to 12th lines in FIG. 12A are collectively described in the 9th line. That is, even in the XML document shown in FIG. 12B, as in the example shown in FIG. 3B, all information related to the two child elements (leaf elements) “title” and “author” In the start tag (empty element tag) of “book”, it is described as an attribute of the parent element “book”.

図１３（Ａ）〜図１３（Ｄ）はいずれも第１実施形態によるＸＭＬ文書の具体的な圧縮処理（第４例）を説明するための図である。
図１３（Ａ）に示す圧縮前のＸＭＬ文書は、前述したパターン(5)のＸＭＬ文書であり、１行目に、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２行目に、外部ファイル１００のＤＴＤを指定するための情報（システム識別子）を含むＤＴＤが記述され、３〜６行目に文書実現値が記述されている。ここで、３〜６行目に記述された文書実現値は、図２（Ｂ）に示した文書実現値の記述例と同一であるので、その説明は省略する。 FIGS. 13A to 13D are diagrams for explaining specific compression processing (fourth example) of an XML document according to the first embodiment.
The XML document before compression shown in FIG. 13A is an XML document of the pattern (5) described above, and an XML declaration indicating that this document is a version 1.0 XML document is described on the first line. In the second line, the DTD including information (system identifier) for designating the DTD of the external file 100 is described, and the document realization value is described in the third to sixth lines. Here, the document realization values described in the 3rd to 6th lines are the same as the description example of the document realization values shown in FIG.

２行目のＤＴＤの文書型宣言では、システム識別子“ＳＹＳＴＥＭ”により、外部ファイル１００に保持されたＤＴＤ（ファイル名「..\book.dtd」）を用いることが宣言・定義されている。
そして、ファイル名「..\book.dtd」のＤＴＤは、図１３（Ａ）における文書実現値の構成に対応して、図１３（Ｂ）に示すように記述されている。この図１３（Ｂ）に示すＤＴＤ（１〜５行目）は、図４（Ａ）に示したＤＴＤにおける２〜６行目の記述例と同一であるので、その説明は省略する。 In the DTD document type declaration on the second line, it is declared and defined that the DTD (file name “.. \ book.dtd”) held in the external file 100 is used by the system identifier “SYSTEM”.
The DTD of the file name “.. \ book.dtd” is described as shown in FIG. 13B corresponding to the configuration of the document realization value in FIG. The DTD (first to fifth lines) shown in FIG. 13B is the same as the description example of the second to sixth lines in the DTD shown in FIG.

図１３（Ａ）に示すＸＭＬ文書（パターン(5)）には、“＜!ＤＯＣＴＹＰＥ”は記述されているが、その後には“[”が記述されることなく、外部ファイル１００におけるＤＴＤを指定するシステム識別子が記述されているので、処理はステップＳ１３のＮＯルートからステップＳ２１へ移行し、文書実現値解析部２０によって文書実現値が解析されるとともに、ＤＴＤ解析部３０によって、システム識別子に従って外部ファイル１００から読み込まれたＤＴＤ（ファイル名「..\book.dtd」）が解析される（ステップＳ２２）。これにより、文書実現値中やＤＴＤ中において、葉となる要素がどこに記述されているかが検出され、その葉の要素名と、親要素の要素名との対応関係が、葉の要素一覧表として登録・出力される。図１３（Ａ）や図１３（Ｂ）に示すＸＭＬ文書の場合も、図２（Ｄ）と同様の葉の要素一覧表が得られる。 In the XML document (pattern (5)) shown in FIG. 13A, “<! DOCTYPE” is described, but after that “[” is not described, the DTD in the external file 100 is specified. Since the system identifier to be processed is described, the process proceeds from the NO route of step S13 to step S21, the document actual value is analyzed by the document real value analysis unit 20, and the external is performed by the DTD analysis unit 30 according to the system identifier. The DTD (file name “.. \ book.dtd”) read from the file 100 is analyzed (step S22). Thereby, in the document realization value or DTD, it is detected where the leaf element is described, and the correspondence between the leaf element name and the parent element name is represented as a leaf element list. Registered / output. In the case of the XML document shown in FIGS. 13A and 13B, a leaf element list similar to that shown in FIG. 2D is obtained.

このとき、図１３（Ｂ）に示すＤＴＤを変更・圧縮して得られる新規のＤＴＤのために、元のファイル名とは異なる新規のファイル名（例えば「..\book2.dtd」）を設定して文書実現値に記入することにより、文書実現値における文書型宣言のシステム識別子“ＳＹＳＴＥＭ”により指定されるファイル名を、旧ファイル名「..\book.dtd」から、新規ファイル名「..\book2.dtd」に書き換える。 At this time, a new file name (for example, “.. \ book2.dtd”) different from the original file name is set for the new DTD obtained by changing and compressing the DTD shown in FIG. By filling in the document realization value, the file name specified by the system identifier “SYSTEM” of the document type declaration in the document realization value is changed from the old file name “.. \ book.dtd” to the new file name “. Rewrite as ". \ book2.dtd".

この後、ステップＳ２１で得られた、葉の要素一覧表（解析結果）に従って、文書実現値構成変更部４０により、葉の要素についての要素名および内容が、それぞれ、親要素の属性名および属性値に移動・変更されるとともに、親要素の開始タグが空要素タグに変更される（ステップＳ２４）。このとき、葉の要素「author」に付随した属性「year」は、親要素「book」の属性に代わる。 Thereafter, according to the leaf element list (analysis result) obtained in step S21, the document realization value configuration changing unit 40 changes the element name and content of the leaf element to the attribute name and attribute of the parent element, respectively. The value is moved / changed to a value, and the start tag of the parent element is changed to an empty element tag (step S24). At this time, the attribute “year” attached to the leaf element “author” replaces the attribute of the parent element “book”.

これにより、図１３（Ａ）に示すＸＭＬ文書は、図１３（Ｃ）に示すようなＸＭＬ文書に変更・圧縮される。図１３（Ｃ）に示すＸＭＬ文書において、１行目の記述は圧縮前と変わらないが、２行目のシステム識別子“ＳＹＳＴＥＭ”により指定されるファイル名が新規ファイル名「..\book2.dtd」となり、３行目には、図１３（Ａ）における３〜６行目の記述が集約されて記述されている。つまり、図１３（Ｃ）に示すＸＭＬ文書でも、図３（Ｂ）に示した例と同様、２つの子要素（葉要素）「title」および「author」にかかる全ての情報が、親要素「book」の開始タグ（空要素タグ）において、親要素「book」の属性として記述される。 As a result, the XML document shown in FIG. 13A is changed and compressed into an XML document as shown in FIG. In the XML document shown in FIG. 13C, the description on the first line is the same as before compression, but the file name specified by the system identifier “SYSTEM” on the second line is the new file name “.. \ book2.dtd”. In the third line, the descriptions of the third to sixth lines in FIG. 13A are collectively described. That is, in the XML document shown in FIG. 13C, as in the example shown in FIG. 3B, all the information related to the two child elements (leaf elements) “title” and “author” In the start tag (empty element tag) of “book”, it is described as an attribute of the parent element “book”.

この後、新規ＤＴＤファイル作成部６０により、新規のＤＴＤファイルを作成し、そのＤＴＤファイルに、外部ファイル１００から読み込んだ圧縮前のＤＴＤファイルの内容を複写してから（ステップＳ２５）、文書実現値構成変更部４０による文書実現値の構成変更に合わせ、新規ファイルにおけるＤＴＤの構成を、ＤＴＤ構成変更部５０により、前述したステップＳ２０と同様にして変更する（ステップＳ２６）。 Thereafter, the new DTD file creation unit 60 creates a new DTD file, copies the contents of the uncompressed DTD file read from the external file 100 to the DTD file (step S25), and then realizes the document realization value. In accordance with the configuration change of the document realization value by the configuration change unit 40, the DTD configuration in the new file is changed by the DTD configuration change unit 50 in the same manner as step S20 described above (step S26).

これにより、図１３（Ｂ）に示すＤＴＤは、図１３（Ｄ）に示すようなＤＴＤに変更・圧縮される。図１３（Ｄ）に示すＤＴＤにおいて、１行目の要素「book」の要素型宣言からは内容モデルの記述が削除されている。また、２〜５行目の記述は、図１３（Ｂ）における２〜５行目の記述を、要素「book」の属性リスト宣言内にまとめたものとなっている。 As a result, the DTD shown in FIG. 13B is changed and compressed into a DTD as shown in FIG. In the DTD shown in FIG. 13D, the description of the content model is deleted from the element type declaration of the element “book” on the first line. The description on the 2nd to 5th lines is a summary of the description on the 2nd to 5th lines in FIG. 13B in the attribute list declaration of the element “book”.

そして、ＤＴＤ構成変更部５０で変更・圧縮されたＤＴＤのファイル（新規ＤＴＤファイル）は、新規のファイル名「..\book2.dtd」を付与されて、新規ＤＴＤファイル作成部６０から外部ファイル１００へ出力・格納される（ステップＳ２７）。 Then, the DTD file (new DTD file) changed / compressed by the DTD configuration changing unit 50 is given a new file name “.. \ book2.dtd”, and the new DTD file creating unit 60 sends the external file 100. Is output and stored (step S27).

また、新旧ＤＴＤ対応表出力部７０によって、旧ＤＴＤと新規ＤＴＤとの対応関係（具体的には旧ファイル名と新規ファイル名との対応関係）を明記した新旧ＤＴＤ対応表が作成されて文書記憶部１０等へ出力・格納されるとともに（ステップＳ２８）、ステップＳ２４において変更・圧縮されたＸＭＬ文書は、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ２９）。その際、新旧ＤＴＤ対応表は、独立したファイルではなく、圧縮文書に注釈の形で付加してもよい。 Also, the old and new DTD correspondence table output unit 70 creates a new and old DTD correspondence table in which the correspondence between the old DTD and the new DTD (specifically, the correspondence between the old file name and the new file name) is created and stored in the document storage. The XML document output / stored in the unit 10 or the like (step S28), and the XML document changed / compressed in step S24 is output / stored as a compressed document in the document storage unit 10 or the like (step S29). At that time, the old and new DTD correspondence table may be added to the compressed document in the form of an annotation instead of an independent file.

なお、第１実施形態では、圧縮したＸＭＬ文書を元の状態に復元（伸長）する必要はないので、必ずしも、元のＤＴＤの保存や新旧ＤＴＤ対応表の作成を実行しなくてもよい。つまり、第１実施形態では、新規ＤＴＤファイル作成部６０やこの新規ＤＴＤファイル作成部６０によるステップＳ２５，Ｓ２７の処理、並びに、新旧ＤＴＤ対応表出力部７０やこの新旧ＤＴＤ対応表出力部７０によるステップＳ２８の処理を省略することも可能である。 In the first embodiment, since it is not necessary to restore (decompress) the compressed XML document to the original state, it is not always necessary to save the original DTD or create the old and new DTD correspondence table. That is, in the first embodiment, the processing of steps S25 and S27 by the new DTD file creation unit 60 and the new DTD file creation unit 60, and the steps by the old and new DTD correspondence table output unit 70 and the old and new DTD correspondence table output unit 70. It is also possible to omit the process of S28.

ただし、元のＤＴＤや新旧ＤＴＤ対応表は、第２実施形態で後述するごとく圧縮されたＸＭＬ文書を復元（伸長）する際に必要になるものである。第１実施形態の圧縮装置は、上述のような、元のＤＴＤの保存機能や新旧ＤＴＤ対応表の作成機能をそなえるとともに、図６を参照しながら後述するタグ辞書作成機能をそなえ、後述する第２実施形態の圧縮手法を実現することもできるように構成されている。 However, the original DTD and the old and new DTD correspondence tables are required when restoring (expanding) an XML document compressed as described later in the second embodiment. The compression apparatus of the first embodiment has the original DTD storage function and the old and new DTD correspondence table creation function as described above, and the tag dictionary creation function described later with reference to FIG. The compression method according to the second embodiment is also configured to be realized.

さて、次に、図６〜図９を参照しながら、第１実施形態の圧縮装置を構成する各部２０，３０，４０および５０の動作について説明する。
まず、図６に示すフローチャート（ステップＳ３１〜Ｓ４３）に従って、第１実施形態の文書実現値解析部２０による解析手順について説明すると、文書実現値解析部２０は、圧縮対象の文書実現値を最後まで走査したか否かを判断しながら（ステップＳ３１）、文書実現値を走査し（ステップＳ３２）、文書実現値の記述を先頭から順次認識し、“＜”が記述されているか否かを調べていく（ステップＳ３３）。なお、“＜”は、ＸＭＬの仕様上、文書実現値の内容には記述されない。 Next, operations of the respective units 20, 30, 40, and 50 constituting the compression device of the first embodiment will be described with reference to FIGS.
First, according to the flowchart (steps S31 to S43) shown in FIG. 6, the analysis procedure by the document actual value analysis unit 20 according to the first embodiment will be described. The document real value analysis unit 20 sets the document real value to be compressed to the end. While determining whether or not the document has been scanned (step S31), the document realization value is scanned (step S32), and the description of the document realization value is sequentially recognized from the top, and it is checked whether or not “<” is described. (Step S33). Note that “<” is not described in the content of the document realization value in the XML specification.

文書実現値の記述として“＜”が検出された場合（ステップＳ３３のＹＥＳルート）、“＜”に続く１バイトの記述に基づいて、この“＜”で始まるタグが開始タグか終了タグかを判定する（ステップＳ３４）。その判定は、“＜”に続く記述が“/”であるか否かによって行なわれる。即ち、“＜”に続く記述が“/”である場合、そのタグは終了タグであると判定され、“＜”に続く記述が“/”ではない場合、そのタグは開始タグであると判定される。 When “<” is detected as the description of the document realization value (YES route in step S33), whether the tag starting with “<” is a start tag or an end tag is determined based on the description of 1 byte following “<”. Determination is made (step S34). The determination is made based on whether or not the description following “<” is “/”. That is, if the description following “<” is “/”, the tag is determined to be an end tag, and if the description following “<” is not “/”, the tag is determined to be a start tag. Is done.

開始タグの場合（ステップＳ３４のＹＥＳルート）、その開始タグ内に記述されている要素名や属性名を検出し、それぞれ、要素名一覧表および属性名一覧表に登録する（ステップＳ３５，Ｓ３６）。その際、要素名や属性名の出現頻度も集計する。この出現頻度は、第２実施形態で必要となるタグ辞書を作成する際に利用されるものである。なお、開始タグ内には、属性名が記述されていない場合があるが、その場合、属性名は検出されないので、ステップＳ３６の処理は省略される。一覧表への登録を終了した後は、ステップＳ３１へ戻る。 In the case of a start tag (YES route in step S34), the element name and attribute name described in the start tag are detected and registered in the element name list and attribute name list, respectively (steps S35 and S36). . At that time, the appearance frequency of element names and attribute names is also totaled. This appearance frequency is used when creating a tag dictionary required in the second embodiment. In some cases, the attribute name is not described in the start tag. In this case, the attribute name is not detected, and thus the process of step S36 is omitted. After completing the registration to the list, the process returns to step S31.

一方、終了タグの場合（ステップＳ３４のＮＯルート）、その終了タグ内に記述されている要素名を検出し（ステップＳ３７）、その要素名が、要素名一覧表において最後に登録された要素名と一致するか否かを判定する（ステップＳ３８）。
このとき、その終了タグで括られた要素（以下、注目要素と呼ぶ）の文書実現値の内容に、子要素の記述が存在する場合、終了タグ内の要素名と要素名一覧表の最後の要素名とは一致しない。また、注目要素の文書実現値の内容に子要素の記述が存在しない場合、即ち、注目要素が葉の要素である場合、終了タグ内の要素名と要素名一覧表の最後の要素名とは一致する。 On the other hand, in the case of an end tag (NO route in step S34), the element name described in the end tag is detected (step S37), and the element name is the last element name registered in the element name list. (Step S38).
At this time, if a description of a child element exists in the content of the document realization value of the element enclosed by the end tag (hereinafter referred to as the element of interest), the element name in the end tag and the last of the element name list Does not match element name. If there is no child element description in the document realization value of the element of interest, that is, if the element of interest is a leaf element, the element name in the end tag and the last element name in the element name list are: Match.

従って、ステップＳ３８で要素名が一致しないと判定された場合（ＮＯルート）、注目要素は子要素を有するものであって葉の要素ではなく、そのままステップＳ３１へ戻る。
これに対し、ステップＳ３８で要素名が一致すると判定された場合（ＹＥＳルート）、注目要素は子要素を有しない葉の要素であると判断することができ、続いて、その注目要素の内容中に外部ファイルに対する実体参照が記述されているか否かを判定する（ステップＳ３９）。 Therefore, if it is determined in step S38 that the element names do not match (NO route), the element of interest has a child element and is not a leaf element, the process returns directly to step S31.
On the other hand, if it is determined in step S38 that the element names match (YES route), it is possible to determine that the element of interest is a leaf element that does not have any child elements. It is determined whether or not an entity reference for an external file is described (step S39).

葉要素の内容は、第１実施形態の圧縮変換により親要素の属性として取り扱われることになるが、前述した通り、ＸＭＬの仕様上、外部ファイルに対する実体参照を属性値において用いることができない。
そこで、ステップＳ３９で実体参照が記述されていると判定された場合（ＹＥＳルート）、そのままステップＳ３１へ戻る。つまり、外部ファイルに対する実体参照をもつ葉の要素は、「葉の要素一覧表」には登録されない。 The content of the leaf element is handled as the attribute of the parent element by the compression conversion of the first embodiment. However, as described above, the entity reference for the external file cannot be used in the attribute value according to the XML specification.
Therefore, when it is determined in step S39 that the entity reference is described (YES route), the process directly returns to step S31. That is, a leaf element having an entity reference to an external file is not registered in the “leaf element list”.

一方、ステップＳ３９で注目要素の内容に実体参照が記述されていないことが確認された場合（ＮＯルート）には、「葉の要素一覧表」に、注目要素の要素名が葉の要素名として登録・追加されるとともに、その葉の親の要素名も登録・追加される（ステップＳ４０）。この後、ステップＳ３１に戻る。 On the other hand, if it is confirmed in step S39 that no entity reference is described in the content of the element of interest (NO route), the element name of the element of interest is displayed as the leaf element name in the “leaf element list”. While registering / adding, the element name of the parent of the leaf is also registered / added (step S40). Thereafter, the process returns to step S31.

そして、ステップＳ３１において圧縮対象の文書実現値を最後まで走査したと判定された場合（ＹＥＳルート）、文書実現値の走査中に出現した要素名および属性名の出現頻度に基づいて、出現頻度の高い要素名や属性名を、より短い文字列（例えば１バイト；短縮文字列）に対応させるタグ辞書（図１４の符号９０参照）を作成・出力するとともに（ステップＳ４１，Ｓ４２）、最終的に得られた「葉の要素一覧表」〔例えば図２（Ｄ）参照〕を出力して（ステップＳ４３）、処理を終了する。 If it is determined in step S31 that the document realization value to be compressed has been scanned to the end (YES route), the appearance frequency is determined based on the appearance frequency of the element name and attribute name that appeared during the scan of the document realization value. A tag dictionary (see reference numeral 90 in FIG. 14) that associates a high element name or attribute name with a shorter character string (for example, 1 byte; abbreviated character string) is generated and output (steps S41 and S42), and finally The obtained “leaf element list” [see, for example, FIG. 2D] is output (step S43), and the process ends.

なお、第１実施形態では、文書実現値解析部２０によりタグ辞書を作成しているが、このタグ辞書は、第１実施形態の圧縮手法を実行する際には用いられず、後述する第２実施形態において用いられるものである。従って、第１実施形態では、ステップＳ４１およびＳ４２を省略してもよい。また、第２実施形態では、ステップＳ４１およびＳ４２の処理は、文書実現値解析部２０ではなく、タグ辞書作成部８０（図１４参照）により実行されるものとして説明される。 In the first embodiment, a tag dictionary is created by the document realization value analysis unit 20, but this tag dictionary is not used when executing the compression method of the first embodiment, and is described later. It is used in the embodiment. Therefore, in the first embodiment, steps S41 and S42 may be omitted. In the second embodiment, the processes in steps S41 and S42 are described as being performed by the tag dictionary creation unit 80 (see FIG. 14), not the document actual value analysis unit 20.

次に、図７に示すフローチャート（ステップＳ５１〜Ｓ５８）に従って、第１実施形態のＤＴＤ解析部３０による解析手順について説明すると、ＤＴＤ解析部３０は、構成変更対象のＤＴＤを最後まで走査したか否かを判断しながら（ステップＳ５１）、ＤＴＤを走査し（ステップＳ５２）、ＤＴＤの記述を先頭から順次認識し、“＜!ＥＬＥＭＥＮＴ”が記述されているか否かを調べていく（ステップＳ５３）。 Next, the analysis procedure performed by the DTD analysis unit 30 according to the first embodiment will be described with reference to the flowchart shown in FIG. 7 (steps S51 to S58). The DTD analysis unit 30 has scanned the DTD to be changed to the end. (Step S51), the DTD is scanned (step S52), the DTD descriptions are sequentially recognized from the top, and it is checked whether or not “<! ELEMENT” is described (step S53).

例えば図４（Ａ）の２行目に示すごとく、要素型宣言では、“!＜ＥＬＥＭＥＮＴ”の後に要素名および内容モデル（子要素の要素名）が記述される。内容モデル内において、“＃ＰＣＤＡＴＡ”のような予約語のみが記述され、独自の子の要素名が登録されていない場合、その要素型宣言は、葉の要素を対象としたものということになる。 For example, as shown in the second line of FIG. 4A, in the element type declaration, an element name and a content model (element name of a child element) are described after “! <ELEMENT”. In the content model, if only a reserved word such as “#PCDATA” is described and no unique child element name is registered, the element type declaration is for a leaf element. .

そこで、ステップＳ５３でＤＴＤの記述として“!＜ＥＬＥＭＥＮＴ”が検出された場合（ＹＥＳルート）、“!＜ＥＬＥＭＥＮＴ”に続く要素文字列（要素名）を検出してから（ステップＳ５４）、さらにその後に続いて記述される内容モデルの記述を調査し（ステップＳ５５）、内容モデル内に子の要素名が記述されているか否かを判定する（ステップＳ５６）。 Therefore, when “! <ELEMENT” is detected as a description of DTD in step S53 (YES route), an element character string (element name) following “! <ELEMENT” is detected (step S54), and thereafter Next, the description of the content model described is checked (step S55), and it is determined whether or not a child element name is described in the content model (step S56).

内容モデル内に要素名が記述されていない場合（ステップＳ５６のＮＯルート）、今注目している要素型宣言は、葉の要素にかかるものであると判断され、その要素型宣言内の要素名（ステップＳ５４で検出したもの）を、親の要素名とともに「葉の要素一覧表」に登録してから（ステップＳ５７）、ステップＳ５１に戻る。 If the element name is not described in the content model (NO route of step S56), the element type declaration of interest is determined to be related to the leaf element, and the element name in the element type declaration is determined. After registering (detected in step S54) together with the parent element name in the “leaf element list” (step S57), the process returns to step S51.

ステップＳ５３でＤＴＤの記述として“!＜ＥＬＥＭＥＮＴ”が検出されなかった場合（ＮＯルート）や、ステップＳ５６で内容モデル内に子の要素名が記述されていると判定された場合（ＹＥＳルート）には、「葉の要素一覧表」への登録処理を行なうことなく、ステップＳ５１に戻る。
そして、ステップＳ５１において構成変更対象のＤＴＤを最後まで走査したと判定された場合（ＹＥＳルート）、最終的に得られた「葉の要素一覧表」〔例えば図２（Ｄ）参照〕を出力して（ステップＳ５８）、処理を終了する。 When “! <ELEMENT” is not detected as the DTD description in step S53 (NO route), or when it is determined in step S56 that the child element name is described in the content model (YES route). Returns to step S51 without performing registration processing in the “leaf element list”.
If it is determined in step S51 that the configuration change target DTD has been scanned to the end (YES route), the finally obtained “leaf element list” (see, for example, FIG. 2D) is output. (Step S58), the process is terminated.

ただし、ＤＴＤにおいて内容モデルにより内容の型（例えば＃ＰＣＤＡＴＡ）が定義されていても、そのＤＴＤの記述からは、内容に実体参照が含まれるか否かを認識することはできない。つまり、ＤＴＤ解析部３０は、前述した文書実現値解析部２０とは異なり、ＤＴＤを解析しただけでは、そのＤＴＤに従って記述される文書実現値の内容に実体参照が含まれるか否かを認識することはできず、当然、その実体参照が文書内を対象とするものか外部ファイルを対象とするものかを区別することはできない。 However, even if the content type (for example, #PCDATA) is defined by the content model in the DTD, it cannot be recognized from the description of the DTD whether the content includes an entity reference. That is, unlike the document realization value analysis unit 20 described above, the DTD analysis unit 30 recognizes whether or not an entity reference is included in the content of the document realization value described according to the DTD only by analyzing the DTD. Of course, it is not possible to distinguish whether the entity reference is for a document or an external file.

図８に示すフローチャート（ステップＳ６１〜Ｓ７２）に従って、第１実施形態の文書実現値構成変更部４０による構成変更手順について説明すると、文書実現値構成変更部４０は、まず、文書実現値解析部２０やＤＴＤ解析部３０で得られた「葉の要素一覧表」を入力してから（ステップＳ６１）、圧縮対象の文書実現値を最後まで走査したか否かを判断しながら（ステップＳ６２）、文書実現値を走査する（ステップＳ６３）。 The configuration change procedure performed by the document actual value configuration changing unit 40 according to the first embodiment will be described with reference to the flowchart shown in FIG. 8 (steps S61 to S72). Or after inputting the “leaf element list” obtained by the DTD analysis unit 30 (step S61), it is determined whether or not the document realization value to be compressed has been scanned to the end (step S62). The actual value is scanned (step S63).

その際、文書実現値の記述を、「葉の要素一覧表」に登録された葉の要素名と比較しながら、先頭から順次認識し、その文書実現値の記述が、「葉の要素一覧表」に登録された葉の要素であるか否かを判断する（ステップＳ６４）。
「葉の要素一覧表」に登録された葉の要素を文書実現値中で検出した場合（ＹＥＳルート）、その葉の要素が属性を有しているか否かを判定し（ステップＳ６５）、属性を有している場合（ＹＥＳルート）には、その属性、つまり属性名および属性値の文字列をそれぞれ属性名一覧および属性値一覧に登録する（ステップＳ６６）。 At that time, the description of the document actual value is sequentially recognized from the top while comparing the description of the document actual value with the leaf element name registered in the “leaf element list”. It is determined whether or not it is a leaf element registered in "" (step S64).
When a leaf element registered in the “leaf element list” is detected in the document realization value (YES route), it is determined whether or not the leaf element has an attribute (step S65). (YES route), the attribute, that is, the character string of the attribute name and attribute value is registered in the attribute name list and attribute value list, respectively (step S66).

属性情報を有していない場合（ステップＳ６５のＮＯルート）や、ステップＳ６６での登録処理の終了後には、その葉の要素名および内容の文字列をそれぞれ属性名一覧および属性値一覧に登録する（ステップＳ６７）。
そして、ステップＳ６６やＳ６７による登録処理を完了した葉の要素についての、開始タグ，内容および終了タグを、文書実現値から削除してから（ステップＳ６８）、ステップＳ６２へ戻る。 If the attribute information is not included (NO route in step S65), or after the registration process in step S66 ends, the element name and content character string of the leaf are registered in the attribute name list and attribute value list, respectively. (Step S67).
Then, the start tag, contents, and end tag for the leaf element for which the registration process in steps S66 and S67 has been completed are deleted from the document realization value (step S68), and the process returns to step S62.

また、ステップＳ６４で葉の要素が検出されなかった場合（ＮＯルート）の場合は、ステップＳ６５〜Ｓ６８の処理を行なうことなくステップＳ６２へ戻る。
ステップＳ６２において圧縮対象の文書実現値を最後まで走査したと判定された場合（ＹＥＳルート）、「葉の要素一覧表」から葉の親要素を検出し（ステップＳ６９）、その親要素の開始タグに、属性名一覧および属性値一覧にそれぞれ登録されている属性名および属性値を新たに付加する（ステップＳ７０）。 If no leaf element is detected in step S64 (NO route), the process returns to step S62 without performing steps S65 to S68.
If it is determined in step S62 that the document realization value to be compressed has been scanned to the end (YES route), the leaf parent element is detected from the “leaf element list” (step S69), and the start tag of the parent element is detected. The attribute name and attribute value respectively registered in the attribute name list and the attribute value list are newly added (step S70).

この後、親要素の終了タグ“＜親の要素名/＞”を削除してから（ステップＳ７１）、親の開始タグの最後に記述された“＞”の前に、“/”を記入することにより、葉の親要素の開始タグを空要素タグに変更して（ステップＳ７２）、処理を終了する。 Thereafter, the end tag “<parent element name />” of the parent element is deleted (step S71), and “/” is entered before “>” described at the end of the parent start tag. As a result, the start tag of the parent element of the leaf is changed to an empty element tag (step S72), and the process ends.

図９に示すフローチャート（ステップＳ８１〜Ｓ８９）に従って、第１実施形態のＤＴＤ構成変更部５０による構成変更手順について説明すると、ＤＴＤ構成変更部５０は、まず、文書実現値解析部２０やＤＴＤ解析部３０で得られた「葉の要素一覧表」を入力してから（ステップＳ８１）、構成変更対象のＤＴＤを最後まで走査したか否かを判断しながら（ステップＳ８２）、ＤＴＤを走査する（ステップＳ８３）。 The configuration change procedure by the DTD configuration change unit 50 according to the first embodiment will be described with reference to the flowchart shown in FIG. 9 (steps S81 to S89). First, the DTD configuration change unit 50 includes the document realization value analysis unit 20 and the DTD analysis unit. After inputting the “leaf element list” obtained in 30 (step S81), the DTD is scanned while determining whether or not the DTD to be changed is scanned to the end (step S82) (step S82). S83).

その際、「葉の要素一覧表」に登録された葉の要素名を有する要素型宣言、即ち“＜!ＥＬＥＭＥＮＴ葉の要素名”が記述されているか否かを判断する（ステップＳ８４）。
そのような葉の要素型宣言が記述されている場合（ステップＳ８４のＹＥＳルート）、その葉の要素型宣言をＤＴＤから削除した後（ステップＳ８５）、その葉の要素名を有する属性リスト宣言、つまり“＜!ＡＴＴＬＩＳＴ葉の要素名属性名”が記述されているか否かを判断する（ステップＳ８６）。 At this time, it is determined whether or not an element type declaration having a leaf element name registered in the “leaf element list”, that is, “<! ELEMENT leaf element name” is described (step S84).
If such a leaf element type declaration is described (YES route of step S84), after deleting the leaf element type declaration from the DTD (step S85), an attribute list declaration having the leaf element name; That is, it is determined whether or not “<! ATTLIST leaf element name attribute name” is described (step S86).

そのような葉の属性リスト宣言が記述されている場合（ステップＳ８６のＹＥＳルート）、その葉の属性リスト宣言をＤＴＤから削除した後（ステップＳ８７）、その葉についての親要素の要素型宣言における内容モデルの記述から葉（子要素）の記述を削除する（ステップＳ８８）。 If such a leaf attribute list declaration is described (YES route of step S86), after deleting the leaf attribute list declaration from the DTD (step S87), in the element type declaration of the parent element for that leaf The description of the leaf (child element) is deleted from the description of the content model (step S88).

そして、葉の親要素についての属性リスト宣言において、ステップＳ８５で削除した葉の要素についての要素名および内容を、それぞれ新たな属性名および属性値として付加するとともに、ステップＳ８７で削除した葉の要素についての属性名および属性値を、それぞれ新たな属性名および属性値として付加してから（ステップＳ８９）、ステップＳ８２へ戻る。このとき、親要素についての属性リスト宣言が、構成変更前に存在していない場合には、新たに属性リスト宣言を作成する。 Then, in the attribute list declaration for the leaf parent element, the element name and content for the leaf element deleted in step S85 are added as new attribute names and attribute values, respectively, and the leaf element deleted in step S87. The attribute name and the attribute value are added as new attribute names and attribute values (step S89), and the process returns to step S82. At this time, if the attribute list declaration for the parent element does not exist before the configuration change, a new attribute list declaration is created.

なお、ステップＳ８４で葉の要素型宣言が記述されていないと判断された場合（ＮＯルート）、ステップＳ８２へ戻る。また、ステップＳ８６で葉の属性リスト宣言が記述されていないと判断された場合（ＮＯルート）、ステップＳ８８へ移行する。
ステップＳ８６のＮＯルートもしくはステップＳ８７からステップＳ８８へ移行した時に、既に内容モデルから葉の記述が削除されている場合には、ステップＳ８８では何ら処理を行なうことなく、ステップＳ８９へ移行する。 If it is determined in step S84 that the leaf element type declaration is not described (NO route), the process returns to step S82. If it is determined in step S86 that the leaf attribute list declaration is not described (NO route), the process proceeds to step S88.
If the NO route in step S86 or the description of the leaf has already been deleted from the content model when the process proceeds from step S87 to step S88, the process proceeds to step S89 without performing any processing in step S88.

さらに、ステップＳ８６のＮＯルートを経由してステップＳ８８へ移行した場合には、葉の親要素についての属性リスト宣言において、ステップＳ８５で削除した葉の要素についての要素名および内容を、それぞれ新たな属性名および属性値として付加してから、ステップＳ８２へ戻る。このときも、親要素についての属性リスト宣言が、構成変更前に存在していない場合には、新たに属性リスト宣言を作成する。 Further, when the process proceeds to step S88 via the NO route of step S86, in the attribute list declaration for the parent element of the leaf, the element name and content for the leaf element deleted in step S85 are newly set. After adding the attribute name and attribute value, the process returns to step S82. At this time, if the attribute list declaration for the parent element does not exist before the configuration change, a new attribute list declaration is created.

このように、本発明の第１実施形態によれば、文書実現値における要素の木構造を解析し、その解析結果に従って、葉要素についての開始タグ，終了タグおよび内容を文書実現値から削除し、その葉要素の要素名，内容，属性名および属性値を親要素の属性として親要素の開始タグ内に付加することにより、葉要素にかかる記述を親要素の属性として取り扱うことができ、葉要素の開始タグや終了タグを記述する必要がなくなり、ＸＭＬ文書の特徴を損なうことなく、また、検索可能な状態に保持したまま、葉要素にかかるタグの記述が省略・圧縮される。 As described above, according to the first embodiment of the present invention, the tree structure of the element in the document realization value is analyzed, and the start tag, end tag, and content for the leaf element are deleted from the document realization value according to the analysis result. By adding the element name, content, attribute name, and attribute value of the leaf element as the parent element attribute in the start tag of the parent element, the description of the leaf element can be handled as the attribute of the parent element. There is no need to describe the start tag and end tag of the element, and the description of the tag relating to the leaf element is omitted / compressed while maintaining the searchable state without impairing the characteristics of the XML document.

従って、ＸＭＬ文書の圧縮率を大幅に高めることができ、ひいては、大規模なデータベースを取り扱うシステムにおいて文書データの格納効率を大幅に高めることができる。特に、多数の短い語句をもつ部品表や価格表等をＸＭＬ文書で記述するような場合、短い語句（内容）を挟んだ開始タグと終了タグとの対表現を省略することができるので、その圧縮率を大幅に高めることができる。 Therefore, the compression rate of the XML document can be significantly increased, and consequently, the storage efficiency of the document data can be greatly increased in a system that handles a large-scale database. In particular, when a parts list, price list, or the like having a large number of short phrases is described in an XML document, the paired expression between the start tag and the end tag with the short phrase (content) sandwiched can be omitted. The compression rate can be greatly increased.

このとき、圧縮後も、データ長の短い内容の平文を検索対象として扱うことが可能であり、検索を行なう際にはＸＭＬ文書を復元（伸長）する必要がない。
また、ＸＭＬ文書の特徴を損なわないため、ブラウザなどの応用ソフトウェアとの整合を容易にとることができる。
さらに、親要素の終了タグを削除して親要素の開始タグを空要素タグに変更することで、ＸＭＬ文書の圧縮率をより高めることができる。 At this time, even after compression, plain text with a short data length can be handled as a search target, and there is no need to restore (decompress) the XML document when performing a search.
In addition, since the characteristics of the XML document are not impaired, it is possible to easily match with application software such as a browser.
Furthermore, by deleting the end tag of the parent element and changing the start tag of the parent element to an empty element tag, the compression rate of the XML document can be further increased.

同様に、ＤＴＤにおける要素の木構造を解析し、その解析結果に従って、葉要素の要素型宣言や属性リスト宣言をＤＴＤから削除するとともに葉要素にかかる記述を親要素の要素型宣言（内容モデル）から削除し、その葉要素の要素型宣言や属性リスト宣言にかかる情報を親要素の属性として再定義することにより、文書実現値に対して行なわれた圧縮に対応した圧縮処理がＤＴＤに対しても行なわれ、葉要素にかかる記述を親要素の属性として取り扱うことができる。従って、ＸＭＬ文書の特徴を損なうことなく、また、検索可能な状態に保持したまま、葉要素にかかる要素型宣言や属性リスト宣言の記述が省略されてＤＴＤが圧縮されるので、ＸＭＬ文書の圧縮率をより高めることができる。 Similarly, the tree structure of the element in DTD is analyzed, and the element type declaration and attribute list declaration of the leaf element are deleted from the DTD according to the analysis result, and the description related to the leaf element is changed to the element type declaration of the parent element (content model) , And redefining the information related to the element type declaration and attribute list declaration of the leaf element as the attribute of the parent element, the compression processing corresponding to the compression performed on the document realization value is applied to the DTD. The leaf element is described as an attribute of the parent element. Therefore, since the description of the element type declaration and the attribute list declaration concerning the leaf element is omitted and the DTD is compressed while maintaining the searchable state without losing the characteristics of the XML document, the XML document is compressed. The rate can be increased further.

〔２〕第２実施形態の説明
次に、本発明の第２実施形態について説明する。
まず、図１５（Ａ）〜図１５（Ｄ）を参照しながら、本発明の第２実施形態における構造化文書の圧縮原理を説明する。なお、本発明の第２実施形態でも、構造化文書がＸＭＬ文書である場合について説明する。 [2] Description of Second Embodiment Next, a second embodiment of the present invention will be described.
First, the compression principle of a structured document in the second embodiment of the present invention will be described with reference to FIGS. 15 (A) to 15 (D). In the second embodiment of the present invention, the case where the structured document is an XML document will be described.

第２実施形態では、文書実現値のタグ内やＤＴＤにおける要素名および属性名の各文字列を、１または２バイトの文字列に置換し、その対応関係をタグ辞書（図１４の符号９０参照）に記録する。
通常、タグ内に記述される文字列（要素名や属性名）は、人が読んで意味が分かるように数バイト以上の長さの文字列を用いて、ＤＴＤで定義されている。 In the second embodiment, each character string of the element name and attribute name in the tag of the document realization value or in the DTD is replaced with a 1- or 2-byte character string, and the correspondence relationship is replaced with a tag dictionary (see reference numeral 90 in FIG. ).
Usually, a character string (element name or attribute name) described in a tag is defined by DTD using a character string having a length of several bytes or more so that a person can read and understand the meaning.

ただし、要素名および属性名の先頭文字は、ＳＧＭＬでは１バイトの英字（Ａ〜Ｚ，ａ〜ｚ）に限られる。一方、ＸＭＬでは、先頭文字は、１バイトの英字，２バイトの平仮名またはカタカナ，１バイトの“#”または“：”のいずれかに限られる。
一般に、文書実現値のタグ部分だけで、すべての文書量の６割から８割が占められる。このため、タグ内における文字列の可読性を犠牲にして、その文字列を１または２バイトの文字列に変換するだけで、ＸＭＬ文書の圧縮率を大幅に高めることが可能である。 However, the first character of element names and attribute names is limited to 1-byte alphabetic characters (AZ, az) in SGML. On the other hand, in XML, the first character is limited to 1-byte alphabet, 2-byte hiragana or katakana, and 1-byte “#” or “:”.
In general, only the tag portion of the document realization value occupies 60% to 80% of the total document amount. For this reason, it is possible to greatly increase the compression rate of the XML document only by converting the character string into a 1- or 2-byte character string at the expense of the readability of the character string in the tag.

そこで、本発明の第２実施形態では、図１５（Ａ）〜図１５（Ｄ）に示すように、要素名および属性名の既存の名前と、新たに定義した１または２バイトの短縮文字列との間の対応関係をタグ辞書に記録し、そのタグ辞書に基づいて、文書実現値のタグ内およびＤＴＤにおける該当する文字列を、より短い短縮文字列に置き換える。この短縮文字列は、当然、既存の名前の文字列よりも短く且つその文字列を特定しうるものでなければならない。 Therefore, in the second embodiment of the present invention, as shown in FIGS. 15A to 15D, existing names of element names and attribute names and newly defined 1- or 2-byte shortened character strings are provided. Is recorded in the tag dictionary, and the corresponding character string in the tag of the document realization value and in the DTD is replaced with a shorter shortened character string based on the tag dictionary. This shortened character string is naturally shorter than the character string of the existing name and must be able to specify the character string.

図１５（Ａ）はタグの具体的な記述例を示す図で、この図１５（Ａ）に示すタグでは、要素名「title」の要素に対して、属性名「tsprint」および属性値「スクールＣＡＩシリーズＮＯ.563 3(3)」をもつ属性が付与されている。
このとき、図１５（Ｂ）に示すごとく、要素名「title」の置換文字（短縮文字列）として「ａ」を予め設定してタグ辞書に登録しておくとともに、図１５（Ｃ）に示すごとく、属性名「tsprint」の置換文字（短縮文字列）として「Ａ」を予め設定してタグ辞書に登録しておく。 FIG. 15A is a diagram showing a specific description example of the tag. In the tag shown in FIG. 15A, the attribute name “tsprint” and the attribute value “school” are assigned to the element with the element name “title”. An attribute having “CAI series No. 563 3 (3)” is given.
At this time, as shown in FIG. 15B, “a” is preset as a replacement character (shortened character string) of the element name “title” and registered in the tag dictionary, and also shown in FIG. 15C. As described above, “A” is preset as a replacement character (shortened character string) of the attribute name “tsprint” and registered in the tag dictionary.

そして、図１５（Ｂ）や図１５（Ｃ）に示すタグ辞書を用いることにより、図１５（Ａ）に示すタグにおいて、要素名および属性名の文字列を、図１５（Ｄ）に示すように、１バイトの文字列に置き換える。このとき、ＤＴＤを有する検証済みＸＭＬ文書（パターン(4)，(5)）においては、そのＤＴＤも、上述した文字列置換に対応して変換される。 Then, by using the tag dictionary shown in FIG. 15 (B) or FIG. 15 (C), in the tag shown in FIG. 15 (A), the character strings of element names and attribute names are as shown in FIG. 15 (D). Replace with a 1-byte character string. At this time, in the verified XML document having the DTD (patterns (4) and (5)), the DTD is also converted corresponding to the character string replacement described above.

従って、ＳＧＭＬパーサーやＸＭＬパーサー（プロセッサ）等では、上述のごとく変換されたＤＴＤに基づいて、同じく上述のごとく変換された構造化文書が解析される。ただし、応用ソフトウェア側で要素や属性の探索を行なう際には、変換されたＤＴＤから読み取った１または２バイトの短縮文字列を用いて、要素名および属性名を指定しなければならない。 Therefore, an SGML parser, an XML parser (processor) or the like analyzes the structured document converted as described above based on the DTD converted as described above. However, when searching for elements and attributes on the application software side, the element name and attribute name must be specified using a 1- or 2-byte shortened character string read from the converted DTD.

以下、図１４および図１６〜図２４を参照しながら、本発明の第２実施形態について、より詳細かつ具体的に説明する。
図１４は本発明の第２実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図であり、この図１４に示すように、第２実施形態の圧縮装置は、第１実施形態と同様の文書記憶部１０，文書実現値解析部２０，ＤＴＤ解析部３０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０のほかに、タグ辞書作成部８０，タグ辞書９０，文書実現値文字列置換部４１およびＤＴＤ文字列置換部５１を有して構成されている。 Hereinafter, the second embodiment of the present invention will be described in more detail and specifically with reference to FIGS. 14 and 16 to 24.
FIG. 14 is a block diagram showing the functional configuration of a structured document compression apparatus according to the second embodiment of the present invention. As shown in FIG. 14, the compression apparatus of the second embodiment is the same as that of the first embodiment. In addition to the same document storage unit 10, document realization value analysis unit 20, DTD analysis unit 30, new DTD file creation unit 60, and old and new DTD correspondence table output unit 70, tag dictionary creation unit 80, tag dictionary 90, document realization value A character string replacing unit 41 and a DTD character string replacing unit 51 are included.

ここで、第２実施形態の圧縮装置も、第１実施形態と同様、ＣＰＵ，ＲＡＭ，ＲＯＭなどをバスラインにより接続して構成される、パソコン等のコンピュータシステムにより実現されるものである。
つまり、ＲＡＭやＲＯＭが文書記憶部１０としての機能を果たすほか、ＲＡＭには、文書実現値解析部２０，ＤＴＤ解析部３０，新規ＤＴＤファイル作成部６０，新旧ＤＴＤ対応表出力部７０，タグ辞書作成部８０，文書実現値文字列置換部４１およびＤＴＤ文字列置換部５１を実現するためのアプリケーションプログラムが格納されている。また、タグ辞書９０は、例えばＲＡＭ上に記録・保存される。 Here, similarly to the first embodiment, the compression apparatus of the second embodiment is also realized by a computer system such as a personal computer configured by connecting a CPU, a RAM, a ROM, and the like through a bus line.
In other words, the RAM and ROM function as the document storage unit 10, and the RAM includes the document realization value analysis unit 20, the DTD analysis unit 30, the new DTD file creation unit 60, the old and new DTD correspondence table output unit 70, and the tag dictionary. An application program for realizing the creating unit 80, the document actual value character string replacement unit 41, and the DTD character string replacement unit 51 is stored. The tag dictionary 90 is recorded / stored on, for example, a RAM.

そして、ＣＰＵが、上記アプリケーションプログラムを実行することにより、文書実現値解析部２０，ＤＴＤ解析部３０，新規ＤＴＤファイル作成部６０，新旧ＤＴＤ対応表出力部７０，タグ辞書作成部８０，文書実現値文字列置換部４１およびＤＴＤ文字列置換部５１としての機能（その詳細については後述）が実現され、第２実施形態の構造化文書の圧縮装置が実現されるようになっている。 When the CPU executes the application program, the document realization value analysis unit 20, the DTD analysis unit 30, the new DTD file creation unit 60, the old and new DTD correspondence table output unit 70, the tag dictionary creation unit 80, the document realization value. Functions as the character string replacement unit 41 and the DTD character string replacement unit 51 (details will be described later) are realized, and the structured document compression apparatus according to the second embodiment is realized.

この第２実施形態の圧縮装置を実現するためのプログラムも、第１実施形態と同様、例えばフレキシブルディスク，ＣＤ−ＲＯＭ等の、コンピュータ読取可能な記録媒体に記録された形態で提供される。そして、コンピュータはその記録媒体からプログラムを読み取って内部記憶装置または外部記憶装置に転送し格納して用いる。また、そのプログラムを、例えば磁気ディスク，光ディスク，光磁気ディスク等の記憶装置（記録媒体）に記録しておき、その記憶装置から通信経路を介してコンピュータに提供してもよい。 The program for realizing the compression apparatus of the second embodiment is also provided in a form recorded on a computer-readable recording medium such as a flexible disk or a CD-ROM, as in the first embodiment. Then, the computer reads the program from the recording medium, transfers it to the internal storage device or the external storage device, and uses it. The program may be recorded in a storage device (recording medium) such as a magnetic disk, an optical disk, or a magneto-optical disk, and provided from the storage device to the computer via a communication path.

そして、第２実施形態の圧縮装置としての機能をコンピュータにより実現する際には、内部記憶装置（例えばＲＡＭ）に格納された上記プログラムがコンピュータのマイクロプロセッサ（例えばＣＰＵ）によって実行される。このとき、記録媒体に記録されたプログラムをマイクロプロセッサが直接読み取って実行してもよい。 When the function as the compression device of the second embodiment is realized by a computer, the program stored in the internal storage device (for example, RAM) is executed by the microprocessor (for example, CPU) of the computer. At this time, the program recorded on the recording medium may be directly read and executed by the microprocessor.

さて、図１４に示す第２実施形態の圧縮装置において、文書記憶部１０，文書実現値解析部２０，ＤＴＤ解析部３０，新規ＤＴＤファイル作成部６０および新旧ＤＴＤ対応表出力部７０は、第１実施形態で説明したものとほぼ同様の機能を果たすので、その詳細な説明は省略する。
ただし、第１実施形態の文書実現値解析部２０は、第２実施形態のタグ辞書作成部８０としての機能を有していたが、第２実施形態では、図６のステップＳ４１，Ｓ４２に対応した処理を行なう部分を、タグ辞書作成部８０として、文書実現値解析部２０から機能的に分離して説明する。 In the compression apparatus of the second embodiment shown in FIG. 14, the document storage unit 10, the document realization value analysis unit 20, the DTD analysis unit 30, the new DTD file creation unit 60, and the old and new DTD correspondence table output unit 70 Since it performs substantially the same function as that described in the embodiment, its detailed description is omitted.
However, the document realization value analysis unit 20 of the first embodiment has a function as the tag dictionary creation unit 80 of the second embodiment, but corresponds to steps S41 and S42 of FIG. 6 in the second embodiment. The part that performs the processing will be described as a tag dictionary creation unit 80 functionally separated from the document realization value analysis unit 20.

このタグ辞書作成部８０は、文書実現値解析部２０やＤＴＤ解析部３０による解析結果に従い、文書実現値のタグ内およびＤＴＤに記述された文字列（要素名，属性名）とその文字列よりも短く且つその文字列を特定しうる短縮文字列（前述した１または２バイトの文字列）とを対応させるタグ辞書９０を作成するものである。 This tag dictionary creation unit 80, in accordance with the analysis results by the document realization value analysis unit 20 and the DTD analysis unit 30, from the character string (element name, attribute name) described in the tag of the document realization value and in the DTD and the character string. The tag dictionary 90 is made to correspond to a shortened character string (one or two-byte character string described above) that is shorter and can specify the character string.

なお、本実施形態のタグ辞書作成部８０は、図６を参照しながら第１実施形態で説明したごとく、文書実現値解析部２０による解析結果のみを用いてタグ辞書９０を作成するものとなっているが、文書実現値解析部２０による解析結果に代えてＤＴＤ解析部３０による解析結果を用いてタグ辞書９０を作成してもよいし、文書実現値解析部２０による解析結果とＤＴＤ解析部３０による解析結果との両方を用いてタグ辞書９０を作成してもよい。 As described in the first embodiment with reference to FIG. 6, the tag dictionary creation unit 80 of the present embodiment creates the tag dictionary 90 using only the analysis result by the document realization value analysis unit 20. However, the tag dictionary 90 may be created using the analysis result by the DTD analysis unit 30 instead of the analysis result by the document actual value analysis unit 20, or the analysis result by the document realization value analysis unit 20 and the DTD analysis unit The tag dictionary 90 may be created by using both of the analysis results obtained by 30.

文書実現値文字列置換部４１は、タグ辞書９０を用いて、文書実現値のタグ内に記述された文字列（要素名，属性名）を、その文字列に対応する短縮文字列に置き換えるもので、その置換手順については、図１７に示すフローチャートを参照しながら後述する。 The document actual value character string replacement unit 41 uses the tag dictionary 90 to replace a character string (element name, attribute name) described in the tag of the document actual value with a shortened character string corresponding to the character string. The replacement procedure will be described later with reference to the flowchart shown in FIG.

ＤＴＤ文字列置換部５１は、ＸＭＬ文書が検証済みＸＭＬ文書である場合（つまりパターン(4)または(5)のＸＭＬ文書である場合）に、ＤＴＤの記述を、文書実現値文字列置換部４１によって置換された文書実現値の記述に合わせるべく、タグ辞書９０を用いて、ＤＴＤに記述された文字列（要素名，属性名）を、その文字列に対応する短縮文字列に置き換えるもので、その置換手順については、図１８に示すフローチャートを参照しながら後述する。 When the XML document is a verified XML document (that is, when the XML document is the pattern (4) or (5)), the DTD character string replacement unit 51 converts the DTD description into the document realization value character string replacement unit 41. The character string (element name, attribute name) described in the DTD is replaced with a shortened character string corresponding to the character string using the tag dictionary 90 to match the description of the document realization value replaced by The replacement procedure will be described later with reference to the flowchart shown in FIG.

なお、ＤＴＤ文字列置換５１において、ＸＭＬ文書がパターン(4)である場合、ＤＴＤは文書記憶部から読み込まれるが、ＸＭＬ文書がパターン(5)である場合、ＤＴＤは外部ファイル１００から読み込まれる。
また、第２本実施形態でも、ＸＭＬ文書がパターン(4)である場合、ＤＴＤ文字列置換部５１による置換結果（圧縮後のＤＴＤ）を、文書実現値置換部４１による置換結果（圧縮後の文書実現値）とともに文書記憶部１０に出力・格納しているが、圧縮後の文書実現値とともに他の記録媒体等に出力・格納してもよい。 In the DTD character string replacement 51, when the XML document is the pattern (4), the DTD is read from the document storage unit, but when the XML document is the pattern (5), the DTD is read from the external file 100.
Also in the second embodiment, when the XML document is the pattern (4), the replacement result (compressed DTD) by the DTD character string replacement unit 51 is replaced with the replacement result (compressed by the document actual value replacement unit 41). Output / stored in the document storage unit 10 together with the document actual value), but may be output / stored together with the compressed document actual value in another recording medium or the like.

新規ＤＴＤファイル作成部６０は、ＤＴＤが外部ファイル１００に存在する場合（ＸＭＬ文書がパターン(5)である場合）、ＤＴＤ文字列置換部５１により置換処理されたＤＴＤについてのファイル（新規ＤＴＤファイル）を作成して外部ファイル１００へ出力するものである。
新旧ＤＴＤ対応表出力部７０は、ＤＴＤが外部ファイル１００に存在する場合（ＸＭＬ文書がパターン(5)である場合）、置換処理前のＤＴＤと置換処理後の新規ＤＴＤとの対応関係を明記した新旧ＤＴＤ対応表〔例えば図２４（Ｇ）参照〕を作成して文書記憶部１０へ出力するものである。 When the DTD exists in the external file 100 (when the XML document is the pattern (5)), the new DTD file creation unit 60 creates a file (new DTD file) for the DTD replaced by the DTD character string replacement unit 51 Is generated and output to the external file 100.
When the DTD exists in the external file 100 (when the XML document is the pattern (5)), the old and new DTD correspondence table output unit 70 clarifies the correspondence between the DTD before replacement processing and the new DTD after replacement processing. An old and new DTD correspondence table (see, for example, FIG. 24G) is created and output to the document storage unit 10.

上述のごとく第２実施形態の圧縮手法（文字列の置換）により圧縮されたＸＭＬ文書は、ＸＭＬ文書としての特徴を全く損なっておらず、圧縮状態のままで（伸長することなく）ＸＭＬ文書としての機能を果たすことができる。
このとき、タグ辞書９０を保持しておけば、このタグ辞書９０を参照して置換前の文字列と短縮文字列との対応関係を認識することにより、文書実現値やＤＴＤにおいて置換・圧縮された文字列を伸長することなく、ＸＭＬ文書内のデータを検索することができる。 As described above, the XML document compressed by the compression method (character string replacement) according to the second embodiment does not lose the characteristics of the XML document at all, and remains in the compressed state (without expansion) as an XML document. Can fulfill the functions of
At this time, if the tag dictionary 90 is held, by referring to the tag dictionary 90 and recognizing the correspondence between the character string before replacement and the shortened character string, replacement / compression is performed in the document realization value or DTD. The data in the XML document can be searched without expanding the character string.

なお、上述のごとく短縮文字列に置換されたＸＭＬ文書の記述を元の状態に伸長・復元させるために、文書実現値文字列逆置換手段やＤＴＤ文字列逆置換手段を含んで構成された伸長装置（図示省略）をそなえておく。
ここで、文書実現値文字列逆置換手段は、上述した文書実現値文字列置換部４１とは逆の置換処理を行なうもので、タグ辞書９０を用いて、文書実現値のタグ内に記述された短縮文字列を、元の文字列（要素名，属性名）に置き換えるものであり、ＤＴＤ文字列逆置換手段は、上述したＤＴＤ文字列置換部５１とは逆の置換処理を行なうもので、タグ辞書９０を用いて、ＤＴＤに記述された短縮文字列を、元の文字列（要素名，属性名）に置き換えるものである。 In addition, in order to decompress / restore the description of the XML document replaced with the shortened character string as described above, the decompression configured to include the document realization value character string reverse replacement means and the DTD character string reverse replacement means. A device (not shown) is provided.
Here, the document actual value character string reverse replacement means performs a reverse process opposite to the document actual value character string replacement unit 41 described above, and is described in the tag of the document actual value using the tag dictionary 90. The abbreviated character string is replaced with the original character string (element name, attribute name), and the DTD character string reverse replacement means performs a reverse process opposite to the DTD character string replacement unit 51 described above. Using the tag dictionary 90, the abbreviated character string described in the DTD is replaced with the original character string (element name, attribute name).

次に、図１６〜図２４を参照しながら、第２実施形態について説明する。
まず、図１６に示すフローチャート（ステップＳ１１１〜Ｓ１２９）に従い、第２実施形態における構造化文書（ＸＭＬ文書）の圧縮手順を説明する。なお、図１６中の丸付き数字１〜５は、それぞれ、表２を参照しながら前述したパターン(1)〜(5)に対応している。
なお、図１４では図示省略しているが、第２実施形態の圧縮装置にも、文書記憶部１０に保存されているＸＭＬ文書がパターン(1)〜(5)（表２参照）のいずれのものであるからを認識するためのパターン認識機能がそなえられている。このパターン認識機能による処理は、図１６に示すステップＳ１１２〜Ｓ１１４による処理に対応している。 Next, a second embodiment will be described with reference to FIGS.
First, a procedure for compressing a structured document (XML document) in the second embodiment will be described with reference to the flowchart shown in FIG. 16 (steps S111 to S129). Note that the circled numbers 1 to 5 in FIG. 16 correspond to the patterns (1) to (5) described above with reference to Table 2, respectively.
Although not shown in FIG. 14, the XML document stored in the document storage unit 10 is stored in any of the patterns (1) to (5) (see Table 2) in the compression apparatus of the second embodiment. It has a pattern recognition function for recognizing it. The processing by this pattern recognition function corresponds to the processing by steps S112 to S114 shown in FIG.

圧縮対象のＸＭＬ文書が入力され文書記憶部１０に格納されると（ステップＳ１１１）、そのＸＭＬ文書に“＜!ＤＯＣＴＹＰＥ”が記述されているか否かを判定し（ステップＳ１１２）、記述されていない場合（ステップＳ１１２のＮＯルート）、そのＸＭＬ文書はＤＴＤをもたない整形式ＸＭＬ文書、つまりパターン(1)のＸＭＬ文書であると認識され、後述するごとくステップＳ１１５，Ｓ１１６およびＳ１２９が実行される。 When an XML document to be compressed is input and stored in the document storage unit 10 (step S111), it is determined whether or not “<! DOCTYPE” is described in the XML document (step S112) and is not described. If this is the case (NO route of step S112), the XML document is recognized as a well-formed XML document having no DTD, that is, an XML document of pattern (1), and steps S115, S116, and S129 are executed as described later. .

ＸＭＬ文書に“＜!ＤＯＣＴＹＰＥ”が記述されている場合（ステップＳ１１２のＹＥＳルート）、その後に“［”が記述されているか否かを判定する（ステップＳ１１３）。
“＜!ＤＯＣＴＹＰＥ”は記述されているが“[”が記述されていない場合（ステップＳ１１３のＮＯルート）、そのＸＭＬ文書は、ＤＴＤを外部ファイル１００として有する検証済みＸＭＬ文書、つまりパターン(5)のＸＭＬ文書であると認識され、後述するごとくステップＳ１２１〜Ｓ１２９が実行される。 When “<! DOCTYPE” is described in the XML document (YES route in step S112), it is determined whether or not “[” is described thereafter (step S113).
When “<! DOCTYPE” is described but “[” is not described (NO route of step S113), the XML document is a verified XML document having the DTD as the external file 100, that is, the pattern (5). As described later, steps S121 to S129 are executed.

“[”が記述されている場合（ステップＳ１１３のＹＥＳルート）、“＜!ＥＬＥＭＥＮＴ”（もしくは“＜!ＡＴＴＬＩＳＴ”）が記述されているか否かを判定する（ステップＳ１１４）。
“＜!ＤＯＣＴＹＰＥ”および“[”は記述されているが“＜!ＥＬＥＭＥＮＴ”が記述されていない場合（ステップＳ１１４のＮＯルート）、内部または外部への実体宣言を含むＤＴＤを有する整形式ＸＭＬ文書、つまりパターン(2)または(3)のＸＭＬ文書であると認識され、パターン(1)の場合と同様、ステップＳ１１５，Ｓ１１６およびＳ１２９が実行される。 If “[” is described (YES route in step S113), it is determined whether “<! ELEMENT” (or “<! ATTLIST”) is described (step S114).
If "<! DOCTYPE" and "[" are described but "<! ELEMENT" is not described (NO route of step S114), a well-formed XML document having a DTD containing an internal or external entity declaration That is, it is recognized that the document is an XML document of the pattern (2) or (3), and steps S115, S116, and S129 are executed as in the case of the pattern (1).

“＜!ＤＯＣＴＹＰＥ”，“[”および“＜!ＥＬＥＭＥＮＴ”がいずれも記述されている場合（ステップＳ１１４のＹＥＳルート）、ＸＭＬ文書内にＤＴＤを有する検証済みＸＭＬ文書、つまりパターン(4)のＸＭＬ文書であると認識され、ステップＳ１１７〜Ｓ１２０およびＳ１２９が実行される。
以下、各パターン(1)〜(5)に対する圧縮処理について、図２０〜図２４に示す具体例（第１例〜第５例）を参照しながら説明する。 When “<! DOCTYPE”, “[”, and “<! ELEMENT” are all described (YES route in step S114), a verified XML document having a DTD in the XML document, that is, the XML of the pattern (4) The document is recognized and steps S117 to S120 and S129 are executed.
Hereinafter, the compression process for each pattern (1) to (5) will be described with reference to specific examples (first to fifth examples) shown in FIGS.

図２０（Ａ）〜図２０（Ｃ）はいずれも第２実施形態によるＸＭＬ文書の具体的な圧縮処理（第１例）を説明するための図である。
図２０（Ａ）に示す圧縮前のＸＭＬ文書は、パターン(1)のＸＭＬ文書であり、図１０（Ａ）に示したものと同じである。この図２０（Ａ）に示すＸＭＬ文書（パターン(1)）には“＜!ＤＯＣＴＹＰＥ”が記述されていないので、処理はステップＳ１１２のＮＯルートからステップＳ１１５へ移行し、文書実現値解析部２０によって文書実現値のタグ内の記述が解析され、タグ辞書作成部８０により、図２０（Ｃ）に示すようなタグ辞書９０が登録・作成される。 FIGS. 20A to 20C are diagrams for explaining specific compression processing (first example) of an XML document according to the second embodiment.
The XML document before compression shown in FIG. 20A is an XML document of pattern (1), which is the same as that shown in FIG. Since “<! DOCTYPE” is not described in the XML document (pattern (1)) shown in FIG. 20A, the process proceeds from the NO route in step S112 to step S115, and the document realization value analysis unit 20 Thus, the description in the tag of the document actual value is analyzed, and the tag dictionary creating unit 80 registers and creates a tag dictionary 90 as shown in FIG.

ここで、図２０（Ａ）に示す例では、文書実現値の各タグ内には要素名のみが記述され、どの要素も属性を有していないので、要素名だけが検出され、各要素名に短縮文字列を対応させるタグ辞書９０が登録・作成される。図２０（Ｃ）に示すタグ辞書９０は、文書実現値解析部２０によって検出・認識された要素名「book」,「title」,「author」に、１バイトの短い短縮文字列（以下、置換文字という場合がある）、例えば「ａ」，「ｂ」，「ｃ」をそれぞれ対応させるものとなっている。 Here, in the example shown in FIG. 20A, only the element name is described in each tag of the document realization value, and no element has an attribute. Therefore, only the element name is detected, and each element name A tag dictionary 90 that associates a shortened character string with is registered and created. A tag dictionary 90 shown in FIG. 20C includes a short 1-byte shortened character string (hereinafter referred to as replacement) for the element names “book”, “title”, and “author” detected and recognized by the document realization value analysis unit 20. For example, “a”, “b”, and “c” are associated with each other.

なお、タグ内に属性名も記述されている場合には、図２３（Ｄ）や図２４（Ｅ）を参照しながら後述するごとく、属性名についてのタグ辞書９０も登録・作成される。
そして、ステップＳ１１５で得られたタグ辞書９０を用いて、文書実現値文字列置換部４１により、文書実現値のタグ内に記述された要素名「book」,「title」,「author」が、それぞれ、１バイトの置換文字「ａ」，「ｂ」，「ｃ」に置き換えられる（ステップＳ１１６）。 If an attribute name is also described in the tag, a tag dictionary 90 for the attribute name is also registered / created as described later with reference to FIGS. 23 (D) and 24 (E).
Then, using the tag dictionary 90 obtained in step S115, the document actual value character string replacement unit 41 uses the element names “book”, “title”, and “author” described in the document real value tag to Each is replaced with 1-byte replacement characters “a”, “b”, and “c” (step S116).

ステップＳ１１５およびＳ１１６によって、例えば図２０（Ａ）に示すＸＭＬ文書は、図２０（Ｂ）に示すようなＸＭＬ文書に置換・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ１２９）。
図２０（Ｂ）に示すＸＭＬ文書において、１行目のＸＭＬ宣言の記述は圧縮前と変わらないが、２〜５行目においては、開始タグ内および終了タグ内の要素名「book」,「title」,「author」がそれぞれ１バイトの置換文字「ａ」，「ｂ」，「ｃ」に置換されている。 Through steps S115 and S116, for example, the XML document shown in FIG. 20A is replaced and compressed with an XML document as shown in FIG. 20B, and then output and stored as a compressed document in the document storage unit 10 or the like. (Step S129).
In the XML document shown in FIG. 20B, the description of the XML declaration in the first line is the same as that before compression, but in the second to fifth lines, the element names “book”, “ “title” and “author” are replaced with 1-byte replacement characters “a”, “b”, and “c”, respectively.

図２１（Ａ）〜図２１（Ｃ）はいずれも第２実施形態によるＸＭＬ文書の具体的な圧縮処理（第２例）を説明するための図である。
図２１（Ａ）に示す圧縮前のＸＭＬ文書は、パターン(2)のＸＭＬ文書であり、図１１（Ａ）に示したものと同じである。この図２１（Ａ）に示すＸＭＬ文書（パターン(2)）には、“＜!ＤＯＣＴＹＰＥ”および“[”がいずれも記述されているが、“＜!ＥＬＥＭＥＮＴ”や“＜!ＡＴＴＬＩＳＴ”が記述されていないので、処理はステップＳ１１４のＮＯルートからステップＳ１１５へ移行し、前述したパターン(1)のＸＭＬ文書と同様の処理が実行される。このとき、タグ辞書作成部８０により、図２１（Ｃ）に示すごとく、図２０（Ｃ）に示したものと同じタグ辞書９０が登録・作成される。 FIGS. 21A to 21C are diagrams for explaining specific compression processing (second example) of an XML document according to the second embodiment.
The XML document before compression shown in FIG. 21A is an XML document of pattern (2), which is the same as that shown in FIG. In the XML document (pattern (2)) shown in FIG. 21A, both “<! DOCTYPE” and “[” are described, but “<! ELEMENT” and “<! ATTLIST” are described. Accordingly, the process proceeds from the NO route in step S114 to step S115, and the same process as the XML document having the pattern (1) described above is executed. At this time, the tag dictionary creation unit 80 registers and creates the same tag dictionary 90 as shown in FIG. 20C, as shown in FIG.

これにより、例えば図２１（Ａ）に示すＸＭＬ文書は、図２１（Ｂ）に示すようなＸＭＬ文書に置換・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ１２９）。
図２１（Ｂ）に示すＸＭＬ文書において、１，３および４行目の記述は圧縮前と変わらないが、２，５〜８行目におけるタグ内の要素名「book」,「title」,「author」がそれぞれ１バイトの置換文字「ａ」，「ｂ」，「ｃ」に置換される。 Thus, for example, the XML document shown in FIG. 21A is replaced and compressed with an XML document as shown in FIG. 21B, and then output and stored as a compressed document in the document storage unit 10 or the like (step) S129).
In the XML document shown in FIG. 21B, the descriptions of the first, third, and fourth lines are the same as before compression, but the element names “book”, “title”, “ “author” is replaced with 1-byte replacement characters “a”, “b”, and “c”, respectively.

図２２（Ａ）〜図２２（Ｃ）はいずれも第２実施形態によるＸＭＬ文書の具体的な圧縮処理（第３例）を説明するための図である。
図２２（Ａ）に示す圧縮前のＸＭＬ文書は、前述したパターン(3)のＸＭＬ文書であり、１行目に、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２〜４行目に、外部への実体宣言を含むＤＴＤが記述され、５〜８行目に文書実現値が記述されている。 22A to 22C are diagrams for explaining specific compression processing (third example) of an XML document according to the second embodiment.
The XML document before compression shown in FIG. 22A is an XML document of the pattern (3) described above, and an XML declaration indicating that this document is a version 1.0 XML document is described in the first line. The DTD including the entity declaration to the outside is described in the 2nd to 4th lines, and the document realization value is described in the 5th to 8th lines.

２〜４行目のＤＴＤでは、文書型宣言に含まれる実体宣言（３行目）におけるシステム識別子“ＳＹＳＴＥＭ”を用いて、文書実現値（ＸＭＬインスタンス）内で用いられる文字列「ｐａｒａ」の実体として、ＵＲＬ“http://www.xml.co.jp”で指定される外部ファイルを用いることが宣言・定義されている。
また、５〜８行目に記述された文書実現値は、図１０（Ａ）に示したＸＭＬ文書の２〜５行目の記述例とほぼ同一であるが、図２２（Ａ）に示す例では、７行目の要素「author」の内容として、「佐藤元＆ｐａｒａ；」が記述されている。 In the DTD on the second to fourth lines, the entity of the character string “para” used in the document realization value (XML instance) using the system identifier “SYSTEM” in the entity declaration (line 3) included in the document type declaration As described above, the use of an external file specified by the URL “http://www.xml.co.jp” is declared and defined.
The document realization value described in the 5th to 8th lines is almost the same as the description example in the 2nd to 5th lines of the XML document shown in FIG. 10 (A), but the example shown in FIG. 22 (A). Then, “Moto Sato ¶” is described as the content of the element “author” on the seventh line.

ここで、「＆ｐａｒａ；」は、文字列「ｐａｒａ」の実体を参照することを指示する記述であり、実際に表示・印刷等によって出力される文書中では、ＵＲＬ“http://www.xml.co.jp”で指定される外部ファイルが読み出されて表記されることになる。 Here, “¶” is a description instructing to refer to the substance of the character string “para”. In a document that is actually output by display / printing, the URL “http: //www.xml The external file specified by “.co.jp” will be read and displayed.

そして、図２２（Ａ）に示すＸＭＬ文書（パターン(3)）には、パターン(2)のＸＭＬ文書と同様、“＜!ＤＯＣＴＹＰＥ”および“[”がいずれも記述されているが、“＜!ＥＬＥＭＥＮＴ”や“＜!ＡＴＴＬＩＳＴ”が記述されていないので、処理はステップＳ１１４のＮＯルートからステップＳ１１５へ移行し、前述したパターン(1)や(2)のＸＭＬ文書と同様の処理が実行される。このとき、タグ辞書作成部８０により、図２２（Ｃ）に示すごとく、図２０（Ｃ）に示したものと同じタグ辞書９０が登録・作成される。 In the XML document (pattern (3)) shown in FIG. 22A, both “<! DOCTYPE” and “[” are described as in the XML document of pattern (2), but “< Since “ELEMENT” and “<! ATTLIST” are not described, the process proceeds from the NO route of step S114 to step S115, and the same process as the XML document of the pattern (1) or (2) described above is executed. The At this time, the tag dictionary creating unit 80 registers and creates the same tag dictionary 90 as shown in FIG. 20C, as shown in FIG.

これにより、例えば図２２（Ａ）に示すＸＭＬ文書は、図２２（Ｂ）に示すようなＸＭＬ文書に置換・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ１２９）。
図２２（Ｂ）に示すＸＭＬ文書において、１，３および４行目の記述は圧縮前と変わらないが、２，５〜８行目におけるタグ内の要素名「book」,「title」,「author」がそれぞれ１バイトの置換文字「ａ」，「ｂ」，「ｃ」に置換される。 Thus, for example, the XML document shown in FIG. 22A is replaced and compressed with the XML document as shown in FIG. 22B, and then output and stored as a compressed document in the document storage unit 10 or the like (step). S129).
In the XML document shown in FIG. 22B, the descriptions of the first, third, and fourth lines are the same as before compression, but the element names “book”, “title”, “ “author” is replaced with 1-byte replacement characters “a”, “b”, and “c”, respectively.

図２３（Ａ）〜図２３（Ｄ）はいずれも第２実施形態によるＸＭＬ文書の具体的な圧縮処理（第４例）を説明するための図である。
図２３（Ａ）に示す圧縮前のＸＭＬ文書は、パターン(4)のＸＭＬ文書であり、図１２（Ａ）に示したものとほぼ同じである。ただし、図２３（Ａ）に示すＸＭＬ文書では、要素「author」の属性「year」についての記述が省略されている。つまり、ＤＴＤにおいて、属性「year」についての属性リスト宣言が省略されるとともに、要素「author」の開始タグ内における属性記述が省略されている。 FIGS. 23A to 23D are diagrams for explaining specific compression processing (fourth example) of an XML document according to the second embodiment.
The XML document before compression shown in FIG. 23A is an XML document of pattern (4), which is almost the same as that shown in FIG. However, in the XML document shown in FIG. 23A, the description about the attribute “year” of the element “author” is omitted. That is, in the DTD, the attribute list declaration for the attribute “year” is omitted, and the attribute description in the start tag of the element “author” is omitted.

図２３（Ａ）に示すＸＭＬ文書（パターン(4)）には、“＜!ＤＯＣＴＹＰＥ”および“[”が記述されるとともに“＜!ＥＬＥＭＥＮＴ”または“＜!ＡＴＴＬＩＳＴ”も記述されているので、処理はステップＳ１１４のＹＥＳルートからステップＳ１１７へ移行し、文書実現値解析部２０によって文書実現値のタグ内の記述が解析されるとともに、ＤＴＤ解析部３０によってＤＴＤの記述が解析される（ステップＳ１１８）。 In the XML document (pattern (4)) shown in FIG. 23A, “<! DOCTYPE” and “[” are described and “<! ELEMENT” or “<! ATTLIST” is also described. The process proceeds from the YES route in step S114 to step S117, the description in the tag of the document actual value is analyzed by the document actual value analysis unit 20, and the description of the DTD is analyzed by the DTD analysis unit 30 (step S118). ).

このとき、タグ辞書作成部８０により、図２３（Ｃ）に示すような、要素名のためのタグ辞書９０と、図２３（Ｄ）に示すような、属性名のためのタグ辞書９０とが登録・作成される。
ここで、図２３（Ｃ）に示すタグ辞書９０は、図２０（Ｃ）に示すものと同じで、文書実現値解析部２０やＤＴＤ解析部３０によって検出・認識された要素名「book」,「title」,「author」に、１バイトの短縮文字列、例えば「ａ」，「ｂ」，「ｃ」をそれぞれ対応させるものとなっている。また、図２３（Ｄ）に示すタグ辞書９０は、文書実現値解析部２０やＤＴＤ解析部３０によって検出・認識された属性名「field」に、１バイトの短縮文字列、例えば「Ａ」を対応させるものとなっている。 At this time, the tag dictionary creation unit 80 generates a tag dictionary 90 for element names as shown in FIG. 23C and a tag dictionary 90 for attribute names as shown in FIG. Registered / created.
Here, the tag dictionary 90 shown in FIG. 23C is the same as that shown in FIG. 20C, and the element names “book”, detected and recognized by the document realization value analysis unit 20 and the DTD analysis unit 30. “Title” and “author” correspond to 1-byte shortened character strings, for example, “a”, “b”, and “c”, respectively. In addition, the tag dictionary 90 shown in FIG. 23D adds a 1-byte shortened character string such as “A” to the attribute name “field” detected and recognized by the document realization value analysis unit 20 and the DTD analysis unit 30. It is something to be supported.

そして、図２３（Ｃ）および図２３（Ｄ）に示すタグ辞書９０を用い、文書実現値文字列置換部４１により、文書実現値のタグ内に記述された要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」，「Ａ」に置き換えられるとともに（ステップＳ１１９）、文書実現値文字列変換部４１による文書実現値の文字列置換に合わせ、ＤＴＤ文字列置換部５１により、ＤＴＤに記述された要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」，「Ａ」に置き換えられる（ステップＳ１２０）。 Then, using the tag dictionary 90 shown in FIG. 23C and FIG. 23D, the document realization value character string replacement unit 41 uses the element names “book” and “title” described in the document realization value tag. , “Author” and attribute name “field” are replaced with 1-byte shortened character strings “a”, “b”, “c”, and “A”, respectively (step S119), and the document actual value character string conversion unit 41 The element name “book”, “title”, “author” and attribute name “field” described in the DTD are each shortened by 1 byte by the DTD character string replacement unit 51 in accordance with the character string replacement of the document realization value by The columns “a”, “b”, “c”, and “A” are replaced (step S120).

これにより、例えば図２３（Ａ）に示すＸＭＬ文書は、図２３（Ｂ）に示すようなＸＭＬ文書に置換・圧縮されてから、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ１２９）。
図２３（Ｂ）に示すＸＭＬ文書において、１および７行目の記述は圧縮前と変わらないが、２〜６および８〜１１行目における要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの置換文字「ａ」，「ｂ」，「ｃ」，「Ａ」に置換される。 Thus, for example, the XML document shown in FIG. 23A is replaced and compressed with an XML document as shown in FIG. 23B, and then output and stored as a compressed document in the document storage unit 10 or the like (step) S129).
In the XML document shown in FIG. 23B, the descriptions in the first and seventh lines are the same as before compression, but the element names “book”, “title”, “author” and the like in the second to sixth and eighth to eleventh lines The attribute name “field” is replaced with 1-byte replacement characters “a”, “b”, “c”, and “A”, respectively.

図２４（Ａ）〜図２４（Ｇ）はいずれも第２実施形態によるＸＭＬ文書の具体的な圧縮処理（第５例）を説明するための図である。
図２４（Ａ）に示す圧縮前のＸＭＬ文書は、パターン(5)のＸＭＬ文書であり、図１３（Ａ）に示したものとほぼ同じである。ただし、図２４（Ａ）に示すＸＭＬ文書では、要素「author」の属性「year」についての記述が省略されている。つまり、ＤＴＤにおいて、属性「year」についての属性リスト宣言が省略されるとともに、要素「author」の開始タグ内における属性記述が省略されている。 FIGS. 24A to 24G are diagrams for explaining specific compression processing (fifth example) of an XML document according to the second embodiment.
The XML document before compression shown in FIG. 24A is an XML document of pattern (5), which is almost the same as that shown in FIG. However, in the XML document shown in FIG. 24A, the description about the attribute “year” of the element “author” is omitted. That is, in the DTD, the attribute list declaration for the attribute “year” is omitted, and the attribute description in the start tag of the element “author” is omitted.

２行目のＤＴＤの文書型宣言では、システム識別子“ＳＹＳＴＥＭ”により、外部ファイル１００に保持されたＤＴＤ（ファイル名「..\book.dtd」）を用いることが宣言・定義されている。
そして、ファイル名「..\book.dtd」のＤＴＤは、図２４（Ａ）における文書実現値の構成に対応して、図２４（Ｂ）に示すように記述されている。この図２４（Ｂ）に示すＤＴＤ（１〜４行目）は、図４（Ａ）に示したＤＴＤにおける２〜５行目の記述例と同一であるので、その説明は省略する。 In the DTD document type declaration on the second line, it is declared and defined that the DTD (file name “.. \ book.dtd”) held in the external file 100 is used by the system identifier “SYSTEM”.
The DTD of the file name “.. \ book.dtd” is described as shown in FIG. 24B corresponding to the configuration of the document realization value in FIG. Since the DTD (1st to 4th lines) shown in FIG. 24B is the same as the description example of the 2nd to 5th lines in the DTD shown in FIG.

図２４（Ａ）に示すＸＭＬ文書（パターン(5)）には、“＜!ＤＯＣＴＹＰＥ”は記述されているが、その後には“[”が記述されることなく、外部ファイル１００におけるＤＴＤを指定するシステム識別子が記述されているので、処理はステップＳ１１３のＮＯルートからステップＳ１２１へ移行し、文書実現値解析部２０によって文書実現値のタグ内の記述が解析されるとともに、ＤＴＤ解析部３０によって、システム識別子に従って外部ファイル１００から読み込まれたＤＴＤ（ファイル名「..\book.dtd」）の記述が解析される（ステップＳ１２２）。 In the XML document (pattern (5)) shown in FIG. 24A, “<! DOCTYPE” is described, but after that “[” is not described, the DTD in the external file 100 is specified. Since the system identifier to be processed is described, the process proceeds from the NO route of step S113 to step S121, the description in the tag of the document realization value is analyzed by the document realization value analysis unit 20, and the DTD analysis unit 30 performs the analysis. The description of the DTD (file name “.. \ book.dtd”) read from the external file 100 according to the system identifier is analyzed (step S122).

このとき、タグ辞書作成部８０により、図２４（Ｅ）に示すような、要素名のためのタグ辞書９０と、図２４（Ｆ）に示すような、属性名のためのタグ辞書９０とが登録・作成される。ここで、図２４（Ｅ）に示すタグ辞書９０は、図２０（Ｃ）に示すものと同じであり、図２４（Ｆ）に示すタグ辞書９０は、図２３（Ｄ）に示すものと同じである。 At this time, the tag dictionary creation unit 80 generates a tag dictionary 90 for element names as shown in FIG. 24E and a tag dictionary 90 for attribute names as shown in FIG. Registered / created. Here, the tag dictionary 90 shown in FIG. 24 (E) is the same as that shown in FIG. 20 (C), and the tag dictionary 90 shown in FIG. 24 (F) is the same as that shown in FIG. 23 (D). It is.

この後、図２４（Ｂ）に示すＤＴＤを変更・圧縮して得られる新規のＤＴＤのために、元のファイル名とは異なる新規のファイル名（例えば「..\book2.dtd」）を設定して文書実現値に記入することにより（ステップＳ１２３）、文書実現値における文書型宣言のシステム識別子“ＳＹＳＴＥＭ”により指定されるファイル名を、旧ファイル名「..\book.dtd」から、新規ファイル名「..\book2.dtd」に書き換える。 After that, a new file name (for example, “.. \ book2.dtd”) different from the original file name is set for the new DTD obtained by changing / compressing the DTD shown in FIG. Then, by filling in the document realization value (step S123), the file name specified by the system identifier “SYSTEM” of the document type declaration in the document realization value is updated from the old file name “.. \ book.dtd”. Rewrite the file name “.. \ book2.dtd”.

そして、図２４（Ｅ）および図２４（Ｆ）に示すタグ辞書９０を用い、文書実現値文字列置換部４１により、文書実現値のタグ内に記述された要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」，「Ａ」に置き換えられる（ステップＳ１２４）。 Then, using the tag dictionary 90 shown in FIGS. 24 (E) and 24 (F), the element names “book” and “title” described in the document actual value tag by the document actual value character string replacement unit 41 are used. , “Author” and attribute name “field” are replaced with 1-byte shortened character strings “a”, “b”, “c”, and “A”, respectively (step S124).

これにより、図２４（Ａ）に示すＸＭＬ文書は、図２４（Ｃ）に示すようなＸＭＬ文書に置換・圧縮される。図１３（Ｃ）に示すＸＭＬ文書において、１行目の記述は圧縮前と変わらないが、２行目のシステム識別子“ＳＹＳＴＥＭ”により指定されるファイル名が新規ファイル名「..\book2.dtd」となるとともに、２〜６行目における要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの置換文字「ａ」，「ｂ」，「ｃ」，「Ａ」に置換される。 As a result, the XML document shown in FIG. 24A is replaced and compressed with an XML document as shown in FIG. In the XML document shown in FIG. 13C, the description on the first line is the same as before compression, but the file name specified by the system identifier “SYSTEM” on the second line is the new file name “.. \ book2.dtd”. And the element names “book”, “title”, “author” and attribute name “field” in the 2nd to 6th lines are replaced by 1-byte replacement characters “a”, “b”, “c”, “ Is replaced by “A”.

ついで、新規ＤＴＤファイル作成部６０により、新規のＤＴＤファイルを作成し、そのＤＴＤファイルに、外部ファイル１００から読み込んだ圧縮前のＤＴＤファイルの内容を複写してから（ステップＳ１２５）、文書実現値文字列変換部４１による文書実現値の文字列置換に合わせ、ＤＴＤ文字列置換部５１により、ＤＴＤに記述された要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」，「Ａ」に置き換えられる（ステップＳ１２６）。 Next, the new DTD file creation unit 60 creates a new DTD file, copies the contents of the uncompressed DTD file read from the external file 100 to the DTD file (step S125), and then realizes the document actual value character. The element names “book”, “title”, “author”, and attribute name “field” described in the DTD are each set to 1 by the DTD character string replacement unit 51 in accordance with the character string replacement of the document realization value by the column conversion unit 41. The byte shortened character strings “a”, “b”, “c”, and “A” are replaced (step S126).

これにより、図２４（Ｂ）に示すＤＴＤは、図２４（Ｄ）に示すようなＤＴＤに変更・圧縮される。図２４（Ｄ）に示すＤＴＤでは、１〜４行目における要素名「book」,「title」,「author」や属性名「field」がそれぞれ１バイトの文字列「ａ」，「ｂ」，「ｃ」，「Ａ」に置換される。
そして、ＤＴＤ文字列置換部５１で置換・圧縮されたＤＴＤのファイル（新規ＤＴＤファイル）は、新規のファイル名「..\book2.dtd」を付与されて、新規ＤＴＤファイル作成部６０から外部ファイル１００へ出力・格納される（ステップＳ１２７）。 As a result, the DTD shown in FIG. 24B is changed and compressed into a DTD as shown in FIG. In the DTD shown in FIG. 24D, the element names “book”, “title”, “author” and the attribute name “field” in the first to fourth lines are 1-byte character strings “a”, “b”, Replaced by “c” and “A”.
The DTD file (new DTD file) replaced / compressed by the DTD character string replacement unit 51 is given a new file name “.. \ book2.dtd”, and the new DTD file creation unit 60 sends an external file. The data is output to and stored in 100 (step S127).

また、新旧ＤＴＤ対応表出力部７０によって、旧ＤＴＤと新規ＤＴＤとの対応関係（具体的には旧ファイル名と新規ファイル名との対応関係）を明記した新旧ＤＴＤ対応表が、図２４（Ｇ）に示すように作成されて、文書記憶部１０等へ出力・格納されるとともに（ステップＳ１２８）、ステップＳ１２４において置換・圧縮されたＸＭＬ文書は、圧縮文書として文書記憶部１０等へ出力・格納される（ステップＳ１２９）。その際、タグ辞書９０や新旧ＤＴＤ対応表は、独立したファイルではなく、圧縮文書に注釈の形で付加してもよい。 Also, the old and new DTD correspondence table output unit 70 generates an old and new DTD correspondence table in which the correspondence between the old DTD and the new DTD (specifically, the correspondence between the old file name and the new file name) is clearly shown in FIG. ) And output / stored in the document storage unit 10 or the like (step S128), and the XML document replaced / compressed in step S124 is output / stored as a compressed document in the document storage unit 10 or the like. (Step S129). At that time, the tag dictionary 90 and the old and new DTD correspondence table may be added to the compressed document in the form of annotations instead of independent files.

さて、次に、図１７および図１８を参照しながら、第２実施形態の圧縮装置を構成する文書実現値文字列置換部４１およびＤＴＤ文字列置換部５１の動作について説明する。
まず、図１７に示すフローチャート（ステップＳ１５１〜Ｓ１５８）に従って、第２実施形態の文書実現値文字列置換部４１による置換手順について説明すると、文書実現値文字列置換部４１は、まず、タグ辞書作成部８０で得られたタグ辞書９０を入力してから（ステップＳ１５１）、第１実施形態の文書実現値解析部２０と同様にして（図６のステップＳ３１〜Ｓ３４参照）、文書実現値における開始タグおよび終了タグを判別し（ステップＳ１５２〜Ｓ１５５）、それらのタグ中の要素名，属性名を、タグ辞書９０を用いて置換する（ステップＳ１５６〜Ｓ１５８）。 Next, operations of the document actual value character string replacement unit 41 and the DTD character string replacement unit 51 that constitute the compression apparatus according to the second embodiment will be described with reference to FIGS. 17 and 18.
First, according to the flowchart shown in FIG. 17 (steps S151 to S158), the replacement procedure by the document actual value character string replacement unit 41 according to the second embodiment will be described. First, the document real value string replacement unit 41 creates a tag dictionary. After the tag dictionary 90 obtained by the unit 80 is input (step S151), the document realization value start is performed in the same manner as the document realization value analysis unit 20 of the first embodiment (see steps S31 to S34 in FIG. 6). Tags and end tags are discriminated (steps S152 to S155), and element names and attribute names in those tags are replaced using the tag dictionary 90 (steps S156 to S158).

つまり、圧縮対象の文書実現値を最後まで走査したか否かを判断しながら（ステップＳ１５２）、文書実現値を走査し（ステップＳ１５３）、文書実現値の記述を先頭から順次認識し、“＜”が記述されているか否かを調べていく（ステップＳ１５４）。なお、“＜”は、ＸＭＬの仕様上、文書実現値の内容には記述されない。 That is, while determining whether or not the document realization value to be compressed has been scanned to the end (step S152), the document realization value is scanned (step S153), and the description of the document realization value is sequentially recognized from the head, and “< "Is described or not (step S154). Note that “<” is not described in the content of the document realization value in the XML specification.

文書実現値の記述として“＜”が検出された場合（ステップＳ１５４のＹＥＳルート）、“＜”に続く１バイトの記述に基づいて、この“＜”で始まるタグが開始タグか終了タグかを判定する（ステップＳ１５５）。その判定は、“＜”に続く記述が“/”であるか否かによって行なわれる。即ち、“＜”に続く記述が“/”である場合、そのタグは終了タグであると判定され、“＜”に続く記述が“/”ではない場合、そのタグは開始タグであると判定される。 When “<” is detected as the description of the document realization value (YES route of step S154), whether the tag starting with “<” is a start tag or an end tag is determined based on the description of 1 byte following “<”. Determination is made (step S155). The determination is made based on whether or not the description following “<” is “/”. That is, if the description following “<” is “/”, the tag is determined to be an end tag, and if the description following “<” is not “/”, the tag is determined to be a start tag. Is done.

開始タグの場合（ステップＳ１５５のＹＥＳルート）、タグ辞書９０を参照して、その開始タグ内に記述されている要素名を、対応する短縮文字列に置き換える（ステップＳ１５６）。また、その開始タグ内に属性名が記述されている場合には、その属性名についても、タグ辞書９０を参照して、対応する短縮文字列に置き換える（ステップＳ１５７）。なお、開始タグ内には、属性名が記述されていない場合には、ステップＳ１５７の処理は省略される。このような置換処理をを終了した後は、ステップＳ１５２へ戻る。 In the case of a start tag (YES route of step S155), the tag dictionary 90 is referenced to replace the element name described in the start tag with the corresponding shortened character string (step S156). If an attribute name is described in the start tag, the attribute name is also replaced with a corresponding shortened character string with reference to the tag dictionary 90 (step S157). If no attribute name is described in the start tag, the process of step S157 is omitted. After completing such replacement processing, the process returns to step S152.

一方、終了タグの場合（ステップＳ１５５のＮＯルート）、タグ辞書９０を参照して、その終了タグ内に記述されている要素名を、対応する短縮文字列に置き換えてから（ステップＳ１５８）、ステップＳ１５２へ戻る。
そして、ステップＳ１５２において、圧縮対象の文書実現値を最後まで走査したと判定された場合（ＹＥＳルート）、処理を終了する。 On the other hand, in the case of an end tag (NO route of step S155), the tag dictionary 90 is referred to, and the element name described in the end tag is replaced with the corresponding shortened character string (step S158), and then step Return to S152.
In step S152, if it is determined that the document realization value to be compressed has been scanned to the end (YES route), the process ends.

図１８に示すフローチャート（ステップＳ１６１〜Ｓ１７０）に従って、第２実施形態のＤＴＤ文字列置換部５１による置換手順について説明すると、ＤＴＤ文字列置換部５１は、まず、タグ辞書作成部８０で得られたタグ辞書９０を入力してから（ステップＳ１６１）、圧縮対象のＤＴＤを最後まで走査したか否かを判断しながら（ステップＳ１６２）、ＤＴＤを走査し（ステップＳ１６３）、要素型宣言、即ち“＜!ＥＬＥＭＥＮＴ”が記述されているか否かを調べていく（ステップＳ１６４）。 The replacement procedure performed by the DTD character string replacement unit 51 according to the second embodiment will be described with reference to the flowchart shown in FIG. 18 (steps S161 to S170). The DTD character string replacement unit 51 is first obtained by the tag dictionary creation unit 80. After inputting the tag dictionary 90 (step S161), while determining whether or not the DTD to be compressed has been scanned to the end (step S162), the DTD is scanned (step S163), and an element type declaration, that is, “< It is checked whether or not “ELEMENT” is described (step S164).

ステップＳ１６４で“＜!ＥＬＥＭＥＮＴ”が検出された場合（ＹＥＳルート）、タグ辞書９０を参照して、その要素型宣言内の要素名を、対応する短縮文字列に置き換えてから（ステップＳ１６５）、その要素型宣言内に記述された内容モデルを検出し（ステップＳ１６６）、その内容モデルに子の要素名が記述されている場合には、タグ辞書９０を参照してその要素名についても、対応する短縮文字列に置き換える（ステップＳ１６７）。なお、内容モデルの記述が無い場合や、内容モデルに子の要素名が記述されていない場合には、ステップＳ１６７の処理は省略される。 When “<! ELEMENT” is detected in step S164 (YES route), the element name in the element type declaration is replaced with the corresponding short character string with reference to the tag dictionary 90 (step S165), The content model described in the element type declaration is detected (step S166), and when the child element name is described in the content model, the element name is also dealt with by referring to the tag dictionary 90. To replace the abbreviated character string (step S167). Note that if there is no description of the content model or if no child element name is described in the content model, the process of step S167 is omitted.

この後、属性リスト宣言、つまり“＜!ＡＴＴＬＩＳＴ”が記述されているか否かを調べる（ステップＳ１６８）。ステップＳ１６８で“＜!ＡＴＴＬＩＳＴ”が検出された場合（ＹＥＳルート）、タグ辞書９０を参照して、その属性リスト宣言内の要素名を、対応する短縮文字列に置き換えるとともに（ステップＳ１６９）、その属性リスト宣言内の属性名を、対応する短縮文字列に置き換えてから（ステップＳ１７０）、ステップＳ１６２へ戻る。 Thereafter, it is checked whether or not an attribute list declaration, that is, “<! ATTLIST” is described (step S168). When “<! ATTLIST” is detected in step S168 (YES route), the tag dictionary 90 is referred to, and the element name in the attribute list declaration is replaced with the corresponding short character string (step S169). After replacing the attribute name in the attribute list declaration with the corresponding short character string (step S170), the process returns to step S162.

なお、ステップＳ１６４で要素型宣言が記述されていないと判断された場合（ＮＯルート）や、ステップＳ１６８で属性リスト宣言が記述されていないと判断された場合（ＮＯルート）には、ステップＳ１６２へ戻る。
そして、ステップＳ１５２において、圧縮対象の文書実現値を最後まで走査したと判定された場合（ＹＥＳルート）、処理を終了する。 If it is determined in step S164 that no element type declaration is described (NO route), or if it is determined in step S168 that no attribute list declaration is described (NO route), the process proceeds to step S162. Return.
In step S152, if it is determined that the document realization value to be compressed has been scanned to the end (YES route), the process ends.

ところで、前述したように、第２実施形態では、短縮文字列に置換されたＸＭＬ文書の記述を元の状態に伸長・復元させるための伸長装置（図示省略）がそなえられている。この伸長装置を構成する前記文字実現値文字列逆置換手段および前記ＤＴＤ文字列逆置換手段は、それぞれ、文字実現値文字列置換部４１やＤＴＤ文字列置換部５１と同様、図１７および図１８に示すフローチャートに従って逆置換処理を行なうものである。ただし、その逆置換処理では、図１７のステップＳ１５６〜Ｓ１５８および図１８のステップＳ１６５，Ｓ１６７，Ｓ１６９およびＳ１７０における文字列変換方向が逆方向になる。 By the way, as described above, in the second embodiment, there is provided a decompression device (not shown) for decompressing / restoring the description of the XML document replaced with the shortened character string. The character realization value character string reverse replacement means and the DTD character string reverse replacement means constituting the decompressing device are the same as the character realization value character string replacement unit 41 and the DTD character string replacement unit 51, respectively, as shown in FIGS. The reverse replacement process is performed according to the flowchart shown in FIG. However, in the reverse replacement process, the character string conversion directions in steps S156 to S158 in FIG. 17 and steps S165, S167, S169, and S170 in FIG. 18 are reversed.

次に、図１９に示すフローチャート（ステップＳ１３１〜Ｓ１４２）に従って、第２実施形態における構造化文書の伸長手順、つまり上述した伸長装置による逆置換手順について説明する。
なお、第２実施形態の伸長装置にも、伸長すべき圧縮文書がパターン(1)〜(5)のいずれのものであるからを認識するためのパターン認識機能がそなえられている。このパターン認識機能による処理は、図１９に示すステップＳ１３３〜Ｓ１３５による処理に対応している。 Next, a structured document decompression procedure according to the second embodiment, that is, a reverse replacement procedure by the decompression apparatus described above will be described with reference to the flowchart shown in FIG. 19 (steps S131 to S142).
Note that the decompression apparatus according to the second embodiment is also provided with a pattern recognition function for recognizing that the compressed document to be decompressed is any one of the patterns (1) to (5). The processing by this pattern recognition function corresponds to the processing by steps S133 to S135 shown in FIG.

まず、伸長対象の圧縮文書が入力されるとともに（ステップＳ１３１）、圧縮処理時に作成されたタグ辞書９０が入力されると（ステップＳ１３２）、その圧縮文書に“＜!ＤＯＣＴＹＰＥ”が記述されているか否かを判定し（ステップＳ１３３）、記述されていない場合（ステップＳ１３３のＮＯルート）、その圧縮文書はＤＴＤをもたない整形式ＸＭＬ文書、つまりパターン(1)のＸＭＬ文書であると認識され、後述するごとくステップＳ１３６およびＳ１４２が実行される。 First, when a compressed document to be decompressed is input (step S131) and the tag dictionary 90 created during the compression process is input (step S132), is “<! DOCTYPE” described in the compressed document? If it is not described (NO route of step S133), the compressed document is recognized as a well-formed XML document having no DTD, that is, an XML document of pattern (1). As will be described later, steps S136 and S142 are executed.

圧縮文書に“＜!ＤＯＣＴＹＰＥ”が記述されている場合（ステップＳ１３３のＹＥＳルート）、その後に“［”が記述されているか否かを判定する（ステップＳ１３４）。
“＜!ＤＯＣＴＹＰＥ”は記述されているが“[”が記述されていない場合（ステップＳ１３４のＮＯルート）、その圧縮文書は、ＤＴＤを外部ファイル１００として有する検証済みＸＭＬ文書、つまりパターン(5)のＸＭＬ文書であると認識され、後述するごとくステップＳ１３９〜Ｓ１４２が実行される。 When “<! DOCTYPE” is described in the compressed document (YES route in step S133), it is determined whether or not “[” is described thereafter (step S134).
When “<! DOCTYPE” is described but “[” is not described (NO route of step S134), the compressed document is a verified XML document having the DTD as the external file 100, that is, the pattern (5). As described later, steps S139 to S142 are executed.

“[”が記述されている場合（ステップＳ１３４のＹＥＳルート）、“＜!ＥＬＥＭＥＮＴ”（もしくは“＜!ＡＴＴＬＩＳＴ”）が記述されているか否かを判定する（ステップＳ１３５）。
“＜!ＤＯＣＴＹＰＥ”および“[”は記述されているが“＜!ＥＬＥＭＥＮＴ”が記述されていない場合（ステップＳ１３５のＮＯルート）、その圧縮文書は、内部または外部への実体宣言を含むＤＴＤを有する整形式ＸＭＬ文書、つまりパターン(2)または(3)のＸＭＬ文書であると認識され、パターン(1)の場合と同様、ステップＳ１３６およびＳ１４２が実行される。 If “[” is described (YES route in step S134), it is determined whether “<! ELEMENT” (or “<! ATTLIST”) is described (step S135).
When “<! DOCTYPE” and “[” are described but “<! ELEMENT” is not described (NO route of step S135), the compressed document includes a DTD including an internal or external entity declaration. It is recognized that it is a well-formed XML document, that is, an XML document of pattern (2) or (3), and steps S136 and S142 are executed as in the case of pattern (1).

“＜!ＤＯＣＴＹＰＥ”，“[”および“＜!ＥＬＥＭＥＮＴ”がいずれも記述されている場合（ステップＳ１３５のＹＥＳルート）、その圧縮文書は、ＸＭＬ文書内にＤＴＤを有する検証済みＸＭＬ文書、つまりパターン(4)のＸＭＬ文書であると認識され、ステップＳ１３７，Ｓ１３８およびＳ１４２が実行される。
以下、各パターン(1)〜(5)に対する伸長処理について、図２０〜図２４に示す具体例（第１例〜第５例）を参照しながら説明する。 When “<! DOCTYPE”, “[”, and “<! ELEMENT” are all described (YES route in step S135), the compressed document is a verified XML document having a DTD in the XML document, that is, a pattern. The document is recognized as an XML document (4), and steps S137, S138, and S142 are executed.
Hereinafter, the decompression process for each of the patterns (1) to (5) will be described with reference to specific examples (first to fifth examples) shown in FIGS.

伸長対象の圧縮文書が図２０（Ｂ）に示すようなパターン(1)の圧縮ＸＭＬ文書である場合、その文書には“＜!ＤＯＣＴＹＰＥ”が記述されていないので、処理はステップＳ１３３のＮＯルートからステップＳ１３６へ移行し、図２０（Ｃ）に示すタグ辞書９０を用いて、前記文書実現値文字列逆置換手段により、文書実現値のタグ内に記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」が、それぞれ元の要素名「book」,「title」,「author」に逆置換される。これにより、例えば図２０（Ｂ）に示す圧縮ＸＭＬ文書は、図２０（Ａ）に示すようなＸＭＬ文書に逆置換・伸長され、伸長文書として文書記憶部１０等へ出力・格納される（ステップＳ１４２）。 When the decompressed compressed document is a compressed XML document having a pattern (1) as shown in FIG. 20B, since “<! DOCTYPE” is not described in the document, the process proceeds to the NO route in step S133. From step S136, the 1-byte shortened character string “a” described in the document realization value tag is written by the document realization value character string reverse replacement means using the tag dictionary 90 shown in FIG. ”,“ B ”, and“ c ”are reversely replaced with the original element names“ book ”,“ title ”, and“ author ”, respectively. As a result, for example, the compressed XML document shown in FIG. 20B is reversely replaced / expanded into an XML document as shown in FIG. 20A, and output / stored as an expanded document in the document storage unit 10 or the like (step) S142).

伸長対象の圧縮文書が図２１（Ｂ）に示すようなパターン(2)の圧縮ＸＭＬ文書である場合、その文書には、“＜!ＤＯＣＴＹＰＥ”および“[”がいずれも記述されているが、“＜!ＥＬＥＭＥＮＴ”や“＜!ＡＴＴＬＩＳＴ”が記述されていないので、処理はステップＳ１３５のＮＯルートからステップＳ１３６へ移行し、図２１（Ｃ）に示すタグ辞書９０を用いて、前記文書実現値文字列逆置換手段により、文書実現値のタグ内に記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」が、それぞれ元の要素名「book」,「title」,「author」に逆置換される。これにより、例えば図２１（Ｂ）に示す圧縮ＸＭＬ文書は、図２１（Ａ）に示すようなＸＭＬ文書に逆置換・伸長され、伸長文書として文書記憶部１０等へ出力・格納される（ステップＳ１４２）。 When the decompressed compressed document is a compressed XML document having a pattern (2) as shown in FIG. 21B, both “<! DOCTYPE” and “[” are described in the document. Since “<! ELEMENT” and “<! ATTLIST” are not described, the process proceeds from the NO route in step S135 to step S136, and the document realization value is determined using the tag dictionary 90 shown in FIG. The 1-byte abbreviated character strings “a”, “b”, and “c” described in the document realization value tags are converted into the original element names “book”, “title”, “ Reversely replaced with “author”. Thus, for example, the compressed XML document shown in FIG. 21B is reversely replaced / expanded into an XML document as shown in FIG. 21A, and output / stored as an expanded document in the document storage unit 10 or the like (step) S142).

伸長対象の圧縮文書が図２２（Ｂ）に示すようなパターン(3)の圧縮ＸＭＬ文書である場合、その文書には、パターン(2)と同様、“＜!ＤＯＣＴＹＰＥ”および“[”がいずれも記述されているが、“＜!ＥＬＥＭＥＮＴ”や“＜!ＡＴＴＬＩＳＴ”が記述されていないので、処理はステップＳ１３５のＮＯルートからステップＳ１３６へ移行し、前述したパターン(1)や(2)の圧縮ＸＭＬ文書と同様の処理が実行される。 When the compressed document to be decompressed is a compressed XML document having a pattern (3) as shown in FIG. 22B, any one of “<! DOCTYPE” and “[” is included in the document as in the pattern (2). However, since “<! ELEMENT” and “<! ATTLIST” are not described, the process proceeds from the NO route of step S135 to step S136, and the above-described patterns (1) and (2) Processing similar to that for the compressed XML document is executed.

つまり、図２２（Ｃ）に示すタグ辞書９０を用いて、前記文書実現値文字列逆置換手段により、文書実現値のタグ内に記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」が、それぞれ元の要素名「book」,「title」,「author」に逆置換される。これにより、例えば図２２（Ｂ）に示す圧縮ＸＭＬ文書は、図２２（Ａ）に示すようなＸＭＬ文書に逆置換・伸長され、伸長文書として文書記憶部１０等へ出力・格納される（ステップＳ１４２）。 That is, by using the tag dictionary 90 shown in FIG. 22C, the 1-byte shortened character strings “a” and “b” described in the document actual value tag by the document actual value character string reverse replacement means. , “C” are reversely replaced with the original element names “book”, “title”, and “author”, respectively. Thus, for example, the compressed XML document shown in FIG. 22B is reversely replaced / expanded into an XML document as shown in FIG. 22A, and output / stored as an expanded document in the document storage unit 10 or the like (step) S142).

伸長対象の圧縮文書が図２３（Ｂ）に示すようなパターン(4)の圧縮ＸＭＬ文書である場合、その文書には、 “＜!ＤＯＣＴＹＰＥ”および“[”が記述されるとともに“＜!ＥＬＥＭＥＮＴ”または“＜!ＡＴＴＬＩＳＴ”も記述されているので、処理はステップＳ１３５のＹＥＳルートからステップＳ１３７へ移行し、図２３（Ｃ）および図２３（Ｄ）に示すタグ辞書９０を用いて、前記文書実現値文字列逆置換手段により、文書実現値のタグ内に記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」や「Ａ」が、それぞれ元の要素名「book」,「title」,「author」や属性名「field」に逆置換される。 When the decompressed compressed document is a compressed XML document having a pattern (4) as shown in FIG. 23B, “<! DOCTYPE” and “[” are described in the document and “<! ELEMENT "Or" <! ATTLIST "is also described, the process proceeds from the YES route in step S135 to step S137, and the document is searched using the tag dictionary 90 shown in FIGS. 23 (C) and 23 (D). The 1-byte shortened character strings “a”, “b”, “c”, and “A” described in the tag of the document actual value are converted into the original element name “book” by the actual value character string reverse replacement means. , “Title”, “author” and the attribute name “field” are reversely substituted.

さらに、図２３（Ｃ）および図２３（Ｄ）に示すタグ辞書９０を用いて、前記ＤＴＤ文字列逆置換手段により、ＤＴＤに記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」や「Ａ」が、それぞれ元の要素名「book」,「title」,「author」や属性名「field」に逆置換される（ステップＳ１３８）。
これにより、例えば図２３（Ｂ）に示す圧縮ＸＭＬ文書は、図２３（Ａ）に示すようなＸＭＬ文書に逆置換・伸長され、伸長文書として文書記憶部１０等へ出力・格納される（ステップＳ１４２）。 Further, by using the tag dictionary 90 shown in FIGS. 23C and 23D, the DTD character string reverse replacement means uses the 1-byte shortened character strings “a”, “b”, “C” and “A” are reversely replaced with the original element names “book”, “title”, “author” and attribute name “field”, respectively (step S138).
Thus, for example, the compressed XML document shown in FIG. 23B is reversely replaced / expanded into an XML document as shown in FIG. 23A, and output / stored as an expanded document in the document storage unit 10 or the like (step) S142).

伸長対象の圧縮文書が図２４（Ｃ）に示すようなパターン(5)の圧縮ＸＭＬ文書である場合、その文書には、“＜!ＤＯＣＴＹＰＥ”は記述されているが、その後には“[”が記述されることなく、外部ファイル１００におけるＤＴＤを指定するシステム識別子が記述されているので、処理はステップＳ１３４のＮＯルートからステップＳ１３９へ移行し、図２４（Ｇ）に示す新旧ＤＴＤ表を入力してから、その新旧ＤＴＤ表に従って、文書実現値における文書型宣言のシステム識別子“ＳＹＳＴＥＭ”により指定されるファイル名を、「..\book2.dtd」から元のファイル名「..\book.dtd」に書き換える（ステップＳ１４０）。 When the decompressed compressed document is a compressed XML document having a pattern (5) as shown in FIG. 24C, “<! DOCTYPE” is described in the document, but “[” is included thereafter. Since the system identifier for designating the DTD in the external file 100 is described, the process proceeds from the NO route in step S134 to step S139, and the old and new DTD tables shown in FIG. Then, according to the old and new DTD tables, the file name specified by the system identifier “SYSTEM” of the document type declaration in the document realization value is changed from “.. \ book2.dtd” to the original file name “.. \ book. dtd "(step S140).

そして、図２４（Ｅ）および図２４（Ｆ）に示すタグ辞書９０を用いて、前記文書実現値文字列逆置換手段により、文書実現値のタグ内に記述された１バイトの短縮文字列「ａ」，「ｂ」，「ｃ」や「Ａ」が、それぞれ元の要素名「book」,「title」,「author」や属性名「field」に逆置換される（ステップＳ１４１）。
これにより、例えば図２４（Ｃ）に示す圧縮ＸＭＬ文書は、図２４（Ａ）に示すようなＸＭＬ文書に逆置換・伸長され、伸長文書として文書記憶部１０等へ出力・格納される（ステップＳ１４２）。 Then, using the tag dictionary 90 shown in FIGS. 24E and 24F, the document realization value character string reverse replacement means uses the 1-byte shortened character string “ “a”, “b”, “c”, and “A” are reversely replaced with the original element names “book”, “title”, “author”, and attribute name “field”, respectively (step S141).
Thus, for example, the compressed XML document shown in FIG. 24C is reversely replaced / expanded into an XML document as shown in FIG. 24A, and output / stored as an expanded document in the document storage unit 10 or the like (step) S142).

このとき、ファイル名「..\book.dtd」のＤＴＤ、つまり図２４（Ｂ）に示すＤＴＤは外部ファイル１００に保存されているので、図２４（Ｄ）に示す圧縮ＤＴＤを前記ＤＴＤ文字列逆置換手段により逆置換・伸長して、図２４（Ｂ）に示すＤＴＤを得る必要はない。 At this time, since the DTD with the file name “.. \ book.dtd”, that is, the DTD shown in FIG. 24B is stored in the external file 100, the compressed DTD shown in FIG. It is not necessary to perform reverse replacement / extension by the reverse replacement means to obtain the DTD shown in FIG.

このように、本発明の第２実施形態によれば、文書実現値のタグ内やＤＴＤの記述を解析し、その解析結果に従ってタグ辞書９０を作成し、そのタグ辞書９０を用いて、文書実現値のタグ内やＤＴＤに記述された文字列を短縮文字列（１または２バイトの置換文字）に置き換えることにより、ＸＭＬ文書の特徴や構造を損なうことなくタグ内やＤＴＤの文字列が圧縮されるので、ＸＭＬ文書の圧縮率を大幅に高めることができ、ひいては、大規模なデータベースを取り扱うシステムにおいて文書データの格納効率を大幅に高めることができる。 As described above, according to the second embodiment of the present invention, a tag dictionary 90 is created according to the analysis result in the tag or DTD description of the document realization value, and the document realization is performed using the tag dictionary 90. By replacing a character string described in a value tag or DTD with a shortened character string (1- or 2-byte replacement character), the character string in the tag or the DTD is compressed without impairing the characteristics or structure of the XML document. Therefore, the compression rate of the XML document can be greatly increased, and consequently, the storage efficiency of the document data can be significantly increased in a system that handles a large-scale database.

このとき、タグ内やＤＴＤに記述された要素名および属性名を短縮文字列に置き換えることにより、検索可能な状態に保持したままタグ部分やＤＴＤを圧縮することができる。つまり、要素名および属性名について置換を行なうとともにタグ辞書９０を保持し、属性値は元の形のまま保持することで、文書データを伸長することなく圧縮した状態のままで検索や文書構造の把握を行なえるようになっている。 At this time, by replacing the element name and attribute name described in the tag or in the DTD with a shortened character string, the tag portion or DTD can be compressed while being kept in a searchable state. In other words, the element name and the attribute name are replaced, the tag dictionary 90 is held, and the attribute value is held in its original form, so that the search and document structure can be maintained while the document data is compressed without being decompressed. It is possible to grasp.

従って、圧縮後にＸＭＬ文書の構造を解析して検索を行なう際に、圧縮されたＸＭＬ文書の伸長を行なう必要がなく、大規模なデータベースにおいて文書データを圧縮格納しても、その文書データの検索処理等を短時間で行なうことができる。 Therefore, when performing a search by analyzing the structure of an XML document after compression, there is no need to decompress the compressed XML document, and even if the document data is compressed and stored in a large-scale database, the document data can be searched. Processing and the like can be performed in a short time.

属性名と属性値とはそれぞれ検索の対象となり得るため、属性名と属性値との両方を一体的に圧縮している場合、伸長・復元を行なわない限り検索を行なうことができなくなる。そこで、第２実施形態では、属性名のみを短縮文字列に置換して圧縮を行ない、属性値は元の形のままとしておくことにより、ブラウザなどの応用ソフトウェアにおいては、圧縮文書の伸長処理を行なうことなく、タグ辞書９０を参照しながら圧縮文書内のデータの検索を行なうことができる。 Since the attribute name and the attribute value can each be a search target, when both the attribute name and the attribute value are integrally compressed, the search cannot be performed unless decompression / restoration is performed. Therefore, in the second embodiment, compression is performed by replacing only the attribute name with a shortened character string and the attribute value is left in its original form, so that application software such as a browser performs decompression processing of the compressed document. It is possible to search for data in the compressed document while referring to the tag dictionary 90 without performing it.

〔３〕第３実施形態の説明
次に、本発明の第３実施形態について説明する。
本発明の第３実施形態では、第２実施形態と同様、図３０（Ａ）に示すごとく、圧縮前のＸＭＬ文書において、“＜”と“＞”とで囲まれた領域（ＤＴＤの宣言文や、文書実現値における開始タグおよび終了タグ）に記述された文字列を、図３０（Ｃ）に示すようなタグ辞書を使用して、短縮文字列（置換文字）に変換して圧縮する。このような圧縮を行なった場合、圧縮後のＸＭＬ文書には、“＜”および“＞”を用いた記述形式は保存されたままであり、圧縮後も、ＸＭＬ文書としての特徴や構造は損なわれることはない。 [3] Description of Third Embodiment Next, a third embodiment of the present invention will be described.
In the third embodiment of the present invention, as in the second embodiment, as shown in FIG. 30A, in the XML document before compression, an area enclosed by “<” and “>” (DTD declaration statement) Alternatively, the character string described in the document realization value (start tag and end tag) is converted into a shortened character string (replacement character) and compressed using a tag dictionary as shown in FIG. When such compression is performed, the description format using “<” and “>” remains stored in the compressed XML document, and the characteristics and structure of the XML document are lost even after compression. There is nothing.

また、第３実施形態では、平文の記述に用いられる言語の種類を識別した上で、該当する言語の辞書（単語辞書）、例えば図３０（Ｄ）に示すような日本語辞書を選択し、平文を成す単語を、最長一致法（longest match）で固定バイト長の短縮文字列（単語番号）に変換する。ここで、平文とは、開始タグと終了タグとの間に記述された内容のことをいう。 In the third embodiment, after identifying the type of language used for describing plaintext, a dictionary (word dictionary) of the corresponding language, for example, a Japanese dictionary as shown in FIG. The plain text is converted to a fixed byte length shortened character string (word number) using the longest match method. Here, plaintext refers to the contents described between the start tag and the end tag.

このようにして、例えば図３０（Ａ）に示すような圧縮前のＸＭＬ文書を、図３０（Ｂ）に示すようなＸＭＬ文書に変換・圧縮することができる。
なお、第３，４実施形態では、言語の種別は、例えば日本語，英語，中国語の３種類の中から識別される。 In this way, for example, an uncompressed XML document as shown in FIG. 30A can be converted and compressed into an XML document as shown in FIG. 30B.
In the third and fourth embodiments, the language type is identified from, for example, three types of Japanese, English, and Chinese.

また、ＤＴＤをもたない整形式ＸＭＬ文書（パターン(1)）の場合は、文書実現値におけるタグ内の記述を調べてから、タグ辞書を用いて、タグ内における要素名等の文字列と短縮文字列（置換文字）との対応付けを行なった上で、ＸＭＬ仕様では言語識別用の属性（xml：lang）を見て平文（内容）の言語を識別してその言語に応じた辞書（単語辞書）を選択し、上述と同様の圧縮を行なう。第３実施形態の圧縮手法を、ＤＴＤをもたない整形式ＸＭＬ文書に適用した場合の具体例については、図３２（Ａ）〜図３２（Ｄ）を参照しながら後述する。 In addition, in the case of a well-formed XML document (pattern (1)) having no DTD, the description in the tag in the document realization value is checked, and then a character string such as an element name in the tag is used using the tag dictionary. In association with the abbreviated character string (replacement character), the XML specification identifies the language of plaintext (contents) by looking at the language identification attribute (xml: lang), and a dictionary (corresponding to the language) Word dictionary) is selected, and the same compression as described above is performed. A specific example in which the compression method of the third embodiment is applied to a well-formed XML document having no DTD will be described later with reference to FIGS. 32 (A) to 32 (D).

以下、図２５〜図３３を参照しながら、本発明の第３実施形態について説明する。
まず、図３０〜図３３により圧縮前のＸＭＬ文書が圧縮後にどのようになるのかについて説明する。 Hereinafter, a third embodiment of the present invention will be described with reference to FIGS.
First, how the XML document before compression will be after compression will be described with reference to FIGS.

図３０（Ａ）〜図３０（Ｄ）はいずれも第３実施形態によるＸＭＬ文書（パターン(4)：ＤＴＤを内部に記述した検証済みＸＭＬ文書）の具体的な圧縮処理を説明するための図で、図３０（Ａ）はＤＴＤを内蔵記述したＸＭＬ文書の圧縮前記述例を示し、図３０（Ｂ）はその圧縮後記述例を示し、図３０（Ｃ）はその圧縮処理に使用したタグ辞書の登録内容例を示し、図３０（Ｄ）はその圧縮処理に使用した日本語辞書の登録内容例を示している。 30A to 30D are diagrams for explaining specific compression processing of an XML document (pattern (4): verified XML document in which DTD is described) according to the third embodiment. 30A shows a description example before compression of an XML document in which DTD is described, FIG. 30B shows a description example after compression, and FIG. 30C shows a tag used for the compression processing. An example of registered contents of a dictionary is shown, and FIG. 30D shows an example of registered contents of a Japanese dictionary used for the compression processing.

図３１（Ａ）〜図３１（Ｇ）はいずれも第３実施形態によるＸＭＬ文書（パターン(5)：別ファイルのＤＴＤを指定して利用するＸＭＬ文書）の具体的な圧縮処理を説明するための図で、図３１（Ａ）は別ファイルに格納されたＤＴＤを参照・利用するＸＭＬ文書の圧縮前記述例を示し、図３１（Ｂ）はその別ファイルに格納されたＤＴＤの圧縮前記述例を示し、図３１（Ｃ）はそのＸＭＬ文書の圧縮後記述例を示し、図３１（Ｄ）はそのＤＴＤの圧縮後記述例（新規ファイルのＤＴＤ）を示し、図３１（Ｅ）はその圧縮処理に使用したタグ辞書の登録内容例を示し、図３１（Ｆ）はその圧縮処理に使用した日本語辞書の登録内容例を示し、図３１（Ｇ）は新旧ＤＴＤの対応関係を保持する対応表の登録内容例を示している。 FIGS. 31A to 31G all illustrate a specific compression process of an XML document (pattern (5): an XML document that specifies and uses a DTD of another file) according to the third embodiment. FIG. 31A shows an example of a pre-compression description of an XML document that refers to and uses a DTD stored in another file, and FIG. 31B shows a pre-compression description of the DTD stored in the other file. FIG. 31C shows an example of description after compression of the XML document, FIG. 31D shows an example of description after compression of the DTD (DTD of a new file), and FIG. An example of registered contents of the tag dictionary used for the compression process is shown, FIG. 31F shows an example of registered contents of the Japanese dictionary used for the compression process, and FIG. 31G holds the correspondence between the old and new DTDs. An example of registered contents of the correspondence table is shown.

図３２（Ａ）〜図３２（Ｄ）はいずれも第３実施形態によるＸＭＬ文書（パターン(1)：ＤＴＤをもたない整形式ＸＭＬ文書）の具体的な圧縮処理を説明するための図で、図３２（Ａ）はＤＴＤをもたない整形式ＸＭＬ文書の圧縮前記述例を示し、図３２（Ｂ）はその圧縮後記述例を示し、図３２（Ｃ）はその圧縮処理に使用したタグ辞書の登録内容例を示し、図３２（Ｄ）はその圧縮処理に使用した日本語辞書の登録内容例を示している。 32A to 32D are diagrams for explaining specific compression processing of an XML document (pattern (1): a well-formed XML document having no DTD) according to the third embodiment. 32A shows a description example before compression of a well-formed XML document having no DTD, FIG. 32B shows a description example after compression, and FIG. 32C is used for the compression processing. An example of registered contents of the tag dictionary is shown, and FIG. 32D shows an example of registered contents of the Japanese dictionary used for the compression processing.

図３３は第３実施形態でのＸＭＬ文書の圧縮手法を説明するための図で、この図３３は、ＸＭＬ文書において“＜”と“＞”とで囲まれた領域に記述された文字列中に、空白（スペース）が存在する場合の、置換文字への変換手法を説明するためのものである。 FIG. 33 is a diagram for explaining the compression method of the XML document in the third embodiment. This FIG. 33 shows the character string described in the area surrounded by “<” and “>” in the XML document. This is for explaining a conversion method to a replacement character when there is a blank space.

まず、図３０（Ａ）〜図３０（Ｄ）により、ＤＴＤを内蔵記述するＸＭＬ文書に対する圧縮処理について説明する。
図３０（Ａ）に示すごとく、圧縮前のＸＭＬ文書には、１行目に示すＸＭＬ宣言、２〜７行目に示すＤＴＤ、８〜１２行目に示す平文の内容等が種々の記号とともに記入されている。 First, with reference to FIGS. 30A to 30D, compression processing for an XML document in which DTD is described will be described.
As shown in FIG. 30A, the XML document before compression includes an XML declaration shown on the first line, a DTD shown on the second to seventh lines, a plain text content shown on the eighth to twelfth lines, and various symbols. It is filled in.

図３０（Ａ）において、１行目には、この文書がバージョン１．０のＸＭＬ文書であることを示すＸＭＬ宣言が記述され、２行目には、この文書の文書型名（ＤＯＣＴＹＰＥの名前）が「book」であることが記述され、その直後の“［”と７行目の“］”との間の記述が、この文書の構成を定義するものである。 In FIG. 30A, the first line describes an XML declaration indicating that this document is a version 1.0 XML document, and the second line describes the document type name (DOCTYPE name) of this document. ) Is “book”, and the description between “[” and “]” on the seventh line defines the structure of this document.

また、３行目の最上位要素名の名前「book」と文書型名「book」とは一致することが必要である。そして、３行目には、要素「book」が子要素「chapter」を有して構成されることが内容モデルとして記述され、４〜６行目には、要素「chapter」が、さらに、２つの子要素「title」および「paragraph」で構成されることが記述され、７行目の“］＞”により、２行目から始まった文書型宣言（ＤＯＣＴＹＰＥ宣言，ＤＴＤ記述）の内部サブセット記述が終了することが示されている。 In addition, the name “book” of the top-level element name on the third line needs to match the document type name “book”. In the third line, it is described as a content model that the element “book” has a child element “chapter”, and in the fourth to sixth lines, the element “chapter” It is described that it is composed of two child elements “title” and “paragraph”, and “]>” on the seventh line describes the internal subset description of the document type declaration (DOCTYPE declaration, DTD description) starting from the second line. Shown to finish.

続く８行目の“＜book＞”は要素「book」の開始タグであり、１２行目の“＜/book＞”は要素「book」の終了タグである。そして、これらのタグ間（９〜１２行目）の記述が、要素「book」の内容であり、９行目の記述は、要素「chapter」の子要素「title」の内容が「ＸＭＬの概要」であることを示し、１０行目の記述は、要素「chapter」の子要素「paragraph」の内容が「ＸＭＬとは…」であることを示す。また、９行目において、“＜title＞”は要素「title」の始まりを示す開始タグであり、“＜/title＞”は要素「title」の終了を示す終了タグである。１１行目の“＜/chapter＞”は要素「chapter」の終了を示す終了タグである。 In the subsequent 8th line, “<book>” is a start tag of the element “book”, and “</ book>” of the 12th line is an end tag of the element “book”. The description between these tags (lines 9 to 12) is the content of the element “book”, and the description of the ninth line is that the content of the child element “title” of the element “chapter” is “XML overview” The description on the 10th line indicates that the content of the child element “paragraph” of the element “chapter” is “What is XML?”. In the ninth line, “<title>” is a start tag indicating the start of the element “title”, and “</ title>” is an end tag indicating the end of the element “title”. “</ Chapter>” on the 11th line is an end tag indicating the end of the element “chapter”.

図３０（Ａ）に示す圧縮前のＸＭＬ文書における文字列を、図３０（Ｃ）に示すタグ辞書と、図３０（Ｄ）に示す日本語辞書とを使用して、置換文字や単語番号に変換することにより、そのＸＭＬ文書を、図３０（Ｂ）に示すように圧縮する。
ここで、図３０（Ｃ）に示すタグ辞書は、図２９を参照しながら後述する手法により作成され、また、図３０（Ｄ）に示す日本語辞書は、予め作成された静的な辞書である。 A character string in the XML document before compression shown in FIG. 30A is converted into a replacement character or a word number by using the tag dictionary shown in FIG. 30C and the Japanese dictionary shown in FIG. By converting, the XML document is compressed as shown in FIG.
Here, the tag dictionary shown in FIG. 30C is created by the method described later with reference to FIG. 29, and the Japanese dictionary shown in FIG. 30D is a static dictionary created in advance. is there.

図３０（Ａ）の例では、上述したタグ辞書により、“chapter”，“title”，“paragraph”が、それぞれ固定長の置換文字“ｂ”，“ｃ”，“ｄ”に置換・圧縮され、また、上述した日本語辞書により、平文の部分の「ＸＭＬ」，「の」，「概要」，“とは”が、それぞれ固定長の単語番号“α”，“β”，“γ”，“δ”等に置換・圧縮され、図３０（Ｂ）に示すような圧縮後のＸＭＬ文書が得られる。 In the example of FIG. 30A, “chapter”, “title”, and “paragraph” are replaced and compressed with fixed-length replacement characters “b”, “c”, and “d”, respectively, by the tag dictionary described above. In addition, according to the above Japanese dictionary, “XML”, “no”, “outline”, and “to” of the plain text part are respectively fixed-length word numbers “α”, “β”, “γ”, Replacement / compression with “δ” or the like is performed to obtain a compressed XML document as shown in FIG.

次に、図３１（Ａ）〜図３１（Ｇ）により、ＸＭＬ文書とは別のファイルに格納されたＤＴＤを参照する場合の圧縮手法について説明する。
図３１（Ａ）に示す圧縮前のＸＭＬ文書では、２行目の下線を付した部分、つまりシステム識別子「ＳＹＳＴＥＭ“../book.dtd”」により、ファイル名「book.dtd」をもつ別ファイルのＤＴＤが指定されている。この別ファイルのＤＴＤは、図３１（Ｂ）に示すように記述されている。 Next, a compression method when referring to a DTD stored in a file different from the XML document will be described with reference to FIGS. 31 (A) to 31 (G).
In the XML document before compression shown in FIG. 31 (A), another file having the file name “book.dtd” by the underlined part of the second line, that is, the system identifier “SYSTEM“ ../book.dtd ””. DTD is specified. The DTD of this separate file is described as shown in FIG.

図３１（Ａ）に示す圧縮前のＸＭＬ文書における文字列を、図３１（Ｅ）に示すタグ辞書と、図３１（Ｆ）に示す日本語辞書とを使用して、置換文字や単語番号に変換することにより、そのＸＭＬ文書を、図３１（Ｃ）に示すように圧縮する。 Using the tag dictionary shown in FIG. 31E and the Japanese dictionary shown in FIG. 31F, the character string in the XML document before compression shown in FIG. By converting, the XML document is compressed as shown in FIG.

また、図３１（Ｂ）に示すＤＴＤにおける文字列を、図３１（Ｅ）に示すタグ辞書を使用して置換文字に変換することにより、そのＤＴＤを、図３１（Ｄ）に示すような新規ファイルのＤＴＤとして圧縮・作成する。この新規ファイルのファイル名としては、例えば“book2.dtd”が付与される。 Further, by converting the character string in the DTD shown in FIG. 31 (B) into a replacement character using the tag dictionary shown in FIG. 31 (E), the DTD is converted into a new one as shown in FIG. 31 (D). Compress and create as file DTD. For example, “book2.dtd” is given as the file name of the new file.

なお、図３１（Ｃ）に示すように、圧縮後のＸＭＬ文書の２行目には、新規ファイルのＤＴＤを指定すべく、そのファイル名“book2.dtd”が、旧ファイルのファイル名“book.dtd”に代えて記入される。そして、これら新規のＤＴＤのファイル名と圧縮前の旧ＤＴＤのファイル名とが、図３１（Ｇ）に示すごとく、新旧ＤＴＤ対応表に記入される。
ここで、図３１（Ｅ）に示すタグ辞書も、図３０（Ｃ）に示すタグ辞書と同様
、図２９を参照しながら後述する手法により作成され、また、図３１（Ｆ）に示
す日本語辞書は、予め作成された静的な辞書である。 As shown in FIG. 31C, in the second line of the compressed XML document, the file name “book2.dtd” is changed to the file name “book” of the old file in order to specify the DTD of the new file. instead of “.dtd”. Then, the new DTD file name and the old DTD file name before compression are entered in the new and old DTD correspondence table as shown in FIG.
Here, the tag dictionary shown in FIG. 31 (E) is created by the method described later with reference to FIG. 29 as well as the tag dictionary shown in FIG. 30 (C), and the Japanese dictionary shown in FIG. 31 (F). The dictionary is a static dictionary created in advance.

図３１（Ａ）の例でも、上述したタグ辞書により、“chapter”，“title”，“paragraph”が、それぞれ固定長の置換文字“ｂ”，“ｃ”，“ｄ”に置換・圧縮され、また、上述した日本語辞書により、平文の部分の「ＸＭＬ」，「の」，「概要」，“とは”が、それぞれ固定長の単語番号“α”，“β”，“γ”，“δ”等に置換・圧縮され、図３１（Ｃ）に示すような圧縮後のＸＭＬ文書が得られる。 Also in the example of FIG. 31A, “chapter”, “title”, and “paragraph” are replaced and compressed with fixed-length replacement characters “b”, “c”, and “d” by the tag dictionary described above. In addition, according to the above Japanese dictionary, “XML”, “no”, “outline”, and “to” of the plain text part are respectively fixed-length word numbers “α”, “β”, “γ”, Substitution / compression with “δ” or the like is performed, and a compressed XML document as shown in FIG. 31C is obtained.

また、図３１（Ｃ）に示すＤＴＤは、上述したタグ辞書により、“chapter”，“title”，“paragraph”が、それぞれ固定長の置換文字“ｂ”，“ｃ”，“ｄ”に置換・圧縮され、ファイル名“book2.dtd”の新規ファイルに、新たなＤＴＤとして格納される。 Further, in the DTD shown in FIG. 31C, “chapter”, “title”, and “paragraph” are replaced with fixed-length replacement characters “b”, “c”, and “d”, respectively, by the tag dictionary described above. Compressed and stored as a new DTD in a new file with the file name “book2.dtd”.

次に、図３２（Ａ）〜図３２（Ｄ）により、ＤＴＤをもたない整形式ＸＭＬ文書に対する圧縮処理について説明する。
図３２（Ａ）に示す圧縮前のＸＭＬ文書では、１行目のＸＭＬ宣言に続いてＤＯＣＴＹＰＥ宣言で開始されるＤＴＤ文書型定義が記述されていない。
このような場合も、図３２（Ａ）に示す圧縮前のＸＭＬ文書における文字列は、図３２（Ｃ）に示すタグ辞書と図３２（Ｄ）に示す日本語辞書とを使用して、置換文字や単語番号に変換され、そのＸＭＬ文書は、図３２（Ｂ）に示すように圧縮される。 Next, a compression process for a well-formed XML document having no DTD will be described with reference to FIGS. 32 (A) to 32 (D).
In the XML document before compression shown in FIG. 32A, the DTD document type definition starting with the DOCTYPE declaration following the XML declaration in the first line is not described.
Even in such a case, the character string in the XML document before compression shown in FIG. 32A is replaced by using the tag dictionary shown in FIG. 32C and the Japanese dictionary shown in FIG. It is converted into characters and word numbers, and the XML document is compressed as shown in FIG.

ここで、図３２（Ｃ）に示すタグ辞書も、図３０（Ｃ）や図３１（Ｅ）に示すタグ辞書と同様、図２９を参照しながら後述する手法により作成され、また、図３２（Ｄ）に示す日本語辞書は、予め作成された静的な辞書である。
図３２（Ａ）の例でも、上述したタグ辞書により、“chapter”，“title”，“paragraph”が、それぞれ固定長の置換文字“ｂ”，“ｃ”，“ｄ”に置換・圧縮され、また、上述した日本語辞書により、平文の部分の「ＸＭＬ」，「の」，「概要」，“とは”が、それぞれ固定長の単語番号“α”，“β”，“γ”，“δ”等に置換・圧縮され、図３２（Ｃ）に示すような圧縮後のＸＭＬ文書が得られる。 Here, the tag dictionary shown in FIG. 32C is also created by the method described later with reference to FIG. 29 in the same manner as the tag dictionary shown in FIGS. 30C and 31E, and FIG. The Japanese dictionary shown in D) is a static dictionary created in advance.
Also in the example of FIG. 32A, “chapter”, “title”, and “paragraph” are replaced and compressed with fixed-length replacement characters “b”, “c”, and “d”, respectively, by the tag dictionary described above. In addition, according to the above Japanese dictionary, “XML”, “no”, “outline”, and “to” of the plain text part are respectively fixed-length word numbers “α”, “β”, “γ”, Replacement / compression with “δ” or the like is performed to obtain a compressed XML document as shown in FIG.

また、本実施形態では、図３３に示すように、ＸＭＬ文書において“＜”と“＞”とで囲まれた領域の文字列中に、空白（スペース）が存在する場合、その文字列を空白部分で区切る。このようにして区切って得られた文字列部分のそれぞれに、置換文字（短縮文字列）に対応させる。 In the present embodiment, as shown in FIG. 33, when a space (space) exists in the character string in the region surrounded by “<” and “>” in the XML document, the character string is blanked. Separate by parts. Each character string portion obtained by dividing in this way is associated with a replacement character (shortened character string).

即ち、図３３に示す例では、まず、“＜”と“＞”とで囲まれた「table orient="PORT" tocentry="1"」という開始タグが存在するが、これ全体を１つのタグとして登録せずに、この文字列内に存在する空白部分で区切ることにより、この文字列を、「table」，「orient="PORT"」，「tocentry="1"」という３つの部分に区分けする。 That is, in the example shown in FIG. 33, first, there is a start tag “table orient =” PORT ”tocentry =“ 1 ”” surrounded by “<” and “>”. By dividing the string with the blank part that exists in this string without registering as, the string is divided into three parts: "table", "orient =" PORT "", and "tocentry =" 1 "" To do.

そして、区分けされた各部分に対して、図３３に示すごとく置換文字（短縮文字列）“ｅ”，“ｆ”，“ｇ”を付加する。このようにして登録内容を短くすることにより、タグ辞書のサイズを小型化することができ、検索対象も小さくすることができ、ひいては、検索手段のハード量も小さくすることができる。 Then, as shown in FIG. 33, replacement characters (shortened character strings) “e”, “f”, and “g” are added to the divided portions. By shortening the registration contents in this way, the size of the tag dictionary can be reduced, the search target can be reduced, and the hardware amount of the search means can also be reduced.

なお、図３３に示す例では、置換文字“ｅ”，“ｆ”，“ｇ”に置き換えられた部分と、置換文字“ｈ”および“ｊ”に置き換えられた部分とがタグである。そして、置換文字“ｈ”に置き換えられたタグがタイトルの始まりを示し、置換文字“ｊ”に置き換えられたタグがタイトルの終わりを示している。また、置換文字“ｉ”に置き換えられた部分が、タイトルの内容を示す平文「機能一覧」である。また、置換文字“ｅ”，“ｆ”，“ｇ”に置き換えられた部分は、それぞれ、＜table orient="PORT" tocentry="1"＞というタグの要素名，第１属性（属性名＝属性値）および第２属性（属性名＝属性値）である。 In the example shown in FIG. 33, the parts replaced with the replacement characters “e”, “f”, “g” and the portions replaced with the replacement characters “h” and “j” are tags. The tag replaced with the replacement character “h” indicates the start of the title, and the tag replaced with the replacement character “j” indicates the end of the title. The portion replaced with the replacement character “i” is a plaintext “function list” indicating the contents of the title. The parts replaced by the replacement characters “e”, “f”, and “g” are the element name of the tag <table orient = "PORT" tocentry = "1">, the first attribute (attribute name = Attribute value) and second attribute (attribute name = attribute value).

次に、上述した圧縮処理に関連する第３実施形態の圧縮装置および伸長装置の構成および動作について、図２５〜図２９を参照しながら説明する。
ここで、図２５は本発明の第３実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図、図２６は本発明の第３実施形態としての構造化文書の伸長装置の機能構成を示すブロック図、図２７は図２５に示す圧縮装置でのタグ辞書作成手順（タグ辞書登録手順）を説明するためのフローチャート、図２８は図２５に示す圧縮装置による圧縮手順を説明するためのフローチャート、図２９は図２６に示す伸長装置による伸長手順を説明するためのフローチャートである。 Next, the configuration and operation of the compression apparatus and decompression apparatus according to the third embodiment related to the above-described compression processing will be described with reference to FIGS.
FIG. 25 is a block diagram showing the functional configuration of the structured document compression apparatus according to the third embodiment of the present invention, and FIG. 26 shows the functional configuration of the structured document decompression apparatus as the third embodiment of the present invention. FIG. 27 is a flowchart for explaining a tag dictionary creation procedure (tag dictionary registration procedure) in the compression device shown in FIG. 25. FIG. 28 is a flowchart for explaining the compression procedure by the compression device shown in FIG. FIG. 29 is a flowchart for explaining a decompression procedure by the decompression apparatus shown in FIG.

まず、図２５を参照しながら、第３実施形態の圧縮装置について説明すると、この図２５において、１０１は文書記憶部、１０２はＤＴＤ条件調査部、１０３はタグ平文識別部、１０４はタグ文字列登録部、１０５はタグ辞書、１０６は文字列比較部、１０７は言語識別部、１０８は日本語辞書、１０９は中国語辞書、１１０は英語辞書、１１１はタグ文字列変換部、１１２は単語番号変換部、１１３は単語番号ファイル、１１４はＤＴＤ記入部である。 First, the compression apparatus according to the third embodiment will be described with reference to FIG. 25. In FIG. 25, 101 is a document storage unit, 102 is a DTD condition checking unit, 103 is a tag plaintext identification unit, and 104 is a tag character string. Registration unit, 105 is a tag dictionary, 106 is a character string comparison unit, 107 is a language identification unit, 108 is a Japanese dictionary, 109 is a Chinese dictionary, 110 is an English dictionary, 111 is a tag character string conversion unit, and 112 is a word number The conversion unit, 113 is a word number file, and 114 is a DTD entry unit.

文書記憶部１０１は、圧縮すべきＸＭＬ文書を入力保持するもので、例えば図３０（Ａ）に示すような圧縮前のＸＭＬ文書が保持されるメモリである。
ＤＴＤ条件調査部１０２は、圧縮すべきＸＭＬ文書が、図３０（Ａ）に示すごとくＤＴＤを内蔵しているか、図３１（Ａ）に示すごとく別ファイルのＤＴＤを参照するものか、図３２（Ａ）に示すごとくＤＴＤなしのものかの３通りのいずれであるかを識別するもので、第１および第２実施形態のパターン認識機能を実現するものである。 The document storage unit 101 inputs and holds an XML document to be compressed, and is a memory that holds an XML document before compression as shown in FIG. 30A, for example.
Whether the XML document to be compressed has a built-in DTD as shown in FIG. 30A, or refers to the DTD of another file as shown in FIG. As shown in A), the pattern recognition function of the first and second embodiments is realized by identifying one of the three types without DTD.

より具体的に、ＤＴＤ条件調査部１０２は、圧縮対象のＸＭＬ文書の２行目に“＜！ＤＯＣＴＹＰＥ”が記述されていればＤＴＤを有するものと識別し、文書型名（ここでは「book」）つまり文書型宣言の後に“［”が記述されていればＤＴＤを内蔵したＸＭＬ文書であると識別し、“［”が記述されていなければ外部ファイルのＤＴＤを参照するＸＭＬ文書であると識別する。そして、ＸＭＬ文書の２行目に“＜！ＤＯＣＴＹＰＥ”が記述されていなければ、ＤＴＤなしのＸＭＬ文書であると識別する。 More specifically, if “<! DOCTYPE” is described in the second line of the XML document to be compressed, the DTD condition examining unit 102 identifies that the document has a DTD, and the document type name (here, “book”). ) That is, if “[” is described after the document type declaration, it is identified as an XML document with a built-in DTD, and if “[” is not described, it is identified as an XML document that refers to the DTD of the external file. To do. If “<! DOCTYPE” is not described in the second line of the XML document, it is identified as an XML document without DTD.

タグ平文識別部１０３は、ＸＭＬ文書の注目文字列がタグか平文かを識別するものであり、その注目文字列が“＜”および“＞”により前後から囲まれていれば、その注目文字列をタグ内の記述であると識別する一方、囲まれていなければ、その注目文字列を平文内の記述であると識別する。例えば、図３０（Ａ）の９行目の「ＸＭＬの概要」や１０行目の「ＸＭＬとは…」、あるいは、図３０（Ｂ）の９行目の「αβγ」や１０行目の「αδ…」は、“＜”および“＞”により囲まれていないので、平文であると識別される。 The tag plaintext identifying unit 103 identifies whether the target character string of the XML document is a tag or plaintext. If the target character string is surrounded by “<” and “>” from the front and back, the target character string Is identified as a description in a tag, and if not enclosed, the character string of interest is identified as a description in plain text. For example, “XML overview” on line 9 in FIG. 30A and “What is XML” on line 10 or “αβγ” on line 9 and “αβγ” on line 10 in FIG. Since “αδ...” is not surrounded by “<” and “>”, it is identified as plain text.

タグ文字列登録部１０４は、例えば図３０（Ｃ）に示すようなタグ辞書１０５を登録・作成するものであり、そのタグ辞書１０５は、タグ内において例えば図３３に示すごとく区分された文字列（すなわち「要素名」や「属性名＝“属性値”」等）を、それぞれ短縮文字列（置換文字）に対応させるためのものである。
上述したタグ平文識別部１０３やタグ文字列登録部１０４は、第２実施形態における文書実現値解析部２０，ＤＴＤ解析部３０やタグ辞書作成部８０に対応した機能を果たすものである。 The tag character string registration unit 104 registers / creates a tag dictionary 105 as shown in FIG. 30C, for example. The tag dictionary 105 is a character string divided as shown in FIG. (That is, “element name”, “attribute name =“ attribute value ””, etc.) are associated with abbreviated character strings (replacement characters), respectively.
The tag plaintext identification unit 103 and the tag character string registration unit 104 described above perform functions corresponding to the document realization value analysis unit 20, the DTD analysis unit 30, and the tag dictionary creation unit 80 in the second embodiment.

ここで、タグ文字列登録部１０４は、図２７に示すフローチャート（ステップＳ２４１〜Ｓ２５１）に従って、上述のようなタグ辞書１０５を作成するようになっており、図２７を参照しながら、タグ文字列登録部１０４によるタグ辞書作成手順（タグ辞書登録手順）について説明する。 Here, the tag character string registration unit 104 creates the tag dictionary 105 as described above according to the flowchart (steps S241 to S251) shown in FIG. 27. A tag dictionary creation procedure (tag dictionary registration procedure) by the registration unit 104 will be described.

まず、タグ文字列登録部１０４に文字を順次入力し（ステップＳ２４１）、入力された文字がＥＯＦ（End Of File）か否かを判別し（ステップＳ２４２）、ＥＯＦであれば（ＹＥＳルート）、タグ辞書作成動作を終了する。
ステップＳ２４２において入力文字がＥＯＦでないと判別した場合（ＮＯルート）には、その入力文字が“＜”か否かを判別し（ステップＳ２４３）、“＜”でなければ（ＮＯルート）、ステップＳ２４１に戻って次の文字を入力する。 First, characters are sequentially input to the tag character string registration unit 104 (step S241), it is determined whether or not the input character is EOF (End Of File) (step S242), and if it is EOF (YES route), The tag dictionary creation operation is terminated.
If it is determined in step S242 that the input character is not EOF (NO route), it is determined whether or not the input character is “<” (step S243). If it is not “<” (NO route), step S241 is performed. Return to and enter the next character.

ステップＳ２４３において入力文字が“＜”であると判別した場合（ＹＥＳルート）には、メモリの文字列を空（♯）にしてから（ステップＳ２４４）、“＜”以降の文字（タグ内の文字）を順次入力する（ステップＳ２４５）。なお、ステップＳ２４１によって入力されてからステップＳ２４４でメモリから消された文字列は、“＜”の前に記述された平文である。 If it is determined in step S243 that the input character is “<” (YES route), the character string in the memory is emptied (#) (step S244), and the characters after “<” (characters in the tag) ) Are sequentially input (step S245). The character string input from step S241 and erased from the memory in step S244 is a plain text described before “<”.

ステップＳ２４５で入力された文字が空白か否かを判別し（ステップＳ２４６）、空白である場合（ＹＥＳルート）、その空白が認識されるまでにメモリに蓄積された文字列に対して適当な置換文字（例えば“ｂ”）を決め、その置換文字と文字列とを対応させてタグ辞書１０５に登録する（ステップＳ２４７）。この後、ステップＳ２４４に戻り、同様の処理を繰り返し実行する。 It is determined whether or not the character input in step S245 is a blank (step S246). If the character is blank (YES route), an appropriate replacement is performed on the character string stored in the memory until the blank is recognized. A character (for example, “b”) is determined, and the replacement character is associated with the character string and registered in the tag dictionary 105 (step S247). Thereafter, the process returns to step S244, and the same processing is repeatedly executed.

ステップＳ２４６において空白ではないと判別した場合（ＮＯルート）には、ステップＳ２４５で入力された文字が“！”か否か判別し（ステップＳ２４８）、“！”であれば（ＹＥＳルート）、この“！”に続く文字列はタグではなくコメントなので、ステップＳ２４１に戻る。 If it is determined in step S246 that it is not blank (NO route), it is determined whether or not the character input in step S245 is "!" (Step S248). Since the character string following “!” Is not a tag but a comment, the process returns to step S241.

ステップＳ２４８において“！”ではないと判別した場合（ＮＯルート）には、ステップＳ２４５で入力された文字が“＞”か否かを判別し（ステップＳ２４９）、“＞”であれば（ＹＥＳルート）、その“＞”が認識されるまでにメモリに蓄積された文字列に対して適当な置換文字を決め、その置換文字と文字列とを対応させてタグ辞書１０５に登録する（ステップＳ２５０）。 If it is determined in step S248 that it is not “!” (NO route), it is determined whether or not the character input in step S245 is “>” (step S249), and if “>” (YES route). ), An appropriate replacement character is determined for the character string stored in the memory until the “>” is recognized, and the replacement character and the character string are associated with each other and registered in the tag dictionary 105 (step S250). .

そして、ステップＳ２４９において“＞”を認識したということは、タグの記述が終わったことを意味するので、ステップＳ２４１に戻り、同様の処理を繰り返し実行する。
また、ステップＳ２４９において“＞”ではないと判別した場合（ＮＯルート）には、今回、ステップＳ２４５で入力された文字を、メモリに蓄積されている文字列に加えて新文字列としてから（ステップＳ２５１）、ステップＳ２４５に戻る。 Recognizing “>” in step S249 means that the description of the tag is finished, so the process returns to step S241 and the same processing is repeatedly executed.
If it is determined in step S249 that it is not “>” (NO route), the character input in step S245 this time is added to the character string stored in the memory as a new character string (step S251), the process returns to step S245.

従って、図３３に示すような文字列が入力された場合、空白毎に文字列が区切られ、区切られた文字列「table」，「orient="PORT"」，「tocentry="1"」に対しそれぞれ置換文字ｅ，ｆ，ｇが決められ、これらの文字列と置換文字との対応関係が、順次、タグ辞書１０５に登録される。
また、図３０（Ａ）の９行目に記述された文字列が入力された場合、“＜”および“＞”が認識される都度、その文字列が区切られ、区切られた文字列「chapter」，「title」に対しそれぞれ置換文字ｂ，ｃが決められ、これらの文字列と置換文字との対応関係が、順次、タグ辞書１０５に登録される。 Therefore, when a character string as shown in FIG. 33 is input, the character string is separated for each blank, and the character strings “table”, “orient =“ PORT ””, “tocentry =“ 1 ”” are separated. The replacement characters e, f, and g are respectively determined, and the correspondence between these character strings and the replacement characters is sequentially registered in the tag dictionary 105.
In addition, when the character string described in the ninth line of FIG. 30A is input, each time “<” and “>” are recognized, the character string is delimited, and the delimited character string “chapter ”And“ title ”are determined as replacement characters b and c, respectively, and the correspondence between these character strings and replacement characters is sequentially registered in the tag dictionary 105.

さて、図２５に示す圧縮装置において、タグ辞書１０５は、タグ文字列登録部１０４により前述のごとく登録・作成されたもので、例えば図３０（Ｃ），図３１（Ｅ）や図３２（Ｃ）に示すような、置換文字とタグ内の文字列との対照表であり、メモリに保持される。 In the compression apparatus shown in FIG. 25, the tag dictionary 105 is registered and created as described above by the tag character string registration unit 104. For example, FIG. 30 (C), FIG. 31 (E) and FIG. And a comparison table of the replacement character and the character string in the tag, as shown in FIG.

文字列比較部１０６は、文書記憶部１０１に保持された文字列と、辞書１０５，１０８〜１１０の登録文字列または単語文字列とを比較し、その文字列が登録文字列または単語文字列と一致した場合には、その文字列に対して置換文字を出力するものである。このとき、タグ内の文字列はタグ辞書１０５の登録文字列と比較され、平文部分の文字列は、後述する単語辞書１０８〜１１０のいずれかにおける単語文字列と比較される。文字列を、文字列比較部１０６によって出力された置換文字（短縮文字列）へ変換することにより、ＸＭＬ文書が圧縮されることになる。 The character string comparison unit 106 compares the character string held in the document storage unit 101 with the registered character string or word character string in the dictionaries 105 and 108 to 110, and the character string is compared with the registered character string or word character string. If they match, a replacement character is output for the character string. At this time, the character string in the tag is compared with a registered character string in the tag dictionary 105, and the character string in the plaintext part is compared with a word character string in any of the word dictionaries 108 to 110 described later. The XML document is compressed by converting the character string into a replacement character (shortened character string) output by the character string comparison unit 106.

言語識別部１０７は、圧縮対象のＸＭＬ文書の内容（平文）において記述される言語が何語であるかを、ＸＭＬ宣言におけるエンコーディング宣言（図示省略）に記述された文字コード名、並びに、任意のタグにおける言語識別用の属性（xml：lang）の値を解読することによって識別するものである。そして、本実施例の言語識別部１０７は、その言語が例えば日本語，中国語，英語の３つのうちのいずれであるかを識別し、その識別結果に対応して、日本語辞書１０８，中国語辞書１０９，英語辞書１１０のうちのいずれか一つを平文用の単語辞書として選択するものである。 The language identification unit 107 determines the number of languages described in the content (plain text) of the XML document to be compressed, the character code name described in the encoding declaration (not shown) in the XML declaration, and an arbitrary The tag is identified by deciphering the value of the language identification attribute (xml: lang). The language identifying unit 107 of this embodiment identifies whether the language is, for example, Japanese, Chinese, or English, and in accordance with the identification result, the Japanese dictionary 108, China One of the word dictionary 109 and the English dictionary 110 is selected as a word dictionary for plain text.

日本語辞書１０８は、ＸＭＬ文書の平文において記述される言語が日本語の場合に、例えば図３０（Ｄ）に示すように、平文を構成する日本語の単語文字列とそれに対応する単語番号（短縮文字列，置換文字）との対応をとるための対照表であり、事前に構成された既知のものである。ここで、単語番号は、平文内の実際の単語文字列よりも短く、且つ、その単語文字列を特定しうる、固定バイト長の短縮文字列（置換文字）である。
同様に、中国語辞書１０９や英語辞書１１０のいずれも、平文を構成する各国語の単語文字列と単語番号（短縮文字列，置換文字）との対応をとるための対照表である。 When the language described in the plain text of the XML document is Japanese, the Japanese dictionary 108, for example, as shown in FIG. 30D, the Japanese word character string that forms the plain text and the corresponding word number ( This is a comparison table for correspondence with abbreviated character strings and replacement characters, and is a known table constructed in advance. Here, the word number is a shortened character string (replacement character) having a fixed byte length, which is shorter than the actual word character string in the plain text and can identify the word character string.
Similarly, each of the Chinese dictionary 109 and the English dictionary 110 is a comparison table for associating a word character string of each country constituting a plain text with a word number (abbreviated character string, replacement character).

タグ文字列変換部１１１は、タグの文字列を、文字列比較部１０６からの一致信号に応じて、この文字列比較部１０６で付与された置換文字（短縮文字列）に変換する、変換処理を行なうものである。そして、このタグ文字列変換部１１１と上述した文字列比較部１０６とが、第２実施形態の文書実現値文字列置換部４１やＤＴＤ文字列置換部５１に対応した機能を果たすものである。 The tag character string conversion unit 111 converts the character string of the tag into a replacement character (shortened character string) given by the character string comparison unit 106 in response to a match signal from the character string comparison unit 106. Is to do. The tag character string conversion unit 111 and the character string comparison unit 106 described above perform functions corresponding to the document realization value character string replacement unit 41 and the DTD character string replacement unit 51 of the second embodiment.

同様に、単語番号変換部１１２は、平文の文字列を、文字列比較部１０６からの一致信号に応じて、この文字列比較部１０６で付与された単語番号（置換文字，短縮文字列）に変換するものであり、最長一致法により単語を固定バイトの単語番号に変換するものである。 Similarly, the word number conversion unit 112 converts the plain text character string to the word number (replacement character, shortened character string) given by the character string comparison unit 106 in response to a match signal from the character string comparison unit 106. The word is converted into a fixed byte word number by the longest match method.

単語番号ファイル１１３は、タグ文字列変換部１１１からの置換文字変換出力と、単語番号変換部１１２からの単語番号変換出力とにより得られた、例えば図３０（Ｂ）に示すようなＸＭＬ文書の圧縮データを保持するものである。
ＤＴＤ記入部１１４は、図３１（Ａ）に示すごとき、別ファイルのＤＴＤを参照するＸＭＬ文書に対して、新規ファイル名を記入するものである。 The word number file 113 is an XML document such as that shown in FIG. 30B obtained by the replacement character conversion output from the tag character string conversion unit 111 and the word number conversion output from the word number conversion unit 112, for example. Holds compressed data.
As shown in FIG. 31A, the DTD entry unit 114 enters a new file name for an XML document that refers to a DTD of another file.

次に、図２６を参照しながら、第３実施形態の伸長装置について説明する。なお、図中、既述の符号と同一の符号は同一もしくはほぼ同一の部分を示しているので、その説明は省略する。
図２６において、１２０は単語番号ファイル記憶部、１２１は置換文字比較部、１２２はタグ文字列逆変換部、１２３は単語番号逆変換部、１２４は文書記憶部、１２５は旧ＤＴＤ記入部である。 Next, an extension device according to a third embodiment will be described with reference to FIG. In the figure, the same reference numerals as those already described indicate the same or substantially the same parts, and the description thereof will be omitted.
In FIG. 26, 120 is a word number file storage unit, 121 is a replacement character comparison unit, 122 is a tag character string reverse conversion unit, 123 is a word number reverse conversion unit, 124 is a document storage unit, and 125 is an old DTD entry unit. .

単語番号ファイル記憶部１２０は、伸長対象である、例えば図３０（Ｃ）に示すような圧縮後のＸＭＬ文書を保持するものである。
置換文字比較部１２１は、単語番号ファイル記憶部１２０に保持された置換文字（単語番号）と、辞書１０５，１０８〜１１０の置換文字（単語番号）とを比較し、これらの置換文字が一致した場合には、その置換文字に対して文字列（単語文字列）を出力するものである。このとき、タグ内の置換文字はタグ辞書１０５に登録された置換文字と比較され、平文部分の単語番号（置換文字）は、単語辞書１０８〜１１０のいずれかにおける単語文字列と比較される。 The word number file storage unit 120 holds an XML document after compression, for example, as shown in FIG.
The replacement character comparison unit 121 compares the replacement characters (word numbers) held in the word number file storage unit 120 with the replacement characters (word numbers) in the dictionaries 105 and 108 to 110, and these replacement characters match. In this case, a character string (word character string) is output for the replacement character. At this time, the replacement character in the tag is compared with the replacement character registered in the tag dictionary 105, and the word number (replacement character) of the plaintext part is compared with the word character string in any of the word dictionaries 108 to 110.

タグ文字列逆変換部１２２は、圧縮後のＸＭＬ文書における置換文字を、置換文字比較部１２１からの信号に基づき、その置換文字に対応した文字列に逆変換するものである。
同様に、単語番号逆変換部１２３は、圧縮後のＸＭＬ文書における単語番号（置換文字）を、番号比較部１２１からの信号に基づき、その単語番号に対応した単語文字列に逆変換するものである。 The tag character string reverse conversion unit 122 performs reverse conversion of the replacement character in the compressed XML document into a character string corresponding to the replacement character based on a signal from the replacement character comparison unit 121.
Similarly, the word number reverse conversion unit 123 reversely converts the word number (replacement character) in the compressed XML document into a word character string corresponding to the word number based on the signal from the number comparison unit 121. is there.

文書記憶部１２４は、タグ文字列逆変換部１２２から出力された、置換文字に対応したタグ文字列と、単語番号逆変換部１２３から出力された、単語番号に対応した単語文字列とを記述されたＸＭＬ文書（復元された文書）を保持するものである。
旧ＤＴＤ記入部１２５は、ＸＭＬ文書において、前記圧縮時に記入された新ＤＴＤを旧ＤＴＤに復元するものである。 The document storage unit 124 describes the tag character string corresponding to the replacement character output from the tag character string reverse conversion unit 122 and the word character string corresponding to the word number output from the word number reverse conversion unit 123. The XML document that has been restored (restored document) is held.
The old DTD entry unit 125 restores the new DTD entered during the compression to the old DTD in the XML document.

次に、図２５にて説明した第３実施形態の圧縮装置の動作について、図３０〜図３２を参照しながら、図２８に示すフローチャート（ステップＳ２０１〜Ｓ２１７）に従って説明する。
まず、文書記憶部１０１に、例えばオペレータが作成済みの図３０（Ａ）に示すようなＸＭＬ文書を保持させる。ＤＴＤ条件調査部１０２は、文書記憶部１０１に格納された、圧縮対象のＸＭＬ文書を参照して、そのＸＭＬ文書中に“＜！ＤＯＣＴＹＰＥ”が記述されているか否かをチェックし（ステップＳ２０１）、“＜！ＤＯＣＴＹＰＥ”が記述されている場合（ＹＥＳルート）には、そのＸＭＬ文書はＤＴＤを有するものと判別する。なお、ＤＴＤがないと判別された場合（ＮＯルート）、ステップＳ２０３へ移行する。 Next, the operation of the compression apparatus according to the third embodiment described with reference to FIG. 25 will be described according to the flowchart (steps S201 to S217) illustrated in FIG. 28 with reference to FIGS.
First, for example, an XML document as shown in FIG. 30A created by an operator is held in the document storage unit 101. The DTD condition checking unit 102 refers to the XML document to be compressed stored in the document storage unit 101 and checks whether “<! DOCTYPE” is described in the XML document (step S201). , “<! DOCTYPE” is described (YES route), it is determined that the XML document has DTD. When it is determined that there is no DTD (NO route), the process proceeds to step S203.

続いて、ＤＴＤ条件調査部１０２は、ＤＴＤを有するＸＭＬ文書に“［”が記述されているか否かを判別する（ステップＳ２０２）。そして、“［”が記述されている場合（ＹＥＳルート）、ＤＴＤがＸＭＬ文書中に内蔵されているものと判別してステップＳ２０３へ移行する一方、“［”が記述されていない場合（ＮＯルート）、そのＸＭＬ文書についてのＤＴＤは外部ファイルに格納されているものと判別してステップＳ２０９へ移行する。 Subsequently, the DTD condition checking unit 102 determines whether or not “[” is described in the XML document having the DTD (step S202). If “[” is described (YES route), it is determined that the DTD is built in the XML document, and the process proceeds to step S203. On the other hand, if “[” is not described (NO route). ), It is determined that the DTD for the XML document is stored in the external file, and the process proceeds to step S209.

ＤＴＤを内蔵したＸＭＬ文書に対しては、以下のステップＳ２０３〜Ｓ２０８の処理を施して圧縮を行なう。
まず、タグ文字列登録部１０４により、図２７に示すフローチャートに従ってタグ文字列の置換文字登録（タグ辞書作成）処理を行ない、図３０（Ｃ）に示すようなタグ辞書１０５を作成する（ステップＳ２０３）。 The XML document with the built-in DTD is compressed by performing the following steps S203 to S208.
First, the tag character string registration unit 104 performs tag character string replacement character registration (tag dictionary creation) processing according to the flowchart shown in FIG. 27, and creates a tag dictionary 105 as shown in FIG. 30C (step S203). ).

続いて、タグ平文識別部１０３が、文書記憶部１０１に保持されたＸＭＬ文書から平文を識別して言語識別部１０７に送出する。このとき、言語識別部１０７は、タグ平文識別部１０３からのＸＭＬ文書のエンコーディング宣言において記述された文字コード名（図示省略）、並びに、任意のタグにおける言語識別用の属性（xml：lang）の値を解読することにより、平文の言語が何語であるかを識別し、その言語に応じた単語辞書、例えば日本語の場合は日本語辞書１０８を選択する（ステップＳ２０４）。 Subsequently, the tag plaintext identifying unit 103 identifies the plaintext from the XML document held in the document storage unit 101 and sends it to the language identifying unit 107. At this time, the language identification unit 107 includes the character code name (not shown) described in the encoding declaration of the XML document from the tag plaintext identification unit 103, and the language identification attribute (xml: lang) in an arbitrary tag. By deciphering the value, the number of plaintext languages is identified, and a word dictionary corresponding to the language, for example, Japanese dictionary 108 in the case of Japanese is selected (step S204).

そして、文字列比較部１０６は、タグ辞書１０５の登録文字列と、圧縮対象のＸＭＬ文書におけるタグ内の文字列とを比較し、これらの文字列が一致した場合、その登録文字列に対応した置換文字（短縮文字列）をタグ辞書１０５から読み出し、タグ文字列変換部１１１にて、タグ内の文字列を置換文字に変換する（ステップＳ２０５）。 Then, the character string comparison unit 106 compares the registered character string in the tag dictionary 105 with the character string in the tag in the XML document to be compressed, and if these character strings match, the character string comparison unit 106 corresponds to the registered character string. The replacement character (abbreviated character string) is read from the tag dictionary 105, and the tag character string conversion unit 111 converts the character string in the tag into a replacement character (step S205).

ついで、文字列比較部１０６は、図３０（Ｄ）に示すような日本語辞書１０８の登録単語文字列と、圧縮対象のＸＭＬ文書における平文内の単語文字列とを比較し、これらの単語文字列が一致した場合、その登録単語文字列に対応した単語番号（置換文字，短縮文字列）を日本語辞書１０８から読み出し、単語番号変換部１１２にて、平文内の単語文字列を単語番号に変換する（ステップＳ２０６）。 Next, the character string comparison unit 106 compares the registered word character string in the Japanese dictionary 108 as shown in FIG. 30D with the word character string in the plain text in the XML document to be compressed, and these word characters. If the strings match, the word number (replacement character, abbreviated character string) corresponding to the registered word character string is read from the Japanese dictionary 108 and the word character string in the plaintext is converted to the word number by the word number conversion unit 112. Conversion is performed (step S206).

この後、タグ辞書１０５は、後述する逆変換処理つまり伸長・復元の際に必要になるため、図示省略のファイルに出力される（ステップＳ２０７）。
また、ステップＳ２０５において置換文字に変換されたタグと、ステップＳ２０６において単語番号に変換された平文とは、図３０（Ｂ）に示すように圧縮されたＸＭＬ文書として、単語番号ファイル１１３に出力され、その単語番号ファイル１１３は、図示省略の記憶装置等で保持される（ステップＳ２０８）。 Thereafter, the tag dictionary 105 is output to a file (not shown) because it is necessary for reverse conversion processing, that is, decompression / restoration described later (step S207).
Further, the tag converted into the replacement character in step S205 and the plaintext converted into the word number in step S206 are output to the word number file 113 as an XML document compressed as shown in FIG. The word number file 113 is held in a storage device (not shown) or the like (step S208).

一方、外部ファイルに格納されたＤＴＤを用いるＸＭＬ文書に対しては、以下のステップＳ２０９〜Ｓ２１７の処理を施して圧縮を行なう。
ステップＳ２０２において“［”が記述されていないと判別された場合、ＤＴＤ条件調査部１０２は、圧縮対象のＸＭＬ文書についてのＤＴＤを外部参照により認識する必要があると判別し、例えば図３１（Ａ）の２行目に示すごとくシステム識別子によって指定されたファイル名“book.dtd”に基づいて、別ファイルのＤＴＤ〔例えば図３１（Ｂ）参照〕が認識・参照される。そして、そのＤＴＤに対してタグ文字列登録部１０４がタグ文字列登録処理を行ない、図３１（Ｅ）に示すようなタグ辞書１０５が作成される（ステップＳ２０９）。 On the other hand, an XML document using DTD stored in an external file is compressed by performing the following steps S209 to S217.
If it is determined in step S202 that “[” is not described, the DTD condition examining unit 102 determines that it is necessary to recognize the DTD for the XML document to be compressed by an external reference. For example, FIG. As shown in the second line, the DTD of another file (see, for example, FIG. 31B) is recognized and referenced based on the file name “book.dtd” designated by the system identifier. Then, the tag character string registration unit 104 performs tag character string registration processing on the DTD, and a tag dictionary 105 as shown in FIG. 31E is created (step S209).

続いて、ステップＳ２０４と同様、タグ平文識別部１０３が、文書記憶部１０１に保持されたＸＭＬ文書から平文を識別して言語識別部１０７に送出する。このとき、言語識別部１０７は、タグ平文識別部１０３からのＸＭＬ文書のエンコーディング宣言において記述された文字コード名（図示省略）を解読することにより、平文で記述される言語が何語であるかを識別し、その言語に応じた単語辞書、例えば日本語の場合は日本語辞書１０８を選択する（ステップＳ２１０）。 Subsequently, as in step S <b> 204, the tag plaintext identifying unit 103 identifies the plaintext from the XML document held in the document storage unit 101 and sends it to the language identifying unit 107. At this time, the language identifying unit 107 interprets the character code name (not shown) described in the encoding declaration of the XML document from the tag plaintext identifying unit 103 to determine how many languages are described in plaintext. And a word dictionary corresponding to the language, for example, the Japanese dictionary 108 in the case of Japanese is selected (step S210).

また、ＤＴＤ条件調査部１０２は、ＤＴＤ記入部１１４を制御して、例えば図３１（Ｇ）に示すごとく新旧ＤＴＤ表に新規ファイルのＤＴＤ名「book2.dtd」を記入させるとともに、単語番号変換部１１２による処理に際して、圧縮前の旧ＤＴＤ名「book.dtd」を新規ファイルのＤＴＤ名「book2.dtd」に書換・記入させる（ステップＳ２１１）。 Further, the DTD condition checking unit 102 controls the DTD entry unit 114 to enter the DTD name “book2.dtd” of the new file in the old and new DTD table as shown in FIG. In the process of 112, the old DTD name “book.dtd” before compression is rewritten and entered into the DTD name “book2.dtd” of the new file (step S211).

そして、文字列比較部１０６は、タグ辞書１０５の登録文字列と、図３１（Ｂ）に示すような別ファイルのＤＴＤにおけるタグ内の文字列とを比較し、これらの文字列が一致した場合、その登録文字列に対応した置換文字（短縮文字列）をタグ辞書１０５から読み出し、タグ文字列変換部１１１にてタグ内の文字列を置換文字に変換することにより、図３１（Ｄ）に示すような新規ファイルのＤＴＤを圧縮作成する（ステップＳ２１２）。 Then, the character string comparison unit 106 compares the registered character string in the tag dictionary 105 with the character string in the tag in the DTD of another file as shown in FIG. 31B, and when these character strings match. The replacement character (abbreviated character string) corresponding to the registered character string is read from the tag dictionary 105, and the character string in the tag is converted into the replacement character by the tag character string conversion unit 111, so that FIG. A new file DTD as shown is compressed and created (step S212).

また、ステップＳ２０５と同様、文字列比較部１０６は、タグ辞書１０５の登録文字列と、文書記憶部１０１に格納された圧縮対象のＸＭＬ文書におけるタグ内の文字列とを比較し、これらの文字列が一致した場合、その登録文字列に対応した置換文字（短縮文字列）をタグ辞書１０５から読み出し、タグ文字列変換部１１１にて、タグ内の文字列を置換文字に変換する。そして、タグ文字列変換部１１１は、上述のごとくタグ内の文字列を置換文字に変換・圧縮したＸＭＬ文書を単語番号変換部１１２に送出する（ステップＳ２１２）。 Similarly to step S205, the character string comparison unit 106 compares the registered character string in the tag dictionary 105 with the character string in the tag in the compression target XML document stored in the document storage unit 101, and these characters. If the strings match, the replacement character (abbreviated character string) corresponding to the registered character string is read from the tag dictionary 105, and the tag character string conversion unit 111 converts the character string in the tag into a replacement character. Then, the tag character string conversion unit 111 sends the XML document obtained by converting and compressing the character string in the tag to the replacement character as described above to the word number conversion unit 112 (step S212).

ついで、文字列比較部１０６は、図３１（Ｆ）に示すような日本語辞書１０８の登録単語文字列と、タグ文字列変換部１１１からのＸＭＬ文書（タグ内文字列が置換文字に変換された文書）における平文内の単語文字列とを比較し、これらの単語文字列が一致した場合、その登録単語文字列に対応した単語番号（置換文字，短縮文字列）を日本語辞書１０８から読み出し、単語番号変換部１１２にて平文内の単語文字列を単語番号に変換する。このとき、ＤＴＤ記入部１１４により、ＸＭＬ文書中において旧ＤＴＤ名「book.dtd」が新しいＤＴＤ名「book2.dtd」に書き換えられる。このようにして、図３１（Ａ）に示すような圧縮前のＸＭＬ文書は図３１（Ｃ）に示すようなＸＭＬ文書に圧縮される（ステップＳ２１３）。 Next, the character string comparison unit 106 converts the registered word character string in the Japanese dictionary 108 as shown in FIG. 31 (F) and the XML document from the tag character string conversion unit 111 (the character string in the tag is converted into a replacement character). If the word character strings match, the word numbers (replacement characters, shortened character strings) corresponding to the registered word character strings are read from the Japanese dictionary 108. Then, the word number conversion unit 112 converts the word character string in the plain text into a word number. At this time, the DTD entry unit 114 rewrites the old DTD name “book.dtd” into the new DTD name “book2.dtd” in the XML document. In this way, the XML document before compression as shown in FIG. 31A is compressed into an XML document as shown in FIG. 31C (step S213).

この後、タグ辞書１０５は、ステップＳ２０７と同様、後述する逆変換処理つまり伸長・復元の際に必要になるため、図示省略のファイルに出力される（ステップＳ２１４）。
また、ステップＳ２１３において変換された、図３１（Ｃ）に示すような圧縮後のＸＭＬ文書は、単語番号変換部１１２から単語番号ファイル１１３に出力される（ステップＳ２１５）。 After that, the tag dictionary 105 is output to a file (not shown) because it is necessary for reverse conversion processing, that is, decompression / restoration described later, as in step S207 (step S214).
Also, the compressed XML document as shown in FIG. 31C converted in step S213 is output from the word number conversion unit 112 to the word number file 113 (step S215).

さらに、ステップＳ２１２で圧縮作成された新規ファイルＤＴＤは図示省略のファイルに出力されるとともに（ステップＳ２１６）、ステップＳ２１１で作成した、図３１（Ｇ）に示すような新旧ＤＴＤ対応表も、後述する逆変換処理つまり伸長・復元の際に必要になるため、図示省略のファイルに出力される（ステップＳ２１７）。
なお、ＤＴＤをもたないＸＭＬ文書に対しても、前述したステップＳ２０３〜Ｓ２０８の処理が施され、例えば図３２（Ａ）に示すＸＭＬ文書が図３２（Ｂ）に示すごとく圧縮される。 Further, the new file DTD compressed and created in step S212 is output to a file not shown (step S216), and the old and new DTD correspondence table shown in FIG. 31G created in step S211 will also be described later. Since it is necessary for reverse conversion processing, that is, decompression / restoration, it is output to a file not shown (step S217).
Note that the XML document having no DTD is also subjected to the processing in steps S203 to S208 described above, and the XML document shown in FIG. 32A is compressed as shown in FIG. 32B, for example.

次に、図２６にて説明した第３実施形態の伸長装置の動作について、図３０〜図３２を参照しながら、図２９に示すフローチャート（ステップＳ２２１〜Ｓ２３２）に従って説明する。
まず、図示省略のファイルからタグ辞書１０５を取り出し、そのタグ辞書１０５を図示省略のメモリに格納する（ステップＳ２２１）。 Next, the operation of the decompression apparatus according to the third embodiment described with reference to FIG. 26 will be described according to the flowchart (steps S221 to S232) illustrated in FIG. 29 with reference to FIGS.
First, the tag dictionary 105 is extracted from a file (not shown), and the tag dictionary 105 is stored in a memory (not shown) (step S221).

そして、そのタグ辞書１０５に対応する、圧縮ＸＭＬ文書の単語番号ファイル１１３を記憶部１２０に格納する（ステップＳ２２２）。
ステップＳ２２１およびＳ２２２の処理により、図３０（Ｃ）に示すタグ辞書に対しては図３０（Ｂ）に示すごとく圧縮されたＸＭＬ文書が記憶部１２０に格納され、図３１（Ｅ）に示すタグ辞書に対しては図３１（Ｃ）に示すごとく圧縮されたＸＭＬ文書が記憶部１２０に格納され、図３２（Ｃ）に示すタグ辞書に対しては図３２（Ｂ）に示すごとく圧縮されたＸＭＬ文書が記憶部１２０格納される。 Then, the compressed XML document word number file 113 corresponding to the tag dictionary 105 is stored in the storage unit 120 (step S222).
Through the processing in steps S221 and S222, the XML document compressed as shown in FIG. 30B is stored in the storage unit 120 for the tag dictionary shown in FIG. 30C, and the tag shown in FIG. The XML document compressed as shown in FIG. 31C is stored in the storage unit 120 for the dictionary, and the tag dictionary shown in FIG. 32C is compressed as shown in FIG. 32B. An XML document is stored in the storage unit 120.

この後、ＤＴＤ条件調査部１０２は、記憶部１２０から伸長復元対象のＸＭＬ文書を読み出し、まず、そのＸＭＬ文書に“＜！ＤＯＣＴＹＰＥ”が記述されているか否かをチェックする（ステップＳ２２３）。
“＜！ＤＯＣＴＹＰＥ”が記述されている場合（ＹＥＳルート）、ＤＴＤ条件調査部１０２は、さらに“［”が記述されているか否かをチェックし（ステップＳ２２４）、“［”が記述されている場合（ＹＥＳルート）、伸長復元対象のＸＭＬ文書にはＤＴＤが内蔵されているものと判断する。 Thereafter, the DTD condition checking unit 102 reads the XML document to be decompressed and restored from the storage unit 120, and first checks whether or not “<! DOCTYPE” is described in the XML document (step S223).
When “<! DOCTYPE” is described (YES route), the DTD condition examining unit 102 further checks whether “[” is described (step S224), and “[” is described. In the case (YES route), it is determined that the XML document to be decompressed / restored contains DTD.

このように伸長復元対象のＸＭＬ文書がＤＴＤを内蔵している場合、言語識別部１０７は、任意のタグ中の言語識別用の属性（xml：lang）の値からそのＸＭＬ文書の平文の言語を識別し、予め作成されている単語辞書群の中から、その識別結果に応じた単語辞書、例えば図３１（Ｄ）に示すような日本語辞書１０８を選択する（ステップＳ２２５）。 As described above, when the XML document to be decompressed / restored includes DTD, the language identifying unit 107 determines the plaintext language of the XML document from the value of the language identifying attribute (xml: lang) in an arbitrary tag. A word dictionary corresponding to the identification result, for example, a Japanese dictionary 108 as shown in FIG. 31D is selected from a group of word dictionaries that have been identified (step S225).

そして、タグ平文識別部１０３は、圧縮されたＸＭＬ文書のタグ部分を識別し、タグ辞書１０５の置換文字と記憶部１２０から出力したタグ部分の置換文字とを置換文字比較部１２１で比較・照合させ、これらの置換文字が一致した場合、その置換文字に対応するタグ内文字列をタグ辞書１０５から読み出してタグ文字列逆変換部１２２に送出し、このタグ文字列変換部１２２において、圧縮されたＸＭＬ文書の置換文字をタグ内文字列に変換する、タグ文字列逆変換処理が行なわれる（ステップＳ２２６）。これにより、例えば図３０（Ｂ）の置換文字ｂ，ｃ，ｄが、図３０（Ａ）に示すごとく、それぞれ“chapter”，“title”，“paragraph”に変換される。 The tag plaintext identification unit 103 identifies the tag portion of the compressed XML document, and the replacement character comparison unit 121 compares and collates the replacement character of the tag dictionary 105 with the replacement character of the tag portion output from the storage unit 120. If these replacement characters match, the in-tag character string corresponding to the replacement character is read from the tag dictionary 105 and sent to the tag character string reverse conversion unit 122, and is compressed by the tag character string conversion unit 122. A tag character string reverse conversion process is performed to convert the replacement character of the XML document into the character string in the tag (step S226). Thereby, for example, the replacement characters b, c, and d in FIG. 30B are converted into “chapter”, “title”, and “paragraph”, respectively, as shown in FIG.

ついで、タグ平文識別部１０３は、圧縮されたＸＭＬ文書の平文部分を識別し、その平文部分の単語番号（置換文字）を、置換文字比較部１２１において、日本語辞書１０８の登録単語番号（登録置換番号）と比較し、これらの単語番号が一致した場合、その単語番号に対応する単語文字列を日本語辞書１０８から読み出して単語番号逆変換部１２３に送出し、この単語番号逆変換部１２３において、圧縮されたＸＭＬ文書の単語番号を単語文字列に変換する、単語番号逆変換処理が行なわれる（ステップＳ２２７）。これにより、例えば図３０（Ｂ）の単語番号α，β，γ，δが、図３０（Ａ）に示すごとく、それぞれ“ＸＭＬ”，“の”，“概要”，“とは”に変換される。 Next, the tag plaintext identification unit 103 identifies the plaintext part of the compressed XML document, and the word number (replacement character) of the plaintext part is registered in the replacement character comparison unit 121 by the registered word number (registration). If these word numbers match, the word character string corresponding to the word number is read from the Japanese dictionary 108 and sent to the word number reverse conversion unit 123, and this word number reverse conversion unit 123 In step S227, a word number reverse conversion process for converting the word number of the compressed XML document into a word character string is performed. Thereby, for example, the word numbers α, β, γ, and δ in FIG. 30B are converted into “XML”, “No”, “Outline”, and “What” as shown in FIG. The

ステップＳ２２６およびＳ２２７でそれぞれ逆変換されたタグ文字列や平文は、例えば図３０（Ａ）に示すような圧縮前のＸＭＬ文書、即ち復元文書となり、文書記憶部１２４に保持される。
一方、ステップＳ２２４で伸長復元対象のＸＭＬ文書に“［”が記述されていないと判別された場合（ＮＯルート）、ＤＴＤ条件調査部１０２は、そのＸＭＬ文書についてのＤＴＤは外部ファイルに格納されているものと判別し、そのＸＭＬ文書の圧縮処理時に作成・保存された、例えば図３２（Ｇ）に示すような新旧ＤＴＤ対応表を入力する（ステップＳ２２８）。 The tag character string and plaintext that have been inversely converted in steps S226 and S227, respectively, become an uncompressed XML document as shown in FIG. 30A, that is, a restored document, and is held in the document storage unit 124.
On the other hand, if it is determined in step S224 that “[” is not described in the XML document to be decompressed and restored (NO route), the DTD condition examining unit 102 stores the DTD for the XML document in the external file. For example, a new and old DTD correspondence table as shown in FIG. 32G created and stored during the compression processing of the XML document is input (step S228).

また、言語識別部１０７は、ステップＳ２２５と同様、タグ中の言語識別用の属性（xml：lang）の値からそのＸＭＬ文書の平文の言語を識別し、予め作成されている単語辞書群の中から、その識別結果に応じた単語辞書、例えば図３２（Ｆ）に示すような日本語辞書１０８を選択する（ステップＳ２２９）。 Similarly to step S225, the language identification unit 107 identifies the plaintext language of the XML document from the value of the language identification attribute (xml: lang) in the tag, Then, a word dictionary corresponding to the identification result, for example, a Japanese dictionary 108 as shown in FIG. 32 (F) is selected (step S229).

さらに、ＤＴＤ条件調査部１０２は、旧ＤＴＤ記入部１２５を制御して、Ｓ２２８で読み出した新旧ＤＴＤ対応表から、ＤＴＤの元のファイル名が“book.dtd”であることを認識させるとともに、例えば図３１（Ｃ）に示すような伸長復元対象のＸＭＬ文書において、“book2.dtd”と記述されているＤＴＤ名を旧ＤＴＤ名“book.dtd”に書換・記入させる（ステップＳ２３０）。 Further, the DTD condition investigation unit 102 controls the old DTD entry unit 125 to recognize that the original file name of the DTD is “book.dtd” from the old and new DTD correspondence table read out in S228. In the XML document to be decompressed and restored as shown in FIG. 31C, the DTD name described as “book2.dtd” is rewritten and entered into the old DTD name “book.dtd” (step S230).

そして、ステップＳ２２６と同様に、タグ平文識別部１０３は、圧縮されたＸＭＬ文書のタグ部分を識別し、タグ辞書１０５の置換文字と記憶部１２０から出力したタグ部分の置換文字とを置換文字比較部１２１で比較・照合させ、これらの置換文字が一致した場合、その置換文字に対応するタグ内文字列をタグ辞書１０５から読み出してタグ文字列逆変換部１２２に送出し、このタグ文字列変換部１２２において、圧縮されたＸＭＬ文書の置換文字をタグ内文字列に変換する、タグ文字列逆変換処理が行なわれる（ステップＳ２３１）。 Then, similarly to step S226, the tag plaintext identifying unit 103 identifies the tag portion of the compressed XML document, and compares the replacement character of the tag dictionary 105 with the replacement character of the tag portion output from the storage unit 120. When these replacement characters are matched by the comparison / matching unit 121, the in-tag character string corresponding to the replacement character is read from the tag dictionary 105 and sent to the tag character string reverse conversion unit 122, and this tag character string conversion is performed. In part 122, a tag character string reverse conversion process is performed in which the replacement character of the compressed XML document is converted into an in-tag character string (step S231).

これにより、例えば図３１（Ｃ）の置換文字ｂ，ｃ，ｄが、図３１（Ａ）に示すごとく、それぞれ“chapter”，“title”，“paragraph”に変換され、その変換結果は、単語番号変換部１２３に出力される。
なお、このとき、ステップＳ２３０の処理により、ＸＭＬ文書中に記述されたＤＴＤ名は、ファイル名“book2.dtd”から旧ファイル名“book.dtd”に変換されている。 Thus, for example, the replacement characters b, c, and d in FIG. 31C are converted into “chapter”, “title”, and “paragraph”, respectively, as shown in FIG. The number is output to the number conversion unit 123.
At this time, the DTD name described in the XML document is converted from the file name “book2.dtd” to the old file name “book.dtd” by the process of step S230.

ついで、ステップＳ２２７と同様に、タグ平文識別部１０３は、圧縮されたＸＭＬ文書の平文部分を識別し、その平文部分の単語番号（置換文字）を、置換文字比較部１２１において、日本語辞書１０８の登録単語番号（登録置換番号）と比較し、これらの単語番号が一致した場合、その単語番号に対応する単語文字列を日本語辞書１０８から読み出して単語番号逆変換部１２３に送出し、単語番号逆変換部１２３において、圧縮されたＸＭＬ文書の単語番号を単語文字列に変換する、単語番号逆変換処理が行なわれる（ステップＳ２３２）。 Next, as in step S 227, the tag plaintext identifying unit 103 identifies the plaintext part of the compressed XML document, and the word number (replacement character) of the plaintext part is determined in the replacement character comparison unit 121 in the Japanese dictionary 108. If these word numbers match, the word character string corresponding to the word number is read out from the Japanese dictionary 108 and sent to the word number reverse conversion unit 123. The number reverse conversion unit 123 performs word number reverse conversion processing for converting the word number of the compressed XML document into a word character string (step S232).

これにより、例えば図３１（Ｃ）の単語番号α，β，γ，δが、図３１（Ａ）に示すごとく、それぞれ“ＸＭＬ”,“の”,“概要”,“とは”に変換される。
このようにして逆変換されたタグ文字列や平文は、例えば図３１（Ａ）に示すような圧縮前のＸＭＬ文書、即ち復元文書となり、文書記憶部１２４に保持される。 As a result, for example, the word numbers α, β, γ, and δ in FIG. 31C are converted into “XML”, “no”, “outline”, and “what” as shown in FIG. The
The tag character string or plaintext reversely converted in this way becomes, for example, an XML document before compression as shown in FIG. 31A, that is, a restored document, and is held in the document storage unit 124.

なお、ステップＳ２２３において伸長復元対象のＸＭＬ文書に“＜！ＤＯＣＴＹＰＥ”が記述されていないと判別された場合（ＮＯルート）、そのＸＭＬ文書はＤＴＤをもたないものと判別され、ステップＳ２２５へ移行し、そのＸＭＬ文書に対しても、前述したステップＳ２２３〜Ｓ２２７の処理が施され、例えば図３２（Ａ）に示すＸＭＬ文書が図３２（Ｂ）に示すごとく伸長・復元されて文書記憶部１２４に保持される。 If it is determined in step S223 that “<! DOCTYPE” is not described in the decompression / restoration target XML document (NO route), it is determined that the XML document has no DTD, and the process proceeds to step S225. The XML document is also subjected to the processing of steps S223 to S227 described above, and the XML document shown in FIG. 32A is expanded and restored as shown in FIG. Retained.

このように、上述した本発明の第３実施形態では、つまり文字列の置換文字（短縮文字列）への変換処理を行なうことによりＸＭＬ文書の圧縮を行ない、その際に用いられたタグ辞書１０５や各国語毎の単語辞書１０８〜１１０は、いずれも圧縮されることなく保存されるので、ＸＭＬ文書を、伸長することなく圧縮した状態のままで検索することができ、例えば検索時間が０．１秒を超えてはいけない場合（システム）に用いて好適である。 Thus, in the third embodiment of the present invention described above, the XML document is compressed by converting the character string into a replacement character (shortened character string), and the tag dictionary 105 used at that time is compressed. The word dictionaries 108 to 110 for each national language are all stored without being compressed, so that the XML document can be searched without being decompressed. It is suitable for use when the system should not exceed 1 second.

次に、本発明の第３実施形態の変形例としての圧縮装置および伸長装置の構成および動作について、図３４〜図３７を参照しながら説明する。
ここで、図３４は本発明の第３実施形態の変形例としての構造化文書の圧縮装置の機能構成を示すブロック図、図３５は本発明の第３実施形態の変形例としての構造化文書の伸長装置の機能構成を示すブロック図、図３６は第３実施形態の変形例における構造化文書の圧縮手順を説明するためのフローチャート、図３７は第４実施形態の変形例における構造化文書の伸長手順を説明するためのフローチャートである。 Next, configurations and operations of a compression apparatus and an expansion apparatus as modifications of the third embodiment of the present invention will be described with reference to FIGS. 34 to 37.
Here, FIG. 34 is a block diagram showing a functional configuration of a structured document compression apparatus as a modification of the third embodiment of the present invention, and FIG. 35 is a structured document as a modification of the third embodiment of the present invention. FIG. 36 is a flowchart for explaining the procedure for compressing a structured document in the modification of the third embodiment, and FIG. 37 is a block diagram of the structured document in the modification of the fourth embodiment. It is a flowchart for demonstrating an expansion | extension procedure.

まず、図３４を参照しながら、第３実施形態の変形例としての圧縮装置について説明すると、この図３４において、１３１は文書記憶部、１３２はタグ平文識別部、１３３はタグ文字列登録部、１３４はタグ辞書、１３５は文字列比較部、１３６は言語識別部、１３７は日本語辞書、１３８は中国語辞書、１３９は英語辞書、１４０はタグ文字列変換部、１４１は単語番号変換部、１４２は可変長符号化部、１４３は圧縮ファイル記憶部である。 First, a compression apparatus as a modified example of the third embodiment will be described with reference to FIG. 34. In FIG. 34, 131 is a document storage unit, 132 is a tag plaintext identification unit, 133 is a tag character string registration unit, 134 is a tag dictionary, 135 is a character string comparison unit, 136 is a language identification unit, 137 is a Japanese dictionary, 138 is a Chinese dictionary, 139 is an English dictionary, 140 is a tag character string conversion unit, 141 is a word number conversion unit, Reference numeral 142 denotes a variable length encoding unit, and 143 denotes a compressed file storage unit.

文書記憶部１３１は、圧縮すべきＸＭＬ文書を入力保持するもので、例えば図３０（Ａ）に示すような圧縮前のＸＭＬ文書が保持されるメモリであり、図２５に示す文書記憶部１０１に対応するものである。
タグ平文識別部１３２は、ＸＭＬ文書の注目文字列がタグが平文かを識別するものであり、図２５におけるタグ平文識別部１０３に対応するものであり、このタグ平文識別部１０３と同様に動作する。 The document storage unit 131 holds an XML document to be compressed. For example, the document storage unit 131 is a memory that stores an XML document before compression as shown in FIG. Corresponding.
The tag plaintext identifying unit 132 identifies whether the target character string of the XML document is a plaintext tag. The tag plaintext identifying unit 132 corresponds to the tag plaintext identifying unit 103 in FIG. 25 and operates in the same manner as the tag plaintext identifying unit 103. To do.

タグ平文列登録部１３３は、タグ内の文字列を置換文字（短縮文字列）に変換するためのタグ辞書１３４を作成するものであり、図２５におけるタグ文字列登録部１０４に対応するもので、図２７により前述した手順に従ってタグ辞書登録動作を行なう。
タグ辞書１３４は、タグ平文列登録部１３３により作成されたもので、例えば図３０（Ｃ），図３１（Ｅ）や図３２（Ｃ）に示すような、置換文字とタグ内の文字列との対照表であり、メモリに保持される。このタグ辞書１３４は、図２５におけるタグ辞書１０５に対応するものである。 The tag plaintext string registration unit 133 creates a tag dictionary 134 for converting a character string in a tag into a replacement character (shortened character string), and corresponds to the tag character string registration unit 104 in FIG. The tag dictionary registration operation is performed according to the procedure described above with reference to FIG.
The tag dictionary 134 is created by the tag plaintext string registration unit 133. For example, as shown in FIG. 30 (C), FIG. 31 (E), and FIG. 32 (C), the replacement character, the character string in the tag, And is held in memory. This tag dictionary 134 corresponds to the tag dictionary 105 in FIG.

文字列比較部１３５は、文書記憶部１３１に保持された文字列と、辞書１３４，１３７〜１３９の登録文字列または単語文字列とを比較し、その文字列が登録文字列または単語文字列と一致した場合には、その文字列に対して置換文字を出力するものである。このとき、タグ内の文字列はタグ辞書１３４の登録文字列と比較され、平文部分の文字列は、後述する単語辞書１３７〜１３９のいずれかにおける単語文字列と比較される。文字列を、文字列比較部１３５から出力された置換文字（短縮文字列）へ変換することにより、ＸＭＬ文書が圧縮されることになる。この文字列比較部１３５は、図２５における文字列比較部１０６に対応するものである。 The character string comparison unit 135 compares the character string held in the document storage unit 131 with the registered character string or word character string in the dictionaries 134 and 137 to 139, and the character string is compared with the registered character string or word character string. If they match, a replacement character is output for the character string. At this time, the character string in the tag is compared with a registered character string in the tag dictionary 134, and the character string in the plaintext part is compared with a word character string in any of word dictionaries 137 to 139 described later. The XML document is compressed by converting the character string into the replacement character (shortened character string) output from the character string comparison unit 135. The character string comparison unit 135 corresponds to the character string comparison unit 106 in FIG.

言語識別部１３６は、圧縮対象のＸＭＬ文書の内容（平文）において記述される言語が何語であるかを、ＸＭＬ宣言におけるエンコーディング宣言（図示省略）に記述された文字コード名、あるいは、任意のタグにおける言語識別用の属性（xml：lang）の値を解読することによって識別するものである。図３４に示す言語識別部１３６も、その言語が例えば日本語，中国語，英語の３つのうちのいずれであるかを識別し、その識別結果に対応して、日本語辞書１３７，中国語辞書１３８，英語辞書１３９のうちのいずれか一つを平文用の単語辞書として選択するもので、図２５における言語識別部１０７に対応するものである。 The language identification unit 136 determines the number of languages described in the content (plain text) of the XML document to be compressed, the character code name described in the encoding declaration (not shown) in the XML declaration, or an arbitrary The tag is identified by deciphering the value of the language identification attribute (xml: lang). The language identification unit 136 shown in FIG. 34 also identifies which of the three languages, for example, Japanese, Chinese, or English, and the Japanese dictionary 137, Chinese dictionary corresponding to the identification result. One of 138 and English dictionary 139 is selected as a plain text word dictionary and corresponds to the language identification unit 107 in FIG.

日本語辞書１３７は、ＸＭＬ文書の平文において記述される言語が日本語の場合に、例えば図３０（Ｄ）に示すように、平文を構成する日本語の単語文字列とそれに対応する単語番号（短縮文字列，置換文字）との対応をとるための対照表であり、事前に構成された既知のもので、図２５における日本語辞書１０８に対応するものである。 When the language described in the plain text of the XML document is Japanese, the Japanese dictionary 137, for example, as shown in FIG. 30D, the Japanese word character string constituting the plain text and the corresponding word number ( 25 is a comparison table for correspondence with abbreviated character strings and substitution characters), which is a known table constructed in advance and corresponding to the Japanese dictionary 108 in FIG.

同様に、中国語辞書１０９や英語辞書１１０のいずれも、平文を構成する各国語の単語文字列と単語番号（短縮文字列，置換文字）との対応をとるための対照表であり、事前に構成された既知のもので、図２５における中国語辞書１０９や英語辞書１１０に対応するものである。
タグ文字列変換部１４０は、タグの文字列を、文字列比較部１３５からの一致信号に応じて、この文字列比較部１０６で付与された置換文字（短縮文字列）に変換する、変換処理を行なうもので、図２５におけるタグ文字列変換部１１１に対応するものである。 Similarly, each of the Chinese dictionary 109 and the English dictionary 110 is a comparison table for taking correspondences between word character strings of each language constituting plaintext and word numbers (abbreviated character strings, replacement characters). These are known ones that correspond to the Chinese dictionary 109 and the English dictionary 110 in FIG.
The tag character string conversion unit 140 converts the character string of the tag into a replacement character (shortened character string) given by the character string comparison unit 106 in response to a match signal from the character string comparison unit 135. This corresponds to the tag character string conversion unit 111 in FIG.

同様に、単語番号変換部１４１は、平文の文字列を、文字列比較部１３５からの一致信号に応じて、この文字列比較部１０６で付与された単語番号（置換文字，短縮文字列）に変換するものであり、最長一致法により単語を固定バイトの単語番号に変換するもので、図２５における単語番号変換部１１２に対応するものである。 Similarly, the word number conversion unit 141 converts the plain text character string to the word number (replacement character, shortened character string) given by the character string comparison unit 106 in response to a match signal from the character string comparison unit 135. A word is converted into a fixed-byte word number by the longest match method, and corresponds to the word number conversion unit 112 in FIG.

可変長符号化部１４２は、タグ文字列変換部１４０や単語番号変換部１４１による変換結果（タグ内やＤＴＤに記述された文字列を置換文字に置き換え且つ平文内の単語文字列を単語番号に置き換えて得られた文字列）、つまり圧縮ファイルを、周知の手法で可変長符号化し、可変長符号化された圧縮ファイルを、タグ辞書とともに出力するものである。
圧縮ファイル記憶部１４３は、可変長符号化部１４２から出力された、タグ辞書と可変長符号化された圧縮ファイルとを保持するものである。 The variable length encoding unit 142 converts the result of conversion by the tag character string conversion unit 140 or the word number conversion unit 141 (replaces the character string described in the tag or DTD with a replacement character, and converts the word character string in the plaintext to the word number. The character string obtained by replacement), that is, the compressed file is variable-length encoded by a known method, and the compressed file subjected to variable-length encoding is output together with the tag dictionary.
The compressed file storage unit 143 holds the tag dictionary and the variable length encoded compressed file output from the variable length encoding unit 142.

次に、図３５を参照しながら、第３実施形態の変形例としての伸長装置について説明する。なお、図中、既述の符号と同一の符号は同一もしくはほぼ同一の部分を示しているので、その説明は省略する。
図３５において、１５１は圧縮ファイル記憶部、１５２は可変長復号化部、１５３は置換文字比較部、１５４はタグ文字列逆変換部、１５５は単語番号逆変換部、１５６は文書記憶部である。 Next, an extension device as a modification of the third embodiment will be described with reference to FIG. In the figure, the same reference numerals as those already described indicate the same or substantially the same parts, and the description thereof will be omitted.
35, 151 is a compressed file storage unit, 152 is a variable length decoding unit, 153 is a replacement character comparison unit, 154 is a tag character string reverse conversion unit, 155 is a word number reverse conversion unit, and 156 is a document storage unit. .

圧縮ファイル記憶部１５１は、図３４に示す圧縮装置により可変長符号化された圧縮ファイルを入力されるものであり、図３４に示す圧縮ファイル記憶部１４３そのものであってもよく、あるいは、この圧縮ファイル記憶部１４３から出力された圧縮ファイルを保持するものとして構成してもよい。
可変長復号化部１５２は、圧縮ファイル記憶部１５１から読み出された圧縮ファイルを、可変長符号化された状態から、例えば図３０（Ｂ）に示すような圧縮ファイルに復号化するものである。 The compressed file storage unit 151 receives a compressed file that has been variable-length encoded by the compression device shown in FIG. 34, and may be the compressed file storage unit 143 itself shown in FIG. The compressed file output from the file storage unit 143 may be held.
The variable length decoding unit 152 decodes the compressed file read from the compressed file storage unit 151 from the variable length encoded state into a compressed file as shown in FIG. 30B, for example. .

置換文字比較部１５３は、可変長復号化部１５２により復号化された圧縮ファイルの置換文字（単語番号）と、辞書１３４，１３７〜１３９の置換文字（単語番号）とを比較し、これらの置換文字が一致した場合には、その置換文字に対して文字列（単語文字列）を出力するものである。このとき、タグ内の置換文字はタグ辞書１３４に登録された置換文字と比較され、平文部分の単語番号（置換文字）は、単語辞書１３７〜１３９のいずれかにおける単語文字列と比較される。この置換文字比較部１５３は、図２６における置換文字比較部１２１に対応するものである。 The replacement character comparison unit 153 compares the replacement characters (word numbers) of the compressed file decoded by the variable length decoding unit 152 with the replacement characters (word numbers) of the dictionaries 134 and 137 to 139, and replaces these replacement characters. If the characters match, a character string (word character string) is output for the replacement character. At this time, the replacement character in the tag is compared with the replacement character registered in the tag dictionary 134, and the word number (replacement character) of the plaintext part is compared with the word character string in any of the word dictionaries 137 to 139. This replacement character comparison unit 153 corresponds to the replacement character comparison unit 121 in FIG.

タグ文字列逆変換部１５４は、可変長復号化部１５２により復号化された圧縮ファイル（圧縮後のＸＭＬ文書）における置換文字を、置換文字比較部１５３からの信号に基づき、その置換文字に対応した文字列に逆変換するもので、図２６におけるタグ文字列逆変換部１２２に対応するものである。
同様に、単語番号逆変換部１５５は、可変長復号化部１５２により復号化された圧縮ファイル（圧縮後のＸＭＬ文書）における単語番号を、置換文字比較部１５３からの信号に基づき、その単語番号に対応した単語文字列に逆変換するもので、図２６における単語番号逆変換部１２３に対応するものである。 The tag character string reverse conversion unit 154 corresponds to the replacement character in the compressed file (the XML document after compression) decoded by the variable length decoding unit 152 based on the signal from the replacement character comparison unit 153. This is a reverse conversion to the character string, and corresponds to the tag character string reverse conversion unit 122 in FIG.
Similarly, the word number reverse conversion unit 155 uses the word number in the compressed file (the XML document after compression) decoded by the variable length decoding unit 152 based on the signal from the replacement character comparison unit 153. Is reversely converted to a word character string corresponding to, and corresponds to the word number reverse conversion unit 123 in FIG.

文書記憶部１５６は、タグ文字列逆変換部１５４から出力された、置換文字に対応したタグ文字列と、単語番号逆変換部１５５から出力された、単語番号に対応した単語文字列とを記述されたＸＭＬ文書（復元された文書）を保持するもので、図２６における文書記憶部１２４に対応するものである。 The document storage unit 156 describes the tag character string corresponding to the replacement character output from the tag character string reverse conversion unit 154 and the word character string corresponding to the word number output from the word number reverse conversion unit 155. The stored XML document (restored document) is stored, and corresponds to the document storage unit 124 in FIG.

次に、図３４にて説明した圧縮装置の動作について、図３０を参照しながら図３６に示すフローチャート（ステップＳ２６１〜Ｓ２６７）に従って説明する。
まず、文書記憶部１０１に、例えばオペレータが作成済みの図３０（Ａ）に示すようなＸＭＬ文書を保持させる。そして、タグ平文識別部１３２は、文書記憶部１３１に保持されたＸＭＬ文書からタグ部分を抽出し、そのタグ部分に基づいて、タグ文字列登録部１３３は、図２７に示すフローチャートに従ってタグ文字列の置換文字登録（タグ辞書作成）処理を行ない、図３０（Ｃ）に示すようなタグ辞書１３４を作成する（ステップＳ２６１）。 Next, the operation of the compression apparatus described in FIG. 34 will be described according to the flowchart (steps S261 to S267) shown in FIG. 36 with reference to FIG.
First, for example, an XML document as shown in FIG. 30A created by an operator is held in the document storage unit 101. Then, the tag plaintext identification unit 132 extracts a tag portion from the XML document held in the document storage unit 131, and based on the tag portion, the tag character string registration unit 133 follows the flowchart shown in FIG. The replacement character registration (tag dictionary creation) is performed to create a tag dictionary 134 as shown in FIG. 30C (step S261).

続いて、タグ平文識別部１３２が、文書記憶部１３１に保持されたＸＭＬ文書から平文を識別して言語識別部１３６に送出する。このとき、言語識別部１０７は、タグ平文識別部１３２からのＸＭＬ文書のエンコーディング宣言において記述された文字コード名（図示省略）、あるいは、任意のタグでの言語識別用の属性（xml：lang）の値を解読することにより、平文の言語が何語であるかを識別し、その言語に応じた単語辞書、例えば日本語の場合は日本語辞書１３７を選択する（ステップＳ２６２）。 Subsequently, the tag plaintext identification unit 132 identifies the plaintext from the XML document held in the document storage unit 131 and sends it to the language identification unit 136. At this time, the language identifying unit 107 uses the character code name (not shown) described in the encoding declaration of the XML document from the tag plaintext identifying unit 132, or the language identifying attribute (xml: lang) in an arbitrary tag. The plaintext language is identified by deciphering the value of and the word dictionary corresponding to the language, for example, the Japanese dictionary 137 in the case of Japanese is selected (step S262).

そして、文字列比較部１３５は、タグ辞書１３４の登録文字列と、圧縮対象のＸＭＬ文書におけるタグ内の文字列とを比較し、これらの文字列が一致した場合、その登録文字列に対応した置換文字（短縮文字列）をタグ辞書１３４から読み出し、タグ文字列変換部１４０にて、タグ内の文字列を置換文字に変換する（ステップＳ２６３）。 The character string comparison unit 135 compares the registered character string in the tag dictionary 134 with the character string in the tag in the XML document to be compressed, and if these character strings match, the character string comparing unit 135 corresponds to the registered character string. The replacement character (shortened character string) is read from the tag dictionary 134, and the tag character string conversion unit 140 converts the character string in the tag into a replacement character (step S263).

ついで、文字列比較部１３５は、図３０（Ｄ）に示すような日本語辞書１３７の登録単語文字列と、圧縮対象のＸＭＬ文書における平文内の単語文字列とを比較し、これらの単語文字列が一致した場合、その登録単語文字列に対応した単語番号（置換文字，短縮文字列）を日本語辞書１３７から読み出し、単語番号変換部１４１において、平文内の単語文字列を単語番号に変換する（ステップＳ２６４）。 Next, the character string comparison unit 135 compares the registered word character string in the Japanese dictionary 137 as shown in FIG. 30D with the word character string in the plain text in the XML document to be compressed, and these word characters. When the strings match, the word number (replacement character, abbreviated character string) corresponding to the registered word character string is read from the Japanese dictionary 137, and the word number conversion unit 141 converts the word character string in the plaintext into the word number. (Step S264).

この後、タグ文字列変換部１４０や単語番号変換部１４１による変換結果〔圧縮ファイル：図３０（Ｂ）に示すようなＸＭＬ文書〕は、可変長符号化部１４２において、周知の手法で可変長符号化される（ステップＳ２６５）。
また、タグ辞書１３４は、後述する逆変換処理つまり伸長・復元の際に必要になるため、図示省略のファイルに出力される（ステップＳ２６６）。
さらに、可変長符号化部１４２は、Ｓ２６５にて可変長符号化した圧縮ファイルを、圧縮ファイル記憶部１４３に出力する（ステップＳ２６７）。 Thereafter, the conversion result [compressed file: XML document as shown in FIG. 30B] by the tag character string conversion unit 140 and the word number conversion unit 141 is variable length by the variable length encoding unit 142 by a well-known method. It is encoded (step S265).
The tag dictionary 134 is output to a file (not shown) because it is necessary for reverse conversion processing, that is, decompression / restoration described later (step S266).
Furthermore, the variable length encoding unit 142 outputs the compressed file that has been subjected to variable length encoding in S265 to the compressed file storage unit 143 (step S267).

次に、図３５にて説明した伸長装置の動作について、図３０を参照しながら図３７に示すフローチャート（ステップＳ２７１〜Ｓ２７６）に従って説明する。
まず、図示省略のファイルからタグ辞書１３４を取り出し、そのタグ辞書１３４を図示省略のメモリに格納する（ステップＳ２７１）。
そして、そのタグ辞書１３４に対応する、図３４における可変長符号化部１４２により符号化された圧縮ファイルを、圧縮ファイル記憶部５１に入力・格納する（ステップＳ２７２）。 Next, the operation of the decompression apparatus described in FIG. 35 will be described according to the flowchart (steps S271 to S276) shown in FIG. 37 with reference to FIG.
First, the tag dictionary 134 is extracted from a file (not shown), and the tag dictionary 134 is stored in a memory (not shown) (step S271).
Then, the compressed file encoded by the variable length encoding unit 142 in FIG. 34 corresponding to the tag dictionary 134 is input and stored in the compressed file storage unit 51 (step S272).

この後、可変長符号化部１５２により、圧縮ファイル記憶部１５１に入力された圧縮ファイルを復号して、例えば図３０（Ｂ）に示すようなＸＭＬ文書に復元する（ステップＳ２７３）。
また、言語識別部１３６は、言語識別用タグ（図示省略）からそのＸＭＬ文書の平文の言語を識別し、予め作成されている単語辞書群の中から、その識別結果に応じた単語辞書、例えば図３０（Ｄ）に示すような日本語辞書１３７を選択する（ステップＳ２７４）。 Thereafter, the variable length encoding unit 152 decodes the compressed file input to the compressed file storage unit 151 and restores the XML document as shown in FIG. 30B, for example (step S273).
Further, the language identification unit 136 identifies the plain text language of the XML document from a language identification tag (not shown), and selects a word dictionary corresponding to the identification result from a group of word dictionaries created in advance, for example, A Japanese dictionary 137 as shown in FIG. 30D is selected (step S274).

そして、タグ平文識別部１３２は、圧縮されたＸＭＬ文書の平文部分を識別し、その平文部分の単語番号（置換文字）を、置換文字比較部１５３において、日本語辞書１３７の登録単語番号（登録置換番号）と比較し、これらの単語番号が一致した場合、その単語番号に対応する単語文字列を日本語辞書１３７から読み出して単語番号逆変換部１５５に送出し、この単語番号逆変換部１５５にて、圧縮されたＸＭＬ文書の単語番号を単語文字列に変換する（ステップＳ２７５）。 Then, the tag plaintext identifying unit 132 identifies the plaintext part of the compressed XML document, and the word number (replacement character) of the plaintext part is registered with the registered word number (registration) of the Japanese dictionary 137 in the replacement character comparison unit 153. If these word numbers match, the word character string corresponding to the word number is read from the Japanese dictionary 137 and sent to the word number reverse conversion unit 155. This word number reverse conversion unit 155 The word number of the compressed XML document is converted into a word character string (step S275).

ついで、タグ平文識別部１３２は、圧縮されたＸＭＬ文書のタグ部分を識別し、タグ辞書１３４の置換文字とタグ部分の置換文字とを置換文字比較部１５３で比較・照合させ、これらの置換文字が一致した場合、その置換文字に対応するタグ内文字列をタグ辞書１３４から読み出してタグ文字列逆変換部１５４に送出し、このタグ文字列変換部１５４において、圧縮されたＸＭＬ文書の置換文字をタグ内文字列に変換する（ステップＳ２７６）。このようにして、図３０（Ａ）に示すようにＸＭＬ文書が復元され、文書記憶部１５６に保持される。 Next, the tag plaintext identification unit 132 identifies the tag portion of the compressed XML document, and the substitution character comparison unit 153 compares and collates the substitution character of the tag dictionary 134 with the substitution character of the tag portion. If the two characters match, the in-tag character string corresponding to the replacement character is read from the tag dictionary 134 and sent to the tag character string reverse conversion unit 154. The tag character string conversion unit 154 uses the replacement character of the compressed XML document. Is converted into an in-tag character string (step S276). In this way, the XML document is restored and held in the document storage unit 156 as shown in FIG.

本発明の第３実施形態によれば、圧縮率および検索速度は下記のようになる。
(i)伸長せずに検索する方式つまり検索時間が０．１秒を超えてはいけない場合、タグの圧縮率および平文の圧縮率は下記のようになる。
タグの圧縮率は、タグ内の文字列の長さをｎバイトとすると、
タグの元の長さは（ｎ＋２）バイトであり、
圧縮したタグの長さは３（＝１＋２）バイトであり、
圧縮率は３／（ｎ＋２）となる。ｎ＝３とすると、０．６程度となる。
また、平文の圧縮率は、大半の単語は４バイトか６バイトのため、０．３程度となる。
検索速度としては、元の検索システムの検索速度（例えば、平均０．０８秒）
が保たれる。 According to the third embodiment of the present invention, the compression rate and the search speed are as follows.
(i) When searching without decompression, that is, when the search time should not exceed 0.1 seconds, the compression ratio of the tag and the compression ratio of the plaintext are as follows.
The compression ratio of the tag is as follows. If the length of the character string in the tag is n bytes,
The original length of the tag is (n + 2) bytes,
The length of the compressed tag is 3 (= 1 + 2) bytes,
The compression rate is 3 / (n + 2). When n = 3, it becomes about 0.6.
The compression ratio of plain text is about 0.3 because most words are 4 bytes or 6 bytes.
The search speed of the original search system (for example, average 0.08 seconds)
Is preserved.

(ii)伸長して検索する方式つまり検索時間が０．１秒を超えてもよい場合、タグの圧縮率および平文の圧縮率は下記のようになる。
タグの圧縮率は、タグ内の文字列の長さをｎバイトとすると、
タグの元の長さは（ｎ＋２）バイトであり、
圧縮したタグの長さは２バイト（固定バイト符号）であり、
圧縮率は２／（ｎ＋２）となる。ｎ＝３とすると、０．４程度となる。可変長符号化により０次文脈で圧縮しても圧縮率は同等である。
また、平文の圧縮率は、静的辞書で単語番号変換し、０次文脈で圧縮した場合０．４程度である。
検索速度は、検索システムの検索速度（例：平均０．０８秒）に対して０．０２秒から０．０３秒だけ速くなる。 (ii) The method of searching by expanding, that is, when the search time may exceed 0.1 seconds, the compression rate of tags and the compression rate of plain text are as follows.
The compression ratio of the tag is as follows. If the length of the character string in the tag is n bytes,
The original length of the tag is (n + 2) bytes,
The length of the compressed tag is 2 bytes (fixed byte code),
The compression rate is 2 / (n + 2). When n = 3, it becomes about 0.4. Even if compression is performed in the 0th-order context by variable length coding, the compression rate is the same.
The compression ratio of plain text is about 0.4 when the word number is converted by a static dictionary and compressed in the 0th-order context.
The search speed is increased by 0.02 seconds to 0.03 seconds with respect to the search speed of the search system (eg, average 0.08 seconds).

ところで、特開平１１−５３３４９号公報に開示された技術では、タグ記号“＜”および“＞”と、これらに挟まれた文字列とを一緒にして符号化しているので、第３実施形態のごとく文字列を符号化するものとは異なり、図３０（Ａ）の３〜６行目に示す“（chapter）”，“chapter”等は符号化することができないので、第３実施形態に比較して圧縮量が小さい。 By the way, in the technique disclosed in Japanese Patent Laid-Open No. 11-53349, the tag symbols “<” and “>” and the character string sandwiched between them are encoded together. Unlike what encodes a character string, “(chapter)”, “chapter”, etc. shown in the 3rd to 6th lines of FIG. 30A cannot be encoded. The amount of compression is small.

本発明の第３実施形態によれば、下記のような利点が得られる。
Ａ．タグ文字列登録部１１４や１３３によりタグ辞書１０５，１３４を作成し、そのタグ辞書１０５，１３４に基づいてタグ文字列を圧縮変換してタグを圧縮することにより、ＸＭＬ文書を圧縮するので、圧縮率を高めることができるのみならず、このタグを圧縮した状態で検索可能にすることができる。 According to the third embodiment of the present invention, the following advantages can be obtained.
A. The tag document 105, 134 is created by the tag character string registration unit 114, 133, the XML document is compressed by compressing the tag character string based on the tag dictionary 105, 134, and compressing the tag. Not only can the rate be increased, but this tag can also be made searchable in a compressed state.

Ｂ．図２７や図３３を参照しながら説明したごとく、タグ文字列をタグ辞書に登録する際に、“＜”と“＞”とに囲まれたタグ領域内（ただし“＜！”と“＞”とにより囲まれたコメント領域を除く）において、空白文字が出現する度に文字列を区切って置換文字に対応させているので、タグ部分を正確に圧縮しながら圧縮率を高くすることができるほか、長いタグを短い文字で区切ることができ、タグ辞書のサイズを小形化し、タグ辞書を検索し易いものとして構成することができ、しかもタグを圧縮した状態で検索可能にすることができる。 B. As described with reference to FIG. 27 and FIG. 33, when a tag character string is registered in the tag dictionary, within the tag area surrounded by “<” and “>” (however, “<!” And “>”). (Except for the comment area surrounded by), each time a blank character appears, the character string is separated to correspond to the replacement character, so the compression rate can be increased while accurately compressing the tag part. Long tags can be separated by short characters, the size of the tag dictionary can be reduced, the tag dictionary can be configured to be easily searched, and the tags can be searched in a compressed state.

Ｃ．長いタグ部分を短い文字で区切ることができ検索し易くすることができるのみならず、さらにこれを圧縮して、圧縮率をより高くすることができる。
Ｄ．ＸＭＬ文書中の文字コード名、あるいは、任意のタグでの言語識別用の属性の値により平文の言語を認識し、その言語に応じた単語辞書を決定するので、例えば日本語，中国語，英語等の様々な言語により記述されたＸＭＬ文書にも正確に対応することができる。 C. Not only can the long tag portion be separated by short characters to facilitate searching, but this can be further compressed to further increase the compression rate.
D. The plain text language is recognized by the character code name in the XML document or the value of the attribute for language identification with an arbitrary tag, and the word dictionary corresponding to the language is determined. For example, Japanese, Chinese, English XML documents described in various languages such as can be handled accurately.

Ｅ．平文の部分を各言語の単語辞書に登録された単語文字列と比較し、単語番号変換部１１２，１４１において、登録単語文字列と一致した部分を固定長バイト番号（置換文字，短縮文字列）に変換するので、平文部分を大きく圧縮することができる。
Ｆ．タグの部分をタグ辞書１０５，１３４の登録文字列と比較し、タグ文字列変換部１１１，１４０において登録文字列と一致した部分を固定長バイトの置換文字（短縮文字列）に変換するので、タグ部分を大きく圧縮することができる。 E. The plaintext part is compared with the word character string registered in the word dictionary of each language, and the word number conversion units 112 and 141 use the fixed-length byte number (replacement character, shortened character string) for the part that matches the registered word character string. Therefore, the plaintext part can be greatly compressed.
F. The tag portion is compared with the registered character string of the tag dictionary 105, 134, and the portion that matches the registered character string is converted into a fixed-length byte replacement character (shortened character string) in the tag character string conversion units 111, 140. The tag portion can be greatly compressed.

Ｇ．ＸＭＬ文書の言語識別記号により、該当する言語の種別に応じた単語辞書を決定するので、多言語に対応しながらＸＭＬ文書の復元が可能となる。
Ｈ．伸長復元処理時には、置換文字比較部１２１，１５３においてタグ内の置換文字をタグ辞書１０５，１３４の登録置換文字と比較し、タグ文字列逆変換部１２２，１５４において、登録置換文字と一致した置換文字をタグ内文字列に逆変換するので、タグを正確に復元することができる。 G. Since the word dictionary corresponding to the type of the corresponding language is determined by the language identification symbol of the XML document, the XML document can be restored while supporting multiple languages.
H. At the time of decompression / restoration processing, the replacement character comparison unit 121, 153 compares the replacement character in the tag with the registered replacement character of the tag dictionary 105, 134, and the tag character string reverse conversion unit 122, 154 replaces the replacement character that matches the registered replacement character. Since the character is reversely converted into the character string in the tag, the tag can be accurately restored.

Ｉ．伸長復元処理時には、置換文字比較部１２１，１５３において平文内の単語番号（置換文字）を各言語の登録単語番号と比較し、単語番号逆変換部１２３，１５５において、登録単語番号と一致した単語番号を単語文字列に逆変換するので、平文の単語番号を各言語に正確に復元することができる。
Ｊ．可変長符号化部１４２により置換処理後の文字列（置換文字や単語番号）をさらに可変長符号化することで、ＸＭＬ文書の圧縮率をより高めることができる。 I. At the time of decompression / restoration processing, the word number (replacement character) in the plaintext is compared with the registered word number of each language in the replacement character comparison units 121 and 153, and the word number reverse conversion units 123 and 155 match the registered word number. Since the number is reversely converted into a word character string, the plaintext word number can be accurately restored to each language.
J. et al. By further variable-length encoding the character string (replacement character and word number) after the replacement process by the variable-length encoding unit 142, the compression rate of the XML document can be further increased.

〔４〕その他
なお、本発明は上述した実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で種々変形して実施することができる。
例えば、上述した実施形態では、構造化文書がＸＭＬ文書である場合について説明したが、本発明は、これに限定されるものではなく、他の構造化文書、例えばＨＴＭＬ文書やＳＧＭＬ文書などにも同様に適用され、上述した実施形態と同様の作用効果を得ることができる。 [4] Others The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention.
For example, in the embodiment described above, the case where the structured document is an XML document has been described. However, the present invention is not limited to this, and other structured documents such as an HTML document and an SGML document are also included. It is applied in the same way, and the same effect as the above-described embodiment can be obtained.

本発明の第１実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the compression apparatus of the structured document as 1st Embodiment of this invention. （Ａ）〜（Ｄ）はいずれも第１実施形態における構造化文書（文書実現値）の圧縮原理を説明するための図である。(A)-(D) are the figures for demonstrating the compression principle of the structured document (document realization value) in 1st Embodiment. （Ａ）〜（Ｃ）はいずれも第１実施形態における構造化文書（文書実現値）の圧縮原理を説明するための図である。(A)-(C) are the figures for demonstrating the compression principle of the structured document (document realization value) in 1st Embodiment. （Ａ）および（Ｂ）はいずれも第１実施形態における構造化文書（ＤＴＤ）の圧縮原理を説明するための図である。(A) And (B) is a figure for demonstrating the compression principle of the structured document (DTD) in 1st Embodiment. 第１実施形態における構造化文書の圧縮手順を説明するためのフローチャートである。It is a flowchart for demonstrating the compression procedure of the structured document in 1st Embodiment. 第１実施形態における文書実現値解析手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document realization value analysis procedure in 1st Embodiment. 第１実施形態における文書型定義解析手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document type definition analysis procedure in 1st Embodiment. 第１実施形態における文書実現値構成変更手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document realization value structure change procedure in 1st Embodiment. 第１実施形態における文書型定義構成変更手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document type definition structure change procedure in 1st Embodiment. （Ａ）および（Ｂ）はいずれも第１実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第１例）を説明するための図である。(A) And (B) is a figure for demonstrating the specific compression process (1st example) of the structured document (XML document) by 1st Embodiment. （Ａ）および（Ｂ）はいずれも第１実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第２例）を説明するための図である。(A) And (B) is a figure for demonstrating the specific compression process (2nd example) of the structured document (XML document) by 1st Embodiment. （Ａ）および（Ｂ）はいずれも第１実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第３例）を説明するための図である。(A) and (B) are diagrams for explaining a specific compression process (third example) of a structured document (XML document) according to the first embodiment. （Ａ）〜（Ｄ）はいずれも第１実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第４例）を説明するための図である。(A)-(D) is a figure for demonstrating the specific compression process (4th example) of the structured document (XML document) by 1st Embodiment. 本発明の第２実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the compression apparatus of the structured document as 2nd Embodiment of this invention. （Ａ）〜（Ｄ）はいずれも第２実施形態における構造化文書の圧縮原理を説明するための図である。(A)-(D) are the figures for demonstrating the compression principle of the structured document in 2nd Embodiment. 第２実施形態における構造化文書の圧縮手順を説明するためのフローチャートである。It is a flowchart for demonstrating the compression procedure of the structured document in 2nd Embodiment. 第２実施形態における文書実現値文字列置換手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document realization value character string replacement procedure in 2nd Embodiment. 第２実施形態における文書型定義文字列置換手順を説明するためのフローチャートである。It is a flowchart for demonstrating the document type definition character string replacement | exchange procedure in 2nd Embodiment. 第２実施形態における構造化文書の伸長手順を説明するためのフローチャートである。It is a flowchart for demonstrating the expansion | extension procedure of the structured document in 2nd Embodiment. （Ａ）〜（Ｃ）はいずれも第２実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第１例）を説明するための図である。(A)-(C) are the figures for demonstrating the concrete compression process (1st example) of the structured document (XML document) by 2nd Embodiment. （Ａ）〜（Ｃ）はいずれも第２実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第２例）を説明するための図である。(A)-(C) are the figures for demonstrating the concrete compression process (2nd example) of the structured document (XML document) by 2nd Embodiment. （Ａ）〜（Ｃ）はいずれも第２実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第３例）を説明するための図である。(A)-(C) are the figures for demonstrating the concrete compression process (3rd example) of the structured document (XML document) by 2nd Embodiment. （Ａ）〜（Ｄ）はいずれも第２実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第４例）を説明するための図である。(A)-(D) is a figure for demonstrating the concrete compression process (4th example) of the structured document (XML document) by 2nd Embodiment. （Ａ）〜（Ｇ）はいずれも第２実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第５例）を説明するための図である。(A)-(G) are figures for demonstrating the concrete compression process (5th example) of the structured document (XML document) by 2nd Embodiment. 本発明の第３実施形態としての構造化文書の圧縮装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the compression apparatus of the structured document as 3rd Embodiment of this invention. 本発明の第３実施形態としての構造化文書の伸長装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the expansion | extension apparatus of the structured document as 3rd Embodiment of this invention. 第３実施形態におけるタグ辞書作成手順（タグ辞書登録手順）を説明するためのフローチャートである。It is a flowchart for demonstrating the tag dictionary creation procedure (tag dictionary registration procedure) in 3rd Embodiment. 第３実施形態における構造化文書の圧縮手順を説明するためのフローチャートである。It is a flowchart for demonstrating the compression procedure of the structured document in 3rd Embodiment. 第３実施形態における構造化文書の伸長手順を説明するためのフローチャートである。It is a flowchart for demonstrating the expansion | extension procedure of the structured document in 3rd Embodiment. （Ａ）〜（Ｄ）はいずれも第３実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第１例）を説明するための図である。(A)-(D) is a figure for demonstrating the concrete compression process (1st example) of the structured document (XML document) by 3rd Embodiment. （Ａ）〜（Ｇ）はいずれも第３実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第２例）を説明するための図である。(A)-(G) are the figures for demonstrating the concrete compression process (2nd example) of the structured document (XML document) by 3rd Embodiment. （Ａ）〜（Ｄ）はいずれも第３実施形態による構造化文書（ＸＭＬ文書）の具体的な圧縮処理（第３例）を説明するための図である。(A)-(D) is a figure for demonstrating the specific compression process (3rd example) of the structured document (XML document) by 3rd Embodiment. 第３実施形態での構造化文書の圧縮手法を説明するための図である。It is a figure for demonstrating the compression method of the structured document in 3rd Embodiment. 本発明の第３実施形態の変形例としての構造化文書の圧縮装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the compression apparatus of the structured document as a modification of 3rd Embodiment of this invention. 本発明の第３実施形態の変形例としての構造化文書の伸長装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the expansion | extension apparatus of the structured document as a modification of 3rd Embodiment of this invention. 第３実施形態の変形例における構造化文書の圧縮手順を説明するためのフローチャートである。It is a flowchart for demonstrating the compression procedure of the structured document in the modification of 3rd Embodiment. 第４実施形態の変形例における構造化文書の伸長手順を説明するためのフローチャートである。It is a flowchart for demonstrating the expansion | extension procedure of the structured document in the modification of 4th Embodiment. （Ａ）〜（Ｃ）はいずれも構造化文書（ＸＭＬ文書）における一般的なタグの書き方を説明するための図である。(A)-(C) are the figures for demonstrating how to write the general tag in a structured document (XML document). 構造化文書（ＸＭＬ文書）のタグにおける一般的な属性の書き方を説明するための図である。It is a figure for demonstrating how to write the general attribute in the tag of a structured document (XML document). 一般的なＸＭＬプロセッサの処理について説明するための図である。It is a figure for demonstrating the process of a general XML processor.

Explanation of symbols

１０文書記憶部
２０文書実現値解析部
３０ＤＴＤ解析部（文書型定義解析部）
４０文書実現値構成変更部
４１文書実現値文字列置換部
５０ＤＴＤ構成変更部（文書型定義構成変更部）
５１ＤＴＤ文字列置換部（文書型定義文字列置換部）
６０新規ＤＴＤファイル作成部
７０新旧ＤＴＤ対応表出力部
８０タグ辞書作成部
９０タグ辞書
１００外部ファイル
１０１，１２４，１３１，１５６文書記憶部
１０２ＤＴＤ条件調査部
１０３，１３２タグ平文識別部（文書実現値解析部）
１０４，１３３タグ文字列登録部（文書実現値解析部，文書型定義解析部，
タグ辞書作成部）
１０５，１３４タグ辞書
１０６，１３５文字列比較部（文書実現値文字列置換部，文書型定義文字列
置換部）
１０７，１３６言語識別部
１０８，１３７日本語辞書（単語辞書）
１０９，１３８中国語辞書（単語辞書）
１１０，１３９英語辞書（単語辞書）
１１１，１４０タグ文字列変換部（文書実現値文字列置換部，文書型定義文
字列置換部）
１１２，１４１単語番号変換部（単語文字列置換部）
１１３単語番号ファイル
１１４ＤＴＤ記入部
１２０単語番号ファイル記憶部
１２１，１５３置換文字比較部
１２２，１５４タグ文字列逆変換部
１２３，１５５単語番号逆変換部
１２５旧ＤＴＤ記入部
１４２可変長符号化部
１４３，１５１圧縮ファイル記憶部
１５２可変長復号化部 10 Document storage unit 20 Document realization value analysis unit 30 DTD analysis unit (document type definition analysis unit)
40 Document Realization Value Configuration Change Unit 41 Document Realization Value Character String Replacement Unit 50 DTD Configuration Change Unit (Document Type Definition Configuration Change Unit)
51 DTD character string replacement part (document type definition character string replacement part)
60 New DTD file creation unit 70 New and old DTD correspondence table output unit 80 Tag dictionary creation unit 90 Tag dictionary 100 External file 101, 124, 131, 156 Document storage unit 102 DTD condition checking unit 103, 132 Tag plaintext identification unit (document realization value) Analysis Department)
104, 133 Tag character string registration unit (document realization value analysis unit, document type definition analysis unit,
Tag dictionary creation part)
105, 134 Tag dictionary 106, 135 Character string comparison unit (document realization value character string replacement unit, document type definition character string replacement unit)
107,136 Language identification unit 108,137 Japanese dictionary (word dictionary)
109,138 Chinese dictionary (word dictionary)
110,139 English dictionary (word dictionary)
111, 140 Tag character string conversion unit (document realization value character string replacement unit, document type definition character string replacement unit)
112, 141 Word number conversion part (word character string replacement part)
113 Word number file 114 DTD entry unit 120 Word number file storage unit 121,153 Replacement character comparison unit 122,154 Tag character string reverse conversion unit 123,155 Word number reverse conversion unit 125 Old DTD entry unit 142 Variable length coding unit 143 , 151 Compressed file storage unit 152 Variable length decoding unit

Claims

A method for compressing a structured document, comprising:
A document realization value analyzing step for analyzing the description in the tag of the document realization value forming the structured document;
A tag dictionary that associates a character string described in a tag of the document realization value with a shortened character string that is shorter than the character string and can identify the character string in accordance with the analysis result in the document realization value analysis step. A tag dictionary creation step to be created;
Using the tag dictionary created in the tag dictionary creation step, a document realization value character string replacement step for replacing a character string described in the tag of the document realization value with a shortened character string corresponding to the character string; A method for compressing a structured document, comprising:

A method for compressing a structured document, comprising:
A document realization value analyzing step for analyzing the description in the tag of the document realization value forming the structured document;
A document type definition analyzing step for analyzing a description of the document type definition forming the structured document;
According to the analysis result in the document realization value analysis step and the document type definition analysis step, the character string described in the tag of the document realization value and in the document type definition and shorter than the character string and the character string are specified. A tag dictionary creation step for creating a tag dictionary that associates possible shortened character strings;
Using the tag dictionary created in the tag dictionary creation step, a document realization value character string replacement step for replacing a character string described in the tag of the document realization value with a shortened character string corresponding to the character string; ,
A document type definition character string replacement step of replacing the character string described in the document type definition with a shortened character string corresponding to the character string using the tag dictionary created in the tag dictionary creation step. A method for compressing a structured document, characterized by:

The element name and attribute name described in the tag or in the document type definition are treated as the character string, and the element name and the attribute name are replaced with the abbreviated character string. 3. A method of compressing a structured document according to 2.

Using a word dictionary that associates a word character string with a shortened character string that is shorter than the word character string and that can specify the word character string, the word character string included in the content of the document realization value is converted into the word character 4. The method of compressing a structured document according to claim 1, further comprising a word character string replacement step for replacing with a shortened character string corresponding to the column.

A variable length in which a character string described in the tag or in the document type definition is replaced with the abbreviated character string, and after the word character string is replaced with the abbreviated character string, these character strings are compressed by variable length encoding. 5. The structured document compression method according to claim 4, further comprising an encoding step.

An apparatus for compressing a structured document,
A document realization value analysis unit for analyzing the description in the tag of the document realization value forming the structured document;
In accordance with the analysis result by the document realization value analysis unit, a tag dictionary is created that associates the character string described in the tag of the document realization value with a shortened character string that is shorter than the character string and can identify the character string. A tag dictionary creation unit to
A document realization value character string replacement unit that replaces a character string described in the tag of the document realization value with a shortened character string corresponding to the character string using the tag dictionary created by the tag dictionary creation unit; A structured document compression apparatus, characterized by comprising:

An apparatus for compressing a structured document,
A document realization value analysis unit for analyzing the description in the tag of the document realization value forming the structured document;
A document type definition analyzing unit for analyzing a description of the document type definition forming the structured document;
According to the analysis result by the document realization value analysis unit and the document type definition analysis unit, the character string described in the document realization value tag and in the document type definition is shorter than the character string and the character string is specified. A tag dictionary creation unit that creates a tag dictionary that associates a shortened character string
A document realization value character string replacement unit that replaces a character string described in the tag of the document realization value with a shortened character string corresponding to the character string using the tag dictionary created by the tag dictionary creation unit; ,
A document type definition character string replacement unit that replaces the character string described in the document type definition with a shortened character string corresponding to the character string using the tag dictionary created by the tag dictionary creation unit. A structured document compression apparatus, characterized by being configured.

The element name and the attribute name described in the tag or in the document type definition are treated as the character string, and the element name and the attribute name are replaced with the abbreviated character string. The structured document compression apparatus according to claim 7.

A computer-readable recording medium storing a structured document compression program for realizing a function of compressing a structured document by a computer,
The structured document compression program is
A document realization value analysis unit for analyzing a description in a tag of a document realization value forming the structured document;
In accordance with the analysis result by the document realization value analysis unit, a tag dictionary is created that associates the character string described in the tag of the document realization value with a shortened character string that is shorter than the character string and can identify the character string. A tag dictionary creation unit, and
Using the tag dictionary created by the tag dictionary creation unit, as a document realization value character string replacement unit that replaces a character string described in a tag of the document realization value with a shortened character string corresponding to the character string A computer-readable recording medium storing a structured document compression program, which causes the computer to function.

A computer-readable recording medium storing a structured document compression program for realizing a function of compressing a structured document by a computer,
The structured document compression program is
A document realization value analysis unit for analyzing a description in a tag of a document realization value forming the structured document;
A document type definition analyzing unit for analyzing the description of the document type definition forming the structured document;
According to the analysis result by the document realization value analysis unit and the document type definition analysis unit, the character string described in the document realization value tag and in the document type definition is shorter than the character string and the character string is specified. A tag dictionary creation unit that creates a tag dictionary that correlates with a shortened character string
Using the tag dictionary created by the tag dictionary creation unit, a document realization value character string replacement unit that replaces a character string described in a tag of the document realization value with a shortened character string corresponding to the character string, and,
Using the tag dictionary created by the tag dictionary creation unit, the computer as a document type definition character string replacement unit that replaces a character string described in the document type definition with a shortened character string corresponding to the character string. A computer-readable recording medium storing a structured document compression program, wherein

The structured document compression program treats the element name and attribute name described in the tag or in the document type definition as the character string, and replaces the element name and attribute name with the short character string in the computer. A computer-readable recording medium storing the structured document compression program according to claim 9 or 10, wherein the structured document compression program according to claim 9 or 10 is stored.