JP2003044459A

JP2003044459A - Method for compressing and exchanging structured data

Info

Publication number: JP2003044459A
Application number: JP2001235046A
Authority: JP
Inventors: Mutsumi Komuro; 睦小室
Original assignee: Hitachi Software Engineering Co Ltd
Current assignee: Hitachi Software Engineering Co Ltd
Priority date: 2001-08-02
Filing date: 2001-08-02
Publication date: 2003-02-14

Abstract

PROBLEM TO BE SOLVED: To provide a method for compressing and exchanging data, by which a data amount can be reduced and data can be exchanged after simultaneously applying enciphering by using the structure information of structured data. SOLUTION: In a compressing (enciphering) module, the internal expression data of the structured data are separated to structure information and contents by using previously applied syntax designation information and further, they are compressed (enciphered) together. The compressed (enciphered) data are delivered from a transmitting side system through a network to a receiving side system. In an extending (deciphering) module, the received compressed (enciphered) data are restored into internal expression data of the structured data by using the syntax designation information.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、構造化データを用
いたデータ交換、アプリケーション統合、システム統
合、およびデータ保存など、構造化データを利用するシ
ステムにおける圧縮方法およびデータ交換方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a compression method and a data exchange method in a system using structured data such as data exchange using structured data, application integration, system integration, and data storage.

【０００２】[0002]

【従来の技術】従来より、テキストデータの可逆圧縮処
理方法としては、ランレングス、ハフマン符号化、およ
びLZ（Zip-Lempel符号）などの各種方法が知られてい
る。これらは、テキストデータ一般に対する圧縮処理手
法である。構造化ドキュメントがテキストデータの形式
であるときには、その構造化ドキュメントに対して上記
のテキストデータ一般に対する圧縮処理手法を用いるこ
とができる。この場合には、その圧縮処理において、そ
のドキュメントが構造化されているという情報を有効に
用いてはいないことになる。2. Description of the Related Art Conventionally, various methods such as run length, Huffman coding, and LZ (Zip-Lempel code) are known as lossless compression processing methods for text data. These are compression processing methods for general text data. When the structured document is in the form of text data, the above compression processing method for general text data can be used for the structured document. In this case, the compression process does not effectively use the information that the document is structured.

【０００３】一方、XML等の構造化データによるデータ
交換が一般化しつつある。これは基本的にテキストフォ
ーマットのデータである。On the other hand, data exchange using structured data such as XML is becoming popular. This is basically text-formatted data.

【０００４】[0004]

【発明が解決しようとする課題】上述したように、XML
等の構造化データはテキストフォーマットであるので、
上述のテキストデータ一般に対する圧縮処理手法を用い
て圧縮し、データ交換を行うことが一般化しつつある
が、構造化データが基本的にテキストフォーマットのデ
ータであるため、以下のような問題点がある。[Problems to be Solved by the Invention] As described above, XML
Since structured data such as is in text format,
It is becoming common to perform data exchange by compressing using the above-described compression processing method for general text data, but since structured data is basically text format data, there are the following problems. .

【０００５】（１）構造化情報をタグとして付与するた
めデータ量が増大する。（２）タグおよびコンテンツがテキストとして読めるた
め、盗聴等によりデータの内容が盗まれる危険がある。（３）交換後のデータを利用する際、字句解析、構文解
析等のパージング処理を行う必要があり処理上のオーバ
ヘッドとなりうる。(1) Since the structured information is added as a tag, the amount of data increases. (2) Since the tag and content can be read as text, there is a risk that the content of the data will be stolen by eavesdropping or the like. (3) When using the data after the exchange, it is necessary to perform parsing processing such as lexical analysis and syntax analysis, which may cause processing overhead.

【０００６】本発明は、構造化データの構造情報を用い
ることで、データ量を削減し、同時に暗号化を施したう
えでデータ交換を可能とする構造化データに対する圧縮
方法およびデータ交換方法を提供することを目的とす
る。The present invention provides a compression method and a data exchange method for structured data, which makes it possible to exchange data after reducing the amount of data and simultaneously performing encryption by using the structure information of the structured data. The purpose is to do.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するた
め、請求項１に係る発明は、構造化データに対するデー
タ圧縮方法であって、処理対象の構造化データの構造を
定義する情報である複数の文法規則を含む構文指定情報
内で個々の文法規則を識別情報で特定できるようにして
おき、該文法規則を特定する識別情報を用いて前記構造
化データの構造を表す構文木を表現することにより、前
記構造化データの構造情報とコンテンツ情報とを分離す
ることを特徴とする。In order to achieve the above object, the invention according to claim 1 is a data compression method for structured data, wherein a plurality of pieces of information defining the structure of the structured data to be processed are provided. Individual grammar rules can be specified by the identification information in the syntax specification information including the grammar rules, and the syntax tree representing the structure of the structured data is expressed by using the identification information specifying the grammar rules. According to, the structure information of the structured data and the content information are separated.

【０００８】請求項２に係る発明は、請求項１に記載の
構造化データに対するデータ圧縮方法において、前記構
造化データから分離して得たコンテンツ情報をさらにデ
ータ型に応じて分類することを特徴とする。The invention according to claim 2 is the data compression method for structured data according to claim 1, characterized in that the content information obtained by separating from the structured data is further classified according to the data type. And

【０００９】請求項３に係る発明は、請求項１に記載の
構造化データに対するデータ圧縮方法において、前記コ
ンテンツ情報に含まれるコンテンツデータのデータ出現
位置を変数で表すことを特徴とする。The invention according to claim 3 is the data compression method for structured data according to claim 1, wherein the data appearance position of the content data included in the content information is represented by a variable.

【００１０】請求項４に係る発明は、構造化データのデ
ータ交換方法であって、請求項１から３の何れか１つに
記載のデータ圧縮方法を用いて、データ交換の対象デー
タである構造化データを構造情報とコンテンツ情報に分
離し、分離した構造情報とコンテンツ情報をそれぞれ所
定の圧縮方式で圧縮または所定の暗号化方式で暗号化
し、該圧縮または暗号化した構造情報とコンテンツ情報
を送信することを特徴とする。A fourth aspect of the present invention is a method for exchanging structured data, wherein the data compression method according to any one of the first to third aspects is used, and the data is data to be exchanged. The encrypted data is separated into structure information and content information, the separated structure information and content information are respectively compressed by a predetermined compression method or encrypted by a predetermined encryption method, and the compressed or encrypted structure information and content information are transmitted. It is characterized by doing.

【００１１】請求項５に係る発明は、構造化データに対
するデータ圧縮方法であって、処理対象の構造化データ
の構造を定義する情報である複数の文法規則を含む構文
指定情報を、それらの文法規則を識別情報で特定できる
ように、記憶手段に保持しておくステップと、前記処理
対象の構造化データの構造を表す文法規則を求め、求め
たすべての文法規則の識別情報を並べて構造情報を生成
するとともに、それらの文法規則のうちコンテンツデー
タが付随するものについてはそのコンテンツデータをコ
ンテンツ情報に格納し、構造情報に並べた識別情報には
当該コンテンツデータが付随することを示す指標を付す
るステップとを備えたことを特徴とする。According to a fifth aspect of the present invention, there is provided a data compression method for structured data, wherein syntax specification information including a plurality of grammatical rules, which are information defining a structure of structured data to be processed, is provided with grammars thereof. The step of holding the rule in the storage means so that the rule can be identified by the identification information, the grammatical rule representing the structure of the structured data to be processed is obtained, and the identification information of all the obtained grammatical rules is arranged to obtain the structural information. For those grammatical rules that are accompanied by content data while being generated, the content data is stored in the content information, and the identification information arranged in the structure information is provided with an index indicating that the content data is associated. And a step.

【００１２】請求項６に係る発明は、構造化データに対
するデータ圧縮または暗号化方法であって、処理対象の
構造化データの構造を定義する情報である複数の文法規
則を含む構文指定情報を、それらの文法規則を識別情報
で特定できるように、記憶手段に保持しておくステップ
と、前記処理対象の構造化データの構造を表す文法規則
を求め、求めたすべての文法規則の識別情報を並べて構
造情報を生成するとともに、それらの文法規則のうちコ
ンテンツデータが付随するものについてはそのコンテン
ツデータをコンテンツ情報に格納し、構造情報に並べた
識別情報には当該コンテンツデータが付随することを示
す指標を付するステップと、前記構造情報およびコンテ
ンツ情報を圧縮または暗号化するステップとを備えたこ
とを特徴とする。According to a sixth aspect of the present invention, there is provided a data compression or encryption method for structured data, wherein syntax designation information including a plurality of grammatical rules, which is information defining a structure of structured data to be processed, is provided. A step of holding the grammatical rules in a storage means so that they can be specified by the identification information, a grammatical rule representing the structure of the structured data to be processed is obtained, and the identification information of all the obtained grammatical rules are arranged. An index indicating that the structure information is generated and the content data is stored in the content information for those grammatical rules associated with the content data, and the identification information arranged in the structure information is associated with the content data. And a step of compressing or encrypting the structure information and the content information.

【００１３】請求項７に係る発明は、請求項６に記載の
データ圧縮または暗号化方法で圧縮または暗号化された
構造情報およびコンテンツ情報を伸長または復号化する
データ伸長または復号化方法であって、圧縮または暗号
化された構造情報およびコンテンツ情報を伸長または復
号化するステップと、伸長または復号化した構造情報か
ら文法規則の識別情報を取り出し、前記構文指定情報を
参照して、取り出した識別情報に対応する文法規則を取
得するステップと、取得した文法規則が表すデータ構造
の構造化データを復元するとともに、その文法規則にコ
ンテンツデータが付随するものについては前記コンテン
ツ情報から対応するコンテンツデータを取り出して構造
化データの対応する位置に設定することにより、構造化
データを復元するステップとを備えたことを特徴とす
る。The invention according to claim 7 is a data decompression or decryption method for decompressing or decrypting structure information and content information compressed or encrypted by the data compression or encryption method according to claim 6. Decompressing or decrypting the compressed or encrypted structure information and content information, and extracting the identification information of the grammar rule from the expanded or decrypted structure information, and referring to the syntax designation information, the extracted identification information And a step of acquiring the grammatical rule corresponding to the grammar rule, the structured data having the data structure represented by the acquired grammatical rule is restored, and the content data associated with the grammatical rule is extracted from the corresponding content data. Restore the structured data by setting it to the corresponding position in the structured data. Characterized in that a step.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【００１５】図１および図２は、本発明の一実施形態に
係る基本的なシステム構成と処理手順を示した図であ
る。1 and 2 are diagrams showing a basic system configuration and a processing procedure according to an embodiment of the present invention.

【００１６】図１は、構造化データを圧縮または暗号化
する場合のシステム構成と処理手順を示す。圧縮・暗号
化する場合、まず字句・構文解析モジュール（１０３）
は、入力された構造化データ（１０１）を事前に与えら
れた構文指定情報（１０２）に基づいて解析し、該構造
化データ（１０１）に出現する部分構文木列（１０４）
とそこで現れるコンテンツ（１０５）とを取り出す。こ
れらのデータは、それぞれ、構文圧縮（暗号化）モジュ
ール（１０６）とコンテンツ圧縮（暗号化）モジュール
（１０７）に引き渡す。構文圧縮（暗号化）モジュール
（１０６）は、部分構文木列（１０４）の圧縮または暗
号化を行う。コンテンツ圧縮（暗号化）モジュール（１
０７）は、コンテンツ（１０５）の圧縮または暗号化を
行う。これらのモジュールによる圧縮（暗号化）の結果
をあわせて圧縮（暗号化）データ（１０８）として出力
する。FIG. 1 shows a system configuration and a processing procedure for compressing or encrypting structured data. When compressing / encrypting, first the lexical / syntactic analysis module (103)
Analyzes the input structured data (101) on the basis of the syntax designation information (102) given in advance, and outputs a partial syntax tree string (104) appearing in the structured data (101).
And the content (105) that appears there are retrieved. These data are passed to the syntax compression (encryption) module (106) and the content compression (encryption) module (107), respectively. The syntax compression (encryption) module (106) compresses or encrypts the partial syntax tree sequence (104). Content compression (encryption) module (1
07) compresses or encrypts the content (105). The results of compression (encryption) by these modules are combined and output as compressed (encrypted) data (108).

【００１７】図２は、圧縮（暗号化）データを伸長また
は復号化する場合のシステム構成と処理手順を示す。伸
長・復号化する場合、図１に示した処理とは逆に、圧縮
（暗号化）データ（２０１）を入力として受け取り、部
分構文木伸長（復号化）モジュール（２０２）およびコ
ンテンツ伸長（復号化）モジュール（２０３）に、圧縮
（暗号化）データ２０１の相当部分を、それぞれ引き渡
す。部分構文木伸長（復号化）モジュール（２０２）
は、構文指定情報に基づいて、圧縮（暗号化）されてい
る部分構文木列を伸長（復号化）し、部分構文木列（２
０４）を出力する。コンテンツ伸長（復号化）モジュー
ル（２０３）は、圧縮（暗号化）されているコンテンツ
を伸長（復号化）し、コンテンツ（２０５）を出力す
る。最後に合成モジュール（２０６）は、部分構文木列
（２０４）上にコンテンツ（２０５）をはめこんで、も
との構造化データ（２０７）を得る。FIG. 2 shows a system configuration and a processing procedure when decompressing or decrypting compressed (encrypted) data. In the case of decompression / decryption, contrary to the processing shown in FIG. 1, compressed (encrypted) data (201) is received as an input, and a partial syntax tree decompression (decryption) module (202) and content decompression (decryption) are performed. ) Deliver corresponding portions of the compressed (encrypted) data 201 to the module (203). Partial syntax tree decompression (decoding) module (202)
Decompresses (decrypts) the compressed (encrypted) partial syntax tree string based on the syntax designation information, and
04) is output. The content decompression (decryption) module (203) decompresses (decrypts) the compressed (encrypted) content and outputs the content (205). Finally, the composition module (206) embeds the content (205) on the partial syntax tree string (204) to obtain the original structured data (207).

【００１８】図３は、図１および図２に示すシステムを
利用して、構造化データを用いたデータ交換を行う場合
のシステム構成と処理手順を示した図である。まず、送
信したい構造化データに対する構文指定情報（３０１）
および（３０２）を、予めデータ送信側システム（３０
３）、圧縮（暗号化）モジュール（３０４）、伸長（復
号化）モジュール（３０５）、およびデータ受信側シス
テム（３０６）の間で共有しておく。FIG. 3 is a diagram showing a system configuration and a processing procedure when data exchange using structured data is performed using the system shown in FIGS. 1 and 2. First, the syntax designation information (301) for the structured data to be transmitted
And (302) in advance by the data transmission side system (30
3), the compression (encryption) module (304), the decompression (decryption) module (305), and the data receiving side system (306).

【００１９】この図に示すシステムでは、同一の構文指
定情報のコピーを（３０１）と（３０２）の２つ用意す
る。データ送信側システム（３０３）と圧縮（暗号化）
モジュール（３０４）が構文指定情報（３０１）を使用
し、伸長（復号化）モジュール（３０５）とデータ受信
側システム（３０６）内の字句・構文解析モジュール
（３１３）が構文指定情報（３０２）を使用する場合を
図示している。In the system shown in this figure, two copies of the same syntax designation information (301) and (302) are prepared. Data transmission system (303) and compression (encryption)
The module (304) uses the syntax designation information (301), and the decompression (decoding) module (305) and the lexical / syntax analysis module (313) in the data receiving system (306) use the syntax designation information (302). The case where it uses is illustrated.

【００２０】データ送信側システム（３０３）は、与え
られた入力データを構文指定情報（３０１）に基づき構
文解析し、構文木などによる内部データ（３０７）の形
式でこれを保持する。構造化データ作成モジュール（３
０８）は、この内部データ（３０７）からテキスト形式
の構造化データ（３０９）を出力する。これを受け取っ
た圧縮（暗号化）モジュール（３０４）は、図１で説明
したようにその構造化データ（３０９）を構造情報とコ
ンテンツに分離してそれぞれを圧縮（暗号化）した圧縮
（暗号化）データ（３１０）とし、ネットワーク（３１
１）経由で受信側に送信する。The data transmission side system (303) parses the given input data based on the syntax designation information (301) and holds it in the form of internal data (307) by a syntax tree or the like. Structured data creation module (3
08) outputs structured data (309) in the text format from this internal data (307). The compression (encryption) module (304) that receives this separates the structured data (309) into structural information and content and compresses (encrypts) each as described in FIG. ) Data (310) and network (31
1) Send to the receiving side via.

【００２１】受信側では、受け取った圧縮（暗号化）デ
ータ（３１０）を伸長（復号化）モジュール（３０５）
で図２で説明したように伸長（復号化）し、元の構造化
データ（３１２）に戻したのち、データ受信側システム
（３０６）に引渡す。データ受信側システム（３０６）
内の字句・構文解析モジュール（３１３）は、その構造
化データ（３１２）を内部データ（３１４）に変換した
後、データ処理をすすめる。On the receiving side, the received compressed (encrypted) data (310) is expanded (decrypted) module (305).
Then, the data is decompressed (decoded) as described in FIG. 2, restored to the original structured data (312), and then passed to the data receiving system (306). Data receiving system (306)
The lexical / syntactic analysis module (313) therein converts the structured data (312) into internal data (314) and then proceeds with data processing.

【００２２】このシステム構成では、データ交換する際
に構造化データ（３０９）を圧縮しているため、ネット
ワーク（３１１）を通過する通信データ量は削減され
る。In this system configuration, since the structured data (309) is compressed when exchanging data, the amount of communication data passing through the network (311) is reduced.

【００２３】図４は、図３のデータ交換システムを改良
して、パージング処理を組みこむことでデータの圧縮
（暗号化）・伸長（復号化）処理のオーバヘッドをなく
したシステム構成例である。図３における構造化データ
作成モジュール（３０８）の代わりに、圧縮（暗号化）
モジュール（３０４）をデータ送信側システム（３０
３）に直接組みこんだのが、図４のデータ送信側システ
ム（４０１）である。同様に、図３における字句・構文
解析モジュール（３１３）の代わりに、伸長（復号化）
モジュール（３０５）をデータ受信側システム（３０
６）に直接組みこんだのが、図４のデータ受信側システ
ム（４０２）である。FIG. 4 is an example of a system configuration in which the data exchange system of FIG. 3 is improved and a purging process is incorporated to eliminate the overhead of data compression (encryption) / decompression (decryption). Instead of the structured data creation module (308) in FIG. 3, compression (encryption)
The module (304) is connected to the data transmission system (30
The data transmission side system (401) of FIG. 4 is directly incorporated into 3). Similarly, decompression (decoding) instead of the lexical / syntactic analysis module (313) in FIG.
The module (305) is connected to the data receiving system (30
The data receiving side system (402) of FIG. 4 is directly incorporated in 6).

【００２４】このようなシステム構成をとることで、デ
ータの内部表現を構造化データに変換することなく圧縮
（暗号化）データとしているので、図３のデータ交換シ
ステムで生じていたオーバヘッドは解消される。さら
に、単にオーバヘッドを解消するのみでなく、圧縮（暗
号化）モジュール（３０４）および伸長（復号化）モジ
ュール（３０５）を全く用いずに直接、構造化データを
送信した場合と比較して、むしろ性能向上することが可
能となる。実際、図４で送信される圧縮（暗号化）デー
タ（４０３）は構文情報を既に含んでいるため、データ
受信側システム４０２で字句・構文解析を実行せずに構
文木などによる内部データ（４０４）を得ることができ
る。By adopting such a system configuration, the internal representation of the data is compressed (encrypted) data without being converted into structured data, so that the overhead generated in the data exchange system of FIG. 3 is eliminated. It Furthermore, rather than simply eliminating the overhead, rather than sending the structured data directly without any compression (encryption) module (304) and decompression (decryption) module (305), It is possible to improve the performance. In fact, since the compressed (encrypted) data (403) transmitted in FIG. 4 already includes syntax information, the internal data (404) that is a syntax tree or the like without performing the lexical / syntactic analysis in the data receiving system 402. ) Can be obtained.

【００２５】図５および図６に、圧縮・暗号化の具体例
を示す。5 and 6 show specific examples of compression / encryption.

【００２６】図５において、（５．１）は住所録を作成
するための複数の構文定義情報から成る構文指定情報
（図１〜図４の１０２，３０１，３０２など）で、XML
文書用のDTDと呼ばれる定義形式で記述してある。
（５．２）は、このDTDを用いたXMLによる構造化データ
（図１〜図４の１０１，２０７，３０９，３１２）であ
り、住所録データベースからの検索結果を想定してい
る。後の参照のために、構文指定情報（５．１）には、
１から１５までの番号を振ってある。この番号は、構文
指定情報（５．１）中の１行分の情報を特定するもので
あり、構文指定情報（５．１）の中にこの番号データが
含まれている訳ではない。構造化データ（５．２）をみ
るとわかるように、構造化データ（５．２）ではデータ
の半分以上が構文を定めるタグ定義であり、しかも似た
ような構文が繰返し現れている。In FIG. 5, (5.1) is syntax designation information (102, 301, 302, etc. in FIGS. 1 to 4) consisting of a plurality of syntax definition information for creating an address book, which is XML.
It is described in the definition format called DTD for documents.
(5.2) is XML structured data (101, 207, 309, 312 in FIGS. 1 to 4) using this DTD, and is assumed to be a search result from the address book database. For later reference, the syntax specification information (5.1) includes
They are numbered 1 to 15. This number identifies one line of information in the syntax designation information (5.1), and this number data is not included in the syntax designation information (5.1). As can be seen from the structured data (5.2), in the structured data (5.2), more than half of the data is the tag definition that defines the syntax, and similar syntax appears repeatedly.

【００２７】図６は、構造化データ（５．２）を構文
（図１，２の１０４，２０４）とコンテンツ（図１，２
の１０５，２０５）に分離した結果を示す。（６．１）
が構文情報を分離したもので、構造化データ（５．２）
を構文指定情報（５．１）で番号付けした１５個の文法
規則のどれを用いて構文解析したかという情報と、コン
テンツとして通常の文字列データであるPCDATAまたは特
殊文字も含む文字列データであるCDATAのいずれかを含
んでいるかという情報をあわせて出力させている。すな
わち、構文（６．１）では、PCDATAを含むときは変数
ｐ、CDATAを含む場合には変数ｃをつけて表わしてい
る。FIG. 6 shows the structured data (5.2) as the syntax (104, 204 in FIGS. 1, 2) and the content (FIGS. 1, 2).
105, 205) of the above. (6.1)
Is the separated syntactic information, structured data (5.2)
Information on which of the 15 grammar rules numbered in the syntax specification information (5.1) was used for parsing, and PCDATA which is normal character string data as contents, or character string data including special characters. It also outputs the information indicating whether any CDATA is included. That is, in the syntax (6.1), a variable p is included when PCDATA is included, and a variable c is included when CDATA is included.

【００２８】例えば、構文（６．１）の先頭からの
「１，２，３」は、それぞれ構文指定情報（５．１）の
第１〜３行の定義規則に基づく構造情報（具体的には
（５．２）の第１行目の<address-book>から第３行目の
<name>まで）が最初に来ることを示している。次の「4
p」は構文指定情報（５．２）の４行目の定義規則に基
づく構造情報がPCDATAを含んで次に来ることを示してい
る。これは、（５．２）の<firstname>John</firstname
>の部分に相当する。For example, "1, 2, 3" from the beginning of the syntax (6.1) is structural information (specifically, based on the definition rules of the first to third lines of the syntax designation information (5.1). Is from <address-book> on the first line of (5.2) to the third line of
(up to <name>) comes first. Next `` 4
“P” indicates that the structure information based on the definition rule on the fourth line of the syntax designation information (5.2) comes next including PCDATA. This is the (5.2) <firstname> John </ firstname
Corresponds to the> part.

【００２９】構造化データ（５．２）と（６．１）を比
較すればわかるように、構文情報に関しては、この変換
のみでもタグを番号で置換えているため既に圧縮の効果
があることがわかる。（６．２）と（６．３）は、PCDA
TAとCDATAを出現順に適当な区切り記号をつけて連結し
たものである。XMLの仕様ではPCDATAでは'>'、CDATAで
は']]>'を文字列として含まないことに定められている
ので、ここではこれらを区切り文字として採用してい
る。As can be seen by comparing the structured data (5.2) and (6.1), with regard to the syntax information, the tag is replaced by the number only with this conversion, so that the compression effect may already be obtained. Recognize. (6.2) and (6.3) are PCDA
TA and CDATA are concatenated in the order of appearance with appropriate delimiters. The XML specification specifies that PCDATA does not include '>' and CDATA does not include ']]>' as character strings, so these are used as delimiters here.

【００３０】構造化データ（５．２）のデータ圧縮を行
うには、図６に示すように分離された構文（６．１）、
PCDATA（６．２）、およびCDATA（６．３）のそれぞれ
を圧縮すればよい。この例の場合、構造化データ（５．
２）の構文はほとんど同じ形式のデータが繰り返される
と予想される。（６．１）では、データの中身は変数化
して構文のみをとりだしているので、この部分は全く同
じかほとんど同じ文字列の繰返しになることが予想され
る。したがって、例えばランレングスのような初等的な
圧縮方法でも十分な圧縮効果をあげられる。PCDATA
（６．２）およびCDATA（６．３）のテキストデータに
関しては、上述のような著しいデータの偏りこそないも
のの、構造化データ（５．２）の同じタグの部分を取り
出してきているので、住所なら住所、人名なら人名とい
った同じ分類の言葉が集まることが期待できる。このた
め、局所的に同じ文字列の出現率が高いときに効果のあ
る圧縮方法、例えばLZ77（Zip-Lampel）符号を用いるこ
とで高い圧縮効果を得ることができる。In order to perform the data compression of the structured data (5.2), the separated syntax (6.1) as shown in FIG.
PCDATA (6.2) and CDATA (6.3) may be compressed. In the case of this example, the structured data (5.
The syntax of 2) is expected to repeat data of almost the same format. In (6.1), since the contents of the data are variable and only the syntax is taken out, it is expected that this part will be the same or almost the same repetition of character strings. Therefore, a sufficient compression effect can be obtained even with an elementary compression method such as run length. PCDATA
As for the text data of (6.2) and CDATA (6.3), the same tag portion of the structured data (5.2) is taken out, though there is no significant data bias as described above. It can be expected that words of the same category such as an address if it is an address and a person's name if it is a person's name will be gathered. Therefore, a high compression effect can be obtained by using a compression method effective when the appearance rate of the same character string is locally high, for example, LZ77 (Zip-Lampel) code.

【００３１】なお、この例では定義形式としてDTDを用
いたが、データ形式をさらに細かく定義できる新しい定
義形式としてXML SchemaやRelaxといったものも提案さ
れている。このような場合、さらにデータの局所的な偏
りを高めることができるので、圧縮効果はさらに高くな
ると期待される。また、型ごとにデータを集めることが
できるので、データ型に応じた符号化を採用すること
で、圧縮効率をさらに高めることが可能になる。実際、
XML Schemaの２０００年４月７日付けのWorkingDraft(h
ttp://www.w3.org/TR/2000/WD-xmlschema-0-20000407/)
では文字列、論理値、浮動点小数、倍精度実数、十進
数、日付、および期間など４０種類以上の単純型が組込
み型として予め定義されており、これらの型によるデー
タ分類を用いればデータの局所性を著しく高めることが
可能である。Although the DTD is used as the definition format in this example, XML Schema and Relax are also proposed as new definition formats that can define the data format more finely. In such a case, the local bias of the data can be further increased, so that the compression effect is expected to be further enhanced. Further, since data can be collected for each type, it is possible to further improve the compression efficiency by adopting the encoding according to the data type. In fact
Working Draft (h of XML Schema of April 7, 2000)
(ttp: //www.w3.org/TR/2000/WD-xmlschema-0-20000407/)
For more than 40 simple types such as character strings, logical values, floating point decimals, double precision real numbers, decimal numbers, dates, and periods, are predefined as built-in types. It is possible to significantly increase locality.

【００３２】このように圧縮された構造化データを伸長
するには、まず採用した圧縮方法に対する伸長を行い、
図６に示したような構文（６．１）、PCDATA（６．
２）、およびCDATA（６．３）の分類を得た後、構文
（６．１）の変数部分にPCDATA（６．２）およびCDATA
（６．３）のデータを区切り文字を分割しながら順に代
入していけばよい。この代入においては、まず、構文指
定情報（５．１）の各文法規則に対して対応する有限状
態機械を構成しておく。これは各文法規則が正規表現で
書かれていることから標準的なアルゴリズムで構成でき
る。さらに、DTDの各文法規則からこのように得られた
有限状態機械は決定性を持たなければならないことがXM
Lの仕様で定められている。In order to decompress the structured data compressed in this way, decompression is performed for the adopted compression method,
The syntax as shown in FIG. 6 (6.1), PCDATA (6.
2) and CDATA (6.3) classification, then PCDATA (6.2) and CDATA in the variable part of the syntax (6.1)
The data of (6.3) may be substituted in order while dividing the delimiter. In this substitution, first, a finite state machine corresponding to each grammatical rule of the syntax designation information (5.1) is constructed. This can be constructed with standard algorithms because each grammar rule is written in regular expressions. Furthermore, the finite state machine thus obtained from each grammatical rule of the DTD must be deterministic.
It is defined by the L specification.

【００３３】以下、図６のデータ分類を得るための手順
を説明する。XML等の構造化データは構文的なあいまい
さはないように設計されているので、YACCなどに代表さ
れるLALR(1)パーザ生成プログラムにより構文解析プロ
グラムを作成できる。すなわち、構文解析ルールをＢＮ
Ｆ形式に記述し、各ルールに対し、そのルール適用時の
アクションを指定することで構文解析を行うプログラム
を生成できる。したがって、次のような手順で構文解析
ルールをDTDから構成すれば、図６のデータ分類を得る
ための解析プログラム（図１の１０３）を生成すること
ができる。The procedure for obtaining the data classification shown in FIG. 6 will be described below. Structured data such as XML is designed so that there is no syntactic ambiguity, so a parser can be created with the LALR (1) parser generator, such as YACC. That is, the parsing rule is BN
It is possible to generate a program for performing syntax analysis by describing in F format and specifying an action when the rule is applied to each rule. Therefore, if the syntax analysis rule is constructed from the DTD in the following procedure, the analysis program (103 in FIG. 1) for obtaining the data classification in FIG. 6 can be generated.

【００３４】ステップ１：DTD内の<!ELEMENT tag_name
body>の形の各要素定義に対して、lex_id(tag_name)
→ trans(body )の形の構文解析ルールを作成する。た
だし、lex_id は字句解析プログラムが返すtag_nameに
対するID、trans(body)はステップ２以下で得られるbod
yの変換結果とする。また、対応するアクションは次の
ような文字列を返す操作とする：<!ELEMENT tag_name b
ody>につけられた番号の次にtrans(body)のアクション
として得られる文字列を連結することによって得られる
文字列。Step 1: <! ELEMENT tag_name in DTD
For each element definition in the form of body>, lex_id (tag_name)
→ Create a parsing rule of the form trans (body). However, lex_id is the ID for the tag_name returned by the lexical analyzer, and trans (body) is the bod obtained in step 2 and below.
Let y be the conversion result. Also, the corresponding action is an operation that returns the following string: <! ELEMENT tag_name b
A string obtained by concatenating the string obtained as the action of trans (body) after the number given to ody>.

【００３５】ステップ２：変換transを以下のように再
帰的に定める。Step 2: The transformation trans is recursively determined as follows.

【００３６】（１）trans(nil) = nilとする。このアク
ションは空文字列""を返す操作である。(1) trans (nil) = nil. This action is an operation that returns an empty string "".

【００３７】（２）(x . y) で先頭要素xの次にリストy
が連結されているリストを表すこととし、trans(x . y)
= (trans_each(x) . trans(y))とする。アクション
は、trans_each(x), trans(y)のアクションの結果とし
て得られる文字列を連結して得られる文字列を返す操作
とする。ただし、trans_eachは、以下のように定義され
る。(2) In (x .y), after the head element x, list y
Let be a concatenated list of trans (x .y)
= (trans_each (x) .trans (y)). The action is an operation that returns the character string obtained by concatenating the character strings obtained as a result of the trans_each (x) and trans (y) actions. However, trans_each is defined as follows.

【００３８】（２−１）trans_each(#PCDATA) = lex_id
(#PCDATA) 、アクションは文字列 "p"を返す操作とす
る。（２−２）trans_each(#CDATA) =lex_id(#CDATA) 、ア
クションは文字列 "c"を返す操作とする。（２−３）trans_each(tag_name) = lex_id(tag_nam
e)、アクションは空文字列""を返す操作とする。(2-1) trans_each (#PCDATA) = lex_id
(#PCDATA), the action is an operation that returns the character string "p". (2-2) trans_each (#CDATA) = lex_id (#CDATA), the action is an operation that returns the character string "c". (2-3) trans_each (tag_name) = lex_id (tag_nam
e), the action is an operation that returns an empty string "".

【００３９】（２−４）trans_each(tag_name*) =make_
new_symbol(lex_id(tag_name),_ list)、アクションは
恒等写像とする。ただし、make_new_symbolは、引数と
して与えられた２つの記号を連結して新しい記号をつく
る関数とする。さらに、この場合、tn = lex_id(tag_na
me)、tn_list= make_new_symbol(ln,_list) に対する次
の新しいルールを追加する。（２−４−１）tn_list → nil 、アクションは空文字
列""を返す操作とする。（２−４−２）tn_list → tn_list tn 、アクション
は右辺のアクションの結果として返される２つの文字列
を連結した文字列を返す操作とする。(2-4) trans_each (tag_name *) = make_
new_symbol (lex_id (tag_name), _ list), action is the identity map. However, make_new_symbol is a function that connects two symbols given as arguments to create a new symbol. Furthermore, in this case, tn = lex_id (tag_na
me), tn_list = make_new_symbol (ln, _list) add the following new rule. (2-4-1) tn_list → nil, and the action is an operation that returns an empty character string "". (2-4-2) tn_list → tn_list tn, the action is an operation of returning a character string that is a concatenation of two character strings returned as a result of the action on the right side.

【００４０】（２−５）trans_each(tag_name+) =make_
new_symbol(lex_id(tag_name),_ list1)、アクションは
恒等写像とする。この場合、tn = lex_id(tag_name)、t
n_list1= make_new_symbol(ln,_list1) に対する次の新
しいルールを追加する。（２−５−１）tn_list1 → tn 、アクションは恒等写
像とする。（２−５−２）tn_list1 → tn_list tn 、アクション
は右辺のアクションの結果として返される2つの文字列
を連結した文字列を返す操作とする。(2-5) trans_each (tag_name +) = make_
new_symbol (lex_id (tag_name), _ list1), the action is the identity map. In this case, tn = lex_id (tag_name), t
n_list1 = Add the following new rule for make_new_symbol (ln, _list1). (2-5-1) tn_list1 → tn, and the action is the identity map. (2-5-2) tn_list1 → tn_list tn, the action is an operation that returns a character string that is a concatenation of two character strings returned as a result of the action on the right side.

【００４１】（２−６）trans_each(tag_name?) =make_
new_symbol(lex_id(tag_name),_ opt)、アクションは恒
等写像とする。この場合、tn = lex_id(tag_name)、tn_
opt= make_new_symbol(ln,_opt) に対する次の新しいル
ールを追加する。（２−６−１）tn_opt → nil 、アクションは空文字
列""を返す操作とする。（２−６−２）tn_opt → tn 、アクションは恒等写像
とする。(2-6) trans_each (tag_name?) = Make_
new_symbol (lex_id (tag_name), _ opt), action is the identity map. In this case, tn = lex_id (tag_name), tn_
Add the following new rule for opt = make_new_symbol (ln, _opt). (2-6-1) tn_opt → nil, and the action is an operation that returns an empty character string "". (2-6-2) tn_opt → tn, and the action is the identity map.

【００４２】以上、主として圧縮・伸長する場合を例と
して説明したが、暗号化・復号化する場合も同様であ
る。すなわち、図６のように分類した後、構文（６．
１）、PCDATA（６．２）、およびCDATA（６．３）をそ
れぞれ暗号化して送信し、受信側では復号化すればよ
い。構文指定情報（５．１）がデータ交換する当事者以
外には秘密になっていれば、構文（６．１）の形にした
だけでも構造化情報については既に暗号化が行われてい
ると見ることができる。Although the case of mainly compressing / decompressing has been described above, the same applies to the case of encryption / decryption. That is, after classifying as shown in FIG. 6, the syntax (6.
1), PCDATA (6.2), and CDATA (6.3) may be encrypted and transmitted, and the receiving side may decrypt them. If the syntax designation information (5.1) is secret to parties other than the data exchanger, it is considered that the structured information has already been encrypted even if it is in the form of the syntax (6.1). be able to.

【００４３】[0043]

【発明の効果】以上説明したように、この発明によれ
ば、構造化データを構文指定情報に従って、構文とコン
テンツに分離してそれぞれを圧縮・暗号化するので、構
造化データをそのまま圧縮するよりも、通信データ量を
効果的に削減することが可能になる。また、データ交換
の際には、本発明に係る圧縮・暗号方式を採用すること
で、通信データ量の削減および安全性向上の効果の上
に、さらに構造化データのパージングにともなうオーバ
ヘッドなしのデータ交換が実現できる。As described above, according to the present invention, the structured data is separated into the syntax and the content according to the syntax designation information and compressed and encrypted. Therefore, the structured data is not compressed as it is. Also, it becomes possible to effectively reduce the amount of communication data. Further, at the time of data exchange, by adopting the compression / encryption method according to the present invention, in addition to the effect of reducing the amount of communication data and improving the safety, data without overhead due to the purging of structured data is further provided. Exchange can be realized.

[Brief description of drawings]

【図１】構造化データを圧縮または暗号化する場合のシ
ステム構成と処理手順を示した図FIG. 1 is a diagram showing a system configuration and a processing procedure when compressing or encrypting structured data.

【図２】圧縮（暗号）データを伸長または復号化する場
合のシステム構成と処理手順を示した図FIG. 2 is a diagram showing a system configuration and a processing procedure when decompressing or decrypting compressed (encrypted) data.

【図３】構造化データを用いたデータ交換を行う場合の
システム構成と処理手順を示した図FIG. 3 is a diagram showing a system configuration and a processing procedure when exchanging data using structured data.

【図４】パージング処理を組みこむことでオーバヘッド
をなくしたシステム構成の例を示した図FIG. 4 is a diagram showing an example of a system configuration in which overhead is eliminated by incorporating a purging process.

【図５】圧縮・暗号方法の概要を例示するための適用例
を示した図FIG. 5 is a diagram showing an application example for illustrating an outline of a compression / encryption method.

【図６】構造化データを構文とコンテンツに分離した結
果を示した図FIG. 6 is a diagram showing a result of separating structured data into syntax and content.

[Explanation of symbols]

１０１…構造化データ、１０２…構文指定情報、１０３
…字句・構文解析、１０４…部分構文木列、１０５…コ
ンテンツ、１０６…構文圧縮（暗号）モジュール、１０
７…コンテンツ圧縮（暗号）モジュール、１０８…圧縮
（暗号）データ、２０１…圧縮（暗号）データ、２０２
…部分構文木伸長（復号）モジュール、２０３…コンテ
ンツ伸長モジュール、２０４…部分構文木列、２０５…
コンテンツ、２０６…合成モジュール、２０７…構造化
データ、３０１、３０２…構文指定情報、３０３…デー
タ送信側システム、３０４…圧縮（暗号化）モジュー
ル、３０５…伸長（復号化）モジュール、３０６…デー
タ受信側システム、３１３…字句・構文解析モジュー
ル、３０７…内部データ、３０８…構造化データ作成モ
ジュール、３０９…構造化データ、３１０…圧縮（暗号
化）データ、３１１…ネットワーク、３１２…構造化デ
ータ、３１３…字句・構文解析モジュール、３１４…内
部データ、４０１…データ送信側システム、４０２…デ
ータ受信側システム、４０３…圧縮（暗号化）データ、
４０４…内部データ、５．１…構文指定情報、５．２…
構造化データ。101 ... Structured data, 102 ... Syntax designation information, 103
... Lexical / syntactic analysis, 104 ... Partial syntax tree string, 105 ... Content, 106 ... Syntax compression (encryption) module, 10
7 ... Content compression (encryption) module, 108 ... Compressed (encrypted) data, 201 ... Compressed (encrypted) data, 202
... Partial syntax tree expansion (decoding) module, 203 ... Content expansion module, 204 ... Partial syntax tree string, 205 ...
Contents, 206 ... Compositing module, 207 ... Structured data, 301, 302 ... Syntax designation information, 303 ... Data transmission side system, 304 ... Compression (encryption) module, 305 ... Decompression (decryption) module, 306 ... Data reception Side system, 313 ... Lexical / syntactic analysis module, 307 ... Internal data, 308 ... Structured data creation module, 309 ... Structured data, 310 ... Compressed (encrypted) data, 311 ... Network, 312 ... Structured data, 313 ... Lexical / syntactic analysis module, 314 ... Internal data, 401 ... Data transmission side system, 402 ... Data reception side system, 403 ... Compressed (encrypted) data,
404 ... Internal data, 5.1 ... Syntax designation information, 5.2 ...
Structured data.

Claims

[Claims]

1. A data compression method for structured data, wherein individual grammar rules are specified by identification information in syntax specification information including a plurality of grammar rules that are information defining a structure of structured data to be processed. The structure information of the structured data and the content information are separated by expressing the syntax tree representing the structure of the structured data by using the identification information for specifying the grammar rule. Data compression method for structured data.

2. The data compression method for structured data according to claim 1, wherein the content information obtained by separating from the structured data is further classified according to a data type. Data compression method.

3. The data compression method for structured data according to claim 1, wherein the data appearance position of the content data included in the content information is represented by a variable.

4. A method for exchanging structured data, wherein the structured data, which is the target data for exchanging data, is converted into structure information by using the data compression method according to claim 1. Separated into content information, the separated structure information and content information are respectively compressed by a predetermined compression method or encrypted by a predetermined encryption method, and the compressed or encrypted structure information and content information are transmitted. Data exchange method.

5. A data compression method for structured data, wherein syntax specification information including a plurality of grammatical rules, which is information defining a structure of structured data to be processed, is specified by identification information. As possible, a step of retaining in a storage means, obtaining grammatical rules representing the structure of the structured data to be processed, generating identification information of all the obtained grammatical rules, and generating the structural information. Of the grammatical rules, if the content data is attached, the content data is stored in the content information, and the identification information arranged in the structure information is provided with an index indicating that the content data is attached. A data compression method for structured data.

6. A data compression or encryption method for structured data, wherein syntax specification information including a plurality of grammatical rules, which is information defining a structure of structured data to be processed, is identified. A step of storing the information in a storage means so that it can be specified by information, obtaining grammatical rules representing the structure of the structured data to be processed, and generating structural information by arranging identification information of all the obtained grammatical rules. For those grammatical rules accompanied by content data, the content data is stored in the content information, and the identification information arranged in the structure information is provided with an index indicating that the content data is associated, A step of compressing or encrypting the structure information and the content information. Data compression or encryption method.

7. A data decompression or decryption method for decompressing or decrypting structure information and content information compressed or encrypted by the data compression or encryption method according to claim 6, which is compressed or encrypted. The step of decompressing or decoding the structure information and the content information, and extracting the identification information of the grammar rule from the expanded or decrypted structure information, referring to the syntax designation information, and determining the grammar rule corresponding to the extracted identification information. Acquiring step and restoring the structured data of the data structure represented by the acquired grammar rule, and if the grammatical rule is accompanied by the content data, extract the corresponding content data from the content information and correspond the structured data. Setting the position to restore the structured data. Data decompression or decoding method for structured data, wherein.