JP2005215951A

JP2005215951A - Encoding or decoding method for document data, and program therefor

Info

Publication number: JP2005215951A
Application number: JP2004021240A
Authority: JP
Inventors: Arei Kobayashi; 亜令小林; Takuya Tanaka; 卓弥田中; Kazunori Matsumoto; 一則松本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2004-01-29
Filing date: 2004-01-29
Publication date: 2005-08-11
Anticipated expiration: 2024-01-29
Also published as: JP4168946B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a code processing method for document data and a program, allowing improvement of an encoding compressibility by use of a grammatical characteristic of the XML document data. <P>SOLUTION: For the purpose of encoding the XML original document data by use of a conversion sheet of XML designated with a code corresponding to an element name and an attribute name, the element name, the attribute name, and a value corresponding to the attribute name are detected in tag units of the original document data, first. Next, a first code obtained by combining the code derived from the element name by use of the conversion sheet and a code wherein presence/absence of the attribute name is expressed by a bit is sequentially recorded in a structure table. Next, a second code obtained by encoding the value corresponding to the present attribute name according to appearance frequency is recorded in an entry of a value table linked to an entry of the structure table recording the first code. The structure table and the value table generated by repeating from the first step to the third step are set as code data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、拡張可能なテキスト形式の構造型記述言語で記載された原文書データを符号化又は復号化する文書データ符号化又は復号化方法及びそのプログラムに関する。 The present invention relates to a document data encoding or decoding method and program for encoding or decoding original document data described in an extensible text format structured description language.

従来、伝送すべきデータ量を削減するために、文書データを符号化及び復号化する方法がある。この方法を実現するには、送信装置及び受信装置はそれぞれ、変換テーブルを所持する必要がある。変換テーブルは、構造型記述言語と符号データとを１対１に対応付けたものである。送信装置は、変換テーブルに基づいて文書データを符号データに符号化する。一方、受信装置は、変換テーブルに基づいて符号データを文書データに復号化する。このような方法は、インターネットにおけるセキュリティの観点からも有効である。変換テーブルを有さないクライアントは、符号データを復号することができないからである。 Conventionally, there is a method of encoding and decoding document data in order to reduce the amount of data to be transmitted. In order to realize this method, each of the transmission device and the reception device needs to have a conversion table. The conversion table is a one-to-one correspondence between a structured description language and code data. The transmission device encodes the document data into code data based on the conversion table. On the other hand, the receiving device decodes the code data into document data based on the conversion table. Such a method is also effective from the viewpoint of security on the Internet. This is because a client that does not have a conversion table cannot decode the encoded data.

具体的には、ＸＭＬ(eXtensible Markup Language)又はＳＧＭＬ(Standard Generalized Markup Language)準拠の符号化テーブルを用いて、ＸＭＬ／ＳＧＭＬ文書データの符号化を行う方法がある（例えば特許文献１及び２参照）。この方法は、ＸＭＬ／ＳＧＭＬ形式の構造型記述言語で記載された変換テーブルに、要素名、要素値、属性名及び属性値の項目に符号長及び符号を定義し、第１の要素名に対する第２の要素名が親子関係を示す符号長及び符号とを定義したものである。この変換テーブルを用いて文書データを符号化することにより、データ伝送量を削減することができる。また、復号側装置は、符号データからの元の文書データを復元させることないために、パーサも必要としない。 Specifically, there is a method of encoding XML / SGML document data using an encoding table compliant with XML (eXtensible Markup Language) or SGML (Standard Generalized Markup Language) (see, for example, Patent Documents 1 and 2). . This method defines a code length and a code for the element name, element value, attribute name, and attribute value items in the conversion table described in the XML / SGML structured type description language, and sets the first element name to the first element name. The element name of 2 defines the code length and code indicating the parent-child relationship. By encoding the document data using this conversion table, the data transmission amount can be reduced. Further, since the decoding side apparatus does not restore the original document data from the code data, it does not require a parser.

図１は、符号化サーバを含むシステム構成図である。 FIG. 1 is a system configuration diagram including an encoding server.

サーバ４は、ＸＭＬ形式の文書データを符号化サーバ６へ送信する。符号化サーバ６は、変換テーブルサーバ７から受信した変換テーブルを用いて、文書データを符号化する。その符号データは、クライアント５へ送信される。クライアント５は、変換テーブルサーバ７から受信した変換テーブルを用いて、文書処理を行う。図１によれば、ＸＭＬ形式の文書データを送信する既存のサーバに変更を加えることなく、符号化サーバをプロキシサーバとして利用することができる。 The server 4 transmits document data in XML format to the encoding server 6. The encoding server 6 encodes the document data using the conversion table received from the conversion table server 7. The code data is transmitted to the client 5. The client 5 performs document processing using the conversion table received from the conversion table server 7. According to FIG. 1, the encoding server can be used as a proxy server without changing an existing server that transmits XML-format document data.

特開２００２−２５９１９４号公報JP 2002-259194 A 小林、高木、村松、馬場、松本、井ノ上「ＸＭＬ文書汎用符号化方式「ｘｅｕｓ」、信学技報ＤＥ２００１−９、ｐｐ．６５〜７２Kobayashi, Takagi, Muramatsu, Baba, Matsumoto, Inoue, “XML document general-purpose encoding method“ xeus ”, IEICE Tech. 65-72 小林、松本、井ノ上「汎用ＸＭＬ文書符号化方式「ＸＥＵＳ」の性能評価」、ＦＩＴ（情報科学技術フォーラム）、２００３、ＬＥ００８、ｐｐ．９９〜１００Kobayashi, Matsumoto, Inoue “Performance evaluation of general-purpose XML document encoding method“ XEUS ””, FIT (Information Science and Technology Forum), 2003, LE008, pp. 99-100

このような従来技術に対して、符号圧縮率をより向上させることが問題となってきた。従来技術は、ＸＭＬ文書データについてタグ単位に順列的に符号化をしており、ＸＭＬ文法の性質を利用したものではなかった。ＸＭＬには、１要素に同じ名称の属性は１度しか記述できないという条件、符号化対象文書の論理構造の相関、及び、符号化対象文書の各値の特性のようなものを十分に利用した符号化方法ではなかった。 Compared to such a conventional technique, it has been a problem to further improve the code compression rate. In the prior art, XML document data is encoded in a permutation for each tag, and does not use the nature of the XML grammar. For XML, the conditions that the attribute with the same name per element can be described only once, the correlation of the logical structure of the encoding target document, and the characteristics of each value of the encoding target document are fully utilized. It was not an encoding method.

通常のＸＭＬ文書データにおいては、論理構造部分は極めて同じ又は近似したタグが連続する場合が多いにもかかわらず、タグ単位に一連に符号化するために、結局のところ符号化されたビット列自体にも同じビット列が連続する場合が多かった。 In ordinary XML document data, although the logical structure portion is often the same or similar tags in many cases, in order to encode a series of tags, after all, in the encoded bit string itself. In many cases, the same bit string continued.

従って、本発明は、ＸＭＬ文書データの文法特性を利用して、符号化圧縮率をより向上させた文書データの符号処理方法及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a document data encoding method and program in which the encoding compression rate is further improved by utilizing the grammatical characteristics of XML document data.

本発明における符号化方法は、
拡張可能なテキスト形式の構造型記述言語で記載され、要素名と、該要素名に指定可能な属性名とに対応する符号が指定されている変換シートと、
要素毎の構造符号をエントリにエントリに記録する構造テーブルと、
構造テーブルのエントリに対応して、属性値又は要素値の符号をエントリに記録する値テーブルと
を用いて、原文書データのタグ単位順に、
要素名と、子要素又は要素値の有無と、属性名の有無と、存在する属性値又は要素値とを検出する第１のステップと、
要素名から変換シートを用いて導出された符号と、該要素名の子要素又は要素値の有無を表す符号と、属性名の有無を表す符号とを組み合わせた構造符号を、構造テーブルのエントリに順に記録する第２のステップと、
存在する属性値又は要素値に対する符号を、値テーブルに記録する第３のステップと
を有し、第１のステップから第３のステップを繰り返すことによって生成された構造テーブル及び値テーブルを符号データとすることを特徴とする。 The encoding method in the present invention is:
A conversion sheet that is written in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified in the element name;
A structure table that records the structure code for each element in the entry;
Corresponding to the entries in the structure table, using the value table that records the sign of the attribute value or element value in the entry, in the tag unit order of the original document data,
A first step of detecting an element name, presence or absence of a child element or element value, presence or absence of an attribute name, and existing attribute value or element value;
A structure code combining a code derived from an element name using a conversion sheet, a code indicating the presence / absence of a child element or element value of the element name, and a code indicating the presence / absence of an attribute name is stored in the structure table entry. A second step of recording in sequence;
A third step of recording a code for an existing attribute value or element value in a value table, and a structure table and a value table generated by repeating the first step to the third step as code data It is characterized by doing.

本発明の符号化方法における他の実施形態によれば、出現頻度の多い属性値又は要素値から順に、短い符号を割り当てた値割当テーブルを更に有し、第３のステップにおける属性値又は要素値に対する符号は、値割当テーブルを用いて導出されたものであり、値割当テーブルも符号データの一部とすることも好ましい。 According to another embodiment of the encoding method of the present invention, the method further includes a value assignment table in which short codes are assigned in order from the attribute value or element value having the highest appearance frequency, and the attribute value or element value in the third step. The code for is derived using a value assignment table, and the value assignment table is also preferably part of the code data.

また、本発明の符号化方法における他の実施形態によれば、変換シートは、属性値又は要素値に対応するデータ型情報を規定しており、値割当テーブルは、データ型情報に応じて複数存在しており、第３のステップにおける属性値又は要素値に対する符号は、属性値又は要素値におけるデータ型情報に対応した値割当テーブルを用いて導出されたものであることも好ましい。 According to another embodiment of the encoding method of the present invention, the conversion sheet defines data type information corresponding to the attribute value or element value, and a plurality of value assignment tables are provided according to the data type information. It is also preferable that the code for the attribute value or element value in the third step is derived using a value assignment table corresponding to the data type information in the attribute value or element value.

更に、本発明の符号化方法における他の実施形態によれば、構造テーブルは、出現頻度の多い構造符号から順に、短い符号をツリー符号として割り当てており、ツリー符号を、原文書データの要素の出現順に並べたツリーテーブルを更に有しており、ツリーテーブルも符号データの一部とすることも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, the structure table assigns short codes as tree codes in order from the structure code having the highest appearance frequency, and the tree code is assigned to the element of the original document data. A tree table arranged in the order of appearance is further provided, and the tree table is also preferably a part of the code data.

更に、本発明の符号化方法における他の実施形態によれば、属性名の有無を表す符号は、変換シートにおける当該要素名に指定可能な属性名順に、各属性を１ビットに割り当てたビット列によって該属性の有無を指定することも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, the code indicating the presence or absence of an attribute name is represented by a bit string in which each attribute is assigned to 1 bit in the order of attribute names that can be specified for the element name in the conversion sheet. It is also preferable to specify the presence or absence of the attribute.

更に、本発明の符号化方法における他の実施形態によれば、出現頻度の多い属性名の有無を表す符号から順に、短い符号を割り当てた属性テーブルを更に有し、第３のステップは、属性テーブルを用いて属性名の有無を表す符号を導出し、属性テーブルも符号データの一部とすることも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, the encoding method of the present invention further includes an attribute table in which short codes are assigned in order from a code indicating the presence / absence of an attribute name having a high appearance frequency. It is also preferable that a code representing the presence or absence of an attribute name is derived using a table, and the attribute table is also part of the code data.

更に、本発明の符号化方法における他の実施形態によれば、変換シートには、属性値又は要素値に対応する選択型情報を規定することができ、第３のステップにおける属性値又は要素値に対する符号は、選択数に応じた最小ビット数で表されたものであることも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, selection type information corresponding to an attribute value or element value can be defined in the conversion sheet, and the attribute value or element value in the third step is defined. It is also preferable that the code for is represented by the minimum number of bits corresponding to the number of selections.

更に、本発明の符号化方法における他の実施形態によれば、変換シートには、属性値又は要素値が複数の群からなる群構造型情報を規定することができ、第３のステップにおける属性値又は要素値に対する符号は、群単位に、先の群について符号化した後で、後の群について先の群に対する差分値のみを符号化することも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, group structure type information in which attribute values or element values consist of a plurality of groups can be defined in the conversion sheet, and the attribute in the third step It is also preferable that the code for the value or the element value is encoded in the group unit for the previous group and then only the difference value for the previous group is encoded for the subsequent group.

更に、本発明の符号化方法における他の実施形態によれば、変換シートには、属性値又は要素値を複数の群として表すためにセパレータを規定することができ、第３のステップにおける属性値又は要素値に対する符号は、セパレータによって群単位に区分されることも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, a separator can be defined in the conversion sheet to represent attribute values or element values as a plurality of groups, and the attribute value in the third step. Or it is also preferable that the code | symbol with respect to an element value is divided into a group unit with a separator.

更に、本発明の符号化方法における他の実施形態によれば、変換シートには、属性値又は要素値に対して固定値部分を規定することができ、第３のステップにおける属性値又は要素値に対する符号は、固定値部分は省略されることも好ましい。 Furthermore, according to another embodiment of the encoding method of the present invention, the conversion sheet can define a fixed value portion for the attribute value or element value, and the attribute value or element value in the third step. It is also preferable that the fixed value portion of the sign for is omitted.

本発明における復号化方法は、
符号データは、要素毎の構造符号をエントリに記録した構造テーブルと、該構造テーブルのエントリに対応して、属性値又は要素値の符号をエントリに記録した値テーブルとから構成されており、
拡張可能なテキスト形式の構造型記述言語で記載され、要素名と、該要素名に指定可能な属性名とに対応する符号が指定されている変換シートを用いて、
構造テーブルから対応する構造符号を切り出す第１のステップと、
構造符号に対応する値テーブルのエントリを検出し、構造符号から、変換シートを用いて、要素名と、子要素又は要素値の有無と、存在する属性名とを特定する第２のステップと、
値テーブルの対応するエントリの値符号から、存在する属性値又は要素値を特定する第３のステップと
を有し、第１のステップから第３のステップを繰り返すことによって原文書データに復号することを特徴とする。 The decoding method in the present invention is:
The code data is composed of a structure table in which the structure code for each element is recorded in the entry, and a value table in which the attribute value or element value code is recorded in the entry corresponding to the entry in the structure table.
Using a conversion sheet that is described in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified for the element name,
A first step of extracting a corresponding structure code from the structure table;
A second step of detecting an entry in the value table corresponding to the structure code, and using the conversion sheet to identify an element name, presence or absence of a child element or element value, and an existing attribute name from the structure code;
A third step of specifying an existing attribute value or element value from the value code of the corresponding entry in the value table, and decoding the original document data by repeating the third step from the first step It is characterized by.

本発明の復号化方法における他の実施形態によれば、符号データは、出現頻度の多い属性値又は要素値から順に、短い符号を割り当てた値割当テーブルを更に含んでおり、第３のステップにおける属性値又は要素値は、値割当テーブルを用いて導出されたものであることも好ましい。 According to another embodiment of the decoding method of the present invention, the code data further includes a value assignment table in which short codes are assigned in order from an attribute value or an element value having a high appearance frequency, in the third step. The attribute value or element value is preferably derived using a value assignment table.

また、本発明の復号化方法における他の実施形態によれば、変換シートは、属性値又は要素値に対応するデータ型情報を規定しており、値割当テーブルは、データ型情報に応じて複数存在しており、第３のステップにおける属性値又は要素値は、属性値又は要素値におけるデータ型情報に対応した値割当テーブルを用いて導出されたものであることも好ましい。 According to another embodiment of the decoding method of the present invention, the conversion sheet defines data type information corresponding to attribute values or element values, and a plurality of value assignment tables are provided according to the data type information. It is also preferable that the attribute value or element value in the third step is derived using a value assignment table corresponding to the data type information in the attribute value or element value.

更に、本発明の復号化方法における他の実施形態によれば、符号データは、構造テーブルを、出現頻度の多い構造符号から順に、短い符号をツリー符号として割り当てており、該ツリー符号を、原文書データの要素の出現順に並べたツリーテーブルを更に含んでいることも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, the code data is such that the structure table is assigned with a short code as a tree code in order from the structure code having the highest appearance frequency. It is also preferable to further include a tree table arranged in the order in which the document data elements appear.

更に、本発明の復号化方法における他の実施形態によれば、属性名の有無を表す符号は、変換シートにおける当該要素名に指定可能な属性名順に、各属性を１ビットに割り当てたビット列によって該属性の有無が指定されていることも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, the code indicating the presence or absence of an attribute name is represented by a bit string in which each attribute is assigned to one bit in the order of attribute names that can be specified for the element name in the conversion sheet. It is also preferable that the presence or absence of the attribute is specified.

更に、本発明の復号化方法における他の実施形態によれば、符号テーブルは、出現頻度の多い属性名の有無を表す符号から順に、短い符号を割り当てた属性テーブルを更に含んでおり、第３のステップは、属性テーブルを用いて属性名の有無を導出することも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, the code table further includes an attribute table to which a short code is assigned in order from a code indicating the presence / absence of an attribute name having a high appearance frequency. It is also preferable to derive the presence / absence of an attribute name using the attribute table.

更に、本発明の復号化方法における他の実施形態によれば、変換シートには、属性値又は要素値に対応する選択型情報を規定することができ、第３のステップにおける属性値又は要素値に対する符号は、選択数に応じた最小ビット数で表されたものであることも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, the conversion sheet can specify selection type information corresponding to the attribute value or element value, and the attribute value or element value in the third step. It is also preferable that the code for is represented by the minimum number of bits corresponding to the number of selections.

更に、本発明の復号化方法における他の実施形態によれば、変換シートには、属性値又は要素値が複数の群からなる群構造型情報を規定することができ、第３のステップにおける属性値又は要素値の値符号は、群単位に、先の群について符号化した後で、後の群について先の群に対する差分値のみを符号化されたものであり、先の群について復号化した後で、後の群を復号化した値に差分値を加算して導出することも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, group structure type information in which attribute values or element values are composed of a plurality of groups can be defined in the conversion sheet, and the attribute in the third step The value code of the value or element value is obtained by encoding only the difference value with respect to the previous group for the subsequent group after encoding the previous group and decoding for the previous group. It is also preferable that the difference value is added to the value obtained by decoding the later group later.

更に、本発明の復号化方法における他の実施形態によれば、変換シートには、属性値又は要素値を複数の群として表すためにセパレータを規定することができ、第３のステップにおける属性値又は要素値の値符号は、セパレータによって群単位に区分されることも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, a separator can be defined in the conversion sheet to represent attribute values or element values as a plurality of groups, and the attribute value in the third step. Alternatively, the value code of the element value is preferably divided into group units by a separator.

更に、本発明の復号化方法における他の実施形態によれば、変換シートには、属性値又は要素値に対して固定値部分を規定することができ、第３のステップにおける属性値又は要素値の値符号は、変換シートを用いて、省略された固定値部分を復元して導出することも好ましい。 Furthermore, according to another embodiment of the decoding method of the present invention, the conversion sheet can define a fixed value portion for the attribute value or element value, and the attribute value or element value in the third step. It is also preferable to derive the value code by restoring the omitted fixed value portion using a conversion sheet.

本発明における符号化プログラムは、
拡張可能なテキスト形式の構造型記述言語で記載され、要素名と、該要素名に指定可能な属性名とに対応する符号が指定されている変換シートと、
要素毎の構造符号をエントリにエントリに記録する構造テーブルと、
構造テーブルのエントリに対応して、属性値又は要素値の符号をエントリに記録する値テーブルと
を用いて、原文書データのタグ単位順に、
要素名と、子要素又は要素値の有無と、属性名の有無と、存在する属性値又は要素値とを検出する第１のステップと、
要素名から変換シートを用いて導出された符号と、該要素名の子要素又は要素値の有無を表す符号と、属性名の有無を表す符号とを組み合わせた構造符号を、構造テーブルのエントリに順に記録する第２のステップと、
存在する属性値又は要素値に対する符号を、値テーブルに記録する第３のステップと
して実行し、第１のステップから第３のステップを繰り返すことによって生成された構造テーブル及び値テーブルを符号データとすることを特徴とする。 The encoding program in the present invention is:
A conversion sheet that is written in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified in the element name;
A structure table that records the structure code for each element in the entry;
Corresponding to the entries in the structure table, using the value table that records the sign of the attribute value or element value in the entry, in the tag unit order of the original document data,
A first step of detecting an element name, presence or absence of a child element or element value, presence or absence of an attribute name, and existing attribute value or element value;
A structure code combining a code derived from an element name using a conversion sheet, a code indicating the presence / absence of a child element or element value of the element name, and a code indicating the presence / absence of an attribute name is stored in the structure table entry. A second step of recording in sequence;
A code for an existing attribute value or element value is executed as a third step of recording in the value table, and a structure table and a value table generated by repeating the first step to the third step are used as code data. It is characterized by that.

本発明における復号化プログラムは、
符号データは、要素毎の構造符号をエントリに記録した構造テーブルと、該構造テーブルのエントリに対応して、属性値又は要素値の符号をエントリに記録した値テーブルとから構成されており、
拡張可能なテキスト形式の構造型記述言語で記載され、要素名と、該要素名に指定可能な属性名とに対応する符号が指定されている変換シートを用いて、
構造テーブルから対応する構造符号を切り出す第１のステップと、
構造符号に対応する値テーブルのエントリを検出し、構造符号から、変換シートを用いて、要素名と、子要素又は要素値の有無と、存在する属性名とを特定する第２のステップと、
値テーブルの対応するエントリの値符号から、存在する属性値又は要素値を特定する第３のステップと
して実行し、第１のステップから第３のステップを繰り返すことによって原文書データに復号することを特徴とする。 The decryption program in the present invention is:
The code data is composed of a structure table in which the structure code for each element is recorded in the entry, and a value table in which the attribute value or element value code is recorded in the entry corresponding to the entry in the structure table.
Using a conversion sheet that is described in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified for the element name,
A first step of extracting a corresponding structure code from the structure table;
A second step of detecting an entry in the value table corresponding to the structure code, and using the conversion sheet to identify an element name, presence or absence of a child element or element value, and an existing attribute name from the structure code;
It is executed as a third step for specifying an existing attribute value or element value from the value code of the corresponding entry in the value table, and decoding into the original document data by repeating the third step from the first step. Features.

本発明における文書データの符号処理方法及びプログラムによれば、符号化圧縮率をより向上させることができる。特に、文書データの論理構造部分と値部分とを混在して符号化していたものを、各々別々に符号化するために、それぞれの部分の特性に最適な符号化を実現することができる。 According to the document data encoding method and program of the present invention, the encoding compression rate can be further improved. In particular, since the logical structure portion and the value portion of the document data that are encoded together are encoded separately, it is possible to realize the optimal encoding for the characteristics of each portion.

特に、論理構造の出現頻度と値の出願頻度との偏りを用いて符号化することができるため、従来の方法と比較し、およそ２０％の符号量を低減することができる。このような符号化方法は、ＸＭＬ及びＳＧＭＬ準拠の全ての文書データの符号化に適用でき、特に、地図コンテンツの座標値の羅列のように、座標値間の変化量が少なく且つ連続した同一又は近似したタグが連続するコンテンツに対して、極めて符号量の削減効果が大きい。 In particular, since encoding can be performed using a bias between the appearance frequency of the logical structure and the filing frequency of values, the code amount can be reduced by approximately 20% compared to the conventional method. Such an encoding method can be applied to encoding of all document data conforming to XML and SGML, and in particular, a change amount between coordinate values is small and continuous or identical, such as a list of coordinate values of map content. The code amount reduction effect is extremely large for content in which approximate tags are continuous.

また、このように符号量を削減できることは、携帯電話機のような狭い帯域の無線リンクに接続された端末にとっては極めて有効である。このような符号化方法は、パース処理も必要としないために、携帯電話機のようなＣＰＵ能力の低い端末には極めて有効である。 In addition, such a reduction in the code amount is extremely effective for a terminal connected to a narrow-band wireless link such as a mobile phone. Since such an encoding method does not require parsing processing, it is extremely effective for a terminal having a low CPU capability such as a mobile phone.

以下では、図面を用いて、本発明を実施するための最良の形態について詳述する。 Hereinafter, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図２は、本発明における文書データの符号処理方法である。 FIG. 2 shows a document data code processing method according to the present invention.

図２によれば、文書データ１２は、複数の文書データ１２０及び１２１によって拡張されている。一方、変換テーブル１１も、拡張された文書データに対応して、複数の変換テーブル１１０及び１１１のリンク情報を定義している。これにより、XML形式の文書データ１２は、変換テーブル１１を用いて符号化１０される。 According to FIG. 2, the document data 12 is expanded by a plurality of document data 120 and 121. On the other hand, the conversion table 11 also defines link information of a plurality of conversion tables 110 and 111 corresponding to the expanded document data. Thereby, the document data 12 in the XML format is encoded 10 using the conversion table 11.

また、図２によれば、符号データは、変換テーブル２１を用いて、直接的に文書処理３０され、ブラウザ２４に表示される。本発明によれば、符号データには、要素の論理構造も含まれる。従って、文書データに復号する必要もなく、更にパーサ２３によって論理構造を解析する必要もない。 In addition, according to FIG. 2, the code data is directly processed 30 using the conversion table 21 and displayed on the browser 24. According to the present invention, the code data includes the logical structure of the elements. Accordingly, there is no need to decrypt the document data, and there is no need to analyze the logical structure by the parser 23.

図３は、文書データのサンプルである。以下では、この文書データを例にとって説明する。 FIG. 3 is a sample of document data. Hereinafter, this document data will be described as an example.

図４は、図３の文書データに対応した種々の符号化テーブルである。 FIG. 4 shows various encoding tables corresponding to the document data of FIG.

図４によれば、文書データの論理構造を示すために、ツリーテーブルと、構造テーブルと、属性テーブルとを有し、値を示すために値テーブルとを有する。このように、本発明の特徴は、論理構造を示す部分と、値の部分とを別々に符号化することに特徴ある。 According to FIG. 4, it has a tree table, a structure table, and an attribute table to show the logical structure of document data, and has a value table to show values. As described above, the present invention is characterized in that the part indicating the logical structure and the value part are encoded separately.

ここで、ツリーテーブルと構造テーブルとをリンクさせる符号と、属性有無の符号とは、ハフマン符号で表されている。ハフマン符号とは、出現頻度が高い項目には短い符号を割当て、そうでない項目には長い符号を割り当てるという方法である。このような可変長の符号化を効率的に行なうことで、個々の情報に一定サイズの符号を割り当てる方法と比較して、効率の良いデータ圧縮が実現できる。 Here, the code for linking the tree table and the structure table and the code for presence / absence of the attribute are represented by a Huffman code. The Huffman code is a method in which a short code is assigned to an item with a high appearance frequency, and a long code is assigned to an item that does not. By efficiently performing such variable length coding, efficient data compression can be realized as compared with a method of assigning a code of a certain size to each piece of information.

図５は、図３の文書データに対応したハフマンテーブル（値割当テーブル）である。 FIG. 5 is a Huffman table (value assignment table) corresponding to the document data of FIG.

図５によれば、整数ハフマンテーブルと、文字列ハフマンテーブルと、占有ビット長テーブルとを有する。図４と同様に、これらテーブルも、ハフマン符号によって割当てられている。従って、各テーブルは出現頻度によりソートされたものであり、出現頻度の多いものから順に、短い符号が割り当てられる。値の符号には、このインデックス符号を用いる。このインデックス符号の割り当て方法にハフマン符号化のような出現頻度による可変長符号化法を用いることにより、符号量低減を実現できる。従来技術によれば、値の符号化は、符号化ルールに定義された符号により符号化していたが、本発明によれば、各データ型毎に、値の出現頻度テーブルを作成し、各値の出現頻度により動的に符号長を算出して、符号化を行う。 According to FIG. 5, it has an integer Huffman table, a character string Huffman table, and an occupied bit length table. Similar to FIG. 4, these tables are also assigned by Huffman codes. Therefore, each table is sorted according to the appearance frequency, and short codes are assigned in order from the one with the highest appearance frequency. This index code is used for the sign of the value. By using a variable length coding method with appearance frequency such as Huffman coding as the index code allocation method, the code amount can be reduced. According to the prior art, the encoding of values was performed using the codes defined in the encoding rules. However, according to the present invention, a value appearance frequency table is created for each data type, and each value is The code length is dynamically calculated based on the appearance frequency of and the encoding is performed.

図６Ａ及び図６Ｂは、変換シートである。 6A and 6B are conversion sheets.

変換シートは、文書データに依存することなく、符号化又は復号化する装置が予め保有しているものであって、要素名又は属性名と符号との対応関係を規定したものである。 The conversion sheet is preliminarily held by an apparatus for encoding or decoding without depending on document data, and defines a correspondence relationship between element names or attribute names and codes.

図７は、図３の文書データにおける符号化の過程を表す説明図である。以下では、図７に基づいて、各テーブルを参照しながら説明する。 FIG. 7 is an explanatory diagram showing a process of encoding the document data in FIG. Below, based on FIG. 7, it demonstrates, referring each table.

[T1]<svg>
->"101"
最初に"svg"について、変換シート（図６Ａ）の[x1]により符号"00"を導出する。このとき、svgの子要素は有るので"1"とし、属性は無いので"0"とする。
"svg"について、"0010"を構造テーブルに記録し（Ｆ）、その値は無いので、対応する値テーブルには記録しない。また、構造テーブルのそのエントリＦには、符号"101"が割り当てられている。この符号"101"は、ツリーテーブルに"svg"と対応させて記録する。 [T1] <svg>
->"101"
First, for “svg”, a code “00” is derived from [x1] of the conversion sheet (FIG. 6A). At this time, since there is a child element of svg, it is set to “1”, and since there is no attribute, it is set to “0”.
For “svg”, “0010” is recorded in the structure table (F), and since there is no value, it is not recorded in the corresponding value table. Further, the code “101” is assigned to the entry F of the structure table. This code “101” is recorded in the tree table in association with “svg”.

[T2]<g id="300" stroke="#1e90ff" stroke-width="1" fill="none">
->"011","100,1110,111,00,0"
"g"について、svgの子要素として、変換シート[x3]により符号"01"を導出する。このとき、gの子要素は有るので"1"とする。また、変換シート[x7]〜[x10]により、属性が４個あることが把握でき、更にそれら４個全ての属性が[T2]には含まれるので、符号"1111"とする。それぞれのビットがid,stroke,stroke-width,fillを表す。ＸＭＬ仕様上、１要素内で２度同じ属性を記述することはできず、各属性について存在するか否かのどちらかであるために、このような記載が可能となる。ここで、属性テーブルを参照すると、"1111"は、符号"01"に対応している。従って、"g"は全体で"01101"に符号化できる。"g"について、"01101"を構造テーブルに記録する（Ｄ）。ここで、この構造テーブルのそのエントリDには"011"が割り当てられている。この符号"011"は、ツリーテーブルに"g"と対応させて記録する。 [T2] <g id = "300" stroke = "# 1e90ff" stroke-width = "1" fill = "none">
->"011","100,1110,111,00,0"
For “g”, the code “01” is derived as a child element of svg by the conversion sheet [x3]. At this time, since there is a child element of g, it is set to “1”. Further, it can be understood from the conversion sheets [x7] to [x10] that there are four attributes, and since all these four attributes are included in [T2], the code is “1111”. Each bit represents id, stroke, stroke-width, and fill. According to the XML specification, the same attribute cannot be described twice in one element, and since it is either the presence or absence of each attribute, such description is possible. Here, referring to the attribute table, “1111” corresponds to the code “01”. Therefore, “g” can be encoded as “01101” as a whole. For “g”, “01101” is recorded in the structure table (D). Here, “011” is assigned to the entry D of this structure table. This code “011” is recorded in the tree table in association with “g”.

本発明によれば、従来のように"g"要素＝"010101"、"id"要素＝"101010"というように、各要素名及び属性名に対して符号を割り当てるのではなく、"g"は、"
id, stroke, stroke-width, fill"属性の有無に４ビット割り当てることにより、その値のみを符号化すればよい。 According to the present invention, instead of assigning a code to each element name and attribute name like “g” element = “010101”, “id” element = “101010” as in the prior art, “g” Is "
By assigning 4 bits to the presence / absence of the “id, stroke, stroke-width, fill” attribute, only that value needs to be encoded.

"id"について、属性値"300"は、整数ハフマンテーブルによって"300"->"1110"とし、そのサイズ４ビットを占有ビット長テーブルによって符号"100"とする。
"stroke"について、属性値"#1e90ff"は、文字列ハフマンテーブルにより、符号"111"を導出する。
"stroke-width"について、属性値"1"は、整数ハフマンテーブルにより符号"00"を導出する。
"fill"について、属性値"none"は、変換シート[x11]によれば選択型であるので、符号"0"を導出する。
最後に、"g"の値符号"100,1110,111,00,0"は、対応する値テーブルのエントリＤに記録される。 For “id”, the attribute value “300” is set to “300”-> “1110” by the integer Huffman table, and the size of 4 bits is set to the code “100” by the occupied bit length table.
For "stroke", the attribute value "# 1e90ff" derives the code "111" from the character string Huffman table.
For "stroke-width", the attribute value "1" derives the code "00" from the integer Huffman table.
As for “fill”, the attribute value “none” is a selection type according to the conversion sheet [x11], and therefore the code “0” is derived.
Finally, the value code “100,1110,111,00,0” of “g” is recorded in entry D of the corresponding value table.

ここで、変換シート[x11][x12]によれば、"fill"には、選択型というデータ型が指定されている。従来技術によれば、値のとりえる候補（選択肢）が限られていた場合でも、値を、数値又は文字のようなデータ型により符号化する必要があった。本発明によれば、g要素のfill属性の属性値は、noneか７文字の文字列しかないので、値がnoneの場合は符号"0"のみで済む。ASCIIだと4byte必要なところを1bitで符号化できる。
これにより、選択肢に割り当てたインデックス符号で符号化することができ、符号量低減が実現できる。 Here, according to the conversion sheets [x11] [x12], a data type called a selection type is designated for “fill”. According to the prior art, even if the candidates (options) that the value can take are limited, it is necessary to encode the value with a data type such as a numerical value or a character. According to the present invention, since the attribute value of the fill attribute of the g element has only a character string of none or 7 characters, if the value is none, only the code “0” is sufficient. If it is ASCII, 4 bytes can be encoded with 1 bit.
Thereby, it can encode with the index code allocated to the choice, and can implement | achieve code amount reduction.

[T3]<polyline points="88,99 96,102 108,109"/>
->"000","11111,11110,0100,0101,0110,0111"
"polyline"について、"g"の子要素として、変換シート[x14]により符号"10"を導出する。このとき、polylineの子要素は無いので"0"とする。また、変換シート[x17]〜[x21]により、属性が５個あることが把握でき、更に、それらpointsの属性のみが[T2]に含まれているので、符号"10000"とする。それぞれのビットが"points,stroke,stroke-width,fill,stroke-dasharray"を表す。この符号"10000"から、属性テーブルを用いて符号"100"が導出される。これらのビット列の"100100"は、構造テーブル（Ａ）に記録される。構造テーブルのこのエントリＡには符号"000"が割り当てられており、その符号"000"を"polyline"と対応させてツリーテーブルに記録する。 [T3] <polyline points = "88,99 96,102 108,109"/>
->"000","11111,11110,0100,0101,0110,0111"
For “polyline”, the code “10” is derived as a child element of “g” by the conversion sheet [x14]. At this time, since there is no child element of polyline, it is set to "0". Further, it can be understood from the conversion sheets [x17] to [x21] that there are five attributes, and since only the attributes of the points are included in [T2], the code is “10000”. Each bit represents "points, stroke, stroke-width, fill, stroke-dasharray". From this code “10000”, the code “100” is derived using the attribute table. These bit strings “100100” are recorded in the structure table (A). A code “000” is assigned to this entry A in the structure table, and the code “000” is recorded in the tree table in association with “polyline”.

"points"の値は、座標"coordinates"型であることが把握できる。従って、"=88,99 96,102 108,109"は、(88,99)(96,102)(108,109)の群からなる座標であることが把握できる。ここで、"88,99 96,102 108,109"をそのまま６つの数値群として符号化するのではなく、データ型に座標値型を用いることにより、先の座標値群からの差分値、即ち" 88,99,+8,+3,+12,+7"と変換して符号化する。整数ハフマンテーブルを用いて、最初の値"88"->"11111", "99"->"11110"と符号化できる。次に、"96,102"は、先の座標値群の差分値"+8,+3"と表され、整数ハフマンテーブルを用いて"0100,0101"と符号化できる。次に、"108,109"は、先の座標値群の差分値"+12,+7"と表され、整数ハフマンテーブルを用いて"0110,0111"と符号化できる。このような符号化は、座標間の変化量が少ないデータ（例えば地図データ）に極めて有効な符号化である。最後に、"polyline"の値符号"11111,11110,0100,0101,0110,0111"は、対応する値テーブルのエントリＡに記録される。 It can be understood that the value of “points” is a coordinate “coordinates” type. Therefore, it can be understood that “= 88,99 96,102 108,109” is a coordinate composed of the group of (88,99) (96,102) (108,109). Here, "88,99 96,102 108,109" is not directly encoded as six numerical value groups, but by using a coordinate value type as a data type, a difference value from the previous coordinate value group, that is, "88,99" , +8, +3, +12, +7 ", and encode. Using the integer Huffman table, the first values "88"-> "11111", "99"-> "11110" can be encoded. Next, “96,102” is expressed as a difference value “+ 8, + 3” of the previous coordinate value group, and can be encoded as “0100,0101” using the integer Huffman table. Next, “108,109” is expressed as a difference value “+ 12, + 7” of the previous coordinate value group, and can be encoded as “0110,0111” using the integer Huffman table. Such encoding is extremely effective for data (for example, map data) with a small amount of change between coordinates. Finally, the value code “11111,11110,0100,0101,0110,0111” of “polyline” is recorded in entry A of the corresponding value table.

数値と数値と区切り（セパレータ）は、通常、スペース" "が記載されるが、変換シート[x17]によれば、カンマ","を定義している。値によってはそのセパレータも異なる。このような場合、従来技術のように文字列として符号化するのではなく、符号化規則としてその属性におけるセパレータを定義し、それによって値を分割して符号化することもできる。ここでは、"points"の値については、X,Y座標が","で区切られ、更にその対が" "で区切られている。このように、値の属性によって複数のセパレータを組み合わせることによって、高い符号化効率を実現することができる。 As for a numerical value and a numerical value and a separator (separator), a space “” is usually described, but according to the conversion sheet [x17], a comma “,” is defined. Depending on the value, the separator is also different. In such a case, instead of encoding as a character string as in the prior art, it is also possible to define a separator in the attribute as an encoding rule and thereby divide and encode the value. Here, for the value of “points”, the X and Y coordinates are separated by “,” and the pair is further separated by “”. Thus, high encoding efficiency can be realized by combining a plurality of separators according to value attributes.

[T4]</g>
->"011"
"/"の論理構造符号"011"を、論理構造テーブルに記録する。 [T4] </ g>
->"011"
The logical structure code “011” of “/” is recorded in the logical structure table.

[T5]<g id="500" fill="#ff4500">
->"100","01,111,1,111"
"g"について、svgの子要素として、変換シート[x3]により、符号"01"を導出する。このとき、gの子要素は有るので"1"とする。また、変換シート[x7]〜[x10]により、属性が４個あることが把握でき、更に、id及びfillの属性だけが[T5]には含まれるので、符号"1001"とする。それぞれのビットが"id,stroke,stroke-width,fill"を表す。ここで、属性テーブルを参照すると、"1001"は、符号"11"に対応している。従って、"g"は全体で"01111"に符号化でき、その"01111"を構造テーブル（Ｅ）に記録する。ここで、この構造テーブルのそのエントリＥには"100"が割り当てられている。この符号"100"は、ツリーテーブルに"g"と対応させて記録する。 [T5] <g id = "500" fill = "# ff4500">
->"100","01,111,1,111"
For “g”, as a child element of svg, a code “01” is derived from the conversion sheet [x3]. At this time, since there is a child element of g, it is set to “1”. Further, it can be understood from the conversion sheets [x7] to [x10] that there are four attributes, and since only the attributes of id and fill are included in [T5], the code is “1001”. Each bit represents "id, stroke, stroke-width, fill". Here, referring to the attribute table, “1001” corresponds to the code “11”. Therefore, “g” can be encoded into “01111” as a whole, and “01111” is recorded in the structure table (E). Here, “100” is assigned to the entry E of this structure table. This code “100” is recorded in correspondence with “g” in the tree table.

"id"について、属性値"500"は、整数ハフマンテーブルから"500"->"111"とし、そのサイズ３ビットを符号"01"とする。
"fill"について、属性値"#ff4500"について、"fill"が選択型で設定されているので、７文字"#ff4500"の場合、選択肢符号"1"の直後に、ASCII型で"#ff4500"を符号化する。このとき、"#ff4500"は、文字列ハフマンテーブルにより符号"111"が割り当てられている。
最後に、"g"の値符号"01,111,1,111"は、対応する値テーブルのエントリＥに記録される。 For "id", the attribute value "500" is "500"->"111" from the integer Huffman table, and its size is 3 bits as code "01".
For "fill", for attribute value "# ff4500", "fill" is set as the selection type. For 7 characters "# ff4500", the ASCII code "# ff4500" immediately after option code "1""Is encoded. At this time, “# ff4500” is assigned the code “111” by the character string Huffman table.
Finally, the value code “01,111,1,111” of “g” is recorded in entry E of the corresponding value table.

[T6]<text x="64" y="22">ABC</text>
->"001","110,01,110,110"
->"010","1000,1001,1010"
"text"について、gの子要素として、変換シート[x15]により符号"11"を導出する。このとき、textの子要素は無いので"0"とする。また、変換シート[x23]〜[x26]により、属性が４個あることが把握でき、更にx及びyの属性だけが[T6]には含まれるので、符号"1001"とする。それぞれのビットがid,stroke,stroke-width,fillを表す。ここで、属性テーブルを参照すると、"1001"は、符号"11"に対応している。従って、"text"は全体で"11011"に符号化でき、その"11011"を構造テーブル（Ｂ）に記録する。ここで、この構造テーブルのそのエントリＢには"001"が割り当てられている。この符号"001"は、ツリーテーブルに"text"と対応させて記録する。 [T6] <text x = "64" y = "22"> ABC </ text>
->"001","110,01,110,110"
->"010","1000,1001,1010"
For “text”, the code “11” is derived from the conversion sheet [x15] as a child element of g. At this time, since there is no child element of text, it is set to "0". Further, it can be understood from the conversion sheets [x23] to [x26] that there are four attributes, and since only the attributes of x and y are included in [T6], the code is “1001”. Each bit represents id, stroke, stroke-width, and fill. Here, referring to the attribute table, “1001” corresponds to the code “11”. Therefore, “text” can be encoded as “11011” as a whole, and “11011” is recorded in the structure table (B). Here, “001” is assigned to the entry B of this structure table. This code “001” is recorded in the tree table in association with “text”.

"x"について、属性値"64"は、整数ハフマンテーブルにより符号"01"を導出し、そのサイズ２ビットを符号"110"とする。
"y"について、属性値"22"は、整数ハフマンテーブルにより符号"110"とし、そのサイズ３ビットを符号"110"とする。
最後に、"text"の値符号"110,01,110,110"は、対応する値テーブルのエントリＢに記録される。 For “x”, for the attribute value “64”, a code “01” is derived from the integer Huffman table, and its size of 2 bits is set to a code “110”.
For “y”, the attribute value “22” is represented by the code “110” according to the integer Huffman table, and the size 3 bits is represented by the code “110”.
Finally, the value code “110,01,110,110” of “text” is recorded in entry B of the corresponding value table.

text要素の要素値"ABC"について、このノードは要素値であるので、要素名、その子要素若しくはその要素値、又は属性も存在しない。従って、要素値"ABC"は、"NULL,0,0"に符号化でき、これを構造テーブル（Ｃ）に記録する。ここで、この構造テーブルのそのエントリ（Ｃ）には"010"が割り当てられている。この符号"010"は、ツリーテーブルに"[value]"と対応させて記録する。 For the element value “ABC” of the text element, since this node is an element value, there is no element name, its child element or its element value, or attribute. Therefore, the element value “ABC” can be encoded as “NULL, 0, 0”, and is recorded in the structure table (C). Here, “010” is assigned to the entry (C) of this structure table. This code “010” is recorded in the tree table in association with “[value]”.

更に、"ABC"は、文字列ハフマンテーブルから符号"1000,1001,1010"と表される。構造テーブルに対応してツリーテーブル"010"を記録する。その構造テーブル（Ｃ）に対応する値テーブルのエントリＣに符号"1000,1001,1010"を記録する。 Furthermore, “ABC” is represented as “1000, 1001, 1010” from the character string Huffman table. Record the tree table “010” corresponding to the structure table. Codes “1000, 1001, 1010” are recorded in entry C of the value table corresponding to the structure table (C).

</text>
->"011"
"/"の論理構造である符号"011"を、ツリーテーブルに記録する。 </ text>
->"011"
The code “011” which is the logical structure of “/” is recorded in the tree table.

[T7]</g>
->"011"
"/"の論理構造符号"011"を、ツリーテーブルに記録する。 [T7] </ g>
->"011"
The logical structure code “011” of “/” is recorded in the tree table.

[T8]</svg>
->"011"
"/"の論理構造符号"011"を、ツリーテーブルに記録する。 [T8] </ svg>
->"011"
The logical structure code “011” of “/” is recorded in the tree table.

本発明におけるその他のその他の発明の機能について説明する。 The other functions of the present invention will be described.

第１に、暗示的なセパレータについて説明する。例えば"100円"という値の場合、「数値->文字」というようにデータ型が変化した場合、"100"と"円"の間に存在するセパレータを暗示的なセパレータと定義づけている。変換シート[x11][x12]によって選択型について説明したが、例えば"100円"という値を符号化する場合、以下のような変換シートを用意することができる。
<attr name="price" explicit_separator="true" >
<value>
<number data="UI" qt="1"/>
<char data="Shift_JIS" length="implied"/>
<choice>
<list code="0">円</list>
<list code="1">ドル</list>
</choice>
</value>
</attr> First, an implicit separator will be described. For example, in the case of a value of “100 yen”, when the data type changes such as “numerical value-> character”, a separator existing between “100” and “yen” is defined as an implicit separator. Although the selection type has been described with the conversion sheets [x11] and [x12], for example, when the value “100 yen” is encoded, the following conversion sheet can be prepared.
<attr name = "price" explicit_separator = "true">
<value>
<number data = "UI" qt = "1"/>
<char data = "Shift_JIS" length = "implied"/>
<choice>
<list code = "0"> Yen </ list>
<list code = "1"> $ </ list>
</ choice>
</ value>
</ attr>

第２に、必ず文字列しか規定されない場合について説明する。例えば、サンプル文書に以下の記載があったとする。
<value>
http://www.kddi.com/
</value> Secondly, a case where only a character string is always defined will be described. For example, assume that a sample document has the following description.
<value>
http://www.kddi.com/
</ value>

この場合、http://www.kddi.com/は、文字列以外あり得ないために、文字列ハフマンテーブルに定義することにより、極めて少ない符号に変換することができる。特に、http://と最後の"/"は、固定化されたものであるために、これらの部分の符号化も省略することができる。この場合の変換シートには、以下のような規則を記載することができる。
<value>
http://
<char data="ASCII" length="implied"/>
<value> In this case, since http://www.kddi.com/ can only be a character string, it can be converted into a very small number of codes by defining it in the character string Huffman table. In particular, since http: // and the last “/” are fixed, encoding of these parts can be omitted. In this case, the following rules can be written on the conversion sheet.
<value>
http: //
<char data = "ASCII" length = "implied"/>
<value>

図８は、本発明における符号化処理のフローチャートである。 FIG. 8 is a flowchart of the encoding process in the present invention.

ＸＭＬの文書データが入力されたとする。
（Ｓ８０１）文書データの最初のタグから順にＳ８１２との間で、要素毎に全ての要素及び要素値について符号化する。
（Ｓ８０２）最初に、要素値か否かを判定する。
（Ｓ８０３）要素値でなければ、変換シートを用いて、当該タグの要素の符号を導出する。
（Ｓ８０４）次に、その要素に子要素又は要素値が有るか否かについて、１ビットで符号化する。
（Ｓ８０５）ＸＭＬの特性によって、変換シートによれば、その要素には属性が何個あって、その順序も予め規定されている。従って、その属性の個数のビット列を配し、その属性毎にその有無をビットで表現する。
（Ｓ８０６）Ｓ８０３〜Ｓ８０５によって得られた一連の符号列は、構造用符号として、構造テーブルに記録される。このとき、構造テーブルにおいては、出現頻度によってハフマン符号化されている。従って、出現頻度の多い構造用符号は短い符号で表され、出現頻度の少ない構造用符号は長い符号で表される。
（Ｓ８０７）構造テーブルによって指定された符号は、ツリーテーブルにその要素と共に記録される。
（Ｓ８０８）次に、その要素について属性有りか否かを判定する。属性が無ければ、次に要素の処理を行う（Ｓ８０１）。
（Ｓ８０９）属性が有る場合、文書データの最初の属性から順にＳ８１１との間で、属性毎に全ての属性について符号化する。
（Ｓ８１０）属性の存在は、既に属性ビットで把握できているので、その属性の値のみを符号化すれば足りる。ここで、占有ビット長テーブルと、整数ハフマンテーブルと、文字列ハフマンテーブルとを用いる。属性の型は、変換シートによって既に把握できているので、そのデータ型によってテーブルを選択することが可能である。また、ここでも、ハフマン符号化されているので、出現頻度の多い構造用符号は短い符号で表され、出現頻度の少ない構造用符号は長い符号で表される。
（Ｓ８１１）Ｓ８０８との間で、全ての属性について符号化するまで繰り返す。
（Ｓ８１２）Ｓ８０２において要素値であると判断されたならば、「要素名なし、子要素なし、属性なし」の符号列が、構造用符号として、構造テーブルに記録される。前述したように、構造テーブルにおいては、出現頻度によってハフマン符号化されている。
（Ｓ８１３）構造テーブルによって指定された符号は、ツリーテーブルにその要素と共に記録される。
（Ｓ８１４）要素値が、占有ビット長テーブルと、整数ハフマンテーブルと、文字列ハフマンテーブルとを用いて、符号化される。
（Ｓ８１５）要素の終了"/"を示す符号が、ツリーテーブルに記録される。
（Ｓ８１６）要素値をＳ８０１との間で、全ての要素について符号化するまで繰り返す。
（Ｓ８１７）Ｓ８０１との間で、全ての要素について符号化するまで繰り返す。 Assume that XML document data is input.
(S801) All elements and element values are encoded for each element in order from the first tag of the document data to S812.
(S802) First, it is determined whether or not it is an element value.
(S803) If it is not an element value, the code | symbol of the element of the said tag is derived | led-out using a conversion sheet.
(S804) Next, whether or not the element has a child element or element value is encoded with 1 bit.
(S805) According to the characteristics of XML, according to the conversion sheet, the element has several attributes, and the order of the elements is defined in advance. Therefore, a bit string corresponding to the number of attributes is arranged, and the presence / absence of each attribute is expressed by bits.
(S806) A series of code strings obtained in S803 to S805 is recorded in the structure table as a structure code. At this time, the structure table is Huffman-coded according to the appearance frequency. Therefore, a structural code having a high appearance frequency is represented by a short code, and a structural code having a low appearance frequency is represented by a long code.
(S807) The code specified by the structure table is recorded together with the element in the tree table.
(S808) Next, it is determined whether or not the element has an attribute. If there is no attribute, the element is processed next (S801).
(S809) If there is an attribute, all attributes are encoded for each attribute between the first attribute of the document data and S811.
(S810) Since the existence of the attribute has already been grasped by the attribute bit, it is sufficient to encode only the value of the attribute. Here, an occupied bit length table, an integer Huffman table, and a character string Huffman table are used. Since the attribute type is already known from the conversion sheet, the table can be selected according to the data type. Also here, since Huffman coding is used, a structural code having a high appearance frequency is represented by a short code, and a structural code having a low appearance frequency is represented by a long code.
(S811) It repeats until it encodes about all the attributes between S808.
(S812) If it is determined in S802 that the value is an element value, a code string “no element name, no child element, no attribute” is recorded in the structure table as a structure code. As described above, the structure table is Huffman-coded according to the appearance frequency.
(S813) The code specified by the structure table is recorded together with the element in the tree table.
(S814) The element value is encoded using the occupied bit length table, the integer Huffman table, and the character string Huffman table.
(S815) The code indicating the end of element “/” is recorded in the tree table.
(S816) It repeats until it encodes about all the elements between S801 and element values.
(S817) It repeats until it encodes about all the elements between S801.

図９は、本発明における復号化処理のフローチャートである。 FIG. 9 is a flowchart of the decoding process in the present invention.

符号データが入力されたとする。符号データは、基本的にツリーテーブルと値テーブルとから構成される。当然に、構造テーブル、属性テーブル、占有ビット長テーブル、整数ハフマンテーブル、文字列ハフマンテーブルも必要となるが、これらテーブルは、予め符号化側と復号化側とで保持されたものであってもよい。
（Ｓ９０１）符号データの最初から順にＳ９０８との間で、要素毎に全ての要素について符号化する。
（Ｓ９０２）最初に、ツリーテーブルを参照して、要素構造の符号を切り出す。
（Ｓ９０３）構造テーブルを参照し、その符号に対応したエントリにおける、要素名、子要素有無及び属性有無のビット列を検索する。要素名のビット列は、変換シートを用いて要素名が具体的に把握できる。
（Ｓ９０４）属性有無のビット列は、属性テーブルから具体的に属性有無を示すビット列が導出される。変換シートから、その要素名に対応する属性名も把握でき、属性有無のビットにより、いずれの属性が存在するかが把握できる。
（Ｓ９０５）次に、構造テーブルのそのエントリに対応する値テーブルを参照する。これにより、存在する属性についての値のビット列を、変換シートを用いて、切り出すことができる。
（Ｓ９０６）切り出されたビット列は、占有ビット長テーブル、整数ハフマンテーブル及び文字列ハフマンテーブルによって、その値及び文字列が導出できる。
（Ｓ９０７）Ｓ９０２からの処理によって、１つの要素の構造が把握できるので、そのタグについて文書処理をする。
（Ｓ９０８）Ｓ９０１との間で、全ての要素について復号化するまで繰り返す。 Assume that code data is input. The code data basically includes a tree table and a value table. Of course, a structure table, an attribute table, an occupied bit length table, an integer Huffman table, and a character string Huffman table are also required, but these tables may be stored in advance on the encoding side and the decoding side. Good.
(S901) All elements are encoded for each element in order from the beginning of the code data to S908.
(S902) First, the code of the element structure is cut out with reference to the tree table.
(S903) With reference to the structure table, a bit string of element name, child element presence / absence and attribute presence / absence in the entry corresponding to the code is searched. The element name bit string can be specifically grasped using the conversion sheet.
(S904) As the bit string indicating the presence / absence of an attribute, a bit string specifically indicating the presence / absence of the attribute is derived from the attribute table. From the conversion sheet, the attribute name corresponding to the element name can also be grasped, and which attribute is present can be grasped by the attribute presence / absence bit.
(S905) Next, the value table corresponding to the entry in the structure table is referenced. Thereby, the bit string of the value about the existing attribute can be cut out using the conversion sheet.
(S906) The value and character string of the extracted bit string can be derived from the occupied bit length table, integer Huffman table, and character string Huffman table.
(S907) Since the structure of one element can be grasped by the processing from S902, document processing is performed for the tag.
(S908) Iterate until it decodes about all the elements between S901.

前述した本発明の文書データの符号処理方法及びシステムの種々の実施形態によれば、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略を、当業者は容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 According to various embodiments of the document data encoding method and system of the present invention described above, those skilled in the art can easily make various changes, corrections and omissions in the technical idea and scope of the present invention. . The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

符号化サーバを含むシステム構成図である。It is a system configuration diagram including an encoding server. 本発明における文書データの符号処理方法である。3 is a document data code processing method according to the present invention. 文書データのサンプルである。This is a sample of document data. 図３の文書データに対応した種々の符号化テーブルである。4 is various encoding tables corresponding to the document data of FIG. 図３の文書データに対応したハフマンテーブルである。4 is a Huffman table corresponding to the document data in FIG. 3. 変換シートの前段部分である。It is the front | former part of a conversion sheet. 変換シートの後段部分である。It is the latter part of a conversion sheet. 図３の文書データにおける符号化の過程を表す説明図である。It is explanatory drawing showing the process of the encoding in the document data of FIG. 本発明における符号化処理のフローチャートである。It is a flowchart of the encoding process in this invention. 本発明における復号化処理のフローチャートである。It is a flowchart of the decoding process in this invention.

Explanation of symbols

１０符号化
１１、２１、１１０、１１１、２１０、２１１変換テーブル
１２、２２、１２０、１２１テキスト形式の文書データ
２３パーサ
２４ブラウザの表示画面
３０文書処理
４既存のサーバ
５クライアント
６符号データ
７変換テーブルサーバ
８インターネット
DESCRIPTION OF SYMBOLS 10 Encoding 11, 21, 110, 111, 210, 211 Conversion table 12, 22, 120, 121 Text format document data 23 Parser 24 Browser display screen 30 Document processing 4 Existing server 5 Client 6 Code data 7 Conversion table Server 8 Internet

Claims

In a document data encoding method for encoding original document data described in an extensible textual structured description language,
A conversion sheet that is written in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified in the element name;
A structure table that records the structure code for each element in the entry;
Corresponding to the entries in the structure table, using a value table that records the sign of the attribute value or element value in the entry, in the tag unit order of the original document data,
A first step of detecting an element name, presence or absence of a child element or element value, presence or absence of an attribute name, and existing attribute value or element value;
The structure code obtained by combining a code derived from the element name using the conversion sheet, a code representing the presence or absence of a child element or element value of the element name, and a code representing the presence or absence of the attribute name, A second step of sequentially recording entries in the structure table;
A third step of recording a code for the existing attribute value or element value in the value table, and the structure table generated by repeating the third step from the first step, and A document data encoding method, wherein a value table is encoded data.

An allocation table in which short codes are allocated in order from the attribute value or the element value having a high appearance frequency;
The sign for the attribute value or element value in the third step is derived using the value assignment table,
The method according to claim 1, wherein the value assignment table is also a part of the code data.

The conversion sheet defines data type information corresponding to the attribute value or element value,
A plurality of the value allocation tables exist according to the data type information,
3. The code for the attribute value or element value in the third step is derived using a value assignment table corresponding to the data type information in the attribute value or element value. The method described in 1.

In the structure table, a short code is assigned as a tree code in order from the structure code having the highest appearance frequency,
A tree table in which the tree codes are arranged in the order of appearance of the elements of the original document data;
4. The method according to claim 1, wherein the tree table is also a part of the code data.

2. The code representing the presence / absence of the attribute name specifies the presence / absence of the attribute by a bit string in which each attribute is assigned to 1 bit in the order of attribute names that can be specified for the element name in the conversion sheet. 5. The method according to any one of items 1 to 4.

In order from the code representing the presence or absence of the attribute name having a high appearance frequency, further having an attribute table assigned a short code,
The third step derives a code representing the presence or absence of the attribute name using the attribute table,
6. The method according to claim 5, wherein the attribute table is also a part of the code data.

In the conversion sheet, selection type information corresponding to the attribute value or element value can be defined,
7. The method according to claim 1, wherein the sign for the attribute value or the element value in the third step is expressed by a minimum number of bits corresponding to the number of selections. .

The conversion sheet can specify group structure type information in which the attribute value or element value is composed of a plurality of groups,
The code for the attribute value or element value in the third step is characterized in that, after encoding for the previous group, only the difference value for the previous group is encoded for the subsequent group. The method according to any one of claims 1 to 7.

In the conversion sheet, a separator can be defined to represent the attribute value or element value as a plurality of groups,
9. The method according to claim 8, wherein the sign for the attribute value or element value in the third step is divided into groups by the separator.

In the conversion sheet, a fixed value part can be defined for the attribute value or the element value,
The method according to claim 1, wherein the fixed value portion is omitted from the sign for the attribute value or the element value in the third step.

In a document data decoding method for decoding encoded data of original document data described in a structural description language in an extensible text format,
The code data includes a structure table in which a structure code for each element is recorded in an entry, and a value table in which an attribute value or a code of an element value is recorded in the entry corresponding to the entry in the structure table.
Using a conversion sheet that is described in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified for the element name,
A first step of extracting a corresponding structure code from the structure table;
A second entry that detects an entry in the value table corresponding to the structure code and identifies an element name, the presence or absence of a child element or an element value, and an existing attribute name from the structure code using the conversion sheet Steps,
A third step of specifying an existing attribute value or element value from the value code of the corresponding entry in the value table, and by repeating the third step from the first step, the original document data is converted into the original document data. A method for decrypting document data, comprising decrypting the document data.

The code data further includes a value assignment table in which short codes are assigned in order from the attribute value or element value having a high appearance frequency,
The method according to claim 11, wherein the attribute value or element value in the third step is derived using the value assignment table.

The conversion sheet defines data type information corresponding to the attribute value or element value,
A plurality of the value allocation tables exist according to the data type information,
The attribute value or element value in the third step is derived using a value assignment table corresponding to the data type information in the attribute value or element value. the method of.

In the code data, the structure table is assigned with a short code as a tree code in order from the structure code having the highest appearance frequency, and a tree table in which the tree code is arranged in the order of appearance of the elements of the original document data is further provided. 14. A method according to any one of claims 11 to 13, characterized in that it comprises.

The code indicating the presence / absence of the attribute name is characterized in that the presence / absence of the attribute is specified by a bit string in which each attribute is assigned to 1 bit in order of attribute names that can be specified for the element name in the conversion sheet. Item 15. The method according to any one of Items 11 to 14.

The code table further includes an attribute table to which a short code is assigned in order from a code representing the presence or absence of the attribute name having a high appearance frequency,
The method according to claim 15, wherein the third step derives the presence or absence of the attribute name using the attribute table.

In the conversion sheet, selection type information corresponding to the attribute value or element value can be defined,
The method according to any one of claims 11 to 16, wherein the sign for the attribute value or the element value in the third step is represented by a minimum number of bits corresponding to the number of selections. .

The conversion sheet can specify group structure type information in which the attribute value or element value is composed of a plurality of groups,
The value code of the attribute value or element value in the third step is obtained by encoding only the difference value with respect to the previous group for the subsequent group after encoding the previous group in the group unit. 18. The method according to claim 11, wherein after decoding the previous group, the difference value is added to a value obtained by decoding the subsequent group. Method.

In the conversion sheet, a separator can be defined to represent the attribute value or element value as a plurality of groups,
The method according to claim 18, wherein the value code of the attribute value or element value in the third step is divided into groups by the separator.

In the conversion sheet, a fixed value part can be defined for the attribute value or the element value,
21. The value code of the attribute value or element value in the third step is derived by restoring the omitted fixed value portion using the conversion sheet. 2. The method according to item 1.

In a document data encoding program that encodes original document data described in a structural description language in an extensible text format,
A conversion sheet that is written in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified in the element name;
A structure table that records the structure code for each element in the entry;
Corresponding to the entries in the structure table, using a value table that records the sign of the attribute value or element value in the entry, in the tag unit order of the original document data,
A first step of detecting an element name, presence or absence of a child element or element value, presence or absence of an attribute name, and existing attribute value or element value;
The structure code obtained by combining a code derived from the element name using the conversion sheet, a code representing the presence or absence of a child element or element value of the element name, and a code representing the presence or absence of the attribute name, A second step of sequentially recording entries in the structure table;
The structure table and the value generated by executing the third step of recording a sign for the existing attribute value or element value in the value table and repeating the third step from the first step An encoding program for document data, characterized in that the table is encoded data.

In a document data decoding program for decoding code data of original document data described in an extensible textual structured description language,
The code data includes a structure table in which a structure code for each element is recorded in an entry, and a value table in which an attribute value or a code of an element value is recorded in the entry corresponding to the entry in the structure table.
Using a conversion sheet that is described in a structural description language in an extensible text format, and that has an element name and a code corresponding to an attribute name that can be specified for the element name,
A first step of extracting a corresponding structure code from the structure table;
A second entry that detects an entry in the value table corresponding to the structure code and identifies an element name, the presence or absence of a child element or an element value, and an existing attribute name from the structure code using the conversion sheet Steps,
This is executed as a third step of specifying an existing attribute value or element value from the value code of the corresponding entry in the value table, and is decoded into original document data by repeating the third step from the first step. A program for decrypting document data.