WO2008003310A1 - Verfahren zur kompression einer datensequenz eines elektronischen dokuments - Google Patents
Verfahren zur kompression einer datensequenz eines elektronischen dokuments Download PDFInfo
- Publication number
- WO2008003310A1 WO2008003310A1 PCT/DE2007/001205 DE2007001205W WO2008003310A1 WO 2008003310 A1 WO2008003310 A1 WO 2008003310A1 DE 2007001205 W DE2007001205 W DE 2007001205W WO 2008003310 A1 WO2008003310 A1 WO 2008003310A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- data sequence
- data
- compressed
- constants
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Definitions
- the technical field of the invention is the transmission, storage and evaluation of digital data, in particular in the operation of computer networks.
- the invention relates to a compression of a data sequence of an electronic
- Document in particular an XML document, by means of a structure definition to the data sequence, in particular a DTD, which comprises at least one substructure of the data sequence at least one declaration comprising unary operators or unary and binary operators of a syntax tree representing the substructure, including at least one unary Kleene-
- the invention further relates to compressed data or files generated by such a method.
- the invention also relates to methods for decompressing and evaluating requests for such compressed data or files.
- XML Extendable Mark-up Language
- a document contains constants such as text components and structure data.
- the structure information contained in the XML file is an advantage of XML since it contains semantic information and navigation aids.
- the contained structure information inflates the document in comparison to the contained text values and path specification to constants such as text values, especially XPath requests often require to navigate through similar or even similar structures several times.
- Fig. 1a illustrates an XML document. This has textual data as constants, e.g. "John Smith.” It has structural data in the form of tags such as ⁇ author> or ⁇ name>, and the tags also serve as startup icons or as terminal icons such as the startup symbol ⁇ name> or the terminal symbol ⁇ / name>.
- Amount of data and determined by the complexity of their processing determines the lifetime of the batteries and the maximum runtime of the devices without external power supply.
- Fig. 1b illustrates to the XML document of Fig. 1a a DTD which defines elements of structure data of the XML document. Categorized DTD
- Document data in at least the three categories: elements, attributes and entities / entities. These categories are optional.
- the XML document of Fig. 1a and the DTD of Fig. 1b contain no attributes.
- FIG. 3a represents an XML document and FIG. 3b represents a DTD corresponding to it, in which the attributes "order number" and "rush" are provided.
- SAX Simple Application Programming Interface for XML
- SAX parsers are available that continuously read document data as a sequential data stream.
- a SAX parser may represent the sequence of events:
- Blocks such as length or number of characters generated and saved.
- a path query to constants, only one block and not the entire compressed XML document must be decompressed.
- the path request is translated into such a signature, which serves by matching the identification of the block with the corresponding signature.
- the disadvantages are the translation and
- a DAG (directed acyclic graph) is a representation of the Data structure of an XML document that builds on such a sequence of events (SAX data stream) generated with a SAX parser. DAGs provide elements of a SAX data stream with indices. For compression, repetitive data sequences of the SAX data stream can be replaced by pointers to such indices. This creates, for example, the following DAG data stream from the sequence of events obtained above by SAX parsing with the pointers "pointer (6)" and "pointer (8)":
- the invention has for its object to remove the redundancies mentioned, thereby enabling a smaller data size. It also allows for a lean index on the data so that path requests (e.g., in the form of XPath for XML data or query languages based thereon) can be efficiently answered. Since the method can also compute a partial output for a partial input such as part of an XML document, the method is not limited to XML data that fits completely in main memory, but can be e.g. also be applied to very long data streams. The object of the invention is to provide an alternative to the prior art with such advantages.
- the invention improves a method for compressing a data sequence of an electronic document, in particular an XML document, by means of a structure definition to the data sequence, in particular a DTD, which comprises at least a substructure of the data sequence at least one declaration, the unary operators or unary and binary Operators of a parse tree representing the substructure, including at least one unary Kleene operator, having as argument the number of possible repeats of the substructure in the data sequence, the data sequence comprising:
- Structure information of the data sequence in particular tags, which start and terminal symbols to the constants and to the substructures of the Comprise data sequence of at least some of the constants and the structure information.
- the improvement is the one generation a) from the data sequence of a first Komprimatsequenz, the values of the
- the invention provides a method for compressing a data sequence of an electronic document, in particular a partially compressed XML document, by means of a structure definition to the data sequence, in particular a DTD, an XML schema, a Relax NG compact or a Relax NG
- Schemas wherein the structure definition defines at least a substructure of the data sequence and comprises the data sequence:
- generations a) and b) are different data processing, they may alternate in carrying out inventive methods to to generate a data stream in which respective parts of the first and second compressed sequences alternate with each other.
- the data sequence is e.g. an indexed sequence as shown at the beginning, such as the sequence of DAG events.
- the representatives of the values of the constants may be indices of a trie.
- the compression method according to the invention further has features of at least one of claims 2 to 8.
- the invention also relates to a compressed data sequence of an electronic document obtained by a method according to the invention and computer files with such compressed data sequences.
- the invention also relates to the decompression of a compressed according to the invention data sequence, characterized by
- the transformed data sequence is transformed back to the
- Path expression requested constant in the compressed sequence by summing up the products.
- An optional sequential step is the decompression of a portion of the compressed sequence that in the data sequence of the substructure with a through the
- Path expression requested constant corresponds, according to a method of the invention.
- a method is also possible wherein the calculated position is that of a pointer in the compressed sequence, characterized by decompressing a portion of the compressed sequence to which the pointer points.
- the invention also includes a computer program product implementing one or more of the methods of the invention.
- the invention thus shows a structure-oriented index, i. an index that provides efficient access to the structural information without the need to decompress the compressed XML data stream.
- the goal of the indexing technique is to use the structural information given by the element declarations to determine the index positions of specific textual values.
- a basic idea of our structure-conserving compression method is thus that the structural information that is redundant due to the given DTD, to minimize.
- a basic idea of our index access to the compressed XML data is to calculate the size of all XML document subtrees, and to use these computed subtree sizes to determine how much of the compressed XML stream can be skipped an element which eg is specified by an XPath request.
- a compressed XML document is referred to in the following cXML document.
- Fig. 1a exemplifies XML data
- Fig. 1 b exemplifies a DTD associated with the data of Fig. 1a;
- Fig. 2 generally illustrates syntax trees from the element declarations of
- DTD of Figure 1 b is dar
- Fig. 3a exemplifies XML data containing attributes
- Fig. 3b exemplifies a DTD associated with the data of Fig. 3b;
- Fig. 4 concretely illustrates the syntax trees of the element declarations of
- FIG. 5 schematically illustrates generation of grammar rules employed in the invention
- FIG. 7 schematically shows a method according to the invention for data decompression
- FIG. 8 schematically shows a method according to the invention for evaluating path expressions on data compressed according to the invention
- FIG. 9 schematically shows bundles of the data streams of data processing schematized in FIGS. 5 to 8.
- FIG. 9 schematically shows bundles of the data streams of data processing schematized in FIGS. 5 to 8.
- Sequence operator ',' With the aid of a sequence operator, an order of the child nodes is defined.
- ' The choice operator specifies an alternative or exclusive selection from a set of possible child elements.
- Kleene-Operator '*' It's a repeat operator that works for a lot of
- Children elements can set the number of possible repetitions and can also specify a minimum and / or maximum number of repetitions.
- Options operator '?' It determines for a child node that this is optional, ie it may occur either 0 times or 1 time.
- the Kleene operator '+' shown in FIG. 4 is the Kleene operator of a subtree of an element declaration
- any right-hand side of an element declaration can be considered a regular expression consisting of 0 or more child elements or the #PCDATA expression or the EMPTY expression using the operators Sequence (,), Choice, (
- the (+) operator can be expressed by the sequence and the Kleene operator, ie (E) + is replaced by (E, (E *)).
- the (?) Operator may be replaced by the choice Operator and the EMPTY expression, ie we replace (E?) With (E
- the DTD shown in Figure 1 b contains element declarations, not attributes.
- One of the element declarations of an element E1 of the DTD shown in FIG. 1b has the
- a parser which compares the structure-defining information of the DTD with an associated, ie valid XML document, are implemented with operator types schemes for the structure-defining information of the types of nodes described below for determining the order and number of child nodes.
- the element declaration of the DTD is parsed in step S1 and converted into a binary syntax tree in step S2 as shown generally in FIG. 2 for the XML document of FIG. 1a and from the DTD of FIG. 1b is.
- Fig. 4 illustrates the syntax trees for the XML document of Fig. 3a and the DTD of Fig. 3b, respectively.
- the structure-defining information is compared with a valid XML document. It is determined during parsing that much of the information in the structure portion of the XML data is redundant since it is already specified by the DTD. Specifically, the following operators are already defined by each element declaration: The names of all child elements, whether it is PCDATA
- Figs. 2 and 4 are nodes to which operators are assigned, represented by the symbols of the corresponding operators given above.
- I.ElementDeclaration (Name) A node of this type is created for each element defined by an element declaration. Each element declaration node has exactly one successor node, which is the
- Root node of the subtree that represents the right side of the element declaration.
- the name of the defined element of the element declaration is set as a parameter.
- Child element A node of this type is created for each occurrence of an element in the right side of an element declaration. The name of this
- Elements is set as a parameter.
- a father-child relationship is set between the declared element and the child element.
- the child element node has no successors.
- PCDATA The PCDATA node is used when the right side of an element declaration contains the string variable '#PCDATA'.
- the PCDATA node has no parameters and no successors.
- the PCDATA operator assigned to it determines for an element that it contains a text node and not an element child node.
- Sequence A node of this type is generated for each sequence operator ',' which appears in the right side of an element declaration.
- a sequence node has no parameters.
- Each sequence node has two descendants, which represent the two arguments of the sequence operator in the element declaration.
- Choice A node of this type becomes '
- a choice node has none Parameter.
- Each choice node has two descendants that represent the two arguments of the choice operator in the element declaration.
- the choice operator determines an alternative or exclusive selection from a set of possible child elements.
- Kleene A node of this type is generated for every Kleene operator '*' that appears in the right side of an element declaration.
- a Kleene node has no parameters. Every Kleene node has exactly one successor, which represents the argument of this Kleene operator in the element declaration.
- EMPTY A node of this type is generated for each occurrence of string 1 EMTPY 'in an element declaration of the DTD.
- Element declaration of the DTD sets an element-attribute relationship between the declared element and the attribute.
- the DTD of Fig. 2b shows a definition ⁇ ! ATTLIST order ...>.
- the set of start-terminal symbols for each node is calculated except for the root node of each element declaration syntax tree.
- These sets of start-terminal symbols allow a parser, which is used to compress or process path expressions, to decide which grammatical rule to apply during parsing.
- the set of start-terminal symbols is calculated according to the following relationships for the data and structures shown in FIGS. 1a, 1b and 2:
- STS (sequence (a, b)): STS (a), if EMPTY i STS (a)
- STS (sequence (a, b)): STS (a) u STS (b), if EMPTY e STS (a)
- STS (Choice (a, b)): STS1 u STS2
- STS1: STS (a)
- STS2: STS (b)
- STS (Kleene (a)): ⁇ EMPTY ⁇ u STS (a)
- the set of start-terminal symbols of two sibling nodes that are successors of a choice node are always disjoint, so that the currently considered node of the input XML document always decides which alternative has been chosen.
- Element Declaration Node the defined element as a parameter, and whenever a child element node appears as a successor of a node p, the grammar rule generated for P contains a child element call of the element element node of the element declaration syntax tree.
- Element Declaration (E1) ⁇ »Kleene (NO, ⁇ E2 ⁇ ).
- Kleene (NO, ⁇ E2 ⁇ ) - »sequence (N1, ⁇ E2 ⁇ ).
- the extended grammar rule can be interpreted as follows: Consume a start tag of the element E1 from the input, dump nothing into the compressed XML file, and then continue with the application of the
- each grammatical rule becomes a Kleene operator having a node ID Ni and a set S of start terminal symbols as parameters
- Child element calls in a grammar rule result in a call to the grammar rule for the element passed as a parameter.
- the example explains the application of the grammar rules for data compression according to the process scheme of FIG. 6.
- the illustrated scheme has the following sequence of process steps: the above-mentioned step S3 of generating grammar rules, a step S4 of reading in a SAX event, a step S5 of applying the grammar rules to the SAX event, a step S6 of generating a compressed output
- steps S4 to S6 loop through, so that the SAX events are processed sequentially.
- the DTDs shown in Figures 1b and 3b contain five different operators: '?', ' * ', '+', ',' And,
- ' are binary operators.
- the DTD according to the above attribute grammar is needed only for the following three operator types in addition to the given structure-defining information from the present XML document of Fig. 1a, the original structure of that XML document from the
- XML document shown in Fig. 1a and the DTD shown in Fig. 1b together contain the following information:
- the paper has the title “XML Compression” and an author named “John Smith” who can be contacted via mailbox "0815" (second alternative) and has no email address.
- the paper contains 2 chapters with the respective titles and
- the second chapter contains no text (EMPTY), but instead contains two sub-chapters with the respective titles and contents. These subchapters, like the first chapter of the paper, contain no further chapters (as expressed by the number 0 in chapters).
- step S5 With the above-explained additional information of the operators of the DTD shown in Fig. 1b and the constants of the XML document shown in Fig. 1a, this information is compressed by step S5 as follows:
- PCDATA calls in step S5 associate each text constant with their respective ID, which is output in the output stream in step S6 ID.
- a getlDfort function determines the ID of a given text constant, and if the text constant occurs for the first time, a new ID for the text constant is generated, and the pair
- EMPTYO ⁇ > ⁇ ID: getlDfor (EMPTY) ⁇ , [] / [ID].
- the trie assigns the ID, 00 'of the constants
- step S5 the constants are assigned to ID as follows: Trie: 0815 (02) A (algorithms (07) pproach (06)) EMTPY (11) ln ((order to compress ...
- the above cXML document 001012021103204050061120708009100 is divided into two parts: the above constant stream - 0001020304050607080910 and a document index stream with the structure information - 12112011200
- This structure information represents the syntax tree, also called Kleene Size Tree, shown in FIG. 2 and contains the following content:
- step S6 The constant stream and the document index stream are generated in step S6 according to the following two specifications:
- the structure definition is identified by one of the three operator types (alternative, repetition or option)
- the situation determined in the XML document that is the present alternative, the number of repetitions or whether an option has been selected in the present XML document, is included in the document - Index current written.
- Example 3 Compression with removal of redundancies in the XML data
- a DAG is constructed from the SAX event stream of the XML data of FIG. 3a as described in connection with FIGS. 3a and 3b and the publication by Koch et. al. was shown.
- Incoming SAX events of the type startDocument, character, endElement, and endDocument that belong to a first occurring subtree are passed directly and unchanged as a DAG event to a second preprocess to remove redundancies based on the given schema information.
- Incoming SAX events of type startElement that belong to a first-occurring subtree are split into a startEiement event and multiple attribute events.
- an event startElement customer, attList
- Incoming SAX events that belong to a recurring subtree are not forwarded at all. Only for the root of each repeated sub-tree W of a first occurring sub-tree E, where W is not part of a more comprehensive repeated sub-tree, a DAG event is generated and forwarded.
- This DAG event contains a reference to the preceding DAG event startElement of the root element of the first occurring subtree. This reference is implemented as a relative distance backward reference, shown in the input DAG shown as pointer pointer (6) or pointer (8).
- Example 2 the compression is essentially the same as in Example 2. Unlike Example 2, however, the DAG events are consumed.
- a DAG event is a reference to a previous DAG event
- a reference is made in the message as follows: If there is an entry in the document index stream for the preceding DAG event, a reference to that entry will be made is set, otherwise in the constant stream a reference is made to the position representing the first entry in the constant stream for this DAG event. Since each node n can be uniquely identified by the text node t reachable by a first-child path and the distance from t to n, a reference in the constant stream consists of the two-tuple (distance to node t, position of node t in the constant current).
- the document index stream and the constant stream are calculated as follows from a set of given syntax trees and a DAG or SAX event stream.
- the operations for traversing the syntax trees are:
- Number of child nodes // Determines the number of child nodes of node n of a syntax tree generated from the structure definitions.
- N.kindnode (i) // Returns the ith child node of node n in a syntax tree generated from the structure definitions
- a source code example for the compression is as follows:
- Event e readNextEventDomDAGStrom (); if (e is a pointer) ⁇ addPointer (e. id ()); deleteNextEventsDomDAGStromC);
- Event e readNextEventVotnDAGStromO; if (e is a pointer) ⁇ addPointer (e. id ()); deleteNextEventsDomDAGStromO; ⁇ else ⁇ // e is startElement
- ⁇ pos DocumentIndexStrom. write (i), IndexStore.add (e.id (), pos);
- Event e readNextEventDomDAGStrom (); if (e is a pointer) ⁇ addPointer (e. id ()); deleteDown eventDomDAG stream (); ⁇ else ⁇ // e is startElement
- Child element name ⁇ comp (elementDeclaration (n. Label O)); ⁇
- Event e readNext ⁇ reportVomDAGStromO; deleteNextEventVomDAGStrom (); if (e is a pointer) addPointer (e. id ()); else ⁇ // e is attribute (id, name, value) constant store. add (e.id (), UNKNOWN); path.push (e.id ()); // put nodes on the stack write constant (e .value (] path.popO; ⁇
- Event e readNextEventDomDAGStrom I write constant (e.label ()) ; ⁇
- Table 1 below shows in the middle column the sequence of tokens of the document index stream from the DAG indicated at the beginning of FIGS. 3a and 3b and operators or nodes in the right column.
- FIGS Depending on the operator, tokens may be repetitions, alternatives, or the presence of optional values.
- a token is provided as a pointer to the six lines above the preceding token of the table. This represents the two consecutive similar table entries from this previous token and its successor.
- Table 2 below shows in the middle column the sequence of constants of the constant current from the DAG given at the beginning of Figures 3a and 3b.
- a token is a pointer to the eight lines above the preceding constant and its subsequent one Constants that correspond to the same element declaration are included in FIG. 3b and represent these constants.
- the line entries are categorized, depending on whether they are pointers or not (ie, constants).
- the corresponding event ID of the DAG events of type start event or attribute type is stored in main memory within a given window for the location of each token in the stream. If a reference V to a DAG event D is read as the next event instead of a DAG event of type start event or of type attribute, then it is first checked whether Current an entry E exists that represents D. If there is an entry E representing D in the document index stream, a reference to E is inserted in the document index stream for V. If there is no entry E representing D in the document index stream, V is represented by a reference in the constant stream representing the first constant stored in constant constant for D.
- Kleene Size Tree This Kleene-Size-Tree is structured as follows: always if one
- Kleene operator K1 appears within another Kleene operator K2, so K2 is contained in the subtree with the root K1.
- Table 1 shows, the following is stored for each Kleene node in the Kleene size tree: How many index positions can be skipped in the cXML data stream if the subtree that has the Kleene node as its root, to answer the
- Kleene size tree To use the Kleene size tree to access paths in the cXML stream, we need to know the following two things for each node of the Kleene-Size tree: the number of descendants and the position of the descendants in the index stream. In other words, we have to cross the Kleene Size Tree. Since the compression results of all other node types are fixed sizes, we can determine all relative positions using the Kleene Size Tree and Table 3.
- the compresses also allow the evaluation of XPath requests on the compressed XML data stream.
- XPath query to be evaluated can not contain any backward axes, as in Example 5 for the evaluation of XPath requests, an adaptation of the procedure from P.M. Tolani and J.R. Hartisa, "XGRIND: A Query / Request-friendly XML Compressor", In Proc. ICDE 2002, pages 225-234. IEEE Computer Society, 2002.
- This approach optimizes the evaluation of XPath queries on XML documents, and in particular on XML data streams, so that the entire document (or stream) must be read only once.
- the buffer size needed to temporarily cache elements depends on the use of filters.
- the schema information of the DTD is used to reduce the computation time to decide the satisfiability of filters, and parts of the
- Decompression of a compressed generated according to Example 3 occurs by reversing the order of phases executed for compression by: 1. decompression into a DAG event stream
- the set of grammatical rules - interchanging the roles of input and output streams - is also used for decompression according to the process scheme of FIG. 7, with a step S3 of reading the grammar rules into a step S7, in which the syntax corresponding to the compression Trees are traversed.
- a step S8 of the application of the grammar rules with tokens of the document required for decompression Index stream and the constant stream are consumed in a step S8 of generating a decompressed output
- the root node of the referenced subtree is determined based on the reference position and the level information, and a reference to the DAG event of that root node is generated in the DAG event stream ,
- a source code example for decompression is as follows:
- decomp root node
- // call for decompression stack path // path from the root of the XML document to
- Root of the subtree is the element that is // removed from the stack end in the // root path level.
- Child element name ⁇ decomp (elementDeclaration (n. Label ())); ⁇ Attribute!
- P CDATA ⁇ x constant current. read (); if (x is a pointer) createPointer (x); No, DAG. character (x); ⁇
- the original SAX input stream is calculated from the DAG event stream.
- all endElement and character events are forwarded directly to the SAX stream.
- the SAX events of type startElement are composed of a DAG event of type startElement and the subsequent DAG events of type attribute - without the respective IDs.
- all events including ID are buffered in main memory.
- the referenced DAG event is jumped back and the events of the subtree starting with this event are output.
- An implementation may also be implemented in another language, such as in Java using a SAX parser and JavaCC attribute grammar. Then an extended rule set for compression and a second reverse implementation for decompression should be generated.
- Example 5 Evaluate XML path requests using the set of grammar rules on the compressed data
- the process scheme has the following steps:
- the steps S11 to S14 are looped through to find the information sought.
- the syntax trees are also run through to evaluate path expressions, and the next token of the document index stream is read, if necessary.
- a DAG event stream is not generated, but only the structure of the XML data in main memory is partially reconstructed by decompression.
- not all constants of an XML subtree are reconstructed. Only when the evaluation of the path expression on the partial XML structure index shows that a constant is needed to evaluate a filter or to output the result does the relative position of that constant in the compressed message (more precisely in the corresponding packet of the constant stream ).
- a first element of a corresponding syntax tree E1 schematized as in FIG. 2 has 10 E3 or E4 child nodes and the second element E1 5 E3 or E4 child node. Since the syntax tree generated for E1 contains only a Kleene operator, the Kleene size tree contains 1 node for every occurrence of a node E1 in the XML input document. Therefore, the Kleene Size Tree contains the following size information:
- E11size IDsize + 10 * (choice_size + IDsize)
- E12size IDsize + 5 * (choice_size + IDsize)
- the document index stream and the structure definition are usually much smaller than the actual structure portion of the XML data, so that the XML structure index can be built faster, which promises a faster evaluation of the path requests.
- queries or queries that can be completely answered on the XML structure index - ie do not require any access to the constant stream - can be many times faster than on the original XML data get answered.
- requests are, for example, count requests on the structure, for example according to the number of orders for an XML document shown in FIG. 3a.
- the relative position can be calculated by traversing the syntax
- Our method is not limited to compressing only fixed size XML documents.
- the method is even capable of compressing contiguous (quasi-infinite) XML data streams, since not all the document needs to be known before compression can begin, i. the extended grammar rules can be used to fly-compress.
- our method is not limited to non-recursive DTDs, i. all ideas can be applied even if e.g. the element declaration of E1 is changed to the following recursive element declaration: ⁇ ! ELEMENT E1 (E2, (E1
- Calculation and path evaluation are equally applicable to recursive DTDs and XML data streams.
- XML stream 1 in Fig. 9 we obtain three different types of output streams of a Compressed 7: the second Comprimatsequenz 5 and first Komprimatsequenz the constants and IDs 6 and a conventionally compressed XML data stream 4 with all the rest for the reconstruction of XML data sequence required values.
- these three data streams are bundled into a stream 7 on the sender side.
- a grammar rule generator 3 consumes a DTD 2 and an XML data stream 1 according to the operation illustrated in Example 1 to output the compact 7.
- the decompressed data stream 8 with be schematized in Fig. 7 schematized method V1. With the method V2 schematized in FIG. 8, path requests can be executed and specifically requested data output in a stream 9, possibly also via a decompression of sub-trees V3 as decompressed stream 10.
- Input for the method are the default structure-defining information of the document as well as the XML data, e.g. in the form of a SAX event stream.
- the procedure starts the compression immediately after receiving the first SAX
- Output of the compression method are two data streams, the document index stream and the constant stream.
- the document index stream along with the structure defining information, e.g.
- the DTD is a complete index on the structure of the original XML data and can therefore be used for a quick evaluation of path queries.
- the constant stream contains finely-granulated
- Both streams may contain references to previous locations in the stream that allow redundant subtrees to be stored only once (and even compressed only once).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE112007001386T DE112007001386A5 (de) | 2006-07-07 | 2007-07-09 | Verfahren zur Kompression einer Datensequenz eines elektronischen Dokuments |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102006031805 | 2006-07-07 | ||
DE102006031805.6 | 2006-07-07 | ||
DE102007031431.2 | 2007-07-05 | ||
DE102007031431 | 2007-07-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008003310A1 true WO2008003310A1 (de) | 2008-01-10 |
Family
ID=38565606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/DE2007/001205 WO2008003310A1 (de) | 2006-07-07 | 2007-07-09 | Verfahren zur kompression einer datensequenz eines elektronischen dokuments |
Country Status (2)
Country | Link |
---|---|
DE (1) | DE112007001386A5 (de) |
WO (1) | WO2008003310A1 (de) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1376388A2 (de) * | 2002-06-21 | 2004-01-02 | Microsoft Corporation | Verfahren und System zur Kodierung eines Markierungssprachedokumentes |
US6681221B1 (en) * | 2000-10-18 | 2004-01-20 | Docent, Inc. | Method and system for achieving directed acyclic graph (DAG) representations of data in XML |
US20040013307A1 (en) * | 2000-09-06 | 2004-01-22 | Cedric Thienot | Method for compressing/decompressing structure documents |
-
2007
- 2007-07-09 WO PCT/DE2007/001205 patent/WO2008003310A1/de active Application Filing
- 2007-07-09 DE DE112007001386T patent/DE112007001386A5/de not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040013307A1 (en) * | 2000-09-06 | 2004-01-22 | Cedric Thienot | Method for compressing/decompressing structure documents |
US6681221B1 (en) * | 2000-10-18 | 2004-01-20 | Docent, Inc. | Method and system for achieving directed acyclic graph (DAG) representations of data in XML |
EP1376388A2 (de) * | 2002-06-21 | 2004-01-02 | Microsoft Corporation | Verfahren und System zur Kodierung eines Markierungssprachedokumentes |
Non-Patent Citations (7)
Title |
---|
BUNEMAN P ET AL: "Path Queries on Compressed XML", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 12 September 2003 (2003-09-12), pages 1 - 12, XP002320151 * |
DONGWON LEE ET AL: "CPI: constraints-preserving inlining algorithm for mapping XML DTD to relational schema", DATA & KNOWLEDGE ENGINEERING ELSEVIER NETHERLANDS, vol. 39, no. 1, October 2001 (2001-10-01), pages 3 - 25, XP002456705, ISSN: 0169-023X * |
FREDKIN E: "TRIE MEMORY", COMMUNICATIONS OF THE ASSOCIATION FOR COMPUTING MACHINERY, ACM, NEW YORK, NY, US, vol. 3, no. 9, August 1960 (1960-08-01), pages 490 - 499, XP002271883, ISSN: 0001-0782 * |
GIORGIO BUSATTO ET AL: "Efficient Memory Representation of XML Documents", DATABASE PROGRAMMING LANGUAGES LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER-VERLAG, BE, vol. 3774, 2005, pages 199 - 216, XP019026285, ISBN: 3-540-30951-9 * |
LEVENE M ET AL: "XML STRUCTURE COMPRESSION", INTERNATIONAL WORLD WIDE WEB CONFERENCE, XX, XX, 2003, pages 1 - 14, XP009048149 * |
SHINAGAWA N ET AL: "Constructing XML views over binary data", DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, 2004. IDEAS '04. PROCEEDINGS. INTERNATIONAL COIMBRA, PORTUGAL JULY 7-9, 2004, PISCATAWAY, NJ, USA,IEEE, 7 July 2004 (2004-07-07), pages 470 - 474, XP010713987, ISBN: 0-7695-2168-1 * |
SUNDARESAN N ET AL: "Algorithms and programming models for efficient representation of XML for Internet applications", COMPUTER NETWORKS, ELSEVIER SCIENCE PUBLISHERS B.V., AMSTERDAM, NL, vol. 39, no. 5, 5 August 2002 (2002-08-05), pages 681 - 697, XP004369439, ISSN: 1389-1286 * |
Also Published As
Publication number | Publication date |
---|---|
DE112007001386A5 (de) | 2009-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60213760T2 (de) | Verfahren zur kompression und dekompression eines strukturierten dokuments | |
EP1522028B9 (de) | Verfahren und vorrichtungen zum kodieren/dekodieren von strukturierten dokumenten, insbesondere von xml-dokumenten | |
DE10301362B4 (de) | Blockdatenkompressionssystem, bestehend aus einer Kompressionseinrichtung und einer Dekompressionseinrichtung, und Verfahren zur schnellen Blockdatenkompression mit Multi-Byte-Suche | |
DE60107964T2 (de) | Vorrichtung zur kodierung und dekodierung von strukturierten dokumenten | |
DE69532775T2 (de) | Verfahren zur Datenkomprimierung und -Dekomprimierung und zugehöriges Datenkomprimierungs- und Dekomprimierungsgerät | |
DE69916661T2 (de) | Codebuchkonstruktion für entropiekodierung von variabler zu variabler länge | |
Bruggemann-Klein et al. | Regular tree and regular hedge languages over unranked alphabets | |
EP1766982B1 (de) | Verfahren zum codieren eines xml-dokuments, sowie verfahren zum decodieren, verfahren zum codieren und decodieren, codiervorrichtung, decodiervorrichtung und vorrichtung zum codieren und decodieren | |
WO2007075690A2 (en) | A compressed schema representation object and method for metadata processing | |
US7500184B2 (en) | Determining an acceptance status during document parsing | |
DE60225785T2 (de) | Verfahren zur codierung und decodierung eines pfades in der baumstruktur eines strukturierten dokuments | |
EP1561281B1 (de) | Verfahren zur erzeugung eines bitstroms aus einem indizierungsbaum | |
Levene et al. | XML Structure Compression. | |
WO2016110356A1 (de) | Verfahren zur integration einer semantischen datenverarbeitung | |
US20060184874A1 (en) | System and method for displaying an acceptance status | |
WO2008003310A1 (de) | Verfahren zur kompression einer datensequenz eines elektronischen dokuments | |
US20060212799A1 (en) | Method and system for compiling schema | |
Böttcher et al. | XML index compression by DTD subtraction. | |
DE19653133C2 (de) | System und Verfahren zur pre-entropischen Codierung | |
Müldner et al. | Annotated trees and their applications to XML compression | |
EP2264626A1 (de) | Verfahren und Vorrichtung zum speichereffizienten Suchen mindestens eines Anfragedatenelementes | |
Böttcher et al. | Compressing XML data streams with DAG+ BSBC | |
Böttcher et al. | Using XML Schema Subtraction to Compress Electronic Payment Messages | |
Leighton | Two new approaches for compressing XML | |
Tollefson | Importing and Creating Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07785603 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1120070013869 Country of ref document: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
REF | Corresponds to |
Ref document number: 112007001386 Country of ref document: DE Date of ref document: 20090319 Kind code of ref document: P |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07785603 Country of ref document: EP Kind code of ref document: A1 |