US20110283183A1 - Method for compressing/decompressing structured documents - Google Patents

Method for compressing/decompressing structured documents Download PDF

Info

Publication number
US20110283183A1
US20110283183A1 US13/190,692 US201113190692A US2011283183A1 US 20110283183 A1 US20110283183 A1 US 20110283183A1 US 201113190692 A US201113190692 A US 201113190692A US 2011283183 A1 US2011283183 A1 US 2011283183A1
Authority
US
United States
Prior art keywords
document
elements
coded
type
schema
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/190,692
Inventor
Cedric Thienot
Claude Seyrat
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Expway SA
Original Assignee
Expway SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Expway SA filed Critical Expway SA
Priority to US13/190,692 priority Critical patent/US20110283183A1/en
Publication of US20110283183A1 publication Critical patent/US20110283183A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • H04N21/2353Processing of additional data, e.g. scrambling of additional data or processing content descriptors specifically adapted to content descriptors, e.g. coding, compressing or processing of metadata
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/25Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with scene description coding, e.g. binary format for scenes [BIFS] compression

Definitions

  • the present invention relates to a method for compressing/decompressing structured documents.
  • Some compression algorithms are designed to process the document's binary data directly, without taking account of the data type. These algorithms have the advantage of being able to process any document, but are ineffective (low compression rate) in processing bulky documents, which are generally of the sound or image type.
  • a structured document is a collection of data sets each associated with a type, and arranged together according to mainly hierarchical relationships. These documents employ a structuring language such as SGML, HTML, XML, making it possible particularly to distinguish the different data sets composing the document. In contrast, in a so-called linear document, the document's content information is mixed with the presentation and typing information.
  • a structured document thus includes locators or markers separating the different data sets of the document.
  • locators known as “tags” are of the form “ ⁇ XXXX> and “ ⁇ /XXXX>”, the first marker indicating the start of the data set “XXXX” and the second the end of this set.
  • a data set may be composed of several lower level data sets.
  • a structured document has a hierarchical or tree structure schema, each node representing a data set and being connected to a higher hierarchical level node representing a data set which contains the lower level data sets.
  • the nodes located at the end of a branch of this tree structure represent data sets containing data of a predefined type, which cannot be broken down into data subsets.
  • a structured document is generally associated with what is called a structure schema setting out in rule form the structure and type of information of each data set of the document.
  • a schema is constituted by nested groups of data set structures, these groups being for example ordered sequences, alternative element groups or necessary element groups, sequenced or non-sequenced.
  • a structured document is thus associated with a structure schema and contains separation markers represented in the form of textual or binary data, these markers delimiting data sets which are themselves able to contain other data sets delimited by markers.
  • markers delimiting data sets which are themselves able to contain other data sets delimited by markers.
  • a document structured in this way is able to include not only textual data, but also any other type of information (for example sound data, images, etc.). Consequently the specific compression algorithms of one particular type of data are ineffective and ill adapted in respect of processing this type of document.
  • the purpose of the present invention is to remove these drawbacks. To this end, it proposes a method for compressing and decompressing a structured document associated with at least one tree structure schema defining a structure of the document and including nested structure elements representing data sets, the structure elements being distributed in three categories, namely structured root elements broken down into structured or unstructured groups of elements and base elements corresponding to the lowest level elements in the structure, each base element and root element being associated with an information type.
  • At least one information type of the base elements is first associated with an adapted compression algorithm, the method including the following steps:
  • the structure of the document is represented in a very compact way, and given that each data set corresponding to a base structure element is associated with an information type, it may be processed by the compression algorithm that is best adapted to its type.
  • the document contains for example textual data, images, sound data, this data is perfectly located in the structured document, and associated with a low level structure element and a type.
  • the automata When the automata are executed, they will detect the presence of data sets having a base type associated with a compression algorithm and invoke successively the algorithms corresponding to this data so as to obtain corresponding binary information sequences which are inserted as they arise into the binary document resulting from the compression.
  • finite automaton By finite automaton must be understood a set of states, each state being associated with a set of input events and a transition function which determines for each input event the set of active states of the automaton. Given this definition, a number of representations may be imagined, involving code conversion tables for example, on the basis of one table per state indicating, for each input event, the table corresponding to the next state, or again correspondence tables, on the basis of one table per automaton having as many lines and columns as there are states in the automaton, with each box of the table containing the description of the transition between the two corresponding states.
  • the structure schema is processed in the same way so as to determine the automata used for compression and to analyze the content of the compressed document for the purpose of reconstituting a document in the original format having a structure which is at least equivalent, if not identical, with decompression algorithms corresponding to the compression algorithms used during compression being executed to restore the original data sets from the binary information sequences located in the compressed document.
  • the method according to the invention includes advantageously a step of transmitting the structure schema which may be the original, or that obtained after transformation and normalization or again that obtained after compilation.
  • each data set is located in the compressed document so as to enable direct access to a particular information element, without it being necessary to decompress the whole document, or the data sets preceding the set to be decompressed.
  • each structure schema element is furthermore associated with a set of possible numbers of occurrences, indicating the number of times that a data set having this structure element can appear in the data set of immediately higher level to which it belongs.
  • the process according to the invention may include a step of optimizing the document's structure schema consisting in reducing the number of hierarchical levels of groups of structure elements. This optimization makes it possible to simplify the structure schema but renders the compression process less efficient.
  • FIG. 1 represents in the form of a block diagram the links between the different steps of the method according to the invention
  • FIGS. 2 a , 2 b and 2 c show graphically a structure schema in form of a tree
  • FIG. 3 shows a structure schema obtained by applying a reduction method according to the invention to the structure schema shown in FIG. 2 ;
  • FIGS. 4 a , 4 b and 4 c show a structure schema obtained by applying another reduction method according to the invention to the structure schema shown in FIG. 2 ;
  • FIGS. 5 a to 5 c show respectively three finite automata obtained and used by the method according to the invention.
  • FIG. 6 shows another automaton illustrating an optimization method used by the method according to the invention
  • FIGS. 7 a and 7 b show two automata obtained by using the process according to the invention from a particular structure schema
  • FIG. 8 shows the application of a reduction method to the automata shown in FIGS. 7 a and 7 b.
  • FIG. 1 shows the links between the different steps of the method according to the invention.
  • This method is designed to process a structured document constituted by a structure schema 1 defining the document's structure and by the document's structured data.
  • a structure schema has for example the following form:
  • This schema shows that the element named “C” has a complex structure constituted by a first element named “a 2 ” of the Boolean type which is optional, a second element named “a 1 ” of the integer type which is always present in the structure, and a group of alternative elements named “A” and “B” of respective types “TA” and “TB”, one of these two elements being present on a single occasion in the structure.
  • Types “TA” and “TB” are defined in the document's structure schema by a similar formulation.
  • this formulation is analyzed and transformed at step 11 of the method so as to obtain syntax trees 4 , on the basis of one tree per structure element.
  • the syntax tree corresponding to the structure element TC is symbolized by the following formula:
  • the formula (1) may be represented by the tree shown in FIG. 2 c , this tree including a root element “TC” 43 constituted by a single occurrence of a sequence type group 44 .
  • This group includes a single occurrence of a non-sequenced “ET” type group 45 and a single occurrence of an alternative group 46 .
  • the group 45 being constituted by a single occurrence of an integer named “a1” and of a Boolean named “a2”, and the group 46 including a single occurrence of an element named “A” of the “TA” type and an element “B” of the “TB” type.
  • Types “TA” and “TB” obtained in step 11 are for example given by the following formulae:
  • FIGS. 2 a and 2 b respectively.
  • the type “TA” 31 includes a single sequence type group 32 constituted by two single groups 33 , 34 , of the ET and SEQ type respectively.
  • the group 33 includes two single occurrences of the integer type, named “a3” and “a4” respectively.
  • the group 34 includes two single occurrences of the type “TC”, named “X” and “Y” respectively.
  • the type “TB” 39 is constituted by a single sequence type group 40 , which includes two Booleans named “a1” and “a5” respectively.
  • the structure elements must be determinist in other words an element must not be able to be interpreted in several different ways.
  • (a, b))”, where “a” appears it is not known whether “b” must appear after it.
  • algorithms which can be applied by the method according to the invention so as to convert a non-determinist schema into a determinist schema.
  • the schema given heretofore may for example be replaced by “(a, b 0 . . . 1 )”.
  • the elements of the structure schema converted into syntax trees may first of all be subjected to a process of reduction or simplification.
  • This reduction process consists in carrying out a general leveling by generating a single syntax tree 51 from all the trees 31 , 39 , and 43 , as is shown in FIG. 3 .
  • This tree in fact represents a dictionary of all the element types likely to be encountered in the document, these elements being collected together in an alternative type group 52 appearing at least once (1 . . . *) in the document.
  • the complex type elements “A”, “B”, “X” and “Y” are associated with an “ANY” type
  • the element “a1” which appeared twice (in the elements “TB” and “TC”) with different types is associated with a default “pcdata” type according to the XML language or with the element type in the initial document, for example “text”.
  • the same data set may indeed be represented in several ways: for example a binary sequence may also be considered as can a character string or an integer number.
  • this reduction process consists in leveling the syntax trees locally to obtain the trees shown as 31 ′, 39 ′ and 43 ′ in FIGS. 4 a to 4 c.
  • the trees “TA”, “TB” and “TC” can be further subjected to an additional process to remove the ambiguities appearing in the structure schema.
  • the trees “TA”, “TB” and “TC” are also subjected to a normalization process consisting in re-sequencing the schema in such a way as to obtain a single sequence of the elements of the schema. This process assigns a binary number to the different nodes of the syntax trees obtained from the previous processes. This number is used when the relevant element is compressed.
  • This normalization process consists in assigning to each group a signature constituted by the concatenation of the group's name with the signature of all the elements and sub-groups of the group, previously sequenced.
  • the group 53 in FIG. 4 is associated with the signature “CHO.a 3 .a 4 .X.Y” (or “
  • SEQ sequenced groups
  • the groups to be normalized are therefore groups of the alternative type (“CHO”), and the “ET” and “ET NO ” groups.
  • This process includes the following steps for each group G composed of sub-groups grand elements e i :
  • the pre-defined order for arranging the components of the group may be the alphanumerical order of their respective signatures, or the descending order of their minimum number of occurrences, and the components having the same minimum number of occurrences being then arranged in alphanumerical order.
  • the next step 13 in the method consists in generating finite automata 5 .
  • This process consists in generating for each syntax tree a set of base automata, on the basis of one automaton per group of the syntax tree, then in combining these base automata.
  • the automaton of a sequential group (SEQ) of n elements of signatures m 1 , m 2 to mn, of an immediately lower hierarchical level includes n+2 states numbered from 0 to n+1 symbolized in the figure by circles, each node being connected to its successor by a transition symbolized by an arc corresponding to a group element and called by the element signature, the last transition F (towards the state n+1) marking the end of the group.
  • the automaton of an alternative type group (CHO) of n elements of signatures m 1 , m 2 to mn, of an immediately lower hierarchical level includes an initial state numbered “0” and n final states numbered from 1 to n, the state 0 being connected to the final state 1 to n respectively by n transitions corresponding to the n group elements respectively.
  • the automaton of an ET group of n elements of signatures m 1 , m 2 to mn, of an immediately lower hierarchical level includes 1+n+n*(n ⁇ 1)+n*(n ⁇ 1)*(n ⁇ 2)+ . . . +n! states representing all the possible combinations of the n elements.
  • An automaton of this kind may be generated by a simple algorithm such as the one which follows:
  • E is the set of all the possible components of the group Execute Function_1 (E, initial state) Function_1 (E, state e): Repeat while E is not empty Select an element mi from E and withdraw it from E Create a state e′ and an arc joining e to e′ denoted mi Duplicate E as E′ Execute function_1 (E′, state e′) End repeat.
  • the automaton of a group ET NO of n elements of signatures m 1 , m 2 to mn, of an immediately lower hierarchical level may be that of an SEQ so long as it is acceptable to lose the information relating to the order of appearance of the elements in the group or it is fixed.
  • automata (the case of groups of the type CHO, ET and ET NO ) can be optimized by applying an avoidance process of the optional elements, i.e. those with a total number of possible currencies in the form (0 . . . k).
  • this process consists in adding a transition between the state “1” located immediately upstream of an optional element “2” and all the states “3” located immediately downstream of this element, this new transition being associated with the signature “m 3 ” of the element corresponding to the state where it ends.
  • This process may be carried out using the following algorithm:
  • automata thus generated for one structure schema are nested in each other. Indeed, in the automata corresponding to the schema example shown in FIG. 2 , when the type TC (elements X and Y) is encountered in the automaton corresponding to the type TA 31 , the automaton corresponding to the type TC 43 is fully executed before proceeding with the execution of the automaton corresponding to the type TA.
  • the next step 14 of the method according to the invention consists in reducing and converting the automata previously obtained.
  • FIG. 7 a corresponds to the group SEQ (“,”
  • the second automaton corresponds to the alternative group (“
  • reaching state 2 in the first automaton activates the execution of the second automaton and reaching the final state 1 or 2 in the second automaton is followed by the pursuit of the execution of the first automaton, in other words the execution of the transition F between the state 2 and the final state 3 of the first automaton.
  • each alternative of the CHO group is shown by a transition associated with the signature “cho.b 1 .b 2 ” of the group, concatenated with the signature “b 1 ”, “b 2 ” of the group element appearing in the selected alternative.
  • the automata may also be subjected to a process for minimizing the number of states, for example by applying Hopcroft's algorithm, then a normalization process to obtain normalized automata 6 .
  • transitions from each node are numbered from 0 to n.
  • the next step 15 consists in re-reading the document 2 , in compressing the data which it contains by executing the automata on the structure of the document, in order to obtain a succession of binary sequences in which the compressed value of each element or base data set of the document is found.
  • these binary sequences are in the form (K.N.V 1 . . . V N ) e for each element or group of elements e, N being the number of occurrences of the element e or the number of successive data sets corresponding to the element e, K being the number of the transition having made it possible to reach the element e, and V 1 . . . V N the respective, possibly compressed, values of the N occurrences of the element e.
  • K binary sequences
  • N may be omitted, particularly when this number is fixed.
  • K in the event of there being only a single arc coming from a state, for example in a group of the sequence type.
  • a general heading of the compressed document may first be made which groups together several encoded parameters, useful for the description of the document.
  • Such a heading may thus include a signature of the structure schema or schemas used, and a set of parameters describing the coding used, as for example:
  • Each information element of the document may also be associated with a heading, its presence and its nature being indicated in the document heading.
  • the heading of an element may thus include its encoded length, in such a way as to allow, when the document is decompressed, access to a particular element without decompressing all the previous elements in the document.
  • the element headings are inserted in the document for example just prior to encoding the value of the elements.
  • compressing the document consists in reading the document sequentially, in executing the automata of the schema, which makes it possible additionally to check that the document's structure corresponds to the schema compiled.
  • the coding is broken down into two parts, namely (i, i) and (0 . . . j ⁇ i), the first part is not encoded since this formulation specifies that i occurrences are necessary.
  • the second part is encoded on
  • a coding technique such as ASN1 is used, according to which the first byte indicates the number of coding bytes, and the following bytes contain the value of the number of occurrences. It is also possible to use the high-order bit of each byte to indicate whether or not it is the last coding byte of the number of occurrences, the next seven bits of each byte being used to encode the number of occurrences.
  • another coding type may be selected wherein it is not necessary to introduce the number of occurrences of the elements of a structure schema.
  • a type called “escape” or “esc” is introduced which indicates the final state of the automata. It is therefore necessary to first apply a conversion to the automata obtained previously.
  • This conversion consists in adding to each state of the automata a return transition to the previous state and in adding an “esc” transition to a final state, marking the end of the execution of the automaton.
  • the coding of the elements is then no more than the form (KV), the coding of an automaton terminating in the number K esc of the transition “esc”.
  • this coding type is only advantageous for encoding complex forms and for elements which do not have a maximum number of occurrences. It is in particular quite adapted to encoding alternative type groups including a number of elements different from 2 p , where p is an integer number.
  • This coding type may be combined with the previous type. This has only to be indicated in the heading of the compressed document and a bit assigned to the locations in the encoding where there are to be a number of occurrences.
  • At least one base type of the data sets of the document is associated with an external compression module 16 .
  • the respective types of the data sets encountered are analyzed, and when a type of data set is associated with an external compression module 16 , this is applied to the content of the data set and the result of the compression inserted into the compressed document as a value of the corresponding data set.
  • External compression modules may for example apply the “mp3” standard for sound information, “jpeg” for images and “MPEG1” or “MPEG2” for video type data.
  • a default compression module may be used or the data sets having this type recovered as they appear in the initial document.
  • the elements are associated with a heading in the compressed document containing the length as a number of bits of the value of the element. This particularity allows direct access to an element of the compressed document without having to decompress the elements located before in the document, by reading by means of automata only the respective lengths of these elements as far as the element being sought.
  • the length of the elements may be encoded in the following way.
  • p represents the number of bytes (in ASN1 coding or using the high-order bits of each byte used to encode this number) used to encode the element length
  • h represents the number of remaining bits of this length (h ⁇ 8).
  • the external compression module 16 which is called on to encode the element value can provide this length in return.
  • the value of the first bit corresponding to the element value indicates whether the following bits represents or do not represent the element length.
  • any new types are inserted into an element heading placed in the compressed document just prior to the element value.
  • the first bit indicates whether the element type is different or not from the expected type.
  • the next bits in the element heading contain the code of the new type, this code being determined by numbering all the possible sub-types of the element base type, this numbering being given by encoding the document's structure.
  • the arcs leaving each node are numbered. This step is optional if there is only one arc leaving the node. If there are n arcs leaving, with each of these arcs is associated a number given in the order of the arcs assigned at normalization (step 14 ). This number is encoded over n′ bits, n′ being such that 2 n′-1 ⁇ n ⁇ 2 n′ .
  • each transition will be encoded over
  • the number of occurrences of each sub-automaton is encoded as described above.
  • the sub-automaton is encoded. This process may be expressed by the following algorithm:
  • V a1 , V a2 , and V a3 are the values of the occurrences of a1, a2 and a3 respectively.
  • attribute sequence is not useful (as in the XML language)
  • the process of decompressing a document thus obtained is performed by executing steps 11 to 15 on the document's structure schema to obtain the automata, then by executing step 15 ′ of decoding and decompressing the document, this step consisting in running through the compressed document executing the automata obtained as a result of steps 11 to 14 , in such a way as to be able to determine the type and the name of the compressed information elements encountered in the document.
  • the values of the elements which have been obtained by means of external compression modules are decompressed by means of corresponding decompression modules 16 ′.
  • steps 11 to 15 are only executed once, only steps 15 and 16 (or 15 ′ and 16 ′) having to be applied to each document to be processed.

Abstract

The invention concerns a method for compressing and decompressing a structured document, associated with at least a tree diagram structure defining a document structure and comprising nested structure elements, associated with a type of information, and representing sets of data, the method comprising steps which consists in: performing a syntactic analysis of the structure diagram and standardizing it so as to obtain a single predefined sequence of the elements of the diagram; compiling the standardized diagram to obtain finite automata, each automaton comprising states interconnected by transitions respectively representing the elements of the structure; and compressing the document, and executing at least a compression algorithm associated with a type of information, when a set of data having the type of information is encountered in the document.

Description

  • The present invention relates to a method for compressing/decompressing structured documents.
  • It applies particularly, but not exclusively, to the transmission of documents such as images or image sequences, video or sound data, via digital data transmission networks, and to the storage of such documents.
  • There are currently in existence a number of digital document compression algorithms. Some compression algorithms are designed to process the document's binary data directly, without taking account of the data type. These algorithms have the advantage of being able to process any document, but are ineffective (low compression rate) in processing bulky documents, which are generally of the sound or image type.
  • Furthermore other compression algorithms are known which are more efficient, but specially adapted to one data type, for example image or sound, with the result that they cannot be used or are ineffective if they are applied to documents which do not exclusively contain data for which they are designed.
  • Increasingly however, the documents being used and circulating on data transmission networks contain several information types integrated in one structure.
  • A structured document is a collection of data sets each associated with a type, and arranged together according to mainly hierarchical relationships. These documents employ a structuring language such as SGML, HTML, XML, making it possible particularly to distinguish the different data sets composing the document. In contrast, in a so-called linear document, the document's content information is mixed with the presentation and typing information.
  • A structured document thus includes locators or markers separating the different data sets of the document. In the case of SGML, XML or HTML formats, these locators known as “tags” are of the form “<XXXX> and “</XXXX>”, the first marker indicating the start of the data set “XXXX” and the second the end of this set. A data set may be composed of several lower level data sets. In this way, a structured document has a hierarchical or tree structure schema, each node representing a data set and being connected to a higher hierarchical level node representing a data set which contains the lower level data sets. The nodes located at the end of a branch of this tree structure represent data sets containing data of a predefined type, which cannot be broken down into data subsets.
  • A structured document is generally associated with what is called a structure schema setting out in rule form the structure and type of information of each data set of the document. A schema is constituted by nested groups of data set structures, these groups being for example ordered sequences, alternative element groups or necessary element groups, sequenced or non-sequenced.
  • A structured document is thus associated with a structure schema and contains separation markers represented in the form of textual or binary data, these markers delimiting data sets which are themselves able to contain other data sets delimited by markers. The result is that a document structured in this way is able to include not only textual data, but also any other type of information (for example sound data, images, etc.). Consequently the specific compression algorithms of one particular type of data are ineffective and ill adapted in respect of processing this type of document.
  • The purpose of the present invention is to remove these drawbacks. To this end, it proposes a method for compressing and decompressing a structured document associated with at least one tree structure schema defining a structure of the document and including nested structure elements representing data sets, the structure elements being distributed in three categories, namely structured root elements broken down into structured or unstructured groups of elements and base elements corresponding to the lowest level elements in the structure, each base element and root element being associated with an information type.
  • According to the invention, at least one information type of the base elements is first associated with an adapted compression algorithm, the method including the following steps:
      • performing a syntactic analysis of the document's structure schema and normalizing it so as to obtain a single predefined sequence of the structure elements of the schema,
      • compiling the normalized structure schema to obtain one finite automaton per root element, each automaton including states interconnected by transitions respectively representing the structure elements, and
      • compressing the structured document including executing the finite automata on the document, and executing the compression algorithm when a data set having an information type associated with said algorithm is encountered in the document to be compressed.
  • By compiling the structure schema, the structure of the document is represented in a very compact way, and given that each data set corresponding to a base structure element is associated with an information type, it may be processed by the compression algorithm that is best adapted to its type. In this way, if the document contains for example textual data, images, sound data, this data is perfectly located in the structured document, and associated with a low level structure element and a type. When the automata are executed, they will detect the presence of data sets having a base type associated with a compression algorithm and invoke successively the algorithms corresponding to this data so as to obtain corresponding binary information sequences which are inserted as they arise into the binary document resulting from the compression.
  • Furthermore, in the case of a data transmission, if the documents transmitted always have the same structure schema, it is not necessary to transmit this at each document transmission, giving an additional gain in terms of the compression rate obtained by using the method according to the invention. It is even pointless to transmit it when the schema is previously known to the document's addressee. For example if it is an HTML document, there is never a need, even the first time, to encode the document schema.
  • By finite automaton must be understood a set of states, each state being associated with a set of input events and a transition function which determines for each input event the set of active states of the automaton. Given this definition, a number of representations may be imagined, involving code conversion tables for example, on the basis of one table per state indicating, for each input event, the table corresponding to the next state, or again correspondence tables, on the basis of one table per automaton having as many lines and columns as there are states in the automaton, with each box of the table containing the description of the transition between the two corresponding states.
  • When decompressing, the structure schema is processed in the same way so as to determine the automata used for compression and to analyze the content of the compressed document for the purpose of reconstituting a document in the original format having a structure which is at least equivalent, if not identical, with decompression algorithms corresponding to the compression algorithms used during compression being executed to restore the original data sets from the binary information sequences located in the compressed document.
  • Where the structure schema is to be transmitted with the document, the method according to the invention includes advantageously a step of transmitting the structure schema which may be the original, or that obtained after transformation and normalization or again that obtained after compilation.
  • According to one particularity of the invention, each data set is located in the compressed document so as to enable direct access to a particular information element, without it being necessary to decompress the whole document, or the data sets preceding the set to be decompressed.
  • According to another particularity of the invention, each structure schema element is furthermore associated with a set of possible numbers of occurrences, indicating the number of times that a data set having this structure element can appear in the data set of immediately higher level to which it belongs.
  • The process according to the invention may include a step of optimizing the document's structure schema consisting in reducing the number of hierarchical levels of groups of structure elements. This optimization makes it possible to simplify the structure schema but renders the compression process less efficient.
  • One preferred mode of implementing the method according to the invention will be described below, in a non-restrictive way, with reference to the appended drawings in which:
  • FIG. 1 represents in the form of a block diagram the links between the different steps of the method according to the invention;
  • FIGS. 2 a, 2 b and 2 c show graphically a structure schema in form of a tree;
  • FIG. 3 shows a structure schema obtained by applying a reduction method according to the invention to the structure schema shown in FIG. 2;
  • FIGS. 4 a, 4 b and 4 c show a structure schema obtained by applying another reduction method according to the invention to the structure schema shown in FIG. 2;
  • FIGS. 5 a to 5 c show respectively three finite automata obtained and used by the method according to the invention;
  • FIG. 6 shows another automaton illustrating an optimization method used by the method according to the invention;
  • FIGS. 7 a and 7 b show two automata obtained by using the process according to the invention from a particular structure schema; and
  • FIG. 8 shows the application of a reduction method to the automata shown in FIGS. 7 a and 7 b.
  • FIG. 1 shows the links between the different steps of the method according to the invention.
  • This method is designed to process a structured document constituted by a structure schema 1 defining the document's structure and by the document's structured data.
  • In the XML Schema language, a structure schema has for example the following form:
  • <element name=“C”>
    <complexType>
    <attribute name=“a2” required=false type-“boolean”/>
    <attribute name=“a1” required=true type=“integer”/>
    <Group order=choice>
    <element name=“A” type=“TA” minOccurs=1
    maxOccurs=1/>
    <element name=“B” type=“TB” minOccurs=1
    maxOccurs=1/>
    </Group>
    </complexType>
    </element>
  • This schema shows that the element named “C” has a complex structure constituted by a first element named “a2” of the Boolean type which is optional, a second element named “a1” of the integer type which is always present in the structure, and a group of alternative elements named “A” and “B” of respective types “TA” and “TB”, one of these two elements being present on a single occasion in the structure.
  • Types “TA” and “TB” are defined in the document's structure schema by a similar formulation.
  • Generally speaking, the following element groups are used to define a document's structure:
      • SEQ: which defines a list of sequenced elements which must all appear in the document and in the sequence indicated,
      • CHO: which defines a set of alternative elements, a single element of the group having to appear,
      • ET: which defines a set of elements which must all appear in the document in some sequence or other which is not be modified,
      • ETNO: which defines a set of elements which must all be present in the document in some sequence or other which is of no importance, and
      • ANY: which includes some element or other among all the possible elements which can be found in the document.
  • According to the invention, this formulation is analyzed and transformed at step 11 of the method so as to obtain syntax trees 4, on the basis of one tree per structure element. The syntax tree corresponding to the structure element TC is symbolized by the following formula:

  • TC→((a1{int} 1 . . . 1&no a2{bool} 0 . . . 1)1 . . . 1,(A {TA} 1 . . . 1 |B {TB} 1 . . . 1)1 . . . 1)1 . . . 1  (1)
  • wherein:
    • “→” shows that TC is the name given to the structure defined after this symbol,
    • “( )” shows the priorities with which the groups of elements are to be read,
    • “,” corresponds to a group of elements of the sequence type (SEQ),
    • “|” represents a group of alternative elements (CHO)
    • “&” represents a group of ET type elements
    • “&NO” represents a group of non-sequenced ET type elements
    • “{ }” associates with an element a structure element name or a base type (for example integer or Boolean), and “Ax . . . y” shows that the element A is repeated by x to y times in the document, y being able to be equal to “*” representing an indeterminate value.
  • This formulation also uses the symbol “$” which represents any element (ANY).
  • The formula (1) may be represented by the tree shown in FIG. 2 c, this tree including a root element “TC” 43 constituted by a single occurrence of a sequence type group 44. This group includes a single occurrence of a non-sequenced “ET” type group 45 and a single occurrence of an alternative group 46.
  • The group 45 being constituted by a single occurrence of an integer named “a1” and of a Boolean named “a2”, and the group 46 including a single occurrence of an element named “A” of the “TA” type and an element “B” of the “TB” type.
  • Types “TA” and “TB” obtained in step 11 are for example given by the following formulae:

  • TA→((a3{int} 1 . . . 1&no a4{int} 0 . . . 1)1 . . . 1,(X {TC} 1 . . . 1 ,Y {TC} 1 . . . 1)1 . . . 1)1 . . . 1  (2)

  • TC→((a1{int} 1 . . . 1 ,a5{bool} 0 . . . 1)1 . . . 1  (3)
  • and represented by the trees shown in FIGS. 2 a and 2 b respectively.
  • The type “TA” 31 includes a single sequence type group 32 constituted by two single groups 33, 34, of the ET and SEQ type respectively. The group 33 includes two single occurrences of the integer type, named “a3” and “a4” respectively. The group 34 includes two single occurrences of the type “TC”, named “X” and “Y” respectively.
  • The type “TB” 39 is constituted by a single sequence type group 40, which includes two Booleans named “a1” and “a5” respectively.
  • Although in the preceding description the name of each element and its type are distinguished, the method according to the invention applies also to structuring languages which make no such distinction.
  • Furthermore, the structure elements must be determinist in other words an element must not be able to be interpreted in several different ways. For example, in the schema “(a|(a, b))”, where “a” appears, it is not known whether “b” must appear after it. To this end, there are algorithms which can be applied by the method according to the invention so as to convert a non-determinist schema into a determinist schema. Reference may be made for example to the documents [“Regular expressions into finite automata” Brüggemann-Klein, Anne, Extended Abstract in I. Simon, Hrsg, LATIN 1992, S.97-98. Springer-Verlag, Berlin 1992. Full Version in Theoretical Computer Science 120: 197-213, 1993]. Thus the schema given heretofore may for example be replaced by “(a, b0 . . . 1)”.
  • In the next step 12 of the method according to the invention, the elements of the structure schema converted into syntax trees may first of all be subjected to a process of reduction or simplification.
  • This reduction process consists in carrying out a general leveling by generating a single syntax tree 51 from all the trees 31, 39, and 43, as is shown in FIG. 3.
  • This tree in fact represents a dictionary of all the element types likely to be encountered in the document, these elements being collected together in an alternative type group 52 appearing at least once (1 . . . *) in the document. In this tree, the complex type elements “A”, “B”, “X” and “Y” are associated with an “ANY” type, and the element “a1” which appeared twice (in the elements “TB” and “TC”) with different types, is associated with a default “pcdata” type according to the XML language or with the element type in the initial document, for example “text”. The same data set may indeed be represented in several ways: for example a binary sequence may also be considered as can a character string or an integer number.
  • Alternatively, this reduction process consists in leveling the syntax trees locally to obtain the trees shown as 31′, 39′ and 43′ in FIGS. 4 a to 4 c.
  • In each of these figures, the groups 32 to 34 (FIG. 2 a), 40 (FIG. 2 b) and 44 to 46 (FIG. 2 c) have been replaced by an alternative type group 53, 54, 55, respectively, which appear at least once (1 . . . *).
  • The trees “TA”, “TB” and “TC” can be further subjected to an additional process to remove the ambiguities appearing in the structure schema.
  • At step 12, the trees “TA”, “TB” and “TC” are also subjected to a normalization process consisting in re-sequencing the schema in such a way as to obtain a single sequence of the elements of the schema. This process assigns a binary number to the different nodes of the syntax trees obtained from the previous processes. This number is used when the relevant element is compressed.
  • This normalization process consists in assigning to each group a signature constituted by the concatenation of the group's name with the signature of all the elements and sub-groups of the group, previously sequenced. Thus, the group 53 in FIG. 4 is associated with the signature “CHO.a3.a4.X.Y” (or “|a3.a4.X.Y”).
  • For this normalization process, it is considered that the sequenced groups (SEQ) are already normalized. The groups to be normalized are therefore groups of the alternative type (“CHO”), and the “ET” and “ETNO” groups. This process includes the following steps for each group G composed of sub-groups grand elements ei:
      • normalizing any sub-groups gi of the group G before normalizing the group G, the normalization algorithm being recursive,
      • arranging any elements ei of the group G before the sub-groups gi,
      • arranging the elements ei in a pre-defined order,
      • arranging the sub-groups gi in the pre-defined order, and
      • determining the signature of the group G obtained by the concatenation of all the signatures of its components (elements and sub-groups) according to the order established as a result of the previous steps.
  • The pre-defined order for arranging the components of the group may be the alphanumerical order of their respective signatures, or the descending order of their minimum number of occurrences, and the components having the same minimum number of occurrences being then arranged in alphanumerical order.
  • It should be noted that this normalization process is not necessary in the method according to the invention. The order in which the components appear in the schema may indeed be retained.
  • The next step 13 in the method consists in generating finite automata 5. This process consists in generating for each syntax tree a set of base automata, on the basis of one automaton per group of the syntax tree, then in combining these base automata.
  • In FIG. 5 a, the automaton of a sequential group (SEQ) of n elements of signatures m1, m2 to mn, of an immediately lower hierarchical level includes n+2 states numbered from 0 to n+1 symbolized in the figure by circles, each node being connected to its successor by a transition symbolized by an arc corresponding to a group element and called by the element signature, the last transition F (towards the state n+1) marking the end of the group.
  • In FIG. 5 b, the automaton of an alternative type group (CHO) of n elements of signatures m1, m2 to mn, of an immediately lower hierarchical level includes an initial state numbered “0” and n final states numbered from 1 to n, the state 0 being connected to the final state 1 to n respectively by n transitions corresponding to the n group elements respectively.
  • In FIG. 5 c, the automaton of an ET group of n elements of signatures m1, m2 to mn, of an immediately lower hierarchical level includes 1+n+n*(n−1)+n*(n−1)*(n−2)+ . . . +n! states representing all the possible combinations of the n elements.
  • An automaton of this kind may be generated by a simple algorithm such as the one which follows:
  • E is the set of all the possible components of the
    group
    Execute Function_1 (E, initial state)
    Function_1 (E, state e):
    Repeat while E is not empty
    Select an element mi from E and withdraw it from E
    Create a state e′ and an arc joining e to e′
    denoted mi
    Duplicate E as E′
    Execute function_1 (E′, state e′)
    End repeat.
  • The automaton of a group ETNO of n elements of signatures m1, m2 to mn, of an immediately lower hierarchical level may be that of an SEQ so long as it is acceptable to lose the information relating to the order of appearance of the elements in the group or it is fixed.
  • These automata (the case of groups of the type CHO, ET and ETNO) can be optimized by applying an avoidance process of the optional elements, i.e. those with a total number of possible currencies in the form (0 . . . k).
  • This rule reflects the fact that each element associated with a minimum zero number of occurrences is not necessarily encoded.
  • As shown in FIG. 6, this process consists in adding a transition between the state “1” located immediately upstream of an optional element “2” and all the states “3” located immediately downstream of this element, this new transition being associated with the signature “m3” of the element corresponding to the state where it ends.
  • If one of the states located immediately downstream is also associated with an optional element, a transition must also be provided to all the states located immediately upstream of this state.
  • This process may be carried out using the following algorithm:
  • Let Z be the sub-set of the nodes of the automaton
    for which the associated element has a minimum zero
    occurrence.
    Repeat (while the automaton is modified by the
    following procedure):
    For each element X of Z:
    For each incoming transition TEx of X:
    For each outgoing transition TSx of X:
    1. Create a new transition N connecting the
    source node of the transition TEx and the
    destination node of the transition TSx. The
    transition is marked by the value of the arc
    TSx.
    2. If an identical transition does not already
    exist in the graph, add it to the graph
    End for
    End for
    End for
    End repeat
  • It should be noted that the automata thus generated for one structure schema are nested in each other. Indeed, in the automata corresponding to the schema example shown in FIG. 2, when the type TC (elements X and Y) is encountered in the automaton corresponding to the type TA 31, the automaton corresponding to the type TC 43 is fully executed before proceeding with the execution of the automaton corresponding to the type TA.
  • The next step 14 of the method according to the invention consists in reducing and converting the automata previously obtained.
  • It is thus possible to merge the automata of the same syntax tree (and not automata of different trees which invoke each other) in the manner explained with reference to FIGS. 7 a and 7 b.
  • These figures show the automata which have been generated in accordance with the method according to the invention from the structure element (ai 0 . . . *, (b1|b2)0 . . . *). The first automaton (FIG. 7 a) corresponds to the group SEQ (“,”), whereas the second automaton (FIG. 7 b) corresponds to the alternative group (“|”).
  • When executing these automata, reaching state 2 in the first automaton activates the execution of the second automaton and reaching the final state 1 or 2 in the second automaton is followed by the pursuit of the execution of the first automaton, in other words the execution of the transition F between the state 2 and the final state 3 of the first automaton.
  • The process of merging the two automata makes it possible to obtain the automaton shown in FIG. 8, wherein each alternative of the CHO group is shown by a transition associated with the signature “cho.b1.b2” of the group, concatenated with the signature “b1”, “b2” of the group element appearing in the selected alternative.
  • During this step 14, the automata may also be subjected to a process for minimizing the number of states, for example by applying Hopcroft's algorithm, then a normalization process to obtain normalized automata 6.
  • Following this process, the transitions from each node are numbered from 0 to n.
  • The next step 15 consists in re-reading the document 2, in compressing the data which it contains by executing the automata on the structure of the document, in order to obtain a succession of binary sequences in which the compressed value of each element or base data set of the document is found. According to a first type of encoding, these binary sequences are in the form (K.N.V1 . . . VN)e for each element or group of elements e, N being the number of occurrences of the element e or the number of successive data sets corresponding to the element e, K being the number of the transition having made it possible to reach the element e, and V1 . . . VN the respective, possibly compressed, values of the N occurrences of the element e. If e is a group of elements, its value V is broken down into as many binary sequences (K.N.V) as it contains elements. However, in certain cases, N may be omitted, particularly when this number is fixed. The same is true for K in the event of there being only a single arc coming from a state, for example in a group of the sequence type.
  • A general heading of the compressed document may first be made which groups together several encoded parameters, useful for the description of the document. Such a heading may thus include a signature of the structure schema or schemas used, and a set of parameters describing the coding used, as for example:
      • a parameter indicating whether the coding of the length of each element is mandatory or optional or not present in the document,
      • a parameter indicating whether the elements may or may not be sub-typed, i.e. associated with a more precise type that their base type, and
      • a parameter indicating the type of coding used for the number of occurrences.
  • Each information element of the document may also be associated with a heading, its presence and its nature being indicated in the document heading.
  • The heading of an element may thus include its encoded length, in such a way as to allow, when the document is decompressed, access to a particular element without decompressing all the previous elements in the document. The element headings are inserted in the document for example just prior to encoding the value of the elements.
  • In a general way, compressing the document consists in reading the document sequentially, in executing the automata of the schema, which makes it possible additionally to check that the document's structure corresponds to the schema compiled.
  • During this process, the number of occurrences of each element appearing in the document is encoded. To this end, the following rules are applied.
  • Where the number of occurrences of an element e is defined by (i . . . j), the following cases may be distinguished:
  • If j is different from “*” and i is different from 0, the coding is broken down into two parts, namely (i, i) and (0 . . . j−i), the first part is not encoded since this formulation specifies that i occurrences are necessary. The second part is encoded on |log2(j−i+1)| bits.
  • If j is different from “*” and i is equal to 0, the number of occurrences is encoded between 1 and j, in other words on |log2(j)| bits, since if this coding is necessary, it means that there is at least one element e in the document.
  • If j is equal to “*”, a coding technique such as ASN1 is used, according to which the first byte indicates the number of coding bytes, and the following bytes contain the value of the number of occurrences. It is also possible to use the high-order bit of each byte to indicate whether or not it is the last coding byte of the number of occurrences, the next seven bits of each byte being used to encode the number of occurrences.
  • Alternatively, another coding type may be selected wherein it is not necessary to introduce the number of occurrences of the elements of a structure schema. According to this coding type, a type called “escape” or “esc” is introduced which indicates the final state of the automata. It is therefore necessary to first apply a conversion to the automata obtained previously.
  • This conversion consists in adding to each state of the automata a return transition to the previous state and in adding an “esc” transition to a final state, marking the end of the execution of the automaton. The coding of the elements is then no more than the form (KV), the coding of an automaton terminating in the number Kesc of the transition “esc”.
  • In fact, this coding type is only advantageous for encoding complex forms and for elements which do not have a maximum number of occurrences. It is in particular quite adapted to encoding alternative type groups including a number of elements different from 2p, where p is an integer number.
  • This coding type may be combined with the previous type. This has only to be indicated in the heading of the compressed document and a bit assigned to the locations in the encoding where there are to be a number of occurrences.
  • According to the invention, at least one base type of the data sets of the document is associated with an external compression module 16. In this way, when reading the document, the respective types of the data sets encountered are analyzed, and when a type of data set is associated with an external compression module 16, this is applied to the content of the data set and the result of the compression inserted into the compressed document as a value of the corresponding data set.
  • External compression modules may for example apply the “mp3” standard for sound information, “jpeg” for images and “MPEG1” or “MPEG2” for video type data.
  • If no compression module is associated with a type of data set, a default compression module may be used or the data sets having this type recovered as they appear in the initial document.
  • If in the heading of the document it is indicated that encoding the length is optional or mandatory, the elements are associated with a heading in the compressed document containing the length as a number of bits of the value of the element. This particularity allows direct access to an element of the compressed document without having to decompress the elements located before in the document, by reading by means of automata only the respective lengths of these elements as far as the element being sought.
  • The length of the elements may be encoded in the following way.
  • Where in the heading of document it is indicated that encoding the length of the elements is mandatory the length L of the elements as a number of bits is calculated using the following formula:

  • L=8*p+h
  • where p represents the number of bytes (in ASN1 coding or using the high-order bits of each byte used to encode this number) used to encode the element length, and h represents the number of remaining bits of this length (h<8).
  • It should be noted that the external compression module 16 which is called on to encode the element value can provide this length in return.
  • Where encoding the length of the elements is not mandatory, the value of the first bit corresponding to the element value indicates whether the following bits represents or do not represent the element length.
  • If the elements can be sub-typed (indicated in the document's heading), any new types are inserted into an element heading placed in the compressed document just prior to the element value. The first bit indicates whether the element type is different or not from the expected type. In the first case, the next bits in the element heading contain the code of the new type, this code being determined by numbering all the possible sub-types of the element base type, this numbering being given by encoding the document's structure.
  • More precisely, a document is encoded in three main steps.
  • In the first step, the arcs leaving each node are numbered. This step is optional if there is only one arc leaving the node. If there are n arcs leaving, with each of these arcs is associated a number given in the order of the arcs assigned at normalization (step 14). This number is encoded over n′ bits, n′ being such that 2n′-1<n≦2n′.
  • Thus, if n transitions are issued from the state E, each transition will be encoded over |log2(n−1)|+1 bits.
  • In the second step, the number of occurrences of each sub-automaton is encoded as described above.
  • In the third step, the sub-automaton is encoded. This process may be expressed by the following algorithm:
  • Get into position at the start of the automaton,
    While the active state is not final
    The arc being crossed is encoded, if necessary
    The number n of occurrences is encoded, if
    necessary
    Move around in the sub-automaton corresponding
    to the node reached,
    This sub-automaton is encoded n times.
    Go back to the initial automaton.
    End While.
  • For example, to encode the occurrence “a2 a3 a1 a1 a3” of the automaton (a1|a2|a3)(0 . . . *), there are three outgoing arcs. The arcs are therefore numbered on two bits. Consequently, the result of the coding is as follows in the case where the number of occurrences is encoded:

  • 0000 0101 01 V a2 10 V a3 00 V a1 00 V a1 10 V a3
  • where “0000 0101” represents the binary value of the number of occurrences i.e. 5, and Va1, Va2, and Va3 are the values of the occurrences of a1, a2 and a3 respectively.
  • Where the number of occurrences is not encoded:

  • 01 V a2 10 V a3 00 V a1 00 V a1 10 V a3 11
  • 11 corresponding to the number of the outgoing transition “esc”.
  • In the example in FIGS. 7 a, 7 b, encoding the occurrence “b2 b1 a1” of the automaton (a1 0 . . . *,b1|b2)0 . . . *) leads to the following result (where the states are not merged):
      • 0000 0010 number of occurrences of the sequence (here twice)
      • 1 encoding the arc “cho.b1.b2
      • 0000 0010 number of occurrences of the group “cho.b1.b2” (here twice).
      • 1 encoding the arc b2 in the group “cho.b1.b2
      • Vb2 encoding the value b2
      • 0 encoding the arc b1 in the group “cho.bl.b2
      • Vb1 encoding the value b1
      • 0 encoding the arc a1
      • 0000 0001 number of occurrences of a1
      • Va1 encoding the value of a1
      • 1 encoding the outgoing arc F
  • Where the states are merged (FIG. 8)
      • 0000 0010 Number of occurrences of the sequence
      • 10 Encoding the arc “cho.b1.b2-b2
      • 0000 0010 Number of occurrences of the group “cho.b1.b2”.
      • Vb2 Encoding the value b2
      • 0 Encoding the arc b1 in the group “cho.b1.b2
      • Vb1 Encoding the value b1
      • 00 Encoding the arc a1
      • 0000 0001 Number of occurrences of a1
      • Va1 Encoding the value of a1
      • 10 Encoding the outgoing arc F
  • It may be necessary to re-arrange the automaton, particularly if the schema has been interpreted and re-ordered in such a way as to optimize coding in the case of the group ETNO.
  • If the attribute sequence is not useful (as in the XML language), it is possible to encode so as to re-order the element attributes in a pre-determined sequence, for example in an alphanumerical sequence, then according to whether they are required or not. This arrangement makes it possible to reduce the size of the compressed description accordingly.
  • The process of decompressing a document thus obtained is performed by executing steps 11 to 15 on the document's structure schema to obtain the automata, then by executing step 15′ of decoding and decompressing the document, this step consisting in running through the compressed document executing the automata obtained as a result of steps 11 to 14, in such a way as to be able to determine the type and the name of the compressed information elements encountered in the document. The values of the elements which have been obtained by means of external compression modules are decompressed by means of corresponding decompression modules 16′.
  • It should be noted that if several documents having the same structure schema are to be processed (compressed or decompressed), steps 11 to 15 are only executed once, only steps 15 and 16 (or 15′ and 16′) having to be applied to each document to be processed.

Claims (15)

1-10. (canceled)
11. Method for decoding a binary document for recovering a structured document of the XML type;
a. said structured document
i. being associated with at least one tree structure schema defining a structure of the structured document and
ii. including nested structure elements delimiting data sets of the structured document,
b. said binary document being coded using the tree structure schema to represent this document in a compact way,
c. the decoding comprising the step of modelling a decoding process which uses a knowledge of said tree structure schema, and the step of triggering the decoding using the tree structure schema by decoding a first data set of one type in said structured document with a decompression algorithm and decoding a second data set of another type in the said structured document with another decompression algorithm for recovering the structured document.
12. The method according to claim 11, further comprising a step of normalizing said structure schema to obtain a single sequence order of the structure elements of the structure schema.
13. The method according to claim 11, further comprising a step of receiving the structure schema with the coded document.
14. The method according to claim 11, further comprising tagging each coded data set in the coded document, to enable direct access to a particular coded data set, without decoding data sets preceding said particular data set in said coded document.
15. The method according to claim 11, further comprising a step of optimizing said structure schema, said step comprising reducing a number of hierarchical levels of groups of structure elements.
16. The method according to claim 11, further comprising associating each structure element in said structure schema with a set of possible numbers of occurrences, indicating the number of times that a coded data set of said coded document having said structure element can appear in a coded data set of immediately higher level.
17. The method according to claim 11, further comprising providing the coded document, for each data set of said original structured document, with a transition number referring to the structure element defining the structure of said data set, and a coded binary value of the data set.
18. The method according to claim 11, further comprising associating each structured element of at least a part of the structure elements of the structure schema, in the coded document, with a number of occurrences of data sets having a structure defined by said structure element, in a data set of immediately higher level.
19. The method according to claim 11, further comprising tagging the end of a group of several occurrences of coded data sets of a same type in the coded document by a binary sequence representing a transition to a final state.
20. The method according to claim 11, further comprising distributing the structured elements into three categories, including structured root elements, groups of elements, and structured, or unstructured base elements corresponding to lowest level elements, and associating each base element with an information type, at least one information type of the base elements being associated with an adapted decoding algorithm.
21. The method according to claim 20 comprising the step of executing the decoding algorithm when a coded data set having an information type associated with said algorithm is encountered in the coded document.
22. The method according to claim 11 comprising the step of performing a syntactic analysis of the structure schema.
23. The method according to claim 11, wherein modelling a decoding process comprises compiling the structure schema to generate at least one finite state automaton, each automaton including states interconnected by transitions respectively representing the structure elements, and during decoding, executing the finite state automata on coded data sets of the coded binary document.
24. A method according to claim 11, wherein, the coding comprising a compressing step, the decoding comprises a decompressing step.
US13/190,692 2000-09-06 2011-07-26 Method for compressing/decompressing structured documents Abandoned US20110283183A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/190,692 US20110283183A1 (en) 2000-09-06 2011-07-26 Method for compressing/decompressing structured documents

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
FR00/11356 2000-09-06
FR0011356A FR2813743B1 (en) 2000-09-06 2000-09-06 COMPRESSION / DECOMPRESSION PROCESS FOR STRUCTURED DOCUMENTS
PCT/FR2001/002719 WO2002021848A1 (en) 2000-09-06 2001-08-31 Method for compressing/decompressing structured documents
FRPCT/FR2001/002719 2001-08-31
US36333003A 2003-08-04 2003-08-04
US13/190,692 US20110283183A1 (en) 2000-09-06 2011-07-26 Method for compressing/decompressing structured documents

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US36333003A Continuation 2000-09-06 2003-08-04

Publications (1)

Publication Number Publication Date
US20110283183A1 true US20110283183A1 (en) 2011-11-17

Family

ID=8854020

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/363,330 Expired - Fee Related US8015218B2 (en) 2000-09-06 2001-08-31 Method for compressing/decompressing structure documents
US13/190,692 Abandoned US20110283183A1 (en) 2000-09-06 2011-07-26 Method for compressing/decompressing structured documents

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/363,330 Expired - Fee Related US8015218B2 (en) 2000-09-06 2001-08-31 Method for compressing/decompressing structure documents

Country Status (11)

Country Link
US (2) US8015218B2 (en)
EP (1) EP1316220B1 (en)
JP (1) JP4653381B2 (en)
AT (1) ATE285656T1 (en)
AU (1) AU2001287796A1 (en)
DE (1) DE60107964T2 (en)
DK (1) DK1316220T3 (en)
ES (1) ES2234878T3 (en)
FR (1) FR2813743B1 (en)
PT (1) PT1316220E (en)
WO (1) WO2002021848A1 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3368883B2 (en) * 2000-02-04 2003-01-20 インターナショナル・ビジネス・マシーンズ・コーポレーション Data compression device, database system, data communication system, data compression method, storage medium, and program transmission device
ES2272429T3 (en) * 2001-07-13 2007-05-01 France Telecom METHOD TO COMPRESS A HIERARCHIC TREE, CORRESPONDING SIGN AND METHOD TO DECODE A SIGNAL.
US20030188265A1 (en) * 2002-04-02 2003-10-02 Murata Kikai Kabushiki Kaisha Structured document processing device and recording medium recording structured document processing program
KR100968083B1 (en) * 2002-07-15 2010-07-05 지멘스 악티엔게젤샤프트 Method and devices for encoding/decoding structured documents, especially xml documents
JP4197320B2 (en) * 2002-07-15 2008-12-17 シーメンス アクチエンゲゼルシヤフト Method and apparatus for encoding / decoding structured text, especially XML text
US7603654B2 (en) * 2004-03-01 2009-10-13 Microsoft Corporation Determining XML schema type equivalence
US8977859B2 (en) * 2004-05-04 2015-03-10 Elsevier, Inc. Systems and methods for data compression and decompression
US7509631B2 (en) * 2004-05-21 2009-03-24 Bea Systems, Inc. Systems and methods for implementing a computer language type system
US8111694B2 (en) 2005-03-23 2012-02-07 Nokia Corporation Implicit signaling for split-toi for service guide
US20070143664A1 (en) * 2005-12-21 2007-06-21 Motorola, Inc. A compressed schema representation object and method for metadata processing
WO2007134407A1 (en) * 2006-05-24 2007-11-29 National Ict Australia Limited Selectivity estimation
DE112007001386A5 (en) * 2006-07-07 2009-03-19 Universität Paderborn Method for compressing a data sequence of an electronic document
US7747558B2 (en) 2007-06-07 2010-06-29 Motorola, Inc. Method and apparatus to bind media with metadata using standard metadata headers
US7925643B2 (en) * 2008-06-08 2011-04-12 International Business Machines Corporation Encoding and decoding of XML document using statistical tree representing XSD defining XML document
EP2161667A1 (en) * 2008-09-08 2010-03-10 Thomson Licensing, Inc. Method and device for encoding elements
EP2219117A1 (en) * 2009-02-13 2010-08-18 Siemens Aktiengesellschaft A processing module, a device, and a method for processing of XML data
JP5570202B2 (en) * 2009-12-16 2014-08-13 キヤノン株式会社 Structured document analysis apparatus, structured document analysis method, and computer program
EP2388701A1 (en) * 2010-05-17 2011-11-23 Siemens Aktiengesellschaft Method and apparatus for providing a service implementation
US20150312298A1 (en) * 2011-03-24 2015-10-29 Kevin J. O'Keefe Method and system for information exchange and processing
US9543980B2 (en) 2014-10-10 2017-01-10 Massachusettes Institute Of Technology Systems and methods for model-free compression and model-based decompression
US10733237B2 (en) 2015-09-22 2020-08-04 International Business Machines Corporation Creating data objects to separately store common data included in documents
JP6903892B2 (en) * 2016-10-12 2021-07-14 富士通株式会社 Verification program, verification device, verification method, coding program, coding device and coding method
US10467275B2 (en) * 2016-12-09 2019-11-05 International Business Machines Corporation Storage efficiency
CN108763379B (en) * 2018-05-18 2022-06-03 北京奇艺世纪科技有限公司 Data compression method, data decompression method, device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6121903A (en) * 1998-01-27 2000-09-19 Infit Communications Ltd. On-the-fly data re-compression
US6266419B1 (en) * 1997-07-03 2001-07-24 At&T Corp. Custom character-coding compression for encoding and watermarking media content
US20020029229A1 (en) * 2000-06-30 2002-03-07 Jakopac David E. Systems and methods for data compression
US20030005001A1 (en) * 2001-06-28 2003-01-02 International Business Machines Corporation Data processing method, and encoder, decoder and XML parser for encoding and decoding an XML document
US7707154B2 (en) * 2002-07-15 2010-04-27 Siemens Aktiengesellschaft Method and devices for encoding/decoding structured documents, particularly XML documents

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5438512A (en) * 1993-10-22 1995-08-01 Xerox Corporation Method and apparatus for specifying layout processing of structured documents
WO1997034240A1 (en) * 1996-03-15 1997-09-18 University Of Massachusetts Compact tree for storage and retrieval of structured hypermedia documents
JPH10187680A (en) * 1996-12-20 1998-07-21 Nec Corp Document repository device managed by word, sentence and grain degree of part
US6363381B1 (en) * 1998-11-03 2002-03-26 Ricoh Co., Ltd. Compressed document matching
US6553141B1 (en) * 2000-01-21 2003-04-22 Stentor, Inc. Methods and apparatus for compression of transform data
JP3368883B2 (en) * 2000-02-04 2003-01-20 インターナショナル・ビジネス・マシーンズ・コーポレーション Data compression device, database system, data communication system, data compression method, storage medium, and program transmission device
US6883137B1 (en) * 2000-04-17 2005-04-19 International Business Machines Corporation System and method for schema-driven compression of extensible mark-up language (XML) documents
IT1314626B1 (en) * 2000-04-21 2002-12-20 Ik Multimedia Production Srl PROCEDURE FOR THE CODING AND DECODING OF DATA FLOWS, SOUND REPRESENTATIVES IN DIGITAL FORM, WITHIN A
BR0107329A (en) * 2000-10-17 2002-08-27 Koninkl Philips Electronics Nv Encoding process for encoding an element of description of an instance of an xml type scheme, decoding process for decoding a fragment comprising a content and a sequence of identification information, encoder for encoding an element of description of an instance of a type scheme xml, decoder to decode a fragment comprising a content and a sequence of identification information, transmission system, signal for transmission over a transmission network, and, table intended for use in an encoder
US6850948B1 (en) * 2000-10-30 2005-02-01 Koninklijke Philips Electronics N.V. Method and apparatus for compressing textual documents
JP4774145B2 (en) * 2000-11-24 2011-09-14 富士通株式会社 Structured document compression apparatus, structured document restoration apparatus, and structured document processing system
WO2002056478A1 (en) * 2001-01-11 2002-07-18 Koninklijke Philips Electronics N.V. Data compression method with identifier of regressive string reference
FR2820563B1 (en) * 2001-02-02 2003-05-16 Expway COMPRESSION / DECOMPRESSION PROCESS FOR A STRUCTURED DOCUMENT
US7246177B2 (en) * 2001-05-17 2007-07-17 Cyber Ops, Llc System and method for encoding and decoding data files
US20030028673A1 (en) * 2001-08-01 2003-02-06 Intel Corporation System and method for compressing and decompressing browser cache in portable, handheld and wireless communication devices
US6667700B1 (en) * 2002-10-30 2003-12-23 Nbt Technology, Inc. Content-based segmentation scheme for data compression in storage and transmission including hierarchical segment representation
US7509574B2 (en) * 2005-02-11 2009-03-24 Fujitsu Limited Method and system for reducing delimiters

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6266419B1 (en) * 1997-07-03 2001-07-24 At&T Corp. Custom character-coding compression for encoding and watermarking media content
US6121903A (en) * 1998-01-27 2000-09-19 Infit Communications Ltd. On-the-fly data re-compression
US20020029229A1 (en) * 2000-06-30 2002-03-07 Jakopac David E. Systems and methods for data compression
US20030005001A1 (en) * 2001-06-28 2003-01-02 International Business Machines Corporation Data processing method, and encoder, decoder and XML parser for encoding and decoding an XML document
US7707154B2 (en) * 2002-07-15 2010-04-27 Siemens Aktiengesellschaft Method and devices for encoding/decoding structured documents, particularly XML documents

Also Published As

Publication number Publication date
EP1316220A1 (en) 2003-06-04
EP1316220B1 (en) 2004-12-22
FR2813743A1 (en) 2002-03-08
DE60107964D1 (en) 2005-01-27
JP2004508647A (en) 2004-03-18
AU2001287796A1 (en) 2002-03-22
JP4653381B2 (en) 2011-03-16
US8015218B2 (en) 2011-09-06
WO2002021848A1 (en) 2002-03-14
ES2234878T3 (en) 2005-07-01
US20040013307A1 (en) 2004-01-22
DK1316220T3 (en) 2005-01-24
PT1316220E (en) 2005-02-28
DE60107964T2 (en) 2005-05-25
FR2813743B1 (en) 2003-01-03
ATE285656T1 (en) 2005-01-15

Similar Documents

Publication Publication Date Title
US20110283183A1 (en) Method for compressing/decompressing structured documents
KR100614677B1 (en) Method for compressing/decompressing a structured document
US7043686B1 (en) Data compression apparatus, database system, data communication system, data compression method, storage medium and program transmission apparatus
US6825781B2 (en) Method and system for compressing structured descriptions of documents
US7707154B2 (en) Method and devices for encoding/decoding structured documents, particularly XML documents
US20050120031A1 (en) Structured document encoder, method for encoding structured document and program therefor
US20070143664A1 (en) A compressed schema representation object and method for metadata processing
AU2002253002A1 (en) Method and system for compressing structured descriptions of documents
US20070112810A1 (en) Method for compressing markup languages files, by replacing a long word with a shorter word
EP1519279B1 (en) Document transformation system
JP2006323821A (en) Method and system for sequentially accessing compiled schema
US7676742B2 (en) System and method for processing of markup language information
US20040268239A1 (en) Computer system suitable for communications of structured documents
JP4776389B2 (en) Encoded document decoding method and system
US7240285B2 (en) Encoding and distribution of schema for multimedia content descriptions
JP2007148751A (en) Encoding method, encoding device, encoding program and decoding device for structured document and data structure for encoded structured document
US6118391A (en) Compression into arbitrary character sets
KR100968083B1 (en) Method and devices for encoding/decoding structured documents, especially xml documents
EP2327028B1 (en) Method and device for encoding elements
KR20060123197A (en) Method for compressing and decompressing structured documents
JP2004342029A (en) Method and device for compressing structured document
EP2039009A1 (en) Methods and devices for compressing structured documents

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION