EP1344151A1

EP1344151A1 - Method for dividing structured documents into several parts

Info

Publication number: EP1344151A1
Application number: EP01271587A
Authority: EP
Inventors: Claude Seyrat; Cédric Thienot
Original assignee: Expway SA
Current assignee: Expway SA
Priority date: 2000-12-18
Filing date: 2001-12-14
Publication date: 2003-09-17
Also published as: AU2002219311A1; JP4145144B2; FR2818409B1; US20040054669A1; US7275060B2; US20070277096A1; JP2004524606A; WO2002050708A1; FR2818409A1

Abstract

The invention concerns a method applicable to a structured document (D) having a hierarchical structure defined by a structural schema, and assembling a main data set (1) including data subsets (1.1, 1.2, 1.3, , 1.2.2.2), which themselves can include data subsets of lower hierarchical level, each data subset being associated with a respective type of data. Said method comprises steps which consist in: dividing the documents into parts (P1, P2, P3) capable of being separately handled, namely a main part (P1) and at least a secondary part (P2, P3), the main part containing at least the main data set (1), and the secondary part containing a data subset (1.2.1, 1.2.2) which is removed from the main data set, each secondary part being related to the main or to another secondary part, and assigning a predefined value to the type of data of each data subset (1.2.1, 1.2.2) removed from a data set of higher hierarchical level (1.2).

Description

METHOD FOR DIVIDING STRUCTURED DOCUMENTS INTO MULTIPLE PARTS.

The present invention relates to a method for dividing structured documents into several parts.

It applies in particular, but not exclusively, to the manipulation, transmission, storage and playback of structured multimedia documents, images or sequences of video or digital images, cinematographic works or video programs, and more generally to any transfer of such documents between processing units interconnected by data transmission networks, or between a processing unit and a storage unit, or even between a processing unit and a reproduction unit such as a television set in the case of video programs.

More and more frequently, the documents thus manipulated and transmitted contain several types of information integrated into a structure.

A structured document is a collection of sets of information, each associated with a type and attributes, and composed together according to mainly hierarchical relationships. These documents use a structuring language such as SGML, HTML, XML, making it possible in particular to distinguish the different subsets of information composing the document. In contrast, in a so-called linear document, the content information of the document is mixed with the presentation and typing information.

A structured document includes separation marks for the different sets of information in the document. In the case of SGML, XML or HTML formats, these marks called "tags" are of the form "<XXXX>" and "</XXXX>", the first mark indicating the start of a set of information "XXXX" and the second the end of this set. A set of information can be composed of several sets of lower level information. Thus, a structured document presents a hierarchical or tree structure diagram, each node representing a set of information and being connected to a node of higher hierarchical level representing a set of information which contains the sets of information of lower level. The nodes located at the end of the branch of this tree structure represent sets of information containing data of a predefined type, which cannot be broken down into subsets of information.

Thus, a structured document contains separation marks represented in the form of textual or binary data, these marks delimiting sets or subsets of information which may themselves contain other subsets of information delimited by marks .

A structured document is associated with what is called a structure diagram defining in the form of rules the structure and the type of information of each set of information in the document. A schema is made up of nested groups of structures of information sets, these groups can be ordered sequences, groups of alternative elements or groups of necessary elements, ordered or unordered.

At present, when a structured document must be transmitted, it is compressed beforehand, so as to minimize the volume of data to be transmitted. For greater efficiency of such compression processing, the document structuring data are also compressed, knowing that the recipient of the document is supposed to know beforehand the structure diagram of the document and can use the structure diagram to determine every instant what set of information it will receive. It is therefore essential that the structure of the transmitted document exactly match the structure diagram that the recipient of the document plans to use for receiving and decoding the document, failing which, the recipient cannot determine the type of data transmitted, in particular. , and therefore is unable to decode them and reconstruct the original document.

However, the structured documents to be transmitted tend to become more and more voluminous. It is envisaged, for example, to transmit or broadcast in this way complete descriptions of cinematographic works or television programs.

In this context, if a transmission error occurs during the transmission of a document, the recipient of the document may no longer be able to determine which subset is being transmitted, so that the entire document must be again be forwarded. In addition, if one wishes to transmit and simultaneously display a cinematographic sequence on a screen, it may be necessary to respect the time slots for transmission of the various elements of the sequence. Some elements of the sequence must also be able to be transmitted several times to allow a recipient who was not connected at the start of the sequence transmission, to receive and display the end of the sequence.

It may also be necessary to replace one document part with another, these two parts having the same structure diagram.

The solution consisting in retransmitting the entire document would considerably increase the volume of information to be transmitted. It is therefore desirable to be able to divide a document into several parts which are transmitted separately. It turns out that current transmission methods do not make it possible to partially transmit a document.

The object of the present invention is to eliminate this drawback. This objective is achieved by providing a method for dividing a structured document having a hierarchical structure defined by a structure diagram, this document grouping together a main set of information including subsets of information, at least some of the information subsets which may include information subsets of lower hierarchical level, each information subset being associated with a respective type of information.

According to the invention, this method comprises the steps consisting in:

- divide the document into parts which can be handled separately, namely a main part and at least one secondary part, the main part containing at least the main information set, and the secondary part containing a subset of information which is removed the main information set, each secondary part being attached to the main part or to another secondary part, and

- assign a predefined value to the type of information of each subset of information removed from a set of information of higher hierarchical level.

In this way, each part is understandable in itself and can be decoded, regardless of the division chosen. Furthermore, when such a part is transmitted and the transmission fails, the rest of the document remains valid and the part which is not transmitted correctly can be retransmitted without the need to retransmit the entire document. Furthermore, it is not necessary to have the main and secondary parts upstream of a part in order to be able to decode the latter, since each part is valid and understandable in itself. Thanks to these provisions, a transmitted document can be enriched and modified over time.

Advantageously, the document comprises a header which is inserted into each part, this header comprising an indicator whose value indicates whether the document is complete or not.

According to a feature of the invention, each part comprises a header comprising information giving the location of the part in the hierarchical structure of the document.

Said location information of the secondary part in the hierarchical structure of the document advantageously describes a path in this structure, defining the position of the secondary part in the document.

Said path can be defined absolutely relative to the main information set of the document. It can also be defined relatively with respect to the position of a last transmitted secondary part.

Alternatively, each type of information assigned to the predefined value is followed by a reference to the secondary part containing the subset of information associated with the type of information, said information on the location of the secondary part in the hierarchical structure. of the document being the reference of said secondary part.

This method can further comprise the transmission of several parts of the document associated with the same location in the structure. In this case, the last part transmitted replaces the previous one which is associated with the same location.

We can also provide that the header of each part includes information specifying a method of processing the part with respect to a part associated with the same location in the structure.

The structured document is for example of the SGML, XML or HTML type.

A preferred embodiment of the invention will be described below, by way of nonlimiting example, with reference to the appended drawings in which: FIG. 1 represents a tree structure in which each node symbolizes a set or subset of information from a structured document which is normally transmitted in one go;

Figure 2 shows the structured document shown in Figure 1 cut into several parts, each of which can be transmitted separately according to the invention;

Figure 3 shows in more detail the structure of the information contained in a structured document;

FIG. 4 represents another tree structure illustrating a method of defining the position of a part of the structure, transmitted separately from the rest of the structure.

FIG. 1 represents a tree structure comprising a root node 1 decomposed into three nodes of lower rank, of which the first node 1.1 is not decomposed into nodes of lower rank, the second node 1.2 consists of two nodes 1.2.1 and 1.2 .2 and the third node 1.3 consists of a single node 1.3.1. The two nodes 1.2.1 and 1.2.2 of the second node 1.2 are attached respectively to a 1.2.1.1 and two nodes 1.2.2.1 and 1.2.2.2 of lower rank.

This structure represents a structured document D comprising a header H in which are defined a certain number of parameters defining the coding and representation format of the document, and a main body B gathering the information and sets of information constituting the document.

According to the invention, a structured document can be transmitted in several separate parts PI, P2, P3, namely a main part and secondary parts P2, P3 which are attached to the main part (Figure 2). Such a transmission is preferably carried out after compression in an appropriate manner of each part to be transmitted separately. Each part of the document, whether compressed or not, includes a header H, H2, H3 and a main body B1, B2, B3. As shown in FIG. 3, a main document body B comprises a data header DH and one or more data bodies DB each gathering the information from a subset of information in the document. The DH data header may include a field K making it possible to remove any ambiguity when decoding the document, in particular by giving a number making it possible to define the following set of information, and / or a field containing the number N d 'occurrences of the DB data body. Depending on the format used, each DB data body can include a T field indicating the type of information it contains, an L field giving the length of this information in number of bits or bytes, an A field gathering attributes of the information subset and a Val field containing the value or content of the information subset. As the document is structured in a tree-like form, the Val field can itself contain a DH data header field and one or more fields containing a DB data body.

It should be noted on this subject that in the structure diagram represented on figure 1, the information contained in the document are gathered in nodes 1.1, 1.2.1.1, 1.2.2.1, 1.2.2.2 and 1.3.1 located at the ends branches, as well as in the attribute A fields of subsets symbolized by all the nodes of the document.

According to the invention, when it is desired to partially transmit such a document whether it is compressed beforehand or not, the field T containing the type of information of a body of data DB not transmitted or removed from the document, receives a value predefined indicating that the following subset of information is not transmitted. This particular predefined value of type of information is for example chosen to be equal to 0 in the case of a document in compressed form, the values of the other types of information being different from 0. If this predefined value appears in the transmitted document, the length L field and the A and Val fields which normally follow the type of information, do not appear in the transmitted data. Consequently, following a type of information equal to the predefined value, the DH header of the next set of information is found in the document or an end of document indicator.

We can plan to add to the header H of the document a parameter indicating whether the document is completely transmitted or not, so as to indicate to the recipient of the document whether the document it is receiving is transmitted entirely or not.

Parts PI, P2 and P3 can be transmitted separately one or more times. They have for this purpose a header H, H2, H3 comprising first of all a parameter indicating that the document is not complete, followed by a definition of the location of the part transmitted in the tree structure of the complete document.

In this way, a structured document can be enriched and modified over time.

It should be noted that the transmission of the main part PI is not necessary since, thanks to the definition of the location appearing in the header of the secondary parts, the processing unit which receives the transmitted secondary parts can determine the location of the received part in the document structure and thus decode it. In addition, the document can be split up so that the main part does not contain any useful data, and so that the whole document can be reconstructed from the secondary parts and their location in the document structure. .

In addition, the header H, H2, H3 of the parts PI, P2, P3 can include information specifying a method of processing the part with respect to a part already transmitted associated with the same location in the structure, namely for example, if the transmitted part must replace a part associated with the same location, which has already been transmitted, or not be taken into account if it already appears in the document received, or else be merged with the part associated with the same location, which has already been transmitted.

As illustrated in FIG. 4, this definition of location can include the name of all the upper nodes up to the root node R, possibly associated with a sequence number relative to the upper node. For example, the first node of the first node of the third node of the first node attached to the root node (identified in Figure 4 by a succession of arrows from the root node R) can be referenced as follows: / c / a [last ] / b (l) / d

This notation indicates that it is about the node of type “d” connected to the first node of type “b” connected to the last node of type “a” connected to the node of type “c” which is connected directly to the root node R . Other parts of the document can then be transmitted either by using the absolute definition method (relative to the root node R) described above, or, advantageously, by using a relative definition method. So, for example, the third node connected to the same immediately superior node as the previous node can be referenced as follows:

../e[2]

This notation indicates that one refers to the second node which must be of type "e" connected to the same node of immediately higher level referenced by the notation ".7". It appears that this second method is more compact than the first.

Alternatively, the definition of the location of the transmitted document part P2, P3 can simply include a reference to the document part, this reference having been transmitted beforehand in the main part PI of the document, for example following the predefined value indicating that the following subset of information is not transmitted.

Preferably, the document or the parts PI, P2, P3 of the document to be transmitted are compressed beforehand. To this end, advantageously, in each part of the document, the structure information and the content information are distinguished, certain parts of the document possibly comprising no content information. Thus in the example of FIGS. 2 and 3, the structure information consists of all the fields except the Val value fields, when these are not structured, that is to say are not decomposable into structured information subsets. In the example in FIG. 2, these are the Val fields of the information subsets 1.1, 1.2.1.1, 1.2.2.1, 1.2.2.2, and 1.3.1, located at the lower ends of the branches of the tree structure of the document.

The compression processing proper consists for example in sequentially reading the part of the document to be compressed, in applying an appropriate compression algorithm to process the structure information and in applying a compression algorithm adapted to the type of information when a Val field not decomposable appears during the reading of the document part. It should be noted that in the document or part of the compressed document, the structure information and the content information appear in the same order as in the original uncompressed document. We can also apply a statistical compression algorithm, such as Zip.

Claims

1. Method for dividing a structured document (D) having a hierarchical structure defined by a structure diagram, this document gathering a main set of information (1) structured, including subsets of information (1.1, 1.2, 1.3 , ..., 1.2.2.2), at least part of the information subsets being structured and possibly including information subsets of lower hierarchical level, each information subset being associated in the set of information at a higher level than a respective type of information (T), characterized in that it comprises the steps consisting in:

- divide the document into individually manipulated structured parts (PI, P2, P3), namely a main part (PI) and at least one secondary part (P2, P3), the main part containing at least the main information set (1), and the secondary part containing a subset of information (1.2.1, 1.2.2) which is removed from the main information set, each secondary part being attached to the main part or to another secondary part, and

- assign in the information sets (1.2) in which at least one subset of information has been removed a predefined value for the type of information (T) of each subset of information (1.2.1, 1.2 .2) removed.

2. Method according to claim 1, characterized in that the document (D) comprises a header (H) which is inserted in each part (PI, P2, P3) removed from the document, this header comprising an indicator whose value indicates whether the document is complete or not.

3. Method according to claim 1 or 2, characterized in that each part (PI, P2, P3) removed from the document comprises a header (H, H2, H3) comprising information giving the location of the part in the hierarchical structure of the document.

4. Method according to claim 3, characterized in that said location information of the secondary part in the hierarchical structure of the document describes a path in this structure, defining the position of the secondary part in the document.

5. Method according to claim 4, characterized in that said path is defined in an absolute manner with respect to the main set of information of the document.

6. Method according to claim 4, characterized in that each secondary part removed from the main document being transmitted separately from the main part of the document, said path is defined in a relative manner relative to the position of a last transmitted secondary part .

7. Method according to claim 3, characterized in that each type of information (T) assigned to the predefined value, appearing in a set of information, is followed by a reference to the secondary part (P2, P3) containing the information subset removed from the information set, said location information of the secondary part in the hierarchical structure of the document being the reference of said secondary part.

8. Method according to one of claims 1 to 7, characterized in that it further comprises the transmission of several parts of the document associated with the same location in the structure, the last transmitted part replacing the part of the document previously transmitted, associated at the same location in the structure.

9. Method according to one of claims 1 to 7, characterized in that it further comprises the transmission of several parts of the document associated with the same location in the structure, the header of each part comprising information indicating the mode of processing of the part with respect to a part already transmitted associated with the same location in the structure.

10. Method according to one of claims 1 to 9, characterized in that the main part and the secondary parts removed from the main part are compressed, then transmitted separately.

11. The method as claimed in claim 10, characterized in that each set and sub-set of information comprising structure information and content information, the structure information is compressed using a structure information compression algorithm, and content information is compressed using an algorithm suitable for the type of information (T) content, the information structure and content appearing in the compressed document part in the same order as in the corresponding uncompressed document part.

12. Method according to one of claims 1 to 11, characterized in that the document is of SGML, XML or HTML type.