FR2901036A1

FR2901036A1 - Reference structural pattern generating method for coding hierarchy data, involves determining structural reference pattern by primary structural patterns, where reference pattern represents patterns associated to reference pattern

Info

Publication number: FR2901036A1
Application number: FR0604208A
Authority: FR
Inventors: Herve Ruellan; Romain Bellessort
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2006-05-11
Filing date: 2006-05-11
Publication date: 2007-11-16
Anticipated expiration: 2026-05-11
Also published as: FR2901036B1

Abstract

The method involves extracting primary structural patterns associated to hierarchy data, where each pattern represents a set of structural information, and the data is organized into a set of items. Each item represents a node, if the item contains a child node. The patterns are grouped with respect to each other at a distance less than or equal to a predetermined value. A structural reference pattern is determined by group of the patterns, where the reference pattern represents the patterns of the group which is associated to the reference pattern. Independent claims are also included for the following: (1) a method for coding hierarchy data and comprising a step of generating reference structural patterns that represent hierarchy data according to a method for generating reference structural pattern (2) a device for generation of reference structural patterns that represent hierarchy data (3) a device for coding hierarchy data (4) a computer program product comprising a sequence of instructions for implementing a method for generating a reference structural pattern (5) a computer program product comprising a sequence of instructions for implementing a method for coding hierarchy data (6) a computer readable data storage medium implementing a method for generating a reference structural pattern (7) a computer readable data storage medium implementing a method for coding hierarchy data.

Description

11

La présente invention concerne un procédé, un dispositif et un programme d'ordinateur pour la génération de motifs structurels de référence aptes à représenter des données hiérarchisées. L'invention concerne également un procédé, un dispositif et un programme d'ordinateur pour le codage de ce type de données en utilisant les motifs structurels de référence ainsi générés. De nombreuses applications manipulent des données structurées de façon hiérarchique également appelées données hiérarchisées . Un document de données hiérarchisées renferme deux types d'information : un premier type d'information qui renseigne sur la structure du document et un second type d'information qui renseigne sur le contenu même des données. Les informations du premier type, appelées informations structurelles , sont toutes les informations qui servent à hiérarchiser les données. Les informations du second type, appelées informations de contenu , représentent les valeurs ou les instances prises par les données du document. The present invention relates to a method, a device and a computer program for generating reference structural patterns capable of representing hierarchical data. The invention also relates to a method, a device and a computer program for coding this type of data using the reference structural patterns thus generated. Many applications manipulate hierarchically structured data also known as hierarchical data. A hierarchical data document contains two types of information: a first type of information that provides information on the structure of the document and a second type of information that provides information on the actual content of the data. The information of the first type, called structural information, is all the information used to prioritize the data. Information of the second type, called content information, represents the values or instances taken by the data in the document.

Le lien entre les informations structurelles et les informations de contenu dépend du langage utilisé pour hiérarchiser les données. Cependant, et de façon générale, un document contenant des données hiérarchisées peut être vu comme un ensemble d' items organisés en arbre . Un item représente un noeud s'il contient au moins un autre item et représente une feuille s'il n'en contient aucun. Le noeud situé au niveau hiérarchique le plus élevé est le noeud racine. Le noeud racine peut donc contenir des sous-noeuds, appelés aussi noeuds-enfants , qui eux-mêmes peuvent contenir d'autres sous-noeuds et ainsi de suite. Un noeud est identifié dans la structure de données à l'aide d'une balise ouvrante et, le plus souvent, d'une balise fermante. Toutes les données situées entre la balise ouvrante et la balise fermante font partie du noeud et représentent le ou les items situés au niveau hiérarchique inférieur par rapport à ce noeud. Ainsi, si un des items est un noeud-enfant, les balises définissant ce noeud-enfant et les données qui lui sont associées sont contenues entre les deux balises du noeud parent. Il existe plusieurs façons de décrire une structure de données hiérarchisées. La plus usuelle utilise le langage XML, acronyme de eXtensible Markup Language , c'est-à-dire un langage à balise extensible. Ce langage est standardisé par le comité de standardisation W3C (une description du langage peut être trouvée à l'adresse http://www.w3.org/TR/REC-xml). XML est de plus en plus utilisé pour le stockage et la transmission de données numériques. The link between structural information and content information depends on the language used to prioritize the data. However, in general, a document containing hierarchical data can be seen as a set of items organized into a tree. An item represents a node if it contains at least one other item and represents a sheet if it contains none. The node at the highest hierarchical level is the root node. The root node may therefore contain subnodes, also called child nodes, which themselves may contain other subnodes and so on. A node is identified in the data structure using an opening tag and, most often, a closing tag. All data between the opening tag and the closing tag are part of the node and represent the item or items at the lower hierarchical level relative to that node. Thus, if one of the items is a child node, the tags defining this child node and the data associated therewith are contained between the two tags of the parent node. There are several ways to describe a hierarchical data structure. The most common uses the XML language, an acronym for eXtensible Markup Language, that is, an extensible tag language. This language is standardized by the W3C standardization committee (a description of the language can be found at http://www.w3.org/TR/REC-xml). XML is increasingly used for storing and transmitting digital data.

Le langage XML définit une syntaxe particulière pour mélanger les informations structurelles et les informations de contenu. Selon cette syntaxe, un noeud, appelé élément , est défini par une balise ouvrante, une balise fermante et un identifiant. Un item feuille, c'est-à-dire un item autre qu'un élément, représente le plus souvent du contenu et peut être, par exemple, du texte, un commentaire, une instruction de traitement ou un attribut. L'attribut est un item localisé dans la balise ouvrante d'un élément et contient, outre le contenu même de l'attribut, un identifiant pour le définir. La figure 1 présente un exemple simple d'un document contenant des données hiérarchisées écrit en langage XML. Ce document contient onze éléments. L'élément racine, possédant l'identifiant liste , est délimité par la balise ouvrante <liste> et la balise fermante <liste> , et comprend trois éléments employé . Les éléments employé contiennent en eux-mêmes d'autres éléments comme prénom , nom ou ville . Le langage XML présente de nombreux avantages. La syntaxe XML est textuelle, elle peut donc être lue ou écrite aisément par un utilisateur. Elle est également générique. Elle peut alors servir de base pour la construction de nouveaux langages plus complexes. Toutefois, la description en langage XML présente un certain nombre d'inconvénients. XML defines a particular syntax for mixing structural information and content information. According to this syntax, a node, called an element, is defined by an opening tag, a closing tag and an identifier. A leaf item, that is to say an item other than an item, most often represents content and may be, for example, text, comment, processing instruction or attribute. The attribute is an item located in the opening tag of an element and contains, in addition to the very content of the attribute, an identifier to define it. Figure 1 shows a simple example of a document containing hierarchical data written in XML. This document contains eleven elements. The root element, having the list identifier, is delimited by the opening <list> tag and the closing <list> tag, and includes three elements used. The elements employed contain in themselves other elements such as first name, last name or city. XML has many advantages. The XML syntax is textual, so it can be read or written easily by a user. It is also generic. It can then serve as a basis for building new, more complex languages. However, the description in XML has a number of disadvantages.

Tout d'abord, un document écrit en langage XML est de taille importante puisque le document inclut non seulement des informations de contenu qui correspondent à l'information effective, mais également des informations structurelles. Ainsi, la manipulation d'un tel document est rendue difficile, aussi bien en termes de stockage qu'en termes d'échange ou de traitement. First, a document written in XML is large in size since the document includes not only content information that corresponds to the actual information, but also structural information. Thus, the handling of such a document is made difficult, both in terms of storage and in terms of exchange or treatment.

En outre, les informations structurelles dont le rôle initial est de hiérarchiser les données du document ne sont pas optimisées pour les divers traitements susceptibles d'être appliqués à ces données, tel que la recherche d'un item particulier dans le document ou la compression du document pour son stockage ou sa transmission. Une première approche pour résoudre ces problèmes est d'utiliser comme base de traitement non pas les informations structurelles du document XML lui-même, mais celles d'un document qui lui est associé ou duquel il dérive. C'est typiquement le cas d'un schéma XML ( XML Schema en terminologie anglo-saxonne, dont une description peut être trouvée aux adresses: http://www.w3.orgfTR/xmlschema-l/ et http://www.w3.org/TR/xmischema-2/). Le schéma XML est un langage qui définit les types de données présents dans un document XML. Un document écrit en schéma XML constitue en quelque sorte un répertoire des types de données autorisés et un modèle structurel pour tous les documents XML conformes à ce schéma. Ces types concernent aussi bien les types des items feuilles qui sont le plus souvent des types simples (entier, texte, etc.) que les types des éléments XML. Selon la syntaxe des schémas XML, le type d'un élément est défini par son identifiant. Par conséquent, il existe autant de type d'élément qu'il existe d'identifiants d'éléments différents. Dans l'exemple, de la figure 1, il existe cinq types d'éléments qui sont : liste, employé, nom, prénom et ville. Le type employé comprend deux sous-éléments de type nom et prénom et, optionnellement, un élément de type ville. In addition, structural information whose initial role is to prioritize document data is not optimized for the various processes that may be applied to that data, such as searching for a particular item in the document or compressing the document. document for storage or transmission. A first approach to solve these problems is to use as a basis of treatment not the structural information of the XML document itself, but of a document associated with it or from which it derives. This is typically the case of an XML Schema (XML Schema in English terminology, a description of which can be found at: http: //www.w3.orgfTR/xmlschema-l/ and http: // www. w3.org/TR/xmischema-2/). The XML schema is a language that defines the types of data present in an XML document. An XML schema document is a kind of directory of allowed data types and a structural model for all XML documents that conform to this schema. These types concern both types of leaf items, which are most often simple types (integer, text, etc.), and types of XML elements. According to the syntax of XML schemas, the type of an element is defined by its identifier. Therefore, there is as much element type as there are different element identifiers. In the example in Figure 1, there are five types of items that are: list, employee, last name, first name, and city. The type used includes two sub-elements of type first and last name and, optionally, a city-type element.

Il est par exemple connu du document US 2004/172591 de la société Microsoft, intitulé Method and system for inferring a schema from a hierarchical data structure for use in a spreadsheet , une méthode de génération d'un schéma XML à partir d'un document XML. Cette génération du schéma XML est réalisée à partir des identifiants des éléments XML. Pour cela, lors du parcours du document, si un identifiant est rencontré plusieurs fois, alors une même structure est utilisée pour définir le type de cet élément dans le sens des schémas XML. L'utilisation d'un schéma permet d'améliorer la performance de certains traitements car le schéma renseigne sur les structures d'éléments qui doivent êtres nécessairement satisfaites par le document XML. Par exemple, il est connu d'utiliser le document de schéma XML associé à un document XML à compresser afin de ne coder que les valeurs d'instances. La décompression ne consistant 4 For example, it is known from document US 2004/172591 of the company Microsoft, entitled Method and system for inferring a schema from a hierarchical data structure for use in a spreadsheet, a method for generating an XML schema from a document XML. This generation of the XML schema is made from the identifiers of the XML elements. For this purpose, when browsing the document, if an identifier is encountered several times, then the same structure is used to define the type of this element in the sense of the XML schemas. The use of a schema makes it possible to improve the performance of certain treatments because the schema informs on the structures of elements that must necessarily be satisfied by the XML document. For example, it is known to use the XML schema document associated with an XML document to be compressed in order to encode only the values of instances. Decompression does not consist 4

ainsi qu'à appliquer les valeurs d'instances au document de schéma XML pour retrouver le document XML d'origine. L'utilisation d'un schéma XML n'est toutefois pas optimale pour l'application de traitements au document XML. En effet, la syntaxe des schémas XML sert en premier lieu à vérifier si les types des différents items d'un document XML sont corrects et si ce dernier vérifie bien les contraintes imposées par le modèle du schéma XML. Il serait par conséquent intéressant de pouvoir réaliser une nouvelle décomposition d'un document structuré, qui fournisse plus d'informations sur la structure du document, de manière à permettre à de nombreuses applications d'être plus perforrnantes lorsqu'elles traitent un document. A cet effet, l'invention a pour objet en premier lieu un procédé de génération de motifs structurels de référence aptes à représenter des données hiérarchisées. Le procédé comprend les étapes suivantes : extraction de motifs structurels primaires associés aux données hiérarchisées, chacun des motifs structurels primaires représentant un ensemble d'informations structurelles ; groupement des motifs structurels primaires situés, par rapport à l'un d'entre eux au moins, à une distance inférieure ou égale à une valeur 20 prédéterminée; et détermination d'un motif structurel de référence par groupe de motifs structurels, ledit motif structurel de référence étant apte à représenter les motifs structurels primaires du groupe qui lui est associé. L'invention prévoit d'analyser des données hiérarchisées pour en extraire 25 des motifs structurels ( patterns en terminologie anglo-saxonne) appelés motifs structurels primaires. Un motif structurel est la description d'une partie de la structure des données hiérarchisées. L'objectif de l'invention est de trouver des motifs structurels qui se reproduisent dans les données hiérarchisées. 30 Pour ce faire, le procédé selon l'invention prévoit de déterminer à partir des motifs structurels primaires extraits des données, des motifs structurels de référence aptes à représenter des données hiérarchisées. and to apply instance values to the XML schema document to retrieve the original XML document. However, the use of an XML schema is not optimal for applying processing to the XML document. Indeed, the syntax of XML schemas is first of all used to check if the types of the different items of an XML document are correct and if it satisfies the constraints imposed by the XML schema model. It would therefore be interesting to be able to perform a new decomposition of a structured document, which provides more information about the structure of the document, in order to allow many applications to be more efficient when they process a document. For this purpose, the object of the invention is first and foremost a method of generating reference structural units capable of representing hierarchical data. The method includes the steps of: extracting primary structural patterns associated with the hierarchical data, each of the primary structural patterns representing a set of structural information; grouping the primary structural units located, with respect to at least one of them, at a distance less than or equal to a predetermined value; and determining a reference structural unit by group of structural units, said reference structural unit being able to represent the primary structural units of the group associated with it. The invention provides for analyzing hierarchical data to extract structural patterns (patterns in English terminology) called primary structural patterns. A structural pattern is the description of a part of the hierarchical data structure. The object of the invention is to find structural patterns that recur in hierarchical data. To do this, the method according to the invention provides for determining, from the primary structural units extracted from the data, reference structural units able to represent hierarchical data.

Ultérieurement, les motifs structurels de référence permettront, notamment, de coder ces données de sorte à réduire la taille de ces données. Le procédé se fonde, notamment, sur une étape de regroupement de motifs structurels primaires qui présentent une similitude déterminée au moyen du calcul d'une distance entre les motifs structurels primaires. Selon une caractéristique, les données hiérarchisées étant organisées en une pluralité d'items, un item représentant un noeud s'il contient au moins un autre item dit item enfant, les informations structurelles d'un motif structurel primaire sont relatives à un noeud et à ses items enfants directs uniquement. Subsequently, the structural reference patterns will, in particular, to code these data so as to reduce the size of these data. The method is based, in particular, on a grouping step of primary structural units that have a similarity determined by calculating a distance between the primary structural units. According to one characteristic, the hierarchical data being organized into a plurality of items, an item representing a node if it contains at least one other item called child item, the structural information of a primary structural pattern is relative to a node and to his direct children's items only.

Selon cette caractéristique, l'information structurelle d'un motif structurel primaire est déterminée par rapport à l'information structurelle contenue dans un noeud et dans les items enfants de ce noeud. Selon une variante de réalisation, les informations structurelles d'un motif structurel primaire sont relatives à une pluralité de noeuds ayant une relation hiérarchique entre eux. Selon cette caractéristique, l'information structurelle d'un motif structurel primaire est déterminée par rapport à l'information structurelle contenue dans une pluralité de noeuds telle que ces noeuds présentent entre eux une relation hiérarchique. According to this characteristic, the structural information of a primary structural pattern is determined with respect to the structural information contained in a node and in the child items of that node. According to an alternative embodiment, the structural information of a primary structural pattern relates to a plurality of nodes having a hierarchical relationship between them. According to this characteristic, the structural information of a primary structural unit is determined with respect to the structural information contained in a plurality of nodes such that these nodes have a hierarchical relationship with one another.

Selon une autre caractéristique, les données hiérarchisées sont décrites dans un langage de balisage structurant les données, notamment en utilisant le langage XML. Selon encore une autre caractéristique, les groupes résultant de l'étape de groupement comprennent des motifs structurels primaires qui sont situés, deux à deux, à une distance inférieure ou égale à la valeur prédéterminée. Selon cette caractéristique, on détermine un groupe tel que les membres de ce groupe présentent entre chacun d'eux une similitude, cette similitude étant déterminée au moyen du calcul d'une distance entre motifs structurels primaires, deux à deux. According to another characteristic, the hierarchical data are described in a markup language structuring the data, in particular by using the XML language. According to yet another characteristic, the groups resulting from the grouping step comprise primary structural units which are located, two by two, at a distance less than or equal to the predetermined value. According to this characteristic, a group is determined such that the members of this group have a similarity between each of them, this similarity being determined by means of calculating a distance between primary structural units, two by two.

Selon un mode de réalisation particulier, la distance entre un premier et un second motif structurel primaire est définie par le nombre d'informations 6 According to a particular embodiment, the distance between a first and a second primary structural unit is defined by the number of pieces of information 6

structurelles à ajouter et / ou à supprimer et / ou à modifier par rapport au premier motif structurel primaire pour obtenir le second motif structurel primaire. Selon ce mode de réalisation, le calcul de la distance est effectué par la détermination des items à ajouter, à supprimer et / ou à modifier pour passer d'un premier motif structurel primaire à un second motif structurel primaire. II est bien entendu que toutes les combinaisons entre l'ajout et / ou la suppression et / ou la modification peuvent être envisagées afin de déterminer le résultat du calcul de la distance entre ces deux motifs structurels primaires. Selon un mode de réalisation, le motif structurel de référence est le motif structurel primaire d'un groupe par rapport auquel tous les motifs structurels primaires du groupe sont situés à une distance inférieure ou égale à la valeur prédéterminée. Selon ce mode de réalisation, le motif structurel de référence déterminé est un motif structurel primaire appartenant au groupe. structural elements to be added and / or deleted and / or modified with respect to the first primary structural unit to obtain the second primary structural unit. According to this embodiment, the calculation of the distance is performed by determining the items to be added, deleted and / or modified to go from a first primary structural pattern to a second primary structural pattern. It is understood that all the combinations between the addition and / or the deletion and / or the modification can be considered in order to determine the result of the calculation of the distance between these two primary structural units. According to one embodiment, the reference structural unit is the primary structural unit of a group with respect to which all the primary structural units of the group are located at a distance less than or equal to the predetermined value. According to this embodiment, the determined reference structural pattern is a primary structural unit belonging to the group.

Selon un autre mode de réalisation, le motif structurel de référence associé à un groupe est construit par la réunion des informations structurelles de tous les motifs structurels primaires du groupe, le motif structurel de référence ainsi construit étant dit motif structurel de référence englobant. Selon ce mode de réalisation, le motif structurel de référence déterminé comprend l'ensemble des informations structurelles de tous les motifs structurels primaires du groupe. L'invention vise également un procédé de codage de données hiérarchisées, caractérisé en ce qu'il comprend les étapes suivantes : - génération de motifs structurels de référence aptes à représenter les données hiérarchisées selon le procédé de génération de motifs structurels de référence brièvement exposé ci-dessus ; détermination des informations de différence entre les motifs structurels de référence et les données hiérarchisées associées ; et codage des données hiérarchisées en fonction des motifs structurels de référence et des informations de différence. Conformément à ce procédé, on génère des motifs structurels de référence selon le procédé de l'invention précédemment décrit de sorte à recoder les données hiérarchisées en vue de réduire la taille de codage de ces données hiérarchisées. En effet, après avoir déterminé les structures des données hiérarchisées (au moyen des motifs structurels de référence), on recode ces données à partir des motifs structurels de référence. De la sorte, on évite un codage des informations structurelles pour chaque donnée et on réduit ainsi de manière significative la taille de codage des données hiérarchisées. Selon une caractéristique, les informations de différence entre les motifs structurels de référence et les données hiérarchisées associées comprennent des informations structurelles et des informations de contenu. L'invention vise également un dispositif de génération de motifs structurels de référence aptes à représenter des données hiérarchisées, caractérisé en ce qu'il comprend : des moyens d'extraction pour extraire des motifs structurels primaires associés aux données hiérarchisées, chacun des motifs structurels primaires représentant un ensemble d'informations structurelles ; des moyens de groupement pour grouper des motifs structurels primaires situés, par rapport à l'un d'entre eux au moins, à une distance inférieure ou égale à une valeur prédéterminée ; et des moyens de détermination pour déterminer un motif structurel de référence par groupe de motifs structurels primaires déterminé, ledit motif structurel de référence étant apte à représenter les motifs structurels primaires du groupe qui lui est associé. De même, l'invention propose un dispositif de codage de données 25 hiérarchisées, caractérisé en ce qu'il comprend : - un dispositif de génération de motifs structurels de référence aptes à représenter les données hiérarchisées conforme au dispositif brièvement exposé ci-dessus ; - des moyens de détermination pour déterminer des informations de 30 différence entre les motifs structurels de référence et les données hiérarchisées associées ; et des moyens de codage pour coder des données hiérarchisées en fonction des motifs structurels de référence et des informations de différence. Ces dispositifs présentent les mêmes avantages que les procédés qu'ils mettent en oeuvre et qui ont été brièvement décrits ci-dessus. La présente invention vise aussi un moyen de stockage, éventuellement amovible partiellement ou totalement, lisible par un ordinateur ou un microprocesseur conservant des instructions d'un programme d'ordinateur, permettant la mise en oeuvre des procédés tels qu'exposés ci- dessus. La présente invention vise enfin un produit programme d'ordinateur pouvant être chargé dans un appareil programmable, comportant des séquences d'instructions pour mettre en oeuvre les procédés tels qu'exposés ci-dessus, lorsque ce programme est chargé et exécuté par l'appareil programmable. D'autres aspects et avantages de la présente invention apparaîtront plus clairement à la lecture de la description qui va suivre, cette description étant donnée uniquement à titre d'exemple non limitatif et faite en référence aux dessins annexés, dans lesquels : la figure 1 présente un exemple simple d'un document de données hiérarchisées écrit en langage XML ; la figure 2 illustre un algorithme de génération de motifs structurels de référence selon l'invention ; la figure 3 représente un exemple d'algorithme permettant l'extraction de motifs structurels primaires d'ordre 1 à partir des données d'un document écrit en langage de balisage XML ; la figure 4 représente un algorithme d'extension d'un motif structurel primaire d'ordre 1 par inclusion des descriptions des motifs structurels primaires référencés ; la figure 5 représente un exemple de mise en oeuvre du groupement des motifs structurels primaires similaires ; 9 According to another embodiment, the reference structural unit associated with a group is constructed by combining the structural information of all the primary structural units of the group, the reference structural unit thus constructed being said encompassing reference structural unit. According to this embodiment, the determined reference structural unit comprises all the structural information of all the primary structural units of the group. The invention also relates to a method of coding hierarchical data, characterized in that it comprises the following steps: - generation of reference structural units able to represent the hierarchical data according to the method of generating reference structural patterns briefly exposed ci -above ; determining the difference information between the reference structural patterns and the associated hierarchical data; and encoding the hierarchical data according to the reference structural patterns and the difference information. According to this method, reference structural units are generated according to the method of the invention described above so as to recode the hierarchical data in order to reduce the coding size of these hierarchical data. Indeed, after having determined the structures of the hierarchical data (by means of the structural units of reference), one recodes these data starting from the structural units of reference. In this way, one avoids a coding of the structural information for each data item and thus significantly reduces the coding size of the hierarchical data. According to one feature, the difference information between the reference structural patterns and the associated hierarchical data includes structural information and content information. The invention also relates to a device for generating reference structural units capable of representing hierarchical data, characterized in that it comprises: extraction means for extracting primary structural units associated with the hierarchical data, each of the primary structural units representing a set of structural information; grouping means for grouping primary structural units located, with respect to at least one of them, at a distance less than or equal to a predetermined value; and determining means for determining a reference structural unit per group of primary structural units determined, said reference structural unit being able to represent the primary structural units of the group associated therewith. Likewise, the invention proposes a hierarchical data coding device, characterized in that it comprises: a device for generating reference structural patterns able to represent the hierarchical data corresponding to the device briefly described above; determining means for determining difference information between the reference structural units and the associated hierarchical data; and encoding means for encoding hierarchical data according to the reference structural patterns and the difference information. These devices have the same advantages as the processes they implement and which have been briefly described above. The present invention also aims a storage means, possibly partially or completely removable, readable by a computer or a microprocessor retaining instructions of a computer program, allowing the implementation of the methods as outlined above. The present invention finally provides a computer program product that can be loaded into a programmable apparatus, including instruction sequences for implementing the methods as set forth above, when this program is loaded and executed by the apparatus. programmable. Other aspects and advantages of the present invention will appear more clearly on reading the description which follows, this description being given solely by way of nonlimiting example and with reference to the appended drawings, in which: FIG. a simple example of a hierarchical data document written in XML; FIG. 2 illustrates an algorithm for generating reference structural patterns according to the invention; FIG. 3 represents an example of an algorithm making it possible to extract first-order primary structural motifs from the data of a document written in XML markup language; FIG. 4 represents an algorithm of extension of a primary structural unit of order 1 by inclusion of the descriptions of the referenced primary structural units; FIG. 5 represents an exemplary implementation of the grouping of similar primary structural units; 9

la figure 6 représente un algorithme de construction d'un motif structurel de référence englobant ; la figure 7 représente un algorithme de mise en oeuvre du procédé de codage utilisant les motifs structurels de référence ; la figure 8 est un schéma bloc illustrant un dispositif adapté à mettre en oeuvre la présente invention. FIG. 6 represents an algorithm for constructing an encompassing reference structural pattern; FIG. 7 represents an algorithm for implementing the coding method using the reference structural units; Figure 8 is a block diagram illustrating a device adapted to implement the present invention.

La figure 2 illustre un algorithme de génération de motifs structurels de référence selon l'invention. Les motifs structurels de référence servent à représenter les données hiérarchisées de manière optimisée. Ainsi, il n'est pas nécessaire de maintenir des motifs structurels primaires représentant exactement les différentes structures présentes parmi les données hiérarchisées, mais seulement quelques motifs structurels de référence qui permettent de représenter toutes les structures avec un coût réduit. FIG. 2 illustrates an algorithm for generating reference structural patterns according to the invention. The structural reference patterns serve to represent the hierarchical data in an optimized manner. Thus, it is not necessary to maintain primary structural patterns representing exactly the different structures present among the hierarchical data, but only a few structural reference units which make it possible to represent all the structures with a reduced cost.

Les données hiérarchisées peuvent être localisées dans un ou plusieurs documents selon les applications. En appliquant les étapes du procédé selon l'invention aux données hiérarchisées de plusieurs documents, les motifs structurels de référence seront optimisés pour représenter les données de tous ces documents. Cela n'empêche évidemment pas d'utiliser les motifs structurels de référence générés à partir des données d'un document pour représenter les données appartenant à d'autres documents. Dans l'étape 210, des motifs structurels sont extraits à partir des données hiérarchisées. Ces motifs structurels sont dits primaires car ils reprennent les différentes structures des données hiérarchisées. Un motif structurel primaire représente un ensemble d'informations structurelles relatives à une partie de l'arbre des données. Les informations structurelles peuvent par exemple être : la nature d'un item (noeud ou feuille), le nombre d'items enfants contenus dans un noeud, l'ordre de ces items enfants, le type d'un item, etc. Par analogie avec la représentation en arbre des données hiérarchisées, un motif structurel peut également être représenté sous forme d'une structure arborescente d'informations structurelles. Les noeuds et les items enfants associés représentent dans ce cas la structure des motifs structurels sans aucune information de contenu. Dans la suite de la description, la représentation arborescente sera utilisée aussi bien pour les données hiérarchisées que pour les motifs structurels associés. Un motif structurel peut s'étendre sur un ou plusieurs niveaux hiérarchiques. Si le motif structurel s'étend sur un seul niveau hiérarchique, on dit que le motif structurel est d'ordre 1. Dans ce cas, les informations structurelles sont relatives à un noeud et à seulement ses items enfants directs. Si le motif structurel s'étend sur n niveaux hiérarchiques, on dit que le motif structurel est d'ordre n. Dans ce cas, les informations structurelles sont relatives à un noeud et à ses items enfants jusqu'à la génération (niveau) n-1. Un motif d'ordre n n'inclut pas nécessairement toutes les branches possibles qui dérivent du noeud racine, mais au moins une qui se prolonge jusqu'au niveau inférieur n-1. Des détails sur le choix de l'ordre et les avantages associés sont donnés ultérieurement. Un exemple de mise en oeuvre de cette étape d'extraction de motifs structurels primaires à partir des données hiérarchisées écrites en langage XML est donné par les figures 3 et 4. Lors de l'étape 220, les motifs structurels primaires qui présentent entre eux des similitudes au niveau structurel sont groupés ensemble. Le degré de similitude est mesuré quantitativement par l'évaluation d'une distance qui est indicative du nombre d'informations structurelles par lesquelles deux motifs structurels primaires diffèrent. Lorsque deux motifs structurels primaires sont similaires, alors les données que ces motifs structurels primaires représentent sont proches structurellement. Il est ainsi avantageux de remplacer la description d'un motif par la référence à un autre motif connu (qui lui est similaire) et par la description de la différence entre le motif structurel effectif et le motif structurel de référence de sorte à réduire la taille du document. Il est à noter que cette méthode de rapprochement entre les motifs structurels primaires permet des optimisations dans la description et, par conséquent, dans les traitements appliqués aux données comme le codage ou la recherche. Ces optimisations ne sont pas permises par les méthodes de l'état de la technique. En effet, selon les méthodes de l'état de la technique, un élément de 11 Hierarchical data can be located in one or more documents depending on the application. By applying the steps of the method according to the invention to the hierarchical data of several documents, the reference structural units will be optimized to represent the data of all these documents. This obviously does not preclude using the reference structural patterns generated from the data of a document to represent the data belonging to other documents. In step 210, structural patterns are extracted from the hierarchical data. These structural patterns are called primary because they take up the different structures of hierarchical data. A primary structural pattern represents a set of structural information relating to a portion of the data tree. Structural information can for example be: the nature of an item (node or sheet), the number of child items contained in a node, the order of these child items, the type of an item, etc. By analogy with the tree representation of the hierarchical data, a structural pattern can also be represented as a tree structure of structural information. In this case, the nodes and associated child items represent the structure of the structural patterns without any content information. In the remainder of the description, the tree representation will be used for the hierarchical data as well as for the associated structural patterns. A structural pattern can span one or more hierarchical levels. If the structural pattern extends over a single hierarchical level, it is said that the structural pattern is of order 1. In this case, the structural information is relative to a node and only to its child direct items. If the structural motif extends over n hierarchical levels, we say that the structural motif is of order n. In this case, the structural information is relative to a node and its child items up to the generation (level) n-1. A pattern of order n does not necessarily include all possible branches that derive from the root node, but at least one that extends to the lower level n-1. Details on the choice of the order and the associated benefits are given later. An exemplary implementation of this step of extracting primary structural patterns from the hierarchical data written in XML language is given in FIGS. 3 and 4. In step 220, the primary structural units which present between them Similarities at the structural level are grouped together. The degree of similarity is quantitatively measured by the evaluation of a distance that is indicative of the number of structural information by which two primary structural patterns differ. When two primary structural units are similar, then the data that these primary structural units represent is structurally close. It is thus advantageous to replace the description of a pattern by reference to another known pattern (which is similar to it) and by the description of the difference between the effective structural pattern and the reference structural pattern so as to reduce the size. of the document. It should be noted that this method of approximation between the primary structural patterns allows optimizations in the description and, consequently, in the processing applied to the data such as coding or searching. These optimizations are not allowed by the methods of the state of the art. Indeed, according to the methods of the state of the art, an element of 11

type gérant (non représenté sur la figure 1), contenant, par exemple, exactement les mêmes items enfants ( nom et prénom ) que l'élément employé , ne sera pas associé à ce dernier car les deux éléments sont de type différents. Or, les deux éléments possèdent seulement comme différence structurelle le type de l'élément, c'est-à- dire l'identifiant du noeud employé / gérant . Selon l'invention, ces deux éléments peuvent être considérés comme similaires. Plusieurs variantes de mise en oeuvre peuvent être envisagées pour former les groupes de motifs structurels similaires. Des exemples de mise en oeuvre de cette étape sont décrits ultérieurement, et en particulier en référence à l'algorithme de la figure 5. Lors de l'étape 230, un motif structurel de référence est déterminé pour représenter les motifs structurels primaires de chaque groupe. Du fait des similitudes qui existent entre les motifs structurels primaires d'un même groupe, un même motif structurel peut être utilisé pour représenter tous ces motifs. Ce motif est appelé motif structurel de référence. Le motif structurel de référence peut être choisi parmi les motifs structurels primaires du groupe ou construit à partir de ces motifs structurels primaires. manager type (not shown in Figure 1), containing, for example, exactly the same child items (surname and first name) than the element used, will not be associated with the latter because the two elements are of different types. However, the two elements have only as structural difference the type of the element, that is to say the identifier of the node employee / manager. According to the invention, these two elements can be considered as similar. Several alternative embodiments can be envisaged to form the groups of similar structural units. Examples of implementation of this step are described later, and in particular with reference to the algorithm of FIG. 5. In step 230, a reference structural unit is determined to represent the primary structural units of each group. . Because of the similarities between the primary structural units of the same group, the same structural pattern can be used to represent all these patterns. This pattern is referred to as the structural reference pattern. The reference structural unit may be chosen from among the primary structural units of the group or constructed from these primary structural units.

Des exemples de mise en oeuvre de cette étape sont décrits ultérieurement, et en particulier en référence à l'algorithme de la figure 6. La figure 3 illustre un exemple d'algorithme permettant l'extraction de motifs structurels primaires d'ordre 1 à partir des données d'un document écrit en langage de balisage XML. Examples of implementation of this step are described later, and in particular with reference to the algorithm of FIG. 6. FIG. 3 illustrates an example of an algorithm allowing the extraction of primary order-1 structural units from data from a document written in XML markup language.

Cet algorithme est exécuté de manière récursive en partant de l'élément racine du document XML. La procédure est donc lancée à l'étape 310 avec cet élément racine. Lors de l'étape 320, un item enfant est sélectionné, puis un test est effectué à l'étape 330 pour savoir si cet item est lui-même un élément XML ou 30 pas. Si l'item est lui-même un élément, la procédure d'extraction de motif pour cet item enfant est une nouvelle fois appelée (étape 331). La procédure 12 This algorithm is executed recursively starting from the root element of the XML document. The procedure is therefore started in step 310 with this root element. In step 320, a child item is selected, and then a test is performed in step 330 to see if that item is itself an XML element or not. If the item is itself an item, the pattern retrieval procedure for that child item is re-called (step 331). The procedure 12

d'extraction retourne une référence au motif ainsi créé associé à l'item enfant, cette référence est ajoutée au motif de l'élément XML à l'étape 333. Optionnellement, si l'item enfant est un élément qui ne contient aucun autre item enfant (alternative positive du test de l'étape 332), aucune référence au motif (vide) associé à l'élément enfant n'est rajouté dans le motif de l'élément parent. Dans l'alternative négative du test de l'étape 330, si l'item enfant sélectionné n'est pas un élément, alors le type de cet item est ajouté au motif associé à l'élément parent. extraction returns a reference to the created pattern associated with the child item, this reference is added to the XML element pattern in step 333. Optionally, if the child item is an item that contains no other items child (positive alternative of the test of step 332), no reference to the (empty) pattern associated with the child element is added in the reason of the parent element. In the negative alternative of the test of step 330, if the selected child item is not an item, then the type of this item is added to the pattern associated with the parent item.

Optionnellement, une condition (étape 335) peut être rajoutée relative à l'inclusion du type de l'item enfant dans le motif. Cette condition peut également s'appliquer (cas non représenté sur l'algorithme) à l'inclusion d'une référence à un motif. La condition relative à l'inclusion d'un sous-item dans le motif structurel primaire permet d'adapter ce motif à des traitements particuliers. Par exemple, si l'objet de la création des motifs est d'identifier tous les éléments ayant un identifiant prénom , la condition d'inclusion sera que l'item doit être un élément et que soit cet élément a comme identifiant prénom , soit il est non vide. Ainsi, on obtient des structures de motifs structurels primaires réduits, mais particulièrement adaptées aux traitements envisagés. L'étape 340 consiste à vérifier s'il existe un autre item enfant s à traiter ou non. S'il reste au moins un autre item enfant à traiter, alors l'algorithme se poursuit à l'étape 320 précédemment décrite, de manière à traiter l'item enfant suivant. Dans le cas contraire, l'algorithme se termine à l'étape 350 et retourne le motif structurel primaire créé me associé à l'élément XML e. Lorsque tous les items enfants d'un élément parent sont eux-mêmes des éléments, le motif structurel d'ordre 1 associé à l'élément parent ne contiendra 30 pour ses items enfants que des références à des motifs. Lorsque plusieurs motifs structurels primaires identifiés sont identiques, il suffit de ne garder qu'un exemplaire pour les traitements ultérieurs. Optionally, a condition (step 335) may be added relating to the inclusion of the type of the child item in the pattern. This condition can also apply (case not represented on the algorithm) to the inclusion of a reference to a pattern. The condition relating to the inclusion of a sub-item in the primary structural unit makes it possible to adapt this reason to particular treatments. For example, if the object of the creation of the reasons is to identify all the elements having a first name identifier, the condition of inclusion will be that the item must be an element and that this element has as first name identifier, either is not empty. Thus, structures of reduced primary structural units are obtained, but particularly adapted to the treatments envisaged. Step 340 consists of checking whether there is another child item to be treated or not. If at least one other child item remains to be processed, then the algorithm proceeds to step 320 previously described, so as to process the next child item. In the opposite case, the algorithm ends at step 350 and returns the primary structural pattern created associated with the XML element e. When all the child items of a parent element are themselves elements, the first-order structural pattern associated with the parent element will contain 30 for its child items only references to patterns. When several primary structural patterns identified are identical, it is sufficient to keep only one copy for subsequent treatments.

Un exemple de détermination des motifs d'ordre 2, dans une variante de réalisation, est obtenu en effectuant d'abord une extraction de motifs structurels primaires d'ordre 1 selon l'algorithme de la figure 3, puis une extension de ces motifs structurels primaires d'un niveau hiérarchique supplémentaire. An exemplary determination of the second-order motifs, in an alternative embodiment, is obtained by first extracting primary structural patterns of order 1 according to the algorithm of FIG. 3, and then an extension of these structural motifs. primary level of an additional hierarchical level.

Ce mode de réalisation est illustré en référence à la figure 4 qui présente un algorithme permettant de remplacer la référence à un motif structurel primaire par la description même de ce motif structurel primaire. La procédure est lancée à l'étape 410 avec un motif structurel primaire m. Dans l'étape 420, un item enfant s du motif m est sélectionné, et un premier test est effectué dans l'étape 430 pour vérifier si cet item enfant s référence un motif structurel primaire. Si ce n'est pas le cas, on passe à l'item enfant suivant s'il existe (test de l'étape 460 et étape 420). Si c'est le cas, un deuxième test est effectué dans l'étape 440 pour savoir si une condition prédéterminée est vérifiée. La condition prédéterminée dépend de l'application envisagée. Par exemple, pour diminuer le nombre de motifs, il peut être intéressant de n'étendre que les motifs dont les items enfants de seconde génération (les enfants des items enfants) sont de type simple, c'est-à-dire ne sont pas des noeuds. Dans l'exemple des données de la figure 1, les motifs associés aux éléments nom et prénom peuvent être insérés directement dans le motif structurel primaire associé à l'élément employé , les éléments nom et prénom représentant des items qui ne contiennent que des sous-items feuilles de type texte. Un autre exemple est consiste à d'inclure dans un motif tous les motifs auxquels il réfère et qui ne sont pas utilisés par un autre motif. Ainsi, suite au test de l'étape 440, sila condition n'est pas vérifiée, on passe à l'item enfant suivant s'il existe (test de l'étape 460 et étape 420). Si la condition est vérifiée, on remplace la référence au motif s par la description de ce motif dans l'étape 450. L'étape 460 consiste à vérifier s'il existe un autre item enfant n'ayant pas été traité. Si tel est le cas, l'algorithme se poursuit à l'étape 420 précédemment décrite. Dans le cas contraire, l'algorithme se termine à l'étape 470. D'autres variantes de réalisation peuvent être envisagées pour étendre un motif d'ordre 1 vers un motif d'ordre n. En effet, sur le même principe que celui de 14 This embodiment is illustrated with reference to FIG. 4 which presents an algorithm making it possible to replace the reference to a primary structural unit by the very description of this primary structural unit. The procedure is started at step 410 with a primary structural pattern m. In step 420, a child item s of the pattern m is selected, and a first test is performed in step 430 to check whether this child item references a primary structural pattern. If this is not the case, go to the next child item if it exists (test of step 460 and step 420). If this is the case, a second test is performed in step 440 to find out if a predetermined condition is satisfied. The predetermined condition depends on the intended application. For example, to reduce the number of patterns, it may be interesting to expand only those patterns whose second generation child items (children of child items) are of simple type, that is, are not some knots. In the example of the data in Figure 1, the patterns associated with the surname and first name elements can be inserted directly into the primary structural pattern associated with the element used, the name and surname elements representing items that contain only sub-elements. text sheet items. Another example is to include in a motif all the reasons to which it refers and which are not used by another reason. Thus, following the test of step 440, if this condition is not verified, we go to the next child item if it exists (test of step 460 and step 420). If the condition is true, replace the reference to pattern s with the description of that reason in step 450. Step 460 is to check if there is another child item that has not been processed. If this is the case, the algorithm continues in step 420 previously described. In the opposite case, the algorithm ends in step 470. Other alternative embodiments may be envisaged to extend a pattern of order 1 to a pattern of order n. Indeed, on the same principle as that of 14

la figure 4, les items enfants de chaque motif qui référencent un autre motif peuvent être remplacés par la description du motif référencé, et ceci récursivement jusqu'à obtenir la profondeur voulue. L'utilisation d'un motif structurel primaire d'ordre n est particulièrement avantageuse lorsque l'item du niveau le plus inférieur du motif représente un type simple, et non une référence à un motif. La figure 5 présente un exemple de mise en oeuvre de l'étape 220 du procédé de la figure 2 pour grouper les motifs structurels primaires similaires entre eux. 4, the child items of each pattern that refer to another pattern can be replaced by the description of the referenced pattern, and this recursively until the desired depth. The use of a primary structural pattern of order n is particularly advantageous when the item of the lowest level of the pattern represents a simple type, and not a reference to a pattern. FIG. 5 shows an exemplary implementation of step 220 of the method of FIG. 2 for grouping the primary structural units that are similar to each other.

L'algorithme débute à l'étape 500. Après cette étape, un motif structurel primaire quelconque m est sélectionné (étape 510) parmi l'ensemble M des motifs structurels primaires extraits précédemment, selon les algorithmes des figures 3 et 4 par exemple. Dans l'étape 520, un groupe g(m) est d'abord initialisé à l'ensemble vide. The algorithm starts at step 500. After this step, any primary structural unit m is selected (step 510) from the set M of the primary structural units extracted previously, according to the algorithms of FIGS. 3 and 4, for example. In step 520, a group g (m) is first initialized to the empty set.

Ce groupe g(m) contiendra à la fin de l'exécution de l'algorithme le motif m et, éventuellement, des motifs qui lui sont similaires. Si aucun motif similaire n'est trouvé, le groupe g(m) ne contiendra qu'un seul élément qui est le motif m lui-même. Lors de l'étape 530, un motif ms est sélectionné parmi les motifs de l'ensemble M, puis la distance entre ce motif ms et le motif m est calculée (d(m, ms). Cette distance est ensuite comparée à un seuil prédéterminé, nommé valeur limite, lors de l'étape 540. Si les motifs structurels primaires m et ms sont suffisamment proches, c'est-à-dire si la distance d(m, ms) est inférieure ou égale à la valeur limite, alors l'algorithme se poursuit à l'étape 550, sinon il se poursuit directement à l'étape 560. Dans l'étape 550, le motif ms qui a été identifié comme suffisamment proche du motif ni par le calcul de la distance, est ajouté au groupe associé à m g(m). Le motif ms est également supprimé de l'ensemble M pour ne pas être repris dans d'autres groupes. L'étape 560 consiste à vérifier au moyen d'un test s'il existe un autre motif structurel primaire ms dans l'ensemble M non sélectionné. Si le test est positif, 15 This group g (m) will contain at the end of the execution of the algorithm the pattern m and, possibly, patterns that are similar to it. If no similar pattern is found, the group g (m) will contain only one element which is the pattern m itself. In step 530, a pattern ms is selected from the patterns of the set M, then the distance between this pattern ms and the pattern m is computed (d (m, ms) .This distance is then compared to a threshold predetermined, called limit value, in step 540. If the primary structural units m and ms are sufficiently close, that is to say if the distance d (m, ms) is less than or equal to the limit value, then the algorithm continues in step 550, otherwise it proceeds directly to step 560. In step 550, the ms pattern that has been identified as sufficiently close to the pattern nor by calculating the distance, is added to the group associated with mg (m) The pattern ms is also deleted from the set M to not be taken up in other groups Step 560 is to check by means of a test if there is a other primary structural unit ms in the set M not selected If the test is positive, 15

l'algorithme retourne à l'étape 530 pour la sélection d'un nouveau motif ms. Sinon, l'algorithme continue à l'étape 570. L'étape 570 consiste à vérifier au moyen d'un test s'il existe encore des motifs dans l'ensemble M à grouper. Si le test est positif, l'algorithme continue à 5 l'étape 510 pour la formation d'un nouveau groupe. Sinon, l'algorithme se termine à l'étape 580. Tout motif structurel primaire appartenant au groupe g(m) est nécessairement à une distance inférieure à la valeur limite du motif structurel primaire m (le motif m est alors dit motif associé au groupe g(m)). Il est à noter 10 cependant que deux motifs quelconques du groupe peuvent être à une distance supérieure à cette valeur limite. Dans une variante de réalisation du procédé de groupement des motifs, il est possible de choisir des groupes dont les motifs structurels primaires constituant chaque groupe sont situés tous, deux à deux, à une distance 15 inférieure ou égale à la valeur limite prédéterminée. La distance d(m, ms) permet de mesurer, de manière quantitative, la similarité entre deux descriptions structurelles. Lorsque deux motifs structurels primaires sont proches, cela veut dire qu'un motif structurel primaire peut être remplacé par l'autre dans la description des données qui leur sont associées. 20 La distance entre deux motifs structurels primaires est très grande, voire infinie, si les deux motifs structurels primaires ne correspondent pas à des éléments XML ayant une structure similaire. On notera que la distance utilisée dépend de l'application envisagée. De même, la limite utilisée pour comparer cette distance dépend de l'application 25 envisagée et de la distance utilisée. Un exemple de méthode de calcul de la distance entre deux motifs structurels primaires consiste à déterminer le nombre d'informations structurelles qu'il faut rajouter, supprimer et modifier à un motif pour obtenir l'autre motif. Comme indiqué précédemment, les informations structurelles peuvent être le type 30 d'un item (identifiant de l'élément dans le cas d'un type composé), le nombre d'items enfants contenus dans un noeud, l'ordre de ces items enfants, etc. 16 the algorithm returns to step 530 for selection of a new ms pattern. Otherwise, the algorithm continues at step 570. Step 570 is to check by means of a test whether there are still patterns in the set M to be grouped. If the test is positive, the algorithm continues at step 510 for the formation of a new group. Otherwise, the algorithm ends in step 580. Any primary structural pattern belonging to the group g (m) is necessarily at a distance less than the limit value of the primary structural unit m (the pattern m is then said group-associated pattern g (m)). It should be noted, however, that any two patterns of the group may be at a distance greater than this limit value. In an alternative embodiment of the method of grouping the patterns, it is possible to choose groups whose primary structural units constituting each group are all located, two by two, at a distance less than or equal to the predetermined limit value. The distance d (m, ms) makes it possible to quantitatively measure the similarity between two structural descriptions. When two primary structural patterns are close, this means that one primary structural pattern can be replaced by the other in the description of the data associated with them. The distance between two primary structural patterns is very large, if not infinite, if the two primary structural patterns do not correspond to XML elements having a similar structure. Note that the distance used depends on the intended application. Similarly, the limit used to compare this distance depends on the intended application and the distance used. An example of a method for calculating the distance between two primary structural patterns is to determine the number of structural information that must be added, deleted, and modified to one pattern to obtain the other pattern. As indicated above, the structural information may be the type of an item (identifier of the item in the case of a compound type), the number of child items contained in a node, the order of these child items etc. 16

Le plus souvent, lorsque les motifs structurels primaires sont très similaires (même type de noeud, même ordre des items ou l'ordre qui n'intervient pas), le calcul de distance se réduit à déterminer le nombre minimal d'ajout d'items et / ou de suppression d'items et / ou de modification d'items à réaliser pour passer d'un motif structurel primaire à l'autre. Il est à noter que l'item enfant d'un motif représente le type de l'item enfant du noeud que le motif représente. Considérons par exemple les deux motifs structurels primaires ml et m2 associés respectivement aux deuxième et troisième items enfants ( employé ) de l'élément liste de la figure 1. Ainsi, d'une part, le motif structurel primaire ml correspond à un noeud identifié par l'identifiant employé et comprend deux items enfants. Le premier item enfant est un motif structurel primaire correspondant à un noeud identifié par l'identifiant prénom , ayant un item enfant de type texte. Le second item enfant est un motif structurel primaire correspondant à un noeud identifié par l'identifiant nom , ayant un item enfant de type texte. D'autre part, le motif structurel primaire m2 correspond à un noeud identifié par l'identifiant employé et comprend trois items enfants. Le premier item enfant est un motif structurel primaire correspondant à un noeud identifié par l'identifiant prénom , ayant un item enfant de type texte. Le second item enfant est un motif structurel primaire correspondant à un noeud identifié par l'identifiant nom , ayant un item enfant de type texte. Le troisième item enfant est un motif structurel primaire correspondant à un noeud identifié par l'identifiant ville , ayant un item enfant de type texte. Most often, when the primary structural patterns are very similar (same type of node, same order of items or order that does not occur), the distance calculation is reduced to determining the minimum number of items added and / or deleting items and / or modifying items to be made to move from one primary structural pattern to another. It should be noted that the child item of a pattern represents the type of the child item of the node that the pattern represents. Consider for example the two primary structural units ml and m2 respectively associated with the second and third child items (employee) of the list element of Figure 1. Thus, on the one hand, the primary structural unit ml corresponds to a node identified by the identifier used and includes two child items. The first child item is a primary structural pattern corresponding to a node identified by the first name identifier, having a text child item. The second child item is a primary structural pattern corresponding to a node identified by the name identifier, having a text child item. On the other hand, the primary structural pattern m2 corresponds to a node identified by the identifier used and includes three child items. The first child item is a primary structural pattern corresponding to a node identified by the first name identifier, having a text child item. The second child item is a primary structural pattern corresponding to a node identified by the name identifier, having a text child item. The third child item is a primary structural pattern corresponding to a node identified by the city identifier, having a text child item.

Selon cette première méthode de calcul, la distance entre les deux motifs structurels primaires ml et m2 est de 1. En effet, ces deux motifs structurels primaires diffèrent par l'ajout d'un item de type motif structurel primaire, à savoir l'item ville . Toutefois, cette méthode de calcul de la distance peut ne pas être représentative si l'item ajouté entre les deux motifs structurels primaires ml et m2 est la description d'un motif structurel primaire de taille importante. According to this first calculation method, the distance between the two primary structural units ml and m2 is 1. Indeed, these two primary structural units differ by adding a primary structural pattern item, namely the item city . However, this method of calculating the distance may not be representative if the item added between the two primary structural units ml and m2 is the description of a primary structural unit of significant size.

On peut envisager en variante, selon une seconde méthode de calcul, de modifier la méthode de calcul de la distance séparant les motifs structurels primaires ml et m2 afin de mieux prendre en compte les motifs structurels primaires, en faisant notamment intervenir la taille du motif structurel primaire considéré. Ainsi, si l'item ajouté, supprimé ou modifié est la description d'un motif structurel primaire, le coût en termes de distance peut être, par exemple, le nombre d'items enfants de ce motif structurel primaire augmenté de la valeur 1. La valeur 1 permet de prendre en compte l'identifiant du noeud décrit par le motif structurel primaire. Selon la seconde méthode de calcul de la distance entre deux motifs structurels primaires, la distance entre les primaires ml et m2 est alors de 2. Selon une troisième méthode de calcul de la distance, celleci peut aussi prendre en compte de façon différente, par des pondérations différentes, les ajouts, suppressions ou modifications d'items. Toutefois, selon cette troisième méthode de calcul, la distance entre le motif structurel primaire ml et le motif structurel primaire m2 peut être différente de la distance entre le motif structurel primaire m2 et le motif structurel primaire ml (d(ml, m2) ≠d(m2, ml)). Alternatively, according to a second calculation method, it is possible to modify the method for calculating the distance separating the primary structural units ml and m2 in order to better take into account the primary structural units, in particular by using the size of the structural unit. considered primary. Thus, if the item added, deleted or modified is the description of a primary structural unit, the cost in terms of distance may be, for example, the number of child items of this primary structural unit plus the value 1. The value 1 makes it possible to take into account the identifier of the node described by the primary structural pattern. According to the second method of calculating the distance between two primary structural units, the distance between the primary ml and m2 is then 2. According to a third method of calculating the distance, it can also take into account differently, by different weightings, additions, deletions or modifications of items. However, according to this third method of calculation, the distance between the primary structural unit ml and the primary structural unit m 2 may be different from the distance between the primary structural unit m 2 and the primary structural unit m 1 (d (ml, m 2) ≠ d (m2, ml)).

Dans ce cas, les associations créées entre des paires de motifs structurels primaires sont dissymétriques. En effet, un motif structurel primaire ml peut être proche d'un motif structurel primaire m2 (d(ml, m2) < limite), sans que le motif structurel primaire m2 soit proche du motif structurel primaire ml (d(m2, ml) limite). In this case, the associations created between pairs of primary structural patterns are asymmetrical. Indeed, a primary structural unit ml can be close to a primary structural unit m2 (d (ml, m2) <limit), without the primary structural unit m2 being close to the primary structural unit ml (d (m2, ml) limit).

Cela signifie que le motif structurel primaire ml peut être remplacé par le motif structurel primaire m2 dans la description de la structure du document, sans que le motif structurel primaire m2 ne puisse être remplacé par le motif structurel primaire ml. II est à noter que l'algorithme de la figure 5 pour grouper les motifs structurels primaires fonctionne dans le cas de distances dissymétriques. En effet, chaque groupe est associé à un motif structurel primaire donné. Les distances sont calculées entre chaque motif structurel primaire du groupe et le motif associé 18 This means that the primary structural unit m1 can be replaced by the primary structural unit m 2 in the description of the structure of the document, without the primary structural unit m 2 can not be replaced by the primary structural unit m 1. It should be noted that the algorithm of FIG. 5 for grouping the primary structural units operates in the case of asymmetrical distances. Indeed, each group is associated with a given primary structural pattern. The distances are calculated between each primary structural unit of the group and the associated unit.

au groupe, c'est-à-dire que le motif associé peut remplacer chaque motif structurel primaire du groupe, sans que nécessairement le motif associé puisse être remplacé par les différents motifs structurels primaires du groupe. Pour chaque groupe identifié, un motif structurel est déterminé pour servir de référence à tous les motifs du groupe. Le motif structurel de référence est le motif qui est utilisé à la place de tous les autres motifs du groupe dans les traitements appliqués aux données hiérarchisées. Dans un premier mode de réalisation de l'invention, le motif structurel de référence d'un groupe est choisi comme étant le motif associé au groupe. to the group, that is to say that the associated pattern can replace each primary structural pattern of the group, without necessarily the associated pattern can be replaced by the different primary structural patterns of the group. For each identified group, a structural pattern is determined to serve as a reference for all group patterns. The reference structural pattern is the pattern that is used in place of all other group patterns in the processing applied to the hierarchical data. In a first embodiment of the invention, the reference structural unit of a group is chosen as being the unit associated with the group.

L'avantage de ce mode est que le motif associé est par construction le motif le plus proche de tous les autres motifs du groupe. Dans un second mode de réalisation, le motif structurel de référence d'un groupe donné est construit à partir des motifs structurels primaires du groupe en prenant l'union (aussi appelée réunion ) de toutes les informations structurelles de ces motifs. Ce motif ainsi construit est dit motif structurel de référence englobant. Un motif structurel ml est considéré comme englobé dans un autre motif structurel m2, quand le motif structurel ml peut être obtenu à partir du motif structurel m2 en supprimant certaines informations structurelles du motif structurel m2. Cela signifie que toutes les informations structurelles du motif structurel ml se retrouvent dans le motif structurel m2 dans l'ordre dans lequel elles se trouvent dans le motif structurel ml. Pour deux motifs structurels ml et m2, il est possible de construire leur union, c'est-à-dire un motif structurel englobant à la fois le motif structurel ml et le motif structurel m2 et le plus petit possible. Pour cela, il suffit de prendre l'ensemble des informations structurelles du motif structurel ml et d'y ajouter les informations structurelles du motif structurel m2 ne se trouvant pas dans le motif structurel ml , tout en respectant l'ordre de ces informations structurelles. II est à noter que pour certaines applications, l'ordre des informations structurelles n'a pas d'importance. Dans ce cas, les informations structurelles dans les motifs structurels peuvent être triées par catégorie et par ordre 19 The advantage of this mode is that the associated pattern is by design the pattern closest to all other patterns in the group. In a second embodiment, the structural reference pattern of a given group is constructed from the primary structural patterns of the group by taking the union (also called meeting) of all the structural information of those patterns. This pattern thus constructed is said encompassing reference structural pattern. A structural unit m1 is considered to be encompassed in another structural unit m 2, when the structural unit m 1 can be obtained from the structural unit m 2 by omitting certain structural information from the structural unit m 2. This means that all the structural information of the structural unit ml is found in the structural unit m2 in the order in which they are in the structural unit ml. For two structural units ml and m2, it is possible to construct their union, that is to say a structural unit encompassing both the structural unit ml and the structural unit m2 and the smallest possible. For this, it suffices to take all the structural information of the structural unit ml and to add the structural information of the structural unit m2 not found in the structural unit ml, while respecting the order of this structural information. It should be noted that for some applications, the order of the structural information does not matter. In this case, the structural information in the structural patterns can be sorted by category and by order.

lexicographique afin de simplifier les comparaisons de motifs structurels et les calculs sur les motifs structurels. L'utilisation de motifs structurels de référence englobant est particulièrement avantageuse pour les applications de codage car il permet d'avoir des taux de compression plus élevés. Cette utilisation est également avantageuse pour les applications de recherche car les motifs structurels de référence englobant permettent de simplifier la recherche d'items. Un exemple d'algorithme de construction d'un motif structurel de référence englobant est décrit en référence à la figure 6. lexicographic in order to simplify comparisons of structural motifs and calculations on structural motifs. The use of encompassing reference structural patterns is particularly advantageous for coding applications because it allows for higher compression rates. This use is also advantageous for research applications because the reference structural patterns encompassing make it possible to simplify the search for items. An exemplary algorithm for constructing an encompassing reference structural pattern is described with reference to FIG.

L'algorithme débute à l'étape 600. Cette étape est suivie de l'étape 610 consistant à choisir un premier motif structurel primaire mr appartenant au groupe g. L'étape suivante (étape 620) consiste à sélectionner un second motif structurel primaire m du groupe g. The algorithm starts at step 600. This step is followed by step 610 of choosing a first primary structural unit mr belonging to group g. The next step (step 620) consists in selecting a second primary structural unit m of group g.

L'algorithme se poursuit par la comparaison, à l'étape 630, de ce motif structurel primaire m sélectionné avec le motif structurel primaire mr. Si le motif structurel primaire mr englobe le motif structurel primaire m, c'est-à-dire si le motif structurel primaire mr comprend l'ensemble des items enfant du motif structurel primaire m et dans le même ordre, alors l'algorithme se poursuit à l'étape 650 décrite ultérieurement. Dans le cas contraire, c'est-à-dire si le motif structurel primaire mr ne comprend pas l'ensemble des items enfants du motif structurel primaire m, alors l'algorithme se poursuit à l'étape 640 au cours de laquelle le motif structurel primaire mr est mis à jour en générant un motif structurel primaire qui comprend les items enfants du motif structurel primaire m et du motif structurel primaire mr (étape 640). Cette étape est suivie de l'étape 650 consistant à vérifier au moyen d'un test s'il existe un autre motif structurel primaire qui n'a pas été traité. Dans l'affirmative, l'algorithme se poursuit à l'étape 620 précédemment décrite. The algorithm continues by comparing, in step 630, this primary structural unit m selected with the primary structural unit mr. If the primary structural unit mr includes the primary structural unit m, i.e., if the primary structural unit mr comprises the set of child items of the primary structural unit m and in the same order, then the algorithm continues in step 650 described later. In the opposite case, that is, if the primary structural unit mr does not include all child items of the primary structural pattern m, then the algorithm proceeds to step 640 in which the pattern The primary structural structure mr is updated by generating a primary structural pattern that includes the child items of the primary structural pattern m and the primary structural pattern mr (step 640). This step is followed by step 650 of checking by means of a test if there is another primary structural unit that has not been processed. If so, the algorithm continues in step 620 previously described.

Dans le cas contraire, l'algorithme se poursuit à l'étape 660 consistant à générer le motif structurel de référence ms(g) représentant le groupe g, à partir du motif mr. Le motif structurel de référence ms(g) peut être par exemple égal à mr. Une fois que les motifs structurels de référence ont été déterminés, il est 5 possible de décrire la structure du document contenant les données hiérarchisées en associant ces motifs structurels de référence aux différentes structures du document, et avantageusement aux éléments du document. Pour cela, pour chaque élément XML du document, on obtient le motif structurel primaire qui lui a été associé lors de l'extraction des motifs structurels 10 primaires réalisés notamment selon l'algorithme décrit en référence à la Figure 3, puis on recherche, parmi l'ensemble des groupes de motifs structurels primaires M, le groupe auquel appartient ce motif structurel primaire, et l'on détermine le motif structurel de référence représentatif de ce groupe. Le motif structurel de référence ainsi obtenu sera utilisé pour représenter 15 la structure de cet élément XML. Le même principe peut être appliqué lorsqu'il s'agit de représenter une structure plus complexe d'éléments par un motif structurel de référence associé. L'association entre le motif structurel de référence et la structure de données correspondante peut être mémorisée soit dans un fichier spécifique, soit 20 directement dans le document à l'aide d'un attribut particulier. Dans une variante de réalisation, et pour certaines applications comme par exemple la compression, il peut être intéressant d'appliquer le concept des motifs structurels de référence à une partie seulement du document. C'est le cas par exemple lorsque qu'un groupe ne contient qu'un seul motif structurel primaire, 25 c'est-à-dire quand aucun autre ne lui a été associé car il est trop distant de tous les autres. L'invention concerne également un procédé de codage de données hiérarchisées basé sur le concept de motifs structurels de référence. Le codage est utilisé, notamment, pour la compression des données. 30 Ce procédé comprend notamment les étapes de génération de motifs structurels de référence aptes à représenter des données hiérarchisées, de détermination des informations de différence entre les motifs structurels de référence et les données hiérarchisées et de codage des données hiérarchisées. L'objectif du codage est de réduire la taille du document contenant les données hiérarchisées pour leur échange entre une unité de codage et une unité de décodage. L'unité de codage et l'unité de décodage se trouvent par exemple dans des dispositifs distants reliés par un réseau de communication. Toutefois, elles peuvent être situées dans un même dispositif lorsqu'il s'agit de réduire la taille du document de données pour son stockage sur un disque. Les motifs structurels de référence qui serviront au codage sont enregistrés, préalablement ou lors du codage, dans le même document qui contient les données hiérarchisées codées, ce qui permet à l'unité de décodage de les utiliser lors du décodage. Cependant, il est possible que ces motifs structurels de référence soient enregistrés dans un document séparé ou échangés par tout autre moyen entre les deux unités. In the opposite case, the algorithm continues in step 660 of generating the reference structural pattern ms (g) representing the group g, from the pattern mr. The structural unit of reference ms (g) can be for example equal to mr. Once the reference structural patterns have been determined, it is possible to describe the structure of the document containing the hierarchical data by associating these structural reference patterns with the different structures of the document, and advantageously with the elements of the document. For this, for each XML element of the document, we obtain the primary structural unit that has been associated with it during the extraction of the primary structural patterns 10 carried out in particular according to the algorithm described with reference to FIG. the set of groups of primary structural units M, the group to which this primary structural unit belongs, and the representative structural reference unit of this group is determined. The resulting structural reference pattern will be used to represent the structure of this XML element. The same principle can be applied when it comes to representing a more complex structure of elements by an associated reference structural pattern. The association between the reference structural pattern and the corresponding data structure can be stored either in a specific file or directly in the document using a particular attribute. In an alternative embodiment, and for certain applications such as compression, it may be advantageous to apply the concept of the reference structural units to only a part of the document. This is the case for example when a group contains only one primary structural unit, that is to say when no other has been associated with it because it is too distant from all the others. The invention also relates to a hierarchical data coding method based on the concept of reference structural units. The coding is used, in particular, for the compression of the data. This method comprises in particular the steps of generating reference structural patterns capable of representing hierarchical data, of determining the difference information between the reference structural patterns and the hierarchical data and hierarchical data coding data. The purpose of the coding is to reduce the size of the document containing the hierarchical data for their exchange between a coding unit and a decoding unit. The coding unit and the decoding unit are for example in remote devices connected by a communication network. However, they can be located in the same device when it comes to reducing the size of the data document for storage on a disk. The reference structural patterns that will be used for coding are recorded, before or during coding, in the same document that contains the coded hierarchical data, which allows the decoding unit to use them during decoding. However, it is possible that these structural reference patterns are recorded in a separate document or exchanged by any other means between the two units.

Un exemple de mise en oeuvre du procédé de codage utilisant les motifs structurels de référence d'ordre 1 est donné par la figure 7. La première étape (étape 710) consiste à utiliser une des mises en oeuvre du procédé de génération des motifs structurels de référence appliquée aux données hiérarchiques à coder. An exemplary implementation of the coding method using the first-order reference structural units is given in FIG. 7. The first step (step 710) consists in using one of the implementations of the process for generating the structural units of reference applied to the hierarchical data to be coded.

La seconde étape (étape 720) consiste à déterminer des informations structurelles de différence entre les motifs structurels de référence générés lors de l'étape précédente et les données hiérarchisées qui leur sont associées. En effet, des différences structurelles peuvent exister étant donné que les motifs structurels de référence ne sont pas nécessairement les motifs structurels primaires associés à ces données. L'étape suivante (étape 730) consiste à déterminer des informations de contenu relatives aux informations structurelles associées aux données hiérarchisées. Ces informations de contenu sont également vues comme des informations de différence car elles représentent les informations qu'il faut ajouter aux informations structurelles pour retrouver l'ensemble des données hiérarchisées. 22 The second step (step 720) consists in determining structural information of difference between the structural reference patterns generated in the previous step and the hierarchical data associated with them. Indeed, structural differences may exist since the structural reference units are not necessarily the primary structural units associated with these data. The next step (step 730) is to determine content information relating to the structural information associated with the hierarchical data. This content information is also seen as difference information because it represents the information that must be added to the structural information to retrieve all hierarchical data. 22

La dernière étape (étape 740) représente l'étape de codage qui utilise les informations de différence, structurelles et de contenu, ainsi que, soit la description de la structure du motif structurel de référence elle-même, soit une référence à celle-ci, pour coder les données hiérarchisées. The last step (step 740) represents the coding step that uses the difference information, structural and content, as well as either the description of the structure of the reference structural pattern itself, or a reference to it. , to encode hierarchical data.

Afin de mettre en oeuvre les procédés de génération de motifs structurels de référence aptes à représenter des données hiérarchisées et de codage de ces données utilisant ces motifs structurels de référence, un dispositif de génération de motifs structurels de référence comprend notamment des moyens d'extraction de motifs structurels primaires, des moyens de groupement des motifs structurels primaires et des moyens de détermination de motifs structurels de référence pour chaque groupe identifié, et un dispositif de codage comprenant notamment les moyens du dispositif de génération de motifs structurels de référence, des moyens de détermination d'informations de différence entre les motifs structurels de référence et les données hiérarchisées associées, et des moyens de codage des données hiérarchisées en fonction des motifs structurels de référence et des informations de différence. Ces dispositifs de génération de motifs structurels de référence et de codage peuvent être incorporés dans un ordinateur 800 tel qu'illustré à la figure 8. In order to implement the methods for generating reference structural patterns able to represent hierarchical data and coding these data using these reference structural units, a device for generating reference structural units comprises, in particular, means for extracting data. primary structural units, means for grouping the primary structural units and means for determining reference structural units for each identified group, and an encoding device comprising in particular the means of the device for generating reference structural units, determination means difference information between the reference structural patterns and the associated hierarchical data, and hierarchical data encoding means according to the reference structural patterns and the difference information. These devices for generating reference and coding structural patterns can be incorporated in a computer 800 as illustrated in FIG. 8.

En particulier, les différents moyens identifiés ci-dessus peuvent être incorporés dans une mémoire morte 805 ("Read- only memory" ou ROM) adaptée à mémoriser un programme de génération de motifs et/ou de codage conforme à l'invention. La mémoire vive 810 ("Random access memory" ou RAM) est adaptée à mémoriser dans des registres les valeurs modifiées lors de l'exécution du programme de génération et de codage. Le microprocesseur 820 est intégré à un ordinateur 800 qui peut être connecté à différents périphériques et à d'autres ordinateurs d'un réseau de communication. In particular, the various means identified above can be incorporated in a ROM 805 ("Read-only memory" or ROM) adapted to store a pattern generation program and / or coding according to the invention. RAM 810 ("Random access memory" or RAM) is adapted to store in registry the values modified during the execution of the generation and coding program. The microprocessor 820 is integrated with a computer 800 that can be connected to different peripherals and other computers in a communication network.

Cet ordinateur comporte de manière connue une interface de communication 830 reliée au réseau de communication 835 pour recevoir ou transmettre des messages. L'ordinateur comporte en outre des moyens de stockage de documents, tel qu'un disque dur 870, ou est adapté à coopérer au moyen d'un lecteur de disque 880 (disquettes, disques compacts ou cartes informatiques) avec des moyens de stockage de documents amovibles, tels que des disques 885. Ces moyens de stockage fixes ou amovibles peuvent comporter le code du procédé de génération motifs structurels ou de codage conforme à l'invention. Ils sont également adaptés à mémoriser un document électronique contenant des données hiérarchisées tel que défini par la présente invention. A titre de variante, le programme permettant au dispositif de génération de motifs structurels ou de codage de mettre en oeuvre l'invention peut être stocké dans la mémoire morte 805. En seconde variante, le programme pourra être reçu pour être stocké comme décrit précédemment par l'intermédiaire du réseau de communication 835. L'ordinateur 800 possède également un écran 840 permettant par exemple de servir d'interface avec un opérateur à l'aide du clavier 850 ou de la souris 860 ou de tout autre moyen. L'unité centrale 820 (CPU) exécutera alors les instructions relatives à la mise en oeuvre de l'invention. Lors de la mise sous tension, les programmes et méthodes relatives à l'invention stockés dans une mémoire non volatile, par exemple la mémoire 805, sont transférés dans la mémoire 810 qui contiendra alors le code exécutable de l'invention ainsi que les variables nécessaires à la mise en oeuvre de l'invention. Le bus de communication 890 permet la communication entre les différents sous-éléments de l'ordinateur ou liés à lui. This computer comprises in known manner a communication interface 830 connected to the communication network 835 for receiving or transmitting messages. The computer further comprises document storage means, such as a hard disk 870, or is adapted to cooperate by means of a disk drive 880 (floppy disks, compact disks or computer cards) with storage means for storing data. removable documents, such as 885 disks. These fixed or removable storage means may comprise the code of the method for generating structural patterns or coding according to the invention. They are also suitable for storing an electronic document containing hierarchical data as defined by the present invention. As a variant, the program enabling the device for generating structural or coding patterns to implement the invention can be stored in the read-only memory 805. In the second variant, the program can be received to be stored as previously described by 840. The computer 800 also has a screen 840 for example to interface with an operator using the 850 keyboard or mouse 860 or any other means. The central unit 820 (CPU) will then execute the instructions relating to the implementation of the invention. When powering up, the programs and methods relating to the invention stored in a non-volatile memory, for example the memory 805, are transferred into the memory 810 which will then contain the executable code of the invention as well as the necessary variables. to the implementation of the invention. The communication bus 890 allows communication between the different sub-elements of the computer or linked to it.

La représentation de ce bus 890 n'est pas limitative et notamment le microprocesseur 820 est susceptible de communiquer des instructions à tout sous-élément directement ou par l'intermédiaire d'un autre sous-élément. Bien entendu, de nombreuses modifications peuvent être apportées aux exemples de réalisation décrits précédemment sans sortir du cadre de l'invention.30 The representation of this bus 890 is not limiting and in particular the microprocessor 820 is able to communicate instructions to any sub-element directly or via another sub-element. Of course, many modifications can be made to the embodiments described above without departing from the scope of the invention.

Claims

A method of generating reference structural patterns capable of representing hierarchical data, characterized in that it comprises the following steps of extracting primary structural patterns associated with the hierarchical data (210), each of the primary structural patterns representing a set of structural information; grouping the primary structural units located, with respect to at least one of them, at a distance less than or equal to a predetermined value (220); and determining a reference structural unit per group of primary structural units (230), said reference structural unit being able to represent the primary structural units of the group associated therewith.

2. Method according to claim 1, characterized in that the hierarchical data being organized into a plurality of items, an item representing a node if it contains at least one other item called child item, the structural information of a structural pattern primary are related to a node and its child direct items only.

3. Method according to claim 1, characterized in that the hierarchical data being organized into a plurality of items, an item representing a node if it contains at least one other item called child item, the structural information of a structural pattern primary are relative to a plurality of nodes having a hierarchical relationship with each other.

4. Method according to any one of the preceding claims, characterized in that the hierarchical data are described in a markup language structuring the data.

5. Method according to any one of the preceding claims, characterized in that the groups resulting from the grouping step comprise primary structural units which are situated, two by two, at a distance less than or equal to the predetermined value.

6. Method according to any one of the preceding claims, characterized in that the distance between a first and a second primary structural unit is defined by the number of structural information to be added and / or deleted and / or modified relative to at the first primary structural unit to obtain the second primary structural unit.

7. Method according to any one of the preceding claims, characterized in that the reference structural unit is the primary structural unit of a group with respect to which all the primary structural units of the group are located at a distance less than or equal to the predetermined value.

8. Process according to any one of claims 1 to 6, characterized in that the structural unit reference unit associated with a group is constructed by combining the structural information of all the primary structural units of the group, the reference structural unit and constructed being said encompassing reference structural pattern.

9. Method for encoding hierarchical data, characterized in that it comprises the following steps: generation of reference structural units (710) able to represent the hierarchical data according to the method for generating reference structural units in accordance with any one of the following: claims 1 to 8; determining the difference information (720, 730) between the reference structural patterns and the associated hierarchical data; and encoding the hierarchical data according to the reference structural patterns and the difference information (740). 26

An encoding method according to claim 9, characterized in that the difference information between the reference structural patterns and the associated hierarchical data comprises structural information and content information.

11. A device for generating reference structural patterns capable of representing hierarchical data, characterized in that it comprises: extraction means for extracting primary structural patterns associated with the hierarchical data, each of the primary structural units representing a set of structural information; grouping means for grouping primary structural units located, with respect to at least one of them, at a distance less than or equal to a predetermined value; and determination means for determining a reference structural unit per group of structural units, said reference structural unit being able to represent the primary structural units of the group associated with it.

12. Device for encoding hierarchical data, characterized in that it comprises: a device for generating reference structural patterns capable of representing the hierarchical data according to claim 11; determining means for determining difference information between the reference structural patterns and the associated hierarchical data; and encoding means for encoding hierarchical data according to the reference structural patterns and the difference information.

Computer program product that can be loaded into a programmable device, characterized in that it comprises instruction sequences 27 for implementing a method of generating reference structural units according to any one of claims 1 to 8, when this program is loaded and executed by the programmable device.

14. Computer program product that can be loaded into a programmable device, characterized in that it comprises sequences of instructions for implementing a hierarchical data coding method according to claim 9 or 10, when this program is loaded and executed by the programmable device.

15. An information storage medium readable by a computer or a microprocessor retaining instructions from a computer program, characterized in that it allows the implementation of a method of generating reference structural patterns according to any of claims 1 to 8.

16. Information storage medium, readable by a computer or a microprocessor holding instructions of a computer program, characterized in that it allows the implementation of a hierarchical data coding method according to the claim 9 or 10. 20