WO2005008521A1

WO2005008521A1 - Method for the indexation of structured documents

Info

Publication number: WO2005008521A1
Application number: PCT/EP2004/051346
Authority: WO
Inventors: Jörg Heuer; Andreas Hutter; Andrea Kofler-Vogt
Original assignee: Siemens Aktiengesellschaft
Priority date: 2003-07-15
Filing date: 2004-07-02
Publication date: 2005-01-27

Abstract

The invention relates to a method for the indexation of a structured document containing a plurality of instances. At least one path and a textual path description is allocated to an instance in the document. At least one path comprises a position number or a position number sequence consisting of several position numbers as a differentiating characteristic for instances having different paths and the same textual path descriptions. The invention is characterized in that an indexing tree or an indexing list comprising a plurality of entries is produced; instances with allocated paths which exclusively differ in terms of position numbers or position number sequences, are allocated to the same entry and the entries respectively comprise the position numbers or position number sequences of the paths which are allocated to the instances. The indexing tree or the indexing list is combined with a data flow containing the structured document, is more particularly introduced into said data flow or allocated to said data flow and is separately transmitted or stored.

Description

Procedure for indexing structured documents

As multimedia applications are becoming increasingly important at the moment, access to digitized data sources in the form of audio, video or images is becoming increasingly important. For example, the Motion Picture Experts Group (MPEG) has developed a standard to be able to describe the most varied types of digitized material in a uniform manner. This was achieved with the MPEG-7 standard [1], whose goal is to simplify the search and access to such data.

An MPEG7 description is created using the Extensible Markup Language (XML) and can be divided into several units (Access units), which in turn consist of several fragments (Fragment Update Units). These units can be encoded using an encoder and sent to one or more receivers if required. The units can then be decoded again in the receivers using a corresponding decoder.

It often happens that a recipient receives several MPEG7 descriptions from which only certain content is of interest to a user. With the help of a query language - such as XPath - a user can evaluate whether he finds the required information in a description and can access it if necessary. So that a description does not have to be decoded first in order to be able to use a filter process to determine the required information, an index system is often used with which paths in the transmitted document structure are indexed. However, such a system should have sufficient functionality. In particular, it is not enough to only support simple path expressions, but more complex structures should also be implemented. XPath supports wildcards (, // 'and, *'), for example, which can be used to formulate queries for which the exact document structure is not known.

In addition to wildcards, an index system should also support so-called multiple-key queries. This means the existence of several conditions in a query, such as: "Give me back every author who is called Müller and who was born in i960". Such a query consists of a so-called prefix path (everything up to the author) and several conditions (last name = Müller and date of birth = i960).

The problems described do not only occur in connection with MPEG-7 descriptions. The present invention can be used in particular in all areas that require an indexing of structured documents.

Numerous index systems for XML documents are already known from the prior art. Many of them only support simple path queries, some enable more complex queries, but still offer only limited functionality. In the following, well-known index systems are discussed that support both wildcards and multiple-key queries, whereby the processing of multiple-key queries is particularly discussed.

In XISS (XML Indexing and Storage System) elements and attributes are indexed with so-called B-trees. In addition, a so-called "Nuπ-ringing scheme" is set up for an XML document. Two numbers are saved for each node in the document tree: an order number and a weighting. With the help of these numbers, a "successor function" be determined whether a certain element is a direct or an extended successor of another element, ie child, grandchildren, great-grandchildren, etc. In the case of a query, a search is carried out in the B-tree for every element specified in the query path. A partial list is returned for each different element name. This partial list contains a number of pairs of numbers consisting of order number and weighting. With the help of the "successor function" complete paths are created again from the partial lists of the element names. In a multiple-key query, the partial lists for the elements of the prefix path and a first result list of all successors are created first. In the search query for authors with cash on delivery Müller and year of birth 1960 explained z. For example, the following lists are created: XX / author, XX / author ₂ , XX / author ₃ , ..., where "XX" stands for any partial path in the XML document. Then a list is created with all surnames that have the value Müller (surnamei, surname ₂ , ....). With the help of the numbering scheme, those partial paths that do not have a successor in the list with the last names are then sorted out. Finally, a list is generated that contains all dates of birth with the value '1960' and the list of authors is reduced accordingly again. As a result, all paths to the author element are found in the results list, which have a surname element with the value Müller and a date of birth element with the value 1960 as successor.

In a technical report by Bremer and Gertz, an index system for XML data is specified, in which a so-called "data guide" is first derived from the XML document. This is a schema summary in which each different path is contained exactly once. Each node in this data guide is given a unique identifier (ID), which is used to display not only an element, but also the path from the root to the identified node. In addition to the identifiers, Po- sition numbers stored if an element occurs frequently. The position numbers are saved with as few bits as possible. The position numbers are stored in series after the node identifier, ie ID, posι ... pos _n . This combination is called PID (path identifier). In addition to the data guide, two B-trees are set up: the T index and the P index. The T index (term index) provides a list of PIDs for a corresponding node identifier (ID). The P index (path index) translates such PIDs into the physical address of the indexed element. A query begins in the data guide. In the case of a multiple-key query, the node identifier of the element at the end of the prefix of the search query is first determined using the data guide. In the search query shown above, the node identifier for "author" is thus determined. The list of the associated PIDs is then generated with the node number in the T-index. According to the above-mentioned query, the ID of the element 'last name' with the value Müller is searched for based on the author element previously identified in the Data Guide. All PIDs from the B-tree are filtered for this ID. The same process is repeated for the successor 'Date of Birth' with the value 1960. In a final step, the list of author PIDs is reduced. For this purpose, all PIDs are deleted from the author list, whose position numbers are not available as a prefix of a position number sequence in the last name list. By comparing the position numbers from the date of birth list with those of the list of authors, the latter is reduced again and then provides the final PIDs, which are used to obtain the physical address in the P index.

All of the above-mentioned methods for generating an index structure have the disadvantage that the creation of the index structure is very complex and the index requires a considerable amount of data. The object of the invention is therefore to create a method for indexing a structured document with which an index structure for search queries in the document is generated in a simple manner.

This problem is solved by the independent claims. Preferred embodiments of the invention are defined in the dependent claims.

The method according to the invention indexes a structured document which contains a multiplicity of instances, with at least one instance in the document being assigned a path and a textual path description and at least one path having a position number or a sequence of position numbers comprising a plurality of position numbers as a distinguishing feature for instances includes different paths and the same textual path descriptions. With the method according to the invention, an indexing tree or an indexing list which comprises a multiplicity of entries is generated in such a way that instances to which paths are assigned which differ only in their position numbers or position number sequences are assigned to the same entry, and the entries each include the position numbers or the position number sequences of the paths assigned to the instances. The indexing tree or the indexing list is finally linked to a data stream containing the structured document, in particular inserted into the data stream or assigned to the data stream and transmitted or stored separately.

By linking entries in the indexing tree or in the indexing list with item numbers or item number sequences, quick queries for the content of the structured document can be carried out effectively. In a preferred embodiment, instance values assigned to the instances or references to these instance values are assigned to the instances in the indexing tree or the indexing list. The existence of position numbers is preferably determined with the aid of a schema definition, which is preferably an XML schema definition that is used for coding structured XML documents.

In a preferred embodiment of the method according to the invention, the indexing tree is a B-tree, with which a logarithmic search for contents in the document is made possible.

In a particularly preferred embodiment of the invention, the structured document used for the indexing is an XML document.

In addition to an indexing method, the invention also relates to a method for coding a structured document, the document being indexed using the indexing method according to the invention and then being coded using a coding method, in particular an MPEG7 method, which produces an encoded data stream. The coded data stream preferably contains offsets with which the positions of entries in the indexing tree or indexing list in the data stream are signaled. A special embodiment of such offsets can be found in German patent application 102 53 275.3, with these offsets ensuring that when searching for content, not all coded information has to be read from the data stream.

In addition to the coding methods described above, the invention further comprises a decoding method for decoding a structured document, the method being designed in such a way that a document coded according to the coding method according to the invention is decoded. Furthermore, the invention relates to a method for coding and decoding a structured document, which comprises the coding and decoding methods according to the invention described above.

The invention further relates to a coding device or a decoding device with which the coding method or decoding method according to the invention can be carried out. The invention further relates to a combination of this coding device and this decoding device.

Exemplary embodiments of the invention are illustrated and explained below with reference to the drawing.

It shows:

Fig.l an example of the schematic structure of a structured XML document that can be indexed with the inventive method.

In the embodiment of the invention described below, the indexing and coding of an XML document is described. An XML document is described by a document tree as shown in FIG. 1. Such a tree contains a large number of nodes, the contents of the document being stored in the leaf nodes, which are referred to below as instances. The remaining nodes are the textual description of the path to the individual instances, although different paths can have the same textual description. 1 relates to the information in a library. It contains two articles for which the title, the authors and the year of publication are specified. 1 that a unique path leads to each leaf node, which is encoded in the embodiment described here using the MPEG-7 coding method. So-called XML schema definitions are used to create binary coded paths, such schema definitions being well known to the person skilled in the art and being used by the encoder and decoder to encode or decode the XML document. The schema definition on which the tree of FIG. 1 is based is as follows: There may be several articles in the library. Each article has exactly one title and at least one author. The conference where the article was submitted may be registered. The year of publication is also given. An author has exactly one last name and can have a maximum of three first names. A first name does not necessarily have to be specified.

A precise execution of the generation of the coded paths is omitted here, since this coding is not essential for understanding the invention.

A binary path coded with the method according to the invention contains a binary code word for each node in the path, which code word is derived from the schema information. For nodes that can occur more frequently, a position number or a sequence of position numbers is also coded at the end, which serve as a distinguishing feature for instances with different paths and the same textual path descriptions. Similar to the Bremer and Gertz solution, these numbers are used to merge partial lists.

So that the further explanations are understandable, the concept of binary paths is explained first.

The schema information is used to create a binary path and thus a code word is determined for each element, which is given by the smallest possible binary number is pictured. More precise explanations regarding this determination are omitted here since this is not part of the invention. The code words obtained in the embodiment described herein identifier as Cbibiiothek or C _ar ti egg, etc .. After all the code words in a path determined, the positions of the nodes of the path to be added back in the document tree of the binary path. Positions are only coded for those nodes that can occur more often. The binary coding of paths is to be illustrated with the following example.

A partial list of the complete paths that can be generated from the document is given from the document tree of FIG. 1: library [l] / article [1] / title [1] library [1] / article [1] / author [1 ] / First name [1] library [l] / article [1] / author [1] / last name [1] library [1] / article [1] / author [2] / first name [1]

library [1] / article [2] / year [l]

The numbers in the square brackets represent the positions of the nodes, i. H. they designate sibling nodes with the same name in the same hierarchy level of the document tree and thus serve as a distinguishing feature of instances with the same textual path description.

When translating the textual paths into binary, all necessary code words are first determined and then the positions are coded in the form of position numbers. The path library [l] / article [1] / title [1] would look something like binary: C _b ia.i.otek / Cartikei / CtiteiAl- A position number is only used for the node name "article", since the other two node names can only occur once in this context. Finally, an indexing tree is created, and its entries, ie the key values that can be searched for in the tree, cannot be differentiated with regard to the position numbers or if several position numbers occur in the path with regard to the position number sequences. As a result, there is only one entry in the indexing tree for paths that differ only on the basis of their position number or position number sequence. If you search for this entry, a list of instances consisting of position numbers and values may be returned. For example, the search for the key value "Cbibi-Lotek / C _ar tikei / Cautor / Cvoraame" would yield the following results: 1,1,1 = Brain W. (1st article, 1st author, 1st first name) 1 , 2,1 = Christopher (1st article, 2nd author, 1st first name)

The invention is explained using a simple example. We are looking for all articles that appeared in 1996 and that were at least co-written by the author with the last name Kaufmann. Since only complete paths are indexed, a query in the form of a complete path is first generated for each condition. The following two queries are created for the specific example: C _b ιbiio- thek / Cartikel / Cjar ⁼ 1996 and Cbibliotek / Cartikel / Cautor / Cnachname ⁼ Kaufmann. The first query searches for the search term "article" and the second query searches for the search term "author".

The queries are executed separately, with each of the item numbers or position number sequences of the respective keywords in the form of two sub-lists (list year and list _na chnaπ.e) are output from the two queries: 1. Scan:

List year = {1/2;} (1st article and 2nd article -> 2 entries)

2nd query:

List _naC hname ⁼ {2 », 3;} (3rd author of the 2 article -> 1 entry) The position numbers with a gray background are each a single position number found or the first position number of a position number sequence found. A certain article should be found as the search result. Therefore, the item numbers for the items in the entries in the partial lists must match. A comparison operation of the above partial lists shows that only the number "2" is found as the corresponding item numbers for the articles. The final result is only the second article.

The solution described has several advantages over the systems according to the prior art. Although these systems have sufficient functionality, the index requires a considerable amount of data and can often only be created with considerable effort. XISS, for example, needs a B-tree for element and attribute names and another B-tree with which values are indexed. In addition, the numbering scheme must be generated and transferred. The solution presented by Bremer and Gertz first requires the generation of a data guide. In addition, the T index and the P index must be created and, if necessary, transferred.

In the solution according to the invention, only the B-tree, which contains all the necessary information, is transmitted. By using byte offsets, only those nodes that influence a search are read from the stream.

The PID scheme may need fewer bits than the solution described here for specifying the positions because only elements actually present in the document are taken into account, while in the solution according to the invention the length of the positions depends on the potentially possible number of elements. So is enough for the presentation the PID for the "item" element is a bit for specifying the position number because there are only two elements. In the invention, at least five bits are used for this, because the element can occur as often as the schema definition says. However, since an MPEG-7 description is divided into AccessUnits before sending, in the case of the PID scheme this would mean that the entire information must first be transmitted and decoded in order to be able to read out the position numbers correctly. This would make an index superfluous because it should offer random access to the desired information and should make it possible to decode only this while ignoring uninteresting parts. In MPEG-7 it often happens that not an entire document is transmitted, but only subtrees of the document tree. In this case it is impossible to completely decode the position numbers of the PID scheme because not all the required information has been transmitted. In the solution according to the invention, the scheme must be available to the decoder in any case, since it is also required for decoding the description. With the help of the scheme, all position numbers can be clearly assigned to the respective elements.

Bibliography :

[1] ISO / IEC 15938 "Multimedia Content Description Interface", Geneva 2001-2003;

Claims

claims

1. Method for indexing a structured document that contains a large number of instances, wherein at least one instance in the document is assigned a path and a textual path description and at least one path has a position number or a position number sequence comprising several position numbers as a distinguishing feature for instances with different paths and same textual path descriptions, in which an indexing tree or an indexing list comprising a large number of entries is generated in such a way that instances to which paths are assigned which differ only in their position numbers or position number sequences are assigned to the same entry and Entries each comprise the item numbers or item number sequences of the paths assigned to the instances; the indexing tree or the indexing list is linked to a data stream containing the structured document, in particular inserted into the data stream or assigned to the data stream and transmitted or stored separately.

2. The method according to claim 1, in which the entries in the indexing tree or the indexing list are assigned instance values or references to these instance values.

3. The method of claim 1 or 2, wherein the existence of position numbers is determined with the help of a schema definition.

4. The method according to any one of the preceding claims, wherein the indexing tree is a B tree.

5. The method according to any one of the preceding claims, wherein the structured document is an XML document.

6. A method for encoding a structured document, in which the structured document is indexed with a method according to one of the preceding claims and is encoded with a coding method, in particular an MPEG-7 method, whereby an encoded data stream is generated.

7. The method according to claim 6, wherein the coded data stream contains offsets with which the positions of entries of the indexing tree or the indexing list in the data stream are signaled.

8. A method for decoding a structured document, the method being designed such that the document encoded according to claim 6 or claim 7 is decoded.

9. A method for encoding and decoding a structured document, comprising a method according to claim 6 or 7 and the method according to claim 8.

10. Coding device with which a method according to claim 6 or 7 can be carried out.

11. Decoding device with which a method according to claim 8 can be carried out.

12. Device for coding and decoding a structured document, with which a method according to claim 9 can be carried out.