KR100612376B1

KR100612376B1 - A index system and method for xml documents using node-range of integration path

Info

Publication number: KR100612376B1
Application number: KR1020050008841A
Authority: KR
Inventors: 김영; 박상호; 박선; 이주홍
Original assignee: 인하대학교 산학협력단
Priority date: 2005-01-31
Filing date: 2005-01-31
Publication date: 2006-08-16
Also published as: KR20060087947A

Abstract

본 발명은 통합패스의 노드범위를 이용한 XML 인덱스 시스템 및 방법에 관한 것으로, 중복되는 패스를 통합하여 구성한 후, 통합패스에 프리오더, 포스트오더, 노드 범위 등의 넘버를 부여하며, 프리-오더 리스트라는 동적배열에 통합패스의 프리오더 순으로 실 데이터테이블의 이름을 보관하도록 함에 따라 질의 수행시에 질의를 분석하여 통합패스에서 프리오더와 노드 범위를 얻고 프리-오더 리스트 동적배열에서 해당 데이터테이블 이름을 얻어 넘버 비교없이 바로 개방하도록 함에 따라 Xpath질의를 빠르게 분석하며, 패스의 인덱스 저장 공간을 효율적으로 관리하고, 프리-오더 리스트를 이용하여 실제 검색할 데이터 테이블의 범위를 줄이며, 데이터테이블을 최하위 노드별로 유지하여 삽입, 삭제를 용이하게 한다.The present invention relates to an XML index system and method using a node range of an integrated path. After the integrated paths are integrated and configured, a number of preorders, post orders, node ranges, etc., are assigned to the integration path. The real data table names are stored in the dynamic array in the order of pre-order of the integration pass, so that the query is analyzed when the query is executed to obtain the preorder and node range in the integration path, and the data table name is obtained from the pre-order list dynamic array. Quickly analyze Xpath queries by efficiently opening them without comparisons, efficiently manage index storage space of paths, reduce the range of data tables to be actually searched using pre-order lists, and maintain data tables by lowest node. Easy insertion and deletion

XML, 인덱싱, 넘버링, 노드 범위, 통합패스, 프리오더, 포스트 오더 XML, Indexing, Numbering, Node Range, Integration Pass, Preorder, Post Order

Description

System and method for XML index using node range of integrated path {A INDEX SYSTEM AND METHOD FOR XML DOCUMENTS USING NODE-RANGE OF INTEGRATION PATH}

도 1은 일반적인 XML문서의 구조를 트리의 형태로 표현한 예를 나타낸 도.1 is a diagram illustrating an example of a structure of a general XML document in the form of a tree.

도 2는 도 1의 요약된 심플패스를 나타낸 도.FIG. 2 illustrates the summarized simple path of FIG. 1. FIG.

도 3은 데이터가이드 색인기법의 예를 나타낸 도.3 shows an example of a data guide indexing technique.

도 4는 Patricia Trie의 예를 나타낸 도.4 shows an example of a Patricia Trie.

도 5는 Layered Index를 나타낸 도.5 shows a Layered Index.

도 6은 Index Fabric을 이용한 인덱스 구조의 예.6 is an example of an index structure using an index fabric.

도 7은 Dietz의 넘버링기법의 예.7 is an example of a numbering technique of Dietz.

도 8은 종래의 넘버링 기반의 SQL 수행 질의 예제를 나타낸 도.8 is a diagram illustrating a conventional numbering-based SQL execution query example.

도 9는 본 발명에서의 XML문서들의 통합 패스의 구현 예도.9 is an example implementation of the integration path of XML documents in the present invention.

도 10은 본 발명에서 XML문서 삽입시의 인덱스 구조도.10 is an index structure diagram when inserting an XML document in the present invention.

도 11은 기존 넘버링 기반의 SQL 질의 예제.11 is an example of an existing numbering-based SQL query.

도 12는 본 발명에 따른 인덱스 시스템의 구성도.12 is a block diagram of an index system according to the present invention.

도 13a 내지 도 13e는 본 발명에서의 데이터 테이블과 필드의 설명도. 13A to 13E are explanatory views of data tables and fields in the present invention.

도 14는 본 발명에서 네임 인덱스의 노드범위 계산 방법을 수행하기 위한 알고 리즘을 나타낸 도.14 is a diagram illustrating an algorithm for performing a method for calculating a node range of a name index in the present invention.

도 15a, 도 15b는 구조가 다른 XML문서의 삽입전,후의 인덱스 구조도.15A and 15B are index structure diagrams before and after insertion of XML documents having different structures.

도 16은 본 발명에서의 실험 데이터를 나타낸 표.16 is a table showing experimental data in the present invention.

도 17a 내지 도 17c는 본 발명에서의 실험 데이터(A)의 질의, 질의별 수행속도 및 성능비교 표 및 그래프.17a to 17c is a query, performance speed and performance comparison table and graph of the experimental data (A) in the present invention.

도 18a 내지 도 18c는 본 발명에서의 실험 데이터(B)의 질의, 질의별 수행속도 및 성능비교 표 및 그래프.18a to 18c is a query, performance rate and performance comparison table and graph of the experimental data (B) in the present invention.

도 19a 내지 도 19c는 본 발명에서의 실험 데이터(C)의 질의, 질의별 수행속도 및 성능비교 표 및 그래프. 19a to 19c is a query, performance speed and performance comparison table and graph of the experimental data (C) in the present invention.

도 20은 XISS와 본 발명에 따른 인덱스의 크기를 비교한 그래프.20 is a graph comparing the size of the index according to the present invention and XISS.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100 : XML 데이터 베이스 200 : DOM파서100: XML database 200: DOM parser

300 : 로우 데이터 저장부300: low data storage unit

본 발명은 XML(eXtensible Markup Language) 인덱스 시스템 및 방법에 관한 것으로, 모든 XML 문서의 패스를 통합하여 관리하며 노드 범위(Node Range)와 프리 오더 리스트(Pre-Order List)를 추가하여 검색성능을 향상시킬 수 있도록 하는 통합패스의 노드범위를 이용한 XML 인덱스 시스템 및 방법에 관한 것이다.The present invention relates to an eXtensible Markup Language (XML) indexing system and method, which integrates and manages all XML document paths and improves search performance by adding a Node Range and a Pre-Order List. The present invention relates to an XML index system and method using a node range of an integrated path.

최근 XML은 인터넷상의데이터의 표현 및 교환의 표준으로 인식되면서 점차 연구의 중요성이 증가하고 있다. 인터넷상의 중요한 정보들을 XML로 구현하여 저장, 검색, 교환하려는 요구가 증대되고 있으며, 의학, 국방, 법률, 조세 등의 전문분야뿐만 아니라 전자상거래, 메신저 등과 같이 인터넷상에서 일어나는 거의 전 분야에 걸쳐 XML이 사용되고 있다. 특히, 전문성을 가진 XML문서들은 그 크기가 방대해지고, 사용자의 질의 또한 복잡해졌다. 따라서 수많은 XML문서들을 저장하고, 복잡한 XML질의를 빠르게 처리하기 위한 많은 인덱싱 기법들이 제안되어 왔다.Recently, XML is recognized as a standard for the representation and exchange of data on the Internet, and the importance of research is gradually increasing. There is an increasing demand for storing, retrieving, and exchanging important information on the Internet in XML. In addition to specialized fields such as medicine, defense, law, and tax, XML is used in almost all fields on the Internet such as e-commerce and messenger. It is used. In particular, specialized XML documents have become large in size and complicated in user queries. Therefore, many indexing techniques have been proposed to store numerous XML documents and to process complex XML queries quickly.

대표적인 인덱싱 기법들 중 패스를 기반으로 하는 인덱싱 기법과 넘버링 기법을 기반으로 하는 인덱싱 기법, 그리고 그 외에 많은 인덱싱 기법들이 있다. 이중 패스를 기반으로 하는 인덱싱 기법과 최근 많은 연구에서 기반 기술이 되고 있는 넘버링 기법을 기반으로 하는 인덱싱 기법에 대하여 살펴본다.Among the typical indexing techniques, there are a pass-based indexing technique, an indexing technique based on a numbering technique, and many other indexing techniques. The indexing technique based on the double pass and the indexing technique based on the numbering technique, which has been the foundation technique in many studies, are discussed.

1). 패스기반 인덱싱 기법One). Pass-Based Indexing Technique

XML문서는 도 1과 같이 트리구조의 형태로 표현할 수 있으며, 문서의 특성상 많은 중복된 노드를 포함한다. 사용자의 질의 시 해당되는 XML문서 혹은 XML문서내의 데이터를 검색하기 위하여 중복된 노드를 포함한 모든 경로를 운행해야 된다는 문제점이 있다. 이러한 문제를 해결하기 위하여 데이터 가이드(이하, DataGuide라 칭함))(R. Goldman, J. Widom. "DataGuides:enabling query formulation and optimization in semistructured databases", In Proc. of the VLDB, pp. 436-445, 1997.)란 색인기법이 제안되었다. The XML document can be expressed in the form of a tree structure as shown in FIG. 1 and includes many duplicated nodes due to the characteristics of the document. There is a problem in that all paths including duplicate nodes must be operated in order to retrieve a corresponding XML document or data in the XML document. To solve this problem, Data Guide (hereinafter referred to as DataGuide) (R. Goldman, J. Widom. "DataGuides: enabling query formulation and optimization in semistructured databases", In Proc. Of the VLDB, pp. 436-445 , 1997.) has been proposed.

도 1의 XML문서는 "Restaurant", "Entree"등의 노드를 중복된 형태로 표현하고, (1)부터 (11)까지는 실제 데이터베이스에 저장된 데이터의 위치를 나타낸다. 사용자 질의 "/Restaurant/ Name"이 주어졌을 경우 해당 XML문서를 루트노드로 "Restaurant" 태그를 가지며, 자녀노드(child node)로 "Name" 태그를 포함하는 데이터를 모두 검색해야 한다. 이러한 문제를 해결하기 위하여 제안된 DataGuide는 도 2의 표와 같이 루트로부터 모든 질의의 가능한 형태를 요약된 심플패스(Simple Path)로 정의한다.The XML document of FIG. 1 expresses nodes such as "Restaurant" and "Entree" in a duplicated form, and (1) to (11) represent locations of data stored in an actual database. If the user query "/ Restaurant / Name" is given, the XML document has the "Restaurant" tag as the root node, and all data including the "Name" tag as the child node must be retrieved. In order to solve this problem, the proposed DataGuide defines possible forms of all queries from the root as a simple path as summarized in the table of FIG. 2.

도 3과 같이 DataGuide 색인기법에서 각 노드는 저장된 데이터의 집합과 연결할 수 있다. 질의"/Restaurant/Name"수행시 DataGuide는 노드로부터 차례로 Restaurant, Name을 검색한 후 (5), (9)의 데이터저장위치를 반환한다. DataGuide에서 제안된 중복된 경로를 통합하기 위해 1-Index/2-Index/T-Index(T. Milo, D. Suciu, "Index Structures for Path Expressions", In Proc. of the ICDT, pp. 277-295, 1997.)와 Index Fabric(B. Cooper, N. Sample, M. Franklin, G. Hjaltason, M. shadmon, "A Fast Index For Semistructured Data", In Proc. of the VLDB, pp. 341-350, 2001.)이란 인덱싱 기법이 제안되었다. 특히, Index Fabric은 복잡한 XML문서의 패스연산을 줄이기 위하여 문자열을 인코딩하여 검색하는 Patricia Trie(D. Knuth, "The Art of Computer Programming, Vol.III, Sorting and Searching", Third Edition. Addison Wesley, Reading, MA, 1998.)를 응용하였다.In the DataGuide indexing method as shown in FIG. 3, each node may be connected to a set of stored data. When executing query "/ Restaurant / Name", DataGuide retrieves Restaurant and Name from node in turn and returns data storage location of (5) and (9). To integrate the redundant paths proposed by DataGuide, 1-Index / 2-Index / T-Index (T. Milo, D. Suciu, "Index Structures for Path Expressions", In Proc. Of the ICDT, pp. 277- 295, 1997.) and Index Fabric (B. Cooper, N. Sample, M. Franklin, G. Hjaltason, M. shadmon, "A Fast Index For Semistructured Data", In Proc. Of the VLDB, pp. 341-350 , 2001.) has been proposed. In particular, Index Fabric uses Patricia Trie (D. Knuth, "The Art of Computer Programming, Vol. III, Sorting and Searching", Third Edition.Addison Wesley, Reading, which encodes and searches strings to reduce the pathing of complex XML documents. , MA, 1998.).

도 4의 Patricia Trie는 문자열을 삽입할 때 중복된 문자를 제거하고 서로 다른 문자의 위치와 해당문자를 저장하는 인코딩 방식으로 구현된다.Patricia Trie of FIG. 4 is implemented by an encoding method of removing duplicate characters and storing positions of different characters and corresponding characters when inserting a string.

Index Fabric에서는 XML문서의 루트부터 리프까지의 패스와 데이터를 로우 데 이터(Raw Data)로 정의하고 Patricia Trie로 인코딩하여 패스연산의 범위를 줄이는 데 중점을 두었다. 또한 균형 트리를 구성하기 위하여 도 5와 같이 레이어(Layer)로 나누어 관리한다. The Index Fabric focused on reducing the range of path operations by defining the path and data from the root to the leaf of the XML document as raw data and encoding it as Patricia Trie. In addition, in order to form a balance tree, the management is divided into layers as shown in FIG. 5.

모든 패스는 디지그네이터(Designators)를 사용하여 인코딩한다. 또한 모든 노드들은 Designators 사전(Dictionary)안에서 디지그네이터와 매핑(Mapping)을 통해 요약된 문자로 관리된다. 요약된 문자는 요약된 심플패스로 구성하고 최하위에 XML 데이터를 나열한 로우 데이터를 구성한다. 그리고, 로우 데이터를 이용하여 도 6과 같이 Patricia Trie를 이용하여 인덱스를 구성한다. Index Fabric은 심플패스의 질의시에 도 6과 같이 루트노드로부터 리프노드까지 인코딩된 패스를 검색하므로 매우 좋은 성능을 보인다. 그러나 중간노드, 최하위노드의 검색, 조상-후손관계('//')의 조인 연산시 효율이 떨어지게 된다. 이를 보완하기 위하여 Refined- Path의 개념을 이용하여 자주 사용되는 패스를 인덱스에 추가하는 방법을 사용하였으나, DBA가 직접 정의해야 하므로 다양한 질의를 처리하기엔 부적합하며 많을 경우에 인코딩의 효과가 떨어지게 된다. All paths are encoded using Designators. In addition, all nodes are managed as summarized characters through digitizers and mappings in the Designators Dictionary. Summarized characters consist of summarized simplepaths and raw data with the bottommost XML data listed. The index is constructed using Patricia Trie as shown in FIG. 6 using the raw data. Index Fabric shows very good performance when searching the encoded path from the root node to the leaf node as shown in FIG. However, the efficiency of the search for the intermediate node, the lowest node, and the join operation of the ancestor-progenitor relationship ('//') is reduced. To compensate for this, we used the method of adding a frequently used path to the index by using the concept of refined-path, but it is not suitable to handle various queries because the DBA needs to define it manually, and in many cases, the effect of encoding is inferior.

2). 넘버링기반 인덱싱기법2). Numbering-based indexing technique

XPath질의는 조상-후손관계의 조인연산('//')이 빈번하게 발생된다. 이러한 연산을 처리하기 위해 기반이 되는 방법 중 하나가 Dietz의 넘버링기법(Numbering Scheme)(P. Dietz, "Maintaining order in a linked list.", In Proc. of the Fourteenth Annual ACM Symposium on Theory of Computing, pp. 122-127, 1982.)이며, 최근의 많은 인덱싱 기법들이 조상-후손관계의 조인연산을 위해 이 방법을 사용한다.In XPath query, the join operation ('//') of ancestor-descent relationship occurs frequently. One of the underlying methods for handling these operations is Dietz's Numbering Scheme (P. Dietz, "Maintaining order in a linked list.", In Proc. Of the Fourteenth Annual ACM Symposium on Theory of Computing, pp. 122-127, 1982.), and many recent indexing techniques use this method for ancestor-progenitor join operations.

넘버링기법을 기반으로 한 XML문서내의 모든 노드는 <d₁,d₂,d₃>의 넘버를 갖는다. 만약 노드 m, n이 존재하며 m n의 관계를 갖는다고 할 때 m이 n의 후손(descendant)이라면 n.d₁ < m.d₁ 이고 n.d₂ > m.d₂의 관계를 만족하고, m이 n의 조상(ancestor)이라면 n.d₁ > m.d₁ 이고 n.d₂ < m.d₂의 관계를 만족하며, m이 n의 자녀라면 n.d₃ + 1 = m.d₃의 관계를 만족한다. 또한 m과 n이 조상-후손관계가 아니라면 n.d₂ < m.d₁ 이거나 n.d ₁ > m.d₂의 관계를 만족하게 된다. 즉 <d₁,d₂,d₃>는 <preorder, postorder, level>을 나타낸다. Every node in an XML document based on the numbering technique has a number <d ₁ , d ₂ , d ₃ >. If nodes m and n exist and have a relationship of mn, if m is a descendant of n, then nd ₁ <md ₁ and nd ₂ > md ₂ , and m is an ancestor of n If nd ₁ > md ₁ and nd ₂ <md ₂ , then m is a child of n, then nd ₃ + 1 = md ₃ . In addition, if m and n are not ancestor- descendants, nd ₂ <md ₁ or nd ₁ > md ₂ is satisfied. That is, <d ₁ , d ₂ , and d ₃ > indicate <preorder, postorder, level>.

예를 들어, 도 7의 노드a와 노드b를 위의 넘버링기법으로 표현하면 노드a는 <1,7,1>, 노드b는 <2,4,2>를 갖는다. 노드b는 노드a의 후손이므로, 1<2 and 7>4를 만족하고, 노드a는 노드b의 조상이므로, 2>1 and 4<7을 만족하며, 노드b는 노드a의 자녀이므로 노드a.level+1 = 노드b.level 의 관계를 만족한다. 위의 구조를 이용하여 XML문서의 모든 노드에 <d₁,d₂,d₃>의 넘버를 부여하여 인덱스를 구성하며, 사용자의 질의처리를 수행한다. 그러나 중간노드나 하위노드가 삽입될 시에 해당되는 모든 노드의 레이블을 모두 조정해 주어야 하는 문제점이 있다. 이에 XISS(P. Harding, Q. Li, B. Moon, "XISS/R: XML Indexing and Storage System Using RDBMS", In Proc. of the VLDB, pp. 1073- 1076, 2003.), (Q. Li, B. Moon, "Indexing and Querying XML Data For Regular path expression", In Proc. of VLDB, pp. 361-370, 2001.)에서는 위의 문제를 해결하기 위하여 넘버링기법을 확장하였다.For example, if node a and node b of FIG. 7 are represented by the above numbering technique, node a has <1,7,1> and node b has <2,4,2>. Since node b is a descendant of node a, it satisfies 1 <2 and 7> 4, and node a satisfies 2> 1 and 4 <7 because node a is an ancestor of node b, and node b is a child of node a. .level + 1 = satisfy the relationship of node b.level Using the above structure, the index is constructed by assigning the numbers <d ₁ , d ₂ , d ₃ > to all nodes of the XML document, and performs user query processing. However, there is a problem in that the labels of all the nodes must be adjusted when the intermediate node or the child node is inserted. XISS (P. Harding, Q. Li, B. Moon, "XISS / R: XML Indexing and Storage System Using RDBMS", In Proc. Of the VLDB, pp. 1073-1076, 2003.), (Q. Li , B. Moon, "Indexing and Querying XML Data for Regular path expression", In Proc. Of VLDB, pp. 361-370, 2001.), extended the numbering technique to solve the above problem.

XISS에서는 모든 노드를 프리오더(preorder), 포스트오더(postorder) 대신에 오더(order), 사이즈(size)로 확장하여 관리하며 사이즈에는 하위노드에 포함될 데이터의 개수 + 확장 가능한 범위가 들어가게 된다. 즉, 노드 m과 n이 있고 m이 n의 부모노드(parent node)라면 다음의 관계를 만족하게 된다. 만약, 노드 m과 n이 존재하며, n이 m의 후손이라면 order(m) < order(n) 이고, order(n) + size(n) <= order(m) + size(m) 의 관계를 만족하게 된다. XISS는 위의 방법을 기반으로 모든 XML문서별로 노드와 속성을 위한 요소 인덱스(Element Index), 속성 인덱스(Attribute Index) 그리고 데이터를 위한 데이터 인덱스(Value Index)등을 관리한다.In XISS, all nodes are managed by extending them to order and size instead of preorder and postorder, and the size includes the number of data to be included in the child node + the expandable range. That is, if there are nodes m and n and m is a parent node of n, the following relationship is satisfied. If nodes m and n exist and n is a descendant of m, then order (m) <order (n) and order (n) + size (n) <= order (m) + size (m) You will be satisfied. Based on the above method, XISS manages element index, attribute index, and data index for data in every XML document.

XML문서별로 <Order, Size>를 부여하므로 구조가 다른 문서가 삽입되어도 쉽게 인덱싱 할 수 있다는 장점을 가진다. 그러나 문서별로 문서ID와 문서내의 각 노드별 오더, 사이즈를 모두 관리해야 하므로 인덱스의 공간이 비효율적으로 사용된다. 검색 시엔 문서ID별로 조인연산을 수행해야 하며 질의내의 모든 노드 수만큼의 자체-조인(Self-Join) 연산을 하므로 검색-오버헤드가 매우 크다. 즉, "/Book/Authors//Name"과 같은 질의가 있을 경우 넘버링 기반의 인덱싱 기법들은 도 8와 같은 SQL질의가 수행되게 된다. <Order, Size> is given for each XML document, so it can be easily indexed even if a document with a different structure is inserted. However, the space of the index is inefficient because the document ID, the order and size of each node in the document must be managed for each document. When searching, join operations must be performed for each document ID, and as many self-join operations are performed as the number of nodes in the query, the search overhead is very large. That is, when there is a query such as "/ Book / Authors // Name", the numbering-based indexing techniques are performed as shown in FIG.

XML문서별로 인덱싱 되어 있기 때문에 XPath질의에 포함된 노드의 수만큼 문서ID(doc_id)별로 조인연산이 필요하게 되며, 데이터 역시 각 문서별 노드에 부여된 넘버를 비교하게 된다. 자료가 대용량이고 질의에 포함된 노드의 수가 많을수록 검색성능은 떨어지게 되며, 데이터 값 혹은 속성 값의 비교연산이 포함된 경우에는 더욱 성능이 저하된다. Since indexing is done for each XML document, join operations are required for each document ID (doc_id) as many nodes as included in the XPath query. Data is also compared with the number assigned to each document node. The larger the data and the larger the number of nodes included in the query, the lower the retrieval performance, and the worse the performance if the data or attribute values are compared.

이와 같이 기존의 넘버링기법 기반의 방법들은 모두 각각의 XML문서를 패스(혹은 요소), 데이터, 속성 등의 인덱스로 나누어 관리한다. 패스 인덱스에는 유일한 문서ID와 노드별 넘버를 부여하고 데이터와 속성 인덱스는 패스 인덱스에서 부여된 문서ID와 노드별 넘버 사이의 값을 차례로 부여하게 된다. 그리고 사용자의 질의를 분석하여 패스, 데이터, 속성 등 각 인덱스의 문서ID와 노드별 넘버의 조인연산을 통하여 검색결과를 추출한다. 따라서 XML문서의 수가 많고 데이터의 크기가 클수록 패스 인덱스의 크기는 커지게 되며 질의에 포함된 노드의 수가 많을수록 패스 인덱스의 자체-조인 연산이 증가하여 검색성능이 떨어지게 되는 단점이 있다.As such, all existing methods based on numbering techniques manage each XML document by dividing each XML document into an index such as path (or element), data, and attributes. The path index assigns a unique document ID and a number for each node, and the data and attribute indexes sequentially assign a value between the document ID and the number for each node. Then, the user's query is analyzed and the search result is extracted through the join operation of the document ID and the number of each index such as path, data, and attribute. Therefore, the larger the number of XML documents and the larger the size of data, the larger the size of the path index. The larger the number of nodes included in the query, the higher the self-join operation of the path index.

본 발명은 상기한 종래기술의 문제점을 해결하기 위해 제안된 것으로, 본 발명의 목적은 XML문서의 패스를 통합하여 관리하고 문서내의 데이터와 속성 인덱스는 넘버를 부여하지 않음에 따라 인덱스의 크기가 작음과 더불어 검색 성능이 우수한 통합패스의 노드범위를 이용한 XML 인덱스 시스템 및 방법을 제공하는 것이다.
The present invention has been proposed to solve the above-mentioned problems of the prior art, and an object of the present invention is to integrate and manage a path of an XML document, and to reduce the size of an index because data and attribute indexes in the document are not given a number. In addition, to provide an XML index system and method using the node range of the integration path with excellent search performance.

상기 목적을 달성하기 위한 본 발명에 따른 통합패스의 노드범위를 이용한 XML 인덱스 시스템은, 입력되는 XML문서를 심플 패스와 데이터로 구성하는 DOM파서; 상기 DOM파서의 출력이 임시 저장되는 로우 데이터 저장부; 상기 로우 데이터 저장부로부터의 데이터에 대하여 기존의 패스와 비교하여 통합패스를 구성하며, 상기 데이터에 대하 여 기 저장되어진 정보와 비교하여 프리오더, 포스트오더, 해당 데이터의 범위를 나타내는 노드범위 등을 부여하여 저장하는 네임 인덱스; 상기 통합 패스의 모든 노드를 프리오더순으로 구성하여 데이터테이블의 이름을 저장하는 프리-오더 리스트; 상기 데이터테이블의 이름별로 XML문서의 ID와 XML문서의 이름을 보관하는 XML문서 ID테이블; XML문서의 ID, 문서내의 데이터값과 데이터 값의 순서가 저장되는 데이터테이블로 구성된 XML 데이터 베이스;로 구성됨을 특징으로 한다.According to an aspect of the present invention, an XML index system using a node range of an integrated path includes: a DOM parser configured to input XML documents into simple paths and data; A row data storage unit for temporarily storing the output of the DOM parser; The integrated path is configured with respect to the data from the raw data storage unit, compared with the existing path, and the pre-order, the post-order, the node range indicating the range of the corresponding data, etc. are compared with the previously stored information about the data. Name index for storing; A pre-order list for organizing all nodes of the integration path in pre-order order to store the name of the data table; An XML document ID table for storing an ID of an XML document and an XML document name for each name of the data table; And an XML database including an ID of the XML document, a data table in the document, and a data table in which the order of the data values is stored.

상기 목적을 달성하기 위한 본 발명에 따른 통합패스의 노드범위를 이용한 XML 인덱싱 방법은, 다수의 XML문서들을 동일한 노드를 경유하는 하나의 패스로 통합하여 통합패스를 구성하되, DOM파서에 의해 구성되는 심플 패스의 최하위 노드별로 문서ID별 XML데이터를 구성하는 별도의 인덱스를 관리하도록 됨을 특징으로 한다.In order to achieve the above object, the XML indexing method using the node range of the integration path according to the present invention comprises a plurality of XML documents in one path through the same node to configure the integration path, but is configured by the DOM parser It is characterized by managing a separate index constituting XML data for each document ID for each node of the simple path.

또한, 상기 동일 패스로 통합 구성된 각 노드에 대하여 프리오더와 포스트오더, 해당 데이터의 범위를 나타내는 노드범위가 부여되고, 상기 동일 패스로 구성된 모든 노드를 프리오더순으로 배열하며, 상기 프리오더순으로 실 데이터테이블의 이름을 저장하도록 되고, 유저 질의 수행시 질의를 분석하여 상기 통합패스에서 프리오더와 노드범위를 얻고 상기 프리오더순으로 저장된 해당 데이터테이블 이름을 얻어 넘버 비교없이 바로 개방하도록 된 것을 특징으로 한다.In addition, a node range indicating a preorder, a postorder, and a corresponding data range is given to each node configured in the same path, and all nodes configured in the same path are arranged in a preorder order, and a real data table in the preorder order. It is to be stored in the name of the user query, the query is analyzed to obtain a preorder and node range in the integration path, and get the corresponding data table name stored in the order of the preorder, characterized in that the open without a number comparison.

이하, 본 발명을 첨부된 도면을 참조로 하여 보다 상세하게 설명한다. 단, 하기 실시예는 본 발명을 예시하는 것일 뿐 본 발명의 내용이 하기 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings. However, the following examples are merely to illustrate the present invention is not limited to the contents of the present invention.

본 발명은 도 9와 같이 여러 가지 구조를 가지는 XML문서들을 동일한 노드를 경 유하는 하나의 경로로 통합하여 구성한다. 그리고 "Book/Titles/Title", "Book/Authors/Author/Name", "Book/Store/Name"등의 심플패스의 최하위 노드별로 문서ID별 XML 데이타를 구성하는 별도의 인덱스를 관리한다. 즉 패스만 넘버를 부여하여 관리하며 질의 시 패스검색결과는 데이터가 저장된 테이블의 범위만을 가져오는데 사용된다. The present invention integrates and configures XML documents having various structures as one path through the same node as shown in FIG. 9. In addition, it manages separate indexes that constitute XML data for each document ID by the lowest node of a simple path such as "Book / Titles / Title", "Book / Authors / Author / Name", and "Book / Store / Name". In other words, only the number of paths is assigned and managed, and the path search result is used to bring only the range of the table where the data is stored.

모든 XML문서가 삽입 되었을 때 본 발명에 따른 인덱스 구조는 도 10과 같다. 통합된 XML패스트리의 각 노드는 넘버링기법을 사용하여 프리오더(preorder)와 포스트오더(postorder), 그리고 해당 데이터의 범위를 나타내는 노드범위(node range)가 부여된다. 그리고 프리-오더 리스트(Pre-Order List)는 통합 패스의 노드를 preorder 순의 배열로 구성하며 각 항목에는 최하위 노드별 데이터가 저장된 데이터테이블의 이름이 보관된다. When all XML documents are inserted, the index structure according to the present invention is shown in FIG. Each node in the consolidated XML pastry is given a node range representing the preorder, postorder, and range of data using a numbering technique. The Pre-Order List configures the nodes of the integration path in a preorder order, and each item stores the name of the data table that stores the data of the lowest node.

사용자의 XPath질의가 주어졌을 때 통합패스를 이용하여 데이터의 범위만을 구하고 범위에 해당되는 데이터테이블을 바로 개방하는 것이다. When a user's XPath query is given, the integration path is used to find only the range of data and immediately open the corresponding data table.

예를 들어, 도 10의 구조를 가진 인덱스 시스템에 "Book//Author"라는 XPath질의가 주어졌을 경우 기존의 넘버링기법 기반의 방법으로 구현해 보면 도 11과 같다.For example, when an XPath query of "Book // Author" is given to the index system having the structure of FIG. 10, the method is implemented based on the existing numbering method.

XML문서별로 패스정보를 저장하고 있는 엘리먼트 인덱스(Element _ix)의 문서ID(doc_id)와 각 문서별 노드의 넘버(order, size)의 조인연산을 통하여 나온 범위를 데이터와 속성이 저장된 데이터 인덱스(value_ix)에서 검색하게 된다. 질의에 포함된 노드의 수가 많을수록 많은 자체-조인으로 인한 검색 오버헤드가 커지게 된다. A data index (value_ix) in which data and attributes are stored in the range obtained through the join operation of the document ID (doc_id) of the element index (Element _ix) and the number (order, size) of each document node. ). The larger the number of nodes included in the query, the greater the search overhead due to more self-joins.

따라서 본 발명은 모든 XML문서의 패스를 통합하여 관리하므로 패스의 크기가 매우 작다. 그러므로 XPath질의 시에 노드에 부여된 넘버를 비교 검색하는데 매우 효율적이다. 즉, 도 11에서 "/Book//Author"의 질의를 수행할 때, 본 발명은 "<Book>"과 "<Author>"의 결과 <1,15,15>와 <7,8,10>을 구하고 프리오더와 포스트오더의 조상-후손관계를 만족한다면, 데이터의 범위 <preorder, node range>, 즉 <7,10>을 얻게 된다. 그리고 프리-오더 리스트의 7번째 블록부터 10번째 블록까지 직접 이동하여 블록내의 데이터테이블 이름을 반환한다. 반환된 데이터테이블 자체를 바로 개방함으로써 질의수행결과를 얻게 된다. 기존의 넘버링기법에서와 같이 데이터 인덱스내의 넘버 비교를 하지 않으므로 검색성능이 우수하다.Therefore, the present invention integrates and manages all XML document paths, so the path size is very small. Therefore, it is very efficient to compare and search the number given to the node in XPath query. That is, when performing a query of "/ Book // Author" in FIG. 11, the present invention results in <1,15,15> and <7,8,10> of "<Book>" and "<Author>". If we obtain and satisfy the ancestor-descendant relationship between preorder and postorder, we get <preorder, node range>, that is, <7,10>. It directly moves from the 7th block to the 10th block of the pre-order list and returns the data table name in the block. By directly opening the returned data table itself, you get a query result. The search performance is excellent because the number comparison in the data index is not performed as in the conventional numbering technique.

본 발명에 따른 인덱스 시스템의 구성은 도 12와 같다.The configuration of the index system according to the present invention is as shown in FIG.

먼저, 본 발명의 XML 데이터 베이스(100)는 도 12b와 같이, 통합패스를 구성하여 XPath질의를 분석하기 위한 네임 인덱스(Name Index)(110), 통합패스의 모든 노드를 프리오더(preorder) 순으로 구성하여 데이터 테이블의 이름을 저장하는 프리-오더 리스트(Pre-Order List)(120), 데이터 테이블안에 포함된 XML문서의 ID를 보관하는 XML문서 ID테이블(Doc_id_TB)(130), 그리고 실제 데이터를 저장하는 데이터 테이블(Data Table)(140)로 구성되며, 그 세부 사항은 도 13a 내지 제 13e에 도시한 바와 같다.First, as shown in FIG. 12B, the XML database 100 of the present invention constructs an integration path, and includes a name index 110 for analyzing an XPath query, and preorders all nodes of the integration path. A pre-order list 120 for storing the name of the data table, an XML document ID table (Doc_id_TB) 130 for storing IDs of XML documents included in the data table, and actual data. It consists of a data table (Data Table) 140 to store, the details of which are as shown in Figs. 13A to 13E.

또한, 본 발명의 인덱스 시스템은 도 12a와 같이, XML문서를 삽입할 때 DOM파서(200)를 이용하여 모든 데이터의 심플패스(Simple Path)를 구성한다. 예를 들어 도 9의 XML문서의 경우 In addition, the index system of the present invention configures a simple path of all data using the DOM parser 200 when inserting an XML document as shown in FIG. 12A. For example, in the case of the XML document of FIG.

/Book/Titles/Title : C++/ Book / Titles / Title: C ++

/Book/Store/Name : Young,Kim/ Book / Store / Name: Young, Kim

의 심플패스와 데이터로 구성하며, 이와 같이 심플패스와 데이터로 구성된 로우 데이터 형태의 데이터는 로우 데이터 저장부(300)에 임시 저장되며, 저장된 데이터를 XML 패스트리의 정보를 구성하는 네임 인덱스(110)에 기존에 저장되어진 정보와 비교하며 <preorder, postorder, node- range>등을 부여하여 저장한다. 즉, XML문서단위로 넘버를 부여하는 것이 아니라, 기존의 패스와 비교하여 통합 패스로 관리하게 된다. 네임 인덱스(110)의 노드범위(node range)는 XPath질의를 분석하여 나온 데이터의 범위를 이용하여 프리-오더 리스트(120)로 빠르게 이동하기 위하여 사용된다.The simplepath and the data of the data, the raw data of the simplepath and the data in this way is temporarily stored in the row data storage unit 300, the name index 110 for configuring the information of the XML data in the stored data It compares with the information saved in the previous and saves by giving <preorder, postorder, node-range>. In other words, rather than assigning a number in XML document units, it is managed as an integrated path compared to the existing path. The node range of the name index 110 is used to quickly move to the pre-order list 120 using the range of data obtained by analyzing the XPath query.

노드범위(node range)를 계산하는 방법을 도 14에 나타내었다.14 shows a method of calculating a node range.

노드범위를 계산하는 방법은 크게 두 가지로 구분된다. 현 노드가 루트노드일 경우엔 현 노드의 범위는 루트노드의 포스트오더(postorder)가 된다. 루트노드가 아닐 경우에는 오른쪽 형제노드(right-sibling node)가 존재한다면 오른쪽 형제노드의 프리오더-1(preorder-1) 의 값이 현 노드의 노드범위가 되며, 오른쪽 형제노드가 존재하지 않는다면 부모노드의 노드범위가 현 노드의 노드범위가 된다.There are two main ways to calculate the node range. If the current node is the root node, the scope of the current node is the root node's postorder. If it is not a root node, if there is a right-sibling node, the value of the preorder-1 of the right sibling node is the node range of the current node, and if the right sibling node does not exist, the parent node The node range of becomes the node range of the current node.

도 13b의 프리-오더 리스트(120)는 도 13a의 네임 인덱스(110)의 프리오더 순으로 저장되어 있는 하나의 동적배열이다. 각 항목엔 네임 인덱스(110)의 프리오더와 데이터 테이블의 이름이 보관되어 있다. 통합된 패스의 프리오더 순으로 데이터 테이블의 이름만을 보관하므로 크기가 매우 작다. The pre-order list 120 of FIG. 13B is one dynamic array stored in the preorder order of the name index 110 of FIG. 13A. Each item stores the name of the preorder of the name index 110 and the data table. It is very small because it only stores the names of the data tables in the preorder order of the combined pass.

데이터 테이블의 이름은 XML문서의 삽입 시에 추가되어진 최하위 노드별로 자동으로 고유한 이름을 부여한다. XPath질의를 처리 시에 네임 인덱스(110)를 통하여 최하위의 프리오더와 노드 범위를 얻는다. 얻은 프리오더와 노드 범위를 이용하여 프리-오 더 리스트(120)의 위치로 직접 이동한 후 데이터 테이블의 이름을 얻는데 사용된다. The name of the data table is automatically given a unique name for each lowest node added when an XML document is inserted. When processing an XPath query, the lowest preorder and node range are obtained through the name index 110. The obtained preorder and node range are used to move directly to the location of the pre-order list 120 and to obtain the name of the data table.

XML문서 ID테이블(130)은 프리-오더 리스트(120)의 데이터 테이블 이름별로 XML문서의 고유번호와 XML문서의 이름이 보관된다. 질의의 결과를 보관된 XML문서로 리턴 할 경우 사용하게 된다. The XML document ID table 130 stores the unique number of the XML document and the name of the XML document for each data table name of the pre-order list 120. It is used to return the result of a query as an archived XML document.

데이터 테이블(140)은 프리-오더 리스트(120)에 보관되어진 데이터 테이블의 이름별로 생성되며, XML문서의 문서 고유번호(Doc_id), 문서내의 데이터값(Xml_value)과 데이터 값의 순서(Seq_value)가 저장되게 된다. 데이터 값의 순서는 XML문서내의 같은 태그(노드)이름을 가질 경우에 순번을 부여하게 되며 XPath질의 "/Book/Authors/Name"등을 처리하는데 사용된다. 데이터 테이블(140)은 최하위 노드별로 생성되므로 XPath질의의 결과범위<preorder, node range>가 넓거나, 중복노드가 존재하여 많은 데이터 테이블을 개방 할 경우 XML 데이터 베이스(100)에 많은 검색-오버헤드를 초래할 수 있으므로, 도 13e의 데이터테이블(150)과 같이 단일 데이터테이블에 데이터테이블 이름을 클러스터 인덱스(Cluster Index)로 지정하여 처리할 수도 있다.The data table 140 is generated for each data table name stored in the pre-order list 120, and the document unique number (Doc_id) of the XML document, the data value (Xml_value) and the order of the data value (Seq_value) in the document are Will be saved. The order of the data values is given a sequence number if they have the same tag (node) name in the XML document. It is used to handle "/ Book / Authors / Name" of XPath query. Since the data table 140 is generated for each of the lowest nodes, when the result range <preorder, node range> of the XPath query is wide, or when a large number of data tables are opened due to the presence of duplicate nodes, a lot of search-overheads are generated in the XML database 100. In this case, the data table name may be designated as a cluster index in the single data table and processed as shown in the data table 150 of FIG. 13E.

다음, XML 문서의 삽입에 대하여 살펴본다.Next, we will look at the insertion of XML documents.

본 발명은 최하위 노드별로 데이터 테이블을 별도로 가지며 통합패스로 운영되므로 구조가 같은 XML문서가 삽입된다면 패스의 조정이 필요하지 않다. 경로를 따라 해당되는 프리-오더 리스트(120)에 등록된 데이터 테이블에 데이터를 추가함으로써 처리가 완료된다. 데이터 테이블은 넘버를 부여하지 않으므로 별도의 처리가 없다. The present invention has a separate data table for each lowest node and operates as an integrated path. Therefore, if an XML document having the same structure is inserted, no path adjustment is necessary. The process is completed by adding data to the data table registered in the corresponding pre-order list 120 along the path. The data table does not have a number, so there is no processing.

구조가 다른 XML문서가 삽입된다면 네임 인덱스(110)와 프리-오더 리스트(120)의 조정이 필요하다. 즉, 네임 인덱스(110)에 노드가 추가되며 변경된 부분의 넘버가 다시 계산되지만 네임 인덱스(110)의 크기가 작으므로 오버 헤드가 크지 않다. 그리고 프리-오더 리스트(120)내의 배열의 수가 증가하게 되지만 기존의 노드별 테이블이름은 변경되지 않는다. 다만 추가된 노드에 해당하는 테이블 이름만 자동부여하고 데이터를 해당되는 데이터 테이블에 저장함으로써 삽입처리를 마치게 된다. If an XML document having a different structure is inserted, the name index 110 and the pre-order list 120 need to be adjusted. That is, the node is added to the name index 110 and the number of the changed portion is recalculated, but the overhead is not large because the size of the name index 110 is small. The number of arrays in the pre-order list 120 is increased, but the existing table names for each node are not changed. However, the insert process is completed by automatically assigning only the table name corresponding to the added node and storing the data in the corresponding data table.

도 15a와 도 15b는 구조가 다른 XML문서의 삽입예를 설명하기 위한 것으로, 도 15a는 삽입전, 도 15b는 삽입 후의 인덱스 구조를 나타낸 것이다.15A and 15B illustrate an example of inserting XML documents having different structures. FIG. 15A shows an index structure before insertion and FIG. 15B shows an index structure after insertion.

도 15b와 같이 구조가 다른 XML문서의 삽입 시에도 네임 인덱스(110)의 해당 위치의 <preorder, postorder, noderange>를 조정하고, 프리-오더 리스트(120)에 추가된 데이터 테이블의 이름만을 자동으로 부여해 줌으로써 처리를 마치게 된다.Even when inserting an XML document having a different structure as shown in FIG. 15B, <preorder, postorder, noderange> of the corresponding position of the name index 110 is adjusted, and only the name of the data table added to the pre-order list 120 is automatically. The processing is completed by granting.

이와 같은 본 발명을 기존의 인덱싱 기법과 비교 실험하여 본 발명이 인덱스 공간 및 검색성능향상에 기여하였음을 증명하고자 한다. By comparing the present invention with the existing indexing technique, it is intended to prove that the present invention contributes to the improvement of index space and search performance.

먼저, 비교대상 인덱스는 중간노드, 조상-후손관계의 조인연산 등에서 효율이 매우 떨어지므로, 기존의 넘버링기법 기반의 대표적인 인덱싱 기법중 하나인 XISS와 비교 평가하며, 또한 본 발명에서는 프리-오더 리스트를 동적배열로 구현하여 처리하지만 구조가 다른 XML문서가 많이 삽입되는 경우에 메모리상의 제한이 있을 수 있으므로, 디스크 기반으로도 구축하여 비교 실험한다First, since the index to be compared is very inefficient in intermediate nodes and ancestor- descendant join operations, the index is compared and evaluated with XISS, which is one of the typical indexing techniques based on the existing numbering technique. Implement and process dynamic arrays, but there may be memory limitations when many XML documents with different structures are inserted. Therefore, build and compare on disk.

본 발명의 인덱스 시스템은 MS-Windows 2000서버를 운영체제로 하여, 펜티엄4 2.0G CPU, 512MB RAM, 60GB HDD사양의 PC에 구현하였으며, 구현에 사용된 관계형 데이터 베이스로는 MS-Sql*Server 2000을, 개발언어로는 Visual Basic 6.0과 T-SQL을 사용하였다. 또한, 실험 XML 문서들의 내부 데이터와 구조를 추출하기 위한 파서(Parser)로 는 마이크로소프트사의 MSXML(DOM) 라이브러리를 사용하였다. 실험데이터는 크게 (A),(B),(C)의 세 부류로 나누어 구성하였다. 실험데이터(A)는 현재 실 법률 사이트에서 운영되고 있는 법률XML문서들로 구성되며, 각 문서들은 문서 내에 포함된 노드 및 데이터의 수는 적으나 문서의 수는 매우 많은 특징을 가진다. 실험데이터(B)는 비교 실험할 XISS-R(R. Goldman, J. Widom. "DataGuides:enabling query formulation and optimization in semistructured databases", In Proc. of the VLDB, pp. 436-445, 1997.)에서 실험데이터로 사용된 세익스피어희극 XML문서들로 문서의 개수는 적으나 각 문서들 내부에 포함된 노드와 데이터의 수가 매우 많으므로 문서 개개의 크기가 매우 큰 특징을 가진다. 실험데이터(C)는 임의의 Synthetic dataset으로써 많은 중복 노드와 데이터를 포함하고 레벨이 매우 깊은 특징을 가지는 대용량의 XML 문서들로 구성되어 있다. The index system of the present invention is implemented on a PC with a Pentium 4 2.0G CPU, 512MB RAM, and 60GB HDD, using the MS-Windows 2000 server as an operating system, and the relational database used for the implementation is MS-Sql * Server 2000. For development language, Visual Basic 6.0 and T-SQL were used. In addition, Microsoft's MSXML (DOM) library was used as a parser to extract the internal data and structure of experimental XML documents. The experimental data were largely divided into three categories (A), (B), and (C). Experimental data (A) is composed of legal XML documents currently operating on the actual legal site. Each document has a very large number of documents, although the number of nodes and data contained in the document is small. Experimental data (B) is XISS-R (R. Goldman, J. Widom. "DataGuides: enabling query formulation and optimization in semistructured databases", In Proc. Of the VLDB, pp. 436-445, 1997.). In Shakespeare's comedy XML documents used as experimental data in, the number of documents is small, but the size of each document is very large because the number of nodes and data included in each document is very large. Experimental data (C) is an arbitrary synthetic dataset that consists of a large amount of XML documents that contain many duplicate nodes and data and have very deep features.

서로 다른 구조의 실험데이터(A),(B),(C)를 이용하여 본 발명의 인덱스 시스템과 기존의 XISS/XISS-R 시스템을 비교실험 하였으며, 실험 결과는 여러 패턴의 질의에 따른 검색 속도와 인덱스 데이터의 크기를 비교 분석하였다. Experimental data (A), (B), and (C) of different structures were used to compare the index system of the present invention with the existing XISS / XISS-R system. We compared and analyzed the size of index data.

도 16은 실험 데이터로서, (A)는 법률 사이트의 실제 XML문서, (B)는 XISS-R(세익스피어 희곡), 그리고 (C)는 Synthetic dataset이다.16 is experimental data, (A) is an actual XML document of a legal site, (B) is a XISS-R (Shakespeare play), and (C) is a synthetic dataset.

각각의 실험 데이터(A),(B),(C)는 문서 별 평균 엘리먼트의 수와 평균 데이터의 수, 그리고 문서의 깊이에 있어서 많은 차이를 보인다.Each experimental data (A), (B), (C) shows a large difference in the number of average elements per document, the number of average data, and the depth of the document.

상기 실험 데이터(A)는 총 12가지의 질의, 실험 데이터(B)와 실험 데이터(C)는 총 9가지의 질의를 사용하여 검색시간을 분석하였다. 질의의 유형은 크게 심플패스 질 의, 조상-후손 관계의 조인연산질의, 조상-후손 관계의 조건 질의의 3가지로 구분된다. 검색시간은 XISS와 본 발명에서 제안한 방법 도 13a와 도 13e의 데이터 테이블을 이용한 프리-오더(1), 프리-오더(2), 그리고 프리-오더 리스트를 동적배열이 아닌 디스크 기반으로 구현한 총 4가지의 인덱스 방법으로 비교하였다. 또한, 각 실험 데이터별 인덱스의 크기도 비교 분석하였다. The experiment data (A) analyzed a search time using a total of 12 queries, the experimental data (B) and the experimental data (C) a total of nine queries. There are three types of queries: simple path queries, ancestor- descendant relational queries, and ancestor- descendant conditional queries. The retrieval time is the total of the pre-order (1), the pre-order (2), and the pre-order list using XISS and the method proposed by the present invention in Figs. 13A and 13E based on disk rather than dynamic array. Four indexing methods were used to compare. In addition, the size of the index for each experimental data was also compared and analyzed.

실험 데이터(A)는 각각의 XML문서 내부에 포함된 노드의 수와 데이터의 수는 적지만 매우 많은 문서의 수를 가지고 있는 특징이 있다. 실험 데이터(A)에서 사용된 질의는 도 17a와 같이, 총 12가지이다. Q1부터 Q3까지는 루트노드에서부터 리프노드까지의 심플패스 혹은 심플패스의 생략된 형태로 구성되고 Q4부터 Q6까지는 조상-후손관계의 조인연산 형태이며 Q7은 조상-후손관계와 심플패스의 복합구조를 나타낸다. Q8부터 Q12는 Q1부터 Q7까지의 질의구조에 조건이 들어간 형태로 구성하였다. The experimental data (A) is characterized by having a very large number of documents but a small number of nodes and data contained in each XML document. There are a total of 12 queries used in the experimental data A, as shown in FIG. 17A. Q1 through Q3 consist of a simple pass or a simple pass from a root node to a leaf node, and Q4 through Q6 form a join operation of an ancestor-descendant relationship, and Q7 represents a complex structure of an ancestor-descendant relationship and a simple path. . Q8 to Q12 are constructed in a condition that contains the query structure from Q1 to Q7.

도 17b는 도 17a의 12가지의 질의의 수행결과로 XISS와 본 발명에 따른 방법에 대하여 도 13d와 도 13e의 데이터 테이블, 프리-오더 리스트를 디스크로 구현한 방법을 비교한 표이며, 질의별로 검색시간과 반환된 결과레코드의 수를 나타낸다. 도 17c는 그 결과를 그래프로 표현하였다. 모든 질의에서 기존의 방법보다 검색결과가 우수함을 알 수 있다. 통합패스 자체의 크기도 각 문서내의 노드별로 인덱스를 구성한 XISS보다 적으며, 데이터 역시 Q1부터 Q7까지는 넘버를 비교하지 않고 프리-오더 리스트 안의 데이터 테이블을 개방함으로써 결과를 리턴하기 때문에 검색속도가 우수하다. 또한 XISS의 경우 데이터의 양에 따라 수행시간에 많은 차이를 보였으나, 본 방법에서는 거의 평균적으로 동일한 결과를 보였다. Q8부터 Q12까지의 조건 질의의 경우엔 노드뿐만 아니라 데이터도 검색해야 함으로 기존의 방법에서는 검색성능이 저하되었으나 본 방법에서는 우수한 검색성능을 보였다. 프리-오더 리스트를 디스크 기반으로 구현한 방법 역시 기존의 방법보다 우수한 성능을 보였다.FIG. 17B is a table comparing XISS and the method according to the present invention as a result of performing the 12 queries of FIG. 17A and a method of implementing the data table and the pre-order list of FIG. 13D and FIG. 13E on disk. Indicates the search time and the number of result records returned. 17C graphically depicts the results. It can be seen that the search results are superior to the existing methods in all queries. The size of the integration path itself is also smaller than the XISS, which is indexed for each node in each document, and the data is also searchable because the results are returned by opening the data table in the pre-order list without comparing the numbers from Q1 to Q7. . In addition, in case of XISS, there were many differences in execution time according to the amount of data. In case of condition query from Q8 to Q12, not only nodes but also data should be searched. However, the existing search method reduced the search performance, but this method showed excellent search performance. The disk-based implementation of the pre-order list also performed better than the conventional method.

실험 데이터(B)는 XML문서의 수는 적으나 각 XML문서별로 노드와 데이터의 수가 매우 많으며 구조 또한 복잡한 특징을 가지고 있다. 실험 데이터(A)와 마찬가지로 Q1부터 Q3까지는 심플패스 혹은 심플패스를 생략한 형태, Q4부터 Q7까지는 조상-후손관계의 조인연산, Q8과 Q9는 조건 질의의 형태로 구성하였다(도 18a). Although experimental data (B) has a small number of XML documents, the number of nodes and data for each XML document is very large, and the structure is also complicated. As in the experimental data (A), the simple path or the simple path was omitted from Q1 to Q3, the join operation of the ancestor-progenitor relationship from Q4 to Q7, and Q8 and Q9 were configured in the form of conditional queries (FIG. 18A).

실험 데이터(A)에서의 결과와 같이 검색성능의 차이가 크지는 않았으며 두 방법 모두 우수한 검색속도를 보였다. 그러나 많은 데이터가 포함되어 있는 조건 질의 Q8의 경우는 제안된 방법이 우수한 검색성능을 보였다. 또한 실험 데이터(A)에서의 조건 질의의 경우엔 도 13e의 데이터테이블(클러스터 인덱스)를 이용한 방법이 조금 더 나은 성능을 보였지만 실험 데이터(B)에선 Q9의 결과와 같이 성능이 오히려 떨어지는 결과를 보였다. 즉, 문서내의 노드의 수와 데이터의 량이 매우 많은 경우에는 도 13d의 데이터테이블의 성능이 우수함을 알 수 있다. 프리-오더 리스트의 디스크 기반의 방법은 실험데이터(A)의 경우와 마찬가지로 평균적으로 우수한 성능을 보였다(도 18b, 도 18c). As shown in the experimental data (A), the search performance was not significantly different. Both methods showed excellent search speed. However, in the case of conditional query Q8 which contains a lot of data, the proposed method showed excellent search performance. In addition, in the case of the conditional query in the experimental data (A), the method using the data table (cluster index) of FIG. 13E showed a little better performance, but the experimental data (B) showed a lower performance as the result of Q9. . That is, when the number of nodes and the amount of data in the document are very large, it can be seen that the performance of the data table of FIG. 13D is excellent. The disk-based method of the pre-order list showed excellent performance on average as in the case of the experimental data A (Figs. 18B and 18C).

실험데이터(C)는 레벨이 깊고 많은 중복노드와 데이터를 포함하며 문서의 크기 또한 큰 구조적 특징을 가지는 Synthetic dataset 이다. 깊은 레벨을 가지고 있으므로 Q1, Q2는 심플패스의 생략된 구조의 질의, Q3부터 Q7까지는 조상-후손관계의 조인연산 질의, Q8과 Q9는 조상-후손관계 + 조건 질의로 구성하였다(도 19a). Experimental data (C) is a synthetic dataset that has a deep level, contains many duplicate nodes and data, and has a large structural feature. Since Q1 and Q2 have a deep level, queries with simple structures omitted, Q3 through Q7 join-join queries of ancestor- descendant relationships, and Q8 and Q9 consist of ancestor-Descendant relationship + conditional queries (FIG. 19A).

실험데이터(C)의 특성상 문서내의 많은 노드를 포함하며 문서 내의 레벨이 깊으 므로 예제 XPath 질의 역시 많은 노드를 포함하는 질의로 구성하였다. XISS의 경우 질의에 포함된 노드의 수가 많을수록 많은 자체-조인연산이 일어나므로 본 발명에 비하여 검색성능이 떨어진다. 또한, Q8과 Q9의 질의의 경우엔 데이터 인덱스의 넘버를 비교하며 조건을 검색하므로 더욱 성능이 떨어진다. 본 방법에서는 질의에 포함된 노드의 수가 많을 지라도 평균적으로 우수한 성능을 보였으며 Q8과 Q9의 조건 질의의 경우엔 두 배 이상의 향상된 성능을 보였다. 서로 다른 구조의 실험 데이터(A),(B),(C)를 이용하여 기존의 방법과 비교 실험을 한 결과 본 발명에서 제안한 방법이 데이터의 구조와 관계없이 평균적으로 우수한 검색성능을 나타내었다. 또한, 구조가 다른 많은 문서가 삽입되었을 때는 프리-오더 리스트가 매우 커지므로 동적배열 대신 디스크 기반으로도 구현하여 성능평가를 하였다. 그 결과 동적배열보다는 성능이 떨어지지만 기존의 방법보다 우수한 검색 성능을 나타내었다. 결과적으로 저장된 XML문서들의 크기가 매우 큰 경우 우수한 검색성능을 보이지만, 문서의 크기가 작고 많은 경우에 더욱 검색성능이 우수함을 알 수 있다. 또한 기존의 방법은 질의에 포함된 노드의 수에 따라 검색성능이 좌우되지만, 본 방법에서는 크게 영향을 미치지 않음을 알 수 있다(도 19b, 도 19c).Because of the characteristics of the experimental data (C), the sample XPath query is composed of a query that includes many nodes because it includes many nodes in the document and the level of the document is deep. In the case of XISS, the larger the number of nodes included in the query, the more self-join operations occur, so that the search performance is lower than that of the present invention. In addition, in the case of Q8 and Q9, the performance is lowered because the number of data indexes is compared and the condition is searched. In this method, even though the number of nodes included in the query is high, the average performance is excellent, and the conditional query of Q8 and Q9 has more than doubled the performance. As a result of comparative experiments using the experimental data of different structures (A), (B), and (C), the proposed method showed excellent search performance on average regardless of the data structure. In addition, when a large number of documents with different structures are inserted, the pre-order list becomes very large, and the performance is evaluated by implementing on a disk basis instead of a dynamic array. As a result, the performance is lower than that of the dynamic array, but the search performance is superior to the conventional method. As a result, when the size of the stored XML documents is very large, the search performance is excellent, but the size of the document is small and the search performance is more excellent in many cases. In addition, while the existing method depends on the number of nodes included in the query, the search performance is not significantly affected in the present method (Figs. 19B and 19C).

도 20은 XISS와 본 발명에 따른 인덱스의 크기를 비교한 그래프이다. 본 발명은 패스를 통합하여 관리하며 각 데이터 테이블에 넘버를 부여하지 않으므로 기존 방법 보다 인덱스공간을 효율적으로 관리함을 알 수 있다. 20 is a graph comparing the size of the index according to the present invention and XISS. Since the present invention integrates and manages paths and does not assign a number to each data table, it can be seen that the index space is managed more efficiently than the conventional method.

상술한 바와 같이, 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위내에서 본 발명을 다양하게 수정 또는 변경하여 실시할 수 있다. As described above, although described with reference to a preferred embodiment of the present invention, those skilled in the art various modifications of the present invention without departing from the spirit and scope of the invention described in the claims below Or it can be changed.

이상에서 살펴본 바와 같이, 본 발명에 따른 통합패스의 노드범위를 이용한 XML 인덱스 시스템 및 방법은 중복되는 패스를 통합하여 구성한 후, 통합패스에 프리오더, 포스트오더, 노드 범위 등의 넘버를 부여하며, 프리-오더 리스트라는 동적배열에 통합패스의 프리오더 순으로 실 데이터테이블의 이름을 보관하도록 함에 따라 질의 수행시에 질의를 분석하여 통합패스에서 프리오더와 노드 범위를 얻고 프리-오더 리스트 동적배열에서 해당 데이터테이블 이름을 얻어 넘버 비교 없이 바로 개방하도록 함에 따라 Xpath질의를 빠르게 분석하며, 패스의 인덱스 저장 공간을 효율적으로 관리하고, 프리-오더 리스트를 이용하여 실제 검색할 데이터 테이블의 범위를 줄이며, 데이터테이블을 최하위 노드별로 유지하여 삽입, 삭제를 용이하게 한다.As described above, the XML index system and method using the node range of the integration path according to the present invention is configured by integrating the overlapping path, give a number of preorders, post orders, node range, etc. to the integration path, The name of the real data table is stored in the dynamic array called the order list in the order of pre-order of the integration path, so that the query is analyzed when the query is executed to obtain the preorder and node range in the integration path and the corresponding data table in the pre-order list dynamic array. Quickly analyze Xpath queries, efficiently manage index storage space of paths, reduce the range of data tables to be actually searched using pre-order lists, and place data tables at the lowest level by getting names and opening them immediately without number comparison. It is easy to insert and delete by keeping by node.

Claims

In the XML index system,

A DOM parser configured to input XML documents into simple paths and data;

A row data storage unit for temporarily storing the output of the DOM parser;

Compose an integrated path with respect to the data from the raw data storage unit, compared to an existing path, and give a preorder, a post order, a node range indicating a range of the corresponding data, and the like by comparing the previously stored information with respect to the data. A name index to store; A pre-order list for organizing all nodes of the integration path in pre-order order to store the name of the data table; An XML document ID table for storing an ID of an XML document and an XML document name for each name of the data table; And an XML database comprising an ID of an XML document, a data table storing data values and a sequence of data values in the document.

The XML using the node range of the integrated path according to claim 1, wherein the name of the data table stored in the pre-order list is automatically given a unique name for each lowest node added when the XML document is inserted. Index system.

The method of claim 1, wherein the data table

XML index system using a node range of the integration path, characterized in that it is generated for each name of the data table stored in the pre-order list.

The method of claim 1, wherein the order of data values stored in the data table is as follows.

XML index system using a node range of the integration path, characterized in that the order is given when the same tag (node) name in the XML document.

The XML index using the node range of the integrated path according to claim 1, wherein the data need only be added to a data table registered in a corresponding pre-order list along a path when the XML document having the same structure is inserted. system.

The method of claim 1, wherein a node is added to the name index when the XML document having a different structure is inserted, the data table name corresponding to the added node is automatically assigned, and the data is stored in the corresponding data table to complete the insertion process. XML index system using the node range of the integrated path characterized in that.

Integrate multiple XML documents into one path through the same node to form an integration path, but manage separate indexes that constitute XML data for each document ID for the lowest node of the simple path configured by the DOM parser. XML indexing method using the node range of the integrated path.

8. The XML indexing method according to claim 7, wherein a node range indicating a range of preorders, post orders, and corresponding data is assigned to each node configured in the same path.

10. The method of claim 8, wherein all nodes configured in the same path are arranged in a preorder order.

10. The method of claim 9, wherein the names of the real data tables are stored in the order of the preorders.

The node of an integrated path according to claim 10, wherein the user query is analyzed to obtain a preorder and node range in the integration path, obtain a corresponding data table name stored in the preorder order, and immediately open the number without comparison. XML indexing method with scope.