KR100558765B1

KR100558765B1 - Method for executing xml query using adaptive path index

Info

Publication number: KR100558765B1
Application number: KR1020020070634A
Authority: KR
Inventors: 민준기; 심규석; 정진완
Original assignee: 한국과학기술원
Priority date: 2002-11-14
Filing date: 2002-11-14
Publication date: 2006-03-10
Also published as: US20040098384A1; US7260572B2; KR20040042358A

Abstract

본 발명은 XML 데이터에 대하여 자주 사용되는 검색 패턴을 이용하여 XML 데이터의 검색 속도를 획기적으로 향상시키며 기존의 인덱스에 비하여 저장 공간을 감소시킬 수 있는 적응형 경로 인덱스를 이용한 XML 질의 처리 방법에 관한 것으로, XML 데이터를 그래프 형태의 XML 그래프로 표현하는 제 1 단계; 전에 수행된 XML 질의들에서 추출한 자주 사용되는 경로(frequently used paths)와 상기 그래프로부터 적응형 경로 인덱스를 생성하고 갱신하는 제 2 단계; 상기 적응형 경로 인덱스를 이용하여 XML 질의를 처리하는 제 3 단계;를 포함하는 것을 특징으로 하는 적응형 경로 인덱스를 이용한 XML 질의 수행 방법을 제공함으로써, XML 데이터의 검색시에 적응형 경로 인덱스를 이용하여 자주 사용되는 경로들을 활용하여 보다 효율적으로 XML 질의 결과를 찾아낼 수 있으며 XML 질의의 패턴이 바뀌는 경우에 이를 반영하여 인덱스를 갱신하는 부담을 감소시키는 효과가 있다.The present invention relates to an XML query processing method using an adaptive path index that can dramatically improve the search speed of XML data by using a search pattern frequently used for XML data and can reduce storage space compared to an existing index. A first step of representing the XML data in an XML graph in the form of a graph; A second step of generating and updating an adaptive path index from the graph and frequently used paths extracted from previously performed XML queries; A third step of processing an XML query using the adaptive path index; and providing an XML query execution method using the adaptive path index, wherein the adaptive path index is used when retrieving XML data. By using frequently used paths, XML query results can be found more efficiently. When the pattern of XML queries is changed, this can reduce the burden of updating the index.

XML, 적응형 경로 인덱스, 해쉬 트리, 익스텐트, IDREFXML, Adaptive Path Index, Hash Tree, Extents, IDREF

Description

How to execute WLML query using adaptive path index {METHOD FOR EXECUTING XML QUERY USING ADAPTIVE PATH INDEX}

도 1은 본 발명에 따른 적응형 경로 인덱스의 관리 구조를 도시한 도면,1 is a diagram illustrating a management structure of an adaptive path index according to the present invention;

도 2는 XML 데이터를 예시한 도면,2 is a diagram illustrating XML data;

도 3은 XML 데이터의 구조를 표현한 도면,3 is a diagram representing a structure of XML data;

도 4는 적응형 경로 인덱스를 예시한 도면,4 illustrates an adaptive path index;

도 5는 적응형 경로 인덱스를 위한 해쉬 트리 내의 해쉬 테이블 구조를 도시한 도면,5 illustrates a hash table structure in a hash tree for an adaptive path index;

도 6은 초기 적응형 경로 인덱스 생성 알고리즘을 도시한 도면,6 illustrates an initial adaptive path index generation algorithm;

도 7은 초기 적응형 경로 인덱스를 예시한 도면,7 illustrates an initial adaptive path index;

도 8은 자주 사용되는 경로 추출 알고리즘을 도시한 도면,8 is a diagram illustrating a path extraction algorithm that is frequently used;

도 9는 적응형 경로 인덱스의 갱신 알고리즘을 도시한 도면,9 is a diagram illustrating an update algorithm of an adaptive path index;

도 10은 Play 데이터에 대한 단일 경로 표현식의 성능을 도시한 그래프,10 is a graph showing the performance of a single path expression over Play data,

도 11은 FlixML 데이터에 대한 단일 경로 표현식의 성능을 도시한 그래프,11 is a graph showing the performance of single path expressions for FlixML data,

도 12는 GedML 데이터에 대한 단일 경로 표현식의 성능을 도시한 그래프,12 is a graph illustrating the performance of single path expressions for GedML data,

도 13은 복합 경로 질의의 성능을 비교 도시한 그래프,13 is a graph comparing and comparing the performance of a complex path query;

도 14는 값-경로 혼합 질의의 성능을 비교 도시한 그래프.14 is a graph comparing the performance of a value-path mixed query.

본 발명은 인터넷 상의 데이터 표현 및 교환의 표준인 XML(Extensible Markup Language, 참고 문헌: T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensible markup language (xml) 1.0 (second edition). W3C)로 나타내어진 데이터를 효율적으로 검색할 수 있는 인덱스 구조와 인덱스 관리 방안에 대한 것으로서, 특히 XML 데이터에 대하여 자주 사용되는 검색 패턴을 이용하여 XML 데이터의 검색 속도를 획기적으로 향상시키며 기존의 인덱스에 비하여 저장 공간을 감소시킬 수 있는 적응형 경로 인덱스를 이용한 XML 질의 처리 방법에 관한 것이다.The present invention relates to XML (Extensible Markup Language, T. Bray, J. Paoli, CM Sperberg-McQueen, and E. Maler. Extensible markup language (xml) 1.0 (second edition) , which is a standard for data representation and exchange on the Internet . W3C ) is an index structure and index management method that can efficiently search the data represented by W3C ) . In particular, it is possible to dramatically improve the speed of XML data search by using a search pattern that is frequently used for XML data. The present invention relates to an XML query processing method using an adaptive path index that can reduce storage space.

인터넷의 등장은 전자 형태의 다양한 정보의 극적인 증가를 유발하였다. XML은 인터넷에서 기존에 사용하던 HTML의 한계를 극복하고 SGML의 복잡함을 해결하는 방안으로 1998년 W3C에서 발표한 것이다. XML은 정형적(regular) 구조로부터 비정형적(irregular) 구조, 평탄한(flat) 구조로부터 깊게 내포된(deeply nested) 구조에 이르기까지 다양한 형태의 자료를 표현할 수 있다. 이러한 XML의 유연성 때문에 XML은 차세대 웹의 데이터 표현 및 교환의 표준 언어로서 각광 받고 있다.The advent of the Internet has led to a dramatic increase in the variety of information in electronic form. XML was introduced by the W3C in 1998 as a way to overcome the limitations of existing HTML on the Internet and solve the complexity of SGML. XML can represent a variety of types of data, from regular structures to irregular structures, to deeply nested structures. Because of its flexibility, XML is emerging as the standard language for data representation and exchange on the next generation of the Web.

XML로 표현되어 있는 정보를 검색하기 위한 다양한 질의 언어들이 제안된 바 있다. XPath[참고 문헌: J. Clark and S. DeRose. XML path language(XPath) version 1.0. W3C]와 XQuery[참고 문헌: S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C] 같은 XML 질의 언어들은 XML 엘리먼트(element)로 이루어진 비정형적 구조를 탐색하기 위하여 경로 표현식(path expression)을 사용한다. 따라서, 주어진 경로 표현식에 따라서 비정형적 구조의 XML 데이터를 탐색해 나가는 것이 XML 질의의 처리를 위한 중요한 요소이다. 그러나, XML 데이터를 구성하는 엘리먼트들은 디스크의 다른 위치에 분산되어 있을 수 있으므로, XML 질의 처리의 성능은 매우 저하된다. 더욱이, 부분 매칭(partial matching) 경로 표현식으로 이루어진 XML 질의를 처리하기 위하여서는 XML 데이터에 속한 모든 엘리먼트들을 검색하여야 하므로 매우 비효율적이다. Various query languages have been proposed for retrieving information expressed in XML. XPath [Reference: J. Clark and S. DeRose. XML path language (XPath) version 1.0. W3C] and XQuery [References: S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C], such as XML query languages, use a path expression to search for an unstructured structure of XML elements. Therefore, exploring XML data of atypical structure according to given path expression is an important factor for processing XML query. However, the elements that make up the XML data may be distributed elsewhere on disk, so the performance of XML query processing is very poor. Moreover, in order to process an XML query consisting of partial matching path expressions, all elements belonging to the XML data must be retrieved, which is very inefficient.

한편, 구조 요약(structural summary)나 경로 인덱스(path index)는 주어진 경로 표현식에 대하여 XML 데이터의 관련있는 부분만을 검색할 수 있도록 하여 XML 데이터의 검색 속도를 향상시킨다. 따라서, XML 데이터의 검색 속도 향상을 위한 구조 요약 추출 및 경로 인덱스 생성 방안이 최근 많은 관심을 받고 있으며, 비정형 특성을 지니는 데이터의 검색 성능 향상을 위한 여러 가지 인덱스가 제안되었다.A structural summary or path index, on the other hand, speeds up the retrieval of XML data by allowing only relevant portions of the XML data to be retrieved for a given path expression. Therefore, a method of extracting structure summaries and generating path indexes for improving the retrieval speed of XML data has recently received a lot of attention, and various indexes have been proposed for improving the retrieval performance of unstructured data.

Goldman과 Widom은 strong DataGuide라고 하는 경로 인덱스를 개발하였다[참고문헌: R. Goldman and J. Widom. DataGuides: Enable query formulation and optimization in semistructured databases. VLDB 1997.]. 이 방법은 XML 과 같은 준구조적 데이터의 구조 정보를 추출하기 위한 방안으로 제안되었으며, XML 데이터의 루트 엘리먼트(root element)로부터 시작되는 단순 경로(simple path)들이 중복되어 나타나지 않도록 표현한다. DataGuide의 생성 방식은 비결정적 유한 오토마타(non-deterministic finite automata)를 결정적 유한 오토마타(deterministic finite automata)로 변환하는 알고리즘[참고문헌: J. E. Hopcraft and J. D. Ullman. Introduction to Automata Theory, Language, and Computation. 1979]과 비슷하게 동작한다. 따라서, 주어진 XML 데이터가 매우 복잡한 형태의 그래프 구조일 경우에 strong DataGuide의 크기가 원본 XML 데이터보다 더 커질 수 있다는 문제가 있다. Goldman and Widom developed a route index called strong DataGuide [Ref. R. Goldman and J. Widom. Data Guides: Enable query formulation and optimization in semistructured databases. VLDB 1997.] . This method is proposed as a method for extracting structural information of semi-structured data such as XML, and expresses simple paths starting from the root element of XML data so that they do not overlap. The generation method of DataGuide is an algorithm for converting non-deterministic finite automata into deterministic finite automata. [Reference: JE Hopcraft and JD Ullman. Introduction to Automata Theory, Language, and Computation. 1979] . Therefore, there is a problem that the size of the strong DataGuide may be larger than the original XML data when the given XML data has a very complicated graph structure.

Milo와 Suciu는 DataGuide와는 다른 1-index를 제안하였다[참고 문헌: T. Milo and D. Suciu. Index structures for path expression. ICDT. 1999]. 1-Index도 루트 엘리먼트로부터 시작되는 모든 경로들의 정보를 유지한다는 점에서는 strong DataGuide와 동일하다. 1-Index는 그래프 이론(graph theory)에 기반한 역방향 시뮬레이선(backward simulation)과 역방향 바이 시뮬레이션(backward bisimulation)을 이용하여 인덱스를 생성한다. 이 방식은 그래프 상의 두 노드가 존재할 때 각 노드로부터 시작되는 경로들 집합이 서로 같은 경우 이 두 노드를 하나의 노드로 합병시키는 방식이다. 이와 같은 방식을 이용하면 strong DataGuide와는 달리 비결정적 유한 오토마타와 같은 그래프을 얻을 수 있다. 그러나 주어진 입력 그래프가 트리 형태일 경우에는 1-Index와 strong DataGuide는 서로 동일하 다. 따라서, 1-Index는 strong DataGuide의 비결정적 변형이라고 볼 수 있다. Milo and Suciu proposed a different 1-index than DataGuide [Ref. T. Milo and D. Suciu. Index structures for path expression. ICDT. 1999] . 1-Index is the same as strong DataGuide in that it keeps information of all paths starting from the root element. 1-Index generates an index using backward simulation and backward bisimulation based on graph theory. This method combines two nodes into one node if the set of paths from each node is the same when there are two nodes on the graph. In this way, unlike a strong DataGuide, you get a graph like a non-deterministic finite automata. However, if the input graph is tree, 1-Index and strong DataGuide are the same. Thus, 1-Index can be seen as a nondeterministic variant of the strong DataGuide.

객체 지향 데이터베이스 분야에서는 임의의 두 객체 간의 자주 사용되는 참조(reference)들을 지원하기 위하여 접근 지원 관계(ASR : access support relation)가 사용되어 왔다[참고문헌: A. Kemper and G. Moerkotte. Access support relations: An indexing method for object bases. Information Systems, 17(2):117-145, 1992]. 그러나 이 접근 지원 관계는 임의의 질의의 참조 고리 (reference chain)들을 구체화(materialize)한 것으로 미리 정의해 놓은 부분 경로들에 대하여서만 지원이 가능하다는 문제가 있다.In the field of object-oriented databases, access support relations (ASRs) have been used to support frequently used references between any two objects [Ref. A. Kemper and G. Moerkotte. Access support relations: An indexing method for object bases. Information Systems, 17 (2): 117-145, 1992] . However, this access support relationship has a problem in that it can support only partial paths which are predefined as materializing reference chains of arbitrary queries.

Cooper 등은 루트 엘리먼트로 시작되는 모든 경로를 유지한다는 점에서 strong DataGuide와 개념적으로 유사한 Index Fabric을 제안하였다[참고 문헌: B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A fast index for semistructured data. VLDB. 2001]. Index Fabric은 데이타 값을 가지는 엘리먼트들의 경로들을 데이터 값과 같이 암호화(encoding)하여 문자열(string)로 변환하고 이를 Patricia Trie와 같은 문자열 인덱스를 이용하여 유지한다. 그러나, Index Fabric은 XML 엘리먼트들간의 부모-자식 관계(parent-child relationship)를 유지할 수 없다는 단점이 존재한다. 따라서, Index Fabric은 부모-자식 관계를 이용하여야 하는 부분 매칭 경로 표현식에는 효과적이지 못하다. Cooper et al. Have proposed an Index Fabric that is conceptually similar to a strong DataGuide in that it maintains all paths beginning with the root element [References: B. Cooper, N. Sample, MJ Franklin, GR Hjaltason, and M. Shadmon. A fast index for semistructured data. VLDB. 2001] . Index Fabric encodes the paths of elements with data values as strings and converts them into strings, and maintains them using string indexes such as Patricia Trie. However, there is a disadvantage that Index Fabric cannot maintain parent-child relationship between XML elements. Therefore, Index Fabric is not effective for partial matching path expressions that require the use of parent-child relationships.

XML 데이터의 사용자는 데이터의 구조를 고려하지 않으며 원하는 결과를 얻기 위하여 의도적으로 부분 매칭 경로 표현식을 이용한다. 그러나, strong DataGuide, 1-Index, Index Fabric 등 위에서 언급한 XML 경로 인덱스들은 루트 엘 리먼트로부터 시작되는 모든 경로들을 표현함으로서 부분 매칭 경로 표현식을 처리하기 위하여서는 인덱스를 소모적으로 탐색함으로써 성능 저하를 유발한다. 더욱이, 이러한 경로 인덱스들은 데이터만을 이용하여 생성되므로, 자주 사용되는 경로 표현식을 효과적으로 처리하기 위한 과거의 질의 정보들의 집합인 질의 부하(query workload)을 활용하지 못하고 있다.The user of the XML data does not consider the structure of the data and intentionally uses partial matching path expressions to achieve the desired result. However, the XML path indexes mentioned above, such as the strong DataGuide, 1-Index, and Index Fabric, represent all paths starting from the root element, which leads to performance degradation by exhaustively searching the index to process partial matching path expressions. do. Moreover, since these path indexes are created using only data, they do not utilize a query workload, which is a set of past query information for effectively processing frequently used path expressions.

상기와 같은 문제점을 해결하기 위하여 본 발명에서는 APEX(Adaptive Path indEx for XML Data)라 칭하는 적응형 경로 인덱스를 개발하였다. 본 발명의 목적은 경로 인덱스를 사용하여 비정형적 구조를 지니는 XML 데이터의 질의를 처리하는 방법으로서, 과거에 XML 데이터의 질의로 사용된 경로 표현식들로부터 자주 사용되는 경로들을 추출하고 이를 이용하여 경로 인덱스를 갱신함으로써 질의의 수행 성능을 개선하는 적응형 경로 인덱스 방법을 이용한 XML 질의 처리 방법을 제공함에 있다.
In order to solve the above problems, the present invention has developed an adaptive path index called APEX (Adaptive Path indEx for XML Data). An object of the present invention is a method for processing a query of XML data having an unstructured structure by using a path index, which extracts frequently used paths from path expressions used in the past as a query of XML data and uses the path index. The present invention provides an XML query processing method using an adaptive path index method that improves the query performance by updating the.

상기와 같은 목적을 달성하기 위해 본 발명은, XML 데이터에 대한 질의를 수행하기 위하여 경로 인덱스를 사용하는 XML 질의 수행 방법에 있어서, 상기 XML 데이터를 그래프 형태의 XML 그래프로 표현하는 제 1 단계; 전에 수행된 XML 질의들에서 추출한 자주 사용되는 경로(frequently used paths)와 상기 그래프로부터 적 응형 경로 인덱스를 생성하고 갱신하는 제 2 단계; 상기 적응형 경로 인덱스를 이용하여 XML 질의를 처리하는 제 3 단계;를 포함하는 것을 특징으로 하는 적응형 경로 인덱스를 이용한 XML 질의 수행 방법을 제공한다.According to an aspect of the present invention, there is provided a method of performing an XML query using a path index to perform a query on XML data, the method comprising: a first step of representing the XML data in a graph-type XML graph; Creating and updating an adaptive path index from the graph and frequently used paths extracted from previously performed XML queries; A third step of processing an XML query using the adaptive path index provides a method for performing an XML query using an adaptive path index.

본 발명에서 제안하는 적응형 경로 인덱스는 DataGuide, 1-Index 등 기존의 경로 인덱스와는 다르게 루트 엘리먼트로부터 시작되는 모든 경로를 유지하지 않으며 질의 성능의 개선을 위하여 자주 사용된 경로들을 활용한다. 더욱이, 기존의 경로 인덱스들은 XML 데이터만을 이용하여 생성되는 것에 반하여 본 발명에서는 과거에 사용된 질의들에 의하여 자주 사용되는 경로들을 데이타 마이닝 개념을 이용하여 파악하고 이를 활용한다.Unlike the existing path indexes such as DataGuide and 1-Index, the adaptive path index proposed in the present invention does not maintain all paths starting from the root element and uses frequently used paths to improve query performance. Furthermore, while existing path indexes are generated using only XML data, the present invention identifies and utilizes paths that are frequently used by queries used in the past by using data mining concepts.

구체적으로, 본 발명에 따른 적응형 경로 인덱스와 질의 처리 방법은 다음과 같은 해결책을 갖는 방법을 제시하는 것을 특징으로 한다.Specifically, the adaptive path index and query processing method according to the present invention is characterized by a method having the following solution.

(1) 적응형 경로 인덱스는 해쉬 트리(hash tree)와 그래프 구조(graph structure)로 구성되어 있으며, 해쉬 트리는 그래프 구조에 있는 노드들의 도착 경로(incoming path)들의 정보를 유지하므로 부분 매칭 경로 표현식을 해쉬 트리의 검색을 통하여 효율적으로 처리할 수 있다.(1) The adaptive path index is composed of a hash tree and a graph structure. Since the hash tree maintains information on the arrival paths of nodes in the graph structure, the adaptive path index is used to generate a partial matching path expression. This can be done efficiently by searching the hash tree.

(2) 자주 사용되는 경로 정보를 추출하기 위하여 순차 패턴 마이닝(sequential pattern mining) 기법에 사용되었던 개념을 활용하였다.(2) The concept used in the sequential pattern mining technique is used to extract frequently used path information.

(3) 자주 사용되는 경로들을 활용하여 인덱스 구조를 유지함으로써 기존의 경로 인덱스들에 비하여 적은 크기를 달성하였다.(3) By using the frequently used paths, the index structure is maintained to achieve a smaller size than the existing path indexes.

(4) 적응형 경로 인덱스는 자주 사용되는 경로들이 변경되었을 때 변경된 경로들만을 파악하여 인덱스를 변경함으로써 인덱스의 생성 및 관리 부담을 감소시켰다.(4) The adaptive path index reduces the burden of creating and managing indexes by changing the index by identifying only the changed paths when frequently used paths are changed.

본 발명의 상술한 목적과 여러 가지 장점은 이 기술 분야에서 숙련된 사람들에 의해 첨부된 도면을 첨부하여 후술되는 발명의 바람직한 실시 예로부터 더욱 명확하게 될 것이다. 이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세하게 설명한다. The above object and various advantages of the present invention will become more apparent from the preferred embodiments of the present invention described below with reference to the accompanying drawings by those skilled in the art. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 적응형 경로 인덱스의 관리 구조를 도시한 것이다. 도 1에 도시된 바와 같이, 적응형 경로 인덱스 관리 모듈은 크게 초기화 모듈(initialization module : 11), 자주 사용되는 경로 추출 모듈(frequently used path extraction module : 13), 갱신 모듈(update module : 15) 이렇게 3개의 주요 모듈들로 구성되어 있다. 초기화 모듈(11)은 과거에 사용된 질의 정보 없이 처음으로 적응형 경로 인덱스를 생성할 때 사용되는 모듈이다. 이 모듈은 적응형 경로 인덱스의 초기 인덱스인 초기 적응형 경로 인덱스 APEX⁰(12)를 생성한다. APEX⁰는 적응형 경로 인덱스의 가장 단순화된 형태로 복잡한 적응형 인덱스 생성의 씨앗 역할을 한다. 경로 추출 모듈(13)과 갱신 모듈(15)은 현 시점의 경로 인덱스를 이용한 질의들의 집합인 질의 부하(workload : 14)를 이용하는 모듈로서 자주 사용되는 경로들을 활용하여 더욱 효율적인 경로 인덱스(16)를 생성하는 데 사용된다. 이때 질의 부하(14)는 l_i.l_i+1.…l_j-1.l_j형태의 단일 경로 표현식(single path expression)의 집합으로 되어 있다. 경로 추출 모듈(13)과 갱신 모듈(15)은 질의 부하가 바뀔 때마다 반복된다.1 illustrates a management structure of an adaptive path index according to the present invention. As shown in FIG. 1, the adaptive path index management module is largely divided into an initialization module 11, a frequently used path extraction module 13, and an update module 15. It consists of three main modules. The initialization module 11 is a module used when generating an adaptive path index for the first time without query information used in the past. This module creates the initial adaptive path index APEX ⁰ (12), which is the initial index of the adaptive path index. APEX ⁰ is the simplest form of adaptive path index and serves as the seed for complex adaptive index creation. The path extraction module 13 and the update module 15 are modules that use a query workload (14), which is a set of queries using the current path index. Used to generate. At this time, the query load 14 is l _i . L _{i + 1} ... l _j-1 .l _j is a set of single path expressions. The path extraction module 13 and the update module 15 are repeated each time the query load changes.

우선 본 발명에서 사용되는 XML 데이터의 구조에 대하여 간략히 기술하고, 이를 기반으로 본 발명에서 제안하는 적응형 경로 인덱스의 구조에 대하여 설명하도록 한다. 그 다름으로 위에서 언급한 적응형 경로 인덱스 관리 방법에 대하여 설명하도록 한다.First, the structure of the XML data used in the present invention will be briefly described, and based on this, the structure of the adaptive path index proposed by the present invention will be described. Next, the above-described adaptive path index management method will be described.

도 2는 XML 데이터의 예를 도시한 것이며, 그 구조는 도 3와 같이 라벨-간선(Label-Edge) 그래프로 표현될 수 있다. 이를 XML 그래프라고 한다. 이때, 도 3에 나타나는 ID-IDREF 관계는 IDREF 타입의 노드로부터 대응되는 ID 타입의 애트리뷰트(attribute)를 가지는 엘리먼트를 표현하는 노드까지의 간선(edge)으로 표현된다. 이 간선의 라벨은 해당 엘리먼트의 라벨을 가진다. 또한 IDREF 타입의 엘리먼트를 위한 노드를 가리키는 간선의 라벨은 "@"으로 시작된다. FIG. 2 illustrates an example of XML data, and its structure may be represented by a label-edge graph as shown in FIG. 3. This is called an XML graph. In this case, the ID-IDREF relationship shown in FIG. 3 is expressed as an edge from a node of an IDREF type to a node representing an element having an attribute of a corresponding ID type. The label of this edge has the label of that element. In addition, edge labels that point to nodes for elements of type IDREF begin with an "@".

XML 그래프에는 라벨의 연속으로 이루어지는 경로들이 존재하며 이는 XML 질의의 경로 표현식으로 사용된다. 각 경로들 간에는 다음과 같은 관계가 존재할 수 있다. 우선, 임의의 두 경로 A와 B에 대하여 다음 정의 1과 같은 관계가 있다면 " 경로 B가 경로 A를 포함한다"라고 한다. In the XML graph, there are paths consisting of a series of labels, which are used as path expressions in the XML query. The following relationship may exist between each path. First, if any two paths A and B have a relationship as in Definition 1 below, it is said that "path B includes path A".

정의 1Definition 1

경로 A= a₁.a₂.….a_n과 B= b₁.b₂.….b_m에 대하여, 만약 a₁= b_i, a₂= b_i+1, …, a_n= b_i+n-1이고 1 ≤ i , i+n-1 ≤ m이라고 할 때 "경로 A가 경로 B에 포함된다" 라고 하거나 A가 B의 하위 경로(subpath)라고 한다. 더욱이, A가 B의 하위 경로이고 m = i+n-1이면, A를 B의 서픽스(suffix)라고 한다.Path A = a ₁ .a ₂ ... .a _n and B = b ₁ .b ₂ ... for .b _m , if a ₁ = b _i , a ₂ = b _i +1,. When a _n = b _{i + n-1} and 1 ≦ i and i + n-1 ≦ m, “path A is included in path B” or A is a subpath of B. Furthermore, if A is a subpath of B and m = i + n-1, then A is called the suffix of B.

또한, XML 그래프 상의 임의의 경로 p 가 주어진 XML 경로 표현식들 중에서 얼마나 자주 사용되었는지를 p의 지지도(support), sup(p)라 하고 수학식 1과 같이 표현된다.In addition, how often any path p on the XML graph is used among the given XML path expressions is referred to as the support (p), sup (p) of p, and is expressed as Equation (1).

sup(p) = |Q|/|W|sup (p) = | Q | / | W |

여기서, |W| = 주어진 XML 경로 표현식들의 수, |Q| = p를 포함하는 XML 경로 표현식들의 수이다.Where | W | = Number of XML path expressions given, | Q | The number of XML path expressions, including = p.

본 발명에 따른 적응형 경로 인덱스 관리 방법에서는 다음 정의 2에 정의된 필요한 경로(required path)들을 파악하고 이를 활용하여 경로 인덱스를 관리한다.In the adaptive path index management method according to the present invention, the required paths defined in the following Definition 2 are identified and the paths are managed using the required paths.

정의 2Definition 2

XML 그래프에 존재하는 경로 p의 지지도 sup(p)가 주어진 사용자 정의 최소 지지도(user specified minimum support), minSup 보다 크거나 p의 길이가 1일 때, p를 필요한 경로(required path)라 한다. 특별히 사용자 정의 최소 지지도 minSup보다 큰 지지도를 가지는 경로를 자주 사용되는 경로(frequently used path)라 한다.When a support sup of the path p present in the XML graph, sup (p) is greater than a user specified minimum support, minSup, or the length of p is 1, p is called a required path. In particular, a path having a support greater than the user-defined minimum support minSup is called a frequently used path.

임의 경로 p에 대한 간선 집합 T(p)는 다음과 같이 정의 된다.Edge set T (p) for arbitrary path p is defined as

정의 3Definition 3

p = l_i.….l_j-1.l_j일 때, T(p) = {<o_j-1,o_j>|o _i-1.l_i.o_i….l_j-1.o_j-1.l_j.o _j}p = l _i ... .l _{j-1 When} .l _j , T (p) = {<o _j-1 , o _j > | o _i-1 .l _i .o _i . .l _j-1 .o _j-1 .l _j .o _j }

여기서, p에 나타나는 라벨 l_k에 대하여 o_k-1,o_k는 XML 그래프에 존재하는 노드들이고 o_k-1에서 o_k로 라벨이 l_k인 간선이 XML 그래프에 존재한다.Here, _k l with respect to the label may appear on p o _k-1, _k o is the label of a node that exists in the XML deulyigo graph in o _k-1 to _k l _k o trunk exists in the XML graph.

정의 3으로부터 임의 경로 p와 q에 대하여 p가 q의 서픽스 이라면 T(p) ⊇ T(q)임을 알 수 있다. 예를 들어 도 3의 그래프에서 경로 name은 actor.name의 서픽스이다. 이때 T(name) = {<2,3>, <4,5>, <7,11>, <12,13>}이며, T(actor.name) ={<2,3>, <4,5>}임을 알 수 있다. 따라서, T(name) ⊇ T(actor.name)을 보인다. 필요한 경로들의 집합에 속한 모든 경로들에 대하여 간선 집합 T(p)들을 관리하는 것은 많은 저장 부담이 요구 된다. 따라서, 본 발명에서는 다음과 같이 경로 p의 익스텐트(extent), E(p)를 정의하고 이를 관리한다.From definition 3, it can be seen that for arbitrary paths p and q, if p is the suffix of q, then T (p) ⊇ T (q). For example, in the graph of FIG. 3, the path name is a suffix of actor.name. Where T (name) = {<2,3>, <4,5>, <7,11>, <12,13>}, and T (actor.name) = {<2,3>, <4, 5>}. Thus, T (name) ⊇ T (actor.name) is shown. Managing the trunk set T (p) for all paths belonging to the set of required paths requires a lot of storage burden. Therefore, in the present invention, the extent (extent), E (p) of the path p is defined and managed as follows.

정의 4Definition 4

XML 그래프의 경로들 중 루트로부터 시작되는 경로들의 집합을 Q_XML 이라고 하고 필요한 경로들의 집합을 R이라 하자. 집합 R에 속한 경로 p에 대하여, Q_G(p) = {l|l ∈ Q_XML and p is a suffix of l}, Q_A(p) = {l|l ∈ Q_XML and every path q (≠ p) ∈ R having p as a suffix is suffix of l}이라고 하자. Q(p) = Q_G(p)-Q_A(p)라면 익스텐트 E(p)= ∪_r∈Q(p)T(r)로 정의 된다.The set of paths starting from the root of the XML graph path is called Q _XML and the set of necessary paths is called R. For path p in set R, Q _G (p) = {l | l l Q _XML and p is a suffix of l}, Q _A (p) = {l | l ∈ Q _XML and every path q (≠ p) Let's say R having p as a suffix is suffix of l}. If Q (p) = Q _G (p) -Q _A (p), then the extent E (p) = ∪ _{r∈Q (p)} T (r) is defined.

상기 정의 4에서 임의의 경로 p에 대한 q는 R에 속한 경로들 중에서 p를 서픽스로 가지는 경로라고 할 때, T(p) = ∪_∀qE(q) 임을 알 수 있다. In definition 4, q for any path p is a path having p as a suffix among the paths belonging to R, and it can be seen that T (p) = = _∀q E (q).

예를 들어, 도 3의 XML 그래프에 대하여, actor.name이 자주 사용되는 경로라 할 때, E(name) = {<7,11>,<12,13>}, E(actor.name)= {<2,3>,<4,5>}이고 T(name) = E(name) ∪ E(actor.name) 임을 알 수 있다.For example, for the XML graph of FIG. 3, when actor.name is a frequently used path, E (name) = {<7,11>, <12,13>} and E (actor.name) = {<2,3>, <4,5>} and T (name) = E (name) ∪ E (actor.name).

도 4는 도 3에 대하여 필요한 경로가 길이가 1인 모든 경로와 direct.movie, @movie.movie, actor.name일 때의 적응형 경로 인덱스를 도시한 것이다. 이는 그래프 구조의 인덱스를 도시한 것이며, 더욱 효율적인 검색을 위해 해쉬 트리를 더 구비할 수 있다.4 shows an adaptive path index when all paths having a length of 1 and direct.movie, @ movie.movie, and actor.name are required for FIG. This shows the index of the graph structure, and may further include a hash tree for more efficient searching.

해쉬 트리는 필요한 경로들에 대한 정보를 유지하며, 그래프 구조의 각 노드와 필요되는 경로를 대응시킨다. 해쉬 트리의 각 노드는 해쉬 테이블을 지니고 있으며, 해쉬 테이블의 각 엔트리(entry)의 구조는 도 5에 상세히 도시된 바와 같이 5개의 필드로 구성되어 있다 : label, count, new, xnode, next는 각 엔트리의 키 값이 된다. count는 해당 엔트리가 표현하는 경로의 사용 빈도 즉 지지도를 표현한다. new 필드는 해당 엔트리가 새로 생성된 것인지 아닌지를 표현한다. xnode는 적응형 경로 인덱스의 그래프 구조에 있는 하나의 노드를 가리킨다. 마지막으로 next 필드는 해쉬 트리의 다른(다음) 노드를 가리킨다. The hash tree maintains information about the required paths and maps each node of the graph structure to the required paths. Each node of the hash tree has a hash table, and the structure of each entry of the hash table consists of five fields as detailed in FIG. 5: label, count, new, xnode, and next are each This is the key value for the entry. count represents the frequency of use, or support, of the path represented by the entry. The new field indicates whether the entry is new or not. xnode points to a node in the graph structure of the adaptive path index. Finally, the next field points to another node of the hash tree.

그래프 구조는 주어진 XML 그래프의 구조 요약 정보를 제공하며, 그래프 구조의 각 노드는 해쉬 트리의 엔트리와 1 대 1 대응하며, 그래프 구조의 노드들은 해쉬트리 엔트리가 표현하는 필요한 경로의 익스텐트를 가지고 있다. 또한, 그래프 구조의 임의의 두 노드 x, y와 x의 익스텐트에 속한 간선 <s_x, t_x>와 y의 익스텐트에 속한 간선 <s_y, t_y>에 대하여, 만약 t_x = s_y라면 그래프 구조의 노드 x에서 y로 가는 간선이 존재하도록 한다. 또한 그 간선의 라벨은 y에 속한 간선들을 분류한 라벨을 사용한다. 또한, 특별히 XML 그래프의 루트 노드를 위하여, 초기 적응형 경로 인덱스에 하나의 노드를 만들고 <null, 루트 노드>로 이루어진 가상의 간선을 해당 노드의 익스텐트로서 저장한다. 그리고 그 노드를 편리하게 xroot로 표현한다. 또한, 임의의 경로 p에 대하여 p를 서픽스로 하는 경로 q가 자주 사용되는 경로이고 E(p)가 공집합이 아닐 때에는, 이를 위하여 remainder 엔트리를 생성하고 이에 대응되는 노드를 만들어 E(p)를 관리 한다. 예를 들어, 도 4에서 E(name)은 &12 노드에 저장되며 이는 해쉬 트리의 루트 노드의 name 엔트리의 하위 노드 내에 있는 remainder 엔트리에 등록되어 있다.The graph structure provides structure summary information for a given XML graph. Each node in the graph structure has a one-to-one correspondence with an entry in the hash tree, and the nodes in the graph structure have the extent of the required path represented by the hash tree entry. Also, for edges <s _x , t _x > belonging to any two nodes x, y and extents of the graph structure, and edges <s _y , t _y > belonging to extents _y , if t _x = s _y Make sure there is an edge from node x to y in the graph structure. In addition, the label of the edge uses the label which classified the edge which belongs to y. Also, especially for the root node of the XML graph, one node is created at the initial adaptive path index, and a virtual edge composed of <null, root node> is stored as an extent of the node. And that node is conveniently represented as xroot. Also, if a path q with a p as a suffix for an arbitrary path p is a frequently used path and E (p) is not empty, for this purpose, a remainder entry is created and a corresponding node is created by E (p). Manage. For example, in FIG. 4, E (name) is stored at node & 12, which is registered in the remainder entry in the subnode of the name entry of the root node of the hash tree.

도 4에서 필요한 경로 direct.movie에 대한 정보는 해쉬 트리에서 역방향으로 검색(lookup)함으로서 파악할 수 있다. 예를 들어, 해쉬 트리의 루트 노드에서 movie 엔트리가 가리키는 하위 노드에 대하여 direct를 검색하여 direct.movie에 대한 정보를 검색할 수 있다.Information about the path direct.movie required in FIG. 4 can be grasped by looking up the hash tree in the reverse direction. For example, you can search for direct.movie by searching direct for the subnode pointed to by the movie entry in the root node of the hash tree.

도 6에는 초기 적응형 경로 인덱스인 APEX⁰를 생성하는 알고리즘이 도시되어 있다. 도 7은 도 3에 대한 APEX⁰를 나타낸다. 상술한 바와 같이, 적응형 경로 인덱스는 해쉬 트리와 그래프 구조로 이루어져 있다. 따라서 APEX⁰이 이러한 구조로 구성되어 있는 것이다.6 illustrates an algorithm for generating APEX ⁰ , which is an initial adaptive path index. FIG. 7 shows APEX ⁰ for FIG. 3. As mentioned above, the adaptive path index consists of a hash tree and a graph structure. Therefore, APEX ⁰ is composed of this structure.

도 6의 APEX⁰ 생성 알고리즘은 기본적으로 XML 그래프의 노드들을 깊이 우선 탐색(depth first traversal) 방식으로 탐색하면서 각 라벨에 따라서 간선(edges)을 분류(grouping)하여 익스텐트를 생성하고 이를 위한 노드를 그래프 구조에 생성하고 이를 해쉬 트리에 등록하는 기능을 수행한다. 우선 각 간선들의 집합을 위한 노드를 초기 적응형 경로 인덱스에 만들고 이 노드와 라벨을 해쉬 트리에 등록한다. APEX⁰를 생성한 후에는 질의 부하로부터 자주 사용되는 경로 정보를 추출하여 적응형 경로 인덱스를 지속적으로 갱신(update)한다. 이렇게 함으로서 현재 사용되고 있는 XML 질의를 보다 효율적으로 지원할 수 있도록 한다. The APEX ⁰ generation algorithm of FIG. 6 basically creates extents by grouping edges according to each label while searching nodes in an XML graph in a depth first traversal manner, and graphing nodes for the same. Creates a structure and registers it with the hash tree. First, a node for each set of edges is created at the initial adaptive path index and the node and label are registered in the hash tree. After generating APEX ⁰ , the adaptive path index is continuously updated by extracting frequently used path information from the query load. This allows for more efficient support of XML queries currently in use.

다음은 자주 사용되는 경로 추출 모듈에 대해 설명한다.The following describes the frequently used path extraction module.

비단조(anti-monotonicity) 특성을 전지(pruning) 단계에서 사용하는 순차적 패턴 마이닝 기법들[참고문헌: M. N. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential patternmining with regular expression constraints. VLDB 1999]을 자주 사용되는 경로 인덱스 추출에 활용할 수 있을 것이다. 그러나, 기존의 순차적 패턴 마이닝 기법들의 대상인 시퀀스(sequence)의 특성이 경로와는 다르기 때문에 수정이 필요하다. 예를 들어, 시퀀스 (A, B, C)가 자주 나타난다고 할 경우, (A, B, C)의 하위 시퀀스 (A, B), (B, C), (A, C)도 자주 나타난다고 할 수 있다. 그러나, 경로 A.B.C가 자주 사용된다고 해서 A.C가 자주 사용된다고 할 수는 없다. 왜냐하면 A.C는 A.B.C의 하위 경로가 아니기 때문이다. Sequential pattern mining techniques using anti-monotonicity characteristics in the pruning step [Ref. MN Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential patternmining with regular expression constraints. VLDB 1999] can be used to extract frequently used path indexes. However, modification is necessary because the characteristics of the sequence, which is the target of the existing sequential pattern mining techniques, are different from the path. For example, if the sequence (A, B, C) appears frequently, the subsequences (A, B), (B, C), (A, C) of (A, B, C) also appear frequently. can do. However, just because the path ABC is used frequently does not mean that AC is used often. Because AC is not a subpath of ABC.

따라서, 본 발명에서는 도 8과 같이 "자주 사용되는 경로 추출 알고리즘"을 새로이 개발하였다. 도 8에 도시된 바와 같이 자주 사용되는 경로 추출 알고리즘은 지지도를 산출하는 부분과 전지 부분으로 이루어져 있다. 해쉬 트리는 XML 질의들의 사용 패턴의 변화를 유지하는데 사용되며, 자주 사용되는 경로의 추출 단계가 종료된 후에 해쉬 트리는 주어진 XML 경로 표현식들 중에서 필요되는 경로들에 대한 정보만을 유지하게 된다. Therefore, in the present invention, as shown in FIG. 8, a newly used path extraction algorithm is newly developed. As shown in FIG. 8, a frequently used path extraction algorithm is composed of a part for calculating a support and a battery part. The hash tree is used to maintain a change in the usage pattern of XML queries. After the extraction phase of the frequently used path is finished, the hash tree maintains only information on the required paths among the given XML path expressions.

우선, 사용빈도(지지도)를 산출하는 부분에서는 경로 표현식의 집합에서 나타난 모든 경로의 사용빈도를, 즉 자주 사용되는 경로를 추출하기 위해서 적응형 경로 인덱스를 구성하는 중요한 자료 구조인 해쉬 트리를 이용한다. 이때, 현재 사용되어진 XML 경로 표현식들의 사용 빈도를 계산하기 위하여 해쉬 트리 내에 존재한 모든 엔트리의 count와 new 필드를 0과 false로 설정한다. 그리고 주어진 XML 경로 표현식 집합인 Q_workload에 나타나는 모든 경로들의 사용빈도를 계산한다. 이 단계에서, 새로운 엔트리가 해쉬 테이블에 생성되면 해당 엔트리의 new 필드를 true로 설정한다. First, to calculate the frequency of use (local map), we use the hash tree, which is an important data structure that constructs the adaptive path index, to extract the frequency of use of all paths in the set of path expressions. At this time, the count and new fields of all entries in the hash tree are set to 0 and false to calculate the frequency of use of the currently used XML path expressions. It then calculates the frequency of use for all paths that appear in Q _workload , a given set of XML path expressions. In this step, if a new entry is created in the hash table, set the new field of the entry to true.

전지 단계에서는 주어진 사용자 정의 최소 지지도, minSup 보다 작은 지지도를 지닌 엔트리들을 모두 제거한다. 이때 순차적 패턴 마이닝과 같이, 임의 경로 p의 sup(p)가 minSup 보다 작다면 p를 포함하는 모든 경로들의 sup(p)는 minSup보다 작다는 비단조 특성을 활용하여, 보다 효율적으로 자주 사용되지 않는 경로 정 보를 해쉬 트리에서 제거한다.The cell phase removes all entries with a given user-defined minimum support, less than minSup. In this case, such as sequential pattern mining, if sup (p) of random path p is smaller than minSup, sup (p) of all paths including p is smaller than minSup, so that it is less frequently used. Remove path information from the hash tree.

주어진 경로 표현식으로부터 자주 사용되는 경로들에 대한 정보를 추출하였으면 이를 이용하여 현재의 적응형 경로 인덱스를 갱신하여야 한다. 이를 위하여 XML 그래프를 처음부터 다시 소모적으로 검색하여 새로운 적응형 경로 인덱스를 만들 수 있을 것이다. 그러나, 이러한 방법은 많은 시간과 비용을 요하는 작업이므로 본 발명에서는 도 9의 갱신 알고리즘을 이용하여 인덱스 갱신 비용을 감소시켰다. 이 갱신 알고리즘의 기본 방식은 적응형 경로 인덱스의 그래프 구조를 탐색하면서 그래프 구조의 각 노드들이 해쉬 트리와 관련되어 유효한가를 검사하고, 유효하지 않은 노드들을 제거하고 새로 생긴 엔트리에 대하여 노드들을 생성한다. 그리고 각 새로 생긴 노드들의 익스텐트를 생성한다. After extracting the information about the frequently used paths from the given path expression, it is necessary to update the current adaptive path index. To do this, the XML graph can be exhaustively searched from the beginning to create a new adaptive path index. However, since this method is a time-consuming and costly operation, the index update cost is reduced by using the update algorithm of FIG. The basic approach of this update algorithm is to search the graph structure of the adaptive path index and check that each node of the graph structure is valid in relation to the hash tree, remove invalid nodes and generate nodes for the new entries. Then create an extent for each new node.

본 발명에 따른 적응형 경로 인덱스를 이용한 XML 질의 처리 방식은 다음과 같다. 임의의 경로 pⁱ _n = l_i.….l_n-1.l_n가 질의로 주어졌을 때, pⁱ _n 를 이용하여 해쉬 트리를 검색한다. 만약 p^k _n= l_k…l_n (i < k ≤ n)인 경로까지만 해쉬 트리에 존재한다면 이를 이용하여 p^k _n의 간선 집합 T(p^k _n)을 구하고, 이와 같은 단계를 p 하위 경로 pⁱ _j=l_i.….l_j (i ≤ j < n)의 정보가 해쉬 트리 상에 존재할 때, 즉 T(pⁱ _j)를 구할 때까지 반복한다. 그리고 얻어진 간선 집합 T(pⁱ _j),…,T(p^k _n)에 대하여 조인 연산을 하여 T(pⁱ _n)을 구할 수 있다. XML query processing using the adaptive path index according to the present invention is as follows. Arbitrary path p ⁱ _n = l _i ... .l _{n-1 When} .l _n is given in a query, p ⁱ _n is used to search the hash tree. If p ^k _n = l _k . _{l n (i <k ≤ n} ) of the path only if it exists in the hash tree, a set of p ^k _n by using this, the trunk T (p ^k _n) to obtain, such a step p subpath p ⁱ _j = l _i. ... Repeat when information of .l _j (i ≦ j <n) exists on the hash tree, that is, until T (p ⁱ _j ) is found. And the obtained trunk set T (p ⁱ _j ),... A T (p ⁱ _n ) can be obtained by performing a join operation on, T (p ^k _n ).

또한, 복합 경로 질의 l_i.*.l_j와 같은 것은 질의 전지 및 변환 기법[참조 문헌: M. F. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. ICDE. 1998]을 적용하여 l_i. l_i+1.… .l_j 형태의 경로 집합으로 변환 한 다음 질의 결과를 구할 수 있다. "임의 경로 표현식을 만족하는 노드들 중 임의 조건을 만족하는 노드들을 추출하라"와 같은 값-경로 혼합 질의에 대하여서는, 임의 경로를 만족하는 노드들을 본 발명에 따른 적응형 경로 인덱스를 이용하여 추출하고 그 다음 임의 조건을 만족하는지는 검사하는 것과 같은 방식, 또는 XML 데이터에 존재하는 노드들 중 임의 조건을 만족하는 노드들과 임의 경로 표현식을 만족하는 노드들을 구한 후 두 노드 집합의 교집합을 구하는 방식 등 다양한 방법을 이용하여 구할 수 있다.In addition, complex path queries such as l _i . *. L _j are also known as query cells and transformation techniques [MF Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. ICDE. 1998] by applying l _i . _{i + 1} ... After converting to a .l _j- type path set, you can get the query result. For a value-path mixed query such as "Extract nodes satisfying any condition among nodes satisfying an arbitrary path expression", extract nodes satisfying an arbitrary path using the adaptive path index according to the present invention. And then check whether the condition is satisfied, or find the intersection of two node sets after finding the nodes satisfying the arbitrary condition among the nodes in the XML data and the nodes satisfying the arbitrary path expression. This can be obtained using various methods.

만약 pⁱ _n가 필요한 경로라면 해쉬 트리에 pⁱ _n을 서픽스로 하는 경로 q가 해쉬 트리에 존재하며, 각 q에 대응되는 엔트리의 노드 포인터가 가리키는 그래프 구조의 노드들의 익스텐트 E(q)들에 대하여 T(pⁱ _n) = ∪_∀qE(q) 결과를 얻을 수 있다. If p ⁱ _n is a required path, there is a path q in the hash tree with p ⁱ _n as a suffix, and the extents E (q) of nodes of the graph structure pointed to by the node pointer of the entry corresponding to each q. T (p ⁱ _n ) = ∪ _∀q E (q)

XML 질의에 대한 본 발명의 효율성을 측정하기 위하여 여러가지 형태의 XML 질의와 여러가지 XML 데이터에 대하여 실험하였다. 실험 데이터로는 실제 XML 데이터 및 XML 데이터의 구조를 정의하는 실제 DTD 를 이용하여 생성한 3가지 XML 데이터를 이용하였으며, 이는 다음과 같다.In order to measure the effectiveness of the present invention for XML queries, various types of XML queries and various XML data were tested. As the experimental data, three XML data generated by using the actual XML data and the actual DTD defining the structure of the XML data were used.

(1) Play : 셰익스피어 희곡들을 XML 형태로 변환하여 만든 XML 문서[출처: http://www.oasis-open.org/cover/xml.html].(1) Play: An XML document created by converting Shakespeare plays to XML. [Source: http://www.oasis-open.org/cover/xml.html] .

(2) FlixML : B급 영화 안내를 위한 해설용 문서에 사용되는 XML 문서로서 약간 복잡한 구조를 지님[출처: http://www.xml.com].(2) FlixML: An XML document used for commentary documents for class B movie guides. It has a slightly complex structure [Source: http://www.xml.com] .

(3) GedML : 계보학에서 사용되는 데이터를 표현하기 XML 문서로서 매우 복잡한 구조를 지님[출처: http://www.oasis-open.org/cover/xml.html, 2001].(3) GedML: An XML document that represents data used in genealogy and has a very complex structure [Source: http://www.oasis-open.org/cover/xml.html, 2001] .

또한 다양한 크기의 XML 데이터에 대한 질의 성능을 측정하기 위하여 표 1과 같이 다양한 엘리먼트의 개수를 지니는 데이터를 생성하였다. In addition, to measure the query performance of various sizes of XML data, data with various numbers of elements is generated as shown in Table 1.

데이터 종류Data type Play Play FlixML FlixML GedML GedML 이름 name four_tragedy.xmlfour_tragedy.xml shakes_11.xmlshakes_11.xml shakes_all.xmlshakes_all.xml Flix01.xmlFlix01.xml Flix02.xmlFlix02.xml Flix03.xmlFlix03.xml Ged01.xmlGed01.xml Ged02.xmlGed02.xml Ged03.xmlGed03.xml 엘리먼트의 수The number of elements 2279122791 4881848818 179691179691 1473414734 4169141691 335401335401 82598259 3622836228 381046381046

질의로는 l_i.l_i+1.….l_n과 같은 라벨의 연속인 5000개의 단일 경로 표현식(single label path expression)과 500개의 l_i.*.l_j 형태의 복합 경로 표현식, 그리고 1000 개의 경로 표현식과 값의 검색이 같이 표현되는 값-경로 표현식을 사용하였다. 단일 경로 표현식의 XML 질의 표준인 XQuery의 //l_i./l_i+1/…/l _n으로 표현되면, 복합 경로 표현식은 //l_i//l_j로 표현되며, 값-경로 표현식은 //l_i/l_i+1/…/l_j[text()=value]형태로 표현된다. 또한 본 발명의 효율성을 기존의 경로 인덱스들 중에서 DataGuide(SDG)와 IndexFabic 그리고 초기 적응형 경로 인덱스(APEX0)와 비교하였다. 또한, 사용자 정의 지지도 minSup에 의하여 적응형 경로 인덱스의 형태가 달라질 수 있으므로, 실험에서도 minSup의 값을 0.002부터 0.05까지 변화시켜가면서 실험을 수행하였으며, 자주 사용되는 경로를 추출하기 위한 XML 데이터 질의 집합으로는 단일 경로 표현식 중 20%를 사용하였다.The query uses l _i .l _{i + 1.} … 5000 single continuous path of the label, such as _n .l expression (single label path expression) and 500 l _i. *. l _j form the compound path expression, and the value is expressed as the choice of the path 1000 and the expression value Path expressions were used. // l _i ./l _{i +} 1 /… in XQuery, the XML query standard for single-path expressions / l _n , the compound path expression is represented by // l _i // l _j , and the value-path expression is // l _i / l _{i + 1} /… / l _j [text () = value] In addition, we compared the efficiency of the present invention with DataGuide (SDG), IndexFabic and the initial adaptive path index (APEX0). In addition, since the shape of the adaptive path index may vary according to the user-defined support minSup, the experiment was also performed by changing the value of minSup from 0.002 to 0.05. As an XML data query set for extracting a frequently used path, Used 20% of single-path expressions.

도 10, 도 11, 도 12는 5000개의 단일 경로 표현식에 대한 처리 시간을 보여주며, 도 13과 도 14는 각각 shake_11,xml, Flix02,xml, Ged02.xml에 대한 복합 경로 표현식과 값-경로 표현식에 대한 처리 시간을 보여 준다. 도 12에서 Ged03.xml의 SDG 성능은 125000초이고 도면 13에서 Ged02.xml의 APEX0의 성능은 약 210000초이다.10, 11, and 12 show processing times for 5000 single path expressions, while FIGS. 13 and 14 show complex path expressions and value-path expressions for shake_11, xml, Flix02, xml and Ged02.xml, respectively. Show processing time for. In FIG. 12, the SDG performance of Ged03.xml is 125000 seconds, and in FIG. 13, the performance of APEX0 of Ged02.xml is about 210000 seconds.

XML 데이터 구조가 복잡할수록 적응형 경로 인덱스의 성능이 좋아짐을 알 수 있다. DataGuide와 Index Fabric의 경우에는 XML 데이터의 루트부터 시작되는 모든 경로들을 유지하는데, XML 데이터의 구조가 복잡해짐에 따라서 이러한 경로들의 개수가 증가되고 따라서 질의 처리시에 검색해야 할 정보의 양도 증가되기 때문이다. 이에 반하여 적응형 경로 인덱스는 자주 사용되는 경로들에 대한 정보만을 유지함으로써 보다 효율적으로 질의를 처리할 수 있도록 한다. The more complex the XML data structure, the better the performance of the adaptive path index. DataGuide and Index Fabric maintain all the paths starting from the root of the XML data because the complexity of the XML data increases the number of these paths and thus the amount of information to be retrieved during query processing. to be. On the other hand, the adaptive path index keeps the information about frequently used paths so that the query can be processed more efficiently.

또한, 사용자 정의 최소 지지도 minSup도 질의 처리 성능에 영향을 미친다. minSup이 작아질 수록 자주 사용되는 경로들의 개수가 증가하게 된다. 그러나, 도 10 내지 12에 도시된 바와 같이, 자주 사용되는 경로들의 개수가 증가한다고 해서 항상 질의 처리 성능이 향상되는 것은 아니다. 이는 위에서 언급한 바와 같이 경로 인덱스의 구조가 복잡해짐에 따라서 검색하여야 할 영역이 증가하여 성능이 저하될 수 있기 때문이다. 본 실험에서는 평균적으로 minSup이 0.005일 때의 적응형 경로 인덱스가 가장 좋은 성능을 나타냈다.In addition, user-defined minimum support minSup also affects query processing performance. As minSup becomes smaller, the number of frequently used routes increases. However, as shown in FIGS. 10 to 12, increasing the number of frequently used paths does not always improve query processing performance. This is because, as mentioned above, as the structure of the path index becomes more complicated, the area to be searched increases, which may degrade performance. In this experiment, on the average, the adaptive path index with minSup of 0.005 showed the best performance.

이상 설명을 통해 당업자라면 본 발명의 기술 사상을 이탈하지 않는 범위에서 다양한 변경 및 수정 실시가 가능함을 알 수 있을 것이다.Those skilled in the art will appreciate that various changes and modifications can be made without departing from the spirit of the present invention.

이상 설명한 바와 같이 본 발명에 따르면, XML 데이터의 검색시에 적응형 경로 인덱스를 이용하여 자주 사용되는 경로들을 활용하여 보다 효율적으로 XML 질의 결과를 찾아낼 수 있다. 또한 본 발명은 XML 질의의 패턴이 바뀌는 경우에 이를 반영하여 인덱스를 갱신하는 부담을 감소하여 주는 방안을 제시하고 있다. 따라서, 본 발명은 전자상거래, 인터넷 문서 검색 등 XML 데이터 응용분야에 크게 기여할 것이다.As described above, according to the present invention, an XML query result can be found more efficiently by utilizing paths frequently used by using an adaptive path index when searching XML data. In addition, the present invention proposes a method of reducing the burden of updating the index by reflecting this when the pattern of the XML query changes. Therefore, the present invention will greatly contribute to XML data application fields such as e-commerce and Internet document retrieval.

Claims

Initialization module that is used to create an adaptive path index for the first time without query information used in the past and generates an initial adaptive path index (APEX0) 12 which is an initial index of the adaptive path index. module: 11 and a module using query workload 14, which is a set of queries using the current path index, are used to generate a more efficient path index 16 using the paths that are frequently used. A frequently used path extraction module (13) that extracts frequently used paths above a user-defined minimum support from the query load, which is a set of path expressions for XML queries, and the frequently used paths. Update mode for updating the initial adaptive path index by the adaptive path index update algorithm to include only In the adaptive path index management module including an update module 15 and stored in a web application server that stores and manages XML data, performing an XML query using a path index to perform a query on the XML data. In the method,

The XML data is represented by a label-edge graph, which is a graph-type data structure, and there are paths formed by a series of labels. An XML graph used as a path expression of an XML query is expressed on a display part of a computer. A first step of making;

From the browser to the web application server that stores and manages the XML data through the frequently used paths extracted from XML queries previously performed and the adaptive path index (APEX0) 12 from the XML graph. A second step of creating and updating; And

A third step of processing an XML query using the adaptive path index (APEX0) 12 and displaying the result on a display of the computer; and a method of performing an XML query using an adaptive path index. .

The method of claim 1,

In the first step, a relationship between an ID of XML data and an attribute of IDREF type is defined as an edge from a node of IDREF type to a node representing an element having an attribute of corresponding ID type. Expressing a label of the edge, and having a label of the element, and displaying a label of an edge indicating a node of an IDREF type to be distinguished from a label of another edge to generate the XML graph. An XML Query Execution Method Using Adaptive Path Indexes.

The method of claim 1,

In the second step, an initial adaptive path index (APEX0) 12 representing the structure summary information of the XML graph is generated by the initialization module 11 of the adaptive path index management module, and the Internet network in the browser is generated. Extract the frequently used paths by the frequently used path extraction module 11 above the user defined minimum support from the query load which is the path expression set of the XML queries previously performed to the web application server storing and managing the XML data through Generate an adaptive path index by updating the initial adaptive path index to include only the frequently used paths by the update module 15, and reflect the change in the query load. XM using adaptive path index, characterized by continuously updating the adaptive path index L How to Perform a Query.

The method of claim 2,

In the second step, an initial adaptive path index representing the structure summary information of the XML graph is generated by the initialization module 11 of the adaptive path index management module, and the XML data is stored in a browser through an internet network. Extract the frequently used paths by the frequently used path extraction module 11 above the user defined minimum support from the query load 14, which is a path expression set of XML queries previously performed with the managing web application server, and update the update. Create an adaptive path index by updating the initial adaptive path index to include only the frequently used paths by module 15, and adapt the adaptive path index to reflect a change in the query load. Number of XML queries using adaptive path indexes Way.

The method according to any one of claims 1 to 4,

The adaptive path index (APEX ⁰ ) 12 is a graph structure, so that nodes of the graph structure have extents instead of having all of the set of edges of paths required.

Here, the set of paths starting from the root among the paths of the XML graph is referred to as Q _XML , and the set of required paths defined as a path having a length of 1 or more than a user-defined minimum support is defined as R. Where the extent for path p is defined as E (p) = ∪ _{r∈Q (p)} T (r) and T (r) is the _edge set of path r, Q (p) = Q _G (p) -Q _A (p), for path p in set R, Q _G (p) = {l | l ∈ Q _XML and p is a suffix of l}, Q _A (p) = {l | l ∈ Q _XML and every path q (≠ p) ∈ R having p as a suffix is suffix of l}.

The method of claim 5,

The adaptive path index further includes a hash tree, the hash tree maintains information about required paths, and each node of the hash tree has a hash table, each of which is a hash table. The entry structure of the hash table includes a label field, a count field indicating at least the frequency of use for the required paths, a new field indicating whether a node is new, an xnode field pointing to a node of the graph structure, and a next node pointing to the next node. A method of performing an XML query using an adaptive path index, which stores a field and continuously updates the fields to reflect a change in the query load.

The method of claim 6,

The update of the adaptive path index searches for a hash tree entry for each node while searching the graph structure of the adaptive path index, so that nodes that are not frequently used are And removing and adding a new node to the graph structure.