KR20140076010A

KR20140076010A - A system for simultaneous and parallel processing of many twig pattern queries for massive XML data and method thereof

Info

Publication number: KR20140076010A
Application number: KR1020120144057A
Authority: KR
Inventors: 이윤준; 최혜봉; 이경하; 김수형
Original assignee: 한국과학기술원
Priority date: 2012-12-12
Filing date: 2012-12-12
Publication date: 2014-06-20
Also published as: KR101450239B1

Abstract

The present invention relates to a system for processing multiple twig pattern queries on massive XML data simultaneously and parallelly and a method thereof. The system includes: an input unit which receives, from users, an input of twig pattern queries and a massive XML file; a pre-processing and data loading unit which divides the twig pattern queries inputted through the input unit into linear path patterns, creates a query index using the linear path patterns, creates multiple XML data blocks from the massive XML file, and loads the created XML data blocks to a distributed file system; a first map-reduce work unit which receives the XML data blocks to obtain answers to the linear path patterns, and calculates the size of the answers to the linear path patterns; and a second map-reduce work unit which performs twig pattern join operations using the answers to the linear path patterns to output the final results.

Description

Technical Field [0001] The present invention relates to a system and method for simultaneous parallel processing of multiple pattern queries on large-capacity XML data,

본 발명은 XML 데이터에 대한 질의 처리 시스템 및 그 방법에 관한 것이다. 더 상세하게는 단일 노드 상에서는 적정 시간 내 처리가 어려운 초대용량 XML 데이터에 대하여 다수의 가지 패턴 질의를 동시에 분산/병렬 처리하는 방법에 관한 것이다. 또한 다수의 가지 패턴 질의를 효율적으로 분산/병렬처리하기 위한 다중 질의 최적화 방법과 XML 문서의 분할 처리 시스템 및 그 방법에 관한 것이다. The present invention relates to a query processing system for XML data and a method thereof. And more particularly, to a method for simultaneously distributing / paralleling multiple patterns of pattern queries on very large capacity XML data that is difficult to process in a reasonable time on a single node. The present invention also relates to a multi-query optimization method and an XML document division processing system and method thereof for efficiently distributing / parallel processing a plurality of branch pattern queries.

XML(eXtensible Markup Language)은 태그를 이용하여 내용 정보뿐만 아니라 데이터의 구조 정보 또한 같이 제공하는 데이터 저장형식으로 반구조적 데이터 모델을 갖는다. XML로 기술된 데이터는 태그들 간의 포함 관계에 따라 서로 간의 관계성을 나타낸다. 예를 들어, 도서 정보를 표현하기 위해 XML을 이용하는 경우, XML 데이터에서는 "도서"라 명명된 엘리먼트가 있고, 그 엘리먼트의 자식 엘리먼트로 “제목”, “작가”와 “출판사”, “출판연도”가 있고, “작가”의 자식 엘리먼트로 “성”과 “이름”이라 명명된 엘리먼트를 포함한다라는 식으로 도서 정보 표현에 요구되는 항목들을 구조화시킬 수 있다. 이러한 정보들의 관계를 포함하여 XML 데이터에 대한 질의 처리를 하기 위해 XML에 대한 질의어에는 구조 정보가 포함될 수 있다. 예를 들어, 앞의 예에서 각 요소들의 관계는 “/도서/작가” 또는 “/도서//출판사”와 같은 경로 패턴들로 표현할 수 있다. 이러한 경로 표현식을 이용해 원하는 관계성을 지니는 정보를 추출하는 것을 선형 경로 질의(linear path query)라고 한다. 뿐만 아니라, 이러한 여러 선형 경로 질의들을 묶어서 좀 더 복잡한 패턴에 대한 질의를 수행할 수도 있다. 예를 들어, 민음사에서 출판한 책의 제목들을 나열하라는 질의는 “/도서[./출판사=”민음사]”라는 조건 경로 표현식과 “/도서/제목”이라는 경로 표현식을 합쳐 “/도서[./출판사=“민음사”]/제목”와 같은 형식으로 표현한다. 이러한 유형의 질의는 나뭇가지처럼 갈라진 형태의 관계성을 갖게 됨에 따라 가지 패턴 질의(twig pattern query)라 한다. 이 가지 패턴은 XML 데이터에 대한 가장 빈번한 형태의 질의 패턴이다. 이러한 가지 패턴의 질의 처리는 일반적으로 1) 가지 패턴을 선형 경로 패턴들로 분해하고, 2) 각 선형 경로 패턴 질의를 처리하여 그 결과를 구한 후, 3) 이들 선형 경로 패턴 질의의 결과들을 조인(join)하는 3 단계로 수행된다. XML (eXtensible Markup Language) has a semi-structured data model as a data storage format that provides not only content information but also structure information of data using tags. The data described in XML represents the relationship between the tags according to their inclusion relation. For example, when using XML to represent book information, there is an element named "book" in XML data, and the child elements of the element are "title", "author" and "publisher" , And elements that are required for book information presentation can be structured in such a way that the child elements of "author" include elements named "first name" and "first name". In order to process the query of XML data including the relation of such information, the query information for XML may include structure information. For example, in the previous example, the relationship of each element can be expressed in path patterns such as "/ book / author" or "/ book // publisher". The extraction of information with a desired relationship using this path expression is called a linear path query. In addition, you can query these more complex patterns by grouping these various linear path queries. For example, a query to list the titles of a book published by Minumsa, "/ book [./ publisher =" minsumsei "], combines the conditional path expression" / book / title " Publisher = "Minminsa"] / title ". This type of query is called a twig pattern query because it has a kind of relationship like a twig. This branch pattern is the most frequent form of query pattern for XML data. The query processing of such a branch pattern is generally performed by 1) decomposing branch patterns into linear path patterns, 2) processing each linear path pattern query to obtain the result, and 3) joining the results of these linear path pattern queries join).

전체론적 가지 패턴 조인(Holistic Twig Pattern Join) 방법은 가지 패턴의 질의 처리를 입출력 최적으로 수행하도록 고안된 조인 방법이다. 이 방법은 선형 경로 패턴 질의의 결과를 구할 때에 실제 조인되어 최종 결과가 될 수 있는 경로 패턴의 질의의 결과는 미리 배제하는 방식이다. 이 방법은 또한 각 엘리먼트의 리스트들을 순차적으로 한번만 읽고 최종 결과를 낼 수 있어 입출력에 대하여 최적화되어 있다. 본 발명에서는 각 가지 패턴 질의들의 처리에 이 전체론적 가지 패턴 조인 기법을 이용하였다. The holistic twig pattern join method is a join method designed to perform the query processing of the branch patterns optimally in terms of input and output. This method is a method of excluding the result of the query of the path pattern that can be actually joined and obtained as a final result when the result of the linear path pattern query is obtained. This method is also optimized for I / O because it can read the lists of each element sequentially one time only and get the final result. In the present invention, this holistic branch pattern joining technique is used to process each branch pattern query.

XML 스트림 처리는 일련의 작은 크기(수~수십KB)의 XML 문서들의 연속을 실시간으로 처리하는 과정으로, 스트림 처리는 데이터가 여러 데이터 소스로부터 지속적으로 생성됨에 따라 데이터의 숫자에 제한이 없고, 질의들은 미리 준비될 수 있으며, 새로 생성되는 스트림 데이터에 대한 질의 처리가 실시간으로 이루어져야 하는 특징을 가지는 작업이다. XML stream processing is a series of processing of a series of small XML documents (several to several tens of KB) in real time. Stream processing is not limited by the number of data as the data is continuously generated from various data sources, Can be prepared in advance, and is a task having a characteristic that a query process for newly generated stream data must be performed in real time.

YFilter는 “Path sharing and predicate evaluation for high performance xml filtering, Y. DIAO et al, ACM Translations on Database System, 28(4): 467-516, 2003.”의 논문에 공개된 기술로 XML 스트림 데이터를 처리하는데 있어, 여러 선형 경로 패턴들에게서 공통되어 출현하는 부분 경로들을 공유할 수 있도록 비결정적 유한 오토마타 (NFA：Nondeterministic Finite Automata) 형태로 변환하여 필터링하는 기법이다. 이 기법은 작은 크기의 XML 데이터들이 연속인 입력으로 들어오는 스트림 환경을 고려한 기술로써 메모리 기반으로 빠른 필터링 속도를 보장한다. 그러나 상기 공개발명은 수십~수백 KB 정도의 작은 크기의 XML 문서들이 연속적으로 들어오는 스트림 데이터에 대한 복수 개의 XML 경로 질의 처리를 단일 노드의 메모리 상에서 처리하는 기술에 본 발명에서 목표로 하는 한 파일이 100GB 이상이 되는 대용량 XML 문서의 처리를 할 수 없는 문제점이 있다.YFilter processes XML stream data with a technique disclosed in a paper of "Path sharing and predicate evaluation for high performance xml filtering, Y. DIAO et al., ACM Translations on Database System, 28 (4): 467-516, 2003." (NFA: Nondeterministic Finite Automata) to share partial paths that are common to several linear path patterns. This technique is based on a stream environment in which small size XML data is input as continuous input, and it guarantees fast filtering speed based on memory. However, in the present invention, a technique for processing a plurality of XML path query processing for stream data in which XML documents of a small size of several tens to several hundreds KB are successively received in the memory of a single node, There is a problem in that it is not possible to process a large-capacity XML document which is more than a predetermined size.

이에 따라 대용량 XML 문서를 동일 크기의 블록 단위로 문서의 구조 정보 손실 없이 분할하는 방법과, 그 블록에 대한 다수의 경로 패턴 질의 수행 기능, 또한 다수의 가지 패턴 질의들을 복수개의 컴퓨터에서 수행하는데 있어 질의 처리 작업의 효율적인 분산 처리 방법에 관한 발명이 요망된다. Accordingly, a method of dividing a large-capacity XML document into blocks of the same size without loss of structure information of a document, a function of performing a plurality of path pattern queries on the block, and a function of querying a plurality of branch pattern queries There is a need for an invention on an efficient distributed processing method of a processing operation.

미국특허번호 US7,756,919(출원일: 2004.06.18)의 “Large-scale data processing in a distributed and parallel processing environment”는 맵리듀스라 명명된 데이터 레코드를 병렬처리하기 위한 발명이다. 이 발명에서는 입력된 파일을 여러 블록들로 나누고 각 블록 내 데이터를 키와 값의 쌍으로 읽어들여 Map과 Reduce 함수로 이를 처리하는 프로그래밍 모델과 Map과 Reduce 함수를 분산된 컴퓨터들에서 수행하는 분산/병렬 처리 방법에 관한 것이다. US Patent No. 7,756, 919 (filed on June 16, 2004) entitled " Large-scale data processing in a distributed and parallel processing environment " is an invention for parallel processing of data records named MapReduce. According to the present invention, a programming model for dividing an input file into a plurality of blocks, reading data in each block as a key / value pair and processing them with Map and Reduce functions, and a distributed / To a parallel processing method.

맵리듀스는 초기 대용량의 웹페이지 정보를 효율적으로 분석하여 웹 검색 시스템의 성능을 향상시키기 위하여 고안된 S/W 프레임워크로 분산 파일 시스템에 저장된 대용량 데이터를 분산 컴퓨팅 환경에서 병렬 처리할 수 있도록 하는 기술이다. 맵리듀스는 고정 크기로 분할되어, 분산 저장된 파일 조각들을 각 컴퓨팅 노드들이 키와 값의 쌍으로 읽어 들여 Map 함수를 수행하여 다시 또 다른 키와 값 쌍의 형태로 중간 결과를 생성한다. 이 후 중간 결과들은 다시 키 값을 기준으로 그룹핑된 후 Reduce 함수를 통해 집계되어 최종 결과를 생성한다. Hadoop은 이러한 패러다임을 따라 작성된 공개 소프트웨어이며 전체적인 시스템 구성은 원래의 맵리듀스와 동일하다. MapReduce is a S / W framework designed to improve the performance of web search system by efficiently analyzing web page information of early large capacity and enables large amount of data stored in distributed file system to be processed in parallel in distributed computing environment . The MapReduce is divided into fixed size, each piece of distributed stored files is read by each computing node into a key / value pair, and the Map function is executed to generate intermediate results in the form of another key / value pair. The intermediate results are then grouped based on the key values and then aggregated through the Reduce function to produce the final result. Hadoop is open source software written in accordance with this paradigm, and the overall system configuration is the same as the original MapReduce.

그러나 상기 특허발명은 데이터를 키와 값의 쌍으로만 읽어 들이고, 파일을 문서의 구조 정보의 고려 없이 고정 크기로 분할하고, 단일 질의 처리만을 고안한 발명으로 트리 모델의 구조 정보를 가지는 대용량 XML을 대상으로 하여 다수의 가지 패턴 질의들을 동시에 병렬 처리하는데에는 한계가 있는 발명이다. However, the patented invention divides the data into key and value pairs, divides the file into fixed sizes without considering the structure information of the document, and devises only a single query processing. There is a limit to parallel processing multiple branch pattern queries simultaneously.

따라서, 대용량 XML 파일을 구조 정보 손실 없이 분할하여 병렬 처리하고, 가지 패턴 사이에 공통된 단일 경로 해답들을 공유하는 기술을 적용해 입출력 비용을 큰 폭으로 줄일 수 있고, 또한 여러 차례 반복 수행해야 하는 다수의 가지 패턴 질의를 두 단계의 맵리듀스 작업으로 분산환경에서 병렬적으로 수행함으로써 데이터 분석의 효율을 높일 수 있는 발명이 요망된다. Therefore, it is possible to greatly reduce the input / output cost by applying a technique of dividing and parallelizing large-capacity XML files without loss of structure information and sharing common single-path solutions among the branch patterns, There is a need for an invention capable of increasing the efficiency of data analysis by performing branch pattern queries in parallel in a distributed environment with a two-step maple deuce operation.

본 발명에서는 다수의 가지 패턴 질의들을 처리하기 위하여 맵리듀스에 기반한 분산/병렬 처리 기법을 제공한다. 하지만, 이때 단순히 맵리듀스를 가지고 XML에 대한 가지 패턴질의 처리를 하는 경우 다음과 같은 문제점들이 존재한다. The present invention provides a distributed / parallel processing technique based on MapReduce for processing a plurality of branch pattern queries. However, there are the following problems when processing the branch pattern query for XML with MapReduce at this time.

1) 맵리듀스에서는 고정 크기의 블록들로 파일을 분할, 저장하고 블록 내 데이터를 키와 값의 쌍들 형태로 읽어 처리한다. 하지만, XML 파일을 단순히 고정 크기로 여러 파일조각들로 분할하는 경우 XML 특유의 구조 정보인 엘리먼트 간 관계 정보를 손실함에 따라 질의 처리가 불가능하다. 1) In MapReduce, a file is divided and stored into blocks of fixed size, and data in the block is read and processed in the form of key / value pairs. However, when an XML file is simply divided into a plurality of file fragments at a fixed size, it is impossible to process the query as the relationship information between the elements, which is XML-specific structure information, is lost.

2) 하나의 가지 패턴 질의를 하나의 맵리듀스 과정으로 구현하는 경우, n개의 질의 처리를 수행하는데 있어 n 번의 맵리듀스 과정이 수행되며, 이는 서로 유사한 연산을 반복 수행하고, 동일 데이터를 반복 읽기를 하여 비효율적이다. 따라서 가급적 적은 수의 맵리듀스 과정으로 많은 가지 패턴 질의들을 한 번에 처리할 수 있는 방법이 필요하다. 2) When a single branch pattern query is implemented as one mapping process, n map re-deuce processes are performed to perform n query processes, which perform similar operations repeatedly and repeat the same data Is inefficient. Therefore, there is a need for a method that can process many pattern queries at a single time with as few mapping processes as possible.

3) 다수의 가지 패턴 질의들 간에는 서로 공유되는 부분이 많으며, 이에 따라 빠른 처리를 위해서는 중간 결과들의 결과들을 질의들 간에 서로 공유하여, 공통되는 패턴에 대한 반복 처리를 없애는 방법이 필요하다. 3) Many branch pattern queries have many parts that are shared with each other. Therefore, there is a need for a method for eliminating the repetitive processing of common patterns by sharing the results of the intermediate results among the queries for fast processing.

4) 다수의 가지 패턴 질의들을 처리함에 따라 다수의 가지 패턴 조인 연산들을 여러 노드들에 나누어 실행해야 한다. 이때 이 조인 연산들을 이보다 작은 수의 노드들에 균형있게 배분하여, 각 노드가 동일한 작업량을 가지고 서로 같은 시점에 종료되도록 해야 가장 빠른 질의 처리 시간을 얻을 수 있다. 이때 각 노드들에 어떻게 가지 패턴 질의들을 균형 있게 배분하는지에 대한 해결책이 필요하다. 4) By processing multiple branch pattern queries, multiple branch pattern join operations must be executed for multiple nodes. At this time, it is possible to obtain the fastest query processing time by distributing the join operations to a smaller number of nodes in a balanced manner and ending each node at the same time with the same amount of work. At this time, there is a need for a solution to how to distribute branch pattern queries equally to each node.

미국특허번호 US7,756,919(출원일: 2004.06.18)U.S. Patent No. 7,756,919 (filed on June 18, 2004)

Path sharing and predicate evaluation for high performance xml filtering, Y. DIAO et al, ACM Translations on Database System, 28(4): 467-516, 2003.Path sharing and predicate evaluation for high performance xml filtering, Y. DIAO et al., ACM Translations on Database System, 28 (4): 467-516, 2003.

본 발명은 상기 종래기술의 문제점을 해결하기 위한 것으로서, 본 발명의 목적은, 대용량 XML 데이터에 대한 다수의 가지 패턴 질의들을 효율적으로 처리하는 방법, 특히 과학 데이터의 분석이나 다양한 시스템의 기록 보존 (Logging)의 목적으로 생성되는 수백 기가바이트 이상의 XML 데이터는 지속적으로 생성되고, 또 주기적으로 갱신되는 특징이 있어 빠른 질의 처리를 요구하는데, 이러한 대용량 XML 데이터에 대하여 많은 가지 패턴 질의들을 동시에 분산/병렬 처리하는 시스템 및 그 방법을 제공하는데 있다.It is an object of the present invention to provide a method and apparatus for efficiently processing a large number of pattern queries on large capacity XML data, ) Is continuously generated and periodically updated. Therefore, it is required to perform fast query processing. In order to deal with such large amount of XML data, many pattern queries are simultaneously distributed / parallelized System and method therefor.

상기 본 발명의 목적을 달성하기 위한 기술적 해결 수단으로서, 본 발명의 제1 관점으로, 사용자들로부터 가지 패턴 질의들을 및 대용량 XML 파일 입력 받기 위한 입력부와; 상기 입력부에 입력된 가지 패턴 질의들을 선형 경로 패턴들로 분해하고, 이들을 이용하여 질의 인덱스를 생성하고, 상기 대용량 XML 파일을 다수의 컴퓨터들에서 병렬처리할 수 있도록 고정 크기의 블록 단위로 분할하여 다수의 XML 데이터 블록들을 생성하고, 이를 분산 파일 시스템에 적재시키기 위한 전처리 및 데이터 적재부와; 상기 분할된 XML 데이터 블록들을 입력 받아 질의 인덱스를 이용해 맵 단계에서 선형경로 패턴들에 대한 해답을 얻고, 리듀스 단계에서 각 선형경로 패턴들에 대한 해답의 크기를 계산하기 위한 제1 맵리듀스 작업부와; 가지 패턴 조인 작업을 실시간으로 작업량이 서로 균등하게 여러 컴퓨팅 노드들에 분배하고, 여러 노드들에 할당된 가지 패턴 조인 연산들에 대하여 이들이 입력으로 취할 선형 경로의 해답들을 맵 단계에서 실행 시간에 적절히 전송하고, 리듀스 단계에서는 각 리듀서에게 할당된 가지 패턴 조인 연산들을 맵 단계에서 전송받은 선형 경로 패턴들의 해답들을 가지고 수행하여 최종결과(104)를 출력시키기 위한 제2 맵리듀스 작업부를 포함하는 대용량 XML 데이터에 대한 다수의 가지 패턴 질의의 동시 병렬처리시스템이 제시된다. According to a first aspect of the present invention, there is provided an information processing apparatus comprising: an input unit for receiving branch pattern queries and a large-capacity XML file from users; A branching unit for branching the branch pattern queries input into the input unit into linear path patterns, generating query indexes using the branch pattern queries, and dividing the large capacity XML file into fixed size block units for parallel processing in a plurality of computers, A preprocessing and data loading unit for creating XML data blocks of the XML file and loading the XML data blocks into a distributed file system; A first mapping task unit for receiving the divided XML data blocks and obtaining a solution to the linear path patterns at a map step using a query index and calculating a size of a solution for each linear path patterns at a reducing step, Wow; In this paper, we propose a new algorithm that can distribute the branch pattern join operations to multiple computing nodes in real time with equal workload, and solve the linear path solutions that they will take as input for the branch pattern join operations assigned to the various nodes, And a second mapping task unit for performing branch pattern join operations allocated to each reducer in the redist step with the solutions of the linear path patterns received in the map step and outputting the final result 104. [ A concurrent parallel processing system for multiple branch pattern queries is presented.

또한, 본 발명의 제2 관점으로, 사용자들로부터 가지 패턴 질의들을 및 대용량 XML 파일을 입력 받는 단계와; 입력된 상기 가지 패턴 질의들을 선형 경로 패턴들로 분해하는 단계와; 상기 선형 경로 패턴들로 분해된 가지 패턴 질의들을 이용하여 질의 인덱스를 생성하는 단계와; 상기 대용량 XML 파일을 다수의 컴퓨터들에서 병렬 처리할 수 있도록 고정 크기의 블록 단위로 분할하여 다수의 XML 데이터 블록들을 생성하는 단계와; 상기 다수의 XML 데이터 블록들을 분산 파일 시스템에 적재시키는 단계와; 상기 분할된 XML 데이터 블록들을 입력 받아 질의 인덱스를 이용해 제1 맵리듀스 작업부의 맵 과정에서 선형경로 패턴들에 대한 해답을 얻는 단계와; 상기 제1 맵리듀스 작업부의 리듀스 과정에서 각 선형경로 패턴들에 대한 해답의 크기를 계산하는 단계와; 가지 패턴 조인 작업을 실시간으로 작업량이 서로 균등하게 여러 컴퓨팅 노드들에 분배하는 단계와; 여러 노드들에 할당된 가지 패턴 조인 연산들에 대하여 이들이 입력으로 취할 선형 경로의 해답들을 제2 맵리듀스 작업부의 맵 과정에서 실행 시간에 적절히 전송하는 단계와; 제2 맵리듀스 작업부의 리듀스 과정에서 각 리듀서에게 할당된 가지 패턴 조인 연산들을 상기 맵 과정에서 전송받은 선형 경로 패턴들의 해답들을 가지고 수행하여 최종결과를 출력시키는 단계를 포함하는 대용량 XML 데이터에 대한 다수의 가지 패턴 질의의 동시 병렬처리방법이 제시된다.According to a second aspect of the present invention, there is also provided a method of searching for a document, the method comprising: receiving branch pattern queries and a large-capacity XML file from users; Decomposing the branch pattern queries inputted into linear path patterns; Generating a query index using the branch pattern queries decomposed into the linear path patterns; Generating a plurality of XML data blocks by dividing the large capacity XML file into blocks of a fixed size so as to enable parallel processing in a plurality of computers; Loading the plurality of XML data blocks into a distributed file system; Obtaining a solution to the linear path patterns in the mapping process of the first mapping task unit using the query index by receiving the divided XML data blocks; Calculating a size of a solution for each linear path pattern in a process of reducing the size of the first mapping task; Distributing the branch pattern joining operations to the plurality of computing nodes in real time evenly with respect to each other; Transmitting the solutions of the linear paths to be taken as input to the branch pattern join operations assigned to the plurality of nodes, at the execution time in the mapping process of the second mapping task unit; Performing branch pattern join operations allocated to each reducer in the redistribution process of the second mapping task unit with the solutions of the linear path patterns received in the mapping process and outputting the final result; A parallel processing method of branch pattern queries is presented.

본 발명에 의하면, 대용량 XML 데이터, 특히 하나의 대용량 XML 파일에 대하여 다수의 질의어들의 처리를 분산 환경에서 효율적으로 처리하는 방법에 관한 것으로 다수의 가지 패턴 질의들을 맵리듀스를 기반으로 동시에 분산/병렬 처리할 수 있는 효과가 있다. 또한, 다수의 가지 패턴 질의들 간에 공통되는 입력들과 그 결과들을 질의 처리 시에 공유하고, 또한 더블 슬래쉬와 와일드카드가 들어간 경로 표현들을 문서 내 유일한 경로 패턴들로 대체함으로써 중복 처리를 제거하여 전체 입출력 비용이 감소되고 이를 통해 효율적인 질의 처리가 가능한 효과가 있다. 또한, 분산 노드들 간의 가지 패턴 질의 처리를 위한 조인 연산의 작업량의 균형을 맞출 수 있는 기술을 적용해 작업량이 한 노드에 치우치는 병목현상이 일어나 전체 시스템의 성능을 떨어뜨리는 현상을 방지하여 성능 최적화를 수행할 수 있는 효과가 있다. 특히 본 발명의 이 작업 할당 기술은 질의 처리 전에 정적으로 수행되는 것이 아니라 XML 데이터가 입력된 후 각 경로 패턴에 대한 해, 즉 질의 처리의 중간 결과의 크기를 측정하여 실행 시간에 실시간으로 동작한다는 점에서 더욱 효과적이다. The present invention relates to a method for efficiently processing a large number of query terms in a distributed environment for large-capacity XML data, in particular, a large-capacity XML file, and a plurality of different pattern queries are simultaneously distributed / There is an effect that can be done. In addition, it is possible to share inputs common to multiple branch pattern queries and their results at the time of query processing, and to eliminate redundant processing by replacing path expressions containing double slashes and wildcards with unique path patterns in the document, The input / output cost is reduced, and the efficient query processing can be performed. Also, by applying a technique that balances the workload of the join operation for processing the branch pattern between distributed nodes, it is possible to prevent performance degradation of the whole system due to a bottleneck that the workload is biased toward one node, There is an effect that can be performed. Particularly, the task allocation technique of the present invention is not performed statically before the query processing but measures the size of the intermediate result of the query processing, that is, the solution for each path pattern after the XML data is inputted, .

도 1은 본 발명의 대용량 XML 데이터에 대한 다수의 가지 패턴 질의의 동시 병렬처리시스템의 실시예에 관한 개략적인 구성도이다.
도 2는 본 발명의 실시예 중 주요부인 전처리 및 데이터 적재부의 실시예에 관한 개략적인 구성도이다.
도 3은 본 발명에서 입력으로 하는 대용량 XML 파일을 XML 엘리먼트들 간의 구조 정보의 손실 없이 고정 크기의 블록 단위로 변환하고 이에 대응하는 레이블 파일을 같이 분산 파일 시스템에 위치시키는 기능에 대한 실시예의 개략적인 구성도이다.
도 4는 본 발명의 실시예 중 제1 맵리듀스 작업부의 동작에 관한 실시예의 개략적인 설명도이다.
도 5는 본 발명의 실시예 중 제2 맵리듀스 작업부의 동작에 관한 실시예의 개략적인 설명도이다.
도 6은 본 발명의 실시예 중 다중 질의 최적화 모듈의 상세 작동 과정의 실시예를 설명하기 위한 흐름도이다.
도 7은 본 발명의 실시예 중 더블 슬래시와 와일드카드가 존재하는 선형 경로 패턴을 입력 XML 문서 내에 존재하는 유일한 경로 패턴들로 변환하는 과정의 설명도이다.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic block diagram of an embodiment of a simultaneous parallel processing system for a plurality of different pattern queries for large capacity XML data of the present invention.
FIG. 2 is a schematic block diagram of an embodiment of a preprocessing and data loading unit which is a main part of the embodiment of the present invention.
3 is a schematic diagram of an embodiment of a function of converting a large capacity XML file as an input into a fixed size block unit without loss of structure information between XML elements and locating the corresponding label file in a distributed file system according to the present invention FIG.
4 is a schematic explanatory diagram of an embodiment relating to the operation of the first mapping unit of the embodiment of the present invention.
5 is a schematic explanatory diagram of an embodiment of the operation of the second mapping task unit of the embodiment of the present invention.
6 is a flowchart for explaining an embodiment of a detailed operation process of the multiple query optimization module in the embodiment of the present invention.
FIG. 7 is an explanatory diagram of a process of converting a linear path pattern in which a double slash and a wildcard exist, into unique path patterns existing in an input XML document according to an embodiment of the present invention.

이하에서, 본 발명의 실시예를 첨부한 도면을 참조하면서 상세히 설명하기로 한다. 도 1은 본 발명의 대용량 XML 데이터에 대한 다수의 가지 패턴 질의의 동시 병렬처리시스템의 실시예에 관한 개략적인 구성도이다. 도 1에 도시한 바와 같이, 본 발명의 대용량 XML 데이터에 대한 다수의 가지 패턴 질의의 동시 병렬처리시스템은, 사용자들로부터 가지 패턴 질의들을 및 대용량 XML 파일 입력 받기 위한 입력부(100)와; 상기 입력부(100)에 입력된 가지 패턴 질의들을 선형 경로 패턴들로 분해하고, 이들을 이용하여 질의 인덱스(101)를 생성하고, 상기 대용량 XML 파일을 다수의 컴퓨터들에서 병렬 처리할 수 있도록 고정 크기의 블록 단위로 분할하여 다수의 XML 데이터 블록(102)들을 생성하고, 이를 분산 파일 시스템에 적재시키기 위한 전처리 및 데이터 적재부(110)와; 상기 분할된 XML 데이터 블록(102)들을 입력 받아 질의 인덱스(101)를 이용해 맵 단계에서 선형경로 패턴들에 대한 해답(103)을 얻고, 리듀스 단계에서 각 선형경로 패턴들에 대한 해답의 크기를 계산하기 위한 제1 맵리듀스 작업부(120)와; 가지 패턴 조인 작업을 실시간으로 작업량이 서로 균등하게 여러 컴퓨팅 노드들에 분배하고, 여러 노드들에 할당된 가지 패턴 조인 연산들에 대하여 이들이 입력으로 취할 선형 경로의 해답(103)들을 맵 단계에서 실행 시간에 적절히 전송하고, 리듀스 단계에서는 각 리듀서에게 할당된 가지 패턴 조인 연산들을 맵 단계에서 전송받은 선형 경로 패턴들의 해답(103)들을 가지고 수행하여 최종결과(104)를 출력시키기 위한 제2 맵리듀스 작업부(130)를 포함하는 구성이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic block diagram of an embodiment of a simultaneous parallel processing system for a plurality of different pattern queries for large capacity XML data of the present invention. As shown in FIG. 1, the simultaneous parallel processing system for a large number of pattern queries of large capacity XML data according to the present invention includes an input unit 100 for receiving branch pattern queries and a large capacity XML file from users; A query index 101 is generated by decomposing the branch pattern queries input to the input unit 100 into linear path patterns and using them to parallelize the large XML file in a plurality of computers, A preprocessing and data loading unit (110) for creating a plurality of XML data blocks (102) by dividing into blocks and loading them into a distributed file system; A solution 103 for the linear path patterns is obtained in the map step using the query index 101 by receiving the divided XML data blocks 102 and the size of the solution for each linear path pattern is determined in the reducing step A first mapping task unit 120 for calculating the first mapping task; In this paper, we propose a method to solve the linear pattern path problem by applying the linear pattern path join operations to the multiple computing nodes in real time. In the redist step, branch pattern join operations assigned to the respective reducers are performed with the solutions 103 of the linear path patterns received in the map step and the final result 104 is outputted (130).

도 2는 본 발명의 실시예 중 전처리 및 데이터 적재부의 실시예에 관한 개략적인 구성도이다. 도 2에 도시한 바와 같이, 각 가지 패턴 질의에 대해 이를 구성하는 여러 선형 경로 패턴들로 분할하고 선형 경로 패턴들과 원래의 가지 패턴 질의 간의 대응 관계 정보를 생성하기 위한 질의 분할 모듈(200)과; 상기 선형 경로 패턴들을 가지고 비결정적 유한 오토마타로 구성되는 질의 인덱스를 생성하기 위한 질의 인덱스 작성 모듈(202)과; 대용량 XML 파일을 입력으로 받아 XML 파일들을 구조적 정보 손실 없이 고정 크기의 XML 데이터 블록들로 분할하고 또한 각 데이터 블록에 속한 XML 엘리먼트와 애트리뷰트들에 대하여 영역 기반 번호 부여 기법에 따라 부여된 레이블 값을 기재하는 레이블 블록을 작성해 분산파일 시스템에 같이 적재하기 위한 XML 데이터 분할 및 레이블링 모듈(201)을 포함하는 구성이다.2 is a schematic block diagram of an embodiment of the preprocessing and data loading unit of the embodiment of the present invention. A query division module 200 for dividing each branch pattern query into a plurality of linear path patterns constituting the branch pattern query and generating correspondence information between the linear path patterns and the original branch pattern query, ; A query index creation module 202 for generating a query index composed of nondeterministic finite automata with the linear path patterns; The XML file is input as a large capacity XML file and the XML files are divided into fixed size XML data blocks without loss of structural information and the label values assigned according to the area based numbering technique are listed for the XML elements and attributes belonging to each data block And an XML data segmentation and labeling module 201 for collecting and loading the label blocks into a distributed file system.

전처리 과정은 다수의 사용자들로부터 질의 처리 전에 미리 입력된 가지 패턴 질의에 대한 상기 질의 분할 모듈(200)과 질의 인덱스 작성 모듈(202)에 의한 전처리와, 대용량 XML 데이터에 대한 XML 데이터 분할 및 레이블링 모듈(201)에 의한 전처리 2 부분으로 나눌 수 있다.The preprocessing process includes preprocessing by the query division module 200 and the query index creation module 202 for the branch pattern query previously input from the plurality of users before the query processing and the XML data segmentation and labeling module for the large- (2).

상기 질의 분할 모듈(200)은 예를 들면, T1이라 ID가 부여된 T1: /A/B[C/D]/E/F[G]/H 라는 가지 패턴 질의가 주어진다면 이를 P1: /A/B/E/F/H, P2: /A/B/C/D, P3: /A/B/E/F/G 세 개의 선형 경로 패턴(P1, P2, P3)로 변한다. 그리고, 이들 간의 대응 관계 정보인T1 :== {P1, P2, P3} 또한 추출하여 별도로 저장한다. 이와 같은 방법으로 모든 가지 패턴 질의들에 대하여 선형 경로 패턴들로 분할한다.The query partitioning module 200 may, for example, provide T1: / A / B [C / D] / E / F [G] / H, (P1, P2, P3) with three linear path patterns: P / B / E / F / H, P2: / A / B / C / D and P3: / A / B / Then, T1: == {P1, P2, P3}, which is correspondence information between them, is also extracted and stored separately. In this way, all branch pattern queries are divided into linear path patterns.

상기 질의 인덱스 작성 모듈(202)에서 생성된 질의 인덱스는 상기 제1 맵리듀스 작업부(120)에서 XML 데이터 블록들에 대하여 선형 경로 패턴들에 대한 필터링을 수행하는데 이용된다. 본 발명에서는 이 질의 인덱스의 생성을 위하여 YFilter라는 기존 기술을 사용하였으나, 이에 한정되는 것은 아니다. 예를 들면, XML 필터링을 수행하는 다른 XML 스트림 처리 기술을 사용할 수 있다. 질의 처리를 위해 먼저 선형 경로 패턴들을 가지고 필터링을 수행하고, 이후 이 선형 경로 패턴들의 결과들을 가지 패턴 조인을 한다. 이렇게 가지 패턴 질의들을 선형 경로 패턴들로 분할하고 먼저 이들에 대한 질의 처리를 수행하는 이유는 여러 가지 패턴 질의들 간에 서로 공통적으로 나타나는 선형 경로 패턴들이 존재하는데, 이들을 각각 따로 처리하는 경우에는 공통되는 선형 경로 패턴들에 대한 반복적인 질의 처리로 인해 입출력 양과 처리 시간이 증가하기 때문이다. 질의 처리 과정 중에 모든 가지 패턴 질의들에 대하여 유일한 선형 경로 패턴들만 처리하는 것은 상기 제1 맵리듀스 작업부(120)와 상기 제2 맵리듀스 작업부(130) 사이의 중간결과의 크기를 줄여 맵리듀스 작업을 수행하는데 있어 발생하는 네트워크 및 디스크 입출력을 크게 절약하기 위함이다. The query index generated by the query index creation module 202 is used to perform filtering on the linear path patterns with respect to the XML data blocks in the first mapping task unit 120. In the present invention, an existing technique called YFilter is used for generating the query index, but the present invention is not limited thereto. For example, other XML stream processing techniques that perform XML filtering can be used. For query processing, we first perform filtering with linear path patterns, and then pattern join the results of these linear path patterns. The reason why the branch pattern queries are divided into the linear path patterns and the query processing is performed first is that there are linear path patterns common to the various pattern queries. In case of separately processing them, This is because the repetitive query processing on the path patterns increases the amount of input and output and the processing time. Processing only the linear path patterns for all the branch pattern queries during the query process reduces the size of the intermediate result between the first and second mapping tasks 120 and 130, This is to save network and disk I /

상기 XML 데이터 분할 및 레이블링 모듈(201)의 실시예에서는 XML 데이터 블록 크기의 기준을 64MB로 하였으나 이 크기는 사용자가 성능 최적화를 위해 임의로 변경할 수 있다. 또한, XML 데이터 분할 및 레이블링 모듈(201)은 XML블록 및 레이블 블록 적재 시 서로 대응되는 두 블록들을 한 분산 노드에 같이 위치시킨다. 이렇게 하는 이유는 상기 제1 맵리듀스 작업부(120)에서 같은 XML블록과 레이블 블록을 동시에 읽어야 하므로 공간 지역성(spatial locality)를 확보하기 위한 목적이다. 확보된 공간 지역성으로 네트워크 입출력 비용이 감소하는 것을 기대할 수 있다. In the embodiment of the XML data segmentation and labeling module 201, the size of the XML data block size is set to 64 MB, but this size can be arbitrarily changed by the user for performance optimization. In addition, the XML data segmentation and labeling module 201 co-locates two blocks corresponding to each other in an XML block and a label block in one distribution node. The reason for doing this is to secure the spatial locality because the first mapping work unit 120 must read the same XML block and label block at the same time. It can be expected that the network input / output cost is reduced due to the secured spatial locality.

도 3은 본 발명에서 입력으로 하는 대용량 XML 파일을 XML 엘리먼트들 간의 구조 정보의 손실 없이 고정 크기의 블록 단위로 변환하고 이에 대응하는 레이블 파일을 같이 분산 파일 시스템에 위치시키는 기능에 대한 실시예의 개략적인 구성도이다.3 is a schematic diagram of an embodiment of a function of converting a large capacity XML file as an input into a fixed size block unit without loss of structure information between XML elements and locating the corresponding label file in a distributed file system according to the present invention FIG.

먼저 원본 XML 파일(300)을 XML 데이터 블록(301)들로 분할할 때는 기본적으로 주어진 고정 크기 기준으로 분할한다. 그러나 이때 XML 파일 내부의 엘리먼트 자체가 두 블록 간의 경계에 인접하여 엘리먼트 자체가 중간에 잘리게 된다면 해당 엘리먼트를 그 태그 이름으로 구분할 수 없게 된다. 그러므로 본 발명의 XML 데이터 분할에서는 엘리먼트가 잘리지 않는 한에서 주어진 블록 크기를 기준으로 여러 XML 데이터 블록으로 분할한다. 또한 이 블록 분할에 따라 분할된 XML 데이터 블록의 경우 선행된 블록들의 구조 정보가 유실됨에 따라 XML 데이터 블록에 해당하는 레이블 블록(302)에 선행 블록들의 경로 정보를 저장한다. 레이블 블록1(302)은 첫 번째 블록이므로 루트를 표시하는 /를 나타내었고, 레이블 블록2는 두 번째 XML 데이터 블록에 대응하는 레이블 블록으로서 두 번째 XML 블록의 첫 엘리먼트인 </D>의 경로 정보인 /A/B/C/D를 가진다. 질의 처리 시에는 XML 데이터 블록을 읽을 때 대응하는 레이블 블록의 처음에 나타나는 경로 정보를 이용하여 XML 데이터 블록의 첫 엘리먼트의 구조 정보를 복원한다. 레이블 블록(302)에 이후에 표시되는 3개의 숫자들은 영역 기반 번호 부여 기법에 따라 각 엘리먼트와 애트리뷰트에 부여되는 숫자들로 각각 엘리먼트가 대표하는 영역의 시작과 끝, 그리고 해당 엘리먼트가 XML 트리 구조 상에서 위치하는 레벨 값을 의미한다. 영역 기반 번호 부여 기법에서는 XML 엘리먼트의 영역 시작 값이 B의 영역 시작 값보다 작고 A의 영역 끝 값이 B의 영역 끝 값보다 크다면 A는 B의 부모 또는 조상 관계를 가짐을 의미하며, 또한 레벨 값의 차가 1이면 부모-자식 관계이며 2 이상인 경우 조상-후손 관계를 가지는 것을 의미한다. First, when the original XML file 300 is divided into the XML data blocks 301, basically, it is divided into a given fixed size reference. However, if the element itself inside the XML file is adjacent to the boundary between two blocks and the element itself is cut off in the middle, the element can not be distinguished by the tag name. Therefore, in the XML data segmentation of the present invention, as long as the element is not truncated, it is divided into several XML data blocks based on a given block size. In the case of the XML data block divided according to the block division, the path information of the preceding blocks is stored in the label block 302 corresponding to the XML data block as the structure information of the preceding blocks is lost. Label block 1 (302) is the first block and thus indicates / indicating the root. Label block 2 is a label block corresponding to the second XML data block, and the path information of the first element of the second XML block </ D> In / A / B / C / D. When querying the XML data block, the structure information of the first element of the XML data block is restored by using the path information at the beginning of the corresponding label block. The three numbers that are displayed later in the label block 302 are the numbers assigned to each element and the attribute according to the area-based numbering scheme, and the start and end of the area represented by the element, respectively, Quot; means a level value to be positioned. In the area-based numbering technique, if the area start value of the XML element is smaller than the area start value of B and the area end value of A is larger than the area end value of B, A means that the parent has an ancestor relation of B, A value of 1 means parent-child relationship, and 2 or more means ancestor-descendant relationship.

도 4는 본 발명의 실시예 중 제1 맵리듀스 작업부(120)의 실시예에 관한 개략적인 구성도이다. 도 4에 도시한 바와 같이, 상기 제1 맵리듀스 작업부(120)는 전처리 및 데이터 적재부(110)에서 처리하여 분할, 저장된 XML 데이터 블록들과 레이블 블록들을 입력으로 받아 질의 인덱스(204)를 사용해 선형 경로 패턴들을 처리한다. 상기 제1 맵리듀스 작업부(120)의 맵 수행 노드(400)들은 질의 인덱스(402)를 먼저 적재하고 XML블록과 해당하는 레이블 블록을 입력으로 하여 분산 파일 시스템으로부터 이들을 읽는다. 이 때 각 맵 단계에서 이용하는 질의 인덱스들은 서로 동일한 구조를 갖는다. 각 맵 단계에서는 이 질의 인덱스를 이용하여 각 XML 블록을 필터링 한 후 질의 인덱스에서 포함된 선형 경로 패턴들의 해답에 해당하는 레이블 값을 출력으로 내보낸다. 이때 출력의 형식은 개념적으로 (선형 경로 패턴의 ID, 레이블 값)이 된다. 리듀스 수행 노드(401)에서는 선형 경로 패턴의 ID별로 정리된(grouping) 선형 경로 해답 레이블들을 취합하여 선형 경로 해답들의 크기 정보(403)을 따로 기록한다. FIG. 4 is a schematic configuration diagram of an embodiment of a first mapping task unit 120 according to an embodiment of the present invention. 4, the first mapping task unit 120 receives the XML data blocks and the label blocks divided and stored in the preprocessing and data loading unit 110 and inputs a query index 204 To process linear path patterns. The map execution nodes 400 of the first mapping task 120 load the query index 402 first and read them from the distributed file system with the XML block and the corresponding label block as inputs. In this case, the query indexes used in each map step have the same structure. In each map step, each XML block is filtered using this query index, and the label value corresponding to the solution of the linear path patterns included in the query index is output to the output. At this time, the format of the output is conceptually (the ID of the linear path pattern, the label value). At the node 401, the size information 403 of the linear path solutions is separately recorded by grouping the linear path solution labels according to the IDs of the linear path patterns.

도 5는 본 발명의 실시예 중 제2 맵리듀스 작업부(130)의 실시예에 관한 개략적인 구성도이다. 도 5에 도시한 바와 같이, 상기 제2 맵리듀스 작업부(130)에서는 상기 제1 맵리듀스 작업부(120)에서 얻은 선형 경로 해답(404)들을 입력으로 가지 패턴 조인 연산들을 수행하여 최종 결과를 계산, 출력한다. 하지만 상기 제1 맵리듀스 작업이 끝나고 제2 맵리듀스 작업이 시작하기 전에 이전에 가지 패턴 조인들을 노드들에 작업량이 서로 균등하게 분배하기 위한 다중 질의 최적화 모듈(502)이 수행된다. 상기 다중 질의 최적화 모듈(502)의 구체적인 동작 내용은 후술하는 도 6에서 상세히 설명한다. 다중 질의 최적화 모듈에 의해서 가지 패턴 조인 연산들은 각 리듀스 수행노드들에 작업량이 서로 균등하도록 분배된다. 맵 수행 노드(500)에서는 어떠한 가지 패턴 연산이 어떤 리듀스 수행 노드로 분배되었는지에 관한 정보를 기반으로 하여 입력으로 오는 선형 경로 패턴들의 해답을 이를 입력으로 하는 가지 패턴 조인 연산이 할당된 리듀스 수행 노드(501)로 전달한다. 이 과정은 맵 수행 결과의 키 값을 전달하고자 하는 리듀스 수행 노드의 ID로 대체함으로써 수행된다.5 is a schematic configuration diagram of an embodiment of the second mapping task 130 of the embodiment of the present invention. As shown in FIG. 5, the second mapping task unit 130 performs pattern join operations by inputting the linear path solutions 404 obtained from the first mapping task unit 120, and outputs the final result Calculation, and output. However, before the first mapping task is completed and before the second mapping task is started, a multiple query optimization module 502 is performed to previously distribute the branch pattern joins to the nodes evenly with respect to each other. The concrete operation contents of the multi-query optimization module 502 will be described later in detail with reference to FIG. The multi - query optimization module distributes the pattern - join operations to each redistribution node so that the workloads are equal to each other. The map execution node 500 performs a redescription operation in which a branch pattern operation for inputting a solution of linear path patterns as input is performed based on information about which branch pattern operation has been distributed to which node To the node 501. This process is performed by replacing the key value of the map execution result with the ID of the redissued node to be transmitted.

이때 한 선형 경로 패턴의 해답이 다수의 리듀스 수행 노드들로 전달되는 경우에는 이 선형 경로 패턴의 해답의 복제가 수행됨에 따라 네트워크 입출력 비용이 증가하게 되는데 다중 질의의 최적화 모듈은 이를 줄이기 위한 기술을 포함한다. 리듀스 수행 노드는 전달 받은 선형 경로 패턴들의 해답을 입력으로 하여 자신에게 할당받은 가지 패턴 조인 연산들을 수행함으로써 최종 결과를 계산, 출력한다. In this case, when a solution of a linear path pattern is transferred to a plurality of redundancy execution nodes, replication of the solution of the linear path pattern is performed, thereby increasing network input / output cost. . The node performing the redundancy pattern receives the solution of the linear path patterns received and performs branch pattern join operations assigned to the node, thereby calculating and outputting the final result.

도 6은 본 발명의 실시예 중 다중 질의 최적화 모듈의 실시예를 설명하기 위한 흐름도이다. 도 6에 도시한 바와 같이, 가지 패턴 질의들과 선형 경로 패턴들간의 대응 관계 정보를 읽어오는 단계(S100)와; 읽어 들인 상기 선형 경로 패턴들의 해답들의 크기 정보를 읽어오는 단계(S101)와; 각 가지 패턴 조인 연산의 비용을 선형 경로 패턴 해답의 크기를 기준으로 계산하는 단계(S102)와; 계산 결과 비용이 가장 큰 가지 패턴 질의를 찾고, 이미 배정된 가지 패턴과 공통된 선형 경로가 있다면 그 선형 경로의 크기 만큼 비용에서 제외시키는 단계(S103)와; 배정된 가지 패턴 질의가 없는 리듀스 수행 노드에 배정하는 단계(S104)와; 가지 패턴 질의가 배정되지 않은 리듀스 수행노드가 있는지 판단하는 단계(S105)와; 가지 패턴 질의가 배정되지 않은 리듀스 수행노드가 있는 경우 상기 단계(S103)으로 복귀시키고, 가지 패턴 질의가 배정되지 않은 리듀스 수행노드가 없는 경우 배정된 가지 패턴 질의의 비용이 가장 적은 리듀스 수행 노드를 찾는 단계(S106)와; 찾은 배정된 가지 패턴 질의의 비용이 가장 적은 리듀스 수행 노드가 배정받은 가지 패턴 질의 들과 공통으로 가지는 선형 경로 패턴의 해답 크기가 가장 큰 가지 패턴 질의를 찾아 해당 리듀스 수행 노드에 배정하는 단계(S107)와; 배정되지 않은 가지 패턴 질의가 있는지 판단하여 배정되지 않은 가지 패턴 질의가 있는 경우 상기 단계(S107)로 복귀시키고, 배정되지 않은 가지 패턴 질의가 없는 경우 종료시키는 단계(S108)를 포함하는 구성이다.6 is a flowchart for explaining an embodiment of a multiple query optimization module according to an embodiment of the present invention. As shown in FIG. 6, step (S100) of reading correspondence relationship information between the branch pattern queries and the linear path patterns; Reading the size information of the solutions of the linear path patterns read (S101); Calculating the cost of each branch pattern join operation based on the size of the linear path pattern solution (S102); (S103) searching for a branch pattern query having the largest cost as a result of the calculation, and if there is a linear path common to the already allocated branch pattern, eliminating it from the cost by the size of the linear path; A step (S104) of allocating the node to a redundancy performing node having no assigned branch pattern query; A step (S105) of judging whether there is a redundancy-executing node to which a branch pattern query is not allocated; If there is a redundancy node to which the branch pattern query is not allocated, the process returns to step S103. If there is no redundancy node to which the branch pattern query is not allocated, the cost of the allocated branch pattern query is reduced Searching for a node (S106); Finding a branch pattern query having the largest solution size of the linear path pattern common to the allocated branch pattern queries of the redistributed node having the lowest cost of the found branch pattern query, and allocating the same to the corresponding redundancy performing node S107); Determining whether there is an unassigned branch pattern query, and returning to step S107 if there is an unassigned branch pattern query, and terminating (S108) when there is no unassisted branch pattern query.

상기 본 발명의 실시예에서 다중 질의 최적화 모듈에서는 리듀스 수행 노드가 서로 균등하게 가지 패턴 조인 연산들을 나누어 배정받도록 함과 동시에 한 선형 경로 패턴의 해답이 여러 리듀스 수행 노드에 배정되어 네트워크 전송 비용이 증가하는 상황을 최소화하기 위하여 탐욕적 방법(greedy approach)를 이용한다. In the embodiment of the present invention, in the multiple query optimizing module, the redistribution nodes are allocated to branch pattern-matching operations evenly, and a solution of a linear path pattern is assigned to several redistribution nodes, Use a greedy approach to minimize the growing situation.

다중 질의 최적화 모듈은 먼저 수집된 각 선형 경로 패턴의 해답의 크기 정보와 선형 경로 패턴들과 가지 패턴 질의들 간의 대응 관계 정보를 이용하여 가지 패턴 질의들의 예상되는 수행 비용을 계산한다. 본 발명의 실시예에서는 가지 패턴 질의를 처리하기 위한 방법으로 전체론적 가지 패턴 조인(holistic twig pattern join; 이하 HTJ) 기술을 사용하였다. HTJ 기법에서 조인 연산의 비용은 입력 엘리먼트들의 리스트들의 크기의 합이다. 그러므로 본 발명의 실시예에서는 가지 패턴 질의에 포함되는 선형 경로 패턴들의 해답의 크기의 합이 가지 조인 연산의 비용이 된다. 그러나 가지 패턴 조인을 수행할 수 있는 다른 기법 또한 이용하는 것이 무방하며, 질의 처리를 위해 HTJ 기법 자체에 의존하거나, 이에 한정되는 것이 아니다. 각 가지 패턴 조인 연산의 비용이 계산되면 조인 연산 비용이 큰 순서대로 리듀스 수행노드들에 가지 패턴 질의를 하나씩 배정을 한다. 이 때 먼저 리듀스 수행 노드에 배정된 가지 패턴 조인의 입력인 선형 경로 패턴들 중에 공통되는 선형 경로 패턴을 입력으로 한다면, 그 선형 경로 해답의 크기만큼은 빼고 계산한다. The multi-query optimization module calculates the expected execution cost of the branch pattern queries using the size information of the solution of each linear path pattern collected and the correspondence information between the linear path patterns and the branch pattern queries. In the embodiment of the present invention, a holistic twig pattern join (HTJ) technique is used as a method for processing a branch pattern query. The cost of a join operation in the HTJ technique is the sum of the sizes of the lists of input elements. Therefore, in the embodiment of the present invention, the sum of the sizes of the solutions of the linear path patterns included in the branch pattern query is the cost of the join operation. However, other techniques for performing branch pattern joining may also be used, depending on the HTJ technique itself for query processing, or not limited thereto. When the cost of each branch pattern join operation is calculated, the pattern query is assigned to each of the redistribution nodes in order of the cost of the join operation. In this case, if a linear path pattern common to the linear path patterns input to the redundant node is input to the redundant node, the amount is subtracted from the linear path solution.

모든 리듀스 수행 노드에 하나씩 가지 패턴 조인 연산들이 배정되고 난 후에는 배정된 가지 패턴 조인 연산 비용이 가장 적은 리듀스 수행노드를 선정한다. 찾은 리듀스 수행노드에 배정된 가지 패턴 질의와 공통으로 가지는 선형 경로 해답의 크기가 가장 큰 가지 패턴 질의를 찾아 해당 리듀스 수행 노드에 배정한다. 이 때 배정하는 가지 패턴 질의는 이전에 배정되지 않은 가지 패턴 질의에 한한다. 만약 배정되지 않은 가지 패턴 질의가 남아 있다면 상기 동작을 반복 수행한다. After the pattern join operations are assigned to each of the redistribution nodes, the redistribution node having the least cost of branch pattern join operation is selected. Finds the branch pattern query having the largest size of the linear path solution common to the branch pattern queries assigned to the redistributed nodes and assigns them to the corresponding redistribution nodes. At this time, the branch pattern query to be allocated is limited to the branch pattern query which has not been previously allocated. If an unassigned branch pattern query remains, the above operation is repeated.

도 7은 더블 슬래시(//)와 와일드 카드(*)를 포함하는 XPath로 기술되는 선형 경로 패턴들을 더블 슬래시와 와일드 카드가 없는 실제 문서 내에서 존재하는 유일한 선형 경로 패턴으로 변환하는 작업의 실시예를 설명하기 위한 구성도이다. 이 작업을 수행하는 이유는 선형 경로 패턴에 더블 슬래시나 와일드 카드가 존재하는 경우 XML 문서 내 엘리먼트들이 복수의 선형 경로 패턴들의 해답으로 중복되어 반복적으로 나타날 수 있고 이로 인해 선형 경로 해답으로 나타나는 중간 결과의 크기가 커져 결과적으로 이를 반복적으로 처리하기 위해 네트워크 및 디스크 입출력 비용이 크게 증가하기 때문이다. 본 발명에서는 이의 해결을 위해 질의 분할 모듈(200)에서 가지 패턴 질의들을 선형 경로 패턴들로 분할하는데 있어, 더블 슬래시나 와일드 카드를 포함하는 선형 경로 패턴들을 문서 내 존재하는 유일 경로 패턴으로 변환함으로써 상기 제1 맵리듀스 작업부(120)에서 XML 문서의 모든 엘리먼트들이 많아야 단 선형 경로 패턴의 처리 결과가 되도록 한다.7 shows an example of an operation of converting linear path patterns described by XPath including a double slash (//) and a wildcard (*) into a double slash and a unique linear path pattern existing in a wildcardless actual document Fig. The reason for doing this is that if there are double slashes or wildcards in the linear path pattern, the elements in the XML document may appear repeatedly as a solution of multiple linear path patterns repeatedly, resulting in an intermediate result This is because the network and disk I / O costs increase significantly as the size increases and, consequently, iterative processing occurs. In the present invention, in order to solve this problem, the query division module 200 divides the branch pattern queries into linear path patterns, and converts the linear path patterns including double slashes and wild cards into unique path patterns existing in the document, So that all the elements of the XML document are processed by the first mapping task unit 120 at most to the processing result of the linear path pattern.

경로 패턴의 변환 과정은 다음과 같다. 경로 추출부(704)를 이용해 먼저 전처리 과정 중에 원본 XML 파일 내에 존재하는 유일한 경로 패턴(701)들을 추출한다. 추출된 유일 경로 패턴들은 탐색의 편의성을 위해 역순으로 해시 테이블에 저장한다. 그리고 유일 경로 변환부(705)에서 질의 분할 모듈(200)이 가지 패턴 질의들로부터 분할한 더블 슬래시와 와일드 카드를 포함하는 선형 경로 패턴(702)들을 이 해시 테이블과 비교하여 유일한 경로 패턴들로 변환(703)하고, 원래의 선형 경로 패턴과 문서 내 유일한 선형 경로 패턴 간의 대응 관계를 설정한다. 이 대응 관계는 상기 제2 맵리듀스 작업부(130)의 맵 수행(500) 시 가지 패턴 조인 연산을 수행할 선형 경 패턴의 해를 결정하는데 이용한다. The conversion process of the path pattern is as follows. The path extracting unit 704 extracts unique path patterns 701 existing in the original XML file during the preprocessing. The extracted unique path patterns are stored in a hash table in reverse order for convenience of search. Then, in the unique path conversion unit 705, the query division module 200 compares the linear path patterns 702 including double slashes and wild cards divided from the branch pattern queries with this hash table to convert them into unique path patterns (703), and sets the correspondence between the original linear path pattern and the linear path pattern unique to the document. This correspondence relationship is used to determine the solution of the linear oblique pattern to perform the branch pattern operation at the map execution 500 of the second mapping job 130. [

추가적으로 XML 질의어인 XPath의 문법 중에 더블 슬래쉬(//)나 와일드 카드(*)가 존재하는데 이 심볼들이 질의어에 이용되는 경우에는 하나의 선형 경로 패턴이 XPath로 작성된 여러 경로 패턴에 중복하여 대응되는 경우가 존재한다. 가령 /A/B/C 라는 선형 경로 상의 마지막 엘리먼트인 C는 XPath로 작성된 경로 표현 //C, //B/C, A/B/C, A//C, A/*/C, *//C, A/B/* 모두의 해답이 될 수 있다. 이때 XPath 로 작성된 각 경로 표현들을 모두 서로 별개의 질의 패턴들로 인식하고, 각 경로 표현에 대해 질의 결과들을 따로따로 취급한다면, 결국엔 동일한 C 엘리먼트들을 반복해서 추출함에 따라 처리 시간 및 I/O의 낭비를 초래한다. In addition, there are double slashes (//) and wildcards (*) in the syntax of XPath which is an XML query language. When these symbols are used in query terms, if one linear path pattern is duplicated in multiple path patterns created by XPath Lt; / RTI > For example, the last element on the linear path A / B / C, C, is a path expression written in XPath // C, // B / C, A / B / C, A // C, A / / C, A / B / *. At this time, if each path expression written in XPath is recognized as separate query patterns, and the query results are handled separately for each path expression, eventually the same C elements are repeatedly extracted, It causes waste.

본 발명에서는 이 문제를 해결하고자 더블 슬래쉬나 와일드 카드가 나타나는 선형 경로 패턴에 대해 XML 문서상에 실제로 존재하는 유일한 경로 패턴으로 대체하여 줌으로써 XML 데이터의 엘리먼트들이 여러 선형 경로 패턴들의 해답으로 중복되어 처리, 출력되는 현상을 제거하고, 이를 통해 각 맵리듀스 작업 사이의 발생하는 입출력 비용을 크게 감소시킨다. In order to solve this problem, in the present invention, by replacing a linear path pattern in which a double slash or a wildcard appears with a unique path pattern actually present in an XML document, the elements of the XML data are duplicated as a solution of various linear path patterns, Thereby greatly reducing the input / output costs incurred between each MapReduce operation.

상술한 본 발명의 실시예는 본 발명의 다양한 실시예 중 일부에 불과하다. 본 발명의 기술적 사상에 포함하는 다양한 실시예가 본 발명의 보호 범위에 해당함은 당연하다.The above-described embodiments of the present invention are only a few of various embodiments of the present invention. It is to be understood that various embodiments included in the technical scope of the present invention fall within the scope of protection of the present invention.

100 : 입력부
110 : 전처리 및 데이터 적재부
120 : 제1 맵리듀스 작업부
130 : 제2 맵리듀스 작업부
200 : 질의 분할 모듈
201 : XML 데이터 분할 및 레이블링 모듈
202 : 질의 인덱스 작성 모듈100: Input unit
110: preprocessing and data loading unit
120: First MapReduce Operation Section
130: Second MapReduce work unit
200: query division module
201: XML Data Segmentation and Labeling Module
202: Query Index Creation Module

Claims

An input unit for receiving branch pattern queries and large XML files from users; A preprocessing and data loading unit for generating a query index by using the linear path patterns in the branch pattern queries input to the input unit, generating a plurality of XML data blocks from the large capacity XML file, and loading the XML data blocks into the distributed file system; A first mapping task unit for receiving the XML data blocks and obtaining a solution to the linear path patterns and calculating a solution size for the linear path patterns; And a second mapping task for performing a branch pattern operation on the solutions of the linear path patterns and outputting a final result.

The method according to claim 1,
The preprocessing and data loading unit divides the large capacity XML file into blocks of a fixed size and stores the blocks in a distributed file system. The preprocessing and data loading unit divides the divided XML data blocks into a plurality of blocks based on an interval-based numbering scheme Label numbers are assigned to elements and attributes. These label numbers are created in separate blocks and stored in the same computing node as the XML data block. When the large XML file is divided into blocks and stored, the structural information of the XML is not lost The block is divided on the basis of the text value located last in the designated block size, and the linear path pattern from the root of the element appearing first to the next block to be newly divided is converted to a string value as a label number A large amount of XML A plurality of patterns simultaneously parallel processing system inquiry for data.

The method according to claim 1,
The first mapping task unit receives the XML data blocks, finds a solution of linear path patterns from each XML data block using the query index in a map step, and finds a solution of each linear path pattern in a reduction step. And the size of the large XML data is calculated.

The method of claim 3,
The second map reduction operation unit receives the solutions of the linear path patterns and confirms the branch pattern queries placed at the respective distributed nodes in the map step and outputs the linear path patterns necessary for processing the branch pattern queries to the corresponding node And the final result is stored in the distributed file system after performing the holistic branch pattern matching using the correspondence information between the linear path pattern actually received in the redox step and the linear path pattern-branch pattern query. Concurrent Parallel Processing System for Multiple Patterns Query on Large Capacity XML Data. The method of claim 7,
The second map reduction operation unit receives the solutions of the linear path patterns and confirms the branch pattern queries placed at the respective distributed nodes in the map step and outputs the linear path patterns necessary for processing the branch pattern queries to the corresponding node And the final result is stored in the distributed file system after performing the holistic branch pattern matching using the correspondence information between the linear path pattern actually received in the redox step and the linear path pattern-branch pattern query. Concurrent Parallel Processing System for Multiple Patterns Query on Large Capacity XML Data.

The method according to claim 1,
And a multi-query optimization module for performing an operation for distributing work pattern amounts to the nodes evenly after the operation by the first mapping task unit is finished and before the operation of the second mapping task unit is started Wherein the plurality of patterns of the plurality of patterns are arranged in parallel.

Receiving pattern pattern queries and a large capacity XML file from users at an input unit; Decomposing the branch pattern queries input from the pre-processing and data loading unit into linear path patterns; Generating a query index using the branch pattern queries decomposed into the linear path patterns in the pre-processing and data loading unit; The pre-processing and data loading unit divides the large capacity XML file into blocks of a fixed size so as to enable parallel processing in a plurality of computers, thereby generating a plurality of XML data blocks. Loading the plurality of XML data blocks in the pre-processing and data loading unit into a distributed file system; Obtaining a solution to the linear path patterns in the mapping process of the first mapping task unit using the query index by receiving the divided XML data blocks; Calculating a size of a solution for each linear path pattern in a process of reducing the size of the first mapping task; Distributing branch pattern joining operations among multiple computing nodes in a multi-query optimization module in real time in equal amounts to each other; Transmitting the solutions of the linear paths to be taken as input to the branch pattern join operations assigned to the plurality of nodes, at the execution time in the mapping process of the second mapping task unit; Performing branch pattern join operations assigned to each reducer in the redistribution process of the second mapping task unit with solutions of linear path patterns received in the mapping process and outputting the final result; A method for simultaneous parallel processing of multiple branch pattern queries.

Inputting pattern pattern queries and a large capacity XML file from the input unit users; Generating a query index by using linear path patterns of the branch pattern queries input to the input unit by the preprocessing and data loading unit, generating a plurality of XML data blocks from the large capacity XML file, and loading the XML data blocks into the distributed file system; The first mapping task unit receiving the XML data blocks and obtaining a solution to the linear path patterns and calculating a solution size for the linear path patterns; Wherein the second mapping task comprises performing a pattern join operation with the solutions of the linear path patterns and outputting the final result. &Lt; RTI ID = 0.0 > 11. < / RTI >

The method of claim 7,
Wherein the first mapping task unit receives the XML data blocks and obtains a solution to linear path patterns and calculates a size of a solution to the linear path patterns,
Creating a query index with linear path patterns commonly existing among a plurality of branch pattern queries and performing filtering processing on the XML elements in parallel with the query index with the query index; Transforming a linear path pattern including // and * into a linear path pattern unique to an input XML document to solve the problem of duplication of the same element due to // / * when creating a query index with a linear path pattern; And collecting the sizes of the linear path solutions for the even workload distribution to the distributed nodes. &Lt; Desc / Clms Page number 19 >

The method of claim 7,
Wherein the first mapping task unit receives the XML data blocks and obtains a solution to the linear path patterns and calculates a size of a solution to the linear path patterns, Before performing the step of performing a pattern pattern join operation with the solutions of the path patterns and outputting the final result,
The method according to claim 1, further comprising the step of distributing branch pattern joining operations among the plurality of computing nodes in a uniform manner in real time in the multiple query optimization module.

The method of claim 9,
In the multi-query optimization module, distributing the branch pattern joining operations to the plurality of computing nodes in equal amounts in real-
a. Reading correspondence information between a linear path pattern and a branch pattern query;
b. Reading the size information of the linear path solutions; Calculating a cost of a branch pattern join operation using the read linear path pattern mapping information and size information;
c. Arranging the calculated costs in ascending order, and assigning the allocated nodes to an arbitrary redundancy performed node having no allocated pattern pattern query;
d. In the case of the linear path pattern common to the already allocated branch pattern queries, the step of removing from the calculated cost;
e. Repeating the steps c and d until there is no redundancy node having no assigned branch pattern join operation;
f. Searching for a node having the least cost of the allocated branch pattern query;
g. Searching for a branch pattern query having a largest solution size of a linear path pattern common to all the branch pattern nodes assigned by the redundancy performing node found in the step f, and assigning the branch pattern node to the corresponding redundancy performing node;
h. And repeating the steps f and g until no untyped branch pattern query is left. The method of claim 1,

The method of claim 9,
Wherein the step of performing a pattern join operation with the solutions of the linear path patterns and outputting the final result comprises:
Referring to the correspondence information between the redistribution node and the branch pattern query allocated by the multiple query optimization module and transmitting the linear path solutions required for each redistribution node; (Key, value) pair transmitted from the map execution unit as the area start value in the area-based numbering technique, and when performing the branching in the redesper performing part, the label value Allowing the user to input and use the joins as the joins; A step of processing a plurality of branch pattern joins by taking the solutions of the transmitted linear path patterns as inputs and outputting a final result, and a step of generating a linear path pattern common to the branch pattern queries, And transmitting the corresponding linear path solution to the corresponding redundancy operation node only once without repeatedly transmitting the linear path solution in the case of allocation.