KR101801091B1

KR101801091B1 - System of Multi-Dimensional Hierarchical Data Cube and Evaluation Method thereof

Info

Publication number: KR101801091B1
Application number: KR1020160075275A
Authority: KR
Inventors: 이원석; 조윤호; 박유신
Original assignee: (주)이더블유비엠
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2017-11-27

Abstract

The present invention relates to a multidimensional hierarchical data cube system and a processing method thereof which build a hyper data cube which processes hierarchical data based on continuous queries and enables each cuboid to compute the smallest computational cost cuboid from previously computed data cuboids. The present invention can process complex queries at high speed by arranging a composition including: (a) a step of expressing multidimensional hierarchical data in structured query language (SQL); (b) a step of generating a data cube based on continuous queries about multidimensional hierarchical data expressed in the step (a); (c) a step of monitoring the number of generated tuples which is the cost of each cuboid about the data cube generated in the step (b); and (d) a step of computing the minimum cost parent cuboid according to the monitoring in the step (c) and making a child cuboid compute the minimum cost according to the computed result.

Description

[0001] The present invention relates to a multidimensional hierarchical data cube system and a multidimensional hierarchical data cube system,

본 발명은 다차원 계층 데이터 큐브 시스템 및 처리 방법에 관한 것으로, 특히 연속질의 기반으로 계층 데이터를 처리하는 하이퍼 데이터 큐브를 구축하고, 각 큐보이드들이 이전 집계된 데이터 큐보이드 중 가장 작은 계산 비용 큐보이드를 집계하는 다차원 계층 데이터 큐브 시스템 및 처리 방법에 관한 것이다.The present invention relates to a multidimensional hierarchical data cube system and a processing method thereof, and more particularly, to a hyperdata cube processing hierarchical data on a continuous query basis, and each of the queue voids has a smallest computational cost of the previous aggregated data queue void Dimensional hierarchical data cube system and a processing method thereof.

인터넷과 같은 글로벌 통신 네트워크의 도래는 엄청난 양의 정보의 교환을 지속시켜 왔다. 또한, 이러한 정보를 저장하고 유지하는 데 드는 비용도 줄어들어, 대용량 데이터 저장 구조에 액세스할 필요가 있게 되었다. 엄청난 양의 데이터는 일반적으로 조직의 사업 이력을 나타내는 데이터베이스인 데이터 웨어하우스(data warehouse)로서 저장될 수 있다. 이력 데이터는 전략적 계획에서 개별 조직부의 수행 평가에 이르기까지 많은 레벨에서 사업 결정을 지원하는 분석에 사용된다. 또한, 이는 관계형 데이터베이스에 저장된 데이터를 가져오고 이 데이터를 처리하여 쿼리와 분석에 더욱 효과적인 도구로 만드는 것도 포함할 수 있다. 작은 규모로 데이터 웨어하우징을 효과적으로 관리하기 위해서, 데이터의 목표 서브세트만이 관리되는 데이터 마트(data mart)의 개념이 채용된다.The advent of global communications networks such as the Internet has continued to exchange tremendous amounts of information. In addition, the cost of storing and maintaining this information is also reduced, requiring access to large data storage structures. A tremendous amount of data can be stored as a data warehouse, a database that typically represents an organization's business histories. Historical data is used in analyzes that support business decisions at many levels, from strategic planning to performance evaluation by individual organizational units. It can also include fetching data stored in a relational database and processing it to make it a more efficient tool for querying and analysis. In order to effectively manage data warehousing on a small scale, the concept of a data mart in which only a target subset of data is managed is employed.

데이터 큐브 또는 큐브라고 불리는 다차원 데이터 구조를 이용하여 복잡한 질의를 고속으로 처리하는 데이터 분석 기술인 데이터베이스 및 데이터 웨어하우스 시스템에서의 OLAP(On-Line Analytical Processing) 기술은 많은 발전을 이루었다. OLAP은 다차원 데이터의 요약, 통합, 관찰, 공식 적용, 종합의 특성을 가지고 있으며, OLAP 시스템에 사용되는 다차원 데이터 모델에서 데이터 큐브(Data Cube)는 사용자가 분석하고자 하는 관점인 '차원(Dimension)'과 분석 대상인 '측정치(Measure)'라는 두 요소에 의해 데이터 항목의 다양한 특성을 나타내기 위해 사용된다. OLAP 기술은 현재 데이터 분석가 및 의사 결정권자에게 필요한 기본적인 도구이며, OLAP의 다차원 데이터 모델인 데이터 큐브는 많은 다차원 데이터 분석에 성공적으로 적용되고 있다.OLAP (On-Line Analytical Processing) technology in databases and data warehouse systems, which are data analysis technologies that process complex queries at high speed using a multidimensional data structure called a data cube or a cube, have made many advances. OLAP has the characteristics of summarization, integration, observation, formal application, and synthesis of multidimensional data. In the multidimensional data model used in the OLAP system, the data cube is a 'dimension' And 'Measure', which is the target of analysis. OLAP technology is now a fundamental tool for data analysts and decision makers. Data cubes, OLAP's multidimensional data model, has been successfully applied to many multidimensional data analysis.

이러한 OLAP은 의사결정 지원 시스템 가운데 하나로서, 복잡한 질의를 고속으로 처리하며 대화식 접근 방식을 통해 다차원 분석을 가능하게 하는 도구이다. 즉, OLAP의 가장 큰 특징인 다차원 데이터 모델은 데이터 큐브라는 개념을 이용하여 데이터의 특성을 나타낸다. 데이터 큐보이드들이 모여 데이터 큐브 래티스(Data Cube Lattice)를 형성한다.This OLAP is one of the decision support systems that can process complex queries at high speed and enable multi-dimensional analysis through an interactive approach. In other words, the multi-dimensional data model, which is the most important characteristic of OLAP, represents the characteristics of data using the concept of data cubes. The data queue voids gather to form a data cube lattice.

이와 같은 기술의 발전에 따라 다양한 기기들이 인터넷으로 연결된 사물 인터넷 시대가 열리면서 실시간으로 방대한 데이터를 생성하는 데이터 스트림 환경에서의 데이터 분석 및 처리 방법이 더욱 중요해졌다. With the development of such technologies, various devices connected to the Internet have opened the Internet era, and data analysis and processing methods in a data stream environment that generates vast amounts of data in real time have become more important.

이러한 기술의 일 예가 하기 문헌 1 및 2 등에 개시되어 있다.Examples of such techniques are described in documents 1 and 2 below.

예를 들어, 하기 특허문헌 1에는 프로세서 및 시스템 메모리를 포함하는 컴퓨터 시스템에서, 다차원 데이터 큐브(multidimensional data cube) 상에서 데이터 마이닝(data mining)하기 위한 방법으로서, 다차원 데이터에 액세스하는 단계, 다차원 데이터를 다차원 데이터 큐브로서 보는 방법을 정의하는 다차원 표현식들에 액세스하는 단계, 상기 다차원 데이터 큐브 내에 상주하는 데이터에 대해 데이터 마이닝을 수행하기 위한 데이터 마이닝 확장들에 액세스하는 단계, 상기 다차원 표현식들과 상기 데이터 마이닝 확장들을 데이터 마이닝 모델 생성을 위한 입력으로 통합하는 단계, 프로세서가 상기 입력으로부터 다차원 데이터에 대해 트레이닝된 데이터 마이닝 모델을 생성하는 단계, 상기 데이터 마이닝 모델을 저장하는 단계 및 상기 데이터 마이닝 모델에 따라 다차원 데이터 큐브를 데이터 마이닝 하여 다차원 데이터 큐브 내에 포함된 데이터에 대해 데이터 예측들을 수행하는 단계를 포함하고, 데이터 마이닝은 다차원 쿼리 엘리먼트에 따라 다차원 데이터 큐브 내에 포함된 데이터의 부분들에 대해 데이터 마이닝 동작들을 수행하는 것을 포함하는 컴퓨터 구현된 방법에 대해 개시되어 있다.For example, Patent Document 1 discloses a method for data mining on a multidimensional data cube in a computer system including a processor and a system memory, comprising: accessing multidimensional data; Accessing data mining extensions for performing data mining on data residing in the multidimensional data cube, accessing the multidimensional expressions that define how to view the multidimensional data cubes as a multidimensional data cube, Integrating the extensions into an input for generating a data mining model; generating a data mining model trained on the multidimensional data from the input by the processor; storing the data mining model; Data mining a multidimensional data cube to perform data predictions on the data contained within the multidimensional data cube, wherein the data mining includes performing data mining operations on portions of data contained within the multidimensional data cube in accordance with the multidimensional query element A computer-implemented method is disclosed.

또 하기 특허문헌 2에는 OLAP 시스템의 3차원 공간상에서 복수 개의 데이터 큐브들로 구성된 직육면체 모델을 표시하는 단계, 상기 직육면체 모델의 제1 외측면을 구성하는 복수의 데이터 큐브 중에서 제1 데이터 큐브를 선택받는 단계, 상기 제1 데이터 큐브를 포함하는 제1 큐브열을 활성화 표시하는 단계, 제1 외측면과 직각을 이루는 제2 외측면을 구성하는 복수의 데이터 큐브 중에서 제2 데이터 큐브를 선택받는 단계, 상기 선택된 제2 데이터 큐브를 포함하는 제2 큐브열을 활성화 표시하는 단계, 제1 큐브열과 제2 큐브열에 중복되어 포함되는 데이터 큐브를 분석 대상 큐브로 설정하는 단계, 분석 대상 큐브에 대응되는 데이터를 출력하는 단계를 포함하는 OLAP 기반의 3차원 공간에서 데이터 분석객체의 선택 방법에 대해 개시되어 있다.In addition, Patent Document 2 discloses a method of displaying a rectangular parallelepiped model composed of a plurality of data cubes in a three-dimensional space of an OLAP system, selecting a first data cube among a plurality of data cubes constituting a first outer side of the rectangular- Selecting a second data cube from among a plurality of data cubes constituting a second outer side perpendicular to the first outer side, Displaying a second cube column including the selected second data cube in an active state, setting a data cube that is redundantly included in the first cube column and the second cube column as an analysis subject cube, outputting data corresponding to the analysis subject cube A method for selecting a data analysis object in an OLAP-based three-dimensional space.

또한, 하기 비특허문헌 1에는 데이터 스트림 환경에서 빙산질의(iceberg query)를 처리하기 위해 전위트리 구조에 기반한 큐보이드 전위트리(Cuboid prefix tree)를 제안하고, 큐보이드 전위트리가 빙산질의에 사용된 그룹항목으로 이루어진 항목집합만을 트리에서 관리하므로 전위트리보다 적은 메모리를 사용하며, 1-항목 관리를 통해서 빈발하지 않은 항목을 트랜잭션에서 제거함으로써 갱신 시 불필요하게 소요되는 시간을 줄일 수 있으며, 다중 빙산질의에서 공통으로 사용된 그룹속성에 따라 노드를 공유함으로써 적은 메모리를 사용하여 효율적으로 다중 빙산질의를 처리할 수 있는 방법에 대해 개시되어 있다.Also, non-patent document 1 proposes a cuboid prefix tree based on a potential tree structure for processing an iceberg query in a data stream environment, and a cuboid potential tree is used for an iceberg query Since only a set of items consisting of group items is managed in the tree, the memory is used less than the dislocation tree, and unnecessary items are deleted from the transaction through 1-item management, thereby reducing unnecessary time required for updating. Discloses a method capable of efficiently processing multiple iceberg queries using a small memory by sharing nodes in accordance with group attributes commonly used in the system.

대한민국 등록특허공보 제10-1034428호(2011.05.03 등록)Korean Registered Patent No. 10-1034428 (registered on May 3, 2011) 대한민국 등록특허공보 제10-1117649호(2012.02.10 등록)Korean Registered Patent No. 10-1117649 (registered on February 10, 2012)

정보과학회논문지 : 데이터베이스 제36권 제3호 (2009년 6월) pp.226-234 1229-7739, 한상길, 양우석, 이원석 Journal of the Korean Institute of Information Scientists and Engineers ISSN: 1229-7739

그러나 실시간 데이터 스트림 환경에서 기존의 OLAP 처리 방식은 다음과 같은 한계가 있다. However, existing OLAP processing methods in the real-time data stream environment have the following limitations.

첫째, 디스크 기반 데이터에 접근하여 질의를 수행하기 때문에 지속적으로 데이터가 발생하는 데이터 스트림 환경에서 반복적 디스크 스캔으로 인하여 처리 속도가 지연된다는 문제가 있었다. First, since the disk-based data is accessed and the query is performed, there is a problem that the processing speed is delayed due to repeated disk scanning in a data stream environment where data is continuously generated.

둘째, 전통적인 OLAP 질의문은 일회성 질의문(One-Time Query)으로 동시적으로 다차원 계층 데이터에 질의를 수행하기 어렵다는 문제도 있었다. Second, there is a problem that it is difficult to perform a query on a multi-dimensional hierarchical data simultaneously with a conventional OLAP query with a one-time query.

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위해 이루어진 것으로서, 실시간 데이터 스트림 환경에서 연속질의 기반으로 다차원 데이터를 처리하는 다차원 계층 데이터 큐브 시스템 및 처리 방법을 제공하는 것이다.It is an object of the present invention to provide a multidimensional hierarchical data cube system and a processing method for processing multidimensional data on a continuous query basis in a real-time data stream environment.

본 발명의 다른 목적은 연속 질의 기반으로 다차원 계층 데이터 큐브를 구축하여 동시적으로 집계를 수행함으로써 사용자에게 다양한 정보를 제공하는 다차원 계층 데이터 큐브 시스템 및 처리 방법을 제공하는 것이다.It is another object of the present invention to provide a multidimensional hierarchical data cube system and a processing method for providing a variety of information to a user by constructing a multidimensional hierarchical data cube based on a continuous query and simultaneously performing aggregation.

본 발명의 또 다른 목적은 큐보이드(Cuboid) 간의 비용을 모니터링하여, 이전 집계된 큐보이드 중 가장 적은 비용의 큐보이드의 결과를 집계하여 최소 비용 큐브 트리를 처리하는 다차원 계층 데이터 큐브 시스템 및 처리 방법을 제공하는 것이다. It is yet another object of the present invention to provide a multidimensional hierarchical data cube system and method of processing a minimum cost cube tree by monitoring the cost between cuboids, thereby aggregating the results of the lowest cost cuboid among the previous aggregated queue voids .

상기 목적을 달성하기 위해 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법은 (a) SQL(Structured Query Language)문으로 다중 계층 데이터를 표현하는 단계, (b) 상기 단계 (a)에서 표현된 다중 계층 데이터에 대해 연속질의 기반으로 데이터 큐브를 생성하는 단계, (c) 상기 단계 (b)에서 생성된 데이터 큐브에 대해 각 큐보이드의 비용인 발생 튜플 수를 모니터링하는 단계, (d) 상기 단계 (c)에서의 모니터링에 따라 최소 비용 부모 큐보이드를 집계하고, 그 결과에 따라 자식 큐보이드가 최소비용을 집계하는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method of processing a multi-dimensional hierarchical data cube, the method comprising: (a) expressing multi-layer data in a SQL (Structured Query Language) (C) monitoring the number of generated tuples, which is the cost of each queue void, for the data cubes generated in step (b); (d) Counting a minimum cost parent queue void according to the monitoring in the child queue void and aggregating the minimum cost according to the result.

또 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법에서, 상기 단계 (a)에서의 표현은 트리의 모든 경로를 표현하는 클로저 테이블 데이터 구조를 적용하는 것을 특징으로 한다.In the method of processing a multidimensional hierarchical data cube according to the present invention, a closure table data structure expressing all the paths of the tree is applied to the representation in the step (a).

또 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법에서, 상기 클로저 테이블 데이터 구조는 각각의 계층 경로를 'Path'로 정의하고, 다중 계층 경로에서 계층 경로를 식별하기 위해 경로 아이디(pathID)라 정의하고, 다차원 계층 데이터 스트림을 관리하기 위해 상기 클로저 테이블에 표현된 레벨 차이(depth)와 계층 경로 아이디(pathID)를 적용하는 것을 특징으로 한다.In the multi-dimensional hierarchical data cube processing method according to the present invention, each of the hierarchical paths is defined as 'Path', the path ID is defined to identify the hierarchical path in the multi-hierarchical path, And a level difference (depth) and a hierarchical path ID (pathID) expressed in the closure table are applied to manage the multi-dimensional hierarchical data stream.

또 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법에서, 상기 단계 (b)에서 데이터 큐브는 하기 식(1)In the method of processing a multidimensional hierarchical data cube according to the present invention, the data cube in the step (b)

차원 D_i = {

,

, …,

} … (1)Dimension D _i = {

,

, ... ,

} ... (One)

(상기

는 j 계층 레벨과 k 계층 경로 아이디의 속성값)에 의해 단일 조인으로 계층형 데이터 스트림 모델을 구축하는 것을 특징으로 한다.(remind

Is an attribute value of a j-th layer path ID and a k-th layer path ID), the hierarchical data stream model is constructed by a single join.

또 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법에서, 상기 단계 (d)는 (d1) 계층 정보가 없을 시 하이퍼 큐보이드 간의 연결 관계만 고려하여 부모 큐보이드 중 가장 비용이 적은 부모 큐보이드를 선택하여 최소 비용 큐브 트리를 형성하는 단계, (d2) 각 차원의 속성 계층 정보가 존재하면, 계층 큐보이드를 하이퍼 큐보이드 안에 형성하며, 이 중 비용이 가장 적게 발생하는 큐보이드를 선택하는 단계, (d3) 사용자 관심 큐보이드들 중 인접한 부모 큐보이드가 없을 경우, 한 단계 높은 레벨에서 부모 큐보이드를 발견할 때까지 반복적으로 수행하여, 조상 큐보이드와 연결하는 부분 데이터 큐브 트리를 구성하는 단계를 포함하는 것을 특징으로 한다.In the method of processing a multidimensional hierarchical data cube according to the present invention, in the step (d), when there is no hierarchical information, the parent cuboid having the lowest cost among the parent cuboids is selected considering only the connection relation between hypercuboids Forming a minimum cost cube tree, (d2) if there is attribute layer information of each dimension, forming a hierarchical queue void in a hypercube void and selecting a queue void in which the least cost occurs, (d3 ) Comprising the step of constructing a partial data cube tree that is repeatedly performed until a parent cuboid is found at a higher level, if there is no adjacent parent cuboid among the user interest queue voids, to connect with the ancestor cuboid .

또한, 상기 목적을 달성하기 위해 본 발명에 따른 기록매체는 클라이언트 단말기, 데이터베이스 및 데이터베이스관리시스템(DBMS)을 포함하는 다차원 계층 데이터 큐브 처리 시스템에서 실행될 때, 청구항 제1항 내지 제5의 다차원 계층 데이터 큐브 처리 방법을 수행하는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체인 것을 특징으로 한다.In order to achieve the above object, a recording medium according to the present invention is a multidimensional hierarchical data cube processing system including a client terminal, a database, and a database management system (DBMS) And is a computer-readable recording medium on which a program for performing a cube processing method is recorded.

또한, 상기 목적을 달성하기 위해 본 발명에 따른 다차원 계층 데이터 큐브 처리 시스템은 클라이언트 단말기, 상기 클라이언트 단말기와 네트워크를 통해 접속되고, 상기 클라이언트 단말기가 요구하는 정보를 저장하는 데이터베이스 및 상기 네트워크를 통해 상기 데이터베이스를 관리하는 데이터베이스관리시스템(DBMS)을 포함하고, 상기 데이터베이스관리시스템은 다차원 계층 데이터 스트림을 관리하기 위해 클로저 테이블에 표현된 레벨 차이(depth)와 계층 경로 아이디(pathID)를 적용하는 클로저 테이블 데이터 구조 형성부, 다중 계층 데이터에 대해 연속질의 기반으로 데이터 큐브를 생성하는 다차원 계층 데이터 큐브 구성부, 부모 큐보이드 또는 조상 큐보이드 중 가장 비용이 적은 부모 큐보이드 또는 조상 큐보이드를 선택하여 최소 비용 큐브 트리를 형성하는 최소 비용 설정부를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a multidimensional hierarchical data cube processing system including a client terminal, a database connected to the client terminal via a network and storing information requested by the client terminal, Wherein the database management system comprises a closure table data structure for applying a level difference (depth) and a hierarchical path ID (path ID) expressed in a closure table for managing a multi-dimensional hierarchical data stream, Forming a data cubes based on continuous queries for multi-layer data, selecting a least cost parent cuboid or ancestor queue void among the parent cuboid or ancestor queue voids, That includes a least-cost set of forming a re-characterized.

상술한 바와 같이, 본 발명에 따른 다차원 계층 데이터 큐브 시스템 및 처리 방법에 의하면, 기존의 SQL(Structured Query Language)문으로 계층 데이터를 표현하고, 연속질의 기반으로 데이터 큐브를 생성하며, 각 큐보이드의 비용인 발생 튜플 수를 모니터링하여, 최소 비용 부모 큐보이드의 집계 결과를 자식 큐보이드가 집계함으로써 데이터베이스 및 데이터 웨어하우스 시스템에서의 OLAP(On-Line Analytical Processing)에 대한 성능적 향상을 실현할 수 있다는 효과가 얻어진다.As described above, according to the multidimensional hierarchical data cube system and processing method of the present invention, hierarchical data is represented by existing SQL (Structured Query Language) statements, data cubes are generated based on continuous queries, By monitoring the number of cost-generating tuples, the performance of the OLAP (On-Line Analytical Processing) in the database and data warehouse system can be improved by aggregating the aggregate results of the least-cost parent cuboid with the child queue void Is obtained.

또, 본 발명에 따른 다차원 계층 데이터 큐브 시스템 및 처리 방법에 의하면, 연속 질의 기반으로 다차원 계층 데이터 큐브를 구축하여 동시적으로 집계를 수행함으로써 사용자에게 다양한 정보를 고속으로 제공할 수 있다는 효과도 얻어진다.In addition, according to the multidimensional hierarchical data cube system and processing method of the present invention, multidimensional hierarchical data cubes are constructed on the basis of a continuous query, and aggregation is performed simultaneously, thereby providing various information to the user at high speed .

도 1은 본 발명에 따른 다차원 계층 데이터 큐브 시스템의 구성 블록도,
도 2는 본 발명에 따른 다차원 계층 데이터 큐브의 처리 과정을 설명하기 위한 공정도,
도 3은 데이터를 클로저 테이블로 표현한 일 예를 나타내는 도면,
도 4는 하이퍼 데이터 큐브의 개념적 구성도,
도 5는 계층 큐보이드의 연결 상태를 나타내는 도면,
도 6은 부분 데이터 큐브 구성 예시도,
도 7은 윈도우 사이즈 별 성능비교를 나타내는 그래프,
도 8은 지프 분포도 별 성능비교를 나타내는 그래프,
도 9는 데이터 셋에 따른 성능 비교를 나타내는 그래프,
도 10은 최대/최소 비용 성능 비교(case 1, case 2)를 나타내는 그래프,
도 11은 최대/최소 비용 성능 비교(case 3, case 4)를 나타내는 그래프,
도 12는 계층 레벨별 최대/최소 비용 성능 비교를 나타내는 그래프,
도 13은 간격 차이에 따른 성능 비교를 나타내는 그래프.1 is a block diagram of a configuration of a multi-dimensional hierarchical data cube system according to the present invention;
FIG. 2 is a process diagram illustrating a process of a multidimensional hierarchical data cube according to the present invention;
3 is a diagram showing an example of data represented by a closure table,
4 is a conceptual diagram of a hyperdata cube,
5 is a diagram showing a connection state of a hierarchical queue void,
FIG. 6 is a partial data cube configuration example,
FIG. 7 is a graph showing a performance comparison by window size,
8 is a graph showing performance comparisons by jeep distribution,
9 is a graph illustrating performance comparisons according to datasets,
10 is a graph showing the maximum / minimum cost performance comparison (case 1, case 2)
11 is a graph showing the maximum / minimum cost performance comparison (case 3, case 4)
FIG. 12 is a graph showing a maximum / minimum cost performance comparison by hierarchical level,
13 is a graph showing a performance comparison according to the gap difference;

본 발명의 상기 및 그 밖의 목적과 새로운 특징은 본 명세서의 기술 및 첨부 도면에 의해 더욱 명확하게 될 것이다.These and other objects and novel features of the present invention will become more apparent from the description of the present specification and the accompanying drawings.

먼저, 본 발명의 개념에 대해 설명한다.First, the concept of the present invention will be described.

본 발명에서는 기존 OLAP의 계층화 연산자를 연속질의 기반으로 구현하기 위하여 클로저 테이블을 이용하였다. 또 각 속성별로 조인 연산을 수행하여, 단일 튜플을 계층화하여 집계 연산을 수행하였다. 또한, 실시간 데이터 스트림 환경에서 빠른 처리 속도를 위해 각 큐보이드 별 비용을 모니터링하여 최소 비용 기반으로 데이터 큐브 트리를 구성하였다. 실제 데이터 실험을 통하여 큐보이드 간의 비용 차이가 클수록, 최소 비용 기반 큐브 트리와 최대 비용 기반 큐브 트리 간의 성능 차이를 증명하였다. 또한, 사용자 관심 큐보이드들을 조상-자식 연결하여 부분적으로 구성함으로써 성능적 향상을 도모하였다. In the present invention, a closure table is used to implement an OLAP layering operator based on a continuous query. In addition, join operations were performed for each attribute, and aggregation operations were performed by layering a single tuple. In addition, the data cube tree is constructed based on the minimum cost by monitoring the cost per each cuboid for fast processing speed in a real time data stream environment. The experimental results show that the larger the cost difference between the queue voids, the more the performance difference between the minimum cost based cube tree and the maximum cost based cube tree. In addition, user interest queue voids are connected in an ancestor - child structure to partially improve performance.

즉, 본 발명은 상술한 바와 같은 구성을 마련하는 것에 의해 실시간 데이터 스트림 환경에서 기존의 OLAP 처리 방식에서 처리 속도의 지연이나 동시적으로 다차원 계층 데이터에 질의를 수행하기 어렵다는 종래 기술의 문제를 해결하였다. That is, the present invention solves the problem of the related art that it is difficult to delay the processing speed in the existing OLAP processing method in the real-time data stream environment or to query the multi-dimensional hierarchical data simultaneously .

기존의 데이터 스트림 환경에서의 OLAP 처리 방식은 트리 기반과 연속질의 기반 방식이 있다. 트리 기반 방식은 데이터를 트리 형식으로 집계 데이터를 관리하는데 이러한 방식은 다양한 속성의 데이터가 끊임없이 발생하는 데이터 스트림 환경에서 트리의 메모리 사용량이 방대하게 증가하게 된다. 반면, 연속질의 기반 방식은 질의를 미리 등록하여 윈도우 사이즈 내의 데이터에 계속 질의를 수행하여 집계 결과를 얻기 때문에 상대적으로 적은 메모리가 필요하다. OLAP processing methods in existing data stream environments are tree based and continuous query based. The tree-based method manages the aggregate data in the form of a tree, which greatly increases the memory usage of the tree in a data stream environment in which various attribute data is constantly generated. On the other hand, the continuous query based method requires a relatively small amount of memory since the query is registered in advance and the query result is continuously obtained by querying the data in the window size.

또 스트림 큐브는 데이터 스트림 환경에서 다차원 데이터 분석을 하기 위해 제안된 구조이다. 스트림 큐브는 다음과 같은 특징을 가진다. The stream cube is a proposed structure for multi-dimensional data analysis in a data stream environment. The stream cube has the following characteristics.

첫째, 경동 시간 윈도우 구조(Tilted Time Frame)를 이용하여 부분적으로 구체화 큐브를 계산한다. 이 구조는 시간을 각각 다른 레벨의 단위로 저장한다. 즉, 최신 데이터는 세밀한 단위로 저장하며 과거의 데이터일수록 요약하여 저장한다. 이는 모든 데이터를 동일한 단위로 처리하는 것보다 메모리와 디스크 비용을 절약하는 측면에서 뛰어나지만, 과거 데이터에 대해서는 자세히 질의를 할 수 없다는 단점이 있다.First, the materialization cube is partially calculated using a tilted time frame. This structure stores the time in units of different levels. That is, the latest data is stored in a granular unit, and the historical data is summarized and stored. This is advantageous in terms of saving memory and disk costs, rather than processing all the data in the same unit, but it has a disadvantage in that it can not be inquired in detail about past data.

둘째, 크리티컬 레이어(Critical Layers) 개념을 사용하여 관심 큐보이드만 구체화함으로써 모든 레벨의 큐보이드를 구체화했을 때의 비용 문제를 해결한다. 즉, 최소 관심 계층(m-layer)와 관찰 계층(o-layer)를 지정하고, 두 계층 사이에 위치한 큐보이드만으로 데이터 큐브를 구성하여 메모리 사용량을 줄인다. 그러나 스트림 큐브가 구성되면 계층을 벗어나는 큐보이드는 관찰할 수가 없으며, 이는 실시간으로 변화하는 데이터 스트림 환경에서 단점이 될 수 있다.Second, it solves the cost problem of materializing all levels of the queue void by specifying only the interest queue voids using the concept of Critical Layers. In other words, the minimum attention layer (m-layer) and the observation layer (o-layer) are designated, and the data cubes are formed by only the queue voids located between the two layers. However, when a stream cube is constructed, it is impossible to observe the cuboid outside the layer, which can be a disadvantage in a real-time changing data stream environment.

셋째, 스트림 큐브는 최소 관심 계층에서부터 관찰 계층까지 큐보이드들 중에서 사용자가 자주 이용하는 인기 경로(Popular Path)의 큐보이드들을 저장한다. 이때 경로는 H-tree라는 자료구조에 저장하여 관리하며, 저장된 경로 외의 질의는 H-tree에 저장된 정보를 이용하여 가장 가까운 큐보이드로 이동하여 계산하기 때문에 응답시간이 빠르다. 인기 경로는 통계적 분석이나 경험에 의하여 결정되는 고정된 값으로 도중에 변경할 수 없기 때문에 급격한 변화에 민감한 데이터 스트림 환경에 대해 능동적으로 대응하지 못한다는 단점이 있다.Third, the stream cube stores the queue voids of popular paths frequently used by users among the queue voids from the lowest interest layer to the observation layer. In this case, the path is stored and managed in a data structure called H-tree, and the query time outside the stored path is calculated by moving to the closest queue void using the information stored in the H-tree. The popularity path can not be modified on the way to a fixed value determined by statistical analysis or experience, so that it can not actively respond to the data stream environment sensitive to rapid change.

이러한 단점을 해결하기 위해 제안된 것이 동적 데이터 큐브(Dynamic Data Cube)이다. 동적 데이터 큐브는 속성값의 지지율에 따라 사용자 관심 영역을 지정하고, 속성값을 동적으로 그룹화하여 관리함으로써 메모리 및 처리시간을 절약한다. 사용자 관심 영역은 최소 지지율 S_min 이상 최대 지지율 S_max 이하인 영역이다.To solve these drawbacks, a dynamic data cube is proposed. Dynamic data cubes save memory and processing time by dynamically grouping and managing attribute values by specifying user interest areas based on ratings of attribute values. The user's area of interest is the area with the minimum support rate S _min and the maximum support rate S _max or less.

최소 지지율 S_min 이하의 지지율을 가지는 그룹은 축소하여 지지율을 높임으로써 사용자 관심 영역에 속하게 하고, 최대 지지율 S_max 이상의 지지율을 가지는 그룹은 확장하여 지지율을 낮아지게 함으로써 사용자 관심 영역에 속하게 한다. 이로써 차원 속성값 그룹이 최대한 사용자 관심 영역에 있게 유도하며, 동적 데이터 큐브의 메모리 사용을 효율적으로 관리할 뿐 아니라, 사용자 관심 영역에 대하여 상세한 차원 속성 그룹 범위를 제공할 수 있게 된다. The group having the support _rate less than the minimum support rate S _min is reduced and the support rate is increased to belong to the user interest area and the group having the support rate higher than the maximum support rate S _max is expanded to belong to the user interest area by lowering the support rate. This leads to a group of dimensional attribute values being in the user interest area as much as possible, not only efficiently managing the memory usage of the dynamic data cube, but also providing a detailed dimension property group range for the user interest area.

그러나 다차원 데이터 스트림의 모든 차원 속성에 대해서 동적 그룹 트리를 구성하게 되면 모든 속성값이 그룹화되는데 많은 메모리를 사용하게 되며 처리시간이 오래 걸리게 된다는 단점이 있다. However, if a dynamic group tree is constructed for all the dimension attributes of the multidimensional data stream, all the attribute values are grouped, and a lot of memory is used and the processing time is long.

본 발명에 적용되는 연속 질의는 DSMS(Data Stream Management System)에서 데이터 스트림에 대한 분석 기술로, 전통적인 질의 방식은 유한한 데이터에 대해서 반복적인 스캔으로 결과를 얻는 일회성 질의인 반면, 연속적이고 끊임없이 발생하는 데이터 스트림은 한정된 메모리 공간에 저장할 수 없기 때문에 질의를 미리 등록하여 데이터 스트림에 대해 연속적으로 질의하여 결과를 얻는다.The continuous query applied to the present invention is an analytical technique for a data stream in a DSMS (Data Stream Management System), while a conventional query method is a one-time query to obtain results with repetitive scans for finite data, Since the data stream can not be stored in a limited memory space, the query is registered in advance and the data stream is queried successively to obtain the result.

슬라이딩 윈도우 기반 집계는 일정 시간 또는 일정 튜플(Tuple)의 수로 스트림 데이터를 분할하여 질의를 수행하며, 슬라이딩 윈도우를 이동하며 계속적으로 수행한다. 시간을 기반으로 윈도우 크기를 결정하는 시간 기반 슬라이딩 윈도우(Time-based Sliding Window), 튜플 기반 슬라이딩 윈도우(Tuple-based Sliding Window)이라 한다. 슬라이딩 윈도우 집계 질의를 위한 기법으로 선형 자료구조를 이용한 페인(Pane)이 있다. 페인이란 입력 스트림을 부분 집계하여 요구된 윈도우 크기를 줄이고 부분 집계 값을 공유함으로써 계산 비용을 줄인다. 페인은 슬라이딩 윈도우와 윈도우 사이즈의 최대공약수만큼의 단위로 스트림을 분할하고 분할 영역만큼의 집계 값을 튜플 대신 저장한다. The sliding window based aggregation performs the query by dividing the stream data by a predetermined number of times or a predetermined number of tuples, and moves the sliding window continuously. A time-based sliding window and a tuple-based sliding window that determine the window size based on time. Pane using linear data structure is a technique for sliding window aggregate query. Pane reduces the computational cost by partially aggregating the input stream to reduce the required window size and sharing partial aggregate values. Payne splits the stream in units of the greatest common divisor of the sliding window and the window size, and stores the aggregated value of the divided area instead of the tuple.

최근의 데이터 스트림 환경에서 연속 질의 기반의 OLAP 연구는 시간 기반 윈도우 내에 데이터를 다차원 구조로 매핑하여 큐브 단위 연산자를 수행한다. 데이터 스트림은 사실 테이블과 같으며 다차원 데이터 스트림 형태로 전환된다. 다차원 데이터 스트림 형태는 S(D₁.A₁₁,…, D_n _- ₁.A_n _-1, T.t, m₁, …, m_m)와 같다. D₁.A₁₁,…, D_n _-1.A_n-1는 차원과 속성의 계층 레벨을 의미하며, T.t는 튜플이 발생한 시간(타임스탬프), m₁, …, m_m는 측정치를 의미한다. 원천 데이터 스트림은 가장 낮은 계층 형식으로 표현되며, 매핑 함수 g를 이용하여 상위 계층으로 이동(롤업)이 가능하다. 다차원 데이터 스트림 질의문 관련 연구인 '다차원 스트림 쿼리 언어(Multi-Dimensional Stream Query Language)'는 PERIODIC, CONTINOUS라는 모드를 이용하여, 실시간 데이터 스트림 환경에서의 다차원 데이터 스트림 질의문을 완성하였다. 또한, 계층 데이터를 지원하기 위해서 해쉬 테이블을 이용하여 구현하였다.In recent data stream environment, continuous query based OLAP research performs cubic unit operator by mapping data into multi - dimensional structure in time - based window. The data stream is the same as the fact table and is converted into a multidimensional data stream. Multi-dimensional data stream types are S _- same as _{_{(D 1 .A 11, ...,}} D n 1 .A n -1, Tt, m 1, ..., m m). D ₁ .A ₁₁ , ... , D _n _-1 .A _n-1 means the hierarchical level of the dimension and the attribute, Tt is the time (time stamp) at which the tuple occurred, m ₁ , ... , m _m means the measurement value. The source data stream is represented in the lowest layer format and can be moved (rolled up) to the upper layer using the mapping function g. Multi-Dimensional Stream Query Language, which is a study on multi-dimensional data stream query, has completed multi-dimensional data stream query in real-time data stream environment using PERIODIC and CONTINOUS mode. In addition, we implemented the hash table to support layer data.

이와 같은 종래의 기술에서는 다차원 데이터 모델의 효과적 질의를 위해 데이터 큐브 단위의 새로운 연산자에 대해 접근하는 반면, 데이터 큐브 간의 처리기법은 고려하지 않았다. In order to efficiently query a multidimensional data model, the conventional technique approaches a new operator in a data cube unit, but does not consider a processing technique between data cubes.

이하, 본 발명에 따른 다차원 계층 데이터 큐브 시스템의 구성을 도면에 따라 설명한다.Hereinafter, a configuration of a multi-dimensional hierarchical data cube system according to the present invention will be described with reference to the drawings.

도 1은 본 발명에 따른 다차원 계층 데이터 큐브 시스템의 구성 블록도 이다.1 is a configuration block diagram of a multi-dimensional hierarchical data cube system according to the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 다차원 계층 데이터 큐브 시스템은 클라이언트 단말기(100), 상기 클라이언트 단말기(100)와 네트워크를 통해 접속되고, 상기 클라이언트 단말기(100)가 요구하는 정보를 저장하는 데이터베이스(200) 및 상기 네트워크를 통해 상기 데이터베이스(200)를 관리하는 데이터베이스관리시스템(DBMS, 300)을 포함한다.1, a multidimensional hierarchical data cube system according to the present invention includes a client terminal 100, a client terminal 100, and a client terminal 100. The client terminal 100 is connected to the client terminal 100 via a network, And a database management system (DBMS) 300 for managing the database 200 through the network.

클라이언트 단말기(100)는 사용자가 이용하는 PC, 노트북, 넷북, PDA, 모바일, 스마트폰 등의 통상의 컴퓨팅 단말기이다. 사용자는 클라이언트 단말기(100)를 이용하여 데이터베이스관리시스템(300)에 접속하여 필요한 정보를 검색할 수 있다. 이를 위해 클라이언트 단말기(100)에는 필요한 정보를 검색하기 위하여 웹브라우저가 설치될 수 있다. The client terminal 100 is a conventional computing terminal such as a PC, a notebook, a netbook, a PDA, a mobile phone, a smart phone, and the like. The user can access the database management system 300 using the client terminal 100 to retrieve necessary information. To this end, the client terminal 100 may be provided with a web browser to retrieve necessary information.

데이터베이스(200)는 클라이언트 단말기(100) 또는 사용데이터베이스관리시스템(300)에서 필요한 데이터를 저장하는 통상의 저장매체로서, 그 구성은 접근 및 검색의 용이성 및 효율성 등을 감안하여 데이터베이스 구축이론에 의한 구조로 마련되는 것이 바람직하다.The database 200 is a conventional storage medium for storing necessary data in the client terminal 100 or the use database management system 300. The configuration is structured by the database construction theory in view of ease of access and search, .

상기 네트워크는 유무선으로 클라이언트 단말기(100), 데이터베이스(300), 사용데이터베이스관리시스템(300)를 상호 접속할 수 있는 것이면 충분하고 특정 통신 수단에 한정되는 것은 아니다.It is sufficient that the network can interconnect the client terminal 100, the database 300 and the usage database management system 300 by wire or wireless, and is not limited to a specific communication means.

상기 데이터베이스관리시스템(300)은 하나 이상의 서버를 포함할 수 있고, 각각의 서버는 하드웨어 및/또는 소프트웨어(예를 들어, 스레드, 프로세스, 컴퓨팅장치)를 포함할 수 있다. 예를 들어 상기 데이터베이스관리시스템(300)은 데이터 웨어하우징 및 OLAP를 수행하기 위한 알고리즘 및 대응하는 시스템의 서버로 기능할 수 있다. 상기 데이터베이스관리시스템(300)이 OLAP 서버로 기능하는 경우, 데이터 표현을 용이하게 하도록 설계된 소프트웨어, 메타데이터(metadata) 및 데이터 관계 테이블(data relationship table)을 포함하는 웨어하우스 데이터의 다양한 표현을 포함할 수 있다.The database management system 300 may include one or more servers, and each server may include hardware and / or software (e.g., threads, processes, computing devices). For example, the database management system 300 may function as an algorithm for performing data warehousing and OLAP and as a server of a corresponding system. When the database management system 300 functions as an OLAP server, it includes various representations of warehouse data including software, metadata and a data relationship table designed to facilitate data representation .

본 발명에 따른 데이터베이스관리시스템(300)은 도 1에 도시된 바와 같이, 다차원 계층 데이터 스트림을 관리하기 위해 클로저 테이블에 표현된 레벨 차이(depth)와 계층 경로 아이디(pathID)를 적용하는 클로저 테이블 데이터 구조 형성부(310), 다중 계층 데이터에 대해 연속질의 기반으로 데이터 큐브를 생성하는 다차원 계층 데이터 큐브 구성부(320), 부모 큐보이드 또는 조상 큐보이드 중 가장 비용이 적은 부모 큐보이드 또는 조상 큐보이드를 선택하여 최소 비용 큐브 트리를 형성하는 최소 비용 설정부(330)를 구비한다.1, a database management system 300 according to the present invention includes closure table data (data) for applying a level difference (depth) represented in a closure table and a hierarchical path ID (pathID) to manage a multi- A structure forming unit 310, a multi-dimensional hierarchical data cubing unit 320 for generating a data cube on the basis of a continuous query for multi-layer data, a lowest cost parent cuboid or ancestor queue void among the parent cuboid or ancestor queue void, And a minimum cost setting unit 330 for selecting a minimum cost cube tree.

또 상기 클로저 테이블 데이터 구조 형성부(310), 다차원 계층 데이터 큐브 구성부(320), 최소 비용 설정부(330)는 프로그램에 의해 실행되는 구성으로 마련되거나 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 의해 실현될 수 있다.The closet table data structure forming unit 310, the multidimensional hierarchical data cubing unit 320, and the minimum cost setting unit 330 may be configured to be executed by a program, or may be a computer-readable recording medium .

상기 클로저 테이블 데이터 구조 형성부(310), 다차원 계층 데이터 큐브 구성부(320), 최소 비용 설정부(330)의 기능에 대해서는 도 2를 참조하여 구제적으로 설명한다.The functions of the closet table data structure forming unit 310, the multi-dimensional hierarchical data cubing unit 320, and the minimum cost setting unit 330 will be explained with reference to FIG.

도 2는 본 발명에 따른 다차원 계층 데이터 큐브의 처리 과정을 설명하기 위한 공정도이다.2 is a flowchart illustrating a process of a multidimensional hierarchical data cube according to the present invention.

도 2에 도시된 바와 같이, 본 발명에 따른 다차원 계층 데이터 큐브의 처리 방법은 먼저 클로저 테이블 데이터 구조 형성부(310)가 SQL(Structured Query Language)문으로 다중 계층 데이터를 표현하고(S10), 상기 단계 S10에서 표현된 다중 계층 데이터에 대해 다차원 계층 데이터 큐브 구성부(320)가 연속질의 기반으로 데이터 큐브를 생성하고(S20), 상기 단계 S20에서 생성된 데이터 큐브에 대해 각 큐보이드의 비용인 발생 튜플 수를 모니터링하며(S30), 상기 단계 S30에서의 모니터링에 따라 최소 비용 설정부(330)가 최소 비용 부모 큐보이드를 집계하고(S40), 그 결과에 따라 자식 큐보이드가 최소비용을 집계(S50)하는 것에 의해 실현된다.As shown in FIG. 2, in the method of processing a multidimensional hierarchical data cube according to the present invention, the closure table data structure forming unit 310 expresses multilevel data in a SQL (Structured Query Language) statement (S10) The multidimensional hierarchical data cubic structure unit 320 for the multi-layer data represented in step S10 creates a data cube on the basis of a continuous query (S20), and calculates the cost of each cuboid for the data cubes generated in step S20 The minimum cost setting unit 330 compares the minimum cost parent cuboid according to the monitoring in step S30 (S40), and according to the result, the child queue void counts the minimum cost S50).

즉 본 발명에 따른 다차원 계층 데이터 큐브 처리 방법에서는 기존의 SQL 문으로 계층 데이터를 표현하고, 연속질의 기반으로 데이터 큐브를 생성한다. 그리고 각 큐보이드의 비용인 발생 튜플 수를 모니터링하여, 최소 비용 부모 큐보이드의 집계 결과를 자식 큐보이드가 집계함으로써 성능적 향상을 기대하고자 한다. That is, in the multidimensional hierarchical data cube processing method according to the present invention, hierarchical data is expressed by an existing SQL statement and a data cube is generated based on a continuous query. We will monitor the number of generated tuples, which are the cost of each queue void, and expect the performance improvement by aggregating the aggregate results of the least cost parent queue voids.

상기 단계 S10에 따른 다차원 계층 데이터 큐브의 다중 계층구조의 표현에 대해 기술한다.A representation of a multi-hierarchical structure of a multi-dimensional hierarchical data cube according to step S10 will be described.

예를 들어, 데이터웨어하우스 시스템에서는 계층 구조를 설정하여 드릴다운/롤업(drill-down/roll-up)기능을 제공함으로써 의사결정 수단으로 활용하였다. 드릴다운은 크게 두 가지 의미로 나뉜다. "Region" 속성에 대해서 계층을 이동하여 집계를 수행하거나 차원을 추가/제거의 의미로 사용된다. 예를 들어, "Region" 속성을 드릴다운하여 각각 "Korea", "France", "UK"에 대한 판매량을 집계할 수 있다. 반면, 롤업은 "Korea"를 "Asia"로 개념적 계층 단계를 올리고, "France", "UK"는 "Europe"으로 이동하여 판매량을 집계한다.For example, in a data warehouse system, a drill-down / roll-up function was set up as a decision-making tool by setting up a hierarchical structure. Drilldown is divided into two main meanings. "Region" attribute is used to move the hierarchy to perform aggregation or to add / remove dimensions. For example, you can drill down on the "Region" attribute to count sales for "Korea", "France", "UK" respectively. On the other hand, roll-ups raise the conceptual hierarchy level from "Korea" to "Asia" and move to "Europe" from "France" and "UK"

이하, 상기 단계 S10에서 계층 데이터를 표현하기 위한 방법에 대하여 구체적으로 기술한다. 일반적인 SQL 문에서는 계층 데이터를 표현하기 위한 다양한 모델링이 제시되어 왔다. 그 예로는 인접 목록(Adjacency List), 경로 열거(Path Enumeration), 중첩 집합(Nested Sets), 클로저 테이블(Closure Table) 등이 있다.Hereinafter, a method for expressing hierarchical data in step S10 will be described in detail. In a typical SQL statement, various modeling techniques have been proposed for representing hierarchical data. Examples include Adjacency List, Path Enumeration, Nested Sets, and Closure Tables.

기존 모델들은 단일 계층 구조를 표현하는데 집중하였다. 하지만, 일반적으로 차원은 다중 계층 구조를 가질 수 있으며, 여러 계층 구조가 레벨을 어떤 방식으로 공유하는가에 따라 다음과 같이 분류될 수 있다. Existing models focused on representing a single hierarchical structure. However, in general, a dimension can have a multi-hierarchical structure, and can be classified as follows depending on how the hierarchical structure shares the levels.

도 3은 데이터를 클로저 테이블로 표현한 일 예를 나타내는 도면으로서, 도 3의 (a)는 상위 레벨을 공유하는 '교차 다중 계층 구조'와 하위 레벨에서 공유하다가 상위 레벨로 올라가면서 분기하는 '분기 다중 계층 구조'가 존재한다.FIG. 3 is a diagram showing an example of data represented by a closure table. FIG. 3 (a) is a diagram illustrating a data structure of a 'crossing multi-layer structure' sharing a high level, Hierarchical structure 'exists.

이러한 점을 반영하기 위해 본 발명에서는 클로저 테이블 데이터 구조 형성부(310)이 클로저 테이블 데이터 구조를 사용하였다. 클로저 테이블은 부모-자식 관계에 대한 경로뿐 아니라, 트리의 모든 경로를 표현한다. 테이블에는 부모/자식 관계를 가지는 모든 노드 쌍을 한 행으로 저장한다. 클로저 테이블은 한 노드가 여러 트리에 속하는 것을 허용하는 유일한 모델이며, 계층 구조를 저장하는 가장 단순한 방법이다.In order to reflect this point, in the present invention, the closure table data structure forming unit 310 uses a closure table data structure. A closure table represents all paths in the tree, as well as paths to parent-child relationships. The table stores all pairs of nodes with parent / child relationships in a single row. A closure table is the only model that allows a node to belong to multiple trees, and is the simplest way to store a hierarchy.

도 3에서 (a)는 교차 다중 계층 구조이며, (b)는 이를 관계형 테이블 형태로 나타낸 것이다. 하위 레벨 'Seoul'은 두 가지 계층 구조를 가지며, 각각의 계층 경로를 'Path'라고 정의한다. 도 3의 (c)는 전통적 방식의 클로저 테이블의 표현으로 부모(parent)-자식(child) 관계 간의 계층의 깊이 차이를 나타낸다. 기존 방식은 모든 계층 경로를 표현할 수 있지만, 단일 데이터가 다중 계층 경로를 가질 때, 어떤 계층 경로에서 온 데이터인지 식별하기 어렵다. 따라서, 본 발명에서는 도 3의 (d)와 같이 각 부모-자식 노드 관계의 계층 경로를 식별하기 위해 아이디를 부여하며, 이를 경로 아이디(pathID)라 정의한다. 클로저 테이블에 표현된 부모-자식 관계의 레벨 차이(depth)와 계층 경로 아이디(pathID)는 다차원 계층 데이터 스트림을 관리하기 위해 사용된다. In FIG. 3, (a) is a cross-over hierarchical structure, and (b) is a relational table form. The lower level 'Seoul' has two hierarchical structures, and each layer path is defined as 'Path'. FIG. 3 (c) shows the depth difference of the hierarchy between parent-child relationships in a representation of a traditional closure table. Existing schemes can represent all hierarchical paths, but when a single data has a multi-hierarchical path, it is difficult to identify which hierarchical path is the data. Accordingly, in the present invention, an ID is assigned to identify the hierarchical path of each parent-child node relationship as shown in FIG. 3 (d), and it is defined as a path ID. The level difference (depth) and the hierarchical path ID (pathID) of the parent-child relationship expressed in the closure table are used to manage the multi-dimensional hierarchical data stream.

상기 단계 S20에서는 클로저 테이블 구조의 모든 계층별 데이터를 표현할 수 있다는 이점을 활용하여 다차원 계층 데이터 큐브 구성부(320)가 단일 조인(join)으로 계층형 데이터 스트림 모델을 구축하였으며, 이는 다음과 같이 표현될 수 있다.In step S20, the multi-dimensional hierarchical data cubic structure unit 320 constructs a hierarchical data stream model by a single join, taking advantage of the ability to represent data of all the hierarchical levels of the closure table structure. .

차원 D_i = {

,

, …,

}Dimension D _i = {

,

, ... ,

}

(

: j 계층 레벨과 k 계층 경로 아이디의 속성값)(

: attribute value of j-layer level and k-layer path ID)

S_stream = {timestamp,

,

, …,

.

,

, …,

}S _stream = {timestamp,

,

, ... ,

.

,

, ... ,

}

H_stream = {timestamp,

,

.

, …,

.

,

, …,

}H _stream = {timestamp,

,

.

, ... ,

.

,

, ... ,

}

{timestamp,

,,

.

, …,

.

,

, …,

}{timestamp,

,,

.

, ... ,

.

,

, ... ,

}

…...

단일 데이터 스트림은 가장 낮은 계층의 속성값으로 표현되며, 차원 D_i는 계층 경로 k개와 최대 j 레벨로 이루어져 있다. ｜D_i｜는 속성의 모든 계층 레벨과 계층경로가 나타낼 수 있는 원소의 수를 의미한다. 예를 들어, D₁= {day⁰, month⁰, year⁰, week¹, month¹, year¹}이면, 계층 경로는 2개, 최대 계층 레벨은 레벨 4이며, ｜D₁｜은 7과 같다. 따라서, n개의 차원 데이터 스트림의 모든 차원 ｜D_i｜= k일 때, 단일 데이터 스트림을 총 n^k개의 계층 데이터 스트림으로 계층화할 수 있다.A single data stream is represented by an attribute value of the lowest layer, and the dimension D _i is composed of k layer paths and a maximum j level. | D _i | means all hierarchical levels of the attribute and the number of elements that the hierarchical path can represent. For example, if D ₁ = {day ⁰ , month ⁰ , year ⁰ , week ¹ , month ¹ , year ¹ }, then the hierarchy path is 2, the maximum hierarchy level is level 4, | D ₁ | . Thus, a single data stream can be stratified into a total of n ^k hierarchical data streams when all dimensions | D _i | = k of the n dimensional data streams.

다음에 최소 비용 설정부(330)는 큐보이드 비용 모델(Cost Model)을 설정한다.Next, the minimum cost setting unit 330 sets a costume cost model.

다차원 데이터 큐브는 여러 큐보이드의 격자 구조로 형성된다. 이러한 연결 관계를 고려했을 때, 하나의 큐보이드의 집계 연산은 다른 큐보이드의 집계 연산 결과로부터 계산하게 되면 연산 부하를 줄일 수 있다. 예를 들어, (Company, Region, Color) 큐보이드와 (Company, *, *) 큐보이드는 서로 부모-자식 관계를 형성할 수 있다. 큐보이드 비용 모델은 이러한 연결 관계를 만들기 위한 지표이다. 큐보이드의 비용을 정의하면 다음과 같다. A multidimensional data cube is formed of a grid structure of several cubic voids. Considering this connection relation, the calculation load of one cuboid can be reduced by calculating from the result of the other cuboid's aggregation operation. For example, (Company, Region, Color) and (Company, *, *) and Quboids can form a parent-child relationship with each other. The cuboid cost model is an indicator for making this connection. The cost of the cuboid is defined as follows.

Cost(C) = 일정 윈도우 크기(W) 내에 서로 다른 그룹 수Cost (C) = number of different groups within the schedule window size (W)

즉, 부모 큐보이드로부터 결과를 받아 집계 연산을 수행할 때 비교 연산 횟수를 의미한다. 예를 들어, (Company, Region, *)의 비용이 100이고, (Company, *, Color)의 비용이 50일 때, 큐보이드 (Company, *, *)는 후자 큐보이드의 집계 연산 결과를 처리하는 것이 이득이다. That is, it means the number of comparison operations when performing the aggregation operation based on the result from the parent queue void. For example, if the cost of (Company, Region, *) is 100 and the cost of (Company, *, Color) is 50, then the company (*, *) will process the result of the aggregation of the latter queue void It is a benefit to do.

다음에 계층 데이터를 처리하는 데이터 큐브 모델과 이를 비용 기반으로 연결한 데이터 큐브 트리를 형성 기법에 대해 자세히 기술한다.Next, the data cube model that processes the hierarchical data and the cost-based data cube tree are described in detail.

먼저, 계층 데이터를 처리하기 위해 전통적 OLAP의 다차원 데이터 모델과 하이퍼 노드(Hyper-Node) 개념을 적용한 '다차원 데이터 큐브(Hyper Data Cube)'를 정의한다. First, we define a 'Multidimensional Data Cube' that applies a multi-dimensional data model of OLAP and a Hyper-Node concept to process hierarchical data.

하이퍼 노드는 (G, N, E)로 표현되며, 관계를 형성하는 서브 노드들을 포함하는 데이터 구조이다. G는 하이퍼 노드의 라벨을 의미하며, N은 하이퍼 노드가 포함하고 있는 기본 노드의 유한 집합을 의미한다. E는 N에 속한 노드 간의 관계(Edge)를 의미한다.A hyper node is represented by (G, N, E) and is a data structure including subnodes forming a relationship. G is the label of the hyper node, and N is the finite set of base nodes that the hyper node contains. E is the relation (edge) between nodes belonging to N.

하이퍼 큐보이드는 (G, H, P)로 표현할 수 있으며, G는 큐보이드의 분석 차원을 의미하고, H는 차원의 계층 레벨, P는 차원의 계층 경로 아이디를 의미한다. 즉, 하이퍼 큐보이드는 계층 데이터를 처리하는 큐보이드를 포함하고 있다. The hypercube void can be expressed as (G, H, P), where G is the analytical dimension of the queue void, H is the hierarchical level of the dimension, and P is the hierarchical path ID of the dimension. That is, the hypercuboid contains a queue void that processes the hierarchical data.

도 4는 하이퍼 데이터 큐브의 개념적 구성도이다. 3차원 기본 하이퍼 큐보이드에 대해 첫 번째 차원 Company는 단일 계층을 가지며, 두 번째 차원 Region은 2가지 계층 레벨을 가진다. 세 번째 차원 Color는 단일 계층 정보를 가진다. 위와 같은 조건을 가질 때, 기본 하이퍼 큐보이드는 총 2개의 계층 데이터를 처리하는 데이터 큐보이드를 포함하게 된다. 4 is a conceptual block diagram of a hyperdata cube. For a three-dimensional fundamental hypercuboid, the first dimension Company has a single hierarchy and the second dimension Region has two hierarchical levels. The third dimension Color has single layer information. With the above conditions, the default hypercuboid will contain a data queue void that processes a total of two hierarchical data.

이러한 계층 데이터를 처리하는 계층 큐보이드를 식별하기 위해 계층 레벨과 계층 경로를 정수로 표현한 유일한 아이디를 부여하며, 이를 '제어 플래그(Control Flag)'라고 부른다. 예를 들어, h₀₀₀은 모든 차원이 가장 낮은 계층 레벨, 즉, 원천 데이터 임을 의미하며, p₀₀₀는 각 차원의 계층경로 ID를 의미한다. 도 2의 기본 큐보이드가 가지고 있는 (Company, City, Color)와 (Company, Country, Color)는 각각 h₀₀₀p₀₀₀과 h₀₁₀p₀₀₀로 표현할 수 있다. 제어 플래그는 부모-자식 관계의 하이퍼 데이터 큐보이드를 연결을 위해서 사용된다.In order to identify the hierarchical queue void that processes such hierarchical data, a unique ID representing the hierarchical level and the hierarchical path is given as an integer, and this is called a 'control flag'. For example, h ₀₀₀ means that all dimensions are the lowest hierarchical level, that is, source data, and p ₀₀₀ means the hierarchical path ID of each dimension. (Company, City, Color) and (Company, Country, Color) possessed by the basic queue void shown in FIG. 2 can be represented by h ₀₀₀ p ₀₀₀ and h ₀₁₀ p ₀₀₀ , respectively. The control flags are used to connect the hyperdata queue voids of the parent-child relationship.

다음에 상기 단계 S40 및 S50에서 본 발명에 적용되는 최소 비용 데이터 큐브 트리(Minimal Cost Data Cube Tree)에 대해 설명한다.Next, the minimum cost data cube tree applied to the present invention at steps S40 and S50 will be described.

하이퍼 데이터 큐브는 2ⁿ개의 하이퍼 큐보이드로 이루어지며, 기본 하이퍼 큐보이드와 정점 하이퍼 큐보이드를 제외한 모든 하이퍼 큐보이드는 부모-자식 관계를 가진다. 이때 자식 노드는 1개 이상의 부모 노드를 가질 수 있다. 예를 들어, 하이퍼 큐보이드(A)는 하이퍼 큐보이드(AB)와 하이퍼 큐보이드(AC)를 부모로 가진다. 데이터 큐브를 효율적으로 처리하기 위해서 사이클을 형성하지 않으며 모든 노드를 연결하는 '최소 비용 데이터 큐브 트리(Minimal Cost Data Cube tree)'를 구축한다. 이때 비용은 단일 노드에서 처리하여 나온 결과 튜플의 수를 의미한다.A hyperdata cube consists of 2 ⁿ hypercube voids. All hypercube voids have a parent-child relationship, except for the default hypercubic voids and vertex hypercubic voids. At this time, the child node may have one or more parent nodes. For example, HyperCuboid (A) has HyperCuboid (AB) and HyperCuboid (AC) as its parent. In order to efficiently process the data cube, we construct a 'Minimal Cost Data Cube tree' that connects all the nodes without forming a cycle. The cost is the number of tuples processed by a single node.

n차원인 기본 큐보이드에서 Top-down 방식으로 인접한 하위 차원 데이터 큐보이드를 탐색하며, 부모-자식 큐보이드 간의 1:1 관계를 연결해 나간다. 기본 하이퍼 데이터 큐브는 2가지 경우에 다중 관계를 갖게 되며 각각에 대해 정의하면 다음과 같다. In the n-dimensional base-queue voids, the top-down method searches for adjacent subdimensional data cuboids and connects 1: 1 relationships between parent-child cube voids. The basic hyperdata cube will have multiple relationships in two cases, defined as follows.

Case 1. 비계층 큐보이드 연결 Case 1. Non-tiered queue void connection

계층 정보가 없을 시에는 하이퍼 큐보이드 간의 연결 관계만 고려한다. 도 4에서 큐보이드 (Company, *, *)는 큐보이드 (Company, Region, *), (Company, *, Color)와 부모-자식 관계를 갖는다. 비계층 데이터 큐브에서는 이렇게 n개 이상의 부모 큐보이드가 존재할 때, 부모 큐보이드 중 가장 비용이 적은 부모 큐보이드를 선택하여 최소 비용 큐브 트리를 형성한다. When there is no hierarchical information, only the connection relationship between hypercuboids is considered. In Figure 4, the Company, *, * has a parent-child relationship with the Company, Region, *, Company, *, Color. In a non-hierarchical data cube, when there are n or more parent cuboids, the least cost parent cuboid among the parent cuboids is selected to form the least cost cube tree.

Case 2. 계층 큐보이드 연결Case 2. Hierarchical queue void connection

각 차원의 속성 계층 정보가 존재하면, 계층 큐보이드를 하이퍼 큐보이드 안에 형성한다. If there is attribute hierarchy information for each dimension, the hierarchy queue void is formed in the hypercube void.

도 5는 계층 큐보이드의 연결 상태를 나타내는 도면이다. 도 5의 (a)는 각 차원 Company, Region, Color 중 Region 차원에만 최대 계층 레벨이 2이고, 계층 경로가 두 가지일 때 계층 데이터 큐브 구성을 예시로 나타내었다. 기본 큐보이드 (Company, Region, Color)와 (Company, *, Color) 큐보이드는 1:1 관계를 가진다. 하지만, 기본 큐보이드에 포함된 계층 큐보이드는 n:1 관계를 가지며, 이 중에서 최소 비용 계층 큐보이드를 선택한다. 도 5의 (b)는 하이퍼 큐보이드 간에 n:1 관계를 가질 때, 계층 큐보이드 간의 연결 예시이다. (Company, *, *) 큐보이드는 2개의 부모 큐보이드와 중복 관계를 가지며, 부모 큐보이드는 각각 4개, 1개의 계층 큐보이드를 가지고 있다. 이 중 비용이 가장 적게 발생하는 큐보이드를 선택한다 5 is a diagram showing the connection state of the layer queue void. FIG. 5 (a) shows an example of hierarchical data cube structure when the maximum hierarchical level is 2 only in the Region dimension among the Company, Region, and Color in each dimension, and there are two hierarchical paths. The basic cuboid (Company, Region, Color) and the (Company, *, Color) cuboid have a 1: 1 relationship. However, the hierarchical queue voids included in the basic queue void have an n: 1 relationship, and the least cost hierarchy queue void is selected from the hierarchical queue voids. Figure 5 (b) is an example of the connection between layer queue voids when they have an n: 1 relationship between hypercuboids. (Company, *, *) The queue void has a redundant relationship with two parent queue voids, and the parent queue void has four hierarchical queue voids each. Of these, we choose the least costly queue void .

다음에 본 발명에 적용되는 부분 데이터 큐브 트리(partial-Data Cube Tree)에 대해 설명한다.Next, a partial-data cube tree applied to the present invention will be described.

부분 데이터 큐브 트리에 관한 사용자 관심 데이터 큐브 트리 구축 방법론에 대하여 설명한다. 분석 데이터 차원 수가 많을수록 전체 데이터 큐브의 크기는 점차 증가하기 때문에 부분적으로 데이터 큐브 트리를 구성할 수 있어야 한다. 하지만, 전체 큐브 트리는 모든 레벨의 큐보이드가 선택되어야 하기 때문에 이 중 일부 큐브 트리만 수행하더라도 사용자가 원하지 않는 레벨의 큐보이드까지 포함되어 수행되어야 한다. Partial Data on Cube Tree The user interest data cube tree construction methodology will be described. Analysis Data As the number of dimensions increases, the size of the entire data cube gradually increases, so you need to be able to configure the data cube tree in part. However, since the entire cube tree needs to be selected for all levels of the queue voids, even if only some of the cube trees are performed, the user must perform the process including the level of the queue voids that the user does not desire.

도 6은 부분 데이터 큐브 구성 예시도이다. 예를 들어, 사용자 관심 큐보이드가 기본 큐보이드 (Company, Country, Color), (Company, City, Color)와 1차원 큐보이드 (*, City, *) 일 때, 전체 큐브 구성(parent-child 연결)에서는 사용자 관심 큐보이드가 아닌 큐보이드 (Company, City, *)를 수행하여야 (*, City, *) 큐보이드의 집계 연산을 수행할 수 있다. 이렇게 불필요한 연산 부하를 줄이기 위해 조상(Ancestor) 큐보이드와 연결하며, 이를 'Ancestor-child 연결'이라 한다. 사용자 관심 큐보이드들 중 인접한 부모 큐보이드가 없을 경우, 한 단계 높은 레벨에서 부모 큐보이드를 발견할 때까지 반복적으로 수행하여, 조상 큐보이드와 연결하는 부분 데이터 큐브 트리를 구성함으로써 성능적 향상을 기대한다. 6 is a diagram illustrating an example of the configuration of a partial data cube. For example, when the user interest cupid is the default cube void (Company, Country, Color), (Company, City, Color) (*, City, *) to perform the aggregation operation of the queue void by executing the user's interest not the queue-void of interests (Company, City, *). In order to reduce the unnecessary computation load, It is called an ancestor-child connection. If there is no adjacent parent cuboid among the user interest queue voids, it is repeatedly performed until the parent cuboid is found at a higher level, so that a partial data cube tree connecting with the ancestor cuboid is constructed to improve performance do.

다음에 본 발명에 따른 시스템의 성능 평가에 대해 설명한다. 먼저, 본 발명에 따른 실험환경을 설명한다.Next, the performance evaluation of the system according to the present invention will be described. First, the experimental environment according to the present invention will be described.

본 발명에서는 계층형 다차원 데이터 모델의 성능을 검증하기 위해 두 가지 데이터 셋(지프 분포 데이터, 실제 구매 데이터)을 이용하였다. 지프 분포(Zipf Distribution) 데이터는 Zipf값이 증가함에 따란 편향된 값을 가진다. 실제 구매 데이터 로그는 2014년 8월1일~8월27일 간의 구매 데이터로 총 49,871 튜플들로 이루어져 있으며 4개의 차원과 2개의 측정치를 갖는다. 본 발명은 Intel Core2Duo E6600 @ 2.4GHz, 메모리 4G의 환경으로 CentOS 6 .5 이상에서 수행하였다.In the present invention, two data sets (jiff distribution data, actual purchase data) are used to verify the performance of the hierarchical multidimensional data model. The Zipf distribution data has a biased value as the Zipf value increases. The Actual Purchase Data Log is a purchase data from August 1, 2014 to August 27, 2014, consisting of 49,871 tuples, with four dimensions and two measurements. The present invention has been performed in an environment of Intel Core2Duo E6600 @ 2.4 GHz and memory 4G in CentOS 6.5 or higher.

다음에 본 발명에 따른 전체 데이터 큐브 실험을 하였다.Next, the entire data cube experiment according to the present invention was performed.

즉, 비계층 데이터 큐브의 성능에 영향을 주는 요인을 분석하였다. In other words, we analyzed the factors affecting the performance of the non - hierarchical data cube.

도 7은 윈도우 사이즈 별 성능비교를 나타내는 그래프이다. 즉, 도 7에서는 구매데이터를 4차원 데이터 큐브로 구성하여 초당 2,000 ~ 10,000 튜플씩 10초 동안 전송하며, 윈도우 사이즈 10, 100, 1,000 별로 성능을 비교하였다. 윈도우 사이즈가 10일 때 가장 빠른 처리 시간과 적은 메모리 사용량을 보였으며, 윈도우 1,000일 때 가장 느린 처리 시간과 많은 메모리 사용량을 보였다. 이는 윈도우 사이즈가 클수록 윈도우 내에 서로 다른 그룹 수가 많아져서 비교 연산 횟수가 증가하였기 때문이다. FIG. 7 is a graph showing performance comparison by window size. That is, in FIG. 7, the purchase data is composed of 4-dimensional data cubes and transmitted for 2 seconds to 10,000 tuples per second for 10 seconds, and the performance is compared for each window size of 10, 100, and 1,000. It showed the fastest processing time and small memory usage when the window size was 10, and it showed the slowest processing time and the large memory usage on Windows 1000. This is because the larger the window size, the larger the number of different groups in the window and the number of comparison operations increases.

도 8은 지프 분포도 별 성능비교를 나타내는 그래프이다. 즉 도 8은 데이터의 분포적 특성과 차원 수가 집계 연산 부하에 끼치는 영향을 분석한 결과이다. 본 실험은 앞서 가장 빠르고 메모리 사용량이 적은 윈도우 사이즈 10, 초당 10,000 튜플을 전송하였을 때 지프 분포도를 조절해가며 성능을 비교하였다. 지프 분포도 값이 클수록 데이터의 분포가 치우쳐 있음을 의미하는데, 윈도우 크기 내에 서로 다른 그룹 수가 적어지기 때문에 메모리 사용량이 줄어드는 것을 확인할 수 있다.FIG. 8 is a graph showing performance comparisons by jeep distribution. That is, FIG. 8 is a result of analyzing the distributional characteristics of data and the influence of dimension on the aggregate calculation load. In this experiment, we compared the performance by adjusting the jiff distribution when the fastest and smallest memory size was used and the window size was 10 and 10,000 tuples per second. The larger the jiff distribution value is, the more the distribution of data is shifted. As the number of different groups within the window size decreases, the memory usage decreases.

다음에 비용 기반 데이터 큐브 트리를 실험하였다.Next, the cost-based data cube tree was experimented.

먼저, 비계층 데이터 큐브 트리에 대해 설명한다.First, a non-hierarchical data cube tree will be described.

즉, 데이터 큐브 트리를 비용기반으로 구축하였을 때의 성능을 검증하였다. 기존 지프 분포 데이터는 모든 속성의 카디널리티(Cardinality)가 동일하여 같은 n차원 큐보이드 간에 비용차가 미비하였다. 따라서 본 발명에서는 실제 온라인 쇼핑몰 구매 데이터를 이용하였다. 비용 기반 데이터 큐브 트리 실험을 위하여 데이터 속성의 카디널리티를 가공하였고, 표 1은 비용 가공 데이터를 나타낸다.In other words, we verified the performance of the data cube tree when it was constructed on a cost basis. In existing Jeep distribution data, the cardinality of all attributes is the same, so the cost difference between the same n-dimensional cube voids is insignificant. Therefore, in the present invention, actual online shopping mall purchase data is used. The cardinality of the data attributes is processed for the cost-based data cube tree experiment, and Table 1 shows the cost processing data.

도 9는 데이터 셋에 따른 성능 비교를 나타내는 그래프이다. 즉 도 9는 가공 데이터 셋의 성능 실험으로 가공도가 높은 case 4 데이터가 가장 빠른 처리 시간과 적은 메모리 사용량을 보였다. 이는 이전 데이터 분포도 실험에서 확인하였듯이 일정한 윈도우 사이즈 내에서 서로 다른 그룹 수가 작아졌기 때문이다. 9 is a graph illustrating performance comparisons according to datasets. That is, FIG. 9 shows that the case 4 data having the high degree of processing showed the fastest processing time and the small amount of memory usage due to the performance test of the processed data set. This is due to the fact that the number of different groups within a certain window size has been reduced, as confirmed in previous data distribution experiments.

도 10은 최대/최소 비용 성능 비교(case 1, case 2)를 나타내는 그래프로서, case 1과 case 2의 최대/최소 비용 큐브 트리 구성 시의 성능 차이를 비교한 그래프이다. case 2의 경우 속성의 그룹 수가 줄었기 때문에 case 1보다 근소하게 빠른 처리 시간과 적은 메모리 사용량을 보였다. 비용 기반 큐브 트리를 구성하였을 때는 처리 튜플 수가 증가할수록 비용 간의 성능 차이가 근소하게 있었다.FIG. 10 is a graph showing the maximum / minimum cost performance comparison (case 1, case 2). FIG. 10 is a graph comparing performance differences in the case of constructing the maximum / minimum cost cube tree of case 1 and case 2. In case 2, the number of groups of attributes was reduced, so the processing time was slightly faster than case 1 and less memory was used. When the cost - based cube tree was constructed, there was a slight difference in cost performance as the number of processing tuples increased.

도 11은 최대/최소 비용 성능 비교(case 3, case 4)를 나타내는 그래프로서, 데이터 case 3과 case 4를 비용 기반으로 트리를 구성하였을 때의 성능 차이를 나타낸 그래프이다. 본 발명에서는 데이터를 이루고 있는 그룹 수가 현저하게 줄어들어 이전 실험보다 격차를 확인할 수 있었다. 최소 비용과 최대 비용 간의 메모리 사용량과 처리 시간의 격차가 초당 전송속도가 클수록 근소하게 증가하는 추세를 보였다. case 4는 초반 전송속도가 작을 때 비용 격차가 가장 크다. 이는 그룹 수가 작아지기 때문에, 처리 튜플 수가 적을 때, 하위 큐보이드의 집계 연산 수행 횟수가 현저히 적어져 영향을 끼친 것으로 보인다.FIG. 11 is a graph showing the maximum / minimum cost performance comparison (case 3, case 4). FIG. 11 is a graph showing performance differences when the data case 3 and case 4 are constructed based on cost. In the present invention, the number of groups of data is remarkably reduced, and the gaps can be confirmed as compared with the previous experiments. The gap between memory usage and processing time between minimum cost and maximum cost showed a slight increase as the transmission rate per second increased. Case 4 shows the largest cost difference when the initial transmission rate is small. This is because the number of groups is small, and when the number of tuples to be processed is small, the number of times the sub-queue void performs the aggregation operation is considerably reduced and seems to have influenced.

비용 기반 비계층 데이터 큐브 실험 결과, 같은 n차원 큐보이드 중 최대값과 최소값의 차이를 비교하였을 때, case 3, case 4, case 2, case 1 순서로 비용 차이 값이 컸다. 즉, 같은 레벨에 존재하는 큐보이드 간의 비용 차이가 클수록 본 발명에 따른 알고리즘의 성능이 나아짐을 확인할 수 있다.In the cost-based non-hierarchical data cube experiment, the difference between the maximum value and the minimum value in the same n-dimensional void voids was larger in the order of case 3, case 4, case 2 and case 1. That is, it can be seen that the performance of the algorithm according to the present invention is improved as the cost difference between the queue voids existing at the same level is larger.

다음에 계층 데이터 큐브 트리에 대해 실험하였다.Next, the hierarchical data cube tree was experimented.

본 발명에서는 차원의 계층 레벨을 늘려가며, 비용 기반 하이퍼 데이터 큐브 트리의 성능을 측정하였다. 4차원 데이터 큐브를 구성하였으며, 계층형 데이터 큐브의 개수(n)을 늘려가며, 성능 차이를 비교하였다. 즉, 이전 실험에서 비용 차이가 작았던 가공 데이터 셋 case 2와 비용 차이가 커지기 시작한 case 3을 하이퍼 데이터 큐브 트리로 구성하였다. In the present invention, the performance of the cost-based hyperdata cube tree is measured while increasing the hierarchical level of the dimension. We constructed a 4 - dimensional data cube and compared the performance differences by increasing the number (n) of hierarchical data cubes. That is, in the previous experiment, the processed data set case 2, in which the cost difference was small, and the case 3, in which the cost difference began to increase, were composed of the hyperdata cube tree.

도 12는 계층 레벨별 최대/최소 비용 성능 비교를 나타내는 그래프로서, 비용 기반 트리를 구축하였을 때의 성능 차이를 나타낸 그래프이다. 기본 큐보이드가 포함하고 있는 계층 데이터 큐보이드의 수(n)이 클수록, 같은 차원 큐보이드간의 비용 차이가 커진다. 즉, n=4일 때, 최소 비용과 최대 비용 간의 메모리 사용량 차가 커지는 것을 확인할 수 있다.FIG. 12 is a graph showing a maximum / minimum cost performance comparison according to a hierarchical level, and is a graph showing a performance difference when a cost-based tree is constructed. The larger the number of hierarchical data queue voids (n) the base queue void contains, the larger the cost difference between the same dimension queue voids. That is, when n = 4, it can be seen that the memory usage difference between the minimum cost and the maximum cost increases.

비계층 데이터 큐보이드에서는 1:1 관계의 하이퍼 큐보이드 간에는 최대/최소 비용이 존재하지 않지만, 계층 데이터 큐보이드는 같은 조건에서 n:1 관계를 가지기 때문에 최대/최소 비용 차가 발생한다. 따라서, 메모리 사용량은 격차가 점차 증가하고, 처리 시간은 거의 일정한 격차를 유지하였다.In the non-hierarchical data queue void, there is no maximum / minimum cost between 1: 1 relation hypercubes. However, since the hierarchical data queue void has an n: 1 relationship under the same conditions, the maximum / minimum cost difference occurs. Therefore, the memory usage gradually increased and the processing time remained almost constant.

또, 비용 기반 부분 데이터 큐브 트리를 실험하였다.We also experimented with a cost - based partial data cube tree.

본 발명에서는 전체 큐브 트리 구성 중 각 튜브 트리의 정점인 큐보이드 2개를 선택하였을 때 부모-자식 연결 기법과 조상-자식 연결 기법의 성능을 실험하였다. 사용자는 기본 큐보이드와 1차원 큐보이드를 관심 큐보이드로 설정하였다는 가정하에 부모-자식 연결 기법과 조상-자식 연결 기법을 비교하였다.In the present invention, the performance of the parent-child connection technique and the ancestor-child connection technique are tested when two cube voids, which are vertices of each tube tree, are selected from the entire cube tree structure. The user compared the parent - child connection method and the ancestor - child connection method on the assumption that the basic queue void and the one - dimensional queue void were set as the interest queue void.

도 13은 간격 차이에 따른 성능 비교를 나타내는 그래프로서, 지프 분포도 1.8인 10차원 데이터를 이용하여 정점과 정점 간의 큐보이드를 수행할 때 성능을 비교하였다. 여기서 간격은 1차원 큐보이드는 고정하고, 기본 큐보이드의 차원 수를 두 개씩 줄여가며 실험을 진행하였다. 간격 차이가 클수록 경로에 포함된 불필요한 큐보이드까지 수행되어야 하여 처리시간과 메모리 사용량 격차가 벌어지는 것을 확인할 수 있다. FIG. 13 is a graph showing a performance comparison according to the gap difference. The performance is compared when performing a cuboid between a vertex and a vertex using 10-dimensional data having a jiff distribution of 1.8. Here, the interval is fixed by fixing the one-dimensional cuboid, and the number of dimensions of the basic cuboid is reduced by two. The larger the interval difference, the more unnecessary the queue void included in the path must be performed, so that the processing time and the memory usage gap can be confirmed.

상술한 바와 같이, 본 발명에서는 연속질의 기반 계층형 다차원 데이터 처리 기법을 마련하였다. 즉, 기존의 SQL(Structured Query Language)문으로 계층 데이터를 표현하고, 연속질의 기반으로 데이터 큐브를 생성하며, 최소 비용 부모 큐보이드의 집계 결과집계하고, 그 결과에 따라 자식 큐보이드가 최소비용을 집계하는 것에 의해 데이터베이스 및 데이터 웨어하우스 시스템에서의 OLAP(On-Line Analytical Processing)에 대한 성능적 향상을 실현할 수 있었다.As described above, the present invention provides a continuous query based hierarchical multidimensional data processing technique. That is, the hierarchical data is represented by the existing SQL (Structured Query Language) statement, the data cube is generated based on the continuous query, the aggregate result of the least cost parent cuboid is counted, By doing so, we could achieve a performance improvement for on-line analytical processing (OLAP) in databases and data warehouse systems.

이상 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.Although the present invention has been described in detail with reference to the above embodiments, it is needless to say that the present invention is not limited to the above-described embodiments, and various modifications may be made without departing from the spirit of the present invention.

본 발명에 따른 다차원 계층 데이터 큐브 시스템 및 처리 방법을 사용하는 것에 의해 복잡한 질의를 고속으로 처리하여 성능을 향상시킬 수 있다.By using the multidimensional hierarchical data cube system and processing method according to the present invention, complex queries can be processed at high speed and performance can be improved.

100 : 클라이언트 단말기
200 : 데이터베이스
300 : 데이터베이스관리 시스템100: Client terminal
200: Database
300: Database Management System

Claims

A multi-dimensional hierarchical data cube processing system comprising a client terminal, a database connected to the client terminal through a network, the database storing information requested by the client terminal, and a database management system (DBMS) managing the database through the network In a data processing method,
The data processing method according to the database management system,
(a) expressing multi-layer data in a SQL (Structured Query Language) statement,
(b) generating a data cube based on the continuous query for the multi-layer data represented in the step (a)
(c) monitoring the number of generated tuples, which is the cost of each cuboid for the data cubes generated in step (b)
(d) aggregating the least cost parent cuboid according to the monitoring in step (c), and aggregating the minimum cost according to the result;
Dimensional data cube processing method.

The method of claim 1,
Wherein the expression in step (a) applies a closure table data structure representing all paths of the tree.

3. The method of claim 2,
The closure table data structure defines each layer path as 'Path', defines a path ID as a path ID for identifying a layer path in a multi-layer path, and defines a path ID in the closure table to manage the multi- And a hierarchical path ID (pathID) is applied to the hierarchical data cube.

The method of claim 1,
Wherein the data cube in step (b)
Dimension D _i = {

,

, ... ,

} ... (One)
(remind

Is the attribute value of the j-th layer level and the k-layer path ID)
And a hierarchical data stream model is constructed with a single join by a single join.

The method of claim 1,
The step (d)
(d1) a step of forming a least cost cube tree by selecting the lowest cost parent void void among the parent queue voids considering only the connection relationship between hyper queue voids when there is no hierarchical information,
(d2) if there is attribute-layer information of each dimension, forming a hierarchical queue void in a hyper-cube void, selecting a queue void having the least cost,
(d3) If there is no adjacent parent cuboid among the user interest queue voids, the step of constructing the partial data cube tree connecting with the ancestor cuboid is repeatedly performed until the parent cuboid is found at the higher level Dimensional hierarchical data cube processing method.

A computer program that, when executed in a multidimensional hierarchical data cube processing system including a client terminal, a database and a database management system (DBMS), executes a program for performing the multidimensional hierarchical data cube processing method of any one of claims 1 to 5 Readable recording medium.

Client terminal,
A database which is connected to the client terminal through a network and stores information requested by the client terminal,
And a database management system (DBMS) for managing the database through the network,
The database management system
A closure table data structure forming unit for applying a level difference (depth) and a hierarchical path ID (pathID) expressed in a closure table to manage a multi-dimensional hierarchical data stream,
A multi-dimensional hierarchical data cube constructing unit for generating a data cube based on a continuous query for multi-layer data,
And a minimum cost setting unit for selecting the lowest cost parent cuboid or ancestor cuboid among the parent cuboid or the ancestor cuboid to form a least cost cube tree.