KR20140009860A

KR20140009860A - Method and apparatus for constructing histogram based on multi-dimensional index structure

Info

Publication number: KR20140009860A
Application number: KR1020120076825A
Authority: KR
Inventors: 노요한; 황혜수
Original assignee: 삼성전자주식회사
Priority date: 2012-07-13
Filing date: 2012-07-13
Publication date: 2014-01-23

Abstract

The present invention discloses a multi-dimensional index structure based histogram constructing method and a constructing device for point data having highly skewed data distribution and being present at a multi-dimensional space. The present invention can implement the multi-dimensional index structure based histogram constructing method and the constructing device having a robust property for the highly skewed data distribution in a bucket under a given bucket number condition and capable of securing a highly reliable estimated value for a highly skewed data set. [Reference numerals] (S10) Receiving an index tree structure; (S11) Composing a new index tree structure; (S12) Composing a multi-dimensional histogram

Description

METHOD AND APPARATUS FOR CONSTRUCTING HISTOGRAM BASED ON MULTI-DIMENSIONAL INDEX STRUCTURE}

다차원 인덱스 구조 기반의 히스토그램 구성 방법 및 구성 장치에 관한 것으로서, 보다 상세하게는 다차원 공간에 존재하고 불균등도가 심한(highly skewed) 데이터 분포를 가지는 점 데이터들에 대한 히스토그램 버킷을 구성하는 방법 및 장치에 관한 것이다. A method and apparatus for constructing a histogram based on a multidimensional index structure, and more particularly, a method and apparatus for constructing a histogram bucket for point data existing in a multidimensional space and having a highly skewed data distribution. It is about.

다차원 데이터는 전통적인 금융 분야에서부터 의료 분야에까지 다양한 분야에서 발생한다. 다차원 데이터란, 속성(attribute)이 2개 이상인 데이터를 의미한다. Multidimensional data comes from a variety of fields, from the traditional financial sector to the medical sector. Multidimensional data means data having two or more attributes.

예컨대, 의료 분야에서 환자에 관한 정보는 연령, 성별, 진단 결과 등을 양하게 포함하므로, 다차원 데이터로 표현할 수 있다. 다수의 환자 정보를 효율적으로 관리하고 효과적으로 질의를 처리하는 방법론은 중요하다. 예컨대, 근사 집계 질의(approximate aggregate query)는 중요한 질의 방법론 중 하나이다. For example, in the medical field, information about a patient may include age, sex, diagnosis results, and the like, and may be represented by multidimensional data. A methodology for efficiently managing a large number of patient information and for efficiently processing queries is important. For example, approximate aggregate queries are one of the important query methodologies.

다차원 히스토그램(multi-dimensional histogram)은 다차원 분포에 대한 요약으로서, 영역 질의 내에 속하는 데이터의 개수를 추정하는 데에 널리 활용되고 있다. Multi-dimensional histograms are summaries of multi-dimensional distributions and are widely used to estimate the number of data in a domain query.

다차원 히스토그램에서, 하나의 히스토그램은 다수의 버킷(bucket)의 집합으로 이루어진다. 모든 버킷은 데이터베이스 시스템의 메인 메모리에 상주 가능하므로, 전체 데이터 셋에 대한 디스크 액세스 없이도, 다차원 히스토그램 내에서 영역 질의와 겹치는 버킷을 통하여 영역 질의 내에 속하는 데이터의 개수를 추정할 수 있게 된다. In a multidimensional histogram, one histogram consists of a collection of multiple buckets. Since all buckets can reside in the main memory of the database system, it is possible to estimate the number of data belonging to an area query through buckets that overlap the area query in a multidimensional histogram without disk access to the entire data set.

이처럼 각 버킷의 버킷 영역에 데이터가 균등하게 분포할 것을 전제로 하는 가정을 균등 데이터 분포(uniform data distribution) 가정이라고 한다. 균등 데이터 분포 가정에 따르면, 주어진 영역 질의에 대한 각 버킷별 데이터 개수의 추정값은 버킷의 영역과 질의의 영역이 겹치는 영역의 넓이에 비례한다. 또한, 주어진 영역 질의에 대한 추정값은 각 버킷별 데이터 개수의 추정값의 총합이 된다. The assumption that assumes that data is distributed evenly in the bucket area of each bucket is called a uniform data distribution assumption. According to the uniform data distribution assumption, the estimate of the number of data for each bucket for a given region query is proportional to the width of the region where the region of the bucket and the region of the query overlap. In addition, the estimate for a given region query is the sum of the estimates of the number of data for each bucket.

그러나, 각 버킷에서의 불균등도가 심한 데이터 분포를 가지는 경우, 영역 질의에 대한 추정값의 정확도가 크게 저하되는 문제점이 발생한다. However, in the case of having a very uneven data distribution in each bucket, there is a problem that the accuracy of the estimate for the region query is greatly reduced.

따라서, 다차원 인덱스 구조 기반의 데이터베이스 시스템에서 모든 버킷에서의 데이터를 가능한 한 균등하도록 히스토그램의 버킷을 구성하는 것은 중요하다.Therefore, in a database system based on a multidimensional index structure, it is important to configure a bucket of the histogram so that the data in all buckets is as even as possible.

다차원 공간에 존재하고 불균등도가 심한(highly skewed) 데이터 분포를 가지는 점 데이터들에 대한 다차원 인덱스 구조 기반의 히스토그램 구성 방법 및 구성 장치를 제공하는 것을 목적으로 한다. It is an object of the present invention to provide a method and apparatus for constructing a histogram based on a multidimensional index structure for point data existing in a multidimensional space and having a highly skewed data distribution.

일측면에 따른 다차원 인덱스 구조 기반의 히스토그램 구성 방법은 인덱스 트리 구조를 수신하는 단계; 수신한 상기 인덱스 트리 구조의 단말 노드 중 데이터 불균등도가 가장 높은 단말 노드를 적어도 2개의 신규 단말 노드로 분할하는 단계; 상기 분할된 신규 단말 노드를 상기 단말 노드의 자식 노드로 구성하여 신규 인덱스 트리 구조를 구성하는 단계; 및 상기 신규 인덱스 트리 구조를 이용하여 다차원 히스토그램을 구성하는 단계;를 포함한다.According to an aspect, a method of constructing a histogram based on a multidimensional index structure includes: receiving an index tree structure; Dividing the terminal node having the highest data inequality among the terminal nodes of the index tree structure into at least two new terminal nodes; Constructing a new index tree structure by configuring the split new terminal node as a child node of the terminal node; And constructing a multidimensional histogram using the new index tree structure.

다른 측면에 따른 다차원 인덱스 구조 기반의 히스토그램 구성 방법은 타겟 데이터 셋의 데이터 파티션 정보를 포함하는 인덱스 트리 구조를 수신하는 단계; 버킷 집합의 크기가 기설정된 버킷 개수보다 많고, 수신한 상기 인덱스 트리 구조의 루트 노드에 대응되는 버킷을 포함하는 버킷들로 이루어지는 히스토그램 버킷을 설정하는 단계; 상기 히스토그램 버킷 내의 버킷들 중 병합 전후의 불균등도 변화량이 최소인 두 개의 버킷을 선정하여 병합하는 단계;를 포함한다.According to another aspect, a method of constructing a histogram based on a multidimensional index structure includes: receiving an index tree structure including data partition information of a target data set; Setting a histogram bucket having a bucket size larger than a predetermined number of buckets, the bucket including buckets corresponding to the root node of the index tree structure; And selecting and merging two buckets having a minimum variation in inequality before and after merging among the buckets in the histogram bucket.

또 다른 측면에 따른 다차원 인덱스 구조 기반의 히스토그램 구성 장치는 타겟 데이터 셋의 데이터 파티션 정보를 포함하는 인덱스 트리 구조를 수신하는 인덱스 수신부; 버킷 집합의 크기가 기설정된 버킷 개수보다 많고, 수신한 상기 인덱스 트리 구조의 루트 노드에 대응되는 버킷을 포함하는 버킷들로 이루어지는 히스토그램 버킷을 설정하는 히스토그램 버킷 설정부; 상기 히스토그램 버킷 내의 버킷 각각에 대한 불균등도를 측정하는 불균등도 측정부; 및 상기 히스토그램 버킷 내의 버킷들 중 병합 전후의 불균등도 변화량이 최소인 두 개의 버킷을 선정하여 병합하는 버킷 병합부;를 포함한다. According to yet another aspect, an apparatus for constructing a histogram based on a multidimensional index structure includes an index receiver configured to receive an index tree structure including data partition information of a target data set; A histogram bucket setting unit configured to set a histogram bucket having a bucket size larger than a predetermined number of buckets and a bucket including a bucket corresponding to the root node of the index tree structure; An inequality measurer for measuring an inequality of each bucket in the histogram bucket; And a bucket merging unit which selects and merges two buckets having a minimum amount of variation in unevenness before and after merging among the buckets in the histogram bucket.

주어진 버킷 개수 조건 하에서 버킷 내 데이터 분포의 불균등도에 로버스트한 특성을 가지며, 불균등도가 높은 데이터 셋에 대해서도 신뢰성 높은 추정값을 얻을 수 있는 다차원 인덱스 구조 기반의 히스토그램 구성 방법 및 장치를 구현할 수 있는 효과가 있다.It is possible to implement a method and apparatus for constructing a histogram based on a multi-dimensional index structure that has a robust characteristic of the inequality of data distribution in a bucket under a given number of buckets, and obtains reliable estimates even for a high inequality data set. There is.

도 1은 불균등도-감소 인덱스 트리 구조 생성 방법의 일례를 나타낸 흐름도,
도 2는 불균등도-이득을 고려한 인덱스 트리 구조 생성 방법의 일례를 나타낸 흐름도,
도 3은 쿼드-트리 인덱스 구조에서 불균등도-감소 인덱스 트리를 이용하여 다차원 히스토그램을 구성하는 예를 나타낸 도면,
도 4는 불균등도-이득을 고려하여 병합 대상 버킷들을 선정하는 것을 설명하기 위한 도면,
도 5는 다른 측면에 의한 다차원 인덱스 구조 기반의 히스토그램 구성 장치의 일례를 나타낸 도면이다.1 is a flowchart illustrating an example of a method for generating an inequality-reducing index tree structure;
2 is a flowchart illustrating an example of a method for generating an index tree structure in consideration of unevenness-gain;
3 is a diagram illustrating an example of constructing a multidimensional histogram using an unevenness-reducing index tree in a quad-tree index structure;
4 is a view for explaining the selection of the buckets to be merged in consideration of the inequality-gain,
5 is a diagram illustrating an example of an apparatus for constructing a histogram based on a multidimensional index structure according to another aspect.

인덱스 트리 기반의 다차원 히스토그램을 구성함에 있어서 고려되어야 하는 핵심 사항은 다음과 같다. The key points to consider in constructing multidimensional histogram based on index tree are as follows.

먼저, 인덱스 트리의 하나 이상의 단말 노드(leaf node)들이 버킷 구성의 기본 단위로 이용되는 경우, 단말 노드들의 데이터 분포의 균동도에 따라 히스토그램의 추정 성능이 크게 달라진다. 일반적으로 인덱스 트리를 구성하는 경우, 데이터 분할 정보를 주로 이용하는 반면, 데이터 분포의 불균등도(skew)는 고려되지 않는다. First, when one or more leaf nodes of the index tree are used as a basic unit of the bucket configuration, the estimation performance of the histogram varies greatly according to the degree of fluctuation of the data distribution of the terminal nodes. In general, when configuring an index tree, data partitioning information is mainly used, while skew of data distribution is not considered.

다음으로, 주어진 입력 데이터 셋이 업데이트가 거의 발생하지 않는 정적데이터 셋(static data set)인 경우, 다차원 히스토그램 구성의 신속성보다 히스토그램의 추정 성능이 더욱 중요하다. 일반적으로, 입력 데이터 셋이 정적이거나 업데이트가 빈번하지 않은 경우, 다차원 히스토그램 구성 시간이 오래 걸리더라도 생성된 다차원 히스토그램이 뛰어난 추정 성능을 제공하는 것이 중요하다.
Next, if a given input data set is a static data set with little update, the estimation performance of the histogram is more important than the speed of multidimensional histogram construction. In general, when the input data set is static or infrequent, it is important that the generated multidimensional histograms provide excellent estimation performance even though the multidimensional histogram construction takes a long time.

일 실시예에 따른 인덱스 구조 기반의 다차원 히스토그램 구성 방법에서는 영역 질의에 대한 추정값의 정확도 고도화를 위하여 다음과 같은 기술적 개념을 제안한다. In the method of constructing a multidimensional histogram based on an index structure according to an exemplary embodiment, the following technical concept is proposed to increase the accuracy of an estimate value for a region query.

(1) 인덱스 트리의 확장 방식에 있어서, 불균등도-감소 인덱스 트리 구조 생성 방법론을 제안한다.(1) In the expansion method of the index tree, we propose a methodology for generating an inequality-reducing index tree structure.

(2) 불균등도를 완화하는 히스토그램 버킷을 생성하기 위한 새로운 측정 기준으로서 불균등도-이득(skew-gain)을 제안한다.
(2) Skew-gain is proposed as a new measure for generating histogram buckets to mitigate inequality.

이하, 첨부된 도면을 참조하여 실시를 위한 구체적인 예를 상세히 설명한다. Hereinafter, with reference to the accompanying drawings will be described in detail a specific example for the implementation.

도 1은 불균등도-감소 인덱스 트리 구조 생성 방법의 일례를 나타낸 흐름도이다. 1 is a flowchart illustrating an example of a method for generating an inequality-reducing index tree structure.

일례에 따라, 다차원 인덱스 트리 구조에 불균등도가 높은 단말 노드들을 분할시켜 얻은 새로운 노드들을 추가함으로써 불균등도-감소 인덱스 트리를 생성하고, 이를 히스토그램 구성에 이용한다.
According to an example, a non-uniformity-reduced index tree is generated by adding new nodes obtained by dividing the highly uneven terminal nodes to the multidimensional index tree structure, and used for constructing the histogram.

불균등도-감소 인덱스 트리 구조 생성 방법의 흐름은 다음과 같다. The flow of the method of generating the inequality-reducing index tree structure is as follows.

먼저, 인덱스 트리 구조를 수신한다(S10). First, an index tree structure is received (S10).

이 때, 불균등도-감소 인덱스 트리의 단말 노드들 중 데이터 불균등도가 가장 높은 단말 노드를 선정한다.At this time, the terminal node having the highest data inequality among the terminal nodes of the inequality-reducing index tree is selected.

다음으로, 데이터 불균등도가 가장 높은 것으로 선정된 단말 노드를 불균등도의 합이 최소가 되는 2 이상의 신규 단말 노드들로 분할한다. 2 이상의 신규 단말 노드들은 선정된 단말 노드의 자식 노드들이 된다. 도 2에서는 2개의 신규 단말 노드로 분할하는 경우를 나타내고 있다. 이를 통하여 신규 인덱스 트리 구조가 구성된다(S11).Next, the terminal node selected as having the highest data inequality is divided into two or more new terminal nodes whose sum of inequality is minimum. Two or more new terminal nodes become child nodes of the selected terminal node. In FIG. 2, the case of dividing into two new terminal nodes is shown. Through this, a new index tree structure is constructed (S11).

이러한 단말 노드 분할 과정은 주어진 시간 또는 주어진 회수만큼 반복하여 실시할 수 있다. This terminal node partitioning process may be repeated by a given time or a given number of times.

이러한 과정에 따른 불균등도-감소 인덱스 트리 생성 과정은 전체 데이터 셋 중 불균등도가 높은 단말 노드 내의 일부 데이터에만 수행하여도 무방하다. 또한, 인덱스 트리 구조이므로 인덱스를 통해 일부 데이터에 대한 접근도 효율적으로 실행 가능하다. 따라서 종래 히스토그램 생성 기법보다 높은 계산 비용을 요하는 것은 아니다. The process of generating an inequality-reduced index tree according to this process may be performed only on some data in the terminal node having a high inequality among the entire data sets. In addition, since the index tree structure, access to some data can be efficiently executed through the index. Therefore, it does not require higher computational costs than conventional histogram generation techniques.

한편, 데이터 불균등도가 가장 높은 것으로 선정된 단말 노드를 불균등도의 합이 최소가 되는 신규 단말 노드들로 분할하는 과정을 계속 진행할수록, 더욱 자세한 데이터 분할 정보가 획득되는 한편, 더 많은 계산 비용이 필요하게 된다. 이러한 과정을 통하여 다차원 히스토그램이 구성된다(S12).On the other hand, as the process of dividing the terminal node selected as having the highest data inequality into new terminal nodes where the sum of inequality is minimum, more detailed data partitioning information is obtained, and more computation cost is obtained. It is necessary. Through this process, a multidimensional histogram is constructed (S12).

또한, 버킷 개수는 일반적으로 데이터베이스 관리자에 의하여 주어진다.Also, the number of buckets is generally given by the database administrator.

일례에 따른 불균등도-감소 인덱스 트리 구조 생성 방법에서의 히스토그램의 정확성과 계산 비용을 분석한 결과, 단말 노드들을 (BlogB)회 분할하는 경우에 정확성과 계산 비용이 최적이 되는 결과를 얻을 수 있게 된다. 즉, 주어진 버킷의 개수(B)에 대하여, (logB)개의 후보 버킷을 고려하는 것이 바람직하다.
As a result of analyzing the accuracy and computational cost of the histogram in the method of generating the inequality-reduced index tree structure according to an example, the accuracy and computational cost can be optimized when the terminal nodes are partitioned (BlogB) times. . That is, it is preferable to consider (logB) candidate buckets for a given number B of buckets.

도 2는 불균등도-이득을 고려한 인덱스 트리 구조 생성 방법의 일례를 나타낸 흐름도이다. 2 is a flowchart illustrating an example of a method for generating an index tree structure in consideration of unevenness-gain.

도 2의 예는 인덱스 트리의 버킷을 구성함에 있어서 불균등도-이득을 영역 질의의 새로운 측정 지표로 이용하는 방법론에 관한 것이다.The example of FIG. 2 relates to a methodology of using inequality-gain as a new measure of a range query in constructing a bucket of an index tree.

이상적으로, 히스토그램은 버킷 내의 데이터 분포가 균등하도록 구성될 때에 최고의 추정 성능을 발휘한다. 히스토그램 내의 B개의 버킷에 대응되는 불균등도-감소 인덱스 트리에 K개의 단말 노드들이 존재한다고 할 때, K개의 단말 노드들을 B개의 그룹으로 나누되, 각 그룹 내의 데이터 분포가 가능한 한 균등하도록 구성할 필요가 있다. Ideally, the histogram exhibits the best estimation performance when the data distribution in the bucket is configured to be even. When there are K terminal nodes in the inequality-decreasing index tree corresponding to B buckets in the histogram, it is necessary to divide the K terminal nodes into B groups, so that the data distribution in each group is as even as possible. There is.

도 2의 예에서는, K개의 단말 노드들을 순차적으로 병합함으로써 B개의 버킷으로 구성하는 결과를 얻는다.
In the example of FIG. 2, K terminal nodes are sequentially merged to obtain B buckets.

도 2의 불균등도-이득을 고려한 인덱스 트리 구조 생성 방법의 흐름은 다음과 같다. The flow of the index tree structure generation method considering the inequality-gain of FIG. 2 is as follows.

먼저, 인덱스 트리 구조를 수신한다(S20). 수신되는 인덱스 트리 구조는 Q-히스토그램 트리 구조, R-트리 구조, 쿼드-트리 구조, KD-트리 구조 등 다양한 인덱스 트리 구조를 널리 포함한다. First, an index tree structure is received (S20). The received index tree structure includes various index tree structures, such as a Q- histogram tree structure, an R-tree structure, a quad-tree structure, and a KD-tree structure.

다음으로, 루트 노드(root node)에 대응하는 하나의 버킷으로 초기화된 버킷 집합(G)으로부터 시작하여, 도 1에서 설명한 바와 같은 불균등도 감소 인덱스 트리 구조 생성 방법에 의하여 버킷 집합의 크기를 B로 수렴시킨다. 이 때, 최대 (B-1)회의 반복 연산이 수행된다. 또한, 다음으로, 각 반복 연산에서 데이터 불균등도가 최대인 버킷을 적어도 2개의 버킷들로 분할한다(S21).Next, starting from the bucket set G initialized with one bucket corresponding to the root node, the size of the bucket set is set to B by the method of generating an unevenness reduction index tree structure as described in FIG. 1. Converge. At this time, a maximum of (B-1) repetitive operations are performed. Further, next, the bucket having the maximum data inequality in each iteration operation is divided into at least two buckets (S21).

그 후, 현재까지 구성된 버킷들 중 두 개의 버킷에 대한 병합을 연속적으로 진행하면서, 분할 직후의 불균등도-이득, 분할 후 1회 병합시 불균등도-이득, 분할 후 2회 병합시 불균등도 이득, …을 각각 계산한다. After that, while merging two buckets of the buckets configured so far, the unevenness-gain immediately after the splitting, the unevenness-gain when merging once after splitting, and the unevenness gain when merging twice after splitting, … Calculate each.

버킷 집합(G)의 불균등도를 skew(G)라고 나타낸다.The unevenness of the bucket set G is referred to as skew (G).

이 때, 불균등도-이득의 값(Skew Gain)은 아래 [수학식 1]과 같이 정의할 수 있다. In this case, the skew gain may be defined as shown in Equation 1 below.

[수학식 1][Equation 1]

(Skew Gain) = {(분할 직후 버킷의 불균등도의 합) - (분할 직전의 버킷의 불균등도의 합)}/(증가된 버킷의 개수). (Skew Gain) = {(sum of inequality of buckets immediately after the split)-(sum of inequality of buckets immediately before the split)} / (number of buckets increased) .

즉, 불균등도-이득은 추가 생성되는 버킷 1개 당 불균등도가 감소하는 정도로 해석할 수 있다. In other words, the inequality-gain can be interpreted to the extent that the inequality is reduced per additional bucket generated.

버킷 집합(G)에서 분할된 버킷을 불균등도-이득이 최소가 되는 버킷들로 대체함으로써, 버킷 집합의 크기는 증가한다. 따라서, 버킷 집합의 크기가 데이터베이스 관리자에 의하여 설정된 버킷의 개수(B)를 초과할 수 있다. By replacing the bucket divided in the bucket set G with buckets of least unevenness-gain, the size of the bucket set is increased. Therefore, the size of the bucket set may exceed the number B of buckets set by the database manager.

이 경우, 다시 버킷 집합(G)의 크기가 설정된 버킷의 개수(B)와 같아질 때까지 버킷 집합 내의 두 개의 버킷의 병합을 반복한다(S22).In this case, merging of two buckets in the bucket set is repeated until the size of the bucket set G is equal to the number B of the set buckets (S22).

이러한 과정을 통하여, 각 버킷의 데이터가 가능한 한 균등하게 되도록 효과적으로 버킷을 구성할 수 있다.
Through this process, the bucket can be effectively configured so that the data of each bucket is as even as possible.

도 3은 쿼드-트리 인덱스 구조에서 불균등도-감소 인덱스 트리를 이용하여 다차원 히스토그램을 구성하는 예를 나타낸 도면이다. 3 is a diagram illustrating an example of constructing a multidimensional histogram using an inequality-reducing index tree in a quad-tree index structure.

만약 노드(10)에 관한 단말 노드(11,12,13,14) 중에서 단말 노드(13)의 불균등도가 가장 높은 경우에, 단말 노드(13)는 새로운 단말 노드(130,131)로 분할된다. (도 3의 실시예에서는 데이터 불균등도가 가장 높은 단말노드를 "2개"의 신규 단말 노드로 분할하는 경우를 나타내고 있다. 그러나, 2를 초과하는 개수의 노드로 분할할 수도 있다.) 이에 따라, 루트 노드(10)와 단말 노드(11,12,14,130,131)를 포함하는 새로운 인덱스 트리가 생성되어 기존의 인덱스 트리를 대체한다.If the inequality of the terminal node 13 is the highest among the terminal nodes 11, 12, 13, 14 with respect to the node 10, the terminal node 13 is divided into new terminal nodes 130, 131. (The embodiment of Fig. 3 shows a case in which the terminal node having the highest data inequality is divided into "two" new terminal nodes. However, it can also be divided into more than two nodes.) In addition, a new index tree including the root node 10 and the terminal nodes 11, 12, 14, 130, and 131 is created to replace the existing index tree.

한편, 신규 인덱스 트리의 단말 노드(11,12,14,130,131) 중에서 불균등도가 가장 높은 단말 노드에 대해서는 다시 노드 분할이 수행될 수 있다. 이 경우, 그에 따른 더욱 새로운 인덱스 트리가 생성되어 기존의 새로운 인덱스 트리를 대체하게 된다.Meanwhile, node splitting may be performed again on the terminal node having the highest inequality among the terminal nodes 11, 12, 14, 130, and 131 of the new index tree. In this case, a newer index tree is created to replace the existing new index tree.

이러한 불균등도-감소를 고려한 노드 분할 절차는 매 반복 수행시를 기준으로 데이터 불균등도가 가장 높은 단말 노드에 대하여 수행된다. The node splitting procedure considering the inequality-reduction is performed for the terminal node having the highest data inequality on the basis of every iteration.

또한, 수신한 상기 인덱스 트리 구조에 대응되는 버킷의 개수가 B인 경우, 계산에 따르는 코스트(cost)와 불균등도 감소에 따르는 이익(benefit)을 고려할 때 BlogB회 반복 수행하는 것이 적절하다.
In addition, when the number of buckets corresponding to the received index tree structure is B, it is appropriate to repeat BlogB times in consideration of the cost of the calculation and the benefit of reducing the inequality.

도 4는 불균등도-이득을 고려하여 병합 대상 버킷들을 선정하는 것을 설명하기 위한 도면이다.4 is a view for explaining the selection of the buckets to be merged in consideration of the inequality-gain.

어떤 히스토그램 내에 도 4에서 나타낸 바와 같이 버킷(40,41,42,43)이 존재하고, 각각의 버킷 내에 점 데이터(400,410,420,430,440)가 분포한다고 하자. Assume that there are buckets 40, 41, 42, 43 in a histogram as shown in FIG. 4, and point data 400, 410, 420, 430, 440 is distributed in each bucket.

이 때, 버킷(40)과 버킷(42)을 병합한 버킷(44) 내의 점 데이터의 분포 상태의 변화는, 버킷(42)과 버킷(43)을 병합한 버킷(45) 내의 점 데이터의 분포 상태의 변화보다 적음을 직관적으로 확인할 수 있다. 이는 곧, 병합된 버킷(44)의 불균등도-이득의 크기가 병합된 버킷(45)의 불균등도-이득의 크기보다 작음을 의미한다. 따라서, 버킷(44)이 병합 대상으로 선정된다.At this time, the change of the distribution state of the point data in the bucket 44 which merged the bucket 40 and the bucket 42 is the distribution of the point data in the bucket 45 which merged the bucket 42 and the bucket 43. You can intuitively see that there is less than a change in state. This means that the magnitude of the inequality-gain of the merged bucket 44 is smaller than the magnitude of the inequality-gain of the merged bucket 45. Therefore, the bucket 44 is selected as a merging object.

버킷의 개수(B)는 일반적으로 미리 주어지는(pre-determined) 값이므로, 이러한 병합 과정은, 버킷 집합(G)의 크기가 미리 주어지는 버킷 개수(B)와 같거나 그보다 작아질 때까지 반복 수행된다.
Since the number of buckets B is generally a pre-determined value, this merging process is repeated until the size of the bucket set G is equal to or smaller than the number of buckets B given in advance. .

도 5는 다른 측면에 의한 다차원 인덱스 구조 기반의 히스토그램 구성 장치의 일례를 나타낸 도면이다. 5 is a diagram illustrating an example of an apparatus for constructing a histogram based on a multidimensional index structure according to another aspect.

도 5에서 나타낸 바와 같이, 다차원 인덱스 구조 기반의 히스토그램 구성 장치(50)는, 인덱스 수신부(500), 히스토그램 버킷 설정부(510), 불균등도 측정부(520) 및 버킷 병합부(530)를 포함할 수 있다. As shown in FIG. 5, the histogram construction apparatus 50 based on the multidimensional index structure includes an index receiver 500, a histogram bucket setting unit 510, an unevenness measuring unit 520, and a bucket merging unit 530. can do.

이 때, 인덱스 수신부(500)는, 타겟 데이터 셋의 데이터 파티션 정보를 포함하는 인덱스 트리 구조를 수신한다. 수신되는 인덱스 트리 구조는 Q-히스토그램 트리 구조, R-트리 구조, 쿼드-트리 구조, KD-트리 구조 등 다양한 인덱스 트리 구조를 널리 포함한다. At this time, the index receiving unit 500 receives an index tree structure including data partition information of the target data set. The received index tree structure includes various index tree structures, such as a Q- histogram tree structure, an R-tree structure, a quad-tree structure, and a KD-tree structure.

히스토그램 버킷 설정부(510)는 히스토그램 버킷을 설정한다. 이 때, 버킷 집합의 크기는 기설정된 버킷 개수(B)보다 많도록 할 수 있다. 또한, 버킷 집합의 버킷 중에는 수신한 인덱스 트리 구조의 루트 노드에 대응되는 버킷이 포함되도록 한다. The histogram bucket setting unit 510 sets a histogram bucket. At this time, the size of the bucket set may be larger than the predetermined number of buckets (B). In addition, the bucket of the bucket set includes a bucket corresponding to the root node of the received index tree structure.

불균등도 측정부(520)는 히스토그램 버킷 내의 버킷 각각에 대한 불균등도를 측정한다. 불균등도를 측정하는 구체적인 방법론과 계산식은 공지의 Q-히스토그램 트리 구조에 관한 사항을 적용할 수 있으므로, 여기에서는 구체적인 설명을 생략한다. The inequality measurer 520 measures the inequality of each bucket in the histogram bucket. Specific methodologies and calculation formulas for measuring the inequality can be applied to the known Q- histogram tree structure, so a detailed description thereof is omitted here.

버킷 병합부(530)는 히스토그램 버킷 내의 버킷들 중 병합 전후의 불균등도 변화량이 최소인 두 개의 버킷을 선정하여 병합한다.The bucket merging unit 530 selects and merges two buckets having a minimum variation in inequality before and after merging among the buckets in the histogram bucket.

이 때, 버킷 병합부(530)는, 상기 버킷 집합의 크기가 상기 기설정된 버킷 개수와 같아질 때까지 반복 수행할 수 있다. 또한, 매 반복 수행시를 기준으로 상기 병합 전후의 불균등도 변화량을 판단하여 병합 대상 버킷을 선정한다.At this time, the bucket merging unit 530 may repeat until the size of the bucket set is equal to the predetermined number of buckets. In addition, the amount of variation of the inequality before and after the merging is determined based on every iteration to select the merging target bucket.

도 5에 예시한 다차원 인덱스 구조 기반의 히스토그램 구성 장치를 통하여, 불균등도-이득을 고려한 다차원 인덱스 구조 기반의 히스토그램을 구성할 수 있다. Through the apparatus for constructing a histogram based on the multidimensional index structure illustrated in FIG. 5, the histogram based on the multidimensional index structure considering the inequality-gain may be configured.

한편, 도 5에 나타내지는 않았으나, 수신한 인덱스 트리 구조의 단말 노드 중 데이터 불균등도가 가장 높은 단말 노드를 적어도 2개의 신규 단말 노드로 분할하여 신규 인덱스 트리 구조를 구성하는 노드 분할부가 더 포함되도록 할 수도 있다. 이 경우, 불균등도가 감소하는 방향으로 노드를 구성할 수 있음으로써, 다차원 인덱스 구조 기반의 히스토그램의 불균등도가 더욱 감소하여 장치의 성능이 개선될 수 있다.
On the other hand, although not shown in Figure 5, among the terminal nodes of the received index tree structure, the node node having the highest data inequality is divided into at least two new terminal nodes to further include a node partitioner constituting a new index tree structure It may be. In this case, by configuring the node in a direction in which the inequality decreases, the inequality of the histogram based on the multi-dimensional index structure can be further reduced to improve the performance of the device.

한편, 본 발명의 실시 예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the embodiments of the present invention can be embodied as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현하는 것을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의하여 용이하게 추론될 수 있다.Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device and the like, and also a carrier wave (for example, transmission via the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily deduced by programmers skilled in the art to which the present invention belongs.

나아가 전술한 실시 예들은 본 발명을 예시적으로 설명하기 위한 것으로 본 발명의 권리범위가 특정 실시 예에 한정되지 아니할 것이다.
Further, the embodiments described above are intended to illustrate the present invention, and the scope of the present invention is not limited to the specific embodiments.

40,41,42,43,44,45 . . 버킷
400,410,420,430 . . . 점 데이터
500 . . . . . . . . . 인덱스 수신부
510 . . . . . . . . . 히스토그램 버킷 설정부
520 . . . . . . . . . 불균등도 측정부
530 . . . . . . . . . 버킷 병합부
50 . . . . . . . . . 다차원 인덱스 구조 기반의 히스토그램 구성 장치
500 . . . . . . . . . 인덱스 수신부
510 . . . . . . . . . 히스토그램 버킷 설정부
520 . . . . . . . . . 불균등도 측정부
530 . . . . . . . . . 버킷 병합부40,41,42,43,44,45. . bucket
400,410,420,430. . . Point data
500. . . . . . . . . Index receiver
510. . . . . . . . . Histogram Bucket Settings
520. . . . . . . . . Unevenness measure
530. . . . . . . . . Bucket Merge
50. . . . . . . . . Histogram Constructor based on Multidimensional Index Structure
500. . . . . . . . . Index receiver
510. . . . . . . . . Histogram Bucket Settings
520. . . . . . . . . Unevenness measure
530. . . . . . . . . Bucket Merge

Claims

Receiving an index tree structure;
Dividing the terminal node having the highest data inequality among the terminal nodes of the index tree structure into at least two new terminal nodes;
Constructing a new index tree structure by configuring the split new terminal node as a child node of the terminal node; And
Constructing a multidimensional histogram using the new index tree structure;
Method for constructing a histogram based on a multidimensional index structure comprising a.

The method of claim 1,
Comprising the new index tree structure,
Repeated at least twice,
A method for constructing a histogram based on a multidimensional index structure that performs node partitioning on a terminal node having the highest data inequality based on every iteration.

3. The method of claim 2,
Comprising the new index tree structure,
A method of constructing a histogram based on a multidimensional index structure that is repeated B times when the number of buckets to construct a multidimensional histogram is B.

Receiving an index tree structure containing data partition information of the target data set;
Setting a histogram bucket having a bucket size larger than a predetermined number of buckets, the bucket including buckets corresponding to the root node of the index tree structure;
Selecting and merging two buckets having a minimum variation in unevenness before and after merging among the buckets in the histogram bucket;
Method for constructing a histogram based on a multidimensional index structure comprising a.

5. The method of claim 4,
The merging may be repeated until the size of the bucket set is equal to the preset number of buckets.
10. A method for constructing a histogram based on a multidimensional index structure in which a merge is performed by selecting two buckets having a minimum amount of variation of inequality before and after the merging based on every iteration.

The method of claim 5,
In the step of receiving the index tree structure,
Dividing the terminal node having the highest data inequality among the terminal nodes of the received index tree structure into at least two new terminal nodes to construct a new index tree structure and replacing the received index tree structure; And
Constructing a multidimensional histogram using the new index tree structure;
Method for constructing a histogram based on the multidimensional index structure further comprising.

The method according to claim 6,
Comprising the new index tree structure,
Repeated at least twice,
A method for constructing a histogram based on a multidimensional index structure that performs node partitioning on a terminal node having the highest data inequality based on every iteration.

The method of claim 7, wherein
Comprising the new index tree structure,
When the number of buckets corresponding to the received index tree structure is B, the histogram construction method based on the multi-dimensional index structure is repeated BlogB times.

An index receiver configured to receive an index tree structure including data partition information of a target data set;
A histogram bucket setting unit configured to set a histogram bucket having a bucket size larger than a preset number of buckets and a bucket including a bucket corresponding to the received root node of the index tree structure;
An inequality measurer for measuring an inequality of each bucket in the histogram bucket; And
A bucket merging unit which selects and merges two buckets having a minimum variation in unevenness before and after merging among the buckets in the histogram bucket;
Multi-dimensional index structure based histogram construction device comprising a.

10. The method of claim 9,
The bucket merging unit is repeatedly performed until the size of the bucket set is equal to the predetermined number of buckets,
A device for constructing a histogram based on a multi-dimensional index structure that selects two buckets having a minimum amount of variation of inequality before and after the merging based on every iteration.

The method of claim 10,
An apparatus for constructing a histogram based on a multidimensional index structure, further comprising a node partitioner configured to divide a terminal node having the highest data inequality among the terminal nodes of the index tree structure into at least two new terminal nodes to form a new index tree structure. .

12. The method of claim 11,
The node dividing unit repeats at least two node divisions for the terminal node having the highest data inequality based on the execution time to form a new index tree structure and uses the index tree structure received as the new index tree structure. A histogram constructing device based on an alternative multidimensional index structure.