KR20000032772A

KR20000032772A - Space index composition method for similarity search

Info

Publication number: KR20000032772A
Application number: KR1019980049332A
Authority: KR
Inventors: 최완; 이병선; 김상욱; 김진호; 한병일
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-11-17
Filing date: 1998-11-17
Publication date: 2000-06-15
Also published as: KR100282608B1

Abstract

PURPOSE: A space index composition method for a similarity search is provided to estimate an overall search cost including a space index search cost and a candidate object access cost by analyzing characteristics of a time series database in advance, and extract a dimension number of the space index for minimizing the overall cost. CONSTITUTION: A space index composition method comprises a 1st step of obtaining standard deviation of discrete fourier transform(DFT) coefficients on a time series database, arranging the standard deviation and analyzing the characteristics of the time series database in advance, a 2nd step of assuming generation of multi dimensional space index on all the possible cases made by an arbitrary DFT coefficient by referring to the characteristics, and then calculating a search cost of a query set on the space index, and a 3rd step of selecting DFT coefficients, applied to a space index minimizing the search cost, as an index entry of the multi dimensional space index.

Description

How to Configure Spatial Indexes to Support Similarity Search

본 발명은 데이터 시퀀스들의 집합으로 구성된 시계열 데이터 베이스에서 효율적인 유사성 검색을 지원하기 위한 공간 인덱스 구성방법에 관한 것으로서, 특히, 대상이 되는 시계열 데이타베이스의 특성을 분석함으로써 공간 인덱스 검색 비용과 후보 객체 검색 비용으로 구성되는 전체 검색 비용을 최소화하도록 하는 공간 인덱스 구성방법에 관한 것이다.The present invention relates to a method of constructing a spatial index to support an efficient similarity search in a time series database composed of a set of data sequences. In particular, the spatial index search cost and candidate object search cost are analyzed by analyzing the characteristics of a target time series database. The present invention relates to a method of constructing a spatial index to minimize the overall search cost.

두 개의 서로 다른 데이타 시퀀스 X(=＜x1, x2, ..., xn＞)와 Y(=＜y1, y2, ..., yn＞)간의 유사성을 측정하는 척도로서 가장 일반적으로 사용하는 것은 (수학식 1)과 같이 정의되는 유클리드 거리(Euclidian distance) D(X,Y)이다.The most commonly used measure of similarity between two different data sequences X (= <x1, x2, ..., xn>) and Y (= <y1, y2, ..., yn>) The Euclidian distance D (X, Y) is defined as Equation (1).

즉, 유사성 검색이란 주어진 질의 시퀀스(query sequence)로부터 지정된 ε이하의 유클리드 거리내에 있는 데이타 시퀀스들을 데이타베이스로부터 찾아내는 연산으로 정의된다.In other words, a similarity search is defined as an operation that finds data sequences within a Euclidean distance less than or equal to a specified ε from a database.

전통적인 유사성 검색 기법에서는 이러한 연산을 효과적으로 지원하기 위하여 다차원 공간 인덱스(multidimensional spatial index)를 사용하며, 시간 도메인(time domain)상에서 n개의 실수값으로 구성된 데이타 시퀀스 T(=＜t1,t2,...,tn＞s)를 이산 퓨리에 변환(DFT:Discrete Fourier Transform)하여 주파수 도메인(frequency domain)상의 n개의 복소수값(DFT 계수라 부름)으로 구성되는 새로운 데이타 시퀀스 F(=＜f1, f2, ... , fn＞)로 변환한다. 그런 후, 변환된 n개의 DFT 계수들 중 k(＜＜n)개만을 이용하여 2k차원의 공간 인덱스를 구성한다. 이 때, 공간 인덱스가 2k차원이 되는 이유는 주파수 도메인에서의 값이 실수부와 허수부로 구성되는 복소수이기 때문이다.The traditional similarity retrieval technique uses a multidimensional spatial index to effectively support this operation. The data sequence T (= <t1, t2, ...) consists of n real values in the time domain. , tn> s) is a Discrete Fourier Transform (DFT), which is a new data sequence F (= <f1, f2, ..) consisting of n complex values (called DFT coefficients) in the frequency domain. , fn>). After that, only k (<< n) of the n transformed DFT coefficients are used to construct a 2k-dimensional spatial index. At this time, the spatial index becomes 2k dimension because the value in the frequency domain is a complex number consisting of a real part and an imaginary part.

이와 같이 n과 비교하여 훨씬 작은 k값을 사용하여 공간 인덱스를 구성하게 되면, 그 공간 인덱스의 차원 수가 크게 줄어들게 되며, 공간 인덱스를 위한 오버헤드와 검색 비용을 줄일 수 있다는 장점이 있다.As such, when the spatial index is constructed using a much smaller k value than n, the number of dimensions of the spatial index is greatly reduced, and the overhead and search cost for the spatial index can be reduced.

반면, k개 이외의 DFT 계수들은 무시되므로 공간 인덱스를 통하여 검색된 객체들 중에서는 질의 시퀀스와의 유클리드 거리가 ε이내에 있지 않는 폴스 매치(false match)가 존재하게 된다. 따라서, 공간 인덱스를 통하여 검색한 후보 객체들을 실제로 액세스함으로써 폴스 매치들을 제거하는 과정이 추가된다.On the other hand, since DFT coefficients other than k are ignored, there is a false match that has no Euclidean distance to the query sequence within ε among objects searched through the spatial index. Thus, a process of eliminating false matches by actually accessing candidate objects retrieved through the spatial index is added.

즉, k개의 DFT 계수만을 활용하여 공간 인덱스를 구성하는 기존 기법의 유용성은 DFT의 특성에 기반을 두고 있는데, 상기 DFT는 유사성 검색 대상인 대부분의 데이타 시퀀스에 대하여 시간 도메인 내의 원 데이타 시퀀스가 가지는 에너지(다차원 공간상의 원점으로부터의 유클리드 거리)가 주파수 도메인으로 변환된 이후에도 그대로 유지되며, 에너지의 대부분이 변환된 데이타 시퀀스의 앞쪽 DFT 계수들에 집중되는 특성을 가진다.In other words, the usefulness of the conventional technique of constructing a spatial index using only k DFT coefficients is based on the characteristics of the DFT, which is the energy of the original data sequence in the time domain for most data sequences that are similarity search targets. The Euclidean distance from the origin in multidimensional space is maintained even after being transformed into the frequency domain, and most of the energy is concentrated in the front DFT coefficients of the transformed data sequence.

따라서, k값을 크게 하는 경우에는 공간 인덱스의 차원수가 커지므로 저장 공간 및 인덱스 검색 비용이 커지게 되며, k값을 작게 하는 경우에는 공간 인덱스 검색 비용 측면에서는 상당히 유리하지만, 에너지 손실로 인하여 공간 인덱스 검색후의 후보 객체들의 수가 증가하므로 폴스 매치의 해결을 위하여 많은 후보 객체를 디스크로부터 액세스해야 하는 오버헤드를 초래한다.Therefore, increasing the value of k increases the number of dimensions of the spatial index, which increases the storage space and index retrieval cost, while decreasing the value of k greatly benefits the spatial index retrieval cost, but due to energy loss, the spatial index. The increase in the number of candidate objects after retrieval results in the overhead of accessing many candidate objects from disk to resolve false matches.

기존 기법에서는 이러한 단점을 해소하기 위해 k값을 단지 2 혹은 3으로 사용하는 것을 추천하고 있으나 적합한 k값은 해당 응용에서 관리하는 데이타 시퀀스 집합의 특징에 따라 상이하며, 더욱이 무조건 앞쪽의 DFT 계수를 일방적으로 선택하는 것도 최적의 공간 인덱스를 구성하는 데에 장애가 된다.The existing technique recommends using only 2 or 3 k values to solve these shortcomings. However, the appropriate k values vary depending on the characteristics of the data sequence set managed by the application. Selecting it as a barrier is also an obstacle to constructing an optimal spatial index.

따라서, 본 발명에서는 상기와 같은 문제점을 해결하고 유사성 검색을 효율적으로 지원하기 위해, 공간 인덱스 구성의 대상이 되는 시계열 데이터 베이스의 특성을 사전에 분석함으로써 공간 인덱스 검색 비용과 후보 객체 검색 비용으로 구성되는 전체 검색 비용을 추정하고, 이를 기반으로 전체적인 비용을 최소화하도록 하는 공간 인덱스의 차원 수를 도출해 내는 체계적인 방법을 제공하고자 한다.Accordingly, in the present invention, in order to solve the above problems and to efficiently support the similarity search, the spatial index search cost and the candidate object search cost are composed by analyzing the characteristics of the time series database that is the target of the spatial index configuration in advance. To provide a systematic method of estimating the overall search cost and deriving the number of dimensions of the spatial index to minimize the overall cost.

상기와 같은 목적을 달성하기 위해 본 발명에서는 공간 인덱스에 참여하는 DFT 계수를 선정하는 기준으로서 기존의 방법에서와 같이 단순한 에너지의 크기를 사용하지 않고, 이들의 표준 편차값을 이용하는 것을 특징으로 한다. 즉, 각 DFT 계수 값을 실수부와 허수부로 분리하고, 2n개의 계수들에 대하여 데이타베이스내의 모든 데이타 시퀀스들을 대상으로 각각의 표준 편차를 구하며, 이들 중 큰 표준 편차값을 갖는 계수일수록 데이타베이스내의 데이타 시퀀스를 잘 분별할 수 있으므로 이들을 대상으로 공간 인덱스를 구성한다.In order to achieve the above object, the present invention is characterized in that the standard deviation value is used as a criterion for selecting the DFT coefficients participating in the spatial index, rather than using a simple energy size as in the conventional method. That is, each DFT coefficient value is separated into a real part and an imaginary part, and each standard deviation is obtained for all data sequences in the database with respect to 2n coefficients. Because we can discern data sequences well, we construct spatial indexes for them.

도 1은 본 발명의 일 실시예에 따른 공간 인덱스 구성방법에 대한 처리 흐름도.1 is a flowchart illustrating a method of constructing a spatial index according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 공간 인덱스 구성방법을 좀 더 상세히 설명하고자 한다.Hereinafter, a method of constructing a spatial index of the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 공간 인덱스 구성방법에 대한 처리 흐름도로서, 도 1을 참조하면, 먼저, 주어진 N개의 데이타 시퀀스 집합내에서 길이 n인 하나의 데이타 시퀀스 S(i)를 선택(1≤i≤N)(101)한 후, 그 데이터 시퀀스 S(i)에 대한 이산 퓨리에 변환(DFT:Discrete Fourier Transform)을 수행하여 실수부 R(i, j)와 허수부 I(i, j)로 구성되는 DFT 계수들을 생성(1≤j≤n)(102)하며, 그 생성된 DFT 계수중 DFT의 대칭적인 특징에 의하여 앞의 n/2 개의 DFT 계수만을 취하여(103), 그 DFT 계수들을 배열 c(k)(1≤k≤n)에 저장(104)한다.1 is a flowchart illustrating a method of constructing a spatial index according to an embodiment of the present invention. Referring to FIG. 1, first, one data sequence S (i) of length n is selected from a given set of N data sequences. (1 ≦ i ≦ N) 101, and then perform a Discrete Fourier Transform (DFT) on the data sequence S (i) to perform the real part R (i, j) and the imaginary part I (i, j) generate (1 ≦ j ≦ n) 102, taking only the previous n / 2 DFT coefficients according to the symmetrical characteristic of the DFT among the generated DFT coefficients (103), and The coefficients are stored 104 in an array c (k) (1 ≦ k ≦ n).

이 때, 상기 상기 과정(101 내지 104)을 선택된 데이터 시퀀스 S(i)가 마지막 데이터 시퀀스가 될 때까지(105) 반복 수행한다.At this time, the steps 101 to 104 are repeated 105 until the selected data sequence S (i) becomes the last data sequence.

그리고, 상기와 같이 생성된 배열 c(k)에 저장된 모든 DFT 계수들에 대한 표준 편차를 구하여(106 내지 108), 그 표준 편차의 크기에 의해 배열 c(k)를 내림 차순으로 정렬하여 해당 번지(k값)를 정렬된 순서에 의해 배열 D(p)(1≤p≤n)에 저장(109)한다. 즉, D(3)에 5가 존재한다면, c(5)에 저장된 DFT 계수의 표준 편차는 3번째로 큰 것임을 의미한다.The standard deviation of all DFT coefficients stored in the array c (k) generated as described above is obtained (106 to 108), and the array c (k) is arranged in descending order according to the magnitude of the standard deviation, and the corresponding address is obtained. (k values) are stored 109 in the array D (p) (1? p? n) in the sorted order. That is, if 5 is present in D (3), it means that the standard deviation of the DFT coefficients stored in c (5) is the third largest.

상기와 같은 일련의 과정을 처리하였으면, 데이타 집합내의 모든 데이타 시퀀스들을 액세스하여 프랙탈 차원(D0)과 상관 프랙탈 차원(D2)을 계산(110)하는데, 이 프랙탈 차원과 상관 프랙탈 차원은 추후 가정된 공간 인덱스의 질의에 대한 검색 시간 계산을 위해 사용된다.After processing such a series of processes, all data sequences in the data set are accessed to calculate (110) the fractal dimension (D0) and the correlated fractal dimension (D2), which are then assumed to be spatially spaced. Used to calculate retrieval time for queries on indexes.

상기 상기 프랙탈 차원(D0)과 상관 프랙탈 차원(D2)을 구하는 공식은 (수학식 2)와 같으며, 상기 (수학식 2)에서 q=0이면 프랙탈 차원을 나타내고, q=2이면 상관 프랙탈 차원을 나타낸다.The formula for obtaining the fractal dimension (D0) and the correlated fractal dimension (D2) is as shown in Equation (2), where q = 0 indicates a fractal dimension, and q = 2 indicates a correlated fractal dimension. Indicates.

여기서, r=전체 다차원 공간내의 한 셀이 가지는 각 차원에서의 변의 길이,Where r = length of the sides in each dimension of a cell in the entire multidimensional space,

Pi=각 차원을 r로 나눔으로써 생성된 다차원 공간내의 i번째 셀내의 객체 수Pi = number of objects in i-th cell in multidimensional space created by dividing each dimension by r

이와 같이 프랙탈 차원(D0)과 상관 프랙탈 차원(D2)을 계산하였으면, 상기 D(p)의 내용을 순서대로 하나씩 선택하여 해당되는 질의에 대한 전체 비용(TotalCost(p,m)) 및 평균 비용(TotalCost(p))을 계산한다.When the fractal dimension (D0) and the correlation fractal dimension (D2) are calculated in this way, the contents of the D (p) are selected one by one in order and the total cost (TotalCost (p, m)) and the average cost ( Calculate TotalCost (p)).

이 때, 상기 D(1)에는 표준 편차가 가장 큰 DFT 계수가 저장된 배열 c(k)의 번지가 저장되고, D(2)에는 표준 편차가 두번째로 큰 DFT 계수가 저장된 배열 c(k)의 번지가 저장되며, D(n)에는 표준 편차가 가장 작은 DFT 계수가 저장된 배열 c(k)의 번지가 저장된다.At this time, the address of the array c (k) storing the DFT coefficient having the largest standard deviation is stored in D (1), and the array c (k) storing the DFT coefficient having the second standard deviation is stored in D (2). The address is stored, and D (n) stores the address of the array c (k) in which the DFT coefficient with the smallest standard deviation is stored.

한편, 큰 표준 편차값을 갖는 DFT 계수일수록 데이터베이스내의 데이터 시퀀스를 잘 분별할 수 있는 특성이 있으므로, 다차원 공간 인덱스의 생성을 가정할 때, 표준 편차가 큰 DFT의 계수를 우선적으로 공간 인덱스에 참여시키는데, 이를 위해 먼저, D(1)에서 가리키는 번지에 저장된 DFT 계수를 선택(111)하여 다차원 공간 인덱스의 생성을 가정(112)한 후, 질의 집합에 포함된 질의별로 그 공간 인덱스에 대한 전체 검색 비용(TotalCost(1,m))(이 때, m은 질의 집합에서 m번째의 질의를 가리킴.)을 계산(113 내지 117)하고, 상기 질의 집합에 포함된 모든 질의에 대하여 해당 공간 인덱스에서의 검색 비용을 계산하였으면, 그 검색 비용들의 평균(TotalCost(1))을 구한다.On the other hand, since DFT coefficients having a large standard deviation have a characteristic of distinguishing data sequences in a database better, assuming that a multidimensional spatial index is generated, coefficients of a DFT having a large standard deviation are preferentially involved in the spatial index. To do this, first select (111) the DFT coefficients stored at the address indicated by D (1), assuming 112 generation of a multidimensional spatial index, and then calculate the total search cost for the spatial index for each query included in the query set ( Calculates TotalCost (1, m) (where m is the mth query in the query set) (113 to 117), and the cost of the search in the spatial index for all queries included in the query set. Is calculated, the average of the search costs (TotalCost (1)) is obtained.

이와 같이하여 D(1)에서 가리키는 번지에 저장된 DFT계수에 의한 공간 인덱스의 검색 비용 계산이 완료되었으면, D(2)에서 가리키는 번지에 저장된 DFT 계수를 선택하여 상기 D(1)에서 가리키는 번지에 저장된 DFT와 D(2)에서 가리키는 번지에 저장된 DFT 계수에 의해 다차원 공간 인덱스의 생성을 가정(112)한 후, 상기 검색 비용 계산(TotalCost(2)) 과정(113 내지 118)을 반복 수행한다.When the search cost of the spatial index by the DFT coefficient stored at the address indicated by D (1) is completed in this manner, the DFT coefficient stored at the address indicated by D (2) is selected to store the stored address at the address indicated by D (1). After assuming 112 the generation of the multidimensional spatial index by the DFT coefficients stored at the address indicated by DFT and D (2), the search cost calculation (TotalCost (2)) processes 113 to 118 are repeated.

이러한 과정을 D(n)이 될 때까지 반복하며, n개의 검색 비용에 대한 평균값(TotalCost(1),…,TotalCost(n))이 모두 계산되었으면, 그 평균값이 최소가 되는 경우(TotalCost(w))를 선택(120)하여, D(1),D(2),…,D(w)내에 나타난 w개의 DFT 계수를 다차원 공간 인덱스의 인덱싱 애트리뷰트로 선정(121)한다.This process is repeated until D (n), and if the average values (TotalCost (1), ..., TotalCost (n)) for n search costs are all calculated, the average value becomes the minimum (TotalCost (w). )) To select D (1), D (2),... The w DFT coefficients shown in D (w) are selected as the indexing attribute of the multidimensional spatial index (121).

이 때, 상기 검색 비용 계산(TotalCost(p)) 과정(113 내지 119)을 도면을 참조하여 구체적으로 설명하면, 주어진 q개의 질의 집합내에서 하나의 질의 Q(m)(1≤m≤q)를 선택(113)하여 해당 질의에 의한 그 공간 인덱스의 검색 비용(TreeSearchCost(p,m))을 계산(114)하고, 상기 선택된 질의에 의해 후보 객체 검색 비용(CandidateAccessCost(p,m))을 계산(115)하며, 상기 공간 인덱스 검색 비용과 후보 객체 검색 비용을 합하여 전체 검색 비용(TotalCost(p,m))을 계산(116)한다. 이러한 과정을 질의 집합내에 포함된 모든 질의에 대하여(117) 수행하여, 질의별 전체 검색 비용(TotalCost(p,m))을 질의 집합에 포함된 질의수(m)에 대한 평균값(TotalCost(p))를 계산(118)하며, 이러한 과정을 D(p)가 마지막(119)이 될 때까지 반복한다.In this case, the search cost calculation (TotalCost (p)) process 113 to 119 will be described in detail with reference to the drawings. One query Q (m) (1 ≦ m ≦ q) within a given q query set Select (113) to calculate the search cost (TreeSearchCost (p, m)) of the spatial index by the corresponding query (114), and calculate the candidate object search cost (CandidateAccessCost (p, m)) by the selected query. In operation 115, the total search cost TotalCost (p, m) is calculated by adding the spatial index search cost and the candidate object search cost. This process is performed for all queries included in the query set (117), so that the total search cost (TotalCost (p, m)) for each query is the average value (TotalCost (p) for the number of queries (m) included in the query set). 118), and repeat this process until D (p) is the last (119).

이 때, 해당 질의에 대한 공간 인덱스 검색 비용(TreeSearchCost(p,m))은 다음에 나타난 (수학식 3)을 이용하여 구하며, 후보 객체 검색 비용(CandidateAccessCost(p,m))은 (수학식 4)를 이용하여 구한다.At this time, the spatial index search cost (TreeSearchCost (p, m)) for the query is obtained by using Equation 3 shown below, and the candidate object search cost CandidateAccessCost (p, m) is expressed by Equation 4 To obtain.

여기서, N는 전체 데이타 집합내의 시퀀스의 수,Where N is the number of sequences in the entire data set,

D는 공간 인덱스에 사용된 계수 수(즉, 차원수),D is the number of coefficients used in the spatial index (that is, the number of dimensions),

Ceff는 R*-트리의 각 노드가 가지는 엔트리의 평균 수용량,Ceff is the average capacity of the entries of each node in the R * -tree,

ε는 유사도를 나타내는 질의 영역의 반지름,ε is the radius of the query region that represents the similarity,

h는 R*-트리의 높이 (=log_CeffN),h is the height of the R * -tree (= log _Ceff N),

∂_j는 R*-트리의 j 번째 단계에서의 각 엔트리가 표현하는 영역의 각 차원에서의 변의 길이로서,∂ _j is the length of the side in each dimension of the area represented by each entry in the j th step of the R * -tree,

여기서, D2는 상관 프랙탈 차원,Where D2 is the correlated fractal dimension,

E는 시퀀스의 원래 차원 수 n,E is the number of original dimensions n,

α는 다차원 공간인덱스에 참여하지 않는 계수들의 표준 편차의 합,α is the sum of the standard deviations of the coefficients that do not participate in the multidimensional spatial index,

Vol(ε,□)는 한변의 길이가 ε인 정방형 객체의 체적,Vol (ε, □) is the volume of a square object of length ε on one side,

Vol(ε,○)는 반지름이 ε인 구형 객체의 체적Vol (ε, ○) is the volume of a spherical object with radius ε

이와 같은 본 발명의 방법은 공간 인덱스 구성의 대상이 되는 시계열 데이타베이스의 특성을 사전에 분석함으로써 주어진 데이타 및 질의 집합에 대한 최적의 공간 인덱스를 효과적으로 구성할 수 있으며, 이로인해, 공간 인덱스를 이용하여 유사성 검색을 처리하게 되는 경우 검색 비용을 최소화 할 수 있고 그 결과 전체 시스템 응답 시간이 향상된다는 장점이 있다.The method of the present invention can effectively construct an optimal spatial index for a given data and query set by analyzing in advance the characteristics of a time-series database that is the subject of spatial index construction. When dealing with similarity search, the search cost can be minimized and the overall system response time is improved.

Claims

A first step of obtaining a standard deviation of discrete Fourier transform (DFT) coefficients for a time series database that is a target of a spatial index configuration, sorting in descending order, and analyzing the characteristics of the time series database in advance;

A second process of assuming a generation of a multidimensional spatial index for all cases that can be generated by arbitrary DFT coefficients by referring to the characteristic of the time series database, and then calculating a search cost of a query set for the spatial index;

And a third process of selecting the DFT coefficients applied to the spatial index having the minimum search cost and selecting the index as an indexing index of the multidimensional spatial index.

The method of claim 1, wherein the first process is

Performing a discrete Fourier transform (DFT) on all data sequences of the time series database to generate discrete Fourier transform (DFT) coefficients consisting of real and imaginary parts and selecting only the first n / 2 DFT coefficients among them With step 1,

Obtaining a standard deviation for the selected DFT from all the data sequences of the time series database and sorting in descending order; accessing all the data sequences to calculate the fractal dimension (D0) and the correlation fractal dimension (D2); Spatial index configuration method for supporting similarity search, characterized in that consisting of.

The method of claim 1, wherein the second process

A first step of selecting one from the most advanced DFT coefficients among the DFT coefficients arranged in descending order and assuming generation of a multidimensional spatial index to which the DFTs are applied;

Comprising a second step of calculating the total search cost for the hypothesized multi-dimensional spatial indexes for each query included in any query set, the average value of the total search cost for each query for each of the multi-dimensional spatial index How to configure spatial indexes to support similarity searches.

The method of claim 3, wherein the second step

Using the following equation (1), obtain the spatial index search cost for the query,

After finding the candidate object search cost for the query by using the following (Equation 2),

And calculating the total search cost of the query for the hypothesized spatial index by adding the spatial index search cost and the candidate object search cost.

(Equation 1)

Where N is the number of sequences in the entire data set,

D is the number of coefficients used in the spatial index (that is, the number of dimensions),

Ceff is the average capacity of the entries of each node in the R * -tree,

ε is the radius of the query region that represents the similarity,

h is the height of the R * -tree (= log _Ceff N),

∂ _j is the length of the side in each dimension of the area represented by each entry in the j th step of the R * -tree,

(Equation 2)

Where D2 is the correlated fractal dimension,

E is the number of original dimensions n,

ε is the radius of the query region that represents the similarity,

α is the sum of the standard deviations of the coefficients that do not participate in the multidimensional spatial index,

Vol (ε, □) is the volume of a square object of length ε on one side,

Vol (ε, ○) is the volume of a spherical object with radius ε