KR20110115281A

KR20110115281A - Partitioning method for high dimensional data

Info

Publication number: KR20110115281A
Application number: KR1020100034696A
Authority: KR
Inventors: 이지형; 이승훈; 김보경; 김재광
Original assignee: 성균관대학교산학협력단
Priority date: 2010-04-15
Filing date: 2010-04-15
Publication date: 2011-10-21
Also published as: KR101116663B1

Abstract

개시된 기술은 분할 장치가 복수의 고차원 데이터를 분할하는 방법에 있어서, (a)상기 복수의 고차원 데이터에 대한 주성분 분석을 통하여 주성분 벡터를 산출하는 단계; (b)상기 산출된 주성분 벡터의 중심점을 산출하는 단계; 및 (c)상기 산출된 중심점을 지나는 초평면을 기준으로 상기 복수의 고차원 데이터를 분할하는 단계를 포함하는 고차원 데이터 분할 방법을 제공한다.
상기 기재된 방법을 컴퓨터상에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체를 제공한다The disclosed technique is a method of dividing a plurality of high-dimensional data by a splitting device, comprising: (a) calculating a principal component vector through principal component analysis on the plurality of high-dimensional data; (b) calculating a center point of the calculated principal component vector; And (c) dividing the plurality of high-dimensional data on the basis of the hyperplane passing through the calculated center point.
Provided is a computer readable recording medium for executing the above-described method on a computer.

Description

Partitioning method for searching similarity of high dimensional data {Partitioning Method for High Dimensional Data}

게시된 기술은 데이터의 유사도 검색을 위한 데이터 분할 방법에 관한 것이다.Published techniques relate to data partitioning methods for searching for similarity of data.

컴퓨터 기술 및 멀티미디어 기술의 발달로 인해 정보들은 문자뿐만 아니라 이미지, 오디오, 비디오를 포함하는 멀티미디어 형태로 표현된다. 이러한 멀티미디어 정보를 다루는데 있어서 주된 문제는 검색의 효율성이다. 즉, 얼마나 빠르고 정확하게 사용자가 원하는 정보를 포함하고 있는 멀티미디어 데이터를 찾을 수 있는가가 문제가 된다. 일반적으로 이미지, 오디오, 비디오와 같은 멀티미디어 객체로부터 검색을 수행하는 방법으로 고차원의 특징 벡터 데이터를 추출하여 이를 이용하여 검색을 수행하는 내용기반 검색 방법 등이 있다.Due to the development of computer technology and multimedia technology, information is represented in the form of multimedia including images, audio and video as well as text. The main problem in handling such multimedia information is the efficiency of retrieval. That is, how quickly and accurately it is possible to find multimedia data containing information desired by the user. In general, as a method of performing a search from a multimedia object such as an image, audio, or video, there is a content-based retrieval method of extracting high-dimensional feature vector data and performing a search using the same.

이러한 고차원 데이터에 대한 검색시 빠른 검색을 보장하기 위해서는 유사도 계산과 데이터 읽기를 줄이는 것이 중요하다. 이를 위하여 고차원 데이터에 대한 색인기법을 사용하고 있으며, 이는 크게 트리 기반 색인을 구축하는 방법과 필터링 기반 방법으로 나누어서 제안되고 있다.It is important to reduce the similarity calculation and read data in order to ensure a fast search when searching for such high-dimensional data. For this purpose, the indexing technique for high-dimensional data is used, and this is largely proposed by dividing into a tree-based index construction method and a filtering-based method.

그러나, 데이터의 차원이 증가할수록 검색 성능이 기하급수적으로 떨어지게 되어, 고차원 데이터의 유사도 검색에는 높은 연산 비용이 필요하게 된다. 이러한 연산비용을 줄이기 위해 인덱스 구조를 통해 전체 데이터를 분할하여 유사도 검색 대상을 감소시키는 방식이 이용되고 있다. 대부분의 인덱스 구조를 이용한 유사도 검색 기법은 참조점과 데이터 간의 거리를 이용한 비 중첩 분할기법이나, 이러한 기존 방법들은 대상을 한정하여 유사도 검색을 수행하기 때문에 비교대상이 아닌 대상을 고려하지 않는 문제점이 있다.However, as the dimension of the data increases, the search performance decreases exponentially, and a high computational cost is required for searching for the similarity of the high-dimensional data. In order to reduce the computational cost, a method of dividing the entire data through the index structure to reduce the object of similarity search is used. Similarity search using most index structures is a non-overlapping partitioning method using the distance between the reference point and the data, but these existing methods do not consider objects that are not compared because they perform similarity search by limiting the object. .

개시된 기술이 해결하고자 하는 기술적 과제는 고차원 데이터의 유사도 검색시 발생 되는 오류를 최소화할 수 있는 고차원 데이터의 유사도 검색을 위한 데이터 분할방법을 제공하는 데 있다.The technical problem to be solved by the disclosed technology is to provide a data partitioning method for searching for similarity of high-dimensional data that can minimize the errors generated when searching for similarity of high-dimensional data.

상기의 기술적 과제를 이루기 위해 개시된 기술은 분할 장치가 복수의 고차원 데이터를 분할하는 방법에 있어서, (a)상기 복수의 고차원 데이터에 대한 주성분 분석을 통하여 주성분 벡터를 산출하는 단계; (b)상기 산출된 주성분 벡터의 중심점을 산출하는 단계; 및 (c)상기 산출된 중심점을 지나는 초평면을 기준으로 상기 복수의 고차원 데이터를 분할하는 단계를 포함하는 고차원 데이터 분할 방법을 제공한다.According to an aspect of the present invention, a method of dividing a plurality of pieces of high-dimensional data by a splitting device includes: (a) calculating a principal component vector through principal component analysis of the plurality of high-dimensional data; (b) calculating a center point of the calculated principal component vector; And (c) dividing the plurality of high-dimensional data on the basis of the hyperplane passing through the calculated center point.

상기 기재된 방법을 컴퓨터상에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체를 제공한다.A computer-readable recording medium for executing the method described above on a computer is provided.

개시된 기술의 실시 예들은 다음의 장점들을 포함하는 효과를 가질 수 있다. 다만, 개시된 기술의 실시 예들이 이를 전부 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Embodiments of the disclosed technique may have effects that include the following advantages. It should be understood, however, that the scope of the disclosed technology is not to be construed as limited thereby, since the embodiments of the disclosed technology are not meant to include all such embodiments.

개시된 기술의 일 실시 예에 따르면, 분할 장치가 복수의 고차원 데이터의 분할 시행시 주성분 분석을 통하여 오류를 최소화할 수 있는 분할을 수행한다. 분할된 데이터는 트리 구조를 가지게 되며, 생성된 인덱스 트리를 통해 유사도 검색을 수행한다. 이러한 단계를 통하여 기존 방법보다 보다 정확한 유사도 검색 수행이 가능하다.According to one embodiment of the disclosed technology, the segmentation apparatus performs segmentation to minimize errors through principal component analysis when segmentation of a plurality of high-dimensional data. The partitioned data has a tree structure, and similarity search is performed through the generated index tree. Through this step, the similarity search can be performed more accurately than the conventional method.

도 1은 개시된 기술의 일 실시예에 따른 데이터 분할방법을 설명하기 위한 순서도이다.
도 2는 개시된 기술의 일 실시예에 따라 주성분 백터, 중심점 및 초평면에 의해 데이터를 분할하는 방법을 보여준다.
도 3은 개시된 기술의 일 실시예에 따라 수행한 결과와 기존 방법에 의해 수행된 결과를 도시한 그래프이다.1 is a flowchart illustrating a data partitioning method according to an embodiment of the disclosed technology.
2 illustrates a method of segmenting data by principal component vectors, center points, and hyperplanes, in accordance with an embodiment of the disclosed technique.
3 is a graph showing the results performed according to an embodiment of the disclosed technology and the results performed by the existing method.

개시된 기술에 관한 설명은 구조적 내지 기능적 설명을 위한 실시 예에 불과하므로, 개시된 기술의 권리범위는 본문에 설명된 실시 예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시 예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 개시된 기술의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다.Description of the disclosed technology is only an embodiment for structural or functional description, the scope of the disclosed technology should not be construed as limited by the embodiments described in the text. That is, the embodiments may be variously modified and may have various forms, and thus, the scope of the disclosed technology should be understood to include equivalents for realizing the technical idea.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

“제1”, “제2” 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.The terms " first ", " second ", and the like are used to distinguish one element from another and should not be limited by these terms. For example, the first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에" 와 "바로 ~사이에" 또는 "~에 이웃하는" 과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" to another component, it should be understood that there may be other components in between, although it may be directly connected to the other component. On the other hand, when an element is referred to as being "directly connected" to another element, it should be understood that there are no other elements in between. On the other hand, other expressions describing the relationship between the components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring", should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as "include" or "have" refer to features, numbers, steps, operations, components, parts, or parts thereof described. It is to be understood that the combination is intended to be present, but not to exclude in advance the possibility of the presence or addition of one or more other features or numbers, steps, operations, components, parts or combinations thereof.

각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Each step may occur differently from the stated order unless the context clearly dictates the specific order. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 개시된 기술이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. Terms defined in commonly used dictionaries should be interpreted to be consistent with meaning in the context of the relevant art and can not be construed as having ideal or overly formal meaning unless expressly defined in the present application.

도 1은 개시된 기술의 일 실시예에 따른 고차원 데이터의 유사도 검색을 위한 데이터 분할방법을 설명하기 위한 블록도이다. 종래 방법에 의한 데이터의 유사도 검색을 위한 방법에서는, 데이터의 차원이 높아질수록 입력 값이 정확히 일치하는 경우가 상대적으로 줄어들기 때문에 질의와 정확히 일치하거나 가장 유사한 답을 찾아내는 데에 높은 연산 비용이 필요게 된다는 문제가 있다. 이로 인해 검색 속도가 저하되는 문제 때문에, 연산 비용을 줄이기 위한 방법으로 다양한 인덱스 구조를 사용하는 방법이 필요하다. 기존의 대부분의 인덱스 구조를 이용한 유사도 검색 기법은 참조점과 데이터간의 거리를 이용한 비중첩 분할기법을 사용하고 있으며, 이러한 기존 방법들은 대상을 한정하여 유사도 검색을 수행하기 때문에 검색 속도에서 뛰어난 면을 보인다. 그러나 유사도 비교 대상이 아닌 데이터는 고려하지 않게 되는 속성이 있기 때문에 실제로 가장 유사한 데이터를 찾아내지 못하게 되는 가능성이 있다는 면에서 문제가 될 수 있다. 1 is a block diagram illustrating a data partitioning method for searching for similarity of high-dimensional data according to an embodiment of the disclosed technology. In the conventional method for retrieving the similarity of data, the higher the dimension of the data, the less the exact match of the input value, so that a high computational cost is required to find the exact match or the most similar answer. There is a problem. Due to this problem of slowing down the search speed, it is necessary to use various index structures as a way to reduce the computational cost. Similarity search method using most index structure uses non-overlapping partitioning method using distance between reference point and data, and these existing methods are excellent in search speed because they perform similarity search by limiting the target. . However, because there is an attribute that does not consider data that is not subject to similarity comparison, it may be a problem in that there is a possibility that it is impossible to find the most similar data.

개시된 기술의 일 실시예에 따른 데이터 분할 방법은 주성분 분석 기법을 이용하여 데이터의 특징을 고려하고, 유사도 검색 시 발생할 수 있는 오류를 최소화 할 수 있는 비중첩 분할 기법을 제공함으로써, 기존에 방법에 비해 정확한 유사도 검색이 가능 하다는 장점이 있다.The data segmentation method according to an embodiment of the disclosed technology considers the characteristics of the data using principal component analysis and provides a non-overlapping segmentation technique that minimizes errors that may occur when searching for similarity, thereby comparing with the conventional method. The advantage is that accurate similarity search is possible.

도 1을 참조하면, 분할 장치가 복수의 입력받은 고차원 데이터를 분할하는 방법은, 상기 복수의 고차원 데이터에 대한 주성분 분석을 통하여 주성분 벡터를 산출하는 단계 및 상기 산출된 주성분 벡터의 중심점을 산출하는 단계와 상기 산출된 중심점을 지나는 초평면(Hyperplane)을 기준으로 상기 복수의 고차원 데이터를 분할하는 단계 및 저장하는 단계로 구성되어 있음을 알 수 있다.Referring to FIG. 1, a method of dividing a plurality of input high-dimensional data by a splitting device includes calculating a principal component vector through principal component analysis on the plurality of high-dimensional data and calculating a center point of the calculated principal component vector. And dividing and storing the plurality of high-dimensional data based on a hyperplane passing through the calculated center point.

초평면은 여러 가지 차원의 평면을 일반화 하여 이르는 말로, 주로 3차원 이상의 고차원으로 존재하는 평면을 의미하나, 2차원의 선, 1차원의 점도 포함할 수 있다. 특히 본 실시 예에서 초평면은, 주로 상기 복수의 고차원 데이터의 차원보다 한 차원 작은 차원의 평면으로서, 상기 복수의 고차원 데이터를 둘로 분할 할 수 있는 평면을 의미하나, 이에 한정되는 것은 아니다. The hyperplane refers to a generalized plane of various dimensions, and means a plane that exists mainly in three or more high dimensions, but may include two-dimensional lines and one-dimensional viscosity. In particular, in the present exemplary embodiment, the hyperplane is a plane having a dimension one dimension smaller than that of the plurality of high-dimensional data, and means a plane that may divide the plurality of high-dimensional data into two, but is not limited thereto.

상기 산출되는 중심점은 주성분 벡터 상의 데이터들의 분포 특징 또는 성향을 대표하여 나타낼 수 있는 수치로, 실시예에 따라, 중심점은 평균값(mean), 중앙값(median), 중간값(midpoint) 등을 기초로 산출될 수 있다. 이하의 실시 예에서는, 중앙값(median)을 기초로 중심점이 산출되는 경우를 예를 들어 설명한다. 중앙값(median)은 데이터를 크기 순서대로 늘어 놓았을 때, 중앙에 위치하는 값을 의미하며, 이때, 중앙값을 기준으로 양 쪽에 분포하는 데이터의 수는 동일하게 된다.The calculated center point is a numerical value that can represent the distribution characteristic or propensity of the data on the principal component vector. In some embodiments, the center point is calculated based on a mean value, a median, a midpoint, and the like. Can be. In the following embodiment, a case where the center point is calculated based on the median will be described as an example. The median means a value located at the center when the data are arranged in order of size, and the number of data distributed on both sides of the median value is the same.

일 실시예에 따르면 데이터입력 단계(S110)에서 고차원 데이터를 입력받는다. 상기 데이터는 멀티미디어 데이터, 이미지 처리 등을 위한 고차원 데이터이며, 해당 데이터를 사용하는 도메인에서 유사도 검색은 빈번하게 사용되는 중요한 기능이다. 패턴 인식이나 이미지 검색 등의 분야에서의 질의 발생시 상기 고차원 데이터를 빠르고 정확하게 탐색하기 위하여, 주성분 분석을 이용하여 데이터를 분석하고 상기 분석을 통하여 상기 데이터의 중심점을 산출한다. According to an embodiment, high-dimensional data is input in the data input step S110. The data is high-dimensional data for multimedia data, image processing, and the like, and similarity search is an important function frequently used in a domain using the data. In order to quickly and accurately search the high-dimensional data when a query occurs in a field such as pattern recognition or image search, data is analyzed using principal component analysis and the center point of the data is calculated through the analysis.

주성분 벡터를 산출해내는 단계(S120)는 주성분 분석 기법을 이용하여 데이터를 찾는 것을 특징으로 하며, 상기 주성분 분석 기법은 다차원 특징 벡터로 이루어진 데이터에 대하여 높은 차원에서의 정보를 유지하면서 낮은 차원으로 차원을 축소시키는 데이터 처리 방법이다. 상기 주성분 분석은 차후 상기 데이터의 중심점을 탐색하는 단계(S130)에서 필요한 수단이다.Computing the principal component vector (S120) is characterized by finding the data using the principal component analysis technique, wherein the principal component analysis technique has a low dimension while maintaining information in a high dimension for data consisting of multidimensional feature vectors. Data processing method to reduce the The principal component analysis is a necessary means in the step S130 of searching for the center point of the data later.

보다 구체적으로, 상기 데이터를 한 개의 축으로 사상시켰을 때 그 분산이 가장 커지는 축을 첫 번째 좌표축에 오게 한다, 이어서 두 번째로 커지는 축이 두 번째 축에 오게 하고, 같은 방법으로 차례로 놓이도록 새로운 좌표계로 상기 데이터를 선형 변환시킨다. 이와 같은 방법으로 각각의 축에 상기 데이터의 가장 중요한 성분을 위치시키게 된다.More specifically, when the data is mapped to one axis, the axis with the largest variance is in the first coordinate axis, then the second largest axis is in the second axis, and in the same way in a new coordinate system. Linearly transform the data. In this way, the most important component of the data is placed on each axis.

주성분 분석은 상기 데이터들의 특징을 가장 잘 표현하는 벡터를 찾아내는 방법이므로, 상기 데이터의 분포를 고려하여 오류를 최소화할 수 있는 분할 기준을 찾기 위해 주성분 분석을 이용한다.Principal component analysis is a method of finding a vector that best represents the characteristics of the data. Therefore, Principal Component Analysis is used to find a segmentation criterion that can minimize errors in consideration of the distribution of the data.

데이터 중심점 산출은 상기 주성분 분석 기법을 통하여 산출된 주성분 벡터를 기준으로 하여 상기 데이터의 수가 중심점에 의해 반반으로 나누어질 수 있는 점을 데이터의 중심점으로 정한다. The data center point calculation is based on a principal component vector calculated through the principal component analysis technique, and determines a point at which the number of data can be divided in half by the center point as the center point of the data.

일 실시예에 따른 상기 데이터의 분할 기준은 유사도 검색의 오류를 최소화하기 위해 주성분 벡터와 상기 중심점을 지나는 초평면이다. 비중첩 분할기법에서의 유사도 검색시, 발생하는 대부분의 오류는 분할 기준과 인접한 데이터에서 발생하는 특징을 가지고 있다. 따라서, 분할 기준에 인접한 데이터를 최소화 시키면 유사도 검색에서 발생할 수 있는 오류를 줄이는 것이 가능하다. 상기 주성분 벡터는 상기 데이터의 특징을 가장 잘 나타내는 직선이면서, 가장 큰 분산을 가지는 특성이 있기 때문에 이에 수직인 초평면은 상대적으로 분산이 작게 나타난다. 이는 상기 초평면으로 정해지는 분할 기준에 인접한 데이터가 상대적으로 적다는 것을 의미한다. According to an embodiment, the splitting criterion of the data is a hyperplane passing through a principal component vector and the center point to minimize errors in similarity search. In the similarity search in the non-overlapping partitioning technique, most of the errors that occur occur in the data adjacent to the partitioning criteria. Therefore, minimizing the data adjacent to the splitting criteria can reduce the errors that may occur in the similarity search. Since the principal component vector is a straight line that best represents the characteristics of the data and has a characteristic of having the largest dispersion, the hyperplane perpendicular to the principal appears relatively small in dispersion. This means that relatively little data is adjacent to the division criterion determined by the hyperplane.

도 2는 개시된 기술의 일 실시예에 따라 주성분 백터(210), 중심점(220) 및 초평면(230)에 의해 데이터를 분할하는 방법을 보여준다.2 illustrates a method of segmenting data by principal component vector 210, center point 220, and hyperplane 230, according to one embodiment of the disclosed technology.

도 2를 참조하면, 상기 S120 단계에서 변환된 주성분 벡터(210)와 상기 탐색된 중심점(220)과의 관계를 잘 알 수 있다. 이와 같은 상기 과정에 의해 데이터의 중심점(220) 탐색이 완료되고, 이후 데이터를 분할하는 단계(S150)를 시행한다.Referring to FIG. 2, the relationship between the principal component vector 210 converted in the step S120 and the searched center point 220 can be well understood. In this way, the search for the center point 220 of the data is completed, and then the step S150 of dividing the data is performed.

일 실시예에 따라, 데이터를 분할하는 단계는, 상기 복수의 고차원 데이터에 대한 주선성분 분석을 통하여 주성분 벡터(210)를 산출하는 S120 단계, 상기 산출된 주성분 벡터(210)의 중심점(220)을 산출하는 S130 단계, 및 상기 산출된 중심점(220)을 지나는 초평면(230)을 기준으로 상기 복수의 고차원 데이터를 분할하는 S150 단계를 복수 회 반복함으로써 수행된다. 이때, S130 단계에 의해 탐색 된 중심점(220)을 이용하여 분할 기준을 정하고, 이를 통해 데이터를 분할하여(240a,240b) 인덱스 트리 구조로 저장하는 과정을 거친다. According to an exemplary embodiment, the step of dividing the data may include calculating a principal component vector 210 through principal component analysis of the plurality of high-dimensional data, and determining a center point 220 of the calculated principal component vector 210. The step S130 of calculating and the step S150 of dividing the plurality of pieces of high-dimensional data on the basis of the hyperplane 230 passing through the calculated center point 220 are performed by repeating a plurality of times. At this time, the division criteria are determined using the center point 220 searched at step S130, and the data is divided (240a, 240b) and stored in an index tree structure.

보다 상세하게, 상기 복수의 고차원 데이터에 대하여 부여된 인덱스를 부모 노드로, 상기 분할된 각각의 고차원 데이터에 대하여 부여된 인덱스를 자식 노드로 하는 트리 구조를 생성한다. 이러한 특징으로 인하여 상기 인덱스 구조는 좌우의 크기가 비슷한 트리 구조의 형태로 나타나게 된다. More specifically, a tree structure is generated in which indexes assigned to the plurality of high-dimensional data are parent nodes, and indexes assigned to the divided high-dimensional data are child nodes. Due to this feature, the index structure appears in the form of a tree structure with similar left and right sizes.

상기 S120 내지 S150 단계를 반복하는 과정은, 일례로, 분할된 데이터의 수가 최초의 고차원 데이터의 수의 소정 비율 이하가 될 때까지 반복될 수 잇다. 또한, 다른 일례로, 상기 반복은 미리 설정된 소정의 횟수만큼 반복될 수도 있다. 즉, 일 실시예에 따라, 상기 트리구조의 말단에 저장되는 데이터의 수가 전체 데이터의 10%정도가 될 때까지 분할이 반복될 수 있다. 이후, 이를 이용하여 유사도 검색 질의가 주어지면 인덱스와 비교하여 하위로 이동하고, 데이터가 저장된 트리의 말단에 도달하면 유사도 검색을 수행하여 질의에 가장 적합한 데이터를 반환하게 된다. The process of repeating the steps S120 to S150 may be repeated until, for example, the number of divided data becomes less than or equal to a predetermined ratio of the number of the first high-dimensional data. In another example, the repetition may be repeated a predetermined number of times. That is, according to an embodiment, the division may be repeated until the number of data stored at the end of the tree structure is about 10% of the total data. Then, using this, if a similarity search query is given, it moves down compared to the index. When the data reaches the end of the stored tree, the similarity search is performed to return the most suitable data for the query.

도 3은, 일 실시예에 의해 수행된 방법의 검증을 위해 기존의 VP-tree기법과 본 기법에 의한 결과를 비교한 그래프이다. 실제 데이터를 이용하여 트리를 생성하고 질의를 주었을 때, 각 기법이 반환한 답과 실제 가장 유사한 답이 얼마나 일치하는가를 비교하여 각 기법의 유사도 검색 정확도를 측정하였으며, 사용 데이터는 UC Irvine Machine Learning Repository 에서 제공하는 Iris, Cloud, Corel Image Feature 데이터 집합을 사용하였다. 각각 4, 10, 32 차원의 150, 1024, 68040개의 데이터를 실험에 사용하였으며, 트리의 말단에 저장되는 데이터의 수가 전체 데이터의 10%가 될때까지 데이터를 서브 데이터로 분할하여 트리를 구축하여 수행하였다.Figure 3 is a graph comparing the results of the present technique with the existing VP-tree method for the verification of the method performed by one embodiment. When the tree was created and queried using real data, the accuracy of each similarity search was measured by comparing how the answers returned by each technique matched with the actual most similar answers. The usage data was UC Irvine Machine Learning Repository. We used Iris, Cloud, and Corel Image Feature dataset provided by. 150, 1024, and 68040 data of 4, 10, and 32 dimensions were used for the experiment, and the tree was constructed by dividing the data into sub data until the number of data stored at the end of the tree became 10% of the total data. It was.

각 데이터의 집합을 이용하여 100회씩 10-Fold cross validation을 수행하였으며, 도 3에서의 x축은 사용한 데이터를 보여주며 y축은 평균 에러율을 나타낸다. 그래프에서 확인할 수 있는 바와 같이 본 일 실시예에 따른 방법은 기존 VP-tree보다 유사도 검색시 높은 정확도로 질의에 적합한 답을 찾을 수 있음을 알 수 있다.10-Fold cross validation was performed 100 times using each data set. In FIG. 3, the x-axis shows the used data and the y-axis shows the average error rate. As can be seen from the graph, it can be seen that the method according to the present embodiment can find an answer suitable for a query with higher accuracy when searching for similarity than the existing VP-tree.

이러한 개시된 기술인 방법 및 장치는 이해를 돕기 위하여 도면에 도시된 실시 예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 개시된 기술의 진정한 기술적 보호 범위는 첨부된 특허청구범위에 의해 정해져야 할 것이다.Although the disclosed method and apparatus have been described with reference to the embodiments illustrated in the drawings for clarity, this is merely exemplary, and various modifications and equivalent other embodiments are possible to those skilled in the art. Will understand. Therefore, the true technical protection scope of the disclosed technology should be defined by the appended claims.

210 : 주성분 벡터
220 : 중심점
230 : 초평면
240a, 240b : 분할된 데이터210: principal component vector
220: center point
230: hyperplane
240a, 240b: partitioned data

Claims

In the method for dividing a plurality of high-dimensional data by the dividing device,
calculating a principal component vector through principal component analysis of the plurality of high-dimensional data;
(b) calculating a center point of the calculated principal component vector; And
(c) dividing the plurality of high-dimensional data based on the hyperplane passing through the calculated center point.

The method of claim 1,
and (d) repeating steps (a) to (c) a plurality of times for each of the divided pieces of high-dimensional data.

The method of claim 2, wherein step (c) comprises:
In a current iteration, generating a tree structure having a given index for the plurality of high-dimensional data as a parent node and a given index for each divided high-dimensional data as a child node. .

The method of claim 2, wherein step (d)
A high dimensional data segmentation method repeating until the ratio of the number of each high dimensional data divided in the current iteration to the number of the plurality of high dimensional data in the first iteration is equal to or less than a preset ratio.

The method of claim 1, wherein the center point is
And a point at which the number of data located on the principal component vector can be divided in half by the center point.

The method of claim 1, wherein the hyperplane is,
And a high dimension data segmentation method perpendicular to the principal component vector.

The method of claim 3, wherein
And the index tree structure is represented by a tree structure having a similar left and right size.

A computer-readable recording medium containing a program for executing the method of claim 1 on a computer.