KR20060025225A

KR20060025225A - Data indexing and similar vector searching method in high dimensional vector set based on hierarchical bitmap indexing for multimedia database

Info

Publication number: KR20060025225A
Application number: KR1020060019260A
Authority: KR
Inventors: 유혁; 김형철; 정영민; 낭종호; 박주현
Original assignee: 주식회사 씬멀티미디어
Priority date: 2006-02-28
Filing date: 2006-02-28
Publication date: 2006-03-20
Also published as: KR100786675B1

Abstract

멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법이 개시된다. 본 발명에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법은, 사용자가 고차원 벡터 집합에서 질의 벡터와

-거리가 가까운 벡터를 빠른 시간 안에 찾을 수 있도록 데이터 집합을 인덱싱 하는 방법 및 그를 이용한 유사 벡터 검색 방법에 있어서, (a) 고차원 벡터 집합으로 이루어진 멀티미디어 데이터베이스 내의 벡터들에 대하여 비트맵 생성을 통한 비트맵 인덱스를 구성하는 단계와, (b) 상기 (a) 단계에서 생성된 비트맵 인덱스간 XOR 연산을 수행하고 상기 XOR 연산결과에서 "11"의 개수와 비트맵 인덱스의 상위 구간과 하위 구간의 차이의 곱으로 계산하여 질의 벡터와의 근사 거리가 소정의 임계값 이상인 벡터들을 유사도가 현저히 떨어지는 벡터들로 간주하여 제외하는 단계, 및 (c) 상기 (b) 단계에서 제외되지 않은 벡터들과 질의 벡터와의

거리를 계산하여 질의 벡터와 유사한 벡터로서 최종 선정하는 단계를 포함하는 것을 특징으로 한다. 본 발명에 따르면, 고차원 벡터 집합에서 비트맵을 사용하여 인덱싱하고, 그와 같이 인덱싱된 상태에서는 질의 벡터와 유사한 유사 벡터 검색시 종래의 방법에 비하여 향상된 속도로 결과 벡터 집합을 검색하는 것이 가능하다.Disclosed are a method of data indexing and pseudo vector search in a high-dimensional vector set based on a hierarchical bitmap index in a multimedia database. In the multimedia database according to the present invention, a method of indexing data and searching similar vectors in a high-dimensional vector set based on a hierarchical bitmap index is provided.

A method of indexing a data set so that a short distance vector can be found in a short time and a similar vector search method using the same, the method comprising: (a) Bitmap by generating a bitmap for vectors in a multimedia database composed of a high-dimensional vector set; (B) performing an XOR operation between the bitmap indexes generated in step (a), and comparing the number of "11" and the difference between the upper and lower intervals of the bitmap index in the XOR operation result. Calculating by multiplying and excluding vectors whose approximate distance from the query vector is greater than a predetermined threshold value as vectors having significantly less similarity, and (c) vector and query vector not excluded in step (b). of

Computing the distance and finally selecting as a vector similar to the query vector. According to the present invention, it is possible to index using a bitmap in a high-dimensional vector set, and in such an indexed state, it is possible to search the result vector set at an improved speed compared to the conventional method in searching for a similar vector similar to a query vector.

멀티미디어 데이터베이스, 계층적 비트맵 색인, 데이터 인덱싱, 유사 벡터 검색 Multimedia Database, Hierarchical Bitmap Indexing, Data Indexing, Similar Vector Search

Description

Data indexing and similar vector searching method in high dimensional vector set based on hierarchical bitmap indexing for multimedia database}

도 1은 본 발명의 실시예에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법의 주요 단계들을 나타낸 흐름도, 및1 is a flowchart illustrating the main steps of a method of data indexing and similar vector search in a high-dimensional vector set based on hierarchical bitmap indexes in a multimedia database according to an embodiment of the present invention; and

도 2 내지 도 4는 상기 도 1의 방법을 구현하기 위한 의사코드의 일예.2 to 4 are examples of pseudocode for implementing the method of FIG.

본 발명은 엠펙 영상 처리 방법에 관한 것으로, 특히 계층적 비트맵 색인 방법을 기반으로 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색을 지원하는 엠펙 영상 처리 방법에 관한 것이다.The present invention relates to an MPEG image processing method, and more particularly, to an MPEG image processing method supporting data indexing and similar vector search in a high-dimensional vector set based on a hierarchical bitmap indexing method.

멀티미디어 데이터베이스에서는 최근접 질의(nearest neighbor query)가 빈번하게 사용된다. 최근접 질의는 “다차원의 벡터 공간 내에 객체 점들의 집합과 질의 점이 주어질 때, 질의 점으로부터 유클리드 거리(Euclidean distance)가 최소 인 객체를 찾는 질의”로 정의된다. 최근접 질의의 효과적인 처리를 위하여 기존의 기법들은 대부분 다차원 색인(multidimensional index)을 이용한다.In multimedia databases, nearest neighbor queries are frequently used. The nearest query is defined as "a query that finds an object with a minimum Euclidean distance from a query point given a set of object points and a query point in a multidimensional vector space." Most existing techniques use multidimensional indexes for efficient processing of nearest queries.

하지만, 기존의 다차원 색인은 GIS 등 저차원 응용의 경우 매우 좋은 성능을 나타내지만, 멀티미디어 응용에서와 같이 고차원 응용의 경우에는 그 성능이 크게 떨어지는 것으로 알려져 있다. 그 이유는, 차원이 증가함에 따라 검색 공간이 기하 급수적으로 늘어나게 되고 상대적으로 각 오브젝트들은 고차원 공간 안에서 산재되게 된다. 이는 각 클러스터 내에 오브젝트가 존재하지 않거나 하나의 오브젝트만 존재하게 되는 원인이 된다. 이러한 문제를 보통 차원의 저주 (Curse of Dimensionality)라고 부르고 있다. 따라서 이러한 차원의 저주(dimensionality curse)를 극복하기 위하여 새로운 색인 구조에 관한 많은 연구들이 진행되고 있다.However, the existing multidimensional indexes have a very good performance in low dimensional applications such as GIS, but the performance of high dimensional applications, such as in multimedia applications, is greatly reduced. The reason is that as the dimension increases, the search space grows exponentially, and each object is relatively scattered in the higher dimension space. This causes the object to not exist in each cluster or only one object exists. This problem is commonly referred to as the Curse of Dimensionality. Therefore, many studies on the new index structure have been conducted to overcome the dimensionality curse.

기존의 고차원 벡터 공간에서의 색인 및 검색 방법은 크게 두 부류로 나누어 살펴볼 수 있다. 첫 번째 방법은 데이터 분할 방법(Data partitioning method)으로서 공간상 위치를 기준으로 데이터베이스 내의 오브젝트들을 클러스터링함으로써 트리 형태의 색인(index)을 구성하는 것이다. R-tree, R*-tree, 그리고 X-tree 등이 이러한 형태의 대표적인 인덱싱 방법들이다. 이러한 방법들은 질의가 오면 질의 오브젝트와 공간상 위치가 가까운 클러스터만 검색 함으로써 획기적으로 검색 시간을 줄일 수 있다. The existing methods of indexing and searching in high dimensional vector space can be divided into two categories. The first method is a data partitioning method, which constructs a tree-type index by clustering objects in a database based on spatial locations. R-trees, R * -trees, and X-trees are typical indexing methods of this type. These methods can significantly reduce the search time by searching only clusters that are close to the query object in spatial location.

확률적으로 분석했을 때 고차원 공간일 경우 공간상 위치에 의한 클러스터링이 검색 공간을 줄이는 효과는 거의 존재하지 않게 된다. 오히려 인덱싱 정보를 읽는 시간 및 계산하는 시간이 첨가되어 R-tree와 같은 데이터 분할 방법을 사용할 경우 순차 검색보다 많은 검색 시간이 소요될 가능성이 있다. 정리하면 데이터 분할 방법은 10차원 이하의 다차원 공간일 경우 좋은 성능을 보여주지만 고차원 공간일 경우 차원의 저주에 의해 급격하게 성능이 저하되는 현상을 보여주게 된다.In probabilistic analysis, clustering by spatial position reduces the search space in the high-dimensional space. Rather, it adds time to read and calculate indexing information, so it may take more searching time than sequential search when using a data partitioning method such as R-tree. In summary, the data segmentation method shows good performance in multi-dimensional spaces of 10 dimensions or less, but shows a phenomenon in which the performance is suddenly degraded by the curse of dimensions in high-dimensional space.

두 번째 방법은 오브젝트 근사화 방법으로서 데이터베이스 내의 오브젝트들을 근사화하여 만든 작은 크기의 색인과 질의 오브젝트를 빠른 시간 안에 순차 비교 하여 유사도가 낮은 오브젝트를 제외시키는 방법이다. 근사치 비교만을 사용해서는 정확한 이웃들(neighborhoods)을 구하지 못하기 때문에 제외되지 않는 오브젝트들과 질의 오브젝트간의 비교 과정을 한번 더 거치게 된다. 그래서 이러한 방법을 필터링 방법(Filtering Approach)이라고도 부른다. VA(Vector Approximation: 벡터 근사화)-file과 LPC-file은 벡터 근사화 방법을 사용하는 대표적인 방법들이다.The second method is an object approximation method, in which small indexes and query objects made by approximating objects in a database are sequentially compared to exclude objects with low similarity. Using only approximate comparisons does not yield exact neighbors, so the comparison process between objects that are not excluded and query objects is performed once more. So this method is also called filtering approach. VA (Vector Approximation) -file and LPC-file are the representative methods using vector approximation.

VA-file은 데이터 공간을 미리 정의된 개수의 셀들로 분할하고, 분할된 각 셀에 비트열을 색인정보로써 할당한다. 한 셀 내의 벡터들은 그 셀에 의하여 근사화된다. 예를 들어, 2차원 데이터 공간에서 1차원, 2차원을 각각 4개로 나누고 각 차원의 색인 정보를 00, 01, 10, 11이라고 정의하는 경우, 데이터가 각 1, 2차원 2번째 공간에 위치한다면 데이터의 색인은 0101이 되는 것이다. 질의가 입력되면 질의 오브젝트와 각 오브젝트간의 최소, 최대 거리의 근사치를 구할 수 있기 때문에 이를 이용하여 결과 집합에 포함될 가능성이 없는 오브젝트들을 제외할 수 있게 되는 것이다. 이 알고리즘은 저장되는 색인의 크기가 크지 않기 때문에 색인 정보를 읽는 시간은 줄일 수 있지만, 결과 집합에 포함될 가능성이 있는 오브젝트들을 추 출하기 위해 유클리드 거리를 계산하는 계산량이 많아져서 검색하는 데 시간이 많이 걸리게 되는 문제점이 있다. LPC-file은 VA-file과 같은 과정으로 이웃 벡터들을 구하지만 극 좌표 표현 방법을 사용하여 VA-file에 비해 작은 색인을 갖고 필터링 성능은 향상되어 비교적 더 좋은 성능을 보여주는 방법이다. 하지만 색인 비교 시 VA-file은 차원 단위의 계산을 하고 LPC-file은 극 좌표 계산의 복잡한 계산을 수행하기 때문에 색인 비교 시간이 많이 걸린다는 단점이 있다. 더욱이 필터링 방법을 사용할 경우 집합 내 모든 벡터의 색인 정보와 질의 벡터를 비교하기 때문에 색인 비교 시간이 오래 걸리는 것은 빠른 검색을 방해하는 매우 큰 문제라 할 수 있다.The VA-file divides a data space into a predefined number of cells and allocates a bit string as index information to each divided cell. The vectors in a cell are approximated by that cell. For example, if two-dimensional data space is divided into four one-dimensional and two-dimensional spaces, and index information of each dimension is defined as 00, 01, 10, and 11, the data is located in the first and second two-dimensional spaces. The index of the data is 0101. When a query is entered, an approximation of the minimum and maximum distances between the query object and each object can be obtained, so that it is possible to exclude objects that are not likely to be included in the result set. This algorithm reduces the time to read the index information because the index is not large in size, but it takes a lot of computation to calculate the Euclidean distance to extract objects that may be included in the result set. There is a problem. LPC-file obtains neighbor vectors by the same process as VA-file, but shows a relatively better performance by using the polar coordinate representation method with smaller index and improved filtering performance than VA-file. However, when comparing indexes, VA-file calculates dimension units and LPC-file performs complex calculations of polar coordinate calculation. Moreover, when the filtering method compares the index information of all the vectors in the set with the query vector, it takes a long time to compare the indexes.

본 발명이 이루고자 하는 기술적 과제는 멀티미디어 데이터베이스 내에서 원하는 데이터를 빠르게 검색하기 위해 계층적 비트맵 색인 방법을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법을 제공하는 것이다.An object of the present invention is to provide a method of data indexing and pseudo vector search in a high-dimensional vector set based on a hierarchical bitmap indexing method for quickly searching for desired data in a multimedia database.

상기 기술적 과제를 이루기 위한 본 발명에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법은,사용자가 고차원 벡터 집합에서 질의 벡터와

-거리가 가까운 벡터를 빠른 시간 안에 찾을 수 있도록 데이터 집합을 인덱싱 하는 방법 및 그를 이용한 유사 벡터 검색 방법에 있어서,In order to achieve the above technical problem, a method of data indexing and similar vector search in a high-dimensional vector set based on hierarchical bitmap indexes in a multimedia database according to the present invention comprises:

In a method of indexing a data set so that a short distance vector can be found in a short time and a similar vector search method using the same,

(a) 고차원 벡터 집합으로 이루어진 멀티미디어 데이터베이스 내의 벡터들에 대하여 비트맵 생성을 통한 비트맵 인덱스를 구성하는 단계;(a) constructing a bitmap index through bitmap generation for vectors in a multimedia database consisting of a high-dimensional vector set;

(b) 상기 (a) 단계에서 생성된 비트맵 인덱스를 사용하여 질의 벡터와 오브젝트들 사이의 근사 거리를 상기 (a) 단계에서 생성된 비트맵 인덱스간 XOR 연산을 수행하고 상기 XOR 연산결과에서 "11"의 개수와 비트맵 인덱스의 상위 구간과 하위 구간의 차이의 곱으로 계산하여 질의 벡터와의 근사 거리가 소정의 임계값 이상인 벡터들을 유사도가 현저히 떨어지는 벡터들로 간주하여 제외하는 단계; 및(b) performing an XOR operation between the bitmap indexes generated in step (a) using the bitmap index generated in step (a) and performing an approximate distance between the query vector and the objects. Calculating vectors as the number of 11 "and the difference between the upper and lower intervals of the bitmap index to exclude vectors whose approximate distance from the query vector is greater than or equal to a predetermined threshold value as remarkably less similarities; And

(c) 상기 (b) 단계에서 제외되지 않은 벡터들과 질의 벡터와의

거리를 계산하여 질의 벡터와 유사한 벡터로서 최종 선정하는 단계;를 포함하는 것을 특징으로 한다.(c) the query vector with the vectors not excluded in step (b)

Computing the distance and finally selecting as a vector similar to the query vector; characterized in that it comprises a.

또한, 상기 (a) 단계는,In addition, the step (a),

(a-1) 고차원 벡터집합들에 대하여 각 차원의 공간을 소정 개수의 영역들로 나누는 단계; 및(a-1) dividing the space of each dimension into a predetermined number of regions for the high dimensional vector sets; And

(a-2) 인덱싱하고자 하는 벡터 집합의 벡터들을 그 크기에 따라 상기 (a-1) 단계에서 나누어진 영역들에 할당하여 인덱싱함으로써 비트맵 인덱스를 구성하는 단계;를 포함하는 것이 바람직하다.(a-2) configuring a bitmap index by allocating and indexing the vectors of the vector set to be indexed according to the size to the regions divided in the step (a-1).

또한, 상기 (b) 단계는,In addition, the step (b),

(b-1) 질의 벡터와 대상 벡터의 비트맵간 XOR 연산을 사용하여 근사거리를 구하는 단계; (b-1) obtaining an approximate distance using an XOR operation between the query vector and the bitmap of the target vector;

(b-2) 상기 근사거리가 주어진 거리

보다 작으면 후보 집합에 포함시키는 단계; 및(b-2) distance given the approximate distance

If smaller, including in the candidate set; And

(b-3) 벡터 집합 내 모든 벡터에 대해 상기 (b-1) 단계 내지 상기 (b-2) 단계를 반복하여 적용하는 단계;를 포함하는 것이 바람직하다.(b-3) repeating the steps (b-1) to (b-2) for all the vectors in the vector set; preferably.

또한, 상기 (c) 단계는,In addition, the step (c),

(c-1) 후보 집합에서 벡터를 선택하는 단계;(c-1) selecting a vector from the candidate set;

(c-2) 질의 벡터와 선택된 벡터 사이의

-거리를 구하여 주어진 거리

보다 작으면 정답 집합에 포함 시키는 단계; 및(c-2) between the query vector and the selected vector

-Get distance given distance

If smaller, including the correct answer set; And

(c-3) 후보 집합 내 모든 벡터에 대해 상기 (c-1) 단계 내지 상기 (c-2) 단계를 반복하여 적용하는 단계;를 포함하는 것이 바람직하다.and (c-3) repeating the steps (c-1) to (c-2) for all the vectors in the candidate set.

이하 첨부된 도면들을 참조하여 본 발명의 바람직한 실시예를 보다 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1에는 본 발명의 실시예에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법의 주요 단계들을 흐름도로써 나타내었다. 또한, 도 2 내지 도 4에는 상기 도 1의 방법을 구현하기 위한 의사코드의 일예를 나타내었다.1 is a flowchart illustrating main steps of a method of data indexing and similar vector search in a high-dimensional vector set based on a hierarchical bitmap index in a multimedia database according to an embodiment of the present invention. 2 to 4 show an example of pseudo code for implementing the method of FIG.

도 1을 참조하면, 본 발명의 실시예에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법에서는 먼저, (a) 고차원 벡터 집합으로 이루어진 멀티미디어 데이터베이스 내의 벡터들에 대하여 비트맵 생성을 통한 비트맵 인덱스를 구성(단계 S10)한다. 상기 (a) 단계를 설명하기 위해서 다음과 같은 벡터 집합을 가정한다. 즉,

을 벡터의 개수를 나타내는 양의 자연수라고 할 때,

개의 벡터를 원소로 가지고 있는 집합

을 가정하기로 한다. 또한, 벡터의 차원을

라 한다면 벡터

는

로 표현할 수 있다. 또한, 각 원소

는 집합

에 속한다고 가정한다. 임의의 벡터

를 위한

개의 비트맵으로 구성된 인덱스는 다음과 같은 과정을 통해 만들 수 있다. 초기

,

로 설정한다.

는 비트맵 번호를 의미하는데 최초 1로 설정한다.Referring to FIG. 1, in the method of data indexing and similar vector search in a high-dimensional vector set based on a hierarchical bitmap index in a multimedia database according to an embodiment of the present invention, first, (a) a multimedia consisting of a high-dimensional vector set A bitmap index through bitmap generation is constructed for the vectors in the database (step S10). In order to explain step (a), assume the following vector set. In other words,

Is a positive natural number that represents the number of vectors,

Set of vector elements as elements

Let's assume. Also, the dimensions of the vector

LA doodles vector

Is

Can be expressed as In addition, each element

Set

Assume that it belongs to. Random vector

for

An index consisting of two bitmaps can be created by the following process. Early

,

Set to.

Is the bitmap number, which is set to 1 first.

먼저, 각 차원의 공간을

,

, 및

의 3개 영역으로 나눈다. 여기서,

와

는

를 만족하는 임의의 값이다.First, let's take the space

,

, And

Divide into three areas of. here,

Wow

Is

Any value that satisfies

다음으로, 벡터

의 임의의 차원

는 차원의 값

가

에 속해 있으면 '00',

에 속해 있으면 '01', 및

에 속해 있으면 '11'의 2 비트를 할당하여 부호화한다. 이로써, 모든 차원에 대해 부호화하여 서로 연결함으로써

아이디를 가진 비트맵

을 구성한다.Next, vector

Random dimension of

Is the value of the dimension

end

If belongs to '00',

'01' if belonging to, and

If belongs to, 2 bits of '11' are allocated and encoded. This allows you to code and link together all dimensions

Bitmap with ID

Configure

다음으로, 비트맵 번호인

와

을 비교하여

가

보다 크면 벡터

를 위한 인덱스 생성을 종료한다.Next, the bitmap number

Wow

By comparing

end

Greater than

Terminate index creation for.

다음으로,

를 새로운

로 두어 위의 단계를 반복하여

를 생성하고

를 새로운

으로 두어 위의 단계를 반복하여

을 생성한다.to the next,

New

And repeat the above steps

Create

New

To repeat the above steps

Create

이제, 위의 과정을 거쳐 N개의 벡터에 대해 각각 비트맵 인덱스를 구성함으로써, 집합 내 모든 벡터들에 대해 비트맵 인덱스를 구성한다.Now, by constructing the bitmap index for each of the N vectors through the above process, the bitmap index is configured for all the vectors in the set.

즉, 위에서 기술한 (a) 단계는,That is, step (a) described above is

(a-1) 고차원 벡터집합들에 대하여 각 차원의 공간을 소정 개수의 영역들로 나누는 단계, 및 (a-2) 인덱싱하고자 하는 벡터 집합의 벡터들을 그 크기에 따라 상기 (a-1) 단계에서 나누어진 영역들에 할당하여 인덱싱함으로써 비트맵 인덱스를 구성하는 단계로 요약할 수 있다. 위와 같은 비트맵 생성 단계(단계 S10)는 벡터 집합에 대해 초기 한 번만 수행한다.(a-1) dividing the space of each dimension into a predetermined number of regions for the high-dimensional vector sets, and (a-2) the vectors of the vector set to be indexed according to their size (a-1) It can be summarized as a step of constructing a bitmap index by allocating and indexing the divided regions. The above bitmap generation step (step S10) is performed only once for the vector set.

다음으로, (b) 상기 (a) 단계에서 생성된 비트맵 인덱스를 사용하여 질의 벡터와 오브젝트들 사이의 근사 거리를 상기 (a) 단계에서 생성된 비트맵 인덱스간 XOR 연산을 수행하고 상기 XOR 연산결과에서 "11"의 개수와 비트맵 인덱스의 상위 구간과 하위 구간의 차이의 곱으로 계산하여 질의 벡터와의 근사 거리가 소정의 임계값 이상인 벡터들을 유사도가 현저히 떨어지는 벡터들로 간주하여 제외시킨다(단계 12).Next, (b) performing an XOR operation between the bitmap indexes generated in step (a) and approximating the distance between the query vector and the objects using the bitmap index generated in step (a). The result is calculated by multiplying the number of " 11 " by the difference between the upper and lower intervals of the bitmap index, and thus, vectors having an approximate distance from the query vector greater than or equal to a predetermined threshold are regarded as vectors having significantly lower similarity ( Step 12).

상기 (b) 단계를 구현하기 위한 실시예를 설명한다. 먼저, 생성된 비트맵을 사용하여 유사도 검색을 하는 첫 단계인 (b) 단계를 구현하기 위해 d 차원 벡터

와 앞서 정의한 집합

를 설정한다. 그리고, 상기 (a) 단계를 통하여 벡터

에 대해 복수 개, 예를들어,

개의 비트맵 인덱스

을 만든다. 이와 같이 복수 개의 비트맵 인덱스를 만드는 것은

에 대한 근사 정도를 높이기 위한 것이다.An embodiment for implementing step (b) will be described. First, d-dimensional vector to implement step (b), which is the first step of similarity search using the generated bitmap.

And the previously defined set

Set. Then, the vector through the step (a)

For a plurality, for example,

Bitmap indexes

Make Creating multiple bitmap indexes like this

To increase the approximate degree of.

최초

=1로 정하고, 근사 거리를 사용한 1차 검색에서 제외되지 않은 벡터들의 집합인 후보 집합

를 정의한다.first

Candidate set, which is a set of vectors that are set to = 1 and not excluded from the first-order search using the approximate distance

Define.

다음으로, 벡터

의

개의 비트맵과

의

개의 비트맵에 대해

XOR

와 같이 XOR 연산을 수행함으로써 각 비트맵 단계별로 '11'의 개수

를 구하여 두 벡터의 거리

에 대한 근사 거리

를

에 따라 구한다. 이는 상기 (a) 단계에서 생성된 비트맵 인덱스간 XOR 연산을 수행하고 상기 XOR 연산결과에서 "11"의 개수와 비트맵 인덱스의 상위 구간과 하위 구간의 차이의 곱으로 계산하되, 복수 개의 비트맵 인덱스를 사용하여

에 대한 근사 정도를 높이기 위해 각 비트맵 인덱스로 구한 근사거리의 합을 최종적인 근사거리로 하는 것에 주목할 필요가 있다.Next, vector

of

Bitmaps

of

For bitmaps

XOR

The number of '11' for each bitmap step by performing XOR operation as

Find the distance of two vectors

Approximate distance for

To

Obtain according to. This is performed by performing the XOR operation between the bitmap indexes generated in step (a) and calculating the product of the number of "11" and the difference between the upper section and the lower section of the bitmap index in the XOR operation result. Using index

It is worth noting that the sum of approximation distances obtained from each bitmap index is the final approximation distance in order to increase the approximation to.

다음으로,

가 미리 정해진 거리

보다 작으면 후보 집합

에

를 포함시킴으로써 후보 집합에 들어갈 후보 벡터들을 선택한다. to the next,

Predetermined distance

Less than

on

By selecting the candidate vectors to be included in the candidate set.

다음으로, 만일

가

보다 크거나 같으면 (b) 단계를 종료하고, 그렇지 않으면,

를 하나 증가시킨 후 위의 단계들을 반복 수행한다.Next, if

end

Is greater than or equal to, exit step (b); otherwise,

Increase the by one and repeat the above steps.

위의 (b) 단계를 요약하면, (b-1) 질의 벡터와 대상 벡터의 비트맵간 XOR 연산을 사용하여 근사거리를 구하는 단계와, (b-2) 상기 근사거리가 주어진 거리

보다 작으면 후보 집합에 포함시키는 단계, 및 (b-3) 벡터 집합 내 모든 벡터에 대해 상기 (b-1) 단계 내지 상기 (b-2) 단계를 반복하여 적용하는 단계로 이루어진 다.To summarize the step (b) above, (b-1) to obtain an approximate distance using the XOR operation between the query vector and the bitmap of the target vector, and (b-2) the distance given the approximate distance

If it is smaller, it includes the step of including it in the candidate set, and (b-3) repeating the steps (b-1) to (b-2) for all the vectors in the vector set.

이제, (c) 상기 (b) 단계에서 제외되지 않은 벡터들과 질의 벡터와의

거리를 계산하여 질의 벡터와 유사한 벡터로서 최종 선정한다(단계 S14).

거리는 맨해튼 거리(Manhattan distance)라고 칭해지는

거리, 유클리디안 거리(uclidean distance)라고 칭해지는

거리와 같이 당해 분야에서 통상의 지식을 가진자에 의하여 알려진 계산 방식이므로 더 이상 설명하지 않는다.Now, (c) the query vector with the vectors not excluded in step (b)

The distance is calculated and finally selected as a vector similar to the query vector (step S14).

The distance is called Manhattan distance

Distance, referred to as the uclidean distance

Since the calculation method known by those skilled in the art, such as distance, it will not be described further.

상기 (c) 단계를 구현하기 위한 실시예를 설명한다. 상기 (b)단계에서 만들어진 후보 집합에 포함되어 있는 벡터들과 질의 벡터

과의 거리

를 계산하여 최종 검색 결과를 만드는 단계;로서 검색 결과 집합

를 정의한다. 그리고, 최초

=1로 정하고, 후보 집합

에서 첫 번째 벡터를 꺼내어 꺼낸 벡터를

이라 설정한다.An embodiment for implementing step (c) will be described. Vectors and query vectors included in the candidate set created in step (b)

Distance from

Generating a final search result by calculating a result set;

Define. And first

Set to = 1, candidate set

Take the first vector from the

Set this.

다음으로, 질의 벡터와의 거리

를 구하여

보다 작으면 정답 집합

에 포함시킨다.Next, the distance to the vector of the query

To obtain

Less than

Include it in

다음으로,

가 비면 종료하고 그렇지 않으면

를 하나 증가시키고 위의 (c) 단계를 반복한다.

에 포함되어 있는 벡터들이 질의 벡터

와

-거리가

이내에 있는 벡터 집합

의 벡터들이다.to the next,

Exits if is empty, otherwise

Increase one by one and repeat step (c) above.

The vectors included in the query vector

Wow

-Distance

Within a set of vectors

Are the vectors of.

상기 (c) 단계를 요약하면, (c-1) 후보 집합에서 벡터를 선택하는 단계와, (c-2) 질의 벡터와 선택된 벡터 사이의

-거리를 구하여 주어진 거리

보다 작으면 정답 집합에 포함 시키는 단계, 및 (c-3) 후보 집합 내 모든 벡터에 대해 상기 (c-1) 단계 내지 상기 (c-2) 단계를 반복하여 적용하는 단계로 이루어진다.To summarize the step (c), (c-1) selecting a vector from the candidate set, and (c-2) between the query vector and the selected vector

-Get distance given distance

If it is smaller, it includes the step of including the correct answer set, and (c-3) repeating the steps (c-1) to (c-2) for all the vectors in the candidate set.

위에서 설명한 본 발명에 따른 멀티미디어 데이터베이스내에서 계층적 비트맵 색인을 기반으로 한 고차원 벡터 집합에서의 데이터 인덱싱 및 유사 벡터 검색 방법들을 이루는 각 단계들은 컴퓨터에 의하여 읽혀지고 실행되는 컴퓨터 프로그램으로 작성될 수 있다. 상기 컴퓨터 프로그램을 이루는 프로그램 코드들 및 코드 세그멘트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다.In the multimedia database according to the present invention described above, the steps for implementing data indexing and pseudo vector search methods in a high-dimensional vector set based on a hierarchical bitmap index can be written by a computer program that is read and executed by a computer. . Program codes and code segments constituting the computer program can be easily inferred by a computer programmer in the art.

도 2에는 도 1의 방법을 구현하기 위한 의사코드의 일예를 나타내었다. 도 2에 도시한 의사 코드는, n 개 오브젝트 벡터들로 이루어진 집합 내에서 질의 오브젝트 벡터와 유사한 벡터를 구하는 전체적인 함수에 해당한다. 후보 벡터 집합과 정답 오브젝트 벡터 집합을 정의(202)하게 된다. 2 illustrates an example of pseudo code for implementing the method of FIG. 1. The pseudo code shown in FIG. 2 corresponds to the overall function of finding a vector similar to the query object vector in the set of n object vectors. A candidate vector set and a correct answer object vector set are defined 202.

고차원 벡터 집합들에 대하여 비트맵 인덱스를 생성하기 위하여 CreatHBIndex()라는 함수(204)를 정의한다. 이는 위에서 설명한 (a) 단계에 해당한다. 다음으로, 소정의 임계값 r과 비교하여 유사도가 현저히 낮은 오브젝트 벡터들을 필터링하여 제외시키는 함수(206)을 정의한다.Define a function 204 called CreatHBIndex () to generate a bitmap index for the high dimensional vector sets. This corresponds to step (a) described above. Next, a function 206 is defined to filter out object vectors with significantly lower similarity compared to a predetermined threshold r.

도 3에는 상기 함수(206)를 구현한 의사코드의 일예를 나타내었다. 도 3을 참조하면,

에 대하여 구성된 질의 벡터

의 비트맵 인덱스

와

및

에 대하여 구성된 오브젝트 벡터

의 비트맵 인덱스

을 입력하여 검색 범위

에 대하여 필터링을 수행한다.3 shows an example of pseudo code implementing the function 206. Referring to Figure 3,

Query Vector Constructed for

Bitmap index

Wow

And

Object vector constructed for

Bitmap index

Search range by typing

Perform filtering on.

즉,

를 1 내지

에 대해서, 비트맵 인덱스들끼리 비교하는 함수(302 :

)를 정의하여 근사거리

를 구한다.그리고, 비교 루틴(304)을 사용하여 그 근사거리가

보다 적은 오브젝트 벡터들인

에 한하여 후보 벡터

로 저장한다.In other words,

To 1

For example, a function for comparing bitmap indexes with each other (302:

) To approximate

Then, using the comparison routine 304, the approximate distance

Less object vectors

Candidate vector only

Save as.

도 4에는 후보벡터들에 대하여

거리를 계산하여 정답 오브젝트 벡터들을 최종 결정하는 함수(208)을 구현한 의사코드의 일예를 나타내었다. 도 4를 참조하면, 4 shows candidate vectors.

An example of the pseudo code implementing the function 208 that calculates the distance and finally determines the correct object vectors is shown. Referring to Figure 4,

를 1 내지 후보벡터들의 인덱스 번호(402:

)에 대하여 질의 벡터

와 검색 대상의 오브젝트 벡터

사이의 거리

를 계산(404)한다. 마지막으로, 계산된 거리

가 소정의 임계값

보다 작은 오브젝트 벡터들

을 정답 집합

에 포함시킴(406)으로써 질의 벡터와 유사한 벡터로서 최종 선정이 종료된다.

Is an index number 402 of 1 to candidate vectors.

Vector of query

And object vector to search for

Distance between

Calculate 404. Finally, the calculated distance

Is a predetermined threshold

Smaller object vectors

Set of correct answers

The inclusion in 406 ends the final selection as a vector similar to the query vector.

상기와 같은 본 발명에 따른 고차원 벡터 집합에서 질의 벡터와

-거리가 가까운 벡터를 빠른 시간 안에 찾을 수 있도록 데이터 집합을 인덱싱하는 방법 및 그를 이용한 유사 벡터 검색 (Similarity Search) 방법에 따르면 사용자는 인덱스를 사용하지 않았을 때보다 적어도 6배 이상 빠르게 결과 벡터 집합을 찾을 수 있다.In the high-dimensional vector set according to the present invention as described above,

According to the method of indexing the dataset to find near-distance vectors quickly and using the similarity search method, users can find the result vector set at least six times faster than without the index. Can be.

상술한 바와 같이 본 발명에 따르면 고차원 벡터 집합에서 비트맵을 사용하여 인덱싱하고, 그와 같이 인덱싱된 상태에서는 질의 벡터와 유사한 유사 벡터 검색시 종래의 방법에 비하여 향상된 속도로 결과 벡터 집합을 검색하는 것이 가능하다.As described above, according to the present invention, indexing using a bitmap in a high-dimensional vector set, and in such an indexed state, searching a result vector set at an improved speed compared to the conventional method when searching for a similar vector similar to a query vector is required. It is possible.

Claims

User can use query vectors in

(a) constructing a bitmap index through bitmap generation for vectors in a multimedia database consisting of a high-dimensional vector set;

(b) performing an XOR operation between the bitmap indexes generated in step (a) using the bitmap index generated in step (a) and performing an approximate distance between the query vector and the objects. Calculating vectors as the number of 11 "and the difference between the upper and lower intervals of the bitmap index to exclude vectors whose approximate distance from the query vector is greater than or equal to a predetermined threshold value as remarkably less similarities; And

(c) the query vector with the vectors not excluded in step (b)

And finally selecting a distance as a vector similar to a query vector by calculating a distance. The method of claim 1, further comprising: hierarchical bitmap indexes based on hierarchical bitmap indexes.

The method of claim 1, wherein step (a) comprises:

(a-1) dividing the space of each dimension into a predetermined number of regions for the high dimensional vector sets; And

(a-2) configuring a bitmap index by allocating and indexing the vectors of the vector set to be indexed according to the size to the regions divided in the step (a-1); A method of indexing and retrieving data in a high-dimensional vector set based on hierarchical bitmap indexes in a database.

According to claim 1, wherein step (b),

(b-1) obtaining an approximate distance using an XOR operation between the query vector and the bitmap of the target vector;

(b-2) distance given the approximate distance

If smaller, including in the candidate set; And

(b-3) repeatedly applying steps (b-1) to (b-2) for all the vectors in the vector set; hierarchical bitmap index in a multimedia database, comprising: -Based indexing and retrieval method in high-dimensional vector sets.

The method of claim 1, wherein step (c) comprises:

(c-1) selecting a vector from the candidate set;

(c-2) between the query vector and the selected vector

-Get distance given distance

If smaller, including the correct answer set; And

(c-3) repeatedly applying the steps (c-1) to (c-2) for all the vectors in the candidate set; the hierarchical bitmap index in the multimedia database, comprising: -Based indexing and retrieval method in high-dimensional vector sets.