KR101052220B1

KR101052220B1 - Skyline Query Execution Device and Method Including Search Terms

Info

Publication number: KR101052220B1
Application number: KR20090003651A
Authority: KR
Inventors: 정연돈; 최현식; 정하림; 박준표
Original assignee: 고려대학교 산학협력단
Priority date: 2009-01-16
Filing date: 2009-01-16
Publication date: 2011-07-27
Also published as: KR20100084266A

Abstract

검색어를 포함하는 스카이라인 질의 수행장치 및 방법이 개시된다. 데이터검색 트리생성부는 복수의 데이터 객체의 특성을 나타내며 데이터 객체의 검색에 사용되는 검색어를 부호화한 특성부호, 데이터 객체가 가지는 복수의 수치화된 속성을 나타내기 위한 다차원 좌표공간 상에서 서로 인접하는 데이터 객체들의 좌표를 나타내는 좌표정보 및 데이터 객체의 식별번호를 포함하는 단말노드정보로 이루어진 단말노드와, 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출되는 노드부호, 좌표공간 상에서 하위계층에 존재하는 단말노드에 포함된 데이터 객체들을 포함하도록 설정된 최소경계사각형의 좌표정보 및 최소경계사각형의 식별번호를 포함하는 비단말노드정보로 이루어진 비단말노드를 계층적으로 배치하여 데이터검색 트리를 생성한다. 식별번호 저장부는 데이터검색 트리의 각 노드에 포함된 데이터 객체의 식별번호 및 최소경계사각형의 식별번호를 좌표공간의 원점으로부터의 거리가 증가하는 순서로 정렬하여 저장한다. 스카이라인 생성부는 식별번호 저장부에 저장된 식별번호 중에서 최상위에 위치하는 식별번호가 데이터검색 트리의 단말노드에 대응하는 단말노드정보에 포함된 데이터 객체의 식별번호이면 데이터 객체의 식별번호를 포함하는 스카이라인을 생성한다. 본 발명에 따르면, 데이터 검색의 속도를 증가시키면서 정확한 검색결과를 얻을 수 있다.An apparatus and method for performing skyline query including a search term are disclosed. The data retrieval tree generation unit represents characteristics of a plurality of data objects, and a feature code encoding a search word used for retrieving the data object and a plurality of data objects adjacent to each other in a multidimensional coordinate space for representing a plurality of numerical attributes of the data object. A terminal node composed of terminal node information including coordinate information indicating coordinates and an identification number of a data object, a node code calculated by a logical sum of characteristic codes included in a child node located below, and existing in a lower layer in a coordinate space. A non-terminal node composed of non-terminal node information including coordinate information of the minimum boundary rectangle and non-terminal information including the identification number of the minimum boundary rectangle, which are configured to include data objects included in the terminal node, is generated hierarchically. The identification number storage unit sorts and stores the identification number of the data object included in each node of the data search tree and the identification number of the minimum boundary rectangle in the order of increasing distance from the origin of the coordinate space. If the identification number located at the top of the identification numbers stored in the identification number storage unit is the identification number of the data object included in the terminal node information corresponding to the terminal node of the data search tree, the skyline generator includes the sky number including the identification number of the data object. Create a line. According to the present invention, accurate search results can be obtained by increasing the speed of data retrieval.

Description

Apparatus and method for processing skyline queries including keyword}

본 발명은 검색어를 포함하는 스카이라인 질의 수행장치 및 방법에 관한 것으로, 보다 상세하게는, 다차원 튜플의 각 차원이 나타내는 데이터의 속성을 기초로 최적의 데이터를 검색하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for performing a skyline query including a search word, and more particularly, to an apparatus and method for searching for optimal data based on attributes of data represented by each dimension of a multi-dimensional tuple.

최근 스카이라인 질의(skyline query)는 다기준 의사결정을 포함하는 다양한 분야에서 중요한 질의방법으로 알려지고 있다. 스카이라인 질의에 의하면 데이터의 분포와 카디널리티(cardinality)에 따라 수많은 결과를 검색할 수 있다. 예를 들면, 각종 온라인 샵에서 사용자가 원하는 조건에 대한 최상의 상품을 검색하거나, 사용자의 취향을 충족시켜줄 것으로 기대되는 제품을 추천하는 데 스카이라인 질의가 사용될 수 있다. 스카이라인은 복수의 d-차원 튜플들로 이루어진 그룹으로부터 얻어지며, 다른 튜플들에 의해 지배되지 않는 튜플들의 그룹을 포함한다. 여기서, 튜플 tp가 모든 차원에서 튜플 tp'에 비해 나쁘지 않고 tp가 적어도 한 차원에서 tp'에 비해 좋은 경우에는 tp는 tp'을 지배한다(dominate)고 표현한다. 또한 특정 차원에서 다른 튜플에 비해 좋다는 것은 원점으로부터의 거리가 가깝다는 것을 의 미한다. 또한 스카이라인에 포함되는 튜플들을 '스카이라인 튜플'이라 한다.Recently, the skyline query is known as an important query method in various fields including multi-criteria decision making. Skyline queries can retrieve numerous results based on data distribution and cardinality. For example, skyline queries may be used to search for the best product for a user's desired condition in various online shops or to recommend a product that is expected to satisfy the user's taste. The skyline is obtained from a group of a plurality of d-dimensional tuples and includes a group of tuples that are not governed by other tuples. Here, when tuple tp is not bad compared to tuple tp 'in all dimensions and tp is good compared to tp' in at least one dimension, tp dominates tp '. Also, being good over other tuples in a particular dimension means that the distance from the origin is close. Also, tuples included in the skyline are called 'skyline tuples'.

도 1은 xy 평면에 위치하는 네 개의 2차원 튜플을 도시한 도면이다. 도 1을 참조하면, tp₁은 tp₂ 내지 tp₄에 비해 조건 x 및 조건 y의 면에서 나쁘지 않고, 조건 x 및 조건 y 중 어느 하나의 면에서 tp₂ 내지 tp₄에 비해 좋기 때문에 tp₁은 tp₂ 내지 tp₄를 지배한다고 볼 수 있다. 또한 도 2는 복수의 튜플들로 이루어진 그룹으로부터 얻어진 스카이라인을 도시한 도면이다. 도 2를 참조하면, 튜플 a 내지 f 중에서 튜플 a 내지 c는 다른 튜플들을 지배하며, 이들 상호간에는 지배관계가 성립되지 않는다. 따라서 튜플 a 내지 c에 의해 스카이라인이 형성된다.1 illustrates four two-dimensional tuples located in the xy plane. Since Referring to Figure _1, tp 1 is tp ₂ to rather bad in terms of the condition x and conditions y than tp _4, conditions good compared to the x and the condition y any of tp ₂ to tp ₄ in terms of tp ₁ is It can be seen that dominates tp ₂ to tp ₄ . 2 is a diagram illustrating a skyline obtained from a group consisting of a plurality of tuples. Referring to FIG. 2, tuples a to c of tuples a to f dominate other tuples, and no governance relationship is established between them. Therefore, the skyline is formed by the tuples a to c.

스카이라인 질의에서 많이 사용되는 검색방법으로 분기한정 스카이라인 탐색(branch and bound skyline search : BBS) 방법이 있다. 이는 우선적으로 가장 결과값에 근접한 데이터를 선택하는 최고우선 탐색(best-first search) 방법과 결과값에 대한 복수의 후보값들을 먼저 구하고, 후보값에 미치지 못하는 조건을 가진 데이터들을 제거하는 분기한정법(branch and bound method)을 바탕으로 R-트리 구조의 인덱스를 사용하여 스카이라인을 구하는 알고리즘이다. 이 방법에서는 특정 단조증가함수인 M에 의해 각 튜플의 선호도가 서열화되어 나타내어지며, 가장 작은 M의 결과값을 가지는 튜플, 즉 원점으로부터 가장 가까운 튜플은 반드시 스카이라인에 포함된다.Branch and bound skyline search (BBS) is a popular search method for skyline queries. It is a best-first search method that first selects the data that is closest to the result value, and a branch limit method that first obtains a plurality of candidate values for the result value and then removes data with conditions that do not meet the candidate value. Based on the (branch and bound method), an algorithm is used to obtain the skyline using the index of the R-tree structure. In this method, the preference of each tuple is represented by the specific monotonic increasing function M, and the tuple having the smallest M result, that is, the tuple closest to the origin, is included in the skyline.

이와 같은 스카이라인 질의를 통해 정보를 검색할 때, d-차원 튜플의 각 차원에 해당하는 검색조건 외에 사용자가 원하는 부가적인 검색조건이 존재하는 경우 에는 사용자는 스카이라인으로부터 최종적으로 원하는 정보를 얻기 위해 부가적인 검색을 수행해야 한다. 따라서 스카이라인 질의로부터 얻어진 결과 중 대다수는 사용자에게 의미없는 것이 된다. 그러므로 스카이라인 질의에서 효과적인 정보검색을 제공하기 위해 사용자의 선호를 고려하는 새로운 검색방법의 필요성이 제기되었다.When retrieving information through such a skyline query, if there are additional search conditions that the user wants in addition to the search conditions corresponding to each dimension of the d-dimensional tuple, the user needs to finally obtain the desired information from the skyline. You must perform additional searches. Thus, many of the results obtained from skyline queries are meaningless to the user. Therefore, the necessity of a new retrieval method considering the user's preference is proposed to provide an effective information retrieval in the skyline query.

이와 같이 기존의 스카이라인 질의에 검색어까지 포함하여 데이터를 검색하도록 하는 방법으로서, 역인덱스 기반의 키워드 스카이라인 검색(inverted-index-based keyword skyline search : INKS) 방법이 있다. 도 3에 INKS 방법을 이용하여 데이터 검색을 수행하는 알고리즘이 나타나 있다. 이 방법은 종래 사용되었던 역인덱스와 블록 중첩 반복(block-nested-loop : BNL) 알고리즘을 결합시킨 것으로, 키워드 검색 및 스카이라인 검색의 두 단계로 구성된다. 키워드 검색 단계에서는 역인덱스를 이용하여 키워드 매칭(keyword matching)을 수행하여 각각의 질의 키워드에 대응하는 모든 데이터를 추출한다. 다음으로 스카이라인 검색 단계에서는 주어진 질의 조건에 따라 BNL을 적용하여 최종 결과인 스카이라인을 획득한다.As such, a method of retrieving data by including a search word in an existing skyline query is an inverted-index-based keyword skyline search (INKS) method. 3 shows an algorithm for performing data retrieval using the INKS method. This method combines the conventionally used inverse index and block-nested-loop (BNL) algorithm, and consists of two stages: keyword search and skyline search. In the keyword searching step, keyword matching is performed using an inverse index to extract all data corresponding to each query keyword. Next, in the skyline retrieval step, the BNL is applied according to a given query condition to obtain a final skyline.

INKS 방법은 그 수행과정에서 검색어에 의한 데이터 검색을 포함하고 있으나, 실질적으로 키워드 검색과 스카이라인 검색이 분리되어 있으므로 BBS 방법과 마찬가지로 검색시간이 길어지게 된다는 단점이 있다.The INKS method includes data search based on search terms in its execution process, but since the keyword search and the skyline search are substantially separated, the search time is lengthened as in the BBS method.

본 발명이 이루고자 하는 기술적 과제는, 데이터 검색의 속도를 증가시키면서 정확한 검색결과를 얻을 수 있는 검색어를 포함하는 스카이라인 질의 수행장치 및 방법을 제공하는 데 있다.SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a skyline query performing apparatus and method including a search word capable of obtaining accurate search results while increasing the speed of data search.

본 발명이 이루고자 하는 다른 기술적 과제는, 데이터 검색의 속도를 증가시키면서 정확한 검색결과를 얻을 수 있는 검색어를 포함하는 스카이라인 질의 수행방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 있다.Another technical problem to be solved by the present invention is to provide a computer-readable recording medium having recorded thereon a program for executing a skyline query method including a search word that can obtain accurate search results while increasing the speed of data retrieval. To provide.

상기의 기술적 과제를 달성하기 위한, 본 발명에 따른 검색어를 포함하는 스카이라인 질의 수행장치는, 복수의 데이터 객체의 특성을 나타내며 상기 데이터 객체의 검색에 사용되는 검색어를 부호화한 특성부호, 상기 데이터 객체가 가지는 복수의 수치화된 속성을 나타내기 위한 다차원 좌표공간 상에서 서로 인접하는 데이터 객체들의 좌표를 나타내는 좌표정보 및 상기 데이터 객체의 식별번호를 포함하는 단말노드정보로 이루어진 단말노드와, 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출되는 노드부호, 상기 좌표공간 상에서 하위계층에 존재하는 단말노드에 포함된 데이터 객체들을 포함하도록 설정된 최소경계사각형의 좌표정보 및 상기 최소경계사각형의 식별번호를 포함하는 비단말노드정보로 이루어진 비단말노드를 계층적으로 배치하여 데이터검색 트리를 생성하는 데이터검색 트리생 성부; 상기 데이터검색 트리의 각 노드에 포함된 상기 데이터 객체의 식별번호 및 상기 최소경계사각형의 식별번호를 상기 좌표공간의 원점으로부터의 거리가 증가하는 순서로 정렬하여 저장하는 식별번호 저장부; 및 상기 식별번호 저장부에 저장된 식별번호 중에서 최상위에 위치하는 식별번호가 상기 데이터검색 트리의 단말노드에 대응하는 단말노드정보에 포함된 데이터 객체의 식별번호이면 상기 데이터 객체의 식별번호를 포함하는 스카이라인을 생성하는 스카이라인 생성부;를 구비하며, 상기 식별번호 저장부는 상기 노드부호와 사용자로부터 입력받은 질의어를 부호화한 질의부호의 논리합이 상기 노드부호와 일치하면 상기 노드부호에 대응하는 최소경계사각형의 식별번호를 저장하고, 상기 특성부호와 상기 질의부호의 논리합이 상기 특성부호와 일치하면 상기 특성부호에 대응하는 데이터 객체의 식별번호를 저장한다.According to an aspect of the present invention, there is provided a skyline query performing apparatus including a search word according to an embodiment of the present invention, wherein a feature code represents a characteristic of a plurality of data objects and encodes a search word used to search for the data object. A terminal node comprising coordinate information indicating coordinates of data objects adjacent to each other and a terminal node information including an identification number of the data object in a multi-dimensional coordinate space for representing a plurality of digitized attributes It includes a node code calculated by the logical sum of the feature codes included in the node, coordinate information of the minimum boundary rectangle set to include data objects included in the terminal node existing in the lower layer in the coordinate space, and an identification number of the minimum boundary rectangle. Non-terminal node consisting of non-terminal node information A data search tree generation unit arranged hierarchically to generate a data search tree; An identification number storage unit for storing the identification number of the data object included in each node of the data search tree and the identification number of the least bounded rectangle in order of increasing distance from the origin of the coordinate space; And an identification number of the data object if the identification number located at the highest of the identification numbers stored in the identification number storage unit is the identification number of the data object included in the terminal node information corresponding to the terminal node of the data search tree. And a skyline generator for generating a line, wherein the identification number storage unit has a minimum boundary square corresponding to the node code when the logical sum of the query code encoding the node code and the query word input from the user matches the node code. If the logical sum of the feature code and the query code matches the feature code, the identification number of the data object corresponding to the feature code is stored.

상기의 다른 기술적 과제를 달성하기 위한, 본 발명에 따른 검색어를 포함하는 스카이라인 질의 수행방법은, 복수의 데이터 객체의 특성을 나타내며 상기 데이터 객체의 검색에 사용되는 검색어를 부호화한 특성부호, 상기 데이터 객체가 가지는 복수의 수치화된 속성을 나타내기 위한 다차원 좌표공간 상에서 서로 인접하는 데이터 객체들의 좌표를 나타내는 좌표정보 및 상기 데이터 객체의 식별번호를 포함하는 단말노드정보로 이루어진 단말노드와, 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출되는 노드부호, 상기 좌표공간 상에서 하위계층에 존재하는 단말노드에 포함된 데이터 객체들을 포함하도록 설정된 최소경계사각형의 좌표정보 및 상기 최소경계사각형의 식별번호를 포함하는 비단말노드정보로 이루어 진 비단말노드를 계층적으로 배치하여 데이터검색 트리를 생성하는 데이터검색 트리생성단계; 상기 데이터검색 트리의 각 노드에 포함된 상기 데이터 객체의 식별번호 및 상기 최소경계사각형의 식별번호를 상기 좌표공간의 원점으로부터의 거리가 증가하는 순서로 정렬하여 저장하는 식별번호 저장단계; 및 상기 식별번호 저장단계에서 저장된 식별번호 중에서 최상위에 위치하는 식별번호가 상기 데이터검색 트리의 단말노드에 대응하는 단말노드정보에 포함된 데이터 객체의 식별번호이면 상기 데이터 객체의 식별번호를 포함하는 스카이라인을 생성하는 스카이라인 생성단계;를 가지며, 상기 식별번호 저장단계에서, 상기 노드부호와 사용자로부터 입력받은 질의어를 부호화한 질의부호의 논리합이 상기 노드부호와 일치하면 상기 노드부호에 대응하는 최소경계사각형의 식별번호를 저장하고, 상기 특성부호와 상기 질의부호의 논리합이 상기 특성부호와 일치하면 상기 특성부호에 대응하는 데이터 객체의 식별번호를 저장한다.According to an aspect of the present invention, there is provided a skyline query performing method including a search word according to an embodiment of the present invention, wherein the feature code represents a characteristic of a plurality of data objects and encodes a search word used to search for the data object. A terminal node including coordinate information indicating coordinates of data objects adjacent to each other and a terminal node information including an identification number of the data object in a multi-dimensional coordinate space for representing a plurality of digitized attributes of the object, The node code calculated by the logical sum of the feature codes included in the child node, the coordinate information of the minimum boundary rectangle set to include data objects included in the terminal node existing in the lower layer in the coordinate space, and the identification number of the minimum boundary rectangle. Non-terminal consisting of non-terminal node information, including Data search tree generating step by placing a de hierarchically generating a data search tree; An identification number storage step of storing the identification number of the data object and the identification number of the least bounded rectangle included in each node of the data search tree in order of increasing distance from the origin of the coordinate space; And an identification number of the data object if the identification number located at the highest of the identification numbers stored in the identification number storing step is an identification number of the data object included in the terminal node information corresponding to the terminal node of the data search tree. And a skyline generation step of generating a line, wherein in the identification number storage step, if the logical sum of the query code encoding the query code received from the node code and the user matches the node code, the minimum boundary corresponding to the node code is generated. The identification number of the rectangle is stored, and if the logical sum of the feature code and the query code matches the feature code, the identification number of the data object corresponding to the feature code is stored.

본 발명에 따른 검색어를 포함하는 스카이라인 질의 수행장치 및 방법에 의하면, 데이터검색 트리의 각 노드에 데이터 객체가 가지는 특성에 관한 정보를 포함시킴으로써, 부가적인 검색과정 없이 정확한 검색결과를 얻을 수 있다. 또한 데이터 검색을 수행할 때 지배관계 판단과 검색어 판단을 동시에 수행함으로써, 데이터 검색에 소요되는 시간을 줄일 수 있다.According to the skyline query execution apparatus and method including the search word according to the present invention, by including information on the characteristics of the data object in each node of the data search tree, it is possible to obtain accurate search results without additional search process. In addition, when performing data retrieval, it is possible to reduce the time required for data retrieval by simultaneously performing the determination of the governance relationship and the search term.

이하에서 첨부된 도면들을 참조하여 본 발명에 따른 검색어를 포함하는 스카 이라인 질의 수행장치 및 방법의 바람직한 실시예에 대해 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the apparatus and method for performing a sky query including a search word according to the present invention.

도 4는 본 발명에 따른 검색어를 포함하는 스카이라인 질의 수행장치에 대한 바람직한 실시예의 구성을 도시한 블록도이다.4 is a block diagram showing the configuration of a preferred embodiment of a skyline query execution apparatus including a search word according to the present invention.

도 4를 참조하면, 본 발명에 따른 스카이라인 질의 수행장치는, 데이터검색 트리생성부(410), 식별번호 저장부(420) 및 스카이라인 생성부(430)를 구비한다.Referring to FIG. 4, the apparatus for performing skyline query according to the present invention includes a data search tree generator 410, an identification number storage unit 420, and a skyline generator 430.

데이터검색 트리생성부(410)는 복수의 데이터 객체의 특성을 나타내며 데이터 객체의 검색에 사용되는 검색어를 부호화한 특성부호, 데이터 객체가 가지는 복수의 수치화된 속성을 나타내기 위한 다차원 좌표공간 상에서 서로 인접하는 데이터 객체들의 좌표를 나타내는 좌표정보 및 데이터 객체의 식별번호를 포함하는 단말노드정보로 이루어진 단말노드와, 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출되는 노드부호, 좌표공간 상에서 하위계층에 존재하는 단말노드에 포함된 데이터 객체들을 포함하도록 설정된 최소경계사각형의 좌표정보 및 최소경계사각형의 식별번호를 포함하는 비단말노드정보로 이루어진 비단말노드를 계층적으로 배치하여 데이터검색 트리를 생성한다.The data retrieval tree generation unit 410 represents characteristics of a plurality of data objects, and is adjacent to each other in a multidimensional coordinate space for representing a feature code encoding a search word used to retrieve the data object and a plurality of numerical attributes of the data object. A node node composed of terminal node information including coordinate information indicating coordinates of the data objects and an identification number of the data object, and a node code calculated by a logical sum of characteristic codes included in a child node located below and a lower node in a coordinate space. A non-terminal node consisting of coordinate information of the minimum boundary rectangle and non-terminal node information including the identification number of the minimum boundary rectangle, which are set to include the data objects included in the terminal node existing in the hierarchy, is arranged in a hierarchical data search tree. Create

앞에서 언급한 바와 같이 종래의 스카이라인 질의는 데이터 객체, 즉 튜플들의 수치화된 속성만을 포함하고 있으므로 BBS 검색방법을 사용할 경우에 사용자가 데이터 객체의 특성을 특정하여 검색하고자 하는 경우에는 스카이라인 질의로 얻어진 스카이라인에 대해 부가적인 검색을 수행하여야 한다. 또한 INKS 검색방법을 사용하는 경우에도 실질적으로 키워드 검색과 스카이라인 검색이 분리되어 있어 BBS 검색방법과 마찬가지의 문제점을 가진다.따라서 이와 같은 문제를 해결하기 위해 본 발명에 따른 스카이라인 질의 수행장치의 데이터검색 트리 생성부(110)는 데이터 객체들의 수치화된 속성 뿐 아니라 검색어로 표현될 수 있는 특성까지 포함하는 데이터검색 트리를 생성한다.As mentioned above, the conventional skyline query includes only quantized attributes of data objects, that is, tuples, so when the user uses the BBS retrieval method to search for a specific feature of the data object, the skyline query is obtained. Additional searches must be performed on the skyline. In addition, even when the INKS search method is used, the keyword search and the skyline search are substantially separated, thereby having the same problem as the BBS search method. Therefore, in order to solve the problem, the data of the skyline query execution apparatus according to the present invention is solved. The search tree generator 110 generates a data search tree that includes not only the numerical attributes of the data objects but also characteristics that can be expressed as search terms.

이하에서는 먼저 종래의 검색방법인 BBS 방법을 사용하여 스카이라인 질의를 수행하는 과정에 대하여 설명한다. 이 방법에서는 위에서 설명한 바와 같이 데이터 검색을 위해 R-트리 구조의 인덱스를 사용하며, R-트리는 본 발명에 따른 스카이라인 질의 수행장치의 데이터검색 트리생성부에 의해 생성된 트리구조에서 검색어에 관련된 정보가 제외된 형태로 생성된다.Hereinafter, a process of performing a skyline query using the BBS method, which is a conventional search method, will be described. In this method, as described above, an index of an R-tree structure is used for data retrieval, and the R-tree is information related to a search word in a tree structure generated by the data retrieval tree generator of the skyline query execution apparatus according to the present invention. Is generated in the excluded form.

도 5는 x축과 y축에 각각 수치화된 속성이 표시되는 xy 평면에 나타난 복수의 데이터 객체 및 데이터 객체를 포함하도록 설정된 복수의 최소경계사각형을 도시한 도면이다. 또한 도 6은 도 5에 나타난 데이터 객체 및 최소경계사각형의 좌표정보를 포함하는 R-트리 구조를 도시한 도면이다. 도 5 및 도 6을 참조하면, a 내지 n은 각각 도 5의 좌표평면에 나타난 데이터 객체들의 식별번호이고, e1 내지 e7은 각각 도 5의 좌표평면에 나타난 최소경계사각형 N1 내지 N7의 식별변호이다. 최소경계사각형은 서로 인접하는 두 개 또는 세 개의 데이터 객체들을 포함하도록 설정되며, 두 개 또는 세 개의 최소경계사각형을 포함하는 상위 개념의 최소경계사각형이 설정될 수도 있다. 도 6에 나타난 R-트리의 단말노드에는 각 데이터 객체의 식별번호 및 좌표정보가 포함되고, 비단말노드에는 하위노드에 포함된 데이터 객체를 포함하도록 설정된 최소경계사각형의 식별번호 또는 하위노드에 포함된 최소경계사각형을 포함하도록 설정된 최소경계사각형의 식별번호 및 좌표정보가 포함된 다.FIG. 5 is a diagram illustrating a plurality of minimum bounding rectangles configured to include a plurality of data objects and data objects displayed on an xy plane in which quantized attributes are displayed on the x and y axes, respectively. FIG. 6 is a diagram illustrating an R-tree structure including coordinate information of the data object and the minimum bounding rectangle shown in FIG. 5. 5 and 6, a to n are identification numbers of data objects shown in the coordinate plane of FIG. 5, respectively, and e1 to e7 are identification numbers of the minimum boundary rectangles N1 to N7 shown in the coordinate plane of FIG. 5, respectively. . The minimum boundary rectangle is set to include two or three data objects adjacent to each other, and a minimum boundary rectangle of a higher concept including two or three minimum boundary rectangles may be set. The terminal node of the R-tree shown in FIG. 6 includes the identification number and coordinate information of each data object, and the non-terminal node is included in the identification number or lower node of the minimum boundary square set to include the data object included in the lower node. It includes the identification number and coordinate information of the minimum bounding rectangle set to include the minimum bounding rectangle.

BBS 방법에서는 이러한 R-트리의 각 노드의 엔트리, 즉 최소경계사각형의 식별번호 또는 데이터 객체의 식별번호가 저장되는 힙(heap)을 생성한다. 먼저 힙에는 루트노드의 엔트리로서 최소경계사각형 N6와 N7의 식별번호인 e6와 e7이 저장된다. 이때 힙에 저장되는 엔트리들의 정렬순서는 좌표공간의 원점으로부터의 거리가 증가하는 순서가 된다. 원점에 가까워질수록 데이터 객체의 속성값이 작아진다는 것을 의미하여 다른 데이터 객체에 지배되지 않는 경우에 해당하기 때문이다. 또한 원점으로부터의 거리는 좌표정보를 구성하는 각 좌표값을 모두 합산한 값에 의해 정해질 수 있다. 데이터 객체의 좌표정보는 좌표공간에서 한 점으로 나타나기 때문에 원점으로부터의 거리를 바로 계산할 수 있지만 최소경계사각형의 경우는 문제가 된다. 이때 최소경계사각형의 각 꼭지점 중에서 원점으로부터 가장 가까운 꼭지점, 즉 좌표값들을 모두 합산한 값이 최소가 되는 꼭지점을 기준으로 원점으로부터의 거리를 정한다. 따라서 최소경계사각형 N6의 원점으로부터의 거리는 왼쪽 아래에 위치한 꼭지점의 좌표값을 합산한 6이 되고, 최소경계사각형 N7의 원점으로부터의 거리는 왼쪽 아래에 위치한 꼭지점의 좌표값을 합산한 4가 된다. 따라서 루트노드의 엔트리가 원점으로부터의 거리가 증가하는 순서로 저장된 힙은 {(e7,4),(e6,6)}과 같이 최소경계사각형의 식별번호와 원점으로부터의 거리가 함께 저장된 형태로 나타난다.In the BBS method, a heap is stored in which an entry of each node of the R-tree, that is, an identification number of a minimum boundary rectangle or an identification number of a data object is stored. In the heap, e6 and e7, which are identification numbers of the minimum boundary rectangles N6 and N7, are stored as entries of the root node. At this time, the sort order of the entries stored in the heap is the order in which the distance from the origin of the coordinate space increases. This is because the closer the origin is, the smaller the property value of the data object is. In addition, the distance from the origin can be determined by the sum of all the coordinate values forming the coordinate information. Since the coordinate information of the data object appears as a point in the coordinate space, the distance from the origin can be calculated immediately, but it is a problem in the case of the minimum bounding rectangle. In this case, the distance from the origin is determined based on the vertex of the vertex that is the sum of the vertices closest to the origin, that is, the sum of the coordinate values. Therefore, the distance from the origin of the minimum boundary rectangle N6 is 6, which is the sum of the coordinates of the vertices located at the bottom left, and the distance from the origin of the minimum boundary rectangle N6 is 4, which is the sum of the coordinate values of the vertices located at the bottom left. Therefore, the heap stored in the root node's entry in the order of increasing distance from the origin appears in the form that the identification number of the minimum boundary rectangle and the distance from the origin are stored together as {(e7,4), (e6,6)}. .

다음으로 힙의 최상위에 위치하는 엔트리인 e7에 해당하는 R-트리의 노드로 이동한다. 최소경계사각형 N7은 최소경계사각형 N3, N4 및 N5를 포함한다. 따라서 e7에 해당하는 R-트리의 노드는 엔트리로서 e3, e4 및 e5를 포함한다. 이제 식별번호 e7은 힙으로부터 삭제되고, e7 노드의 엔트리인 e3 내지 e5가 힙에 추가되어 저장된다. 이 경우에도 마찬가지로 이미 힙에 저장되어 있는 e6와 새로 저장될 e3 내지 e5가 원점으로부터의 거리가 증가하는 순서로 정렬된다. e3 내지 e5의 식별번호를 가지는 최소경계사각형 N3 내지 N5의 원점으로부터의 거리는 각각 5, 10 및 8이므로 식별번호들은 힙에 {(e3,5),(e6,6),(e5,8),(e4,10)}과 같은 형태로 저장된다.Next, move to the node of the R-tree corresponding to e7, the entry at the top of the heap. Minimum bounding squares N7 include minimum bounding squares N3, N4 and N5. Thus, the node of the R-tree corresponding to e7 contains e3, e4 and e5 as entries. The identification number e7 is now deleted from the heap and the entries e3 through e5 of the node e7 are added to the heap and stored. In this case as well, e6 already stored in the heap and e3 to e5 to be newly stored are arranged in order of increasing distance from the origin. Since the distances from the origins of the minimum boundary rectangles N3 to N5 with identification numbers of e3 to e5 are 5, 10 and 8, respectively, the identification numbers are defined as {(e3,5), (e6,6), (e5,8), (e4,10)}.

다음으로 힙의 최상위에 위치하는 엔트리는 e3이므로, e3 노드에 포함된 데이터 객체의 식별번호 g, h 및 i가 힙에 저장된다. 마찬가지로 g, h 및 i가 저장될 때 e3는 힙으로부터 삭제된다. 힙에는 데이터 객체 및 최소경계사각형의 식별번호가 별도의 구분없이 원점으로부터의 거리만을 기준으로 정렬된다. 데이터 객체는 좌표공간에서 한 점으로 나타나므로 g, h 및 i의 원점으로부터의 거리는 좌표값들을 각각 합산한 11, 7 및 5로 계산된다. 따라서 힙에는 데이터 객체 및 최소경계사각형의 식별번호들이 {(i,5),(e6,6),(h,7),(e5,8),(e4,10),(g,11)}의 형태로 저장된다.Next, since the entry located at the top of the heap is e3, the identification numbers g, h, and i of the data object included in the e3 node are stored in the heap. Likewise e3 is deleted from the heap when g, h and i are stored. In the heap, the identification numbers of data objects and minimum bounding rectangles are sorted based only on the distance from the origin. Since the data object appears as a point in the coordinate space, the distances from the origins of g, h and i are calculated as 11, 7 and 5, the sum of the coordinate values, respectively. Therefore, in the heap, the identification numbers of the data object and the minimum bounding rectangle are {(i, 5), (e6,6), (h, 7), (e5,8), (e4,10), (g, 11)} Is stored in the form of.

현재 힙의 최상위에 위치하는 엔트리는 i로서, 이는 R-트리의 단말노드에 포함되는 데이터 객체의 식별번호이다. 따라서 BBS 검색방법에서는 데이터 검색의 결과물인 스카이라인을 생성하고, 힙의 최상위 엔트리인 데이터 객체의 식별번호 i를 스카이라인에 포함시킨다. 그와 동시에 식별번호 i는 힙으로부터 삭제된다. 스카이라인에 포함된 i에 해당하는 데이터 객체는 다른 모든 데이터 객체들에 의해 지배되지 않는다는 것을 의미한다. 데이터 검색의 결과를 포함하는 스카이라인이 생성 되었으므로 이후에 진행되는 검색과정은 힙에 추가될 데이터 객체 또는 최소경계사각형과 스카이라인에 포함되어 있는 데이터 객체와의 지배관계를 판단하는 과정을 포함하게 된다. 스카이라인에는 다른 데이터 객체들에 의해 지배되지 않는 데이터 객체만 데이터 검색의 결과로서 포함되기 때문이다.The entry located at the top of the current heap is i, which is the identification number of the data object included in the terminal node of the R-tree. Therefore, the BBS search method generates a skyline that is a result of data search and includes the identification number i of the data object, which is the top entry of the heap, in the skyline. At the same time, the identification number i is deleted from the heap. The data object corresponding to i included in the skyline means that it is not controlled by all other data objects. Since the skyline containing the results of the data retrieval is generated, the subsequent retrieval process includes determining the relationship between the data object to be added to the heap or the minimum boundary rectangle and the data object included in the skyline. . This is because the skyline contains only data objects that are not governed by other data objects as a result of data retrieval.

식별번호 i가 제거된 후 힙의 최상위에 위치하는 엔트리는 최소경계사각형의 식별번호 e6이다. 노드 e6에 포함된 식별번호 e1 및 e2를 힙에 추가하기 위해 먼저 e1 및 e2에 해당하는 최소경계사각형 N1 및 N2와 스카이라인에 포함된 데이터 객체 i 간의 지배관계를 판단한다. 지배관계를 판단할 때 좌표값을 비교하게 되므로 최소경계사각형에 대하여는 원점으로부터의 거리를 정할 때와 마찬가지로 원점으로부터 가깝게 위치하는 꼭지점의 좌표정보를 이용하여 지배관계를 판단한다.The entry at the top of the heap after the identification number i is removed is the identification number e6 of the minimum bounding square. In order to add the identification numbers e1 and e2 included in the node e6 to the heap, first, the governing relationship between the minimum boundary rectangles N1 and N2 corresponding to e1 and e2 and the data object i included in the skyline is determined. Since the coordinate values are compared when determining the governing relationship, the governing relationship is determined using the coordinate information of the vertex located close to the origin as with the distance from the origin for the minimum bounding rectangle.

먼저 최소경계사각형 N1의 경우, 원점으로부터 가까운 꼭지점의 좌표정보는 (1,8)이고, 스카이라인에 포함된 데이터 객체 i의 좌표정보는 (3,2)이다. N1의 좌표정보 중에서 x좌표의 좌표값인 1은 i의 좌표정보 중에서 x좌표의 좌표값인 3보다 작다. 또한 N1의 좌표정보 중에서 y좌표의 좌표값인 8은 i의 좌표정보 중에서 y좌표의 좌표값인 2보다 크다. 따라서 N1과 i는 서로 지배되지 않는 관계에 해당한다. 앞에서 두 데이터 객체 간의 지배관계에 대해 설명한 바와 같이 '튜플 tp가 모든 차원에서 튜플 tp'에 비해 나쁘지 않고, 즉 속성값이 크지 않고, tp가 적어도 한 차원에서 tp'에 비해 좋은 경우, 즉 속성값이 작은 경우'에 해당하지 않기 때문이다. 여기서 튜플은 데이터 객체와 동일한 의미를 가지며, 차원은 데이터 객체의 각 속성을 의미하는 다차원 좌표공간에서의 각 차원을 의미하는 것이다.First, in the case of the minimum boundary rectangle N1, the coordinate information of the vertex close to the origin is (1,8), and the coordinate information of the data object i included in the skyline is (3,2). The coordinate value 1 of the x coordinate in the coordinate information of N1 is smaller than the coordinate value of the x coordinate in the coordinate information of i. In addition, 8 which is the coordinate value of the y coordinate among the coordinate information of N1 is larger than 2 which is the coordinate value of the y coordinate among the coordinate information of i. Thus, N1 and i correspond to the undominant relationship. As previously described for the governance between two data objects, 'tuple tp is not bad compared to tuple tp' in all dimensions, i.e. the attribute value is not large, and tp is better than tp in at least one dimension, i.e. attribute value This is not the case. Here, a tuple has the same meaning as a data object, and a dimension means each dimension in a multidimensional coordinate space that means each property of the data object.

따라서 최소경계사각형 N1은 데이터 객체 i에 대해 지배되지 않는 관계에 해당하므로 N1의 식별번호인 e1이 힙에 추가되어 저장될 수 있다. 다음으로 최소경계사각형 N2의 경우, 원점으로부터 가장 가까운 꼭지점의 좌표정보는 (6,5)이고, 데이터 객체 i의 좌표정보는 (3,2)이다. N2의 좌표정보를 구성하는 각 좌표값은 모두 i의 좌표정보를 구성하는 좌표값보다 크다. 따라서 N2는 i에 의해 지배되므로 N2의 식별번호인 e2는 힙에 저장될 수 없다. 특정 최소경계사각형이 어떠한 데이터 객체에 의해 지배되면, 그 최소경계사각형에 포함되는 모든 데이터 객체들도 동일한 데이터 객체에 의해 지배될 것이기 때문이다. 식별번호 e1이 포함된 힙은 {(h,7),(e5,8),(e1,9),(e4,10),(g,11)}의 구성을 가진다.Therefore, since the minimum bounding rectangle N1 corresponds to an undominant relationship with respect to the data object i, an identification number e1 of N1 may be added to the heap and stored. Next, in the case of the minimum boundary rectangle N2, the coordinate information of the vertex closest to the origin is (6, 5), and the coordinate information of the data object i is (3, 2). Each coordinate value constituting the coordinate information of N2 is larger than the coordinate value constituting the coordinate information of i. Therefore, since N2 is controlled by i, the identification number e2 of N2 cannot be stored on the heap. If a particular bounding rectangle is governed by any data object, then all data objects included in that minimum bounding rectangle will be governed by the same data object. The heap including the identification number e1 has a configuration of {(h, 7), (e5,8), (e1,9), (e4,10), (g, 11)}.

다음으로 힙의 최상위에 위치하는 h가 스카이라인에 추가될 수 있는지 여부를 검토하면, 데이터 객체 h의 좌표정보는 (4,3)이므로 데이터 객체 i의 좌표정보인 (3,2)에 비해 큰 좌표값을 가져 i에 의해 지배되는 관계를 가진다. 따라서 h는 스카이라인에 포함되지 않고 힙으로부터 삭제된다. h가 삭제된 후 힙의 최상위에 위치하는 e5의 경우, e5 노드의 엔트리인 m 및 n을 힙에 추가하기 전에 i와의 지배관계를 판단한다. 먼저 데이터 객체 m의 경우, 좌표정보 중에서 y좌표의 좌표값은 i의 y좌표의 좌표값과 동일하지만 x좌표의 좌표값은 i의 x좌표의 좌표값보다 크므로 데이터 객체 i에 의해 지배되는 관계를 가진다. 또한 데이터 객체 n의 경우, 좌표정보를 구성하는 모든 좌표값이 i의 좌표값보다 크므로 역시 데이터 객체 i에 의해 지배된다. 따라서 m과 n은 모두 힙에 추가되지 못하고, 식별번호 e5는 힙으로부터 삭제된다.Next, if the h at the top of the heap can be added to the skyline, the coordinate information of the data object h is (4,3), which is larger than the coordinate information (3,2) of the data object i. It has a coordinate value and has a relationship governed by i. Thus h is not included in the skyline and is deleted from the heap. In the case of e5 located at the top of the heap after h is deleted, the governing relationship with i is determined before adding the entries m and n of the node e5 to the heap. First, in the case of the data object m, the coordinate value of the y coordinate in the coordinate information is the same as the coordinate value of the y coordinate of i, but the coordinate value of the x coordinate is larger than the coordinate value of the x coordinate of i. Has In the case of the data object n, since all coordinate values constituting the coordinate information are larger than the coordinate values of i, they are also controlled by the data object i. Therefore, both m and n are not added to the heap, and the identification number e5 is deleted from the heap.

식별번호가 e5가 삭제된 후 힙의 최상위에 위치하는 e1의 경우, e1 노드의 엔트리인 데이터 객체 a, b 및 c는 데이터 객체 i와 지배관계가 성립하지 않는다. 따라서 a 내지 c는 모두 힙에 추가되어 원점으로부터의 거리에 따라 정렬되고, 힙은 {(a,10),(e4,10),(g,11),(b,12),(c,12)}와 같은 구성을 가진다. 다음으로 힙의 최상위에 위치하는 a는 데이터 객체의 식별번호이므로 스카이라인에 추가될 수 있는지 여부를 판별하기 위해 i와의 지배관계를 판단해야 한다. a의 좌표정보는 (1,9)이고, i의 좌표정보는 (3,2)이다. 두 좌표정보 중에서 x좌표의 좌표값은 a의 좌표값이 더 작지만, y좌표의 좌표값은 i의 좌표값이 더 작다. 따라서 데이터 객체 a와 i는 서로 지배관계를 판단할 수 없는 경우에 해당한다. 데이터 객체 a도 다른 데이터 객체들에 의해 지배되지 않는 데이터 객체로 판단되었으므로 데이터 객체 i에 이어서 스카이라인에 추가된다. 또한 식별번호 a는 힙으로부터 삭제된다.For e1, where the identification number is at the top of the heap after e5 is deleted, data objects a, b, and c, which are entries of node e1, do not have a governance relationship with data object i. So a to c are all added to the heap and sorted according to their distance from the origin, and the heap is {(a, 10), (e4,10), (g, 11), (b, 12), (c, 12 )} Next, since a located at the top of the heap is an identification number of the data object, the governance relationship with i must be determined to determine whether it can be added to the skyline. The coordinate information of a is (1,9) and the coordinate information of i is (3,2). Of the two coordinate information, the coordinate value of the x coordinate is smaller than the coordinate value of a, while the coordinate value of the y coordinate is smaller than the coordinate value of i. Therefore, data objects a and i are cases where the governing relationship cannot be determined. Data object a is also added to the skyline following data object i because it was determined that the data object is not governed by other data objects. The identification number a is also removed from the heap.

a가 삭제된 후 힙의 최상위에 위치하는 식별번호 e4는 최소경계사각형 N4에 대응하며, 데이터 객체 l 및 k를 포함한다. 현재 스카이라인은 데이터 객체 i 및 a를 포함하고 있으므로 데이터 객체의 식별번호가 힙에 추가될 수 있는지 여부를 결정하기 위해서는 데이터 객체 i와 a 모두에 대하여 지배관계를 판단해야 한다. 추가대상 데이터 객체가 스카이라인에 포함된 모든 데이터 객체에 지배되는 관계가 성립하는 경우에만 힙에 추가될 수 없다. 먼저 데이터 객체 l의 좌표정보는 (10,4)로서, 모든 좌표값이 i의 좌표값보다 크므로 데이터 객체 i에 의해 지배되는 관계를 가진다. 그러나 좌표정보가 (1,9)인 데이터 객체 a와는 지배관계를 판단할 수 없다. 따라서 식별번호 l은 힙에 추가될 수 있다. 또한 데이터 객체 k의 좌표정보 는 (9,1)로서, x좌표의 좌표값은 a 및 i보다 크고 y좌표의 좌표값은 a 및 i보다 작으므로 데이터 객체 i와 a 모두에 대해 지배관계를 판단할 수 없는 경우에 해당한다. 따라서 식별번호 k도 힙에 추가되고, 식별번호 e4는 삭제된다. 이제 힙은 데이터 객체들의 식별번호만을 포함하며, {(k,10),(g,11),(b,12),(c,12),(l,14)}의 구성을 가진다.The identification number e4 located at the top of the heap after a is deleted corresponds to the minimum bounding rectangle N4 and includes the data objects l and k. Since the current skyline contains data objects i and a, the governance relationship must be determined for both data objects i and a to determine whether the identification number of the data object can be added to the heap. An appended data object cannot be added to the heap only if a relationship is established that governs all data objects included in the skyline. First, the coordinate information of the data object l is (10,4), and since all coordinate values are larger than the coordinate values of i, they have a relationship that is governed by the data object i. However, the governing relationship with the data object a whose coordinate information is (1,9) cannot be determined. Thus, the identification number l can be added to the heap. Also, the coordinate information of the data object k is (9,1), and since the coordinate value of the x coordinate is larger than a and i and the coordinate value of the y coordinate is smaller than a and i, the governing relationship is determined for both the data objects i and a. This is the case if you cannot. Therefore, the identification number k is also added to the heap, and the identification number e4 is deleted. The heap now contains only the identification numbers of the data objects and has the following structure: {(k, 10), (g, 11), (b, 12), (c, 12), (l, 14)}.

다음으로 힙의 최상위에 위치하는 데이터 객체의 식별번호는 현재 스카이라인에 포함된 모든 데이터 객체에 대하여 지배관계를 판단할 수 없는 관계에 해당하는 경우에만 스카이라인에 추가될 수 있다. 식별번호 k를 가지는 데이터 객체의 경우를 보면, 위에서 검토한 바와 같이 데이터 객체 i와 a 모두에 대해 지배관계를 판단할 수 없는 경우에 해당한다. 따라서 데이터 객체 k는 새롭게 스카이라인에 추가될 수 있으며, 식별번호 k는 힙으로부터 삭제된다.Next, the identification number of the data object located at the top of the heap may be added to the skyline only when the relationship does not determine the governing relationship for all data objects included in the current skyline. In the case of the data object having the identification number k, it corresponds to the case where the governing relationship cannot be determined for both the data objects i and a as discussed above. Therefore, the data object k can be newly added to the skyline, and the identification number k is deleted from the heap.

데이터 객체 k가 스카이라인에 추가된 후 힙에 남아있는 데이터 객체 g, b, c 및 l은 데이터 객체 i, a 및 k 모두에 대해 지배관계를 판단할 수 없는 경우에 해당하지 않는다. 따라서 이들은 스카이라인에 추가될 수 없고, 힙으로부터도 삭제된다. 힙에 남아있는 엔트리가 더 이상 존재하지 않으면 데이터 검색과정은 종료된다. 도 5에 나타난 데이터 객체들에 대해 이제까지 설명한 바와 같은 BBS 검색방법을 적용하여 얻어진 스카이라인은 {i,a,k}와 같은 구성을 가진다.The data objects g, b, c and l remaining in the heap after the data object k is added to the skyline do not correspond to the case where the governance relationship cannot be determined for all the data objects i, a and k. Thus they cannot be added to the skyline and are also removed from the heap. If there are no more entries left on the heap, the data retrieval process is terminated. The skyline obtained by applying the BBS retrieval method as described above with respect to the data objects shown in FIG. 5 has a configuration as {i, a, k}.

본 발명에 따른 스카이라인 질의 수행장치는 위와 같은 BBS 검색방법에서 검색조건에 데이터 객체의 특성이 추가된 데이터 검색을 수행하기 위한 것이므로, 데이터검색 트리생성부(410)는 BBS 검색방법에서 생성된 R-트리의 각 노드에 데이터 객체의 특성을 나타내는 키워드들이 추가된 형태의 데이터검색 트리를 생성한다. 이러한 트리구조를 본 발명에서는 IR²-트리로 정의한다. 즉, IR²-트리는 데이터 객체의 수치화된 속성을 나타내는 다차원 좌표공간에 위치한 데이터 객체에 대해 속성을 검색조건으로 한 데이터 검색 및 특성을 나타내는 키워드를 검색조건으로 한 데이터 검색을 동시에 수행할 수 있도록 하는 인덱스 구조이다.Since the skyline query performing apparatus according to the present invention is to perform a data search in which the characteristics of a data object are added to a search condition in the BBS search method as described above, the data search tree generation unit 410 generates an R generated by the BBS search method. Create a data search tree with keywords representing the characteristics of the data object added to each node of the tree. This tree structure is defined as IR ² -tree in the present invention. That is, IR ² -Tree can perform data retrieval based on the property as the search condition and data retrieval using the keyword representing the property on the data object located in the multidimensional coordinate space representing the digitized property of the data object. Index structure.

도 7은 x축과 y축에 각각 수치화된 속성이 표시되는 xy 평면상에 특성을 나타내는 키워드를 포함하는 복수의 데이터 객체 및 데이터 객체를 포함하도록 설정된 복수의 최소경계사각형을 도시한 도면이고, 도 8은 도 7에 나타난 데이터 객체 및 최소경계사각형의 좌표정보 및 특성부호를 포함하는 IR²-트리 구조를 도시한 도면이다. 여기서 도 7에 나타난 데이터 객체들은 2차원 좌표공간으로 나타내어지는 속성을 가지고 있지만, 본 발명에 따른 스카이라인 질의 수행장치는 2차원 이상의 다차원 속성을 가지는 모든 데이터 객체에 대하여 동일한 방법을 사용하여 적용될 수 있다.FIG. 7 is a diagram illustrating a plurality of data objects including keywords representing characteristics on a xy plane in which digitized attributes are displayed on the x-axis and the y-axis, respectively, and a plurality of minimum bounding rectangles. FIG. 8 is a diagram illustrating an IR ² -tree structure including coordinate information and characteristic codes of a data object and a minimum boundary rectangle shown in FIG. 7. Here, although the data objects shown in FIG. 7 have properties represented by a two-dimensional coordinate space, the skyline query execution apparatus according to the present invention may be applied to all data objects having two or more multidimensional properties by using the same method. .

도 7에 나타난 데이터 객체들은 웹사이트에서 판매되는 중고차에 관한 정보를 나타낸다. 또한 xy평면에서 x축은 주행거리, y축은 가격을 나타낸다. 주행거리 및 가격은 수치화될 수 있는 속성이므로 다차원 좌표공간의 각 축으로 나타낼 수 있다. 따라서 도 7의 xy평면에 나타난 데이터 객체의 좌표정보는 각각 중고차의 주행거리 및 가격을 나타낸다. 또한 각각의 데이터 객체는 문자로 표현되는 특성들을 부가적으로 포함하고 있다. 이러한 특성들은 수치화되어 표현될 수 없는 것으로, 자동차의 경우에는 에어백, 크루져 컨트롤(cruiser control) 및 선루프 등이 이에 해당한다. 이러한 특성들을 나타내는 키워드는 모든 사용자 및 데이터 객체에 대해 보편적으로 적용될 수 있도록 설정되는 것이 바람직하다. 도 7에 나타난 데이터 객체들은 여러가지 특성들 중에서 자신에게 해당되는 특성을 나타내는 키워드를 포함하고 있다. 예를 들면, 데이터 객체 a는 좌표정보가 (2,2)이므로 주행거리가 2만km, 가격이 2000만원인 중고차를 나타내며, 데이터 객체 a에 해당하는 중고차는 부가적인 특성으로 선루프와 가죽시트를 포함하고 있다.The data objects shown in FIG. 7 represent information about used cars sold on the website. Also, in the xy plane, the x-axis represents the distance traveled and the y-axis represents the price. Since mileage and price are quantifiable attributes, they can be represented by each axis in a multidimensional coordinate space. Therefore, the coordinate information of the data object shown in the xy plane of FIG. 7 represents the mileage and the price of the used car, respectively. Each data object also contains additional properties that are represented by text. These characteristics cannot be expressed numerically, and in the case of automobiles, airbags, cruiser control, sunroof, and the like correspond to this. Keywords representing these characteristics are preferably set to be universally applicable to all users and data objects. The data objects shown in FIG. 7 include keywords representing characteristics corresponding to them among various characteristics. For example, data object a represents a used car with a mileage of 20,000km and a price of 20 million won because the coordinate information is (2,2). The used car corresponding to data object a has additional characteristics such as sunroof and leather seats. It includes.

서로 인접하는 데이터 객체들을 포함하는 최소경계사각형은 BBS 검색방법에서와 마찬가지의 방법으로 설정되며, 최소경계사각형의 좌표정보는 최소경계사각형의 꼭지점들 중 하나의 좌표정보, 바람직하게는 원점으로부터의 거리가 가장 가까운 꼭지점의 좌표정보에 의해 나타낸다. 데이터 객체의 속성은 이와 같이 수치화된 좌표정보로 표현될 수 있으므로 BBS 검색방법에서와 마찬가지로 IR²-트리의 각 노드에 데이터 객체의 속성에 관한 정보를 포함시킬 수 있다. 그러나 데이터 객체의 특성은 키워드, 즉 문자로 표현되는 정보이므로 이를 그대로 IR²-트리에 포함시키게 되면 메모리에서 큰 용량을 차지하여 데이터 검색의 효율성이 저하되는 문제가 발생한다. 따라서 이러한 문자정보를 숫자로 표현하는 방법을 사용하여 숫자화된 데이터 객체의 특성이 IR²-트리에 포함되도록 하는 것이 바람직하다.The minimum bounding rectangle containing adjacent data objects is set in the same manner as in the BBS search method, and the coordinate information of the minimum bounding rectangle is coordinate information of one of the vertices of the minimum bounding rectangle, preferably the distance from the origin. Is represented by the coordinate information of the nearest vertex. Since the attributes of the data object can be represented by the coordinate information digitized in this manner, as in the BBS retrieval method, information on the attributes of the data object can be included in each node of the IR ² -tree. However, since the characteristics of the data object are information represented by keywords, that is, characters, if they are included in the IR ² -tree as they are, they occupy a large capacity in memory and thus, the efficiency of data retrieval is deteriorated. Therefore, it is desirable to use the method of representing such character information as numbers so that the characteristics of the digitized data object are included in the IR ² -tree.

데이터 객체의 특성을 나타내는 검색어를 부호화하여 숫자로 나타내기 위해 종래의 부호화 방법을 사용할 수 있다. 종래에 사용되는 방법에 의하면, 검색어를 구성하는 각 단어가 해시함수에 의하여 고정된 길이의 숫자로 이루어진 부호로 변환된다. 예를 들면, t1, t2 및 t3로 표현되는 검색어는 4비트의 크기를 가지는 부호로 변환되어 각각 0101, 1010 및 0011로 표현될 수 있다. 또한 이와 같은 부호화 방법은 사용자가 어떠한 특성을 가지는 데이터를 검색하기 위해 입력한 검색어, 즉 질의어를 부호화하기 위해 동일하게 적용될 수 있다.A conventional encoding method may be used to encode a search word representing a characteristic of a data object and to represent the search word as a number. According to the conventionally used method, each word constituting the search word is converted into a code composed of a fixed length number by a hash function. For example, a search word represented by t1, t2, and t3 may be converted into a code having a size of 4 bits and represented as 0101, 1010, and 0011, respectively. In addition, the encoding method may be equally applied to encode a search word input by a user to search for data having certain characteristics, that is, a query word.

이러한 방법을 사용하면 복수의 데이터 객체들이 가지는 특성을 나타내는 검색어들은 일정한 규칙에 의하여 숫자로 이루어진 특성부호로 변환되고, 데이터검색 트리의 각 노드에 포함된다. 이때 데이터검색 트리의 단말노드는 데이터 객체들의 식별번호 및 좌표정보를 포함하고 있으므로 각각의 데이터 객체의 특성을 나타내는 특성부호를 그대로 포함시켜 단말노드정보를 구성할 수 있다. 또한 데이터 객체가 가지는 특성이 복수 개인 경우에는 각각의 특성을 부호화하여 복수의 검색부호를 생성한 후 이들의 논리합을 산출함으로써 특성부호를 결정할 수 있다. 데이터검색 트리의 비단말노드는 하위계층에 존재하는 단말노드에 포함된 데이터 객체들에 관한 정보를 모두 포함하는 비단말노드정보로 이루어져야 한다. 따라서 데이터검색 트리의 비단말노드에는 해당 비단말노드의 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출된 노드부호가 포함된다. 예를 들면, 자식노드에 t1 및 t2의 검색어가 포함된 경우, 그에 해당하는 특성부호는 위에 설명한 바와 같이 각각 0101 및 1010이 되고, 이들의 논리합으로 산출된 노드부호는 1111이 된다. 문자로 이루어진 검색어를 이와 같이 적은 데이터량을 가지는 특성부호 또는 노드부호로 부호화하고, 사용자가 입력한 질의어를 동일한 방법으로 부호화한 질의부호를 생성하면 특성부호 또는 노드부호와 질의부호를 비교함으로써 질의어가 나타내는 특성을 해당 데이터 객체가 포함하고 있는지를 간단하게 판단할 수 있다.Using this method, the search terms representing the characteristics of the plurality of data objects are converted into numeric feature codes according to a predetermined rule and included in each node of the data search tree. In this case, since the terminal node of the data search tree includes identification numbers and coordinate information of the data objects, the terminal node information may be configured by including the characteristic code representing the characteristic of each data object. In addition, when the data object has a plurality of characteristics, the characteristic codes may be determined by encoding a plurality of characteristics to generate a plurality of search codes, and then calculating their logical sums. The non-terminal node of the data retrieval tree should consist of non-terminal node information including all the information about the data objects included in the terminal node existing in the lower layer. Therefore, the non-terminal node of the data retrieval tree includes a node code calculated by the logical sum of the feature codes included in the child nodes positioned below the non-terminal node. For example, when the child node includes the search words t1 and t2, the corresponding feature codes are 0101 and 1010, respectively, as described above, and the node code calculated by the OR is 1111. When a query consisting of characters is encoded with a feature code or node code having such a small amount of data, and a query code is generated by encoding a query word input by a user in the same way, the query word is compared by comparing the feature code or node code with the query code. It is easy to determine whether the data object contains the properties it represents.

도 8에 도시된 IR²-트리의 단말노드에는 도 7에 나타난 각각의 데이터 객체가 포함하는 검색어를 부호화한 특성부호가 포함되어 있고, 비단말노드에는 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합에 의해 산출된 노드부호가 포함되어 있다. 여기서 각각의 데이터 객체들은 두 개의 검색어를 포함하고 있으므로, 특성부호는 두 개의 검색어를 부호화하여 얻어진 검색부호의 논리합이 된다. 또한 단말노드에 포함된 a 내지 k는 각각 도 7에 나타난 데이터 객체의 식별번호이고, 비단말노드에 포함된 e1 내지 e7은 각각 도 7에 나타난 최소경계사각형 N1 내지 N7의 식별번호이다.The terminal node of the IR ² -tree illustrated in FIG. 8 includes a feature code encoding a search word included in each data object shown in FIG. 7, and the non-terminal node includes a feature code included in a child node located below. The node code calculated by the OR is included. Since each data object includes two search terms, the feature code is the logical sum of the search codes obtained by encoding the two search terms. Also, a to k included in the terminal node are identification numbers of the data objects shown in FIG. 7, and e1 to e7 included in the non-terminal nodes are identification numbers of the minimum boundary rectangles N1 to N7 shown in FIG. 7, respectively.

식별번호 저장부(420)는 데이터검색 트리생성부(410)에 의해 생성된 데이터검색 트리의 각 노드에 포함된 데이터 객체의 식별번호 및 최소경계사각형의 식별번호를 좌표공간의 원점으로부터의 거리가 증가하는 순서로 정렬하여 저장하고, 스카이라인 생성부(430)는 식별번호 저장부(420)에 저장된 식별번호 중에서 최상위에 위치하는 식별번호가 데이터 객체의 식별번호이면 해당 데이터 객체의 식별번호를 포함하는 스카이라인을 생성한다.The identification number storage unit 420 stores the identification number of the data object included in each node of the data retrieval tree generated by the data retrieval tree generation unit 410 and the identification number of the minimum bounding rectangle from the origin of the coordinate space. Sort and store in increasing order, the skyline generator 430 includes the identification number of the data object if the identification number located at the top of the identification number stored in the identification number storage unit 420 is the identification number of the data object. Create a skyline that will

위에서 설명한 바와 같이 종래의 BBS 검색방법에서도 R-트리의 각 노드에 포함된 데이터 객체 및 최소경계사각형의 식별번호가 저장되는 힙을 생성하고, 데이터 검색의 최종 결과로서 지배관계에 대한 판단을 통하여 힙의 최상위에 저장된 데 이터 객체의 식별번호로 구성된 스카이라인을 생성하였다. 본 발명에 따른 스카이라인 질의 수행장치의 식별번호 저장부(420)는 BBS 검색방법에서의 힙과 동일하게 데이터 객체 및 최소경계사각형의 원점으로부터의 거리가 증가하는 순서로 식별번호를 저장하는 기능을 수행한다. 그러나 식별번호 저장부(420)는 BBS 검색방법에서 사용되는 힙과 다르게 데이터 객체 또는 최소경계사각형의 식별번호가 힙에 추가되어 저장될 수 있는지 여부를 판단할 때 지배관계 뿐만 아니라 검색어 판단을 부가적으로 수행한다. 검색어 판단은 특성부호 또는 노드부호와 질의부호를 비교하여 수행되며, 뒤에서 검색어 판단과정을 상세하게 설명한다.As described above, in the conventional BBS retrieval method, a heap in which the data object and the minimum boundary square identification number are stored in each node of the R-tree is generated, and the heap is determined by determining the governance relationship as the final result of the data retrieval. We have created a skyline consisting of the identification numbers of the data objects stored at the top of. The identification number storage unit 420 of the skyline query performing apparatus according to the present invention stores the identification numbers in the order of increasing distance from the origin of the data object and the minimum bounding rectangle in the same way as the heap in the BBS search method. To perform. However, unlike the heap used in the BBS retrieval method, the identification number storage unit 420 additionally determines the search term as well as the governing relationship when determining whether the identification number of the data object or the minimum bounding rectangle can be added to the heap and stored. To do it. The search term determination is performed by comparing the feature code or node code with the query code, which will be described in detail later.

도 9는 식별번호 저장부(420) 및 스카이라인 생성부(430)에 의해 데이터 검색을 수행하는 알고리즘을 나타낸 도면이다. 도 7 내지 도 9를 참조하여 식별번호 저장부(420)에 데이터 객체 및 최소경계사각형의 식별번호가 저장되는 방법에 관하여 설명한다. 이때 지배관계의 판단 및 식별번호 저장부(420) 내에서 원점으로부터의 거리에 따라 식별번호들이 정렬되는 과정은 BBS 검색방법에서 수행되는 과정과 동일하므로 상세한 설명을 생략한다.9 is a diagram illustrating an algorithm for performing data retrieval by the identification number storage unit 420 and the skyline generator 430. A method of storing the identification number of the data object and the minimum boundary rectangle in the identification number storage unit 420 will be described with reference to FIGS. 7 to 9. In this case, the process of sorting the identification numbers according to the determination of the governing relationship and the distance from the origin in the identification number storage unit 420 is the same as the process performed in the BBS search method, and thus a detailed description thereof will be omitted.

데이터검색 트리생성부(410)에 의해 데이터검색 트리가 생성되면, 식별번호 저장부(420)는 데이터 객체 및 최소경계사각형의 식별번호가 저장되는 힙(H)을 생성하고(2열), 스카이라인 생성부(430)는 데이터 검색의 결과인 데이터 객체의 식별번호가 저장되는 스카이라인(S)을 생성한다(3열). 또한 사용자가 지정한 특성을 가지는 데이터 객체를 검색하기 위해 사용자로부터 질의어를 부호화한 질의부호를 입력받는다(4열). 이하에서는 질의어가 'air bag'이고, 이를 부호화한 질의부호가 '0001 0001 0100 1000'인 경우를 가정한다.When the data retrieval tree is generated by the data retrieval tree generation unit 410, the identification number storage unit 420 generates a heap H in which the data object and the minimum boundary rectangle identification numbers are stored (column 2). The line generator 430 generates a skyline S in which the identification number of the data object which is a result of the data search is stored (3 columns). In addition, to retrieve a data object having a user-specified characteristic, the user receives a query code encoding a query from the user (column 4). Hereinafter, it is assumed that the query is 'air bag' and the query code encoding the query is '0001 0001 0100 1000'.

먼저 식별번호 저장부(420)는 데이터검색 트리의 루트노드에 포함된 식별번호인 e6와 e7을 힙에 저장한다(5열). 이때 e6에 대응하는 최소경계사각형 N6의 좌표정보는 (2,1)이므로 원점으로부터의 거리가 3이고, e7에 대응하는 최소경계사각형 N7의 좌표정보는 (1,6)이므로 원점으로부터의 거리가 7이다. 식별번호 저장부(420)에는 식별번호들이 원점으로부터의 거리가 증가하는 순서로 저장되므로 식별번호가 저장된 힙은 {(e6,3),(e7,7)}의 구성을 가진다.First, the identification number storage unit 420 stores the identification numbers e6 and e7 included in the root node of the data search tree on the heap (column 5). Since the coordinate information of the minimum boundary rectangle N6 corresponding to e6 is (2,1), the distance from the origin is 3, and the coordinate information of the minimum boundary rectangle N7 corresponding to e7 is (1,6), so the distance from the origin is 7. In the identification number storage unit 420, the identification numbers are stored in order of increasing distance from the origin, and the heap in which the identification numbers are stored has a configuration of {(e6, 3) and (e7, 7)}.

다음으로 식별번호 저장부(420)에서 최상위에 위치하는 최소경계사각형 e6는 하위노드의 엔트리로서 e1, e2 및 e3를 포함한다. 따라서 식별번호 저장부(420)는 e6를 삭제하고(7열) e1, e2 및 e3를 추가하여 저장한다. 이때 BBS 검색방법과의 차이점으로서 최소경계사각형 N1, N2 및 N3에 대응하는 노드부호와 사용자로부터 입력받은 질의부호를 비교하는 검색어 판단이 수행된다(8열). 즉, 노드부호와 질의부호의 논리합이 노드부호와 일치하는 경우에만 해당 노드부호에 대응하는 최소경계사각형의 식별번호가 식별번호 저장부(420)에 저장된다(9열~14열). 즉, 이러한 검색어 판단은 다음의 수학식 1에 의해 수행된다.Next, the minimum bounding rectangle e6 located at the top of the identification number storage unit 420 includes e1, e2, and e3 as entries of a lower node. Therefore, the identification number storage unit 420 deletes e6 (7 columns) and adds and stores e1, e2, and e3. At this time, as a difference from the BBS retrieval method, a search term determination is performed to compare the node code corresponding to the minimum boundary rectangles N1, N2, and N3 with the query code input from the user (column 8). That is, only when the logical sum of the node code and the query code coincides with the node code, the identification number of the minimum boundary rectangle corresponding to the node code is stored in the identification number storage unit 420 (columns 9 to 14). That is, the search term determination is performed by Equation 1 below.

.

여기서, sig_chk(σ,τ)는 검색어 판단 함수, σ는 질의부호, 그리고 τ는 노드부호이다.Where sig_chk (σ, τ) is a search term judgment function, σ is a query code, and τ is a node code.

최소경계사각형 N1에 대응하는 노드부호는 '0111 0101 1110 1011'이고, 사용자로부터 입력받은 질의부호는 '0001 0001 0100 1000'이다. 이들의 논리합은 '0111 0101 1110 1011'로 산출되며, 이는 노드부호와 일치한다. 따라서 최소경계사각형 N1의 식별번호인 e1은 식별번호 저장부(420)에 저장될 수 있다. 다음으로 최소경계사각형 N2에 대응하는 노드부호는 '0101 1101 0111 0011'이고, 질의부호와의 논리합을 산출하면 '0101 1101 0111 1011'이므로 노드부호와 일치하지 않는다. 따라서 최소경계사각형 N2의 식별번호인 e2는 식별번호 저장부(420)에 저장될 수 없다. 마지막으로 최소경계사각형 N3에 대응하는 노드부호는 '0101 1101 0111 1001'이고, 질의부호와의 논리합은 '0101 1101 0111 1001'이므로 노드부호와 일치하여 최소경계사각형 N3의 식별번호 e3는 식별번호 저장부(420)에 저장된다.The node code corresponding to the minimum boundary rectangle N1 is '0111 0101 1110 1011', and the query code input from the user is '0001 0001 0100 1000'. Their OR is calculated as '0111 0101 1110 1011', which matches the node code. Therefore, the identification number e1 of the minimum boundary rectangle N1 may be stored in the identification number storage unit 420. Next, the node code corresponding to the minimum boundary square N2 is '0101 1101 0111 0011', and when the logical sum with the query code is calculated, the node code does not match the node code. Therefore, the identification number e2 of the minimum boundary rectangle N2 may not be stored in the identification number storage unit 420. Finally, since the node code corresponding to the minimum boundary square N3 is '0101 1101 0111 1001' and the logical sum with the query code is '0101 1101 0111 1001', the identification number e3 of the minimum boundary square N3 is stored in accordance with the node code. It is stored in the unit 420.

이때 원점으로부터 최소경계사각형 N1까지의 거리는 4, 최소경계사각형 N3까지의 거리는 10이므로 식별번호 저장부(420)에는 최소경계사각형의 식별번호들이 {(e1,4),(e7,7),(e3,10)}과 같이 저장된다.At this time, since the distance from the origin to the minimum boundary rectangle N1 is 4, and the distance from the minimum boundary rectangle N3 is 10, the identification number storage unit 420 has identification numbers of the minimum boundary rectangles {(e1,4), (e7,7), ( e3,10)}.

다음으로 식별번호 저장부(420)에서 최상위에 위치하는 식별번호인 e1, 즉 최소경계사각형 N1은 데이터 객체 a 및 b를 포함하고 있다. 따라서 식별번호 저장부(420)로부터 e1을 삭제하고 e1의 엔트리인 a 및 b가 식별번호 저장부(420)에 추가되어 저장될 수 있는지 여부를 판단하기 위해 검색어 판단을 수행한다. 데이터 객체 a의 특성부호는 '0011 0001 1110 0010'이고, 질의부호인 '0001 0001 0100 1000'와의 논리합은 '0011 0001 1110 1010'으로 a의 특성부호와 일치하지 않는다. 따라서 식별번호 a는 식별번호 저장부(420)에 저장될 수 없다. 다음으로 데이터 객 체 b의 특성부호는 '0101 0101 0110 1001'이고, 질의부호와의 논리합은 '0101 0101 0110 1001'이므로 b의 특성부호와 일치한다. 따라서 식별번호 b는 식별번호 저장부(420)에 저장될 수 있다. 원점으로부터의 거리를 고려하여 정렬한 결과 식별번호 저장부(420)의 구성은 {(b,5),(e7,7),(e3,10)}이 된다.Next, the identification number e1, that is, the minimum boundary rectangle N1 located at the top of the identification number storage unit 420 includes the data objects a and b. Accordingly, the keyword search is performed to determine whether e1 is deleted from the identification number storage unit 420 and entries a and b of e1 can be added and stored in the identification number storage unit 420. The characteristic code of the data object a is '0011 0001 1110 0010' and the logical sum with the query code '0001 0001 0100 1000' is '0011 0001 1110 1010', which does not match the characteristic code of a. Therefore, the identification number a cannot be stored in the identification number storage unit 420. Next, since the feature code of the data object b is '0101 0101 0110 1001' and the logical sum with the query code is '0101 0101 0110 1001', it corresponds to the feature code of b. Therefore, the identification number b may be stored in the identification number storage unit 420. As a result of sorting in consideration of the distance from the origin, the configuration of the identification number storage unit 420 becomes {(b, 5), (e7, 7), (e3, 10)}.

식별번호 b는 데이터검색 트리의 단말노드에 포함된 데이터 객체의 식별번호이므로 b가 스카이라인에 포함될 수 있는지 여부를 판단하여야 한다(15열~18열). b는 식별번호 저장부(420)에서 최상위에 위치하므로 자동으로 스카이라인에 포함될 수 있다. 그러나 보다 정확한 검색결과를 획득하기 위해 특성부호 및 질의부호가 아닌 검색어 및 질의어 간의 비교를 통해 검색어와 질의어가 일치하는지 여부를 판단한 후 데이터 객체의 식별번호를 스카이라인에 포함시키는 것이 바람직하다. 즉, 이 경우에 데이터 객체 b에 대응하는 검색어는 'air bag' 및 'cruiser control'이고, 질의어는 'air bag'이다. 따라서 검색어와 질의어가 일치하므로 b는 스카이라인에 포함된다. 또한 스카이라인에 포함된 식별번호 b는 식별번호 저장부(420)로부터 삭제된다.Since the identification number b is an identification number of the data object included in the terminal node of the data search tree, it should be determined whether b can be included in the skyline (columns 15 to 18). b is located at the top of the identification number storage unit 420 may be automatically included in the skyline. However, in order to obtain more accurate search results, it is desirable to include the identification number of the data object in the skyline after determining whether the search word matches the query word by comparing the search word and the query word, not the feature code and the query code. That is, in this case, the search words corresponding to the data object b are 'air bag' and 'cruiser control', and the query word is 'air bag'. Therefore, b is included in the skyline because the search word and the query match. In addition, the identification number b included in the skyline is deleted from the identification number storage unit 420.

식별번호 b가 스카이라인에 포함된 후, 식별번호 저장부(420)의 구성은 {(e7,7),(e3,10)}이고, 스카이라인의 구성은 {b}이다. 스카이라인에 포함된 식별번호가 존재하므로 이후에 식별번호 저장부(420)에 식별번호가 추가될 때에는 지배관계에 대한 판단이 수행되어야 한다.After the identification number b is included in the skyline, the configuration of the identification number storage unit 420 is {(e7,7), (e3,10)}, and the configuration of the skyline is {b}. Since there is an identification number included in the skyline, when an identification number is later added to the identification number storage unit 420, a determination on the governing relationship should be performed.

식별번호 저장부(420)에서 최상위에 위치하는 식별번호 e7은 그에 해당하는 노드의 엔트리로서 e4 및 e5를 포함하고 있다. 식별번호 저장부(420)에서 e7을 삭 제하고 e4 및 e5를 추가할 수 있는지 여부를 판단하기 위해 먼저 지배관계에 대한 판단이 수행된다. BBS 검색방법에서와 동일한 방법에 의하여 지배관계를 판단하면, 최소경계사각형 N4의 좌표정보는 (1,6)이고, 스카이라인에 포함된 데이터 객체 b의 좌표정보는 (3,4)이다. x좌표의 좌표값은 N4의 좌표값이 더 크고, y좌표의 좌표값은 b의 좌표값이 더 크므로 최소경계사각형 N4와 데이터 객체 b는 서로 지배관계를 판단할 수 없는 관계에 해당한다. 즉, e4는 b에 의해 지배되지 않으므로 식별번호 저장부(420)에 저장될 수 있다. 다음으로 e5와 b에 대해 지배관계를 판단하면, 최소경계사각형 N5의 좌표정보는 (6,6)이고, 두 좌표값 모두 데이터 객체 b의 좌표정보를 구성하는 좌표값보다 큰 값이다. 따라서 e5는 b에 의해 지배되는 관계에 있으므로 식별번호 저장부(420)에 저장될 수 없다. e4에 대해 검색어 판단을 수행하면, e4에 대응하는 노드부호인 '0111 0101 0111 1111'과 질의부호인 '0001 0001 0100 1000'의 논리합은 '0111 0101 0111 1111'이므로 노드부호와 일치하여 최종적으로 e4는 식별번호 저장부(420)에 저장된다. 따라서 식별번호 저장부(420)의 구성은 {(e4,7),(e3,10)}이 된다.The identification number e7 located at the top of the identification number storage unit 420 includes e4 and e5 as entries of the corresponding node. In order to determine whether the identification number storage unit 420 can delete e7 and add e4 and e5, a determination of governance relationship is first performed. When the governing relationship is determined by the same method as in the BBS search method, the coordinate information of the minimum boundary rectangle N4 is (1,6), and the coordinate information of the data object b included in the skyline is (3,4). Since the coordinate value of the x coordinate is larger than the coordinate value of N4 and the coordinate value of the y coordinate is larger than the coordinate value of b, the minimum boundary rectangle N4 and the data object b correspond to a relationship in which the governing relationship cannot be determined. That is, since e4 is not controlled by b, it may be stored in the identification number storage unit 420. Next, when the governing relationship is determined for e5 and b, the coordinate information of the minimum boundary rectangle N5 is (6,6), and both coordinate values are larger than the coordinate values constituting the coordinate information of the data object b. Therefore, since e5 is in a relationship controlled by b, it cannot be stored in the identification number storage unit 420. When the search term is determined for e4, the logical sum of the node code corresponding to e4 '0111 0101 0111 1111' and the query code '0001 0001 0100 1000' is '0111 0101 0111 1111'. Is stored in the identification number storage unit 420. Therefore, the configuration of the identification number storage unit 420 is {(e4,7), (e3,10)}.

최상위에 위치하는 식별번호인 e4에 해당하는 노드의 엔트리는 g, h 및 i로서, 먼저 지배관계의 판단에 의해 데이터 객체 g 및 i가 데이터 객체 b와 지배관계를 판단할 수 없는 관계에 있는 것으로 나타난다. 또한 검색어 판단에 의해 최종적으로 데이터 객체 i가 식별번호 저장부(420)에 저장될 수 있는 것으로 나타난다. 따라서 식별번호 e4가 식별번호 저장부(420)로부터 삭제되고 i가 추가되어 {(e3,10),(i,11)}과 같은 구성이 된다.The entries of the node corresponding to e4, the identification number located at the top, are g, h, and i. First, the data objects g and i cannot be determined to be governed by the data object b by judging the governing relationship. appear. In addition, it is shown that the data object i may be finally stored in the identification number storage unit 420 by determining the search word. Therefore, the identification number e4 is deleted from the identification number storage unit 420 and i is added to have a configuration such as {(e3,10), (i, 11)}.

다음으로 식별번호 저장부(420)에서 최상위에 위치하는 식별번호 e3에 해당하는 노드의 엔트리인 e 및 f에 대하여 지배관계 판단 및 검색어 판단을 수행한다. 먼저 지배관계의 판단 결과, 데이터 객체 e 및 f 모두 데이터 객체 b와 지배관계를 판단할 수 없는 관계에 있는 것으로 나타난다. 또한 검색어 판단 결과. 데이터 객체 e 및 f 모두 특성부호와 질의부호의 논리합이 특성부호와 일치하여 식별번호 저장부(420)에 저장될 수 있는 것으로 나타난다. 따라서 식별번호 저장부(420)는 {(e,9),(i,11),(f,13)}의 구성을 가지게 된다.Next, in the identification number storage unit 420, the governing relationship determination and the search term determination are performed on the entries e and f of the node corresponding to the identification number e3 located at the top. First, as a result of the determination of the governing relationship, the data objects e and f both appear to be in a relationship where the governing relationship with the data object b cannot be determined. Also the query judgment results. It is shown that both the data objects e and f can be stored in the identification number storage unit 420 in which the logical sum of the feature code and the query code matches the feature code. Therefore, the identification number storage unit 420 has a configuration of {(e, 9), (i, 11), (f, 13)}.

식별번호 저장부(420)에 저장된 식별번호들은 모두 데이터 객체의 식별번호들이다. 따라서 지배관계 판단을 통해 이들이 스카이라인에 추가될 수 있는지 여부를 결정하여 스카이라인에 포함시킨다. 먼저 최상위에 위치하는 식별번호 e의 경우, 현재 스카이라인에 저장되어 있는 b와의 지배관계 판단 결과, 지배관계를 결정할 수 없는 경우에 해당하는 것으로 나타난다. 따라서 e는 b에 의해 지배되지 않으므로 스카이라인에 추가된다. 다음으로 i를 추가하기 위해 b 및 e에 대해 각각 지배관계 판단을 수행한다. 판단 결과, 데이터 객체 i는 데이터 객체 b 및 e와의 관계에서 모두 지배관계를 결정할 수 없는 관계에 있는 것으로 나타난다. 따라서 i 역시 스카이라인에 추가된다. 마지막으로 f의 경우, 스카이라인에 저장되어 있는 b, e 및 i와 각각 지배관계 판단을 수행하는데, b 및 i와의 관계에서는 지배관계를 결정할 수 없는 것으로 나타났으나 데이터 객체 e에 의해 지배되는 관계에 있는 것으로 나타난다. 따라서 f는 스카이라인에 추가될 수 없다. 이때 위에서 설명한 것과 마찬가지로 검색어와 질의어가 일치하는지 여부를 판단하면 보다 정확한 검색 결과를 얻을 수 있다. 스카이라인 질의 수행결과 최종적으로 얻어진 스카이라인은 {b,e,i}의 구성을 가진다. 또한 스카이라인에 포함된 모든 데이터 객체들은 앞에서 정의하였던 '스카이라인 튜플'에 대응하여 '검색어 합치 스카이라인 튜플(keyword-matched skyline tuple)'로 정의될 수 있다.The formula numbers stored in the identification number storage unit 420 are all identification numbers of the data object. Therefore, the judging relationship judgment determines whether they can be added to the skyline and included in the skyline. First, in the case of the identification number e located at the top, it is determined that the governance relationship cannot be determined as a result of the determination of the governance relationship with b currently stored in the skyline. Thus e is added to the skyline because it is not dominated by b. Next, do governance judgment on b and e to add i. As a result of the determination, the data object i appears to be in a relationship in which the governing relationship cannot be determined in relation to the data objects b and e. Thus i is also added to the skyline. Finally, in case of f, the governance judgment is performed with b, e and i stored in the skyline, respectively, but the relationship with b and i cannot be determined but the relationship governed by the data object e. Appears to be in Thus f cannot be added to the skyline. In this case, as described above, if the search word and the query word are determined to be matched, a more accurate search result can be obtained. The skyline finally obtained as a result of skyline query execution has a configuration of {b, e, i}. In addition, all data objects included in the skyline may be defined as a keyword-matched skyline tuple corresponding to the above-described 'skyline tuple'.

도 10은 도 7에 나타난 데이터 객체들에 대하여 본 발명에 따른 스카이라인 질의 수행장치에 의해 얻어진 스카이라인을 구성하는 데이터 객체들 및 종래의 검색방법인 BBS에 의해 얻어진 스카이라인을 구성한 데이터 객체들을 표시한 도면이다. 스카이라인 질의 수행장치에 의해 얻어진 스카이라인을 구성하는 데이터 객체들은 서로 점선으로 연결된 데이터 객체 b, e 및 i이다. 반면에 BBS 검색방법에 의해 얻어진 스카이라인을 구성하는 데이터 객체들은 진한색으로 표시된 a, e 및 i이다. BBS 검색방법을 사용하여 검색어를 포함하는 데이터 검색을 수행하였다면 {a,e,i}에 대하여 재검색을 수행하여 {e,i}의 결과를 얻게 될 것이다. 그러나 본 발명에 따른 스카이라인 질의 수행장치를 사용함으로써 한 번의 검색에 의해 보다 정확하고 다양한 검색 결과를 얻을 수 있다.FIG. 10 shows the data objects constituting the skyline obtained by the skyline query performing apparatus according to the present invention and the data objects constituting the skyline obtained by the BBS, which is a conventional retrieval method, for the data objects shown in FIG. One drawing. The data objects constituting the skyline obtained by the skyline query execution apparatus are data objects b, e, and i connected by dotted lines. On the other hand, the data objects constituting the skyline obtained by the BBS retrieval method are a, e and i shown in dark colors. If a data search including a search word is performed using the BBS search method, a result of {e, i} will be obtained by performing a re-search on {a, e, i}. However, by using the skyline query execution apparatus according to the present invention, more accurate and various search results can be obtained by one search.

도 11은 본 발명에 따른 스카이라인 질의 수행방법에 대한 바람직한 일 실시예의 수행과정을 도시한 흐름도이다.11 is a flowchart illustrating a process of performing a preferred embodiment of the skyline query execution method according to the present invention.

도 11을 참조하면, 데이터검색 트리생성부(410)는 복수의 데이터 객체의 특성을 나타내며 데이터 객체의 검색에 사용되는 검색어를 부호화한 특성부호, 데이터 객체가 가지는 복수의 수치화된 속성을 나타내기 위한 다차원 좌표공간 상에서 서로 인접하는 데이터 객체들의 좌표를 나타내는 좌표정보 및 데이터 객체의 식별 번호를 포함하는 단말노드정보로 이루어진 단말노드와, 하위에 위치하는 자식노드에 포함된 특성부호들의 논리합으로 산출되는 노드부호, 좌표공간 상에서 하위계층에 존재하는 단말노드에 포함된 데이터 객체들을 포함하도록 설정된 최소경계사각형의 좌표정보 및 최소경계사각형의 식별번호를 포함하는 비단말노드정보로 이루어진 비단말노드를 계층적으로 배치하여 데이터검색 트리를 생성한다(S1110).Referring to FIG. 11, the data retrieval tree generation unit 410 represents characteristics of a plurality of data objects, and represents a feature code encoding a search word used for retrieving the data object and a plurality of digitized attributes of the data object. A node computed by the logical sum of feature codes included in a terminal node including coordinate information indicating coordinates of data objects adjacent to each other in a multidimensional coordinate space and terminal node information including an identification number of the data object and child nodes located below. A non-terminal node comprising hierarchically a non-terminal node composed of a code, non-terminal node information including coordinate information of a minimum boundary rectangle and a minimum boundary rectangle identification number set to include data objects included in a terminal node existing in a lower layer in the coordinate space. In operation S1110, the data search tree is generated.

다음으로 식별번호 저장부(420)는 앞에서 설명한 바와 같이 스카이라인 생성부(430)에 의해 생성된 스카이라인에 포함된 식별번호가 존재하는 경우에는 데이터검색 트리의 각 노드에 포함된 데이터 객체 또는 최소경계사각형과 스카이라인에 포함된 식별번호에 대응하는 데이터 객체 간에 서로 지배관계가 성립하는지 판단하는 지배관계 판단(S1120), 데이터 객체의 특성부호 또는 최소경계사각형의 노드부호가 질의부호를 포함하는지 판단하는 검색어 판단(S1130)을 수행하고, 지배관계가 성립하지 않고 질의부호를 포함하는 것으로 판단된 데이터 객체 또는 최소경계사각형의 식별번호를 저장한다. 이때 각 식별번호는 좌표공간에서 원점으로부터의 거리가 증가하는 순서로 정렬되어 저장된다(S1140). 또한 스카이라인 생성부(430)는 식별번호 저장부(420)에서 최상위에 위치하는 식별번호에 해당하는 데이터 객체가 스카이라인에 저장된 데이터 객체와 지배관계가 성립하지 않는 경우에는(S1150) 해당 데이터 객체의 식별번호를 스카이라인에 포함시킨다(S1160). 이때 검색어와 질의어의 일치 여부를 부가적으로 판단할 수 있다.Next, when the identification number included in the skyline generated by the skyline generator 430 exists as described above, the identification number storage unit 420 stores the data object or minimum number included in each node of the data search tree. Governance relationship determination (S1120) for determining whether a governance relationship is established between the data object corresponding to the identification number included in the boundary rectangle and the skyline, and whether the feature code of the data object or the node code of the minimum boundary rectangle includes the query code. The search term determination S1130 is performed, and the identification number of the data object or the minimum boundary rectangle that is determined to include the query code without establishing the governing relationship is stored. At this time, each identification number is stored in alignment in order of increasing distance from the origin in the coordinate space (S1140). In addition, when the data object corresponding to the identification number located at the top of the identification number storage unit 420 does not establish a governance relationship with the data object stored in the skyline (S1150), the skyline generation unit 430 corresponds to the corresponding data object. Include the identification number of the skyline (S1160). At this time, whether or not the search word matches the query word may be additionally determined.

본 발명에 따른 스카이라인 질의 수행장치 및 방법의 성능을 종래의 검색방법인 INKS 및 BBS와 비교하여 평가하기 위한 실험을 수행하였다. 우선 각각의 데이 터 객체들은 d-차원의 속성을 가지고 있으며, 이를 좌표정보로 나타내면 각 좌표값은 모두 0과 1 사이에 포함되도록 설정되었다. 또한 특성부호, 노드부호 및 질의부호의 길이는 32,768 비트의 범위로 설정되었고, 검색어의 개수는 1개 내지 5개의 범위 내에서 다양한 값을 가지도록 설정되었다.An experiment was performed to evaluate the performance of the skyline query execution apparatus and method according to the present invention compared with the conventional search methods INKS and BBS. First of all, each data object has d-dimensional properties, and if this is expressed as coordinate information, each coordinate value is set to be included between 0 and 1. In addition, the lengths of the feature code, the node code, and the query code are set in the range of 32,768 bits, and the number of search terms is set to have various values within the range of 1 to 5 bits.

먼저 인위적으로 구성된 데이터 객체들이 아닌 실제 웹사이트(http://www.motors.ebay.com)로부터 수집된 데이터 객체들을 이용하여 실험을 수행하였다. 데이터 객체들은 자동차에 관한 것이며, 속성은 가격과 주행거리이고 특성을 나타내는 검색어는 에어백, 가죽시트 등 다양하게 설정되었다. 구체적으로는, 데이터 객체의 개수는 모두 44528개이며, 특성을 나타내는 검색어는 평균 길이가 6으로 412개가 설정되었다.First, experiments were conducted using data objects collected from actual websites (http://www.motors.ebay.com) rather than artificially constructed data objects. The data objects are about cars, and the attributes are price and mileage, and the search terms for characteristics are variously set such as airbags and leather seats. Specifically, the number of data objects is 44528, and the search terms representing the characteristics are set to 412 with an average length of six.

도 12는 웹사이트로부터 수집된 데이터 객체들에 대하여 검색어의 분포 및 속성값의 분포를 나타낸 그래프이다. 도 12의 (a)를 참조하면, 검색어의 개수가 많을수록 각각의 데이터 객체에 대하여 나타나는 빈도수가 적다. 또한 도 12의 (b)를 참조하면, 각 데이터 객체, 즉 자동차는 적은 가격과 낮은 주행거리를 가지는 영역에 집중적으로 분포하고 있는 것을 확인할 수 있다.12 is a graph showing the distribution of search terms and the distribution of attribute values for data objects collected from a website. Referring to FIG. 12A, the greater the number of search terms, the less frequency appears for each data object. In addition, referring to FIG. 12B, it can be seen that each data object, that is, a car, is concentrated in an area having a low price and a low mileage.

도 13은 검색어의 개수에 따른 스카이라인 질의 수행시간 및 메모리 접근횟수를 INKS 및 BBS 검색방법과 비교하여 나타낸 그래프이다. KMS는 본 발명에 따른 스카이라인 질의 수행장치 및 방법에 의한 검색방법을 말한다. 도 13을 참조하면, 질의 수행시간 및 메모리 접근횟수의 면에서 모두 본 발명에 의한 검색방법이 INKS 및 BBS에 비해 우수한 검색성능을 보이고 있는 것을 확인할 수 있다. 또한 도 14는 부호(특성부호, 노드부호 및 질의부호)의 길이에 따른 질의 수행시간 및 메모리 접근횟수를 INKS 및 BBS 검색방법과 비교하여 나타낸 그래프이다. 도 14를 참조하면, 본 발명에 의한 검색방법이 종래의 검색방법인 INKS 및 BBS에 비해 우수한 검색성능으로 보이는 것을 확인할 수 있다. 그러나 도 13 및 도 14에서 검색성능은 검색어의 개수 및 부호의 길이에 거의 영향을 받지 않는 것을 확인할 수 있는데, 이는 사용된 데이터 객체들의 개수가 적절한 검색성능의 평가를 수행하기에는 너무 적었기 때문이다.FIG. 13 is a graph illustrating skyline query execution time and memory access times according to the number of search terms compared with an INKS and BBS search method. KMS refers to a search method using a skyline query execution apparatus and method according to the present invention. Referring to FIG. 13, it can be seen that the search method according to the present invention shows superior search performance compared to INKS and BBS in terms of query execution time and memory access frequency. 14 is a graph showing the query execution time and the number of memory accesses according to the length of code (character code, node code, and query code) compared with INKS and BBS retrieval methods. Referring to Figure 14, it can be seen that the search method according to the present invention appears to be excellent search performance compared to the conventional search methods INKS and BBS. 13 and 14, however, the search performance is hardly affected by the number of search terms and the length of the sign, since the number of data objects used is too small to perform an appropriate search performance evaluation.

다음으로 다양한 파라미터들이 검색성능에 미치는 영향을 평가하기 위해 스카이라인 질의수행에서 흔히 사용되는 (1) 독립 데이터셋(independent dataset) 및 (2) 상호무관 데이터셋(anti-correlated dataset)을 생성하여 본 발명에 의한 스카이라인 질의를 수행하였다. 그 밖에 검색성능 평가를 위해 변화시킨 파라미터들의 범위는 다음의 표 1과 같다.Next, to evaluate the effects of various parameters on search performance, we generate (1) independent datasets and (2) anti-correlated datasets that are commonly used in skyline query performance. The skyline query according to the invention was performed. In addition, the range of parameters changed for the evaluation of search performance is shown in Table 1 below.

파라미터parameter 값value 데이터 객체의 개수(N)Number of data objects (N) 100k, 200k, 400k, 600k, 800k, 1m 100k, 200k, 400k, 600k, 800k, 1m 차원(속성의 개수)(d)Dimension (number of attributes) (d) 2, 3, 4, 52, 3 , 4, 5 검색어의 개수(k)Number of queries (k) 1, 2, 3, 4, 51, 2 , 3, 4, 5 부호 길이(l)Code length (l) 32, 128, 256, 384, 512, 640, 76832, 128, 256, 384 , 512, 640, 768 검색어 분포의 비대칭도(θ)Asymmetry of the query distribution (θ) 0.0, 0.2, 0.4, 0.6, 0.8, 1.00.0, 0.2, 0.4, 0.6, 0.8 , 1.0 데이터 객체의 분포Distribution of data objects 독립, 상호무관Independence, Mutual Independence 각 데이터 객체에 대한 검색어 개수Query count for each data object 66

본 발명에 따른 스카이라인 질의 수행장치 및 방법의 성능평가는 위 표 1에 나타난 파라미터들 중에서 각각 하나씩의 파라미터를 변화시키면서 수행되었다. 파라미터 값들 중에서 굵은 글씨로 표시된 값은 특정 파라미터가 변화할 때 고정되는 값이다.Performance evaluation of the skyline query execution apparatus and method according to the present invention was performed while changing each one of the parameters shown in Table 1 above. Among the parameter values, the values shown in bold are fixed values when specific parameters change.

먼저 데이터 객체의 개수(N)가 본 발명의 성능에 미치는 영향을 평가하는 실험이 수행되었다. 도 15는 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 개수에 따른 질의 수행시간을 INKS 및 BBS와 비교하여 나타낸 그래프이다. 도 15를 참조하면, 본 발명은 데이터 객체의 개수가 클수록 우수한 검색성능을 보인다. 독립 데이터셋을 사용한 경우보다 상호무관 데이터셋을 사용한 경우에서 본 발명과 INKS의 그래프 간격이 좁은 것을 확인할 수 있는데, 이는 상호무관 데이터셋에서 사용하는 힙의 크기가 독립 데이터셋을 사용한 경우보다 훨씬 커서 질의 수행시간을 증가시키기 때문이다.First, an experiment was conducted to evaluate the effect of the number N of data objects on the performance of the present invention. FIG. 15 is a graph showing query execution time according to the number of data objects compared with INKS and BBS for an independent dataset and an unrelated dataset. Referring to FIG. 15, the present invention shows excellent search performance as the number of data objects increases. It can be seen that the graph gap between the present invention and INKS is narrower in the case of using the unrelated dataset than in the case of using the independent dataset, which is much larger than the case of using the independent dataset. This is because the query execution time is increased.

다음으로 도 16은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 개수에 따른 메모리 접근횟수를 INKS 및 BBS와 비교하여 나타낸 그래프이다. 도 16을 참조하면, 본 발명에 의한 검색방법이 INKS 및 BBS에 비해 메모리 접근횟수가 적어 우수한 검색성능을 보이는 것을 확인할 수 있다. 또한 BBS의 경우 데이터 개수가 20만개보다 많은 경우에 대한 결과값이 표시되지 않는데, 이는 검색과정이 종료되지 않아 결과값을 측정할 수 없기 때문이다. 이하에서는 BBS의 경우를 제외하고 INKS와의 성능비교만을 수행한 실험결과를 제시한다.Next, FIG. 16 is a graph illustrating the number of memory accesses according to the number of data objects compared to INKS and BBS for the independent dataset and the unrelated dataset. Referring to FIG. 16, it can be seen that the searching method according to the present invention exhibits excellent searching performance due to fewer memory accesses than INKS and BBS. In addition, in the case of BBS, the result value is not displayed when the number of data is more than 200,000 because the search process is not finished and the result value cannot be measured. Hereinafter, except for the case of BBS, the experimental results of performing only the performance comparison with INKS is presented.

도 17은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 속성 개수에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프이고, 도 18은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 속성 개수에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프이다. 도 17 및 도 18을 참조하면, 본 발명은 INKS에 비해 대체로 우수한 검색성능을 보이지만, 데이터 객체의 속성 개수, 즉 차원이 증가할수록 성능이 저하되는 것을 확인할 수 있다. 이는 차원이 증가함에 따라 검색에 사용되는 IR²-트리의 품질이 저하되고, 지배관계 판단에 소요되는 비용이 증가하기 때문이다.FIG. 17 is a graph illustrating query execution time according to the number of attributes of a data object for an independent data set and a non-relevant data set compared to INKS, and FIG. 18 is a number of attributes of a data object for an independent data set and a cross-independent data set. The graph shows the number of memory accesses according to INKS. Referring to FIGS. 17 and 18, the present invention generally shows superior search performance compared to INKS, but the performance decreases as the number of attributes, ie, dimensions, of the data object increases. This is because as the dimension increases, the quality of the IR ² -tree used for searching decreases and the cost of determining the governance relationship increases.

도 19는 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어의 개수에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프이다. 도 19를 참조하면, 두 그래프의 간격이 검색어의 개수 증가에 따라 빠르게 증가하는 것을 확인할 수 있는데, 이는 검색어의 개수가 많으면 검색어 판단 결과 식별번호 저장부(420)에 저장되지 못하는 식별번호의 개수도 증가하여 검색속도를 증가시키기 때문이다. 도 20은 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어의 개수에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프이다. 도 20을 참조하면, 본 발명은 INKS에 비해 대체로 우수한 성능을 보이며, INKS의 경우 k가 3일 때 메모리 접근횟수가 최소이다.FIG. 19 is a graph showing query execution time according to the number of search terms compared to INKS for an independent dataset and an unrelated dataset. Referring to FIG. 19, it can be seen that the interval between the two graphs increases rapidly as the number of search terms increases. This means that if the number of search terms is large, the number of identification numbers that cannot be stored in the identification number storage unit 420 is determined. This increases the speed of search. FIG. 20 is a graph showing memory access times according to the number of search terms compared to INKS for independent data sets and unrelated data sets. Referring to FIG. 20, the present invention exhibits generally superior performance compared to INKS, and the minimum number of memory accesses when k is 3 in INKS.

도 21은 독립 데이터셋과 상호무관 데이터셋에 대하여 부호 길이에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프이고, 도 22는 독립 데이터셋과 상호무관 데이터셋에 대하여 부호 길이에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프이다. 도 21 및 도 22를 참조하면, 부호 길이가 256비트보다 짧을 때에는 본 발명에 의한 검색성능이 저하된다. 그러나 부호 길이가 256비트 이상이면 우수한 검색성능을 보인다. 이는 부호 길이와 검색에 사용되는 인덱스 크기 간에 트레이드오프(trade-off) 관계가 성립하기 때문이다. 즉, 부호 길이가 길어지면 검색어 판단의 성능을 개선시키나, 인덱스의 크기를 크게 만들어 메모리 사용량을 증가시킬 수 있다.본 발명에 의한 데이터 검색은 부호 길이가 256 내지 768비트로 설정될 경우에 최적의 성능을 보이며, 부호 길이에 따른 인덱스의 크기는 다음의 표 2와 같다.FIG. 21 is a graph illustrating query execution time according to code length with respect to the independent data set and the unrelated data set, and INKS, and FIG. 22 shows the number of memory accesses according to the code length with respect to the independent data set and the unrelated data set. It is a graph compared with INKS. 21 and 22, when the code length is shorter than 256 bits, the search performance according to the present invention is degraded. However, if the code length is more than 256 bits, the search performance is excellent. This is because a trade-off relationship is established between the code length and the index size used for searching. In other words, a longer code length can improve the performance of the search term determination, but can increase the memory usage by increasing the size of the index. The data retrieval according to the present invention is optimal when the code length is set to 256 to 768 bits. The size of the index according to the code length is shown in Table 2 below.

부호 길이(비트)Sign length (bits) 인덱스 크기Index size 32~38432-384 408MB408 MB 512512 712MB712 MB 640640 1.2GB1.2 GB 768768 1.5GB1.5 GB

도 23은 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어 분포의 비대칭도에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프이다. 도 23을 참조하면, 본 발명에 의한 데이터 검색은 비대칭도에 거의 무관하게 INKS에 비하여 우수한 검색성능을 보인다. 또한 도 24는 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어 분포의 비대칭도에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프이다. 도 24를 참조하면, 비대칭도가 0.4보다 작을 때에는 검색성능이 INKS에 비하여 열등한데, 이는 데이터검색 트리에서 검색결과가 거의 나오지 않는 비단말노드에 대해서도 검색을 수행하여야 하기 때문이다. 그러나 비대칭도가 0.4 이상인 경우에는 우수한 검색성능을 보이며, 이는 검색어 판단과 지배관계 판단을 통해 식별번호 저장부(420)에 저장될 식별번호의 개수를 줄일 수 있기 때문이다.FIG. 23 is a graph illustrating query execution time according to asymmetry of search word distribution with respect to INKS for independent and interrelated data sets. Referring to FIG. 23, the data retrieval according to the present invention shows superior retrieval performance compared to INKS regardless of asymmetry. FIG. 24 is a graph illustrating the number of memory accesses according to the asymmetry of the search word distribution compared to the INKS for the independent dataset and the unrelated dataset. Referring to FIG. 24, when the asymmetry is less than 0.4, the search performance is inferior to that of INKS, because the search should be performed even for non-terminal nodes whose search results rarely appear in the data search tree. However, when the asymmetry is 0.4 or more, excellent search performance is shown, because the number of identification numbers to be stored in the identification number storage unit 420 can be reduced by determining the search word and determining the governance relationship.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be embodied as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상에서 본 발명의 바람직한 실시예에 대해 도시하고 설명하였으나, 본 발명은 상술한 특정의 바람직한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것은 물론이고, 그와 같은 변경은 청구범위 기재의 범위 내에 있게 된다.Although the preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific preferred embodiments described above, and the present invention belongs to the present invention without departing from the gist of the present invention as claimed in the claims. Various modifications can be made by those skilled in the art, and such changes are within the scope of the claims.

도 1은 xy 평면에 위치하는 네 개의 2차원 튜플을 도시한 도면,1 shows four two-dimensional tuples located in the xy plane,

도 2는 복수의 튜플들로 이루어진 그룹으로부터 얻어진 스카이라인을 도시한 도면,2 shows a skyline obtained from a group of a plurality of tuples;

도 3은 INKS 방법을 이용하여 데이터 검색을 수행하는 알고리즘을 나타낸 도면,3 is a diagram illustrating an algorithm for performing data retrieval using the INKS method;

도 4는 본 발명에 따른 검색어를 포함하는 스카이라인 질의 수행장치에 대한 바람직한 실시예의 구성을 도시한 블록도,4 is a block diagram showing the configuration of a preferred embodiment of a skyline query execution apparatus including a search word according to the present invention;

도 5는 x축과 y축에 각각 수치화된 속성이 표시되는 xy 평면에 나타난 복수의 데이터 객체 및 데이터 객체를 포함하도록 설정된 복수의 최소경계사각형을 도시한 도면,FIG. 5 illustrates a plurality of minimum bounding rectangles configured to include a plurality of data objects and data objects shown in an xy plane in which quantified attributes are displayed on the x and y axes, respectively;

도 6은 도 5에 나타난 데이터 객체 및 최소경계사각형의 좌표정보를 포함하는 R-트리 구조를 도시한 도면,FIG. 6 is a diagram illustrating an R-tree structure including coordinate information of a data object and a minimum bounding rectangle shown in FIG. 5;

도 7은 x축과 y축에 각각 수치화된 속성이 표시되는 xy 평면상에 특성을 나타내는 키워드를 포함하는 복수의 데이터 객체 및 데이터 객체를 포함하도록 설정된 복수의 최소경계사각형을 도시한 도면,FIG. 7 is a diagram illustrating a plurality of data objects including keywords representing characteristics on a xy plane in which quantified attributes are displayed on the x and y axes, respectively, and a plurality of minimum boundary rectangles configured to include the data objects; FIG.

도 8은 도 7에 나타난 데이터 객체 및 최소경계사각형의 좌표정보 및 특성부호를 포함하는 IR²-트리 구조를 도시한 도면,FIG. 8 is a diagram illustrating an IR ² -tree structure including coordinate information and characteristic codes of a data object and a minimum boundary rectangle shown in FIG. 7;

도 9는 식별번호 저장부 및 스카이라인 생성부에 의해 데이터 검색을 수행하 는 알고리즘을 나타낸 도면,9 is a diagram illustrating an algorithm for performing data retrieval by an identification number storage unit and a skyline generation unit;

도 10은 도 7에 나타난 데이터 객체들에 대하여 본 발명에 따른 스카이라인 질의 수행장치에 의해 얻어진 스카이라인을 구성하는 데이터 객체들 및 종래의 검색방법인 BBS에 의해 얻어진 스카이라인을 구성한 데이터 객체들을 표시한 도면,FIG. 10 shows the data objects constituting the skyline obtained by the skyline query performing apparatus according to the present invention and the data objects constituting the skyline obtained by the BBS, which is a conventional retrieval method, for the data objects shown in FIG. One drawing,

도 11은 본 발명에 따른 스카이라인 질의 수행방법에 대한 바람직한 일 실시예의 수행과정을 도시한 흐름도,11 is a flowchart illustrating a process of performing a preferred embodiment of the skyline query execution method according to the present invention;

도 12는 웹사이트로부터 수집된 데이터 객체들에 대하여 검색어의 분포 및 속성값의 분포를 나타낸 그래프,12 is a graph showing the distribution of search terms and the distribution of attribute values for data objects collected from a website;

도 13은 검색어의 개수에 따른 스카이라인 질의 수행시간 및 메모리 접근횟수를 INKS 및 BBS 검색방법과 비교하여 나타낸 그래프,13 is a graph showing the skyline query execution time and the number of memory accesses according to the number of search terms compared with the INKS and BBS search methods;

도 14는 부호(특성부호, 노드부호 및 질의부호)의 길이에 따른 질의 수행시간 및 메모리 접근횟수를 INKS 및 BBS 검색방법과 비교하여 나타낸 그래프,14 is a graph showing the query execution time and the number of memory accesses according to the length of a code (character code, node code, and query code) compared with an INKS and BBS retrieval method;

도 15는 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 개수에 따른 질의 수행시간을 INKS 및 BBS와 비교하여 나타낸 그래프,15 is a graph showing query execution time according to the number of data objects compared to INKS and BBS for an independent dataset and a non-correlated dataset;

도 16은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 개수에 따른 메모리 접근횟수를 INKS 및 BBS와 비교하여 나타낸 그래프,FIG. 16 is a graph illustrating memory access times according to the number of data objects compared with INKS and BBS for an independent dataset and an unrelated dataset. FIG.

도 17은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 속성 개수에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프,17 is a graph showing query execution time according to the number of attributes of a data object compared to INKS for an independent dataset and a non-correlated dataset;

도 18은 독립 데이터셋과 상호무관 데이터셋에 대하여 데이터 객체의 속성 개수에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프,18 is a graph illustrating memory access times according to the number of attributes of a data object compared to INKS for an independent dataset and a non-correlated dataset;

도 19는 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어의 개수에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프,19 is a graph showing query execution time according to the number of search terms compared to INKS for an independent dataset and an unrelated dataset;

도 20은 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어의 개수에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프,FIG. 20 is a graph illustrating memory access times according to the number of search terms compared to INKS for an independent dataset and an unrelated dataset. FIG.

도 21은 독립 데이터셋과 상호무관 데이터셋에 대하여 부호 길이에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프,21 is a graph showing query execution time according to code length in comparison with INKS for independent datasets and unrelated datasets;

도 22는 독립 데이터셋과 상호무관 데이터셋에 대하여 부호 길이에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프,22 is a graph illustrating memory access times according to code lengths compared to INKS for independent datasets and unrelated datasets;

도 23은 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어 분포의 비대칭도에 따른 질의 수행시간을 INKS와 비교하여 나타낸 그래프, 그리고,FIG. 23 is a graph showing query execution time according to asymmetry of search term distribution with INKS for independent dataset and unrelated dataset, and

도 24는 독립 데이터셋과 상호무관 데이터셋에 대하여 검색어 분포의 비대칭도에 따른 메모리 접근횟수를 INKS와 비교하여 나타낸 그래프이다.FIG. 24 is a graph illustrating the number of memory accesses according to asymmetry of search word distributions compared to INKS for independent and interrelated data sets.

Claims

A feature code encoding a search word used to search for the data object and representing coordinates of data objects adjacent to each other in a multi-dimensional coordinate space for representing a plurality of digitized attributes of the data object. A terminal node comprising a terminal node information including coordinate information and an identification number of the data object, a node code calculated by a logical sum of characteristic codes included in a child node located below, and a terminal existing in a lower layer in the coordinate space Data retrieval that creates a data retrieval tree by hierarchically arranging non-terminal nodes composed of coordinate information of the minimum boundary rectangle and non-terminal node information including the identification number of the minimum boundary rectangle set to include data objects included in the node. Tree generation unit;

An identification number storage unit for arranging and storing the identification number of the data object included in each node of the data search tree and the identification number of the minimum boundary rectangle in order of increasing distance from the origin in the coordinate space; And

A skyline including the identification number of the data object if the identification number located at the highest of the identification numbers stored in the identification number storage unit is the identification number of the data object included in the terminal node information corresponding to the terminal node of the data search tree It includes; Skyline generation unit for generating a;

The identification number storage unit stores the identification number of the minimum boundary square corresponding to the node code when the logical sum of the query code that encodes the node code and the query word input from the user matches the node code. And when the logical sum of the query code matches the feature code, storing the identification number of the data object corresponding to the feature code.

The method of claim 1,

And the characteristic code is calculated as a logical sum of a plurality of search codes generated by encoding a plurality of search words corresponding to a data object, respectively.

The method of claim 1,

The identification number storage unit does not have a plurality of coordinate values constituting the coordinate information of the data object and the minimum bounding rectangle in the coordinate space are larger than corresponding coordinate values among the coordinate information of the data object included in the skyline. Skyline query performing apparatus characterized in that for storing the identification number of the data object and the minimum boundary rectangle when at least one of the coordinate value of the smaller than the corresponding coordinate value of the data object coordinate information included in the skyline .

The method according to any one of claims 1 to 3,

The identification number storage unit is a distance from the origin of the coordinate space to the data object in which the coordinate values of the data object are added in the coordinate space, and among the values of the sum of the coordinate values of the vertices of the minimum bounding rectangle. And a minimum value is a distance from an origin of the coordinate space to the minimum bounding rectangle.

The method according to any one of claims 1 to 3,

The identification number storage unit deletes the identification number of the minimum boundary rectangle when the identification number located at the highest position among the stored identification numbers corresponds to the minimum boundary rectangle, and corresponds to the minimum boundary rectangle in the data search tree. And an identification number corresponding to a child node of a non-terminal node is added based on a distance from an origin point in the coordinate space.

The method according to any one of claims 1 to 3,

The skyline generator includes a plurality of coordinate values constituting coordinate information corresponding to the identification number of the data object located at the highest position among the identification numbers stored in the identification number storage unit, among the coordinate information of the data object included in the skyline. The identification number of the data object is not greater than the corresponding coordinate value and if at least one of the plurality of coordinate values is smaller than the corresponding coordinate value among the coordinate information corresponding to the identification number of the data object included in the skyline. Skyline query performing apparatus comprising a to the skyline.

The method according to any one of claims 1 to 3,

The skyline generation unit includes the identification number of the data object in the skyline when a search word corresponding to the identification number of the data object located at the highest level among the identification numbers stored in the identification number storage unit matches the query word input from the user. Skyline query performing apparatus, characterized in that.

In the skyline query execution method performed by a skyline query system for searching the data object based on a search term representing a plurality of data objects,

Terminal node information including a feature code encoding the search word, coordinate information indicating coordinates of data objects adjacent to each other in a multi-dimensional coordinate space for representing a plurality of digitized attributes of the data object, and an identification number of the data object. A node code calculated by a logical sum of feature codes included in a child node positioned below, a child node positioned below, and a coordinate of a minimum boundary rectangle set to include data objects included in a terminal node existing in a lower layer in the coordinate space. A data retrieval tree generation step of generating a data retrieval tree by hierarchically arranging non-terminal nodes composed of information and non-terminal node information including the identification number of the minimum boundary rectangle;

An identification number storage step of storing identification numbers of the data objects included in each node of the data search tree and identification numbers of the minimum boundary rectangles in order of increasing distance from the origin in the coordinate space; And

Skyline including the identification number of the data object if the identification number located at the highest of the identification numbers stored in the identification number storing step is the identification number of the data object included in the terminal node information corresponding to the terminal node of the data search tree. A skyline generation step of generating;

In the storing of the identification number, if the logical sum of the query code encoding the query code received from the node code and the user matches the node code, the identification number of the minimum boundary square corresponding to the node code is stored, and the characteristic code And when the logical sum of the query code matches the feature code, storing the identification number of the data object corresponding to the feature code.

The method of claim 8,

In the storing of the identification number, the plurality of coordinate values constituting the coordinate information of the data object and the minimum bounding rectangle in the coordinate space are not larger than corresponding coordinate values among the coordinate information of the data object included in the skyline, Skyline query, characterized in that to store the identification number of the data object and the minimum boundary rectangle when at least one coordinate value of the plurality of coordinate values is smaller than the corresponding coordinate value of the data object coordinate information included in the skyline How to do it.

The method according to any one of claims 8 to 10,

In the storing of the identification number, a value obtained by summing coordinate values of the data object in the coordinate space is a distance from an origin of the coordinate space to the data object, and summing coordinate values of each vertex of the minimum boundary rectangle. And a minimum value of a value is a distance from an origin of the coordinate space to the minimum bounding rectangle.

The method according to any one of claims 8 to 10,

In the storing of the identification number, if the identification number located at the highest of the stored identification numbers is the identification number of the minimum boundary rectangle, the identification number of the minimum boundary rectangle is deleted and the minimum boundary rectangle is stored in the data search tree. And adding an identification number corresponding to a child node of a corresponding non-terminal node based on a distance from an origin point in the coordinate space.

The method according to any one of claims 8 to 10,

In the skyline generation step, all the coordinate values constituting the coordinate information corresponding to the identification number of the data object located at the highest position among the identification numbers stored in the identification number storing step are coordinates of the data object included in the skyline. When the information is not larger than the corresponding coordinate value and at least one of the plurality of coordinate values is smaller than the corresponding coordinate value among the coordinate information corresponding to the identification number of the data object included in the skyline, Skyline query performing method comprising the identification number in the skyline.

The method according to any one of claims 8 to 10,

In the skyline generation step, if the search word corresponding to the identification number of the data object located at the highest level among the identification numbers stored in the identification number storage step matches the query word input from the user, the identification number of the data object is input to the skyline. Skyline query execution method comprising the.

A computer-readable recording medium having recorded thereon a program for executing the method for executing a skyline query according to any one of claims 8 to 10.