KR20190116032A

KR20190116032A - A novel join technique for similar-trend searches supporting normalization on time-series databases

Info

Publication number: KR20190116032A
Application number: KR1020180097619A
Authority: KR
Inventors: 김상욱; 송준호
Original assignee: 한양대학교 산학협력단
Priority date: 2018-04-04
Filing date: 2018-08-21
Publication date: 2019-10-14
Also published as: KR102121304B1

Abstract

The present invention relates to a subsequence matching method of a time series database, which matches a query subsequence from an index generated for a data sequence. The subsequence matching method of a time series database comprises the steps of: dividing a given query sequence into a plurality of query windows, and determining the number of query windows corresponding to the query processing from the query sequence among the divided query windows; selecting a query window by calculating a query cost with respect to the determined number of query windows; recalculating a similarity tolerance range by using the selected query window and a preset similarity tolerance value; and retrieving an answer candidate subsequence matching the query sequence from the index using a range query constructed based on the recalculated similarity tolerance range.

Description

Subsequence matching system and method thereof of time series database {A NOVEL JOIN TECHNIQUE FOR SIMILAR-TREND SEARCHES SUPPORTING NORMALIZATION ON TIME-SERIES DATABASES}

본 발명의 시계열 데이터베이스에서의 서브시퀀스 매칭 방법은 색인을 구성하는 데이터 시퀀스와 주어진 질의 시퀀스를 정규화 변환하고, 유클리디안 거리를 사용해 유사도를 측정하여 검색의 정확도를 향상시키며, 후처리 과정에서 여러 정답 후보 집합의 교집합을 사용하여 착오해답을 줄일 수 있는 시계열 데이터베이스의 서브 시퀀스 매칭 시스템 및 그 방법에 관한 것이다.The subsequence matching method in the time-series database of the present invention normalizes a given data sequence and a given query sequence, improves the accuracy of the search by measuring similarity using Euclidean distance, and provides various correct answers in post-processing. The present invention relates to a subsequence matching system of a time series database and a method thereof that can reduce an error solution by using an intersection of candidate sets.

질의 시퀀스에 대한 답변 시퀀스를 검색하는 방법은 질의 시퀀스와 데이터 시퀀스 간의 유사도에 따라 데이터 시퀀스 중에서 질의 시퀀스에 매칭될 답변 시퀀스를 검색하는 방법이다.The searching method for an answer sequence for a query sequence is a method for searching for an answer sequence to be matched with a query sequence among data sequences according to the similarity between the query sequence and the data sequence.

종래의 질의 시퀀스에 대한 답변 시퀀스를 검색하는 방법은 데이터 시퀀스와 질의 시퀀스의 유사도 척도를 계산하여 답변 시퀀스를 검색하고 있다.In a conventional method for searching an answer sequence for a query sequence, the answer sequence is searched by calculating a similarity measure between the data sequence and the query sequence.

그러나, 질의 시퀀스의 길이는 사용자의 입력에 따라 변경될 수 있으므로, 데이터베이스 시스템이 질의 시퀀스의 길이를 예측할 수 없다. 그러므로, 종래의 답변 시퀀스를 검색하는 방법은 질의 시퀀스의 다양한 길이 모두에 대한 색인을 구성하여 검색할 수 없는 실정이다.However, since the length of the query sequence may change according to a user's input, the database system cannot predict the length of the query sequence. Therefore, the conventional method for retrieving the answer sequence is not able to construct and search the index for all the various lengths of the query sequence.

따라서, 종래기술은 정규화 변환과 타임 워핑을 지원하는 서브시퀀스 매칭을 위하여 다양한 방법으로 최적화된 순차검색을 사용하였다.Therefore, the prior art has used sequential search optimized in various ways for subsequence matching that supports normalization transformation and time warping.

그러나, 순차검색은 데이터 시퀀스의 모든 가능한 서브시퀀스와 질의 시퀀스를 일일이 비교해야 하므로 색인을 사용한 검색 보다 속도가 매우 느리다는 문제점이 있다.However, the sequential search has a problem in that it is much slower than a search using an index because all possible subsequences of the data sequence and the query sequence must be compared.

한국특허공개공보 10-2004-0095802호Korean Patent Publication No. 10-2004-0095802

본 발명은 색인을 구성하는 데이터 시퀀스와 주어진 질의 시퀀스를 정규화 변환하고, 유클리디안 거리를 사용해 유사도를 측정하여 검색의 정확도를 향상시키며, 후처리 과정에서 여러 정답 후보 집합의 교집합을 사용하여 착오해답을 현저히 줄이는 동시에 성능을 향상시킬 수 있는 시계열 데이터베이스의 서브 시퀀스 매칭 시스템 및 그 방법을 제공하고자 한다.The present invention improves the accuracy of the search by normalizing and converting a data sequence constituting an index and a given query sequence, measuring similarity using Euclidean distance, and using an intersection of several correct candidate sets in a post-processing process. The present invention provides a subsequence matching system and a method of a time-series database that can significantly reduce performance and improve performance.

본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제(들)로 제한되지 않으며, 언급되지 않은 또 다른 과제(들)은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problem to be solved by the present invention is not limited to the problem (s) mentioned above, and other object (s) not mentioned will be clearly understood by those skilled in the art from the following description.

상기 과제를 해결하기 위하여, 본 발명에서 제공하는 시계열 데이터베이스의 서브 시퀀스 매칭 방법은, 데이터 시퀀스에 대해 생성된 색인으로부터 질의 서브 시퀀스를 매칭하는 방법에 있어서, 주어진 질의 시퀀스를 복수의 질의 윈도우로 분할하고, 상기 분할된 질의 윈도우들 중에서 질의 시퀀스로부터 질의 처리에 대응하는 질의 윈도우 수를 결정하는 단계; 상기 결정된 질의 윈도우 수에 대해 질의 비용을 계산하여 질의 윈도우를 선택하는 단계; 상기 선택된 질의 윈도우와 기 설정된 유사 허용값을 이용하여 유사 허용 범위를 재계산하는 단계; 및 상기 재계산된 유사 허용 범위를 기반으로 구성된 범위 질의를 이용하여 상기 색인에서 질의 시퀀스에 매칭하는 답변 후보 서브 시퀀스를 검색하는 단계;를 포함하는 점에 그 특징이 있다.In order to solve the above problems, the subsequence matching method of the time-series database provided in the present invention is a method for matching a query subsequence from an index generated for a data sequence, wherein the given query sequence is divided into a plurality of query windows. Determining a number of query windows corresponding to a query process from a query sequence among the divided query windows; Selecting a query window by calculating a query cost with respect to the determined number of query windows; Recalculating a similar tolerance range using the selected query window and a preset similarity tolerance value; And retrieving an answer candidate subsequence matching the query sequence from the index using a range query constructed based on the recalculated similar allowable range.

바람직하게는, 상기 질의 윈도우를 선택하는 단계이후, 상기 선택된 질의 윈도우를 정규화하는 단계를 더 포함하는 점에 그 특징이 있다.Advantageously, after the step of selecting the query window, further comprising normalizing the selected query window.

바람직하게는, 상기 유사 허용 범위를 재계산하는 단계이후, 상기 정규화된 질의 윈도우 및 상기 재계산된 유사 허용 범위를 이용하여 범위 질의를 구성하는 단계를 더 포함하는 점에 그 특징이 있다.Preferably, the method further comprises the step of constructing a range query using the normalized query window and the recalculated similar tolerance after the step of recalculating the similar tolerance.

바람직하게는, 상기 질의 윈도우를 선택하는 단계에서, 상기 질의 비용은 질의 비용 함수를 기반으로 질의 윈도우 유사도 및 질의 윈도우에 의해 계산되는 검색 범위를 반영하여 계산하는 점에 그 특징이 있다.Advantageously, in the step of selecting the query window, the query cost is calculated by reflecting the query window similarity and the search range calculated by the query window based on the query cost function.

바람직하게는, 상기 답변 후보 서브 시퀀스를 검색하는 단계이후, 상기 질의 시퀀스를 이용하여 상기 답변 후보 서브 시퀀스를 검증하는 단계를 더 포함하는 점에 그 특징이 있다.Preferably, after the searching of the answer candidate subsequence, the method further includes verifying the answer candidate subsequence using the query sequence.

바람직하게는, 상기 답변 후보 서브 시퀀스를 검증하는 단계는, 상기 답변 후보 서브 시퀀스들의 집합들 간의 교집합에 따라 정답 후보 집합을 추출하는 단계; 및 상기 질의 시퀀스를 이용하여 상기 정답 후보 집합을 검증하는 단계;를 포함하는 점에 그 특징이 있다.Advantageously, verifying the answer candidate subsequence comprises: extracting a correct candidate set according to an intersection between the sets of answer candidate subsequences; And verifying the correct candidate set using the query sequence.

바람직하게는, 상기 정답 후보 집합을 검증하는 단계는, 상기 질의 시퀀스와 상기 정답 후보 집합에 포함된 데이터 시퀀스 간의 유클리디안 거리를 계산하고, 계산된 거리가 상기 유사 허용 범위 이하인 경우, 상기 데이터 시퀀스를 상기 질의 시퀀스에 매칭할 답변 시퀀스의 정답 후보 집합에 포함하는 점에 그 특징이 있다.Preferably, verifying the correct candidate set comprises: calculating a Euclidean distance between the query sequence and a data sequence included in the correct candidate set, and when the calculated distance is equal to or less than the similar allowable range, the data sequence Is included in the correct candidate set of answer sequences to match the query sequence.

바람직하게는, 상기 답변 후보 서브 시퀀스를 검증하는 단계이후, 상기 선택된 질의 윈도우들 중에서 마지막 질의 윈도우인지를 판단하여 상기 질의 윈도우를 선택하는 단계이후의 각 단계를 반복적으로 진행하는 점에 그 특징이 있다.Preferably, after the step of verifying the answer candidate subsequence, it is characterized in that each step after the step of selecting the query window by determining whether it is the last query window among the selected query windows. .

바람직하게는, 상기 생성된 색인에 대한 색인 단계는, 슬라이딩 윈도우를 이용하여 데이터 시퀀스를 복수의 데이터 윈도우들로 분할하는 단계; 상기 데이터 윈도우들의 스케일들이 동일하도록 상기 데이터 윈도우들을 정규화하는 단계; 상기 정규화된 데이터 윈도우들의 차원을 축소하는 단계; 상기 차원이 축소된 데이터 윈도우들의 레코드를 생성하는 단계; 및 상기 레코드를 다차원 색인 구조에 삽입하는 단계; 를 포함하는 점에 그 특징이 있다.Advantageously, the step of indexing the generated index comprises: splitting a data sequence into a plurality of data windows using a sliding window; Normalizing the data windows such that the scales of the data windows are the same; Reducing the dimension of the normalized data windows; Creating a record of the reduced data windows; Inserting the record into a multidimensional index structure; Its features are to include.

바람직하게는, 상기 다차원 색인 구조는, 상기 차원 축소된 데이터 윈도우들을 최소 경계 사각형(MBR: Minimum Bounding Rectangle)들로 구성하고, 최소 경계 사각형들 중 상위 레벨의 최소 경계 사각형이 복수의 하위 레벨의 최소 경계 사각형들을 포함하도록 구성하여 생성된 계층적 트리 구조인 점에 그 특징이 있다.Advantageously, said multidimensional index structure comprises said dimensional reduced data windows as Minimum Bounding Rectangles (MBRs), wherein a minimum bounding rectangle of a higher level among the minimum bounding rectangles is a minimum of a plurality of lower levels. Its feature is that it is a hierarchical tree structure constructed by including bounding rectangles.

또한, 상기 과제를 해결하기 위하여, 본 발명에서 제공하는 시계열 데이터베이스의 서브 시퀀스 매칭 시스템은, 데이터 시퀀스의 색인을 생성하는 색인 수단으로부터 질의 서브 시퀀스를 매칭하는 시스템에 있어서, 주어진 질의 시퀀스를 슬라이딩 윈도우를 이용하여 복수의 질의 윈도우로 분할하는 시퀀스 분할부; 상기 분할된 질의 윈도우들 중에서 질의 시퀀스로부터 질의 처리를 위한 최적의 질의 윈도우 수를 계산하는 윈도우 계산부; 상기 계산된 질의 윈도우 수에 대해 최소의 질의 비용을 가지는 질의 윈도우를 선택하는 윈도우 선택부; 상기 정규화된 질의 윈도우 및 기 설정된 유사 허용값을 이용하여 유사 허용 범위를 재계산하는 재산출부; 상기 정규화된 질의 윈도우 및 상기 재계산된 유사 허용 범위를 이용하여 범위 질의를 구성하는 질의 구성부; 및 상기 구성된 범위 질의를 이용하여 색인에서 질의 시퀀스에 매칭할 답변 후보 서브 시퀀스를 검색하는 질의 검색부;를 포함하는 점에 그 특징이 있다.In addition, in order to solve the above problems, the subsequence matching system of the time-series database provided in the present invention is a system for matching a query subsequence from an indexing means for generating an index of a data sequence. A sequence divider for dividing the query window into a plurality of query windows; A window calculator for calculating an optimal number of query windows for query processing from the query sequence among the divided query windows; A window selector for selecting a query window having a minimum query cost with respect to the calculated number of query windows; A property retrieval unit for recalculating a similar tolerance range using the normalized query window and a preset similarity tolerance value; A query constructing unit configured to construct a range query using the normalized query window and the recalculated similar allowable range; And a query search unit for searching for an answer candidate subsequence to match the query sequence in the index using the configured range query.

바람직하게는, 상기 윈도우 선택부에서 선택된 질의 윈도우를 정규화하는 윈도우 정규화부를 더 포함하는 점에 그 특징이 있다.Preferably, the method further includes a window normalization unit for normalizing the query window selected by the window selection unit.

바람직하게는, 상기 질의 비용은 질의 비용 함수를 기반으로 질의 윈도우 유사도 및 질의 윈도우에 의해 계산되는 검색 범위를 반영하여 계산하는 점에 그 특징이 있다.Preferably, the query cost is characterized in that it is calculated by reflecting the query window similarity and the search range calculated by the query window based on the query cost function.

바람직하게는, 상기 질의 검색부에서 검색된 상기 답변 후보 서브 시퀀스들의 집합들 간의 교집합에 따라 정답 후보 집합을 추출하는 추출부 및 상기 질의 시퀀스를 이용하여 상기 정답 후보 집합을 검증하는 검증부를 포함하는 질의 후처리부를 더 포함하는 점에 그 특징이 있다.Preferably, after the query includes an extractor for extracting a correct candidate set according to the intersection between the sets of the answer candidate subsequences retrieved by the query search unit and a verification unit for verifying the correct candidate set using the query sequence. Its characteristics are that it further includes a processing unit.

바람직하게는, 상기 검증부는, 상기 질의 시퀀스와 상기 정답 후보 집합에 포함된 데이터 시퀀스 간의 유클리디안 거리를 계산하고, 계산된 거리가 상기 유사 허용 범위 이하인 경우, 상기 데이터 시퀀스를 상기 질의 시퀀스에 매칭할 답변 시퀀스의 정답 후보 집합에 포함하는 점에 그 특징이 있다.Preferably, the verification unit calculates a Euclidean distance between the query sequence and the data sequence included in the correct candidate set, and if the calculated distance is equal to or less than the similar allowable range, matching the data sequence to the query sequence. The characteristic is that it is included in the set of correct candidates for the answer sequence to be answered.

본 발명의 시계열 데이터베이스의 서브 시퀀스 매칭 시스템 및 그 방법에 따르면, 색인을 구성하는 데이터 시퀀스와 주어진 질의 시퀀스를 정규화 변환하고, 유클리디안 거리를 사용해 유사도를 측정하여 검색의 정확도를 향상시키며, 후처리 과정에서 여러 정답 후보 집합의 교집합을 사용하여 착오해답을 현저히 줄이는 동시에 성능을 향상시킬 수 있다.According to the subsequence matching system and method of the time-series database of the present invention, the data sequence constituting the index and the given query sequence are normalized and converted, and the similarity is measured using Euclidean distance to improve the accuracy of the search and post-processing. In the process, the intersection of a set of correct answer candidates can be used to significantly reduce error and improve performance.

도 1은 본 발명의 일 실시예에 따른 서브 시퀀스 매칭 시스템의 구성을 개략적으로 도시한 도면이다.
도 2는 상기 도 1의 데이터 시퀀스 색인 수단의 구성을 개략적으로 도시한 도면이다.
도 3은 상기 도 1의 질의 시퀀스 매칭 수단의 구성을 개략적으로 도시한 도면이다.
도 4는 도 2의 질의 후처리부의 구성을 개략적으로 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 데이터 시퀀스에 대하여 색인을 생성하는 색인 과정을 설명하기 위한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 색인으로부터 질의를 수행하여 결과를 도출하는 매칭 과정을 설명하기 위한 순서도이다.
도 7a 및 7b은 본 발명의 일 실시예에 따른 슬라이딩 윈도우와 디스조인트 윈도우를 설명하기 위한 도면이다.1 is a diagram schematically illustrating a configuration of a subsequence matching system according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing the configuration of the data sequence indexing means of FIG.
3 is a diagram schematically showing the configuration of the query sequence matching means of FIG.
4 is a diagram schematically illustrating a configuration of a query post-processing unit of FIG. 2.
5 is a flowchart illustrating an indexing process of generating an index with respect to a data sequence according to an embodiment of the present invention.
6 is a flowchart illustrating a matching process of deriving a result by performing a query from an index according to an embodiment of the present invention.
7A and 7B are views for explaining a sliding window and a disjoint window according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the drawings, similar reference numerals are used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 본 발명에 따른 바람직한 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 서브 시퀀스 매칭 시스템의 구성을 개략적으로 도시한 도면이고, 도 2는 상기 도 1의 데이터 시퀀스 색인 수단의 구성을 개략적으로 도시한 도면이고, 도 3은 상기 도 1의 질의 시퀀스 매칭 수단의 구성을 개략적으로 도시한 도면이고, 도 4는 도 2의 질의 후처리부의 구성을 개략적으로 도시한 도면이다.1 is a diagram schematically showing the configuration of a subsequence matching system according to an embodiment of the present invention, FIG. 2 is a diagram schematically showing the configuration of the data sequence indexing means of FIG. 1, and FIG. 1 is a diagram schematically showing the configuration of the query sequence matching means of FIG. 1, and FIG. 4 is a diagram schematically showing the configuration of the query post-processing unit of FIG.

본 발명에서 "시퀀스"는 n개의 엔트리로 구성된 배열을 의미하며, "시계열 데이터"는 각 시간별로 측정한 실수 값을 가지는 엔트리의 시퀀스를 의미하고, "시계열 데이터베이스"는 시계열 데이터를 저장한 데이터베이스를 의미한다. In the present invention, "sequence" means an array consisting of n entries, "time series data" means a sequence of entries having a real value measured at each time, and "time series database" means a database storing time series data. it means.

또한, "데이터 시퀀스"는 시계열 데이터베이스에 저장된 시계열 데이터를 의미하고, "질의 시퀀스"는 사용자에 의하여 주어지는 시퀀스를 의미한다. "서브시퀀스"란 데이터 시퀀스의 일부분을 의미하고, "서브시퀀스 매칭"이란 질의 시퀀스와 유사한 서브시퀀스의 데이터 시퀀스 상에서의 위치를 찾는 방법을 의미한다. 이때, 두 시퀀스간의 유사도(거리)가 사용자가 제시한 "허용치"인 ε 이하이면 ε-매치한다고 정의하며, 이를 "유사"하다고 정의한다.In addition, "data sequence" means time series data stored in a time series database, and "query sequence" means a sequence given by a user. "Subsequence" refers to a portion of a data sequence, and "subsequence matching" refers to a method of locating a position on a data sequence of subsequences similar to a query sequence. In this case, if the similarity (distance) between the two sequences is equal to or smaller than ε, which is the "allowed value" suggested by the user, it is defined as ε-match, which is defined as "similarity.

도 1에 도시된 바와 같이, 본 발명에 따른 서브 시퀀스 매칭 시스템은, 데이터 시퀀스 색인 수단(110), 색인 데이터베이스(120) 및 질의 시퀀스 매칭 수단(130)를 포함하여 구성된다. 여기서, 상기 데이터 시퀀스 색인 수단(110)에서는 데이터 시퀀스를 슬라이딩 윈도우를 이용하여 데이터 시퀀스 윈도우로 분할하여 정규화 변환하여 데이터 시퀀스를 검색하기 위한 데이터 시퀀스의 색인을 생성할 수 있으며, 이를 상기 색인 데이터베이스(120)에 저장하게 된다. 이때, 시퀀스는 n개의 엔트리로 구성된 배열을 의미하며, 시계열 데이터는 각 시간 별로 측정한 실수 값을 가지는 엔트리의 시퀀스일 수 있다. As shown in FIG. 1, the subsequence matching system according to the present invention comprises a data sequence indexing means 110, an index database 120, and a query sequence matching means 130. Here, the data sequence indexing means 110 may generate an index of a data sequence for retrieving the data sequence by dividing the data sequence into a data sequence window using a sliding window and normalizing and converting the data sequence. ). In this case, the sequence refers to an array composed of n entries, and the time series data may be a sequence of entries having a real value measured for each time.

상기 색인 데이터베이스(120)는 데이터 시퀀스 및 데이터 시퀀스 색인 수단(110)이 생성한 데이터 시퀀스의 색인을 저장하고 관리할 수 있다. 이때, 데이터 시퀀스는 시계열 데이터일 수 있다. 예를 들어, 색인 데이터베이스(120)는 시계열 데이터를 저장한 시계열 데이터베이스일 수 있다.The index database 120 may store and manage the data sequence and the index of the data sequence generated by the data sequence indexing means 110. In this case, the data sequence may be time series data. For example, the index database 120 may be a time series database that stores time series data.

상기 질의 시퀀스 매칭 수단(130)은 상기 색인 데이터베이스(120)에 저장된 데이터 시퀀스의 색인을 이용하여 질의 시퀀스에 매칭되는 답변 시퀀스를 검색하여 출력할 수 있다. 이때, 질의 시퀀스는 사용자로부터 입력되는 시퀀스이고, 서브 시퀀스는 데이터 시퀀스의 일부분일 수 있다. The query sequence matching means 130 may search for and output an answer sequence matching the query sequence using an index of the data sequence stored in the index database 120. In this case, the query sequence may be a sequence input from the user, and the subsequence may be a part of the data sequence.

그리고, 질의 시퀀스 매칭 수단(130)은 질의 시퀀스와 유사한 서브 시퀀스를 검색할 수 있다.The query sequence matching means 130 may search for a subsequence similar to the query sequence.

예를 들어, 질의 시퀀스로 주식의 동향 데이터가 입력될 수 있다. 이때, 질의 시퀀스 매칭 수단(130)은 종래의 주식의 동향 데이터들인 데이터 시퀀스들 중에서 질의 시퀀스로 입력된 주식의 동향 데이터와 유사한 동향을 가지는 데이터 시퀀스를 답변 시퀀스로 검색할 수 있다. 그리고, 사용자는 답변 시퀀스로 출력된 주식의 동향 데이터를 기초로 주식의 동향 변화를 예측할 수 있다.For example, trend data of stocks may be input in a query sequence. In this case, the query sequence matching means 130 may search a data sequence having a trend similar to the trend data of the stock input as the query sequence among the data sequences which are the trend data of the conventional stock as the answer sequence. In addition, the user may predict a trend change of the stock based on the trend data of the stock output as the answer sequence.

그리고, 질의 시퀀스와 데이터 시퀀스의 답변 시퀀스 간의 유사도, 또는 거리가 사용자가 설정한 허용값인 ε 이하인 경우, 질의 시퀀스 매칭 수단(130)은 서브 시퀀스가 질의 시퀀스와 유사하다고 판단할 수 있다. 그리고, 질의 시퀀스 매칭 수단(130)가 검색한 서브 시퀀스를 포함한 데이터 시퀀스와 질의 시퀀스 간의 매칭을 ε-매칭으로 정의될 수 있다. 또한, 질의 시퀀스 매칭 수단(130)은 데이터 시퀀스 상에서 해당 서브시퀀스의 위치를 찾을 수 있다.In addition, when the similarity or distance between the query sequence and the answer sequence of the data sequence is equal to or less than ε, which is a user-set allowable value, the query sequence matching means 130 may determine that the subsequence is similar to the query sequence. Then, the matching between the query sequence and the data sequence including the subsequence searched by the query sequence matching means 130 may be defined as ε-matching. In addition, the query sequence matching means 130 may find the location of the corresponding subsequence on the data sequence.

이때, 유사도는 두 시퀀스 간의 유사한 정도를 측정하는 단위이며, 유사도의 값은 거리(distance)로 나타낼 수 있다 그리고, 질의 시퀀스 매칭 수단(130)은 유클리디안 거리(Euclidean distance)를 이용하여 유사도를 계산할 수 있다.In this case, the similarity is a unit for measuring the degree of similarity between the two sequences, and the value of the similarity may be represented by a distance, and the query sequence matching means 130 may use the Euclidean distance to determine the similarity. Can be calculated

예를 들어, 질의 시퀀스 매칭 수단는 길이가 n인 시퀀스 와 의 유클리디안거리를 하기 수학식 1을 이용하여 계산할 수 있다.For example, the query sequence matching means is a sequence of length n Wow The Euclidean distance of may be calculated using Equation 1 below.

[수학식 1][Equation 1]

상기 데이터 시퀀스 색인 수단 및 상기 질의 시퀀스 매칭 수단에 대한 구체적인 구성 및 동작에 대해 이하 도 2 및 도 3을 참조로 하여 상세히 설명하기로 한다.Detailed configurations and operations of the data sequence indexing means and the query sequence matching means will be described in detail with reference to FIGS. 2 and 3.

도 2에 도시된 바와 같이, 상기 데이터 시퀀스 색인 수단(110)은, 데이터 시퀀스 분할부(210), 데이터 시퀀스 윈도우 정규화부(220) 및 색인 생성부(230)를 포함할 수 있다.As illustrated in FIG. 2, the data sequence indexing unit 110 may include a data sequence division unit 210, a data sequence window normalization unit 220, and an index generation unit 230.

상기 데이터 시퀀스 분할부(210)는 슬라이딩 윈도우를 이용하여 데이터 시퀀스를 복수의 데이터 윈도우들로 분할할 수 있다.The data sequence divider 210 may divide a data sequence into a plurality of data windows by using a sliding window.

예를 들어, 데이터 시퀀스가 인 경우, 데이터 시퀀스 분할부(210)는 슬라이딩 윈도우 기법을 사용하여 데이터 시퀀스를 길이 w의 데이터 윈도우 로 분할할 수 있다.For example, if a data sequence In this case, the data sequence divider 210 converts the data sequence into a data window of length w using a sliding window technique. Can be divided into

상기 데이터 시퀀스 윈도우 정규화부(220)는 데이터 시퀀스 분할부(210)가 분할한 데이터 윈도우 들의 특성은 유지하면서 스케일들이 동일하도록 데이터 윈도우들을 정규화할 수 있다. 이때, 데이터 시퀀스 윈도우 정규화부(220)는 데이터 윈도우의 원소들 각각에 정규화 기법을 사용하여 로 변환시킬 수 있다. The data sequence window normalizer 220 may normalize the data windows so that the scales are the same while maintaining the characteristics of the data windows divided by the data sequence divider 210. At this time, the data sequence window normalization unit 220 is a data window Using normalization techniques on each of the elements of Can be converted to

예를 들어, 데이터 시퀀스 윈도우 정규화부(220)는 하기 수학식 2와 같이 정의되는 Z-Score 기법으로 데이터 윈도우들을 정규화할 수 있다.For example, the data sequence window normalization unit 220 may normalize the data windows by a Z-Score technique defined as in Equation 2 below.

[수학식 2][Equation 2]

다만, 데이터 시퀀스 윈도우 정규화부(220)가 사용하는 정규화 기법은 Z-Score 기법으로 한정되지 않으며 다양한 정규화 기법 중에 하나를 시용하여 데이터 윈도우들을 정규화할 수 있다.However, the normalization technique used by the data sequence window normalization unit 220 is not limited to the Z-Score technique and may normalize the data windows by using one of various normalization techniques.

상기 색인 생성부(230)는 데이터 시퀀스 윈도우 정규화부(220)가 정규화한 데이터 윈도우들을 이용하여 데이터 시퀀스의 색인을 생성할 수 있다. The index generator 230 may generate an index of the data sequence by using the data windows normalized by the data sequence window normalizer 220.

보다 구체적으로, 상기 색인 생성부(230)는 정규화된 데이터 윈도우들의 차원을 축소할 수 있다. 이때, 색인 생성부(230)는 정규화된 데이터 윈도우들의 차원을 축소함으로써, 다차원 색인 구조에서 색인하는 데이터의 차원이 일정 차원 이상일 경우 성능이 급격히 저하되는 차원 저주 현상(Curse of dimensionality)이 발생하는 것을 회피할 수 있다.More specifically, the index generator 230 may reduce the dimension of the normalized data windows. At this time, the index generator 230 reduces the dimension of the normalized data windows, so that when the dimension of the data indexed in the multi-dimensional index structure is more than a certain dimension, a curse of dimensionality occurs that rapidly decreases in performance. Can be avoided.

다음으로, 색인 생성부(230)는 차원이 축소된 데이터 윈도우들의 레코드를 생성할 수 있다. 이때, 색인 생성부(230)는 데이터 윈도우들 각각에 대하여 차원이 축소된 데이터 윈도우, 데이터 시퀀스 식별자, 및 윈도우 시작 위치 중 적어도 하나를 포함하는 레코드를 생성할 수 있다.Next, the index generator 230 may generate a record of data windows whose dimensions are reduced. In this case, the index generator 230 may generate a record including at least one of a reduced data window, a data sequence identifier, and a window start position for each of the data windows.

마지막으로, 색인 생성부(230)는 데이터 윈도우들의 레코드를 다차원 색인 구조에 삽입하여 데이터 시퀀스의 색인을 생성할 수 있다. 이때, 색인 생성부(230)는 데이터 윈도우들 각각에 대하여 차원이 축소된 데이터 윈도우를 키로 설정하여 레코드를 다차원 색인 구조에 삽입할 수 있다.Finally, the index generator 230 may insert a record of the data windows into the multidimensional index structure to generate an index of the data sequence. In this case, the index generator 230 may insert a record into the multidimensional index structure by setting the reduced data window as a key for each of the data windows.

이때, 색인 생성부(230)가 사용하는 다차원 색인 구조는, 차원 축소된 데이터 윈도우들을 최소 경계 사각형(MBR: Minimum Bounding Rectangle)들로 구성하고, 최소 경계 사각형들 중 상위 레벨의 최소 경계 사각형이 복수의 하위 레벨의 최소 경계 사각형들을 포함하도록 구성하여 생성된 계층적 트리 구조일 수 있다. In this case, the multi-dimensional index structure used by the index generator 230 configures the dimension-reduced data windows as minimum boundary rectangles (MBRs), and the minimum boundary rectangles of the upper level among the minimum boundary rectangles are plural. It may be a hierarchical tree structure generated by configuring to include the minimum boundary rectangles of the lower level of.

이러한 상기 데이터 시퀀스 색인 수단(110)는 상기 색인 데이터베이스(120)에 포함된 모든 데이터 시퀀스의 색인이 생성될 때까지 상기 동작들을 반복할 수 있다.The data sequence indexing means 110 may repeat the operations until an index of all data sequences included in the index database 120 is generated.

또한, 도 3에 도시된 바와 같이, 질의 시퀀스 매칭 수단(130)은, 시퀀스 분할부(310), 윈도우 계산부(320), 윈도우 선택부(330), 윈도우 정규화부(340), 재산출부(350), 질의 구성부(360), 질의 검색부(370) 및 질의 후처리부(380)를 포함하여 구성될 수 있다. In addition, as shown in FIG. 3, the query sequence matching means 130 includes a sequence divider 310, a window calculator 320, a window selector 330, a window normalizer 340, and an asset retrieval unit. 350, the query constructing unit 360, the query searching unit 370, and the query post-processing unit 380 may be configured.

상기 시퀀스 분할부(310)는 주어진 질의 시퀀스를 슬라이딩 윈도우를 이용하여 복수의 질의 윈도우로 분할한다. 예를 들어, 질의 시퀀스가 인 경우, 시퀀스 분할부(310)는 슬라이딩 윈도우 기법을 사용하여 질의 시퀀스를 색인에 사용된 길이 w의 윈도우 로 분할할 수 있다.The sequence dividing unit 310 divides a given query sequence into a plurality of query windows using a sliding window. For example, if the query sequence If, the sequence divider 310 uses a sliding window technique to window the length w used to index the query sequence. Can be divided into

상기 윈도우 계산부(320)는 상기 분할된 질의 윈도우들 중에서 질의 시퀀스로부터 질의 처리를 위한 최적의 질의 윈도우 수를 계산하게 된다. 여기서, 하기 수학식 3 이용하여 주어진 질의 시퀀스로부터 질의 처리를 위한 최적의 윈도우 수 n을 계산한다.The window calculator 320 calculates an optimal number of query windows for query processing from the query sequence among the divided query windows. Here, an optimal number of windows n for query processing is calculated from a given query sequence using Equation 3 below.

[수학식 3][Equation 3]

상기 윈도우 선택부(330)는 상기 계산된 질의 윈도우 수에 대해 최소의 질의 비용을 가지는 질의 윈도우를 선택하게 된다.The window selector 330 selects a query window having a minimum query cost with respect to the calculated number of query windows.

보다 구체적으로, 상기 계산된 질의 윈도우 수 n개에 대해 질의 비용 함수를 기반으로 최적의 성능을 예측하여 n개의 윈도우를 선택하게 된다. 이때, 질의 비용 함수는 윈도우의 유사도 및 윈도우에 의해 계산되는 검색 범위를 고려하여 하기 수학식 4에 의해 질의 비용을 계산하게 된다.More specifically, for the calculated number of query windows, n windows are selected by predicting optimal performance based on a query cost function. In this case, the query cost function calculates the query cost by the following Equation 4 in consideration of the similarity of the window and the search range calculated by the window.

[수학식 4][Equation 4]

상기 윈도우 선택부(330)는 시퀀스 분할부(310)가 분할한 질의 윈도우들 중에서 질의 시퀀스의 특징을 가장 많이 포함한 질의 윈도우를 선택할 수 있다.The window selector 330 may select a query window including the feature of the query sequence most among the query windows divided by the sequence divider 310.

이때, 윈도우 선택부(330)는 시퀀스 분할부(310)가 분할한 질의 윈도우들 각각의 표준편차를 계산할 수 있다. 그리고, 윈도우 선택부(330)는 계산한 표준 편차의 크기 순으로 서로 디스조인트한 n개의 질의 윈도우를 선택할 수 있다 이때, n은 사용자가 정의한 양의 정수일 수 있다. In this case, the window selector 330 may calculate a standard deviation of each of the query windows divided by the sequence divider 310. The window selector 330 may select n query windows that are disjoint from each other in the order of the calculated standard deviation. In this case, n may be a positive integer defined by a user.

또한, 윈도우 선택부(330)는 표준 편차의 크기 순으로 윈도우를 선택함으로써, 질의 시퀀스의 특징을 가장 많이 포함하는 윈도우를 선택하여 검색 결과의 착오 해답을 감소시킬 수 있다. In addition, the window selector 330 may select the window in the order of the magnitude of the standard deviation, thereby selecting the window that includes the most feature of the query sequence to reduce the error answer of the search result.

상기 윈도우 정규화부(340)는 상기 윈도우 선택부(330)에서 선택된 질의 윈도우를 정규화하게 된다.The window normalizer 340 normalizes the query window selected by the window selector 330.

보다 구체적으로, 상기 윈도우 정규화부(340)는 윈도우 선택부(330)가 선택한 질의 윈도우를 정규화할 수 있다 윈도우 정규화부(340)는 윈도우 선택부(330)가 선택한 n개의 질의 윈도우 를 윈도우 정규화부(330)와 동일한 정규화 기법을 사용하여 질의 윈도우 로 변환시킬 수 있다.More specifically, the window normalization unit 340 may normalize the query window selected by the window selector 330. The window normalization unit 340 may have n query windows selected by the window selector 330. Query window using the same normalization technique as the window normalization unit 330 Can be converted to

상기 재산출부(350)는 상기 선택된 질의 윈도우를 윈도우 정규화부에서 정규화하여 기 설정된 유사 허용값을 이용하여 유사 허용 범위 를 하기 수학식 5를 이용하여 재계산하게 된다. 이때, 유사 허용 범위는 질의 윈도우와 매칭될 답변 윈도우와 유사한 것으로 판단할 수 있는 유사도 값의 범위일 수 있다.The property retrieval unit 350 normalizes the selected query window by a window normalization unit and uses a similar similarity value to be preset. To be recalculated using Equation 5 below. Similar tolerance May be a range of similarity values that may be determined to be similar to an answer window to be matched with a query window.

[수학식 5][Equation 5]

상기 질의 구성부(360)는 상기 정규화된 질의 윈도우 및 상기 재계산된 유사 허용 범위를 이용하여 범위 질의를 구성하게 된다. The query constructing unit 360 includes the normalized query window and the recalculated similar tolerance range. You will construct a range query using.

여기서, 상기 질의 구성부(360)는 상기 n개의 다차원 질의 윈도우의 차원 축소할 수 있으며, 차원 축소된 n개의 질의 윈도우 및 유사 허용 범위를 이용하여 범위 질의를 생성할 수 있다.Here, the query constructing unit 360 may reduce the dimension of the n multidimensional query windows, and generate a range query using the n reduced query windows and similar allowable ranges.

이때, 범위 질의는 선택된 질의 윈도우와 사용자가 정의한 유사 허용 범위를 이용하여 재계산된 유사 허용 범위를 이용하여, 질의 윈도우로부터 재계산된 유사 허용 범위 내의 데이터 윈도우를 색인을 통해 찾는 질의 방법일 수 있다. 예를 들어, 질의 구성부(360)가 생성한 범위 질의는 선택된 질의 윈도우와 재계산된 유사 허용범위를 쌍으로 매칭하여 생성된 <선택된 질의 윈도우, 재계산된 유사 허용범위>의 형태일 수 있다. 그리고, 질의 구성부(360)가 생성한 범위 질의는 색인을 통한 검색의 입력 변수로 사용될 수 있다.In this case, the range query may be a query method that searches for a data window within the similar tolerance range recalculated from the query window by using an index, using the similar tolerance range recalculated using the selected query window and the user-defined similarity tolerance. . For example, the range query generated by the query configuration unit 360 may be in the form of <selected query window, recalculated similar tolerance> generated by matching the selected query window with the recalculated similar tolerance in pairs. . In addition, the range query generated by the query configuration unit 360 may be used as an input variable for searching through the index.

상기 질의 검색부(370)는 상기 질의 구성부(360)에서 구성된 범위 질의를 이용하여 색인에서 질의 시퀀스에 매칭할 답변 후보 서브 시퀀스를 검색하게 된다.The query retrieval unit 370 retrieves an answer candidate subsequence to match the query sequence from the index by using the range query configured in the query constructer 360.

여기서, 상기 질의 검색부(370)는 정규화된 질의 윈도우 및 유사 허용 범위를 이용하여 색인에서 질의 시퀀스에 매칭할 답변 후보 서브 시퀀스들을 검색하고, 검색된 답변 후보 서브 시퀀스들의 집합을 생성할 수 있다. 이때, 질의 검색부(370)는 윈도우 선택부(330)가 선택한 n개의 윈도우에 대한 답변 후보 서브 시퀀스들의 집합이 모두 생성될 때까지 상기 과정을 반복할 수 있다.Here, the query retrieval unit 370 may search for an answer candidate subsequence to match the query sequence in the index using a normalized query window and a similar allowable range, and generate a set of retrieved answer candidate subsequences. In this case, the query retrieval unit 370 may repeat the above process until all of the set of answer candidate subsequences for the n windows selected by the window selection unit 330 are generated.

상기 질의 후처리부(380)는 도 4와 같이, 추출부(381)와 검증부(382)를 포함하여 구성될 수 있다. The query postprocessor 380 may include an extractor 381 and a verifyer 382 as shown in FIG. 4.

상기 추출부(381)는 상기 질의 검색부(370)에서 검색된 상기 답변 후보 서브 시퀀스들의 집합들 간의 교집합에 따라 정답 후보 집합을 추출하게 된다. 상기 검증부(382)는 상기 질의 시퀀스를 이용하여 상기 정답 후보 집합을 검증하는 검증부를 포함하여 구성될 수 있다. 상기 검증부(382)는 상기 질의 시퀀스와 상기 정답 후보 집합에 포함된 데이터 시퀀스 간의 유클리디안 거리를 계산하고, 계산된 거리가 상기 유사 허용 범위 이하인 경우, 상기 데이터 시퀀스를 상기 질의 시퀀스에 매칭할 답변 시퀀스의 정답 후보 집합에 포함하다.The extractor 381 extracts a correct answer candidate set according to an intersection between sets of the answer candidate subsequences searched by the query search unit 370. The verification unit 382 may include a verification unit that verifies the set of correct answer candidates using the query sequence. The verification unit 382 calculates a Euclidean distance between the query sequence and the data sequence included in the correct candidate set, and if the calculated distance is equal to or less than the similar tolerance, match the data sequence to the query sequence. Include in the correct candidate set of answer sequences.

한편, 도 7a 및 7b은 본 발명의 일 실시예에 따른 슬라이딩 윈도우와 디스조인트 윈도우를 설명하기 위한 도면이다. 도 7a 및 도 7b에 도시된 바와 같이, 윈도우는 시퀀스를 분할하는 단위로서, 분할 방법에 따라 슬라이딩 윈도우와 디스조인트 윈도우로 구분될 수 있다. 슬라이딩 윈도우는 시퀀스의 가능한 모든 위치를 시작 위치로 하여 구성한 윈도우이고, 디스조인트 윈도우는 시퀀스에서 윈도우 길이의 배수가 되는 위치를 시작 위치로 하여 구성한 윈도우일 수 있다. 도 7a은 시퀀스(710)를 길이 4인 슬라이딩 윈도우(720)로 나눈 예를 나타낸다. "디스조인트 윈도우"는 시퀀스에서 윈도우 길이의 배수가 되는 위치를 시작 위치로 하여 구성한 윈도우를 의미하며, 도 7b는 길이 4인 디스조인트 윈도우(730)의 예를 나타낸다.7A and 7B illustrate a sliding window and a disjoint window according to an embodiment of the present invention. As shown in FIGS. 7A and 7B, a window is a unit for dividing a sequence, and may be divided into a sliding window and a disjoint window according to a splitting method. The sliding window may be a window configured with all possible positions of the sequence as a start position, and the disjoint window may be a window configured with a start position of a position that is a multiple of the window length in the sequence. 7A shows an example of dividing a sequence 710 into a sliding window 720 having a length of four. "Disjoint window" means a window configured with a start position of a multiple of the window length in a sequence, and FIG. 7B shows an example of a disjoint window 730 having a length of 4. FIG.

색인 과정Indexing process

도 5는 본 발명의 일 실시예에 따른 데이터 시퀀스에 대하여 색인을 생성하는 색인 과정을 설명하기 위한 순서도이다. 5 is a flowchart illustrating an indexing process of generating an index with respect to a data sequence according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 먼저 색인 생성부(230)는 다차원 색인 구조를 생성하여 초기화하는 단계(S510)를 수행한다. 이때, 색인 생성부(230)는 차원 축소된 데이터 윈도우들을 최소 경계 사각형(MBR)들로 구성하고, 최소 경계 사각형들 중 상위 레벨의 최소 경계 사각형이 복수의 하위 레벨의 최소 경계 사각형들을 포함하도록 구성하여 다차원 색인 구조를 생성할 수 있다.As shown in FIG. 5, the index generator 230 first generates and initializes a multidimensional index structure (S510). In this case, the index generator 230 configures the reduced-dimensional data windows as the minimum boundary rectangles (MBRs), and the minimum boundary rectangles of the upper level among the minimum boundary rectangles include the minimum boundary rectangles of the plurality of lower levels. To create a multidimensional index structure.

그리고 슬라이딩 윈도우를 이용하여 데이터 시퀀스를 복수의 데이터 윈도우들로 분할하는 단계(S520)를 수행한다. In operation S520, the data sequence is divided into a plurality of data windows by using the sliding window.

이어, 상기 데이터 윈도우들의 스케일들이 동일하도록 상기 데이터 윈도우들을 정규화하는 단계(S530)가 수행된다. 즉, 분할한 데이터 윈도우들의 특성은 유지하면서 스케일들이 동일하도록 데이터 윈도우들을 정규화할 수 있다.Subsequently, normalizing the data windows such that the scales of the data windows are the same (S530) is performed. That is, the data windows can be normalized so that the scales are the same while maintaining the characteristics of the divided data windows.

그 다음, 상기 색인 생성부(230)에서는 상기 정규화된 데이터 윈도우들의 차원을 축소하는 단계(S540)가 수행되고, 차원이 축소된 데이터 윈도우들의 레코드를 생성할 수 있다(S550). Next, the index generator 230 may reduce the dimension of the normalized data windows (S540), and generate a record of the reduced data windows (S550).

그리고 색인 생성부(230)는 데이터 윈도우들의 레코드를 상기 S510 단계에서 생성 및 초기화한 다차원 색인 구조에 삽입하여 데이터 시퀀스의 색인을 생성할 수 있다. 이때, 색인 생성부(230)는 데이터 윈도우들 각각에 대하여 차원이 축소된 데이터 윈도우를 키로 설정하여 레코드를 다차원 색인 구조에 삽입하는 단계(S560)를 수행하게 된다.The index generator 230 may insert a record of the data windows into the multidimensional index structure generated and initialized in operation S510 to generate an index of the data sequence. In this case, the index generator 230 inserts a record into the multidimensional index structure by setting a data window having a reduced dimension as a key for each of the data windows (S560).

마지막으로, 색인 생성부(230)는 색인 데이터베이스(120)에 포함된 모든 데이터 시퀀스의 색인이 생성되었는지 여부를 확인할 수 있다. 색인 데이터베이스(120)에 포함된 모든 데이터 시퀀스의 색인이 생성된 경우, 색인 생성부(230)는 색인 생성 방법을 종료할 수 있다(S570).Finally, the index generator 230 may check whether all the data sequences included in the index database 120 are indexed. When the indexes of all data sequences included in the index database 120 are generated, the index generator 230 may terminate the index generation method (S570).

또한, 색인 데이터베이스(120)에 포함된 데이터 시퀀스들 중 색인이 생성되지 않은 데이터 시퀀스가 존재하는 경우, 색인 생성부(230)는 색인이 생성되지 않은 데이터 시퀀스들 중 하나를 선택하여 S520를 반복적으로 수행할 수 있다.In addition, when there is a non-indexed data sequence among the data sequences included in the index database 120, the index generator 230 repeatedly selects one of the non-indexed data sequences S520. Can be done.

매칭 과정Matching process

도 5는 본 발명의 일 실시예에 따른 색인으로부터 질의를 수행하여 결과를 도출하는 매칭 과정을 설명하기 위한 순서도이다. 여기서, 상기 도 3 및 도 4의 설명을 참조하여 동일한 설명에 대해 생략하기로 한다.5 is a flowchart illustrating a matching process for deriving a result by performing a query from an index according to an embodiment of the present invention. Here, the same description will be omitted with reference to the description of FIGS. 3 and 4.

도 5에 도시된 바와 같이, 먼저, 주어진 질의 시퀀스를 슬라이딩 윈도우를 이용하여 복수의 질의 윈도우로 분할하는 단계(S610)가 수행된다. 여기서, 상기 시퀀스 분할부(310)는 주어진 질의 시퀀스를 슬라이딩 윈도우를 이용하여 복수의 질의 윈도우로 분할한다. 예를 들어, 질의 시퀀스가 인 경우, 시퀀스 분할부(310)는 슬라이딩 윈도우 기법을 사용하여 질의 시퀀스를 색인에 사용된 길이 w의 윈도우 로 분할할 수 있다.As shown in FIG. 5, first, a step (S610) of dividing a given query sequence into a plurality of query windows using a sliding window is performed. Here, the sequence dividing unit 310 divides a given query sequence into a plurality of query windows using a sliding window. For example, if the query sequence If, the sequence divider 310 uses a sliding window technique to window the length w used to index the query sequence. Can be divided into

이어서, 상기 분할된 질의 윈도우들 중에서 질의 시퀀스로부터 질의 처리를 위한 질의 윈도우 수를 계산하는 단계(S620)가 수행된다. 여기서, 상기 윈도우 계산부(320)는 상기 분할된 질의 윈도우들 중에서 질의 시퀀스로부터 질의 처리를 위한 최적의 질의 윈도우 수를 계산하게 된다. Subsequently, a step (S620) of calculating the number of query windows for query processing is performed from the query sequence among the divided query windows. Here, the window calculator 320 calculates an optimal number of query windows for query processing from the query sequence among the divided query windows.

그리고, 상기 계산된 질의 윈도우 수에 대해 질의 비용을 갖는 질의 윈도우를 선택하는 단계(S630)가 수행된다. 보다 구체적으로, 상기 윈도우 선택부(330)는 상기 계산된 질의 윈도우 수에 대해 최소의 질의 비용을 가지는 질의 윈도우를 선택하게 된다. 여기서, 상기 계산된 질의 윈도우 수 n개에 대해 질의 비용 함수를 기반으로 최적의 성능을 예측하여 n개의 윈도우를 선택하게 된다. 이때, 질의 비용 함수는 윈도우의 유사도 및 윈도우에 의해 계산되는 검색 범위를 고려하여 질의 비용을 계산하게 된다.In operation S630, a query window having a query cost is selected based on the calculated number of query windows. More specifically, the window selector 330 selects a query window having a minimum query cost with respect to the calculated number of query windows. Here, for the calculated number of query windows, n windows are selected by predicting optimal performance based on a query cost function. In this case, the query cost function calculates the query cost in consideration of the similarity of the window and the search range calculated by the window.

그 다음, 상기 선택된 질의 윈도우를 정규화하는 단계(S640)가 수행된다.Then, normalizing the selected query window (S640) is performed.

이어서, 상기 정규화된 질의 윈도우 및 기 설정된 유사 허용값을 이용하여 유사 허용 범위를 재계산하는 단계(S650)가 수행된다. 여기서, 상기 재산출부(350)는 상기 선택된 질의 윈도우를 윈도우 정규화부에서 정규화하여 기 설정된 유사 허용값을 이용하여 유사 허용 범위 를 재계산하게 된다. 이때, 유사 허용 범위는 질의 윈도우와 매칭될 답변 윈도우와 유사한 것으로 판단할 수 있는 유사도 값의 범위일 수 있다.Subsequently, in operation S650, a similar allowable range is recalculated using the normalized query window and a preset similar allowable value. Here, the property retrieval unit 350 normalizes the selected query window by a window normalization unit and uses a similar similarity allowance value to be preset. Will be recalculated. Similar tolerance May be a range of similarity values that may be determined to be similar to an answer window to be matched with a query window.

그리고, 상기 정규화된 질의 윈도우 및 상기 재계산된 유사 허용 범위를 이용하여 범위 질의를 구성하는 단계(660)가 수행된다. 여기서, 상기 질의 구성부(360)는 상기 n개의 다차원 질의 윈도우의 차원 축소할 수 있으며, 차원 축소된 n개의 질의 윈도우 및 유사 허용 범위를 이용하여 범위 질의를 생성할 수 있다.A step 660 of constructing a range query using the normalized query window and the recalculated similar allowable range is performed. Here, the query constructing unit 360 may reduce the dimension of the n multidimensional query windows, and generate a range query using the n reduced query windows and similar allowable ranges.

다음으로, 상기 구성된 범위 질의를 이용하여 색인에서 질의 시퀀스에 매칭할 답변 후보 서브 시퀀스를 검색하는 단계(670)가 수행된다.Next, a step 670 of searching for an answer candidate subsequence to match the query sequence in the index using the configured range query is performed.

이어, 상기 질의 시퀀스를 이용하여 상기 답변 후보 서브 시퀀스를 검증하는 단계(680)가 수행된다.Subsequently, a step 680 of verifying the answer candidate subsequence using the query sequence is performed.

보다 구체적으로, 상기 추출부(381)는 상기 질의 검색부(370)에서 검색된 상기 답변 후보 서브 시퀀스들의 집합들 간의 교집합에 따라 정답 후보 집합을 추출하게 된다. 상기 검증부(382)는 상기 질의 시퀀스를 이용하여 상기 정답 후보 집합을 검증하는 검증부를 포함하여 구성될 수 있다. 상기 검증부(382)는 상기 질의 시퀀스와 상기 정답 후보 집합에 포함된 데이터 시퀀스 간의 유클리디안 거리를 계산하고, 계산된 거리가 상기 유사 허용 범위 이하인 경우, 상기 데이터 시퀀스를 상기 질의 시퀀스에 매칭할 답변 시퀀스의 정답 후보 집합에 포함하다.More specifically, the extractor 381 extracts a correct answer candidate set according to an intersection between sets of the answer candidate subsequences searched by the query search unit 370. The verification unit 382 may include a verification unit that verifies the set of correct answer candidates using the query sequence. The verification unit 382 calculates a Euclidean distance between the query sequence and the data sequence included in the correct candidate set, and if the calculated distance is equal to or less than the similar tolerance, match the data sequence to the query sequence. Include in the correct candidate set of answer sequences.

한편, 상기 선택된 질의 윈도우들 중에서 마지막 질의 윈도우인지를 판단(S690)하여 선택된 n 개의 윈도우가 남아있을 경우 반복적으로 상기 질의 윈도우를 선택하는 단계이후의 각 단계를 진행할 수 있다.On the other hand, if it is determined whether the last query window among the selected query windows (S690), if the remaining n windows remain, it may proceed to each step after the step of repeatedly selecting the query window.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

110 --- 데이터 시퀀스 색인 수단 120 --- 색인 데이터 베이스
130 --- 질의 시퀀스 매칭 수단 210 --- 시퀀스 분할부
220 --- 윈도우 계산부 230 --- 윈도우 선택부
240 --- 윈도우 정규화부 250 --- 재산출부
260 --- 질의 구성부 270 --- 질의 검색부
280 --- 질의 후처리부 310 --- 추출부
320 --- 검증부 610 --- 시퀀스
620 --- 슬라이딩 윈도우 630 --- 디스조인트 윈도우110 --- Data sequence indexing means 120 --- Index database
130 --- query sequence matching means 210 --- sequence divider
220 --- window calculator 230 --- window selector
240 --- Window Normalization 250 --- Property Retrieval
260 --- Query Constructor 270 --- Query Finder
280 --- Query Post-Processing Unit 310 --- Extraction Unit
320 --- Verification unit 610 --- Sequence
620 --- sliding window 630 --- disjoint window

Claims

1. A method of matching a query subsequence from an index generated over a data sequence, the method comprising:
Dividing a given query sequence into a plurality of query windows, and determining the number of query windows corresponding to the query processing from the query sequence among the divided query windows;
Selecting a query window by calculating a query cost with respect to the determined number of query windows;
Recalculating a similar tolerance range using the selected query window and a preset similarity tolerance value; And
Retrieving an answer candidate subsequence matching a query sequence from the index using a range query constructed based on the recalculated likelihood range;
Subsequence matching method of a time series database comprising a.

The method of claim 1,
After selecting the query window,
And subnormalizing the selected query window.

The method of claim 1,
After recalculating the similar tolerance range,
And constructing a range query using the normalized query window and the recalculated similar allowable ranges.

The method of claim 1,
In the step of selecting the query window,
The query cost is calculated by reflecting the query window similarity and the search range calculated by the query window based on the query cost function.

The method of claim 1,
After retrieving the answer candidate subsequence,
Verifying the answer candidate subsequence using the query sequence.

The method of claim 5,
Verifying the answer candidate subsequence,
Extracting a correct candidate set according to an intersection between the sets of answer candidate subsequences; And
Verifying the set of correct answer candidates using the query sequence; and subsequence matching method of a time series database.

The method of claim 6,
Verifying the correct candidate set,
A Euclidean distance is calculated between the query sequence and the data sequence included in the set of correct candidates, and when the calculated distance is less than or equal to the similar allowable range, the data sequence is included in the correct candidate set of the answer sequence to match the query sequence. Subsequence matching method of a time series database comprising a.

The method of claim 1,
After verifying the answer candidate subsequence,
Determining whether the query window is the last query window among the selected query windows, and repeatedly performing each step after selecting the query window.

The method of claim 1,
The indexing step for the generated index,
Dividing a data sequence into a plurality of data windows using a sliding window;
Normalizing the data windows such that the scales of the data windows are the same;
Reducing the dimension of the normalized data windows;
Creating a record of the reduced data windows; And
Inserting the record into a multidimensional index structure;
Sequence matching method of a time series database comprising a.

The multidimensional index structure,
A layer created by configuring the dimensionally reduced data windows as Minimum Bounding Rectangles (MBR) and configuring a minimum boundary rectangle of a higher level among the minimum boundary rectangles to include a plurality of minimum boundary rectangles of a lower level. A sequence matching method of a time series database, characterized in that the tree structure.

A system for matching a query subsequence from indexing means for generating an index of a data sequence,
A sequence divider for dividing a given query sequence into a plurality of query windows using a sliding window;
A window calculator for calculating an optimal number of query windows for query processing from the query sequence among the divided query windows;
A window selector for selecting a query window having a minimum query cost with respect to the calculated number of query windows;
A property retrieval unit for normalizing the selected query window by a window normalization unit and recalculating a similar tolerance range using a preset similarity tolerance value;
A query constructing unit configured to construct a range query using the normalized query window and the recalculated similar allowable range; And
A query retrieval unit for retrieving an answer candidate subsequence to match a query sequence from an index using the constructed range query;
And a subsequence matching system of a time series database.

The method of claim 11,
And a window normalization unit for normalizing a query window selected by the window selection unit.

The method of claim 11,
The window selector,
The query cost is calculated based on the query cost function to reflect the query window similarity and the search range calculated by the query window.

The method of claim 11,
And a query post-processing unit including an extracting unit for extracting a correct candidate set according to an intersection between the sets of answer candidate subsequences retrieved by the query searching unit, and a verifying unit for verifying the correct candidate set using the query sequence. And a subsequence matching system of a time series database.

The method of claim 14,
The verification unit,
A Euclidean distance is calculated between the query sequence and the data sequence included in the set of correct candidates, and when the calculated distance is less than or equal to the similar allowable range, the data sequence is included in the correct candidate set of the answer sequence to match the query sequence. And a subsequence matching system of a time series database.