KR20120101925A

KR20120101925A - Apparatus for discovering sequential patterns over data stream using dual tree and method thereof

Info

Publication number: KR20120101925A
Application number: KR1020110020051A
Authority: KR
Inventors: 이원석
Original assignee: 연세대학교 산학협력단
Priority date: 2011-03-07
Filing date: 2011-03-07
Publication date: 2012-09-17
Also published as: KR101238014B1

Abstract

PURPOSE: A device for discovering sequential patterns in a data stream using a dual tree structure and a method thereof are provided to find sequential patterns through MTeISeq structure capable of sequence pattern mining of a 1-item sets sequence, thereby effectively perform a data analysis. CONSTITUTION: A first extracting unit extracts sequence candidate item sets by using a monitoring tree from a data stream(110). A mapping unit maps the sequence candidate item sets in a 1-item set sequence format. A second extract unit extracts a 1-item set frequent sequence from the 1-item set sequence by using the monitoring tree. A generating unit generates an extended sequence composing item sets. A restoring unit restores the 1-item set frequent sequence added the extended sequence to a n-item set frequent sequence. [Reference numerals] (110) Data stream; (150) Generating an extended sequence; (170) n-item set frequent sequence; (AA) Input; (BB) Output; (CC) Mapping table; (DD) n-item set ▶ 1-item set; (EE,160) 1-item set frequent sequence; (FF) Extracting a sequence candidate item set

Description

Apparatus and method for exploring sequential patterns in a data stream using a dual tree structure {APPARATUS FOR DISCOVERING SEQUENTIAL PATTERNS OVER DATA STREAM USING DUAL TREE and method}

본 발명은 순차 패턴 탐사 방법에 관한 것으로, 특히, MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 찾고 이 항목들을 매핑테이블을 이용하여 특정 ID로 매핑하여 1-항목집합 시퀀스화한 후에 1-항목집합 시퀀스의 순차 패턴 마이닝이 가능한 MTeISeq 구조를 이용하여 순차 패턴을 찾을 수 있도록 하는 듀얼 트리 구조를 이용하여 데이터 스트림에서 순차 패턴을 탐사하기 위한 장치 및 그 방법에 관한 것이다.The present invention relates to a sequential pattern exploration method, and more particularly, to find a sequence candidate item set, which is a frequent item in each transaction by using an MTestDec structure, and to map these items to a specific ID by using a mapping table. The present invention relates to an apparatus and method for exploring a sequential pattern in a data stream using a dual tree structure that allows a sequential pattern to be found using a MTeISeq structure capable of mining a sequential pattern of a 1-item set sequence.

다양한 입력 방식으로 끊임없이 입력되는 데이터 스트림은 기존의 유한한 데이터 집합에서의 마이닝 방법으로는 제한적인 메모리상의 저장공간에서 처리하는 것이 불가능하다. 그렇기 때문에 데이터 스트림에서 지식을 추출해내기 위해서는 한 번의 데이터 탐색만으로 단 시간 내에 의미있는 지식을 얻어낼 수 있어야 하고, 제한적인 메모리상의 저장공간에서 처리될 수 있어야 하며, 데이터 스트림에서 추출된 의미있는 지식을 사용자가 원하는 시점에 언제든지 조회가 가능하도록 해야하는 조건이 붙게 된다. 따라서, 데이터 스트림에서의 마이닝 방법은 유한한 데이터 집합에서의 마이닝 방법과는 달리, 결과에 오차를 포함하게 된다.Data streams that are constantly input by various input methods cannot be processed in limited memory space by the mining method of existing finite data sets. Therefore, in order to extract knowledge from a data stream, it must be possible to obtain meaningful knowledge in a short time with a single data search, to be able to process in limited storage space, and to extract meaningful knowledge extracted from the data stream. There is a condition that needs to be available at any time when user wants. Thus, the mining method in the data stream, unlike the mining method in a finite data set, includes an error in the result.

데이터 스트림 환경에서 단일 트랜잭션 내의 항목간의 빈발 항목을 찾는 빈발 패턴 탐사 방법이나 항목간의 연관성을 고려하여 빈발 항목을 찾아내는 연관 규칙 탐사 방법에 대한 연구에 비해, 순차 패턴 탐사 방법은 다수 트랜잭션 간의 시간 순서를 고려하여 빈발 패턴을 찾아내야 한다. 따라서, 순차 패턴을 구성하는 빈발 시퀀스의 항목집합들이 모두 빈발한지 확인하여야 할 뿐만 아니라, 이들 항목집합 사이의 순서 정보도 고려해야 하기 때문에 많은 연산이 요구되고, 대용량의 메모리를 필요로 하게 된다. 이러한 어려움 때문에 데이터 스트림 환경에서의 순차 패턴 마이닝 알고리즘은 전 세계적으로도 많은 연구가 이루어지지 않고 있는 상황이다.In the data stream environment, the sequential pattern exploration method considers the time sequence between multiple transactions in comparison to the frequent pattern exploration method for finding frequent items among items in a single transaction or the association rule exploration method for finding frequent items by considering the correlation between items. Find frequent patterns. Therefore, not only whether the itemsets of the frequent sequence constituting the sequential pattern are frequent, but also the order information between these itemsets must be taken into consideration, which requires a lot of computation and requires a large amount of memory. Due to these difficulties, many studies have not been conducted worldwide on sequential pattern mining algorithms in data stream environments.

기존에 연구되어온 데이터 스트림 환경에서의 순차패턴 마이닝 방법들 중 1-항목집합 시퀀스를 대상으로 하는 순차 패턴 마이닝 알고리즘들의 경우, 시퀀스의 본래의 정의인 다양한 항목으로 구성된 시퀀스에 대한 처리가 불가능하다. 예를 들면, 고객이 상점을 방문할 때마다 매 번 하나 이상의 물건을 사는 경우에 대해서는 처리가 불가능하다. 그렇기 때문에 이러한 처리를 가능하게 하기 위한 연구가 진행되기 시작하였다.In the case of sequential pattern mining algorithms targeting 1-item set sequences among the sequential pattern mining methods in the data stream environment that have been studied previously, it is impossible to process a sequence composed of various items which are original definitions of the sequence. For example, it is not possible to handle the case where a customer buys more than one item each time he visits a store. As such, research has begun to enable such treatments.

최근에 연구되기 시작한 n-항목집합 시퀀스를 대상으로 한 순차 패턴 마이닝 알고리즘들은 아직 메모리 사용량이나 정확도의 측면에서 개선되어야 할 부분이 존재한다.Recently, sequential pattern mining algorithms targeting n-itemset sequences still need improvement in terms of memory usage and accuracy.

슬라이딩 윈도우(Sliding Window) 방식의 IncSPAM 알고리즘과 시간 기울임 윈도우(Time-Tilted Window) 방식의 방법은 메모리 사용량을 줄일 수 있으나, 모든 시퀀스에 대한 정보를 일관적으로 유지하지 않기 때문에 정확도가 좋지 않으며, 긴 시퀀스에 대한 처리가 불가능하다는 단점이 존재한다.The sliding window IncSPAM algorithm and the time-tilted window method can reduce memory usage, but they are not accurate because they do not maintain consistent information for all sequences. There is a disadvantage that the processing of the sequence is impossible.

랜드마크(Landmark) 방식의 경우 지금까지 발생한 모든 패턴 정보를 관리하여 정확도는 높지만, 높은 메모리 사용량을 보이고 수행시간이 길어진다는 단점이 존재한다. SPAMS 알고리즘은 이러한 특징을 가지고 있는 랜드마크 방식의 알고리즘이지만 잠재적으로 빈발할 가능성이 있는 데이터들을 별도로 관리해주지 않고 전체 SPA(Sequential Patterns Automaton)구조에 대한 빈발하지 않은 항목의 제거가 이루어지지 않기 때문에 정확도 측면에서도 떨어지는 문제점을 가지고 있다. 또한 SPA구조를 유지하기 위한 부가적인 정보들을 항목(state)마다 유지하고 있어야 하므로 노드의 개수에 비해 상대적으로 많은 메모리를 요구한다.In the case of the landmark method, all pattern information generated up to now is managed to have high accuracy, but there are disadvantages of high memory usage and long execution time. The SPAMS algorithm is a landmark algorithm that has this feature, but because it does not manage the potentially frequent data separately and removes infrequent items for the entire SPA (Sequential Patterns Automaton) structure, it is not accurate. Also has problems falling. In addition, since additional information for maintaining the SPA structure must be maintained for each state, more memory is required than the number of nodes.

따라서 이러한 종래 기술의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 찾고 이 항목들을 매핑테이블을 이용하여 특정 ID로 매핑하여 1-항목집합 시퀀스화한 후에 1-항목집합 시퀀스의 순차 패턴 마이닝이 가능한 MTeISeq 구조를 이용하여 순차 패턴을 찾을 수 있도록 하는 듀얼 트리 구조를 이용하여 데이터 스트림에서 순차 패턴을 탐사하기 위한 장치 및 그 방법을 제공하는데 있다.Accordingly, an object of the present invention is to find a sequence candidate item set, which is a frequent item in each transaction, using an MTestDec structure, and map these items to a specific ID using a mapping table. An apparatus and method for exploring a sequential pattern in a data stream using a dual tree structure that allows the sequential pattern to be found using an MTeISeq structure capable of mining the sequential pattern of a 1-item set sequence after item set sequencing. To provide.

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the objects of the present invention are not limited to those mentioned above, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 목적들을 달성하기 위하여, 본 발명의 한 관점에 따른 데이터 스트림에서 순차 패턴을 탐사하기 위한 장치는 지속적으로 발생되는 트랜잭션들로 구성되는 무한한 데이터 집합인 데이터 스트림을 입력 받으면, 상기 데이터 스트림으로부터 모니터링 트리인 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 추출하는 제1 추출부; 상기 시퀀스후보항목집합들을 1-항목집합 시퀀스의 형태로 매핑하는 매핑부; 상기 1-항목집합 시퀀스로부터 모니터링 트리인 MTeISeq 구조를 이용하여 1-항목집합 빈발 시퀀스를 추출하는 제2 추출부; k개 이상의 항목을 갖는 항목집합들로 구성된 확대 시퀀스를 생성하는 생성부; 및 상기 확대 시퀀스가 추가된 1-항목집합 빈발 시퀀스를 원래의 n-항목집합 빈발 시퀀스로 복원하는 복원부를 포함하는 것을 특징으로 한다.In order to achieve the above objects, an apparatus for exploring a sequential pattern in a data stream according to an aspect of the present invention, upon receiving a data stream that is an infinite data set consisting of continuously occurring transactions, the monitoring tree from the data stream. A first extracting unit which extracts a sequence candidate item set which is a frequent item in each transaction using an MTestDec structure; A mapping unit for mapping the sequence candidate item sets in the form of a 1-item set sequence; A second extracting unit for extracting a 1-item set frequent sequence from the 1-item set sequence by using a MTeISeq structure which is a monitoring tree; a generating unit for generating an enlarged sequence composed of item sets having k or more items; And a reconstruction unit for restoring the 1-item set frequent sequence to which the enlarged sequence is added to the original n-item set frequent sequence.

바람직하게, 상기 제1 추출부는 사용자 정의 값으로 설정한 지지도 값보다 크거나 같은 시퀀스후보항목집합들을 추출하고, 여기서, 상기 지지도는 전체 고객의 수에 대한 해당 시퀀스를 포함하고 있는 고객들의 수의 비율을 나타내고, 상기 시퀀스는 지속적으로 발생하는 트랜잭션을 고객 식별자를 기준으로 트랜잭션이 발생한 시간 순서대로 묶은 리스트인 것을 특징으로 한다.Preferably, the first extracting unit extracts a sequence candidate item set that is greater than or equal to a support value set as a user-defined value, wherein the support ratio is a ratio of the number of customers including the corresponding sequence to the total number of customers. The sequence is characterized in that the list is a list of the transactions that occur continuously occur in order of the time the transaction occurred based on the customer identifier.

필요에 따라, 상기 지지도는 최근에 빈발한 항목집합 중심의 빈발 패턴 추출을 위하여 시간의 흐름에 따라 항목마다 가중치를 다르게 유지하는 것을 특징으로 한다.If necessary, the support is characterized in that the weight is maintained differently for each item according to the passage of time in order to extract a frequent pattern centered on a recent frequent item set.

바람직하게, 상기 매핑부는 상기 시퀀스후보항목집합들을 1-항목집합 시퀀스로 변환하기 위하여 상기 시퀀스후보항목집합들 각각을 단일 ID로 매핑하되, k개 미만의 항목을 갖는 시퀀스후보항목집합들을 매핑하는 것을 특징으로 한다.Preferably, the mapping unit maps each of the sequence candidate item sets to a single ID to map the sequence candidate item sets having less than k items to convert the sequence candidate item sets into a 1-item set sequence. It features.

바람직하게, 상기 생성부는 부분 항목집합이 되는 모든 k-1개의 항목을 갖는 시퀀스후보항목집합들이 존재하는 k개의 항목을 갖는 시퀀스후보항목집합들로 구성된 확대 시퀀스를 생성하는 것을 특징으로 한다.
Preferably, the generation unit generates an enlarged sequence consisting of sequence candidate item sets having k items in which sequence candidate item sets having all k-1 items that are partial item sets exist.

본 발명의 다른 한 관점에 따른 데이터 스트림에서 순차 패턴을 탐사하기 위한 방법은 (a) 지속적으로 발생되는 트랜잭션들로 구성되는 무한한 데이터 집합인 데이터 스트림을 입력 받으면, 상기 데이터 스트림으로부터 모니터링 트리인 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 추출하는 단계; (b) 상기 시퀀스후보항목집합들을 1-항목집합 시퀀스의 형태로 매핑하는 단계; (c) 상기 1-항목집합 시퀀스로부터 모니터링 트리인 MTeISeq 구조를 이용하여 1-항목집합 빈발 시퀀스를 추출하는 단계; (d) k개 이상의 항목을 갖는 항목집합들로 구성된 확대 시퀀스를 생성하는 단계; 및 (e) 상기 확대 시퀀스가 추가된 1-항목집합 빈발 시퀀스를 원래의 n-항목집합 빈발 시퀀스로 복원하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, a method for exploring a sequential pattern in a data stream includes (a) an MTestDec structure that is a monitoring tree from the data stream when the data stream is an infinite data set composed of continuously occurring transactions. Extracting sequence candidate item sets that are frequent items in each transaction by using; (b) mapping the sequence candidate item sets in the form of a 1-item set sequence; (c) extracting a 1-item set frequent sequence from the 1-item set sequence by using a MTeISeq structure which is a monitoring tree; (d) generating an enlargement sequence composed of item sets having k or more items; And (e) restoring the 1-item set frequent sequence to which the enlargement sequence is added to the original n-item set frequent sequence.

바람직하게, 상기 (a) 단계는 사용자 정의 값으로 설정한 지지도 값보다 크거나 같은 시퀀스후보항목집합들을 추출하고, 여기서, 상기 지지도는 전체 고객의 수에 대한 해당 시퀀스를 포함하고 있는 고객들의 수의 비율을 나타내고, 상기 시퀀스는 지속적으로 발생하는 트랜잭션을 고객 식별자를 기준으로 트랜잭션이 발생한 시간 순서대로 묶은 리스트인 것을 특징으로 한다.Preferably, the step (a) extracts the sequence candidate item sets that are greater than or equal to a support value set to a user-defined value, wherein the support is the number of customers including the corresponding sequence for the total number of customers. The sequence is characterized in that the sequence is a list of continuously occurring transactions grouped in order of the time the transaction occurred based on the customer identifier.

바람직하게, 상기 (b) 단계는 상기 시퀀스후보항목집합들을 1-항목집합 시퀀스로 변환하기 위하여 상기 시퀀스후보항목집합들 각각을 단일 ID로 매핑하되, k개 미만의 항목을 갖는 시퀀스후보항목집합들을 매핑하는 것을 특징으로 한다.Preferably, the step (b) maps each of the sequence candidate item sets to a single ID in order to convert the sequence candidate item sets into a 1-item set sequence, and includes sequence candidate item sets having less than k items. It is characterized in that the mapping.

바람직하게, 상기 (d) 단계는 부분 항목집합이 되는 모든 k-1개의 항목을 갖는 시퀀스후보항목집합들이 존재하는 k개의 항목을 갖는 시퀀스후보항목집합들로 구성된 확대 시퀀스를 생성하는 것을 특징으로 한다.Preferably, the step (d) is characterized by generating an enlarged sequence consisting of sequence candidate item sets having k items in which sequence candidate item sets having all k-1 items that are partial item sets exist. .

이를 통해, 본 발명은 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 찾고 이 항목들을 매핑테이블을 이용하여 특정 ID로 매핑하여 1-항목집합 시퀀스화한 후에 1-항목집합 시퀀스의 순차 패턴 마이닝이 가능한 MTeISeq 구조를 이용하여 순차 패턴을 찾음으로써, 고객별 이용 패턴 등의 데이터 분석을 보다 빠르고 효과적으로 할 수 있는 효과가 있다.Through this, the present invention finds the sequence candidate item sets, which are frequent items in each transaction by using the MTestDec structure, and maps these items to specific IDs using the mapping table to sequence 1-item sets, and then 1-item sets. By using the MTeISeq structure, which enables the sequential pattern mining of the sequence, the sequential pattern can be found, thereby making it possible to analyze data such as customer-specific usage patterns more quickly and effectively.

또한, 본 발명은 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스후보항목집합들을 찾고 이 항목들을 매핑테이블을 이용하여 특정 ID로 매핑하여 1-항목집합 시퀀스화한 후에 1-항목집합 시퀀스의 순차 패턴 마이닝이 가능한 MTelSeq 구조를 이용하여 순차 패턴을 찾되, 항목집합 중 k이상의 길이를 가지는 항목집합들로 구성된 후보 시퀀스들을 배제하고 마이닝 작업을 수행함으로써, 메모리 사용량을 줄일 수 있는 효과가 있다.In addition, the present invention uses the MTestDec structure to find the sequence candidate item sets, which are frequent items in each transaction, and map these items to specific IDs using a mapping table to sequence the 1-item set and then 1-item set sequence. By using the MTelSeq structure capable of mining the sequential pattern, the sequential pattern is found, but the mining operation is performed by excluding candidate sequences composed of item sets having a length of k or more among the item sets, thereby reducing memory usage.

또한, 본 발명은 항목집합 중 k이상의 길이를 가지는 항목집합들로 구성된 후보 시퀀스들을 배제하고 마이닝 작업을 수행하되, 확대 시퀀스를 이용한 복원 과정을 수행함으로써, 항목집합의 길이 제한으로 인해 손실되는 정확도를 보완할 수 있는 효과가 있다.In addition, the present invention performs a mining operation by excluding candidate sequences consisting of item sets having a length of k or more among the item sets, and performing a reconstruction process using an enlarged sequence, thereby reducing the accuracy lost due to the limitation of the length of the item sets. There is a complementary effect.

도 1은 본 발명의 실시예에 따른 순차 패턴을 탐사하기 위한 장치를 나타내는 예시도이다.
도 2는 본 발명의 실시예에 따른 시퀀스 트랜잭션의 전처리 과정을 설명하기 위한 예시도이다.
도 3은 본 발명의 실시예에 따른 estDuT의 매핑 테이블을 나타내는 예시도이다.
도 4는 본 발명의 실시예에 따른 시퀀스 후보항목집합의 매핑 과정을 설명하기 위한 예시도이다.
도 5는 본 발명의 실시예에 따른 1-항목집합 시퀀스 매핑 기법을 설명하기 위한 예시도이다.
도 6은 본 발명의 실시예에 따른 1-항목집합 시퀀스 변환 기법을 설명하기 위한 제1 예시도이다.
도 7은 본 발명의 실시예에 따른 1-항목집합 시퀀스 변환 기법을 설명하기 위한 제2 예시도이다.
도 8은 본 발명의 실시예에 따른 확대 시퀀스 생성 원리를 설명하기 위한 예시도이다.
도 9는 본 발명의 실시예에 따른 시퀀스 후보항목집합의 길이제한 값 k의 변화에 따른 메모리 사용량을 보여주는 그래프이다.
도 10은 본 발명의 실시예에 따른 시퀀스 후보항목집합의 길이제한 값 k의 변화에 따른 수행시간을 보여주는 그래프이다.
도 11은 본 발명의 실시예에 따른 시퀀스 후보항목집합 길이제한 값 k의 변화에 따른 Precision과 Recall을 보여주는 그래프이다.
도 12는 본 발명의 실시예에 따른 시퀀스 후보항목집합 길이제한 값 k의 변화에 따른 ASE를 보여주는 그래프이다.
도 13은 본 발명의 실시예에 따른 estDuT의 Ssig 값 변화에 따른 성능을 보여주는 그래프이다.
도 14는 본 발명의 실시예에 따른 최적화(decay) 적용에 따른 성능을 보여주는 그래프이다.
도 15는 본 발명의 실시예에 따른 estDuT와 SPAMS의 정확도(Precision, Recall) 비교 실험 결과를 보여주는 그래프이다.
도 16은 본 발명의 실시예에 따른 estDuT와 SPAMS의 정확도(ASE) 비교 실험 결과를 보여주는 그래프이다.
도 17은 본 발명의 실시예에 따른 긴 시퀀스에서의 성능평가 결과를 보여주는 그래프이다.
도 18은 본 발명의 실시예에 따른 순차 패턴을 탐사하기 위한 방법을 나타내는 예시도이다.1 is an exemplary view showing an apparatus for exploring a sequential pattern according to an embodiment of the present invention.
2 is an exemplary view for explaining a preprocessing process of a sequence transaction according to an embodiment of the present invention.
3 is an exemplary diagram illustrating a mapping table of estDuT according to an embodiment of the present invention.
4 is an exemplary diagram for describing a mapping process of a sequence candidate item set according to an embodiment of the present invention.
5 is an exemplary diagram for explaining a 1-itemset sequence mapping technique according to an embodiment of the present invention.
6 is a first exemplary diagram for explaining a 1-itemset sequence transformation technique according to an embodiment of the present invention.
FIG. 7 is a second exemplary view for explaining a 1-itemset sequence transformation technique according to an embodiment of the present invention. FIG.
8 is an exemplary diagram for explaining a principle of generating an enlarged sequence according to an embodiment of the present invention.
9 is a graph showing memory usage according to a change in the length limit value k of a sequence candidate item set according to an embodiment of the present invention.
10 is a graph showing execution time according to a change in the length limit value k of a sequence candidate item set according to an embodiment of the present invention.
11 is a graph showing precision and recall according to a change in the sequence candidate item set length limit value k according to an embodiment of the present invention.
12 is a graph illustrating an ASE according to a change in the sequence candidate item set length limit value k according to an embodiment of the present invention.
13 is a graph showing the performance according to the Ssig value change of the estDuT according to an embodiment of the present invention.
14 is a graph showing the performance according to the application (decay) according to an embodiment of the present invention.
FIG. 15 is a graph showing the results of experiments comparing the accuracy (Precision, Recall) of estDuT and SPAMS according to an embodiment of the present invention. FIG.
16 is a graph showing the results of experiments comparing the accuracy (ASE) of estDuT and SPAMS according to an embodiment of the present invention.
17 is a graph showing performance evaluation results in a long sequence according to an embodiment of the present invention.
18 is an exemplary view showing a method for exploring a sequential pattern according to an embodiment of the present invention.

이하에서는, 본 발명의 실시예에 따른 듀얼 트리 구조를 이용하여 데이터 스트림에서 순차 패턴을 탐사하기 위한 장치 및 그 방법을 첨부한 도 1 내지 도 18을 참조하여 설명한다. 본 발명에 따른 동작 및 작용을 이해하는데 필요한 부분을 중심으로 상세히 설명한다. 명세서 전체를 통하여 각 도면에서 제시된 동일한 참조 부호는 동일한 구성 요소를 나타낸다.Hereinafter, an apparatus and method for exploring a sequential pattern in a data stream using a dual tree structure according to an embodiment of the present invention will be described with reference to FIGS. 1 to 18. It will be described in detail focusing on the parts necessary to understand the operation and action according to the present invention. Like reference numerals in the drawings denote like elements throughout the specification.

본 발명에서는 기존의 SPAMS의 문제점을 개선하기 위해 새로운 순차 패턴 마이닝 구조인 estDuT(estmation Dual Trees for sequential pattern mining)를 제안한다. 여기서, estDuT는 시퀀스의 빈발한 항목집합을 발견하기 위한 estDec 알고리즘의 모니터링 트리인 MTestDec, n-항목집합을 1-항목집합으로 변환하기 위해 사용되는 매핑테이블, 그리고 1-항목집합 시퀀스에서 빈발 시퀀스를 발견하기 위한 eISeq 알고리즘의 모니터링 트리인 MTeISeq로 이루어진 구조이며, 항목집합 중 k이상의 길이를 가지는 항목집합들로 구성된 후보 시퀀스들을 배제하고 마이닝 작업을 수행함으로써 메모리 사용량을 줄이고, 확대 시퀀스의 생성을 통해 항목집합의 길이 제한으로 인해 손실되는 정확도를 보완하는 방법이다.
The present invention proposes a new sequential pattern mining structure, estDuT (estmation Dual Trees for sequential pattern mining), to improve the problem of existing SPAMS. Here, estDuT is MTestDec, which is a monitoring tree of the estDec algorithm for detecting frequent itemsets of a sequence, a mapping table used for converting n-itemsets to 1-itemsets, and a frequent sequence from 1-itemsets. This structure consists of MTeISeq, which is a monitoring tree of the eISeq algorithm for discovery.It reduces the memory usage by performing mining operations by excluding candidate sequences composed of item sets having lengths of k or more among the item sets, and generates items by generating enlarged sequences. This method compensates for the accuracy that is lost due to the length limitation of the set.

도 1은 본 발명의 실시예에 따른 순차 패턴을 탐사하기 위한 장치를 나타내는 예시도이다.1 is an exemplary view showing an apparatus for exploring a sequential pattern according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명에 따른 순차 패턴을 탐사하기 위한 장치는 입력부(110), 제1 추출부(120), 매핑부(130), 제2 추출부(140), 생성부(150), 복원부(160), 및 출력부(170) 등을 포함하여 구성될 수 있다.As shown in FIG. 1, an apparatus for exploring a sequential pattern according to the present invention includes an input unit 110, a first extractor 120, a mapping unit 130, a second extractor 140, and a generator ( 150, the restoration unit 160, the output unit 170, and the like.

입력부(110)는 지속적으로 발생되는 트랜잭션들로 구성되는 무한한 데이터 집합인 데이터 스트림을 입력받을 수 있다. 여기서 데이터 마이닝의 대상이 되는 데이터 집합에서는 응용 도메인에 나타나는 모든 단위 정보들을 단위 항목(item)으로 정의하고, 그 응용 도메인에서 의미적인 동시성(즉, 의미적으로 서로 함께 발생하는)을 갖는 단위 정보들의 모임을 트랜잭션(transaction)이라 정의한다.The input unit 110 may receive a data stream, which is an infinite data set composed of continuously occurring transactions. Here, in the data set that is the subject of data mining, all unit information appearing in the application domain is defined as a unit item, and the unit information having semantic concurrency (that is, semantically occurring together) in the application domain is defined. A meeting is defined as a transaction.

이때, 빈발 항목정보를 찾기 위한 데이터 스트림은 트랜잭션 ID(TID), 항목정보로 구성되며 다음과 같이 정의될 수 있다.In this case, the data stream for finding frequent item information includes a transaction ID (TID) and item information, and may be defined as follows.

i) 항목집합 I = {i₁, i₂, …, i_n}는 현재까지 트랜잭션들을 통해 발생한 항목들의 집합이고, 항목은 응용 도메인에서의 단위 정보를 나타낼 수 있다.i) item set I = {i ₁ , i ₂ ,... , i _n } is a set of items generated through transactions up to now, and the item may represent unit information in an application domain.

ii) P(I)를 항목집합 I의 멱집합이라 할 때, e ∈ {P(I) ?? {??}} 을 만족하는 e를 항목집합(itemset)이라 하고, 항목집합의 길이 |e|는 항목집합 e를 구성하는 항목의 수를 의미하며, 임의의 항목집합 e는 해당 항목집합의 길이에 따라 |e|-항목집합이라 정의할 수 있다. 일반적으로 3-항목집합 {a,b,c}는 간단히 abc로 표현할 수 있다.ii) When P (I) is the set of itemsets I, e ∈ {P (I) ?? The e satisfying {??}} is called an item set, and the length of the item set | e | means the number of items constituting the item set e, and the arbitrary item set e is the length of the item set. Can be defined as | e | -itemset. In general, the 3-item set {a, b, c} can be simply expressed as abc.

iii) 트랜잭션은 공집합이 아닌 I의 부분집합이며 각 트랜잭션은 트랜잭션 식별자 TID와 고객 식별자 CID를 가질 수 있다. k번째 순서로 데이터 집합에 추가되는 트랜잭션을 T_k라 나타내며 T_k의 TID는 k를 나타낼 수 있다.iii) A transaction is a subset of I, not an empty set, and each transaction can have a transaction identifier TID and a customer identifier CID. A transaction added to the data set in the k-th order is denoted by T _k , and the TID of T _k may denote k.

iv) 새로운 트랜잭션 T_k가 추가되었을 때 현재의 데이터 집합 D_k는 현재까지 발생하여 추가된 모든 트랜잭션들 즉, D_k = <T₁, T₂, …, T_k>로 구성될 수 있다. 따라서 |D_k|는 현재 데이터 집합 Dk에 포함된 총 트랜잭션의 수를 의미할 수 있다.iv) When a new transaction T _k is added, the current data set D _k occurs so far that all of the added transactions, that is, D _k = <T ₁ , T ₂ ,. , T _k >. Therefore, | D _k | may refer to the total number of transactions included in the current data set Dk.

v) T_k를 현재 트랜잭션이라 할 때, 임의의 항목집합 e에 대한 현재 출현 빈도수를 C_k(e)라 정의하며 이는 현재까지 k트랜잭션에서 e가 포함된 트랜잭션의 수를 나타낼 수 있다. 이와 마찬가지로 항목집합의 지지도 S_k(e)는 현재까지의 총 트랜잭션 수 |D_k| 대비 항목집합 e의 출현빈도수 C_k(e)의 비율로 정의할 수 있다.v) When T _k is the current transaction, the current frequency of occurrence for any set of items e is defined as C _k (e), which can represent the number of transactions that contain e in the k transaction to date. Similarly, the support of a set of items, S _k (e), is the total number of transactions so far | D _k | It can be defined as the ratio of the frequency of occurrence C _k (e) to the set of items e.

또한, 순차패턴 마이닝을 위한 데이터 스트림은 고객의 ID(CID), 트랜잭션의 발생 시간, 항목정보로 구성되어 지속적으로 발생하는 트랜잭션의 무한집합으로 다음과 같이 정의될 수 있다.In addition, the data stream for sequential pattern mining is composed of a customer ID (CID), the occurrence time of the transaction, item information can be defined as an infinite set of transactions that occur continuously as follows.

i) 시퀀스 s = <s₁, s₂, …, s_m>는 지속적으로 발생하는 트랜잭션을 고객 식별자 CID를 기준으로 트랜잭션이 발생한 시간 순서대로 묶은 리스트를 가리킬 수 있다. s_j는 하나의 항목집합 e를 나타내고, 하나의 고객은 동일한 트랜잭션 단위 시간 내에 하나 이상의 트랜잭션을 가질 수 없다.i) sequence s = <s ₁ , s ₂ ,... , s _m > may refer to a list of continuously occurring transactions in chronological order based on the customer identifier CID. s _j represents one set of items e, and one customer cannot have more than one transaction within the same transaction unit time.

ii) 시퀀스의 길이 |s|는 시퀀스를 구성하는 항목집합 e의 수를 의미하며, 임의의 시퀀스 s는 해당 시퀀스의 길이에 따라 |s|-시퀀스라 정의할 수 있다. 일반적으로 3-시퀀스는 <(a,b), (b), (a,c,d)>는 간단히 <(ab)(b)(acd)>로 표현할 수 있다.ii) The length | s | of the sequence means the number of itemsets e constituting the sequence, and any sequence s can be defined as | s | -sequence according to the length of the sequence. In general, the 3-sequence can be expressed simply as <(a, b), (b), (a, c, d)> as <(ab) (b) (acd)>.

iii) 지금까지 발생한 k개의 시퀀스들에 대해, 현재까지의 시퀀스의 카운트 C_k(s)는 s를 포함하고 있는 고객들의 수를 나타낼 수 있다. 시퀀스 지지도 S_k(s)는 지금까지 생성된 전체 고객의 수에서의 C_k(s)의 비율을 나타낼 수 있다.iii) For the k sequences that have occurred so far, the count C _k (s) of the sequence so far may represent the number of customers containing s. The sequence support S _k (s) may represent the ratio of C _k (s) in the total number of customers generated so far.

iv) 시퀀스 s_a = <a₁, a₂, …, a_n>와 s_b = <b₁, b₂, …, b_m>에 대해, 1 ≤ j₁ < j₂ < … < j_n ≤ m 에서 a_n ⊆ b_jn 인 관계를 만족한다면, s_a는 s_b의 부분 시퀀스(sub-sequence)라 한다. 예컨대, <(1 2)(3)(4 5)>의 시퀀스는 <(1 2 5)(3 4)(4 5)>의 부분 시퀀스이지만, <(1)(2)>는 <(1 2)>의 부분 시퀀스라고 할 수 없다. 반대로 sb는 sa의 확대 시퀀스(supersequence)가 된다. 예컨대, <(1 2 5)(3 4)(4 5)>는 <(1 2)(3)(4 5)>의 확대집합이다.iv) the sequence s _a = <a ₁ , a ₂ ,. , a _n > and s _b = <b ₁ , b ₂ ,.. , b _m >, 1 ≦ j ₁ <j ₂ <. s _a is called a sub-sequence of s _b if <j _n ≤ m satisfies the relationship a _n ⊆ b _jn . For example, the sequence of <(1 2) (3) (4 5)> is a partial sequence of <(1 2 5) (3 4) (4 5)>, while <(1) (2)> is <(1 2)> is not a partial sequence. In contrast, sb becomes the supersequence of sa. For example, <(1 2 5) (3 4) (4 5)> is an enlarged set of <(1 2) (3) (4 5)>.

또한, 위의 정의들에 따라, 항목집합 e의 지지도 S_k(e)가 사전 정의된 최소 지지도 S_min이상의 값을 가질 때, e를 현재 데이터 스트림 D_k에서의 빈발항목집합이라고 하고, 시퀀스 s의 지지도 S_k(s)가 사전 정의된 최소 지지도 S_min이상의 값을 가질 때, s를 현재 데이터 스트림 D_k에서의 빈발시퀀스라고 정의할 수 있다.
Further, according to the above definitions, when the support S _k (e) of the item set e has a value equal to or more than a predefined minimum support S _min , e is called a frequent item set in the current data stream D _k , and the sequence s When s of support S _k (s) has a value equal to or more than a predefined minimum support S _min , s may be defined as a frequent sequence in the current data stream D _k .

제1 추출부(120)는 입력받은 데이터 스트림으로부터 estDec 알고리즘의 모니터링 트리인 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스 후보항목집합들을 추출할 수 있다. 여기서, MTestDec는 시퀀스 트랜잭션의 빈발항목집합 정보를 유지하기 위해 estDec 알고리즘을 이용하여 생성된 모니터링 전위 트리로 이 트리를 탐사하는 과정을 통해 현재 발생한 트랜잭션의 시퀀스 후보항목집합 e_c를 찾아내는 역할을 수행할 수 있다.The first extractor 120 may extract sequence candidate item sets, which are frequent items in each transaction, using the MTestDec structure, which is a monitoring tree of the estDec algorithm, from the received data stream. Here, MTestDec plays a role of finding the sequence candidate item set e _c of the currently occurring transaction by exploring this tree with a monitoring prefix tree generated using the estDec algorithm to maintain the frequent item set information of the sequence transaction. Can be.

이러한 시퀀스후보항목집합들을 추출하는 과정을 상세히 설명하면 다음과 같다.The process of extracting the sequence candidate item sets will be described in detail as follows.

먼저, 본 발명은 불특정 다수의 고객을 대상으로 특정 주기 내에 등장하는 빈발 시퀀스를 탐사하기 때문에 시퀀스 트랜잭션을 단위 시퀀스 트랜잭션의 형태로 변환하는 전처리 과정을 수행한다.
First, the present invention performs a preprocessing process for converting a sequence transaction into a unit sequence transaction type since a frequent sequence appearing within a specific period is searched for an unspecified number of customers.

도 2는 본 발명의 실시예에 따른 시퀀스 트랜잭션의 전처리 과정을 설명하기 위한 예시도이다.2 is an exemplary view for explaining a preprocessing process of a sequence transaction according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 트랜잭션이 발생하면 사용자가 정의한 일정 시간 동안의 수집 활동 후, 같은 고객 식별자 ID(CID)를 가진 트랜잭션끼리 묶어 하나의 고객에게서 발생한 트랜잭션을 트랜잭션이 발생한 시간 순서대로 정렬할 수 있다. 정렬된 고객별 트랜잭션 정보는 해당 고객에 대한 시퀀스를 만드는 데에 이용될 수 있다.As shown in FIG. 2, when a transaction occurs, after a user-defined collection activity for a predetermined time, the transactions having the same customer identifier ID (CID) are grouped together to sort the transactions occurring from one customer in the order in which the transaction occurs. Can be. The sorted customer-specific transaction information can be used to create a sequence for that customer.

이러한 시퀀스 트랜잭션의 전처리 단계가 진행된 후에 얻어지는 시퀀스를 이용하여 본 발명에서 제안하는 estDuT 방법으로 순차패턴 마이닝 과정을 수행하게 된다.
The sequential pattern mining process is performed by the estDuT method proposed by the present invention using a sequence obtained after the preprocessing of the sequence transaction.

빈발 시퀀스를 구하기 위해 모든 항목집합에 대한 시퀀스들의 빈도를 확인해 볼 필요는 없다. 시퀀스를 구성하는 항목집합 중 하나라도 빈발하지 않을 경우에 해당 시퀀스는 빈발 시퀀스로 볼 수 없기 때문에 본 논문에서는 이러한 빈발하지 않은 항목집합을 우선적으로 제거하여 순차 패턴 탐사에 드는 비용을 줄이고자 하였다. 이를 위해 MTestDec 구조가 사용되는데, 이 구조는 실시간으로 시퀀스를 구성하게 될 항목집합의 정보를 모니터링하면서 현재 빈발하게 발생하고 있는 항목집합에 대해서만 순차 패턴 탐사 작업을 수행하도록 해준다. 현재 빈발하게 발생하고 있는 항목집합을 본 발명에서는 시퀀스후보항목집합 e_c라 하고 다음과 같이 정의한다.It is not necessary to check the frequency of sequences for every set of items to find the frequent sequences. If one of the items in the sequence does not occur frequently, the sequence cannot be seen as a frequent sequence. In this paper, we tried to reduce the cost of sequential pattern exploration by eliminating these infrequent items. The MTestDec structure is used to monitor the information of the set of items that will make up the sequence in real time, allowing you to perform sequential pattern exploration only on the frequently occurring itemsets. In the present invention, a frequently occurring item set is referred to as a sequence candidate item set e _c and defined as follows.

정의 1. 시퀀스후보항목집합 eDefinition 1. Sequence candidate item set e _cc

P(e)를 항목집합 e의 멱집합이라 할 때, e_c ∈ {P(e) -

} 와 S_k(e_c) ≥ S_sig 을 만족하는 e_c를 시퀀스 후보항목집합이라고 한다. 시퀀스후보항목집합의 길이 |e_c|는 e_c를 구성하는 항목의 수를 의미하며, 임의의 시퀀스후보항목집합 e_c는 해당 항목집합의 길이에 따라 |e_c|-시퀀스 후보항목집합이라 정의할 수 있다.When P (e) is the set of item set e, e _c ∈ {P (e)-

} And _k is referred to as S (e _c) ≥ S _c e _sig set a sequence of candidate entries satisfying. The length of the sequence candidate item set | e _c | means the number of items constituting e _c , and the arbitrary sequence candidate item set e _c is defined as the | e _c | -sequence candidate item set according to the length of the corresponding item set. can do.

예컨대, {1,2,3}이라는 항목집합의 부분집합들 중 {1}, {2}, {1,2}가 최소지지도 이상의 값을 갖는다면 {1,2,3}이라는 항목집합의 시퀀스후보항목집합은 {{1},{2},{1,2}}가 된다.For example, if {1}, {2}, and {1,2} of the subsets of the {1,2,3} subset have a value greater than or equal to the minimum map, the sequence of the {1,2,3} subset Candidate item sets are {{1}, {2}, {1,2}}.

앞서 언급했듯이 본 발명에서는 데이터 스트림 환경에서의 빈발 패턴 마이닝 방법인 estDec 알고리즘의 모니터링 트리인 MTestDec를 활용하여 시퀀스 후보항목집합 e_c를 찾는다. 그러나 시퀀스에 대한 지지도 정의가 기존 빈발 패턴 마이닝 방법에서 사용하는 지지도의 정의와는 다르기 때문에 전처리가 수행되지 않은 트랜잭션의 Tid를 이용하여 빈발항목을 찾는다면, 빈발시퀀스의 항목집합이 될 시퀀스후보항목집합 e_c를 구해낼 수 없다. 따라서, 다음과 같이 트랜잭션을 새롭게 정의하고자 한다.As mentioned above, the present invention finds the sequence candidate item set e _c using MTestDec, which is a monitoring tree of the estDec algorithm, which is a frequent pattern mining method in a data stream environment. However, since the definition of support for a sequence is different from the definition of support used in the existing frequent pattern mining method, if a frequent item is found using Tid of a transaction that has not been preprocessed, the sequence candidate item set that will be the item set of the frequent sequence e _c cannot be found. Therefore, we want to define a new transaction as follows.

정의 2. 트랜잭션 TDefinition 2. Transaction T _FF

CID가 k일 때 발생하는 시퀀스 s_k에 대해서 s_k가 n개의 트랜잭션 <T_k1, T_k2, …, T_kn>를 가지고 있을 때, 중요 지지도 S_sig이상의 값을 가지는 시퀀스후보항목집합 e_c를 찾아내기 위한 트랜잭션 T_F의 정의는 다음과 같이 나타낼 수 있다.For sequence s _k , which occurs when CID is k, s _k is n transactions <T _k1 , T _k2,. , T _kn >, the definition of transaction T _F to find the sequence candidate item set e _c with a value equal to or greater than the critical support S _sig can be expressed as:

T_F = <T_F1.tid, T_F2.tid, …, T_Fn.tid>T _F = <T _F1.tid , T _F2.tid ,.. , T _Fn.tid >

T_F1.tid = k, T_F1.itemset = T_k1.itemset T _F1.tid = k, T _F1.itemset = T _k1.itemset

T_F2.tid = k, T_F2.itemset = T_k2.itemset T _F2.tid = k, T _F2.itemset = T _k2.itemset

…...

T_Fn.tid = k, T_Fn.itemset = T_kn.itemset T _Fn.tid = k, T _Fn.itemset = T _kn.itemset

정의 2를 통해 얻어낸 T_F를 빈발 패턴 탐사법인 estDec 방법을 사용하여 T_F에 대한 전위 트리인 MTestDec로 구성한다. 이렇게 구성된 MTestDec를 이용하여 사전에 estDec의 사용자 정의 값으로 설정한 중요 지지도(significant support) S_sing ∈ (0,S_min) 값보다 크거나 같은 시퀀스 후보항목집합들을 모두 구할 수 있다.The T _F obtained through definition 2 is composed of MTestDec, which is a potential tree for T _F , using the estDec method, which is a frequent pattern exploration method. Using the MTestDec configured as described above, all the candidate candidate sets greater than or equal to the critical support S _sing ∈ (0, S _min ) previously set to the user-defined value of estDec can be obtained.

정적인 데이터베이스에서는 트랜잭션에 대한 카운트가 변하지 않고 전체 트랜잭션의 수도 일정하다. 따라서 해당 데이터베이스에서 트랜잭션 지지도의 변화가 없기 때문에 빈발 항목 집합 또한 일정하게 유지될 수 있다. 하지만, 데이터 스트림 환경에서는 전체 트랜잭션의 수가 지속적으로 변화하기 때문에 트랜잭션의 지지도가 일정하게 유지되지 않고 항상 변화한다. 그렇기 때문에 결과적으로 빈발한 항목집합이더라도 이 항목집합을 포함하는 트랜잭션의 출현 주기가 길어지게 되면, 해당 항목집합에 대한 카운트는 이루어지지 않고 전체 트랜잭션의 카운트만 증가하게 된다. 이 때, 일시적으로 해당 항목집합의 지지도가 최소지지도를 만족하지 못하게 되면서 빈발한 후보항목집합에서 제외되는 경우가 발생하게 된다. 이렇게 되면 해당 항목집합은 시퀀스후보항목집합에서 누락되기 때문에 후에 있을 MTeISeq의 트랜잭션인 후보시퀀스를 생성하는 단계에서 시퀀스 트랜잭션이 생성되지 않는다. 따라서, 본 발명에서는 이렇게 잠재적으로 빈발할 가능성이 있는 항목집합들을 취급하기 위해 MTestDec의 빈발항목뿐만 아니라 중요 항목집합까지 시퀀스후보항목집합으로 정의한다.In static databases, the count for a transaction does not change and the number of transactions is constant. Therefore, the frequent item set can be kept constant because there is no change in transaction support in the database. However, in the data stream environment, since the total number of transactions is constantly changing, the support of the transaction is not always maintained but always changes. Therefore, even if a frequent item set results in a longer appearance period of a transaction including the item set, the count of the item set is not counted, but only the count of the entire transaction is increased. At this time, the support of the item set temporarily does not satisfy the minimum map, and thus is excluded from the frequent candidate item set. In this case, because the item set is missing from the sequence candidate item set, the sequence transaction is not generated in the step of generating a candidate sequence which is a transaction of MTeISeq. Therefore, in the present invention, in order to handle such potentially frequent item sets, not only frequent items of MTestDec but also important item sets are defined as sequence candidate item sets.

매핑부(130)는 시퀀스 후보항목집합들을 매핑 테이블을 이용하여 1-항목집합 시퀀스의 형태로 매핑 또는 변환할 수 있다.The mapping unit 130 may map or convert the sequence candidate item sets into a 1-item set sequence using a mapping table.

이러한 1-항목집합 시퀀스의 형태로 매핑하는 과정을 상세히 설명하면 다음과 같다.The process of mapping in the form of such a 1-item set sequence will be described in detail as follows.

n-항목집합으로 구성된 시퀀스에서 MTeISeq 구조를 이용하여 빈발 시퀀스를 찾기 위해서는 1-항목집합 시퀀스로의 변환이 요구된다. 따라서, 본 발명에서는 MTestDec 구조가 가지고 있는 빈발 항목들의 정보를 이용하여 얻어진 각각의 시퀀스 후보항목집합들을 원래의 트랜잭션의 항목집합을 대표하는 유일한 ID로 매핑하고, 그 매핑된 ID를 이용하여 후보 시퀀스를 생성한다. 이러한 과정을 통해 n-항목집합 시퀀스는 1-항목집합 시퀀스의 형태로 변환될 수 있다.In order to find a frequent sequence using the MTeISeq structure in a sequence composed of n-item sets, a conversion to a 1-item set sequence is required. Accordingly, in the present invention, each sequence candidate item set obtained by using the frequent items of the MTestDec structure is mapped to a unique ID representing the item set of the original transaction, and the candidate sequence is mapped using the mapped ID. Create Through this process, the n-itemset sequence may be converted into a 1-itemset sequence.

i) 시퀀스 후보항목집합 매핑i) Sequence candidate set mapping

도 3은 본 발명의 실시예에 따른 estDuT의 매핑 테이블을 나타내는 예시도이다.3 is an exemplary diagram illustrating a mapping table of estDuT according to an embodiment of the present invention.

도 3에 도시한 바와 같이, 이전 단계의 시퀀스 후보항목집합들 즉, 도 2에서의 시퀀스 후보항목집합들은 n-항목집합 시퀀스를 1-항목집합 시퀀스의 형태로 변경하기 위해 각각의 항목집합들을 대표하는 유일한 ID로 매핑될 수 있다. 따라서 매핑 테이블에 저장되는 시퀀스 후보항목집합들은 모두 빈발한 항목집합이다.
As shown in Fig. 3, the sequence candidate item sets of the previous step, that is, the sequence candidate item sets in Fig. 2, represent the respective item sets in order to change the n-item set sequence into the form of the 1-item set sequence. Can be mapped to a unique ID. Therefore, the sequence candidate item sets stored in the mapping table are all frequent item sets.

도 4는 본 발명의 실시예에 따른 시퀀스 후보항목집합의 매핑 과정을 설명하기 위한 예시도이다.4 is an exemplary diagram for describing a mapping process of a sequence candidate item set according to an embodiment of the present invention.

도 4에 도시한 바와 같이, 새로운 시퀀스 후보항목집합이 발생할 때마다 매핑 ID가 순차적으로 부여되며, 매핑테이블에 이미 해당 시퀀스 후보항목집합이 존재한다면, 해당하는 매핑 ID를 시퀀스 후보항목집합에 부여할 수 있다. 시퀀스 후보항목집합의 개수가 많아질수록 매핑테이블에서 시퀀스 후보항목집합을 탐색하는 비용이 커지기 때문에 해시테이블 자료구조를 활용하여 탐색 비용을 줄일 수 있다.
As shown in FIG. 4, whenever a new sequence candidate item set occurs, a mapping ID is sequentially assigned, and if the sequence candidate item set already exists in the mapping table, a corresponding mapping ID is assigned to the sequence candidate item set. Can be. As the number of sequence candidate item sets increases, the cost of searching for the sequence candidate item sets in the mapping table increases, so the search cost can be reduced by using a hash table data structure.

ii) 길이제한 시퀀스후보항목집합ii) length limited sequence candidate items

n개의 항목을 원소로 가지는 항목집합의 경우 모든 후보항목집합이 빈발하다고 가정 할 때에 가질 수 있는 후보항목집합의 개수가 최대 2ⁿ개이다. 따라서 m개의 항목집합을 소유한 시퀀스의 경우 최대로 가질 수 있는 후보 시퀀스의 조합이 (2ⁿ)^m개가 된다. 따라서, 하나의 시퀀스에 대해 처리해야 할 후보 시퀀스의 수가 기하급수적으로 많아지게 되고, 많은 처리비용을 요구하게 된다. 본 논문에서는 이러한 처리비용을 줄이고자 시퀀스후보항목집합의 항목 수 제한을 통해 생성되는 시퀀스후보항목집합의 개수를 줄이고자 한다.In the case of an item set having n items as elements, the maximum number of candidate item sets that can be assumed when all candidate item sets are frequent is 2 ⁿ at most. Therefore, in the case of a sequence having ^m itemsets, there is (2 ⁿ ) ^{m possible} combinations of candidate sequences. Therefore, the number of candidate sequences to be processed for one sequence increases exponentially, requiring a large processing cost. In this paper, we want to reduce the number of sequence candidate item sets created by limiting the number of items in the sequence candidate item set to reduce the processing cost.

정의 3. k-시퀀스후보항목집합 eDefinition 3. k-sequence candidate item set e _c.kc.k

사용자가 제한하고자 하는 항목의 수를 k라고 할 때, k개 미만의 항목을 갖는 시퀀스 후보항목집합 e_c를 k-시퀀스후보항목집합 e_c.k라고 한다. k-시퀀스 후보항목집합의 개수 |e_c.k|는 멱집합의 개수를 구하는 방법에 따라 다음과 같이 정의한다.When the number of items that the user wants to limit is k, the sequence candidate item set e _c having less than k items is called k-sequence candidate item set e _ck . The number of k-sequence candidate item sets | e _ck | is defined as follows according to the method of obtaining the number of 멱 sets.

정의 3에 따라 m개의 항목집합을 소유한 시퀀스의 경우, 최대로 가질 수 있는 후보 시퀀스의 조합은 |e_c.k|^m개가 된다. 결과적으로 항목집합 수의 항목 수를 k개로 제한하게 되면(k<n), 모든 항목에 대한 부분항목집합을 생성할 때(k=n)보다, 2 - (nCk-1 + nCk-2 + … + nC1) 개의 부분항목집합에 대한 생성이 이루어지지 않으므로 알고리즘이 처리해야 할 후보 시퀀스의 수를 효과적으로 줄일 수 있게 된다. 그러나 이러한 방법은 전체 시퀀스후보항목집합에 대한 마이닝 작업이 이루어지지 않으므로 정확도 측면에서 오차를 발생시킬 수 있다.For a sequence that owns m itemsets according to definition 3, the maximum possible combination of candidate sequences is | e _ck | ^{m will} be. As a result, limiting the number of items in the number of items to k (k <n) yields 2-(nCk-1 + nCk-2 +…) rather than generating subitems for all items (k = n). Since + nC1) subsets are not generated, the number of candidate sequences to be processed by the algorithm can be effectively reduced. However, in this method, the mining operation for the entire sequence candidate item set is not performed, which may cause an error in terms of accuracy.

정의 4. 잠재적 빈발 k-항목집합 PLIDefinition 4. Potentially frequent k-item set PLI _kk

k-항목집합 I_k에 대해, Apriori 원칙에 따라 I_k가 빈발하다면 이 항목집합의 부분 항목집합들인 I_(k-1)들도 빈발하다. 따라서, 이에 대한 역으로 모든 I_(k-1)들이 빈발하다면, I_k는 빈발한 항목집합일 수도 있고, 빈발하지 않은 항목집합일 수도 있다. 따라서, 이러한 I_k는 잠재적으로 빈발할 가능성이 있는 항목집합이다. 그러므로 모든 I_(k-1)들이 빈발한 I_k를 잠재적으로 빈발한 k-항목집합인, PLI_k라 정의한다.For the k-item set I _k , if I _k is frequent according to the Apriori principle, then the subitem sets I _(k-1) of this set are also frequent. Thus, if all I _(k-1) are frequent inversely, then I _k may be a frequent or non-frequent item set. Thus, I _k is a potentially frequent set of items. Therefore, all I _(k-1) define the frequent I _k as PLI _k , a potentially frequent k-item set.

정의 4에 따라, (k-1)-항목집합들을 이용하여 PLI_k를 구할 수 있다. 그러나 정의한 것처럼, (k-1)-항목집합들이 모두 빈발하다고 하여, k-항목집합이 반드시 빈발한 항목집합이 될 수는 없다. 즉, PLI_k는 추론에 의해 빈발한 항목집합이라고 가정한 것이기 때문에 실제로 빈발하지 않을 가능성이 존재하게 된다. 따라서, 본 발명의 알고리즘을 통해 최종적으로 얻어낸 빈발 시퀀스에서는 false positive 오차가 발생할 가능성이 존재하게 된다.According to definition 4, the PLI _k can be found using the (k-1) -item sets. However, as defined, just because all (k-1) -itemsets are frequent, a k-itemset is not necessarily a frequent itemset. In other words, since PLI _k is assumed to be a frequent set of items by inference, there is a possibility that it is not frequent. Therefore, there is a possibility that a false positive error occurs in the frequent sequence finally obtained through the algorithm of the present invention.

본 발명의 알고리즘에서는 k-시퀀스 후보항목집합을 사용함으로써 빈발 항목에 대한 오차를 발생시키는 대신, 처리 비용을 감소시킬 수 있다. k이상의 길이를 가지는 시퀀스 후보항목집합의 경우, 알고리즘이 모두 수행된 뒤 k-시퀀스 후보항목집합의 PLI_k를 구하는 방법을 활용한 간단한 후처리 과정을 통해 복원해 낼 수 있다. 이러한 복원 과정에 대해서는 뒷부분에서 자세히 설명할 것이다.
In the algorithm of the present invention, by using the k-sequence candidate set, the processing cost can be reduced instead of generating an error for the frequent items. In the case of a sequence candidate item set having a length of k or more, after the algorithms are all performed, the sequence candidate item sets can be recovered through a simple post-processing process using a method of obtaining PLI _k of the _k -sequence candidate item sets. This restoration process will be described in detail later.

도 5는 본 발명의 실시예에 따른 1-항목집합 시퀀스 매핑 기법을 설명하기 위한 예시도이다.5 is an exemplary diagram for explaining a 1-itemset sequence mapping technique according to an embodiment of the present invention.

도 5에 도시한 바와 같이, 본 발명에 따른 1-항목집합 시퀀스 매핑 기법을 알고리즘으로 표현한 것이다. MTestDec는 전위 트리이기 때문에 그림 (a)와 같이 깊이 우선 탐색 방법을 사용하여 현재 발생한 트랜잭션의 모든 부분 항목집합과의 매칭을 시도하고 그림 (b)와 같이 기존에 존재하는 k-시퀀스 후보항목집합 여부를 체크한다.
As shown in FIG. 5, the 1-itemset sequence mapping technique according to the present invention is represented by an algorithm. Since MTestDec is a prefix tree, it attempts to match all subsets of the current transaction using depth-first search, as shown in Figure (a), and checks whether the existing k-sequence candidate sets are shown in Figure (b). Check it.

iii) 1-항목집합 시퀀스 변환 기법iii) 1-itemset sequence transformation

MTeIseq에서는 1-항목집합에 대한 빈발 시퀀스의 탐사만 가능하기 때문에 n-항목집합 시퀀스를 1-항목집합 시퀀스로 변환하는 방법이 필요하다. 따라서, 본 발명에서는 MTestDec에서 발견해낸 k-시퀀스 후보항목집합들을 단일 ID로 매핑하고 이들을 조합하여 1-항목집합 시퀀스의 형태로 변환한다.Since MTeIseq can only detect frequent sequences for 1-item sets, a method of converting n-item set sequences to 1-item set sequences is required. Accordingly, in the present invention, the k-sequence candidate sets found in MTestDec are mapped to a single ID, and the combinations thereof are converted into a 1-item set sequence.

MTestDec에서 빈발하지 않았던 시퀀스후보항목집합이나 k이상의 길이를 가진 시퀀스 후보항목집합은 앞의 ii)에서 제외되기 때문에 이러한 시퀀스 후보항목집합들에 대해서는 매핑 ID가 존재하지 않는다. 그러므로 이들을 제외한 매핑 ID가 존재하는 k-시퀀스 후보항목집합만을 이용하여 1-항목집합 시퀀스의 형태로 변환할 수 있게 된다. 이러한 1-항목집합 시퀀스 변환 기법의 예는 다음과 같다.
The mapping ID does not exist for these sequence candidate item sets because the sequence candidate item sets or the length candidate k lengths greater than k that are not frequent in MTestDec are excluded in ii). Therefore, it is possible to convert to the form of 1-item set sequence by using only the k-sequence candidate item set in which mapping IDs except these exist. An example of such a 1-itemset sequence transformation technique is as follows.

도 6은 본 발명의 실시예에 따른 1-항목집합 시퀀스 변환 기법을 설명하기 위한 제1 예시도이고, 도 7은 본 발명의 실시예에 따른 1-항목집합 시퀀스 변환 기법을 설명하기 위한 제2 예시도이다.FIG. 6 is a first exemplary diagram for explaining a one-item set sequence transformation technique according to an embodiment of the present invention, and FIG. 7 is a second illustration for explaining a one-item set sequence transformation technique according to an embodiment of the present invention. It is an illustration.

도 6 내지 도 7에 도시한 바와 같이, 매핑테이블의 매핑 ID를 이용하여 k-시퀀스후보항목집합을 가진 원래 시퀀스를 MTeISeq에서 활용하기 위해서 1-항목집합 시퀀스의 형태로 변환하는 과정을 보여주고 있다. 이 과정은 하나의 시퀀스가 발생하였을 때에 시작되며, 해당 시퀀스의 모든 k-시퀀스후보항목집합에 대한 매핑이 이루어지고 이들을 조합하여 가능한 모든 후보 1-항목집합 시퀀스의 형태로 변환할 때까지 반복된다.As shown in FIG. 6 to FIG. 7, a process of converting an original sequence having a k-sequence candidate item set into a 1-item set sequence form in order to be used in MTeISeq using the mapping ID of the mapping table is shown. . This process begins when one sequence occurs, and is repeated until all k-sequence candidate sets of the sequence are mapped and combined to convert them into the form of all possible candidate 1-item set sequences.

도 7에서는 1-항목집합 시퀀스의 변환 기법을 알고리즘으로 표현한 것이다.
In FIG. 7, an algorithm is used to convert a 1-item set sequence.

제2 추출부(140)는 1-항목집합 시퀀스로부터 모니터링 트리인 MTeISeq 구조를 이용하여 1-항목집합 빈발 시퀀스를 추출할 수 있다.The second extractor 140 may extract the 1-item set frequent sequence from the 1-item set sequence using the MTeISeq structure which is a monitoring tree.

이러한 1-항목집합 빈발 시퀀스를 추출하는 과정을 상세히 설명하면 다음과 같다.A process of extracting the 1-item set frequent sequence will be described in detail as follows.

전 단계에서 1-항목집합 시퀀스의 형태로 변환된 후보 시퀀스들을 이용하여 빈발시퀀스를 찾기 위해서 데이터 스트림에서 1-항목집합 시퀀스의 빈발시퀀스를 찾는 eISeq방법을 이용한다. k-시퀀스후보항목집합들을 ID의 형태로 매핑하였기 때문에 MTeISeq에 존재하는 노드들은 매핑 ID와 대응되는 k-시퀀스 후보항목집합을 나타내고, MTeISeq의 경로를 따라 탐색하며 빈발 시퀀스를 찾아낼 수 있게 된다. 즉, estDuT에서 MTeISeq는 k-시퀀스후보항목집합 간의 시간 순서 정보를 담고 있게 된다.In order to find the frequent sequences using the candidate sequences transformed in the form of 1-item sequence in the previous step, the eISeq method is used to find the frequent sequences of the 1-item set sequence in the data stream. Since the k-sequence candidate item sets are mapped in the form of ID, the nodes present in the MTeISeq represent the k-sequence candidate item sets corresponding to the mapping IDs, and can be searched along the path of the MTeISeq to find frequent sequences. That is, MTeISeq in estDuT contains time order information between k-sequence candidate itemsets.

이때, 항목집합의 항목 수를 제한하더라도 여전히 모니터링해야 할 항목의 수는 많기 때문에 여전히 많은 메모리와 수행 시간을 요구하게 된다. 그런데 데이터 스트림 환경에서 최근에 빈발한 항목은 과거에 빈발했던 항목에 비해 더 중요한 가치를 가지는 경우가 있을 수 있다. 이러한 경우, 과거에 빈발했던 항목의 정보를 유지하기 위해 불필요하게 메모리의 공간이 소비된다. 일반적인 Landmark 유형의 마이닝 방식은 동일한 빈도로 나타난 항목들에 대해 오랜 시간이 지난 항목이더라도 최근에 발생한 항목과 같은 지지도를 유지하기 때문에 과거에 빈발했었던 항목에 대한 구분이 불가능하다.At this time, even if the number of items in the item set is limited, the number of items to be monitored is still large, and thus still requires a lot of memory and execution time. However, recently frequent items in a data stream environment may have more important values than frequent items in the past. In such a case, space of the memory is unnecessarily consumed in order to keep the information of the frequent items in the past. In general, the mining method of the landmark type maintains the same support as the recently generated items, even if the items have passed for a long time and cannot be distinguished from the items that were frequently used in the past.

이와 같은 문제를 해결하기 위해 estDec 방법이나 eISeq 방법에서는 감쇄율을 적용하여 시간의 흐름에 따른 항목 지지도의 가중치를 다르게 유지함으로써 최근에 빈발한 항목집합 중심의 빈발 패턴 추출을 가능하게 하였다. 따라서, 감쇄율을 estDuT의 각각의 트리에 적용하면 최근에 빈발한 항목을 중심으로 알고리즘을 최적화할 수 있어서 보다 빠른 시간에 적은 양의 메모리를 이용하여 빈발 시퀀스를 얻어낼 수 있게 된다.In order to solve this problem, the estDec method or the eISeq method has applied attenuation rate to maintain the weight of item support differently over time, thereby making it possible to extract the frequent pattern centered on the frequent itemsets. Therefore, by applying the decay rate to each tree of estDuT, it is possible to optimize the algorithm around recently frequent items, so that a frequent sequence can be obtained using a small amount of memory at a faster time.

이러한 경우, 과거에만 빈발하게 발생했던 항목들은 감쇄율로 인해 모니터링 트리에서 전지되는 경우가 생기게 되며, 이로 인해 메모리를 절약할 수 있게 된다. 본 발명에서도 MTestDec와 MTeISeq에 모두 감쇄율을 적용하는 것으로 오랜 시간이 지난 항목에 대해서는 지지도를 일정 비율로 감소시켜 메모리의 사용량을 최소화 할 수 있다.
In this case, items that occurred only frequently in the past are often lost in the monitoring tree due to the decay rate, thereby saving memory. In the present invention, by applying the attenuation rate to both MTestDec and MTeISeq, it is possible to minimize the memory usage by reducing the support rate for a certain period of time.

생성부(150)는 k이상의 길이를 가지는 항목집합들로 구성된 확대 시퀀스를 생성하고, 생성된 확대 시퀀스를 1-항목집합 빈발 시퀀스에 추가할 수 있다.The generation unit 150 may generate an enlargement sequence composed of item sets having a length of k or more and add the generated enlargement sequence to the 1-item set frequent sequence.

이러한 확대 시퀀스를 생성하고 추가하는 과정을 상세히 설명하면 다음과 같다.The process of creating and adding such an enlarged sequence will now be described in detail.

정의 4에 따라 PLI_k를 추론하는 방법을 이용하여, (k-1)-시퀀스 후보항목집합으로 구성된 빈발 시퀀스의 결과만을 가지고 잠재적으로 빈발할 가능성이 있는 PLI_k인 k-시퀀스 후보항목집합으로 구성된 확대 시퀀스 es를 추론해낼 수 있다. 빈발 시퀀스를 구성하는 첫번째 (k-1)-시퀀스 후보항목집합에서 PLI_k를 구해 새로운 확대 시퀀스를 구성하고, 다음에 오는 (k-1)-시퀀스 후보항목집합들에 대해서도 각각의 새로운 확대 시퀀스를 구성한다. 이렇게 같은 방법을 반복하여 동일한 방식으로 n-시퀀스 후보항목집합으로 구성된 확대 시퀀스까지 잠재적으로 빈발한 모든 확대 시퀀스들을 복원해 낼 수 있다. 복원 절차는 다음과 같은 단계들을 거쳐 진행된다.Using a method of inferring PLI _k according to definition 4, it consists of a set of k-sequence candidates that are potentially PLI _k with only the results of frequent sequences consisting of (k-1) -sequence candidate sets. Infer the expansion sequence es. Obtain a PLI _k from the first (k-1) -sequence candidate set that constitutes the frequent sequence to form a new magnification sequence, and for each subsequent (k-1) -sequence candidate set, add a new Configure. This same method can be repeated to restore all potentially frequent magnification sequences up to the magnification sequence consisting of n-sequence candidate sets in the same manner. The restoration procedure proceeds through the following steps.

i) estDuT 방법을 이용한 순차패턴 마이닝 결과로 발견된 n개의 빈발 시퀀스 집합 FS = {f_s1, f_s2, …, f_sn}에서 임의의 시퀀스 f_si (1 ≤ i ≤ n)를 선택한다.i) A set of n frequent sequences FS = {f _s1 , f _s2 ,... found as a result of sequential pattern mining using the estDuT method. , f _sn } selects a random sequence f _si (1 ≦ i ≦ n).

ii) 임의의 시퀀스 f_si = <s_i1, s_i2, …, s_im> 에 대해, s_ia (1 ≤ a ≤ m)를 선택한다.ii) any sequence f _si = <s _i1 , s _i2 ,... For s _im >, select s _ia (1 ≦ a ≦ m).

iii) 선택한 시퀀스 f_si와 FS에 존재하는 나머지 다른 시퀀스들과의 비교작업을 수행한다. 정의 4에 따라, f_si의 확대 시퀀스가 되기 위해서는 확대 시퀀스를 구성하는 k-시퀀스 후보항목집합의 부분 항목집합이 되는 모든 (k-1)-시퀀스 후보항목집합이 빈발하여야 한다. 따라서, f_si 비교대상이 될 시퀀스를 f_si'이라 할 때, f_si'은 s_ij = s_i'j (j ≠ a, 1 ≤ j ≤ m) 인 (k-1)-시퀀스 후보항목집합으로 구성되어 있어야 한다.iii) Compare the selected sequence f _si with the rest of the other sequences in the FS. According to definition 4, in order to become an enlargement sequence of f _si , all (k-1) -sequence candidate sets that are sub-items of the k-sequence candidate sets constituting the enlargement sequence must be frequent. Thus, _"when referred to, f _si, the sequence to be compared. F f _si _si is _{_{s ij = s i'j (j ≠}} a, 1 ≤ j ≤ m) of (k-1) - sequence of the candidate set of items Must consist of

iv) (k-1)-시퀀스 후보항목집합들인 f_sia와 모든 f_si'a들을 비교하여 k-시퀀스 후보항목집합을 생성하기 위한 모든 (k-1)-시퀀스 후보항목집합이 존재하는지 확인한 후, 존재한다면 f_sia가 k-시퀀스 후보항목집합인 확대 시퀀스를 생성하고 FS에 추가한다.iv) comparing all (k-1) _-sequence candidate sets f _sia with all f _si'a to see if all (k-1) _-sequence candidate sets exist to generate k-sequence candidate sets. If present, f _sia creates an expansion sequence whose k-sequence candidate set is added to FS.

v) s_i의 모든 시퀀스 후보항목집합에 대해 ii)~iv)까지의 과정을 반복 수행한다.v) Repeat steps ii) to iv) for all sequence candidate items in s _i .

vi) i)~v)까지의 과정을 반복하여 수행하면서 가능한 모든 확대 시퀀스를 생성해낸다. 이는 잠재적으로 빈발할 가능성이 있는 시퀀스의 후보가 된다.vi) Repeat steps i) to v) to generate all possible magnification sequences. This is a candidate for a potentially frequent sequence.

도 8은 본 발명의 실시예에 따른 확대 시퀀스 생성 원리를 설명하기 위한 예시도이다.8 is an exemplary diagram for explaining a principle of generating an enlarged sequence according to an embodiment of the present invention.

도 8에 도시한 바와 같이, k=3인 경우에 (k-1)-시퀀스 후보항목집합을 이용하여 k이상의 길이를 가지는 항목집합들로 구성된 확대 시퀀스를 생성하는 것을 보여주고 있다.
As shown in FIG. 8, it is shown that when k = 3, an enlarged sequence consisting of item sets having a length of k or more is generated using a (k-1) -sequence candidate item set.

이러한 과정을 통해 생성된 확대 시퀀스는 잠재적으로 빈발할 가능성이 있는 시퀀스이기 때문에 MTestDec를 이용하여 검증할 필요가 있다. 그러므로 해당 시퀀스의 모든 시퀀스가 MTestDec에 존재하는지의 여부를 통해 실제로 빈발한지 확인한다.Since the magnification sequence generated through this process is a potentially frequent sequence, it needs to be verified using MTestDec. Therefore, we check to see if all the sequences in the sequence actually exist in MTestDec.

i) 임의의 시퀀스 s (s ∈ FS) 를 선택하고 k이상의 길이를 가지는 시퀀스후보항목집합을 선택한다.i) Select any sequence s (s ∈ FS) and select a set of sequence candidate items with a length greater than or equal to k.

ii) 해당 시퀀스후보항목집합이 MTestDec에 존재하는지 확인한다.ii) Check if the corresponding sequence candidate item set exists in MTestDec.

iii) MTestDec에 존재하지 않는다면, 해당 시퀀스는 빈발한 시퀀스가 아니므로 false positive 오차에 해당한다. 따라서, 해당 시퀀스를 삭제한다.iii) If it does not exist in MTestDec, the sequence is not a frequent sequence and thus false false error. Therefore, the sequence is deleted.

iv) i)~iii)까지의 과정을 반복하여 수행하면서 false positive 오차인 확대 시퀀스들을 제거한다.iv) Repeat steps i) to iii) to remove the magnification sequences that are false positive errors.

하지만, MTestDec와의 검증과정을 통해 모든 false positive 오차가 제거되지는 않는다. 예를들어, {23}, {24}, {34}, {2}가 모두 시퀀스 후보항목집합으로 MTestDec에 존재하고, 시퀀스 <(23)(2)>, <(34)(2)>, <(24)(2)>가 빈발 시퀀스로 MTeISeq에 존재하는 경우, <(234)(2)>라는 확대 시퀀스를 생성할 수 있다. 이러한 경우, {234}가 빈발 항목집합으로 MTestDec에 존재하더라도 {234}라는 시퀀스후보항목집합이 발생한 이후에 {2}라는 시퀀스후보항목집합이 발생하는 경우가 반드시 빈발하다고 할 수 없다. 즉, 시퀀스에서는 단일 항목집합의 빈발도 뿐만이 아닌 항목집합간의 시간적인 순서 또한 빈발해야 하기 때문에 이러한 요소가 false positive 오차를 발생시키는 요인이 될 수 있다.However, verification with MTestDec does not eliminate all false positive errors. For example, {23}, {24}, {34}, and {2} all exist in MTestDec as a sequence candidate set, and the sequence <(23) (2)>, <(34) (2)>, If <(24) (2)> exists in the MTeISeq as a frequent sequence, an enlargement sequence called <(234) (2)> may be generated. In this case, even if {234} is a frequent item set in MTestDec, it is not necessarily frequent that a sequence candidate item set called {2} occurs after a sequence candidate item set called {234}. That is, in a sequence, not only the frequent occurrence of a single item set but also the temporal order among the item sets must be frequent, such a factor may be a factor causing a false positive error.

확대 시퀀스는 MTeISeq를 이용해 시퀀스의 출현빈도를 직접 카운트한 것이 아니기 때문에 정확한 지지도 값을 얻어낼 수는 없지만, 정의 4에서 언급한 Apriori 원리를 활용하면, 지지도의 범위를 추정해 낼 수 있다.Since the extended sequence does not directly count the frequency of occurrence of the sequence using MTeISeq, it is not possible to obtain an accurate support value.However, using the Apriori principle mentioned in definition 4, the range of support can be estimated.

정의 5. 확대 시퀀스 es의 지지도Definition 5. Support for Magnification Sequence es

확대 시퀀스 es의 지지도는 es의 시퀀스후보항목집합들 중 PLI_k (∈ es)의 (k-1)-시퀀스후보항목집합을 가진 시퀀스 s_i (s ∈ FS,1 < i≤ n, n = _kC_k-1 ) 들의 지지도 최소값과 PLI_k와 일치하는 MTestDec트리의 시퀀스후보항목집합 e_c.k의 지지도의 최소값으로 결정된다.The support of the extended sequence es is the sequence s _i (s ∈ FS, 1 <i≤ n, n = _k with the (k-1) -sequence candidate set of PLI _k (∈ es) among the sequence candidate items of es). C _k-1 ) is determined by the minimum value of the support of the sequence candidate item set e _ck of the MTestDec tree corresponding to PLI _k .

S(es) = min(S(s₁), S(s₂), …, S(s_n), S(e_c.k))
S (es) = min (S (s ₁ ), S (s ₂ ),…, S (s _n ), S (e _ck ))

정의 5를 이용하여 확대시퀀스 es의 지지도를 얻어낼 수 있다. 정의 4의 Apriori 원리에 따라, k-시퀀스후보항목집합인 PLI_k가 빈발하다면 PLI_k를 생성하는 모든 (k-1)-시퀀스후보항목집합도 빈발해야 한다. 따라서, (k-1)-시퀀스후보항목집합들로 구성된 시퀀스 지지도는 이들을 통해 생성되는 PLI_k를 가진 확대 시퀀스 es의 지지도보다 항상 작거나 같아야만 한다. 또한, PLI_k의 지지도는 MTestDec트리의 시퀀스후보항목집합 e_c.k의 지지도와 같은 값을 가져야 한다. 그러므로 PLI_k로 구성된 확대 시퀀스 es의 지지도는 e_c.k의 지지도보다 큰 값을 가질 수 없다. 따라서, 확대시퀀스 es의 지지도는 e_c.k의 지지도와 PLI_k의 (k-1)-시퀀스후보항목집합들로 구성된 시퀀스의 지지도의 최소값으로 추론할 수 있다.Definition 5 can be used to obtain the support of the magnification sequence es. According to the Apriori principle of definition 4, if the k-sequence candidate set PLI _k is frequent, then all (k-1) -sequence candidate sets that produce PLI _k must also be frequent. Thus, the sequence support composed of the (k-1) -sequence candidate sets should always be less than or equal to the support of the extended sequence es with PLI _k generated through them. In addition, the support of PLI _k must have the same value as the support of the sequence candidate item set e _ck of the MTestDec tree. Therefore, the support of the expansion sequence es composed of PLI _k cannot have a value larger than the support of e _ck . Therefore, the support of the extended sequence es can be inferred to the minimum value of the support of the e _ck and the support of the sequence consisting of the (k-1) -sequence candidate sets of PLI _k .

복원부(160)는 확대 시퀀스가 추가된 1-항목집합 빈발 시퀀스를 원래의 n-항목집합 빈발 시퀀스로 복원할 수 있다.The restoration unit 160 may restore the 1-item set frequent sequence to which the enlargement sequence is added to the original n-item set frequent sequence.

출력부(170)는 복원된 원래의 n-항목집합 빈발 시퀀스를 실시간으로 출력할 수 있다.
The output unit 170 may output the restored original n-item set frequent sequence in real time.

이하에서는 여러 실험을 통하여 본 발명에서 제안한 estDuT 방법의 성능을 검증한 결과를 설명한다.Hereinafter, the results of verifying the performance of the estDuT method proposed by the present invention through various experiments will be described.

실험에 사용된 데이터는 IBM dataset generator를 이용하여 생성하였다. 각 실험에서 데이터 집합은 반복하여 발생한다고 가정하고, 시퀀스에 대한 전처리는 데이터 집합이 한 번 반복될 때마다 이루어진다고 가정하였으며, 전처리 과정은 실험데이터에 반영하지 않았다. 모든 실험에서 중요 지지도(significant support) Ssig는 최소 지지도 Smin의 상대적인 값으로 정의하며, 사용자 정의 임계값 Ssig가 p%로 표기되었을 때, Ssig의 실제 값은 Smin*(p/100)이다. 모든 실험에서 트리의 강제전지는 CID가 100의 배수일 때 수행되었으며, 최소 지지도 Smin, 중요 지지도 Ssig 등의 설정 값들은 별도의 표기가 없을 경우에 MTestDec, MTeISeq에서 동일하게 유지된다. 모든 실험은 4GB의 메모리를 가진 쿼드코어 2.4GHz 컴퓨터와 리눅스 커널 2.6.26에서 수행되었으며 estDuT 알고리즘은 모두 C 언어로 구현되었다. 알고리즘의 메모리의 사용량과 수행 시간은 estDuT와 같은 랜드마크 방식의 데이터 스트림 환경의 순차 패턴 마이닝 알고리즘인 SPAMS와 비교하여 실험하였으며, 알고리즘의 정확도는 유한한 데이터 집합에서의 순차 패턴 마이닝 방법인 SPAM의 결과집합을 기준으로 측정하였다. SPAM알고리즘과 같은 유한한 데이터 집합에서의 마이닝 방법의 경우, 전체트랜잭션이 제한적이기 때문에 항상 정확한 결과를 가지게 된다. 따라서, 데이터 스트림 마이닝의 정확도 비교의 정답으로 사용하기에 적합하다.The data used in the experiment was generated using the IBM dataset generator. In each experiment, it is assumed that the data set occurs repeatedly, and preprocessing for the sequence is assumed to occur every time the data set is repeated once, and the preprocessing process is not reflected in the experimental data. In all experiments, the critical support Ssig is defined as the relative value of the minimum support Smin, and when the user-defined threshold Ssig is expressed in p%, the actual value of Ssig is Smin * (p / 100). In all experiments, the forced cell of the tree was performed when the CID was a multiple of 100, and the set values of minimum support Smin and critical support Ssig remain the same in MTestDec and MTeISeq unless otherwise indicated. All experiments were performed on quad-core 2.4GHz computers with 4GB of memory and the 2.6.26 Linux kernel, and the estDuT algorithm was all implemented in C. The memory usage and execution time of the algorithm are compared with SPAMS, which is a sequential pattern mining algorithm of landmark data stream environment such as estDuT, and the accuracy of the algorithm is the result of SPAM, a sequential pattern mining method in finite data set Measurement was based on the set. Mining methods on finite datasets, such as the SPAM algorithm, always have accurate results because the entire transaction is limited. Therefore, it is suitable for use as a correct answer for comparing the accuracy of data stream mining.

본 발명에서는 다음의 [표 1]과 같은 데이터 집합들을 사용하여 알고리즘의 성능 및 정확도를 평가하기 위한 실험을 진행하였다.In the present invention, an experiment for evaluating the performance and accuracy of the algorithm was performed using the following data sets.

[표 1][Table 1]

각각의 데이터 집합은 실험 목적에 맞는 데이터를 생성하기 위한 파라미터 값으로 설정하였고 각각의 파라미터 값은 다음의 [표 2]와 같이 정의되었다.Each data set was set to a parameter value for generating data suitable for the purpose of experiment, and each parameter value was defined as shown in [Table 2] below.

[표 2][Table 2]

데이터 집합 D0.1C3T1S3I1N0.03 의 경우, 고객들의 수가 100 명이고 한 명의 고객이 발생시킨 평균 트랜잭션의 수가 3 개 이며, 하나의 트랜잭션에 등장하는 평균 항목의 수가 3 개인 데이터 집합이다. 또, 이 데이터 집합은 30 개의 서로 다른 항목들로 구성이 되어 있으며, 빈발 시퀀스의 평균 길이가 3 이고 빈발 시퀀스의 항목집합의 크기가 1 이다.In the data set D0.1C3T1S3I1N0.03, the number of customers is 100, the average number of transactions generated by one customer is three, and the average number of items in one transaction is three. In addition, this data set is composed of 30 different items. The average length of the frequent sequences is 3 and the size of the itemsets of the frequent sequences is 1.

생성된 데이터 집합의 밀도는 다음과 같이 정의되었다.The density of the generated data set is defined as follows.

따라서, 데이터 집합 R = D0.1C3T1S3I1N0.03 의 밀도 ρ(R) 는 ㅧ 100 = 3.33(%)이다.
Therefore, the density rho (R) of the data set R = D0.1C3T1S3I1N0.03 is ㅧ 100 = 3.33 (%).

도 9는 본 발명의 실시예에 따른 시퀀스 후보항목집합의 길이제한 값 k의 변화에 따른 메모리 사용량을 보여주는 그래프이다.9 is a graph showing memory usage according to a change in the length limit value k of a sequence candidate item set according to an embodiment of the present invention.

그림 9에 도시한 바와 같이, D0.1C3T1S3I1N0.03, D0.1C3T1.5S3I1.5N0.03, D0.1C3T2S3I2N0.03, D0.1C3T2.5S3I2.5N0.03, D0.1C3T3S3I3N0.03의 데이터 집합들을 이용하여 Smin이 0.05, Ssig가 90%일 때, 시퀀스후보항목집합의 길이제한 값 k의 변화에 따른 최대 메모리 사용량 변화와 그 효과를 보여준다. 그림 (a)는 MTestDec의 최대 메모리 사용량을 보여준다. MTestDec는 시퀀스후보항목집합의 길이제한 값 k에 상관없이 Ssig값 이상의 모든 항목집합을 유지하기 때문에 k값이 변화하더라도 아무런 영향을 받지 않는다. 그림 (b)는 MTeISeq의 최대 메모리 사용량을 보여준다. k값을 작게 제한함에 따라 시퀀스후보항목집합의 개수가 줄어들기 때문에 메모리의 사용량이 크게 줄어드는 것을 확인할 수 있다. 그림 (c)는 매핑 테이블의 메모리 사용량을 보여준다. k값 제한으로 인해 생성되지 않은 시퀀스후보항목집합은 매핑 테이블에 기록되지 않기 때문에 역시 k값이 작아질수록 작은 메모리 사용량을 보이게 된다.
As shown in Figure 9, data sets of D0.1C3T1S3I1N0.03, D0.1C3T1.5S3I1.5N0.03, D0.1C3T2S3I2N0.03, D0.1C3T2.5S3I2.5N0.03, and D0.1C3T3S3I3N0.03 are used. Therefore, when Smin is 0.05 and Ssig is 90%, the maximum memory usage change and its effect according to the change of the length limit value k of the sequence candidate item set are shown. Figure (a) shows the maximum memory usage of MTestDec. MTestDec maintains all item sets above Ssig regardless of the length limit value k of the sequence candidate item set. Figure (b) shows the maximum memory usage of the MTeISeq. As the value of k is limited to a small value, the number of sequence candidate item sets decreases, thereby reducing the memory usage. Figure (c) shows the memory usage of the mapping table. The sequence candidate item set that is not created due to the k value restriction is not recorded in the mapping table. Therefore, the smaller the k value, the smaller the memory usage.

도 10은 본 발명의 실시예에 따른 시퀀스 후보항목집합의 길이제한 값 k의 변화에 따른 수행시간을 보여주는 그래프이다.10 is a graph showing execution time according to a change in the length limit value k of a sequence candidate item set according to an embodiment of the present invention.

그림 10에 도시한 바와 같이, 도 9와 동일한 실험의 평균 수행시간 비교이다. 그림 (a)는 MTestDec의 평균 수행시간을 보여준다. MTestDec의 메모리 사용량과는 달리 수행시간은 k의 값이 증가할 때 함께 증가하는 것을 볼 수 있다. 이러한 비용은 k-시퀀스후보항목집합을 발견하기 위한 매칭 과정에서 발생하는 것으로, 항목집합의 길이가 k이상인 경우 매칭을 시도하지 않는 것으로 알고리즘 수행 시간을 감소시켰다. 그림 (b)와 (c)는 k의 값에 따라 크게 변화하고 있는 MTeISeq와 매핑 테이블의 평균 수행시간을 보여주고 있다. 전반적으로 k의 값에 따라 수행시간이 늘어나지만, 특히 밀도가 높은 데이터 집합에서 수행시간이 크게 증가하는 것을 볼 수 있다.
As shown in Figure 10, it is a comparison of the average execution time of the same experiment as in FIG. Figure (a) shows the average execution time of MTestDec. Unlike the memory usage of MTestDec, the execution time increases as the value of k increases. This cost is incurred in the matching process to find the k-sequence candidate itemsets, which reduces the execution time of the algorithm by not attempting matching when the lengths of the itemsets are larger than k. Figures (b) and (c) show the average execution time of the MTeISeq and the mapping table, which vary greatly depending on the value of k. Overall, the execution time increases with the value of k, but it can be seen that the execution time increases significantly, especially in dense data sets.

도 11은 본 발명의 실시예에 따른 시퀀스 후보항목집합 길이제한 값 k의 변화에 따른 Precision과 Recall을 보여주는 그래프이다.11 is a graph showing precision and recall according to a change in the sequence candidate item set length limit value k according to an embodiment of the present invention.

도 11에 도시한 바와 같이, 도 9에서와 같은 데이터 집합을 이용하여 Smin이 0.05, Ssig가 90%일 때, 확대시퀀스를 이용하여 빈발 시퀀스를 복원하는 후처리 작업 전과 후의 시퀀스후보항목집합의 길이제한 값 k의 변화에 따른 정확률(Precision)과 재현율(Recall)의 그래프이다. 정확도를 비교하기 위한 알고리즘으로는 SPAM을 사용하였다. 그림 (a)와 (b)에서 볼 수 있듯이 k값을 작은 값으로 제한할수록 후처리 작업 후의 정확률이 감소한다. k값이 1인 경우는 생성되는 확대 시퀀스가 무한히 많아지기 때문에 메모리 사용량을 초과하여 측정이 불가능하였다. 하지만 이러한 경우, 길이가 1인 빈발한 모든 항목에 대한 확대 시퀀스의 생성이 이루어지기 때문에 정의 4에 따라 누락되는 빈발 시퀀스가 존재하지 않게 된다. 따라서 재현율은 항상 100%를 유지하게 되지만 false positive 값들을 많이 생성해 내기 때문에 정확률이 크게 떨어지게 된다.따라서 이러한 경우 매우 비효율적인 방법이라는 것이 명확하기 때문에 k값은 1보다 큰 값을 선정해 주어야 한다.As shown in FIG. 11, when Smin is 0.05 and Ssig is 90% using the same data set as in FIG. 9, the length of the sequence candidate item set before and after the post-processing operation for restoring the frequent sequence using the enlarged sequence. It is a graph of precision and recall as the limit value k changes. As an algorithm to compare the accuracy, SPAM was used. As can be seen in Figures (a) and (b), the smaller the value of k, the lower the accuracy after post-processing. If the value of k is 1, since the generated magnification sequence is infinitely large, it is impossible to measure because the memory usage is exceeded. However, in this case, since a magnification sequence is generated for all frequent items of length 1, there is no frequent sequence that is missed according to definition 4. Therefore, the recall rate is always maintained at 100%, but the accuracy rate drops significantly because it generates a large number of false positive values. Therefore, the value of k must be selected to be greater than 1 because it is obvious that this is a very inefficient method.

그림 (c)와 (d)를 비교하여 보면 이 데이터 집합에 대해서는 k=2일 때에 비해 k=3, k=4일 때의 재현율이 크게 좋아지지 않는 것을 볼 수 있다. k=3, k=4일 때의 재현율이 좋지 않은 이유는 데이터 스트림 환경에서 발생하게 되는 오차이다. 즉, 확대 시퀀스를 통한 후처리 복원 작업은 데이터 스트림 환경에서의 오차를 줄이는 것도 가능하다.Comparing Figures (c) and (d) shows that the reproducibility of k = 3 and k = 4 does not improve much for this dataset. The reason why the reproducibility is poor when k = 3 and k = 4 is an error that occurs in the data stream environment. That is, the post-processing restoration work through the magnification sequence may reduce the error in the data stream environment.

따라서, 이 데이터 집합에의 k값은 정확률이 크게 떨어지지 않고 메모리 사용량이 적으며 수행 시간이 짧게 나타나는 2가 가장 적합하다고 할 수 있다. 후처리 복원 작업 후에 정확률이 떨어지는 이유는 k값이 작을수록 잠재적으로 빈발한 항목집합 PLIk를 이용하여 추론해 내야하는 확대 시퀀스 es의 개수가 많아지기 때문이다. 이 과정에서 실제 빈발하지 않음에도 빈발하다고 판단되는 false positive가 생성되고 이는 정확률에 영향을 미치게 된다.
Therefore, the value of k in this data set is best suited to 2, which does not drop the accuracy rate significantly, uses less memory, and shows a short execution time. The reason for the lower accuracy after post-processing restoration is that the smaller the value of k, the greater the number of escalation sequences es that must be inferred using the potentially frequent item set PLIk. In this process, a false positive that is considered to be frequent even though it is not actually frequent is generated, which affects the accuracy rate.

도 12는 본 발명의 실시예에 따른 시퀀스 후보항목집합 길이제한 값 k의 변화에 따른 ASE를 보여주는 그래프이다.12 is a graph illustrating an ASE according to a change in the sequence candidate item set length limit value k according to an embodiment of the present invention.

도 12에 도시한 바와 같이, 도 9와 동일한 데이터 집합에 대해 Smin이 0.05, Ssig가 90%일 때, 시퀀스후보항목집합의 길이제한 값 k의 변화에 따른 ASE 값의 그래프이다. 정확도를 비교하기 위한 알고리즘으로는 SPAM[7]을 사용하였다. ASE(Average Support Error)는 [8]에서 정의한 바와 같이 제안된 방법의 상대적인 정확도를 측정하기 위한 것으로 항목집합의 비교 대신 시퀀스를 비교하기 위해 변형된 수식으로 사용한다. 따라서, 두 빈발 시퀀스 결과 집합 R1 =

에 대해 다음과 같이 정의한다.As shown in FIG. 12, when Smin is 0.05 and Ssig is 90% for the same data set as in FIG. 9, it is a graph of the ASE value according to the change of the length limit value k of the sequence candidate item set. As an algorithm for comparing the accuracy, SPAM [7] was used. Average Support Error (ASE), as defined in [8], is used to measure the relative accuracy of the proposed method. It is used as a modified equation to compare sequences instead of comparing sets of items. Thus, two frequent sequence result sets = R1

For.

|R1|은 결과 집합 R1의 시퀀스의 개수를 나타낸다. ASE(R2|R1)이 작을수록 R2의 마이닝 결과 집합이 R1의 결과 집합과 유사하다. 그림 12에서는 순차 패턴 마이닝 결과의 정확도를 표현하기 위해 ASE(RestDuT|RSPAM)을 사용하였다. 여기서 RestDuT와 RSPAM은 각각 estDuT 알고리즘과 SPAM 알고리즘에 의해 구해진 마이닝 결과를 나타낸다. 그림 (b)에서 PLIk를 이용한 확대 시퀀스의 생성으로 인해 정확률이 감소하였기 때문에 그림 (a)와 (b)에서도 k=2일 경우, SPAM의 마이닝 결과와 비교한 ASE의 값이 상대적으로 크게 나타나고 있다. 데이터 집합 D0.1C3T3S3I3N0.03의 ASE 값이 후처리 후에 크게 감소하지 않은 원인은 데이터 스트림 환경이기 때문에 발생하는 탐사 방법에 따른 오차이다. 그러므로, estDuT의 방법의 경우 시퀀스후보항목집합의 길이 제한에 따라 메모리 사용량이나 알고리즘 수행시간에 있어 효과를 볼 수 있지만, 정확도가 다소 떨어진다는 단점이 존재한다.
| R1 | represents the number of sequences of the result set R1. The smaller the ASE (R2 | R1), the closer the mining result set of R2 is to the result set of R1. In Figure 12, ASE (RestDuT | RSPAM) is used to represent the accuracy of the sequential pattern mining results. Here, RestDuT and RSPAM represent mining results obtained by the estDuT and SPAM algorithms, respectively. In Figure (b), the accuracy rate is reduced due to the creation of the PLIk magnification sequence, so when k = 2 in Figures (a) and (b), the ASE value compared to the mining result of SPAM is relatively large. . The reason why the ASE value of the data set D0.1C3T3S3I3N0.03 did not decrease significantly after post-processing is an error due to an exploration method that occurs because of the data stream environment. Therefore, the estDuT method has an effect on memory usage and algorithm execution time depending on the length limit of the sequence candidate item set. However, the accuracy of the estDuT method is poor.

도 13은 본 발명의 실시예에 따른 estDuT의 Ssig 값 변화에 따른 성능을 보여주는 그래프이다.13 is a graph showing the performance according to the Ssig value change of the estDuT according to an embodiment of the present invention.

도 13에 도시한 바와 같이, D0.1C3T4S3I1.25N4, D0.1C3T4S3I1.25N0.8, D0.1C3T4S3I1.25N0.4, D0.1C3T4S3I1.25N0.08, D0.1C3T4S3I1.25N0.04 데이터 집합에서 k=2이고 Smin이 0.1일 때, Ssig의 변화에 따라 나타나는 estDuT 알고리즘의 성능을 보여준다. 그림 13에서 볼 수 있듯이 데이터 집합의 밀도가 증가함에 따라 높은 메모리 사용량을 보이고 수행 시간이 길어진다. 이러한 결과는 데이터 집합의 밀도가 증가할수록 더 많은 빈발 항목집합과 빈발 시퀀스가 나타나게 되고, 이에 따라 Ssig보다 큰 지지도를 가지는 항목집합이 증가하게 되면서 각각의 트리에 유지해야 할 항목집합의 정보가 늘어나기 때문에 나타난다. 따라서 Ssig값을 증가시키면 MTestDec나 MTeISeq에 유지되는 노드의 개수가 줄어들게 되고 매핑 테이블의 크기도 줄어들게 되어 메모리 사용량과 수행시간이 그림 13에서와 같이 감소하게 된다. 그러므로 데이터의 특성에 따라 Ssig 값을 조절하여 메모리 사용량이나 수행시간을 통제할 수 있게 된다.
As shown in FIG. 13, k = in D0.1C3T4S3I1.25N4, D0.1C3T4S3I1.25N0.8, D0.1C3T4S3I1.25N0.4, D0.1C3T4S3I1.25N0.08, D0.1C3T4S3I1.25N0.04 datasets. When 2 and Smin is 0.1, we show the performance of the estDuT algorithm as the Ssig changes. As you can see in Figure 13, as the dataset density increases, the memory usage increases and the execution time increases. The result is that as the density of the data set increases, more frequent item sets and frequent sequences appear, which increases the number of item sets that have greater support than Ssig. appear. Therefore, increasing the value of Ssig reduces the number of nodes maintained in MTestDec or MTeISeq and reduces the size of the mapping table, reducing memory usage and execution time as shown in Figure 13. Therefore, it is possible to control the memory usage and execution time by adjusting the Ssig value according to the characteristics of the data.

도 14는 본 발명의 실시예에 따른 최적화(decay) 적용에 따른 성능을 보여주는 그래프이다.14 is a graph showing the performance according to the application (decay) according to an embodiment of the present invention.

도 14에 도시한 바와 같이, 도 13과 동일한 데이터 집합들에 대해 Smin이 0.05, Ssig가 90%일 때, 데이터 집합에 따른 estDuT와 SPAMS의 성능비교 결과를 보여준다. 이 실험에서는 최적화된 estDuT와의 성능비교를 위해 decay의 half-life 파라미터를 10000, 30000, 50000으로 설정하여 추가적으로 실험하였다. estDuT는 SPAMS에 비해 메모리 사용량이 작지만 데이터 집합의 밀도가 커질수록 SPAMS에 비해 긴 수행시간을 가지게 된다. 이는 데이터 집합의 밀도가 커지게 될 경우 각각의 항목의 출현빈도가 높아지게 되고 이에 따라 MTestDec가 모니터링해야 할 시퀀스후보항목집합의 개수가 증가하기 때문이다. SPAMS도 마찬가지로 밀도에 따라 관리해야 할 빈발 시퀀스의 개수가 증가하기 때문에 밀도가 높아질수록 많은 메모리를 사용하는 것을 볼 수 있다. 하지만 SPAMS는 SPA구조를 유지하기 위해 관리해야 할 부가 정보들이 많기 때문에 그림 (a)에서처럼 estDuT에 비해 상대적으로 많은 메모리를 사용하게 된다. 그러나 estDuT에서는 시퀀스후보항목집합의 생성과정을 요구하므로 밀도가 높아져서 유지해야 할 항목정보들이 많아진다면 그림 (b)와 같이 수행 시간이 오래 걸리게 된다는 단점이 있다. 따라서, 앞에서 언급한 decay의 개념을 적용하는 것으로 도 14와 같이 알고리즘의 메모리 사용량과 수행 시간을 일정 부분 감소시킬 수 있었다.
As shown in FIG. 14, when Smin is 0.05 and Ssig is 90% for the same data sets as in FIG. 13, a performance comparison result of estDuT and SPAMS according to the data set is shown. In this experiment, the decay half-life parameters were set to 10000, 30000, and 50000 for further performance comparison with the optimized estDuT. estDuT uses less memory than SPAMS, but the denser the data set, the longer the execution time compared to SPAMS. This is because when the density of the data set is increased, the frequency of occurrence of each item increases and accordingly, the number of sequence candidate item sets that MTestDec should monitor increases. Similarly, SPAMS increases the number of frequent sequences that need to be managed according to density, so you can see that more memory uses more memory. However, since SPAMS has a lot of additional information to manage to maintain SPA structure, it uses more memory than estDuT as shown in Fig. (A). However, estDuT requires the process of generating a sequence candidate item set, so if the density increases and the item information to be maintained increases, it takes a long time to execute as shown in (b). Therefore, by applying the concept of decay mentioned above, as shown in FIG. 14, the memory usage and execution time of the algorithm can be partially reduced.

도 15는 본 발명의 실시예에 따른 estDuT와 SPAMS의 정확도(Precision, Recall) 비교 실험 결과를 보여주는 그래프이다.FIG. 15 is a graph showing the results of experiments comparing the accuracy (Precision, Recall) of estDuT and SPAMS according to an embodiment of the present invention. FIG.

도 15에 도시한 바와 같이, 도 14에서와 같은 데이터 집합을 이용하여 Smin이 0.05, Ssig가 90%일 때, 확대시퀀스를 이용하여 빈발 시퀀스를 복원한 estDuT와 SPAMS의 데이터 집합에 따른 정확률(Precision)과 재현율(Recall)의 비교 그래프이다. 정확도를 비교하기 위한 알고리즘으로는 SPAM[7]을 사용하였다. 그림 15에서 볼 수 있듯이 SPAMS는 빈발 항목에 대한 정확한 관리가 이루어지지 않고 있기 때문에 규모가 작은 데이터 집합임에도 불구하고 정확도가 불규칙하게 나타나는 경향이 있다. 그러나 estDuT의 경우, 밀도가 높아짐에 따라 정확률은 다소 떨어지지만 높은 재현율을 보이고 있다.
As shown in FIG. 15, when Smin is 0.05 and Ssig is 90% using the data set as shown in FIG. 14, the accuracy rate according to the data set of estDuT and SPAMS which restores the frequent sequence using the enlarged sequence (Precision) ) And Recall. As an algorithm for comparing the accuracy, SPAM [7] was used. As Figure 15 shows, SPAMS tends to show irregularities in spite of the small size of datasets because of the lack of accurate management of frequent items. However, in the case of estDuT, the higher the density, the lower the accuracy, but the higher the reproducibility.

도 16은 본 발명의 실시예에 따른 estDuT와 SPAMS의 정확도(ASE) 비교 실험 결과를 보여주는 그래프이다.16 is a graph showing the results of experiments comparing the accuracy (ASE) of estDuT and SPAMS according to an embodiment of the present invention.

도 16에 도시한 바와 같이, 도 14와 같은 데이터 집합을 이용하여 Smin이 0.05, Ssig가 90%일 때, 확대시퀀스를 이용하여 빈발 시퀀스를 복원한 estDuT와 SPAMS의 데이터 집합에 따른 ASE 값의 비교 그래프이다. 정확도를 비교하기 위한 알고리즘으로는 SPAM[7]을 사용하였다. 그림 15에서와 마찬가지로 SPAMS와 비교하였을 때 estDuT가 상대적으로 높은 정확도를 보이는 것을 볼 수 있다.
As shown in FIG. 16, when Smin is 0.05 and Ssig is 90% using the data set as shown in FIG. 14, a comparison of the ASE values according to the data set of estDuT and SPAMS, which restores the frequent sequence using the enlarged sequence, is performed. It is a graph. As an algorithm for comparing the accuracy, SPAM [7] was used. As shown in Figure 15, you can see that estDuT has a relatively high accuracy compared to SPAMS.

도 17은 본 발명의 실시예에 따른 긴 시퀀스에서의 성능평가 결과를 보여주는 그래프이다.17 is a graph showing performance evaluation results in a long sequence according to an embodiment of the present invention.

도 17에 도시한 바와 같이, 밀도가 상대적으로 낮은 데이터 집합 D1C10T1-1.3S10I1-1.3N1에 대해 Smin이 0.05일 때, 트랜잭션의 길이 변화에 따른 estDuT와 SPAMS의 성능비교를 보여준다. estDuT의 중요 지지도는 0.05이고, SPAMS는 δ = 0.01 일 때의 통계적인 지지도 θ = θ - ε (θ=S)를 사용하였다. 상대적으로 밀도가 낮은 데이터 집합의 경우, 항목에 대한 분포가 넓게 퍼져있기 때문에 빈발항목이 적다는 특성이 있다. estDuT의 경우, 데이터 집합의 항목집합의 길이를 |e|라 할 때, 최대 |e | = | |C + | |C 개의 시퀀스후보항목집합 e 를 생성한다. 이에 따라 처리해야 할 MTeISeq의 노드의 개수도 함께 증가하기 때문에 그림 12와 같이 항목집합의 길이 |e|가 길어질수록 메모리의 사용량이 증가하고 알고리즘의 수행시간이 늘어나게 된다. SPAMS의 경우, estDuT와는 다르게 SPA라는 오토메타를 이용하여 항목집합 e의 부분집합을 노드가 아닌 전이(transition)만으로 관리하고 있기 때문에 많은 노드를 유지하지 않고도 시퀀스의 모든 부분 시퀀스들을 관리할 수 있다. 그러나 SPA구조를 위해서는 별도로 유지해야 하는 정보가 많기 때문에 추가적인 메모리를 많이 요구한다. 전체 노드에 대한 삭제작업이 이루어지지 않고, 발생하는 시퀀스와 관계되는 노드들에 대해 전지하기 때문에 실제로는 빈발하지 않은 잔여노드가 생기게 되고, 현재 상태로 들어오는 전이(ingoing transition)들의 집합, 현재 상태에서 발생한 고객 식별자 CID의 집합 등 유지해야 할 부가적인 정보들이 각각의 상태에서 유지되어야 하고, 고객 식별자에 대한 도달 가능한 상태(state)들의 집합과 같이 SPA 구조에서 유지되어야 하는 정보들도 존재한다. 따라서 상대적으로 밀도가 낮은 데이터 집합에서는 도 17에서와 같이 SPAMS가 estDuT 알고리즘에 비해 메모리 사용량이 많고 수행 시간이 약간 더 느리다.
As shown in FIG. 17, when Smin is 0.05 for the data set D1C10T1-1.3S10I1-1.3N1 having a relatively low density, the performance comparison of estDuT and SPAMS according to the change in transaction length is shown. estDuT had an important support of 0.05 and SPAMS used statistical support θ = θ-ε (θ = S) when δ = 0.01. In the case of a relatively low density data set, there is a characteristic that there are few frequent items because the distribution of items is widely spread. For estDuT, if the length of the set of items in the dataset is | e |, the maximum | e | = | | C + | Generate the C sequence candidate item set e. As the number of nodes of MTeISeq to be processed increases accordingly, as the length of item set | e | increases as shown in Figure 12, memory usage increases and algorithm execution time increases. In the case of SPAMS, unlike estDuT, SPA uses an autometa to manage the subset of the item set e only by transitions, not nodes, so that all subsequences of the sequence can be managed without maintaining many nodes. However, the SPA structure requires a lot of additional memory because there is a lot of information to be kept separately. Deletion of the entire node is not carried out, and since it is responsible for the nodes involved in the sequence that occurs, there is actually no frequent residual nodes, and a set of ingoing transitions that enter the current state, Additional information to be maintained, such as the set of generated customer identifier CIDs, must be maintained in each state, and there is also information that must be maintained in the SPA structure, such as the set of reachable states for the customer identifier. Therefore, as shown in FIG. 17, SPAMS has a higher memory usage and a slightly slower execution time than the estDuT algorithm.

도 18은 본 발명의 실시예에 따른 순차 패턴을 탐사하기 위한 방법을 나타내는 예시도이다.18 is an exemplary view showing a method for exploring a sequential pattern according to an embodiment of the present invention.

도 18에 도시한 바와 같이, 본 발명에 따른 순차 패턴 탐사 장치는 지속적으로 발생되는 트랜잭션들로 구성되는 무한한 데이터 집합인 데이터 스트림을 입력받으면, estDec 알고리즘의 모니터링 트리인 MTestDec 구조를 이용하여 각각의 트랜잭션에서 빈발한 항목인 시퀀스 후보항목집합들을 추출할 수 있다(S1810).As shown in FIG. 18, when the sequential pattern exploration apparatus according to the present invention receives a data stream, which is an infinite data set composed of continuously occurring transactions, each transaction is performed using an MTestDec structure, which is a monitoring tree of the estDec algorithm. In operation S1810, sequence candidate item sets that are frequent items may be extracted.

다음으로, 순차 패턴 탐사 장치는 추출한 시퀀스 후보항목집합들을 매핑 테이블을 이용하여 1-항목집합 시퀀스의 형태로 변환 또는 매핑할 수 있다(S1820). 이때, 순차 패턴 탐사 장치는 k-시퀀스 후보항목집합들을 단일 ID로 매핑하되, k이상의 길이를 가진 시퀀스 후보항목집합들은 매핑하지 않는다.Next, the sequential pattern exploration apparatus may convert or map the extracted sequence candidate item sets into a 1-item set sequence using a mapping table (S1820). At this time, the sequential pattern exploration apparatus maps the k-sequence candidate item sets to a single ID, but does not map the sequence candidate item sets having a length of k or more.

다음으로, 순차 패턴 탐사 장치는 변환한 1-항목집합 시퀀스로부터 eISeq 알고리즘의 모니터링 트리인 MTeISeq 구조를 이용하여 1-항목집합 빈발 시퀀스를 추출할 수 있다(S1830).Next, the sequential pattern exploration apparatus may extract the 1-item set frequent sequence from the transformed 1-item set sequence using the MTeISeq structure, which is a monitoring tree of the eISeq algorithm (S1830).

다음으로, 순차 패턴 탐사 장치는 k 개이상의 항목을 갖는 항목집합들로 구성된 확대 시퀀스를 생성하고(S1840), 생성된 확대 시퀀스를 1-항목집합 빈발 시퀀스에 추가할 수 있다(S1850).Next, the sequential pattern exploration apparatus may generate an enlarged sequence composed of item sets having k or more items (S1840), and add the generated enlarged sequence to the 1-item set frequent sequence (S1850).

다음으로, 순차 패턴 탐사 장치는 확대 시퀀스가 추가된 1-항목집합 빈발 시퀀스를 원래의 n-항목집합 빈발 시퀀스로 복원하고(S1860), 복원된 n-항목집합 빈발 시퀀스를 실시간으로 출력할 수 있다.
Next, the sequential pattern exploration apparatus may restore the 1-item set frequent sequence to which the enlarged sequence is added to the original n-item set frequent sequence (S1860), and output the restored n-item set frequent sequence in real time. .

본 발명에 따른 estDuT는 시퀀스를 구성하는 항목집합들 중에서 k이하의 길이를 갖는 빈발항목집합인 k-시퀀스후보항목집합을 실시간으로 탐지하는 MTestDec와 빈발한 1-항목집합 시퀀스를 실시간으로 탐지하는 MTeISeq 구조를 활용하여 데이터 스트림 환경에서 빈발한 n-항목집합 시퀀스를 발견해내는 알고리즘이다. estDuT방법은 빈발항목집합의 길이를 제한함으로써 빈발 시퀀스를 탐지하기 위한 알고리즘의 성능을 개선시켰고 사용자가 빈발 시퀀스를 요구하는 시점에서 확대시퀀스를 이용하여 k보다 긴 길이를 갖는 항목집합으로 구성된 빈발 시퀀스를 복원해 냄으로써 높은 정확도를 가지는 빈발 시퀀스 결과를 보여줄 수 있었다.EstDuT according to the present invention is MTestDec for detecting in real-time k-sequence candidate itemsets, which are frequent itemsets having a length of k or less among the item sets constituting the sequence, and MTeISeq for detecting frequent 1-item sets sequences. It is an algorithm that finds frequent n-itemset sequences in the data stream environment using the structure. The estDuT method improves the algorithm's performance for detecting frequent sequences by limiting the length of frequent itemsets. When the user requests frequent sequences, the estDuT method uses a zoom sequence to generate a frequent sequence composed of items having a length greater than k. By reconstructing, we could show the results of frequent sequences with high accuracy.

최근에 연구된 SPAMS 알고리즘과 비교하였을 때, 본 발명에 따른 estDuT는 상대적으로 밀도가 낮은 데이터에서 메모리 사용량이 작고 짧은 수행시간을 가진다는 장점이 있다. 또, SPAMS와 달리 오랫동안 발생하지 않아 전지된 항목에 대해서 지지도를 추정하여 빈발한 항목과 시퀀스를 관리할 수 있고, 강제 전지를 통해 빈발하지 않은 항목과 시퀀스를 주기적으로 제거함으로써 효율적으로 항목을 관리할 수 있으며, 시퀀스후보항목집합의 길이 제한을 통해 상대적으로 작은 메모리 사용량으로도 n-항목집합에 대한 순차패턴 마이닝이 가능하다는 장점이 있다.
Compared to the recently studied SPAMS algorithm, estDuT according to the present invention has the advantage of low memory usage and short execution time in relatively low density data. In addition, unlike SPAMS, it is possible to manage frequent items and sequences by estimating support for items that have not occurred for a long time, and efficiently manage items by periodically removing items and sequences that are not frequent through forced battery. In addition, the sequential pattern mining of the n-item set is possible even with relatively small memory usage by limiting the length of the sequence candidate item set.

본 발명에 의한 듀얼 트리 구조를 이용하여 데이터 스트림에서 순차 패턴을 탐사하기 위한 장치 및 그 방법이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Various modifications and variations may be made by those skilled in the art without departing from the essential features of the present invention, as well as an apparatus for exploring sequential patterns in a data stream using the dual tree structure according to the present invention. It will be possible. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

110: 입력부
120: 제1 추출부
130: 매핑부
140: 제2 추출부
150: 생성부
160: 복원부
170: 출력부110: input unit
120: first extraction unit
130: mapping unit
140: second extraction unit
150: generator
160: restoration unit
170: output unit

Claims

A first extracting unit extracts a sequence candidate item set, which is a frequent item in each transaction, by using an MTestDec structure, which is a monitoring tree, when the data stream is an infinite data set composed of continuously occurring transactions. ;
A mapping unit for mapping the sequence candidate item sets in the form of a 1-item set sequence;
A second extracting unit for extracting a 1-item set frequent sequence from the 1-item set sequence by using a MTeISeq structure which is a monitoring tree;
a generating unit for generating an enlarged sequence composed of item sets having k or more items; And
A restoration unit for restoring the 1-item set frequent sequence to which the enlarged sequence is added to the original n-item set frequent sequence
Apparatus for exploring the sequential pattern in a data stream comprising a.

The method according to claim 1,
The first extraction unit,
Extract a set of candidate candidate sequences that are greater than or equal to a support value set to a user-defined value, where the support represents the ratio of the number of customers comprising the sequence to the total number of customers, and the sequence continues Apparatus for exploring a sequential pattern in a data stream, characterized in that the list of the generated transactions in the order of the time the transaction occurred based on the customer identifier.

The method of claim 2,
The support degree,
Apparatus for exploring a sequential pattern in a data stream, characterized in that the weight is maintained for each item differently over time in order to extract a frequent pattern centered on a frequent item set.

The method according to claim 1,
The mapping unit,
In order to convert the sequence candidate item sets into a 1-item set sequence, each of the sequence candidate item sets is mapped to a single ID and sequence candidate item sets having less than k items are mapped in the data stream. Device for exploring sequential patterns.

The method according to claim 1,
The generation unit,
For generating a sequential pattern in a data stream, characterized by generating an enlarged sequence consisting of sequence candidate item sets with k items in which all the k-1 items that are subitems are present. Device.

(a) receiving a data stream, which is an infinite data set consisting of continuously occurring transactions, extracting sequence candidate item sets, which are frequent items in each transaction, using the MTestDec structure, which is a monitoring tree, from the data stream ;
(b) mapping the sequence candidate item sets in the form of a 1-item set sequence;
(c) extracting a 1-item set frequent sequence from the 1-item set sequence by using a MTeISeq structure which is a monitoring tree;
(d) generating an enlargement sequence composed of item sets having k or more items; And
(e) restoring the 1-item set frequent sequence added with the enlarged sequence to the original n-item set frequent sequence
The method for exploring the sequential pattern in a data stream comprising a.

The method of claim 6,
The step (a)
Extract a set of candidate candidate sequences that are greater than or equal to a support value set to a user-defined value, where the support represents the ratio of the number of customers comprising the sequence to the total number of customers, and the sequence continues A method for exploring a sequential pattern in a data stream, characterized in that the list of occurrences of the transactions is grouped in the order in which they occurred based on the customer identifier.

The method of claim 7, wherein
The support degree,
A method for exploring a sequential pattern in a data stream, characterized in that the weight is maintained for each item differently over time in order to extract a frequent pattern centered on a frequent item set.

The method of claim 6,
The step (b)
In order to convert the sequence candidate item sets into a 1-item set sequence, each of the sequence candidate item sets is mapped to a single ID and sequence candidate item sets having less than k items are mapped in the data stream. Device for exploring sequential patterns.

The method of claim 6,
The step (d)
For generating a sequential pattern in a data stream, characterized by generating an enlarged sequence consisting of sequence candidate item sets with k items in which all the k-1 items that are subitems are present. Way.