KR101376444B1

KR101376444B1 - Pattern mining method for searching tree on top-down traversal for considering weight in data stream

Info

Publication number: KR101376444B1
Application number: KR1020120009913A
Authority: KR
Inventors: 윤은일; 양흥모; 편광범; 이강인; 김지원
Original assignee: 충북대학교 산학협력단
Priority date: 2012-01-31
Filing date: 2012-01-31
Publication date: 2014-03-19
Also published as: KR20130088584A

Abstract

본 발명은 데이터 스트림에서 가중치를 고려하여 하향식으로 트리를 탐색하는 패턴 마이닝 방법에 관한 것으로서, 본 발명은 데이터 스트림에서 가중치를 고려하여 하향식으로 트리를 탐색하는 패턴 마이닝을 수행하는 패턴 마이닝 장치에서의 패턴 마이닝 방법에 있어서, 상기 패턴 마이닝 장치는 상기 데이터 스트림이 가중치 내림차순으로 정렬된 제1 트랜잭션을 생성하는 단계, 상기 패턴 마이닝 장치는 상기 제1 트랜잭션을 WP-트리(Weighed Pattern-tree)에 삽입하는 단계, 상기 패턴 마이닝 장치는 상기 WP-트리를 하향식으로 탐색하는 단계, 상기 패턴 마이닝 장치는 탐색한 경로를 이용하여 제2 트랜잭션을 생성하는 단계, 상기 패턴 마이닝 장치는 상기 제2 트랜잭션을 데이터베이스로 하는 방식으로 조건적 데이터베이스를 생성하는 단계 및 상기 패턴 마이닝 장치는 상기 조건적 데이터베이스를 사용하여 투영화 작업을 수행하는 단계를 포함한다. The present invention relates to a pattern mining method for searching a tree in a top-down manner in consideration of weights in a data stream. The present invention relates to a pattern mining apparatus for performing pattern mining for searching a tree in a top-down view in consideration of weights in a data stream. In the mining method, the pattern mining apparatus generates a first transaction in which the data streams are arranged in descending order of weight, and the pattern mining apparatus inserts the first transaction into a WP-tree (Weighed Pattern-tree). Searching the WP-tree from the top down; generating a second transaction using the searched path; and patterning the second transaction as a database. Creating a conditional database with the pattern Signing apparatus includes the step of performing the work-to-film using the conditional database.

Description

Pattern mining method for searching tree on top-down traversal for considering weight in data stream}

본 발명은 데이터 스트림에서 가중치를 고려하여 하향식으로 트리를 탐색하는 패턴 마이닝 방법에 관한 것이다.
The present invention relates to a pattern mining method for searching a tree from the top down in consideration of weights in a data stream.

데이터마이닝(Data mining)은 수많은 정보에서 필요한 정보를 찾기 위해 여러 기법을 사용한다. 크게 예측적(Prediction), 기술적(Descriptive) 두 가지로 나누어 분류를 할 수 있다. 예측적에는 분류(Classification), 회귀분석(Regression), 시간분석(Time Series Analysis), 예측(Prediction)으로 세분화되고 기술적에는 군집(Clustering), 요약(Summarization), 연관 규칙(Association Rule), 순차발견(Sequence Discovery)이 있다. Data mining uses several techniques to find the information you need from a lot of information. It can be classified into two types, Predictive and Descriptive. Predictive is divided into Classification, Regression, Time Series Analysis, and Prediction. Technically, Clustering, Summarization, Association Rule, and Sequential Discovery Sequence Discovery).

현재 가장 활발한 연구가 되고 있는 기술적 데이터마이닝에서 연관 규칙은 “A라는 패턴(pattern)이 있을 때 B라는 패턴이 자주 발생한다” 라는 규칙을 신뢰도(Confidence)라고 하고 “A라는 패턴과 B라는 패턴이 동시에 발생한다” 라는 규칙을 지지도(Support)라고 한다. 이중 지지도를 찾아내기 위해 데이터 베이스(Data Base)에 저장된 데이터를 가지고 연관 규칙을 찾아내는 방법으로 패턴 마이닝(Pattern mining)기법을 사용한다. In technical data mining, which is the most active research at present, the association rule “the pattern B often occurs when there is a pattern A” is called confidence and the pattern “A and B” Occur at the same time ”is called Support. The pattern mining technique is used to find association rules based on the data stored in the database to find the dual support.

패턴 마이닝에는 여러 기술들이 개발되었는데 보통 3가지로 분류한다. Several techniques have been developed for pattern mining.

첫 번째는 트랜잭션의 지지도를 임계값(threshold)으로 정하여 임계값보다 많은 지지도를 얻은 모든 패턴을 찾는 방법이다. 대부분의 패턴마이닝이 이 분야에 속하며 많은 방법들이 개발되었다. The first method is to find all the patterns that have gained more support than the threshold by setting the support of the transaction as the threshold. Most pattern mining is in this field and many methods have been developed.

두 번째는 닫힌 패턴(closed pattern)으로 임계값을 넘는 지지도를 가진 패턴중 자신의 확대집합(Super Set)이 같은 지지도가 없는 패턴을 찾는 방법이다. 이 방법의 장점은 같은 지지도의 패턴중 가장 큰 패턴을 찾을 수 있고 모든 패턴을 복원 가능하다. The second method is to find a pattern in which the super set is not the same as the closed pattern. The advantage of this method is that you can find the largest pattern among the same support patterns and recover all patterns.

마지막으로 최대화 패턴(Maximal pattern)으로 가장 길이가 큰 패턴의 지지도가 임계값보다 큰 것으로 이 패턴의 부분집합(Subset)은 찾지 않는 방식이다. 최대화 패턴은 외각선(Outlier)과 비슷한 응용이 가능하며 여러 분야에 쓰일 수 있다.Finally, as the maximum pattern, the support of the longest pattern is larger than the threshold, and the subset of the pattern is not found. Maximization patterns have similar applications to Outliers and can be used in many applications.

스트림 데이터(Stream Data)의 패턴 마이닝은 최근 응용프로그램(Application)과 시스템 소프트웨어(System Software)의 추세가 실시간 서비스를 지향하면서 이슈가 되고 있다. 기존의 패턴 마이닝은 데이터웨어하우스(Data Warehouse)형태의 데이터를 이용하기 때문에 실시간 데이터에 대하여 매우 적용하기 어려운 문제를 가지고 있었다. 이를 극복하기 위해 데이터베이스(Data Base)를 2번 스캔하는 기존의 방식 대신 1번만 스캔하여 패턴이 임계값보다 큰 빈발패턴(Frequent Pattern)을 찾아내는 방법을 개발하게 되었다.Pattern mining of stream data has become an issue in recent years as the trends of applications and system software are directed toward real-time services. Conventional pattern mining has a problem that is difficult to apply to real-time data because it uses data in the form of data warehouse. In order to overcome this problem, instead of the conventional method of scanning the data base twice, a method of finding a frequent pattern having a pattern larger than a threshold by scanning only once is developed.

가중치(Weight)를 고려한 패턴 마이닝에서, 데이터베이스를 구성하는 아이템에 가중치를 주어 중요한 아이템은 높은 가중치를 부여하고, 크게 고려하지 않아도 되는 아이템은 가중치를 낮게 부여하여 가중치를 고려한 지지도가 임계값보다 큰 가중화 빈발패턴(Weighted Frequent Pattern)을 찾아내는 방법이 개발되었다. 이 방법은 데이터베이스를 스캔(Scan)하여 아이템의 빈도수(frequency)의 순서(Order)대로 정렬하여 압축 트리의 대응되는 노드에 삽입하는 방식 대신 가충치 순서대로 정렬하여 트리에 삽입한다.In pattern mining with weight, weighting items that make up a database give important weights to important items, and weighting items that do not need to be considerably given weights so that support considering weight is greater than the threshold A method of finding the weighted frequency pattern has been developed. This method scans the database, sorts them in order of frequency of the items, and inserts them into the tree in order of decay instead of inserting them into the corresponding nodes of the compressed tree.

기존의 스트림 데이터의 가중치 패턴 마이닝은 가중치 오름차순 트리로 상향식(Bottom Up) 방식으로 탐색한다. 즉, 가장 큰 가중치를 갖는 아이템부터 선택하여 헤더 테이블에 연결된 링크 노드를 따라 선택된 아이템이 있는 노드를 탐색한다. 탐색된 노드를 시작하여 루트 노드까지 상향식으로 탐색하여 조건적 데이터베이스를 찾는다. 이러한 종래 방식은 같은 부분트리에 선택된 아이템에 대응되는 노드를 여러개 가지고 있기 때문에 같은 노드를 여러번 탐색하게 되는 문제점이 있다.
The weight pattern mining of the existing stream data is searched in a bottom-up manner by the weight ascending tree. That is, the node having the selected item is searched along the link node connected to the header table by selecting the item having the largest weight. Start a searched node and search from the bottom up to the root node to find a conditional database. This conventional method has a problem in that the same node is searched many times because there are several nodes corresponding to the selected item in the same subtree.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 상향식 탐색 방법에서 중복된 노드를 다시 방문하지 않고 조건적 데이터베이스를 찾을 수 있도록 하향식 방식의 트리 탐색 기법을 제공하는데 그 목적이 있다. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a top-down tree search technique so that a conditional database can be found without visiting a duplicate node again in the bottom-up search method.

즉, 본 발명은 아이템의 순서를 가중치 내림차순으로 하는 트리를 구성하고, 가장 가중치가 큰 가중치를 가진 아이템을 선택하여 하향식으로 트리를 탐색하는데 BFS와 DFS방식을 이용하여 탐색을 하고, 동적프로그래밍 기법을 사용하여 한번 방문한 노드를 기억하여 재방문하지 않도록 알고리즘을 구성하여 효율적인 노드 탐색이 이뤄질 수 있는 방법을 제공하는데 그 목적이 있다.That is, the present invention constructs a tree in which the order of the items is in descending order of weight, selects the item having the highest weight, and searches the tree from the top down using BFS and DFS methods, and uses dynamic programming techniques. Its purpose is to provide an efficient node search by constructing an algorithm so as not to revisit and remember the node visited once.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.
The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명은 데이터 스트림에서 가중치를 고려하여 하향식으로 트리를 탐색하는 패턴 마이닝을 수행하는 패턴 마이닝 장치에서의 패턴 마이닝 방법에 있어서, 상기 패턴 마이닝 장치는 상기 데이터 스트림이 가중치 내림차순으로 정렬된 제1 트랜잭션을 생성하는 단계, 상기 패턴 마이닝 장치는 상기 제1 트랜잭션을 WP-트리(Weighed Pattern-tree)에 삽입하는 단계, 상기 패턴 마이닝 장치는 상기 WP-트리를 하향식으로 탐색하는 단계, 상기 패턴 마이닝 장치는 탐색한 경로를 이용하여 제2 트랜잭션을 생성하는 단계, 상기 패턴 마이닝 장치는 상기 제2 트랜잭션을 데이터베이스로 하는 방식으로 조건적 데이터베이스를 생성하는 단계 및 상기 패턴 마이닝 장치는 상기 조건적 데이터베이스를 사용하여 투영화 작업을 수행하는 단계를 포함한다. In order to achieve the above object, the present invention provides a pattern mining method of a pattern mining apparatus that performs pattern mining to search a tree in a top-down manner in consideration of weights in a data stream. Generating a first transaction arranged in a second order; the pattern mining apparatus inserting the first transaction into a WP-tree; and the pattern mining apparatus searching the WP-tree from top to bottom Generating, by the pattern mining device, a second transaction using the searched path, generating, by the pattern mining device, a conditional database in a manner that uses the second transaction as a database; To perform projection tasks using a proprietary database It includes.

상기 제1 트랜잭션을 WP-트리에 삽입하는 단계에서, 배열을 자료구조로 하는 WP-트리에 상기 제1 트랜잭션을 삽입하며, 이때 상기 제1 트랜잭션의 헤더 테이블과 WP-트리의 관계로 구성될 수 있다.In the step of inserting the first transaction into the WP-tree, the first transaction is inserted into the WP-tree having an array of data structures, wherein the first transaction may be configured in a relationship between the header table of the first transaction and the WP-tree. have.

상기 WP-트리를 하향식으로 탐색하는 단계는, 스트림 데이터와 가중치를 적용하여 구축한 WP-트리에서 동적 프로그래밍 방식을 이용하여 루트로부터 리프까지 탐색하는 하향식 방식일 수 있다. The searching of the WP-tree in a top-down manner may be a top-down method of searching from a root to a leaf using a dynamic programming method in a WP-tree constructed by applying stream data and weights.

이때, 상기 루트로부터 리프까지 탐색하는 하향식 방식은, BFS(Breadth First Search) 방식을 이용하여 탐색하거나, 또는 DFS(Depth First Search) 방식을 이용하여 탐색할 수 있다. In this case, the top-down method for searching from the root to the leaf may be searched using a breadth first search (BFS) method or a search using a depth first search (DFS) method.

상기 투영화 작업을 수행하는 단계는, 상기 WP-트리에서 상기 조건적 데이터베이스를 이용하여 로컬 트리를 생성하는 단계 및 상기 로컬 트리에서 싱글 패스(Single path)가 될 경우, 패턴을 추출하는 단계를 포함하여 이루어질 수 있다.
The performing of the projection may include generating a local tree using the conditional database in the WP-tree and extracting a pattern when the local tree becomes a single path. It can be done by.

본 발명에 의하면 상향식 탐색 방법에서 중복된 노드를 다시 방문하지 않고 조건적 데이터베이스를 찾을 수 있도록 하향식 방식의 트리 탐색 기법을 제공함으로써, 불필요한 탐색을 방지하여 탐색 효율을 향상시킬 수 있는 효과가 있다. According to the present invention, a bottom-up tree search technique is provided so that a conditional database can be searched without visiting a duplicate node again in a bottom-up search method, thereby preventing unnecessary searches and improving search efficiency.

즉, 본 발명은 아이템의 순서를 가중치 내림차순으로 하는 트리를 구성하고, 가장 가중치가 큰 가중치를 가진 아이템을 선택하여 하향식으로 트리를 탐색하는데 BFS와 DFS방식을 이용하여 탐색을 하고, 동적프로그래밍 기법을 사용하여 한번 방문한 노드를 기억하여 재방문하지 않도록 알고리즘을 구성하여 효율적인 노드 탐색을 수행할 수 있는 효과가 있다.
That is, the present invention constructs a tree in which the order of the items is in descending order of weight, selects the item having the highest weight, and searches the tree from the top down using BFS and DFS methods, and uses dynamic programming techniques. By using the algorithm, the algorithm can be configured so that the node visited once is not remembered and visited again.

도 1은 FP-트리의 구조이다.
도 2는 본 발명의 일 실시예에 따른 WP-트리의 구조이다.
도 3은 본 발명의 일 실시예에 따른 WP-트리를 하향식 방식으로 DFS를 이용하여 탐색하는 방법의 예제를 보여주는 도면이다.
도 4는 본 발명의 일 실시예에 따른 WP-트리를 하향식 방식으로 BFS를 이용하여 탐색하는 방법의 예제를 보여주는 도면이다.
도 5는 본 발명의 일 실시예에 따른 스트림 데이터를 WP-트리에 적용하여 전역 트리를 생성하는 방법을 보여주는 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 정렬된 트랜잭션을 WP-트리에 삽입하는 방법을 보여주는 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 전역 WP-트리에서 BFS와 DFS방식을 이용하여 탐색하는 방법과 조건적 데이터베이스를 생성하는 방법을 보여주는 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 패턴 마이닝 방법을 보여주는 흐름도이다.1 is a structure of an FP-tree.
2 is a structure of a WP-tree according to an embodiment of the present invention.
3 is a diagram illustrating an example of a method of searching a WP-tree using a DFS in a top-down manner according to an embodiment of the present invention.
4 is a diagram illustrating an example of a method of searching a WP-tree using a BFS in a top-down manner according to an embodiment of the present invention.
5 is a flowchart illustrating a method of generating a global tree by applying stream data to a WP-tree according to an embodiment of the present invention.
6 is a flowchart illustrating a method of inserting an ordered transaction into a WP-tree according to an embodiment of the present invention.
7 is a flowchart illustrating a method of searching using a BFS and a DFS scheme in a global WP-tree according to an embodiment of the present invention, and a method of generating a conditional database.
8 is a flowchart illustrating a pattern mining method according to an embodiment of the present invention.

이하, 첨부된 도면을 참조해서 본 발명의 실시예를 상세히 설명하면 다음과 같다. 우선 각 도면의 구성 요소들에 참조 부호를 부가함에 있어서, 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 그리고, 본 발명을 설명함에 있어서, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 명세서 전반에 걸쳐서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used for the same reference numerals even though they are shown in different drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. Also, throughout this specification, when a component is referred to as "comprising ", it means that it can include other components, aside from other components, .

본 발명은 데이터 스트림에서 가중치를 고려하여 하향식으로 트리를 탐색하는 패턴 마이닝을 수행하는 패턴 마이닝 장치에서의 패턴 마이닝 방법을 제안한다.The present invention proposes a pattern mining method in a pattern mining apparatus which performs pattern mining to search a tree from the top down in consideration of weights in a data stream.

본 발명에서는 고정된(static) 데이터베이스에서의 마이닝이 아닌 실제 데이터를 활용한 데이터 스트림에 대해 가중화된 패턴을 마이닝한다. In the present invention, a weighted pattern is mined for a data stream utilizing real data, not mining in a static database.

본 발명에서는 데이터베이스의 트랜잭션들을 반복적으로 읽지 않고 저장구조에 저장하여 성능을 향상시키기 위한 저장구조인 WP-트리(Weighed Pattern-tree)를 제안한다. 기존의 Apriori 기법과 FP-tree 구조도 효율적인 저장 형태를 가지지만 지속적으로 입력되는 스트림 데이터에 대해서 가중치를 고려한 패턴을 찾기 위해서는 후보 패턴들의 빈도수(support)와 가중치(weight)를 동시에 고려해야한다. The present invention proposes a WP-tree (Weighed Pattern-tree), which is a storage structure for improving performance by storing transactions in a database without repeatedly reading the transactions of the database. Existing Apriori technique and FP-tree structure have an efficient storage form, but in order to find a weighted pattern for continuously input stream data, it is necessary to simultaneously consider the frequency and support of candidate patterns.

본 발명에서는 입력되는 스트림데이터를 한번만 스캔하면서 효율적인 저장형태를 가지도록하는 트리를 제안한다. 본 발명에서 제안하는 WP-트리는 아이템 가중치에 특화된 트리 형태이다.The present invention proposes a tree having an efficient storage form while only scanning the input stream data once. The WP-tree proposed in the present invention is a tree type specialized for item weights.

스트림 마이닝의 특징은 데이터베이스를 한번만 스캔하기 때문에 전역 트리를 생성할 때 pruning 없이 모든 데이터베이스를 트리에 압축형태로 저장하는 특성을 갖는다. 그렇기 때문에 트리의 구조가 변경되는 일이 발생하지 않아 배열구조의 장점만을 활용할 수 있다. 본 발명에서 배열의 크기는 입력되는 트랜잭션 길이를 참조하여 메모리 낭비를 줄일 수 있는 동적 배열로 구성한다.The characteristic of stream mining is that it scans the database only once and stores all the databases in compressed form without pruning when creating the global tree. Therefore, the structure of the tree does not change, and only the advantages of the array structure can be used. In the present invention, the size of the array is configured as a dynamic array that can reduce memory waste by referring to the input transaction length.

도 1은 FP-트리의 구조이고, 도 2는 본 발명의 일 실시예에 따른 WP-트리의 구조이다.1 is a structure of an FP-tree, and FIG. 2 is a structure of a WP-tree according to an embodiment of the present invention.

도 1에 FP-트리의 헤더 테이블(110)과 트리구조(120)가 도시되어 있다.The header table 110 and tree structure 120 of the FP-tree are shown in FIG.

도 2를 참조하면, 배열 구조를 사용하는 WP-트리는 도 1의 일반적인 FP-트리의 구조를 모두 포함하여 트리구조(220) 옆에 헤더 테이블(210)이 있으며, 헤더 테이블(210)은 모든 아이템의 빈도수(support)와 트리 링크 번호로 구성되어 있다. 여기서, 트리 링크 번호는 트리의 노드 해시 번호를 저장하여 바로 접근할 수 있도록 생성한다.Referring to FIG. 2, the WP-tree using the arrangement structure includes a header table 210 next to the tree structure 220 including all of the structure of the general FP-tree of FIG. 1, and the header table 210 includes all items. It consists of a frequency of support and a tree link number. Here, the tree link number is generated so that the node hash number of the tree can be accessed directly.

도 8은 본 발명의 일 실시예에 따른 패턴 마이닝 방법을 보여주는 흐름도이다.8 is a flowchart illustrating a pattern mining method according to an embodiment of the present invention.

도 8을 참조하면, 패턴 마이닝 장치는 데이터 스트림이 가중치 내림차순으로 정렬된 제1 트랜잭션을 생성한다(S801).Referring to FIG. 8, the pattern mining apparatus generates a first transaction in which data streams are arranged in descending weight order (S801).

그리고, 패턴 마이닝 장치는 제1 트랜잭션을 WP-트리(Weighed Pattern-tree)에 삽입한다(S803). 본 발명의 일 실시예에서 S803 단계에서, 배열을 자료구조로 하는 WP-트리에 제1 트랜잭션을 삽입하며, 이때 제1 트랜잭션의 헤더 테이블과 WP-트리의 관계로 구성될 수 있다.The pattern mining apparatus inserts the first transaction into a WP-tree (S803). In an embodiment of the present invention, in step S803, a first transaction is inserted into a WP-tree having an array as a data structure, and in this case, the first transaction may be configured in a relationship between a header table of the first transaction and a WP-tree.

다음, 패턴 마이닝 장치는 WP-트리를 하향식으로 탐색한다(S805).Next, the pattern mining apparatus searches the WP-tree from the top down (S805).

S805 단계에서 스트림 데이터와 가중치를 적용하여 구축한 WP-트리에서 동적 프로그래밍 방식을 이용하여 루트로부터 리프까지 탐색하는 하향식 방식일 수 있다.The WP-tree constructed by applying stream data and weights in step S805 may be a top-down method of searching from a root to a leaf using a dynamic programming method.

본 발명의 일 실시예에서 루트로부터 리프까지 탐색하는 하향식 방식은, BFS(Breadth First Search) 방식을 이용하거나, DFS(Depth First Search) 방식을 이용하여 탐색할 수 있다. BFS 방식과 DFS 방식에 대한 상세한 설명은 후술하기로 한다.In an embodiment of the present invention, the top-down method of searching from a root to a leaf may be searched using a breadth first search (BFS) method or a depth first search (DFS) method. Detailed description of the BFS method and the DFS method will be described later.

다음, 패턴 마이닝 장치는 탐색한 경로를 이용하여 제2 트랜잭션을 생성한다(S807).Next, the pattern mining device generates a second transaction using the searched path (S807).

패턴 마이닝 장치는 제2 트랜잭션을 데이터베이스로 하는 방식으로 조건적 데이터베이스를 생성한다(S809).The pattern mining apparatus generates a conditional database in a manner in which the second transaction is a database (S809).

다음, 패턴 마이닝 장치는 조건적 데이터베이스를 사용하여 투영화 작업을 수행한다(S811). 본 발명에서 투영화 작업은 WP-트리에서 조건적 데이터베이스를 이용하여 로컬 트리를 생성하는 단계 및 로컬 트리에서 싱글 패스(Single path)가 될 경우, 패턴을 추출하는 단계를 포함하여 이루어질 수 있다.
Next, the pattern mining apparatus performs projection by using the conditional database (S811). In the present invention, the projection operation may be performed by generating a local tree using a conditional database in the WP-tree and extracting a pattern when the single tree is a single path in the local tree.

도 5는 본 발명의 일 실시예에 따른 스트림 데이터를 WP-트리에 적용하여 전역 트리를 생성하는 방법을 보여주는 흐름도이다.5 is a flowchart illustrating a method of generating a global tree by applying stream data to a WP-tree according to an embodiment of the present invention.

도 5를 참조하면, 한번 스캔(one scan) 방식의 트리는 데이터베이스를 한번만 스캔하여 헤더와 트리를 만드는 작업을 병행한다. 한번 스캔 방식으로 트리를 만드는 방법은 2가지가 있다. 첫째는 아이템들의 가중치가 들어있는 파일을 읽으면서 순서(order)를 정하고, 데이터베이스의 트랜잭션을 하나씩 읽으면서 정해진 순서대로 트리에 삽입하는 방식이다. 둘째는 순서가 정해지지 않은 상태에서 하나의 트랜잭션이 들어오면 알파벳 순서나 아이템 이름 순서로 헤더 테이블과 트리를 구축하는 방식이다. Referring to FIG. 5, a one scan tree scans a database only once to create a header and a tree. There are two ways to make a tree by scanning once. The first is to set the order by reading a file containing the weights of the items, and inserting them into the tree in the specified order by reading the transactions in the database one by one. The second method is to construct header tables and trees in alphabetical order or item name order when a transaction comes in unordered.

가중치를 고려하는 데이터 마이닝에서 아이템 순서는 가중치 오름차순(weighted ascending order)과 가중치 내림차순(weighted descending order)이 있다. 본 발명에서는 가중치 내림차순 트리를 설계하는 기술을 제안하고자 한다. In data mining considering weights, an item order includes a weighted ascending order and a weighted descending order. The present invention proposes a technique for designing a weighted descending tree.

도 5는 전역 트리를 생성하는 방법에 대한 것이다. 가중치 내림차순으로 된 트리의 구조로 마이닝을 하는데 아이템의 순서는 루트로부터 가중치가 점점 작아지는 순서로 구성된다. 5 is a method for generating a global tree. Mining is done in a tree structure in descending order of weight. The order of items consists of the order of decreasing weight from the root.

아이템의 가중치를 읽어 가중치 테이블은 생성한다(S510) 트리는 FP-트리의 방법을 그대로 따르는데 순서는 가중치 순서(weighted order)이지만, 안티 모노톤 법칙에 따라 자식의 빈도수는 부모의 빈도수보다 크지 않다. A weight table is generated by reading the weights of the items (S510). The tree follows the method of the FP-tree as it is, but the order is weighted order, but according to the anti-monotone law, the frequency of the child is not greater than the frequency of the parent.

WP-트리용 헤더테이블을 생성한다(S520).A header table for the WP-tree is generated (S520).

트랜잭션 데이터베이스(S560)가 스캔되어 읽어지고(S530), 트랜잭션 데이터베이스에서 각 트랜잭션 내에 있는 아이템들은 가중치 내림차순으로 정렬한다(S540). The transaction database S560 is scanned and read (S530), and items in each transaction in the transaction database are sorted in descending weight order (S540).

트랜잭션내의 정렬된 아이템들은 순차적으로 전역 WP-트리에 트리의 루트부터 대응되는 노드로의 경로를 따라서 삽입된다(S550). The sorted items in the transaction are sequentially inserted along the path from the root of the tree to the corresponding node in the global WP-tree (S550).

마지막 트랜잭션인지 확인하여, 마지막 트랜잭션이면 전역 WP-트리를 생성한다(S570, S580). 그렇지 않으면, 다시 S530 단계로 회귀한다.
Check whether it is the last transaction, and if it is the last transaction, generate a global WP-tree (S570, S580). Otherwise, the process returns to step S530 again.

도 6은 본 발명의 일 실시예에 따른 정렬된 트랜잭션을 WP-트리에 삽입하는 방법을 보여주는 흐름도이다. 도 6은 정렬된 트랜잭션을 WP-트리에 삽입하는 방법에 대한 도면이다. 6 is a flowchart illustrating a method of inserting an ordered transaction into a WP-tree according to an embodiment of the present invention. 6 is a diagram of a method for inserting an ordered transaction into a WP-tree.

도 6을 참조하면, 정렬된 트랜잭션의 아이템들 중에서 트랜잭션의 i번째 아이템(S610)에 대하여 헤더테이블에서 해당 아이템을 검색한다(S620). Referring to FIG. 6, the item is searched for in the header table for the i-th item S610 of the transaction among the items of the sorted transaction (S620).

헤더 테이블의 이 아이템의 지지도가 증가하고(S630), 트리에 현재 노드의 자식들에 대하여 대응되는 노드가 있는지 확인한다(S635). 만일 자식 중에 i번째 아이템이 없어 새로운 노드를 삽입하면 WP-트리의 가장 뒤에 노드가 새로 하나 생성되고(S640), 그 노드 안의 아이템의 빈도수가 증가된다(S650). The support of this item of the header table is increased (S630), and it is checked whether there is a corresponding node for the children of the current node in the tree (S635). If a new node is inserted because there is no i th item among the children, a new node is created at the end of the WP-tree (S640), and the frequency of the items in the node is increased (S650).

자식 중에 i번째 아이템이 있어 기존의 노드가 이용되면, 노드 안에서의 아이템 빈도수는 1씩 증대된다(S660). 이런 방식으로, 기존 트랜잭션 데이터베이스로부터 압축된 형태의 데이터가 트리 안에 저장된다. 마지막으로 헤더테이블의 해당 아이템의 링크노드에 현재 노드의 번호가 추가된다(S670). 트랜잭션의 다음 아이템을 읽어 같은 작업을 하며 트랜잭션이 끝날 때까지 반복한다(S680). 이때, 조건적 패턴 P가 주어지고, 그 조건적 패턴 P를 가지고 조건적 데이터베이스(conditional data base)로부터 생성되는 패턴 Q에 대해서, 패턴 P의 빈도수(support)의 값은 조건적 데이터베이스에서 생성되는 패턴 Q의 빈도수 값보다 크거나 같은 특성이 있다.
If there is an i-th item among the children and the existing node is used, the item frequency in the node is increased by one (S660). In this way, data in a compressed form from an existing transaction database is stored in the tree. Finally, the number of the current node is added to the link node of the corresponding item in the header table (S670). The next item of the transaction is read and the same operation is repeated until the end of the transaction (S680). The conditional pattern P is given, and for the pattern Q generated from the conditional database with the conditional pattern P, the value of the frequency support of the pattern P is a pattern generated in the conditional database. There is a characteristic that is greater than or equal to the frequency value of Q.

본 발명의 WP-트리구조를 순회(traversal)하는 구성은 FP-growth 알고리즘을 포함한 향상된 방식이다. 기존의 트리 탐색 방식은 상향식(bottom up)방식인데 노드 링크를 따라 트리의 리프에서 루트까지 탐색을 한다. 이런 방식으로 탐색을 하면 중복되어 탐색하는 노드가 발생한다. 본 발명에서는 중복 탐색의 문제점을 해결하기 위해 가중치 내림차순으로 구성된 트리를 루트로부터 리프까지 탐색하는 하향식(top-down) 방식을 사용한다. 트리순회를 하는 방법에 따라서 DFS(Depth First Search)방식과 BFS(Breadth First Search)방식으로 나뉘어져 있다. 구현하는 탐색기술은 재귀를 사용하지 않고 반복자(Iterator)를 사용하면서 동적 프로그래밍을 적용하는 방식이다. The construction for traversing the WP-tree structure of the present invention is an improved scheme including an FP-growth algorithm. Conventional tree traversal is a bottom-up approach, which traverses from leaf to root of a tree along node links. Searching in this way results in duplicate nodes searching. In order to solve the problem of duplicate search, the present invention uses a top-down method of searching a tree constructed from a weight descending order from a root to a leaf. The tree traversal is divided into DFS (Depth First Search) and BFS (Breadth First Search) methods. The search technique implemented is a method of applying dynamic programming using an iterator instead of using recursion.

동적 프로그래밍은 재귀 알고리즘의 단점인 방문했던 노드를 다시 방문하는 등의 불필요한 동작을 최소화하는 프로그래밍 방식으로 소량의 보조 저장장치가 필요하다. 이 방식은 탐색했던 노드의 정보와 앞으로 탐색을 해야 하는 노드의 정보를 저장하는 방식으로 반복 탐색을 방지한다. Dynamic programming is a programming method that minimizes unnecessary operations such as revisiting visited nodes, which is a disadvantage of recursive algorithms, and requires a small amount of auxiliary storage. This method prevents repetitive searching by storing the information of the node that has been searched and the information of the node that needs to be searched forward.

도 3은 본 발명의 일 실시예에 따른 WP-트리를 하향식 방식으로 DFS를 이용하여 탐색하는 방법의 예제를 보여주는 도면이다.3 is a diagram illustrating an example of a method of searching a WP-tree using a DFS in a top-down manner according to an embodiment of the present invention.

도 3은 동적 프로그래밍을 사용하여 구현한 DFS방식이다. 탐색은 루트로부터 탐색이 시작되고 자식 노드를 검사한다. 현재 탐색해야할 자식이 몇 번째 노드인지 확인을 하고 방문하지 않았다면 방문을 한다. 이 때 스택에 현재 자식 노드 중 2번째로 방문하는 노드의 정보를 삽입하는데 DFS방식으로 탐색하다 리프에 도달했을 때 다음 방문할 곳을 미리 저장한다. 방문한 노드의 아이템을 임시 저장 배열에 기록한다. 3 is a DFS method implemented using dynamic programming. The search starts from the root and checks the child nodes. Check which node you are currently searching for and if not, visit it. At this time, it inserts the information of the second visited node among the current child nodes in the stack and searches in the DFS method, and stores the next visit place when the leaf is reached. Write the item of visited node to the temporary storage array.

리프에 도달하면 현재 경로를 기록한 임시저장 배열의 아이템을 트랜잭션으로 하여 조건적 데이터베이스를 만든다. 또한 자식노드의 빈도수 합이 현재 노드보다 작으면 자식이 없는 노드에서의 작업과 동일한 작업을 수행하고 그 빈도수를 현재노드의 빈도수에서 자식노드의 빈도수 합을 뺀 값을 트랜잭션으로 한다. 즉 현재 노드의 빈도수가 7이고 자식노드의 빈도수 합이 4라고 가정하면 현재노드의 빈도수에서 자식노드의 빈도수합의 차이므로 7-4=3이 되어 이 트랜잭션의 빈도수가 3개가 되는 조건적 데이터베이스를 구성한다. When the leaf is reached, a conditional database is created using the items in the temporary array that recorded the current path as transactions. In addition, if the sum of the frequency of the child node is smaller than the current node, the same operation as the work in the node without child is performed, and the value is obtained by subtracting the frequency from the frequency of the current node and subtracting the sum of the frequency of the child node. In other words, if the frequency of the current node is 7 and the sum of the frequency of the child nodes is 4, the difference of the sum of the frequency of the child nodes in the frequency of the current node is 7-4 = 3, thus constructing a conditional database in which the frequency of this transaction is three. do.

빈발패턴 Pn(p1p2, .., pn)가 주어질 때, Pn-조건적 데이터베이스는 아이템 pn을 포함하는 트랜잭션들을 수집하고, 다음의 조건을 만족하는 아이템들을 제거(pruning)하는 방식으로, Pn-1(p1p2, .., pn-1)-조건적 데이터베이스로부터 유도된다. Given a frequent pattern Pn (p1p2,..., Pn), the Pn-conditional database collects transactions containing item pn and pruning items that satisfy the following conditions: derived from a (p1p2, .., pn-1) -conditional database.

가중화 빈발(weighted frequent)이 아닌 아이템들, 가중화 빈발 아이템들 가운데 아이템 pn이후에 나오는 가중화 빈발 아이템들, 마지막으로 아이템 pi 자기 자신 등 세 가지 경우의 패턴 Pn(p1p2, .., pn)은 조건적 패턴(conditional pattern) 또는 프리픽스(prefix)라 하고 그 조건적 패턴을 포함하는 트랜잭션의 집합을 Pn-조건적 데이터베이스 또는 Pn-투영된(projected) 데이터베이스라 한다.
Pattern Pn (p1p2, .., pn): items that are not weighted frequent, weighted frequent items that appear after item pn among weighted frequent items, and finally item pi itself. Is called a conditional pattern or prefix, and the set of transactions containing the conditional pattern is called a Pn-conditional database or a Pn-projected database.

도 4는 본 발명의 일 실시예에 따른 WP-트리를 하향식 방식으로 BFS를 이용하여 탐색하는 방법의 예제를 보여주는 도면이다.4 is a diagram illustrating an example of a method of searching a WP-tree using a BFS in a top-down manner according to an embodiment of the present invention.

도 4는 BFS방식의 탐색으로 DFS방식과 마찬가지로 루트로부터 탐색한다. 방문한 노드의 자식 주소값과 탐색된 경로를 큐(Queue)에 넣는다. 큐에서 하나를 가져(pop)온 주소값으로 해당 노드를 방문한다. 4 is a BFS search, which searches from the root as in the DFS method. Queue the child address of the visited node and the searched path. Visit the node with the address that popped one from the queue.

방문한 노드가 자식이 있다면 자식노드를 큐에 경로와 함께 삽입(insert)한다. 그 다음 큐에서 하나를 꺼낸 주소값을 가지고 노드를 방문한다. If the visited node has children, the child nodes are inserted into the queue with the path. It then visits the node with the address taken from the queue.

이런 방식으로 반복하다가 리프 노드에 도달하면 큐에서 가져온 현재까지 저장된 경로의 아이템이 하나의 트랜잭션이 되어 조건적 데이터베이스를 구성한다. When you iterate in this way and reach a leaf node, the items in the path stored from the queue up to now are a transaction, forming a conditional database.

그리고, 드믄(infrequent) 아이템을 제거하고 새로운 헤더와 트리에 삽입한다. 또한 자식 노드들을 큐에 저장할 때 현재 노드와 자식노드의 빈도수를 확인하여 DFS방식에서 설명한 것과 동일하게 수행한다.
Then remove the infrequent item and insert it into the new header and tree. In addition, when storing the child nodes in the queue, the frequency of the current node and the child node is checked and executed in the same manner as described in the DFS method.

도 7은 본 발명의 일 실시예에 따른 전역 WP-트리에서 BFS와 DFS방식을 이용하여 탐색하는 방법과 조건적 데이터베이스를 생성하는 방법을 보여주는 흐름도이다. 도 7은 BFS와 DFS의 탐색방법을 이용하여 WP-트리를 탐색하는 방법이다. 7 is a flowchart illustrating a method of searching using a BFS and a DFS scheme in a global WP-tree according to an embodiment of the present invention, and a method of generating a conditional database. 7 is a method of searching a WP-tree using a search method of BFS and DFS.

도 7을 참조하면, 가중치 내림차순이기 때문에 헤더 테이블에서 가장 먼저 나오는 아이템 순서대로 선택한다(S710). Referring to FIG. 7, since the weights are in descending order, the selection is made in order of the first item in the header table (S710).

현재 아이템의 링크노드를 탐색한다(S720). The link node of the current item is searched for (S720).

다음, BFS와 DFS방식중 어느 것을 사용할지 선택한다(S725). Next, select whether to use BFS or DFS (S725).

BFS를 선택한 경우에는 탐색하면서 자식노드를 모두 큐에 저장한다(S730) 큐에 저장된 노드를 탐색하여 생긴 경로를 트랜잭션으로 변환한다(S740, S770). When BFS is selected, all child nodes are stored in the queue while searching (S730). Paths generated by searching for nodes stored in the queue are converted into transactions (S740 and S770).

DFS방식을 선택한 경우에는 다음 방문해야 하는 노드를 스택에 저장한다(S750). 탐색중 자식 노드가 없는 경우 스택에 저장된 노드를 탐색한다(S760). 탐색한 경로를 트랜잭션으로 만든다(S770). If the DFS method is selected, the next node to be visited is stored on the stack (S750). If there is no child node during the search, the node stored in the stack is searched (S760). The searched path is made into a transaction (S770).

BFS또는 DFS를 이용하여 만들어진 트랜잭션을 데이터베이스로 만들어 조건적 데이터베이스를 생성한다(S780).A conditional database is created by making a transaction made using BFS or DFS into a database (S780).

조건적 데이터베이스를 사용하여 투영화 작업을 수행하여 싱글 패스(Single Path)가 나올 때까지 반복한다. 싱글패스인 경우 싱글 패스 아이템의 서브셋을 모두 구하여 투영화하면서 만들어진 프리픽스들과 조합하여 각 조합마다 평균 가중치를 구하여 가중화 빈도수를 만든다. 이것을 임계값과 비교하여 임계값을 넘는 패턴에 대하여 출력한다.
Perform a projection using a conditional database and repeat until you get a single path. In the case of a single pass, a weighting frequency is obtained by obtaining an average weight for each combination by combining all the subsets of a single pass item with the prefixes created by projecting them. This is compared with a threshold and output for patterns that exceed the threshold.

한편, 본 발명의 실시예에 따른 패턴 마이닝 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다.On the other hand, the pattern mining method according to an embodiment of the present invention can be implemented as computer-readable code on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

예컨대, 컴퓨터가 읽을 수 있는 기록매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함된다.For example, the computer-readable recording medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a hard disk, a floppy disk, a removable storage device, a nonvolatile memory, , Optical data storage devices, and the like, as well as carrier waves (for example, transmission over the Internet).

또한, 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.
In addition, the computer readable recording medium may be distributed and executed in a computer system connected to a computer communication network, and may be stored and executed as a code readable in a distributed manner.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 다양한 변화와 수정을 가할 수 있음을 이해할 것이다.
While the present invention has been described with reference to several preferred embodiments, these embodiments are illustrative and not restrictive. It will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit of the invention and the scope of the appended claims.

210 헤더 테이블 220 트리 구조210 header table 220 tree structure

Claims

In the pattern mining method of the pattern mining apparatus for performing a pattern mining to search the tree from the top down in consideration of the weight in the data stream,
Generating, by the pattern mining device, a first transaction in which the data streams are arranged in descending weight order;
The pattern mining device inserting the first transaction into a WP-tree;
The pattern mining apparatus searching the WP-tree from the top down;
Converting the searched path into a second transaction by the pattern mining device;
Generating, by the pattern mining device, a conditional database in a manner that uses the second transaction as a database; And
And the pattern mining apparatus comprises performing projection by using the conditional database.

The method of claim 1,
Inserting the first transaction into a WP-tree,
And inserting the first transaction into a WP-tree having an array as a data structure, wherein the inserted WP-tree comprises a relationship between a header table of the first transaction and a WP-tree.

The method of claim 1,
Searching the WP-tree from the top down,
A pattern mining method characterized by a top-down method of searching from a root to a leaf using a dynamic programming method in a WP-tree constructed by applying stream data and weights.

The method of claim 3,
The top-down method for searching from the root to the leaf is searched using a breadth first search (BFS) method.

The method of claim 3,
The top-down method of searching from the root to the leaf is searched using a depth first search (DFS) method.

The method of claim 1,
Performing the projection operation,
Creating a local tree in the WP-tree using the conditional database; And
Pattern extraction method comprising the step of extracting the pattern, if the single path in the local tree (Single path).

A computer-readable recording medium having recorded thereon a program capable of executing the method of any one of claims 1 to 6.