KR20040026178A

KR20040026178A - Data Mining Method and Data Mining System

Info

Publication number: KR20040026178A
Application number: KR1020020057539A
Authority: KR
Inventors: 이원석
Original assignee: 이원석
Priority date: 2002-09-23
Filing date: 2002-09-23
Publication date: 2004-03-30
Also published as: KR100913027B1

Abstract

PURPOSE: A data mining system and a method thereof are provided to quickly obtain a frequent item-set or a frequent sequential pattern by analyzing a non-restricted data set comprising transactions continuously generated according to the flow of time through an opened data mining method. CONSTITUTION: A data server module(10) manages the transaction data continuously generated from an application domain by a database system or a file system. A data mining kernel(20) finding a mining result by analyzing a data set comprises a transaction data read module(21) reading the data of the database server module and a data mining module(22) finding out the mining result by analyzing the frequency information of the items appeared in the transaction. A mining result present module(30) presents the result get from the data mining kernel to the user. The mining result is stored in a storage(40) and the stored data mining result is provided to the user from the data mining present module.

Description

Data Mining Method and Data Mining System

본 발명은 대량의 데이터 집합에서 데이터 집합을 분석하여 특징 정보를 찾아내는 데이터 마이닝 방법 및 이 방법을 이용하여 구현된 데이터 마이닝 시스템 및 그 방법에 관한 것이다. 보다 상세하게 기술하면 시간 흐름에 따라 새로운 데이터가 지속적으로 발생하고 새롭게 발생된 데이터를 계속적으로 포함하는 비한정적 데이터 집합에 내포된 데이터들간의 상관 관계를 데이터 집합의 변화 시점마다 실시간으로 분석할 수 있는 데이터 마이닝 방법에 관한 것으로 빈발항목(frequent itemset) 및 빈발항목 순차패턴(frequent sequential pattern)을 실시간으로 탐색하는 방법과 이를 적용하여 구현되는 데이터 마이닝 시스템 및 그 방법에 관한 것이다.The present invention relates to a data mining method for analyzing feature sets in a large data set to find feature information, a data mining system implemented using the method, and a method thereof. In more detail, new data can be continuously generated over time, and the correlation between the data contained in the non-limiting data set continuously including the newly generated data can be analyzed in real time at each change point of the data set. The present invention relates to a data mining method and a method of searching for frequent itemset and frequent sequential patterns in real time, a data mining system implemented by applying the same, and a method thereof.

일반적으로 데이터 마이닝의 대상이 되는 데이터 집합에서는 응용분야에 나타나는 모든 단위 정보들을 단위항목(item)으로 정의하고 응용분야에서 의미적인 동시성(즉, 의미적으로 서로 함께 발생하는)을 갖는 단위 정보들의 모임을 트랜잭션(transaction)이라 정의한다. 트랜잭션은 의미적인 동시성을 갖는 단위항목들의 정보를 가지며 데이터 마이닝의 분석 대상이 되는 데이터 집합은 해당 응용분야에서 발생된 트랜잭션들의 집합으로 정의된다.In general, in a data set that is the subject of data mining, all unit information that appears in an application field is defined as an item, and a collection of unit information having semantic concurrency (that is, semantically occurring with each other) in the application field. Is defined as a transaction. A transaction has information of unit items with semantic concurrency, and the data set to be analyzed for data mining is defined as a set of transactions generated in the corresponding application field.

데이터 마이닝 방법에서 종래의 기술은 기본적으로 마이닝 작업을 시작하기 전에 마이닝 대상이 되는 데이터 집합을 한정적으로 정의하고 데이터 집합에 대한 기본적인 전처리 작업이 가능할 때 효과적으로 수행될 수 있도록 설계되었다. 즉, 종래의 방법은 데이터 마이닝 분석 시점에서 고정된 데이터 집합을 정의하고 고정된 데이터 집합에 존재하는 지식을 추출하는데 그 목적이 있었다. 이러한 종래의 기술들은 데이터 집합이 고정적으로 정의되므로 특정 시간대에 발생된 지식만을 추출할 수 있다. 하지만 시간의 흐름에 따라 새롭게 생성된 트랜잭션들에 내포된 지식은 변화될 수 있으며 이와 같이 최근에 발생한 데이터 집합에 대한 마이닝 결과도 새로운 데이터가 계속적으로 발생하면 머지 않은 장래에 유효성에 대한 문제가 발생되므로 새롭게 발생된 데이터 집합을 포함하는 정확한 마이닝 결과를 얻기 위해서는 이전의 마이닝 대상 집합과 그 이후에 발생한 새로운 트랜잭션들을 포함하는 전체 데이터 집합에 대한 마이닝 작업을 다시 수행해야 한다. 일반적으로 데이터 집합이 커질수록 기존의 마이닝 방법을 활용한 마이닝 작업은 많은 작업 시간과 컴퓨터 처리 능력을 필요로 하므로 실시간으로 결과를 제공할 수 없다.In the data mining method, the conventional technology is basically designed to define a data set that is a mining target before starting a mining task and to be effectively performed when basic preprocessing for the data set is possible. That is, the conventional method has a purpose to define a fixed data set and extract the knowledge existing in the fixed data set at the time of data mining analysis. These conventional techniques can extract only the knowledge generated at a specific time since the data set is fixedly defined. However, as time passes, the knowledge implied in newly created transactions can change, and the mining results of recently generated datasets also cause problems of validity in the near future if new data is continuously generated. In order to obtain accurate mining results including the newly generated data set, the mining operation of the entire data set including the previous mining target set and the new transactions occurring thereafter must be performed again. In general, as data sets grow larger, mining operations that use traditional mining methods require more work time and computer processing power, and thus cannot provide results in real time.

종래의 일부 연구들이나 발명들에서는 항목집합이라는 용어를 사용하여 다수의 단위 정보들로 구성된 묶음을 표현하고 하나의 단위 정보를 항목이라 표현한다. 즉, 빈발항목이라는 용어 대신에 빈발항목집합이라 표현하였다. 하지만 본 발명에서는 다수의 단위 정보들이 모인 묶음을 간단히 항목(itemset)이라 줄여서 표현하며 개별 단위 정보를 단위항목이라 한다. 이에 대해서는 2.1절의 마이닝 대상 데이터 집합에 대한 정의 부분에서 자세히 설명한다. 데이터 마이닝 기법은 크게 빈발항목(frequent itemset) 탐색, 빈발 순차패턴(sequential pattern) 탐색 및 클러스터링(clustering) 방법 등으로 구분된다. 데이터 집합을 분석하여 사전에 정의된 임계 지지도 보다 많은 트랜잭션들에서 동시에 출현한 항목들의 모임을 빈발항목들의 집합(set of frequent itemset)이라 정의하고 데이터 집합에서 빈발항목 집합을 찾는 작업은 이러한 빈발항목 집합을 찾는 작업이다. 여기서 지지도라 함은 전체 데이터 집합에서 항목의 출현횟수 비율을 나타내는 값으로 일반적으로 데이터 집합을 구성하는 총 트랜잭션의 수 대비 해당 항목이 출현한 트랜잭션의 비율로 정의된다. 종래 기술에서 빈발항목 탐색 작업은 크게 두 가지 방법으로 나뉜다. 하나는 기본적이며 가장 널리 활용되는 방법으로 데이터 집합의 모든 트랜잭션을 동시에 분석하는 방법이며 다른 하나는 각 트랜잭션을 하나씩 순차적으로 분석하는 방법이다. 모든 트랜잭션들을 동시에 분석하는 대표적인 방법은 "R. Agrawal, T. Imielinski, and A. Swani. Mining Association rules between sets of items inlarge database. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, May 1993." 및 "R. Agrawal, and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Sept. 1994."에서 제안된 Apriori 방법이 있다. 이 방법은 데이터 집합을 여러번 탐색하는 알고리즘으로 각 단위항목별로 빈발항목을 찾고 그 결과를 기반으로 두 개의 단위항목들로 구성된 빈발항목을 동일한 방법으로 찾는다. 이 과정은 빈발항목의 크기를 하나씩 늘리면서 더 이상의 빈발항목이 존재하지 않을 때까지 반복된다. 이외에도 데이터 집합을 분할하여 탐색하는 방법 등이 제안되었으나 처리 과정에서의 메모리 사용량을 고려할 경우 1번 이상의 데이터 집합 탐색이 필요한 방법들이다.In some conventional studies or inventions, the term item set is used to express a bundle composed of a plurality of unit information, and one unit information is expressed as an item. In other words, instead of the term frequent items, it is expressed as a frequent item set. However, in the present invention, a bundle of a plurality of unit informations is expressed by simply shortening the item as an item set and individual unit information is called a unit item. This is explained in detail in the definition of mining target data set in Section 2.1. Data mining techniques are classified into frequent itemset search, frequent sequential pattern search, and clustering method. Analyzing the dataset, defining a collection of items that appear simultaneously in more transactions than a predefined threshold of support is called the set of frequent itemset, and finding the frequent item set in the dataset is such a frequent item set. Finding a job. Here, "support" is a value representing the ratio of the number of occurrences of an item in the entire data set, and is generally defined as the ratio of the transaction in which the corresponding item appears to the total number of transactions constituting the data set. In the related art, the frequent item search task is largely divided into two methods. One is the basic and most widely used method of analyzing all transactions in a data set at the same time, and the other is analyzing each transaction one by one. A representative method for analyzing all transactions simultaneously is "R. Agrawal, T. Imielinski, and A. Swani. Mining Association rules between sets of items inlarge database. In Proc. Of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, May 1993. " And the Apriori method proposed in "R. Agrawal, and R. Srikant. Fast algorithms for mining association rules. In Proc. Of the 20th Int'l Conference on Very Large Databases, Sept. 1994." This method is an algorithm that searches a data set several times and finds frequent items for each unit item, and finds frequent items consisting of two unit items based on the result. This process is repeated until there are no more frequent items, increasing the size of the frequent items by one. In addition, a method of dividing and searching a data set has been proposed. However, when considering memory usage in a processing process, at least one data set searching is required.

데이터집합을 구성하는 트랜잭션들을 각각 순차적으로 고려하는 방법에는 "C. Hidber. Online Association Rule Mining. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 145-156, May 1999."에서 제안된 Carma 방법이 있다. 이 방법은 데이터 집합을 최대 두 번 탐색한다. 첫 번째 탐색에서는 분석 결과를 얻기 위한 관심 대상이 되는 항목을 파악하여 이들의 출현빈도 정보를 래티스 구조를 이용하여 메모리에서 관리한다. 이때 해당 항목의 모든 부분항목들의 정보가 메모리에서 관리되고 있을 때만 해당 항목은 관심 대상 항목이 될 수 있다. 일반적으로 하나의 항목이 빈발항목이 되기 위해서는 모든 부분항목이 빈발항목이어야 한다. 따라서 모든 부분항목들이 메모리에서 관리되고 있는 항목만이 빈발항목이 될 수 있는 관심 대상으로 볼 수 있다. 일반적으로 모든 항목들의 출현빈도 수를 관리하는 다른 방법들은 데이터 집합 증가함에 따라 마이닝 과정에서 메모리 사용량이 급격히 증가된다. 하지만 빈발항목이 될 가능성이 큰 것으로 예상되는 항목들만을 관리함으로써 관리되는 항목의 수를 줄여 마이닝 과정에서 필요한 메모리 사용량을 줄일 수 있다. 두 번째 탐색에서는 빈발항목이 될 수 없는 항목은 전지되며 빈발항목의 정확한 지지도를 파악한다.See "C. Hidber. Online Association Rule Mining. In Proc. Of the ACM SIGMOD Int'l Conference on Management of Data, pages 145-156, May 1999." There is a Carma method offered. This method searches the dataset up to two times. In the first search, the items of interest to obtain the analysis results are identified and their frequency information is managed in memory using a lattice structure. In this case, the item may be an item of interest only when information on all subitems of the item is managed in the memory. In general, in order for one item to be a frequent item, all sub items must be frequent items. Therefore, only the items in which all subitems are managed in memory can be regarded as the interest items that can be frequent items. In general, other methods of managing the frequency of appearance of all items rapidly increase the memory usage during the mining process as the data set grows. However, by managing only those items that are expected to be frequent items, the number of items managed can be reduced, thereby reducing the memory usage required during the mining process. In the second search, items that cannot be frequent items are omniscient and the exact support of the frequent items is identified.

이상과 같은 두 가지 접근 방법을 포함한 기존의 빈발항목 탐색 방법들은 마이닝 시작 이전에 데이터 집합을 구성하는 모든 트랜잭션들이 고정적으로 정의되어야 하고 데이터 집합을 한 번 이상 탐색해야 하므로 새로운 트랜잭션이 추가될 경우 마이닝 결과의 변화를 실시간으로 분석할 수 없는 단점을 갖는다.Existing frequent search methods including the above two approaches require that all transactions that make up the dataset must be defined statically and the dataset be searched at least once before mining starts. Has the disadvantage of not being able to analyze the change in real time.

빈발항목 순차패턴 탐색은 항목들간의 발생 순서를 나타내는 트랜잭션들로 구성되는 데이터 집합에서 빈번하게 발생하는 빈발항목들에 대해 자주 발생된 빈발 순서 유형 정보를 추출하는 마이닝 방법이다. 즉, 빈발항목 순차패턴 탐색은 함께 발생하는 서로 다른 항목들간의 발생 순서적인 패턴을 분석하여 출현 빈도가 정해진 임계값 보다 큰 순차패턴을 찾는 작업으로 순차패턴 탐색은 함께 발생하는 상호 연관된 항목들이 동일한 순서로 데이터 집합을 구성하는 트랜잭션들에 자주 발생하는 빈발 빈도를 분석하는 마이닝 방법이다. 즉, 순차패턴은 앞서 기술한 빈발항목 탐색 작업을 수행하여 빈발항목을 찾고 이를 기반으로 이들의 순서적인 연관 관계를 파악하는 분석 작업이 추가된다. 데이터 집합에서 빈발항목의 순차패턴을 탐색하기 위한 마이닝 방법으로 "R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the Int'l Confernce on Data Engineering, pages 3-14,Mar. 1995."에서 순차패턴 탐색 방법이 제안되었다. 이 방법에서는 데이터 집합에 대한 빈발항목 탐색 단계와 여기서 얻어진 정보를 이용한 순차패턴 탐색 단계로 구성된다. 또한 순차패턴을 구성하는 빈발항목들의 개수에 따라 데이터 집합을 다중 탐색한다. 즉, n개의 항목들로 구성되는 순차패턴을 탐색하기 위해서는 데이터 집합을 n번 탐색해야 한다. 최근들어 데이터 집합에 대한 반복적인 탐색에 있어서 대상이 되는 데이터 집합을 줄여가면서 중간 단계에서 생성되는 후보항목의 수를 줄이기 위한 방법으로 "J. Pei, J. Han, B. Mortazavi-Asi, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proc. of the Int'l Conference on Data Engineering, pages 215-224, April 2001."에서 PrefixSpan 방법이 제안되었다. PrefixSpan 방법에서는 최초 길이가 1인 순차패턴 탐색을 위해서 전체 데이터 집합에서 빈발 1-항목을 찾고 이들 각 항목에 대해서 순차적으로 연결될 수 있는 패턴들을 추출하여 별도의 데이터 집합을 구성하며 이때 생성된 데이터 집합은 전체 데이터 집합에 비해서 작은 양의 데이터로 구성된다. 다음으로 순차패턴 상에서 각 1-항목에 이어질 빈발항목을 탐색한다. 이러한 과정을 보다 더 긴 순차패턴이 발생되지 않는 단계까지 반복 수행하여 주어진 추출조건에 대해 데이터 집합에 존재하는 모든 순차패턴을 추출한다.The frequent item sequential pattern search is a mining method for extracting frequently occurring order type information about frequently occurring items from a data set consisting of transactions representing occurrence order among items. In other words, frequent item sequential pattern search is to find the sequential pattern with occurrence frequency greater than the threshold by analyzing occurrence sequence patterns among different items that occur together. This is a mining method that analyzes the frequency of frequent occurrences in the transactions that make up the data set. In other words, in the sequential pattern, an analysis task of finding frequent items by performing the frequent item search operation described above and adding an analysis job of identifying the order relations between them is added. As a mining method for exploring sequential patterns of frequent items in datasets, see "R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. Of the Int'l Confernce on Data Engineering, pages 3-14, Mar. 1995." In sequential pattern search method is proposed. This method consists of a frequent item search step for the data set and a sequential pattern search step using the information obtained therefrom. In addition, the data set is multi-searched according to the number of frequent items constituting the sequential pattern. That is, to search for a sequential pattern composed of n items, the data set must be searched n times. Recently, as a method for reducing the number of candidate items generated in the intermediate stage while reducing the target data set in an iterative search for the data set, J. Pei, J. Han, B. Mortazavi-Asi, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu.PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.In Proc. Of the Int'l Conference on Data Engineering, pages 215-224, April 2001 PrefixSpan method is proposed. In the PrefixSpan method, a frequent 1-item is searched from the entire data set to search for a sequential pattern with an initial length of 1, and a separate data set is formed by extracting patterns that can be sequentially connected to each of these items. It consists of a small amount of data compared to the entire data set. Next, search for frequent items that follow each 1-item in the sequential pattern. This process is repeated until a longer sequential pattern does not occur and extracts all sequential patterns in the data set for a given extraction condition.

[ 참고 문헌 ][ references ]

R. Agrawal, T. Imielinski, and A. Swani. Mining Association rulesbetween sets of items in large database. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, May 1993.R. Agrawal, T. Imielinski, and A. Swani. Mining Association rulesbetween sets of items in large database. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, May 1993.

"R. Agrawal, and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Sept. 1994."R. Agrawal, and R. Srikant. Fast algorithms for mining association rules. In Proc. Of the 20th Int'l Conference on Very Large Databases, Sept. 1994.

Christian Hidber. Online Association Rule Mining. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 145-156, May 1999.Christian Hidber. Online Association Rule Mining. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 145-156, May 1999.

R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the Int'l Confernce on Data Engineering, pages 3-14, Mar. 1995.R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the Int'l Confernce on Data Engineering, pages 3-14, Mar. 1995.

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 255-264, Tucson, AZ, May 1997S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 255-264, Tucson, AZ, May 1997

M. J. Zaki, Generating Non-Redundant Association Rules. In Proc. of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining, pages 34-43, Boston, MA, Sept. 2000.M. J. Zaki, Generating Non-Redundant Association Rules. In Proc. of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining, pages 34-43, Boston, MA, Sept. 2000.

J. Pei, J. Han, B. Mortazavi-Asi, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proc. of the Int'l Conference on Data Engineering, pages 215-224, Heidelberg, Germany, April 2001.J. Pei, J. Han, B. Mortazavi-Asi, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proc. of the Int'l Conference on Data Engineering, pages 215-224, Heidelberg, Germany, April 2001.

종래의 다수 연구들에 의해 데이터 마이닝 방법이 정보 분석에 효율적인 것으로 증명됨에 따라 다양한 응용 분야에서 데이터 집합 분석을 통해 필요한 지식 정보를 획득하기 위해 마이닝 기술을 활용하고 있다. 하지만 종래의 기술은 다음과 같은 기술적 한계를 가진다.As data mining methods have proved to be effective for information analysis by a number of conventional studies, mining techniques are used to obtain necessary knowledge information through data set analysis in various application fields. However, the prior art has the following technical limitations.

◈ 기본적인 마이닝 방법에서의 한계Limitations in Basic Mining Methods

- 종래 기술은 데이터 마이닝 작업 이전에 대상이 되는 데이터 집합을 미리 정의하여 이들에 대한 기본적인 통계적 전처리 분석이 가능한 경우에 효율적으로 결과를 얻을 수 있도록 설계되었다. 하지만 데이터 집합을 구성하는 단위항목들이 변화될 수 있고 데이터 집합이 지속적으로 증가되는 환경에서는 데이터 집합을 구성하는 단위항목과 이들의 트랜잭션들을 한정적으로 정의하는 것은 불가능하며 따라서 데이터 집합에 대한 기본적인 통계적 전처리 분석도 불가능하다.-The prior art is designed to predefine data sets before data mining and obtain results efficiently when basic statistical preprocessing analysis is possible for them. However, in an environment where the unit items constituting the data set can be changed and the data set is continuously increasing, it is impossible to define the unit items constituting the data set and their transactions in a limited manner, and thus, a basic statistical preprocessing analysis of the data set is required. It is also impossible.

- 종래의 마이닝 시스템에서는 고정적인 데이터 집합에 대한 분석 정보를 제공하는 것을 목적으로 한다. 따라서 새로운 데이터들의 추가에 의한 데이터 집합의 변화로 인하여 분석 정보의 변화를 사용자에게 제공하지 못한다.In the conventional mining system, an object of the present invention is to provide analysis information about a fixed data set. Therefore, due to the change of the data set due to the addition of new data, the change of analysis information cannot be provided to the user.

- 종래 마이닝 시스템에서는 마이닝 수행 결과로 얻어지는 항목들의 유용성을 판단하는 지표가 되는 지지도(support), 신뢰도(confidence), 상관도(correlation), 흥미도(interesting) 등에 대한 기준 임계값이 고정적으로 정의되는 것이 일반적이며 새로운 데이터의 추가에 대해 이들 값들의 변화를 파악할 수 있도록 지원하지 못한다.In a conventional mining system, reference thresholds for support, confidence, correlation, interest, etc., which are indicators for determining the usefulness of items obtained as a result of mining, are fixedly defined. This is common and does not support the identification of changes in these values for addition of new data.

◈ 마이닝 처리 시간 단축 및 실시간 처리의 한계◈ Limitation of mining processing time and limitation of real time processing

- 종래 기술은 지속적으로 증가되는 데이터 집합에서 최근에 발생한 정보를 포함하는 분석 결과를 얻기 위해서 많은 처리 시간을 필요로 한다. 즉, 새로운 트랜잭션이 계속적으로 발생하는 환경에서 데이터 집합이 새로이 확장되는 경우 이전의 분석 결과는 과거의 정보가 되어 최근까지 발생한 전체 데이터 집합의 정보를 포함하는 최신의 정보로서의 가치가 감소되며 새로 발생된 데이터 집합을 포함하는 분석 결과를 얻기 위해서는 이전의 데이터 집합 일부 또는 전체와 새롭게 발생한 모든 트랜잭션들에 대해 다시 마이닝 작업을 수행해야 한다. 따라서 마이닝 작업 수행 시간이 길어지는 단점을 갖는다.The prior art requires a great deal of processing time to obtain analytical results that include information that has recently occurred in a constantly growing set of data. In other words, if a dataset is newly expanded in an environment where new transactions continue to occur, the results of the previous analysis become historical information, which reduces the value as the latest information including the information of the entire dataset that occurred recently. In order to obtain analytical results involving the dataset, you must re-mining some or all of the previous dataset and all newly occurring transactions. Therefore, the mining operation execution time is long.

- 종래 기술은 실시간으로 마이닝 결과를 얻는데 한계가 있다. 마이닝 작업에 대한 실시간 처리 능력이란 제한된 빠른 시간 안에 분석 결과를 얻어낼 수 있는 능력이다. 종래의 기술에서는 분석 대상이 되는 데이터 집합에 대한 정확한 정보 분석만을 고려하여 신속한 처리 시간을 지원하는데 한계를 갖는다. 특히 데이터 집합을 구성하는 각 트랜잭션을 여러 번 반복해서 읽어야 하므로 지속적으로 데이터 집합이 증가되는 환경에서는 이전의 모든 트랜잭션을 별도로 저장해야 하며 최근에 발생한 데이터 집합의 정보를 포함하는 마이닝 결과를 얻기 위한 작업 시간이 길어지므로 실시간으로 결과를 얻는데 한계가 있다. 즉, 새로운 트랜잭션이 발생했을 때 트랜잭션 추가에 따른 마이닝 결과를 전체 데이터 집합에 대한 분석을 통해 얻을 수 있도록 설계되어 있으므로 새로운 트랜잭션들의 추가에 의한 마이닝 결과를 실시간으로 제공하지 못한다.The prior art has limitations in obtaining mining results in real time. Real-time processing of mining tasks is the ability to obtain analysis results in a limited amount of time. In the prior art, there is a limit in supporting fast processing time in consideration of accurate information analysis of a data set to be analyzed. In particular, each transaction that makes up a dataset must be read repeatedly, so in an environment where the dataset is constantly growing, all previous transactions must be stored separately, and the time required to obtain mining results that include information from the latest dataset. This lengthens, so there is a limit to getting results in real time. That is, when a new transaction occurs, the mining result of adding a transaction can be obtained through analysis of the entire data set. Therefore, the mining result of adding new transactions cannot be provided in real time.

- 순차패턴 탐색에서의 종래 기술은 데이터 집합에 대한 빈발항목 집합 탐색 작업과 이를 통해 찾아진 빈발항목 집합을 바탕으로 순차패턴을 탐색하는 과정으로 나눠진다. 빈발항목 탐색에서와 마찬가지로 순차패턴 탐색 과정에서도 패턴을 구성하는 빈말항목의 수에 따라 데이터 집합을 두 번 이상 탐색하게 되어 앞서기술한 동일한 문제점들이 내재되어 있다. 따라서 종래 기술은 순차패턴 탐색에 있어서 데이터 집합에 대한 탐색 시간이 길어지며 실시간 처리를 지원하지 못한다.The prior art in sequential pattern search is divided into a process of searching for frequent item sets for a data set and a process of searching for a sequential pattern based on the frequent item sets found through the data set. As in the frequent item search, in the sequential pattern search process, the data set is searched more than once according to the number of frequent word items constituting the pattern, and thus the same problems described above are inherent. Therefore, the prior art has a long search time for a data set in sequential pattern search and does not support real time processing.

◈ 정보의 중요성 차별화 밑 정보 변화에 대한 적응력이 낮음◈ Low adaptability to changing information under differentiation of importance of information

- 종래 기술은 시간흐름에 따라 새로 발생된 트랜잭션에서 나타나는 정보변화의 중요성을 차별화 하지 못한다. 즉, 분석에 있어서 오래 전 과거에 발생한 정보나 최근에 발생한 정보들이 서로 동일한 중요성을 갖는다. 따라서 과거에 발생한 이후 최근에는 발생되지 않는 정보들과 과거엔 전혀 발생되지 않았다가 최근에 많이 발생한 정보들을 구별하지 못하므로 현재 시점에서 최근에 발생한 데이터 집합을 중심으로 하는 마이닝 결과를 얻을 수 없다.The prior art does not differentiate the importance of information change in newly generated transactions over time. In other words, the information generated in the past or the recent in the analysis has the same importance. Therefore, it is impossible to distinguish between information that does not occur recently in the past and information that does not occur at all in the past, but recently generated. Therefore, a mining result based on a data set that occurs recently at the present time cannot be obtained.

- 종래 기술은 정보 변화에 대한 적응력이 낮다. 즉, 지속적으로 증가되는 데이터 집합에서 나타나는 새로운 정보들이나 일정 시간에 발생횟수가 급격히 증가하거나 또는 감소한 정보들의 변화율을 분석 할 수 없다. 즉 데이터 집합에 나타나는 최근에 발생한 새로운 정보 변화가 전체 데이터 집합의 분석 결과에 반영되기위해서는 새로운 정보 변화가 시간적으로 오랜 동안 계속적으로 발생해야 된다.The prior art has low adaptability to information changes. In other words, it is not possible to analyze the rate of change of new information appearing in a continuously increasing data set or information that increases or decreases at a certain time. In other words, in order for the latest information changes appearing in the data set to be reflected in the analysis results of the entire data set, the new information changes must continuously occur for a long time.

◈ 유용성 지표의 변화 분석에서의 한계Limitations in the Analysis of Changes in Usability Indicators

- 종래 기술은 한번의 마이닝 작업에 많은 처리 시간을 필요로 하므로 시간변화에 따라 지속적으로 발생되는 데이터 집합에서 지식의 시간적인 변화를 분석하는데 한계가 있다. 즉 종래 기술은 분석 대상이 되는 데이터 집합에 대한 지지도 및 신뢰도 등의 정보를 분석하는데 있어서도 정의된 트랜잭션의 총 수 대비 발생한 정보의 비율을 추출하므로 정적인 정보만을 분석한다. 하지만 동일한 값을 갖는 정보들도 시간적 간격 대비 해당 정보의 변화 상태에 따라 서로 다른 값으로 해석될 수 있다. 즉, 동일한 지지도 또는 신뢰도 등과 같은 유용성 지표 값에 대해 해당 값이 증가 또는 감소되고 있는 상태에 따라 서로 다른 지식으로 해석될 수 있으나 종래 기술에서는 이를 구별할 수 없으며 종래 기술에서는 시간 변화에 따른 지식의 변화를 분석하기 위해서는 각 시간대마다 별도의 독립적인 마이닝 작업들을 수행하여 이들의 결과 차이를 별도로 비교 분석해야 한다.The conventional technique requires a lot of processing time in one mining operation, and thus, there is a limit in analyzing a temporal change of knowledge in a data set that is continuously generated according to time change. That is, the conventional technology analyzes only static information because it extracts the ratio of generated information to the total number of defined transactions in analyzing information such as support and reliability for the data set to be analyzed. However, information having the same value may be interpreted as different values according to the change state of the corresponding information with respect to the time interval. That is, it can be interpreted as different knowledge according to the state that the corresponding value is increasing or decreasing for the same usefulness index value such as support or reliability, but in the prior art, this cannot be distinguished, and in the prior art, the change of knowledge with time change In order to analyze the data, separate and independent mining tasks should be performed for each time period to compare and analyze the difference of these results separately.

본 발명에서는 이상과 같은 종래 기술의 문제점을 해결하고 응용 환경 및 요구 사항 변화를 만족하기 위해서 대표적인 데이터마이닝 기법인 빈발항목 탐색 및 순차패턴 탐색 방법에 대해서 다음과 같은 기술을 개발하는 것을 그 목적으로 한다.In order to solve the problems of the prior art and to satisfy application environment and requirements changes, the present invention aims to develop the following techniques for frequent item search and sequential pattern search, which are typical data mining techniques. .

◈ 마이닝을 위한 기본적인 구조 및 방법 개선◈ Improvement of basic structure and method for mining

- 새로운 트랜잭션이 지속적으로 발생하는 응용분야에서 비한정적 데이터 집합에 대한 마이닝을 위한 기본 구조를 제안한다. 지속적으로 발생되는 트랜잭션의 추가로 데이터 집합이 증가되므로 마이닝 수행 시간이 지나치게 많이 걸리거나 또는 수행 과정에서 필요한 작업 메모리 공간이 지속적으로 증가되는 종래 기술의 한계를 극복하고 현재 시점을 기준으로 분석자가 자유롭게 분석 대상이 되는 데이터 집합의 범위를 정의하고 분석 범위에서 벗어나는 데이터를 계속적으로 소멸시킴으로써 실시간 처리가 가능하도록 지원하기 위한 기본적인 구조를 고안한다.In this paper, we propose a basic structure for mining non-limiting data sets in applications where new transactions are continuously occurring. As the data set is increased by the addition of continuously occurring transactions, the analyst freely analyzes the current point in time, overcoming the limitations of the prior art, which takes too much time to perform mining or continuously increases the required working memory space. The basic structure is defined to support the real-time processing by defining the scope of the data set and continuously destroying the data that is out of the scope of analysis.

◈ 마이닝 처리 시간 단축 및 실시간 처리 지원◈ Support mining processing time and real time processing

- 증가되는 데이터 집합을 포함하는 전체 데이터 집합에 대한 처리 시간을 감소시킴으로써 실시간 처리가 가능한 마이닝 방법을 설계한다. 실시간 처리를 위해서는 먼저 분석 대상이 되는 데이터 집합을 줄여야 한다. 즉, 일정 시점에서 전체 데이터 집합을 유동적으로 정의할 수 있도록 하여 사용자가 요구하는 최초 범위의 트랜잭션들에 대한 분석 결과의 질적인 면을 최대한 보장하면서 필요한 분석 대상 트랜잭션의 수를 최소화함으로써 마이닝 작업 시간을 최소화하고 결과의 질을 유지하는 방법을 고안한다.Design a mining method that enables real-time processing by reducing the processing time for the entire data set including the increasing data set. Real-time processing requires first reducing the data set to be analyzed. In other words, the entire data set can be flexibly defined at a certain point in time, thereby minimizing the number of transactions to be analyzed while minimizing the number of transactions to be analyzed while ensuring the qualitative aspect of the analysis results for the initial range of transactions required by the user. Devise ways to minimize and maintain quality of results.

◈ 유용성 지표의 시간 흐름에 따른 변화 분석 지원◈ Support change analysis over time of usability indicator

- 데이터 마이닝에서 구해지는 항목 집합의 유용성을 나타내는 지표는 분석대상 트랜잭션의의 증가에 따라 변화되는 데이터 집합에서 그 값이 지속적으로 변화된다. 본 발명에서는 이와 같은 환경에서 마이닝 결과로 구해지는 항목들의 유용성 판단 지표를 효율적으로 파악할 수 있는 방법을 고안한다.-An indicator of the usefulness of a set of items obtained from data mining is constantly changing in the data set that changes with the increase of the analysis transaction. The present invention devises a method for efficiently grasping the usefulness determination indexes of items obtained as a result of mining in such an environment.

- 더불어, 본 발명에서는 데이터 집합 변화에 따라 수시로 변화되는 유용성 지표들의 시간 흐름에 따른 변화를 분석하기 위한 방법을 고안한다. 즉, 유용성 지표의 변화 속도 및 가속도 등을 분석함으로써 마이닝 결과에 대한 부가적인 지식을 제공하기 위한 방법을 제안한다.In addition, the present invention devises a method for analyzing the change over time of the availability indicators that change frequently as the data set changes. In other words, we propose a method to provide additional knowledge about mining results by analyzing the rate of change and acceleration of the usability indicator.

- 지속적으로 데이터 집합이 증가하는 마이닝 환경에서 마이닝 결과 집합에 대한 유용성 지표 제한 조건을 동적으로 변화시킬 수 있도록 지원한다. 즉, 결과 집합을 구성하는 각 항목에 대한 최소 지지도, 최소 신뢰도, 최소 상관도, 최소 흥미도 등을 각 시점에서의 마이닝 결과를 근거로 동적으로 변화시킬 수 있도록 지원한다.-Supports the dynamic change of the usability indicator constraints for the mining result set in a mining environment where the data set is constantly growing. That is, the minimum support, minimum confidence, minimum correlation, and minimum interest for each item constituting the result set are dynamically changed based on the mining result at each time point.

◈ 정보의 중요성 차별화 및 정보 변화에 대한 적응력 높임◈ Increased adaptability to information differentiation and information change

- 분석 범위 내에서 정보의 중요성을 해당 정보의 발생 시점을 기준으로 차별화하기 위한 방법을 제시한다. 즉, 정보의 발생 시점에 따라 정보의 중요성을 차별화하고 시간 변화에 따라 발생되는 데이터 집합에서 정보 변화에 대한 적응력을 높이며 최근 데이터에 나타난 정보 변화를 효율적으로 감지할 수 있도록 지원하기 위해서 발생 시점에 따라 서로 다른 가중치를 부여하는 정보 차별화 방법을 고안한다.-It suggests ways to differentiate the importance of information within the scope of analysis based on when the information occurred. That is, in order to differentiate the importance of information according to the time of occurrence of information, to improve the adaptability to the change of information in the data set generated by the change of time, and to efficiently detect the change of information in recent data according to the time of occurrence. We devise information differentiation method to give different weights.

- 실시간 처리 및 발생 시점에 따른 정보의 중요성 차별화 등이 가능한 새로운 마이닝 방법에서 정보 차별화 정도를 최소화하여 (즉, 각 시점에서 발생한 모든 트랜잭션이 서로 동일한 중요성을 가지도록 하여) 종래 기술에서 얻어진 결과와 동일한 결과를 얻을 수 있도록 하는 방법을 고안한다. 이를 통해 본 발명이 종래의 방법을 포괄할 수 있음을 보여준다.-In the new mining method that enables real-time processing and differentiation of the importance of information according to the time of occurrence, it minimizes the degree of information differentiation (that is, all transactions occurring at each point in time have the same importance as each other). Devise ways to get results. This shows that the present invention can cover the conventional method.

◈ 마이닝에 필요한 요소 및 결과 집합의 최적화◈ Optimize the elements and result set necessary for mining

- 지속적으로 증가되는 데이터 집합에 대한 마이닝 작업 시 요구되는 메모리 사용량을 감소시키기 위한 방법을 고안한다. 트랜잭션에 출현하는 모든 항목들의 출현빈도 수 정보를 관리하는 종래 방법과는 달리 빈발항목이 될 가능성이 큰 항목들의 출현빈도 수 정보를 예측하여 관리함으로써 관리 대상이 되는 항목 수를 줄이고 이를 통해 메모리 사용량을 줄일 수 있는 방법을 고안한다.-Devise a method to reduce the memory usage required for mining operations on a constantly increasing data set. Unlike the conventional method of managing the frequency information of all items appearing in a transaction, the number of items to be managed is reduced by predicting and managing the frequency information of items that are likely to be frequent items. Devise ways to reduce it.

- 지속적으로 증가되는 데이터 집합에 대한 마이닝에서는 작업 수행에 필요한 공간 및 시간을 최소화 하기 위해 마이닝 결과에 대한 오차가 발생할 수 있다. 즉, 마이닝 결과를 정확히 유지하고자 하는 경우 마이닝에 필요한 작업 공간이 증가되고 작업 시간도 길어진다. 따라서 본 발명에서는 마이닝 결과의 정확도와 마이닝에 필요한 공간 및 시간을 사용자의 다양한 요구에 따라 적응할 수 있는 최적화된 방법을 고안한다.Mining for a constantly growing set of data can cause errors in the mining results to minimize the space and time required to perform the task. In other words, if you want to keep the mining results accurately, the work space required for mining increases and the working time takes longer. Therefore, the present invention devises an optimized method that can adapt the accuracy of mining results and the space and time required for mining according to various needs of the user.

◈ 통합 시스템◈ Integrated system

- 시간 변화에 따라 지속적으로 증가되는 데이터 집합에 대한 빈발항목 및 빈발항목 순차패턴 탐색이 실시간으로 가능하도록 본 발명에서 고안한 마이닝 방법을 기반으로 하는 통합 데이터 마이닝 시스템을 고안한다.The present invention devises an integrated data mining system based on the mining method devised in the present invention so that frequent items and frequent item sequential patterns can be searched for a data set continuously increasing with time.

도 1은 지속적으로 발생되는 트랜잭션 집합을 분석대상으로 데이터 마이닝 환경에서 개방 데이터 마이닝 방법을 적용하는 작업 흐름도1 is a flowchart illustrating an operation of applying an open data mining method in a data mining environment based on a continuously occurring transaction set.

도 2는 응용 도메인에 따라 적용 가능한 개방 데이터 마이닝 방법의 다양한 접근 방법2 illustrates various approaches of an open data mining method applicable to application domains.

도 3은 단순축적 빈발항목 탐색 방법의 작업 흐름도3 is a flowchart of a method for searching a simple accumulated frequent items

도 4는 단순축적 빈발항목 탐색 방법에서 항목집합의 출현빈도 수 갱신 및 새로운 항목 추가 단계의 작업 흐름도4 is a flowchart illustrating operations of updating a frequency of occurrence of a set of items and adding a new item in a simple accumulation frequent item searching method;

도 5는 단순축적 빈발항목 탐색 방법에서 출현빈도 수 갱신 및 전지작업 단계의 작업 흐름도5 is a flowchart illustrating the operation of updating the frequency of occurrence and pruning in the simple accumulation frequent items searching method.

도 6은 단순축적 빈발항목 탐색 방법에서 지연추가 단계의 작업 흐름도6 is a flowchart illustrating an operation of adding a delay in a simple accumulation frequent item searching method;

도 7은 단순축적 빈발항목 탐색 방법에서 항목집합 추가 가능 여부 판정 단계의 작업 흐름도7 is a flowchart illustrating a step of determining whether an item set can be added in the simple accumulation frequent item searching method;

도 8은 단순축적 빈발항목 탐색 방법에서 강제 전지작업 단계의 작업 흐름도8 is a flow chart of a forced battery operation step in a simple accumulation frequent items search method

도 9는 단순축적 빈발항목 탐색 방법에서 빈발항목 선택 단계의 작업 흐름도9 is a flowchart illustrating a frequent item selection step in the simple accumulation frequent item search method;

도 10은 단순축적 항목 순차패턴 탐색 방법의 작업 흐름도10 is a flowchart illustrating a method for searching a simple accumulated item sequence pattern.

도 11은 정보차별화 방법에서 발생 시점에 따른 트랜잭션의 가중치 변화 곡선11 is a curve of weight change of a transaction according to an occurrence time in the information discrimination method

도 12는 윈도우 방법에 의한 빈발항목 탐색 방법의 작업 흐름도12 is a flowchart illustrating a method of searching for frequent items by the window method

도 13은 윈도우 방법에 의한 빈발항목 탐색 방법에서 출현빈도 수 갱신 및 새로운 항목 추가 단계의 작업 흐름도13 is a flowchart illustrating the operation of updating the frequency of occurrence and adding a new item in the frequent item searching method using the window method;

도 14는 윈도우 방법에 의한 빈발항목 탐색 방법에서 트랜잭션 윈도우 갱신 단계의 작업 흐름도14 is a flowchart illustrating a transaction window update step in the frequent item searching method using the window method;

도 15는 감쇠율 방법에 의한 빈발항목 탐색 방법의 작업 흐름도15 is a flowchart illustrating a method of searching for frequent items by the attenuation rate method;

도 16은 감쇠율 방법에 의한 빈발항목 탐색 방법에서 출현빈도 수 갱신 단계의 작업 흐름도16 is a flowchart illustrating an operation of updating a frequency of occurrence in a frequent item searching method using a damping rate method;

도 17은 감쇠율 방법에 의한 빈발항목 탐색 방법에서 강제 전지작업 단계의 작업 흐름도17 is a flowchart illustrating a forced cell operation step in the frequent item searching method using the attenuation rate method;

도 18은 감쇠율 방법에 의한 빈발항목 탐색 방법에서 빈발항목 선택 단계의 작업 흐름도18 is a flowchart illustrating a frequent item selection step in the frequent item search method using the attenuation rate method;

도 19는 윈도우 방법에 의한 항목 순차패턴 탐색 방법의 작업 흐름도19 is a flowchart illustrating a method of searching for an item sequential pattern by the window method

도 20은 지지도 변화 속도에 따른 항목의 출현빈도 수 차별화 예제20 is an example of differentiating the frequency of appearance of items according to the rate of change of support

도 21은 마이닝 결과 정보를 별도로 유지하는 방법을 위한 항목의 출현빈도 수 갱신 및 새로운 항목 추가 단계의 작업 흐름도21 is a flowchart illustrating operations of updating the frequency of appearance of items and adding new items for maintaining the mining result information separately;

도 22는 동시 발생 트랜잭션 처리를 한번에 처리하기 위한 변형된 트랜잭션읽기 단계의 작업 흐름도22 is a flowchart of a modified transaction read step for processing concurrent transaction processing at once

도 23은 유용성 지표의 시간 흐름에 따른 변화 분석을 위한 항목의 출현빈도 수 갱신 및 새로운 항목 추가 단계의 작업 흐름도23 is a flowchart illustrating operations of updating the frequency of appearance of items and adding new items for analyzing the change over time of the usability indicator.

도 24는 유용성 지표의 시간 흐름에 따른 변화 분석을 위한 빈발항목 선택 단계의 작업 흐름도24 is a flowchart illustrating a frequent item selection step for analyzing a change in a usability indicator over time.

도 25는 특성 지표 값 분석을 통한 트랜잭션 정제를 위한 단순축적 빈발항목 탐색 방법의 작업 흐름도25 is a flowchart illustrating a simple accumulation frequent item searching method for transaction purification through characteristic indicator value analysis

도 26은 트랜잭션 정제 단계의 작업 흐름도Figure 26 is a flowchart of operations in the transaction purification step

본 발명은 새로운 트랜잭션이 지속적으로 발생되는 데이터 집합에서 새롭게 추가된 트랜잭션을 포함하면서 빈발항목 및 빈발항목 순차패턴 탐색 방법을 적용한 마이닝 결과를 실시간으로 추출할 수 있는 마이닝 방법 및 시스템으로 분석 대상이 되는 데이터 집합을 분석하여 마이닝 결과를 얻기 위한 데이터 마이닝 시스템은 도 1에서와 같이 크게 세 부분으로 나눌 수 있다. 첫째, 본 발명에서 고안하는 데이터 마이닝 방법 및 시스템을 적용하고자 하는 분야에서 지속적으로 발생되는 트랜잭션 데이터 집합을 저장하고 관리하는 데이터 서버 모듈(10)은 응용 도메인에서 지속적으로 발생되는 트랜잭션 데이터를 데이베이스 시스템이나 파일 시스템으로 관리한다. 둘째, 데이터 집합을 분석하여 마이닝 결과를 구하는 데이터 마이닝 커널(20)에서는 데이터베이스 서버 모듈의 데이터를 읽는 트랜잭션 데이터 읽기 모듈(21)과 트랜잭션에 출현한 항목들의 출현빈도 수 정보를 분석하여 마이닝 결과를 구하는 데이터 마이닝 모듈(22)로 구성된다. 셋째, 마이닝 결과 제시 모듈(30)은 데이터 마이닝 커널에서 얻어진 결과를 자양한 방법으로 사용자에게 제시한다. 한편, 데이터 마이닝 커널에서 얻어진 마이닝 결과는 별도의 저장장치(40)에 유지할 수 있으며 저장된 마이닝 결과도 마이닝 결과 제시 모듈(30)에서 사용자에게 제공할 수 있다. 본 발명에서 고안하는 마이닝 방법 및 시스템은 도 1에 제시된 마이닝을 위한 구성 모듈 중에서 데이터 마이닝 커널 부분에 대한 것으로서 다음과 같이 일곱 부분으로 구성된다.The present invention includes a newly added transaction in a data set in which a new transaction is continuously generated, and a mining method and system for extracting a mining result applying a frequent item and frequent item sequential pattern searching method in real time. A data mining system for analyzing a set to obtain a mining result can be divided into three parts as shown in FIG. 1. First, the data server module 10 that stores and manages a transaction data set continuously generated in a field to which the data mining method and system of the present invention is applied is a database system for continuously generating transaction data generated in an application domain. Or file system. Second, the data mining kernel 20 analyzing the data set to obtain the mining result analyzes the transaction data read module 21 for reading data of the database server module and the occurrence frequency information of items appearing in the transaction to obtain the mining result. It consists of a data mining module 22. Third, the mining result presentation module 30 presents the result obtained in the data mining kernel to the user in a satisfactory manner. Meanwhile, the mining result obtained in the data mining kernel may be maintained in a separate storage device 40 and the stored mining result may be provided to the user in the mining result presentation module 30. The mining method and system devised in the present invention are for the data mining kernel part among the configuration modules for mining shown in FIG. 1 and are composed of seven parts as follows.

1. 개방 데이터 마이닝 기본 개념1. Open Data Mining Basic Concepts

마이닝 작업 수행 이전에 명확히 정의되는 한정적인 데이터 집합을 분석 대상으로 하는 종래의 데이터 마이닝 방법은 시간 변화에 따라 지속적으로 증가되는 데이터 집합에 대한 마이닝 결과를 얻는 데는 많은 한계가 있다. 즉, 종래의 마이닝 방법은 분석 대상 데이터 집합이 마이닝 시작 이전에 고정적으로 정의되어 해당 데이터 집합에 내재된 지식만을 추출하도록 설계되었다. 본 발명에서는 이와 같이 사전에 정의된 한정적인 데이터 집합을 분석하는 마이닝 방법을 폐쇄 데이터 마이닝이라 정의한다. 이와 달리 본 발명에서 고안한 마이닝 방법은 새로운 트랜잭션이 지속적으로 발생되는 데이터집합을 분석 대상으로 분석자가 현재 시점을 기준으로 정의한 분석 범위의 데이터 집합에 대한 마이닝 결과를 실시간으로 얻는 방법으로서 데이터 범위를 정의하는데 기준이 되는 현재 시점은 항상 새로운 트랜잭션이 발생된 시점으로 정의되며 분석자는 항상 최근의 분석 결과를 얻게 된다. 이와같이 지속적으로 발생되는 데이터 집합을 분석 대상으로 하는 데이터 마이닝 방법을 개방 데이터 마이닝이라 정의한다.Conventional data mining methods that target a limited data set that is clearly defined before performing a mining task have many limitations in obtaining a mining result for a data set that is continuously increased with time. That is, the conventional mining method is designed to extract only the knowledge inherent in the data set because the data set to be analyzed is fixed before the mining start. In the present invention, a mining method of analyzing a predefined limited data set is defined as closed data mining. In contrast, the mining method devised in the present invention defines a data range as a method of obtaining a mining result in real time for a data set of an analysis range defined by an analyst based on a current point of time, based on a data set in which a new transaction is continuously generated. The current point in time is always defined as the point at which a new transaction occurs, and the analyst always gets the latest analysis. The data mining method that analyzes data sets that occur continuously is defined as open data mining.

개방 데이터 마이닝은 다음과 같은 특징을 갖는다. 첫째, 분석 대상이 되는 데이터 집합이 비한정적으로 정의되므로 데이터 집합을 구성하는 단위항목의 종류및 개수가 지속적으로 변화될 수 있다. 둘째, 현재 시점에서 지정된 분석 범위 내에 발생된 데이터 집합의 정보를 반영한 결과를 빠른 시간에 추출한다. 시간 변화에 따라 지속적으로 트랜잭션이 발생되는 환경에서 현재 시점의 정보를 얻기 위해서는 데이터 집합에 대한 신속한 마이닝 작업을 통해 요구 시간 내에 결과를 생성할 수 있는 실시간 마이닝 능력이 필요하다. 셋째, 응용 분야에서 시간 흐름에 따라 지속적으로 발생되는 새로운 트랜잭션 데이터에서 과거보다는 최근의 정보가 보다 중요한 의미를 갖는다. 따라서 최근 정보의 가치를 과거의 정보에 비해 보다 높게 평가할 수 있는 방법이 필요하며 또한 분석 대상이 되는 과거 트랜잭션들의 범위를 유동적으로 조절할 수 있어야 한다. 넷째, 응용분야에 대한 마이닝 분석 요구 사항은 변화될 수 있으며 이때 변화된 요구 사항을 만족하기 위해서는 빈발항목 판정을 위한 최소지지도가 동적으로 변화 가능해야 한다. 다섯째, 분석 대상이 되는 데이터 집합이 많을 경우 마이닝 작업 과정에서의 메모리 사용량 감소 및 처리 시간 단축을 위해서 마이닝 결과 집합에 일정 범위 내에서 오차를 허용하여 작업의 신속성 및 메모리 요구량 대비 마이닝 결과 집합의 정확도 사이에서 적절히 절충할 수 있는 방법이다.Open data mining has the following characteristics: First, since the data set to be analyzed is defined indefinitely, the type and number of unit items constituting the data set may be continuously changed. Second, it extracts the result reflecting the information of the data set generated within the specified analysis range at the present time in a short time. In order to obtain information at the present time in an environment in which transactions continuously occur as time changes, real-time mining ability that can generate results within a required time through fast mining of data sets is required. Third, recent information is more important than the past in new transaction data that is continuously generated over time in the application field. Therefore, a method is needed to evaluate the value of the latest information higher than the information in the past, and the range of past transactions to be analyzed should be flexibly adjusted. Fourth, the mining analysis requirements for the application field can be changed. At this time, the minimum support for determining the frequent items must be dynamically changed to satisfy the changed requirements. Fifth, if there are a large number of data sets to be analyzed, allow the error within the mining result set within a certain range to reduce the memory usage and processing time during the mining process. This is a good way to trade off.

도 2는 개방 데이터 마이닝 환경에서 시간 흐름에 따라 새롭게 발생되는 데이터 집합에서 각 데이터의 발생 시점에 따른 정보 차별화 정도에 따라 적용 가능한 다양한 접근 방법을 설명하는 것으로서 모든 정보가 동일한 중요성을 갖는 것으로 간주하는 단순축적 방법(200), 최근에 발생한 일정 범위의 정보만을 분석 대상으로 하는 윈도우 방법(300) 및 시간 변화에 따라 발생한 데이터를 일정 비율에 따라 차별화 하는 감쇠율 방법(400)으로 나눠진다.FIG. 2 illustrates various approaches that can be applied according to the degree of information differentiation according to the time of occurrence of each data in a newly generated data set over time in an open data mining environment. The accumulation method 200 is divided into a window method 300 that analyzes only a certain range of information generated recently, and attenuation rate method 400 that differentiates data generated according to time change according to a predetermined ratio.

2. 단순축적에 의한 개방 데이터 마이닝 방법2. Open Data Mining Method by Simple Accumulation

개방 데이터 마이닝에서 서로 다른 시점에서 발생한 각각의 정보들이 서로 동일한 중요성을 갖는 것으로 간주하여 출현빈도 수를 단순축적 함으로써 마이닝 결과를 구하는 방법을 단순축적 방법에 의한 데이터 마이닝 방법(200)이라 한다. 단순축적 방법은 적용하고자 하는 마이닝 기법에 따라 단순축적에 의한 빈발항목 탐색 방법(210) 및 단순축적 순차패턴 탐색 방법(220)으로 나눠진다. 단순축적 방법의 분석 대상은 현재 시점까지 발생한 모든 트랜잭션들을 포함하는 전체 데이터 집합으로 정의된다. 즉, 각 시점에서 발생한 정보들이 서로 동일한 가중치를 가지므로 최초 발생 시점부터 마이닝 결과를 구하고자 하는 시점까지의 모든 정보들이 분석 대상이 된다.A method of obtaining a mining result by simply accumulating the number of occurrences is regarded as a data mining method 200 by a simple accumulation method, considering that each information generated at different points in open data mining has the same importance. The simple accumulation method is divided into a frequent item search method 210 and a simple accumulation sequential pattern search method 220 by simple accumulation according to a mining technique to be applied. The analysis target of the simple accumulation method is defined as the entire data set including all transactions that have occurred up to the present time. That is, since the information generated at each time point has the same weight, all the information from the first occurrence point to the time point for obtaining the mining result are analyzed.

2.1 빈발항목 탐색2.1 Finding frequent items

폐쇄 데이터 마이닝에서의 빈발항목 탐색 방법과 마찬가지로 개방 데이터 마이닝에서 트랜잭션의 발생 시점의 차이를 고려하지 않고 발생된 모든 트랜잭션의 정보들이 서로 동일한 중요성을 갖는 것으로 간주하여 단순축적함으로써 빈발항목을 탐색하는 방법을 단순축적 빈발항목 탐색 방법(210)이라 한다. 본 발명에서는 개방 데이터 마이닝 환경에서 빈발항목 탐색을 위한 출현빈도 수 정보 관리를 위해서 전위-트리(prefix tree)를 기반으로 하는 인접 래티스(adjacency lattice) 구조를 기본 구조로 한다. 전위-트리 기반 인접 래티스 구조를 이용한 마이닝은 각 트랜잭션을 한번만 읽는 구조이므로 실시간 마이닝을 요하는 개방 데이터 마이닝 환경에서 적합한 구조이다. 항목들을 전위-트리 기반 인접 래티스로 표현하는데 있어서 각 항목은 전위 경로에 의해 표현되며 출현빈도 수는 전위 경로의 마지막 노드에 저장된다. 따라서 전위-트리를 기반으로 하는 래티스 구조의 각 노드들은 전위 경로상의 각 단위 정보를 표현하는 단위항목 정보와 최상위 노드로부터 해당 노드까지 존재하는 단위항목들로 구성되는 항목의 출현빈도 수 정보를 갖는다. 새로운 트랜잭션이 추가되었을 때, 해당 트랜잭션에 출현한 항목들의 출현빈도 수 정보는 래티스상의 각 해당 노드에 적용되어 갱신된다. 본 발명에서는 트랜잭션에서 출현한 항목의 출현빈도 수 정보를 관리하기 위한 래티스 구조를 모니터링 래티스라 지칭한다.Similar to searching for frequent items in closed data mining, a method of searching for frequent items by simply accumulating the information of all generated transactions is considered to have the same importance without considering the difference in transaction origin in open data mining. It is referred to as a simple accumulation frequent item search method 210. In the present invention, an adjacency lattice structure based on a prefix tree is used as a basic structure for managing the frequency information for finding frequent items in an open data mining environment. Mining using the prefix-tree based neighboring lattice structure is suitable for use in open data mining environments that require real-time mining because each transaction is read only once. In representing the items as potential-tree based adjacent lattice, each item is represented by a potential path and the frequency of occurrence is stored at the last node of the potential path. Therefore, each node of the lattice structure based on the prefix-tree has unit item information representing each unit information on the potential path and frequency of occurrence of an item composed of unit items existing from the top node to the node. When a new transaction is added, the frequency information of items appearing in the transaction is applied to each corresponding node on the lattice and updated. In the present invention, a lattice structure for managing the frequency of occurrence information of items appearing in a transaction is called a monitoring lattice.

개방 데이터 마이닝에서 빈발항목 탐색을 위한 데이터 집합은 다음과 같이In open data mining, the data set for frequent items exploration is

정의된다.Is defined.

트랜잭션 식별자 TID는 트랜잭션 발생시간으로 정의될 수 있으며 시간 단위를 매우 작은 단위로 세분화함으로써 동일 시점에는 하나의 트랜잭션만 발생하는 것으로 간주할 수도 있다.The transaction identifier TID may be defined as a transaction occurrence time and may be regarded as generating only one transaction at the same time by subdividing the time unit into very small units.

종래의 폐쇄 데이터 마이닝 방법에서 정의되는 기본 개념은 개방 데이터 마이닝에서 표 1과 같이 재정의 되며 트랜잭션의 수가 한정적이지 않고 시간 흐름에 따라 지속적으로 증가되므로 각 시점에서의 마이닝 결과는 각 시점에서의 분석 대상에 대한 마이닝 결과이다. 종래 방법에서 제시된 개념들 중에서 아래 표에서 정의되지 않은 기본 개념은 종래의 마이닝에서의 기본 개념과 동일하다.The basic concept defined in the conventional closed data mining method is redefined as shown in Table 1 in the open data mining method, and the number of transactions is not limited and continuously increases with time, so the mining results at each time point are analyzed. Mining results for. Of the concepts presented in the conventional method, the basic concept not defined in the following table is the same as the basic concept in the conventional mining.

표 1. 개방 데이터 마이닝에서의 기본 개념Table 1. Basic concepts in open data mining

도 3은 단순축적 방법에 의한 데이터 마이닝에서 빈발항목 탐색 방법(210)의 수행 과정을 보여주는 것으로 트랜잭션 읽기 단계(211)에서는 응용 분야에서 새롭게 생성된 트랜잭션 데이터를 읽으며(100) 마이닝 기본 정보 갱신 단계(212)에서는 트랜잭션의 총 수, 항목 추가 및 전지 작업을 위한 임계값 등 마이닝에 필요한 기본적인 정보들을 갱신한다. 항목 추가 및 전지 작업에 대해서는 해당 단계에서 상세히 설명한다. 출현빈도 수 갱신 및 항목 추가 단계(213)에서는 새롭게 읽혀진 트랜잭션을 분석하여 현재의 모니터링 래티스 상에서 관리하고 있는 항목의 출현빈도 수를 갱신하거나 새로 출현한 항목의 관리 여부를 판단하여 이를 모니터링 래티스에 추가한다. 이때 강제 전지 작업여부를 판단하여(214) 필요한 경우 강제 전지작업 수행 단계(215)에서 관리하고 있는 항목 중 관리될 필요가 없는 항목들에 대해 강제 전지작업을 수행한다. 빈발항목 선택 단계(216)에서는 모니터링 래티스를 탐색하여 래티스에 존재하는 항목의 지지도가 빈발항목 판정을 위한 최소 지지도 이상인 경우 이를 빈발항목으로 판정한다.3 is a flowchart illustrating a method of searching for frequent items 210 in data mining by a simple accumulation method. In the transaction reading step 211, the transaction data newly generated in an application field is read (100) and the mining basic information updating step ( 212) updates basic information required for mining, such as the total number of transactions, the addition of items, and the threshold for pruning. The addition of items and the battery operation will be described in detail in the corresponding steps. In the update frequency count and add item step 213, the newly read transaction is analyzed to update the frequency count of the items managed on the current monitoring lattice or to determine whether to manage the newly appeared items and add them to the monitoring lattice. . At this time, it is determined whether the forced battery operation (214), if necessary, performs a forced battery operation on the items that do not need to be managed among the items managed in the forced battery operation step 215. In the frequent item selection step 216, the monitoring lattice is searched to determine the frequent item if the degree of support of the item existing in the lattice is equal to or greater than the minimum degree of support for determining the frequent item.

단순축적 방법은 각 시점에서 발생한 정보들이 서로 동일한 가중치를 가지므로 하나의 새로운 트랜잭션이 추가되는 경우 마이닝 기본 정보 갱신 단계에서 현재 시점에서의 트랜잭션의 총수()는 이전 시점에서의 트랜잭션의 총수()로부터과 같이 구한다. 종래의 마이닝 방법에서는 전체 데이터 집합이 고정되어 있으므로 지지도 값으로 표현된 항목 추가 및 전지 작업을 위한 임계값 및 빈발항목 판정을 위한 최소 지지도는 출현빈도 수로 환산했을 때 항상 일정한 값을 갖게된다. 하지만 개방 데이터 마이닝에서는 새로운 트랜잭션이 지속적으로 추가될 수 있으므로 트랜잭션의 총 수가 지속적으로 증가되며 일정 지지도를 나타내는 항목의 출현빈도 수는 지속적으로 증가하게 된다. 따라서 항목을 모니터링 래티스에 추가하거나 모니터링 래티스로부터 삭제하기 위한 임계값 및 빈빌항목 판정을 위한 최소 출현빈도 수는 트랜잭션의 증가를 고려하여 새롭게 계산되어야 한다. 이때 지지도로 표현된 값은 트랜잭션의 총수 ()를 고려하여 '출현빈도 수' = '지지도'×관계를 적용하여 출현빈도 수로 변환된다. 이와 같은 변화 관계를 바탕으로 각 시점에서의 트랜잭션 총 수를 고려하여 임계값 및 최소 지지도를 출현빈도 수 정보로 변환한다.In the simple accumulation method, the information generated at each point in time has the same weight, so when one new transaction is added, the total number of transactions at the present point in the mining basic information update step ( ) Is the total number of transactions ( )from Obtain as In the conventional mining method, since the entire data set is fixed, the minimum support for adding and expressing items expressed as a support value and determining the frequent items and the minimum support for determining frequent items always have a constant value when converted into the frequency of appearance. However, in open data mining, new transactions can be added continuously, so the total number of transactions continues to increase, and the number of occurrences of items showing constant support continues to increase. Therefore, the threshold for adding items to or removing items from the monitoring lattice and the minimum number of occurrences for determining the empty item should be newly calculated in consideration of the increase in transactions. In this case, the value expressed as support is the total number of transactions ( ), 'Number of occurrences' = 'edge map' × The relationship is applied to the frequency of occurrence. Based on this change relationship, the threshold value and the minimum support are transformed into the frequency information by considering the total number of transactions at each time point.

기본적으로 개방 데이터 마이닝에서는 일정 시점을 기준으로 해당 시점까지 발생한 모든 트랙잭션에서 출현한 항목 정보를 분석하여 마이닝 결과를 얻는다. 따라서 지속적으로 발생되는 모든 트랜잭션에 포함된 항목 정보를 관리해야 한다. 이때 데이터 집합이 지속적으로 증가되는 개방 데이터 마이닝의 특성상 수행 과정에서 모니터링 해야하는 정보가 급격히 증가될 수 있으며 이는 마이닝 수행 과정에서의 메모리 사용량 증가를 초래하고 마이닝 수행 시간을 길게 하며 트랜잭션이 계속 증가되는 경우 고성능의 시스템에서도 감당할 수 없는 정도로 모니터링 래티스에서 관리해야 하는 정보의 양이 증가될 수 있다. 개방 데이터 마이닝에서 트랜잭션에 출현하는 항목은 최초 출현 시에는 매우 낮은 지지도를 갖는다. 즉, 지속적으로 증가되는 데이터 집합에서 트랜잭션의 총수는 일반적으로 상당히 큰 값을 갖게 된다. 하지만 처음으로 출현하는 항목의 출현빈도 수는 1이며 현재까지 많은 트랜잭션이 발생하였다면 최초 출현 항목은 매우 낮은 지지도를 갖게 된다. 즉, 항목의 최초 출현 시점에서는 빈발항목 판정을 위한 최소 지지도에 비해서도 매우 낮은 지지도를 갖게 되므로 이들 정보를 최초 출현 시에 모니터링 하지 않더라도 빈발항목 판정 결과에는 영향을 미치지 않는다. 본 발명에서는 항목의 출현빈도 수를 예측하여 빈발항목 판정에 영향을 미칠 정도의 충분히 큰 값을 가지게 될 때 이를 모니터링 래티스에 추가하도록 하여 개방 데이터 마이닝 수행 과정에서 관리 대상이 되는 항목의 수를 최소화한다. 이러한 방법을 통해 모니터링 래티스가 지속적으로 증가되지 않도록 하여 마이닝 수행 과정에서의 메모리 사용량과 마이닝 수행 시간을 감소시킨다. 이와 같이 충분히 큰 출현빈도 수를 갖는 항목만을 모니터링 래티스에 추가하는 방법을 지연추가 방법이라 하며 상세한 과정은 도 6의 지연추가 단계(234)에서 자세히 설명한다. 이때, 지연추가를 위한 기준이 되는 임계 출현빈도 수를 지연추가 임계값이라 하고로 나타내며 지연추가 임계값은 데이터 집합에 대한 최소 지지도 이하의 값으로 정의된다.Basically, in open data mining, the mining result is obtained by analyzing the item information appearing in all the transactions occurring up to that point in time. Therefore, it is necessary to manage the item information included in all the transactions that occur continuously. At this time, due to the nature of open data mining, which continuously increases the data set, the information that needs to be monitored may increase rapidly during execution. This results in increased memory usage during the mining process, increases the mining execution time, and increases the number of transactions when transactions continue to increase. The amount of information that monitoring lattice needs to manage can be increased to an unacceptable level in the system. In open data mining, items that appear in a transaction have very low support when they first appear. In other words, the total number of transactions in a continually increasing data set generally has a fairly large value. However, the number of occurrences of the first item is 1, and if there are many transactions up to now, the first item has very low support. That is, at the time of the first appearance of the item has a very low support even compared to the minimum support for the determination of frequent items, even if these information is not monitored at the first appearance does not affect the result of the frequent item determination. In the present invention, when the frequency of occurrence of items is predicted to have a value large enough to affect the determination of frequent items, it is added to the monitoring lattice to minimize the number of items to be managed during the open data mining process. . In this way, monitoring lattice is not continuously increased, which reduces memory usage and mining time during mining. The method of adding only the items having a sufficiently high frequency to the monitoring lattice is called a delay adding method, and a detailed process will be described in detail in the delay adding step 234 of FIG. At this time, the threshold frequency of occurrence as a criterion for adding delay is called a delay addition threshold value. The delay addition threshold is defined as a value less than or equal to the minimum support for the data set.

종래의 폐쇄 데이터 마이닝 과정에서는 전체 데이터 집합에서 빈발항목이 될 수 없다고 확신할 수 있는 항목을 제거하는 전직작업을 수행한다. 개방 데이터 마이닝에서는 데이터 집합이 지속적으로 증가되므로 전체 데이터 집합에서 빈발항목이 되지 않는다고 확신할 수 있는 항목은 존재하지 않는다. 하지만 일정 시간 동안 출현하지 않는 항목의 출현빈도 수는 변화되지 않지만 트랜잭션의 총 수는 지속적으로 증가되므로 지지도가 감소된다. 따라서 시간 변화에 따른 지지도 감소로 일정 임계값보다 낮은 지지도를 갖는 항목을 래티스로부터 제거할 수 있다. 이와 같이 일정 임계값 보다 작은 지지도를 갖는 항목을 모니터링 래티스로부터 제거하는 작업을 항목 전지 작업이라 하며 세부 수행 과정은 도 5의 출현빈도 수 갱신 및 전지작업 단계(233)에서 상세히 설명한다. 이때 기준이 되는 임계 출현빈도 수를 전지 임계값이라 하며으로 나타낸다. 전지 작업은 모니터링 래티스에서 관리되는 항목의 수를 감소시킴으로써 마이닝 수행 과정에서의 메모리 사용량을 감소시킬 수 있다. 전지 임계값은 데이터 집합에 대한 지연 추가 임계값 이하의 값으로 정의된다. 왜냐하면, 전지 임계값이 지연추가 임계값 보다 높게 정의되는 경우 모니터링 래티스에서 관리되어야 항목들이 전지되는 경우가 발생하거나 불필요한 지연추가와 전지작업이 지속적으로 반복되는 상황이 발생할 수 있다.In the conventional closed data mining process, the former job of removing an item that can be sure that it cannot be a frequent item in the entire data set is performed. In open data mining, the data set is constantly growing, so there is no one to be sure that it is not a frequent item in the entire data set. However, the number of occurrences of items that do not appear for a certain period of time does not change, but the total number of transactions continues to increase, which reduces support. Thus, items having a lower support than a certain threshold can be removed from the lattice by a decrease in the support with time. As described above, the operation of removing the item having the support smaller than the predetermined threshold value from the monitoring lattice is called the item battery operation, and the detailed process will be described in detail in the frequency update and the battery operation step 233 of FIG. 5. In this case, the threshold frequency of occurrence is referred to as a battery threshold. Represented by Pruning can reduce memory usage during mining by reducing the number of items managed in the monitoring lattice. The battery threshold is defined as a value below the delay addition threshold for the data set. If the battery threshold is defined to be higher than the delay addition threshold, the monitoring lattice to manage the items may occur, or the unnecessary delay addition and the battery operation may occur repeatedly.

도 4에서와 같이 출현빈도 수 갱신 및 항목 추가 단계(213)는 먼저 항목 읽기 단계(231)에서 새롭게 추가된 트랜잭션에서 하나씩 항목을 읽는다. 읽혀진 항목은 모니터링 래티스에 존재하는지에 대한 판단 단계(232)를 거쳐 크게 두 가지 과정으로 작업 과정이 나눠진다. 새로이 추가된 트랜잭션에서 출현한 항목이 모니터링 래티스에 존재하는 경우는 출현빈도 수 갱신 또는 전지작업 단계(233)를 수행하며 새로운 트랜잭션에서 출현한 항목 중에서 모니터링 래티스에 존재하지 않는 항목은 지연추가 단계(234)를 거쳐 필요시 래티스에 추가된다. 새로운 트랜잭션에 분석되지 않는 항목의 존재 여부를 판단하여(235) 분석되지 않는 항목이 존재하는 경우 이상의 과정을 반복한다.As shown in FIG. 4, the frequency update and the item addition step 213 first read items one by one in the newly added transaction in the item read step 231. The read-out item is divided into two main steps through the determination step 232 of whether it exists in the monitoring lattice. If an item appearing in the newly added transaction exists in the monitoring lattice, update the frequency of occurrence or pruning, and perform step 233. Among the items appearing in the new transaction, an item not present in the monitoring lattice is delayed adding step 234. Is added to the lattice if necessary. In step 235, if there is an item that is not analyzed, the above process is repeated.

도 5는 출현빈도 수 갱신 및 전지 작업 단계(233)를 나타내는 것으로 대상 항목의 현재 시점에서의 출현빈도 수()를 이전 갱신 시점에서의 출현빈도 수()에 새로운 출현 정보를 고려하여로 갱신하는 출현빈도 수 갱신 단계(241)와 갱신된 출현빈도 수가 항목의 전지 임계값보다 작은 경우 해당 항목을 모니터링 래티스로부터 제거하는 항목 전지 단계(243)로 구성된다. 단순축적 방법에서는 각 시점에서 발생한 정보들이 서로 동일한 중요성을 가지므로 이전 시점까지의 출현빈도 수에 1을 추가한다. 1을 추가하는 이유는 출현빈도 수 갱신 및 항목 추가 단계(233)에서는 새로운 트랜잭션에서 출현한 항목에 대해서만 출현빈도 수 갱신 작업을 수행하기 때문이다. 항목 전지 단계(243)에서는 출현빈도 수가 전지 임계값 이하인 경우 해당 항목을 모니터링 래티스로부터 제거한다. 이때, 1-항목은 전지 임계값 이하의 출현빈도 수를 갖는 경우에도 이를 래티스로부터제거하지 않고 계속 유지한다. 왜냐하면 1-항목은 부분항목이 존재하지 않으므로 해당 항목의 출현빈도 수를 다른 항목으로부터 추정하는 것이 불가능하다. 따라서 항상 실제 값이 유지되어야 하며 한번 전지된 값이 다시 래티스에 추가되는 경우에 전지 시점 이전의 출현빈도 수 정보는 무시되어 해당 1-항목의 출현빈도 수에 오차가 발생된다. 한편, 하나의 1-항목에 있어서 해당 항목이 상당한 기간동안 발생한 트랜잭션들에서 출현하지 않는다면 해당 항목의 출현빈도 수에 비해 트랜잭션의 총수가 크게 증가되어 매우 낮은 지지도를 갖게된다. 기본적인 전지 방법에서는 이러한 경우에도 1-항목은 전지되지 않는다. 하지만 트랜잭션의 총 수에 비해 매우 작은 출현빈도 수를 갖는 경우 1-항목인 경우에도 해당 항목을 래티스로부터 전지할 수 있다. 이때, 별도의 임계값을 정의하며 1-항목의 출현빈도 수가 이보다 작은 경우 이를 모니터링 래티스로부터 전지한다. 이때 기준이 되는 임계값을 1-항목 전지 임계값이라 한다. 1-항목 전지 임계값은 n-항목(n≥2) 전지 임계값보다 작으며 전체 데이터집합에서 무시될 수 있는 정도의 지지도로 정의한다. 1-항목 전지 임계값에서도 값이 지나치게 높게 정의되는 경우 전체 데이터집합에서 발생한 정보들의 정확도가 낮아질 수 있다.5 shows update frequency count and battery operation step 233, where the frequency of occurrence of the target item at the present time ( ) Is the number of occurrences ( Considering new advent information) And update item frequency step 241, and the item battery step 243 to remove the item from the monitoring lattice when the updated frequency is less than the battery threshold of the item. In the simple accumulation method, since the information generated at each time point is equally important to each other, 1 is added to the frequency of occurrences up to the previous time point. The reason for adding 1 is that in the frequency update and the item addition step 233, the frequency update operation is performed only for the items that appear in the new transaction. Item cell step 243 removes the item from the monitoring lattice if the frequency of occurrence is less than or equal to the battery threshold. At this time, the 1-item keeps it even without removing it from the lattice even if it has a frequency of occurrence below the battery threshold. Because 1-item does not have subitems, it is impossible to estimate the frequency of occurrence of the item from other items. Therefore, the actual value must be maintained at all times, and when the value once calculated is added to the lattice, information about the frequency of occurrence before the battery is ignored and an error occurs in the frequency of occurrence of the corresponding 1-item. On the other hand, in one 1-item, if the item does not appear in transactions that have occurred for a significant period of time, the total number of transactions is greatly increased compared to the frequency of occurrence of the item, and thus has very low support. In the basic cell method, even in this case, the 1-item is not celled. However, if you have a very small number of occurrences relative to the total number of transactions, even if it is a 1-item, that item can be passed from Lattice. In this case, a separate threshold value is defined, and if the frequency of occurrence of 1-item is smaller than this, it is transferred from the monitoring lattice. In this case, the threshold which is a reference is referred to as a 1-item battery threshold. The 1-item cell threshold is defined as a degree of support that is less than the n-item (n ≧ 2) cell threshold and can be ignored in the entire dataset. If the value is defined too high even at the 1-item cell threshold, the accuracy of the information generated in the entire dataset may be low.

도 6은 새로운 트랜잭션에 출현한 항목 중에서 모니터링 래티스에 존재하지 않는 항목을 모니터링 래티스에 추가하는 지연추가 단계(234)의 수행 과정을 나타내며 다음과 같이 추가하고자 하는 항목의 길이를 고려하여 즉시 추가하거나 출현빈도 수 추정 과정을 거쳐 추가한다. 1-항목 추가 단계(251)에서는 트랜잭션에 출현한 새로운 모든 1-항목은 출현빈도 수에 대한 추정 과정을 거치지 않고 즉시 추가된다. 즉, 최초 추가시의 출현빈도 수는 1이 된다. 2개 이상의 단위항목으로 구성되는 n-항목(n≥2)은 출현 즉시 모니터링 래티스에 추가되지 않고 해당 항목의 모든 부분항목이 모니터링 래티스에 존재하고 해당 항목의 출현빈도 수 추정값이 지연추가 임계값을 초과할 때 추가될 수 있다. 이와 같이 n-항목은 최초 출현 즉시 래티스에 추가시키지 않고 지연추가 여부를 판정하며 지연추가 되는 항목의 출현빈도 수는 n-항목 추가가능 여부 판정 단계(253)에서 해당 항목의 부분항목들의 출현빈도 수로부터 추정된다. 이때 출현빈도 수 추정값이 사용자에 의해 정의된 지연추가 임계값을 초과하여 추가 가능한지를 판단하여(254) 추가 가능한 경우 해당 항목을 모니터링 래티스에 추가한다. 지연추가에 의한 항목 추가 시 출현빈도 수는 추정값을 사용하며 이러한 지연추가 과정을 통해 모니터링 래티스에서 관리 되는 항목의 수를 적게할 수 있다. 즉, 실제 마이닝 과정에서 요구되는 메모리 용량을 감소시킬 수 있다.FIG. 6 shows a process of performing the delay adding step 234 of adding items not present in the monitoring lattice among items appearing in a new transaction to the monitoring lattice, and immediately adding or appearing in consideration of the length of the item to be added as follows. Add it through the frequency estimation process. In the one-item addition step 251, all new one-items appearing in the transaction are added immediately without undergoing an estimation process for the frequency of appearance. That is, the frequency of appearance at the time of the first addition becomes one. An n-item (n≥2) consisting of two or more unit items is not added to the monitoring lattice immediately upon appearance, and all subitems of that item are present in the monitoring lattice, and the frequency of occurrence estimates for that item set the delay add threshold. Can be added when exceeded. In this way, the n-item is added to the lattice immediately after the first occurrence, and it is determined whether the delay is added, and the frequency of occurrence of the delayed addition is the number of occurrences of the subitems of the item at step 253. Is estimated from In this case, it is determined whether the appearance frequency estimation value can be added beyond the delay addition threshold defined by the user (254), and if applicable, the item is added to the monitoring lattice. When adding items by delayed addition, the frequency of occurrence is estimated, and the delayed addition process can reduce the number of items managed by monitoring lattice. That is, the memory capacity required in the actual mining process can be reduced.

도 7은 해당 항목의 추가 가능여부를 판정하는 추가가능 여부 판정 단계(253)를 나타낸다. 출현빈도 수를 추정하고자 하는 항목이 있을 때 먼저 모니터링 래티스에서 해당 항목의 필요한 부분항목들을 탐색한다. 출현빈도 수 추정 가능 여부를 판단하기 위해 현재 모니터링 래티스에서 관리되고 있는 해당 항목의 부분항목이 모니터링 래티스에 존재하고 있는지를 탐색한다. 만약 하나 이상의 부분항목이 모니터링 래티스에 존재하지 않는다면 해당 항목은 출현빈도 수를 추정하지 않는다. 이는 부분항목 중 일부가 현재의 모니터링 래티스에서 관리되지 않는다면 해당 항목의 출현빈도 수 추정값이 지연추가를 위한 임계값 이상의 값을 가질 수없음이 명백하므로 출현빈도 수 추정 과정을 더 이상 진행하지 않더라도 해당 항목이 모니터링 래티스에 추가될 필요가 없음을 판단할 수 있다. 모든 부분항목이 모니터링 래티스에 존재하는 경우 출현빈도 수 추정 단계(253c)를 거쳐 출현빈도 수의 가능한 최대값 및 최소값을 추정한다. 여기서 구해진 출현빈도 수의 최대값이 지연추가를 위한 임계값 보다 크거나 같은 경우 해당 항목은 모니터링 래티스에 추가하여 모니터링을 시작한다(253f). 일부 부분항목이 래티스 내에 존재하지 않거나 출현빈도 수 추정값의 최대값이 지연추가 임계값보다 작은 경우 해당 항목은 추가 불가하다고 판정한다(253e).7 shows an addability determination step 253 for determining whether the item can be added. When there is an item for which you want to estimate the frequency of occurrences, first search for the necessary subitems of the item in the monitoring lattice. To determine whether the frequency of occurrence can be estimated, the subitem of the item currently managed by the monitoring lattice is searched for in the monitoring lattice. If more than one subitem is not present in the monitoring lattice, the item does not estimate the frequency of occurrences. It is clear that if some of the sub-items are not managed by the current monitoring lattice, the item's frequency count estimate may not have a value above the threshold for delayed addition. It can be determined that it does not need to be added to this monitoring lattice. If all subitems are present in the monitoring lattice, a possible frequency estimate step 253c is used to estimate the possible maximum and minimum values of the frequency of occurrence. If the maximum value of the frequency of occurrence obtained is greater than or equal to the threshold for delay addition, the corresponding item is added to the monitoring lattice to start monitoring (253f). If some subitems do not exist in the lattice or the maximum value of the frequency count estimate is less than the delay addition threshold, it is determined that the item cannot be added (253e).

출현빈도 수를 추정하고자 하는 항목을 e라 할 때 출현빈도 수의 가능한 최대값 및 최소값은 다음과 같이 추정된다. 출현빈도 수 추정을 위해서 먼저 e의 부분항목들의 집합, m-부분항목들의 집합, m-부분항목들의 출현빈도 수 집합을 다음과 같이 정의한다.When the item for which the frequency of occurrence is to be estimated as e, the possible maximum and minimum values of the frequency of occurrence are estimated as follows. In order to estimate the frequency of occurrence, first, a set of subitems of e, a set of m-subitems, and a set of frequency of m-subitems are defined as follows.

한편, 두 개의 항목과가 주어졌을 때 병합항목와 공통항목를 다음과 같이 정의한다.Meanwhile, two items and Merge is given And common items Define as

ⅰ. 병합항목는또는에 속하는 모든 단위항목들로 구성된 항목이다.Iii. Merged item Is or This item consists of all the unit items belonging to.

ⅱ. 공통항목는과에 모두에 속하는 모든 단위항목들로 구성된 항목이다.Ii. Common item Is and An item consisting of all the unit items belonging to all.

이때, 하나의 항목에 대해서 각 부분항목들은 적어도 해당 항목의 출현빈도 수만큼의 출현빈도 수를 갖는다. 예를 들어 항목을 구성하는 모든 단위항목들이 항상 동시에 출현하는 경우 해당 항목은 해당 항목의 부분항목들과 동일한 출현빈도 수를 갖게된다. 따라서, 항목의 출현빈도 수는 항목을 구성하는 단위 항목들이 얼마나 빈번히 함께 출현하는가에 영향을 받게 된다. 이러한 분석 결과를 고려하여 항목의 출현빈도 수를 결정하는 명확한 두 가지 분포 형태를 다음과 같이 정의한다. 먼저, 다수의 트랜잭션으로 구성되는 데이터 집합에 출현한 임의의 두 항목에 대해서 이들 두 항목이 가능한 많은 트랜잭션에서 함께 출현한 경우를 최소배타분포라 한다. 다음으로 이들 두 항목이 최대한 배타적으로 출현한 경우를 최대배타분포라 한다. 예를 들어 10개의 트랜잭션으로 구성되는 데이터 집합에서 두 항목과가 출현한 트랜잭션의 수가 각각 6개와 7개라고 할 때, 가능한 모든 트랜잭션에서 병합항목 형태로 출현하였다면(즉, 병합항목의 출현빈도 수가 6이라면) 이들 두 항목은 최소배타분포라고 할 수 있다. 한편, 병합항목이 단지 3개의 트랜잭션에서만 출현하였다면 이들 두 항목은 최대배타분포라고 할 수 있다.In this case, for one item, each subitem has at least as many occurrences as the frequency of appearance of the corresponding item. For example, if all the unit items constituting the item always appear at the same time, the item has the same frequency of occurrence as the subitems of the item. Therefore, the frequency of appearance of the item is affected by how frequently the unit items constituting the item appear together. Taking into account the results of these analyses, two distinct types of distribution that determine the number of occurrences of an item are defined as follows. First, for any two items that appear in a data set consisting of multiple transactions, the case where these two items appear together in as many transactions as possible is called the minimum exclusive distribution. Next, the case where these two items appear as exclusively as possible is called the maximum exclusive distribution. For example, two items in a dataset consisting of 10 transactions and If the number of transactions that appeared is 6 and 7, respectively, and they appear in the form of merge items in all possible transactions (i.e., if the number of occurrences of merge items is 6), these two items are the minimum exclusive distribution. On the other hand, if a merge item appeared only in three transactions, then these two items are the maximum exclusive distribution.

두 항목및가 최소배타분포 및 최대배타분포일 때 병합항목의 출현빈도 수는 다음과 같이 구할 수 있다. 먼저, 트랜잭션으로 구성되는 데이터 집합을 D라 하자. 이때, |D|는 데이터 집합에 포함되는 트랜잭션의 총수를 나타낸다. 또한 TS(e) 데이터 집합에서 항목 e가 나타나는 모든 트랜잭션들의 집합을 나타내며 C(e)는 TS(e)에 속하는 트랜잭션의 총수를 나타낸다. 이때, 두 항목및에 대해서는 병합항목의 정의에 의해 다음 식이 성립한다.Two items And Merge when is the minimum and maximum exclusive distribution The frequency of occurrence of can be found as follows. First, let D be a data set composed of transactions. In this case, | D | represents the total number of transactions included in the data set. In addition, the TS (e) represents the set of all transactions in which item e appears in the data set, and C (e) represents the total number of transactions belonging to TS (e). Two items And For the following equation, the following equation is established by the definition of the merge item.

만약 두 항목이 최소배타분포 상태라면 TS()⊆TS() 또는 TS(e1)⊇TS() 관계가 성립한다. 그러므로 TS()는 TS() 또는 TS()와 동일한 집합이 된다. 따라서 C() 는 두항목의 출현빈도 수 C() 및 C()중에서 최소값과 동일한 값을 갖게 된다. 즉, 다음과 같은 관계가 성립한다.If the two items are at least exclusive, TS ( ) ⊆TS ( ) Or TS (e1) ⊇TS ( ) The relationship is established. So TS ( ) Is TS ( ) Or TS ( ) Is the same set. So C ( ) Is the number of occurrences of two items, C ( ) And C ( ) Will have the same value as the minimum value. That is, the following relationship holds.

한편, min(V)는 숫자 값들의 집합 V에 대해서 집합에 포함된 값들 중에서 최소값을 구해주는 함수이다. 여기서, 두 항목의 공통항목가 존재한다면 다음과 같은 관계가 성립한다.Min (V) is a function for obtaining a minimum value among the values included in the set with respect to the set V of the numeric values. Where two items are common If exists, the following relationship holds.

이때, 집합의 원소 개수를 구하는 과정을 고려하면 다음 식이 성립됨을 알 수 있다.At this time, considering the process of obtaining the number of elements in the set it can be seen that the following equation is established.

만약이 존재하지 않는다면 두 항목의 병합항목의 출현빈도 수는 데이터 집합을 구성하는 트랜잭션의 총수를 고려하여 구할 수 있다. TS() 및 TS()이 데이터 집합 D의 부분집합이므로 다음과 같은 관계가 성립한다.if If this does not exist, the number of occurrences of the merge item of the two items can be obtained by considering the total number of transactions constituting the data set. TS ( ) And TS ( ) Is a subset of dataset D, so the following relationship holds:

이때, 집합의 원소 개수를 고려하면 다음과 같은 식이 성립함을 알 수 있다.At this time, considering the number of elements in the set it can be seen that the following equation holds.

만약 두 항목의 출현빈도 수 C() 및 C()의 합이 전체 데이터 집합의 트랜잭션 총수 |D| 보다 작다면 C()가 0보다 작은 값을 가질 수 있다. 하지만, C()는 항목의 출현빈도 수이므로 0 이상의 값을 가져야 한다. 두 항목이 최대배타분포일 때 두 항목으로 구성되는 병합항목이 취할 수 있는 값 중에서 가장 작은 값을 취하게 되며 다음 식과 같이 구해질 수 있다.If the frequency of occurrence of two items is C ( ) And C ( ) Is the total number of transactions in the entire data set | D | If less than C ( ) May have a value less than zero. However, C ( ) Must have a value greater than or equal to zero since it is the number of occurrences of the item. When the two items are at the maximum exclusive distribution, the merged item consisting of the two items takes the smallest value and can be obtained as follows.

여기서, max(V)는 숫자 값들의 집합 V에 대해서 집합에 포함된 값들 중에서 최대값을 구해주는 함수이다.Here, max (V) is a function for obtaining a maximum value among the values included in the set with respect to the set V of the numeric values.

두 항목이 최소배타분포일 때, 병합항목의 출현빈도 수는 최대값을 갖게된다. 반면, 두 항목이 최대배타분포일 때, 병합항목의 출현빈도 수는 최소값을 갖게된다. 즉, 병합항목의 출현빈도 수는 최대배타분포인 경우의 출현빈도 수보다 크고 최소배타분포인 경우의 출현빈도 수보다 작은 범위에서 값을 갖게된다.When two items have a minimum exclusive distribution, the number of occurrences of the merged item has a maximum value. On the other hand, when two items are at the maximum exclusive distribution, the frequency of occurrence of the merged item has the minimum value. That is, the number of occurrences of the merged item has a value in a range larger than the number of occurrences in the case of the maximum exclusive distribution and less than the number of occurrences in the case of the minimum exclusive distribution.

앞서 설명한 것과 같이 지연 추가 대상 항목들에 대한 출현빈도 수는 해당 항목의 부분항목들의 출현빈도 수로부터 구할 수 있다. 이때, 가능한 최소 출현빈도 수와 최대 출현빈도 수를 모두 구하게 된다. 본 발명의 이하 부분에서 최소 가능 출현빈도 수및 최대 가능 출현빈도 수는 항목 e의 가능한 최소 출현빈도 수 및 최대 출현빈도 수를 나타낸다. 항목들의 최대 가능 출현빈도 수는 모든 부분항목들이 최소배타분포 상태일 경우를 가정하여 값을 구하게 된다. 즉, 부분항목들이 가능한 많은 트랜잭션에서 동시에 출현하는 것으로 가정하여 출현빈도 수를 구한다. 항목 e에 있어서 해당 항목의 두 부분항목및가 최소배타분포일 때, 이들의 병합항목(즉, 항목 e와 동일)의 출현빈도 수는 앞서 기술한 바와 같이 min(C(), C())와 같이 구해진다. 따라서, 해당 항목의 모든 부분항목들이 최소배타분포일 때, 항목 e의 최대 가능 출현빈도 수는 모든 부분항목들의 출현빈도 수 중에서 최소 값으로 정의될 수 있다. 그런데, 모든 부분항목들 중에서 최대 길이의 부분항목이 가장 작은 출현빈도 수를 갖게 된다. 그러므로,는 (n-1)-부분항목들의 출현빈도 수만을 고려하여 구할 수 있다. 즉, 항목 e에 대해서 (n-1)-부분항목가 공집합이 아닐 때, 최대 가능 출현빈도 수는 다음과 같이 구해진다.As described above, the frequency of occurrence of items to be delayed may be obtained from the frequency of subitems of the item. At this time, both the minimum number of occurrences and the maximum number of occurrences are obtained. Minimum possible frequency of occurrences in the following parts of the invention And maximum possible frequency Denotes the possible minimum frequency and maximum frequency of item e. The maximum possible frequency of items is determined by assuming that all subitems have a minimum exclusive distribution. In other words, it is assumed that sub-items appear in as many transactions as possible at the same time. The two subitems of item e in item e. And When is the minimum exclusive distribution, the number of occurrences of these merged items (i.e., equal to item e) is calculated as min (C ( ), C ( Is obtained as Therefore, the maximum possible frequency of occurrence of item e when all subitems of that item are the minimum exclusive distribution Can be defined as the minimum value among the frequency counts of all subitems. However, of all the subitems, the subitem of the maximum length has the smallest frequency of occurrence. therefore, Can be obtained by considering only the frequency of appearance of (n-1) -sub-items. That is, (n-1) -subitem for item e Is not an empty set, the maximum number of possible occurrences Is obtained as follows.

항목 e의 최소 가능 출현빈도 수도 모든 부분항목 집합들의 출현빈도 수로부터 구하거나 또는 (n-1)-부분항목의 출현빈도 수로부터 구할 수 있다. 그러나 이는 해당 부분항목들이 최대배타분포 상태일 경우를 고려하여 구한다. 서로 다른 두 개의 (n-1)-부분항목및로 구성되는 각 조합에 대해서 두 부분항목이 최대배타분포 상태일 때, 최소 가능 출현빈도 수는 앞서 설명된 것과 같이 구할 수 있다. 여기서 모든 조합에서 구해지는 최소 가능 출현빈도 수 중에서 최대값이 항목 e의 최소 가능 출현빈도 수가 될 수 있다. 즉, (n-1)-부분항목 집합가 공집합이 아니며 그들이 최대배타분포 상태일 때 최소 가능 출현빈도 수는 다음과 같이 구해진다.Minimum number of possible occurrences of item e Can also be obtained from the number of occurrences of all subitem sets or from the number of occurrences of (n-1) -subitems. However, this is taken into account when the subitems are at the maximum exclusive distribution. Two different (n-1) -subparts And The minimum number of possible occurrences when two subitems are at the maximum exclusive distribution for each combination of Can be obtained as described above. Here, the maximum value among the minimum possible frequency counts obtained in all combinations may be the minimum possible frequency count of item e. That is, (n-1) -subitem set Are not empty sets and they are the smallest possible frequency of occurrence when they are at the maximum exclusive distribution Is obtained as follows.

한편, 하나의 항목이 래티스에 추가되기 위해서는 해당 항목의 모든 부분항목이 래티스에 존재해야 한다. 만약 해당 항목의 부분항목들 중에서 하나 이상이래티스에 존재하지 않는 경우 해당 항목의 최대 가능 출현빈도 수가 min()이므로 해당 값이 0으로 계산된다. 따라서 해당 항목을 출현빈도 수를 계산하지 않아도 추가되지 않음을 판단할 수 있다. 즉, 항목의 모든 부분항목이 래티스에 존재할 때만 해당 항목의 추가 여부를 판정해야 하며, 만약 하나 이상의 부분항목이 래티스에 존재하지 않는다면 해당 항목은 래티스에 추가되지 않는다.On the other hand, in order for an item to be added to a lattice, all subitems of the item must exist in the lattice. If one or more of the subitems of the item do not exist in the lattice, the maximum possible frequency of occurrence of the item Is min ( ), So the value is calculated as zero. Therefore, it can be determined that the item is not added without calculating the frequency of appearance. That is, only when all subitems of an item exist in a lattice, it is necessary to determine whether the item is added. If one or more subitems do not exist in the lattice, the item is not added to the lattice.

항목 e에 대해서,가 해당 항목의 실제 출현빈도 수로 지정된다. 그러나,는 추정에 의해 구해진 값이므로 각 항목의 출현빈도 수에는 추정에 따른 오차가 포함될 수 있다. 이때, 최대 가능 출현빈도 수와 최소 가능 출현빈도 수의 차이를 해당 항목의 추정 오차 ε(e)라 하며 출현빈도 수 추정에 의해 발생 가능한 최대 오차를 나타낸다. 하지만 출현빈도 수 추정 오차로 간주되는 만큼의 출현빈도 수는 해당 항목의 출현빈도 수를 추정하는 시점의 값으로 고정되어 있으나 새로운 트랜잭션이 추가됨에 따라 트랜잭션의 총수가 증가되어 추정오차로 인한 지지도 차이는 매우 작은 정도가 된다. 따라서 지연추가에 의해 추가된 항목의 출현빈도 수는 일정 시간이 흐른 뒤에는 무시할 수 있는 정도의 오차를 갖는다. 즉, 신뢰할 수 있는 마이닝 결과를 얻게 된다.About item e, Is specified as the actual number of occurrences of the item. But, Since is a value obtained by estimation, the number of occurrences of each item may include an error according to the estimation. In this case, the difference between the maximum possible frequency and the minimum possible frequency is referred to as the estimated error ε (e) of the corresponding item and represents the maximum error that can be generated by estimating the frequency of appearance. However, the number of occurrences, which is regarded as an error in estimating the frequency of occurrence, is fixed as the value at the time of estimating the frequency of occurrence of the item, but as the number of new transactions is added, the total number of transactions increases, so the difference in support due to the estimation error It is very small. Therefore, the frequency of appearance of items added by delayed addition has negligible error after a certain time. In other words, you get reliable mining results.

트랜잭션에서 발생한 항목에 대한 출현빈도 수 갱신 및 새로 출현한 항목에 대한 추가 작업이 완료된 후 이를 바탕으로 빈발항목을 탐색한다. 여기서 항목 출현빈도 수 갱신 및 전지 작업은 새로 출현한 항목을 대상으로 수행된다. 따라서 과거 오래 전에 출현한 이후 다시 출현하지 않는 항목은 매우 작은 지지도를 가지나 모니터링 래티스로부터 제거되지 않고 유지될 수 있다. 즉, 제거 가능한 불필요한 항목이 래티스 내에 존재하게 되므로 마이닝 과정에서 메모리가 낭비될 수 있으므로 제거한다. 또한 이러한 낭비를 최소화하기 위해서 래티스에 존재하는 모든 항목을 탐색하여 전지 가능한 항목을 래티스로부터 제거할 수 있다. 이러한 작업을 강제 전지작업이라 한다. 강제 전지 작업을 수행하는 경우 래티스에는 전지 임계값 이상의 항목만 유지되므로 메모리 사용량을 최소화할 수 있는 장점이 있다. 하지만 강제 전지작업은 모니터링 래티스 전체를 탐색해야 하므로 시간이 상대적으로 많이 걸린다. 따라서 일정한 시간 간격을 두고 주기적으로 수행하거나 메모리 부족 상황인 경우에만 수행하게 된다. 도 8은 강제 전지 작업 수행 단계를(215) 나타낸다. 먼저 래티스에서 하나의 항목을 읽고 해당 항목의 출현빈도 수를 전지 임계 출현빈도 수와 비교한다(262). 비교 결과 항목의 출현빈도 수가 전지 임계값보다 작은 경우 항목 제거 단계(263)에서 항목을 래티스에서 제거한다. 래티스에 탐색하지 않은 항목이 남아 있으면 강제 전지 과정을 계속한다.After updating the frequency of occurrence of items and additional work on newly appeared items, the frequent items are searched. Here, the item frequency update and battery operation are performed on newly appeared items. Thus, items that do not reappear after appearing long ago in the past have very little support but can be maintained without being removed from the monitoring lattice. In other words, since unnecessary items that can be removed exist in the lattice, memory may be wasted during the mining process and thus removed. In addition, to minimize this waste, all items present in the lattice can be searched to remove the batteryable items from the lattice. This operation is called forced cell operation. Lattice has the advantage of minimizing memory usage because Lattice retains only items above the battery threshold. Forced pruning, however, is relatively time consuming because it requires the entire monitoring lattice to be explored. Therefore, it is performed periodically at regular time intervals or only when there is an insufficient memory situation. 8 illustrates a step 215 of performing a forced battery operation. First read a single item from Lattice and count the frequency of occurrence of that item (262). If the comparison results in that the number of occurrences of the item is less than the battery threshold, the item is removed from the lattice in the item removal step 263. If no items are left in Lattice, continue the forced battery process.

도 9에서 설명되는 빈발항목 선택 단계(216)는 모니터링 래티스에 존재하는 모든 항목을 탐색하여 항목의 출현빈도 수가 빈발항목 판정을 위한 최소 지지도 이상인 경우 해당 항목을 빈발항목으로 판정한다. 래티스내에 탐색되지 않은 항목이 존재하는 경우 빈발항목 선택 단계를 반복 수행하며 모든 항목에 대한 탐색이 끝나면 종료된다. 빈발항목 선택 단계는 래티스 구조를 기반으로 하는 종래의 마이닝방법에서 빈발항목 선택 과정과 동일하며 강제전지 작업과 마찬가지로 전체 래티스를 탐색한다. 따라서 새로운 트랜잭션이 추가될 때마다 수행하기 보다 빈발항목을 구하고자 하는 시점에서 수행함으로써 작업 수행 시간을 감소시킬 수 있다. 빈발항목 선택을 위해서 모니터링 래티스 전체를 탐색함으로써 수행 시간이 길어지는 문제를 해결하기 위해서 이전에 마이닝 결과로 얻어진 빈발항목들을 별도로 유지하는 방법이 고려될 수 있다. 이에 대해서는 5장에서 자세히 설명한다.The frequent item selection step 216 described in FIG. 9 searches for all items present in the monitoring lattice and determines the item as a frequent item when the number of occurrences of the item is greater than or equal to the minimum support for determining the frequent item. If there is an item that has not been searched in the lattice, the frequent item selection step is repeated and ends when all items have been searched. The frequent item selection step is the same as the frequent item selection process in the conventional mining method based on the lattice structure, and searches the entire lattice like the forced cell work. Therefore, it is possible to reduce work execution time by executing at the point where frequent items are to be obtained rather than executing each time a new transaction is added. In order to solve the problem of long execution time by searching the entire monitoring lattice for the selection of frequent items, a method of keeping the frequent items obtained as a result of mining separately may be considered. This is described in detail in Chapter 5.

지연추가로 인한 출현빈도 수의 오차는 새로운 트랜잭션이 추가됨에 따라 무시할 수 있을 정도의 작은 값으로 감소된다. 즉, 추정에 의해 발생되는 오차는 항목을 래티스에 추가하는 시점에서의 출현빈도 수 차이로 항상 같은 값으로 유지되지만 전체 트랜잭션의 수가 증가함에 따라 지지도 기준으로는 감쇠된다. 하지만 일정 시점에서의 마이닝 결과를 구하는데 있어서 추정오차에 의한 지지도 차이가 큰 항목이 존재하는 경우 지연추가 방법을 적용하지 않은 마이닝 경우와 차이가 큰 결과를 얻을 수 있다. 즉, 지연추가 방법을 적용하지 않는 경우에 비해 특정 항목의 지지도가 크거나 마아닝 결과로 얻어진 빈발항목의 개수가 많을 수 있다. 이러한 문제점을 보완하기 위해서 모니터링 래티스에서 각 항목들의 출현빈도 수 정보와 더불어 지연추가 당시의 출현빈도 수 추정 오차 정보를 같이 관리함으로써 빈발항목의 현재 지지도와 지지도의 오차 정도를 정확히 파악할 수 있으며, 따라서 보다 신뢰도 높은 마이닝 결과를 얻을 수 있다. 일반적으로 지연추가 임계값이 높을수록 래티스에서 관리되는 항목의 수는 감소하지만 마이닝 결과로 얻어진 빈발항목이 큰 오차를 가질 수 있다. 이는 추정 오차가 무시할 수 있는 정도로 감소하기 이전에빈발항목으로 판정되기 때문이다.The error in the number of occurrences due to delayed addition is reduced to a negligible value as new transactions are added. In other words, the error caused by the estimation is always maintained at the same value due to the difference in the frequency of appearance at the time of adding the item to the lattice, but as the total number of transactions increases, it is attenuated based on the support. However, when there is an item with a large difference in support due to an estimation error in obtaining a mining result at a certain point in time, the result may be different from that of mining without applying the delayed addition method. That is, compared to the case where the delayed addition method is not applied, the support of a specific item may be large or the number of frequent items obtained as a result of the marning may be large. In order to compensate for this problem, the monitoring lattice manages the appearance frequency information of each item together with the error frequency estimation error information at the time of delay addition, so that the current degree of support and the degree of error of the frequent items can be accurately identified. Highly reliable mining results can be obtained. In general, the higher the delay addition threshold, the lower the number of items managed by Lattice, but the frequent items obtained as a result of mining may have a large error. This is because the estimation error is judged to be a frequent item before the negligible decrease.

표 2. 오차 정보 관리를 위한 수행과정 변경Table 2. Change of Procedure for Error Information Management

변경 위치Change position 추정 오차 미관리Estimation error not managed 추정 오차 관리Estimated Error Management 항목지연추가(234)Add item delay (234) 1-항목1-item 래티상의 노드가출현빈도 수 정보만 가짐Node only information on lattice contains frequency count information 래티스상의 노드가 출현빈도 수 및추정오차 정보를 가짐이때 추정 오차 = 0Estimation error = 0 when the node on the lattice has the frequency of occurrence and estimated error information n-항목n-item 래티상의 노드가출현빈도 수 정보만 가짐Node only information on lattice contains frequency count information 래티스상의 노드가 출현빈도 수 및추정오차 정보를 가짐이때 추정 오차는 출현빈도 수 추정값의 최대값과 최소값이 차이When nodes on the lattice have frequency and estimated error information, the estimation error differs between the maximum and minimum values of the frequency estimate. 항목 전지작업(233, 215)Item Battery Operation (233, 215) 출현빈도 수만 고려Consider only frequency of occurrence 출현빈도 수만 고려Consider only frequency of occurrence 빈발항목 선택(216)Frequently Selected Items (216) 출현빈도 수가 최소 지지도이상인 경우 빈발항목Frequent items when the frequency of occurrence is above the minimum support 출현빈도 수가 최소 지지도 이상이며추정오차가 허용오차보다 작은 경우빈발항목Frequently, the number of occurrences is greater than the minimum support and the estimated error is less than the tolerance.

이러한 오차 정보를 분석자에게 제공하기 위해서는 모니터링 래티스의 각 항목들은 항목의 출현빈도 수 뿐만 아니라 추정 오차 정보를 함께 관리해야 한다. 따라서, 도 6의 항목 지연추가 시 출현빈도 수 추정 등의 과정은 동일하지만 항목의 출현빈도 수 뿐만 아니라 추정 오차를 모니터링 래티스에 함께 유지한다. 이때 1-항목의 경우 출현빈도 수가 추정값이 아니라 실제 값이므로 추정오차는 0이 된다. 전지작업에서는 오차 정보는 고려하지 않는다. 왜냐하면 전지 대상이 되는 항목의 출현빈도 수 정보는 지연추가를 위한 추정 단계에서 가능한 최대값을 고려한 것이므로 출현빈도 수가 전지 임계값 이하인 경우 오차를 고려하더라도 전지 임계값 이하가 된다. 전지 임계값은 사용자에 의해 정해지며 최소지지도를 넘지 않는 범위 내에서 지정돼야 한다. 이때 임계값이 높을수록 제거되는 항목의 수는 증가하나 마이닝 결과의 정확도가 낮아질 수 있다. 빈발항목 선택 시에는 출현빈도 수뿐만 아니라 추정오차를 고려한다. 즉, 출현빈도 수가 최소지지도를 넘는 항목들이 빈발항목이 될 수 있으며, 이들 항목의 추정오차가 해당 항목이 빈발항목으로 분류될 때의 오차가 현재 시점까지 감소된 값을 나타내며 현재 시점에서 각 항목이 가질 수 있는 오차가 된다. 따라서, 빈발항목 여부는 추정오차가 사용자가 정의하는 일정 임계값 이하인 경우로 제한할 수 있으며, 이때 기준이 되는 임계값을 허용오차라 하며로 나타낸다. 허용오차는 최소지지도를 넘지 않는 범위 내에서 사용자에 의해 정해진다. 허용오차를 최소지지도로 하는 경우 출현빈도 수 정보만으로 빈발항목을 선택하는 결과를 얻게된다. 오차 정보 관리를 활용한 마이닝 방법이 오차 정보를 관리하지 않는 방법에 비해 추가적으로 고려할 사항들은 표 2와 같이 정리할 수 있다.To provide this error information to the analyst, each item in the monitoring lattice must manage the estimated error information as well as the frequency of occurrence of the item. Therefore, the process of estimating the frequency of appearance when the item delay is added in FIG. 6 is the same, but the estimation error as well as the frequency of appearance of the items are kept together in the monitoring lattice. In the case of 1-item, the estimation error is zero because the frequency of appearance is not an estimate but an actual value. Error information is not taken into account in pruning. Because the frequency information of the items to be battery is considered the maximum possible value in the estimation step for delay addition, when the frequency of occurrence is less than or equal to the battery threshold, it becomes less than or equal to the battery threshold. The battery threshold is determined by the user and must be specified within a range not exceeding the minimum map. In this case, as the threshold is higher, the number of items to be removed increases, but the accuracy of the mining result may be lowered. When selecting the frequent items, the estimation error as well as the number of occurrences are considered. In other words, items with a frequency above the minimum number of maps may be frequent items, and the estimated error of these items shows a value in which the error when the item is classified as frequent items is reduced to the present time point. It is an error that you can have. Therefore, the frequent items may be limited to cases in which the estimated error is less than or equal to a predetermined threshold value defined by the user. Represented by Tolerances are determined by the user within the minimum map. When the tolerance is the minimum support, the result is that the frequent items are selected only by the frequency information. Mining methods using error information management can be summarized as shown in Table 2, compared with methods that do not manage error information.

2.2 단위항목 순차패턴 탐색2.2 Unit Item Sequential Pattern Search

단위항목 순차패턴 탐색을 위한 데이터 집합은 단위항목간의 발생순서를 분석할 수 있도록 다음과 같이 하나의 트랜잭션이 순서적인 정보를 갖는 단위항목들의 집합으로 표현된다. 즉, 빈발항목 탐색을 위한 데이터 집합에서는 하나의 트랜잭션을 구성하는 단위항목들간에 중복이 허용되지 않았으나 단위항목 순차패턴 탐색을 위한 트랜잭션에서는 단위항목의 중복이 허용되며 각 트랜잭션은 단위항목의 발생순서가 의미를 가지므로 다음과 같이 단위항목들의 순서적인 집합으로 정의된다. 다음 트랜잭션에서 →는 순서적인 의미를 표현하기 위한 것으로 실제 트랜잭션집합에서는 표현되지 않는다.The data set for unit item sequential pattern search is expressed as a set of unit items in which a transaction has ordered information as follows so that an occurrence order between unit items can be analyzed. That is, in the data set for frequent item search, duplication is not allowed between unit items that constitute one transaction, but in the transaction for unit item sequential pattern duplication, unit items are allowed to be duplicated. As it has meaning, it is defined as an ordered set of unit items as follows. In the next transaction, → is used to express the semantic meaning, not in the actual transaction set.

단위항목 순차패턴 트랜잭션={ 100→200→100→300→400→500 }Unit Item Sequential Pattern Transaction = {100 → 200 → 100 → 300 → 400 → 500}

폐쇄 데이터 마이닝에서의 순차패턴 탐색 방법과 마찬가지로 개방 데이터 마이닝에서 트랜잭션의 발생 시점의 차이를 고려하지 않고 발생된 모든 트랜잭션의 정보들이 서로 동일한 중요성을 갖는 것으로 간주하여 단순축적함으로써 빈발항목간의 빈발 발생 순서에 대한 순차적인 패턴을 탐색하는 방법을 단순축적 단위항목 순차패턴 탐색 방법(230)이라 한다. 단순축적 단위항목 순차패턴 탐색 방법(230)은 2.1절에서 설명된 단순축적 빈발항목 탐색 방법과 동일한 트랜잭션 읽기 단계, 마이닝 기본 정보 갱신 단계, 순차패턴의 출현빈도 수 갱신 및 새로 출현한 순차패턴 추가 단계, 강제 전지작업 단계 및 단위항목 순차패턴 선택 단계로 구성된다. 하지만 단위항목 순차패턴 탐색에서는 하나의 트랜잭션을 구성하는 단위항목들간의 중복이 허용되므로 n-순차패턴(n≥2)의 출현빈도 수 추정 과정에서 빈발항목 탐색 방법에서와 다른 수행 과정을 갖는다.Like the sequential pattern search method in closed data mining, all accumulated transaction information is regarded as having the same importance without considering the difference in transaction occurrence time in open data mining. The method of searching for a sequential pattern for the second method is called a simple accumulation unit item sequential pattern search method 230. The simple accumulation unit item sequential pattern search method 230 is the same as the simple accumulation frequent item search method described in Section 2.1, the following steps of reading a transaction, updating basic mining information, updating the frequency of occurrence of the sequential pattern, and adding a new sequential pattern. It consists of a forced battery operation step and a unit item sequential pattern selection step. However, in the unit item sequential pattern search, the overlap between the unit items constituting a transaction is allowed, and thus the process of estimating the frequency of occurrence of the n-sequential pattern (n ≧ 2) has a different process from that of the frequent item search method.

빈발항목 탐색에서와 마찬가지로 하나의 단위항목으로 구성되는 1-순차패턴은 출현빈도 수를 추정하지 못하며 최초 출현 시 모든 항목을 래티스에 추가하여 출현빈도 수를 관리한다. 한편, n-순차패턴(n≥2)의 출현빈도 수는 2.1절에서 설명된 빈발항목 탐색 과정에서의 출현빈도 수 추정 방법과 같이 해당 순차패턴의 출현빈도 수를 부분순차패턴으로부터 추정한다. 이때, 항목들간의 순서에 상관없이 동시 출현 여부만을 분석하는 빈발항목 탐색 방법과는 달리 단위항목 순차패턴 탐색에서는 트랜잭션에 출현하는 항목들의 중복이 허용되며 출현 단위항목들의 순서 정보를 분석한다. 이와 같이 단위항목의 중복이 허용되고 순서 정보를 분석하는 단위항목 순차패턴 탐색 방법에서는 n-순차패턴(n≥2)이 새로 출현하여 출현빈도 수를 추정하는 과정에서 순서적인 정보가 고려되어 다음과 같이 수행된다. 예를 들어, 세 개의 단위항목 a, b 및 c로 구성되는 순차패턴 "abc"를 고려해 보자. 2.1절에서 설명된 빈발항목 탐색 방법에서는 부분항목을 고려함에 있어서 순서 정보는 고려되지 않고 동시 출현 정보만 고려되어 "ab", "bc" 및 "ac" 등 세 개의 부분항목들의 출현빈도 수로부터 해당 항목의 출현빈도 수를 추정한다. 하지만, 단위항목 순차패턴 탐색 방법에서는 단순한 동시성을 고려하는 것뿐만 아니라 출현빈도 수 추정에 활용되는 부분순차패턴들에 있어서 순서 정보를 고려해야 하므로 부분순차패턴의 시작 위치를 고려하여 선택해야 한다. 단위항목 순차패턴 탐색 방법에서 앞서 예를 든 순차패턴 "abc"의 출현빈도 수를 추정하는 경우 "ab" 및 "ac" 부분순차패턴은 순서 정보를 고려하더라도 동일한 출현빈도 수 정보가 "abc" 순차패턴의 출현빈도 수 추정을 위해 활용된다. 하지만 "bc" 부분순차패턴에 있어서는 빈발항목 탐색 방법에서와는 달리 순서 정보를 고려하여 순차패턴의 시작 위치로 구분된 출현빈도 수를 "abc" 순차패턴의 출현빈도 수 추정을 위해 활용된다. 즉, "bc" 순차패턴은 순서 정보를 갖는 트랜잭션에서 해당 순차패턴이 출현한 위치에 따라 "bc"가 일련의 단위항목으로 구성되는 트랜잭션의 첫 번째 단위항목으로부터 시작되는 것과 해당 트랜잭션에서 두 번째 이후의 단위항목에서 시작하는 두 가지 경우로 구분할 수있다. 이때, 순차패턴 "abc"의 출현빈도 수를 추정하는데 고려해야 할 부분순차패턴은 "bc" 부분순차패턴이 트랜잭션의 두 번째 이후의 단위항목에서 출현하는 경우이다. 따라서, "bc"가 트랜잭션의 처음에서 시작되는 경우를 제외한 나머지 경우의 출현빈도 수만을 순차패턴 "abc"의 출현빈도 수를 추정하는데 활용한다. 출현위치에 따른 출현빈도 수를 달리 적용하기 위해서 단위항목 순차패턴 탐색 방법에서는 각 순차패턴들이 두 가지의 출현빈도 수 정보를 유지한다. 즉, 해당 순차패턴이 출현한 모든 경우를 계산한 출현빈도 수와 해당 순차패턴이 트랜잭션에서 두 번째 이후의 단위항목으로부터 시작한 경우만을 계산한 출현빈도 수 두 가지를 별도로 관리해야 한다. 이를 통해 단위항목 순차패턴 탐색 방법에서 출현빈도 수 추정값의 정확성을 높일 수 있다.As in the frequent item search, the one-sequential pattern consisting of one unit item does not estimate the frequency of occurrence and manages the frequency of occurrence by adding all items to the lattice at the first appearance. On the other hand, the number of occurrences of the n-sequential pattern (n≥2) is estimated from the partial sequence patterns, as in the method of estimating the frequency of occurrence in the frequent item search process described in Section 2.1. In this case, unlike the frequent item search method that analyzes only the simultaneous appearance regardless of the order between the items, the overlapping of items appearing in the transaction is allowed in the unit item sequential pattern search and the order information of the appearance unit items is analyzed. As described above, in the method of unit pattern sequential pattern detection, which allows the duplication of unit items and analyzes the sequence information, a new n-sequence pattern (n≥2) is introduced and the order information is considered in the process of estimating the frequency of occurrence. Is performed together. For example, consider a sequential pattern "abc" consisting of three unit items a, b and c. In the frequent item search method described in Section 2.1, when considering sub-items, the order information is not taken into account but only simultaneous occurrence information is considered from the number of occurrences of three sub-items such as "ab", "bc", and "ac". Estimate the number of occurrences of the item. However, in the unit item sequential pattern search method, not only considering concurrency but also considering the sequence information in the partial sequential patterns used for estimating the frequency of appearance, the unit sequential pattern search method should be selected in consideration of the start position of the partial sequential pattern. In the unit item sequential pattern search method, when the frequency of occurrence of the sequential pattern "abc" as mentioned above is estimated, the "ab" and "ac" partial sequential patterns have the same frequency of occurrence information "abc" even if the sequence information is considered. It is used to estimate the frequency of occurrence of patterns. However, in the "bc" partial sequential pattern, unlike the frequent item search method, the frequency of occurrence divided by the start position of the sequential pattern is used to estimate the frequency of occurrence of the "abc" sequential pattern in consideration of the order information. In other words, the "bc" sequence pattern starts with the first unit of a transaction consisting of a series of units and the second after the transaction, depending on where the sequence pattern appears in a transaction with sequence information. It can be divided into two cases starting from the unit of. At this time, the partial sequential pattern to be considered in estimating the frequency of occurrence of the sequential pattern "abc" is a case where the "bc" partial sequential pattern appears in the second and subsequent unit items of the transaction. Therefore, only the frequency of occurrence of the case other than the case where "bc" starts at the beginning of a transaction is used to estimate the frequency of occurrence of the sequential pattern "abc". In order to apply the frequency of occurrence according to the appearance position differently, in the unit item sequential pattern search method, each sequential pattern maintains two kinds of frequency information. In other words, the number of occurrences calculated for all cases in which the corresponding sequential pattern appears and the number of occurrences calculated only when the sequential pattern starts from the second and subsequent unit items in the transaction must be managed separately. This can improve the accuracy of the frequency count estimate in the unit item sequential pattern search method.

한편, 단위항목 순차패턴 탐색 방법에서 n-순차패턴(n≥2)에 있어서 해당 순차패턴이 모니터링 래티스에 추가된 직후에는 해당 순차패턴의 출현빈도 수가 추정에 의해서 구해진 값이다. 따라서, 해당 순차패턴의 출현빈도 수를 트랜잭션 내에서 시작 위치에 따라 구분하는 것이 불가능하다. 이 경우 해당 순차패턴의 출현빈도 수에 있어서 시작 위치에 무관하게 관리하는 해당 순차패턴의 출현빈도 수와 트랜잭션에서 두 번째 이후의 단위항목으로 시작하는 순차패턴의 출현빈도 수를 동일하게 추정에 의해 얻어진 값으로 지정한다. 이후 추가적으로 발생되는 트랜잭션들에 대해서는 해당 순차패턴의 출현빈도 수를 구분하여 관리하게 된다. 즉, 단위항목 순차패턴 탐색을 위한 모니터링 래티스에 존재하는 각 순차패턴들은 두 가지 출현빈도 수 정보를 갖는다.On the other hand, in the unit item sequential pattern search method, immediately after the sequential pattern is added to the monitoring lattice in the n-sequential pattern (n≥2), the number of occurrence frequencies of the sequential pattern is obtained by estimation. Therefore, it is impossible to distinguish the number of occurrences of the sequential pattern according to the start position in the transaction. In this case, the number of occurrences of the sequential pattern managed by irrespective of the starting position in the number of occurrences of the sequential pattern and the number of occurrences of the sequential pattern starting with the second and subsequent unit items in the transaction are obtained by the same estimation. Specify by value. After that, additionally generated transactions are managed by distinguishing the number of occurrences of the corresponding sequential pattern. That is, each sequence pattern present in the monitoring lattice for unit item sequence pattern search has two occurrence frequency information.

빈발항목 탐색 과정에서와 마찬가지로 단위항목 순차패턴 탐색 과정에서도 부분순차패턴이 래티스에 존재하는지 여부를 고려하여 해당 순차패턴의 추가 여부를 판단한다. 즉, 하나의 순차패턴이 래티스에 추가되기 위해서는 해당 순차패턴의 모든 부분순차패턴이 래티스에 존재해야 한다. 만약 해당 순차패턴의 부분순차패턴들 중에서 하나 이상이 래티스에 존재하지 않는 경우 해당 순차패턴의 최대 가능 출현빈도 수가 0으로 계산된다. 따라서 해당 순차패턴의 출현빈도 수를 계산하지 않아도 추가되지 않음을 판단할 수 있다. 즉, 순차패턴의 모든 부분항목이 래티스에 존재할 때만 해당 순차패턴의 추가 여부를 판정해야 하며, 만약 하나 이상의 부분순차패턴이 래티스에 존재하지 않는다면 해당 순차패턴은 래티스에 추가되지 않는다.As in the frequent item search process, in the unit item sequential pattern search process, it is determined whether or not the corresponding sequential pattern is added in consideration of whether or not the partial sequential pattern exists in the lattice. That is, in order for one sequence pattern to be added to the lattice, all partial sequence patterns of the sequence pattern must exist in the lattice. If at least one of the partial sequence patterns of the sequence pattern does not exist in the lattice, the maximum possible frequency of occurrence of the sequence pattern is calculated as zero. Therefore, it can be determined that the addition is not necessary even if the frequency of occurrence of the sequential pattern is not calculated. That is, only when all subitems of the sequence pattern exist in the lattice, it is determined whether to add the sequence pattern. If at least one partial sequence pattern does not exist in the lattice, the sequence pattern is not added to the lattice.

2.3 항목 순차패턴 탐색2.3 Item Sequential Pattern Search

순차패턴 탐색에 있어서 하나의 트랜잭션이 단위항목들간의 순서적인 정보만으로 표현되는 단위항목 순차패턴 탐색 방법에서와는 달리 항목 순차패턴 탐색 방법에서는 다음과 같이 하나의 트랜잭션이 단위항목들이 동시에 출현한 연관성 정보 및 동시에 출현한 단위항목으로 구성된 항목들의 순서적인 정보를 동시에 갖는다. 이때 단위항목들의 동시에 출현한 연관성 정보 표현 단위에서는 중복되는 단위항목이 존재하지 않으나 순서적인 정보를 나타내는 항목들간에는 중복된 항목이 존재할 수 있다.In the sequential pattern search method, in which one transaction is expressed only with sequential information between unit items in the sequential pattern search, in the sequential pattern search method, as follows: It has the order information of items consisting of the unit items that appeared. In this case, there is no overlapping unit item in the association information expression unit that appears at the same time, but there may be duplicate items between items indicating ordered information.

항목 순차패턴 트랜잭션= { (100, 200, 300)→(100, 200)→(300, 400, 500) }Item Sequential Transaction = {(100, 200, 300) → (100, 200) → (300, 400, 500)}

개방 데이터 마이닝에서 트랜잭션의 발생 시점의 차이를 고려하지 않고 분석 대상으로 정의된 트랜잭션들에 나타나는 정보들이 서로 동일한 중요성을 갖는 것으로 간주하여 단순축적함으로써 빈발항목간의 빈발 발생 순서에 대한 순차적인 패턴을 탐색하는 방법을 단순축적 항목 순차패턴 탐색 방법(220)이라 한다. 도 10은 단순축적 빈발항목 순차패턴 탐색 방법(220)을 보여주는 것으로 크게 빈발항목 탐색 단계(221a)와 순차패턴 탐색 단계(222)로 구성된다. 도 10에서와 같이 두 단계가 연속적으로 진행된다. 먼저 빈발항목 탐색 단계에서 해당 트랜잭션에 출현한 항목들의 출현빈도 수를 갱신하고 새로 출현한 항목이나 빈발항목을 구한다. 다음으로 이들 빈발항목만으로 해당 트랜잭션을 분석하여 순차 정보를 분석함으로써 빈발항목 순차패턴을 탐색한다. 이때, 트랜잭션에 대한 순차 정보 분석을 위해서 해당 트랜잭션을 순차 정보 분석이 끝나는 시점까지 메모리 상에서 유지한다.In open data mining, it is possible to search for sequential patterns of frequent occurrences among frequent items by simply accumulating the information appearing in the transactions defined as an analysis target without considering the difference in time of transaction occurrence. The method is referred to as a simple accumulation item sequential pattern search method 220. FIG. 10 illustrates a simple accumulation frequent item sequential pattern searching method 220, which is composed of a frequent item searching step 221a and a sequential pattern searching step 222. As in FIG. 10, two steps are performed continuously. First, the frequency of occurrence of items appearing in the transaction is updated in the frequent item search step, and new items or frequent items are obtained. Next, we search for frequent item sequential patterns by analyzing the corresponding transactions with only these frequent items. At this time, to analyze the sequential information on the transaction, the transaction is maintained in memory until the end of the sequential information analysis.

빈발항목 탐색 단계는 2.1절에서 설명된 단순축적 방법에 의한 빈발항목 탐색 단계와 동일한 수행 과정을 가지며 빈발항목 탐색 결과 순차패턴 구성을 위한 항목이 선택된다. 순차패턴을 구성하는 항목은 먼저 동일 데이터 집합에 대해서 빈발항목이어야 하며 이를 만족하지 못하는 항목은 순차패턴을 형성할 수 없다. 따라서 빈발항목 탐색 단계에서는 트랜잭션에서 발생하는 항목에서 주어진 최소 지지도를 만족하는 모든 빈발항목을 먼저 찾는다. 이때 구해진 빈발항목은 순차패턴 탐색을 위한 기본 항목으로 순차패턴 탐색 단계에서 하나의 순차패턴 단위항목이 된다. 순차패턴 탐색 단계(222)는 순차패턴을 위한 단위항목 매핑 단계(222a) 및 단위항목 매핑 해석 단계(222g)를 갖는 것을 제외한 다른 과정(222b-222f)은 빈발항목 탐색 단계와 동일하다. 단위항목 매핑 단계에서는 빈발항목 탐색 단계에서 구해진 순차패턴을 위한 항목을 별도의 식별자로 매핑하여 순차패턴 단위항목들이 하나의 식별자로 표현되도록 한다. 빈발항목 탐색 단계에서 구해진 항목은 다수의 단위항목들로 구성되며 순차패턴 탐색 단계에서 이들 항목이 순차패턴 단위항목이 된다. 따라서 순차패턴 탐색 단계에서 순차패턴 단위항목 정보를 효율적으로 관리하기 위해서 빈발항목 단계에서 얻어진 항목을 별도의 하나의 식별자로 표현한다. 단위항목 매핑 정보 분석 단계에서는 순차패턴 분석 단계의 각 순차패턴 단위항목을 해석하여 원래의 데이터 집합을 구성하는 단위항목들도 표현한다. 한편, 빈발항목 탐색 단계와 순차패턴 탐색 단계에서는 각각 별도의 래티스를 유지하며, 트랜잭션에 출현하는 항목들의 연관성을 기반으로 빈발항목을 탐색하는 연관-래티스와 빈발항목 분석 결과를 고려하여 순차패턴을 탐색하는 순차-래티스로 구분된다. 이러한 이중 래티스 구조를 유지함으로써 빈발항목 탐색 단계에서 얻어진 빈발항목들에 대한 순서적인 정보를 분석하는 순차패턴 단계에서 고려 대상 데이터 집합을 최소화 할 수 있다. 연관-래티스는 2.1절에 제시된 빈발항목 탐색을 위한 모니터링 래티스 구조와 동일하며 순차-래티스는 기본적으로 2.2절에서 설명된 단위항목 순차패턴 탐색을 위한 모니터링 래티스와 동일한 구조를 갖는다.The frequent item search step has the same process as the frequent item search step by the simple accumulation method described in Section 2.1, and an item for sequential pattern composition is selected as a result of the frequent item search. An item constituting the sequential pattern must be a frequent item for the same data set first, and an item that does not satisfy this cannot form a sequential pattern. Therefore, in the frequent item search phase, all the frequent items satisfying the given minimum support are first searched for in the items that occur in the transaction. The frequent items obtained are basic items for sequential pattern search and become one sequential pattern unit item in the sequential pattern search step. The other steps 222b to 222f are identical to the frequent item search step except that the sequential pattern search step 222 includes the unit item mapping step 222a and the unit item mapping analysis step 222g for the sequence pattern. In the unit item mapping step, the items for the sequential pattern obtained in the frequent item search step are mapped to separate identifiers so that the sequential pattern unit items are represented by one identifier. The items obtained in the frequent item search step are composed of a plurality of unit items, and these items become a sequential pattern unit item in the sequential pattern search step. Therefore, in order to efficiently manage the sequence pattern unit item information in the sequential pattern search step, the item obtained in the frequent item step is represented by a separate identifier. In the unit item mapping information analysis step, each of the sequential pattern unit items in the sequential pattern analysis step is analyzed to express the unit items constituting the original data set. Meanwhile, in the frequent item search step and the sequential pattern search step, separate lattices are maintained, and the sequential pattern is searched in consideration of the association-lattice and the frequent item analysis results based on the associations of items appearing in the transaction. Is divided into sequential-lattices. By maintaining such a double lattice structure, it is possible to minimize the data set under consideration in the sequential pattern step of analyzing the ordered information on the frequent items obtained in the frequent item search step. Associative-lattices have the same structure as the monitoring lattice structure for the frequent item search described in Section 2.1, and sequential-lattices basically have the same structure as the monitoring lattice for the unit-item sequential pattern search described in Section 2.2.

항목 매핑 및 해석 과정은 다음과 같다. 먼저, 빈발항목 탐색 단계에서 생성된 항목과 순차패턴 탐색 단계에서의 순차패턴 단위항목 사이의 매핑 관계를 유지하기 위해서 별도의 연결관계 테이블이 유지된다. 순차패턴 탐색 단계의 순차패턴 단위항목들은 빈발항목 탐색 단계에서 생성되는 각 항목들로서, 별도의 하나의 정수 식별자로 표현되며 순차패턴 탐색 단계에서는 순차패턴 단위항목이 된다. 한편, 순차패턴 탐색 단계에서는 새롭게 발생되는 다수의 트랜잭션에서 출현하지 않는 항목들에 대해서는 해당 항목이 래티스에서 삭제되기도 한다. 이때, 순차패턴 탐색 단계에서 하나의 순차패턴 단위항목이 삭제되는 경우 해당 순차패턴 단위항목을 포함하는 어떤 순차패턴도 순차-래티스상에 존재하지 않는다. 즉, 해당 순차패턴 단위항목은 더 이상 래티스에 존재하지 않는다. 따라서 해당 순차패턴 단위항목은 연관-래티스에서와 순차-래티스 사이의 연결관계를 표현하기 위한 연결관계 테이블 상에서 해당 매핑 정보를 삭제한다. 즉, 더 이상 순차패턴 탐색 단계에서 활용되지 않는 순차패턴 단위항목을 표현하는 식별자는 이후 빈발항목 탐색에서 새로운 항목이 추가되어 순차패턴 단위항목 식별자를 매핑할 때 재활용된다. 기본적으로 빈발항목 탐색 단계에서 새로운 항목이 생성되어 순차패턴 단위항목으로 매핑하기 위한 식별자를 부여하는 경우 바로 이전에 생성된 순차패턴 단위항목에 부여된 식별자보다 1이 증가된 식별자를 부여한다. 하지만 순차패턴 탐색 단계에서 삭제된 순차패턴 단위항목이 있는 경우 이를 다시 활용한다. 즉, 순차패턴 탐색 단계에서 삭제된 순차패턴 단위항목이 있는 경우 이들 식별자를 포함한 이용 가능한 모든 식별자 중에서 가장 작은 값이 빈발항목 탐색 단계에서 새롭게 생성된 항목을 위한 순차패턴단위항목 식별자로 부여된다.The item mapping and interpretation process is as follows. First, in order to maintain a mapping relationship between the items created in the frequent item search step and the sequential pattern unit items in the sequential pattern search step, a separate connection relationship table is maintained. The sequential pattern unit items in the sequential pattern search step are each items generated in the frequent item search step, and are represented by a separate integer identifier, and become sequential pattern unit items in the sequential pattern search step. On the other hand, in the sequential pattern search step, the items that do not appear in a plurality of newly generated transactions may be deleted from lattice. In this case, when one sequential pattern unit item is deleted in the sequential pattern search step, no sequential pattern including the sequential pattern unit item exists on the sequential-lattice. That is, the sequential pattern unit item no longer exists in Lattice. Therefore, the sequential pattern unit item deletes the corresponding mapping information in the association table for representing the association relationship between association-lattice and sequential-lattice. That is, the identifier representing the sequential pattern unit item that is no longer utilized in the sequential pattern search step is recycled when a new item is added in the frequent item search to map the sequential pattern unit item identifier. Basically, when a new item is generated in the frequent item search step and an identifier for mapping to the sequential pattern unit item is assigned, the identifier is increased by 1 from the identifier given to the previously generated sequential pattern unit item. However, if there is a sequential pattern unit item deleted in the sequential pattern search step, it is used again. That is, when there are sequential pattern unit items deleted in the sequential pattern search step, the smallest value among all available identifiers including these identifiers is given as the sequential pattern unit item identifier for the newly created item in the frequent item search step.

순차패턴 탐색 단계에서 새롭게 발생된 항목 순차패턴을 래티스에 추가하는 과정은 2.2절에서 설명된 단위항목 순차패턴 탐색 방법에서의 단위항목 순차패턴 추가 과정과 동일하다. 또한 이 과정에서 필요한 항목 순차패턴의 출현빈도 수 추정 과정도 단위항목 순차패턴 탐색 방법에서와 동일하다. 즉, 출현빈도 수를 추정하고자 하는 항목 순차패턴의 부분순차패턴을 고려하는데 있어서 해당 부분순차패턴의 트랜잭션에서의 시작 위치를 고려해야 하며 부분순차패턴의 수를 판단하는데 있어서 순차패턴을 구성하는 항목의 중복성 여부를 고려한다.The process of adding the newly generated item sequential pattern to the lattice in the sequential pattern search step is the same as the process of adding the unit item sequential pattern in the method of searching the unit item sequential pattern described in Section 2.2. In addition, the process of estimating the number of occurrences of the item sequential pattern required in this process is the same as that of the unit item sequential pattern search method. That is, in considering the partial sequence pattern of the item sequential pattern for which the frequency of occurrence is to be considered, the starting position in the transaction of the partial sequential pattern should be considered and the redundancy of the items constituting the sequential pattern in determining the number of partial sequential patterns Consider whether or not.

한편, 순차패턴 탐색 단계(222)에서는 마이닝 기본 정보 갱신 작업은 수행되지 않는다. 즉, 빈발항목 탐색 단계(221a)에서 이미 한번 갱신되었으므로 순차패턴 탐색 단계에서는 별도의 갱신 작업을 수행하지 않는다.Meanwhile, in the sequential pattern search step 222, the mining basic information update operation is not performed. That is, since the update has already been performed once in the frequent item search step 221a, no additional update operation is performed in the sequential pattern search step.

3. 정보 차별화에 의한 개방 데이터 마이닝 방법3. Open data mining method by information differentiation

마이닝 방법을 적용하여 분석하고자 하는 정보의 중요도는 분석 대상이 되는 데이터 집합에서 출현빈도 수로 표현되며 데이터 마이닝은 중요도가 높은 정보(이하 '의미있는 정보'라 지칭한다)들을 분석하고 이들 정보를 동일한 응용 분야에서 발생되는 데이터 집합에 대한 예측이나 판단에 활용하는데 있다. 시간 변화에 따라 지속적으로 발생되는 데이터 집합에서 전체 데이터 집합에 대해서 의미있는 정보들은 응용분야에 큰 변화가 발생되지 않는다면 일반적으로 시간 흐름에 따라 계속적으로 해당 정보의 중요도에 비례하여 반복적으로 발생될 것이다. 즉, 전체 데이터집합에서 의미있는 정보들은 최근의 데이터 집합에서도 반복적으로 발생될 것이다. 더불어 과거에 발생하지 않던 정보들이 최근의 트랜잭션에서 발생빈도가 높아진다면 이 또한 데이터 집합에서 나타나는 의미있는 정보(혹은 정보 변화)라 할 수 있다. 따라서 개방 데이터 마이닝에서는 전체 데이터 집합에 대한 분석에 있어서 현재까지의 모든 트랜잭션에서 발생한 정보들을 서로 동일한 중요성을 갖는 것으로 간주하지 않고 발생 시점에 따라 중요성을 차별화할 수 있다. 즉, 과거에 발생한 정보들에 비해 최근에 발생한 정보들의 중요성을 높게 부여하는 정보 차별화 방법을 고려할 수 있다. 이러한 정보 차별화 방법은 지속적으로 발생되는 전체 데이터 집합에서 최근의 변화에 중점을 두는 경우 실제 분석되는 데이터집합의 규모를 현재 시점을 기준으로 이전에 발생한 일정 범위의 데이터 집합으로 유지함으로써 마이닝 과정에서 유지해야 하는 정보량을 감소시킴과 동시에 최근 정보에 보다 집중하여 분석이 가능하다는 장점이 있다. 분석 대상이 되는 정보의 범위를 정의하는 마이닝 매개변수들을 설정하는 정도에 따라 전체 데이터 집합에서 분석 대상 범위를 가변적으로 정의할 수 있고, 이렇게 정의된 분석 대상 내에서도 분석 대상 데이터 집합을 구성하는 트랜잭션에 대해서 발생시점이나 발생 순서에 따라 서로 다른 가중치를 부여하여 최근에 발생한 정보에 보다 높은 가중치를 부여하여 분석할 수 있다. 발생 시점이나 발생 순서에 따른 가중치 차별화 방법은 다음과 같은 방법들을 고려할 수 있다.The importance of the information to be analyzed by applying the mining method is expressed as the frequency of occurrence in the data set to be analyzed, and data mining analyzes the information of high importance (hereinafter referred to as 'significant information') and applies the same information. It is used to predict or judge data set generated in the field. Meaningful information about the entire dataset in the dataset that occurs continuously with time will generally be repeated in proportion to the importance of the information over time unless significant changes occur in the application. That is, meaningful information in the entire dataset will be generated repeatedly in recent datasets. In addition, if information that has not occurred in the past increases in frequency in recent transactions, it can also be referred to as meaningful information (or information change) in the data set. Therefore, in open data mining, the information generated from all transactions up to now can be differentiated according to the time of occurrence in the analysis of the entire data set. That is, an information differentiation method that considers the importance of recently generated information as compared to the information generated in the past may be considered. This information differentiation method should be maintained in the mining process by keeping the size of the dataset being analyzed as a range of datasets that occurred previously from the current point of view when focusing on recent changes in the continuous dataset. At the same time, there is an advantage that the analysis can be performed by focusing on the latest information. The scope of analysis can be defined variably in the entire data set according to the degree of setting mining parameters that define the scope of information to be analyzed. Different weights can be assigned according to the time of occurrence or the order of occurrence so that higher weights can be analyzed for the recently generated information. As the weight differentiation method according to the time of occurrence or the order of occurrence, the following methods may be considered.

정보 차별화 방법에서는 일반적으로 대상이 되는 데이터 집합을 구성하는 트랜잭션의 발생 시점을 기준으로 정보 차별화 정도를 정의하거나 발생된 트랜잭션들의 생성 순서로 차별화 정도를 정의할 수 있다. 본 발명의 설명에서는 TID를 기준으로 각각의 정보 차별화 방법을 설명한다. 이때, TID가 트랜잭션의 발생 시간으로 정의되는 경우는 발생 시간에 의해 정보 차별화 정도가 정의되며, TID가 발생 순서를 나타내는 경우 트랜잭션의 개수로서 정의된다.In the information differentiation method, the degree of information differentiation can be defined based on the time point of the transaction constituting the data set, or the degree of differentiation can be defined by the generation order of the generated transactions. In the description of the present invention, each information differentiation method will be described based on the TID. At this time, when the TID is defined as the occurrence time of the transaction, the degree of information differentiation is defined by the occurrence time, and when the TID indicates the order of occurrence, it is defined as the number of transactions.

3.1 빈발항목 탐색3.1 Search for frequent items

정보 차별화 방법에서 빈발항목 탐색 방법은 2.1절에서 설명된 단순축적 방법에 의한 빈발항목 탐색 과정과 유사하나 다음과 같은 차이를 갖는다. 첫째, 분석 대상이 되는 데이터 집합이 현재 시점을 기준으로 일정한 범위로 정의된다. 즉, 전체 데이터 집합의 분석 정보를 항상 메모리상에서 유지하며 이로부터 각 시점에서의 빈발항목을 구하는 단순축적 방법과는 달리 현재 시점을 기준으로 이전 일정 범위의 데이터 집합을 분석 대상으로 간주하며 해당 범위의 주요 정보만을 메모리에서 유지한다. 이때, 분석 대상이 되는 데이터 집합의 범위를 정하는 방법에 따라 윈도우 방법(300)과 감쇠율 방법(400)으로 나뉜다. 각 방법에서는 데이터 집합을 구성하는 트랜잭션이 TID에 따라 각각 다른 방법으로 해당 트랜잭션의 정보가 차별화 된다.In the information differentiation method, the frequent item search method is similar to the frequent item search process by the simple accumulation method described in Section 2.1, but has the following differences. First, the data set to be analyzed is defined in a certain range from the current point in time. In other words, unlike the simple accumulation method that always maintains analysis information of the entire data set in memory and obtains frequent items from each point in time, the data set of the previous range is regarded as the target of analysis based on the current time point. Keep only important information in memory. In this case, the method is divided into a window method 300 and attenuation rate method 400 according to a method of determining a range of a data set to be analyzed. In each method, the information of the transaction is differentiated in different ways according to the TID of the transactions that make up the data set.

도 11은 발생 시점에 따른 각 트랜잭션의 가중치 차별화 방법을 보여준다. 가중치 곡선 (a)는 앞서 설명된 단순축적 방법에서의 트랜잭션 가중치를 나타내는 것으로 발생 시점에 관계없이 모든 트랜잭션이 서로 동일한 중요성을 갖는다. 가중치 곡선 (b)는 최근 일정 시간 범위 내에 발생한 트랜잭션만을 분석하여 응용 분야에 대한 분석 결과로 간주하는 윈도우 방법에서의 트랜잭션의 가중치 변화를 나타낸다. 가중치 곡선 (c)는 현재 시점에서 발생한 트랜잭션의 가중치를 1로 간주하고 이전에 발생한 트랜잭션의 가중치는 현재 트랜잭션 가중치를 기준으로 일정 비율로 감쇠되는 것으로 간주하는 감쇠율 방법에서의 트랜잭션 가중치 변화를 나타낸다.11 shows a weight differentiation method of each transaction according to an occurrence time point. The weight curve (a) represents the transaction weight in the simple accumulation method described above, and all transactions have the same importance regardless of the time of occurrence. The weight curve (b) shows a change in the weight of a transaction in the window method in which only a transaction occurring within a certain time range is analyzed and considered as an analysis result for an application field. The weight curve (c) shows a change in the transaction weight in the decay rate method in which the weight of the transaction occurring at the present time is regarded as 1, and the weight of the previously generated transaction is considered to be decayed at a predetermined rate based on the current transaction weight.

a. 윈도우 방법에 의한 빈발항목 탐색a. Frequent Item Search by Window Method

윈도우 방법을 적용한 빈발항목 탐색에서는 현재 시점으로부터 과거 일정 범위 내에서 발생한 트랜잭션으로 구성되는 데이터 집합을 분석 대상으로 한다. 이때 분석 대상이 되는 데이터 집합을 정의하는 일정 범위를 트랜잭션 윈도우라고 하며 TW로 나타내며 윈도우의 크기는 사용자에 의해 정의되며 트랜잭션 발생 시점을 기준으로 최근의 일정 시간에 발생한 트랜잭션들로 정의되거나 발생된 트랜잭션의 개수를 기준으로 최근에 발생한 일정 개수의 트랜잭션으로 정의할 수 있다. 아래 설명에서는 TID를 기준으로 분석 대상이 되는 윈도우 크기를 정의하고 있다. TID가 트랜잭션의 발생 시간으로 정의되는 경우 윈도우 크기는 발생 시간에 의해 정의되어 각 트랜잭션이 의미를 갖는 최근의 최대 시간이 정의되며, TID가 발생 순서를 나타내는 경우 윈도우 크기는 트랜잭션의 개수로서 정의되어 의미를 갖는 최근의 최대 트랜잭션 개수가 정의된다. 한편, 윈도우 방법에서는 시간 흐름에 따라 대상이 되는 트랜잭션 집합이 변경된다. 즉, 항상 현재 트랜잭션을 기준으로 윈도우 크기만큼의 최근 트랜잭션들만을 분석 대상으로 한다. 이때 시간 흐름에 따라 분석 대상에서 제외되는 트랜잭션에 출현한 항목에 대해서는 출현빈도 수를 감소시켜야한다. 한편 시간 흐름에 따른 트랜잭션 윈도우의 변화를 관리하기 위해서 윈도우 방법에서는 현재 시점을 기준으로 윈도우 크기만큼의 트랜잭션들에 대한 로그 정보를 별도로 유지해야 한다.In the frequent item search using the window method, a data set consisting of transactions occurring within a certain range from the present time point is analyzed. In this case, a certain range that defines the data set to be analyzed is called a transaction window and is represented by TW. The size of the window is defined by the user and is defined by the transactions that occurred in the latest schedule at the time of the transaction. Based on the count, it can be defined as a certain number of recent transactions. The following description defines the window size to be analyzed based on the TID. If the TID is defined as the time of occurrence of a transaction, the window size is defined by the time of occurrence to define the most recent maximum time each transaction has meaning, and when the TID indicates the order of occurrence, the window size is defined as the number of transactions. The latest maximum number of transactions with is defined. On the other hand, in the Windows method, the set of target transactions is changed over time. That is, only recent transactions as large as the window size are analyzed based on the current transaction. At this time, the frequency of appearance should be reduced for items appearing in transactions excluded from analysis over time. On the other hand, in order to manage the change of the transaction window over time, the window method must maintain log information on transactions as many as the window size based on the current time.

도 12는 윈도우 방법에 의한 빈발항목 탐색 방법(310)의 수행 과정을 보여주며 트랜잭션 읽기(311), 마이닝 기본 정보 갱신 단계(312), 출현빈도 수 갱신 및 항목 추가 단계(313), 트랜잭션 윈도우 갱신 단계(314) 및 빈발항목 선택 단계(315)로 구성된다. 트랜잭션 읽기 및 빈발항목 선택 단계는 2.1절의 단순축적 방법에서 설명된 내용과 동일하다. 한편 윈도우 방법도 단순축적 방법에서와 마찬가지로 래티스 구조를 기본 구조로 한다.12 shows a process of performing the frequent item search method 310 by the window method, and reads a transaction 311, updates basic mining information 312, updates a frequency of occurrence and adds an item 313, and updates a transaction window. Step 314 and frequent item selection step 315. The steps for reading transactions and selecting frequent items are the same as described in the simple accumulation method in Section 2.1. The window method, on the other hand, has a lattice structure as a basic structure, as in the simple accumulation method.

마이닝 기본 정보 갱신 단계(312)에서는 사용자에 의해 정의된 최소지지도와 트랜잭션의 총수를 고려하여 빈발항목 선택을 위한 최소 출현빈도 수가 갱신된다. 한편 트랜잭션의 총 수 및 최소 출현빈도 수는 현재 시점의 트랜잭션 식별자 TID에 따라 두 가지 방법으로 정의된다. 현재 분석 대상이 되는 윈도우 크기 보다 작은 경우는 분석 초기 단계에 국한된 경우로 이전 시점의 트랜잭션의 총수에다 1을 더하여 현재 시점의 트랜잭션의 총 수를 구하며 최소 출현빈도 수도 트랜잭션의 총 수를 고려하여 값을 갱신한다. 반면에 분석 대상이 되는 트랜잭션의 수가 정의된 윈도우 크기보다 많은 경우에는 트랜잭션의 총 수 및 최소 출현빈도 수는 이전 시점에서의 값과 동일한 값을 가진다.In the mining basic information update step 312, the minimum frequency of occurrence for frequent item selection is updated in consideration of the minimum support defined by the user and the total number of transactions. On the other hand, the total number of transactions and the minimum number of occurrences are defined in two ways depending on the transaction identifier TID at the present time. If the window size is smaller than the current analysis size, it is limited to the initial stage of analysis. The total number of transactions at the present time is added to the total number of transactions at the previous point in time, and the minimum frequency of occurrence is determined by considering the total number of transactions. Update On the other hand, if the number of transactions to be analyzed is larger than the defined window size, the total number and minimum frequency of transactions have the same value as the previous time.

도 13에서는 트랜잭션에 출현한 항목에 대한 출현빈도 수 갱신 및 항목 추가 단계(313)를 보여준다. 항목 읽기 단계(331)에서는 트랜잭션에서 발생한 항목을 읽는다. 이때 항목이 래티스에 존재하는 경우 해당 항목의 출현빈도 수를 증가시켜 갱신하고(333), 래티스에 존재하지 않는 항목은 해당 항목을 래티스에 추가한다(334). 이때, 단순축적 방법에서와는 달리 별도의 출현빈도 수 추정과정 없이 해당 항목이 모니터링 래티스에 추가된다. 트랜잭션에 이 과정에 의해 처리되지 않은 항목이 존재하는 동안 위의 과정을 반복한다.FIG. 13 shows an update of the frequency of occurrence and the addition of items 313 for items that appear in a transaction. In an item reading step 331, an item generated in a transaction is read. At this time, if an item exists in a lattice, the frequency of occurrence of the item is increased and updated (333), and an item not present in the lattice is added to the lattice (334). In this case, unlike the simple accumulation method, the item is added to the monitoring lattice without an extra frequency estimation process. This process is repeated while there are items in the transaction that are not processed by this process.

도 14는 트랜잭션 윈도우 갱신 단계(314)로서 트랜잭션 윈도우 변화 및 이에 따른 출현빈도 수 갱신 과정을 보여준다. 먼저 314 단계에서 현재 시점에서 발생한 트랜잭션을 트랜잭션 윈도우에 추가한다. 이때 추가된 트랜잭션에서 발생한 항목에 대한 정보 갱신 작업은 이전 단계에서 수행되었으므로 별도로 고려하지 않는다. 다음으로 트랜잭션 증가에 따른 트랜잭션 윈도우 정보를 갱신한다. 먼저, 초기 단계인 경우 현재까지 발생한 트랜잭션의 총 수가 트랜잭션 윈도우 크기보다 작거나 같은 경우 (트랜잭션 윈도우를 발생 시간 기준으로 정의하는 경우 현재 트랜잭션의 발생 시점이 트랜잭션 윈도우를 정의하는 기본 시간 이전인 경우) 트랜잭션 윈도우 갱신을 위한 별도의 작업이 없이 갱신 작업을 종료한다. 즉, 현재의 트랜잭션 식별자 TID가 윈도우 크기보다 작은 경우 트랜잭션 윈도우 갱신 과정을 종료한다. 다음으로 현재까지 발생한 트랜잭션의 총 수가 트랜잭션 윈도우 크기보다 큰 경우 (트랜잭션 윈도우를 발생 시간 기준으로 정의하는 경우 현재 트랜잭션의 발생 시점이 트랜잭션 윈도우를 정의하는 기본 시간보다 더 늦은 시간인 경우), 즉 TID가 윈도우 크기보다 큰 경우 현재 트랜잭션의 발생 시점으로부터 윈도우 크기만큼의 시간 이전에 발생한 트랜잭션들을 제거하고 해당 트랜잭션에서 출현한 항목의 출현빈도를 감소시키는 다음 과정을 수행한다. 먼저 이전 트랜잭션 윈도우에서 제거 대상이 되는 트랜잭션들 중에서 하나의 트랜잭션에 대해서 출현한 항목을 읽고(343) 해당 항목의 출현빈도 수를 1 감소시킨다. 이때 감소된 출현빈도 수가 0이 되는 경우(즉, 최근의 윈도우 크기만큼의 시점에서는 해당 항목이 출현하지 않은 경우) 이를 래티스에서 제거한다(346). 제거 대상이 되는 트랜잭션에 처리되지 않은 항목이 존재하는 경우 344 단계에서부터 347 단계까지를 반복 수행하며 모든 항목에 대해서 처리가 완료되면 해당 트랜잭션을 트랜잭션 윈도우로부터 제거한다. 트랜잭션 윈도우에 제거 대상이 되는 다른 트랜잭션이 아직 존재하면 343 단계로부터 348 단계까지를 반복 수행한다. 따라서, 모니터링 래티스에는 항상 윈도우 내에서 발생한 트랜잭션들에 대한 항목들의 발생빈도 정보를 관리하게 된다.14 shows a transaction window update step 314, which shows a transaction window change and an appearance frequency update process. First, in step 314, the transaction occurring at the present time is added to the transaction window. At this time, the information update work on the items that occurred in the added transaction was performed in the previous step, so it is not considered separately. Next, update the transaction window information as the transaction increases. First, in the initial stage, if the total number of transactions that have occurred so far is less than or equal to the transaction window size (when the transaction window is defined as the time of occurrence, when the current transaction occurs before the base time that defines the transaction window) End the update without any additional work for updating the window. That is, if the current transaction identifier TID is smaller than the window size, the transaction window update process is terminated. Next, if the total number of transactions that have occurred so far is larger than the transaction window size (if the transaction window is defined by the time of occurrence, then the occurrence of the current transaction is later than the base time that defines the transaction window), that is, the TID is If it is larger than the window size, the following process is performed to remove transactions that occurred before the window size time from the current transaction occurrence and to reduce the frequency of occurrence of items appearing in the transaction. First, an item that appears for one of the transactions to be removed in the previous transaction window is read (343) and the number of occurrences of the corresponding item is reduced by one. In this case, when the reduced frequency of occurrence becomes zero (that is, the corresponding item does not appear at the point of time of the latest window size), it is removed from the lattice (346). If an unprocessed item exists in the transaction to be removed, steps 344 to 347 are repeated. If the processing is completed for all items, the transaction is removed from the transaction window. If another transaction to be removed still exists in the transaction window, steps 343 to 348 are repeated. Therefore, the monitoring lattice always manages the occurrence frequency information of items for transactions occurring in the window.

b. 감쇠율 방법에 의한 빈발항목 탐색b. Frequently searched items by damping rate method

개방 데이터 마이닝 환경에서 빈발항목을 구하기 위해 데이터 집합을 구성하는 각 트랜잭션을 발생 시점에 따라 차별화 하는 정보 차별화 방법 중에서 각 시점의 중요성을 일정 비율에 따라 감소시킬 수 있는 방법으로 감쇠율 방법을 고려할 수 있다.In the open data mining environment, the attenuation rate method can be considered as a method that can reduce the importance of each time point by a certain ratio among the information differentiation methods that differentiate each transaction constituting the data set according to occurrence time in order to obtain frequent items.

감쇠율은 시간 변화에 따라 발생되는 데이터 집합에서 과거 정보와 최근 정보의 중요성을 차별화하는 정보 차별화 방법에서 일정량의 변화에 따른 가중치의 감쇠 정도를 나타내는 값이다. 이때 변화량은 트랜잭션의 발생 시점을 기준으로 정의되거나 발생된 트랜잭션의 개수로서 정의될 수 있다. 아래 설명에서는 TID를 기준으로 설명하고 있으며, TID가 트랜잭션의 발생 시간으로 정의되는 경우 감쇠율이 발생 시간에 의해 정의되며, TID가 발생 순서를 나타내는 경우 감쇠율은 트랜잭션의 개수로서 정의된다고 할 수 있다. 한편, 감쇠율은 감쇠 기본값과 감쇠 기본주기에 의해 정의된다. 여기서 사용자에 의해 정의되는 일정 시간을 감쇠 단위시간이라 한다. 감쇠 기본값은 감쇠율을 정의하는 기본값으로 b로 나타낸다. 감쇠 기본주기는 데이터 집합에서 현재 트랜잭션의 가중치(=1)가 감쇠되어이 되는 시점까지 경과하는 단위시간의 수를 의미하며 h로 나타낸다. 감쇠 기본값이 b로 정의될 때 한 번의 단위시간 변화에 따른 가능한 최대 감쇠율은 감쇠 기본주기가 1인 경우로로 구해진다. 한편 감쇠 기본값 b와 감쇠 기본주기 h가 주어졌을 때 한번의 단위시간 변화에 따른 가중치의 감쇠 정도를 나타내는 감쇠율 d는 다음과 같이 정의된다. 감쇠율 d가 정의되었을 때 한 번의 시간 변화에 따라 나타내는 실제 감쇠 정도는 (1-d)로 구해진다.The attenuation rate is a value representing the degree of attenuation of the weight according to a certain amount of change in the information differentiation method that differentiates the importance of the past information and the recent information in the data set generated by the time change. In this case, the change amount may be defined based on the time point of the occurrence of the transaction or may be defined as the number of generated transactions. The following description is based on the TID. When the TID is defined as the occurrence time of a transaction, the decay rate is defined by the occurrence time, and when the TID indicates the order of occurrence, the decay rate is defined as the number of transactions. On the other hand, the damping rate is defined by the damping default and the damping fundamental period. Herein, a predetermined time defined by the user is called attenuation unit time. The attenuation default is the default value that defines the attenuation rate, denoted by b. The damping fundamental period is the weight of the current transaction in the data set (= 1) The number of unit hours that elapse until this point is represented by h. When the decay default is defined as b, the maximum possible decay rate for one unit time change is 1 when the decay fundamental period is 1. Obtained by On the other hand, given the attenuation default value b and the attenuation fundamental period h, the attenuation ratio d, which represents the degree of attenuation of the weight according to one unit time change, is defined as follows. When the decay rate d is defined, the actual degree of attenuation represented by one time change is obtained as (1-d).

한편 개방 데이터 마이닝에서 분석 대상으로 하는 데이터 집합에서 트랜잭션의 발생 시점(즉, TID가)이 감쇠 단위 시간의 배수가 되지 않는 경우 시간변화량을 고려하여 감쇠 정도가 정의된다. 예를 감쇠 단위 시간이 1시간으로 정의된 감쇠율 d를 갖는 데이터 집합에서 현재 트랜잭션의 가중치를 1로 할 때 30분 이전에 발생한(즉, 시간 변화량이 30분인 경우) 트랜잭션은 가중치가로 감쇠된다. 이하 본 발명에서 시간차이라 함은 실질적인 시간 변화량을 감쇠 단위시간으로 나눠서 구해진 값(즉, 감쇠 단위시간 기준 변화량)으로 정의한다. 일반적으로 하나의 감쇠 기본값이 주어졌을 때 감쇠 기본주기가 증가함에 따라 감쇠율은 최대값인 1에 급격히 가까워진다. 특히 감쇠 기본값이 작은 경우 감쇠 기본주기가 증가함에 따라 감쇠율은 보다 빠르게 최대값인 1에 가까워진다.On the other hand, in the data set to be analyzed in open data mining, the degree of attenuation is defined by considering the amount of time change when the transaction occurrence time (that is, the TID) is not a multiple of the attenuation unit time. For example, if the weight of the current transaction is 1 in a dataset with attenuation rate d defined as 1 hour of decay time (that is, if the time variance is 30 minutes), then the transaction is weighted. Is attenuated by. Hereinafter, in the present invention, the time difference is defined as a value obtained by dividing the actual time change amount by the attenuation unit time (that is, the change amount based on the damping unit time). In general, given a single damping default, as the damping fundamental period increases, the damping rate rapidly approaches the maximum value of 1. Especially when the damping default is small, as the damping fundamental period increases, the damping rate gets closer to the maximum value of 1.

감쇠율 방법은 전체적으로 2.1절에서 설명된 단순축적 방법과 유사한 수행 과정을 갖는다. 하지만 각 출현빈도 수 및 트랜잭션의 총 수 등의 정보들이 감쇠율의 영향을 받아 시간 흐름에 따라 감쇠 되며 각 수행 단계에서 이를 고려해야 한다.The damping rate method has an overall process similar to the simple accumulation method described in Section 2.1. However, the information such as the frequency of each occurrence and the total number of transactions are attenuated over time by the attenuation rate, which must be taken into account at each stage of execution.

도 15에서 감쇠율 방법에 의한 빈발항목 탐색 방법(410)을 보여주는 것으로 트랜잭션 데이터 읽기 단계(411)에서부터 빈발항목 선택 단계(416)까지의 전체적인 구성은 단순축적 방법에 의한 빈발항목 탐색 과정과 동일하다. 즉, 각 단계가 2.1절에서 설명된 단순축적 방법에 의한 빈발항목 탐색 방법에서의 해당 단계와 서로 동일하다. 하지만 각 시점에서 발생한 트랜잭션의 중요성을 감쇠율에 의해 차별화 하는 감쇠율 방법에서는 다음의 과정에서 단순축적 방법과는 다른 차이점을 갖는다.15 shows the frequent item search method 410 by the attenuation rate method. The overall configuration from the transaction data reading step 411 to the frequent item selection step 416 is the same as the frequent item search process by the simple accumulation method. That is, each step is identical to the corresponding step in the frequent item search method by the simple accumulation method described in Section 2.1. However, the attenuation rate method that differentiates the importance of transactions occurring at each time point by the attenuation rate has a difference from the simple accumulation method in the following process.

먼저, 마이닝 기본 정보 갱신 단계(412)에서는 각 시점에서 발생한 정보들이 발생 시점에 따라 차별화된 가중치를 가지므로 현재 시점에서의 트랜잭션의 총 수()는 이전 시점에서의 트랜잭션의 총 수와 감쇠율을 고려하여과 같이 구해진다. 여기서 Δt는 현재 새롭게 발생한 트랜잭션의 발생 시점과 가장 최근 이전 시점에 발생한 트랜잭션의 발생 시점의 시간 차이를 감쇠 단위 시간의 배수로 표현한 것이다. 단순축적 방법에서와 마찬가지로 감쇠율 방법에서도 트랜잭션 수의 변화를 고려하여 항목 추가 및 전지 작업을 위한 임계 출현빈도 수 및 빈발항목 판정을 위한 최소 출현빈도 수를 새롭게 계산한다. 이때 지지도로 표현된 값은 트랜잭션의 총수를 고려하여 출현빈도 수로 변환된다.First, in the mining basic information update step 412, since the information generated at each time point has a weight that is differentiated according to the time of occurrence, the total number of transactions at the current time point ( ) Takes into account the total number of transactions and the decay rate Obtained as Here, Δt represents the time difference between the occurrence of the newly generated transaction and the occurrence of the transaction occurring at the most recent previous time, in multiples of the attenuation unit time. As in the simple accumulation method, the attenuation rate method newly calculates the critical frequency count for adding items and battery operations and the minimum frequency count for determining frequent items in consideration of the change in the number of transactions. At this time, the value expressed as the support is converted into the frequency of appearance in consideration of the total number of transactions.

도 16는 감쇠율 방법에서 출현빈도 수 갱신 및 전지 작업 단계(413)를 나타낸다. 출현빈도 수 갱신 단계(431)에서는 항목의 현재 시점에서의 출현빈도 수()는 이전 시점에서의 출현빈도 수() 및 감쇠율을 고려하여로 갱신된다. 이때 Δt는 새롭게 발생한 트랜잭션에서 나타난 항목을 대상으로 하므로 현재 시점과 해당 항목이 가장 최근에 발생한 시점간의 차이를 감쇠 단위 시간의 배수로 표현한 것이다. 항목 전지 단계(433)에서는 갱신된 출현빈도 수가 항목의 전지 임계값보다 작은 경우 해당 항목을 모니터링 래티스로부터 제거한다. 이때, 기본적으로 1-항목은 전지되지 않는다. 하지만 2.1절에서 설명된 단순축적 방법에서의 빈발항목 탐색 과정에서의 1-항목 전지 작업과 마찬가지로 하나의 1-항목에 있어서 해당 항목이 상당한 기간동안 발생한 트랜잭션들에서출현하지 않는다면 해당 항목의 출현빈도 수에 비해 트랜잭션의 총수가 크게 증가되어 매우 낮은 지지도를 갖게된다. 기본적인 전지 방법에서는 이러한 경우에도 1-항목은 전지되지 않는다. 하지만 트랜잭션의 총 수에 비해 매우 작은 출현빈도 수를 갖는 경우 1-항목인 경우에도 해당 항목을 래티스로부터 전지할 수 있다. 이때, 별도의 임계값을 정의하며 1-항목의 출현빈도 수가 이보다 작은 경우 이를 모니터링 래티스로부터 전지한다. 이때 기준이 되는 임계값을 1-항목 전지 임계값이라 한다. 1-항목 전지 임계값은 n-항목(n≥2) 전지 임계값보다 작으며 전체 데이터집합에서 무시될 수 있는 정도의 지지도로 정의한다.16 shows frequency update and cell work step 413 in the damping rate method. In the appearance frequency update step 431, the frequency of appearance at the current time of the item ( ) Is the number of occurrences ( ) And taking into account the attenuation factor Is updated to At this time, Δt is for the item appeared in the newly generated transaction, so the difference between the current time and the most recent time that the item occurred is expressed as a multiple of the attenuation unit time. Item cell step 433 removes the item from the monitoring lattice if the updated frequency of occurrence is less than the cell threshold of the item. At this time, the 1-item is basically not carried out. However, as with the one-item cell work in the process of searching for frequent items in the simple accumulation method described in Section 2.1, the number of occurrences of an item if one item does not appear in transactions that have occurred over a significant period of time. Compared to this, the total number of transactions is greatly increased, resulting in very low support. In the basic cell method, even in this case, the 1-item is not celled. However, if you have a very small number of occurrences relative to the total number of transactions, even if it is a 1-item, that item can be passed from Lattice. In this case, a separate threshold value is defined, and if the frequency of occurrence of 1-item is smaller than this, it is transferred from the monitoring lattice. In this case, the threshold which is a reference is referred to as a 1-item battery threshold. The 1-item cell threshold is defined as a degree of support that is less than the n-item (n ≧ 2) cell threshold and can be ignored in the entire dataset.

한편, 감쇠율 방법에 의한 빈발항목 탐색에서는 지연추가 및 전지 임계값을 설정하는데 있어서 마이닝 결과에 전혀 영향을 미치지 않는 정도의 임계값을 값을 정의할 수 있다. 즉, 임의의 항목에 있어서 해당 항목이 일정 기간 동안 발생한 트랜잭션에서 출현하지 않는다면 해당 항목은 매우 낮은 지지도를 갖게된다. 즉, 시간이 경과하거나 트랜잭션이 지속적으로 증가함에 따라 해당 항목의 출현빈도 수는 매우 낮은 값으로 감소된다. 많은 시간이 증가되거나 많은 트랜잭션이 추가로 발생되었을 때 항목의 감소된 지지도는 해당 항목이 이후 빈발항목 여부를 판정하는데 영향을 미치지 않을 정도로 작은 지지도를 갖게된다. 따라서 지연추가 및 전지 임계값을 빈발항목 판정에 영향을 미치지 않을 정도의 작은 값으로 정의하는 경우 지연 추가 및 전지 작업을 수행하더라도 마이닝 결과 집합에는 오차가 발생되지 않는다. 하지만 빈발항목 탐색 결과에 영향을 미치지 않을 정도로 작은 값은 실제로 매우 작은 값이므로 지연추가 및 전지작업을 수행하더라도 모니터링 래티스에서 관리되는 항목의 수는 크게 감소되지는 않는다.On the other hand, in the frequent item search by the attenuation rate method, a threshold value that does not affect the mining result at all in setting up delay addition and battery threshold may be defined. That is, for any item, if the item does not appear in a transaction that has occurred for a period of time, the item has very low support. In other words, as time passes or the transaction continues to increase, the number of occurrences of the item decreases to a very low value. When a large amount of time is increased or a large number of additional transactions occur, the reduced support of the item will be small enough that the item will not affect later determining whether the item is frequent. Therefore, if the delay addition and battery threshold are defined as small enough to not affect the frequent item determination, there is no error in the mining result set even if the delay addition and battery operation are performed. However, the values that are small enough to not affect the frequent item search results are actually very small values, and thus the number of items managed by the monitoring lattice will not be significantly reduced even when delayed addition and pruning are performed.

도 17 및 도 18는 감쇠율 방법에 의한 빈발항목 탐색에서 강제 전지 작업 수행 단계(415) 및 빈발항목 선택 단계(416)을 나타낸다. 각 단계의 전체적인 수행 과정은 2.1절에서 기술된 단순축적 방법에서의 빈발항목 탐색 방법과 동일하다. 하지만 항목의 전지 가능 여부 및 빈발항목 가능 여부를 판단하기 위해서 항목의 출현빈도 수를 활용함에 있어서 해당 항목의 출현빈도 수를 감쇠율을 적용하여 최신의 정보로 갱신해야 한다.17 and 18 illustrate a forced battery operation performing step 415 and a frequent item selecting step 416 in the frequent item search by the attenuation rate method. The overall process of performing each step is the same as the method of finding frequent items in the simple accumulation method described in Section 2.1. However, in order to determine whether the item is batteryable and the frequent item, the frequency of occurrence of the item should be updated with the latest information by applying the attenuation rate.

3.2 단위항목 순차패턴 탐색3.2 Unit Sequential Pattern Search

정보 차별화 방법에서 단위항목 순차패턴 탐색 방법은 2.2절에서 설명된 단순축적 방법에서의 단위항목 순차패턴 탐색 방법과 동일한 데이터 집합을 분석하여 해당 데이터 집합에서 포함된 단위항목 순차패턴을 탐색하며 데이터 집합을 구성하는 트랜잭션 정보의 중요성을 차별화 하는 방법에 따라 다음과 같이 크게 두 가지 방법으로 나눠진다.In the information differentiation method, the unit sequential pattern search method analyzes the same data set as the unit sequential pattern search method in the simple accumulation method described in Section 2.2, and searches the unit data sequential pattern included in the data set. According to the method of differentiating the importance of composing transaction information, it is divided into two ways as follows.

a. 윈도우 방법에 의한 단위항목 순차패턴 탐색a. Unit pattern sequential pattern search by window method

윈도우 방법에 의한 단위항목 순차패턴 탐색은 3.1절에서 설명된 윈도우 방법에서의 빈발항목 탐색에서와 마찬가지로 현재 시점을 기준으로 최근에 일어난 일정 범위의 트랜잭션을 바탕으로 단위항목 순차패턴을 탐색하는 방법이다. 이때 현재 시점을 기준으로 정의되는 일정 범위를 트랜잭션 윈도우라 한다. 윈도우 방법에의한 단위항목 순차패턴 탐색 방법(330)은 3.1절에서 설명된 윈도우 방법에 의한 빈발항목 탐색 방법과 동일한 처리 과정을 거친다. 하지만 단위항목 순차패턴 탐색에서는 하나의 트랜잭션을 구성하는 단위항목들간의 중복이 허용되므로 모니터링 래티스상에 표현되는 하나의 순차패턴에 동일한 단위항목이 중복되어 표현될 수 있다. 또한, 새로 출현한 단위항목을 모니터링 래티스에 추가하는데 있어서 지연 추가 등의 과정을 거치지 않으므로 출현빈도 수 추정 과정을 수행하지 않으면 따라서 모니터링 래티스 상에 존재하는 각 순차패턴들은 별도의 구분없이 하나의 출현빈도 수만 관리한다.The unit item sequential pattern search by the window method is a method of searching the unit item sequential pattern based on a range of transactions that occurred recently based on the current point in time as in the frequent item search in the window method described in Section 3.1. At this time, a certain range defined based on the current time point is called a transaction window. The unit item sequential pattern search method 330 by the window method undergoes the same processing as the frequent item search method by the window method described in Section 3.1. However, in the unit item sequential pattern search, the overlap between the unit items constituting a transaction is allowed, and thus the same unit item may be duplicated in one sequential pattern expressed on the monitoring lattice. In addition, since adding a newly appeared unit item to the monitoring lattice does not go through the process of adding delay, etc., if the occurrence frequency estimation process is not performed, each sequential pattern existing on the monitoring lattice has one occurrence frequency without distinction. Manage only numbers.

b. 감쇠율 방법에 의한 단위항목 순차패턴 탐색b. Unit pattern sequential pattern search by damping rate method

감쇠율 방법에 의한 단위항목 순차패턴 탐색은 3.1절에서 설명된 감쇠율 방법에 의한 빈발항목 탐색에서와 마찬가지로 데이터 집합을 구성하는 각 트랜잭션을 발생 시점에 따라 각 시점의 중요성을 일정 비율에 따라 감소시켜 단위항목 순차패턴을 탐색하는 방법이다. 감쇠율 방법에 의한 단위항목 순차패턴 탐색 방법(430)은 2.2절에서 설명된 단순축적 단위항목 순차패턴 탐색 방법과 동일한 수행 단계를 거친다. 하지만, 각 정보의 중요성이 발생 시점에 따라 감쇠율에 의해 차별화 된다는 점에서 차이점을 갖는다. 즉, 단위항목 순차패턴 탐색 방법을 구성하는 세부 단계 중에서 출현빈도 수 갱신 및 순차패턴 추가 단계, 강제 전지작업 단계 및 단위항목 순차패턴 선택 단계에서 항목의 출현빈도 수는 감쇠율을 적용하여 갱신된다. 상세한 감쇠율 적용 과정은 3.1절에서 설명된 감쇠율 방법에 의한 빈발항목 탐색에서와동일하다.The unit item sequential pattern search by the damping rate method is similar to that of the frequent item search by the damping rate method described in Section 3.1. This is a method for searching a sequential pattern. The unit item sequential pattern search method 430 by the attenuation rate method goes through the same steps as the method for searching a simple accumulated unit item sequential pattern described in Section 2.2. However, there is a difference in that the importance of each information is differentiated by the attenuation rate according to the time of occurrence. That is, the frequency of appearance of the items is updated by applying the attenuation rate in the step of updating the frequency of appearance and adding the sequential pattern, the forced battery operation step, and the step of selecting the unit item sequential pattern among the detailed steps constituting the unit item sequential pattern search method. The detailed attenuation rate application process is the same as for the frequent item search by the attenuation rate method described in Section 3.1.

3.3 항목 순차패턴 탐색3.3 Item Sequential Pattern Search

정보 차별화 방법에서 항목 순차패턴 탐색 방법은 2.3절에서 설명된 단순축적 방법에서의 항목 순차패턴 탐색 방법과 동일한 데이터 집합을 분석하여 해당 데이터 집합에서 포함된 항목 순차패턴을 탐색하며 데이터 집합을 구성하는 트랜잭션 정보의 중요성을 차별화 하는 방법에 따라 다음과 같이 크게 두 가지 방법으로 나눠진다.In the information differentiation method, the item sequential pattern search method analyzes the same data set as the method of item sequential pattern search in the simple accumulation method described in Section 2.3, searches for the item sequential pattern included in the data set, and constructs the data set. There are two main ways to differentiate the importance of information:

a. 윈도우 방법에 의한 항목 순차패턴 탐색a. Item Sequential Pattern Search by Window Method

윈도우 방법에 의한 항목 순차패턴 탐색은 3.1절에서 설명된 윈도우 방법에서의 빈발항목 탐색방법과 마찬가지로 현재 시점을 기준으로 최근에 일어난 일정 범위의 정보들을 바탕으로 항목 순차패턴을 탐색하는 방법이다. 이때 현재 시점을 기준으로 정의되는 일정 범위를 트랜잭션 윈도우라 한다. 윈도우 방법에 의한 항목 순차패턴 탐색 방법(320)은 도 19에서 보듯이 크게 빈발항목 탐색 단계(321a)와 순차패턴 탐색 단계(322)로 구분된다. 여기서 빈발항목 탐색 단계는 3.1절에서 설명된 윈도우 방법에 의한 빈발항목 탐색 방법(310)과 동일하다. 단순축적 방법에서의 항목 순차패턴 탐색과 마찬가지로 이 단계에서 얻어진 빈발항목 집합은 다음 단계인 순차패턴 탐색 단계에서 각 트랜잭션을 순차패턴 단위항목이 된다. 순차패턴 탐색 단계(322)에서는 순차패턴 탐색을 시작하기 이전에 빈발항목 탐색 단계에서 얻어진 항목을 순차패턴 탐색을 위한 순차패턴 단위항목으로 매핑하는 항목 매핑 단계(322a)를 거치게 된다. 이 과정은 2.3절에서 설명된 단순축적 항목 순차패턴 탐색 방법에서의 매핑 과정과 동일하다. 순차패턴 분석을 위한 트랜잭션 읽기 단계(322b), 출현빈도 수 갱신 및 새로 출현한 순차패턴 추가 단계(322c), 트랜잭션 윈도우 갱신 단계(322d)는 윈도우 방법에서 빈발항목 탐색 과정과 동일하다. 항목 순차패턴 선택 단계(322e)에서는 임계값 이상의 지지도를 갖는 항목 순차패턴을 선택한다. 항목 매핑 정보 해석 단계(322f)에서는 항목 순차패턴을 구성하는 항목을 항목 매핑 단계(322a)에서의 매핑 정보를 참고하여 응용 도메인의 실제 데이터 집합을 구성하는 단위항목 정보로 재해석한다. 한편, 순차패턴 탐색 단계(322)에서는 마이닝 기본 정보 갱신 작업은 수행되지 않는다. 즉, 빈발항목 탐색 단계(321a)에서 이미 한번 갱신되었으므로 순차패턴 탐색 단계에서는 별도의 갱신 작업을 수행하지 않는다. 윈도우 방법에서의 항목 순차패턴 탐색도 2.3절에서 설명된 단순축적 항목 순차패턴 탐색 방법에서와 마찬가지로 연관-래티스와 순차-래티를 갖는 이중 래티스 구조로 구성된다. 즉, 빈발항목 탐색 단계에서의 각 항목들은 연관-래티스에 유지되며 순차패턴 탐색을 위한 각 순차패턴들은 순차-래티스에서 관리된다. 하지만, 순차패턴 탐색을 위한 순차패턴을 순차-래티스에 추가하는 과정에서 순차패턴의 출현빈도 수를 추정하지 않고 실제 출현빈도 수가 관리되므로 단순축적 항목 순차패턴 탐색 방법에서와는 달리 순차-래티스에서 각 순차패턴의 출현빈도 수를 시작 위치에 따라 두 가지로 구분하여 관리하지 않고 하나의 출현빈도 수만을 관리한다. 한편, 시간 변화나 트랜잭션이 추가적으로 발생되는데 따른 항목 및 순차패턴들의 출현빈도 수 갱신 및 트랜잭션 윈도우 관리 방법은 3.1절에서 설명된 윈도우 방법에서의 빈발항목 탐색 방법에서와 동일하다.The item sequential pattern search by the window method is a method of searching the item sequential pattern based on a certain range of information that occurred recently from the current time point, similar to the frequent item search method in the window method described in Section 3.1. At this time, a certain range defined based on the current time point is called a transaction window. The item sequential pattern search method 320 by the window method is largely divided into a frequent item search step 321a and a sequential pattern search step 322 as shown in FIG. The frequent item searching step is the same as the frequent item searching method 310 by the window method described in Section 3.1. Similar to the item sequential pattern search in the simple accumulation method, the frequent item set obtained in this step becomes a sequential pattern unit item in each transaction in the next step, the sequential pattern search step. In the sequential pattern search step 322, an item mapping step 322a is performed to map the items obtained in the frequent item search step to the sequential pattern unit items for the sequential pattern search before starting the sequential pattern search. This process is the same as the mapping process in the simple accumulation item sequential pattern search method described in Section 2.3. The transaction reading step 322b for sequential pattern analysis, the update of the frequency of occurrence, the addition of the newly appeared sequential pattern 322c, and the transaction window updating step 322d are the same as the frequent item search process in the window method. In the item sequential pattern selection step 322e, an item sequential pattern having a support value equal to or greater than a threshold value is selected. In the item mapping information analysis step 322f, the item constituting the item sequential pattern is reinterpreted into unit item information constituting the actual data set of the application domain with reference to the mapping information in the item mapping step 322a. Meanwhile, in the sequential pattern search step 322, the mining basic information update operation is not performed. That is, since the update has already been performed once in the frequent item search step 321a, no update operation is performed in the sequential pattern search step. The item sequential pattern search in the window method is composed of a double lattice structure with associative-lattice and sequential-latty as in the simple accumulated item sequential pattern searching method described in Section 2.3. That is, each item in the frequent item search step is maintained in the association-lattice and each sequence pattern for the sequence pattern search is managed in the sequence-lattice. However, in the process of adding the sequential pattern for sequential pattern search to the sequential-lattice, the actual frequency of occurrence is managed without estimating the number of occurrences of the sequential pattern. Rather than managing the number of occurrences of the two types according to the starting position, only one number of occurrences is managed. On the other hand, the method of updating the frequency of appearance of the items and the sequential patterns and the transaction window management method according to the additional time change or transaction are the same as in the frequent item search method in the window method described in Section 3.1.

처리하고자 하는 하나의 트랜잭션이 생성되었을 때 앞서 설명된 빈발항목 탐색 단계와 순차패턴 탐색 단계가 도 19에서와 같이 연속적으로 진행된다. 즉, 먼저 빈발항목 탐색 단계에서 해당 트랜잭션에 출현한 항목들의 출현빈도 수를 갱신하고 이 과정에서 래티스에 새로 추가된 항목 및 빈발항목들을 구한다. 다음으로 해당 트랜잭션을 다시 읽으면서 이들 빈발항목만으로 구성되는 순차패턴들을 분석함으로써 빈발 순차패턴을 탐색한다. 이때, 트랜잭션에 대한 순차정보 분석을 위해서 해당 트랜잭션을 순차 정보 분석이 끝나는 시점까지 메모리 상에서 유지한다. 트랜잭션에 대한 순차 정보 탐색을 위한 분석 이후의 과정은 3.1절에서 기술된 윈도우 방법에 의한 빈발항목 탐색 방법에서와 동일하다.When one transaction to be processed is generated, the frequent item search step and the sequential pattern search step described above are continuously performed as shown in FIG. 19. That is, first, the frequency of occurrence of items appearing in the transaction in the frequent item search step is updated, and new items and frequent items newly added to the lattice are obtained in this process. Next, we search for frequent sequential patterns by rereading the transaction and analyzing sequential patterns consisting of only these frequent items. At this time, to analyze the sequential information on the transaction, the transaction is maintained in memory until the end of the sequential information analysis. The process after the analysis for sequential information search for a transaction is the same as for the frequent item search method using the window method described in Section 3.1.

b. 감쇠율 방법에 의한 항목 순차패턴 탐색b. Item Sequential Pattern Search by Attenuation Ratio Method

감쇠율 방법에 의한 항목 순차패턴 탐색은 3.1절에서 설명된 감쇠율 방법에 의한 빈발항목 탐색에서와 마찬가지로 데이터 집합을 구성하는 각 트랜잭션을 발생 시점에 따라 각 시점의 중요성을 일정 비율에 따라 감소시켜 순차패턴을 탐색하는 방법이다. 감쇠율 방법에 의한 항목 순차패턴 탐색은 2.3절에서 설명된 단순축적 방법에 의한 항목 순차패턴 탐색과 동일한 수행 단계를 거친다. 즉, 기본적으로 단순축적 방법에 의한 항목 순차패턴 탐색에서와 마찬가지로 도 10에 제시된 수행 과정을 거치며, 크게 빈발항목 탐색 단계와 순차패턴 탐색 단계로 나눠진다. 빈발항목 탐색 단계는 3.1절에서 설명된 감쇠율 방법에 의한 빈발항목 탐색 방법(410)과 동일한 수행 과정을 갖는다. 이 단계에서 얻어진 빈발항목 집합은 다음 단계인 순차패턴 탐색 단계에서 각 트랜잭션을 정의하는 순차패턴 단위항목이 된다.Item sequential pattern search by decay rate method is similar to that of frequent item search by decay rate method described in Section 3.1. How to navigate. The item sequential pattern search by the decay rate method goes through the same performance steps as the item sequential pattern search by the simple accumulation method described in Section 2.3. That is, basically, as in the item sequential pattern search by the simple accumulation method, the process shown in FIG. 10 is performed, and is divided into a frequent item search step and a sequential pattern search step. The frequent item search step has the same process as the frequent item search method 410 by the attenuation rate method described in Section 3.1. The frequent item set obtained in this step becomes a sequential pattern unit item defining each transaction in the next step, the sequential pattern search step.

순차패턴 탐색 단계는 2.3절에서 설명된 단순축적 항목 순차패턴 탐색 방법에서와 동일한 과정을 거치며 도 10에 제시된 단순축적 항목 순차패턴 탐색 방법에서의 순차패턴 탐색 단계와 같다. 즉, 순차패턴 구성을 위한 항목 매핑 단계에서부터 항목 매핑 정보 해석 단계까지의 모든 과정이 단순축적 항목 순차패턴 탐색 방법에서와 서로 동일하다. 하지만, 각 정보의 중요성이 발생 시점에 따라 감쇠율에 의해 차별화 된다는 점에서 차이점을 갖는다. 즉, 순차패턴 탐색 단계를 구성하는 세부 단계 중에서 출현빈도 수 갱신 및 순차패턴 추가 단계, 강제 전지작업 및 항목 순차패턴 선택 단계 등에서 항목의 출현빈도 수는 감쇠율을 적용하여 갱신된다. 상세한 감쇠율 적용 과정은 3.1절에서 설명된 감쇠율 방법에 의한 빈발항목 탐색 방법에서와 동일하다. 감쇠율을 방법에 의한 순차패턴 탐색 방법에서는 시간의 흐름이나 새로운 트랜잭션의 추가에 따라 빈발항목 탐색 단계와 순차패턴 탐색 단계에서 하나의 감쇠율에 의해 정보의 중요성이 감소된다. 즉, 빈발항목 탐색 단계와 순차패턴 탐색 단계에서 동일한 감쇠 단위 시간, 감쇠 기본값 및 감쇠 기본주기로 정의되는 하나의 감쇠율에 의해 정보의 중요성이 차별화 된다.The sequential pattern search step is the same as the sequential pattern search step in the simple accumulated item sequential pattern search method shown in FIG. That is, all processes from the item mapping step for constructing the sequential pattern to the step of interpreting the item mapping information are the same as in the simple accumulated item sequential pattern searching method. However, there is a difference in that the importance of each information is differentiated by the attenuation rate according to the time of occurrence. That is, among the detailed steps constituting the sequential pattern search step, the frequency of appearance of the item is updated by applying the attenuation rate in the update of the frequency of appearance, the addition of the sequential pattern, the forced battery operation, and the selection of the item sequential pattern. The detailed attenuation rate application process is the same as in the frequent item search method by the attenuation rate method described in Section 3.1. In the sequential pattern search method using the decay rate method, the importance of information is reduced by one decay rate in the frequent item search step and the sequential pattern search step according to the passage of time or the addition of a new transaction. That is, the importance of information is differentiated by one decay rate defined by the same decay unit time, decay default value, and decay fundamental period in the frequent item search step and the sequential pattern search step.

4. 마이닝 결과 집합을 별도로 유지하는 개방 데이터 마이닝4. Open data mining to keep mining result sets separate

앞서 설명된 각 방법들에서 빈발항목 집합 또는 빈발 순차패턴 집합을 선택하기 위한 방법은 각 시점에서 모니터링 래티스에서 관리되고 있는 모든 정보를 분석하여 빈발항목 여부를 판단해야 한다. 이러한 방법에서는 개방 데이터 마이닝 환경에서 빈발항목 집합을 구하는데 필요한 최소한의 정보를 메모리에 유지함으로써 필요 메모리양을 최소화 할 수 있다. 하지만 빈발항목을 구하고자 하는 각 시점에서 전체 모니터링 래티스를 탐색해야 하므로 시간 흐름에 따라 데이터 집합이 증가되어 모니터링 래티스에서 관리되고 있는 정보들이 증가될 경우 빈발항목을 구하는데 필요한 시간이 크게 증가될 수 있다. 본 장에서는 이와 같은 시간적인 낭비를 줄이기 위한 방법으로 현재 시점에서의 빈발항목 집합을 구하기 위해 전체 모니터링 래티스를 검색하지 않고, 각 시점에서 빈발항목 집합 정보를 별도로 추적 관리함으로써 빈발항목 집합을 구하는데 필요한 시간을 줄일 수 있는 방법에 대해서 설명한다.In each of the methods described above, a method for selecting a frequent item set or a frequent sequential pattern set should determine whether there are frequent items by analyzing all the information managed by the monitoring lattice at each time point. In this method, the minimum amount of memory required can be minimized by keeping the minimum information necessary to obtain the frequent item set in the open data mining environment. However, since the entire monitoring lattice must be explored at each point in time to obtain frequent items, the time required to obtain the frequent items can be greatly increased if the data set increases over time and the information managed by the monitoring lattice increases. . In this chapter, in order to reduce such time waste, it is necessary to obtain frequent item set by separately tracking frequent item set information at each time point without searching entire monitoring lattice to obtain frequent item set at the present time. This section describes how to save time.

임의의 시간 t=k에서 모니터링 래티스에서 관리되고 있는 정보를라 하고, 이 시점에서 구해진 빈발항목 집합을라 하자. 시간이 흐름에 따라 시점이 변경되어 t=k+1으로 변경되었을 때, 모니터링 래티스에서 관리되는 정보및 빈발항목 집합은 다음과 같이 정의될 수 있다.At any time t = k, the information managed by the monitoring lattice The frequent item set obtained at this point Let's do it. Information managed by the monitoring lattice when the viewpoint changes over time and changes to t = k + 1 And frequent item sets May be defined as follows.

한편,은 첫 번째 트랜잭션에서 출현한 항목들로 구성되는 모니터링 래티스 정보이며은 첫 번째 트랜잭션에 출현한 항목들 중에서 빈발항목 집합을 나타낸다. 또한, ΔMS는 t=k+1에서 새롭게 발생한 트랜잭션에 나타나는 항목들을 나타내는 것으로서 하나의 트랜잭션에 나타난 항목 변화를 모니터링 래티스에서 관리되고 있는 기존의 항목 정보들에 더하여 정보를 갱신함으로써 t=k+1까지 발생한 모든 트랜잭션의 정보를 모니터링 래티스에서 관리하게 된다. ΔLS는 t=k+1에서 발생한 트랜잭션에 포함된 항목들로 인해 발생되는 빈발항목 집합의 변화를 의미하는 것으로서 ΔLS에 포함되는 변화에 대한 처리 과정은 크게 두 가지로 구분될 수 있다. 첫 번째 과정은 새롭게 발생된 트랜잭션에 포함된 항목들에 대한 처리 과정으로 모니터링 래티스에서 관리되는 정보에 대한 갱신 단계와 동시에 수행된다. 즉,를 바탕으로 새롭게 발생한 트랜잭션에서 나타나는 항목 정보를 고려하여을 구성하는 과정에서 해당 항목이 빈발항목으로 확인되는 경우 다음과 같이 빈발항목 집합을 갱신한다.에 존재하지 않는 항목이 새롭게 빈발항목으로 확인된 경우 이를 추가하고에 존재하는 항목인 경우 해당 항목의 출현빈도 수를 갱신하여 새로운 빈발항목 집합을 구성한다. 두 번째 과정은 LSk에 포함된 항목들 중에서 새롭게 발생된 트랜잭션에 출현하지 않은 항목에 대한 처리 과정이다. 이들 항목은 출현빈도 수는 변화되지 않고 데이터 집합의 트랜잭션 총수가 변화되므로 항목의 지지도가 변화된다. 따라서 데이터 집합에 대한 최소 지지도를 고려하여 이들 항목들의 빈발항목 여부를 판단하고 해당 항목이 빈발항목이 아닌 것으로 확인되는 경우 이를로부터 제거함으로써을 구성한다.Meanwhile, Is the monitoring lattice information consisting of the items from the first transaction. Represents a set of frequent items among items that appear in the first transaction. In addition, ΔMS indicates items appearing in a newly generated transaction at t = k + 1, and updates the information by adding the item change shown in one transaction to the existing item information managed in the monitoring lattice to t = k + 1. Monitoring Lattice manages the information of all transactions that occurred. ΔLS refers to a change in the frequent item set generated by items included in a transaction occurring at t = k + 1. There are two main processes for processing a change included in ΔLS. The first process is the processing of the items included in the newly generated transaction, which is performed simultaneously with the update of the information managed by the monitoring lattice. In other words, Based on the item information that appears in the newly occurred transaction If the corresponding item is identified as a frequent item in the process of configuring a, update the frequent item set as follows. If an item that doesn't exist in your account is newly identified as frequent, add it. If the item exists in, the new frequent item set is updated by updating the frequency of occurrence of the item. Configure The second process is to process items that do not appear in the newly generated transaction among items included in LSk. These items do not change the frequency of appearance, but rather the change in the total number of transactions in the data set. Therefore, in consideration of the minimum support for the data set, it is determined whether these items are frequent items, and it is determined that the items are not frequent items. By removing from Configure

마이닝 결과 정보를 별도로 유지하는 경우의 마이닝 수행 과정은 2장 및 3장에서 설명된 각 마이닝 방법들과 유사하다. 단, 새로운 트랜잭션이 발생하는 경우 항목의 출현빈도 수 갱신 및 새로운 항목 추가 단계에서 별도로 관리되는 마이닝 결과 집합에 대한 갱신을 위한 과정이 추가되어야 한다. 도 21은 단순축적 빈발항목 탐색 방법에서 마이닝 결과 집합을 별도로 유지하는 방법의 출현빈도 수 갱신 및 새로운 항목 추가 단계(213)을 보여준다. 새롭게 발생된 트랜잭션에서 출현한 항목이 마이닝 결과 집합에 존재하는 경우 해당 항목의 출현빈도 수를 갱신(236b)하며 하나의 트랜잭션에 대한 출현빈도 수 갱신 작업이 종료되면 236c에서 236e 단계를 거쳐 마이닝 결과 집합에 존재하는 항목들 중에서 트랜잭션의 증가로 인해 더 이상 빈발항목이 될 수 없는 항목을 마이닝 결과 집합으로부터 제거한다. 한편, 마이닝 결과 집합을 별도로 유지하는 경우 빈발항목 선택 단계(216)가 매우 간단한 처리 과정으로 변경된다. 즉, 도 9에서와 같이 모니터링 래티스에 존재하는 모든 항목에 대한 지지도 비교를 거치지 않으며 LS에 존재하는 모든 항목을 별도의 판정 과정 없이 빈발항목으로 선택한다. 이때 전체 모니터링 래티스를 탐색하여 빈발항목을 선택하는 경우 래티스 탐색에 필요한 시간이 매우 크나 상대적으로 매우 작은규모인 LS를 탐색하므로 수행 시간이 급격히 감소된다. 한편, LS정보를 관리하기 위한 구조로는 모니터링 래티스에서와 같이 전위-트리 래티스 구조로 관리하거나 또는 배열 구조를 이용하여 표현할 수 있다. 본 발명에서 고안한 다른 방법들에서도 출현빈도 수 갱신 및 새로운 항목 추가 단계와 빈발항목 선택 단계에서 이와 동일한 작업을 수행함으로써 마이닝 결과 집합을 별도로 유지하여 마이닝 결과 집합을 최소화 하는 데이터 마이닝을 수행할 수 있다.If mining result information is kept separately, the mining process is similar to the mining methods described in Chapters 2 and 3. However, when a new transaction occurs, a process for updating the frequency of occurrence of the item and the updating of the mining result set managed separately at the step of adding a new item should be added. FIG. 21 illustrates a step 213 of updating the frequency of appearance and adding a new item in the method of maintaining a mining result set separately in the simple accumulation frequent item search method. If an item from the newly generated transaction exists in the mining result set, the frequency of occurrence of the item is updated (236b) .If the update of the frequency of one transaction is finished, the mining result set is passed through steps 236c to 236e. Remove items from the mining result set that are no longer frequent items due to transaction growth. On the other hand, if the mining result set is maintained separately, the frequent item selection step 216 is changed to a very simple process. That is, as shown in FIG. 9, support for all items existing in the monitoring lattice is not compared, and all items present in the LS are selected as frequent items without a separate judgment process. In this case, if the frequent items are selected by searching the entire monitoring lattice, the time required for the lattice search is very large but the relatively small size LS is searched, and the execution time is drastically reduced. On the other hand, the structure for managing the LS information can be managed by using a potential-tree lattice structure or as an array structure as in the monitoring lattice. Other methods devised in the present invention can perform data mining to minimize the mining result set by maintaining the mining result set separately by performing the same operation in the frequency update, the addition of new items, and the frequent item selection step. .

이와 같이 각 시점에서 빈발항목 집합을 구하는데 있어서 수행 시간을 최소화하기 위한 방법으로 이전 시점에서 구해진 빈발항목 집합을 활용하기 위해서는 각 데이터 집합에서 발생한 정보들을 모니터링 래티스에서 관리하는 이외에도 현재 시점에서 얻어진 빈발항목 집합을 별도로 관리해야 한다. 빈발항목 집합을 별도로 유지하는 경우 마이닝 수행을 위한 메모리 사용량이 증가된다. 하지만 빈발항목 집합을 구하기 위한 수행 시간이 감소되므로 효율적으로 마이닝 작업을 수행할 수 있다.In order to minimize the execution time in obtaining the frequent item set at each time point, in order to utilize the frequent item set obtained at the previous time point, in addition to managing information generated from each data set in the monitoring lattice, frequent items obtained at the present time point The set must be managed separately. Maintaining a separate set of frequent items increases memory usage for mining. However, since the execution time for obtaining the frequent item set is reduced, the mining operation can be performed efficiently.

한편, 이상의 방법은 빈발항목 선택에 있어서 새로운 트랜잭션이 발생될 때마다 매번 빈발항목 집합에 대한 갱신 작업을 수행한다. 만약 새로운 트랜잭션이 발생될 때마다 매번 빈발항목 집합을 구할 필요가 없는 경우에는 LS에 대한 갱신 작업을 각 트랜잭션마다 매번 수행하지 않고 빈발항목 집합을 구하고자 하는 일정 시점에서만 빈발항목 집합에 대한 갱신 작업을 수행할 수 있다. 이때, 빈발항목 집합에 대한 갱신 작업이 마지막으로 수행된 시점에서부터 현재 시점까지의 트랜잭션에서 출현한 항목들의 변화 정보를 별도로 유지해야 한다. 이를 위해서는 추가적인메모리 공간을 필요로 하며 빈발항목 선택을 위해 별도로 관리되는 정보를 포함하여 빈발항목 집합을 갱신하는 과정에서 별도로 관리되는 정보의 양이 증가되는 경우 빈발항목 집합 갱신에 따른 시간이 증가될 수 있다. 따라서 각 트랜잭션이 발생할 때마다 매번 LS를 갱신하는 경우와 필요한 경우에만 빈발항목 집합을 갱신하는 경우를 실제 응용 환경에 맞도록 선택하여 적용할 수 있다. 한편, 마이닝 결과 집합을 별도로 유지하는 마이닝 방법은 본 장에서 설명된 빈발항목 탐색뿐만 아니라 빈발 단위항목 순차패턴 탐색 방법 및 빈발항목 순차패턴 탐색 방법에서도 동일하게 적용될 수 있으며 단순축적 방법, 윈도우 방법 및 감쇠율 방법 등에 무관하게 적용될 수 있다.Meanwhile, in the above method, each time a new transaction occurs in the frequent item selection, the frequent item set is updated. If there is no need to obtain the frequent item set every time a new transaction occurs, update the frequent item set only at a point in time when the frequent item set is to be obtained instead of updating the LS for each transaction. Can be done. At this time, the change information of items appearing in the transaction from the last time the update operation on the frequent item set to the current time should be maintained separately. This requires additional memory space and may increase the time required to update the frequent item set if the amount of separately managed information is increased in the process of updating the frequent item set including separately managed information for selecting frequent items. have. Therefore, it is possible to select and apply the LS update every time each transaction occurs and the frequent item set update only when necessary according to the actual application environment. On the other hand, the mining method that maintains the mining result set separately can be applied not only to the frequent item search described in this chapter, but also to the frequent unit item sequential pattern search method and the frequent item sequential pattern search method. It can be applied regardless of the method.

5. 다수의 트랜잭션을 동시에 처리하는 개방 데이터 마이닝5. Open data mining that processes multiple transactions simultaneously

2장, 3장 및 4장에서 고안된 개방 데이터 마이닝 방법들은 한 시점에서 하나의 트랜잭션이 발생하는 경우에 각 트랜잭션 단위로 항목들의 출현빈도 수 정보를 갱신하고 이로부터 빈발항목을 선택한다. 하지만 일반적인 응용분야에서는 하나의 응용 도메인에서 다수의 트랜잭션이 동시에 발생할 수 있다. 이 경우에는 동시에 발생한 다수의 트랜잭션들에 대한 기본적인 분석을 수행하여 출현빈도 수 정보를 동시에 갱신할 수 있는 항목 및 트랜잭션 등에 대한 정보를 구하고 이를 활용하여 마이닝 각 단계에서 활용함으로써 마이닝 수행 과정에서 각 트랜잭션을 하나씩 독립적으로 처리하는 것에 비해서 상대적으로 빠른 시간에 처리할 수 있다.The open data mining methods devised in Chapters 2, 3, and 4 update the frequency information of items in each transaction unit and select frequent items from each transaction when one transaction occurs at a time. However, in general applications, multiple transactions can occur simultaneously in one application domain. In this case, basic analysis of multiple transactions that occurred at the same time is performed to obtain information on items and transactions that can update the frequency of occurrence information at the same time. It can be processed relatively fast compared to processing one by one independently.

다수의 트랜잭션을 동시에 처리하기 위해서는 2장 및 3장에서 설명되고 도3, 도 10, 도 12, 도 15 및 도 19에서 제시된 본 발명에서 고안된 각 데이터 마이닝 방법들의 트랜잭션 읽기 단계가 도 22와 같은 동시에 발생한 다수의 트랜잭션에 대한 항목 및 트랜잭션 중복 여부 분석 단계를 포함하도록 변경된다. 즉, 단순히 하나의 트랜잭션을 읽어서 항목의 출현빈도 수 등을 갱신하는 것이 아니라 다수의 트랜잭션들에서 출현하는 항목의 출현빈도 수를 한꺼번에 처리한다. 이러한 경우에 마이닝 수행 과정에서 다른 단계에서도 처리 과정이 일부 변경된다. 먼저 마이닝 기본 정보 변경 단계에서 이전에 설명된 방법들에서는 하나의 트랜잭션 증가에 따른 처리 과정이므로 1만큼 증가되었다. 하지만 다수의 트랜잭션을 동시에 처리하는 경우에는 해당 트랜잭션의 총 수만큼 증가된다. 즉, k개의 트랜잭션을 한번에 처리하는 경우 변경된 데이터 집합의 트랜잭션 총수는 이전 시점에서의 트랜잭션의 총수로부터와 같이 구해진다. 다음으로 항목의 출현빈도 수 갱신 및 새로 출현한 항목 추가 단계에서도 이전에 설명된 방법들에서는 출현빈도 수 정보를 변경하고자 하는 항목의 출현빈도 수가 1만큼 변경되었지만 다수의 트랜잭션을 동시에 처리하고자 하는 경우에는 동시 처리 대상이 되는 다수의 트랜잭션들에서 해당 항목이 실제 출현한 횟수를 고려하여 해당 항목의 출현빈도 수를 갱신한다. 이러한 방법은 동시에 처리 하고자 하는 트랜잭션들에서 발생한 항목이나 트랜잭션의 중복성을 분석하는데 필요한 추가적인 공간을 필요로 하지만 각각의 트랜잭션을 독립적으로 분석하는 경우에 비해 처리 시간을 단축시킬 수 있다.In order to process multiple transactions simultaneously, the transaction reading step of each of the data mining methods of the present invention described in Chapters 2 and 3 and shown in FIGS. 3, 10, 12, 15, and 19 is performed at the same time. It is changed to include an item and a transaction duplication analysis step for a number of transactions that occurred. That is, instead of simply reading one transaction and updating the frequency of occurrence of the item, the number of occurrences of the item appearing in multiple transactions is handled at once. In this case, the processing is partially changed in other steps in the mining process. First, in the mining basic information change step, the methods described previously were incremented by 1 because they are processed by one transaction increase. However, if multiple transactions are processed at the same time, the total number of transactions is increased. That is, the total number of transactions in the changed data set when k transactions are processed at once Is the total number of transactions from the previous point in time from Obtained as Next, even when updating the frequency of occurrence of an item and adding a newly appeared item, in the previously described methods, if the frequency of the item whose frequency of occurrence information is to be changed is changed by 1, but the number of transactions is to be processed simultaneously, The number of occurrences of the item is updated in consideration of the number of times the item actually appears in a plurality of concurrent transactions. This method requires additional space for analyzing the redundancy of items or transactions occurring in transactions to be processed at the same time, but can reduce processing time compared to analyzing each transaction independently.

6. 유용성 지표의 시간 흐름에 따른 변화 분석6. Analysis of changes over time of usability indicators

종래의 폐쇄 데이터 마이닝 방법에서 각 항목의 유용성을 나타내는 지지도 및 신뢰도 등은 마이닝 수행 시점에서 분석 대상이 되는 데이터 집합내에 나타나는 정적인 값을 나타낸다. 하지만 개방 데이터 마이닝에서는 시간 흐름에 따른 정보의 변화 분석이 가능하며 현재 시점을 기준으로 이전에 발생한 트랜잭션 집합에서의 유용성 지표 변화를 분석함으로써 이후 발생될 트랜잭션에서의 변화를 보다 정확히 예측할 수 있다. 즉, 시간변화에 따라 유용성 지표가 증가되거나 혹은 감소될지 여부를 판단할 수 있다. 여기서 유용성 지표라 함은 마이닝 결과 구해진 항목들의 유용성을 나타내는 값으로 "윤종필, 김희숙 및 최옥주. 데이터 마이닝의 유용성. 정보과학회지. 제 16권, 제 9호. 1998년 9월"에서 제시된 지지도(support), 신뢰도(confidence), 설득치(conviction), 흥미도(interesting), 경이도(surprisingness), 분포도(variance), 상관도(correlation) 및 유사도(similarity)를 의미한다.In the conventional closed data mining method, support and reliability indicating the usefulness of each item represent static values appearing in the data set to be analyzed at the time of mining. In open data mining, however, it is possible to analyze changes in information over time, and more accurately predict changes in future transactions by analyzing changes in the availability indicators in the previously occurring transaction set based on the current point in time. That is, it may be determined whether the usefulness index is increased or decreased with time. The usefulness index is a value representing the usefulness of the items obtained from mining results. , Confidence, conviction, interest, interests, surprisingness, variance, correlation, and similarity.

도 20에서 곡선 (a)는 시간변화에 따른 항목의 지지도 변화를 나타내고 곡선 (b)는 시간변화에 따른 항목의 지지도 변화를 나타낸다고 할 때 시간 t=k에서과는 서로 동일한 지지도를 나타낸다. 하지만의 지지도는 증가 추세에 있으나의 지지도는 감소 추세에 있다. 이때 지지도만으로 판단하는 경우,가 서로 동일한 중요성을 갖는 것으로 간주되지만 시간 흐름에 따라 이후의 발생되는 트랜잭션에서는의 지지도는 현재 값보다 증가될 가능성이 크며의 지지도는 감소될 가능성이 큰 것으로 예측할 수 있다.Curve (a) in Figure 20 is an item over time And the curve (b) shows the change over time. At the time t = k and Indicates the same degree of support for each other. However Support is on the rise Support is on the decline. If judged only by support , Are considered to be of equal importance to each other, but in subsequent transactions over time Support is likely to increase from the current value It can be predicted that the degree of support for is likely to decrease.

데이터 마이닝의 1차적인 목적이 과거 데이터의 분석을 통한 특징 정보를 추출하는 것이라면 2차적인 목적은 분석에 의해 추출된 결과를 활용하여 향후 데이터 집합에 대한 예측이나 이에 대한 판단의 근거로 활용하는 것이라고 할 수 있다. 앞서 제시된 도 20의 예에서,가 동일한 중요성을 갖는 것으로 판단하는 경우 데이터 마이닝의 1차적인 목적에는 부합하나 2차적인 목적에서는이보다 높은 중요성을 갖는다. 즉, 미래에 대한 예측에 있어서는 현재 시점에서의 지지도 값만을 고려하는 경우 올바른 분석 결과를 제시할 수 없다. 이때 각 시점에서의 지지도변화 정도를 구하고 특히 이전의 지지도변화율에 비해서 증가 혹은 감소 등의 변화 양상을 제시하는 경우 미래에 발생할 데이터 집합에서의 변화에 대한 보다 정확한 예측을 지원할 수 있다. 이와 같은 분석을 종래의 폐쇄 데이터 마이닝으로 구현하기 위해서는 변화를 파악하고자 하는 여러 데이터 집합들에 대해 별도의 마이닝 작업을 각각 수행하여 그 차이를 분석해야 한다. 하지만, 개방 데이터 마이닝에서는 트랜잭션 증가에 따라 변화된 마이닝 결과를 실시간으로 구할 수 있으므로 마이닝 결과 집합의 변화 추세 파악이 용이하다.If the primary purpose of data mining is to extract feature information from the analysis of historical data, the second purpose is to use the results extracted by the analysis as a basis for predicting or judging future data sets. can do. In the example of FIG. 20 presented above , Determines that they have the same importance, they meet the primary purpose of data mining, but this Have higher importance. In other words, in the prediction of the future, if only the support value at the present time is considered, a correct analysis result cannot be presented. In this case, the degree of support change at each point of time can be supported, and in particular, when changing patterns such as increase or decrease compared to the previous support change rate can be supported, it is possible to support more accurate prediction of changes in the future data set. In order to implement such an analysis in the conventional closed data mining, a separate mining operation must be performed on various data sets for which changes are to be identified, and the differences must be analyzed. However, in open data mining, it is easy to identify the change trend of the mining result set because the mining result that is changed according to the transaction increase can be obtained in real time.

6.1 단순축적 방법에서의 속도 및 가속도6.1 Velocity and Acceleration in the Simple Accumulation Method

유용성 지표의 시간 흐름에 따른 변화 예측을 위한 지표로는 유용성 지표의 속도 및 가속도를 고려할 수 있다. 유용성 지표의 하나인 항목의 지지도에 대해서 기술하고 신뢰도 등 다른 유용성 지표의 경우에도 이하에서 기술되는 내용들은 동일하게 적용한다.As an indicator for predicting the change over time of the usability indicator, the speed and acceleration of the usability indicator may be considered. The support of the item, which is one of the usefulness indicators, is described, and the following descriptions apply equally to other usefulness indicators such as reliability.

a. 속도a. speed

하나의 항목 e에 대해서 지지도가 시간 흐름에 따라 변화되는 경우 항목 e의 지지도 S의 속도는 일정 단위 시간당 지지도의 변화량으로 정의되며 t=k에서의 속도는로 나타낸다. 여기서 속도를 정의하기 위한 일정 단위 시간을 속도 기본 시간이라 정의하며 트랜잭션이 발생할 수 있는 최소 단위 시간과는 별개로 정의될 수 있으나 본 발명에서는 속도 기본 시간을 트랜잭션이 발생할 수 있는 최소 시간의 배수로 정의한다. 예를 들어 속도 기본 시간이 트랜잭션이 발생할 수 있는 최소 시간의 m배로 정의되었다고 가정하자. 이 경우 지지도의 속도는 트랜잭션의 최소 시간의 m배만큼 시간이 변하는 동안의 지지도 변화량을 의미하는 것으로 해당 시간 동안 발생한 지지도의 변화량의 총합으로 정의된다. 임의의 시점 t=k에서일 때, 해당 시점에서의 지지도 속도는 다음과 같이 정의된다. 만약, k≤m이면는 값을 구할 수 없으며 이때는 0으로 간주한다.If support for one item e changes over time, the speed of support S of item e is defined as the amount of change in support per unit time, and the speed at t = k Represented by Here, the unit time for defining the rate is defined as the rate base time, and may be defined separately from the minimum unit time at which the transaction can occur, but in the present invention, the rate base time is defined as a multiple of the minimum time at which the transaction can occur. . For example, suppose the rate base time is defined as m times the minimum time that a transaction can occur. In this case, the speed of support means the change in support during the change of time by m times the minimum time of the transaction, and is defined as the sum of the change in support during the time. At any point in time t = k , Support velocity at that time Is defined as If k≤m Cannot return a value Is assumed to be zero.

한편, 임의의 시점 t=k+1에서의 속도는 t=k에서의 속도로부터 다음과 같이 재귀적으로 구할 수 있다. 한편,은로 정의된다.On the other hand, the velocity at an arbitrary time point t = k + 1 Is the velocity at t = k Can be obtained recursively as Meanwhile, silver Is defined as

b. 가속도b. acceleration

하나의 항목 e에 대해서 지지도 S의 가속도는 속도 기본 시간당 속도의 변화량으로 정의되며 t=k에서의 가속도는로 나타낸다. 속도 정의에서와 마찬가지로 속도 기본 시간이 트랜잭션이 발생할 수 있는 최소 시간의 m배로 정의되었다고 가정할 때 지지도의 가속도는 해당 시간 동안 발생한 속도의 변화량의 총합으로 정의된다. 임의의 시점 t=k에서일 때, 해당 시점에서의지지도 가속도는 다음과 같이 정의된다. 만약, k≤m이면는 값을 구할 수 없으며 이때는 0으로 간주한다.For one item e, the acceleration of the support S is defined as the change in velocity base hourly velocity and the acceleration at t = k Represented by As in the velocity definition, assuming that the velocity base time is defined as m times the minimum amount of time that a transaction can occur, the acceleration of the support is defined as the sum of the changes in velocity over that time. At any point in time t = k , The map acceleration at that time Is defined as If k≤m Cannot return a value Is assumed to be zero.

한편, 지지도 속도에서와 마찬가지로 임의의 시점 t=k+1에서의 가속도는 t=k에서의 가속도로부터 다음과 같이 재귀적으로 구할 수 있다.은로 정의된다.On the other hand, the acceleration at an arbitrary time point t = k + 1 as in the support velocity Is the acceleration at t = k Can be obtained recursively as silver Is defined as

6.2 정보차별화 방법에서의 속도 및 가속도6.2 Velocity and Acceleration in Information Discrimination Methods

트랜잭션의 발생 시점에 관계없이 서로 동일한 가중치로 정의하는 단순축적 방법에서는 속도(혹은 가속도)가 지지도(혹은 속도) 변화량의 단순 산술적인 합으로 정의된다. 윈도우 방법에서는 단순축적방법에서와 같이 단순히 산술적인 합으로 정의되지만 속도 및 가속도를 정의하는 단위시간이 윈도우 크기보다 커지는 경우 속도 및 가속도 변화량이 매우 크게 나타날 수 있다. 윈도우 방법에서 지지도의 변화 속도 및 가속도에 대한 정의 방법은 단순축적 방법에서와 동일하므로 별도의 설명은 생략한다. 감쇠율 방법에서는 속도 및 가속도를 구하는데 있어서 속도를 정의하기 위한 별도의 감쇠율을 이용하여 다음과 같이 정의된다.Velocity (or acceleration) is defined as a simple arithmetic sum of support (or velocity) changes in a simple accumulation method that defines equal weights regardless of when transactions occur. In the window method, as in the simple accumulation method, it is simply defined as an arithmetic sum. However, if the unit time defining the speed and acceleration is larger than the window size, the speed and acceleration variation may be very large. In the window method, the definition method for the change rate and acceleration of the support is the same as in the simple accumulation method, and thus a separate description is omitted. The damping rate method is defined as follows by using a separate damping rate for defining the velocity in calculating the velocity and acceleration.

a. 속도a. speed

감쇠율 방법에서 지지도의 변화 속도는 단순축적 방법에서와 마찬가지로 기본적으로 속도 기본 시간당 지지도 변화량으로 정의된다. 이때, 지지도 변화량은 감쇠율을 고려한 변화량으로 정의된다. 하지만 이러한 방법에서는 속도 기본 시간 동안의 지지도 변화를 별도로 유지해야 하는 낭비가 따른다. 이를 방지하기 위한 방법으로 본 발명에서는 속도를 정의하기 위한 별도의 감쇠율을 정의하고 이를 바탕으로 지지도의 변화 속도를 정의한다. 이때 속도를 정의하기 위한 속도 감쇠율 dv는 데이터 집합에 대한 기본적인 감쇠율과는 다른 별도의 감쇠 기본값, 감쇠 단위시간 및 감쇠 기본주기로 정의되며 이를 각각 속도 감쇠 기본값, 속도 감쇠 단위시간 및 속도 감쇠 기본주기라 정의한다. 속도 감쇠율은 최근의 짧은 구간에 발생한 정보들이 의미적으로 보다 중요한 것으로 정의하며 따라서 속도 감쇠율에 의해 전체 구간에 대해 구해진 지지도 변화량은 이들 구간 내에서의 지지도 변화량에 매우 근접한 값을 갖게된다. 임의의 시점 t=k에서일 때, 속도 감쇠율을 적용하는 경우 해당 시점에서의 지지도 변화 속도는 다음과 같이 정의된다.The rate of change in support in the damping rate method is basically defined as the rate-based hourly change in support as in the simple accumulation method. In this case, the support change amount is defined as the change amount considering the attenuation rate. However, this method is a waste of maintaining a change in support during the speed base time. As a method for preventing this, in the present invention, a separate attenuation rate for defining the speed is defined, and the speed of change of the support is defined based on this. In this case, the speed decay rate dv to define the speed is defined as a separate decay default value, decay unit time, and decay fundamental period, which are different from the basic decay rate for the data set. do. Velocity decay rate is defined as more important information of the recent short intervals, so the change in support calculated for the entire interval by the speed decay rate is very close to the change in support in these intervals. At any point in time t = k When the rate of decay rate is applied, the rate of change of support at that time Is defined as

한편, 임의의 시점 t=k+1에서의 속도은 t=k에서의 속도로부터 다음과 같이 재귀적으로 정의될 수 있다.은로 정의된다.On the other hand, the velocity at an arbitrary time point t = k + 1 Is the velocity at t = k Can be defined recursively as silver Is defined as

b. 가속도b. acceleration

감쇠율 방법에서 지지도의 변화 가속도는 단순축적 방법에서와 마찬가지로 기본적으로 속도 기본 시간당 속도 변화량으로 정의된다. 이때, 속도 변화량은 감쇠율을 고려한 변화량으로 정의된다. 하지만 속도에서와 마찬가지로 이러한 방법에서는 속도 기본 시간 동안의 속도 변화를 별도로 유지해야 하는 낭비가 따른다. 이를 방지하기 위한 방법으로 본 발명에서는 속도를 정의하기 위한 별도의 감쇠율을 이용하여 이를 바탕으로 지지도의 변화 가속도를 정의한다. 이때 속도를 정의하기위한 감쇠율은 속도 정의에서 사용되는 감쇠율과 동일하다. 임의의 시점 t=k에서일 때, 속도 감쇠율을 적용하는 경우 해당 시점에서의 지지도 변화 가속도는 다음과 같이 정의된다In the damping rate method, the acceleration of the change in support is basically defined as the speed-based speed change per hour, as in the simple accumulation method. In this case, the speed change amount is defined as the change amount considering the attenuation rate. As with speed, however, this method is a waste of maintaining a separate speed change during the speed base time. As a method for preventing this, in the present invention, the acceleration of the change in the support is defined based on this using a separate attenuation rate for defining the speed. At this time, the damping rate for defining the speed is the same as the damping rate used in the speed definition. At any point in time t = k When the speed decay rate is applied, the acceleration of support change at that time Is defined as

한편, 지지도 속도에서와 마찬가지로 임의의 시점 t=k+1에서의 속도은 t=k에서의 속도로부터 다음과 같이 재귀적으로 정의될 수 있다.은로 정의된다.On the other hand, the velocity at any time point t = k + 1 as in the support velocity Is the velocity at t = k Can be defined recursively as silver Is defined as

6.3 항목의 유용성 판단 지표 변화 분석에 의한 데이터 마이닝6.3 Data Mining by Analyzing Changes in Indicators of Usefulness Determination

본 절에서는 유용성 판단 지표의 변화 속도 및 가속도를 분석하여 보다 활용성이 높은 마이닝 결과를 얻을 수 있도록 지원하는 마이닝 방법에 대해서 기술한다. 본 절에서 기술하는 변화 분석에 의한 마이닝 방법은 2장 및 3장에서 기술된 마이닝 방법들과 이를 바탕으로 변형된 마이닝 방법들 각각에 대해 적용될 수 있으나 본 절에서는 2.1절에서 설명된 단순축적 빈발항목 탐색 방법에서 마이닝 항목들의 지지도 변화 속도를 분석하는 유용성 판단 지표 변화 분석에 의한 데이터 마이닝 방법에 대해 기술한다. 다음에서 기술되는 내용들은 본 발명에서 고안한 다른 방법들에서 동일하게 적용될 수 있으며 지지도뿐만 아니라 다른 유용성 판단 지표에도 동일하게 적용되고 또한 가속도 분석에 의한 데이터 마이닝 방법도 본 절에서 설명되는 속도 분석에 의한 마이닝 방법과 동일한 방법으로 구현될 수 있다.This section describes mining methods that support the more useful mining results by analyzing the rate of change and the acceleration of the usability indicators. The mining method by change analysis described in this section can be applied to each of the mining methods described in Chapters 2 and 3 and the modified mining methods based on them. This paper describes the data mining method by analyzing the change in the usefulness judgment indicator that analyzes the rate of change of the support of the mining items in the search method. The contents described below may be equally applied to the other methods devised in the present invention, and the same applies to not only the support but also other usefulness determination indicators, and the data mining method by the acceleration analysis is also performed by the speed analysis described in this section. It may be implemented in the same manner as the mining method.

단순축적 빈발항목 탐색 방법에서 항목들의 지지도 속도를 분석하기 위해서는 도 3에서 제시된 단순축적 방법에 의한 빈발항목 탐색 방법(210)에서 다음과 같이 두 가지 사항이 추가 또는 변경된다. 첫째, 출현빈도 수 갱신 및 새로운 항목 추가 단계(213)에서 모니터링 래티스 상에 존재하거나 또는 새로 추가된 항목들에 대해서 해당 항목의 속도를 갱신한다. 이때 현재 시점에서 해당 항목의 속도는 6.2절에서 제시된 방법에 의헤 이전 시점에서의 속도로부터 제귀적으로 구할 수 있다. 도 23은 항목의 속도 갱신 모듈(237)이 추가된 출현빈도 수 갱신 및 새로운 항목 추가 단계를 보여준다. 도 24는 지지도의 변화 속도 분석을 위한 273 단계의 세부 과정을 나타낸다. 273 단계는 빈발항목 선택 단계의 세부 단계로서 해당 항목의 지지도가 데이터 집합에 대한 최소 지지도 이상으로서 해당 항목이 빈발항목이라 판정되는 단계이다. 도 24에서 지지도 변화 속도 분석은 크게 두 가지 방법으로 구분된다. 하나는 시간 평균 속도에 의한 빈발항목 정제 방법이며 다른 하나는 전체 빈발항목 평균 속도에 의한 빈발항목 정제 방법이다. 시간 평균 속도에 의한 빈발항목 정제 방법은 시간 변화 또는 트랜잭션 증가에 따른 해당 항목의 지지도 변화 속도의 평균값 대비 현재 트랜잭션이 처리된 상황에서의 각 항목의 지지도 변화 속도의 비율이 사용자에 의해 사전에 정의된 일정 임계값 이상인 경우 이를 지지도 변화 속도를 고려하여 정제된 빈발항목으로 선택하는 방법이다. 전체 빈발항목 평균 속도에 의한 빈발항목 정제 방법은 현재 트랜잭션이 처리된 상황에서 지지도만으로 구해진 빈발항목 집합에 포함되는 각 항목들의 지지도 변화 속도 평균값 대비 각 항목의 지지도 변화 속도의 비율이 사용자에 의해 사전에 정의된 일정 임계값 이상인 경우 이를 지지도 변화 속도를 고려하여 정제된 빈발항목으로 선택하는 방법이다. 한편, 각 항목의 지지도 변화 속도를 상대 비교하지 않고 해당 항목의 지지도 변화 속도가 사용자에 의해 사전에 정의된 일정 임계값 이상인 경우 이를 지지도 변화 속도를 고려하여 정제된 빈발항목으로 선택할 수도 있다.In order to analyze the support rate of items in the simple accumulation frequent items searching method, two items are added or changed as follows in the frequent items searching method 210 by the simple accumulation method shown in FIG. 3. First, update the frequency of occurrence and update the speed of the item with respect to the newly added or existing items on the monitoring lattice in step 213. At this time, the speed of the item can be obtained recursively from the speed at the previous time by the method given in Section 6.2. FIG. 23 shows the frequency update module 237 of adding an item and updating a frequency of adding an item. 24 shows a detailed process of step 273 for analyzing the rate of change of support. Step 273 is a detailed step of the frequent item selection step, in which the item is determined to be a frequent item because the degree of support of the item is greater than or equal to the minimum degree of support for the data set. In Figure 24, the support change rate analysis is largely divided into two methods. One is the method of refining the frequent items by the average rate of time, and the other is the method of refining the frequent items by the average rate of the total frequent items. In the method of resolving frequent items by the average time rate, the ratio of the change rate of the support rate of each item in the situation where the current transaction is processed to the average value of the change rate of the support rate of the corresponding item according to the time change or transaction increase is defined in advance by the user. If it is above a certain threshold, it is a method of selecting it as a refined frequent item in consideration of the change rate of support. According to the method of resolving frequent items by the average rate of total frequent items, the ratio of the rate of change of the support rate of each item to the average value of the rate of change of each item included in the set of frequent items determined by the support only in the current transaction processing state is determined by the user in advance. If it is more than the defined threshold value, it is a method of selecting it as a refined frequent item in consideration of the change rate of support. On the other hand, without a relative comparison between the support rate change of each item, if the support change rate of the item is more than a predetermined threshold value predefined by the user may be selected as a refined frequent items in consideration of the change rate of support.

7. 개방 데이터 마이닝을 위한 데이터 집합 전처리 기법7. Data Set Preprocessing Techniques for Open Data Mining

종래의 폐쇄 데이터 마이닝에서는 응용 도메인에서 발생되는 다양한 데이터 집합들 중에서 해당 응용 도메인의 특성을 가장 잘 반영하는 데이터 집합을 분석 대상으로 정의하는 것이 매우 중요하다. 이때, 데이터 집합에 대한 전처리 과정을 통해 해당 응용 도메인의 특성에서 크게 벗어나는 트랜잭션들을 분석 대상 데이터 집합에서 제외하고 해당 응용 도메인의 특성을 잘 반영할 수 있는 트랜잭션을 분석 대상에 포함시킴으로써 해당 응용 도메인의 특성을 보다 명확히 표현할 수 있는 마이닝 결과를 얻을 수 있다. 이와 마찬가지로 개방 데이터 마이닝에서도 지속적으로 증가되는 트랜잭셔들에서 응용 도메인의 특성을 잘 반영하는 트랜잭션들만을 분석 대상 데이터 집합에 포함 시키기 위해서 지속적으로 발생되는 트랜잭션들에 대한 전처리 작업을 수행한다.In the conventional closed data mining, it is very important to define a data set that best reflects the characteristics of the application domain among the various data sets generated in the application domain as an analysis target. At this time, by preprocessing the data set, the transactions that deviate significantly from the characteristics of the application domain are excluded from the analysis target data set, and the characteristics of the application domain are included by including the transactions that can reflect the characteristics of the application domain well. The mining results can be obtained more clearly. Similarly, open data mining performs preprocessing for transactions that occur continuously to include only transactions that reflect the characteristics of the application domain in the continuously increasing transactions.

개방 데이터 마이닝에서는 분석 대상이 되는 데이터 집합을 구성하는 트랜잭션 정보들이 다양하게 변화될 수 있다. 이때, 트랜잭션 집합을 일정 기준에 의해 정형화함으로써 마이닝에 의해 얻어진 결과에 대한 신뢰도를 보다 높일 수 있다. 예를 들어 현재 마이닝 대상이 되는 데이터 집합에서 트랜잭션의 길이를 통계적으로 분석하여 새로운 트랜잭션의 길이(트랜잭션의 길이라 함은 하나의 트랜잭션을 구성하는 단위항목의 개수를 지칭한다.)가 지나치게 길거나 혹은 짧은 트랜잭션을 마이닝 분석 대상 데이터 집합에서 제외함으로써 분석 대상이 되는 응용 도메인의 공통적인 특성을 보다 잘 반영할 수 있는 마이닝 결과를 얻을 수 있다. 개방 데이터 마이닝에서는 새로운 트랜잭션 추가에 따른 마이닝 결과를 실시간으로 분석해야 하므로 새롭게 발생된 트랜잭션을 과거 트랜잭션 집합과 비교했을 때 갖는 특이성을 해당 트랜잭션 정보를 모니터링 래티스에 추가하기 전에 판단해야 한다. 종래의 폐쇄 데이터 마이닝에서는 이러한 과정이 마이닝 작업 수행 이전의 전처리(pre-processing) 단계에서 수행되지만 개방 데이터 마이닝에서는 이러한 과정이 각 트랜잭션이 추가 되는 시점에서 수행되어야 한다. 한편, 다음에서 제시하는 특성 분석 지표 및 이를 활용한 데이터 집합 특성 분석 방법은 개방 데이터 마이닝에서 효율적으로 적용될 수 있는 방법이다. 하지만 이들 지표들은 종래의 폐쇄 데이터 마이닝에서도 데이터 집합의 특성을 분석하는데 효율적으로 활용될 수 있다. 분석 대상이 되는 데이터 집합의 특성을 분석하기 위해서 본 발명에서는 다음과 같은 지표들은 활용한다.In open data mining, transaction information constituting the data set to be analyzed can be variously changed. At this time, by shaping the transaction set by a certain criterion, the reliability of the result obtained by the mining can be increased. For example, the length of a new transaction (the length of a transaction refers to the number of units constituting one transaction) is statistically analyzed by analyzing the length of a transaction in the current data set to be mined. By excluding transactions from the mining analysis target data set, you can obtain mining results that better reflect the common characteristics of the application domain that you are analyzing. In open data mining, the mining results of adding new transactions must be analyzed in real time, so the specificity of comparing newly generated transactions with past transaction sets must be determined before adding the transaction information to the monitoring lattice. In conventional closed data mining, this process is performed in the pre-processing stage before performing the mining operation. In open data mining, this process must be performed at the time of adding each transaction. On the other hand, the following characteristics analysis index and data set characterization method using the same is a method that can be efficiently applied in open data mining. However, these indicators can be effectively used to characterize data sets even in conventional closed data mining. In order to analyze the characteristics of the data set to be analyzed, the following indicators are used in the present invention.

7.1 데이터 집합 특성 분석 지표7.1 Dataset Characterization Indicators

a. 평균 길이 (Average Length)a. Average Length

임의의 시점 t=k에서 발생한 트랜잭션의 길이는로 나타내며 트랜잭션을 구성하는 단위항목의 개수를 지칭하는 값이다. 이때, 전체 데이터 집합을 구성하는 각 트랜잭션의 길이를 구하고 이들을 통계적으로 분석하여 트랜잭션 길이의 평균이나 표준편차나 분산 등과 같은 길이의 분포 정도를 나타내는 값을 분석함으로써 전체 데이터 집합을 구성하는 트랜잭션의 길이에 대한 공통적인 특성을 분석한다.The length of a transaction at any point in time t = k The value indicates the number of unit items that constitute a transaction. In this case, the length of each transaction constituting the entire data set is obtained and statistically analyzed to analyze the value representing the distribution degree of the length such as the average of the transaction length, the standard deviation or the variance, and so on. Analyze common characteristics for

지속적으로 변화되는 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 전체 데이터 집합의 트랜잭션 평균 길이를라고 할 때 t=k+1에서 전체 데이터 집합의 트랜잭션 평균 길이은 다음과 같이 구할 수 있다.In a constantly changing dataset, if the transactions occurring at each point in time have the same weights, the average length of the transactions in the entire dataset up to a point in time t = k Is the average length of transactions across the entire dataset at t = k + 1. Can be obtained as

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 트랜잭션의 평균길이는 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the decay rate method in which the transactions occurring at each time point have different weights, the average length of the transaction at t = k + 1 may be obtained by considering the decay rate d as follows. Is the same as in the simple accumulation method.

b. 평균반영율 (Average Coverage)b. Average Coverage

하나의 트랜잭션에서 반영율이라 함은 해당 트랜잭션이 분석 대상 응용 도메인의 특성을 반영하는 정도를 나타내는 값이다. 즉, 트랜잭션을 구성하는 항목들 중에서 현재까지 빈발항목 집합에 포함된 항목의 비율로 표현된다. 반영율이 높은 트랜잭션은 트랜잭션이 발생한 응용 도메인의 특성을 잘 반영하고 있는 것으로 간주할 수 있으며 반대로 낮은 반영율을 갖는 트랜잭션은 도메인의 특성을 잘 반영하지 못하는 것으로 간주할 수 있다. 한편, 반영율은 반영율을 구하는 기준이 되는 항목의 길이에 따라 1-항목 반영율, 2-항목 반영율, ... , n-항목 반영율 등으로 표현된다. 즉, 항목의 길이가 n인 n-항목을 기준으로 반영율을 계산하는 경우 n-항목 반영율이라 정의한다. 트랜잭션의 n-항목 반영율는 다음과 같이 정의될 수 있다.In one transaction, reflectance is a value that indicates the degree to which the transaction reflects the characteristics of the application domain to be analyzed. That is, it is expressed as the ratio of the items included in the frequent item set among the items constituting the transaction. Transactions with high reflectance can be considered to reflect the characteristics of the application domain where the transaction occurred. Conversely, transactions with low reflectance can be regarded as not reflecting the characteristics of the domain. Meanwhile, the reflectance rate is expressed as 1-item reflectance rate, 2-item reflectance rate, ..., n-item reflectance rate, etc. according to the length of the item which is a standard for obtaining the reflectance rate. That is, when the reflectance is calculated based on the n-item whose length is n, it is defined as the n-item reflectance rate. transaction N-item reflectance of May be defined as follows.

평균반영율은 분석 대상이 되는 데이터 집합을 구성하는 각 트랜잭션들의 반영율 평균값으로서 분석 대상이 되는 전체 데이터 집합이 응용 도메인의 특성을 어느 정도 반영하고 있는지를 나타낼 수 있다. 전체 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 평균반영율을라고 할 때 t=k+1에서 평균반영율은 다음과 같이 구할 수 있다. 여기서,은 트랜잭션의의 반영율을 나타낸다.The average reflectance is an average reflectance value of each transaction constituting the data set to be analyzed, and may indicate how much the entire data set to be analyzed reflects the characteristics of the application domain. If the transactions occurring at each time point in the entire data set have the same weights, then the average reflectance up to any time point t = k Is the average reflectance at t = k + 1. Can be obtained as here, Of the transaction Indicates the reflectance.

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 평균반영율은 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the attenuation rate method in which transactions occurring at each time point have different weights, the average reflectance at t = k + 1 may be obtained as follows in consideration of the attenuation rate d. Is the same as in the simple accumulation method.

c. 빈발항목 평균 길이 (Average Length of Frequent Itemset)c. Average Length of Frequent Itemset

하나의 트랜잭션에 출현한 항목들 중에서 데이터 집합에 대한 최소지지도를 만족하는 빈발항목들의 평균 길이를 빈빌항목 평균 길이라 한다. 하나의 데이터 집합에서 최소지지도가 낮아지면 빈발항목의 수가 증가되고 평균 길이도 증가된다.트랜잭션에 대해서 빈발항목 평균 길이는 다음과 같이 정의된다.Among the items appearing in a transaction, the average length of the frequent items that satisfy the minimum map for the data set is called the empty bill average length. Lower minimum support in one data set increases the number of frequent items and the average length. Frequent average length for Is defined as

LS : 빈발항목 집합, |LS| : 빈발항목 집합의 원소 개수LS: frequent item set, | LS | : Number of elements in the frequent item set

전체 데이터 집합에 대한 빈발항목 평균 길이는 각 트랜잭션에서 발생된 빈발항목의 평균 길이로부터 구할 수 있으며 전체 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 전체 데이터 집합에 대한 빈발항목 평균 길이를라고 할 때 t=k+1에서 빈발항목 평균 길이은 다음과 같이 구할 수 있다. 여기서 TLk+1은 트랜잭션의 Tk+1의 빈발항목 평균 길이를 나타낸다.The average length of frequent items for the entire data set can be obtained from the average length of frequent items generated in each transaction, and the total data up to any point in time t = k if the transactions occurring at each point in time in the entire data set have the same weights. The average length of frequent items for a set Is the average length of frequent items at t = k + 1. Can be obtained as Here, TLk + 1 represents the average length of frequent items of Tk + 1 of the transaction.

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 트랜잭션의 평균 길이는 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the decay rate method in which transactions occurring at each time point have different weights, the average length of a transaction at t = k + 1 may be obtained as follows in consideration of the decay rate d. Is the same as in the simple accumulation method.

d. 빈발항목 평균 지지도 (Average Support of Frequent Itemset)d. Average Support of Frequent Itemset

하나의 트랜잭션에 출현한 항목들 중에서 데이터 집합에 대한 최소지지도를 만족하는 빈발항목들의 평균 지지도를 빈빌항목 평균 지지도라 한다. 트랜잭션에 대해서 빈발항목 평균 지지도는 다음과 같이 정의된다.Among the items appearing in a transaction, the average support of frequent items that satisfy the minimum map for the data set is called the empty bill average support. transaction Frequent average support for Is defined as

전체 데이터 집합에 대한 빈발항목 평균 지지도는 각 트랜잭션에서 발생된 빈발항목의 평균 지지도로부터 구할 수 있으며 전체 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 전체 데이터 집합에 대한 빈발항목 평균 지지도를라고 할 때 t=k+1에서 빈발항목 평균 지지도은 다음과 같이 구할 수 있다. 여기서은 트랜잭션의의 빈발항목 평균 지지도를 나타낸다.The average support for the frequent items for the entire data set can be obtained from the average support for the frequent items that occurred in each transaction, and the total data up to any point in time t = k if the transactions occurring at each time point in the data set have the same weight. The average support for frequent items for a set , Mean support for frequent items at t = k + 1 Can be obtained as here Of the transaction Shows the average support for frequent items.

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 트랜잭션의 평균 지지도는 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the decay rate method in which the transactions occurring at each time point have different weights, the average support of the transaction at t = k + 1 may be obtained by considering the decay rate d as follows. Is the same as in the simple accumulation method.

e. 빈발항목 평균 규칙점수 (Average Rule-Point of Frequent Itemset)e. Average Rule-Point of Frequent Itemset

하나의 트랜잭션에 출현한 항목들 중에서 데이터 집합에 대한 최소지지도를 만족하는 빈발항목들의 특성을 나타내는 값으로 평균 길이 및 평균 지지도를 살펴보았다. 이들 두가지 지표를 하나의 복합 지표로 표현하기 위해서 빈발항목 규칙점수를 고려한다. 본 발명에서는 트랜잭션에 대해서 빈발항목 규칙 점수는 빈발항목의 길이와 지지도를 동시에 고려하기 위하여 다음과 같이 정의한다.Among the items appearing in a transaction, the average length and the average support were examined as values representing the characteristics of the frequent items that satisfy the minimum map for the data set. In order to represent these two indicators as a composite indicator, the frequent item rule score is considered. In the present invention, a transaction Frequent rule score for Is defined as follows to consider the length and support of frequent items simultaneously.

전체 데이터 집합에 대한 빈발항목 평균 규칙점수는 각 트랜잭션에서 발생된 빈발항목의 규칙 점수로부터 구할 수 있으며 전체 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 전체 데이터 집합에 대한 빈발항목 평균 규칙점수를라고 할 때 t=k+1에서 빈발항목 평균 규칙점수은 다음과 같이 구할 수 있다. 여기서 TRPk+1은 트랜잭션 Tk+1의 빈발항목 평균 규칙점수를 나타낸다.The average rule score for the frequent items for the entire data set can be obtained from the rule scores for the frequent items that occurred in each transaction, and the total up to any point t = k if the transactions that occur at each point in the data set have the same weights. Frequently averaged rule scores for a dataset The average rule score for frequent items at t = k + 1. Can be obtained as Here, TRPk + 1 represents the average rule score of the frequent items of transaction Tk + 1.

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 트랜잭션의 평균 규칙점수는 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the decay rate method in which the transactions occurring at each time point have different weights, the average rule score of the transaction at t = k + 1 may be obtained by considering the decay rate d as follows. Is the same as in the simple accumulation method.

f. 빈발항목의 지지도 구간별 분포 비율(Percentage of Reflection at a Interval in a set of Frequent Itemset)f. Percentage of Reflection at a Interval in a set of Frequent Itemset

일정 시점까지 발생한 트랜잭션에 대한 마이닝 결과로 얻어지는 빈발항목 집합에서 각 항목들을 지지도를 기준으로 일정 구간별로 구분하였을 때, 각 구간에 해당되는 빈발항목들 중에서 현재 트랜잭션에 발생한 항목의 수를 비율로 나타낸 값을 빈발항목의 지지도 구간별 분포 비율이라 한다. 지지도 구간별 분포 비율은 각 트랜잭션에서 발생한 빈발항목의 수나 지지도뿐만 아니라 각 트랜잭션이 빈발항목 집합의 특성을 반영하는 정도를 지지도 구간별로 구분하여 분석할 수 있다. 하나의 트랜잭션에 대해서 최소지지도 보다 큰 값으로 정의된 임의의 지지도 구간 u(최소지지도과 최대 지지도로 구간이 정의되었다고 가정)에 대해서 빈발항목의 지지도 구간별 분포 비율은 다음과 같이 정의된다When each item is classified by a certain interval based on the support in the frequent item set obtained as a result of mining the transaction that occurred up to a certain point, the value that represents the number of items that occurred in the current transaction among the frequent items corresponding to each interval. It is referred to as the distribution ratio of the support section of the frequent items. The distribution ratio for each support interval can be analyzed by classifying the support interval by not only the number or support of frequent items generated in each transaction, but also the degree to which each transaction reflects the characteristics of a set of frequent items. One transaction Any support interval u defined by a value greater than the minimum map for And maximum support Distribution interval by frequency of support for frequent items) Is defined as

전체 데이터 집합에 대한 빈발항목의 지지도 구간별 분포 비율은 각 트랜잭션에서의 빈발항목 지지도 구간별 분포 비율로부터 구할 수 있으며 최소지지도 보다 큰 값으로 정의된 임의의 지지도 구간 u(최소지지도과 최대 지지도로 구간이 정의되었다고 가정)에 대해서 전체 데이터 집합에서 각 시점에서 발생한 트랜잭션이 서로 동일한 가중치를 갖는 경우 임의의 시점 t=k까지의 빈발항목 지지도 구간별 분포 비율을라 할 때 t=k+1에서 빈발항목 지지도 구간별 분포 비율은 다음과 같이 구할 수 있다. 여기서은 트랜잭션의 구간 u에서의 빈발항목 지지도 분포 비율을 나타낸다.The distribution ratio of the support interval of the frequent items for the entire data set can be obtained from the distribution ratio of the frequent support intervals in each transaction, and the random support interval u (minimum map) defined as a value larger than the minimum map. And maximum support If the transactions occur at each time point in the entire data set have the same weights, then the distribution ratio of frequent items support intervals up to a point in time t = k Where t = k + 1 Can be obtained as here Is a transaction The ratio of the frequent item support distribution in the interval u of.

이때, 각 시점에서 발생한 트랜잭션이 서로 다른 가중치를 갖는 감쇠율 방법에서는 t=k+1에서 빈발항목의 지지도 구간별 분포 비율은 감쇠율 d를 고려하여 다음과 같이 구할 수 있다.은 단순축적 방법에서와 동일하다.In this case, in the attenuation rate method in which transactions occurring at each time point have different weights, the distribution ratio for each support interval of the frequent items at t = k + 1 may be calculated as follows by considering the attenuation rate d. Is the same as in the simple accumulation method.

7.2 특성 분석 지표의 활용7.2 Use of Characteristic Indicators

7.1절에서 정의된 데이터 집합 특성 분석 지표들은 실제 마이닝 과정에서 다음과 같이 정형도 분석을 통한 트랜잭션 정제 방법 및 최소 요구 조건을 마이닝 매개변수로 설정함으로써 데이터 집합에서 조건에 맞는 정보를 찾는 방법 등으로 활용될 수 있다. 본 절에서는 2.1절에서 설명된 단순축젖 빈발항목 탐색 방법에서 특성 분석 지표를 활용한 개방 데이터 마이닝을 지원하기 위한 방법에 대해서 설명한다. 본 절에서 설명하는 내용은 본 발명에서 고안한 다른 데이터 마이닝 방법들에서도 동일하게 적용할 수 있다.The dataset characterization indicators defined in Section 7.1 can be used in the actual mining process as a method of purifying transactions by analyzing the degree of formality and setting the minimum requirements as mining parameters as follows. Can be. This section describes a method for supporting open data mining using characteristic analysis indicators in the simple livestock frequency search method described in Section 2.1. The information described in this section is equally applicable to other data mining methods devised in the present invention.

a. 특성 분석 지표의 정형도 분석을 통한 트랜잭션 정제a. Transaction refinement by analyzing the degree of formality of characteristic analysis indicators

데이터 집합 특성 분석 지표들 중에서 평균반영율, 빈발항목 평균 길이, 빈발항목 평균 지지도 및 빈발항목 규칙 점수에서는 앞서 정의된 식에서 구해지는 값들의 분포를 바탕으로 정형도를 정의할 수 있다. 즉, 각 트랜잭션 단위로 이들 지표 값들을 계산하고 트랜잭션간의 평균 및 분산(혹은 표준편차)을 계산하면 전체데이터 집합을 구성하는 각 트랜잭션들이 해당 지표값에서 어느 위치에 존재하는지를 알 수 있다. 이와 같이 전체 트랜잭션 집합 분포에서의 위치를 고려함으로써 해당 트랜잭션이 전체 트랜잭션 집합의 특성 반영정도를 파악할 수 있다. 이를 기준으로 해당 트랜잭션을 분석 대상 데이터 집합에 포함시킬 것인지 여부를 결정하는 트랜잭션 정제 작업을 수행할 수 있다.Among the data set characteristic analysis indicators, the degree of uniformity can be defined based on the distribution of the values obtained from the above-described formulas in the mean reflectance rate, the average length of frequent items, the average support of frequent items, and the rule score of frequent items. In other words, by calculating these index values in each transaction unit and calculating the average and variance (or standard deviation) between the transactions, it is possible to know where each transaction constituting the entire data set exists in the corresponding index value. In this way, by considering the position in the distribution of the entire transaction set, it is possible to determine the degree of reflection of the characteristics of the entire transaction set. Based on this, transaction cleansing to decide whether to include the transaction in the data set to be analyzed may be performed.

도 25는 특성 지표 값 분석을 통해 분석 대상이 되는 데이터 집합에 포함될 트랜잭션들을 정제하는 단순축적 빈발항목 탐색 방법을 보여준다. 2.1절에서 트랜잭션을 정제하지 않는 단순축적 데이터 빈발항목 탐색 방법에서와는 달리 발생한 모든 트랜잭션을 분석하는 것이 아니라 특성 지표 분석을 통해 일정 조건을 만족하는 트랜잭션들만을 분석 대상 데이터 집합에 포함시킨다. 이때, 각 트랜잭션은 이전 트랜잭션까지 처리한 결과로 얻어진 마이닝 결과 집합을 기준으로 특성 분석 지표 값을 구한다. 이러한 과정을 특성 지표 분석에 의한 트랜잭션 정제 단계(217)라 한다. 마이닝 기본 정보 갱신 단계(212) 및 항목의 출현빈도 수 갱신 및 새로 출현한 항목 추가 단계(213)는 2.1절의 단순축적 빈발항목 탐색 방법과 동일하다. 다음으로 218 단계에서 전체 데이터 집합에서 특성 분석 지표 값의 정형도를 구한다. 여기서 얻어진 결과는 새로운 다음 트랜잭션이 입력되었을 때 해당 트랜잭션의 정제를 위한 기준이 된다. 이후의 강제 전지 작업 및 빈발항목 선택 단계 등은 기존의 데이터 마이닝 방법과 동일하다.FIG. 25 illustrates a simple accumulation frequent item searching method for refining transactions to be included in a data set to be analyzed by analyzing characteristic indicator values. Unlike the simple accumulating data frequent items search method, which does not purify transactions in Section 2.1, instead of analyzing all transactions that occur, only the transactions that satisfy certain conditions are included in the analysis target data set through characteristic index analysis. At this time, each transaction obtains a characteristic analysis index value based on a mining result set obtained as a result of processing up to the previous transaction. This process is referred to as transaction purification step 217 by characterization analysis. The mining basic information updating step 212 and updating the frequency of appearance of the item and adding a newly appearing item 213 are the same as the simple accumulation frequent item searching method of Section 2.1. Next, in step 218, the degree of accuracy of the characteristic analysis indicator values is obtained from the entire data set. The result obtained here becomes a standard for refining the transaction when a new next transaction is entered. Subsequent forced cell operations and frequent item selection steps are the same as the existing data mining methods.

b. 특성 분석 지표 최소값 설정에 의한 마이닝 방법b. Mining method by setting minimum value of characteristic analysis indicator

본 발명에서 고안한 데이터 집합 특성 분석 지표들은 각각 독립적으로 존재하는 것이 아니라 하나의 지표 값은 다른 지표 값들로부터 상호간의 영향을 주고받는다. 예들 들어 하나의 데이터 집합에서 반영율을 높이기 위해서는 데이터 집합의 최소지지도를 낮추는 작업을 필요로 하며, 이는 빈발항목 평균 길이 및 빈발항목 평균 지지도 등에도 변화를 초래한다. 이와 같이 상호 작용하는 특성 분석 지표들에 대해서 마이닝 작업시 하나의 지표값에 대한 요구값을 설정함으로써 다른 지표값들이 이에 적응하도록 할 수 있다. 즉, 본 발명에서 대부분의 마이닝 방법은 데이터 집합에서 빈발항목을 정의하는 최소지지도를 마이닝 매개변수로 정의하고 이를 바탕으로 빈발항목을 탐색한다. 하지만 본 발명에서 설명한 마이닝 방법들에 있어서 평균 반영율, 빈발항목 평균 길이, 빈발항목 평균 지지도 및 빈발항목 규칙점수 등 다른 특성 분석 지표에 대한 기준 임계값을 마이닝 매개변수로 정의함으로써 마이닝을 수행하고 결과집합을 구할 수 있다.The data set characteristic analysis indicators devised in the present invention do not exist independently of each other, but one indicator value is mutually influenced from other indicator values. For example, in order to increase the reflectance in one data set, it is necessary to lower the minimum map of the data set, which causes changes in the average length of frequent items and the average support for frequent items. In this way, by setting a request value for one index value in the mining operation for the interactive characteristic analysis indexes, other index values can be adapted to this. That is, in the present invention, most of the mining methods define a minimum map for defining frequent items in a data set as a mining parameter and search for frequent items based on the mining parameter. However, in the mining methods described in the present invention, mining is performed by defining reference threshold values for other characteristic analysis indicators, such as an average reflectance rate, an average length of frequent items, an average support of frequent items, and a rule score of frequent items, as a mining parameter. Can be obtained.

8. 빈발항목 및 빈발항목 순차패턴 탐색이 가능한 마이닝 시스템8. Mining system capable of searching frequent items and frequent patterns

이상에서 설명된 개방 데이터마이닝 개념 및 방법들을 바탕으로 구축된 개방 데이터 마이닝 시스템은 지속적으로 증가되는 데이터 집합에서 분석 대상이 되는 데이터 집합에 출현하는 특성 정보들을 제한된 작업 시간 안에 구할 수 있도록 지원한다.The open data mining system built on the open data mining concepts and methods described above supports obtaining the characteristic information appearing in the data set to be analyzed from a continuously increasing data set within a limited working time.

개방 데이터 마이닝 시스템에서는 마이닝 결과 집합에 대한 유용성 지표 제한 조건을 동적으로 변화시킬 수 있도록 지원한다. 즉, 결과 집합을 구성하는 각항목에 대한 최소 지지도, 최소 신뢰도, 최소 상관도, 최소 흥미도 등을 각 시점에서의 마이닝 결과를 근거로 동적으로 변화시킬 수 있도록 지원한다. 이와 같은 유용성 지표의 동적인 변화는 시간 흐름에 따라 변화되는 응용 도메인의 특성을 보다 효율적으로 분석할 수 있도록 지원한다.Open data mining systems support the dynamic change of usability indicator constraints for mining result sets. That is, the minimum support, minimum confidence, minimum correlation, and minimum interest for each item constituting the result set are dynamically changed based on the mining result at each time point. This dynamic change in usability indicators allows for more efficient analysis of the nature of application domains that change over time.

한편, 단순축적 방법과 감쇠율 적용 방법에서는 기본적으로 트랜잭션에 출현한 정보들에 대해서 출현빈도 수 갱신 작업 및 모니터링 래티스에 추가 여부를 결정한다. 또한 모니터링 래티스로부터의 전지 작업도 트랜잭션에 출현한 항목들에 대해서만 가능 여부를 판단한다. 이러한 작업은 불필요하게 모니터링 래티스에 존재하는 항목의 수를 증가시킬 수 있다. 따라서 각 방법에서는 강제전지 작업을 수행하도록 설계되었다. 하지만 강제 전지 작업을 각 시점에서 매번 수행하는 경우 시간적인 낭비가 커질 수 있으며, 특히 마이닝 작업을 수행하는 프로세서와 동일한 프로세서에 의해 강제전지 작업을 수행하는 경우 결과 집합을 얻는데 필요한 시간이 급격히 증가될 수 있다. 개방 데이터 마이닝 시스템에서는 모니터링 래티스 상에 존재하는 항목들 중에서 전지 가능한 항목을 제거하는 별도의 프로세서를 설계한다. 이와 같이 모니터링 래티스에 존재하는 불필요한 항목들을 제거하는 별도의 프로세서를 휴지항목 수집기라 정의한다. 휴지항목 수집기는 마이닝 작업 프로세서와는 별도로 구동되며 일정한 시간 간격을 주기로 반복 수행한다. 이를 통해 실제 마이닝에 필요한 작업 시간을 증가시키지 않으면서 모니터링 래티스에 존재하는 불필요한 항목들을 제거함으로써 메모리 사용량을 감소시킬 수 있다.On the other hand, in the simple accumulation method and the decay rate application method, it is basically determined whether or not the information that appears in the transaction is added to the frequency update operation and monitoring lattice. In addition, pruning from monitoring lattice is only possible for items that appear in a transaction. This can unnecessarily increase the number of items present in the monitoring lattice. Therefore, each method is designed to perform forced cell work. However, each time you perform a forced cell operation at each point in time, it can be a waste of time, especially if you are forced to perform the same cell operation by the same processor that performs the mining operation. have. In open data mining systems, a separate processor is designed to remove batteryable items from those present on the monitoring lattice. In this way, a separate processor that removes unnecessary items present in the monitoring lattice is defined as an idle item collector. The idle item collector is driven separately from the mining task processor and is repeatedly executed at regular time intervals. This can reduce memory usage by eliminating unnecessary entries in the monitoring lattice without increasing the time required for actual mining.

본 발명에서 고안한 개방 데이터 마이닝 방법을 통해 시간 흐름에 따라 지속적으로 발생되는 트랜잭션들로 구성되는 비한정적인 데이터 집합을 분석 대상으로 하여 빈발항목 또는 빈발 순차패턴을 빠르게 얻을 수 있다. 즉, 분석 대상 데이터 집합이 마이닝 시작 이전에 명확히 정의되지 않고 비한정적이며 지속적으로 변화되는 데이터 집합에 대한 마이닝 작업 시간을 감소시킨다. 이러한 비한정적 데이터 집합에 대한 마이닝 작업 시간 단축은 응용 도메인에서 지속적으로 발생되는 데이터 집합에 대한 분석 결과를 짧은 시간 안에 실시간으로 얻을 수 있도록 지원한다. 특히, 빈발항목 순차패턴 탐색에 필요한 시간을 현저히 감소시킨다. 이와 같은 마이닝 방법은 데이터 집합이 지속적으로 증가되는 개방 데이터 마이닝에서뿐만 아니라 분석 대상 데이터 집합이 사전에 명확히 정의되는 종래의 폐쇄 데이터 마이닝 환경에서도 마이닝 결과를 얻는데 필요한 분석 시간을 최소화할 수 있다.Through the open data mining method devised in the present invention, frequent items or frequent sequential patterns can be quickly obtained by analyzing a non-limiting data set composed of transactions continuously generated over time. In other words, the mining operation time is reduced for data sets whose analysis data sets are not clearly defined before the start of mining, and are not limited and continuously changed. This shortening of mining operation time for non-limiting datasets enables the analysis of datasets that occur continuously in the application domain in real time. In particular, the time required for frequent item sequential pattern search is significantly reduced. Such a mining method can minimize the analysis time required to obtain mining results not only in open data mining where the data set is continuously increased, but also in a conventional closed data mining environment in which the data set to be analyzed is clearly defined in advance.

일반적으로 분석 대상이 되는 데이터 집합이 지속적으로 증가되는 경우 마이닝 수행 수행 과정에서 필요한 메모리 사용량이 급격히 증가된다. 본 발명에서 고안하는 지연 추가 방법은 마이닝 수행 과정에서 메모리에서 관리되는 정보의 양을 감소시킴으로써 마이닝에 필요한 메모리 사용량을 감소시키며, 종래의 마이닝 방법에서와 동일한 마이닝 결과를 얻을 수 있도록 지원한다. 더불어 지연추가 되는 항목의 출현빈도 수를 해당 항목의 부분항목들로부터 추정함으로써 지연추가 되는 항목의 출현빈도 수를 실제 값에 근사한 값을 가질 수 있도록 지원한다.In general, when the data set to be analyzed is continuously increasing, the memory usage required during the mining operation is rapidly increased. The delay adding method of the present invention reduces the amount of memory required for mining by reducing the amount of information managed in the memory during the mining process, and supports the same mining result as in the conventional mining method. In addition, by estimating the number of occurrences of the item to be delayed added from the subitems of the item, the number of occurrences of the item added to the delayed item can be approximated to the actual value.

한편, 본 발명에서 고안한 정보 차별화 방법은 시간 흐름에 따라 지속적으로 발생되는 데이터 집합에서 각 정보의 중요성을 발생 시점에 따라 차별화 함으로써 최근 정보의 중요성을 부각하면서도 전체 데이터 집합의 공통적인 특징들을 효율적으로 추출할 수 있도록 지원한다. 이러한 방법은 분석 대상이 되는 데이터 집합의 크기를 감소시키는 효과를 가져오며, 또한 분석 대상이 되는 데이터 집합이 크게 증가되었을 때 데이터 집합에 나타내는 정보 변화에 대한 적응력을 높일 수 있다.On the other hand, the information differentiation method devised in the present invention efficiently differentiates the importance of each information from the data set continuously generated over time according to the time of occurrence, while effectively highlighting the importance of recent information and efficiently sharing common features of the entire data set. Support for extraction. This method has the effect of reducing the size of the data set to be analyzed, and can also increase the adaptability to the information changes in the data set when the data set to be analyzed is greatly increased.

본 발명에서는 시간 흐름에 따라 지속적으로 증가되는 데이터 집합에서 출현하는 항목들의 유용성을 나타내는 유용성 지표들을 마이닝 결과를 구하고자 하는 특정 시점에서의 정적인 값뿐만 아니라 시간 흐름에 따른 유용성 지표의 변화를 분석한다. 각 유용성 지표에 대해서 시간 흐름에 따른 변화를 분석함으로써 일정 시점에서 동일한 값을 갖는 항목들도 시간 흐름에 따른 변화 값이 다른 경우 이후 발생될 트랜잭션에서의 유용성 지표 값이 달리 변화 될 것으로 예측할 수 있다. 이를 통해 데이터 집합을 구성하는 각 항목들을 보다 세밀하게 분석할 수 있다. 또한, 본 발명에서는 응용 도메인에서 발생되는 트랜잭션 데이터에 대해서 트랜잭션의 특성 값을 기준으로 일정 조건을 만족하는 트랜잭션 데이터만을 분석 대상으로 선택한다. 이를 통해 응용 도메인의 특성을 잘 반영하는 데이터만을 대상으로 마이닝 작업을 수행함으로써 마이닝 결과에 대한 신뢰도를 높일 수 있다.In the present invention, the usefulness indicators indicating the usefulness of items appearing in the data set continuously increasing over time are analyzed as well as the static values at a specific time point for obtaining a mining result as well as the change in the usability indicator over time. . By analyzing the changes over time for each usability indicator, it is possible to predict that the items with the same value at a certain time will change the usability indicator values in later transactions if the values change over time. This allows more detailed analysis of the items that make up the data set. In addition, in the present invention, the transaction data generated in the application domain selects only the transaction data that satisfies a predetermined condition based on the characteristic value of the transaction as the analysis target. Through this, mining can be performed on only data that reflects the characteristics of the application domain, thereby increasing the reliability of the mining results.

Claims

As a new transaction is created for a non-limiting dataset consisting of transactions that occur continuously in the application domain, the transaction is read only once to perform data mining to obtain information embedded in the increased dataset that contains the transaction. Data mining system department to analyze

a) a transaction data reading detail module unit for reading the newly generated analysis target transaction data and analyzing the item information included in the transaction;

b) a frequency count management detailed module unit for gradually managing the frequency information of items generated in a newly created transaction using a monitoring lattice;

c) a mining result selection module for analyzing the frequency of occurrence of items managed in a monitoring lattice to find frequent items or frequent patterns;

d) A data mining system comprising a detailed module for presenting mining results for presenting frequent items or frequent sequential patterns to a user.

The method of claim 1,

Data mining system characterized in that each detailed module is performed to analyze the information inherent in the data set, considering that transactions to be analyzed are not always different from each other in importance due to time change or transaction increase.

The method of claim 2,

A data mining system, characterized in that each detailed module unit is performed to analyze the frequency count information of items to search for an item whose support degree is equal to or greater than a predetermined threshold defined by the user.

The method of claim 3,

In order to provide a mining result in a short time when time changes or new transactions are increased, it maintains the frequency information of all items that appeared in all transactions that occurred before, without re-analyzing the transactions included in the entire data set. A data mining system, characterized in that the appearance frequency count management detail module unit and the mining result selection detail module unit are performed so as to obtain a mining result by analyzing only the added transaction.

The method of claim 4, wherein

A) The number of items to be managed increases rapidly as the number of items to be managed rapidly increases as transactions increase rapidly, so managing the frequency information of items occurring in all transactions increases the limit of available memory space and increases the mining processing time. It is impossible due to physical limitations such as, etc. When items appearing in a transaction occur repeatedly and have high support enough to affect the determination of frequent items, memory items can be used by adding the item to the monitoring lattice to manage the frequency of appearance. Delay with additional module to reduce the

b) a frequency count estimating module unit for estimating the frequency of occurrence of delayed addition items from the frequency of occurrence of subitems of the corresponding item that appeared in a previously generated transaction.

The method of claim 4, wherein

The frequency count management detail module unit removes the item from the monitoring lattice when the support frequency is low enough because the item does not appear in a transaction that occurred for a certain period of time among the items whose frequency count is managed in the monitoring lattice. Data mining system comprising a battery operation module unit for reducing the memory space usage required for the mining process by performing.

The method of claim 4,

The delay adding step of the frequency count management detailed module of the frequency count management detailed module of the item frequency management structure and the estimation error of the items for separately managing the estimated errors generated in the process of estimating the frequency of occurrence for the delay addition as well as the frequency information of the items And the mining result selection step in consideration of estimated error information, thereby limiting the error included in the frequent items within a predetermined range defined by the user, thereby increasing the reliability of the mining result.

The method of claim 2,

And analyzing each occurrence order information between the unit items, and performing each detailed module unit to search for a sequential pattern of unit items appearing above a user defined threshold value.

The method of claim 8,

In order to provide mining results in a short time when time changes or new transactions are increased, the frequency counts of all the sequential patterns that appeared in all previously occurring transactions are maintained, without reanalyzing the transactions included in the entire data set. A data mining system, characterized in that the occurrence frequency count management detailed module unit and the mining result selection detailed module unit are performed to obtain a mining result by analyzing only a newly added transaction.

The method of claim 9,

A) The number of sequential patterns to be managed rapidly increases as the number of transactions is rapidly increased, so managing information on the number of sequential patterns of sequential patterns occurring in all transactions is limited in available memory space and mining processing time. Since it is impossible due to physical limitations such as the increase of the number of times, when the sequence patterns appearing in the transaction occur repeatedly and have a large enough support to affect the determination of the unit item sequence pattern, the sequence pattern is added to the monitoring lattice to manage the frequency of appearance. Adds a delay module to reduce memory space usage by starting

b) a data mining system comprising a frequency count estimating module unit for estimating the frequency of occurrence of delayed addition of sequential patterns from the frequency of occurrence of partial sequential patterns of a corresponding sequential pattern that appeared in a previously generated transaction

The method of claim 9,

The frequency count management detailed module removes the sequential pattern from the monitoring lattice when the support pattern has a sufficiently low value because the sequential pattern does not appear in a transaction occurring for a long period of time among the sequential patterns managed by the monitoring lattice. A data mining system comprising a battery operation module unit for reducing the amount of memory space required for the mining process by performing a battery operation.

The method of claim 9,

When using the partial sequence pattern to estimate the frequency of occurrence of the sequence pattern, to select the partial sequence pattern in consideration of the sequential information. Items include a data mining system comprising a dual frequency management structure that manages two frequency information according to the start position of the corresponding sequential pattern.

The method of claim 9,

Adding delay in the detail module for managing the frequency of appearance pattern management and estimation error for managing sequential patterns separately for each sequential pattern. Data mining system characterized in that the reliability of the mining results can be improved by limiting the errors included in the sequential patterns of the unit item within a predetermined range defined by the user in consideration of the estimated error information in the step of selecting the mining result. .

The method of claim 2,

And analyzing each occurrence order information among items to perform a detailed module unit to search for an item sequential pattern appearing above a user-defined threshold value.

The method of claim 14,

One lattice manages information on the frequency of occurrence of items for the search for frequent items with a frequency higher than a certain threshold, and on the other, analyzes the ordered information of items obtained as frequent items and analyzes the frequency of occurrence higher than the corresponding threshold. A data mining system comprising: a dual lattice structure for managing the frequency of appearance of the sequential pattern in order to search for a sequential pattern having a number;

The method of claim 16,

In order to efficiently use items with higher support than the predetermined threshold in the occurrence frequency count management module, a separate sequential pattern unit item identifier is newly assigned when the item is created and the sequential pattern is searched in the sequential pattern search step. When the unit item is removed, it terminates the connection relationship between the item and the sequential pattern unit item identifier, thereby including a sequential pattern unit item identifier mapping module unit for supporting the items appearing in the new transaction to be mapped to the sequential pattern unit item. Data mining system characterized by.

The method of claim 15,

A) The number of sequential patterns to be managed increases rapidly as the number of transactions is rapidly increased, so managing the frequency information of sequential patterns occurring in all transactions is limited in the available memory space and mining processing time. Since it is impossible due to physical limitations such as the increase of number, when the sequential patterns appearing in the transaction occur repeatedly and have big support enough to affect the item sequential pattern determination, the sequential pattern is added to the monitoring lattice to manage the frequency of occurrence. A delay addition module section for reducing memory space usage by starting

b) a frequency count estimating module for estimating the number of occurrences of delayed addition of the sequential patterns from the number of occurrences of partial sequential patterns of the corresponding sequential patterns that appeared in a previously generated transaction.

Data mining system comprising a.

The method of claim 15,

When using the partial sequence pattern to estimate the frequency of occurrence of the sequential pattern, in order to select the partial sequential pattern in consideration of the sequential information, the frequency of each occurrence in the monitoring lattice for managing the frequency of occurrence of the sequential pattern in the detailed module The sequence pattern items are data mining systems comprising an appearance frequency double management structure that manages two occurrence frequency information according to the start position of the sequence pattern.

The method of claim 15,

Adding delays to the appearance frequency management submodule for the sequential pattern number management structure and estimation error that separately manages the estimation errors generated in the process of estimating the frequency of occurrence to add delay as well as the frequency information of sequential patterns A data mining system characterized in that the reliability of the mining result can be increased by limiting the error included in the item sequential pattern within a predetermined range defined by the user by considering the estimated error information in the step of selecting the mining result.

The method of claim 1,

The data mining system part by the window method redefines the transaction window defining the search range based on the present time and changes the window among the transactions included in the window from the previous time when the current time is changed due to a time change or an additional transaction occurs. It includes a transaction window update module that updates the frequency information of the items managed in the monitoring lattice for items appearing in transactions that are not included in the transaction, and differentiates the importance of each transaction according to time changes or transaction growth. A data mining system characterized by analyzing information inherent in a data set within a specified range by performing each detailed module with respect to a transaction or a certain number of transactions occurring at a predetermined time defined by a user.

The method of claim 22,

The method of claim 23, wherein

The method of claim 22,

A data mining system, characterized in that each sub-module is performed to analyze the sequence information between the unit items and to search for the sequential pattern of the unit items appearing above a user defined threshold.

The method of claim 25,

Re-analyze the transactions included in the entire data set by maintaining the frequency count of all unit sequential patterns from all previously occurring transactions to provide mining results in a short time when time changes or new transactions increase. The data mining system, characterized in that the appearance frequency number management detailed module unit and the mining result selection detailed module unit are performed to obtain a mining result by analyzing only a newly added transaction.

The method of claim 22,

A data mining system, characterized in that each detailed module is performed to analyze occurrence order information between items to search for an item sequential pattern that appears above a user-defined threshold.

The method of claim 27,

One lattice manages the frequency information of items for searching for items with a frequency higher than a certain threshold value, and the other, analyzes the order information of items obtained as frequent items and analyzes the frequency of occurrences higher than the threshold value. The data mining system, characterized in that it comprises a dual lattice structure for managing the frequency of appearance of the sequence pattern management unit to search for a sequential pattern having a sequence pattern.

The method of claim 27,

In order to efficiently utilize frequent items with higher support than a certain threshold in the sequential pattern search process, the frequency module management detail module newly assigns a separate sequential pattern unit item identifier at the time of item creation and applies the sequential pattern in the sequential pattern search step. When the unit item is removed, it terminates the connection relationship between the item and the sequential pattern unit item identifier, thereby including a sequential pattern unit item identifier mapping module unit for supporting the items appearing in the new transaction to be mapped to the sequential pattern unit item. Data mining system characterized by.

The method of claim 27,

The method of claim 1,

The data mining system part by the reduction rate method: a) attenuation rate, which indicates the degree of decrease in importance of a transaction that occurs whenever a unit time or a number of unit transactions elapses based on a decay unit time defined by a user, a decay default value, and a decay fundamental period. Attenuation factor definition module for defining

b) In the damping rate method of updating the frequency of occurrence of previously generated items by applying the attenuation rate when the current time is changed due to a time change or the occurrence of additional transactions, including an item frequency update module.

A data mining system characterized by analyzing the information inherent in a data set within a specified range by performing each detailed module by differentiating the importance of each transaction according to time change or transaction increase by attenuation rate based on a predetermined unit.

The method of claim 31, wherein

A data mining system, characterized in that each detailed module is performed to analyze the frequency count information of items so as to search for an item whose degree of support is equal to or greater than a predetermined threshold defined by the user.

The method of claim 32,

The method of claim 33,

The frequency count management detail module unit removes the item from the monitoring lattice when the support frequency is low enough because the item does not appear in a transaction that occurred for a certain period of time among the items whose frequency count is managed in the monitoring lattice. Data mining system comprising a battery operation module for reducing the memory space usage required for the mining process.

The method of claim 33,

Delay addition step of the frequency count management detailed module for the frequency count management structure and the estimation error of the items for separately managing the estimated errors generated in the process of estimating the frequency of occurrence for adding delay as well as the frequency count information of the items. And the mining result selection step in consideration of estimated error information, thereby limiting the error included in the frequent items within a predetermined range defined by the user, thereby increasing the reliability of the mining result.

The method of claim 31, wherein

Data mining system characterized in that each module module is executed to analyze the order of occurrence between the unit items and search for the sequential pattern of unit items that appeared above the user-defined threshold.

The method of claim 37, wherein

The method of claim 38,

b) a frequency count estimating module unit for estimating the frequency of occurrence of delayed addition of the sequence patterns from the frequency of occurrence of partial sequence patterns of the sequence pattern that appeared in a previously generated transaction.

The method of claim 38,

In order to select the partial sequential pattern in consideration of the sequential information in utilizing the partial sequential pattern to estimate the frequency of occurrence of the sequential pattern, the sequential pattern in the frequency count management detailed module of each module in the monitoring lattice The pattern mining data mining system includes an appearance frequency double management structure that manages two occurrence frequency information divided according to a start position of a corresponding sequential pattern.

The method of claim 38,

Select the mining result and manage the sequential pattern frequency count management structure and estimation error, which separately manages the estimated errors generated in the process of estimating the frequency of appearance for delay addition as well as the frequency count information of the sequential patterns Data mining system, characterized in that the reliability of the mining results can be increased by limiting the error included in the item sequential pattern within a predetermined range defined by the user in consideration of the estimated error information in the step.

The method of claim 31, wherein

A data mining system, characterized in that each detailed module unit performs an analysis of order information between items to search for an item sequential pattern that appears above a user-defined threshold.

The method of claim 43,

The method of claim 44,

One lattice manages the frequency information of items for searching for items with a frequency higher than a certain threshold value, and the other, analyzes the order information of items obtained as frequent items and analyzes the frequency of occurrences higher than the threshold value. In order to search for a sequential pattern having a frequency count management detailed module unit comprises a dual lattice structure for managing the frequency of appearance of the sequence pattern mining system.

The method of claim 44,

In order to efficiently utilize items having higher support than the predetermined threshold in the sequential pattern search process, the occurrence frequency count management detailed module allocates a new sequential pattern unit item identifier at the time of item creation and in the sequential pattern search step When the item is removed, the connection pattern between the item and the sequential pattern unit item identifier is terminated, thereby including a sequential pattern unit item identifier mapping module for supporting the items appearing in a new transaction to be mapped to the sequential pattern unit item. Data mining system.

The method of claim 44,

A) The number of sequential patterns to be managed rapidly increases as the number of transactions is rapidly increased, so managing information on the number of sequential patterns of sequential patterns occurring in all transactions is limited in available memory space and mining processing time. Since it is impossible due to physical limitations such as the increase of number, when the sequential patterns appearing in the transaction occur repeatedly and have big support enough to affect the item sequential pattern determination, the sequential pattern is added to the monitoring lattice to manage the frequency of occurrence. A delay addition module section for reducing memory space usage by starting

Data mining system comprising a.

The method of claim 44,

In order to select the partial sequential pattern in consideration of the sequential information in using the partial sequential pattern to estimate the frequency of occurrence of the sequential pattern, the frequency of each occurrence in the monitoring lattice for managing the sequential pattern frequency of the submodule The sequence pattern items include a frequency count dual management structure that manages the frequency number information having the frequency number information divided into two types according to the start position of the corresponding sequence pattern.

The method of claim 44,

The method of claim 1,

Step-by-Step Information Separately Managed Data Mining System maintains a separate mining result set obtained by performing mining to a previous time point or a previous transaction, so that items appearing in newly added transactions without searching the entire monitoring lattice to obtain a mining result set. Data mining system, characterized in that the mining result set selection detail module obtains the mining result by maintaining the mining result set separately so that the mining result of the entire data set can be obtained by analyzing only the appearance information of the data and reflecting the result in the mining result.

The method of claim 1,

Open data mining system part by processing multiple transactions simultaneously Performs mining to obtain mining results by performing basic analysis on multiple transactions occurring at the same time and updating the frequency of occurrence of items appearing in the transactions at once in monitoring lattice Data mining system characterized by reducing the time

The method of claim 1,

The data mining system unit according to the analysis of useful indicator changes includes a) selecting the mining result in consideration of the change rate of the current support of the item or the sequential pattern. Analyzes the rate of change of support indicators, reliability, persuasion, interest, severity, distribution, correlation, and similarity to indicate the usefulness of the item in the data set as well as the future data set. Data mining system, characterized in that the mining result set is obtained by analyzing the rate of change of the usefulness judgment indicator to support the predictable change of the indicator

The method of claim 1,

a) Supporting, reliability, persuasion, indexes indicating the usefulness of items obtained by mining results, including the mining result selection module through the support change acceleration analysis that selects mining results in consideration of the current support change acceleration of an item or sequential pattern. Analyze the acceleration of changes in interest, hardness, distribution, correlation, and similarity to predict the change in the rate of change of the corresponding usefulness indicator of the item in the future data set, as well as analysis of information on the data set up to now. Data mining system, characterized in that the mining result set is obtained by supporting the analysis of the usability indicator indicator change acceleration

The method of claim 1,

Data mining system part based on data collection characteristic analysis is used to refine the data set to be mined by using data set characteristic analysis index values, so that the data mining result reflects the characteristics of the application domain better. Data mining system comprising a characterization and purification module unit.

The method of claim 1,

And a mining result set selection detail module unit to obtain a mining result in consideration of a minimum reference value for a data set characteristic analysis indicator value defined by a user.

The method of claim 1,

The integrated system unit including the idle item collector reuses the memory space occupied by the corresponding item or the sequential pattern for the item or the sequential pattern that is no longer needed to manage the frequency of occurrence of the item or the sequential pattern during the mining process. Data mining system comprising a pause item collection module for allocating as possible space.

As a new transaction is created for a non-limiting dataset consisting of transactions that occur continuously in the application domain, the transaction is read only once to perform data mining to obtain information embedded in the increased dataset that contains the transaction. How to analyze data mining

a) transaction data reading method that reads newly created transaction target transaction data and analyzes item information included in transaction;

b) a frequency count management method for gradually managing frequency of occurrence information of items generated in a newly created transaction using a monitoring lattice;

c) a method of selecting mining results by analyzing the frequency of occurrence of items managed in a monitoring lattice to find frequent items or frequent patterns;

d) a data mining method comprising a method for presenting a mining result for presenting a frequent item or a frequent sequential pattern to a user.

The method of claim 58,

Data mining method that analyzes the information inherent in the data set by performing each sub-step, considering that transactions that are the analysis target by the simple accumulation method are always the same without being different in importance due to time change or transaction increase.

The method of claim 59,

A method for mining data, characterized in that each sub-module is performed to search for items whose degree of support is greater than or equal to a predetermined threshold defined by the user by performing each step for analyzing the frequency information of the items.

The method of claim 60,

In order to provide a mining result in a short time when time changes or new transactions are increased, it maintains the frequency information of all items that appeared in all transactions that occurred before, without re-analyzing the transactions included in the entire data set. A data mining method, characterized in that the method of managing the frequency of occurrence and the method of selecting a mining result are performed to obtain a mining result by analyzing only the added transaction.

62. The method of claim 61,

In the method of managing the frequency of occurrence, a) a data mining method comprising a method of estimating the frequency of occurrence of estimating the frequency of occurrence of items to be delayed from the frequency of occurrence of all sub-items of the corresponding item that appeared in a previously generated transaction. or

b) a method of estimating the frequency of occurrence of estimating the frequency of occurrence of items to be delayed by only the frequency of occurrence of the (n-1) -sub-items of the item that appeared in a previously generated transaction;

c) As the number of items to be managed rapidly increases as transactions increase rapidly, managing information about the frequency of occurrence of items that occur in all transactions is due to physical limitations such as available memory space limitations and increased mining processing time. If the items appearing in the transaction are repeatedly generated and have high support enough to affect the determination of frequent items, the delayed addition method that adds the items to the monitoring lattice and starts to manage the frequency of occurrences will reduce the memory space usage. Data mining method comprising a.

62. The method of claim 61,

In the method of managing the frequency of occurrence, the task of removing the item from the monitoring lattice when the frequency of support is small enough because the item does not appear in the transaction that occurred for a considerable period of time among the items managed by the monitoring lattice. And a battery operation method for reducing memory space usage required for the mining process.

62. The method of claim 61,

In addition to the information on the frequency of occurrence of the items, the error in the estimation of the frequency of occurrence for adding delays is managed separately from the delay adding step for each item, and the error included in the frequent items is selected by considering the estimation error information in the mining result selection step. Data mining method characterized in that the confidence in the mining results can be increased by limiting within a predetermined range defined by the user.

The method of claim 59,

A method of mining data, characterized in that each detailed method is performed to analyze the order of occurrence between unit items and to search for a sequential pattern of unit items that appears above a user-defined threshold.

66. The method of claim 65,

In order to provide mining results in a short time when time changes or new transactions are increased, the frequency counts of all the sequential patterns that appeared in all previously occurring transactions are maintained, without reanalyzing the transactions included in the entire data set. A data mining method, characterized in that a method of managing the frequency of occurrence and a method of selecting a mining result are performed to analyze only newly added transactions and provide a mining result.

The method of claim 66,

In the method of managing the frequency of appearance, a) a method of estimating the frequency of occurrence of estimating the frequency of occurrence of delayed addition of sequential patterns from the frequency of occurrence of all partial sequential patterns of the corresponding sequential patterns that appeared in a previously generated transaction; Data mining methods or

b) Data mining including a method of estimating the frequency of occurrence of estimating the number of occurrences of delayed addition of sequence patterns by only the number of occurrences of (n-1) -partial sequence patterns of the corresponding sequence pattern that appeared in a previously generated transaction. How to

c) As the number of sequential patterns managed to increase rapidly as transactions increase rapidly, managing information on the number of occurrences of sequential patterns occurring in all transactions is limited by physical limitations such as available memory space and increase in mining processing time. Since the sequential patterns appearing in the transaction are repeatedly generated and have a large enough support to affect the determination of the unit sequential pattern, the sequential patterns are added to the monitoring lattice to start managing the frequency of occurrence of memory space. A data mining method comprising the method of adding delay to reduce.

The method of claim 66,

In the method of managing the frequency of occurrence, the sequence is removed from the monitoring lattice when the support has a value that is small enough because the sequence does not appear in the transaction that occurred for a long period of time in the monitoring lattice. And a battery operation method for reducing the memory space usage required for the mining process by performing the battery operation.

The method of claim 66,

In addition to the information on the number of occurrences of sequential patterns, the estimation errors generated in the process of estimating the frequency of occurrence for delay addition are managed separately from the delay addition step for each sequential pattern. A data mining method for a simple accumulated unit item sequential pattern search considering a maximum tolerance, wherein the reliability of the mining result can be increased by limiting the included error within a predetermined range defined by a user.

The method of claim 59,

Based on the dual lattice structure, data mining for simple accumulation item sequential pattern searching is performed by analyzing each item occurrence order information to search for an item sequential pattern that appears above a user-defined threshold. Way.

The method of claim 70,

In order to provide mining results in a short time when time changes or new transactions are increased, the frequency counts of all the sequential patterns that appeared in all previously occurring transactions are maintained, without reanalyzing the transactions included in the entire data set. A data mining method comprising a simple accumulated real-time item sequential pattern searching method, characterized in that a frequency count management method and a mining result selection method are performed to obtain a mining result by analyzing only a newly added transaction.

The method of claim 71, wherein

In order to efficiently utilize items with higher support than a certain threshold in the method of managing the frequency of occurrence, a new sequential pattern unit item identifier is newly assigned when creating an item, and the sequential pattern unit is used in the sequential pattern search step. When the item is removed, the method includes mapping the sequence pattern unit item identifier mapping method to terminate the connection relationship between the item and the sequence pattern unit item identifier so that items appearing in a new transaction can be used when the items are mapped to the sequence pattern unit item. Data mining method.

The method of claim 71, wherein

b) Data mining including a method of estimating the number of occurrences of estimating the number of occurrences of delayed addition of the sequential patterns only by the number of occurrences of the (n-1) -partial sequential patterns of the corresponding sequential patterns that appeared in a previously generated transaction. How to

c) As the number of sequential patterns managed to increase rapidly as transactions increase rapidly, managing information on the number of occurrences of sequential patterns occurring in all transactions is limited by physical limitations such as available memory space and increase in mining processing time. Since the sequential patterns appearing in the transaction are repeatedly generated and have a large enough support to affect the determination of the sequential patterns, the memory space usage is reduced by adding the sequential patterns to the monitoring lattice to start managing the frequency of occurrence. Data mining method comprising a delay adding method

The method of claim 71, wherein

In addition to the information on the number of occurrences of sequential patterns, the estimation errors generated in the process of estimating the frequency of occurrence for delay addition are managed separately from the delay addition step for each sequential pattern, and included in the item sequential pattern considering the estimation error information in the mining result selection step. Data mining method characterized in that to increase the reliability of the mining results by limiting the error within a certain range defined by the user

The method of claim 58,

In the window method step: a) time-transactions or transactions, including methods of differentiating the importance of transactions at the time of occurrence, ignoring information appearing in previously occurring transactions if only a transaction or a certain number of transactions that occurred in recent time have significance. Differentiate the importance of each transaction according to the increase, and perform the steps for the transaction or the number of transactions that occurred at the latest fixed time defined by the user, and analyze the information inherent in the data set within the specified range. Data mining method.

77. The method of claim 76,

Performing each step to analyze the frequency of occurrence of items information, each sub-module is performed to search for items whose degree of support is above a certain threshold defined by the user

78. The method of claim 77 wherein

If the current point of view changes due to a time change or an additional transaction occurs, the transaction window defining the search range is redefined based on the present, and it appears in the transactions not included in the changed window among the transactions included in the window at the previous point in time. All previous occurrences to provide mining results in a short time when time changes or new transactions increase, including a transaction window update method that updates the frequency information of the items managed in the monitoring lattice for one item. Frequency count management method and mining to maintain mining frequency information of all items appearing in a transaction so as to provide mining results by analyzing only newly added transactions without re-analyzing the transactions included in the entire data set Data mining method, characterized in that the selected method is performed

77. The method of claim 76,

A method for mining data, characterized in that each detailed module is performed to analyze the order of occurrence of unit items and search for a sequential pattern that is expressed above a user-defined threshold.

The method of claim 79,

If the current point of view changes due to a time change or an additional transaction occurs, the transaction window defining the search range is redefined based on the present, and it appears in the transactions not included in the changed window among the transactions included in the window at the previous point in time. All previous occurrences to provide mining results in a short time when time changes or new transactions increase, including a transaction window update method that updates the frequency information of the items managed in the monitoring lattice for one item. By maintaining the frequency information of all the sequential patterns that appeared in the transaction, the frequency of occurrence frequency management method and the method to obtain the mining result by analyzing only the newly added transaction without re-analyzing the transactions included in the entire data set. Method for data mining results real-time sequential pattern unit entry Discover the window characterized in that the selection method is performed

77. The method of claim 76,

Based on the double lattice structure, each step is performed to analyze the sequential information between the items and search for the sequential pattern that appears above the user-defined threshold. Data mining method.

82. The method of claim 81 wherein

In order for the frequency management method to efficiently use items with higher support than a certain threshold in the sequence pattern search process, a separate sequence pattern unit item identifier is newly assigned when the item is created, and the sequence pattern unit item in the sequence pattern search step is used. If this is removed, the method includes mapping a sequence pattern unit item identifier mapping method to terminate the connection relationship between the item and the sequence pattern unit item identifier so that items appearing in a new transaction can be utilized when the items are mapped to the sequence pattern unit item. Data mining methods.

82. The method of claim 81 wherein

If the current point of view changes due to a time change or an additional transaction occurs, the transaction window defining the search range is redefined based on the present, and it appears in the transactions not included in the changed window among the transactions included in the window at the previous point in time. All previous occurrences to provide mining results in a short time when time changes or new transactions increase, including a transaction window update method that updates the frequency information of the items managed in the monitoring lattice for one item. By maintaining the frequency information of all the sequential patterns that appeared in the transaction, the frequency of occurrence frequency management method and the method to obtain the mining result by analyzing only the newly added transaction without re-analyzing the transactions included in the entire data set. Method for real-time data mining sequential patterns entry search result by the window characterized in that the selection method is performed.

The method of claim 58,

a) The number of unit time or unit transactions has elapsed using the decay rate, which indicates the degree of decrease in importance according to the unit time or unit number change, based on the decay unit time, the decay default, and the decay fundamental period defined by the user in the decay rate method step. How to differentiate the importance of transactions,

b) If the current point of time changes due to a change in time or the occurrence of additional transactions, the change in time, including a method of updating the number of occurrences of an item, by updating the number of occurrences of the item by applying the attenuation rate to the number of occurrences of previously generated items Differentiating the importance of each transaction so that the importance of each transaction is attenuated by the decay rate based on a certain unit based on a certain unit, and performing each step to analyze the information inherent in the data set within a specified range. .

85. The method of claim 84,

Each step is performed to analyze information about the frequency of occurrence of items so that each step is performed to search for items whose degree of support is above a certain threshold defined by the user.

86. The method of claim 85,

87. The method of claim 86,

b) a data mining method including a method of estimating the number of occurrences of estimating the number of occurrences of delayed-added items only by the number of occurrences of (n-1) -subitems of the corresponding item appearing in a previously generated transaction; And

87. The method of claim 86,

The frequency count management method removes the item from the monitoring lattice when it has a value that is small enough because the item does not appear in a transaction that occurred for a considerable period of time among the items whose frequency count is managed in the monitoring lattice. Data mining method comprising a battery operation method for reducing the memory space required for the mining process by performing

87. The method of claim 86,

85. The method of claim 84,

Data mining method characterized in that each detailed step is performed in order to analyze the order of occurrence of the unit items to search the sequential pattern of the unit items appeared above the user-defined threshold value

91. The method of claim 90,

In order to provide mining results in a short time when time changes or new transactions are increased, the frequency counts of all the sequential patterns that appeared in all previously occurring transactions are maintained, without reanalyzing the transactions included in the entire data set. A data mining method, characterized in that the method of managing the frequency of occurrence and the method of selecting a mining result are performed to obtain a mining result by analyzing only a newly added transaction.

92. The method of claim 91 wherein

b) Data mining including a method of estimating the number of occurrences of estimating the number of occurrences of delayed addition of the sequence patterns by the number of occurrences of the (n-1) -partial sequence patterns of the corresponding sequence pattern that appeared in a previously generated transaction. How to

92. The method of claim 91 wherein

The occurrence frequency management method removes the sequential pattern from the monitoring lattice when the support frequency has a sufficiently small value because the sequential pattern does not appear in a transaction occurring for a long period of time among the sequential patterns managed by the monitoring lattice. And a battery operation method for reducing memory space usage required for the mining process by performing the battery operation.

92. The method of claim 91 wherein

In addition to the information on the number of occurrences of sequential patterns, the estimation errors generated in the process of estimating the frequency of occurrence for delay addition are managed separately from the delay addition step for each sequential pattern. Data mining method characterized in that to increase the confidence in the mining results by limiting the included error within a certain range defined by the user.

85. The method of claim 84,

Data mining characterized in that each detailed step is performed in order to search for the item sequential pattern by the attenuation rate method that analyzes the occurrence order information of the items based on the double lattice structure and searches for the sequential pattern that appears above the user-defined threshold. Way.

96. The method of claim 95,

97. The method of claim 96,

In order for the frequency management method to efficiently use items with higher support than a certain threshold in the sequence pattern search process, a separate sequence pattern unit item identifier is newly assigned when the item is created, and the sequence pattern unit item in the sequence pattern search step is used. If this is removed, the method includes mapping a sequence pattern unit item identifier mapping method to terminate the connection relationship between the item and the sequence pattern unit item identifier so that items appearing in a new transaction can be utilized when the items are mapped to the sequence pattern unit item. Data mining methods

97. The method of claim 96,

In the method of managing the frequency of appearance, a) a method of estimating the frequency of occurrence of estimating the frequency of occurrence of delayed addition of sequential patterns from the frequency of occurrence of all partial sequential patterns of the corresponding sequential patterns that appeared in a previously generated transaction; Data mining methods, or

b) Data mining including a method of estimating the number of occurrences of estimating the number of occurrences of delayed addition of the sequence patterns by the number of occurrences of the (n-1) -partial sequence patterns of the corresponding sequence pattern that appeared in a previously generated transaction. Method,

c) As the number of sequential patterns managed to increase rapidly as transactions increase rapidly, managing information on the number of occurrences of sequential patterns occurring in all transactions is limited by physical limitations such as available memory space and increase in mining processing time. Since the sequential patterns appearing in the transaction are repeatedly generated and have a large enough support to affect the determination of the unit sequential pattern, the sequential patterns are added to the monitoring lattice to start managing the frequency of occurrence of memory space. Data mining method comprising the method of adding delay to reduce

97. The method of claim 96,

In addition to the information on the number of occurrences of sequential patterns, the estimation errors generated in the process of estimating the frequency of occurrence for delay addition are managed separately from the delay addition step for each sequential pattern, and included in the item sequential pattern considering the estimation error information in the mining result selection step. Data mining method, characterized in that the reliability of the mining results can be increased by limiting the error within a predetermined range defined by the user.

The method of claim 58,

Step-by-Step Information In order to obtain the mining results by maintaining the mining results obtained by mining the previous time point or the previous transaction in a separate management step, only the appearance information of items appearing in the newly added transaction is analyzed without searching the entire monitoring lattice. A data mining method characterized by obtaining mining results by separately managing the mining result information in the mining result set selection method so that the mining result for the entire data set can be obtained by reflecting the result in the mining results up to the previous.

The method of claim 58,

In the step of open data mining by simultaneous processing of multiple transactions, a) analyzing the redundancy of items or units of transactions in multiple transactions at the same time, and updating the frequency of occurrences of items in multiple transactions at once. A method of data mining, comprising: performing a basic analysis on a plurality of transactions that occur and updating the frequency of occurrences of items appearing in the transactions to a monitoring lattice at a time to reduce mining execution time.

The method of claim 58,

In the analysis of usefulness indicator change step, a) define a certain unit time, define the change of usefulness indicator index during the unit time as the speed of the indicator, and in the damping rate method, define the rate of change of the usefulness judgment indicator that defines the amount of change in consideration of the decay rate. and,

b) including a method of selecting a mining result by analyzing a rate of change of support for selecting a mining result in consideration of a current rate of change of support of an item or a sequential pattern,

Analyzes the rate of change of support, reliability, persuasion, interest, hardness, distribution, correlation, and similarity, which are indicators of the usefulness of the items obtained as a result of mining, and analyzes the data set up to now, as well as the data to be generated later. A data mining method comprising mining results in consideration of the rate of change of the usefulness judgment index for predicting the change in the corresponding usefulness judgment index of the corresponding item in the set.

The method of claim 58,

a) Define a certain unit time, define the change rate of the change rate of the usefulness judgment indicator during the unit time as the acceleration of the indicator, and define the acceleration rate of change of the usefulness judgment index that defines the change rate of change rate in consideration of the attenuation rate. How,

b) a method of selecting a mining result through a support change acceleration analysis in which a mining result is selected in consideration of the acceleration of the current support change of an item or a sequential pattern;

Analyze the acceleration of changes in support, reliability, persuasion, interest, hardness, distribution, correlation, and similarity, which are indicators of the usefulness of the items obtained as a result of mining, and analyze data on the data set up to now, as well as data to be generated later. A data mining method comprising mining results in consideration of an acceleration of a change in the usability judgment indicator for predicting a change in a change rate of a corresponding usability determination indicator of a corresponding item in a set.

The method of claim 58,

In the dataset feature segmentation related step, the dataset feature analysis and refinement method is used to refine the dataset to be mined by using the dataset feature analysis indicator value. Data mining method.

The method of claim 58,

Data mining method characterized in that to obtain a mining result set that satisfies this by setting a minimum reference value for the data set characteristic analysis indicator value

107. The method of claim 105 or 106,

And a method for analyzing and purifying the characteristics of one transaction by obtaining an average length of a transaction to be analyzed and obtaining an average transaction length for the entire data set.

107. The method of claim 105 or 106,

A method of mining data comprising analyzing a reflection rate of an item appearing in a transaction to be analyzed and analyzing and purifying the characteristics of one transaction and obtaining an average reflection rate of the item for the entire data set.

107. The method of claim 105 or 106,

The data mining method includes analyzing and refining the characteristics of one transaction by analyzing the average length of the frequent items among the items appearing in the transaction to be analyzed, and obtaining the average length of the frequent items for the entire data set. Way.

107. The method of claim 105 or 106,

The data mining method includes analyzing and refining the characteristics of one transaction by analyzing the average map of the frequent items among the items appearing in the transaction to be analyzed, and obtaining the average map of the frequent items for the entire data set. Way.

107. The method of claim 105 or 106,

Among the items appearing in the transaction to be analyzed, the average rule scores of the frequent items are analyzed and the characteristics of one transaction are analyzed and refined, and a method for obtaining the frequent rule average rule scores for the entire data set is also included. Data mining method.

107. The method of claim 105 or 106,

Among the items appearing in the transaction to be analyzed, the distribution ratio of the support items of the frequent items is analyzed and refined by analyzing the characteristics of one transaction, and the distribution ratio of the support items of the support items for the frequent items is also obtained for the entire data set. Data mining method, characterized in that.

The method of claim 58,

It is possible to reuse the memory space occupied by an item or a sequence pattern for an item or a sequence pattern that is no longer needed to manage the frequency of occurrence of the item or sequence pattern in the mining process at the idle item collector stage in the integrated system. A data mining method comprising a method of collecting idle items allocated to space.