KR100776640B1

KR100776640B1 - System and method for finding the time sensitive frequent itemsets

Info

Publication number: KR100776640B1
Application number: KR1020050089986A
Authority: KR
Inventors: 이주홍; 안찬민; 박태수
Original assignee: 인하대학교 산학협력단
Priority date: 2005-09-27
Filing date: 2005-09-27
Publication date: 2007-11-16
Also published as: KR20070035300A

Abstract

본 발명은 데이터 마이닝 시스템에서 시간차를 이용하여 상대적인 빈발항목을 탐색할 수 있도록 한 시간차를 이용한 상대적인 빈발항목 탐색시스템 및 방법을 제공하며, 본 발명의 시스템은 입력 모듈을 통해 입력된 데이터 스트림에서 소정 트랜잭션 내에서 출현하는 항목들의 집계를 통해 빈발항목을 탐색하며, 소정 트랜잭션 동안 출현하는 항목들의 시간차를 이용하여 상대적인 빈발항목을 탐색하는 탐색 모듈 및 상기 탐색 모듈에서 탐색된 빈발항목을 저장하는 저장 모듈을 구비함을 특징으로 하며, 이러한 본 발명은 시간에 민감한 빈발항목을 탐색할 수 있으며, 빈발항목 탐색에 대한 정확도를 높일 수 있고, 한정적인 메모리를 효율적으로 사용할 수 있게 된다.The present invention provides a system and method for searching a relatively frequent item using a time difference to search for a relatively frequent item using a time difference in a data mining system. The system of the present invention provides a predetermined transaction in a data stream input through an input module. And a search module for searching for frequent items through aggregation of items appearing within the network, and for searching for relatively frequent items using a time difference between items appearing during a predetermined transaction, and a storage module for storing frequent items found in the search module. The present invention can search for time-sensitive frequent items, increase the accuracy of frequent item search, and efficiently use a limited memory.

데이터 마이닝, 트랜잭션, 상대적 빈발항목, 시간차, 탐색, 데이터 스트림 Data Mining, Transactions, Relative Frequencies, Time Differences, Navigation, Data Streams

Description

System and method for searching relative frequent items using time difference {SYSTEM AND METHOD FOR FINDING THE TIME SENSITIVE FREQUENT ITEMSETS}

도 1은 본 발명에 따른 시스템 구성도.1 is a system configuration according to the present invention.

도 2는 본 발명에 따른 상대적인 빈발항목 탐색을 위한 기본 요소를 나타낸 표.Figure 2 is a table showing the basic elements for searching relative frequent items according to the present invention.

도 3은 본 발명에 따른 상대적인 빈발항목을 설명하기 위한 도.Figure 3 is a view for explaining the relative frequent items according to the present invention.

도 4는 본 발명에 따른 시간차를 이용한 상대적인 빈발항목 탐색방법의 알고리즘.4 is an algorithm of a method of searching for a relative frequent item using a time difference according to the present invention.

도 5는 도 4에 대한 흐름도.5 is a flow chart for FIG. 4.

도 6은 본 발명의 실험 예로서 빈발항목과 빈발항목에 대한 정확도를 나타낸 그래프.Figure 6 is a graph showing the accuracy for the frequent items and frequent items as an experimental example of the present invention.

도 7은 본 발명의 실험 예로서 최소 지지도에 따른 평균 수행시간을 나타낸 그래프.7 is a graph showing the average execution time according to the minimum support as an experimental example of the present invention.

도 8은 본 발명의 실험 예로서 FP-Tree 메모리 사용량을 나타낸 그래프.8 is a graph showing FP-Tree memory usage as an experimental example of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 : 입력 모듈 20 : 탐색 모듈10: input module 20: navigation module

30 : 저장 모듈30: storage module

본 발명은 데이터 마이닝 시스템에 관한 것으로, 특히 데이터 스트림에서 시간적 측면을 고려하여 상대적인 빈발항목을 탐색할 있도록 하는 시간차를 이용한 상대적인 빈발항목 탐색 시스템 및 방법에 관한 것이다.The present invention relates to a data mining system, and more particularly, to a relatively frequent item searching system and method using a time difference to search for a relatively frequent item in consideration of a temporal aspect in a data stream.

최근 들어 저장장치의 발전과 네트워크의 발달로 인하여 지속적으로 많은 양의 데이터가 빠른 시간 내에 증가되고 있다. 예를 들어, 네트워크의 침입 탐지나 유비쿼터스, e-commerce등 많은 응용분야에서 대용량의 데이터가 발생되고 있으며, 이러한 응용 환경에서 가치 있는 정보를 추출하기 위한 많은 노력들이 여러 분야에 걸쳐 이뤄지고 있다. Recently, due to the development of storage devices and the development of networks, a large amount of data is continuously increasing in a short time. For example, large amounts of data are generated in many applications such as network intrusion detection, ubiquitous, and e-commerce, and many efforts are being made in various fields to extract valuable information from such applications.

일반적으로 데이터 마이닝의 대상이 되는 데이터 집합에서는 응용분야에 나타나는 모든 단위 정보들을 단위항목(item)으로 정의하고, 응용분야에서 의미적인 동시성(즉, 의미적으로 서로 함께 발생하는)을 갖는 단위 정보들의 모임을 트랜잭션(transaction)이라 정의한다. 트랜잭션은 의미적인 동시성을 갖는 단위 항목들의 정보를 가지며, 데이터 마이닝의 분석 대상이 되는 데이터 집합은 해당 응용분야에서 발생된 트랜잭션들의 집합으로 정의된다.In general, in a data set that is the subject of data mining, all unit information that appears in an application field is defined as an item, and the unit information having semantic concurrency (that is, semantically occurring together) in the application field is defined. A meeting is defined as a transaction. A transaction has information of unit items having semantic concurrency, and a data set that is the subject of data mining analysis is defined as a set of transactions generated in a corresponding application field.

데이터 마이닝 기법을 통한 데이터 스트림에서의 가치 있는 정보 추출은 주 요 연구 분야 중 하나이며, 데이터 스트림은 매우 빠른 시간 내에 지속적으로 데이터가 증가되는 특성을 가지고 있다. 따라서 데이터 마이닝에서 데이터 스트림을 처리하기 위한 다음과 같은 조건들이 요구되고 있다. Extracting valuable information from data streams through data mining techniques is one of the main research areas, and data streams are characterized by continuous data growth in a very short time. Therefore, the following conditions are required to process data streams in data mining.

첫째, 매우 빠른 시간 내에 증가하는 데이터를 한정적인 저장 공간에 모두 저장하는 것은 불가능하기 때문에 메모리 공간을 유연하게 사용하여 정보의 손실 없이 데이터를 효율적으로 저장하는 방법이 필요하다. First, since it is impossible to store all of the growing data in a limited storage space in a very short time, there is a need for a method of efficiently storing data without loss of information by using the memory space flexibly.

둘째, 데이터 스트림에서는 데이터가 매우 빠른 시간 내에 생성되고 현 시점에서의 마이닝 결과가 중요하기 때문에 마이닝 결과를 원하는 즉시 생성해 주어야 한다. 그러기 위해서는 데이터 스트림에서의 각 트랜잭션을 생성되는 즉시 오직 한번만 읽고 마이닝 결과를 즉각 생성해야 한다(장중혁, 이원석.,(2003). 데이터 스트림에서 개방 데이터 마이닝 기반의 빈발항목 탐색. 정보처리학회 논문지, 10-D(3).)Second, in the data stream, the data is generated very quickly, and the mining results at this point are important. To do this, each transaction in the data stream should be read only once as soon as it is created (Jang Joong-hyuk, Lee Won-seok., (2003). Searching for frequent items based on open data mining in the data stream. 10-D (3).)

데이터 스트림에서 가장 기본적인 문제는 빈발항목들을 찾는 것이다(Charikar, M., Chen, K., & Farach-Colton, M., (2002). Finding Frequent Items in Data Streams. International Colloquium on Automata,Languages, and Programming, 508-515.). The most basic problem in data streams is finding frequent items (Charikar, M., Chen, K., & Farach-Colton, M., (2002) .Finding Frequent Items in Data Streams.International Colloquium on Automata, Languages, and Programming, 508-515.).

기존의 데이터 마이닝 기법은 정적인 트랜잭션들에 대해서 한번의 탐색으로 일정한 후보 빈발항목을 만든 후에 미리 정의된 특정 임계값 보다 높은 지지도를 가지는 빈발항목을 찾기 때문에 메모리의 사용량이 많고 처리 시간이 길다.Conventional data mining techniques use a lot of memory and require a long processing time because they make a candidate candidate with a single search for static transactions, and then find a frequent item with a higher support than a predetermined threshold.

또한, 데이터 스트림은 매우 많은 양의 데이터가 끊임없이 들어오기 때문에 모든 단위 항목들을 저장할 수 없으므로, 기존의 데이터 마이닝 기법을 데이터 스트림에 그대로 적용하는 것은 적합하지 않다. 따라서 데이터 스트림에서 빈발항목을 찾는 새로운 방안들이 연구되고 있다(장중혁, 이원석.,(2003). 데이터 스트림에서 개방 데이터 마이닝 기반의 빈발항목 탐색. 정보처리학회 논문지, 10-D(3)., Manku, G. S., & Motwani, R.,(1994). Approximate frequency counts over data streams, In Proc. of the 28th Conference on Very Large Databases., Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P.S.,(2003). Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining, AAAI/MIT. 등등).In addition, the data stream cannot store all the unit items because a very large amount of data is constantly coming in, it is not suitable to apply the existing data mining techniques to the data stream. Therefore, new methods for finding frequent items in data streams are being studied (Jong Joong-hyuk, Lee Won-seok., (2003). Searching for frequent items based on open data mining in data streams. Journal of Information Processing Society, 10-D (3)., Manku, GS, & Motwani, R., (1994) .Approximate frequency counts over data streams, In Proc. Of the 28th Conference on Very Large Databases., Giannella, C., Han, J., Pei, J., Yan , X., & Yu, PS, (2003) .Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining, AAAI / MIT.

그러나 현재까지의 데이터 스트림에서의 빈발항목 탐색 방법들은 단순히 빈발항목의 집계를 통해 빈발항목을 탐색하거나 또는 일정한 크기의 슬라이딩 윈도우를 임의로 설정하여 그 시간, 즉 구간에 국한된 빈발항목을 탐색하고 있으며, 시간이 흐름에 따라 총체적으로 빈발항목을 집계하기 때문에 현 시점에서 빈발항목의 변화 및 시간이 흐름에 따른 상대적인 빈발항목을 간과하고 지나쳐 빈발항목 탐색의 정확도를 보장하지 못하고 있다.However, the method of searching for frequent items in the data streams up to now is searching for frequent items by simply counting frequent items or randomly setting a sliding window of a certain size, that is, searching for frequent items limited to a section. Since the frequent items are aggregated according to this flow, it is not possible to guarantee the accuracy of the frequent items search by overlooking the change of the frequent items and the relative frequent items over time.

한편, 이전에 연구된 빈발항목에 대한 알고리즘들은 대부분 Apriori원칙에 기반을 두고 있다(Rakesh Agrawal,, & Ramakrishnan Srikant., (1994). Fast Algorithms for Mining Association Rules. Proc. 20th Int. Conf. Very Large Data Bases, 487-499.).Algorithms for frequent items, which have been studied previously, are mostly based on the Apriori principle (Rakesh Agrawal, & Ramakrishnan Srikant., (1994) .Fast Algorithms for Mining Association Rules.Proc. 20th Int.Conf.Very Large Data Bases, 487-499.).

이 원칙은 빈발항목의 모든 부분집합은 반드시 빈발항목이었어야 한다는 것 이다. Apriori알고리즘은 빈발항목의 최대 길이가 n이면 n+1까지 탐색하여 후보 집합을 생성하고, 빈발 항목을 탐색하기 때문에 메모리의 사용량이 크고 반복적인 데이터베이스 탐색으로 인하여 빈발항목을 탐색하는데 많은 시간이 소요된다.The principle is that all subsets of frequent items must be frequent items. If the maximum length of frequent items is n, Apriori algorithm searches for up to n + 1 to create a candidate set, and since it searches for frequent items, it uses a lot of memory and it takes a lot of time to search for frequent items due to repeated database search. .

반면에, 분할-정복기법(divide-and-conquer)을 사용하는 FP-growth는 후보 집합을 생성하지 않는다(Han, J., & Yin, Y.,(2000). Mining frequent patterns without candidate generation. In Proc. IEEE Symposium on Foundations of Computer Science, 359-366.)On the other hand, FP-growth using a divide-and-conquer does not produce candidate sets (Han, J., & Yin, Y., (2000). Mining frequent patterns without candidate generation. In Proc. IEEE Symposium on Foundations of Computer Science, 359-366.)

FP-growth는 길거나 짧은 빈발 항목을 마이닝하는데 매우 효율적이고, 확장성을 가지며, Apriori알고리즘보다 속도 측면에서 한 차원 앞선다는 것을 보여주고 있다. FP-growth shows that it is very efficient for mining long or short frequent items, is scalable, and is one step ahead in speed over the Apriori algorithm.

그러나 상기한 두 기법 모두 데이터 집합을 다중 탐색해야하며 새로운 트랜잭션이 발생하였을 때 전체를 재탐색해야 한다. 또한, 데이터 집합이 지속적으로 빠르게 증가하면 가용 메모리의 한정성으로 인하여 성능이 낮아지는 단점이 있다.However, both of these techniques require multiple traversal of the data set and a full rescan when a new transaction occurs. In addition, if the data set continuously increases rapidly, performance may be lowered due to the limitation of available memory.

또한, Count Sketch 알고리즘은 데이터 스트림에서 단위 항목들의 빈발도수에 중점을 두고 있으며(Charikar, M., Chen, K., & Farach-Colton, M., (2002). Finding Frequent Items in Data Streams. International Colloquium on Automata,Languages, and Programming, 508-515.), Lossy Counting 알고리즘은 최소 지지도와 최대 허용 오차 조건이 주어졌을 때 데이터 스트림에서 발생한 빈발항목들의 집합을 찾는다(Manku, G. S., & Motwani, R.,(1994). Approximate frequency counts over data streams, In Proc. of the 28th Conference on Very Large Databases.).The Count Sketch algorithm also focuses on the frequency of unit items in the data stream (Charikar, M., Chen, K., & Farach-Colton, M., (2002) .Finding Frequent Items in Data Streams. International Colloquium on Automata, Languages, and Programming, 508-515.), The Lossy Counting algorithm finds a set of frequent items that occur in a data stream given the minimum support and maximum tolerance conditions (Manku, GS, & Motwani, R). , (1994) .Approximate frequency counts over data streams, In Proc. Of the 28th Conference on Very Large Databases.).

그러나 이들 알고리즘들은 시간을 고려하지 않고 단순히 빈발항목을 탐색하는데 중점을 두고 있어 정확한 빈발항목을 탐색할 수 없는 단점이 있다.However, these algorithms focus on simply searching for frequent items without considering time, and thus, there is a disadvantage in that an accurate frequent item cannot be searched.

본 발명은 이러한 점을 감안한 것으로, 본 발명의 목적은 데이터 스트림에서 시간을 고려하여 일정 트랜잭션 동안 항목들이 출현하는 시간차를 이용하여 미처 발견하지 못한 상대적인 빈발항목을 탐색할 수 있도록 함으로써 빈발항목 탐색의 정확도를 개선하고 한정적인 메모리 자원을 효율적으로 사용 및 관리할 수 있도록 한 시간차를 이용한 상대적인 빈발항목 탐색 시스템 및 방법을 제공함에 있다.The present invention has been made in view of the above, and an object of the present invention is to accurately search for frequent frequent items that have not been found by using the time difference in which items appear during a certain transaction in consideration of time in a data stream. The present invention provides a system and method for searching for relatively frequent items using a time difference to improve the performance and to efficiently use and manage limited memory resources.

상기 본 발명의 목적을 달성하기 위한 본 발명에 따른 시간차를 이용한 상대적인 빈발항목 탐색 시스템은, 데이터 마이닝 시스템에서 빈발항목을 탐색함에 있어, 입력되는 데이터 스트림을 처리하는 입력 모듈; 상기 입력 모듈을 통해 입력된 데이터 스트림에서 소정 트랜잭션 내에서 출현하는 항목들의 집계를 통해 빈발항목을 탐색하며, 소정 트랜잭션 동안 출현하는 항목들의 시간차를 이용하여 상대적인 빈발항목을 탐색하는 탐색 모듈; 및 상기 탐색 모듈에서 탐색된 빈발항목을 저장하는 저장 모듈;을 포함하는 것을 특징으로 한다.In order to achieve the object of the present invention, a relatively frequent item search system using a time difference according to the present invention comprises: an input module for processing an input data stream when searching for frequent items in a data mining system; A search module for searching for frequent items by aggregating items appearing within a predetermined transaction in a data stream input through the input module, and searching for a relatively frequent item using a time difference between items appearing during a predetermined transaction; And a storage module for storing the frequent items searched by the search module.

상기 상대적인 빈발항목은 유동적인 트랜잭션 개수를 출현 빈도 간격에 대한 차의 합으로 나눈 값인 상대적인 출현빈도가 상기 빈발항목에 대한 상대적인 출현빈도보다 크며, 그 출현 시점이 바로 이전의 상대적인 출현빈도의 값에서 현재의 상대적인 출현빈도의 값을 뺀 결과가 0보다 작아지는 시점부터 0보다 커지는 시점까지인 것을 특징으로 한다.The relative frequent items have a relative occurrence frequency that is a value obtained by dividing the number of fluid transactions by the sum of the differences for the frequency of appearance intervals, and is larger than the relative occurrence frequency for the frequent items. The result of subtracting the value of the relative occurrence frequency of is characterized in that the time from less than zero to the time greater than zero.

상기 본 발명의 목적을 달성하기 위한 본 발명에 따른 시간차를 이용한 상대적인 빈발항목 탐색 방법은, 데이터 마이닝 시스템에서의 빈발항목 탐색 방법에 있어서, 데이터 스트림에 새로이 추가된 트랜잭션을 탐색하여 각 항목에 대한 출현빈도와 트랜잭션의 아이디 정보를 갱신하는 단계; 상기 추가된 트랜잭션에 출현하는 항목 중 기 정의된 최소 지지도보다는 작지만 최대 지지도 오차보다는 큰 값을 가지는 항목을 부분 빈발항목으로 추가하는 단계; 사용자의 요구에 의해 현재까지 총 빈발도수가 가장 크고 최소 지지도 이상의 지지도를 갖는 항목의 정보를 출력하는 단계; 및 상기 트랜잭션 동안 출현하는 항목들의 시간차를 이용하여 상대적인 빈발항목을 탐색하는 단계;를 포함하는 것을 특징으로 한다.In order to achieve the object of the present invention, a method for searching a relatively frequent item using a time difference according to the present invention, in the method for searching a frequent item in a data mining system, searches for a transaction newly added to a data stream and appears for each item. Updating the frequency and ID information of the transaction; Adding an item having a value smaller than a predetermined minimum support but greater than a maximum support error among the items appearing in the added transaction as a partial frequent item; Outputting information of an item having the highest total frequency and the support of at least the minimum support to date at the request of the user; And searching for a relatively frequent item using a time difference between items that appear during the transaction.

또한, FP-Tree알고리즘을 이용하는 본 발명의 시간차를 이용한 상대적인 빈발항목 탐색 방법은, 데이터 마이닝 시스템에서의 FP-Tree알고리즘을 이용한 빈발항목 탐색 방법에 있어서, 데이터 스트림에 새로이 추가된 트랜잭션을 탐색하여 상기 트랜잭션에서 출현한 항목들이 상기 FP-Tree의 노드에 존재하면 각 항목에 대한 출현 빈도와 트랜잭션의 아이디 정보를 갱신하는 단계; 상기 추가된 트랜잭션에 출현하는 항목들이 상기 FP-Tree에 존재하는 노드에 없으면 최소 지지도보다는 작지만 최대 지지도 오차보다는 큰 값을 가지는 항목을 부분 빈발항목으로 구분하여 상 기 FP-Tree의 노드에 삽입하는 단계; 사용자의 요구에 의해 현재까지 총 빈발도수가 가장 크고 최소 지지도 이상의 지지도를 갖는 항목의 정보를 출력하는 단계; 및 상기 트랜잭션 동안 출현하는 항목들의 시간차를 이용하여 상대적인 빈발항목을 탐색하는 단계;를 포함하는 것을 특징으로 한다.In addition, the relative frequent item search method using the time difference of the present invention using the FP-Tree algorithm, in the frequent item search method using the FP-Tree algorithm in the data mining system, by searching for a transaction newly added to the data stream Updating appearance frequencies and transaction ID information for each item when items appearing in a transaction exist in the node of the FP-Tree; If items appearing in the added transaction do not exist in the node present in the FP-Tree, inserting items having a value smaller than the minimum support but larger than the maximum support error into partial frequent items and inserting them into the nodes of the FP-Tree. ; Outputting information of an item having the highest total frequency and the support of at least the minimum support to date at the request of the user; And searching for a relatively frequent item using a time difference between items that appear during the transaction.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 보다 상세하게 설명한다. 단, 하기 실시예는 본 발명을 예시하는 것일 뿐 본 발명의 내용이 하기 실시 예에 한정되는 것은 아니다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the following examples are merely to illustrate the invention is not limited to the contents of the present invention.

도 1은 본 발명을 구현하기 위한 시스템 구성도를 도시한 것으로, 크게 입력되는 데이터 스트림을 처리하는 입력 모듈(10), 상기 데이터 스트림에서 빈발항목을 탐색하는 탐색 모듈(20) 및 탐색된 빈발항목을 저장하는 저장 모듈(30)로 구성된다.1 is a diagram illustrating a system configuration for implementing the present invention. An input module 10 for processing a large input data stream, a search module 20 for searching for frequent items in the data stream, and a searched frequent item It consists of a storage module 30 for storing the.

상기 탐색 모듈(20)은 입력 모듈(10)을 통해 입력된 데이터 스트림에서 소정 트랜잭션 내에서 출현하는 항목들의 집계를 통해 빈발항목을 탐색하며, 소정 트랜잭션 동안 출현하는 항목들의 시간차를 이용하여 상대적인 빈발항목을 탐색하게 되며, 본 발명의 빈발항목 탐색은 상기 탐색 모듈(20)에서 행하게 된다.The search module 20 searches for frequent items through the aggregation of items that appear within a given transaction in the data stream input through the input module 10, and uses the time difference between the items that appear during the given transaction and the relative frequent items. The search for frequent items of the present invention is performed by the search module 20.

본 발명에서는 먼저, 데이터 스트림에서 빈발항목을 탐색하기 위해서 탐색대상이 되는 데이터 집합에 대해 정의한다.In the present invention, first, a data set to be searched for defining frequent items in a data stream is defined.

S_N= {T₁, T₂, T₃,...,T_N}는 현재까지 발생한 트랜잭션의 집합, 즉 전제 데이터 집합을 의미하며, 현재의 트랜잭션은 T_t= {I₁, I₂, I₃,...,I_N} (t=1,..,n)로 표현되 고, 각각의 트랜잭션은 TID라고 하는 고유한 식별인자인 아이디를 가진다. 그리고 트랜잭션의 구성요소인 단위 항목에 대한 집합은 I = {i₁, i₂, i₃,...,i_N}로 나타낸다.S _N = {T ₁ , T ₂ , T ₃ , ..., T _N } refers to the set of transactions that have occurred to date, that is, the entire data set, and the current transaction is T _t = {I ₁ , I ₂ , I ₃ , ..., I _N } (t = 1, .., n), and each transaction has an ID that is a unique identifier called TID. The set of unit items that are components of a transaction is represented by I = {i ₁ , i ₂ , i ₃ , ..., i _N }.

그리고 FP-stream알고리즘(Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P.S.,(2003). Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining, AAAI/MIT.)에서와 같이 사전에 사용자에 의해 정의되는 최소 지지도(S_min∈(0,1))와 최대 지지도 오차(e∈(0,S_min))를 이용하여 단일 항목을 빈발항목, 부분 빈발항목, 빈발하지 않은 항목으로 구분한다.And FP-stream algorithms (Giannella, C., Han, J., Pei, J., Yan, X., & Yu, PS, (2003) .Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining , As in AAAI / MIT.), A single item is frequently used with a minimum support defined by the user (S _min ∈ (0,1)) and a maximum support error (e∈ (0, S _min )). , Divided into frequent items and non-frequent items.

출현 빈도가 최소 지지도 이상의 값을 가질 경우는 빈발항목으로 간주하고, 최소 지지도보다는 작지만 최대 지지도 오차 이상의 값을 가질 경우는 부분 빈발항목이라 정의한다. 최대 지지도 오차보다 작은 경우는 빈발하지 않은 항목으로 간주하여 처리하지 않는다.If the frequency of occurrence is greater than the minimum support, it is considered as a frequent item. If the frequency of occurrence is smaller than the minimum support but more than the maximum support error, it is defined as a frequent item. If less than the maximum support error, it is regarded as an infrequent item and will not be treated.

또한, 기존의 연구에서는 빈발항목과 부분 빈발항목에 대해 단지 집계 값의 비교를 통해서 빈발항목의 탐색함에 따라 시간에 민감한 상대적인 빈발항목을 간과하고 지나칠 수 있어 현재까지 집계된 값이 현재의 빈발항목보다는 작지만 상대적인 출현빈도수가 현재의 빈발항목에 비해 상대적으로 클 경우의 항목에 대한 고려가 필요할 것임에 따라 본 발명에서는 다음과 같이 상대적인 빈발항목에 대하여 정의한다.In addition, in the existing studies, the frequent items and partial frequent items can be overlooked and passed over time-sensitive relative frequent items as the frequent items are searched by only comparing the aggregated values. As a small but relative occurrence frequency will need to be considered when the relative frequency is relatively large compared to the current frequent items, the present invention defines the relative frequent items as follows.

상대적인 빈발항목 집합 R은 현재의 빈발항목의 출현빈도보다 출현빈도가 작 지만 상대적인 출현빈도가 현재의 빈발항목의 상대적인 출현빈도보다 큰 항목들의 집합으로 정의된다. 여기서 f는 현재의 빈발 항목이고, c는 현재의 시점을 나타낸다.The relative frequent item set R is defined as a set of items in which the occurrence frequency is smaller than the occurrence frequency of the current frequent item, but the relative occurrence frequency is larger than the relative frequency of the current frequent item. Where f is the current frequent item and c is the current time point.

상대적인 출현빈도란 유동적인 트랜잭션의 개수 m(m<N)을 출현빈도 간격에 대한 차의 합으로 나눈 것이며, 상대적인 빈발항목을 탐색하기 위한 기준이 된다.Relative frequency is the number of flexible transactions, m (m <N), divided by the sum of the differences in the frequency of occurrence, which is a criterion to search for relative frequent items.

또한, 출현빈도는 지속적으로 입력되는 트랜잭션에 대해서 항목이 출현했는지 출현하지 않았는지를 파악하여 집계한 값이다.In addition, the appearance frequency is a value obtained by counting whether an item has appeared or not appeared for a continuously input transaction.

그리고 출현빈도 간격에 대한 차의 합을 구하는 것은 상대적인 빈발항목이 발생하는 시점을 파악하기 위한 것으로 현재 출현한 시점과 이전에 출현했던 시점 사이에 대한 시간차의 합으로 정의된다. 여기서 y_t 는 항목 x를 포함하는 트랜잭션 T_t가 발생한 시점을 의미한다.The sum of the differences for the frequency of occurrence is to determine the time when the relative frequent items occur, and is defined as the sum of the time differences between the present time and the previous time. Here y _t means the time point when the transaction T _t containing the item x occurs.

상기의 상대적인 빈발항목 탐색을 위한 기본 요소는 도 2의 표에 나타낸 바와 같이, S_min는 최소 지지도, e는 최대 지지도 오차, f는 빈발항목, R은 상대 빈발 항목, A(x)는 항목 x에 대한 출현 판단 함수, F(x)는 항목 x의 출현빈도, C(x)는 항목 x에 대한 출현 간격의 총합, E_m(x)는 항목 x의 상대적인 출현빈도이다.As shown in the table of FIG. 2, _Smin is the minimum support, e is the maximum support error, f is the frequent item, R is the relative frequent item, and A (x) is the item x. The appearance determination function for, F (x) is the frequency of occurrence of item x, C (x) is the sum of the appearance intervals for item x, and E _m (x) is the relative frequency of occurrence of item x.

이러한 내용을 바탕으로 상대적인 빈발항목에 대하여 살펴보면 다음과 같다.Based on these details, the relative frequent items are as follows.

도 3은 상대적인 빈발항목에 대한 개념도이다. 여기서, 각각의 I₁, I₂, I₃는 최소 지지도 이상의 값을 가지는 빈발항목과 최소지지도 보다는 작지만 최대 지지도 오차보다는 큰 부분 빈발항목을 나타내고, 각각의 화살표의 점은 출현 빈도를 나타낸 것이다. 3 is a conceptual diagram of relative frequent items. Here, each of I ₁ , I ₂ , and I ₃ represents a frequent item having a value greater than or equal to the minimum support and a partial frequent item smaller than the minimum support map but larger than the maximum support error, and the points of each arrow indicate the frequency of appearance.

또한, T_N는 현재까지의 트랜잭션을 의미한다. 첫 번째 줄의 I₁을 살펴보면 현재까지의 출현빈도가 28로 가장 높은 것을 알 수 있다. 즉 빈발항목을 의미한다. 그리고 두 번째 줄과 세 번째 줄은 부분 빈발항목을 의미한다. In addition, T _N means a transaction to date. Looking at I ₁ on the first line, we can see that the frequency of occurrence so far is 28, the highest. That means frequent items. The second and third lines indicate partial frequent items.

도 3에서 첫 번째 항목 즉 빈발항목에서는 초반에 출현빈도가 높고 중간으로 갈수록 출현간격이 점점 벌어지는 것을 알 수 있다. 그러나 두 번째 줄은 중간 부분에서 급격하게 출현빈도가 높아진 것을 볼 수 있다. In FIG. 3, the first item, that is, the frequent item, has a high frequency of appearance at the beginning and an interval of appearance gradually increases toward the middle. However, the second line shows a sharp increase in the frequency of appearance in the middle.

그리고 상대적으로 첫 번째 항목에서 보다 더욱 빈번하게 출현하고 출현간격 또한 첫 번째 항목 보다 좁기 때문에 중간 부분에서는 두 번째 항목을 빈발항목으로 지정해야 정확한 것이라고 볼 수 있다. 즉, 두 번째 항목이 중간부분에서는 첫 번째 항목보다는 상대적인 빈발항목이 되는 것이다.In addition, since the appearance of the first item is more frequent than the first item and the interval between appearances is narrower than that of the first item, the second item should be designated as the frequent item in the middle. That is, the second item is a relatively frequent item in the middle rather than the first item.

여기서 중요한 점은 상대적인 빈발항목으로 변경되는 시점에 대한 기준을 찾는 것이다. 따라서 본 발명에서는 다음과 같은 상대적인 빈발항목에 대한 기준을 제시한다.The important point here is to find the criteria for when to change to a relatively frequent item. Therefore, the present invention proposes a criterion for the following relatively frequent items.

1. 상대적인 출현빈도가 빈발항목에 대한 상대적인 출현빈도보다 커야한다.1. The relative frequency of occurrence should be greater than the relative frequency of frequent items.

2. 상대적인 빈발항목의 출현 시점은 바로 이전의 상대적인 출현빈도의 값에서 현재의 상대적인 출현빈도의 값을 뺀 결과가 0보다 작아지는 시점부터 0보다 커지는 시점까지이다.2. The time of appearance of relative frequent items is from the time when the result of subtracting the current relative frequency of occurrence from the value of the previous relative frequency of occurrence is less than zero to the time when it becomes larger than zero.

즉, 빈발항목에 대하여 상대적인 출현빈도가 높아지고 출현간격이 빈발항목에 비해 급격히 좁아졌을 때를 의미하는 것이다.In other words, it means when the relative frequency of occurrence of the frequent items is increased and the interval of appearance is sharply narrowed compared to the frequent items.

상기와 같은 사항들을 바탕으로 하는 본 발명은 모든 빈발항목과 상대적인 빈발항목들을 상기 저장 모듈(30)인 메모리에서 효율적으로 유지, 관리하는 프리픽스 트리(prefix tree)구조의 FP-Tree 알고리즘을 이용하여 상대적인 빈발항목을 저장한다.Based on the above-described matters, the present invention provides a method using a FP-Tree algorithm of a prefix tree structure that efficiently maintains and manages all frequent items and relative items in a memory, which is the storage module 30. Save the frequent items.

데이터 스트림에서 데이터는 무한집합이라고 간주한다. 그렇기 때문에 모든 항목들을 저장하는 것은 사실상 불가능하다. 따라서 FP-Tree에서는 빈발항목과 상대적인 빈발항목을 효율적으로 유지, 관리하기 위하여 items, 출현 빈도, TID의 3가지 정보만을 저장함이 바람직하다. In a data stream, data is considered to be an infinite set. As such, storing all items is virtually impossible. Therefore, in the FP-Tree, it is desirable to store only three items of items, frequency of appearance, and TID in order to efficiently maintain and manage frequent items and relative items.

여기서, items은 빈발항목이나 부분 빈발항목들을 의미하고, 출현 빈도는 현재까지 items이 출현한 총 횟수가 된다. 또한 TID는 현재의 트랜잭션 아이디를 뜻 하며, 상대적인 빈발항목이 발생하는 시점을 알 수 있는 척도로 사용된다.Here, items means frequent items or partial frequent items, and the frequency of appearance is the total number of times the items have appeared. In addition, TID means the current transaction ID and is used as a measure to know when relative frequent items occur.

도 4는 본 발명에 따른 FP-Tree 알고리즘을 나타낸 것이며, 도 5는 이에 대한 흐름도를 도시한 것으로, FP-Tree 알고리즘은 크게 4단계로 구성되며, 트랜잭션이 추가될 때마다 다음의 4단계를 반복적으로 수행하게 된다. Figure 4 shows the FP-Tree algorithm according to the present invention, Figure 5 shows a flow diagram for this, the FP-Tree algorithm is composed of four steps, each time the transaction is added the following four steps repeatedly Will be performed.

제1단계(S110)는 데이터 스트림의 트랜잭션을 탐색하여 각 항목에 대한 출현 빈도를 갱신하는 단계로 새로운 트랜잭션이 추가되면 전체 데이터 집합 │S_N│의 크기는 1씩 증가된다. The first step S110 is to search for transactions in the data stream and update the appearance frequency for each item. When a new transaction is added, the size of the entire data set | S _N | is increased by one.

그리고 새로 추가된 트랜잭션에서 출현한 항목들이 FP-Tree의 노드에 존재하면 출현 빈도와 트랜잭션 아이디 값을 갱신 한다.If items appearing in the newly added transaction exist in the node of FP-Tree, the appearance frequency and transaction ID value are updated.

제2단계(S120)는 부분 빈발항목이 추가되는 단계로 새롭게 추가된 트랜잭션에 출현하는 항목들이 FP-Tree에 존재하는 노드에 없을 경우, 최소 지지도보다는 작지만 최대 지지도 오차보다는 큰 값을 가지는 항목을 부분 빈발항목으로 구분하여 FP-Tree의 노드로 삽입한다.The second step (S120) is a step in which partial frequent items are added. If items appearing in the newly added transaction are not present in the node present in the FP-Tree, the second step (S120) may include an item having a value smaller than the minimum support but larger than the maximum support error. Separate into frequent items and insert them as nodes of FP-Tree.

여기서, 최대 지지도 오차보다 작은 항목들은 빈발항목이 될 가능성이 희박하기 때문에 제거된다. 그렇기 때문에 메모리 사용공간을 줄이고 FP-Tree의 노드에 삽입하는 추가적인 작업이 없으므로 수행시간을 줄일 수 있다. Here, items smaller than the maximum support error are removed because they are unlikely to be frequent items. This reduces the memory footprint and reduces the execution time because there is no additional work to insert into the nodes of the FP-Tree.

제3단계(S130)는 현재의 빈발항목을 찾는 단계로 사용자의 요구에 의해 현재까지 총 빈발도수가 가장 크고 최소 지지도(S_min) 이상의 지지도를 갖는 항목의 items, 출현 빈도, TID 정보를 출력해 준다.The third step (S130) is to find the current frequent items, and outputs items, frequency of appearance, and TID information of items having the highest total frequency and the least support degree (S _min ) by the user's request. give.

제4단계(S140)는 상대적인 빈발항목을 찾는 단계로 빈발항목의 상대적인 출현빈도가 커졌을 때, 즉 출현 간격이 좁아졌을 경우에 현재의 빈발항목보다 상대적으로 빈번하게 출현하는 항목을 탐색하여 텍스트 파일로 정보를 출력한다.The fourth step (S140) is to find a relatively frequent item. When the relative frequency of occurrence of the frequent item is increased, that is, when the interval of appearance becomes narrow, the fourth item (S140) searches for an item that appears more frequently than the current frequent item as a text file. Print information.

이와 같이 본 발명은 현재의 빈발항목 뿐만 아니라 간과하고 지나칠 수 있는 상대적인 빈발항목을 탐색함으로써 신뢰도 및 정확도을 보장해주며 빈발항목과 부분 빈발항목에 대한 items, 출현 빈도, TID만을 관리하기 때문에 한정적인 메모리를 효율적으로 사용할 수 있게 된다.As such, the present invention ensures reliability and accuracy by searching for the current frequent items as well as the relative frequent items that may be overlooked and overlooked, and efficiently manages limited memory because it manages only the items, the frequency of occurrence, and the TID for the frequent items and the partial frequent items. It can be used as.

다음은 본 발명의 실험 예에 대하여 살펴보며, 실험 예에서는 상대적인 빈발항목과 FP-Tree에 대한 성능을 다양한 실험을 통하여 검증한다.Next, an experimental example of the present invention will be described. In the experimental example, the relative frequent items and the performance of the FP-Tree are verified through various experiments.

데이터 집합은 "Rakesh Agrawal,, & Ramakrishnan Srikant., (1994). Fast Algorithms for Mining Association Rules. Proc. 20th Int. Conf. Very Large Data Bases, 487-499."에서 제안된 데이터 생성 방법에 따라 T10.I4.D1000K, T20.I4.D1000K와 T10.I4.D100K, T15.I6.D1000K의 데이터 집합을 생성하여 사용한다.The data set is T10 according to the data generation method proposed in "Rakesh Agrawal ,, & Ramakrishnan Srikant., (1994). Fast Algorithms for Mining Association Rules. Proc. 20th Int. Conf. Very Large Data Bases, 487-499." Create and use .I4.D1000K, T20.I4.D1000K and T10.I4.D100K and T15.I6.D1000K data sets.

각 데이터 집합에서 T는 트랜잭션의 평균적인 길이를 의미하며, I는 잠재적인 최대 빈발 항목의 평균적인 길이를 의미한다. 또한 D는 데이터 집합에 대한 트랜잭션의 총수를 의미한다. 본 발명에서 모든 실험들은 512MB 램(RAM)을 가진 AMD XP 2600+의 컴퓨터 환경에서 실험되었으며, C언어로 구현하였다.In each data set, T is the average length of the transaction, and I is the average length of the potential maximum frequent items. D is also the total number of transactions for the data set. In the present invention, all experiments were conducted in a computer environment of AMD XP 2600+ having 512MB of RAM, and implemented in C language.

실험은 크게 3가지 부분으로 나누어져 실행된다. The experiment is largely divided into three parts.

첫 번째는 빈발항목 및 상대적인 빈발항목 탐색의 정확도에 대한 검증이다. The first is to verify the accuracy of frequent and relative frequent item search.

두 번째는 빈발항목과 상대적인 빈발항목을 탐색하는 수행 시간에 대한 검증이다. The second is to verify the execution time of searching for frequent items and relative items.

마지막으로 세 번째는 빈발항목과 부분 빈발항목을 관리하는 FP-Tree에 대한 메모리 사용량에 대한 검증한다.Finally, the third verifies the memory usage for the FP-Tree that manages frequent and partial frequent items.

도 6은 T10.I4.D1000K 데이터 집합을 이용한 상대적인 빈발항목과 빈발항목에 대한 정확도를 보여준다. 빈발항목이 상대적인 빈발항목보다 다소 높은 정확도를 보여주고 있다. 6 shows the relative frequent items and the accuracy of the frequent items using the T10.I4.D1000K data set. Frequently, the frequent items show a somewhat higher accuracy than the relative frequent items.

그 이유는 상대적인 빈발항목은 빈발항목보다 시간의 영향을 많이 받아 수시로 변하기 때문에 빈발항목보다 탐색하는 것이 더욱 어렵기 때문이다. 그럼에도 불구하고 높은 정확도를 보여주고 있다.The reason for this is that it is more difficult to search than the frequent items because the relative frequent items change more frequently under the influence of time than the frequent items. Nevertheless, it shows high accuracy.

도 7은 최소지지도를 다양하게 변화시켰을 때 각 구간에 대한 평균 수행시간을 비교한 것이다. 평균 수행시간은 새로운 트랜잭션이 추가되었을 때 빈발항목 및 상대적인 빈발항목을 탐색하는데 소요되는 평균적인 시간을 의미한다. 최소지지도가 낮을수록 평균 수행시간이 커진다. FIG. 7 compares the average execution time of each section when the minimum map is variously changed. Average execution time refers to the average time spent searching for frequent and relative frequent items when a new transaction is added. The lower the minimum support, the greater the average execution time.

그 이유는 최소지지도가 낮을수록 빈발항목과 부분빈발항목에 대한 허용범위가 넓어져서 빈발항목을 탐색하기 위한 비교 횟수가 증가되기 때문이다. 즉 평균 수행시간과 최소지지도 사이에는 반비례 관계가 된다. The reason is that the lower the minimum support, the wider the allowable range for the frequent items and the partial frequent items, which increases the number of comparisons to search for the frequent items. In other words, there is an inverse relationship between the average execution time and the minimum map.

도 8은 각 데이터 집합에 따른 FP-Tree에서의 메모리 사용량을 보여준다. 각 데이터 집합 사이의 메모리 사용량에 대한 차이는 크지 않다. 8 shows memory usage in the FP-Tree according to each data set. The difference in memory usage between each data set is not significant.

그 이유는 FP-Tree알고리즘에서 최소 지지도와 최대 지지도 오차를 이용하여 빈발항목과 부분 빈발항목만을 관리함으로 메모리의 사용량이 작다. 각 데이터 집합 중 T15.I6.D1000K이 가장 많은 메모리를 사용한다. T15.I6.D1000K는 잠재적인 최대 빈발항목의 평균적인 길이가 다른 데이터 집합에 비해 크기 때문에 상대적으로 좀 더 많은 메모리를 사용한다. The reason for this is that the FP-Tree algorithm manages only the frequent items and the partial frequent items by using the minimum support and the maximum support error, thereby reducing the memory usage. T15.I6.D1000K uses the most memory of each data set. T15.I6.D1000K uses relatively more memory because the average length of potential maximum frequent items is larger than other data sets.

상술한 바와 같이, 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 또는 변형하여 실시할 수 있다. As described above, although described with reference to a preferred embodiment of the present invention, those skilled in the art various modifications of the present invention without departing from the spirit and scope of the invention described in the claims below Or it may be modified.

이상에서 설명한 바와 같이, 본 발명은 전체 빈발도수와 빈발 간격에 따른 상대적인 빈발도수를 계산하여 빈발항목과 부분 빈발항목에 따른 상대적인 빈발도수를 비교하여 간과하고 지나칠 수 있는 상대적인 빈발항목을 탐색하며, 또한, FP-Tree에서 빈발항목과 부분 빈발항목을 효율적으로 관리하기 위하여 빈발항목이나 부분 빈발항목 등의 단위 항목, 출현 빈도, 트랙잭션의 아이디의 3가지 정보만을 저장함으로써 시간에 민감한 빈발항목을 탐색할 수 있으며, 빈발항목 탐색에 대한 정확도를 높일 수 있고, 한정적인 메모리를 효율적으로 사용할 수 있게 된다.As described above, the present invention calculates the relative frequent frequency according to the total frequency and the frequent interval, and compares the relative frequent frequency according to the frequent items and the partial frequent items, and searches for the relative frequent items that can be overlooked and overlooked. In addition, in order to efficiently manage frequent items and partial frequent items in FP-Tree, time-sensitive frequent items are stored by storing only three pieces of information such as unit items such as frequent items or partial frequent items, frequency of occurrence, and transaction ID. It can search, improve accuracy of frequent items search, and efficiently use limited memory.

Claims

In searching for frequent items in a data mining system that mines an input data stream,

An input module for processing an input data stream;

In the data stream input through the input module, the frequent items are searched through the aggregation of items that appear in a predetermined transaction, and the relative frequency of occurrence, which is a value obtained by dividing the number of flexible transactions by the sum of the differences for the frequency of appearance, is frequent. The relative frequency of occurrence is greater than the relative frequency of, and the time of occurrence is searched for relative frequent items from the time when the result of subtracting the current relative frequency of occurrence from the value of the previous relative frequency of occurrence is less than zero to the time of greater than zero. Navigation module; And

A storage module for storing the frequent items searched by the search module;

Relative frequent items search system using a time difference characterized in that it comprises a.

delete

The system of claim 1, wherein the storage module stores search results according to an FP-Tree algorithm.

The method of claim 4, wherein the storage module

And a unit item including the frequent item or the partial frequent item, a frequency of appearance, which is a total number of times the unit item has appeared, and a transaction ID.

In the frequent item search method in a data mining system for data mining an input data stream,

Searching for a transaction newly added to the data stream and updating occurrence frequency and transaction ID information for each item;

Adding an item having a value smaller than a predetermined minimum support but greater than a maximum support error among the items appearing in the added transaction as a partial frequent item;

Outputting information of an item having the highest total frequency and the support of at least the minimum support to date at the request of the user; And

The relative occurrence frequency, which is the number of the transaction divided by the sum of the differences for the appearance frequency interval, is greater than the relative occurrence frequency for the frequent items, and the time of appearance is the current relative occurrence frequency at the value of the previous relative occurrence frequency. Searching for a relatively frequent item from a time point at which the result of subtracting the value is smaller than 0 to a time point greater than 0;

Relative frequent items search method using a time difference characterized in that it comprises a.

delete

The method of claim 6, wherein the output information is

And a unit item including a frequent item, information about a frequency of appearance, a transaction ID, and the like, which are the total number of times the unit item has appeared so far.

In the frequent item search method using the FP-Tree algorithm in the data mining system for data mining the input data stream,

Searching for a transaction newly added to the data stream and updating the frequency of appearance of each item and the ID information of the transaction if items appearing in the transaction exist in the node of the FP-Tree;

If items appearing in the added transaction do not exist in the node present in the FP-Tree, dividing an item having a value smaller than the minimum support but greater than the maximum support error into partial frequent items and inserting the item into the node of the FP-Tree;

The method of claim 9, wherein the output information is