KR20230037830A

KR20230037830A - Method and system for compressing graph stream based on incremental frequent patterns

Info

Publication number: KR20230037830A
Application number: KR1020210120873A
Authority: KR
Inventors: 신보경; 이현병; 최도진; 임종태; 복경수; 유재수
Original assignee: 충북대학교 산학협력단
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-03-17
Also published as: KR102662267B1

Abstract

Disclosed are a method and system for compressing a graph stream based on incremental frequent patterns, which can improve the compression processing speed of a graph stream. The method for compressing a graph stream based on incremental frequent patterns according to an embodiment of the present invention comprises: a step of extracting a first subgraph from a sub-stream input for the first time in a graph stream according to a compression command for the graph stream input in real time; a step of storing the first subgraph in a pattern dictionary as an initial frequency pattern; a step of updating the first subgraph to an extension subgraph in the pattern dictionary and maintaining the same by using a second subgraph when the second subgraph different from the first subgraph is extracted from the sub-stream input for a second time after the first time has passed in the graph stream; and a step of compressing the extension subgraph, maintained in the pattern dictionary, as a reference frequency pattern when a time obtained by adding the first and second times reaches a determined time.

Description

Graph stream compression method and system based on progressive frequent patterns

본 발명은 그래프스트림 압축 방법에 연관되며, 보다 특정하게는 점진적 빈발패턴을 이용한 그래프스트림의 압축 효율 향상에 연관된다.The present invention relates to a method for compressing a graph stream, and more particularly, to improving compression efficiency of a graph stream using a gradual frequency pattern.

소셜 네트워크, 사물인터넷(Internet of Things), 모바일 기기, 생물 같은 복잡한 연결 관계를 표현하기 위해 그래프 데이터가 활용되고 있다.Graph data is being used to express complex connections such as social networks, the Internet of Things, mobile devices, and living things.

이러한 서비스에서 생성되는 그래프 데이터는 실시간으로 확장되고 변화되며 정보의 양이 거대한 특징이 존재한다.The graph data generated by these services is expanded and changed in real time, and the amount of information is huge.

이에 따라 객체를 나타내는 정점이나 객체 간 관계를 나타내는 간선의 추가, 삭제, 변경과 같은 불규칙적인 그래프의 변화를 나타내는 '그래프스트림'이 실시간적으로 발생한다.Accordingly, a 'graph stream' representing irregular graph changes, such as adding, deleting, or changing vertices representing objects or edges representing relationships between objects, is generated in real time.

이처럼 실시간적으로 변화하는 그래프스트림(그래프 데이터)을 제한적인 인메모리(In-Memory)에 효율적으로 저장하기 위해, 그래프스트림을 압축 저장하기 위한 다양한 그래프 마이닝 기술이 제안되었다.In order to efficiently store such a graph stream (graph data) that changes in real time in a limited in-memory, various graph mining techniques for compressing and storing the graph stream have been proposed.

하지만, 그래프스트림의 고차원적이고 복잡한 구조와 형태로 인해, 기존 제안된 그래프 마이닝 기술을 직접적으로 적용하기 어려운 문제가 있다.However, there is a problem in that it is difficult to directly apply the previously proposed graph mining technology due to the high-dimensional and complex structure and shape of the graph stream.

구조적인 그래프 압축 기법의 일례로 제안된 'StarZIP'(비특허문헌 1 참조) 및 'GraphZIP'(비특허문헌 2 참조)과 같은 그래프 마이닝 알고리즘에 따르면, 압축률은 높지만, 그래프의 구조(예, Star, Clique, Bipartite….)를 찾는 전처리 과정에서 수행 시간이 오래 걸려 실시간으로 데이터가 생성되는 환경에 적용하기 어려우며, 변환하는 값이 지나치게 커질 경우 비압축 데이터를 저장하는 것보다 오히려 압축 효과가 떨어지게 된다.According to graph mining algorithms such as 'StarZIP' (see Non-Patent Document 1) and 'GraphZIP' (See Non-Patent Document 2), which have been proposed as examples of structural graph compression techniques, the compression rate is high, but the structure of the graph (eg, Star , Clique, Bipartite….) takes a long time to perform, making it difficult to apply to an environment where data is generated in real time, and if the value to be converted becomes excessively large, the compression effect is lower than that of storing uncompressed data. .

또한, 기존의 사전 기반 그래프 압축 기법의 일례로 제안된 'GraphZip'(비특허문헌 3 참조)과 같은 그래프 마이닝 알고리즘에서는, 정점과 간선에 공통으로 나타난 그래프를 기준패턴으로 정하고, 기준패턴에서 변경된 사항을 기술하는 방법을 사용하고 있지만, 기준패턴이 일정 크기 이상으로 늘어나면 기준패턴을 탐색하는데 시간이 오래 걸리게 되고, 기준패턴과 완전히 다른 새로운 패턴이 입력될 때 적용할 수 있는 기준패턴이 적거나 없어, 기준패턴을 다시 탐색하는 과정이 필요하므로 압축률이 감소할 수 있다.In addition, in a graph mining algorithm such as 'GraphZip' (see Non-Patent Document 3) proposed as an example of an existing dictionary-based graph compression technique, a graph commonly appearing in vertices and edges is set as a reference pattern, and changes in the reference pattern are made. However, if the reference pattern increases beyond a certain size, it takes a long time to search for the reference pattern, and there are few or no reference patterns that can be applied when a new pattern completely different from the reference pattern is input. , the compression rate may decrease because the process of searching for the reference pattern again is required.

따라서 압축률 및 압축시간을 중점적으로 고려한 기존 압축 기법에 시간 경과에 따라 정점과 간선이 실시간 변화하는 그래프스트림 환경을 고려해 그래프 마이닝을 적용하여, 그래프스트림의 압축 효율을 높이기 위한 기술이 요구된다.Therefore, a technique for increasing the compression efficiency of graph streams is required by applying graph mining in consideration of the graph stream environment in which vertices and edges change in real time over time in addition to the existing compression techniques that focus on compression rate and compression time.

[비특허문헌 1] B. Dolgorsuren, Khan K, Rasel MK, Lee Y “StarZIP: Streaming Graph Compression Technique for Data Archiving”, IEEE Access, pp.38020-38034, 2019[Non-Patent Document 1] B. Dolgorsuren, Khan K, Rasel MK, Lee Y “StarZIP: Streaming Graph Compression Technique for Data Archiving”, IEEE Access, pp.38020-38034, 2019 [비특허문헌 2] R. A. Rossi and R. Zhou, “GraphZIP: A clique-based sparse graph compression method” J. Big Data, vol. 5, no. 1, pp. 1-14, 2018.[Non-Patent Document 2] R. A. Rossi and R. Zhou, “GraphZIP: A clique-based sparse graph compression method” J. Big Data, vol. 5, no. 1, p. 1-14, 2018. [비특허문헌 3] P. Charles, and H. B. Lawrence, “GraphZip: Mining graph streams using dictionary-based compression,'' in Proc. SIGKDD Workshop Mining Learn. Graphs, 2017.[Non-Patent Document 3] P. Charles, and H. B. Lawrence, “GraphZip: Mining graph streams using dictionary-based compression,'' in Proc. SIGKDD Workshop Mining Learn. Graphs, 2017.

본 발명의 실시예는 그래프스트림에 대하여 효율적으로 압축을 처리하기 위해 시간에 따른 정점과 간선의 변화를 고려헤 압축하는 그래프스트림 압축 기법을 제안하는 것을 목적으로 한다.An object of the present invention is to propose a graph stream compression technique that compresses a graph stream in consideration of changes in vertices and edges over time in order to efficiently compress the graph stream.

본 발명의 실시예는 실시간으로 입력되는 그래프스트림으로부터 빈발하는 패턴을 찾아 기준 빈발패턴으로서 패턴 사전에 유지하고, 빈발하지 않는 패턴을 패턴 사전에서 제외하여, 그래프스트림의 압축에 적용할 데이터를 줄임으로써, 그래프스트림의 압축 처리 속도를 높이는 것을 목적으로 한다.An embodiment of the present invention finds a frequent pattern from a graph stream input in real time, maintains it in a pattern dictionary as a reference frequent pattern, and excludes infrequent patterns from the pattern dictionary to reduce data to be applied to graph stream compression. , the purpose of which is to increase the compression processing speed of graph streams.

본 발명의 실시예는 이전시점에 입력된 일부의 그래프스트림에서 빈발하는 패턴과, 현시점에 입력된 일부의 그래프스트림에서 빈발하는 패턴을 모두 패턴 사전에 저장할 경우 새로운 스트림이 입력될 때마다 패턴 사전의 크기가 점차 커지고 그에 따라 그래프스트림의 압축 효율이 떨어지는 기존 그래프 마이닝 기법을 개선하기 위해, 시간 변화에 따라 변화하는 그래프스트림의 전체에서 자주 사용되는 최소한의 기준 빈발패턴을 패턴 사전에 남김으로써, 그래프스트림의 압축 효율을 높이는 것을 목적으로 한다.In the embodiment of the present invention, when both patterns that are frequent in some graph streams input at the previous time and patterns that are frequent in some graph streams input at the current time are stored in the pattern dictionary, every time a new stream is input, the pattern dictionary In order to improve the existing graph mining technique, in which the size gradually increases and the compression efficiency of the graph stream decreases accordingly, the minimum standard frequent pattern frequently used throughout the graph stream that changes over time is left in the pattern dictionary, The purpose is to increase the compression efficiency of

본 발명의 실시예는 이전시점에 발견하여 패턴 사전에 저장한 빈발패턴을, 현시점에 발견한 빈발패턴을 이용해 확장시켜 업데이트하는 과정을, 시간 경과에 따른 스트림 입력 시마다 반복하여, 패턴 사전 내의 빈발패턴을 점진적으로 업데이트함으로써, 정해진 종료시점에 도달했을 때 패턴 사전에 시간 변화에 따라 변화하는 그래프스트림의 전체에서 기준이 되는 빈발패턴이 남도록 하는 것을 목적으로 한다.An embodiment of the present invention repeats a process of expanding and updating a frequent pattern found at a previous time and stored in a pattern dictionary using a frequent pattern found at the current time, every time a stream is input over time, and frequently patterns in the pattern dictionary The purpose of this is to ensure that a frequent pattern, which is a reference, remains in the entire graph stream that changes with time in the pattern dictionary when a predetermined end point is reached by gradually updating .

본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 방법은, 실시간으로 입력되는 그래프스트림에 대한 압축 명령에 따라, 상기 그래프스트림 중 제1 시간 동안 입력된 서브스트림으로부터 제1 서브그래프를 추출하는 단계와, 상기 제1 서브그래프를 초기 빈발패턴으로서 패턴 사전에 저장하는 단계와, 상기 그래프스트림 중 상기 제1 시간의 경과시점부터 제2 시간 동안 입력된 서브스트림으로부터 상기 제1 서브그래프와 상이한 제2 서브그래프가 추출되면, 상기 제2 서브그래프를 이용하여, 상기 패턴 사전 내에서 상기 제1 서브그래프를 확장 서브그래프로 업데이트하여 유지하는 단계, 및 상기 제1 및 제2 시간을 합산한 시간이 정해진 시간에 도달하면, 상기 패턴 사전 내에 유지되는 상기 확장 서브그래프를, 기준 빈발패턴으로서, 압축 처리하는 단계를 포함할 수 있다.A method for compressing a graph stream based on a gradual frequent pattern according to an embodiment of the present invention extracts a first subgraph from a substream input during a first time among the graph streams according to a compression command for the graph stream input in real time. and storing the first subgraph as an initial frequent pattern in a pattern dictionary, and a substream input during a second time from the lapse of the first time among the graph streams differs from the first subgraph in a pattern dictionary. If the second subgraph is extracted, updating and maintaining the first subgraph as an extended subgraph in the pattern dictionary using the second subgraph, and a sum of the first and second times. When the predetermined time is reached, compressing the extended subgraph maintained in the pattern dictionary as a reference frequent pattern may be included.

또한, 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템은, 실시간으로 입력되는 그래프스트림에 대한 압축 명령에 따라, 상기 그래프스트림 중 제1 시간 동안 입력된 서브스트림으로부터, 제1 서브그래프를 추출하는 추출부와, 상기 제1 서브그래프를 초기 빈발패턴으로서 패턴 사전에 저장하는 저장부와, 상기 추출부에 의해 상기 그래프스트림 중 상기 제1 시간의 경과시점부터 제2 시간 동안 입력된 서브스트림으로부터 상기 제1 서브그래프와 상이한 제2 서브그래프가 추출되면, 상기 제2 서브그래프를 이용하여, 상기 패턴 사전 내에서 상기 제1 서브그래프를 확장 서브그래프로 업데이트하여 유지하는 업데이트부, 및 상기 제1 및 제2 시간을 합산한 시간이 정해진 시간에 도달하면, 상기 패턴 사전 내에 유지되는 상기 확장 서브그래프를, 기준 빈발패턴으로서, 압축 처리하는 처리부를 포함할 수 있다.In addition, in the graph stream compression system based on the gradual recurrence pattern according to an embodiment of the present invention, according to a compression command for a graph stream input in real time, from a sub-stream input during a first time among the graph streams, a first sub-stream An extraction unit for extracting a graph; a storage unit for storing the first subgraph as an initial frequent pattern in a pattern dictionary; When a second subgraph different from the first subgraph is extracted from a substream, an update unit that updates and maintains the first subgraph as an extended subgraph within the pattern dictionary using the second subgraph; and and a processor that compresses the extended subgraph maintained in the pattern dictionary as a reference frequent pattern when the sum of the first and second times reaches a predetermined time.

본 발명에 따르면, 실시간으로 입력되는 그래프스트림으로부터 빈발하는 패턴을 찾아 기준 빈발패턴으로서 패턴 사전에 유지하고, 빈발하지 않는 패턴을 패턴 사전에서 제외하여, 그래프스트림의 압축에 적용할 데이터를 줄임으로써, 그래프스트림의 압축 처리 속도를 향상시킬 수 있다.According to the present invention, by finding a frequent pattern from a graph stream input in real time, maintaining it in a pattern dictionary as a reference frequent pattern, and excluding infrequent patterns from the pattern dictionary, reducing data to be applied to compression of the graph stream, The compression processing speed of the graph stream can be improved.

본 발명에서 제안하는 그래프스트림 처리를 위한 점진적 빈발패턴 기반 압축 기법에 따르면, 그래프 패턴 마이닝으로 빈발패턴을 추출하고, 빈발도와 간선 크기에 따라 패턴 점수를 계산하여, 패턴 점수가 높은 순으로 선별한 빈발패턴을 기준 빈발패턴으로서 패턴 사전에 남김으로써, 실시간으로 변화하는 그래프스트림에서 보다 중요도 높은 빈발패턴을 기존의 압축 기법보다 빠르게 검출할 수 있고, 그래프스트림의 압축 효율 및 처리 속도를 향상시킬 수 있다.According to the progressive frequent pattern-based compression technique for graph stream processing proposed in the present invention, frequent patterns are extracted by graph pattern mining, pattern scores are calculated according to frequency and edge size, and frequent patterns are selected in order of high pattern scores. By leaving the pattern as a reference frequent pattern in the pattern dictionary, it is possible to detect a frequent pattern having a higher importance in a graph stream that changes in real time more quickly than conventional compression techniques, and it is possible to improve the compression efficiency and processing speed of the graph stream.

본 발명에 따르면, 그래프스트림을 인메모리 환경에서 처리할 수 있도록 시간 경과에 따른 정점과 간선의 변화를 프로버넌스 정보를 사용하여 기록하고, 패턴 사전에 유지된 기준 빈발패턴에 상기 프로버넌스 정보를 적용하여 실시간으로 입력되는 그래프스트림에 대한 변화 이력을 관리자가 용이하게 파악할 수 있게 한다.According to the present invention, changes in vertices and edges over time are recorded using provenance information so that a graph stream can be processed in an in-memory environment, and the provenance information is stored in a reference frequent pattern maintained in a pattern dictionary. is applied so that the manager can easily grasp the history of changes to the graph stream input in real time.

본 발명에 따르면, 기준 빈발패턴 및 프로버넌스 정보를 활용하여, 최신 패턴의 이력 관리 및 변화 사항 파악이 용이해지고, 그 결과, 그래프 데이터의 크기를 감소하여 인메모리 환경에 많은 그래프 데이터를 유지하여 빠른 처리를 수행할 수 있게 한다.According to the present invention, by using the reference frequent pattern and provenance information, it is easy to manage the history of the latest pattern and grasp changes, and as a result, the size of graph data is reduced to maintain a lot of graph data in an in-memory environment. Allows for quick processing.

도 1은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템의 구성을 도시한 블록도이다.
도 2는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템의 전체 구조도를 도시한 도면이다.
도 3은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 도 2에 도시한 기준 패턴 생성기(210)의 수행 과정을 도시한 도면이다.
도 4는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 서브그래프의 정점에 인덱스를 부여해 패턴 사전에 저장하는 과정을 설명하기 위한 도면이다.
도 5는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 도 2에 도시한 그래프 관리기(220)의 수행 과정을 도시한 도면이다.
도 6은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 서브그래프 간 유사도 판별 결과에 따라 동형 그래프를 결정하는 과정을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 시간 경과에 따라 변화하는 시간 윈도우를 설명하기 위한 도면이다.
도 8은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 빈발패턴의 점진적인 업데이트 과정을 설명하기 위한 도면이다.
도 9는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, FP-Growth 알고리즘을 적용한 Pruning 과정을 설명하기 위한 도면이다.
도 10은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 패턴 트리(a)에 따라 구축되는 패턴 사전(b)을 예시한 도면이다.
도 11은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 패턴 사전에 저장되는 프로버넌스 정보를 도시한 도면이다.
도 12는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 압축 처리 과정을 도시한 도면이다.
도 13은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 방법의 순서를 도시한 흐름도이다.1 is a block diagram showing the configuration of a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
2 is a diagram showing the overall structure of a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating a process of performing the reference pattern generator 210 shown in FIG. 2 in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
4 is a diagram for explaining a process of assigning an index to a vertex of a subgraph and storing it in a pattern dictionary in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a process of performing the graph manager 220 shown in FIG. 2 in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
6 is a diagram for explaining a process of determining an isomorphic graph according to a similarity determination result between subgraphs in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
7 is a diagram for explaining a time window that changes over time in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
8 is a diagram for explaining a process of gradually updating frequent patterns in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
9 is a diagram for explaining a pruning process to which the FP-Growth algorithm is applied in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
10 is a diagram illustrating a pattern dictionary (b) constructed according to a pattern tree (a) in a graph stream compression system based on progressively frequent patterns according to an embodiment of the present invention.
11 is a diagram illustrating provenance information stored in a pattern dictionary in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
12 is a diagram illustrating a compression process in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.
13 is a flowchart illustrating a sequence of a method for compressing a graph stream based on a gradual frequent pattern according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes can be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents or substitutes to the embodiments are included within the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the examples are used only for descriptive purposes and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description will be omitted.

본 발명에서는 그래프스트림에 대한 압축 명령 발생에 따라, 실시간으로 입력되는 그래프스트림을 점진적으로 처리하여, 그래프스트림에서 빈발하는 패턴 및 그에 대한 프로버넌스 정보를 최소한으로 유지한 패턴 사전을 구축해 압축에 적용함으로써, 압축 효율을 높이고 처리 속도를 향상시키기 위한 기법을 제안한다.In the present invention, according to the compression command for the graph stream, the input graph stream is gradually processed in real time, and a pattern dictionary that minimizes frequent patterns in the graph stream and provenance information thereof is constructed and applied to compression. By doing so, we propose a technique for increasing compression efficiency and improving processing speed.

이를 위해 본 발명에서는 실시간으로 변화하는 그래프스트림에서 기준이 되는 빈발패턴을 점진적으로 찾아 패턴 사전을 최소한으로 구축하고, 이를 통해 압축할 데이터를 최대한으로 줄일 수 있다.To this end, in the present invention, a frequent pattern as a reference is gradually found in a graph stream that changes in real time, and a pattern dictionary is constructed to a minimum, and through this, data to be compressed can be reduced to a maximum.

압축율을 최대로 하는 기준 빈발패턴을 점진적으로 검출하는 일례로, 본 발명에서는 그래프스트림의 최초시점에 발견하여 패턴 사전에 기저장한 빈발패턴(초기 빈발패턴)을, 그래프스트림의 다음시점에 발견한 빈발패턴과 비교해 동형성을 확인하고, 동형성이 확인된 빈발패턴을 이용해 점진적으로 확장하여 업데이트하고, 패턴 사전에 저장 이후, 그래프스트림의 종료시점까지 업데이트 되지 않은 빈발패턴에 대해서는 삭제 처리하여, 최종적으로 패턴 사전에 남은 빈발패턴을, 기준 빈발패턴으로 검출하고 있다.As an example of gradually detecting a reference frequent pattern maximizing the compression ratio, in the present invention, a frequent pattern (initial frequent pattern) found at the beginning of the graph stream and pre-stored in the pattern dictionary is found at the next time of the graph stream. Isomorphism is checked by comparing with frequent patterns, and the frequent patterns with confirmed isomorphism are gradually expanded and updated, and after storing in the pattern dictionary, frequent patterns that have not been updated until the end of the graph stream are deleted, and the final As a result, the frequent patterns remaining in the pattern dictionary are detected as reference frequent patterns.

여기서 동형성의 확인은, 그래프스트림이 항상 일정한 패턴이 나오는 것이 아니라 시간에 따라 패턴이 달라져, 현재시점의 패턴과 시간 경과 후의 패턴이 다를 수 있기 때문에, 업데이트를 진행하기 전에 수행될 필요가 있다.Here, the confirmation of isomorphism needs to be performed before proceeding with the update because the pattern at the present time and the pattern after the lapse of time may be different because the pattern of the graph stream does not always come out with a constant pattern but changes over time.

따라서 본 발명에서는 패턴 사전에 기저장한 빈발패턴의 서브그래프를 G₁, 다음시점에 발견한 빈발패턴의 서브그래프를 G₂라고 할 때, G₁⊆G₂이면, G₁와 G₂를 동형 그래프로 판단하고, G₁를 G₂를 이용해 확장시키는 업데이트를 수행하여, 확장 서브그래프가 패턴 사전에 유지되도록 할 수 있다.Therefore, in the present invention, when G ₁ is a subgraph of a frequent pattern previously stored in a pattern dictionary and G ₂ is a subgraph of a frequent pattern found at the next point in time, if G ₁ ⊆G ₂ , then G ₁ and G ₂ are isomorphic. It is possible to determine the graph and perform an update that expands G ₁ using G ₂ , so that the extended subgraph is maintained in the pattern dictionary.

이처럼 본 발명에서는 시간 경과에 따라 변화하는 그래프스트림에서 자주 사용되는 빈발패턴만 패턴 사전에 남기고, 자주 사용되지 않는 빈발패턴을 패턴 사전에서 삭제 처리하는 방식으로, 점진적으로 기준 빈발패턴을 찾아 압축하는 기법을 통해 패턴 사전의 크기를 줄임으로써, 그래프스트림의 압축 효율을 높일 수 있고, 상대적으로 인메모리에 더 많은 양의 데이터를 유지하여 그래프 처리 속도를 향상시킬 수 있다.As such, in the present invention, only frequent patterns that are frequently used in a graph stream that changes over time are left in the pattern dictionary, and infrequently used frequent patterns are deleted from the pattern dictionary, thereby gradually finding and compressing reference frequent patterns. By reducing the size of the pattern dictionary through , the compression efficiency of the graph stream can be increased, and the graph processing speed can be improved by maintaining a relatively large amount of data in in-memory.

도 1은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템의 구성을 도시한 블록도이다.1 is a block diagram showing the configuration of a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(100)은, 추출부(110), 저장부(120), 업데이트부(130), 처리부(140), 및 패턴 사전(170)을 포함하여 구성할 수 있다. 또한, 실시예에 따라 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(100)은, 판별부(150) 및 산출부(160)를 각각 추가하여 구성할 수 있다.Referring to FIG. 1, a graph stream compression system 100 based on progressive frequent patterns according to an embodiment of the present invention includes an extraction unit 110, a storage unit 120, an update unit 130, a processing unit 140, and a pattern dictionary 170. In addition, according to an embodiment, the graph stream compression system 100 based on progressive frequent patterns may be configured by adding a determination unit 150 and a calculation unit 160, respectively.

추출부(110)는 그래프스트림에 대한 압축 명령에 따라, 상기 그래프스트림 중 제1 시간(T) 동안 입력된 서브스트림으로부터, 제1 서브그래프를 추출한다.The extraction unit 110 extracts a first subgraph from a substream input during a first time T of the graphstream according to a compression command for the graphstream.

이를 위해 추출부(110)는 도 7에 도시한 시간이 지남에 따라 변화하는 Time Window W_n를 실시간으로 입력되는 그래프스트림에 적용하여, 그래프스트림을, 복수의 서브스트림으로 구분하고, W_n중 최초의 W₀을 그래프스트림에 적용해 얻어진 첫번째 서브스트림에서, 제1 서브그래프를 추출할 수 있다.To this end, the extractor 110 applies the Time Window W _n shown in FIG. 7, which changes over time, to the graph stream input in real time, divides the graph stream into a plurality _of sub-streams, and divides the graph stream into a plurality of sub-streams. A first subgraph may be extracted from a first substream obtained by applying the first W ₀ to the graph stream.

도 7에 도시된 것처럼 하나의 W_n은 여러 개의 배치(B_n)로 구성될 수 있고, W₀={B₁, B₂, B₃}으로 정해질 경우, 추출부(110)는 첫번째 서브스트림에서 복수의 제1 서브그래프를 추출할 수 있다.As shown in FIG. 7, one W _n may be composed of several batches (B _n ), and when W ₀ ={B ₁ , B ₂ , B ₃ }, the extraction unit 110 performs a first sub A plurality of first subgraphs may be extracted from the stream.

저장부(120)는 상기 제1 서브그래프를, 초기 빈발패턴으로서, 패턴 사전(170)에 저장한다.The storage unit 120 stores the first subgraph as an initial frequent pattern in the pattern dictionary 170 .

일례로 저장부(120)는 상기 제1 서브그래프를 구성하는 복수의 정점 및 상기 복수의 정점을 연결하는 간선을 패턴 사전(170)에 저장할 수 있다.For example, the storage unit 120 may store a plurality of vertices constituting the first subgraph and trunk lines connecting the plurality of vertices in the pattern dictionary 170 .

예를 들어 도 8의 (a)를 참조하면, 저장부(120)는 W₀을 이용해 첫번째 스트림에서 추출된 제1 서브그래프를 구성하는 복수의 정점과 그 사이를 잇는 간선을 패턴 사전(170)에 저장할 수 있다.For example, referring to (a) of FIG. 8, the storage unit 120 uses W ₀ to form a plurality of vertices constituting the first subgraph extracted from the first stream and the trunk line connecting them to the pattern dictionary 170. can be stored in

상기 제1 시간(T)이 경과하면, 추출부(110)는 상기 그래프스트림 중, 제1 시간(T)의 경과시점부터 제2 시간(T) 동안 입력된 서브스트림으로부터, 제1 서브그래프와 상이한 제2 서브그래프를 추출할 수 있다.When the first time T elapses, the extraction unit 110 extracts the first subgraph and the substream from the substream input during the second time T from the time point when the first time T elapses among the graph streams. A different second subgraph can be extracted.

도 8의 (b)를 참조하면, 추출부(110)는 W₀다음의 W₁={B₄, B₅}을 그래프스트림에 적용해, 두번째 서브스트림에서 제2 서브그래프를 추출할 수 있다.Referring to (b) of FIG. 8 , the extraction unit 110 may extract the second subgraph from the second substream by applying W ₁ ={B ₄ , B ₅ } next to W ₀ to the graph stream. .

업데이트부(130)는, 추출부(110)에 의해 추출된 제2 서브그래프를 이용하여, 패턴 사전(170) 내에서 상기 제1 서브그래프를 확장 서브그래프로 업데이트하여 유지한다.The update unit 130 updates and maintains the first subgraph as an extended subgraph in the pattern dictionary 170 using the second subgraph extracted by the extractor 110 .

즉 업데이트부(130)는 상기 제1 서브그래프에 이어서 추출된 제2 서브그래프를 패턴 사전(170)에 단순 저장하는 대신에, 제2 서브그래프의 추출시점에 패턴 사전(170)에 기 저장된 상태의 제1 서브그래프를, 제2 서브그래프에 의해 확장시키는 업데이트를 통해, 제2 서브그래프를 구성하는 정점과 간선의 변화를 패턴 사전(170)에 반영할 수 있다.That is, instead of simply storing the second subgraph extracted following the first subgraph in the pattern dictionary 170, the update unit 130 pre-stores the pattern dictionary 170 at the point of extraction of the second subgraph. Changes in vertices and edges constituting the second subgraph may be reflected in the pattern dictionary 170 through an update in which the first subgraph of is extended by the second subgraph.

일례로 업데이트부(130)는 상기 제2 서브그래프를 구성하는 복수의 정점을, 패턴 사전(170)에 기 저장된 기존정점과, 패턴 사전(170)에 기 저장되지 않은 신규정점으로 구분한 상태에서, 상기 기존정점과 상기 신규정점 사이를 연결하는 제1 간선을 상기 제2 서브그래프로부터 탐색할 수 있다.For example, the update unit 130 divides the plurality of vertices constituting the second subgraph into existing vertices previously stored in the pattern dictionary 170 and new vertices not previously stored in the pattern dictionary 170. , A first trunk line connecting the existing vertex and the new vertex may be searched from the second subgraph.

상기 제1 간선은 패턴 사전(170) 내 제1 서브그래프에는 존재하지 않는 간선이므로, 업데이트부(130)는 상기 제1 간선을 제1 서브그래프에 추가 연결('변경')하여, 제1 서브그래프를 확장 서브그래프로 업데이트할 수 있다.Since the first trunk line does not exist in the first subgraph in the pattern dictionary 170, the update unit 130 additionally connects ('changes') the first trunk line to the first subgraph, Graphs can be updated with extended subgraphs.

또한 업데이트부(130)는 상기 제2 서브스트림으로부터, 상기 제1 간선에 연결되면서, 상기 신규정점 사이를 연결하는 제2 간선을 탐색할 수 있다.In addition, the updater 130 may search for a second trunk connecting between the new vertices while being connected to the first trunk, from the second sub-stream.

상기 제2 간선 역시, 패턴 사전(170) 내 제1 서브그래프에는 존재하지 않는 간선이므로, 업데이트부(130)는 상기 제2 간선을 상기 제1 서브그래프에 추가 연결('삽입')하여, 상기 확장 서브그래프로의 업데이트를 수행할 수 있다. 이때, 업데이트부(130)는 상기 신규정점 사이를 연결하는 제2 간선이 상기 제1 간선에 연결되지 않았더라도, 상기 제1 서브그래프에 추가 연결할 수 있다.Since the second trunk line is also an trunk line that does not exist in the first subgraph in the pattern dictionary 170, the updater 130 additionally connects ('inserts') the second trunk line to the first subgraph, An update to an extended subgraph can be performed. In this case, the update unit 130 may additionally connect the first subgraph even though the second trunk connecting the new vertices is not connected to the first trunk.

또한 업데이트부(130)는 제2 서브그래프로부터 상기 기존정점 사이를 연결하는 제3 간선을 탐색할 수 있다.In addition, the update unit 130 may search for a third trunk line connecting the existing vertices from the second subgraph.

상기 제3 간선은, 패턴 사전(170) 내 제1 서브그래프에 이미 존재하는 간선이므로, 업데이트부(130)는 상기 제3 간선을 상기 제1 서브그래프에 추가하지 않고, 제1 서브그래프 내 해당 간선에 설정된 빈발도(Frequent)를 증가시킬 수 있다.Since the third trunk line already exists in the first subgraph in the pattern dictionary 170, the update unit 130 does not add the third trunk line to the first subgraph, but corresponds to the corresponding section in the first subgraph. The frequency set in the trunk line can be increased.

상기 업데이트를 수행하기에 앞서, 제1 서브그래프와 제2 서브그래프의 동형성을 확인하기 위해, 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(100)은, 판별부(150)를 추가하여 구성할 수 있다.Prior to performing the update, in order to check the isomorphism of the first subgraph and the second subgraph, the graph stream compression system 100 based on progressive frequent patterns may be configured by adding a determining unit 150. there is.

판별부(150)는 상기 제2 서브그래프의 추출에 따라, 'VF2 알고리즘'을 이용하여, 상기 제1 서브그래프와 상기 제2 서브그래프 간 유사도를 계산하고, 계산된 유사도에 따라, 상기 제1 서브그래프가, 상기 제2 서브그래프의 적어도 일부와 일치하는 동형 그래프인지 결정한다.The determination unit 150 calculates the similarity between the first subgraph and the second subgraph using the 'VF2 algorithm' according to the extraction of the second subgraph, and according to the calculated similarity, the first subgraph It is determined whether the subgraph is an isomorphic graph that matches at least a part of the second subgraph.

예를 들어 도 6의 (a)와 (b)를 참조하면, 판별부(150)는 기존 서브그래프 패턴 G₁및 새로 들어온 서브그래프 패턴 G₂가 존재할 때, G₁=G₂일 때만 그래프 G₁와 G₂를 동형 그래프로 결정하는 것이 아니라, G₁⊆G₂의 관계가 되는 그래프 G₁와 G₂를 동형 그래프로 결정하고, G₁와 G₂를 유사도가 높은 패턴으로 판단할 수 있다.For example, referring to (a) and (b) of FIG. 6 , the determination unit 150 determines graph G only when G ₁ =G ₂ when an existing subgraph pattern G ₁ and a newly entered subgraph pattern G ₂ exist. ₁ and G ₂ are not determined as isomorphic graphs, but graphs G ₁ and G ₂ having a relationship of G ₁ ⊆G ₂ are determined as isomorphic graphs, and G ₁ and G ₂ can be judged as patterns with high similarity .

업데이트부(130)는 제1 서브그래프가 제2 서브그래프의 동형 그래프로 결정되면, 제2 서브그래프에 의해 제1 서브그래프의 업데이트를 수행할 수 있다.When the first subgraph is determined to be an isomorphic graph of the second subgraph, the updater 130 may update the first subgraph using the second subgraph.

이처럼 업데이트부(130)는 기 저장된 제1 서브그래프가 금번에 추출된 제2 서브그래프의 동형 그래프로 결정될 때만 업데이트를 수행하여, 실시간 입력되는 그래프스트림에서 빈발패턴을 점진적으로 검출해 나갈 수 있다.As such, the update unit 130 performs an update only when the pre-stored first subgraph is determined to be an isomorphic graph of the second subgraph extracted this time, and gradually detects frequent patterns in the graph stream input in real time.

실시예에 따라, 업데이트부(130)는 상기 업데이트 이전의 상기 제1 서브그래프에 설정된 빈발도를, 상기 업데이트에 이용된 제2 서브그래프의 개수 만큼 증가한 값에 따라, 상기 확장 서브그래프의 빈발도를 설정할 수 있다.Depending on the embodiment, the update unit 130 determines the frequency of the extended subgraph according to a value obtained by increasing the frequency set in the first subgraph before the update by the number of second subgraphs used for the update. can be set.

처리부(140)는 제1 시간 및 제2 시간을 합산한 시간(2T)이, 정해진 시간(τ)에 도달하면(2T=τ), 패턴 사전(170) 내에 유지되는 상기 확장 서브그래프를, 기준 빈발패턴으로서, 압축 처리한다.When the sum of the first time and the second time (2T) reaches a predetermined time (2T=τ), the processing unit 140 converts the extended subgraph maintained in the pattern dictionary 170 to a standard. Compression processing is performed as a frequent pattern.

이때, 처리부(140)는 상기 초기 빈발패턴으로서 복수의 제1 서브그래프가 패턴 사전(170)에 저장된 경우, 복수의 제1 서브그래프 중, 상기 확장 서브그래프로의 업데이트에 관여하지 않은 제1 서브그래프를, 상기 압축 처리 전에 패턴 사전(170)으로부터 삭제 처리할 수 있다.At this time, when a plurality of first subgraphs are stored in the pattern dictionary 170 as the initial frequent pattern, the processing unit 140, among the plurality of first subgraphs, first subgraphs not involved in the update to the extended subgraph A graph may be deleted from the pattern dictionary 170 before the compression process.

이처럼 처리부(140)는 자주 나오는 패턴을 기준 빈발패턴으로서 패턴 사전(170)에 유지해 압축에 사용하고, 자주 나오지 않는 패턴을 패턴 사전(170)에서 삭제해 패턴 사전(170)의 크기를 감소함으로써, 그래프스트림의 압축 효율을 높이고 압축 처리에 소요되는 시간을 단축할 수 있다.In this way, the processing unit 140 retains frequently occurring patterns in the pattern dictionary 170 as reference frequent patterns and uses them for compression, and deletes infrequently occurring patterns from the pattern dictionary 170 to reduce the size of the pattern dictionary 170, It is possible to increase the compression efficiency of graph stream and reduce the time required for compression processing.

다른 일례로, 처리부(140)는 제1 시간 및 제2 시간을 합산한 시간(2T)이, 정해진 시간(τ)에 아직 도달하지 않은 경우(2T<τ), 합산한 시간(2T)이 정해진 시간(τ)에 도달할 때까지 상기 압축 처리를 보류하고, 상기 그래프스트림 중 새로운 서브스트림의 입력 시마다 추출부(110)에 의한 서브그래프 추출 및 업데이트부(130)에 의한 업데이트를 반복할 수 있다.In another example, the processing unit 140 determines that the sum of the first time and the second time (2T) has not yet reached the predetermined time (τ) (2T<τ), and the sum of the times (2T) is determined. The compression process may be suspended until time τ is reached, and subgraph extraction by the extractor 110 and update by the update unit 130 may be repeated whenever a new substream is input from among the graph streams. .

즉 추출부(110)는 상기 그래프스트림 중, 상기 제2 시간의 경과시점부터 제3 시간 동안 입력된 서브스트림으로부터, 상기 제1 서브그래프 및 제2 서브그래프와 상이한 제3 서브그래프를 추출하고, 업데이트부(130)는 상기 제3 서브그래프를 이용하여, 상기 확장 서브그래프를 추가확장 서브그래프로 업데이트할 수 있다.That is, the extractor 110 extracts a third subgraph different from the first subgraph and the second subgraph from the substream input during the third time from the lapse of the second time, among the graph streams; The update unit 130 may update the extended subgraph into an additional extended subgraph using the third subgraph.

이러한 과정을 통해 업데이트부(130)는 초기 빈발패턴의 제1 서브그래프를 점진적으로 확장하여 업데이트할 수 있다.Through this process, the update unit 130 may gradually expand and update the first subgraph of the initial frequent pattern.

이 경우에도, 처리부(140)는 초기 빈발패턴으로서 복수의 제1 서브그래프가 패턴 사전(170)에 저장된 경우, 복수의 제1 서브그래프 중, 상기 확장 서브그래프 또는 상기 추가확장 서브그래프로의 업데이트에 관여하지 않은 제1 서브그래프를, 상기 압축 처리 전에 패턴 사전(170)으로부터 삭제 처리할 수 있다.Even in this case, the processing unit 140 updates the extended subgraph or the additional extended subgraph among the plurality of first subgraphs when a plurality of first subgraphs are stored in the pattern dictionary 170 as an initial frequent pattern. The first subgraph not involved in can be deleted from the pattern dictionary 170 before the compression process.

이처럼 처리부(140)는 패턴 사전(170)에 자주 사용되는 패턴을 유지하고, 자주 사용되지 않는 패턴을 삭제 처리해 패턴 사전(170)을 축소하여, 그래프스트림의 압축율을 최대화할 수 있다.As such, the processing unit 140 can maximize the compression ratio of the graph stream by maintaining frequently used patterns in the pattern dictionary 170 and deleting infrequently used patterns to reduce the pattern dictionary 170 .

다시 말해 본 발명에서는 시간 변화에 따라 변화하는 그래프스트림의 전체에서 자주 사용되는 최소한의 기준 빈발패턴을 패턴 사전(170)에 남김으로써, 그래프스트림의 압축 효율을 높이고 압축 처리를 빠르게 진행할 수 있다.In other words, in the present invention, the compression efficiency of the graph stream can be increased and the compression process can be performed quickly by leaving the minimum reference frequent pattern frequently used in the entire graph stream that changes with time in the pattern dictionary 170.

실시예에 따라, 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(100)은, 패턴의 빈발도와 그래프 연결성을 고려해 보다 중요도 높은 패턴이 기준 빈발패턴으로서 패턴 사전(170) 내에 유지되도록, 패턴 점수(RPS)를 산출하는 산출부(160)를 추가하여 구성할 수 있다.According to an embodiment, the graph stream compression system 100 based on gradual frequent patterns calculates pattern scores (RPS) so that patterns with higher importance are maintained in the pattern dictionary 170 as reference frequent patterns in consideration of the frequency and graph connectivity of the patterns. It can be configured by adding a calculator 160 that calculates.

산출부(160)는 상기 확장 서브그래프가 복수이면, 복수의 확장 서브그래프 각각에 대해, 빈발도 및 간선 크기를 고려하여, 패턴 점수를 산출한다.If there are a plurality of extension subgraphs, the calculation unit 160 calculates pattern scores for each of the plurality of extension subgraphs by considering frequency and trunk size.

산출부(160)는 그래프스트림 전체에서 자주 발생하며 그래프 연결이 많은 서브그래프 패턴을 기준패턴으로 채택하기 위해, 서브그래프 패턴의 빈발도(Frequent)와 간선 크기(Edge Size)를 고려하여 후술하는 수식(1)에 따라 패턴 점수(RPS)를 산출할 수 있다.In order to adopt a subgraph pattern that occurs frequently throughout the graph stream and has many graph connections as a reference pattern, the calculation unit 160 considers the frequency and edge size of the subgraph pattern to obtain the formula described below. According to (1), the pattern score (RPS) can be calculated.

처리부(140)는 상기 패턴 점수 순에 따른 상위 k(상기 k는 자연수)개의 확장 서브그래프를 선별하고, 선별한 k개의 확장 서브그래프를, 기준 빈발패턴으로서 지정(선별)할 수 있다.The processing unit 140 may select the top k (k is a natural number) extension subgraphs according to the pattern score order, and designate (select) the selected k extension subgraphs as reference frequent patterns.

이때, 처리부(140)는 상기 패턴 점수 순에 따라 복수의 확장 서브그래프 중에서 선별되지 않는 나머지 확장 서브그래프를 패턴 사전(170)으로부터 삭제 처리함으로써, 패턴 사전(170)의 크기를 줄일 수 있다.At this time, the processing unit 140 may reduce the size of the pattern dictionary 170 by deleting the remaining extended subgraphs that are not selected from among the plurality of extended subgraphs according to the pattern score order from the pattern dictionary 170 .

예를 들어 도 10을 참조하면, 도 10의 (a)와 같은 패턴 트리가 입력되었을 때 패턴 사전(230)에는, 도 10의 (b)에 도시한 테이블(1010)과 같이, 각 패턴이 입력된 순서대로 배정된 인덱스 1~8이 패턴 식별자 P_id로서 기록되고, 패턴 P_id 1~4에 대한 타임스탬프는 T, 패턴 P_id 5~8에 대한 타임스탬프는 T+1로 기록되며, 이와 함께, 각 패턴의 빈발도(Frequent)가 기록된다.For example, referring to FIG. 10, when a pattern tree as shown in (a) of FIG. 10 is input, each pattern is input to the pattern dictionary 230 as shown in the table 1010 shown in (b) of FIG. 10. Indexes 1 to 8 assigned in the order in which they are assigned are recorded as pattern identifiers P_id, timestamps for patterns P_id 1 to 4 are recorded as T, and timestamps for patterns P_id 5 to 8 are recorded as T+1. The frequency of the pattern is recorded.

이후 처리부(140)는 Top-k(k=6)를 적용하여, 테이블(1010)에 기록된 각 패턴을 빈발도가 높은 순으로 정렬했을 때, 상위 6개에 해당하는 패턴 P_id 1, 4, 2, 3, 8, 5를 기준패턴으로서 선별하고, 선별되지 않은 나머지 패턴을 삭제할 수 있다.Thereafter, the processing unit 140 applies Top-k (k=6), and when each pattern recorded in the table 1010 is sorted in order of frequency, the top 6 patterns P_id 1, 4, 2, 3, 8, and 5 are selected as reference patterns, and the remaining patterns that are not selected can be deleted.

기준패턴 선별 시 패턴의 빈발도가 동등한 경우, 처리부(140)는 edge의 개수가 더 많거나, 먼저 입력되어 P_id가 낮은 패턴을 우선적으로 기준패턴으로 선별할 수 있다.When selecting the reference pattern, if the frequency of the patterns is the same, the processor 140 may preferentially select, as the reference pattern, a pattern having a greater number of edges or having a lower P_id due to being input first.

이에 따라 최종적으로 패턴 사전(230)에는, 도 10의 (b)에 도시한 테이블(1020)과 같이, 자주 발생하면서, 간선 크기가 커서 그래프 연결성이 높은 기준 빈발패턴이 유지될 수 있다.Accordingly, in the pattern dictionary 230, as shown in the table 1020 shown in (b) of FIG. 10, a reference frequent pattern that occurs frequently and has a large edge size and high graph connectivity can be maintained in the pattern dictionary 230.

이처럼 본 발명에서는 Top-K Pattern의 적용에 의해, 중요도 높은 기준패턴을 선별해 패턴 사전(170)에 유지할 수 있고, 또한 패턴 사전(170)의 Max-size를 제한하여, 계산량 증가로 인한 Overload 발생을 방지할 수 있다.In this way, in the present invention, by applying the Top-K Pattern, it is possible to select a reference pattern of high importance and keep it in the pattern dictionary 170, and also limit the Max-size of the pattern dictionary 170 to generate overload due to an increase in the amount of calculation. can prevent

저장부(120)는 상기 기준 빈발패턴으로 지정된 확장 서브그래프를 기준으로, 시간 경과에 따라 변경되어 입력되는 상기 그래프스트림에 대한 변경사항을 기록한 프로버넌스 정보를, 패턴 사전(170)에 저장할 수 있다.The storage unit 120 may store, in the pattern dictionary 170, provenance information that records changes to the graph stream that is changed over time and input based on the extended subgraph designated as the reference frequent pattern. there is.

여기서 상기 프로버넌스 정보는, 기준 빈발패턴으로 지정된 확장 서브그래프에 대해, 설정된 빈발도 및 간선 크기에 따라 산출되는 패턴 점수와, 상기 제1 서브그래프 및 제2 서브그래프가 추출된 각 서브스트림이 입력되는 시간인 제1 시간 및 제2 시간을 기록한 타임스탬프와, 각 서브스트림을 구성하는 배치(Batch)의 개수, 및 상기 확장 서브그래프의 식별을 위한 패턴ID 중 적어도 하나를 포함할 수 있다.Here, the provenance information includes a pattern score calculated according to the set frequency and edge size for the extended subgraph designated as the reference frequent pattern, and each substream from which the first subgraph and the second subgraph are extracted. It may include at least one of timestamps recording the first and second input times, the number of batches constituting each substream, and a pattern ID for identifying the extended subgraph.

예를 들어 도 11을 참조하면, 저장부(120)는 5개의 패턴(패턴 ID 1~5)에 대한 빈발도, 간선크기, 패턴 점수(RPS), 타임스탬프, Prov(State), 배치(Batch)를 포함하여 생성한 프로버넌스 정보를 패턴 사전(170)에 함께 저장할 수 있다.For example, referring to FIG. 11 , the storage unit 120 stores frequency, edge size, pattern score (RPS), timestamp, Prov (State), batch (Batch) for five patterns (pattern IDs 1 to 5). ) may be stored together in the pattern dictionary 170.

처리부(140)는 기준 빈발패턴 및 그에 대한 프로버넌스 정보를 저장한 패턴 사전(170)에, 사전 인코딩(dictionary encoding) 기법을 적용하여, 압축 처리할 수 있다.The processing unit 140 may perform compression processing by applying a dictionary encoding technique to the pattern dictionary 170 storing reference frequent patterns and provenance information therefor.

즉, 처리부(140)는 데이터 스트림에서 반복 패턴을 찾고 더 짧은 이진 코드로 반복 패턴을 나타내는 패턴 사전(170)이 구축되면, 이진 코드와 패턴 사전(170)을 사용하여 압축된 데이터를 저장할 수 있다.That is, when the processing unit 140 finds a repeating pattern in the data stream and builds a pattern dictionary 170 representing the repeating pattern with a shorter binary code, it can store compressed data using the binary code and the pattern dictionary 170. .

처리부(140)는 데이터 구조가 담고 있는 문자열의 집합과 압축된 텍스트 사이에서 일치하는 항목을 찾아 동작하는 사전기반 압축 기법에 따라, 그래프를 이진 데이터로 사전 인코딩하면, 그래프 패턴을 그대로 저장하는 것보다 훨씬 작은 크기로 압축해 저장할 수 있어, 차지하는 메모리 공간을 크게 줄일 수 있다.According to the dictionary-based compression technique in which the processing unit 140 finds a matching item between the set of strings contained in the data structure and the compressed text, and pre-encodes the graph into binary data, it is far more effective than storing the graph pattern as it is. It can be compressed and stored in a small size, so the memory space occupied can be greatly reduced.

본 발명에 따르면, 이전시점에 발견하여 패턴 사전에 저장한 빈발패턴을, 현시점에 발견한 빈발패턴을 이용해 확장시켜 업데이트하는 과정을, 시간 경과에 따른 스트림 입력 시마다 반복하여, 패턴 사전 내의 빈발패턴을 점진적으로 업데이트함으로써, 정해진 종료시점에 도달했을 때 패턴 사전에 시간 변화에 따라 변화하는 그래프스트림의 전체에서 기준이 되는 빈발패턴이 남도록 할 수 있고, 이를 통해 그래프스트림의 압축 효율 및 처리 속도를 향상시킬 수 있다.According to the present invention, a process of expanding and updating a frequent pattern found at a previous time and stored in a pattern dictionary using a frequent pattern found at the present time is repeated every time a stream is input over time, so that a frequent pattern in the pattern dictionary is changed. By gradually updating, when a predetermined end point is reached, it is possible to ensure that a frequent pattern, which is the standard, remains in the entire graph stream that changes according to the change in time in advance of the pattern, and through this, the compression efficiency and processing speed of the graph stream can be improved. can

도 2는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템의 전체 구조도를 도시한 도면이다.2 is a diagram showing the overall structure of a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(200)은, 기준 패턴 생성기(210), 그래프 관리기(220), 인메모리(201) 상의 패턴 사전(230), 및 압축 처리부(240)를 포함하여 구성할 수 있다.Referring to FIG. 2, the graph stream compression system 200 based on progressively frequent patterns according to an embodiment of the present invention includes a reference pattern generator 210, a graph manager 220, and a pattern dictionary 230 in the in-memory 201. ), and a compression processing unit 240.

기준 패턴 생성기(Reference Pattern)(210)는 그래프스트림 G[t]이 입력될 때 빈발하는 서브그래프 패턴을 추출하고, 추출된 서브그래프 패턴의 빈발도 및 간선 크기에 따라 패턴 점수를 산출하고, 추출된 서브그래프 패턴 중에서 패턴 점수가 높게 산출된 순으로 일정 수의 기준패턴을 결정하는 역할을 한다.The reference pattern generator (Reference Pattern) 210 extracts frequent subgraph patterns when the graph stream G[t] is input, calculates pattern scores according to the frequency and edge size of the extracted subgraph patterns, and extracts It plays a role in determining a certain number of reference patterns in the order in which pattern scores are calculated among the subgraph patterns.

또한 기준 패턴 생성기(210)는 추출된 서브그래프 패턴 중에서 압축률이 가장 높은 패턴을 기준패턴으로 결정할 수 있다.Also, the reference pattern generator 210 may determine a pattern having the highest compression ratio among the extracted subgraph patterns as the reference pattern.

기준 패턴 생성기(210)는 이러한 과정을 반복하여 기준패턴을 그래프 관리기(220)로 전달함과 동시에, 패턴 사전(230)에 저장하여, 다음 윈도우에서 활용할 수 있다.The reference pattern generator 210 repeats this process to transfer the reference pattern to the graph manager 220, and at the same time stores it in the pattern dictionary 230, so that it can be used in the next window.

그래프 관리기(Graph Manager)(220)는 스트림 환경에서 시간의 변화가 있을 때, 패턴 관리 정책을 사용하여 패턴을 관리하고 패턴 간의 유사도를 판별하는 역할을 수행한다.The graph manager 220 manages patterns using a pattern management policy and determines similarities between patterns when there is a change in time in a stream environment.

패턴 사전(Pattern Dictionary)(230)에는, 기준 패턴 생성기(210)에서 생성된 서브그래프 패턴과 함께, 관련된 프로버넌스 정보가 함께 저장된다.In the pattern dictionary 230, subgraph patterns generated by the reference pattern generator 210 are stored together with related provenance information.

상기 프로버넌스 정보는, 도 11에 도시된 것처럼, 각 패턴의 패턴 ID, 빈발도, 간선의 크기, 패턴 점수, 타임스탬프, 배치 수 등이 포함될 수 있다.As shown in FIG. 11 , the provenance information may include a pattern ID, frequency, size of an edge, pattern score, timestamp, number of batches, and the like of each pattern.

패턴 사전(230)에 저장되는 모든 데이터는, 저장 공간을 효율적으로 사용하기 위해 사전 인코딩에 의해 변환되어 저장된다.All data stored in the pattern dictionary 230 is converted and stored by dictionary encoding in order to efficiently use storage space.

기준 패턴 생성기(210)와 그래프 관리기(220)에서는 패턴 사전(230)에 저장되는 데이터를 활용하여, 기준패턴을 결정한다.The reference pattern generator 210 and the graph manager 220 utilize data stored in the pattern dictionary 230 to determine a reference pattern.

압축 처리부(240)는 기준 패턴 생성기(210)에서 만들어진 기준패턴을 적용하여 최종적으로 그래프 데이터를 압축한다.The compression processing unit 240 finally compresses the graph data by applying the reference pattern created by the reference pattern generator 210 .

도 3은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 도 2에 도시한 기준 패턴 생성기(210)의 수행 과정을 도시한 도면이다.FIG. 3 is a diagram illustrating a process of performing the reference pattern generator 210 shown in FIG. 2 in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 3을 참조하면, 기준 패턴 생성기(210)는 초기 빈발패턴 탐색(step1), 점진적 빈발패턴 탐색(step2), 기준 빈발패턴 탐색(step3)의 과정을 거쳐, 산출된 패턴 점수가 가장 높은 패턴을 기준패턴으로 결정할 수 있다.Referring to FIG. 3 , the reference pattern generator 210 selects a pattern having the highest pattern score calculated through the processes of initial frequent pattern search (step 1), gradual frequent pattern search (step 2), and reference frequent pattern search (step 3). It can be determined as a standard pattern.

(step 1)초기 빈발패턴 탐색 과정은 그래프가 기존에 존재하지 않거나, 또는, 패턴 사전(230)에 패턴이 없을 때 수행된다. 이러한 경우 패턴 사전(230)에 저장된 기준패턴이 존재하지 않는다. 따라서, 기준 패턴 생성기(210)는 그래프를 배치 크기로 탐색하여 초기에 단일 간선 패턴을 찾는다.(Step 1) The initial frequent pattern search process is performed when a graph does not exist or there is no pattern in the pattern dictionary 230. In this case, the reference pattern stored in the pattern dictionary 230 does not exist. Thus, the reference pattern generator 210 searches the graph in batch size to initially find a single edge pattern.

[표 1]은 초기 빈발패턴을 탐색하는 알고리즘을 나타낸다. 배치 그래프 G_B의 간선 e는 (v_source, v_target, l)이다. 기준 패턴 생성기(210)는 단일 간선 그래프를 생성한 뒤, 리스트의 순서대로 앞에 들어오는 정점은 v_source로 수정하고, 뒤에 들어오는 정점은 v_target로 수정하고, 간선은 (v_source, v_target, l)로 추가한 뒤, 패턴사전(230)에 저장하게 된다.[Table 1] shows an algorithm for searching for an initial frequent pattern. The edge e of the batch graph G _B is (v _source , v _target , l). The reference pattern generator 210 generates a single edge graph, modifies the vertices that come in front in the order of the list with v _source , and modifies the vertices that come later with v _target , and the edges are (v _source , v _target , l) After adding, it is stored in the pattern dictionary 230.

(step2)점진적 빈발패턴 탐색 과정은, 그래프에서 찾은 패턴에 속한 서브그래프에 대해 반복적으로 수행하게 되며 step1 수행 시에는 패턴이 존재하지 않기 때문에 수행되지 않는다.(step2) The gradual frequent pattern search process is repeatedly performed for the subgraph belonging to the pattern found in the graph, and is not performed during step 1 because the pattern does not exist.

[표 2]는 점진적 빈발패턴 알고리즘을 나타낸다. 이 과정에선 초기 빈발패턴 탐색에서 나타난 단일 간선 패턴을 점진적으로 업데이트하는 과정이 수행된다. [Table 2] shows the gradual recurrence pattern algorithm. In this process, a process of gradually updating the single edge pattern that appeared in the initial frequent pattern search is performed.

그래프에서 찾은 서브그래프 셋에 속한 각 서브그래프에 대해 확인하고 G(v)→P(v)를 반복 수행하면서 정점을 인덱스로 매칭한 뒤, 각 서브그래프를 정점이 들어오는 순서대로 인덱스로 매칭시켜 그래프의 크기를 감소시킨다.After checking each subgraph belonging to the set of subgraphs found in the graph and repeating G(v) → P(v), vertices are matched by index, and each subgraph is matched by index in the order in which vertices come in, resulting in a graph reduce the size of

도 4는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 서브그래프의 정점에 인덱스를 부여해 패턴 사전(230)에 저장하는 과정을 설명하기 위한 도면이다.4 is a diagram for explaining a process of assigning indexes to vertices of subgraphs and storing them in the pattern dictionary 230 in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 4를 참조하면, 그래프스트림에서 G(v)와 같이 정점이 순서대로 들어올 때, P(v)는 들어오는 순서대로 인덱스(0, 1, ...)로 매칭(배정)되어 패턴 사전(230)에 저장된다. 즉, G(v)=[132, 12, 51, 43, 21, 19, 12, 3, …이 들어올 때, 그래프의 정점이 들어온 순서대로 P(v)=[0, 1, 2, 3, 4, 5, 1, 6, …로 매칭된다.Referring to FIG. 4, when vertices are entered in order as in G(v) in the graph stream, P(v) is matched (assigned) to indexes (0, 1, ...) in the order of entry, and the pattern dictionary 230 ) is stored in That is, G(v)=[132, 12, 51, 43, 21, 19, 12, 3, ... comes in, P(v)=[0, 1, 2, 3, 4, 5, 1, 6, … in the order the vertices of the graph came in. matched with

이때, 도면부호(410)로 나타낸 것처럼, 앞서 인덱스 '1'로 배정된 정점 G(12) 및 인덱스 '6'으로 배정된 G(3)은, 기존 인덱스(1, 6)로 매칭된다.At this time, as indicated by reference numeral 410, the vertex G(12) previously assigned to index '1' and the vertex G(3) assigned to index '6' are matched with existing indices (1, 6).

G(v)와 P(v)에 저장된 정점이, 패턴에 저장된 간선과 연결되어 있는지 연결성을 확인하여, G(v)값이 더 높으면 패턴을 확장하는 과정이 수행된다.Connectivity is checked to see if the vertices stored in G(v) and P(v) are connected to the edges stored in the pattern, and if the value of G(v) is higher, the process of extending the pattern is performed.

예를 들어 g(e)의 간선 e가 패턴에 존재하지 않으면, 패턴을 확장하고, 존재하면 패턴 사전(230)에 더해진다.For example, if the edge e of g(e) does not exist in the pattern, the pattern is expanded, and if present, it is added to the pattern dictionary 230.

(step 3)기준 빈발패턴 탐색 과정에서는, 추출된 서브그래프 패턴 중에서 패턴 점수에 따라 기준패턴을 채택하는 과정이 수행된다. (step 3) In the reference frequent pattern search process, a process of adopting a reference pattern according to pattern scores among the extracted subgraph patterns is performed.

이 과정에서는 그래프스트림 전체에서 자주 발생하며 그래프 연결이 많은 서브그래프 패턴을 기준패턴으로 채택하기 위해, 서브그래프 패턴의 빈발도와 간선 크기를 고려하여 수식(1)에 따라 패턴 점수(RPS)를 산출하고, 패턴 점수(RPS)가 높은 순으로 기준패턴이 채택될 수 있다.In this process, in order to adopt a subgraph pattern that occurs frequently throughout the graph stream and has many graph connections as a reference pattern, the pattern score (RPS) is calculated according to Equation (1) in consideration of the frequency and edge size of the subgraph pattern, , the reference patterns may be adopted in the order of high pattern scores (RPS).

예를 들어, 패턴 사전(230)에 기 저장된 패턴이, 새로 들어온 그래프 G_B에서 추출된 서브그래프 g와 동형 그래프로 판단되면, 수식(1)에 따라 계산된 패턴 점수(RPS)가 패턴 사전(230)에 업데이트된다. 수식(1)에서 α와 β는 상수값이다.For example, if a pattern previously stored in the pattern dictionary 230 is determined to be an isomorphic graph with a subgraph g extracted from a newly entered graph G _B , the pattern score (RPS) calculated according to Equation (1) is the pattern dictionary ( 230) is updated. In Equation (1), α and β are constant values.

도 5는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 도 2에 도시한 그래프 관리기(220)의 수행 과정을 도시한 도면이다.FIG. 5 is a diagram illustrating a process of performing the graph manager 220 shown in FIG. 2 in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

그래프 관리기(220)는 새로 들어온 그래프스트림 G[t]에서 시간 경과에 따라 변화가 있을 때, 패턴 관리 정책을 사용하여 패턴을 관리하고, 또한 그래프의 유사도를 판단해 기존과 다른 패턴이 나오면 기준 패턴 생성기(210)를 재수행시키는 역할을 한다.The graph manager 220 manages the pattern using the pattern management policy when there is a change over time in the newly entered graph stream G[t], and also determines the similarity of the graph, and if a pattern different from the existing one is found, the reference pattern It serves to rerun the generator 210.

도 5를 참조하면, 그래프 관리기(220)는 새로 들어온 패턴의 기존 패턴과의 유사도 및 중요도에 따라 패턴 관리 정책을 결정하여 패턴을 관리하기 위해, 유사도 검증기(221) 및 패턴 관리기(222)를 포함하여 구성된다.Referring to FIG. 5 , the graph manager 220 includes a similarity verifier 221 and a pattern manager 222 to manage patterns by determining a pattern management policy according to the degree of similarity and importance of a new pattern with an existing pattern. It is composed by

유사도 검증기(Similarity Verification)(221)는 패턴 사전(230)에 저장된 기존 패턴들과 시간 t에서 새로 들어온 패턴을 비교하여 유사도를 검증한다.The similarity verifier 221 compares existing patterns stored in the pattern dictionary 230 with a newly received pattern at time t to verify similarity.

유사도 검증기(221)는 VF2 알고리즘에 따라 유사도를 계산할 수 있고, 또한, 서브그래프 동형성 검사를 수행하여 기준패턴과 새로 들어온 패턴 G_t의 유사도를 계산할 수 있다.The similarity verifier 221 may calculate the similarity according to the VF2 algorithm, and may also calculate the similarity between the reference pattern and the newly entered pattern G _t by performing a subgraph isomorphism test.

도 6은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 서브그래프 간 유사도 판별 결과에 따라 동형 그래프를 결정하는 과정을 설명하기 위한 도면이다.6 is a diagram for explaining a process of determining an isomorphic graph according to a similarity determination result between subgraphs in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 6의 (a)와 (b)를 참조하면, 유사도 검증기(221)는 기존 서브그래프 패턴 G₁및 새로 들어온 서브그래프 패턴 G₂가 존재할 때, G₁=G₂일 때만 그래프 G₁와 G₂를 동형 그래프로 결정하는 것이 아니라, G₁⊆G₂의 관계가 되는 그래프 G₁와 G₂를 동형 그래프로 결정하고, G₁와 G₂를 유사도가 높은 패턴으로 판단할 수 있다.Referring to (a) and (b) of FIG. 6, the similarity verifier 221 calculates graphs G 1 and _G only when G ₁ =G ₂ when the existing subgraph pattern G ₁ and the new subgraph pattern G ₂ exist. ₂ is not determined as an isomorphic graph, but graphs G ₁ and G ₂ having a relationship of G ₁ ⊆G ₂ are determined as isomorphic graphs, and G ₁ and G ₂ can be determined as patterns with high similarity.

이는 종래 대부분의 압축 알고리즘에서 기준패턴이 정점과 간선에 공통으로 나타난 그래프를 기준패턴으로 결정하고, 기준패턴에서 변경된 사항을 기술하는 방법을 사용하고 있지만, 그래프 패턴이 완전히 일치하는 패턴만 탐색한다면 압축에 적용할 패턴의 개수가 적어져 압축에 적용하기 어렵다는 점을 고려한 것으로, 유사도 검증기(221)는 완전 일치하지 않은 패턴도 동형으로 판단하여 다른 알고리즘보다 유연하게 패턴 유사도를 검증할 수 있다.In most conventional compression algorithms, a method of determining a graph in which the reference pattern is common to vertices and edges is determined as a reference pattern and describing changes in the reference pattern, but if only patterns that match the graph pattern completely match are searched for, compression is performed. Considering the fact that it is difficult to apply compression because the number of patterns to be applied is small, the similarity verifier 221 can verify pattern similarity more flexibly than other algorithms by judging patterns that do not completely match as isomorphic.

유사도 검증기(221)는 전체 그래프가 아닌 기준 패턴 생성기(210)에서 패턴 점수(RPS)를 통해 정해진 기준패턴과 새로 들어온 패턴의 유사성을 판별해 비교 연산량을 줄일 수 있다.The similarity verifier 221 determines the similarity between the reference pattern determined through the pattern score (RPS) and the newly entered pattern in the reference pattern generator 210 instead of the entire graph, thereby reducing the amount of comparison operation.

패턴 관리기(Pattern Manager)(222)는 타임스탬프(Time-Stamp)와 FP-Tree를 활용한 Pruning, 패턴 사전(230)에 저장되는 패턴의 크기 제한을 통해 패턴의 중요도를 결정하고 관리한다.The pattern manager 222 determines and manages the importance of patterns through pruning using time-stamps and FP-trees, and limiting the size of patterns stored in the pattern dictionary 230.

그래프스트림에서는 항상 일정한 패턴이 나오는 것이 아니라 시간에 따라 패턴이 달라지기 때문에, 현재시점에서 패턴과 시간이 지난 후 패턴이 다를 수 있다.Since a pattern does not always come out in a graph stream and the pattern changes over time, the pattern at the present time and the pattern after time may differ.

이 점을 고려해 패턴 관리기(222)에서는 빈발패턴 검출 시, 타임스탬프 Trimming와, FP-Tree를 활용한 Pruning, 및 Top-K Pattern의 정책을 적용한다.Considering this point, the pattern manager 222 applies timestamp trimming, pruning using FP-Tree, and Top-K Pattern policies when detecting frequent patterns.

타임스탬프 Trimming은, 정해진 임계값 τ의 Window가 지나면 패턴을 삭제 처리함으로써, 다음에 입력될 그래프스트림 처리를 위한 메모리 공간을 확보하기 위한 처리이다(도 7 참조).Timestamp trimming is a process for securing a memory space for processing a graph stream to be input next by deleting a pattern after a window of a predetermined threshold value τ passes (see FIG. 7).

FP-Tree를 활용한 Pruning은, 패턴 사전(230)에 저장된 패턴을 FP-Growth 알고리즘에 적용하여 빈발하지 않는 패턴을 pruning(가지치기)함으로써, 한정적인 저장 공간을 효율적으로 사용하기 위한 처리이다(도 8 참조).Pruning using FP-Tree is a process for efficiently using limited storage space by applying patterns stored in the pattern dictionary 230 to the FP-Growth algorithm and pruning (pruning) infrequent patterns ( see Figure 8).

Top-K Pattern은 기준패턴 중 자주 나오는 K개의 패턴만 저장하여, 패턴 사전(230)에 저장되는 패턴의 Max-size를 제한하기 위한 처리이다(도 9 참조).The Top-K Pattern is a process for limiting the Max-size of patterns stored in the pattern dictionary 230 by storing only K patterns that frequently appear among the reference patterns (see FIG. 9).

이에 따라 최종적으로 패턴 사전(230)에는 앞서 설명한 패턴의 빈발도와 유사도 및 중요도를 조합하여 정규화한 패턴 점수(Pattern Score)가 높은 순서대로 패턴이 저장될 수 있다.Accordingly, patterns may be finally stored in the pattern dictionary 230 in order of high pattern scores normalized by combining the frequency, similarity, and importance of the patterns described above.

도 7은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 시간 경과에 따라 변화하는 시간 윈도우를 설명하기 위한 도면이다.7 is a diagram for explaining a time window that changes over time in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 7에는 시간이 지남에 따라 변화하는 Time Window W₀~ W₃이 도시되고, Time Window 경과시마다 타임스탬프는 업데이트된다.7 shows a time window W ₀ to W ₃ that changes over time, and a timestamp is updated whenever the time window elapses.

하나의 윈도우(W_n)에는 여러 개의 배치(B_n)가 포함되고, 배치 개수는 사용자가 지정할 수 있다. 예를 들어 배치(Batch)의 수가 '3'으로 지정된 윈도우 W₀이 존재하면, W₀={B₁, B₂, B₃}으로 표현된다.One window (W _n ) includes several batches (B _n ), and the number of batches can be designated by the user. For example, if a window W ₀ in which the number of batches is designated as '3' exists, W ₀ = {B ₁ , B ₂ , B ₃ }.

임계값 τ=4로 정해진 경우, 패턴 관리기(222)는 W₀~ W₃까지의 패턴을 유지하다가, W₄의 패턴이 들어오기 전에 W₀~ W₃의 패턴을 인메모리(201)에서 삭제하여, 다음 그래프스트림 처리를 위한 메모리 공간을 확보할 수 있다.When the threshold τ=4 is set, the pattern manager 222 maintains the patterns W ₀ to W ₃ , and deletes the patterns W ₀ to W ₃ from the in-memory 201 before the pattern W ₄ is received. Thus, memory space for processing the next graph stream can be secured.

도 8은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 빈발패턴의 점진적인 업데이트 과정을 설명하기 위한 도면이다.8 is a diagram for explaining a process of gradually updating frequent patterns in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 8의 (a)에는 첫 번째 스트림에서 시간을 고려한 패턴 업데이트가 도시되고, 도 8의 (b)에는 두 번째 스트림에서 시간을 고려한 패턴 업데이트가 도시되어 있다.In (a) of FIG. 8, a pattern update in consideration of time is shown in the first stream, and in (b) of FIG. 8, a pattern update in consideration of time is shown in the second stream.

패턴 관리기(222)는 도 8의 (a)와 (b)에 도시된 것처럼, 그래프스트림에서 윈도우(W₀, W₁)와 배치(Batch 1~5)를 이용해, 시간 경과에 따른 패턴 변화(정점과 간선의 변화)을 감지하여, 패턴 사전(230)을 관리한다.As shown in (a) and (b) of FIG. 8, the pattern manager 222 uses windows (W ₀ , W ₁ ) and batches (Batch 1 to 5) in the graph stream to change patterns over time ( The pattern dictionary 230 is managed by detecting changes in vertices and edges).

패턴 사전(230)에 저장된 기준패턴은 업데이트/삭제하며 관리하게 된다. 예를 들어 도 8의 (a)와 같이 배치의 크기가 3인 윈도우가 존재할 때, 모든 정점이 처음 나온 패턴이라면 타임스탬프를 현재시점인 T로 유지한다.The reference patterns stored in the pattern dictionary 230 are managed by updating/deleting. For example, when a window with a batch size of 3 exists, as shown in (a) of FIG. 8, if all vertices are the first pattern, the timestamp is maintained at the current point, T.

Window의 크기가 3일 때, Batch 1에서 P₁, P₂, P₃의 패턴이 등장하고 Batch 2는 Batch1에서 등장한 패턴인 P₁과 P₂의 확장패턴인 P₁₊₂가 추가되어 P₁, P₂, P₃, P₁₊₂의 총 4개의 패턴이 등장한다. Batch3은 Batch1, Batch2에서 나온 P₁패턴이 제외된 P₂, P₃, P₁₊₂, P₄의 패턴이 등장한다. 따라서 W₀에 패턴은 P₁, P₂, P₃, P₁₊₂, P₄가 존재한다.When the size of the window is 3, the patterns P ₁ , P ₂ , and P ₃ appear in Batch 1, and P _{1 +} 2, the extended pattern of P ₁ and P ₂ , are added in Batch 2 to form P ₁ , P ₂ , P ₃ , and P _1+2, a total of 4 patterns appear. In Batch3, the patterns P ₂ , P ₃ , P ₁₊₂ , and P ₄ appear except for the P ₁ pattern from Batch 1 and Batch 2. Accordingly, patterns P ₁ , P ₂ , P ₃ , P ₁₊₂ , and P ₄ exist in W ₀ .

이후, 도 8의 (b)와 같이 W₀→ W₁ 로 시간이 변화할 때 패턴은 업데이트 된다. W₁에 W₀과 같은 패턴이 계속 나온다면 타임스탬프를 현재시점(T)로 계속 유지하지만, W₁에 나오지 않은 패턴인 P₁과 P₁₊₂는 이전시점(T-1)에만 존재하므로, 타임스탬프를 T-1로 수정한다. 이전시점에서 나오지 않았지만 현재에 추가된 패턴인 P₅와 P₂₊₃은 추가된다. 이러한 과정을 반복 수행하여 타임스탬프를 수정하고 임계값인 T = τ인 시점 동안 윈도우에서 등장하지 않은 패턴은 삭제된다. After that, the pattern is updated when the time changes from W ₀ → W ₁ as shown in (b) of FIG. 8 . If the _same pattern as W ₀ continues to appear in W ₁ , the _timestamp is maintained _at the current point in time (T). , correct the timestamp to T-1. The patterns P ₅ and P ₂₊₃ , which did not appear at the previous point but are added to the present, are added. By repeating this process, the timestamp is corrected, and patterns that do not appear in the window during the threshold value of T = τ are deleted.

본 발명에서는 이처럼 패턴을 업데이트/삭제하는 과정을 통해 자주 나오는 패턴은 유지해 압축에 사용하고, 자주 나오지 않는 패턴은 삭제하여 관리함으로써, 메모리 공간을 확보하고 그래프스트림의 처리 속도를 높일 수 있다.In the present invention, through the process of updating/deleting patterns, frequently occurring patterns are maintained and used for compression, and infrequently occurring patterns are deleted and managed, thereby securing memory space and increasing the processing speed of graph streams.

도 9는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, FP-Growth 알고리즘을 적용한 Pruning 과정을 설명하기 위한 도면이다.9 is a diagram for explaining a pruning process to which the FP-Growth algorithm is applied in the graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 9에는 FP-Growth알고리즘을 적용한 Pruning 과정이 예시된다. 패턴 P에는 (pid, B, T, Frequent)의 정보가 저장된다. 여기서 Pid는 패턴ID이고, B는, 윈도우당 배치 수이고, T는 타임스탬프이고, Frequent는 빈발도를 나타낸다.9 illustrates a pruning process to which the FP-Growth algorithm is applied. Information of (pid, B, T, Frequent) is stored in pattern P. Here, Pid is the pattern ID, B is the number of batches per window, T is the timestamp, and Frequent indicates the frequency.

빈발도임계값(Frequent Threshold)=3, 시간임계값(Time Threshold)=2로 적용한 경우, 리프 노드에서 일정 임계값 이하의 값이 나오는 패턴을 자주 나오지 않는 패턴으로 판단하여 Pruning한다.When Frequent Threshold = 3 and Time Threshold = 2 are applied, patterns that occur below a certain threshold in leaf nodes are judged as infrequent patterns and pruned.

윈도우 W₀에서 임계값을 만족하지 못하는 패턴 P₁, P₁₊₂는 Pruning된다.Patterns P ₁ and P ₁₊₂ that do not satisfy the threshold in window W ₀ are pruned.

윈도우 W₁에서 임계값을 만족하지 못하는 패턴 P₅도 Pruning된다.Patterns P ₅ that do not satisfy the threshold in window W ₁ are also pruned.

이러한 과정을 통해 본 발명에서는 패턴을 유지하는 비용을 줄이고 다음 스트림에 들어오는 데이터를 유지할 수 있는 메모리 상 공간을 만들 수 있다.Through this process, in the present invention, it is possible to reduce the cost of maintaining a pattern and create a memory space capable of holding data coming into the next stream.

패턴 사전(230)에 저장된 패턴을 FP-Growth 알고리즘에 적용해 자주 나오지 않는 패턴을 pruning하여, 데이터의 크기를 감소시켜 인메모리 환경에 많은 그래프 데이터를 유지 및 처리 수행할 수 있고, 자주 사용되지 않는 패턴들을 패턴사전에서 업데이트/삭제해 최신 패턴을 유지하여 실시간으로 변화하는 그래프스트림에서 압축 효율 향상시킨다.Patterns stored in the pattern dictionary 230 are applied to the FP-Growth algorithm to pruning infrequently occurring patterns, thereby reducing the size of data to maintain and process a lot of graph data in an in-memory environment, and to perform infrequently used graph data. By updating/deleting patterns in the pattern dictionary, the latest patterns are maintained to improve compression efficiency in graph streams that change in real time.

도 10은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 패턴 트리(a)에 따라 구축되는 패턴 사전(b)을 예시한 도면이다.10 is a diagram illustrating a pattern dictionary (b) constructed according to a pattern tree (a) in a graph stream compression system based on progressively frequent patterns according to an embodiment of the present invention.

도 10을 참조하면, 도 10의 (a)와 같은 패턴 트리가 존재할 때, 도 10의 (b)에 도시한 테이블(1010, 1020) 형태로 패턴 사전(230)에 데이터가 저장될 수 있다.Referring to FIG. 10 , when a pattern tree as shown in (a) of FIG. 10 exists, data may be stored in the pattern dictionary 230 in the form of tables 1010 and 1020 shown in (b) of FIG. 10 .

도 10의 (a)에 도시한 패턴 트리에서는, 먼저 시간 T에 패턴 (0, 1), (0, 2), (1, 3), (2, 3)이 입력되고, 시간이 경과한 시간 T+1에, 패턴 (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 3, 2)이 입력되는 것을 나타내고 있다.In the pattern tree shown in (a) of FIG. 10, first, patterns (0, 1), (0, 2), (1, 3), (2, 3) are input at time T, and the time elapses It shows that patterns (0, 1, 2), (0, 1, 3), (0, 2, 3), and (1, 3, 2) are input to T+1.

상기 패턴 트리에 따라, 테이블(1010)에는, 각 패턴이 입력된 순서대로 배정된 인덱스 1~8이 패턴 식별자 P_id로서 기록되고, 패턴 P_id 1~4에 대한 타임스탬프는 T, 패턴 P_id 5~8에 대한 타임스탬프는 T+1로 기록되며, 이와 함께, 각 패턴의 빈발도(Frequent)가 기록된다.According to the pattern tree, in the table 1010, indexes 1 to 8 assigned in the order in which each pattern is input are recorded as pattern identifiers P_id, timestamps for patterns P_id 1 to 4 are T, patterns P_id 5 to 8 The timestamp for is recorded as T+1, and together with this, the frequency of each pattern is recorded.

시간 임계값(τ)이 '2'로 정해진 경우, 시간 T 및 T+1에 입력되어 테이블(1010)에 기록된 패턴을 대상으로 Top-k를 적용해 기준패턴의 선정이 이루어진다.When the time threshold τ is set to '2', a reference pattern is selected by applying Top-k to patterns input at times T and T+1 and recorded in the table 1010 .

테이블(1010)에 기록된 각 패턴을 빈발도가 높은 순으로 정렬했을 때, k가 '6'으로 정해진 경우, 빈발도 상위 6개에 해당하는 패턴 P_id 1, 4, 2, 3, 8, 5이 기준패턴으로서 선별되고, 선별되지 않은 나머지 패턴은 삭제되어, 최종적으로는 테이블(1020)이 패턴 사전(230)에 저장될 수 있다.When each pattern recorded in the table 1010 is sorted in order of frequency, if k is set to '6', patterns P_id 1, 4, 2, 3, 8, 5 corresponding to the top 6 frequencies The reference pattern is selected, the remaining patterns that are not selected are deleted, and finally the table 1020 can be stored in the pattern dictionary 230.

이때 패턴의 빈발도(빈도수)가 동등하면, edge의 개수가 더 많은 패턴을 우선적으로 선별하고, edge의 개수도 동등하면, 먼저 입력된 순서, 즉, P_id가 낮은 순서대로 선별하도록 정해질 수 있다.At this time, if the frequency (frequency) of the pattern is equal, a pattern with a greater number of edges is selected first, and if the number of edges is also equal, it can be determined to select the first input order, that is, in the order of lower P_id. .

이처럼 본 발명에서는 Top-K Pattern의 적용에 의해, 자주 발생하면서 간선 크기가 큰 k개의 패턴을 중요도 높은 기준패턴으로 선별해 패턴 사전(230)에 저장할 수 있고, 또한 패턴 사전(230)의 Max-size를 제한하여, 계산량 증가로 인한 Overload 발생을 방지할 수 있다.As described above, in the present invention, by applying the Top-K Pattern, it is possible to select k patterns having a large edge size that occur frequently as reference patterns with high importance and store them in the pattern dictionary 230, and in addition, the Max-K patterns of the pattern dictionary 230 By limiting the size, it is possible to prevent overload caused by an increase in the amount of calculation.

패턴 사전(230)은 인메모리(201) 내 공간에 저장되고, 패턴 사전(230)에는 기준 패턴 생성기(210)에서 생성된 기준패턴과, 그에 관련된 프로버넌스(Provenance) 정보가 저장된다. 프로버넌스 정보는 빈발도, 패턴 ID, 타임스탬프, 패턴 점수 등을 포함한다.The pattern dictionary 230 is stored in a space within the in-memory 201, and the reference pattern generated by the reference pattern generator 210 and related provenance information are stored in the pattern dictionary 230. The provenance information includes frequency, pattern ID, timestamp, pattern score, and the like.

일례로 그래프 패턴이 패턴 사전(230)에 저장될 때, 그래프 패턴의 정점과 간선이 그래프에서 얼마나 자주 나온 패턴인지를 고려하여 빈발도(빈도수)가 패턴 사전(230)에 저장되고, 패턴을 식별할 수 있는 패턴 ID, 타임스탬프, 패턴 점수 등도 패턴 사전(230)에 함께 저장된다.For example, when a graph pattern is stored in the pattern dictionary 230, the frequency (frequency) is stored in the pattern dictionary 230 in consideration of how often vertices and edges of the graph pattern appear in the graph, and the pattern is identified. The available pattern ID, timestamp, pattern score, etc. are also stored in the pattern dictionary 230.

이러한 프로버넌스 정보를 활용하면, 그래프 패턴 마이닝으로 추출된 기준패턴으로부터 실시간으로 변화하는 그래프스트림의 변화 이력을 용이하게 파악할 수 있게 된다.Using this provenance information, it is possible to easily grasp the change history of a graph stream that changes in real time from the reference pattern extracted by graph pattern mining.

또한 패턴 사전(230)에 저장되는 모든 데이터는, 사전 인코딩(dictionary encoding)을 수행하여 변환되어 저장된다. 이후 변환된 데이터를 다시 인코딩하면, 사전 인코딩하지 않은 경우에 비해 작은 용량을 가지게 되어, 메모리 공간의 효율적인 사용이 가능해진다.In addition, all data stored in the pattern dictionary 230 is converted and stored by performing dictionary encoding. Afterwards, if the converted data is re-encoded, it has a smaller capacity than in the case of non-pre-encoding, enabling efficient use of memory space.

[표 3]은 프로버넌스 정보의 요소를 정의한다. 프로버넌스 정보는, 그래프 변화가 일어난 시간(Time, T_n)과 그래프의 정점/간선의 상태(State)를 포함하여 Prov(Time, State)와 같이 표시된다. State는 그래프의 정점/간선의 삽입(I_n)과 삭제(D_n) 및 변경(U_n)으로 구분된다. [표 3]에 따라, 패턴 사전(230)에 저장되는 각 패턴의 프로버넌스 정보를 도 11에 예시된다.[Table 3] defines the elements of provenance information. Provenance information is displayed as Prov(Time, State), including the graph change time (Time, T _n ) and the graph vertex/trunk state (State). State is divided into insertion (I _n ), deletion (D _n ), and change (U _n ) of the vertex/trunk of the graph. According to [Table 3], provenance information of each pattern stored in the pattern dictionary 230 is illustrated in FIG. 11 .

도 11은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 패턴 사전에 저장되는 프로버넌스 정보를 도시한 도면이다.11 is a diagram illustrating provenance information stored in a pattern dictionary in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 11을 참조하면, 5개의 패턴(패턴 ID 1~5)이 패턴 사전(230)에 저장된 경우, 각 패턴에 대한 빈발도, 간선크기, 패턴 점수(RPS), 타임스탬프, Prov(State), 배치(Batch)를 포함하여 생성된 프로버넌스 정보가 함께 패턴 사전(230)에 저장될 수 있다.Referring to FIG. 11, when five patterns (pattern IDs 1 to 5) are stored in the pattern dictionary 230, the frequency, edge size, pattern score (RPS), timestamp, Prov (State), Provenance information generated including a batch may be stored in the pattern dictionary 230 together.

기준 패턴 생성기(210)와 그래프 관리기(220)에서는 패턴 사전(230)에 저장되는 프로버넌스 정보를 활용하여, 저장된 패턴 중에서, 빈발하면서 간선 크기가 커서 그래프 연결성이 높고, 유사도와 중요도가 높은 패턴을 기준패턴으로 선별하고, 선별되지 않은 패턴을 패턴 사전(230)에서 삭제해, 최종적으로 기준패턴만 패턴 사전(230)에 남길 수 있다.In the reference pattern generator 210 and the graph manager 220, by utilizing provenance information stored in the pattern dictionary 230, among the stored patterns, patterns with high graph connectivity, similarity, and high importance because they are frequent and have large edge sizes. may be selected as a reference pattern, patterns that are not selected may be deleted from the pattern dictionary 230, and finally only the reference pattern may be left in the pattern dictionary 230.

도 12는 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 시스템에서, 압축 처리 과정을 도시한 도면이다.12 is a diagram illustrating a compression process in a graph stream compression system based on progressive frequent patterns according to an embodiment of the present invention.

도 12를 참조하면, 기준 패턴 생성기(210)는 그래프스트림에서 발생하는 초기 빈발패턴을, 스트림 입력시마다 점진적으로 업데이트할 수 있다.Referring to FIG. 12 , the reference pattern generator 210 may gradually update an initial frequent pattern generated in a graph stream whenever a stream is input.

이후 임계값으로 지정된 시간에 도달하면, 기준 패턴 생성기(210)는, 그래프 매니저(220)에 의해 얻어진 프로버넌스 정보(빈발도, 간선크기, 패턴 점수, 타임스탬프)를 활용하여, 빈발하면서 중요도 높은 패턴을 기준 빈발패턴으로서 선별하고, 선별되지 않은 패턴을 패턴 사전(230)에서 삭제할 수 있다.Then, when the time specified as the threshold value is reached, the reference pattern generator 210 utilizes the provenance information (frequency, edge size, pattern score, timestamp) obtained by the graph manager 220 to determine the frequency and importance. High patterns may be selected as reference frequent patterns, and patterns not selected may be deleted from the pattern dictionary 230 .

압축 처리부(240)는 사전 기반 압축 기법을 사용하여, 패턴 사전(230)에 남은 기준 빈발패턴을 압축 처리할 수 있다.The compression processing unit 240 may compress the reference frequent patterns remaining in the pattern dictionary 230 using a dictionary-based compression technique.

사전기반 압축 기법은, 데이터 구조가 담고 있는 문자열의 집합과 압축된 텍스트 사이에서 일치하는 항목을 찾아 동작하는 것을 말한다.The dictionary-based compression technique works by finding matching items between a set of strings contained in a data structure and compressed text.

즉, 압축 처리부(240)는 데이터 스트림에서 반복 패턴을 찾고 더 짧은 이진 코드로 반복 패턴을 나타내는 사전을 구축한 후, 이진 코드와 사전만을 사용하여 압축된 데이터를 저장할 수 있다.That is, the compression processing unit 240 may find a repetition pattern in the data stream, build a dictionary representing the repetition pattern with a shorter binary code, and store compressed data using only the binary code and the dictionary.

이처럼 압축 처리부(240)는 그래프를 이진 데이터로 인코딩하면 그래프 패턴을 저장하는 것보다 훨씬 작은 크기로 저장 공간을 크게 줄일 수 있다.As such, the compression processing unit 240 can significantly reduce the storage space to a much smaller size than storing the graph pattern by encoding the graph into binary data.

도 13은 본 발명의 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 방법의 순서를 도시한 흐름도이다.13 is a flowchart illustrating a sequence of a method for compressing a graph stream based on a gradual frequent pattern according to an embodiment of the present invention.

본 실시예에 따른 점진적 빈발패턴 기반의 그래프스트림 압축 방법은, 상술한 점진적 빈발패턴 기반의 그래프스트림 압축 시스템(100)에 의해 수행될 수 있다.The graph stream compression method based on the gradual frequent pattern according to the present embodiment may be performed by the above-described graph stream compression system 100 based on the gradual frequent pattern.

도 13을 참조하면, 단계(1310)에서 그래프스트림 압축 시스템(100)은, 실시간으로 변화하는 그래프스트림이 입력되면, 단계(1320)에서 그래프스트림 압축 시스템(100)은, 상기 그래프스트림 중 제1 시간(T) 동안 입력된 서브스트림으로부터, 제1 서브그래프를 추출하여 초기 빈발패턴으로서 패턴 사전에 저장한다.Referring to FIG. 13, in step 1310, if a graph stream that changes in real time is input, in step 1310, the graph stream compression system 100, in step 1320, the first graph stream From the substream input during the time T, the first subgraph is extracted and stored as an initial frequent pattern in the pattern dictionary.

일례로 그래프스트림 압축 시스템(100)은 상기 제1 서브그래프를 구성하는 복수의 정점 및 상기 복수의 정점을 연결하는 간선을 상기 패턴 사전에 저장할 수 있다.For example, the graph stream compression system 100 may store a plurality of vertices constituting the first subgraph and an edge connecting the plurality of vertices in the pattern dictionary.

단계(1330)에서 그래프스트림 압축 시스템(100)은, 제1 시간(T)에 이어지는 제2 시간(T) 동안 입력된 서브스트림으로부터, 제1 서브그래프와 상이한 제2 서브그래프를 추출한다.In step 1330, the graphstream compression system 100 extracts a second subgraph different from the first subgraph from a substream input during a second time T following the first time T.

단계(1340)에서 그래프스트림 압축 시스템(100)은, 패턴 사전 내 제1 서브그래프를, 제2 서브그래프를 이용해 확장시켜 업데이트한다.In step 1340, the graphstream compression system 100 expands and updates the first subgraph in the pattern dictionary by using the second subgraph.

일례로 그래프스트림 압축 시스템(100)은 제2 서브그래프를 구성하는 복수의 정점을, 패턴 사전에 기 저장된 기존정점과, 패턴 사전에 저장되지 않는 신규정점으로 구분했을 때, 기존정점과 신규정점 사이를 연결하는 제1 간선과, 신규정점 사이를 연결하는 제2 간선을 제2 서브그래프에서 탐색하여, 상기 제1 서브그래프에 추가 연결함으로써, 상기 확장 서브그래프로의 업데이트를 수행할 수 있다.For example, when the graph stream compression system 100 divides a plurality of vertices constituting the second subgraph into existing vertices previously stored in the pattern dictionary and new vertices not stored in the pattern dictionary, between the existing vertices and the new vertices. It is possible to perform an update to the extended subgraph by searching for a first trunk line connecting and a second trunk line connecting new vertices in the second subgraph and additionally connecting the first trunk line to the first subgraph.

단계(1350)에서 그래프스트림 압축 시스템(100)은, 제1 시간 및 제2 시간을 합산한 시간(2T)이 정해진 시간(τ)에 도달하는지 확인한다.In step 1350, the graphstream compression system 100 checks whether the sum of the first time and the second time (2T) reaches a predetermined time (τ).

정해진 시간(τ)에 도달하지 않은 경우, 그래프스트림 압축 시스템(100)은, 상기 제2 시간(T)의 경과시점부터 제3 시간(T) 동안 입력된 서브스트림으로부터, 제1 및 제2 서브그래프와 상이한 제3 서브그래프를 추출하고, 상기 제3 서브그래프를 이용하여, 상기 확장 서브그래프를 추가확장 서브그래프로 업데이트하는 과정을, 제1 시간, 제2 시간 및 제3 시간을 합산한 시간(3T)이 정해진 시간(τ)에 도달할 때까지 반복 수행한다.If the predetermined time period (τ) is not reached, the graphstream compression system 100, from the lapse of the second time period (T) to the third time period (T) from the input sub-stream, the first and second sub-streams The process of extracting a third subgraph different from the graph and updating the extended subgraph into an additional extended subgraph using the third subgraph is the sum of the first time, the second time, and the third time. It is repeated until (3T) reaches a predetermined time (τ).

정해진 시간(τ)에 도달한 경우, 단계(1360)에서 그래프스트림 압축 시스템(100)은, 패턴 사전 내 확장 서브그래프를 기준 빈발패턴으로 지정하고, 단계(1370)에서 그래프스트림 압축 시스템(100)은, 패턴 사전을 압축 처리한다.When the predetermined time τ is reached, in step 1360, the graphstream compression system 100 designates an extended subgraph in the pattern dictionary as a reference frequent pattern, and in step 1370, the graphstream compression system 100 , compresses the pattern dictionary.

이와 같이, 본 발명에서는, 실시간으로 입력되는 그래프스트림으로부터 빈발하는 패턴을 찾아 기준 빈발패턴으로서 패턴 사전에 유지하고, 빈발하지 않는 패턴을 패턴 사전에서 제외하여, 그래프스트림의 압축에 적용할 데이터를 줄임으로써, 그래프스트림의 압축 처리 속도 및 압축 효율을 향상시키는 효과를 얻을 수 있다.As such, in the present invention, a frequent pattern is found from a graph stream input in real time, maintained in a pattern dictionary as a reference frequent pattern, and infrequent patterns are excluded from the pattern dictionary to reduce data to be applied to graph stream compression. As a result, the effect of improving the compression processing speed and compression efficiency of the graph stream can be obtained.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

100: 점진적 빈발패턴 기반의 그래프스트림 압축 시스템
110: 추출부
120: 저장부
130: 업데이트부
140: 처리부
150: 판별부
160: 산출부
170: 패턴 사전100: Graph stream compression system based on progressive frequent patterns
110: extraction unit
120: storage unit
130: update unit
140: processing unit
150: determination unit
160: calculation unit
170: pattern dictionary

Claims

According to the compression command for the graph stream input in real time,
extracting a first sub-graph from a sub-stream input during a first time among the graph streams;
storing the first subgraph as an initial frequent pattern in a pattern dictionary;
Among the graph streams, if a second subgraph different from the first subgraph is extracted from a substream input during a second time from the lapse of the first time,
updating and maintaining the first subgraph as an extended subgraph within the pattern dictionary using the second subgraph; and
When the sum of the first and second times reaches a predetermined time,
Compressing the extended subgraph maintained in the pattern dictionary as a reference frequent pattern
Graph stream compression method based on gradual recurrence pattern including

According to claim 1,
If the sum of the first and second times does not reach the predetermined time,
extracting a third subgraph different from the first and second subgraphs from a substream input during a third time from the lapse of the second time, among the graph streams; and
Updating the extended subgraph into an additional extended subgraph using the third subgraph
A graph stream compression method based on progressive frequent patterns further comprising.

According to claim 2,
When a plurality of first subgraphs are stored in the pattern dictionary as the initial frequent pattern,
Deleting, from among the plurality of first subgraphs, a first subgraph not involved in updating to the extended subgraph or the additional extended subgraph from the pattern dictionary.
A graph stream compression method based on progressive frequent patterns further comprising.

According to claim 1,
determining a degree of similarity between the first subgraph and the second subgraph using a VF2 algorithm according to the extraction of the second subgraph;
determining whether the first subgraph is an isomorphic graph that matches at least a part of the second subgraph according to the degree of similarity; and
When the first subgraph is determined to be an isomorphic graph of the second subgraph,
performing an update of the first subgraph by the second subgraph;
A graph stream compression method based on progressive frequent patterns further comprising.

According to claim 1,
The step of storing the pattern dictionary,
Storing a plurality of vertices constituting the first subgraph and an edge connecting the plurality of vertices in the pattern dictionary
including,
The step of updating and maintaining,
dividing a plurality of vertices constituting the second subgraph into existing vertices stored in the pattern dictionary and new vertices not stored in the pattern dictionary;
searching for a first trunk line connecting the existing vertex and the new vertex from the second subgraph;
searching for a second trunk from the second sub-stream, which is connected to the first trunk and connects the new vertices; and
performing an update to the extended subgraph by additionally connecting the first trunk line and the second trunk line to the first subgraph;
Graph stream compression method based on gradual recurrence pattern including

According to claim 1,
If the extended subgraph is plural,
Calculating a pattern score for each of the plurality of extended subgraphs by considering frequency and trunk line size;
selecting the top k (where k is a natural number) number of extended subgraphs according to the order of pattern scores; and
Designating the k extended subgraphs as the reference frequent pattern
A graph stream compression method based on progressive frequent patterns further comprising.

According to claim 6,
Deleting an extension subgraph that is not selected among the plurality of extension subgraphs from the pattern dictionary
A graph stream compression method based on progressive frequent patterns further comprising.

According to claim 1,
Storing, in the pattern dictionary, provenance information that records changes to the graph stream that is changed over time and input based on the extended subgraph designated as the reference frequent pattern.
Including more,
In the compression process,
compressing by applying a dictionary encoding technique to the pattern dictionary in which the reference frequent pattern and the provenance information are stored;
Graph stream compression method based on gradual recurrence pattern including

According to claim 8,
The provenance information,
For the extended subgraph designated as the reference frequent pattern, the pattern score calculated according to the set frequency and trunk line size, and the time at which each substream from which the first and second subgraphs are extracted from the graph stream is input. At least one of a timestamp recording the first and second times, the number of batches constituting each substream, and a pattern ID for identifying the extended subgraph
A graph stream compression method based on progressive frequency patterns.

According to the compression command for the graph stream input in real time,
an extraction unit extracting a first sub-graph from a sub-stream input during a first time among the graph streams;
a storage unit that stores the first subgraph as an initial frequent pattern in a pattern dictionary;
When a second subgraph different from the first subgraph is extracted from a substream input during a second time from the lapse of the first time, among the graph streams, by the extraction unit,
an update unit that updates and maintains the first subgraph as an extended subgraph in the pattern dictionary using the second subgraph; and
When the sum of the first and second times reaches a predetermined time,
A processing unit which compresses the extended subgraph held in the pattern dictionary as a reference frequent pattern.
A graph stream compression system based on progressive frequent patterns that includes.

According to claim 10,
If the sum of the first and second times does not reach the predetermined time,
The extraction part,
Among the graph streams, a third subgraph different from the first and second subgraphs is extracted from a substream input during a third time period from the lapse of the second time period;
The update unit,
Updating the extended subgraph to an additional extended subgraph using the third subgraph
A graphstream compression system based on incremental recurrence patterns.

According to claim 10,
According to the extraction of the second subgraph, a similarity between the first subgraph and the second subgraph is determined using a VF2 algorithm, and according to the similarity, the first subgraph is determined as the second subgraph. A discriminator for determining whether an isomorphic graph that matches at least a part of
Including more,
The update unit,
When the first subgraph is determined to be an isomorphic graph of the second subgraph,
Updating the first subgraph by the second subgraph
A graphstream compression system based on incremental recurrence patterns.

According to claim 10,
If the extended subgraph is plural,
For each of the plurality of extended subgraphs, a calculation unit that calculates a pattern score by considering frequency and edge size.
Including more,
The processing unit,
Selecting the top k (k is a natural number) extension subgraphs according to the pattern score order, and designating the k extension subgraphs as the reference frequent pattern
A graphstream compression system based on incremental recurrence patterns.

According to claim 10,
the storage unit,
Based on the extended subgraph designated as the reference frequent pattern, provenance information that records changes to the graph stream that is changed over time and input is stored in the pattern dictionary,
The processing unit,
Compressing by applying a dictionary encoding technique to the pattern dictionary storing the reference frequent pattern and the provenance information
A graphstream compression system based on incremental recurrence patterns.

A computer-readable recording medium recording a program for executing the method of any one of claims 1 to 9.