KR102054068B1

KR102054068B1 - Partitioning method and partitioning device for real-time distributed storage of graph stream

Info

Publication number: KR102054068B1
Application number: KR1020180048571A
Authority: KR
Inventors: 한진수; 김민수; 복경수; 유재수
Original assignee: 충북대학교 산학협력단
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2019-12-09
Also published as: KR20190124512A

Abstract

하나의 마스터 노드와 여러 개의 슬레이브 노드로 구성된 그래프에 있어, 대용량의 그래프가 정점 절단 기반으로 분할되어 있는 상황에서 그래프에 대한 질의 처리 성능의 향상을 위한 그래프 스트림의 분산 저장 관리 기법을 개시한다.
그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법은, 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법은, 그래프를 분할한 복수의 하위그래프 각각을 개별적으로 처리하는 복수의 노드에 대해, 점수를 산출하는 단계, 상기 복수의 노드 중에서, 산출된 상기 점수가 가장 높은 선택 노드를 결정하는 단계, 및 상기 선택 노드에 속한 하위그래프에, 상기 신규의 그래프 스트림과 연관된 정점을 간선으로 연결 함으로써, 상기 선택 노드에서 상기 신규의 그래프 스트림을 처리하도록 하는 단계를 포함할 수 있다.In a graph composed of one master node and several slave nodes, a distributed storage management technique of graph streams for improving query processing performance of graphs in a situation where a large graph is divided based on vertex truncation is disclosed.
The partitioning method for real-time distributed storage for the graph stream is divided into the partitioning method for real-time distributed storage for the graph stream. Determining a selected node having the highest score, among the plurality of nodes, and connecting a vertex associated with the new graph stream by a trunk to a lower graph belonging to the selected node, at the selected node. And processing the new graph stream.

Description

PARTITIONING METHOD AND PARTITIONING DEVICE FOR REAL-TIME DISTRIBUTED STORAGE OF GRAPH STREAM}

본 발명은 하나의 마스터 노드와 여러 개의 슬레이브 노드로 구성된 그래프에 있어, 대용량의 그래프가 정점 절단 기반으로 분할되어 있는 상황에서 그래프에 대한 질의 처리 성능의 향상을 위한 그래프 스트림의 분산 저장 관리 기법에 관한 것이다.The present invention relates to a distributed storage management technique of a graph stream for improving query processing performance for graphs in a situation where a large graph is divided based on vertex truncation in a graph composed of one master node and several slave nodes. will be.

본 발명의 배경이 되는 기술은 다음의 문헌에 개시되어 있다.
1) 등록번호 : 10-1285078 (2013-07-05) "스트림 데이터에 대한 점진적인 맵리듀스 기반 분산 병렬 처리 시스템 및 방법"
2) 등록번호 : 10-1678149 (2016-11-15) "데이터베이스의 데이터 탐색방법 및 그 장치와 이를 위한 컴퓨터 프로그램"
최근에는 소셜 네트워크, 시맨틱 웹, 사물 인터넷 등과 같은 서비스가 활발히 사용되면서, 사용자나 사물들 간의 관계 또는 상호 작용을 표현하기 위해 그래프 데이터가 다수 활용되고 있다. 또한, 모바일 기기의 보편화는, 트위터, 페이스북과 같은 SNS의 급격한 확산으로 발생하는 그래프 데이터의 양을 보다 증가시키고 있다.The background technology of the present invention is disclosed in the following documents.
1) Registration No.: 10-1285078 (2013-07-05) "Discrete MapReduce based Distributed Parallel Processing System and Method for Stream Data"
2) Registration No.: 10-1678149 (2016-11-15) "Data search method of database and its device and computer program for it"
Recently, as services such as social networks, the semantic web, the Internet of things, etc. are actively used, a large number of graph data are used to express relationships or interactions between users and things. In addition, the generalization of mobile devices is increasing the amount of graph data generated by the rapid proliferation of SNS such as Twitter and Facebook.

방대한 양의 그래프를 효율적으로 저장하고 관리하기 위해서는, 대규모 그래프를 작은 단위의 하위 그래프로 분할하는 기법에 대한 개발이 요구되고 있다.In order to efficiently store and manage a large amount of graphs, development of a technique for dividing a large graph into smaller graphs is required.

종래의 Metis나 PowerGraph는 대표적인 그래프 분할 기법이다. Metis는 간선 절단 방식으로 하위 그래프 간 통신 비용을 최소화하기 위해, 분할된 여러 하위 그래프들을 연결하는 절단 간선의 수를 최소화하는 방법을 제안하고 있다. 반면, PowerGraph는 정점 분할을 수행하며 각 하위 그래프 간에 연결되는 간선을 없애는 방법과 이를 위한 프로그래밍 모델을 제안하고 있다.Conventional Metis or PowerGraph are representative graph segmentation techniques. Metis proposes a method of minimizing the number of cutting edges connecting several divided subgraphs in order to minimize the communication cost between subgraphs using the edge cutting method. On the other hand, PowerGraph proposes a method of eliminating edges connected between each subgraph by performing vertex splitting and a programming model for this.

그래프 데이터에서 객체 간의 관계를 나타내는 간선의 추가/삭제 또는 객체를 나타내는 정점의 추가/삭제와 같은 그래프의 변화는 불규칙적으로 발생하며 이러한 변화는 그래프 스트림으로 지칭할 수 있다.Changes in graphs, such as the addition / deletion of edges representing relationships between objects in graph data or the addition / deletion of vertices representing objects, occur irregularly, and such changes may be referred to as graph streams.

기존의 정적인 그래프에 대한 분할 기법에서는 불규칙한 그래프의 변화에 대해, 그래프의 형태가 변화하는 동적인 그래프, 즉 그래프 스트림에 대한 관리 정책을 필요로 하였다.The existing segmentation technique for static graphs requires a management policy for dynamic graphs, that is, graph streams, in which the shape of the graph changes, for irregular graph changes.

그래프 스트림의 관리를 위해서는, 기존에 정적인 대규모의 그래프가 분산 저장되어 있는 상황에서 추가되는 데이터를 저장할 위치를 선택하는 기준과, 기존 데이터의 삭제로 인해 발생할 수 있는 문제를 고려해야 한다.In order to manage the graph stream, it is necessary to consider a criterion for selecting a location to store additional data in a situation where a large static graph is statically stored and a problem that may occur due to deletion of existing data.

그래프 스트림을 저장하는 기준의 설정을 위해서는 현재 클러스터를 구성하는 노드들의 상태를 고려해야 한다. 추가적으로, 상기 기준의 설정에는, 정점 절단 기반의 분할이 이루어진 상태라고 가정했을 때에, 새로운 데이터의 배치로 인해 발생하는 정점의 복제 비율을 고려해야 한다. 또한, 상기 기준의 설정에는, 간선 절단의 분할이 이루어진 상태라면 절단 간선의 최소화를 고려해야 한다.To set the criteria for storing the graph stream, it is necessary to consider the state of the nodes constituting the current cluster. In addition, in setting the criterion, it is necessary to consider the rate of replication of vertices caused by the placement of new data, assuming that the segmentation is based on vertex truncation. In addition, the setting of the criterion should consider the minimization of the cutting edge if the division of the cutting edge is made.

특정 노드의 처리량이 높거나 데이터 메모리 공간이 부족하게 되는 경우, 전체 시스템의 성능 저하가 발생할 수 있으므로, 처리량의 편차나 적절한 메모리의 사용률은 항상 적정 수준으로 관련되어야 한다.When the throughput of a particular node is high or the data memory space is insufficient, the performance of the entire system may be degraded. Therefore, the variation in throughput or the appropriate utilization of memory should always be appropriately related.

따라서, 클러스터 전체의 상태 정보를 관리하고 실시간으로 데이터가 발생할 때마다 그 정보를 이용하여 새로운 데이터의 분할 기준을 계산하거나 삭제로 인해 발생하는 문제를 발견하고 해결하는 방법이 요구되고 있다.Therefore, there is a demand for a method of managing the state information of the entire cluster and using the information whenever a data is generated in real time to find and solve a problem caused by calculating or deleting a new data partitioning criterion.

기존의 그래프 스트림 분할에 대한 발명은, 분할의 기준에 따라 정점 절단과 간선 절단이 있으며, 그래프 스트림 분할을 위해 다양한 사항을 고려하고 있다.Existing inventions for graph stream segmentation include vertex truncation and edge truncation according to the criteria of segmentation, and various considerations are made for graph stream segmentation.

S-PowerGraph은 Power Graph에서 제안한 프로그래밍 모델을 실시간 그래프 스트림 분할에 적용한 것으로 정점 절단 기반 분할 기법을 사용한다.S-PowerGraph applies the programming model proposed by Power Graph to real-time graph stream segmentation and uses vertex truncation based segmentation technique.

종래의 다른 그래프 스트림에 대한 발명은 새로운 데이터의 추가로 인해 발생하게 되는 절단 간선의 수를 최소화시키는 간선 절단 방식을 제안하였다. 이 종래의 다른 그래프 스트림에 대한 발명은, 동시에 클러스터를 구성하는 각 노드의 메모리 공간을 고려하여 메모리의 사용률이 많은 노드보다 적은 노드에 더 많은 데이터를 배치하여 저장량의 편차를 줄이는 방법을 제안하고 있다.The invention of other conventional graph streams has proposed a trunk cutting scheme that minimizes the number of cutting edges caused by the addition of new data. The invention of another conventional graph stream proposes a method of reducing the variation in the storage amount by arranging more data in a node having less memory than a node having high memory utilization in consideration of the memory space of each node constituting the cluster at the same time. .

종래의 또 다른 그래프 스트림에 대한 발명에서는 정점 절단 방식을 사용하며, 클러스터 내 노드들의 상태로 저장량과 처리량을 모두 고려하였다.In the conventional invention of another graph stream, vertex truncation is used, and both storage and throughput are considered as states of nodes in a cluster.

하지만 기존의 기법들은, 추가되는 스트림의 분할 기준에 정점의 복제 비율만을 고려하였으며 클러스터 내 서버들의 상태는 고려하지 못했으며, 클러스터내의 서버의 상태를 고려하기 위해 저장량을 고려하였지만 처리량에 관한 부분은 고려하지 않았다.Existing techniques, however, consider only the replication ratio of the vertices in the segmentation criteria of the additional streams, not the state of the servers in the cluster, and the storage capacity in order to consider the state of the servers in the cluster. Did not do it.

또한, 기존의 기법들은, 저장량과 처리성능을 고려하였지만 처리성능을 일정 시간동안 하나의 노드에 저장되어 있던 데이터의 양으로 정의했으며 이는 PageRank와 같은 연산에서는 효율적일 수 있으나 특정 그래프 패턴을 찾는 질의에 대해서는 적절하지 못하다는 문제점이다.In addition, the existing techniques consider storage and processing performance, but define processing performance as the amount of data stored in one node for a certain time. This can be efficient for operations such as PageRank, but for queries looking for a specific graph pattern, The problem is that it is not appropriate.

또한, PageRank 연산의 경우는 모든 정점에 대한 PageRank 값을 계산해야 하기 때문에 하나의 노드에서 저장하고 있는 정점의 수가 처리량과 직결될 수 있으나, 특정 정점만 연산에 활용되는 경우엔 전체 저장량을 처리량으로 계산하는 것은 부적절하다.In addition, in the case of PageRank operation, the number of vertices stored in one node can be directly related to throughput because PageRank values for all vertices must be calculated.However, when only a specific vertex is used for calculation, the total storage is calculated as throughput. It is inappropriate to do.

또한, PageRank 알고리즘과 같이 모든 정점에 대한 연결 관계와 각 정점이 갖는 값을 계산해야 하기 때문에 정점의 양이 처리량으로 직결 될 수 있다. 하지만 서브그래프(하위그래프) 탐색 질의의 경우, 해당 노드에서 저장하고 있는 정점의 양보다는 탐색이 요청된 서브그래프에 포함된 정점이 특정 노드에 얼마나 존재하는가가 처리량과 직결될 수 있다. 이에 따라, 기존의 기법들에서 고려한 부하에 대한 관점은 일반적이지 않다.Also, like the PageRank algorithm, we need to calculate the linkage of all vertices and the values that each vertex has, so the amount of vertices can be directly linked to throughput. However, in the case of a subgraph search query, throughput may be directly related to how many vertices included in the searched subgraph exist in a specific node, rather than the amount of vertices stored in the node. Accordingly, the view of load considered in existing techniques is not common.

이에 따라, 대용량 그래프에 대한 질의 처리 성능의 향상을 위해 새로운 그래프 스트림 분할 기법의 등장이 절실히 요청된다.Accordingly, the emergence of a new graph stream segmentation technique is urgently required to improve query processing performance for large graphs.

상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명은 부하 분산을 고려한 정점 절단 기반의 그래프 스트림 분할 기법을 사용하여, 그래프 스트림의 분산 저장관리를 위해 추가되는 데이터로 인한 정점의 복제 비율, 추가되는 데이터가 배치될 클러스터 내의 노드들의 저장량, 처리량을 고려하여 각각의 요소에 가중치를 부여한 분할 기준을 제안함으로써, 그래프 형태의 대용량 데이터에 대한 질의 처리 성능의 향상시키는 것을 목적으로 한다.In order to solve the above problems, the present invention uses a vertex truncation-based graph stream segmentation technique in consideration of load balancing, the replication rate of the vertices due to the data added for distributed storage management of the graph stream, addition The object of the present invention is to improve the query processing performance for a large amount of data in the form of graphs by proposing a partitioning criterion that weights each element in consideration of the storage capacity and throughput of nodes in a cluster to which data is to be arranged.

또한, 본 발명은 그래프 스트림의 분산 저장관리를 위해 추가되는 데이터로 인한 정점의 복제 비율, 추가되는 데이터가 배치될 클러스터 내의 노드들의 저장량, 처리량을 고려하여 각각의 요소에 가중치를 부여한 분할 기준을 제시 함으로써, 클러스터의 부하 분산을 적절하게 조정하여 전체 시스템의 처리성능을 향상시키는 것을 목적으로 한다.In addition, the present invention proposes a partitioning criterion that weights each element in consideration of the replication ratio of the vertices due to the added data, the storage capacity of the nodes in the cluster where the added data is to be placed, and the throughput for distributed storage management of the graph stream. The purpose of this is to appropriately adjust the load balancing of the cluster to improve the processing performance of the entire system.

또한, 본 발명은 처리량을 직접적인 질의 처리 횟수로 정의하여 보다 효율적으로 부하 분산을 조정하여 전체적인 시스템의 성능을 향상시키는 것을 목적으로 한다.In addition, the present invention aims to improve throughput of the overall system by more efficiently adjusting load balancing by defining throughput as a direct query processing frequency.

또한, 본 발명은 그래프 스트림에서 기존에 사용되는 핫 데이터와 연결성을 갖는 새로운 데이터가 발생했을 때, 새로운 데이터를 핫 데이터로 간주하고 처리량에 대한 고려사항에 높은 가중치를 두어 하나의 노드에 부하가 집중되는 것을 예방하는 것을 목적으로 한다.In addition, the present invention considers the new data as hot data when the new data having connectivity with the hot data used in the graph stream occurs, and puts a heavy load on one node by placing a high weight on the consideration of throughput. It is aimed at preventing becoming.

상기의 목적을 이루기 위한, 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법은, 그래프를 분할한 복수의 하위그래프 각각을 개별적으로 처리하는 복수의 노드에 대해, 점수를 산출하는 단계, 상기 복수의 노드 중에서, 산출된 상기 점수가 가장 높은 선택 노드를 결정하는 단계, 및 상기 선택 노드에 속한 하위그래프에, 상기 신규의 그래프 스트림과 연관된 정점을 간선으로 연결 함으로써, 상기 선택 노드에서 상기 신규의 그래프 스트림을 처리하도록 하는 단계를 포함할 수 있다.In order to achieve the above object, a segmentation method for real-time distributed storage of a graph stream includes: calculating a score for a plurality of nodes that individually process each of a plurality of subgraphs obtained by dividing the graph; Determining a selection node having the highest score, and connecting the vertex associated with the new graph stream to the lower graph belonging to the selection node by edge, thereby connecting the new graph stream to the selection node. Processing may be included.

또한, 상기 목적을 달성하기 위한 기술적 구성으로서, 소셜 네트워크에서, 사물 간의 상호 작용에 따라 신규의 그래프 스트림이 발생하는 경우, 그래프를 분할한 복수의 하위그래프 각각을 개별적으로 처리하는 복수의 노드에 대해, 점수를 산출하는 점수 산출부, 상기 복수의 노드 중에서, 산출된 상기 점수가 가장 높은 선택 노드를 결정하는 노드 결정부, 및 상기 선택 노드에 속한 하위그래프에, 상기 신규의 그래프 스트림과 연관된 정점을 간선으로 연결 함으로써, 상기 선택 노드에서 상기 신규의 그래프 스트림을 처리하도록 하는 분할 연결부를 포함하는 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 장치를 구성할 수 있다.In addition, as a technical configuration for achieving the above object, in a social network, when a new graph stream is generated in accordance with the interaction between things, for a plurality of nodes that individually process each of a plurality of subgraphs divided into a graph A vertex associated with the new graph stream in a score calculator for calculating a score, a node determiner for determining a selected node having the highest score among the plurality of nodes, and a lower graph belonging to the selected node; By connecting by trunk lines, a splitting apparatus for real time distributed storage of a graph stream including a split connection unit for processing the new graph stream at the selected node may be configured.

본 발명에 따르면, 질의 처리 성능 향상과 부하 분산을 고려한 정점 절단 기반의 스트림 그래프 분할 기법을 제공할 수 있다.According to the present invention, it is possible to provide a stream graph segmentation technique based on vertex truncation considering query processing performance and load balancing.

또한 본 발명에 따르면, 대용량 그래프가 저장되어 있는 상황에서 새로운 서브그래프의 배치를 위한 기준을 제안할 수 있다.In addition, according to the present invention, it is possible to propose a criterion for arranging a new subgraph in a situation where a large graph is stored.

또한, 본 발명에 따르면, 그래프 스트림을 실시간으로 서버에 분할 저장하기 위해 메모리 사용률과 처리량과 같은 클러스터 내의 서버의 상태와 정점의 복제 비율을 고려할 수 있다.In addition, according to the present invention, in order to partition and store the graph stream in a server in real time, the state of the server and the replication rate of vertices such as memory utilization and throughput may be considered.

또한, 본 발명에 의해서는 핫 데이터가 삽입될 경우 처리량에 더 높은 비중을 두고 부하분산을 수행 함으로써, 특정 노드에 부하가 집중되는 문제점을 해결할 수 있다.In addition, according to the present invention, when hot data is inserted, load balancing is performed by giving a higher weight to the throughput, thereby solving the problem that load is concentrated on a specific node.

도 1는 본 발명의 일실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 장치의 구체적인 구성을 나타내는 도면이다.
도 2는 본 발명에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 기법의 전체 처리 구조를 설명하기 위한 도면이다.
도 3은, 본 발명에 따라 처리량과 메모리 사용률이 산출되는 과정을 설명하기 위한 도면이다.
도 4는 핫 데이터로 인해 시스템의 성능이 저하되는 일례를 설명하기 위한 도면이다.
도 5는 본 발명에 따른 서브그래프 질의에 대한 예시를 보여주기 위한 도면이다.
도 6은 본 발명에 따른 그래프 스트림의 분할과정을 설명하기 위한 도면이다.
도 7은 본 발명에 따른 이웃 간선을 설명하기 위한 도면이다.
도 8과 도 9는 본 발명에 따라, 추가되는 그래프 스트림 분할의 예시를 설명하기 위한 도면이다.
도 10은 본 발명의 일실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법을 구체적으로 도시한 작업 흐름도이다.1 is a diagram illustrating a detailed configuration of a splitting apparatus for real time distributed storage of a graph stream according to an embodiment of the present invention.
2 is a view for explaining the overall processing structure of the segmentation scheme for real-time distributed storage for the graph stream according to the present invention.
3 is a view for explaining a process of calculating the throughput and memory utilization in accordance with the present invention.
4 is a view for explaining an example in which the performance of the system is degraded due to hot data.
5 is a diagram illustrating an example of a subgraph query according to the present invention.
6 is a view for explaining a segmentation process of a graph stream according to the present invention.
7 is a diagram illustrating a neighboring trunk line according to the present invention.
8 and 9 are diagrams for explaining an example of a graph stream segmentation added according to the present invention.
FIG. 10 is a detailed flowchart illustrating a partitioning method for real-time distributed storage of a graph stream according to an embodiment of the present invention.

이하에서, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the embodiments. Like reference numerals in the drawings denote like elements.

도 1는 본 발명의 일실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 장치의 구체적인 구성을 나타내는 도면이다.1 is a diagram illustrating a detailed configuration of a splitting apparatus for real time distributed storage of a graph stream according to an embodiment of the present invention.

본 발명의 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 장치(100)는, 점수 산출부(110), 노드 결정부(120), 및 분할 연결부(130)를 포함하여 구성될 수 있다.The splitting apparatus 100 for real-time distributed storage of the graph stream of the present invention may include a score calculator 110, a node determiner 120, and a split connection unit 130.

우선, 점수 산출부(110)는 소셜 네트워크에서, 사물 간의 상호 작용에 따라 신규의 그래프 스트림이 발생하는 경우, 그래프를 분할한 복수의 하위그래프 각각을 개별적으로 처리하는 복수의 노드에 대해, 점수를 산출한다. 즉, 점수 산출부(110)는 새로 발생한 그래프 스트림을 처리하는 데에, 최적할 수 있는 노드를 선별하기 위해, 각 노드의 상태를 수치로서 나타내는 기능을 한다.First, in a social network, when a new graph stream is generated according to interactions between objects in the social network, the score calculator 110 scores a score for a plurality of nodes that individually process each of a plurality of subgraphs in which the graph is divided. Calculate That is, the score calculator 110 functions to indicate the state of each node as a numerical value in order to select nodes that can be optimal for processing the newly generated graph stream.

점수 산출부(110)에서 점수 산출에 활용하는 파라미터로는, 개별 노드가 가지고 있는 그래프 데이터의 저장량, 이전 작업에서 그래프 데이터를 처리한 처리량 등을 예시할 수 있다.As a parameter used for score calculation by the score calculator 110, a storage amount of graph data owned by an individual node, a throughput of processing the graph data in a previous operation, and the like may be exemplified.

노드 별 점수를 산출하는 데에 있어서, 점수 산출부(110)는, 부하 테이블(140)을 참조하여, 상기 복수의 노드 각각에 대해, 저장량과 처리량을 확인하고, 상기 저장량이 낮을수록, 또는 상기 처리량이 낮을수록, 상기 점수를 높게 산출할 수 있다.In calculating the score for each node, the score calculator 110 checks the storage amount and the throughput for each of the plurality of nodes with reference to the load table 140, and the lower the storage amount, or the The lower the throughput, the higher the score can be calculated.

부하 테이블(140)은 각 노드별로 보유하고 있는 하위그래프의 양과, 정해진 하위그래프가 갖는 그래프 데이터의 주기별 처리량을 기록하고 있다. 후술하는 표 1에서와 같이 부하 테이블(140)은 예컨대 노드의 처리량을, 하위그래프가 가지고 있는 정점의 수로 기록할 수 있고, 또한 노드의 저장량을, 메모리 사용률로 기록할 수 있다.The load table 140 records the amount of the subgraphs held by each node and the throughput for each cycle of the graph data of the predetermined subgraph. As shown in Table 1 below, the load table 140 can record, for example, the throughput of a node as the number of vertices of the lower graph, and can also record the storage of the node as a memory utilization.

즉, 점수 산출부(110)는 부하 테이블(140)내의 정보에 근거하여, 현시점에서 저장량이 많지 않아 추가적으로 신규의 그래프 스트림이 연결되더라도, 그래프 데이터의 처리에 여력이 충분한 노드에 높은 점수를 부여할 수 있고, 또한 이전 시점까지의 처리량이 많지 않은 노드에 높은 점수를 부여할 수 있다.That is, the score calculating unit 110 may give a high score to a node having sufficient capacity for processing the graph data, even if a new graph stream is additionally connected due to the small amount of storage at the present time based on the information in the load table 140. It is also possible to assign high scores to nodes that do not have much throughput up to the previous point in time.

일례로서, 점수 산출부(110)는, 수학식 1을 적용하여, 상기 복수의 노드 각각에 대해 연산되는 TSi(Total Score)를 점수로서 산출할 수 있다.As an example, the score calculator 110 may calculate a total score (TSi) calculated for each of the plurality of nodes as a score by applying the equation (1).

수학식 1은

로 표현할 수 있으며,Equation 1 is

Can be expressed as

여기서, RSi(Replication Score)는 노드의 정점 복제비율 점수로 정의하고, USi(Storage Utilization Score)는 노드별 저장량 점수로 정의하며, CSi(Computation Size Score)는 노드별 처리량 점수로 정의할 수 있다. 또한, α, β, γ는 각 점수에 부여되는 가중치일 수 있다.Here, RSi (Replication Score) may be defined as a vertex replication ratio score of a node, Storage Utilization Score (USi) may be defined as a storage score per node, and CSi (Computation Size Score) may be defined as a throughput score per node. In addition, α, β, and γ may be weights assigned to each score.

상기 RSi(Replication Score)는, 노드에 존재하는 이웃 간선의 수가 많을수록 상기 RSi를 높게 산출하여, 상기 그래프 내의 정점에 대한 복제비율을 최소화할 수 있다.The replication score (RSi) may calculate a higher RSi as the number of neighboring edges existing in a node increases, thereby minimizing a replication ratio of a vertex in the graph.

상기 USi(Storage Utilization Score)는, 노드가 저장하고 있는 데이터 양이 적을수록 상기 USi를 높게 산출하여, 상기 신규의 그래프 스트림이, 저장량이 상대적으로 적은 노드에 배치되도록 할 수 있다.The storage utilization score (USi) may calculate the USi higher as the amount of data stored by the node is smaller, so that the new graph stream is arranged in a node having a relatively low storage amount.

상기 CSi(Computation Size Score)는, 이전 처리 수행동안 처리량이 적을수록 상기 CSi를 높게 산출하여, 상기 신규의 그래프 스트림이, 저장량이 상대적으로 많은 노드에 배치되는 것을 최소화할 수 있다.The Computation Size Score (CSi) may calculate the CSi higher as the throughput decreases during the previous processing, thereby minimizing the arrangement of the new graph stream in a node having a relatively large storage amount.

즉, 점수 산출부(110)는 RSi, USi, CSi의 값에 가중치 α, β, γ를 부여하여 더하는 방식을 취하여 TSi를 산출하고, 이를 통해 결과적으로 신규의 그래프 스트림을 배치하는 기준이 될 각 노드의 점수를 산출할 수 있다.That is, the score calculating unit 110 calculates TSi by applying weights α, β, and γ to the values of RSi, USi, and CSi and adds them, and as a result, each graph to be a reference for arranging a new graph stream. The score of the node can be calculated.

특히, 상기 CSi의 가중치인 γ의 경우, 점수 산출부(110)는, 상기 신규의 그래프 스트림과 연관된 제1 정점에 포함되는 데이터가, 정해진 빈도수를 초과하여 사용되는 핫 데이터인지를 식별하고, 상기 핫 데이터인 경우, 상기 γ을 양의 값으로 조정할 수 있다. 즉, 점수 산출부(110)는 신규의 그래프 스트림이 빈도있게 활용되는 핫 데이터를 포함 함에 따라, CSi가 높게 산출되도록, γ을 보다 큰 값으로 조정할 수 있다.In particular, in the case of γ which is the weight of the CSi, the score calculator 110 identifies whether the data included in the first vertex associated with the new graph stream is hot data used in excess of a predetermined frequency, In the case of hot data, the gamma can be adjusted to a positive value. That is, the score calculator 110 may adjust γ to a larger value so that CSi is calculated higher as the new graph stream includes hot data frequently used.

노드별 점수를 산출하는 다른 일례에 있어, 점수 산출부(110)는, 상기 신규의 그래프 스트림과 연관된 제1 정점과, 공통된 패턴을 갖는 제2 정점을, 상기 그래프로부터 식별하고, 상기 제2 정점을 구성으로 하는 하위그래프를 처리하는 노드에, 가중치를 부여하여, 상기 점수를 높게 산출할 수 있다. 즉, 점수 산출부(110)는 신규의 그래프 스트림이 갖는 정점의 배치/구성 형태를 패턴으로 인식하고, 인식된 패턴을 유사하게 가지고 있는, 기존 그래프 내의 정점을 찾아, 관련된 노드에 대해, 상대적으로 높은 점수가 부여되도록 할 수 있다.In another example of calculating the score for each node, the score calculator 110 identifies the first vertex associated with the new graph stream and the second vertex having a common pattern from the graph, and identifies the second vertex. The score can be calculated by giving a weight to a node processing a lower graph having a structure of. That is, the score calculator 110 recognizes the arrangement / configuration of the vertices of the new graph stream as a pattern, finds the vertices in the existing graph that have the recognized patterns similarly, and relatively to the related nodes. High scores can be given.

노드별 점수를 산출하는 또 다른 일례에 있어, 점수 산출부(110)는, 상기 신규의 그래프 스트림과 연관된 제1 정점에 포함되는 데이터가, 정해진 빈도수를 초과하여 사용되는 핫 데이터인지를 식별하고, 상기 핫 데이터인 경우, 부하 테이블(140)을 참조하여, 상기 복수의 노드 중 가장 낮은 처리량의 노드에 가장 높은 점수를 산출할 수 있다. 즉, 점수 산출부(110)는 발생한 신규의 그래프 스트림 내의 정점이 빈도 높이 활용되는 핫 데이터와 관련됨에 따라, 이를 처리하는 노드로, 이전까지 가장 낮은 처리량을 기록하는 노드가 결정되도록, 해당 노드에 보다 많은 점수가 산출되도록 할 수 있다.In another example of calculating the score for each node, the score calculator 110 identifies whether the data included in the first vertex associated with the new graph stream is hot data used in excess of a predetermined frequency, In the case of the hot data, the highest score may be calculated for the node having the lowest throughput among the plurality of nodes with reference to the load table 140. That is, the score calculating unit 110 is a node that processes the vertex in the new graph stream that is generated as the hot data is used frequently, so that the node that records the lowest throughput is determined. More scores can be calculated.

노드 결정부(120)는, 상기 복수의 노드 중에서, 산출된 상기 점수가 가장 높은 선택 노드를 결정한다. 즉, 노드 결정부(120)는 노드의 상태를 수치화한 점수를 가장 높게 산출받은 노드를, 신규의 그래프 스트림을 최적하게 처리할 수 있는 노드로 판단하여, 선택 노드로서 결정하는 역할을 한다.The node determination unit 120 determines a selected node having the highest calculated score among the plurality of nodes. In other words, the node determiner 120 determines the node that has the highest score obtained by quantifying the state of the node as a node capable of optimally processing the new graph stream, and determines the node as the selected node.

분할 연결부(130)는 상기 선택 노드에 속한 하위그래프에, 상기 신규의 그래프 스트림과 연관된 정점을 간선으로 연결한다. 즉, 분할 연결부(130)는 선택 노드가 유지하는 하위그래프에, 상기 신규의 그래프 스트림을 연결시켜, 하나의 하위그래프로 갱신 함으로써, 상기 선택 노드에서 상기 신규의 그래프 스트림을 처리하도록 할 수 있다.The segment connection unit 130 connects the vertices associated with the new graph stream with the trunk to the lower graph belonging to the selected node. That is, the split connection unit 130 may connect the new graph stream to a lower graph maintained by the selected node and update the new graph stream to one lower graph, thereby allowing the selected node to process the new graph stream.

최근 모바일 기기의 보편화로 인한 급속한 SNS의 확산 등의 이유로 그래프 데이터가 실시간으로 빠르게 변화하는 환경에서는, 그래프 스트림의 분산 저장관리에 대한 개선된 발명이 필요하다.In an environment where graph data changes rapidly in real time due to the rapid spread of SNS due to the widespread use of mobile devices, an improved invention on distributed storage management of graph streams is required.

개선된 발명에는, 기존의 Metis나 PowerGraph와 같은 정적인 그래프 분할 기법에 대해, 새로운 정점이나 간선이 추가될 때 분할된 그래프 중 어느 곳에 추가할 것인지, 삭제가 계속해서 발생했을 때 어떤 문제가 발생하는지를 관리해야 할 필요성이 있다.In the improved invention, for static graph segmentation techniques such as Metis or PowerGraph, it is possible to identify which problem occurs when a new vertex or edge is added to the segmented graph, or when deletion continues to occur. There is a need to manage it.

본 발명에서는 하나의 마스터 노드와 여러 개의 슬레이브 노드로 구성된 클러스터에 있어, 대용량의 그래프가 정점 절단 기반으로 분할되어 있는 상황에서 그래프에 대한 질의 처리 성능의 향상을 위한 그래프 스트림의 분산 저장 관리 기법을 제안한다.In the present invention, in a cluster composed of one master node and several slave nodes, a distributed storage management scheme of graph streams for improving query processing performance of graphs in a situation where a large graph is divided based on vertex truncation is proposed. do.

또한, 본 발명에서는, 추가되는 데이터의 배치 기준과 삭제되는 데이터로 인해 발생하는 문제를 해결하여, 전체 시스템의 처리 성능 향상을 목적으로 한다.In addition, the present invention solves the problems caused by the placement criteria of the added data and the deleted data, and aims to improve the processing performance of the entire system.

추가되는 데이터의 배치에서는 클러스터를 구성하는 각 노드의 저장량과 처리량을 고려한 분할 기준을 적용한다. 예컨대, 분할 기준으로는, 저장량이 낮은 노드 또는 처리량이 낮은 노드에 더 많은 데이터를 배치하는 기준을 설정할 수 있다.In the arrangement of additional data, partitioning criteria considering storage and throughput of each node constituting the cluster are applied. For example, as a partitioning criterion, a criterion for placing more data in a node with low storage or a node with low throughput may be set.

이때, 처리량을 고려하기 위한 추가적인 요소로는 질의에서 자주 사용되는 핫 데이터가 있을 수 있고, 이 핫 데이터는 통계정보로서 관리되어 이용된다. 본 발명에서는, 추가되는 데이터가 핫 데이터인 경우와 아닌 경우에 따라 분할 기준을 변경하여, 예컨대 처리량이 낮은 노드에 핫 데이터가 배치될 수 있도록 할 수 있다.In this case, an additional factor for considering throughput may be hot data frequently used in a query, and the hot data is managed and used as statistical information. In the present invention, the splitting criteria may be changed depending on whether or not the data to be added is hot data, so that hot data may be arranged at a node having low throughput, for example.

도 2는 본 발명에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 기법의 전체 처리 구조를 설명하기 위한 도면이다.2 is a view for explaining the overall processing structure of the segmentation scheme for real-time distributed storage for the graph stream according to the present invention.

도 2에 도시한 바와 같이, 각 슬레이브 노드는, PowerGraph에서 제안한 정점 절단 방식의 분할 기법을 통해 분할된 하위 그래프 데이터를 각각 저장하고 있다.As shown in FIG. 2, each slave node stores sub-graph data divided by a segmentation scheme of a vertex truncation scheme proposed by PowerGraph.

통계정보 관리 모듈(210)은 질의처리에 따른 각 노드별 처리량과 저장량에 대한 정보를 부하 테이블(212)에 저장한다. 또한, 통계정보 관리 모듈(210)은 질의를 기반으로 자주 요청되는 데이터에 대한 정보를 핫 데이터 테이블(214)에 저장한다.The statistical information management module 210 stores information on throughput and storage for each node according to the query processing in the load table 212. In addition, the statistical information management module 210 stores information about frequently requested data in the hot data table 214 based on the query.

그 후, 그래프 스트림이 발생하면, 스트림 분할 모듈(220)은 먼저 핫 데이터 테이블(214)을 참조하여, 발생된 그래프 스트림이 핫 데이터인지의 여부를 분석한다(Stream Analysis). 예컨대, 스트림 분할 모듈(220)은 새롭게 발생한 그래프 스트림이 질의에 자주 요청되는 정점과 연결되는 데이터라면, 추가된 이후 질의 요청에서 함께 사용될 가능성이 있는 핫 데이터로 판단할 수 있다.Thereafter, when the graph stream occurs, the stream splitting module 220 first refers to the hot data table 214 and analyzes whether the generated graph stream is hot data (Stream Analysis). For example, if the newly generated graph stream is data associated with a vertex frequently requested for a query, the stream dividing module 220 may determine that the hot stream is likely to be used together in the query request after being added.

상기 그래프 스트림에 대한 핫 데이터의 여부가 판단되면, 스트림 분할 모듈(220)은 부하 테이블(212)을 참조하여 각 노드의 상태를 확인한다(Node Analysis). 또한, 스트림 분할 모듈(220)은 기정의된 기준에 따라 각 노드에 대한 상태를 고려하여, 노드별 점수를 계산한 후, 가장 높은 점수를 갖는 노드를 결정한다. 마지막으로, 스트림 분할 모듈(220)은 결정된 노드에 새로운 그래프 스트림을 전송한다(Data Transmission).When it is determined whether hot data for the graph stream is determined, the stream dividing module 220 checks the state of each node with reference to the load table 212 (Node Analysis). In addition, the stream dividing module 220 determines the node having the highest score after calculating the score for each node in consideration of the state of each node according to a predetermined criterion. Finally, the stream splitting module 220 transmits a new graph stream to the determined node (Data Transmission).

적절한 부하 분산을 수행하여 시스템의 성능을 높이기 위해서는, 클러스터를 구성하는 각 노드의 상태를 고려하여 그래프 스트림을 분할해야 한다. 이를 위해, 본 발명은 부하 관리를 통해 클러스터 내 각 노도들의 부하를 모니터링하고, 이 결과에 따라 수집된 정보들을 그래프 스트림의 분할에 활용한다.In order to increase the performance of the system by performing proper load balancing, the graph stream should be partitioned in consideration of the state of each node constituting the cluster. To this end, the present invention monitors the load of each road in the cluster through load management, and utilizes the information collected according to the result in the segmentation of the graph stream.

본 발명에서는 도 2의 마스터 노드의 통계정보 관리 모듈(210)에서 노드별 저장량과 처리량을 관리하도록 구성할 수 있다. 통계정보 관리 모듈(210)는, 이미 많은 양의 데이터를 저장한 노드에 추가 데이터를 저장하거나, 처리량이 다른 노드에 비해 많은 노드에 추가 데이터를 저장하는 것으로 회피 함으로써, 전체적인 시스템의 성능 저하를 억제할 수 있다. 즉, 통계정보 관리 모듈(210)은 이러한 성능 저하를 예방하기 위해 부하 관리의 결과를, 그래프 스트림의 처리에 반영할 수 있다.In the present invention, the statistical information management module 210 of the master node of FIG. 2 may be configured to manage storage and throughput for each node. The statistical information management module 210 suppresses the performance degradation of the overall system by avoiding storing additional data in a node that has already stored a large amount of data or storing the additional data in a large amount of nodes in comparison with other nodes. can do. That is, the statistical information management module 210 may reflect the result of the load management in the processing of the graph stream in order to prevent such performance degradation.

질의 처리 모듈(230)은 마스터 노드에 특정 서브그래프 패턴을 찾는 질의가 요청되면, 탐색을 수행해야 할 노드를 결정하여 찾아야 할 패턴을 전송한다.When a query for searching for a specific subgraph pattern is requested to the master node, the query processing module 230 determines a node to be searched and transmits a pattern to be found.

이후, 검색 대상이 되는 슬레이브 노드에서는, 질의 처리 결과를 다시 마스터 노드의 질의 처리 모듈(230)로 전송함과 동시에, 요청된 패턴 중 각 노드에서 포함하고 있는 검색 대상이 되는 정점의 수와 ID, 그리고 해당 노드의 메모리 사용률을 통계정보 관리 모듈(210)로 전송할 수 있다. 여기서, 검색 대상이 되는 정점의 수와 메모리 사용률은 부하 테이블(212)에 유지된다.Subsequently, in the slave node to be searched, the query processing result is transmitted back to the query processing module 230 of the master node, and the number and ID of vertices to be searched included in each node among the requested patterns, The memory usage rate of the corresponding node may be transmitted to the statistical information management module 210. Here, the number of vertices to be searched and the memory utilization are maintained in the load table 212.

표 1은 부하 테이블에 대한 일례이다.Table 1 is an example for a load table.

표 1에서는, 정점의 수를 해당 노드의 처리량으로 유지하고, 또한 메모리 사용률을 저장량으로 유지하는 것을 예시한다. 이때, 처리량은 전체 시스템에서 처리한 양에 대한 상대적인 비율로 계산한다. Table 1 exemplifies maintaining the number of vertices as the throughput of the node and maintaining the memory utilization as the storage. In this case, the throughput is calculated as a ratio relative to the amount processed in the whole system.

도 3은, 본 발명에 따라 처리량과 메모리 사용률이 산출되는 과정을 설명하기 위한 도면이다.3 is a view for explaining a process of calculating the throughput and memory utilization in accordance with the present invention.

도 3의 (a)에서는 각 노드에 저장되어 있는 데이터와 그 중 질의에 포함된 데이터를 나타낸다.3 (a) shows data stored in each node and data included in a query among them.

처리량은 각 노드에 저장된 데이터 중 질의에 포함된 데이터의 양을 지칭할 수 있다.Throughput may refer to the amount of data included in the query among the data stored in each node.

도 3의 (a)에서 Node #1은 8개의 정점을 포함하고 있고, 그 중 질의에 포함되는 정점 'a,b,c,e,d'의 5를 처리량으로 산출할 수 있다. 유사한 산출 방식에 따라, Node #2의 처리량은 2이고, Node #3의 처리량은 5이며, Node #4의 처리량은 2가 된다.In FIG. 3A, Node # 1 includes eight vertices, and 5 of the vertices 'a, b, c, e, d' included in the query can be calculated as throughput. According to a similar calculation method, the throughput of Node # 2 is 2, the throughput of Node # 3 is 5, and the throughput of Node # 4 is 2.

이들 처리량을 분할 기준으로 환산하면, 환산된 처리량은 각 노드별로 5/14 , 2/14, 5/14, 2/14의 값으로 계산될 수 있다.When these throughputs are converted on a split basis, the converted throughputs can be calculated as 5/14, 2/14, 5/14, and 2/14 for each node.

도 3의 (b)에서는 랜덤하게 생성되는 서브그래프 탐색 질의를 나타낸다.3B illustrates a randomly generated subgraph search query.

도 3의 (b)에 도시한 바와 같이, 하나의 서브그래프 질의는, 노드별로, 질의에 포함되는 정점을 연결하는 경로로서 산출할 수 있다. 예컨대 Node #1와 관련하여서는, 정점 'c'를 중심으로, b-c-e와 a-b-c-d의 서브그래프 질의를 만들 수 있다.As shown in (b) of FIG. 3, one subgraph query can be calculated as a path connecting vertices included in the query for each node. For example, with respect to Node # 1, we can create a subgraph query of b-c-e and a-b-c-d around vertex 'c'.

본 발명에서 핫 데이터는 질의 처리에 자주 사용되는 서브그래프를 의미한다. 같은 양의 그래프를 저장하고 있는 두 개의 노드가 있다고 할 때 핫 데이터를 포함한 노드는, 상대적으로 질의 처리시 더 높은 처리량이 요구되며 이는 전체적인 시스템의 성능 저하를 유발할 수 있다.In the present invention, hot data refers to a subgraph frequently used for query processing. Assuming that there are two nodes that store the same amount of graphs, a node with hot data requires a relatively higher throughput for query processing, which can lead to poor overall system performance.

따라서 새로 발생하는 서브그래프가 핫 데이터와 연결성을 가진 경우에는 이를 핫 데이터로 간주한다. 즉, 추가되는 서브그래프가 기존 핫 데이터와 간선으로 연결되는 경우에는, 그래프의 연결성으로 인해 질의 처리시 추가적으로 연결된 서브그래프에 대해서도 질의 처리의 수행이 증가될 것이다.Therefore, when a newly generated subgraph has hot data connectivity, it is regarded as hot data. That is, when the added subgraph is connected to the existing hot data by the edge, the performance of the query processing will be increased for the subgraph additionally connected during the query processing due to the graph connectivity.

따라서, 핫 데이터와 연관되는 노드는 새로운 서브그래프를 저장하기 위해 처리량에 높은 가중치를 부여한다.Thus, nodes associated with hot data give high weight to throughput to store new subgraphs.

도 4는 핫 데이터로 인해 시스템의 성능이 저하되는 일례를 설명하기 위한 도면이다.4 is a view for explaining an example in which the performance of the system is degraded due to hot data.

도 4에서는 핫 데이터로 인한 전체 시스템의 성능 저하 상황을 나타내는 도면이다. 도 4의 (a)에서는 전체 그래프가 3개의 하위 그래프로 분할되어 저장되어 있고, 1번 하위 그래프(partition#1)에 핫 데이터인 정점 s, t, u 가 포함되는 것이 예시되고 있다.4 is a diagram illustrating a performance degradation situation of the entire system due to hot data. In FIG. 4A, the entire graph is divided into three subgraphs and stored, and the first subgraph (partition # 1) includes hot data vertices s, t, and u.

이때, 각각의 하위 그래프가 하나의 슬레이브 노드에서 처리된다고 했을 때, 도 4의 (b)는 각 파티션에 대한 수행 시간을 도시화하여 나타낸다. 도 4의 (b) 에서는 3개의 슬레이브 노드가 모두 같은 양의 서브그래프를 처리하지만, 1번 하위 그래프를 포함한 1번 슬레이브 노드의 경우, 상대적으로 처리 시간이 증가되고 있음을 나타내고 있다. 이에 따라, 다른 슬레이브 노드(partition#3, #2)는 이미 처리를 끝낸 이후에도 1번 슬레이브 노드에서의 처리 종료를 기다려야 한다. In this case, when each subgraph is processed in one slave node, FIG. 4B shows the execution time for each partition. In (b) of FIG. 4, all three slave nodes process the same amount of subgraphs, but in the case of slave node 1 including the subgraph 1, the processing time is relatively increased. Accordingly, the other slave nodes (partition # 3, # 2) must wait for the end of processing at the slave node 1 even after the processing has already been completed.

핫 데이터로 인해 특정 노드에 부하가 증가되는 문제를 해결하기 위해서는, 핫 데이터의 정보를 관리하고, 하나의 노드에 핫 데이터가 집중되는 현상을 방지하기 위한 대책이 필요하다.In order to solve the problem that the load increases to a specific node due to the hot data, it is necessary to manage the information of the hot data and to prevent a phenomenon in which the hot data is concentrated on one node.

이를 위해 본 발명의 통계정보 관리 모듈(210)에서는 질의처리에 사용된 서브그래프를 핫 데이터 테이블(214)로 관리하고, 스트림 분할 모듈(220)에서는 핫 데이터 테이블(214)에 저장된 정보를 기반으로 발생하는 그래프 스트림이 핫 데이터 인지 판단한다.To this end, the statistical information management module 210 of the present invention manages the subgraphs used for query processing by the hot data table 214, and the stream partitioning module 220 based on the information stored in the hot data table 214. Determines whether the generated graph stream is hot data.

본 발명에서는 부하 모니터링 단계에서 핫 데이터에 대한 정보를 같이 수집한다. 마스터 노드 내의 질의 처리 모듈(230)은 랜덤하게 생성되는 서브그래프 탐색 질의에서 등장하는 정점의 수를, 표 2와 같이 순서화 함으로써 핫 데이터 테이블(214)을 관리할 수 있다.In the present invention, the information on the hot data is collected together in the load monitoring step. The query processing module 230 in the master node may manage the hot data table 214 by ordering the number of vertices appearing in the randomly generated subgraph search query as shown in Table 2.

표 2에는, 후술하는 도 5에서의 서브그래프 질의 내의 중심이 되는 정점을 기준으로, 해당 정점이 요청되는 횟수를 카운트하여 기록한다.Table 2 counts and records the number of times the vertex is requested, based on the vertex as the center of the subgraph query in FIG. 5 described later.

예컨대, 표 2에서, 요청 횟수 4로 가장 많은 정점 b는 순위 1위로 기록될 수 있다.For example, in Table 2, the highest number of vertices b with the number of requests 4 may be recorded as the first rank.

도 5는 본 발명에 따른 서브그래프 질의에 대한 예시를 보여주기 위한 도면이다.5 is a diagram illustrating an example of a subgraph query according to the present invention.

도 5에서는, 정점 b를 중심으로 정점 a, d, c가 연결되어 형성되는 서브그래프 질의를 Query 1로, 정점 h를 중심으로 정점 b, f, i가 연결되어 형성되는 서브그래프 질의를 Query 2로, 정점 b를 중심으로 정점 g, n, m이 연결되어 형성되는 서브그래프 질의를 Query 3로, 정점 q를 중심으로 정점 b, x, z가 연결되어 형성되는 서브그래프 질의를 Query 4로 예시한다.In FIG. 5, a subgraph query formed by connecting vertices a, d, and c around vertex b is Query 1, and a subgraph query formed by connecting vertices b, f, and i around vertex h is Query 2. For example, a subgraph query formed by connecting vertices g, n, and m around vertex b is Query 3, and a subgraph query formed by connecting vertices b, x, and z around vertex q is Query 4 do.

이러한 도 5의 예시를 토대로, 표 2에서는 데이터를 정리하게 되면, 예컨대 도 5에서 정점 b가 피요청되는 횟수는 총 4이므로, 표 2에서 정점 ID b에 대한 요청 횟수는 4로 기록된다. 또한, 유사한 계수 방식을 적용하여, 정점 ID a에 대한 요청 횟수 등은 1로 기록된다. 또한, 표 2에는, 요청 횟수가 가장 높은 정점 ID b에, 가장 높은 순위를 할당할 수 있다.Based on the example of FIG. 5, when the data is summarized in Table 2, for example, since the number of times vertex b is requested in FIG. 5 is 4 in total, the number of requests for vertex ID b in Table 2 is recorded as 4. In addition, by applying a similar counting method, the number of requests for the vertex ID a and the like are recorded as one. In Table 2, the highest rank can be assigned to the vertex ID b having the highest number of requests.

본 발명에서는 추가되는 그래프를 어떤 노드에 저장할지를 결정하기 위해 저장량, 처리량, 정점의 복제 비율을 고려할 수 있다.In the present invention, the amount of storage, throughput, and the rate of replication of vertices can be considered to determine which node to store the added graph.

여기서, 정점의 복제 비율은 분할된 파티션 간의 통신이 분할된 정점을 통해 이루어지므로 전체 시스템의 통신비용을 줄이기 위해 정점의 복제 비율을 최소화하기 위한 고려사항이다.Here, the replication rate of the vertices is a consideration for minimizing the replication rate of the vertices in order to reduce the communication cost of the entire system since communication between the partitions is made through the divided vertices.

또한, 저장량은, 노드에 저장하고 있는 서브그래프가 많을 경우 모든 정점에 대한 연산을 필요로 하는 연산에서는 더 많은 처리량을 가질 수밖에 없으므로 저장되는 서브그래프의 양을 고르게 만들기 위한 요소이다.In addition, the storage amount is an element for making the amount of subgraphs stored evenly because a large number of subgraphs stored in a node have more throughput in an operation that requires operations on all vertices.

또한, 처리량은 특정 노드의 값이 높아지면 해당 노드에 부하가 집중되므로 부하 분산을 고르게 하여 전체 시스템의 성능을 향상시키기 위한 고려사항이다.In addition, throughput is a consideration for improving the performance of the entire system by evenly balancing the loads as the value of a specific node increases.

이러한 고려사항들을 적용하기 위해, 본 발명에서는 세 가지 고려사항을 하나의 수식으로 만들고, 추가되는 그래프 스트림에서 추가되는 서브그래프가 발생하면 수식의 값을 계산하여 가장 높은 값을 갖는 노드에 서브그래프가 배치되도록 한다.In order to apply these considerations, the present invention makes three considerations into a formula, and when an additional subgraph occurs in the added graph stream, the value of the formula is calculated and the subgraph is added to the node having the highest value. To be deployed.

도 6은 본 발명에 따른 그래프 스트림의 분할과정을 설명하기 위한 도면이다.6 is a view for explaining a segmentation process of a graph stream according to the present invention.

새로운 그래프 스트림이 발생하면, 통계정보 관리 모듈(610)은 핫 데이터 테이블(614)을 확인하여 해당 그래프 스트림이 핫 데이터에 해당되는지를 판단한다. 또한, 통계정보 관리 모듈(610)은 그래프 스트림이 핫 데이터의 여부인지에 따라 분할 기준을 계산하는 가중치를 조정할 수 있다.When a new graph stream occurs, the statistical information management module 610 checks the hot data table 614 to determine whether the graph stream corresponds to hot data. In addition, the statistical information management module 610 may adjust a weight for calculating a split criterion according to whether the graph stream is hot data.

통계정보 관리 모듈(610)은 분할 기준을 계산하기 위해 부하 테이블(614)의 정보를 활용하고, 스트림 분할 모듈(620)은 각 노드별로 계산된 점수에 대해 가장 높은 점수를 갖는 노드에 데이터를 전송한다.The statistical information management module 610 utilizes the information in the load table 614 to calculate the splitting criteria, and the stream splitting module 620 transmits data to the node having the highest score for the score calculated for each node. do.

스트림 분할 모듈(620)에 의한 노드별로 점수 계산에 있어, 클러스터를 구성하는 노드가 k개 있다고 할 때, 각 노드의 점수는 i(i는 1, 2, ..., k)번째 노드의 점수를 뜻하는 TSi(Total Score)에 의해 계산한다.In calculating the score for each node by the stream splitting module 620, when there are k nodes constituting the cluster, the score of each node is the score of the i (i is 1, 2, ..., k) th node. It is calculated by TSi (Total Score) which means.

수학식 1은 Ti값을 계산하기 위한 수식이다.Equation 1 is an equation for calculating the Ti value.

수학식 1은, RSi, USi, CSi의 값에 가중치 α, β, γ를 부여하여 더하는 방식을 취하고 있다. 수학식 1에 의해 계산되는 값은, 추가되는 서브그래프를 배치하는 기준이 되는 각 노드의 점수를 의미한다. 수학식 1을 통해 본 발명에서는 클러스터를 구성하는 모든 노드에 대해 값을 계산하고, 가장 높은 값을 갖는 노드에 서브그래프를 배치한다.Equation 1 takes a method of adding and adding weights α, β, and γ to the values of RSi, USi, and CSi. The value calculated by Equation 1 means the score of each node which is a reference for arranging the added subgraph. Through Equation 1, the present invention calculates values for all nodes constituting the cluster, and arranges subgraphs at nodes having the highest values.

분할 기준에 관한 수학식 1에서 첫 번째 요소인 RSi(Replication Score)는, 정점의 복제를 최소화하기 위한 것으로, i번째 노드의 정점 복제비율 점수를 뜻한다.The first element in Equation 1 regarding the partitioning criterion, RSi (Replication Score), is to minimize the replication of the vertex, and refers to the vertex replication ratio score of the i-th node.

정점 분할 방식에서 중요한 요소 중 하나인 정점 복제비율은, 하나의 정점이 몇 개의 노드에 분할되어 저장되어 있는지를 나타낸다. 정점 복제비율이 높을수록 하나의 정점이 여러 개의 노드에 저장되는 것을 의미하며 이는 정점을 단위로 특정 값을 계산하는 PageRank와 같은 알고리즘을 수행함에 있어서 정점이 갖는 값을 동기화하기 위한 노드들 간의 통신 비용을 보다 많이 발생시키는 것을 의미한다.The vertex replication ratio, one of the important factors in the vertex splitting scheme, indicates how many nodes a vertex is divided and stored. The higher the vertex replication rate, the more vertices a single vertex is stored in multiple nodes. This means that the cost of communication between nodes to synchronize the vertices of a vertex in an algorithm such as PageRank that calculates a specific value based on the vertices. It means to generate more.

따라서, 본 발명에서는 정점의 복제비율을 최소화하기 위해 새로운 서브그래프인 e를 추가할 때 해당 노드에 존재하는 이웃 간선의 수가 많은 노드에 큰 점수를 부여할 수 있다.Therefore, in the present invention, when a new subgraph e is added to minimize the replication rate of the vertices, a large score can be given to a node having a large number of neighbor edges existing in the node.

수학식 2는, 노드의 정점 복제비율 점수로서의 RSi(Replication Score)를 연산하는 수식이다. 상기 수학식 2에서 Pi는 i번째 파티션, 본 발명에서는 하나의 노드가 하나의 파티션을 저장하고 있으므로 i번째 노드를 나타낸다. N(e)는 간선 e의 이웃 간선의 집합을 의미한다.(2) is an expression for calculating RSi (Replication Score) as a vertex replication ratio score of a node. In Equation 2, Pi denotes the i-th partition, and in the present invention, since one node stores one partition. N (e) means a set of neighboring edges of the edge e.

도 7은 본 발명에 따른 이웃 간선을 설명하기 위한 도면이다.7 is a diagram illustrating a neighboring trunk line according to the present invention.

도 7에서는 이웃간선에 대한 개념을 보여준다. 도 7에서 e_kl은 정점 k와 정점 l을 잇는 간선을 의미하며, e_kl의 이웃간선은 e_ij, e_lm, e_ln이 된다.7 illustrates a concept of neighboring edges. In FIG. 7, e _kl denotes an _edge connecting a vertex k and vertex l, and neighboring edges of e _kl are e _ij , e _lm , and e _ln .

결과적으로 RSi는 새롭게 추가되는 서브그래프 e에 대해 어떤 노드에 이웃간선이 가장 많은지, 즉 어떤 노드에 배치했을 때 정점의 복제비율을 최소화 할 수 있는지를 고려하기 위한 요소이다.As a result, RSi is a factor to consider which node has the most neighbor edges for the newly added subgraph e, that is, the node's replication rate can be minimized when placed on which node.

수학식 2에서, 분모인

의 값은 가장 큰 값을 갖는 노드의 값으로 나누어 해당요소를 0~1사이의 값으로 정규화하기 위한 것이다. In Equation 2, the denominator

The value of is divided by the value of the node with the largest value to normalize the element to a value between 0 and 1.

수학식 1의 USi(Storage Utilization Score)는 노드별 저장량을 고려하기 위한 요소이다. 특정 노드에 많은 양의 서브그래프가 저장되어 있다면, 질의 처리는 상대적으로 많은 연산을 수행해야 할 것이다.The Storage Utilization Score (USi) of Equation 1 is an element for considering storage amount for each node. If a large number of subgraphs are stored in a particular node, query processing will require a relatively large number of operations.

이를 방지하기 위해서는 추가되는 서브그래프를 저장량이 상대적으로 적은 노드에 배치해야 하고, 이를 위해서 본 발명에서는, 수학식 3을 사용하여 저장량 점수를 계산한다.In order to prevent this, the added subgraph should be placed in a node having a relatively small storage amount. For this purpose, in the present invention, the storage amount score is calculated using Equation 3.

수학식 3에 의해서는, 저장된 양이 적은 노드에 높은 점수를 부여할 수 있다.According to Equation 3, it is possible to give a high score to a node having a small stored amount.

수학식 3에서 표현한 Pi는 앞서 설명한 i번째 노드를 의미하며, 수학식 3의 분자의

값은 i번째 노드에 저장된 서브그래프의 크기를 의미한다.Pi represented by Equation 3 means the i-th node described above, and

The value represents the size of the subgraph stored in the i th node.

Ci는 i번째 노드의 전체 메모리 사이즈를 의미하며 결과적으로 USi는 i번째 노드의 메모리 사용률을 나타낸다.Ci means the total memory size of the i-th node, and consequently USi represents the memory utilization of the i-th node.

수학식 1의 CSi(Computation Size Score)는 처리량을 고려하기 위한 요소이다. 처리량이 많은 노드에 계속해서 서브그래프를 추가하게 되면, 하나의 노드에서만 처리량이 높아지게 되어, 부하의 불균형이 발생하고 시스템의 성능은 저하된다.Computation Size Score (CSi) of Equation 1 is an element for considering throughput. Continued addition of subgraphs to high-throughput nodes results in high throughput on only one node, resulting in load imbalance and poor system performance.

이를 방지하기 위해, 본 발명에서는 수학식 4와 같이 처리량 점수를 계산하여 이전 질의처리 수행동안 처리량이 상대적으로 낮았던 노드가 높은 점수를 갖게 한다.In order to prevent this, in the present invention, a throughput score is calculated as in Equation 4 so that a node having a relatively low throughput during a previous query processing has a high score.

수학식 1에서 처리량 점수의 가중치인 γ은 추가되는 서브그래프가 핫 데이터인지 아닌지에 따라 가중치 값이 달라진다. 핫 데이터로 판단되는 경우에 γ은 상대적으로 큰 값으로 조정되어, 처리량이 낮은 노드에 처리량 점수를 높여 핫 데이터가 배치되도록 유도한다.In Equation 1, γ, which is a weight of the throughput score, varies in weight depending on whether or not the added subgraph is hot data. When it is determined that the hot data is determined, γ is adjusted to a relatively large value, thereby increasing the throughput score at a node with low throughput to induce hot data to be placed.

수학식 4의 Si는 질의에 포함된 정점 중 i번째 노드에 포함된 정점의 수를 의미한다. 수학식 4의 최우변은, 질의에 포함된 정점 중 각 노드에 포함된 정점의 수를 모두 합산하고, 각 노드별 수를 전체 합산한 수로 나누어 산출할 수 있다.Si in Equation 4 represents the number of vertices included in the i-th node among the vertices included in the query. The rightmost side of Equation 4 may be calculated by summing all the number of vertices included in each node among the vertices included in the query, and dividing the number for each node by the total sum.

그 후, 본 발명에서는 수학식 4의 최우변의 결과 값을 1에서 차감 함으로써, 높은 처리량을 갖는 노드가 적은 양의 그래프를 배치 받을 수 있도록 한다.Subsequently, in the present invention, by subtracting the result value of the rightmost side of Equation 4 from 1, a node having a high throughput can receive a small amount of graphs.

분할 기준의 수식(수학식 3, 4)에서 각각의 값은 어느 한 값에 의해서만 분할 기준 전체가 영향을 받지 않도록 모두 0~1 사이의 값을 갖도록 하였다. 또한, α, β, γ는 성능평가에서 질의 처리 성능의 비교를 통해 적절한 값을 제시한다.In the formula of division criteria (Equations 3 and 4), each value has a value between 0 and 1 so that the entire division criteria is not affected by only one value. In addition, α, β, and γ suggest appropriate values through comparison of query processing performance in performance evaluation.

특정한 노드에서 서브그래프의 삭제가 계속해서 발생한다면, 전체 시스템은 저장량 관점에서 부하의 불균형이 발생할 것이다. 이에 따라 처리량의 편차가 커질 수 있고, 적은 양이더라도 핫 데이터가 삭제되는 경우에는 처리량 관점에서의 불균형이 발생할 가능성이 있다.If the deletion of the subgraph continues on a particular node, the whole system will experience load imbalances in terms of storage. As a result, the variation in throughput may be large, and even if a small amount of hot data is deleted, there is a possibility of an imbalance in terms of throughput.

저장량 관점에서 발생하는 불균형의 문제는 노드별로 임계값을 두고 어느 한 노드의 저장량이 임계값 이하로 내려갈 경우, 삽입 시에 적용되는 수학식 2에서 β를 높은 값으로 설정하여 입력되는 서브그래프가 우선적으로 저장량이 적은 노드에 배치될 수 있도록 한다.An imbalance problem that occurs from the storage point of view is that if the storage value of one node is lower than the threshold value for each node, the subgraph inputted by setting β to a high value in Equation 2 applied at the time of insertion has priority This allows them to be deployed on nodes with less storage.

처리량 관점에서도 어느 한 노드의 처리량이 임계값 이하로 내려갈 경우 다음의 입력에 대해 γ를 높은 값으로 설정하여 처리량이 낮은 노드가 우선순위를 가질 수 있도록 한다.In terms of throughput, when a node's throughput drops below a threshold value, γ is set to a high value for the next input so that a node having low throughput has priority.

도 8과 도 9는 본 발명에 따라, 추가되는 그래프 스트림 분할의 예시를 설명하기 위한 도면이다.8 and 9 are diagrams for explaining an example of a graph stream segmentation added according to the present invention.

도 8에서와 같이, 슬레이브 노드 1번부터 4번 각각으로, 하위그래프가 저장되어 있는 상황을 예시한다.As shown in FIG. 8, the subgraphs are stored in each of slave nodes 1 to 4.

슬레이브 노드 1번에서, 3번 정점과 7번 정점을 잇는 간선이 새로 삽입되는 경우, 본 발명에서는, 우선 3번 정점과 7번 정점 중 핫 데이터에 속하는 정점이 있는지 판별한다. 예컨대, 본 발명은 7번 정점이 도 9의 핫 데이터 테이블 상에서 1위에 속하는 정점이므로, 7번 정점을 핫 데이터로 판단할 수 있다.In the slave node 1, when a trunk line connecting vertices 3 and 7 is newly inserted, the present invention first determines whether vertices belonging to hot data are among vertices 3 and 7. For example, in the present invention, since vertex 7 belongs to the first position on the hot data table of FIG. 9, vertex 7 may be determined as hot data.

추가되는 서브그래프가 핫 데이터로 판별되었기 때문에, 본 발명은 계산되는 수학식 1의 가중치를 조정한다.Since the added subgraph has been determined to be hot data, the present invention adjusts the weight of Equation 1 to be calculated.

도 9의 (a)와 (b)에서 볼 수 있듯이, 각 정점 별로 산출되는 점수는, 저장량인 메모리 사용률이 낮을수록, 또한 처리량이 낮을수록 높은 점수를 갖는다.As shown in (a) and (b) of FIG. 9, the score calculated for each vertex has a higher score as the memory usage rate, which is a storage amount, and as the throughput is low.

도 9의 (c)에서 가장 마지막 열의 값은 최종적으로 계산된 노드별 점수를 나타낸다. 핫 데이터의 발생으로 처리량에 대한 가중치가 높아졌고, 이에 따라 1번 노드가 가장 높은 값을 갖게 되어 새로운 서브그래프가 1번 노드에 배치된다. 하지만, 복제 비율이나 저장량에 대한 점수가 큰 차이가 없었기 때문일 뿐, 핫 데이터가 발생하는 경우에도 오로지 처리량만을 고려하는 것이 아님을 주의해야 한다.In (c) of FIG. 9, the value of the last column represents the finally calculated score for each node. The weighting of the throughput is increased by the occurrence of hot data, so node 1 has the highest value and a new subgraph is placed in node 1. However, it is important to note that only the throughput is not considered even when hot data occurs, because the scores for the replication rate and storage did not differ significantly.

본 발명에서는 질의 처리 향상을 위해 클러스터를 구성하는 노드들의 부하 정도를 고려한 효율적인 그래프 스트림 분할 기법을 제시한다.The present invention proposes an efficient graph stream partitioning scheme considering the load of nodes constituting the cluster to improve query processing.

스트림 분할 기준에 고려되는 요소로 정점의 복제비율, 저장량, 처리량을 고려하였으며 처리량을 저장량과 확실히 구분될 수 있게 질의처리에 사용되는 데이터의 양으로 정의하였다.We considered the replication rate, storage, and throughput of the vertices as factors considered in the stream segmentation criteria, and defined the throughput as the amount of data used for query processing to be able to clearly distinguish it from the storage.

또한, 핫 데이터의 발생 여부에 따라 분할 기준을 변경하여 핫 데이터가 하나의 노드에 집중되는 것을 방지하였다.In addition, the partitioning criteria are changed according to whether hot data is generated to prevent hot data from being concentrated on one node.

본 발명은 실시간으로 발생되는 그래프 데이터를 분산 처리하거나 분석하기 위한 기법에 활용 가능하다. 또한, 본 발명은 소셜 네트워크, 사물 인터넷 등과 같이 객체들 관계를 실시간으로 저장하고 관리하기 위한 서비스 분야에서 대용량 데이터를 효과적으로 저장하기 위한 방법으로 활용된다.The present invention can be applied to a technique for distributing or analyzing graph data generated in real time. In addition, the present invention is utilized as a method for effectively storing a large amount of data in the field of services for storing and managing object relationships in real time, such as social networks and the Internet of Things.

이하, 본 발명의 실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법의 작업 흐름을 상세히 설명한다.Hereinafter, a detailed description will be given of a workflow of a partitioning method for real time distributed storage of a graph stream according to an embodiment of the present invention.

도 10은 본 발명의 일실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법을 구체적으로 도시한 작업 흐름도이다.FIG. 10 is a detailed flowchart illustrating a partitioning method for real-time distributed storage of a graph stream according to an embodiment of the present invention.

본 실시예에 따른 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 방법은 상술한 분할 장치(100)에 의해 수행될 수 있다.The division method for real time distributed storage of the graph stream according to the present embodiment may be performed by the division apparatus 100 described above.

우선, 분할 장치(100)는 소셜 네트워크에서, 사물 간의 상호 작용에 따라 신규의 그래프 스트림이 발생하는 경우, 그래프를 분할한 복수의 하위그래프 각각을 개별적으로 처리하는 복수의 노드에 대해, 점수를 산출한다(1010). 단계(1010)는 새로 발생한 그래프 스트림을 처리하는 데에, 최적할 수 있는 노드를 선별하기 위해, 각 노드의 상태를 수치로서 나타내는 과정일 수 있다.First, when a new graph stream is generated according to interaction between objects in the social network, the dividing apparatus 100 calculates a score for a plurality of nodes that individually process each of a plurality of subgraphs that have divided the graph. 1010. Step 1010 may be a process of numerically representing the state of each node to select nodes that may be optimal for processing the newly generated graph stream.

분할 장치(100)에서 점수 산출에 활용하는 파라미터로는, 개별 노드가 가지고 있는 그래프 데이터의 저장량, 이전 작업에서 그래프 데이터를 처리한 처리량 등을 예시할 수 있다.As a parameter used to calculate the score in the dividing apparatus 100, a storage amount of graph data owned by an individual node, a throughput of processing the graph data in a previous operation, and the like may be exemplified.

노드 별 점수를 산출하는 데에 있어서, 분할 장치(100)는, 부하 테이블을 참조하여, 상기 복수의 노드 각각에 대해, 저장량과 처리량을 확인하고, 상기 저장량이 낮을수록, 또는 상기 처리량이 낮을수록, 상기 점수를 높게 산출할 수 있다.In calculating the score for each node, the division apparatus 100 confirms the storage amount and the throughput for each of the plurality of nodes with reference to the load table, and the lower the storage amount or the lower the throughput, The score can be calculated higher.

부하 테이블은 각 노드별로 보유하고 있는 하위그래프의 양과, 정해진 하위그래프가 갖는 그래프 데이터의 주기별 처리량을 기록하고 있다. 부하 테이블은 예컨대 노드의 처리량을, 하위그래프가 가지고 있는 정점의 수로 기록할 수 있고, 또한 노드의 저장량을, 메모리 사용률로 기록할 수 있다.The load table records the amount of subgraphs held by each node and the throughput for each cycle of graph data of a given subgraph. The load table can, for example, record the throughput of a node as the number of vertices that a subgraph has, and can also record the storage of a node in memory utilization.

단계(1010)에서 분할 장치(100)는 부하 테이블 내의 정보에 근거하여, 현시점에서 저장량이 많지 않아 추가적으로 신규의 그래프 스트림이 연결되더라도, 그래프 데이터의 처리에 여력이 충분한 노드에 높은 점수를 부여할 수 있고, 또한 이전 시점까지의 처리량이 많지 않은 노드에 높은 점수를 부여할 수 있다.In operation 1010, the partitioning apparatus 100 may give a high score to a node having sufficient capacity for processing the graph data, even if a new graph stream is additionally connected due to the small amount of storage at the present time, based on the information in the load table. In addition, a high score can be given to a node that does not have much throughput up to the previous point in time.

일례로서, 분할 장치(100)는, 수학식 1을 적용하여, 상기 복수의 노드 각각에 대해 연산되는 TSi(Total Score)를 점수로서 산출할 수 있다.As an example, the dividing apparatus 100 may calculate a total score (TSi) calculated for each of the plurality of nodes as a score by applying the equation (1).

수학식 1은

로 표현할 수 있으며,Equation 1 is

Can be expressed as

즉, 분할 장치(100)는 RSi, USi, CSi의 값에 가중치 α, β, γ 를 부여하여 더하는 방식을 취하여 TSi를 산출하고, 이를 통해 결과적으로 신규의 그래프 스트림을 배치하는 기준이 될 각 노드의 점수를 산출할 수 있다.That is, the division apparatus 100 calculates TSi by applying weights α, β, and γ to RSi, USi, and CSi values, and adds them, and as a result, each node to be a reference for arranging a new graph stream. The score of can be calculated.

특히, 상기 CSi의 가중치인 γ의 경우, 분할 장치(100)는, 상기 신규의 그래프 스트림과 연관된 제1 정점에 포함되는 데이터가, 정해진 빈도수를 초과하여 사용되는 핫 데이터인지를 식별하고, 상기 핫 데이터인 경우, 상기 γ을 양의 값으로 조정할 수 있다. 즉, 분할 장치(100)는 신규의 그래프 스트림이 빈도있게 활용되는 핫 데이터를 포함 함에 따라, CSi가 높게 산출되도록, γ을 보다 큰 값으로 조정할 수 있다.In particular, in the case of γ which is the weight of the CSi, the dividing apparatus 100 identifies whether the data included in the first vertex associated with the new graph stream is hot data used over a predetermined frequency, and the hot In the case of data, the gamma can be adjusted to a positive value. That is, the division apparatus 100 may adjust γ to a larger value so that CSi is calculated higher as the new graph stream includes hot data that is frequently utilized.

노드별 점수를 산출하는 단계(1010)의 다른 일례에 있어, 분할 장치(100)는, 상기 신규의 그래프 스트림과 연관된 제1 정점과, 공통된 패턴을 갖는 제2 정점을, 상기 그래프로부터 식별하고, 상기 제2 정점을 구성으로 하는 하위그래프를 처리하는 노드에, 가중치를 부여하여, 상기 점수를 높게 산출할 수 있다. 즉, 분할 장치(100)는 신규의 그래프 스트림이 갖는 정점의 배치/구성 형태를 패턴으로 인식하고, 인식된 패턴을 유사하게 가지고 있는, 기존 그래프 내의 정점을 찾아, 관련된 노드에 대해, 상대적으로 높은 점수가 부여되도록 할 수 있다.In another example of calculating the score per node 1010, the segmentation apparatus 100 identifies from the graph a first vertex associated with the new graph stream and a second vertex having a common pattern, A weight may be given to the node processing the lower graph constituting the second vertex to calculate the score higher. That is, the segmentation apparatus 100 recognizes the arrangement / configuration of the vertices of the new graph stream as a pattern, finds the vertices in the existing graph having similarly recognized patterns, and has a relatively high level with respect to the related nodes. A score can be given.

노드별 점수를 산출하는 단계(1010)의 또 다른 일례에 있어, 분할 장치(100)는, 상기 신규의 그래프 스트림과 연관된 제1 정점에 포함되는 데이터가, 정해진 빈도수를 초과하여 사용되는 핫 데이터인지를 식별하고, 상기 핫 데이터인 경우, 부하 테이블을 참조하여, 상기 복수의 노드 중 가장 낮은 처리량의 노드에 가장 높은 점수를 산출할 수 있다. 즉, 분할 장치(100)는 발생한 신규의 그래프 스트림 내의 정점이 빈도 높이 활용되는 핫 데이터와 관련됨에 따라, 이를 처리하는 노드로, 이전까지 가장 낮은 처리량을 기록하는 노드가 결정되도록, 해당 노드에 보다 많은 점수가 산출되도록 할 수 있다.In another example of calculating the score per node 1010, the dividing apparatus 100 determines whether the data included in the first vertex associated with the new graph stream is hot data that is used over a predetermined frequency. In the case of hot data, the highest score may be calculated for the node having the lowest throughput among the plurality of nodes with reference to the load table. In other words, the segmentation apparatus 100 is a node that processes the vertices in the new graph stream, which is generated frequently, as the vertices associated with the hot data are utilized, so that the node that records the lowest throughput until then is determined. Many scores can be calculated.

또한, 분할 장치(100)는, 상기 복수의 노드 중에서, 산출된 상기 점수가 가장 높은 선택 노드를 결정한다(1020). 단계(1020)는 노드의 상태를 수치화한 점수를 가장 높게 산출받은 노드를, 신규의 그래프 스트림을 최적하게 처리할 수 있는 노드로 판단하여, 선택 노드로서 결정하는 과정일 수 있다.In addition, the division apparatus 100 determines a selected node having the highest calculated score among the plurality of nodes (1020). Step 1020 may be a process of determining a node having the highest numerical value of the state of the node as a node capable of optimally processing the new graph stream and determining it as a selection node.

또한, 분할 장치(100)는 상기 선택 노드에 속한 하위그래프에, 상기 신규의 그래프 스트림과 연관된 정점을 간선으로 연결한다(1030). 단계(1030)는 선택 노드가 유지하는 하위그래프에, 상기 신규의 그래프 스트림을 연결시켜, 하나의 하위그래프로 갱신 함으로써, 상기 선택 노드에서 상기 신규의 그래프 스트림을 처리하도록 하는 과정일 수 있다.In operation 1030, the segmentation apparatus 100 connects the vertices associated with the new graph stream to the lower graph belonging to the selected node by an edge. Step 1030 may be a process of processing the new graph stream at the selection node by connecting the new graph stream to a lower graph maintained by the selection node and updating the data with one subgraph.

본 발명의 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.　The method according to an embodiment of the present invention can be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.　Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different manner than the described method, or other components. Or even if replaced or replaced by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

100 : 그래프 스트림에 대한 실시간 분산 저장을 위한 분할 장치
110 : 점수 산출부 120 : 노드 결정부
130 : 분할 연결부100: partitioning device for real time distributed storage of graph stream
110: score calculation unit 120: node determination unit
130: split connection

Claims

A partitioning method for real-time distributed storage for a graph stream implemented by a splitting device for real-time distributed storage for a graph stream,
In social networks, when a new graph stream occurs due to interaction between things,
Calculating, by the score calculator in the splitting device, a score for a plurality of nodes that individually process each of the plurality of subgraphs obtained by dividing the graph;
Determining, by the node determining unit in the dividing device, a selected node having the highest calculated score among the plurality of nodes; And
Processing the new graph stream at the selection node by connecting a vertex associated with the new graph stream to the lower graph belonging to the selection node by a trunk at a division connection unit in the division apparatus;
Including,
The step of calculating the score,

Calculating a total score (TSi) calculated for each of the plurality of nodes as the score;
The Replication Score (RSi) is a peak replication ratio score of the node, the Storage Utilization Score (USi) is a storage score for each node, and the Computation Size Score (CSi) is a throughput score for each node,-The α is the Is a weight of RSi, β is a weight of the USi, and γ is a weight of the CSi
Identifying whether data included in a first vertex associated with the new graph stream is hot data used over a predetermined frequency; And
In the case of the hot data, the CSi is calculated to be high by adjusting the gamma to a positive value.
Partitioning method for real-time distributed storage for the graph stream comprising a.

The method of claim 1,
The step of calculating the score,
Identifying a storage amount and a throughput for each of the plurality of nodes with reference to a load table; And
Calculating the score higher as the storage amount is lower or the throughput is lower.
Partitioning method for real-time distributed storage for the graph stream further comprising.

The method of claim 1,
The step of calculating the score,
Identifying from the graph a first vertex associated with the new graph stream and a second vertex having a common pattern; And
Calculating a high score by assigning a weight to a node processing a lower graph constituting the second vertex
Partitioning method for real-time distributed storage for the graph stream further comprising.

The method of claim 1,
The step of calculating the score,
In the case of the hot data, calculating a highest score at a node having the lowest throughput among the plurality of nodes by referring to a load table.
Partitioning method for real-time distributed storage for the graph stream further comprising.

delete

The method of claim 1,
The step of calculating the score,
Calculating the RSi higher as the number of neighboring edges existing in a node increases, thereby minimizing a replication ratio of the vertices in the graph
Partitioning method for real-time distributed storage for the graph stream further comprising.

The method of claim 1,
The step of calculating the score,
Computing the USi higher as the amount of data stored by a node is smaller, so that the new graph stream is placed in a node having a relatively small amount of storage.
Partitioning method for real-time distributed storage for the graph stream further comprising.

The method of claim 1,
The step of calculating the score,
Computing the CSi higher as the throughput is lower during the previous processing, thereby minimizing the new graph stream being placed in a node with a relatively large amount of storage.
Partitioning method for real-time distributed storage for the graph stream further comprising.

delete

In social networks, when a new graph stream occurs due to interaction between things,
A score calculator configured to calculate a score for a plurality of nodes that individually process each of the plurality of subgraphs obtained by dividing the graph;
A node determination unit that determines a selected node having the highest calculated score among the plurality of nodes; And
A segmentation connector for processing the new graph stream at the selected node by connecting the vertices associated with the new graph stream with edges to a lower graph belonging to the selected node.
Including,
The score calculation unit,

Applying, calculates TSi (Total Score) calculated for each of the plurality of nodes as the score,
The Replication Score (RSi) is a peak replication ratio score of a node, the Storage Utilization Score (USi) is a storage score for each node, and the Computation Size Score (CSi) is a throughput score for each node,-The α is the Is a weight of RSi, β is a weight of the USi, and γ is a weight of the CSi
Identifying whether the data included in the first vertex associated with the new graph stream is hot data used over a predetermined frequency,
In the case of the hot data, the CSi is calculated to be high by adjusting the gamma to a positive value.
Partitioning device for real-time distributed storage of graph streams.

The method of claim 10,
The score calculation unit,
With reference to the load table, for each of the plurality of nodes, the storage amount and the throughput are checked, and the lower the storage amount or the lower the throughput, the higher the score is calculated.
Splitting device.

The method of claim 10,
The score calculation unit,
Identify from the graph a first vertex associated with the new graph stream and a second vertex having a common pattern,
A weight is given to a node processing a lower graph constituting the second vertex, and the score is calculated to be high.
Splitting device.

The method of claim 10,
The score calculation unit,
In the case of the hot data, the highest score is calculated for a node having the lowest throughput among the plurality of nodes with reference to a load table.
Splitting device.

delete