KR101307337B1

KR101307337B1 - System and method for Triangle Counting Sampling by using Map-Reduce

Info

Publication number: KR101307337B1
Application number: KR1020110133139A
Authority: KR
Inventors: 김성열; 윤진현
Original assignee: 건국대학교 산학협력단
Priority date: 2011-12-12
Filing date: 2011-12-12
Publication date: 2013-09-10
Also published as: KR20130066352A

Abstract

본 발명은 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법에 대한 것으로서, 데이터 마이닝 엔진을 이용하여 클러스터들의 정보를 수집하기 위해 Hadoop의 Map-Reduce기술을 이용하여 Triangle Counting하여 종래 기술보다 빠른시간 안에 공유되는 에지가 많은 부분은 살려서 샘플링하여 보다 전체에서 뿐만 아니라 특정 부분에서도 정확하게 분산 데이터 처리를 하여 원하는 결과를 얻을 수 있는 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법에 관한 것이다.
본 발명은 트라이앵글 카운팅 기법(Triangle counting Algorithm)으로 연결된 세노드(node)를 찾는 방법에 있어서, 데이터마이닝 엔진이 트라이앵글 카운팅 모듈을 통하여 전체 데이터에서 각 노드들의 출현 빈도를 카운팅하는 단계와, 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 통해 상기 3 노드 중 2 노드 값의 최소값을 취하는 단계와, 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하는 단계와, 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅하는 단계로 이루어지는 것을 특징으로 한다.The present invention relates to a triangle counting sampling device using map reduce, and a method thereof. In order to collect information of clusters using a data mining engine, triangle counting is performed using Hadoop's Map-Reduce technology in a faster time than the prior art. The present invention relates to a triangle counting sampling device using a map reducer and a method thereof, in which a large portion of shared edges is sampled and distributed data is processed accurately in a specific part as well as in the whole.
The present invention relates to a method of finding three nodes connected by a triangle counting algorithm, in which a data mining engine counts the frequency of appearance of each node in total data through a triangle counting module, and two frequencies. Taking a minimum value of two node values among the three nodes through data having edges, and subsampling the sampling probability of the edges with a probability proportional to the frequency according to the frequency of appearance of the nodes. forming a subgraph, and counting triangles in the subgraph using the triangle counting technique.

Description

Triangle Counting Sampling Apparatus and Method Using Map Reduce {System and method for Triangle Counting Sampling by using Map-Reduce}

본 발명은 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법에 대한 것으로서, 데이터 마이닝 엔진을 이용하여 클러스터들의 정보를 수집하기 위해 Hadoop의 Map-Reduce기술을 이용하여 Triangle Counting하여 종래 기술보다 빠른시간 안에 공유되는 에지가 많은 부분은 살려서 샘플링하여 보다 전체에서 뿐만 아니라 특정 부분에서도 정확하게 분산 데이터 처리를 하여 원하는 결과를 얻을 수 있는 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법에 관한 것이다.The present invention relates to a triangle counting sampling device using map reduce, and a method thereof. In order to collect information of clusters using a data mining engine, triangle counting is performed using Hadoop's Map-Reduce technology in a faster time than the prior art. The present invention relates to a triangle counting sampling device using a map reducer and a method thereof, in which a large portion of shared edges is sampled and distributed data is processed accurately in a specific part as well as in the whole.

종래 기술인 한국공개특허 제2011-0069338호는 대용량의 데이터를 다수의 컴퓨팅 노드를 이용하여 MapReduce 방식으로 분산 병렬 처리하는 시스템으로서, 이미 수집되어 있는 대용량 저장 데이터는 물론 분산 병렬 처리 작업이 수행되는 동안에도 연속적으로 수집되는 대량의 스트림 데이터에 대해서 점진적인 MapReduce 기반 분산 병렬 처리 기능을 제공하기 위한 분산 병렬 처리 시스템에 대한 것이다.Korean Patent Publication No. 2011-0069338, which is a prior art, is a system for distributed parallel processing of a large amount of data by using a plurality of computing nodes in a MapReduce method. It is a distributed parallel processing system to provide incremental MapReduce based distributed parallel processing function for a large amount of stream data collected continuously.

Undirect Graph에서 Triangle Counting은 많은 알고리즘이 있다. 간단한 Triangle Counting 기법은 연결된 모든 세 node들을 일일이 찾기 때문에 O(

)∼O(

)의 계산이 필요하다. Triangle Counting하는 여러 알고리즘은 빠른시간안에 적은 계산으로, 실제 값과 적은 오차값을 목표로 한다. 실제 Hadoop의 Map-Reduce를 이용하여 Triangle Counting을 Sampling을 한 DOULION 역시 빠른 시간안에 적은 오차를 목표로 한다. DOULION의 핵심은 Map-Reduce를 이용하여 Samping을 통해 에지의 수를 줄여 계산 양을 축소하는 것이다.

개의 에지만을 선택하기 때문에 전체 계산의 양이

의 비율로 줄어들며 수식은 다음과 같다. Triangle Counting in the Undirect Graph has many algorithms. The simple Triangle Counting technique finds all three connected nodes one by one.

) To O (

) Needs to be calculated. Many algorithms with triangle counting aim at real values and small errors in a short time. In fact, Doulion, which has done triangle counting using Hadoop's Map-Reduce, also aims for small errors in a short time. The key to DOULION is to reduce the amount of computation by using Map-Reduce to reduce the number of edges through sampling.

Since we only select four edges,

It is reduced by the ratio of and the formula is as follows.

실제로 전체 Triangle Counting 값은 실제 값과 유사하다. 그러나 이런 적은 오차는 전체적으로 봤을 때의 경우에는 적은 오차이지만, 그래프의 많은 삼각형에 공유되는 에지가 많은 부분이 샘플링으로 빠질 경우에는 문제가 될 수가 있다. 만약에 도2와 같이 수식에서 제시한 k값이 극단적으로 높은, 많은 수로 공유되는 에지가 Sampling 과정에서 누락된다고 가정을 하자. In fact, the total Triangle Counting value is similar to the actual value. However, this small error is a small error as a whole, but it can be a problem when a lot of edges shared by many triangles in the graph fall into sampling. Suppose that a large number of shared edges are missing in the sampling process, as shown in FIG.

이러한 에지가 많은 수로 누락된다면, 전체적으로도 도1과 같이 실제 값과 오차가 커질 수도 있고, 작은 경우를 특정한 경우라면 더욱더 문제가 될 수 있다. 그래프의 일부분에서는 큰 오차를 가질 수 있어서 클러스터링 등에서 부정확한 결과를 얻을 수 있는 문제점이 있었다.If these edges are missing a large number, the actual value and the error may be large as shown in FIG. 1, and may be a problem in a small case. Part of the graph may have a large error, which may cause inaccurate results in clustering.

상술한 문제점을 해결하기 위하여, 본 발명은 새로운 Triangle Counting을 Sampling 방법을 제시하여 공유되는 에지가 많은 에지를 살리는 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법을 제공하는 데 목적이 있다.In order to solve the above problems, an object of the present invention is to provide a triangle counting sampling apparatus using a map reducer that saves a large number of shared edges by presenting a new triangle counting sampling method and a method thereof.

어떤 에지가 공유가 많이 되는지는 결과에 해당하므로, 각 node들이 전체 데이터에서의 출현 빈도를 세고, 상기 출현 빈도에 따라서 빈도에 비례하는 다른 확률로 Sampling을 하여 특정한 경우 즉, 공유되는 에지가 많은 것은 누락을 피할 수 있는 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치 및 그 방법을 제공하는 데 목적이 있다.Which edges are shared a lot is a result, so each node counts the frequency of appearance in the entire data, and according to the frequency of occurrence, sampling is performed with a different probability proportional to the frequency. It is an object of the present invention to provide a triangle counting sampling apparatus using map reduce and a method thereof, which can avoid omission.

본 발명은 트라이앵글 카운팅 기법(Triangle counting Algorithm)으로 연결된 세노드(node)를 찾는 방법에 있어서, 데이터마이닝 엔진이 트라이앵글 카운팅 모듈을 통하여 전체 데이터에서 각 노드들의 출현 빈도를 카운팅하는 단계와, 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 통해 상기 에지로 연결된 2노드 중 연결 에지수의 최소값을 취하는 단계와, 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하는 단계와, 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅하는 단계로 이루어진다.The present invention relates to a method of finding three nodes connected by a triangle counting algorithm, in which a data mining engine counts the frequency of appearance of each node in total data through a triangle counting module, and two frequencies. Taking a minimum value of the number of connection edges among the two nodes connected to the edge through respective data consisting of edges having an edge; and sampling probability of the edge with a probability proportional to the frequency according to the appearance frequency of the nodes; Forming a subgraph, and counting triangles in the subgraph using the triangle counting technique.

상기 노드들이 늘어나도 처리 속도가 빨라지도록, 하둡(Hadoop)의 맵-리듀스(Map-Reduce)를 이용하여 상기 노드들의 작업을 병렬로 처리한다.Hadoop's Map-Reduce is used to process the tasks of the nodes in parallel so that processing speeds up as the nodes grow.

상기 서브그래프는

(V : vertex 값, E : edge 값)이다.The subgraph is

(V: vertex value, E: edge value)

상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하는 단계에서, 빈도 측정 공식은 아래 <수학식2>이다.In the step of forming the sampling probability of the edge as a subgraph with a probability proportional to the frequency according to the appearance frequency of the nodes, the frequency measurement formula is expressed by Equation 2 below.

<수학식2>&Quot; (2) "

본 발명은 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치에 있어서, 데이터마이닝 엔진을 포함하되, 상기 데이터마이닝 엔진은 전체 데이터에서 각 노드들의 출현 빈도를 카운팅하는 트라이앵글 카운팅 모듈로 구성되며, 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 통해 상기 에지로 연결된 2노드 중 연결 에지수의 최소값을 취하고, 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하여, 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅한다.According to the present invention, in a triangle counting sampling apparatus using map reduce, the data mining engine includes a data mining engine, and the data mining engine includes a triangle counting module that counts the frequency of appearance of each node in the total data, and has two frequencies. Taking the minimum value of the number of connection edges among the two nodes connected to the edge through the data consisting of edges, and subsampling the sampling probability of the edge with a probability proportional to the frequency according to the frequency of appearance of the nodes. to form a subgraph and count triangles in the subgraph using the triangle counting technique.

상기 데이터마이닝 엔진은 상기 노드들이 늘어나도 처리 속도가 빨라지도록, 하둡(Hadoop)의 맵-리듀스(Map-Reduce)를 이용하여 상기 노드들의 작업을 병렬로 처리한다.The data mining engine processes the work of the nodes in parallel using Hadoop's Map-Reduce so that the processing speed is increased even when the nodes are expanded.

본 발명에 따르면 클러스터 관리자가 효율적으로 자원을 분배하여, 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 에지로 연결된 2노드 중 연결 에지수의 최소값을 취하고, 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하여, 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅할 수 있다.According to the present invention, the cluster manager efficiently distributes resources, takes each data consisting of edges having two frequencies, takes the minimum value of the number of connection edges among the two nodes connected by the edge, and the frequency according to the frequency of appearance of the nodes. A sampling probability of the edge is formed as a subgraph with a proportional probability, and the triangle may be counted in the subgraph using the triangle counting technique.

본 발명에 따르면 데이터 마이닝 엔진을 이용하여 클러스터들의 정보를 수집하기 위해 Hadoop의 Map-Reduce기술을 이용하여 Triangle Counting하여 종래 기술보다 빠른시간 안에 공유되는 에지가 많은 부분은 살리도록 샘플링하여 보다 전체에서 뿐만 아니라 특정 부분에서도 정확하게 분산 데이터 처리를 하여 원하는 결과를 얻을 수 있다.According to the present invention, triangle counting is performed using Hadoop's Map-Reduce technology to collect information of clusters using a data mining engine. In addition, you can achieve the desired result by processing distributed data accurately in specific parts.

도1은 종래 기술에 따라 하나의 에지만을 공유하는 무수한 많은 삼각형을 지닌 그래프를 보여주는 도면.
도2는 DOULION의 문제점을 보이는 실험결과를 보여주는 그래프.
도3은 Hadoop의 Map-Reduce를 이용한 분산 처리 방법
도4는 본 발명에 따른 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치의 구성을 보여주는 도면.
도5는 본 발명에 따른 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 방법의 순서를 보여주는 도면.
도6은 본 발명에 따라 시뮬레이션한 그래프에서의 결과값을 비교하는 그래프(This paper : 본 발명).
도7은 본 발명의 일실시예에 따라 특수한 그래프에서의 성능을 보여주는 그래프.1 shows a graph with a myriad of triangles sharing only one edge in accordance with the prior art;
Figure 2 is a graph showing the experimental results showing the problem of DOULION.
Figure 3 is a distributed processing method using Map-Reduce of Hadoop
4 is a view showing the configuration of a triangle counting sampling apparatus using map reduce according to the present invention;
5 is a view showing a sequence of a triangle counting sampling method using map reduce according to the present invention;
Figure 6 is a graph comparing the results in the graph simulated according to the present invention (This paper: the present invention).
7 is a graph showing performance in a special graph in accordance with one embodiment of the present invention.

이하 본 발명의 실시를 위한 구체적인 내용을 도면을 참고하여 자세히 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the drawings.

본 발명에 따른 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 장치는 데이터마이닝 엔진(100)을 포함하여, 가중치를 사용하여 Sampling을 하여 특정 그래프 즉, 하나의 에지만을 공유하는 무수히 많은 삼각형이 있을 때에 중요 에지가 누락 되는 것을 막아 실제 Count와 편차가 적은 우수한 성능을 얻을 수 있다.The triangle counting sampling apparatus using map reduce according to the present invention includes a data mining engine 100 and performs sampling using a weighting factor to make a critical edge when there are a large number of triangles sharing a specific graph, that is, only one edge. Excellent performance with little deviation from actual count can be obtained by preventing omission.

상기 데이터마이닝 엔진(100)은 전체 데이터에서 각 노드들의 출현 빈도를 카운팅하는 트라이앵글 카운팅 모듈(150)로 구성된다.The data mining engine 100 is composed of a triangle counting module 150 that counts the frequency of appearance of each node in the total data.

따라서 본 발명에 따라 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 통해 상기 에지로 연결된 2노드 중 연결 에지수의 최소값을 취하고, 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하여, 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅한다.Accordingly, according to the present invention, the minimum value of the number of connection edges among two nodes connected to the edge is obtained through the data consisting of two frequency edges, and the edges have a probability proportional to the frequency according to the appearance frequency of the nodes. The sampling probability of is formed into a subgraph, and the triangle is counted in the subgraph using the triangle counting technique.

또한 상기 데이터마이닝 엔진(100)은 상기 노드들이 늘어나도 처리 속도가 빨라지도록, 하둡(Hadoop)의 맵-리듀스(Map-Reduce)를 이용하여 상기 노드들의 작업을 병렬로 처리한다.In addition, the data mining engine 100 processes the work of the nodes in parallel using Hadoop's Map-Reduce so that the processing speed is increased even when the nodes are expanded.

또한 상기 서브그래프는

(V : vertex 값, E : edge 값)이다.Also, the subgraph

(V: vertex value, E: edge value)

상기 데이터마이닝 엔진(100)은, 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하기 위한 빈도 측정 공식은 아래 <수학식2>이다.In the data mining engine 100, a frequency measurement formula for forming a sampling probability of the edge into a subgraph with a probability proportional to the frequency according to the appearance frequency of the nodes is expressed by Equation 2 below. to be.

따라서 상기 데이터 마이닝 엔진을 이용하여 클러스터들의 정보를 수집하여 Hadoop의 Map-Reduce기술을 이용하여 Triangle Counting하여 기존보다 빠른시간 안에 공유되는 에지가 많은 부분은 살려서 샘플링하여 보다 전체에서 뿐만 아니라 특정 부분에서도 정확하게 분산 데이터 처리를 하여 원하는 결과를 얻을 수 있다.
Therefore, by collecting the information of clusters using the data mining engine and triangle counting using Hadoop's Map-Reduce technology, the part that has many shared edges is sampled in a faster time than before. Distributed data processing can produce the desired results.

이하 본 발명의 실시를 위한 맵 리듀스를 이용한 트라이앵글 카운팅 샘플링 방법에 대하여 자세히 설명한다.Hereinafter, a triangle counting sampling method using map reduce for the implementation of the present invention will be described in detail.

먼저 데이터마이닝 엔진이 트라이앵글 카운팅 모듈을 통하여 전체 데이터에서 각 노드들의 출현 빈도를 카운팅한다.First, the data mining engine counts the frequency of appearance of each node in the total data through the triangle counting module.

그리고 두 개의 빈도수를 가지는 에지(Edge)로 이루어진 각 데이터들을 통해 상기 3 노드 중 2 노드 값의 최소값을 취한다.The minimum value of two node values of the three nodes is taken through the data consisting of edges having two frequencies.

계속하여 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성한다.Subsequently, the sampling probability of the edge is formed as a subgraph with a probability proportional to the frequency according to the appearance frequency of the nodes.

또한 상기 트라이앵글 카운팅 기법을 이용하여 상기 서브그래프에서 삼각형을 카운팅한다.The triangle is counted in the subgraph using the triangle counting technique.

또한 상기 노드들이 늘어나도 처리 속도가 빨라지도록, 하둡(Hadoop)의 맵-리듀스(Map-Reduce)를 이용하여 상기 노드들의 작업을 병렬로 처리하고, 여기에서 상기 서브그래프는,

(V : vertex 값, E : edge 값)이다.In addition, the processing of the nodes is processed in parallel using Hadoop's Map-Reduce so that the processing speed is increased even when the nodes are expanded, wherein the subgraph is

(V: vertex value, E: edge value)

또한 상기 노드들의 출현 빈도에 따라 빈도에 비례하는 확률로 상기 에지의 샘플링(sampling) 확률을 서브그래프(subgraph)로 형성하는 단계에서, 빈도 측정 공식은 상기 <수학식2>이다.In addition, in the step of forming a sampling probability of the edge as a subgraph with a probability proportional to the frequency according to the appearance frequency of the nodes, the frequency measurement formula is Equation (2).

구체적으로 살펴보면, Specifically,

1) Hadoop의 Map-Reduce를 이용하여 도3 같이 node(PC)들의 작업을 병렬로 처리가 가능하다. node가 늘어날수록 처리 속도는 빨라진다.1) Using Hadoop's Map-Reduce, it is possible to process the work of nodes (PCs) in parallel as shown in FIG. The more nodes, the faster the processing.

가) 전체 데이터에서 각 node들의 출현 빈도를 센다. A) Count the frequency of appearance of each node in the total data.

나) 에지로 이루어진 각 데이터들은 두 개의 빈도수를 가지게 된다. 이 때 우리는 두 node 값의 최소값을 취하게 된다.B) Each data consisting of edges has two frequencies. At this time, we take the minimum of two node values.

다) node들의 출현 빈도에 따라 빈도에 비례하는 확률로 에지의 Sampling 확률을 다르게 준다. C) According to the frequency of nodes, the sampling probability of the edge is different with the probability proportional to the frequency.

라) Sampling된 새로운 subgraph를

이라 한다.D) new sampled subgraph

Quot;

마) Triangle Counting Algorithm을 이용하여

에서 삼각형을 센다. 실험 결과는 실제값과 작은 편차를 가진다.E) using the Triangle Counting Algorithm

Count the triangles in The experimental result has a small deviation from the actual value.

로 나누어 질 수 있다.Can be divided into

따라서 본 발명에서는 그래프에서 node의 에지개수를 세어서 비례하는 확률로 Sampling하여 Triangle Counting을 하는 방법을 Hadoop의 Map-Reduce를 이용하여 보다 전체에서 뿐만 아니라 특정 부분에서도 정확하게 분산 데이터 처리를 하여 원하는 결과를 얻을 수 있다.Therefore, in the present invention, the method of triangle counting by sampling and counting the number of edges of nodes in the graph with proportional probability, using Hadoop's Map-Reduce, more accurately distributed data processing not only in the whole but also in specific parts, and the desired result is obtained. You can get it.

100 : 데이터마이닝 엔진
150 : 트라이앵글 카운팅 모듈100: data mining engine
150: triangle counting module

Claims

In a method for finding three nodes connected by a triangle counting algorithm,
Counting, by the data mining engine, the frequency of appearance of each node in the total data via the triangle counting module;
Taking a minimum value of the number of connection edges among the two nodes connected to the edge through respective data consisting of edges having two frequencies;
Forming a sampling probability of the edge into a subgraph with a probability proportional to the frequency of appearance of the nodes;
Counting triangles in the subgraph by using the triangle counting technique. Triangle counting sampling method using map reduce, characterized in that consisting of.

The method of claim 1,
A triangle counting sampling method using map reduce, wherein the operations of the nodes are processed in parallel using Hadoop's Map-Reduce.

The method of claim 1,
The subgraph is,
Sampling from the analysis target graph G (V, E)

Triangle counting sampling method using map reduce, characterized in that (V: vertex value, E: edge value).

The method of claim 1,
In the step of forming a sampling probability of the edge as a subgraph with a probability proportional to the frequency according to the frequency of appearance of the nodes, the frequency measurement formula is Equation 2 below. Triangle counting sampling method using deuce.
&Quot; (2) "

(α: value between 0 and 1 as adjustable parameter)

In the triangle counting sampling apparatus using map reduce,
Include the data mining engine,
The data mining engine,
Triangle counting module for counting the frequency of appearance of each node in the total data;
Sampling of the edges with a probability proportional to the frequency according to the frequency of appearance of the nodes by taking the minimum value of the number of connection edges among the two nodes connected to the edge through the data consisting of two frequency edges. Forming a probability into a subgraph and counting triangles in the subgraph using the triangle counting technique.

The method of claim 5,
The data mining engine,
A triangle counting sampling device using map reduce, characterized in that the operations of the nodes are processed in parallel using Hadoop's Map-Reduce.

The method according to claim 6,
The subgraph is,
Sampling from the analysis target graph G (V, E)

Triangle counting sampling device using map reduce, characterized in that (V: vertex value, E: edge value).

The method according to claim 6,
In the data mining engine, a frequency measurement formula for forming a sampling probability of the edge into a subgraph with a probability proportional to the frequency according to the appearance frequency of the nodes is expressed by Equation 2 below. Triangle counting sampling device using map reduce.
&Quot; (2) "

(α: value between 0 and 1 as adjustable parameter)