KR101669356B1

KR101669356B1 - Mapreduce method for triangle enumeration and apparatus thereof

Info

Publication number: KR101669356B1
Application number: KR1020150020455A
Authority: KR
Inventors: 강유; 박하명; 라스무스 파; 프란체스코 실베스트리
Original assignee: 한국과학기술원
Priority date: 2015-02-10
Filing date: 2015-02-10
Publication date: 2016-10-25
Also published as: KR20160097943A

Abstract

삼각형 열거를 위한 매핑 방법이 개시된다. 그 방법은, 에지 값을 포함하는 리듀싱 데이터를 수신하는 단계와, 상기 에지 값을 구성하는 정점들 각각에 대한 색상을 결정하는 단계와, 상기 색상에 기초하여 삼각형의 유형을 나타내는 매핑 데이터를 생성하는 단계를 포함하고, 상기 매핑 데이터는 상기 삼각형의 유형에 기초하여 결정된 리듀서로 전송된다.A mapping method for triangle enumeration is disclosed. The method includes receiving redundant data including an edge value, determining a color for each of the vertices constituting the edge value, generating mapping data indicating a type of the triangle based on the color, Wherein the mapping data is transmitted to a reducer determined based on the type of the triangle.

Description

[0001] MAPREDUCE METHOD FOR TRIANGLE ENUMERATION AND APPARATUS THEREOF [0002]

본 명세서에 기재된 다양한 실시예들은 맵 리듀스 방법 및 그 방법을 이용하는 장치들에 관한 것이다.The various embodiments described herein relate to a method of mapping reduction and apparatuses employing the method.

최근 정보량의 증가 및 데이터 처리 기술의 발달로 인해 방대한 양의 데이터로부터 특정 의미를 추출하는 다양한 기법이 연구되고 있다. 특정 의미는 데이터간의 연관 관계를 의미할 수 있다. 한편, 데이터 처리의 효율을 위해 맵 리듀스와 같은 분산 처리 기법이 이용될 수 있다. 맵 리듀스는 다양한 알고리즘에 따라 구현될 수 있다. 맵 리듀스를 통해 데이터 간의 연관 관계를 추출함에 있어서, 분산 시스템에 쉽게 적용 가능하고, 확장이 가능하며, 클러스터의 생성 및 유지에 적은 비용이 소비되는 알고리즘이 요구된다.Recently, various techniques for extracting a specific meaning from a vast amount of data have been studied due to an increase in information amount and development of data processing techniques. A specific meaning can mean an association between data. On the other hand, a distributed processing technique such as map reduction can be used for data processing efficiency. Map Reduce can be implemented according to various algorithms. There is a need for an algorithm that can be easily applied to a distributed system, expandable, and costly to create and maintain clusters in extracting associations between data through mapping reduction.

본 명세서에 기재된 다양한 실시예들은 그래프에 포함된 삼각형을 효율적이고 정확하게 열거하기 위한 맵 리듀스 방법, 및 그 방법을 이용하는 장치들을 제공하는데 그 목적이 있다.The various embodiments described herein are aimed at providing a method of reducing the number of triangles included in a graph efficiently and accurately and apparatuses using the method.

일 측에 따른 삼각형 열거를 위한 매핑 방법은 에지 값을 포함하는 리듀싱 데이터를 수신하는 단계; 및 상기 에지 값을 구성하는 정점들 각각에 대한 색상에 기초하여 삼각형의 유형을 나타내는 매핑 데이터를 생성하는 단계를 포함한다. 상기 매핑 데이터는 상기 삼각형의 유형에 기초하여 결정된 리듀서로 전송될 수 있다.
상기 정점들의 색상은 미리 정해진 색상 중 균일하게 랜덤 선택된 색상을 그래프 내 정점들에 부여함으로써 결정되고, 상기 미리 정해진 색상들의 수는 그래프 내 에지들의 개수와 관련된 제1 요소 및 리듀서의 메모리 용량과 관련된 제2 요소에 기초하여 결정될 수 있다.
상기 정점들의 색상은 복수의 컬러링 함수들로부터 균일하게 랜덤 선택된 함수를 이용하여, 미리 정해진 색상 중 선택된 색상을 그래프 내 정점들에 부여함으로써 결정될 수 있다.
상기 색상은 상기 에지 값을 구성하는 제1 정점에 대한 제1 색상 및 상기 에지 값을 구성하는 제2 정점에 대한 제2 색상을 포함할 수 있고, 상기 매핑 데이터를 생성하는 단계는, 상기 제1 색상과 상기 제2 색상이 동일한 경우, 두 정점들의 색상이 동일하고 나머지 하나의 정점의 색상이 상이한 제2 유형을 나타내는 매핑 데이터 또는 세 정점들의 색상이 모두 동일한 제1 유형을 나타내는 매핑 데이터를 생성하는 단계를 포함할 수 있다.
상기 매핑 데이터는 상기 제1 색상 또는 상기 제2 색상에 대응하는 제1 원소, 현재 라운드에 기초하여 결정된 제3 색상에 대응하는 제2 원소, 및 널(null)에 대응하는 제3 원소를 포함할 수 있다.
상기 매핑 데이터를 생성하는 단계는, 상기 제1 색상과 상기 제2 색상이 상이한 경우, 세 정점들의 색상이 모두 다른 제3 유형을 나타내는 매핑 데이터를 생성하는 단계를 포함할 수 있다.
상기 매핑 데이터는 상기 제1 색상에 대응하는 제1 원소, 상기 제2 색상에 대응하는 제2 원소, 및 현재 라운드에 기초하여 결정된 제3 색상에 대응하는 제3 원소를 포함할 수 있다.
일 측에 따른 리듀싱 방법은 삼각형의 유형을 나타내는 매핑 데이터를 수신하는 단계; 및 정점 세트와 에지 세트로 구성된 그래프에서 상기 삼각형의 유형에 대응되는 삼각형을 열거하는 단계를 포함한다. 상기 삼각형의 유형은 삼각형을 구성하는 정점들의 색상에 의해 결정될 수 있다.
상기 매핑 데이터는 제1 색상에 대응하는 제1 원소, 제2 색상에 대응하는 제2 원소, 및 제3 색상 또는 널(null)에 대응하는 제3 원소를 포함할 수 있다.
상기 열거하는 단계는 상기 제3 원소가 상기 널에 대응하는 경우, 상기 제1 색상 또는 상기 제2 색상을 이용하여 세 정점들의 색상들이 동일한 제1 유형의 삼각형을 열거하는 단계; 및 상기 제1 색상 및 상기 제2 색상을 이용하여 두 정점들의 색상이 동일하고 나머지 하나의 정점의 색상이 상이한 제2 유형의 삼각형을 열거하는 단계를 포함할 수 있다.
상기 열거하는 단계는 상기 제3 원소가 상기 제3 색상에 대응하는 경우, 상기 제1 색상, 상기 제2 색상, 및 상기 제3 색상을 이용하여 세 정점들의 색상이 모두 다른 제3 유형의 삼각형을 열거하는 단계를 포함할 수 있다.
상기 삼각형을 열거하는 단계는, 상기 매핑 데이터에 의해 결정된 에지들의 집합으로부터 상기 삼각형을 열거하는 단계를 포함할 수 있다.
상기 정점들의 색상은 미리 정해진 색상 중 균일하게 랜덤 선택된 색상을 상기 정점들 각각에 부여함으로써 결정될 수 있다.
일 측에 따른 삼각형 열거를 위한 매퍼는 에지 값을 포함하는 리듀싱 데이터를 수신하는 리듀싱 데이터 수신부; 및 상기 에지 값을 구성하는 정점들 각각에 대한 색상에 기초하여 삼각형의 유형을 나타내는 매핑 데이터를 생성하는 매핑 데이터 생성부를 포함한다. 상기 매핑 데이터는 상기 삼각형의 유형에 기초하여 결정된 리듀서로 전송될 수 있다.
상기 색상은 상기 에지 값을 구성하는 제1 정점에 대한 제1 색상 및 상기 에지 값을 구성하는 제2 정점에 대한 제2 색상을 포함하고, 상기 매핑 데이터 생성부는, 상기 제1 색상과 상기 제2 색상이 동일한 경우, 두 정점들의 색상이 동일하고 나머지 하나의 정점의 색상이 상이한 제2 유형을 나타내는 매핑 데이터 또는 세 정점들의 색상이 모두 동일한 제1 유형을 나타내는 매핑 데이터를 생성하고, 상기 제1 색상과 상기 제2 색상이 상이한 경우, 세 정점들의 색상이 모두 다른 제3 유형을 나타내는 매핑 데이터를 생성할 수 있다.
일 측에 따른 리듀서는 삼각형의 유형을 나타내는 매핑 데이터를 수신하는 매핑 데이터 수신부; 및 정점 세트와 에지 세트로 구성된 그래프에서 상기 삼각형의 유형에 대응되는 삼각형을 열거하는 삼각형 열거부를 포함한다. 상기 삼각형의 유형은 삼각형을 구성하는 정점들의 색상에 의해 결정될 수 있다.A mapping method for triangle enumeration along one side includes receiving reduced data including an edge value; And generating mapping data indicating a type of the triangle based on a color for each of the vertices constituting the edge value. The mapping data may be sent to a reducer determined based on the type of triangle.
Wherein the color of the vertices is determined by assigning uniformly randomly selected hues of a predetermined color to the vertices in the graph and the number of predetermined colors is determined based on a first element associated with the number of edges in the graph and a second element associated with a memory capacity of the reducer Can be determined based on two elements.
The color of the vertices may be determined by applying a selected color of a predetermined color to the vertices in the graph using a uniformly randomly selected function from a plurality of coloring functions.
The color may include a first color for a first vertex constituting the edge value and a second color for a second vertex constituting the edge value, and the step of generating the mapping data may include: Mapping data indicating a second type in which the colors of the two vertices are the same and colors of the other vertex are different when the color and the second color are the same, or mapping data indicating the first type in which the colors of the vertices are all the same Step < / RTI >
Wherein the mapping data includes a first element corresponding to the first color or the second color, a second element corresponding to a third color determined based on the current round, and a third element corresponding to null .
The generating of the mapping data may include generating mapping data indicating a third type in which the colors of the cleavage points are different when the first color and the second color are different.
The mapping data may include a first element corresponding to the first color, a second element corresponding to the second color, and a third element corresponding to a third color determined based on the current round.
A method according to one aspect of the present invention includes receiving mapping data indicating a type of a triangle; And enumerating the triangles corresponding to the type of triangles in the graph consisting of the set of vertices and the set of edges. The type of the triangle can be determined by the color of the vertices constituting the triangle.
The mapping data may include a first element corresponding to a first color, a second element corresponding to a second color, and a third element corresponding to a third color or null.
The enumerating step enumerating the first type of triangles having the same color of the clearing points using the first color or the second color when the third element corresponds to the null; And enumerating the second type of triangle using the first color and the second color so that the color of the two vertices is the same and the color of the other vertex is different.
Wherein the step of enumerating comprises: if the third element corresponds to the third color, using the first color, the second color, and the third color to generate a third type of triangle May include the step of listing.
The step of enumerating the triangles may include enumerating the triangles from the set of edges determined by the mapping data.
The color of the vertices may be determined by giving each of the vertices a uniformly randomly selected hue of a predetermined hue.
A mapper for triangular enumeration along one side includes a reducing data receiving unit for receiving the reducing data including the edge value; And a mapping data generator for generating mapping data indicating a type of a triangle based on a color of each of the vertices constituting the edge value. The mapping data may be sent to a reducer determined based on the type of triangle.
Wherein the hue includes a first hue for a first vertex constituting the edge value and a second hue for a second vertex constituting the edge value, Mapping data indicating a second type in which the colors of the two vertices are the same and colors of the other vertex are different when the colors are the same, or mapping data indicating the first type in which the colors of the vertices are all the same, If the second color is different from the first color, mapping data indicating a third type in which the colors of the cleavage points are different from each other.
A reducer according to one side receives a mapping data indicating a type of a triangle; And a triangle enumerating unit for enumerating triangles corresponding to the types of the triangles in a graph composed of a vertex set and an edge set. The type of the triangle can be determined by the color of the vertices constituting the triangle.

본 명세서에 기재된 다양한 실시예들에 따르면 분산 시스템에 쉽게 적용 가능하고, 오류에 강인(fault tolerant)하며, 확장 가능(scalable)하며, 클러스터를 생성(build)하고 유지(maintain)하는데 상대적으로 적은 비용만 요구되는 맵 리듀스 방법, 및 그 방법을 이용하는 장치들이 제공된다.The various embodiments described herein can be readily applied to distributed systems, fault tolerant, scalable, and relatively low cost to build and maintain clusters. And a device using the method are provided.

도 1은 일실시예에 따른 매퍼와 리듀서가 데이터를 처리하는 과정을 설명하기 위한 도면이다.
도 2는 일실시예에 따른 매퍼와 리듀서의 구성을 설명하기 위한 블록도이다.
도 3은 일실시예에 따른 매퍼의 동작을 설명하기 위한 플로우 차트이다.
도 4는 일실시예에 따른 리듀서의 동작을 설명하기 위한 플로우 차트이다.
도 5 내지 도 10은 일실시예에 따른 매퍼와 리듀서의 성능 실험 결과를 나타낸 그래프이다.FIG. 1 is a diagram for explaining a process of processing data by a mapper and a reducer according to an embodiment.
2 is a block diagram for explaining a configuration of a mapper and a reducer according to an embodiment.
3 is a flowchart for explaining the operation of the mapper according to one embodiment.
4 is a flowchart illustrating an operation of the reducer according to an embodiment.
5 to 10 are graphs showing performance test results of a mapper and a reducer according to an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.
아래 설명하는 실시예들에는 다양한 변경이 가해질 수 있다. 아래 설명하는 실시예들은 실시 형태에 대해 한정하려는 것이 아니며, 이들에 대한 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.
실시예에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.
다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.
또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.
도 1은 일실시예에 따른 매퍼와 리듀서가 데이터를 처리하는 과정을 설명하기 위한 도면이다.
삼각형 열거를 위한 맵 리듀스 방법은 매퍼(10-1 내지 10-5)와 리듀서(20-1 내지 20-7)를 통해 구현될 수 있다. 삼각형 열거를 위한 맵 리듀스 방법은 다수의 스테이지를 통해 수행될 수 있다. 예컨대, 다수의 스테이지는 제r 스테이지와 제r+1 스테이지를 포함할 수 있다. 각각의 스테이지에서는 일정한 수의 매퍼와 리듀서가 실행될 수 있다. 예컨대, 제r 스테이지에서는 매퍼(10-1 내지 10-3)와 리듀서(20-1 내지 20-4)가 실행될 수 있고, 제r+1 스테이지에서는 매퍼(10-4 및 10-5)와 리듀서(20-5 및 20-7)가 실행될 수 있다.
각 스테이지에 포함된 매퍼의 개수와 리듀서의 개수는 다양하게 변형될 수 있다. 예를 들어, 매퍼의 개수와 리듀서의 개수는 동일할 수도 있다. 또한, 한 스테이지와 다음 스테이지 사이의 연결관계도 다양하게 변형될 수 있다. 매퍼(10-1 내지 10-5)와 리듀서(20-1 내지 20-7)의 구성에 관해서는 도 2 내지 도 4를 통해 추후 상세히 설명한다.
삼각형 열거를 위한 맵 리듀스는 CTTP (Colored Triangle Type Partition) 기법을 통해 수행될 수 있다. 실시예들에 따른 CTTP (Colored Triangle Type Partition) 기법을 설명하기에 앞서, 우선 실시예들에 의하여 이용되는 계산 모델을 설명한다.
이하, 실시예들은 맵 리듀스를 위한 계산 모델 MR(m, M)을 이용한다. 맵 리듀스는 매우 큰 데이터를 처리하는 분산 프로그래밍 프레임워크(distributed programming framework)이다. 맵 리듀스는 (i)분산 시스템에 쉽게 적용 가능하고, (ii)오류에 강인(fault tolerant)하며, (iii)확장 가능(scalable)하며, (iv)클러스터를 생성(build)하고 유지(maintain)하는데 상대적으로 적은 비용만 요구되는 특징을 가진다. 실시예들은 하둡(Hadoop)을 이용하여 구현될 수 있다.
계산 모델은 파라미터 m과 파라미터 M을 이용한다. 파라미터 m은 맵/리듀스 함수를 수행하기 위한 최대 메모리 용량을 나타내고, 파라미터 M은 전체 시스템 내 최대 가용 메모리 용량을 나타낸다. 맵/리듀스 함수는 맵 함수 또는 리듀스 함수 중 적어도 하나를 지칭할 수 있다. 이하, 리듀서(reducer)는 리듀스 함수를 수행하는 어플리케이션을 지칭할 수 있다. 또한, 매퍼(mapper)는 맵 함수를 수행하는 어플리케이션을 지칭할 수 있다. 리듀서와 매퍼는 각각 단일 어플리케이션일 수 있다. 파라미터 m은 매퍼의 사이즈 또는 리듀서의 사이즈를 지칭할 수도 있다.
계산 모델에 기반한 알고리즘은 라운드들(rounds)의 시퀀스로 구체화될 수 있다. 각 라운드 내 계산(computation)은 맵 함수와 리듀스 함수로 정의될 수 있다. 맵 함수 및 리듀스 함수의 입출력은 키-밸류 페어들(key-value pairs)의 멀티셋들(multisets)일 수 있다. 키-밸류 페어는 <k; v>로 표현될 수 있다. 여기서, k는 키이고, v는 밸류이다. 맵 함수는 하나의 페어를 입력 받아 새로운 페어들의 멀티셋을 출력할 수 있다. 리듀스 함수는 동일한 키를 가지는 페어들을 입력 받아 새로운 페어들의 멀티셋을 출력할 수 있다. 이하, 맵/리듀스 함수에서 emit(<k; v>)은 <k; v>이 출력됨을 나타낼 수 있다.
r번째 라운드(r ≥ 0)에서 페어들의 멀티셋 I_r 이 입력될 수 있다. r번째 라운드의 맵 스템(map step)에서, I_r 내 각 페어에 맵 함수가 적용됨으로써 새로운 멀티셋 W_r 이 생성될 수 있다. r번째 라운드의 셔플 스텝(shuffle step)에서, W_r 내 페어들 중 동일한 키를 가지는 페어들이 그룹핑 될 수 있다. r번째 라운드의 리듀스 스텝(reduce step)에서, 동일한 키를 가지는 페어들의 그룹들 각각은 리듀스 함수에 의하여 처리될 수 있다. 그 결과, 멀티셋 O_r 이 r번째 라운드의 최종 결과로 출력될 수 있다. O_r 은 동일한 키를 가지는 페어들의 그룹들 각각에 대한 리듀스 함수의 출력을 포함할 수 있다. O_r 은 다음 라운드(r+1)의 입력으로 이용될 수 있다.
m_k,r 은 라운드 r에서 키 k에 의하여 정의되는 그룹을 처리하기 위하여 리듀스 함수에 의하여 요구되는 용량을 지칭할 수 있다. K_r 은 W_r 내 구별되는 키들의 세트를 지칭할 수 있다. 이 경우, 계산 모델은 k ? K_r 및 r = 0 인 각 k에 대하여 m_k,r ≤ m 일 것이 요구되고, r = 0 인 각 r에 대하여 ? k ? K_r m_k,r ≤ M 일 것이 요구된다. 유사한 제한들(constraints)이 맵 함수에도 요구될 수 있다.
계산 모델에 기반한 알고리즘의 복잡도(complexity)는 최악의 케이스(worst case)에서 요구되는 라운드들의 개수 R 일 수 있다. 실시예들은 주어진 m과 M을 위한 라운드들의 개수 R을 최소화하는 기술을 제공할 수 있다. 이하, 계산 모델에 기반한 알고리즘의 전체 일(total work)은 매퍼들과 리듀서들에 의하여 요구되는 일의 총합으로 정의될 수 있다.
이하, 설명의 편의를 위하여 그래프 G를 가정한다. 그래프 G는 무방향 그래프일 수 있다. 그래프 G는 셀프 루프(self loop)를 포함하지 않고, 평행 에지(parallel edge)를 포함하지 않는다. 그래프 G는 정점 세트 V와 에지 세트 E를 포함한다. 삼각형 열거 문제는 그래프 G에 포함된 삼각형들을 열거하는 문제이다. 실시예들은 각각의 삼각형 (u, v, w)에서, 지역 함수 enum()이 호출하도록 구현될 수 있다. enum() 함수의 입력 파라미터는 해당 삼각형의 세 정점들일 수 있다.
실시예들에 따르면, 삼각형 열거 문제는 맵 리듀스(Map Reduce) 기반 알고리즘을 통하여 해결(solve)될 수 있다. 이하, 삼각형 열거 문제 또는 서브 문제를 해결한다는 것은 해당 문제 또는 해당 서브 문제를 해결하기 위한 연산들을 수행한다는 것으로 이해될 수 있다.
표 1을 참조하면, 이하 표기의 편의를 위하여, 특정 세트를 지칭하는 기호는 해당 세트의 사이즈를 지칭하기 위하여 이용될 수 있다. 예를 들어, E는 에지 세트의 사이즈를 지칭할 수 있다.

정점 세트 V 의 정점들은 차수(degree)로 정렬될 수 있다. 동일한 차수의 정점들은 임의로 정렬될 수 있다. 정점의 차수는 해당 정점에 연결된 에지들의 개수일 수 있다. 각 에지 {u, v} 는 단일 메모리 워드를 요구하며, 최초에 Ψ<(u, v)>로 표현될 수 있다. 여기서 Ψ는 더미 키(dummy key)이고, (u, v)는 밸류이며, u < v 일 수 있다. v1 < v2 < v3 인 삼각형 (v1, v2, v3) 에서, (v2, v3) 은 피봇 에지(pivot edge)이라고 지칭되고, v1은 콘 정점(cone vertex)라고 지칭될 수 있다. 임의의 정수 n에 대하여, [n]은 {0, ..., n-1}의 세트를 지칭할 수 있다.
실시예들은

보다 큰 차수의 정점과 같은 매우 큰 차수의 정점을 배제할 수 있다. 보다 구체적으로, 매우 큰 차수의 정점은 정렬(sorting)을 이용하여 열거될(enumerated) 수 있다. 예를 들어, 각각의 매우 큰 차수의 정점 v에 대하여, 정점 v를 포함하는 삼각형들은 에지 세트들을 세 번 정렬함으로써 적절하게 검색될 수 있다.

보다 큰 사이즈의 리듀서를 이용하는 경우, 정렬 알고리즘은 각 라운드에서 O(1)의 복잡도만을 요구할 수 있다. 매우 큰 차수의 정점을 가지는 삼각형들을 검색하기 위한 전체 용량(aggregate space)는 E이므로, M/E 개의 매우 큰 정점들이 병렬적으로 처리될 수 있다.
따라서, 리듀서의 사이즈가 m이고, 매퍼의 사이즈가 일정하며, 전체 시스템 내 최대 가용 메모리 용량이 M이고 전체 일이 O(E3/2)인 경우, 적어도 하나의 매우 큰 정점을 가지는 삼각형들은

의 복잡도를 가지는 개수의 라운드 안에 열거될 수 있다. 아래에서 설명할 CTTP 기법의 복잡도와 비교할 때, 전술한 매우 큰 정점들을 제거하는 전체 일의 복잡도는 점근적으로 무시(asymptotically negligible)될 수 있다.
이하, 도 2 내지 도 4를 참조하여, 매퍼와 리듀서에 의한 CTTP (Colored Triangle Type Partition) 기법에 관해 상세히 설명한다.
도 2는 일실시예에 따른 매퍼와 리듀서의 구성을 설명하기 위한 블록도이다.
도 2를 참조하면, 매퍼(10)는 리듀싱 데이터 수신부(11) 및 매핑 데이터 생성부(13)를 포함한다. 또한, 리듀서(20)는 매핑 데이터 수신부(21) 및 삼각형 열거부(22)를 포함한다. 매퍼(10)와 리듀서(20) 및, 매퍼(10)와 리듀서(20)에 포함된 각각의 구성들은 적어도 하나의 하드웨어 모듈 또는 소프트웨어 모듈로 구현될 수 있다.
CTTP 기법은 삼각형 열거 문제를 해결하기 위한 맵 리듀스 기반 알고리즘이다. 예를 들어, CTTP 기법은 거대한 그래프(enormous graph) 내 모든 삼각형들을 열거(enumerating)하기 위한 멀티-라운드 맵 리듀스 랜덤 알고리즘(multi-round MapRuduce randomized algorithm)일 수 있다.
CTTP 기법은 라운드들의 개수 R, 각 매퍼와 리듀서에 의하여 요구되는 메모리 용량 m, 전체 시스템에서 요구되는 가용 메모리 용량 M 사이의 균형(trade-off)을 고려할 수 있다. 임의의 입력 그래프에 대하여, CTTP 기법은 최악의 경우

의 복잡도를 가지는 개수의 라운드들을 요구할 수 있다. CTTP 기법은 각 매퍼에 M/E의 메모리 용량을 요구하고, 각 리듀서에 메모리 용량 m을 요구할 수 있다. CTTP 기법은 리듀서들 사이에 전체 일이 불 균일하게 분배되는 문제를 해소할 수 있다.
리듀싱 데이터 수신부(11)는 리듀싱 데이터를 수신한다. 리듀싱 데이터는 더미 키(dummy key)와 에지 값을 포함할 수 있다. 에지 값은 키-벨류 페어에서 벨류 값에 해당할 수 있다. 에지 값은 정점들로 구성될 수 있다.
정점들은 각각 색상을 가질 수 있다. 예를 들어, 매퍼(10) 및 리듀서(20)를 이용하여 삼각형 열거 문제를 해결하기에 앞서, 그래프 내 정점들은 컬러링 될 수 있다. 보다 구체적으로, CTTP 기법은 정점 분할(vertex partitioning) 기법에 기반한다. 일반적인 정점 분할 기법들과 달리, CTTP 기법은 k-와이즈 독립 패밀리 함수들(k-wise independent family of functions)로부터 랜덤하게 선택된 컬러링 함수(coloring function)에 따라 정점들을 분할한다. k는 2 이상의 양의 정수일 수 있으며, 이하 설명의 편의를 위하여 k=4인 실시예를 설명한다. 이 경우, k-와이즈 독립 패밀리 함수들은 4-와이즈 독립 패밀리 함수들로 지칭될 수 있다.
4-와이즈 독립 패밀리 함수들은 외부 메모리에서 삼각형들을 열거하는 함수들일 수 있다. 컬러링 함수에 대하여 단지 4-와이즈니스(4-wiseness) 만을 요구함으로써, CTTP 기법은 랜덤 입력 그래프를 가정하는 기존 기법들의 문제를 극복하고, 어떠한 입력 그래프에도 적용 가능한 일반성을 보장할 수 있다.
정점들은

함수를 이용하여

개의 색상들로 컬러링 될 수 있다.

함수는 4-와이즈 독립 패밀리 함수들로부터 균일하게(uniformly) 랜덤 선택될 수 있다. Ei, j 는 {(u, v) ∈ E | i = min{ξ(u),ξ(v)} and j = max{ξ(u),ξ(v)}} 인 에지 세트를 지칭할 수 있다. 여기서, i≤j 이고, i, j∈ρ이다. ξ(x)는 정점 x의 색상이다.
매핑 데이터 생성부(130)는 결정된 색상에 기초하여 삼각형의 유형을 나타내는 매핑 데이터를 생성한다.
세 정점들의 색상이 모두 다른 경우 삼각형 (u, v, w) 는 type-3으로 분류되고, 세 정점들 중 두 정점들의 색상이 동일하고 나머지 하나의 정점의 색상이 상이한 경우 삼각형 (u, v, w) 는 type-2로 분류되며, 세 정점들의 색상이 모두 동일한 경우 삼각형 (u, v, w) 는 type-1으로 분류될 수 있다.
CTTP 기법은 삼각형 열거 문제를

개의 서브 문제들로 분해(decompose)할 수 있다. 서브 문제들은 표 2의 두 유형들 중 하나일 수 있다.

CTTP 기법은 K개의 서브 문제들을

개의 라운드들 내에서 균일하게 분배(evenly distributing)함으로써, 각 서브 문제를 단일 리듀서를 이용하여 해결할 수 있다. CTTP 기법은 r번째 라운드에서 표 3의 알고리즘 1과 같이 동작할 수 있다. 각 라운드의 입력은, 키-벨류 페어인, <Ψ(u, v)>일 수 있다.
r번째 라운드에서 0 = i < j < k 이고, (i + j + k) = (r mod R) 인 경우, CTTP 기법은 키 (i, j, k)와 연관된 리듀서를 이용하여 (i, j, k)-서브 문제를 해결할 수 있다. r번째 라운드에서 0 = i < j <ρ이고, (i + j) = (r mod R) 인 경우, CTTP 기법은 키 (i, j, -1)과 연관된 리듀서를 이용하여 (i, j)-서브 문제를 해결할 수 있다.

매퍼들은 각 에지를 적절한 리듀서들로 포워딩할 수 있다. 다시 말해, 매핑 데이터는 삼각형의 유형에 기초하여 결정된 리듀서로 전송될 수 있다.
각각의 입력 페어 <Ψ(u, v)>에 대하여, 매퍼는 표 4와 같은 메시지들을 전송할 수 있다. 여기서, i = min{ξ(u),ξ(v)} 이고, j = max{ξ(u), ξ(v)} 일 수 있다.

매핑 데이터 수신부(21)는 매퍼들로부터 삼각형의 유형을 나타내는 매핑 데이터를 수신한다.
삼각형 열거부(22)는 정점 세트와 에지 세트로 구성된 그래프 G에서 상기 삼각형의 유형에 대응되는 삼각형을 열거한다. 이 때, 상기 삼각형의 유형은 삼각형을 구성하는 정점들의 색상에 의해 결정된다.
실시예들에 따라 서브 문제들을 라운드들로 분배하는 경우, 각 매퍼는 동일한 수의 페어들을 출력(emit)하도록 보장될 수 있다. 예를 들어, 각 라운드에서 매퍼 당

개의 페어들이 출력되도록 보장될 수 있다. 그러므로, 맵 스텝은 가용 프로세싱 유닛들에 균일하게 분배될 수 있다. 일 예로, 맵 스텝은 실시간 스케쥴에 의하여 균일하게 분배될 수 있고, 나아가, 몇몇 느린 매퍼들에 의한 딜레이를 회피하면서 균일하게 분배될 수도 있다.
도 3은 일실시예에 따른 매퍼의 동작을 설명하기 위한 플로우 차트이다.
도 3을 참조하면, 단계(110)에서, 매퍼(10)는 에지 값을 포함하는 리듀싱 데이터를 수신한다.
단계(130)에서, 매퍼(10)는 제1 색상과 제2 색상을 비교한다. 여기서, 제1 색상은 에지 값을 구성하는 제1 정점에 대한 색상을 의미하고, 제2 색상은 에지 값을 구성하는 제2 정점에 대한 색상을 의미할 수 있다. 예컨대, 앞서 설명된 알고리즘 1에서, 제1 색상은 i를 의미하고, 제2 색상은 j를 의미할 수 있다. 단계(130)의 판단 결과, 제1 색상과 제2 색상이 상이한 경우, 매퍼(10)는 단계(141)을 수행할 수 있고, 제1 색상과 제2 색상이 동일한 경우, 매퍼(10)는 단계(142)를 수행할 수 있다.
단계(141)에서, 매퍼(10)는 제3 유형을 나타내는 매핑 데이터를 생성한다. 제3 유형은 세 정점들의 색상이 모두 다른 삼각형의 유형을 나타낸다.
단계 (142)에서, 매퍼(10)는 제2 유형을 나타내는 매핑 데이터 또는 제1 유형을 나타내는 매핑 데이터를 생성한다. 제2 유형은 두 정점들의 색상이 동일하고 나머지 하나의 정점의 색상이 상이한 삼각형의 유형을 나타내고, 제1 유형은 세 정점들의 색상이 모두 동일한 삼각형의 유형을 나타낸다.
그 밖에, 매퍼(10)에 관해서는 앞서 설명된 CTTP 기법이 적용될 수 있다.
도 4는 일실시예에 따른 리듀서의 동작을 설명하기 위한 플로우 차트이다.
도 4를 참조하면, 단계(210)에서, 리듀서(20)는 삼각형의 유형을 나타내는 매핑 데이터를 수신한다.
단계(220)에서, 리듀서(20)는 정점 세트와 에지 세트로 구성된 그래프에서 상기 삼각형의 유형에 대응되는 삼각형을 열거한다. 이 때, 상기 삼각형의 유형은 삼각형을 구성하는 정점들의 색상에 의해 결정될 수 있다.
그 밖에, 리듀서(20)에 관해서는 앞서 설명된 CTTP 기법이 적용될 수 있다.
도 5 내지 도 10은 일실시예에 따른 매퍼와 리듀서의 성능 실험 결과를 나타낸 그래프이다.
실험 결과 그래프를 설명하기에 앞서, CTTP 기법의 성능 및 성능 개선 실시예에 관해 설명한다.
서브 문제들의 총 개수를

라고 가정하면, 각 라운드에서

의 복잡도를 가지는 개수의 서브 문제들이 해결될 수 있다. 표 5의 렘마(lemma) 1은 R이 2 또는 3의 배수가 아닌 경우 각 라운드에서 정확하게 K/R 개의 서브 문제들이 해결됨을 보여줄 수 있다. 또한, 표 5의 렘마(lemma) 1은 R이 2 또는 3의 배수인 경우 K/R 의 편차(deviation)는 최대

임을 보여줄 수 있다.

표 6의 정리(theorem) 1은 CTTP 기법의 성능을 설명할 수 있다. CTTP 기법은 삼각형 열거 문제를 해결하는 최적(optimal) 기법을 제공할 수 있다. 예를 들어, CTTP 기법에 의하여 요구되는 전체 일(total work)의 양은

의 복잡도를 가질 수 있다.

실시예들은 삼각형 열거 문제를 해결하기 위한 라운드 개수의 하계(lower bound)를 도출(derive)함으로써, 삼각형 열거 문제를 해결하는 최적 해법(optimal solution)을 제공할 수 있다. 각각의 에지 또는 각각의 정점은 적어도 하나의 메모리 워드가 요구된다고 가정하면, 임의의 순간에 사이즈가 m인 리듀서 내에서 최대 m개의 에지들/정점들이 존재할 수 있다. 전술한 가정은 표 7의 정리 2에 의하여 검증될(verified) 수 있다.

각 리듀서의 최대 부하(maximum load)를 강하게 보장하기 위하여, CTTP 기법 내 정점 컬러링(vertex coloring)은 개선될 수 있다. 전술한 바와 같이 정점의 차수는

보다 크지 않다고 가정될 수 있다. 설명의 편의를 위하여, 이하 사용되는 표기들은 점근적 표기(asymptotic notation)일 수 있다. 점근적 표기들의 정확한 경계들(exact bounds)은 전술한 기재들에 의하여 도출될 수 있다.
CTTP 기법 내 정점 컬러링을 개선하기 위하여, 두 가지 컬러링 기법들이 이용될 수 있다. 예를 들어, 차수가

범위에 해당하는 고 차수 정점들(high degree vertexes)을 위한 컬러링 기법 및 차수가

범위에 해당하는 저 차수 정점들(low degree vertexes)을 위한 컬러링 기법이 이용될 수 있다. 저 차수 정점들은 log E-와이즈 함수들의 세트에서 랜덤 선택된 컬러링 함수

를 이용하여 컬러링 될 수 있다. 고 차수 정점들의 색상은 외부 메모리 내 서브그래프 열거(subgraph enumeration) 기법을 통하여 결정론적으로(deterministically) 계산될 수 있다. 예를 들어, 고 차수 정점들은 동일한 색상을 가지는 정점들의 차수들의 합이

이 되도록 ρ개의 색상들을 이용하여 컬러링 될 수 있다. 정점의 차수는 에지 세트 E를 정렬함으로써 계산될 수 있다.

인 경우 최대

개의 고 차수 정점들이 단일 리듀서에 의하여 계산될 수 있다.
CTTP 기법 내 정점 컬러링의 성능 개선은 표 8의 정리 3에 의하여 설명될 수 있다.

도 5를 참조하면, 라운드 수의 영향이 (a) 러닝 타임(분)에 미치는 영향과 (b) 라운드별 셔플 데이터(shuffled data)의 크기(GB)에 미치는 영향이 도시되어 있다.
또한, 도 6을 참조하면, 리듀스 단계, 셔플 단계 및 맵 단계 각각의 평균 러닝 타임이 도시되어 있다.
또한, 도 7을 참조하면, CTTP, TTP 및 GP 알고리즘 각각에서 에지 수의 영향이 (a) 러닝 타임(분)에 미치는 영향과 (b) 라운드별 셔플 데이터(shuffled data)의 크기(GB)에 미치는 영향이 도시되어 있다.
또한, 도 8을 참조하면, CTTP, TTP 및 GP 알고리즘에서 리듀서의 수에 따른 스피드 업 인자가 도시되어 있다.
또한, 도 9을 참조하면, (a)에는 CTTP, TTP 및 GP 알고리즘의 상대적 러닝 타임이, (b)에는 러닝 타임(분)의 리스트가 도시되어 있다.
또한, 도 10을 참조하면, CTTP, TTP 및 GP 알고리즘의 셔플 데이터(shuffled data)의 크기가 도시되어 있다.
도 5 내지 도 10에 도시된 실험 결과 그래프로부터, 본 명세서에 기재된 다양한 실시예에 따른 CTTP 기법은 높은 확장 가능성(scalability)을 갖고, TTP 및 GP 알고리즘에 비해서도 뛰어난 성능을 나타냄을 알 수 있다.
실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.
이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.
그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.In the following, embodiments will be described in detail with reference to the accompanying drawings. Like reference symbols in the drawings denote like elements.
Various modifications may be made to the embodiments described below. It is to be understood that the embodiments described below are not intended to limit the embodiments, but include all modifications, equivalents, and alternatives to them.
The terms used in the examples are used only to illustrate specific embodiments and are not intended to limit the embodiments. The singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, the terms "comprises" or "having" and the like refer to the presence of stated features, integers, steps, operations, elements, components, or combinations thereof, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this embodiment belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.
In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. In the following description of the embodiments, a detailed description of related arts will be omitted if it is determined that the gist of the embodiments may be unnecessarily blurred.
FIG. 1 is a diagram for explaining a process of data processing by a mapper and a reducer according to an embodiment.
The method of reducing the number of triangles may be implemented by the mapper 10-1 through 10-5 and the reducers 20-1 through 20-7. The method of reducing the size of triangles can be performed through a number of stages. For example, the plurality of stages may include an r-th stage and an (r + 1) -th stage. In each stage, a certain number of mapper and reducer can be executed. For example, in the rth stage, the mappers 10-1 through 10-3 and the reducers 20-1 through 20-4 can be executed, and in the r + l stage, the mappers 10-4 and 10-5, (20-5 and 20-7) can be executed.
The number of mapper and reducer included in each stage can be variously modified. For example, the number of mapper and the number of reducer may be the same. Also, the connection relationship between one stage and the next stage can be variously modified. The configurations of the mappers 10-1 to 10-5 and the reducers 20-1 to 20-7 will be described in detail later with reference to Figs. 2 to 4. Fig.
The mapping reduction for triangle enumeration can be done through the CTTP (Colored Triangle Type Partition) technique. Before describing the CTTP (Colored Triangle Type Partition) technique according to the embodiments, a calculation model used by embodiments will be described first.
Hereinafter, the embodiments use a calculation model MR (m, M) for map reduction. MapReduce is a distributed programming framework that processes very large data. MapReduce can be easily applied to (i) distributed systems, (ii) fault tolerant, (iii) scalable, (iv) build and maintain clusters. ), But it requires relatively low cost. Embodiments may be implemented using Hadoop.
The calculation model uses parameters m and M. The parameter m represents the maximum memory capacity for performing the map / reduce function, and the parameter M represents the maximum available memory capacity of the entire system. The map / reduce function may refer to at least one of a map function or a reduce function. Hereinafter, the reducer may refer to an application that performs a decreasing function. Also, a mapper may refer to an application that performs a map function. The reducer and mapper can each be a single application. The parameter m may refer to the size of the mapper or the size of the reducer.
An algorithm based on a computational model can be specified as a sequence of rounds. Each round computation can be defined as a map function and a reduction function. The input and output of the map function and the decrement function may be multisets of key-value pairs. A key-value pair is a <k;v>. Where k is the key and v is the value. The map function accepts one pair and can output a multiset of new pairs. The Reduce function can receive pairs with the same key and output a multiset of new pairs. Hereinafter, emit (<k; v >) in the map / v> is output.
The multi-set I _r of the pair can be input in the r-th round (r ≥ 0). In the map step of the r-th round, a map function is applied to each pair in I _r , so that a new multiset W _r can be generated. In the shuffle step of the rth round, pairs having the same key among pairs in W _r can be grouped. In the reduce step of the rth round, each of the groups of pairs having the same key can be processed by a decreasing function. As a result, the multiset O _r can be output as the final result of the rth round. O _r may include the output of a reduction function for each of the groups of pairs having the same key. O _r can be used as the input of the next round (r + 1).
m _{k, r} may refer to the capacity required by the reduction function to process the group defined by key k in round r. K _r may refer to a set of distinct keys in W _r . In this case, the calculation model is k? For each k with K _r and r = 0 it is required that m _{k, r} ≤ m and for each r r = 0? k? It is required that K _r m _{k, r} ≤ M. Similar constraints may be required for map functions as well.
The complexity of the algorithm based on the computational model may be the number of rounds R required in the worst case. Embodiments can provide a technique for minimizing the number R of rounds for a given m and M. Hereinafter, the total work of the algorithm based on the computational model can be defined as the sum of the work required by the mapper and the reducers.
Hereinafter, for convenience of explanation, graph G is assumed. The graph G may be a non-directional graph. Graph G does not include a self loop and does not include a parallel edge. The graph G includes a vertex set V and an edge set E. The triangle enumeration problem is a matter of enumerating the triangles contained in graph G. Embodiments may be implemented such that, in each triangle (u, v, w), the local function enum () is called. The input parameters of the enum () function can be three vertices of the corresponding triangle.
According to embodiments, the triangle enumeration problem can be solved through a Map Reduce-based algorithm. Hereinafter, solving the triangle enumeration problem or sub problem can be understood as performing the operations to solve the problem or the sub problem.
Referring to Table 1, for convenience in the following notation, a symbol designating a particular set may be used to refer to the size of the set. For example, E may refer to the size of the edge set.

The vertices of vertex set V may be ordered by degree. The vertices of the same order can be arbitrarily ordered. The degree of a vertex may be the number of edges connected to the vertex. Each edge {u, v} requires a single memory word and can be expressed initially as Ψ <(u, v)>. Where? Is a dummy key, (u, v) is a value, and u < v. In the triangles v1, v2 and v3 with v1 <v2 <v3, (v2, v3) is called the pivot edge and v1 can be called the cone vertex. For any integer n, [n] may refer to a set of {0, ..., n-1}.
Examples include

A vertex of a very large order such as a vertex of a larger order can be excluded. More specifically, very large orders of vertices may be enumerated using sorting. For example, for each very large order vertex v, the triangles containing the vertex v can be properly retrieved by aligning the edge sets three times.

If a larger size reducer is used, the sorting algorithm may require only the complexity of O (1) in each round. Since the aggregate space for searching triangles with very large order vertices is E, very large vertices of M / E can be processed in parallel.
Thus, if the size of the reducer is m, the size of the mapper is constant, the maximum available memory capacity in the overall system is M, and the total is O (E3 / 2), then the triangles with at least one very large vertex

Lt; RTI ID = 0.0 > complexity. &Lt; / RTI > Compared with the complexity of the CTTP technique described below, the complexity of the entire work to eliminate the very large vertices described above can be asymptotically negligible.
Hereinafter, a CTTP (Colored Triangle Type Partition) technique using a mapper and a reducer will be described in detail with reference to FIGS. 2 to 4. FIG.
2 is a block diagram for explaining a configuration of a mapper and a reducer according to an embodiment.
Referring to FIG. 2, the mapper 10 includes a reduction data receiving unit 11 and a mapping data generating unit 13. Further, the reducer 20 includes a mapping data receiving unit 21 and a triangular column rejection unit 22. Each of the configurations included in the mapper 10 and the reducer 20 and the mapper 10 and the reducer 20 may be implemented with at least one hardware module or software module.
The CTTP technique is a map-based algorithm for solving triangle enumeration problems. For example, the CTTP technique may be a multi-round MapReduce randomized algorithm for enumerating all triangles in an enormous graph.
The CTTP scheme may consider a trade-off between the number of rounds R, the memory capacity m required by each mapper and reducer, and the available memory capacity M required by the overall system. For arbitrary input graphs, the CTTP technique is the worst case

Lt; RTI ID = 0.0 > complexity. &Lt; / RTI > The CTTP scheme requires a memory capacity of M / E for each mapper, and a memory capacity of m for each reducer. The CTTP technique can solve the problem of non-uniform distribution of the whole work between reducers.
The redundancy data receiving unit 11 receives the redundancy data. The redundancy data may include a dummy key and an edge value. The edge value may correspond to the value in the key-valued pair. The edge value may consist of vertices.
The vertices can each have a color. For example, prior to solving the triangle enumeration problem using the mapper 10 and the reducer 20, the vertices in the graph may be colored. More specifically, the CTTP technique is based on a vertex partitioning technique. Unlike normal vertex segmentation techniques, the CTTP technique divides vertices according to a coloring function selected randomly from k-wise independent family of functions. k may be a positive integer of 2 or more, and for convenience of explanation, an embodiment with k = 4 will be described. In this case, the k-wise independent family functions may be referred to as 4-wise independent family functions.
The 4-Wise independent family functions can be functions that enumerate triangles in external memory. By requiring only 4-wiseness for the coloring function, the CTTP scheme overcomes the problems of existing techniques that assume a random input graph and can guarantee the generality applicable to any input graph.
The vertices are

Function

Color < / RTI >

Function can be randomly selected uniformly from 4-Wise independent family functions. Ei, j is {(u, v) ∈ E | may refer to an edge set i = min {ξ (u), ξ (v)} and j = max {ξ (u), ξ (v)}. Here, i < j and i, j∈ρ. ξ (x) is the color of the vertex x.
The mapping data generation unit 130 generates mapping data indicating the type of the triangle based on the determined color.
The triangles (u, v, w) are classified as type-3 if the colors of all three vertices are different, and the triangles (u, v, w) w) are classified as type-2, and triangles (u, v, w) can be classified as type-1 if all vertices have the same color.
The CTTP technique uses the triangle enumeration problem

It can decompose into sub-problems. Sub-problems can be one of two types in Table 2.

The CTTP scheme uses K sub-problems

Each sub-problem can be solved using a single reducer by evenly distributing in rounds of the number of sub-problems. The CTTP scheme can operate as in Algorithm 1 of Table 3 in the rth round. The input of each round may be a key-valued pair, < (u, v) >.
(i, j, k) using the reducer associated with the key (i, j, k) if 0 = i <j <k and (i + j + k) = , k) - sub-problem can be solved. (i, j) using the reducer associated with the key (i, j, -1) when 0 = i <j < - Sub-problem can be solved.

The mapper can forward each edge to the appropriate reducers. In other words, the mapping data may be sent to a determined reducer based on the type of triangle.
For each input pair <Ψ (u, v)>, the mapper can send messages as shown in Table 4. Here, i = min {ξ (u), ξ (v)} and j = max {ξ (u), ξ (v)}.

The mapping data receiving unit 21 receives the mapping data indicating the type of the triangle from the mapper.
The triangular column rejection 22 lists the triangles corresponding to the type of triangles in the graph G consisting of the set of vertices and the set of edges. At this time, the type of the triangle is determined by the color of the vertices constituting the triangle.
If sub-problems are distributed in rounds according to embodiments, each mapper can be guaranteed to emit the same number of pairs. For example, per mapper in each round

Pairs can be guaranteed to be output. Therefore, the map step can be evenly distributed to the available processing units. In one example, the map steps may be uniformly distributed by a real time schedule, and may even be distributed uniformly while avoiding delays by some slower mappers.
3 is a flowchart for explaining the operation of the mapper according to one embodiment.
Referring to FIG. 3, in step 110, the mapper 10 receives reduced data including edge values.
In step 130, the mapper 10 compares the first color with the second color. Here, the first hue means a hue for a first vertex constituting an edge value, and the second hue means a hue for a second vertex constituting an edge value. For example, in Algorithm 1 described above, the first color may mean i, and the second color may mean j. As a result of the determination in step 130, if the first color and the second color are different, the mapper 10 may perform step 141, and if the first color and the second color are the same, Step 142 may be performed.
At step 141, the mapper 10 generates mapping data indicating a third type. The third type represents the type of triangle in which the three vertices have different colors.
At step 142, the mapper 10 generates mapping data representing the second type or mapping data representing the first type. The second type indicates a type of a triangle in which the colors of the two vertices are the same and the color of the other vertex is different, and the first type indicates a type of a triangle having all three vertices having the same color.
In addition, the CTTP technique described above can be applied to the mapper 10.
4 is a flowchart illustrating an operation of the reducer according to an embodiment.
Referring to FIG. 4, in step 210, the reducer 20 receives mapping data indicating the type of triangle.
At step 220, the reducer 20 lists the triangles corresponding to the type of triangles in the graph consisting of the set of vertices and the set of edges. At this time, the type of the triangle can be determined by the color of the vertices constituting the triangle.
In addition, for the reducer 20, the CTTP technique described above can be applied.
5 to 10 are graphs showing performance test results of a mapper and a reducer according to an embodiment.
Prior to describing the graph of the experimental result, the performance and performance improvement embodiments of the CTTP technique will be described.
Total number of sub-problems

In each round,

The number of sub-problems having complexity can be solved. The lemma 1 of Table 5 can show that if R is not a multiple of 2 or 3, exactly K / R sub-problems are solved in each round. In addition, the lemma 1 of Table 5 shows that when R is a multiple of 2 or 3, the deviation of K / R is maximum

Can be shown.

Theorem 1 in Table 6 can explain the performance of the CTTP technique. The CTTP technique can provide an optimal technique to solve the triangle enumeration problem. For example, the amount of total work required by the CTTP technique is

As shown in FIG.

Embodiments can provide an optimal solution to solving the triangle enumeration problem by deriving a lower bound of the number of rounds to solve the triangle enumeration problem. Assuming that each edge or each vertex is required for at least one memory word, there can be a maximum of m edges / vertices in the reducer of size m at any instant. The above assumptions can be verified by the Theorem 2 of Table 7.

In order to strongly guarantee the maximum load of each reducer, vertex coloring in the CTTP technique can be improved. As described above, the degree of the vertex is

Can be assumed to be not greater than. For ease of explanation, the notations used below may be asymptotic notation. The exact bounds of the asymptotic notations can be derived from the above descriptions.
To improve vertex coloring in the CTTP technique, two coloring techniques can be used. For example,

Coloring technique for high degree vertexes and range order

Coloring techniques for low degree vertices corresponding to the range can be used. The lower order vertices are a randomly selected coloring function in the set of log E-Wise functions

As shown in FIG. The color of high order vertices can be calculated deterministically through a subgraph enumeration technique in external memory. For example, the high order vertices are the sum of orders of vertices having the same color

Lt; / RTI > colors. The order of the vertices can be calculated by aligning the edge set E.

If max

The higher order vertices can be computed by a single reducer.
The performance improvement of vertex coloring in the CTTP technique can be explained by the theorem 3 in Table 8.

Referring to FIG. 5, the influence of the number of rounds on the (a) running time (minute) and the effect on the size (GB) of shuffled data per round are shown.
Also, referring to FIG. 6, the average running time of each of the reduction step, shuffle step and map step is shown.
7, the influence of the number of edges on the (a) running time (minute) and the size (GB) of the shuffled data per round are shown in each of the CTTP, TTP and GP algorithms The effect is shown.
Also, referring to FIG. 8, the speed up factors according to the number of reducers in the CTTP, TTP and GP algorithms are shown.
Referring to FIG. 9, (a) shows the relative running time of the CTTP, TTP, and GP algorithms, and (b) shows a list of running times (minutes).
Referring to FIG. 10, the sizes of shuffled data of CTTP, TTP, and GP algorithms are shown.
It can be seen from the experimental result graphs shown in FIGS. 5 to 10 that the CTTP technique according to the various embodiments described herein has high scalability, and is superior to the TTP and GP algorithms.
The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

Receiving redundant data including edge values; And
Generating mapping data indicating a type of a triangle based on a color for each of the vertices constituting the edge value
Lt; / RTI >
Wherein the mapping data is transmitted to a reducer determined based on the type of the triangle,
The color of the vertices is determined by assigning uniformly randomly selected colors among the predetermined colors to the vertices in the graph,
Wherein the predetermined number of hues is determined based on a first element associated with the number of edges in the graph and a second element associated with a memory capacity of the reducer,
Mapping method for triangle enumeration.

delete

The method according to claim 1,
The color of the vertices
Determining, by using a uniformly randomly selected function from a plurality of coloring functions, a selected color of a predetermined color to the vertices in the graph,
Mapping method for triangle enumeration.

The method according to claim 1,
Wherein the hue comprises a first hue for a first vertex constituting the edge value and a second hue for a second vertex constituting the edge value,
Wherein the generating the mapping data comprises:
Mapping data indicating a second type in which the colors of the two vertices are the same and colors of the other vertex are different when the first color and the second color are the same, &Lt; / RTI >
Mapping method for triangle enumeration.

5. The method of claim 4,
The mapping data
A first element corresponding to the first color or the second color, a second element corresponding to a third color determined based on the current round, and a third element corresponding to a null,
Mapping method for triangle enumeration.

5. The method of claim 4,
Wherein the generating the mapping data comprises:
And generating mapping data indicating a third type in which the colors of the cleavage points are all different when the first color and the second color are different.
Mapping method for triangle enumeration.

The method according to claim 6,
The mapping data
A second element corresponding to the second color, and a third element corresponding to a third color determined based on the current round, wherein the first element corresponds to the first color,
Mapping method for triangle enumeration.

Receiving mapping data representing a type of triangle; And
Enumerating the triangles corresponding to the type of triangles in the graph consisting of the set of vertices and the set of edges
Lt; / RTI >
The type of the triangle is determined by the color of the vertices constituting the triangle,
The mapping data
A first element corresponding to a first color, a second element corresponding to a second color, and a third element corresponding to a third color or null.
Reducing method for triangle enumeration.

delete

9. The method of claim 8,
The enumerating step
When the third element corresponds to the null,
Enumerating a first type of triangle having the same color of the cleavage points using the first color or the second color; And
Using the first color and the second color, enumerating a second type of triangle in which the color of two vertices is the same and the color of the other vertex is different,
/ RTI >
Reducing method for triangle enumeration.

9. The method of claim 8,
The enumerating step
And when the third element corresponds to the third color,
Enumerating a third type of triangle in which the colors of the cleavage points are all different using the first color, the second color, and the third color;
/ RTI >
Reducing method for triangle enumeration.

9. The method of claim 8,
The step of enumerating the triangles comprises:
Enumerating the triangles from a set of edges determined by the mapping data.
Reducing method for triangle enumeration.

9. The method of claim 8,
Wherein the color of the vertices is determined by giving each of the vertices a uniformly randomly selected hue of a predetermined hue,
Reducing method for triangle enumeration.

13. A computer program stored in a medium for execution in accordance with any one of claims 1, 3, 4, 5, 6, 7, 8, and 13, in combination with hardware.

A redundancy data receiving unit for receiving the redundancy data including the edge value; And
A mapping data generation unit for generating mapping data indicating a type of a triangle based on a color of each of the vertices constituting the edge value,
Lt; / RTI >
Wherein the mapping data is transmitted to a reducer determined based on the type of the triangle,
The color of the vertices is determined by assigning uniformly randomly selected colors among the predetermined colors to the vertices in the graph,
Wherein the predetermined number of hues is determined based on a first element associated with the number of edges in the graph and a second element associated with a memory capacity of the reducer,
Mapper for triangle enumeration.

16. The method of claim 15,
Wherein the hue comprises a first hue for a first vertex constituting the edge value and a second hue for a second vertex constituting the edge value,
Wherein the mapping data generator comprises:
Mapping data indicating a second type in which the colors of the two vertices are the same and colors of the other vertex are different when the first color and the second color are the same, Lt; / RTI >
Generating mapping data indicating a third type in which the colors of the cleavage points are all different when the first color and the second color are different,
Mapper for triangle enumeration.

A mapping data receiving unit for receiving mapping data indicating a type of a triangle; And
In a graph composed of a vertex set and an edge set, a triangle column denoting a triangle corresponding to the type of the triangle
Lt; / RTI >
The type of the triangle is determined by the color of the vertices constituting the triangle,
The mapping data
A first element corresponding to a first color, a second element corresponding to a second color, and a third element corresponding to a third color or null.
Reducer for triangle enumeration.