KR102412843B1

KR102412843B1 - Device and method for extracting sample graph from original graph having properties of original graph

Info

Publication number: KR102412843B1
Application number: KR1020190156716A
Authority: KR
Inventors: 김수현
Original assignee: 한국과학기술연구원
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2022-06-27
Also published as: KR20210067278A

Abstract

실시예들은: 원본 그래프(original graph)의 속성을 가지는 샘플 그래프를 얻기 위해, 샘플 그래프의 에지의 총수 보다 많은 에지를 갖는 부분 그래프를 원본 그래프로부터 추출하는 단계; 상기 원본 그래프 및 부분 그래프 중 하나 이상의 속성에 기초하여 상기 부분 그래프에서 제거될 에지를 선택하는 단계; 및 선택된 에지를 제거하여 상기 샘플 그래프를 획득하는 단계를 수행하도록 구성된 그래프 샘플링 장치에 관한 것이다. Embodiments include: extracting, from the original graph, a partial graph having more edges than the total number of edges of the sample graph, to obtain a sample graph having properties of the original graph; selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and obtaining the sample graph by removing the selected edge.

Description

Graph sampling apparatus and method for extracting a sample graph having properties of the original graph

본 발명의 실시예들은 노드와 에지를 포함한 그래프를 생성하는 기술에 관한 것으로서, 보다 상세하게는 대규모의 원본 그래프(original graph)의 속성을 가지면서 크기가 축소된 샘플 그래프를 얻기 위해 원본 그래프의 일부를 부분 그래프로 추출하고 추출된 부분 그래프를 변형하는 그래프 샘플링 장치 및 방법에 관련된다.Embodiments of the present invention relate to a technique for generating a graph including nodes and edges, and more particularly, a part of the original graph to obtain a sample graph with reduced size while having the properties of a large-scale original graph. It relates to a graph sampling apparatus and method for extracting and transforming the extracted partial graph.

네트워크는 링크로 상호연결된 엔티티들의 연결로서, 기술 분야에 따라 다양한 형태를 나타낸다. 예를 들어, 사회 기술 분야에서 네트워크는 친구 관계를 링크로 갖는, 인적 관계를 나타낼 수 있다. IT 기술 분야에서는 컴퓨터 간의 상호 연결, 또는 서로를 가리키는 웹 페이지를 나타낼 수 있다. A network is a connection of entities interconnected by a link, and takes various forms according to technical fields. For example, in the field of social technology, a network may represent a human relationship with a friend relationship as a link. In the field of IT technology, it can represent an interconnection between computers, or web pages that point to each other.

빅데이터 및 신경과학 기술 분야의 발전으로 인해 사회, 정보, 생물 및 기술 네트워크에 대한 관심이 증가하고 있으며, 특히 소셜 네트워크 서비스의 발전으로 인해 소셜 네트워크를 분석하고 모델링하는 것에 대한 수요가 급증하고 있다. With the development of big data and neuroscience and technology, interest in social, information, biological and technological networks is increasing, and in particular, the demand for analyzing and modeling social networks is increasing due to the development of social network services.

전술한 네트워크를 분석 및 설명하기 위해 효율적인 수단이 그래프이다. 그래프는 노드 및 노드를 연결하는 에지로 구성되어, 단순성과 보편성을 가지고 있다. An efficient means for analyzing and describing the aforementioned networks is graphs. A graph is composed of nodes and edges that connect nodes, so it has simplicity and universality.

온라인 소셜 네트워크를 나타낸 그래프를 분석하기 위해서는 전체 그래프를 컴퓨터 메모리에 저장해야 한다. 그러나, 온라인 소셜 네트워크 그래프는 수백만 개의 노드와 에지를 포함하기 때문에, 때때로 저장이 불가능하다. 또한, 저장이 가능한 경우에도 해당 그래프의 일부 속성만을 계산하는 것 조차 시간이 많이 걸린다. 즉, 온라인 소셜 네트워크 그래프 자체를 분석하는 것은 실질적으로 불가능하다. In order to analyze a graph representing an online social network, the entire graph must be stored in computer memory. However, since online social network graphs contain millions of nodes and edges, storage is sometimes impossible. Also, even when storage is possible, it takes a lot of time to calculate only some properties of the graph. In other words, it is practically impossible to analyze the online social network graph itself.

때문에 최근 몇 년 동안, 그래프 샘플링으로 알려진, 대규모 원본 그래프에서 규모가 축소된 대표 하위 그래프를 추출하는, 그래프 샘플링이 솔루션으로 등장하였다. Therefore, in recent years, graph sampling, known as graph sampling, which extracts reduced-scale representative subgraphs from a large original graph, has emerged as a solution.

종래의 그래프 샘플링 솔루션은 그래프 크기를 원본 그래프 대비 10%이상으로 (통상적으로 20 내지 30% 범위로) 축소하면서도 원본 네트워크의 구조 및 속성을 유지하였다. 그러나, 그래프의 규모를 더 축소시킬 경우, 샘플링된 그래프의 구조가 저하되는 한계가 있다. 소셜 네트워크 기술 분야의 발전으로 인해, 온라인 소셜 네트워크 그래프는 수십억 개의 노드를 종종 포함하므로, 원본 대비 10%까지 축소를 한 샘플링된 그래프 또한 수천만 개의 노드를 포함하게 되고, 여전히 메모리 용량 문제를 갖는 한계가 있다.The conventional graph sampling solution maintains the structure and properties of the original network while reducing the graph size to 10% or more (typically in the range of 20 to 30%) compared to the original graph. However, when the scale of the graph is further reduced, there is a limit in that the structure of the sampled graph is deteriorated. Due to the development of social network technology, online social network graphs often contain billions of nodes, so the sampled graph scaled down to 10% compared to the original also contains tens of millions of nodes, and there is still a limit with memory capacity problem. have.

Hardiman and Katzir (2013) Estimating Clustering Coefficient and Size of Social Networks via Random Walk. In ACM's WWWHardiman and Katzir (2013) Estimating Clustering Coefficient and Size of Social Networks via Random Walk. In ACM's WWW Leskovec and Faloutsos (2006) Sampling from Large Graphs. In SIGKDD pp 631-636Leskovec and Faloutsos (2006) Sampling from Large Graphs. In SIGKDD pp 631-636 Ahmed et al., (2011) Network Sampling via Edge-based Node Selection with Graph Induction. Technical Report 11-016, Purdue Digital LibraryAhmed et al., (2011) Network Sampling via Edge-based Node Selection with Graph Induction. Technical Report 11-016, Purdue Digital Library

본 발명의 실시예들은 대규모의 원본 그래프(original graph)의 속성을 가지면서 크기가 축소된 샘플 그래프를 얻기 위해, 원본 그래프의 일부를 추출하고 추출된 부분 그래프를 변형하는 그래프 샘플링 장치 및 방법을 제공하고자 한다.Embodiments of the present invention provide a graph sampling apparatus and method for extracting a part of the original graph and transforming the extracted partial graph in order to obtain a sample graph with a reduced size while having the properties of a large-scale original graph want to

본 발명의 일 측면에 따른 원본 그래프(original graph)의 속성을 가지는 샘플 그래프를 얻기 위한 그래프 샘플링 장치는 샘플 그래프의 에지의 총수 보다 많은 에지를 갖는 부분 그래프를 원본 그래프로부터 추출하는 단계; 상기 원본 그래프 및 부분 그래프 중 하나 이상의 속성에 기초하여 상기 부분 그래프에서 제거될 에지를 선택하는 단계; 및 선택된 에지를 제거하여 상기 샘플 그래프를 획득하는 단계를 수행하도록 구성될 수 있다. According to an aspect of the present invention, a graph sampling apparatus for obtaining a sample graph having properties of an original graph includes extracting a partial graph having more edges than the total number of edges of the sample graph from the original graph; selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and removing the selected edge to obtain the sample graph.

일 실시예에서, 상기 원본 그래프의 속성은 원본 그래프의 도수 및 군집 계수 중 하나 이상을 포함할 수 있다. In an embodiment, the property of the original graph may include one or more of a frequency and a clustering coefficient of the original graph.

일 실시예에서, 상기 부분 그래프를 원본 그래프로부터 추출하는 단계는, 샘플링 비율에 따라서 상기 원본 그래프에서 부분 그래프에 추출될 노드를 탐색하는 단계; 및 탐색된 노드 사이를 에지로 연결하여 상기 부분 그래프를 유도하는 단계(inducing)를 포함할 수 있다. In an embodiment, the extracting of the partial graph from the original graph may include: searching for a node to be extracted from the original graph to the partial graph according to a sampling rate; and inducing the partial graph by connecting the found nodes with edges.

일 실시예에서, 상기 추출될 노드를 탐색하는 단계는, 상기 원본 그래프 상에서 탐색을 시작할 현재 노드에 연결된 분기에서 특정 분기 방향의 노드를 탐색한 이후 다음 분기 방향으로 노드를 탐색하는 단계를 포함할 수 있다. 여기서, 상기 특정 분기 방향의 노드는 현재 노드에 연결된 노드에서 가장 높은 도수를 갖는 노드이다. In one embodiment, the step of searching for the node to be extracted may include searching for a node in a next branching direction after searching for a node in a specific branching direction in a branch connected to a current node to start searching on the original graph. have. Here, the node in the specific branching direction is a node having the highest frequency among nodes connected to the current node.

일 실시예에서, 상기 제거될 에지를 선택하는 단계는, 상기 부분 그래프에 포함된 에지에 대한 에지 가중치를 산출하는 단계; 상기 에지 가중치에 기초하여 군집 계수의 감소 경향에 기초한 제1 그룹 및 제2 그룹을 결정하는 단계; 및 제거될 에지를 선택하기 위해, 상기 부분 그래프의 군집 계수와 샘플 그래프의 군집 계수에 기초하여 제1 그룹 또는 제2 그룹에서 제거될 에지를 선택하는 단계를 포함할 수 있다. In an embodiment, the selecting of the edge to be removed includes: calculating an edge weight for an edge included in the partial graph; determining a first group and a second group based on a decreasing tendency of a clustering coefficient based on the edge weight; and selecting an edge to be removed from the first group or the second group based on the clustering coefficient of the partial graph and the clustering coefficient of the sample graph to select the edge to be removed.

일 실시예에서, 상기 제1 그룹은 제2 그룹의 에지를 제거하는 경우 부분 그래프의 군집 계수가 보다 많이 감소되는 에지를 포함할 수 있다. In an embodiment, the first group may include an edge for which a clustering coefficient of the subgraph is reduced more when an edge of the second group is removed.

일 실시예에서, 제1 에지 보다 큰 에지 가중치를 갖는 제2 에지를 제거할 경우, 제1 에지를 제거하는 경우 보다 부분 그래프의 군집 계수의 감소량이 크다. In an embodiment, when the second edge having an edge weight greater than that of the first edge is removed, the decrease in the clustering coefficient of the subgraph is greater than when the first edge is removed.

일 실시예에서, 상기 부분 그래프에서 닫힌 삼중 구조(closed triplet)를 형성하게 하는 제1 및 제2 노드를 연결하는 에지에 대한 에지 가중치는 다음의 수학식에 의해 산출된다. In an embodiment, an edge weight for an edge connecting the first and second nodes to form a closed triplet in the partial graph is calculated by the following equation.

[수학식] [Equation]

여기서, k는 부분 그래프의 노드의 총수를 나타낸다. Here, k represents the total number of nodes in the subgraph.

일 실시에예서, 상기 부분 그래프의 군집 계수의 감소 상태를 판단하는 단계는, 상기 부분 그래프의 실시간 군집 계수 및 예측된 군집 계수를 산출하는 단계; 실시간 값과 예측된 값을 비교하여, 상기 부분 그래프의 실시간 군집 계수가 예측된 군집 계수 보다 크고 상기 부분 그래프의 군집 계수가 원본 그래프의 군집 계수 보다 큰 경우, 군집 계수의 감소가 보다 큰 에지 가중치를 갖는 에지를 제거 대상으로 선택하는 단계; 및 실시간 값과 예측된 값을 비교하여, 상기 부분 그래프의 실시간 군집 계수가 예측된 군집 계수 보다 작거나, 또는 상기 부분 그래프의 군집 계수가 원본 그래프의 군집 계수 보다 작은 경우, 군집 계수의 감소가 보다 작은 에지 가중치를 갖는 에지를 제거 대상으로 선택하는 단계;를 포함할 수 있다. In an embodiment, determining the decrease state of the clustering coefficient of the subgraph may include: calculating a real-time clustering coefficient and a predicted clustering coefficient of the subgraph; By comparing the real-time value and the predicted value, when the real-time clustering coefficient of the subgraph is larger than the predicted clustering coefficient and the clustering coefficient of the subgraph is larger than the clustering coefficient of the original graph, an edge weight with a larger decrease in the clustering coefficient is obtained. selecting an edge having an edge as an object to be removed; and comparing the real-time value with the predicted value, and when the real-time clustering coefficient of the subgraph is less than the predicted clustering coefficient, or the clustering coefficient of the subgraph is smaller than the clustering coefficient of the original graph, the decrease in the clustering coefficient is more and selecting an edge having a small edge weight as a removal target.

일 실시예에서, 상기 다음 부분 그래프의 군집 계수의 예측은 다음의 수학식에 의해 산출되며, In one embodiment, the prediction of the clustering coefficient of the next subgraph is calculated by the following equation,

[수학식] [Equation]

여기서, e_del은 이미 제거된 에지의 수를 나타내고, CCorg는 원본 그래프의 군집 계수를 나타내며, slope는 다음의 수학식으로 표현되며, Here, e _del represents the number of edges that have already been removed, CCorg represents the cluster coefficient of the original graph, and slope is expressed by the following equation,

[수학식] [Equation]

여기서, e_extra는 제거될 에지의 총수를 나타낸다. Here, e _extra represents the total number of edges to be removed.

일 실시예에서, 상기 에지를 제거하는 단계에서 제거되는 에지는 하나이며, 그래프 샘플링 장치는 초기 부분 그래프의 에지의 총수와 상기 이전 샘플 그래프의 에지의 총수 간의 차이에 기초하여 상기 제거하는 단계를 반복하는 단계를 더 수행하도록 더 구성될 수 있다. In an embodiment, the edge removed in the step of removing the edge is one, and the graph sampling device repeats the step of removing based on a difference between the total number of edges of the initial subgraph and the total number of edges of the previous sample graph. It may be further configured to further perform the steps of:

일 실시예에서, 상기 반복하는 단계는, 제거된 에지의 엔드 노드 및 엔드 노드의 공통된 친구 노드의 지역 군집 계수(local clustering coefficient)를 산출하는 단계; 상기 지역 군집 계수 및 에지가 제거되기 이전의 군집 계수에 기초하여 선택된 에지가 제거된 부분 그래프의 군집 계수를 산출하는 단계; 및 상기 부분 그래프의 군집 계수를 에지가 제거된 부분 그래프의 군집 계수로 업데이트하는 단계를 포함할 수 있다. In one embodiment, the repeating may include: calculating a local clustering coefficient of an end node of the removed edge and a common friend node of the end node; calculating a clustering coefficient of a partial graph from which a selected edge is removed based on the regional clustering coefficient and a clustering coefficient before the edge is removed; and updating the clustering coefficient of the subgraph to the clustering coefficient of the subgraph from which edges are removed.

본 발명의 일 실시예에 따른 그래프 샘플링 장치는 원본 그래프의 속성을 가지는 샘플 그래프를 생성할 수 있다. The graph sampling apparatus according to an embodiment of the present invention may generate a sample graph having properties of the original graph.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

본 발명 또는 종래 기술의 실시예의 기술적 해결책을 보다 명확하게 설명하기 위해, 실시예에 대한 설명에서 필요한 도면이 아래에서 간단히 소개된다. 아래의 도면들은 본 명세서의 실시예를 설명하기 목적일 뿐 한정의 목적이 아니라는 것으로 이해되어야 한다. 또한, 설명의 명료성을 위해 아래의 도면들에서 과장, 생략 등 다양한 변형이 적용된 일부 요소들이 도시될 수 있다.
도 1은, 본 발명의 일 실시예에 따른, 그래프 생성 방법의 흐름도이다.
도 2는, 본 발명의 일 실시예에 따른, 에지 가중치를 설명하기 위한 도면이다.
도 3은, 본 발명의 일 실시예에 따른, 에지 제거 슬로프를 도시한 개념도이다.
도 4는, 본 발명의 일 실시예에 따른, 그래프 샘플링 장치의 동작의 예시적인 코드이다.
도 5 내지 도 7은, 본 발명의 일 실험예에 따른, 샘플링 결과로 획득된 부분 그래프의 성능을 설명하기 위한 도면이다. In order to more clearly explain the technical solutions of the embodiments of the present invention or the prior art, drawings necessary for the description of the embodiments are briefly introduced below. It should be understood that the following drawings are for the purpose of explaining the embodiments of the present specification and not for the purpose of limitation. In addition, some elements to which various modifications such as exaggeration and omission have been applied may be shown in the drawings below for clarity of description.
1 is a flowchart of a method for generating a graph, according to an embodiment of the present invention.
2 is a diagram for explaining an edge weight according to an embodiment of the present invention.
3 is a conceptual diagram illustrating an edge removal slope according to an embodiment of the present invention.
4 is an exemplary code of an operation of a graph sampling apparatus according to an embodiment of the present invention.
5 to 7 are diagrams for explaining the performance of a partial graph obtained as a sampling result according to an experimental example of the present invention.

여기서 사용되는 전문 용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다. 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분의 존재나 부가를 제외시키는 것은 아니다.The terminology used herein is for the purpose of referring to specific embodiments only, and is not intended to limit the present invention. As used herein, the singular forms also include the plural forms unless the phrases clearly indicate the opposite. The meaning of "comprising," as used herein, specifies a particular characteristic, region, integer, step, operation, element and/or component, and includes the presence or absence of another characteristic, region, integer, step, operation, element and/or component. It does not exclude additions.

다르게 정의하지는 않았지만, 여기에 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 보통 사용되는 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.Although not defined otherwise, all terms including technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which the present invention belongs. Commonly used terms defined in the dictionary are additionally interpreted as having a meaning consistent with the related technical literature and the presently disclosed content, and unless defined, are not interpreted in an ideal or very formal meaning.

본 명세서에서 원본 그래프의 속성을 갖는 다는 것은 특정 속성의 측면에서 실제 네트워크와 동일 또는 유사한 값을 갖는 것을 지칭한다. 여기서 유사한 값은 소정 범위의 오차, 또는 종래의 실시예들에 의한 분석 결과 보다 원본 그래프에 밀접한 것을 지칭한다. In the present specification, having the property of the original graph refers to having the same or similar value as the actual network in terms of a specific property. Here, a similar value refers to an error within a predetermined range or closer to the original graph than the analysis result according to the conventional embodiments.

이하에서, 도면을 참조하여 본 발명의 실시예들에 대하여 상세히 살펴본다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

본 명세서에서 샘플링 이전의 큰 규모의 그래프는 G=(V, E)로서, V는 해당 그래프에 포함된 노드이며 V={v₁, v₂, v₃, ..., v_n}(n은 1이상의 정수)일 수 있고, 그리고 E는 해당 그래프에 포함된 에지(또는 링크)이며 E={e₁, e₂, e₃, ..., e_m}(m은 0 이상의 정수)일 수 있다. 샘플링된 그래프(이하, "샘플 그래프")는 Gs=(V_s, E_s)로서, V_s 는 V에 포함되고, Es는 E에 포함되는, 원본 그래프의 부분 그래프다. In the present specification, the large-scale graph before sampling is G=(V, E), where V is a node included in the graph, and V={v ₁ , v ₂ , v ₃ , ..., v _n }(n can be an integer greater than or equal to 1), and E is an edge (or link) included in the graph and E={e ₁ , e ₂ , e ₃ , ..., e _m } (m is an integer greater than or equal to 0). can The sampled graph (hereinafter, “sample graph”) is a subgraph of the original graph, where Gs=(V _s , E _s ), where V _s is included in V and Es is included in E.

본 발명의 일 실시예에 따른 그래프 샘플링 방법은 대규모의 원본 그래프를 기준으로 샘플링 비율(sampling fraction)(=|V_s|/|V|)이 φ인 샘플 그래프를 추출(또는 생성)함으로써, 규모가 축소된 부분 그래프를 획득할 수 있다. A graph sampling method according to an embodiment of the present invention extracts (or generates) a sample graph having a sampling fraction (=|V _s |/|V|) of φ based on a large-scale original graph, A reduced partial graph can be obtained.

도 1은, 본 발명의 일 실시예에 따른, 그래프 생성 방법의 흐름도이다. 1 is a flowchart of a method for generating a graph, according to an embodiment of the present invention.

도 1을 참조하면, 우선 원본 그래프의 부분 그래프를 초기 부분 그래프로 추출한다(S10). Referring to FIG. 1 , first, a partial graph of the original graph is extracted as an initial partial graph ( S10 ).

일 실시예에서, 초기 부분 그래프를 추출하기 이전에, 사용자가 목표하는 최종적인 샘플 그래프가 가져야 할 샘플링 요소가 획득된다(S0). 상기 샘플링 요소는 샘플링 과정의 가이드 라인이다. 즉, 가이드 라인 하에서 샘플링이 수행되는 것은 샘플링 요소를 갖는 샘플 그래프를 생성하는 것을 의미한다. 여기서, 샘플링 요소를 갖는다는 것은 해당 요소 값과 동일 또는 허용 가능한 오차 범위 안의 값을 갖는 것을 나타낸다. In one embodiment, before extracting the initial partial graph, a sampling element that the final sample graph targeted by the user should have is obtained (S0). The sampling element is a guideline for the sampling process. That is, performing sampling under the guideline means generating a sample graph with sampling elements. Here, having a sampling element means having a value equal to or within an allowable error range with the corresponding element value.

일 실시예에서, 상기 샘플링 요소는 샘플 그래프의 노드 총수 및/또는 노드 에지를 포함한다. In one embodiment, the sampling element comprises a node total and/or node edges of the sample graph.

샘플 그래프는 실제 원본 그래프의 부분 그래프이므로, 상기 실제 원본 그래프의 에지 총수, 노드 총수 보다 훨씬 작은 값이다. 일 실시예에서, 샘플 그래프의 노드의 총수는 원본 그래프의 10%이하(즉, 샘플링 비율(φ) = 0.1), 또는 5%(즉, 샘플링 비율(φ)=0.05) 이하일 수 있다. 샘플 그래프의 샘플링 성능은 보다 적은 노드로 원본 그래프와 동일한 속성(예컨대, 군집 계수)를 유지하는 것에 비례한다. Since the sample graph is a subgraph of the actual original graph, the value is much smaller than the total number of edges and the total number of nodes of the actual original graph. In one embodiment, the total number of nodes of the sample graph may be 10% or less (ie, sampling rate φ = 0.1), or 5% or less (ie, sampling ratio φ=0.05) or less of the original graph. The sampling performance of the sample graph is proportional to maintaining the same properties (eg, clustering coefficients) as the original graph with fewer nodes.

또한, 상기 샘플링 요소는 원본 그래프의 속성과 관련된 요소를 더 포함할 수 있다. In addition, the sampling element may further include an element related to the property of the original graph.

일 실시예에서, 상기 샘플링 요소는 원본 그래프의 도수(degree), 군집 계수(clustering coefficient) 중 하나 이상을 포함한다. 일부 실시예에서, 가이드 라인으로서 도수 또는 군집 계수는 평균 도수 또는 평균 군집 계수일 수 있다. In an embodiment, the sampling element includes at least one of a degree and a clustering coefficient of the original graph. In some embodiments, the frequency or clustering coefficient as a guideline may be an average frequency or average clustering coefficient.

상기 샘플링 요소의 일부는 사용자 입력 또는 전기 통신을 통해 직접적으로 또는 간접적으로 획득될 수 있다. 일 예에서, 샘플 그래프가 가져야할 군집 계수, 에지 총수, 노드 총수가 사용자 입력을 통해 직접 획득될 수 있다. A portion of the sampling element may be obtained directly or indirectly via user input or telecommunications. In one example, the cluster coefficient, the total number of edges, and the total number of nodes that the sample graph should have may be directly obtained through a user input.

다른 일 예에서, 적어도 샘플 그래프의 군집 계수 및/또는 도수는 원본 그래프의 일부에 기초한 예측 방식에 의해 산출될 수 있다. 단계(S0)에서 원본 그래프의 일부(예컨대, 0.1%)에 기초하여 그래프 전체의 속성 값을 예측함으로써, 원본 그래프의 속성 값이 산출된다. 예를 들어, 그래프의 일부를 탐색하여 그래프의 도수 분포를 캡처하는 MHRW(Metropolis-Hastings Random Walk) 방식을 이용하거나, 또는 비특허문헌 1(Hardiman and Katzir (2013))의 무작위 방문(Random walk)을 통해 노드의 작은 비율로 채굴하여(mining) 전체 그래프의 평균 군집 계수를 예측하는 방식을 이용하여 가이드 라인의 속성 값이 산출된다. In another example, at least the cluster coefficient and/or the frequency of the sample graph may be calculated by a prediction method based on a part of the original graph. By predicting the attribute value of the entire graph based on a part (eg, 0.1%) of the original graph in step S0 , the attribute value of the original graph is calculated. For example, using a Metropolis-Hastings Random Walk (MHRW) method that captures the frequency distribution of the graph by exploring a part of the graph, or a random walk of Non-Patent Document 1 (Hardiman and Katzir (2013)) The attribute value of the guideline is calculated using a method that predicts the average cluster coefficient of the entire graph by mining with a small proportion of nodes through .

이 가이드 라인 하에서 (예컨대, 원본 그래프의 도수, 군집 계수를 달성하도록) 원본 그래프로부터 샘플 그래프가 생성되며, 결과적으로 부분 그래프가 획득된다. 예를 들어, 샘플링 요소가 원본 그래프의 평균 도수, 평균 군집 계수 등을 포함한 경우, 샘플 그래프는 해당 값을 달성하도록 원본 그래프로부터 생성된다. A sample graph is generated from the original graph under this guideline (eg, to achieve the counts, clustering coefficients of the original graph), and consequently a subgraph is obtained. For example, if the sampling factor includes the mean frequency, mean clustering coefficient, etc. of the original graph, the sample graph is generated from the original graph to achieve the corresponding values.

단계(S10)에서, 샘플링 요소에 기초하여 초기 부분 그래프를 추출하면, 해당 초기 그래프를 변형시켜 최종적으로 부분 그래프를 추출하여 획득한다(S30). In step S10, if the initial partial graph is extracted based on the sampling element, the corresponding initial graph is transformed to finally extract and obtain the partial graph (S30).

일 실시예에서, 샘플링 요소에 기초하여 원본 그래프에서 노드 및/또는 에지를 추출하여 초기 부분 그래프를 추출한다(S10). 상기 초기 부분 그래프를 위한 노드 및/또는 에지는 원본 그래프에서 에지를 통해 노드를 탐색하여 추출된다. 샘플링 과정은 규모의 축소와 관련되므로, 처음에 최종적인 노드 및/또는 에지의 수 보다 많은 수를 우선 추출한 이후 노드 및/또는 에지를 축소하는 방향으로 진행되는 것이 유리하다. 노드 및/또는 에지를 도중에 더 생성할 경우, 샘플링 과정이 오히려 복잡해질 수 있다. 따라서, 초기 부분 그래프의 노드 및/또는 에지의 총수는 적어도 샘플 그래프가 최종적으로 가져야할, 샘플링 요소에 포함된 노드 및/또는 에지의 총수 이상이어야 한다. In an embodiment, an initial partial graph is extracted by extracting nodes and/or edges from the original graph based on the sampling element (S10). Nodes and/or edges for the initial subgraph are extracted by searching for nodes through edges in the original graph. Since the sampling process is related to scale reduction, it is advantageous to first extract a larger number than the final number of nodes and/or edges and then proceed in a direction to reduce nodes and/or edges. If more nodes and/or edges are created along the way, the sampling process may be rather complicated. Accordingly, the total number of nodes and/or edges of the initial subgraph should be at least equal to or greater than the total number of nodes and/or edges included in the sampling element, which the sample graph should finally have.

탐색을 통한 가장 대표적인 노드/에지 추출 방식은 BFS(Breadth First Search) 및 DFS(Depth First Search)이다. BFS는 임의의 노드(예컨대, 루트 노드)로부터 시작해서 인접한 노드를 먼저 탐색하는 방식으로서, 시작 정점으로부터 가까운 정점을 먼저 방문한 뒤 멀리 떠어져있는 정점을 나중에 방문하는 방식이다. 즉, 깊게 탐색하기 이전에 넓게 탐색하는 방식이다. 한편, DFS는 임의의 노드(예컨대, 루트 노드)로부터 탐색을 시작해서 다음 분기(brach)로 넘어가기 이전에 해당 분기를 완벽하게 탐색하는 방식으로서, 미로를 탐색할 때 한 방향으로 갈 수 있을 때까지 계속 가다가 더 이상 갈 수 없게 되면 다시 가장 가까운 갈림길로 돌아와서 다른 방향으로 다시 탐색을 진행하는 방식과 유사하다. 즉, 넓게 탐색하기 이전에 깊게 탐색하는 방식이다. The most representative node/edge extraction methods through search are BFS (Breadth First Search) and DFS (Depth First Search). BFS starts from an arbitrary node (eg, a root node) and searches for adjacent nodes first. It visits a vertex close to the starting vertex first and then visits a vertex far away from the starting vertex. That is, it is a method of searching broadly before searching deeply. On the other hand, DFS starts a search from an arbitrary node (eg, a root node) and searches the branch completely before moving on to the next branch. It is similar to the method of returning to the nearest fork and searching again in the other direction when it is not possible to go any further. That is, it is a method to search deeply before searching broadly.

DFS에 따르면 탐색된 에지의 수가 작기 때문에, 가이드 라인으로서 에지의 총수 보다 작은 에지가 추출될 가능성이 높아 그래프 샘플링에 적합하지 않다. 한편, BFS에 따르면 탐색된 에지의 수는 DFS 보다 많으나, 탐색된 노드 사이의 경로 길이(path length)가 짧아 원본 그래프의 경로 길이와 많은 차이가 발생할 수 있다. 그래프의 경로 길이 또한 그래프를 설명하는데 사용되는 중요한 속성이므로, 샘플링 요소에 포함되지 않은 경우에도 무시될 속성은 아니다. 따라서, BFS는 원본 그래프의 속성을 유지하면서 그래프를 샘플링하는데 한계가 있다. According to DFS, since the number of detected edges is small, it is highly likely that edges smaller than the total number of edges will be extracted as a guideline, which is not suitable for graph sampling. On the other hand, according to the BFS, the number of discovered edges is larger than that of the DFS, but the path length between the discovered nodes is short, which may cause a large difference from the path length of the original graph. Since the path length of the graph is also an important attribute used to describe the graph, it is not an attribute to be ignored even if it is not included in the sampling factor. Therefore, BFS has a limitation in sampling the graph while maintaining the properties of the original graph.

일 실시예에서, 초기 부분 그래프 생성을 위한 노드 및/또는 에지는: 샘플링되지 않은 이웃 노드 중 가장 높은 도수를 갖는 이웃 노드 방향으로 DFS를 수행하여, 초기 노드 V_s를 추출한다(S10).In an embodiment, a node and/or an edge for generating an initial partial graph extracts an initial node V _s by performing DFS in the direction of a neighbor node having the highest frequency among unsampled neighboring nodes ( S10 ).

원본 그래프의 임의의 한 노드에서 샘플링이 시작되는 경우, 시작 노드에 에지를 통해 연결된 이웃 노드 중에서 도수가 가장 높은 노드를 결정한다. 그러면, 시작 노드의 분기(branch) 중에서 도수가 가장 높은 노드 방향의 분기를 우선 탐색한 이후, (만약 샘플링 비율에 따라 초기 부분 그래프가 다 탐색되지 않은 경우) 다음 분기 방향으로 탐색을 수행한다. When sampling starts at any one node of the original graph, the node with the highest frequency is determined among neighboring nodes connected to the starting node through edges. Then, the branch in the direction of the node with the highest frequency among the branches of the start node is first searched, and then the branch is searched in the direction of the next branch (if the initial partial graph is not fully searched according to the sampling rate).

또한, 초기 노드 V_s 사이를 연결하는 초기 에지 E_s 를 생성한다. 일 실시예에서, 초기 에지 E_s 는 초기 노드 V_s 사이에 원본 그래프에 존재하는 모든 에지를 연결하여 생성된다. 예를 들어, 초기 에지 E_s 를 채움으로써(populating), 초기 부분 그래프가 유도된다. 유도 과정에 의해 아버지 노드와 아들 노드 사이의 에지 및 할아버지 노드와 아들 노드 사이의 에지 등이 초기 노드 V_s 에 연결된다. In addition, an initial edge E _s connecting between the initial nodes V _s is generated. In one embodiment, the initial edge E _s is generated by connecting all edges existing in the original graph between the initial nodes V _s . For example, by populating the initial edge E _s , an initial subgraph is derived. The edge between the father node and the son node and the edge between the grandfather node and the son node are connected to the initial node V _s by the derivation process.

일 실시예에서, 초기 부분 그래프의 노드의 총수는 원본 그래프에 샘플링 비율이 적용된 값일 수 있다. 샘플링 비율에 기초하여 초기 부분 그래프 노드의 총수가 산출되면, 시작 노드에 에지를 통해 연결된 이웃 노드 중에서 도수가 가장 높은 노드를 결정하고, 가장 도수가 높은 노드 방향으로 DFS를 수행하는 전술한 방식이 산출된 노드가 추출될 때까지 진행된다. 그러면, 추출된 노드 Vs를 유도하여 초기 에지 E_s 가 생성된다. In an embodiment, the total number of nodes of the initial subgraph may be a value to which a sampling rate is applied to the original graph. When the total number of initial partial graph nodes is calculated based on the sampling rate, the above-described method of determining the node with the highest frequency among neighboring nodes connected to the start node through the edge and performing DFS in the direction of the node with the highest frequency is calculated It proceeds until the node is extracted. Then, an initial edge E _s is generated by inducing the extracted node Vs.

이와 같이, 초기 부분 그래프의 노드/에지를 추출하는 과정은, DFS에서 높은 도수를 갖는 노드를 선호하는 방향으로 DFS가 수행되는, 변형된 DFS 방식의 노드 탐색이 진행된다. 그러면, 평균 도수 및 평균 군집 계수 측면에서 과대평가된 에지를 갖는 초기 부분 그래프가 생성된다(S10). 이 초기 부분 그래프는 경로 길이가 지나치게 짧지 않으면서 샘플 그래프의 최종 에지 총수 보다 많은 수의 에지를 가지는, 원본 그래프로부터 추출된 부분 그래프다. As such, in the process of extracting nodes/edges of the initial partial graph, node search in a modified DFS method in which DFS is performed in a direction that prefers a node having a high frequency in DFS is performed. Then, an initial partial graph having overestimated edges in terms of the average frequency and average clustering coefficient is generated ( S10 ). This initial subgraph is a subgraph extracted from the original graph that has a greater number of edges than the total number of final edges of the sample graph without the path length being too short.

이후, 단계(S10)에서 초기 부분 그래프가 획득되면, 초기 부분 그래프를 변형한다. 여기서, 샘플 그래프의 변형은 이전 샘플 그래프(예컨대, 초기 부분 그래프)에서, 과대평가된 에지를 제거하는 과정을 포함한다. 초기 그래프 및 이로부터 에지가 제거된 그래프 또한 원본 그래프의 부분 그래프다(S30). Then, when the initial partial graph is obtained in step S10, the initial partial graph is modified. Here, the transformation of the sample graph includes removing overestimated edges from the previous sample graph (eg, the initial partial graph). The initial graph and the graph from which edges are removed are also partial graphs of the original graph (S30).

전술한 바와 같이, 도 1의 샘플링 과정은 원본 그래프에서 부분 그래프를 추출한 뒤, 단계(S0)에서 획득된 가이드 라인 하에서 부분 그래프의 규모를 축소 시키면서 샘플 그래프를 생성하는 것이다. 즉, 원본 그래프에서 맹목적으로 축소되는, 무작위 샘플링에 의해 샘플 그래프를 생성하는 것이 아니다.As described above, in the sampling process of FIG. 1 , after extracting a partial graph from the original graph, a sample graph is generated while reducing the scale of the partial graph under the guideline obtained in step S0. That is, the sample graph is not generated by random sampling, which is blindly reduced from the original graph.

과대평가된 에지를 무작위로 제거할 경우, 가이드 라인을 충족하는 부분 그래프가 획득되지 않는다. 과대평가된 에지는 가이드 라인의 목표를 달성하는 방향으로 제거되어야 하며, 가이드 라인을 충족하기 위한 제거대상 에지를 선택하기 위한 기준이 요구된다. 따라서, 단계(S30)의 이전에, 제거될 에지를 선택하는 기준이 산출된다(S20). If the overestimated edges are randomly removed, a subgraph that meets the guideline is not obtained. Overestimated edges should be removed in the direction to achieve the goal of the guideline, and criteria for selecting the edge to be removed to meet the guideline are required. Accordingly, before step S30, a criterion for selecting an edge to be removed is calculated (S20).

일 실시예에서, 상기 기준은 에지의 속성을 나타낸다. 에지의 속성은 각 에지에 대한 에지 가중치(W_e)를 포함한다. In one embodiment, the criterion indicates an attribute of an edge. The properties of the edges include an edge weight (W _e ) for each edge.

전술한 바와 같이, 샘플 그래프는 원본 그래프의 군집 계수를 가이드 라인으로 가질 수 있으므로, 제거될 에지를 선택함에 있어서 샘플 그래프의 이전 군집 계수에 의존한다. 예를 들어, 에지 제거 동작이 t회 진행된 경우, t+1회에서 제거될 에지는 t회 진행되어 획득된 샘플 그래프의 군집 계수에 의존한다. As described above, since the sample graph may have the clustering coefficient of the original graph as a guide line, it depends on the previous clustering coefficient of the sample graph in selecting the edge to be removed. For example, when an edge removal operation is performed t times, an edge to be removed at t+1 times depends on a clustering coefficient of a sample graph obtained by performing t times.

노드의 이웃(neighborhood)은 에지를 통해 연결되는 모든 노드를 포함한다. 초기 부분 그래프의 임의의 노드v가 갖고 있는 도수가 d_v이고 이웃 노드와의 에지의 수가 l_v인 경우, 지역 군집 계수(local clustering coeeficient)로서 cc_v는 다음의 수학식에 의해 산출된다. A node's neighbor includes all nodes that are connected through an edge. When the frequency of any node v in the initial subgraph is d _v and the number of edges with the neighboring node is l _v , cc _v as a local clustering coeeficient is calculated by the following equation.

에지 제거 과정에서 도수 d_v가 유지된다면, 노드 v의 군집 계수는 2/(d_v*(d_v-1))만큼 감소한다. 이에 기초할 때, 샘플 그래프의 모든 에지에 군집 계수와 관련된 속성 값을 부여하여, 샘플 그래프의 군집 계수와 원본 그래프의 군집 계수가 일치하도록, 해당 속성 값에 따라 에지를 제거할 수 있다. If the frequency d _v is maintained during the edge removal process, the clustering coefficient of node v is reduced by 2/(d _v *(d _v -1)). Based on this, by assigning attribute values related to clustering coefficients to all edges of the sample graph, the edges may be removed according to the corresponding attribute values so that the clustering coefficients of the sample graph and the clustering coefficients of the original graph match.

한편, 그래프의 군집 계수는 도수에 의존하는데, 에지가 제거되면 일반적으로 도수의 값이 변하게 된다. 에지가 제거되는 과정에서 그래프의 평균 군집 계수에 미치는 영향을 정확하게 계산하는 것은 실질적으로 불가능하다. 그래프에서 단일 에지를 제거하는 것은 많은 삼중구조(triplets)의 형성을 방해할 수도 있기 때문이다. On the other hand, the cluster coefficient of the graph depends on the frequency, and when an edge is removed, the value of the frequency generally changes. It is practically impossible to accurately calculate the effect of edge removal on the average clustering coefficient of the graph. This is because removing a single edge from the graph may prevent the formation of many triplets.

상기 그래프 샘플링 장치는 특정 노드의 도수가 변하지 않도록 이웃 노드 사이의 에지를 제거함으로써, 상기 특정 노드의 군집 계수의 감소를 정확하게 측정할 수 있다. The graph sampling apparatus may accurately measure the decrease in the cluster coefficient of the specific node by removing edges between neighboring nodes so that the frequency of the specific node does not change.

삼중구조(triplet)의 잠재성이 있는 세 노드 v, u, w를 가정해보자. 만약 세 노드가 모두 에지로 연결된 상태에서 노드 u, w 사이의 에지 e(u, w)가 제거되면, 노드 v의 도수(d_v)에는 영향이 없지만 군집 계수는 감소한다. 즉, 노드 v는 노드 u, w 사이의 에지 e(u, w)에 가중치를 부여한다. 이에 기초할 때, 에지 e(u, w)의 가중치는 아래의 수학식으로 표현된다. Let's assume three nodes v, u, w with the potential of a triplet. If the edge e(u, w) between nodes u and w is removed while all three nodes are connected by edges, the frequency (d _v ) of node v is not affected, but the cluster coefficient decreases. That is, node v gives weight to the edge e(u, w) between nodes u and w. Based on this, the weight of the edge e(u, w) is expressed by the following equation.

마찬가지로, 노드 u는 에지 e(v, w)에 가중치를 제공하고, 노드 w 또한 에지 e(u, v)에 가중치를 제공한다. Similarly, node u provides weights to edge e(v, w), and node w also provides weights to edge e(u, v).

상기 수학식 2의 구조는 단 3개의 노드를 갖는 부분 그래프로 간주될 수 있다. 부분 그래프가 k개의 노드를 갖는 경우, 노드 u, w가 닫힌 삼중 구조(closed triplets)를 형성하게 하는 에지 e(u, w)의 가중치는 다음의 수학식으로 산출된다. The structure of Equation 2 can be regarded as a partial graph having only three nodes. When the partial graph has k nodes, the weight of the edge e(u, w) causing the nodes u and w to form closed triplets is calculated by the following equation.

상기 수학식 3에 기초해서, 이전 부분 그래프(예컨대, 초기 부분 그래프)에 포함된 에지의 에지 가중치 We가 산출된다(S20). Based on Equation 3, the edge weight We of the edges included in the previous subgraph (eg, the initial subgraph) is calculated ( S20 ).

높은 가중치 We 를 갖는 에지는 부분 그래프의 군집 계수(예컨대, 평균 군집 계수)에 큰 영향을 미친다. 낮은 에지 가중치를 갖는 에지를 제거하는 것과 비교하여, 에지 가중치가 높은 에지를 제거하면 많은 노드의 지역 군집 계수가 감소할 것이고, 그러면 샘플 그래프의 군집 계수가 감소할 것이다. 또한, 낮은 도수를 갖는 노드는 에지 가중치에 더 기여를 한다. 높은 에지 가중치를 갖는 에지를 제거할 경우, 노드의 낮은 지역 군집 계수가 크게 감소하여, 샘플 그래프의 (평균) 군집 계수가 급격히 감소한다. 여기서, 높다, 낮다의 의미는 샘플 그래프에서 에지 가중치, 도수가 중간 보다 높은 것을 지칭하는 것으로 제한되지 않는다. 그 의미는 상대적인 개념으로서 서로 다른 가중치, 도수를 갖는 샘플 그래프에서의 경향성을 설명하기 위한 것이다. Edges with high weight We have a large influence on the clustering coefficient (eg, average clustering coefficient) of the subgraph. Compared to removing an edge with a low edge weight, removing an edge with a high edge weight will decrease the local clustering coefficient of many nodes, and then the clustering coefficient of the sample graph will decrease. Also, a node with a lower frequency contributes more to the edge weight. When an edge with a high edge weight is removed, the low local clustering coefficient of a node is greatly reduced, so that the (average) clustering coefficient of the sample graph is sharply reduced. Here, the meaning of high and low is not limited to indicating that the edge weight and frequency in the sample graph are higher than the middle. Its meaning is to explain the tendency in sample graphs with different weights and frequencies as a relative concept.

도 2는, 본 발명의 일 실시예에 따른, 에지 가중치를 설명하기 위한 도면이다. 2 is a diagram for explaining an edge weight according to an embodiment of the present invention.

도 2의 샘플 그래프는 |V|=9, |E|=16이고 평균 군집 계수는 0.66이다. 제1 노드(A)의 도수 dA는 3이고, 수학식 2에 기초할 때, 해당 노드가 에지 e(B, C) 및 e(C, D)에 부여한 가중치는 0.33이다. 유사하게, 제6 노드(F)의 도수 dF는 5이며, 에지 e(B, C), e(B, E), e(C, G) 및 e(G, I)에 가중치 0.1를 부여한다. 이들 에지는 제거될 경우 제6 노드의 도수가 변하지 않는다. The sample graph of FIG. 2 has |V|=9, |E|=16, and the average clustering coefficient is 0.66. The frequency dA of the first node A is 3, and based on Equation 2, the weight given to the edges e(B, C) and e(C, D) by the corresponding node is 0.33. Similarly, the frequency dF of the sixth node F is 5, giving the edges e(B, C), e(B, E), e(C, G) and e(G, I) a weight of 0.1 . When these edges are removed, the frequency of the sixth node does not change.

에지 e(B, C)의 전체 가중치(즉, 에지 가중치) We(B, C)는 0.33 + 0.1 = 0.43이다. 이러한 과정을 통해 모든 에지 가중치를 산출하면, 도 2에 도시된 에지 가중치 산출 결과를 얻을 수 있다. The total weight (ie, edge weight) We(B, C) of edge e(B, C) is 0.33 + 0.1 = 0.43. When all edge weights are calculated through this process, the edge weight calculation result shown in FIG. 2 can be obtained.

도 2의 샘플 그래프에서 We로 0.1 값을 갖는 e(B, E)를 제거할 경우, 샘플 그래프의 평균 군집 계수는 0.63으로 다소 감소한다. 반면, We로 1.1 값을 갖는 e(D, G)를 제거할 경우, 샘플 그래프의 평균 군집 계수는 0.51로 감소한다. 즉, 에지 가중치가 큰 에지를 제거할 경우, 평균 군집 계수는 더 많이 감소한다. When e(B, E) having a value of 0.1 is removed as We from the sample graph of FIG. 2 , the average clustering coefficient of the sample graph is slightly reduced to 0.63. On the other hand, when e(D, G) having a value of 1.1 is removed as We, the average clustering coefficient of the sample graph decreases to 0.51. That is, when an edge having a large edge weight is removed, the average clustering coefficient is further reduced.

이와 같이, 에지 가중치는 부분 그래프 내 노드의 지역 군집 계수의 감소 규모를 나타내고, 결국 해당 부분 그래프의 군집 계수의 감소 속도를 나타낸다. 따라서, 현재의 부분 그래프의 군집 계수와 목표 군집 계수를 비교하고 현재 단계에서 적절한 에지 가중치를 갖는 에지를 선택한다면, 가이드 라인으로서 군집 계수를 충족하는 방향으로 에지를 제거하고 샘플 그래프를 획득할 수 있다. As described above, the edge weight indicates the magnitude of the decrease in the local clustering coefficient of the node in the subgraph, and consequently the decrease rate of the clustering coefficient of the corresponding subgraph. Therefore, if the cluster coefficient of the current subgraph and the target cluster coefficient are compared and an edge with an appropriate edge weight is selected at the current stage, edges can be removed in the direction that satisfies the cluster coefficient as a guideline and a sample graph can be obtained. .

일 실시예에서, 모든 에지에 대한 에지 가중치가 산출되면(S20), 각 에지는 에지 가중치에 기초하여 각각 식별된다. 예를 들어, 각 에지의 에지 가중치를 각각 산출하고, 가중치의 크기 순으로 정렬(sorting)한 뒤, 정렬된 순서로 인덱스(index)를 에지에 할당할 수 있다. 인덱스 값은 가중치가 큰 값에서 낮은 값 순서로 할당하거나(오름차순), 또는 낮은 값에서 큰 값 순서로 할당할 수 있다(내림차순). 이하, 설명의 명료성을 위해서, 에지 가중치를 큰 값에서 낮은 값 순서로 정렬하고, 가장 큰 값을 갖는 에지로부터 인덱스를 할당한 내림차순 경우를 기초로 본 발명을 상세히 서술한다. 그러나 이에 제한되지 않는 것이 통상의 기술자에게 명백할 것이다. In an embodiment, when edge weights for all edges are calculated ( S20 ), each edge is individually identified based on the edge weights. For example, the edge weight of each edge may be calculated, sorted in the order of weight size, and an index may be assigned to the edge in the sorted order. Index values can be assigned in order of weight from highest to lowest (ascending), or from lowest to largest (descending). Hereinafter, for clarity of explanation, the present invention will be described in detail based on a case in which the edge weights are arranged in the order of the largest value to the lowest value, and the index is allocated from the edge having the largest value in descending order. However, it will be clear to a person skilled in the art that this is not limiting.

이와 같이, 단계(S20)에서는 에지 가중치가 산출되고, 상기 에지 가중치에 기초하여 제1 그룹 및 제2 그룹이 결정될 수 있다. 여기서, 제1 그룹의 에지는 제거될 경우, 제2 그룹의 에지 보다 부분 그래프의 군집 계수가 보다 크게 감소하게 하는 에지를 포함한다. As described above, in step S20 , an edge weight may be calculated, and a first group and a second group may be determined based on the edge weight. Here, the edges of the first group include edges that, when removed, cause the clustering coefficient of the subgraph to decrease more than the edges of the second group.

단계(S20)에서 (예컨대, 부분 그래프에 포함된 에지에 대한 에지 가중치와 같은) 에지를 제기할 기준이 산출되면, 이전 부분 그래프(예컨대, 단계(S10)의 초기 부분 그래프)에서 제거될 에지를 선택하고, 선택된 에지를 제거한다(S30). When a criterion for raising an edge (eg, an edge weight for an edge included in the subgraph) is calculated in step S20 , the edge to be removed in the previous subgraph (eg, the initial subgraph of step S10 ) is selected is selected, and the selected edge is removed (S30).

단계(S30)는 초기에 과대평가된 에지의 총수가 목표하는 에지의 총수(즉, 샘플링 요소 값)가 될 때까지 반복된다(S40). 즉, 단계(S30)는 여분 에지를 모두 제거할 때까지 반복된다(S40). Step S30 is repeated until the total number of initially overestimated edges becomes the target total number of edges (ie, the sampling factor value) (S40). That is, step S30 is repeated until all the extra edges are removed (S40).

전술한 바와 같이, 초기 부분 그래프가 최종적인 샘플 그래프 보다 많은 수의 에지를 갖도록 형성되므로, 에지가 하나 이상 제거된, 초기 부분 그래프에서 최종 샘플 그래프 사이의 부분 그래프 또한 최종적인 샘플 그래프 보다 많은 수의 에지를 가진다. 이와 같이, 부분 그래프가 포함한, 최종적인 샘플 그래프 보다 많은 수의 에지는 여분 에지(extra edge)로 지칭된다. 여분 에지는 단지 수량 집합으로서, 특정 에지의 집합이 아니다. As described above, since the initial subgraph is formed to have a greater number of edges than the final sample graph, the partial graph between the initial subgraph and the final sample graph in which one or more edges are removed also has a larger number than the final sample graph. have an edge As such, the number of edges included in the partial graph than in the final sample graph is referred to as an extra edge. The extra edge is just a set of quantities, not a set of specific edges.

여분 에지의 총수는 다음의 수학식으로 산출된다. The total number of extra edges is calculated by the following equation.

여기서, d_org는 원본 그래프의 일부로부터 예측된 원본 그래프의 도수 값일 수 있다. 일부 실시예에서, d_org는 원본 그래프의 일부로부터 예측된 원본 그래프의 평균 도수 값일 수 있다. Here, d _org may be a frequency value of the original graph predicted from a part of the original graph. In some embodiments, d _org may be an average frequency value of the original graph predicted from a portion of the original graph.

일 실시예에서, 여분 에지가 한 개씩 제거되는 경우, 단계(S30)는 여분 에지의 총수만큼 반복된다(S40). 단계(S30)를 반복하면서 현재의 군집 계수가 점차 최종적인 군집 계수에 근접 또는 일치하게 변화할 것이다.In an embodiment, when the redundant edges are removed one by one, step S30 is repeated for the total number of redundant edges ( S40 ). While repeating step S30, the current clustering coefficient will gradually change to approximate or coincide with the final clustering coefficient.

단계(S30)에서 제거될 에지는 부분 그래프의 속성에 기초하여 선택된다. 또한, 제거될 에지는 샘플링 목표로서 원본 그래프의 속성(예컨대, 샘플링 요소)에 더 기초하여 선택된다. The edge to be removed in step S30 is selected based on the properties of the partial graph. Also, the edge to be removed is selected further based on an attribute (eg, sampling element) of the original graph as a sampling target.

일 실시예에서, 제거될 에지는 에지 가중치에 기초하여 선택된다(S30). In one embodiment, the edge to be removed is selected based on the edge weight (S30).

여분 에지의 존재로 인해, 샘플링 과정의 부분 그래프의 군집 계수는 샘플 그래프의 군집 계수 보다 크며, 여분 에지의 감소로 인해 샘플링 과정의 부분 그래프의 군집 계수 또한 감소한다. 도 2를 참조하여 서술한 바와 같이, 보다 큰 에지 가중치를 갖는 제2 에지를 제거할 경우, 제1 에지를 제거하는 경우 보다 부분 그래프의 군집 계수의 감소량이 큰 경향을 가진다. 즉, 에지 가중치는 군집 계수의 감소와 관련되므로, 제거될 에지를 선택하는데 활용될 수 있다. Due to the existence of the extra edge, the clustering coefficient of the subgraph of the sampling process is larger than that of the sample graph, and the clustering coefficient of the subgraph of the sampling process also decreases due to the decrease of the extra edge. As described with reference to FIG. 2 , when the second edge having a larger edge weight is removed, the decrease in the clustering coefficient of the subgraph tends to be larger than when the first edge is removed. That is, since the edge weight is related to the reduction of the clustering coefficient, it may be utilized to select an edge to be removed.

일 실시예에서, 제거될 에지를 선택하기 위해, 단계(S30)의 부분 그래프의 군집 계수의 감소 상태를 판단한다. 감소 상태가 목적으로 갖는 샘플 그래프의 군집 계수 보다 높은 값을 갖는 상태를 나타내는 것으로 판단되면, 군집 계수를 보다 많이 감소시키는 에지를 제거 대상으로 선택한다. 반면, 감소 상태가 목적으로 갖는 샘플 그래프의 군집 계수 보다 낮은 값을 갖는 상태를 나타내는 것으로 판단되면, 군집 계수를 보다 적게 감소시키는 에지를 제거 대상으로 선택한다. In one embodiment, in order to select an edge to be removed, a decrease state of the clustering coefficient of the partial graph of step S30 is determined. If it is determined that the reduced state represents a state having a higher value than the cluster coefficient of the target sample graph, an edge that further reduces the cluster coefficient is selected as a removal target. On the other hand, if it is determined that the reduced state represents a state having a value lower than the cluster coefficient of the target sample graph, an edge that reduces the cluster coefficient less is selected as a removal target.

일 실시예에서, 감소 상태는, 상기 부분 그래프의 군집 계수와 샘플 그래프의 군집 계수에 기초하여 판단된다. 동일한 단계(S30)에서 부분 그래프의 실시간 군집 계수 및 예측된 군집 계수가 각각 산출되면, 실시간 값과 예측된 값을 비교하여 감소 상태를 판단한다. 여기서, 실시간 값은 해당 부분 그래프로부터 직접 또는 간접적으로 산출된 군집 계수를 나타내고, 예측된 값은 초기 부분 그래프의 군집 계수 및 여분 에지의 총수에 기초하여 목표로 하는 군집 계수에 도달하기 위해 해당 부분 그래프가 현재 단계에서 가질 것으로 예측된 값을 나타낸다. In an embodiment, the reduction state is determined based on a clustering coefficient of the partial graph and a clustering coefficient of the sample graph. When the real-time clustering coefficient and the predicted clustering coefficient of the partial graph are respectively calculated in the same step ( S30 ), the reduction state is determined by comparing the real-time value with the predicted value. Here, the real-time value represents a clustering coefficient calculated directly or indirectly from the subgraph, and the predicted value is the clustering coefficient of the initial subgraph and the total number of extra edges to arrive at the targeted clustering coefficient. represents the value predicted to have at the current stage.

부분 그래프가 초기 부분 그래프인 경우, 원본 그래프의 평균 군집 계수 및 초기 부분 그래프의 노드 수 등에 기초하여 초기 부분 그래프의 실시간 군집 계수가 산출된다. When the subgraph is the initial subgraph, the real-time clustering coefficient of the initial subgraph is calculated based on the average clustering coefficient of the original graph and the number of nodes of the initial subgraph.

그리고 여분 에지가 하나 이상 제거된 부분 그래프의 실시간 군집 계수cc_curr는, 부분 그래프 자신 및/또는 다른 부분 그래프(예컨대, 초기 부분 그래프)에 기초하여 산출된다. And the real-time clustering coefficient cc _curr of the subgraph from which one or more extra edges are removed is calculated based on the subgraph itself and/or other subgraphs (eg, the initial subgraph).

일 실시예에서, 제거 단계(S30)에서 실시간 평균 군집 계수 cc_curr는 이전 부분 그래프의 전체 구조에 기초하여 산출된다. 예를 들어, 단계(S30)가 t회 반복된 경우, t번째 단계(S30)에서 에지를 선택하는데 사용되는 실시간 평균 군집 계수는 t-1번째 단계(S30)가 완료된 부분 그래프의 전체 구조에 기초하여 산출된다. In an embodiment, the real-time average clustering coefficient cc _curr in the removing step S30 is calculated based on the overall structure of the previous partial graph. For example, when step S30 is repeated t times, the real-time average clustering coefficient used to select an edge in the t-th step S30 is based on the overall structure of the subgraph in which the t-1 th step S30 is completed. is calculated by

다른 일 실시예에서, 제거 단계(S30)에서 실시간 평균 군집 계수 cc_curr는 이전 부분 그래프의 일부에 기초하여 산출된다. 예를 들어, 단계(S30)가 t회 반복된 경우, t번째 단계(S30)에서 에지를 선택하는데 사용되는 실시간 평균 군집 계수는 t-1번째 단계(S30) 도중에 제거된 에지 근처의 일부 및 t-2번째 단계(S30)의 실시간 평균 군집 계수에 기초하여 산출된다. In another embodiment, the real-time average clustering coefficient cc _curr in the removing step S30 is calculated based on a part of the previous partial graph. For example, if step S30 is repeated t times, the real-time average clustering coefficient used to select an edge in the t-th step S30 is a fraction near the edge removed during the t-1 th step S30 and t It is calculated based on the real-time average clustering coefficient of the second step (S30).

여분 에지를 삭제하면, 삭제된 여분 에지의 엔드 노드와 이 노드의 공통된 친구 노드에 관련된 지역 군집 계수만이 영향을 받는다. 삭제된 여분 에지의 엔드 노드 및 친구 노드에 관련된 지역 군집 계수만을 산출하고, 제거 이전의 평균 군집 계수 및 산출된 지역 군집 계수에 기초하여 제거 이후의 평균 군집 계수를 산출함으로써, 샘플링 과정에서 샘플 그래프의 평균 군집 계수를 산출하는 동작의 부담을 최소화할 수 있다. When a redundant edge is deleted, only the local clustering coefficients related to the deleted redundant edge's end node and its common friend are affected. By calculating only the regional clustering coefficients related to the end node and the friend node of the deleted extra edge, and calculating the average clustering coefficient after removal based on the average cluster coefficient before removal and the calculated regional clustering coefficient, It is possible to minimize the burden of calculating the average cluster coefficient.

일 실시예에서, 예측된 군집 계수 cc_exp는 아래의 수학식에 의해 산출된다. In an embodiment, the predicted clustering coefficient cc _exp is calculated by the following equation.

여기서, e_del은 이미 제거된 에지의 수로서, 초기 부분 그래프에서 에지가 제거될 경우, e_del=0이다. slop는 에지 제거 슬로프로서, 다음의 수학식으로 표현된다.Here, e _del is the number of edges that have already been removed, and when an edge is removed from the initial partial graph, e _del = 0. slop is an edge removal slope, and is expressed by the following equation.

여기서, cc_org는 원본 그래프의 일부로부터 예측된 원본 그래프의 군집 계수 값일 수 있다. 일부 실시예에서, cc_org는 원본 그래프의 일부로부터 예측된 원본 그래프의 평균 군집 계수 값일 수 있다. 그리고 cc_curr는 현재의 군집 계수로서, 단계(S30)가 t회 적용된 경우, t개의 에지가 제거된 부분 그래프의 군집 계수cc_t를 나타낸다. Here, cc _org may be a cluster coefficient value of the original graph predicted from a part of the original graph. In some embodiments, cc _org may be an average cluster coefficient value of the original graph predicted from a portion of the original graph. And cc _curr is a current clustering coefficient, and when step S30 is applied t times, indicates a clustering coefficient cc _t of the partial graph from which t edges are removed.

도 3은, 본 발명의 일 실시예에 따른, 에지 제거 슬로프를 도시한 개념도이다. 3 is a conceptual diagram illustrating an edge removal slope according to an embodiment of the present invention.

도 3을 참조하면, 에지 제거 슬로프는 x축은 제거되는 에지의 수, y축은 샘플 그래프의 cc로 이루어진 그래프 상에서 초기 부분 그래프의 군집 계수와 최종적인 샘플 그래프의 군집 계수 사이의 선형 관계를 나타내며, 부분 그래프의 군집 계수의 감소 속도의 기준으로 활용될 수 있다. Referring to FIG. 3 , the edge removal slope represents a linear relationship between the cluster coefficients of the initial partial graph and the final sample graph on a graph consisting of the number of edges removed on the x-axis and cc of the sample graph on the y-axis. It can be used as a criterion for the rate of decrease of the cluster coefficient of the graph.

동일한 단계(S30)에서 에지 제거 슬로프에 기초하여 예측된 군집 계수 보다 실시간 군집 계수가 높은 경우, 이 감소 상태가 지속되면 최종적으로 얻어지는 샘플 그래프의 군집 계수는 목표로 하는 원본 그래프의 군집 계수 보다 높은 값을 가질 가능성이 높다. 예를 들어, 상기 부분 그래프의 실시간 군집 계수가 예측된 군집 계수 보다 크고 상기 부분 그래프의 군집 계수가 원본 그래프의 군집 계수 보다 큰 경우, 감소 상태가 목적으로 갖는 샘플 그래프의 군집 계수 보다 높은 값을 갖는 상태를 나타내는 것으로 판단한다. If the real-time clustering coefficient is higher than the predicted clustering coefficient based on the edge removal slope in the same step (S30), if this decreasing state continues, the finally obtained clustering coefficient of the sample graph is a value higher than that of the target original graph is likely to have For example, when the real-time clustering coefficient of the subgraph is larger than the predicted clustering coefficient and the clustering coefficient of the subgraph is larger than the clustering coefficient of the original graph, the reduced state has a higher value than the clustering coefficient of the target sample graph. judged to be indicative of a state.

따라서, 이번 단계(S30)에서 군집 계수를 상대적으로 많이 감소시키는 에지가 제거되어야, 샘플 그래프의 군집 계수가 원본 그래프의 군집 계수에 근접할 가능성이 높아진다. 이를 위해, 도 3에 도시된 바와 같이, 에지 가중치가 높은 에지를 이번 단계(S30)에서 제거 대상으로 선택한다. 전술한 바와 같이, 에지 가중치가 높은 에지가 제거될 경우, 군집 계수가 보다 많이 감소하기 때문이다. Therefore, in this step ( S30 ), the edge that reduces the clustering coefficient by a relatively large amount must be removed, so that the probability that the clustering coefficient of the sample graph approaches the clustering coefficient of the original graph increases. To this end, as shown in FIG. 3 , an edge having a high edge weight is selected as a removal target in this step ( S30 ). This is because, as described above, when an edge having a high edge weight is removed, the clustering coefficient decreases more.

일부 실시예에서, 에지 가중치 순으로 에지가 배열된 경우, 군집 계수를 보다 많이 감소시키기 위해, 에지 가중치가 중간 값인 가운데 에지(e_mid)를 기준으로 보다 높은 가중치를 갖는 에지를 포함한 제1 그룹에서 임의의 하나의 에지를 선택한다. 예를 들어, 에지 가중치에 기초하여 내림차순으로 에지가 배열된 경우, 중간 인덱스 보다 낮은 인덱스 값을 갖는 에지가 선택된다. 이 경우, 제거 대상으로서 에지는 아래의 수학식에 의해 산출된 인덱스를 갖는 에지이다. In some embodiments, when the edges are arranged in the order of edge weights, in order to further reduce the clustering coefficient, in the first group including edges with higher weights based on the middle edge (e _mid ), the edge weights of which are intermediate values: Select any one edge. For example, when the edges are arranged in descending order based on the edge weight, an edge having an index value lower than the middle index is selected. In this case, an edge as an object to be removed is an edge having an index calculated by the following equation.

여기서, mid=|Es|/2이고, cc_ratio= cc_org/cc_curr이다. Here, mid=|Es|/2, and cc _ratio = cc _org /cc _curr .

반면, 동일한 단계(S30)에서 에지 제거 슬로프에 기초하여 예측된 군집 계수 보다 실시간 군집 계수가 낮은 경우, 이 감소 상태가 지속되면 최종적으로 얻어지는 샘플 그래프의 군집 계수는 목표로 하는 원본 그래프의 군집 계수 보다 낮은 값을 가질 가능성이 높다. 예를 들어, 상기 부분 그래프의 실시간 군집 계수가 예측된 군집 계수 보다 작거나, 또는 상기 부분 그래프의 군집 계수가 원본 그래프의 군집 계수 보다 큰 경우, 감소 상태가 목적으로 갖는 샘플 그래프의 군집 계수 보다 낮은 값을 갖는 상태를 나타내는 것으로 판단한다.On the other hand, if the real-time clustering coefficient is lower than the clustering coefficient predicted based on the edge removal slope in the same step (S30), if this reduction state continues, the clustering coefficient of the finally obtained sample graph is higher than the clustering coefficient of the target original graph. It is likely to have a lower value. For example, if the real-time clustering coefficient of the subgraph is smaller than the predicted clustering coefficient, or if the clustering coefficient of the subgraph is larger than that of the original graph, the reduced state is lower than the clustering coefficient of the target sample graph. It is judged to indicate a state with a value.

따라서, 이번 단계(S30)에서 군집 계수를 상대적으로 적게 감소시키는 에지가 제거되어야, 샘플 그래프의 군집 계수가 원본 그래프의 군집 계수에 근접할 가능성이 높아진다. 이를 위해, 도 3에 도시된 바와 같이, 에지 가중치가 낮은 에지를 이번 단계(S30)에서 제거 대상으로 선택한다. 전술한 바와 같이, 에지 가중치가 낮은 에지가 제거될 경우, 군집 계수가 보다 적게 감소하기 때문이다. Therefore, in this step ( S30 ), the edge that reduces the clustering coefficient to a relatively small extent should be removed, so that the probability that the clustering coefficient of the sample graph approaches the clustering coefficient of the original graph increases. To this end, as shown in FIG. 3 , an edge having a low edge weight is selected as a removal target in this step ( S30 ). This is because, as described above, when an edge having a low edge weight is removed, the clustering coefficient is reduced to a lesser extent.

일부 실시예에서, 에지 가중치 순으로 에지가 배열된 경우, 보다 적게 군집 계수를 감소시키기 위해, 상기 가운데 에지를 기준으로 보다 낮은 가중치를 갖는 에지를 포함한 제2 그룹에서 임의의 하나의 에지를 선택한다. 예를 들어, 에지 가중치에 기초하여 내림차순으로 에지가 배열된 경우, 중간 인덱스 보다 높은 인덱스 값을 갖는 에지가 선택된다. 이 경우, 제거 대상으로서 에지는 아래의 수학식에 의해 산출된 인덱스를 갖는 에지이다.In some embodiments, when the edges are arranged in the order of edge weight, in order to reduce the clustering coefficient to a lesser extent, an arbitrary edge is selected from the second group including an edge having a lower weight based on the middle edge. . For example, when the edges are arranged in descending order based on the edge weight, an edge having an index value higher than the middle index is selected. In this case, an edge as an object to be removed is an edge having an index calculated by the following equation.

단계(S30)에서 에지가 제거된 이후에도 여분 에지가 존재하는 경우, 단계(S40)가 반복된다. 이 경우, 현재 제거된 에지의 총수는 이전에 제거된 에지의 총수(예컨대, e_del) + 1로 업데이트된다. If an extra edge exists even after the edge is removed in step S30, step S40 is repeated. In this case, the total number of currently removed edges is updated to the total number of previously removed edges (eg, e _del )+1.

도 4는, 본 발명의 일 실시예에 따른, 그래프 샘플링 장치의 동작의 예시적인 코드이다. 4 is an exemplary code of an operation of a graph sampling apparatus according to an embodiment of the present invention.

도 4에서, 이전 샘플 그래프가 초기 부분 그래프인 경우, 여분 에지의 제거 환경은 다음과 같다: e_del = 0; e_ratio = 1; cc_curr = cc_init. 여기서, e_del은 제거된 에지의 수, e_ratio는 제거된 에지의 변화량(= (e_extra - e_del)/e_extra), cc_init는 원본 그래프의 일부로부터 예측된 원본 그래프의 군집 계수 값일 수 있다. 일부 실시예에서, cc_init는 원본 그래프의 일부로부터 예측된 원본 그래프의 평균 도수 값일 수 있다. In FIG. 4 , when the previous sample graph is the initial partial graph, the environment for removing the extra edges is as follows: e _del = 0; e _ratio = 1; cc _curr = cc _init . Here, e _del is the number of removed edges, e _ratio is the amount of change in the removed edges (= (e _extra - e _del )/e _extra ), and cc _init is the cluster coefficient value of the original graph predicted from a part of the original graph. have. In some embodiments, cc _init may be an average frequency value of the original graph predicted from a portion of the original graph.

도 4에 도시된 바와 같이, 에지를 선택 및 제거하는 과정(S30)을 여분 에지를 모두 제거할 때까지 반복하는 과정(S40)에서 제거된 에지를 나타내는 e_del; 여분 에지의 변화량 e_ratio; 각 제거 단계(S30)에서 해당 부분 그래프의 실시간 군집 계수 cc_curr; 예측된 군집 계수 cc_exp; 군집 계수의 스케일링 비율 cc_ratio 값은 에지가 제거된 이후의 샘플 그래프 정보에 기초하여 업데이트된다. As shown in FIG. 4 , e _del indicating the edge removed in the process ( S40 ) of repeating the process ( S30 ) of selecting and removing the edge until all the extra edges are removed; The amount of change of the extra edge e _ratio ; In each removal step (S30), the real-time clustering coefficient cc _curr of the corresponding partial graph; predicted cluster coefficients cc _exp ; The scaling ratio cc _ratio value of the clustering coefficient is updated based on the sample graph information after the edge is removed.

전술한 그래프 생성 방법은 프로세서를 포함한 컴퓨팅 장치(예컨대, 그래프 샘플링 장치)에 의해 수행될 수 있다. The graph generating method described above may be performed by a computing device including a processor (eg, a graph sampling device).

실시예들에 따른 그래프 샘플링 장치는 전적으로 하드웨어이거나, 전적으로 소프트웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 예컨대, 장치 또는 시스템은 데이터 처리 능력이 구비된 하드웨어 및 이를 구동시키기 위한 운용 소프트웨어를 통칭할 수 있다. 본 명세서에서 "부(unit)", "모듈(module)", "장치", 또는 "시스템" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 하드웨어는 CPU(Central Processing Unit), GPU(Graphic Processing Unit) 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.A graph sampling apparatus according to embodiments may be entirely hardware, entirely software, or may have aspects that are partly hardware and partly software. For example, the device or system may collectively refer to hardware equipped with data processing capability and operating software for driving the same. As used herein, terms such as “unit,” “module,” “device,” or “system” are intended to refer to a combination of hardware and software run by the hardware. For example, the hardware may be a data processing device including a central processing unit (CPU), a graphic processing unit (GPU), or another processor. In addition, software may refer to a running process, an object, an executable file, a thread of execution, a program, and the like.

도 5 내지 도 7은, 본 발명의 일 실험예에 따른, 샘플링 결과로 획득된 부분 그래프의 성능을 설명하기 위한 도면이다. 5 to 7 are diagrams for explaining the performance of a partial graph obtained as a sampling result according to an experimental example of the present invention.

상기 실험예에서 종래의 샘플링 방식인 FFS (Forest Fire Sampling) 및 TIES (Totally Induced Edge Sampling)에 의한 부분 그래프와 도 1의 샘플링 방식(GS)에 의한 부분 그래프의 성능을 비교하였다. 샘플링 비율은 φ = 0.001 내지 φ = 0.01로 제어되었다. 성능의 비교 지표로서 도수(degree), 평균 군집 계수, 및 경로 길이(path length)가 활용되었다. 각 비교 지표의 값은 RMSE(Root Mean Squre Error)로서 산출되었다. 3가지 샘플링 방식은 아래의 표에 도시된 속성 값을 갖는 7개의 원본 네트워크(Data set)에 적용되었다. In the experimental example, the performance of the partial graph by the conventional sampling method FFS (Forest Fire Sampling) and TIES (Totally Induced Edge Sampling) and the partial graph by the sampling method (GS) of FIG. 1 were compared. The sampling rate was controlled from φ = 0.001 to φ = 0.01. A degree, an average clustering coefficient, and a path length were utilized as comparative indicators of performance. The value of each comparison index was calculated as RMSE (Root Mean Square Error). The three sampling methods were applied to seven original networks (data sets) with the attribute values shown in the table below.

[표 1] [Table 1]

도 5는 도수 측면에서 부분 그래프의 성능을 평가한 그래프이다. 도 5의 x축은 원본 네트워크에 대한 샘플링 비율과, y축은 평균 차수의 스케일링 비율을 나타낸다. y 값이 1에 동일 또는 근접한 부분 그래프가 원본 그래프의 속성을 유지하는 가이드 라인에 충실하여 추출되었음을 나타낸다. 5 is a graph for evaluating the performance of a partial graph in terms of frequency. In FIG. 5 , the x-axis represents a sampling rate for the original network, and the y-axis represents a scaling ratio of the average order. A subgraph with a y value equal to or close to 1 indicates that the subgraph was extracted faithfully to the guidelines maintaining the properties of the original graph.

도 5를 참조하면, 도 1의 샘플링 방식(GS)에 의한 샘플 그래프가 원 본 그래프의 속성에 가장 가까운 속성 값을 갖는 것을 확인할 수 있다. 도 1의 샘플링 방식(GS)은 노드 및/에지를 맹목적으로 샘플링하는 대신에, 목표 값에 기초하여 샘플링하므로, 원본 그래프의 속성 값을 갖는 부분 그래프를 획득할 수 있다. FFS 방식에 따르면, 항상 도수가 과소 평가된다. TIES 방식에 따르면, 일부 원본 네트워크에서 평균 도수가 과대 평가되거나, 다른 일부 원본 네트워크에서 평균 도수가 과소 평가된다. Referring to FIG. 5 , it can be confirmed that the sample graph according to the sampling method GS of FIG. 1 has an attribute value closest to the attribute of the original graph. The sampling method GS of FIG. 1 samples based on a target value instead of blindly sampling nodes and/or edges, so that a partial graph having attribute values of the original graph may be obtained. According to the FFS method, the power is always underestimated. According to the TIES method, the average frequency is overestimated in some original networks, or the average frequency is underestimated in some other original networks.

도 6은 군집 계수 측면에서 부분 그래프의 성능을 평가한 그래프이다. 도 6의 x축은 원본 네트워크에 대한 샘플링 비율과, y축은 평균 군집 계수의 스케일링 비율을 나타낸다.6 is a graph evaluating the performance of a partial graph in terms of clustering coefficients. In FIG. 6 , the x-axis represents a sampling rate for the original network, and the y-axis represents a scaling rate of the average clustering coefficient.

도 6을 참조하면, 도 1의 샘플링 방식(GS)에 의한 샘플 그래프가 원 본 그래프의 속성에 가장 가까운 속성 값을 갖는 것을 확인할 수 있다. 도 1의 방식(GS)은 목표 값에 도달하는데 도움이 되는 에지 가중치에 따라 여분 에지를 제거하기 문에 좋은 성능을 가진다. 반면, FFS 방식에 따르면, 샘플 그래프의 성능이 좋지 않다. 또한, TIES 방식에 따르면, 대부분의 네트워크에서 부정확한 값을 제공한다. 특히, 샘플링 비율이 낮은 값에서의 정확도가 매우 낮다. Referring to FIG. 6 , it can be confirmed that the sample graph according to the sampling method GS of FIG. 1 has an attribute value closest to that of the original graph. The scheme (GS) of Fig. 1 has good performance because it removes extra edges according to the edge weights that help to reach the target value. On the other hand, according to the FFS method, the performance of the sample graph is not good. In addition, according to the TIES method, most networks provide inaccurate values. In particular, the accuracy is very low at values with a low sampling rate.

도 7은 경로 길이 측면에서 부분 그래프의 성능을 평가한 그래프이다. 도 7의 x축은 원본 네트워크에 대한 샘플링 비율과, y축은 평균 경로 길이의 스케일링 비율을 나타낸다.7 is a graph for evaluating the performance of a partial graph in terms of path length. In FIG. 7 , the x-axis represents a sampling rate for the original network, and the y-axis represents a scaling ratio of the average path length.

도 7을 참조하면, FFS 방식에 따르면 모든 네트워크에서 매우 큰 스케일링 비율을 가지므로, 가장 성능이 낮다. 한편, TIES 방식과 도 1의 샘플링 방식(GS)을 비교하면, 도 1의 샘플링 방식(GS)이 보다 가까운 경로 길이를 가짐으로써 더 좋은 성능을 갖는 것을 확인할 수 있다. Referring to FIG. 7 , according to the FFS scheme, all networks have a very large scaling ratio, so the performance is the lowest. Meanwhile, when the TIES method is compared with the sampling method GS of FIG. 1 , it can be confirmed that the sampling method GS of FIG. 1 has better performance by having a shorter path length.

이상에서 설명한 실시예들에 따른 원본 그래프의 속성을 유지하면서 매우 작은 규모로 축소된 샘플 그래프를 추출하기 위한 그래프 샘플링 장치 및 방법에 의한 동작은 적어도 부분적으로 컴퓨터 프로그램으로 구현되어, 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 예를 들어, 프로그램 코드를 포함하는 컴퓨터-판독가능 매체로 구성되는 프로그램 제품과 함께 구현되고, 이는 기술된 임의의 또는 모든 단계, 동작, 또는 과정을 수행하기 위한 프로세서에 의해 실행될 수 있다. For extracting a sample graph reduced to a very small scale while maintaining the properties of the original graph according to the embodiments described above An operation by the graph sampling apparatus and method may be at least partially implemented as a computer program, and may be recorded in a computer-readable recording medium. For example, embodied with a program product consisting of a computer-readable medium containing program code, which may be executed by a processor for performing any or all steps, operations, or processes described.

상기 컴퓨터는 데스크탑 컴퓨터, 랩탑 컴퓨터, 노트북, 스마트 폰, 또는 이와 유사한 것과 같은 컴퓨팅 장치일 수도 있고 통합될 수도 있는 임의의 장치일 수 있다. 컴퓨터는 하나 이상의 대체적이고 특별한 목적의 프로세서, 메모리, 저장공간, 및 네트워킹 구성요소(무선 또는 유선 중 어느 하나)를 가지는 장치다. 상기 컴퓨터는 예를 들어, 마이크로소프트의 윈도우와 호환되는 운영 체제, 애플 OS X 또는 iOS, 리눅스 배포판(Linux distribution), 또는 구글의 안드로이드 OS와 같은 운영체제(operating system)를 실행할 수 있다.The computer may be any device that may be incorporated into or may be a computing device such as a desktop computer, laptop computer, notebook, smart phone, or the like. A computer is a device having one or more alternative and special purpose processors, memory, storage, and networking components (either wireless or wired). The computer may run, for example, an operating system compatible with Microsoft's Windows, an operating system such as Apple OS X or iOS, a Linux distribution, or Google's Android OS.

상기 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록신원확인 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장신원확인 장치 등을 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 실시예를 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트(segment)들은 본 실시예가 속하는 기술 분야의 통상의 기술자에 의해 용이하게 이해될 수 있을 것이다. The computer-readable recording medium includes all types of recording identification devices in which computer-readable data is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage identification device, and the like. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present embodiment may be easily understood by those skilled in the art to which the present embodiment belongs.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention as described above has been described with reference to the embodiments shown in the drawings, it will be understood that these are merely exemplary, and that various modifications and variations of the embodiments are possible therefrom by those of ordinary skill in the art. However, such modifications should be considered to be within the technical protection scope of the present invention. Accordingly, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

본 발명의 그래프 샘플링 장치는 건강 관리, 소셜 네트워크, 인터넷 네트워크 토폴로지, 생물 정보학 분야에서 활용될 수 있다. The graph sampling apparatus of the present invention may be utilized in the fields of health care, social networks, Internet network topology, and bioinformatics.

예를 들어, 사물 인터넷(IoT)의 폭발적인 증가와 클라우드 환경에서 수많은 기기를 고려할 때, 인터넷에 연결된 모든 클라이언트 장치의 토폴로지(toplogy)를 분석하는 것은 거의 불가능하다. 그러나, 그래프 샘플링 장치는 전체의 속성을 유지하는 부분 그래프를 추출할 수 있어, 대용량 인터넷 토폴로지의 분석하는데 높은 이용 가능성을 가진다.For example, given the explosive growth of the Internet of Things (IoT) and the large number of devices in a cloud environment, it is almost impossible to analyze the topology of all Internet-connected client devices. However, the graph sampling apparatus can extract a partial graph maintaining the properties of the whole, and thus has a high applicability for analysis of a high-capacity Internet topology.

Claims

A graph sampling device for obtaining a sample graph having properties of an original graph, the graph sampling device comprising:
extracting from the original graph a partial graph having more edges than the total number of edges of the sample graph;
selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and
and removing selected edges to obtain the sample graph,
The properties of the original graph or subgraph include at least one of a frequency and a clustering coefficient of the corresponding graph,
The step of extracting the partial graph from the original graph,
searching for a node to be extracted from the original graph to a partial graph according to a sampling rate; and
Including the step of inducing the partial graph by connecting the found nodes with an edge,
The step of searching for the node to be extracted includes:
Searching for a node in a direction of a next branch after searching for a node in a specific branching direction in a branch connected to a current node to start a search on the original graph,
The node in the specific branching direction is a node having the highest frequency among nodes connected to the current node.

delete

A graph sampling device for obtaining a sample graph having properties of an original graph, the graph sampling device comprising:
extracting from the original graph a partial graph having more edges than the total number of edges of the sample graph;
selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and
and removing selected edges to obtain the sample graph,
The properties of the original graph or subgraph include at least one of a frequency and a clustering coefficient of the corresponding graph,
The step of selecting the edge to be removed comprises:
calculating an edge weight for an edge included in the partial graph;
determining a first group and a second group based on the edge weight; and
and selecting an edge to be removed from the first group or the second group based on the clustering coefficient of the partial graph and the clustering coefficient of the sample graph to select the edge to be removed.

6. The method of claim 5,
The graph sampling apparatus of claim 1, wherein the first group includes edges for which the cluster coefficient of the subgraph is reduced more when the edges of the second group are removed.

6. The method of claim 5,
A graph sampling apparatus, characterized in that when the second edge having an edge weight greater than that of the first edge is removed, a decrease in the clustering coefficient of the subgraph is larger than when the first edge is removed.

8. The method of claim 7,
The edge weight for the edge connecting the first and second nodes to form a closed triplet in the partial graph is calculated by the following equation,
[Equation]

Here, k represents the total number of nodes of the partial graph.

The method of claim 5, wherein the determining of a decrease state of the cluster coefficient of the partial graph comprises:
calculating a real-time clustering coefficient and a predicted clustering coefficient of the partial graph;
Comparing the real-time value and the predicted value, if the real-time clustering coefficient of the subgraph is larger than the predicted clustering coefficient and the clustering coefficient of the subgraph is larger than that of the original graph, an edge weight with a larger decrease in the clustering coefficient is obtained. selecting an edge having an edge as an object to be removed;
Comparing the real-time value with the predicted value, when the real-time clustering coefficient of the subgraph is smaller than the predicted clustering coefficient, or the clustering coefficient of the subgraph is smaller than the clustering coefficient of the original graph, the decrease in the clustering coefficient is smaller and selecting an edge having an edge weight as a removal target.

10. The method of claim 9,
The prediction of the cluster coefficient of the subgraph after the extra edge is deleted from the subgraph is calculated by the following equation,
[Equation]

Here, e _del represents the number of edges that have already been removed, CCorg represents the cluster coefficient of the original graph, and slope is expressed by the following equation,
[Equation]

Here, e _extra represents the total number of edges to be removed.

According to claim 1,
The edge removed in the step of removing the edge is one,
The graph sampling apparatus configured to further perform the step of repeating the removing based on a difference between the total number of edges of the subgraph before removing the edges and the total number of edges of the current sample graph obtained by removing the edges.

The method of claim 11, wherein the repeating step comprises:
calculating a local clustering coefficient of an end node of the removed edge and a common friend node of the end node;
calculating a clustering coefficient of a partial graph from which a selected edge is removed based on the regional clustering coefficient and a clustering coefficient before the edge is removed; and
and updating the cluster coefficient of the subgraph to the cluster coefficient of the subgraph from which edges are removed.

A graph sampling method for obtaining a sample graph having properties of an original graph, performed by a processor, the graph sampling method comprising:
extracting from the original graph a partial graph having more edges than the total number of edges of the sample graph;
selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and
Comprising the step of obtaining the sample graph by removing the selected edge,
The properties of the original graph or subgraph include at least one of a frequency and a clustering coefficient of the corresponding graph,
The step of extracting the partial graph from the original graph,
searching for a node to be extracted from the original graph to a partial graph according to a sampling rate; and
Including the step of inducing the partial graph by connecting the found nodes with an edge,
The step of searching for the node to be extracted includes:
Searching for a node in a direction of a next branch after searching for a node in a specific branching direction in a branch connected to a current node to start a search on the original graph,
The node in the specific branching direction is a node having the highest frequency among nodes connected to the current node.

A graph sampling method for obtaining a sample graph having properties of an original graph, performed by a processor, the graph sampling method comprising:
extracting from the original graph a partial graph having more edges than the total number of edges of the sample graph;
selecting an edge to be removed from the subgraph based on at least one attribute of the original graph and the subgraph; and
Comprising the step of obtaining the sample graph by removing the selected edge,
The properties of the original graph or subgraph include at least one of a frequency and a clustering coefficient of the corresponding graph,
The step of selecting the edge to be removed comprises:
calculating an edge weight for an edge included in the partial graph;
determining a first group and a second group based on the edge weight; and
and selecting an edge to be removed from the first group or the second group based on the clustering coefficient of the subgraph and the clustering coefficient of the sample graph to select the edge to be removed.