KR20040102887A

KR20040102887A - A Method for Visualizing Protein Interaction Networks

Info

Publication number: KR20040102887A
Application number: KR1020030034718A
Authority: KR
Inventors: 한경숙; 주병현
Original assignee: 학교법인 인하학원
Priority date: 2003-05-30
Filing date: 2003-05-30
Publication date: 2004-12-08
Also published as: KR100471417B1

Abstract

PURPOSE: A method for visualizing large-scale protein-protein interaction network is provided to visualize the protein-protein interaction network fast and simply by adjusting a position of each node to a center of neighboring nodes of each node after preferentially arranging all connected components in the network and arranging the nodes to the center of a pivot node in the connected component. CONSTITUTION: The connected component in the protein-protein interaction network is identified/arranged. A middle node and the pivot node in each connected component are determined. A distance to each node from the pivot node within each connected component is respectively calculated and each node related to the pivot node is arranged based on the calculated distance. Each connected component is rearranged by arranging a cut-vertex related to the middle node and the middle node related to the neighboring node of the cut-vertex. Each connected component is rearranged by arranging each node to the neighboring node within the distance fixed to each node as the center.

Description

A Method for Visualizing Protein Interaction Networks

본 발명은 단백질 상호작용 네트워크(protein-protein interaction network)의 시각화방법에 관한 것으로서 보다 상세하게는, 수 천개 또는 그 이상의 단백질을 포함하는 대규모의 단백질 상호작용 네트워크를 빠르고 간단하게 시각화하는 단백질 상호작용 네트워크의 시각화방법에 관한 것이다.The present invention relates to a method of visualizing a protein-protein interaction network, and more particularly, to a protein interaction network that quickly and simply visualizes a large-scale protein interaction network including thousands or more proteins. The visualization method of the.

종래의 바이오화학 실험에서는 개별적인 단백질 상호작용에 대한 적은 데이터량을 생산한 반면, 최근 효모 2-이종 및 질량 분석 기술과 같은 고속 처리 상호 검출법의 발달로 인해 최근 3년 동안 단백질 상호작용 데이터량은 빠르게 확장되어 오고 있다. 이러한 상호작용 데이터는 텍스트 파일 또는 데이터베이스에서 유용하다. 그러나, 데이터량의 부피 때문에, 단백질 상호작용을 그래픽으로 표현하는 것이 상호작용 단백질을 긴 목록으로 표현하는 것보다 더 이해하기 쉽다는 것이 증명되어 있다.Conventional biochemistry experiments have produced small amounts of data for individual protein interactions, whereas recent advances in high-speed interaction detection such as yeast two-heterologous and mass spectrometry techniques have led to rapid increases in protein interaction data over recent three years. It's been expanding. This interaction data is useful in text files or databases. However, because of the volume of data, it has been demonstrated that graphically representing protein interactions is easier to understand than representing long lists of interacting proteins.

종래의 경우 단백질 상호작용은 무방향 그래프(undirected graph) G = (V,E)로 시각화되었다. 여기서, x∈V는 단백질은 나타내고, (x,y)∈E는 단백질 x 및 y의 상호작용을 나타낸다. 그래프의 시각화는 적은 개수의 노드 및 에지를 취급할 때는 간단하다. 그러나 실질적으로 단백질 상호작용 네트워크들은 수천 개 또는 그 이상의 노드를 포함한다. 이러한 단백질 상호작용 네트워크들은 많은 에지 교차(edge crossing) 또는 수정이 어려운 정적인 드로잉(drawing)을 갖는 복잡한 드로잉을 만듬으로써 대규모 데이터 량을 인터랙티브하게 분석하기에는 너무 느리거나, 또는 단백질 상호작용 데이터베이스들로부터 직접적으로 상기 데이터를 얻기 보다는 특정 형태를 갖는 입력 데이터를 요구하기 때문에, 많은 그래프 드로잉 툴의 유용성이 매우 제한되는 단점이 있다. 단백질 상호작용 네트워크의 근본적인 단점은 상기 네트워크의 판독가능 여부에 달려있다. 따라서 단백질 상호작용 네트워크는 상호작용 정보를 빠르고 정확하게 전송하는데 초점을 두어야 한다.In the past, protein interactions were visualized in an undirected graph G = (V, E). Where xV represents protein and (x, y) E represents the interaction of proteins x and y. Visualization of the graph is straightforward when dealing with fewer nodes and edges. In practice, however, protein interaction networks include thousands or more nodes. These protein interaction networks are too slow to interactively analyze large amounts of data by creating complex drawings with many edge crossings or static drawings that are difficult to modify, or are directly from protein interaction databases. As a result, input data having a specific form is required rather than obtaining the data, so that the utility of many graph drawing tools is very limited. The fundamental disadvantage of protein interaction networks depends on their readability. Therefore, protein interaction networks should focus on the fast and accurate transfer of interaction information.

인터뷰어(INTERVIEWER)라고 불리는 종래의 힘-방향성(Force-directed) 레이아웃 알고리즘은 무방향 그래프를 시각화하는 가장 널리 알려진 방법이다. 상기 알고리즘은 힘 모델(force model)에 기초한 최적의 레이아웃을 만든다. 그러나, 이러한 종래의 힘-방향성 레이아웃의 간단한 실행은 수백 개 이상의 노드를 갖는 그래프를 드로잉(drawing)하는 경우 실제 어려움에 부닥히게 된다. 이러한 어려움은 두 가지 원인에서 기인한다. 첫째, 레이아웃 조절은 최적화 과정의 각 단계에서 모든 노드쌍간 힘(force)의 계산을 포함하고, 둘째, 대규모 그래프의 경우 상기 최적화 과정은 초기의 랜덤한 레이아웃을 최적의 레이아웃으로 변환하는 과정을 매우 많이 반복해야 하기 때문이다.Conventional force-directed layout algorithms, called INTERVIEWERs, are the most widely known way of visualizing an undirected graph. The algorithm creates an optimal layout based on the force model. However, a simple implementation of this conventional force-directional layout presents practical difficulties when drawing a graph with more than a few hundred nodes. This difficulty comes from two sources. First, layout adjustment involves the calculation of the force between all node pairs at each stage of the optimization process. Second, for large graphs, the optimization process is very much involved in converting the initial random layout into an optimal layout. Because you have to repeat.

상기와 같은 이유로 인해, 종래의 그래프 드로잉 툴을 이용하여 수천 개 또는 그 이상의 단백질을 포함하는 대규모의 단백질 상호작용 네트워크를 인터랙티브하게 시각화하는 경우, 처리속도가 너무 느리거나 에지 교차(edge crossing)가 많은 그래프로 그려지며, 특히 많은 에지 교차를 갖는 불명확한 드로잉은 포기하게 되는 문제점이 있다.For these reasons, when interactively visualizing a large protein interaction network containing thousands of proteins or more using conventional graph drawing tools, the processing speed is too slow or there are many edge crossings. There is a problem that the graph is drawn, and in particular, an obscure drawing having many edge intersections is abandoned.

본 발명은, 상기와 같은 수천 개 이상의 단백질을 포함하는 대규모의 단백질 상호작용 네트워크를 인터랙티브하게 시각화하는데 있어 발생되는 종래의 문제점을 해결하기 위해 제안된 것으로서, 대규모의 단백질 상호작용 네트워크의 모든 연결된 구성요소(connected component)를 우선적으로 배치하고 상기 연결된 구성요소 내에서 피벗 노드(pivot node)를 중심으로 노드들을 배치한 후, 상기 각 노드별 인접노드를 중심으로 상기 각 노드들을 위치를 조정함으로써 빠르고 간단하게 단백질 상호작용 네트워크를 시각화하는 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the conventional problem that arises in the interactive visualization of large-scale protein interaction networks comprising thousands of proteins as described above, and all linked components of large-scale protein interaction networks. (connected component) is placed first, and nodes are arranged around a pivot node in the connected component, and then each node is positioned around the adjacent node for each node quickly and simply. Its purpose is to provide a method for visualizing protein interaction networks.

도 1은 본 발명에 따른 그래프에서의 분리점(cutvertex) 및 노드의 일례를 도시한 예시도이다.1 is an exemplary diagram showing an example of a cut point and a node in a graph according to the present invention.

도 2는 본 발명에 따른 피벗 노드(pivot node)를 포함하는 그래프 레이아웃의 일례를 보이는 예시도이다.2 is an exemplary view showing an example of a graph layout including a pivot node according to the present invention.

도 3은 본 발명에 따른 원래 그래프를 클리크로 대체한 그래프이다.3 is a graph in which the original graph according to the present invention is replaced with a click.

도 4는 본 발명에 따른 원래 네트워크에서 서브 네트워크를 복합 노드로 추상화한 네트워크이다.4 is a network abstracting a sub-network into a composite node in the original network according to the present invention.

도 5는 본 발명의 일 실시예에 따른 4242개의 인간 단백질 간의 44387개의 상호작용을 갖는 네트워크를 나타낸다.5 shows a network with 44387 interactions between 4242 human proteins, according to one embodiment of the invention.

도 6은 본 발명에 따른 네트워크 추상화의 일례를 도시한 것이다.6 illustrates an example of network abstraction in accordance with the present invention.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

11,12,13 : 경로 21 : 피벗 노드11,12,13: Path 21: Pivot Node

c1,c2 : 마디점(cutvertex) v5~v11 : 중간 노드c1, c2: cutvertex v5 ~ v11: intermediate node

G1 : 제1 중간노드그룹 G2 : 제2 중간노드그룹G1: first intermediate node group G2: second intermediate node group

상기 목적을 달성하기 위한 본 발명에 따른 단백질 상호작용 네트워크의 시각화방법은, 단백질 상호작용 네트워크(protein-protein interaction network) 내의 연결된 구성요소(connected component)를 식별하여 배치하는 제1단계; 상기 각 연결된 구성요소 내의 중간노드 및 피벗노드(pivot node)를 결정하는 제2단계; 상기 각 연결된 구성요소 내에서 상기 피벗노드로부터 각 노드까지의 거리를 각각 계산하고, 상기 계산된 거리를 기초로 상기 피벗노드와 관련된 각 노드를 배치하는 제3단계; 상기 중간노드와 관련된 cutvertex 및 상기 cutvertex의 인접노드와 관련된 중간노드를 배치하여 상기 각 연결된 구성요소를 재 배치하는 제4단계; 및 상기 각 노드별로 설정된 거리 이내의 인접노드를 중심으로 상기 각 노드를 배치하여 상기 각 연결된 구성요소를 재 배치하는 제5단계를 포함한다.According to an aspect of the present invention, there is provided a method of visualizing a protein interaction network, the method comprising: identifying and disposing a connected component in a protein-protein interaction network; Determining a intermediate node and a pivot node in each of the connected components; Calculating each distance from the pivot node to each node in the connected components, and placing each node associated with the pivot node based on the calculated distance; Disposing the respective connected components by disposing a cutvertex associated with the intermediate node and an intermediate node associated with an adjacent node of the cutvertex; And a fifth step of disposing each of the connected components by arranging each of the nodes with respect to adjacent nodes within a distance set for each node.

여기서, 상기한 본 발명에 따른 네트워크의 시각화방법은, 동일한 상호작용을 갖는 하나의 노드 그룹은 단일의 복합 노드로 대체하여 노드 및 에지의 개수를 줄이는 단계를 추가로 포함할 수 있다.Here, the visualization method of the network according to the present invention may further include the step of reducing the number of nodes and edges by replacing one node group having the same interaction with a single compound node.

상기한 목적 및 특징들은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 그러나 첨부된 도면은 단지 본 발명의 바람직한 실시예를 예시하는 것으로서 본 발명이 이에 한정되는 것은 아니다. 또한, 본 발명에 따른 네트워크와 그래프는 동일한 개념으로서 이하의 설명에서는 이를 혼용하기로 한다. 본 발명의 바람직한 실시예가 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.The above objects and features will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. However, the accompanying drawings are only illustrative of the preferred embodiments of the present invention, the present invention is not limited thereto. In addition, the network and the graph according to the present invention is the same concept and will be mixed in the following description. Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 그래프에서의 분리점(cutvertex) 및 노드의 일례를 도시한 예시도이다. 도 1에 도시된 그래프의 일례를 참조하면, 그래프 G에서 c1 및 c2가 분리점(cutvertex; 마디점(articulation point)이라고도 함)(이하, 전문용어의 편의상 'cutvertex'로 표기함)이다. 본 발명에서의 cutvertex는, 자신이 속해 있는 그래프에서 자신을 제거하게 되면 그 그래프와의 연결이 끊기게 되는 노드를 말한다. 즉, 도 1에서 c1 노드를 제거하면 노드 v1과 노드 v2는 그래프 G와 연결이 끊기게 되는데, 이때 상기 c1노드를 cutvertex라 한다. 물론 c2도 cutvertex이다.1 is an exemplary diagram showing an example of a cut point and a node in a graph according to the present invention. Referring to an example of the graph shown in FIG. 1, in the graph G, c1 and c2 are cutpoints (also called articulation points) (hereinafter, referred to as 'cutvertex' for convenience of terminology). In the present invention, cutvertex refers to a node that is disconnected from the graph when it is removed from the graph to which it belongs. That is, when the node c1 is removed from FIG. 1, the node v1 and the node v2 are disconnected from the graph G. In this case, the node c1 is called cutvertex. Of course c2 is also cutvertex.

노드 v의 차수(degree of node v)는 상기 노드 v의 에지(edge)의 개수를 나타내고, deg(v)로 표시한다. 그래프 G에서 경로(path)는 상기 그래프 G 내 별개 노드의 시퀀스(v1, v2, ..., vn)이며, 여기서, 1 ≤i ≤n-1일 때 (vi, vi+ı)∈E이다. 그래프 G=(V,E)의 서브 그래프인 그래프 G'=(V',E')이며, 이때 V'⊆V 및 E'⊆E∩(V' ×V')이다.The degree of node v represents the number of edges of node v, denoted by deg (v). In graph G, the path is a sequence of distinct nodes (v1, v2, ..., vn) in graph G, where (vi, vi + ı) ∈E when 1 ≤ i ≤ n-1 . Graph G '= (V', E ') which is a subgraph of graph G = (V, E), where V' V and E 'V E' (V 'x V').

도 1에 도시된 바와 같이, 한 쌍의 cutvertex 사이에 다중 경로가 존재할 때 상기 경로상의 노드를 중간 노드라 한다. 도 1의 예시도에는, 하나의 연결된 구성요소 내 한 쌍의 cutvertex(c1,c2) 사이에 세 개의 경로(11,12,13)가 존재하며, 각 경로상에 중간 노드(v5~v11)가 도시되어 있다. 만약, 상기 한 쌍의 cutvertex(c1,c2) 사이의 상기 다중 경로(11,12,13)가 다른 길이를 가진다면, 동일 길이의 경로 상의 중간 노드들은 함께 그룹화된다. 예를 들어, 도 1에 도시된 바와 같이, 상기 한 쌍의 cutvertex(c1,c2) 사이의 다중 경로 중에서 상위 두 개의 경로(11,12)의 길이가 동일하다면 상기 두 개의 경로(11,12) 상에 존재하는 중간 노드들(v5,v6,v8,v9)은 제1중간노드그룹(G1)으로 그룹화되고, 상기 두 경로(11,12)와 경로 길이가 다른 나머지 경로(13) 상에 존재하는 중간 노드들(v7,v10,v11)은 제2중간노드그룹(G2)으로 그룹화된다.As shown in Fig. 1, when multiple paths exist between a pair of cutvertex, the nodes on the paths are called intermediate nodes. In the example diagram of FIG. 1, three paths 11, 12, 13 exist between a pair of cutvertex (c1, c2) in one connected component, and intermediate nodes v5 to v11 are located on each path. Is shown. If the multiple paths 11, 12, 13 between the pair of cutvertex (c1, c2) have different lengths, the intermediate nodes on the path of the same length are grouped together. For example, as shown in FIG. 1, if the lengths of the upper two paths 11 and 12 among the multiple paths between the pair of cutvertex (c1, c2) are the same, the two paths 11 and 12 are the same. Intermediate nodes (v5, v6, v8, v9) present in the group is grouped into a first intermediate node group (G1), and exist on the remaining path 13 is different from the two paths (11, 12) path length. The intermediate nodes v7, v10, and v11 are grouped into a second intermediate node group G2.

본 발명의 단백질 상호작용 네트워크 시각화방법은 다중레벨(multilevel) 기술을 사용하여 그래프를 드로잉한다. 상기 다중레벨은 그룹핑(grouping)단계 및 배치(layout)단계를 포함한다. 상기 그룹핑단계에서는 전체 네트워크의 모든 연결된 구성요소(connected component; 그래프 내에서 연결된 최대의 서브 그래프)를 식별하여 그룹핑하고, 상기 각 연결된 구성요소 내에서 중간 노드 및 피벗 노드(pivot node)를 찾은 후, 상기 각 연결된 구성요소 내에서 상기 피벗 노드로부터 각 노드까지의 거리를 계산한다. 여기서, 피벗 노드(pivot node)는 그래프의 레이아웃 내에서 키 노드(key node)가 된다. 보다 구체적으로는, 상기 피벗 노드는 본 발명에따른 그래프의 레이아웃을 위해 설정된 형식으로 선택되어진 노드로서, 상기 피벗 노드로부터 거리를 고려하여 연결된 구성요소를 배치하기 때문에 그 선정방법과 개수에 따라 다른 결과를 보여줄 수 있다. 또한, 상기 피벗 노드의 개수를 너무 많이 선정하면 각 연결된 구성요소 내에서 상기 피벗 노드로부터 각 노드까지의 거리를 계산하는데 시간이 많이 걸리고, 반대로 상기 피벗 노드의 개수를 너무 적게 선정하면 그 피벗 노드가 속한 연결된 구성요소의 레이아웃에 대한 결과가 좋지 않게 된다.The protein interaction network visualization method of the present invention draws a graph using a multilevel technique. The multilevel includes a grouping step and a layout step. In the grouping step, all connected components (maximum subgraphs connected in the graph) of the entire network are identified and grouped, and after finding the intermediate node and pivot node in each connected component, The distance from the pivot node to each node within each connected component is calculated. Here, the pivot node becomes a key node in the layout of the graph. More specifically, the pivot node is a node selected in the form set for the layout of the graph according to the present invention. Since the connected components are arranged in consideration of the distance from the pivot node, the results vary depending on the selection method and the number. Can show. Also, if the number of pivot nodes is selected too much, it takes time to calculate the distance from the pivot node to each node in each connected component. On the contrary, if the number of pivot nodes is selected too few, the pivot nodes may be selected. The result for the layout of connected components to which they belong is bad.

한편, 배치단계에서는, 전체 네트워크의 연결된 구성요소의 레이아웃을 찾음으로써 상기 각 연결된 구성요소들 간의 레이아웃을 식별하고, 상기 각 연결된 구성요소에 대하여, 상기 연결된 구성요소의 피벗 노드와 관련된 노드의 레이아웃을 찾음으로써, 상기 연결된 구성요소 내의 글로벌 레이아웃을 조정한다. 이어, 상기 중간노드들의 cutvertex 및 상기 cutvertex의 직접적인 인접노드와 관련된 중간노드들을 재 배치함으로써 각 연결된 구성요소의 레이아웃을 재정리한다. 이로써 상기 연결된 구성요소 내의 중간노드의 로컬 레이아웃을 조정한다. 계속하여, 각 노드로부터 거리가 2이내인 노드와 관련된 모든 노드들을 재배치하여 각 연결된 구성요소의 레이아웃을 재정리함으로써 상기 각 연결된 구성요소 내의 모든 노드들의 국소적인 레이아웃을 조정한다.On the other hand, in the arrangement step, by identifying the layout of the connected components of the entire network to identify the layout between each of the connected components, for each of the connected components, the layout of the node associated with the pivot node of the connected component By finding, adjusts the global layout within the connected component. Subsequently, the layout of each connected component is rearranged by rearranging the cutvertex of the intermediate nodes and the intermediate nodes associated with the immediate neighbor node of the cutvertex. This adjusts the local layout of intermediate nodes within the connected component. Subsequently, the local layout of all nodes within each connected component is adjusted by rearranging the layout of each connected component by rearranging all nodes associated with the node within 2 distances from each node.

도 2는 본 발명에 따른 피벗 노드(pivot node)를 포함하는 그래프 레이아웃의 일례를 보이는 예시도로서, 도 2(a)는 본 발명에 따른 메쉬(mesh)로부터 선택된 피벗 노드를 포함한 그래프의 레이아웃을 나타내며, 도 2(b)는 본 발명에 따른 단백질 상호작용 네트워크로부터 선택된 피벗 노드를 포함한 그래프의 레이아웃을 나타낸다. 도 2에 도시된 바와 같이, 고 품질의 레이아웃을 만들기 위해서는 각 연결된 구성요소에 거의 균일하게 분배된 피벗 노드(21)를 선택한다. 상기 피벗 노드(21)의 개수 및 그들간의 거리는 노드의 개수 및 네트워크의 직경에 따라 결정된다. 이때, 네트워크의 직경은 그 네트워크 내의 두 노드간 최대 거리이다. 일반적으로, 피벗 노드는 노드의 개수에 비해 직경이 작은 네트워크보다는 노드의 개수에 비해 직경이 큰 네트워크의 경우에 선택된다. 100개 또는 그 이하의 노드를 갖는 작은 네트워크의 경우, 그 노드들 간의 거리가 3 또는 그 이하가 되도록 피벗 노드를 선택한다. 직경이 20 이하인 네트워크의 경우도 피벗 노드가 선택될 수 있는 거리까지 피벗 노드간의 거리를 줄인다. 상기한 바와 같이, 본 발명의 단백질 상호작용 네트워크의 시각화방법에서, 연결된 구성요소를 식별하여 배치하고, 피벗 노드를 선택한 후 각 노드들간의 거리를 계산하여 각 노드를 배치하는 구체적인 방법은 하기에서 상세하게 설명한다.FIG. 2 is an exemplary view showing a graph layout including a pivot node according to the present invention, and FIG. 2 (a) shows a layout of a graph including a pivot node selected from a mesh according to the present invention. 2 (b) shows the layout of the graph including pivot nodes selected from the protein interaction network according to the present invention. As shown in FIG. 2, to create a high quality layout, the pivot node 21 that is distributed almost uniformly to each connected component is selected. The number of pivot nodes 21 and the distance between them are determined according to the number of nodes and the diameter of the network. At this time, the diameter of the network is the maximum distance between two nodes in the network. In general, a pivot node is selected for a network having a diameter larger than the number of nodes than a network having a diameter smaller than the number of nodes. For small networks with 100 or fewer nodes, select the pivot node so that the distance between the nodes is 3 or less. Networks with diameters of 20 or less also reduce the distance between pivot nodes to the distance that the pivot node can be selected. As described above, in the visualization method of the protein interaction network of the present invention, a specific method of identifying and arranging connected components, selecting a pivot node, and calculating the distance between each node and arranging each node is described in detail below. Explain.

본 발명의 단백질 상호작용 네트워크 시각화과정을 일 실시예에 따른 알고리즘을 참조하여 설명한다. 여기서, 이하에서 기재되고 설명되는 본 발명의 단백질 상호작용 네트워크 시각화과정에 따른 알고리즘은 본 발명의 일 실시예에 불과한 것으로서, 본 발명은 이에 한정되는 것이 아니라 다양한 형태로 변형이 가능하다는 것을 주의해야 한다. 특히, 하기에서 기술되는 알고리즘은 바람직하게는 컴퓨터에서 판독 및 실행가능한 다른 프로그램으로도 구현할 수 있다.The protein interaction network visualization process of the present invention will be described with reference to an algorithm according to one embodiment. Here, the algorithm according to the visualization process of the protein interaction network of the present invention described below is only one embodiment of the present invention, it should be noted that the present invention is not limited to this and can be modified in various forms. . In particular, the algorithm described below may be embodied in other programs, preferably readable and executable on a computer.

또한, 본 발명에 따른 단백질 상호작용 네트워크의 시각화 방법을 구현하기위해 하기에서 기술되는 알고리즘들은 당 기술분야에서 통상의 지식을 가진 자(당업자)라면 쉽게 이해할 수 있을 것이며, 나아가, 하기의 알고리즘들은 당업자에 의해 컴퓨터 등과 같은 장치에서 용이하게 구현할 수 있을 것이다.In addition, the algorithms described below to implement the visualization method of the protein interaction network according to the present invention will be easily understood by those skilled in the art, and further, the algorithms described below It can be easily implemented in a device such as a computer.

이하, 본 발명의 일 실시예에 따른 알고리즘을 참조하여 본 발명에 따른 단백질 상호작용 네트워크의 시각화과정을 설명한다.Hereinafter, a visualization process of the protein interaction network according to the present invention will be described with reference to an algorithm according to an embodiment of the present invention.

먼저, 본 발명의 일 실시예에 따른 알고리즘1은, 전체 네트워크에서 비연결 그래프(disconnected graph)의 노드들을 연결된 구성요소(connected component)로 그룹핑하여 배치하는 과정을 나타내고 있다.First, Algorithm 1 according to an embodiment of the present invention shows a process of grouping and arranging nodes of a disconnected graph into connected components in an entire network.

상기 알고리즘1은 하나의 그래프에서 각각의 연결된 서브 그래프인 연결된구성요소를 찾는 과정을 나타낸다. 전체 그래프의 모든 노드의 집합인 V 중에서 적어도 하나의 에지(edge)가 연결된 하나의 노드 v에 대하여, 상기 노드 v가 이미 다른 연결된 구성요소에 속하지 않을 경우, 그 노드 v에 대하여 새로운 그룹핑 식별자를 부여하고 상기 노드 v에 연결된 이웃노드를 하나씩 찾아가면서 상기 노드 v에 연결된 노드를 찾는다. 이때, 상기 노드 v에 대하여 모든 연결된 노드를 찾는 과정에서 상기 연결된 구성요소에 속하는 노드들을 모아놓은 집합 GLst 중에서 I번째의 노드(GLst[i])의 모든 인접노드 u에 대하여, 상기 인접노드 u가 다른 연결된 구성요소에 이미 속해 있는지 확인한 후, 상기 인접노드 u가 다른 연결된 구성요소에 속해 있지 않은 경우 상기 GLst에 상기 u를 추가하고, 다른 인접노드에 대하여 상기 과정을 계속해서 반복한다.Algorithm 1 shows a process of finding a connected component that is each connected subgraph in a graph. For a node v to which at least one edge is connected among V, which is a set of all nodes of the entire graph, if the node v does not belong to another connected component, a new grouping identifier is assigned to the node v. Next, the nodes connected to the node v are found while searching for neighbor nodes connected to the node v one by one. In this case, in the process of finding all connected nodes with respect to the node v, for all neighbor nodes u of the I-th node GLst [i] among the set GLsts that collect the nodes belonging to the connected components, the neighbor node u is After checking whether it belongs to another connected component, if the neighbor node u does not belong to another connected component, the u is added to the GLst, and the process is repeated for another neighbor node.

본 발명의 일 실시예에 따른 알고리즘2는 두 개의 노드 v,w간의 거리를 계산하는 알고리즘으로서, 두 개의 노드가 하나의 연결된 구성요소에 속해 있을 경우 상기 두 노드간의 거리를 계산하는 과정을 나타낸다. 여기서, 하나의 노드에서 인접한 노드까지의 거리를 1로 하여 거리를 계산한다. 즉, 노드 a에서 인접한 노드 b까지의 거리는 1이고, 상기 노드 a에서 상기 노드 b에 인접한 노드 c까지의 거리는 2이다.Algorithm 2 according to an embodiment of the present invention is an algorithm for calculating a distance between two nodes v and w, and shows a process of calculating the distance between two nodes when two nodes belong to one connected component. Here, the distance is calculated by setting the distance from one node to an adjacent node as 1. That is, the distance from node a to adjacent node b is 1, and the distance from node a to node c adjacent to node b is 2.

상기 알고리즘2를 참조하면, 초기에 시작 노드인 v와 자신과의 거리인 0을 DLst에 입력하고 상기 DLst의 첫 번째 노드부터 repeat~until부분을 실행하여 상기 시작 노드 v와 각 노드간의 거리를 계산한다. 보다 구체적으로는, DLst에서 현재 노드 v' 및 상기 시작 노드 v에서 상기 현재 노드 v'까지의 거리 currentDist를 얻게 되고, 상기 v'를 통하여 인접한 모든 노드 u에 대하여 이미 DLst에 포함되지 않았을 경우 DLst에 추가하며, 상기 u의 거리 v'에서의 currentDist에 1을 추가한다. 만약 상기 u가 마지막 계산을 위한 w일 경우에 대해서는 currentDist + 1을 결과값으로 반환한다. DLst.First는 DLst에 있는 첫 번째 노드로 이동하라는 것이며, DLst.Add는 DLst 목록에 하나의 노드와 그에 해당하는 거리를 추가하라는 명령이다. 그리고, DLst.Next는 목록의 다음 단계로 이동하는 것이며, DLst.Eof는 더 이상의 노드가 없을 경우 true, 있을 경우 false가 된다.Referring to Algorithm 2, the distance between the start node v and each node is calculated by first inputting 0, which is the distance between the start node v and itself, to DLst and executing the repeat ~ until part from the first node of the DLst. do. More specifically, the distance currentDist from the current node v 'and the starting node v to the current node v' in the DLst is obtained, and for all adjacent nodes u through the v ', the DLst is not included in the DLst. Add 1 to currentDist at distance v 'of u. If u is w for the last calculation, currentDist + 1 is returned as a result value. DLst.First tells you to go to the first node in DLst, and DLst.Add adds a node and its distance to the list of DLsts. DLst.Next moves to the next level of the list, and DLst.Eof becomes true if there are no more nodes, and false if there are any.

상기 알고리즘1 및 알고리즘2는 적어도 하나의 에지(edge)를 갖는 노드상에서 실행되기 때문에, 에지가 없는 노드는 전체 네트워크의 연결된 구성요소를 식별하여 배치하는 과정에서 크기가 2이상인 연결된 구성요소가 배치된 이후에 배치된다. 또한,개의 노드를 갖는 그래프의 경우, 상기 알고리즘1의 시간 복잡도(time complexity)는 O(n)이고, 상기 알고리즘2의 시간 복잡도는 O(n·ㅣPvNㅣ)이다. 여기서, ㅣPvNㅣ은 피벗 노드의 개수이다.Since Algorithm 1 and Algorithm 2 are executed on a node having at least one edge, an edgeless node is arranged to have two or more connected components of size two or more in the process of identifying and placing the connected components of the entire network. Then placed. Also, In the case of a graph having three nodes, the time complexity of Algorithm 1 is O (n), and the time complexity of Algorithm 2 is O (n · PvN |). Where | PvN | is the number of pivot nodes.

알고리즘3은 전체 네트워크의 모든 연결된 구성요소를 식별하여 배치한 후 피벗 노드를 결정하는 과정을 나타낸다. 이는 각 연결된 구성요소 별로 내부의 피벗 노드를 결정하는 알고리즘이다.Algorithm 3 identifies the pivot node after identifying and placing all connected components of the entire network. This is an algorithm that determines the internal pivot node for each connected component.

전체 네트워크에서 모든 연결된 구성요소의 레이아웃을 결정한 후에 상기 알고리즘3을 실행한다. 먼저, 각각의 연결된 구성요소의 첫 번째 노드(V[0])를 첫 번째 피벗 노드로 설정하여 계산하기 시작한다. 상기 알고리즘3에서 DLst는 각각의 피벗 노드로부터 다른 노드들 사이의 거리를 계산하기 위한 변수이며, PvN은 각각의 피벗 노드와 그 피벗 노드로부터 다른 노드들까지의 거리를 거억하기 위한 거리 테이블을 가지고 있다. 또한, MaxDist는 현재의 연결된 구성요소의 최대 거리를 기억하기 위한 변수이다. DLst.Clear에서 DLst를 초기화한 후 현재의 피벗 노드와 그의 거리를 0으로 초기화하고, 각각의 피벗 노드에서의 거리를 계산하기 위한 ChkDistance를 호출한다.The algorithm 3 is executed after the layout of all connected components in the entire network is determined. First, we start by calculating the first node V [0] of each connected component as the first pivot node. In Algorithm 3, DLst is a variable for calculating the distance between each pivot node from other nodes, and PvN has a distance table for storing each pivot node and the distance from the pivot node to other nodes. . MaxDist is also a variable for storing the maximum distance of the currently connected component. After initializing DLst in DLst.Clear, initialize the current pivot node and its distance to 0, and call ChkDistance to calculate the distance from each pivot node.

상기한 바와 같이 피벗 노드를 결정하기 위한 상기 알고리즘3에서 호출되는 알고리즘이 하기의 알고리즘4이다. 알고리즘4는 DLst, DistTable, MaxDist 등의 파라메터를 받아서 계산을 수행하게 되는데 DLst에서의 현재 단계의 노드의 이웃노드와의 거리를 DLst, DistTable에 추가하며 현재 노드 v 중에서 피벗 노드로 될 가능성이 있는 노드를 골라서 ChkPvN으로 호출하여 피벗 노드로 추가하는 과정을 보이는 알고리즘이다.As described above, the algorithm called in Algorithm 3 for determining the pivot node is Algorithm 4 below. Algorithm 4 takes parameters such as DLst, DistTable, and MaxDist to perform calculations. It adds the distances from the current node's neighbor node to DLst and DistTable, and is likely to become a pivot node among the current nodes v. This algorithm shows how to select and call as ChkPvN to add as a pivot node.

상기 알고리즘4에서 DLst.GetCurrent(v,dist)는 현재의 노드 v와 그의 거리 dist를 구하는 명령어이고, 그 현재의 거리가 최대 거리인 MaxDist보다 클 경우 상기 MaxDist를 현재의 거리로 대치한다. 그리고 bAddPvN은 현재의 노드가 피벗 노드로 될 가능성이 있는지 검사하는 부분으로, 우선 이 단계에서 검사하는 부분은 2가지이다. 그 2가지는 현재 노드 v의 이웃노드 중에서 DLst에 추가할 수 있는 노드가 없는 경우와 MaxDist의 1/n되는 지점(상기 알고리즘4에서는 일례로 1/3되는 지점)에 있는 노드의 경우이다. 따라서, 상기 2가지의 경우에 대하여 우선 검사를 진행하게 된다.In Algorithm 4, DLst.GetCurrent (v, dist) is a command for obtaining the current node v and its distance dist. If the current distance is larger than the maximum distance MaxDist, the MaxDist is replaced with the current distance. And bAddPvN checks whether the current node is likely to be a pivot node. First, there are two parts to check at this stage. The two cases are the case where there is no node that can be added to the DLst among the neighbor nodes of the current node v, and the node at the point 1 / n of MaxDist (one third in the algorithm 4, for example). Therefore, the above two cases are first examined.

상기 알고리즘4의 여섯 번째 단계에서, 현재 노드 v에 이웃하는 노드 w에 대하여 이미 DLst에 있는지를 검사하고, 없을 경우 bAddPvN을 false로 한 후 현재 거리를 dist + 1로 하여 DLst에 넣는다. 또한, DistTable(w)의 경우, 상기 DistTable에는 DLst에 들어 있는 노드 순서가 아닌 현재의 연결된 구성요소의 순으로 거리를 입력하게 되므로 DistTable(w)에 피벗 노드로부터 w까지의 거리에 해당하는 dist + 1을 입력하게 된다.In the sixth step of Algorithm 4, it is checked whether the node w neighboring the current node v is already in the DLst. If not, bAddPvN is set to false and the current distance is set to dist + 1 in the DLst. In addition, in the case of DistTable (w), since the distance is input to the DistTable in the order of the current connected components, not the node order in the DLst, dist + corresponding to the distance from the pivot node to w in DistTable (w). You will enter 1.

상기 bAddPvN과 ChkPvN(v)가 모두 true 인 경우에 대하여 현재 v를 피벗 노드에 추가하며 그 v가 사용하게 될 DistTable'를 새로 정의하여 현재 노드 v와 자신과의 거리에 해당하는 0을 넣는다. ChkPvN(v)함수는 하기와 같은 과정을 통하여 그 결과에 만족하는 노드에 대해 true가 된다.When both bAddPvN and ChkPvN (v) are true, the current v is added to the pivot node, and a new DistTable 'that the v is to be used is newly defined, and 0 corresponding to the distance between the current node v and itself is entered. The ChkPvN (v) function becomes true for nodes satisfying the result through the following process.

1) 연결된 구성요소 내의 노드 개수가 40개 미만인 경우, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 2가 되어야 한다.1) If the number of nodes in the connected component is less than 40, the distance from all existing pivot nodes to v nodes shall be at least 2.

2) 연결된 구성요소 내의 노드 개수가 40개 이상 노드 100개 미만인 경우, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 3이 되어야 한다.2) If the number of nodes in the connected component is more than 40 and less than 100 nodes, the distance from all existing pivot nodes to v nodes shall be at least 3.

3) 연결된 구성요소 내의 노드 개수가 100 개 이상인 경우,3) If the number of nodes in the connected component is 100 or more,

(a) 만약, 연결된 구성요소의 직경(d)이 7 미만이면, 상기 v노드의 에지 개수(degree(v))는 3이상이 되어야하고,(a) if the diameter (d) of the connected components is less than 7, the number of edges (degree (v)) of the v-nodes should be 3 or more,

(b) 만약, 7 ≤d < 15이면, degree(v)는 4 이상이 되어야 하고,(b) If 7 ≤ d <15, degree (v) must be 4 or more,

(c) 만약, 15 ≤d < 20이면, degree(v)는 5이상이 되어야 하며,(c) If 15 ≤ d <20, degree (v) should be 5 or more,

(d) 그외, 상기 연결된 구성요소의 노드 개수에 대한 상기 연결된 구성요소의 직경의 비를 R이라 할 때,(d) Otherwise, when the ratio of the diameter of the connected component to the number of nodes of the connected component is R,

i) 만약, R < 0.01이면, 존재하는 모든 피벗 노드로부터 v노드까지의거리가 적어도 40이 되어야 한다.i) If R <0.01, the distance from all existing pivot nodes to v nodes must be at least 40.

ii) 만약, 0.01 ≤R < 0.02이면, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 17이 되어야 하고, 이때, 전체 노드의 개수 > 1000 이면 그 거리를 30으로 조정한다.ii) If 0.01 < R < 0.02, the distance from all existing pivot nodes to v nodes must be at least 17, and if the total number of nodes> 1000, adjust the distance to 30.

iii) 만약, 0.02 ≤R < 0.035이면, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 13이 되어야 하고, 이때, 전체 노드의 개수 > 1000 이면 그 거리를 20으로 조정한다.iii) If 0.02 ≤ R <0.035, the distance from all existing pivot nodes to v nodes should be at least 13, and if the number of total nodes> 1000 then adjust the distance to 20.

iv) 만약, 0.035 ≤R < 0.07이면, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 10이 되어야 한다.iv) If 0.035 ≤ R <0.07, the distance from all existing pivot nodes to v nodes should be at least 10.

v) 만약, R ≥0.07이면, 존재하는 모든 피벗 노드로부터 v노드까지의 거리가 적어도 5가 되어야 한다.v) If R ≧ 0.07, the distance from all existing pivot nodes to v nodes must be at least 5.

상기한 1~3과정은 본 발명의 일 실시예에 불과하며 상기 과정에서 적용되는 수치는 전체 네트워크의 크기, 노드의 개수 등과 같은 조건에 따라 변경될 수 있으며, 또한 상기한 알고리즘4도 피벗 노드를 적절한 개수로 결정하기 위한 일 실시예를 나타내는 것이다.The above processes 1 to 3 are only one embodiment of the present invention, and the numerical values applied in the above process may be changed according to conditions such as the size of the entire network, the number of nodes, and the like. One embodiment for determining the appropriate number is shown.

상기 알고리즘3 및 4에서 피벗 노드를 선택할 때, 다른 모든 노드로부터 상기 피벗 노드의 거리는 모두 계산된다. 알고리즘4는 현재 노드가 피벗 노드인지를 검사하여, 피벗 노드가 아니면 기존의 피벗 노드로부터의 거리에 의존하는 피벗 노드 세트 PvN에 상기 노드를 포함할 가능성을 결정하고, 연결된 구성요소의 구조(즉, 상기 연결된 구성요소의 직경, 노드 개수 및 에지 개수)를 결정한다. 상기 알고리즘3 및 4는 단일 피벗 노드의 경우 O(n)의 시간이 걸리고, 따라서 모든 피벗 노드를 선택하기 위한 전체 시간 복잡도는 O(lPvNl·n)가 된다.When selecting a pivot node in Algorithms 3 and 4, the distances of the pivot node from all other nodes are all calculated. Algorithm 4 checks whether the current node is a pivot node, determines the possibility of including the node in a pivot node set PvN that depends on the distance from an existing pivot node if it is not a pivot node, and Diameter, number of nodes and number of edges) of the connected components. Algorithms 3 and 4 take O (n) time for a single pivot node, so the overall time complexity for selecting all pivot nodes is O (lPvNl · n).

알고리즘5는 현재의 노드 v에 대한 다른 노드들 V'의 위치를 결정하는 과정을 나타낸다. 이는 각각의 배치단계에 따라 다른 노드들의 집합 V'와 다른 노드 u와의 거리인 Distance(u,v)를 결정하게 된다. 현재의 노드 v의 위치는 항상 기준 노드 세트 V'와 관련되어 결정되는데, 여기서 V'는 V의 서브 세트이다.Algorithm 5 shows the process of determining the position of the other nodes V 'with respect to the current node v. This determines the distance (u, v), which is the distance between the set V 'of other nodes and the other node u according to each deployment step. The position of the current node v is always determined relative to the reference node set V ', where V' is a subset of V.

상기 알고리즘5에서 △(델타)는 u 좌표에서 v의 좌표를 뺀 상대 좌표를 나타내고, D는 0으로 그 초기값을 설정한 후 각각의 u에 대한 계산을 통해 변하게 된다. △(1-Distance(u,v)/)는 현재의 u,v의 거리가 실제 네트워크(그래프) 상의 거리인 Distance(u,v)보다 큰 경우는 (-)방향으로, 작은 경우는 (+)방향으로 이동시킨다. 다시 말하면, 상기 식은 '△ - (Distance(u,v) ×△의 단위벡터)' 의 형식으로 표현되므로, 실제 △에 각각의 방향 성분 x,y,z의 각 성분에 실제 그래프 상의 거리에 해당하는 (Distance ×단위벡터)를 뺀 값을 상기 D에 더하게 된다.In Algorithm 5, Δ (delta) represents a relative coordinate minus v coordinates, and D is set to 0 to change its initial value after calculation for each u. Δ (1-Distance (u, v) / ) Moves in the negative direction if the current u, v distance is larger than Distance (u, v), which is the distance on the graph, and in the positive direction (+). In other words, the above equation is expressed in the form of Δ- (Distance (u, v) × Δ), so that each component of each direction component x, y, z corresponds to the actual distance on the graph. The value obtained by subtracting (Distance × unit vector) is added to D.

다른 모든 노드의 집합인 V'에 대하여 이동시킨 것이므로 다시 D를 V'에 속한 노드 개수로 나누어 그 평균값을 구하여 현재의 v의 위치에서 그 위치만큼 이동시켜 새로운 v의 위치를 구하게 된다.Since it is moved about V 'which is a set of all other nodes, D is divided by the number of nodes belonging to V', and the average value is obtained. The new v is obtained by moving it from the current v position by that position.

상기 알고리즘5를 이용하여 실제 전체 네트워크의 배치는 하기와 같은 과정을 통하여 이루어진다. 우선, 전체 네트워크의 배치는 연결된 구성요소를 배치하는 글로벌 배치와 상기 각 연결된 구성요소를 내부에서 노드들을 배치하는 로컬 배치로 나뉘고, 상기 로컬 배치는 다시 피벗 노드 배치, 중간 노드 배치 및 인접 노드 배치로 나뉜다.Using the algorithm 5, the actual entire network is arranged through the following process. First, the layout of the entire network is divided into a global layout for placing connected components and a local layout for placing each connected component therein, and the local layout is again a pivot node arrangement, an intermediate node arrangement and an adjacent node arrangement. Divided.

상기 글로벌 배치의 경우 V'는 v가 속하지 않는 다른 연결된 구성요소의 피벗 노드 세트이고, 피벗 노드 배치의 경우 V'는 v가 속한 연결된 구성요소의 피벗 노드 세트이며, 중간 노드 배치의 경우 v의 V'는 그 cutvertex 및 상기 cutvertex와 직접적으로 인접한 노드 세트이다. 또한, 인접 노드 배치에서 V'는 v와 2 미만의 거리로 인접한 노드 세트이다.V 'is a set of pivot nodes of other connected components that do not belong to v for global placement, V' is a set of pivot nodes of connected components to which v belongs to, and for intermediate node placement, V ' Is a cutvertex and a set of nodes directly adjacent to the cutvertex. Also, in adjacent node arrangements V 'is a set of adjacent nodes at a distance of less than 2 with v.

보다 구체적으로는, 상기 글로벌 배치의 경우에는, v는 현재 계산중인 연결된 구성요소의 피벗 노드들이 되며 V'는 이전 단계에서 계산된 피벗 노드들이 된다. Distance(u,v)는 모든 연결된 구성요소 중에서 최대 거리를 이용하여 계산된다.More specifically, in the case of the global arrangement, v becomes the pivot nodes of the connected component currently being calculated and V 'becomes the pivot nodes calculated in the previous step. Distance (u, v) is calculated using the maximum distance of all connected components.

상기 로컬 배치 중에서 피벗 노드 배치에서는 v는 현재 연결된 구성요소의 모든 노드가 되며, V'는 그 연결된 구성요소의 피벗 노드가 된다. 이 경우에Distance(u,v)는 상기 피벗 노드를 결정하는 알고리즘3에서 계산된 DistTable을 사용하게 된다.In the local placement of the pivot node, v is all nodes of the currently connected component, and V 'is a pivot node of the connected component. In this case, Distance (u, v) uses the DistTable calculated by Algorithm 3 to determine the pivot node.

그리고, 중간 노드 배치에서 v는 양쪽의 cutvertex의 인접 노드들 중 중간 노드에 속하는 노드가 되며, V'는 cutvertex에 이웃하는 노드들과 cutvertex를 가지게 된다. 이 경우에서의 Distance(u,v)는 하기와 같이 계산된다. 먼저, 도 1의 연결된 구성요소에서 v5를 배치하고자 하는 노드 v라고 할 경우, 상기 v5노드에서 다른 노드들과의 거리는 우선 v5노드의 인접노드 c1와의 거리는 1이고, 상기 c1의 인접노드들(v1,v2,v6,v7)의 거리는 2가 된다. 또한 반대편의 cutvertex인 c2까지의 거리는 c1과 c2의 거리에서 1을 뺀 나머지가 되며, v8까지의 거리는 c1,c2의 거리에서 2를 뺀 나머지가 된다. 또한, v3,v4,v9,v10까지의 거리는 c1과 c2의 거리가 된다.In the intermediate node arrangement, v becomes a node belonging to the middle node among adjacent nodes of both cutvertex, and V 'has cutvertex with nodes neighboring cutvertex. In this case, Distance (u, v) is calculated as follows. First, in the connected component of FIG. 1, when the node v to which v5 is to be disposed is referred to as a node v, the distance from the other nodes in the v5 node is first the distance from the neighbor node c1 of the v5 node is 1, and the neighbor nodes v1 of the c1 are first. The distance of v2, v6, v7) is 2. Also, the distance to c2, the opposite cutvertex, is minus one from the distance between c1 and c2, and the distance to v8 is minus two from the distance between c1 and c2. Further, the distances to v3, v4, v9 and v10 become the distances of c1 and c2.

또한, 도 1의 일례에서와 같이 양쪽에 cutvertex(c1,c2)가 있는 경우와 한쪽에만 cutvertex가 있고 반대편이 terminal 노드인 경우, 이 두 가지 모두 중간 노드들(v5~v11)을 취급하게 되며, 만약 한쪽만 cutvertex(c1)가 있는 경우에서는 상기 cutvertex(c1)와 그에 연결된 노드들(v5,v6,v7)만을 가지고 중간 노드 배치를 실행하게 된다. 나아가, 중간 노드 배치시에도 모든 노드들에 대하여 배치하는 것이 아니라 cutvertex에 연결된 노드들만 배치하고 중간 노드의 중간에 속한 노드(v11)는 따라 배치를 하지 않는다.In addition, as shown in the example of FIG. 1, when there are cutvertex (c1, c2) on both sides and when there is cutvertex on only one side and the terminal node on the other side, both of them handle intermediate nodes v5 to v11, If only one side of the cutvertex (c1) is present, the intermediate node arrangement is executed only with the cutvertex (c1) and the nodes (v5, v6, v7) connected thereto. Furthermore, even when arranging intermediate nodes, not all nodes are arranged but only nodes connected to cutvertex, and nodes v11 in the middle of intermediate nodes are not arranged accordingly.

마지막으로, 인접노드 배치에서는 현재 계산할 노드를 v로 잡고 그 노드 v에서 거리가 2인 노드의 집합을 V'라 하고 계산하게 된다. 이 경우 Distance(u,v)는상기 v노드에서 각각의 노드까지의 거리가 된다. 상기한 글로벌 배치 및 로컬 배치 중에서 피벗 노드 배치 및 중간 노드 배치의 과정을 반복하여 계산하게 된다. 이러한 반복은 각각의 에지 거리와 실제 거리의 비가 일정 수준 이하로 내려 올 때까지 반복한다. 그리고 마지막으로 인접노드 배치를 거쳐 전체 배치과정을 마치게 된다.Finally, in the neighbor node arrangement, the node to be calculated is set to v, and the set of nodes having a distance of 2 from the node v is called V '. In this case, Distance (u, v) is the distance from the v node to each node. Among the global and local layouts described above, the process of pivot node placement and intermediate node placement is repeated. This repetition is repeated until the ratio of each edge distance to the actual distance falls below a certain level. Finally, the neighboring node layout completes the entire placement process.

상기 알고리즘5가 1회 실행되는데는 O(lV'l)의 시간이 걸린다. 따라서 글로벌 배치 단계에서 로컬 배치 단계까지의 전체 시간 복잡도는 O(n·lPvNl)이 된다. 여기서, lPvNl은 피벗 노드의 개수이다.It takes O (lV'l) time to execute the algorithm 5 once. Therefore, the overall time complexity from the global deployment stage to the local deployment stage is O (n · lPvNl). Where lPvNl is the number of pivot nodes.

수 많은 에지 및 노드를 갖는 복잡한 단백질 상호작용 네트워크의 경우 종종, 난잡하게 존재하는 에지 및 노드들 때문에 상기 네트워크의 판독가능성이 줄어든다. 일반적으로, 그러한 복잡한 네트워크를 분석하는데는 두 가지 방법이 있다. 하나는 전체 네트워크로부터 더 작은 서브 네트워크로 추출하여 상기 서브 네트워크를 하나씩 분석하는 방법이며, 다른 하나는 전체 네트워크를 더 작은 네트워크로 추상화하는 방법이다. 본 발명은 몇 가지 방법으로 서브 네트워크를 추출할 수 있다. 예를 들어, 본 발명은 수 개의 단백질 상호작용 네트워크가 공유하는 하나 이상의 목표 단백질 또는 단백질의 서브 네트워크로부터 특정 상호작용 거리 내에 있는 단백질의 서브 네트워크를 추출할 수 있다. 추상화 네트워크는 요구에 따라 구체적인 네트워크로 확장될 수 있다. 네트워크의 추상화의 경우, 본 발명은 두 가지 기능을 제공한다. 첫째, 클리크(clique)를 별 형상의 서브 그래프로 추상화하는 기능을 제공한다. 즉, 클리크(에지로 연결된 각 쌍의 노드를 갖는 완전한 서브 그래프)는 더미 노드(dummy node)의 중앙에 위치한 별 형상의 서브 그래프로 대체된다. 이는 추상화 그래프 내에 원과 같이 나타난다. 둘째, 동일한 상호작용을 갖는 하나의 노드 그룹을 복합 노드로 추상화하는 기능을 제공한다. 즉, 동일한 상호작용 파트너를 갖는 하나의 노드 그룹은 단일의 복합 노드로 추상화된다. 이는 추상화 그래프 내에 마름모(diamond)와 같이 나타난다.Complex protein interaction networks with many edges and nodes often reduce the readability of the network because of the cluttered edges and nodes. In general, there are two ways to analyze such a complex network. One method is to extract the sub-networks from the entire network and analyze the sub-networks one by one, and the other is to abstract the entire network into smaller networks. The present invention can extract a subnetwork in several ways. For example, the present invention may extract sub-networks of proteins within a specific interaction distance from one or more target proteins or sub-networks of proteins shared by several protein interaction networks. Abstraction networks can be extended to specific networks on demand. In the case of an abstraction of a network, the present invention provides two functions. First, it provides the ability to abstract cliques into star subgraphs. In other words, the click (complete subgraph with each pair of nodes connected by edges) is replaced with a star-shaped subgraph located in the center of the dummy node. This appears like a circle in the abstraction graph. Second, it provides the ability to abstract one node group with the same interaction into a composite node. That is, one node group with the same interaction partner is abstracted into a single compound node. This appears like a diamond in the abstraction graph.

n개의 노드를 갖는 클리크(clique)는 n(n-1)/2 개의 에지를 포함하고, 상기 클리크에 대한 별 형상의 그래프는 정확하게 n개의 에지를 포함한다. 따라서, 별 형상의 그래프를 갖는 클리크는 실질적으로 에지의 개수를 줄인다. 그래프에서 최대 크기의 클리크를 찾는 것은 NP-하드(NP-hard) 문제이다. 알고리즘6 및 7에는 발명에 따른 에지-분리(edge-disjoint) 클리크를 식별하는 효율적인 방법이 제시되어 있다. 크기가 3인 클리크의 추상화의 경우 에지의 개수를 감소시키지 않기 때문에, 추상화는 4 또는 그 이상의 크기를 갖는 클리크의 경우에 실행된다(알고리즘6 참조).A clique with n nodes includes n (n-1) / 2 edges, and the star-shaped graph for the click contains exactly n edges. Thus, clicks with star-shaped graphs substantially reduce the number of edges. Finding the maximum size click in the graph is a NP-hard problem. Algorithms 6 and 7 present an efficient method of identifying edge-disjoint clicks in accordance with the invention. Since the abstraction of a click of size 3 does not reduce the number of edges, the abstraction is performed in the case of a click of size 4 or more (see Algorithm 6).

상기 알고리즘6은 그래프 내부에 있는 완전한 서브 그래프인 클리크(clique)를 찾기 위한 알고리즘으로서 클리크를 하나의 더미 노드(dummy node)를 통하여 표시하여 에지수를 줄이는 방법으로 사용할 수 있다. 연결된 구성요소 V에 속하는 모든 노드 v에 대하여 그 v의 인접노드 u에 대하여 계산을 한다. Lst를 초기화 한 후, u,v가 이미 클리크를 이루고 있지 않을 경우에 대하여 Lst의 초기값으로 (v, v에서의 u의 인덱스+1),(u, u에서의 v의 인덱스+1)을 넣게 된다. 이어, v와 v의 인접노드 중에서 Lst[0],Idx에 해당되는 노드와 이미 클리크를 구성하고 있는지를 검사하여 ChkClique에 넘겨주게 된다. 그리고 Lst[0].Idx는 1을 증가시킨 후 Lst[0].Idx가 Lst[0].node에 해당되는 노드의 에지수에 비해 같거나 클때까지 반복하게 된다. 이런 과정을 거쳐서 Lst에 남은 노드의 개수가 4개 이상인 경우 Lst에있는 노드들로 클리크를 구성하게 된다. 도 3은 본 발명에 따른 그래프를 클리크로 대체한 결과를 도시한 것이다.Algorithm 6 is an algorithm for finding a clique, which is a complete subgraph in the graph, and can be used as a method of reducing the number of edges by displaying the cleak through one dummy node. For every node v belonging to the connected component V, the neighbor node u of that v is calculated. After initializing Lst, if u and v have not already clicked, set the initial value of Lst as (index of u in v, v + 1), (index of v in u, u + 1). It is put in. Subsequently, it checks whether the node corresponding to Lst [0], Idx has already formed a click among v and v adjacent nodes and passes it to ChkClique. After Lst [0] .Idx is increased to 1, it is repeated until Lst [0] .Idx is equal to or larger than the number of edges of the node corresponding to Lst [0] .node. Through this process, if the number of nodes remaining in Lst is 4 or more, clicks are composed of nodes in Lst. Figure 3 shows the result of replacing the graph according to the invention with a click.

알고리즘7은 상기 알고리즘6에서 호출되는 함수로서, Lst에 있는 노드들을 검사하고 Lst에 다른 노드들을 추가할 수 있을지 결정하여 추가하게 된다. 여기서, N은 Lst에서의 현재 검사하고 있는 위치를 나타내며, NVal은 현재 추가하려는 노드를 가지게 된다.Algorithm 7 is a function called in Algorithm 6, which checks the nodes in Lst and adds them by determining if other nodes can be added to Lst. Here, N represents the current location in Lst, and NVal has the node to add.

알고리즘7을 참조하면, N이 Lst에 들어있는 노드의 개수와 같을 경우(즉, 모든 노드를 검사하여 NVal을 추가할 수 있다고 판단된 경우), Lst에 NVal을 추가하며 idx는 우선 0으로 초기화 해 둔다. 그 외의 경우에는 현재의 단계의 노드의 idx에 해당하는 노드에 비해서 NVal이 앞쪽에 해당하는 노드인 경우는 현재의 단계의idx를 같거나 클때까지 증가시킨다. 만약 같을 경우에는 다음 단계의 검사를 진행하고 idx에 해당하는 노드가 더 뒤쪽 노드에 해당할 경우 상위 단계로 돌아가서 새로운 NVal을 고르게 된다. 이런 검사를 최상위에 해당하는 노드의 모든 인접노드에 대하여 수행하게 되어 하나의 클리크를 찾게 된다.Referring to Algorithm 7, if N is equal to the number of nodes in Lst (that is, it is determined that all nodes can be added to NVal), then NVal is added to Lst and idx is initialized to 0 first. Put it. Otherwise, when NVal is the node corresponding to the front node, increase the idx of the current stage until it is equal to or larger than the node corresponding to idx of the node of the current stage. If it is the same, the next step is checked and if the node corresponding to idx is a later node, it goes back to the higher level and selects a new NVal. This check is performed on all neighboring nodes of the node corresponding to the highest level to find a click.

또한, 동일한 상호작용을 갖는 모든 노드 그룹의 식별 방법은 알고리즘8에 기재되어 있다. 클리크를 별 형상의 그래프로 추상화하는 것은 단지 에지의 개수를 감소시키는 반면, 동일한 상호작용을 갖는 노드의 복합 노드로 추상화하는 것은 에지의 개수 뿐만 아니라 노드의 개수도 줄인다.In addition, a method of identifying all groups of nodes having the same interaction is described in Algorithm 8. Abstracting the click into a star-shaped graph merely reduces the number of edges, while abstracting it into a composite node of nodes with the same interaction reduces the number of nodes as well as the number of edges.

상기 알고리즘8은 연결관계가 같은 노드들을 찾아서 하나의 더미 노드로 표현하기 위한 알고리즘이다. 우선 V에 해당하는 모든 노드 v에 대하여 상기 v가 이미 복합 노드(composite node)에 포함되지 않은 경우, CList를 초기화 한 후 v와 그 v노드의 인접노드 수가 같은 모든 노드 중에서 v와 같은 에지 구성을 갖는 노드 v'를 CList에 추가한다. 더 이상 추가할 노드가 없다면 CList에 들어 있는 노드들에 대하여 복합 노드로 변환한다. 도 4는 본 발명에 따른 원래 네트워크에서 서브 네트워크를 복합 노드로 추상화한 네트워크이다. 도 4에 도시된 바와 같이, 동일한 상호작용 패턴을 갖는 노드 그룹을 복합 노드로 추상화함으로써 터미널에서 조밀한 서브 그래프를 매우 간단하게 나타낼 수 있다.Algorithm 8 is an algorithm for finding nodes having the same connection relationship and representing them as one dummy node. First, for all the nodes v corresponding to V, if the v is not already included in the composite node, after initializing the CList, the edge configuration equal to v among all nodes with the same number of neighbors of v and its v nodes is obtained. Add node v 'to the CList. If there are no more nodes to add, convert them to composite nodes for the nodes in the CList. 4 is a network abstracting a sub-network into a composite node in the original network according to the present invention. As shown in FIG. 4, it is possible to very simply represent a dense subgraph at a terminal by abstracting a group of nodes having the same interaction pattern into a composite node.

도 5는 본 발명의 일 실시예에 따른 4242개의 인간 단백질 간의 44387개의 상호작용을 갖는 네트워크를 나타낸다. 이는 에지 교차(edge crossing)를 갖는 것으로 나타나지만, 실제로 비디오 모니터 상에서 3차원 드로잉으로 시각화해 보면 에지 교차(edge crossing)를 포함하지 않는다.5 shows a network with 44387 interactions between 4242 human proteins, according to one embodiment of the invention. This appears to have edge crossings but does not actually include edge crossings when visualized in a three dimensional drawing on a video monitor.

도 6은 본 발명에 따른 네트워크 추상화의 일례를 도시한 것이다. 도 6(a)는 307개의 노드 및 1063개의 에지를 갖는 단백질 상호작용 네트워크이고, 도 6(b)는 원으로 표시된 바와 같이, 상기 도 6(a)의 클리크를 더미 노드에서 중앙에 위치한 별 형성의 서브 그래프로 치환함으로써 단순화된 311개의 노드 및 700개의 에지를 갖는 그래프이며, 도 6(c)는 마름모 형상으로 표시된 바와 같이, 상기 6(b)에서 동일한 상호작용 파트너를 갖는 하나의 노드 그룹을 복합 노드로 추상화함으로써 단순화된 47개의 노드 및 115개의 에지를 갖는 그래프이다. 도 6을 참조하면, 많은 클리크를 갖는 그래프의 경우, 단지 한 번의 추상화 기능만으로도 그래프의 복잡도를 줄이는데 굉장한 효과가 있다. 적은 클리크를 갖는 그래프의 경우, 두 번의 축약 기능을 적용함으로써 상기 복잡도를 크게 줄일 수 있다.6 illustrates an example of network abstraction in accordance with the present invention. Fig. 6 (a) is a protein interaction network with 307 nodes and 1063 edges, and Fig. 6 (b) shows a star formation centered at the dummy node of the click of Fig. 6 (a), as indicated by a circle. Is a graph having 311 nodes and 700 edges simplified by substituting a subgraph of Fig. 6 (c) shows one node group having the same interaction partner in 6 (b), as indicated by the rhombus shape. It is a graph with 47 nodes and 115 edges, simplified by abstracting them into composite nodes. Referring to FIG. 6, in the case of a graph having many clicks, only one abstraction function has a great effect on reducing the complexity of the graph. For graphs with fewer clicks, the complexity can be greatly reduced by applying two abbreviations.

하기에 표시된 테이블1은 본 발명에서 제공하는 알고리즘과 다른 기타 알고리즘간의 실제 실행 시간을 비교하기 위하여, 본 발명의 알고리즘과 다른 두 개의 그래프-드로잉 프로그램 즉, Pajek 및 Tulip을 실행시켜 본 결과를 나타낸다.Table 1 shown below shows the results of executing the algorithm of the present invention and two other graph-drawing programs, Pajek and Tulip, to compare the actual execution time between the algorithm provided by the present invention and other algorithms.

프로그램(알고리즘)Program (Algorithm) Y2H 데이터3751 노드12917 에지Y2H Data3751 Node12917 Edge BIND 데이터4048 노드8286 에지BIND Data4048 Node8286 Edge DIP 데이터4690 노드14460 에지DIP Data 4690 Nodes 14460 Edge 인간맵18654 노드184407 에지Human Map 18654 Node184407 Edge 인간맵212056 노드6989558 에지Human Map 212056 Node 6989558 Edge 본 발명The present invention 7초7 sec 6초6 sec 7초7 sec 19초19 seconds 8분 20초8 minutes 20 seconds Pajek (K-K)Pajek (K-K) 2분 31초2 minutes 31 seconds 1분 37초1 minute 37 seconds 3분 04초3 minutes 04 seconds 56분 38초56 minutes 38 seconds 용량초과Over capacity Pajek (F-R)Pajek (F-R) 28분 23초28 minutes 23 seconds 20분 02초20 minutes 02 seconds 42분 45초42 minutes 45 seconds 2시간 28분 40초2 hours 28 minutes 40 seconds 5시간 43분 32초5 hours 43 minutes 32 seconds Tulip (GEM)Tulip (GEM) 2분 10초2 minutes 10 seconds 4분 40초4 minutes 40 seconds 18분 40초18 minutes 40 seconds 9시간 19분 10초9 hours 19 minutes 10 seconds 10시간 초과More than 10 hours Tulip (S-E)Tulip (S-E) 24분 35초24 minutes 35 seconds 35분 47초35 minutes 47 seconds 56분 45초56 minutes 45 seconds 10시간 초과More than 10 hours 10시간 초과More than 10 hours

상기 테이블1은 동일한 세트의 테스트의 경우에 대한 5개의 레이아웃 알고리즘 즉, 본 발명의 배치 알고리즘, Pajek의 Kamada & Kawai's 레이아웃, Pajek의 Frunchterman-Reingold's 레이아웃, Pajek의 GEM 레이아웃, 그리고 Tulip의 Spring-Electric force 레이아웃의 실행 시간을 각각 보여준다. Kamada & Kawai' 레이아웃을 갖는 Pajek는 '메모리 용량 초과'의 에러가 발생하여 인간 맵2(human map2)를 시각화할 수 없었다. 상기 테이블1을 참조하면, 본 발명의 배치 알고리즘의 실행 시간은 힘-방향성 레이아웃 알고리즘의 실행 시간보다 더 빠름을 알 수 있다.Table 1 shows five layout algorithms for the same set of test cases, that is, the layout algorithm of the present invention, Kamada & Kawai's layout of Pajek, Frunchterman-Reingold's layout of Pajek, GEM layout of Pajek, and Spring-Electric force of Tulip. Shows the execution time of each layout. Pajek with Kamada & Kawai 'layout was unable to visualize human map2 due to an error of' memory exceeded '. Referring to Table 1, it can be seen that the execution time of the placement algorithm of the present invention is faster than that of the force-directional layout algorithm.

이상에서 설명한 바와 같이, 본 발명은 3차원 공간에서 대규모의 단백질 상호작용 네트워크를 드로잉하는 새로운 알고리즘을 제공한다. 본 발명에 따른 알고리즘은 다른 힘-방향성 드로잉 알고리즘보다 실행 속도가 빠르고, 단백질 상호작용의 시각화 뿐만 아니라 개별적인 연결된 구성요소 또는 서브 그래프의 인터랙티브한 조사를 위해서 사용될 수 있어, 대규모 단백질 상호작용을 연구하는데 매우 유용하다. 또한, 본 발명에 따른 알고리즘은, 통합적인 단백질 상호작용 데이터베이스를 요청하고, 상기 요청 결과를 직접 시각화하는 프레임 네트워크(framenetwork)를 제공하여, 업데이트된 많은 양의 데이터의 시각화 및 분석이 쉽도록 한다. 나아가, 본 발명은 대규모의 단백질 상호작용 네트워크에 대하여 명확하고 심미적으로도 만족하는 드로잉을 만들고, 또한 다른 힘-방향성 알고리즘에 비해 실행 속도가 빠르다는 것을 알 수 있다. 유무선 통신을 지원하는 웹 기반 응용 프로그램으로의 작업이 확장되고 있는 실정이다.As described above, the present invention provides a new algorithm for drawing large-scale protein interaction networks in three-dimensional space. Algorithms according to the present invention are faster to execute than other force-oriented drawing algorithms and can be used for the visualization of protein interactions as well as for interactive investigation of individual connected components or subgraphs, making them very useful for studying large-scale protein interactions. useful. In addition, the algorithm according to the present invention makes it easy to visualize and analyze large amounts of updated data by requesting an integrated protein interaction database and providing a frame network that directly visualizes the request results. Furthermore, it can be seen that the present invention produces clear and aesthetically pleasing drawings for large scale protein interaction networks, and is also faster to execute than other force-directed algorithms. Work is expanding to web-based applications that support wired and wireless communication.

한편, 본 발명의 네트워크 배치 알고리즘 및 추상화 기능은 Borland Dephi 6.0 등에서 실행될 수 있다. 또한 단백질 상호작용의 데이터베이스는 마이크로 소프트 데이터 액세스 컴포넌트 2.7(Microsoft Data Access Components 2.7)을 사용하여 구성할 수 있다. 본 발명은 그 실행 시스템으로서 윈도2000/XP/Me/98/NT4.0을 탑재한다면 어떤 PC라도 실행이 가능하다. 본 발명에 따른 네트워크 시각화 알고리즘을 구현하는 장치 및 수단은 상기한 프로그램 또는 시스템과 같이 당 기술분야의 통상의 지식을 가진 자라면 용이하게 적용할 수 있을 것이다.Meanwhile, the network layout algorithm and the abstraction function of the present invention can be executed in Borland Dephi 6.0 and the like. In addition, a database of protein interactions can be constructed using Microsoft Data Access Components 2.7. The present invention can be executed by any PC as long as Windows 2000 / XP / Me / 98 / NT4.0 is mounted as the execution system. Apparatus and means for implementing the network visualization algorithm according to the present invention will be readily applicable to those of ordinary skill in the art, such as the program or system described above.

본 발명의 상세한 설명 및 도면에는 본 발명을 이해를 돕기 위한 바람직한 일실시예를 개시한 것으로서 본 발명의 권리범위를 한정하는 것은 아니며, 본 발명의 권리의 범위는 상기한 상세한 설명에 의해 결정되는 것이 아니라 첨부한 청구범위에 의해 결정되어야만 할 것이다.The detailed description and drawings of the present invention disclose a preferred embodiment to help understand the present invention, and do not limit the scope of the present invention, and the scope of the present invention is determined by the above detailed description. Rather, it should be determined by the appended claims.

본 발명에 따르면, 인터뷰어라고 불리는 기존의 힘-방향성 레이아웃 알고리즘에 비해 다음과 같은 효과가 있다.According to the present invention, the following effects are compared with the conventional force-directional layout algorithm called the interviewer.

첫째, 노드 쌍간의 힘을 계산하지 않고도 더 나은 드로잉을 만들 수 있고, 실행 속도가 빠르다.First, you can create better drawings without having to calculate the force between the pairs of nodes and run faster.

둘째, 다수의 형상화 기능을 제공하여 복잡한 네트워크를 더 단순한 네트워크로 줄일 수 있다.Second, by providing a number of shaping functions, a complex network can be reduced to a simpler network.

셋째, 본 발명에 따르면 다중 단백질 상호작용 네트워크는 상기 네트워크의 일부 또는 전부가 공유하는 공통된 단백질 또는 상기 단백질의 상호작용에 대하여 서로 비교될 수 있다.Third, according to the present invention, multiple protein interaction networks can be compared with one another for common proteins or interactions of the proteins shared by some or all of the networks.

넷째, 본 발명은 웹브라우저에서 실행가능하기 때문에 사용이 용이하다.Fourth, the present invention is easy to use because it is executable in a web browser.

Claims

Identifying and placing connected components in a protein-protein interaction network;

Determining a intermediate node and a pivot node in each of the connected components;

Calculating each distance from the pivot node to each node in the connected components, and placing each node associated with the pivot node based on the calculated distance;

Disposing the respective connected components by disposing a cutvertex associated with the intermediate node and an intermediate node associated with an adjacent node of the cutvertex; And

And arranging each node with respect to adjacent nodes within a distance set for each node to rearrange the connected components.

The method of claim 1, wherein the first step,

Determining whether a node having at least one edge already belongs to another connected component;

If it does not belong to the determination result, assigning an identifier of a new connected component to the node and searching for a neighbor node connected to the node; And

And determining that the neighbor node connected to the node belongs to another connected component, and if not, adding the neighbor node to the connected component of the node.

The method of claim 1,

The determination of the pivot node of the second step,

Setting a first node of the connected component as a first pivot node;

Determining whether the current node is a pivot node and calculating a distance from the first pivot node if the node is not a pivot node;

Determining a pivot node by examining the likelihood that the current node becomes a pivot node based on the calculated distance and the number of nodes in the connected component.

The method according to claim 1 or 3,

And the distance from the pivot node to the current node is at least three when the number of nodes of the connected components is substantially 100 or less.

The method of claim 1,

A method for visualizing a protein interaction network, wherein the distance between adjacent nodes is one.

The method of claim 1, wherein the fifth step,

The method of claim 1, wherein each node is arranged around an adjacent node having a distance of less than 2 for each node.

The method of claim 1,

And a node group having the same interaction further comprises reducing a number of nodes and edges by replacing with a single compound node.