KR101556060B1

KR101556060B1 - Method and system for searching subgraph isomorphism using candidate region exploration

Info

Publication number: KR101556060B1
Application number: KR1020140050819A
Authority: KR
Inventors: 한욱신; 이정훈
Original assignee: 포항공과대학교 산학협력단; 서울대학교산학협력단
Priority date: 2014-04-28
Filing date: 2014-04-28
Publication date: 2015-10-01

Abstract

The present invention relates to a method for searching subgraph isomorphism using a candidate region search technique and a system thereof and provides a method for searching subgraph isomorphism using a candidate region search technique and a system thereof capable of efficiently and strongly searching subgraph isomorphism using a candidate region search technique which is a new concept. The method for searching subgraph isomorphism using a candidate region search technique according to the present invention comprises the following steps of: selecting a vertex(u_s) of a start inquiry from an inquiry graph(q); obtaining a breath first search(BFS) tree(q′) from the inquiry graph(q); searching a candidate region while going around a data graph(g) from a root vertex(u_s′) of the BFS tree(q′) about each selected start vertex; determining a matching order about the searched candidate region; mapping the start vertex(u_s′) of the BFS tree(q′) on a start vertex(v_s) of a corresponding candidate region; and generating all possible images about each candidate region. The present invention by the above constitution increases search speed when a candidate region is searched, efficiently deletes data vertexes which are not promising. Moreover, the present invention calculates a selection rate about each candidate region instead of the entire graph, thereby generating a stronger matching order by using the accurate information such as the selection rate.

Description

A method and system for searching a subgraph isomorphism using a candidate region search technique,

본 발명은 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법 및 그 시스템에 관한 것으로, 특히, 검색 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대한 강건한 매칭 순서를 계산하는 후보 영역 탐색이라는 새로운 개념을 도입하여 빠른 성능을 나타내고 효율적이며 강건한 부분 그래프 검색을 수행하는 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법 및 그 시스템에 관한 것이다. [0001] The present invention relates to a partial graph homing method and a system thereof using a candidate region searching technique, and more particularly to a partial graph homing searching method and system for identifying a candidate partial graph (i.e., a candidate region) including a search result, The present invention relates to a partial graph homotopic search method and system using a candidate region search technique that performs fast and efficient and robust partial graph search by introducing a new concept of candidate region search that calculates robust matching order.

일반적으로 그래프는 분자 구조, 단백질 상호 작용 네트워크 및 프로그램 의존성과 같은 복합 구조를 모델링하기 위해 널리 사용된다. 그래프의 중요성이 커짐에 따라서, 효율적인 그래프 데이터 관리에 대한 많은 연구가 진행되고 있다. Generally, graphs are widely used to model complex structures such as molecular structures, protein interaction networks, and program dependencies. As the importance of graphs grows, much research has been conducted on efficient graph data management.

한편, 부분 그래프 동형 검색 문제는 주어진 질의 그래프(q)와 데이터 그래프(g)에 대해, 데이터 그래프(g) 내에서 질의 그래프(q)의 모든 출현들을 찾는 문제로, 그래프 데이터베이스에서 제공해야 하는 가장 핵심적인 질의 유형 중 하나로서 많은 응용 프로그램에서 사용되고 있다. 이 문제는 NP-하드에 속하지만, 실제 데이터 집합들에 대해 합리적인 시간 내에 이 문제를 해결하는 많은 알고리즘들이 제안되었다. On the other hand, the partial graph homogeneity problem is a problem of finding all occurrences of the query graph (q) in the data graph (g) for a given query graph (q) and data graph (g) It is one of the core query types and is used in many applications. This problem belongs to NP-hard, but many algorithms have been proposed to solve this problem within a reasonable time for real data sets.

대부분의 부분 그래프 동형 알고리즘들은 백트래킹 알고리즘(J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2013, http://www-db.knu.ac.kr/vldb13.pdf. 참조)에 기반한다. Most of the subgraph isomorphisms are based on the backtracking algorithm (J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. (2), 2013, http://www-db.knu.ac.kr/vldb13.pdf.).

이런 알고리즘들은 먼저 질의 그래프(q)의 각 질의 정점(u)에 대해 후보 정점들의 집합(C(u))을 획득한다. 이를 위해, 알고리즘들은 어떤 레이블을 그 레이블을 갖고 있는 정점들의 리스트에 매핑하는 매핑 테이블을 관리한다. 매칭 순서(O) 가 주어지면, 각 알고리즘은 주 재귀 서브루틴(SUBGRAPHSEARCH 함수)을 호출하여 매칭 순서(O)에 따라 한번에 한 질의 정점을 하나의 데이터 정점에 매칭한다. 이때 매칭 순서는 부분 그래프 동형 검색의 성능에 지대한 영향을 준다.These algorithms first obtain a set of candidate vertices (C (u)) for each query vertex (u) in the query graph (q). To do this, the algorithms manage a mapping table that maps a label to a list of vertices that have that label. Given a matching order (O), each algorithm invokes a principal recursion subroutine (SUBGRAPHSEARCH function) to match one query vertex to one data vertex at a time according to the matching order (O). At this time, the matching order greatly affects the performance of the partial graph homogeneous search.

이러한 서브루틴(SUBGRAPHSEARCH 함수)에서 어떤 질의 정점(u)이 선택되면, 각 알고리즘은 특정한 가지치기 규칙들을 사용해서 C(u)를 정제한다. 다음으로, 정제된 각 후보 정점(v)에 대해, 질의 정점(u)과 기 매치(match)된 질의 정점들 사이의 간선들이 후보 정점(v)과 데이터 그래프(g)에서 이미 매치된 데이터 정점들 사이의 몇몇 간선들에 대응하는지를 검사한다. 후보 정점(v)이 이런 조건을 만족한다면, 질의 정점(u)과 후보 정점(v)을 매치한다. 다음으로, 각 알고리즘은 남은 질의 정점들을 매칭하기 위해 서브루틴(SUBGRAPHSEARCH 함수)을 재귀 호출한다. 그리고 모든 가능한 사상(embedding)들을 획득하면, 부분 그래프 동형 검색을 마친다.When a query vertex (u) is selected in this subroutine (SUBGRAPHSEARCH function), each algorithm refines C (u) using specific pruning rules. Next, for each refined candidate vertex v, the edges between the query vertex u and the query vertices matched match the data vertices v already matched in the candidate vertex v and the data graph g, &Lt; / RTI > to some of the trunks between them. If the candidate vertex (v) satisfies this condition, match the query vertex (u) with the candidate vertex (v). Next, each algorithm recurses the subroutine (SUBGRAPHSEARCH function) to match the remaining query vertices. Once all the possible embeddings have been acquired, the subgraph isomorphism search is terminated.

Ullmann 알고리즘(J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 23:31-32, January 1976. 참조)은 부분 그래프 동형 검색 문제를 해결한 첫 번째 알고리즘이다. Ullmann 알고리즘은 질의 정점들의 입력 순서를 매칭 순서로 사용하며 후보 정점들을 걸러 내는 가지치기 규칙들이 거의 없기 때문에, 기존의 모든 알고리즘들 중에 가장 느린 성능을 보인다. The Ullmann algorithm (see J. R. Ullmann, An algorithm for subgraph isomorphism, J. ACM, 23: 31-32, January 1976.) is the first algorithm to solve the problem of partial graph isomorphism. The Ullmann algorithm uses the input order of query vertices as a matching order and has the slowest performance among all existing algorithms because there are few rules for filtering out candidate vertices.

VF2(L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph isomorphism algorithm for matching large graphs. IEEE PAMI, 26(10):1367-1372, 2004. 참조)는 다음과 같은 측면에서 Ullmann 알고리즘을 최적화하였다. 서브루틴(SUBGRAPHSEARCH 함수)에서, Ullmann 알고리즘은 처리할 현재 질의 정점(u)을 임으로 선택하는 반면에, VF2 알고리즘은 이전 매치된 질의 정점들에 연결된 질의 정점(u)을 현재 질의 정점으로 선택한다. VF2 알고리즘은 후보 정점들의 집합(C(u))으로부터 후보 정점들을 제거하기 위해 이런 제약 사항을 사용한다. 그러나, VF2 알고리즘도 좋은 매칭 순서를 선정하는 방법에 대해서는 고려하지 않았다. IEEE PAMI, 26 (10): 1367-1372, 2004.), for example, is described in the following paragraphs: < RTI ID = 0.0 > We have optimized the Ullmann algorithm. In the subroutine (SUBGRAPHSEARCH function), the Ullmann algorithm arbitrarily selects the current query vertex (u) to be processed, while the VF2 algorithm selects the query vertex (u) connected to the previous matched query vertices as the current query vertex. The VF2 algorithm uses this constraint to remove candidate vertices from the set of candidate vertices (C (u)). However, the VF2 algorithm does not consider a method for selecting a good matching order.

QuickSI 알고리즘(P. Zhao and J. Han. On graph query optimization in large networks. PVLDB, 3(1):340-351, 2010. 참조)은 매칭 순서를 선정하는 방법을 제안하였다. QuickSI 알고리즘의 매칭 순서 선정 방법은 빈도가 작은 정점 레이블들과 빈도가 작은 인접 간선 레이블들을 가진 질의 정점들을 가능한 먼저 선택한다. 이 탐욕적인 방법(Greedy Method)은 몇몇 데이터 집합에서는 잘 동작하지만, 다른 데이터 집합들에서는 매칭 순서에 심각한 문제를 유발한다. 구체적인 예로, 인간 단백질 상호작용 네트워크 데이터 세트에서는 그 데이터 세트에서 자주 나타나지 않는, 즉 더 선택적인 질의 간선을 선택하는 것이, 결과적으로는 그 데이터에서 자주 나타나는 덜 선택적인 질의 경로를 선택하도록 하는 원인이 된다. 모든 특징적인 경로들에 대한 통계들을 관리한다면, 정확하게 그 경로의 선택도를 계산할 수 있다. 그러나, 이러한 방법은 매우 큰 저장 공간과 비싼 관리 비용을 요구하기 때문에, 현실적으로는 실행이 불가능하다. The QuickSI algorithm (P. Zhao and J. Han. On graph query optimization in large networks. PVLDB, 3 (1): 340-351, 2010.) proposed a matching procedure. The matching procedure selection method of QuickSI algorithm is small Select vertex labels and query vertices with adjacent edge labels as low as possible. This greedy method works well for some data sets, but it causes serious problems in the matching order in other data sets. As a specific example, in the human protein interaction network data set, the selection of a more selective query edge that does not appear frequently in the data set, i.e., a more selective query, results in the selection of a less selective query path that often appears in the data . If you manage statistics for all characteristic paths, you can accurately calculate the path selectivity. However, since this method requires a very large storage space and expensive management cost, it is practically impossible to execute.

GraphQL(H. He and A. K. Singh. Graphs-at-a-time: query language and access methods for graph databases. In SIGMOD, pages 405-418, 2008. 참조)과 SPath(P. Zhao and J. Han. On graph query optimization in large networks. PVLDB, 3(1):340-351, 2010. 참조)는 이웃-기반 필터링을 이용함으로써 C(u)의 크기를 최소화하는데 중점을 두었다. GraphQL은 질의 정점(u)과 후보 정점(v)∈ C(u)에 대한 BFS 검색 트리를 각각 생성하고, 이 2개의 트리 사이에 포함 관계가 있는지 검사하는 과정을 BFS 검색 트리를 확장해 나가면서 반복 수행하는 의사 동형 검사를 수행함으로 C(u)의 크기를 줄였다. 그러나, 다소 간단한 공식을 사용하는 매칭 순서 선택 전략으로 인해 매칭 순서에 심각한 문제가 있다. 더욱이, 각 데이터 정점에 대한 이웃 정보를 저장하기 위해 사용하는 인수의 값들은 사용자가 선택하여 한다. Graph SQL (H. He and AK Singh, Graphs-at-a-time: query language and access methods for graph databases, In SIGMOD, pages 405-418, 2008.) and SPath (P. Zhao and J. Han. PVLDB, 3 (1): 340-351, 2010.) focused on minimizing the size of C (u) by using neighbor-based filtering. GraphQL creates a BFS search tree for query vertex (u) and candidate vertex (v) ∈ C (u), and examines whether there is an inclusion relation between these two trees. We reduced the size of C (u) by performing iterative pseudo - homogeneous test. However, there is a serious problem with the matching sequence due to the matching sequence selection strategy using a rather simple formula. Moreover, the values of the arguments used to store the neighbor information for each data vertex are selected by the user.

최근에 Wang(Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5(9):788-799, 2012. 참조)은 새로운 부분 그래프 동형 알고리즘을 제시하였다. 이 알고리즘은 Trinity(B. Shao, H. Wang, and Y. Li. The trinity graph engine. Technical Report 161291, Microsoft Research, 2012. 참조)의 메모리 클라우드에 구현되었다. 먼저 이 알고리즘은 질의를 STwigs라 불리는 기본 단위 질의들의 집합으로 분해한다. 다음으로, 각 STwig에 대한 모든 가능한 중간 결과를 획득한다. 여기에서, 중간 결과들을 병합하기 위해 많은 양의 이러한 중간 결과들을 다른 서버들로 옮겨야 한다. 다음으로, 블록 기반 파이프라인 조인을 수행하여 중간 결과들을 병합한다. 조인 순서를 정하기 위해, 이 알고리즘은 조인 최적화를 기반으로 하는 샘플링을 사용하였다. 그러나 제안한 알고리즘을 기존의 다른 알고리즘들과 실험적으로는 비교하지 않았기 때문에, 이 알고리즘이 기존의 다른 알고리즘들보다 우수한지는 불분명하다. 이 알고리즘의 가장 두드러진 장점은 한 레이블과 그 레이블을 갖고 있는 정점들의 리스트를 매핑하는 매핑 테이블을 제외하고는 어떤 보조 인덱스 구조도 관리하지 않는다는 것이다. 따라서, 그래프의 크기가 커지더라도 비용이 서서히 증가하여, 큰 그래프를 지원할 수 있다.Recently, Wang (Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5 (9): 788-799, 2012.) The partial graph isomorphism algorithm is proposed. This algorithm has been implemented in the memory cloud of Trinity (B. Shao, H. Wang, and Y. Li. The trinity graph engine. Technical Report 161291, Microsoft Research, 2012.). First, the algorithm decomposes the query into a set of basic unit queries called STwigs. Next, all possible intermediate results for each STwig are obtained. Here, a large amount of these intermediate results must be transferred to the other servers to merge the intermediate results. Next, block-based pipeline joins are performed to merge the intermediate results. To determine the join order, this algorithm uses sampling based on join optimization. However, since the proposed algorithm is not compared experimentally with other existing algorithms, it is unclear whether this algorithm is superior to other existing algorithms. The most prominent advantage of this algorithm is that it does not manage any auxiliary index structure except for a mapping table that maps one label and a list of vertices with that label. Therefore, even if the size of the graph increases, the cost gradually increases, and a large graph can be supported.

또한, gIndex(] X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequent structure-based approach. In SIGMOD, pages 335-346, 2004. 및 X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent structure analysis. ACM Trans. Database Syst., 30(4):960-993, 2005. 참조), Tree+ Δ(P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: Tree + delta >= graph. In VLDB, pages 938-949, 2007. 참조), SwitfIndex(H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB, 1(1):364-375, 2008 참조), C-Tree(S. Zhang, M. Hu, and J. Yang. Treepi: A novel graph indexing method. In ICDE, pages 966-975, 2007. 참조), gCode(L. Zou, L. Chen, J. X. Yu, and Y. Lu. A novel spectral coding in a large graph database. In EDBT, pages 181-192, 2008. 참조)와 같은 그래프-인덱스 기반 부분 그래프 동형 알고리즘들도 존재한다. 이 알고리즘들은 그래프 인덱스들을 이용한 여과 정제 전략을 활용한다. 즉, 그래프 인덱스를 필터로 이용하여, 질의를 수행하는 동안 불필요한 그래프들을 적은 비용으로 많이 제거할 수 있다. 이 알고리즘들은 그래프 특징과 특징을 포함하는 부분 그래프의 포스팅 리스트의 쌍을 저장하고, 주어진 질의 그래프(q)에 대해, maxL까지 크기의 모든 부분그래프들을 열거하여 질의 그래프의 특징이 인덱스 안에 있는지 검사한다. 알고리즘들은 그래프 ID들로 구성된 포스팅 리스트들 및 이들과 연관된 특징들을 검색한 후, 후보 그래프들을 얻기 위해 검색한 결과 포스팅 리스트를 교차시킨다. 따라서 여과 단계의 결과들은 그래프 ID들의 집합이다. 인덱스 기반 알고리즘들은 질의 그래프가 데이터베이스에 저장된 그래프에 포함되는지 여부를 확인하기 위해 부분 그래프 동형 검사를 반드시 수행해야 하므로, 데이터베이스에 큰 그래프가 한 개만 저장되어 있는 경우에는 이러한 그래프 인덱스를 사용하는 방법은 유용성이 떨어진다.In addition, gIndex () X. Yan, PS Yu, and J. Han. Graph indexing: A frequent structure-based approach.In SIGMOD, pages 335-346, 2004. and X. Yan, PS Yu, and J. Han. Graph indexing based on discriminative frequent structure analysis, ACM Trans. Database Syst., 30 (4): 960-993, 2005.), Tree + Δ (P. Zhao, JX Yu, > = graph. In VLDB, pages 938-949, 2007.), SwitfIndex (H. Shang, Y. Zhang, X. Lin, and JX Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. (1): 364-375, 2008), C-Tree (S. Zhang, M. Hu, and J. Yang, Treepi: A novel graph indexing method, In ICDE, pages 966-975, 2007) A graph-index based subgraph isomorphism algorithm such as gCode (see L. Zou, L. Chen, JX Yu, and Y. Lu, A novel spectral coding in a large graph database, In EDBT, pages 181-192, Are also present. These algorithms utilize filtering and refining strategies using graph indices. That is, by using the graph index as a filter, unnecessary graphs can be removed at a low cost while performing a query. These algorithms store a pair of posting lists of subgraphs that contain graph features and characteristics, and enumerate all subgraphs up to maxL for a given query graph (q) to check whether the query graph features are in the index . The algorithms search for posting lists composed of graph IDs and their associated features, and then intersect the resulting posting list to retrieve candidate graphs. The results of the filtering step are therefore a set of graph IDs. Index-based algorithms must perform a partial graph isomorphism check to see if query graphs are included in the graph stored in the database. Therefore, when there is only one large graph stored in the database, .

GraMi 알고리즘(M. E. Saeedy and P. Kalnis. Grami: Generalized frequent pattern mining in a single large graph.)은 단일 그래프(g)에 대한 빈발 부분 그래프 마이닝 알고리즘이다. GraMi 알고리즘은 어떤 패턴 상의 간선들이 단일 그래프(g)에 속하는 거리 제한 경로들에 대응하는 더욱 일반화된 부분 그래프 패턴들을 찾는다. 단일 그래프(g)에 (즉, P≥σ의 서포트) 패턴(P)의 사상이 적어도 σ개 존재한다면, 패턴(P)은 단일 그래프(g)에 빈번히 나타나는 것으로 정의된다. GraMi 알고리즘은 빈발 패턴들을 찾기 위해 제약 만족 문제 해법에 기반한 부분 그래프 동형 알고리즘을 제안하였다. 그러나, GraMi 알고리즘에서 해결하고자 한 문제는 최소 이미지 기반 지원(MNI) σ를 이용한다는 점에서 기존의 부분 그래프 동형 문제와는 완전히 다르다. GraMi 알고리즘에서는 M_j를 단일 그래프(g)에 속한 패턴(P)의 j-번째 사상으로, ∇i 를

로, u_i는 i번째 정점, |M|은 모든 가능한 사상의 수로 가정한다.

는 패턴의 서포트를 계산하기 위해 사용된다. 주어진 패턴(P)이 MNI 빈발 계측 아래에서 빈번한지 검사하기 위한, 핵심 아이디어는 단일 그래프(g)에서 패턴(P)의 모든 가능한 사상을 나열하지 않고도, 각 질의 정점에 대해 매핑되는 유효한 데이터 정점들이 적어도 σ개가 있는지 체크하는 것이다. 만약 그렇지 않다면, 패턴(P)은 희발하다고 정의된다. 그러나, 이러한 특성은 부분 그래프 동형 문제에는 활용될 수 없다. GraMi 알고리즘은 매칭 순서에서 선택도를 고려하지 않고 가장 큰 수의 질의 제약을 가지는 질의 정점을 선택한다. 그러나 이 휴리스틱(heuristics)은 기존 부분 그래프 동형 방법들이 가지는 것과 유사한 성능 문제를 초래할 수 있다. 또한, GraMi 알고리즘은 질의 그래프가 완벽하게 대칭적인 특별한 경우를 다루기 위해 MNI 빈발 계측 하에서 최적화를 이용한다. The GraMi algorithm (ME Saeedy and P. Kalnis, Grami: Generalized frequent pattern mining in a single large graph) is a frequent partial graph mining algorithm for a single graph (g). The GraMi algorithm finds more generalized subgraph patterns corresponding to distance-limited paths in which some edges on a pattern belong to a single graph (g). A pattern P is defined to appear frequently in a single graph (g) if at least σ is present in the single graph (g) (ie, support of P? GraMi algorithm proposed a partial graph isomorphism algorithm based on the constraint satisfaction problem solution to find frequent patterns. However, the problem to be solved in the GraMi algorithm is completely different from the existing partial graph homogeneity problem in that it uses minimum image-based support (MNI) σ. In the GraMi algorithm, M _j is _defined as a j-th mapping of a pattern (P) belonging to a single graph (g)

, U _i is the ith vertex, and | M | is assumed to be the number of all possible mappings.

Is used to calculate the support of the pattern. The key idea for checking if a given pattern (P) is frequent under MNI frequent measurements is to use valid data vertices mapped to each query vertex, without listing all possible mappings of the pattern (P) in a single graph (g) It is checked whether there is at least?. If not, the pattern P is defined as sparse. However, this property can not be applied to the problem of the partial graph homogeneity. The GraMi algorithm selects the query vertices with the largest number of query constraints without considering the selectivity in the matching sequence. However, this heuristics can cause performance problems similar to those of existing partial graph homogeneous methods. In addition, the GraMi algorithm uses optimization under MNI frequent metrics to handle special cases where the query graph is perfectly symmetric.

추가로, 근사 부분 그래프 검색에 대한 연구가 있다(예를 들면, Y. Tian and J. M. Patel. TALE: A tool for approximate large graph matching. In ICDE, pages 963-972, 2008; M. Mongiovi, R. D. Natale, R. Giugno, A. Pulvirenti, A. Ferro, and R. Sharan. Sigma: a set-cover-based inexact graph matching algorithm. J. Bioinformatics and Computational Biology, 8(2):199-118, 2010; 및 A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao. Neighborhood based fast graph search in large networks. In SIGMOD, pages 901-912, 2011 참조). 기존 연구들은 근사 사상들을 찾기 위해, 자신만의 유사도 측정 방법들을 사용한다. 정확한 부분 그래프 검색과 근사 부분 그래프 검색은 각각 적용 가능한 응용 분야가 상이하다. In addition, there is a study of approximate partial graph retrieval (for example, Y. Tian and JM Patel. TALE: A tool for approximate large graph matching. In ICDE, pages 963-972, 2008; M. Mongiovi, RD Natale , Sigma: a set-cover-based inexact graph matching algorithm, J. Bioinformatics and Computational Biology, 8 (2): 199-118, 2010; A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao, Neighborhood based fast graph search in large networks, In SIGMOD, pages 901-912, 2011). Previous studies use their own similarity measures to find approximate ideas. The exact subgraph search and the approximate subgraph search are different applications.

기존 연구들은 부분 그래프 동형 검색에서 성공적인 결과를 얻은 것처럼 기술하였으나, 최근 연구인 백트래킹 알고리즘은 많은 데이터 집합들에 대한 광범위한 벤치 마크를 통하여, 모든 기존 알고리즘들이 몇몇 질의/데이터 집합들에 대해 다음과 같은 심각한 성능 문제를 갖고 있음을 보여 주었다: (P1) 기존의 모든 방법들은 매칭 순서 선택에서 심각한 문제를 갖고 있어서, 몇몇 질의/데이터 집합들에 대해 질의 처리 시간이 지수 형태로 증가한다. 따라서, 어떤 방법들도 강건한 검색 성능을 보여 주지는 못한다. 그러나, 다행스럽게도 각 질의/데이터 집합에 대해서 적어도 하나의 방법은 합리적인 시간 내에 질의 집합을 처리한다; (P2) 많은 질의/데이터 집합들에 대해, GraphQL 알고리즘과 SPath 알고리즘의 이웃 기반 필터링 방법의 맹목적 탐색은 때때로 GraphQL 알고리즘과 SPath 알고리즘의 우수한 검색 성능을 무효화시킨다. 이와는 대조적으로, QuickSI 알고리즘은 같은 질의/데이터 집합들에 대해서, 좋은 매칭 순서를 선택하여, 때때로 GraphQL 알고리즘과 SPath 알고리즘과 같은 최근 알고리즘들보다 더 나은 성능을 보인다. 이는 좋은 매칭 순서들을 선택하는 것이 중요함을 재차 강조하고 있다; (P3) GraphQL 알고리즘과 SPath 알고리즘의 인수들은 검색 성능에 심각한 영향을 준다. 그러나, 각 질의에 대해 최고의 성능을 나타내는 인수 값들을 사전에 찾아내는 것은 실제적으로 매우 어렵다.Previous studies have described successful results in partial graph isomorphism retrieval, but recent research, the backtracking algorithm, has found that through extensive benchmarking of large data sets, all existing algorithms can be used for some query / (P1) All existing methods have serious problems in the selection of matching order, so query processing time increases exponentially for some queries / datasets. Therefore, none of the methods show robust search performance. Fortunately, however, at least one method for each query / dataset processes a set of queries within a reasonable time; (P2) For many queries / datasets, the blind search of the GraphQL algorithm and the neighborhood-based filtering method of the SPath algorithm sometimes invalidates the excellent search performance of the GraphQL algorithm and the SPath algorithm. In contrast, the QuickSI algorithm chooses a good matching order for the same query / datasets, sometimes showing better performance than recent algorithms such as the GraphQL algorithm and the SPath algorithm. This reiterates the importance of choosing good matching sequences; (P3) The arguments of GraphQL algorithm and SPath algorithm have a serious influence on search performance. However, it is actually very difficult to find in advance the argument values that give the best performance for each query.

상기와 같은 종래 기술의 문제점을 해결하기 위해, 본 발명은 새로운 개념인 후보 영역 탐색 기법을 이용하여 효율적이고 강건한 부분 그래프 검색을 수행할 수 있는 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법 및 그 시스템을 제공하고자 한다.In order to solve the problems of the prior art as described above, the present invention provides a partial graph homotopic search method that utilizes a candidate region search technique that can perform an efficient and robust partial graph search using a candidate region search technique, System.

위와 같은 과제를 해결하기 위한 본 발명은 질의 그래프(q)로부터 시작 질의의 정점(u_s)을 선택하는 단계; 상기 질의 그래프(q)로부터 너비 우선 탐색(BFS) 트리(q')를 획득하는 단계; 상기 선택된 각 시작 정점에 대하여 상기 BFS 트리(q')의 루트 정점(u_s')으로부터 데이터 그래프(g)를 순회하면서 후보 지역을 탐색하는 단계; 상기 탐색된 임의의 후보 지역에 대하여 매칭 순서를 결정하는 단계; 상기 BFS 트리(q')의 상기 시작 정점(u_s')을 그 후보 지역의 시작 정점(v_s)에 매핑하는 단계; 및 각 후보 영역에 대한 모든 가능한 사상을 생성하는 단계를 포함하는 것을 특징으로 한다. According to an aspect of the present invention, there is provided a method for processing a query, comprising: selecting a vertex (u _s ) of a start query from a query graph (q); Obtaining a breadth first search (BFS) tree (q ') from the query graph (q); Searching for a candidate region by traversing the data graph g from the root vertex u _s 'of the BFS tree q' for each selected starting vertex; Determining a matching order for any of the searched candidate regions; The step of mapping the 'the start vertex (u _s of) the BFS tree (q)' to the start vertex (v _s) of the candidate area; And generating all possible mappings for each candidate region.

일 실시예에서, 상기 선택하는 단계는 상기 모든 질의의 정점(u)의 순위를 빈도 및 차수에 따라 정의하는 단계; 상기 순위 중 상위 k개의 질의 정점을 선택하는 단계; 상기 상위 k개의 질의 정점(u)에 대한 후보 지역의 개수를 예측하는 단계; 및 상기 후보 지역의 개수가 가장 작은 정점을 상기 시작 질의의 정점으로 선택하는 단계를 포함할 수 있다.In one embodiment, the selecting comprises: defining a ranking of vertices (u) of all queries according to frequency and order; Selecting top k query vertices of the ranking; Estimating a number of candidate regions for the top k query vertices u; And selecting a vertex having the smallest number of candidate regions as a vertex of the start query.

일 실시예에서, 상기 정의하는 단계는 상기 빈도가 작을수록 또는 상기 차수가 높을수록 우선 순위를 부여할 수 있다. In one embodiment, the defining step may assign a priority as the frequency is lower or the degree is higher.

일 실시예에서, 상기 예측하는 단계는 상기 질의 정점(u)의 차수가 상기 후보 지역에 속하는 각 정점(v)의 차수보다 작거나 같은지를 검사하고, 상기 질의 정점(u)에 인접한 정점들 중 레이블이 l인 정점의 개수가 상기 후보 지역에 속하는 정점(v)에 인접한 정점들 중 레이블이 l인 정점의 개수보다 작거나 같은지를 검사하여, 상기 두 조건을 모두 만족하는 후보 지역들의 개수를 예측할 수 있다. In one embodiment, the predicting step may check whether the degree of the query vertex u is less than or equal to the degree of each vertex v belonging to the candidate region, and if the degree of the vertex u adjacent to the query vertex u It is determined whether the number of vertices whose label is l is smaller than or equal to the number of vertices whose labels are l among vertexes adjacent to the vertex v belonging to the candidate region to estimate the number of candidate regions satisfying both of the above conditions .

일 실시예에서, 상기 획득하는 단계는 상기 BFS 트리(q')에 매칭하는 후보 지역의 크기가 작게 되도록 상기 BFS 트리(q')를 생성할 수 있다. In one embodiment, the obtaining step may generate the BFS tree q 'such that a size of a candidate region matching the BFS tree q' is small.

일 실시예에서, 상기 탐색하는 단계는 상기 BFS 트리(q')의 상기 질의 정점(u')에 대한 후보 데이터 정점들(V_M)에 방문되지 않은 데이터 정점(v')에 대하여 식별 조건을 만족하는지를 판단하는 단계; 상기 식별 조건을 만족하는 데이터 정점(v')을 방문한 것으로 마크하고, 상기 질의 정점(u')의 자식 질의 정점(u'_c)들을 데이터 정점(v')에 인접하고 레이블이 같은 정점들의 개수에 따라 정렬하는 단계; 상기 데이터 정점(v')의 부분 트리들이 상기 질의 정점(u')의 부분 트리들과 매치하는지를 판단하는 단계; 상기 부분 트리들이 매치하면, 상기 데이터 정점(v')의 마크의 상태를 리셋하고 후보 부분 지역(CR(u',v))에 추가하는 단계; 및 데이터 정점(v')을 모두 방문한 경우, 후보 부분 지역(CR(u',v))에 속하는 데이터 정점의 개수가 1 미만이면, 후보 지역 모두를 삭제하는 단계;를 포함할 수 있다. In one embodiment, the step of searching further comprises determining an identity condition for unvisited data vertices (v ') to candidate data vertices (V _M ) for the query vertex (u') of the BFS tree Judging whether it is satisfied; The data vertices v 'satisfying the discrimination condition are marked as visited and the vertexes u' _c of the child queries of the query vertex u 'are defined as the number of vertices adjacent to the data vertex v';&Lt; / RTI > Determining whether the partial trees of the data vertex (v ') match the partial trees of the query vertex (u'); If the partial trees match, resetting the state of the mark of the data vertex v 'and adding it to the candidate subregion CR (u', v); And deleting all of the candidate regions if the number of data vertices belonging to the candidate partial region CR (u ', v) is less than 1, when both the data vertices v' and the data vertices v 'are visited.

일 실시예에서, 상기 식별 조건을 판단하는 단계는 상기 질의 정점(u')의 차수가 상기 데이터 정점(v')의 차수보다 작거나 같은지와, 상기 질의 정점(u')에 인접한 정점들 중 레이블이 l인 정점의 개수가 상기 후보 지역에 속하는 정점(v')에 인접한 정점들 중 레이블이 l인 정점의 개수보다 작거나 같은지를 검사할 수 있다. In one embodiment, the step of determining the discrimination condition may include determining whether the degree of the query vertex u 'is less than or equal to the degree of the data vertex v' and the number of vertices adjacent to the query vertex u ' It is possible to check whether the number of vertices whose label is l is smaller than or equal to the number of vertices whose label is l among vertices adjacent to the vertex v 'belonging to the candidate region.

일 실시예에서, 상기 식별 조건을 만족하지 않는 데이터 정점(v')의 부분 트리를 순회하지 않고 제거하는 단계를 추가로 포함할 수 있다. In one embodiment, it may further comprise removing the partial tree of data vertices v 'that do not satisfy the discrimination condition without traversing.

일 실시예에서, 상기 부분 트리들이 매치하지 않으면, 상기 데이터 정점(v')의 부분 트리에서 방문된 모든 정점들에 대한 후보 지역을 삭제하는 단계를 더 포함할 수 있다. In one embodiment, if the partial trees do not match, deleting the candidate region for all vertices visited in the partial tree of the data vertex (v ').

일 실시예에서, 상기 결정하는 단계는 선택되지 않은 후보 정점들의 수가 최소인 경로를 선택하는 단계; 및 상기 예측값이 가장 작은 질의 정점을 포함하는 질의 경로에 대응하는 데이터 그래프(g) 상의 경로로부터 부분 그래프 동형 매칭을 수행하는 단계를 포함할 수 있다. In one embodiment, the determining comprises: selecting a path with a minimum number of unselected candidate vertices; And performing the partial graph isomorphism from the path on the data graph (g) corresponding to the query path including the query vertex having the smallest predicted value.

일 실시예에서, 상기 생성하는 단계는 임의의 BFS 트리(q')의 정점(u')에 대하여, 후보 부분 지역들을 획득하는 단계; 상기 후보 부분 지역에 속하는 임의의 데이터 정점(v')이 이미 매치되었으면 이 조합을 안전하게 제거하는 단계; 상기 BFS 트리(q')의 정점(u')에 대하여, 상기 정점(u')과 상기 질의 그래프(q)의 이미 매치된 질의 정점들 사이의 간선들이 이미 매치된 데이터 정점(v')들 사이의 대응하는 간선들을 갖는지를 검사하는 단계; 상기 간선 조건을 만족하면, 해당 데이터 정점(v')을 대응하는 질의 정점(u)에 매핑하고, 상태 정보를 갱신하는 단계; 단말 질의 정점을 방문중이면, 매치되는 사상 개수를 증가시키는 단계; 및 모든 사상을 찾았으면, 결과를 출력하는 단계를 포함할 수 있다. In one embodiment, the generating step includes obtaining candidate subregions for a vertex u 'of an arbitrary BFS tree q'; Safely removing the combination if any data vertices v 'belonging to the candidate sub-region have already been matched; For the vertex u 'of the BFS tree q', the edges between the vertex u 'and the already matched query vertices of the query graph q have already been matched to the data vertices v' Checking whether there is a corresponding truncation between < RTI ID = 0.0 > Mapping the corresponding data vertex (v ') to the corresponding query vertex (u) and updating the state information if the trunk condition is satisfied; Increasing the number of mapped events if the vertices of the terminal query are visited; And if all the events are found, outputting the result.

일 실시예에서, 상기 모든 사상을 찾지 않았으면, 변경되었던 상태 정보를 복원하고, 상기 안전하게 제거하는 단계로 복귀할 수 있다. In one embodiment, if not all of the above-mentioned events are found, the changed state information can be restored and the process can be safely canceled.

일 실시예에서, 상기 추가하는 단계 이후에, 상기 BFS 트리(q')의 각 단말에 대하여, 상기 후보 지역 탐색에서 획득한 정점들의 개수가 k를 초과하는지를 판단하는 단계; 및 상기 정점들의 개수가 k개를 초과하면, 더 이상 탐색하지 않고, 찾은 후보 정점들에 대하여 부분 그래프 동형 검사를 수행하는 단계를 포함하고, 상기 적정 부분 그래프 동형이 탐색될 때까지 다른 후보 정점들에 대하여 상기 후보 지역을 탐색하는 단계, 상기 초과하는지를 판단하는 단계 및 상기 검사를 수행하는 단계를 반복적으로 수행할 수 있다. In one embodiment, for each terminal of the BFS tree (q ') after the adding step, determining whether the number of vertices acquired in the candidate area search exceeds k; And performing a partial graph isomorphism check on the found candidate vertices without further searching if the number of vertices exceeds k, Searching for the candidate region with respect to the candidate region, determining whether the candidate region is exceeded, and performing the check.

본 발명의 다른 양태에 따른 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템은 질의 그래프(q)로부터 시작 질의의 정점(u_s)을 선택하고, 상기 질의 그래프(q)로부터 너비 우선 탐색(BFS) 트리(q')를 획득하는 질의 재작성부; 상기 선택된 각 시작 정점에 대하여 상기 BFS 트리(q')의 루트 정점(u_s')으로부터 데이터 그래프(g)를 순회하면서 후보 지역을 탐색하는 후보 지역 탐색부; 상기 탐색된 임의의 후보 지역에 대하여 매칭 순서를 결정하고, 상기 BFS 트리(q')의 상기 시작 정점(u_s')을 그 후보 지역의 시작 정점(v_s)에 매핑하며, 각 후보 영역에 대한 모든 가능한 사상을 생성하는 부분 그래프 동형 매칭부; 및 상기 데이터 그래프(g)가 저장되는 데이터베이스를 포함하는 것을 특징으로 한다. The partial graph homogeneous search system utilizing the candidate region search technique according to another aspect of the present invention selects a vertex u _s of a start query from a query graph q and extracts a vertex u _s of a start query from a query B ) Query rewriting part for obtaining the tree q '; A candidate region searching unit searching the candidate region while traversing the data graph g from the root vertex u _s 'of the BFS tree q' for each selected starting vertex; (U _s ') of the BFS tree (q') to the start vertex (v _s ) of the candidate region, and determines a candidate vertex A subgraph isomorphism unit for generating all possible mappings for the subgraph; And a database in which the data graph g is stored.

일 실시예에서, 상기 질의 재작성부는 상기 모든 질의의 정점(u)의 순위를 빈도 및 차수에 따라 정의하여 상위 k개의 질의 정점을 선택하고, 상기 각 질의의 정점(u)에 대한 후보 지역의 개수를 예측하여 상기 후보 지역의 개수가 가장 작은 정점을 상기 시작 질의의 정점으로 선택할 수 있다. In one embodiment, the query rewriting unit selects the top k query vertices by defining the order of the vertexes (u) of all the queries according to frequency and order, and selects the top k query vertices, The vertex having the smallest number of candidate regions can be selected as the vertex of the start query.

일 실시예에서, 상기 후보 지역 탐색부는 입력 질의 트리에 의해 유도된 임의의 후보 지역을 그 지역의 시작 정점으로부터 깊이 우선 탐색할 수 있다. In one embodiment, the candidate region search unit may search for any candidate region derived by the input query tree from the start vertex of the region by depth-first search.

일 실시예에서, 상기 부분 그래프 동형 매칭부는 상기 BFS 트리(q')의 각 단말에 대하여, 상기 후보 지역 탐색에서 획득한 정점들의 개수가 k를 초과하는지를 판단하여 k개를 초과하면, 더 이상 탐색하지 않고, 찾은 후보 정점들에 대하여 부분 그래프 동형 검사를 수행하고, k개를 초과하지 않으면 다른 후보 정점들에 대한 후보 지역을 탐색할 수 있다. In one embodiment, the partial graph homing unit determines, for each terminal of the BFS tree (q '), if the number of vertices acquired in the candidate region search exceeds k and exceeds k, The candidate candidate vertices can be searched for candidates for other candidate vertices if k is not exceeded.

일 실시예에서, 상기 데이터 그래프는 정점 레이블로 정렬된 ID 리스트를 접근하기 위한 역 정점 레이블 리스트, 및 자신의 인접 정보를 저장하고 있는 각 정정별 인접 리스트롤 포함하는 구조를 가질 수 있다. In one embodiment, the data graph may include a list of inverted vertex labels for accessing an ID list aligned with a vertex label, and a neighbor list roll for each of the corrections storing its neighbor information.

본 발명에 따른 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법 및 그 시스템은 새로운 개념인 후보 영역 탐색 기법을 활용하여 효율적이고 강건한 부분 그래프 검색을 수행할 수 있는 효과가 있다. The partial graph homotopic search method and system using the candidate region search technique according to the present invention can perform an efficient and robust partial graph search using the candidate region search technique which is a new concept.

또한, 본 발명은 후보 영역 탐색을 수행하여 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대해 강건한 매칭 순서를 계산하여 계산된 매칭 순서에 따라 후보 영역별로 검색을 수행할 수 있다. In addition, the present invention performs a candidate region search to immediately identify candidate partial graphs (i.e., candidate regions) including the results, calculate a robust matching order for each candidate region searched for, You can perform a search by each.

또한, 본 발명은 후보 지역 탐색시 검사의 속도를 높이고, 유망하지 않은 데이터 정점들을 효율적으로 삭제할 수 있다. In addition, the present invention can speed up the checking of candidate regions and efficiently delete promising data vertices.

또한, 본 발명은 그래프 전체에 대해서가 아니라 각 후보 지역에 대하여 선택률을 계산하기 때문에, 그러한 정확한 정보들을 활용하여 더 강건한 매칭 순서를 생성할 수 있다. In addition, since the present invention computes the selectivity for each candidate region, not for the entire graph, it is possible to generate a more robust matching sequence utilizing such precise information.

도 1은 본 발명의 일 실시예에 따른 부분 그래프 동형 검색 방법을 설명하기 위한 개념도이다.
도 2는 본 발명의 일 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법을 나타낸 순서도이다.
도 3a 내지 도 3d는 도 2의 방법의 실행 예를 설명하기 위한 개념도이다.
도 4는 도 2의 후보 지역 탐색 단계의 세부적인 방법을 나타내는 순서도이다.
도 5는 후보 부분 지역의 개념을 설명하기 위한 개념도이다.
도 6은 후보 부분 지역에 대한 데이터 구조를 설명하기 위한 개념도이다.
도 7은 매칭 순서의 중요성을 설명하기 위한 개념도이다.
도 8은 데이터 그래프의 저장 구조를 설명하기 위한 개념도이다.
도 9는 도 2의 부분 그래프 동형 검색의 세부적인 방법을 나타내는 순서도이다.
도 10은 본 발명의 다른 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법을 나타낸 순서도이다.
도 11은 본 발명의 일 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템을 나타낸 블록도이다. 1 is a conceptual diagram for explaining a partial graph homology search method according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a method of searching for a partial graph based on a rear region search technique according to an embodiment of the present invention. Referring to FIG.
3A to 3D are conceptual diagrams for explaining an execution example of the method of FIG.
4 is a flowchart illustrating a detailed method of the candidate region searching step of FIG.
5 is a conceptual diagram for explaining the concept of a candidate partial area.
6 is a conceptual diagram for explaining a data structure for a candidate partial area.
7 is a conceptual diagram for explaining the importance of the matching procedure.
8 is a conceptual diagram for explaining the storage structure of the data graph.
FIG. 9 is a flowchart showing a detailed method of the partial graph homogeneous search of FIG. 2. FIG.
10 is a flowchart illustrating a method of searching for a partial graph based on a rear region search technique according to another embodiment of the present invention.
FIG. 11 is a block diagram illustrating a partial graph homing search system using a rear region search technique according to an embodiment of the present invention. Referring to FIG.

이하, 본 발명을 바람직한 실시예와 첨부한 도면을 참고로 하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되는 것은 아니다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Hereinafter, the present invention will be described in detail with reference to preferred embodiments and accompanying drawings, which will be easily understood by those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

본 발명은 데이터 그래프(g) 내의 어떤 데이터 부분 그래프에 대해서는 좋은 매칭 순서가 데이터 그래프(g) 내의 다른 데이터 부분 그래프에 대해서는 나쁠 수 있다는 사실에 기반한다. 종래의 부분 그래프 동형 검색 알고리즘은 다양한 실제 데이터 집합들에 대한 광범위한 벤치 마크를 수행한 최근 연구를 통하여, 매칭 순서 선택에 있어 심각한 문제들을 가지고 있음이 밝혀졌다. 이러한 문제를 극복하기 위해, 본 발명은 새로운 개념인 후보 영역 탐색 기법을 활용하여 효율적이고 강건한 부분 그래프 검색을 수행하는 방법을 제안하였다. 본 발명의 방법은 후보 영역 탐색을 수행하여 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대해 강건한 매칭 순서를 계산하여 계산된 매칭 순서에 따라 후보 영역별로 검색을 수행한다. The present invention is based on the fact that for some data subgraphs in the data graph g a good matching order may be bad for other data subgraphs in the data graph g. Conventional partial graph homogeneous search algorithms have been found to have serious problems in selecting matching sequences through recent studies that have performed extensive benchmarks on various real data sets. In order to overcome this problem, the present invention proposes a method of performing an efficient and robust partial graph search using a new concept of a candidate region search technique. The method of the present invention performs a candidate region search to immediately identify a candidate partial graph (i.e., a candidate region) including a result, calculate a robust matching sequence for each candidate region searched for, Perform a search by each other.

도 1은 본 발명의 일 실시예에 따른 부분 그래프 동형 검색 방법을 설명하기 위한 개념도이다. 도 1에 도시된 바와 같이, 두 개의 데이터 부분 그래프(g₁ 및 g₂)를 가정한다. 만약 매칭 순서 O₁ =(u₁ , u₂ , u₃)을 사용하여 부분 그래프(g₁)의 데이터 정점과 그래프 정점을 매치하려고 한다면, (u₁ , v₁), (u₂ , v₂), 및 (u₃ , v₃)을 매칭한 이후에, 부분 그래프(g₁)에서는 더 이상의 매칭이 존재하지 않는다는 것을 알 수 있다. 그러나, 매칭 순서 O₂ =(u₁ , u₃ , u₂)를 사용하여 매칭을 시도하면, 후보 정점(v₄)의 각 자식 정점(v_c)에 대해 후보 정점(v₄)의 모든 자식 정점과 질의 정점(u₂)을 매치해야 한다. 따라서, 매칭 순서(O₂)를 이용하여 매칭을 시도한 횟수는 1003번(즉, u₁은 1번, u₃은 2번, u₂는 1000번)이 된다. 마찬가지로 부분 그래프(g₂)에 대해 매칭 순서(O₂)를 사용하는 경우, 매칭을 시도하는 횟수는 단 세 번인 반면에, 매칭 순서(O₁)를 사용한다면 매칭을 시도하는 수는 1003번이 된다. 1 is a conceptual diagram for explaining a partial graph homology search method according to an embodiment of the present invention. As shown in Figure 1, it is assumed the two part graph data (g ₁ and g _2). If you want to match the data vertices and graph vertices of the partial graph g ₁ using the matching sequence O ₁ = (u ₁ , u ₂ , u ₃ ), then (u ₁ , v ₁ ), (u ₂ , v ₂ ), And (u ₃ , v ₃ ), there is no more matching in the partial graph g ₁ . However, matching the order _{_{O 2 = (u 1, u}} 3, u 2) for an attempt to match with the candidate peaks (v ₄₎ for each child vertex all children of the candidate peaks (v ₄₎ for (v _c) of The vertex and the query vertex (u ₂ ) must match. Therefore, the number of matching attempts using the matching order (O ₂ ) is 1003 (that is, u ₁ is 1, u ₃ is 2, u ₂ is 1000). Similarly, when the matching procedure (O ₂ ) is used for the partial graph g ₂ , the number of matching attempts is three times, whereas if the matching order (O ₁ ) is used, the number of matching attempts is 1003 do.

기존의 모든 부분 그래프 동형 검색 알고리즘들은 질의 정점들의 모든 가능한 매핑 순서를 맹목적으로 바꾸기 때문에 불필요한 연산을 수행하는 문제를 가진다. 본 발명에서는 후보 영역 탐색 기법을 활용하여, 빠르고 강건하게 부분 그래프 동형 검색을 수행하는 방법을 제안하였다. 본 방법은 후보 영역 탐색을 수행하여 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대한 강건한 매칭 순서를 계산한다. All existing partial graph homogeneous search algorithms blindly change all possible mapping sequences of query vertices, thus causing the problem of performing unnecessary operations. In the present invention, a method of performing a partial graph homogeneous search quickly and robustly using a candidate region search technique has been proposed. The method performs a candidate region search to immediately identify candidate partial graphs (i.e., candidate regions) containing the results, and calculates a robust matching sequence for each candidate region searched.

본 발명에 따른 후보 지역 탐색 프로세스는 각 후보 지역의 시작 데이터 정점에 대해 깊이 우선 검색을 수행하여 후보 지역들을 탐색한다. 그런 다음, 각 질의 정점에 대해 후보 데이터 정점들을 얻는다. 후보 지역 탐색은 각 질의 정점(u')에 대한 모든 후보 정점들이 질의 정점(u')이 갖는 부분 트리와 구조가 동일한 부분 트리들을 갖는 것을 보장한다. 다음으로, 단말 질의 정점들을 그들의 후보 정점들의 개수로 정렬하여, 각 후보 지역에 대한 매칭 순서를 획득한다. The candidate region search process according to the present invention searches for candidate regions by performing a depth-first search on the start data vertices of each candidate region. Then, candidate data vertices are obtained for each query vertex. The candidate region search is performed when all the candidate vertices for each query vertex (u ') have the same partial tree structure as the partial tree of the query vertex (u') To be guaranteed. Next, the terminal query vertices are sorted by the number of their candidate vertices, and a matching order for each candidate region is obtained.

본 발명에서 제안하는 새로운 그래프 동형 검색 알고리즘은 질의 그래프(q)로부터 시작 질의 정점을 선택한다. 다음으로, 질의 그래프(q)로부터 너비 우선 탐색(BFS) 트리(q')를 획득하고, 각 영역의 각 시작 정점에 대해 트리(q')의 루트 정점(u'_s)에서부터 후보 지역 탐색을 수행한다. 임의의 후보 지역을 찾으면, 매칭 순서를 결정하고, 질의 그래프(q)의 시작 정점(u'_s)을 그 후보 지역의 시작 정점(v_s)에 매핑한다. 본 발명에서 제안하는 그래프 동형 검색 알고리즘은 기존 알고리즘들처럼, 2가지 배열(M 및 F)을 사용한다. M은 매핑 배열이고, F는 현재 매핑에 사용되는 데이터 정점들에 대한 정보를 저장하는 배열이다. 다음으로, 각 영역에 대한 모든 가능한 사상들을 생성하기 위해 주 재귀 서브루틴을 수행하고, 각 지역에 대한 모든 사상들을 검색한 다음, 위에서 언급한 2개 배열의 값을 복귀시킨다. The new graph homogeneous search algorithm proposed in the present invention selects the vertex of the starting query from the query graph (q). Next, a breadth first search (BFS) tree q 'is obtained from the query graph q and a candidate region search is performed from the root vertex (u' _s ) of the tree (q ') to each start vertex of each region . If an arbitrary candidate region is found, the matching order is determined, and the starting vertex (u ' _s ) of the query graph (q) is mapped to the starting vertex (v _s ) of the candidate region. The graph homogeneous search algorithm proposed in the present invention uses two arrangements (M and F) as in the conventional algorithms. M is a mapping array, and F is an array that stores information about data vertices used in the current mapping. Next, we perform a main recursive subroutine to generate all possible events for each region, retrieve all occurrences for each region, and then return the values of the two arrays mentioned above.

이하, 도 2 내지 도 10을 참조하여 본 발명의 일 실시예에 따른 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법(200)을 설명한다. 도 2는 본 발명의 일 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법을 나타낸 순서도이고, 도 3a 내지 도 3d는 도 2의 방법의 실행 예를 설명하기 위한 개념도이다. Hereinafter, a partial graph homotopic search method 200 using a candidate region search technique according to an embodiment of the present invention will be described with reference to FIGS. 2 to 10. FIG. FIG. 2 is a flowchart illustrating a partial graph homotopic search method using a posterior region search technique according to an embodiment of the present invention. FIGS. 3A to 3D are conceptual diagrams illustrating an example of the method of FIG.

후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법(200)은 시작 질의의 정점(u_s)을 선택하는 단계(S201), 너비 우선 탐색(BFS) 트리(q')를 획득하는 단계(S202), BFS 트리(q')의 루트 정점(u'_s)으로부터 데이터 그래프(g)를 순회하면서 후보 지역을 탐색하는 단계(S203), 탐색된 임의의 후보 지역에 대하여 매칭 순서를 결정하는 단계(S204), BFS 트리(q')의 시작 정점(u'_s)을 그 후보 지역의 시작 정점(v_s)에 매핑하는 단계(S205); 및 각 후보 영역에 대한 모든 가능한 사상을 생성하는 단계(S206)로 구성된다. The partial graph isotopic search method 200 using the candidate region search method includes a step S201 of selecting a vertex u _s of a start query and a step S202 of obtaining a breadth first search tree B ' (S203) searching for a candidate region while traversing the data graph (g) from the root vertex (u ' _s ) of the BFS tree (q'), determining a matching order for any candidate region searched ), a step (S205) for mapping the "start vertices (u a) _'s) BFS tree (q in the start vertex of the candidate region (v _s); And generating all possible mappings for each candidate region (S206).

보다 상세히 설명하면, 도 2에 도시된 바와 같이, 먼저, 질의 그래프(q)로부터 시작 질의의 정점(u_s)을 선택한다(단계 S201). 이때, 모든 질의의 정점(u)의 순위를 빈도 및 차수에 따라 정의하되, 빈도가 작을수록 또는 차수가 높을수록 우선 순위를 부여하여 부여된 순위 중 상위 k개의 질의 정점을 선택할 수 있다. 즉, 각 지역의 시작 정점과 매치되는 시작 질의 정점(u_s)을 선택하기 위해, 모든 질의 정점의 순위를 산출하며, 순위 값으로 상위 k개의 질의 정점들을 선택한다. 예를 들면,순위 함수로서 다음 식을 사용한다.More specifically, as shown in Fig. 2, first, the vertex u _s of the start query is selected from the query graph q (step S201). In this case, the order of vertices (u) of all queries is defined according to the frequency and degree, and the higher k queries vertices can be selected by assigning priorities as the frequency is lower or the order is higher. That is, to select the vertex (u _s ) of the start query that matches the start vertex of each region, the rank of all query vertices is calculated, and the top k query vertices are selected as the rank value. For example, the following formula is used as a ranking function.

여기서, freq(g,l)은 데이터 그래프(g)에서 레이블이 l인 데이터 정점들의 개수이며, deg(u)는 질의 정점(u)의 차수를 의미한다. Here, freq (g, l) is the number of data vertices whose label is l in the data graph (g) and deg (u) is the order of the query vertex (u).

본 실시예에서는 k를 3으로 가정하였다. 이러한 순위 함수는, 후보 지역의 개수를 최소화하기 위해서, 낮은 빈도와 높은 차수를 가진 정점에 높은 순위(즉, 낮은 순위 값)를 부여한다. 도 3의 실행 예에서는 질의 정점(u₁)의 순위 값(= 1/3)이 가장 작기 때문에, 시작 질의 정점으로 선택된다. In the present embodiment, k is assumed to be 3. This ranking function assigns a high rank (i.e., a low rank value) to a vertex having a low frequency and a high degree to minimize the number of candidate regions. In the execution example of FIG. 3, since the rank value (= 1/3) of the query vertex u ₁ is the smallest, it is selected as the vertex of the start query.

여기서, 다음에 설명할 비용이 작은 2개의 필터를 활용하여, 선택된 상위 3개의 질의 정점(u₁, u₂, 및 u₃)에 대한 후보 지역 개수를 더 정확히 예측한다. 즉, 각 질의의 정점(u)에 대한 후보 지역의 개수를 예측하고, 후보 지역의 개수가 가장 작은 정점을 시작 질의의 정점으로 선택한다. Here, two filters having a small cost to be described below are used to more accurately predict the number of candidate regions for the selected three query vertices (u ₁ , u ₂ , and u ₃ ). That is, the number of candidate regions for the vertex u of each query is predicted, and the vertex having the smallest number of candidate regions is selected as the vertex of the start query.

이를 위해, 선택된 각 질의 정점(u)에 대하여, 후보 데이터 정점들의 집합 C(u)를 획득한다. 이러한 C(u)에 속한 각 정점(v)에 대하여, 질의 정점(u)의 차수가 후보 지역(C(u))에 속하는 각 정점(v)의 차수보다 작거나 같은지를 다음을 이용하여 검사한다. For this purpose, for each selected query vertex u, a set of candidate data vertices C (u) is obtained. For each vertex v belonging to C (u), whether or not the degree of the query vertex u is less than or equal to the order of each vertex v belonging to the candidate region C (u) do.

수학식 2의 검사 결과가 참이면, 이웃 레이블 빈도수 필터(NLF filter)를 적용한다. NLF 필터는 질의 정점(u)의 인접 정점들의 모든 구별되는 레이블 l에 대하여, 질의 정점(u)에 인접한 정점들 중 레이블이 l인 정점의 개수가 후보 지역(C(u))에 속하는 정점(v)에 인접한 정점들 중 레이블이 l인 정점의 개수보다 작거나 같은지를 다음을 이용하여 검사한다. If the result of the check in equation (2) is true, then a neighboring label frequency filter (NLF filter) is applied. The NLF filter transforms the neighborhood vertices of the query vertex (u) For all distinguished labels l, among the vertices adjacent to the query vertex u, the number of vertices whose label is l is the vertex whose label is l, among the vertices adjacent to the vertex v belonging to the candidate region C (u) Is checked by using the following.

여기서, adj(u,l)는 질의 정점(u)에 인접한 정점들 중에 레이블이 l인 정점의 리스트이다. Where adj (u, l) is a list of vertices with a label of l among vertices adjacent to the query vertex (u).

이러한 필터링 과정을 통하여, 두 조건(수학식 2 및 수학식 3)을 모두 만족하는 후보 지역들의 개수를 예측할 수 있다. Through this filtering process, it is possible to predict the number of candidate regions satisfying both the conditions (Equation 2 and Equation 3).

이와 같이 선택된 각 시작 질의 정점에 대해 후보 지역들의 개수를 정확히 예측한 다음, 후보 지역의 개수가 가장 작은 정점 하나를 선택한다. 이때 선택률(

)이 5%보다 크면, 질의 정점(u₂ 또는 u₃)에 대해서는 이러한 필터들을 적용하지 않는다. 즉, 위에서 언급한 필터를 비-선택적인 시작 질의 정점들에 대해서는 적용하지 않는다.The number of candidate regions is precisely predicted for each vertex of the selected start query, and then one vertex having the smallest number of candidate regions is selected. At this time,

) Is greater than 5%, these filters are not applied for query vertices (u ₂ or u ₃ ). That is, the above-mentioned filter is not applied to non-selective start query vertices.

다음으로, 질의 그래프(q)로부터 너비 우선 탐색(BFS) 트리(q')를 획득한다(단계 S202). 이때, 이러한 BFS 트리(q')는 BFS 트리(q')에 매칭하는 후보 지역의 크기가 작게 되도록 생성될 수 있다. 즉, 질의 그래프(q)를 트리(q')로 재작성할 때, BFS 트리(q')에 매칭하는 후보 지역들의 크기가 가장 작아야 하기 때문에, 각 후보 지역의 직경이 평균적으로 최소가 되도록 하기 위해, 질의 그래프(q)로부터 너비 우선 탐색 트리를 생성한다. Next, a width-first search (BFS) tree q 'is obtained from the query graph q (step S202). At this time, the BFS tree q 'may be generated such that the size of the candidate region matching the BFS tree q' is small. That is, when the query graph q is rewritten into the tree q ', the sizes of the candidate regions matching the BFS tree q' must be the smallest, so that the diameters of the candidate regions are minimized on average , And creates a breadth-first search tree from the query graph (q).

위에서 설명한 바와 같이, 후보 지역에 대해서만 부분 그래프 동형 검색을 수행함으로써, 주어진 질의 그래프(q)에 대한 모든 사상을 획득할 수 있다. 따라서, 후보 지역의 크기를 최소화하기 위해, 질의 그래프(q)를 질의 그래프(q)의 BFS 트리(q')로 재작성한다. As described above, all the ideas for a given query graph q can be obtained by performing a partial graph homogeneous search only for candidate regions. Therefore, in order to minimize the size of the candidate region, the query graph q is rewritten into the BFS tree q 'of the query graph q.

이제, 각 BFS 트리의 정점에 대하여 후보 데이터 정점들을 저장하고 있는 후보 부분 지역의 개념에 대해 설명한다. 임의의 후보 지역에 속하는 모든 후보 부분 지역들을 합치면 그 후보 지역과 같아진다. Now, the concept of a candidate subregion storing candidate data vertices for each vertex of each BFS tree will be described. When all the candidate partial regions belonging to an arbitrary candidate region are combined, they become the same as the candidate region.

다음으로, 선택된 각 시작 정점에 대하여 BFS 트리(q')의 루트 정점(u'_s)으로부터 데이터 그래프(g)를 순회하면서 후보 지역을 탐색한다(단계 S203). 즉, 질의 그래프(q)의 BFS 트리(q')가 주어지면, BFS 트리(q')를 이용하여 데이터 그래프(g)를 순회하면서 후보 지역들을 식별할 수 있다. 이를 위해, 후보 지역 탐색 방법은 각 후보 지역의 시작 정점을 루트 정점으로 갖는 부분 그래프를 순회한다. Next, the candidate region is searched for each selected starting vertex while traversing the data graph g from the root vertex u ' _s of the BFS tree q' (step S203). That is, given the BFS tree q 'of the query graph q, the candidate regions can be identified while traversing the data graph g using the BFS tree q'. To do this, the candidate region search method traverses the partial graph having the start vertex as the root vertex of each candidate region.

이러한 후보 지역 탐색은 도 4 내지 도 6을 참조하여 더 상세하게 설명한다. 도 4는 도 2의 후보 지역 탐색 단계의 세부적인 방법을 나타내는 순서도이고, 도 5는 후보 부분 지역의 개념을 설명하기 위한 개념도이이며, 도 6은 후보 부분 지역에 대한 데이터 구조를 설명하기 위한 개념도이다. This candidate region search will be described in more detail with reference to FIGS. FIG. 4 is a flowchart illustrating a detailed method of the candidate region searching step of FIG. 2. FIG. 5 is a conceptual diagram for explaining a concept of a candidate partial region, and FIG. 6 is a conceptual diagram for explaining a data structure of a candidate partial region .

한편, 후보 지역 탐색을 수행하는 동안, 부분 그래프 동형 검색에 사용될 각 BFS 트리의 정점에 대한 후보 데이터 정점들을 식별할 수 있다. 이런 데이터 정점들을 저장하기 위해, 본 발명에서는 u'은 BFS 트리의 정점이고, v는 데이터 정점인 CR(u',v)로 표시되는 후보 부분 지역이라는 개념을 제시한다. CR(u',v)는 u'와 매치되며, BFS트리에서 v의 자식 정점들이 되는 데이터 정점들을 포함한다. 이와 같은 CR(u',v)은 다음의 정의 1과 정의한다. On the other hand, during the candidate region search, candidate data vertices for each vertex of each BFS tree to be used in the partial graph homogeneous search can be identified. To store these data vertices, we present the concept that u 'is the vertex of the BFS tree, and v is the candidate subregion denoted by the data vertex CR (u', v). CR (u ', v) matches u' and contains data vertices that are the child vertices of v in the BFS tree. Such a CR (u ', v) is defined by the following definition 1.

정의 1. Definition 1.

CR(u',v)은 다음의 조건을 만족하는 데이터 정점(v')들의 집합이다 : CR (u ', v) is a set of data vertices (v') that satisfy the following conditions:

1) v'는 깊이 우선 탐색 트리에서 v의 자식 정점이다. 1) v 'is the child vertex of v in the depth-first search tree.

2) u'_s에서 u'로 가는 경로 위의 모든 BFS 정점들은 깊이 우선 탐색 트리의 v_s에서 v'로 가는 경로 상의 모든 데이터 정점들과 매치된다. 2) All BFS vertices on the path from u ' _s to u' are matched with all data vertices on the path from v _s to v 'in the depth-first search tree.

3) u'의 부분 트리는 깊이 우선 탐색 트리에서 v'의 대응하는 부분 트리와 매치된다. 여기서, u_s'는 q'의 루트 정점이며, v_s는 후보 지역의 시작 정점이다．3) The partial tree of u 'matches the corresponding partial tree of v' in the depth-first search tree. Where u _s 'is the root vertex of q', and v _s is the starting vertex of the candidate region.

도 5는 CR(u',v)의 의미를 그림으로 표현하고 있다. u'_s의 후보 정점(즉, v_s)을 저장하기 위해, v_s의 부모인 가상의 정점(v*_s)이 존재한다고 가정한다. 도 3의 BFS 트리와 데이터 그래프에 도시된 바와 같이, 정점(v₂)은 u'₂에, u'_s의 부분　트리는 v'₂의 부분 트리에 매치될 수 있으므로, CR(u'₂,v₁)={v₂}이 된다. v₃, v₄, 및 v₅의 부분 트리들은 u'₂의 부분 트리와 매치되지 않기 때문에, v₃, v₄, 및 v₅는 시CR(u'₂,v₁)에 포함되지 않는다. 이러한　방법으로 CR(u'₃,v₁)와 CR(u'₄,v₁)는 {v₂, v₃, v₄, v₅}를 포함한다. 도 3d는 도 3c의 모든 후보 부분 지역을 그림으로 도식화하였다. FIG. 5 illustrates the meaning of CR (u ', v). To store the candidate vertex (ie, v _s ) of u ' _s , suppose that there is a virtual vertex (v * _s ) that is the parent of v _s . As shown in the BFS tree and the data graph of Figure 3, the vertex (v ₂₎ is "a _2, u 'u _s' may be matched to the part tree of _2, CR (u' portion tree v of _2, v ₁ ) = {v ₂ }. v _3, v _4, and v _5, because part of the tree will not match the portion of the tree _{_{u '2, v 3, v}} 4, and v ₅ is Is not included in the time CR (u ' ₂ , v ₁ ). In this way, CR (u ' ₃ , v ₁ ) and CR (u' ₄ , v ₁ ) contain {v ₂ , v ₃ , v ₄ , v ₅ }. FIG. 3D schematically illustrates all candidate regions of FIG. 3C.

도 4에 도시된 바와 같이, 후보 지역 탐색 방법(400)은 입력 질의 트리에 의해 유도된 임의의　후보 지역을 그 지역의 시작 정점으로부터 깊이 우선으로 탐색한다. 즉, 후보 지역 탐색 방법(400)은 BFS 트리(q')의 질의 정점(u')에 대한 후보 데이터 정점들(V_M)에 방문되지 않은 데이터 정점(v')에 대하여 식별 조건을 만족하는지를 판단하여 개시된다(단계 S401). As shown in FIG. 4, the candidate region searching method 400 searches for a candidate region derived by the input query tree from the start vertex of the region in a depth-first manner. That is, the candidate area searching method 400 determines whether the candidate data vertices V _M for the query vertex u 'of the BFS tree q' satisfy the identification condition for the unvisited data vertex v ' (Step S401).

즉, 후보 지역을 탐색할 때, v에 인접한 데이터 정점(v')이 BFS 트리의 정점(u')에 대한 CR(u',v)에 속하는지 검사해야 한다. 검사의 속도를 높이고, 유망하지 않은 데이터 정점들을 효율적으로 제거하기 위해서, 다음의 식별 조건에 대하여 판단한다. That is, when searching for candidate regions, it is necessary to check whether the data vertex (v ') adjacent to v belongs to the CR (u', v) for the vertex (u ') of the BFS tree. In order to increase the speed of the test and to efficiently remove promising data peaks, the following discrimination conditions are determined.

즉, 방문되지 않은 각 데이터 정점(v')에 대하여, 질의 정점(u')의 차수 deg(u')가 데이터 정점(v')의 차수 deg(v')보다 작거나 같은지를 먼저 검사하고, 그 다음 데이터 정점(v')의 부분 트리를 순회하기 전에 NLF 필터를 데이터 정점(v')에 적용한다. 예를 들면, 질의 정점(u')에 인접한 정점들 중 레이블이 l인 정점의 개수가 후보 지역에 속하는 정점(v')에 인접한 정점들 중 레이블이 l인 정점의 개수보다 작거나 같은지를 검사한다. That is, for each unadopted data vertex v ', it is first checked whether the degree deg (u') of the query vertex u 'is less than or equal to the degree deg (v') of the data vertex v ' , And then applies an NLF filter to the data vertex (v ') before traversing the partial tree of data vertices (v'). For example, it is checked whether the number of vertices whose label is l among the vertices adjacent to the query vertex u 'is less than or equal to the number of vertices whose label is l among the vertices adjacent to the vertex v' belonging to the candidate region do.

단계 S401의 판단 결과, 데이터 정점(v')이 상기 조건들을 만족하면, 해당 데이터 정점(v')을 데이터 정점(v')을 방문한 것으로 마크한다(단계 S402). As a result of the determination in step S401, if the data vertex v 'satisfies the above conditions, the data vertex v' is marked as having visited the data vertex v '(step S402).

다음으로, 질의 정점(u')의 자식 질의 정점(u'_c)들을 질의 정점(u')에 대응하는 정점인 v'에 인접하고 레이블이 같은 정점들의 개수에 따라 정렬한다(단계 S403). 이러한 정렬에 의해 질의 정점(u')의 부분 트리들과 매치되지 않는 부분 트리를 갖는 데이터 정점(v')이 일찍 제거될 가능성을 높일 수 있다. Next, the child query vertices u ' _c of the query vertex u' are aligned according to the number of vertices whose labels are adjacent to the vertex v 'corresponding to the query vertex u' (step S403). This sorting can increase the likelihood that the data vertex v 'having a subtree that does not match the partial trees of the query vertex u' is removed early.

단계 S401의 판단 결과, 데이터 정점(v')이 상기 조건들을 만족하지 않으면, 해당 데이터 정점(v')의 부분 트리를 순회하지 않고 제거하며(단계 S404), 단계 S407로 진행하여 데이터 정점(v')을 모두 방문했는지를 판단한다.As a result of the determination in step S401, if the data vertex v 'does not satisfy the above conditions, the partial tree of the data vertex v' is removed without traversing (step S404) ') Are all visited.

다음으로, 데이터 정점(v')의 부분 트리들이 질의 정점(u')의 부분 트리들과 매치하는지를 판단하여(단계 S405), 데이터 정점(v')의 부분 트리들이 질의 정점(u')의 부분 트리들과 매치하면, 해당 데이터 정점(v')의 마크의 상태를 리셋하고, 후보 부분 지역 CR(u',v)에 추가한다(단계 S406). 여기서, v는 깊이 우선 탐색 트리에서 해당 데이터 정점(v')의 부모 정점이다. Next, it is determined whether or not the partial trees of the data vertex v 'match the partial trees of the query vertex u' (step S405). If the partial trees of the data vertex v ' If the partial trees match, the state of the mark of the corresponding data vertex v 'is reset and added to the candidate partial area CR (u', v) (step S406). Where v is the parent vertex of the corresponding data vertex (v ') in the depth-first search tree.

다음으로, 데이터 정점(v')을 모두 방문했는지를 판단하여(단계 S407), 데이터 정점(v')을 모두 방문한 경우, 후보 부분 지역 CR(u',v)에 속하는 데이터 정점의 개수가 1 미만이면, 즉, |CR(u',v)| < 1이면, 질의 정점(u')이 CR(u',v)에 속한 정점들에 대응될 수 없기 때문에 검색했던 후보 지역 모두를 삭제한다(단계 S408). 이와 같은 후보 지역 탐색 방법이 종료되면, 단계 S204로 진행하여 매칭 순서 결정을 수행한다. Next, it is determined whether or not all the data vertices v 'have been visited (step S407). When all the data vertices v' are visited, the number of data vertices belonging to the candidate partial area CR (u ', v) , That is, | CR (u ', v) | &Lt; 1, all the candidate regions that have been searched are deleted because the query vertex u 'can not correspond to the vertices belonging to CR (u', v) (step S408). When the candidate region searching method is completed, the process proceeds to step S204 to perform the matching order determination.

단계 S405의 판단결과, 데이터 정점(v')의 부분 트리들이 질의 정점(u')의 부분 트리들과 매치하지 않으면, 데이터 정점(v')의 부분 트리에서 방문된 모든 정점들에 대한 후보 지역을 삭제하고(단계 S409), 단계 407로 진행하여 데이터 정점(v')을 모두 방문했는지를 판단한다. As a result of the determination in step S405, if the partial trees of the data vertex v 'do not match the partial trees of the query vertex u', the candidate regions for all visited vertices in the partial tree of the data vertex v ' (Step S409), and proceeds to step 407 to determine whether all the data vertices v 'have been visited.

이때, 데이터 정점(v')이 위에서 언급한　조건들을 만족하지 못한다면, 데이터 정점(v')의 부분 트리에 속하는 모든 방문된 정점들에 대한 각 후보 지역에 대하여 삭제되어야 한다. 본 발명에서는 이러한 과정의 속도를 향상시키기 위해 CR(u',v)을 표현하는 효율적인 데이터 구조를 도 6과 같이 제안하였다. 도 6은 CR(u',-)에 대한 데이터 구조의 예를 도시한다. At this time, if the data vertex v 'does not satisfy the above conditions, it should be deleted for each candidate vertex for all visited vertices belonging to the partial tree of the data vertex v'. In the present invention, an efficient data structure expressing CR (u ', v) is proposed as shown in FIG. 6 to improve the speed of the process. FIG. 6 shows an example of a data structure for CR (u ', -).

구체적인 데이터 구조로서, 도시된 바와 같이, 각 질의 정점(u')에 대하여 하나의 배열 CR(u',-)만을 유지하며, 따라서　모든 CR(u',v)들을 배열 CR(u',-)에 저장한다. CR(u',v)의 각 요소는 v', 범위 [s, e]들의 리스트(L)로 구성된다. 여기서, |L|은 질의 정점(u')의 자식 질의 정점들의 개수와 동일하다. 만약, 질의 정점(u')의 i번째 자식 질의 정점이 u"라면, CR(u",v')의 후보 정점들은 CR(u",-)[L_i:s]에서부터 CR(u",-)[L_i:e]에 저장된다. 따라서, 데이터 정점(v')의 부분 트리에 속하는 모든 검색된 후보 지역을 삭제하기 위해서는, 질의 정점(u')의 각 자손 질의 정점에 대하여 삭제를 수행하기만 하면 된다. As a concrete data structure, only one array CR (u ', -) is retained for each query vertex u' as shown, and thus all the CR (u ', v) ). Each element of CR (u ', v) consists of v', a list of ranges [s, e] (L). Here, | L | is equal to the number of child vertices of the query vertex u '. If the query vertices (u ') is the i-th child query vertices of u "if, CR (u", v' ) candidate apex of their CR (u ", -) [ L i: s] from CR (u", -) [L _i : e]. Thus, all of the retrieved (< RTI ID = 0.0 > In order to delete the candidate region, it is sufficient to perform deletion for each vertex of each descendant of the query vertex (u ').

도 6에 도시된 바와 같이, CR(u'₁,v*_s)에 대응하는 CR(u'₁,-)를 고려하면, CR(u'₁,-)에는 단지 한 개의 요소(v₁, [1:1], [1:4], [1:4])가 있다. u'₁이 자식 질의 정점 u'₂, u'₃, u'₄를 가지므로, CR(u'₂,v₁)의 후보 정점은 CR(u'₂,-)[1]에 저장되고, CR(u'₃,v₁)의 후보 정점은 CR(u'₃,-)[1]에서부터 CR(u'₃,-)[4]에 저장된다. 이와 유사하게, CR(u'₆,-)에서도 CR(u'₆,-)의 두 번째 요소가 v₈, [2:4]이므로, CR(u'₈,v₈)의 후보 정점들은 CR(u'₈,-)[2]에서부터 CR(u'₈,-)[4]에 저장된다. As shown in Figure _{6, CR (u '(1} , 1, v * s) CR u -) corresponding to the _{"Considering, CR (u' 1, -} ) has only one element (v _1, [1: 1], [1: 4], [1: 4]). Since u _'1, a child query vertex u' of the _2, u _'3, u' _4, CR (u 'candidate peak of the _2, v ₁₎ are CR (u' _2, -) and stored in 1, The candidate vertices of CR (u ' ₃ , v ₁ ) are stored in CR (u' ₃ , -) [1] to CR (u ' ₃ , -) [4]. _{Similarly, CR (u '- (6} , 6,) CR u -) in "the second element of v _8, [2: 4] candidate vertices of the _{so, CR (u' 8, v} 8) are CR (u ' ₈ , -) [2] to CR (u' ₈ , -) [4].

질의가 경로 형태를 가지며, 질의의 시작 정점이 그 경로의 끝에 위치한다면, 상기와 같은 후보 지역 탐색 방법은 주 재귀 서브루틴의 호출 없이도 사상을 직접 생성할 수 있다. 질의 사이즈가 작다면, 이런 간단한 최적화　방법은 전형적으로 효과적일 수 있다. If the query has a path shape and the starting vertex of the query is located at the end of the path, the candidate region searching method as described above can directly generate the mapping without calling the main recursive subroutine. If the query size is small, this simple optimization method can typically be effective.

도 3의 실행 예에서, 후보 지역 탐색 방법은 BFS 트리(q')를 사용하여 v₁에서 시작하는 깊이 우선 탐색을 수행하고, BFS 트리의 각 정점들에 대한 후보 정점들을 식별한다. 여기서, 후보 정점들은 대응하는 후보 부분 지역에 저장된다. v₁₈과 v₁₀~v₁₃은 단계 S403과 같은 질의 정점(u')의 자식 질의 정점(u'_c)들의 정렬에 의해 후보 지역에 속하지 않는 것으로 결정되었다. In the execution example of FIG. 3, the candidate region search method uses a BFS tree (q ') to perform a depth-first search starting at v ₁ , and identifies candidate vertices for each vertex of the BFS tree. Here, the candidate vertices are stored in the corresponding candidate subregion. v ₁₈ and v ₁₀ to v ₁₃ were determined not to belong to the candidate region by the alignment of child vertices (u ' _c ) of the query vertex (u') as in step S403.

다시 도 2를 참조하여, 탐색된 임의의 후보 지역에 대하여 매칭 순서를 결정한다(단계 S204). 보다 상세하게는, 선택되지 않은 후보 정점들의 수가 최소인 경로를 선택하고, 상기 예측값이 가장 작은 질의 정점을 포함하는 질의 경로에 대응하는 데이터 그래프(g) 상의 경로로부터 부분 그래프 동형 매칭을 수행한다. Again referring to FIG. 2, a matching order is determined for any candidate regions searched (step S204). More specifically, a path with the smallest number of unselected candidate vertices is selected, and partial graph isomorphism is performed from the path on the data graph (g) corresponding to the query path including the query vertex with the smallest predicted value.

여기서, 도 7을 참조하여 매칭 순서에 따른 영향을 살펴본다. 도 7은 매칭 순서의 중요성을 설명하기 위한 개념도이다. Here, the influence of the matching order will be described with reference to FIG. 7 is a conceptual diagram for explaining the importance of the matching procedure.

만약, 질의 그래프(q)가 n 개의 정점(u₁, u₂, …, u_n)을 갖는다면, 검색 공간은 |C(u₁)|×|C(u₂)|×… ×|C(u_n)|이 된다. 여기서 C(u_i)는 u_i와 매치하는 후보 정점들의 집합이다. 따라서, 질의 처리 중에 중간 결과로 생성되는 중간 검색 공간이 최소가 되는 매칭 순서를 찾을 필요가 있다. 그러나, 질의 정점의 쌍에 대한 조인 선택률을 알지 못하므로, 중간 검색 공간의 크기를 최소로 하는 매칭 순서를 찾는 것은 매우 어려운 문제이다. 기존 방법들은 자신들이 고안한 예측 수식을 사용하여 이런 문제를 해결하려고 했지만, 예측 수식 자체가 갖는 심각한 문제 때문에 항상 좋은 매칭 순서를 찾는 것에는 실패하였다. If the query graph q has n vertices u ₁ , u ₂ , ..., u _n , then the search space is | C (u ₁ ) | × | C (u ₂ ) | × ... X | C (u _n ) |. Where C (u _i ) is the set of candidate vertices that match u _i . Therefore, it is necessary to find a matching order that minimizes the intermediate search space generated as an intermediate result during the query processing. However, since the join selectivity for a query vertex pair is not known, it is very difficult to find a matching sequence that minimizes the size of the intermediate search space. Existing methods have attempted to solve these problems using the predictive formulas they have devised, but they have always failed to find a good matching order because of the serious problems with the predictive formulas themselves.

예를 들어, SPath 알고리즘은 이미 매치된 질의 정점에서 시작해서 가장 선택적인 경로를 찾는다. 여기에, 주어진 경로(p)에 대한 선택률 함수 sel(p)는 다음과 같이 계산된다.For example, the SPath algorithm finds the most selective path starting from a query vertex that has already been matched. Here, the selectivity function sel (p) for a given path (p) is calculated as follows.

여기서, V(p)는 경로(p)에 존재하는 모든 정점들을 나타나며, 분모는 경로(p) 상의 모든 질의 정점들에 대한 후보 집합의 조인 카디널리티이다. 그러나, 이러한 과대 측정은 조인 카디널리티를 예측하는데 있어서 심각한 오류를 불러일으킨다.Here, V (p) denotes all the vertexes existing on the path (p), and denominator is the join cardinality of the candidate set on all query vertices on the path (p). However, this overestimation causes a serious error in predicting the join cardinality.

본 발명에서는 이러한 비현실적인 예측 수식에 의존하는 대신에, BFS 트리의 주어진 경로에 대한 후보 정점들의 개수를 측정하기 위해서 후보 지역 탐색을 수행한다. 이러한 방식으로, 시작 BFS 트리 상의 질의 정점으로부터 각각의 질의 경로에 대해 거의 정확한 선택률을 얻을 수 있다. 더 나아가, 그래프 전체에 대해서가 아니라 각 후보 지역에 대하여 선택률을 계산하기 때문에, 그러한　정확한　정보들을 활용하여 기존 방법들보다 더 강건한 매칭 순서를 생성할 수 있다. Instead of relying on this unrealistic prediction formula, the present invention performs a candidate region search to measure the number of candidate vertices for a given path of the BFS tree. In this way, it is possible to obtain a nearly correct selection rate for each query path from the query vertex on the starting BFS tree. Furthermore, since the selectivity is calculated for each candidate region, not for the entire graph, it is possible to generate a more robust matching sequence than those using existing methods by using such precise information.

구체적으로는, 각 단말 정점(u')(즉，u'까지의 경로)를 |CR(u',-)|로 정렬한다. 다시 말해, 아직 선택되지 않은 경로들 중에서 후보 정점들의 수가 최소인 경로를 먼저 선택한다. 도 7에 도시된 바와 같이, 3개의 질의 경로들(p₁ ~ p₃)이 있다고 가정하면, 후보 지역 탐색을 통하여, 첫 번째 후보 지역에 대한 경로들의 카디널리티가 10, 10000, 5임을 알 수 있다. 다음으로, 두 번째 후보 지역에 대해서는 카디널리티가 각각 1000, 10, 5이다. Specifically, each terminal vertex u '(that is, a path up to u') is aligned with | CR (u ', -) |. In other words, the path that has the smallest number of candidate vertices among the paths that are not yet selected is first selected. Assuming that there are three query paths (p ₁ to p ₃ ) as shown in FIG. 7, it is found through the candidate region search that the cardinalities of the paths to the first candidate region are 10, 10000, and 5 . Next, the cardinalities for the second candidate region are 1000, 10, and 5, respectively.

처음으로 경로 p₁과 p₂를 사용하여 부분　그래프 동형 검색을 수행한다면, 매우 많은 부분 결과들이 열거될 것이다. 하지만 X를 레이블로 가진 데이터 정점들과 Z를 레이블로 가진 데이터 정점들 사이에는 간선이 없기 때문에, 결과적으로 p₃과의 매칭에 의해 모든 부분 결과들이 제거된다. 반대로, 경로 p₁과 p₃을 사용하여 부분 그래프 동형 검색을 수행하게 되면, 어떤 부분 결과도 생성되지 않기 때문에, 전체 검색 공간을 미리 제거할 수 있다. 비슷한 이유로, 먼저 p₂와 p₃을 사용하여 두 번째 후보 지역에 대해 부분 그래프 동형 검색을 수행하면, 전체 공간을 미리 제거할 수 있다. 그러나, 다른 조인 순서를 먼저 선택하면, 많은 부분 결과들이 나열되는 문제가 발생한다. For the first time, if you perform a partial graph isomorphic search using paths p ₁ and p ₂ , then a very large number of partial results will be listed. However, since there is no truncation between data vertices with label X and data vertices with label Z, all partial results are eliminated by matching with p ₃ . Conversely, if the partial graph homogeneous search is performed using the paths p ₁ and p ₃ , no partial result is generated, so that the entire search space can be removed in advance. For a similar reason, if the partial graph isomorphism search is performed on the second candidate region using p ₂ and p ₃ , the entire space can be removed in advance. However, if you select another join order first, you will have a problem that many partial results are listed.

여기에서, 질의 그래프(q)에 비 트리 간선들이 존재하는 경우를 고려하면, 후보 지역 탐색을 수행할 때는 질의 그래프에서 비 트리 간선들을 고려하지 않는다. 따라서, 주어진 질의 정점에 대한 후보 정점들의 개수를 예측하는 경우, 이런 비트리 간선들을 고려해야 한다. 질의 정점(u')에 인접한 j개의 비 트리 간선들이 존재한다고 가정하면, 본 발명에서는 후보 정점의 개수에 대한 예측 값으로, 알고리즘을 통하여 산출한 질의 정점(u')에 대한 후보 정점들의 개수를 j+1로 나눈 값을 사용한다. 이런 방법은 위에서 설명한 순위 함수(Rank(u))의 계산값을 이용하여 시작 정점을 선택하는 방식과 일치한다. Here, considering the case where non-tree trunks exist in the query graph (q), non-tree trunks are not considered in the query graph when performing the candidate region search. Thus, when predicting the number of candidate vertices for a given query vertex, these non-tree trunks must be considered. Assuming that there are j non-tree trunks adjacent to the query vertex u ', in the present invention, the number of candidate vertices with respect to the query vertex u' calculated through the algorithm is used as a prediction value for the number of candidate vertices The value divided by j + 1 is used. This method is consistent with the method of selecting the starting vertex using the computed value of the ranking function (Rank (u)) described above.

또한, 본 발명에서는 이러한 감소 요소를 질의 정점(u')의 부분 트리에도 적용한다. 예를 들면, 도 3에서 u'_s에 대한 후보 정점들의 개수는 4이다. 그러나, u₅와 u₆ 사이에 비 트리 간선이 있기 때문에, 4를 2로 나눈 예측　값인 2를　사용한다(실제로 이 예측값은 실제 정점 개수인 1(즉, |{v₁₄}|)에　근사한　값이다). 따라서, 결과로 매칭 순서 O₄=(u'₁, u'₂, u'₅, u'₆, u'₈, u'₇, u'₉, u'₃, u'₄)를 얻을 수 있다. In the present invention, this reduction factor is also applied to the partial tree of the query vertex u '. For example, in Fig. 3, the number of candidate vertices for u ' _s is four. However, since there is a tree truncation line between u ₅ and u ₆ , we use 2, which is a predicted value of 4 divided by 2 (actually, this estimate is approximate to the actual number of vertices 1 (ie, | {v ₁₄ } |) to be). As a result, a matching sequence O ₄ = (u ' ₁ , u' ₂ , u ' ₅ , u' ₆ , u ' ₈ , u' ₇ , u ' ₉ , u' ₃ , u ' ₄ ) .

다시 도 2를 참조하여, BFS 트리(q')의 시작 정점(u'_s)을 그 후보 지역의 시작 정점(v_s)에 매핑한다(단계 S205). Referring again to FIG. 2, which maps the "start vertex (u a _s) BFS tree (q) 'to the start vertex of the candidate region (v _s) (step S205).

다음으로, 각 후보 영역에 대한 모든 가능한 사상을 생성한다(단계 S206). Next, all possible maps for each candidate region are generated (step S206).

이와 같은 부분　그래프 동형 매칭을 설명하기 전에, 도 8을 참조하여 하나의 데이터 그래프의 디스크 표현 구조에 대해 설명한다. 도 8은 데이터 그래프의 저장 구조를 설명하기 위한 개념도이다. Before describing such partial graph homothetic matching, the disc representation structure of one data graph will be described with reference to FIG. 8 is a conceptual diagram for explaining the storage structure of the data graph.

본 발명에서는 다음과 같은 두 가지 구조로 그래프를 표현한다: 1) 정점 레이블로 정렬된 정점 ID 리스트를 접근하기 위한 역 정점 레이블 리스트(도 8a); 및 2) 자신의 인접 정보를 저장하고 있는 각 정점 별 인접 리스트 (도 8b). 여기서, 인접 정점들은 인접 리스트의 레이블에 따라 그룹핑된다. 이런 방법으로, 후보 지역 탐색을 수행하는 동안, 후보가 아닌 것으로 확실하게 되는 정점들을 제거하기 위해 사용되는 adj(v,l)를 효율적으로 계산할 수 있다.In the present invention, a graph is represented by the following two structures: 1) an inverted vertex label list (FIG. 8A) for accessing a list of vertex IDs arranged in a vertex label; And 2) a neighbor list for each vertex storing its neighbor information (FIG. 8B). Here, the adjacent vertices are grouped according to the label of the neighbor list. In this way, during the candidate region search, the adj (v, l) used to remove the vertices that are not candidates can be efficiently computed.

도 9는 도 2의 부분 그래프 동형 검색의 세부적인 방법을 나타내는 순서도이다. FIG. 9 is a flowchart showing a detailed method of the partial graph homogeneous search of FIG. 2. FIG.

도 9에 도시된 바와 같이, 부분 그래프 동형 검색 방법(900)은 임의의 BFS 트리(q')의 정점(u')에 대하여, 후보 부분 지역들을 획득함으로써 개시된다(단계 S901). 즉, 임의의 BFS 트리 정점(u')에 대하여 CR(u',P(u'))을 포함한 집합 C를 획득한다. As shown in Fig. 9, the partial graph homogeneous search method 900 is started by obtaining candidate partial regions for vertex u 'of an arbitrary BFS tree q' (step S901). That is, a set C including CR (u ', P (u')) is obtained for an arbitrary BFS tree vertex u '.

다음으로, 후보 부분 지역에 속하는 임의의 데이터 정점(v')이 이미 매치되었으면, 이런 조합을 안전하게 제거한다(단계 S902). 즉, C에 속하는 임의의 데이터 정점이 이미 매치 되었다면, 이런 조합을 안전하게 제거한다. Next, if any data vertex v 'belonging to the candidate partial area has already been matched, this combination is safely removed (step S902). That is, if any data vertices belonging to C have already been matched, this combination is safely removed.

다음으로, BFS 트리(q')의 정점(u')에 대하여, 정점(u')과 질의 그래프(q)의 이미 매치된 질의 정점들 사이의 간선들이 이미 매치된 데이터 정점(v')들 사이의 대응하는 간선들을 갖는지를 검사한다(단계 S903). 즉, 정점(u')에 대하여, 정점(u')와 질의 그래프(q)의 이미 매치된 질의 정점들 사이의 간선들이 u'에 대응하는 C의 어떤 정점과 g의 이미 매치된 데이터 정점들 사이의 대응하는 간선들을 갖는지 검사하여, 상기 조건을 만족하지 않으면, 다음의 BFS 트리(q')의 정점(u')에 대한 검사를 수행한다.Next, for the vertex u 'of the BFS tree q', the arcs between the vertexes u 'and the already matched query vertices of the query graph q have already been matched to the data vertices v' (Step S903). That is, for the vertex u ', the edges between the vertex u' and the already matched query vertices of the query graph q are the vertices of C corresponding to u 'and the vertices of the already matched data vertices g' (Q ') of the next BFS tree (q') if the above condition is not satisfied.

단계 S903의 검사 결과, BFS 트리(q')의 정점(u')이 상기 조건을 만족하는 경우에는 해당 데이터 정점(v')을 대응하는 질의 정점(u)에 매핑하고 상태 정보를 갱신한다(단계 S904). If the vertex u 'of the BFS tree q' satisfies the above condition as a result of the check in step S903, the data vertex v 'is mapped to the corresponding query vertex u and the state information is updated Step S904).

다음으로, 단말 질의 정점을 방문중인지를 판단하여(단계 S905), 단말 질의 정점을 방문중이면, 매치되는 사상 개수를 증가시키고(단계 S906), 단말 질의 정점을 방문중이지 않은 경우에는 단계 S901로 복귀하여 다음 질의 정점들에 대한 매치를 수행한다. 즉, 정점(u')을 정점(u')의 자식 질의 정점으로 하여 해당 자손 질의 정점에 대한 매치를 반복적으로 수행한다. In step S905, it is determined whether the vertex of the terminal quality is being visited (step S905). If the vertex of the terminal quality is being visited, the number of matched events is increased (step S906) And returns a match to the next query vertices. That is, the vertex u 'is used as the child query vertex of the vertex u', and the matching for the vertices of the child query is repeatedly performed.

다음으로, 모든 사상을 찾았는지를 판단하여(단계 S907), 모든 사상을 찾지 못했다고 판단한 경우에는 변경되었던 상태 정보를 복원하고(단계 S909), 단계 S902로 진행하여 다음의 데이터 정점에 매치를 수행한다. Next, it is determined whether all the events are found (step S907). When it is determined that all events are not found, the changed state information is restored (step S909), and the process proceeds to step S902 to match the next data vertices .

단계 S907의 판단 결과, 모든 사상을 찾았으면, 결과를 출력하고(단계 S908), 부분 그래프 동형 매칭 방법(900)을 종료한다. As a result of the determination in step S907, if all the events are found, a result is output (step S908), and the partial graph homology matching method 900 ends.

본 발명의 일 실시예에 따른 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법(200)은 종래의 모든 알고리즘과는 달리, 2단계로 부분 그래프 동형 질의를 처리한다. The partial graph homogeneous search method 200 using the candidate region search technique according to an embodiment of the present invention processes the partial graph homogeneous query in two steps unlike all the conventional algorithms.

이런 접근 방법은 오직 k개의 부분 그래프 동형들만 검색하고자 하는 경우, 몇몇 질의에 대한 후보 지역들에서는 모든 데이터 정점들을 접근하지 않고, 가능한 일찍 종료될 수 있는 가능성이 적을 수 있다. 이 문제를 해결하기 위해, 본 발명의 다른 실시예의 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법은 후보 지역 탐색 방법과 부분　그래프 동형 검색 방법을 교차 수행할 수 있다. This approach may be less likely to terminate as early as possible without accessing all data vertices in candidate regions for some queries if only k subgraph isomens are to be searched. In order to solve this problem, the partial graph homogeneous search method using the posterior region search technique of another embodiment of the present invention can cross the candidate region search method and the partial graph homogeneous search method.

보다 상세하게 도 10을 참조하여 설명한다. 도 10은 본 발명의 다른 실시예에 따른 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법을 나타낸 순서도이다. Will be described in more detail with reference to FIG. 10 is a flowchart illustrating a method of searching for a partial graph based on a candidate region search technique according to another embodiment of the present invention.

먼저, 도 2의 단계 S203에서 후보 지역 탐색이 이루어지는 동안, BFS 트리(q')의 각 단말에 대하여, 후보 지역 탐색에서 획득한 정점들의 개수가 k를 초과하는지를 판단한다(단계 S1001). 즉, 후보 지역 탐색의 세부 절차인 단계 S405에서, 다음 방문되지 않은 데이터 정점(v')에 대한 탐색 이전에, 현재까지 획득된 정점들의 수가 k를 초과하는지를 판단한다. In step S1001, it is determined whether the number of vertices acquired in the candidate area search exceeds k for each terminal of the BFS tree q 'during the candidate area search in step S203 of FIG. That is, in step S405, which is a detailed procedure of the candidate area search, it is determined whether the number of vertices obtained up to now exceeds k before searching for the next unvisited data vertex v '.

다음으로, 단계 S1001의 판단 결과, 후보 지역 탐색에서 획득된 정점들의 개수가 k개를 초과한다고 판단한 경우에는, BFS 트리에 대한 정점들을 더 이상 탐색하지 않고, 이제까지 찾은 후보 정점들에 대하여 도 2의 단계 S204 내지 도 206의 부분 그래프 동형 검색을 수행한다. Next, when it is determined in step S1001 that the number of vertices obtained in the candidate area search exceeds k, the vertices of the BFS tree are not further searched, Performs the partial graph homology search of steps S204 to S206.

단계 S1001의 판단 결과, 후보 지역 탐색에서 획득된 정점들의 개수가 k개를 초과하지 않는다고 판단한 경우에는 단계 S1001로 복귀하여 획득된 정점들의 개수가 k개가 될 때까지 도 2의 단계 203의 후보 지역 탐색을 반복적으로 수행한다. If it is determined in step S1001 that the number of the vertices obtained in the candidate area search does not exceed k, the process returns to step S1001 and the candidate area search in step 203 of FIG. 2 is performed until the number of vertices obtained becomes k. Lt; / RTI >

다음으로, 위와 같은 부분 그래프 동형 검색 후에 적정 부분 그래프 동형이 검색되는지를 판단하여(단계 S1003), 적정 부분 그래프 동형이 검색되지 않으면, 즉, 후보 정점들에서 k개의 부분 그래프 동형들을 찾을 수 없으면, 다른 후보 정점들을 찾기 위해 단계 S1001로 복귀하여 다른 후보 정점들에 대한 후보 지역 탐색을 재개한다. Next, it is determined whether the appropriate partial graph homotype is searched after the partial graph homology search as described above (step S1003). If the correct partial graph homotype is not found, that is, if k partial graph homomorphisms can not be found at the candidate vertices, The process returns to step S1001 to search for other candidate vertices to resume the candidate region search for the other candidate vertices.

단계 S1003의 판단 결과 적정 부분 그래프 동형이 검색되었으면, 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법(1000)을 종료한다. As a result of the determination in step S1003, if the appropriate partial graph isomorphism is found, the partial graph isomorphism search method 1000 using the candidate region search technique is terminated.

이와 같은 방법에 의해, 본 발명은 새로운 개념인 후보 영역 탐색 기법을 활용하여 효율적이고 강건한 부분 그래프 검색을 수행할 수 있고, 후보 영역 탐색을 수행하여 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대해 강건한 매칭 순서를 계산하여 계산된 매칭 순서에 따라 후보 영역별로 검색을 수행할 수 있다. According to this method, the present invention can perform an efficient and robust partial graph search using a new candidate region search technique, perform a candidate region search, and generate a candidate partial graph (i.e., a candidate region) And a robust matching order is calculated for each candidate region searched for, and search can be performed for each candidate region according to the calculated matching order.

또한, 본 발명은 후보 지역 탐색시 검사의 속도를 높이고, 유망하지 않은 데이터 정점들을 효율적으로 삭제할 수 있으며, 또한, 본 발명은 그래프 전체에 대해서가 아니라 각 후보 지역에 대하여 선택률을 계산하기 때문에, 그러한 정확한 정보들을 활용하여 더 강건한 매칭 순서를 생성할 수 있다. In addition, the present invention can increase the speed of the inspection in the search for a candidate region, efficiently delete undesired data vertices, and also because the present invention computes the selectivity for each candidate region, not for the entire graph, Using more accurate information, a more robust matching sequence can be generated.

이하, 도 11을 참조하여 본 발명의 일 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템을 설명한다.Hereinafter, a partial graph homotopic search system using a posterior region search technique according to an embodiment of the present invention will be described with reference to FIG.

도 11은 본 발명의 일 실시예에 따른 후부 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템을 나타낸 블록도이다. FIG. 11 is a block diagram illustrating a partial graph homing search system using a rear region search technique according to an embodiment of the present invention. Referring to FIG.

부분 그래프 동형 검색 시스템(1100)은 질의에 대한 부분 그래프 동형을 검색하는 동형 검색부(1110) 및 부분 그래프 동형 검색 동안 발생하는 정보들을 저장하는 저장부(1120)를 포함한다. The partial graph homing system 1100 includes a homology searching unit 1110 for searching for a partial graph homotype for a query and a storing unit 1120 for storing information occurring during partial graph homing.

동형 검색부(1110)는 질의를 BFS 트리로 재적성하는 질의 재작성부(1112), 질의에 대한 후보 지역을 탐색하는 후보 지역 탐색부(1114), 및 데이터 그래프에서 부분 그래프 동형을 매칭하는 부분 그래프 동형 매칭부(1116)를 포함한다. The homology searching unit 1110 includes a query rewriting unit 1112 for re-querying a query into a BFS tree, a candidate region searching unit 1114 for searching a candidate region for a query, and a subgraph And includes a homograph matching unit 1116.

질의 재작성부(1112)는 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템은 질의 그래프(q)로부터 시작 질의의 정점(u_s)을 선택하고, 질의 그래프(q)로부터 너비 우선 탐색(BFS) 트리(q')를 획득한다. 또한, 질의 재작성부(1112)는 모든 질의의 정점(u)의 순위를 빈도 및 차수에 따라 정의하여 상위 k개의 질의 정점을 선택하고, 각 질의의 정점(u)에 대한 후보 지역의 개수를 예측하여 후보 지역의 개수가 가장 작은 정점을 시작 질의의 정점으로 선택할 수 있다. The query rewriting unit 1112 selects a vertex (u _s ) of the start query from the query graph (q) using the candidate region search technique, and performs a breadth search (BFS) from the query graph (q) And obtains the tree q '. The query rewriting unit 1112 selects the top k query vertices by defining the order of the vertices u of all the queries according to the frequency and order, and estimates the number of candidate regions with respect to the vertex u of each query The vertex having the smallest number of candidate regions can be selected as the vertex of the start query.

이러한 질의 재작성부(1112)는 도 2의 시작 질의 정점 선택 단계(S201) 및 BFS 트리 획득 단계(단계 S202)에 해당하는 절차 및 기능을 수행할 수 있다. The query rewriting unit 1112 may perform procedures and functions corresponding to the vertex selection step S201 of the start query and the BFS tree acquisition step of FIG. 2 (step S202).

후보 지역 탐색부(1114)는 선택된 각 시작 정점에 대하여 BFS 트리(q')의 루트 정점(u_s')으로부터 데이터 그래프(g)를 순회하면서 후보 지역을 탐색한다. 또한, 후보 지역 탐색부(1114)는 입력 질의 트리에 의해 유도된 임의의 후보 지역을 그 지역의 시작 정점으로부터 깊이 우선 탐색할 수 있다. The candidate region searching unit 1114 searches the candidate region while traversing the data graph g from the root vertex u _s 'of the BFS tree q' for each selected starting vertex. In addition, the candidate region searching unit 1114 can search for any candidate region derived by the input query tree from the starting vertex of the region by depth-first search.

이런 후보 지역 탐색부(1114)는 도 2의 후보 지역 탐색 단계(S203) 및 도 4의 후보 지역 탐색 방법에 해당하는 절차 및 기능을 수행할 수 있다. The candidate region searching unit 1114 may perform a procedure and a function corresponding to the candidate region searching step S203 of FIG. 2 and the candidate region searching method of FIG.

부분 그래프 동형 매칭부(1116)는 탐색된 임의의 후보 지역에 대하여 매칭 순서를 결정하고, BFS 트리(q')의 시작 정점(u'_s)을 그 후보 지역의 시작 정점(v_s)에 매핑하며, 각 후보 영역에 대한 모든 가능한 사상을 생성한다. 또한, 부분 그래프 동형 매칭부(1116)는 BFS 트리(q')의 각 단말에 대하여, 후보 지역 탐색에서 획득한 정점들의 개수가 k를 초과하는지를 판단하여 k개를 초과하면, 더 이상 탐색하지 않고, 찾은 후보 정점들에 대하여 부분 그래프 동형 검사를 수행하고, k개를 초과하지 않으면 다른 후보 정점들에 대한 후보 지역을 탐색할 수 있다. The partial graph matching unit 1116 determines the matching order for any candidate region searched and maps the starting vertex u ' _s of the BFS tree q' to the starting vertex v _s of the candidate region And generates all possible mappings for each candidate region. In addition, the partial graph homing unit 1116 determines whether the number of vertices acquired in the candidate region search exceeds k, for each terminal of the BFS tree q ' , The partial graph isomorphic check is performed on the candidate vertices that are found, and if k is not exceeded, candidate regions for other candidate vertices can be searched.

이러한 부분 그래프 동형 매칭부(1116)는 도 2의 단계 매칭 순서 결정 단계(S204), 시작 정점의 매핑 단계(S205), 및 모든 가능한 사상의 생성 단계(S206) 및 도 9의 부분 그래프 동형 검색 방법뿐만 아니라, 도 10의 다른 실시예의 주요 절차 및 기능을 수행할 수 있다. The partial graph homology matching unit 1116 performs the step matching step S204 of FIG. 2, the mapping step of the starting vertex S205, and the generation of all possible maps S206 and the partial graph homology search method of FIG. 9 In addition, the main procedure and function of the other embodiment of Fig. 10 can be performed.

저장부(1120)는 후보 영역 탐색 과정에서 발생하는 정보, 즉, 질의 그래프(q)를 BFS 트리(q')로 재작성한 질의의 BFS 정보(1122), 및 후보 지역 탐색을 통하여 데이터 그래프(g)에서 탐색된 후보 지역 정보(1124)를 포함한다. The storage unit 1120 stores the information generated in the candidate region searching process, that is, the BFS information 1122 of the query rewritten the query graph q into the BFS tree q ', and the data graph g And candidate region information 1124 searched in the candidate region information 1124.

데이터 그래프(1130)는 검색 대상인 데이터 그래프(g)가 저장된다. 또한, 이러한 데이터 그래프(1130)는 정점 레이블로 정렬된 ID 리스트를 접근하기 위한 역 정점 레이블 리스트, 및 자신의 인접 정보를 저장하고 있는 각 정정별 인접 리스트롤 포함하는 구조를 가질 수 있다. The data graph 1130 stores the data graph g that is the search object. In addition, the data graph 1130 may have a structure of including an inverse vertex label list for accessing an ID list aligned with a vertex label, and a neighbor list roll for each correction storing its neighbor information.

이와 같은 구성에 의해 본 발명은 새로운 개념인 후보 영역 탐색 기법을 활용하여 효율적이고 강건한 부분 그래프 검색을 수행할 수 있고, 후보 영역 탐색을 수행하여 결과를 포함하는 후보 부분 그래프(즉, 후보 영역)들을 즉시 식별하고, 탐색된 각 후보 영역에 대해 강건한 매칭 순서를 계산하여 계산된 매칭 순서에 따라 후보 영역별로 검색을 수행할 수 있다. According to this configuration, the present invention can perform an efficient and robust partial graph search using a new candidate region search technique, and perform a candidate region search to extract candidate partial graphs (i.e., candidate regions) including results It is possible to identify candidates immediately, calculate a robust matching order for each candidate region searched for, and search for each candidate region according to the calculated matching order.

상기에서는 본 발명의 바람직한 실시예에 대하여 설명하였지만, 본 발명은 이에 한정되는 것이 아니고 본 발명의 기술 사상 범위 내에서 여러 가지로 변형하여 실시하는 것이 가능하고 이 또한 첨부된 특허 청구 범위에 속하는 것은 당연하다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, Do.

200 : 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 방법
1100 : 후보 영역 탐색 기법을 활용한 부분 그래프 동형 검색 시스템
1110 : 동형 검색부 1112 : 질의 재작성부
1114 : 후보 지역 탐색부 1116 : 부분 그래프 동형 매칭부
1120 : 저장부 1122 : 질의의 BFS 정보
1124 : 후보 지역 정보 1130 : 데이터 그래프200: Partial graph homogeneous search method using candidate region search method
1100: Partial Graph Homogeneous Search System Using Candidate Region Search Technique
1110: homology searching unit 1112: query rewriting unit
1114 candidate region searching unit 1116 partial graph homology matching unit
1120: Storage unit 1122: BFS information of the query
1124: candidate area information 1130: data graph

Claims

Selecting a vertex (u _s ) of the start query from the query graph (q);
Obtaining a breadth first search (BFS) tree (q ') from the query graph (q);
Searching the candidate region while traversing the data graph g from the root vertex u ' _s of the BFS tree q' for each selected starting vertex;
Determining a matching order for any of the searched candidate regions;
The step of mapping the 'the start vertex (u _s of) the BFS tree (q)' to the start vertex (v _s) of the candidate area; And
And generating all possible mappings for each candidate region using a candidate region search technique.

The method according to claim 1,
Wherein the selecting comprises:
Defining the order of vertices (u) of all the queries according to frequency and order;
Selecting top k query vertices of the ranking;
Estimating a number of candidate regions for the top k query vertices u; And
And selecting a vertex having the smallest number of candidate regions as a vertex of the start query.

3. The method of claim 2,
Wherein the defining step assigns a priority order as the frequency is lower or as the degree is higher.

3. The method of claim 2,
Wherein the predicting step determines whether the degree of the query vertex u is less than or equal to the degree of each vertex v belonging to the candidate region, Of the vertices adjacent to the vertex (v) belonging to the candidate region is less than or equal to the number of vertices of the vertex (l), and estimating the number of candidate regions satisfying both the conditions A method of partial graph isomorphism retrieval using.

The method according to claim 1,
Wherein the obtaining step generates the BFS tree (q ') so that the size of a candidate region matching the BFS tree (q') is small.

The method according to claim 1,
Wherein the searching step comprises:
Determining whether a candidate data vertex (V _M ) for the query vertex (u ') of the BFS tree (q') satisfies an identification condition for an unvisited data vertex (v ');
The data vertices v 'satisfying the discrimination condition are marked as visited and the vertexes u' _c of the child queries of the query vertex u 'are defined as the number of vertices adjacent to the data vertex v';&Lt; / RTI >
Determining whether the partial trees of the data vertex (v ') match the partial trees of the query vertex (u');
If the partial trees match, resetting the state of the mark of the data vertex v 'and adding it to the candidate subregion CR (u', v); And
And if all the data vertices v 'are visited, deleting all of the candidate regions if the number of data vertices belonging to the candidate partial region CR (u', v) is less than 1, A method of searching a subgraph isomorphism using.

The method according to claim 6,
The determining of the discrimination condition may include determining whether the degree of the query vertex u 'is less than or equal to the degree of the data vertex v' and whether the vertex of the vertex adjacent to the query vertex u ' (V ') belonging to the candidate region is less than or equal to the number of vertices whose label is l among the vertices adjacent to the vertex (v') belonging to the candidate region.

The method according to claim 6,
Further comprising the step of removing the partial tree of the data vertex (v ') that does not satisfy the identification condition without traversing the partial tree of the data vertex (v').

The method according to claim 6,
And deleting candidate regions for all vertices visited in the partial tree of the data vertex (v ') if the partial trees do not match, using the candidate region search method.

3. The method of claim 2,
Wherein the determining comprises:
Selecting a path having a minimum number of unselected candidate vertices; And
Performing a partial graph homology matching from a path on a data graph (g) corresponding to a query path including a query vertex having a smallest predicted value for the number of candidate vertices. How to search.

The method according to claim 1,
Wherein the generating comprises:
For a vertex (u ') of an arbitrary BFS tree (q'), obtaining candidate partial regions;
Safely removing the combination if any data vertices v 'belonging to the candidate sub-region have already been matched;
For the vertex u 'of the BFS tree q', the edges between the vertex u 'and the already matched query vertices of the query graph q have already been matched to the data vertices v' Checking whether there is a corresponding truncation between < RTI ID = 0.0 >
Mapping the corresponding data vertex (v ') to the corresponding query vertex (u) and updating the state information if the trunk condition is satisfied;
Increasing the number of mapped events if the vertices of the terminal query are visited; And
And outputting the result, if all the occurrences are found, using the candidate region search technique.

12. The method of claim 11,
And if not all the events are found, restoring the changed state information and returning to the safe removal step.

The method according to claim 6,
After the adding step,
Determining for each terminal of the BFS tree (q ') whether the number of vertices acquired in the candidate region search exceeds k; And
Performing a partial graph isomorphism check on the found candidate vertices without further searching if the number of vertices exceeds k,
Searching for the candidate region with respect to other candidate vertices until a suitable partial graph homotype is found, determining whether the candidate region is exceeded, and performing the checking step repeatedly, using a candidate region search technique Graph homography search method.

A query rewriting part for selecting a vertex u _s of a start query from a query graph q and obtaining a breadth first search (BFS) tree q 'from the query graph q;
A candidate region searching unit searching the candidate region while traversing the data graph g from the root vertex u _s 'of the BFS tree q' for each selected starting vertex;
(U _s ') of the BFS tree (q') to the start vertex (v _s ) of the candidate region, and determines a candidate vertex A subgraph isomorphism unit for generating all possible mappings for the subgraph; And
And a database in which the data graph (g) is stored.

15. The method of claim 14,
Wherein the query rewriting unit selects the top k query vertices by defining the order of the vertexes (u) of all the queries according to the frequency and order, and estimates the number of candidate regions with respect to the vertex (u) of each query, A partial graph homotopic search system using a candidate region search technique for selecting a vertex with the smallest number of candidate regions as a vertex of the start query.

15. The method of claim 14,
Wherein the candidate region searching unit searches for a candidate region derived by the input query tree from the starting vertex of the region by using the candidate region searching method.

15. The method of claim 14,
The partial graph homing unit judges whether or not the number of vertices acquired in the candidate region search exceeds k, for each terminal of the BFS tree (q '). If the number of vertices exceeds k, A partial graph homogeneous search system that uses candidate region search technique to perform a partial graph homogeneous test on vertices and to search candidate regions for other candidate vertices if k is not exceeded.

15. The method of claim 14,
Wherein the data graph includes a list of inverted vertices for accessing an ID list aligned with a vertex label and a structure of a neighbor list of each of the corrections storing the neighbor information, Homogeneous search system.