KR101441000B1

KR101441000B1 - A parallel change detection method for triple data

Info

Publication number: KR101441000B1
Application number: KR1020120143813A
Authority: KR
Inventors: 임동혁; 김홍기; 안진현
Original assignee: 서울대학교산학협력단
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2014-09-17
Also published as: KR20140075456A

Abstract

소스/타겟 트리플 데이터에 대하여 소스 트리플 데이터를 기준으로 타겟 트리플에서 추가된/삭제된 트리플을 찾는 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 관한 것으로서, (a) 상기 트리플 데이터로부터 주어별로 트리플 그룹을 생성하는 단계; (b) 상기 트리플 그룹 내 트리플을 정렬하는 단계; (c) 상기 트리플 그룹에서 소스 트리플 및 타겟 트리플의 개수를 비교하는 단계; (d) 개수가 동일하면 소스 트리플 및 타겟 트리플 전체를 비교하는 단계; (e) 전체를 비교하여 다르면, 그룹 내 소스 트리플 및 타겟 트리플을 각각 비교하여 추가된 트리플 목록 및 삭제된 트리플 목록을 작성하는 단계; (f) 개수가 다르면, 소스 트리플 개수 또는 타겟 트리플의 개수가 0인지를 판단하여, 0이 아니면 상기 (e) 단계를 수행하는 단계; 및, (g) 소스 트리플 개수 또는 타겟 트리플의 개수가 0이면, 소스 트리플 또는 타겟 트리플을 추가된 트리플 목록 또는 삭제된 트리플 목록에 포함시키는 단계를 포함하는 구성을 마련한다.
상기와 같은 트리플 데이터에 대한 변경 탐지 방법에 의하면, RDF 트리플 데이터 비교를 병렬처리할 수 있게 함으로써, 여러 대의 컴퓨터로 구성된 분산환경 하에서도 RDF 트리플 변경 탐지가 가능하게 하여 대규모 RDF 트리플 데이터도 처리할 수 있다.The present invention relates to a method of detecting a change in triple data based on parallel processing in which a source / target triple data is searched for a triple added / deleted in a target triple based on source triple data. ; (b) aligning the triples in the triple group; (c) comparing the number of source triples and target triples in the triple group; (d) comparing the source triple and the target triple if the numbers are the same; (e) compares all of the source triples and the target triples in the group, if they are different, to compute the added triple list and the deleted triple list; (f) determining whether the number of source triples or the number of target triples is 0 if the numbers are different, and if not, performing the step (e); And (g) if the number of source triples or the number of target triples is zero, including a source triple or a target triple in the added triple list or the deleted triple list.
According to the change detection method for the triple data as described above, the RDF triple data comparison can be processed in parallel. Thus, it is possible to detect the RDF triple change even in a distributed environment composed of a plurality of computers, have.

Description

Technical Field [0001] The present invention relates to a parallel change detection method for triple data,

본 발명은 서로 다른 시점에 생성된 2개의 트리플 집합에 대해 어느 한쪽을 기준으로 다른 한쪽에서 추가된 트리플과 삭제된 트리플을 찾는 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 관한 것이다.
The present invention relates to a change detection method for triple data based on parallel processing for finding triples added and deleted triples on one side of two triple sets generated at different points in time.

일반적으로, 트리플 데이터의 대표적인 모델은 RDF 데이터이다. RDF(Resource Description Framework) 트리플은 주어(Subject), 술어(Predicate) 그리고 목적어(Object) 등의 3가지 서술 단위로 어떠한 정보를 서술한 것을 가리킨다.Typically, a representative model of triple data is RDF data. Resource Description Framework (RDF) Triple indicates what information is described in three narrative units: Subject, Predicate, and Object.

예를 들어, 다음과 같은 서술문은 도 1과 같은 RDF 방식으로 표현된다.For example, the following statement is expressed in the RDF scheme shown in FIG.

"가수 015B 음악의 장르는 발라드이다.""The genre of singer 015B music is ballad."

주어(Subject)는 서술할 대상을 가리키며, 술어(Predicate)는 서술 대상의 어떠한 속성을 기술할지를 표현한다. 그리고 목적어(Object)는 해당 속성의 값을 가리킨다. 주어(Subject), 술어(Predicate), 목적어(Object) 등 3가지로 기술하기 때문이 이 한 단위를 보통 트리플이라고 부른다.The Subject indicates the subject to be described, and the Predicate indicates which attribute of the subject is to be described. The object (object) indicates the value of the attribute. This unit is usually referred to as a triple because it describes the subject, the predicate, and the object.

RDF 모델을 표기하는 여러 가지 방식이 있는데 본 발명에서는 N3 포맷을 다룬다. N3 또는 Notation 3는 RDF(Resource Description Framework) 모델의 표기 방식 중 하나이며, XML 표기 방식 보다 간단한 형태로 트리플을 표기한다.There are various ways of expressing the RDF model. The present invention deals with the N3 format. N3 or Notation 3 is one of the RDF (Resource Description Framework) notation schemes, which displays triples in a simpler form than the XML notation.

도 2는 실제 한국어 DBpedia infobox의 일부이다. Figure 2 is part of the actual Korean DBpedia infobox.

015B의 장르, 활동시기, 구성원 등의 정보가 주어(Subject), 술어(Predicate), 목적어(Object) 형태로 되어 있다. Subject, Predicate, Object는 탭이나 공백문자로 구분되며 각 트리플은 마침표(.)로 구분된다.Information of genre, activity period, and member of 015B is in the form of Subject, Predicate, and Object. Subject, Predicate, and Object are separated by tabs or whitespace, and each triple is separated by a period (.).

변화 탐지(change detection) 문제는 임의의 2개의 데이터에 대해 어느 한쪽을 기준으로 다른 한쪽에 추가된 데이터와 삭제된 데이터를 찾는 문제이다. 따라서, RDF 트리플 데이터에 대한 변화탐지문제는 어느 한쪽 RDF 트리플 데이터를 기준으로 다른 한쪽에서 추가된 트리플과 삭제된 트리플을 찾는 문제이다. The problem of change detection is to find data added to the other side and deleted data based on either one of two arbitrary data. Therefore, the change detection problem for RDF triple data is the problem of finding triples added and deleted triples based on either RDF triple data.

도 3은 RDF 트리플에 대한 변화 탐지 문제의 예를 도시한 것이다.Figure 3 shows an example of a change detection problem for an RDF triple.

도 3에서 보는 바와 같이, 입력 RDF 트리플 데이터가 소스(Source), 타겟(Target) 등 2가지 종류가 있다. 상기 2개의 RDF 트리플 데이터를 입력으로 하면, RDF 변화탐지의 출력은 도 3의 오른쪽 부분에 표시된 "삭제된 트리플 목록"과 "추가된 트리플 목록"이다. <조치원, 소속, 충청도>의 경우 소스(Source)에는 있지만 타겟(Target)에는 없는 트리플이므로 삭제될 트리플이다. 또한 <연기, 소속, 세종시>와 <조치원, 소속, 세종시>의 경우 소스(Source)에는 없지만 타겟(Target)에는 있는 트리플이다. 따라서 추가된 트리플이다.As shown in FIG. 3, there are two types of input RDF triple data: a source and a target. If the two RDF triple data are input, the output of the RDF change detection is the "deleted triple list" and the "added triple list" shown in the right part of FIG. In the case of <Rejuvenation, Affiliation, and Chungcheongdo>, it is a triple to be deleted because it is a triple that exists in the source but does not exist in the target. Also, in the case of <acting, belonging to, and Sejong City> and <Sookwon, belonging to, and Sejong City>, it is a triple in the target but not in the source. So it is an added triple.

도 4는 RDF 트리플 데이터의 버전 A(Version A)와 버전 B(Version B)에 대한 변화탐지한 결과로서, 구조적 차이(Structural diff) 결과와, 시맨틱 차이(Semantic diff) 결과를 도시한 것이다.FIG. 4 shows the results of Structural diff and Semantic diff as a result of detecting changes in version A (version A) and version B (version B) of RDF triple data.

RDF 트리플 데이터에 대한 변화탐지 방식은 2가지가 있다. 구조적 차이(Structural diff) 기반 방식은 속성(property)에 의한 추론을 고려하지 않고 추가/삭제 트리플을 식별하는 방법이다. 시맨틱 차이(Semantic diff) 기반 방식은, 속성(property)에 의한 추론을 통해 생성할 수 있는 트리플은 존재하는 것으로 간주하고 추가/삭제 트리플을 식별하는 방법이다.There are two types of change detection methods for RDF triple data. Structural diff-based schemes are a way to identify add / delete triples without taking into account property inference. The semantic diff-based method is a method of identifying addition / deletion triples by considering that triples that can be generated through inference by the property are present.

버전 A(Version A)와 버전 B(Version B)에 대한 구조적 차이(Structural diff) 변화탐지 결과인 추가/삭제 트리플은 다음과 같이 집합에 대한 차집합 연산으로 정의할 수 있다. Structural diffs for version A (Version A) and version B (version B) The addition / deletion triple as a result of change detection can be defined as the difference set operation for the set as follows.

[수학식 1][Equation 1]

added(A,B) = B - A = B - (A ∩ B)added (A, B) = B - A = B - (A? B)

removed(A,B) = A - B = A - (A ∩ B)
removed (A, B) = A - B = A - (A? B)

시맨틱 차이(Semantic diff) 변화탐지 방법의 경우 다음과 같이 정의된다.Semantic diff In the case of change detection method, it is defined as follows.

[수학식 2]&Quot; (2) "

added(A,B) = inf(B) - inf(A)added (A, B) = inf (B) - inf (A)

removed(A,B) = inf(A) - inf(B)removed (A, B) = inf (A) - inf (B)

이때, inf(A)는 버전 A(Version A)에 추론 규칙을 적용하여 얻는 또 다른 데이터 집합을 가리킨다. In this case, inf (A) indicates another data set obtained by applying an inference rule to the version A (Version A).

예를 들어, Inf(A)는 기존 버전 A에 다음과 같은 트리플이 추가된 데이터 집합이다.For example, Inf (A) is a dataset with the following triple added to existing version A:

a rdfs:type da rdfs: type d

b rdfs:type db rdfs: type d

이는 "a -> c 이고 c -> d 이면 a -> d 이다"라는 이행규칙에 의한 것이다.This is based on the transition rule "a -> c and c -> d if a -> d".

따라서, 시맨틱 차이(Semantic diff) 기반 방식에서는 상기 2개의 트리플은 추가된 데이터 집합(added data set)에는 없다. 왜냐하면 위와 같은 추론을 통해 얻을 수 있기 때문에 추가되었다고 간주하지 않기 때문이다.Thus, in a semantic diff-based scheme, the two triples are not in the added data set. Because it does not consider it to be added because it can be obtained through such reasoning.

이러한 집합 연산 기반 변화탐지 기법을 메모리, 데이터베이스 등을 사용해서 구현할 수 있다. 예를 들어 도 5와 같이 소스(source)와 타겟(target) 등 2개의 데이터베이스 테이블이 있을 때, SQL의 차집합(Minus) 연산 질의를 이용해서 추가된 트리플(added)과 삭제된 트리플(removed)를 얻을 수 있다. Such an aggregation-based change detection method can be implemented using a memory, a database, and the like. For example, when there are two database tables such as a source and a target as shown in FIG. 5, the added triplet and deleted triplet using the SQL minus operation query, Can be obtained.

그러나 메모리를 기반으로 구현할 경우 대용량 RDF 트리플 데이터는 처리할 수 없다. 왜냐하면 메모리 용량은 한계가 있기 때문이다. 데이터베이스를 기반으로 구현할 경우 대용량 RDF 트리플 데이터를 처리할 수는 있지만 시간이 많이 걸리는 문제점이 있다.
However, memory-based implementations can not handle large RDF triple data. This is because memory capacity is limited. Database-based implementations can handle large RDF triple data, but they are time consuming.

[1] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150[1] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 [2] Apache Hadoop, http://hadoop.apache.org/[2] Apache Hadoop, http://hadoop.apache.org/ [3] SHA, http://ko.wikipedia.org/wiki/SHA[3] SHA, http://en.wikipedia.org/wiki/SHA

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로, 대용량 RDF 트리플 데이터에 대한 변화탐지 시간을 줄이기 위하여, 여러 대의 컴퓨터로 구성된 분산환경을 활용하여 RDF 트리플 데이터 비교를 병렬처리할 수 있는, 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법을 제공하는 것이다.It is an object of the present invention to solve the above problems and to provide a method and apparatus for parallel processing of RDF triple data comparison using a distributed environment composed of a plurality of computers in order to reduce change detection time for a large capacity RDF triple data, And to provide a change detection method for triple data based on parallel processing.

특히, 본 발명의 목적은 RDF 트리플 데이터를 비교할 때 특정한 경우에는 비교 과정을 수행하지 않음으로써 비교 시간을 줄일 수 있는 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법을 제공하는 것이다.
In particular, it is an object of the present invention to provide a method for detecting a change in triple data based on parallel processing that can reduce a comparison time by not comparing a RDF triple data in a specific case.

상기 목적을 달성하기 위해 본 발명은 소스 트리플 데이터 및 타겟 트리플 데이터에 대하여 소스 트리플 데이터를 기준으로 타겟 트리플에서 추가된 트리플 및 삭제된 트리플을 찾는 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 관한 것으로서, (a) 상기 트리플 데이터로부터 주어별로 트리플 그룹을 생성하는 단계; (b) 상기 트리플 그룹 내 트리플을 정렬하는 단계; (c) 상기 트리플 그룹에서 소스 트리플 및 타겟 트리플의 개수를 비교하는 단계; (d) 개수가 동일하면 소스 트리플 및 타겟 트리플 전체를 비교하는 단계; (e) 전체를 비교하여 다르면, 그룹 내 소스 트리플 및 타겟 트리플을 각각 비교하여 추가된 트리플 목록 및 삭제된 트리플 목록을 작성하는 단계; (f) 개수가 다르면, 소스 트리플 개수 또는 타겟 트리플의 개수가 0인지를 판단하여, 0이 아니면 상기 (e) 단계를 수행하는 단계; 및, (g) 소스 트리플 개수 또는 타겟 트리플의 개수가 0이면, 소스 트리플 또는 타겟 트리플을 추가된 트리플 목록 또는 삭제된 트리플 목록에 포함시키는 단계를 포함하고, 상기 추가된 트리플 목록은 상기 소스 트리플 리스트에는 포함되지 않고 상기 타겟 트리플 리스트에는 포함되는 트리플들에 대한 목록이고, 상기 삭제된 트리플 목록은 상기 소스 트리플 리스트에는 포함되고 상기 타겟 트리플 리스트에는 포함되는 트리플에 대한 목록인 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method of detecting change in triple data based on source triple data and target triple data based on parallel processing based on triple and deleted triples added in a target triple based on source triple data and target triple data, (a) generating a triple group by subject from the triple data; (b) aligning the triples in the triple group; (c) comparing the number of source triples and target triples in the triple group; (d) comparing the source triple and the target triple if the numbers are the same; (e) compares all of the source triples and the target triples in the group, if they are different, to compute the added triple list and the deleted triple list; (f) determining whether the number of source triples or the number of target triples is 0 if the numbers are different, and if not, performing the step (e); And (g) including the source triple or target triple in the added triple list or the deleted triple list if the number of source triples or the number of target triples is zero, And the deleted triple list is included in the source triple list and is a list of triples included in the target triple list.

또, 본 발명은 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 있어서, 상기 (e)단계는, (e1) 상기 트리플 그룹에서 소스 트리플과 타겟 트리플을 순차적으로 비교하는 단계; (e2) 상기 소스 트리플이 상기 타겟 트리플보다 큰 경우, 타겟 트리플을 추가된 트리플 목록에 포함시키고 다음 타겟 트리플을 선정하는 단계; (e3) 타겟 트리플이 소스 트리플보다 큰 경우, 소스 트리플을 삭제된 트리플 목록에 포함시키고 다음 소스 트리플을 선정하는 단계; (e4) 소스 트리플 및 타겟 트리플이 같은 경우, 다음 소스 트리플 및 타겟 트리플을 선정하는 단계; 및, (e5) 남은 소스 트리플 및 타겟 트리플이 있을 때까지 상기 (e1) 단계에서 (e4) 단계를 반복하는 단계(S55)를 포함하는 것을 특징으로 한다.The method may further include: (e1) sequentially comparing a source triple and a target triple in the triple group; (e2) if the source triple is greater than the target triple, including a target triple in the added triple list and selecting a next target triple; (e3) if the target triple is larger than the source triple, including the source triple in the deleted triple list and selecting the next source triple; (e4) if the source triple and the target triple are the same, selecting the next source triple and the target triple; And (e5) repeating steps (e1) to (e4) until there are remaining source triples and target triples (S55).

또, 본 발명은 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 있어서, 상기 (d)단계에서, 상기 트리플 그룹 내의 모든 소스 트리플을 문자열(이하 소스 문자열)로 변환하고, 모든 타겟 트리플을 문자열(타겟 문자열)로 변환하고, 상기 소스 문자열 및 타겟 문자열을 해쉬하여 해쉬한 해쉬값으로 동일여부를 비교하는 것을 특징으로 한다.The method may further include the step of converting all the source triples in the triple group into a string (hereinafter referred to as a source string), converting all the target triples into a string (target Character strings), and hash values of the source character string and the target character string are compared with the hash value.

또, 본 발명은 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 있어서, 상기 트리플은 주어, 서술어, 목적어 순으로 구성되는 것을 특징으로 한다.Further, the present invention is a method for detecting change in triple data based on parallel processing, wherein the triple is composed of subject, predicate, and object.

또, 본 발명은 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 있어서, 상기 트리플의 정렬은 주어, 서술어, 목적어 순으로 정렬하고, 상기 주어, 서술어, 목적어 각각은 문자열에 의해 정렬되는 것을 특징으로 한다.
According to another aspect of the present invention, there is provided a change detection method for triple data based on parallel processing, wherein the triples are arranged in order of subject, predicate, and object, and each subject, .

상술한 바와 같이, 본 발명에 따른 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 의하면, RDF 트리플 데이터 비교를 병렬처리할 수 있게 함으로써, 여러 대의 컴퓨터로 구성된 분산환경 하에서도 RDF 트리플 변경 탐지가 가능하게 하여 대규모 RDF 트리플 데이터도 처리할 수 있는 효과가 얻어진다.As described above, according to the change detection method for the triple data based on the parallel processing according to the present invention, the RDF triple data comparison can be performed in parallel, thereby enabling the RDF triple change detection even in a distributed environment composed of a plurality of computers Thus, the effect of processing large-scale RDF triple data is obtained.

또한, 본 발명에 따른 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법에 의하면, RDF 트리플 데이터를 비교할 때 특정한 경우에는 비교 과정을 수행하지 않음으로써, 비교 시간을 상당히 줄일 수 있는 효과가 얻어진다.
In addition, according to the method for detecting a change in triple data based on parallel processing according to the present invention, compared to the RDF triple data, the comparison process is not performed in a specific case, and the comparison time can be significantly reduced.

도 1은 종래기술에 의한 RDF 트리플의 일례이다.
도 2는 종래기술에 의한 한국어 DBpedia infobox의 일부를 N3포맷으로 작성한 일례이다.
도 3은 종래기술에 의한, RDF 트리플에 대한 변화 탐지 문제를 예시한 도면이다.
도 4는 종래기술에 의한, 구조적 차이 및 시맨틱 차이에 의한 변화 탐지의 예를 도시한 것이다.
도 5는 종래기술에 의한, 데이터베이스를 이용한 RDF 트리플 데이터에 대한 변화 탐지 방법을 예시한 것이다.
도 6은 본 발명을 실시하기 위한 전체 시스템의 구성을 도시한 도면이다.
도 7은 본 발명의 일실시예에 따라 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법을 설명하는 흐름도이다.
도 8은 본 발명의 일실시예에 따라 트리플 데이터를 비교하여 추가/삭제된 트리플을 찾는 방법을 설명하는 흐름도이다.
도 9는 도 8의 방법에 대한 알고리즘의 일례를 도시한 것이다.
도 10은 도 8의 방법에 의해 추가/삭제된 트리플을 찾는 일례를 도시한 것이다.
도 11은 본 발명의 일실시예에 따라 트리플 데이터를 병렬처리에 의해 추가/삭제된 트리플을 찾는 일례를 도시한 것이다.Figure 1 is an example of an RDF triple according to the prior art.
2 is an example of a part of the Korean DBpedia infobox according to the prior art written in the N3 format.
3 is a diagram illustrating a change detection problem for an RDF triple according to the prior art.
FIG. 4 shows an example of change detection by structural difference and semantic difference according to the prior art.
FIG. 5 illustrates a conventional method of detecting change in RDF triple data using a database.
6 is a diagram showing a configuration of an overall system for carrying out the present invention.
7 is a flowchart illustrating a change detection method for triple data based on parallel processing according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a method of comparing triple data to find added / deleted triples according to an embodiment of the present invention.
Figure 9 shows an example of an algorithm for the method of Figure 8;
FIG. 10 shows an example of finding a triple added / deleted by the method of FIG.
11 illustrates an example of finding a triple added / deleted by parallel processing of triple data according to an embodiment of the present invention.

이하, 본 발명의 실시를 위한 구체적인 내용을 도면에 따라서 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the drawings.

또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.
In the description of the present invention, the same parts are denoted by the same reference numerals, and repetitive description thereof will be omitted.

먼저, 본 발명을 실시하기 위한 전체 시스템의 구성의 예들에 대하여 도 6a 및 도 6b를 참조하여 설명한다.First, examples of the configuration of the entire system for carrying out the present invention will be described with reference to Figs. 6A and 6B.

도 6a에서 보는 바와 같이, 본 발명에 따른 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법은 소스 트리플 데이터(11) 및 타켓 트리플 데이터(11)을 입력받아, 소스 트리플 데이터(11)에 대하여 타켓 트리플 데이터(11)에서 새로 추가된 트리플과 삭제된 트리플을 찾는 컴퓨터 단말 또는 서버(20) 상의 프로그램 시스템으로 실시될 수 있다. 즉, 상기 변경 탐지 방법은 프로그램으로 구성되어 컴퓨터 단말 또는 서버(20)에 설치되어 실행될 수 있다. 컴퓨터 단말 또는 서버(20)에 설치된 프로그램은 하나의 프로그램 시스템(30)과 같이 동작할 수 있다.6A, the change detection method for triple data based on parallel processing according to the present invention receives source triple data 11 and target triple data 11 and generates target triple data 11 with respect to source triple data 11, To the computer terminal on the server 20 or to the program system on the server 20 to find the newly added triple and the deleted triple in the server 11. That is, the change detection method can be implemented by a program and installed in a computer terminal or server 20 and executed. A program installed in the computer terminal or the server 20 can operate as a single program system 30. [

한편, 다른 실시예로서, 상기 변경 탐지 방법은 프로그램으로 구성되어 범용 컴퓨터에서 동작하는 것 외에 ASIC(주문형 반도체) 등 하나의 전자회로로 구성되어 실시될 수 있다. 또는 트리플 데이터의 변경 탐지만을 전용으로 처리하는 전용 컴퓨터 단말(20)로 개발될 수도 있다. 이를 병렬처리 기반 트리플 데이터 변경 탐지 장치라 부르기로 한다. 그 외 가능한 다른 형태도 실시될 수 있다.Meanwhile, as another embodiment, the change detection method may be implemented by a single electronic circuit such as an ASIC (on-demand semiconductor) in addition to being operated by a general-purpose computer. Or dedicated computer terminal 20 dedicated to only detecting change detection of triple data. This is called a parallel processing based triple data change detection device. Other possible forms may also be practiced.

다음으로, 본 발명의 일실시예에 따라 다수의 서버에 의하여 트리플 변경 탐지를 병렬처리하는 방법을 도 6b를 참조하여 설명한다.Next, a method for parallel processing triple change detection by a plurality of servers according to an embodiment of the present invention will be described with reference to FIG. 6B.

도 6b에서 보는 바와 같이, 다수의 서버 S₀, S₁, ..., S_N이 네트워크(미도시)를 통해 서로 연결되어 구성된다. 다수의 서버 중 하나의 서버 S₀는 마스터(master) 컴퓨터로 작동하고, 나머지 서버 S₁, ..., S_N은 슬레이브(slave) 컴퓨터로서 역할을 수행한다.As shown in FIG. 6B, a plurality of servers S ₀ , S ₁ ,..., S _N are connected to each other through a network (not shown). A server S ₀ of one of the plurality of servers operates as a master computer, and the remaining servers S ₁ , ..., S _N serve as slave computers.

즉, 서버 S₀는 입력 파일(소스와 타겟 트리플 데이터)이 저장된 마스터(master) 컴퓨터이고 나머지 서버들은 비교 작업을 수행할 슬레이브(slave) 컴퓨터이다.That is, the server S ₀ is a master computer storing an input file (source and target triple data), and the remaining servers are slave computers performing a comparison operation.

마스터(master) 컴퓨터는 슬레이브(slave) 컴퓨터에 소스와 타켓에 있는 트리플 리스트에서 주어(subject) 부분이 같은 트리플만 모아서 전송한다. 소스에 있었던 트리플은 맨 뒤에 S로 표시하고 타켓에 있었던 트리플은 맨 뒤에 T로 표시한다.The master computer collects only the triples of the same subject in the triple list in the source and target and sends it to the slave computer. The triples that were in the source are marked with an S at the end and the triples that were in the target are marked with a T at the end.

이에 대하여, 슬레이브(slave) 컴퓨터에 의하여 비교 작업을 수행하면 삭제된(del) 목록과 추가된(add) 목록을 얻을 수 있다(비교 작업은 이하에서 자세하게 설명될 것이다). 비교작업을 마치면, 슬레이브(slave) 컴퓨터는 생성한 삭제된(del) 목록과 추가된(add) 목록을 마스터(master) 컴퓨터로 전송한다. On the other hand, when a comparison operation is performed by a slave computer, a del list and an add list can be obtained (the comparison operation will be described in detail below). At the end of the comparison, the slave computer sends the created del and list to the master computer.

마스터(master) 컴퓨터에서는 각 삭제된(del) 목록과 각 추가된(add) 목록을 하나로 합친다. 그러면 최종적으로 원래 소스와 타겟 트리플 리스트에 대한 최종적인 삭제된(del) 목록과 추가된(add) 목록을 얻을 수 있다.
The master computer combines each deleted (del) list and each added list into one. You can finally get the final deleted (del) list and the add list for the original source and target triple lists.

다음으로, 본 발명의 일실시예에 따른 병렬처리 기반 트리플 데이터에 대한 변경 탐지 방법을 도 7을 참조하여 설명한다.Next, a method for detecting a change in triple data based on parallel processing according to an embodiment of the present invention will be described with reference to FIG.

도 7에서 보는 바와 같이, 본 발명의 일실시예에 따른 병렬처리 기반 트리플 데이터 변경 탐지 방법은 (a) 주어별 트리플 그룹 생성 단계(S10); (b) 트리플 정렬 단계(S20); (c) 소스/타겟의 트리플 개수 비교 단계(S30); (d) 개수가 동일한 경우, 소스/타겟의 트리플 전체 비교 단계(S40); (e) 전체가 다르면 각 트리플을 비교하는 단계(S50); (f) 개수가 다르면 개수가 0인지 판단하는 단계(S60); 및, (g) 개수가 0이면 소스/타겟의 트리플을 추가/삭제 트리플로 구성하는 단계(S70)로 구성된다.As shown in FIG. 7, the parallel processing-based triple data change detection method according to an embodiment of the present invention includes the steps of: (a) generating a triple group per subject (S10); (b) triple aligning step S20; (c) comparing the number of triples of the source / target (S30); (d) triplet comparison step (S40) of source / target if the numbers are the same; (e) comparing each triple if all are different (S50); (f) determining whether the number is 0 if the numbers are different (S60); And (g) if the number is 0, constructing a triple of adding / deleting triples of the source / target (S70).

먼저, 입력으로 주어진 RDF 트리플 데이터를 주어(Subject) 기준으로 분할한다(S10). 주어(Subject)가 같은 트리플들을 모아서 주어(Subject)별로 트리플 그룹을 생성하는 작업이다. 트리플 그룹에는 소스(Source)와 타겟(Target)에 있는 트리플이 모두 속하고 소스(Source)에서 온 건지 타겟(Target)에서 온 건지를 구분한다.First, the RDF triple data given as an input is divided by a subject (S10). It is the task of collecting triples having the same subject and creating a triple group by Subject. The triple group distinguishes whether the triples in the source and target belong to the source and whether they are from the target.

다음으로, 각 트리플 그룹에 있는 트리플들을 정렬한다(S20). 정렬은 트리플에 있는 문자를 기준으로 정렬한다. 정렬 시 소스(Source)와 타겟(Target)인지 여부를 무시하고 오직 주어(Subject), 술어(Predicate), 목적어(Object)만으로 정렬한다.Next, the triples in each triple group are sorted (S20). Sorting sorts by the characters in the triple. When sorting, it ignores whether it is a source or a target and sorts only the subject, the predicate, and the object.

다음으로, 각 트리플 그룹에 있는 소스(Source)와 타겟(Target) 트리플의 개수를 비교한다(S30). 개수에 따라 다음 작업은 달라진다.Next, the number of source and target triples in each triple group is compared (S30). Depending on the number, the next task is different.

먼저, 소스(Source)와 타겟(Target) 트리플의 개수가 동일한 경우, 소스 트리플과 타겟 트리플의 전체 내용이 동일한지를 검사한다(S40).First, if the number of source and target triples is equal, it is checked whether the entire contents of the source triple and the target triple are identical (S40).

이때, 소스(Source)에서 온 트리플과 타겟(Target)에서 온 트리플의 개수가 같은 트리플 그룹의 경우 2가지 경우로 구분된다. 첫 번째는 2개의 트리플이 정확히 같은 트리플로 구성된 경우이고, 두 번째는 개수는 같지만 다른 트리플로 구성된 경우이다.In this case, the triple group having the same number of triples from the source to the target is divided into two cases. The first is when two triples are composed of exactly the same triple, and the second is when they are of the same number but of different triples.

우선 소스(Source) 트리플 목록과 타겟(Target) 트리플 목록을 각각 문자열로 변환한다. 다음과 같이 트리플그룹이 있는 경우(S는 소스에서 온 트리플을 T는 타겟에서 온 트리플을 의미한다)를 예로 설명한다.First, the source triple list and the target triple list are converted into a string, respectively. Here is an example where there is a triple group (S means a triple from a source and T means a triple from a target).

이때, 다음과 같이 문자열로 변환한다.At this time, it is converted into a string as follows.

소스(Source) : "apccpc"Source: "apccpc"

타겟(Target) : "bwtdpq"Target: "bwtdpq"

이 2개의 문자열에 대해 각각 해쉬함수를 적용한다.Apply a hash function to each of these two strings.

해쉬함수는 입력 문자열을 또 다른 문자열로 변환시키는 역할을 한다. 해쉬함수의 출력 문자열을 해쉬값이라고 부르고 입력 문자열의 문자 개수보다 적은 수의 개수의 문자로 구성된다.The hash function converts an input string to another string. The output string of the hash function is called a hash value and consists of a number of characters smaller than the number of characters in the input string.

일반적으로 만약 2개의 입력 문자열이 같다면 해쉬값은 같고 다르다면 해쉬값도 다르다. 따라서 2개의 대규모 문자열이 같은지를 판단할 때 입력 문자열을 일일이 비교하지 않고 해쉬값만 비교하면 효율적이다.In general, if two input strings are equal, the hash values are the same, and if they are different, the hash values are also different. Therefore, when judging whether two large strings are equal, it is efficient to compare only the hash value without comparing each input string.

본 발명의 경우에 있어서도 소스(Source) 트리플 그룹의 해쉬값(또는 문자열의 해쉬값)과 타겟(Target) 트리플 그룹의 해쉬값(또는 문자열의 해쉬값)이 같다면 소스(Source) 트리플 그룹과 타겟(Target) 트리플 그룹은 같은 트리플로 구성됐다는 의미이다. 따라서 이 트리플 그룹에는 추가된 트리플 또는 삭제된 트리플로 식별할 것은 없다.Even in the case of the present invention, if the hash value (or the hash value of the string) of the source triple group is the same as the hash value (or the hash value of the string) of the target triple group, (Target) Triple group means that it consists of the same triple. Therefore, this triple group is not identified as an added triple or a deleted triple.

만약 2개의 해쉬값이 다르다면 소스(Source) 트리플 목록과 타겟(Target) 트리플 목록에는 다른 트리플이 존재한다는 의미이다. 이 경우 트리플 목록을 일일이 비교하는 단계(S50)로 진행한다.
If the two hash values are different, it means that there are other triples in the Source triple list and the Target triple list. In this case, the flow advances to step S50 in which the triple list is compared one by one.

다음으로, 소스(Source)에서 온 트리플과 타겟(Target)에서 온 트리플의 개수를 설명한다.Next, the number of triples from the source to the target and the number of triples from the target are described.

소스(Source)와 타겟(Target) 트리플의 개수가 다른 경우, 소스 트리플 또는 타겟 트리플의 개수가 0인지를 검사한다(S60).If the number of source and target triples is different, it is checked whether the number of source triples or target triples is 0 (S60).

소스 트리플 또는 타겟 트리플의 개수가 0인 경우, 소스 트리플 또는 타겟 트리플을 추가된 트리플 목록 또는 삭제된 트리플 목록에 모두 포함시킨다(S70).If the number of source triples or target triples is zero, both the source triples or the target triples are included in the added triple list or the deleted triple list (S70).

즉, 소스(Source) 트리플 개수가 0인 경우 트리플 그룹의 트리플들을 추가된 트리플 목록에 포함시킨다. 타겟(Target) 트리플 개수가 0인 경우 트리플 그룹의 트리플들을 삭제된 트리플 목록에 포함시킨다. That is, if the number of source triples is 0, the triples of the triple group are included in the added triple list. If the target triple count is zero, the triple group triples are included in the deleted triple list.

만약, 소스(Source) 트리플의 개수와 타겟(Target) 트리플의 개수가 모두 0이 아닌 경우, 트리플 목록을 일일이 비교하는 단계(S50)로 진행한다.If both the number of source triples and the number of target triples are not 0, the process proceeds to step S50 in which the triple lists are compared one by one.

마지막으로, 주어진 2개의 트리플 리스트를 비교해서 다른 트리플을 찾아내는 단계(S50)이다. 상기 단계(S50)의 설명을 이하에서 보다 구체적으로 기재한다.
Finally, comparing the given two triple lists to find another triple (S50). The description of step (S50) will be described in more detail below.

다음으로, 본 발명의 일실시예에 따른 2개의 트리플 리스트를 비교해서 다른 트리플을 찾아내는 단계를 도 8 내지 도 10을 참조하여 보다 구체적으로 설명한다.Next, the step of comparing two triple lists according to an embodiment of the present invention to find other triples will be described in more detail with reference to FIGS. 8 to 10. FIG.

도 8에서 보는 바와 같이, 본 발명에 따른 소스/타겟 트리플 그룹을 비교하여 변경된 트리플을 찾는 단계(S50)는, (e1) 정렬된 리스트에서 소스 트리플과 타겟 트리플을 순차적으로 비교하는 단계(S51); (e2) 소스 트리플이 타겟 트리플보다 큰 경우, 타겟 트리플을 추가된 트리플 목록에 포함시키고 다음 타겟 트리플을 선정하는 단계(S52); (e3) 타겟 트리플이 소스 트리플보다 큰 경우, 소스 트리플을 삭제된 트리플 목록에 포함시키고 다음 소스 트리플을 선정하는 단계(S53); (e4) 소스/타겟 트리플이 같은 경우, 다음 소스/타겟 트리플을 선정하는 단계(S54); 및, (e5) 남은 소스/타겟 트리플이 있을 때까지 상기 (e1) 단계에서 (e4) 단계를 반복하는 단계(S55)로 구성된다.As shown in FIG. 8, the step S50 of comparing the source / target triple groups according to the present invention to the changed triplets S50 includes the steps of sequentially comparing the source triple and the target triple in the sorted list S51, ; (e2) if the source triple is larger than the target triple, including the target triple in the added triple list and selecting the next target triple (S52); (e3) if the target triple is larger than the source triple, including the source triple in the deleted triple list and selecting the next source triple (S53); (e4) selecting the next source / target triple if the source / target triple is the same (S54); And (e5) repeating (e1) to (e4) until there is a remaining source / target triple (S55).

도 9는 도 8의 단계들을 알고리즘으로 구현한 것이고, 도 10은 도 8의 단계들의 예를 표시한 것이다.Figure 9 is an algorithm implementation of the steps of Figure 8, and Figure 10 is an example of the steps of Figure 8.

한 트리플 그룹이므로 주어(Subject) 부분은 모두 같다. 술어(Predicate) 부분은 다를 수 있으나 설명을 위해서 술어(Predicate) 부분도 모두 같다고 가정한다.Because it is a triple group, the subject parts are all the same. Predicate parts may be different, but assume that the predicate parts are all the same for the sake of explanation.

앞서 (b) 단계(S20)에서 정렬을 했으므로 목적어(Object) 부분의 알파벳 순서로 정렬되어 있다. 또한 한 트리플 그룹이므로 소스(Source) 트리플과 타겟(Target) 트리플이 같이 정렬이 되어 섞여있었지만 따로 분리한다.Since they are aligned in step (b) and step (S20), they are arranged in the alphabetical order of the object part. Also, since it is a triple group, the source triple and the target triple are arranged in the same order, but are separated.

도 9 및 도 10에서 보는 바와 같이, 먼저 소스(Source)에 있는 첫 번째 트리플(<S P E>)과 타겟(Target)에 있는 첫 번째 트리플(<S P A>)부터 비교를 한다(S51). E > A 이므로 알고리즘 상의 첫 번째 if문에 의해 <S P A>는 추가된 트리플 목록에 포함시킨다(S52). t 인덱스만 1증가하여 다음으로 비교할 것은 <S P E>와 타겟(Target)의 두 번째 트리플 <S P D>이다. 마찬가지로 E > D 이므로 <S P D>를 추가된 트리플 목록에 포함시킨다(S53). 그 다음으로 <S P E>와 <S P J>를 비교하면 E < J 이므로 두 번째 if문에 의해 <S P E>는 삭제된 트리플 목록에 포함시킨다. As shown in FIGS. 9 and 10, the first triple (<SPE>) in the source is compared with the first triple (<SPE>) in the target (S51). Since E> A, <S P A> is included in the added triple list by the first if statement in the algorithm (S52). The only increment of t index is 1, and the next comparison is <S P E> and the second triple <S P D> of the target. Similarly, since E> D, <S P D> is included in the added triple list (S53). Then, comparing <S P E> and <S P J>, E <J, so the <S P E> by the second if statement is included in the deleted triple list.

이런 식으로 진행하면 도 8의 맨 마지막 부분까지 된다. 소스(Source)의 모든 트리플을 탐색했으므로 Loop를 종료한다(S55). Loop를 종료했는데 타겟(Target)에는 아직 탐색하지 않은 2개의 트리플이 있다. 이 트리플들은 모두 추가된 트리플 목록에 포함시킨다.Proceeding this way leads to the last part of Figure 8. Since all the triples of the source have been searched, the loop is terminated (S55). I quit the loop, but there are two triples in the target that have not been searched yet. These triples are all included in the added triple list.

도 9의 알고리즘은 2개의 리스트가 정렬이 되어 있는 경우에만 사용가능하며 리스트를 한 번만 탐색하므로 속도가 빠르다.
The algorithm of FIG. 9 is available only when two lists are sorted, and is fast because it searches the list only once.

다음으로, 본 발명의 일실시예에 따라 RDF 트리플 데이터를 분산하여 병렬처리하는 방법을 도 11을 참조하여 설명한다.Next, a method of distributing and parallelizing RDF triple data according to an embodiment of the present invention will be described with reference to FIG.

병렬처리 방식은 MapReduce 프레임워크[비특허문헌 1]를 사용할 수 있다. MapReduce 프레임워크는 병렬처리 방식의 일종으로 Google 검색 엔진에서 대규모의 웹데이터를 처리하는 방법들 중 하나이다. Map 단계에서는 특정 규칙에 의해 데이터를 분할하고 Reduce 단계에서는 분할된 데이터에 대한 연산을 수행하는 방법이다. The parallel processing method can use the MapReduce framework [Non-Patent Document 1]. The MapReduce framework is a type of parallel processing that is one of the ways that Google search engines process large amounts of web data. In the Map phase, data is segmented by specific rules, and in the Reduce phase, operations are performed on the segmented data.

Apache Hadoop[비특허문헌 2]은 MapReduce 프레임워크를 Java 언어로 구현한 오픈소스 라이브러리이다. 분산컴퓨팅 환경에서는 개별 컴퓨터에서 예상하지 못한 오류가 발생할 수 있으므로 그것을 모니터링해서 처리하는 방법이 필요하다. Apache Hadoop은 특정 개별 컴퓨터에서 Map 단계 또는 Reduce 단계에서 오류가 발생한 경우에 해당 작업을 재시작하거나 다른 컴퓨터로 전달하는 등의 분산컴퓨팅 관리 기능도 제공한다.Apache Hadoop [Non-Patent Document 2] is an open source library that implements the MapReduce framework in the Java language. In a distributed computing environment, an unexpected error may occur in an individual computer, and a method of monitoring and processing it is necessary. Apache Hadoop also provides distributed computing management features such as restarting Maple steps on specific individual computers or restarting them if they fail during the Reduce phase.

도 11은 본발명의 알고리즘을 MapReduce 프레임워크로 표현한 그림이다.11 is a diagram showing the algorithm of the present invention by the MapReduce framework.

Map단계에서는 소스(Source)와 타겟(Target)에 있는 RDF 트리플을 모두 주어(Subject)를 기준으로 분할한다. 작업#1은 주어(Subject)가 A인 트리플, 작업#2에는 주어(Subject)가 D인 트리플와 같은 식으로 작업을 생성한다. 머신의 개수가 작업의 개수보다 적으면 대기하고 있다가 작업이 완료되면 대기하고 있던 작업을 머신에 보낸다. 작업#1에서 트리플( <A p B> <A k D>)은 소스(Source)에서 온 트리플 이므로 마지막에 S를 표시하였다. 마찬가지로 작업#1에서 트리플( <A k B> <A p B> )는 타겟(Target)에서 온 트리플이므로 마지막에 T를 표시하였다. In the Map phase, all RDF triples in the Source and Target are divided by Subject. Task # 1 creates a task in the form of a triple with Subject A, and a triple with Subject D in Task # 2. If the number of machines is less than the number of jobs, it waits and sends the job that was waiting when the job is completed to the machine. In task # 1, the triple (<A p B> <A k D>) is a triple from the source, so it displays S at the end. Likewise, in task # 1, the triple (<A k B> <A p B>) is a triple from the target, and thus a T is displayed at the end.

Reduce단계에서는 주어진 두 개의 트리플 리스트(소스(Source)와 타겟(Target))를 비교하는 단계이다. 두 개의 트리플 리스트의 크기가 동일한 경우 해쉬함수(SHA)를 사용해서 해쉬값을 비교한다. 해쉬값이 동일한 경우 두 개의 트리플 리스트가 같다는 의미이므로 무시한다. 동일하지 않다면 두 개의 트리플 리스트를 비교해서 추가된(added) 트리플과 삭제된(deleted) 트리플들을 찾는다.
In the Reduce step, the two triple lists (Source and Target) are compared. If the two triple lists are the same size, the hash function (SHA) is used to compare the hash values. If the hash values are the same, it means that the two triple lists are the same and are ignored. If it is not the same, compare the two triple lists to find the added and deleted triples.

데이터를 표준화된 형태로 웹에 공개해서 누구나 쉽게 데이터를 이용할 수 있도록 하는 시스템이 활발히 구축되고 있다. 웹 데이터 표준 중 하나가 하나가 LOD이다. LOD는 Linked Open Data의 약자로서 공공기관 등의 데이터가 표준화된 형태(RDF 트리플)로 기술 된 것을 가리킨다. A system is being actively developed to make data accessible to anyone by making the data public on the web in a standardized form. One of the Web data standards is LOD. LOD is an abbreviation of Linked Open Data, which indicates that data from public institutions are described in a standardized form (RDF triple).

LOD는 시간이 지남에 따라 트리플이 추가되기도 하고 삭제되기도 한다. 이러한 LOD를 활용함에 있어서 예전 버전 LOD와 현재 버전 LOD 사이의 공통되는 트리플과 그렇지 않은 트리플를 구분하여 활용할 수 있다면 효율적인 LOD 활용 시스템을 디자인할 수 있다. 따라서 주어진 임의의 두 개의 트리플 리스트에서 추가된 트리플과 삭제된 트리플을 식별하는 RDF 트리플 변화 탐지 기술이 중요하다.
LODs are often added or deleted over time. In order to utilize this LOD, it is possible to design an efficient LOD utilization system if the common triple between the old version LOD and the current version LOD can be distinguished from the common triple. Thus, the RDF triple change detection technique that identifies added and deleted triples in any given two triple lists is important.

이상, 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.
Although the present invention has been described in detail with reference to the above embodiments, it is needless to say that the present invention is not limited to the above-described embodiments, and various modifications may be made without departing from the spirit of the present invention.

11 : 소스 트리플 데이터 12 : 타겟 트리플 데이터
20 : 컴퓨터 단말 30 : 프로그램 시스템11: Source triple data 12: Target triple data
20: computer terminal 30: program system

Claims

1. A change detection method for triple data based on parallel processing for finding triples and deleted triples added in a target triple based on source triple data with respect to source triple data and target triple data,
(a) generating a triple group by subject from the triple data;
(b) aligning the triples in the triple group;
(c) comparing the number of source triples and target triples in the triple group;
(d) comparing the triple data of the source triples of the corresponding number and the triple data of the target triples of the same number, respectively, if they are the same;
(e) if the triple data is different, comparing the source triple and the target triple in the group, respectively, to create an added triple list and a deleted triple list;
(f) determining whether the number of source triples or the number of target triples is 0 if the numbers are different, and if not, performing the step (e); And
(g) if the number of source triples or the number of target triples is zero, including a source triple or a target triple in the added triple list or the deleted triple list,
Wherein the added triple list is not included in the source triple list but is a list of triples included in the target triple list,
Wherein the deleted triple list is a list of triples included in the source triple list and included in the target triple list.

2. The method of claim 1, wherein step (e)
(e1) sequentially comparing the source triple and the target triple in the triple group by a string of the corresponding triple data;
(e2) if the source triple is greater than the target triple, including a target triple in the added triple list and selecting a next target triple;
(e3) if the target triple is larger than the source triple, including the source triple in the deleted triple list and selecting the next source triple;
(e4) if the source triple and the target triple are the same, selecting the next source triple and the target triple; And
(e5) repeating steps (e1) through (e5) until there are remaining source triples and target triples (step S55).

3. The method of claim 2,
In step (d), all the source triples in the triple group are converted into a string (hereinafter, referred to as a source string), all the target triples are converted into a string (target string), the source string and the target string are hashed, Value of the triple data based on the parallel processing.

The method according to claim 1,
Wherein the triple comprises a subject, a predicate, and an object in order.

5. The method of claim 4,
Wherein the arrangement of the triples is arranged by a string of a subject, a descriptor, and an object, and is arranged in order of a subject, a descriptor, and an object.