KR101524375B1

KR101524375B1 - Method for similarity joins by adaptive prefix filtering

Info

Publication number: KR101524375B1
Application number: KR1020130156178A
Authority: KR
Inventors: 박종수
Original assignee: 성신여자대학교 산학협력단
Priority date: 2013-12-16
Filing date: 2013-12-16
Publication date: 2015-07-01

Abstract

Disclosed is a similarity join method using an adaptive prefix filtering, which generates and verifies a pair of candidates having similarity greater than or equal to a limit value given in the data set by using an adaptive prefix. The method of the present invention comprises the steps of: (a) setting a range of a prefix token of a detection record x for comparing and an indexing record y; (b) generating the detection and indexing records as a pair of candidates for similarity join, if a detection token to be compared of the detection record and an indexing token of the indexing record are in the predetermined prefix range; and (c) verifying a pair of candidates for similarity join in order to determine a pair of similarity join.

Description

[0001] METHOD FOR SIMILARITY JOINS BY ADAPTIVE PREFIX FILTERING [0002]

개시된 기술은 맞춤 접두 필터링을 이용한 유사도 조인 방법에 관한 것으로서, 보다 상세하게는 빠른 유사도 조인(Similarity join)을 위해 후보 쌍들의 생성시 접두 필터링 원리를 제약조건으로 하는 효율적 유사도 조인 방법에 관한 것이다.
The disclosed technique relates to a similarity joining method using customized prefix filtering, and more particularly, to an efficient similarity joining method using a prefix filtering principle as a constraint condition in the generation of candidate pairs for a quick similarity join.

인터넷의 사용과 모바일 컴퓨터의 발전에 의해 대용량의 데이터가 급속히 증가하고 있다. 사용자의 온라인상의 참여, 개방, 공유의 개념이 점점 확산되면서, 종래의 정보 제공자 및 수요자의 경계가 허물어지고 있으며, 사용자가 원하는 정보가 온라인상에 반드시 있을 것이라는 기대치 또한 점점 증가하고 있다.Due to the use of the Internet and the development of mobile computers, large amounts of data are rapidly increasing. As the concept of online participation, openness and sharing of users spreads more and more, the boundaries of conventional information providers and consumers are getting torn down, and the expectation that users want information to be available online is also increasing.

그러나 방대한 양의 데이터에서 사용자가 원하는 정보를 찾기 위해서는 데이터의 유사성 비교가 선행되어야 한다.However, in order to find the information desired by users in a large amount of data, comparison of similarity of data should be preceded.

데이터를 처리하여 유사성 비교를 수행하는 종래 기술에는 "용어 유사도를 이용하여 객체 간의 유사도를 계산하는 방법 및 시스템" (한국 특허공개 제10-2009-0063801호), "유사도 결정 방법 및 유사도 결정 장치" (한국 특허등록 제10-1265062호), "정보 엔트로피를 이용하여 유사도를 보정하는 사용자 기반 협업 필터링 추천 방법 및 시스템" (한국 특허공개 제10-2010-0086296호)등이 있으나, 이들은 빠른 유사도 조인을 위해 후보 쌍들을 생성할 때 접두 필터링 원리를 적용하지 못하고 있는 실정이다. Conventional techniques for processing data to perform similarity comparison include "a method and system for calculating similarity between objects using term similarity" (Korean Patent Laid-Open No. 10-2009-0063801), "similarity determination method and similarity determination apparatus" (Korean Patent Registration No. 10-1265062), "User-based Collaborative Filtering Recommendation Method and System for Correcting Similarity Using Information Entropy" (Korean Patent Laid-Open No. 10-2010-0086296) We can not apply the prefiltering principle to generate candidate pairs.

한편, 접두 필터링 원리를 적용한 유사도 조인 알고리즘으로서, PPJoin 알고리즘, PPJoin+ 알고리즘 및 MPJoin 알고리즘 등이 제안된바 있다. 그러나, PPJoin 알고리즘은 너무 많은 조인 후보 쌍들이 만들어지게 하여 조인 검증시간이 많이 소요되는 문제점이 있었으며, PPJoin+ 알고리즘 역시 조인 검증시간이 많이 소요되는 단점이 있었고, MPJoin 알고리즘은 실험결과 적절치 않은 결과를 제공하는 것으로 확인되었다.
On the other hand, PPJoin algorithm, PPJoin + algorithm and MPJoin algorithm have been proposed as the similarity join algorithm applying the prefix filtering principle. However, since the PPJoin algorithm has too many join candidate pairs, the join verification time is long. Also, the PPJoin + algorithm has a disadvantage that the join verification time is long, and the MPJoin algorithm provides an inappropriate result Respectively.

본 발명은 상술한 종래 기술의 한계를 해결하기 위해 제시된 것으로서, 빠른 유사도 조인을 위해 후보 쌍들의 생성시 접두 필터링 원리를 강한 제약조건으로 하는, 즉, 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법(이하, 'APJoin'이라 함)을 제공하는 데 있다.
SUMMARY OF THE INVENTION The present invention has been made in order to solve the problems of the prior art described above, and it is an object of the present invention to provide an efficient similarity joining method using a prefix filtering, APJoin ").

상기 기술적 과제를 달성하기 위해, 본 발명은 접두 필터링을 이용하여 데이터 집합에서 주어진 한계치 이상의 유사도를 가지는 후보 쌍들을 생성하고 검증하는 데이터 처리방법으로서, (a) 비교하는 탐색 레코드 x와 인덱싱 레코드 y의 접두(prefix) 토큰의 범위를 설정하는 단계; (b) 상기 탐색 레코드의 비교되는 탐색 토큰과 인덱싱 레코드의 인덱싱 토큰이 상기 설정된 접두 범위 내에 있으면, 상기 탐색 레코드와 인덱싱 레코드를 유사도 조인(Similarity Join) 후보 쌍으로 생성하는 단계; 및 (c) 상기 유사도 조인 후보쌍을 검증하여 유사도 조인 쌍을 결정하는 단계를 포함하는 유사도 조인 방법을 제공한다.According to an aspect of the present invention, there is provided a data processing method for generating and verifying candidate pairs having a similarity degree higher than a given threshold in a data set using pre-filtering, the method comprising the steps of: (a) Setting a range of a prefix token; (b) generating the search record and the indexing record as a similarity candidate candidate pair if the search token to be compared and the indexing token of the indexing record are within the set prefix range; And (c) verifying the similarity pair candidate pair to determine a similarity pair pair.

여기서, 상기 (a)단계는 탐색 레코드 x와 인덱싱 레코드 y의 유사도를 계산하는 단계를 포함한다.Here, the step (a) includes calculating the similarity between the search record x and the indexing record y.

이때, 탐색 레코드 x와 인덱싱 레코드 y의 유사도는, At this time, the similarity degree between the search record x and the indexing record y is

로 정의되는 쟈카드 유사도(Jaccard similarity)와,

(Jaccard similarity) defined as < RTI ID = 0.0 >

로 정의되는 공통부분 유사도(Overlap similarity)로 계산될 수 있다. And a similar partial similarity, which is defined as " Overlap similarity ".

여기서, 상기 쟈카드 유사도와 공통부분 유사도는Here, the Jacquard similarity and the common partial similarity are

의 관계를 가진다(여기서, t는 쟈카드 유사도의 한계치, 는 공통부분 유사도의 한계치). (Where t is the limit of the jacquard similarity, and the limit of the common partial similarity).

구체적으로, 상기 (a) 단계에서, 상기 탐색 레코드 x의 접두 토큰의 수 prefix_x는

으로 설정되고, 상기 인덱싱 레코드 y의 접두 토큰의 수 prefix_y는

로 설정될 수 있다(여기서, |x|는 레코드 x를 구성하는 토큰의 수, |y|는 레코드 y를 구성하는 토큰의 수).Specifically, in the step (a), the number of prefix tox of the search record x is

The number of prefix tokens of the indexing record y is set to

(Where | x | is the number of tokens that make up record x, and | y | is the number of tokens that make up record y).

또한, 상기 (a) 단계에서, 탐색 레코드 x의 prefix로 올수 있는 최대크기 max_probe_prefix는

로 결정될 수 있으며, 탐색 레코드 x의 토큰들이 인덱싱 레코드의 토큰들로 사용될 때, 역 인덱스(inverted index)에 저장될 최대 크기 max_index_prefix는

로 결정될 수 있다.In the step (a), the maximum size max_probe_prefix that can be obtained as the prefix of the search record x is

, And when the tokens of the search record x are used as tokens of the indexing record, the maximum size max_index_prefix to be stored in the inverted index is

. &Lt; / RTI >

또한, 상기 (b) 단계는, 탐색 레코드 x의 i번째 토큰(x, i)와 이 토큰의 역 인덱스에서 가져온 인덱싱 레코드 y의 j번째 토큰 (y, j)을 비교하는 단계로서, 만약 y의 크기가 t|x|보다 작으면, 토큰 (y,j)을 삭제하는 단계; y의 크기가 t|x|보다 작지 않고, 만약 j가 y의 prefix 범위를 벗어나면 토큰 (y,j)을 삭제하는 단계; j가 y의 prefix 범위를 벗어나지 않고, 만약 i가 x의 prefix 범위를 벗어나면 다음 토큰으로 이동하는 단계; 및 만약 i가 x의 prefix 범위를 벗어나지 않으면, x와 y를 유사도 조인 후보 쌍으로 결정하는 단계를 포함한다(여기서, t는 쟈카드 유사도의 한계치).The step (b) may include comparing the i-th token (x, i) of the search record x with the j-th token (y, j) of the indexed record y obtained from the inverse index of the token, Deleting the token (y, j) if the size is smaller than t | x |; removing the token (y, j) if the size of y is not less than t | x | and j is outside the prefix range of y; moving j to the next token if i is not outside the prefix range of y and if i is outside the prefix range of x; And if i does not deviate from the prefix range of x, determine x and y as a pair of similarity join candidates, where t is the limit of jacquard similarity.

또한, 상기 (c) 단계에서는, 상기 (b)단계에서 결정된 유사도 조인 후보 쌍인 두 레코드의 토큰들을 비교하되, 현재 공통부분 토큰들의 개수, 남아있는 토큰들의 개수 및 값을 비교하여 유사도 조인 쌍을 결정할 수 있다.In the step (c), the tokens of the two records, which are the similarity pair candidates determined in the step (b), are compared, and the number of the current common partial tokens, the number of remaining tokens, and the value are compared to determine the similarity pair .

본 발명의 또 다른 측면에 의하면, 상술한 유사도 조인 방법이 수행되도록 프로그램된 기록매체가 제공된다.
According to another aspect of the present invention, there is provided a recording medium programmed to perform the above-described similarity-joining method.

본 발명에서 제안하는 APJoin(맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법)에 의하면 대용량의 레코드들 중에서 서로 유사한 레코드를 단축된 실행시간으로 효율적으로 찾아내는 것이 가능하다.
According to the APJoin (efficient similarity joining method using customized prefix filtering) proposed by the present invention, it is possible to efficiently find similar records among the large-capacity records with a shortened execution time.

도 1은 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 설명하기 위해 탐색 레코드 x 및 인덱싱 레코드 y의 prefix의 범위를 설명하는 도면이다.
도 2는 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법의 알고리즘 구현 예를 보여준다.
도 3은 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 DBLP 데이터에 적용했을 때의 결과를 나타내는 그래프이다.
도 4는 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 ENRON 데이터에 적용했을 때의 결과를 나타내는 그래프이다.FIG. 1 is a view for explaining a range of a prefix of a search record x and an indexing record y to describe an efficient similarity joining method using customized prefix filtering according to an embodiment of the present invention.
FIG. 2 shows an algorithm implementation example of an efficient similarity joining method using customized prefix filtering according to an embodiment of the present invention.
3 is a graph showing a result of applying an efficient similarity joining method using customized prefix filtering to DBLP data according to an embodiment of the present invention.
FIG. 4 is a graph showing a result of applying an efficient similarity joining method using customized prefix filtering to ENRON data according to an embodiment of the present invention.

개시된 기술에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 개시된 기술의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 개시된 기술의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. The description of the disclosed technique is merely an example for structural or functional explanation and the scope of the disclosed technology should not be construed as being limited by the embodiments described in the text. That is, the embodiments are to be construed as being variously embodied and having various forms, so that the scope of the disclosed technology should be understood to include equivalents capable of realizing technical ideas.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 개시된 기술이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosed technology belongs, unless otherwise defined. Commonly used predefined terms should be interpreted to be consistent with the meanings in the context of the related art and can not be interpreted as having ideal or overly formal meaning unless explicitly defined in the present application.

또한, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, each step may occur differently from the stated order, unless the context clearly states a particular order. That is, each step may occur in the same order as described, may be performed substantially concurrently, or may be performed in reverse order.

유사도는 두 개체들을 구성하는 토큰들이 얼마나 많이 유사한 정도를 나타내는지 [0, 1] 사이의 값을 계산하는 유사도 함수(similarity function)로 계산된다. 유사도 조인(Similarity join)은 데이터 집합에서 주어진 한계치 이상의 유사도를 갖는 모든 쌍의 레코드를 찾아내는 연산이다. 유사도를 측정하는 함수로서는 쟈카드 유사도(Jaccard similarity), 코사인 유사도(Cosine similarity), 공통 부분 유사도(OverLap similraity) 등이 있으며, 본 발명에서는 쟈카드 유사도(Jaccard similarity)를 기초로 하여 유사도 조인 알고리즘을 제안한다.The degree of similarity is calculated as a similarity function that calculates the value between [0, 1] indicating how much similar the tokens constituting the two entities are. A similarity join is an operation that finds all pairs of records that have a similarity greater than a given limit in the data set. Jaccard similarity, cosine similarity, and common partial similarity (OverLap similarity) are known as functions for measuring the similarity. In the present invention, a similarity join algorithm is proposed based on Jaccard similarity .

일반적으로 인덱스기반 유사도 조인 알고리즘에서는 후보 생성 단계 및 검증 단계의 두 단계를 포함하며, 조인 대상 후보의 수를 작게 생성하기 위한 접근 방식과, 검증을 빠르게 수행하기 위한 접근방식 등으로 연구가 이루어지고 있다.In general, the index-based similarity join algorithm includes two steps of candidate generation step and verification step, and an approach for generating a small number of candidates to be joined and an approach for quickly performing verification are being studied .

본 발명에서는 접두 필터링 원리(prefix filtering principle)를 유사도 조인 알고리즘에 맞춤 방식으로 적용한 것으로, 이하 도 1 내지 도 4를 참조하여 본 발명의 특징 및 효과를 설명한다.In the present invention, the prefix filtering principle is applied to the similarity join algorithm in a customized manner. Hereinafter, the features and effects of the present invention will be described with reference to FIG. 1 to FIG.

도 1은 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 설명하기 위해 탐색 레코드 x 및 인덱싱 레코드 y의 prefix의 범위를 설명하는 도면이다.FIG. 1 is a view for explaining a range of a prefix of a search record x and an indexing record y to describe an efficient similarity joining method using customized prefix filtering according to an embodiment of the present invention.

유사도 함수(similarity function)의 선택은 응용하고자 하는 영역에 따라 큰 영향을 받는다. 본 발명에서는 유사도 조인 후보쌍의 생성단계에서 쟈카드 유사도(Jaccard similarity)와 공통 부분 유사도(OverLap similraity)를 이용한다.The choice of similarity function is strongly influenced by the area to be applied. In the present invention, the Jacquard similarity and the common partial similarity (OverLap similraity) are used in the generation step of the similarity join candidate pair.

탐색 레코드인 레코드 x 및 인덱싱 레코드인 레코드 y의 토큰 개수를 각각 |x| 및 |y|라고 하면, The number of tokens in record x, which is a navigation record, and in record y, which is an indexing record, are | x | And | y |

쟈카드 유사도는 수식

로 정의되고,The jacquard similarity is calculated using the equation

Lt; / RTI >

공통부분 유사도는 수식

로 표현된다.The common partial similarity is expressed by the equation

Lt; / RTI >

이때, [0, 1] 사이의 값으로 주어지는 쟈카드 유사도의 한계치를 t라고 하면, 쟈카드 유사도와 공통 부분 유사도는 다음과 같은 관계식[수학식 1]으로 정의될 수 있다.If the limit of the jacquard similarity given as a value between [0, 1] is t, the jacquard similarity and the common partial similarity can be defined by the following relational expression (1).

[수학식 1]에서 α는 최소 공통부분 유사도로서, 다음의 [수학식 2]에 다시 표시한다.In Equation (1),? Is the minimum common partial similarity, and is again shown in the following Equation (2).

접두 필터링(prefix filtering)을 적용할 때, 가장 기본적인 원리는 다음과 같다.
When applying prefix filtering, the most basic principle is as follows.

o 토큰 전체 집합 Ｕ의 순서를 이루는 Ｏ와 레코드들의 집합이 주어지면, 토큰들 각각은 Ｏ의 순서에 따라 정렬된다.o Given an ordered set of O's and a set of records, each token is ordered according to the order of O's.

본 발명은 상기 접두 필터링(prefix filtering) 원리를 적용하여 탐색 레코드 x와 인덱싱 레코드 y 사이의 prefix 범위를 미리 맞춤 방식으로 찾아내어 이 범위 내에서 유사도 조인 후보 쌍들을 빠르게 결정하는 방식을 채택한다.The present invention employs the prefix filtering principle to find a prefix range between the search record x and the indexing record y in a pre-fitting manner, and quickly determine similarity pair candidates within this range.

[수학식 1]과 접두 필터링 원리로부터 탐색 레코드 x와 인덱싱 레코드 y의 접두 필터링에 사용되는 토큰의 개수를 아래의 수학식 3과 수학식 4로 나타낼 수 있다.The number of tokens used in the prefix filtering of the search record x and the indexing record y from Equation (1) and the prefix filtering principle can be expressed by the following Equations (3) and (4).

탐색 레코드 x의 prefix로 올 수 있는 최대 크기 max_probe_prefix는 수학식 3에서 α값이 최소가 되는 |y|=t·|x|일 때 수학식 5로 나타내어 질 수 있다.The maximum size max_probe_prefix that can be obtained with the prefix of the search record x can be expressed by Equation (5) when | y | = t · x |, which is the minimum value of? In Equation (3).

여기서 t는 쟈카드 유사도의 한계치이다. 이후 단계에서의 토큰들이 인덱싱 레코드의 토큰들로 사용될 때 역 인덱스(inverted index)에 저장되어야 할 최대 크기는 수학식 6으로 나타내어질 수 있다.Where t is the limit of the jacquard similarity. The maximum size to be stored in the inverted index when the tokens in the later steps are used as tokens in the indexing record can be expressed by Equation (6).

전술한 PPJoin에서는, 탐색 레코드의 탐색 토큰 (x, i)의 inverted index에서 가져온 인덱싱 토큰 (y, j)에서 |y|≤t·x이면 삭제하고 x의 현재 위치 i와 y의 현재 위치 j를 고려하여 위치 조건이 맞으면 공통 부분(overlap) 값을 계산하여 유사도 조인 후보로 결정한다. 또한, MPJoin에서는 탐색 레코드 x의 prefix_size를 수학식 6에 따라 초기 값으로 설정하고 이 x의 토큰이 inverted index에 저장되어 다음 처리에서 인덱싱 토큰으로 될 때 그 시점에서 수학식 4에 의해 계산되어 prefix_size가 동적으로 갱신된다. 이후 상기 인덱싱 토큰(y, j)이 조인 후보 여부를 결정하는 시점에서 j가 y의 prefix_size보다 크면 상기 토큰(y, j)을 삭제하는 방법을 채택하고 있다.In the above-described PPJoin, if | y |? Tx is deleted from the indexing token (y, j) obtained from the inverted index of the search token (x, i) of the search record, the current position i of x and the current position j of y If the position condition is correct, the overlap value is calculated and determined as the similarity join candidate. In MPJoin, the prefix_size of the search record x is set to an initial value according to Equation 6, and when the token of x is stored in the inverted index and becomes an indexing token in the next processing, the prefix_size It is updated dynamically. When the indexing token (y, j) determines whether or not the indexing token (y, j) is a candidate for joining, the token (y, j) is deleted if j is greater than the prefix_size of y.

본 발명의 APJoin은 다음과 같은 특징을 가진다.The APJoin of the present invention has the following characteristics.

첫 번째로, PPJoin 및 MPJoin에서 사용한 prefix filtering 과 positional filtering을 적용한다.First, prefix filtering and positional filtering used in PPJoin and MPJoin are applied.

두 번째로, MPJoin에서 사용한 동적으로 갱신되는 인덱싱 레코드의 prefix_size를 적용하는 방법을 탐색 레코드와 인덱싱 레코드의 prefix 범위 설정에 적용한다. 즉, 한 레코드가 탐색 레코드로 처음 고려될 때 이 레코드의 prefix 범위를 수학식 3으로부터 설정하고, 이 레코드와 prefix 범위 안에 들어올 수 있는 인덱싱 레코드의 prefix의 범위를 수학식 4로부터 설정한다.Second, apply the prefix_size of the dynamically updated indexing records used in MPJoin to the prefix range setting of the navigation records and indexing records. That is, when a record is considered as a search record for the first time, the prefix range of this record is set from Equation 3, and the range of the prefix of the indexing record that can be included in this record and the prefix range is set from Equation (4).

세 번째로, 탐색 레코드와 인덱싱 레코드의 prefix 범위가 설정되어 있으므로, 비교되는 탐색 토큰과 인덱싱 토큰의 두 레코드가 설정된 prefix 범위 안에 있으면 유사도 조인 후보 쌍으로 결정되게 된다.Third, since the prefix range of the search record and the indexing record is set, if the two search tokens and the indexing token to be compared are within the set prefix range, they are determined as the similarity join candidate pair.

즉, PPJoin 및 MPJoin에서는 유사도 조인 후보 쌍을 결정할 때 현재 공통 부분 토큰의 개수, 남아있는 토큰의 개수 및 α 값을 계산하여 결정하지만, 본 발명의 APJoin에서는 탐색 레코드 x의 i번째 토큰 (x,i)와 이 토큰의 inverted index에서 가져온 인덱싱 레코드 y의 j번째 토큰 (y, j)을 비교하여 설정된 Prefix 범위 이내인지 아닌지만을 검사함으로써, 두 레코드를 유사도 조인 후보 쌍으로 결정하게 된다. 이러한 조건검사 과정을 좀 더 자세히 설명하면 다음과 같다.
That is, in PPJoin and MPJoin, when determining the pair of similarity join candidates, APJoin of the present invention decides the number of current common partial tokens, the number of remaining tokens and a value, ) And the jth token (y, j) of the indexing record y obtained from the inverted index of the token, and determines whether or not the two records are within the set prefix range. This condition checking process will be described in more detail as follows.

o 만약 y의 크기가 t·|x|보다 작으면, 토큰 (y,j)을 삭제한다;o If the size of y is less than t · | x |, delete the token (y, j);

그렇지 않고 만약, j가 y의 prefix 범위를 벗어나면 토큰 (y,j)을 삭제한다;Otherwise, if j is outside the prefix range of y, delete the token (y, j);

그렇지 않고 만약 i가 x의 prefix 범위를 벗어나면 다음 토큰으로 이동한다;Otherwise, if i is outside the prefix range of x, move to the next token;

그렇지 않으면 x와 y가 유사도 조인 후보 쌍이 된다.Otherwise, x and y become a pair of similarity join candidates.

이때, 탐색 레코드와 인덱싱 레코드의 prefix 범위는 수학식 3 및 수학식 4에서 계산하고 배열에 저장된다.At this time, the prefix ranges of the search record and the indexing record are calculated in Equations (3) and (4) and stored in the array.

도 1과 표 1은 t=0.8이고 탐색 레코드의 크기 |x|=12일 때, 고려될 수 있는 인덱싱 레코드의 크기는

가 되어 |y|=10, 11, 12가 되는 것을 나타낸다.1 and Table 1 show the size of the indexing record that can be considered when t = 0.8 and the size of the search record is | x | = 12

And | y | = 10, 11, and 12, respectively.

도 1에서 빗금으로 표시된 칸들은 탐색 레코드와 인덱싱 레코드의 prefix 범위를 설명한다. 도 2의 알고리즘에 의하면 prefix 값으로 prefix_x[1:3]={2,2,3}과 prefix_y[1:3]={2,1,1}로 계산된다. 두 레코드의 비교 토큰들의 위치가 주어진 prefix 범위를 벗어나면 유사도 조인 후보에서 제외된다. 표 1의 |y|=12인 마지막 네 열에서는 크기가 12인 탐색 레코드 x가 다음 처리 과정에서 인텍싱 레코드 y로 사용되는 경우에 해당되는 변수들의 값을 설명하고 있다. prefix_y의 값들 중에서 최대 값은 2가 되어 |x|=12일 때 max_index_prefix=2 이므로 서로 같음을 보여준다. 따라서 크기가 12인 탐색 레코드 x의 토큰들 중에서 앞에서 부터 수학식 6의 max_index_prefix 만큼만 inverted index에 저장하면 충분하다.In FIG. 1, the hatched fields describe the prefix range of the navigation record and the indexing record. According to the algorithm of FIG. 2, prefix_x [1: 3] = {2,2,3} and prefix_y [1: 3] = {2,1,1} are calculated as prefix values. If the position of the comparison tokens of the two records is outside the given prefix range, they are excluded from the similarity join candidate. The last four columns of | y | = 12 in Table 1 illustrate the values of the variables when the search record x of size 12 is used as the indexing record y in the next processing. The maximum value of the prefix_y is 2, which is equal to max_index_prefix = 2 when | x | = 12. Therefore, it is sufficient to store only the max_index_prefix in the inverted index of Equation 6 from among the tokens of the search record x having the size of 12.

도 2는 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법의 알고리즘 구현 예를 보여준다.FIG. 2 shows an algorithm implementation example of an efficient similarity joining method using customized prefix filtering according to an embodiment of the present invention.

도 2에 도시된 바와 같이 본 발명에 의한 APJoin에서의 입력은 레코드들의 집합 R과 주어진 쟈카드 유사도 한계치 t이고, 출력 값으로 한계치 t이상의 값을 갖는 모든 레코드 쌍들을 출력하게 된다.As shown in FIG. 2, the input to the APJoin according to the present invention is a set of records R and a given jacade similarity threshold t, and outputs all pairs of records having an output value equal to or greater than the threshold value t.

각 레코드는 토큰들로 구성되고 레코드들은 토큰들의 개수에 따라 비내림차순으로 정렬된다. 도 2에 도시된 단계 5 내지 단계 10에서는 수학식 2 내지 수학식 6에 해당되는 변수들을 계산한다. 단계8에서 탐색 레코드 x가 주어지면 인덱싱 레코드 y의 크기가 결정되므로 다른 수학식들에 사용될 수 있다.Each record consists of tokens and records are sorted in descending order according to the number of tokens. In steps 5 to 10 shown in FIG. 2, the variables corresponding to the equations (2) to (6) are calculated. Given a search record x in step 8, the size of the indexing record y is determined and can be used in other mathematical expressions.

단계 14 내지 단계 21은 탐색 토큰 (x,i)와 인덱싱 토큰 (y,j)로 상술한 조건 검사만으로 유사도 조인후보 쌍을 결정한다.Steps 14 to 21 determine a pair of similarity joining candidates by only the condition check described above with the search token (x, i) and the indexing token (y, j).

조건 검사에 맞지 않는 현재 토큰이 유사도 조인후보가 될 가능성이 없는 경우 inverted index에서 삭제하여 이후 비교 검사되는 인덱싱 레코드의 토큰 개수를 줄일 수 있다. 단계18의 조건은 도 1의 (a)와 (b)의 탐색 레코드 x의 3번째 토큰과 같이 현재 인덱싱 레코드의 토큰과 조인 후보가 될 가능성이 없는 경우 현재 토큰은 처리하지 않고 다음 토큰으로 이동하는 것을 나타낸다. 이때, x의 3번째 토큰은 도 1의 (c)에서 사용될 것이다. 단계 21에서 탐색 토큰(x, i)와 인덱싱 토큰 (y, j)은 두 레코드 x와 y의 prefix 범위에 들어오기 때문에 x와 y가 유사도 조인 후보 쌍으로 생성되게 된다.If the current token that does not match the conditional test is not likely to be a candidate for similarity join, it can be deleted from the inverted index to reduce the number of tokens in the indexed records that are subsequently compared. If the condition of step 18 is not likely to be a join candidate with the token of the current indexing record, such as the third token of the search record x of FIGS. 1A and 1B, the current token is not processed but moved to the next token . At this time, the third token of x will be used in Fig. 1 (c). In step 21, since the search token (x, i) and the indexing token (y, j) are included in the prefix range of two records x and y, x and y are generated as a pair of similarity join candidates.

단계 22 내지 단계 24에서는 앞으로 사용될 토큰들을 inverted index에 저장한다. In steps 22 to 24, the tokens to be used are stored in the inverted index.

단계 25의 VerifiZip은 지퍼 합병 방식으로 공통 부분을 계산하여 유사도 조인 쌍을 결정하는 함수로서, 이 함수에서 두 레코드의 토큰들을 비교하면서 현재 공통 부분 토큰들의 개수, 남아있는 토큰들의 개수 및 α값을 비교하여 계속 비교 검증을 수행할 것인지 또는 중단할 것인지를 결정하게 한다. 따라서 빠르게 유사도 조인 후보 쌍들을 검증할 수 있다.VerifiZip in step 25 is a function to determine a pair of similarity joins by calculating a common part by a zip-merging method. In this function, comparing the tokens of two records, the number of current common partial tokens, the number of remaining tokens, To determine whether to continue or stop the comparison verification. Therefore, it is possible to verify pairs of similarity join candidates quickly.

한편, 본 발명에서 제안된 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 다른 유사도 조인 알고리즘과 비교하였으며 비교 결과는 도3, 도4 및 표 2로 도시하였다.Meanwhile, the efficient similarity joining method using the customized prefix filtering proposed in the present invention is compared with other similarity joining algorithms, and the comparison results are shown in FIG. 3, FIG. 4, and Table 2.

비교 알고리즘인 PPJoin, PPJoin+ 및 MPJoin 알고리즘을 C++ 언어로 구현하여 성능 비교 분석을 하였으며, 분석 실험은 MS Windows 7 64bit 운영체제, MS Visual Studio 2008, 하드웨어는 CPU: Intel i920 @ 2.67Ghz, RAM 16GB환경에서 수행되었다.We implemented the comparison algorithms PPJoin, PPJoin +, and MPJoin in C ++ language. The experiment was performed in MS Windows 7 64bit OS, MS Visual Studio 2008, CPU hardware: Intel i920 @ 2.67Ghz, RAM 16GB .

실험에 사용된 데이터는 DBLP 웹 사이트에 있는 참고문헌 레코드들로 구성된 데이터 및 Enron 회사의 email들로 구성된 데이터들을 사용하였다. 설명의 편이를 위해 DBLP 웹 사이트에 있는 참고문헌 레코드들로 구성된 데이터는 DBLP로 Enron 회사의 email들로 구성된 데이터들은 ENRON으로 설명하도록 한다. DBLP는 988,567개의 레코드들로 이루어지고, 각 레코드는 출판물의 저자 이름과 제목에 포함된 단어들을 정수 토큰으로 변환되어 전체 토큰들의 개수는 약 71만개이고, 레코드들의 평균 토큰 개수는 15.2개이다. ERON은 245,481개의 이메일들로 구성되고, 각 메일은 공백이나 특수문자로 분리된 단어들을 추출하여 정수 토큰들로 변환되었다. ENRON에서 전체 토큰의 개수는 약 236만개이고 평균 토큰 개수는 285.5개이다.The data used in the experiments consisted of data consisting of bibliographic records on the DBLP website and data from Enron company emails. For convenience of explanation, data consisting of bibliographic records in the DBLP website are DBLP, and data composed of Enron company emails are described as ENRON. DBLP consists of 988,567 records, each record being converted into integer tokens of the publisher's author name and title, with a total of about 770,000 tokens and an average token count of 15.2 records. ERON consists of 245,481 emails, each of which extracts words separated by spaces or special characters and converted into integer tokens. The total number of tokens in ENRON is about 2.36 million, and the average number of tokens is 285.5.

DBLP 데이터를 입력으로 한 실험결과는 표 2에서 보여주고 있다.Table 2 shows the results of DBLP data input.

표 2에서 │Join│은 유사도 조인 결과 쌍들의 개수이고, │Cand │는 유사도 조인 후보 쌍들의 개수이고, Time은 실행에 소요된 시간을 나타내고, │Comp │는 탐색 토큰과 인덱싱 토큰을 검사하여 유사도 조인 후보 쌍을 결정하는 단계들의 회수를 나타내는 값이다.
In Table 2, │Join│ is the number of pairs of similarity result pairs, │Cand │ is the number of pairs of similarity join candidates, Time represents the time spent in execution, │Comp │ examines the search token and indexing token, Is a value indicating the number of steps for determining a pair of join candidates.

도 3은 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 DBLP 데이터에 적용했을 때의 결과를 나타내는 그래프이고, 도 4는 본 발명의 실시예에 따른 맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법을 ENRON 데이터에 적용했을 때의 결과를 나타내는 그래프로서, 쟈카드 유사도 한계치 t가 변화할 때 주어진 입력 데이터에 따라 각 알고리즘의 실행시간을 보여준다.FIG. 3 is a graph showing a result of applying an efficient similarity joining method using customized prefix filtering to DBLP data according to an embodiment of the present invention. FIG. 4 is a graph showing the result of an efficient similarity filtering using customized prefix filtering according to an embodiment of the present invention. The graph shows the result of applying the join method to ENRON data. It shows the execution time of each algorithm according to given input data when the jacade similarity threshold t changes.

도시된 바와 같이 MPJoin 의 │Comp │값이 PPJoin에 비해서 작은 것은 인덱싱 토큰의 위치 j가 prefix 범위보다 크면 삭제되기 때문이다. 한편 본 발명에서 제안하는 APJoin(맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법)의 │Comp│가 더 작은 것은 탐색 레코드의 토큰 위치와 인덱싱 레코드의 토큰 위치가 설정된 prefix 범위 이내에서만 유사도 조인 후보 쌍을 결정하기 때문이다. │Comp│가 작아지면 실행시간도 작아지게 된다.As shown, the value of MPJoin │Comp │ is smaller than that of PPJoin because the position j of the indexing token is deleted if it is larger than the prefix range. Meanwhile, the smaller APJoin (efficient similarity joining method using customized prefix filtering) suggested by the present invention is to determine the pair of similarity joins only when the token position of the search record and the token position of the indexing record are within the set prefix range Because. When │Comp│ becomes smaller, execution time becomes smaller.

실행시간 비교에서 본 발명에서 제안된 APJoin 은 실행속도 면에서 PPJoin에 비해 68%, MPJoin에 비해 32% 개선된 결과를 보여준다.In terms of execution time, APJoin proposed in the present invention shows 68% improvement in execution speed and 32% improvement in MPJoin compared to PPJoin.

DBLP 데이터에서는 시간이 MPJoin에 비해 32%정도 일정하게 개선되는 것을 보여주며, ENRON 데이터에서는 APJoin 알고리즘의 상대적인 실행시간은 유사도 한계치 t가 0.95일 때 4%에서 t가 0.75일 때 33%로 점차 개선됨을 보여준다. 도 4에서 ENRON 데이터의 레코드들이 DBLP 데이터에 비해서 토큰들의 평균 개수가 많으므로 샐행 시간이 많이 소요되고 상대적인 실행시간 특성이 도 3에 도시된 결과와는 차이가 있는 것을 보여준다.In the case of ENRON data, the relative execution time of the APJoin algorithm is gradually improved from 4% at t = 0.95 to 33% at t = 0.75 in the DBLP data. Show. FIG. 4 shows that the ENRON data records have a larger average number of tokens than the DBLP data. Therefore, it takes a long time to execute and the relative execution time characteristic is different from the result shown in FIG.

상술한 바와 같이 본 발명에서 제안하는 APJoin(맞춤 접두 필터링을 이용한 효율적 유사도 조인 방법)에 의하면, 접두 필터링 원리에 따라 탐색 레코드와 인덱싱 레코드의 접두 토큰들의 개수를 설정하여 이 범위 내에 들어오는 토큰이 일치하면 유사도 조인 후보로 결정하여 검증함으로써, 대용량의 레코드들 중에서 서로 유사한 레코드를 빠른 실행시간 내에 효율적으로 찾아내는 것이 가능하다.
As described above, according to the APJoin (efficient similarity joining method using customized prefix filtering) proposed in the present invention, when the number of prefix tokens of the search record and the indexing record is set according to the prefix filtering principle and the tokens within the range match It is possible to efficiently find similar records among large-capacity records in a fast execution time by determining and verifying the candidates as similarity joining candidates.

이상 본 발명이 도면에 도시된 실시예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 개시된 기술의 진정한 기술적 보호 범위는 첨부된 특허청구범위에 의해 정해져야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be clearly understood that the same is by way of illustration and example only and is not to be construed as limiting the present invention. Accordingly, the true scope of protection of the disclosed technology should be determined by the appended claims.

Claims

A method of processing data in a computer for generating and verifying candidate pairs having similarity to a given threshold or higher in a data set using pre-filtering,
(a) setting a range of a prefix token of a search record x and a index record y to be compared;
(b) generating the search record and the indexing record as a similarity candidate candidate pair if the search token to be compared and the indexing token of the indexing record are within the set prefix range; And
(c) verifying the pair of similarity degree candidates and determining a pair of similarity degree of similarity.

The method according to claim 1,
Wherein the step (a) comprises calculating the similarity of the search record x and the indexing record y.

The method of claim 2,
The similarity between the search record x and the indexing record y is

(Jaccard similarity) defined as < RTI ID = 0.0 >

And the similarity degree is calculated as an overlap similarity defined by the similarity degree.

The method of claim 3,
The jacquard similarity and the common partial similarity are

(Where t is the limit of the jacquard similarity and? Is the limit of the common partial similarity)

The method of claim 4,
In the step (a), the number of prefix tox of the search record x is

The number of prefix tokens of the indexing record y is set to

(Where | x | is the number of tokens that make up record x, and y y is the number of tokens that make up record y).

The method of claim 5,
In the step (a), the maximum size max_probe_prefix that can be obtained as the prefix of the search record x is

&Lt; / RTI >

The method of claim 5,
In the step (a), when the tokens of the search record x are used as tokens of the indexing record, the maximum size max_index_prefix to be stored in the inverted index is

&Lt; / RTI >

The method according to claim 1,
Wherein the step (b) comprises comparing the i-th token (x, i) of the search record x with the j-th token (y, j) of the indexed record y obtained from the inverse index of the token,
Deleting the token (y, j) if the magnitude of y is less than t · | x |;
removing the token (y, j) if the size of y is not less than t · | x | and j is outside the prefix range of y;
moving j to the next token if i is not outside the prefix range of y and if i is outside the prefix range of x; And
If i does not deviate from the prefix range of x, determining x and y as a pair of similarity-joining candidates, where t is the limit of jacquard similarity.

The method according to claim 1,
The step (c) includes comparing the tokens of two records, the pair of similarity degree candidates determined in the step (b), and comparing the number of the current common tokens, the number of remaining tokens and the value of alpha, And a similarity degree.

A computer program for executing the similarity joining method according to any one of claims 1 to 9.