KR101560726B1

KR101560726B1 - System and method for generating cluster using seed based link

Info

Publication number: KR101560726B1
Application number: KR1020090058991A
Authority: KR
Inventors: 이재범; 김상욱; 윤석호; 송석순; 김동진
Original assignee: 네이버 주식회사
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2015-10-16
Also published as: KR20110001448A

Abstract

링크에 따른 시드를 이용한 클러스터 생성 시스템 및 방법이 개시된다. 클러스터 생성 시스템은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 시드 결정부 및 상기 결정된 시드를 이용하여 클러스터를 생성하는 클러스터 생성부를 포함할 수 있다. 유사도가 높은 시드를 결정하여 클러스터를 생성함으로써 클러스터를 생성하기 위한 초기 수행 시간을 단축시킬 수 있다.A cluster creation system and method using a seed according to a link is disclosed. A cluster generating system includes a seed determining unit for determining a seed composed of objects of similarity or more than a predetermined degree set on the basis of link information from a plurality of objects linked by a link and a cluster generating unit for generating a cluster using the determined seed . It is possible to shorten the initial execution time for generating a cluster by determining a seed having a high degree of similarity and creating a cluster.

클러스터, 링크, 시드, 유사도, 트리, 객체, 해쉬 기반 Cluster, link, seed, similarity, tree, object, hash base

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system and a method for generating a cluster using a seed according to a link,

본 발명은 링크에 따른 시드를 이용한 클러스터 생성 시스템 및 방법에 관한 것으로, 보다 자세하게는, 해쉬 기반의 탐색 방법을 이용하여 클러스터를 생성하는 클러스터 생성 시스템 및 방법에 관한 것이다.The present invention relates to a cluster creation system and method using a seed according to a link, and more particularly, to a cluster creation system and method for creating a cluster using a hash-based search method.

클러스터링은 서로 유사한 객체를 클러스터로 그룹화하는 것을 의미한다. 클러스터링 연구는 통계학, 데이터베이스, 데이터마이닝 등의 여러 분야에서 오랜 시간 연구되어 왔다. 클러스터링 과정을 통해 도출된 클러스터를 통해 광고, 검색 등의 서비스에 적용하여 해당 서비스의 정확도를 높이고, 객체의 유사도 경향을 쉽게 파악할 수 있다.Clustering means grouping similar objects into clusters. Clustering research has been studied for a long time in various fields such as statistics, databases, and data mining. Through the clustering process, it can be applied to services such as advertisement and search to improve the accuracy of the service and to easily understand the tendency of similarity of objects.

최근, 클러스터링 과정 중 인터넷 상에 존재하는 많은 데이터를 이용하여 서로 유사도가 높은 객체에 대해 클러스터를 생성하는 연구가 주목받고 있다. 특히, 객체들 간의 링크를 고려하여 객체들을 클러스터링하는 링크 기반 클러스터링(link-based clustering)이 대두되고 있다.In recent years, research has been focused on creating clusters for objects with high similarity using a large amount of data existing on the Internet during the clustering process. In particular, link-based clustering has been emerging that clusters objects considering links between objects.

다만, 인터넷 사용이 증가하면서 인터넷 상에 수많은 객체에 대해 클러스터 를 생성하는 것은 많은 수행 시간을 요구한다. 수행 시간이 증가할수록 클러스터링을 위한 리소스가 많이 소모되기 때문에, 이를 효과적으로 처리하는 방안이 필요하다. 즉, 대용량 데이터 환경에 대비하여 유사도를 보장하면서도 수행시간을 감소시켜 신속한 처리를 하는 방안이 요구되고 있다.However, as the use of the Internet increases, creating a cluster for a large number of objects on the Internet requires a lot of execution time. As the execution time increases, resources for clustering are consumed. That is, there is a demand for a method for rapidly performing processing while reducing the execution time while ensuring similarity in preparation for a large-capacity data environment.

결국, 대용량 데이터에 대해 클러스터링을 위해 전체 성능에 영향을 미치는 병목 부분을 개선함으로써 대용량 데이터를 정확하고 효율적으로 클러스터링하는 것이 무엇보다 필요하다.As a result, it is necessary to accurately and efficiently cluster large volumes of data by improving the bottleneck that affects overall performance for clustering for large amounts of data.

본 발명은 링크로 연결된 복수의 객체들 중 유사도가 매우 높은 소수의 객체를 시드로 결정하고, 시드를 이용하여 클러스터를 생성함으로써 클러스터링에 필요한 수행 시간을 단축시키는 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster generation system and method for shortening the execution time required for clustering by determining clusters using a seed and a few few objects with a very high similarity among a plurality of objects linked by a link.

본 발명은 링크로 연결된 복수의 객체들에 대한 트랜잭션 데이터로부터 해쉬 구조를 생성하고, 해쉬 구조를 탐색하면서 최저 빈발도를 초과하는 패턴을 추출하여 시드로 결정함으로써 일정 수준의 유사도를 보장하면서도 클러스터 생성 시간을 감소할 수 있는 클러스터 생성 시스템 및 방법을 제공한다.The present invention creates a hash structure from transaction data for a plurality of objects linked by a link, extracts a pattern exceeding a minimum overflow while searching for a hash structure, determines the seed as a seed, A cluster creation system and method capable of reducing the number of cluster creation processes.

본 발명은 링크의 개수가 적은 객체를 노이즈로 판단하여 클러스터 생성에 대한 전처리 과정을 통해 트랜잭션 데이터에서 미리 제거함으로써 생성하고자 하는 클러스터의 유사도를 향상시킬 수 있는 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster generation system and method capable of improving similarity of clusters to be generated by predicting an object having a small number of links as noise and removing it from transaction data through a preprocessing process for cluster creation.

본 발명은 복수의 객체들 및 복수의 객체들로부터 생성된 클러스터를 통해 객체들 간의 유사도를 계층적으로 표현하는 트리를 생성함으로써 복수의 객체들 간의 유사도를 용이하게 파악할 수 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster creation system and method capable of easily grasping the similarity among a plurality of objects by generating a tree that hierarchically expresses the similarities between the objects through the clusters generated from the plurality of objects and the plurality of objects do.

본 발명의 일실시예에 따른 클러스터 생성 시스템은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 시드 결정부 및 상기 결정된 시드를 이용하여 클러스터를 생성하는 클러스터 생성부를 포함할 수 있다.A cluster generation system according to an embodiment of the present invention includes a seed determination unit for determining a seed composed of objects of similarity or more than a predetermined degree set on the basis of link information from a plurality of objects linked by a link, And a cluster generating unit for generating a cluster generating unit.

본 발명의 일측면에 따르면, 상기 시드 결정부는 상기 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성하는 트랜잭션 데이터 생성부, 상기 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정하는 해쉬 구조 결정부 및 상기 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출하는 시드 추출부를 포함할 수 있다.According to an aspect of the present invention, the seed determining unit may include a transaction data generating unit for generating transaction data using a plurality of objects linked by the link, a search unit for searching the transaction data to determine a hash structure composed of candidate patterns based on the pattern length And a seed extracting unit for extracting, from the candidate patterns constituting the hash structure, an object corresponding to a candidate pattern exceeding the lowest frequent occurrence, by a seed.

본 발명의 일측면에 따르면, 상기 해쉬 구조 결정부는 상기 아이템의 개수에 따른 패턴 길이에 기초하여 상기 트랜잭션 데이터로부터 상기 패턴 길이에 대응하는 후보 패턴을 생성하는 후보 패턴 생성부 및 상기 트랜잭션 데이터로부터 상기 생성된 후보 패턴을 카운트하여 빈발도를 결정하는 빈발도 결정부를 포함할 수 있다.According to an aspect of the present invention, the hash structure determination unit may include a candidate pattern generation unit that generates a candidate pattern corresponding to the pattern length from the transaction data based on a pattern length according to the number of items, And a frequent determining unit that determines a frequent degree by counting the candidate patterns.

본 발명의 일측면에 따르면, 상기 시드 결정부는 상기 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 상기 트랜잭션 데이터에서 제거하는 노이즈 제거부를 더 포함할 수 있다.According to an aspect of the present invention, the seed determining unit may further include a noise removing unit for removing an object less than a predetermined number of links from the transaction data by determining the object as noise.

본 발명의 일실시예에 따른 클러스터 생성 시스템은 각 타입별로 상기 복수의 객체를 하위 레벨의 노드로 설정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성하는 트리 생성부를 더 포함할 수 있다.The cluster creation system according to an embodiment of the present invention sets the plurality of objects as low-level nodes for each type and sets a cluster generated from the plurality of objects as a high-level node to generate a structured tree And a tree generating unit.

본 발명의 일실시예에 따른 클러스터 생성 방법은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 단계 및 상기 결정된 시드를 이용하여 클러스터를 생성하는 단계를 포함할 수 있다.A cluster generation method according to an embodiment of the present invention includes the steps of: determining a seed composed of a plurality of objects connected in a link and having a degree of similarity set in advance on the basis of link information, and generating a cluster using the determined seed .

본 발명의 일측면에 따르면, 상기 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 단계는 상기 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성하는 단계, 상기 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정하는 단계 및 상기 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출하는 단계를 포함할 수 있다.According to an aspect of the present invention, the step of determining a seed composed of the objects having the similarity or greater in number is a step of generating transaction data using a plurality of objects linked by the link, And a step of extracting an object corresponding to a candidate pattern exceeding the lowest frequent occurrence among the candidate patterns constituting the hash structure by a seed.

본 발명의 일측면에 따르면, 상기 시드를 결정하는 단계는 상기 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 상기 트랜잭션 데이터에서 제거하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, the step of determining the seed may further include removing a predetermined number or less of the objects from the transaction data by judging the number of the links as noise.

본 발명의 일실시예에 따른 클러스터 생성 방법은 각 타입별로 상기 복수의 객체를 하위 레벨의 노드로 설정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성하는 단계를 더 포함할 수 있다.A cluster generation method according to an embodiment of the present invention sets the plurality of objects as low-level nodes for each type and sets a cluster generated from the plurality of objects as a high-level node to generate a structural tree Step < / RTI >

본 발명의 일실시예에 따르면, 링크로 연결된 복수의 객체들 중 유사도가 매우 높은 소수의 객체를 시드로 결정하고, 시드를 이용하여 클러스터를 생성함으로써 클러스터링에 필요한 수행 시간이 단축될 수 있다.According to an embodiment of the present invention, a small number of objects having a very high degree of similarity among a plurality of objects linked by a link are determined as a seed, and a cluster is created using a seed, thereby shortening the execution time required for clustering.

본 발명의 일실시예에 따르면, 링크로 연결된 복수의 객체들에 대한 트랜잭 션 데이터로부터 해쉬 구조를 생성하고, 해쉬 구조를 탐색하면서 최저 빈발도를 초과하는 패턴을 추출하여 시드로 결정함으로써 일정 수준의 유사도를 보장하면서도 클러스터 생성 시간을 줄일 수 있다.According to an embodiment of the present invention, a hash structure is generated from transaction data for a plurality of objects connected by a link, a pattern exceeding a minimum frequency is extracted while searching for a hash structure, It is possible to reduce the cluster creation time.

본 발명의 일실시예에 따르면, 링크의 개수가 적은 객체를 노이즈로 판단하여 클러스터 생성에 대한 전처리 과정을 통해 트랜잭션 데이터에서 제거함으로써 생성하고자 하는 클러스터의 유사도가 향상될 수 있다.According to an embodiment of the present invention, a similarity of a cluster to be generated can be improved by eliminating an object having a small number of links as noise and removing it from transaction data through a preprocessing process for cluster creation.

본 발명의 일실시예에 따르면, 복수의 객체들 및 복수의 객체들로부터 생성된 클러스터를 통해 객체들 간의 유사도를 계층적으로 표현하는 트리를 생성함으로써 복수의 객체들 간의 유사도가 용이하게 파악될 수 있다.According to an embodiment of the present invention, a similarity degree among a plurality of objects can be easily grasped by generating a tree that hierarchically expresses a degree of similarity between objects through a cluster generated from a plurality of objects and a plurality of objects have.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다. 다만, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. 본 발명의 일실시예에 다른 클러스터 생성 방법은 클러스터 생성 시스템에 의해 수행될 수 있다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. Like reference symbols in the drawings denote like elements. A cluster creation method according to an embodiment of the present invention may be performed by a cluster creation system.

도 1은 본 발명의 일실시예에 따른 클러스터 생성 시스템이 수행하는 전체 과정을 설명하기 위한 도면이다.FIG. 1 is a diagram for explaining an entire process performed by a cluster creation system according to an embodiment of the present invention.

클러스터 생성 시스템(100)은 링크로 연결된 복수의 객체(101, 102)를 링크에 기초한 유사도에 따라 클러스터링하여 구조화된 트리(103)를 생성할 수 있다. 클러스터링은 서로 유사한 객체들을 특정 클러스터로 그룹화하는 것을 의미할 수 있다. 이 때, 링크로 연결된 객체(101)와 객체(102)는 서로 다른 타입을 나타내는 데이터일 수 있다. 예를 들어, 객체(101)가 블로그이고, 객체(102)는 스크랩을 통해 링크로 연결된 포스트일 수 있다. 도 1에서는 서로 다른 타입을 나타내는 2 종류의 객체를 도시하였으나, 타입의 종류는 제한이 없다.The cluster creation system 100 may cluster a plurality of linked objects 101 and 102 according to the link-based similarity to generate a structured tree 103. [ Clustering can mean grouping similar objects into specific clusters. At this time, the object 101 and the object 102 connected by a link may be data representing different types. For example, object 101 may be a blog, and object 102 may be a post linked by a link through a scrap. In FIG. 1, two kinds of objects representing different types are shown, but the types are not limited.

클러스터의 대상이 되는 객체(데이터)의 양이 계속적으로 증가하면서 대용량의 데이터를 효율적으로 클러스터링하는 것이 중요해졌다. 각각의 객체들이 서로 링크를 통해 연결되기 때문에, 데이터량이 많이 질수록 클러스터링을 위해 처리되어야 하는 연산량이 증가할 수 밖에 없다. It has become important to effectively cluster large volumes of data while the amount of objects (data) that are the object of the cluster continuously increases. Since each object is linked through a link, the amount of computation that must be processed for clustering increases as the amount of data increases.

이에 대해, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 링크로 연결된 객체들로 구성된 트랜잭션 데이터를 이용하여 해쉬 구조를 생성하고, 해쉬 기반의 시드 탐사 방법을 통해 클러스터를 생성할 수 있다. 구체적으로, 클러스터 생성 시스템(100)은 유사도가 높은 소수의 객체들(101, 102)에 의해 클러스터가 생성되는 사실을 고려하여 유사도가 매우 높은 소수의 객체들을 시드로 결정하고, 시드와 관련된 다른 객체들(101, 102)을 시드에 포함시킴으로써 초기 클러스터링을 효율적이고 신속하게 처리할 수 있다. In contrast, the cluster creation system 100 according to an embodiment of the present invention can create a hash structure using transaction data composed of objects linked by a link, and generate a cluster through a hash-based seed exploration method . Specifically, the cluster creation system 100 determines seeds of a very small number of highly similar objects in consideration of the fact that clusters are created by a small number of highly similar objects 101 and 102, By including seeds 101 and 102 in the seed, the initial clustering can be efficiently and quickly processed.

또한, 클러스터 생성 시스템(100)은 클러스터를 생성하기 이전에 노이즈에 해당하는 객체들(101, 102)을 전처리를 통해 트랜잭션 데이터에서 반복적으로 제거함으로써 클러스터링의 수행 시간을 단축시킬 수 있다. 클러스터 생성 시스템(100)은 위와 같은 과정을 통해 객체(101, 102)에 대해 클러스터링을 수행하여 클러스터를 생성하고, 생성된 클러스터를 이용하여 각 타입마다 트리(103, 104)를 생성함으로써 정확도를 감소시키지 않으면서도 수행 시간을 줄일 수 있는 링크 기반의 클러스터링을 수행할 수 있다.In addition, the cluster creation system 100 may shorten the clustering execution time by repeatedly removing the objects (101, 102) corresponding to the noise from the transaction data through preprocessing before creating the cluster. The cluster creation system 100 generates clusters for the objects 101 and 102 through the above process and generates the trees 103 and 104 for each type using the generated clusters, thereby reducing the accuracy Link-based clustering can be performed to reduce the execution time without having to perform link-based clustering.

도 2는 본 발명의 일실시예에 따른 클러스터 생성 시스템의 전체 구성을 도시한 블록다이어그램이다.2 is a block diagram illustrating the overall configuration of a cluster creation system according to an embodiment of the present invention.

도 2를 참고하면, 클러스터 생성 시스템(100)은 시드 결정부(201), 클러스터 생성부(202) 및 트리 생성부(203)를 포함할 수 있다.Referring to FIG. 2, the cluster creation system 100 may include a seed determination unit 201, a cluster generation unit 202, and a tree generation unit 203.

시드 결정부(201)는 링크로 연결된 복수의 객체들(101, 102)로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드(seed)를 결정할 수 있다. 본 발명의 일실시예에 따르면, 시드는 클러스터를 확장 생성하기 위한 기준 객체가 될 수 있다. 이 때, 복수의 객체(101)와 복수의 객체(102)는 서로 다른 타입을 가지며, 시드 결정부(201)는 타입이 다른 객체(101, 102)마다 시드를 결정할 수 있다. The seed determining unit 201 can determine a seed composed of objects of similarity or more than a predetermined degree set on the basis of the link information from the plurality of objects 101 and 102 linked by a link. According to an embodiment of the present invention, the seed may be a reference object for generating an extended cluster. At this time, the plurality of objects 101 and the plurality of objects 102 have different types, and the seed determining unit 201 can determine the seed for each of the objects 101 and 102 of different types.

본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 시드에 대응하는 초기 클러스터를 대략적으로 구축함으로써 객체들(101)로부터 클러스터를 생성하는 시간을 감소시킬 수 있다. 이 때, 시드는 복수의 객체들(101) 중 유사도가 매우 높은 소수의 객체들을 포함할 수 있다.The cluster creation system 100 according to an embodiment of the present invention can reduce the time for generating clusters from the objects 101 by roughly constructing an initial cluster corresponding to the seed. In this case, the seed may include a small number of objects having a very high degree of similarity among the plurality of objects 101.

일례로, 시드 결정부(201)는 트랜잭션 데이터 생성부(204), 노이즈 제거부(205), 해쉬 구조 결정부(206) 및 시드 추출부(207)를 포함할 수 있다. 구체적으로, 시드 결정부(201)는 해쉬 기반의 시드 탐사 방법을 통해 타입이 다른 복수의 객체들(101, 102)로부터 각각 시드를 결정할 수 있다.For example, the seed determining unit 201 may include a transaction data generating unit 204, a noise removing unit 205, a hash structure determining unit 206, and a seed extracting unit 207. Specifically, the seed determining unit 201 can determine a seed from a plurality of objects 101 and 102 of different types through a hash-based seed exploration method.

트랜잭션 데이터 생성부(204)는 링크로 연결된 복수의 객체들(101)을 이용하여 트랜잭션 데이터를 생성할 수 있다. 일례로, 트랜잭션 데이터 생성부(204)는 복수의 객체들(101) 각각이 링크를 통해 가리키는 다른 타입의 객체들(102)을 트랜잭션으로 설정할 수 있다. The transaction data generation unit 204 may generate transaction data using a plurality of objects 101 linked by a link. For example, the transaction data generation unit 204 may set the transaction type of the objects 102 of the different types indicated by the plurality of objects 101 on the link.

그리고, 트랜잭션 데이터 생성부(204)는 링크에 따라 복수의 객체들(101)을 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다. 즉, 트랜잭션 데이터 생성부(204)는 링크로 연결된 객체들(101, 102)을 트랜잭션 데이터로 변환할 수 있다. 트랜잭션 데이터를 생성하는 구체적인 예는 도 3에서 설명된다.The transaction data generation unit 204 may generate the transaction data by classifying the plurality of objects 101 into items constituting the set transaction according to the link. That is, the transaction data generation unit 204 may convert the objects 101 and 102 linked by the link into transaction data. A specific example of generating transaction data is described in FIG.

노이즈 제거부(205)는 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 트랜잭션 데이터에서 제거할 수 있다. 특정 객체들(101)은 소수의 링크를 통해 다른 타입의 객체들(102)과 연결될 수 있다. 클러스터는 유사도가 높은 객체들(101)로 구성되며, 유사도는 링크의 개수와 관련이 있다. 링크의 개수가 적은 경우, 객체들간의 유사도에 대한 신뢰성이 보장될 수 없다. 그래서, 링크의 개수가 적은 객체가 클러스터에 포함되는 경우, 클러스터의 정확도는 감소될 수 있다. 또한, 링크의 개수가 적은 객체들은 클러스터링을 위해 처리해야 할 데이터의 수를 증가시키기 때문에 클러스터링의 수행 시간도 증가하는 문제점이 있다. The noise removing unit 205 may remove noise from the transaction data by determining that the number of links is equal to or less than the predetermined number. Certain objects 101 may be associated with other types of objects 102 through a small number of links. The cluster is made up of highly similar objects 101, and the similarity is related to the number of links. If the number of links is small, the reliability of the similarity between objects can not be guaranteed. Thus, when an object having a small number of links is included in the cluster, the accuracy of the cluster can be reduced. In addition, objects with a small number of links increase the number of data to be processed for clustering, which increases clustering execution time.

따라서, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 링크의 개수가 소수인 객체들(101)을 노이즈로 판단하여 클러스터 생성하기 이전에 전처리 과정을 통해 미리 제거함으로써, 클러스터에 포함된 객체 간의 유사도를 향상시키 고 클러스터링 시간을 단축시킬 수 있다.Therefore, the cluster generation system 100 according to an embodiment of the present invention determines the objects 101 having a small number of links as noise, and removes the objects 101 in advance through the preprocessing process before generating the clusters, The similarity between objects can be improved and the clustering time can be shortened.

일례로, 노이즈 제거부(205)는 서로 다른 타입에 있는 객체들을 타입마다 반복적으로 제거할 수 있다. 예를 들어, 타입 1에 해당하는 객체(101)(A)가 타입 2에 해당하는 객체들(102)(a, b)과 링크로 연결되어 있다고 가정한다. 그리고, 노이즈 제거부(205)는 링크가 1개인 객체들을 노이즈로 판단하여 제거할 수 있다. For example, the noise remover 205 may repeatedly remove objects of different types on a per type basis. For example, it is assumed that the object 101 (A) corresponding to the type 1 is linked to the objects 102 (a, b) corresponding to the type 2. The noise removing unit 205 can remove objects having one link by judging them as noise.

이 때, 객체 a(102)의 링크 개수가 1개일 때, 노이즈 제거부(203)는 클러스터에서 객체 a(102)를 제거할 수 있다. 객체 a(102)가 제거되면서, 객체 A(102)의 링크 개수도 감소할 수 있다. 그러면, 타입 1에 해당하는 객체 A(101)의 링크 개수가 2개에서 1개로 되므로, 노이즈 제거부(203)는 클러스터에서 객체 A(101)도 노이즈로 판단하여 제거할 수 있다. 결국, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 노이즈에 해당하는 개체가 클러스터에 존재하지 않을 때까지 반복적으로 제거할 수 있다.At this time, when the number of links of the object a 102 is one, the noise removing unit 203 can remove the object a 102 from the cluster. As object a 102 is removed, the number of links of object A 102 may also be reduced. Then, since the number of links of the object A 101 corresponding to the type 1 is changed from two to one, the noise removing unit 203 can also remove the object A 101 from the cluster by judging it as noise. As a result, the cluster creation system 100 according to an embodiment of the present invention can repeatedly remove the noise corresponding to the noise until the object corresponding to the noise does not exist in the cluster.

해쉬 구조 결정부(206)는 생성된 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정할 수 있다. 이 때, 트랜잭션 데이터는 노이즈 제거부(205)를 통해 노이즈로 판단된 객체들이 제거된 트랜잭션 데이터를 탐색할 수 있다. 패턴 길이는 트랜잭션 데이터에 포함된 객체쌍에 포함된 객체의 개수일 수 있다. 일례로, 해쉬 구조 결정부(206)는 후보 패턴 생성부 및 빈발도 결정부를 포함할 수 있다. The hash structure determination unit 206 can determine the hash structure composed of the candidate patterns based on the pattern length by searching the generated transaction data. At this time, the transaction data can search through the noise removal unit 205 for the transaction data from which objects judged as noise are removed. The pattern length may be the number of objects included in the object pair included in the transaction data. For example, the hash structure determination unit 206 may include a candidate pattern generation unit and a frequentness determination unit.

후보 패턴 생성부는 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 즉, 후보 패턴 생성부는 트랜잭션 데이터를 탐색하여 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. The candidate pattern generator may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items. That is, the candidate pattern generator may search the transaction data to generate a candidate pattern corresponding to the pattern length.

그리고, 빈발도 결정부는 트랜잭션 데이터로부터 후보 패턴을 카운트하여 트랜잭션 데이터에서 후보 패턴이 발생한 빈도를 나타내는 빈발도를 결정할 수 있다. 결국, 해쉬 구조 결정부(206)는 후보 패턴과 후보 패턴 각각에 대응하는 빈발도를 이용하여 해쉬 구조를 결정할 수 있다.The frequent occurrence determining unit may determine a frequency of occurrence of the candidate pattern in the transaction data by counting the candidate pattern from the transaction data. As a result, the hash structure determination unit 206 can determine the hash structure using the frequentities corresponding to the candidate pattern and the candidate pattern, respectively.

시드 추출부(207)는 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. 즉, 시드 추출부(207)는 일정 수준 이상의 유사도를 보장할 수 있는 최저 빈발도를 설정하고, 복수의 객체들(101) 중 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. The seed extracting unit 207 can extract an object corresponding to a candidate pattern exceeding the lowest frequen- cy among the candidate patterns constituting the hash structure as a seed. That is, the seed extractor 207 sets the lowest frequen- cies that can guarantee a degree of similarity equal to or higher than a predetermined level, and extracts an object corresponding to a candidate pattern exceeding the lowest frequen- cy among the plurality of objects 101 as a seed .

따라서, 시드 추출부(207)는 한번의 후보 패턴으로 구성된 해쉬 구조를 탐색함으로써 복수의 객체들(101)로부터 유사도가 매우 높은 소수의 객체들을 신속하게 결정할 수 있다.Therefore, the seed extracting unit 207 can quickly determine a small number of highly similar objects from the plurality of objects 101 by searching for a hash structure composed of one candidate pattern.

클러스터 생성부(202)는 시드 결정부(201)를 통해 결정된 시드를 이용하여 클러스터를 생성할 수 있다. 클러스터는 객체의 타입마다 생성될 수 있다.The cluster generating unit 202 may generate a cluster using the seed determined through the seed determining unit 201. [ Clusters can be created for each type of object.

일례로, 클러스터 생성부(202)는 시드에 해당하는 객체들로 구성된 클러스터를 초기에 생성할 수 있다. 그리고, 클러스터 생성부(202)는 복수의 객체들 중 클러스터에 포함되지 않는 객체들에 대해 시드와 동일한 트랜잭션에 빈발하게 나타 나는 개체들을 추출하여 상기 클러스터에 추가할 수 있다. 구체적으로, 클러스터 생성부(202)는 클러스터를 구성하는 객체의 수가 미리 설정한 개수가 될 때까지 클러스터에 객체를 추가하는 과정을 수행할 수 있다.For example, the cluster generating unit 202 may initially generate a cluster composed of objects corresponding to a seed. The cluster generating unit 202 may extract the objects that are frequently included in the same transaction as the seed for the objects that are not included in the cluster among the plurality of objects and may add them to the cluster. Specifically, the cluster generating unit 202 may perform a process of adding an object to a cluster until the number of objects constituting the cluster becomes a preset number.

트리 생성부(203)는 각 타입별로 복수의 객체(101, 102)를 하위 레벨의 노드로 설정하고, 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리(103, 104)를 생성할 수 있다. 즉, 복수의 객체들(101)에 대해서는 트리 X(103)가 생성되고, 복수의 객체들(102)에 대해서는 트리 Y(103)가 생성될 수 있다.The tree generation unit 203 sets a plurality of objects 101 and 102 as low-level nodes for each type, sets clusters generated from a plurality of objects as high-level nodes, Lt; / RTI > That is, a tree X 103 may be generated for a plurality of objects 101, and a tree Y 103 may be generated for a plurality of objects 102.

이 때, 복수의 객체들(101, 102) 각각은 트리(103, 104)에서 말단 노드로 결정되고, 클러스터는 비말단 노드로 결정될 수 있다. 또한, 클러스터 생성 시스템(100)은 클러스터 생성 과정을 통해 최초 결정된 클러스터를 객체로 설정하고, 해당 클러스터로부터 새로운 클러스터를 생성할 수 있다. 즉, 클러스터 생성 시스템(100)은 클러스터 생성 과정을 반복함으로써 복수의 객체들(101, 102)과 클러스터에 대해 레벨에 따라 구조화된 트리(103, 104)를 생성할 수 있다.At this time, each of the plurality of objects 101 and 102 is determined as the end node in the tree 103 and 104, and the cluster can be determined as the non-end node. In addition, the cluster creation system 100 can set the initially determined cluster as an object through the cluster creation process, and create a new cluster from the corresponding cluster. That is, the cluster creation system 100 can generate the structured tree 103 and 104 according to the level for the plurality of objects 101 and 102 and the cluster by repeating the cluster creation process.

도 3은 본 발명의 일실시예에 따라 링크로 연결된 객체에 기초한 트랜잭션 데이터를 설명하기 위한 도면이다.3 is a diagram for explaining transaction data based on objects linked by a link according to an embodiment of the present invention.

도 3을 참고하면, 타입이 다른 복수의 객체들(301, 302)이 링크로 연결된 것을 확인할 수 있다. 예를 들어, 도 3에서 복수의 객체들(301)은 블로그(Blog)를 의미하고, 복수의 객체들(302)은 포스트(Post)를 의미한다고 가정한다.Referring to FIG. 3, it can be seen that a plurality of objects 301 and 302 of different types are linked by a link. For example, in FIG. 3, it is assumed that a plurality of objects 301 refers to a blog and a plurality of objects 302 refers to a post.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들(301, 302)로부터 링 크 정보에 기초하여 유사도 이상의 객체들로 구성되는 시드를 결정할 수 있다. 이러한 유사도는 객체의 링크에 따라 결정될 수 있다. 그리고, 시드는 타입이 다른 복수의 객체들(301, 302) 각각으로부터 추출될 수 있다.The cluster creation system can determine a seed composed of objects of similarity or more based on link information from a plurality of objects 301 and 302 linked by a link. Such similarity can be determined according to the link of the object. Then, the seed can be extracted from each of a plurality of objects 301 and 302 of different types.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들(301, 302)을 이용하여 트랜잭션 데이터를 생성할 수 있다. 일례로, 클러스터 생성 시스템은 복수의 객체들(301) 각각이 링크를 통해 가리키는 다른 타입의 객체들(302)을 트랜잭션으로 설정할 수 있다. 그리고, 클러스터 생성 시스템은 링크에 따라 복수의 객체들(301)을 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다.The cluster creation system can generate transaction data using a plurality of linked objects (301, 302). In one example, the cluster creation system may set up a transaction as a different type of objects 302, each of which is indicated by a plurality of objects 301 through a link. The cluster generating system can classify a plurality of objects 301 into items constituting a transaction according to a link to generate transaction data.

도 3에서, 복수의 포스트 데이터가 트랜잭션으로 설정되는 경우, 클러스터 생성 시스템은 각각의 포스트 데이터에 연결된 블로그 데이터를 트랜잭션에 대한 아이템(303)으로 분류할 수 있다. 예를 들어, 포스트 데이터 P3에 블로그 데이터 B1, B3, B4가 링크로 연결되어 있으므로, 트랜잭션 P3에 대한 아이템은 B1, B3, B4가 될 수 있다. In FIG. 3, when a plurality of post data is set as a transaction, the cluster creation system can classify the blog data linked to each post data as items 303 for the transaction. For example, since the blog data B1, B3, and B4 are linked to the post data P3, the items for the transaction P3 can be B1, B3, and B4.

이러한 과정을 통해 트랜잭션으로 설정된 포스트 데이터 P1 내지 P8 각각에 링크로 연결된 블로그 데이터가 아이템(303)으로 분류되어, 트랜잭션 데이터가 생성될 수 있다. 반대로, 복수의 블로그 데이터가 트랜잭션으로 설정되는 경우, 클러스터 생성 시스템은 각각의 블로그 데이터에 연결된 포스트 데이터를 트랜잭션에 대한 아이템으로 분류할 수도 있다.Through this process, the blog data linked to each of the post data P1 to P8 set in the transaction is classified as the item 303, and transaction data can be generated. Conversely, when a plurality of blog data is set as a transaction, the cluster creation system may classify the post data linked to each blog data as an item for a transaction.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성함으로써, 빈발적으로 발생하는 객체의 패턴을 파악할 수 있다. 예를 들어, 블로그 데이터 B1과 B3는 포스트 데이터 P1과 P3를 링크를 통해 공통적으로 가리키고 있으므로 서로 유사하다고 할 수 있다. 또한, B1과 B3는 동일한 트랜잭션 P1과 P3에 빈발하게 포함되므로, 클러스터 생성 시스템은 B1과 B3를 빈발하게 발생하는 패턴으로 설정할 수 있다.The cluster generation system can generate the transaction data using a plurality of linked objects to grasp the patterns of the objects that occur frequently. For example, the blog data B1 and B3 can be said to be similar to each other because the post data P1 and P3 are commonly pointed through the link. Also, because B1 and B3 are frequently included in the same transactions P1 and P3, the cluster creation system can set B1 and B3 in a pattern that occurs frequently.

그리고, 클러스터 생성 시스템은 생성된 트랜잭션 데이터에서 링크의 개수가 미리 설정한 개수 이하인 객체에 대해서는 노이즈로 판단하여 제거할 수 있다. 일례로, 클러스터 생성 시스템은 노이즈로 판단된 객체를 각 타입별로 반복적으로 제거할 수 있다. 노이즈 제거 과정은 트랜잭션 데이터에 노이즈에 대응하는 객체가 존재하지 않을 때까지 반복될 수 있다. In addition, the cluster creation system can remove noise from the generated transaction data by judging that the number of links is less than the predetermined number. For example, the cluster creation system can repeatedly remove objects determined to be noise by each type. The noise removal process can be repeated until there is no object corresponding to the noise in the transaction data.

결국, 클러스터 생성 시스템은 클러스터를 생성하기 전에 링크가 적은 객체를 제거함으로써 노이즈로 판단된 객체에 대해서는 클러스터를 생성할 필요가 없어 좀더 신속하게 클러스터를 생성할 수 있다. 그리고, 유사도는 링크의 개수와 연관이 있다. 즉, 클러스터 생성 시스템은 링크의 개수가 미리 설정한 개수보다 적은 객체를 노이즈로 판단하여 제거함으로써 클러스터에 포함된 객체 간의 유사도를 향상시키고, 클러스터링 시간을 단축시킬 수 있다.As a result, the cluster creation system can create clusters more quickly because it does not need to create clusters for objects judged to be noise by removing objects with low links before creating clusters. And, the similarity is related to the number of links. That is, the cluster generation system can improve the similarity between the objects included in the cluster and shorten the clustering time by eliminating the objects that are less than the preset number by judging the noise as the noise.

도 4는 본 발명의 일실시예에 따라 트랜잭션 데이터로부터 해쉬 구조를 결정하는 과정을 설명하기 위한 도면이다.4 is a diagram illustrating a process of determining a hash structure from transaction data according to an embodiment of the present invention.

도 4를 참고하면, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 해쉬 구조를 생성할 수 있다. 일례로, 클러스터 생성 시스템은 트랜잭션 데이 터(401)를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조(402)를 결정할 수 있다. 이 때, 트랜잭션 데이터(401)는 노이즈로 판단된 객체가 제거된 것일 수 있다. 즉, 클러스터 생성 시스템은 해쉬 구조를 생성하기 위해 노이즈로 판단된 객체가 제거된 트랜잭션 데이터(401)를 탐색할 수 있다.Referring to FIG. 4, the cluster creation system may search the transaction data 401 to generate a hash structure. In one example, the cluster creation system may search the transaction data 401 to determine a hash structure 402 configured with a candidate pattern based on the pattern length. At this time, the transaction data 401 may indicate that the object determined as noise has been removed. That is, the cluster creation system can search the transaction data 401 from which the object judged as noise has been removed to generate the hash structure.

구체적으로, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 구성하는 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터로부터 후보 패턴을 카운트하여 빈발도를 결정할 수 있다.Specifically, the cluster generation system may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items constituting the transaction data 401. [ The cluster generation system can determine the frequent occurrence by counting candidate patterns from the transaction data.

도 4에서 패턴 길이를 2로 설정하기로 가정한다. 그러면, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 각 트랜잭션마다 패턴 길이가 2인 후보 패턴을 생성할 수 있다. 이 때, 트랜잭션 P1, P3, P4, P6, P8이 아이템 2개 이상을 포함하고 있으므로, 패턴 길이가 2인 후보 패턴이 생성될 수 있다. 패턴 길이가 2인 후보 패턴은 객체가 2개로 쌍을 이루고 있다는 것을 의미한다. It is assumed that the pattern length is set to 2 in Fig. Then, the cluster creation system searches the transaction data 401 to generate a candidate pattern having a pattern length of 2 for each transaction. At this time, since the transactions P1, P3, P4, P6, and P8 include two or more items, a candidate pattern having a pattern length of 2 can be generated. A candidate pattern with a pattern length of 2 means that two objects are paired.

클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 패턴 길이가 2인 후보 패턴 {B1, B2}, {B1, B3}, {B1, B4}, {B3, B4}, {B4, B5}을 생성할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터(401)에서 후보 패턴을 카운트하여 빈발도를 결정할 수 있다. 트랜잭션 데이터(401)에서 후보 패턴 각각의 빈발도는 1, 2, 1, 1, 2가 된다. The cluster creation system searches the transaction data 401 to generate candidate patterns {B1, B2}, {B1, B3}, {B1, B4}, {B3, B4}, {B4, B5} having a pattern length of 2 can do. The cluster creation system can determine the frequent occurrence by counting candidate patterns in the transaction data 401. In the transaction data 401, the frequency of each candidate pattern is 1, 2, 1, 1, 2.

만약, 최저 빈발도가 1로 설정된 경우, 클러스터 생성 시스템은 최저 빈발도를 초과하는 후보 패턴을 시드로 결정할 수 있다. 즉, 클러스터 생성 시스템은 후보 패턴 {B1, B3}와 {B4, B5}를 시드로 결정할 수 있다. 도 4에서 설정된 패턴 길이와 최저 빈발도는 시스템의 구성에 따라 변경될 수 있다.If the lowest frequency is set to 1, the cluster generating system can determine the seed pattern as a candidate pattern exceeding the minimum frequency. That is, the cluster generation system can determine the candidate patterns {B1, B3} and {B4, B5} as seeds. The pattern length and the lowest frequent pattern set in FIG. 4 can be changed according to the configuration of the system.

도 5는 본 발명의 일실시예에 따라 클러스터를 생성하고, 객체를 추가하는 과정을 설명하기 위한 도면이다.5 is a diagram illustrating a process of creating a cluster and adding an object according to an embodiment of the present invention.

클러스터 생성 시스템은 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. 그러면, 클러스터 생성 시스템은 링크 정보를 통해 서로 유사한 것으로 판단된 객체인 시드를 이용하여 클러스터를 생성할 수 있다. 그리고, 클러스터 생성 시스템은 클러스터에 포함되지 않은 객체들 중 시드와 동일한 트랜잭션에 빈발하게 나타나는 객체들을 클러스터에 추가할 수 있다.The cluster generation system can extract an object corresponding to a candidate pattern exceeding the lowest frequen- cy among the candidate patterns constituting the hash structure as a seed. Then, the cluster creation system can create the clusters using the seed, which is an object determined to be similar to each other through the link information. Also, the cluster creation system can add to the cluster objects that are not included in the cluster and appear frequently in the same transaction as the seed.

도 4의 예를 참고했을 때, 최저 빈발도 1을 초과하는 후보 패턴 {B1, B3}, {B4, B5}가 각각 시드로 결정될 수 있다. 그러면, 도 5의 도면부호(501)에서 볼 수 있듯이, B1, B3가 하나의 그룹인 클러스터(X)로 결정되고, B4, B5가 다른 하나의 그룹인 클러스터(Y)로 결정될 수 있다. 그리고, 도 5의 도면부호(502)에서 볼 수 있듯이, 클러스터 생성 시스템은 B1, B3와 동일한 트랜잭션에 포함된 B2를 클러스터(X)에 추가할 수 있다.Referring to the example of FIG. 4, the candidate patterns {B1, B3}, {B4, B5} exceeding the lowest frequent fly 1 can be determined as seeds, respectively. Then, as shown in the reference numeral 501 in FIG. 5, B1 and B3 can be determined as a cluster X, which is a group, and B4 and B5 can be determined as a cluster Y, which is another group. Then, as can be seen at 502 in FIG. 5, the cluster creation system may add B2 contained in the same transaction as B1, B3 to cluster X.

결국, 클러스터 생성 시스템은 트랜잭션 데이터에 기초한 해쉬 구조를 탐사하여 복수의 객체들 중에서 유사도가 매우 높은 소수의 객체들을 시드로 결정할 수 있다. 그리고, 클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성한 후, 클러스터에 포함된 객체와 동일한 트랜잭션에 빈발하게 발생하는 객체를 클러스터 에 추가함으로써 복수의 객체들을 이용하여 클러스터링을 수행할 수 있다. As a result, the cluster generation system can determine a seed of a few objects having a very high degree of similarity among a plurality of objects by exploring a hash structure based on transaction data. In addition, the cluster generating system can perform clustering using a plurality of objects by creating clusters using a seed, and adding objects frequently occurring in the same transaction as the objects included in the clusters to the clusters.

도 6은 본 발명의 일실시예에 따라 객체 및 클러스터로 구성된 트리를 도시한 도면이다.6 is a diagram illustrating a tree composed of objects and clusters according to an embodiment of the present invention.

도 6은 서로 다른 타입의 복수의 객체로부터 클러스터가 생성되고, 복수의 객체와 클러스터를 구성된 트리를 나타낸다. 트리의 말단 노드인 레벨 0의 경우, 복수의 객체들로 구성된다. 그리고, 상위 레벨인 레벨 1의 비말단 노드의 경우, 레벨 0의 복수의 객체들로부터 생성된 클러스터로 구성된다. 그리고, 상위 레벨 2의 비말단 노드는 레벨 1의 클러스터를 다시 객체로 간주하여 생성된 클러스터로 구성된다. 이러한 과정을 통해 트리는 확장될 수 있다.FIG. 6 shows a tree in which clusters are generated from a plurality of objects of different types, and a plurality of objects and clusters are constructed. In the case of level 0, which is the end node of the tree, it consists of a plurality of objects. In the case of the non-terminal node of the level 1, which is a higher level, the cluster is formed of a plurality of objects of level 0. The non-terminal node of the upper level 2 is composed of the clusters generated by considering the cluster of level 1 as an object again. Through this process, the tree can be expanded.

그리고, 서로 다른 타입의 트리에서 레벨 0에 대응하는 복수의 객체들이 링크를 통해 서로 연결될 수 있다. 이와 같은 링크를 통해 노드 간의 유사도가 결정될 수 있다. 일례로, 트리는 같은 부모 노드에 속한 자식 노드들 간의 유사도를 저장할 수 있다. 같은 부모 노드에 속하지 않은 자식 노드들 간의 유사도는 자식 노드들의 조상 노드들 사이의 유사도를 통해 계산될 수 있다. A plurality of objects corresponding to level 0 in the trees of different types can be connected to each other via links. The similarity between nodes can be determined through such a link. For example, a tree can store the degree of similarity between child nodes belonging to the same parent node. The similarity between child nodes not belonging to the same parent node can be calculated through the similarity between the parent nodes of the child nodes.

즉, 트리에서 모든 노드들 간의 유사도가 계산되지 않고, 트리의 계층 구조를 통해 계산되므로 트리에 포함된 노드들의 유사도를 신속하게 계산할 수 있다. 최종적으로 계산된 유사도를 바탕으로, 보다 유사도가 가까운 노드로 트리가 정련될 수 있다.That is, similarity between all the nodes in the tree is not calculated but is calculated through the hierarchical structure of the tree, so that the similarity of the nodes included in the tree can be calculated quickly. Based on the finally calculated similarity, the tree can be refined to nodes with similarity.

도 7은 본 발명의 일실시예에 따른 클러스터 생성 방법의 전체 과정을 도시한 플로우차트이다.FIG. 7 is a flowchart illustrating an entire process of a cluster generation method according to an embodiment of the present invention.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정할 수 있다(S701).The cluster creation system may determine a seed composed of objects that are similar or higher than a predetermined degree set on the basis of link information from a plurality of objects linked by a link (S701).

일례로, 클러스터 생성 시스템은 해쉬 구조 기반의 탐색 방법을 통해 시드를 결정할 수 있다. 구체적으로, 클러스터 생성 시스템은 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성할 수 있다(S704). 이 때, 클러스터 생성 시스템은 복수의 객체들 각각이 링크를 통해 가리키는 다른 타입의 객체들을 트랜잭션으로 설정하고, 상기 링크에 따라 상기 복수의 객체들을 상기 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다.For example, the cluster creation system can determine the seed through a hash structure based search method. Specifically, the cluster creation system may generate transaction data using a plurality of linked objects (S704). At this time, the cluster creation system sets other types of objects indicated by links of the plurality of objects as transactions, classifies the plurality of objects as items constituting the set transaction according to the link, and generates transaction data can do.

클러스터 생성 시스템은 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 전처리 과정을 통해 트랜잭션 데이터에서 제거할 수 있다(S705). 객체 간의 유사도는 공통되는 링크의 개수가 많을수록 증가한다. 그러나, 링크의 개수가 적은 객체가 클러스터에 포함되는 경우, 클러스터의 유사도는 감소할 수 있다. 따라서, 클러스터 생성 시스템은 클러스터를 생성하기 이전에 전처리 과정을 통해, 링크의 개수가 적은 객체를 노이즈로 판단할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터에서 노이즈로 판단된 객체를 제거함으로써 추후 생성되는 클러스터의 유사도를 향상시킬 수 있다. 이 때, 링크의 개수는 시스템의 구성에 따라 변경될 수 있다.The cluster creation system determines that the number of links is equal to or less than a predetermined number, and removes the objects from the transaction data through a preprocessing step (S705). The similarity between objects increases as the number of common links increases. However, when an object having a small number of links is included in a cluster, the degree of similarity of clusters can be reduced. Therefore, the cluster generation system can determine the object having a small number of links as noise through the preprocessing process before creating the cluster. And, the cluster creation system can improve the similarity of clusters generated later by removing objects judged as noise from the transaction data. At this time, the number of links can be changed according to the configuration of the system.

일례로, 클러스터 생성 시스템은 서로 다른 타입에 있는 객체들을 타입마다 반복적으로 제거할 수 있다. 즉, 객체들은 서로 다른 타입에 있는 개체들과 링크 로 연결되어 있기 때문에, 하나의 객체가 제거되면 해당 객체와 링크로 연결된 다른 타입의 객체의 링크의 개수도 감소할 수 있다.For example, the cluster creation system can iteratively remove objects of different types on a per type basis. That is, since objects are linked with objects of different types, when one object is removed, the number of links of other types of objects connected to the object and the link can also be reduced.

그리고, 클러스터 생성 시스템은 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정할 수 있다(S706). 여기서, 트랜잭션 데이터는 노이즈로 판단된 객체가 제거된 것일 수 있다. 일례로, 클러스터 생성 시스템은 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 이후, 클러스터 생성 시스템은 트랜잭션 데이터로부터 후보 패턴을 카운트하여 빈발도를 결정할 수 있다. 최종적으로, 후보 패턴과 후보 패턴에 대한 빈발도를 구성된 해쉬 구조가 결정될 수 있다.Then, the cluster creation system searches the transaction data to determine a hash structure composed of candidate patterns based on the pattern length (S706). Here, the transaction data may be one in which the object determined as noise is removed. In one example, the cluster creation system may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items. Thereafter, the cluster creation system may determine the frequency of occurrence by counting candidate patterns from the transaction data. Finally, a hash structure composed of candidate patterns and frequent patterns for candidate patterns can be determined.

클러스터 생성 시스템은 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다(S707).The cluster creation system can extract an object corresponding to a candidate pattern exceeding the lowest frequen- cy among the candidate patterns constituting the hash structure by seeding (S707).

클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성할 수 있다(S702). 일례로, 클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성하고, 클러스터에 포함되지 않은 객체들 중 시드와 동일한 트랜잭션에 빈발하게 나타나는 객체들을 클러스터에 추가할 수 있다. 즉, 본 발명의 일실시예에 다른 클러스터 생성 시스템은 유사도가 매우 높은 소수의 객체인 시드를 결정하여 클러스터를 생성하고, 시드를 통해 클러스터를 확장함으로써 클러스터를 생성하기 위한 초기 수행 시간을 단축시킬 수 있다.The cluster creation system may create a cluster using the seed (S702). For example, a cluster creation system may create clusters using seeds, and may add objects that are not included in the cluster and appear frequently in the same transaction as the seed, to the cluster. That is, in the cluster creation system according to an embodiment of the present invention, a seed, which is a small number of highly similar objects, is determined to create a cluster, and an initial execution time for generating a cluster can be shortened by expanding the cluster through the seed have.

클러스터 생성 시스템은 각 타입별로 복수의 객체를 하위 레벨의 노드로 설 정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성할 수 있다(S703).In step S703, the cluster creation system sets up a plurality of objects as low-level nodes for each type, and sets a cluster generated from the plurality of objects as high-level nodes in step S703.

이 때, 트리의 말단 노드인 레벨 0의 경우, 복수의 객체들로 구성된다. 그리고, 상위 레벨인 레벨 1의 비말단 노드의 경우, 레벨 0의 복수의 객체들로부터 생성된 클러스터로 구성된다. 그리고, 상위 레벨 2의 비말단 노드는 레벨 1의 클러스터를 다시 객체로 간주하여 생성된 클러스터로 구성된다. 이러한 과정을 통해 트리는 확장될 수 있다.In this case, in the case of the level 0 which is the end node of the tree, it is composed of a plurality of objects. In the case of the non-terminal node of the level 1, which is a higher level, the cluster is formed of a plurality of objects of level 0. The non-terminal node of the upper level 2 is composed of the clusters generated by considering the cluster of level 1 as an object again. Through this process, the tree can be expanded.

도 7에서 설명되지 않은 부분은 도 1 내지 도 6의 설명을 참고할 수 있다.The parts not described in FIG. 7 can be referred to the description of FIG. 1 to FIG.

또한 본 발명의 일실시예에 따른 클러스터 생성 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있 는 고급 언어 코드를 포함한다.The cluster creation method according to an exemplary embodiment of the present invention includes a computer readable medium including program instructions for performing various computer-implemented operations. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The media may be program instructions that are specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명 사상은 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, Modification is possible. Accordingly, the spirit of the present invention should be understood only in accordance with the following claims, and all equivalents or equivalent variations thereof are included in the scope of the present invention.

<도면의 주요 부분에 대한 부호의 설명>Description of the Related Art

100: 클러스터 생성 시스템100: Cluster creation system

101: 객체<타입 1> 102: 객체<타입 2>101: Object <Type 1> 102: Object <Type 2>

103: 트리<타입 1> 104: 트리<타입 2>103: Tree <Type 1> 104: Tree <Type 2>

Claims

A seed determining unit for determining a seed composed of objects from a plurality of objects linked by a link based on link information based on the number of linked links, the objects having similarities set in advance to a degree of similarity set in advance; And

A cluster generating unit for generating a cluster using the determined seed,

&Lt; / RTI >

The method according to claim 1,

The seed determination unit

A transaction data generation unit for generating transaction data using a plurality of objects linked by the link;

A hash structure determination unit for searching the transaction data to determine a hash structure composed of candidate patterns based on a pattern length; And

A seed extracting unit for extracting, from the candidate patterns constituting the hash structure, an object corresponding to a candidate pattern exceeding the lowest frequen- cy,

&Lt; / RTI >

3. The method of claim 2,

Wherein the transaction data generation unit comprises:

And sets the plurality of objects to a transaction in accordance with the link and classifies the plurality of objects into items constituting the set transaction to generate the transaction data Cluster creation system.

The method of claim 3,

The hash structure determination unit may determine,

A candidate pattern generator for generating a candidate pattern corresponding to the pattern length from the transaction data based on a pattern length according to the number of items; And

A frequent determining unit for determining the frequent occurrence rate by counting the generated candidate patterns from the transaction data,

&Lt; / RTI >

5. The method of claim 4,

The hash structure determination unit may determine,

Wherein the hash structure is determined using the candidate patterns and the frequentities of the candidate patterns.

3. The method of claim 2,

Wherein,

A cluster is created using the extracted seed, and objects added to the cluster that are frequently included in the same transaction as the seed among the objects not included in the cluster are added to the cluster.

3. The method of claim 2,

The seed determination unit

And a noise removing unit for removing a predetermined number or less of the objects from the transaction data,

Further comprising:

The hash structure determination unit may determine,

And searches the transaction data from which the object determined as the noise is removed.

8. The method of claim 7,

Wherein the noise eliminator comprises:

And repeatedly removing objects of different types for each type.

The method according to claim 1,

A tree generating unit configured to set the plurality of objects as lower level nodes for each type and to set a cluster generated from the plurality of objects as higher level nodes,

Further comprising: < / RTI >

A cluster creation method performed by a cluster creation system,

Determining a seed comprising a plurality of objects connected to a seed determination attaching link included in the cluster creation system and including objects having a degree of similarity higher than a degree of similarity based on the number of linked links based on link information; And

Wherein the cluster generating unit included in the cluster generating system generates a cluster using the determined seed

&Lt; / RTI >

11. The method of claim 10,

Wherein the step of determining a seed, which is composed of objects having the predetermined similarity or more,

Generating transaction data using a plurality of objects linked by the link;

Searching the transaction data to determine a hash structure composed of candidate patterns based on the pattern length; And

Extracting an object corresponding to a candidate pattern exceeding a minimum frequen- cy among the candidate patterns constituting the hash structure with a seed;

&Lt; / RTI >

12. The method of claim 11,

Wherein the step of generating the transaction data comprises:

Wherein each of the plurality of objects sets other types of objects indicated by links on the transaction as a transaction and generates the transaction data by classifying the plurality of objects into items constituting the set transaction according to the link How to create a cluster.

13. The method of claim 12,

Wherein the step of determining a hash structure composed of candidate patterns based on the pattern length comprises:

Generating a candidate pattern corresponding to the pattern length from the transaction data based on a pattern length according to the number of the items; And

Counting the generated candidate patterns from the transaction data and determining a frequent occurrence rate

&Lt; / RTI >

14. The method of claim 13,

12. The method of claim 11,

Wherein the generating the cluster comprises:

Generating a cluster using the extracted seed; And

Adding to the cluster objects that are frequently included in the same transaction as the seed among the objects not included in the cluster

&Lt; / RTI >

13. The method of claim 12,

The step of determining the seed may comprise:

Determining that the number of links is equal to or less than a predetermined number as noise, and removing the objects from the cluster

Further comprising:

Wherein the determining the hash structure comprises:

And searching for transaction data from which the object determined as the noise is removed.

17. The method of claim 16,

Wherein the step of determining that the number of links is equal to or less than a predetermined number as noise and removing the objects from the cluster,

And repeatedly removing objects of different types for each type.

11. The method of claim 10,

Setting the plurality of objects as low-level nodes for each type and setting a cluster generated from the plurality of objects as high-level nodes to generate a structured tree

&Lt; / RTI >

18. A computer-readable recording medium on which a program for executing the method of any one of claims 10 to 18 is recorded.