KR20110001448A

KR20110001448A - System and method for generating cluster using seed based link

Info

Publication number: KR20110001448A
Application number: KR1020090058991A
Authority: KR
Inventors: 이재범; 김상욱; 윤석호; 송석순; 김동진
Original assignee: 엔에이치엔(주)
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2011-01-06
Also published as: KR101560726B1

Abstract

PURPOSE: A cluster generation system using a seed according to the link for reducing the necessary operation time for clustering is provided to generate the cluster through the seed by determining an object of high similarity as the seed. CONSTITUTION: A cluster generating system(100) includes a seed determination unit(201) and a cluster generating unit(202). The seed determination unit determines a seed based on link information of objects. A plurality of objects is connected by link. The seed is composed with an object more than the preset similarity. The cluster generating unit generates the cluster in the use of the determined seed.

Description

System and method for creating clusters using seed along links {SYSTEM AND METHOD FOR GENERATING CLUSTER USING SEED BASED LINK}

본 발명은 링크에 따른 시드를 이용한 클러스터 생성 시스템 및 방법에 관한 것으로, 보다 자세하게는, 해쉬 기반의 탐색 방법을 이용하여 클러스터를 생성하는 클러스터 생성 시스템 및 방법에 관한 것이다.The present invention relates to a cluster generation system and method using a seed along a link, and more particularly, to a cluster generation system and method for generating a cluster using a hash-based search method.

클러스터링은 서로 유사한 객체를 클러스터로 그룹화하는 것을 의미한다. 클러스터링 연구는 통계학, 데이터베이스, 데이터마이닝 등의 여러 분야에서 오랜 시간 연구되어 왔다. 클러스터링 과정을 통해 도출된 클러스터를 통해 광고, 검색 등의 서비스에 적용하여 해당 서비스의 정확도를 높이고, 객체의 유사도 경향을 쉽게 파악할 수 있다.Clustering means grouping similar objects into clusters. Clustering research has been studied for a long time in various fields such as statistics, database, and data mining. The cluster derived through the clustering process can be applied to services such as advertisement and search to increase the accuracy of the service and easily identify the similarity tendency of objects.

최근, 클러스터링 과정 중 인터넷 상에 존재하는 많은 데이터를 이용하여 서로 유사도가 높은 객체에 대해 클러스터를 생성하는 연구가 주목받고 있다. 특히, 객체들 간의 링크를 고려하여 객체들을 클러스터링하는 링크 기반 클러스터링(link-based clustering)이 대두되고 있다.Recently, a study of generating clusters for objects having high similarity to each other by using a large amount of data existing on the Internet during the clustering process has been attracting attention. In particular, link-based clustering, which clusters objects in consideration of links between objects, has emerged.

다만, 인터넷 사용이 증가하면서 인터넷 상에 수많은 객체에 대해 클러스터 를 생성하는 것은 많은 수행 시간을 요구한다. 수행 시간이 증가할수록 클러스터링을 위한 리소스가 많이 소모되기 때문에, 이를 효과적으로 처리하는 방안이 필요하다. 즉, 대용량 데이터 환경에 대비하여 유사도를 보장하면서도 수행시간을 감소시켜 신속한 처리를 하는 방안이 요구되고 있다.However, as the use of the Internet increases, creating a cluster of many objects on the Internet requires a lot of execution time. As the execution time increases, more resources for clustering are consumed, so a method of effectively processing the same is needed. In other words, there is a demand for a method of rapidly processing by reducing execution time while ensuring similarity in preparation for a large data environment.

결국, 대용량 데이터에 대해 클러스터링을 위해 전체 성능에 영향을 미치는 병목 부분을 개선함으로써 대용량 데이터를 정확하고 효율적으로 클러스터링하는 것이 무엇보다 필요하다.As a result, the need for accurate and efficient clustering of large amounts of data is improved by improving bottlenecks that affect overall performance for clustering large amounts of data.

본 발명은 링크로 연결된 복수의 객체들 중 유사도가 매우 높은 소수의 객체를 시드로 결정하고, 시드를 이용하여 클러스터를 생성함으로써 클러스터링에 필요한 수행 시간을 단축시키는 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster generation system and method for reducing the execution time required for clustering by determining a seed having a very high similarity among a plurality of linked objects as a seed and creating a cluster using the seed.

본 발명은 링크로 연결된 복수의 객체들에 대한 트랜잭션 데이터로부터 해쉬 구조를 생성하고, 해쉬 구조를 탐색하면서 최저 빈발도를 초과하는 패턴을 추출하여 시드로 결정함으로써 일정 수준의 유사도를 보장하면서도 클러스터 생성 시간을 감소할 수 있는 클러스터 생성 시스템 및 방법을 제공한다.The present invention generates a hash structure from transaction data for a plurality of linked objects, extracts a pattern exceeding the lowest frequency while searching for the hash structure, and determines the seed by determining the seed, thereby ensuring a similar level of cluster generation time. It provides a cluster generation system and method that can reduce the.

본 발명은 링크의 개수가 적은 객체를 노이즈로 판단하여 클러스터 생성에 대한 전처리 과정을 통해 트랜잭션 데이터에서 미리 제거함으로써 생성하고자 하는 클러스터의 유사도를 향상시킬 수 있는 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster generation system and method capable of improving the similarity of a cluster to be generated by removing objects from transaction data in advance by determining an object having a small number of links as noise and preprocessing the cluster.

본 발명은 복수의 객체들 및 복수의 객체들로부터 생성된 클러스터를 통해 객체들 간의 유사도를 계층적으로 표현하는 트리를 생성함으로써 복수의 객체들 간의 유사도를 용이하게 파악할 수 클러스터 생성 시스템 및 방법을 제공한다.The present invention provides a cluster generation system and method that can easily determine the similarity between a plurality of objects by generating a tree hierarchically expressing the similarity between the objects through a plurality of objects and a cluster created from the plurality of objects. do.

본 발명의 일실시예에 따른 클러스터 생성 시스템은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 시드 결정부 및 상기 결정된 시드를 이용하여 클러스터를 생성하는 클러스터 생성부를 포함할 수 있다.According to an embodiment of the present invention, a cluster generation system includes a seed determination unit configured to determine a seed including a predetermined degree of similarity or more based on link information from a plurality of objects connected by a link, and a cluster using the determined seed. It may include a cluster generation unit for generating a.

본 발명의 일측면에 따르면, 상기 시드 결정부는 상기 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성하는 트랜잭션 데이터 생성부, 상기 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정하는 해쉬 구조 결정부 및 상기 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출하는 시드 추출부를 포함할 수 있다.According to one aspect of the invention, the seed determination unit is a transaction data generation unit for generating transaction data using a plurality of objects connected by the link, the hash data consisting of a candidate pattern based on the pattern length by searching the transaction data to determine A hash structure determination unit and a seed extraction unit for extracting an object corresponding to the candidate pattern exceeding the lowest frequency among the candidate patterns constituting the hash structure as a seed.

본 발명의 일측면에 따르면, 상기 해쉬 구조 결정부는 상기 아이템의 개수에 따른 패턴 길이에 기초하여 상기 트랜잭션 데이터로부터 상기 패턴 길이에 대응하는 후보 패턴을 생성하는 후보 패턴 생성부 및 상기 트랜잭션 데이터로부터 상기 생성된 후보 패턴을 카운트하여 빈발도를 결정하는 빈발도 결정부를 포함할 수 있다.According to an aspect of the present invention, the hash structure determination unit generates a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items and the generation from the transaction data. It may include a frequency determination unit for counting the candidate pattern determined to determine the frequency.

본 발명의 일측면에 따르면, 상기 시드 결정부는 상기 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 상기 트랜잭션 데이터에서 제거하는 노이즈 제거부를 더 포함할 수 있다.According to an aspect of the present invention, the seed determination unit may further include a noise removal unit for determining from the transaction data by determining that the number of objects less than a predetermined number of the link as noise.

본 발명의 일실시예에 따른 클러스터 생성 시스템은 각 타입별로 상기 복수의 객체를 하위 레벨의 노드로 설정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성하는 트리 생성부를 더 포함할 수 있다.The cluster generation system according to an embodiment of the present invention generates a structural tree by setting the plurality of objects to nodes of a lower level for each type, and setting the clusters generated from the plurality of objects to nodes of a higher level. It may further include a tree generator.

본 발명의 일실시예에 따른 클러스터 생성 방법은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 단계 및 상기 결정된 시드를 이용하여 클러스터를 생성하는 단계를 포함할 수 있다.The cluster creation method according to an embodiment of the present invention comprises the steps of determining a seed composed of objects having a predetermined similarity or more based on link information from a plurality of objects connected by a link, and generating a cluster using the determined seed. It may include the step.

본 발명의 일측면에 따르면, 상기 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정하는 단계는 상기 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성하는 단계, 상기 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정하는 단계 및 상기 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출하는 단계를 포함할 수 있다.According to one aspect of the invention, the step of determining the seed consisting of the object of the predetermined similarity or more comprises the steps of generating transaction data using a plurality of objects connected by the link, the transaction data to search the pattern length The method may include determining a hash structure including a candidate pattern based on and extracting, as a seed, an object corresponding to a candidate pattern exceeding a lowest frequency among candidate patterns constituting the hash structure.

본 발명의 일측면에 따르면, 상기 시드를 결정하는 단계는 상기 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 상기 트랜잭션 데이터에서 제거하는 단계를 더 포함할 수 있다.According to an aspect of the present disclosure, the determining of the seed may further include determining that the number of objects equal to or less than a predetermined number as the link is noise and removing from the transaction data.

본 발명의 일실시예에 따른 클러스터 생성 방법은 각 타입별로 상기 복수의 객체를 하위 레벨의 노드로 설정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성하는 단계를 더 포함할 수 있다.Cluster generation method according to an embodiment of the present invention to generate a structural tree by setting the plurality of objects to each node of a lower level for each type, and the cluster created from the plurality of objects to a node of a higher level It may further comprise a step.

본 발명의 일실시예에 따르면, 링크로 연결된 복수의 객체들 중 유사도가 매우 높은 소수의 객체를 시드로 결정하고, 시드를 이용하여 클러스터를 생성함으로써 클러스터링에 필요한 수행 시간이 단축될 수 있다.According to an embodiment of the present invention, the execution time required for clustering may be shortened by determining a seed having a very high similarity among a plurality of objects connected by a link as a seed and creating a cluster using the seed.

본 발명의 일실시예에 따르면, 링크로 연결된 복수의 객체들에 대한 트랜잭 션 데이터로부터 해쉬 구조를 생성하고, 해쉬 구조를 탐색하면서 최저 빈발도를 초과하는 패턴을 추출하여 시드로 결정함으로써 일정 수준의 유사도를 보장하면서도 클러스터 생성 시간을 줄일 수 있다.According to an embodiment of the present invention, by generating a hash structure from transaction data for a plurality of linked objects, searching for a hash structure, extracting a pattern exceeding the lowest frequency and determining the seed as a seed level. It can reduce the cluster creation time while ensuring the similarity of.

본 발명의 일실시예에 따르면, 링크의 개수가 적은 객체를 노이즈로 판단하여 클러스터 생성에 대한 전처리 과정을 통해 트랜잭션 데이터에서 제거함으로써 생성하고자 하는 클러스터의 유사도가 향상될 수 있다.According to an embodiment of the present invention, the similarity of a cluster to be generated may be improved by determining that an object having a small number of links is noise and removing it from transaction data through a preprocessing process for cluster generation.

본 발명의 일실시예에 따르면, 복수의 객체들 및 복수의 객체들로부터 생성된 클러스터를 통해 객체들 간의 유사도를 계층적으로 표현하는 트리를 생성함으로써 복수의 객체들 간의 유사도가 용이하게 파악될 수 있다.According to an embodiment of the present invention, the similarity between the plurality of objects can be easily identified by generating a tree hierarchically expressing the similarity between the objects through a plurality of objects and a cluster generated from the plurality of objects. have.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명에 따른 실시예를 상세하게 설명한다. 다만, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. 본 발명의 일실시예에 다른 클러스터 생성 방법은 클러스터 생성 시스템에 의해 수행될 수 있다.Hereinafter, with reference to the contents described in the accompanying drawings will be described in detail an embodiment according to the present invention. However, the present invention is not limited to or limited by the embodiments. Like reference numerals in the drawings denote like elements. Another cluster generation method according to an embodiment of the present invention may be performed by a cluster generation system.

도 1은 본 발명의 일실시예에 따른 클러스터 생성 시스템이 수행하는 전체 과정을 설명하기 위한 도면이다.1 is a view for explaining the overall process performed by the cluster creation system according to an embodiment of the present invention.

클러스터 생성 시스템(100)은 링크로 연결된 복수의 객체(101, 102)를 링크에 기초한 유사도에 따라 클러스터링하여 구조화된 트리(103)를 생성할 수 있다. 클러스터링은 서로 유사한 객체들을 특정 클러스터로 그룹화하는 것을 의미할 수 있다. 이 때, 링크로 연결된 객체(101)와 객체(102)는 서로 다른 타입을 나타내는 데이터일 수 있다. 예를 들어, 객체(101)가 블로그이고, 객체(102)는 스크랩을 통해 링크로 연결된 포스트일 수 있다. 도 1에서는 서로 다른 타입을 나타내는 2 종류의 객체를 도시하였으나, 타입의 종류는 제한이 없다.The cluster generation system 100 may generate the structured tree 103 by clustering the plurality of objects 101 and 102 connected by the link according to the similarity based on the link. Clustering may refer to grouping similar objects into a specific cluster. In this case, the object 101 and the object 102 connected by a link may be data representing different types. For example, object 101 may be a blog, and object 102 may be a post linked by scrap. Although FIG. 1 illustrates two types of objects representing different types, the types of types are not limited.

클러스터의 대상이 되는 객체(데이터)의 양이 계속적으로 증가하면서 대용량의 데이터를 효율적으로 클러스터링하는 것이 중요해졌다. 각각의 객체들이 서로 링크를 통해 연결되기 때문에, 데이터량이 많이 질수록 클러스터링을 위해 처리되어야 하는 연산량이 증가할 수 밖에 없다. As the amount of objects (data) targeted for clusters continues to increase, it becomes important to efficiently cluster large amounts of data. Since each object is linked through each other, the amount of data that must be processed for clustering increases as the amount of data increases.

이에 대해, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 링크로 연결된 객체들로 구성된 트랜잭션 데이터를 이용하여 해쉬 구조를 생성하고, 해쉬 기반의 시드 탐사 방법을 통해 클러스터를 생성할 수 있다. 구체적으로, 클러스터 생성 시스템(100)은 유사도가 높은 소수의 객체들(101, 102)에 의해 클러스터가 생성되는 사실을 고려하여 유사도가 매우 높은 소수의 객체들을 시드로 결정하고, 시드와 관련된 다른 객체들(101, 102)을 시드에 포함시킴으로써 초기 클러스터링을 효율적이고 신속하게 처리할 수 있다. In this regard, the cluster generation system 100 according to an embodiment of the present invention may generate a hash structure using transaction data composed of objects connected by links, and generate a cluster through a hash-based seed exploration method. . In detail, the cluster generation system 100 determines a seed having a very high similarity as a seed in consideration of the fact that the cluster is generated by a few objects 101 and 102 having a high similarity, and another object related to the seed. By including the fields 101 and 102 in the seed, initial clustering can be efficiently and quickly processed.

또한, 클러스터 생성 시스템(100)은 클러스터를 생성하기 이전에 노이즈에 해당하는 객체들(101, 102)을 전처리를 통해 트랜잭션 데이터에서 반복적으로 제거함으로써 클러스터링의 수행 시간을 단축시킬 수 있다. 클러스터 생성 시스템(100)은 위와 같은 과정을 통해 객체(101, 102)에 대해 클러스터링을 수행하여 클러스터를 생성하고, 생성된 클러스터를 이용하여 각 타입마다 트리(103, 104)를 생성함으로써 정확도를 감소시키지 않으면서도 수행 시간을 줄일 수 있는 링크 기반의 클러스터링을 수행할 수 있다.In addition, the cluster generation system 100 may shorten the execution time of the clustering by repeatedly removing the objects 101 and 102 corresponding to the noise from the transaction data before generating the cluster. The cluster generation system 100 generates a cluster by performing clustering on the objects 101 and 102 through the above process, and reduces the accuracy by generating the trees 103 and 104 for each type using the generated cluster. Link-based clustering can be performed to reduce execution time without the need to do so.

도 2는 본 발명의 일실시예에 따른 클러스터 생성 시스템의 전체 구성을 도시한 블록다이어그램이다.2 is a block diagram showing the overall configuration of a cluster generation system according to an embodiment of the present invention.

도 2를 참고하면, 클러스터 생성 시스템(100)은 시드 결정부(201), 클러스터 생성부(202) 및 트리 생성부(203)를 포함할 수 있다.Referring to FIG. 2, the cluster generation system 100 may include a seed determiner 201, a cluster generator 202, and a tree generator 203.

시드 결정부(201)는 링크로 연결된 복수의 객체들(101, 102)로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드(seed)를 결정할 수 있다. 본 발명의 일실시예에 따르면, 시드는 클러스터를 확장 생성하기 위한 기준 객체가 될 수 있다. 이 때, 복수의 객체(101)와 복수의 객체(102)는 서로 다른 타입을 가지며, 시드 결정부(201)는 타입이 다른 객체(101, 102)마다 시드를 결정할 수 있다. The seed determiner 201 may determine a seed composed of objects having a predetermined similarity or more based on link information from the plurality of objects 101 and 102 connected by a link. According to an embodiment of the present invention, the seed may be a reference object for expanding and creating a cluster. In this case, the plurality of objects 101 and the plurality of objects 102 may have different types, and the seed determination unit 201 may determine the seed for each of the objects 101 and 102 having different types.

본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 시드에 대응하는 초기 클러스터를 대략적으로 구축함으로써 객체들(101)로부터 클러스터를 생성하는 시간을 감소시킬 수 있다. 이 때, 시드는 복수의 객체들(101) 중 유사도가 매우 높은 소수의 객체들을 포함할 수 있다.Cluster generation system 100 according to an embodiment of the present invention can reduce the time to create a cluster from the objects 101 by roughly constructing an initial cluster corresponding to the seed. At this time, the seed may include a small number of objects having a very high similarity among the plurality of objects 101.

일례로, 시드 결정부(201)는 트랜잭션 데이터 생성부(204), 노이즈 제거부(205), 해쉬 구조 결정부(206) 및 시드 추출부(207)를 포함할 수 있다. 구체적으로, 시드 결정부(201)는 해쉬 기반의 시드 탐사 방법을 통해 타입이 다른 복수의 객체들(101, 102)로부터 각각 시드를 결정할 수 있다.For example, the seed determiner 201 may include a transaction data generator 204, a noise remover 205, a hash structure determiner 206, and a seed extractor 207. In detail, the seed determiner 201 may determine seeds from a plurality of objects 101 and 102 having different types through a hash-based seed exploration method.

트랜잭션 데이터 생성부(204)는 링크로 연결된 복수의 객체들(101)을 이용하여 트랜잭션 데이터를 생성할 수 있다. 일례로, 트랜잭션 데이터 생성부(204)는 복수의 객체들(101) 각각이 링크를 통해 가리키는 다른 타입의 객체들(102)을 트랜잭션으로 설정할 수 있다. The transaction data generator 204 may generate transaction data using the plurality of objects 101 connected by a link. In one example, the transaction data generator 204 may set different types of objects 102 each of which the plurality of objects 101 points to via a link.

그리고, 트랜잭션 데이터 생성부(204)는 링크에 따라 복수의 객체들(101)을 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다. 즉, 트랜잭션 데이터 생성부(204)는 링크로 연결된 객체들(101, 102)을 트랜잭션 데이터로 변환할 수 있다. 트랜잭션 데이터를 생성하는 구체적인 예는 도 3에서 설명된다.The transaction data generator 204 may generate transaction data by classifying the plurality of objects 101 into items constituting a set transaction according to a link. That is, the transaction data generator 204 may convert the objects 101 and 102 connected by the link into transaction data. A specific example of generating transaction data is described in FIG. 3.

노이즈 제거부(205)는 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 트랜잭션 데이터에서 제거할 수 있다. 특정 객체들(101)은 소수의 링크를 통해 다른 타입의 객체들(102)과 연결될 수 있다. 클러스터는 유사도가 높은 객체들(101)로 구성되며, 유사도는 링크의 개수와 관련이 있다. 링크의 개수가 적은 경우, 객체들간의 유사도에 대한 신뢰성이 보장될 수 없다. 그래서, 링크의 개수가 적은 객체가 클러스터에 포함되는 경우, 클러스터의 정확도는 감소될 수 있다. 또한, 링크의 개수가 적은 객체들은 클러스터링을 위해 처리해야 할 데이터의 수를 증가시키기 때문에 클러스터링의 수행 시간도 증가하는 문제점이 있다. The noise removing unit 205 may determine that an object having a number equal to or less than a predetermined number as the noise and remove the object from the transaction data. Certain objects 101 may be connected to other types of objects 102 through a few links. The cluster is composed of objects 101 having high similarity, and the similarity is related to the number of links. If the number of links is small, the reliability of the similarity between objects cannot be guaranteed. Thus, when an object having a small number of links is included in a cluster, the accuracy of the cluster may be reduced. In addition, since the number of objects having a small number of links increases the number of data to be processed for clustering, the clustering execution time also increases.

따라서, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 링크의 개수가 소수인 객체들(101)을 노이즈로 판단하여 클러스터 생성하기 이전에 전처리 과정을 통해 미리 제거함으로써, 클러스터에 포함된 객체 간의 유사도를 향상시키 고 클러스터링 시간을 단축시킬 수 있다.Therefore, the cluster generation system 100 according to an embodiment of the present invention determines the objects 101 having a small number of links as noise and removes them in advance by performing a preprocessing process before generating the clusters, thereby being included in the cluster. It can improve the similarity between objects and shorten the clustering time.

일례로, 노이즈 제거부(205)는 서로 다른 타입에 있는 객체들을 타입마다 반복적으로 제거할 수 있다. 예를 들어, 타입 1에 해당하는 객체(101)(A)가 타입 2에 해당하는 객체들(102)(a, b)과 링크로 연결되어 있다고 가정한다. 그리고, 노이즈 제거부(205)는 링크가 1개인 객체들을 노이즈로 판단하여 제거할 수 있다. For example, the noise removing unit 205 may repeatedly remove objects in different types for each type. For example, it is assumed that the object 101 (A) corresponding to the type 1 is linked with the objects 102 (a, b) corresponding to the type 2. The noise removing unit 205 may determine and remove objects having one link as noise.

이 때, 객체 a(102)의 링크 개수가 1개일 때, 노이즈 제거부(203)는 클러스터에서 객체 a(102)를 제거할 수 있다. 객체 a(102)가 제거되면서, 객체 A(102)의 링크 개수도 감소할 수 있다. 그러면, 타입 1에 해당하는 객체 A(101)의 링크 개수가 2개에서 1개로 되므로, 노이즈 제거부(203)는 클러스터에서 객체 A(101)도 노이즈로 판단하여 제거할 수 있다. 결국, 본 발명의 일실시예에 따른 클러스터 생성 시스템(100)은 노이즈에 해당하는 개체가 클러스터에 존재하지 않을 때까지 반복적으로 제거할 수 있다.In this case, when the number of links of the object a 102 is one, the noise removing unit 203 may remove the object a 102 from the cluster. As object a 102 is removed, the number of links of object A 102 may also decrease. Then, since the number of links of the object A 101 corresponding to the type 1 is two to one, the noise removing unit 203 may determine that the object A 101 is also removed from the cluster as noise. As a result, the cluster generation system 100 according to an exemplary embodiment of the present invention may repeatedly remove the object corresponding to the noise until the entity does not exist in the cluster.

해쉬 구조 결정부(206)는 생성된 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정할 수 있다. 이 때, 트랜잭션 데이터는 노이즈 제거부(205)를 통해 노이즈로 판단된 객체들이 제거된 트랜잭션 데이터를 탐색할 수 있다. 패턴 길이는 트랜잭션 데이터에 포함된 객체쌍에 포함된 객체의 개수일 수 있다. 일례로, 해쉬 구조 결정부(206)는 후보 패턴 생성부 및 빈발도 결정부를 포함할 수 있다. The hash structure determiner 206 may search the generated transaction data to determine a hash structure composed of candidate patterns based on the pattern length. In this case, the transaction data may search for transaction data from which objects determined as noise are removed through the noise removal unit 205. The pattern length may be the number of objects included in the object pair included in the transaction data. For example, the hash structure determiner 206 may include a candidate pattern generator and a frequency determiner.

후보 패턴 생성부는 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 즉, 후보 패턴 생성부는 트랜잭션 데이터를 탐색하여 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. The candidate pattern generator may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items. That is, the candidate pattern generator may search for transaction data to generate a candidate pattern corresponding to the pattern length.

그리고, 빈발도 결정부는 트랜잭션 데이터로부터 후보 패턴을 카운트하여 트랜잭션 데이터에서 후보 패턴이 발생한 빈도를 나타내는 빈발도를 결정할 수 있다. 결국, 해쉬 구조 결정부(206)는 후보 패턴과 후보 패턴 각각에 대응하는 빈발도를 이용하여 해쉬 구조를 결정할 수 있다.In addition, the frequency determining unit may count the candidate pattern from the transaction data to determine a frequency indicating the frequency of occurrence of the candidate pattern in the transaction data. As a result, the hash structure determiner 206 may determine the hash structure by using the candidate pattern and the frequentness corresponding to each of the candidate patterns.

시드 추출부(207)는 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. 즉, 시드 추출부(207)는 일정 수준 이상의 유사도를 보장할 수 있는 최저 빈발도를 설정하고, 복수의 객체들(101) 중 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. The seed extractor 207 may extract, as a seed, an object corresponding to the candidate pattern exceeding the lowest frequency among candidate patterns constituting the hash structure. That is, the seed extracting unit 207 sets a minimum frequency that can guarantee a degree of similarity or more than a predetermined level, and extracts an object corresponding to a candidate pattern exceeding the minimum frequency among the plurality of objects 101 as a seed. Can be.

따라서, 시드 추출부(207)는 한번의 후보 패턴으로 구성된 해쉬 구조를 탐색함으로써 복수의 객체들(101)로부터 유사도가 매우 높은 소수의 객체들을 신속하게 결정할 수 있다.Accordingly, the seed extractor 207 may quickly determine a few objects having a very high similarity from the plurality of objects 101 by searching a hash structure composed of one candidate pattern.

클러스터 생성부(202)는 시드 결정부(201)를 통해 결정된 시드를 이용하여 클러스터를 생성할 수 있다. 클러스터는 객체의 타입마다 생성될 수 있다.The cluster generator 202 may generate a cluster using the seed determined by the seed determiner 201. Clusters can be created for each type of object.

일례로, 클러스터 생성부(202)는 시드에 해당하는 객체들로 구성된 클러스터를 초기에 생성할 수 있다. 그리고, 클러스터 생성부(202)는 복수의 객체들 중 클러스터에 포함되지 않는 객체들에 대해 시드와 동일한 트랜잭션에 빈발하게 나타 나는 개체들을 추출하여 상기 클러스터에 추가할 수 있다. 구체적으로, 클러스터 생성부(202)는 클러스터를 구성하는 객체의 수가 미리 설정한 개수가 될 때까지 클러스터에 객체를 추가하는 과정을 수행할 수 있다.For example, the cluster generator 202 may initially generate a cluster composed of objects corresponding to the seed. In addition, the cluster generation unit 202 may extract and add objects, which are frequently displayed in the same transaction as the seed, to the objects that are not included in the cluster among the plurality of objects and add them to the cluster. In detail, the cluster generator 202 may perform a process of adding an object to the cluster until the number of objects constituting the cluster becomes a preset number.

트리 생성부(203)는 각 타입별로 복수의 객체(101, 102)를 하위 레벨의 노드로 설정하고, 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리(103, 104)를 생성할 수 있다. 즉, 복수의 객체들(101)에 대해서는 트리 X(103)가 생성되고, 복수의 객체들(102)에 대해서는 트리 Y(103)가 생성될 수 있다.The tree generator 203 sets the plurality of objects 101 and 102 as nodes of each lower level for each type, and sets the clusters generated from the plurality of objects as nodes of a higher level to structure the trees 103 and 104. Can be generated. That is, the tree X 103 may be generated for the plurality of objects 101, and the tree Y 103 may be generated for the plurality of objects 102.

이 때, 복수의 객체들(101, 102) 각각은 트리(103, 104)에서 말단 노드로 결정되고, 클러스터는 비말단 노드로 결정될 수 있다. 또한, 클러스터 생성 시스템(100)은 클러스터 생성 과정을 통해 최초 결정된 클러스터를 객체로 설정하고, 해당 클러스터로부터 새로운 클러스터를 생성할 수 있다. 즉, 클러스터 생성 시스템(100)은 클러스터 생성 과정을 반복함으로써 복수의 객체들(101, 102)과 클러스터에 대해 레벨에 따라 구조화된 트리(103, 104)를 생성할 수 있다.In this case, each of the plurality of objects 101 and 102 may be determined as an end node in the tree 103 and 104, and the cluster may be determined as a non-terminal node. In addition, the cluster generation system 100 may set a cluster initially determined through the cluster generation process as an object and generate a new cluster from the cluster. That is, the cluster generation system 100 may generate the plurality of objects 101 and 102 and the trees 103 and 104 structured according to levels with respect to the cluster by repeating the cluster generation process.

도 3은 본 발명의 일실시예에 따라 링크로 연결된 객체에 기초한 트랜잭션 데이터를 설명하기 위한 도면이다.3 is a diagram for describing transaction data based on an object connected by a link according to an embodiment of the present invention.

도 3을 참고하면, 타입이 다른 복수의 객체들(301, 302)이 링크로 연결된 것을 확인할 수 있다. 예를 들어, 도 3에서 복수의 객체들(301)은 블로그(Blog)를 의미하고, 복수의 객체들(302)은 포스트(Post)를 의미한다고 가정한다.Referring to FIG. 3, it can be seen that a plurality of objects 301 and 302 having different types are connected by a link. For example, in FIG. 3, it is assumed that a plurality of objects 301 means a blog and a plurality of objects 302 means a post.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들(301, 302)로부터 링 크 정보에 기초하여 유사도 이상의 객체들로 구성되는 시드를 결정할 수 있다. 이러한 유사도는 객체의 링크에 따라 결정될 수 있다. 그리고, 시드는 타입이 다른 복수의 객체들(301, 302) 각각으로부터 추출될 수 있다.The cluster generation system may determine a seed composed of objects of similarity or higher based on link information from the plurality of objects 301 and 302 connected by a link. This similarity can be determined according to the link of the object. The seed may be extracted from each of the plurality of objects 301 and 302 having different types.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들(301, 302)을 이용하여 트랜잭션 데이터를 생성할 수 있다. 일례로, 클러스터 생성 시스템은 복수의 객체들(301) 각각이 링크를 통해 가리키는 다른 타입의 객체들(302)을 트랜잭션으로 설정할 수 있다. 그리고, 클러스터 생성 시스템은 링크에 따라 복수의 객체들(301)을 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다.The cluster generation system may generate transaction data using a plurality of objects 301 and 302 connected by a link. In one example, the cluster creation system may set up different types of objects 302 each of which the plurality of objects 301 point to via a link in a transaction. In addition, the cluster generation system may generate the transaction data by classifying the plurality of objects 301 into items constituting the set transaction according to the link.

도 3에서, 복수의 포스트 데이터가 트랜잭션으로 설정되는 경우, 클러스터 생성 시스템은 각각의 포스트 데이터에 연결된 블로그 데이터를 트랜잭션에 대한 아이템(303)으로 분류할 수 있다. 예를 들어, 포스트 데이터 P3에 블로그 데이터 B1, B3, B4가 링크로 연결되어 있으므로, 트랜잭션 P3에 대한 아이템은 B1, B3, B4가 될 수 있다. In FIG. 3, when a plurality of post data is set as a transaction, the cluster generation system may classify blog data connected to each post data as an item 303 for a transaction. For example, since blog data B1, B3, and B4 are linked to post data P3, the item for transaction P3 may be B1, B3, and B4.

이러한 과정을 통해 트랜잭션으로 설정된 포스트 데이터 P1 내지 P8 각각에 링크로 연결된 블로그 데이터가 아이템(303)으로 분류되어, 트랜잭션 데이터가 생성될 수 있다. 반대로, 복수의 블로그 데이터가 트랜잭션으로 설정되는 경우, 클러스터 생성 시스템은 각각의 블로그 데이터에 연결된 포스트 데이터를 트랜잭션에 대한 아이템으로 분류할 수도 있다.Through this process, blog data linked to each of the post data P1 to P8 set as a transaction may be classified as an item 303 to generate transaction data. Conversely, when a plurality of blog data is set up as a transaction, the cluster generation system may classify post data linked to each blog data as an item for the transaction.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성함으로써, 빈발적으로 발생하는 객체의 패턴을 파악할 수 있다. 예를 들어, 블로그 데이터 B1과 B3는 포스트 데이터 P1과 P3를 링크를 통해 공통적으로 가리키고 있으므로 서로 유사하다고 할 수 있다. 또한, B1과 B3는 동일한 트랜잭션 P1과 P3에 빈발하게 포함되므로, 클러스터 생성 시스템은 B1과 B3를 빈발하게 발생하는 패턴으로 설정할 수 있다.The cluster generation system generates a transaction data using a plurality of objects connected by a link, so that the cluster generation system can grasp a pattern of a frequently occurring object. For example, the blog data B1 and B3 are similar to each other because they commonly point to the post data P1 and P3 through a link. In addition, since B1 and B3 are frequently included in the same transactions P1 and P3, the cluster generation system may set B1 and B3 in a frequently occurring pattern.

그리고, 클러스터 생성 시스템은 생성된 트랜잭션 데이터에서 링크의 개수가 미리 설정한 개수 이하인 객체에 대해서는 노이즈로 판단하여 제거할 수 있다. 일례로, 클러스터 생성 시스템은 노이즈로 판단된 객체를 각 타입별로 반복적으로 제거할 수 있다. 노이즈 제거 과정은 트랜잭션 데이터에 노이즈에 대응하는 객체가 존재하지 않을 때까지 반복될 수 있다. In addition, the cluster generation system may determine that the object whose number of links is less than or equal to a predetermined number in the generated transaction data may be determined as noise and removed. For example, the cluster generation system may repeatedly remove an object determined as noise for each type. The noise removal process may be repeated until there is no object corresponding to the noise in the transaction data.

결국, 클러스터 생성 시스템은 클러스터를 생성하기 전에 링크가 적은 객체를 제거함으로써 노이즈로 판단된 객체에 대해서는 클러스터를 생성할 필요가 없어 좀더 신속하게 클러스터를 생성할 수 있다. 그리고, 유사도는 링크의 개수와 연관이 있다. 즉, 클러스터 생성 시스템은 링크의 개수가 미리 설정한 개수보다 적은 객체를 노이즈로 판단하여 제거함으로써 클러스터에 포함된 객체 간의 유사도를 향상시키고, 클러스터링 시간을 단축시킬 수 있다.As a result, the cluster generation system does not need to generate a cluster for the object determined to be noise by removing an object having fewer links before generating the cluster, so that the cluster generation system can generate the cluster more quickly. Similarity is related to the number of links. That is, the cluster generation system may determine that the number of links is less than the preset number as noise and remove the noise, thereby improving the similarity between the objects included in the cluster and shortening the clustering time.

도 4는 본 발명의 일실시예에 따라 트랜잭션 데이터로부터 해쉬 구조를 결정하는 과정을 설명하기 위한 도면이다.4 is a diagram for describing a process of determining a hash structure from transaction data according to an embodiment of the present invention.

도 4를 참고하면, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 해쉬 구조를 생성할 수 있다. 일례로, 클러스터 생성 시스템은 트랜잭션 데이 터(401)를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조(402)를 결정할 수 있다. 이 때, 트랜잭션 데이터(401)는 노이즈로 판단된 객체가 제거된 것일 수 있다. 즉, 클러스터 생성 시스템은 해쉬 구조를 생성하기 위해 노이즈로 판단된 객체가 제거된 트랜잭션 데이터(401)를 탐색할 수 있다.Referring to FIG. 4, the cluster generation system may generate a hash structure by searching for transaction data 401. In one example, the cluster generation system may search the transaction data 401 to determine a hash structure 402 composed of candidate patterns based on the pattern length. In this case, the transaction data 401 may be an object determined to be noise. That is, the cluster generation system may search for transaction data 401 from which an object determined as noise is removed to generate a hash structure.

구체적으로, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 구성하는 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터로부터 후보 패턴을 카운트하여 빈발도를 결정할 수 있다.In detail, the cluster generation system may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items constituting the transaction data 401. In addition, the cluster generation system may determine the frequency of occurrence by counting candidate patterns from the transaction data.

도 4에서 패턴 길이를 2로 설정하기로 가정한다. 그러면, 클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 각 트랜잭션마다 패턴 길이가 2인 후보 패턴을 생성할 수 있다. 이 때, 트랜잭션 P1, P3, P4, P6, P8이 아이템 2개 이상을 포함하고 있으므로, 패턴 길이가 2인 후보 패턴이 생성될 수 있다. 패턴 길이가 2인 후보 패턴은 객체가 2개로 쌍을 이루고 있다는 것을 의미한다. In FIG. 4, it is assumed that the pattern length is set to two. Then, the cluster generation system may search for the transaction data 401 and generate a candidate pattern having a pattern length of 2 for each transaction. At this time, since the transactions P1, P3, P4, P6, and P8 include two or more items, a candidate pattern having a pattern length of 2 may be generated. A candidate pattern with a pattern length of 2 means that the objects are paired in two.

클러스터 생성 시스템은 트랜잭션 데이터(401)를 탐색하여 패턴 길이가 2인 후보 패턴 {B1, B2}, {B1, B3}, {B1, B4}, {B3, B4}, {B4, B5}을 생성할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터(401)에서 후보 패턴을 카운트하여 빈발도를 결정할 수 있다. 트랜잭션 데이터(401)에서 후보 패턴 각각의 빈발도는 1, 2, 1, 1, 2가 된다. The cluster generation system searches for the transaction data 401 to generate candidate patterns {B1, B2}, {B1, B3}, {B1, B4}, {B3, B4}, {B4, B5} having a pattern length of 2. can do. In addition, the cluster generation system may determine the frequency of occurrence by counting candidate patterns in the transaction data 401. The frequency of each candidate pattern in the transaction data 401 is 1, 2, 1, 1, 2.

만약, 최저 빈발도가 1로 설정된 경우, 클러스터 생성 시스템은 최저 빈발도를 초과하는 후보 패턴을 시드로 결정할 수 있다. 즉, 클러스터 생성 시스템은 후보 패턴 {B1, B3}와 {B4, B5}를 시드로 결정할 수 있다. 도 4에서 설정된 패턴 길이와 최저 빈발도는 시스템의 구성에 따라 변경될 수 있다.If the lowest frequency is set to 1, the cluster generation system may determine as a seed a candidate pattern exceeding the lowest frequency. That is, the cluster generation system may determine candidate patterns {B1, B3} and {B4, B5} as seeds. The pattern length and the lowest frequency set in FIG. 4 may be changed according to the configuration of the system.

도 5는 본 발명의 일실시예에 따라 클러스터를 생성하고, 객체를 추가하는 과정을 설명하기 위한 도면이다.5 is a diagram illustrating a process of creating a cluster and adding an object according to an embodiment of the present invention.

클러스터 생성 시스템은 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다. 그러면, 클러스터 생성 시스템은 링크 정보를 통해 서로 유사한 것으로 판단된 객체인 시드를 이용하여 클러스터를 생성할 수 있다. 그리고, 클러스터 생성 시스템은 클러스터에 포함되지 않은 객체들 중 시드와 동일한 트랜잭션에 빈발하게 나타나는 객체들을 클러스터에 추가할 수 있다.The cluster generation system may extract, as a seed, an object corresponding to the candidate pattern exceeding the lowest frequency among candidate patterns constituting the hash structure. Then, the cluster generation system may generate a cluster using seeds which are objects determined to be similar to each other through link information. In addition, the cluster generation system may add to the cluster objects that appear frequently in the same transaction as the seed among the objects not included in the cluster.

도 4의 예를 참고했을 때, 최저 빈발도 1을 초과하는 후보 패턴 {B1, B3}, {B4, B5}가 각각 시드로 결정될 수 있다. 그러면, 도 5의 도면부호(501)에서 볼 수 있듯이, B1, B3가 하나의 그룹인 클러스터(X)로 결정되고, B4, B5가 다른 하나의 그룹인 클러스터(Y)로 결정될 수 있다. 그리고, 도 5의 도면부호(502)에서 볼 수 있듯이, 클러스터 생성 시스템은 B1, B3와 동일한 트랜잭션에 포함된 B2를 클러스터(X)에 추가할 수 있다.Referring to the example of FIG. 4, candidate patterns {B1, B3} and {B4, B5} that exceed the lowest frequency 1 may be determined as seeds, respectively. Then, as shown by reference numeral 501 of FIG. 5, B1 and B3 may be determined as the cluster X which is one group, and B4 and B5 may be determined as the cluster Y which is the other group. As shown in reference numeral 502 of FIG. 5, the cluster generation system may add B2 included in the same transaction as B1 and B3 to the cluster X.

결국, 클러스터 생성 시스템은 트랜잭션 데이터에 기초한 해쉬 구조를 탐사하여 복수의 객체들 중에서 유사도가 매우 높은 소수의 객체들을 시드로 결정할 수 있다. 그리고, 클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성한 후, 클러스터에 포함된 객체와 동일한 트랜잭션에 빈발하게 발생하는 객체를 클러스터 에 추가함으로써 복수의 객체들을 이용하여 클러스터링을 수행할 수 있다. As a result, the cluster generation system may search a hash structure based on transaction data to determine a small number of objects having a high similarity among a plurality of objects as seeds. In addition, the cluster generation system may generate a cluster by using a seed, and then perform clustering by using a plurality of objects by adding an object, which frequently occurs in the same transaction as an object included in the cluster, to the cluster.

도 6은 본 발명의 일실시예에 따라 객체 및 클러스터로 구성된 트리를 도시한 도면이다.6 illustrates a tree composed of objects and clusters according to an embodiment of the present invention.

도 6은 서로 다른 타입의 복수의 객체로부터 클러스터가 생성되고, 복수의 객체와 클러스터를 구성된 트리를 나타낸다. 트리의 말단 노드인 레벨 0의 경우, 복수의 객체들로 구성된다. 그리고, 상위 레벨인 레벨 1의 비말단 노드의 경우, 레벨 0의 복수의 객체들로부터 생성된 클러스터로 구성된다. 그리고, 상위 레벨 2의 비말단 노드는 레벨 1의 클러스터를 다시 객체로 간주하여 생성된 클러스터로 구성된다. 이러한 과정을 통해 트리는 확장될 수 있다.6 illustrates a tree in which clusters are generated from a plurality of objects of different types, and are configured of the plurality of objects and clusters. For level 0, the end node of the tree, it consists of a plurality of objects. The non-terminal node of level 1, which is a higher level, is composed of a cluster created from a plurality of objects of level 0. The non-terminal node of the upper level 2 is composed of a cluster created by considering the cluster of the level 1 again as an object. Through this process, the tree can be expanded.

그리고, 서로 다른 타입의 트리에서 레벨 0에 대응하는 복수의 객체들이 링크를 통해 서로 연결될 수 있다. 이와 같은 링크를 통해 노드 간의 유사도가 결정될 수 있다. 일례로, 트리는 같은 부모 노드에 속한 자식 노드들 간의 유사도를 저장할 수 있다. 같은 부모 노드에 속하지 않은 자식 노드들 간의 유사도는 자식 노드들의 조상 노드들 사이의 유사도를 통해 계산될 수 있다. In addition, a plurality of objects corresponding to level 0 in different types of trees may be connected to each other through a link. Through such a link, similarity between nodes may be determined. For example, the tree may store similarity between child nodes belonging to the same parent node. Similarity between child nodes not belonging to the same parent node may be calculated through similarity between ancestor nodes of the child nodes.

즉, 트리에서 모든 노드들 간의 유사도가 계산되지 않고, 트리의 계층 구조를 통해 계산되므로 트리에 포함된 노드들의 유사도를 신속하게 계산할 수 있다. 최종적으로 계산된 유사도를 바탕으로, 보다 유사도가 가까운 노드로 트리가 정련될 수 있다.That is, the similarity between all nodes in the tree is not calculated, but is calculated through the hierarchical structure of the tree, so that the similarity of nodes included in the tree can be quickly calculated. Based on the similarity finally calculated, the tree can be refined to nodes with more similarity.

도 7은 본 발명의 일실시예에 따른 클러스터 생성 방법의 전체 과정을 도시한 플로우차트이다.7 is a flowchart illustrating the overall process of the cluster creation method according to an embodiment of the present invention.

클러스터 생성 시스템은 링크로 연결된 복수의 객체들로부터 링크 정보에 기초하여 미리 설정한 유사도 이상의 객체들로 구성되는 시드를 결정할 수 있다(S701).The cluster generation system may determine a seed composed of objects having a predetermined similarity or more based on link information from a plurality of objects connected by a link (S701).

일례로, 클러스터 생성 시스템은 해쉬 구조 기반의 탐색 방법을 통해 시드를 결정할 수 있다. 구체적으로, 클러스터 생성 시스템은 링크로 연결된 복수의 객체들을 이용하여 트랜잭션 데이터를 생성할 수 있다(S704). 이 때, 클러스터 생성 시스템은 복수의 객체들 각각이 링크를 통해 가리키는 다른 타입의 객체들을 트랜잭션으로 설정하고, 상기 링크에 따라 상기 복수의 객체들을 상기 설정된 트랜잭션을 구성하는 아이템으로 분류하여 트랜잭션 데이터를 생성할 수 있다.In one example, the cluster generation system may determine the seed through a hash structure-based search method. In detail, the cluster generation system may generate transaction data using a plurality of objects connected by a link (S704). At this time, the cluster generation system sets different types of objects each of which the plurality of objects point to via a link as a transaction, and classifies the plurality of objects as items constituting the established transaction according to the link to generate transaction data. can do.

클러스터 생성 시스템은 링크의 개수가 미리 설정한 개수 이하의 객체를 노이즈로 판단하여 전처리 과정을 통해 트랜잭션 데이터에서 제거할 수 있다(S705). 객체 간의 유사도는 공통되는 링크의 개수가 많을수록 증가한다. 그러나, 링크의 개수가 적은 객체가 클러스터에 포함되는 경우, 클러스터의 유사도는 감소할 수 있다. 따라서, 클러스터 생성 시스템은 클러스터를 생성하기 이전에 전처리 과정을 통해, 링크의 개수가 적은 객체를 노이즈로 판단할 수 있다. 그리고, 클러스터 생성 시스템은 트랜잭션 데이터에서 노이즈로 판단된 객체를 제거함으로써 추후 생성되는 클러스터의 유사도를 향상시킬 수 있다. 이 때, 링크의 개수는 시스템의 구성에 따라 변경될 수 있다.The cluster generation system may determine that an object having a number equal to or less than a predetermined number as a noise is removed from the transaction data through a preprocessing operation (S705). The similarity between objects increases as the number of common links increases. However, when an object having a small number of links is included in a cluster, the similarity of the cluster may be reduced. Therefore, the cluster generation system may determine an object having a small number of links as noise through a preprocessing process before generating the cluster. In addition, the cluster generation system may improve the similarity of the cluster generated later by removing the object determined to be noise from the transaction data. At this time, the number of links may be changed according to the configuration of the system.

일례로, 클러스터 생성 시스템은 서로 다른 타입에 있는 객체들을 타입마다 반복적으로 제거할 수 있다. 즉, 객체들은 서로 다른 타입에 있는 개체들과 링크 로 연결되어 있기 때문에, 하나의 객체가 제거되면 해당 객체와 링크로 연결된 다른 타입의 객체의 링크의 개수도 감소할 수 있다.For example, the cluster generation system may repeatedly remove objects in different types for each type. That is, since objects are linked with objects of different types, when one object is removed, the number of links of other types of objects connected with the corresponding object may be reduced.

그리고, 클러스터 생성 시스템은 트랜잭션 데이터를 탐색하여 패턴 길이에 기초한 후보 패턴으로 구성된 해쉬 구조를 결정할 수 있다(S706). 여기서, 트랜잭션 데이터는 노이즈로 판단된 객체가 제거된 것일 수 있다. 일례로, 클러스터 생성 시스템은 아이템의 개수에 따른 패턴 길이에 기초하여 트랜잭션 데이터로부터 패턴 길이에 대응하는 후보 패턴을 생성할 수 있다. 이후, 클러스터 생성 시스템은 트랜잭션 데이터로부터 후보 패턴을 카운트하여 빈발도를 결정할 수 있다. 최종적으로, 후보 패턴과 후보 패턴에 대한 빈발도를 구성된 해쉬 구조가 결정될 수 있다.The cluster generation system may search the transaction data to determine a hash structure composed of candidate patterns based on the pattern length (S706). Here, the transaction data may be an object determined to be noise. In one example, the cluster generation system may generate a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items. Thereafter, the cluster generation system may determine the frequency of occurrence by counting candidate patterns from the transaction data. Finally, a hash structure consisting of the candidate pattern and the frequency of the candidate pattern may be determined.

클러스터 생성 시스템은 해쉬 구조를 구성하는 후보 패턴 중에서 최저 빈발도를 초과하는 후보 패턴에 대응하는 객체를 시드로 추출할 수 있다(S707).The cluster generation system may extract, as a seed, an object corresponding to the candidate pattern exceeding the lowest frequency among candidate patterns constituting the hash structure (S707).

클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성할 수 있다(S702). 일례로, 클러스터 생성 시스템은 시드를 이용하여 클러스터를 생성하고, 클러스터에 포함되지 않은 객체들 중 시드와 동일한 트랜잭션에 빈발하게 나타나는 객체들을 클러스터에 추가할 수 있다. 즉, 본 발명의 일실시예에 다른 클러스터 생성 시스템은 유사도가 매우 높은 소수의 객체인 시드를 결정하여 클러스터를 생성하고, 시드를 통해 클러스터를 확장함으로써 클러스터를 생성하기 위한 초기 수행 시간을 단축시킬 수 있다.The cluster generation system may generate a cluster using the seed (S702). In one example, the cluster generation system may create a cluster using a seed, and add to the cluster objects frequently appearing in the same transaction as the seed among objects not included in the cluster. That is, the cluster generation system according to an embodiment of the present invention may generate a cluster by determining a seed, which is a small number of objects having high similarity, and shorten an initial execution time for generating a cluster by extending the cluster through the seed. have.

클러스터 생성 시스템은 각 타입별로 복수의 객체를 하위 레벨의 노드로 설 정하고, 상기 복수의 객체로부터 생성된 클러스터를 상위 레벨의 노드로 설정하여 구조적인 트리를 생성할 수 있다(S703).The cluster generation system may generate a structural tree by setting a plurality of objects as nodes of a lower level for each type, and by setting clusters generated from the plurality of objects as nodes of a higher level (S703).

이 때, 트리의 말단 노드인 레벨 0의 경우, 복수의 객체들로 구성된다. 그리고, 상위 레벨인 레벨 1의 비말단 노드의 경우, 레벨 0의 복수의 객체들로부터 생성된 클러스터로 구성된다. 그리고, 상위 레벨 2의 비말단 노드는 레벨 1의 클러스터를 다시 객체로 간주하여 생성된 클러스터로 구성된다. 이러한 과정을 통해 트리는 확장될 수 있다.In this case, in the case of level 0, which is an end node of the tree, a plurality of objects are configured. The non-terminal node of level 1, which is a higher level, is composed of a cluster created from a plurality of objects of level 0. The non-terminal node of the upper level 2 is composed of a cluster created by considering the cluster of the level 1 again as an object. Through this process, the tree can be expanded.

도 7에서 설명되지 않은 부분은 도 1 내지 도 6의 설명을 참고할 수 있다.Parts not described in FIG. 7 may refer to descriptions of FIGS. 1 to 6.

또한 본 발명의 일실시예에 따른 클러스터 생성 방법은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있 는 고급 언어 코드를 포함한다.In addition, the cluster creation method according to an embodiment of the present invention includes a computer readable medium including program instructions for performing operations implemented by various computers. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be program instructions that are specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by the compiler, but also high-level language code that can be executed by a computer using an interpreter.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명 사상은 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. Modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100: 클러스터 생성 시스템100: cluster creation system

101: 객체<타입 1> 102: 객체<타입 2>101: object <type 1> 102: object <type 2>

103: 트리<타입 1> 104: 트리<타입 2>103: tree <type 1> 104: tree <type 2>

Claims

A seed determination unit configured to determine a seed composed of objects having a predetermined degree of similarity or more based on link information from a plurality of objects connected by a link; And

Cluster generation unit for generating a cluster by using the determined seed

Cluster generation system comprising a.

The method of claim 1,

The seed determination unit,

A transaction data generation unit generating transaction data using a plurality of objects connected by the link;

A hash structure determiner configured to search the transaction data to determine a hash structure composed of candidate patterns based on a pattern length; And

A seed extracting unit which extracts, as a seed, an object corresponding to a candidate pattern exceeding a lowest frequency among candidate patterns constituting the hash structure.

Cluster generation system comprising a.

The method of claim 2,

The transaction data generation unit,

Set different types of objects pointed to by each of the plurality of objects through a link as a transaction, and classify the plurality of objects into items constituting the established transaction according to the link to generate the transaction data. Cluster creation system.

The method of claim 3,

The hash structure determination unit,

A candidate pattern generation unit generating a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items; And

A frequency determination unit for counting the generated candidate pattern from the transaction data to determine the frequency of occurrence

Cluster generation system comprising a.

The method of claim 4, wherein

The hash structure determination unit,

And a hash structure is determined using the candidate pattern and the frequentness of the candidate pattern.

The method of claim 2,

The cluster generation unit,

Create a cluster by using the extracted seed, the cluster generation system, characterized in that to add to the cluster objects frequently appearing in the same transaction as the seed among the objects not included in the cluster.

The method of claim 2,

The seed determination unit,

A noise removing unit for removing noise from the transaction data by determining that the number of links is equal to or less than a preset number.

More,

The hash structure determination unit,

And searching for transaction data from which the object determined as noise is removed.

The method of claim 7, wherein

The noise removing unit,

Cluster generation system characterized in that by repeatedly removing the objects of different types for each type.

The method of claim 1,

A tree generation unit for generating a structural tree by setting the plurality of objects as nodes of a lower level for each type and setting a cluster generated from the plurality of objects as nodes of a higher level.

Cluster generation system further comprising.

Determining a seed comprising a plurality of objects having a predetermined similarity or more based on link information from a plurality of objects connected by a link; And

Creating a cluster using the determined seed

Cluster creation method comprising a.

The method of claim 10,

Determining a seed consisting of the objects of the predetermined similarity or more,

Generating transaction data using a plurality of objects connected by the link;

Searching the transaction data to determine a hash structure composed of candidate patterns based on pattern lengths; And

Extracting, as a seed, an object corresponding to a candidate pattern exceeding a lowest frequency among candidate patterns constituting the hash structure;

Cluster creation method comprising a.

The method of claim 11,

Generating the transaction data,

And generating a transaction data by setting different types of objects each of the plurality of objects pointed through a link into a transaction and classifying the plurality of objects into items constituting the set transaction according to the link. How to create a cluster.

The method of claim 12,

Determining a hash structure composed of candidate patterns based on the pattern length,

Generating a candidate pattern corresponding to the pattern length from the transaction data based on the pattern length according to the number of items; And

Counting the generated candidate pattern from the transaction data to determine a frequency of occurrence

Cluster creation method comprising a.

The method of claim 13,

And a hash structure is determined by using the candidate pattern and the frequentness of the candidate pattern.

The method of claim 11,

Creating the cluster,

Creating a cluster using the extracted seeds; And

Adding to the cluster objects that appear frequently in the same transaction as the seed among objects not included in the cluster;

Cluster creation method comprising a.

The method of claim 12,

Determining the seed,

Determining that the number of objects equal to or less than a predetermined number is the noise and removing from the cluster

More,

Determining the hash structure,

And searching for transaction data from which the object determined to be the noise has been removed.

The method of claim 16,

Determining the number of objects less than or equal to a predetermined number as the noise to remove from the cluster,

Cluster generation method characterized in that by repeatedly removing the objects of different types for each type.

The method of claim 10,

Generating a structural tree by setting the plurality of objects as nodes of a lower level for each type, and setting clusters generated from the plurality of objects as nodes of a higher level.

Cluster generation method further comprising.

A computer-readable recording medium having recorded thereon a program for executing the method of claim 10.