KR20120050643A

KR20120050643A - Method and apparatus for detecting outlier based on graph

Info

Publication number: KR20120050643A
Application number: KR1020100112000A
Authority: KR
Inventors: 김상욱; 정서
Original assignee: 한양대학교 산학협력단
Priority date: 2010-11-11
Filing date: 2010-11-11
Publication date: 2012-05-21
Also published as: KR101186400B1

Abstract

PURPOSE: A graph based outlier detecting method and an apparatus thereof are provided to efficiently detect an outlier by using an HITS(Hyperlink-Induced Topic Search) algorithm. CONSTITUTION: A graph generating unit(410) generates a graph based on a dataset. A score giving unit gives a score of objects based on graph information. An outlier detecting unit(430) detects the outlier based on the object score. The graph point matches the object. The graph generating unit generates the number of objects.

Description

Graph based outlier detection method and apparatus {METHOD AND APPARATUS FOR DETECTING OUTLIER BASED ON GRAPH}

아래의 실시예들은 아웃라이어를 검출하기 위한 방법 및 장치에 관한 것이다.The following embodiments pertain to methods and apparatus for detecting outliers.

데이터 셋에 기반하여 생성된 그래프를 사용함으로써 아웃라이어를 검출하는 방법 및 장치가 개시된다.A method and apparatus for detecting an outlier by using a graph generated based on a data set is disclosed.

아웃라이어(outlier)란 데이터 셋 내의 객체들 중 다른 객체들과 비교되었을 때 상대적으로 이질적인 객체를 의미한다.Outliers refer to objects that are relatively heterogeneous when compared to other objects in the data set.

이러한 아웃라이어를 검출하는 것은 여러 도메인에서 유용하게 사용될 수 있다. 예컨대, 금융 거래 사기를 발견해야 하는 도메인에서, 한 고객의 금융 거래 패턴이 다른 고객들과 상이한 경우, 아웃라이어 검출을 통해 상이한 패턴이 검출될 수 있다. 따라서, 검출 후 해당 고객의 거래 내역이 자세히 조사될 수 있다.Detecting such outliers can be useful in many domains. For example, in a domain where financial transaction fraud should be found, if the financial transaction pattern of one customer is different from other customers, different patterns may be detected through outlier detection. Therefore, after the detection, the transaction details of the customer may be examined in detail.

아웃라이어 검출에 대한 다음과 같은 기존 방법들이 존재한다.The following existing methods for outlier detection exist.

통계적인 방법은 다양한 통계 모델들 중 데이터 셋이 따르는 통계 모델에서 벗어나는 객체를 아웃라이어로서 검출하는 방법이다.The statistical method is a method of detecting, as an outlier, an object deviating from the statistical model followed by the data set among various statistical models.

거리 기반 방법은 데이터 셋 내의 객체들 간의 거리를 척도로, 상대적으로 동떨어져 있는 객체를 아웃라이어로서 검출하는 방법이다.The distance-based method is a method of detecting, as an outlier, a relatively distant object by measuring the distance between objects in a data set.

밀도 기반 방법은 한 객체의 밀도가 그 객체의 주변에 존재하는 다른 객체의 밀도와 많은 차이가 나는 경우, 해당 객체를 아웃라이어로서 검출하는 방법이다.The density-based method is a method of detecting an object as an outlier when the density of one object is greatly different from the density of other objects existing around the object.

기존의 방법들에는 다음과 같은 문제점들이 존재한다.The following problems exist with the existing methods.

통계적 방법이 사용될 경우, 데이터 셋이 다차원 데이터이면, 각 차원에 대한 통계 모델이 각각 결정된 후, 결정된 통계 모델들이 하나의 모델로 결합되어야 한다. 이러한 문제 때문에, 다차원 데이터에는 통계적 방법이 적용되기 어렵다.When the statistical method is used, if the data set is multidimensional data, statistical models for each dimension are respectively determined, and then the determined statistical models should be combined into one model. Because of these problems, statistical methods are difficult to apply to multidimensional data.

거리 기반 방법들은 아웃라이어를 검출할 때 거리만을 아웃라이어 척도로 사용한다. 따라서, 거리 기반 방법이 사용될 경우, 지역 밀도(local density) 문제가 발생할 수 있다.Distance-based methods use only distance as the outlier measure when detecting outliers. Thus, when a distance based method is used, a local density problem may occur.

밀도 기반 방법들은 각 객체의 밀도를 상기 객체의 주변 객체들의 밀도들과 비교하는 것만을 아웃라이어 여부를 판단하기 위한 기준으로 사용한다. 따라서, 밀도 기반 방법이 사용될 경우 다단위(multi-granularity) 문제가 발생할 수 있다. Density-based methods use only the comparison of the density of each object with the densities of surrounding objects of the object as a criterion for determining the outlier. Thus, multi-granularity problems may arise when density-based methods are used.

최근, 이러한 문제를 해결하기 위해 그래프 기반 방법이 제안되었다. 예컨대, 주어진 데이터 셋을 그래프로 모델링 한 후, 해당 그래프에 재출발하는 임의 보행(Random Walks with Restart; RWR)을 수행함으로써 각 객체가 다른 객체들과 얼마나 동떨어져 있는지를 의미하는 점수를 부여한다.Recently, a graph-based method has been proposed to solve this problem. For example, after modeling a given data set as a graph, random walks with restart (RWR) are performed on the graph to give a score indicating how far apart each object is from other objects.

그러나, 이러한 방법은 주어진 데이터를 완전 그래프로 모델링한다. 따라서, 그래프의 무게중심 객체에 가장 높은 점수가 부여되는 문제가 발생한다.However, this method models the given data as complete graphs. Therefore, a problem arises in that the highest score is assigned to the center of gravity object of the graph.

본 발명의 일 실시에는 k-최근접 이웃 그래프를 기반으로 하는 아웃라이어 검출 방법 및 장치를 제공할 수 있다.One embodiment of the present invention may provide an outlier detection method and apparatus based on a k-nearest neighbor graph.

본 발명의 일 실시에는 가중치가 부여된 HITS 알고리즘을 사용하는 아웃라이어 검출 방법 및 장치를 제공할 수 있다.One embodiment of the present invention can provide an outlier detection method and apparatus using a weighted HITS algorithm.

본 발명의 일측에 따르면, 하나 이상의 객체를 포함하는 데이터 셋에서 아웃라이어를 검출하는 방법에 있어서, 상기 데이터 셋을 기반으로 그래프를 생성하는 단계, 상기 그래프의 간선 정보에 기반하여 상기 하나 이상의 객체 각각에 점수를 부여하는 단계 및 상기 하나 이상에 객체 각각에 부여된 상기 점수에 기반하여 상기 하나 이상의 객체 중 상기 아웃라이어를 검출하는 단계를 포함하고, 상기 그래프의 하나 이상의 정점들 각각은 상기 하나 이상의 객체 중 하나의 객체에 대응하는, 아웃라이어 검출 방법이 제공된다.According to an aspect of the present invention, in the method for detecting an outlier in a data set including one or more objects, generating a graph based on the data set, each of the one or more objects based on edge information of the graph Assigning a score to and detecting the outliers of the one or more objects based on the scores assigned to each of the one or more objects, wherein each of the one or more vertices of the graph is one or more objects. An outlier detection method, corresponding to one of the objects, is provided.

상기 그래프의 하나 이상의 정점들 각각의 진입-차수는 상기 정점에 대응하는 객체의 주변 객체의 수에 비례할 수 있다.The entry-order of each of the one or more vertices of the graph may be proportional to the number of surrounding objects of the object corresponding to the vertex.

상기 그래프는 k-최인접 이웃 방향성 그래프일 수 있고, 제1 객체의 k 개의 최인접 이웃 중 제2 객체가 포함될 경우에만, 상기 그래프는 상기 제1 객체에 대응하는 제1 정점으로부터 상기 제2 객체에 대응하는 제2 정점으로의 간선을 가질 수 있다.The graph may be a k-nearest neighbor directional graph, and only if the second object of the k nearest neighbors of the first object is included, the graph is from the first vertex corresponding to the first object to the second object. It may have an edge to a second vertex corresponding to.

상기 데이터 셋을 기반으로 그래프를 생성하는 단계는, 상기 하나 이상의 객체 각각에 대응하는 하나 이상의 정점을 생성하는 단계, 상기 하나 이상의 객체 각각에 대해 k 개의 최인접 이웃 객체들을 검색하는 단계 및 상기 하나 이상의 객체 각각으로부터 상기 검색된 k 개의 최인접 이웃 객체들로의 간선들을 생성하는 단계를 포함할 수 있다.Generating a graph based on the data set includes: generating one or more vertices corresponding to each of the one or more objects, retrieving k nearest neighbor objects for each of the one or more objects, and the one or more And generating edges from each of the objects to the searched k nearest neighbor objects.

상기 점수는 권위 점수 및 허브 점수의 합일 수 있고, 상기 권위 점수는 상기 권위 점수를 갖는 객체의 주변에 다른 객체가 얼마나 많은지를 의미할 수 있고, 상기 허브 점수는 상기 허브 점수를 갖는 객체의 주변에 아웃라이어가 아닌 다른 객체가 얼마나 많은지를 의미할 수 있다.The score may be a sum of authority score and hub score, the authority score may mean how many other objects are in the vicinity of the object having the authority score, and the hub score is at the periphery of the object having the hub score It can mean how many other objects are outliers.

상기 그래프의 간선 정보에 기반하여 상기 하나 이상의 객체 각각에 점수를 부여하는 단계는, 상기 하나 이상의 객체 각각에 상기 그래프의 간선 정보에 기반하여 권위 점수를 부여하는 단계 및 상기 하나 이상의 객체 각각에 상기 그래프의 간선 정보에 기반하여 허브 점수를 부여하는 단계를 포함할 수 있다.The step of assigning a score to each of the one or more objects based on the edge information of the graph comprises: assigning an authority score to each of the one or more objects based on the edge information of the graph and the graph on each of the one or more objects. It may include the step of giving a hub score based on the edge information of the.

상기 하나 이상에 객체 각각에 부여된 상기 점수에 기반하여 상기 하나 이상의 객체 중 상기 아웃라이어를 검출하는 단계는, 상기 하나 이상의 객체 중 상기 점수의 합이 가장 작은 n 개의 객체들을 아웃라이어로서 검출하는 단계를 포함할 수 있다.The detecting of the outlier among the one or more objects based on the scores assigned to each of the one or more objects includes: detecting as n the n out of the one or more objects the smallest sum of the scores as the outlier. It may include.

본 발명의 다른 일측에 따르면, 하나 이상의 객체를 포함하는 데이터 셋에서 아웃라이어를 검출하는 장치에 있어서, 상기 데이터 셋을 기반으로 그래프를 생성하는 그래프 생성부, 상기 그래프의 간선 정보에 기반하여 상기 하나 이상의 객체 각각에 점수를 부여하는 점수 부여부 및 상기 하나 이상에 객체 각각에 부여된 상기 점수에 기반하여 상기 하나 이상의 객체 중 상기 아웃라이어를 검출하는 아웃라이어 검출부를 포함하고, 상기 그래프의 하나 이상의 정점들 각각은 상기 하나 이상의 객체 중 하나의 객체에 대응하는, 아웃라이어 검출 장치가 제공된다.According to another aspect of the present invention, an apparatus for detecting an outlier in a data set including at least one object, the graph generator for generating a graph based on the data set, the one based on the edge information of the graph A score assigning unit for assigning a score to each of at least one object and an outlier detecting unit for detecting the outlier among the at least one object based on the scores assigned to each of the at least one object, and at least one vertex of the graph. Each of these corresponds to an outlier detection device corresponding to one of the one or more objects.

상기 그래프 생성부는 상기 그래프의 하나 이상의 정점들 각각의 진입-차수가 상기 정점에 대응하는 객체의 주변 객체의 수에 비례하도록 상기 그래프를 생성할 수 있다.The graph generator may generate the graph such that the entry-order of each of the one or more vertices of the graph is proportional to the number of objects around the object corresponding to the vertex.

상기 점수 부여부는 상기 하나 이상의 객체 각각에 상기 그래프의 간선 정보에 기반하여 권위 점수를 부여하고, 상기 하나 이상의 객체 각각에 상기 그래프의 간선 정보에 기반하여 허브 점수를 부여함으로써 상기 그래프의 간선 정보에 기반하여 상기 하나 이상의 객체 각각에 점수를 부여할 수 있다.The score assigning unit assigns an authority score to each of the one or more objects based on the edge information of the graph, and assigns a hub score to each of the one or more objects based on the edge information of the graph, based on the edge information of the graph. A score may be assigned to each of the one or more objects.

상기 권위 점수는 상기 권위 점수가 부여되는 노드에 대응하는 객체를 가리키는 객체들의 집합, 상기 권위 점수가 부여되는 노드에 대응하는 객체가 가리키는 객체들의 집합 및 객체들 간의 유사도에 기반하여 결정될 수 있다.The authority score may be determined based on a set of objects indicating an object corresponding to the node to which the authority score is assigned, a set of objects pointed to by the object corresponding to the node to which the authority score is assigned, and similarity between the objects.

상기 허브 점수는 상기 허브 점수가 부여되는 노드에 대응하는 객체를 가리키는 객체들의 집합, 상기 허브 점수가 부여되는 노드에 대응하는 객체가 가리키는 객체들의 집합 및 객체들 간의 유사도에 기반하여 결정될 수 있다.The hub score may be determined based on a set of objects indicating an object corresponding to the node to which the hub score is assigned, a set of objects pointed to by the object corresponding to the node to which the hub score is assigned, and similarity between the objects.

상기 아웃라이어 검출부는 상기 하나 이상의 객체 중 상기 점수의 합이 가장 작은 n 개의 객체들을 아웃라이어로서 검출할 수 있다.The outlier detector may detect n objects having the smallest sum of the scores among the one or more objects as an outlier.

k-최근접 이웃 그래프를 기반으로 하는 아웃라이어 검출 방법 및 장치가 제공된다.An outlier detection method and apparatus are provided based on a k-nearest neighbor graph.

가중치가 부여된 HITS 알고리즘을 사용하는 아웃라이어 검출 방법 및 장치가 제공된다.An outlier detection method and apparatus using a weighted HITS algorithm is provided.

도 1은 완전 그래프를 사용한 아웃라이어 검출 방법을 설명한다.
도 2는 본 발명의 일 실시예에 따른 아웃라이어 검출 방법을 설명한다.
도 3은 본 발명의 일 실시예에 따른 아웃라이어 검출 방법의 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 아웃라이어 검출 장치의 구조도이다.
도 5는 본 발명의 일 예에 따른 아웃라이어 검출의 예를 도시한다.1 illustrates an outlier detection method using a complete graph.
2 illustrates an outlier detection method according to an embodiment of the present invention.
3 is a flowchart of an outlier detection method according to an embodiment of the present invention.
4 is a structural diagram of an outlier detecting apparatus according to an embodiment of the present invention.
5 shows an example of outlier detection according to an example of the present invention.

이하에서, 본 발명의 일 실시예를, 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. Like reference numerals in the drawings denote like elements.

본 발명의 실시예에서는 정확한 아웃라이어 검출을 위한 새로운 그래프 기반 방법이 개시된다.In an embodiment of the present invention, a novel graph-based method for accurate outlier detection is disclosed.

도 1은 완전 그래프를 사용한 아웃라이어 검출 방법을 설명한다.1 illustrates an outlier detection method using a complete graph.

H. Moonesinghe, "Outlier Detection using Random Walks," In ICTAAI, pp. 532-539, 2006에 개시된 완전 그래프를 사용한 아웃라이어 검출 방법 및 상기 방법의 문제점을 설명한다. 상기 논문의 일체의 내용은 본 문서의 내용을 보완하기 위해 사용될 수 있다.H. Moonesinghe, "Outlier Detection using Random Walks," In ICTAAI, pp. The outlier detection method using the full graph disclosed in 532-539, 2006 and the problems of the method are described. Any content of the article can be used to supplement the content of this document.

이러한 그래프 기반 아웃라이어 검출 방법은 주어진 데이터를 완전 그래프로 모델링 한 후, 모델링된 그래프에 RWR을 적용한다. 그 후, 객체들 중 객체에 부여된 점수가 낮은 것들이 아웃라이어로 간주된다.This graph-based outlier detection method models the given data as a complete graph and then applies RWR to the modeled graph. Subsequently, the lower scores assigned to the objects are considered outliers.

상기의 방법에는 다음과 같은 문제점이 존재한다.The above method has the following problems.

1) 주어진 데이터 셋은 완전 그래프로 모델링된다. 따라서, 전체 그래프의 무게중심에 존재하는 객체는 다른 객체보다 더 높은 점수를 갖는다.1) A given data set is modeled as a complete graph. Thus, an object at the center of gravity of the entire graph has a higher score than other objects.

2) 도 1에서, 아웃라이어로 판단되는 점 c(130)의 점수는 점 a(110)보다 낮아야 한다. 또한, 점 b(120), d(140) 및 e(150)는 각각 클러스터의 중심부에 존재하는 점이므로 점 c(130) 보다 높은 점수를 부여받아야 한다. 그러나, 상기의 방법에 의해 점들 각각에게 실제로 부여된 점수들의 크기는 내림차순으로 c(130), b(120), d(140), a(110) 및 e(150)의 순서이다.2) In FIG. 1, the score of the point c 130 determined as the outlier should be lower than the point a 110. Also, points b (120), d (140), and e (150) are points that exist in the center of the cluster, respectively, and should be given a higher score than point c (130). However, the magnitudes of the scores actually assigned to each of the points by the above method are in the order of c (130), b (120), d (140), a (110) and e (150) in descending order.

이는, 종래의 방법에 의해 생성된 완전 그래프에서는, 모든 객체(즉, 그래프 내의 정점)들의 진입 차수(in-degree)가 동일하기 때문에, 객체들 각각의 점수에 영향을 미치는 요인은 다른 객체들과의 거리뿐이기 때문이다. 즉, 객체(또는 객체에 대응하는 노드)가 그래프의 전체 무게 중심에 가까울수록, 그래프의 다른 모든 객체들과의 거리들의 합이 작아진다. 따라서, 그래프의 전체 무게 중심에 더 가까운 객체에게 더 높은 점수가 부여된다. 결과적으로, 객체들의 점수 분포는 그래프의 무게 중심에서 멀어질수록 감소하게 된다.
This is because in the complete graph generated by the conventional method, since the in-degrees of all objects (ie, vertices in the graph) are the same, the factor influencing the score of each of the objects is different from that of other objects. Because only the distance. That is, the closer the object (or node corresponding to the object) is to the total center of gravity of the graph, the smaller the sum of the distances to all other objects in the graph. Thus, higher scores are given to objects closer to the overall center of gravity of the graph. As a result, the score distribution of the objects decreases away from the center of gravity of the graph.

도 2는 본 발명의 일 실시예에 따른 아웃라이어 검출 방법을 설명한다.2 illustrates an outlier detection method according to an embodiment of the present invention.

본 실시예에서는, 주어진 데이터 셋은 완전 그래프가 아닌, 유사도가 높은 k개의 객체들 간에만 간선으로 연결되는 k-최인접 이웃(k-Nearest Neighbor; k-NN) 그래프로 모델링된다.In this embodiment, a given data set is modeled as a k-Nearest Neighbor (k-NN) graph, which is connected by edges only between k objects of high similarity, rather than a complete graph.

k-NN 그래프는 메트릭(metric) 공간(space)(예컨대, 유클리드 거리(Euclidean distance)의 평면(plane) 내의 점(point)들의 집합) 내의 n 개의 객체들의 집합 p를 위한 그래프이다. k-NN 그래프의 정점(vertex)들의 집합은 P이다. k-NN 그래프에서, q가 p의 k-NN 이면(즉, 적어도 k 번째로 가까운 이웃이면), k-NN 그래프는 p로부터 q로의 방향성(directed) 간선을 갖는다.The k-NN graph is a graph for a set p of n objects in a metric space (eg, a set of points in a plane of Euclidean distance). The set of vertices of the k-NN graph is P. In the k-NN graph, if q is k-NN of p (ie, at least the kth closest neighbor), the k-NN graph has a directed edge from p to q.

즉, 객체 A의 k-NN에 객체 B가 포함되면, A→B로의(즉, 객체 A에 대응하는 정점으로부터 객체 B에 대응하는 정점으로의) 간선이 연결된다.That is, when the object B is included in the k-NN of the object A, the edge from A to B (that is, from the vertex corresponding to the object A to the vertex corresponding to the object B) is connected.

각 각선의 가중치는 간선이 연결하는 두 객체의 유사도(또는, 유사도에 비례하는 값)이다.The weight of each line is the similarity (or a value proportional to the similarity) of two objects connected by the edge.

유사도는 두 객체들 사이의 거리를 기반으로 계산될 수 있다. 즉, 서로 간의 거리가 가까울수록, 두 객체는 높은 유사도를 가질 수 있다.Similarity can be calculated based on the distance between two objects. That is, the closer the distance between each other, the higher the degree of similarity between the two objects.

유사도는 코사인 유사도(cosine similarity) 또는 유클리드 거리(Euclidean distance) 등 다양한 방법에 의해 계산될 수 있다.Similarity can be calculated by various methods such as cosine similarity or Euclidean distance.

각 객체(즉, 각 객체에 대응하는 노드)의 진입-차수(in-degree)는 상기 객체의 주변 객체의 수에 비례하도록 그레프가 모델링될 수 있다. 즉, 주변에 다른 객체들이 많은 객체의 진입-차수는 높게, 주변에 다른 객체들이 거의 없는 객체의 진입-차수는 낮도록 그래프가 모델링된다.The graph can be modeled such that the in-degree of each object (ie, the node corresponding to each object) is proportional to the number of objects surrounding the object. That is, the graph is modeled so that the entry-order of an object having many other objects around it is high, and the entry-order of an object having few other objects around it is low.

k-NN 그래프에 RWR을 적용할 경우, 새로운 문제가 발생될 수 있다.When RWR is applied to the k-NN graph, a new problem may arise.

RWR의 경우, 진입-차수가 없는 객체들에게 재시작 확률(restart probability)에 해당하는 점수만을 동일하게 부여한다. 따라서, 클러서터 외곽 객체 및 아웃라이어가 구분될 수 없다.In the case of RWR, objects with no entry-order are given the same score corresponding to the restart probability. Therefore, the classifier outer object and the outlier cannot be distinguished.

즉, 도 2의 그래프(200)에 RWR이 적용되면, 클러스터 외각 객체(210, 220 및 230) 및 아웃라이어 객체(240)에게 동일한 점수가 부여된다. 따라서, 클러스터 외각 객체(210, 220 및 230) 및 아웃라이어 객체(240)가 서로 간에 구분될 수 없다.That is, when RWR is applied to the graph 200 of FIG. 2, the same score is given to the cluster outer objects 210, 220, and 230 and the outlier object 240. Accordingly, the cluster outer objects 210, 220, and 230 and the outlier object 240 may not be distinguished from each other.

따라서, 본 실시예에서는 RWR 대신 HITS(Hyperlink-Induced Topic Search) 알고리즘이 사용된다. HITS 알고리즘에 대해, J. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," In JACM, Vol. 46, No. 5, pp. 604-632, 1999에 상세히 설명되었다. 상기 논문의 일체의 내용은 본 문서의 내용을 보완하기 위해 사용될 수 있다.Therefore, in this embodiment, a hyperlink-induced topic search (HITS) algorithm is used instead of RWR. For the HITS algorithm, J. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," in JACM, Vol. 46, No. 5, pp. 604-632, 1999, detailed. Any content of the article can be used to supplement the content of this document.

HITS 알고리즘에서, 객체 p의 권위 점수 a_p 및 허브 점수 h_p는 하기의 수학식 1 및 수학식 2에 의해 계산될 수 있다.In the HITS algorithm, the authority score a _p and the hub score h _p of the object p can be calculated by Equations 1 and 2 below.

여기서, 집합 B(x)는 객체 x를 가리키고 있는 객체들의 집합을 의미하고, 집합 F(x)는 상기 객체 x가 가리키고 있는 객체들의 집합을 의미한다.Here, the set B (x) means a set of objects that point to the object x, and the set F (x) means a set of objects that the object x points to.

본 실시예에서는, 각 객체 간의 유사도를 반영하여 권위(authoritative) 점수 및 허브(hub) 점수를 계산하도록 기존의 HITS 알고리즘을 수정하였다. 상기의 수정은 객체들 간의 유사도에 비례하여 권위 점수 및 허브 점수가 파급되도록 하기 위한 것이다.In this embodiment, the existing HITS algorithm is modified to calculate the authoritative score and the hub score by reflecting the similarity between the objects. The above modification is intended to spread the authority score and the hub score in proportion to the similarity between the objects.

하기의 수학식 3 및 수학식 4는 각각 각 객체 간의 유사도를 반영하여 권위 점수 a_p 및 허브 점수 h_p를 계산하는 수식이다.Equations 3 and 4 below are equations for calculating the authority score a _p and the hub score h _p by reflecting the similarity between the objects.

여기서, W_ij는 객체 i 및 객체 j 간의 유사도를 의미한다.Here, W _ij means the similarity between the object i and the object j.

각 객체에 부여된 권위 점수는 권위 점수를 갖는 객체의 주변에 다른 객체가 얼마나 많은지를 의미한다.The authority score assigned to each object means how many other objects are in the vicinity of the object having the authority score.

각 객체에 부여된 허브 점수는 허브 점수를 갖는 객체의 주변에 아웃라이어가 아닌 다른 객체가 얼마나 많은지를 의미한다.The hub score given to each object means how many objects other than the outliers are around the object having the hub score.

객체들 중 부여된 허브 점수 및 권위 점수의 합이 가장 작은 n 개의 객체들이 아웃라이어로서 검출될 수 있다.
Among the objects, n objects having the smallest sum of the given hub score and authority score can be detected as an outlier.

도 3은 본 발명의 일 실시예에 따른 아웃라이어 검출 방법의 흐름도이다.3 is a flowchart of an outlier detection method according to an embodiment of the present invention.

본 실시예에 따른, 하나 이상의 객체를 포함하는 데이터 셋에서 아웃라이어를 검출하기 위한 방법은 그래프 생성 단계(S310 내지 S330), 점수 부여 단계(S340 및 S350) 및 아웃라이어 검출 단계(S360)를 포함할 수 있다.According to the present embodiment, a method for detecting an outlier in a data set including one or more objects includes graph generation steps S310 to S330, scoring steps S340 and S350, and outlier detection step S360. can do.

그래프 생성 단계(S310 내지 S330)에서, 데이터 셋을 기반으로 그래프가 생성된다. 생성된 그래프의 하나 이상의 정점들 각각은 하나 이상의 객체 중 하나의 객체에 대응한다.In the graph generation steps S310 to S330, a graph is generated based on the data set. Each of the one or more vertices of the generated graph corresponds to an object of one or more objects.

생성된 그래프의 하나 이상의 정점들 각각의 진입-차수는 정점에 대응하는 객체의 주변 객체의 수에 비례할 수 있다.The entry-order of each of the one or more vertices of the generated graph may be proportional to the number of surrounding objects of the object corresponding to the vertex.

전술된 것처럼, 생성된 그래프는 k-NN 방향성 그래프일 수 있다. 따라서, 제1 객체의 k 개의 최인접 이웃 중 제2 객체가 포함될 경우에만, 그래프는 상기 제1 객체에 대응하는 제1 정점으로부터 상기 제2 객체에 대응하는 제2 정점으로의 간선을 갖는다.As described above, the generated graph may be a k-NN directional graph. Thus, only when the second object of the k nearest neighbors of the first object is included, the graph has an edge from the first vertex corresponding to the first object to the second vertex corresponding to the second object.

점수 부여 단계(S340 및 S350)에서, 생성된 그래프의 간선 정보에 기반하여 하나 이상의 객체 각각에 점수가 부여된다.In the scoring step S340 and S350, a score is assigned to each of the one or more objects based on the edge information of the generated graph.

객체 각각에 부여되는 점수는 권위 점수 및 허브 점수의 합일 수 있다.The score given to each object may be the sum of the authority score and the hub score.

아웃라이어 검출 단계(S360)에서, 하나 이상에 객체 각각에 부여된 점수에 기반하여 하나 이상의 객체 중 상기 아웃라이어가 검출된다.In the outlier detection step S360, the outlier of one or more objects is detected based on a score assigned to each of the objects.

우선, 그래프 생성 단계(S310 내지 S330)를 구체적으로 설명한다.First, the graph generation steps S310 to S330 will be described in detail.

단계(S310)에서, 하나 이상의 객체 각각에 대응하는 하나 이상의 정점이 그래프 내에서 생성된다.In step S310, one or more vertices corresponding to each of the one or more objects are created in the graph.

단계(S320)에서, 하나 이상의 객체 각각에 대해 k 개의 최인접 이웃 객체들이 검색된다.In step S320, k nearest neighbor objects are retrieved for each of the one or more objects.

단계(S330)에서, 하나 이상의 객체 각각으로부터 상기 객체에 대해 검색된 k 개의 최인접 이웃 객체들로의 간선들이 생성된다.In step S330, edges are generated from each of the one or more objects to the k nearest neighbor objects retrieved for the object.

다음, 점수 부여 단계(S340 및 S350)를 설명한다.Next, the scoring step (S340 and S350) will be described.

단계(S340)에서, 하나 이상의 객체 각각에 그래프의 간선 정보에 기반하여 권위 점수가 부여된다.In step S340, an authority score is assigned to each of the one or more objects based on the edge information of the graph.

단계(S350)에서, 하나 이상의 객체 각각에 그래프의 간선 정보에 기반하여 허브 점수가 부여된다.In operation S350, a hub score is assigned to each of the one or more objects based on the edge information of the graph.

다음, 아웃라이어 검출 단계(S360)를 설명한다.Next, the outlier detection step S360 will be described.

단계(S360)에서, 하나 이상의 객체 중 부여된 점수의 합이 가장 작은 n 개의 객체들이 아웃라이어로서 검출된다.
In step S360, n objects having the smallest sum of the given points among one or more objects are detected as outliers.

도 4는 본 발명의 일 실시예에 따른 아웃라이어 검출 장치의 구조도이다.4 is a structural diagram of an outlier detecting apparatus according to an embodiment of the present invention.

아웃라이어 검출 장치(400)는 그래프 생성부(410), 점수 부여부(420) 및 아웃라이어 검출부(430)를 포함한다. 아웃라이어 검출 장치(400)는 저장부(440)를 더 포함할 수 있다.The outlier detection apparatus 400 includes a graph generator 410, a score granter 420, and an outlier detector 430. The outlier detection device 400 may further include a storage unit 440.

아웃라이어 검출 장치(400)는 하나 이상의 객체를 포함하는 데이터 셋에서 아웃라이어를 검출한다.The outlier detection apparatus 400 detects an outlier in a data set including one or more objects.

그래프 생성부(410)는 데이터 셋을 기반으로 그래프를 생성한다. 생성된 그래프의 하나 이상의 정점들 각각은 하나 이상의 객체 중 하나의 객체에 대응한다.The graph generator 410 generates a graph based on the data set. Each of the one or more vertices of the generated graph corresponds to an object of one or more objects.

점수 부여부(420)는 생성된 그래프의 간선 정보에 기반하여 하나 이상의 객체 각각에 점수를 부여한다.The score assigning unit 420 assigns a score to each of the one or more objects based on the edge information of the generated graph.

아웃라이어 검출부(430)는 하나 이상에 객체 각각에 부여된 점수에 기반하여 하나 이상의 객체 중 아웃라이어를 검출한다.The outlier detection unit 430 detects an outlier among one or more objects based on the scores assigned to each of the one or more objects.

저장부(440)는 데이터 셋, 하나 이상의 객체, 점수(권위 점수 및 허브 점수를 포함한다), 그래프(정점 및 간선을 포함한다.), 검출된 아웃라이어 등 아웃라이어 검출을 위해 필요한 데이터 구조들을 저장하고, 상기의 데이터 구조들을 그래프 생성부(410), 점수 부여부(420) 및 아웃라이어 검출부(430)에게 제공한다.The storage unit 440 stores data structures required for outlier detection, such as a data set, one or more objects, scores (including authority scores and hub scores), graphs (including vertices and edges), detected outliers, and the like. The data structures are provided to the graph generator 410, the scorer 420, and the outlier detector 430.

그래프 생성부(410)는 그래프의 하나 이상의 정점들 각각의 진입-차수가 정점에 대응하는 객체의 주변 객체의 수에 비례하도록 그래프를 생성할 수 있다.The graph generator 410 may generate a graph such that the entry-order of each of the one or more vertices of the graph is proportional to the number of objects around the object corresponding to the vertex.

그래프 생성부(410)에 의해 생성된 그래프는 k-NN 방향성 그래프일 수 있으며, 그래프 생성부(410)는 그래프 내의 제1 객체의 k 개의 최인접 이웃 중 제2 객체가 포함될 경우에만, 제1 객체에 대응하는 제1 정점으로부터 제2 객체에 대응하는 제2 정점으로의 간선을 생성할 수 있다.The graph generated by the graph generator 410 may be a k-NN directional graph, and the graph generator 410 may include a first object only when a second object among k nearest neighbors of the first object in the graph is included. An edge from the first vertex corresponding to the object to the second vertex corresponding to the second object may be generated.

그래프 생성부(410)는 데이터 셋 내의 하나 이상의 객체 각각에 대응하는 하나 이상의 정점을 생성하고, 하나 이상의 객체 각각에 대해 k 개의 최인접 이웃 객체들을 검색하고, 하나 이상의 객체 각각으로부터 상기 객체에 대하여 검색된 k 개의 최인접 이웃 객체들로의 간선들을 생성함으로써 그래프를 생성할 수 있다.The graph generator 410 generates one or more vertices corresponding to each of the one or more objects in the data set, retrieves k nearest neighbor objects for each of the one or more objects, and retrieves the objects from each of the one or more objects. A graph can be created by creating edges to k nearest neighbor objects.

점수 부여부(420)는 데이터 셋 내의 하나 이상의 객체 각각에 그래프의 간선 정보에 기반하여 권위 점수를 부여하고, 하나 이상의 객체 각각에 그래프의 간선 정보에 기반하여 허브 점수를 부여함으로써 그래프의 간선 정보에 기반하여 하나 이상의 객체 각각에 점수를 부여할 수 있다.The score assigning unit 420 assigns an authority score to each of the one or more objects in the data set based on the edge information of the graph, and assigns a hub score to each of the one or more objects based on the edge information of the graph. A score can be assigned to each of one or more objects based on that.

권위 점수는 상기 권위 점수가 부여되는 노드에 대응하는 객체를 가리키는 객체들의 집합, 상기 권위 점수가 부여되는 노드에 대응하는 객체가 가리키는 객체들의 집합 및 객체들 간의 유사도에 기반하여 결정될 수 있으며, 허브 점수는 상기 허브 점수가 부여되는 노드에 대응하는 객체를 가리키는 객체들의 집합, 상기 허브 점수가 부여되는 노드에 대응하는 객체가 가리키는 객체들의 집합 및 객체들 간의 유사도에 기반하여 결정될 수 있다.The authority score may be determined based on a set of objects pointing to an object corresponding to the node to which the authority score is assigned, a set of objects pointed to by the object corresponding to the node to which the authority score is assigned, and similarity between the objects. May be determined based on a set of objects indicating an object corresponding to the node to which the hub score is assigned, a set of objects to which the object corresponding to the node to which the hub score is assigned, and similarity between the objects.

점수 부여부(420)는 객체들 간의 거리를 기반으로 객체들 간의 유사도를 결정할 수 있다.The scoring unit 420 may determine the similarity between the objects based on the distance between the objects.

아웃라이어 검출부(430)는 데이터 셋 내의 하나 이상의 객체 중 상기 점수의 합이 가장 작은 n 개의 객체들을 아웃라이어로서 검출할 수 있다.The outlier detection unit 430 may detect n objects having the smallest sum of the scores among the one or more objects in the data set as an outlier.

앞서 도 1 내지 도 3을 참조하여 설명된 본 발명의 일 실시예에 따른 기술 적 내용들이 본 실시예에도 그대로 적용될 수 있다. 따라서 보다 상세한 설명은 이하 생략하기로 한다.Technical content according to an embodiment of the present invention described above with reference to FIGS. 1 to 3 may be applied to the present embodiment as it is. Therefore, more detailed description will be omitted below.

상기 구성요소들(410 내지 430)의 기능은 단일한 제어부(도시되지 않음)에서 수행될 수 있다. 이때, 상기 제어부는 단일(single) 또는 복수(multi) 칩(chip), 프로세서(processor) 또는 코어(core)를 나타낼 수 있다. 상기 구성요소들(410 내지 430) 각각은 상기 제어부에서 수행되는 함수(function), 라이브러리(library), 서비스(service), 프로세스(process), 쓰레드(thread) 또는 모듈(module)을 나타낼 수 있다.
The functions of the components 410 to 430 may be performed by a single controller (not shown). In this case, the controller may represent a single or multi chip, a processor, or a core. Each of the components 410 to 430 may represent a function, a library, a service, a process, a thread, or a module that is performed by the controller.

도 5는 본 발명의 일 예에 따른 아웃라이어 검출의 예를 도시한다.5 shows an example of outlier detection according to an example of the present invention.

본 예에서는 G. Karypis "Chameleon: Hierarchical Clustering using Dynamic Modeling," In IEEE Computer, Vol. 32, No. 8, pp. 68-75, 1999에서 제시된 데이터 셋을 사용하여 아웃라이어 검출이 수행되었다.In this example, G. Karypis "Chameleon: Hierarchical Clustering using Dynamic Modeling," In IEEE Computer, Vol. 32, no. 8, pp. Outlier detection was performed using the data set presented in 68-75, 1999.

본 예에서는 80-최인접 이웃 그래프가 사용되었다.In this example, an 80-nearest neighbor graph is used.

테스트 데이터 셋은 8,000 개의 객체로 구성되었다.The test data set consisted of 8,000 objects.

원형으로 표시된 것이 아웃라이어로 검출된 객체들이다. 아웃라이어들은 하브 점수 및 권위 점수의 합이 낮은 250 개의 객체들이다.Circled objects are objects detected as outliers. The outliers are 250 objects with a low sum of herb scores and authority scores.

십자형으로 표시된 것이 아웃라이어로 검출되지 않은 객체들이다.Crossed out objects are objects not detected as outliers.

도시된 것처럼, 데이터 외각의 아웃라이어 및 클러스터들 사이의 아웃라이어가 검출되었고, 클러스터 외곽 객체들은 검출되지 않았다.
As shown, the outliers between the cluster and the outliers of the data envelope were detected, and the objects outside the cluster were not detected.

본 발명의 일 실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Method according to an embodiment of the present invention is implemented in the form of program instructions that can be executed by various computer means may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above embodiments, and those skilled in the art to which the present invention pertains various modifications and variations from such descriptions. This is possible.

그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the claims below but also by the equivalents of the claims.

400: 아웃라이어 검출 장치
410: 그래프 생성부
420: 점수 부여부
430: 아웃라이어 검출부400: outlier detection device
410: graph generator
420: scoring
430: outlier detection unit

Claims

A method for detecting an outlier in a data set that includes one or more objects,
Generating a graph based on the data set;
Assigning a score to each of the one or more objects based on the edge information of the graph; And
Detecting the outliers of the one or more objects based on the scores assigned to each of the one or more objects;
Wherein each of the one or more vertices of the graph corresponds to one of the one or more objects.

The method of claim 1,
Wherein the entry-order of each of the one or more vertices of the graph is proportional to the number of surrounding objects of the object corresponding to the vertex.

The method of claim 1,
The graph is a k-nearest neighbor directional graph, and only when the second object of the k nearest neighbors of the first object is included, the graph corresponds to the second object from the first vertex corresponding to the first object. The outlier detection method which has the edge to the 2nd vertex made.

The method of claim 1,
Generating a graph based on the data set,
Creating one or more vertices corresponding to each of the one or more objects;
Retrieving k nearest neighbor objects for each of the one or more objects; And
Generating edges from each of the one or more objects to the retrieved k nearest neighbor objects
Included, the outlier detection method.

The method of claim 1,
The score is the sum of the authority score and the hub score, the authority score means how many other objects are in the vicinity of the object having the authority score, and the hub score is the outlier around the object having the hub score. Outlier detection method, meaning how many other objects are available.

The method of claim 1,
The step of assigning a score to each of the one or more objects based on the edge information of the graph,
Assigning an authority score to each of the one or more objects based on the edge information of the graph; And
Assigning each of the one or more objects a hub score based on the edge information of the graph
Included, the outlier detection method.

The method of claim 6,
The authority score a _p for the object p is generated according to Equation 1 below, and the hub score h _p for the object p is generated according to Equation 2 below.
[Equation 1]

[Equation 2]

Here, the set B (x) means a set of objects that point to the object x, the set F (x) means a set of objects that the object x points to, and W _ij denotes the similarity between the object i and the object j. Meaning.

The method of claim 7, wherein
The similarity is determined based on the distance between objects.

The method of claim 1,
Detecting the outlier of the one or more objects based on the scores assigned to each of the one or more objects;
Detecting as outliers n objects having the smallest sum of the scores among the one or more objects;
Included, the outlier detection method.

An apparatus for detecting an outlier in a data set that includes one or more objects,
A graph generator for generating a graph based on the data set;
A score granter configured to assign a score to each of the one or more objects based on the edge information of the graph; And
An outlier detection unit that detects the outlier among the one or more objects based on the scores assigned to each of the one or more objects;
Wherein each of the one or more vertices of the graph corresponds to one of the one or more objects.

The method of claim 10,
And the graph generator generates the graph such that the entry-order of each of the one or more vertices of the graph is proportional to the number of objects around the object corresponding to the vertex.

The method of claim 10,
The graph is a k-nearest neighbor directional graph,
The graph generator may divide an edge from a first vertex corresponding to the first object to a second vertex corresponding to the second object only when a second object among k nearest neighbors of the first object in the graph is included. The outlier detection device to generate.

The method of claim 10,
The graph generator generates one or more vertices corresponding to each of the one or more objects, retrieves k nearest neighbor objects for each of the one or more objects, and retrieves the k nearest neighbor objects from each of the one or more objects. And generate a graph based on the data set by generating edges to the furnace.

The method of claim 10,
The score is the sum of the authority score and the hub score, the authority score means how many other objects are in the vicinity of the object having the authority score, and the hub score is the outlier around the object having the hub score. Outlier detection device, meaning how many other objects are present.

The method of claim 10,
The score assigning unit assigns an authority score to each of the one or more objects based on the edge information of the graph, and assigns a hub score to each of the one or more objects based on the edge information of the graph, based on the edge information of the graph. And assign a score to each of the one or more objects.

16. The method of claim 15,
The authority score is determined based on a set of objects indicating an object corresponding to the node to which the authority score is assigned, a set of objects pointed to by the object corresponding to the node to which the authority score is assigned, and similarity between the objects,
The hub score is determined based on a set of objects pointing to an object corresponding to the node to which the hub score is assigned, a set of objects pointed to by the object corresponding to the node to which the hub score is assigned, and similarity between the objects. Detection device.

The method of claim 16,
The scorer determines the similarity based on a distance between objects.

The method of claim 10,
And the outlier detecting unit detects n objects having the smallest sum of the scores as the outlier among the one or more objects.