KR20080095708A

KR20080095708A - A method for modelling a huge amount of data

Info

Publication number: KR20080095708A
Application number: KR1020070040561A
Authority: KR
Inventors: 서대관; 최영식; 김기주
Original assignee: 디듬인네트워크(주)
Priority date: 2007-04-25
Filing date: 2007-04-25
Publication date: 2008-10-29

Abstract

A modeling method of high capacity data is provided to select representative data aggregate from high capacity data by using a genetic algorithm and to obtain an optimal hyperplane by applying a KRR(Kernel Ridge Regression) method to the selected representative data aggregate, consequently data modeling of high capacity data can be accomplished efficiently by using the small amount of calculation. A method for modeling high capacity data comprises the steps of: selecting representative data aggregate(415) by applying a genetic algorithm to high capacity data(410); extracting a feature vector(425) from the selected data aggregate(420); obtaining an optimal hyperplane(435), that is a data model, by applying KRR to the extracted feature vectors(430). In particular, the number of data which belong to the representative data aggregate can be set up optionally in the process to which the genetic algorithm is applied, and the trade-off between the accuracy of the optimal hyperplane and the calculation amount which is required from the step 430 can be adjusted by the optional setting. In addition, the feature vectors extracted from the step 420 which is the feature vector extraction step can be expanded to nonlinear space by using a Kernel function prior to the step 430 to which the KRR is applied.

Description

A METHOD FOR MODELLING A HUGE AMOUNT OF DATA

도 1은 비교 실험에 사용된 데이터 집합을 나타내는 도면으로서, 도 1a는 하나의 큰 원형 군집을 형성하는 데이터 집합을 나타내는 도면이고, 도 1b는 네 개의 작은 원형 군집을 형성하는 또 다른 데이터 집합을 나타내는 도면.1 is a diagram showing a data set used in a comparative experiment, FIG. 1A is a diagram showing a data set forming one large circular cluster, and FIG. 1B is another data set forming four small circular clusters. drawing.

도 2는 서포트 벡터 머신(SVM) 방법을 이용하여 도 1의 데이터 집합을 모델링한 결과를 나타내는 도면으로서, 도 2a 및 도 2b는 도 1a의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이고, 도 2c 및 도 2d는 도 1b의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면.FIG. 2 is a diagram illustrating a result of modeling a data set of FIG. 1 using a support vector machine (SVM) method, and FIGS. 2A and 2B are results of modeling the data set of FIG. 1A by different parameters of kernel functions. 2C and 2D are diagrams illustrating the results of modeling the data set of FIG. 1B with different parameters of a kernel function.

도 3은 커널 능형 회귀(KRR) 방법을 이용하여 도 1의 데이터 집합을 모델링한 결과를 나타내는 도면으로서, 도 3a 및 도 3b는 도 1a의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이고, 도 3c 및 도 3d는 도 1b의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면.3 is a diagram illustrating a result of modeling the data set of FIG. 1 using a kernel ridge regression (KRR) method, and FIGS. 3A and 3B illustrate the results of modeling the data set of FIG. 1A by different parameters of kernel functions. 3C and 3D are diagrams illustrating the results of modeling the data set of FIG. 1B with different parameters of a kernel function.

도 4는 본 발명의 일 실시예에 따른 대용량 데이터의 모델링 방법을 구성하는 각각의 단계들을 나타내는 블록도.4 is a block diagram illustrating respective steps constituting a method for modeling a large amount of data according to an embodiment of the present invention.

도 5는 기본적인 유전 알고리즘의 일예를 나타내는 도면.5 shows an example of a basic genetic algorithm.

도 6은 실험에 사용된 데이터를 나타내는 도면.6 shows data used in an experiment.

도 7은 2000개의 전체 데이터로부터 50개의 데이터를 랜덤으로 선별한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면.Fig. 7 is a diagram showing experimental results of KRR in the case where 50 pieces of data are randomly selected from 2000 pieces of data.

도 8은 본 발명의 일 실시예에 따라 유전 알고리즘을 이용하여 2000개의 전체 데이터로부터 50개의 데이터를 선별한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면.FIG. 8 is a view showing experimental results of KRR in the case where 50 data are selected from 2000 total data using a genetic algorithm according to an embodiment of the present invention. FIG.

도 9는 본 발명의 일 실시예에 따라 유전 알고리즘을 이용하여 2000개의 전체 데이터로부터 20개의 데이터를 선별한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면.FIG. 9 is a view showing experimental results of KRR in the case where 20 data are selected from 2000 total data using a genetic algorithm according to an embodiment of the present invention. FIG.

<도면 중 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

400 : (본 발명의 일 실시예에 따른) 대용량 데이터의 모델링 방법 400 : modeling method of large data (according to an embodiment of the present invention)

405 : 대용량 데이터405: large data

410 : 유전 알고리즘을 적용하는 단계410: Applying the genetic algorithm

415 : 대표 데이터 집합415: representative data set

420 : 특징 벡터를 추출하는 단계420: Extracting feature vectors

425 : 특징 벡터425: Feature Vector

430 : 능형 회귀 방법을 적용하는 단계430: Steps to Apply Ridge Regression Method

435 : 최적의 초월면(데이터 모델)435: Optimal Transcendental Surface (Data Model)

본 발명은 대용량 데이터를 모델링하는 방법에 관한 것으로서, 보다 구체적으로는 유전 알고리즘(Genetic Algorithm)을 이용하여 대용량 데이터로부터 대표 데이터 집합을 선별한 후, 선별된 대표 데이터 집합을 커널 능형 회귀(Kernel Ridge Regression; KRR)법에 적용하여 최적의 초월면을 구함으로써, 보다 작은 계산량으로 대용량 데이터를 효과적으로 모델링하는 방법에 관한 것이다.The present invention relates to a method for modeling a large amount of data, and more specifically, to select a representative data set from the large data using a genetic algorithm (Genetic Algorithm), and then the selected representative data set Kernel Ridge Regression ; KRR) method to obtain an optimal transcendental plane, and to efficiently model a large amount of data with a smaller amount of computation.

대용량 데이터를 모델링하는 것은 많은 응용분야에서 이용될 수 있는데, 특히 같은 주제를 가진 문서들을 모델링하는 주제 모델링은 주제 기반 검색, 개인화 검색 등의 전문 검색 분야에서 적용될 수 있다. 데이터를 모델링하는 방법에는, 서포트 벡터를 이용하는 서포트 벡터 머신(Support Vector Machine; SVM) 방법과 능형 회귀(Ridge Regression; RR) 방법 등이 있다. 서포트 벡터 머신(SVM) 방법은, 데이터가 존재하는 영역을 추출하여 해당 영역을 서포트 벡터로서 표현하는 방법으로서, 표현된 영역 밖의 데이터들은 아웃라이어(outlier)들로 간주된다. SVM 방법은 커널 함수(kernel function)를 이용하여 데이터들을 고차원 특징 공간(high dimensional feature space)인 커널 공간으로 사상(mapping)시킴으로써 데이터가 존재하는 복잡한 영역(비선형 영역 포함)을 자연스럽게 표현할 수 있다. 이에 비해, 능형 회귀(RR) 방법은, 정규화 최소 제곱 오차(regularized least-square error) 방법의 일 형태로서, 주어진 모든 데이터들과 거리가 최소가 되는 최적의 초평면을 찾는 방법이다. RR 방법은 데이터들의 응집성(condensation)을 모델링하 기에 적합한 방법으로서, SVM 방법과 마찬가지로 커널 함수를 이용함으로써, 주어진 데이터를 커널 공간으로 사상하여 비선형 초월면을 구할 수 있다. 커널 함수를 이용하는 RR 방법을 커널 능형 회귀(Kernel Ridge Regression; KRR) 방법이라고 지칭한다.Modeling large amounts of data can be used in many applications. In particular, subject modeling that models documents with the same subject can be applied in specialized search fields such as subject-based search and personalized search. The data modeling method includes a support vector machine (SVM) method using a support vector, a ridge regression (RR) method, and the like. The support vector machine (SVM) method is a method of extracting a region in which data exists and expressing the region as a support vector. Data outside the expressed region is regarded as outliers. The SVM method can naturally express complex regions (including nonlinear regions) in which data exists by mapping data into kernel space, which is a high dimensional feature space, using a kernel function. In contrast, the ridge regression (RR) method is a form of a regularized least-square error method that finds an optimal hyperplane with a minimum distance from all given data. The RR method is suitable for modeling the condensation of data. Like the SVM method, by using a kernel function, a nonlinear transcendental can be obtained by mapping a given data into kernel space. The RR method using the kernel function is called the Kernel Ridge Regression (KRR) method.

서포트 벡터 머신(SVM) 방법을 능형 회귀(RR) 방법과 비교하면, SVM 방법은 단지 주어진 데이터의 경계면을 나타낼 뿐이므로 데이터의 내부 분포에 대한 표현은 부족하다는 문제점이 있다. 즉, SVM 방법을 이용할 경우, 주어진 데이터들의 분포 형태를 표현할 수는 있으나, 데이터 분포의 성질을 나타내기에는 적합하지 않다고 할 수 있다.Comparing the support vector machine (SVM) method with the ridge regression (RR) method, there is a problem that the SVM method merely represents the boundary of a given data, so that the representation of the internal distribution of the data is insufficient. That is, when the SVM method is used, it is possible to express the distribution form of given data, but it is not suitable to indicate the nature of the data distribution.

도 1 내지 도 3은 SVM 방법과 KRR 방법을 이용한 데이터 모델링 결과를 비교하는 도면이다. 도 1은 비교 실험에 사용된 데이터 집합을 나타내는 도면으로서, 도 1a는 하나의 큰 원형 군집을 형성하는 데이터 집합을 나타내는 도면이고, 도 1b는 네 개의 작은 원형 군집을 형성하는 또 다른 데이터 집합을 나타내는 도면이다. 도 2는 서포트 벡터 머신(SVM) 방법을 이용하여 도 1의 데이터 집합을 모델링한 결과를 나타내는 도면으로서, 도 2a 및 도 2b는 도 1a의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이고, 도 2c 및 도 2d는 도 1b의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이다. 도 3은 커널 능형 회귀(KRR) 방법을 이용하여 도 1의 데이터 집합을 모델링한 결과를 나타내는 도면으로서, 도 3a 및 도 3b는 도 1a의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이고, 도 3c 및 도 3d는 도 1b의 데이터 집합을 커널 함수의 파라미터를 서로 달리하여 모델링한 결과를 나타내는 도면이다. 도 2와 도 3의 비교를 통해, SVM 방법에 의해 생성된 데이터 모델은 데이터들의 경계만을 표현할 수 있는데 비하여, KRR 방법에 의해 생성된 데이터 모델은 데이터들의 내부 분포까지 표현할 수 있음을 분명하게 확인할 수 있다.1 to 3 are diagrams comparing data modeling results using the SVM method and the KRR method. 1 is a diagram showing a data set used in a comparative experiment, FIG. 1A is a diagram showing a data set forming one large circular cluster, and FIG. 1B is another data set forming four small circular clusters. Drawing. FIG. 2 is a diagram illustrating a result of modeling a data set of FIG. 1 using a support vector machine (SVM) method, and FIGS. 2A and 2B are results of modeling the data set of FIG. 1A by different parameters of kernel functions. 2C and 2D are diagrams illustrating the results of modeling the data set of FIG. 1B by using different kernel function parameters. 3 is a diagram illustrating a result of modeling the data set of FIG. 1 using a kernel ridge regression (KRR) method, and FIGS. 3A and 3B illustrate the results of modeling the data set of FIG. 1A by different parameters of kernel functions. 3C and 3D are diagrams illustrating the results of modeling the data set of FIG. 1B by using different kernel function parameters. Through comparison between FIG. 2 and FIG. 3, it can be clearly seen that the data model generated by the SVM method can only represent the boundaries of the data, whereas the data model generated by the KRR method can express the internal distribution of the data. have.

도 1 내지 도 3을 통해 분명하게 확인할 수 있는 바와 같이, KRR 방법은, SVM 방법과 비교하여 데이터들의 내부 분포까지 표현할 수 있다는 장점을 가진다. 그러나 KRR 방법은 최적의 초월면을 구하는 과정에서 많은 계산량을 요구하는 역행렬 연산을 포함하고 있다는 단점이 있는 바, 특히 데이터의 량이 많을 경우 역행렬 연산을 위해 많은 시간과 메모리 공간이 요구된다는 문제점이 있다.As can be clearly seen through Figures 1 to 3, the KRR method has the advantage that it can represent the internal distribution of the data as compared to the SVM method. However, the KRR method has the disadvantage of including an inverse matrix operation that requires a large amount of computation in the process of finding an optimal transcendental plane. In particular, a large amount of data requires a lot of time and memory space for the inverse matrix operation.

본 발명은 상기와 같은 문제점을 해결하기 위해 제안된 것으로서, 유전 알고리즘(Genetic Algorithm)을 이용하여 대용량 데이터로부터 대표 데이터 집합을 선별한 후, 선별된 대표 데이터 집합을 커널 능형 회귀(Kernel Ridge Regression; KRR)법에 적용하여 최적의 초월면을 구함으로써, 보다 작은 계산량으로 대용량 데이터를 효과적으로 모델링하는 방법을 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems, after selecting a representative data set from a large amount of data using a genetic algorithm (Genetic Algorithm), and then the selected representative data set Kernel Ridge Regression (KRR) The purpose of the present invention is to provide a method for efficiently modeling large data with a smaller amount of computation by obtaining an optimal transcendental plane by applying the method.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른, 대용량 데이터의 모델링 방법은,According to a feature of the present invention for achieving the above object, a method of modeling a large amount of data,

(1) 유전 알고리즘(Genetic Algorithm)을 이용하여 대용량의 전체 데이터를 최 적으로 근사화하는 대표 데이터 집합을 선별하는 제1 단계;(1) a first step of selecting a representative data set that optimally approximates a large amount of total data using a genetic algorithm;

(2) 상기 제1 단계에서 선별된 상기 대표 데이터 집합으로부터 특징 벡터를 추출하는 제2 단계; 및(2) a second step of extracting a feature vector from the representative data set selected in the first step; And

(3) 상기 제2 단계에서 추출된 상기 특징 벡터를 능형 회귀(Ridge Regression)법에 적용하여 대용량의 전체 데이터를 대표하는 최적의 초월면(optimal hyperplane)을 구하는 제3 단계를 포함하는 것을 그 구성상의 특징으로 한다.And (3) a third step of applying the feature vector extracted in the second step to a Ridge Regression method to obtain an optimal hyperplane representing a large amount of total data. Features of the jacket.

바람직하게는, 상기 제3 단계에서, 상기 특징 벡터를 커널 함수(kernel function)에 적용하여 비선형 공간으로 확장한 후, 상기 능형 회귀법에 적용할 수 있다.Preferably, in the third step, the feature vector may be applied to a kernel function to extend into a nonlinear space and then applied to the ridge regression method.

더욱 바람직하게는, 상기 제1 단계에서, 상기 대표 데이터 집합에 속하는 데이터의 수를 임의로 설정할 수 있다.More preferably, in the first step, the number of data belonging to the representative data set may be arbitrarily set.

이하에서는 첨부된 도면들을 참조하여, 본 발명에 따른 실시예에 대하여 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 4는 본 발명의 일 실시예에 따른 대용량 데이터의 모델링 방법을 구성하는 각각의 단계들을 나타내는 블록도이다. 도 4에 도시된 바와 같이, 본 발명에 따른 대용량 데이터 모델링 방법(400)은, 대용량 데이터(405)에 대하여 유전 알고리즘을 적용하여 대표 데이터 집합(415)을 선별하는 단계(410), 선별된 대표 데이터 집합(415)으로부터 특징 벡터(425)를 추출하는 단계(420), 및 추출된 특징 벡터(425)에 능형 회귀 방법을 적용하여 최적의 초월면(435), 즉 데이터 모델을 구하는 단계(430)를 포함한다. 특히, 유전알고리즘을 적용하는 단계(410)에서는 대표 데이 터 집합(415)에 속하는 데이터의 수를 임의로 설정할 수 있는데, 이에 의해 최종 결과물인 최적의 초월면(435)의 정확도와 능형 회귀 방법을 적용하는 단계(430)에서 요구되는 연산량 사이의 트레이드-오프(trade-off)를 조정할 수 있다. 또한, 능형 회귀 방법을 적용하는 단계(430)에 앞서, 커널 함수를 사용하여 특징 벡터 추출 단계(420)에서 추출된 특징 벡터(425)를 비선형 공간으로 확장할 수도 있다.4 is a block diagram illustrating respective steps constituting a method for modeling a large amount of data according to an embodiment of the present invention. As shown in FIG. 4, in the large data modeling method 400 according to the present invention, selecting a representative data set 415 by applying a genetic algorithm to the large data 405, selecting the representative data set 415, and selecting the representative data. Extracting the feature vector 425 from the data set 415 and applying the ridge regression method to the extracted feature vector 425 to obtain an optimal transcendental surface 435, i.e., a data model (430). ). In particular, in the step 410 of applying the genetic algorithm, the number of data belonging to the representative data set 415 may be arbitrarily set, thereby applying the accuracy and ridge regression method of the optimal transcendental plane 435 as the final result. In operation 430, a trade-off between the amounts of computation required may be adjusted. In addition, prior to the step 430 of applying the ridge regression method, the feature vector 425 extracted in the feature vector extraction step 420 may be extended to a nonlinear space using a kernel function.

다음으로, 본 발명에서 대용량 데이터 모델링에 사용되고 있는 유전 알고리즘을 적용한 KRR 방법에 대하여, 그 수학적 근거를 포함하여 보다 상세하게 살펴보기로 한다.Next, the KRR method to which the genetic algorithm used for the large-capacity data modeling in the present invention will be described in more detail, including the mathematical basis thereof.

1. One. 능형Ridge 회귀( return( RidgeRidge RegressionRegression ) 방법) Way

(1) 선형 능형 회귀(Linear Ridge Regression; LRR) (1) Linear Ridge Regression (LRR)

학습 데이터(training data)는

, 레이블과 오차는 각각

과

인 데이터에서, LRR은 다음 수학식 1과 같은 선형 함수 f(x)로 찾을 수 있다.Training data is

, Label and error are respectively

and

In the data, LRR can be found by the linear function f (x) as shown in Equation 1 below.

[수학식 1][Equation 1]

데이터들의 제곱의 차의 합을 최소화하는 매개변수 (w, b)를 결정하여야 하는데, 이는 다음 수학식 2와 같은 목적함수로 정의될 수 있다.A parameter ( w , b) that minimizes the sum of squared differences of data should be determined, which may be defined as an objective function as shown in Equation 2 below.

[수학식 2][Equation 2]

상기 수학식 2를 라그랑지안 계수법으로 표현하면 다음 수학식 3과 같다.Equation 2 is expressed by the Lagrangian coefficient method as shown in Equation 3 below.

[수학식 3][Equation 3]

상기 수학식 3을 w, b,

에 대하여 미분하여 정리하면 다음 수학식 4와 같다.Equation 3 is represented by w , b,

Differentiated and summarized as in Equation 4 below.

[수학식 4][Equation 4]

상기 수학식 4를 듀얼 목적함수로 해결하면 다음 수학식 5와 같다.Solving Equation 4 by the dual objective function is as follows.

[수학식 5][Equation 5]

상기 수학식 5를 a에 대하여 미분하여 다음 수학식 6과 같이 a를 결정할 수 있다.By differentiating Equation 5 with respect to a , a can be determined as shown in Equation 6 below.

[수학식 6][Equation 6]

상기 수학식 1의 결정함수 f(x)에 상기 수학식 6을 대입하여 정리하면, 다음 수학식 7과 같다.When the equation 6 is substituted into the decision function f (x) of Equation 1, Equation 7 is obtained.

[수학식 7][Equation 7]

(2) 커널 능형 회귀(Kernel Ridge Regression; KRR)(2) Kernel Ridge Regression (KRR)

KRR은 LRR에서 데이터 x를 커널함수

을 이용하여 커널 공간으로 사상하는 과정이 추가된다. 상기 수학식 1의 결정함수 f(x)는 커널함수를 이용하여 다음 수학식 8과 같이 다시 적을 수 있다.KRR Kernel Function for Data x in LRR

Mapping into kernel space is added using. The determination function f (x) of Equation 1 may be rewritten as in Equation 8 using the kernel function.

[수학식 8][Equation 8]

여기서, 학습 데이터

는 커널함수

에 의해

로 변환되며, KRR의 목적함수는 다음 수학식 9와 같이 정의될 수 있다.Where training data

Is a kernel function

By

The KRR objective function may be defined as in Equation 9 below.

[수학식 9][Equation 9]

라그랑지안 계수법을 이용하여 상기 수학식 9를 풀면 다음 수학식 10과 같다.The equation 9 is solved using the Lagrangian coefficient method.

[수학식 10][Equation 10]

상기 수학식 10을 w, b,

에 대하여 미분하여 정리하면 다음 수학식 11과 같다.Equation 10 is represented by w , b,

Differentiated and summarized as in Equation 11 below.

[수학식 11][Equation 11]

상기 수학식 11을 듀얼 목적함수로 해결하면 다음 수학식 12와 같다.If Equation 11 is solved by the dual objective function, Equation 12 is obtained.

[수학식 12][Equation 12]

상기 수학식 12를 a에 대하여 미분하여 다음 수학식 13과 같이 a를 결정할 수 있다.By differentiating Equation 12 with respect to a , a can be determined as shown in Equation 13.

[수학식 13][Equation 13]

상기 수학식 8의 결정함수 f(x)에 상기 수학식 13을 대입하여 정리하면, KRR의 결정함수는 다음 수학식 14와 같다.When the equation 13 is substituted for the decision function f (x) of Equation 8, the determination function of KRR is expressed by Equation 14 below.

[수학식 14][Equation 14]

2. 특징 벡터(Feature Vector)2. Feature Vector

(1) 특징 추출(Feature Extraction)(1) Feature Extraction

일반적인 웹 문서는 단어(term)들의 집합으로 구성되는데, 이들 단어들은 특징 벡터로서 이용될 수 있다. 웹 문서는 크게 HTML 태그(Tag)와 일반 텍스트(Text)로 구분된다. HTML 파싱(parsing)을 수행하여 태그 정보를 제거한 후, 일반 텍스트에 포함된 단어들을 특징 벡터로서 선택한다. 특징 벡터의 속성 값은, 일반적으로 TF(Term Frequency)와 TF-IDF를 이용할 수 있다. TF는 문서 d_i 내의 단어 w_j의 빈도수로 정의되며, IDF(Inverse Document Frequency)는 단어 w_j를 포함하는 문서 d_i의 수 n_j로 결정될 수 있다. 이것을 식으로 표현하면 다음 수학식 15와 같다.A typical web document consists of a set of terms, which can be used as feature vectors. Web documents are largely divided into HTML tags and plain text. After parsing the HTML to remove the tag information, the words included in the plain text are selected as feature vectors. In general, the attribute value of the feature vector may use TF (Term Frequency) and TF-IDF. TF document d _i In is defined as the frequency of the word _{w j, (Inverse Document Frequency)} IDF may be determined by the number n _j of the document d _i including the word w _j. This expression is expressed by the following equation (15).

[수학식 15][Equation 15]

IDF는 단어를 포함하는 문서의 수가 작을수록 더 높은 가중치를 갖게 된다. 문서 d_i에 포함된 단어 w_j의 가중치는 TF-IDF(

)로 결정된다. 웹 문서의 크기는 서로 다르므로 한 문서 내의 특징 벡터를 단위 길이(unit length)로 정규화(normalization)하여 사용한다.IDF has a higher weight as the number of documents containing a word is smaller. The weight of the word w _j contained in document d _i is TF-IDF (

Is determined. Since web documents have different sizes, feature vectors within a document can be normalized to unit length.

(2) 특징 감소(Feature Reduction)(2) Feature Reduction

일반적인 웹 문서에는 굉장히 많은 수의 단어가 포함되어 있으며, 이것은 고차원 공간으로 사상시켜질 수 있다. 이 때, 철자가 틀린 단어나 빈도수가 매우 높은 단어들(대명사, 조사 등)은 검색 성능을 저하시킬 수 있다. 즉, 분류화(classification)에 부적합한 단어들을 제거하는 방법으로 불용어(stopword)를 제거한다. 또한, 어간 추출(stemming)을 통해서 같은 의미의 단어를 동일한 특징 벡터로 표현하여 특징 벡터의 수를 줄인다.A typical web document contains a very large number of words, which can be mapped into higher dimensions. In this case, misspelled words or words with high frequency (pronouns, surveys, etc.) may reduce the search performance. That is, stopwords are removed by removing words that are inappropriate for classification. In addition, the number of feature vectors is reduced by expressing words having the same meaning as the same feature vectors through stemming.

(3) 특징 선택(Feature Selection)(3) Feature Selection

웹 문서에서 추출된 모든 단어를 특징 벡터로 적용하는 것은 연산 수행 시간과 메모리의 문제를 발생시키게 되는데, 이 문제는 성능에 영향을 주지 않는 단어들을 제외시키고 중요한 단어를 선택함으로써 해결할 수 있다. 특징(feature)을 선택하는 방법에는, 문서 빈도수(Document Frequency; DF), 임계값(thresholding), 정보 이득(Information Gain; IG), 상호 정보(Mutual Information; MI),

통계치(Chi-Squared statistics) 등이 있다.Applying all the words extracted from the web document as a feature vector introduces a problem of computation execution time and memory. This problem can be solved by excluding words that do not affect performance and selecting important words. The method for selecting a feature may include document frequency (DF), thresholding, information gain (IG), mutual information (MI),

Chi-Squared statistics.

3. 유전 알고리즘 (Genetic Algorithm)3. Genetic Algorithm

KRR의 최적 초월면(optimal hyperplane)은 상기 수학식 14의 라그랑지안 계수법의 a를 계산하여 얻어질 수 있다. 만약 학습 데이터의 개수가 m개일 경우, 상기 수학식 14를 이용하여 a를 구하기 위해서는 크기가 m인 정방행렬의 역행렬을 계산하여야 하는데, 이와 같이 역행렬을 계산하는 것은 학습 데이터와 특징 크기가 클수록 필요로 하는 연산 수행 시간과 메모리 사용이 매우 크다는 문제점이 있다. KRR의 이러한 단점을 보완하기 위한 방법으로, 본 발명에서는 진화 알고리즘(Evolutionary Algorithm) 중 하나인 유전 알고리즘을 이용한다.The optimal hyperplane of KRR can be obtained by calculating a of the Lagrangian coefficient method of Equation (14). If the number of training data is m, in order to obtain a using Equation 14, the inverse of the square matrix of size m must be calculated. There is a problem that the operation execution time and memory usage are very large. As a method to compensate for this shortcoming of the KRR, the present invention uses a genetic algorithm, which is one of the evolutionary algorithms.

도 5는 기본적인 유전 알고리즘의 일예를 나타내는 도면이다. 도 5에서, 개체 군(population) P(t)는 n개의 개체(individuals)로 구성되며, 각각의 개체는 k개의 학습 데이터, 즉 유전자(gene)로 구성된다. 최초의 P(t)는 전체 학습 데이터 중에서 랜덤으로 선택하여 초기화된다. 적합도 함수(fitness function)를 이용하여 P(t)를 평가한 후, 선택(selection), 교차(crossover), 변이(mutation)의 과정을 수행하여 새로운 P(t)를 선정한다. 새로운 p(t)를 선정할 때, 제일 마지막 개체는 현재까지 평가 값이 가장 높은 개체가 되도록 한다(elitism). 최종적으로 선택된 P(t) 중에서 평가 값이 가장 좋은 개체가 전체 학습 데이터를 근사화하는 학습 데이터가 된다.5 is a diagram illustrating an example of a basic genetic algorithm. In FIG. 5, population P (t) is composed of n individuals, and each individual is composed of k learning data, that is, a gene. The initial P (t) is initialized by randomly selecting from all the training data. After evaluating P (t) using a fitness function, a new P (t) is selected by performing a process of selection, crossover, and mutation. When selecting a new p (t), the last individual is the highest in the assessment to date. Finally selected P (t) Among the subjects with the best evaluation values, the training data approximates the entire training data.

평가 함수는 다음 수학식 16과 같이 정의될 수 있다.The evaluation function may be defined as in Equation 16 below.

[수학식 16][Equation 16]

여기서, N은 학습 데이터의 전체 개수이고, P_j는 각각의 개체에 대해서 KRR을 수행하여 얻어진 최적의 초월면으로 전체 데이터를 평가한 결과로서, P_j가 클수록 전체 학습 데이터를 최적으로 근사화한 학습 데이터를 의미한다.Here, N is a total number of learning data, P _j is a result of evaluating the entire data at the best beyond surface obtained by performing the KRR for each object, it learned that P _j greater approximate the entire training data optimally Means data.

선택(select) 과정은 각 개체 중에서 평가 값에 따라 확률적으로 선택하며, 교차(crossover)와 변이(mutation)는 미리 주어진 값에 따라 확률적으로 결정한다. 최적의 평가 값이 변하지 않으면 유전 알고리즘은 종료되며, 평가 값이 가장 큰 개체가 주어진 전체 학습 데이터를 가장 잘 표현하는 학습 데이터로 선별된다.The selection process is probabilistically selected according to the evaluation value among each individual, and crossover and mutation are probabilistically determined according to a predetermined value. If the optimal evaluation value does not change, the genetic algorithm is terminated and the entity with the largest evaluation value is selected as the learning data that best represents the given total learning data.

4. 실험 결과4. Experimental Results

실험은, 본 발명에 따라 유전 알고리즘을 사용하여 데이터를 선별한 후 KRR을 적용한 결과를, 랜덤으로 데이터를 선별하여 KRR을 적용한 결과와 비교하는 방식으로 이루어졌다. 도 6은 실험에 사용된 데이터를 나타내는 도면이다. 도 6에 도시된 바와 같이, 실험에 사용된 데이터는 체스 형태를 이루며, 정규분포를 따르는 임의의 데이터로서, 전체 데이터의 개수는 2000개이며, 그 중 잡음(noise)의 개수는 400개이다.The experiment was conducted by comparing the results of applying the KRR after selecting the data using the genetic algorithm according to the present invention and comparing the results of applying the KRR by randomly selecting the data. 6 is a diagram illustrating data used in an experiment. As shown in FIG. 6, the data used in the experiment is in the form of chess and is arbitrary data according to a normal distribution. The total number of data is 2000, and the number of noise is 400.

먼저, 개체군(population)의 크기를 10으로 설정하고, 각각의 개체군을 구성하는 개체(individual)의 크기를 50개의 데이터(전체 데이터의 2.5%)로 설정한 경우를 실험하였다. 도 7은 2000개의 전체 데이터로부터 50개의 데이터를 랜덤으로 선별한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면이다. 도면에서 흰색으로 보이는 부분이 주어진 데이터를 가장 잘 표현하는 초월면을 의미하고, 어두워질수록 학습 데이터와 연관성이 없는 데이터를 의미한다. 도 7을 살펴보면, 초월면이 뚜렷하게 잘 나타나지 않는 것을 확인할 수 있다. 도 8은 본 발명의 일 실시예에 따라 유전 알고리즘을 이용하여 2000개의 전체 데이터로부터 50개의 데이터를 선별 한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면이다. 유전 알고리즘을 적용함에 있어서, 최종 결과 값이 수렴할 때까지 알고리즘이 수행되었으며, 변이를 위한 확률 값은 0.3으로 설정하였다. 도 8의 실험 결과를 도 7의 실험 결과와 비교해 보면, 최적의 초월면이 훨씬 뚜렷하게 나타나고 있는 것을 확인할 수 있다. 즉, 실험 결과로부터 본 발명에 따른 방법이 대용량 데이터를 보다 작은 계산량을 가지고도 효율적으로 모델링할 수 있다는 것을 분명하게 확인할 수 있다.First, the size of a population was set to 10, and the size of an individual constituting each population was set to 50 data (2.5% of the total data). FIG. 7 is a diagram showing an experimental result of KRR in the case where 50 pieces of data are randomly selected from 2000 pieces of data. The white part in the figure means the transcendental surface that best represents the given data, and the darker the darker, the more irrelevant data to the learning data. Looking at Figure 7, it can be seen that the transcendental surface does not appear clearly. FIG. 8 is a diagram showing an experimental result of KRR in a case where 50 data are selected from 2000 total data using a genetic algorithm according to an embodiment of the present invention. In applying the genetic algorithm, the algorithm was performed until the final result converged, and the probability value for the variation was set to 0.3. Comparing the experimental results of FIG. 8 with the experimental results of FIG. 7, it can be seen that the optimal transcendental surface is much more clearly seen. In other words, it can be clearly seen from the experimental results that the method according to the present invention can efficiently model a large amount of data even with a smaller amount of computation.

다음으로, 각각의 개체군을 구성하는 개체의 크기를 20개의 데이터(전체 데이터의 1%)로 설정한 경우를 실험하였다. 도 9는 본 발명의 일 실시예에 따라 유전 알고리즘을 이용하여 2000개의 전체 데이터로부터 20개의 데이터를 선별한 경우에 있어서의 KRR의 실험 결과를 나타내는 도면이다. 유전 알고리즘에 대한 설정 값은 이전 실험과 동일하게 설정하였다. 도 9의 실험 결과를 도 8의 실험 결과와 비교해 보면, 흰색 부분이 조금 더 희미하게 보이는 것을 확인할 수 있다. 이것은 최적의 초월면이 조금 덜 뚜렷하게, 즉 데이터 모델링이 조금 덜 뚜렷하게 된 것을 의미하는데, KRR에 사용된 데이터의 양이 줄어들었기 때문에 생긴 당연한 결과이며, 그 반대급부로 역행렬 연산 시의 계산량이 훨씬 줄어들게 된다. 도 8과 도 9의 비교 실험을 통해, 데이터 모델링의 정확성과 요구되는 계산량(메모리 크기) 사이의 트레이드-오프를 개체군을 구성하는 개체의 크기를 임의로 설정함으로써 사용자가 임의로 조정할 수 있음을 확인할 수 있다.Next, an experiment was conducted in which the size of the individuals constituting each population was set to 20 data (1% of the total data). FIG. 9 is a diagram showing an experimental result of KRR in the case where 20 data are selected from 2000 total data using a genetic algorithm according to an embodiment of the present invention. The set value for the genetic algorithm was set the same as in the previous experiment. Comparing the experimental result of FIG. 9 with the experimental result of FIG. 8, it can be seen that the white part is slightly blurred. This means that the optimal transcendental surface is a little less pronounced, ie, the data modeling is a little less pronounced, which is a natural consequence of the reduced amount of data used in KRR, and in return, the amount of computation for inverse computation is much reduced. do. 8 and 9, it can be seen that the user can arbitrarily adjust the trade-off between the accuracy of data modeling and the required amount of calculation (memory size) by arbitrarily setting the size of the objects constituting the population. .

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above may be variously modified or applied by those skilled in the art, and the scope of the technical idea according to the present invention should be defined by the following claims.

본 발명의 대용량 데이터의 모델링 방법에 따르면, 유전 알고리즘(Genetic Algorithm)을 이용하여 대용량 데이터로부터 대표 데이터 집합을 선별한 후, 선별된 대표 데이터 집합을 커널 능형 회귀(Kernel Ridge Regression; KRR)법에 적용하여 최적의 초월면을 구함으로써, 보다 작은 계산량으로 대용량 데이터를 효과적으로 모델링할 수 있다.According to the method for modeling a large amount of data according to the present invention, a representative data set is selected from the large amount of data using a genetic algorithm, and the selected representative data set is applied to a kernel ridge regression (KRR) method. By finding the optimal transcendental plane, we can effectively model large amounts of data with smaller amounts of computation.

Claims

Selecting a representative data set for optimally approximating a large amount of total data using a genetic algorithm;

A second step of extracting a feature vector from the representative data set selected in the first step; And

A third step of obtaining an optimal hyperplane representative of a large amount of data by applying the feature vector extracted in the second step to a Ridge Regression method; Way.

The method of claim 1,

In the third step, the feature vector is applied to a kernel function (kernel function) to expand to a nonlinear space, and then to the ridge regression method for modeling a large amount of data characterized in that.

The method according to claim 1 or 2,

And in the first step, the number of data belonging to the representative data set can be arbitrarily set.