KR101247307B1

KR101247307B1 - Data processing method for classifying data, media for writing the method, and data processing apparatus for applying the method

Info

Publication number: KR101247307B1
Application number: KR1020080046291A
Authority: KR
Inventors: 나진희; 박명수; 최진영
Original assignee: 재단법인서울대학교산학협력재단; 삼성테크윈 주식회사
Priority date: 2008-05-19
Filing date: 2008-05-19
Publication date: 2013-03-25
Also published as: KR20090120319A

Abstract

본 발명은 선형 부분공간기법을 사용하여 데이터 분류를 할 경우, 데이터 전처리 과정으로서 분류 성능을 저하하는 비경계 데이터들을 제거함으로써, 특징 추출 과정에서 분류에 적합한 특징이 선택되도록 하는 데이터 처리 방법을 제공하고자 하는 것이다.The present invention is to provide a data processing method for selecting a feature suitable for classification in the feature extraction process by removing the non-boundary data that degrades the classification performance as a data preprocessing process when data classification using a linear subspace technique It is.

Description

Data processing method for classifying data, a recording medium recording the method, and a data processing apparatus for executing the method, and a data processing apparatus for applying the method.

본 발명은 패턴 인식(pattern recognition) 분야에서 선형 특징추출 방법들의 성능을 향상시키기 위해 데이터의 차원을 줄이기 위한 변환행렬(transformation matrix)을 구하는 데이터 전처리 단계를 수행하는 데이터 처리 방법, 이를 기록한 기록 매체 및 상기 방법을 실행하는 데이터 처리 장치에 관한 것이다. The present invention provides a data processing method for performing a data preprocessing step of obtaining a transformation matrix for reducing the dimension of data in order to improve the performance of linear feature extraction methods in the field of pattern recognition, a recording medium recording the same and A data processing apparatus for executing the method.

일반적인 컴퓨터를 이용한 분류 시스템에서는 주어진 데이터의 통계적 특성을 분석하여, 같은 통계적 특성을 가진 것으로 판단되는 새로운 데이터의 종류를 결정한다. 이 경우, 분류 성능을 향상시키기 위해서는 주어진 데이터를 처리하여 신뢰도가 높은 데이터를 선택하는 과정이 선행될 필요가 있다. 이러한 과정을 데이터 전처리(preprocessing)이라고 하는데, 이 과정은 주어진 입력 데이터에서 본질적인 정보를 추출하기 쉽도록 현재 주목하고자 하는 부분 데이터를 선정하거나 데이터를 정형하여 불필요한 정보를 분리하기 위한 예비적인 조작이다. 이러한 전처 리 과정에는 데이터 정규화(normalization), 데이터 선택, 잡음 데이터 제거 등과 같은 처리가 포함된다. 데이터 전처리 과정 이후에는 계산 량을 적절한 수준으로 낮추고 데이터의 질을 향상시키기 위해 원래의 데이터를 보다 낮은 차원의 새로운 데이터로 변환할 필요가 있다. 이러한 차원감소(dimension reduction)를 위해 얻어지는 새로운 데이터를 원래 데이터의 특징(feature)이라고 부르며, 그러한 특징을 추출하는 과정을 특징추출(feature extraction)이라고 한다. 데이터로부터 특징이 추출되면, 최종적으로 분류기(classifier)를 이용하여 주어진 데이터의 종류 정보가 최종적으로 확정된다. In a typical computer-based classification system, statistical characteristics of a given data are analyzed to determine the type of new data judged to have the same statistical characteristics. In this case, in order to improve the classification performance, it is necessary to process the given data to select the data having high reliability. This process is called preprocessing. This process is a preliminary operation for separating unnecessary information by selecting the partial data to be currently noticed or shaping the data so that the essential information can be extracted from the given input data. This preprocessing includes processing such as data normalization, data selection, and noise data removal. After the data preprocessing process, it is necessary to convert the original data into new data in a lower dimension in order to lower the calculation amount to an appropriate level and improve the quality of the data. The new data obtained for this dimension reduction is called the feature of the original data, and the process of extracting such feature is called feature extraction. Once features are extracted from the data, the type information of the given data is finally determined using a classifier.

데이터 분류를 위한 특징추출에서 고려되어야 하는 사항은, 종류 정보에 따라 데이터를 분류하기에 적합한 정보가 추출된 특징에 충분히 포함되어야 한다는 것이다. 일반적으로 특징 추출은 크게 전역적(global)인 방법과 국소적(local)인 방법으로 구분되는데, 원래 데이터에 포함된 모든 차원이 특징의 추출에 있어 고려되는 방식을 전역적인 방법이라고 하며, 이와는 달리 원래 데이터의 일부 차원과 그러한 차원 간의 기하학적인 관계가 고려되는 방식을 국소적인 방법이라고 한다. What should be considered in the feature extraction for data classification is that information suitable for classifying the data according to the type information must be sufficiently included in the extracted feature. Generally, feature extraction is divided into global method and local method. The method in which all the dimensions included in the original data are considered in the feature extraction is called a global method. The way in which some dimensions of the original data and the geometric relationships between them are taken into account is called a local method.

전역적인 방법 중 널리 이용되는 것으로 부분공간기법(subspace method)이 있다. 부분공간기법은 분류를 위한 본질적인 정보가 추출되도록 원 데이터를 가공하는 전처리 단계와, 가공된 데이터를 이용하여 데이터를 투영하기 위한 축들의 집합인 변환행렬을 찾는 단계와, 변환행렬을 이용하여 데이터를 투영하여 특징을 추출하는 단계와, 원 데이터 대신 투영된 특징을 이용하여 데이터를 분류하는 단계로 구성된다. One widely used global method is the subspace method. The subspace method includes a preprocessing step of processing raw data so as to extract essential information for classification, a step of finding a transformation matrix which is a set of axes for projecting data using the processed data, Projecting and extracting features, and classifying the data using the projected features instead of the raw data.

부분공간기법의 대표적인 예로는 주성분분석(principal component analysis, PCA)이 있는데, 주성분분석이란 차원 감소를 위해 변수들의 상관 관계를 규명하는 다변량 데이터 분석 기법이다. 즉, 해석하고자 하는 다차원의 데이터를 그 안에 포함된 정보의 손실을 가능한 작게 하면서 낮은 차원의 데이터로 축약하는 통계적인 기법으로 원래의 데이터를 낮은 차원의 특징으로 효과적으로 기술(description)하기에 유용한 방법이다. A representative example of the subspace technique is principal component analysis (PCA), which is a multivariate data analysis technique in which variables are correlated to reduce dimensions. In other words, it is a statistical technique that reduces the loss of information contained in the multidimensional data to be interpreted into low-dimensional data while reducing the loss of information contained therein. .

주성분분석 방법은 데이터의 각 차원에 포함된 분산(variance)을 최대화하는 축 집합을 찾고, 이에 포함된 축들 위로 데이터를 투영하여, 투영된 값을 원 영상 데이터를 대신하여 특징으로 이용한다.Principal component analysis finds a set of axes maximizing the variance included in each dimension of the data, projects the data onto the axes included therein, and uses the projected values as features instead of the original image data.

그러나 주성분분석을 통해 추출된 특징들은 분류 시스템에 사용되기에 적절하지 않은 경우가 많다. 주성분분석을 통해 얻어지는 특징은 원래의 데이터가 가지는 분류성능을 고려하지 않은 것이며, 따라서 차원의 감소에 따른 분류성능의 저하가 발생하게 되는 문제점이 있었다. 특히, 결정 경계에 있는 경계 데이터들은 뒤섞여 있는 경우가 많아서, 선형 부분공간기법을 이용할 경우 이러한 경계 데이터들의 영향으로 분류에 부적합한 변환행렬이 구해질 수 있다. However, features extracted through principal component analysis are often not suitable for use in classification systems. The characteristics obtained through the principal component analysis do not consider the classification performance of the original data, and thus there is a problem that the classification performance decreases due to the reduction of the dimension. In particular, the boundary data at the decision boundary is often mixed, and when the linear subspace technique is used, the transformation matrix unsuitable for classification may be obtained due to the influence of the boundary data.

이러한 단점을 해결하기 위해 선형판별분석(linear discriminant analysis)이 제안되었다. 선형판별분석은 분류성능을 높이기 위해 정의된 비산을 최대화하는 축을 찾는 단계와; 찾은 축들 위로 데이터를 투영시키는 단계와; 원 영상 대신 투영된 값을 특징으로 이용하는 단계로 구성된다.To solve this drawback, linear discriminant analysis has been proposed. The linear discriminant analysis comprises the steps of: finding an axis maximizing defined scatter to enhance classification performance; Projecting data onto the found axes; And using the projected value as a feature instead of the original image.

이러한 선형판별분석은 주성분분석에 기반하는 기법으로, 분산(variance)을 최대화하는 대신 분류성능을 높이기 위해 정의된 비산(scatter)을 최대화하는 축을 찾고 이에 포함된 축을 이용하여 특징을 구한다.This linear discriminant analysis is a technique based on principal component analysis. Instead of maximizing variance, it finds the axis that maximizes the defined scatter and improves the classification performance.

비산을 표현하는 비산행렬(scatter matrix)은 같은 종류(class)로 분류되어야 하는 데이터들 간의 분산(within-class variance)과 그러한 데이터들의 평균 간의 분산(between-class variance)의 비로 구성되어 있으며, 이것을 이용하면 다르게 분류되어야 할 데이터 간의 분산을 최대화하는 축을 찾을 수 있다.The scatter matrix, which represents scattering, consists of the ratio of between-class variance between the data that should be classified into the same class and between-class variance of the mean of those data. Use this to find the axis that maximizes the variance between the data that should be classified differently.

선형판별분석을 통해 얻어지는 특징은 주성분분석을 통해 얻어지는 특징보다 적거나 같은 차원을 가지면서도 높은 분류성능을 보여준다. 그러나 선형판별분석에서도 결정 경계에 있는 데이터는 다른 종류 정보를 가진 데이터들과 섞여 있으므로, 이러한 경계 데이터들의 영향으로 분류에 부적합한 변환행렬이 구해질 수 있다. The features obtained through linear discriminant analysis show higher classification performance with less or equal dimensions than those obtained through principal component analysis. However, even in the linear discriminant analysis, since the data at the decision boundary is mixed with data having different kinds of information, the transformation matrix unsuitable for classification may be obtained due to the influence of the boundary data.

본 발명은 상기와 같은 종래의 문제점 및 그 밖의 여러 가지 문제점들을 해결하기 위해 제안된 것으로, 데이터 전처리 과정에서 종류 정보를 이용하여 분류 성능을 저하하는 일부 데이터들을 제거함으로써, 선형 부분공간기법을 사용할 경우 특징추출 과정에서 분류에 적합한 특징이 선택되도록 하는 데이터 처리 방법을 제공하고자 하는 것이다.The present invention has been proposed to solve the above-described conventional problems and other various problems. In the case of using the linear subspace technique by removing some data that degrade classification performance by using type information in the data preprocessing process, It is an object of the present invention to provide a data processing method for selecting a feature suitable for classification in a feature extraction process.

본 발명은 복수의 데이터들로부터 비경계 데이터를 경계 데이터와 구별하여 도출하는 단계와, 상기 비경계 데이터를 이용하여 변환행렬을 도출하는 단계를 포함하는 데이터 분류를 위한 데이터 처리 방법을 제공한다.The present invention provides a data processing method for data classification, comprising: deriving non-boundary data from boundary data from boundary data, and deriving a transformation matrix using the non-boundary data.

상기 데이터 처리 방법에 있어서, 복수의 데이터들로부터 비경계 데이터를 구별함은 복수의 데이터들 중 일 영역에 포함되는 적어도 하나의 데이터에 대해 종류에 따른 확률을 도출하고, 상기 확률을 이용하여 무질서도를 도출하며, 상기 무질서도를 이용하여 상기 복수의 데이터들을 경계 데이터와 비경계 데이터로 구분하여 비경계 데이터를 도출할 수 있다. In the data processing method, distinguishing the non-boundary data from a plurality of data is to derive a probability according to the type of at least one data included in one region of the plurality of data, and to use the probability The non-boundary data may be derived by dividing the plurality of data into boundary data and non-boundary data using the disorder.

상기 데이터 처리 방법에 있어서, 상기 확률은 i 번째 데이터와 상기 i 번째 데이터 주위의 일 영역에 포함되는 k개의 데이터들 중에서 상기 i 번째 데이터와 동일한 종류의 데이터를 얼마나 포함하는지에 대한 확률일 수 있다. In the data processing method, the probability may be a probability of how much data includes the same kind of data as the i th data among k data included in the i th data and one region around the i th data.

상기 데이터 처리 방법에 있어서, 상기 무질서도는 상기 일 영역에 얼마나 서로 다른 종류의 데이터들이 포함되어 있는지의 정도를 나타내는 것이다. In the data processing method, the disorder indicates how much different kinds of data are included in the one region.

상기 데이터 처리 방법에 있어서, 상기 복수의 데이터들을 경계 데이터와 비경계 데이터로 구분함은 상기 i 번째 데이터의 무질서도가 기준 값 보다 큰지 판단하여, 상기 판단 결과 무질서도가 기준 값 보다 크다면 상기 i 번째 데이터는 경계 데이터이고 무질서도가 기준 값 보다 크지 않다면 상기 i 번째 데이터는 비경계 데이터로 구분하는 단계를 포함할 수 있다. In the data processing method, dividing the plurality of data into boundary data and non-boundary data may determine whether the disorder of the i-th data is greater than a reference value. If the first data is boundary data and the disorder is not greater than the reference value, the i-th data may include dividing the data into non-boundary data.

상기 데이터 처리 방법은 상기 복수의 데이터들을 상기 변환행렬에 투영하여 데이터들의 특징을 추출하고, 추출된 상기 데이터들의 특징을 이용하여 상기 복수의 데이터들을 분류하는 단계를 더 포함할 수 있다.The data processing method may further include extracting a feature of data by projecting the plurality of data onto the transformation matrix, and classifying the plurality of data by using the extracted feature of the data.

또한, 본 발명은 상기 데이터 처리 방법들을 기록한 기록 매체도 제공한다. The present invention also provides a recording medium on which the data processing methods are recorded.

아울러, 본 발명은 복수의 데이터들로부터 비경계 데이터를 경계 데이터와 구별하여 도출하는 데이터 도출부와, 상기 비경계 데이터를 이용하여 변환행렬을 도출하는 변환행렬 도출부를 포함하는 데이터 처리 장치를 제공한다.The present invention also provides a data processing apparatus including a data derivation unit for deriving non-boundary data from boundary data from boundary data and a transformation matrix derivation unit for deriving a transformation matrix using the non-boundary data. .

상기 데이터 처리 장치에 있어서, 데이터 도출부는 상기 복수의 데이터들 중 일 영역에 포함되는 적어도 하나의 데이터에 대해 종류에 따른 확률을 도출하는 확률 도출부와, 상기 확률을 이용하여 무질서도를 도출하는 무질서도 도출부와, 상기 무질서도를 이용하여 비경계 데이터를 경계 데이터와 구별하여 도출하는 비경계 데이터 도출부를 구비할 수 있다. In the data processing apparatus, the data derivation unit may include a probability derivation unit deriving a probability according to a type of at least one data included in one region of the plurality of data, and a disorder using the probability to derive a disorder figure. A derivation unit and a non-boundary data derivation unit for deriving distinction of non-boundary data from boundary data using the disordered diagram may be provided.

상기 데이터 처리 장치에 있어서, 상기 비경계 데이터 도출부는 상기 복수의 데이터들 중 어느 하나의 데이터의 무질서도가 기준 값 보다 큰지 판단하는 판단부 와, 상기 데이터의 무질서도가 기준 값 보다 크다면 상기 데이터는 경계 데이터이고 상기 무질서도가 기준 값 보다 크지 않다면 상기 데이터는 비경계 데이터로 구분하는 제어부를 포함할 수 있다. In the data processing apparatus, the non-boundary data deriving unit includes a determining unit that determines whether the disorder of any one of the plurality of data is greater than a reference value, and if the disorder is greater than a reference value, the data. Is boundary data and if the disorder is not greater than a reference value, the data may include a control unit that divides the data into non-boundary data.

상기 데이터 처리 장치는 상기 복수의 데이터들을 상기 변환행렬에 투영하여 특징을 추출하는 특징 추출부와, 상기 특징 추출 장치에 의해 추출된 특징을 이용하여 상기 데이터들을 분류하는 분류부를 더 포함할 수 있다.The data processing apparatus may further include a feature extractor configured to extract a feature by projecting the plurality of data onto the transformation matrix, and a classifier that classifies the data using the feature extracted by the feature extractor.

이상에서 살펴본 바와 같이, 본 발명은 데이터 분류를 위한 전처리 방법으로서 경계 데이터를 제외하고 비경계 데이터를 이용하여 분류에 보다 접합한 특징을 추출함으로써 분류성능을 향상시킬 수 있다. As described above, the present invention can improve the classification performance by extracting features more bonded to the classification using non-boundary data except boundary data as a preprocessing method for data classification.

이하, 본 발명에 관한 데이터 처리 방법, 이를 기록한 기록 매체, 및 상기 벙법에 의해 수행되는 데이터 처리 장치에 대해 첨부된 도면들을 참조하여 설명한다. Hereinafter, a data processing method according to the present invention, a recording medium recording the same, and a data processing apparatus performed by the above method will be described with reference to the accompanying drawings.

도 1은 본 발명에 관한 데이터 처리 방법을 설명하기 위한 순서도이다. 1 is a flowchart illustrating a data processing method according to the present invention.

우선 적어도 2종의 복수의 데이터들을 수집한다. 데이터 수집 전 또는 수집 후 분류 전에 데이터 주변의 일 영역을 정의한다. 상기 일 영역은 해당 데이터 주위의 국소 영역으로서, 사용자 등에 의해 수동으로 또는 자동으로 정해질 수 있다. 상기 일 영역은 하기 수학식 1과 같이 정의될 수 있다. First, at least two pieces of a plurality of data are collected. Define an area around the data before or after data collection and sorting. The one area is a local area around the data and may be determined manually or automatically by a user or the like. The one region may be defined as in Equation 1 below.

여기서 Ν(x _i , k)는 x _i 데이터와 소정 거리에 있는 k 개의 데이터들의 집합을 의미하며, x _i 데이터를 포함하는 상기 일 영역을 결정한다. 상기 일 영역의 크기는 k 값에 의해 결정될 수 있다.Where Ν (x _i, k) is a set of k pieces of data on the x _i data with a predetermined distance, and determines the work area including the x _i data. The size of the one region may be determined by k value.

상기 일 영역이 정해지면, 상기 일 영역에 포함되는 데이터에 대해 종류에 따른 확률을 도출한다(S11). 예를 들어, 상기 확률은 i 번째 데이터와 상기 i 번째 데이터 주위의 일 영역에 포함되는 k개의 데이터들 중에서 상기 i 번째 데이터를 포함하여 상기 i 번째 데이터와 동일한 종류의 데이터를 얼마나 포함하는지를 나타낸 것이다. 상기 확률은 하기 수학식 2와 같은 방법으로 도출될 수 있다. When the one region is determined, a probability according to the type of data included in the one region is derived (S11). For example, the probability indicates how much of the k-th data included in the i-th data and one region around the i-th data includes the same kind of data as the i-th data including the i-th data. The probability may be derived in the same manner as in Equation 2 below.

여기서 x _i 데이터의 n번째 근접 데이터가 x _i 데이터와 동일한 종류 정보를 갖는 데이터이면 Ij(n)값은 1이며, 그렇지 않을 경우 값은 0으로 결정된다. j는 종류 정보를 의미하는 것이다. 즉, P_j(x_i)는 x_i 데이터가 j번째 종류에 속할 확률값을 의미한다.Where x _i is the value data I j (n) n-th-up data having the same type of information and data in the data x _i is 1, or if the value is determined to zero. j means kind information. That is, P _j (x _i ) means a probability value that x _i data belongs to the j th type.

예를 들어, 도 2를 참조하면 x _i 데이터는 ○ 종류 정보를 가지며, 상기 x _i 데 이터가 포함된 일 영역에서 상기 x _i 데이터와 동일한 ○ 종류 정보를 갖는 확률을 계산하면 P_j(x_i)는 3/5이다. 도 3을 참조하면, i+1 번째 데이터인 x _i+1 데이터와 동일한 종류 정보를 갖는 확률 P_j(x_i+1)는 1이다. For if example, with reference to FIG. 2 x _i data ○ has the type information, the x _i when to calculate the probability of having the same ○ type information and the x _i data in an area containing the data P _j (x _i ) Is 3/5. Referring to FIG. 3, the probability P _j (x _{i + 1} ) having the same type information as the x _{i + 1} data which is the i + 1 th data is 1.

상기 확률을 이용하여 상기 일 영역에서의 무질서도(entropy)를 도출한다(S12). 무질서도는 상기 일 영역에 얼마나 서로 다른 종류의 데이터들이 포함되어 있는지의 정도를 나타내는 것이다. 하기 수학식 3에 따라 도출할 수 있다. The probability is used to derive an entropy in the one region (S12). The disorder is an indication of how different kinds of data are included in the one area. It can be derived according to the following equation (3).

여기서, Nc는 데이터 종류의 개수를 의미한다. Here, Nc means the number of data types.

도 2에 도시된 x _i 데이터에 대해 무질서도를 상기 수학식 3을 이용하여 계산하면 0 보다 큰 값을 갖는다. 그러나 도 3에 도시된 x _i ₊₁ 데이터의 무질서도는 0을 갖는다. 무질서도가 클수록 해당 데이터가 다른 종류 정보를 가진 데이터들의 집합과의 경계 영역에 위치하게 됨을 의미한다. If the disorder chart is calculated using Equation 3 with respect to the x _i data shown in FIG. 2, it has a value greater than zero. However, the disorder of the x _i ₊₁ data shown in FIG. 3 has zero. The larger the degree of chaos, the more likely that the data will be located in the boundary area with the set of data having different kinds of information.

따라서 상기 도출된 무질서도가 기준 값 보다 큰지 판단한다(S13). Therefore, it is determined whether the derived disorder is greater than a reference value (S13).

상기 기준 값은 경험칙에 의거하여 결정된 값으로, 상기 일 영역에 노이즈 데이터들이 포함되는 등의 변수를 고려하여 적당하게 결정할 수 있다. The reference value is a value determined based on empirical rules, and may be appropriately determined in consideration of a variable such as noise data included in the one region.

상기 판단은 하기 수학식 4로 나타내어질 수 있다. The determination may be represented by Equation 4 below.

여기서 는 Xb는 경계 데이터의 집합이며, θ(k, Nc)는 사용자가 미리 결정한 기준 임계값(threshold)으로 상술한 바와 같이 k값 결정에 따른 근접 영역의 크기와 종류 수에 따라 적절한 값으로 결정할 수 있다. Where Xb is a set of boundary data, and θ ( k , Nc ) is a reference threshold predetermined by the user, and is determined as an appropriate value according to the number and size of the proximal region according to the determination of k value as described above. Can be.

상기 판단 결과 x _i 데이터의 무질서도가 기준 값 보다 크지 않다면 x _i 데이터는 비경계 데이터이고(S14), x _i 데이터의 무질서도가 기준 값 보다 크다면 x _i 데이터는 경계 데이터로 구분할 수 있다(S15).It is determined that x _i if not larger than the disorder is also a reference value of a data x _i data is a non-boundary data, and (S14), x disorders of _i data is greater than the reference value x _i data can be classified into boundary data ( S15).

그리고 상기 x _i 데이터가 마지막 데이터인지 판단할 수 있다(S16). 마지막 데이터가 아니라면 다음 데이터를 선택하고(S17), 다음 데이터에 대한 확률, 무질서도를 도출하는 단계 및 그 이후의 단계들을 수행한다. In operation S16, it may be determined whether the x _i data is the last data. If it is not the last data, the next data is selected (S17), and the probabilities for the next data, the derivation of the disorder, and the subsequent steps are performed.

마지막 데이터라면, 상기 S13 판단에 의해 구분되어진 비경계 데이터들을 이용하여 특징 추출에 이용할 변환행렬을 도출한다(S18). If it is the last data, a transformation matrix to be used for feature extraction is derived using non-boundary data separated by the determination S13 (S18).

변환행렬 도출은 다양한 방법으로 행해질 수 있으며, 예를 들어 하기 수학식 5 또는 수학식 6 등에 의해 변환행렬을 생성할 수 있다. The transformation matrix derivation may be performed in various ways, and for example, the transformation matrix may be generated by Equation 5 or 6 below.

여기서 S^(nb)는 비경계 데이터의 공분산 행렬(covariance matrix)이며, W_PCA ^(nb)는 데이터를 투영할 축들의 집합을 나타내는 변환행렬이다. S ^(nb) is a covariance matrix of non-boundary data, and W _PCA ^(nb) is a transformation matrix representing a set of axes to project data.

여기서 S_b ⁽ ^nb ⁾는 비경계 데이터의 종류 간 분산행렬(between-class scatter matrix)을 의미하며, S_w ⁽ ^nb ⁾는 비경계 데이터의 종류 내 분산행렬(within-class scatter matrix)을 의미한다. 또한, W_LDA ^(nb)는 데이터를 투영할 축들의 집합을 나타내는 변환행렬이다. Here, S _b ⁽ ^nb ⁾ means the scattering matrix between types of non-boundary data, and S _w ⁽ ^nb ⁾ means the scattering matrix within the type of non-boundary data. . In addition, W _LDA ^(nb) is a transformation matrix representing a set of axes on which data is projected.

그리고 도출된 상기 변환행렬에 수집된 데이터들을 투영하여 최종 특징을 추출한다(S19). 특징 추출을 위해 하기 수학식 7이 이용될 수 있다. The final feature is extracted by projecting the collected data on the derived transformation matrix (S19). Equation 7 may be used for feature extraction.

F 는 데이터를 투영할 축들의 집합을 나타내는 변환행렬이다. F is a transformation matrix representing the set of axes on which to project data.

이와 같은 특징 집합에서 분류기를 이용하여 데이터를 분류할 수 있다. 이와 같이 특징 추출 전에 경계 데이터를 제외한 비경계 데이터만을 이용하여 변환행렬을 생성하고, 상기 변환행렬로부터 데이터들의 특징을 추출함으로써 효과적으로 수 집된 데이터들을 분류할 수 있다. In such a feature set, data may be classified using a classifier. As described above, the transform matrix may be generated using only non-boundary data except boundary data before feature extraction, and the data may be effectively classified by extracting the feature of the data from the transform matrix.

본 발명은 상기 방법을 기록한 기록 매체, 예를 들어 광디스크, 자기 디스크, 하드 디스크 등의 컴퓨터로 판독가능한 기록 매체를 포함한다. The present invention includes a recording medium recording the above method, for example, a computer readable recording medium such as an optical disc, a magnetic disc, a hard disc.

또한 본 발명은 수집된 데이터들로부터 비경계 데이터를 구별하고, 상기 비경계 데이터를 이용하여 변환 행렬을 생성한 후, 상기 변환 행렬을 이용하여 특징을 추출하고 데이터를 분류하는 데이터 처리 장치를 포함한다. 구체적으로 도 2를 참조하여 설명한다. The present invention also includes a data processing apparatus for distinguishing non-boundary data from collected data, generating a transformation matrix using the non-boundary data, and then extracting a feature and classifying the data using the transformation matrix. . Specifically, this will be described with reference to FIG. 2.

도 2에 따르면, 본 발명에 관한 데이터 처리 장치는 데이터 전처리부(100)를 포함한다. 상기 데이터 전처리부(100)는 비경계 데이터를 구분 도출하는 데이터 도출부(110)와 상기 비경계 데이터를 이용하여 변환행렬을 생성하는 변환행렬 도출부(120)를 구비한다. According to FIG. 2, the data processing apparatus according to the present invention includes a data preprocessor 100. The data preprocessor 100 includes a data derivation unit 110 for deriving non-boundary data and a transformation matrix derivation unit 120 for generating a transformation matrix using the non-boundary data.

또한, 상기 데이터 처리 장치는 상기 변환행렬을 이용하여 데이터들의 특징을 추출하는 특징 추출부(200)와 상기 특징으로부터 데이터들을 분류하는 분류부(300)를 구비할 수 있다. In addition, the data processing apparatus may include a feature extractor 200 for extracting features of data using the transformation matrix and a classifier 300 for classifying data from the features.

상기 데이터 도출부(110)에 대해 도 3을 함께 참조하여 더욱 상세히 설명한다. The data derivation unit 110 will be described in more detail with reference to FIG. 3.

도 3에 도시된 바에 따르면, 상기 데이터 도출부(110)는 확률 도출부(111), 무질서도 도출부(112), 비경계 데이터 도출부(113)를 구비할 수 있다. As shown in FIG. 3, the data derivation unit 110 may include a probability derivation unit 111, a chaotic derivation unit 112, and a non-boundary data derivation unit 113.

확률 도출부(111)는 선택된 데이터 주위의 일 영역에서 상기 데이터에 대해 종류에 따른 확률을 도출한다. 구체적으로, 여기서 확률은 i 번째 데이터와 상기 i 번째 데이터 주위의 일 영역에 포함되는 k개의 데이터들 중에서 상기 i 번째 데이터와 동일한 종류의 데이터를 얼마나 포함하는지에 관한 것이다. 예를 들어, 상기 수학식 2에서와 같은 방법으로 행할 수 있다. 여기서 상기 일 영역은 사용자 등에 의해 미리 설정될 수 있다. The probability deriving unit 111 derives a probability according to the type of the data in one region around the selected data. Specifically, the probability here relates to how much data of the same kind as the i-th data is included among the i-th data and k data included in one region around the i-th data. For example, the method may be performed in the same manner as in Equation 2 above. Here, the one area may be preset by a user or the like.

무질서도 도출부(112)는 상기 확률을 이용하여 일 영역에 얼마나 서로 다른 종류의 데이터들이 포함되어 있는지의 정도인 무질서도를 도출한다. 무질서도는 예를 들어 상술한 수학식 3에 의해 계산될 수 있다. The disorder degree deriving unit 112 derives the disorder degree which is a degree of how different kinds of data are included in one region by using the probability. The disorder can be calculated by, for example, Equation 3 described above.

그리고 비경계 데이터 도출부(113)는 상기 무질서도를 이용하여 비경계 데이터를 경계 데이터와 구별하여 도출할 수 있다. 구체적으로 비경계 데이터 도출부(113)는 해당 데이터의 무질서도가 기준 값 보다 큰지 판단하는 판단부(113a)와, 상기 데이터의 무질서도가 기준 값 보다 크다면 상기 데이터는 경계 데이터이고 상기 무질서도가 기준 값 보다 크지 않다면 상기 데이터는 비경계 데이터로 구분하는 제어부(113b)를 포함할 수 있다. The non-boundary data deriving unit 113 may derive the non-boundary data from the boundary data by using the disorder. In detail, the non-boundary data deriving unit 113 determines whether the disorder of the data is greater than the reference value, and if the disorder of the data is greater than the reference value, the data is boundary data and the disorder is Is not greater than the reference value, the data may include a controller 113b that divides the data into non-boundary data.

상기와 같은 데이터 도출부(110)에 의해 경계 데이터를 제외하고 추출된 비경계 데이터를 이용하여 변환행렬을 도출한다. 이는 변환행렬 도출부(120)에서 수행한다. 따라서 본 발명은 데이터 전처리부(100)에서 비경계 데이터를 이용하여 변환행렬을 도출한다.The transformation matrix is derived by using the non-boundary data extracted by the data deriving unit 110 as described above except for boundary data. This is performed by the transformation matrix derivation unit 120. Therefore, in the present invention, the data preprocessing unit 100 derives a transformation matrix using non-boundary data.

그리고 특징 추출부(200)에서 도출된 변환행렬에 수집된 데이터들을 투영하여 특징을 추출하고, 특징을 이용하여 분류부(300)에서 데이터들을 분류한다. The feature is extracted by projecting the data collected on the transform matrix derived from the feature extractor 200, and the data is classified by the classifier 300 using the feature.

이하에서는 본 발명에 따른 데이터 처리 방법에서 데이터 전처리 효과를 평 가한 실시 예에 대해 설명한다. Hereinafter, an embodiment in which the data preprocessing effect is evaluated in the data processing method according to the present invention will be described.

우선, 도 6과 7은 대조군으로서 본 발명에서와 같은 데이터 전처리 없이 수집된 데이터들에 대해 주성분분석으로 특징을 추출하고, 분류한 결과를 나타낸 그래프들이다. First, FIGS. 6 and 7 are graphs showing results of extracting and classifying features by principal component analysis on data collected without data pretreatment as in the present invention as a control.

도 8과 9는 본 발명의 데이터 처리 방법에 따라 비경계 데이터를 이용하여 변환행렬을 생성하고, 상기 변환행렬을 이용하여 특징을 추출한 후, 분류한 결과를 나타낸 그래프들이다. 8 and 9 are graphs illustrating a result of classification after generating a transformation matrix by using non-boundary data, extracting a feature using the transformation matrix, according to the data processing method of the present invention.

대조군은 a와 b 2종의 데이터들을 수집하여, 수집된 데이터들을 종래 주성분분석을 이용하여 특징 추출을 하였다. 도 6에 따르면 특징 추출을 위한 축으로 세로의 직선 축이 추출되었다. 따라서 상기 축에 모든 데이터들을 투영한 결과 도 7을 참조하면 a 데이터들과 b 데이터들의 구별이 어려운 상태임을 보여준다. The control group collected data of a and b and extracted the collected data using conventional principal component analysis. According to FIG. 6, a vertical straight axis was extracted as an axis for feature extraction. Accordingly, as a result of projecting all data on the axis, referring to FIG. 7, it is difficult to distinguish between a data and b data.

이에 반해, 다른 종류의 c 데이터들과 d 데이터들을 수집하고, 수집한 데이터들로부터 비경계 데이터를 선택적으로 추출하여 변환행렬을 도출한 후, 상기 변환행렬을 도출하였다. 도 8에 따르면 특징 추출을 위한 축으로 가로의 직선 축이 추출되었다. 그리고 상기 축에 모든 데이터들을 투영하면, 도 9에 도시된 바와 같이 c, d 데이터들이 비해 분류하기 쉬운 상태로 존재함을 알 수 있다. On the other hand, after collecting different types of c data and d data, and selectively extracting non-boundary data from the collected data to derive a transformation matrix, the transformation matrix was derived. According to FIG. 8, a horizontal linear axis was extracted as an axis for feature extraction. When all the data are projected on the axis, as shown in FIG. 9, it can be seen that c and d data exist in a state where they are easier to classify.

또한, 본 발명의 데이터 처리 방법에 대해 분류 성능을 평가함에 대해 설명한다. In addition, description will be given of evaluating classification performance with respect to the data processing method of the present invention.

도 10에 도시된 UCI Machine Learning Depository에 공개된 여러 가지 분류 데이터베이스를 이용하여 분류 실험을 수행하였다.Classification experiments were performed using various classification databases published in the UCI Machine Learning Depository shown in FIG. 10.

성능 비교를 위해 두 종류의 교차평가(cross validation)를 시행하였다. 첫 번째로 데이터 집합에서 하나의 데이터를 제외한 데이터들을 이용하여 특징을 추출하고 분류기를 학습한 후, 제외하였던 데이터를 정확히 분류하는지를 기록하였다. 이와 같은 과정을 모든 데이터 각각에 대해 반복하여 전체에 대한 교차평가(cross validation)를 시행하고 분류 성능을 확인하였다. Two types of cross validation were performed for performance comparison. First, we extracted the features of the data set by excluding the data from one set of data, learned the classifier, and recorded whether the data that were excluded were correctly classified. This process was repeated for each of the data, and the cross validation of the whole was performed and the classification performance was confirmed.

다음으로는 데이터 집합을 10개의 부분 집합으로 나눈 후, 9개의 부분 집합에 해당하는 데이터들을 이용하여 특징을 추출하고 분류기를 학습한 후, 나머지 1개의 부분집합에 속하는 데이터를 정확히 분류하는지 기록하였다. 분류기로는 근접이웃 분류기(nearest neighborhood classifier)를 사용하였다.Next, we divided the dataset into 10 subsets, extracted features using the data from the 9 subsets, trained the classifier, and recorded whether they correctly classify the data in the other subset. As a classifier, a neighborhood neighborhood classifier was used.

상기와 같은 방법으로 공개 분류 데이터에서 얻어지는 분류 성능을 본 발명에서 제공한 데이터 전처리 방법을 적용하지 않고 주성분분석 및 선형판별분석에 따라 데이터들을 분류한 방법과 비교하였다. 여기서 데이터 전처리는 수집된 데이터들로부터 비경계 데이터를 구별 추출하여, 상기 비경계 데이터들을 이용하여 변환행렬을 도출한 것이다. The classification performance obtained from the public classification data was compared with the method of classifying the data according to the principal component analysis and the linear discrimination analysis without applying the data preprocessing method provided by the present invention. Here, data preprocessing is to extract non-boundary data from the collected data and derive a transformation matrix using the non-boundary data.

도 11과 12는 본 발명에 관한 데이터 처리 방법에 따라 데이터 전처리를 주성분분석에 적용(RPS+PCA: e)한 것과, 종래의 주성분분석(PCA: f)과의 성능을 비교하여 평가한 결과이다. 도 11은 첫 번째 교차평가 방법을 사용한 결과이고, 도 12는 두 번째 교차평가 방법을 사용한 결과이다.11 and 12 are results obtained by comparing the performance of the data preprocessing to the principal component analysis (RPS + PCA: e) and the performance of the conventional principal component analysis (PCA: f) according to the data processing method of the present invention. . 11 shows the result of using the first cross-valuation method, and FIG. 12 shows the result of using the second cross-valuation method.

도 13 및 도 14는 본 발명에 관한 데이터 처리 방법에 따라 데이터 전처리를 선형판별분석에 적용(RPS+LDA: g)하여 기존의 선형판별분석(LDA: h)과 성능을 비교 하여 평가한 결과이다. 도 13은 첫 번째 교차평가 방법을 사용한 결과이고, 도 14는 두 번째 교차평가 방법을 사용한 결과이다. FIG. 13 and FIG. 14 show the results obtained by applying data preprocessing to the linear discrimination analysis (RPS + LDA: g) according to the data processing method of the present invention and comparing the performance with the conventional linear discrimination analysis (LDA: h). . FIG. 13 shows the result of using the first cross-evaluation method, and FIG. 14 shows the result of using the second cross-evaluation method.

도 11 내지 14를 참조하면, 분류 성능이 본 발명에서 제안한 데이터 처리 방법을 적용하는 경우 더 우수함을 알 수 있다. 11 to 14, it can be seen that the classification performance is better when applying the data processing method proposed in the present invention.

이처럼 본 발명은 분류를 위한 종류정보를 이용하여 데이터 전처리 작업을 수행하여 주성분분석 및 선형판별분석과 같은 부분공간기법의 분류성능을 향상시키고자 한 것이다. As such, the present invention is to improve the classification performance of subspace techniques such as principal component analysis and linear discriminant analysis by performing data preprocessing using the type information for classification.

이상에서 본 발명의 바람직한 실시 예에 한정하여 설명하였으나, 본 발명은 이에 한정되지 않고 다양한 변화와 변경 및 균등물을 사용할 수 있다. 따라서 본 발명은 이들 실시 예를 적절히 변형하여 응용할 수 있고, 이러한 응용도 하기 특허청구범위에 기재된 기술적 사상을 바탕으로 하는 한 본 발명의 권리범위에 속하게 됨은 당연하다 할 것이다.While the present invention has been described in connection with certain exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the invention. Therefore, the present invention can be applied to modify these embodiments as appropriate, it will be natural that such applications also fall within the scope of the present invention based on the technical idea described in the claims.

도 2와 3은 i 번째 데이터 및 i+1 번째 데이터에 대해 본 발명에 관한 데이터 처리 방법에 따라 확률 및 무질서도를 도출하는 단계를 설명하기 위한 도면들이다. 2 and 3 are diagrams for explaining a step of deriving probability and disorder for the i th data and the i + 1 th data according to the data processing method of the present invention.

도 4와 5는 본 발명에 관한 데이터 처리 장치를 설명하기 위한 블럭도들이다.4 and 5 are block diagrams for explaining a data processing apparatus according to the present invention.

도 6은 주성분분석을 이용하여 데이터를 투영할 축을 나타낸 그래프이다. 6 is a graph showing an axis on which data is projected using principal component analysis.

도 7은 도 6에 도시된 축에 데이터들을 투영하였을 경우, 다른 종류 정보를 갖는 데이터들의 분포를 나타낸 그래프이다. FIG. 7 is a graph illustrating distribution of data having different type information when the data is projected on the axis illustrated in FIG. 6.

도 8은 본 발명에 따른 데이터 전처리 후, 주성분분석을 이용하여 데이터를 투영할 축을 나타낸 그래프이다. 8 is a graph showing an axis to project data using principal component analysis after data preprocessing according to the present invention.

도 9는 도 8에 도시된 축으로 데이터들을 투영하였을 경우, 다른 종류 정보를 갖는 데이터들의 분포를 나타낸 그래프이다. FIG. 9 is a graph illustrating distribution of data having different type information when data is projected on the axis illustrated in FIG. 8.

도 10은 본 발명의 성능 평가를 위해 사용한 UCI Machine Learning 데이터베이스를 나타낸 도면이다.10 is a diagram showing a UCI Machine Learning database used for performance evaluation of the present invention.

도 11 내지 14는 도 10에 도시된 데이터베이스를 이용하여 본 발명에 관한 데이터 분류를 위한 데이터 처리 방법 및 종래 데이터 분류 방법에 따른 분류 성능을 비교하여 나타낸 그래프들이다. 11 to 14 are graphs comparing the classification performance according to the data processing method and the conventional data classification method for data classification according to the present invention using the database shown in FIG.

Claims

A probability according to a type is derived for at least one data included in one region among a plurality of data, a disorder chart is derived using the probability, and the disorder chart is used to compare the plurality of data with boundary data. Deriving the non-boundary data by dividing by boundary data; And

Deriving a transformation matrix representing a set of axes to project the data using the covariance matrix of the non-boundary data.

delete

Claim 3 has been abandoned due to the setting registration fee.

The method of claim 1, wherein the probability is a probability of how much data includes the same kind of data as the i-th data among the k-th data included in the i-th data and one region around the i-th data. Data processing method for data classification.

Claim 4 has been abandoned due to the setting registration fee.

The method of claim 1, wherein the plurality of data are divided into boundary data and non-boundary data.

Determining whether the disorder of the i-th data of the plurality of data is greater than a reference value;

And if the disorder is greater than the reference value, the i-th data is boundary data, and if the disorder is not greater than the reference value, the i-th data is divided into non-boundary data. How data is processed.

Claim 6 has been abandoned due to the setting registration fee.

The method of claim 1, wherein the deriving of the transformation matrix comprises:

And a transformation matrix representing a set of axes on which the data is to be projected, by using the dispersion matrix between the types of the non-boundary data and the distribution matrix within the type of the non-boundary data.

A recording medium on which the data processing method of any one of claims 1 to 3 is recorded.

A probability derivation unit for deriving probabilities according to types of at least one data included in one region of the plurality of data;

A chaotic derivation unit for deriving a chaotic diagram using the probability;

A non-boundary data derivation unit for deriving distinction of non-boundary data from boundary data using the disorder diagram; And

And a transformation matrix derivation unit for deriving a transformation matrix representing a set of axes to project the data using the covariance matrix of the non-boundary data.

delete

The method of claim 8, wherein the non-boundary data derivation unit,

Determination unit for determining whether the disorder of any one of the plurality of data is greater than the reference value; And

The data is boundary data if the disorder is greater than the reference value, and the controller divides the data into non-boundary data if the disorder is not greater than the reference value. Processing unit.

Claim 11 was abandoned when the registration fee was paid.

The method of claim 8, wherein the conversion matrix derivation unit,

And a transformation matrix representing a set of axes on which the data is to be projected by using a dispersion matrix between the types of non-boundary data and a dispersion matrix within the type of the non-boundary data.