KR20120072291A

KR20120072291A - Method for imputating of missing values using incremental expectation maximization principal component analysis

Info

Publication number: KR20120072291A
Application number: KR1020110007484A
Authority: KR
Inventors: 양형정; 김선희; 캄스위
Original assignee: 전남대학교산학협력단
Priority date: 2010-12-23
Filing date: 2011-01-25
Publication date: 2012-07-03
Also published as: KR101213677B1

Abstract

PURPOSE: A method for replacing missing values which uses an EM(Expectation Maximization) PCA is provided to maximize data accuracy and reduce computation complexity by measuring incremental EM PCA. CONSTITUTION: An input vector is received. If the input vector includes a value, the value is initialized as average of data matrix. A PCA(Principle Component Analysis) is performed about the data matrix. The data matrix is re-configured. A mixing value is replaced with the missing value of the reconfigured data matrix. An error rate about the data matrix is performed until error rate is less than a critical value.

Description

METHODS FOR IMPUTATING OF MISSING VALUES USING INCREMENTAL EXPECTATION MAXIMIZATION PRINCIPAL COMPONENT ANALYSIS}

본 발명은 점진적(Incremental) EM(Expectation Maximization) PCA(Principal Component Analysis)를 이용한 결측값 대치 방법에 관한 것으로서, 더욱 상세하게는, 뇌파 전위 기록(EEG, Electroencephalography) 데이터 등의 생리학적 다변량 데이터 분석에 있어서 다수의 결측값이 제거되거나 단순한 방법을 사용하여 측정 또는 추정되는 경우 부정확한 데이터가 발생하는 문제점을 해결하기 위하여, 점진적 EM PCA를 이용하여 실제 시간 데이터에서 결측값을 발견 및 측정함으로써 데이터 정확도를 극대화하는 점진적 EM PCA를 이용한 결측값 대치 방법에 관한 것이다.The present invention relates to a method for replacing missing values using incremental (Expectation Maximization) EMP Principal Component Analysis (PCA). In order to solve the problem of inaccurate data when a large number of missing values are eliminated or measured or estimated using a simple method, the data accuracy is improved by finding and measuring missing values in real time data using a progressive EM PCA. A method of replacing missing values using incremental EM PCA to maximize.

결측값은 데이터 매트릭스에서 발견되지 않은 데이터 요소를 의미하는데, 다양한 이유, 예를 들면 기술적인 문제 또는 생물학적 특성으로 인해 발생할 수 있다.Missing values refer to data elements that are not found in the data matrix, which can occur due to various reasons, for example technical problems or biological characteristics.

일반적으로, EEG 데이터는 간헐적으로 입력된다. 그러므로, 잡음, 인공물 또는 기타 변동 사항이 존재하는 EEG 데이터 신호 환경에 있어서, 이러한 EEG 데이터의 손실은 빈번히 발생한다. 따라서 EEG에서 신호 분석은 결측 데이터 기간에 의해 악영향을 받을 수 있다.In general, EEG data is intermittently entered. Therefore, in an EEG data signal environment where noise, artifacts or other variations are present, such loss of EEG data occurs frequently. Thus, signal analysis in EEG can be adversely affected by missing data periods.

이러한 결측값은 주로 분석에서 배제되거나 측정값으로 대체되는데, 만약 결측값의 비율이 일정 수준 이상이라면 데이터 분석에 있어서 데이터 신뢰도가 하락하여 상당한 결과값 오류가 발생할 수 있다는 문제점이 있다.These missing values are mainly excluded from the analysis or replaced with measured values. If the ratio of missing values is above a certain level, there is a problem that a significant result error may occur due to a decrease in data reliability in data analysis.

대부분의 생리학적 데이터에서 빈번하게 발생하는 결측값을 다루는 방법으로는 다음의 2 가지 방법을 예시할 수 있다.The following two methods can be exemplified as a method of dealing with missing values that occur frequently in most physiological data.

첫 번째로는, 도 1에서와 같이 불완전한 데이터를 제거하는 방법이다. 두 번째로는, 남아있는 신호로부터 결측값을 측정 또는 추정하는 방법이다.First, a method of removing incomplete data as shown in FIG. Second, a method of measuring or estimating missing values from the remaining signal.

도 1에서, 뇌파 데이터는 결측값을 포함한다. 31개의 전극들(열들), 약 130 시점 동안, 어두운 색은 기록상 발생하는 결측값을 표시한다. 열 n에서 어두운 칸과 행 t는 특정시간(t번째 시점/행)과 특정 전극의 수(n번째 열)에 대하여 결측값을 표시한다.In FIG. 1, EEG data includes missing values. Thirty-one electrodes (columns), for about 130 time points, dark color indicates missing values that occur in writing. The dark cells and rows t in column n indicate missing values for a specific time (tth time point / row) and the number of specific electrodes (nth column).

첫 번째 방법은 명확하게 간단하다. 그러나 결측값 발생 원인이 임의적이지 않은 경우 결측값을 제거하는 것은 편향된 모델을 생성하고 정보의 손실을 유발하는 문제점이 있다.The first method is clearly simple. However, if the cause of missing values is not arbitrary, removing missing values creates a biased model and causes loss of information.

두 번째 방법에서는 주로 단순 대치 또는 다중 대치 방법을 사용한다. 이러한 대치 방법들에 의하면, 불완전한 데이터를 제거하는 경우와 비교하여 원래 데이터로부터의 정보 손실이 상대적으로 적으므로, 종래 일반적으로 사용되고 광범위하게 발전되어 왔다(Graham et al., 2007; Graham, 2009; Horton and Lipsitz, 2001). The second method mainly uses simple replacement or multiple replacement. According to these replacement methods, the loss of information from the original data is relatively small compared to the case of eliminating incomplete data, and thus has been conventionally used and widely developed (Graham et al., 2007; Graham, 2009; Horton). and Lipsitz, 2001).

그러나, 단순 대치 방법은 항상 대치하는 데이터로서 관찰값의 평균값을 사용하므로, 표준편차를 과소평가하는 문제점이 있다. 즉, 단순 대치 방법은 결측값에 의해 발생하는 불확실성의 증가를 고려하지 않는다.However, since the simple replacement method always uses the mean value of observations as the data to replace, there is a problem of underestimating the standard deviation. That is, the simple replacement method does not take into account the increase in uncertainty caused by missing values.

한편, 다중 대치 방법, 예를 들면 MLE(Maximum Likelihood Estimation) 등은 단순 대치 방법에서 발생하는 문제점을 해결하기 위해 제안되었다. 다중 대치 방법은 결측값 추정을 위해 다수의 후보값들을 고려한다.On the other hand, a multi-substitution method, for example, MLE (Maximum Likelihood Estimation), etc. have been proposed to solve the problem that occurs in the simple substitution method. The multiple imputation method considers multiple candidate values for missing value estimation.

다중 대치 방법에 있어서, 결측값은 무작위적인 교체에 의해 후보값으로 대치된다. 다중 대치 방법에서는 완전한 데이터 결과의 통계적인 분석을 실행한 후 측정된 계수 값을 통합하는 것으로 구성된다(Hardy and Robillard, 2005; Little and Rubin, 2002). In the multiple imputation method, missing values are replaced with candidate values by random replacement. Multiple imputation methods consist of performing statistical analysis of the complete data results and then integrating the measured coefficient values (Hardy and Robillard, 2005; Little and Rubin, 2002).

한편, 다변량 분석 기법, 예를 들어 주성분 분석(PCA, principal component analysis) 등은 비선형 특성을 나타내는 다양한 분석 과정에 적용된다(Nuding and Zetzsche, 2007). 이러한 주성분 분석 방법은 비정상적인 값들을 발견, 진단하고 데이터 수집 단계에서의 결측값 문제를 해결한다(Xu et al., 2009; Gu et al., 2009).On the other hand, multivariate analysis techniques, such as principal component analysis (PCA), are applied to various analytical processes that exhibit nonlinear characteristics (Nuding and Zetzsche, 2007). This principal component analysis method finds and diagnoses abnormal values and solves the problem of missing values at the data collection stage (Xu et al., 2009; Gu et al., 2009).

예를 들어, 아담스(Adams)는 의약품 제조 과정에서의 결측값 대치를 위해 주성분 분석과 EM 알고리즘을 이용한 EM PCA를 적용한 바 있다(Adams et al., 2002). 또한, 자오(Zhao)는 폐수처리공정에서 취득한 데이터에 대해 강건한 EM PCA와 수정된 EM PCA를 적용하여 결측값 대비 성능을 비교한 바 있다(Zhao et al., 2006).For example, Adams has applied EM PCA using principal component analysis and EM algorithms to replace missing values in the manufacture of pharmaceuticals (Adams et al., 2002). In addition, Zhao applied robust EM PCA and modified EM PCA to the data acquired in the wastewater treatment process and compared the performance against missing values (Zhao et al., 2006).

결측값을 적절한 값으로 교체하는 상술한 바와 같은 대치 방법들은 결측값 통제에 있어서 효율적인 방법으로 알려져 있으며 상당한 성능을 보장한다. 특히, 강건한 EM PCA는 결측값 측정에서는 현존하는 방법 중 가장 우수한 것으로 알려져 있다.Substitution methods as described above that replace missing values with appropriate values are known as efficient methods for controlling missing values and ensure significant performance. In particular, robust EM PCA is known to be the best of existing methods for measuring missing values.

그러나 이러한 대치 방법들은 다수의 정보를 요구하며 대치를 위한 실행 시간으로서 장시간을 필요로 한다. 더욱이, 상관관계 분석의 경우 만약 결측값이 높으면, 데이터 신뢰도는 저하되고 결국 부정확한 결과를 도출할 수밖에 없다. 또한, 강건한 EM PCA는 음수 데이터가 아닌 경우에만 결측값 측정 시 성능을 보장할 수 있을 뿐이다.However, these replacement methods require a lot of information and require a long time as execution time for the replacement. Moreover, in the case of correlation analysis, if the missing value is high, the data reliability is lowered and inevitably leads to inaccurate results. In addition, robust EM PCA can only guarantee performance when measuring missing values if it is not negative data.

본 발명이 해결하려는 과제는, 상기 문제점을 극복하기 위한 것으로서, 생리학적 다변량 데이터 분석에 있어서 다수의 결측값이 제거되거나 단순한 방법을 사용하여 측정 또는 추정되는 경우 부정확한 데이터가 발생하는 문제점을 해결하는 것이다.Disclosure of Invention Problems to be Solved by the Invention The present invention has been made to overcome the above-mentioned problems. In the analysis of physiological multivariate data, the problem of inaccurate data is generated when a plurality of missing values are removed or measured or estimated using a simple method. will be.

본 발명이 해결하려는 다른 과제는, 상기 문제점을 극복하기 위한 것으로서, 점진적 EM PCA를 이용하여 실제 시간 데이터에서 결측값을 발견 및 측정함으로써 계산 복잡성을 감소시키면서도 데이터 정확도를 극대화하는 것이다.Another problem to be solved by the present invention is to overcome the above problems, by maximizing data accuracy while reducing computational complexity by finding and measuring missing values in real time data using progressive EM PCA.

본 발명은, 점진적 EM PCA를 이용한 결측값 대치 방법에 관한 것으로서, (a) 입력 벡터 x를 입력받는 단계; (b) 입력받은 입력 벡터 x가 결측값을 포함하는 경우, 데이터 매트릭스 X의 행렬의 평균으로 결측값을 초기화하는 단계; (c) 상기 데이터 매트릭스 X에 대해 PCA를 수행하는 단계; (d) 상기 데이터 매트릭스 X를 재구성하는 단계; 및 (e) 재구성 전의 상기 데이터 매트릭스 X에 존재하는 결측값을 재구성된 데이터 매트릭스 X에서의 결측값으로 대치하는 단계; 를 포함한다.The present invention relates to a method for replacing missing values using a progressive EM PCA, comprising: (a) receiving an input vector x; (b) initializing the missing value with an average of the matrix of the data matrix X when the input input vector x includes the missing value; (c) performing PCA on the data matrix X; (d) reconstructing the data matrix X; And (e) replacing missing values in the data matrix X before reconstruction with missing values in the reconstructed data matrix X; It includes.

바람직하게, 상기 (e) 단계 이후에, (f) 결측값이 대치된 데이터 매트릭스 X에 대한 오류율이 기 설정된 임계값 이하가 될 때까지 상기 (c) 단계 내지 (e) 단계를 반복 수행하는 단계; 를 더 포함하는 것을 특징으로 한다.Preferably, after step (e), repeating steps (c) to (e) until (f) the error rate for the data matrix X in which missing values are replaced is less than or equal to a preset threshold. ; It characterized in that it further comprises.

또한 바람직하게, 상기 (f) 단계 이후에, (g) 은닉 변수를 계산하는 단계; (h) 전체 은닉 변수 에너지를 계산하는 단계; 및 (i) 전체 입력 데이터 에너지를 계산하는 단계;를 더 포함하는 것을 특징으로 한다.Also preferably, after step (f), (g) calculating a hidden variable; (h) calculating the total hidden variable energy; And (i) calculating the total input data energy.

또한 바람직하게, 상기 (g) 단계 내지 (i) 단계는 각각 다음의 수학식에 의하는 것을 특징으로 한다.Also preferably, the steps (g) to (i) are each characterized by the following equation.

(상기 수학식에서, y는 은닉 변수이며, T는 스코어 벡터이고, W_new는 갱신된 가중치 매트릭스이고, x_replace는 결정된 결측값이며, E_hv는 기존 또는 갱신된 전체 은닉 변수 에너지이고, λ는 지수망각인자이며, E_x는 기존 또는 갱신된 전체 입력 데이터 에너지이다.)Where y is a hidden variable, T is a score vector, W _new is an updated weight matrix, x _replace is a determined missing value, E _hv is an existing or updated total hidden variable energy, and λ is an exponent Is an oblivion factor, E _x is the existing or updated total input data energy.)

또한 바람직하게, E_hv〈fE_x(f는 주성분의 수이다)이면 상기 (i) 단계 이후에, (j) 은닉 변수를 추가하는 단계;를 더 포함하는 것을 특징으로 한다.Also preferably, if E _hv < fE _x (f is the number of principal components), after step (i), (j) adding a hidden variable.

그리고 바람직하게, 지수 망각 인자 λ는 0.96 내지 0.98 중 선택되는 것을 특징으로 한다.And preferably, the exponential forgetting factor λ is selected from 0.96 to 0.98.

본 발명에 따르면, 생리학적 다변량 데이터 분석에 있어서 다수의 결측값이 제거되거나 단순한 방법을 사용하여 측정 또는 추정되는 경우 부정확한 데이터가 발생하는 문제점을 해결하는 효과가 있다.According to the present invention, in the analysis of physiological multivariate data, there is an effect of solving the problem that incorrect data is generated when a plurality of missing values are removed or measured or estimated using a simple method.

본 발명에 따르면, 점진적 EM PCA를 이용하여 실제 시간 데이터에서 결측값을 발견 및 측정함으로써 계산 복잡성을 감소시키면서도 데이터 정확도를 극대화하는 효과도 있다.According to the present invention, the incremental EM PCA is used to find and measure missing values in real time data, thereby reducing calculation complexity and maximizing data accuracy.

도 1은 종래의 결측값 대치 방법을 설명하기 위한 도면이다.
도 2는 본 발명에 따른 방법의 정확도 평가 결과를 설명하기 위한 도면이다.
도 3은 본 발명에 따른 방법의 은닉 변수 발견에 관한 평가 결과를 설명하기 위한 도면이다.
도 4는 본 발명에 따른 방법의 재구성 평가 결과를 설명하기 위한 도면이다.
도 5는 본 발명에 따른 방법의 계산 복잡성 평가 결과를 설명하기 위한 도면이다.1 is a view for explaining a conventional method of replacing missing values.
2 is a view for explaining the accuracy evaluation results of the method according to the present invention.
3 is a view for explaining an evaluation result regarding the detection of hidden parameters of the method according to the present invention.
4 is a view for explaining the reconstruction evaluation results of the method according to the present invention.
5 is a view for explaining a calculation complexity evaluation result of the method according to the present invention.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.Prior to describing the specific details for carrying out the invention, the terminology or words used in this specification and claims may properly define the concept of a term in order for the inventor to best describe his or her invention. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention on the basis of the principle.

또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.
In addition, when it is determined that the detailed description of the known function and its configuration related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description is omitted.

본 발명에서, EEG 데이터 등 생리학적 다변량 데이터 분석에 있어서 다수의 결측값이 제거되거나 단순한 방법을 사용하여 측정되는 경우 부정확한 데이터가 발생하는 문제점을 해결하기 위해, 결측값 대치 방법으로서의 점진적 EM PCA 방법이 개시된다.In the present invention, a gradual EM PCA method as a method of replacing missing values, in order to solve the problem that incorrect data occurs when a plurality of missing values are removed or measured using a simple method in physiological multivariate data analysis such as EEG data. This is disclosed.

본 발명자들이 완성한 점진적 EM PCA에 따르면, 결측값들을 회복하고 은닉 변수를 발견하며 추세를 예측할 수 있는데, 결측값은 다른 결과값 사이의 상호관계를 이용하면 다른 관측값으로부터 추론될 수 있다.According to the progressive EM PCA completed by the present inventors, it is possible to recover missing values, find hidden variables, and predict trends, which can be inferred from other observations using the correlation between different results.

상호관계의 이용에 의해 계산의 복잡성은 감소되며, 은닉 변수 발견에 의해 불필요한 데이터 특성을 제거할 수 있으므로, 결론적으로 본 발명에 따르면 종래의 결측값 대치 방법에 비하여 데이터 정확도가 현저히 향상된다.
The complexity of the calculation is reduced by the use of interrelationships, and unnecessary data characteristics can be eliminated by the hidden variable discovery. Consequently, according to the present invention, the data accuracy is significantly improved as compared with the conventional missing value replacement method.

이하, 본 발명을 실시하기 위한 구체적인 내용을 설명하기 전에, 먼저 PCA에 대하여 상세히 설명한다.Hereinafter, the PCA will be described in detail before describing the specific contents for carrying out the present invention.

PCA는 데이터 분석에 있어서 차원 수를 감소시키기 위한 기법으로서, 주된 특징은 차원 감소, 잡음 제거, 그리고 직교 변환으로 요약될 수 있다.PCA is a technique for reducing the number of dimensions in data analysis. The main features can be summarized as dimension reduction, noise reduction, and orthogonal transformation.

구체적으로, PCA는 성분 간에 대표 데이터를 갖는 특성 성분을 취하며 데이터 특성을 암시하는 고유 성분을 추출하고 차원을 감소시킨다. 즉, PCA는 데이터를 충분히 표현하는 주성분의 축을 결정하기 위한 것이며, 차원은 주성분의 축의 데이터 매트릭스 X를 투영함에 의해 감소된다.Specifically, the PCA takes characteristic components with representative data between components and extracts unique components that imply data characteristics and reduces dimensions. That is, the PCA is for determining the axis of the principal component that fully represents the data, and the dimension is reduced by projecting the data matrix X of the axis of the principal component.

주성분의 축은 고유벡터, 고유값, 주성분에 의해 다음의 단수 분해(SVD, Singular Value Decomposition)를 이용하여 계산된다.The axis of the principal component is computed using the following singular value decomposition (SVD) by the eigenvector, the eigenvalue, and the principal component.

그러므로, 다음의 [수학식 1]과 같이 PCA는 두 매트릭스의 곱셈으로 표현 될 수 있다.Therefore, PCA can be expressed as the multiplication of two matrices as shown in Equation 1 below.

[수학식 1][Equation 1]

X=TV^T X = TV ^T

[수학식 1]에서, X는 특성치 n과 관찰치 m을 갖는 데이터 매트릭스이며, T가 m×f 매트릭스를 갖는 스코어 벡터이면 V는 n×f 매트릭스를 갖는 로딩 벡터이다. f는 주성분의 수를 나타내며, 시계열데이터의 경우 m은 시점을 표시한다.In Equation 1, X is a data matrix having characteristic values n and observation m, and if T is a score vector having an m × f matrix, V is a loading vector having an n × f matrix. f denotes the number of principal components, and in the case of time series data, m denotes a viewpoint.

그러나, 만약 데이터 매트릭스 X가 결측값을 포함하면 전통적인 PCA는 일관성을 결여할 수밖에 없다.
However, if the data matrix X contains missing values, the traditional PCA lacks consistency.

이하, 본 발명에 따른 점진적 EM PCA에 대하여 상세히 설명한다.Hereinafter, the progressive EM PCA according to the present invention will be described in detail.

EM PCA는 전통적인 PCA의 문제점을 해결하기 위해 제시되었다(AI-Deek et al., 2004; Roweis, 1998). 그 알고리즘은 추출된 몇 개의 고유벡터와 고유값을 고차원의 데이터로부터 추출할 수 있도록 한다.EM PCA has been proposed to solve the problems of traditional PCA (AI-Deek et al., 2004; Roweis, 1998). The algorithm makes it possible to extract several extracted eigenvectors and eigenvalues from high-dimensional data.

EM PCA는 시간과 공간 측면에서 계산 효율을 향상시키며, 데이터의 표본 공분산에 대한 계산을 요구하지 않고, 자연스럽게 결측값을 수용한다.EM PCA improves computational efficiency in terms of time and space, and does not require calculation of sample covariance of data, and naturally accepts missing values.

EM PCA에서는 [수학식 2] 및 [수학식 3]의 두 단계에 의한 계산 반복에 따라 결측값을 발견한다. E 단계는 데이터의 기대값을 기본으로 결측값을 결정한다.In the EM PCA, missing values are found by repeating the calculation by the two steps of [Equation 2] and [Equation 3]. Step E determines missing values based on the expected values of the data.

초기값은 행렬의 평균에서 시작한다. M 단계는 E 단계에서 획득한 값을 최대화하고 EM의 반복에 의해 결측값을 PCA로 예상되는 값으로 다시 결정한다.The initial value starts at the mean of the matrix. Step M maximizes the value obtained in step E and re-determines the missing value to the value expected by PCA by iteration of EM.

[수학식 2][Equation 2]

[수학식 3]&Quot; (3) "

결측값 대치를 위해 PCA는 행렬의 평균으로 최초에 값을 결정하여 완성된 데이터를 적용한다. 그 후에 적용된 데이터는 재구성되며, 재구성된 데이터는 수렴할 때까지 결측값을 대체하는데 사용된다.To replace missing values, the PCA first determines the value as the mean of the matrix and applies the completed data. The applied data is then reconstructed, and the reconstructed data is used to replace missing values until convergence.

그러므로 EM PCA는 다음과 같이 요약될 수 있다.Therefore, EM PCA can be summarized as follows.

첫째, 데이터 매트릭스 X의 행렬의 평균으로 결측값을 초기화한다.First, the missing values are initialized with the average of the matrix of data matrix X.

둘째, 완벽한 데이터 세트에 의해 PCA를 수행한다. 로딩 벡터 V와 스코어 벡터 T는 [수학식 2] 및 [수학식 3]에 의해 반복적으로 계산된다.Second, perform the PCA with a complete data set. The loading vector V and the score vector T are calculated iteratively by [Equation 2] and [Equation 3].

[수학식 4]&Quot; (4) "

Y=V^TX // X를 로딩 벡터 V_new에 투영Y = V ^T X // Project X to the loading vector V _new

셋째, 중요한 주성분에 대해 사전에 정의된 숫자를 갖는 데이터 매트릭스 X를 재구성한다.Third, reconstruct the data matrix X with a predefined number for the principal principal component.

[수학식 5][Equation 5]

// 재구성

// reconstruct

넷째, 원래 매트릭스 X에 존재하는 결측값을 그에 대해 재구성된 값으로 대체한다. 관찰값 X는 변경되지 않고 유지된다.Fourth, replace the missing value present in the original matrix X with the reconstructed value for it. Observation value X remains unchanged.

[수학식 6]&Quot; (6) "

// 결측값 대체

// replace missing values

[수학식 7][Equation 7]

// 에러율 평가

// error rate evaluation

마지막으로, 수렴할 때까지 둘째에서 넷째 단계를 반복한다.Finally, repeat steps 2 through 4 until they converge.

결측값을 포함하는 데이터 매트릭스 X는 EM PCA 알고리즘으로부터 예상된 값으로 대체된다.The data matrix X containing missing values is replaced with the values expected from the EM PCA algorithm.

다음으로, 결측값의 최적값이 EM PCA의 반복 과정에서 얻어진다.Next, the optimal value of the missing value is obtained during the iteration of the EM PCA.

EM PCA는 다음과 같이 잔차의 제곱합을 최소화하므로 EM PCA 파라미터는 최소 제곱 기준을 만족한다.Since EM PCA minimizes the sum of squares of the residuals as follows, the EM PCA parameter satisfies the minimum square criterion.

[수학식 8][Equation 8]

[수학식 8]에서,

는 재구성된 i번째 데이터를 의미한다.
In Equation 8,

Denotes the reconstructed i-th data.

이하, 본 발명에 따른 결측값 대치를 위한 점진적 EM PCA에 대해 공진화된(Co-evolving) EEG 데이터의 경우를 예로 들어 상세히 설명한다.Hereinafter, the case of co-evolving EEG data for the progressive EM PCA for replacing missing values according to the present invention will be described in detail.

PCA는 다변량 의존성 데이터로부터 다른 성분을 분리하기 위해 사용되는 분석 기법이며, 주어진 데이터 정보를 표현하는 한편 정보량을 감소시키는 효율적인 분석 기법이다.PCA is an analytical technique used to separate other components from multivariate dependent data, and is an efficient analytical technique that reduces the amount of information while representing the given data information.

PCA는 n 차원의 데이터

를 k 차원(n〉k)의 데이터

로 변환하여 데이터 X의 최대 정보를 유지하며 각각 상관되어 있지는 않다. 간략하게, PCA는 변수들의 선형 조합을 통해 설명될 수 있으며 새로운 주성분들을 유도한다.PCA n-dimensional data

Is k-dimensional (n> k) data

The maximum information of the data X is maintained by converting to and not correlated with each other. Briefly, PCA can be described through a linear combination of variables and leads to new principal components.

본 발명에서, 점진적 EM PCA는 다변량 EEG 시계열 데이터의 결측값을 예측하고 대체하며 데이터의 특성을 나타내는 은닉 변수의 발견에 의해 데이터를 요약한다.In the present invention, the progressive EM PCA summarizes the data by the discovery of hidden variables that predict and replace missing values in the multivariate EEG time series data and characterize the data.

먼저, EEG 시계열 데이터는 변수 n과 관찰시점 t의 데이터 매트릭스에 의해 작성된다.First, the EEG time series data is created by the data matrix of the variable n and the observation time t.

예를 들어, 데이터 x는 [표 1]과 같이 n×t로 구성된다.For example, data x is composed of n × t as shown in [Table 1].

그러므로 다변량 EEG 시계열 데이터는 시간 종속성을 갖는다. Therefore, multivariate EEG time series data has a time dependency.

그리고, 결측값을 포함하는

으로 n차원 특성 벡터를 표현한다. 이때 n은 데이터 측정을 위한 센서의 수이며 t는 시점이다.And including missing values

Represents an n-dimensional feature vector. Where n is the number of sensors for data measurement and t is the time point.

주성분 선형 조합의 일반적인 형식은 Y=W^TX로 표현된다. 감소된 차원의 특성 벡터는 Y=W_k ^TX로 표현된다. 이는 X 매트릭스의 감소된 크기를 나타내며, n〉k이다.The general form of the principal component linear combination is expressed as Y = W ^T X. The characteristic vector of the reduced dimension is represented by Y = W _k ^T X. This represents the reduced size of the X matrix, where n> k.

(k×t) 특성 벡터들과 k를 가지는 매트릭스는 특성 벡터들의 차원이며, W 기본 벡터들 (n×n)로 구성된다.The matrix with (k × t) feature vectors and k is the dimension of the feature vectors and consists of the W fundamental vectors (n × n).

기초 벡터는 점차적으로 표준 기본 벡터를 사용한다.The base vector gradually uses the standard base vector.

그러데, EEG 시계열 데이터에서 표준 기본 벡터는 데이터를 표현하기에는 부적절하다.However, in EEG time series data, the standard base vector is inappropriate to represent the data.

새로운 기본 벡터는 데이터는 시점에 의해 측정 될 때까지 뇌파 시계열 데이터의 측정 시점에 의해 특성들을 표현하기 위해 요구되어 진다.A new fundamental vector is required to represent the characteristics by the time of measurement of the EEG time series data until the data is measured by time.

변수variable 의미meaning n
k
W
w_i
X
x
x_t

x_replace
x_t,miss
x_mean

Y
y_t
d_i
m_k,t
e_i
P
λ
E_t,i n
k
W
w _i
X
x
x _t

x _replace
x _{t, miss}
x _mean

Y
y _t
d _i
m _{k, t}
e _i
P
λ
E _{t, i} Number of streams
Number of hidden variables
Weighting matrix
i relation weight vector
n-dimensional matrix X
Input vector
N stream input at time t

N stream input replaced by missing value at time t
N stream input containing missing values at time t
Average from x ₁ to x _t _-1

reconstruction of x _i
Hidden Variable Matrix
vector of hidden variables at x _t
Energy measurement of the I hidden parameter
Energy accumulated by k at time t
Reconfiguration error
Potential variables
Exponential forgetting factor
Total energy of hidden variables up to time t

본 발명에 따른 점진적 PCA는 현존하는 주성분 분석법에서 시계열의 특성을 반영하기 위해 원래의 고차원 데이터를 예측하는 동안, 점차적으로 갱신되는 기본 벡터 w에 의해 새로운 기초값을 탐지한다. 다시 말하면, 평균 재구성 오류를 최소화하기 위해 주성분의 기본 벡터

를 점차적으로 갱신하는 동안, 점진적 PCA는 새로운 벡터를 찾는다.The progressive PCA according to the present invention detects a new base value by a base vector w that is gradually updated while predicting the original high dimensional data to reflect the characteristics of the time series in existing principal component analysis. In other words, the base vector of the principal components to minimize the mean reconstruction error

While gradually updating, the progressive PCA finds a new vector.

이때, k는 주성분들의 수이다. 따라서, 점진적 PCA는 k의 기본 벡터를 필요로 하고 w_i,k의 값을 주성분

으로 갱신한다.Where k is the number of principal components. Therefore, incremental PCA requires a base vector of _k and principal component of the values of w _{i, k} .

Update with

본 발명에서는, 은닉 변수의 개수를 k라 하고 은닉 변수의 초기 개수는 3으로 설정된다.In the present invention, the number of hidden variables is k and the initial number of hidden variables is set to three.

그리고, 벡터 x_t는 새로운 매트릭스 3×t로 변환된다.Then, the vector x _t is converted into a new matrix 3 x t.

축약된 특성 벡터 y_i의 값은 다음의 [수학식 9]에 의해 표현되는 기본 벡터The value of the abbreviated feature vector y _i is the base vector represented by Equation 9

w_k 및 y_i에 의해 원본 벡터 x_i의 곱셈으로부터 얻어진다.It is obtained from the multiplication of the original vector x _i by w _k and y _i .

[수학식 9][Equation 9]

주성분들의 개수를 결정하기 위해 [수학식 9]로 요약된 기본 벡터를 사용한 시간을 이용하여 점차적으로 성분들의 크기를 평가한다. [수학식 10]은 성분들의 개수를 결정하기 위한 크기 평가를 위해 사용된다.The size of the components is gradually evaluated using the time using the base vector summarized in [Equation 9] to determine the number of principal components. Equation 10 is used for the size evaluation to determine the number of components.

[수학식 10]&Quot; (10) "

크기 역치(Threshold)는 표준값의 최소 95%에서 최대 98%까지 이른다(Pan et al., 2004; Papadimitriou et al., 2005; Sun et al., 2005). 본 명세서에서는, 에너지라고 표현하기보다는 크기 역치라고 표현하기로 한다.Size thresholds range from a minimum of 95% to a maximum of 98% of standard values (Pan et al., 2004; Papadimitriou et al., 2005; Sun et al., 2005). In this specification, rather than energy, it is expressed as a magnitude threshold.

이러한 에너지 범위는 에너지의 상위와 하위 경계값을 제시하며, 에너지의 성분들을 결정한다. 즉, 에너지 범위는 은닉 변수의 개수를 결정한다.This energy range gives the upper and lower bounds of the energy and determines the components of the energy. In other words, the energy range determines the number of hidden variables.

만약 특정 시간기록에서 계산된 은닉 변수의 에너지가 하위 에너지 경계보다 낮다면, 은닉 변수들은 증가되거나 그 반대이다.If the energy of the concealed variable calculated in a particular timestamp is lower than the lower energy boundary, the concealed variables are increased or vice versa.

만약 우리가 하위 경계에너지를 낮게 설정하면 우리는 더 적은 숨겨진 변수들을 얻을 것이다. 그 경우에, 낮은 에너지의 은닉 변수로부터 유용한 정보를 잃을 수 있다.If we set the lower boundary energy lower, we will get fewer hidden variables. In that case, useful information may be lost from low energy concealment parameters.

그럼에도 불구하고, 첫 번째 은닉 변수는 구별되는 스트림의 형태에서 전체 EEG 시계열 신호를 요약하는데 충분하다. 충분한 결과를 위해 은닉 변수의 에너지 레벨 기록이 필요하며, 본 발명에서 개시하는 에너지 범위는 타당하다.Nevertheless, the first hidden variable is sufficient to summarize the entire EEG time series signal in the form of a distinct stream. The energy level recording of the concealment variable is necessary for sufficient results, and the energy range disclosed in the present invention is justified.

[수학식 11]에서 평균 재구성 오류 비율(ARE)은 기본 벡터를 갱신하기 위해 사용된다.In Equation 11, the average reconstruction error ratio (ARE) is used to update the base vector.

[수학식 11]&Quot; (11) "

[수학식 11]에서,

은 k 차원 공간에서 투영 후 복원으로 표시되고 x_t의 복원은 [수학식 12]에 의한다.In Equation 11,

Is denoted as post-projection reconstruction in k-dimensional space and the reconstruction of x _t is given by Equation 12.

[수학식 12][Equation 12]

뇌파 EEG 데이터는 실시간으로 발생하는 결측값을 포함한다.EEG EEG data includes missing values that occur in real time.

만약 새로운 표본 x_t+1이 주어진 시간 변수에 의해 결측값을 포함한다면, 이 알고리즘은 결측값들을 신속히 조정하고 본 발명에 따른 점진적 EM PCA([표 2] 참조)에서 기본 벡터를 갱신하는 것에 의해 점차적으로 주성분을 변경한다.If a new sample x _{t + 1} contains missing values by a given time variable, the algorithm quickly adjusts the missing values and updates the base vector in a progressive EM PCA according to the present invention (see Table 2). Gradually change the principal component.

점진적 EM PCA, 지수 망각 인자, λ는 뇌파 EEG 데이터에서 최근의 행동에 적용하기 위해 사용된다.Progressive EM PCA, exponential forgetting factor, λ are used to apply to recent behavior in EEG data.

지수 망각 인자의 공통 선택들은 0.96과 0.98 사이의 값을 갖는다(Pan et al., 2004; Papadimitriou et al., 2005; Sun et al., 2005). Common selections of exponential forgetting factors have values between 0.96 and 0.98 (Pan et al., 2004; Papadimitriou et al., 2005; Sun et al., 2005).

슬라이딩 윈도우와 비교하여 EEG 신호의 시계열을 위한 버퍼 공간을 요구하지 않으므로, 메모리 사용량을 감소시킨다.Compared to sliding windows, it does not require buffer space for the time series of EEG signals, thus reducing memory usage.

본 발명에서는, λ= 0.96이며, 이러한 접근은 점차적으로 작동하므로 EEG 시계열 데이터에서 특히 적합하다.In the present invention, λ = 0.96, and this approach works gradually and is therefore particularly suitable for EEG time series data.

결측값 대치를 위한 점진적 EM PCAProgressive EM PCA to Replace Missing Values [입력값]
새로운 입력 x
기존 가중치 매트릭스 W
과거 데이터에 의한 기존 에너지 측정 d
기존 전체 은닉 변수 에너지 E_hv
기존 전체 입력 데이터 에너지 E_x
사전에 정의된 에너지의 하위 경계 f
사전에 정의된 에너지의 상위 경계 F
지수망각인자 λ
[출력값]
갱신된 가중치 메트릭스 W
갱신된 에너지 측정 d
갱신된 전체 은닉 변수 에너지 E_hv
갱신된 전체 입력 데이터 에너지 E_x
결정된 결측값 x_replace
[알고리즘]
입력 벡터 x가 결측값을 포함한다면,
x_replace=x_t,miss=x_mean // x_mean에 의한 결측값 초기화(x 관측값 유지)
Error=MSE(x_t,x_replace) // 오류율 평가
결측값이 수렴하지 않는다면,

[수학식 4] 내지 [수학식 7]에 의한 결측값 대체

// 은닉 변수 계산

// 전체 은닉 변수 에너지 계산

// 전체 입력 데이터 에너지 계산
만약 E_hv〈fE_x
k=k+1 // 은닉 변수 추가
만약 E_hv〉FE_x
k=k-1
끝
만약 입력 벡터 x가 결측값을 포함하지 않는다면,
[수학식 9] 내지 [수학식 12]를 사용한 I=1 내지 k까지의 가중치 벡터 갱신
끝
은닉 변수 k의 개수 조정
끝[Input value]
New input x
Existing Weight Matrix W
Traditional energy measurement by historical data d
Existing Full Concealed Variable Energy E _hv
Existing Full Input Data Energy E _x
Lower bound f of predefined energy
Upper boundary F of predefined energy
Exponential forget factor λ
[Output value]
Updated Weight Metrics W
Updated energy measurement d
Updated Total Hidden Variable Energy E _hv
Updated Total Input Data Energy E _x
Determined missing value x _replace
[algorithm]
If the input vector x contains missing values,
x _replace = x _{t, miss} = x _mean // Reset missing values by x _mean (keep x observations)
Error = MSE (x _t , x _replace ) // evaluate error rate
If missing values do not converge,

Replace missing values by Equations 4 to 7

// hidden variable calculation

// Calculate Total Hidden Variable Energy

// calculate total input data energy
If E _hv 〈fE _x
k = k + 1 // add a hidden variable
If E _hv 〉 FE _x
k = k-1
End
If the input vector x does not contain missing values,
Update weight vector from I = 1 to k using Equations 9 to 12
End
Adjust the number of hidden variables k
End

상술한 점진적 EM PCA의 특징을 요약한다면, 결측값을 포함하는 EEG 시계열 데이터에서 결측값이 측정된다. 시간 내 변화에 의해서 데이터의 특성들을 표현하는 패턴들을 도출하며, EEG 시계열 데이터의 은닉 변수 발견에 의해서 데이터 상호 간에 상관관계를 발견한다.Summarizing the features of the progressive EM PCA described above, missing values are measured in EEG time series data including missing values. Patterns representing the characteristics of the data are derived by changes in time, and correlations between the data are found by finding hidden variables in the EEG time series data.

본 발명에 의하면, 메모리 사용량을 감소시킴으로써 계산의 복잡성을 감소시킬 수 있으며, 실시간 진행 과정에 의해 특성들을 도출하고 EEG 시계열 데이터의 결측값을 측정할 수 있다. 또한, 원래의 입력 데이터보다 낮은 은닉 함수의 하위 차원에 의해서 데이터를 요약한다.
According to the present invention, it is possible to reduce the complexity of the calculation by reducing the memory usage, to derive the characteristics and to measure the missing value of the EEG time series data by the real-time process. In addition, the data is summarized by a lower dimension of the concealment function than the original input data.

이하, 본 발명에 따른 점진적 EM PCA에 대한 실험 결과에 대해 상세히 설명한다.Hereinafter, the experimental results for the progressive EM PCA according to the present invention will be described in detail.

먼저, 실험에 사용된 데이터 세트들에 관해 설명하면 다음과 같다.First, the data sets used in the experiment will be described as follows.

[표 3]에서와 같이, 실험에는 다변량 EEG 시계열 데이터에 관한 3개의 세트가 사용되었다.As in Table 3, three sets of multivariate EEG time series data were used in the experiment.

첫 번째 세트는 Krug에 의해 제공된 발작적 간질에 대한 EEG 데이터이며, 공개적으로 http://www.epileptologie-bonn.de/cms/에서 이용 가능하다.The first set is EEG data for seizure epilepsy provided by Krug and is publicly available at http://www.epileptologie-bonn.de/cms/.

간질 EEG 데이터는 간질 환자에 대하여 23.6초의 시간 동안 100개의 채널을 사용하여 측정되었으며, 인공 안구의 움직임 또는 근육의 움직임의 시각적인 확인 후에 연속적인 뇌파 다중 채널에 의해 기록되었다.Epilepsy EEG data were measured using 100 channels for a time of 23.6 seconds for patients with epilepsy and were recorded by successive EEG multi-channels after visual confirmation of artificial eye movements or muscle movements.

연구 1 및 2는 반복적으로 눈을 감고 뜨는 5명의 건강한 봉사자의 표면 뇌파 기록으로부터 얻어졌다. 두 연구에서의 데이터는 간질성 구역(대상 4) 안의 5명의 환자와 뇌(대상 3)의 반대편 반구체의 해마 조직의 무발작 간격 사이에서 측정되었다.Studies 1 and 2 were obtained from surface EEG records of five healthy volunteers who repeatedly closed their eyes. Data from both studies were measured between five patients in the interstitial zone (subject 4) and no seizure intervals in the hippocampal tissue of the opposite hemisphere of the brain (subject 3).

연구 5는 발작 행동을 포함하며, 발작 행동을 나타내는 모든 기록된 위치로부터 선택되었다. 연구 1 및 2는 두개골 밖에서 기록되어졌으나 연구 3, 4 및 5는 두개골 안에서 기록되었다.Study 5 included seizure behavior and was selected from all recorded positions representing seizure behavior. Studies 1 and 2 were recorded outside the skull but studies 3, 4 and 5 were recorded inside the skull.

다른 기록 전극과 분리하여, 기록된 파라미터들은 고정되었다. 이러한 이유로, 다른 분석 결과들은 다른 뇌의 동적인 속성들에 기인될 수 있다. 그러므로 실험에는 완성된 데이터 세트인 3개의 연구(1,3 및 5)를 사용한다.Separate from the other recording electrodes, the recorded parameters were fixed. For this reason, different analysis results can be attributed to the dynamic properties of other brains. Therefore, the experiment uses three studies (1, 3 and 5), which are complete data sets.

자기 조절(Self Regulation) EEG 데이터 세트는 Klaus에 의해 제공되었다(http://www.bbci.de/competition/ii/). 이 실험은 컴퓨터 화면의 위치(왼쪽, 오른쪽, 위, 아래)를 결정하기 위해 수행되었다.Self Regulation EEG data set was provided by Klaus (http://www.bbci.de/competition/ii/). This experiment was conducted to determine the position of the computer screen (left, right, up, down).

만약 목표물이 나타나면, 대상은 목표물을 추적하기 위한 생각에 따라 화면 위의 커서를 이동시키며(왼쪽에서 오른쪽으로, 오른쪽에서 왼쪽으로) 뇌파가 기록된다. 자기 조절 EEG 데이터는 3명의 대상들로부터 64 채널들을 사용하여 수집된 EEG 데이터를 포함한다.If a target appears, the subject moves the cursor on the screen (from left to right, right to left) according to the idea of tracking the target, and the brain waves are recorded. Self-regulating EEG data includes EEG data collected using 64 channels from three subjects.

대상 한 명에는 하루에 10개의 시기가 주어지며, 각각의 시기는 특정되었다. 또한, 이 실험에서는 비디오 화면의 오른쪽 구석에 위치한 목표물의 세로 위치를 향하여 세로의 커서 움직임을 조절하기 위해서 mu 또는 beta 리듬 진폭을 사용하였다.One subject was given ten periods per day, each of which was specified. Also, in this experiment, mu or beta rhythm amplitude was used to control the vertical cursor movement toward the vertical position of the target located in the right corner of the video screen.

데이터는 각 30분씩 10개 부분으로부터 수집되었다. 각 부분은 6개의 진행으로 구성되고 1분의 휴식시간으로 분리되며, 각 진행은 32개의 개별적인 시험들로 구성된다. 각 시험들은 화면이 검은 동안 1초로부터 시작한다.Data was collected from 10 parts each 30 minutes. Each part consists of six runs, separated by one minute of rest, and each run consists of 32 individual tests. Each test starts from 1 second while the screen is black.

그 후, 목표물이 오른쪽 가장자리에 4개의 가능한 위치 중 하나에 나타났다.The target then appeared in one of four possible positions on the right edge.

1초 후에, 하나의 커서는 화면 왼쪽 가장자리의 가운데 부분에 나타났고 일정한 속력으로 화면 왼쪽에서 오른쪽으로 이동하기 시작했다.A second later, one cursor appeared in the middle of the left edge of the screen and started moving from left to right at a constant speed.

커서의 수직적 위치는 대상의 뇌파(갱신율 : 10 times per sec)에 의하여 조절되었다. 대상의 목표는 커서를 정확한 목표물의 높이로 이동시키는 것이다. 커서가 오른쪽에 도달하면 화면은 검게 된다.The vertical position of the cursor was controlled by the subject's brain waves (update rate: 10 times per sec). The object's goal is to move the cursor to the correct target height. When the cursor reaches the right side, the screen becomes black.

이는 실험의 끝을 의미하며, 각 대상의 1~6부분들이 사용되었다.This meant the end of the experiment, where 1 to 6 parts of each subject were used.

실험에 사용된 마지막 데이터 세트는 사건과 관련된 잠재력(ERP)에 대한 EEG 데이터이다.The final data set used in the experiment is the EEG data for the event-related potential (ERP).

http://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html에서 그러한 데이터가 제공된다.Such data is provided at http://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html.

ERP EEG 데이터는 31 채널들을 사용하여 기록되었다. 이 데이터 세트는 컴퓨터에 의해 조정되는 컴퓨터 화면으로부터 불빛이 흐린 방에서 110cm 에 앉아있는 참가자들로부터 얻어졌다.ERP EEG data was recorded using 31 channels. This data set was obtained from participants sitting 110 cm in a dimly lit room from a computer screen controlled by a computer.

이 대상들은 동물 분류작업과 인지 작업 등 2개의 작업을 수행하였다. 대상들은 접촉에 민감한 버튼을 눌러야 했다. 검은 화면의 중앙에 한 작은 고정 점이 그려져 있었다.These subjects performed two tasks: animal classification and cognition. Subjects had to press a button that was sensitive to touch. A small anchor point was drawn in the center of the black screen.

그 후, 8 비트 컬러 수직 사진이 프로그램된 영상 판을 사용하여 20ms동안 번쩍였다. 이러한 짧은 제시 시간은 대상들이 반응적으로 탐색 안구 운동하는 것을 피하게 했다. 참여자들은 "go / no go" 범례에 따라 반응하였다.Thereafter, 8-bit color vertical photographs flashed for 20 ms using the programmed image plate. This short presentation time allowed subjects to avoid seeking eye movement responsibly. Participants responded according to the "go / no go" legend.

동물 분류작업에서, 참여자들은 사진에 동물이 출현하는 시간과 관계없이 반응해야 했다. 인지 작업은 학습 단계부터 시작했다.In animal classification, participants had to respond regardless of the time the animal appeared in the photograph. Cognitive work started at the learning stage.

특정 조사 이미지는 5번째와 10번째 후에 1000ms 동안 2 제시물들이 섞여 14번씩 20ms동안 번쩍였고, 1000ms의 자극 간격으로 안구의 이미지에 대한 탐구가 가능했다.The specific investigative image flashed 14 times for 20 ms, with two presentations mixed for 1000 ms after the fifth and tenth, and eyeball images were explored at 1000 ms stimulation intervals.

두 개의 작업들은 100개의 일련의 이미지로 구성되었는데, 50개의 목표물 이미지들은 동물 분류작업에서는 50개의 비목표물이 섞여있었다. 독특한 사진들 50개는 인지 작업에서 50개의 비목표물과 무작위로 섞여있었다.The two tasks consisted of a series of 100 images, with 50 target images containing 50 non-targets in animal classification. Fifty unique photographs were randomly mixed with fifty non-targets in cognitive work.

모든 EEG 데이터는 결측값이 없는 완벽한 데이터 세트이다. 그러므로 본 발명에 따른 점진적 EM PCA의 성능을 평가하기 위해서, 결측값을 5%, 10% 및 15%의 비율로 생성시켰다.
All EEG data is a complete data set with no missing values. Therefore, in order to evaluate the performance of the progressive EM PCA according to the present invention, missing values were generated at rates of 5%, 10% and 15%.

그리고, 본 발명에 따른 점진적 EM PCA에 의한 정확도 평가 결과를 설명하면 다음과 같다.And, the accuracy evaluation results by the progressive EM PCA according to the present invention will be described.

정확도 평가는 본 발명에 따른 점진적 EM PCA에 따른 결과와 원래값들에 의해 측정된 값들의 정확도를 측정에 의해 수행되었으며, 본 발명에 따른 점진적 EM PCA에서는 자동적으로 다량의 데이터를 요약하는 은닉 변수의 개수를 추출하기 위하여 모든 데이터에서 0.95에서 0.98까지의 에너지 값들을 사용하였다.Accuracy evaluation was performed by measuring the accuracy of the results according to the progressive EM PCA according to the present invention and the values measured by the original values. In the progressive EM PCA according to the present invention, a hidden variable automatically summarizes a large amount of data. Energy values from 0.95 to 0.98 were used in all data to extract the number.

효율적인 정확도 평가를 위해서, [표 4]에서와 같이 본 발명에 따른 점진적 EV PCA는 강건한 EMPCA(Zhao et al., 2006) 및 결측값 SVD(MSVD, Troyanskaya et al., 2001)와 비교되었으며 각각에 대한 결측값 대치의 오류율이 측정되었다. 각 데이터 세트는 반복적으로 생성된 5%, 10% 및 15% 비율의 결측값들을 포함한다.For efficient accuracy assessment, the progressive EV PCA according to the present invention is compared with robust EMPCA (Zhao et al., 2006) and missing value SVD (MSVD, Troyanskaya et al., 2001), as shown in Table 4, respectively. The error rate of imputing missing values for was measured. Each data set contains 5%, 10% and 15% ratios of missing values generated repeatedly.

[표 4]에서, 강건한 EMPCA와 간질 EEG 데이터 MSVE의 경우에, 5%의 결측율을 포함하는 데이터 세트에서 평균 22와 80의 은닉 변수들을 추출하였으나 본 발명에 따른 방법은 성공적으로 21개의 은닉 변수들을 요약하였다.In Table 4, in the case of robust EMPCA and epilepsy EEG data MSVE, the mean of 22 and 80 concealment variables were extracted from a data set containing 5% missing rate, but the method according to the present invention successfully achieved 21 concealment parameters. Are summarized.

간질 EEG 데이터에서, 본 발명에 따른 방법은 강건한 EMPCA와 MSVD보다 평균 0.0016%, 0.0004%라는 낮은 결측값 대치 오류율을 나타낸다.In epilepsy EEG data, the method according to the invention shows a lower missing value replacement error rate of 0.0016% and 0.0004% on average than robust EMPCA and MSVD.

MSVD의 평균 대치 오류율은 강건한 EMPCA보다 낮았다.The mean replacement error rate for MSVD was lower than for robust EMPCA.

단수 분해(SVD, Troyanskaya et al., 2001)는 하나의 결과에서 결측값을 회복할 수 있는 것으로서, 다중 결과를 통해 선형 상관관계를 발견하는데 국한되는 도구이며, 강건한 EMPCA는 음수 데이터가 아닌 경우에만 적용 가능한 도구라는 점이 다시 한번 확인된다.Singular decomposition (SVD, Troyanskaya et al., 2001) is a tool that is capable of recovering missing values from a single result, limited to finding linear correlations through multiple results, and robust EMPCA is only available for non-negative data. Once again, this is an applicable tool.

[표 4]에서, 자기 조절 EEG 데이터의 경우, 본 발명을 적용하면 결측값 대치 오류율은 강건한 EM PCA보다 0.00008% 높은 평균 오류율을 나타낸다.In Table 4, for self-regulating EEG data, the application of the present invention results in a mean error rate of 0.00008% higher than the robust EM PCA.

연구 #1은 첫 번째 대상의 첫 번째 부분이고, 연구 #2는 두 번째 대상의 첫 번째 부분이고, 연구 #3은 자기 조절 EEG 데이터의 3 번째 대상의 결과이다. 강건한 EM PCA(Zhao et al., 2006)는 선형 데이터의 경우에는 결측값을 측정하는 현존하는 방법들 중에 가장 우수한 것으로 알려져 있다.Study # 1 is the first part of the first subject, Study # 2 is the first part of the second subject, and Study # 3 is the result of the third subject of the self-regulating EEG data. The robust EM PCA (Zhao et al., 2006) is known to be the best of the existing methods for measuring missing values for linear data.

선형 데이터에서, 예를 들어 강건한 EM PCA는 본 발명에 따른 방법보다 상대적으로 높은 정확도를 나타낸다. 더욱이, MSVD는 강건한 EM PCA보다 낮은 오류율을 나타낸다.In linear data, for example, robust EM PCA shows a relatively higher accuracy than the method according to the invention. Moreover, MSVDs exhibit lower error rates than robust EM PCAs.

그러므로, 선형 데이터에서, MSVD는 결측값 대치에서 높은 정확도를 보인다. 그러나, 비선형 데이터에서, 전반적으로 본 발명에 따른 방법은 강건한 EM PCA보다 상대적으로 낮은 오류율을 나타낸다.Therefore, in linear data, MSVD shows high accuracy in missing value substitution. However, in nonlinear data, the method according to the invention as a whole shows a relatively lower error rate than the robust EM PCA.

도 2에서, 데이터 분포 (a)와 (b)는 각각 간질 EEG 데이터와 ERP 뇌파 데이터이다. (a)와 (b)는 비선형 분포를 보여주고, (c)는 자기 조절 뇌파 데이터의 선형 데이터이다. (a)에서 1-전극은 진폭 -200~200μV, 2-전극은 진폭 -300~200μV를 갖는다. (b)의 1-전극과 2-전극은 각각 진폭 -500~600와 -1000~3000μV 범위의 값을 갖는다. (c)는 진폭 -20000~20000 범위에서 나타난다.In Fig. 2, data distributions (a) and (b) are interstitial EEG data and ERP EEG data, respectively. (a) and (b) show nonlinear distributions, and (c) is linear data of self-regulated EEG data. In (a), the 1-electrode has an amplitude of -200 to 200 μV and the 2-electrode has an amplitude of -300 to 200 μV. The 1- and 2-electrodes of (b) have values in the range of amplitudes -500 to 600 and -1000 to 3000 μV, respectively. (c) appears in the amplitude range of -20000 to 20000.

ERP 뇌파 데이터에서 본 발명에 따른 방법을 적용할 때, 결측값 대치 오류율은 MSVD보다 평균 0.00007% 더 낮은 오류율을 나타낸다. 강건한 EM PCA와 MSVD를 비교할 때, MSVD는 강건한 EM PCA보다 평균 0.00019% 더 낮은 오류율을 나타낸다.
When applying the method according to the invention in ERP EEG data, the missing value replacement error rate shows an error rate of 0.00007% lower than the MSVD. Comparing robust EM PCA and MSVD, MSVD shows an average error rate of 0.00019% lower than robust EM PCA.

이하, 본 발명에 따른 점진적 EM PCA에 의한 은닉 변수 발견에 관한 평가에 대해 상세히 설명한다.Hereinafter, the evaluation regarding the hidden variable discovery by the gradual EM PCA according to the present invention will be described in detail.

EEG 시계열 데이터 세트들을 요약하기 위해서, 그리고 데이터 연구 시간을 감소시키기 위해서 은닉 변수의 발견은 중요하다. 은닉 변수 발견에 관한 평가를 위해, 각 EEG 데이터 세트 중 5%, 10% 및 15%의 결측값들을 갖는 데이터를 임의적으로 생성한 후에 결측값을 측정하고 은닉 변수들을 발견하기 위하여 본 발명에 따른 방법과 MSVD를 각각 적용하였다.The discovery of hidden variables is important to summarize EEG time series data sets and to reduce data study time. For evaluating hidden variable findings, the method according to the invention for randomly generating data with missing values of 5%, 10% and 15% of each EEG data set to measure missing values and to find hidden variables And MSVD were applied respectively.

도 3은 발견된 첫 번째 변수의 비교를 나타내는데, (a)는 간질 EEG 데이터를 나타내며 (b)는 자기 조절 뇌파 데이터의 첫 번째 은닉 변수이고 (c)는 ERP 뇌파 데이터의 첫 번째 부분의 결과이다. 각각의 데이터 세트는 5%의 결측율를 포함한다.
3 shows a comparison of the first variables found, where (a) represents epilepsy EEG data, (b) is the first hidden variable of self-regulated EEG data and (c) is the result of the first portion of ERP EEG data . Each data set contains a 5% missing rate.

도 3에서, 본 발명에 따른 방법은 MSVD를 적용할 경우보다 원래의 완전한 데이터를 적용할 때 은닉 변수에 관한 상대적으로 유사한 패턴을 나타낸다. 도 3 (a)의 상단 그래프는 완전한 데이터 세트에서 본 발명에 따른 방법을 적용하여 발견한 첫 번째 은닉 변수이다.In Fig. 3, the method according to the invention shows a relatively similar pattern for the hidden variables when applying the original complete data than when applying the MSVD. The top graph of FIG. 3 (a) is the first hidden variable found by applying the method according to the invention on a complete data set.

도 3 (a)의 중간 부분은 본 발명에 따른 방법을 적용한 후에 발견된 첫 번째 은닉 변수이다. 이 경우, 3 (a)의 상단과 유사한 형태의 패턴을 보인다. 그러나, 도 3 (a)의 하단의 패턴은 상단의 그래프와 상이하다. The middle part of Figure 3 (a) is the first hidden variable found after applying the method according to the invention. In this case, the pattern similar to the top of 3 (a) is shown. However, the pattern at the bottom of FIG. 3 (a) is different from the graph at the top.

하단의 그래프는 MSVD에 관한 것이다. 이때, 본 발명에 따른 방법과는 달리 원래의 값들과 근접한 정확한 값을 기대할 수 없는 점이 명확하다.The bottom graph is for MSVD. In this case, unlike the method according to the present invention, it is clear that an accurate value close to the original values cannot be expected.

도 3 (b)는 자기 조절 뇌파 데이터에서 대상 1의 첫 번째 부분의 결과를 나타낸다. 도 3 (b)의 경우, 본 발명에 따른 방법, PCA 및 MSVD에 관한 첫 번째 은닉 변수의 패턴과 비교할 때, 각각의 방법 모두 유사한 패턴을 나타냄을 알 수 있다.3 (b) shows the results of the first portion of subject 1 in self-regulated EEG data. In the case of Figure 3 (b), it can be seen that when compared to the pattern of the first hidden variable for the method, PCA and MSVD according to the present invention, each method shows a similar pattern.

도 3 (c)는 ERP 뇌파 데이터의 첫 번째 은닉 변수이다. 3 가지 각각의 방법에 의해 발견된 은닉 변수의 패턴을 비교할 때, PCA와 본 발명에 따른 방법의 경우는 명확하게 4 가지의 상이한 작업들로 각각 분리됨을 알 수 있다.3 (c) is the first hidden variable of ERP EEG data. When comparing the pattern of hidden variables found by each of the three methods, it can be seen that the PCA and the method according to the invention are clearly separated into four different tasks, respectively.

그러나, MSVD의 경우는 작업의 구분이 불가능하다.However, in the case of MSVD, the task cannot be distinguished.

반면, 본 발명에 따른 방법은 자동적으로 실시간으로 은닉 변수들을 발견한다. 또한 본 발명에 따른 방법은 패턴과 은닉 변수들을 발견하기 위한 형태 인식과 관련된 다양한 분야에서 적용 가능하다.
In contrast, the method according to the invention automatically finds hidden variables in real time. The method according to the invention is also applicable in various fields related to shape recognition for finding patterns and hidden variables.

이하, 본 발명에 따른 점진적 EM PCA에 대한 재구성 평가 결과에 대하여 상세히 설명한다.Hereinafter, the reconstruction evaluation results for the progressive EM PCA according to the present invention will be described in detail.

재구성 오류가 최소화된다면 특성 추출 또는 데이터의 차원 감소로 인해 정보의 손실을 최소화할 수 있으므로, 이미지 압축, 이미지 복구 및 정보 인지 시스템과 관련한 컴퓨터 시각 분야에서 광범위하게 적용될 수 있다.Minimizing the reconstruction error can minimize the loss of information due to feature extraction or dimension reduction of the data, and thus can be widely applied in the field of computer vision related to image compression, image recovery and information recognition systems.

도 4는 임의적으로 선택된 3개의 센서들에 관한 재구성 데이터와 원본 데이터의 곡선이다. 재구성 데이터는 5% 비율의 결측값들을 포함하도록 본 발명에 따른 방법에 의해 재구성되었다.4 is a curve of reconstruction data and original data for three randomly selected sensors. The reconstruction data was reconstructed by the method according to the invention to include missing values of 5% rate.

도 4 (a)는 각 5 가지 종류의 작업을 수행하는 동안 기록된 간질 데이터에서 센서 30과 센서 90을 나타낸다. 도 4 (b)는 ERP 뇌파 데이터의 재구성 데이터와 원본 데이터를 비교한 결과이다. 도 4 (c)는 자기 조절 뇌파 데이터에서 임의적으로 선택된 3개의 채널의 원본과 재구성 데이터를 도시하며, 첫 번째 대상의 2 채널과 35 채널의 원본과 재구성 데이터를 보인다.4 (a) shows the sensor 30 and the sensor 90 in the epilepsy data recorded during each of the five types of tasks. Figure 4 (b) is a result of comparing the reconstruction data and the original data of the ERP EEG data. 4 (c) shows original and reconstructed data of three channels randomly selected from self-regulated EEG data, and shows original and reconstructed data of two and 35 channels of a first object.

도 4에서, 밝은 점 찍힌 선들은 원본 데이터를 표시하며, 어두운 색깔의 선은 본 발명에 따른 방법을 적용한 후의 재구성 데이터를 표시한다. 도 4로부터, 재구성 데이터가 원본 데이터의 그래프와 유사함을 확인할 수 있다.In Fig. 4, bright dotted lines represent original data, and dark colored lines represent reconstruction data after applying the method according to the present invention. From Figure 4, it can be seen that the reconstruction data is similar to the graph of the original data.

그러므로, 본 발명에 따른 방법에 의하면 결측값을 포함하는 데이터의 데이터 감소에 있어서 정보의 손실을 최소화할 수 있으므로, 데이터의 특성 추출, 차원 감소 및 데이터 압축을 위해서 다양한 분야들에 적용될 수 있다.
Therefore, the method according to the present invention can minimize loss of information in data reduction of data including missing values, and thus can be applied to various fields for feature extraction, dimension reduction and data compression of data.

이하, 본 발명에 따른 점진적 EM PCA에 대한 계산 복잡성 평가 결과에 대하여 상세히 설명한다.Hereinafter, the calculation complexity evaluation result for the progressive EM PCA according to the present invention will be described in detail.

본 발명에 따른 방법의 복잡성은

이며, 이때 T는 각 시점이고 k는 은닉 변수들의 수이다. 그러나, MSVD의 복잡성은

이며, 이때 m은 표본의 수이고 n은 X의 차원이다.The complexity of the method according to the invention

Where T is each time point and k is the number of hidden variables. However, the complexity of MSVD

Where m is the number of samples and n is the dimension of X.

그러므로, 본 발명에 따른 방법은 강건한 EMPCA보다 메모리 사용량을 감소시킨다. 도 5는 다양한 데이터 세트들과 일련의 길이의 비에서 알고리즘의 진행 시간을 나타낸다. 각 진행에서, 각 데이터 세트의 5%의 데이터는 결여되는 것으로 처리되었다. 이러한 데이터 세트들은 본 발명에 따른 방법과 MSVD를 비교하기 위해서 사용되었다.Therefore, the method according to the present invention reduces memory usage than robust EMPCA. 5 shows the run time of an algorithm at a ratio of various data sets and a series of lengths. In each run, 5% of data in each data set was treated as lacking. These data sets were used to compare the MSVD with the method according to the present invention.

도 5 (a)는 간질 뇌파 데이터의 연구 시간을 나타내며, (b)는 ERP 뇌파 데이터의 연구 시간을 나타내고, (c)는 자기 조절 뇌파 데이터의 연구 시간을 나타내는데, MSVD는 가장 높은 실행 시간을 보인다. 이는, 새로운 데이터 t+1이 추가될 경우 연구를 위해서 데이터 1부터 데이터 t까지의 모든 데이터를 사용해야 하기 때문이다.Figure 5 (a) shows the study time of the interstitial EEG data, (b) shows the study time of the ERP EEG data, (c) shows the study time of the self-regulated EEG data, MSVD shows the highest run time . This is because when a new data t + 1 is added, all data from data 1 to data t must be used for the study.

그러나 본 발명에 따른 방법은 시점 t에서 결측값의 최초값들과 가중치를 사용하기 때문에 만약 t+1이 추가되어도 많은 연구 시간이 필요하지 않다. 그러므로, 본 발명에 따른 방법은 진행시간 문제와 실시간 데이터의 결측 데이터를 측정하기 위해 제한된 메모리 문제를 해결할 수 있고 은닉 변수들의 발견으로 데이터를 성공적으로 요약할 수 있다.
However, the method according to the present invention uses the initial values and the weights of missing values at time t, so that even if t + 1 is added, much research time is not required. Therefore, the method according to the present invention can solve the limited memory problem to measure the progress time problem and missing data of real time data and can successfully summarize the data with the discovery of hidden variables.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시 예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주하여야 할 것이다.As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be apparent to those skilled in the art that many changes and modifications can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes, modifications, and equivalents should be considered to be within the scope of the present invention.

Claims

In the method of replacing missing values using progressive EM (Principal Component Analysis),
(a) receiving an input vector x;
(b) initializing the missing value with an average of the matrix of the data matrix X when the input input vector x includes the missing value;
(c) performing PCA on the data matrix X;
(d) reconstructing the data matrix X; And
(e) replacing missing values in the data matrix X before reconstruction with missing values in the reconstructed data matrix X; Missing value replacement method using a gradual EM PCA comprising a.

The method of claim 1,
After step (e),
(f) repeating steps (c) to (e) until the error rate for the data matrix X with missing values is less than or equal to a preset threshold; Missing value replacement method using a gradual EM PCA, characterized in that it further comprises.

The method of claim 2,
After step (f),
(g) calculating a hidden variable;
(h) calculating the total hidden variable energy; And
(i) calculating the total input data energy; Missing value replacement method using a gradual EM PCA, characterized in that it further comprises.

The method of claim 3, wherein
Step (g) to step (i) is a method of replacing missing values using a gradual EM PCA, characterized in that each of the following equation.

Where y is a hidden variable, T is a score vector, W _new is an updated weight matrix, x _replace is a determined missing value, E _hv is an existing or updated total hidden variable energy, and λ is an exponent Is an oblivion factor, E _x is the existing or updated total input data energy.)

The method of claim 4, wherein
If E _hv < fE _x (f is the number of principal components), after step (i),
(j) adding a hidden variable; Missing value replacement method using a gradual EM PCA, characterized in that it further comprises.

The method of claim 4, wherein
The exponential forgetting factor [lambda] is selected from 0.96 to 0.98.