KR20120072290A

KR20120072290A - Method for analyzing incremental non-guassian

Info

Publication number: KR20120072290A
Application number: KR1020110007479A
Authority: KR
Inventors: 양형정; 캄스위; 김선희
Original assignee: 전남대학교산학협력단
Priority date: 2010-12-23
Filing date: 2011-01-25
Publication date: 2012-07-03
Also published as: KR101213676B1

Abstract

PURPOSE: A method for analyzing incremental non-Gaussian time is provided to minimize memory usage and data processing time by utilizing incremental ICA(Independent Component Analysis) scheme by analysis of EEG(Electroencephalography) signal data. CONSTITUTION: First input vector data is composed of matrices. The first input vector data is represented and disassembled into a non-Gaussian weight vector. The non-Gaussian weight vector is updated by the second input vector data. The non-Gaussian weight vector is repeated until maximum non-Gaussian characteristics are indicated.

Description

Gradual Non-Gaussian Analysis Method {METHOD FOR ANALYZING INCREMENTAL NON-GUASSIAN}

본 발명은 점진적 비가우시간 분석 방법에 관한 것으로서, 더욱 상세하게는, 다변량 EEG(Electroencephalography) 신호 등의 생리학적 신호 데이터 분석에 있어서 잠재적인 변수를 발견하기 위해 각각의 입력 벡터에서 데이터를 처리하되 이전 데이터는 몇몇 변수들로 저장하고 변수들의 이전 값들을 나중의 입력 벡터에 의해 갱신하는 점진적 독립 성분 분석을 활용함으로써 메모리 사용량 및 데이터 처리 시간을 최소화할 수 있는 점진적 독립 성분 분석을 이용한 데이터 분석 방법에 관한 것이다.The present invention relates to a gradual non-time analysis method, and more particularly, to process the data in each input vector to find potential variables in the analysis of physiological signal data, such as multivariate electroencephalography (EGE) signals A method for data analysis using progressive independent component analysis that minimizes memory usage and data processing time by utilizing progressive independent component analysis, in which data is stored as several variables and the previous values of the variables are updated by later input vectors. will be.

최근의 지난 10년간 다변량 데이터 마이닝에 대한 관심은 비약적으로 증가하여 왔다. 현대의 정보 기술이 발전함과 함께, 큰 체적의 다변량 데이터는 재정, 공학 및 의학으로부터 과학에 이르기까지 상이한 영역에서 중요시되고 있다. 몇몇 변수들 상에서 데이터를 수집하여 변수들이 다양한 방식으로 서로 간에 긴밀하게 서로 관련시키는 현실을 설계하는 것은 필수적이다.In recent decades, interest in multivariate data mining has increased dramatically. As modern information technology advances, large volumes of multivariate data are becoming important in different areas, from finance, engineering, and medicine to science. It is essential to design data that collects data on several variables so that they are closely related to each other in various ways.

이러한 다변량 데이터 마이닝에 관한 연구가 다양화되면서, 하나 이상의 변수를 포함하는 대규모 데이터 셋들을 마이닝 하는 방법에 대한 연구도 활성화되고 있다. 이는, 거대한 크기의 다변량 데이터를 가지고서, 유용한 정보를 검색하고 유용한 의사 결정을 제공하기 위한 것이다.As the research on multivariate data mining diversifies, research on mining large data sets including one or more variables is being activated. This is to search for useful information and to provide useful decisions with huge amounts of multivariate data.

전형적으로, 다수의 실제 적용분야에 있어서, 분석되는 데이터 테이블들은 일련의 유닛들(예를 들어 주제들, 샘플들) 상에서 수집된 몇몇 측정값으로 구성된다. 예를 들어, 마이크로어레이 칩에서, 광범위한 양의 유전자들이 단일 실험에서 수집된다. 유전자 발현들은 실험들을 통해 식별된다. 많은 수의 신호들이 뇌 피질로부터 획득되는 EEG 신호가 또 다른 예이다. 신호들은 높은 주파수에서 수집되고 데이터 셋은 매우 높은 차원이다.Typically, in many practical applications, the data tables being analyzed consist of several measurements collected on a series of units (eg subjects, samples). For example, in microarray chips, a wide range of genes are collected in a single experiment. Gene expressions are identified through experiments. Another example is the EEG signal, in which a large number of signals are obtained from the brain cortex. The signals are collected at high frequencies and the data set is of a very high dimension.

DNA 마이크로어레이들은 단일 실험에서 대규모 개수의 유전자들에 대한 mRNA들의 존재를 동시에 측정하도록 사용될 수 있는 장치들이다. 고차원적인 마이크로어레이 유전자 발현은 생물학자들이 마이크로어레이 데이터에 대한 포괄적인 식견을 갖는 데에 있어서 장애가 되어 왔다.DNA microarrays are devices that can be used to simultaneously measure the presence of mRNAs for large numbers of genes in a single experiment. High-level microarray gene expression has been an obstacle for biologists to have comprehensive insights into microarray data.

많은 특징들을 포함하지만, 상대적으로 적은 수의 관측을 포함하는 마이크로어레이 데이터의 유일한 존재는 대부분의 현실 세계 데이터의 정상적인 행동과 대조를 이룬다. 더욱이, 두 개체들의 유전자 발현들은 좀처럼 동일하지 않다.The unique presence of microarray data, which includes many features but a relatively small number of observations, contrasts with the normal behavior of most real-world data. Moreover, the gene expressions of the two individuals are rarely the same.

유전자 변형 유전자는 마이크로어레이 데이터를 분석하는데 더 많은 도전이 되고 있다. 더욱이, 마이크로어레이 실험을 수행하는 동안, 데이터는 조직 수집, 칩 상에서 혼성화에 대한 mRNA의 증폭으로부터 잡음이 되는 경향이 있다.Genetically modified genes are becoming more challenging to analyze microarray data. Moreover, during the microarray experiments, data tends to be noisy from tissue collection, amplification of mRNA for hybridization on the chip.

시계열 마이크로어레이 실험들이 일반적으로 사용되어 시계열 간 유전자들의 능동적인 생물학적 처리들을 특징화한다. 하나의 실험에서 거대한 개수의 유전자들을 분석하기 위한 능력은 생물학자들로 하여금 마이크로어레이 데이터에서 더 많은 시료들을 수집하도록 장려한다.Time series microarray experiments are commonly used to characterize the active biological processes of genes between time series. The ability to analyze large numbers of genes in one experiment encourages biologists to collect more samples from microarray data.

그러므로, 많은 환자들의 데이터가 장래에 유용하게 되는 것이 예상된다. 더욱이, 현대 기술의 발전과 더불어, 더 많은 유전자가 생물학자들에 의해 발견될 수도 있다. 따라서, 시료들의 개수 및 유전자들의 개수 모두가 성장할 것으로 예상된다.Therefore, it is expected that the data of many patients will be useful in the future. Moreover, with the development of modern technology, more genes may be discovered by biologists. Thus, both the number of samples and the number of genes are expected to grow.

따라서, 마이크로어레이 데이터의 크기는 지속적으로 증가할 것이다. 이 경우, 중요한 정보를 보유함으로써 그리고 오염된 잡음을 제거함으로써 거대한 데이터 차원들의 문제를 해결하도록 적당한 방법이 제안될 것이다.Thus, the size of the microarray data will continue to increase. In this case, a suitable method would be proposed to solve the problem of huge data dimensions by retaining important information and by removing contaminated noise.

EEG 신호는 연속적인 흐름의 신호들이 높은 비율로 성공적으로 도달하는 데이터 스트림의 예이다. 데이터는 뇌 피질에 부착된 센서들로부터 수집된다. 다수의 센서들로부터 수집된 고속의 전기적인 데이터는 높은 차원을 형성하는 경향이 있다.The EEG signal is an example of a data stream in which signals of continuous flow arrive successfully at a high rate. Data is collected from sensors attached to the brain cortex. High speed electrical data collected from multiple sensors tends to form a high dimension.

전통적인 시스템들은 EEG 신호와 관계한 시간을 처리하는데 저속이며 종종 의사 결정에 중요한 즉각적인 처리를 제공할 수 없다. 전기적인 신호들의 성공적인 도착과 함께, 고속 데이터의 신속한 처리는 중요하다. 더욱이, 현재의 계산 능력으로 볼 때, 할당된 메모리에서 모든 이전 데이터를 보유하는 것은 불가능하다.Traditional systems are slow to process the time associated with the EEG signal and often cannot provide immediate processing, which is important for decision making. With the successful arrival of electrical signals, rapid processing of high speed data is important. Moreover, given the current computational power, it is impossible to retain all previous data in the allocated memory.

EEG 신호는 센서들 사이에서 본질적으로 상호 관련된다고 여겨지고 그리고 그들 사이에서 패턴 및 상호 관련성 발견이 가능하다. EEG 신호들이 수집되는 동안, EEG 신호들은 신호를 수집하고 가시화하는데 사용되는 리드 및 장치들로부터 획득되는 불가피한 인공산물들을 포함한다.EEG signals are considered to be intrinsically interrelated between the sensors and it is possible to discover patterns and correlations between them. While EEG signals are collected, EEG signals include inevitable artifacts obtained from leads and devices used to collect and visualize the signal.

그러한 잡음은 결과에 영향을 미치므로 바람직하지 않다. 그러므로, 온라인으로 데이터의 상호관련성을 검출하는 동안에, 이러한 데이터로부터 잡음을 제거하는 것이 중요하다.Such noise is undesirable because it affects the results. Therefore, while detecting the correlation of data online, it is important to remove noise from such data.

그 명칭에서와 같이, 다변량 분석은 하나의 변수 이상을 가진 데이터 셋의 분석에 적합한 일련의 기술을 포함한다. 이러한 기술들은 다변량 분석 분야에서 개발되었는데 왜냐하면 변수들 내부뿐만 아니라, 다수의 변수들 사이의 관계를 고려한 결과를 찾는 것이 중요하기 때문이다.As its name suggests, multivariate analysis involves a set of techniques suitable for analysis of data sets with more than one variable. These techniques have been developed in the field of multivariate analysis because it is important to find results that take into account relationships between multiple variables as well as within variables.

다변량 분석은 광범위하게 사용되어 특징들을 추출하고 그리고 마이크로어레이 데이터 셋들의 차원 및 EEG 신호들을 감소시킨다.Multivariate analysis is widely used to extract features and to reduce dimension and EEG signals of microarray data sets.

주성분 분석(Principal component analysis, PCA)은 차원 축소에 있어서 가장 일반적인 다변량 분석이다. 시계열 마이크로어레이 데이터를 요약하고 EEG 데이터 셋의 차원을 축소하는 것이 연구된다.Principal component analysis (PCA) is the most common multivariate analysis of dimension reduction. Summarizing time series microarray data and reducing the dimensions of the EEG data set are studied.

독립 성분 분석은 데이터를 기본적인 정보 성분들로 분리하기 위한 다변량 통계 분석에 있어서의 또 다른 방법이다. ICA는 일련의 관측된 현상의 기본이 되는 구동력을 밝히는 필수적으로 유용한 방법이다. 이러한 현상들은 뇌로부터의 일련의 뉴런들(neurons) 및 마이크로어레이 데이터 셋들의 화이어링(firing)을 포함한다. ICA는 MEG 데이터에서 인공산물들의 분리, 가시화, 국소화 및 EEG 신호의 특징 추출과 같은 많은 응용 분야들에 응용된다.Independent component analysis is another method in multivariate statistical analysis to separate data into basic information components. ICA is an indispensable useful way of identifying the driving forces underlying a series of observed phenomena. These phenomena include firing a series of neurons and microarray data sets from the brain. ICA is applied to many applications such as separation, visualization, localization and extraction of features of EEG signals from MEG data.

PCA는 독립성보다 훨씬 약한 특성을 가지는 일련의 신호들을 발견함에 반하여, ICA는 일련의 독립적인 소스 신호들을 발견한다. 특히, PCA는 서로 관련이 없는 일련의 신호들을 발견한다.While PCA finds a series of signals that are much weaker than independence, ICA discovers a series of independent source signals. In particular, the PCA finds a series of signals that are not related to each other.

몇몇 경우에 있어서, 만약 데이터가 가우시안인 경우, 모델의 추정은 직교 변환을 요구한다. 하지만, PCA는 가우시안 분포가 아닌 실제 세계 데이터에 대한 직교성 요구를 받는다.In some cases, if the data is Gaussian, the estimation of the model requires an orthogonal transformation. However, PCAs require orthogonality for real world data, not Gaussian distributions.

확률 이론에 있어서, 중심 극한 이론(Central Limit Theorem, CLT)은 충분히 큰 독립적인 랜덤 변수들의 합계가 근사적으로 정규적으로 분포되는 조건들을 언급한다. 즉, 몇몇 소스들의 혼합들은 원래의 소스들의 분포보다 더욱 가우시안인 경향이 있다. 만약 그들의 성분들이 가우시안인 경우에 PCA는 일련의 독립적인 성분들을 제공한다. 역으로, ICA는 ICA가 통계적인 독립 성분들을 분해하는 비-가우시안 인자 분석으로 고려된다. 많은 연구는 마이크로어레이 및 EEG 신호들 분석에서 ICA가 PCA를 능가함을 보여준다.In probability theory, the central limit theory (CLT) refers to conditions under which the sum of sufficiently large independent random variables is approximately normally distributed. That is, mixtures of some sources tend to be more Gaussian than the distribution of the original sources. If their components are Gaussian, the PCA provides a series of independent components. Conversely, ICA is considered a non-Gaussian factor analysis in which ICA decomposes statistical independent components. Many studies show that ICA outperforms PCA in microarray and EEG signals analysis.

대규모 마이크로어레이 및 EEG 신호 데이터 셋들과 관련한 더 많은 응용 분야들이 등장하고 있다. 데이터가 도착하자마자 데이터를 분석하는 것은 중요하다. 불행하게도, 마이크로어레이 데이터 셋들 및 EEG 신호들을 처리하는 종래 방법은 항상 이러한 데이터를 정적인 상태로 처리한다. 게다가, 특히 시계열 데이터에 있어서 일괄 처리는 유지시간(t)에 의존하는데, 이는 무한대로 증가한다.More applications are emerging with large scale microarray and EEG signal data sets. It is important to analyze the data as soon as it arrives. Unfortunately, conventional methods of processing microarray data sets and EEG signals always process this data in a static state. In addition, especially for time series data, the batch processing depends on the holding time t, which increases to infinity.

종래의 PCA 및 ICA 모두는 거대한 메모리를 소모하는 특이값 분해(singular value decomposition, SVD)의 계산을 포함한다. 공간 요구사항 또한 유지시간(t)에 의존하기 때문에, 공간의 소비는 유지시간(t)에 비례한다. 그러므로, 일괄 모드 처리는 큰 메모리 요구사항을 항상 수반하며 그리고 특히 데이터의 크기가 증가할 때 시간 소모적이다. 즉, 유지시간이 무한대로 증가할 때 ICA가 다변량 데이터를 처리하는 것은 불가능하다.Both conventional PCA and ICA include computation of singular value decomposition (SVD), which consumes huge memory. Since the space requirement also depends on the holding time t, the consumption of space is proportional to the holding time t. Therefore, batch mode processing always entails a large memory requirement and is time consuming, especially when the size of the data increases. In other words, it is impossible for ICA to process multivariate data when the retention time increases infinitely.

본 발명이 해결하려는 과제는, 상기 문제점을 극복하기 위한 것으로서, 다변량 EEG 신호 데이터 분석에 있어서 점진적 독립 성분 분석 기법을 활용함으로써 메모리 사용량 및 데이터 처리 시간을 최소화할 수 있는 점진적 비가우시간 분석 방법을 제공하는 것이다.The problem to be solved by the present invention is to overcome the above problems, to provide a gradual non-time analysis method that can minimize the memory usage and data processing time by using a progressive independent component analysis technique in multivariate EEG signal data analysis It is.

본 발명이 해결하려는 다른 과제는, 상기 문제점을 극복하기 위한 것으로서, 비정규성 및 적응성 점진적 모델을 통합하여 EEG 신호 특징을 성공적으로 추출하고, 이에 따라 관찰된 데이터의 개수보다 현저히 다수인 데이터 특징을 가지는 마이크로어레이 유전자 발현 데이터에 대해서도 분석 가능한 점진적 비가우시간 분석 방법을 제공하는 것이다.Another problem to be solved by the present invention is to overcome the above problems, by successfully incorporating a non-normal and adaptive gradual model to successfully extract the EEG signal features, thereby having a data feature significantly greater than the number of observed data It is to provide a gradual rain time analysis method that can also be analyzed for microarray gene expression data.

본 발명은, 점진적 비가우시간 분석 방법에 관한 것으로서, (a) 입력되는 제 1 입력 벡터 데이터를 행렬로 구성하는 단계; (b) 행렬로 구성된 상기 제 1 입력 벡터 데이터를 표준화하며 비-가우시안 가중치 벡터로 분해하는 단계; 및 (c) 다음으로 입력되는 제 2 입력 벡터 데이터에 의해 상기 비-가우시안 가중치 벡터를 갱신하는 단계; 를 포함한다.The present invention relates to a progressive non-corruption time analysis method, comprising: (a) constructing input first input vector data into a matrix; (b) normalizing the first input vector data consisting of a matrix and decomposing it into a non-Gaussian weight vector; And (c) updating the non-Gaussian weight vector with second input vector data that is then input; It includes.

바람직하게, 상기 제 1 및 제 2 입력 벡터 데이터는, 마이크로어레이 유전자 발현 데이터 또는 EEG(Electroencephalography) 신호 데이터인 것을 특징으로 한다.Preferably, the first and second input vector data is microarray gene expression data or EEG (Electroencephalography) signal data.

또한 바람직하게, 상기 (b) 단계는, (b-1) 상기 비-가우시안 가중치 벡터를 선형 변환에 의해 상기 제 1 입력 벡터 상에 투영하여 은닉 변수를 도출하는 단계;를 포함하는 것을 특징으로 한다.Also preferably, the step (b) includes (b-1) projecting the non-Gaussian weight vector on the first input vector by linear transformation to derive a hidden variable. .

또한 바람직하게, 상기 (c) 단계는, (c-1) 상기 (b) 단계의 비-가우시안 가중치 벡터를 이용하여 상기 제 2 입력 벡터로부터 은닉 변수를 계산하는 단계; 및 (c-2) 상기 (c-1) 단계에서 계산된 은닉 변수를 이용하여 다음의 수학식에 의해 에너지를 추정하는 단계; 를 포함하는 것을 특징으로 한다.Also preferably, step (c) comprises: (c-1) calculating a hidden variable from the second input vector using the non-Gaussian weight vector of step (b); And (c-2) estimating energy by the following equation using the hidden variable calculated in step (c-1); Characterized in that it comprises a.

(상기 수학식에서, d_i는 에너지, λ는 망각 인자, y_i는 상기 (c1) 단계에서 계산된 은닉 변수이다.)(In the above equation, d _i is energy, λ is forgetting factor, y _i is a hidden variable calculated in step (c1).)

또한 바람직하게, 상기 망각 인자 λ는 0.96 내지 0.98 중 선택되는 것을 특징으로 한다.Also preferably, the forgetting factor λ is selected from 0.96 to 0.98.

또한 바람직하게, 상기 에너지 d_i는 95% 내지 98% 중 결정되는 것을 특징으로 한다.Also preferably, the energy d _i is characterized in that determined in 95% to 98%.

그리고 바람직하게, 상기 (c) 단계 이후에, (d) 상기 갱신된 비-가우시안 가중치 벡터가 최대의 비-가우시안 특징을 나타낼 때까지 상기 (c) 단계를 반복하는 단계;를 더 포함하는 것을 특징으로 한다.And preferably, after step (c), repeating step (c) until the updated non-Gaussian weight vector exhibits a maximum non-Gaussian feature. It is done.

본 발명에 따르면, 다변량 EEG 신호 데이터의 분석에 있어서 점진적 독립 성분 분석 기법을 활용함으로써 메모리 사용량 및 데이터 처리 시간을 최소화할 수 있는 효과가 있다.According to the present invention, in the analysis of multivariate EEG signal data, it is possible to minimize memory usage and data processing time by using a progressive independent component analysis technique.

본 발명에 따르면, 비정규성 및 적응성 점진적 모델을 통합하여 EEG 신호 특징을 성공적으로 추출할 수 있으며, 이에 따라 관찰된 데이터의 개수보다 현저히 다수인 데이터 특징을 가지는 마이크로어레이 유전자 발현 데이터에 대해서도 분석 가능한 효과도 있다.According to the present invention, it is possible to successfully extract EEG signal features by incorporating a nonnormal and adaptive gradual model, thus analyzing effects on microarray gene expression data having data features significantly greater than the number of observed data. There is also.

도 1 은 비-가우시안 분석 알고리즘 단편을 나타내는 도면이다.
도 2 는 본 발명에 따른 방법(a), 점진적 PCA(b), ICA(c) 및 PCA(d) 출력의 계층적 클러스터링을 나타내는 도면이다.
도 3 은 정규화된 로 데이터(a), 본 발명에 따른 방법에 의해 재현된 데이터(b), 점진적 PCA(c) , ICA(d) 및 PCA(e)의 계층적 클러스터링을 나타내는 도면이다.
도 4a 는 마이크로어레이 데이터 셋 상에서 유전자들의 개수에 대한 실행 시간을 나타내는 도면이다.
도 4b 는 EEG 신호 데이터 셋 상에서 소스들의 개수에 대한 실행 시간을 나타내는 도면이다.
도 5a 는 마이크로어레이 데이터에서 실험들 수에 대한 실행 시간을 나타내는 도면이다.
도 5b 는 EEG 신호 데이터에서 스트림 크기에 대한 실행 시간을 나타내는 도면이다.1 is a diagram illustrating a non-Gaussian analysis algorithm fragment.
2 is a diagram illustrating hierarchical clustering of the method (a), progressive PCA (b), ICA (c) and PCA (d) outputs according to the present invention.
3 shows hierarchical clustering of normalized raw data (a), data (b) reproduced by the method according to the invention, progressive PCA (c), ICA (d) and PCA (e).
4A shows the execution time for the number of genes on a microarray data set.
4B is a diagram illustrating an execution time for the number of sources on an EEG signal data set.
5A is a diagram showing execution time for the number of experiments in microarray data.
5B is a diagram illustrating an execution time with respect to a stream size in EEG signal data.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.Prior to describing the specific details for carrying out the invention, the terminology or words used in this specification and claims may properly define the concept of a term in order for the inventor to best describe his or her invention. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention on the basis of the principle.

또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.
In addition, when it is determined that the detailed description of the known function and its configuration related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description is omitted.

본 발명은, 다변량 EEG 신호 데이터 분석에 있어서 독립 성분 분석(Independent component analysis, ICA) 기법에 점진적 접근을 접목한 분석 방법에 관한 것이다.The present invention relates to an analytical method incorporating a gradual approach to independent component analysis (ICA) techniques in multivariate EEG signal data analysis.

본 발명에 따른 방법에 의하면, 소정 에너지 범위에 있는 각각의 가중치 벡터(weight vector)를 갱신함으로써 직교 가중치(orthogonal weight)를 계산한다. 직교 가중치를 획득할 때, 그것은 비-가우시안 가중치가 되도록 수렴된다.According to the method according to the invention, orthogonal weights are calculated by updating each weight vector in a predetermined energy range. When obtaining an orthogonal weight, it converges to be a non-Gaussian weight.

이후, 새로운 데이터가 입력되는 경우 모든 데이터 셋들을 다시 계산하는 대신에 과거 변수들을 이용하여 비-가우시안 가중치를 갱신하는 방식으로 점진적으로 작업함으로써 결과적으로 낮은 계산 비용을 가진다.Then, when new data is input, instead of recalculating all the data sets, working gradually in a manner of updating non-Gaussian weights using old variables, resulting in lower computational costs.

본 발명에 따른 방법은 점진적 접근에 있어서 다변량 데이터의 독립적인 성분들을 통계학적으로 수렴시키는 개념을 적응시킨다. 종래의 일괄적인 계산 방법은 새로운 데이터가 입력되는 경우 완전한 데이터 행렬 재계산을 요구하였으므로 불필요한 메모리 사용량 및 처리 시간이 요구되었던 반면, 본 발명에 따른 방법에 의하면 완전한 데이터 행렬을 포함하지 않고도 과거 변수들을 이용하여 새로 도착하는 데이터를 효율적으로 갱신할 수 있다.
The method according to the invention adapts the concept of statistically converging the independent components of multivariate data in a gradual approach. The conventional batch calculation method requires a complete data matrix recalculation when new data is input, which requires unnecessary memory usage and processing time, while the method according to the present invention uses the past variables without including the complete data matrix. The new arrival data can be efficiently updated.

이하, 본 발명에 따른 점진적 비가우시간 분석 방법에 관해 설명하기 전에, 먼저 독립 성분 분석에 대하여 상세히 설명한다.Hereinafter, independent component analysis will be described in detail before describing the gradual non-corruption analysis method according to the present invention.

독립 성분 분석(Independent component analysis, ICA)은 다차원 데이터로부터 기본 성분들을 찾기 위한 방법이다. 서로 간에 최대로 독립적인 일련의 성분들이 더 높은 차원의 통계로 결정된다. ICA는 EEG 신호 분석에 광범위하게 응용된다.Independent component analysis (ICA) is a method for finding basic components from multidimensional data. A series of components that are most independent of each other is determined by higher level statistics. ICA is widely used for EEG signal analysis.

이하, EEG 신호를 예로 들어 설명하나, 독립 성분 분석 또는 본 발명이 이에 한정되는 것은 아니다.Hereinafter, an EEG signal will be described as an example, but independent component analysis or the present invention is not limited thereto.

EEG 신호 분석에 있어서, 특정 반응에 대하여 협업하는 동시 신경 소스들은 센서들을 통해 수집된다. 많은 신호들은 뇌 피질의 영역에 부착된 센서들에 의해 획득된다. 이러한 신호들은 뇌로부터의 원래의 신경 신호들의 혼합이다.In EEG signal analysis, simultaneous neural sources that collaborate on a particular response are collected via sensors. Many signals are obtained by sensors attached to areas of the brain cortex. These signals are a mixture of the original neural signals from the brain.

이 진폭이고, t가 시간 지표일 때, 센서들로부터의 신호를

로 표시하자. 이렇게 기록된 신호들 각각은

로 표시되는 뇌신경에 의해 발산된 신경 신호들의 가중 합계이다. 이는 다음의 [수학식 1]과 같은 선형 방정식으로서 표현될 수 있다.

Is amplitude and t is a time indicator,

Let's mark Each of these recorded signals

It is the weighted sum of neural signals emitted by the cranial nerve. This can be expressed as a linear equation such as [Equation 1] below.

[수학식 1][Equation 1]

[수학식 1]에서

은 뇌신경들로부터 센서들의 거리들 또는 조건들에 의존하는 파라미터들이다. 간단히, [수학식 1]과 같은 선형 방정식은 다음의 [수학식 2]와 같이 표현될 수 있다.In Equation 1

Are parameters that depend on the distances or conditions of the sensors from the cranial nerves. For simplicity, a linear equation such as Equation 1 may be expressed as Equation 2 below.

[수학식 2][Equation 2]

x=Asx = As

일반적으로, 굵은 소문자들은 벡터들을 나타내고 굵은 대문자들은 행렬들을 나타낸다.Generally, bold lowercase letters represent vectors and bold uppercase letters represent matrices.

ICA의 임무는 소스 신호들 또는 성분들(s)이 통계학적으로 독립적인 그러한 A를 발견하는 것이다. 행렬(A)을 추정한 후, 그것의 역 (W)을 계산하여 [수학식 3] 및 [수학식 4]와 같이 독립 성분들을 획득할 수 있다.The task of the ICA is to find such A where the source signals or components s are statistically independent. After estimating the matrix A , its inverse W can be calculated to obtain independent components as shown in Equations 3 and 4.

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

모델을 또한 다음의 [수학식 5]와 같이 표현할 수 있다.The model can also be expressed as Equation 5 below.

[수학식 5][Equation 5]

성분들(s)이 통계학적으로 독립적이고 그리고 독립 성분들이 비-가우시안 분포를 가져야만 한다는 것이 ICA 모델의 가정들이다. s_i의 선형 조합인 y를 고려하고, [수학식 6]과 같이 독립 성분들 중 하나(y)를 구한다.It is the assumptions of the ICA model that the components s are statistically independent and that the independent components must have a non-Gaussian distribution. Consider y, a linear combination of s _i , and find one of the independent components (y) as shown in [Equation 6].

[수학식 6]&Quot; (6) "

W^Tx의 비-가우시안을 최대로 함으로써, 독립 성분(y)을 얻을 수 있다.By maximizing the non-Gaussian of W ^T x, the independent component (y) can be obtained.

Fast ICA에서는 고정점 반복 기법에 기반하여 W^Tx의 최대 비-가우시안을 도출한다. 비-가우시안의 측정 중 하나는 네겐트로피(negentropy)에 의해 제공된다. 이는 차동 엔트로피의 정보-이론량(information-theoretic quantify)에 기반한다.Fast ICA derives the maximum non-Gaussian of W ^T x based on the fixed point iteration technique. One measurement of non-Gaussian is provided by negentropy. This is based on the information-theoretic quantify of differential entropy.

네겐트로피의 근사를 이용함으로써 투영이 비-가우시안을 최대로 하기 위해, Fast ICA는 방향, 즉 단위 벡터(W)를 찾는 학습 규칙을 적용한다.In order for the projection to maximize the non-Gaussian by using an approximation of negentropy, Fast ICA applies a learning rule to find the direction, or unit vector ( W ).

우선, 초기 가중치 벡터(w)가 무작위로 선택된다. 그런 다음, [수학식 7] 및 [수학식 8]과 같이 더 높은 차원의 모멘트들을 이용함으로써 네겐트로피를 근사한다.First, the initial weight vector w is chosen randomly. Then, using the higher dimensional moments such as Equation 7 and Equation 8, the negentrophy is approximated.

[수학식 7][Equation 7]

[수학식 8][Equation 8]

이때,

은 [수학식 8]의 도함수이며, 알고리즘은 수렴할 때까지 반복된다.
At this time,

Is a derivative of Equation 8, and the algorithm is repeated until convergence.

이하, 본 발명에 따른 점진적 비가우시간 분석 방법에 관하여 상세히 설명한다.Hereinafter, a gradual non-corruption time analysis method according to the present invention will be described in detail.

본 발명에 따른 방법은 통계학적으로 독립적인 비-가우시안 가중치 벡터들을 분해한다. 마이크로어레이 유전자 발현 데이터의 기본이 되는 생물학적인 처리는 원래의 소스의 혼합보다 더욱 수퍼-가우시안이다.The method according to the invention decomposes statistically independent non-Gaussian weight vectors. The biological processing underlying the microarray gene expression data is more super-Gaussian than the mixing of original sources.

단지 소수의 유전자만이 각각의 병리학적인 전이에서 변경되는데, 다수의 유전자들은 영향을 받지 않는 상태이므로, 이는 수퍼-가우시안 분포를 형성한다. 따라서 비-가우시안 분석은 마이크로어레이 유전자 발현 데이터를 분석하는데 적합하다.Only a few genes are altered at each pathological metastasis, since many genes are unaffected, which forms a super-Gaussian distribution. Thus, non-Gaussian analysis is suitable for analyzing microarray gene expression data.

본 발명에 따른 방법은, AD 마이크로어레이 데이터 셋에 대한 자율적이고 계층적인 클러스터링을 수행한다.The method according to the invention performs autonomous and hierarchical clustering of AD microarray data sets.

마이크로어레이 데이터(X(m,n))를 2차원 mn 행렬로 구성한다. 이때, 각각의 행(m)은 유전자 프로파일 데이터를 나타내며, 각각의 열(n)은 유전자 데이터를 나타낸다.The microarray data (X (m, n)) is composed of a two-dimensional mn matrix. Wherein each row m represents gene profile data and each column n represents gene data.

마이크로어레이 데이터는 제로 평균 및 단위 분산에 대해 정규화되어 표준화된다. 즉, 평균값에 의해 마이크로어레이 데이터를 차감하고 표준 편차에 의해 차감된 마이크로어레이 데이터를 나눈다.Microarray data is normalized and normalized to zero mean and unit variance. That is, the microarray data is subtracted by the average value and the microarray data subtracted by the standard deviation is divided.

클러스터링 처리 이전에, 제 1 입력 벡터, 유전자 프로파일 데이터들은 비-가우시안 가중치 벡터들로 분해된다.Prior to the clustering process, the first input vector, gene profile data is decomposed into non-Gaussian weight vectors.

그런 다음, 비-가우시안 가중치 벡터들은 제 2 입력 벡터(다음으로 입력되는 유전자 프로파일 데이터들)의 가중치 반영에 의해 갱신된다.Then, the non-Gaussian weight vectors are updated by reflecting the weight of the second input vector (next input gene profile data).

갱신 처리는 마지막으로 입력되는 제 n 입력 벡터, 즉 마지막으로 입력되는 유전자 프로파일 데이터들의 가중치까지를 반영하여 갱신함으로써 반복된다.The update process is repeated by reflecting the last input nth input vector, that is, the weight of the last input gene profile data.

마이크로어레이 데이터에 대한 비-가우시안 가중치 벡터들의 갱신 처리 완료에 의해, 최종 비-가우시안 가중치 벡터들이 획득된다.By completing the update processing of the non-Gaussian weight vectors for the microarray data, the final non-Gaussian weight vectors are obtained.

본 발명에서, 망각 인자는 0.96으로 결정되었고 에너지는 95% 내지 98%로 결정되었다. 사전 처리에 따라, 마이크로어레이 데이터 행렬의 최종 차원은 4로 축소되었다. 클러스터링에 사용된 데이터 행렬은 134이다. [표 1]에서, 각각의 변수에 대한 의미를 파악할 수 있다.In the present invention, the forgetting factor was determined to be 0.96 and the energy was determined to be 95% to 98%. According to the preprocessing, the final dimension of the microarray data matrix was reduced to four. The data matrix used for clustering is 134. In Table 1, you can find the meaning of each variable.

기호sign 설명Explanation

Input Vector (Bold Lowercase)

N stream input at time (t)

Weight Matrix (bold capital letters)

i-th engagement weight vector

Number of streams

Number of hidden variables

Vector of Hidden Variables for x _t

Energy estimate of the i hidden parameter

Reproduction error rate

The total energy of the hidden variables up to time t and up to the hidden variable i

Total energy of the hidden variables

Energy of input data (x)

Exponential forgetting factor f Lower limit predetermined energy F Upper limit predetermined energy

데이터 스트림에서,

은 연속적으로 무한히 성장할 수 있는 시간(t)에서 n 차원들 측정 열-벡터이다. 최초 타임스탬프 동안, 기본 벡터는 가중치 벡터(w_i)에 의해 채택된다.In the data stream,

Is the n-dimensional measurement heat-vector at time t that can grow continuously infinitely. During the initial timestamp, the base vector is adopted by the weight vector w _i .

각각의 가중치 벡터(w_i)는 데이터 스트림의 선형 변환으로 입력 벡터(x_t) 위에 투영되어 시간에 대한 은닉 변수들 또는 성분들(y_t)을 얻는다.Each weight vector w _i is projected onto the input vector x _{t in} a linear transformation of the data stream to obtain hidden variables or components y _t over time.

본 발명에 따른 점진적 접근의 핵심은 새롭게 투영된 공간에서 각각의 짧은 시간에 참여 가중치 벡터(w_i) 각각을 점진적으로 갱신하는 것이다.The key to the gradual approach according to the invention is to gradually update each of the participation weight vectors w _i at each short time in the newly projected space.

점진적 PCA와는 대조적으로, 본 발명에 따른 방법의 가중치 벡터는 비-가우시안이다. 직교 가중치 벡터(w_i)를 얻을 때, 각각의 가중치 벡터(w_i)은 최대 비-가우시안이 얻어질 때까지 갱신된다.In contrast to progressive PCA, the weight vector of the method according to the invention is non-Gaussian. When obtaining an orthogonal weight vector w _i , each weight vector w _i is updated until a maximum non-Gaussian is obtained.

은닉 변수들(k)의 개수는 우선 임의의 수로 초기화된다. 그런 다음, n 차원들을 가지는 시간(t)에서 입력 벡터(

)를 얻는다.The number of hidden variables k is first initialized to an arbitrary number. Then, at time t with n dimensions, the input vector (

Get)

입력 벡터로부터, 이전의 가중치(

)에 기반하는 i 번째 성분(y_t,i)을 계산한다.From the input vector, the previous weight (

Calculate the i th component (y _{t, i} ) based on

i 번째 성분의 계산은 다음의 [수학식 9]에 의해 표시된다. 즉, 시간(t)에서 입력 벡터 상에 투영된 가중치 벡터의 합에 의해 계산된다.The calculation of the i th component is represented by the following [Equation 9]. That is, it is calculated by the sum of the weight vectors projected on the input vector at time t.

[수학식 9][Equation 9]

그 이후, 이전의 단계로부터 계산된 은닉 변수를 이용하여 에너지(d_i) 및 재현 에러율(e_i)을 추정한다. 에너지의 초기 값은 작은 양의 값이다.Afterwards, the hidden variable calculated from the previous step is used to estimate the energy d _i and the reproduction error rate e _i . The initial value of energy is a small positive value.

은닉 변수(y_t,i)를 얻을 때, [수학식 10] 및 [수학식 11]이 실행된다.When the hidden variable y _{t, i} is obtained, Equations 10 and 11 are executed.

[수학식 10]&Quot; (10) "

[수학식 11]&Quot; (11) "

[수학식 10]에서, 새로운 데이터가 데이터 스트림에서 이전의 행동에 적응될 수 있도록 하기 위해 지수적인 망각 인자(λ)가 도입된다. 지수적인 망각 인자의 값은 0에서 1 사이이다.In Equation 10, an exponential forgetting factor λ is introduced to allow new data to be adapted to previous behavior in the data stream. The value of the exponential forgetting factor is between 0 and 1.

λ=0으로 설정할 때, 어떠한 이전 데이터도 다음의 처리에서 고려되지 않는다. 반대로, 만약 λ=1인 경우, 데이터 스트림은 정지 상태이다. 다만, 대부분의 데이터 스트림은 점진적으로 흐르기 때문에 일반적으로 이러한 경우는 발생하지 않는다.When setting λ = 0, no previous data is considered in the following processing. In contrast, if λ = 1, the data stream is at rest. However, this usually does not happen because most data streams flow gradually.

λ의 도입은 거대한 메모리 사용을 감소시키는데 유용한데, 왜냐하면 전체 데이터 스트림에 대한 버퍼 공간 필요성이 없기 때문이다. 지수적인 망각 인자의 바람직한 범위는 0.96과 0.98 사이의 값들이다. 데이터가 과거 값들에 적응하기 위하여, 지수적인 망각 인자는 높은 값으로 설정되는 것이다. 값이 매우 크게 변하지 않는다면, 결과는 유사하다.The introduction of [lambda] is useful for reducing huge memory usage because there is no need for buffer space for the entire data stream. The preferred range of exponential forgetting factors is values between 0.96 and 0.98. In order for the data to adapt to past values, the exponential forgetting factor is set to a high value. If the value does not change very much, the result is similar.

추정치들의 크기에 대해서는 참여 가중치 벡터(w_i)에 의해 획득된 과거 데이터를 고려한다. 이러한 이유로, 갱신은 i 번째 은닉 변수의 현재 에너지(E_t,i)에 반비례한다. 즉,

및

이다.For the magnitude of the estimates, consider the historical data obtained by the participation weight vector w _i . For this reason, the update is inversely proportional to the current energy E _{t, i} of the i th hidden variable. In other words,

And

to be.

참여 가중치 벡터는 [수학식 12]에 의해 갱신된다.The participation weight vector is updated by Equation 12.

[수학식 12][Equation 12]

참여 가중 추정치를 얻은 다음, 수렴할 때까지 [수학식 7]을 이용하여 비-가우시안의 최대화를 계속한다. 실제 값에 수렴하기 위해, 참여 가중치는 요소의 개수에 따라 k번 갱신된다.After obtaining the participatory weighting estimate, continue to maximize the non-Gaussian equation (7) until convergence. In order to converge to the actual value, the participation weight is updated k times according to the number of elements.

마지막으로, 갱신된 참여 가중치(

)를 얻는다.Finally, the updated engagement weight (

Get)

입력 벡터(x)를 가지고, 가중 행렬(w)을 투영함으로써 시간(t)에서 실제적인 은닉 변수들(y_t)이 계산된다. With the input vector x, the actual concealment variables y _t are computed at time t by projecting the weighting matrix w.

다음으로, 은닉 변수들의 값에 따라 에너지(E_hv)가 계산된다. 데이터를 표현하기에 충분한 성분들이 있음을 확인하기 위해, 에너지 문턱치가 인가되어 얼마나 많은 은닉 변수들이 필요한지 결정한다.Next, the energy E _hv is calculated according to the values of the hidden variables. To confirm that there are enough components to represent the data, an energy threshold is applied to determine how many hidden variables are needed.

은닉 변수들에 의해 보유된 에너지(E_hv)는 원래 입력 데이터의 상한 에너지(F_EE) 및 하한 에너지(f_EE)와 비교된다.The energy E _hv held by the concealment variables is compared with the upper limit energy F _E E and the lower limit energy f _E E of the original input data.

만약 은닉 변수들이 너무 적은 에너지를 유지하는 경우(

), 은닉 변수들(k)의 개수를 증가시킨다. 역으로, 만약 유지된 에너지가 너무 높은 경우(

), 우리는 은닉 변수들(k)의 개수를 감소시킨다. 이에 의해, 은닉 변수들의 에너지가 낮은 에너지 값 및 높은 에너지 값 중 특정 구간 내에 항상 있음이 보장된다.If the hidden variables hold too little energy (

), Increasing the number of hidden variables k. Conversely, if the energy held is too high (

We reduce the number of hidden variables k. By this, it is ensured that the energy of the concealment variables is always within a certain interval of the low energy value and the high energy value.

새로운 데이터가 도착할 때마다, 가중치 벡터의 갱신 처리는 반복되며 은닉 변수들의 개수는 조정되어 소정의 하한치 및 상한치 사이에서 성분들의 에너지를 보유한다. 도 1 은 비-가우시안 분석 알고리즘 단편을 나타내는 도면이다.
Each time new data arrives, the update process of the weight vector is repeated and the number of hidden variables is adjusted to retain the energy of the components between the predetermined lower and upper limits. 1 is a diagram illustrating a non-Gaussian analysis algorithm fragment.

이하, 본 발명에 따른 점진적 비가우시간 분석 방법의 효율성을 검증하기 위한 실험에 대하여 상세히 설명한다.Hereinafter, an experiment for verifying the efficiency of the gradual rain time analysis method according to the present invention will be described in detail.

먼저, 마이크로어레이 데이터 셋에서 본 발명에 따른 방법의 효율성을 검증하기 위해, Blalock 등에 의해 기탁된 GEO 데이터 셋들을 사용하여 실험을 수행하였다.First, in order to verify the effectiveness of the method according to the invention in a microarray data set, experiments were performed using GEO data sets deposited by Blalock et al.

실험에 사용된 GEO 데이터 셋들은 해마 유전자 발현을 분석하고 31개의 전용 마이크로어레이들에 대한 중증도(severity)를 변경하는 알츠하이머 질환(Alzheimer Disease , AD)을 분석하도록 연구된 것으로, 시료들은 켄터키 대학 소재 알츠하이머 질환 연구 센터의 뇌 은행으로부터 획득되었으며 Human GeneChips (HG-U133A) 및 마이크로어레이 수트 5가 데이터 수집을 위해 사용되었다.Experiment with GEO data set used in the are as a study to analyze the Alzheimer's disease (Alzheimer Disease, AD) to analyze hippocampal gene expression and changes the severity (severity) of the 31 dedicated microarray samples are University of Kentucky materials Alzheimer's Acquired from the Brain Bank of the Center for Disease Research and Human GeneChips (HG-U133A) and Microarray Suite 5 were used for data collection.

중요한 잡음을 포함하는 시료들은 제거됨으로써 8개의 제어를 남겼으며 5개의 중증 AD 시료들을 남겼다. 마이크로어레이 데이터에서 조절되지 않은 유전자들은 종종 정보를 거의 포함하고 있지 않고, 따라서 이러한 조절되지 않은 유전자들은 실험에서 제거된다. 결과적으로, 13개의 시료 및 각 시료마다 3617개씩의 유전자가 실험 대상으로 사용되었다.Samples containing significant noise were removed, leaving eight controls and leaving five severe AD samples. Unregulated genes in microarray data often contain little information, so these unregulated genes are removed from the experiment. As a result, 13 samples and 3617 genes for each sample were used as the test subjects.

아울러, 본 발명에 따른 방법은 Eamonn Keogh 박사에 의해 제공된 알빈 쥐의 EEG 신호에 적용되었다. EEG 신호는, 21386 시료들 및 512 특징들을 가지고서, 128 Hz에서 시험되었다. 특징들의 개수는 128 Hz에서 EEG 등록의 4초에 해당하며, 데이터 셋은 3개의 분류, 즉 viz. wake, synchronized sleep 및 REM sleep을 가진다.In addition, the method according to the invention was applied to the EEG signal of Alvin mice provided by Dr. Eamonn Keogh. The EEG signal was tested at 128 Hz, with 21386 samples and 512 features. The number of features corresponds to 4 seconds of EEG registration at 128 Hz, and the data set contains three classes, viz. It has wake, synchronized sleep, and REM sleep.

원래의 데이터는 최후 15개 기록들에 대한 결측 라벨들(missing labels)을 포함하는데, 실험을 위해서 결측 라벨들을 가진 기록들은 제거되었다.The original data included missing labels for the last 15 records, for which experiments with missing labels were removed.

도 2 의 (a) 는 비-가우시안 가중치 벡터들 상에 마이크로어레이의 투영에 대한 클러스터링 결과를 도시한다. 제어 시료 및 중증의 AD 시료는 소수의 특징들을 이용함으로써 명확하게 구별될 수 있다.(A) of FIG. 2 shows the clustering result for the projection of the microarray on non-Gaussian weight vectors. Control samples and severe AD samples can be clearly distinguished by using a few features.

본 발명에 따른 방법과의 비교를 위해, 점진적 PCA를 사용한 실험도 수행되었다. 본 발명에 따른 방법과 점진적 PCA 간의 주된 차이점은, 분해된 벡터 성분들이 독립적이 아니라는 점이다. 즉, 점진적 PCA는 점진적 방식으로 작동하는 것을 제외하고는 PCA와 유사하다.For comparison with the method according to the invention, experiments with gradual PCA were also performed. The main difference between the method according to the invention and the progressive PCA is that the resolved vector components are not independent. In other words, progressive PCA is similar to PCA except that it operates in a progressive manner.

점진적 PCA를 적용한 실험은 본 발명에 따른 방법에 대한 실험과 동일한 방식 및 동일한 파라미터 설정에 의해 수행되었다. Experiments with incremental PCA were carried out in the same manner and with the same parameter settings as the experiments for the method according to the invention.

도 2 의 (b) 는 점진적 PCA를 적용한 실험에서의 클러스터링 결과를 도시하는데, 하나의 AD 시료, AD2가 정확하게 클러스터될 수 없음이 명확함을 알 수 있다.Figure 2 (b) shows the clustering results in the experiment applying a gradual PCA, it can be seen that one AD sample, AD2 can not be accurately clustered.

본 발명에 따른 방법에 대한 실험 결과는, Kong 등에 의한 ICA 실험 결과와도 비교할 수 있다. 도 2 의 (c) ICA에 의한 클러스터링 결과를 도시한다.The experimental results for the method according to the invention can also be compared with the ICA experimental results by Kong et al. 2 (c) shows the clustering result by ICA.

도 2 의 (c) 에서, 완전한 데이터 행렬은 fast ICA 알고리즘에 의해 수행되며, Fast ICA는 50회 반복되어 각각의 루핑으로부터 생성되는 다소 상이한 결과들의 불안정성을 완화한다. 11개 ICA 잠재적인 변수들은 식별되어 원래의 데이터 행렬로부터 중요한 기본적인 생물학적 정보를 충분히 획득한다. 제어 시료 및 AD 중증의 시료 모두는 정확한 클러스터들로 식별될 수 있다.In FIG. 2C, the complete data matrix is performed by the fast ICA algorithm, where the Fast ICA is repeated 50 times to mitigate the instability of the somewhat different results generated from each looping. Eleven ICA potential variables are identified to fully obtain important basic biological information from the original data matrix. Both control samples and severe AD samples can be identified by accurate clusters.

그러나, ICA를 본 발명에 따른 방법과 비교하면, 본 발명에 따른 방법은 더욱 소수의 성분들에 의해서도 현저히 효율적인 결과를 달성할 수 있음을 알 수 있다.However, comparing ICA with the method according to the present invention, it can be seen that the method according to the present invention can achieve significantly efficient results even with fewer components.

한편, 본 발명에 따른 방법에 대한 실험 결과는 PCA와도 비교할 수 있으며, 그러한 비교를 위해서도 Kong 등에 의한 PCA 결과를 대비할 수 있다.On the other hand, the experimental results for the method according to the present invention can be compared with the PCA, and for such a comparison can be prepared for the PCA results by Kong et al.

투영 공간의 변수들이 유전자 발현들의 선형 조합들이라는 의미에서 PCA는 선형 투영이며, 유전자 발현 행렬을 변수 중 95.5% 보유하여 주성분들로 분해함으로써 PCA 실험은 수행된다. 잡음을 포함한 낮은 변수를 가진 주성분들은 클러스터링 처리로부터 제거된다.In the sense that the variables in the projection space are linear combinations of gene expressions, PCA is a linear projection, and PCA experiments are performed by decomposing the gene expression matrix into principal components with 95.5% of the variables. Principal components with low variables, including noise, are removed from the clustering process.

도 2 의 (d) 는 PCA에 따른 클러스터링 결과를 나타내는데, 제어 시료들은 AD 시료들을 제외하고는 성공적으로 클러스터 되었으나, AD2 시료는 부적절하게 클러스터 되었음을 알 수 있다.
2 (d) shows the clustering result according to the PCA. The control samples were successfully clustered except for the AD samples, but it can be seen that the AD2 samples were inappropriately clustered.

다음으로 본 발명자들은, 재현 이후에 제어 시료들로부터 AD 시료들을 식별함에 있어 상술한 각각의 방법들에 따른 효과를 실험하였다.The inventors then examined the effects of the respective methods described above in identifying AD samples from control samples after reproduction.

재현된 데이터는 로 데이터(raw data)를 상술한 각각의 방법들에서 발견된 잠재적인 변수로 투영함으로써 획득하였다. 정규화된 로 데이터와 상술한 각각의 방법들에 의해 재현된 데이터 상에서 수행된 계층적 클러스터링 결과들에 대하여 비교가 수행되었다.Reproduced data were obtained by projecting raw data into potential variables found in the respective methods described above. A comparison was performed on the normalized raw data and the hierarchical clustering results performed on the data reproduced by the respective methods described above.

도 3 의 (a) 는 정규화된 로 데이터의 클러스터링 결과를 나타낸다. 도 3 의 (a) 에서, 몇몇 AD시료들이 같이 클러스터 되지만, 클러스터의 더 높은 계층은 두 개의 상이한 클러스터들을 성공적으로 식별하지 못함을 알 수 있다.3 (a) shows a clustering result of normalized raw data. In FIG. 3A, it can be seen that although some AD samples are clustered together, the higher layer of the cluster does not successfully identify two different clusters.

본 발명에 따른 방법에 있어서는, 지수적인 망각 인자로서 0.96을 사용하고 에너지 범위로 95% 내지 98%를 사용하는 각각의 데이터 관찰에 있어서 가중치 벡터들을 점진적으로 갱신함으로써 획득되는 4차원 비-가우시안 가중치 벡터들로 로 데이터를 투영함으로써 데이터들은 재현된다.In the method according to the invention, a four-dimensional non-Gaussian weight vector obtained by progressively updating the weight vectors in each data observation using 0.96 as the exponential forgetting factor and using 95% to 98% in the energy range. By projecting the data into the furnace, the data are reproduced.

도 3 의 (b) 는 본 발명에 따른 방법에 의해 재현된 데이터에 적용된 클러스터링 결과를 도시한다. 도 3 의 (b) 에서, 제어 시료 및 AD 시료는 상이한 그룹들로 분리됨이 명백하며, 본 발명에 따른 방법은 클러스터링 결과의 식별 능력을 극대화할 수 있음을 알 수 있다. 즉, 본 발명에 따른 방법에 의하면, 소수의 성분들에 의해서도 다른 방법들보다 현저히 정확하며 효율적인 클러스터링을 수행할 수 있는 것이다.3 (b) shows the clustering result applied to the data reproduced by the method according to the present invention. In FIG. 3 (b), it is clear that the control sample and the AD sample are divided into different groups, and it can be seen that the method according to the present invention can maximize the ability of identifying clustering results. That is, according to the method according to the present invention, even a small number of components can perform a significantly more accurate and efficient clustering than other methods.

도 3 의 (c) 는 점진적 PCA에 의해 재현된 데이터의 클러스터링 결과이다.3C is a clustering result of the data reproduced by the gradual PCA.

도 3 의 (c) 에서의 재현된 데이터는 로 데이터를 점진적 PCA로부터 획득된 직교 가중치 벡터들에 투영함으로써 획득된다. 정당한 비교를 위해, 파라미터는 본 발명에 따른 방법과 동일하게 설정되었다.The reproduced data in FIG. 3C is obtained by projecting the raw data onto orthogonal weight vectors obtained from the progressive PCA. For a legitimate comparison, the parameters were set identical to the method according to the invention.

도 3 의 (c) 에서, 점진적 PCA는 제어 시료들로부터 하나의 AD 시료, AD2를 분리하지 못함을 알 수 있다.In Figure 3 (c), it can be seen that the gradual PCA does not separate one AD sample, AD2 from the control samples.

PCA 및 ICA에 있어서는, 생물학적 처리에 연관된 것으로 식별되는, 변수 중 95.5% 획득된 10개의 주성분들 및 11개 독립 성분들은 선택되어 개별적으로 데이터를 재현한다. 이때, 하나의 AD 시료, AD2는 PCA 및 ICA에 의해서는 AD 그룹으로 적절하게 클러스터 되지 않는다.
For PCA and ICA, 10 principal components and 11 independent components obtained 95.5% of the variables identified as being involved in biological processing are selected to reproduce the data separately. At this time, one AD sample, AD2, is not properly clustered into AD groups by PCA and ICA.

본 발명자들은 EEG 신호 분리를 수행하였으며, 점진적 PCA, ICA 및 PCA 결과를 본 발명에 따른 방법에서의 결과와 비교하여 본 발명에 따른 방법의 우월성을 입증하였다. 본 발명에 따른 방법은 새로운 입력이 도착할 때마다 가중치 벡터들을 갱신하여 비-가우시안 가중치를 가지는 은닉 변수들(hidden variables)을 도출한다. 하지만, ICA 및 PCA에서는 데이터를 일괄적으로 처리한다.We performed EEG signal separation and demonstrated the superiority of the method according to the invention by comparing the results of the progressive PCA, ICA and PCA results with the method according to the invention. The method according to the invention updates the weight vectors each time a new input arrives to derive hidden variables with non-Gaussian weights. However, ICA and PCA process data in batches.

실험의 단순화를 위해, 본 발명자들은 전체 데이터 흐름을 고려하면서, 마지막 시점에서 ICA 및 PCA를 사용하여 실험을 수행하였다.For simplicity, we performed the experiment using ICA and PCA at the end time, taking into account the overall data flow.

에너지는 ICA에 대하여 사전에 결정될 수 없다. 그러므로, 성분들의 개수는 에너지의 95%를 보유하도록 PCA로부터 도출되었다. 점진적 PCA는 각각의 시점에서 점진적으로 직교 가중치를 갱신함으로써 점진적 방식으로 직교 특징들을 도출한다.Energy cannot be determined in advance for the ICA. Therefore, the number of components was derived from the PCA to hold 95% of the energy. Progressive PCA derives orthogonal features in a progressive manner by gradually updating orthogonal weights at each time point.

분류기로서는 K=1, 2, 3을 가지는 K-최근접 이웃(K-Nearest Neighbor, KNN) 및 선형 판별 분석(Linear Discriminate Analysis, LDA)을 채택한다. 데이터 중 70%가 분류기를 훈련시키기 위해 채택되었다. 데이터 중 나머지 30%는 테스트를 위해 사용된다. 지수적인 망각 인자(λ) 및 알빈 EEG 신호에 있어서 에너지 범위가 [표 2]에 도시된 바와 같이 조절된다.As a classifier, K-Nearest Neighbor (KNN) with K = 1, 2, 3 and Linear Discriminate Analysis (LDA) are adopted. 70% of the data was adopted to train the classifier. The remaining 30% of the data is used for testing. For exponential forgetting factor ([lambda]) and Alvin EEG signals, the energy range is adjusted as shown in [Table 2].

평균적으로, 본 발명에 따른 방법은 다른 방법들을 능가한다. 평균적인 분류 비율은 λ=0.96이며 에너지 분포가 95%에서 98% 사이일 때 가장 높다. 더욱이, 본 발명에 따른 방법은 동일한 조건에서 그리고 KNN에서 최상의 결과를 달성한다.On average, the method according to the invention outperforms other methods. The average classification rate is λ = 0.96, the highest when the energy distribution is between 95% and 98%. Moreover, the method according to the invention achieves the best results under the same conditions and at KNN.

LDA 분류기의 경우에 있어서, ICA는 가장 높은 분류 비율을 가지는데, 이는 파라미터 λ=0.98, 그리고 95% 내지 98%의 에너지 범위를 가지는 경우의 본 발명에 따른 방법에 의해 달성된다. ICA는 독립적이며 비-가우시안인 성분들로 데이터를 분해한다.In the case of an LDA classifier, the ICA has the highest classification ratio, which is achieved by the method according to the invention in the case of a parameter λ = 0.98 and an energy range of 95% to 98%. ICA breaks down data into independent, non-Gaussian components.

하지만, ICA는 높은 메모리를 요구하는 공분산 행렬의 계산을 포함하는 단점이 있다. 더욱이, 종래의 ICA는 새로운 훈련 데이터가 입력될 때마다 전체 행렬을 다시 계산할 필요가 있다.However, ICA has the disadvantage of including computation of covariance matrices that require high memory. Moreover, the conventional ICA needs to recalculate the entire matrix each time new training data is input.

본 발명에 따른 방법은 과거 값을 변수들로 저장한다. 새로운 훈련 데이터가 입력될 때, 본 발명에 따른 방법은 전체 행렬을 재계산할 필요가 없이 변수에 대한 계산만을 요구하므로, 불필요한 데이터 처리 시간 및 메모리를 절감할 수 있다. 아울러, 에너지 범위 및 λ 값의 적절한 조정에 따라서 본 발명에 따른 방법의 효과는 극대화될 수 있다.The method according to the invention stores past values as variables. When new training data is input, the method according to the present invention only requires calculation on variables without having to recalculate the entire matrix, thus saving unnecessary data processing time and memory. In addition, the effect of the method according to the invention can be maximized by appropriate adjustment of the energy range and lambda values.

더욱이, 상술한 방법들에 의해 발견된 은닉 변수들의 개수를 조사함으로써, 본 발명자들은 본 발명에 따른 방법 및 점진적 PCA가, 점진적 접근을 고려하지 않는 PCA 및 ICA보다 더 적은 데이터 특징들만을 요구함을 알 수 있었다. 즉, 본 발명에 따른 방법은 상대적으로 소수의 특징들만을 추출하고서도, 더욱 정확하며 효율적인 결과를 달성할 수 있는 것이다. 소수의 데이터 특징들은 분류기들에 의한 계산량이 상대적으로 작음을 의미하며, 그에 따라 더욱 신속한 분류가 가능하다.
Moreover, by investigating the number of hidden variables found by the methods described above, the inventors find that the method and the progressive PCA according to the present invention only require fewer data features than PCA and ICA which do not consider a gradual approach. Could. In other words, the method according to the invention is able to achieve more accurate and efficient results even with relatively few features extracted. A small number of data features mean that the amount of computation by the classifiers is relatively small, thus allowing faster classification.

이하, 본 발명에 따른 방법에 의한 효과와 ICA에 의한 효과를 정성적인 관점에서 비교한다. ICA가 비교 대상으로 채택된 이유는, 양 방법이 비-가우시안 성분들을 분해하기 때문이다.Hereinafter, the effect by the method according to the present invention and the effect by ICA are compared from a qualitative point of view. ICA was chosen for comparison because both methods decompose non-Gaussian components.

정성적인 관점에서의 비교 실험들은 알츠하이머의 질환 마이크로어레이 데이터 셋 및 EEG 신호 데이터 셋 상에서 수행하였는데, 이러한 데이터 셋들은 다변량 데이터에서 상이한 행동들을 나타내기 때문이다. 합성 데이터는 더 많은 특징들 및 더 많은 관측들이 생성되도록 하기 위해 이러한 데이터 셋들 상에서 증대된다.Qualitative comparison experiments were performed on Alzheimer's disease microarray data set and EEG signal data set, as these data sets exhibit different behaviors in multivariate data. The composite data is augmented on these data sets to allow more features and more observations to be generated.

도 4a 는 마이크로어레이 데이터 셋 상에서 유전자들 개수에 대한 실행 시간을 나타내는 도면이다. 도 4a 에서, 본 발명에 따른 방법은 별표 기호들로 도시하였으며, ICA는 플러스 기호들로 도시하였다.4A shows the execution time for the number of genes on a microarray data set. In FIG. 4A, the method according to the invention is shown with asterisk symbols and ICA is shown with plus signs.

마이크로어레이 실험에 있어서, 관측들의 개수 및 다른 파라미터들 설정은 고정하였다. 하지만, 본 발명에 따른 방법 및 ICA 모두에서 유전자들의 상이한 개수의 실행 시간이 기록될 수 있도록 하기 위해, 유전자들의 개수는 각각의 루프에서 증가한다. 지수적인 망각 인자는 0.96으로 설정하였고, 에너지 범위는 95% 내지 98%이며 초기 3개의 은닉 변수들이 있다.In the microarray experiment, the number of observations and other parameter settings were fixed. However, in order for the execution time of different numbers of genes to be recorded in both the method according to the invention and the ICA, the number of genes is increased in each loop. The exponential forgetfulness factor was set to 0.96, with an energy range of 95% to 98% with initial three hidden variables.

도 4a 에서, 유전자 개수가 증가할 때, 양 방법 모두에서의 실행 시간들은 비례적으로 증가하였다. 하지만, 유전자들 개수가 증가할 때 ICA는 본 발명에 따른 방법보다 현저히 많은 계산 시간을 요구하는 것이 명확하다.In FIG. 4A, as the number of genes increased, run times in both methods increased proportionally. However, it is clear that when the number of genes increases, ICA requires significantly more computation time than the method according to the invention.

도 4b 는 EEG 신호 데이터 셋 상에서 소스들의 개수에 대한 실행 시간을 도시한다. 이때, 실험을 수행하는 방법은 도 4a 에서 도시한 바와 같은 마이크로어레이 데이터 셋의 경우와 동일하게 유지하였다.4B shows the run time for the number of sources on the EEG signal data set. At this time, the method of performing the experiment was kept the same as the case of the microarray data set as shown in Figure 4a.

도 4b 에서, 소스들의 개수가 증가될 때, ICA에 따른 실행 시간은 지수적으로 증가한다. 그럼에도 불구하고, 본 발명에 따른 방법에 의한 실행 시간은 극단적으로 증가하지 않음을 알 수 있다.In FIG. 4B, as the number of sources is increased, execution time according to ICA increases exponentially. Nevertheless, it can be seen that the execution time by the method according to the invention does not increase dramatically.

본 발명에 따른 방법은 부동 연산(floating operation)만을 포함하나 ICA는 공분산 행렬 계산을 포함하므로, ICA는 행렬의 크기가 증가할 때 더 긴 계산 시간을 요구한다. 그러므로, 본 발명에 따른 방법은 유전자들 또는 소스들의 개수가 증가할 때 더욱 효율적임이 명백하다.The method according to the invention includes only floating operations but ICA involves covariance matrix calculation, so that ICA requires longer computation time when the size of the matrix is increased. Therefore, it is clear that the method according to the invention is more efficient when the number of genes or sources is increased.

도 5a 는 마이크로어레이 데이터 셋에서 실험들의 횟수에 대한 실행 시간을 나타내는 도면이다. 이때, 유전자들의 개수가 고정된 것을 제외하고, 파라미터는 상술한 실험에서와 동일하게 설정되었다. 또한, 상이한 실험 크기의 실행 시간을 관측하기 위해, 실험들의 회수는 각각의 반복에서 변경되었다.5A is a diagram illustrating an execution time for the number of experiments in a microarray data set. At this time, except that the number of genes was fixed, the parameters were set the same as in the above-described experiment. In addition, the number of experiments was changed at each iteration to observe run times of different experiment sizes.

도 5a 에서, 실험들의 횟수가 증가할 때, 실행 시간은 비례적으로 증가하지 않음을 알 수 있다. 이는, 새로운 실험이 유전자 발현을 포함할 때, 본 발명에 따른 방법은 변수들로 저장된 과거 값들을 이용함으로써 새로운 입력 벡터와 함께 가중치 벡터들을 갱신하기 때문이다.In FIG. 5A, it can be seen that as the number of experiments increases, the execution time does not increase proportionally. This is because when the new experiment involves gene expression, the method according to the invention updates the weight vectors with the new input vector by using the old values stored as variables.

그러나, 실험들의 횟수가 증가할 때, ICA에 대한 실험 결과는 극단적인 상향 경향을 나타낸다. 더욱이, ICA의 실행 시간 역시 본 발명에 따른 방법에서의 실행 시간보다 현저히 많다.However, as the number of experiments increases, the experimental results for ICA show an extreme upward trend. Moreover, the execution time of the ICA is also significantly greater than the execution time in the method according to the invention.

도 5b 는 EEG 신호 데이터 셋에서 ICA와 본 발명에 따른 방법에 관한 실행 시간 대 스트림 크기를 나타내는 도면이다. 이때, 실험 조건은 마이크로어레이 데이터 셋에서와 동일하게 설정하였다.5B is a diagram illustrating execution time versus stream size for an ICA and a method according to the present invention in an EEG signal data set. At this time, the experimental conditions were set as in the microarray data set.

도 5b 에서, 스트림 크기가 극단적으로 증가할 때 ICA는 극도로 장시간의 처리 시간을 요구한다는 점은 명백하다. 반면, 본 발명에 따른 방법에서는 스트림 크기가 증가할 때 처리 시간에 있어서 약간의 증가만을 나타낸다.
In FIG. 5B, it is evident that ICA requires an extremely long processing time when the stream size is extremely increased. In contrast, the method according to the invention shows only a slight increase in processing time as the stream size increases.

상술한 바와 같이, 마이크로어레이 및 EEG 신호 데이터 셋 등에서의 종합적인 성능 검증 결과에 의하면, 본 발명에 따른 방법은 데이터 특징들(유전자 또는 소스들 등)의 개수 및 관측들의 개수 모두에 있어서 명백히 우수하며, 마이크로어레이 및 데이터 스트림 분석 모두에 대하여 적합함을 알 수 있다.As described above, according to comprehensive performance verification results in microarrays and EEG signal data sets, etc., the method according to the invention is clearly superior both in the number of data features (genes or sources, etc.) and in the number of observations. It can be seen that it is suitable for both microarray and data stream analysis.

추가적으로, 본 발명에 따른 방법은 다중-방식 데이터 분석에도 확장되어 적용될 수 있다. 다중-방식 데이터 분석은 상이한 차원들 사이에서 상호관계를 발견함으로써 더욱 다양한 행동 정보를 입수하는 것이다. 이때, 본 발명에 따른 방법을 활용한다면, 상대적으로 고차원의 통계를 점진적으로 이용함으로써 다중-선형 구조를 획득할 수 있다. 만약 데이터가 두 개 이상의 모드들로 구성된다면, 기본적인 구조들은 본 발명에 따른 점진적 접근을 이용함으로써 더욱 효율적으로 검출될 수 있다.
In addition, the method according to the invention can be extended to multi-method data analysis. Multi-modal data analysis is to obtain more diverse behavioral information by finding correlations between different dimensions. At this time, if the method according to the present invention is used, a multi-linear structure can be obtained by gradually using relatively high-dimensional statistics. If the data consists of two or more modes, the basic structures can be detected more efficiently by using a gradual approach in accordance with the present invention.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시 예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주하여야 할 것이다.As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be apparent to those skilled in the art that many changes and modifications can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes, modifications, and equivalents should be considered to be within the scope of the present invention.

Claims

In the progressive non-corruption time analysis method,
(a) constructing input first input vector data into a matrix;
(b) normalizing the first input vector data consisting of a matrix and decomposing it into a non-Gaussian weight vector; And
(c) updating the non-Gaussian weight vector with second input vector data that is then input; Progressive non-time analysis method comprising a.

The method of claim 1,
And the first and second input vector data are microarray gene expression data or electroencephalography (EEG) signal data.

The method of claim 1,
The step (b)
(b-1) projecting the non-Gaussian weight vector on the first input vector by linear transformation to derive a hidden variable; Progressive non-time analysis method characterized in that it comprises a.

The method of claim 1,
In step (c),
(c-1) calculating a hidden variable from the second input vector using the non-Gaussian weight vector of step (b); And
(c-2) estimating energy by the following equation using the hidden variable calculated in step (c1); Progressive non-time analysis method characterized in that it comprises a.

(In the above equation, d _i is energy, λ is a forgetting factor, y _i is a hidden variable calculated in step (c1).)

The method of claim 4, wherein
And the forgetting factor λ is selected from 0.96 to 0.98.

The method of claim 4, wherein
The energy d _i is determined between 95% and 98%.

The method of claim 1,
After step (c),
(d) repeating step (c) until the updated non-Gaussian weight vector exhibits the largest non-Gaussian feature; Progressive non-time analysis method characterized in that it further comprises.