WO2023024210A1

WO2023024210A1 - Data dimension reduction method based on fourier-domain principal component analysis

Info

Publication number: WO2023024210A1
Application number: PCT/CN2021/120524
Authority: WO
Inventors: 沈项军; 徐兆瑞; 刘志锋
Original assignee: 江苏大学
Priority date: 2021-08-23
Filing date: 2021-09-26
Publication date: 2023-03-02
Also published as: CN113743485A

Abstract

Disclosed in the present invention is a data dimension reduction method based on Fourier-domain principal component analysis. High-dimensional data is projected to a Fourier domain, and a feature vector solution problem of principal component analysis is converted into a search for a meaningful Fourier domain basis by using the properties of a cyclic matrix and a Fourier matrix. The Fourier domain basis is predefined and a principal component distribution of data is ordered; therefore, training can be accelerated by means of inputting training samples in batches until the required Fourier basis is stable and ordered. The number of Fourier bases and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set, so as to obtain a low-dimensional data set, thereby facilitating fast data processing. By means of the data dimension reduction method provided in the present invention, on the basis of principal component analysis and fast Fourier transform, noise and redundant information in a high-dimensional data set can be removed, thereby reducing unnecessary operation processes during data processing, and improving the algorithm running speed and the memory efficiency.

Description

A Data Dimensionality Reduction Method Based on Principal Component Analysis in Fourier Domain

technical field

The invention belongs to the neighborhood of computer science and image processing technology, in particular to a data dimensionality reduction method based on Fourier domain principal component analysis.

Background technique

Traditional data processing methods have been unable to effectively analyze massive data. At the same time, with the continuous increase of data dimensions generated by big data processing and cloud computing, in order to remove noise and redundant information in high-dimensional data sets, reduce unnecessary calculations in data processing, and improve the operating efficiency of algorithms, the Dimensionality reduction is also more necessary for high-dimensional data.

Principal Component Analysis (PCA) is an advanced data exploration algorithm that can be used to find patterns in data and to find transformed representations of data that can emphasize those patterns. Through the orthogonal rotation of the coordinate axes of the original data set, the principal component analysis makes the original scattered sample points concentrate near some characteristic coordinate axes after rotation. The effect of reducing the dimensionality of the original data realizes the compression of the data. Principal component analysis is a powerful data transformation technique that you can apply for further analysis. This approach is very useful when you encounter high-dimensional data sets, and it can help in understanding the underlying data structure, cluster analysis, regression analysis, and many other tasks.

However, although principal component analysis has shown promising performance, its application to massive data processing problems is limited due to its high computational complexity. In order to deal with large-scale data, many optimization techniques have been proposed to speed up the principal component analysis algorithm. According to different strategies to solve this problem, the existing optimization techniques can be roughly divided into the following two categories: one is to use Nystrom's matrix approximation technique, which uses the calculated sub-matrix eigenvectors to approximate the original matrix eigenvectors, to reduce the computational cost of the eigendecomposition step. Another method is to use Random Fourier Features to approximate the matrix, which can transform the original KPCA problem into a high-dimensional linear PCA problem. However, although the above methods have solved the application processing problem of massive data, they are still not fully utilized in terms of speed and memory efficiency, and the fast and efficient calculation of massive data is still a problem we face.

Contents of the invention

In order to solve the deficiencies in the prior art, the present invention proposes a data dimensionality reduction method based on Fourier domain principal component analysis, using the fast Fourier transform method to observe each data point in the sequence from the perspective of the frequency domain , which is constructed into a new principal component analysis algorithm based on Fourier transform. By optimizing the principal component analysis to solve the eigenvector problem to find meaningful Fourier domain bases and batch input training, the eigenvalue distribution of the global sample is approximated by using stable and orderly partial sample eigenvalues. In turn, it improves the computing speed and memory utilization of data dimensionality reduction, and provides support and acceleration for principal component analysis of large-scale data.

The technical scheme adopted in the present invention is as follows:

A data dimensionality reduction method based on Fourier domain principal component analysis, comprising the following steps:

Step 1, data initialization, collect data sample set X as the required data set, X is an M×N dimensional matrix; and initialize the current batch number j, the initial M×M dimensional zero matrix Λ ₀ , random Fourier basis Set P ₀ and discrete Fourier matrix F; M represents the dimension of the data set X, and N is the number of samples of the data;

Step 2, construct batch samples and the Fourier data representation of batch samples, randomly input a batch sample set X _b ∈ R ^M×b with the number b, perform Fourier transform on the sample vector x _i in X _b to obtain

Step 3, for each batch of randomly input batch sample set X _b , calculate the eigenvalue matrix Λ _b obtained by this batch of batch sample set X _b , expressed as:

yes

The complex conjugate matrix of ;

With the continuous input of small-batch samples, the eigenvalue matrix Λ b obtained by each batch of batch sample set X _b is added to Λ _j , and Λ _j is used to represent the eigenvalue matrix Λ _b after the j-th batch of batch sample set is input Accumulation, the process is expressed as: Λ _j ← Λ _j-1 + Λ _b ; where Λ _j-1 represents the accumulation of eigenvalues obtained after inputting the j-1 batch sample set X _b ;

Step 4, obtain the Fourier projection basis of the batch sample set, and set

Take it as the column vector of F; sort the diagonal elements λ ₁ , λ ₂ ,…,λ _M of the eigenvalue matrix Λ _j in ascending order, and select the smallest first r eigenvalues λ ₁ , λ ₂ ,…λ _r corresponding to The Fourier basis in the matrix F of

Make up the current set of projections

r is the preset number of required Fourier projection bases;

Step 5, if the set P _j is the same as P _j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis

As the final Fourier projection base, otherwise execute steps 2 to 4, and update the current input batch number, j←j+1;

Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P _j

i=1,...,r, form a projection matrix V'=[p ₁ p ₂ ... p _r ]; multiply the high-dimensional data set X by the projection matrix V' ^T , and obtain the reduced-dimensional data set X'= V' ^T X.

Further, the threshold g is set, and the number of samples in the batch sample set X _b is b=N*g, b<<M.

Further, the threshold g takes a value of 0.5% to 5%.

Further, perform Fourier transform on the sample vector x _i to get

Expressed as:

in,

is the generating vector of the Fourier transform,

Indicates that the fast Fourier transform is performed on the vector x _i , and F is a discrete Fourier matrix;

Further, the method to obtain the sample vector x _i is: express the data sample set as X=[x ₁ x ₂ ... x _N ], and form the sample vector x i from the i-th column of data samples in the data sample set _X , i=1 ,2,...N, N represents the number of data samples; the sample vector x _i contains the data samples of M dimensions in the i-th column.

Further, in this embodiment,

is a circular matrix constructed from samples _xi , expressed as:

Among them, circ represents the circular matrix corresponding to the shift construction of the vector x _i

The characteristic of this circular matrix is that it can be diagonalized by Fourier transform,

F ^H =(F ^* ) ^T is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. The eigenvalues of the batch samples X _b of the current batch are obtained as follows:

Among them, λ is the Lagrange factor; b is the number of batch samples;

yes

The complex conjugate matrix of ; ⊙ is the element point multiplication operation of the matrix; diag represents the diagonal matrix that converts the main diagonal of the vector into vector elements;

is the main projection vector of the training data set X, that is, the feature vector. For each batch of random input samples X _b , we can get Λ _b :

Among them, Λ _b is the eigenvalue matrix obtained by this batch of samples.

Further, the Fourier domain basis vector is:

Where V is the set of projected column vectors v, that is, V=[v ₁ v ₂ ... v _n ].

Beneficial effects of the present invention:

1. Using the characteristics of the repeatability of the data sequence to model the data in the Fourier domain. Using the fast Fourier transform method to observe each data point in the sequence from the perspective of the frequency domain, a new principal component analysis algorithm based on the Fourier domain is constructed. Finding the projection objective of PCA can be achieved by finding predefined meaningful Fourier basis.

2. Due to the operational nature of the Fourier domain, we can avoid complex matrix inversion operations in the time domain through simple matrix dot product operations in the Fourier domain.

3. In order to obtain the Fourier basis meaningfully, the training process does not need to load all the data samples, only a few batches of data samples need to be loaded until the order of the Fourier basis is stabilized, which undoubtedly can use memory more efficiently .

4. Solving the eigenvector problem by optimizing the principal component analysis In order to find a meaningful Fourier domain base and batch input training, the eigenvalue distribution of the global sample is approximated by using stable and orderly partial sample eigenvalues. In turn, it improves the computing speed and memory utilization of the data dimension reduction process, and provides support and acceleration for principal component analysis of massive data.

Description of drawings

Fig. 1 is the main flowchart of the method proposed by the present invention.

Detailed ways

In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

As shown in Figure 1, a data dimensionality reduction method based on Fourier domain principal component analysis includes the following steps:

Step 1, data set preparation, collecting data sample set X(M×N) as the required data set, expressing the data sample set as X=[x ₁ x ₂ ... x _N ], by the first in the data sample set X The data samples in column i constitute a sample vector x _i , where i=1, 2,...N, N represents the number of data samples; the sample vector x _i contains data samples of M dimensions in column i.

Initialization parameters: j, Λ ₀ , F, P ₀ . Among them, j represents the current batch number of batch training, and j=1; Λ ₀ represents the initial M×M dimensional zero matrix; F is the discrete Fourier matrix (DFT); P ₀ is a random Fourier basis Set, P _0The elements of the set are the column vectors of the discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as:

where V is the set of projected column vectors v, ie V=[v ₁ v ₂ ... v _n ], n is the number of columns of set V; ω is a complex number and can be expressed as ω=e ^-2πi/M , i is an imaginary unit.

Step 2, construct batch samples and their Fourier data representation.

According to the threshold g, a batch sample X _b ∈ R ^M×b with a quantity of b=N*g is randomly input, and g is 0.5% to 5%. Fast Fourier transform is performed on the batch sample X _b , and the fast Fourier transform method is used to observe the data from the perspective of the frequency domain, which is expressed as follows:

in,

Indicates that the fast Fourier transform is performed on the vector x _i ; F is a discrete Fourier matrix;

is the vector generated after Fourier transform, and ∧ represents the vector generated after fast Fourier transform. x _i represents the ith column vector in the data sample set X.

Step 3, get the eigenvalues of the current batch of samples.

The eigenvalues of the batch samples X _b of the current batch are obtained as follows:

Among them, λ is the Lagrange factor; b is the number of batch samples;

yes

The complex conjugate matrix of ; ⊙ is the dot multiplication operation of the elements in the matrix; diag means that the vector is converted into a diagonal matrix whose main diagonal is the vector element;

is the main projection vector of the training data set X, that is, the eigenvector; F ^H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. According to formula (2), for each batch of randomly input samples X _b , we can get:

Among them, Λ _b is the eigenvalue matrix obtained by this batch of samples. We use _Λj to denote the accumulation of feature values after inputting the j-th batch of samples, and j denotes the current input batch number. With the continuous input of small batch samples, the eigenvalue matrix Λ _b obtained by each batch of samples is added to Λ _j ,

Λ _j ← Λ _j-1 + Λ _b (4)

Among them, Λ _j-1 represents the feature value accumulation obtained after inputting j-1 batch samples.

Step 4, obtain the Fourier projection basis of batch samples.

According to formula (2), the

Take it as the column vector of F, sort the diagonal elements λ ₁ , λ ₂ ,…,λ _M of the eigenvalue matrix Λ _j in ascending order, and select the first r smallest eigenvalues λ ₁ , λ ₂ ,…λ _r corresponding to Fourier basis in matrix F

Make up the current set of projections

Among them, r is the preset required number of Fourier projection bases, and the value here is 50.

as the final Fourier projection basis. Otherwise, execute steps 2-4, and update the current input batch number, j←j+1.

i=1,...,r, the projection matrix V'=[p ₁ p ₂ ... p _r ] is obtained. Multiply the high-dimensional data set X with the projection matrix V' ^T , and obtain the dimensionally reduced data set X'= ^V'T X.

The above embodiments are only used to illustrate the design concept and characteristics of the present invention, and its purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.

Claims

A data dimensionality reduction method based on Fourier domain principal component analysis, characterized in that it comprises the following steps:

Step 1, data initialization, collect data sample set X as the required data set, X is an M×N dimensional matrix; and initialize the current batch number j, the initial M×M dimensional zero matrix Λ 0 , random Fourier basis Set P 0 and discrete Fourier matrix F; M represents the dimension of the data set X, and N is the number of samples of the data;

Step 2, construct batch samples and the Fourier data representation of batch samples, randomly input a batch sample set X b ∈ R M×b with the number b, perform Fourier transform on the sample vector x i in X b to obtain

Step 3, for each batch of randomly input batch sample set X b , calculate the eigenvalue matrix Λ b obtained by this batch of batch sample set X b , expressed as:
yes
The complex conjugate matrix of ;

With the continuous input of small-batch samples, the eigenvalue matrix Λ b obtained by each batch of batch sample set X b is added to Λ j , and Λ j is used to represent the eigenvalue matrix Λ b after the j-th batch of batch sample set is input Accumulation, the process is expressed as: Λ j ← Λ j-1 + Λ b ; where Λ j-1 represents the accumulation of eigenvalues obtained after inputting the j-1 batch sample set X b ;

Step 4, obtain the Fourier projection basis of the batch sample set, and set
Take it as the column vector of F; sort the diagonal elements λ 1 , λ 2 ,…,λ M of the eigenvalue matrix Λ j in ascending order, and select the smallest first r eigenvalues λ 1 , λ 2 ,…λ r corresponding to The Fourier basis in the matrix F of
Make up the current set of projections
r is the preset number of required Fourier projection bases;

Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
As the final Fourier projection base, otherwise execute steps 2 to 4, and update the current input batch number, j←j+1;

Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
Construct the projection matrix V'=[p 1 p 2 ... p r ]; multiply the high-dimensional data set X by the projection matrix V' T to obtain the reduced-dimensional data set X'=V' T X.
A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, setting threshold value g, the number of samples of batch sample set X b is b=N*g, b<<M .
A data dimensionality reduction method based on Fourier domain principal component analysis according to claim 2, characterized in that the threshold g is 0.5% to 5%.
A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, carry out Fourier transform to sample vector x i and obtain
Expressed as:

in,
is the generating vector of the Fourier transform,
Indicates that the fast Fourier transform is performed on the vector x i , and F is a discrete Fourier matrix.
A data dimensionality reduction method based on Fourier domain principal component analysis according to claim 4, wherein the method for obtaining the sample vector x i is: expressing the data sample set as X=[x 1 x 2 … x N ], the sample vector x i is composed of the i-th column data samples in the data sample set X , i=1, 2,...N, N represents the number of data samples; the sample vector x i contains the i-th column Data samples in M dimensions.
A kind of data dimension reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, obtains the eigenvalue of the batch sample X b of current batch as follows:

Among them, λ is the Lagrange factor; b is the number of batch samples;
yes
The complex conjugate matrix of ; ⊙ is the element point multiplication operation of the matrix; diag represents the diagonal matrix that converts the main diagonal of the vector into vector elements;
is the main projection vector of the training data set X, that is, the feature vector. For each batch of random input samples X b , we can get Λ b :

Among them, Λ b is the eigenvalue matrix obtained by this batch of samples.
A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 4, it is characterized in that, described Fourier domain base vector is:

where V is the set of projected column vectors v, ie V=[v 1 v 2 ... v n ], n is the number of columns of set V; ω is a complex number and can be expressed as ω=e -2πi/M , i is an imaginary unit.