WO2022037012A1

WO2022037012A1 - Dimension reduction and correlation analysis method applicable to large-scale data

Info

Publication number: WO2022037012A1
Application number: PCT/CN2021/073088
Authority: WO
Inventors: 沈项军; 徐兆瑞
Original assignee: 江苏大学
Priority date: 2020-08-19
Filing date: 2021-01-21
Publication date: 2022-02-24
Also published as: CN112149045A

Abstract

Disclosed in the present invention is a dimension reduction and correlation analysis method applicable to large-scale data. High-dimensional data is projected into a Fourier domain, such that the problem of solving feature vectors in linear correlation analysis is transformed into searching for meaningful Fourier domain bases. Fourier domain bases are predefined and feature value distribution of data is ordered, and therefore, training samples are input in batches to accelerate training, until required Fourier bases are stable and ordered. The number of Fourier bases, and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set to obtain a low-dimensional data set, thereby facilitating the fast data processing. By means of the data dimension reduction method in the present invention, on the basis of fast Fourier transform and correlation analysis, noise and redundant information in a high-dimensional data set can be eliminated, and unnecessary operation processes in data processing can be reduced, thereby improving the running speed and memory utilization efficiency in data dimension reduction calculation.

Description

A dimensionality reduction and association analysis method suitable for large-scale data

technical field

The invention belongs to the neighborhood of computer science and image processing technology, in particular to a dimensionality reduction and correlation analysis method suitable for large-scale data.

Background technique

Traditional data processing methods have been unable to effectively analyze massive data. At the same time, with the continuous increase of data dimensions generated by big data processing and cloud computing, in many fields of research and applications, it is usually necessary to observe data containing multiple variables, and collect a large amount of data to analyze and find rules. Multivariate large data sets will undoubtedly provide rich information for research and application, but also increase the workload of data collection to a certain extent.

Canonical Correlation Analysis (CCA) is one of the most commonly used algorithms for mining data correlations, and it is also a dimensionality reduction technique that can be used to test data correlations and find data transformation representations that can emphasize these correlations. The essence of canonical correlation analysis is to select several representative comprehensive indicators (linear combination of variables) from two groups of random variables, and use the correlation of these indicators to represent the correlation between the original two groups of variables, which can help understand the underlying Data structures, cluster analysis, regression analysis, and many other tasks.

However, although typical association analysis shows good performance, its application in massive data processing problems is limited due to its high computational complexity. To handle large-scale data, many optimization techniques have been proposed to speed up correlation analysis algorithms. According to different strategies to solve this problem, the existing optimization techniques can be roughly divided into the following two categories: one is the matrix approximation technique using Nystrom, which approximates the original matrix eigenvectors by using the calculated submatrix eigenvectors, to reduce the computational cost of the feature decomposition step. Another approach is to approximate the matrix using Random Fourier Features, which transforms the original KCCA problem into a high-dimensional linear CCA problem. However, although the above methods solve the application processing problem of massive data, their utilization in terms of speed and memory efficiency is still insufficient, and the fast and efficient calculation of massive data is still a problem we face.

SUMMARY OF THE INVENTION

In view of the deficiencies in the prior art, the present invention proposes a dimensionality reduction and correlation analysis method suitable for large-scale data. By optimizing the eigenvector problem of the correlation analysis, it is necessary to find a meaningful Fourier domain basis, and to analyze the eigenvector problem. Batch input training, use stable and ordered partial sample eigenvalues to approximate the eigenvalue distribution of the global sample. In addition, the operation speed and memory utilization of the data dimensionality reduction process are improved, and the support and acceleration of the correlation analysis of massive data are provided.

The technical scheme adopted in the present invention is as follows:

A dimensionality reduction and association analysis method suitable for large-scale data, comprising the following steps:

Step 1, data initialization, collect data sample sets X (M ₁ ×N) and Y (M ₂ ×N) as the required data sets, and initialize the current batch number j, dimension parameter M, and initial M × M dimension zero Matrix Λ ₀ , random Fourier basis set P ₀ and discrete Fourier matrix F; wherein, M ₁ and M ₂ represent the dimensions of data set X and Y respectively, and N is the number of data samples;

Step 2, construct the Fourier data representation of the batch samples, and randomly input the batch sample set with the number b

and

The X _b and Y _b are respectively increased to M dimension by filling with zero elements; the samples x _i and y _i in X _b and Y _b are respectively obtained by Fourier transform.

Step 3, for each batch of randomly input samples X _b , Y _b , calculate the eigenvalue matrix Λ _b obtained by the batch of samples, and with the continuous input of small batch samples, calculate the eigenvalue matrix obtained by each batch of samples. Λ _b is added to Λ _j , and Λ _j is used to denote the accumulation of eigenvalues after inputting the jth partial sample, and the process is expressed as: Λ _j ←Λ _j-1 +Λ _b ;

Among them, Λ _j-1 represents the accumulation of eigenvalues obtained after inputting j-1 batches of samples.

Step 4, to obtain the Fourier projection basis of the batch samples, set

Take as a column vector of F. Sort the diagonal elements λ ₁ , λ ₂ , ..., λ _M of the eigenvalue matrix Λ _j in ascending order, and select the matrix F corresponding to the first r smallest eigenvalues λ ₁ , λ ₂ , ... λ _r Fourier basis in

form the current set of projections

r is a preset number of required Fourier projection bases.

Step 5, if the set P _j is the same as P _j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis

As the final Fourier projection basis, otherwise perform steps 2 to 4, and update the current input batch number, j←j+1.

Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P _j

The projection matrix V'=[p ₁ p ₂ ... _pr ] is formed; the high-dimensional data set X is multiplied by the projection matrix V' ^T , that is, the dimension-reduced data set X'=V' ^T X is obtained.

Further, the dimension parameter M is required to satisfy M≥M ₁ and M≥M ₂ ;

Further, the discrete Fourier matrix (DFT) F is expressed as:

where ω is a complex number and can be expressed as ω=e ^-2πi/M , and i is the imaginary unit.

Further, batch samples X _b and Y _b are batch samples whose random input quantity is b=N*g according to the threshold g;

Further, x _i , y _i are Fourier transformed to obtain

They are respectively expressed as:

in,

are the generated vectors of the Fourier transform, respectively,

Represents the fast Fourier transform of the vectors x _i and y _i respectively, and F is the discrete Fourier matrix;

Further, the eigenvalues of the batch samples X _B and Y _b of the current batch are obtained as follows:

Among them, 1./ is the reciprocal operation of each element of the vector, λ is the Lagrangian factor; b is the number of batch samples;

respectively

The complex conjugate matrix of ; ⊙ is the dot product operation of the elements in the matrix; diag represents the diagonal matrix that converts the vector into the main diagonal as the vector elements;

is the main projection vector of the training data set X, that is, the eigenvector; F ^H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. For each batch of randomly input samples X _b , Y _B , we can get Λ _b :

Among them, Λ _b is the eigenvalue matrix obtained from the batch of samples.

Beneficial effects of the present invention:

1. Use the characteristics of repeatability of data sequences to model the data in the Fourier domain. Using the fast Fourier transform method to observe each data point in the time series from the perspective of the frequency domain, a new correlation analysis algorithm based on the Fourier domain is constructed. Finding the projection target of the association analysis can be achieved by finding predefined meaningful Fourier basis.

2. Due to the operational nature of the Fourier domain, we can avoid complex matrix inversion operations in the time domain by simple matrix dot product operations in the Fourier domain.

3. In order to obtain the Fourier basis in a meaningful way, the training process does not need to load all data samples, but only needs to load several batches of data samples until the order of the Fourier basis is pursued to be stable, which can undoubtedly use memory more efficiently .

4. Solving the eigenvector problem by optimizing the correlation analysis In order to find a meaningful Fourier domain basis and input training in batches, the eigenvalue distribution of the global sample is approximately obtained with the eigenvalues of stable and ordered partial samples. In addition, the operation speed and memory utilization of the data dimensionality reduction process are improved, and the support and acceleration of the correlation analysis of massive data are provided.

Description of drawings

Fig. 1 is the main flow chart of the method proposed by the present invention.

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

As shown in Figure 1, a dimensionality reduction and association analysis method suitable for large-scale data includes the following steps:

Step 1, data initialization, collect data sample sets X(M ₁ ×N) and Y(M ₂ ×N) as required data sets. It is explained here that M ₁ and M ₂ represent the dimensions of the data set X and Y respectively, that is, each row of X and Y is regarded as an attribute of the data; X=[x ₁ x ₂ ... x _N ], in the same way, Y =[y ₁ y ₂ ... y _N ], N represents the number of data samples, that is, each column vector (ie x _i and y _i , i=1, 2, ... N) represents the data samples in the same All values under the dimension.

Initialization parameters: j, M, Λ ₀ , F, P ₀ . Among them, j represents the current batch number of batch training, and j=1; M is a dimension parameter constructed to obtain a finer feature vector, M>M ₁ and M>M ₂ ; Λ ₀ represents the initial M× M-dimensional zero matrix; P ₀ is a random Fourier basis set, and the elements of the P ₀ set are column vectors of the discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as:

Step 2, construct the Fourier data representation of batch samples.

According to the threshold g, randomly input batch samples of b=N*g

and

g takes 0.5% to 5%. Taking the dataset X _b as an example, for each sample in the dataset X _b

Increase to M dimension by zero-element padding, i.e.

in,

respectively represent the value of the sample point x _i under different attributes. Use the Fast Fourier Transform method to view the data in the frequency domain:

in,

Represents the fast Fourier transform of the vector x _i ; F is the discrete Fourier matrix;

is the generating vector of the Fourier transform, and ^ represents the generating vector of the fast Fourier transform. Similarly, for each sample vector in the dataset Y _b

Increase to M dimension by zero-element padding and fast Fourier transform

Step 3, obtain the feature values of batch samples.

The eigenvalues of batch samples X _b and Y _b of the current batch are obtained as follows:

respectively

is the main projection vector of the training data set X, that is, the eigenvector; F ^H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. According to formula (2), for each batch of randomly input samples X _b and Y _b , we can get:

Among them, Λ _b is the eigenvalue matrix obtained from the batch of samples. We use Λ _j to denote the accumulation of feature values after inputting the jth partial sample, and j to denote the number of batches currently input. With the continuous input of small batch samples, the eigenvalue matrix Λ _b obtained by each batch of samples is added to Λ _j ,

Λ _j ←Λ _j-1 +Λ _b (4)

Step 4, obtain the Fourier projection basis of batch samples.

According to formula (2), the

Take it as the column vector of F, sort the diagonal elements λ ₁ , λ ₂ , ..., λ _M of the eigenvalue matrix Λ _j in ascending order, and select the first r smallest eigenvalues λ ₁ , λ ₂ , ... Fourier basis in matrix F corresponding to λ _r

form the current set of projections

Among them, r is the preset number of required Fourier projection bases, and the value here is 50.

as the final Fourier projection basis. Otherwise, go to steps 2 to 4, and update the current batch number, j←j+1.

Obtain the projection matrix V'=[p ₁ p ₂ ... _pr ]. Multiply the high-dimensional data set X by the projection matrix V' ^T , that is, the data set X'=V' ^T X after dimension reduction is obtained.

The above embodiments are only used to illustrate the design ideas and features of the present invention, and the purpose is to enable those skilled in the art to understand the contents of the present invention and implement them accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made according to the principles and design ideas disclosed in the present invention fall within the protection scope of the present invention.

Claims

A dimensionality reduction and association analysis method suitable for large-scale data, characterized in that it comprises the following steps:

Step 1, data initialization, collect data sample sets X (M 1 ×N) and Y (M 2 ×N) as the required data sets, and initialize the current batch number j, dimension parameter M, and initial M × M dimension zero Matrix Λ 0 , random Fourier basis set P 0 and discrete Fourier matrix F; wherein, M 1 and M 2 represent the dimensions of data set X and Y respectively, and N is the number of data samples;

Step 2, construct the Fourier data representation of the batch samples, and randomly input the batch sample set with the number b
and
The X b and Y b are respectively increased to M dimension by filling with zero elements; the samples x i and y i in X b and Y b are respectively obtained by Fourier transform.

Step 3, for each batch of randomly input samples X b , Y b , calculate the eigenvalue matrix Λ b obtained by the batch of samples, and with the continuous input of small batch samples, calculate the eigenvalue matrix obtained by each batch of samples. Λ b is added to Λ j , and Λ j is used to represent the accumulation of eigenvalues after inputting the jth partial sample, expressed as: Λ j ←Λ j - 1 +Λ b ; The eigenvalues obtained after 1 batch of samples are accumulated;

Step 4, to obtain the Fourier projection basis of the batch samples, set
Take as a column vector of F. Sort the diagonal elements λ 1 , λ 2 , ..., λ M of the eigenvalue matrix Λ j in ascending order, and select the matrix F corresponding to the first r smallest eigenvalues λ 1 , λ 2 , ... λ r Fourier basis in
form the current set of projections
r is the preset number of required Fourier projection bases;

Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
As the final Fourier projection basis, otherwise perform steps 2 to 4, and update the current input batch number, j←j+1;

Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j

i=1,..., r , forming the projection matrix V'=[p 1 p 2 ... '=V' T X.
The method for dimensionality reduction and correlation analysis suitable for large-scale data according to claim 1, wherein the dimension parameter M is required to satisfy M≥M 1 and M≥M 2 .
A dimensionality reduction and correlation analysis method suitable for large-scale data according to claim 1, wherein the discrete Fourier matrix (DFT) F is expressed as:

where ω is a complex number and can be expressed as ω=e -2πi/M , and i is the imaginary unit.
A dimensionality reduction and correlation analysis method suitable for large-scale data according to claim 1, characterized in that the batch samples X b and Y b are batch samples whose random input quantity is b=N*g according to the threshold g .
A method for dimensionality reduction and correlation analysis suitable for large-scale data according to claim 1, wherein x i and y i are subjected to Fourier transform to obtain
They are respectively expressed as:

in,
are the generated vectors of the Fourier transform, respectively,
Respectively represent the fast Fourier transform of the vector x i , and F is the discrete Fourier matrix.
A dimensionality reduction and association analysis method suitable for large-scale data according to claim 1, characterized in that, the eigenvalues of batch samples X b and Y b of the current batch are obtained as follows:

Among them, 1./ is the reciprocal operation of each element of the vector, λ is the Lagrangian factor; b is the number of batch samples;
respectively
The complex conjugate matrix of ; ⊙ is the dot product operation of the elements in the matrix; diag represents the diagonal matrix that converts the vector into the main diagonal as the vector elements;
is the main projection vector of the training data set X, that is, the eigenvector; F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. For each batch of randomly input samples X b , Y b , we can get Λ b :

Among them, Λ b is the eigenvalue matrix obtained from the batch of samples.