GB2601862A

GB2601862A - Dimension reduction and correlation analysis method applicable to large-scale data

Info

Publication number: GB2601862A
Application number: GB2110472.4A
Authority: GB
Inventors: Shen Xiangjun; Xu Zhaorui
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-08-19
Filing date: 2021-01-21
Publication date: 2022-06-15
Also published as: GB202110472D0

Abstract

Disclosed in the present invention is a dimension reduction and correlation analysis method applicable to large-scale data. High-dimensional data is projected into a Fourier domain, such that the problem of solving feature vectors in linear correlation analysis is transformed into searching for meaningful Fourier domain bases. Fourier domain bases are predefined and feature value distribution of data is ordered, and therefore, training samples are input in batches to accelerate training, until required Fourier bases are stable and ordered. The number of Fourier bases, and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set to obtain a low-dimensional data set, thereby facilitating the fast data processing. By means of the data dimension reduction method in the present invention, on the basis of fast Fourier transform and correlation analysis, noise and redundant information in a high-dimensional data set can be eliminated, and unnecessary operation processes in data processing can be reduced, thereby improving the running speed and memory utilization efficiency in data dimension reduction calculation.

Description

DIMENSIONALITY REDUCTION AND CORRELATION ANALYSIS METHOD

SUITABLE FOR LARGE-SCALE DATA

Technical Field

The present invention relates to the technical field of computer science and image processing, and in particular, to a dimensionality reduction and correlation analysis method suitable for large-scale data.

Background

Traditional data processing methods can no longer effectively analyze mass data. Meanwhile, as the dimensions of data generated by big data processing and cloud computing continue to increase, it is usually necessary to observe data containing multiple variables and collect a large amount of data to analyze and find rules in the studies and applications in many fields. Multivariate large datasets will undoubtedly provide rich information for the studies and applications, but also increase the workload of data collection to a certain extent.

Canonical Correlation Analysis (CCA) is one of the most commonly used algorithms for mining data correlation relationships. It is also a dimensionality reduction technique that can be used to test the correlation of data and find data transformation representations that can emphasize these correlations. The essence of canonical correlation analysis is to select several representative comprehensive indicators (linear combinations of variables) from two sets of random variables, and use the correlation relationship of these indicators to represent the correlation relationship between the original two sets of variables, which can help understand the underlying data structure, cluster analysis, regression analysis and many other tasks.

However, although canonical correlation analysis exhibits good performance, its application in mass data processing is limited due to its high computational complexity. To deal with large-scale data, a number of optimization techniques have been proposed to accelerate the correlation analysis. In terms of different strategies to solve this problem, existing optimization techniques can be roughly classified into the following two general categories. A representation of the first category is the Nystrom matrix approximation technique which aims to use the eigenvectors of a sub-matrix to estimate an approximation of the eigenvectors of the original matrix, to reduce the computational cost of the eigen-decomposition step. Another way is using random Fourier features to approximate the matrix. This approach can turn the original KCCA problem into a high-dimensional linear CCA problem. While these existing methods solve applications in large-scale problems, they are still underutilized in terms of speed and memory efficiency. Fast and efficient computing of mass data is still an issue to be solved.

Summary

To overcome the drawbacks in the prior art, the present invention provides a dimensionality reduction and correlation analysis method suitable for large-scale data, where the eigenvector pursuing problem of correlation analysis is optimized into finding discriminative Fourier bases, samples are inputted in batches for training, and an approximation of a global eigenvalue distribution of samples is estimated based on partial sample values that are stable and orderly. Thus, the computing speed and memory utilization rate in the data dimensionality reduction process are increased, providing support for and speeding up the correl ati on analysis of mass data.

The following technical solution is adopted in the present invention.

A dimensionality reduction and correlation analysis method suitable for large-scale data, including the following steps: step 1, initializing data, acquiring data sample sets X(M, )< N) and Y(M2 x N) as required datasets, and initializing a current batch number], a dimension parameter Al, an initial M x M-dimensional zero matrix 110, a random Fourier basis set P0, and a discrete Fourier matrix F, where Mi. and M2 respectively represent dimensions of the datasets X and Y, and N is a data sample size; step 2, constructing Fourier data representations of batch samples, randomly inputting batch sample sets Xb E RMixb and b c R"2" each having a size b, and increasing Xb and Yb to M dimensions by padding with zero elements; respectively performing a Fourier transform of samples x, and y, in Xb and Yb to obtain 2, and 9,, step 3, for each batch of randomly inputted samples Xb and Yb, computing an eigenvalue matrix Ab obtained from the batch of samples, and as small batches of samples are continuously inputted, adding the eigenvalue matrix Ab obtained from each batch of samples to Ai, where Ai represents an accumulation of eigenvalues after the jth batch of samples is inputted, and this process is expressed as. Ai <-/11_1 + Ab; where A1_1 represents an accumulation of eigenvalues obtained after the (j-1)th batch of samples is inputted; step 4, obtaining Fourier projection bases of batch samples, and taking 13 as a column vector of F, sorting diagonal elements A1, A2, ..., Am of the eigenvalue matrix Ai in ascending order, and choosing Fourier bases 71' , 2, *** 13r in the matrix F that are corresponding to first r smallest eigenvalues X1,12, ..., kr; constructing a current projection set P 1.15 -j -,1, 2, *** where r is a preset number of Fourier projection bases required; step 5, if the set Pi is identical to Pi_i, ending the execution of the steps 2 to 4, and obtaining the required Fourier bases 131 = 171,452, ..., Pr} as final Fourier projection bases; otherwise, executing the steps 2 to 4, and updating a currently inputted batch number, j j + 1; step 6, performing an inverse Fourier transform of each Fourier projection base in the set p, = T-i(p,) = =1,...,r, to construct a projection matrix V' = [Pi P2 *** Pr]; multiplying a high-dimensional dataset X by the projection matrix VT to obtain a dimensionality-reduced dataset X' = V'T X. Further, the dimension parameter M is required to satisfy M M1 and M M2. Further, the discrete Fourier matrix (DFT) F is expressed as: 1 1 1 F Fl 6.01-1 04_1)21 : 1 com-i where 6) is a complex number that can be expressed as co = C-27", and I is an imaginary unit.

Further, batch samples XL, and Yb are the size b = N * g of batch samples randomly inputted according to a threshold g.

Further, a Fourier transform of xi and y is performed to obtain 2, and 9, which are respectively expressed as: = T(x1) = fyciFx, = T(yI) = Vri4FY1 where 2i and 9, are respectively generating vectors of the Fourier transform, T(x1) and Y(y1) respectively represent performing a fast Fourier transform of vectors x, and y" and F is the discrete Fourier matrix.

Further, eigenvalues of batch samples X,, and Yb in a current batch are obtained in the following manner: di ag ([1. / XCX,' 0' X i)1C)[(5-ci C)9 010[1. / I(90[Z(9 i* (1)2 i) Fit) = AFF i=1 i=1 i=1 i=1 where 1./ is finding a reciprocal of each element in he vector, and A is a Lagrange factor; h is the batch sample size; 2: and 9: are respectively complex conjugate matrices of 2, and 9i; 0 represents an element-wise dot product operation in the matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; D is a main projection vector of a trailing dataset X, i.e., an eigenvector; FFI is a conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; for each batch of randomly inputted samples Xb and Y5, A5 is obtained: A5 = diag (1 U10101 0 i= (5'70231) (270 i= )1 0 IX ( 7 0 -I 0 / 1=1t= where Ab is the eigenvalue matrix obtained from he batch of samples.

The present invention has the following beneficial effects.

1. Data is modeled in the Fourier domain by using the characteristic of data pattern repeatability. The fast Fourier transform method is used to observe each data point in the time sequence from the perspective of the frequency domain, to construct a novel correlation analysis algorithm based on the Fourier domain. The objective of finding projections for correlation analysis may be achieved by choosing discriminative Fourier bases which are predefined in advance.

2. Due to the operation properties in the Fourier domain, we can avoid pursuing complex matrix inversion operation in the time domain by using simple dot product operations in the Fourier domain.

3. To obtain discriminative Fourier bases, the training process does not require all data samples to be loaded at once, but instead, only requires the data samples to be loaded in several batches till the pursued Fourier bases set is stable. This undoubtedly makes use of the memory more efficiently.

4. The eigenvector pursuing problem of correlation analysis is optimized into finding discriminative Fourier bases, samples are inputted in batches for training, and an approximation of a global eigenvalue distribution of samples is estimated based on partial sample values that are stable and orderly. Thus, the computing speed and memory utilization rate in the data dimensionality reduction process are increased, providing support for and speeding up the correlation analysis of mass data.

Brief Description of the Drawings

FIG. 1 is a main flowchart of a method according to the present invention.

Detailed Description of the Embodiments

To make the objects, technical solutions, and advantages of the present invention clearer, the present invention is described in further detail with reference to accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely used for explaining the present invention, and are not intended to limit the present invention.

As shown in FIG. I, a dimensionality reduction and correlation analysis method suitable for large-scale data includes the following steps Step 1. Data is initialized, and data sample sets X(Mj. x N) and Y(M2 x N) are acquired as required datasets. It is noted that M1 and M2 respectively represent dimensions of the datasets X and Y, that is, each row of X and Y is taken as an attribute of data, X = [xi X2 xid, and similarly, Y = [Yi Y2 Yid, where N represents the data sample size, that is, each column vector (i.e., xi and 9,, where i = 1,2, N) represents all values of a data sample in the same dimension.

Parameters j, M, Ao, F, and Po are initialized, where] represents a current batch number in the case of training in batches, and j=1; M is a dimension parameter constructed for obtaining finer eigenvectors, M > M1, and M > M2; A0 represents an initial M x M-dimensional zero matrix; Po is a random Fourier basis set, and elements of the Po set are column vectors of a discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as: F= 1 1 1 (Al 1 1 (,4-1)2 cum-1 where to is a complex number that can be expressed as co = e-2ntim, and i is an imaginary unit.

Step 2. Fourier data representations of batch samples are constructed.

The size b =N *g of batch samples Xb E Rmi-x13 and b c RA42xb are randomly inputted according to a threshold g. g is 0.5 to 5%. Taking the dataset Xb as an example, each sample xi E /el in the dataset Xb is padded with zero elements to M dimensions, that is, 64-1)(1 x1=pc.,1 -" xibri 0 * ** GI E Rm, where x11,x12, ximi respectively represent values of a sample point x, under different attributes. The fast Fourier transform method is used to observe data from the perspective of the frequency domain.

fi =IUD =,,/IMFxi (1) where F(xi) represents performing a fast Fourier transform of a vector xi; F is the discrete Fourier matrix; 1., is a generating vector of the Fourier transform, and A represents a generating vector of the fast Fourier transform. Similarly, each sample vector yi E RM2 in the dataset Yb is padded with zero elements to M dimensions, and a fast Fourier transform is performed: 9, = T(y,) = Step 3. Eigenvalues of batch samples are obtained.

Eigenvalues of batch samples Xb and Yb in a current batch are obtained in the following manner: diag ([17 (1-(705e31 0 / (1109)1 0 [Li/CW09310 [/07(*.0201) = API° (2) where 1./ is finding a reciprocal of each element in the vector, and A is a Lagrange factor; b is the batch sample size; 21 and 9: are respectively complex conjugate matrices of and 9; 0 represents an element-wise dot product operation in matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; is a main projection vector of a training dataset X, i.e., an eigenvector; Ffri is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; according to formula (2), for each batch of randomly inputted samples X b and Yb, Ab is obtained: b Ab = diag ([1./I(11-0201C)[1(54C)9D1C)[1./1(9[ 0901C)F1(9i" CVO') (3) i=1 i=t i=1 i=i where Ab is an eigenvalue matrix obtained from the batch of samples, Aj represents an accumulation of eigenvalues after the jn' batch of samples is inputted, and j represents a currently inputted batch number. As small batches of samples are continuously inputted, the eigenvalue matrix Ab obtained for each batch of samples is added to Ai Aj_j. + Ab (4) where A1_1 represents an accumulation of eigenvalues obtained after the (j-samples is inputted; Step 4. Fourier projection bases of batch samples are obtained.

According to formula (2), 0 is taken as a column vector of F; diagonal elements 21,22, __Am of the eigenvalue matrix A1 are sorted in ascending order, and Fourier bases 132, *** Pr in the matrix F that are corresponding to first r smallest eigenvalues 11,12.....1,.

are chosen, to construct a current projection set /3.; = 13r}. r is a preset number of Fourier projection bases required, and has a value of 50 herein.

Step 5. If the set /3 is identical to /3_1, the execution of the steps 2 to 4 is ended, and the 1=1 1=1 t=1 t=1 1)th batch of

S

required Fourier bases 13 = ...43,1 are obtained as final Fourier projection bases Otherwise, the steps 2 to 4 are executed, and a currently inputted batch number is updated, j j + 1.

Step 6 An inverse Fourier transform of each Fourier projection base in the set PI, m = T-1(pi) =VTzF1p, i = 1, r is performed to obtain a projection matrix V' = [Pt P2 '** Pr] . A high-dimensional dataset X is multiplied by the projection matrix V'T to obtain a dimensionality-reduced dataset X' =V'T X The above embodiments are only used to illustrate the design ideas and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and practice the present invention accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made according to the principles and design ideas disclosed in the present invention fall within the protection scope of the present invention.

Claims

Claims What is claimed is: I. A dimensionality reduction and correlation analysis method suitable for large-scale data, characterized by comprising the following steps: step I, initializing data, acquiring data sample sets X(M, x N) and Y(M2 x N) as required datasets, and initializing a current batch number j, a dimension parameter M, an initial M x M-dimensional zero matrix Ao, a random Fourier basis set Po, and a discrete Fourier matrix F, wherein M, and M2 respectively represent dimensions of the datasets X and Y, and N is a data sample size; step 2, constructing Fourier data representations of batch samples, randomly inputting batch sample sets Xb E Rmlxb and Yb E RA12 xb each having a size b, and increasing Xb and Yb to M dimensions by padding with zero elements: respectively performing a Fourier transform of samples xi and yi in Xb and lib to obtain it and 9; step 3, for each batch of randomly inputted samples XL, and Yb, computing an eigenvalue matrix Ab obtained for the batch of samples, and as small batches of samples are continuously inputted, adding the eigenvalue matrix Ab obtained for each batch of samples to Ai, wherein Ai represents an accumulation of eigenvalues after the th batch of samples is inputted, and is expressed as: A1 Aj_1 + Ab, wherein Aj_1 represents an accumulation of eigenvalues obtained after the (j-Dth batch of samples is inputted; step 4, obtaining Fourier projection bases of batch samples, and taking 13 as a column vector of F; sorting diagonal elements Al, A2, ..., Am of the eigenvalue matrix 211 in ascending order, and choosing Fourier bases p5,, 12, , /3., in the matrix F that are corresponding to first r smallest eigenvalues, 4; constructing a current projection set = j = 031, fi2, * * * ,1311, wherein r is a preset number of Fourier projection bases required; step 5, if the set Pi is identical to Pj_i, ending execution of the steps 2 to 4, and obtaining the required Fourier bases Pi = 031,152, 73?1 as final Fourier projection bases; otherwise, executing the steps 2 to 4, and updating a currently inputted batch number, j j + 1; and step 6, performing an inverse Fourier transform of each of the Fourier projection bases in the set 13, pi = F-1(31) =Vn-1Fp -1^i; i = ..., r, to construct a projection matrix vi = [Pt P2 *** Pr]; and multiplying a high-dimensional dataset X by the projection matrix V' to obtain a dimensionality-reduced dataset X' = VT X.
2. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the dimension parameter M is required to satisfy M > Mi and M > M2.
3. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the discrete Fourier matrix (DFT)F is expressed as: 1 1 1 1 m-i F =FiI1 com-i cw-i)2 wherein io is a complex number that can be expressed as w = and i is an imaginary unit.
4. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the batch samples Xb and Yb are the size b = N * g of batch samples randomly inputted according to a threshold g.
5. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the Fourier transform of xi and yi is performed to obtain 2i and 2/i, which are respectively expressed as: = T(31i) = \CV I FY t wherein 2i and 9i are respectively generating vectors of the Fourier transform, .T(xi) and Y(y1) respectively represent performing a fast Fourier transform of vectors xi and yi, and F is the discrete Fourier matrix.
6. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that eigenvalues of the batch samples Xb and Yb in a current batch are obtained in the following manner: diag([1./I(2102i)10[1(2109i) 0 1./101709i) 0 (9702i) FRO = AFHO 1=1 1=1 J L 1=1 J 11=1 wherein 1./ is finding a rec.procal of each element in the vector, and A is a Lagrange factor; b is the batch sample size; 21 and 9: are respectively complex conjugate matrices of 2i and 9; 0 represents an element wise dot product operation in the matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; 13 is a main projection vector of a training dataset X, i.e., an eigenvector; F'm is a conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; for each batch of randomly inputted samples Xb and Yb, Ab is obtained: Ab = diag([1./I(21C)2)1C)[(xiC)yalC)[1./I(' Yi®Yi)10[(Y701.01) b 1=1 L=1 1=1 L=1 wherein Ab is the eigenvalue matrix obtained for the batch of samples