GB2601862A - Dimension reduction and correlation analysis method applicable to large-scale data - Google Patents

Dimension reduction and correlation analysis method applicable to large-scale data Download PDF

Info

Publication number
GB2601862A
GB2601862A GB2110472.4A GB202110472A GB2601862A GB 2601862 A GB2601862 A GB 2601862A GB 202110472 A GB202110472 A GB 202110472A GB 2601862 A GB2601862 A GB 2601862A
Authority
GB
United Kingdom
Prior art keywords
batch
fourier
samples
matrix
correlation analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2110472.4A
Other versions
GB202110472D0 (en
Inventor
Shen Xiangjun
Xu Zhaorui
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010835235.8A external-priority patent/CN112149045A/en
Application filed by Jiangsu University filed Critical Jiangsu University
Publication of GB202110472D0 publication Critical patent/GB202110472D0/en
Publication of GB2601862A publication Critical patent/GB2601862A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/156Correlation function computation including computation of convolution operations using a domain transform, e.g. Fourier transform, polynomial transform, number theoretic transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

Disclosed in the present invention is a dimension reduction and correlation analysis method applicable to large-scale data. High-dimensional data is projected into a Fourier domain, such that the problem of solving feature vectors in linear correlation analysis is transformed into searching for meaningful Fourier domain bases. Fourier domain bases are predefined and feature value distribution of data is ordered, and therefore, training samples are input in batches to accelerate training, until required Fourier bases are stable and ordered. The number of Fourier bases, and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set to obtain a low-dimensional data set, thereby facilitating the fast data processing. By means of the data dimension reduction method in the present invention, on the basis of fast Fourier transform and correlation analysis, noise and redundant information in a high-dimensional data set can be eliminated, and unnecessary operation processes in data processing can be reduced, thereby improving the running speed and memory utilization efficiency in data dimension reduction calculation.

Description

DIMENSIONALITY REDUCTION AND CORRELATION ANALYSIS METHOD
SUITABLE FOR LARGE-SCALE DATA
Technical Field
The present invention relates to the technical field of computer science and image processing, and in particular, to a dimensionality reduction and correlation analysis method suitable for large-scale data.
Background
Traditional data processing methods can no longer effectively analyze mass data. Meanwhile, as the dimensions of data generated by big data processing and cloud computing continue to increase, it is usually necessary to observe data containing multiple variables and collect a large amount of data to analyze and find rules in the studies and applications in many fields. Multivariate large datasets will undoubtedly provide rich information for the studies and applications, but also increase the workload of data collection to a certain extent.
Canonical Correlation Analysis (CCA) is one of the most commonly used algorithms for mining data correlation relationships. It is also a dimensionality reduction technique that can be used to test the correlation of data and find data transformation representations that can emphasize these correlations. The essence of canonical correlation analysis is to select several representative comprehensive indicators (linear combinations of variables) from two sets of random variables, and use the correlation relationship of these indicators to represent the correlation relationship between the original two sets of variables, which can help understand the underlying data structure, cluster analysis, regression analysis and many other tasks.
However, although canonical correlation analysis exhibits good performance, its application in mass data processing is limited due to its high computational complexity. To deal with large-scale data, a number of optimization techniques have been proposed to accelerate the correlation analysis. In terms of different strategies to solve this problem, existing optimization techniques can be roughly classified into the following two general categories. A representation of the first category is the Nystrom matrix approximation technique which aims to use the eigenvectors of a sub-matrix to estimate an approximation of the eigenvectors of the original matrix, to reduce the computational cost of the eigen-decomposition step. Another way is using random Fourier features to approximate the matrix. This approach can turn the original KCCA problem into a high-dimensional linear CCA problem. While these existing methods solve applications in large-scale problems, they are still underutilized in terms of speed and memory efficiency. Fast and efficient computing of mass data is still an issue to be solved.
Summary
To overcome the drawbacks in the prior art, the present invention provides a dimensionality reduction and correlation analysis method suitable for large-scale data, where the eigenvector pursuing problem of correlation analysis is optimized into finding discriminative Fourier bases, samples are inputted in batches for training, and an approximation of a global eigenvalue distribution of samples is estimated based on partial sample values that are stable and orderly. Thus, the computing speed and memory utilization rate in the data dimensionality reduction process are increased, providing support for and speeding up the correl ati on analysis of mass data.
The following technical solution is adopted in the present invention.
A dimensionality reduction and correlation analysis method suitable for large-scale data, including the following steps: step 1, initializing data, acquiring data sample sets X(M, )< N) and Y(M2 x N) as required datasets, and initializing a current batch number], a dimension parameter Al, an initial M x M-dimensional zero matrix 110, a random Fourier basis set P0, and a discrete Fourier matrix F, where Mi. and M2 respectively represent dimensions of the datasets X and Y, and N is a data sample size; step 2, constructing Fourier data representations of batch samples, randomly inputting batch sample sets Xb E RMixb and b c R"2" each having a size b, and increasing Xb and Yb to M dimensions by padding with zero elements; respectively performing a Fourier transform of samples x, and y, in Xb and Yb to obtain 2, and 9,, step 3, for each batch of randomly inputted samples Xb and Yb, computing an eigenvalue matrix Ab obtained from the batch of samples, and as small batches of samples are continuously inputted, adding the eigenvalue matrix Ab obtained from each batch of samples to Ai, where Ai represents an accumulation of eigenvalues after the jth batch of samples is inputted, and this process is expressed as. Ai <-/11_1 + Ab; where A1_1 represents an accumulation of eigenvalues obtained after the (j-1)th batch of samples is inputted; step 4, obtaining Fourier projection bases of batch samples, and taking 13 as a column vector of F, sorting diagonal elements A1, A2, ..., Am of the eigenvalue matrix Ai in ascending order, and choosing Fourier bases 71' , 2, *** 13r in the matrix F that are corresponding to first r smallest eigenvalues X1,12, ..., kr; constructing a current projection set P 1.15 -j -,1, 2, *** where r is a preset number of Fourier projection bases required; step 5, if the set Pi is identical to Pi_i, ending the execution of the steps 2 to 4, and obtaining the required Fourier bases 131 = 171,452, ..., Pr} as final Fourier projection bases; otherwise, executing the steps 2 to 4, and updating a currently inputted batch number, j j + 1; step 6, performing an inverse Fourier transform of each Fourier projection base in the set p, = T-i(p,) = =1,...,r, to construct a projection matrix V' = [Pi P2 *** Pr]; multiplying a high-dimensional dataset X by the projection matrix VT to obtain a dimensionality-reduced dataset X' = V'T X. Further, the dimension parameter M is required to satisfy M M1 and M M2. Further, the discrete Fourier matrix (DFT) F is expressed as: 1 1 1 F Fl 6.01-1 04_1)21 : 1 com-i where 6) is a complex number that can be expressed as co = C-27", and I is an imaginary unit.
Further, batch samples XL, and Yb are the size b = N * g of batch samples randomly inputted according to a threshold g.
Further, a Fourier transform of xi and y is performed to obtain 2, and 9, which are respectively expressed as: = T(x1) = fyciFx, = T(yI) = Vri4FY1 where 2i and 9, are respectively generating vectors of the Fourier transform, T(x1) and Y(y1) respectively represent performing a fast Fourier transform of vectors x, and y" and F is the discrete Fourier matrix.
Further, eigenvalues of batch samples X,, and Yb in a current batch are obtained in the following manner: di ag ([1. / XCX,' 0' X i)1C)[(5-ci C)9 010[1. / I(90[Z(9 i* (1)2 i) Fit) = AFF i=1 i=1 i=1 i=1 where 1./ is finding a reciprocal of each element in he vector, and A is a Lagrange factor; h is the batch sample size; 2: and 9: are respectively complex conjugate matrices of 2, and 9i; 0 represents an element-wise dot product operation in the matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; D is a main projection vector of a trailing dataset X, i.e., an eigenvector; FFI is a conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; for each batch of randomly inputted samples Xb and Y5, A5 is obtained: A5 = diag (1 U10101 0 i= (5'70231) (270 i= )1 0 IX ( 7 0 -I 0 / 1=1t= where Ab is the eigenvalue matrix obtained from he batch of samples.
The present invention has the following beneficial effects.
1. Data is modeled in the Fourier domain by using the characteristic of data pattern repeatability. The fast Fourier transform method is used to observe each data point in the time sequence from the perspective of the frequency domain, to construct a novel correlation analysis algorithm based on the Fourier domain. The objective of finding projections for correlation analysis may be achieved by choosing discriminative Fourier bases which are predefined in advance.
2. Due to the operation properties in the Fourier domain, we can avoid pursuing complex matrix inversion operation in the time domain by using simple dot product operations in the Fourier domain.
3. To obtain discriminative Fourier bases, the training process does not require all data samples to be loaded at once, but instead, only requires the data samples to be loaded in several batches till the pursued Fourier bases set is stable. This undoubtedly makes use of the memory more efficiently.
4. The eigenvector pursuing problem of correlation analysis is optimized into finding discriminative Fourier bases, samples are inputted in batches for training, and an approximation of a global eigenvalue distribution of samples is estimated based on partial sample values that are stable and orderly. Thus, the computing speed and memory utilization rate in the data dimensionality reduction process are increased, providing support for and speeding up the correlation analysis of mass data.
Brief Description of the Drawings
FIG. 1 is a main flowchart of a method according to the present invention.
Detailed Description of the Embodiments
To make the objects, technical solutions, and advantages of the present invention clearer, the present invention is described in further detail with reference to accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely used for explaining the present invention, and are not intended to limit the present invention.
As shown in FIG. I, a dimensionality reduction and correlation analysis method suitable for large-scale data includes the following steps Step 1. Data is initialized, and data sample sets X(Mj. x N) and Y(M2 x N) are acquired as required datasets. It is noted that M1 and M2 respectively represent dimensions of the datasets X and Y, that is, each row of X and Y is taken as an attribute of data, X = [xi X2 xid, and similarly, Y = [Yi Y2 Yid, where N represents the data sample size, that is, each column vector (i.e., xi and 9,, where i = 1,2, N) represents all values of a data sample in the same dimension.
Parameters j, M, Ao, F, and Po are initialized, where] represents a current batch number in the case of training in batches, and j=1; M is a dimension parameter constructed for obtaining finer eigenvectors, M > M1, and M > M2; A0 represents an initial M x M-dimensional zero matrix; Po is a random Fourier basis set, and elements of the Po set are column vectors of a discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as: F= 1 1 1 (Al 1 1 (,4-1)2 cum-1 where to is a complex number that can be expressed as co = e-2ntim, and i is an imaginary unit.
Step 2. Fourier data representations of batch samples are constructed.
The size b =N *g of batch samples Xb E Rmi-x13 and b c RA42xb are randomly inputted according to a threshold g. g is 0.5 to 5%. Taking the dataset Xb as an example, each sample xi E /el in the dataset Xb is padded with zero elements to M dimensions, that is, 64-1)(1 x1=pc.,1 -" xibri 0 * ** GI E Rm, where x11,x12, ximi respectively represent values of a sample point x, under different attributes. The fast Fourier transform method is used to observe data from the perspective of the frequency domain.
fi =IUD =,,/IMFxi (1) where F(xi) represents performing a fast Fourier transform of a vector xi; F is the discrete Fourier matrix; 1., is a generating vector of the Fourier transform, and A represents a generating vector of the fast Fourier transform. Similarly, each sample vector yi E RM2 in the dataset Yb is padded with zero elements to M dimensions, and a fast Fourier transform is performed: 9, = T(y,) = Step 3. Eigenvalues of batch samples are obtained.
Eigenvalues of batch samples Xb and Yb in a current batch are obtained in the following manner: diag ([17 (1-(705e31 0 / (1109)1 0 [Li/CW09310 [/07(*.0201) = API° (2) where 1./ is finding a reciprocal of each element in the vector, and A is a Lagrange factor; b is the batch sample size; 21 and 9: are respectively complex conjugate matrices of and 9; 0 represents an element-wise dot product operation in matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; is a main projection vector of a training dataset X, i.e., an eigenvector; Ffri is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; according to formula (2), for each batch of randomly inputted samples X b and Yb, Ab is obtained: b Ab = diag ([1./I(11-0201C)[1(54C)9D1C)[1./1(9[ 0901C)F1(9i" CVO') (3) i=1 i=t i=1 i=i where Ab is an eigenvalue matrix obtained from the batch of samples, Aj represents an accumulation of eigenvalues after the jn' batch of samples is inputted, and j represents a currently inputted batch number. As small batches of samples are continuously inputted, the eigenvalue matrix Ab obtained for each batch of samples is added to Ai Aj_j. + Ab (4) where A1_1 represents an accumulation of eigenvalues obtained after the (j-samples is inputted; Step 4. Fourier projection bases of batch samples are obtained.
According to formula (2), 0 is taken as a column vector of F; diagonal elements 21,22, __Am of the eigenvalue matrix A1 are sorted in ascending order, and Fourier bases 132, *** Pr in the matrix F that are corresponding to first r smallest eigenvalues 11,12.....1,.
are chosen, to construct a current projection set /3.; = 13r}. r is a preset number of Fourier projection bases required, and has a value of 50 herein.
Step 5. If the set /3 is identical to /3_1, the execution of the steps 2 to 4 is ended, and the 1=1 1=1 t=1 t=1 1)th batch of
S
required Fourier bases 13 = ...43,1 are obtained as final Fourier projection bases Otherwise, the steps 2 to 4 are executed, and a currently inputted batch number is updated, j j + 1.
Step 6 An inverse Fourier transform of each Fourier projection base in the set PI, m = T-1(pi) =VTzF1p, i = 1, r is performed to obtain a projection matrix V' = [Pt P2 '** Pr] . A high-dimensional dataset X is multiplied by the projection matrix V'T to obtain a dimensionality-reduced dataset X' =V'T X The above embodiments are only used to illustrate the design ideas and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and practice the present invention accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made according to the principles and design ideas disclosed in the present invention fall within the protection scope of the present invention.

Claims (6)

  1. Claims What is claimed is: I. A dimensionality reduction and correlation analysis method suitable for large-scale data, characterized by comprising the following steps: step I, initializing data, acquiring data sample sets X(M, x N) and Y(M2 x N) as required datasets, and initializing a current batch number j, a dimension parameter M, an initial M x M-dimensional zero matrix Ao, a random Fourier basis set Po, and a discrete Fourier matrix F, wherein M, and M2 respectively represent dimensions of the datasets X and Y, and N is a data sample size; step 2, constructing Fourier data representations of batch samples, randomly inputting batch sample sets Xb E Rmlxb and Yb E RA12 xb each having a size b, and increasing Xb and Yb to M dimensions by padding with zero elements: respectively performing a Fourier transform of samples xi and yi in Xb and lib to obtain it and 9; step 3, for each batch of randomly inputted samples XL, and Yb, computing an eigenvalue matrix Ab obtained for the batch of samples, and as small batches of samples are continuously inputted, adding the eigenvalue matrix Ab obtained for each batch of samples to Ai, wherein Ai represents an accumulation of eigenvalues after the th batch of samples is inputted, and is expressed as: A1 Aj_1 + Ab, wherein Aj_1 represents an accumulation of eigenvalues obtained after the (j-Dth batch of samples is inputted; step 4, obtaining Fourier projection bases of batch samples, and taking 13 as a column vector of F; sorting diagonal elements Al, A2, ..., Am of the eigenvalue matrix 211 in ascending order, and choosing Fourier bases p5,, 12, , /3., in the matrix F that are corresponding to first r smallest eigenvalues, 4; constructing a current projection set = j = 031, fi2, * * * ,1311, wherein r is a preset number of Fourier projection bases required; step 5, if the set Pi is identical to Pj_i, ending execution of the steps 2 to 4, and obtaining the required Fourier bases Pi = 031,152, 73?1 as final Fourier projection bases; otherwise, executing the steps 2 to 4, and updating a currently inputted batch number, j j + 1; and step 6, performing an inverse Fourier transform of each of the Fourier projection bases in the set 13, pi = F-1(31) =Vn-1Fp -1^i; i = ..., r, to construct a projection matrix vi = [Pt P2 *** Pr]; and multiplying a high-dimensional dataset X by the projection matrix V' to obtain a dimensionality-reduced dataset X' = VT X.
  2. 2. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the dimension parameter M is required to satisfy M > Mi and M > M2.
  3. 3. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the discrete Fourier matrix (DFT)F is expressed as: 1 1 1 1 m-i F =FiI1 com-i cw-i)2 wherein io is a complex number that can be expressed as w = and i is an imaginary unit.
  4. 4. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the batch samples Xb and Yb are the size b = N * g of batch samples randomly inputted according to a threshold g.
  5. 5. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that the Fourier transform of xi and yi is performed to obtain 2i and 2/i, which are respectively expressed as: = T(31i) = \CV I FY t wherein 2i and 9i are respectively generating vectors of the Fourier transform, .T(xi) and Y(y1) respectively represent performing a fast Fourier transform of vectors xi and yi, and F is the discrete Fourier matrix.
  6. 6. The dimensionality reduction and correlation analysis method suitable for the large-scale data according to claim 1, characterized in that eigenvalues of the batch samples Xb and Yb in a current batch are obtained in the following manner: diag([1./I(2102i)10[1(2109i) 0 1./101709i) 0 (9702i) FRO = AFHO 1=1 1=1 J L 1=1 J 11=1 wherein 1./ is finding a rec.procal of each element in the vector, and A is a Lagrange factor; b is the batch sample size; 21 and 9: are respectively complex conjugate matrices of 2i and 9; 0 represents an element wise dot product operation in the matrices; diag represents transforming the vector into a diagonal matrix in which main diagonals are vector elements; 13 is a main projection vector of a training dataset X, i.e., an eigenvector; F'm is a conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose; for each batch of randomly inputted samples Xb and Yb, Ab is obtained: Ab = diag([1./I(21C)2)1C)[(xiC)yalC)[1./I(' Yi®Yi)10[(Y701.01) b 1=1 L=1 1=1 L=1 wherein Ab is the eigenvalue matrix obtained for the batch of samples
GB2110472.4A 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data Pending GB2601862A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010835235.8A CN112149045A (en) 2020-08-19 2020-08-19 Dimension reduction and correlation analysis method suitable for large-scale data
PCT/CN2021/073088 WO2022037012A1 (en) 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data

Publications (2)

Publication Number Publication Date
GB202110472D0 GB202110472D0 (en) 2021-09-01
GB2601862A true GB2601862A (en) 2022-06-15

Family

ID=81656136

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2110472.4A Pending GB2601862A (en) 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data

Country Status (1)

Country Link
GB (1) GB2601862A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011155288A1 (en) * 2010-06-11 2011-12-15 国立大学法人豊橋技術科学大学 Data index dimension reduction method, and data search method and device using same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011155288A1 (en) * 2010-06-11 2011-12-15 国立大学法人豊橋技術科学大学 Data index dimension reduction method, and data search method and device using same

Also Published As

Publication number Publication date
GB202110472D0 (en) 2021-09-01

Similar Documents

Publication Publication Date Title
WO2022037012A1 (en) Dimension reduction and correlation analysis method applicable to large-scale data
US8650138B2 (en) Active metric learning device, active metric learning method, and active metric learning program
Zhang et al. Efficient kNN algorithm based on graph sparse reconstruction
Yuan et al. Randomized tensor ring decomposition and its application to large-scale data reconstruction
Kotte et al. A similarity function for feature pattern clustering and high dimensional text document classification
Zhao et al. Joint adaptive graph learning and discriminative analysis for unsupervised feature selection
Li Principal Component Analysis
Nakaji et al. Measurement optimization of variational quantum simulation by classical shadow and derandomization
Huang et al. Support vector machine classification algorithm based on relief-F feature weighting
GB2601862A (en) Dimension reduction and correlation analysis method applicable to large-scale data
WO2023024210A1 (en) Data dimension reduction method based on fourier-domain principal component analysis
WO2022188711A1 (en) Svm model training method and apparatus, device, and computer-readable storage medium
Li et al. Objective extraction for simplifying many-objective solution sets
An et al. Unsupervised feature selection with joint clustering analysis
US11120072B1 (en) High dimensional to low dimensional data transformation and visualization system
CN114021322A (en) Modal mode shape analysis method and device of linear periodic time-varying system
Yang et al. Robust non-negative matrix factorization via joint sparse and graph regularization
El Omari Notes on spherical bifractional Brownian motion
Bannai et al. Classification of Spherical $2 $-distance $\{4, 2, 1\} $-designs by Solving Diophantine Equations
Li et al. Enforced block diagonal graph learning for multikernel clustering
Ye et al. Dual global structure preservation based supervised feature selection
Shen et al. Automatic Gaussian Bandwidth Selection for Kernel Principal Component Analysis
Chen et al. FINC: An efficient and effective optimization method for normalized cut
Xia et al. C-ISTA: Iterative shrinkage-thresholding algorithm for sparse covariance matrix estimation
Atwa et al. Active query selection for constraint-based clustering algorithms