WO2022037012A1 - Dimension reduction and correlation analysis method applicable to large-scale data - Google Patents

Dimension reduction and correlation analysis method applicable to large-scale data Download PDF

Info

Publication number
WO2022037012A1
WO2022037012A1 PCT/CN2021/073088 CN2021073088W WO2022037012A1 WO 2022037012 A1 WO2022037012 A1 WO 2022037012A1 CN 2021073088 W CN2021073088 W CN 2021073088W WO 2022037012 A1 WO2022037012 A1 WO 2022037012A1
Authority
WO
WIPO (PCT)
Prior art keywords
fourier
batch
samples
matrix
data
Prior art date
Application number
PCT/CN2021/073088
Other languages
French (fr)
Chinese (zh)
Inventor
沈项军
徐兆瑞
Original Assignee
江苏大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏大学 filed Critical 江苏大学
Priority to GB2110472.4A priority Critical patent/GB2601862A/en
Publication of WO2022037012A1 publication Critical patent/WO2022037012A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the invention belongs to the neighborhood of computer science and image processing technology, in particular to a dimensionality reduction and correlation analysis method suitable for large-scale data.
  • Canonical Correlation Analysis is one of the most commonly used algorithms for mining data correlations, and it is also a dimensionality reduction technique that can be used to test data correlations and find data transformation representations that can emphasize these correlations.
  • the essence of canonical correlation analysis is to select several representative comprehensive indicators (linear combination of variables) from two groups of random variables, and use the correlation of these indicators to represent the correlation between the original two groups of variables, which can help understand the underlying Data structures, cluster analysis, regression analysis, and many other tasks.
  • the present invention proposes a dimensionality reduction and correlation analysis method suitable for large-scale data.
  • a dimensionality reduction and correlation analysis method suitable for large-scale data.
  • Batch input training use stable and ordered partial sample eigenvalues to approximate the eigenvalue distribution of the global sample.
  • the operation speed and memory utilization of the data dimensionality reduction process are improved, and the support and acceleration of the correlation analysis of massive data are provided.
  • Step 1 data initialization, collect data sample sets X (M 1 ⁇ N) and Y (M 2 ⁇ N) as the required data sets, and initialize the current batch number j, dimension parameter M, and initial M ⁇ M dimension zero Matrix ⁇ 0 , random Fourier basis set P 0 and discrete Fourier matrix F; wherein, M 1 and M 2 represent the dimensions of data set X and Y respectively, and N is the number of data samples;
  • Step 2 construct the Fourier data representation of the batch samples, and randomly input the batch sample set with the number b and The X b and Y b are respectively increased to M dimension by filling with zero elements; the samples x i and y i in X b and Y b are respectively obtained by Fourier transform.
  • Step 3 for each batch of randomly input samples X b , Y b , calculate the eigenvalue matrix ⁇ b obtained by the batch of samples, and with the continuous input of small batch samples, calculate the eigenvalue matrix obtained by each batch of samples.
  • ⁇ b is added to ⁇ j
  • ⁇ j is used to denote the accumulation of eigenvalues after inputting the jth partial sample, and the process is expressed as: ⁇ j ⁇ j-1 + ⁇ b ;
  • ⁇ j-1 represents the accumulation of eigenvalues obtained after inputting j-1 batches of samples.
  • Step 4 to obtain the Fourier projection basis of the batch samples, set Take as a column vector of F. Sort the diagonal elements ⁇ 1 , ⁇ 2 , ..., ⁇ M of the eigenvalue matrix ⁇ j in ascending order, and select the matrix F corresponding to the first r smallest eigenvalues ⁇ 1 , ⁇ 2 , ... ⁇ r Fourier basis in form the current set of projections r is a preset number of required Fourier projection bases.
  • Step 5 if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis As the final Fourier projection basis, otherwise perform steps 2 to 4, and update the current input batch number, j ⁇ j+1.
  • Step 6 perform an inverse Fourier transform on each Fourier projection basis in the set P j
  • DFT F discrete Fourier matrix
  • x i , y i are Fourier transformed to obtain They are respectively expressed as:
  • ⁇ b is the eigenvalue matrix obtained from the batch of samples.
  • the training process does not need to load all data samples, but only needs to load several batches of data samples until the order of the Fourier basis is pursued to be stable, which can undoubtedly use memory more efficiently .
  • Fig. 1 is the main flow chart of the method proposed by the present invention.
  • a dimensionality reduction and association analysis method suitable for large-scale data includes the following steps:
  • Step 1 data initialization, collect data sample sets X(M 1 ⁇ N) and Y(M 2 ⁇ N) as required data sets.
  • M 1 and M 2 represent the dimensions of the data set X and Y respectively, that is, each row of X and Y is regarded as an attribute of the data;
  • X [x 1 x 2 ... x N ], in the same way,
  • Y [y 1 y 2 ... y N ],
  • Step 2 construct the Fourier data representation of batch samples.
  • Step 3 obtain the feature values of batch samples.
  • ⁇ b is the eigenvalue matrix obtained from the batch of samples.
  • ⁇ j to denote the accumulation of feature values after inputting the jth partial sample
  • j to denote the number of batches currently input.
  • the eigenvalue matrix ⁇ b obtained by each batch of samples is added to ⁇ j ,
  • ⁇ j-1 represents the accumulation of eigenvalues obtained after inputting j-1 batches of samples.
  • Step 4 obtain the Fourier projection basis of batch samples.
  • Step 5 if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis as the final Fourier projection basis. Otherwise, go to steps 2 to 4, and update the current batch number, j ⁇ j+1.

Abstract

Disclosed in the present invention is a dimension reduction and correlation analysis method applicable to large-scale data. High-dimensional data is projected into a Fourier domain, such that the problem of solving feature vectors in linear correlation analysis is transformed into searching for meaningful Fourier domain bases. Fourier domain bases are predefined and feature value distribution of data is ordered, and therefore, training samples are input in batches to accelerate training, until required Fourier bases are stable and ordered. The number of Fourier bases, and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set to obtain a low-dimensional data set, thereby facilitating the fast data processing. By means of the data dimension reduction method in the present invention, on the basis of fast Fourier transform and correlation analysis, noise and redundant information in a high-dimensional data set can be eliminated, and unnecessary operation processes in data processing can be reduced, thereby improving the running speed and memory utilization efficiency in data dimension reduction calculation.

Description

一种适用于大规模数据的降维、关联分析方法A dimensionality reduction and association analysis method suitable for large-scale data 技术领域technical field
本发明属于计算机科学和图像处理技术邻域,尤其是一种适用于大规模数据的降维、关联分析方法。The invention belongs to the neighborhood of computer science and image processing technology, in particular to a dimensionality reduction and correlation analysis method suitable for large-scale data.
背景技术Background technique
传统的数据处理方式已经无法对海量数据进行有效的分析。与此同时,随着大数据处理和云计算所产生的数据维度不断增加,在许多领域的研究与应用中,通常需要对含有多个变量的数据进行观测,收集大量数据后进行分析寻找规律。多变量大数据集无疑会为研究和应用提供丰富的信息,但是也在一定程度上增加了数据采集的工作量。Traditional data processing methods have been unable to effectively analyze massive data. At the same time, with the continuous increase of data dimensions generated by big data processing and cloud computing, in many fields of research and applications, it is usually necessary to observe data containing multiple variables, and collect a large amount of data to analyze and find rules. Multivariate large data sets will undoubtedly provide rich information for research and application, but also increase the workload of data collection to a certain extent.
典型关联分析(Canonical Correlation Analysis,CCA)是最常用的挖掘数据关联关系的算法之一,也是一种降维技术,可用于检验数据的相关性,以及找到可以强调这些相关性的数据变换表示。典型相关分析的实质就是在两组随机变量中选取若干个有代表性的综合指标(变量的线性组合),用这些指标的相关关系来表示原来的两组变量的相关关系,它可以帮助理解底层数据结构、聚类分析、回归分析和许多其他任务。Canonical Correlation Analysis (CCA) is one of the most commonly used algorithms for mining data correlations, and it is also a dimensionality reduction technique that can be used to test data correlations and find data transformation representations that can emphasize these correlations. The essence of canonical correlation analysis is to select several representative comprehensive indicators (linear combination of variables) from two groups of random variables, and use the correlation of these indicators to represent the correlation between the original two groups of variables, which can help understand the underlying Data structures, cluster analysis, regression analysis, and many other tasks.
然而,尽管典型关联分析表现出了良好的性能,但由于其计算复杂度高,因此在海量数据处理问题中的应用受到限制。为了处理大规模数据,人们提出了许多优化技术来加速相关分析算法。根据解决这一问题的不同策略,现有的优化技术大致可以分为以下两类:一种是使用Nystrom的矩阵近似技术,它通过将计算出的子矩阵特征向量用于近似原矩阵特征向量,来降低特征分解步骤的计算代价。另一种方法是使用Random Fourier Features来近似矩阵,该方法可将原来的KCCA问题转化为一个高维的线性CCA问题。然而,上述方法虽然解决了海量数据的应用处理问题,但它们在速度和内存效率等方面的利用仍不够充分,海量数据的快速高效计算依然是我们面临的问题。However, although typical association analysis shows good performance, its application in massive data processing problems is limited due to its high computational complexity. To handle large-scale data, many optimization techniques have been proposed to speed up correlation analysis algorithms. According to different strategies to solve this problem, the existing optimization techniques can be roughly divided into the following two categories: one is the matrix approximation technique using Nystrom, which approximates the original matrix eigenvectors by using the calculated submatrix eigenvectors, to reduce the computational cost of the feature decomposition step. Another approach is to approximate the matrix using Random Fourier Features, which transforms the original KCCA problem into a high-dimensional linear CCA problem. However, although the above methods solve the application processing problem of massive data, their utilization in terms of speed and memory efficiency is still insufficient, and the fast and efficient calculation of massive data is still a problem we face.
发明内容SUMMARY OF THE INVENTION
针对现有技术中存在的不足,本发明提出了一种适用于大规模数据的降维、 关联分析方法,通过优化关联分析的求解特征向量问题为寻找有意义的傅里叶域基,以及分批次输入训练,用稳定有序的部分样本特征值近似获得全局样本的特征值分布。进而提高数据降维过程的运算速度和内存利用率,并且提供对海量数据进行关联分析的支持和加速。In view of the deficiencies in the prior art, the present invention proposes a dimensionality reduction and correlation analysis method suitable for large-scale data. By optimizing the eigenvector problem of the correlation analysis, it is necessary to find a meaningful Fourier domain basis, and to analyze the eigenvector problem. Batch input training, use stable and ordered partial sample eigenvalues to approximate the eigenvalue distribution of the global sample. In addition, the operation speed and memory utilization of the data dimensionality reduction process are improved, and the support and acceleration of the correlation analysis of massive data are provided.
本发明所采用的技术方案如下:The technical scheme adopted in the present invention is as follows:
一种适用于大规模数据的降维、关联分析方法,包括如下步骤:A dimensionality reduction and association analysis method suitable for large-scale data, comprising the following steps:
步骤1,数据初始化,采集数据样本集X(M 1×N)和Y(M 2×N)作为所需的数据集,且初始化当前批次数j、维度参数M、初始的M×M维零矩阵Λ 0、随机傅里叶基集合P 0和离散傅里叶矩阵F;其中,M 1和M 2分别表示数据集X和Y的维度,N是数据的样本数量; Step 1, data initialization, collect data sample sets X (M 1 ×N) and Y (M 2 ×N) as the required data sets, and initialize the current batch number j, dimension parameter M, and initial M × M dimension zero Matrix Λ 0 , random Fourier basis set P 0 and discrete Fourier matrix F; wherein, M 1 and M 2 represent the dimensions of data set X and Y respectively, and N is the number of data samples;
步骤2,构造批量样本的傅里叶数据表达,随机输入数量为b的批量样本集
Figure PCTCN2021073088-appb-000001
Figure PCTCN2021073088-appb-000002
通过零元素填充的方式分别将X b和Y b增加至M维;分别对X b、Y b中的样本x i、y i进行傅里叶变换得到
Figure PCTCN2021073088-appb-000003
Step 2, construct the Fourier data representation of the batch samples, and randomly input the batch sample set with the number b
Figure PCTCN2021073088-appb-000001
and
Figure PCTCN2021073088-appb-000002
The X b and Y b are respectively increased to M dimension by filling with zero elements; the samples x i and y i in X b and Y b are respectively obtained by Fourier transform.
Figure PCTCN2021073088-appb-000003
步骤3,对于每批次随机输入的样本X b,Y b,计算该批次样本所获得的特征值矩阵Λ b,随着小批量样本的不断输入,将每一批样本所获特征值矩阵Λ b添加到Λ j,用Λ j表示在输入第j次部分样本后的特征值的累积,该过程表示为:Λ j←Λ j-1bStep 3, for each batch of randomly input samples X b , Y b , calculate the eigenvalue matrix Λ b obtained by the batch of samples, and with the continuous input of small batch samples, calculate the eigenvalue matrix obtained by each batch of samples. Λ b is added to Λ j , and Λ j is used to denote the accumulation of eigenvalues after inputting the jth partial sample, and the process is expressed as: Λ j ←Λ j-1b ;
其中,Λ j-1表示在输入j-1批次样本后所获得的特征值累积。 Among them, Λ j-1 represents the accumulation of eigenvalues obtained after inputting j-1 batches of samples.
步骤4,获得批量样本的傅里叶投影基,将
Figure PCTCN2021073088-appb-000004
取为F的列向量。对特征值矩阵Λ j的对角元素λ 1,λ 2,...,λ M进行升序排序,选取前r个最小的特征值λ 1,λ 2,...λ r所对应的矩阵F中的傅里叶基
Figure PCTCN2021073088-appb-000005
构成当前的投影集合
Figure PCTCN2021073088-appb-000006
r是预先设定的所需傅里叶投影基个数。
Step 4, to obtain the Fourier projection basis of the batch samples, set
Figure PCTCN2021073088-appb-000004
Take as a column vector of F. Sort the diagonal elements λ 1 , λ 2 , ..., λ M of the eigenvalue matrix Λ j in ascending order, and select the matrix F corresponding to the first r smallest eigenvalues λ 1 , λ 2 , ... λ r Fourier basis in
Figure PCTCN2021073088-appb-000005
form the current set of projections
Figure PCTCN2021073088-appb-000006
r is a preset number of required Fourier projection bases.
步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
Figure PCTCN2021073088-appb-000007
作为最终的傅里叶投影基,否则执行步骤2~4,并更新当前输入的批次数,j←j+1。
Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
Figure PCTCN2021073088-appb-000007
As the final Fourier projection basis, otherwise perform steps 2 to 4, and update the current input batch number, j←j+1.
步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
Figure PCTCN2021073088-appb-000008
Figure PCTCN2021073088-appb-000009
构成投影矩阵V′=[p 1 p 2 … p r];将高维数据集X与 投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
Figure PCTCN2021073088-appb-000008
Figure PCTCN2021073088-appb-000009
The projection matrix V'=[p 1 p 2 ... pr ] is formed; the high-dimensional data set X is multiplied by the projection matrix V' T , that is, the dimension-reduced data set X'=V' T X is obtained.
进一步,维度参数M要求满足M≥M 1且M≥M 2Further, the dimension parameter M is required to satisfy M≥M 1 and M≥M 2 ;
进一步,离散傅里叶矩阵(DFT)F表示为:Further, the discrete Fourier matrix (DFT) F is expressed as:
Figure PCTCN2021073088-appb-000010
Figure PCTCN2021073088-appb-000010
其中,ω是一个复数且可被表示为ω=e -2πi/M,i为虚数单位。 where ω is a complex number and can be expressed as ω=e -2πi/M , and i is the imaginary unit.
进一步,批量样本X b和Y b是根据阈值g,随机输入数量为b=N*g的批量样本; Further, batch samples X b and Y b are batch samples whose random input quantity is b=N*g according to the threshold g;
进一步,x i、y i进行傅里叶变换得到
Figure PCTCN2021073088-appb-000011
分别表示为:
Further, x i , y i are Fourier transformed to obtain
Figure PCTCN2021073088-appb-000011
They are respectively expressed as:
Figure PCTCN2021073088-appb-000012
Figure PCTCN2021073088-appb-000012
Figure PCTCN2021073088-appb-000013
Figure PCTCN2021073088-appb-000013
其中,
Figure PCTCN2021073088-appb-000014
分别是傅里叶变换的生成向量,
Figure PCTCN2021073088-appb-000015
分别表示对向量x i、y i进行快速傅里叶变换,F是离散傅里叶矩阵;
in,
Figure PCTCN2021073088-appb-000014
are the generated vectors of the Fourier transform, respectively,
Figure PCTCN2021073088-appb-000015
Represents the fast Fourier transform of the vectors x i and y i respectively, and F is the discrete Fourier matrix;
进一步,按照如下方式得到当前批次的批量样本X B和Y b的特征值: Further, the eigenvalues of the batch samples X B and Y b of the current batch are obtained as follows:
Figure PCTCN2021073088-appb-000016
Figure PCTCN2021073088-appb-000016
其中,1./是对向量每个元素的倒数运算,λ为拉格朗日因子;b是批量样本的数量;
Figure PCTCN2021073088-appb-000017
分别是
Figure PCTCN2021073088-appb-000018
的复共轭矩阵;⊙是矩阵中元素的点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
Figure PCTCN2021073088-appb-000019
为训练数据集X的主投影向量,即特征向量;F H是傅里叶矩阵F的共轭转置,H表示共轭转置运算。对于每批次随机输入的样本X b,Y B,我们可以得到Λ b
Among them, 1./ is the reciprocal operation of each element of the vector, λ is the Lagrangian factor; b is the number of batch samples;
Figure PCTCN2021073088-appb-000017
respectively
Figure PCTCN2021073088-appb-000018
The complex conjugate matrix of ; ⊙ is the dot product operation of the elements in the matrix; diag represents the diagonal matrix that converts the vector into the main diagonal as the vector elements;
Figure PCTCN2021073088-appb-000019
is the main projection vector of the training data set X, that is, the eigenvector; F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. For each batch of randomly input samples X b , Y B , we can get Λ b :
Figure PCTCN2021073088-appb-000020
Figure PCTCN2021073088-appb-000020
其中,Λ b为该批次样本所获得的特征值矩阵。 Among them, Λ b is the eigenvalue matrix obtained from the batch of samples.
本发明的有益效果:Beneficial effects of the present invention:
1、利用数据序列可重复性的特点对数据进行傅里叶域建模。利用快速傅里叶变换方法从频域的角度来观察时间序列中每个数据点,构造成新型基于傅里叶 域的关联分析算法。找到关联分析的投影目标可以通过找到预先定义好的有意义的傅里叶基来实现。1. Use the characteristics of repeatability of data sequences to model the data in the Fourier domain. Using the fast Fourier transform method to observe each data point in the time series from the perspective of the frequency domain, a new correlation analysis algorithm based on the Fourier domain is constructed. Finding the projection target of the association analysis can be achieved by finding predefined meaningful Fourier basis.
2、由于傅里叶域的运算性质,我们可以通过简单的傅里叶域的矩阵点积运算来避免在时域进行复杂的矩阵求逆运算。2. Due to the operational nature of the Fourier domain, we can avoid complex matrix inversion operations in the time domain by simple matrix dot product operations in the Fourier domain.
3、为了有意义的获得傅里叶基,训练的过程不需要加载所有的数据样本,只需要加载几批数据样本,直到追求傅里叶基的顺序稳定为止,这无疑可以更高效的使用内存。3. In order to obtain the Fourier basis in a meaningful way, the training process does not need to load all data samples, but only needs to load several batches of data samples until the order of the Fourier basis is pursued to be stable, which can undoubtedly use memory more efficiently .
4、通过优化关联分析的求解特征向量问题为寻找有意义的傅里叶域基,以及分批次输入训练,用稳定有序的部分样本特征值近似获得全局样本的特征值分布。进而提高数据降维过程的运算速度和内存利用率,并且提供对海量数据进行关联分析的支持和加速。4. Solving the eigenvector problem by optimizing the correlation analysis In order to find a meaningful Fourier domain basis and input training in batches, the eigenvalue distribution of the global sample is approximately obtained with the eigenvalues of stable and ordered partial samples. In addition, the operation speed and memory utilization of the data dimensionality reduction process are improved, and the support and acceleration of the correlation analysis of massive data are provided.
附图说明Description of drawings
图1是本发明提出的方法的主流程图。Fig. 1 is the main flow chart of the method proposed by the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用于解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
如图1所示的一种适用于大规模数据的降维、关联分析方法,包括如下步骤:As shown in Figure 1, a dimensionality reduction and association analysis method suitable for large-scale data includes the following steps:
步骤1,数据初始化,采集数据样本集X(M 1×N)和Y(M 2×N)作为所需的数据集。在此说明M 1和M 2分别表示数据集X和Y的维度,即将X和Y的每一行作为数据的一种属性;X=[x 1 x 2 ... x N],同理,Y=[y 1 y 2 ... y N],N代表的是数据的样本数量,即每一列向量(即x i和y i,i=1,2,...N)表示数据样本在相同维度下的所有的值。 Step 1, data initialization, collect data sample sets X(M 1 ×N) and Y(M 2 ×N) as required data sets. It is explained here that M 1 and M 2 represent the dimensions of the data set X and Y respectively, that is, each row of X and Y is regarded as an attribute of the data; X=[x 1 x 2 ... x N ], in the same way, Y =[y 1 y 2 ... y N ], N represents the number of data samples, that is, each column vector (ie x i and y i , i=1, 2, ... N) represents the data samples in the same All values under the dimension.
初始化参数:j、M、Λ 0、F、P 0。其中,j表示分批次训练的当前批次数,且j=1;M是为了获得更精细的特征向量而构造的维度参数,M>M 1且M>M 2;Λ 0表示初始的M×M维零矩阵;P 0是一个随机傅里叶基集合,P 0集合的元素为离散傅里叶矩阵(DFT)F的列向量。离散傅里叶矩阵(DFT)F表示为: Initialization parameters: j, M, Λ 0 , F, P 0 . Among them, j represents the current batch number of batch training, and j=1; M is a dimension parameter constructed to obtain a finer feature vector, M>M 1 and M>M 2 ; Λ 0 represents the initial M× M-dimensional zero matrix; P 0 is a random Fourier basis set, and the elements of the P 0 set are column vectors of the discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as:
Figure PCTCN2021073088-appb-000021
Figure PCTCN2021073088-appb-000021
其中,ω是一个复数且可被表示为ω=e -2πi/M,i为虚数单位。 where ω is a complex number and can be expressed as ω=e -2πi/M , and i is the imaginary unit.
步骤2,构造批量样本的傅里叶数据表达。Step 2, construct the Fourier data representation of batch samples.
根据阈值g,随机输入数量为b=N*g的批量样本
Figure PCTCN2021073088-appb-000022
Figure PCTCN2021073088-appb-000023
g取0.5%~5%。以数据集X b为例,对数据集X b中的每个样本
Figure PCTCN2021073088-appb-000024
通过零元素填充增加至M维,即
Figure PCTCN2021073088-appb-000025
其中,
Figure PCTCN2021073088-appb-000026
分别表示样本点x i在不同属性下的值。利用快速傅里叶变换方法从频域的角度来观察数据:
According to the threshold g, randomly input batch samples of b=N*g
Figure PCTCN2021073088-appb-000022
and
Figure PCTCN2021073088-appb-000023
g takes 0.5% to 5%. Taking the dataset X b as an example, for each sample in the dataset X b
Figure PCTCN2021073088-appb-000024
Increase to M dimension by zero-element padding, i.e.
Figure PCTCN2021073088-appb-000025
in,
Figure PCTCN2021073088-appb-000026
respectively represent the value of the sample point x i under different attributes. Use the Fast Fourier Transform method to view the data in the frequency domain:
Figure PCTCN2021073088-appb-000027
Figure PCTCN2021073088-appb-000027
其中,
Figure PCTCN2021073088-appb-000028
表示对向量x i进行快速傅里叶变换;F是离散傅里叶矩阵;
Figure PCTCN2021073088-appb-000029
是傅里叶变换的生成向量,用^表示快速傅里叶变换的生成向量。同理,对数据集Y b中的每个样本向量
Figure PCTCN2021073088-appb-000030
通过零元素填充增加至M维,并进行快速傅里叶变换
Figure PCTCN2021073088-appb-000031
in,
Figure PCTCN2021073088-appb-000028
Represents the fast Fourier transform of the vector x i ; F is the discrete Fourier matrix;
Figure PCTCN2021073088-appb-000029
is the generating vector of the Fourier transform, and ^ represents the generating vector of the fast Fourier transform. Similarly, for each sample vector in the dataset Y b
Figure PCTCN2021073088-appb-000030
Increase to M dimension by zero-element padding and fast Fourier transform
Figure PCTCN2021073088-appb-000031
步骤3,获得批量样本的特征值。Step 3, obtain the feature values of batch samples.
按照如下方式得到当前批次的批量样本X b和Y b的特征值: The eigenvalues of batch samples X b and Y b of the current batch are obtained as follows:
Figure PCTCN2021073088-appb-000032
Figure PCTCN2021073088-appb-000032
其中,1./是对向量每个元素的倒数运算,λ为拉格朗日因子;b是批量样本的数量;
Figure PCTCN2021073088-appb-000033
分别是
Figure PCTCN2021073088-appb-000034
的复共轭矩阵;⊙是矩阵中元素的点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
Figure PCTCN2021073088-appb-000035
为训练数据集X的主投影向量,即特征向量;F H是傅里叶矩阵F的共轭转置,H表示共轭转置运算。根据公式(2),对于每批次随机输入的样本X b和Y b,我们可以得到:
Among them, 1./ is the reciprocal operation of each element of the vector, λ is the Lagrangian factor; b is the number of batch samples;
Figure PCTCN2021073088-appb-000033
respectively
Figure PCTCN2021073088-appb-000034
The complex conjugate matrix of ; ⊙ is the dot product operation of the elements in the matrix; diag represents the diagonal matrix that converts the vector into the main diagonal as the vector elements;
Figure PCTCN2021073088-appb-000035
is the main projection vector of the training data set X, that is, the eigenvector; F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. According to formula (2), for each batch of randomly input samples X b and Y b , we can get:
Figure PCTCN2021073088-appb-000036
Figure PCTCN2021073088-appb-000036
其中,Λ b为该批次样本所获得的特征值矩阵。我们用Λ j表示在输入第j次部分样本后的特征值的累积,j表示当前输入的批次数。随着小批量样本的不断输入,将每一批样本所获特征值矩阵Λ b添加到Λ jAmong them, Λ b is the eigenvalue matrix obtained from the batch of samples. We use Λ j to denote the accumulation of feature values after inputting the jth partial sample, and j to denote the number of batches currently input. With the continuous input of small batch samples, the eigenvalue matrix Λ b obtained by each batch of samples is added to Λ j ,
Λ j←Λ j-1b      (4) Λ j ←Λ j-1b (4)
其中,Λ j-1表示在输入j-1批次样本后所获得的特征值累积。 Among them, Λ j-1 represents the accumulation of eigenvalues obtained after inputting j-1 batches of samples.
步骤4,获得批量样本的傅里叶投影基。Step 4, obtain the Fourier projection basis of batch samples.
根据公式(2),将
Figure PCTCN2021073088-appb-000037
取为F的列向量,对特征值矩阵Λ j的对角元素λ 1,λ 2,...,λ M进行升序排序,选取前r个最小的特征值λ 1,λ 2,...λ r所对应矩阵F中的傅里叶基
Figure PCTCN2021073088-appb-000038
构成当前的投影集合
Figure PCTCN2021073088-appb-000039
其中,r是预先设定的所需傅里叶投影基个数,此处取值为50。
According to formula (2), the
Figure PCTCN2021073088-appb-000037
Take it as the column vector of F, sort the diagonal elements λ 1 , λ 2 , ..., λ M of the eigenvalue matrix Λ j in ascending order, and select the first r smallest eigenvalues λ 1 , λ 2 , ... Fourier basis in matrix F corresponding to λ r
Figure PCTCN2021073088-appb-000038
form the current set of projections
Figure PCTCN2021073088-appb-000039
Among them, r is the preset number of required Fourier projection bases, and the value here is 50.
步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
Figure PCTCN2021073088-appb-000040
作为最终的傅里叶投影基。否则执行步骤2~4,并更新当前输入的批次数,j←j+1。
Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
Figure PCTCN2021073088-appb-000040
as the final Fourier projection basis. Otherwise, go to steps 2 to 4, and update the current batch number, j←j+1.
步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
Figure PCTCN2021073088-appb-000041
Figure PCTCN2021073088-appb-000042
获得投影矩阵V′=[p 1 p 2 … p r]。将高维数据集X与投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
Figure PCTCN2021073088-appb-000041
Figure PCTCN2021073088-appb-000042
Obtain the projection matrix V'=[p 1 p 2 ... pr ]. Multiply the high-dimensional data set X by the projection matrix V' T , that is, the data set X'=V' T X after dimension reduction is obtained.
以上实施例仅用于说明本发明的设计思想和特点,其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施,本发明的保护范围不限于上述实施例。所以,凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰,均在本发明的保护范围之内。The above embodiments are only used to illustrate the design ideas and features of the present invention, and the purpose is to enable those skilled in the art to understand the contents of the present invention and implement them accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made according to the principles and design ideas disclosed in the present invention fall within the protection scope of the present invention.

Claims (6)

  1. 一种适用于大规模数据的降维、关联分析方法,其特征在于,包括如下步骤:A dimensionality reduction and association analysis method suitable for large-scale data, characterized in that it comprises the following steps:
    步骤1,数据初始化,采集数据样本集X(M 1×N)和Y(M 2×N)作为所需的数据集,且初始化当前批次数j、维度参数M、初始的M×M维零矩阵Λ 0、随机傅里叶基集合P 0和离散傅里叶矩阵F;其中,M 1和M 2分别表示数据集X和Y的维度,N是数据的样本数量; Step 1, data initialization, collect data sample sets X (M 1 ×N) and Y (M 2 ×N) as the required data sets, and initialize the current batch number j, dimension parameter M, and initial M × M dimension zero Matrix Λ 0 , random Fourier basis set P 0 and discrete Fourier matrix F; wherein, M 1 and M 2 represent the dimensions of data set X and Y respectively, and N is the number of data samples;
    步骤2,构造批量样本的傅里叶数据表达,随机输入数量为b的批量样本集
    Figure PCTCN2021073088-appb-100001
    Figure PCTCN2021073088-appb-100002
    通过零元素填充的方式分别将X b和Y b增加至M维;分别对X b、Y b中的样本x i、y i进行傅里叶变换得到
    Figure PCTCN2021073088-appb-100003
    Step 2, construct the Fourier data representation of the batch samples, and randomly input the batch sample set with the number b
    Figure PCTCN2021073088-appb-100001
    and
    Figure PCTCN2021073088-appb-100002
    The X b and Y b are respectively increased to M dimension by filling with zero elements; the samples x i and y i in X b and Y b are respectively obtained by Fourier transform.
    Figure PCTCN2021073088-appb-100003
    步骤3,对于每批次随机输入的样本X b,Y b,计算该批次样本所获得的特征值矩阵Λ b,随着小批量样本的不断输入,将每一批样本所获特征值矩阵Λ b添加到Λ j,用Λ j表示在输入第j次部分样本后的特征值的累积,表示为:Λ j←Λ j-1b;其中,Λ j-1表示在输入j-1批次样本后所获得的特征值累积; Step 3, for each batch of randomly input samples X b , Y b , calculate the eigenvalue matrix Λ b obtained by the batch of samples, and with the continuous input of small batch samples, calculate the eigenvalue matrix obtained by each batch of samples. Λ b is added to Λ j , and Λ j is used to represent the accumulation of eigenvalues after inputting the jth partial sample, expressed as: Λ j ←Λ j - 1b ; The eigenvalues obtained after 1 batch of samples are accumulated;
    步骤4,获得批量样本的傅里叶投影基,将
    Figure PCTCN2021073088-appb-100004
    取为F的列向量。对特征值矩阵Λ j的对角元素λ 1,λ 2,...,λ M进行升序排序,选取前r个最小的特征值λ 1,λ 2,...λ r所对应的矩阵F中的傅里叶基
    Figure PCTCN2021073088-appb-100005
    构成当前的投影集合
    Figure PCTCN2021073088-appb-100006
    r是预先设定的所需傅里叶投影基个数;
    Step 4, to obtain the Fourier projection basis of the batch samples, set
    Figure PCTCN2021073088-appb-100004
    Take as a column vector of F. Sort the diagonal elements λ 1 , λ 2 , ..., λ M of the eigenvalue matrix Λ j in ascending order, and select the matrix F corresponding to the first r smallest eigenvalues λ 1 , λ 2 , ... λ r Fourier basis in
    Figure PCTCN2021073088-appb-100005
    form the current set of projections
    Figure PCTCN2021073088-appb-100006
    r is the preset number of required Fourier projection bases;
    步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
    Figure PCTCN2021073088-appb-100007
    作为最终的傅里叶投影基,否则执行步骤2~4,并更新当前输入的批次数,j←j+1;
    Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
    Figure PCTCN2021073088-appb-100007
    As the final Fourier projection basis, otherwise perform steps 2 to 4, and update the current input batch number, j←j+1;
    步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
    Figure PCTCN2021073088-appb-100008
    Figure PCTCN2021073088-appb-100009
    i=1,...,r,构成投影矩阵V′=[p 1 p 2 … p r];将高维数据集X与投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
    Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
    Figure PCTCN2021073088-appb-100008
    Figure PCTCN2021073088-appb-100009
    i=1,..., r , forming the projection matrix V'=[p 1 p 2 ... '=V' T X.
  2. 根据权利要求1所述的一种适用于大规模数据的降维、关联分析方法, 其特征在于,维度参数M要求满足M≥M 1且M≥M 2The method for dimensionality reduction and correlation analysis suitable for large-scale data according to claim 1, wherein the dimension parameter M is required to satisfy M≥M 1 and M≥M 2 .
  3. 根据权利要求1所述的一种适用于大规模数据的降维、关联分析方法,其特征在于,离散傅里叶矩阵(DFT)F表示为:A dimensionality reduction and correlation analysis method suitable for large-scale data according to claim 1, wherein the discrete Fourier matrix (DFT) F is expressed as:
    Figure PCTCN2021073088-appb-100010
    Figure PCTCN2021073088-appb-100010
    其中,ω是一个复数且可被表示为ω=e -2πi/M,i为虚数单位。 where ω is a complex number and can be expressed as ω=e -2πi/M , and i is the imaginary unit.
  4. 根据权利要求1所述的一种适用于大规模数据的降维、关联分析方法,其特征在于,批量样本X b和Y b是根据阈值g,随机输入数量为b=N*g的批量样本。 A dimensionality reduction and correlation analysis method suitable for large-scale data according to claim 1, characterized in that the batch samples X b and Y b are batch samples whose random input quantity is b=N*g according to the threshold g .
  5. 根据权利要求1所述的一种适用于大规模数据的降维、关联分析方法,其特征在于,x i、y i进行傅里叶变换得到
    Figure PCTCN2021073088-appb-100011
    分别表示为:
    A method for dimensionality reduction and correlation analysis suitable for large-scale data according to claim 1, wherein x i and y i are subjected to Fourier transform to obtain
    Figure PCTCN2021073088-appb-100011
    They are respectively expressed as:
    Figure PCTCN2021073088-appb-100012
    Figure PCTCN2021073088-appb-100012
    Figure PCTCN2021073088-appb-100013
    Figure PCTCN2021073088-appb-100013
    其中,
    Figure PCTCN2021073088-appb-100014
    分别是傅里叶变换的生成向量,
    Figure PCTCN2021073088-appb-100015
    分别表示对向量x i进行快速傅里叶变换,F是离散傅里叶矩阵。
    in,
    Figure PCTCN2021073088-appb-100014
    are the generated vectors of the Fourier transform, respectively,
    Figure PCTCN2021073088-appb-100015
    Respectively represent the fast Fourier transform of the vector x i , and F is the discrete Fourier matrix.
  6. 根据权利要求1所述的一种适用于大规模数据的降维、关联分析方法,其特征在于,按照如下方式得到当前批次的批量样本X b和Y b的特征值: A dimensionality reduction and association analysis method suitable for large-scale data according to claim 1, characterized in that, the eigenvalues of batch samples X b and Y b of the current batch are obtained as follows:
    Figure PCTCN2021073088-appb-100016
    Figure PCTCN2021073088-appb-100016
    其中,1./是对向量每个元素的倒数运算,λ为拉格朗日因子;b是批量样本的数量;
    Figure PCTCN2021073088-appb-100017
    分别是
    Figure PCTCN2021073088-appb-100018
    的复共轭矩阵;⊙是矩阵中元素的点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
    Figure PCTCN2021073088-appb-100019
    为训练数据集X的主投影向量,即特征向量;F H是傅里叶矩阵F的共轭转置,H表示共轭转置运算。对于每批次随机输入的样本X b,Y b,我们可以得到Λ b
    Among them, 1./ is the reciprocal operation of each element of the vector, λ is the Lagrangian factor; b is the number of batch samples;
    Figure PCTCN2021073088-appb-100017
    respectively
    Figure PCTCN2021073088-appb-100018
    The complex conjugate matrix of ; ⊙ is the dot product operation of the elements in the matrix; diag represents the diagonal matrix that converts the vector into the main diagonal as the vector elements;
    Figure PCTCN2021073088-appb-100019
    is the main projection vector of the training data set X, that is, the eigenvector; F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. For each batch of randomly input samples X b , Y b , we can get Λ b :
    Figure PCTCN2021073088-appb-100020
    Figure PCTCN2021073088-appb-100020
    其中,Λ b为该批次样本所获得的特征值矩阵。 Among them, Λ b is the eigenvalue matrix obtained from the batch of samples.
PCT/CN2021/073088 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data WO2022037012A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2110472.4A GB2601862A (en) 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010835235.8A CN112149045A (en) 2020-08-19 2020-08-19 Dimension reduction and correlation analysis method suitable for large-scale data
CN202010835235.8 2020-08-19

Publications (1)

Publication Number Publication Date
WO2022037012A1 true WO2022037012A1 (en) 2022-02-24

Family

ID=73887570

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073088 WO2022037012A1 (en) 2020-08-19 2021-01-21 Dimension reduction and correlation analysis method applicable to large-scale data

Country Status (2)

Country Link
CN (1) CN112149045A (en)
WO (1) WO2022037012A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510880A (en) * 2022-04-19 2022-05-17 中国石油大学(华东) Method for diagnosing working condition of sucker-rod pump based on Fourier transform and geometric characteristics

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149045A (en) * 2020-08-19 2020-12-29 江苏大学 Dimension reduction and correlation analysis method suitable for large-scale data
CN113743485A (en) * 2021-08-23 2021-12-03 江苏大学 Data dimension reduction method based on Fourier domain principal component analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413551A (en) * 2013-07-16 2013-11-27 清华大学 Sparse dimension reduction-based speaker identification method
CN108682007A (en) * 2018-04-28 2018-10-19 华中师范大学 Jpeg image resampling automatic testing method based on depth random forest
US20200098077A1 (en) * 2018-09-20 2020-03-26 At&T Intellectual Property I, L.P. Enabling secure video sharing by exploiting data sparsity
CN112149045A (en) * 2020-08-19 2020-12-29 江苏大学 Dimension reduction and correlation analysis method suitable for large-scale data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413551A (en) * 2013-07-16 2013-11-27 清华大学 Sparse dimension reduction-based speaker identification method
CN108682007A (en) * 2018-04-28 2018-10-19 华中师范大学 Jpeg image resampling automatic testing method based on depth random forest
US20200098077A1 (en) * 2018-09-20 2020-03-26 At&T Intellectual Property I, L.P. Enabling secure video sharing by exploiting data sparsity
CN112149045A (en) * 2020-08-19 2020-12-29 江苏大学 Dimension reduction and correlation analysis method suitable for large-scale data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510880A (en) * 2022-04-19 2022-05-17 中国石油大学(华东) Method for diagnosing working condition of sucker-rod pump based on Fourier transform and geometric characteristics
CN114510880B (en) * 2022-04-19 2022-07-12 中国石油大学(华东) Method for diagnosing working condition of sucker-rod pump based on Fourier transform and geometric characteristics

Also Published As

Publication number Publication date
CN112149045A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
WO2022037012A1 (en) Dimension reduction and correlation analysis method applicable to large-scale data
Chen et al. Fast density peak clustering for large scale data based on kNN
CN113379057B (en) Quantum system ground state energy estimation method and system
Liu et al. Extreme support vector machine classifier
Zhe et al. Dintucker: Scaling up gaussian process models on large multidimensional arrays
Li et al. Quantum clustering using kernel entropy component analysis
Saade et al. Clustering from sparse pairwise measurements
Huang et al. Mini-batch stochastic ADMMs for nonconvex nonsmooth optimization
Huang et al. High performance hierarchical tucker tensor learning using gpu tensor cores
CN111401413A (en) Optimization theory-based parallel clustering method with scale constraint
WO2023024210A1 (en) Data dimension reduction method based on fourier-domain principal component analysis
Yu et al. SimFusion+ extending simfusion towards efficient estimation on large and dynamic networks
Li et al. Large-scale subspace clustering by independent distributed and parallel coding
Hansen et al. Semi-supervised eigenvectors for large-scale locally-biased learning
US20020013801A1 (en) Computer system and program product for estimation of characteristic values of matrixes using statistical sampling
Du et al. Maxios: Large scale nonnegative matrix factorization for collaborative filtering
Nabatian et al. An adaptive scaling technique to quantum clustering
Jayaram et al. In-database regression in input sparsity time
Wang et al. A fast and scalable joint estimator for learning multiple related sparse gaussian graphical models
Sabelfeld Stochastic algorithms in linear algebra-beyond the Markov chains and von Neumann-Ulam scheme
Hao et al. Multi-View K-Means with Laplacian Embedding
Berglund et al. Zeroth-order randomized subspace Newton methods
GB2601862A (en) Dimension reduction and correlation analysis method applicable to large-scale data
Gokden Coulgat: An experiment on interpretability of graph attention networks
Atwa et al. Active query selection for constraint-based clustering algorithms

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 202110472

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20210121

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857118

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857118

Country of ref document: EP

Kind code of ref document: A1