WO2023024210A1 - Data dimension reduction method based on fourier-domain principal component analysis - Google Patents

Data dimension reduction method based on fourier-domain principal component analysis Download PDF

Info

Publication number
WO2023024210A1
WO2023024210A1 PCT/CN2021/120524 CN2021120524W WO2023024210A1 WO 2023024210 A1 WO2023024210 A1 WO 2023024210A1 CN 2021120524 W CN2021120524 W CN 2021120524W WO 2023024210 A1 WO2023024210 A1 WO 2023024210A1
Authority
WO
WIPO (PCT)
Prior art keywords
fourier
data
batch
matrix
vector
Prior art date
Application number
PCT/CN2021/120524
Other languages
French (fr)
Chinese (zh)
Inventor
沈项军
徐兆瑞
刘志锋
Original Assignee
江苏大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏大学 filed Critical 江苏大学
Publication of WO2023024210A1 publication Critical patent/WO2023024210A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the invention belongs to the neighborhood of computer science and image processing technology, in particular to a data dimensionality reduction method based on Fourier domain principal component analysis.
  • Principal Component Analysis is an advanced data exploration algorithm that can be used to find patterns in data and to find transformed representations of data that can emphasize those patterns.
  • PCA Principal Component Analysis
  • the principal component analysis makes the original scattered sample points concentrate near some characteristic coordinate axes after rotation. The effect of reducing the dimensionality of the original data realizes the compression of the data.
  • Principal component analysis is a powerful data transformation technique that you can apply for further analysis. This approach is very useful when you encounter high-dimensional data sets, and it can help in understanding the underlying data structure, cluster analysis, regression analysis, and many other tasks.
  • the present invention proposes a data dimensionality reduction method based on Fourier domain principal component analysis, using the fast Fourier transform method to observe each data point in the sequence from the perspective of the frequency domain , which is constructed into a new principal component analysis algorithm based on Fourier transform.
  • the principal component analysis to solve the eigenvector problem to find meaningful Fourier domain bases and batch input training, the eigenvalue distribution of the global sample is approximated by using stable and orderly partial sample eigenvalues.
  • it improves the computing speed and memory utilization of data dimensionality reduction, and provides support and acceleration for principal component analysis of large-scale data.
  • a data dimensionality reduction method based on Fourier domain principal component analysis comprising the following steps:
  • Step 1 data initialization, collect data sample set X as the required data set, X is an M ⁇ N dimensional matrix; and initialize the current batch number j, the initial M ⁇ M dimensional zero matrix ⁇ 0 , random Fourier basis Set P 0 and discrete Fourier matrix F; M represents the dimension of the data set X, and N is the number of samples of the data;
  • Step 2 construct batch samples and the Fourier data representation of batch samples, randomly input a batch sample set X b ⁇ R M ⁇ b with the number b, perform Fourier transform on the sample vector x i in X b to obtain
  • Step 3 for each batch of randomly input batch sample set X b , calculate the eigenvalue matrix ⁇ b obtained by this batch of batch sample set X b , expressed as: yes The complex conjugate matrix of ;
  • the eigenvalue matrix ⁇ b obtained by each batch of batch sample set X b is added to ⁇ j , and ⁇ j is used to represent the eigenvalue matrix ⁇ b after the j-th batch of batch sample set is input Accumulation, the process is expressed as: ⁇ j ⁇ ⁇ j-1 + ⁇ b ; where ⁇ j-1 represents the accumulation of eigenvalues obtained after inputting the j-1 batch sample set X b ;
  • Step 4 obtain the Fourier projection basis of the batch sample set, and set Take it as the column vector of F; sort the diagonal elements ⁇ 1 , ⁇ 2 ,..., ⁇ M of the eigenvalue matrix ⁇ j in ascending order, and select the smallest first r eigenvalues ⁇ 1 , ⁇ 2 ,... ⁇ r corresponding to The Fourier basis in the matrix F of Make up the current set of projections r is the preset number of required Fourier projection bases;
  • Step 5 if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis As the final Fourier projection base, otherwise execute steps 2 to 4, and update the current input batch number, j ⁇ j+1;
  • the threshold g takes a value of 0.5% to 5%.
  • is the Lagrange factor
  • b is the number of batch samples
  • is the element point multiplication operation of the matrix
  • diag represents the diagonal matrix that converts the main diagonal of the vector into vector elements
  • ⁇ b is the eigenvalue matrix obtained by this batch of samples.
  • the training process does not need to load all the data samples, only a few batches of data samples need to be loaded until the order of the Fourier basis is stabilized, which undoubtedly can use memory more efficiently .
  • Fig. 1 is the main flowchart of the method proposed by the present invention.
  • a data dimensionality reduction method based on Fourier domain principal component analysis includes the following steps:
  • Initialization parameters j, ⁇ 0 , F, P 0 .
  • ⁇ 0 represents the initial M ⁇ M dimensional zero matrix;
  • F is the discrete Fourier matrix (DFT);
  • P 0 is a random Fourier basis Set, P 0
  • the elements of the set are the column vectors of the discrete Fourier matrix (DFT) F.
  • the discrete Fourier matrix (DFT) F is expressed as:
  • Step 2 construct batch samples and their Fourier data representation.
  • Fast Fourier transform is performed on the batch sample X b , and the fast Fourier transform method is used to observe the data from the perspective of the frequency domain, which is expressed as follows:
  • Step 3 get the eigenvalues of the current batch of samples.
  • is the Lagrange factor
  • b is the number of batch samples
  • yes The complex conjugate matrix of
  • is the dot multiplication operation of the elements in the matrix
  • diag means that the vector is converted into a diagonal matrix whose main diagonal is the vector element
  • F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation.
  • ⁇ b is the eigenvalue matrix obtained by this batch of samples.
  • ⁇ j the accumulation of feature values after inputting the j-th batch of samples
  • j denotes the current input batch number.
  • ⁇ j-1 represents the feature value accumulation obtained after inputting j-1 batch samples.
  • Step 4 obtain the Fourier projection basis of batch samples.
  • Step 5 if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis as the final Fourier projection basis. Otherwise, execute steps 2-4, and update the current input batch number, j ⁇ j+1.

Abstract

Disclosed in the present invention is a data dimension reduction method based on Fourier-domain principal component analysis. High-dimensional data is projected to a Fourier domain, and a feature vector solution problem of principal component analysis is converted into a search for a meaningful Fourier domain basis by using the properties of a cyclic matrix and a Fourier matrix. The Fourier domain basis is predefined and a principal component distribution of data is ordered; therefore, training can be accelerated by means of inputting training samples in batches until the required Fourier basis is stable and ordered. The number of Fourier bases and a projection matrix are determined, and the projection matrix is multiplied by a high-dimensional data set, so as to obtain a low-dimensional data set, thereby facilitating fast data processing. By means of the data dimension reduction method provided in the present invention, on the basis of principal component analysis and fast Fourier transform, noise and redundant information in a high-dimensional data set can be removed, thereby reducing unnecessary operation processes during data processing, and improving the algorithm running speed and the memory efficiency.

Description

一种基于傅里叶域主成分分析的数据降维方法A Data Dimensionality Reduction Method Based on Principal Component Analysis in Fourier Domain 技术领域technical field
本发明属于计算机科学和图像处理技术邻域,尤其是一种基于傅里叶域主成分分析的数据降维方法。The invention belongs to the neighborhood of computer science and image processing technology, in particular to a data dimensionality reduction method based on Fourier domain principal component analysis.
背景技术Background technique
传统的数据处理方式已经无法对海量数据进行有效的分析。与此同时,随着大数据处理和云计算所产生的数据维度不断增加,为了去除高维度数据集中的噪声和冗余信息,减少数据处理中不必要的运算过程,提高算法的运行效率,对高维数据进行降维处理也更加必要。Traditional data processing methods have been unable to effectively analyze massive data. At the same time, with the continuous increase of data dimensions generated by big data processing and cloud computing, in order to remove noise and redundant information in high-dimensional data sets, reduce unnecessary calculations in data processing, and improve the operating efficiency of algorithms, the Dimensionality reduction is also more necessary for high-dimensional data.
主成分分析(Principal Component Analysis,PCA)是一种高级的数据探索算法,可用于寻找数据中的模式,以及找到可以强调这些模式的数据变换表示。主成分分析通过原始数据集坐标轴的正交旋转,使得原本分散的样本点在旋转后集中在某一些特征坐标轴的附近,当忽略承载原始信息量小的那些主成分时,即达到了对原始数据降维的效果,实现了数据的压缩。主成分分析是一种强大的数据转换技术,您可以应用此技术,再进行进一步的分析工作。这种方法在您遇到高维数据集时非常实用,它可以帮助理解底层数据结构、聚类分析、回归分析和许多其他任务。Principal Component Analysis (PCA) is an advanced data exploration algorithm that can be used to find patterns in data and to find transformed representations of data that can emphasize those patterns. Through the orthogonal rotation of the coordinate axes of the original data set, the principal component analysis makes the original scattered sample points concentrate near some characteristic coordinate axes after rotation. The effect of reducing the dimensionality of the original data realizes the compression of the data. Principal component analysis is a powerful data transformation technique that you can apply for further analysis. This approach is very useful when you encounter high-dimensional data sets, and it can help in understanding the underlying data structure, cluster analysis, regression analysis, and many other tasks.
然而,尽管主成分分析表现出了良好的性能,但由于其计算复杂度高,因此在海量数据处理问题中的应用受到限制。为了处理大规模数据,人们提出了许多优化技术来加速主成分分析算法。根据解决这一问题的不同策略,现有的优化技术大致可以分为以下两类:一种是使用Nystrom的矩阵近似技术,它通过将计算出的子矩阵特征向量用于近似原矩阵特征向量,来降低特征分解步骤的计算代价。另一种方法是使用Random Fourier Features来近似矩阵,该方法可将原来的KPCA问题转化为一个高维的线性PCA问题。然而,上述方法虽然解决了海量数据的应用处理问题,但它们在速度和内存效率等方面的利用仍不够充分,海量数据的快速高效计算依然是我们面临的问题。However, although principal component analysis has shown promising performance, its application to massive data processing problems is limited due to its high computational complexity. In order to deal with large-scale data, many optimization techniques have been proposed to speed up the principal component analysis algorithm. According to different strategies to solve this problem, the existing optimization techniques can be roughly divided into the following two categories: one is to use Nystrom's matrix approximation technique, which uses the calculated sub-matrix eigenvectors to approximate the original matrix eigenvectors, to reduce the computational cost of the eigendecomposition step. Another method is to use Random Fourier Features to approximate the matrix, which can transform the original KPCA problem into a high-dimensional linear PCA problem. However, although the above methods have solved the application processing problem of massive data, they are still not fully utilized in terms of speed and memory efficiency, and the fast and efficient calculation of massive data is still a problem we face.
发明内容Contents of the invention
为了解决现有技术中存在的不足,本发明提出了一种基于傅里叶域主成分分析的数据降维方法,利用快速傅里叶变换方法从频域的角度来观察序列中每个数据点,构造成新型基于傅里叶变换的主成分分析算法。通过优化主成分分析的求解特征向量问题为寻找有意 义的傅里叶域基,以及分批次输入训练,用稳定有序的部分样本特征值近似获得全局样本的特征值分布。进而提高数据降维的运算速度和内存利用率,并且提供对大规模数据进行主成分分析的支持和加速。In order to solve the deficiencies in the prior art, the present invention proposes a data dimensionality reduction method based on Fourier domain principal component analysis, using the fast Fourier transform method to observe each data point in the sequence from the perspective of the frequency domain , which is constructed into a new principal component analysis algorithm based on Fourier transform. By optimizing the principal component analysis to solve the eigenvector problem to find meaningful Fourier domain bases and batch input training, the eigenvalue distribution of the global sample is approximated by using stable and orderly partial sample eigenvalues. In turn, it improves the computing speed and memory utilization of data dimensionality reduction, and provides support and acceleration for principal component analysis of large-scale data.
本发明所采用的技术方案如下:The technical scheme adopted in the present invention is as follows:
一种基于傅里叶域主成分分析的数据降维方法,包括如下步骤:A data dimensionality reduction method based on Fourier domain principal component analysis, comprising the following steps:
步骤1,数据初始化,采集数据样本集X作为所需的数据集,X为M×N维的矩阵;且初始化当前批次数j、初始的M×M维零矩阵Λ 0、随机傅里叶基集合P 0和离散傅里叶矩阵F;M表示数据集X的维度,N是数据的样本数量; Step 1, data initialization, collect data sample set X as the required data set, X is an M×N dimensional matrix; and initialize the current batch number j, the initial M×M dimensional zero matrix Λ 0 , random Fourier basis Set P 0 and discrete Fourier matrix F; M represents the dimension of the data set X, and N is the number of samples of the data;
步骤2,构造批量样本以及批量样本的傅里叶数据表达,随机输入数量为b的批量样本集X b∈R M×b,对X b中的样本向量x i进行傅里叶变换得到
Figure PCTCN2021120524-appb-000001
Step 2, construct batch samples and the Fourier data representation of batch samples, randomly input a batch sample set X b ∈ R M×b with the number b, perform Fourier transform on the sample vector x i in X b to obtain
Figure PCTCN2021120524-appb-000001
步骤3,对于每批次随机输入的批量样本集X b,计算该批次的批量样本集X b所获得的特征值矩阵Λ b,表示为:
Figure PCTCN2021120524-appb-000002
Figure PCTCN2021120524-appb-000003
Figure PCTCN2021120524-appb-000004
的复共轭矩阵;
Step 3, for each batch of randomly input batch sample set X b , calculate the eigenvalue matrix Λ b obtained by this batch of batch sample set X b , expressed as:
Figure PCTCN2021120524-appb-000002
Figure PCTCN2021120524-appb-000003
yes
Figure PCTCN2021120524-appb-000004
The complex conjugate matrix of ;
随着小批量样本的不断输入,将每一批次的批量样本集X b所获特征值矩阵Λ b添加到Λ j,用Λ j表示在输入第j批次批量样本集后的特征值的累积,该过程表示为:Λ j←Λ j-1b;其中,Λ j-1表示在输入j-1批次样本集X b后所获得的特征值累积; With the continuous input of small-batch samples, the eigenvalue matrix Λ b obtained by each batch of batch sample set X b is added to Λ j , and Λ j is used to represent the eigenvalue matrix Λ b after the j-th batch of batch sample set is input Accumulation, the process is expressed as: Λ j ← Λ j-1 + Λ b ; where Λ j-1 represents the accumulation of eigenvalues obtained after inputting the j-1 batch sample set X b ;
步骤4,获得批量样本集的傅里叶投影基,将
Figure PCTCN2021120524-appb-000005
取为F的列向量;对特征值矩阵Λ j的对角元素λ 12,…,λ M进行升序排序,选取最小的前r个特征值λ 12,…λ r所对应的矩阵F中的傅里叶基
Figure PCTCN2021120524-appb-000006
构成当前的投影集合
Figure PCTCN2021120524-appb-000007
r是预先设定的所需傅里叶投影基个数;
Step 4, obtain the Fourier projection basis of the batch sample set, and set
Figure PCTCN2021120524-appb-000005
Take it as the column vector of F; sort the diagonal elements λ 1 , λ 2 ,…,λ M of the eigenvalue matrix Λ j in ascending order, and select the smallest first r eigenvalues λ 1 , λ 2 ,…λ r corresponding to The Fourier basis in the matrix F of
Figure PCTCN2021120524-appb-000006
Make up the current set of projections
Figure PCTCN2021120524-appb-000007
r is the preset number of required Fourier projection bases;
步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
Figure PCTCN2021120524-appb-000008
作为最终的傅里叶投影基,否则执行步骤2~4,并更新当前输入的批次数,j←j+1;
Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
Figure PCTCN2021120524-appb-000008
As the final Fourier projection base, otherwise execute steps 2 to 4, and update the current input batch number, j←j+1;
步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
Figure PCTCN2021120524-appb-000009
i=1,…,r,构成投影矩阵V′=[p 1 p 2 … p r];将高维数据集X与投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
Figure PCTCN2021120524-appb-000009
i=1,...,r, form a projection matrix V'=[p 1 p 2 ... p r ]; multiply the high-dimensional data set X by the projection matrix V' T , and obtain the reduced-dimensional data set X'= V' T X.
进一步,设置阈值g,批量样本集X b的样本数量为b=N*g,b<<M。 Further, the threshold g is set, and the number of samples in the batch sample set X b is b=N*g, b<<M.
进一步,阈值g取值0.5%~5%。Further, the threshold g takes a value of 0.5% to 5%.
进一步,对样本向量x i进行傅里叶变换得到
Figure PCTCN2021120524-appb-000010
表示为:
Further, perform Fourier transform on the sample vector x i to get
Figure PCTCN2021120524-appb-000010
Expressed as:
Figure PCTCN2021120524-appb-000011
Figure PCTCN2021120524-appb-000011
其中,
Figure PCTCN2021120524-appb-000012
是傅里叶变换的生成向量,
Figure PCTCN2021120524-appb-000013
表示对向量x i进行快速傅里叶变换,F是离散傅里叶矩阵;
in,
Figure PCTCN2021120524-appb-000012
is the generating vector of the Fourier transform,
Figure PCTCN2021120524-appb-000013
Indicates that the fast Fourier transform is performed on the vector x i , and F is a discrete Fourier matrix;
进一步,获得样本向量x i的方法为:将数据样本集表示为X=[x 1 x 2 … x N],由数据样本集X中的第i列数据样本构成样本向量x i,i=1,2,…N,N代表的是数据样本的数量;样本向量x i中包含第i列中M个维度的数据样本。 Further, the method to obtain the sample vector x i is: express the data sample set as X=[x 1 x 2 ... x N ], and form the sample vector x i from the i-th column of data samples in the data sample set X , i=1 ,2,...N, N represents the number of data samples; the sample vector x i contains the data samples of M dimensions in the i-th column.
进一步,在本实施例中,
Figure PCTCN2021120524-appb-000014
是由样本x i构造的循环矩阵,表示为:
Further, in this embodiment,
Figure PCTCN2021120524-appb-000014
is a circular matrix constructed from samples xi , expressed as:
Figure PCTCN2021120524-appb-000015
Figure PCTCN2021120524-appb-000015
其中,circ表示对向量x i进行移位构造对应的循环矩阵
Figure PCTCN2021120524-appb-000016
这种循环矩阵的特性就是可以被傅里叶变换对角化,
Figure PCTCN2021120524-appb-000017
F H=(F *) T是傅里叶矩阵F的共轭转置,H表示共轭转置运算。按照如下方式得到当前批次的批量样本X b的特征值:
Among them, circ represents the circular matrix corresponding to the shift construction of the vector x i
Figure PCTCN2021120524-appb-000016
The characteristic of this circular matrix is that it can be diagonalized by Fourier transform,
Figure PCTCN2021120524-appb-000017
F H =(F * ) T is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. The eigenvalues of the batch samples X b of the current batch are obtained as follows:
Figure PCTCN2021120524-appb-000018
Figure PCTCN2021120524-appb-000018
其中,λ为拉格朗日因子;b是批量样本的数量;
Figure PCTCN2021120524-appb-000019
Figure PCTCN2021120524-appb-000020
的复共轭矩阵;⊙是矩阵的元素点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
Figure PCTCN2021120524-appb-000021
为训练数据集X的主投影向量,即特征向量。对于每批次随机输入的样本X b,我们可以得到Λ b
Among them, λ is the Lagrange factor; b is the number of batch samples;
Figure PCTCN2021120524-appb-000019
yes
Figure PCTCN2021120524-appb-000020
The complex conjugate matrix of ; ⊙ is the element point multiplication operation of the matrix; diag represents the diagonal matrix that converts the main diagonal of the vector into vector elements;
Figure PCTCN2021120524-appb-000021
is the main projection vector of the training data set X, that is, the feature vector. For each batch of random input samples X b , we can get Λ b :
Figure PCTCN2021120524-appb-000022
Figure PCTCN2021120524-appb-000022
其中,Λ b为该批次样本所获得的特征值矩阵。 Among them, Λ b is the eigenvalue matrix obtained by this batch of samples.
进一步,所述傅里叶域基向量为:Further, the Fourier domain basis vector is:
Figure PCTCN2021120524-appb-000023
Figure PCTCN2021120524-appb-000023
其中V是投影列向量v的集合,即V=[v 1 v 2 … v n]。 Where V is the set of projected column vectors v, that is, V=[v 1 v 2 ... v n ].
本发明的有益效果:Beneficial effects of the present invention:
1、利用数据序列可重复性的特点对数据进行傅里叶域建模。利用快速傅里叶变换方法 从频域的角度来观察序列中每个数据点,构造成新型基于傅里叶域的主成分分析算法。找到主成分分析的投影目标可以通过找到预先定义好的有意义的傅里叶基来实现。1. Using the characteristics of the repeatability of the data sequence to model the data in the Fourier domain. Using the fast Fourier transform method to observe each data point in the sequence from the perspective of the frequency domain, a new principal component analysis algorithm based on the Fourier domain is constructed. Finding the projection objective of PCA can be achieved by finding predefined meaningful Fourier basis.
2、由于傅里叶域的运算性质,我们可以通过简单的傅里叶域的矩阵点积运算来避免在时域进行复杂的矩阵求逆运算。2. Due to the operational nature of the Fourier domain, we can avoid complex matrix inversion operations in the time domain through simple matrix dot product operations in the Fourier domain.
3、为了有意义的获得傅里叶基,训练的过程不需要加载所有的数据样本,只需要加载几批数据样本,直到追求傅里叶基的顺序稳定为止,这无疑可以更高效的使用内存。3. In order to obtain the Fourier basis meaningfully, the training process does not need to load all the data samples, only a few batches of data samples need to be loaded until the order of the Fourier basis is stabilized, which undoubtedly can use memory more efficiently .
4、通过优化主成分分析的求解特征向量问题为寻找有意义的傅里叶域基,以及分批次输入训练,用稳定有序的部分样本特征值近似获得全局样本的特征值分布。进而提高数据降维过程的运算速度和内存利用率,并且提供对海量数据进行主成分分析的支持和加速。4. Solving the eigenvector problem by optimizing the principal component analysis In order to find a meaningful Fourier domain base and batch input training, the eigenvalue distribution of the global sample is approximated by using stable and orderly partial sample eigenvalues. In turn, it improves the computing speed and memory utilization of the data dimension reduction process, and provides support and acceleration for principal component analysis of massive data.
附图说明Description of drawings
图1是本发明提出的方法的主流程图。Fig. 1 is the main flowchart of the method proposed by the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用于解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
如图1所示的一种基于傅里叶域主成分分析的数据降维方法,包括如下步骤:As shown in Figure 1, a data dimensionality reduction method based on Fourier domain principal component analysis includes the following steps:
步骤1,数据集准备,采集数据样本集X(M×N)作为所需的数据集,将数据样本集表示为X=[x 1 x 2 … x N],由数据样本集X中的第i列数据样本构成样本向量x i,i=1,2,…N,N代表的是数据样本的数量;样本向量x i中包含第i列中M个维度的数据样本。 Step 1, data set preparation, collecting data sample set X(M×N) as the required data set, expressing the data sample set as X=[x 1 x 2 ... x N ], by the first in the data sample set X The data samples in column i constitute a sample vector x i , where i=1, 2,...N, N represents the number of data samples; the sample vector x i contains data samples of M dimensions in column i.
初始化参数:j、Λ 0、F、P 0。其中,j表示分批次训练的当前批次数,且j=1;Λ 0表示初始的M×M维零矩阵;F是离散傅里叶矩阵(DFT);P 0是一个随机傅里叶基集合,P 0集合的元素为离散傅里叶矩阵(DFT)F的列向量。离散傅里叶矩阵(DFT)F表示为: Initialization parameters: j, Λ 0 , F, P 0 . Among them, j represents the current batch number of batch training, and j=1; Λ 0 represents the initial M×M dimensional zero matrix; F is the discrete Fourier matrix (DFT); P 0 is a random Fourier basis Set, P 0The elements of the set are the column vectors of the discrete Fourier matrix (DFT) F. The discrete Fourier matrix (DFT) F is expressed as:
Figure PCTCN2021120524-appb-000024
Figure PCTCN2021120524-appb-000024
其中,V是投影列向量v的集合,即V=[v 1 v 2 … v n],n是集合V的列数;ω是一个复数且可被表示为ω=e -2πi/M,i为虚数单位。 where V is the set of projected column vectors v, ie V=[v 1 v 2 ... v n ], n is the number of columns of set V; ω is a complex number and can be expressed as ω=e -2πi/M , i is an imaginary unit.
步骤2,构造批量样本及其傅里叶数据表达。Step 2, construct batch samples and their Fourier data representation.
根据阈值g,随机输入数量为b=N*g的批量样本X b∈R M×b,g取0.5%~5%。对批量样本X b进行快速傅里叶变换,利用快速傅里叶变换方法从频域的角度来观察数据,表示 如下: According to the threshold g, a batch sample X b ∈ R M×b with a quantity of b=N*g is randomly input, and g is 0.5% to 5%. Fast Fourier transform is performed on the batch sample X b , and the fast Fourier transform method is used to observe the data from the perspective of the frequency domain, which is expressed as follows:
Figure PCTCN2021120524-appb-000025
Figure PCTCN2021120524-appb-000025
其中,
Figure PCTCN2021120524-appb-000026
表示对向量x i进行快速傅里叶变换;F是离散傅里叶矩阵;
Figure PCTCN2021120524-appb-000027
是傅里叶变换后生成的向量,用∧表示快速傅里叶变换后生成向量。x i表示数据样本集X中第i列向量。
in,
Figure PCTCN2021120524-appb-000026
Indicates that the fast Fourier transform is performed on the vector x i ; F is a discrete Fourier matrix;
Figure PCTCN2021120524-appb-000027
is the vector generated after Fourier transform, and ∧ represents the vector generated after fast Fourier transform. x i represents the ith column vector in the data sample set X.
步骤3,获得当前批量样本的特征值。Step 3, get the eigenvalues of the current batch of samples.
按照如下方式得到当前批次的批量样本X b的特征值: The eigenvalues of the batch samples X b of the current batch are obtained as follows:
Figure PCTCN2021120524-appb-000028
Figure PCTCN2021120524-appb-000028
其中,λ为拉格朗日因子;b是批量样本的数量;
Figure PCTCN2021120524-appb-000029
Figure PCTCN2021120524-appb-000030
的复共轭矩阵;⊙是矩阵中元素的点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
Figure PCTCN2021120524-appb-000031
为训练数据集X的主投影向量,即特征向量;F H是傅里叶矩阵F的共轭转置,H表示共轭转置运算。根据公式(2),对于每批次随机输入的样本X b,可以得到:
Among them, λ is the Lagrange factor; b is the number of batch samples;
Figure PCTCN2021120524-appb-000029
yes
Figure PCTCN2021120524-appb-000030
The complex conjugate matrix of ; ⊙ is the dot multiplication operation of the elements in the matrix; diag means that the vector is converted into a diagonal matrix whose main diagonal is the vector element;
Figure PCTCN2021120524-appb-000031
is the main projection vector of the training data set X, that is, the eigenvector; F H is the conjugate transpose of the Fourier matrix F, and H represents the conjugate transpose operation. According to formula (2), for each batch of randomly input samples X b , we can get:
Figure PCTCN2021120524-appb-000032
Figure PCTCN2021120524-appb-000032
其中,Λ b为该批次样本所获得的特征值矩阵。我们用Λ j表示在输入第j批次样本后的特征值的累积,j表示当前输入的批次数。随着小批量样本的不断输入,将每一批样本所获特征值矩阵Λ b添加到Λ jAmong them, Λ b is the eigenvalue matrix obtained by this batch of samples. We use Λj to denote the accumulation of feature values after inputting the j-th batch of samples, and j denotes the current input batch number. With the continuous input of small batch samples, the eigenvalue matrix Λ b obtained by each batch of samples is added to Λ j ,
Λ j←Λ j-1b      (4) Λ j ← Λ j-1 + Λ b (4)
其中,Λ j-1表示在输入j-1批次样本后所获得的特征值累积。 Among them, Λ j-1 represents the feature value accumulation obtained after inputting j-1 batch samples.
步骤4,获得批量样本的傅里叶投影基。Step 4, obtain the Fourier projection basis of batch samples.
根据公式(2),将
Figure PCTCN2021120524-appb-000033
取为F的列向量,对特征值矩阵Λ j的对角元素λ 12,…,λ M进行升序排序,选取前r个最小的特征值λ 12,…λ r所对应矩阵F中的傅里叶基
Figure PCTCN2021120524-appb-000034
构成当前的投影集合
Figure PCTCN2021120524-appb-000035
其中,r是预先设定的所需傅里叶投影基个数,此处取值为50。
According to formula (2), the
Figure PCTCN2021120524-appb-000033
Take it as the column vector of F, sort the diagonal elements λ 1 , λ 2 ,…,λ M of the eigenvalue matrix Λ j in ascending order, and select the first r smallest eigenvalues λ 1 , λ 2 ,…λ r corresponding to Fourier basis in matrix F
Figure PCTCN2021120524-appb-000034
Make up the current set of projections
Figure PCTCN2021120524-appb-000035
Among them, r is the preset required number of Fourier projection bases, and the value here is 50.
步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
Figure PCTCN2021120524-appb-000036
作为最终的傅里叶投影基。否则执行步骤2~4,并更新当前输入的批次数,j←j+1。
Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
Figure PCTCN2021120524-appb-000036
as the final Fourier projection basis. Otherwise, execute steps 2-4, and update the current input batch number, j←j+1.
步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
Figure PCTCN2021120524-appb-000037
i=1,…,r,获得投影矩阵V′=[p 1 p 2 … p r]。将高维数据集X与投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
Figure PCTCN2021120524-appb-000037
i=1,...,r, the projection matrix V'=[p 1 p 2 ... p r ] is obtained. Multiply the high-dimensional data set X with the projection matrix V' T , and obtain the dimensionally reduced data set X'= V'T X.
以上实施例仅用于说明本发明的设计思想和特点,其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施,本发明的保护范围不限于上述实施例。所以,凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰,均在本发明的保护范围之内。The above embodiments are only used to illustrate the design concept and characteristics of the present invention, and its purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.

Claims (7)

  1. 一种基于傅里叶域主成分分析的数据降维方法,其特征在于,包括如下步骤:A data dimensionality reduction method based on Fourier domain principal component analysis, characterized in that it comprises the following steps:
    步骤1,数据初始化,采集数据样本集X作为所需的数据集,X为M×N维的矩阵;且初始化当前批次数j、初始的M×M维零矩阵Λ 0、随机傅里叶基集合P 0和离散傅里叶矩阵F;M表示数据集X的维度,N是数据的样本数量; Step 1, data initialization, collect data sample set X as the required data set, X is an M×N dimensional matrix; and initialize the current batch number j, the initial M×M dimensional zero matrix Λ 0 , random Fourier basis Set P 0 and discrete Fourier matrix F; M represents the dimension of the data set X, and N is the number of samples of the data;
    步骤2,构造批量样本以及批量样本的傅里叶数据表达,随机输入数量为b的批量样本集X b∈R M×b,对X b中的样本向量x i进行傅里叶变换得到
    Figure PCTCN2021120524-appb-100001
    Step 2, construct batch samples and the Fourier data representation of batch samples, randomly input a batch sample set X b ∈ R M×b with the number b, perform Fourier transform on the sample vector x i in X b to obtain
    Figure PCTCN2021120524-appb-100001
    步骤3,对于每批次随机输入的批量样本集X b,计算该批次的批量样本集X b所获得的特征值矩阵Λ b,表示为:
    Figure PCTCN2021120524-appb-100002
    Figure PCTCN2021120524-appb-100003
    的复共轭矩阵;
    Step 3, for each batch of randomly input batch sample set X b , calculate the eigenvalue matrix Λ b obtained by this batch of batch sample set X b , expressed as:
    Figure PCTCN2021120524-appb-100002
    yes
    Figure PCTCN2021120524-appb-100003
    The complex conjugate matrix of ;
    随着小批量样本的不断输入,将每一批次的批量样本集X b所获特征值矩阵Λ b添加到Λ j,用Λ j表示在输入第j批次批量样本集后的特征值的累积,该过程表示为:Λ j←Λ j-1b;其中,Λ j-1表示在输入j-1批次样本集X b后所获得的特征值累积; With the continuous input of small-batch samples, the eigenvalue matrix Λ b obtained by each batch of batch sample set X b is added to Λ j , and Λ j is used to represent the eigenvalue matrix Λ b after the j-th batch of batch sample set is input Accumulation, the process is expressed as: Λ j ← Λ j-1 + Λ b ; where Λ j-1 represents the accumulation of eigenvalues obtained after inputting the j-1 batch sample set X b ;
    步骤4,获得批量样本集的傅里叶投影基,将
    Figure PCTCN2021120524-appb-100004
    取为F的列向量;对特征值矩阵Λ j的对角元素λ 12,…,λ M进行升序排序,选取最小的前r个特征值λ 12,…λ r所对应的矩阵F中的傅里叶基
    Figure PCTCN2021120524-appb-100005
    构成当前的投影集合
    Figure PCTCN2021120524-appb-100006
    r是预先设定的所需傅里叶投影基个数;
    Step 4, obtain the Fourier projection basis of the batch sample set, and set
    Figure PCTCN2021120524-appb-100004
    Take it as the column vector of F; sort the diagonal elements λ 1 , λ 2 ,…,λ M of the eigenvalue matrix Λ j in ascending order, and select the smallest first r eigenvalues λ 1 , λ 2 ,…λ r corresponding to The Fourier basis in the matrix F of
    Figure PCTCN2021120524-appb-100005
    Make up the current set of projections
    Figure PCTCN2021120524-appb-100006
    r is the preset number of required Fourier projection bases;
    步骤5,若集合P j与P j-1相同,则结束执行步骤2~4,并获得所需的傅立叶基
    Figure PCTCN2021120524-appb-100007
    作为最终的傅里叶投影基,否则执行步骤2~4,并更新当前输入的批次数,j←j+1;
    Step 5, if the set P j is the same as P j-1 , then end the execution of steps 2 to 4, and obtain the required Fourier basis
    Figure PCTCN2021120524-appb-100007
    As the final Fourier projection base, otherwise execute steps 2 to 4, and update the current input batch number, j←j+1;
    步骤6,对集合P j中每一个傅里叶投影基执行反傅里叶变换
    Figure PCTCN2021120524-appb-100008
    构成投影矩阵V′=[p 1 p 2 … p r];将高维数据集X与投影矩阵V′ T相乘,即得到降维后的数据集X′=V′ TX。
    Step 6, perform an inverse Fourier transform on each Fourier projection basis in the set P j
    Figure PCTCN2021120524-appb-100008
    Construct the projection matrix V'=[p 1 p 2 ... p r ]; multiply the high-dimensional data set X by the projection matrix V' T to obtain the reduced-dimensional data set X'=V' T X.
  2. 根据权利要求1所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,设置阈值g,批量样本集X b的样本数量为b=N*g,b<<M。 A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, setting threshold value g, the number of samples of batch sample set X b is b=N*g, b<<M .
  3. 根据权利要求2所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,阈值g取值0.5%~5%。A data dimensionality reduction method based on Fourier domain principal component analysis according to claim 2, characterized in that the threshold g is 0.5% to 5%.
  4. 根据权利要求1所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,对样本向量x i进行傅里叶变换得到
    Figure PCTCN2021120524-appb-100009
    表示为:
    A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, carry out Fourier transform to sample vector x i and obtain
    Figure PCTCN2021120524-appb-100009
    Expressed as:
    Figure PCTCN2021120524-appb-100010
    Figure PCTCN2021120524-appb-100010
    其中,
    Figure PCTCN2021120524-appb-100011
    是傅里叶变换的生成向量,
    Figure PCTCN2021120524-appb-100012
    表示对向量x i进行快速傅里叶变换,F是离散傅里叶矩阵。
    in,
    Figure PCTCN2021120524-appb-100011
    is the generating vector of the Fourier transform,
    Figure PCTCN2021120524-appb-100012
    Indicates that the fast Fourier transform is performed on the vector x i , and F is a discrete Fourier matrix.
  5. 根据权利要求4所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,获得样本向量x i的方法为:将数据样本集表示为X=[x 1 x 2 … x N],由数据样本集X中的第i列数据样本构成样本向量x i,i=1,2,…N,N代表的是数据样本的数量;样本向量x i中包含第i列中M个维度的数据样本。 A data dimensionality reduction method based on Fourier domain principal component analysis according to claim 4, wherein the method for obtaining the sample vector x i is: expressing the data sample set as X=[x 1 x 2 … x N ], the sample vector x i is composed of the i-th column data samples in the data sample set X , i=1, 2,...N, N represents the number of data samples; the sample vector x i contains the i-th column Data samples in M dimensions.
  6. 根据权利要求1所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,按照如下方式得到当前批次的批量样本X b的特征值: A kind of data dimension reduction method based on Fourier domain principal component analysis according to claim 1, is characterized in that, obtains the eigenvalue of the batch sample X b of current batch as follows:
    Figure PCTCN2021120524-appb-100013
    Figure PCTCN2021120524-appb-100013
    其中,λ为拉格朗日因子;b是批量样本的数量;
    Figure PCTCN2021120524-appb-100014
    Figure PCTCN2021120524-appb-100015
    的复共轭矩阵;⊙是矩阵的元素点乘运算;diag表示将向量转化为主对角线为向量元素的对角矩阵;
    Figure PCTCN2021120524-appb-100016
    为训练数据集X的主投影向量,即特征向量。对于每批次随机输入的样本X b,我们可以得到Λ b
    Among them, λ is the Lagrange factor; b is the number of batch samples;
    Figure PCTCN2021120524-appb-100014
    yes
    Figure PCTCN2021120524-appb-100015
    The complex conjugate matrix of ; ⊙ is the element point multiplication operation of the matrix; diag represents the diagonal matrix that converts the main diagonal of the vector into vector elements;
    Figure PCTCN2021120524-appb-100016
    is the main projection vector of the training data set X, that is, the feature vector. For each batch of random input samples X b , we can get Λ b :
    Figure PCTCN2021120524-appb-100017
    Figure PCTCN2021120524-appb-100017
    其中,Λ b为该批次样本所获得的特征值矩阵。 Among them, Λ b is the eigenvalue matrix obtained by this batch of samples.
  7. 根据权利要求4所述的一种基于傅里叶域主成分分析的数据降维方法,其特征在于,所述傅里叶域基向量为:A kind of data dimensionality reduction method based on Fourier domain principal component analysis according to claim 4, it is characterized in that, described Fourier domain base vector is:
    Figure PCTCN2021120524-appb-100018
    Figure PCTCN2021120524-appb-100018
    其中,V是投影列向量v的集合,即V=[v 1 v 2 … v n],n是集合V的列数;ω是一个复数且可被表示为ω=e -2πi/M,i为虚数单位。 where V is the set of projected column vectors v, ie V=[v 1 v 2 ... v n ], n is the number of columns of set V; ω is a complex number and can be expressed as ω=e -2πi/M , i is an imaginary unit.
PCT/CN2021/120524 2021-08-23 2021-09-26 Data dimension reduction method based on fourier-domain principal component analysis WO2023024210A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110968131.9A CN113743485A (en) 2021-08-23 2021-08-23 Data dimension reduction method based on Fourier domain principal component analysis
CN202110968131.9 2021-08-23

Publications (1)

Publication Number Publication Date
WO2023024210A1 true WO2023024210A1 (en) 2023-03-02

Family

ID=78732295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120524 WO2023024210A1 (en) 2021-08-23 2021-09-26 Data dimension reduction method based on fourier-domain principal component analysis

Country Status (2)

Country Link
CN (1) CN113743485A (en)
WO (1) WO2023024210A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861224A (en) * 2023-09-04 2023-10-10 鲁东大学 Intermittent process soft measurement modeling system based on intermittent process soft measurement modeling method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682089A (en) * 2012-04-24 2012-09-19 浙江工业大学 Method for data dimensionality reduction by identifying random neighbourhood embedding analyses
CN102938072A (en) * 2012-10-20 2013-02-20 复旦大学 Dimension reducing and sorting method of hyperspectral imagery based on blocking low rank tensor analysis
US20200272651A1 (en) * 2019-02-22 2020-08-27 International Business Machines Corporation Heuristic dimension reduction in metadata modeling
CN112149045A (en) * 2020-08-19 2020-12-29 江苏大学 Dimension reduction and correlation analysis method suitable for large-scale data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4292837B2 (en) * 2002-07-16 2009-07-08 日本電気株式会社 Pattern feature extraction method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682089A (en) * 2012-04-24 2012-09-19 浙江工业大学 Method for data dimensionality reduction by identifying random neighbourhood embedding analyses
CN102938072A (en) * 2012-10-20 2013-02-20 复旦大学 Dimension reducing and sorting method of hyperspectral imagery based on blocking low rank tensor analysis
US20200272651A1 (en) * 2019-02-22 2020-08-27 International Business Machines Corporation Heuristic dimension reduction in metadata modeling
CN112149045A (en) * 2020-08-19 2020-12-29 江苏大学 Dimension reduction and correlation analysis method suitable for large-scale data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861224A (en) * 2023-09-04 2023-10-10 鲁东大学 Intermittent process soft measurement modeling system based on intermittent process soft measurement modeling method
CN116861224B (en) * 2023-09-04 2023-12-01 鲁东大学 Intermittent process soft measurement modeling system based on intermittent process soft measurement modeling method

Also Published As

Publication number Publication date
CN113743485A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
WO2022037012A1 (en) Dimension reduction and correlation analysis method applicable to large-scale data
Zhang et al. Robust low-rank kernel multi-view subspace clustering based on the schatten p-norm and correntropy
Shao et al. A regularization for the projection twin support vector machine
JP2023535109A (en) Method and apparatus for acquiring ground state of quantum system, computer equipment, storage medium, and computer program
Liang et al. Variational quantum algorithms for dimensionality reduction and classification
CN113496285A (en) Data processing method and device based on quantum circuit, electronic device and medium
WO2023024210A1 (en) Data dimension reduction method based on fourier-domain principal component analysis
Srinivasan et al. GPUML: Graphical processors for speeding up kernel machines
Ambainis et al. Spatial search on grids with minimum memory
Sasaki et al. Direct density-derivative estimation and its application in KL-divergence approximation
Saade et al. Clustering from sparse pairwise measurements
Jiang et al. Robust and efficient computation of eigenvectors in a generalized spectral method for constrained clustering
Rančić Noisy intermediate-scale quantum computing algorithm for solving an n-vertex MaxCut problem with log (n) qubits
Huang et al. Stochastic alternating direction method of multipliers with variance reduction for nonconvex optimization
Li et al. Sub-selective quantization for large-scale image search
Wang et al. Accelerating nearest neighbor partitioning neural network classifier based on CUDA
Xu et al. A practical Riemannian algorithm for computing dominant generalized Eigenspace
Siegel et al. Adaptive neuron apoptosis for accelerating deep learning on large scale systems
Zhang et al. Sparse semi-supervised learning on low-rank kernel
Lin et al. Online kernel learning with nearly constant support vectors
Lu et al. Complexity-reduced implementations of complete and null-space-based linear discriminant analysis
WO2022188711A1 (en) Svm model training method and apparatus, device, and computer-readable storage medium
Liu et al. Ensemble kernel method: SVM classification based on game theory
Du et al. Maxios: Large scale nonnegative matrix factorization for collaborative filtering
Dubout et al. Accelerated training of linear object detectors

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE