CN106855918A

CN106855918A - Process the principal component analytical method of extensive matrix data

Info

Publication number: CN106855918A
Application number: CN201611153472.6A
Authority: CN
Inventors: 喻文健; 谷昱
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2017-06-16

Abstract

The present invention proposes a kind of principal component analytical method for processing extensive matrix data, including：Generation random number matrix Ω；According to raw data matrix A calculating matrix G and H；Initializing variable j=1, m × l matrixes Q and l × n matrix B are null matrix；Setting G_[j,j+b]And Ω_[j,j+b]The jth of respectively G and Ω is arranged to j+b, works as j>When 1, G is calculated_[j,j+b]‑QBΩ_[j,j+b]Result is covered into G_[j,j+b]；To G_[j,j+b]It is simplified QR to decompose, obtains orthogonal matrix Q_[j,j+b]With upper triangle square formation R；If j>1, calculate Q_[j,j+b]‑Q(Q^TQ_[j,j+b]) simplified QR decompose, the orthogonal matrix that will obtain covering Q_[j,j+b], obtainCalculateResult of calculation is covered into R；If H_[j,j+b]For the jth of H is arranged to j+b, if j=1, calculateOtherwise calculateObtain matrix B_temp；The value of variable j+b+1 is assigned to variable j；If j≤l, return to step four otherwise performs next step；Singular value decomposition is done to B, preceding k principal component vector and corresponding singular value is obtained.The present invention is suitable for various big data analysis scenes, with computational efficiency and practicality higher.

Description

Process the principal component analytical method of extensive matrix data

Technical field

The present invention relates to big data analysis technical field, more particularly to a kind of principal component for processing extensive matrix data point Analysis method.

Background technology

Principal component analysis, i.e. PCA (Principal Component Analysis), are a kind of conventional data analysis sides Method.PCA extracts one group main base vector (i.e. principal character component) of the initial data in linear space by matrix computations, Then initial data is projected on this group of base, realizes the dimensionality reduction of high dimensional data.To by the data after dimensionality reduction, can further do The computings such as cluster, classification, realize the artificial intelligence applications such as feature extraction, automatic classification, identification.Currently, principal component analysis conduct A kind of important unsupervised learning method, is widely used in the relevant various application problems of data mining, machine learning.

In practical problem, data can often be expressed as a matrix.Without loss of generality, each data is regarded as matrix A line of A, then matrix column number is exactly the dimension of each data.The target that principal component analysis is calculated is some of initial data Individual principal character component, can be obtained by the Eigenvalues Decomposition of matrix or singular value decomposition.Side based on Eigenvalue Decomposition Method is first calculating matrix A^TA, then to A^TA carries out Eigenvalues Decomposition, obtains the corresponding characteristic vector of some characteristic values of maximum just It is requirement " principal component ".Method based on Singular Value Decomposition Using directly does singular value decomposition to matrix A：A=U Σ V^T, its Middle U and V are orthogonal matrix, and Σ is the diagonal matrix that diagonal element is arranged from big to small, and the preceding several columns of the V matrixes for obtaining are exactly to require " principal component ".If data dimension is less high, i.e., the columns of A is much smaller than line number, the method computational efficiency that feature based value is decomposed Compare high, because the A of its treatment^TA matrixes are a less matrixes of exponent number.

On the other hand, as mobile device, internet, sensor network, genetic engineering are developed rapidly, data are produced Source becomes variation, while data volume also shows exponential growth trend.That is, being currently in so-called " big data " epoch.How to store, analyze and manage growing data integration under the time and space limitation that can be born is The problem that traditional data processing meanses face.Research shows that current 85% data can be directly or by conversion After be expressed as numeric type data, i.e., common integer, real-coded GA, and " table " of numeric type data construction is stored in database Structure is generally regarded as matrix and is processed.Therefore, how for these big datas produce, storage, using etc. aspect spy Point, working out effective " big matrix " data analysing method becomes abnormal important.Specifically, because data scale is too big, it Be probably distributed storage (i.e. on the network on different computer nodes) or storage on the computer's hard and cannot Intactly it is loaded into internal memory (due to memory size limitation).In some other application scenarios, these data be also likely to be by The mode of " data flow " gradually produces, get, and is not suitable for by the way of traditional first storing is calculated again to them Processed.Method in view of traditional calculating principal component analysis need to carry out Eigenvalues Decomposition or singular value point to whole matrix Solution, needs reading repeatedly in algorithm in it, (k principal component, at least wants completely the element of ergodic data matrix before to calculate Read matrix element k times on ground), it is clear that they are not suitable for being analyzed to reading the huge big data of expense in above-mentioned scene.

In view of above-mentioned background, the calculation of the matrix computational approach based on randomization, including Eigenvalues Decomposition, singular value decomposition Method, enjoys people to pay close attention in recent years.In document：N.Halko,P.-G.Martinsson and J.A.Tropp,Finding structure with randomness:Probabilistic algorithms for constructing Approximate matrix decompositions, SIAM Review, 53 (2011), no.2, pp.217-288 (letter below Writing SIAM2011) in, it is proposed that a kind of random singular value decomposition algorithm less to matrix data traversal number of times.The method is led to Cross and original matrix A is multiplied by one only containing the random matrix of k row, obtain the k dimensional features subspace of original matrix column space, then Obtain the orthogonal base vectors matrix Q of the subspace, and A approximate factorization：A ≈ QB, wherein B be one only k rows matrix. Finally to B, this less matrix carries out traditional singular value decomposition calculating, can approximately obtain the preceding k singular value of original matrix A With corresponding left and right singular vector.In document SIAM2011, the degree of accuracy also to above-mentioned approximate data has carried out theoretical point Analysis, as a result shows that it can make error fall in the limit of very little on very big probability, while it is also proposed several raising results The skill of the degree of accuracy.

Although it should be pointed out that document SIAM2011 institute's extracting methods greatly reduced compared to traditional singular value decomposition algorithm it is right The traversal number of times of matrix element, but it at least needs twice of Ergodic Matrices element, still there is room for promotion from computational efficiency, and The processing requirement of data stream type big data cannot be adapted to.

The content of the invention

It is contemplated that at least solving one of above-mentioned technical problem.

Therefore, it is an object of the invention to propose a kind of principal component analytical method for processing extensive matrix data, the party Method is suitable for various big data analysis scenes, with computational efficiency and practicality higher.

To achieve these goals, embodiment of the invention discloses that a kind of principal component for processing extensive matrix data point Analysis method, comprises the following steps：S1：A n row, the random number matrix Ω of l row are generated in internal memory；S2：Choose initial data Matrix A, and according to raw data matrix A the calculating matrix G and H, and matrix G and H are stored in internal memory, wherein, G=A Ω, H=A^TG, the raw data matrix A are m * n matrix；S3：Initializing variable j=1, and initialize m × l matrixes Q and l × n matrix B is null matrix；S4：Setting G_[j,j+b]And Ω_[j,j+b]Respectively the jth of matrix G and matrix Ω is arranged to j+b, and works as j >When 1, G is calculated_[j,j+b]-QBΩ_[j,j+b], and result of calculation is covered into G_[j,j+b], wherein, b is the nonnegative integer no more than l-j； S5：To matrix G_[j,j+b]It is simplified QR to decompose, obtains the row orthogonal matrix Q of m × (b+1)_[j,j+b]With upper triangle square formation R, wherein, Q_[j,j+b]For the jth stored in matrix Q is arranged to j+b；S6：If j>1, then calculating matrix Q_[j,j+b]-Q(Q^TQ_[j,j+b]) simplification QR is decomposed, the m that will be obtained × (b+1) row orthogonal matrix covering Q_[j,j+b], it is to obtain upper triangular matrixAnd calculating matrix multiplicationAnd result of calculation is covered into R；S7：If H_[j,j+b]The jth of representing matrix H is arranged to j+b, if j=1, is calculatedOtherwise calculateObtaining result is (b+1) matrix B of × n_temp, and by B_tempThe jth in matrix B is stored to j+b rows；S8：The value of variable j+b+1 is assigned to become Amount j；S9：If j≤l, return and perform the S4, otherwise perform the S10；S10：Singular value decomposition is done to matrix B：B=U ΣV^T, wherein, the preceding k of matrix V is classified as the preceding k principal component vector, and the preceding k diagonal element of Σ is described corresponding unusual Value.

In addition, the principal component analytical method of the extensive matrix data for the treatment of according to the above embodiment of the present invention can also have There is following additional technical characteristic：

In some instances, in the S1, the parameter l is at least bigger than k 5 integer.

In some instances, the S1, further includes：S11：According to one n × l of random number generator Software Create with Machine matrix number Ω；S12：Initializing variable i=0, variable P are the nonnegative integer less than 10；S13：If i=P, terminate to hold OK, the S14 is otherwise gone to continue executing with；S14：Calculating matrix multiplication A Ω, and result of calculation is carried out to simplify QR decomposition, will M × l row the orthogonal matrix for obtaining is assigned to matrix G；S15：Calculating matrix multiplication A^TG, and result of calculation is carried out to simplify QR decomposition, N × l row the orthogonal matrix that will be obtained is assigned to matrix Ω；S16：The value of i is added 1, and goes to the S13 and continued executing with.

In some instances, in the S2, different producing methods or source according to the raw data matrix A are led to The unit crossed in a time raw data matrix A of traversal usually calculates matrix G=A Ω and H=A^TG。

In some instances, the S2, further includes：S21：Open up two-dimensional array space storage n × l's in internal memory Matrix H, and be 0 by the data initialization of the matrix H；S22：Obtain the data of the default row of raw data matrix A and be stored in In internal memory, and set the matrix A that the default row forms s × n_i, calculating matrix multiplication G_i=A_iΩ, wherein, the G_iIt is square The corresponding rows of battle array G；S23：CalculateAnd result of calculation is assigned to matrix H；S24：Judge whether to obtain initial data All rows of matrix A, if it is, stopping performing, otherwise return and perform the S22.

The principal component analytical method of the extensive matrix data for the treatment of according to embodiments of the present invention, based on current random strange Different value decomposition algorithm, but by improve by algorithm main part to the traversal number of times of data matrix by being reduced to twice once, And keep the degree of accuracy of former algorithm constant, such that it is able to be suitable for various big data analysis scenes, imitated with calculating higher Rate and practicality.

Additional aspect of the invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by practice of the invention.

Brief description of the drawings

Of the invention above-mentioned and/or additional aspect and advantage will become from description of the accompanying drawings below to embodiment is combined Substantially and be readily appreciated that, wherein：

Fig. 1 is the flow chart of the principal component analytical method of the extensive matrix data for the treatment of according to embodiments of the present invention.

Specific embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings.

Below in conjunction with the principal component analytical method of the Description of Drawings extensive matrix data for the treatment of according to embodiments of the present invention.

Fig. 1 is the flow of the principal component analytical method of the extensive matrix data for the treatment of according to an embodiment of the invention Figure.As shown in figure 1, the method is comprised the following steps：

Step S1：A n row, the random number matrix Ω of l row are generated in internal memory.Wherein, parameter l is at least bigger than k by 5 Integer, k is a columns for random matrix.

In one embodiment of the invention, the element in random number matrix Ω is uniform random number or standard normal Distribution random numbers, or obtained by more complicated mode, no longer enumerate repeating one by one herein.

In one embodiment of the invention, step S1 is further included：

S11：According to one n × l random number matrix Ω of random number generator Software Create.

S12：Initializing variable i=0, variable P are the nonnegative integer less than 10.

S13：If i=P, terminate to perform, otherwise go to step S14 and continue executing with.

S14：Calculating matrix multiplication A Ω, and result of calculation is carried out to simplify QR decomposition, the m × l row orthogonal matrix that will be obtained is assigned It is worth and gives matrix G.

S15：Calculating matrix multiplication A^TG, and result of calculation is carried out to simplify QR decomposition, the n × l row orthogonal matrix that will be obtained is assigned It is worth and gives matrix Ω.

S16：The value of i is added 1, and goes to S13 and continued executing with, so as to improve the degree of accuracy of result.

Step S2：Raw data matrix A is chosen, and according to raw data matrix A calculating matrix G and H, and by matrix G and H It is stored in internal memory, wherein, G=A Ω, H=A^TG, raw data matrix A are m * n matrix, and it is often gone and represents a data, always Common m rows.

In one embodiment of the invention, in step s 2, different producing methods according to raw data matrix A or come Source, matrix G=A Ω and H=A are usually calculated by traveling through the unit in a raw data matrix A^TG。

Based on this, step S2 is further included：

S21：The matrix H that two-dimensional array space stores n × l is opened up in internal memory, and is 0 by the data initialization of matrix H.

S22：Obtain the data of the default row of raw data matrix A and be stored in internal memory, and set default row and form s × n Matrix A_i, calculating matrix multiplication G_i=A_iΩ, wherein, G_iIt is the corresponding rows of matrix G.

S23：CalculateAnd result of calculation is assigned to (covering) matrix H.

S24：Judge whether all rows of acquisition raw data matrix A, if it is, stopping performing, otherwise return and perform Step S22.

Step S3：Initializing variable j=1, and initialize m × l matrixes Q and l × n matrix B and be null matrix.

Step S4：Setting G_[j,j+b]And Ω_[j,j+b]Respectively the jth of matrix G and matrix Ω is arranged to j+b, and works as j>When 1, meter Calculate G_[j,j+b]-QBΩ_[j,j+b], and result of calculation is covered into G_[j,j+b], wherein, set b as the nonnegative integer no more than l-j.

Step S5：To matrix G_[j,j+b]It is simplified QR to decompose, obtains the row orthogonal matrix Q of m × (b+1)_[j,j+b]With upper triangle Square formation R, wherein, Q_[j,j+b]The jth stored in matrix Q is arranged to j+b.

Step S6：If j>1, then calculating matrix Q_[j,j+b]-Q(Q^TQ_[j,j+b]) simplified QR decompose, the m that will be obtained × (b+ 1) row orthogonal matrix covering Q_[j,j+b], it is to obtain upper triangular matrixAnd calculating matrix multiplicationAnd result of calculation is covered into R.

Step S7：If H_[j,j+b]The jth of representing matrix H is arranged to j+b, if j=1, is calculatedIt is no Then calculateObtain the matrix B that result is (b+1) × n_temp, and By B_tempThe jth in matrix B is stored to j+b rows.

Step S8：The value of variable j+b+1 is assigned to variable j.

Step S9：If j≤l, return and perform S4, otherwise perform S10.

Step S10：Singular value decomposition is done to matrix B, to obtain preceding k principal component vector and corresponding singular value.Specifically , the formula that singular value decomposition is done to matrix B is：

B=U Σ V^T,

Wherein, the preceding k of matrix V is classified as preceding k principal component vector, and the preceding k diagonal element of Σ is corresponding singular value.

Needs are said, in the above embodiment of the present invention, it is assumed that raw data matrix A is m * n matrix, and it is often gone A data are represented, altogether m rows, based on this, the purpose of the present invention is the preceding k principal component vector and correspondence for calculating the data Singular value.On the other hand, if each column of A represents a data, can be to its transposition A^TCarry out the above-mentioned implementation of the invention described above Calculating process described by example.

For the ease of more fully understanding the principal component analytical method of the extensive matrix data for the treatment of of the embodiment of the present invention, with Under the present invention is explained in further detail in conjunction with specific embodiments.

In the present embodiment, the method for the embodiment of the present invention for example can be with any programming language realization, with CPU Performed with the computing device of internal memory.The random number generator that is used in the present embodiment, plus/minus method, multiplication are performed to matrix, is turned Put, matrix inversion (or Solving Linear), and QR decompose and singular value decomposition, prior art is, by calling phase Answering the numerical computations function library of programming language can realize.

In the present embodiment, it is considered to which one stores the large-scale data matrix A on hard disk, and it is certain time series number According to, each data characteristics number is relatively more, such as and 100,000, and data a total of 500,000, each numerical value is using double essences Degree floating number storage is, it is necessary to 8 bytes.So, whole data need the amount of storage of about 400GB.Assuming that needing to extract data 1000 principal characters, that is, to calculate preceding 1000 principal component vectors.Following step is can perform, an ergodic data is for one time Complete to calculate (parameter value k=1000, m=500,000, n=100,000), comprise the following steps that：

Step 1：The value of l is determined according to parameter k, l=k+5 is made.

Step 2：A n row, the standardized normal distribution random number matrix Ω of l row are generated in internal memory.

Step 3：Matrix G=A Ω and H=A are calculated by traveling through the matrix A one time on hard disk^TG.Specific steps are such as Under：

Step 3.1：The matrix H and m × l matrix G of two-dimensional array space storage n × l are opened up in internal memory, by the beginning of its data Beginning turns to 0, opens the fixed disk file of storage matrix A, and read pointer is placed in into file header.

Step 3.2：Since the 1000 row data that matrix A is read file pointer position are stored in internal memory, if they are formed Matrix A_i, calculate G_i=A_iΩ, as a result G_iIt is stored on the corresponding rows of matrix G.

Step 3.3：CalculateResult is assigned to (covering) H.

Step 3.4：If not taking the corresponding files of A, return to step 3.2 is performed, and otherwise performs step 4.

Step 4：Initializing variable j=1, initialization m × l matrixes Q and l × n matrix B is null matrix.

Step 5：The value for setting b is 19.

Step 6：Note G_[j,j+b]And Ω_[j,j+b]Respectively the jth of matrix G and Ω is arranged to j+b；If j>1, calculate G_[j,j+b]- QBΩ_[j,j+b], result is covered into G_[j,j+b]。

Step 7：To matrix G_[j,j+b]It is simplified QR to decompose, obtains the row orthogonal matrix Q of m × (b+1)_[j,j+b]With upper triangle Square formation R, Q_[j,j+b]The jth stored in matrix Q is arranged to j+b.

Step 8：If j>1, calculating matrix Q_[j,j+b]-Q(Q^TQ_[j,j+b]) simplified QR decompose, the m that will be obtained × (b+1) Row orthogonal matrix covers Q_[j,j+b], and obtain upper triangular matrix and beCalculating matrix multiplicationResult covers R.

Step 9：If H_[j,j+b]The jth of representing matrix H is arranged to j+b；If j=1, calculateIt is no Then calculateObtain the matrix B that result is (b+1) × n_temp, will It stores the jth in matrix B to j+b rows.

Step 10：The value of variable j+b+1 is assigned to variable j.

Step 11：If j≤l, return to step 5 is performed, and otherwise performs step 12.

Step 12：Singular value decomposition, i.e. B=U Σ V are done to matrix B^T, then the preceding k row of matrix V are exactly desired preceding k master Component vector, the preceding k diagonal element of Σ is exactly corresponding singular value.

It is above-mentioned only to have read hard disk number because the time for reading data from hard disk is much larger than the time calculated in internal memory According to the algorithm of a time, it performs the twice that speed is respective algorithms in document SIAM2011, greatlys save whole big data analysis Time.

The matrix V of preceding 1000 principal component vectors composition that said process is obtained, notices that it is one 100,000 × 1000 Row orthogonal matrix.Then AV is calculated, the dimensionality reduction data matrix of 500,000 × 1000 is obtained, it still represents 500,000 number According to, but data dimension reduces significantly.Can further be clustered using the data after dimensionality reduction, be classified, etc., do data mining with Analysis.

To sum up, the principal component analytical method of the extensive matrix data for the treatment of according to embodiments of the present invention, based on current Random singular value decomposition algorithm, but by improve by algorithm main part to the traversal number of times of data matrix by being reduced to twice Once, and keep the degree of accuracy of former algorithm constant, such that it is able to be suitable for various big datas analysis scenes, with meter higher Calculate efficiency and practicality.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or spy that the embodiment or example are described Point is contained at least one embodiment of the invention or example.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not Can these embodiments be carried out with various changes, modification, replacement and modification in the case of departing from principle of the invention and objective, this The scope of invention is by claim and its equivalent limits.

Claims

1. a kind of principal component analytical method for processing extensive matrix data, it is characterised in that comprise the following steps：

S1：A n row, the random number matrix Ω of l row are generated in internal memory；

S2：Raw data matrix A is chosen, and according to raw data matrix A the calculating matrix G and H, and matrix G and H are stored In internal memory, wherein, G=A Ω, H=A^TG, the raw data matrix A are m * n matrix；

S3：Initializing variable j=1, and initialize m × l matrixes Q and l × n matrix B and be null matrix；

S4：Setting G_[j,j+b]And Ω_[j,j+b]Respectively the jth of matrix G and matrix Ω is arranged to j+b, and works as j>When 1, calculate G_[j,j+b]-QBΩ_[j,j+b], and result of calculation is covered into G_[j,j+b], wherein, b is the nonnegative integer no more than l-j；

S5：To matrix G_[j,j+b]It is simplified QR to decompose, obtains m × (b+1) row orthogonal matrix Q_[j,j+b]With upper triangle square formation R, its In, Q_[j,j+b]The jth stored in matrix Q is arranged to j+b；

S6：If j>1, then calculating matrix Q_[j,j+b]-Q(Q^TQ_[j,j+b]) simplified QR decompose, the m that will be obtained × (b+1) arrange orthogonal Matrix cover Q_[j,j+b], it is to obtain upper triangular matrixAnd calculating matrix multiplicationAnd result of calculation is covered into R；

S7：If H_[j,j+b]The jth of representing matrix H is arranged to j+b, if j=1, is calculatedOtherwise calculateObtain the matrix B that result is (b+1) × n_temp, and by B_temp The jth in matrix B is stored to j+b rows；

S8：The value of variable j+b+1 is assigned to variable j；

S9：If j≤l, return and perform the S4, otherwise perform the S10；

S10：Singular value decomposition is done to matrix B：B=U Σ V^T, wherein, the preceding k of matrix V is classified as the preceding k principal component vector, Σ Preceding k diagonal element be the corresponding singular value.

2. the principal component analytical method of the extensive matrix data for the treatment of according to claim 1, it is characterised in that described In S1, the parameter l is at least bigger than k 5 integer.

3. the principal component analytical method of the extensive matrix data for the treatment of according to claim 1, it is characterised in that described S1, further includes：

S11：According to one n × l random number matrix Ω of random number generator Software Create；

S12：Initializing variable i=0, variable P are the nonnegative integer less than 10；

S13：If i=P, terminate to perform, otherwise go to the S14 and continue executing with；

S14：Calculating matrix multiplication A Ω, and result of calculation is carried out to simplify QR decomposition, the m × l row orthogonal matrix that will be obtained is assigned to Matrix G；

S15：Calculating matrix multiplication A^TG, and result of calculation is carried out to simplify QR decomposition, the n × l row orthogonal matrix that will be obtained is assigned to Matrix Ω；

S16：The value of i is added 1, and goes to the S13 and continued executing with.

4. the principal component analytical method of the extensive matrix data for the treatment of according to claim 1, it is characterised in that described In S2, different producing methods or source according to the raw data matrix A, by traveling through a raw data matrix A In unit usually calculate matrix G=A Ω and H=A^TG。

5. the principal component analytical method of the extensive matrix data for the treatment of according to claim 1, it is characterised in that described S2, further includes：

S21：The matrix H that two-dimensional array space stores n × l is opened up in internal memory, and is 0 by the data initialization of the matrix H；

S22：Obtain the data of the default row of raw data matrix A and be stored in internal memory, and set the default row and form s × n Matrix A_i, calculating matrix multiplication G_i=A_iΩ, wherein, the G_iIt is the corresponding rows of matrix G；

S23：CalculateAnd result of calculation is assigned to matrix H；

S24：Judge whether all rows of acquisition raw data matrix A, if it is, stopping performing, otherwise return described in performing S22。