Content of the invention
Goal of the invention: for problems of the prior art and a kind of not enough, cloud based on improvement pca of present invention proposition
Methods of High-dimensional Data Visualization in calculating network, is standardized to high dimensional data in system for cloud computing processing, simulation result table
Bright, improved method has preferable visualization and classifying quality, can be very good to realize high dimensional data standard in system for cloud computing
Change is processed.Principal component analysiss (pca) are a kind of methods of mathematics dimensionality reduction, and its method is to find out several aggregate variables to replace originally
Numerous variables, makes these aggregate variables reflect the quantity of information of primal variable as much as possible, and separate each other.
Technical scheme: Methods of High-dimensional Data Visualization in a kind of system for cloud computing based on improvement pca, to system for cloud computing
Middle high dimensional data is standardized processing and optimizes;Establishment including high dimensional data eigenmatrix and visual based on high dimensional data
Data normalization processes and optimizes two parts.
The establishment of high dimensional data eigenmatrix
To in high dimensional data visualization process, the variable in original high dimensional data matrix is standardized processing, gives
Go out new high dimensional data eigenmatrix, the eigenvalue in matrix is arranged in order, choose the maximum number of principal components of variance
According to.Specific step is as detailed below:
It is assumed that byRepresent raw data matrix under system for cloud computing, each variable that x is represented enters rower
Standardization pretreatment, obtains standardized data matrix z using formula (5)
In formula, xi,jRepresent j-th category attribute of i-th high dimensional data,Represent the covariance square of i-th high dimensional data
Battle array,Represent the low-dimensional embedded space of i-th high dimensional data.
Formula (6) and formula (7) is then utilized to calculate
It is assumed that covariance matrix is represented by c, then calculate c using formula (8)
Obtain the eigenvalue matrix λ=diag (λ of c using Jacobi method1,λ2,…λm) and characteristic vector w.
The eigenvalue of each data is arranged λ according to descending order1> λ2> ... > λm, and to characteristic vector row
Order is adjusted correspondingly, and promotes first main constituent to have the variance of maximum, promotes second main constituent to have secondary big
Variance, and minimum variance is corresponded to d-th main constituent.Choose k maximum main constituent of variance, and promote k main constituent energy
Enough retain most raw information, the cumulative variance contribution of k main constituent generally making selection is more than population variance
85%, that is,It is assumed that by wiRepresent the characteristic vector of k main constituent of selection, then utilize formula (11)
Obtain k independent linear combination new variables;
Can illustrate in sum, in high dimensional data visualization process, by the variable in original high dimensional data matrix
It is standardized processing, provide new high dimensional data eigenmatrix, the eigenvalue in matrix is arranged in order, selection side
The maximum number of principal components evidence of difference, for realizing high dimensional data visualization is laid a good foundation.
Processed based on the visual data normalization of high dimensional data and optimize
Consider similarity between principal component contributor rate factor and row it is proposed that new data row sort method, mainly
Process is as follows:
It is assumed that the data matrix after main constituent conversion is represented by y, with the ξ obtainingkIt is foundation, calculate y using formula (12)
In formula, fc represents separating degree between the class of different classes of data, and g represents high dimensional spatial clustering data, ω*Represent in class
Concentration class.
1. the contribution degree factor calculates
First calculated row between similarity matrix be
Wherein sijRepresent the similarity of the i-th row and jth row.Then for the i-th row, and the average similarity of other all row is
tiThe i-th row and the similarity degree of other row can be reflected, therefore can define the new contribution degree factor is
aiRepresent contribution degree factor weights, this contribution degree factor is taken advantage of by the similarity between principal component contributor rate factor and row
Amass and obtain, can preferably reflect the importance degree of each row.
2. data sorting
To giThe contribution degree factor representing is according to order arrangement from big to small, and will adjust it accordingly and correspond in y
The order of middle row it is assumed that representing the matrix after adjustment order by y ', is then stated using formula (14)
Can state out from formula (14), contribution rate is bigger, the data in y is listed in corresponding data row sequence in y ' and gets over
Forward, then in visualization presents, DISPLAY ORDER is more forward.
3. data row weight
The weight size of every string of y ' is defined as contribution rate, and every for y ' string is multiplied with corresponding contribution rate, then
Using formula (15) statement
It is assumed that by λnewRepresent new contribution data rate, calculate λ using formula (16)newIn any two row i, the distance between j
In formula, d (i, j) represents and does not add i before contribution rate factor, the distance of j.
Specific embodiment
With reference to specific embodiment, it is further elucidated with the present invention it should be understood that these embodiments are merely to illustrate the present invention
Rather than restriction the scope of the present invention, after having read the present invention, the various equivalences to the present invention for the those skilled in the art
The modification of form all falls within the application claims limited range.
In carrying out high dimensional data standardization optimization process, need high dimensional data visualization in system for cloud computing
On the basis of just can complete, first dimensionality reduction is carried out to whole high dimensional datas in system for cloud computing, high dimensional data is projected to two dimension
On data space, extract high dimensional data and concentrate relation between classification and its feature, search different pieces of information feature permutation order and
Excellent mapping, classifies to data set on this basis, completes the visualization processing to high dimensional data using its result, realizes high
Data normalization process, specific step as detailed below:
It is assumed that by xiHigh dimensional data under input cloud computing environment, xi=(xi1,xi2,xid)tRepresent xiD dimensional feature to
Amount, then carry out dimensionality reduction using formula (1) to whole high dimensional datas in system for cloud computing, high dimensional data is projected empty to 2-D data
Between on;
In formula, wjRepresent the vector of whole high dimensional data identical dimensional, ε (t) representative sample data set.λ (k) representative feature
Between similarity meansigma methodss, the j=1 ... in formula 1, d, x be high dimensional data matrix, wcData matrix by wanted dimensionality reduction.
It is assumed that the similarity between high dimensional data sample, v are represented by cv (i, j)kI () represents random high dimensional data feature
Vector, then utilize formula (2) to extract high dimensional data and concentrate the relation between classification and its feature
In formula, cvn×nRepresent the transformation matrix of high dimensional data.
It is assumed that by cvn×nRepresent the transformation matrix of high dimensional data,Represent the eigenvalue of dissimilar high dimensional data, full
It is sufficient to the condition of (k=1,2,3...), then search different pieces of information feature permutation order and optimum mapping, set up high using formula (3)
The Visualization Model of dimension data
In formula, similarity between prox (i, j) representative sample (i, j), the meansigma methodss of similarity between λ (k) representative feature.
With formula (3) as foundation, whole High Dimensional Data Sets is standardized process using formula (4)
But traditional method can not effectively eliminate redundant data and information, effect of visualization is poor, reduces data normalization
The effect processing.Methods of High-dimensional Data Visualization in a kind of system for cloud computing based on improvement pca is proposed, in system for cloud computing
High dimensional data is standardized processing and optimizes;Establishment including high dimensional data eigenmatrix and be based on the visual number of high dimensional data
Optimize two parts according to standardization.
The establishment of high dimensional data eigenmatrix
To in high dimensional data visualization process, the variable in original high dimensional data matrix is standardized processing, gives
Go out new high dimensional data eigenmatrix, the eigenvalue in matrix is arranged in order, choose the maximum number of principal components of variance
According to.Specific step is as detailed below:
It is assumed that byRepresent raw data matrix under system for cloud computing, each variable that x is represented enters rower
Standardization pretreatment, obtains standardized data matrix using formula (5)
In formula, xi,jRepresent j-th category attribute of i-th high dimensional data,Represent the covariance square of i-th high dimensional data
Battle array,Represent the low-dimensional embedded space of i-th high dimensional data.
Formula (6) and formula (7) is then utilized to calculate
It is assumed that covariance matrix is represented by c, then calculate c using formula (8)
Obtain the eigenvalue matrix λ=diag (λ of c using Jacobi method1,λ2,...λm) and characteristic vector w.
By described above, the eigenvalue of each data is arranged λ according to order1> λ2> ... > λm, and to characteristic vector
The order of row is adjusted correspondingly, and promotes first main constituent to have the variance of maximum, promotes second main constituent to have secondary
Big variance, and minimum variance is corresponded to d-th main constituent.Choose k maximum main constituent of variance, and promote k main one-tenth
Divide and can retain most raw information, the cumulative variance contribution of k main constituent generally making selection is more than always side
The 85% of difference, that is,It is assumed that by wiRepresent the characteristic vector of k main constituent of selection, then utilize formula
(11) obtain k independent linear combination new variables;
Can illustrate in sum, in high dimensional data visualization process, by the variable in original high dimensional data matrix
It is standardized processing, provide new high dimensional data eigenmatrix, the eigenvalue in matrix is arranged in order, selection side
The maximum number of principal components evidence of difference, for realizing high dimensional data visualization is laid a good foundation.
Processed based on the visual data normalization of high dimensional data and optimize
Consider similarity between principal component contributor rate factor and row it is proposed that new data row sort method, mainly
Process is as follows: it is assumed that representing the data matrix after main constituent conversion by y, with the ξ obtainingkIt is foundation, calculated using formula (12)
y
In formula, fc represents separating degree between the class of different classes of data, and g represents high dimensional spatial clustering data, ω*Represent in class
Concentration class.
1. the contribution degree factor calculates
First calculated row between similarity matrix be
Wherein sijRepresent the similarity of the i-th row and jth row.Then for the i-th row, and the average similarity of other all row is
tiThe i-th row and the similarity degree of other row can be reflected, therefore can define the new contribution degree factor is
In formula, aiRepresent contribution degree factor weights, this contribution degree factor is by similar between principal component contributor rate factor and row
The product of degree obtains, and can preferably reflect the importance degree of each row.
2. data sorting
To giThe contribution degree factor representing is according to order arrangement from big to small, and will adjust it accordingly and correspond in y
The order of middle row it is assumed that representing the matrix after adjustment order by y ', is then stated using formula (14)
Can state out from formula (14), contribution rate is bigger, the data in y is listed in corresponding data row sequence in y ' and gets over
Forward, then in visualization presents, DISPLAY ORDER is more forward.
3. data row weight
1) the weight size of every string of y ' is defined as contribution rate, and every for y ' string is multiplied with corresponding contribution rate,
Formula (15) is then utilized to state
It is assumed that by λnewRepresent new contribution data rate, calculate λ using formula (16)newIn any two row i, the distance between j
In formula, d (i, j) represents and does not add i before contribution rate factor, the distance of j.
Emulation proves
In order to prove that in the system for cloud computing based on improvement pca proposing, Methods of High-dimensional Data Visualization carries out high dimensional data
The effectiveness of standardization, needs once to be tested.The hardware system that experiment is chosen is 2.8ghz cpu, the meter of 1g internal memory
Calculation machine, the data set in experiment derives fromhttp://dbgroup.cs.tsinghua.edu download.Selected data collection is through conventional
The Performance comparision of various pattern recognition task in document, table 1 provides sample number, characteristic number and the classification number of experimental data set.
Table 1 experimental data set information
Wherein, nd represents data name, ns representative sample number, cs representative feature number, and nc represents classification number,
In order to ensure the fairness of high dimensional data visualized experiment, the estimation of grader error rate adopts 6v, takes 6 independences
The average result of experiment, 11v refers to for data set sample to be divided into into 6 parts.Because the good and bad directly shadow data mark of effect of visualization
Standardization processes effect of optimization, and therefore the present invention to high dimensional data, verify by visual effect.
Algorithms of different classification error rate
It is respectively adopted this paper algorithm and document [2], document [1] algorithm carry out high dimensional data in system for cloud computing and visualize in fact
Test.The high dimensional data classification error rate of relatively 3 kinds of algorithms of different, comparing result is shown in Table 2.
Table 2 algorithms of different classification error rate contrasts
Wherein, nd represents data name, and pa represents the error rate of the inventive method, and la [9] represents the mistake of document [2] algorithm
Rate by mistake, la [8] represents the error rate of document [1] algorithm.
Can analyze from table 2 and draw, carry out high dimensional data visualization classification in system for cloud computing using the inventive method
Error rate carry out the mistake of high dimensional data visualization classification in system for cloud computing well below document [2], document [1] algorithm
Rate, this is primarily due to when carrying out high dimensional data visualization using this paper algorithm, after being changed with principal component contributor rate
High dimensional data column pitch is from again being set up by row to the high dimensional data after conversion using hierarchical clustering algorithm, thus having ensured this
Inventive method carries out the accuracy that high dimensional data in system for cloud computing visualizes data classification.
Algorithms of different carries out the visual Contrast on effect of high dimensional data
It is respectively adopted the inventive method and document [2] carries out high dimensional data visualized experiment in system for cloud computing.Relatively 2 kinds
The high dimensional data effect of visualization of algorithms of different.Comparing result is shown in Fig. 1 and Fig. 2.
Analyze from Fig. 1 and Fig. 2 and can draw, carry out high dimensional data visualization in system for cloud computing using the inventive method
Effect be better than document [2] and carry out the visual effect of high dimensional data, this is primarily due to carrying out higher-dimension using document [2]
During data visualization, first the variable in original high dimensional data matrix is standardized processing, provides new high dimensional data feature
Matrix, the eigenvalue in matrix is arranged in order, chooses variance maximum number of principal components evidence, thus having ensured side of the present invention
Method carries out the visual superiority of high dimensional data.
Simulation result shows, institute's extracting method has preferable visualization and classifying quality.