CN106407363A

CN106407363A - Ultra-high-dimensional data dimension reduction algorithm based on information entropy

Info

Publication number: CN106407363A
Application number: CN201610810509.1A
Authority: CN
Inventors: 何兴高; 李蝉娟; 张效藩
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-09-08
Filing date: 2016-09-08
Publication date: 2017-02-15

Abstract

The invention provides an ultra-high-dimensional data dimension reduction algorithm based on information entropy, belongs to the field of high-dimensional data preprocessing, and aims to solve the following problems existing in actual application of a conventional PCA algorithm: when data dimensionality (feature) is high enough, all data characteristic read values cannot be stored in memory in one time for analysis and calculation; a partitioning processing method, without depending on a cloud platform and a distributed type calculation platform, is used for experiment; however, the method is long in time consumption and cannot satisfy actual application requirement. On the basis, the thought of the information entropy is introduced for improving the PCA algorithm; the improved algorithm can be used for processing the ultra-high-dimensional data dimension reduction; and experiment results prove that the operating time consumption of the improved algorithm is shortened by 60 times compared with that of the partitioning processing algorithm when it is ensured that the same proportion of the original data information is reserved.

Description

A kind of superelevation dimension data dimension-reduction algorithm based on comentropy

Technical field

The invention belongs to high dimensional data preprocessing technical field, more specifically, it is a kind of based on the improved superelevation of comentropy Dimension data dimension-reduction algorithm.

Background technology

Developing rapidly with information science technology, the expression of information is more and more comprehensive, and people obtain data and increasingly hold Easily, the data object of concern is day by day complicated, and industry is the most urgent to the demand of data analysiss, treatment technology, particularly to higher-dimension The analysis of data and treatment technology.Directly process high dimensional data and can face following difficulty：Dimension disaster, absolutely empty, uncomfortable fixed, calculation Method lost efficacy.The present invention is directed to data characteristicses Wei Taigao, and memory-limited, it is impossible to disposably read in the problem that memory analysis calculate, is adopted Use piecemeal processing method, handling process is as shown in Figure 1.But result shows, run time-consuming oversize it is impossible to meet application demand, On the basis of this, introduce comentropy, do Feature Selection first, greatly reduce feature quantity, then do dimension-reduction treatment, idiographic flow is such as Shown in Fig. 2, as shown in figure 3, whole process runs time-consuming minimizing several times, dimensionality reduction result remains most of main one-tenth to specific algorithm Point, still can meet application demand.

Content of the invention

The final purpose of the present invention is to carry out dimension-reduction treatment to original superelevation dimension data so that data after dimensionality reduction can be Relatively low internal memory, is continued in the case that the used time is less to analyze and process.The present invention is to mainly make use of comentropy in information processing On meaning, PCA algorithm is improved.So-called data dimension is exactly the attribute number of every record data.

For achieving the above object, the present invention is improved to PCA algorithm based on comentropy, and its Algorithm constitution is as follows：

1) Matrix=getMatrix (rdata)

2) calculate comentropy, screen

3) partition data matrix

[B, C]=randomSplit (Matrix) //B is training set, and C is inspection set

4) sample B matrix centralization：I.e. every dimension deducts the average of this dimension

X=B repmat (mean (B, 2), 1, m1)

5) calculate the covariance between different dimensions, constitute covariance matrix：

C=(X*X^T)/size(X,2)

6) characteristic vector eigenVector and the eigenvalue eigenValue of covariance matrix are calculated

[eigenVector, eigenValue]=eig (Cov)

7) maximum k eigenvalue corresponding k characteristic vector is selected to form eigenvectors matrix V respectively as column vector_n×k, k Calculated by f.

8) calculate dimensionality reduction result：Y=V^T*X

9) to Matrix C centralization, obtain X0

10) result of calculation：Y0=V^T*X0

11) subsequent contrast, such as classifies.

Brief description

Fig. 1 is the flow process (handling process of application PCA) of the PCA process high dimensional data based on piecemeal, and Fig. 2 is base of the present invention In the PCA dimension-reduction treatment flow process (the dimension-reduction treatment flow process based on comentropy) of comentropy, Fig. 3 is the step of inventive algorithm E-PCA Suddenly (PCA (E-PCA) algorithmic procedure based on comentropy).

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described, so that those skilled in the art is more preferable Understand the present invention.Requiring particular attention is that, in the following description, when known function and design detailed description perhaps Can desalinate the present invention main contents when, these descriptions will be ignored here.

Fig. 1 is the flow process based on the superelevation dimension data dimension-reduction treatment of comentropy for the present invention.In the present embodiment, as Fig. 2 institute Show, initial data is as input, if the matrix of initial data originally attribute and record composition, it is convenient to omit be converted to square The step of battle array.

The next step of generator matrix is to each property calculation comentropy H (i), and (et answers according to specific with threshold value et With value) compare, retain more than the attribute of threshold value, the input that the matrix A after process is processed as next step.

Data after comentropy is processed enters PCA handling process, center of a sample first, is calculating between different attribute Covariance, form covariance matrix, then calculate eigenvalue and the characteristic vector of covariance matrix, calculate tribute eigenvalue and offer rate (characterizing the ratio that the main constituent obtaining accounts for primary data information (pdi)) f determines k value, and then determines main constituent number, extracts maximum k The corresponding k characteristic vector of individual eigenvalue, as conversion base, initial data dimensionality reduction is obtained result Y_k×m, so that subsequent analysis calculate.

Although to the present invention, illustrative specific embodiment is described above, in order to the technology of the art Personnel understand the present invention, the common skill it should be apparent that the invention is not restricted to the scope of specific embodiment, to the art For art personnel, as long as in the spirit and scope of the present invention of various change claim restriction appended again and determination, these Change is it will be apparent that all utilize the innovation and creation of present inventive concept all in the row of protection.

Claims

1. comentropy can be with metric amount size, and high dimensional data dimensionality reduction is to solve directly to process four disasters that high dimensional data faces The effective ways of topic, are analyzed to superelevation dimension data calculating, dimension-reduction treatment more to be carried out, a kind of superelevation dimension based on comentropy Data Dimensionality Reduction Algorithm is made up of following characteristics：

//Input (input) needs the data matrix U of dimensionality reduction_n×m(or non-matrix form rdata),

Information entropy threshold et,

Eigenvalue contribution rate f

Result Y after //Output (output) initial data dimensionality reduction_k×m

/ * getMatrix function, initial data rdata is converted to matrix form, records the attribute containing and is set to 1, does not have Attribute be 0, output n × m 0,1 matrix, when some initial data non-matrix forms using */

Matrix=getMatrix (rdata)

/ * getEntropy function, computation attribute α_iInformation entropy */

H (i)=getEntropy (α_i)

/ * randomSplit function, data Matrix being converted into matrix is by record conduct training more corresponding than randomly drawing Collection, remaining for inspection set */

[B, C]=randomSplit (Matrix)

/ * eig function, the eigenvalue of calculating matrix and characteristic vector */

[eigenVe, eigenVa]=eig (Cov)

/ * variable f, calculates eigenvalue contribution rate, characterize main constituent account for ratio * of primary data information (pdi)/

.

2. the various features of comprehensive claim 1, are expressed by file testepca.m to the reduction process of superelevation dimension data：

/ * testepca.m, the mastery routine of dimensionality reduction, the data after comentropy is processed carries out pca dimension-reduction treatment, output dimensionality reduction knot Fruit */