CN106407363A - Ultra-high-dimensional data dimension reduction algorithm based on information entropy - Google Patents
Ultra-high-dimensional data dimension reduction algorithm based on information entropy Download PDFInfo
- Publication number
- CN106407363A CN106407363A CN201610810509.1A CN201610810509A CN106407363A CN 106407363 A CN106407363 A CN 106407363A CN 201610810509 A CN201610810509 A CN 201610810509A CN 106407363 A CN106407363 A CN 106407363A
- Authority
- CN
- China
- Prior art keywords
- matrix
- data
- dimension
- eigenvalue
- dimensionality reduction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an ultra-high-dimensional data dimension reduction algorithm based on information entropy, belongs to the field of high-dimensional data preprocessing, and aims to solve the following problems existing in actual application of a conventional PCA algorithm: when data dimensionality (feature) is high enough, all data characteristic read values cannot be stored in memory in one time for analysis and calculation; a partitioning processing method, without depending on a cloud platform and a distributed type calculation platform, is used for experiment; however, the method is long in time consumption and cannot satisfy actual application requirement. On the basis, the thought of the information entropy is introduced for improving the PCA algorithm; the improved algorithm can be used for processing the ultra-high-dimensional data dimension reduction; and experiment results prove that the operating time consumption of the improved algorithm is shortened by 60 times compared with that of the partitioning processing algorithm when it is ensured that the same proportion of the original data information is reserved.
Description
Technical field
The invention belongs to high dimensional data preprocessing technical field, more specifically, it is a kind of based on the improved superelevation of comentropy
Dimension data dimension-reduction algorithm.
Background technology
Developing rapidly with information science technology, the expression of information is more and more comprehensive, and people obtain data and increasingly hold
Easily, the data object of concern is day by day complicated, and industry is the most urgent to the demand of data analysiss, treatment technology, particularly to higher-dimension
The analysis of data and treatment technology.Directly process high dimensional data and can face following difficulty:Dimension disaster, absolutely empty, uncomfortable fixed, calculation
Method lost efficacy.The present invention is directed to data characteristicses Wei Taigao, and memory-limited, it is impossible to disposably read in the problem that memory analysis calculate, is adopted
Use piecemeal processing method, handling process is as shown in Figure 1.But result shows, run time-consuming oversize it is impossible to meet application demand,
On the basis of this, introduce comentropy, do Feature Selection first, greatly reduce feature quantity, then do dimension-reduction treatment, idiographic flow is such as
Shown in Fig. 2, as shown in figure 3, whole process runs time-consuming minimizing several times, dimensionality reduction result remains most of main one-tenth to specific algorithm
Point, still can meet application demand.
Content of the invention
The final purpose of the present invention is to carry out dimension-reduction treatment to original superelevation dimension data so that data after dimensionality reduction can be
Relatively low internal memory, is continued in the case that the used time is less to analyze and process.The present invention is to mainly make use of comentropy in information processing
On meaning, PCA algorithm is improved.So-called data dimension is exactly the attribute number of every record data.
For achieving the above object, the present invention is improved to PCA algorithm based on comentropy, and its Algorithm constitution is as follows:
1) Matrix=getMatrix (rdata)
2) calculate comentropy, screen
3) partition data matrix
[B, C]=randomSplit (Matrix) //B is training set, and C is inspection set
4) sample B matrix centralization:I.e. every dimension deducts the average of this dimension
X=B repmat (mean (B, 2), 1, m1)
5) calculate the covariance between different dimensions, constitute covariance matrix:
C=(X*XT)/size(X,2)
6) characteristic vector eigenVector and the eigenvalue eigenValue of covariance matrix are calculated
[eigenVector, eigenValue]=eig (Cov)
7) maximum k eigenvalue corresponding k characteristic vector is selected to form eigenvectors matrix V respectively as column vectorn×k, k
Calculated by f.
8) calculate dimensionality reduction result:Y=VT*X
9) to Matrix C centralization, obtain X0
10) result of calculation:Y0=VT*X0
11) subsequent contrast, such as classifies.
Brief description
Fig. 1 is the flow process (handling process of application PCA) of the PCA process high dimensional data based on piecemeal, and Fig. 2 is base of the present invention
In the PCA dimension-reduction treatment flow process (the dimension-reduction treatment flow process based on comentropy) of comentropy, Fig. 3 is the step of inventive algorithm E-PCA
Suddenly (PCA (E-PCA) algorithmic procedure based on comentropy).
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described, so that those skilled in the art is more preferable
Understand the present invention.Requiring particular attention is that, in the following description, when known function and design detailed description perhaps
Can desalinate the present invention main contents when, these descriptions will be ignored here.
Fig. 1 is the flow process based on the superelevation dimension data dimension-reduction treatment of comentropy for the present invention.In the present embodiment, as Fig. 2 institute
Show, initial data is as input, if the matrix of initial data originally attribute and record composition, it is convenient to omit be converted to square
The step of battle array.
The next step of generator matrix is to each property calculation comentropy H (i), and (et answers according to specific with threshold value et
With value) compare, retain more than the attribute of threshold value, the input that the matrix A after process is processed as next step.
Data after comentropy is processed enters PCA handling process, center of a sample first, is calculating between different attribute
Covariance, form covariance matrix, then calculate eigenvalue and the characteristic vector of covariance matrix, calculate tribute eigenvalue and offer rate
(characterizing the ratio that the main constituent obtaining accounts for primary data information (pdi)) f determines k value, and then determines main constituent number, extracts maximum k
The corresponding k characteristic vector of individual eigenvalue, as conversion base, initial data dimensionality reduction is obtained result Yk×m, so that subsequent analysis calculate.
Although to the present invention, illustrative specific embodiment is described above, in order to the technology of the art
Personnel understand the present invention, the common skill it should be apparent that the invention is not restricted to the scope of specific embodiment, to the art
For art personnel, as long as in the spirit and scope of the present invention of various change claim restriction appended again and determination, these
Change is it will be apparent that all utilize the innovation and creation of present inventive concept all in the row of protection.
Claims (2)
1. comentropy can be with metric amount size, and high dimensional data dimensionality reduction is to solve directly to process four disasters that high dimensional data faces
The effective ways of topic, are analyzed to superelevation dimension data calculating, dimension-reduction treatment more to be carried out, a kind of superelevation dimension based on comentropy
Data Dimensionality Reduction Algorithm is made up of following characteristics:
//Input (input) needs the data matrix U of dimensionality reductionn×m(or non-matrix form rdata),
Information entropy threshold et,
Eigenvalue contribution rate f
Result Y after //Output (output) initial data dimensionality reductionk×m
/ * getMatrix function, initial data rdata is converted to matrix form, records the attribute containing and is set to 1, does not have
Attribute be 0, output n × m 0,1 matrix, when some initial data non-matrix forms using */
Matrix=getMatrix (rdata)
/ * getEntropy function, computation attribute αiInformation entropy */
H (i)=getEntropy (αi)
/ * randomSplit function, data Matrix being converted into matrix is by record conduct training more corresponding than randomly drawing
Collection, remaining for inspection set */
[B, C]=randomSplit (Matrix)
/ * eig function, the eigenvalue of calculating matrix and characteristic vector */
[eigenVe, eigenVa]=eig (Cov)
/ * variable f, calculates eigenvalue contribution rate, characterize main constituent account for ratio * of primary data information (pdi)/
.
2. the various features of comprehensive claim 1, are expressed by file testepca.m to the reduction process of superelevation dimension data:
/ * testepca.m, the mastery routine of dimensionality reduction, the data after comentropy is processed carries out pca dimension-reduction treatment, output dimensionality reduction knot
Fruit */
1) Matrix=getMatrix (rdata)
2) calculate comentropy, screen
3) partition data matrix
[B, C]=randomSplit (Matrix) //B is training set, and C is inspection set
4) sample B matrix centralization:I.e. every dimension deducts the average of this dimension
X=B repmat (mean (B, 2), 1, m1)
5) calculate the covariance between different dimensions, constitute covariance matrix:
C=(X*XT)/size(X,2)
6) characteristic vector eigenVector and the eigenvalue eigenValue of covariance matrix are calculated
[eigenVector, eigenValue]=eig (Cov)
7) maximum k eigenvalue corresponding k characteristic vector is selected to form eigenvectors matrix V respectively as column vectorn×k, k by
F calculates.
8) calculate dimensionality reduction result:Y=VT*X
9) to Matrix C centralization, obtain X0
10) result of calculation:Y0=VT*X0
Above dimensionality reduction result Y and Y0 can be used for subsequent analysis and calculate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610810509.1A CN106407363A (en) | 2016-09-08 | 2016-09-08 | Ultra-high-dimensional data dimension reduction algorithm based on information entropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610810509.1A CN106407363A (en) | 2016-09-08 | 2016-09-08 | Ultra-high-dimensional data dimension reduction algorithm based on information entropy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106407363A true CN106407363A (en) | 2017-02-15 |
Family
ID=57998945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610810509.1A Pending CN106407363A (en) | 2016-09-08 | 2016-09-08 | Ultra-high-dimensional data dimension reduction algorithm based on information entropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106407363A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106919677A (en) * | 2017-02-25 | 2017-07-04 | 浙江沛宏网络科技有限公司 | One kind is based on third-party user data statistical method and system |
CN108561127A (en) * | 2018-03-26 | 2018-09-21 | 上海电力学院 | A kind of Formation pressure prediction method based on stochastic simulation |
CN108828590A (en) * | 2018-07-03 | 2018-11-16 | 南京信息工程大学 | A kind of low complex degree entropy extension through-wall radar imaging method |
CN109241231A (en) * | 2018-09-07 | 2019-01-18 | 武汉中海庭数据技术有限公司 | The accurately pretreatment unit and method of diagram data |
CN109446476A (en) * | 2018-09-27 | 2019-03-08 | 清华大学 | A kind of multimodal sensor information decoupling method |
CN110007989A (en) * | 2018-12-13 | 2019-07-12 | 国网信通亿力科技有限责任公司 | Data visualization platform system |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN110501917A (en) * | 2019-09-11 | 2019-11-26 | 智慧谷(厦门)物联科技有限公司 | The system and method for realizing internet of things intelligent household information management using cloud computing |
CN110705276A (en) * | 2019-09-26 | 2020-01-17 | 中电万维信息技术有限责任公司 | Method, device and storage medium for monitoring network public sentiment based on neural network |
CN111948736A (en) * | 2019-05-14 | 2020-11-17 | 中国电力科学研究院有限公司 | High-dimensional weather forecast data dimension reduction method based on big data platform |
CN111984466A (en) * | 2020-07-30 | 2020-11-24 | 苏州浪潮智能科技有限公司 | ICC-based data consistency inspection method and system |
CN112241748A (en) * | 2019-07-16 | 2021-01-19 | 广州汽车集团股份有限公司 | Data dimension reduction method and device based on multi-source information entropy difference |
CN112288016A (en) * | 2020-10-30 | 2021-01-29 | 上海淇玥信息技术有限公司 | Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment |
CN112464154A (en) * | 2020-11-27 | 2021-03-09 | 中国船舶重工集团公司第七0四研究所 | Method for automatically screening effective features based on unsupervised learning |
CN116434950A (en) * | 2023-06-05 | 2023-07-14 | 山东建筑大学 | Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning |
CN117520824A (en) * | 2024-01-03 | 2024-02-06 | 浙江省白马湖实验室有限公司 | Information entropy-based distributed optical fiber data characteristic reconstruction method |
-
2016
- 2016-09-08 CN CN201610810509.1A patent/CN106407363A/en active Pending
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106919677A (en) * | 2017-02-25 | 2017-07-04 | 浙江沛宏网络科技有限公司 | One kind is based on third-party user data statistical method and system |
CN108561127A (en) * | 2018-03-26 | 2018-09-21 | 上海电力学院 | A kind of Formation pressure prediction method based on stochastic simulation |
CN108561127B (en) * | 2018-03-26 | 2022-04-01 | 上海电力学院 | Stratum pressure prediction method based on random simulation |
CN108828590A (en) * | 2018-07-03 | 2018-11-16 | 南京信息工程大学 | A kind of low complex degree entropy extension through-wall radar imaging method |
CN109241231A (en) * | 2018-09-07 | 2019-01-18 | 武汉中海庭数据技术有限公司 | The accurately pretreatment unit and method of diagram data |
CN109446476B (en) * | 2018-09-27 | 2020-07-14 | 清华大学 | Multi-mode sensor information decoupling method |
CN109446476A (en) * | 2018-09-27 | 2019-03-08 | 清华大学 | A kind of multimodal sensor information decoupling method |
CN110007989A (en) * | 2018-12-13 | 2019-07-12 | 国网信通亿力科技有限责任公司 | Data visualization platform system |
CN111948736A (en) * | 2019-05-14 | 2020-11-17 | 中国电力科学研究院有限公司 | High-dimensional weather forecast data dimension reduction method based on big data platform |
CN110334546B (en) * | 2019-07-08 | 2021-11-23 | 辽宁工业大学 | Difference privacy high-dimensional data release protection method based on principal component analysis optimization |
CN110334546A (en) * | 2019-07-08 | 2019-10-15 | 辽宁工业大学 | Difference privacy high dimensional data based on principal component analysis optimization issues guard method |
CN112241748A (en) * | 2019-07-16 | 2021-01-19 | 广州汽车集团股份有限公司 | Data dimension reduction method and device based on multi-source information entropy difference |
CN110501917A (en) * | 2019-09-11 | 2019-11-26 | 智慧谷(厦门)物联科技有限公司 | The system and method for realizing internet of things intelligent household information management using cloud computing |
CN110705276A (en) * | 2019-09-26 | 2020-01-17 | 中电万维信息技术有限责任公司 | Method, device and storage medium for monitoring network public sentiment based on neural network |
CN111984466A (en) * | 2020-07-30 | 2020-11-24 | 苏州浪潮智能科技有限公司 | ICC-based data consistency inspection method and system |
CN111984466B (en) * | 2020-07-30 | 2022-10-25 | 苏州浪潮智能科技有限公司 | ICC-based data consistency inspection method and system |
CN112288016A (en) * | 2020-10-30 | 2021-01-29 | 上海淇玥信息技术有限公司 | Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment |
CN112288016B (en) * | 2020-10-30 | 2023-10-31 | 上海淇玥信息技术有限公司 | Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment |
CN112464154A (en) * | 2020-11-27 | 2021-03-09 | 中国船舶重工集团公司第七0四研究所 | Method for automatically screening effective features based on unsupervised learning |
CN112464154B (en) * | 2020-11-27 | 2024-03-01 | 中国船舶重工集团公司第七0四研究所 | Method for automatically screening effective features based on unsupervised learning |
CN116434950A (en) * | 2023-06-05 | 2023-07-14 | 山东建筑大学 | Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning |
CN116434950B (en) * | 2023-06-05 | 2023-08-29 | 山东建筑大学 | Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning |
CN117520824A (en) * | 2024-01-03 | 2024-02-06 | 浙江省白马湖实验室有限公司 | Information entropy-based distributed optical fiber data characteristic reconstruction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106407363A (en) | Ultra-high-dimensional data dimension reduction algorithm based on information entropy | |
CN105354595A (en) | Robust visual image classification method and system | |
US20140099033A1 (en) | Fast computation of kernel descriptors | |
CN103177265B (en) | High-definition image classification method based on kernel function Yu sparse coding | |
Zhao et al. | Local quantization code histogram for texture classification | |
Langone et al. | Soft kernel spectral clustering | |
CN107679539B (en) | Single convolution neural network local information and global information integration method based on local perception field | |
Schettino et al. | Income polarization in the USA: What happened to the middle class in the last few decades? | |
CN109711442A (en) | Unsupervised layer-by-layer generation fights character representation learning method | |
CN109522953A (en) | The method classified based on internet startup disk algorithm and CNN to graph structure data | |
Van den Broeck et al. | Learning in feedforward Boolean networks | |
CN111931867A (en) | New coronary pneumonia X-ray image classification method and system based on lightweight model | |
CN110851627A (en) | Method for describing sun black subgroup in full-sun image | |
Saha et al. | Matrix compression via randomized low rank and low precision factorization | |
Li et al. | Weight‐Selected Attribute Bagging for Credit Scoring | |
CN106557783A (en) | A kind of automatic extracting system and method for caricature dominant role | |
Cheng et al. | Adaptive matching of kernel means | |
CN107506744A (en) | Represent to retain based on local linear and differentiate embedded face identification method | |
CN108280511A (en) | A method of network access data is carried out based on convolutional network and is handled | |
Chen et al. | Feature coding for image classification combining global saliency and local difference | |
CN107563334A (en) | Based on the face identification method for differentiating linear expression retaining projection | |
Sassi et al. | A methodology using neural network to cluster validity discovered from a marketing database | |
Li et al. | [Retracted] Multiobject Detection Algorithm Based on Adaptive Default Box Mechanism | |
Zhang et al. | Sparse eigenfaces analysis for recognition | |
CN104200510B (en) | Vector quantization compression object plotting method based on target CF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170215 |