CN106407363A - Ultra-high-dimensional data dimension reduction algorithm based on information entropy - Google Patents

Ultra-high-dimensional data dimension reduction algorithm based on information entropy Download PDF

Info

Publication number
CN106407363A
CN106407363A CN201610810509.1A CN201610810509A CN106407363A CN 106407363 A CN106407363 A CN 106407363A CN 201610810509 A CN201610810509 A CN 201610810509A CN 106407363 A CN106407363 A CN 106407363A
Authority
CN
China
Prior art keywords
matrix
data
dimension
eigenvalue
dimensionality reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610810509.1A
Other languages
Chinese (zh)
Inventor
何兴高
李蝉娟
张效藩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201610810509.1A priority Critical patent/CN106407363A/en
Publication of CN106407363A publication Critical patent/CN106407363A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an ultra-high-dimensional data dimension reduction algorithm based on information entropy, belongs to the field of high-dimensional data preprocessing, and aims to solve the following problems existing in actual application of a conventional PCA algorithm: when data dimensionality (feature) is high enough, all data characteristic read values cannot be stored in memory in one time for analysis and calculation; a partitioning processing method, without depending on a cloud platform and a distributed type calculation platform, is used for experiment; however, the method is long in time consumption and cannot satisfy actual application requirement. On the basis, the thought of the information entropy is introduced for improving the PCA algorithm; the improved algorithm can be used for processing the ultra-high-dimensional data dimension reduction; and experiment results prove that the operating time consumption of the improved algorithm is shortened by 60 times compared with that of the partitioning processing algorithm when it is ensured that the same proportion of the original data information is reserved.

Description

A kind of superelevation dimension data dimension-reduction algorithm based on comentropy
Technical field
The invention belongs to high dimensional data preprocessing technical field, more specifically, it is a kind of based on the improved superelevation of comentropy Dimension data dimension-reduction algorithm.
Background technology
Developing rapidly with information science technology, the expression of information is more and more comprehensive, and people obtain data and increasingly hold Easily, the data object of concern is day by day complicated, and industry is the most urgent to the demand of data analysiss, treatment technology, particularly to higher-dimension The analysis of data and treatment technology.Directly process high dimensional data and can face following difficulty:Dimension disaster, absolutely empty, uncomfortable fixed, calculation Method lost efficacy.The present invention is directed to data characteristicses Wei Taigao, and memory-limited, it is impossible to disposably read in the problem that memory analysis calculate, is adopted Use piecemeal processing method, handling process is as shown in Figure 1.But result shows, run time-consuming oversize it is impossible to meet application demand, On the basis of this, introduce comentropy, do Feature Selection first, greatly reduce feature quantity, then do dimension-reduction treatment, idiographic flow is such as Shown in Fig. 2, as shown in figure 3, whole process runs time-consuming minimizing several times, dimensionality reduction result remains most of main one-tenth to specific algorithm Point, still can meet application demand.
Content of the invention
The final purpose of the present invention is to carry out dimension-reduction treatment to original superelevation dimension data so that data after dimensionality reduction can be Relatively low internal memory, is continued in the case that the used time is less to analyze and process.The present invention is to mainly make use of comentropy in information processing On meaning, PCA algorithm is improved.So-called data dimension is exactly the attribute number of every record data.
For achieving the above object, the present invention is improved to PCA algorithm based on comentropy, and its Algorithm constitution is as follows:
1) Matrix=getMatrix (rdata)
2) calculate comentropy, screen
3) partition data matrix
[B, C]=randomSplit (Matrix) //B is training set, and C is inspection set
4) sample B matrix centralization:I.e. every dimension deducts the average of this dimension
X=B repmat (mean (B, 2), 1, m1)
5) calculate the covariance between different dimensions, constitute covariance matrix:
C=(X*XT)/size(X,2)
6) characteristic vector eigenVector and the eigenvalue eigenValue of covariance matrix are calculated
[eigenVector, eigenValue]=eig (Cov)
7) maximum k eigenvalue corresponding k characteristic vector is selected to form eigenvectors matrix V respectively as column vectorn×k, k Calculated by f.
8) calculate dimensionality reduction result:Y=VT*X
9) to Matrix C centralization, obtain X0
10) result of calculation:Y0=VT*X0
11) subsequent contrast, such as classifies.
Brief description
Fig. 1 is the flow process (handling process of application PCA) of the PCA process high dimensional data based on piecemeal, and Fig. 2 is base of the present invention In the PCA dimension-reduction treatment flow process (the dimension-reduction treatment flow process based on comentropy) of comentropy, Fig. 3 is the step of inventive algorithm E-PCA Suddenly (PCA (E-PCA) algorithmic procedure based on comentropy).
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described, so that those skilled in the art is more preferable Understand the present invention.Requiring particular attention is that, in the following description, when known function and design detailed description perhaps Can desalinate the present invention main contents when, these descriptions will be ignored here.
Fig. 1 is the flow process based on the superelevation dimension data dimension-reduction treatment of comentropy for the present invention.In the present embodiment, as Fig. 2 institute Show, initial data is as input, if the matrix of initial data originally attribute and record composition, it is convenient to omit be converted to square The step of battle array.
The next step of generator matrix is to each property calculation comentropy H (i), and (et answers according to specific with threshold value et With value) compare, retain more than the attribute of threshold value, the input that the matrix A after process is processed as next step.
Data after comentropy is processed enters PCA handling process, center of a sample first, is calculating between different attribute Covariance, form covariance matrix, then calculate eigenvalue and the characteristic vector of covariance matrix, calculate tribute eigenvalue and offer rate (characterizing the ratio that the main constituent obtaining accounts for primary data information (pdi)) f determines k value, and then determines main constituent number, extracts maximum k The corresponding k characteristic vector of individual eigenvalue, as conversion base, initial data dimensionality reduction is obtained result Yk×m, so that subsequent analysis calculate.
Although to the present invention, illustrative specific embodiment is described above, in order to the technology of the art Personnel understand the present invention, the common skill it should be apparent that the invention is not restricted to the scope of specific embodiment, to the art For art personnel, as long as in the spirit and scope of the present invention of various change claim restriction appended again and determination, these Change is it will be apparent that all utilize the innovation and creation of present inventive concept all in the row of protection.

Claims (2)

1. comentropy can be with metric amount size, and high dimensional data dimensionality reduction is to solve directly to process four disasters that high dimensional data faces The effective ways of topic, are analyzed to superelevation dimension data calculating, dimension-reduction treatment more to be carried out, a kind of superelevation dimension based on comentropy Data Dimensionality Reduction Algorithm is made up of following characteristics:
//Input (input) needs the data matrix U of dimensionality reductionn×m(or non-matrix form rdata),
Information entropy threshold et,
Eigenvalue contribution rate f
Result Y after //Output (output) initial data dimensionality reductionk×m
/ * getMatrix function, initial data rdata is converted to matrix form, records the attribute containing and is set to 1, does not have Attribute be 0, output n × m 0,1 matrix, when some initial data non-matrix forms using */
Matrix=getMatrix (rdata)
/ * getEntropy function, computation attribute αiInformation entropy */
H (i)=getEntropy (αi)
/ * randomSplit function, data Matrix being converted into matrix is by record conduct training more corresponding than randomly drawing Collection, remaining for inspection set */
[B, C]=randomSplit (Matrix)
/ * eig function, the eigenvalue of calculating matrix and characteristic vector */
[eigenVe, eigenVa]=eig (Cov)
/ * variable f, calculates eigenvalue contribution rate, characterize main constituent account for ratio * of primary data information (pdi)/
.
2. the various features of comprehensive claim 1, are expressed by file testepca.m to the reduction process of superelevation dimension data:
/ * testepca.m, the mastery routine of dimensionality reduction, the data after comentropy is processed carries out pca dimension-reduction treatment, output dimensionality reduction knot Fruit */
1) Matrix=getMatrix (rdata)
2) calculate comentropy, screen
3) partition data matrix
[B, C]=randomSplit (Matrix) //B is training set, and C is inspection set
4) sample B matrix centralization:I.e. every dimension deducts the average of this dimension
X=B repmat (mean (B, 2), 1, m1)
5) calculate the covariance between different dimensions, constitute covariance matrix:
C=(X*XT)/size(X,2)
6) characteristic vector eigenVector and the eigenvalue eigenValue of covariance matrix are calculated
[eigenVector, eigenValue]=eig (Cov)
7) maximum k eigenvalue corresponding k characteristic vector is selected to form eigenvectors matrix V respectively as column vectorn×k, k by F calculates.
8) calculate dimensionality reduction result:Y=VT*X
9) to Matrix C centralization, obtain X0
10) result of calculation:Y0=VT*X0
Above dimensionality reduction result Y and Y0 can be used for subsequent analysis and calculate.
CN201610810509.1A 2016-09-08 2016-09-08 Ultra-high-dimensional data dimension reduction algorithm based on information entropy Pending CN106407363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610810509.1A CN106407363A (en) 2016-09-08 2016-09-08 Ultra-high-dimensional data dimension reduction algorithm based on information entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610810509.1A CN106407363A (en) 2016-09-08 2016-09-08 Ultra-high-dimensional data dimension reduction algorithm based on information entropy

Publications (1)

Publication Number Publication Date
CN106407363A true CN106407363A (en) 2017-02-15

Family

ID=57998945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610810509.1A Pending CN106407363A (en) 2016-09-08 2016-09-08 Ultra-high-dimensional data dimension reduction algorithm based on information entropy

Country Status (1)

Country Link
CN (1) CN106407363A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919677A (en) * 2017-02-25 2017-07-04 浙江沛宏网络科技有限公司 One kind is based on third-party user data statistical method and system
CN108561127A (en) * 2018-03-26 2018-09-21 上海电力学院 A kind of Formation pressure prediction method based on stochastic simulation
CN108828590A (en) * 2018-07-03 2018-11-16 南京信息工程大学 A kind of low complex degree entropy extension through-wall radar imaging method
CN109241231A (en) * 2018-09-07 2019-01-18 武汉中海庭数据技术有限公司 The accurately pretreatment unit and method of diagram data
CN109446476A (en) * 2018-09-27 2019-03-08 清华大学 A kind of multimodal sensor information decoupling method
CN110007989A (en) * 2018-12-13 2019-07-12 国网信通亿力科技有限责任公司 Data visualization platform system
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN110501917A (en) * 2019-09-11 2019-11-26 智慧谷(厦门)物联科技有限公司 The system and method for realizing internet of things intelligent household information management using cloud computing
CN110705276A (en) * 2019-09-26 2020-01-17 中电万维信息技术有限责任公司 Method, device and storage medium for monitoring network public sentiment based on neural network
CN111948736A (en) * 2019-05-14 2020-11-17 中国电力科学研究院有限公司 High-dimensional weather forecast data dimension reduction method based on big data platform
CN111984466A (en) * 2020-07-30 2020-11-24 苏州浪潮智能科技有限公司 ICC-based data consistency inspection method and system
CN112241748A (en) * 2019-07-16 2021-01-19 广州汽车集团股份有限公司 Data dimension reduction method and device based on multi-source information entropy difference
CN112288016A (en) * 2020-10-30 2021-01-29 上海淇玥信息技术有限公司 Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment
CN112464154A (en) * 2020-11-27 2021-03-09 中国船舶重工集团公司第七0四研究所 Method for automatically screening effective features based on unsupervised learning
CN116434950A (en) * 2023-06-05 2023-07-14 山东建筑大学 Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning
CN117520824A (en) * 2024-01-03 2024-02-06 浙江省白马湖实验室有限公司 Information entropy-based distributed optical fiber data characteristic reconstruction method

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919677A (en) * 2017-02-25 2017-07-04 浙江沛宏网络科技有限公司 One kind is based on third-party user data statistical method and system
CN108561127A (en) * 2018-03-26 2018-09-21 上海电力学院 A kind of Formation pressure prediction method based on stochastic simulation
CN108561127B (en) * 2018-03-26 2022-04-01 上海电力学院 Stratum pressure prediction method based on random simulation
CN108828590A (en) * 2018-07-03 2018-11-16 南京信息工程大学 A kind of low complex degree entropy extension through-wall radar imaging method
CN109241231A (en) * 2018-09-07 2019-01-18 武汉中海庭数据技术有限公司 The accurately pretreatment unit and method of diagram data
CN109446476B (en) * 2018-09-27 2020-07-14 清华大学 Multi-mode sensor information decoupling method
CN109446476A (en) * 2018-09-27 2019-03-08 清华大学 A kind of multimodal sensor information decoupling method
CN110007989A (en) * 2018-12-13 2019-07-12 国网信通亿力科技有限责任公司 Data visualization platform system
CN111948736A (en) * 2019-05-14 2020-11-17 中国电力科学研究院有限公司 High-dimensional weather forecast data dimension reduction method based on big data platform
CN110334546B (en) * 2019-07-08 2021-11-23 辽宁工业大学 Difference privacy high-dimensional data release protection method based on principal component analysis optimization
CN110334546A (en) * 2019-07-08 2019-10-15 辽宁工业大学 Difference privacy high dimensional data based on principal component analysis optimization issues guard method
CN112241748A (en) * 2019-07-16 2021-01-19 广州汽车集团股份有限公司 Data dimension reduction method and device based on multi-source information entropy difference
CN110501917A (en) * 2019-09-11 2019-11-26 智慧谷(厦门)物联科技有限公司 The system and method for realizing internet of things intelligent household information management using cloud computing
CN110705276A (en) * 2019-09-26 2020-01-17 中电万维信息技术有限责任公司 Method, device and storage medium for monitoring network public sentiment based on neural network
CN111984466A (en) * 2020-07-30 2020-11-24 苏州浪潮智能科技有限公司 ICC-based data consistency inspection method and system
CN111984466B (en) * 2020-07-30 2022-10-25 苏州浪潮智能科技有限公司 ICC-based data consistency inspection method and system
CN112288016A (en) * 2020-10-30 2021-01-29 上海淇玥信息技术有限公司 Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment
CN112288016B (en) * 2020-10-30 2023-10-31 上海淇玥信息技术有限公司 Channel anti-cheating method and device based on principal component analysis algorithm and electronic equipment
CN112464154A (en) * 2020-11-27 2021-03-09 中国船舶重工集团公司第七0四研究所 Method for automatically screening effective features based on unsupervised learning
CN112464154B (en) * 2020-11-27 2024-03-01 中国船舶重工集团公司第七0四研究所 Method for automatically screening effective features based on unsupervised learning
CN116434950A (en) * 2023-06-05 2023-07-14 山东建筑大学 Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning
CN116434950B (en) * 2023-06-05 2023-08-29 山东建筑大学 Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning
CN117520824A (en) * 2024-01-03 2024-02-06 浙江省白马湖实验室有限公司 Information entropy-based distributed optical fiber data characteristic reconstruction method

Similar Documents

Publication Publication Date Title
CN106407363A (en) Ultra-high-dimensional data dimension reduction algorithm based on information entropy
CN105354595A (en) Robust visual image classification method and system
US20140099033A1 (en) Fast computation of kernel descriptors
CN103177265B (en) High-definition image classification method based on kernel function Yu sparse coding
Zhao et al. Local quantization code histogram for texture classification
Langone et al. Soft kernel spectral clustering
CN107679539B (en) Single convolution neural network local information and global information integration method based on local perception field
Schettino et al. Income polarization in the USA: What happened to the middle class in the last few decades?
CN109711442A (en) Unsupervised layer-by-layer generation fights character representation learning method
CN109522953A (en) The method classified based on internet startup disk algorithm and CNN to graph structure data
Van den Broeck et al. Learning in feedforward Boolean networks
CN111931867A (en) New coronary pneumonia X-ray image classification method and system based on lightweight model
CN110851627A (en) Method for describing sun black subgroup in full-sun image
Saha et al. Matrix compression via randomized low rank and low precision factorization
Li et al. Weight‐Selected Attribute Bagging for Credit Scoring
CN106557783A (en) A kind of automatic extracting system and method for caricature dominant role
Cheng et al. Adaptive matching of kernel means
CN107506744A (en) Represent to retain based on local linear and differentiate embedded face identification method
CN108280511A (en) A method of network access data is carried out based on convolutional network and is handled
Chen et al. Feature coding for image classification combining global saliency and local difference
CN107563334A (en) Based on the face identification method for differentiating linear expression retaining projection
Sassi et al. A methodology using neural network to cluster validity discovered from a marketing database
Li et al. [Retracted] Multiobject Detection Algorithm Based on Adaptive Default Box Mechanism
Zhang et al. Sparse eigenfaces analysis for recognition
CN104200510B (en) Vector quantization compression object plotting method based on target CF

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170215