CN110390358A

CN110390358A - A kind of deep learning method based on feature clustering

Info

Publication number: CN110390358A
Application number: CN201910665812.0A
Authority: CN
Inventors: 杨勇; 黄淑英
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-10-29

Abstract

The deep learning method based on feature clustering that the invention discloses a kind of, select characteristic variable the following steps are included: being concentrated in specific data, carry out data prediction, calculate the relative coefficient between characteristic variable to selecting characteristic variable, and using the high characteristic variable of custom function screening relative coefficient, filter out principal component in characteristic variable, the principal component filtered out constructed, clustering processing, neural network configuration instructed based on cluster result；The present invention is by carrying out data prediction to the characteristic variable selected, data zooming can eliminate the difference of the characteristic attributes such as characteristic, order of magnitude between different samples, sample can be mapped to low dimensional space and is shown by dimension-reduction treatment, convenient for the later period, data shape selects most suitable cluster mode according to the observation, the accuracy of feature clustering can be improved, by screening the high characteristic variable of relative coefficient using custom function, it can solve and select the relatively low problem of correlation between variables in clustering.

Description

A kind of deep learning method based on feature clustering

Technical field

The present invention relates to machine learning techniques field more particularly to a kind of deep learning methods based on feature clustering.

Background technique

In deep learning field, the deep learning framework of mainstream has DNN, RNN and CNN, and DNN is the nerve that feature connects entirely Network is a kind of general deep learning method；RNN is Recognition with Recurrent Neural Network and a kind of full connection structure, is mainly used for having Time context, such as the field NLP；CNN is convolutional neural networks, is characterized in the part connection based on spatial coherence, It is mainly used for field of image processing.Also clearly, CNN's is this for the advantage and disadvantage of the neural network structure of these three mainstreams at present The connection of feature local correlation reduces a large amount of parameter storage and calculates；DNN does not consider feature correlation, directly to all spies The full connection of sign, causes largely calculating and storage pressure, and many incoherent features are also attached calculating, causes a large amount of Interference and unnecessary connection calculate；There is also similar problems by RNN.

For image or other data sets with spatial coherence feature, can directly be learnt using CNN, but it is right It in the data for not having similar this space local correlations of image, directly will not then be got well using CNN effect, directly use DNN Then there is the connection calculating and parameter storage pressure of a large amount of uncorrelated features, therefore, the present invention proposes a kind of poly- based on feature The deep learning method of class, to solve shortcoming in the prior art.

Summary of the invention

In view of the above-mentioned problems, the present invention proposes a kind of deep learning method based on feature clustering, by selecting Characteristic variable carries out data prediction, and data zooming can eliminate the difference of the characteristic attributes such as characteristic, order of magnitude between different samples Different, sample can be mapped to low dimensional space and is shown by dimension-reduction treatment, and convenient for the later period, data shape is selected according to the observation Most suitable cluster mode, can improve the accuracy of feature clustering, high by screening relative coefficient using custom function Characteristic variable can solve and select the relatively low problem of correlation between variables in clustering.

The present invention proposes a kind of deep learning method based on feature clustering, comprising the following steps:

Step 1: being based on specific set of data, selects mostly important characteristic variable in specific data concentration；

Step 2: carrying out data prediction to the characteristic variable selected, including carries out data zooming, data transformation sum number According to dimension-reduction treatment；

Step 3: the relative coefficient between characteristic variable is calculated, using relative coefficient as similarity measurement, and is utilized Custom function screens the high characteristic variable of relative coefficient；

Step 4: based on the relative coefficient between characteristic variable, the principal component in characteristic variable is filtered out；

Step 5: constructing the principal component filtered out, forms network chart structure；

Step 6: carrying out clustering processing to network chart structure, the high characteristic variable of relative coefficient be divided into one kind, Obtain cluster result；

Step 7: neural network configuration is instructed based on obtained cluster result.

Further improvement lies in that: the characteristic variable in the step 1 can use correlation, Gini coefficient, comentropy, system Any one method in meter inspection or random forest is chosen.

Further improvement lies in that: data zooming process in the step 2 are as follows: the characteristic variable that will acquire is proportionally It is converted, the characteristic variable after conversion is compressed between (0,1).

Further improvement lies in that: data transformation is using in discrete Fourier transform or wavelet transform in the step 2 Any one mode carry out data transformation.

Further improvement lies in that: the characteristic variable mistake that custom function screening relative coefficient is high is utilized in the step 3 Journey are as follows: calculating correlation matrix first filters out the characteristic variable that relative coefficient is higher than preset value, correlation is higher than pre- If the relative coefficient of value takes 1, it is labeled as target signature variable, 0 is set by non-targeted characteristic variable, then finds out and meet item Then the characteristic variable of part deletes the high characteristic variable of relative coefficient.

Further improvement lies in that: the principal component process in characteristic variable is filtered out in the step 4 are as follows:

S1: then characteristic variable data set is carried out average value by input feature vector variable data collection；

S2: covariance matrix is calculated；

S3: the eigen vector of covariance matrix is sought with Eigenvalues Decomposition method；

S4: characteristic value is sorted from large to small, and selects maximum k, then by its corresponding k feature vector point It Zuo Wei not row vector composition characteristic vector matrix；

S5: finally data are transformed into the new space of k feature vector building.

Further improvement lies in that: in the step 6 to network chart structure carry out clustering processing can be used hierarchical cluster or It is that K-meana clustering algorithm carries out clustering processing.

Further improvement lies in that: when carrying out clustering processing using hierarchical cluster in the step 6, last cluster result Node as clustering processing next time.

Further improvement lies in that: when carrying out clustering processing using K-meana clustering algorithm in the step 6, detailed process Are as follows:

T1: determining the number of cluster first, and K sample is then chosen from the data set of network chart structure as cluster Then center calculates cluster centre to the Euclidean distance between other samples, chooses the smallest sample of cluster and sorted out, sorted out With the class where cluster centre, initial clustering result is obtained；

T2: calculating the mean value of all samples in initial clustering result, then determines a new cluster centre, and repeat T1 Operation；

T3: repetitive operation always completes clustering processing until cluster centre no longer moves.

The invention has the benefit that data zooming can by carrying out data prediction to the characteristic variable selected The difference for eliminating the characteristic attributes such as characteristic, order of magnitude between different samples can guarantee each sample characteristics numerical quantity of result All on the same order of magnitude, sample can be mapped to low dimensional space and is shown by dimension-reduction treatment, convenient for the later period according to sight Data shape is examined to select most suitable cluster mode, the accuracy of feature clustering can be improved, by being sieved using custom function The characteristic variable that relative coefficient is high is selected, can solve and select the relatively low problem of correlation between variables in clustering, and The calculating of the method for the present invention and storage pressure are lower, can be reduced the training time of deep learning model, efficiency is higher.

Specific embodiment

In order to deepen the understanding of the present invention, the present invention is further described below in conjunction with embodiment, the present embodiment For explaining only the invention, it is not intended to limit the scope of the present invention..

A kind of deep learning method based on feature clustering, comprising the following steps:

Step 1: being based on specific set of data, selects mostly important characteristic variable, characteristic variable in specific data concentration It can be chosen with correlation method；

Step 2: carrying out data prediction to the characteristic variable selected, including carries out data zooming, data transformation sum number According to dimension-reduction treatment, data zooming process are as follows: the characteristic variable that will acquire proportionally is converted, and the feature after conversion is become Amount is compressed between (0,1), and data transformation uses wavelet transform；

Step 3: the relative coefficient between characteristic variable is calculated, using relative coefficient as similarity measurement, and is utilized Custom function screens the high characteristic variable of relative coefficient, utilizes the high characteristic variable of custom function screening relative coefficient Process are as follows: calculating correlation matrix first filters out the characteristic variable that relative coefficient is higher than preset value, correlation is higher than The relative coefficient of preset value takes 1, is labeled as target signature variable, sets 0 for non-targeted characteristic variable, then find out and meet Then the characteristic variable of condition deletes the high characteristic variable of relative coefficient；

Step 4: based on the relative coefficient between characteristic variable, the principal component in characteristic variable, process are filtered out are as follows:

S2: covariance matrix is calculated；

S5: finally data are transformed into the new space of k feature vector building；

Step 6: carrying out clustering processing to network chart structure, carries out clustering processing using K-meana clustering algorithm, will The high characteristic variable of relative coefficient is divided into one kind, obtains cluster result, when clustering processing, detailed process are as follows:

T3: repetitive operation always completes clustering processing until cluster centre no longer moves；

By carrying out data prediction to the characteristic variable selected, data zooming can be eliminated special between different samples The difference of the characteristic attributes such as property, the order of magnitude, can guarantee each sample characteristics numerical quantity of result all on the same order of magnitude, Sample can be mapped to low dimensional space and is shown by dimension-reduction treatment, and convenient for the later period, data shape is most suitable to select according to the observation The cluster mode of conjunction, can improve the accuracy of feature clustering, by screening the high feature of relative coefficient using custom function Variable can solve and select the relatively low problem of correlation between variables in clustering, and the calculating of the method for the present invention and deposit It is lower to store up pressure, can be reduced the training time of deep learning model, efficiency is higher.

The basic principles, main features and advantages of the invention have been shown and described above.The technical staff of the industry should Understand, the present invention is not limited to the above embodiments, and the above embodiments and description only describe originals of the invention Reason, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes and improvements It all fall within the protetion scope of the claimed invention.The claimed scope of the invention is by appended claims and its equivalent circle It is fixed.

Claims

1. a kind of deep learning method based on feature clustering, which comprises the following steps:

Step 2: carrying out data prediction to the characteristic variable selected, including carries out data zooming, data transformation and data drop Dimension processing；

Step 3: calculating the relative coefficient between characteristic variable, using relative coefficient as similarity measurement, and utilizes and makes by oneself The high characteristic variable of adopted function screening relative coefficient；

Step 6: clustering processing is carried out to network chart structure, the high characteristic variable of relative coefficient is divided into one kind, is obtained Cluster result；

2. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 1 In characteristic variable can use any one method in correlation, Gini coefficient, comentropy, statistical check or random forest It is chosen.

3. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 2 Middle data zooming process are as follows: the characteristic variable that will acquire proportionally is converted, and the characteristic variable after conversion is compressed to Between (0,1).

4. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 2 Middle data transformation carries out data transformation using any one mode in discrete Fourier transform or wavelet transform.

5. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 3 The middle characteristic variable process high using custom function screening relative coefficient are as follows: calculating correlation matrix first filters out Relative coefficient is higher than the characteristic variable of preset value, and the relative coefficient that correlation is higher than preset value is taken 1, special labeled as target Variable is levied, 0 is set by non-targeted characteristic variable, then finds out qualified characteristic variable, then delete relative coefficient High characteristic variable.

6. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 4 In filter out principal component process in characteristic variable are as follows:

S2: covariance matrix is calculated；

S4: characteristic value is sorted from large to small, and is selected maximum k, is then made its corresponding k feature vector respectively For row vector composition characteristic vector matrix；

7. a kind of deep learning method based on feature clustering according to claim 1, it is characterised in that: the step 6 In carry out clustering processing to network chart structure hierarchical cluster or K-meana clustering algorithm can be used carrying out clustering processing.

8. a kind of deep learning method based on feature clustering according to claim 7, it is characterised in that: the step 6 When the middle progress clustering processing using hierarchical cluster, node of the last cluster result as clustering processing next time.

9. a kind of deep learning method based on feature clustering according to claim 6, it is characterised in that: the step 6 When the middle progress clustering processing using K-meana clustering algorithm, detailed process are as follows:

T1: determining the number of cluster first, then from K sample is chosen in the data set of network chart structure as in cluster Then the heart calculates cluster centre and arrives the Euclidean distance between other samples, choose and cluster the smallest sample and sorted out, sort out and Class where cluster centre obtains initial clustering result；

T2: calculating the mean value of all samples in initial clustering result, then determines a new cluster centre, and repeats T1 behaviour Make；