CN109919227A

CN109919227A - A kind of density peaks clustering method towards mixed attributes data set

Info

Publication number: CN109919227A
Application number: CN201910171730.0A
Authority: CN
Inventors: 雒江涛; 戴文彬; 许国良; 易燕
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2019-06-21

Abstract

The present invention relates to a kind of density peaks clustering methods towards mixed attributes data set, belong to data mining technology field.This method specifically includes: S1: obtaining data, and pre-processes to it；S2: the mixed attributes distance between sample point is calculated；S3: class cluster central point is obtained using residual analysis method twice；S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance；S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, realize cluster.Present invention improves over the mixed attributes distances between data sample point to calculate, and density peaks clustering algorithm is enable to be suitable for mixed attributes data set.

Description

A kind of density peaks clustering method towards mixed attributes data set

Technical field

The invention belongs to data mining technology fields, are related to a kind of density peaks cluster side towards mixed attributes data set Method.

Background technique

With the arrival of big data era and artificial intelligence epoch, clustering algorithm is calculated as a kind of important data mining Method, more and more attention has been paid to and be widely used in numerous fields, including pattern-recognition, medical diagnosis, know Know discovery, biological medicine etc..

The data set generated in real world is to have Numeric Attributes feature and categorical attribute feature simultaneously mostly Mixed attributes data set.Wherein, Numeric Attributes feature value is serial number；Categorical attribute feature value is discontinuous number Value, represents classification or state.The clustering algorithm for handling mixed attributes data set at this stage is broadly divided into two classes: based on division (partition-based) be based on level (hierarchy-based) mixed attributes cluster data algorithm.Based on division Method, realize that relatively simple and time complexity is low, but its disadvantage is also more obvious, that is, exists and need to give class cluster in advance Number, the class cluster that cannot find arbitrary shape and it is more sensitive to abnormal point the problems such as.Method based on level does not need to mention Preceding given class cluster number and class cluster number can be set by personal subjective, but this method exists and needs to store similarity matrix, tool There is the problems such as higher Time & Space Complexity.

On " Science " in 2014, Rodriguez and Laio propose one kind based on density (density- Based density peaks clustering algorithm (density peaks clustering algorithm)).Density peaks clustering algorithm Have the advantages that it is more, it is simple and efficiently, can identify the class cluster of arbitrary shape and not need to give class cluster number in advance, together When without the concern for data set probability-distribution function, performance also do not influenced by data space dimension, has the lower time And space complexity.But there is also some shortcomings for the algorithm: algorithm is just for Numeric Attributes data set, not to mixed Mould assembly attribute data collection is analyzed；The identification of class cluster central point and noise spot is general, and it is accurate not provide a kind of comparison Recognition methods；The Clustering Effect of algorithm very relies on the selection of truncation distance, and the selection that distance is truncated is rule of thumb to select It takes, it is personal subjective, it often will cause the assignment error of sample point.

In view of the above deficiencies, the present invention proposes a kind of density peaks clustering method towards mixed attributes data set.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of density peaks cluster sides towards mixed attributes data set Method can be suitable for mixed attributes data set, need to shift to an earlier date existing for original mixed attributes cluster data algorithm for solving Given class cluster number, the class cluster that cannot find arbitrary shape, and higher Time & Space Complexity more sensitive to abnormal point etc. Problem；Residual analysis twice is introduced simultaneously to obtain class cluster central point and obtain optimal truncation distance using genetic algorithm, is solved Class cluster central point existing for density peaks clustering algorithm and noise point recognition effect are bad and truncation distance is set by personal subjectivity The problems such as sample point assignment error caused by fixed.

In order to achieve the above objectives, the invention provides the following technical scheme:

A kind of density peaks clustering method towards mixed attributes data set, specifically includes the following steps:

S1: data are obtained, and it is pre-processed；

S2: the mixed attributes distance between sample point is calculated；

S3: class cluster central point is obtained using residual analysis method twice；

S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance；

S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, Realize cluster.

Further, the step S1 specifically includes the following steps:

S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T numeric type Attributive character, U categorical attribute feature, i.e. M=T+U；Dom(C_i) presentation class type attributive character C_iValue set, Dom (C_i)={ c_i,1,c_i,2,...c_i,fIndicate C_iAttributive character has f different values；

S12: number is recorded to the null value, illegal value, inconsistency data and the mutually repetition that obtain in mixed attributes data set According to being filled up and deleted；

S13: min-max normal linearity normalized is carried out to all Numeric Attributes features, calculation expression is such as Shown in lower:

Wherein, X_jIndicate the data of jth column, minX_jIndicate the minimum value in jth column data, maxX_jIndicate jth column data In maximum value, X_ijIndicate the actual data value of the i-th row jth column, X '_ijValue after indicating normalized.

Further, the step S2 specifically includes the following steps:

S21: it calculates numerical attribute similarity between sample: setting x, the numerical attribute similarity of y ∈ D, x and y are as follows:

Wherein, d_r(x, y) indicate x and y numerical attribute part Euclidean distance, numerical attribute similarity illustrate object it Between similarity on Numeric Attributes, its value range is [0,1]；

S22: support: C is calculated_iFor categorical attribute feature, attribute value c_i,j(c_i,j∈Dom(C_i)) about C_iSupport Degree is attribute C in data set D_iValue is equal to c_i,jData object number, value are as follows:

S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:

Wherein, W_iPresentation class type attributive character C_iShared weight,H_c(C_i) indicate to divide Type attribute feature C_iEntropy, it may be assumed that

Wherein, p (c_i,j) presentation class attributive character C_iMiddle attribute value c_i,jProbability, it may be assumed that p (c_i,j)=Sup (C_i|c_i,j)/ D|；Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1]；

S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:

Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).

Further, the step S3 specifically includes the following steps:

S31: sample point x is calculated_iLocal density ρ_i:

S32: sample point x is calculated_iDistance δ_i:

S33: sample point x is calculated_iWeight γ_i:

γ_i=ρ_i*δ_i

And it is ranked up to institute's sample point weight is descending, x_iWeight sequencing ranking be R_i；

S34: equation y=b is used₀+b₁X is to ρ_i-δ_iWith R_i-γ_iDistribution map carries out linear regression operation, respectively to ρ_i-δ_iWith R_i-γ_iThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρ_i-δ_iWith R_i-γ_iDistribution map In singular point be class cluster central point.

Further, the step S4 specifically includes the following steps:

S41: fitness function is defined:

Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively Indicate the class cluster object and class cluster center after certain iteration in cluster result,Indicate the center of all data points；u_nk∈(0, 1), work as u_nkWhen=1, indicate that data point belongs to k-th of class cluster, otherwise not in kth class cluster；The center of all data points is expressed asWherein the Global center value of the center dimension of the l dimension of Numeric Attributes indicates i.e.The center of the l dimension of nominal type attribute is indicated i.e. with the most classification of the dimension ||c_k| | indicate k-th of class cluster in data object number and

S42: setting truncation distance d_cValue range be so that the average local density of all data objects is data set The 1%-20% of total quantity is [d to get value interval_{c_low},d_{c_high}], z d is then selected at random_c, form first generation d_cGroup, It is each d according to the method for the division class cluster of density peaks algorithm DPCA_cThe division of class cluster is carried out, passes through genetic algorithm later The per generation d of iterative process evolution_cGroup, it is final to obtain optimal truncation distance d_cBestWith corresponding class cluster central point.

The beneficial effects of the present invention are: present invention improves over the mixed attributes distances between data sample point to calculate, and makes close Degree peak value clustering algorithm can be suitable for mixed attributes data set.The method of residual analysis twice is proposed, precisely identifies class cluster Central point and noise point.Optimal truncation distance is obtained using genetic algorithm iteration simultaneously, truncation is solved and is set apart from subjectivity And the problem of causing the assignment error of sample point.

It present invention can be suitably applied to multiple fields, such as: in commercial field, in the presence of helping marketing personal to find customer base The group of different characteristic, and the customer group of these different characteristics can be described with purchasing model；In biological field, obtain dynamic Hierarchical structure where object or plant, and classified according to gene function to it；In area of geographic information, help from the earth The region with similar land use situation is identified in observation database；In internet area, document on internet can recognize Classification, to carry out deeper data discovery.The present invention is also used as data analysis tool use simultaneously, for obtaining Structure, analysis data characteristics and determining interested data category of evidence of fetching etc..

Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke To be instructed from the practice of the present invention.Target of the invention and other advantages can be realized by following specification and It obtains.

Detailed description of the invention

To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is made below in conjunction with attached drawing excellent The detailed description of choosing, in which:

Fig. 1 is density peaks clustering method flow chart of the present invention.

Specific embodiment

Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.It should be noted that diagram provided in following embodiment is only to show Meaning mode illustrates basic conception of the invention, and in the absence of conflict, the feature in following embodiment and embodiment can phase Mutually combination.

Wherein, the drawings are for illustrative purposes only and are merely schematic diagrams, rather than pictorial diagram, should not be understood as to this The limitation of invention；Embodiment in order to better illustrate the present invention, the certain components of attached drawing have omission, zoom in or out, not Represent the size of actual product；It will be understood by those skilled in the art that certain known features and its explanation may be omitted and be in attached drawing It is understood that.

Referring to Fig. 1, being a kind of density peaks clustering method towards mixed attributes data set, following step is specifically included It is rapid:

S1: data are obtained, and it is pre-processed, are specifically included:

S2: the mixed attributes distance between sample point is calculated, is specifically included: S21: calculating numerical attribute similarity between sample: If the numerical attribute similarity of x, y ∈ D, x and y are as follows:

Wherein, p (c_i,j) presentation class attributive character C_iMiddle attribute value c_i,jProbability, it may be assumed that p (c_i,j)=Sup (C_i|c_i,j)/ |D|；Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1]；

Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).

S3: obtaining class cluster central point using residual analysis method twice, specifically:

S31: sample point x is calculated_iLocal density ρ_i:

S32: sample point x is calculated_iDistance δ_i:

S33: sample point x is calculated_iWeight γ_i:

γ_i=ρ_i*δ_i

S4: being updated according to fitness function iteration with genetic algorithm and obtain optimal truncation distance, specifically:

S41: fitness function is defined:

Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention Scope of the claims in.

Claims

1. a kind of density peaks clustering method towards mixed attributes data set, which is characterized in that this method specifically includes following Step:

S1: data are obtained, and it is pre-processed；

S2: the mixed attributes distance between sample point is calculated；

S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, are realized Cluster.

2. a kind of density peaks clustering method towards mixed attributes data set according to claim 1, which is characterized in that The step S1 specifically includes the following steps:

S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T Numeric Attributes Feature, U categorical attribute feature, i.e. M=T+U；Dom(C_i) presentation class type attributive character C_iValue set, Dom (C_i) ={ c_i,1,c_i,2,...c_i,fIndicate C_iAttributive character has f different values；

S12: to obtain mixed attributes data set in null value, illegal value, inconsistency data and mutually repeat record data into Row is filled up and is deleted；

S13: min-max normal linearity normalized, the following institute of calculation expression are carried out to all Numeric Attributes features Show:

Wherein, X_jIndicate the data of jth column, minX_jIndicate the minimum value in jth column data, maxX_jIt indicates in jth column data Maximum value, X_ijIndicate the actual data value of the i-th row jth column, X '_ijValue after indicating normalized.

3. a kind of density peaks clustering method towards mixed attributes data set according to claim 2, which is characterized in that The step S2 specifically includes the following steps:

Wherein, d_rThe Euclidean distance of (x, y) expression x and y numerical attribute part；

S22: support: C is calculated_iFor categorical attribute feature, attribute value c_i,jAbout C_iSupport be data set D in attribute C_i Value is equal to c_i,jData object number, value are as follows:

Wherein, W_iPresentation class type attributive character C_iShared weight,H_c(C_i) presentation class type Attributive character C_iEntropy, it may be assumed that

Wherein, p (c_i,j) presentation class attributive character C_iMiddle attribute value c_i,jProbability, it may be assumed that p (c_i,j)=Sup (C_i|c_i,j)/|D|；

Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).

4. a kind of density peaks clustering method towards mixed attributes data set according to claim 3, which is characterized in that The step S3 specifically includes the following steps:

S31: sample point x is calculated_iLocal density ρ_i:

S32: sample point x is calculated_iDistance δ_i:

S33: sample point x is calculated_iWeight γ_i:

γ_i=ρ_i*δ_i

S34: equation y=b is used₀+b₁X is to ρ_i-δ_iWith R_i-γ_iDistribution map carries out linear regression operation, respectively to ρ_i-δ_iWith R_i- γ_iThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρ_i-δ_iWith R_i-γ_iIn distribution map Singular point is class cluster central point.

5. a kind of density peaks clustering method towards mixed attributes data set according to claim 4, which is characterized in that The step S4 specifically includes the following steps:

S41: fitness function is defined:

Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively indicate certain Class cluster object and class cluster center after secondary iteration in cluster result,Indicate the center of all data points；

S42: setting truncation distance d_cValue range be so that the average local density of all data objects is data set total quantity 1%-20% to get value interval be [d_{c_low},d_{c_high}], z d is then selected at random_c, form first generation d_cGroup, according to close The method for spending the division class cluster of peak algorithm (density peaks clustering algorithm, DPCA) is each d_cInto The division of row class cluster passes through the per generation d of iterative process evolution of genetic algorithm later_cGroup, it is final to obtain optimal truncation distance d_cBest With corresponding class cluster central point.