CN109919227A - A kind of density peaks clustering method towards mixed attributes data set - Google Patents

A kind of density peaks clustering method towards mixed attributes data set Download PDF

Info

Publication number
CN109919227A
CN109919227A CN201910171730.0A CN201910171730A CN109919227A CN 109919227 A CN109919227 A CN 109919227A CN 201910171730 A CN201910171730 A CN 201910171730A CN 109919227 A CN109919227 A CN 109919227A
Authority
CN
China
Prior art keywords
data
value
data set
distance
indicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910171730.0A
Other languages
Chinese (zh)
Inventor
雒江涛
戴文彬
许国良
易燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201910171730.0A priority Critical patent/CN109919227A/en
Publication of CN109919227A publication Critical patent/CN109919227A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of density peaks clustering methods towards mixed attributes data set, belong to data mining technology field.This method specifically includes: S1: obtaining data, and pre-processes to it;S2: the mixed attributes distance between sample point is calculated;S3: class cluster central point is obtained using residual analysis method twice;S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, realize cluster.Present invention improves over the mixed attributes distances between data sample point to calculate, and density peaks clustering algorithm is enable to be suitable for mixed attributes data set.

Description

A kind of density peaks clustering method towards mixed attributes data set
Technical field
The invention belongs to data mining technology fields, are related to a kind of density peaks cluster side towards mixed attributes data set Method.
Background technique
With the arrival of big data era and artificial intelligence epoch, clustering algorithm is calculated as a kind of important data mining Method, more and more attention has been paid to and be widely used in numerous fields, including pattern-recognition, medical diagnosis, know Know discovery, biological medicine etc..
The data set generated in real world is to have Numeric Attributes feature and categorical attribute feature simultaneously mostly Mixed attributes data set.Wherein, Numeric Attributes feature value is serial number;Categorical attribute feature value is discontinuous number Value, represents classification or state.The clustering algorithm for handling mixed attributes data set at this stage is broadly divided into two classes: based on division (partition-based) be based on level (hierarchy-based) mixed attributes cluster data algorithm.Based on division Method, realize that relatively simple and time complexity is low, but its disadvantage is also more obvious, that is, exists and need to give class cluster in advance Number, the class cluster that cannot find arbitrary shape and it is more sensitive to abnormal point the problems such as.Method based on level does not need to mention Preceding given class cluster number and class cluster number can be set by personal subjective, but this method exists and needs to store similarity matrix, tool There is the problems such as higher Time & Space Complexity.
On " Science " in 2014, Rodriguez and Laio propose one kind based on density (density- Based density peaks clustering algorithm (density peaks clustering algorithm)).Density peaks clustering algorithm Have the advantages that it is more, it is simple and efficiently, can identify the class cluster of arbitrary shape and not need to give class cluster number in advance, together When without the concern for data set probability-distribution function, performance also do not influenced by data space dimension, has the lower time And space complexity.But there is also some shortcomings for the algorithm: algorithm is just for Numeric Attributes data set, not to mixed Mould assembly attribute data collection is analyzed;The identification of class cluster central point and noise spot is general, and it is accurate not provide a kind of comparison Recognition methods;The Clustering Effect of algorithm very relies on the selection of truncation distance, and the selection that distance is truncated is rule of thumb to select It takes, it is personal subjective, it often will cause the assignment error of sample point.
In view of the above deficiencies, the present invention proposes a kind of density peaks clustering method towards mixed attributes data set.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of density peaks cluster sides towards mixed attributes data set Method can be suitable for mixed attributes data set, need to shift to an earlier date existing for original mixed attributes cluster data algorithm for solving Given class cluster number, the class cluster that cannot find arbitrary shape, and higher Time & Space Complexity more sensitive to abnormal point etc. Problem;Residual analysis twice is introduced simultaneously to obtain class cluster central point and obtain optimal truncation distance using genetic algorithm, is solved Class cluster central point existing for density peaks clustering algorithm and noise point recognition effect are bad and truncation distance is set by personal subjectivity The problems such as sample point assignment error caused by fixed.
In order to achieve the above objectives, the invention provides the following technical scheme:
A kind of density peaks clustering method towards mixed attributes data set, specifically includes the following steps:
S1: data are obtained, and it is pre-processed;
S2: the mixed attributes distance between sample point is calculated;
S3: class cluster central point is obtained using residual analysis method twice;
S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, Realize cluster.
Further, the step S1 specifically includes the following steps:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T numeric type Attributive character, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom (Ci)={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: number is recorded to the null value, illegal value, inconsistency data and the mutually repetition that obtain in mixed attributes data set According to being filled up and deleted;
S13: min-max normal linearity normalized is carried out to all Numeric Attributes features, calculation expression is such as Shown in lower:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIndicate jth column data In maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
Further, the step S2 specifically includes the following steps:
S21: it calculates numerical attribute similarity between sample: setting x, the numerical attribute similarity of y ∈ D, x and y are as follows:
Wherein, dr(x, y) indicate x and y numerical attribute part Euclidean distance, numerical attribute similarity illustrate object it Between similarity on Numeric Attributes, its value range is [0,1];
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,j(ci,j∈Dom(Ci)) about CiSupport Degree is attribute C in data set DiValue is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) indicate to divide Type attribute feature CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/ D|;Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1];
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
Further, the step S3 specifically includes the following steps:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γiii
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri
S34: equation y=b is used0+b1X is to ρiiWith RiiDistribution map carries out linear regression operation, respectively to ρiiWith RiiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρiiWith RiiDistribution map In singular point be class cluster central point.
Further, the step S4 specifically includes the following steps:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively Indicate the class cluster object and class cluster center after certain iteration in cluster result,Indicate the center of all data points;unk∈(0, 1), work as unkWhen=1, indicate that data point belongs to k-th of class cluster, otherwise not in kth class cluster;The center of all data points is expressed asWherein the Global center value of the center dimension of the l dimension of Numeric Attributes indicates i.e.The center of the l dimension of nominal type attribute is indicated i.e. with the most classification of the dimension ||ck| | indicate k-th of class cluster in data object number and
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set The 1%-20% of total quantity is [d to get value intervalc_low,dc_high], z d is then selected at randomc, form first generation dcGroup, It is each d according to the method for the division class cluster of density peaks algorithm DPCAcThe division of class cluster is carried out, passes through genetic algorithm later The per generation d of iterative process evolutioncGroup, it is final to obtain optimal truncation distance dcBestWith corresponding class cluster central point.
The beneficial effects of the present invention are: present invention improves over the mixed attributes distances between data sample point to calculate, and makes close Degree peak value clustering algorithm can be suitable for mixed attributes data set.The method of residual analysis twice is proposed, precisely identifies class cluster Central point and noise point.Optimal truncation distance is obtained using genetic algorithm iteration simultaneously, truncation is solved and is set apart from subjectivity And the problem of causing the assignment error of sample point.
It present invention can be suitably applied to multiple fields, such as: in commercial field, in the presence of helping marketing personal to find customer base The group of different characteristic, and the customer group of these different characteristics can be described with purchasing model;In biological field, obtain dynamic Hierarchical structure where object or plant, and classified according to gene function to it;In area of geographic information, help from the earth The region with similar land use situation is identified in observation database;In internet area, document on internet can recognize Classification, to carry out deeper data discovery.The present invention is also used as data analysis tool use simultaneously, for obtaining Structure, analysis data characteristics and determining interested data category of evidence of fetching etc..
Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke To be instructed from the practice of the present invention.Target of the invention and other advantages can be realized by following specification and It obtains.
Detailed description of the invention
To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is made below in conjunction with attached drawing excellent The detailed description of choosing, in which:
Fig. 1 is density peaks clustering method flow chart of the present invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.It should be noted that diagram provided in following embodiment is only to show Meaning mode illustrates basic conception of the invention, and in the absence of conflict, the feature in following embodiment and embodiment can phase Mutually combination.
Wherein, the drawings are for illustrative purposes only and are merely schematic diagrams, rather than pictorial diagram, should not be understood as to this The limitation of invention;Embodiment in order to better illustrate the present invention, the certain components of attached drawing have omission, zoom in or out, not Represent the size of actual product;It will be understood by those skilled in the art that certain known features and its explanation may be omitted and be in attached drawing It is understood that.
Referring to Fig. 1, being a kind of density peaks clustering method towards mixed attributes data set, following step is specifically included It is rapid:
S1: data are obtained, and it is pre-processed, are specifically included:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T numeric type Attributive character, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom (Ci)={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: number is recorded to the null value, illegal value, inconsistency data and the mutually repetition that obtain in mixed attributes data set According to being filled up and deleted;
S13: min-max normal linearity normalized is carried out to all Numeric Attributes features, calculation expression is such as Shown in lower:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIndicate jth column data In maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
S2: the mixed attributes distance between sample point is calculated, is specifically included: S21: calculating numerical attribute similarity between sample: If the numerical attribute similarity of x, y ∈ D, x and y are as follows:
Wherein, dr(x, y) indicate x and y numerical attribute part Euclidean distance, numerical attribute similarity illustrate object it Between similarity on Numeric Attributes, its value range is [0,1];
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,j(ci,j∈Dom(Ci)) about CiSupport Degree is attribute C in data set DiValue is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) indicate to divide Type attribute feature CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/ |D|;Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1];
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
S3: obtaining class cluster central point using residual analysis method twice, specifically:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γiii
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri
S34: equation y=b is used0+b1X is to ρiiWith RiiDistribution map carries out linear regression operation, respectively to ρiiWith RiiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρiiWith RiiDistribution map In singular point be class cluster central point.
S4: being updated according to fitness function iteration with genetic algorithm and obtain optimal truncation distance, specifically:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively Indicate the class cluster object and class cluster center after certain iteration in cluster result,Indicate the center of all data points;unk∈(0, 1), work as unkWhen=1, indicate that data point belongs to k-th of class cluster, otherwise not in kth class cluster;The center of all data points is expressed asWherein the Global center value of the center dimension of the l dimension of Numeric Attributes indicates i.e.The center of the l dimension of nominal type attribute is indicated i.e. with the most classification of the dimension ||ck| | indicate k-th of class cluster in data object number and
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set The 1%-20% of total quantity is [d to get value intervalc_low,dc_high], z d is then selected at randomc, form first generation dcGroup, It is each d according to the method for the division class cluster of density peaks algorithm DPCAcThe division of class cluster is carried out, passes through genetic algorithm later The per generation d of iterative process evolutioncGroup, it is final to obtain optimal truncation distance dcBestWith corresponding class cluster central point.
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, Realize cluster.
Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention Scope of the claims in.

Claims (5)

1. a kind of density peaks clustering method towards mixed attributes data set, which is characterized in that this method specifically includes following Step:
S1: data are obtained, and it is pre-processed;
S2: the mixed attributes distance between sample point is calculated;
S3: class cluster central point is obtained using residual analysis method twice;
S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, are realized Cluster.
2. a kind of density peaks clustering method towards mixed attributes data set according to claim 1, which is characterized in that The step S1 specifically includes the following steps:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T Numeric Attributes Feature, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom (Ci) ={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: to obtain mixed attributes data set in null value, illegal value, inconsistency data and mutually repeat record data into Row is filled up and is deleted;
S13: min-max normal linearity normalized, the following institute of calculation expression are carried out to all Numeric Attributes features Show:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIt indicates in jth column data Maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
3. a kind of density peaks clustering method towards mixed attributes data set according to claim 2, which is characterized in that The step S2 specifically includes the following steps:
S21: it calculates numerical attribute similarity between sample: setting x, the numerical attribute similarity of y ∈ D, x and y are as follows:
Wherein, drThe Euclidean distance of (x, y) expression x and y numerical attribute part;
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,jAbout CiSupport be data set D in attribute Ci Value is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) presentation class type Attributive character CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/|D|;
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
4. a kind of density peaks clustering method towards mixed attributes data set according to claim 3, which is characterized in that The step S3 specifically includes the following steps:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γiii
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri
S34: equation y=b is used0+b1X is to ρiiWith RiiDistribution map carries out linear regression operation, respectively to ρiiWith Ri- γiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρiiWith RiiIn distribution map Singular point is class cluster central point.
5. a kind of density peaks clustering method towards mixed attributes data set according to claim 4, which is characterized in that The step S4 specifically includes the following steps:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively indicate certain Class cluster object and class cluster center after secondary iteration in cluster result,Indicate the center of all data points;
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set total quantity 1%-20% to get value interval be [dc_low,dc_high], z d is then selected at randomc, form first generation dcGroup, according to close The method for spending the division class cluster of peak algorithm (density peaks clustering algorithm, DPCA) is each dcInto The division of row class cluster passes through the per generation d of iterative process evolution of genetic algorithm latercGroup, it is final to obtain optimal truncation distance dcBest With corresponding class cluster central point.
CN201910171730.0A 2019-03-07 2019-03-07 A kind of density peaks clustering method towards mixed attributes data set Pending CN109919227A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910171730.0A CN109919227A (en) 2019-03-07 2019-03-07 A kind of density peaks clustering method towards mixed attributes data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910171730.0A CN109919227A (en) 2019-03-07 2019-03-07 A kind of density peaks clustering method towards mixed attributes data set

Publications (1)

Publication Number Publication Date
CN109919227A true CN109919227A (en) 2019-06-21

Family

ID=66963816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910171730.0A Pending CN109919227A (en) 2019-03-07 2019-03-07 A kind of density peaks clustering method towards mixed attributes data set

Country Status (1)

Country Link
CN (1) CN109919227A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110772267A (en) * 2019-11-07 2020-02-11 中国人民解放军63850部队 Human body physiological fatigue data marking method and fatigue identification model
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN113158817A (en) * 2021-03-29 2021-07-23 南京信息工程大学 Objective weather typing method based on rapid density peak clustering

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110772267A (en) * 2019-11-07 2020-02-11 中国人民解放军63850部队 Human body physiological fatigue data marking method and fatigue identification model
CN110772267B (en) * 2019-11-07 2022-04-19 中国人民解放军63850部队 Human body physiological fatigue data marking method and fatigue identification model
CN111339294A (en) * 2020-02-11 2020-06-26 普信恒业科技发展(北京)有限公司 Client data classification method and device and electronic equipment
CN113158817A (en) * 2021-03-29 2021-07-23 南京信息工程大学 Objective weather typing method based on rapid density peak clustering
CN113158817B (en) * 2021-03-29 2023-07-18 南京信息工程大学 Objective weather typing method based on rapid density peak clustering

Similar Documents

Publication Publication Date Title
CN106095893B (en) A kind of cross-media retrieval method
CN101692224B (en) High-resolution remote sensing image search method fused with spatial relation semantics
CN109002492B (en) Performance point prediction method based on LightGBM
CN106817251B (en) Link prediction method and device based on node similarity
CN105045858A (en) Voting based taxi passenger-carrying point recommendation method
CN109919227A (en) A kind of density peaks clustering method towards mixed attributes data set
CN103810299A (en) Image retrieval method on basis of multi-feature fusion
CN111652291A (en) Method for establishing student growth portrait based on group sparse fusion hospital big data
CN107944485A (en) The commending system and method, personalized recommendation system found based on cluster group
US20130041753A1 (en) System and Method for Identifying a Path of a Billboard Audience Group and Providing Advertising Content Based on the Path
CN104751463B (en) A kind of threedimensional model optimal viewing angle choosing method based on sketch outline feature
CN113032613B (en) Three-dimensional model retrieval method based on interactive attention convolution neural network
Rodrigues et al. Automatic classification of points-of-interest for land-use analysis
CN102902976A (en) Image scene classification method based on target and space relationship characteristics
CN109583712B (en) Data index analysis method and device and storage medium
CN115018357A (en) Farmer portrait construction method and system for production performance improvement
CN115018545A (en) Similar user analysis method and system based on user portrait and clustering algorithm
CN115115825A (en) Method and device for detecting object in image, computer equipment and storage medium
CN113240209A (en) Urban industry cluster development path prediction method based on graph neural network
CN113220915A (en) Remote sensing image retrieval method and device based on residual attention
Dahal Effect of different distance measures in result of cluster analysis
CN115719453A (en) Rice planting structure remote sensing extraction method based on deep learning
CN112506930B (en) Data insight system based on machine learning technology
CN112488236B (en) Integrated unsupervised student behavior clustering method
CN111488520B (en) Crop planting type recommendation information processing device, method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190621