CN109919227A - A kind of density peaks clustering method towards mixed attributes data set - Google Patents
A kind of density peaks clustering method towards mixed attributes data set Download PDFInfo
- Publication number
- CN109919227A CN109919227A CN201910171730.0A CN201910171730A CN109919227A CN 109919227 A CN109919227 A CN 109919227A CN 201910171730 A CN201910171730 A CN 201910171730A CN 109919227 A CN109919227 A CN 109919227A
- Authority
- CN
- China
- Prior art keywords
- data
- value
- data set
- distance
- indicate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of density peaks clustering methods towards mixed attributes data set, belong to data mining technology field.This method specifically includes: S1: obtaining data, and pre-processes to it;S2: the mixed attributes distance between sample point is calculated;S3: class cluster central point is obtained using residual analysis method twice;S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, realize cluster.Present invention improves over the mixed attributes distances between data sample point to calculate, and density peaks clustering algorithm is enable to be suitable for mixed attributes data set.
Description
Technical field
The invention belongs to data mining technology fields, are related to a kind of density peaks cluster side towards mixed attributes data set
Method.
Background technique
With the arrival of big data era and artificial intelligence epoch, clustering algorithm is calculated as a kind of important data mining
Method, more and more attention has been paid to and be widely used in numerous fields, including pattern-recognition, medical diagnosis, know
Know discovery, biological medicine etc..
The data set generated in real world is to have Numeric Attributes feature and categorical attribute feature simultaneously mostly
Mixed attributes data set.Wherein, Numeric Attributes feature value is serial number;Categorical attribute feature value is discontinuous number
Value, represents classification or state.The clustering algorithm for handling mixed attributes data set at this stage is broadly divided into two classes: based on division
(partition-based) be based on level (hierarchy-based) mixed attributes cluster data algorithm.Based on division
Method, realize that relatively simple and time complexity is low, but its disadvantage is also more obvious, that is, exists and need to give class cluster in advance
Number, the class cluster that cannot find arbitrary shape and it is more sensitive to abnormal point the problems such as.Method based on level does not need to mention
Preceding given class cluster number and class cluster number can be set by personal subjective, but this method exists and needs to store similarity matrix, tool
There is the problems such as higher Time & Space Complexity.
On " Science " in 2014, Rodriguez and Laio propose one kind based on density (density-
Based density peaks clustering algorithm (density peaks clustering algorithm)).Density peaks clustering algorithm
Have the advantages that it is more, it is simple and efficiently, can identify the class cluster of arbitrary shape and not need to give class cluster number in advance, together
When without the concern for data set probability-distribution function, performance also do not influenced by data space dimension, has the lower time
And space complexity.But there is also some shortcomings for the algorithm: algorithm is just for Numeric Attributes data set, not to mixed
Mould assembly attribute data collection is analyzed;The identification of class cluster central point and noise spot is general, and it is accurate not provide a kind of comparison
Recognition methods;The Clustering Effect of algorithm very relies on the selection of truncation distance, and the selection that distance is truncated is rule of thumb to select
It takes, it is personal subjective, it often will cause the assignment error of sample point.
In view of the above deficiencies, the present invention proposes a kind of density peaks clustering method towards mixed attributes data set.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of density peaks cluster sides towards mixed attributes data set
Method can be suitable for mixed attributes data set, need to shift to an earlier date existing for original mixed attributes cluster data algorithm for solving
Given class cluster number, the class cluster that cannot find arbitrary shape, and higher Time & Space Complexity more sensitive to abnormal point etc.
Problem;Residual analysis twice is introduced simultaneously to obtain class cluster central point and obtain optimal truncation distance using genetic algorithm, is solved
Class cluster central point existing for density peaks clustering algorithm and noise point recognition effect are bad and truncation distance is set by personal subjectivity
The problems such as sample point assignment error caused by fixed.
In order to achieve the above objectives, the invention provides the following technical scheme:
A kind of density peaks clustering method towards mixed attributes data set, specifically includes the following steps:
S1: data are obtained, and it is pre-processed;
S2: the mixed attributes distance between sample point is calculated;
S3: class cluster central point is obtained using residual analysis method twice;
S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors,
Realize cluster.
Further, the step S1 specifically includes the following steps:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T numeric type
Attributive character, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom
(Ci)={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: number is recorded to the null value, illegal value, inconsistency data and the mutually repetition that obtain in mixed attributes data set
According to being filled up and deleted;
S13: min-max normal linearity normalized is carried out to all Numeric Attributes features, calculation expression is such as
Shown in lower:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIndicate jth column data
In maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
Further, the step S2 specifically includes the following steps:
S21: it calculates numerical attribute similarity between sample: setting x, the numerical attribute similarity of y ∈ D, x and y are as follows:
Wherein, dr(x, y) indicate x and y numerical attribute part Euclidean distance, numerical attribute similarity illustrate object it
Between similarity on Numeric Attributes, its value range is [0,1];
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,j(ci,j∈Dom(Ci)) about CiSupport
Degree is attribute C in data set DiValue is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) indicate to divide
Type attribute feature CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/
D|;Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1];
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
Further, the step S3 specifically includes the following steps:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γi=ρi*δi
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri;
S34: equation y=b is used0+b1X is to ρi-δiWith Ri-γiDistribution map carries out linear regression operation, respectively to ρi-δiWith
Ri-γiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρi-δiWith Ri-γiDistribution map
In singular point be class cluster central point.
Further, the step S4 specifically includes the following steps:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively
Indicate the class cluster object and class cluster center after certain iteration in cluster result,Indicate the center of all data points;unk∈(0,
1), work as unkWhen=1, indicate that data point belongs to k-th of class cluster, otherwise not in kth class cluster;The center of all data points is expressed asWherein the Global center value of the center dimension of the l dimension of Numeric Attributes indicates i.e.The center of the l dimension of nominal type attribute is indicated i.e. with the most classification of the dimension
||ck| | indicate k-th of class cluster in data object number and
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set
The 1%-20% of total quantity is [d to get value intervalc_low,dc_high], z d is then selected at randomc, form first generation dcGroup,
It is each d according to the method for the division class cluster of density peaks algorithm DPCAcThe division of class cluster is carried out, passes through genetic algorithm later
The per generation d of iterative process evolutioncGroup, it is final to obtain optimal truncation distance dcBestWith corresponding class cluster central point.
The beneficial effects of the present invention are: present invention improves over the mixed attributes distances between data sample point to calculate, and makes close
Degree peak value clustering algorithm can be suitable for mixed attributes data set.The method of residual analysis twice is proposed, precisely identifies class cluster
Central point and noise point.Optimal truncation distance is obtained using genetic algorithm iteration simultaneously, truncation is solved and is set apart from subjectivity
And the problem of causing the assignment error of sample point.
It present invention can be suitably applied to multiple fields, such as: in commercial field, in the presence of helping marketing personal to find customer base
The group of different characteristic, and the customer group of these different characteristics can be described with purchasing model;In biological field, obtain dynamic
Hierarchical structure where object or plant, and classified according to gene function to it;In area of geographic information, help from the earth
The region with similar land use situation is identified in observation database;In internet area, document on internet can recognize
Classification, to carry out deeper data discovery.The present invention is also used as data analysis tool use simultaneously, for obtaining
Structure, analysis data characteristics and determining interested data category of evidence of fetching etc..
Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and
And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke
To be instructed from the practice of the present invention.Target of the invention and other advantages can be realized by following specification and
It obtains.
Detailed description of the invention
To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is made below in conjunction with attached drawing excellent
The detailed description of choosing, in which:
Fig. 1 is density peaks clustering method flow chart of the present invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from
Various modifications or alterations are carried out under spirit of the invention.It should be noted that diagram provided in following embodiment is only to show
Meaning mode illustrates basic conception of the invention, and in the absence of conflict, the feature in following embodiment and embodiment can phase
Mutually combination.
Wherein, the drawings are for illustrative purposes only and are merely schematic diagrams, rather than pictorial diagram, should not be understood as to this
The limitation of invention;Embodiment in order to better illustrate the present invention, the certain components of attached drawing have omission, zoom in or out, not
Represent the size of actual product;It will be understood by those skilled in the art that certain known features and its explanation may be omitted and be in attached drawing
It is understood that.
Referring to Fig. 1, being a kind of density peaks clustering method towards mixed attributes data set, following step is specifically included
It is rapid:
S1: data are obtained, and it is pre-processed, are specifically included:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T numeric type
Attributive character, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom
(Ci)={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: number is recorded to the null value, illegal value, inconsistency data and the mutually repetition that obtain in mixed attributes data set
According to being filled up and deleted;
S13: min-max normal linearity normalized is carried out to all Numeric Attributes features, calculation expression is such as
Shown in lower:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIndicate jth column data
In maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
S2: the mixed attributes distance between sample point is calculated, is specifically included: S21: calculating numerical attribute similarity between sample:
If the numerical attribute similarity of x, y ∈ D, x and y are as follows:
Wherein, dr(x, y) indicate x and y numerical attribute part Euclidean distance, numerical attribute similarity illustrate object it
Between similarity on Numeric Attributes, its value range is [0,1];
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,j(ci,j∈Dom(Ci)) about CiSupport
Degree is attribute C in data set DiValue is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) indicate to divide
Type attribute feature CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/
|D|;Categorical attribute similarity illustrates the similarity degree between object on categorical attribute, its value range is [0,1];
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
S3: obtaining class cluster central point using residual analysis method twice, specifically:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γi=ρi*δi
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri;
S34: equation y=b is used0+b1X is to ρi-δiWith Ri-γiDistribution map carries out linear regression operation, respectively to ρi-δiWith
Ri-γiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρi-δiWith Ri-γiDistribution map
In singular point be class cluster central point.
S4: being updated according to fitness function iteration with genetic algorithm and obtain optimal truncation distance, specifically:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively
Indicate the class cluster object and class cluster center after certain iteration in cluster result,Indicate the center of all data points;unk∈(0,
1), work as unkWhen=1, indicate that data point belongs to k-th of class cluster, otherwise not in kth class cluster;The center of all data points is expressed asWherein the Global center value of the center dimension of the l dimension of Numeric Attributes indicates i.e.The center of the l dimension of nominal type attribute is indicated i.e. with the most classification of the dimension
||ck| | indicate k-th of class cluster in data object number and
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set
The 1%-20% of total quantity is [d to get value intervalc_low,dc_high], z d is then selected at randomc, form first generation dcGroup,
It is each d according to the method for the division class cluster of density peaks algorithm DPCAcThe division of class cluster is carried out, passes through genetic algorithm later
The per generation d of iterative process evolutioncGroup, it is final to obtain optimal truncation distance dcBestWith corresponding class cluster central point.
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors,
Realize cluster.
Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with
Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention
Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention
Scope of the claims in.
Claims (5)
1. a kind of density peaks clustering method towards mixed attributes data set, which is characterized in that this method specifically includes following
Step:
S1: data are obtained, and it is pre-processed;
S2: the mixed attributes distance between sample point is calculated;
S3: class cluster central point is obtained using residual analysis method twice;
S4: it is updated with genetic algorithm according to fitness function iteration and obtains optimal truncation distance;
S5: other sample points are divided into corresponding cluster by the minimum range of the more high density object to arest neighbors, are realized
Cluster.
2. a kind of density peaks clustering method towards mixed attributes data set according to claim 1, which is characterized in that
The step S1 specifically includes the following steps:
S11: obtaining mixed attributes data set D, wherein there is N number of data sample point, M attributive character, and wherein T Numeric Attributes
Feature, U categorical attribute feature, i.e. M=T+U;Dom(Ci) presentation class type attributive character CiValue set, Dom (Ci)
={ ci,1,ci,2,...ci,fIndicate CiAttributive character has f different values;
S12: to obtain mixed attributes data set in null value, illegal value, inconsistency data and mutually repeat record data into
Row is filled up and is deleted;
S13: min-max normal linearity normalized, the following institute of calculation expression are carried out to all Numeric Attributes features
Show:
Wherein, XjIndicate the data of jth column, minXjIndicate the minimum value in jth column data, maxXjIt indicates in jth column data
Maximum value, XijIndicate the actual data value of the i-th row jth column, X 'ijValue after indicating normalized.
3. a kind of density peaks clustering method towards mixed attributes data set according to claim 2, which is characterized in that
The step S2 specifically includes the following steps:
S21: it calculates numerical attribute similarity between sample: setting x, the numerical attribute similarity of y ∈ D, x and y are as follows:
Wherein, drThe Euclidean distance of (x, y) expression x and y numerical attribute part;
S22: support: C is calculatediFor categorical attribute feature, attribute value ci,jAbout CiSupport be data set D in attribute Ci
Value is equal to ci,jData object number, value are as follows:
S23: it calculates categorical attribute similarity between sample: setting x, the similarity of the categorical attribute of y ∈ D, x and y are as follows:
Wherein, WiPresentation class type attributive character CiShared weight,Hc(Ci) presentation class type
Attributive character CiEntropy, it may be assumed that
Wherein, p (ci,j) presentation class attributive character CiMiddle attribute value ci,jProbability, it may be assumed that p (ci,j)=Sup (Ci|ci,j)/|D|;
S24: it calculates distance between mixed attributes sample: setting x, the similarity of y ∈ D, x and y are as follows:
Then x is at a distance from y are as follows: d (x, y)=1-S (x, y).
4. a kind of density peaks clustering method towards mixed attributes data set according to claim 3, which is characterized in that
The step S3 specifically includes the following steps:
S31: sample point x is calculatediLocal density ρi:
S32: sample point x is calculatediDistance δi:
S33: sample point x is calculatediWeight γi:
γi=ρi*δi
And it is ranked up to institute's sample point weight is descending, xiWeight sequencing ranking be Ri;
S34: equation y=b is used0+b1X is to ρi-δiWith Ri-γiDistribution map carries out linear regression operation, respectively to ρi-δiWith Ri-
γiThe linear regression result of distribution map carries out residual analysis and obtains singular point, while appearing in ρi-δiWith Ri-γiIn distribution map
Singular point is class cluster central point.
5. a kind of density peaks clustering method towards mixed attributes data set according to claim 4, which is characterized in that
The step S4 specifically includes the following steps:
S41: fitness function is defined:
Wherein,Indicate class cluster between distance and,Indicate class cluster in distance and,WithRespectively indicate certain
Class cluster object and class cluster center after secondary iteration in cluster result,Indicate the center of all data points;
S42: setting truncation distance dcValue range be so that the average local density of all data objects is data set total quantity
1%-20% to get value interval be [dc_low,dc_high], z d is then selected at randomc, form first generation dcGroup, according to close
The method for spending the division class cluster of peak algorithm (density peaks clustering algorithm, DPCA) is each dcInto
The division of row class cluster passes through the per generation d of iterative process evolution of genetic algorithm latercGroup, it is final to obtain optimal truncation distance dcBest
With corresponding class cluster central point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910171730.0A CN109919227A (en) | 2019-03-07 | 2019-03-07 | A kind of density peaks clustering method towards mixed attributes data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910171730.0A CN109919227A (en) | 2019-03-07 | 2019-03-07 | A kind of density peaks clustering method towards mixed attributes data set |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109919227A true CN109919227A (en) | 2019-06-21 |
Family
ID=66963816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910171730.0A Pending CN109919227A (en) | 2019-03-07 | 2019-03-07 | A kind of density peaks clustering method towards mixed attributes data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109919227A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110772267A (en) * | 2019-11-07 | 2020-02-11 | 中国人民解放军63850部队 | Human body physiological fatigue data marking method and fatigue identification model |
CN111339294A (en) * | 2020-02-11 | 2020-06-26 | 普信恒业科技发展(北京)有限公司 | Client data classification method and device and electronic equipment |
CN113158817A (en) * | 2021-03-29 | 2021-07-23 | 南京信息工程大学 | Objective weather typing method based on rapid density peak clustering |
-
2019
- 2019-03-07 CN CN201910171730.0A patent/CN109919227A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110772267A (en) * | 2019-11-07 | 2020-02-11 | 中国人民解放军63850部队 | Human body physiological fatigue data marking method and fatigue identification model |
CN110772267B (en) * | 2019-11-07 | 2022-04-19 | 中国人民解放军63850部队 | Human body physiological fatigue data marking method and fatigue identification model |
CN111339294A (en) * | 2020-02-11 | 2020-06-26 | 普信恒业科技发展(北京)有限公司 | Client data classification method and device and electronic equipment |
CN113158817A (en) * | 2021-03-29 | 2021-07-23 | 南京信息工程大学 | Objective weather typing method based on rapid density peak clustering |
CN113158817B (en) * | 2021-03-29 | 2023-07-18 | 南京信息工程大学 | Objective weather typing method based on rapid density peak clustering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106095893B (en) | A kind of cross-media retrieval method | |
CN101692224B (en) | High-resolution remote sensing image search method fused with spatial relation semantics | |
CN109002492B (en) | Performance point prediction method based on LightGBM | |
CN106817251B (en) | Link prediction method and device based on node similarity | |
CN105045858A (en) | Voting based taxi passenger-carrying point recommendation method | |
CN109919227A (en) | A kind of density peaks clustering method towards mixed attributes data set | |
CN103810299A (en) | Image retrieval method on basis of multi-feature fusion | |
CN111652291A (en) | Method for establishing student growth portrait based on group sparse fusion hospital big data | |
CN107944485A (en) | The commending system and method, personalized recommendation system found based on cluster group | |
US20130041753A1 (en) | System and Method for Identifying a Path of a Billboard Audience Group and Providing Advertising Content Based on the Path | |
CN104751463B (en) | A kind of threedimensional model optimal viewing angle choosing method based on sketch outline feature | |
CN113032613B (en) | Three-dimensional model retrieval method based on interactive attention convolution neural network | |
Rodrigues et al. | Automatic classification of points-of-interest for land-use analysis | |
CN102902976A (en) | Image scene classification method based on target and space relationship characteristics | |
CN109583712B (en) | Data index analysis method and device and storage medium | |
CN115018357A (en) | Farmer portrait construction method and system for production performance improvement | |
CN115018545A (en) | Similar user analysis method and system based on user portrait and clustering algorithm | |
CN115115825A (en) | Method and device for detecting object in image, computer equipment and storage medium | |
CN113240209A (en) | Urban industry cluster development path prediction method based on graph neural network | |
CN113220915A (en) | Remote sensing image retrieval method and device based on residual attention | |
Dahal | Effect of different distance measures in result of cluster analysis | |
CN115719453A (en) | Rice planting structure remote sensing extraction method based on deep learning | |
CN112506930B (en) | Data insight system based on machine learning technology | |
CN112488236B (en) | Integrated unsupervised student behavior clustering method | |
CN111488520B (en) | Crop planting type recommendation information processing device, method and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190621 |