CN105787520B - A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search - Google Patents
A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search Download PDFInfo
- Publication number
- CN105787520B CN105787520B CN201610179542.9A CN201610179542A CN105787520B CN 105787520 B CN105787520 B CN 105787520B CN 201610179542 A CN201610179542 A CN 201610179542A CN 105787520 B CN105787520 B CN 105787520B
- Authority
- CN
- China
- Prior art keywords
- neighbors
- nearest
- shared
- cluster
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of data mining, more particularly, to a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search.It is characterized in that, carrying out the search of nature nearest-neighbors to data set first, when find does not have the quantity of the point of shared nearest-neighbors no longer to change in data set, search terminates, and obtains search arest neighbors number n;According to naturally shared neighbours' definition of proposition, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated;The natural neighbor seaching algorithm for being then based on shared arest neighbors has determined the naturally shared nearest-neighbor relationship of each object, shares nearest neighbor relationships naturally according to this, carries out cluster to data and outlier differentiates.A kind of new shared nearest neighbor relationships and natural neighbor seaching termination condition are proposed in algorithm of the invention, solving the problems, such as existing algorithm, Clustering Effect caused by not tight enough and search condition not enough science is bad and outlier detection precision is not high because natural neighborhood defines.
Description
Technical field
The invention belongs to the field of data mining, more particularly, to a kind of discovery cluster based on naturally shared nearest-neighbors search
With the algorithm of outlier.
Background technique
As the big datas technologies such as data explosion growth, cloud computing continue to develop, people are more next to data mining technology
More pay attention to.And the excavation of cluster and outlier is a very important technology in data mining, it can help to find valuable
Information, to effectively analyze data.
Presently, there are a kind of natural nearest neighbor algorithm, which does not need user and specifies arest neighbors number, self-assembling formation
Neighborhood relationships have also had algorithm to carry out outlier detection on the basis of cluster to cluster to data.But it is existing
In natural nearest neighbor algorithm, the definition of natural neighbours and the termination condition of searching algorithm are not scientific enough, lead to the cluster of data
Effect is bad, and outlier detection precision is not high, is based on this, and the present invention proposes a kind of hair based on naturally shared nearest-neighbors search
Natural neighbours definition is optimized in the algorithm, forms shared nearest-neighbors definition, and improve by the algorithm of existing cluster and outlier
Search termination condition keeps the neighborhood of discovery more scientific, so that cluster result be made more to meet the true distributions of data, inspection
The outlier accuracy measured is higher.
Summary of the invention
To solve the above-mentioned problems, the invention proposes a kind of discovery cluster based on naturally shared nearest-neighbors search and from
Group point algorithm, which is characterized in that the algorithm the specific steps are
Step 1 carries out the search of nature nearest-neighbors, each dimension to the data set D of plant various growth parameter(s)s composition
A kind of growth parameter(s) is represented, and each classification of data set can separate in advance from other classification;When in discovery data set
Search terminates when the quantity of the point of not shared nearest-neighbors no longer changes, and obtains search arest neighbors number n;According to proposition from
So shared neighbours' definition calculates the naturally shared nearest neighbor relationships that each object obtains under n neighbour;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined the naturally shared arest neighbors of each object
Domain relationship shares nearest neighbor relationships according to this naturally, carries out cluster to data and outlier differentiates.
If the naturally shared nearest-neighbors, which are defined as object X, thinks that object Y is its neighbour, Y thinks that X is its neighbour, and X
Neighbour at least one is identical with the neighbour of Y, then X and Y each other naturally share nearest-neighbors.
Carrying out nature nearest-neighbors search process to data set in the step 1 is
(1) 1 is set by arest neighbors number k;
(2) the k nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the k neighbour of object a including object b, and b
K neighbour in include a, and have a same object in the k neighbour of a and the k neighbour of b, be then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make k=k+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this
The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise more by the value of n1
The new value for n2 this moment, return step (5).
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, together
When by the point and it natural arest neighbors formed a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added to such
In, until all the points in such are all labeled, then k=k+1;
(3) direct access is according to concentrating not labeled point to repeat the above process until all the points are labeled in data set, then
Obtain final cluster result.
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D |
With condition twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that regard as outlier or the cluster that peels off for the less cluster of number, and condition in data set second is that be divided into very
It is avoided that these tuftlets are considered as the object that peels off when multiple tuftlets.
Beneficial effect
In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of hairs based on naturally shared nearest-neighbors search
The algorithm of existing cluster and outlier proposes that a kind of new shared nearest neighbor relationships and natural neighbor seaching terminate item in this algorithm
Part solves existing algorithm because natural neighborhood definition not enough tightly and caused by search condition not enough science clusters effect
Fruit is bad and the not high problem of outlier detection precision.
Detailed description of the invention
Fig. 1 is a kind of process of the algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search of the present invention
Figure.
Specific embodiment
With reference to the accompanying drawing, it elaborates to the present invention.Fig. 1 is that the present invention is a kind of to be searched based on the shared nearest-neighbors of nature
The flow chart of the algorithm of the discovery cluster and outlier of rope.
A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the calculation
Method the specific steps are
Step 1, to data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors
Quantity when no longer changing search terminate, obtain search arest neighbors number n;According to naturally shared neighbours' definition of proposition, calculate
What each object obtained under n neighbour shares nearest neighbor relationships naturally;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined the naturally shared arest neighbors of each object
Domain relationship shares nearest neighbor relationships according to this naturally, carries out cluster to data and outlier differentiates.
If the naturally shared nearest-neighbors, which are defined as object X, thinks that object Y is its neighbour, Y thinks that X is its neighbour, and X
Neighbour at least one is identical with the neighbour of Y, then X and Y each other naturally share nearest-neighbors.
Carrying out nature nearest-neighbors search process to data set in the step 1 is
(1) 1 is set by arest neighbors number k;
(2) the k nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the k neighbour of object a including object b, and b
K neighbour in include a, and have a same object in the k neighbour of a and the k neighbour of b, be then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make k=k+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this
The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise return step
(5)。
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, together
When by the point and it natural arest neighbors formed a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added to such
In, until all the points in such are all labeled, then k=k+1;
(3) direct access is according to concentrating not labeled point to repeat the above process until all the points are labeled in data set, then
Obtain final cluster result.
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from big to small, if i-th of class c (i) meets condition one | c (i) | < 10% | D |
With condition twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that regard as outlier or the cluster that peels off for the less cluster of number, and condition in data set second is that be divided into very
It is avoided that these tuftlets are considered as the object that peels off when multiple tuftlets.
Data set concentrates Iris Plants data set using UCI normal data.The data set include 3 classes totally 150 it is right
As each object has 5 dimensions, and the present invention, as cluster, picks out 9 points as outlier using the first two class in third class
With the cluster that peels off, the detection of cluster and outlier is carried out to the data set with algorithm proposed by the present invention, to verify the effective of the algorithm
Property.
1, to the data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors
Algorithm terminates when number no longer changes, and obtaining search arest neighbors number is 11;
2, according to naturally shared neighbours' definition of proposition, calculate each object obtained under 11 neighbours it is naturally shared nearest
Neighborhood;
3, based on nature share nearest neighbor relationships, data are clustered, obtain quantity be 49 and 50 two classes, 1
The cluster that peels off that a outlier and a quantity are 9.
It should be noted that 1 obtained outlier is not erroneous detection, it is No. 42 objects in the first kind, although it
It is not the outlier that we are arranged, but it is the object far from cluster core point, belongs to local outlier, therefore use inventive algorithm
Obtained class and the cluster that peels off comply fully with the normal distribution situation of data set, cluster accuracy and outlier detection accuracy is
100%.
And with based on existing natural neighbor seaching cluster and Outliers Detection data are clustered, obtaining quantity is 42
And 67 two classes, 7 data in the first kind are mistakenly assigned in the second class by this algorithm, illustrate that Clustering Effect occurs
Deviation does not meet data and is really distributed;In addition, the cluster that peels off containing 9 points is not detected in this method, but the cluster mistake that will peel off
It accidentally is assigned in the second class, therefore outlier cannot be effectively detected out in this method.
The algorithm of the discovery cluster based on naturally shared nearest-neighbors search proposed through the invention and outlier and existing
Algorithm compares discovery, and this paper algorithm can improve Clustering Effect and detection accuracy.
Claims (2)
1. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the algorithm
Specific steps are as follows:
Step 1 carries out the search of nature nearest-neighbors to the data set D of plant various growth parameter(s)s composition, each dimension represents
A kind of growth parameter(s), and each classification of data set can separate in advance from other classification;Do not have when in discovery data set
Search terminates when the quantity of the point of shared nearest-neighbors no longer changes, and obtains search arest neighbors number n;According to the naturally total of proposition
Neighbours' definition is enjoyed, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined that the naturally shared nearest-neighbors of each object are closed
Nearest neighbor relationships are shared according to this naturally by system, carry out cluster to data and outlier differentiates;
The data set D concentrates Iris Plants data set using UCI normal data;
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, simultaneously will
The point and its natural arest neighbors form a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added in such, directly
It is all labeled to all the points in such, then k=k+1;
(3) direct access is then obtained according to concentrating not labeled point to repeat the above process until all the points are labeled in data set
Final cluster result;
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D | and item
Part twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that the cluster that number is lacked regards as outlier or the cluster that peels off, and condition in data set second is that be divided into many small
It is avoided that these tuftlets are considered as the object that peels off when cluster;
If the naturally shared nearest-neighbors are defined as object X and think that object Y is its neighbour, Y thinks that X is its neighbour, and X's is close
Adjacent at least one is identical with the neighbour of Y, then X and Y shares nearest-neighbors naturally each other.
2. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search according to claim 1,
It is characterized in that, being to data set progress nature nearest-neighbors search process in the step 1
(1) 1 is set by arest neighbors number M;
(2) the M nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the M neighbour of object a including object b, and the M of b
Include a in neighbour, and have a same object in the M neighbour of a and the M neighbour of b, is then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make M=M+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final M is the nearest-neighbors number in searching algorithm, at this value
The shared nearest-neighbors for calculating each object, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise the value of n1 is updated to
The value of n2 this moment, return step (5).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610179542.9A CN105787520B (en) | 2016-03-25 | 2016-03-25 | A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610179542.9A CN105787520B (en) | 2016-03-25 | 2016-03-25 | A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105787520A CN105787520A (en) | 2016-07-20 |
CN105787520B true CN105787520B (en) | 2019-09-20 |
Family
ID=56391086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610179542.9A Expired - Fee Related CN105787520B (en) | 2016-03-25 | 2016-03-25 | A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105787520B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108337226A (en) * | 2017-12-19 | 2018-07-27 | 中国科学院声学研究所 | The detection method and embedded intelligent terminal of embedded intelligent terminal abnormal data |
CN108765954B (en) * | 2018-06-13 | 2022-05-24 | 上海应用技术大学 | Road traffic safety condition monitoring method based on SNN density ST-OPTIC improved clustering algorithm |
CN113158871B (en) * | 2021-04-15 | 2022-08-02 | 重庆大学 | Wireless signal intensity abnormity detection method based on density core |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104200206B (en) * | 2014-09-09 | 2017-04-26 | 武汉大学 | Double-angle sequencing optimization based pedestrian re-identification method |
CN104217015B (en) * | 2014-09-22 | 2017-11-03 | 西安理工大学 | Based on the hierarchy clustering method for sharing arest neighbors each other |
CN104391925A (en) * | 2014-11-20 | 2015-03-04 | 四川长虹电器股份有限公司 | Video recommendation method and system based on TV (television) user collaborative forecasting |
CN105117485B (en) * | 2015-09-17 | 2018-07-20 | 深圳大学 | A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours |
-
2016
- 2016-03-25 CN CN201610179542.9A patent/CN105787520B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN105787520A (en) | 2016-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104346481B (en) | A kind of community detection method based on dynamic synchronization model | |
CN104462184B (en) | A kind of large-scale data abnormality recognition method based on two-way sampling combination | |
Zhang et al. | Analysis of power consumer behavior based on the complementation of K-means and DBSCAN | |
CN105787520B (en) | A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search | |
CN107291847A (en) | A kind of large-scale data Distributed Cluster processing method based on MapReduce | |
CN103888541A (en) | Method and system for discovering cells fused with topology potential and spectral clustering | |
CN104317908B (en) | Outlier detection method based on three decisions and distance | |
Ding et al. | Constrained spectral clustering based controlled islanding | |
CN106789149A (en) | Using the intrusion detection method of modified self-organizing feature neural network clustering algorithm | |
Hsu et al. | Charting the evolution of biohydrogen production technology through a patent analysis | |
CN103150470A (en) | Visualization method for concept drift of data stream in dynamic data environment | |
Zhou et al. | Dempster–Shafer theory-based robust least squares support vector machine for stochastic modelling | |
TW202009803A (en) | Prediction system and method for solar photovoltaic power generation | |
CN109409394A (en) | A kind of cop-kmeans method and system based on semi-supervised clustering | |
CN107578445A (en) | Image discriminant region extracting method based on convolution characteristic spectrum | |
Miao et al. | Ultra-short-term prediction of wind power based on sample similarity analysis | |
Nithiyananthan et al. | Enhanced R package-based cluster analysis fault identification models for three phase power system network | |
CN110298373A (en) | Power network line telemetry clustering ensemble method based on comentropy Dynamic Programming | |
Ma | The Research of Stock Predictive Model based on the Combination of CART and DBSCAN | |
CN103744899A (en) | Distributed environment based mass data rapid classification method | |
CN105354243B (en) | The frequent probability subgraph search method of parallelization based on merger cluster | |
Long et al. | A skeleton-based community detection algorithm for directed networks | |
CN102521845B (en) | Visual attention focus transfer track planning method based on graph theory | |
Liu et al. | An agglomerative hierarchical clustering algorithm based on global distance measurement | |
Li et al. | Research on DBSCAN for Extraction of Typical Scenarios in New Energy Power System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190920 Termination date: 20200325 |