CN105787520B - A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search - Google Patents

A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search Download PDF

Info

Publication number
CN105787520B
CN105787520B CN201610179542.9A CN201610179542A CN105787520B CN 105787520 B CN105787520 B CN 105787520B CN 201610179542 A CN201610179542 A CN 201610179542A CN 105787520 B CN105787520 B CN 105787520B
Authority
CN
China
Prior art keywords
neighbors
nearest
shared
cluster
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610179542.9A
Other languages
Chinese (zh)
Other versions
CN105787520A (en
Inventor
高红菊
刘艳哲
储汪兵
刘继文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN201610179542.9A priority Critical patent/CN105787520B/en
Publication of CN105787520A publication Critical patent/CN105787520A/en
Application granted granted Critical
Publication of CN105787520B publication Critical patent/CN105787520B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data mining, more particularly, to a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search.It is characterized in that, carrying out the search of nature nearest-neighbors to data set first, when find does not have the quantity of the point of shared nearest-neighbors no longer to change in data set, search terminates, and obtains search arest neighbors number n;According to naturally shared neighbours' definition of proposition, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated;The natural neighbor seaching algorithm for being then based on shared arest neighbors has determined the naturally shared nearest-neighbor relationship of each object, shares nearest neighbor relationships naturally according to this, carries out cluster to data and outlier differentiates.A kind of new shared nearest neighbor relationships and natural neighbor seaching termination condition are proposed in algorithm of the invention, solving the problems, such as existing algorithm, Clustering Effect caused by not tight enough and search condition not enough science is bad and outlier detection precision is not high because natural neighborhood defines.

Description

A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search
Technical field
The invention belongs to the field of data mining, more particularly, to a kind of discovery cluster based on naturally shared nearest-neighbors search With the algorithm of outlier.
Background technique
As the big datas technologies such as data explosion growth, cloud computing continue to develop, people are more next to data mining technology More pay attention to.And the excavation of cluster and outlier is a very important technology in data mining, it can help to find valuable Information, to effectively analyze data.
Presently, there are a kind of natural nearest neighbor algorithm, which does not need user and specifies arest neighbors number, self-assembling formation Neighborhood relationships have also had algorithm to carry out outlier detection on the basis of cluster to cluster to data.But it is existing In natural nearest neighbor algorithm, the definition of natural neighbours and the termination condition of searching algorithm are not scientific enough, lead to the cluster of data Effect is bad, and outlier detection precision is not high, is based on this, and the present invention proposes a kind of hair based on naturally shared nearest-neighbors search Natural neighbours definition is optimized in the algorithm, forms shared nearest-neighbors definition, and improve by the algorithm of existing cluster and outlier Search termination condition keeps the neighborhood of discovery more scientific, so that cluster result be made more to meet the true distributions of data, inspection The outlier accuracy measured is higher.
Summary of the invention
To solve the above-mentioned problems, the invention proposes a kind of discovery cluster based on naturally shared nearest-neighbors search and from Group point algorithm, which is characterized in that the algorithm the specific steps are
Step 1 carries out the search of nature nearest-neighbors, each dimension to the data set D of plant various growth parameter(s)s composition A kind of growth parameter(s) is represented, and each classification of data set can separate in advance from other classification;When in discovery data set Search terminates when the quantity of the point of not shared nearest-neighbors no longer changes, and obtains search arest neighbors number n;According to proposition from So shared neighbours' definition calculates the naturally shared nearest neighbor relationships that each object obtains under n neighbour;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined the naturally shared arest neighbors of each object Domain relationship shares nearest neighbor relationships according to this naturally, carries out cluster to data and outlier differentiates.
If the naturally shared nearest-neighbors, which are defined as object X, thinks that object Y is its neighbour, Y thinks that X is its neighbour, and X Neighbour at least one is identical with the neighbour of Y, then X and Y each other naturally share nearest-neighbors.
Carrying out nature nearest-neighbors search process to data set in the step 1 is
(1) 1 is set by arest neighbors number k;
(2) the k nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the k neighbour of object a including object b, and b K neighbour in include a, and have a same object in the k neighbour of a and the k neighbour of b, be then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make k=k+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise more by the value of n1 The new value for n2 this moment, return step (5).
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, together When by the point and it natural arest neighbors formed a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added to such In, until all the points in such are all labeled, then k=k+1;
(3) direct access is according to concentrating not labeled point to repeat the above process until all the points are labeled in data set, then Obtain final cluster result.
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D | With condition twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that regard as outlier or the cluster that peels off for the less cluster of number, and condition in data set second is that be divided into very It is avoided that these tuftlets are considered as the object that peels off when multiple tuftlets.
Beneficial effect
In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of hairs based on naturally shared nearest-neighbors search The algorithm of existing cluster and outlier proposes that a kind of new shared nearest neighbor relationships and natural neighbor seaching terminate item in this algorithm Part solves existing algorithm because natural neighborhood definition not enough tightly and caused by search condition not enough science clusters effect Fruit is bad and the not high problem of outlier detection precision.
Detailed description of the invention
Fig. 1 is a kind of process of the algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search of the present invention Figure.
Specific embodiment
With reference to the accompanying drawing, it elaborates to the present invention.Fig. 1 is that the present invention is a kind of to be searched based on the shared nearest-neighbors of nature The flow chart of the algorithm of the discovery cluster and outlier of rope.
A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the calculation Method the specific steps are
Step 1, to data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors Quantity when no longer changing search terminate, obtain search arest neighbors number n;According to naturally shared neighbours' definition of proposition, calculate What each object obtained under n neighbour shares nearest neighbor relationships naturally;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined the naturally shared arest neighbors of each object Domain relationship shares nearest neighbor relationships according to this naturally, carries out cluster to data and outlier differentiates.
If the naturally shared nearest-neighbors, which are defined as object X, thinks that object Y is its neighbour, Y thinks that X is its neighbour, and X Neighbour at least one is identical with the neighbour of Y, then X and Y each other naturally share nearest-neighbors.
Carrying out nature nearest-neighbors search process to data set in the step 1 is
(1) 1 is set by arest neighbors number k;
(2) the k nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the k neighbour of object a including object b, and b K neighbour in include a, and have a same object in the k neighbour of a and the k neighbour of b, be then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make k=k+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise return step (5)。
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, together When by the point and it natural arest neighbors formed a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added to such In, until all the points in such are all labeled, then k=k+1;
(3) direct access is according to concentrating not labeled point to repeat the above process until all the points are labeled in data set, then Obtain final cluster result.
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from big to small, if i-th of class c (i) meets condition one | c (i) | < 10% | D | With condition twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that regard as outlier or the cluster that peels off for the less cluster of number, and condition in data set second is that be divided into very It is avoided that these tuftlets are considered as the object that peels off when multiple tuftlets.
Data set concentrates Iris Plants data set using UCI normal data.The data set include 3 classes totally 150 it is right As each object has 5 dimensions, and the present invention, as cluster, picks out 9 points as outlier using the first two class in third class With the cluster that peels off, the detection of cluster and outlier is carried out to the data set with algorithm proposed by the present invention, to verify the effective of the algorithm Property.
1, to the data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors Algorithm terminates when number no longer changes, and obtaining search arest neighbors number is 11;
2, according to naturally shared neighbours' definition of proposition, calculate each object obtained under 11 neighbours it is naturally shared nearest Neighborhood;
3, based on nature share nearest neighbor relationships, data are clustered, obtain quantity be 49 and 50 two classes, 1 The cluster that peels off that a outlier and a quantity are 9.
It should be noted that 1 obtained outlier is not erroneous detection, it is No. 42 objects in the first kind, although it It is not the outlier that we are arranged, but it is the object far from cluster core point, belongs to local outlier, therefore use inventive algorithm Obtained class and the cluster that peels off comply fully with the normal distribution situation of data set, cluster accuracy and outlier detection accuracy is 100%.
And with based on existing natural neighbor seaching cluster and Outliers Detection data are clustered, obtaining quantity is 42 And 67 two classes, 7 data in the first kind are mistakenly assigned in the second class by this algorithm, illustrate that Clustering Effect occurs Deviation does not meet data and is really distributed;In addition, the cluster that peels off containing 9 points is not detected in this method, but the cluster mistake that will peel off It accidentally is assigned in the second class, therefore outlier cannot be effectively detected out in this method.
The algorithm of the discovery cluster based on naturally shared nearest-neighbors search proposed through the invention and outlier and existing Algorithm compares discovery, and this paper algorithm can improve Clustering Effect and detection accuracy.

Claims (2)

1. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the algorithm Specific steps are as follows:
Step 1 carries out the search of nature nearest-neighbors to the data set D of plant various growth parameter(s)s composition, each dimension represents A kind of growth parameter(s), and each classification of data set can separate in advance from other classification;Do not have when in discovery data set Search terminates when the quantity of the point of shared nearest-neighbors no longer changes, and obtains search arest neighbors number n;According to the naturally total of proposition Neighbours' definition is enjoyed, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated;
Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined that the naturally shared nearest-neighbors of each object are closed Nearest neighbor relationships are shared according to this naturally by system, carry out cluster to data and outlier differentiates;
The data set D concentrates Iris Plants data set using UCI normal data;
Cluster process in the step 2 are as follows:
(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, simultaneously will The point and its natural arest neighbors form a class c (k);
(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added in such, directly It is all labeled to all the points in such, then k=k+1;
(3) direct access is then obtained according to concentrating not labeled point to repeat the above process until all the points are labeled in data set Final cluster result;
Outlier differentiates that process is in the step 2
The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D | and item Part twoThen think that c (i) is outlier or the cluster that peels off;
Condition is first is that the cluster that number is lacked regards as outlier or the cluster that peels off, and condition in data set second is that be divided into many small It is avoided that these tuftlets are considered as the object that peels off when cluster;
If the naturally shared nearest-neighbors are defined as object X and think that object Y is its neighbour, Y thinks that X is its neighbour, and X's is close Adjacent at least one is identical with the neighbour of Y, then X and Y shares nearest-neighbors naturally each other.
2. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search according to claim 1, It is characterized in that, being to data set progress nature nearest-neighbors search process in the step 1
(1) 1 is set by arest neighbors number M;
(2) the M nearest neighbours of each object in data set are searched for;
(3) the shared nearest-neighbors of each object are calculated after the completion of search;If in the M neighbour of object a including object b, and the M of b Include a in neighbour, and have a same object in the M neighbour of a and the M neighbour of b, is then the shared nearest-neighbors of a at b;
(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated;
(5) make M=M+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0;
(6) it if n2=n1, stops search, obtained final M is the nearest-neighbors number in searching algorithm, at this value The shared nearest-neighbors for calculating each object, what is obtained is exactly that nature shares nearest neighbor relationships;Otherwise the value of n1 is updated to The value of n2 this moment, return step (5).
CN201610179542.9A 2016-03-25 2016-03-25 A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search Expired - Fee Related CN105787520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610179542.9A CN105787520B (en) 2016-03-25 2016-03-25 A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610179542.9A CN105787520B (en) 2016-03-25 2016-03-25 A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search

Publications (2)

Publication Number Publication Date
CN105787520A CN105787520A (en) 2016-07-20
CN105787520B true CN105787520B (en) 2019-09-20

Family

ID=56391086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610179542.9A Expired - Fee Related CN105787520B (en) 2016-03-25 2016-03-25 A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search

Country Status (1)

Country Link
CN (1) CN105787520B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337226A (en) * 2017-12-19 2018-07-27 中国科学院声学研究所 The detection method and embedded intelligent terminal of embedded intelligent terminal abnormal data
CN108765954B (en) * 2018-06-13 2022-05-24 上海应用技术大学 Road traffic safety condition monitoring method based on SNN density ST-OPTIC improved clustering algorithm
CN113158871B (en) * 2021-04-15 2022-08-02 重庆大学 Wireless signal intensity abnormity detection method based on density core

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104200206B (en) * 2014-09-09 2017-04-26 武汉大学 Double-angle sequencing optimization based pedestrian re-identification method
CN104217015B (en) * 2014-09-22 2017-11-03 西安理工大学 Based on the hierarchy clustering method for sharing arest neighbors each other
CN104391925A (en) * 2014-11-20 2015-03-04 四川长虹电器股份有限公司 Video recommendation method and system based on TV (television) user collaborative forecasting
CN105117485B (en) * 2015-09-17 2018-07-20 深圳大学 A kind of high-accuracy overall situation outlier detection algorithm based on k very neighbours

Also Published As

Publication number Publication date
CN105787520A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
CN104346481B (en) A kind of community detection method based on dynamic synchronization model
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
Zhang et al. Analysis of power consumer behavior based on the complementation of K-means and DBSCAN
CN105787520B (en) A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search
CN107291847A (en) A kind of large-scale data Distributed Cluster processing method based on MapReduce
CN103888541A (en) Method and system for discovering cells fused with topology potential and spectral clustering
CN104317908B (en) Outlier detection method based on three decisions and distance
Ding et al. Constrained spectral clustering based controlled islanding
CN106789149A (en) Using the intrusion detection method of modified self-organizing feature neural network clustering algorithm
Hsu et al. Charting the evolution of biohydrogen production technology through a patent analysis
CN103150470A (en) Visualization method for concept drift of data stream in dynamic data environment
Zhou et al. Dempster–Shafer theory-based robust least squares support vector machine for stochastic modelling
TW202009803A (en) Prediction system and method for solar photovoltaic power generation
CN109409394A (en) A kind of cop-kmeans method and system based on semi-supervised clustering
CN107578445A (en) Image discriminant region extracting method based on convolution characteristic spectrum
Miao et al. Ultra-short-term prediction of wind power based on sample similarity analysis
Nithiyananthan et al. Enhanced R package-based cluster analysis fault identification models for three phase power system network
CN110298373A (en) Power network line telemetry clustering ensemble method based on comentropy Dynamic Programming
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
CN103744899A (en) Distributed environment based mass data rapid classification method
CN105354243B (en) The frequent probability subgraph search method of parallelization based on merger cluster
Long et al. A skeleton-based community detection algorithm for directed networks
CN102521845B (en) Visual attention focus transfer track planning method based on graph theory
Liu et al. An agglomerative hierarchical clustering algorithm based on global distance measurement
Li et al. Research on DBSCAN for Extraction of Typical Scenarios in New Energy Power System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190920

Termination date: 20200325