CN105787520B

CN105787520B - A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search

Info

Publication number: CN105787520B
Application number: CN201610179542.9A
Authority: CN
Inventors: 高红菊; 刘艳哲; 储汪兵; 刘继文
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2019-09-20
Anticipated expiration: 2036-03-25
Also published as: CN105787520A

Abstract

The invention belongs to the field of data mining, more particularly, to a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search.It is characterized in that, carrying out the search of nature nearest-neighbors to data set first, when find does not have the quantity of the point of shared nearest-neighbors no longer to change in data set, search terminates, and obtains search arest neighbors number n；According to naturally shared neighbours' definition of proposition, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated；The natural neighbor seaching algorithm for being then based on shared arest neighbors has determined the naturally shared nearest-neighbor relationship of each object, shares nearest neighbor relationships naturally according to this, carries out cluster to data and outlier differentiates.A kind of new shared nearest neighbor relationships and natural neighbor seaching termination condition are proposed in algorithm of the invention, solving the problems, such as existing algorithm, Clustering Effect caused by not tight enough and search condition not enough science is bad and outlier detection precision is not high because natural neighborhood defines.

Description

A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search

Technical field

The invention belongs to the field of data mining, more particularly, to a kind of discovery cluster based on naturally shared nearest-neighbors search With the algorithm of outlier.

Background technique

As the big datas technologies such as data explosion growth, cloud computing continue to develop, people are more next to data mining technology More pay attention to.And the excavation of cluster and outlier is a very important technology in data mining, it can help to find valuable Information, to effectively analyze data.

Presently, there are a kind of natural nearest neighbor algorithm, which does not need user and specifies arest neighbors number, self-assembling formation Neighborhood relationships have also had algorithm to carry out outlier detection on the basis of cluster to cluster to data.But it is existing In natural nearest neighbor algorithm, the definition of natural neighbours and the termination condition of searching algorithm are not scientific enough, lead to the cluster of data Effect is bad, and outlier detection precision is not high, is based on this, and the present invention proposes a kind of hair based on naturally shared nearest-neighbors search Natural neighbours definition is optimized in the algorithm, forms shared nearest-neighbors definition, and improve by the algorithm of existing cluster and outlier Search termination condition keeps the neighborhood of discovery more scientific, so that cluster result be made more to meet the true distributions of data, inspection The outlier accuracy measured is higher.

Summary of the invention

To solve the above-mentioned problems, the invention proposes a kind of discovery cluster based on naturally shared nearest-neighbors search and from Group point algorithm, which is characterized in that the algorithm the specific steps are

Step 1 carries out the search of nature nearest-neighbors, each dimension to the data set D of plant various growth parameter(s)s composition A kind of growth parameter(s) is represented, and each classification of data set can separate in advance from other classification；When in discovery data set Search terminates when the quantity of the point of not shared nearest-neighbors no longer changes, and obtains search arest neighbors number n；According to proposition from So shared neighbours' definition calculates the naturally shared nearest neighbor relationships that each object obtains under n neighbour；

Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined the naturally shared arest neighbors of each object Domain relationship shares nearest neighbor relationships according to this naturally, carries out cluster to data and outlier differentiates.

If the naturally shared nearest-neighbors, which are defined as object X, thinks that object Y is its neighbour, Y thinks that X is its neighbour, and X Neighbour at least one is identical with the neighbour of Y, then X and Y each other naturally share nearest-neighbors.

Carrying out nature nearest-neighbors search process to data set in the step 1 is

(1) 1 is set by arest neighbors number k；

(2) the k nearest neighbours of each object in data set are searched for；

(3) the shared nearest-neighbors of each object are calculated after the completion of search；If in the k neighbour of object a including object b, and b K neighbour in include a, and have a same object in the k neighbour of a and the k neighbour of b, be then the shared nearest-neighbors of a at b；

(4) the number n1 for the object that shared nearest-neighbors are 0 is calculated；

(5) make k=k+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0；

(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships；Otherwise more by the value of n1 The new value for n2 this moment, return step (5).

Cluster process in the step 2 are as follows:

(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, together When by the point and it natural arest neighbors formed a class c (k)；

(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added to such In, until all the points in such are all labeled, then k=k+1；

(3) direct access is according to concentrating not labeled point to repeat the above process until all the points are labeled in data set, then Obtain final cluster result.

Outlier differentiates that process is in the step 2

The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D | With condition twoThen think that c (i) is outlier or the cluster that peels off；

Condition is first is that regard as outlier or the cluster that peels off for the less cluster of number, and condition in data set second is that be divided into very It is avoided that these tuftlets are considered as the object that peels off when multiple tuftlets.

Beneficial effect

In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of hairs based on naturally shared nearest-neighbors search The algorithm of existing cluster and outlier proposes that a kind of new shared nearest neighbor relationships and natural neighbor seaching terminate item in this algorithm Part solves existing algorithm because natural neighborhood definition not enough tightly and caused by search condition not enough science clusters effect Fruit is bad and the not high problem of outlier detection precision.

Detailed description of the invention

Fig. 1 is a kind of process of the algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search of the present invention Figure.

Specific embodiment

With reference to the accompanying drawing, it elaborates to the present invention.Fig. 1 is that the present invention is a kind of to be searched based on the shared nearest-neighbors of nature The flow chart of the algorithm of the discovery cluster and outlier of rope.

A kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the calculation Method the specific steps are

Step 1, to data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors Quantity when no longer changing search terminate, obtain search arest neighbors number n；According to naturally shared neighbours' definition of proposition, calculate What each object obtained under n neighbour shares nearest neighbor relationships naturally；

(1) 1 is set by arest neighbors number k；

(2) the k nearest neighbours of each object in data set are searched for；

(6) it if n2=n1, stops search, obtained final k is the nearest-neighbors number in searching algorithm, at this The lower shared nearest-neighbors for calculating each object of value, what is obtained is exactly that nature shares nearest neighbor relationships；Otherwise return step (5)。

Cluster process in the step 2 are as follows:

Outlier differentiates that process is in the step 2

The k class that cluster obtains is arranged from big to small, if i-th of class c (i) meets condition one | c (i) | < 10% | D | With condition twoThen think that c (i) is outlier or the cluster that peels off；

Data set concentrates Iris Plants data set using UCI normal data.The data set include 3 classes totally 150 it is right As each object has 5 dimensions, and the present invention, as cluster, picks out 9 points as outlier using the first two class in third class With the cluster that peels off, the detection of cluster and outlier is carried out to the data set with algorithm proposed by the present invention, to verify the effective of the algorithm Property.

1, to the data set carry out the search of nature nearest-neighbors, when discovery data set in there is no the point of shared nearest-neighbors Algorithm terminates when number no longer changes, and obtaining search arest neighbors number is 11；

2, according to naturally shared neighbours' definition of proposition, calculate each object obtained under 11 neighbours it is naturally shared nearest Neighborhood；

3, based on nature share nearest neighbor relationships, data are clustered, obtain quantity be 49 and 50 two classes, 1 The cluster that peels off that a outlier and a quantity are 9.

It should be noted that 1 obtained outlier is not erroneous detection, it is No. 42 objects in the first kind, although it It is not the outlier that we are arranged, but it is the object far from cluster core point, belongs to local outlier, therefore use inventive algorithm Obtained class and the cluster that peels off comply fully with the normal distribution situation of data set, cluster accuracy and outlier detection accuracy is 100%.

And with based on existing natural neighbor seaching cluster and Outliers Detection data are clustered, obtaining quantity is 42 And 67 two classes, 7 data in the first kind are mistakenly assigned in the second class by this algorithm, illustrate that Clustering Effect occurs Deviation does not meet data and is really distributed；In addition, the cluster that peels off containing 9 points is not detected in this method, but the cluster mistake that will peel off It accidentally is assigned in the second class, therefore outlier cannot be effectively detected out in this method.

The algorithm of the discovery cluster based on naturally shared nearest-neighbors search proposed through the invention and outlier and existing Algorithm compares discovery, and this paper algorithm can improve Clustering Effect and detection accuracy.

Claims

1. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search, which is characterized in that the algorithm Specific steps are as follows:

Step 1 carries out the search of nature nearest-neighbors to the data set D of plant various growth parameter(s)s composition, each dimension represents A kind of growth parameter(s), and each classification of data set can separate in advance from other classification；Do not have when in discovery data set Search terminates when the quantity of the point of shared nearest-neighbors no longer changes, and obtains search arest neighbors number n；According to the naturally total of proposition Neighbours' definition is enjoyed, the naturally shared nearest neighbor relationships that each object obtains under n neighbour are calculated；

Step 2, the natural neighbor seaching algorithm based on shared arest neighbors have determined that the naturally shared nearest-neighbors of each object are closed Nearest neighbor relationships are shared according to this naturally by system, carry out cluster to data and outlier differentiates；

The data set D concentrates Iris Plants data set using UCI normal data；

Cluster process in the step 2 are as follows:

(1) all the points in initial data set D are not labeled, direct access according to concentration a bit, which is marked, simultaneously will The point and its natural arest neighbors form a class c (k)；

(2) point not being labeled in class c (k) is taken at random, which is marked and its natural arest neighbors is added in such, directly It is all labeled to all the points in such, then k=k+1；

(3) direct access is then obtained according to concentrating not labeled point to repeat the above process until all the points are labeled in data set Final cluster result；

Outlier differentiates that process is in the step 2

The k class that cluster obtains is arranged from small to large, if i-th of class c (i) meets condition one | c (i) | < 10% | D | and item Part twoThen think that c (i) is outlier or the cluster that peels off；

Condition is first is that the cluster that number is lacked regards as outlier or the cluster that peels off, and condition in data set second is that be divided into many small It is avoided that these tuftlets are considered as the object that peels off when cluster；

If the naturally shared nearest-neighbors are defined as object X and think that object Y is its neighbour, Y thinks that X is its neighbour, and X's is close Adjacent at least one is identical with the neighbour of Y, then X and Y shares nearest-neighbors naturally each other.

2. a kind of algorithm of discovery cluster and outlier based on naturally shared nearest-neighbors search according to claim 1, It is characterized in that, being to data set progress nature nearest-neighbors search process in the step 1

(1) 1 is set by arest neighbors number M；

(2) the M nearest neighbours of each object in data set are searched for；

(3) the shared nearest-neighbors of each object are calculated after the completion of search；If in the M neighbour of object a including object b, and the M of b Include a in neighbour, and have a same object in the M neighbour of a and the M neighbour of b, is then the shared nearest-neighbors of a at b；

(5) make M=M+1, return to step (2), calculate the number n2 for the object that shared nearest-neighbors at this time are 0；

(6) it if n2=n1, stops search, obtained final M is the nearest-neighbors number in searching algorithm, at this value The shared nearest-neighbors for calculating each object, what is obtained is exactly that nature shares nearest neighbor relationships；Otherwise the value of n1 is updated to The value of n2 this moment, return step (5).