WO2021232442A1

WO2021232442A1 - Density clustering method and apparatus on basis of dynamic grid hash index

Info

Publication number: WO2021232442A1
Application number: PCT/CN2020/092225
Authority: WO
Inventors: 毛睿; 张贺; 陆敏华; 廖好; 王毅; 刘刚
Original assignee: 深圳大学
Priority date: 2020-05-21
Filing date: 2020-05-26
Publication date: 2021-11-25
Also published as: CN111612069A

Abstract

A density clustering method and apparatus on the basis of a dynamic grid hash index. The method comprises: acquiring preset incremental information, comprising: D: incremental data set; Eps: radius; Minpts: whether same is a decision threshold of a core point; and unAttr: a dimension the value of which is uncertain; a data set after incremental clustering is performed on the basis of an original data set and is generated according to the acquired preset incremental information and by means of the density clustering method; and after a cycle ends, a data set that has completed incremental clustering is obtained. By introducing a correspondingly modified new index structure for uncertain data, the time complexity of an algorithm is reduced from O(n2) to O(n), and space complexity is reduced from O(n2) to O(1), the algorithm adapts to dynamic data sets, and incremental clustering is more efficient than full clustering; and on the basis of the newly proposed GH-PDBSCAN algorithm in combination with the DGridHash index structure, the Incremental GH-PDBSCAN algorithm is proposed, such that same is suitable for the clustering of dynamic uncertain data sets.

Description

Density clustering method and device based on dynamic grid hash index

Technical field

This application relates to the field of data processing, in particular to a density clustering method and device based on dynamic grid hash index.

Background technique

In computer science, uncertain data refers to data that contains noise. These noises cause the original data to deviate from the correct value. When such data exists in the database, probability calculations need to be introduced.

Currently, PDBSCAN is a clustering algorithm for attribute uncertainty data. The idea of the PDBSCAN algorithm comes from the DBSCAN algorithm, but the DBSCAN algorithm is only suitable for deterministic data, while the PDBSCAN algorithm introduces probability to replace the previously determined value, making it suitable for uncertain data types. The algorithm steps of the PDBSCAN algorithm are as follows:

Algorithm 1: PDBSCAN

enter:

D: Uncertainty data set; Eps: search radius;

Minpts: Judgment threshold for whether it is a core point; F_value: the probability threshold at which the direct density can be reached; Output: the data set and the corresponding class label;

Algorithm process:

Algorithm 1 describes the PDBSCAN algorithm, and Algorithm 2 is the specific details of its extended clustering. clu_num=k means that the current cluster category is k, and k is a positive integer. class(i)=0.-1 or 1...k respectively means that the data object o _i has not yet been classified and has been determined to belong to a certain class of noise or 1...k. type(i)=0.-1 or 1 respectively means that the data object o _i is a boundary point, a noise point or a core point. visited(i)=1 or 0 respectively means that the data object o _i has been processed or has not been processed.

Algorithm 1, after initialization is completed (lines 1-2), PDBSCAN algorithm starts accessing the data points and calculating o _p PNeighborhood (o _p) and PN _Eps (o _p) (3-5 line), if the PN _Eps (o _p) It is equal to 1, which means that there is only one point in the immediate neighborhood of Eps, so it is judged to be noise (lines 6-7). PN _Eps (o _p) between 1 to MinPts, is still less than the direct determination of the type of data object, when the PN _Eps (o _p) greater than or equal MinPts, means that the point of the core point, PDBSCAN algorithm which was used directly density The data whose reachable probability value is greater than the threshold f_value are classified into the same class (8-16), and the Expand_cluster function is called to expand the existing cluster. When the expansion step is completed, the data points with a class label of 0 are processed again, and they are classified as noise points.

The following are the algorithm steps of the function Expand_cluster involved in the PDBSCAN algorithm. Algorithm 2: Expand_cluster (. PNeighborhood (o p) 'clu_num, f_value, Minpts)

Let n represent the size of the attribute uncertainty data set, m represent the dimension of the attribute uncertainty data object, and S represent the number of different probability distribution functions introduced. In the preprocessing process, the calculation

The time complexity of is O(n ² mS ² ). In the main loop, n scans are required in the worst case, so the time complexity of the PDBSCAN algorithm is O(n ² mS ² ). The algorithm calculation process needs to maintain a probability matrix whose distance between any two points is less than a specified radius, so the space complexity of the PDBSCAN algorithm is O(n ² ).

Through the above introduction, the shortcomings of PDBSCAN can be found as follows:

1. The time complexity of the PDBSCAN algorithm is too high, at the O(n ² ) level;

2. The space complexity of the PDBSCAN algorithm is too high, at the ^{level of O(n 2} );

3. No incremental clustering algorithm based on dynamic uncertain data corresponding to the PDBSCAN algorithm is proposed.

Summary of the invention

In view of the problems, the present application is proposed in order to provide a method and device for density clustering based on dynamic grid hash index that overcome the problems or at least partially solve the problems, including:

A density clustering method based on dynamic grid hash index, including:

Obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold of whether it is a core point; unAttr: dimension with uncertain value;

Create an index G based on the existing data with class tags;

Repeat the following steps to insert each data object p in D into index G;

A1, obtaining PNeighborhood (o _p), and p is a core point determination probability based on PN _Eps (o _p);

A2, get UpSeed _ins (p);

A3. If the objects in UpSeed _ins (p) belong to different categories, but _{all objects in UpSeed ins} (p) can be directly or indirectly reachable _{after inserting p, then the objects contained in p and UpSeed ins} (p) are located Cluster merging

and / or;

Establish an index G based on the existing data with class tags, and find the position of p in the original data set;

Repeat the following steps to delete each data object p in D from index G;

B1: Get _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p);

B2: If _{the data objects contained in UpSeed del} (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters;

After the end of the loop, a data set with incremental clustering is obtained.

Further, after obtaining UpSeed _ins (p), it also includes: if UpSeedins(p) is empty and NEps(p) does not contain a core object, then p is regarded as noise and ∞ is returned.

Further, after obtaining UpSeed _ins (p), it also includes: if UpSeedins(p) is not empty, the contained objects not only have a density of reachable objects without core objects in known clusters but also do not belong to any clusters, Then create a new cluster and return ∞.

Further, after obtaining UpSeed _ins (p), it also includes: before inserting p, if the objects contained in UpSeedins(p) belong to the same cluster or contain different object class labels and data of different class labels after inserting p If the density is still not reachable or UpSeedins(p) is empty, and there are core objects in NEps(p), then p is merged into a certain cluster and returns ∞.

Further, in, PN _Eps (o _p) and UpSeed _del (p) After obtaining PNeighborhood (o _p), further comprising: when p is noise, then remove and return ∞.

Further, in obtaining _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p) after, further comprising: after If p is not noise and UpSeeddel (p) is removed empty, p NEPS (p) does not exist For the core point, other data points of the same type as p are set as noise and return to ∞.

Further, in obtaining _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p) after, further comprising: if UpSeeddel (p) is empty, but NEps (p) still contains the core object; or UpSeeddel (p The data points in) can be directly reachable in density. After deleting p, these data objects are still clusters of the same type and return to ∞.

A density clustering device based on dynamic grid hash index, including:

The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value;

A data insertion unit, configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;

The search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.

A device that includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the above-mentioned dynamic grid-based Harbin is implemented. The steps of the density clustering method of Greek index.

A computer-readable storage medium stores a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned density clustering method based on dynamic grid hash index are realized.

This application has the following advantages:

In the embodiment of the present application, by obtaining the incremental preset information, it includes: D: incremental data set; Eps: radius; Minpts: determination threshold of whether it is a core point; unAttr: dimension with uncertain value; according to the existing data classes with index label G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o _p), and p is determined according to the core PN _Eps (o _p) Point probability; A2, get UpSeed _ins (p); A3, if the objects in UpSeed _ins (p) belong to different categories, but after inserting p, _{all objects in UpSeed ins} (p) can be directly or indirectly reachable in density, then Combine the clusters where the objects contained in p and UpSeed _ins (p) are located; and/or; create an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following steps to change D each data object is deleted from the index p in G; B1: obtaining _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p); B2: if the data object UpSeed _del (p) can not be directly contained in another If the density is reachable, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained. By introducing a new index structure corresponding to uncertain data, the time complexity of the algorithm is ^{reduced from O(n 2} ) to O(n), and the space complexity is ^{reduced from O(n 2} ) to O(1); Make the algorithm suitable for dynamic data sets, incremental clustering is more efficient than full clustering; on the basis of the newly proposed GH-PDBSCAN algorithm combined with the DGridHash index structure, the Incremental GH-PDBSCAN algorithm is proposed to make it suitable for dynamic uncertainties Clustering of functional data sets; with good scalability, a dynamic grid-based hash index structure, which can be extended to other algorithm fields for use, such as incremental clustering, incremental classification and other algorithms.

Description of the drawings

In order to explain the technical solutions of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the present application. Obviously, the accompanying drawings in the following description are only some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative labor.

FIG. 1 is a schematic diagram of the three-layer structure of the H grid provided by an embodiment of the present application;

Figure 2 is a schematic diagram of an affected part of a grid provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of efficiency comparison between the GH-PDBSCAN algorithm and the PDBSCAN algorithm provided by Example 1 of the present application;

Figure 4.1 is a schematic diagram of the speedup ratio of the Incremental GH-PDBSCAN when inserting data provided in Example 2 of this application;

Figure 4.2 is a schematic diagram of the speedup ratio of Incremental GH-PDBSCAN when deleting data provided in Example 2 of this application;

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed ways

In order to make the objectives, features, and advantages of the application more obvious and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

Aiming at the deficiencies of the PDBSCAN algorithm in the background art, the present invention proposes a grid-based hash index structure (abbreviated as GridHash) calculation

A dynamic grid-based hash index structure (DGridHash for short) and an incremental clustering algorithm corresponding to PDBSCAN are proposed, which are described in detail as follows.

1. Introduction of GridHash

GridHash divides the grid layer by layer and maps it to a hash table through a hash function. First, in R ^d , the space is divided into several cells, and the side length of each cell is

Where c is the search radius and d is the dimension of the data. For each non-empty cell c (containing at least one data object), divide it into 2 ^d grids of the same size. If one of the non-empty grids is denoted as c′, the same Divide it recursively until the side length of the grid is less than

ρ is the specified constant.

Referring to Fig. 1, the hierarchical structure obtained by the division is marked as H, and only non-empty cells in H are stored. For each cell, cnt(c) is used to represent all data objects covered by cell c. The side length of the cell at level i is denoted as

The height of H is h=max[1,1+log ₂ (1/ρ)]=O(1), that is, the level of each cell is a constant level. If the grid c'of the i+1 level is contained by the grid c, then c'is called the child node of c, and c is called the parent node of c'. If a node has no children, then It is called a leaf node.

In the process of constructing GridHash, each cell first generates a hashDate class object, in which the coordinate value and index number of each cell are recorded, and the parent node is initialized to be empty, and the node type is set to non-leaf node. , And finally map the class to the corresponding position of the hash table according to the index number, and then classify each data object in the data set into the hashDate class object and map down layer by layer until the leaf node.

Given c and p, the time and space complexity of constructing GridHash is O(n), and the time and space complexity of searching using this index structure is O(1).

1. Define a hash table,

2. Each element in the hash table is a hashDate class, which contains several operations and related data definitions. The data contained in it is defined as follows:

vector<double>axisX;//The value range of the first dimension of each cell

vector<double>axisY;//The value range of the second dimension of each cell

vector<int>dateSubscripts;//Included data point identification

vector<shared_ptr<hashDate>>hashPointer;//pointer array

boolleafNode;//Judge whether the node is a leaf node

shared_ptr<hashDate>fatherNo;//The parent node pointer of the node

int indecNumber;//The index number of the node in the upper node pointer array

In this process, a hash function is needed to map the cell to the hash table, and the key value of the data point in the hash table is determined according to the dimensions of the data point. Taking two-dimensional data as an example, the hash function can be set for:

Among them, axisX is the value of the abscissa of the data object; axisY is the value of the ordinate of the data object; minX is the minimum value of the first dimension of the grid structure; minY is the minimum value of the second dimension of the grid structure; wCell is The side length of the cell; closNum is the number of columns in the grid.

Dynamically insert data objects based on DGridHash

It should be noted that dynamically inserting data objects in DGridHash is a process of constantly creating new nodes and adding them to the index structure. The specific algorithm process is as follows:

Dynamic insertion algorithm

Input: dataPoint: the data object to be inserted; hashTable: hash table; wCell: cell side length; wCell_Final: end cell side length; colsNum: the number of grid columns;

Output: the index of the completed data object insertion operation;

Algorithm process:

Dynamically delete data objects based on DGridHash

It should be noted that the data object is dynamically deleted in DGridHash, and all nodes containing the object are traversed from top to bottom and deleted from the bottom up. After deletion, if the number of data objects contained in the node is not zero, Only delete the data object without other processing; if the number of data objects contained in the node becomes zero, the node will be deleted at the same time, and at the same time, check the situation of its parent node. If it is still zero, the parent node will be deleted. Points are also deleted until the root node is found. The specific algorithm process is as follows:

Dynamic deletion algorithm

Output: the index of the completed data object deletion operation;

Algorithm process:

Refer to Figure 2, a range search is performed on the index

For a given two-dimensional data point A (A.X, A.Y), according to

Calculate the position in the grid, because the width of the first layer of grid is

Therefore, the number of affected grids is 21, and the specific solution is shown in Figure 2.

Search for these 21 affected grids to find data points that meet the conditions. For a single non-empty cell c and q, the steps of searching for the range are the same as GridHash. The search process of DGridHash is the same as that of the static grid-based hash structure GridHash, but because ρ is set, the quality of the index structure range search is reduced. In order to obtain accurate results in the incremental clustering algorithm, it is necessary to filter the searched data points after using the index structure to search. In this paper, the method of linear traversal is used to perform the filtering operation on the searched data points again.

Let d be the dimension of the data, remember that each dimension of the cell is divided into partition[d] parts, then the first layer of the hash table needs to store parNumber=partition[0]×......partition[n] A hashDate object, the space complexity of each hashDate object is O(n), so the space complexity of the first layer is O(n×parNumber). Each hashDate object in the first layer has 2 ^d subtrees, so the space complexity of the second layer is O(2 ^d ×n×parNumber), and so on, the space complexity of each layer is O(n), The tree height of the hash index is O(1), and d and parNumber are constants, so the space complexity of the dynamic grid-based hash index is O(parNumber+n). The time complexity of constructing the index is O(n), and the time complexity of the range search is O(1).

Improve GH-PDBSCAN algorithm based on PDBSCAN

The time and space complexity of the PDBSCAN algorithm is O(n ² mS ² ) and O(n ² ) respectively, mainly because the algorithm needs to calculate the distance probability between any two points and store the probability matrix during preprocessing. Therefore, the key to reducing the time and space complexity of the PDBSCAN algorithm is to introduce an index structure.

In the introduction of GridHash, a grid-based hash index structure is introduced. The time complexity of this index structure for range search is O(1) and the space complexity is O(n). It is an efficient index structure. For attribute uncertainty data, not all attributes are uncertain, but a combination of deterministic attributes and uncertain attributes. In the paper that Zhang proposed the PDBSCAN algorithm, there is only one data generated during the experiment Dimensions are uncertain, while other dimensions are deterministic values. Therefore, the grid-based hash index can be introduced into the PDBSCAN algorithm, and it is only used to calculate the range search of the deterministic attribute, and then combine the calculation of "the probability that the distance between any two points is less than or equal to a given radius" and keep the probability value greater than 0 Data points, making it suitable for clustering algorithms dealing with uncertain data. This kind of PDBSCAN algorithm that introduces a grid-based hash structure is called GH-PDBSCAN algorithm in this article.

Suppose there is data set D, a data object o _p of dimension k, the k-1 before the property is deterministic value, a k-th dimension range, the following steps:

1. According to the first k-1 dimensions, establish a grid-based hash index structure;

When 2.PDBSCAN algorithm to calculate PNeighborhood (o _p), find the index Eps first neighbor, then, smaller than a certain threshold probability therebetween thresholdValue Eps its neighbors o _p calculated by the Monte Carlo method, thresholdValue defined as follows:

Wherein, o _q ∈Eps neighbor, o _pi denotes the i th attribute of o _p.

3. After calculating PNeighborhood (o _p), a subsequent step of the algorithm running PDBSCAN.

When the GH-PDBSCAN algorithm calculates the nearest neighbors of the data object Eps, the first step is to build an index based on the deterministic attribute, and the second step is to calculate the probability of less than a certain threshold based on the Monte Carlo method. When the data set has only one attribute with uncertainty, the probability between any two objects can be expressed as follows:

Assume that the _{uncertainty attribute of the data point o r} is [a, b], the _{uncertainty attribute of the data point o w} is [c, d], and the uncertainty attributes of the two data points are in the same Dimension. It can be solved by definite integral or Monte Carlo method. When applying the Monte Carlo method, this paper randomly selects the value of the two objects 1000 times to calculate the probability.

Since the time complexity of the grid-based hash index structure in indexing and range search is O(n) and O(1) respectively, then the time complexity of GH-PDBSCAN is O(nmS ² ), and S represents the introduction The number of different probability distribution functions. The space complexity of GH-PDBSCAN algorithm is O(n).

The time and space complexity of the PDBSCAN algorithm is relatively high, and it is not suitable for situations with large data volume. Compared with the PDBSCAN algorithm, the GH-PDBSCAN algorithm has higher efficiency and lower space consumption, which is more meaningful and usable.

Example one

Refer to Figure 3, the efficiency comparison of GH-PDBSCAN algorithm and PDBSCAN algorithm

The experiment part uses four different data types, among which image and abalone are derived from UCI, and the negative values in the two data types are deleted. The other two are artificially synthesized data, as shown in Table 1. The four data types all use the method of Gullo et al. to generate attribute uncertainty data. Each data type includes two forms: random and normal. The test platform is window 7, 32G memory, 32-core CPU, the development tool is Visual studio 2012, and the programming language is C++.

Table 1

To compare the efficiency of the GH-PDBSCAN algorithm and the PDBSCAN algorithm, it is necessary to test separately from the two dimensions of the data value range and the data volume. The test results are shown in Figure 3.

Figure 3 shows four data types, corresponding to four different situations, (a) the data value range is larger, and the data volume is small; (b) the data value range is small, and the data volume is small; (c) The data value range is larger and the data volume is larger; (d) the data value range is smaller and the data volume is larger.

Experiments show that only in the case (a), the efficiency of the PDBSCAN algorithm is higher than that of the GH-PDBSCAN algorithm, while in other cases, the GH-PDBSCAN algorithm is more efficient. Compared with the GH-PDBSCAN algorithm in (a), the PDBSCAN algorithm is increased by 5.62 times when the data belongs to random distribution, and 6.18 times when the data belongs to the normal distribution; in (b) compared with the PDBSCAN algorithm, the data belongs to When randomly distributed, it is increased by 1.95 times, and when it is normal distribution, it is increased by 2.61 times; in (c), compared with PDBSCAN algorithm, the GH-PDBSCAN algorithm increases by 4.23 times when the data belongs to random distribution and 3.63 times when it belongs to normal distribution; (d) Compared with the PDBSCAN algorithm in GH-PDBSCAN algorithm, the data is increased by 9.38 times when the data is random distribution, and 9.25 times when the data is normal distribution.

In response to the problems raised by the background technology, the present invention proposes an efficient algorithm-the Incremental GH-PDBSCAN algorithm proposed by GH-PDBSCAN

There are more and more attribute uncertainty data, and there are also dynamic updates. In order to enable the GH-PDBSCAN algorithm to be applied to dynamic attribute uncertainty data sets, this paper proposes the Incremental GH-PDBSCAN algorithm. The algorithm draws on the idea of the Incremental DBSCAN algorithm, but introduces the probability distribution function into it, making it suitable for data with uncertain attributes.

First query the seed data point, let D represent the data set, and p represent the deleted or inserted data point. The domain objects for deleting or inserting data are represented as follows:

UpSeed _ins (p)={q|q is the core object in D∪{p},

Is not the core point in D, q′ is the core point in D∪{p}, and

}

UpSeed _del (p)={q|q is the core object in D\{p},

Is not the core point in D, q′ is the core point in D\{p}, and

}

The definition of the seed object and the steps involved in incremental clustering are the same as those of Incremental DBSCAN.

Data objects larger than a certain threshold, and in the range search process, use a dynamic grid-based hash index. Before running Incremental PDBSCAN, we need PDBSCAN clustering algorithm and stores whether each data object is a core point, PN _Eps (o _p) values and class labels, and according _{_{UpSeed ins (p), UpSeed del}} (p) and PN _Eps (o _p) attribute information such as data uncertainty incremental clustering.

The detailed process of the incremental GH-PDBSCAN algorithm is as follows:

Among them, the process of Incremental GH-PDBSCAN inserting data is as follows:

Input: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value;

Output: the data set that completes incremental clustering;

1: Create index G based on existing data with class tags;

2: for (each data object p in D)

3: Insert p into index G;

4: Get PNeighborhood (o _p), and p is a core point determination probability based on PN _Eps (o _p);

5: Get UpSeed _ins (p);

6: If UpSeed _ins (p) is empty and N _Eps (p) does not contain core objects, then p will be regarded as noise and return otherwise, go to step 7;

7: If UpSeed _ins (p) is not empty, the contained objects not only have a density of reachable objects, there is no core object in a known cluster, but also does not belong to any cluster, then create a new cluster and return, otherwise perform the steps 8;

8: Before inserting p, if _{the objects contained in UpSeed ins} (p) belong to the same cluster or contain different object class labels, and the data of different class labels still cannot be densely reachable after inserting p, or UpSeed _ins (p) If it is empty and _{there is a core object in N Eps} (p), merge p into a certain cluster and return, otherwise go to step 9;

9: If the objects in UpSeed _ins (p) belong to different categories, but _{all objects in UpSeed ins} (p) can be directly or indirectly reachable _{after inserting p, then the objects contained in p and UpSeed ins} (p) are located Cluster merge.

Incremental GH-PDBSCAN deletes data as follows:

Input: D: delete object collection; Eps: radius; Minpts: determine whether it is a core point threshold; unAttr: dimension with uncertain value;

Output: the data set that completes incremental clustering;

1: Create index G based on existing data with class tags;

2: Find the position of p in the original data set;

3: for (each data object p in D)

4: Delete p from index G;

5: Get _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p);

6: If p is noise, delete and return, otherwise go to step 7;

7: If p is not noise and UpSeed _del (p) is empty, _{there is no core point in N Eps} (p) after p is deleted, then other data points of the same type as p are set as noise and returned, otherwise, go to step 8;

8: If UpSeed _del (p) is empty, but N _Eps (p) still contains the core object; or _{the data points in UpSeed del} (p) can be directly reachable in density, then these data objects are still in the same cluster after deleting p Return, otherwise execute 9;

9: If _{the data objects contained in UpSeed del} (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters, otherwise it is not used .

Incremental GH-PDBSCAN has four different situations during incremental insertion: noise, creating clusters, merging into a cluster, and merging clusters. In the course of the algorithm is required to find and the range of index GridHash Monte Carlo method to calculate PNeighborhood (o _p), the time complexity of the algorithm is O (q × k), where q is the probability of the insertion point up directly density p For data sets greater than a certain threshold, k represents the amount of data contained in _{UpSeed del (p).} The space complexity of Incremental GH-PDBSCAN is O(n+m), where n represents the size of the original data set, and m represents the size of the data set that needs to be updated incrementally.

Incremental GH-PDBSCAN has four different situations when deleting data: noise, eliminating clusters, reducing cluster objects and cluster splitting. The time complexity of Incremental GH-PDBSCAN is O(L*(nm)), where L is the _{number of clusters contained in UpSeed del} (p), n is the size of the original data set, and m is the size of the data set that needs to be deleted size. The space complexity of the algorithm is O(nm).

Example two

Refer to Figure 4.1 and 4.2, the efficiency comparison of Incremental GH-PDBSCAN algorithm and GH-PDBSCAN algorithm

The GH-PDBSCAN algorithm can process big data, so this article uses 1 million data volume, each data object is three-dimensional spatial data, and the third dimension of each object is designated as the uncertainty attribute. Refer to the specific implementation method Article published by Gullo et al. in 2008. The experimental method of the Incremental PDBSCAN algorithm proposed in this paper is similar to the experimental method of the Incremental DBSCAN algorithm proposed by Ester et al. in 1998.

The clustering time of each data object of the GH-PDBSCAN algorithm depends on the range search time. The time consumption of clustering n data objects can be recorded as Cost _DBSCAN (n), that is

Cost _DBSCAN (n)=n (5)

The number of range searches of the Incremental GH-PDBSCAN algorithm depends on the specific application, so experiments must be used to verify the number of range searches required for each insertion and deletion of data. Generally speaking, deleting a data object will affect more data points than inserting a data object. We introduce two parameters, r _ins and r _del , which respectively represent the number and increment of average range search for each data object during incremental insertion. The average range search times of each data object during deletion, f _ins and f _del respectively represent the proportion of incremental insertion and incremental deletion during incremental update. When the size of the incrementally updated data set is m, the time consumption of the Incremental GH-PDBSCAN algorithm is:

Cost _{Inrementtal GH-PDBSCAN} (m)=m*(f _ins *r _ins +f _del *r _del ) (6)

Table 2 below lists the parameter values for 1 million 3D uncertainty data:

参数parameter	参数含义Parameter meaning	100万3维不确定性数据1 million 3D uncertainty data
nn	原始数据集大小Original data set size	1,000,0001,000,000
mm	增量数据集大小Incremental data set size	从20,000到50,0000From 20,000 to 50,0000
r _ins r _ins	增量插入时范围查找次数Range search times during incremental insertion	11
r _del r _del	增量删除时范围查找次数Range search times during incremental deletion	4.64.6
f _ins f _ins	插入数据所占比Percentage of inserted data	1或者01 or 0
f _del f _del	删除数据所占比Percentage of deleted data	0或者10 or 1

Table 2

According to the above definition, the acceleration ratio of the Incremental GH-PDBSCAN algorithm and the GH-PDBSCAN algorithm can be calculated, which is defined as follows:

When incrementally inserting and deleting data, the acceleration ratios of the Incremental GH-PDBSCAN algorithm and the GH-PDBSCAN algorithm are shown in Figures 4.1 and 4.2.

Experiments show that the Incremental GH-PDBSCAN algorithm has a great improvement in efficiency compared to the GH-PDBSCAN algorithm. When the inserted or deleted data set is determined, the size of the original data set is proportional to the speedup ratio; when the original data set is fixed, the larger the inserted or deleted data set, the lower the speedup ratio. When the original data set is large and the amount of inserted or deleted data is small, the advantages of the Incremental GH-PDBSCAN algorithm can be better reflected.

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

Shows a density clustering device based on dynamic grid hash index provided by an embodiment of the present application, which is characterized in that it includes:

The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold for core points; unAttr: dimension with uncertain value;

Referring to FIG. 5, a computer device of a density clustering method based on dynamic grid hash index of the present invention is shown, which may specifically include the following:

The above-mentioned computer device 12 is represented in the form of a general-purpose computing device. The components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing unit 16) The bus 18.

The bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or a memory controller, a peripheral bus 18, a graphics acceleration port, a processor, or a bureau that uses any of the bus 18 structures. Domain bus 18. For example, these architectures include but are not limited to industry standard architecture (ISA) bus 18, microchannel architecture (MAC) bus 18, enhanced ISA bus 18, audio and video electronics standards association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

The computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 5, a disk drive for reading and writing to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk (such as CD-ROM, DVD-ROM) can be provided. Or other optical media) read and write optical disc drives. In these cases, each drive can be connected to the bus 18 through one or more data medium interfaces. The memory may include at least one program product, and the program product has a set (for example, at least one) of program modules 42 configured to perform the functions of the various embodiments of the present invention.

A program/utility tool 40 having a set of (at least one) program module 42 may be stored in, for example, a memory. Such program module 42 includes, but is not limited to, an operating system, one or more application programs, and other program modules 42 and program data, each of these examples or some combination may include the realization of a network environment. The program module 42 generally executes the functions and/or methods in the described embodiments of the present invention.

The computer device 12 may also communicate with one or more external devices 14 (such as a keyboard, pointing device, display 24, camera, etc.), and may also communicate with one or more devices that enable a user to interact with the computer device 12, and/ Or communicate with any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 22. In addition, the computer device 12 may also communicate with one or more networks (such as a local area network (LAN)), a wide area network (WAN), and/or a public network (such as the Internet) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the computer device 12 through the bus 18. It should be understood that although not shown in FIG. 5, other hardware and/or software modules can be used in conjunction with the computer device 12, including but not limited to: microcode, device drivers, redundant processing unit 16, external disk drive arrays, RAID systems, Tape drive and data backup storage system 34 and so on.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the density clustering method based on dynamic grid hash index provided by the embodiment of the present invention.

That is, when the processing unit 16 executes the above program, it realizes: acquiring the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value ; the existing data established with the class tag index G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o _p), and in accordance with PN _Eps (o _p) Determine the probability that p is the core point; A2, get UpSeed _ins (p); A3, if the objects in UpSeed _ins (p) belong to different categories, but after inserting p, _{all objects in UpSeed ins} (p) can be directly or indirectly dense If reachable, _{merge the clusters where the objects contained in p and UpSeed ins} (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat execution the steps of each data object D is removed from the index p in G; B1: obtaining _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p); B2: if UpSeed _del (p) contained in the data Objects cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained.

In the embodiments of the present invention, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the dynamic grid hash index based on the dynamic grid hash index provided in all embodiments of the present application Density clustering method:

That is, when the program is executed by the processor, it is realized: to obtain the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: whether it is a core point judgment threshold; unAttr: a dimension with uncertain values; the index data G with the conventional type of the label; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o _p), and (O _p) is determined in accordance with PN _Eps p is the probability of the core point; A2, get UpSeed _ins (p); A3, if the objects in UpSeed _ins (p) belong to different categories, but after inserting p, _{all objects in UpSeed ins} (p) can be directly or indirectly dense If it is reached, _{merge the clusters where the objects contained in p and UpSeed ins} (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following step D of each data object is deleted from the index p in G; B1: obtaining _{_{PNeighborhood (o p), PN Eps}} (o p) and UpSeed _del (p); B2: if the data object UpSeed _del (p) contained If the density cannot be reached directly to each other, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained.

Any combination of one or more computer-readable media may be used. The computer-readable medium may be a computer-readable medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPOM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .

The computer program code used to perform the operations of the present invention can be written in one or more programming languages or a combination thereof. The above-mentioned programming languages include object-oriented programming languages such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer or entirely executed on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to Connect via the Internet). The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

Although the preferred embodiments of the embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present application.

Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or terminal device including a series of elements not only includes those elements, but also includes those elements that are not explicitly listed. Other elements listed, or also include elements inherent to this process, method, article, or terminal device. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or terminal device that includes the element.

The above describes in detail the density clustering method, device, equipment and medium based on dynamic grid hash index provided by this application. Specific examples are used in this article to illustrate the principle and implementation of this application. The description of the embodiments is only used to help understand the method and core idea of this application; at the same time, for those of ordinary skill in the art, according to the idea of this application, there will be changes in the specific implementation and the scope of application. As mentioned above, the content of this specification should not be construed as a limitation to this application.

Claims

A density clustering method based on dynamic grid hash index, which is characterized in that it comprises:

Obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold of whether it is a core point; unAttr: dimension with uncertain value;

Create an index G based on the existing data with class tags;

Repeat the following steps to insert each data object p in D into index G:

A1, obtaining PNeighborhood (o p), and p is a core point determination probability based on PN Eps (o p);

A2, get UpSeed ins (p);

A3. If the objects in UpSeed ins (p) belong to different categories, but all objects in UpSeed ins (p) can be directly or indirectly reachable after inserting p, then the objects contained in p and UpSeed ins (p) are located Cluster merging

and / or;

Establish an index G based on the existing data with class tags, and find the position of p in the original data set;

Repeat the following steps to delete each data object p in D from index G;

B1: Get PNeighborhood (o p), PN Eps (o p) and UpSeed del (p);

B2: If the data objects contained in UpSeed del (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters;

After the end of the loop, a data set with incremental clustering is obtained.
The method according to claim 1, wherein after obtaining UpSeed ins (p), it further comprises: if UpSeedins(p) is empty and NEps(p) does not contain a core object, then p is regarded as noise And returns ∞.
The method according to claim 1, characterized in that after obtaining UpSeed ins (p), it further comprises: if UpSeedins(p) is not empty, not only does the contained object have a density that is up to the object but is not in the known cluster If the core object does not belong to any cluster, a new cluster is created and ∞ is returned.
The method according to claim 1, wherein after obtaining UpSeed ins (p), it further comprises: before inserting p, if the objects contained in UpSeedins(p) belong to the same cluster or contain different object class labels And after inserting p, the data of different types of labels still cannot reach the density or UpSeedins(p) is empty, and there is a core object in NEps(p), then p is merged into a certain cluster and returns ∞.
The method according to claim 1, wherein, in obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p) after, further comprising: when p is noise, then remove and return ∞.
The method according to claim 1, characterized in that, PN Eps (o p) and UpSeed del (p) After obtaining PNeighborhood (o p), further comprising: if p is not noise and UpSeeddel (p) is empty, After p is deleted, there is no core point in NEps(p), and other data points of the same type as p are set as noise and return to ∞.
The method according to claim 1, wherein, in obtaining PNeighborhood (o p), the PN Eps (o p) and UpSeed del (p), further comprising: if UpSeeddel (p) is empty, but NEps (p ) Still contains the core object; or the data points in UpSeeddel(p) can be directly reachable in density, then these data objects are still the same cluster after deleting p and return to ∞.
A density clustering device based on dynamic grid hash index, which is characterized in that it comprises:

The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold for core points; unAttr: dimension with uncertain value;

A data insertion unit, configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;

The search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
A device, characterized in that it comprises a processor, a memory, and a computer program stored on the memory and capable of running on the processor. The computer program is executed by the processor to implement claims 1 to 7. The method described in any one of 7.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.