WO2021232442A1 - Density clustering method and apparatus on basis of dynamic grid hash index - Google Patents
Density clustering method and apparatus on basis of dynamic grid hash index Download PDFInfo
- Publication number
- WO2021232442A1 WO2021232442A1 PCT/CN2020/092225 CN2020092225W WO2021232442A1 WO 2021232442 A1 WO2021232442 A1 WO 2021232442A1 CN 2020092225 W CN2020092225 W CN 2020092225W WO 2021232442 A1 WO2021232442 A1 WO 2021232442A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- incremental
- upseed
- algorithm
- clustering
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- CDAISMWEOUEBRE-GPIVLXJGSA-N inositol Chemical compound O[C@H]1[C@H](O)[C@@H](O)[C@H](O)[C@H](O)[C@@H]1O CDAISMWEOUEBRE-GPIVLXJGSA-N 0.000 claims description 29
- 238000003780 insertion Methods 0.000 claims description 11
- 230000037431 insertion Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 abstract description 101
- 230000008569 process Effects 0.000 description 18
- 238000012545 processing Methods 0.000 description 8
- 238000012217 deletion Methods 0.000 description 7
- 230000037430 deletion Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 238000000342 Monte Carlo simulation Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 3
- 238000005315 distribution function Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Definitions
- This application relates to the field of data processing, in particular to a density clustering method and device based on dynamic grid hash index.
- uncertain data refers to data that contains noise. These noises cause the original data to deviate from the correct value. When such data exists in the database, probability calculations need to be introduced.
- PDBSCAN is a clustering algorithm for attribute uncertainty data.
- the idea of the PDBSCAN algorithm comes from the DBSCAN algorithm, but the DBSCAN algorithm is only suitable for deterministic data, while the PDBSCAN algorithm introduces probability to replace the previously determined value, making it suitable for uncertain data types.
- the algorithm steps of the PDBSCAN algorithm are as follows:
- Minpts Judgment threshold for whether it is a core point
- F_value the probability threshold at which the direct density can be reached
- Output the data set and the corresponding class label
- Algorithm 1 describes the PDBSCAN algorithm
- Algorithm 2 is the specific details of its extended clustering.
- PN Eps (o p) between 1 to MinPts is still less than the direct determination of the type of data object, when the PN Eps (o p) greater than or equal MinPts, means that the point of the core point, PDBSCAN algorithm which was used directly density
- the data whose reachable probability value is greater than the threshold f_value are classified into the same class (8-16), and the Expand_cluster function is called to expand the existing cluster.
- the expansion step is completed, the data points with a class label of 0 are processed again, and they are classified as noise points.
- Expand_cluster (. PNeighborhood (o p) 'clu_num, f_value, Minpts)
- n the size of the attribute uncertainty data set
- m the dimension of the attribute uncertainty data object
- S the number of different probability distribution functions introduced.
- the time complexity of is O(n 2 mS 2 ).
- n scans are required in the worst case, so the time complexity of the PDBSCAN algorithm is O(n 2 mS 2 ).
- the algorithm calculation process needs to maintain a probability matrix whose distance between any two points is less than a specified radius, so the space complexity of the PDBSCAN algorithm is O(n 2 ).
- the time complexity of the PDBSCAN algorithm is too high, at the O(n 2 ) level;
- the present application is proposed in order to provide a method and device for density clustering based on dynamic grid hash index that overcome the problems or at least partially solve the problems, including:
- a density clustering method based on dynamic grid hash index including:
- Obtain incremental preset information including: D: incremental data set; Eps: radius; Minpts: judgment threshold of whether it is a core point; unAttr: dimension with uncertain value;
- A1 obtaining PNeighborhood (o p), and p is a core point determination probability based on PN Eps (o p);
- UpSeed ins (p) After obtaining UpSeed ins (p), it also includes: if UpSeedins(p) is empty and NEps(p) does not contain a core object, then p is regarded as noise and ⁇ is returned.
- UpSeed ins (p) After obtaining UpSeed ins (p), it also includes: if UpSeedins(p) is not empty, the contained objects not only have a density of reachable objects without core objects in known clusters but also do not belong to any clusters, Then create a new cluster and return ⁇ .
- UpSeed ins (p) After obtaining UpSeed ins (p), it also includes: before inserting p, if the objects contained in UpSeedins(p) belong to the same cluster or contain different object class labels and data of different class labels after inserting p If the density is still not reachable or UpSeedins(p) is empty, and there are core objects in NEps(p), then p is merged into a certain cluster and returns ⁇ .
- PN Eps (o p) and UpSeed del (p) After obtaining PNeighborhood (o p), further comprising: when p is noise, then remove and return ⁇ .
- PN Eps (o p) and UpSeed del (p) after further comprising: after If p is not noise and UpSeeddel (p) is removed empty, p NEPS (p) does not exist For the core point, other data points of the same type as p are set as noise and return to ⁇ .
- PN Eps (o p) and UpSeed del (p) after further comprising: if UpSeeddel (p) is empty, but NEps (p) still contains the core object; or UpSeeddel (p The data points in) can be directly reachable in density. After deleting p, these data objects are still clusters of the same type and return to ⁇ .
- a density clustering device based on dynamic grid hash index including:
- the information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value;
- a data insertion unit configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;
- the search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
- a device that includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor.
- the computer program is executed by the processor, the above-mentioned dynamic grid-based Harbin is implemented.
- the steps of the density clustering method of Greek index is implemented.
- a computer-readable storage medium stores a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned density clustering method based on dynamic grid hash index are realized.
- the incremental preset information includes: D: incremental data set; Eps: radius; Minpts: determination threshold of whether it is a core point; unAttr: dimension with uncertain value; according to the existing data classes with index label G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and p is determined according to the core PN Eps (o p) Point probability; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly reachable in density, then Combine the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; create an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following steps to change D each data object is deleted from the index p
- the time complexity of the algorithm is reduced from O(n 2 ) to O(n), and the space complexity is reduced from O(n 2 ) to O(1);
- Make the algorithm suitable for dynamic data sets incremental clustering is more efficient than full clustering; on the basis of the newly proposed GH-PDBSCAN algorithm combined with the DGridHash index structure, the Incremental GH-PDBSCAN algorithm is proposed to make it suitable for dynamic uncertainties Clustering of functional data sets; with good scalability, a dynamic grid-based hash index structure, which can be extended to other algorithm fields for use, such as incremental clustering, incremental classification and other algorithms.
- FIG. 1 is a schematic diagram of the three-layer structure of the H grid provided by an embodiment of the present application
- Figure 2 is a schematic diagram of an affected part of a grid provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of efficiency comparison between the GH-PDBSCAN algorithm and the PDBSCAN algorithm provided by Example 1 of the present application;
- Figure 4.1 is a schematic diagram of the speedup ratio of the Incremental GH-PDBSCAN when inserting data provided in Example 2 of this application;
- Figure 4.2 is a schematic diagram of the speedup ratio of Incremental GH-PDBSCAN when deleting data provided in Example 2 of this application;
- Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
- GridHash grid-based hash index structure
- DGridHash dynamic grid-based hash index structure
- GridHash divides the grid layer by layer and maps it to a hash table through a hash function.
- R d the space is divided into several cells, and the side length of each cell is Where c is the search radius and d is the dimension of the data.
- c the search radius
- d the dimension of the data.
- the hierarchical structure obtained by the division is marked as H, and only non-empty cells in H are stored.
- cnt(c) is used to represent all data objects covered by cell c.
- each cell In the process of constructing GridHash, each cell first generates a hashDate class object, in which the coordinate value and index number of each cell are recorded, and the parent node is initialized to be empty, and the node type is set to non-leaf node. , And finally map the class to the corresponding position of the hash table according to the index number, and then classify each data object in the data set into the hashDate class object and map down layer by layer until the leaf node.
- Each element in the hash table is a hashDate class, which contains several operations and related data definitions.
- the data contained in it is defined as follows:
- a hash function is needed to map the cell to the hash table, and the key value of the data point in the hash table is determined according to the dimensions of the data point.
- the hash function can be set for:
- axisX is the value of the abscissa of the data object
- axisY is the value of the ordinate of the data object
- minX is the minimum value of the first dimension of the grid structure
- minY is the minimum value of the second dimension of the grid structure
- wCell is The side length of the cell
- closNum is the number of columns in the grid.
- dynamically inserting data objects in DGridHash is a process of constantly creating new nodes and adding them to the index structure.
- the specific algorithm process is as follows:
- Input dataPoint: the data object to be inserted; hashTable: hash table; wCell: cell side length; wCell_Final: end cell side length; colsNum: the number of grid columns;
- the data object is dynamically deleted in DGridHash, and all nodes containing the object are traversed from top to bottom and deleted from the bottom up.
- deletion if the number of data objects contained in the node is not zero, Only delete the data object without other processing; if the number of data objects contained in the node becomes zero, the node will be deleted at the same time, and at the same time, check the situation of its parent node. If it is still zero, the parent node will be deleted. Points are also deleted until the root node is found.
- the specific algorithm process is as follows:
- Input dataPoint: the data object to be inserted; hashTable: hash table; wCell: cell side length; wCell_Final: end cell side length; colsNum: the number of grid columns;
- Each hashDate object in the first layer has 2 d subtrees, so the space complexity of the second layer is O(2 d ⁇ n ⁇ parNumber), and so on, the space complexity of each layer is O(n),
- the tree height of the hash index is O(1), and d and parNumber are constants, so the space complexity of the dynamic grid-based hash index is O(parNumber+n).
- the time complexity of constructing the index is O(n), and the time complexity of the range search is O(1).
- the time and space complexity of the PDBSCAN algorithm is O(n 2 mS 2 ) and O(n 2 ) respectively, mainly because the algorithm needs to calculate the distance probability between any two points and store the probability matrix during preprocessing. Therefore, the key to reducing the time and space complexity of the PDBSCAN algorithm is to introduce an index structure.
- the grid-based hash index can be introduced into the PDBSCAN algorithm, and it is only used to calculate the range search of the deterministic attribute, and then combine the calculation of "the probability that the distance between any two points is less than or equal to a given radius" and keep the probability value greater than 0 Data points, making it suitable for clustering algorithms dealing with uncertain data.
- This kind of PDBSCAN algorithm that introduces a grid-based hash structure is called GH-PDBSCAN algorithm in this article.
- thresholdValue When 2.PDBSCAN algorithm to calculate PNeighborhood (o p), find the index Eps first neighbor, then, smaller than a certain threshold probability therebetween thresholdValue Eps its neighbors o p calculated by the Monte Carlo method, thresholdValue defined as follows:
- o q ⁇ Eps neighbor o pi denotes the i th attribute of o p.
- the first step is to build an index based on the deterministic attribute
- the second step is to calculate the probability of less than a certain threshold based on the Monte Carlo method.
- the time complexity of the grid-based hash index structure in indexing and range search is O(n) and O(1) respectively
- the time complexity of GH-PDBSCAN is O(nmS 2 )
- S represents the introduction The number of different probability distribution functions.
- the space complexity of GH-PDBSCAN algorithm is O(n).
- the time and space complexity of the PDBSCAN algorithm is relatively high, and it is not suitable for situations with large data volume.
- the GH-PDBSCAN algorithm has higher efficiency and lower space consumption, which is more meaningful and usable.
- the experiment part uses four different data types, among which image and abalone are derived from UCI, and the negative values in the two data types are deleted.
- the other two are artificially synthesized data, as shown in Table 1.
- the four data types all use the method of Gullo et al. to generate attribute uncertainty data.
- Each data type includes two forms: random and normal.
- the test platform is window 7, 32G memory, 32-core CPU, the development tool is Visual studio 2012, and the programming language is C++.
- Figure 3 shows four data types, corresponding to four different situations, (a) the data value range is larger, and the data volume is small; (b) the data value range is small, and the data volume is small; (c) The data value range is larger and the data volume is larger; (d) the data value range is smaller and the data volume is larger.
- the present invention proposes an efficient algorithm-the Incremental GH-PDBSCAN algorithm proposed by GH-PDBSCAN
- UpSeed del (p) ⁇ q
- q is the core object in D ⁇ p ⁇ , Is not the core point in D
- q′ is the core point in D ⁇ p ⁇
- the definition of the seed object and the steps involved in incremental clustering are the same as those of Incremental DBSCAN.
- PDBSCAN clustering algorithm Before running Incremental PDBSCAN, we need PDBSCAN clustering algorithm and stores whether each data object is a core point, PN Eps (o p) values and class labels, and according UpSeed ins (p), UpSeed del (p) and PN Eps (o p) attribute information such as data uncertainty incremental clustering.
- D incremental data set
- Eps radius
- Minpts judging threshold for core points
- unAttr dimension with uncertain value
- Output the data set that completes incremental clustering
- Incremental GH-PDBSCAN deletes data as follows:
- D delete object collection
- Eps radius
- Minpts determine whether it is a core point threshold
- unAttr dimension with uncertain value
- Output the data set that completes incremental clustering
- step 6 If p is noise, delete and return, otherwise go to step 7;
- step 7 If p is not noise and UpSeed del (p) is empty, there is no core point in N Eps (p) after p is deleted, then other data points of the same type as p are set as noise and returned, otherwise, go to step 8;
- Incremental GH-PDBSCAN has four different situations during incremental insertion: noise, creating clusters, merging into a cluster, and merging clusters.
- the time complexity of the algorithm is O (q ⁇ k), where q is the probability of the insertion point up directly density p
- k represents the amount of data contained in UpSeed del (p).
- the space complexity of Incremental GH-PDBSCAN is O(n+m), where n represents the size of the original data set, and m represents the size of the data set that needs to be updated incrementally.
- Incremental GH-PDBSCAN has four different situations when deleting data: noise, eliminating clusters, reducing cluster objects and cluster splitting.
- the time complexity of Incremental GH-PDBSCAN is O(L*(nm)), where L is the number of clusters contained in UpSeed del (p), n is the size of the original data set, and m is the size of the data set that needs to be deleted size.
- the space complexity of the algorithm is O(nm).
- the GH-PDBSCAN algorithm can process big data, so this article uses 1 million data volume, each data object is three-dimensional spatial data, and the third dimension of each object is designated as the uncertainty attribute.
- the experimental method of the Incremental PDBSCAN algorithm proposed in this paper is similar to the experimental method of the Incremental DBSCAN algorithm proposed by Ester et al. in 1998.
- the clustering time of each data object of the GH-PDBSCAN algorithm depends on the range search time.
- the time consumption of clustering n data objects can be recorded as Cost DBSCAN (n), that is
- the number of range searches of the Incremental GH-PDBSCAN algorithm depends on the specific application, so experiments must be used to verify the number of range searches required for each insertion and deletion of data. Generally speaking, deleting a data object will affect more data points than inserting a data object.
- the average range search times of each data object during deletion, f ins and f del respectively represent the proportion of incremental insertion and incremental deletion during incremental update.
- the time consumption of the Incremental GH-PDBSCAN algorithm is:
- Cost Inrementtal GH-PDBSCAN (m) m*(f ins *r ins +f del *r del ) (6)
- the acceleration ratio of the Incremental GH-PDBSCAN algorithm and the GH-PDBSCAN algorithm can be calculated, which is defined as follows:
- the Incremental GH-PDBSCAN algorithm has a great improvement in efficiency compared to the GH-PDBSCAN algorithm.
- the size of the original data set is proportional to the speedup ratio; when the original data set is fixed, the larger the inserted or deleted data set, the lower the speedup ratio.
- the advantages of the Incremental GH-PDBSCAN algorithm can be better reflected.
- the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
- the information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold for core points; unAttr: dimension with uncertain value;
- a data insertion unit configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;
- the search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
- a computer device of a density clustering method based on dynamic grid hash index of the present invention is shown, which may specifically include the following:
- the above-mentioned computer device 12 is represented in the form of a general-purpose computing device.
- the components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing unit 16) The bus 18.
- the bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or a memory controller, a peripheral bus 18, a graphics acceleration port, a processor, or a bureau that uses any of the bus 18 structures.
- Domain bus 18 includes but are not limited to industry standard architecture (ISA) bus 18, microchannel architecture (MAC) bus 18, enhanced ISA bus 18, audio and video electronics standards association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.
- ISA industry standard architecture
- MAC microchannel architecture
- VESA audio and video electronics standards association
- PCI Peripheral Component Interconnect
- the computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.
- the system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
- the computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
- the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as "hard drives").
- a disk drive for reading and writing to a removable non-volatile disk such as a "floppy disk”
- a removable non-volatile optical disk such as CD-ROM, DVD-ROM
- other optical media read and write optical disc drives.
- each drive can be connected to the bus 18 through one or more data medium interfaces.
- the memory may include at least one program product, and the program product has a set (for example, at least one) of program modules 42 configured to perform the functions of the various embodiments of the present invention.
- a program/utility tool 40 having a set of (at least one) program module 42 may be stored in, for example, a memory.
- Such program module 42 includes, but is not limited to, an operating system, one or more application programs, and other program modules 42 and program data, each of these examples or some combination may include the realization of a network environment.
- the program module 42 generally executes the functions and/or methods in the described embodiments of the present invention.
- the computer device 12 may also communicate with one or more external devices 14 (such as a keyboard, pointing device, display 24, camera, etc.), and may also communicate with one or more devices that enable a user to interact with the computer device 12, and/ Or communicate with any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 22.
- the computer device 12 may also communicate with one or more networks (such as a local area network (LAN)), a wide area network (WAN), and/or a public network (such as the Internet) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the computer device 12 through the bus 18.
- LAN local area network
- WAN wide area network
- public network such as the Internet
- the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the density clustering method based on dynamic grid hash index provided by the embodiment of the present invention.
- the processing unit 16 executes the above program, it realizes: acquiring the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value ; the existing data established with the class tag index G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and in accordance with PN Eps (o p) Determine the probability that p is the core point; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly dense If reachable, merge the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat execution the steps of each data object D
- the present invention also provides a computer-readable storage medium on which a computer program is stored.
- the program is executed by a processor, the dynamic grid hash index based on the dynamic grid hash index provided in all embodiments of the present application Density clustering method:
- the program when executed by the processor, it is realized: to obtain the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: whether it is a core point judgment threshold; unAttr: a dimension with uncertain values; the index data G with the conventional type of the label; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and (O p) is determined in accordance with PN Eps p is the probability of the core point; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly dense If it is reached, merge the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following step
- the computer-readable medium may be a computer-readable medium or a computer-readable storage medium.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPOM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
- the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
- the computer program code used to perform the operations of the present invention can be written in one or more programming languages or a combination thereof.
- the above-mentioned programming languages include object-oriented programming languages such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language.
- the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer or entirely executed on the remote computer or server.
- the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to Connect via the Internet).
- LAN local area network
- WAN wide area network
- an Internet service provider for example, using an Internet service provider to Connect via the Internet.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A density clustering method and apparatus on the basis of a dynamic grid hash index. The method comprises: acquiring preset incremental information, comprising: D: incremental data set; Eps: radius; Minpts: whether same is a decision threshold of a core point; and unAttr: a dimension the value of which is uncertain; a data set after incremental clustering is performed on the basis of an original data set and is generated according to the acquired preset incremental information and by means of the density clustering method; and after a cycle ends, a data set that has completed incremental clustering is obtained. By introducing a correspondingly modified new index structure for uncertain data, the time complexity of an algorithm is reduced from O(n2) to O(n), and space complexity is reduced from O(n2) to O(1), the algorithm adapts to dynamic data sets, and incremental clustering is more efficient than full clustering; and on the basis of the newly proposed GH-PDBSCAN algorithm in combination with the DGridHash index structure, the Incremental GH-PDBSCAN algorithm is proposed, such that same is suitable for the clustering of dynamic uncertain data sets.
Description
本申请涉及数据处理领域,特别是基于动态网格哈希索引的密度聚类方法及装置。This application relates to the field of data processing, in particular to a density clustering method and device based on dynamic grid hash index.
在计算机科学中,不确定的数据是指包含噪声的数据,这些噪声使得原始数据偏离正确的值,当数据库中存在这样的数据,就需要引入概率计算。In computer science, uncertain data refers to data that contains noise. These noises cause the original data to deviate from the correct value. When such data exists in the database, probability calculations need to be introduced.
目前,PDBSCAN是属性不确定性数据的聚类算法。PDBSCAN算法思想来源于DBSCAN算法,但是DBSCAN算法只适用于确定性数据,而PDBSCAN算法则引入了概率代替之前确定的数值,使其适用于不确定性数据类型。PDBSCAN算法的算法步骤如下:Currently, PDBSCAN is a clustering algorithm for attribute uncertainty data. The idea of the PDBSCAN algorithm comes from the DBSCAN algorithm, but the DBSCAN algorithm is only suitable for deterministic data, while the PDBSCAN algorithm introduces probability to replace the previously determined value, making it suitable for uncertain data types. The algorithm steps of the PDBSCAN algorithm are as follows:
算法1:PDBSCANAlgorithm 1: PDBSCAN
输入:enter:
D:不确定性数据集;Eps:搜索半径;D: Uncertainty data set; Eps: search radius;
Minpts:是否为核心点的判定阈值;F_value:直接密度可达的概率阈值;输出:数据集及相应的类标签;Minpts: Judgment threshold for whether it is a core point; F_value: the probability threshold at which the direct density can be reached; Output: the data set and the corresponding class label;
算法过程:Algorithm process:
算法1描述了PDBSCAN算法,算法2是其扩展聚类的具体细节。clu_num=k意味着当前的聚类类别是k,k是正整数。class(i)=0.-1或者1……k分别意味着数据对象o
i尚未分类,已经确定属于噪音或者1……k中的某个类。type(i)=0.-1或者1分别意味着数据对象o
i是边界点,噪音点还是核心点。visited(i)=1或者0分别意味着数据对象o
i已经被处理或者没有被处理。
Algorithm 1 describes the PDBSCAN algorithm, and Algorithm 2 is the specific details of its extended clustering. clu_num=k means that the current cluster category is k, and k is a positive integer. class(i)=0.-1 or 1...k respectively means that the data object o i has not yet been classified and has been determined to belong to a certain class of noise or 1...k. type(i)=0.-1 or 1 respectively means that the data object o i is a boundary point, a noise point or a core point. visited(i)=1 or 0 respectively means that the data object o i has been processed or has not been processed.
算法1中,初始化完毕之后(1-2行),PDBSCAN算法开始访问数据点o
p并计算PNeighborhood(o
p)及PN
Eps(o
p)(3-5行),如果PN
Eps(o
p)等于1,则意味着该点Eps近邻里只有一个点,故判定它为噪音(6-7行)。PN
Eps(o
p)在1到Minpts之间,则尚不足直接判定该数据对象的类型,当PN
Eps(o
p)大于等于Minpts时,意味着该点为核心点,PDBSCAN算法将其直接密度可达的概率值大于阈值f_value的数据归为同一个类(8-16),并且调用Expand_cluster函数对现有的聚类进行扩展。当扩展的步骤完成之后,则对类标签为0的数据点再次处理,并把它们归于噪音点。
Algorithm 1, after initialization is completed (lines 1-2), PDBSCAN algorithm starts accessing the data points and calculating o p PNeighborhood (o p) and PN Eps (o p) (3-5 line), if the PN Eps (o p) It is equal to 1, which means that there is only one point in the immediate neighborhood of Eps, so it is judged to be noise (lines 6-7). PN Eps (o p) between 1 to MinPts, is still less than the direct determination of the type of data object, when the PN Eps (o p) greater than or equal MinPts, means that the point of the core point, PDBSCAN algorithm which was used directly density The data whose reachable probability value is greater than the threshold f_value are classified into the same class (8-16), and the Expand_cluster function is called to expand the existing cluster. When the expansion step is completed, the data points with a class label of 0 are processed again, and they are classified as noise points.
以下是PDBSCAN算法中涉及到的函数Expand_cluster的算法步骤。算法2:Expand_cluster(PNeighborhood(o
p)′.clu_num,f_value,Minpts)
The following are the algorithm steps of the function Expand_cluster involved in the PDBSCAN algorithm. Algorithm 2: Expand_cluster (. PNeighborhood (o p) 'clu_num, f_value, Minpts)
用n表示属性不确定性数据集大小,m表示属性不确定性数据对象的维度,S表示引入的不同的概率分布函数的数目。在预处理过程中,计算
的时间复杂度为O(n
2mS
2),在主循环过程中,最坏的情况下需要n遍扫描,所以PDBSCAN算法的时间复杂度为O(n
2mS
2)。算法计算过程需维持任意两点之间距离小于指定半径的概率矩阵,故PDBSCAN算法的空间复杂度为O(n
2)。
Let n represent the size of the attribute uncertainty data set, m represent the dimension of the attribute uncertainty data object, and S represent the number of different probability distribution functions introduced. In the preprocessing process, the calculation The time complexity of is O(n 2 mS 2 ). In the main loop, n scans are required in the worst case, so the time complexity of the PDBSCAN algorithm is O(n 2 mS 2 ). The algorithm calculation process needs to maintain a probability matrix whose distance between any two points is less than a specified radius, so the space complexity of the PDBSCAN algorithm is O(n 2 ).
通过以上介绍,可发现PDBSCAN的缺点如下:Through the above introduction, the shortcomings of PDBSCAN can be found as follows:
1、PDBSCAN算法时间复杂度太高,为O(n
2)级别;
1. The time complexity of the PDBSCAN algorithm is too high, at the O(n 2 ) level;
2、PDBSCAN算法空间复杂度太高,为O(n
2)级别;
2. The space complexity of the PDBSCAN algorithm is too high, at the level of O(n 2 );
3、未提出与PDBSCAN算法相应的基于动态不确定数据的增量聚类算法。3. No incremental clustering algorithm based on dynamic uncertain data corresponding to the PDBSCAN algorithm is proposed.
发明内容Summary of the invention
鉴于所述问题,提出了本申请以便提供克服所述问题或者至少部分地解 决所述问题的基于动态网格哈希索引的密度聚类方法及装置,包括:In view of the problems, the present application is proposed in order to provide a method and device for density clustering based on dynamic grid hash index that overcome the problems or at least partially solve the problems, including:
一种基于动态网格哈希索引的密度聚类方法,包括:A density clustering method based on dynamic grid hash index, including:
获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;Obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold of whether it is a core point; unAttr: dimension with uncertain value;
根据现有的带有类标签的数据建立索引G;Create an index G based on the existing data with class tags;
重复执行以下步骤将D中的每个数据对象p插入到索引G;Repeat the following steps to insert each data object p in D into index G;
A1、获取PNeighborhood(o
p),并根据PN
Eps(o
p)判断p为核心点的概率;
A1, obtaining PNeighborhood (o p), and p is a core point determination probability based on PN Eps (o p);
A2、获取UpSeed
ins(p);
A2, get UpSeed ins (p);
A3、若UpSeed
ins(p)中的对象所属类别不同,但插入p后UpSeed
ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed
ins(p)包含的对象所在的聚类合并;
A3. If the objects in UpSeed ins (p) belong to different categories, but all objects in UpSeed ins (p) can be directly or indirectly reachable after inserting p, then the objects contained in p and UpSeed ins (p) are located Cluster merging
和/或;and / or;
根据现有的带有类标签的数据建立索引G,并在原数据集中查找p的位置;Establish an index G based on the existing data with class tags, and find the position of p in the original data set;
重复执行以下步骤将D中的每个数据对象p从索引G中删除;Repeat the following steps to delete each data object p in D from index G;
B1:获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p);
B1: Get PNeighborhood (o p), PN Eps (o p) and UpSeed del (p);
B2:若UpSeed
del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类;
B2: If the data objects contained in UpSeed del (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters;
循环结束后得到完成增量聚类的数据集。After the end of the loop, a data set with incremental clustering is obtained.
进一步地,在获取UpSeed
ins(p)后,还包括:若UpSeedins(p)为空,且NEps(p)内不包含核心对象,则将p视为噪音并返回∞。
Further, after obtaining UpSeed ins (p), it also includes: if UpSeedins(p) is empty and NEps(p) does not contain a core object, then p is regarded as noise and ∞ is returned.
进一步地,在获取UpSeed
ins(p)后,还包括:若UpSeedins(p)非空,所包含的对象不仅其密度可达对象中没有已知聚类中的核心对象而且不属于任何聚类,则创建新的聚类并返回∞。
Further, after obtaining UpSeed ins (p), it also includes: if UpSeedins(p) is not empty, the contained objects not only have a density of reachable objects without core objects in known clusters but also do not belong to any clusters, Then create a new cluster and return ∞.
进一步地,在获取UpSeed
ins(p)后,还包括:在插入p之前,若UpSeedins(p)所包含的对象所属聚类相同或者包含的对象类标签不同而且在插入p后不同类标签的数据仍然不能够密度可达或者UpSeedins(p)为空,且NEps(p)内有核心对象,则将p归并到某一聚类并返回∞。
Further, after obtaining UpSeed ins (p), it also includes: before inserting p, if the objects contained in UpSeedins(p) belong to the same cluster or contain different object class labels and data of different class labels after inserting p If the density is still not reachable or UpSeedins(p) is empty, and there are core objects in NEps(p), then p is merged into a certain cluster and returns ∞.
进一步地,在获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p)后,还包括:若p为噪音,则删除并返回∞。
Further, in, PN Eps (o p) and UpSeed del (p) After obtaining PNeighborhood (o p), further comprising: when p is noise, then remove and return ∞.
进一步地,在获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p)后,还包 括:若p不是噪音且UpSeeddel(p)为空,p被删除后NEps(p)不存在核心点,则与p同类的其他数据点设为噪音并返回∞。
Further, in obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p) after, further comprising: after If p is not noise and UpSeeddel (p) is removed empty, p NEPS (p) does not exist For the core point, other data points of the same type as p are set as noise and return to ∞.
进一步地,在获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p)后,还包括:若UpSeeddel(p)为空,但NEps(p)仍然包含核心对象;或者UpSeeddel(p)中的数据点均可直接密度可达,则删除p后这些数据对象依然同类簇并返回∞。
Further, in obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p) after, further comprising: if UpSeeddel (p) is empty, but NEps (p) still contains the core object; or UpSeeddel (p The data points in) can be directly reachable in density. After deleting p, these data objects are still clusters of the same type and return to ∞.
一种基于动态网格哈希索引的密度聚类装置,包括:A density clustering device based on dynamic grid hash index, including:
信息输入单元,用于获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value;
数据插入单元,用于依据所述获取增量预设信息,通过权利要求1所述密度聚类方法,生成在所述原数据集基础上进行增量聚类后的数据集;A data insertion unit, configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;
搜索结果输出单元,用于输出所述增量聚类单元所生成的完成增量聚类的数据集。The search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
一种设备,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上所述的基于动态网格哈希索引的密度聚类方法的步骤。A device that includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, the above-mentioned dynamic grid-based Harbin is implemented. The steps of the density clustering method of Greek index.
一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如上所述的基于动态网格哈希索引的密度聚类方法的步骤。A computer-readable storage medium stores a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned density clustering method based on dynamic grid hash index are realized.
本申请具有以下优点:This application has the following advantages:
在本申请的实施例中,通过获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;根据现有的带有类标签的数据建立索引G;重复执行以下步骤将D中的每个数据对象p插入到索引G:A1、获取PNeighborhood(o
p),并根据PN
Eps(o
p)判断p为核心点的概率;A2、获取UpSeed
ins(p);A3、若UpSeed
ins(p)中的对象所属类别不同,但插入p后UpSeed
ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed
ins(p)包含的对象所在的聚类合并;和/或;根据现有的带有类标签的数据建立索引G,并在原数据集中查找p的位置;重复执行 以下步骤将D中的每个数据对象p从索引G中删除;B1:获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p);B2:若UpSeed
del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类;循环结束后得到完成增量聚类的数据集。通过引入针对不确定数据进行相应改造的新索引结构并,使算法的时间复杂度由O(n
2)降为O(n),空间复杂度由O(n
2)降为O(1);使算法适用于动态数据集,增量聚类比全量聚类更高效;在新提出的GH-PDBSCAN算法的基础上结合DGridHash索引结构再提出Incremental GH-PDBSCAN算法,使其适用于动态的不确定性数据集合的聚类;具有良好的可扩展性,动态的基于网格的哈希索引结构,可以扩展到其他算法领域进行使用,如增量聚类、增量分类等等算法。
In the embodiment of the present application, by obtaining the incremental preset information, it includes: D: incremental data set; Eps: radius; Minpts: determination threshold of whether it is a core point; unAttr: dimension with uncertain value; according to the existing data classes with index label G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and p is determined according to the core PN Eps (o p) Point probability; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly reachable in density, then Combine the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; create an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following steps to change D each data object is deleted from the index p in G; B1: obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p); B2: if the data object UpSeed del (p) can not be directly contained in another If the density is reachable, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained. By introducing a new index structure corresponding to uncertain data, the time complexity of the algorithm is reduced from O(n 2 ) to O(n), and the space complexity is reduced from O(n 2 ) to O(1); Make the algorithm suitable for dynamic data sets, incremental clustering is more efficient than full clustering; on the basis of the newly proposed GH-PDBSCAN algorithm combined with the DGridHash index structure, the Incremental GH-PDBSCAN algorithm is proposed to make it suitable for dynamic uncertainties Clustering of functional data sets; with good scalability, a dynamic grid-based hash index structure, which can be extended to other algorithm fields for use, such as incremental clustering, incremental classification and other algorithms.
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the present application. Obviously, the accompanying drawings in the following description are only some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative labor.
图1是本申请一实施例提供的H中网格的三层结构示意图;FIG. 1 is a schematic diagram of the three-layer structure of the H grid provided by an embodiment of the present application;
图2是本申请一实施例提供的网格中受影响部分示意图;Figure 2 is a schematic diagram of an affected part of a grid provided by an embodiment of the present application;
图3是本申请实例一提供的GH-PDBSCAN算法与PDBSCAN算法效率对比示意图;FIG. 3 is a schematic diagram of efficiency comparison between the GH-PDBSCAN algorithm and the PDBSCAN algorithm provided by Example 1 of the present application;
图4.1是本申请实例二提供的Incremental GH-PDBSCAN插入数据时的加速比示意图;Figure 4.1 is a schematic diagram of the speedup ratio of the Incremental GH-PDBSCAN when inserting data provided in Example 2 of this application;
图4.2是本申请实例二提供的Incremental GH-PDBSCAN删除数据时的加速比示意图;Figure 4.2 is a schematic diagram of the speedup ratio of Incremental GH-PDBSCAN when deleting data provided in Example 2 of this application;
图5是本发明一实施例的一种计算机设备的结构示意图。Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
为使本申请的所述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属 于本申请保护的范围。In order to make the objectives, features, and advantages of the application more obvious and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
针对背景技术中PDBSCAN算法的不足,本发明提出了基于网格的哈希索引结构(简称GridHash)计算
并提出动态的基于网格的哈希索引结构(简称DGridHash)及与PDBSCAN相应的增量聚类算法,具体阐述如下。
Aiming at the deficiencies of the PDBSCAN algorithm in the background art, the present invention proposes a grid-based hash index structure (abbreviated as GridHash) calculation A dynamic grid-based hash index structure (DGridHash for short) and an incremental clustering algorithm corresponding to PDBSCAN are proposed, which are described in detail as follows.
一、GridHash的介绍1. Introduction of GridHash
GridHash是将网格逐层的进行划分,并将其通过哈希函数映射到哈希表。首先,在R
d中,将空间划分为若干个单元格,每个单元格的边长为
其中c为搜索半径,d为数据的维度。对每个非空单元格c(至少包含有一个数据对象),把他划分为2
d个相同大小的网格,假如将其中划分出的一个非空网格记为c′,则以相同的方式递归的划分它,直到网格的边长小于
ρ为指定的常数。
GridHash divides the grid layer by layer and maps it to a hash table through a hash function. First, in R d , the space is divided into several cells, and the side length of each cell is Where c is the search radius and d is the dimension of the data. For each non-empty cell c (containing at least one data object), divide it into 2 d grids of the same size. If one of the non-empty grids is denoted as c′, the same Divide it recursively until the side length of the grid is less than ρ is the specified constant.
参照图1,将划分得到的层次结构记为H,且只保存H中的非空单元格,对于每个单元格用cnt(c)表示被单元格c所覆盖的所有数据对象。第i层单元格的边长记为
H的高度h=max[1,1+log
2(1/ρ)]=O(1),即每个单元格划分的层次是一个常数级的。如果第i+1层的网格c′被网格c所包含,则c′被称为c的孩子结点,c被称为c′的父结点,若一个结点没有孩子结点则称为叶子结点。
Referring to Fig. 1, the hierarchical structure obtained by the division is marked as H, and only non-empty cells in H are stored. For each cell, cnt(c) is used to represent all data objects covered by cell c. The side length of the cell at level i is denoted as The height of H is h=max[1,1+log 2 (1/ρ)]=O(1), that is, the level of each cell is a constant level. If the grid c'of the i+1 level is contained by the grid c, then c'is called the child node of c, and c is called the parent node of c'. If a node has no children, then It is called a leaf node.
构建GridHash的过程中首先把每个单元格生成一个hashDate类对象,在该对象中记录单元格的各维坐标取值、索引号且初始化父结点为空、设置结点类型为非叶子结点,最后根据索引号把该类映射到哈希表的相应位置,然后将数据集中的每个数据对象归入hashDate类对象并逐层向下映射,直到叶子结点。In the process of constructing GridHash, each cell first generates a hashDate class object, in which the coordinate value and index number of each cell are recorded, and the parent node is initialized to be empty, and the node type is set to non-leaf node. , And finally map the class to the corresponding position of the hash table according to the index number, and then classify each data object in the data set into the hashDate class object and map down layer by layer until the leaf node.
给定c和p,构建GridHash的时空复杂度均为O(n),采用这个索引结构进行搜索时的时空复杂度均为O(1)。Given c and p, the time and space complexity of constructing GridHash is O(n), and the time and space complexity of searching using this index structure is O(1).
1.定义一个哈希表,1. Define a hash table,
2.哈希表中每个元素是一个hashDate类,该类中包含有若干操作及相关数据定义,其包含的数据定义如下:2. Each element in the hash table is a hashDate class, which contains several operations and related data definitions. The data contained in it is defined as follows:
vector<double>axisX;//每个单元格第一个维度的取值范围vector<double>axisX;//The value range of the first dimension of each cell
vector<double>axisY;//每个单元格第二个维度的取值范围vector<double>axisY;//The value range of the second dimension of each cell
vector<int>dateSubscripts;//包含的数据点标识vector<int>dateSubscripts;//Included data point identification
vector<shared_ptr<hashDate>>hashPointer;//指针数组vector<shared_ptr<hashDate>>hashPointer;//pointer array
boolleafNode;//判断该结点是否为叶子结点boolleafNode;//Judge whether the node is a leaf node
shared_ptr<hashDate>fatherNo;//该结点的父结点指针shared_ptr<hashDate>fatherNo;//The parent node pointer of the node
int indecNumber;//该结点在上层结点指针数组中的索引号int indecNumber;//The index number of the node in the upper node pointer array
在这个过程中需要一个把单元格映射到哈希表的一个哈希函数,根据数据点的各个维度确定数据点在哈希表中的key值,以二维数据为例,哈希函数可设为:In this process, a hash function is needed to map the cell to the hash table, and the key value of the data point in the hash table is determined according to the dimensions of the data point. Taking two-dimensional data as an example, the hash function can be set for:
其中,axisX是数据对象横坐标的值;axisY是数据对象纵坐标的值;minX是网格结构的第一个维度的最小值;minY是网格结构的第二个维度的最小值;wCell是单元格的边长;closNum是网格的列数。Among them, axisX is the value of the abscissa of the data object; axisY is the value of the ordinate of the data object; minX is the minimum value of the first dimension of the grid structure; minY is the minimum value of the second dimension of the grid structure; wCell is The side length of the cell; closNum is the number of columns in the grid.
基于DGridHash动态插入数据对象Dynamically insert data objects based on DGridHash
需要说明的是,在DGridHash中动态的插入数据对象是不断的创建新的结点并将其加入索引结构的一个过程。具体算法过程如下:It should be noted that dynamically inserting data objects in DGridHash is a process of constantly creating new nodes and adding them to the index structure. The specific algorithm process is as follows:
动态插入算法Dynamic insertion algorithm
输入:dataPoint:要插入的数据对象;hashTable:哈希表;wCell:单元格边长;wCell_Final:终止划分的单元格边长;colsNum:网格的列数;Input: dataPoint: the data object to be inserted; hashTable: hash table; wCell: cell side length; wCell_Final: end cell side length; colsNum: the number of grid columns;
输出:完成数据对象插入操作的索引;Output: the index of the completed data object insertion operation;
算法过程:Algorithm process:
基于DGridHash动态删除数据对象Dynamically delete data objects based on DGridHash
需要说明的是,在DGridHash中动态的删除数据对象,自顶向下遍历所有包含该对象的结点并自底向上把它删除,删除后如果该结点包含的数据对象个数不为零,则只删除数据对象,不做其他处理;若该结点包含的数据对象个数变为零则同时删除该结点,与此同时查看其父结点的情况,如果依然为零则把父结点也删除,直到查找到根节点为止。具体算法过程如下:It should be noted that the data object is dynamically deleted in DGridHash, and all nodes containing the object are traversed from top to bottom and deleted from the bottom up. After deletion, if the number of data objects contained in the node is not zero, Only delete the data object without other processing; if the number of data objects contained in the node becomes zero, the node will be deleted at the same time, and at the same time, check the situation of its parent node. If it is still zero, the parent node will be deleted. Points are also deleted until the root node is found. The specific algorithm process is as follows:
动态删除算法Dynamic deletion algorithm
输入:dataPoint:要插入的数据对象;hashTable:哈希表;wCell:单元格边长;wCell_Final:终止划分的单元格边长;colsNum:网格的列数;Input: dataPoint: the data object to be inserted; hashTable: hash table; wCell: cell side length; wCell_Final: end cell side length; colsNum: the number of grid columns;
输出:完成数据对象删除操作的索引;Output: the index of the completed data object deletion operation;
算法过程:Algorithm process:
参照图2,索引上进行范围查找Refer to Figure 2, a range search is performed on the index
对于给定的一个二维数据点A(A.X,A.Y),根据For a given two-dimensional data point A (A.X, A.Y), according to
计算出在网格中的位置,因为第一层网格的宽度为
故受影响的网格数共有21个,具体解如图2所示。
Calculate the position in the grid, because the width of the first layer of grid is Therefore, the number of affected grids is 21, and the specific solution is shown in Figure 2.
针对这21个受影响的网格进行搜索,找到满足条件的数据点。对于单个非空单元格c,q点进行范围查找的步骤GridHash一致。DGridHash与静态的基于网格的哈希结构GridHash的搜索过程相同,但是由于ρ值得设定,使得该索引结构范围查找的质量有所下降。为了在增量聚类算法中得到精确的结果,故而需要在采用该索引结构进行搜索后,对搜索出的数据点进行过滤。本文中采用线性遍历的方法对搜索出的数据点再次执行过滤操作。Search for these 21 affected grids to find data points that meet the conditions. For a single non-empty cell c and q, the steps of searching for the range are the same as GridHash. The search process of DGridHash is the same as that of the static grid-based hash structure GridHash, but because ρ is set, the quality of the index structure range search is reduced. In order to obtain accurate results in the incremental clustering algorithm, it is necessary to filter the searched data points after using the index structure to search. In this paper, the method of linear traversal is used to perform the filtering operation on the searched data points again.
设d为数据的维度,记单元格每个维度被划分为partition[d]个部分,则哈希表的第一层需要存储parNumber=partition[0]×......partition[n]个hashDate对象,每个hashDate对象的空间复杂度为O(n),故第一层的空间复杂度为O(n×parNumber)。第一层的每个hashDate对象都有2
d个子树,故第二层的空间复杂度为O(2
d×n×parNumber),以此类推,每层的空间复杂度是O(n),哈希索引的树高是O(1),且d和parNumber为常数,故动态的基于网格的哈希索引的空间复杂度是O(parNumber+n)。构建索引的时间复杂度为O(n), 范围查找的时间复杂度为O(1)。
Let d be the dimension of the data, remember that each dimension of the cell is divided into partition[d] parts, then the first layer of the hash table needs to store parNumber=partition[0]×......partition[n] A hashDate object, the space complexity of each hashDate object is O(n), so the space complexity of the first layer is O(n×parNumber). Each hashDate object in the first layer has 2 d subtrees, so the space complexity of the second layer is O(2 d ×n×parNumber), and so on, the space complexity of each layer is O(n), The tree height of the hash index is O(1), and d and parNumber are constants, so the space complexity of the dynamic grid-based hash index is O(parNumber+n). The time complexity of constructing the index is O(n), and the time complexity of the range search is O(1).
基于PDBSCAN对GH-PDBSCAN算法进行改进Improve GH-PDBSCAN algorithm based on PDBSCAN
PDBSCAN算法的时空复杂度分别为O(n
2mS
2)和O(n
2),主要是因为该算法在预处理时需要计算任意两点之间的距离概率并存储概率矩阵。所以PDBSCAN算法降低时空复杂度的关键在于引入索引结构。
The time and space complexity of the PDBSCAN algorithm is O(n 2 mS 2 ) and O(n 2 ) respectively, mainly because the algorithm needs to calculate the distance probability between any two points and store the probability matrix during preprocessing. Therefore, the key to reducing the time and space complexity of the PDBSCAN algorithm is to introduce an index structure.
在GridHash的介绍中,引入了的基于网格的哈希索引结构,该索引结构进行范围查找的时间复杂度是O(1),空间复杂度为O(n),是一个高效的索引结构,而对于属性不确定性数据而言,并非所有属性均为不确定,而是确定性的属性与不确定性的属性的结合,在Zhang提出PDBSCAN算法的论文里面,实验时所生成的数据只有一个维度是不确定性的,而其他的维度均是确定性数值。故而,可以将基于网格的哈希索引引入PDBSCAN算法,并只用于计算确定性属性的范围查找,然后结合“任意两点距离小于等于给定半径的概率”的计算并保留概率值大于0的数据点,从而使其适用于处理不确定性数据的聚类算法。这种引入基于网格的哈希结构的PDBSCAN算法,本文称之为GH-PDBSCAN算法。In the introduction of GridHash, a grid-based hash index structure is introduced. The time complexity of this index structure for range search is O(1) and the space complexity is O(n). It is an efficient index structure. For attribute uncertainty data, not all attributes are uncertain, but a combination of deterministic attributes and uncertain attributes. In the paper that Zhang proposed the PDBSCAN algorithm, there is only one data generated during the experiment Dimensions are uncertain, while other dimensions are deterministic values. Therefore, the grid-based hash index can be introduced into the PDBSCAN algorithm, and it is only used to calculate the range search of the deterministic attribute, and then combine the calculation of "the probability that the distance between any two points is less than or equal to a given radius" and keep the probability value greater than 0 Data points, making it suitable for clustering algorithms dealing with uncertain data. This kind of PDBSCAN algorithm that introduces a grid-based hash structure is called GH-PDBSCAN algorithm in this article.
假设存在数据集D,数据对象o
p的维度为k,前k-1个的属性为确定性数值,第k个维度为一个取值范围,具体步骤如下:
Suppose there is data set D, a data object o p of dimension k, the k-1 before the property is deterministic value, a k-th dimension range, the following steps:
1.根据前k-1个维度,建立基于网格的哈希索引结构;1. According to the first k-1 dimensions, establish a grid-based hash index structure;
2.PDBSCAN算法,计算PNeighborhood(o
p)时,首先根据索引查找Eps近邻,然后,o
p与其Eps近邻通过蒙特卡洛方法计算它们之间小于某阈值ThresholdValue的概率,ThresholdValue定义如下:
When 2.PDBSCAN algorithm to calculate PNeighborhood (o p), find the index Eps first neighbor, then, smaller than a certain threshold probability therebetween thresholdValue Eps its neighbors o p calculated by the Monte Carlo method, thresholdValue defined as follows:
其中,o
q∈Eps近邻,o
pi表示o
p的第i个属性。
Wherein, o q ∈Eps neighbor, o pi denotes the i th attribute of o p.
3.计算出PNeighborhood(o
p)后,运行PDBSCAN算法的后续步骤。
3. After calculating PNeighborhood (o p), a subsequent step of the algorithm running PDBSCAN.
GH-PDBSCAN算法计算数据对象Eps近邻时,第一步根据确定性属性建立索引,第二步根据蒙特卡洛方法计算小于某阈值的概率。当数据集只有一个属性具有不确定性时,任意两个对象之间的概率,可以表述如下:When the GH-PDBSCAN algorithm calculates the nearest neighbors of the data object Eps, the first step is to build an index based on the deterministic attribute, and the second step is to calculate the probability of less than a certain threshold based on the Monte Carlo method. When the data set has only one attribute with uncertainty, the probability between any two objects can be expressed as follows:
假设数据点o
r不确定性属性的取值为[a,b],数据点o
w不确定性属性的取值为[c,d],且两个数据点的不确定性属性在同一个维度。则可以采用定积分或蒙特卡洛方法求解。在应用蒙特卡洛方法时,本文在两个对象的取值 区间随机取值1000次进行概率计算。
Assume that the uncertainty attribute of the data point o r is [a, b], the uncertainty attribute of the data point o w is [c, d], and the uncertainty attributes of the two data points are in the same Dimension. It can be solved by definite integral or Monte Carlo method. When applying the Monte Carlo method, this paper randomly selects the value of the two objects 1000 times to calculate the probability.
由于基于网格的哈希索引结构在建立索引与范围查找时的时间复杂度分别是O(n)和O(1),那么GH-PDBSCAN的时间复杂度是O(nmS
2),S代表引入的不同的概率分布函数的个数。GH-PDBSCAN算法空间复杂度为O(n)。
Since the time complexity of the grid-based hash index structure in indexing and range search is O(n) and O(1) respectively, then the time complexity of GH-PDBSCAN is O(nmS 2 ), and S represents the introduction The number of different probability distribution functions. The space complexity of GH-PDBSCAN algorithm is O(n).
PDBSCAN算法的时空复杂度较高,不适用与数据量较大的情况,而GH-PDBSCAN算法较PDBSCAN算法有更高的效率及更低的空间消耗,更具意义及可用性。The time and space complexity of the PDBSCAN algorithm is relatively high, and it is not suitable for situations with large data volume. Compared with the PDBSCAN algorithm, the GH-PDBSCAN algorithm has higher efficiency and lower space consumption, which is more meaningful and usable.
实例一Example one
参照图3,GH-PDBSCAN算法与PDBSCAN算法效率对比Refer to Figure 3, the efficiency comparison of GH-PDBSCAN algorithm and PDBSCAN algorithm
该实验部分采用了四种不同的数据类型,其中image和abalone来源于UCI,两种数据类型中的负值被删除。另两种为人工合成数据,如表1所示。四种数据类型均采用Gullo等人的方法生成属性不确定性数据,每种数据类型包含了随机分布(random)和正态分布(normal)两种形式。试验平台为window 7,32G内存,32核CPU,开发工具为Visual studio 2012,编程语言为C++。The experiment part uses four different data types, among which image and abalone are derived from UCI, and the negative values in the two data types are deleted. The other two are artificially synthesized data, as shown in Table 1. The four data types all use the method of Gullo et al. to generate attribute uncertainty data. Each data type includes two forms: random and normal. The test platform is window 7, 32G memory, 32-core CPU, the development tool is Visual studio 2012, and the programming language is C++.
表1Table 1
GH-PDBSCAN算法与PDBSCAN算法效率对比,需从数据取值范围和数据量两个维度分别试验,试验效果如图3所示。To compare the efficiency of the GH-PDBSCAN algorithm and the PDBSCAN algorithm, it is necessary to test separately from the two dimensions of the data value range and the data volume. The test results are shown in Figure 3.
图3展示了四种数据类型,分别对应四种不同的情况,(a)数据取值范围较大,数据量较小;(b)数据取值范围较小,数据量较小;(c)数据取值范围较大,数据量较大;(d)数据取值范围较小,数据量较大。Figure 3 shows four data types, corresponding to four different situations, (a) the data value range is larger, and the data volume is small; (b) the data value range is small, and the data volume is small; (c) The data value range is larger and the data volume is larger; (d) the data value range is smaller and the data volume is larger.
实验表明,只有(a)情况,PDBSCAN算法的效率才高于GH-PDBSCAN算法,而其他情况均是GH-PDBSCAN算法较为高效。在(a)中PDBSCAN算法与GH-PDBSCAN算法相比,数据属于随机分布时提高5.62倍,属于正 态分布时提高6.18倍;在(b)中GH-PDBSCAN算法与PDBSCAN算法相比,数据属于随机分布时提高1.95倍,属于正态分布时提高2.61倍;在(c)中GH-PDBSCAN算法与PDBSCAN算法相比,数据属于随机分布时提高4.23倍,属于正态分布时提高3.63倍;在(d)中GH-PDBSCAN算法与PDBSCAN算法相比,数据属于随机分布时提高9.38倍,属于正态分布时提高9.25倍。Experiments show that only in the case (a), the efficiency of the PDBSCAN algorithm is higher than that of the GH-PDBSCAN algorithm, while in other cases, the GH-PDBSCAN algorithm is more efficient. Compared with the GH-PDBSCAN algorithm in (a), the PDBSCAN algorithm is increased by 5.62 times when the data belongs to random distribution, and 6.18 times when the data belongs to the normal distribution; in (b) compared with the PDBSCAN algorithm, the data belongs to When randomly distributed, it is increased by 1.95 times, and when it is normal distribution, it is increased by 2.61 times; in (c), compared with PDBSCAN algorithm, the GH-PDBSCAN algorithm increases by 4.23 times when the data belongs to random distribution and 3.63 times when it belongs to normal distribution; (d) Compared with the PDBSCAN algorithm in GH-PDBSCAN algorithm, the data is increased by 9.38 times when the data is random distribution, and 9.25 times when the data is normal distribution.
针对背景技术提出的问题,本发明提出了一种高效的算法——根据GH-PDBSCAN提出的Incremental GH-PDBSCAN算法In response to the problems raised by the background technology, the present invention proposes an efficient algorithm-the Incremental GH-PDBSCAN algorithm proposed by GH-PDBSCAN
属性不确定性数据越来越多,同样存在着动态更新的情况,为了使GH-PDBSCAN算法能够应用于动态的属性不确定性数据集,本文提出了Incremental GH-PDBSCAN算法。该算法借鉴了Incremental DBSCAN算法的思想,但是将概率分布函数引入其中,使其适用于属性不确定数据的情况。There are more and more attribute uncertainty data, and there are also dynamic updates. In order to enable the GH-PDBSCAN algorithm to be applied to dynamic attribute uncertainty data sets, this paper proposes the Incremental GH-PDBSCAN algorithm. The algorithm draws on the idea of the Incremental DBSCAN algorithm, but introduces the probability distribution function into it, making it suitable for data with uncertain attributes.
首先查询种子数据点,设D表示数据集,p表示删除或插入的数据点。删除或插入数据的领域对象表示如下:First query the seed data point, let D represent the data set, and p represent the deleted or inserted data point. The domain objects for deleting or inserting data are represented as follows:
UpSeed
ins(p)={q|q是D∪{p}中的核心对象,
在D中不是核心点,q′在D∪{p}中是核心点,而且
}
UpSeed ins (p)={q|q is the core object in D∪{p}, Is not the core point in D, q′ is the core point in D∪{p}, and }
UpSeed
del(p)={q|q是D\{p}中的核心对象,
在D中不是核心点,q′在D\{p}中是核心点,而且
}
UpSeed del (p)={q|q is the core object in D\{p}, Is not the core point in D, q′ is the core point in D\{p}, and }
种子对象的定义及增量聚类时所涉及的步骤与Incremental DBSCAN相同,但在范围查找时,则是取
大于某个阈值的数据对象,且在范围查找过程中,采用动态的基于网格的哈希索引。在运行Incremental PDBSCAN之前,需要采用PDBSCAN算法聚类并存储每个数据对象是否为核心点,PN
Eps(o
p)值和类标签,然后根据UpSeed
ins(p),UpSeed
del(p)及PN
Eps(o
p)等信息进行属性不确定性数据的增量聚类。
The definition of the seed object and the steps involved in incremental clustering are the same as those of Incremental DBSCAN. Data objects larger than a certain threshold, and in the range search process, use a dynamic grid-based hash index. Before running Incremental PDBSCAN, we need PDBSCAN clustering algorithm and stores whether each data object is a core point, PN Eps (o p) values and class labels, and according UpSeed ins (p), UpSeed del (p) and PN Eps (o p) attribute information such as data uncertainty incremental clustering.
Incremental GH-PDBSCAN算法的详细过程如下所示:The detailed process of the incremental GH-PDBSCAN algorithm is as follows:
其中,Incremental GH-PDBSCAN插入数据的过程如下:Among them, the process of Incremental GH-PDBSCAN inserting data is as follows:
输入:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;Input: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value;
输出:完成增量聚类的数据集;Output: the data set that completes incremental clustering;
1:根据现有的带有类标签的数据建立索引G;1: Create index G based on existing data with class tags;
2:for(D中的每个数据对象p)2: for (each data object p in D)
3:将p插入到索引G;3: Insert p into index G;
4:获取PNeighborhood(o
p),并根据PN
Eps(o
p)判断p为核心点的概率;
4: Get PNeighborhood (o p), and p is a core point determination probability based on PN Eps (o p);
5:获取UpSeed
ins(p);
5: Get UpSeed ins (p);
6:若UpSeed
ins(p)为空,且N
Eps(p)内不包含核心对象,则将p视为噪音并返回否则执行步骤7;
6: If UpSeed ins (p) is empty and N Eps (p) does not contain core objects, then p will be regarded as noise and return otherwise, go to step 7;
7:若UpSeed
ins(p)非空,所包含的对象不仅其密度可达对象中没有已知聚类中的核心对象而且不属于任何聚类,则创建新的聚类并返回,否则执行步骤8;
7: If UpSeed ins (p) is not empty, the contained objects not only have a density of reachable objects, there is no core object in a known cluster, but also does not belong to any cluster, then create a new cluster and return, otherwise perform the steps 8;
8:在插入p之前,若UpSeed
ins(p)所包含的对象所属聚类相同或者包含的对象类标签不同而且在插入p后不同类标签的数据仍然不能够密度可达或者UpSeed
ins(p)为空,且N
Eps(p)内有核心对象,则将p归并到某一聚类并返回,否则执行步骤9;
8: Before inserting p, if the objects contained in UpSeed ins (p) belong to the same cluster or contain different object class labels, and the data of different class labels still cannot be densely reachable after inserting p, or UpSeed ins (p) If it is empty and there is a core object in N Eps (p), merge p into a certain cluster and return, otherwise go to step 9;
9:若UpSeed
ins(p)中的对象所属类别不同,但插入p后UpSeed
ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed
ins(p)包含的对象所在的聚类合并。
9: If the objects in UpSeed ins (p) belong to different categories, but all objects in UpSeed ins (p) can be directly or indirectly reachable after inserting p, then the objects contained in p and UpSeed ins (p) are located Cluster merge.
Incremental GH-PDBSCAN删除数据的过程如下:Incremental GH-PDBSCAN deletes data as follows:
输入:D:删除对象集合;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;Input: D: delete object collection; Eps: radius; Minpts: determine whether it is a core point threshold; unAttr: dimension with uncertain value;
输出:完成增量聚类的数据集;Output: the data set that completes incremental clustering;
1:根据现有的带有类标签的数据建立索引G;1: Create index G based on existing data with class tags;
2:在原数据集中查找p的位置;2: Find the position of p in the original data set;
3:for(D中的每个数据对象p)3: for (each data object p in D)
4:从索引G中删除p;4: Delete p from index G;
5:获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p);
5: Get PNeighborhood (o p), PN Eps (o p) and UpSeed del (p);
6:如果p为噪音,则删除并返回,否则执行步骤7;6: If p is noise, delete and return, otherwise go to step 7;
7:若p不是噪音且UpSeed
del(p)为空,p被删除后N
Eps(p)不存在核心点,则与p同类的其他数据点设为噪音并返回,否则执行步骤8;
7: If p is not noise and UpSeed del (p) is empty, there is no core point in N Eps (p) after p is deleted, then other data points of the same type as p are set as noise and returned, otherwise, go to step 8;
8:若UpSeed
del(p)为空,但N
Eps(p)仍然包含核心对象;或者UpSeed
del(p)中的数据点均可直接密度可达,则删除p后这些数据对象依然同 类簇并返回,否则执行9;
8: If UpSeed del (p) is empty, but N Eps (p) still contains the core object; or the data points in UpSeed del (p) can be directly reachable in density, then these data objects are still in the same cluster after deleting p Return, otherwise execute 9;
9:若UpSeed
del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类,反之则不用。
9: If the data objects contained in UpSeed del (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters, otherwise it is not used .
Incremental GH-PDBSCAN增量插入时有四种不同的情况分别为:噪音,创建聚类,并入某聚类及合并聚类。在算法过程中,需要使用GridHash索引进行范围查找并采用蒙特卡洛方法计算PNeighborhood(o
p),算法的时间复杂度是O(q×k),其中q为与插入点p直接密度可达概率大于某阈值的数据集,k表示UpSeed
del(p)所包含的数据量大小。Incremental GH-PDBSCAN的空间复杂度为O(n+m),其中n表示原始数据集大小,m代表需增量更新数据集大小。
Incremental GH-PDBSCAN has four different situations during incremental insertion: noise, creating clusters, merging into a cluster, and merging clusters. In the course of the algorithm is required to find and the range of index GridHash Monte Carlo method to calculate PNeighborhood (o p), the time complexity of the algorithm is O (q × k), where q is the probability of the insertion point up directly density p For data sets greater than a certain threshold, k represents the amount of data contained in UpSeed del (p). The space complexity of Incremental GH-PDBSCAN is O(n+m), where n represents the size of the original data set, and m represents the size of the data set that needs to be updated incrementally.
Incremental GH-PDBSCAN在删除数据时有四种不同的情况分别为:噪音,消除类簇,减少类簇对象及类簇分裂。Incremental GH-PDBSCAN的时间复杂度是O(L*(n-m)),其中L为UpSeed
del(p)中所包含的聚类数目,n表示原始数据集的大小,m表示需要删除的数据集的大小。算法的空间复杂度为O(n-m)。
Incremental GH-PDBSCAN has four different situations when deleting data: noise, eliminating clusters, reducing cluster objects and cluster splitting. The time complexity of Incremental GH-PDBSCAN is O(L*(nm)), where L is the number of clusters contained in UpSeed del (p), n is the size of the original data set, and m is the size of the data set that needs to be deleted size. The space complexity of the algorithm is O(nm).
实例二Example two
参照图4.1和4.2,Incremental GH-PDBSCAN算法与GH-PDBSCAN算法效率对比Refer to Figure 4.1 and 4.2, the efficiency comparison of Incremental GH-PDBSCAN algorithm and GH-PDBSCAN algorithm
GH-PDBSCAN算法能够处理大数据,故而本文采用100万的数据量,每个数据对象均为三维的空间数据,并指定每个对象的第三个维度作为不确定性属性,具体的实现方法参考Gullo等人在2008年发表的文章。本文提出的Incremental PDBSCAN算法的实验方法与Ester等人在1998年提出的Incremental DBSCAN算法的实验方法相似。The GH-PDBSCAN algorithm can process big data, so this article uses 1 million data volume, each data object is three-dimensional spatial data, and the third dimension of each object is designated as the uncertainty attribute. Refer to the specific implementation method Article published by Gullo et al. in 2008. The experimental method of the Incremental PDBSCAN algorithm proposed in this paper is similar to the experimental method of the Incremental DBSCAN algorithm proposed by Ester et al. in 1998.
GH-PDBSCAN算法每个数据对象的聚类时间取决于范围查找的时间,聚类n个数据对象的时间消耗可以记为Cost
DBSCAN(n),即
The clustering time of each data object of the GH-PDBSCAN algorithm depends on the range search time. The time consumption of clustering n data objects can be recorded as Cost DBSCAN (n), that is
Cost
DBSCAN(n)=n (5)
Cost DBSCAN (n)=n (5)
Incremental GH-PDBSCAN算法范围查找的次数依赖于具体的应用,因此必须用实验验证每次插入和删除数据所需要的范围查找的次数。一般来说,删除一个数据对象会比插入一个数据对象影响的数据点要多,我们引入两个参数r
ins和r
del,分别表示增量插入时每个数据对象平均范围查找的次数与增 量删除时每个数据对象平均范围查找的次数,f
ins和f
del分别表示增量更新时增量插入与增量删除所占的比例。当增量更新的数据集大小为m时,记Incremental GH-PDBSCAN算法的时间消耗为:
The number of range searches of the Incremental GH-PDBSCAN algorithm depends on the specific application, so experiments must be used to verify the number of range searches required for each insertion and deletion of data. Generally speaking, deleting a data object will affect more data points than inserting a data object. We introduce two parameters, r ins and r del , which respectively represent the number and increment of average range search for each data object during incremental insertion. The average range search times of each data object during deletion, f ins and f del respectively represent the proportion of incremental insertion and incremental deletion during incremental update. When the size of the incrementally updated data set is m, the time consumption of the Incremental GH-PDBSCAN algorithm is:
Cost
Inrementtal GH-PDBSCAN(m)=m*(f
ins*r
ins+f
del*r
del) (6)
Cost Inrementtal GH-PDBSCAN (m)=m*(f ins *r ins +f del *r del ) (6)
下表2针对100万3维不确定性数据列出了各个参数值:Table 2 below lists the parameter values for 1 million 3D uncertainty data:
参数parameter | 参数含义Parameter meaning | 100万3维不确定性数据1 million 3D uncertainty data |
nn | 原始数据集大小Original data set size | 1,000,0001,000,000 |
mm | 增量数据集大小Incremental data set size | 从20,000到50,0000From 20,000 to 50,0000 |
r ins r ins | 增量插入时范围查找次数Range search times during incremental insertion | 11 |
r del r del | 增量删除时范围查找次数Range search times during incremental deletion | 4.64.6 |
f ins f ins | 插入数据所占比Percentage of inserted data | 1或者01 or 0 |
f del f del |
删除数据所占比Percentage of deleted |
0或者10 or 1 |
表2Table 2
根据以上定义,可以计算Incremental GH-PDBSCAN算法与GH-PDBSCAN算法的加速比,定义如下:According to the above definition, the acceleration ratio of the Incremental GH-PDBSCAN algorithm and the GH-PDBSCAN algorithm can be calculated, which is defined as follows:
增量插入与删除数据时,Incremental GH-PDBSCAN算法与GH-PDBSCAN算法的加速比如图4.1和4.2所示。When incrementally inserting and deleting data, the acceleration ratios of the Incremental GH-PDBSCAN algorithm and the GH-PDBSCAN algorithm are shown in Figures 4.1 and 4.2.
实验表明,Incremental GH-PDBSCAN算法相较于GH-PDBSCAN算法在效率方面提高幅度很大。当插入或删除数据集确定,原始数据集大小与加速比成正比;当原始数据集固定,插入或删除数据集越大,则加速比越低。当原始数据集很大,而插入或删除数据量很小时,更能体现Incremental GH-PDBSCAN算法的优势。Experiments show that the Incremental GH-PDBSCAN algorithm has a great improvement in efficiency compared to the GH-PDBSCAN algorithm. When the inserted or deleted data set is determined, the size of the original data set is proportional to the speedup ratio; when the original data set is fixed, the larger the inserted or deleted data set, the lower the speedup ratio. When the original data set is large and the amount of inserted or deleted data is small, the advantages of the Incremental GH-PDBSCAN algorithm can be better reflected.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
示出了本申请一实施例提供的一种基于动态网格哈希索引的密度聚类装置,其特征在于,包括:Shows a density clustering device based on dynamic grid hash index provided by an embodiment of the present application, which is characterized in that it includes:
信息输入单元,用于获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold for core points; unAttr: dimension with uncertain value;
数据插入单元,用于依据所述获取增量预设信息,通过权利要求1所述密度聚类方法,生成在所述原数据集基础上进行增量聚类后的数据集;A data insertion unit, configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;
搜索结果输出单元,用于输出所述增量聚类单元所生成的完成增量聚类的数据集。The search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
参照图5,示出了本发明的一种基于动态网格哈希索引的密度聚类方法方法的计算机设备,具体可以包括如下:Referring to FIG. 5, a computer device of a density clustering method based on dynamic grid hash index of the present invention is shown, which may specifically include the following:
上述计算机设备12以通用计算设备的形式表现,计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。The above-mentioned computer device 12 is represented in the form of a general-purpose computing device. The components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, connecting different system components (including system memory 28 and processing unit 16) The bus 18.
总线18表示几类总线18结构中的一种或多种,包括存储器总线18或者存储器控制器,外围总线18,图形加速端口,处理器或者使用多种总线18结构中的任意总线18结构的局域总线18。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线18,微通道体系结构(MAC)总线18,增强型ISA总线18、音视频电子标准协会(VESA)局域总线18以及外围组件互连(PCI)总线18。The bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or a memory controller, a peripheral bus 18, a graphics acceleration port, a processor, or a bureau that uses any of the bus 18 structures. Domain bus 18. For example, these architectures include but are not limited to industry standard architecture (ISA) bus 18, microchannel architecture (MAC) bus 18, enhanced ISA bus 18, audio and video electronics standards association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。The computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。计算机设备12可以进一步包括其他移动/不可移动的、易失性/非易失性计算机体统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(通常称为“硬盘驱动器”)。尽管图5中未示出,可以提供用于对可移动非易失性磁盘(如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其他光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质界面与总线18相连。存储器可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块42,这些程序模块42被配置以执行本发明各实施例的功能。The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 5, a disk drive for reading and writing to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk (such as CD-ROM, DVD-ROM) can be provided. Or other optical media) read and write optical disc drives. In these cases, each drive can be connected to the bus 18 through one or more data medium interfaces. The memory may include at least one program product, and the program product has a set (for example, at least one) of program modules 42 configured to perform the functions of the various embodiments of the present invention.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器中,这样的程序模块42包括——但不限于——操作系统、一个或 者多个应用程序、其他程序模块42以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本发明所描述的实施例中的功能和/或方法。A program/utility tool 40 having a set of (at least one) program module 42 may be stored in, for example, a memory. Such program module 42 includes, but is not limited to, an operating system, one or more application programs, and other program modules 42 and program data, each of these examples or some combination may include the realization of a network environment. The program module 42 generally executes the functions and/or methods in the described embodiments of the present invention.
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24、摄像头等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其他计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)界面22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN)),广域网(WAN)和/或公共网络(例如因特网)通信。如图所示,网络适配器20通过总线18与计算机设备12的其他模块通信。应当明白,尽管图5中未示出,可以结合计算机设备12使用其他硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元16、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统34等。The computer device 12 may also communicate with one or more external devices 14 (such as a keyboard, pointing device, display 24, camera, etc.), and may also communicate with one or more devices that enable a user to interact with the computer device 12, and/ Or communicate with any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 22. In addition, the computer device 12 may also communicate with one or more networks (such as a local area network (LAN)), a wide area network (WAN), and/or a public network (such as the Internet) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the computer device 12 through the bus 18. It should be understood that although not shown in FIG. 5, other hardware and/or software modules can be used in conjunction with the computer device 12, including but not limited to: microcode, device drivers, redundant processing unit 16, external disk drive arrays, RAID systems, Tape drive and data backup storage system 34 and so on.
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本发明实施例所提供的基于动态网格哈希索引的密度聚类方法方法。The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the density clustering method based on dynamic grid hash index provided by the embodiment of the present invention.
也即,上述处理单元16执行上述程序时实现:获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;根据现有的带有类标签的数据建立索引G;重复执行以下步骤将D中的每个数据对象p插入到索引G:A1、获取PNeighborhood(o
p),并根据PN
Eps(o
p)判断p为核心点的概率;A2、获取UpSeed
ins(p);A3、若UpSeed
ins(p)中的对象所属类别不同,但插入p后UpSeed
ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed
ins(p)包含的对象所在的聚类合并;和/或;根据现有的带有类标签的数据建立索引G,并在原数据集中查找p的位置;重复执行以下步骤将D中的每个数据对象p从索引G中删除;B1:获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p);B2:若UpSeed
del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类;循环结束后得到完成增量聚类的数据集。
That is, when the processing unit 16 executes the above program, it realizes: acquiring the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judging threshold for core points; unAttr: dimension with uncertain value ; the existing data established with the class tag index G; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and in accordance with PN Eps (o p) Determine the probability that p is the core point; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly dense If reachable, merge the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat execution the steps of each data object D is removed from the index p in G; B1: obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p); B2: if UpSeed del (p) contained in the data Objects cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained.
在本发明实施例中,本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有实施例提供的基于动态网格哈希索引的密度聚类方法方法:In the embodiments of the present invention, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the dynamic grid hash index based on the dynamic grid hash index provided in all embodiments of the present application Density clustering method:
也即,给程序被处理器执行时实现:获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;根据现有的带有类标签的数据建立索引G;重复执行以下步骤将D中的每个数据对象p插入到索引G:A1、获取PNeighborhood(o
p),并根据PN
Eps(o
p)判断p为核心点的概率;A2、获取UpSeed
ins(p);A3、若UpSeed
ins(p)中的对象所属类别不同,但插入p后UpSeed
ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed
ins(p)包含的对象所在的聚类合并;和/或;根据现有的带有类标签的数据建立索引G,并在原数据集中查找p的位置;重复执行以下步骤将D中的每个数据对象p从索引G中删除;B1:获取PNeighborhood(o
p),PN
Eps(o
p)及UpSeed
del(p);B2:若UpSeed
del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类;循环结束后得到完成增量聚类的数据集。
That is, when the program is executed by the processor, it is realized: to obtain the incremental preset information, including: D: incremental data set; Eps: radius; Minpts: whether it is a core point judgment threshold; unAttr: a dimension with uncertain values; the index data G with the conventional type of the label; repeat the following steps for each data object D is inserted into the index p G: A1, obtaining PNeighborhood (o p), and (O p) is determined in accordance with PN Eps p is the probability of the core point; A2, get UpSeed ins (p); A3, if the objects in UpSeed ins (p) belong to different categories, but after inserting p, all objects in UpSeed ins (p) can be directly or indirectly dense If it is reached, merge the clusters where the objects contained in p and UpSeed ins (p) are located; and/or; build an index G based on the existing data with class labels, and find the position of p in the original data set; repeat the following step D of each data object is deleted from the index p in G; B1: obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p); B2: if the data object UpSeed del (p) contained If the density cannot be reached directly to each other, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters; after the end of the cycle, a data set that completes the incremental clustering is obtained.
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机克顿信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦可编程只读存储器(EPOM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Any combination of one or more computer-readable media may be used. The computer-readable medium may be a computer-readable medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPOM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系 统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言——诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行或者完全在远程计算机或者服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。The computer program code used to perform the operations of the present invention can be written in one or more programming languages or a combination thereof. The above-mentioned programming languages include object-oriented programming languages such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer or entirely executed on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to Connect via the Internet). The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。Although the preferred embodiments of the embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present application.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or terminal device including a series of elements not only includes those elements, but also includes those elements that are not explicitly listed. Other elements listed, or also include elements inherent to this process, method, article, or terminal device. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or terminal device that includes the element.
以上对本申请所提供的基于动态网格哈希索引的密度聚类方法方法、装置、设备及介质,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above describes in detail the density clustering method, device, equipment and medium based on dynamic grid hash index provided by this application. Specific examples are used in this article to illustrate the principle and implementation of this application. The description of the embodiments is only used to help understand the method and core idea of this application; at the same time, for those of ordinary skill in the art, according to the idea of this application, there will be changes in the specific implementation and the scope of application. As mentioned above, the content of this specification should not be construed as a limitation to this application.
Claims (10)
- 一种基于动态网格哈希索引的密度聚类方法,其特征在于,包括:A density clustering method based on dynamic grid hash index, which is characterized in that it comprises:获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;Obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold of whether it is a core point; unAttr: dimension with uncertain value;根据现有的带有类标签的数据建立索引G;Create an index G based on the existing data with class tags;重复执行以下步骤将D中的每个数据对象p插入到索引G:Repeat the following steps to insert each data object p in D into index G:A1、获取PNeighborhood(o p),并根据PN Eps(o p)判断p为核心点的概率; A1, obtaining PNeighborhood (o p), and p is a core point determination probability based on PN Eps (o p);A2、获取UpSeed ins(p); A2, get UpSeed ins (p);A3、若UpSeed ins(p)中的对象所属类别不同,但插入p后UpSeed ins(p)中的所有对象可直接或间接密度可达,则将p及UpSeed ins(p)包含的对象所在的聚类合并; A3. If the objects in UpSeed ins (p) belong to different categories, but all objects in UpSeed ins (p) can be directly or indirectly reachable after inserting p, then the objects contained in p and UpSeed ins (p) are located Cluster merging和/或;and / or;根据现有的带有类标签的数据建立索引G,并在原数据集中查找p的位置;Establish an index G based on the existing data with class tags, and find the position of p in the original data set;重复执行以下步骤将D中的每个数据对象p从索引G中删除;Repeat the following steps to delete each data object p in D from index G;B1:获取PNeighborhood(o p),PN Eps(o p)及UpSeed del(p); B1: Get PNeighborhood (o p), PN Eps (o p) and UpSeed del (p);B2:若UpSeed del(p)含有的数据对象不能彼此直接密度可达,且通过同类簇的其他的核心点依然不能使其密度可达,则原聚类被分成若干个聚类; B2: If the data objects contained in UpSeed del (p) cannot be directly reachable to each other in density, and the density cannot be reached through other core points of the same cluster, the original cluster is divided into several clusters;循环结束后得到完成增量聚类的数据集。After the end of the loop, a data set with incremental clustering is obtained.
- 根据权利要求1所述的方法,其特征在于,在获取UpSeed ins(p)后,还包括:若UpSeedins(p)为空,且NEps(p)内不包含核心对象,则将p视为噪音并返回∞。 The method according to claim 1, wherein after obtaining UpSeed ins (p), it further comprises: if UpSeedins(p) is empty and NEps(p) does not contain a core object, then p is regarded as noise And returns ∞.
- 根据权利要求1所述的方法,其特征在于,在获取UpSeed ins(p)后,还包括:若UpSeedins(p)非空,所包含的对象不仅其密度可达对象中没有已知聚类中的核心对象而且不属于任何聚类,则创建新的聚类并返回∞。 The method according to claim 1, characterized in that after obtaining UpSeed ins (p), it further comprises: if UpSeedins(p) is not empty, not only does the contained object have a density that is up to the object but is not in the known cluster If the core object does not belong to any cluster, a new cluster is created and ∞ is returned.
- 根据权利要求1所述的方法,其特征在于,在获取UpSeed ins(p)后,还包括:在插入p之前,若UpSeedins(p)所包含的对象所属聚类相同或者包含的对象类标签不同而且在插入p后不同类标签的数据仍然不能够密度可达或者UpSeedins(p)为空,且NEps(p)内有核心对象,则将p归并到某一聚类并返回∞。 The method according to claim 1, wherein after obtaining UpSeed ins (p), it further comprises: before inserting p, if the objects contained in UpSeedins(p) belong to the same cluster or contain different object class labels And after inserting p, the data of different types of labels still cannot reach the density or UpSeedins(p) is empty, and there is a core object in NEps(p), then p is merged into a certain cluster and returns ∞.
- 根据权利要求1所述的方法,其特征在于,在获取PNeighborhood(o p), PN Eps(o p)及UpSeed del(p)后,还包括:若p为噪音,则删除并返回∞。 The method according to claim 1, wherein, in obtaining PNeighborhood (o p), PN Eps (o p) and UpSeed del (p) after, further comprising: when p is noise, then remove and return ∞.
- 根据权利要求1所述的方法,其特征在于,在获取PNeighborhood(o p),PN Eps(o p)及UpSeed del(p)后,还包括:若p不是噪音且UpSeeddel(p)为空,p被删除后NEps(p)不存在核心点,则与p同类的其他数据点设为噪音并返回∞。 The method according to claim 1, characterized in that, PN Eps (o p) and UpSeed del (p) After obtaining PNeighborhood (o p), further comprising: if p is not noise and UpSeeddel (p) is empty, After p is deleted, there is no core point in NEps(p), and other data points of the same type as p are set as noise and return to ∞.
- 根据权利要求1所述的方法,其特征在于,在获取PNeighborhood(o p),PN Eps(o p)及UpSeed del(p)后,还包括:若UpSeeddel(p)为空,但NEps(p)仍然包含核心对象;或者UpSeeddel(p)中的数据点均可直接密度可达,则删除p后这些数据对象依然同类簇并返回∞。 The method according to claim 1, wherein, in obtaining PNeighborhood (o p), the PN Eps (o p) and UpSeed del (p), further comprising: if UpSeeddel (p) is empty, but NEps (p ) Still contains the core object; or the data points in UpSeeddel(p) can be directly reachable in density, then these data objects are still the same cluster after deleting p and return to ∞.
- 一种基于动态网格哈希索引的密度聚类装置,其特征在于,包括:A density clustering device based on dynamic grid hash index, which is characterized in that it comprises:信息输入单元,用于获取增量预设信息,包括:D:增量数据集;Eps:半径;Minpts:是否为核心点的判定阈值;unAttr:数值不确定的维度;The information input unit is used to obtain incremental preset information, including: D: incremental data set; Eps: radius; Minpts: judgment threshold for core points; unAttr: dimension with uncertain value;数据插入单元,用于依据所述获取增量预设信息,通过权利要求1所述密度聚类方法,生成在所述原数据集基础上进行增量聚类后的数据集;A data insertion unit, configured to generate a data set after incremental clustering on the basis of the original data set through the density clustering method according to claim 1 according to the acquired incremental preset information;搜索结果输出单元,用于输出所述增量聚类单元所生成的完成增量聚类的数据集。The search result output unit is configured to output the incremental clustering data set generated by the incremental clustering unit.
- 一种设备,其特征在于,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至7中任一项所述的方法。A device, characterized in that it comprises a processor, a memory, and a computer program stored on the memory and capable of running on the processor. The computer program is executed by the processor to implement claims 1 to 7. The method described in any one of 7.
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的方法。A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010436841.2A CN111612069A (en) | 2020-05-21 | 2020-05-21 | Density clustering method and device based on dynamic grid Hash index |
CN202010436841.2 | 2020-05-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021232442A1 true WO2021232442A1 (en) | 2021-11-25 |
Family
ID=72199001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/092225 WO2021232442A1 (en) | 2020-05-21 | 2020-05-26 | Density clustering method and apparatus on basis of dynamic grid hash index |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111612069A (en) |
WO (1) | WO2021232442A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114579577A (en) * | 2022-03-11 | 2022-06-03 | 中国地质大学(武汉) | Speed partitioning method, device and system for distributed index of mobile object |
CN115563522A (en) * | 2022-12-02 | 2023-01-03 | 湖南工商大学 | Traffic data clustering method, device, equipment and medium |
CN118277812A (en) * | 2024-05-31 | 2024-07-02 | 中南大学 | Grid clustering method, system and user recommendation method for classifying mass data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1953442A (en) * | 2006-09-14 | 2007-04-25 | 浙江大学 | Method of k-neighbour query based on data mesh |
CN102915347A (en) * | 2012-09-26 | 2013-02-06 | 中国信息安全测评中心 | Distributed data stream clustering method and system |
CN105740371A (en) * | 2016-01-27 | 2016-07-06 | 深圳大学 | Density-based incremental clustering data mining method and system |
US20190095514A1 (en) * | 2017-09-28 | 2019-03-28 | Here Global B.V. | Parallelized clustering of geospatial data |
-
2020
- 2020-05-21 CN CN202010436841.2A patent/CN111612069A/en active Pending
- 2020-05-26 WO PCT/CN2020/092225 patent/WO2021232442A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1953442A (en) * | 2006-09-14 | 2007-04-25 | 浙江大学 | Method of k-neighbour query based on data mesh |
CN102915347A (en) * | 2012-09-26 | 2013-02-06 | 中国信息安全测评中心 | Distributed data stream clustering method and system |
CN105740371A (en) * | 2016-01-27 | 2016-07-06 | 深圳大学 | Density-based incremental clustering data mining method and system |
US20190095514A1 (en) * | 2017-09-28 | 2019-03-28 | Here Global B.V. | Parallelized clustering of geospatial data |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114579577A (en) * | 2022-03-11 | 2022-06-03 | 中国地质大学(武汉) | Speed partitioning method, device and system for distributed index of mobile object |
CN115563522A (en) * | 2022-12-02 | 2023-01-03 | 湖南工商大学 | Traffic data clustering method, device, equipment and medium |
CN115563522B (en) * | 2022-12-02 | 2023-04-07 | 湖南工商大学 | Traffic data clustering method, device, equipment and medium |
CN118277812A (en) * | 2024-05-31 | 2024-07-02 | 中南大学 | Grid clustering method, system and user recommendation method for classifying mass data |
Also Published As
Publication number | Publication date |
---|---|
CN111612069A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Du et al. | The optimal-location query | |
WO2021232442A1 (en) | Density clustering method and apparatus on basis of dynamic grid hash index | |
CN108804576B (en) | Domain name hierarchical structure detection method based on link analysis | |
CN104820708B (en) | A kind of big data clustering method and device based on cloud computing platform | |
CN113190645B (en) | Index structure establishment method, device, equipment and storage medium | |
EP2049984A2 (en) | Primenet data management system | |
US10650559B2 (en) | Methods and systems for simplified graphical depictions of bipartite graphs | |
CN111159184A (en) | Metadata tracing method and device and server | |
CN111737393A (en) | Vector data self-adaptive management method and system under web environment | |
WO2021047373A1 (en) | Big data-based column data processing method, apparatus, and medium | |
CN106874425A (en) | Real time critical word approximate search algorithm based on Storm | |
CN113268557A (en) | Rapid spatial indexing method suitable for display-oriented visualization analysis | |
Hershberger et al. | Adaptive sampling for geometric problems over data streams | |
CN113722600B (en) | Data query method, device, equipment and product applied to big data | |
CN114595302A (en) | Method, device, medium, and apparatus for constructing multi-level spatial relationship of spatial elements | |
Álvarez-García et al. | Compact and efficient representation of general graph databases | |
CN112685574B (en) | Method and device for determining hierarchical relationship of domain terms | |
CN110321435B (en) | Data source dividing method, device, equipment and storage medium | |
CN111309854B (en) | Article evaluation method and system based on article structure tree | |
US20170031909A1 (en) | Locality-sensitive hashing for algebraic expressions | |
Wang et al. | RODA: A fast outlier detection algorithm supporting multi-queries | |
KR20220099745A (en) | A spatial decomposition-based tree indexing and query processing methods and apparatus for geospatial blockchain data retrieval | |
CN116226686B (en) | Table similarity analysis method, apparatus, device and storage medium | |
Hershberger et al. | Adaptive sampling for geometric problems over data streams | |
CN113761076B (en) | Clustering method, device, equipment and storage medium applied to data warehouse |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20936864 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.03.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20936864 Country of ref document: EP Kind code of ref document: A1 |