WO2016107297A1

WO2016107297A1 - Clustering method based on local density on mapreduce platform

Info

Publication number: WO2016107297A1
Application number: PCT/CN2015/094376
Authority: WO
Inventors: 蔡立宇; 贾西贝
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2014-12-31
Filing date: 2015-11-12
Publication date: 2016-07-07
Also published as: CN104978382A

Abstract

Provided is a clustering method based on local density on a MapReduce platform. The method comprises: Step 10, preprocessing data to be clustered, and constructing a connected graph using nodes to indicate data; Step 20, using the nodes and edge information thereof in the connected graph as input data, and generating the local density Rho of the nodes through a MapReduce operation; and Step 30, obtaining the dispersion degree Delta of each node through the MapReduce operation, and performing class identification according to a preset rule. According to the present invention, clustering based on local density is realized in terms of clusters by using the MapReduce distributed computation idea; the restriction during the processing caused by limited resources of a single machine is weakened; mass data processing can be realized; and the clustering operation can be completed more quickly.

Description

Local density based clustering method on MapReduce platform

[0001] The present invention relates to the field of data processing technologies, and in particular, to a local density based clustering method on a MapReduce platform.

BACKGROUND OF THE INVENTION

[0003] Cluster analysis is an important algorithm for data mining. Cluster analysis is based on similarity, with more similarities between patterns in one cluster than patterns that are not in the same cluster. Clustering analysis algorithms can be divided into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. With the advent of cloud computing big data, the rapid development of social information and networking has led to explosive growth of data. The use of cluster analysis to encounter big data 吋 needs to be combined with distributed computing platforms to get rid of the limitations imposed by the limited resources of the computer itself.

[0004] MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce". data set. In the calculation process on the MapReduce platform, the input data is first segmented into different computers in the cluster, and other computers in the cluster are assigned to execute Map jobs or Reduce jobs; Map jobs extract key values from the input data <^ , Valuo Each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs generated by the map function are cached in memory, and the cached intermediate key-value pairs are periodically written to the local disk, and the intermediate key-value pairs are divided. For R zones, the size of R is user-defined. Each zone will correspond to a Reduce job in the future; key-value pairs with the same Key will be processed by the same Reduce job, and the Reduce job will read these intermediate key-value pairs. For each unique key, the key and associated value are passed to the reduce function, and the output produced by the reduce function is added to the output file of the partition. The difference between a Map/Reduce job and a map/reduce function: A Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes a partition's intermediate key-value pair, during the period To call the reduce function once for each different key, the Reduce job eventually corresponds to an output file. Throughout the process, the input data is from the underlying distributed file system, the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system. [0005] Chinese patent application CN201410814330.4 "Virtual Person Establishment Method and Apparatus" relates to a local density based clustering method. The local density-based clustering method mainly includes: characterizing data by nodes in the connected graph, and characterizing the similarity between the data by the length of the edges between the nodes, and the shorter the edges between the nodes, the data represented by the nodes The higher the similarity between the two; the local density Rho of each node is determined, Rh o is defined as the number of adjacent edges whose length is lower than the predefined value Dc; the dispersion degree of each node is determined respectively, Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; the Rho value and the De lta value are used. The nodes respectively higher than the preset thresholds R_T and D_T are identified as the central node of the class; the non-central nodes are classified as the class to which the non-central node has the shortest distance and the Rho value is higher than the central node of the non-central node. The side length characterizes the measure of the likelihood (similarity) of nodes belonging to the same class; Rho characterizes the importance of the current node to its neighbors; the Delta representation is distinguished from other class centers if the current node is the center of the class. Sex. In order to realize the processing of massive data and overcome the limitations imposed by the limited resources of the single machine, it is urgent to implement the clustering method based on local density on the MapReduce platform.

SUMMARY OF THE INVENTION

[0007] Therefore, the object of the present invention is to provide a local density-based clustering method on the MapReduce platform, which realizes processing of massive data and overcomes the limitation imposed by the limited resources of the single machine.

[0008] In order to achieve the above object, the present invention provides a local density based clustering method on a MapReduce platform, including:

[0009] Step 10: pre-processing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, the shorter the edge between the nodes, the node The higher the similarity between the data being characterized;

[0010] Step 20: The node and the edge information in the connected graph are used as input data, and the key value pair including the node and the neighbor information is generated by the Map job, and the local density R ho and the node including the node and the node are generated by the Reduce job. The output of all neighboring information, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;

[0011] Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. Neighbor node Rho and all neighboring information, and the delta of each node is obtained. Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; Logo.

[0012] wherein, the predetermined rule comprises: Rho and Delta of the node are respectively higher than a threshold 1^_T and a threshold D_T as input parameters, then the node is a center of a class, and the class identifier of the node takes its own class Identification; No Bay 1J, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

[0013] The class identifier of the isolated node is its own class identifier.

[0014] wherein, the predetermined rule comprises: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta The value may be a value interval, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

[0015] The class identifier of the isolated node is its own class identifier.

[0016] wherein, in step 20, the data format of the node and the edge in the connected graph as the input data includes a field identifying the node, a field identifying the neighbor node, and a side length identifying the neighboring edge between the node and the neighbor node. Field.

[0017] wherein, in step 20, the output of the Reduce job is stored in a relational database or a key value database.

[0018] wherein, in the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing a Cartesian product on the output of the Reduce job in step 20.

[0019] wherein, step 20 includes:

[0020] Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node. a field of the side of the adjacent side;

[0021] Step 22. Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

[0022] Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;

[0023] Step 25: Through the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate local density Rho including nodes and nodes, and all neighbor information of the nodes. Output.

[0024] wherein, step 20 further includes:

[0025] In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;

[0026] Step 24. Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.

[0027] wherein the ordering in step 24 is ascending order.

[0028] wherein, in step 25, the output of the Reduce job is a key value pair, wherein the key includes a field identifying the node, and the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.

[0029] wherein, step 30 includes:

[0030] Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node. a field of a side length, a field identifying the neighbor node Rho, and a field identifying the node Rho;

[0031] Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

[0032] Step 33: The key value pair in the same partition is grouped according to the nodes included in the key, and the key includes the key value pairs of the same node and is allocated to the same group;

[0033] Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discreteness of each node. Delta, combined with predetermined rules for class identification.

[0034] wherein, step 30 further includes:

[0035] In step 31, the key further includes a field identifying the neighbor node Rho;

[0036] Step 34: Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key.

[0037] In summary, the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] In the drawings,

1 is a flow chart of a preferred embodiment of a local density based clustering method on a MapReduce platform of the present invention; Figure.

DETAILED DESCRIPTION

The technical scheme of the present invention and its advantageous effects will be apparent from the following detailed description of embodiments of the invention.

[0043] Referring to FIG. 1, it is a flowchart of a preferred embodiment of a local density based clustering method on the MapReduce platform of the present invention. The preferred embodiment mainly includes:

[0044] Step 10: Preprocessing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, and the shorter the edge between the nodes, the node The higher the similarity between the data being characterized. In step 10, the similarity between the data to be clustered is first calculated according to a preset rule, and then the connected graph is constructed; taking Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus" as an example, wherein the data to be clustered is The account number is calculated based on the situation in which the accounts are coordinated, and the connectivity graph is constructed.

[0045] Step 20: using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighbor information through the Map job, and generating the local density R ho and the node including the node and the node through the Reduce job. For the output of all neighbor information, Rho is defined as the number of neighbors whose length is lower than the predefined value Dc.

[0046] Step 20 specifically includes:

[0047] Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node. The field of the side of the adjacent side. The neighbor information includes the corresponding neighbor node and the neighbor side length. As an optimization, in step 21, the key may further include a field identifying the side length of the neighboring edge between the node and the neighboring node.

[0048] Application 吋, each row of the input data may correspond to side information between a group of nodes. Therefore, for convenience, the input data can be set to a triple consisting of _a small identity value node &, a large identity value node b, and a side length len( _a , b): [a, b, len(a, b) )].

[0049] Because each node needs to calculate their Rho value, the Map job will have two <^6, ¥&11^> outputs for one side information in the connected graph. Each Key value or Value value is composed of two fields, left and right. Specifically, the first Key value can be Kl = <a, l _en (a, b) > (here, left=a, Right=len(a,b)) , Value value can be Vl=<b,len(a,b)>, second key value can be K2=<b,len (a,b)>, Value It can be V2=<a, len(a,b)>.

[0050] Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. In this embodiment, in particular, the sequence of the Partition to which each record belongs will only be related to the first field of the Map output Key value. For example, the partition sequence can be the remainder of the key's left field and the remainder of the known total partition number, expressed as pseudocode:

[0051] K.left.hashCode()^^, the number of regions.

[0052] This actually ensures that the side information of the nodes of the left node of the same node is allocated to the same partition for storage.

[0053] Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;

[0054] The result of the group (GroupCompare) will only be related to the comparison of the first field of the Key value compared. For example, for two Keys, kl and k2, the corresponding comparison result is

[0055] kl.left.compare(k2.1eft).

[0056] This actually guarantees that the information (Value value, neighbor point and side length) of all sides of each node will be called during the same Reduce process.

[0057] Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key. step

The ordering in 24 can be sorted in ascending order. As an optional optimization measure, step 24 can be called intra-group sorting (SortComparator, SC), which can be set as the result of comparing the two fields in the left and right order. Expressed in pseudo code:

[0058] l_compare = kl .left.compare(k2.1eft)

[0059] if(l_compare==0)// If the left values are equal, then the right value is compared

[0060] return kl .right.compare(k2.right)

[0061] else

[0062] return l_compare.

[0063] Since the right value of Key represents the length of the side, it is actually guaranteed to be iterative in the Reduce process.

The side information is returned in ascending order of the length of the side. Note: In fact, the Key value in step 21 is set. It is composed of two fields, node identifier and side length, for the purpose of optimization; if there is no such optimization, the Key value in step 21 can only be composed of node identifiers.

[0064] Step 25: Through the Reduce job, traverse all the neighboring edges of the same node by iterating the values of the key-value pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all the neighbor information of the node. .

[0065] The output of the Reduce job in step 25 is a key value pair, wherein the key includes a field identifying the node, the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.

[0066] After the above steps, each Reduce call can traverse all edges of the same node by iterating over Values. Each time the Reduce procedure is called, the following three pieces of information are output: the identifier of the current node n, the Rho value of n, and all neighbor information of n after sorting by the side length.

[0067] When optimized using the SC described above, the count of Rho values may end after the iterative side length is greater than the predefined value Dc. At the same time, since the neighboring edges have been sorted in ascending order by means of SC, the neighboring side information can also be stitched in order of iteration. If this optimization is not performed, the count of the Rho value needs to be iterated to the last edge to end, and the neighbor information needs to be sorted and then used as part of the Value value.

[0068] As an example, the format of the output can be a key-value pair:

[K=n, V=<n, Rho(n), n 1 : len(n, n 1 ), n2: len<n, n2> ... nN: len<n, nN»].

[0070] The preferred embodiment implements the calculation of the Rho value by using the first MapReduce task described above, and sorts the neighbor nodes in ascending order by distance. The next second MapReduce task, the main implementation of calculating the Delta value, and identifying the class center point.

[0071] Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. The neighbor node Rho and all neighbor information, the dispersion degree of each node is obtained, and the class identification is performed in combination with a predetermined rule.

[0072] In the preferred embodiment, the predetermined rule is: Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class. Identifies; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is its own class identifier. The predetermined rule is similar to the rule adopted in the Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus". A rigid requirement is that the Rho value and the Delta value are higher than a certain value. Corresponding thresholds.

[0073] This is just one of the ways in which a node can be identified as a class center. Basically, whether a node can be used as a class center node is based on the node's Rho value and Delta value. In fact, there are other methods for making judgments using factors including Rho and Delta values. The clustering method based on local density on the MapReduce platform of the present invention can also be relaxed in the way of confirming the center point of the class, and can complete the clustering operation more quickly. For example, the predetermined rule may include: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, the value may be Interval, the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is itself Class identifier. For example: If the node's Rho value is in the range [10,20], and the Delta value is also [0.9* 10,

0.8*20] (ie, the Delta value is also within a certain range of the Rho value, the Delta value range corresponds to the Rho value range, and the node can also be identified as the class center).

[0074] To solve the delta value of a certain node, it is necessary to obtain the Rho value corresponding to its neighboring edge. In the output of the Reduce job in step 20, the Cartesian Product on the general MapReduce can be used to implement the traversal of the Rho value of the neighbor node. The custom inputFormat is used to implement the full connection. The traversal here is actually for the follow-up

Delta value. A related case can be found in [«MapReduce Design Pattems», O'Reilly, Dec. 2012, p: 128-138].

[0075] Step 30 specifically includes:

[0076] Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node. The field of the side length identifies the field of the neighbor node Rho and identifies the field of the node Rho.

[0077] For the output of the Reduce job in step 20, the information of the current node and the connected neighbor node is output via the Map job. An optimized sample output format is:

[K=<a, Rho(b)>, V=<Rho(b), Rho(a), b, len(a,b)>].

[0079] In step 31, as a selection, the key may further include a field identifying the neighbor node Rho, and the optimization is to incorporate the information of Rho(b) into the Key part, so as to facilitate the sorting of the subsequent step 34. [0080] Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. For details, see step 22.

[0081] Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key pairs including the same node are assigned to the same group. For details, see step 23.

[0082] Step 34: Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key. As an optional optimization measure, firstly, according to the first field of the Key value, it is distinguished whether it is the Ke y value of the same node, and if it is the same, the second field is sorted in descending order. This sorting ensures that in the same Reduce process

The neighbor nodes with high Rho values will be accessed first by iteration.

[0083] Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discrete of each node. Delta, combined with predetermined rules for class identification.

[0084] After the above steps, in each Reduce process, the information of a node and all its neighbors can be traversed by iterating over the Value value. This option can be combined with the threshold R_T and the threshold D_T as input parameters to generate the information needed to perform the class identification.

In the preferred embodiment, the Map process of step 30 is implemented on a native MapReduce scheme, but in practice the process can be accelerated by common database technologies. For example, in step 20, the Reduce job output 吋, the Rho value of each node is stored in the relational database or K-V database. Therefore, in the Map of step 30, it is only necessary to query the Rho value of the neighbor point, and does not need to be processed by the custom InputForm at; that is, the Cartesian operation is no longer needed, and can be directly in the Map stage. Access the data to get the Rho value of the neighbor node.

[0086] In summary, the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.

[0087] As described above, various other changes and modifications can be made in accordance with the technical solutions and technical concept of the present invention, and all such changes and modifications should be attached to the present invention. The scope of protection of the claims.

technical problem

Problem solution Beneficial effect

Claims

Claim

[Claim 1] A local density-based clustering method on a MapReduce platform, comprising: step 10, pre-processing data to be clustered, constructing a connected graph of nodes to represent data, and The length of the edge characterizes the similarity between the data. The shorter the edge between the nodes, the higher the similarity between the data represented by the node;

Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key-value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. The output is Rho, which is defined as the number of neighboring edges whose length is lower than the predefined value Dc. Step 30: For the output of the Reduce job in step 20, generate a node including the node, the node Rho, the neighbor node Rho, and the neighbor through the Map job. For each node, the node Rho, all neighbor nodes Rho, and all neighbor information are traversed by the Reduce job, and the dispersion degree of each node is determined. Delta is defined as the higher Rho value of all nodes connected to the node. The length of the shortest side of the neighboring edge of the neighboring node. If there is no such neighboring node, the length of the longest neighboring edge of the local node is taken; and the class identifier is combined with the predetermined rule.

[Claim 2] The local density-based clustering method on the MapReduce platform of claim 1, wherein the predetermined rule comprises:

The Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, and the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node is taken closest to it. Rho higher neighbor node class identifier;

The class ID of an isolated node is its own class identifier.

[Claim 3] The local density-based clustering method on the MapReduce platform of claim 1, wherein the predetermined rule comprises:

Pre-divided the Rho value possible value interval and the corresponding Delta value possible value interval. If the node's Rho value belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value possible value interval, the node is The center of a class whose class identifier is its own class identifier; otherwise, the node's class identifier takes the neighbor section that is closest to it and Rho is higher. The class identifier of the point;

The class ID of an isolated node is its own class identifier.

[Claim 4] The local density based clustering method on the MapReduce platform according to claim 1, wherein the output of the Reduce job in step 20 is stored in a relational database or a key value database.

[Claim 5] The local density-based clustering method on the MapReduce platform of claim 1, wherein the Map job in step 30 is implemented by performing a Cartesian product on the output of the Reduce job in step 20. Traversal of the neighbor node Rho.

[Claim 6] The local density-based clustering method on the MapReduce platform of claim 1, wherein the step 20 includes:

Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node. The field of the side length; Step 22, the key value pair is partitioned according to the node included in the key, and the key includes the key value pair of the same node and is allocated to the same partition;

Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;

Step 25. Via the Reduce operation, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate the local density Rho including the node, the node, and the output of all the neighbor information of the node.

[Claim 7] The local density-based clustering method on the MapReduce platform of claim 6, wherein the step 20 further includes:

In step 21, the key further includes a field identifying the side length of the neighboring edge between the node and the neighboring node; Step 24: Sorting the edge lengths of the neighboring edges included in the key for the key value pairs belonging to the same group.

[Claim 8] The local density-based clustering method on the MapReduce platform of claim 6, wherein the output of the Reduce job in step 25 is a key-value pair, wherein the key includes a field identifying the node, and the value includes a field identifying the node, a field identifying the node Rho, and an identifier The field of all neighbor information of the node.

[Claim 9] The local density-based clustering method on the MapReduce platform of claim 1, wherein the step 30 includes:

Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;

Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the keys include key value pairs of the same node are allocated to the same group;

Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.

[Claim 10] The local density-based clustering method on the MapReduce platform of claim 9, wherein the step 30 further includes:

In step 31, the key further includes a field identifying the neighbor node Rho; Step 34: Sorting the key value pairs belonging to the same group according to the neighbor nodes Rho included in the key.