WO2016107297A1 - Clustering method based on local density on mapreduce platform - Google Patents

Clustering method based on local density on mapreduce platform Download PDF

Info

Publication number
WO2016107297A1
WO2016107297A1 PCT/CN2015/094376 CN2015094376W WO2016107297A1 WO 2016107297 A1 WO2016107297 A1 WO 2016107297A1 CN 2015094376 W CN2015094376 W CN 2015094376W WO 2016107297 A1 WO2016107297 A1 WO 2016107297A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
key
rho
value
neighbor
Prior art date
Application number
PCT/CN2015/094376
Other languages
French (fr)
Chinese (zh)
Inventor
蔡立宇
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016107297A1 publication Critical patent/WO2016107297A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a local density based clustering method on a MapReduce platform.
  • Cluster analysis is an important algorithm for data mining. Cluster analysis is based on similarity, with more similarities between patterns in one cluster than patterns that are not in the same cluster. Clustering analysis algorithms can be divided into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. With the advent of cloud computing big data, the rapid development of social information and networking has led to explosive growth of data. The use of cluster analysis to encounter big data ⁇ needs to be combined with distributed computing platforms to get rid of the limitations imposed by the limited resources of the computer itself.
  • MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce”. data set.
  • MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce”. data set.
  • MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce”. data set.
  • MapReduce MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce”. data set.
  • MapReduce MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It
  • Each zone will correspond to a Reduce job in the future; key-value pairs with the same Key will be processed by the same Reduce job, and the Reduce job will read these intermediate key-value pairs.
  • the key and associated value are passed to the reduce function, and the output produced by the reduce function is added to the output file of the partition.
  • a Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes a partition's intermediate key-value pair, during the period To call the reduce function once for each different key, the Reduce job eventually corresponds to an output file.
  • the local density-based clustering method mainly includes: characterizing data by nodes in the connected graph, and characterizing the similarity between the data by the length of the edges between the nodes, and the shorter the edges between the nodes, the data represented by the nodes The higher the similarity between the two; the local density Rho of each node is determined, Rh o is defined as the number of adjacent edges whose length is lower than the predefined value Dc; the dispersion degree of each node is determined respectively, Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; the Rho value and the De lta value are used.
  • the nodes respectively higher than the preset thresholds R_T and D_T are identified as the central node of the class; the non-central nodes are classified as the class to which the non-central node has the shortest distance and the Rho value is higher than the central node of the non-central node.
  • the side length characterizes the measure of the likelihood (similarity) of nodes belonging to the same class; Rho characterizes the importance of the current node to its neighbors; the Delta representation is distinguished from other class centers if the current node is the center of the class. Sex. In order to realize the processing of massive data and overcome the limitations imposed by the limited resources of the single machine, it is urgent to implement the clustering method based on local density on the MapReduce platform.
  • the object of the present invention is to provide a local density-based clustering method on the MapReduce platform, which realizes processing of massive data and overcomes the limitation imposed by the limited resources of the single machine.
  • the present invention provides a local density based clustering method on a MapReduce platform, including:
  • Step 10 pre-processing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, the shorter the edge between the nodes, the node The higher the similarity between the data being characterized;
  • Step 20 The node and the edge information in the connected graph are used as input data, and the key value pair including the node and the neighbor information is generated by the Map job, and the local density R ho and the node including the node and the node are generated by the Reduce job.
  • the output of all neighboring information, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;
  • Step 30 For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. Neighbor node Rho and all neighboring information, and the delta of each node is obtained. Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; Logo.
  • the predetermined rule comprises: Rho and Delta of the node are respectively higher than a threshold 1 ⁇ _T and a threshold D_T as input parameters, then the node is a center of a class, and the class identifier of the node takes its own class Identification; No Bay 1J, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
  • the class identifier of the isolated node is its own class identifier.
  • the predetermined rule comprises: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta
  • the value may be a value interval, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
  • the class identifier of the isolated node is its own class identifier.
  • the data format of the node and the edge in the connected graph as the input data includes a field identifying the node, a field identifying the neighbor node, and a side length identifying the neighboring edge between the node and the neighbor node. Field.
  • step 20 wherein, in step 20, the output of the Reduce job is stored in a relational database or a key value database.
  • step 20 includes:
  • Step 21 The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node. a field of the side of the adjacent side;
  • Step 22 Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
  • Step 23 grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;
  • Step 25 Through the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate local density Rho including nodes and nodes, and all neighbor information of the nodes. Output.
  • step 20 further includes:
  • the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;
  • Step 24 Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.
  • step 24 wherein the ordering in step 24 is ascending order.
  • the output of the Reduce job is a key value pair, wherein the key includes a field identifying the node, and the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.
  • step 30 includes:
  • Step 31 Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node. a field of a side length, a field identifying the neighbor node Rho, and a field identifying the node Rho;
  • Step 32 Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
  • Step 33 The key value pair in the same partition is grouped according to the nodes included in the key, and the key includes the key value pairs of the same node and is allocated to the same group;
  • Step 35 Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discreteness of each node. Delta, combined with predetermined rules for class identification.
  • step 30 further includes:
  • the key further includes a field identifying the neighbor node Rho;
  • Step 34 Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key.
  • the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.
  • FIG. 1 is a flow chart of a preferred embodiment of a local density based clustering method on a MapReduce platform of the present invention
  • FIG. 1 it is a flowchart of a preferred embodiment of a local density based clustering method on the MapReduce platform of the present invention.
  • the preferred embodiment mainly includes:
  • Step 10 Preprocessing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, and the shorter the edge between the nodes, the node The higher the similarity between the data being characterized.
  • the similarity between the data to be clustered is first calculated according to a preset rule, and then the connected graph is constructed; taking Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus" as an example, wherein the data to be clustered is The account number is calculated based on the situation in which the accounts are coordinated, and the connectivity graph is constructed.
  • Step 20 using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighbor information through the Map job, and generating the local density R ho and the node including the node and the node through the Reduce job.
  • Rho is defined as the number of neighbors whose length is lower than the predefined value Dc.
  • Step 20 specifically includes:
  • Step 21 The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node.
  • the neighbor information includes the corresponding neighbor node and the neighbor side length.
  • the key may further include a field identifying the side length of the neighboring edge between the node and the neighboring node.
  • each row of the input data may correspond to side information between a group of nodes. Therefore, for convenience, the input data can be set to a triple consisting of a small identity value node &, a large identity value node b, and a side length len( a , b): [a, b, len(a, b) )].
  • each node needs to calculate their Rho value, the Map job will have two ⁇ 6, ⁇ &11 ⁇ > outputs for one side information in the connected graph.
  • Each Key value or Value value is composed of two fields, left and right.
  • Step 22 Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition.
  • the sequence of the Partition to which each record belongs will only be related to the first field of the Map output Key value.
  • the partition sequence can be the remainder of the key's left field and the remainder of the known total partition number, expressed as pseudocode:
  • K.left.hashCode() ⁇ the number of regions.
  • Step 23 grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;
  • the result of the group (GroupCompare) will only be related to the comparison of the first field of the Key value compared. For example, for two Keys, kl and k2, the corresponding comparison result is
  • Step 24 Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key. step
  • step 24 can be sorted in ascending order.
  • step 24 can be called intra-group sorting (SortComparator, SC), which can be set as the result of comparing the two fields in the left and right order.
  • SC intra-group sorting
  • the side information is returned in ascending order of the length of the side.
  • the Key value in step 21 is set. It is composed of two fields, node identifier and side length, for the purpose of optimization; if there is no such optimization, the Key value in step 21 can only be composed of node identifiers.
  • Step 25 Through the Reduce job, traverse all the neighboring edges of the same node by iterating the values of the key-value pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all the neighbor information of the node. .
  • the output of the Reduce job in step 25 is a key value pair, wherein the key includes a field identifying the node, the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.
  • each Reduce call can traverse all edges of the same node by iterating over Values.
  • the Reduce procedure is called, the following three pieces of information are output: the identifier of the current node n, the Rho value of n, and all neighbor information of n after sorting by the side length.
  • the count of Rho values may end after the iterative side length is greater than the predefined value Dc.
  • the neighboring side information can also be stitched in order of iteration. If this optimization is not performed, the count of the Rho value needs to be iterated to the last edge to end, and the neighbor information needs to be sorted and then used as part of the Value value.
  • the format of the output can be a key-value pair:
  • the preferred embodiment implements the calculation of the Rho value by using the first MapReduce task described above, and sorts the neighbor nodes in ascending order by distance.
  • the next second MapReduce task the main implementation of calculating the Delta value, and identifying the class center point.
  • Step 30 For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. The neighbor node Rho and all neighbor information, the dispersion degree of each node is obtained, and the class identification is performed in combination with a predetermined rule.
  • the predetermined rule is: Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class. Identifies; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is its own class identifier.
  • the predetermined rule is similar to the rule adopted in the Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus". A rigid requirement is that the Rho value and the Delta value are higher than a certain value. Corresponding thresholds.
  • a node can be identified as a class center. Basically, whether a node can be used as a class center node is based on the node's Rho value and Delta value. In fact, there are other methods for making judgments using factors including Rho and Delta values.
  • the clustering method based on local density on the MapReduce platform of the present invention can also be relaxed in the way of confirming the center point of the class, and can complete the clustering operation more quickly.
  • the predetermined rule may include: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, the value may be Interval, the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is itself Class identifier. For example: If the node's Rho value is in the range [10,20], and the Delta value is also [0.9* 10,
  • the Delta value is also within a certain range of the Rho value, the Delta value range corresponds to the Rho value range, and the node can also be identified as the class center).
  • the Cartesian Product on the general MapReduce can be used to implement the traversal of the Rho value of the neighbor node.
  • the custom inputFormat is used to implement the full connection. The traversal here is actually for the follow-up
  • Step 30 specifically includes:
  • Step 31 Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node.
  • the field of the side length identifies the field of the neighbor node Rho and identifies the field of the node Rho.
  • the key may further include a field identifying the neighbor node Rho, and the optimization is to incorporate the information of Rho(b) into the Key part, so as to facilitate the sorting of the subsequent step 34.
  • Step 32 Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. For details, see step 22.
  • Step 33 The key value pairs in the same partition are grouped according to the nodes included in the key, and the key pairs including the same node are assigned to the same group. For details, see step 23.
  • Step 34 Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key. As an optional optimization measure, firstly, according to the first field of the Key value, it is distinguished whether it is the Ke y value of the same node, and if it is the same, the second field is sorted in descending order. This sorting ensures that in the same Reduce process
  • the neighbor nodes with high Rho values will be accessed first by iteration.
  • Step 35 Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discrete of each node. Delta, combined with predetermined rules for class identification.
  • the information of a node and all its neighbors can be traversed by iterating over the Value value.
  • This option can be combined with the threshold R_T and the threshold D_T as input parameters to generate the information needed to perform the class identification.
  • the Map process of step 30 is implemented on a native MapReduce scheme, but in practice the process can be accelerated by common database technologies.
  • the Reduce job output ⁇ the Rho value of each node is stored in the relational database or K-V database. Therefore, in the Map of step 30, it is only necessary to query the Rho value of the neighbor point, and does not need to be processed by the custom InputForm at; that is, the Cartesian operation is no longer needed, and can be directly in the Map stage. Access the data to get the Rho value of the neighbor node.
  • the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a clustering method based on local density on a MapReduce platform. The method comprises: Step 10, preprocessing data to be clustered, and constructing a connected graph using nodes to indicate data; Step 20, using the nodes and edge information thereof in the connected graph as input data, and generating the local density Rho of the nodes through a MapReduce operation; and Step 30, obtaining the dispersion degree Delta of each node through the MapReduce operation, and performing class identification according to a preset rule. According to the present invention, clustering based on local density is realized in terms of clusters by using the MapReduce distributed computation idea; the restriction during the processing caused by limited resources of a single machine is weakened; mass data processing can be realized; and the clustering operation can be completed more quickly.

Description

MapReduce平台上基于本地密度的聚类方法 技术领域  Local density based clustering method on MapReduce platform
[0001] 本发明涉及数据处理技术领域, 尤其涉及一种 MapReduce平台上基于本地密度 的聚类方法。  [0001] The present invention relates to the field of data processing technologies, and in particular, to a local density based clustering method on a MapReduce platform.
[0002] 背景技术 BACKGROUND OF THE INVENTION
[0003] 聚类分析是数据挖掘的一个重要算法。 聚类分析以相似性为基础, 在一个聚类 中的模式之间比不在同一聚类中的模式之间具有更多的相似性。 聚类分析的算 法可以分为划分法、 层次法、 基于密度的方法、 基于网格的方法、 基于模型的 方法等。 随着云计算大数据吋代的到来, 社会信息化和网络化的高速发展导致 数据呈爆炸式增长。 利用聚类分析遇到大数据吋, 需要与分布式计算平台结合 以摆脱计算机单机本身资源有限等所带来的限制。  [0003] Cluster analysis is an important algorithm for data mining. Cluster analysis is based on similarity, with more similarities between patterns in one cluster than patterns that are not in the same cluster. Clustering analysis algorithms can be divided into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. With the advent of cloud computing big data, the rapid development of social information and networking has led to explosive growth of data. The use of cluster analysis to encounter big data 吋 needs to be combined with distributed computing platforms to get rid of the limitations imposed by the limited resources of the computer itself.
[0004] MapReduce是谷歌提出的分布式并行计算框架, 用于大规模数据集的并行运算 , 主要通过 "Map (映射) "和" Reduce (化简) "这两个步骤来并行处理大规模的 数据集。 在 MapReduce平台上的计算过程中, 输入数据首先被切分到集群的不同 计算机上, 集群中其他计算机分配为执行 Map作业或 Reduce作业; Map作业从输 入数据中抽取出键值对<^ , Valuo, 每一个键值对都作为参数传递给 map函数 , map函数产生的中间键值对被缓存在内存中, 缓存的中间键值对会被定期写入 本地磁盘,而且这些中间键值对被分为 R个区, R的大小是由用户定义的, 将来每 个区会对应一个 Reduce作业; 带有相同 Key的键值对由同一个 Reduce作业来处理 , Reduce作业读取这些中间键值对, 对于每个唯一的键, 都将键与关联的值传 递给 reduce函数, reduce函数产生的输出会添加到这个分区的输出文件中。 Map/ Reduce作业和 map/reduce函数的区别: Map作业处理一个输入数据的分片, 可能 需要调用多次 map函数来处理每个输入键值对; Reduce作业处理一个分区的中间 键值对, 期间要对每个不同的键调用一次 reduce函数, Reduce作业最终也对应一 个输出文件。 整个过程中, 输入数据是来自底层分布式文件系统的, 中间数据 是放在本地文件系统的, 最终输出数据是写入底层分布式文件系统。 [0005] 中国专利申请 CN201410814330.4"虚拟人建立方法及装置"中涉及了一种基于本 地密度的聚类方法。 该基于本地密度的聚类方法主要包括: 以连通图中的节点 表征数据, 并以节点之间的边的长度表征数据之间的相似度, 节点之间的边越 短, 节点所表征的数据之间相似度越高; 分别求出每个节点的本地密度 Rho, Rh o定义为连接本节点的长度低于预定义值 Dc的邻边的数目; 分别求出每个节点的 离散度 Delta, Delta定义为本节点所有连接更高 Rho值邻居节点的邻边中最短边 的边长, 若不存在这样的邻居节点, 则取本节点最长邻边的边长; 将 Rho值和 De lta值分别高于预设阈值 R_T和 D_T的节点标识为类的中心节点; 将非中心节点 归类为到该非中心节点距离最短且 Rho值高于该非中心节点的中心节点所属的类 。 边长表征节点之间属于同一个类的可能性 (相似度)的衡量; Rho表征当前节点 对其邻接点的重要性; Delta表征若以当前节点为类中心, 其相对其他类中心的 可区别性。 为了能实现对海量数据的处理, 克服单机本身资源有限所带来的限 制, 亟需将该基于本地密度的聚类方法在 MapReduce平台上加以实现。 [0004] MapReduce is a distributed parallel computing framework proposed by Google. It is used for parallel computing of large-scale data sets. It mainly deals with large-scale parallel processing through the two steps of "Map" and "Reduce". data set. In the calculation process on the MapReduce platform, the input data is first segmented into different computers in the cluster, and other computers in the cluster are assigned to execute Map jobs or Reduce jobs; Map jobs extract key values from the input data <^ , Valuo Each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs generated by the map function are cached in memory, and the cached intermediate key-value pairs are periodically written to the local disk, and the intermediate key-value pairs are divided. For R zones, the size of R is user-defined. Each zone will correspond to a Reduce job in the future; key-value pairs with the same Key will be processed by the same Reduce job, and the Reduce job will read these intermediate key-value pairs. For each unique key, the key and associated value are passed to the reduce function, and the output produced by the reduce function is added to the output file of the partition. The difference between a Map/Reduce job and a map/reduce function: A Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes a partition's intermediate key-value pair, during the period To call the reduce function once for each different key, the Reduce job eventually corresponds to an output file. Throughout the process, the input data is from the underlying distributed file system, the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system. [0005] Chinese patent application CN201410814330.4 "Virtual Person Establishment Method and Apparatus" relates to a local density based clustering method. The local density-based clustering method mainly includes: characterizing data by nodes in the connected graph, and characterizing the similarity between the data by the length of the edges between the nodes, and the shorter the edges between the nodes, the data represented by the nodes The higher the similarity between the two; the local density Rho of each node is determined, Rh o is defined as the number of adjacent edges whose length is lower than the predefined value Dc; the dispersion degree of each node is determined respectively, Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; the Rho value and the De lta value are used. The nodes respectively higher than the preset thresholds R_T and D_T are identified as the central node of the class; the non-central nodes are classified as the class to which the non-central node has the shortest distance and the Rho value is higher than the central node of the non-central node. The side length characterizes the measure of the likelihood (similarity) of nodes belonging to the same class; Rho characterizes the importance of the current node to its neighbors; the Delta representation is distinguished from other class centers if the current node is the center of the class. Sex. In order to realize the processing of massive data and overcome the limitations imposed by the limited resources of the single machine, it is urgent to implement the clustering method based on local density on the MapReduce platform.
[0006] 发明内容 SUMMARY OF THE INVENTION
[0007] 因此, 本发明的目的在于提供一种 MapReduce平台上基于本地密度的聚类方法 , 实现对海量数据的处理, 克服单机本身资源有限所带来的限制。  [0007] Therefore, the object of the present invention is to provide a local density-based clustering method on the MapReduce platform, which realizes processing of massive data and overcomes the limitation imposed by the limited resources of the single machine.
[0008] 为实现上述目的, 本发明提供了一种 MapReduce平台上基于本地密度的聚类方 法, 包括:  [0008] In order to achieve the above object, the present invention provides a local density based clustering method on a MapReduce platform, including:
[0009] 步骤 10、 对待聚类的数据进行预处理, 构造以节点表征数据的连通图, 并以节 点之间的边的长度表征数据之间的相似度, 节点之间的边越短, 节点所表征的 数据之间相似度越高;  [0009] Step 10: pre-processing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, the shorter the edge between the nodes, the node The higher the similarity between the data being characterized;
[0010] 步骤 20、 以连通图中的节点和边的信息作为输入数据, 通过 Map作业生成包括 节点以及邻边信息的键值对, 通过 Reduce作业生成包括节点、 节点的本地密度 R ho以及节点所有邻边信息的输出, Rho定义为连接本节点的长度低于预定义值 Dc 的邻边的数目;  [0010] Step 20: The node and the edge information in the connected graph are used as input data, and the key value pair including the node and the neighbor information is generated by the Map job, and the local density R ho and the node including the node and the node are generated by the Reduce job. The output of all neighboring information, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;
[0011] 步骤 30、 对于步骤 20中 Reduce作业的输出, 通过 Map作业生成包括节点、 节点 Rho. 邻居节点 Rho以及邻边信息的键值对, 对每个节点, 通过 Reduce作业遍历 节点 Rho、 所有邻居节点 Rho以及所有邻边信息, 得出每个节点的离散度 Delta, Delta定义为本节点所有连接更高 Rho值邻居节点的邻边中最短边的边长, 若不存 在这样的邻居节点, 则取本节点最长邻边的边长; 再结合预定规则来进行类标 识。 [0011] Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. Neighbor node Rho and all neighboring information, and the delta of each node is obtained. Delta is defined as the length of the shortest side of the neighboring edge of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, the length of the longest neighbor of the node is taken; Logo.
[0012] 其中, 所述预定规则包括: 节点的 Rho和 Delta分别高于作为输入参数的阈值1^_ T和阈值 D_T, 则该节点为一个类的中心, 该节点的类标识取其自身类标识; 否 贝 1J, 节点的类标识取距离其最近且 Rho更高的邻居节点的类标识;  [0012] wherein, the predetermined rule comprises: Rho and Delta of the node are respectively higher than a threshold 1^_T and a threshold D_T as input parameters, then the node is a center of a class, and the class identifier of the node takes its own class Identification; No Bay 1J, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
[0013] 孤立节点的类标识为自身类标识。  [0013] The class identifier of the isolated node is its own class identifier.
[0014] 其中, 所述预定规则包括: 预先划分 Rho值可能取值区间以及对应的 Delta值可 能取值区间, 如果节点的 Rho值属于 Rho值可能取值区间且节点的 Delta值属于对 应的 Delta值可能取值区间, 则该节点为一个类的中心, 该节点的类标识取其自 身类标识; 否则, 节点的类标识取距离其最近且 Rho更高的邻居节点的类标识; [0014] wherein, the predetermined rule comprises: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta The value may be a value interval, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
[0015] 孤立节点的类标识为自身类标识。 [0015] The class identifier of the isolated node is its own class identifier.
[0016] 其中, 步骤 20中, 连通图中的节点和边的信息作为输入数据的数据格式包括标 识节点的字段、 标识邻居节点的字段、 以及标识该节点和邻居节点之间邻边的 边长的字段。  [0016] wherein, in step 20, the data format of the node and the edge in the connected graph as the input data includes a field identifying the node, a field identifying the neighbor node, and a side length identifying the neighboring edge between the node and the neighbor node. Field.
[0017] 其中, 步骤 20中 Reduce作业的输出存储于关系数据库或键值数据库中。  [0017] wherein, in step 20, the output of the Reduce job is stored in a relational database or a key value database.
[0018] 其中, 步骤 30中的 Map作业中, 通过对步骤 20中 Reduce作业的输出进行笛卡尔 积, 实现对邻居节点 Rho的遍历。 [0018] wherein, in the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing a Cartesian product on the output of the Reduce job in step 20.
[0019] 其中, 步骤 20包括: [0019] wherein, step 20 includes:
[0020] 步骤 21、 连通图中的节点和边的信息作为输入数据经由 Map作业生成键值对, 其中, 键包括标识节点的字段, 值包括标识邻居节点的字段和标识该节点和邻 居节点之间邻边的边长的字段;  [0020] Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node. a field of the side of the adjacent side;
[0021] 步骤 22、 对键值对按照键所包括的节点进行分区, 键包括相同节点的键值对分 配至同一分区;  [0021] Step 22. Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
[0022] 步骤 23、 对于同一分区内的键值对按照键所包括的节点进行分组, 键包括相同 节点的键值对分配至同一组;  [0022] Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;
[0023] 步骤 25、 经由 Reduce作业, 通过对属于同一组的键值对的值的迭代来遍历同一 节点的所有的邻边, 生成包括节点、 节点的本地密度 Rho以及节点所有邻边信息 的输出。 [0023] Step 25: Through the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate local density Rho including nodes and nodes, and all neighbor information of the nodes. Output.
[0024] 其中, 步骤 20还包括:  [0024] wherein, step 20 further includes:
[0025] 步骤 21中, 键还包括标识该节点和邻居节点之间邻边的边长的字段;  [0025] In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;
[0026] 步骤 24、 对于属于同一组的键值对按照键所包括的邻边的边长进行排序。 [0026] Step 24. Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.
[0027] 其中, 步骤 24中的排序为升序排序。 [0027] wherein the ordering in step 24 is ascending order.
[0028] 其中, 步骤 25中 Reduce作业的输出为键值对, 其中, 键包括标识节点的字段, 值包括标识节点的字段、 标识节点 Rho的字段以及标识节点所有邻边信息的字段  [0028] wherein, in step 25, the output of the Reduce job is a key value pair, wherein the key includes a field identifying the node, and the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.
[0029] 其中, 步骤 30包括: [0029] wherein, step 30 includes:
[0030] 步骤 31、 对于步骤 20中 Reduce作业的输出经由 Map作业生成键值对, 其中, 键 包括标识节点的字段, 值包括标识邻居节点的字段、 标识该节点和邻居节点之 间邻边的边长的字段、 标识该邻居节点 Rho的字段和标识该节点 Rho的字段; [0030] Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node. a field of a side length, a field identifying the neighbor node Rho, and a field identifying the node Rho;
[0031] 步骤 32、 对键值对按照键所包括的节点进行分区, 键包括相同节点的键值对分 配至同一分区; [0031] Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
[0032] 步骤 33、 对于同一分区内的键值对按照键所包括的节点进行分组, 键包括相同 节点的键值对分配至同一组;  [0032] Step 33: The key value pair in the same partition is grouped according to the nodes included in the key, and the key includes the key value pairs of the same node and is allocated to the same group;
[0033] 步骤 35、 经由 Reduce作业, 对每个节点, 通过对属于同一组的键值对的值的迭 代来遍历节点 Rho、 所有邻居节点 Rho以及所有邻边信息, 得出每个节点的离散 度 Delta, 再结合预定规则来进行类标识。 [0033] Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discreteness of each node. Delta, combined with predetermined rules for class identification.
[0034] 其中, 步骤 30还包括: [0034] wherein, step 30 further includes:
[0035] 步骤 31中, 键还包括标识该邻居节点 Rho的字段;  [0035] In step 31, the key further includes a field identifying the neighbor node Rho;
[0036] 步骤 34、 对于属于同一组的键值对按照键所包括的邻居节点 Rho进行排序。  [0036] Step 34: Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key.
[0037] 综上所述, 本发明借助流行的 MapReduce分布式计算思想在集群上实现了基于 本地密度的聚类, 弱化了处理吋单机本身资源有限等所带来的限制, 能实现对 海量数据的处理, 更快的完成聚类操作。 [0037] In summary, the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.
[0038] 附图说明 BRIEF DESCRIPTION OF THE DRAWINGS
[0039] 附图中, [0039] In the drawings,
[0040] 图 1为本发明 MapReduce平台上基于本地密度的聚类方法一较佳实施例的流程 图。 1 is a flow chart of a preferred embodiment of a local density based clustering method on a MapReduce platform of the present invention; Figure.
[0041] 具体实施方式  DETAILED DESCRIPTION
[0042] 下面结合附图, 通过对本发明的具体实施方式详细描述, 将使本发明的技术方 案及其有益效果显而易见。  The technical scheme of the present invention and its advantageous effects will be apparent from the following detailed description of embodiments of the invention.
[0043] 参见图 1, 其为本发明 MapReduce平台上基于本地密度的聚类方法一较佳实施 例的流程图。 该较佳实施例主要包括:  [0043] Referring to FIG. 1, it is a flowchart of a preferred embodiment of a local density based clustering method on the MapReduce platform of the present invention. The preferred embodiment mainly includes:
[0044] 步骤 10、 对待聚类的数据进行预处理, 构造以节点表征数据的连通图, 并以节 点之间的边的长度表征数据之间的相似度, 节点之间的边越短, 节点所表征的 数据之间相似度越高。 步骤 10中首先按照预先设定的规则计算待聚类数据之间 的相似度, 然后构造连通图; 以中国专利申请 CN 201410814330.4"虚拟人建立方 法及装置"为例, 其中待聚类的数据为账号, 根据账号之间协同出现的情况来计 算账号之间的相似度, 进而构建连通图。  [0044] Step 10: Preprocessing the data to be clustered, constructing a connected graph that represents the data by the node, and characterizing the similarity between the data by the length of the edge between the nodes, and the shorter the edge between the nodes, the node The higher the similarity between the data being characterized. In step 10, the similarity between the data to be clustered is first calculated according to a preset rule, and then the connected graph is constructed; taking Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus" as an example, wherein the data to be clustered is The account number is calculated based on the situation in which the accounts are coordinated, and the connectivity graph is constructed.
[0045] 步骤 20、 以连通图中的节点和边的信息作为输入数据, 通过 Map作业生成包括 节点以及邻边信息的键值对, 通过 Reduce作业生成包括节点、 节点的本地密度 R ho以及节点所有邻边信息的输出, Rho定义为连接本节点的长度低于预定义值 Dc 的邻边的数目。  [0045] Step 20: using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighbor information through the Map job, and generating the local density R ho and the node including the node and the node through the Reduce job. For the output of all neighbor information, Rho is defined as the number of neighbors whose length is lower than the predefined value Dc.
[0046] 步骤 20具体可以包括:  [0046] Step 20 specifically includes:
[0047] 步骤 21、 连通图中的节点和边的信息作为输入数据经由 Map作业生成键值对, 其中, 键包括标识节点的字段, 值包括标识邻居节点的字段和标识该节点和邻 居节点之间邻边的边长的字段。 邻边信息包括对应的邻居节点和邻边边长。 作 为优化, 步骤 21中, 键还可以包括标识该节点和邻居节点之间邻边的边长的字 段。  [0047] Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying the node and the neighbor node. The field of the side of the adjacent side. The neighbor information includes the corresponding neighbor node and the neighbor side length. As an optimization, in step 21, the key may further include a field identifying the side length of the neighboring edge between the node and the neighboring node.
[0048] 应用吋, 可以将输入数据的每一行对应一组节点之间的边信息。 故为方便起见 , 可以将输入数据设定为依次由小标识值节点&、 大标识值节点 b和边长 len(a,b) 组成的三元组: [a,b,len(a,b)]。 [0048] Application 吋, each row of the input data may correspond to side information between a group of nodes. Therefore, for convenience, the input data can be set to a triple consisting of a small identity value node &, a large identity value node b, and a side length len( a , b): [a, b, len(a, b) )].
[0049] 因为对于每个节点都需要计算它们的 Rho值, 对连通图中的一条边信息, Map 作业将会有两次<^6 ,¥&11^>输出。 每个 Key值或 Value值均依次由 left和 right两个 字段组成。 具体来说, 第一次的 Key值可以是 Kl= <a,len(a,b)>(这里, left=a, right=len(a,b)) , Value值可以是 Vl=<b,len(a,b)>, 第二次的 Key值可以是 K2=<b,len (a,b)>, Value值可以是 V2=<a,len(a,b)>。 [0049] Because each node needs to calculate their Rho value, the Map job will have two <^6, ¥&11^> outputs for one side information in the connected graph. Each Key value or Value value is composed of two fields, left and right. Specifically, the first Key value can be Kl = <a, l en (a, b) > (here, left=a, Right=len(a,b)) , Value value can be Vl=<b,len(a,b)>, second key value can be K2=<b,len (a,b)>, Value It can be V2=<a, len(a,b)>.
[0050] 步骤 22、 对键值对按照键所包括的节点进行分区, 键包括相同节点的键值对分 配至同一分区。 在此实施例中具体来说, 各记录所将归属的分区 (Partition)的序 列将只与 Map输出 Key值的第一个字段有关。 比如说, 分区序列可以为 Key的 left 字段的哈希值与已知总分区数的余数, 以伪代码表示即:  [0050] Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. In this embodiment, in particular, the sequence of the Partition to which each record belongs will only be related to the first field of the Map output Key value. For example, the partition sequence can be the remainder of the key's left field and the remainder of the known total partition number, expressed as pseudocode:
[0051] K.left.hashCode()^ ^、区数。  [0051] K.left.hashCode()^^, the number of regions.
[0052] 这实际上保证了相同节点 left字段的节点的边信息, 都会分配到同一个分区中 进行存储。  [0052] This actually ensures that the side information of the nodes of the left node of the same node is allocated to the same partition for storage.
[0053] 步骤 23、 对于同一分区内的键值对按照键所包括的节点进行分组, 键包括相同 节点的键值对分配至同一组;  [0053] Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;
[0054] 分组 (GroupCompare) 的结果将只与相比较的 Key值它们的第一个字段的比较 结果有关。 举例来说, 对于两个 Key , kl和 k2, 相应的比较 (compare) 结果为 [0054] The result of the group (GroupCompare) will only be related to the comparison of the first field of the Key value compared. For example, for two Keys, kl and k2, the corresponding comparison result is
[0055] kl.left.compare(k2.1eft)。 [0055] kl.left.compare(k2.1eft).
[0056] 这实际上保证了, 每一个节点的所有边的信息 (Value值, 邻居点和边长), 都会 在同一次 Reduce过程中调用。  [0056] This actually guarantees that the information (Value value, neighbor point and side length) of all sides of each node will be called during the same Reduce process.
[0057] 步骤 24、 对于属于同一组的键值对按照键所包括的邻边的边长进行排序。 步骤[0057] Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key. step
24中的排序可以为升序排序。 步骤 24作为一个可选的优化措施, 可以称为组内 排序 (SortComparator, SC) , 可设定为按 left和 right次序两个字段先后进行比较 的结果。 以伪代码表示即: The ordering in 24 can be sorted in ascending order. As an optional optimization measure, step 24 can be called intra-group sorting (SortComparator, SC), which can be set as the result of comparing the two fields in the left and right order. Expressed in pseudo code:
[0058] l_compare = kl .left.compare(k2.1eft) [0058] l_compare = kl .left.compare(k2.1eft)
[0059] if(l_compare==0)//如果 left值都相等, 则比较 right值 [0059] if(l_compare==0)// If the left values are equal, then the right value is compared
[0060] return kl .right.compare(k2.right) [0060] return kl .right.compare(k2.right)
[0061] else [0061] else
[0062] return l_compare。  [0062] return l_compare.
[0063] 由于 Key的 right值均表示边长, 故这里实际上保证在 Reduce过程中的迭代式吋 [0063] Since the right value of Key represents the length of the side, it is actually guaranteed to be iterative in the Reduce process.
, 边信息是按照边长的长短的升序顺序返回的。 注: 实际上, 步骤 21中 Key值设 定为由节点标识和边长两个字段构成, 就是为了进行该优化; 若无该优化的考 虑, 则步骤 21中 Key值仅有节点标识组成即可。 The side information is returned in ascending order of the length of the side. Note: In fact, the Key value in step 21 is set. It is composed of two fields, node identifier and side length, for the purpose of optimization; if there is no such optimization, the Key value in step 21 can only be composed of node identifiers.
[0064] 步骤 25、 经由 Reduce作业, 通过对属于同一组的键值对的值的迭代来遍历同一 节点的所有的邻边, 生成包括节点、 节点的本地密度 Rho以及节点所有邻边信息 的输出。 [0064] Step 25: Through the Reduce job, traverse all the neighboring edges of the same node by iterating the values of the key-value pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all the neighbor information of the node. .
[0065] 步骤 25中 Reduce作业的输出为键值对, 其中, 键包括标识节点的字段, 值包括 标识节点的字段, 标识节点 Rho的字段以及标识节点所有邻边信息的字段。  [0065] The output of the Reduce job in step 25 is a key value pair, wherein the key includes a field identifying the node, the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.
[0066] 经过上述步骤, 在每一次 Reduce调用吋, 均可通过对 Values的迭代来遍历同一 节点的所有的边。 每次 Reduce过程调用吋, 都会输出如下三部分信息: 当前节 点 n的标识, n的 Rho值, 按边长进行排序后的 n的所有邻边信息。  [0066] After the above steps, each Reduce call can traverse all edges of the same node by iterating over Values. Each time the Reduce procedure is called, the following three pieces of information are output: the identifier of the current node n, the Rho value of n, and all neighbor information of n after sorting by the side length.
[0067] 当使用上述 SC进行了优化吋, Rho值的计数可在迭代到的边长大于预定义值 Dc 吋便结束。 同吋, 由于邻边已经借助 SC进行了升序排序, 邻边信息亦可按迭代 吋的先后拼接即可。 若未进行该优化, 则 Rho值的计数需迭代到了最后一条边吋 才能结束, 而邻边信息需要排序后再作为 Value值的一部分。  [0067] When optimized using the SC described above, the count of Rho values may end after the iterative side length is greater than the predefined value Dc. At the same time, since the neighboring edges have been sorted in ascending order by means of SC, the neighboring side information can also be stitched in order of iteration. If this optimization is not performed, the count of the Rho value needs to be iterated to the last edge to end, and the neighbor information needs to be sorted and then used as part of the Value value.
[0068] 作为举例, 输出的格式可以为键值对:  [0068] As an example, the format of the output can be a key-value pair:
[0069] [K=n,V=<n,Rho(n),n 1 :len(n,n 1 ),n2:len<n,n2> ...nN:len<n,nN»]。  [K=n, V=<n, Rho(n), n 1 : len(n, n 1 ), n2: len<n, n2> ... nN: len<n, nN»].
[0070] 该较佳实施例通过以上所述的第一个 MapReduce任务, 主要实现计算 Rho值, 并对邻居节点按距离升序排序。 接下来的第二个 MapReduce任务, 主要实现计算 Delta值, 并标识类中心点。  [0070] The preferred embodiment implements the calculation of the Rho value by using the first MapReduce task described above, and sorts the neighbor nodes in ascending order by distance. The next second MapReduce task, the main implementation of calculating the Delta value, and identifying the class center point.
[0071] 步骤 30、 对于步骤 20中 Reduce作业的输出, 通过 Map作业生成包括节点、 节点 Rho. 邻居节点 Rho以及邻边信息的键值对, 对每个节点, 通过 Reduce作业遍历 节点 Rho、 所有邻居节点 Rho以及所有邻边信息, 得出每个节点的离散度 Delta, 再结合预定规则来进行类标识。  [0071] Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho. a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all through the Reduce job. The neighbor node Rho and all neighbor information, the dispersion degree of each node is obtained, and the class identification is performed in combination with a predetermined rule.
[0072] 在此较佳实施例中预定规则为: 节点的 Rho和 Delta分别高于作为输入参数的阈 值 R_T和阈值 D_T, 则该节点为一个类的中心, 该节点的类标识取其自身类标识 ; 否则, 节点的类标识取距离其最近且 Rho更高的邻居节点的类标识; 孤立节点 的类标识为自身类标识。 该预定规则与中国专利申请 CN 201410814330.4 "虚拟 人建立方法及装置"中所采用的规则类似一刚性的要求 Rho值和 Delta值高于某个 分别对应的阈值。 [0072] In the preferred embodiment, the predetermined rule is: Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class. Identifies; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is its own class identifier. The predetermined rule is similar to the rule adopted in the Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus". A rigid requirement is that the Rho value and the Delta value are higher than a certain value. Corresponding thresholds.
[0073] 这只是节点是否可标识为类中心的方法之一。 从根本来说, 节点是否可作为类 中心节点是根据节点的 Rho值和 Delta值来进行的。 其实, 还存在其他利用包括 R ho值和 Delta值在内的因素来进行判断的各种方法。 本发明 MapReduce平台上基 于本地密度的聚类方法在类中心点的确认方式上, 也可以松懈, 能更快的完成 聚类操作。 例如, 预定规则可以包括: 预先划分 Rho值可能取值区间以及对应的 Delta值可能取值区间, 如果节点的 Rho值属于 Rho值可能取值区间且节点的 Delta 值属于对应的 Delta值可能取值区间, 则该节点为一个类的中心, 该节点的类标 识取其自身类标识; 否则, 节点的类标识取距离其最近且 Rho更高的邻居节点的 类标识; 孤立节点的类标识为自身类标识。 比如: 如果节点的 Rho值在 [10,20]范 围, 且 Delta值也在 [0.9* 10,  [0073] This is just one of the ways in which a node can be identified as a class center. Basically, whether a node can be used as a class center node is based on the node's Rho value and Delta value. In fact, there are other methods for making judgments using factors including Rho and Delta values. The clustering method based on local density on the MapReduce platform of the present invention can also be relaxed in the way of confirming the center point of the class, and can complete the clustering operation more quickly. For example, the predetermined rule may include: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, the value may be Interval, the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is itself Class identifier. For example: If the node's Rho value is in the range [10,20], and the Delta value is also [0.9* 10,
0.8*20] (即 Delta值也在随 Rho值变动的某个范围内, Delta值取值范围与 Rho值取 值范围相对应, 该节点也可标识为类中心)。  0.8*20] (ie, the Delta value is also within a certain range of the Rho value, the Delta value range corresponds to the Rho value range, and the node can also be identified as the class center).
[0074] 求解某个节点的 Delta值, 需要取得其邻边对应的 Rho值。 在步骤 20中 Reduce作 业的输出下, 可以借助通用的 MapReduce上进行笛卡尔积 (Cartesian Product) 的 方式, 来实现对邻居节点的 Rho值的遍历_通过自定义 InputFormat来实现全连接 。 这里的遍历, 实际上是为了后续求出 [0074] To solve the delta value of a certain node, it is necessary to obtain the Rho value corresponding to its neighboring edge. In the output of the Reduce job in step 20, the Cartesian Product on the general MapReduce can be used to implement the traversal of the Rho value of the neighbor node. The custom inputFormat is used to implement the full connection. The traversal here is actually for the follow-up
Delta值。 相关的案例可参见 [«MapReduce Design Pattems»,O'Reilly,Dec.2012, p: 128-138]所述。  Delta value. A related case can be found in [«MapReduce Design Pattems», O'Reilly, Dec. 2012, p: 128-138].
[0075] 步骤 30具体可以包括: [0075] Step 30 specifically includes:
[0076] 步骤 31、 对于步骤 20中 Reduce作业的输出经由 Map作业生成键值对, 其中, 键 包括标识节点的字段, 值包括标识邻居节点的字段, 标识该节点和邻居节点之 间邻边的边长的字段, 标识该邻居节点 Rho的字段, 标识该节点 Rho的字段。  [0076] Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a neighboring edge between the node and the neighbor node. The field of the side length identifies the field of the neighbor node Rho and identifies the field of the node Rho.
[0077] 对于步骤 20中 Reduce作业的输出经由 Map作业输出当前节点和经连接得到的邻 居节点的信息。 一种优化的示例输出格式为:  [0077] For the output of the Reduce job in step 20, the information of the current node and the connected neighbor node is output via the Map job. An optimized sample output format is:
[0078] [K=<a,Rho(b)>,V=<Rho(b),Rho(a),b,len(a,b)>]。  [K=<a, Rho(b)>, V=<Rho(b), Rho(a), b, len(a,b)>].
[0079] 步骤 31中, 作为选择, 键还可以包括标识该邻居节点 Rho的字段, 优化处在于 将 Rho(b)的信息也并入到 Key部分, 便于后续步骤 34的排序。 [0080] 步骤 32、 对键值对按照键所包括的节点进行分区, 键包括相同节点的键值对分 配至同一分区。 具体方式可参见步骤 22。 [0079] In step 31, as a selection, the key may further include a field identifying the neighbor node Rho, and the optimization is to incorporate the information of Rho(b) into the Key part, so as to facilitate the sorting of the subsequent step 34. [0080] Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. For details, see step 22.
[0081] 步骤 33、 对于同一分区内的键值对按照键所包括的节点进行分组, 键包括相同 节点的键值对分配至同一组。 具体方式可参见步骤 23。 [0081] Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key pairs including the same node are assigned to the same group. For details, see step 23.
[0082] 步骤 34、 对于属于同一组的键值对按照键所包括的邻居节点 Rho进行排序。 作 为可选的优化措施, 首先根据 Key值的第一个字段区别出是否为同一个节点的 Ke y值, 若相同则以第二个字段降序排序。 这样排序保证了在同一个 Reduce过程中[0082] Step 34: Sort the key values that belong to the same group according to the neighbor nodes Rho included in the key. As an optional optimization measure, firstly, according to the first field of the Key value, it is distinguished whether it is the Ke y value of the same node, and if it is the same, the second field is sorted in descending order. This sorting ensures that in the same Reduce process
, 高 Rho值的邻居节点会被首先迭代访问到。 The neighbor nodes with high Rho values will be accessed first by iteration.
[0083] 步骤 35、 经由 Reduce作业, 对每个节点, 通过对属于同一组的键值对的值的迭 代来遍历节点 Rho、 所有邻居节点 Rho以及所有邻边信息, 得出每个节点的离散 度 Delta, 再结合预定规则来进行类标识。 [0083] Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key-value pairs belonging to the same group, and obtain the discrete of each node. Delta, combined with predetermined rules for class identification.
[0084] 经过上述步骤, 在每一次 Reduce过程中, 都可以通过对 Value值的迭代来遍历 某个节点的自身及其所有邻边的信息。 这吋可以再选择结合作为输入参数的阈 值 R_T和阈值 D_T值, 便生成进行类标识所需的信息。 [0084] After the above steps, in each Reduce process, the information of a node and all its neighbors can be traversed by iterating over the Value value. This option can be combined with the threshold R_T and the threshold D_T as input parameters to generate the information needed to perform the class identification.
[0085] 在此较佳实施例中, 步骤 30的 Map过程是在原生的 MapReduce方案上实现, 但 实际中可通过常见的数据库技术而加速处理过程。 例如, 在步骤 20中 Reduce作 业输出吋, 将各节点的 Rho值存在关系型数据库或 K-V数据库中。 从而在步骤 30 的 Map吋, 只需对邻居点的 Rho值进行査询即可, 而不需要通过自定义 InputForm at来处理; 也就是说, 不再需要进行笛卡尔操作, 可在 Map阶段直接访问数据来 获取邻居节点的 Rho值便可。 In the preferred embodiment, the Map process of step 30 is implemented on a native MapReduce scheme, but in practice the process can be accelerated by common database technologies. For example, in step 20, the Reduce job output 吋, the Rho value of each node is stored in the relational database or K-V database. Therefore, in the Map of step 30, it is only necessary to query the Rho value of the neighbor point, and does not need to be processed by the custom InputForm at; that is, the Cartesian operation is no longer needed, and can be directly in the Map stage. Access the data to get the Rho value of the neighbor node.
[0086] 综上所述, 本发明借助流行的 MapReduce分布式计算思想在集群上实现了基于 本地密度的聚类, 弱化了处理吋单机本身资源有限等所带来的限制, 能实现对 海量数据的处理, 更快的完成聚类操作。 [0086] In summary, the present invention implements local density-based clustering on a cluster by means of the popular MapReduce distributed computing idea, and weakens the limitation caused by the limited resources of the processing unit, and can realize massive data. Processing, complete clustering operations faster.
[0087] 以上所述, 对于本领域的普通技术人员来说, 可以根据本发明的技术方案和技 术构思作出其他各种相应的改变和变形, 而所有这些改变和变形都应属于本发 明后附的权利要求的保护范围。 [0087] As described above, various other changes and modifications can be made in accordance with the technical solutions and technical concept of the present invention, and all such changes and modifications should be attached to the present invention. The scope of protection of the claims.
技术问题  technical problem
问题的解决方案 明的有益效果 Problem solution Beneficial effect

Claims

权利要求书 Claim
[权利要求 1] 一种 MapReduce平台上基于本地密度的聚类方法, 其特征在于, 包括 步骤 10、 对待聚类的数据进行预处理, 构造以节点表征数据的连通图 , 并以节点之间的边的长度表征数据之间的相似度, 节点之间的边越 短, 节点所表征的数据之间相似度越高;  [Claim 1] A local density-based clustering method on a MapReduce platform, comprising: step 10, pre-processing data to be clustered, constructing a connected graph of nodes to represent data, and The length of the edge characterizes the similarity between the data. The shorter the edge between the nodes, the higher the similarity between the data represented by the node;
步骤 20、 以连通图中的节点和边的信息作为输入数据, 通过 Map作业 生成包括节点以及邻边信息的键值对, 通过 Reduce作业生成包括节点 、 节点的本地密度 Rho以及节点所有邻边信息的输出, Rho定义为连 接本节点的长度低于预定义值 Dc的邻边的数目; 步骤 30、 对于步骤 20中 Reduce作业的输出, 通过 Map作业生成包括节 点、 节点 Rho、 邻居节点 Rho以及邻边信息的键值对, 对每个节点, 通过 Reduce作业遍历节点 Rho、 所有邻居节点 Rho以及所有邻边信息 , 得出每个节点的离散度 Delta, Delta定义为本节点所有连接更高 Rho 值邻居节点的邻边中最短边的边长, 若不存在这样的邻居节点, 则取 本节点最长邻边的边长; 再结合预定规则来进行类标识。  Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key-value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. The output is Rho, which is defined as the number of neighboring edges whose length is lower than the predefined value Dc. Step 30: For the output of the Reduce job in step 20, generate a node including the node, the node Rho, the neighbor node Rho, and the neighbor through the Map job. For each node, the node Rho, all neighbor nodes Rho, and all neighbor information are traversed by the Reduce job, and the dispersion degree of each node is determined. Delta is defined as the higher Rho value of all nodes connected to the node. The length of the shortest side of the neighboring edge of the neighboring node. If there is no such neighboring node, the length of the longest neighboring edge of the local node is taken; and the class identifier is combined with the predetermined rule.
[权利要求 2] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 所述预定规则包括: [Claim 2] The local density-based clustering method on the MapReduce platform of claim 1, wherein the predetermined rule comprises:
节点的 Rho和 Delta分别高于作为输入参数的阈值 R_T和阈值 D_T, 贝 1J 该节点为一个类的中心, 该节点的类标识取其自身类标识; 否则, 节 点的类标识取距离其最近且 Rho更高的邻居节点的类标识;  The Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, and the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node is taken closest to it. Rho higher neighbor node class identifier;
孤立节点的类标识为自身类标识。  The class ID of an isolated node is its own class identifier.
[权利要求 3] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 所述预定规则包括: [Claim 3] The local density-based clustering method on the MapReduce platform of claim 1, wherein the predetermined rule comprises:
预先划分 Rho值可能取值区间以及对应的 Delta值可能取值区间, 如果 节点的 Rho值属于 Rho值可能取值区间且节点的 Delta值属于对应的 Del ta值可能取值区间, 则该节点为一个类的中心, 该节点的类标识取其 自身类标识; 否则, 节点的类标识取距离其最近且 Rho更高的邻居节 点的类标识; Pre-divided the Rho value possible value interval and the corresponding Delta value possible value interval. If the node's Rho value belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value possible value interval, the node is The center of a class whose class identifier is its own class identifier; otherwise, the node's class identifier takes the neighbor section that is closest to it and Rho is higher. The class identifier of the point;
孤立节点的类标识为自身类标识。  The class ID of an isolated node is its own class identifier.
[权利要求 4] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 20中 Reduce作业的输出存储于关系数据库或键值数据 库中。 [Claim 4] The local density based clustering method on the MapReduce platform according to claim 1, wherein the output of the Reduce job in step 20 is stored in a relational database or a key value database.
[权利要求 5] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 30中的 Map作业中, 通过对步骤 20中 Reduce作业的输 出进行笛卡尔积, 实现对邻居节点 Rho的遍历。  [Claim 5] The local density-based clustering method on the MapReduce platform of claim 1, wherein the Map job in step 30 is implemented by performing a Cartesian product on the output of the Reduce job in step 20. Traversal of the neighbor node Rho.
[权利要求 6] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 20包括:  [Claim 6] The local density-based clustering method on the MapReduce platform of claim 1, wherein the step 20 includes:
步骤 21、 连通图中的节点和边的信息作为输入数据经由 Map作业生成 键值对, 其中, 键包括标识节点的字段, 值包括标识邻居节点的字段 和标识该节点和邻居节点之间邻边的边长的字段; 步骤 22、 对键值对按照键所包括的节点进行分区, 键包括相同节点的 键值对分配至同一分区;  Step 21: The information of nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node. The field of the side length; Step 22, the key value pair is partitioned according to the node included in the key, and the key includes the key value pair of the same node and is allocated to the same partition;
步骤 23、 对于同一分区内的键值对按照键所包括的节点进行分组, 键 包括相同节点的键值对分配至同一组;  Step 23: grouping the key values in the same partition according to the nodes included in the key, and the keys including the key pairs of the same node are allocated to the same group;
步骤 25、 经由 Reduce作业, 通过对属于同一组的键值对的值的迭代来 遍历同一节点的所有的邻边, 生成包括节点、 节点的本地密度 Rho以 及节点所有邻边信息的输出。  Step 25. Via the Reduce operation, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate the local density Rho including the node, the node, and the output of all the neighbor information of the node.
[权利要求 7] 如权利要求 6所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 20还包括: [Claim 7] The local density-based clustering method on the MapReduce platform of claim 6, wherein the step 20 further includes:
步骤 21中, 键还包括标识该节点和邻居节点之间邻边的边长的字段; 步骤 24、 对于属于同一组的键值对按照键所包括的邻边的边长进行排 序。  In step 21, the key further includes a field identifying the side length of the neighboring edge between the node and the neighboring node; Step 24: Sorting the edge lengths of the neighboring edges included in the key for the key value pairs belonging to the same group.
[权利要求 8] 如权利要求 6所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 25中 Reduce作业的输出为键值对, 其中, 键包括标识 节点的字段, 值包括标识节点的字段、 标识节点 Rho的字段以及标识 节点所有邻边信息的字段。 [Claim 8] The local density-based clustering method on the MapReduce platform of claim 6, wherein the output of the Reduce job in step 25 is a key-value pair, wherein the key includes a field identifying the node, and the value includes a field identifying the node, a field identifying the node Rho, and an identifier The field of all neighbor information of the node.
[权利要求 9] 如权利要求 1所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 30包括: [Claim 9] The local density-based clustering method on the MapReduce platform of claim 1, wherein the step 30 includes:
步骤 31、 对于步骤 20中 Reduce作业的输出经由 Map作业生成键值对, 其中, 键包括标识节点的字段, 值包括标识邻居节点的字段、 标识该 节点和邻居节点之间邻边的边长的字段、 标识该邻居节点 Rho的字段 和标识该节点 Rho的字段;  Step 31: Generate a key value pair for the output of the Reduce job in step 20, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;
步骤 32、 对键值对按照键所包括的节点进行分区, 键包括相同节点的 键值对分配至同一分区;  Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
步骤 33、 对于同一分区内的键值对按照键所包括的节点进行分组, 键 包括相同节点的键值对分配至同一组;  Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the keys include key value pairs of the same node are allocated to the same group;
步骤 35、 经由 Reduce作业, 对每个节点, 通过对属于同一组的键值对 的值的迭代来遍历节点 Rho、 所有邻居节点 Rho以及所有邻边信息, 得出每个节点的离散度 Delta, 再结合预定规则来进行类标识。  Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.
[权利要求 10] 如权利要求 9所述的 MapReduce平台上基于本地密度的聚类方法, 其 特征在于, 步骤 30还包括: [Claim 10] The local density-based clustering method on the MapReduce platform of claim 9, wherein the step 30 further includes:
步骤 31中, 键还包括标识该邻居节点 Rho的字段; 步骤 34、 对于属于同一组的键值对按照键所包括的邻居节点 Rho进行 排序。  In step 31, the key further includes a field identifying the neighbor node Rho; Step 34: Sorting the key value pairs belonging to the same group according to the neighbor nodes Rho included in the key.
PCT/CN2015/094376 2014-12-31 2015-11-12 Clustering method based on local density on mapreduce platform WO2016107297A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410855502.2A CN104978382A (en) 2014-12-31 2014-12-31 Clustering method based on local density on MapReduce platform
CN201410855502.2 2014-12-31

Publications (1)

Publication Number Publication Date
WO2016107297A1 true WO2016107297A1 (en) 2016-07-07

Family

ID=54274894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/094376 WO2016107297A1 (en) 2014-12-31 2015-11-12 Clustering method based on local density on mapreduce platform

Country Status (2)

Country Link
CN (1) CN104978382A (en)
WO (1) WO2016107297A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978382A (en) * 2014-12-31 2015-10-14 深圳市华傲数据技术有限公司 Clustering method based on local density on MapReduce platform
CN106204293B (en) * 2016-06-30 2019-05-31 河北科技大学 A kind of community discovery algorithm based on Hadoop platform
CN108073939A (en) * 2016-11-17 2018-05-25 中国移动通信有限公司研究院 A kind of data clustering method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013151221A1 (en) * 2012-04-06 2013-10-10 에스케이플래닛 주식회사 System and method for analyzing cluster results of large amounts of data
CN103593418A (en) * 2013-10-30 2014-02-19 中国科学院计算技术研究所 Distributed subject finding method and system for big data
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN104965846A (en) * 2014-12-31 2015-10-07 深圳市华傲数据技术有限公司 Virtual human establishing method on MapReduce platform
CN104978382A (en) * 2014-12-31 2015-10-14 深圳市华傲数据技术有限公司 Clustering method based on local density on MapReduce platform

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103339624A (en) * 2010-12-14 2013-10-02 加利福尼亚大学董事会 High efficiency prefix search algorithm supporting interactive, fuzzy search on geographical structured data
CN103544289A (en) * 2013-10-28 2014-01-29 公安部第三研究所 Feature extraction achieving method based on deploy and control data mining

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013151221A1 (en) * 2012-04-06 2013-10-10 에스케이플래닛 주식회사 System and method for analyzing cluster results of large amounts of data
CN103593418A (en) * 2013-10-30 2014-02-19 中国科学院计算技术研究所 Distributed subject finding method and system for big data
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN104965846A (en) * 2014-12-31 2015-10-07 深圳市华傲数据技术有限公司 Virtual human establishing method on MapReduce platform
CN104978382A (en) * 2014-12-31 2015-10-14 深圳市华傲数据技术有限公司 Clustering method based on local density on MapReduce platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG, YAJUN.: "A MapReduce Based Adaptive Density Clustering Algorithm", MASTER'S DISSERTATION, 31 July 2014 (2014-07-31) *

Also Published As

Publication number Publication date
CN104978382A (en) 2015-10-14

Similar Documents

Publication Publication Date Title
US10679055B2 (en) Anomaly detection using non-target clustering
CN107423636B (en) Differential privacy K-means clustering method based on MapReduce
CN106294762B (en) Entity identification method based on learning
CN108614997B (en) Remote sensing image identification method based on improved AlexNet
US9626426B2 (en) Clustering using locality-sensitive hashing with improved cost model
WO2019127299A1 (en) Data query method, and electronic device and storage medium
Xu et al. Distributed maximal clique computation
CN107832456B (en) Parallel KNN text classification method based on critical value data division
TW200828053A (en) A method for grid-based data clustering
WO2015180340A1 (en) Data mining method and device
TW202217597A (en) Image incremental clustering method, electronic equipment, computer storage medium thereof
WO2016107297A1 (en) Clustering method based on local density on mapreduce platform
WO2016106944A1 (en) Method for creating virtual human on mapreduce platform
CN108052832B (en) Sorting-based micro-aggregation anonymization method
Chaturvedi et al. An improvement in K-mean clustering algorithm using better time and accuracy
US20170220665A1 (en) Systems and methods for merging electronic data collections
KR102039244B1 (en) Data clustering method using firefly algorithm and the system thereof
Shaham et al. Machine learning aided anonymization of spatiotemporal trajectory datasets
WO2019127300A1 (en) Data storage method, electronic device and storage medium
CN112214684A (en) Seed-expanded overlapped community discovery method and device
CN108897820B (en) Parallelization method of DENCLUE algorithm
CN108197295B (en) Application method of attribute reduction in text classification based on multi-granularity attribute tree
CN108664548B (en) Network access behavior characteristic group dynamic mining method and system under degradation condition
CN111177190A (en) Data processing method and device, electronic equipment and readable storage medium
CN112768081B (en) Common-control biological network motif discovery method and device based on subgraphs and nodes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15874978

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15874978

Country of ref document: EP

Kind code of ref document: A1