CN110325984B - System and method for hierarchical community detection in graphics - Google Patents

System and method for hierarchical community detection in graphics Download PDF

Info

Publication number
CN110325984B
CN110325984B CN201780050053.6A CN201780050053A CN110325984B CN 110325984 B CN110325984 B CN 110325984B CN 201780050053 A CN201780050053 A CN 201780050053A CN 110325984 B CN110325984 B CN 110325984B
Authority
CN
China
Prior art keywords
graph
vertex
community
vertices
communities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780050053.6A
Other languages
Chinese (zh)
Other versions
CN110325984A (en
Inventor
维克多·弗拉基米罗维奇·斯米尔诺夫
亚历山大·弗拉基米罗维奇·斯莱萨连科
亚历山大·尼古拉耶维奇·菲利波夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN110325984A publication Critical patent/CN110325984A/en
Application granted granted Critical
Publication of CN110325984B publication Critical patent/CN110325984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Abstract

An apparatus for detecting communities in graphics is provided that includes a processor configured to compute the following cycle: creating a directed decoupling graph of vertex communities according to possible movements among the communities of the computational graph, and filtering edges of the directed decoupling graph to obtain corresponding communities, wherein each community comprises one of the following: the target point is an incoming edge of the vertex of the corresponding community, or the origin point is an outgoing edge of the vertex of the corresponding community; updating the vertices of the labeled graph with community labels from the edges of the filtered directed decoupling graph, wherein a compressed graph created by merging the vertices of each community of the filtered directed decoupling graph and merging multiple edges between pairs of vertices of the compressed graph into a single edge represents a computational hierarchical community of the labeled graph. The graph eventually converges to a solution when executed in a parallel and/or distributed computing system.

Description

System and method for hierarchical community detection in graphics
Background
The present invention, in some embodiments thereof, relates to community detection, and more particularly, but not exclusively, to systems and methods for hierarchical community detection in graphics.
Community Detection (CD) is a process of identifying communities in a graph. The communities are identified as closely related vertices. CD has applications in a variety of fields, including computer vision, economics, social science, medical research, and genetic research.
There are two general approaches to community detection; planes and hierarchies. The hierarchical CD is a community detection process for generating a hierarchical topology of the communities identified in the graph. Hierarchical CDs are generally more complex than planar CDs, but produce a hierarchical community topology and may provide better accuracy.
The hierarchical community detection method may include one of two general methods, split (top-down) and aggregate (bottom-up). The coacervation process generally provides more control over the process of obtaining a solution.
The problem of community detection in big data is computationally difficult when such big data comprises a large number of graphs, e.g. representing internet connectivity. For practical applications, the method of community detection in large data graphs should be fast enough to produce results in a reasonable time. Since the pattern being analyzed has a large size, O (n)2) Complex methods may not produce a solution in a reasonable time. To identify communities in large data graphs in a reasonable time, the CD method should be designed to be closer to linear time complexity.
Referring now to FIG. 1, FIG. 1 is a diagram illustrating a graph 102 having identified hierarchical communities (represented by different shading) and corresponding tree graphs 104 (corresponding to the community shading of the graph 102) to aid in understanding some embodiments of the present invention.
Disclosure of Invention
The invention aims to provide a device, a method, a computer program product and a system for detecting communities in graphics.
The above and other objects are achieved by the features of the independent claims. Further implementations are apparent from the dependent claims, the description and the drawings.
According to a first aspect, there is provided an apparatus for detecting communities using vertices V and edges E in a graph G ═ (V, E), for: first a compression map is computed iteratively (i.e. in an iterative loop) using a marker map (G1 ═ V1, E1), wherein the graph G ═ (V, E) serves as the marker map for the first iteration of the iterative loop, and the compression map serves as the marker map for the next iteration of the iterative loop until the graph modular values representing the connectivity of a community of compression maps of the current iteration are stable or inverted relative to the graph modular values of the previously computed compression map, wherein the compression map for each iteration is recursively computed by: obtaining, for each vertex V in V1 of the label graph (G1), all vertex movements between communities, where a vertex movement between any two communities is an edge E in E1 of the label graph (G1), connecting two vertices from the two communities; stopping the iterative loop when there is no vertex movement; creating a directed decoupling graph (D (G1)) of communities of vertices from the obtained vertex movements between communities when there is at least one possible vertex movement, wherein each vertex of the directed decoupling graph (D (G1)) represents a respective community of the marker graph and the obtained vertex movements between communities are represented as edges of the directed decoupling graph (D (G1)); filtering edges of the directed decoupling graph (D (G1)) to obtain a filtered directed decoupling graph (FD (G1)), wherein each vertex of the filtered directed decoupling graph (D (G1)) has only an in-edge or an out-edge; updating vertices of the marker graph (G1) with community labels from edges of the filtered directed decoupling graph D (G1) to create an updated graph (G1'); creating a compression graph for a current iteration by merging vertices of each community of the update graph (G1') into a single vertex and merging multiple edges between pairs of vertices of the compression graph into a single edge with an aggregated weight between corresponding pairs of vertices using an aggregated weight; and outputting a compression graph of the last iteration, wherein the compression graph of the last iteration represents the calculation level community of the label graph of the first iteration.
According to a second aspect, there is provided a method of detecting communities using vertices V and edges E in a graph G ═ (V, E), the method comprising: first computing a compression map in an iterative loop using a labeled map (G1 ═ V1, E1), wherein the graph G ═ V, E) serves as a labeled map for a first iteration of the iterative loop, the compression map serves as a labeled map for a next iteration of the iterative loop until a graph modular value representing connectivity of a community of compression maps of a current iteration is stable or inverted relative to a graph modular value of a previously computed compression map, wherein the compression map for each iteration is recursively computed by: computing for each vertex V in V1 of the label graph (G1) all vertex movements between communities, where a vertex movement between any two communities is an edge E in E1 of the label graph (G1), connecting two vertices from the two communities; stopping the loop when there is no vertex movement; creating a directed decoupling graph (D (G1)) of communities of vertices from the calculated movement between communities when there is at least one possible vertex movement, wherein each vertex of the directed decoupling graph (D (G1)) represents a respective community of the marker graph and the calculated movement of vertices between communities is represented as edges of the directed decoupling graph (D (G1)); filtering edges of the directed decoupling graph (D (G1)) to obtain a filtered directed decoupling graph (FD (G1)), wherein each vertex of the filtered directed decoupling graph (D (G1)) has only an in-edge or an out-edge; updating vertices of the marker graph (G1) with community labels from edges of the filtered directed decoupling graph D (G1) to create an updated graph (G1'); creating a compression graph for a current iteration by merging vertices of each community of the update graph (G1') into a single vertex and merging multiple edges between pairs of vertices of the compression graph into a single edge with an aggregated weight between corresponding pairs of vertices; outputting a compression map of the last iteration, the compression map of the last iteration representing the computation hierarchy community of the input label map (i.e., the label map of the first iteration).
The edges are filtered so that vertices of the community of obtained graphs are all associated with an incoming edge or all associated with an outgoing edge, thereby ensuring that the graphs eventually converge to a solution when executed in a parallel and/or distributed computing system. For example, a naive parallelization approach may loop vertices between communities, which may prevent final convergence to a solution.
Filtering the edges so that vertices of the community of graphs obtained are all associated with an incoming edge or all associated with an outgoing edge, thereby providing linear time complexity for processing very large graphs in a reasonable amount of time, since the movement and updating of the vertices is performed in parallel by the distributed nodes. For example, relative to O (n)2) Other methods of complexing.
In another implementation manner of the first and second aspects, the method further comprises and/or the apparatus is further configured to: creating/obtaining the label graph (G1) by initializing vertices of the input graph (G) with community labels by placing each vertex in a respective community.
The vertices of the labeled graph (G1) and/or the edges of the filtered directed decoupled graph (D (G1)) may be distributed across nodes of a distributed system for parallel and/or distributed updates.
In another implementation manner of the first and second aspects, the method further comprises and/or the apparatus is further configured to: distributing the marker graph (G1) over nodes of at least one of a parallel computing system and a distributed computing system, and wherein each node computes possible vertex movements for at least one different vertex of the marker graph (G1) in order to obtain the possible vertex movements.
The apparatus, systems, methods, and/or code instructions described herein may be executed in parallel and/or in a distributed computing system. The parallel and/or distributed execution identifies very large patterns (approximately at least 10)6-108A vertex of about 109Edges or larger) that otherwise may take an impractically long time to process using standard methods. Data may be stored locally at each node of the parallel and/or distributed computing systems, which reduces network traffic that would otherwise occur when one node accesses data stored in other nodes. In contrast, other common methods are designed to run sequentially, which results in lengthy processing times. Even if other methods are performed in parallel and/or distributed computing systems, only one core processes the graph at a time, since the algorithms are designed sequentially. Furthermore, if data is stored in multiple nodes in a distributed file system, accessing other nodes by the single executing node to obtain the distributed data may result in excessive network activity.
In another implementation of the first and second aspects, the apparatus includes a plurality of processors configured to compute the compression graph in parallel, each processor performing, for at least one different community, merging vertices of each community of the filtered directed decoupling graph (D (G1)) into a single vertex and merging duplicate edges of the compression graph into a single edge.
The decoupled graph is processed centrally when all data is received from other nodes to ensure that the full graph is processed. Computing the decoupling graph and filtering the edges of the decoupling graph may be performed by the single processor using relatively little computational resources and/or in a reasonable amount of time because the computation is relatively simple.
Network traffic between nodes is reduced relative to other approaches. The network traffic associated with moving and storing the decoupling graph is O (n), where n represents the number of vertices in the graph, since a vertex can only have one associated move. Instead, network traffic associated with moving and storing the entire graph is performed in one iteration represented by o (m), where m represents the number of edges in the graph.
In another implementation of the first and second aspects, the graph modularization represents a global cost function indicating a comparison of a density of edges between vertices within each community of the compression graph and a density of edges between vertices of different communities of the compression graph.
In another implementation of the first and second aspects, the method further comprises and/or the process is further for selecting vertex movements from the computed possible movements between communities according to local movement modality values computed for each vertex representing a variation of a density of edges between vertices within the respective communities of each vertex movement, wherein a directed decoupling graph (D (G1)) of the communities of vertices is created according to the selected vertex movements.
Greedy algorithm based computations are sufficiently accurate in practice, where vertex moves are selected according to the most significant changes (e.g., the largest changes) in the local move modular values, while providing linear time complexity for processing very large graphics in reasonable time.
In another implementation of the first and second aspects, the directed decoupling graph (D (G1)) is created and filtered by performing the following steps: selecting a first vertex movement from the calculated possible movements according to a vertex movement local modal value; designating a first transmitter community and a second receiver community in accordance with the selected first vertex movement, wherein the transmitter community represents an origin of the first vertex moved vertex and the receiver community represents an aim point of the first vertex moved vertex; the remaining moves are iteratively processed: selecting another vertex movement from the calculated possible movements according to a local movement modality value representing the most significant change in density of edges from the remaining possible movements, filtering another vertex movement when an origin of a vertex from the another vertex movement is located in one of the receiver communities and an endpoint is located in one of the transmitter communities, and designating another transmitter community and another receiver community according to the selected second vertex movement.
According to a third aspect, there is provided a system for determining communities using vertices V and edges E in a graph G ═ (V, E), the system comprising an apparatus according to the first aspect, a distributed system, a graph repository and a user interface, the distributed system comprising: a plurality of nodes, wherein the apparatus is configured to receive a graph G ═ (V, E) from the graph library, distribute the graph G across the nodes of the distributed system; receiving the graph G of possible vertex movements from the distributed system, performing parallel computation on each vertex V of V by nodes of the distributed system, and determining a community in the graph G based on the received vertex movements.
According to a fourth aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to perform the steps of the method of the second aspect.
According to a fifth aspect, there is provided a computer program product comprising instructions which, when executed by a computer or processor, cause the computer or processor to perform the steps of the method of the second aspect.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, with the exemplary methods and/or materials being described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not necessarily limiting.
Drawings
Some embodiments of the invention are described herein, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the embodiments of the present invention. Thus, it will be apparent to one skilled in the art from the description of the figures how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a diagram depicting a graph with identified hierarchical communities and corresponding tree diagrams to aid in understanding some embodiments of the present invention;
FIG. 2 is a schematic diagram that visually depicts a sequential implementation of hierarchical community detection in a graph that may be used to facilitate an understanding of some embodiments of the present invention;
FIG. 3 is a flow diagram depicting the sequential design of the Venturi process described herein to assist in understanding some embodiments of the present invention;
FIG. 4 is a diagram that visually illustrates that the prior art method cannot compute communities for graphics even if implemented in a parallel environment, to aid in understanding some embodiments of the present invention;
FIG. 5 is a schematic diagram that visually illustrates the lack of graphical convergence by parallel shifting of vertices across different communities according to a prior art method designed for sequential processing, to aid in understanding some embodiments of the present invention;
FIG. 6 is a diagram that visually illustrates the inability of prior art methods designed for sequential execution to ensure graphical convergence when implemented in a parallel environment, to aid in understanding some embodiments of the invention;
FIG. 7 is a flow diagram of a method for detecting communities in a graph for implementation in a distributed system, according to some embodiments of the present invention;
FIG. 8 is a block diagram of components of a system for detecting communities in a graph implemented by a distributed system, according to some embodiments of the present invention;
FIG. 9 is a schematic diagram depicting a process of limiting community shrinkage or growth during each iteration in accordance with some embodiments of the invention;
FIG. 10 is a schematic diagram depicting a process for centrally filtering edges of the directed decoupling graph (D (G1)) in a distributed system, in accordance with some embodiments of the invention;
FIG. 11 is a schematic diagram depicting a process for calculating decoupled vertex movements based on a greedy transformation algorithm according to the method described with reference to FIG. 7, in accordance with some embodiments of the invention;
FIG. 12 is a diagram depicting parallel processing for computing hierarchical communities in graphics, in accordance with some embodiments of the invention;
FIG. 13 is a table summarizing experimental results for systems, apparatus, methods, and/or code instructions described herein;
FIG. 14 is a table summarizing the computational performance of systems, apparatus, methods, and/or code instructions described herein when identifying community in a big data graph;
FIG. 15 is a diagram that graphically depicts one iteration of the process of computing the compression map in accordance with block 702-710 described with reference to FIG. 7, in accordance with some embodiments of the present invention.
Detailed Description
The present invention, in some embodiments thereof, relates to community detection, and more particularly, but not exclusively, to systems and methods for hierarchical community detection in graphics.
As used herein, the term "graph" table refers to a set of objects, represented as vertices or nodes, that are associated with one or more other vertices or nodes having edges. The input graph represents the graph processed for detecting communities therein, sometimes denoted herein as G or G ═ V, E, where V denotes vertices and E denotes edges. The labeled graph represents processing of the input graph, sometimes denoted herein as G1 ═ V1, E1, or G1. The directional decoupling graph represents the processing of the labeled graph, sometimes denoted as D (G1) herein. The filtered directed decoupling graph is created by filtering the edges of the directed decoupling graph D (G1), sometimes denoted herein as FD (G1). The compressed graph is computed based on the filtered directed decoupled graph FD (G1).
As used herein, the term "community" refers to a subset of vertices of the graph, each vertex associated with a common community label. The vertex subsets of each community have a relatively strong relationship (e.g., edge weight, number of edges) between vertices of the public community relative to vertices located outside the community, optionally in another community.
As used herein, the term "community tag" refers to a numerical attribute of the graph vertex that defines which community the graph vertex belongs to.
As used herein, the term "vertex movement" refers to the logical movement of a graph vertex from a corresponding community to some adjacent community on some edge. For example, a certain community label of the target community (obtained from the neighboring vertices) is assigned to vertices that move in the opposite direction on the appropriate edge.
An aspect of some embodiments of the invention relates to a system, apparatus, method and/or code instructions (stored in a data storage device executable by one or more processors) for a community (optionally a hierarchical community) in a distributed computing input graph. Creating a directed decoupled graph of vertex communities by computing possible vertex movements between communities of each vertex of the labeled graph computed from the input graph. Each vertex of the directed decoupling graph represents a respective community of the labeled graph. The movement between the computed communities is represented as edges of the directed decoupling graph. Filtering edges of the directional decoupling graph to obtain a filtered directional decoupling graph, wherein each vertex of the filtered directional decoupling graph has only an in-edge or an out-edge. Updating vertices of the labeled graph with community labels from edges of the filtered directed decoupling graph to create an updated graph.
The directed decoupled graph (D (G1)) is a temporary (optionally logical) data object created during the process of updating the graph (G1) with new community tags. The directed decoupling graph (D (G1)) prevents convergence failure of the marker graph (G1) by adding modularity to the update graph (G1).
The compression map is iteratively computed. The compression map for each iteration is computed by merging the vertices of each community of the update map into a single vertex. Merging a plurality of edges between pairs of vertices of the compression map into a single edge with an aggregated weight between corresponding pairs of vertices. The compression graph of the current iteration serves as a marker graph for the next iteration until the graphical modular values representing connectivity of the community of compression graphs of the current iteration are stable or inverted relative to the previously computed graphical modular values of the compression graphs.
Outputting a compression map of a last iteration, the compression map of the last iteration representing a computational hierarchy community of the markup map.
Optionally, the possible vertex movements of the marker graph are computed in parallel by nodes of the distributed computing system. Alternatively or additionally, the compression graph is computed in parallel from the update graph by nodes of the distributed computing system.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, method and/or computer program product. The computer program product may include one (or more) computer-readable storage media having computer-readable program instructions for causing a processor to perform various aspects of the present invention.
The computer readable storage medium may be a tangible device that can store and store instructions for use by an instruction execution device. For example, the computer-readable storage medium may be, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing.
The computer-readable program instructions described herein may be downloaded to the respective computing/processing device from a computer-readable storage medium, or downloaded to an external computer or external storage device over a network. The network is the internet, a local area network, a wide area network and/or a wireless network, etc.
The computer readable program instructions may execute entirely on the user's computer or partly on the user's computer, or as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, using the Internet from an Internet service provider). In some embodiments, an electronic circuit comprising a programmable logic circuit, a Field Programmable Gate Array (FPGA) or a Programmable Logic Array (PLA), etc., may execute computer readable program instructions using state information of the computer readable program instructions to personalize the electronic circuit to perform aspects of the present invention.
Aspects of the present invention are described herein in connection with flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Referring now to FIG. 2, FIG. 2 is a schematic diagram that visually depicts a sequential implementation of hierarchical community detection in a graph that may be used to facilitate an understanding of some embodiments of the present invention. In contrast to the systems, devices, methods, and/or code instructions described herein that use parallel processing to detect communities in graphics, the process based on the unwen process depicted in fig. 2 (also referred to as fast unrolling) is designed for sequential execution. The venturi method is based on heuristic optimization of the modularization of the graph, and the calculation method of the method is a global cost function. The venturi method iteratively performs two stages based on iterations until convergence of the graph is achieved (e.g., according to convergence requirements). The first stage 202 is a flat community detection stage, also known as modular optimization. The second stage 204 is a graphics compression stage, also known as community aggregation.
During the first stage 202, each graph vertex is placed in its own community by marking the graph vertex with a community label. By adding modularity to the graph, vertices are moved sequentially across communities by common community. During the second stage 204, the communities formed by the movement of the vertices are transformed into vertices of a new compression graph. Edges within the community will transform into a self-loop of the new vertex. Edges between communities are merged to create a new graph edge of the compression graph.
The first stage 202 is designed for order-based implementation based on asynchronous updates. During each successive iteration, a single vertex is selected, the optimal movement is calculated for the vertex (label assignment), and the vertex is labeled with the new community label according to the calculated optimal movement. The graph converges according to the following modular addition equation and is computed for each movement of a vertex I to the community Ck.
Figure BDA0001970178180000071
Wherein:
Figure BDA0001970178180000072
Figure BDA0001970178180000073
respectively representing positive and negative changes that occur to modularity as the vertex is moved out of one community and into another.
Figure BDA0001970178180000074
Equation (3) represents the weight of the vertex i and the weight of the community Ci, respectively, calculated by summing the weights of the adjacent links.
The method based on equations (1) - (3) is not designed for parallel processing. Since equation (2) depends on the value calculated by equation (3), and the value calculated by equation (3) changes for each community of absorbing or emitting vertices, only one vertex can be actually moved at a time, thereby preventing calculation based on parallel vertex movement.
The venturi method may be expressed in terms of sequential steps that further illustrate the differences from the parallel-based processing performed by the systems, apparatus, methods, and/or code instructions described herein:
1. setting the current iteration i to 0; the graph G is set as an input graph.
2. All vertices in G are put into set S and the set is randomly shifted.
3. If S is null, go to step 7, otherwise obtain vertex n from set S.
4. Finding the optimal movement v given n.
5. If the modular increase of v is negative, please go to step 3.
6. Move n into the community according to v and then go to step 3.
7. By merging vertices in the same community into a new graph GpAnd G ispThe graph G is compressed by merging the repeated edges in (1) into a single edge with an aggregate weight.
8. Computing modulus Q of a compression mapi
9. If Q isi>Qi-1Or if i is 0, setting i to i +1 and G to GpAnd go to step 2, otherwise go to step 10.
10. The calculated graph represents a solution with a hierarchical community.
Referring now to FIG. 3, FIG. 3 is a flow diagram depicting a sequential design of the venturi process described herein to assist in understanding some embodiments of the present invention. The sequential process described in fig. 3 is in contrast to parallel-based processing performed by the systems, apparatus, methods, and/or code instructions described herein.
At 302, G is set to the input map. An initial set C of vertex communities is defined. At 304, an unmoved node n is selected from G. At 306, the optimal movement v of node n is found according to the highest increase in modularity of the graph. At 308, node n is updated with the optimal movement v. At 310, an analysis is performed to determine whether a maximum number of defined iterations is reached and/or whether the community has converged. At 312, the process is iterated by repeatedly performing block 304 and 310 when additional iterations may be performed and/or when the community has not yet been transformed. At 314, the computation graph G and the determined community set C are output when a maximum number of iterations is reached and/or when the communities converge.
As described above, the luxen method and other methods for detecting communities in graphics are designed to be performed in sequence. The very large pattern has about 106-108One or more vertices and having an approximate thickness of 109Or more, for the processing of such very large graphics, order-based processing is inefficient and may not be reasonably feasible using reasonably available processing resourcesThe solution is calculated over time. The systems, methods, and/or code instructions described herein perform parallel-based and/or distributed processing that computes solutions (i.e., identifies communities) for very large graphs in a reasonable amount of time and/or using a reasonable amount of computing resources.
Referring now to FIG. 4, FIG. 4 is a diagram that visually illustrates that the prior art method cannot compute communities for graphics even if implemented in a parallel environment 402, for aiding in understanding some embodiments of the present invention. The parallel environment 402 may include multiple nodes, each running one or more CPUs, optionally including a multi-core CPU. When the parallel environment 402 receives the input graph 404 and computes an output graph with community 406 based on prior art methods implemented in the parallel environment 402 (e.g., the Luwen method), only one core of one CPU of one node is active when executing the method. And the rest of the inner cores, the CPU and the nodes are in an idle state and do not participate in the calculation. To prevent the parallelization, updates may be made on the vertices to affect other vertices according to the identified optimal vertex movement. Thus, the second vertex move cannot be computed in parallel, but must wait until the first vertex move computation is complete.
In addition, a Distributed File System (DFS) may be used to store very large graphs (belonging to the category of large data) on many nodes of the parallel environment 402. Since the computations used to identify communities in the distributed graph are compressed to a single core, random access from the active single core to other nodes storing the distributed graph is required. Communicating portions of the distributed graph from the other nodes to the active single core may result in network activity being too high. The entire distributed graph may need to be stored in the local memory of a single active node during the computation, which may not be feasible when the graph is very large and exceeds the storage capacity of the memory of a single active node.
Referring now to FIG. 5, FIG. 5 is a schematic diagram that visually illustrates the lack of graphical convergence by parallel shifting of vertices across different communities according to a prior art method designed for sequential processing, to aid in understanding some embodiments of the present invention. Fig. 5 depicts an implementation of a naive-based approach taken on a prior art approach performed in a parallel environment (e.g., the venturi process). For clarity of illustration, graph 502 includes a small number of vertices and edges. When all nodes move in parallel, the graph may not form a stable community, and thus the graph may not converge. When the graph does not converge, a solution cannot be calculated.
Referring now to FIG. 6, FIG. 6 is a diagram that visually illustrates the inability to ensure graph convergence when a prior art method designed for sequential execution is implemented in a parallel environment, to aid in understanding some embodiments of the present invention. For clarity of illustration, graph 602 includes three vertices and three edges, and graph 604 includes two vertices and two edges.
The graph 602 is processed into graphs 602B and 602C by sequentially moving one vertex at a time based on prior art methods (e.g., the luxen method), correctly generating a blob of values. The graph 604 is processed into a graph 604B by sequentially moving one vertex at a time based on prior art methods (e.g., the luxen method), correctly generating a blob of values.
Instead, the graph 602 is processed into graph 602D and graph 602E by moving all the vertices in parallel so that the vertices loop around each other and never converge to a stable distribution. Communities in the graph 602D and the graph 602E cannot be identified. The graph 604 is processed into a graph 604C by moving all vertices in parallel to loop the vertices around each other and never converge to a stable distribution. Communities in the graph 604C cannot be identified.
Referring now to fig. 7, fig. 7 is a flow diagram of a method for detecting communities in a graph implemented in a distributed system, according to some embodiments of the present invention. Referring now also to FIG. 8, FIG. 8 is a block diagram of components of a system 800 for detecting communities in a graph implemented by a distributed system 802, according to some embodiments of the present invention.
Distributed system 802 is designed to execute code instructions in parallel. The distributed system 802 includes a plurality of nodes 804. The distributed system 802 may be implemented as a single unit (e.g., a chassis), or as multiple interconnected units (e.g., multiple chassis connected to one another).
Each node 804 may be implemented, for example, as a hardware processor, a virtual machine, a set of processors arranged for parallel processing, a multi-core processor, a computing device (i.e., at least one processor and associated data storage device), and/or a set of computing devices arranged as a sub-distributed system. The nodes 804 may be homogeneous or heterogeneous. Node 804 may be a standalone computing component, such as a Web server, computing cloud, local server, remote server, client terminal running code, mobile device, stationary device, server, smartphone, laptop, tablet, wearable computing device, glasses computing device, watch computing device, and desktop computer. The processor of the node 804 is implemented as, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), and an Application Specific Integrated Circuit (ASIC).
Computing devices (i.e., devices used to determine communities in a graph) 806 communicate with a distributed system 802 (also sometimes referred to herein as a parallel processing system). The computing device 806 may act as a controller for the distributed system 802, e.g., a scheduler and/or dispatcher that handles work on the nodes 804 of the distributed system 802. Computing device 806 may be integrated within distributed system 802. Computing device 806 may be implemented as one or more nodes in distributed system 802.
The computing device 806 may be implemented, for example, as software code instructions stored and executed by a processor of the distributed system 802, code instructions stored and executed by one or more nodes of the distributed system 802, hardware cards installed in the distributed system 802 and/or in one or more nodes 804, and/or as a stand-alone computing device connected locally or remotely to the distributed system 802 using a network or direct connection (e.g., wired, short-range wireless link).
The computing device 806 may be implemented as, for example, a computing cloud, a cloud network, a computer network, a virtual machine (e.g., hypervisor, virtual server), a single computing device (e.g., client terminal), a parallel-arranged computing device group, a server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet, a wearable computing device, an eyewear computing device, a watch computing device, and a desktop computer.
The computing device 806 includes one or more processors 808 and a data storage device 810 (e.g., memory) that stores code instructions that, when executed by the processors 808, implement the acts of the method described with reference to fig. 7.
The processor 808 may be implemented as, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a custom circuit, a processor for connecting with other units, and/or a dedicated hardware accelerator. Processor 808 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogeneous and/or heterogeneous processor architectures).
The data storage device 810 may be implemented, for example, as a Random Access Memory (RAM), a read-only memory (ROM), and/or a storage device such as a non-volatile memory, a magnetic medium, a semiconductor memory device, a hard disk drive, a removable storage device, and an optical medium (e.g., DVD, CD-ROM).
The computing device 806 receives the graphics from a graphics library 812, such as a remote server, computing cloud, remote storage device, and/or local storage. Alternatively, the computing device 806 computes the graph, such as by creating the graph by tracking social connections of social networking websites.
The computing device 806 is associated with one or more user interfaces 814, the user interfaces 814 including mechanisms for a user to input data (e.g., specify charts) and/or view identified communities. Exemplary user interfaces 814 include one or more of: a display, a touch screen, a keyboard, a mouse, and voice activated software operating with a speaker and a microphone.
Referring now back to fig. 7, at 702, a marker graph (G1 ═ V1, E1) is created by initializing vertices of an input graph (G or G ═ (V, E)) using community tags and by placing the vertices of the input graph (G) into respective communities. Optionally, each vertex is placed in a respective community.
Input graphics (G) may be received by computing device 806 from graphics repository 812. The input map (G) may be obtained from memory on a storage device, automatically computed by code, and/or manually entered by a user using user interface 814.
At 704, vertex movement (one or more movements, optionally all vertex movements) between communities of each vertex (denoted V in V1) of the label graph (G1) is computed. The vertex movement between any two communities is the edge (denoted as E in E1) of the label graph (G1), connecting the two vertices from the two communities. It should be noted that some vertices may not be associated with any movement, such as described with reference to act 712.
The label graph (G1) is distributed across nodes of the parallel and/or distributed computing system 802. The nodes (optionally each available node) 804 of the distributed system 702 compute the possible vertex movements for one or more different vertices of the label graph (G1).
The parallel and/or distributed execution identifies very large patterns (approximately at least 10)6-108A vertex of about 109Edges or larger) that otherwise may take an impractically long time to process using standard methods. Data may be stored locally at each node of the parallel and/or distributed computing systems, which reduces network traffic that would otherwise occur when one node accesses data stored in other nodes. In contrast, other common methods are designed to run sequentially, which results in lengthy processing times. Even if other methods are performed in parallel and/or distributed computing systems, since the algorithms are straightforwardDesigned sequentially, only one core processes the graph at a time. Furthermore, if data is stored in multiple nodes in a distributed file system, accessing other nodes by the single executing node to obtain the distributed data may result in excessive network activity.
The vertex movement is selected from among the possible movements computed between communities. Vertex movement may be selected based on local movement modularization values calculated for individual vertices (e.g., each vertex).
At 706, a directed decoupling graph of vertex communities is created with the possible movements computed between communities according to the selected vertex movements (D (G1)). Each vertex (D (G1)) of the directed decoupling graph represents a respective community of the marker graph (G1). Possible movements computed between communities are represented as edges of the directed decoupling graph (D (G1)).
At 708, edges of the directional decoupling graph (D (G1)) are filtered to obtain a filtered directional decoupling graph (FD (G1)). Each vertex of the filtered directed decoupling graph (FD (G1)) community includes an in-edge whose (only) target point is the vertex of the respective community, or an out-edge whose (only) origin is the vertex of the respective community. During each iteration (e.g., as described with reference to step 712), each community is restricted to either sending vertices (thereby reducing size) or receiving vertices (thereby increasing size).
As used herein, the term "decoupling" refers to a process that separates vertex movements that do not affect the convergence of the directed decoupling graph (D (G1)) from other movements that may prevent convergence or significantly increase the convergence time and/or computational resources to compute the convergence. The decoupled movement can be performed in parallel while maintaining the ability of the graphs to converge.
The edges are filtered so that vertices of the obtained community of graphs are all associated with an incoming edge or all associated with an outgoing edge, thereby ensuring that the graphs eventually converge to a solution when executed in a distributed computing system. For example, as described herein, the naive-based parallelization approach can loop through vertices between communities, which can prevent eventual convergence to a solution.
Filtering said edges to obtainThe vertices of the community of resulting graphs are all associated with an in-edge or all associated with an out-edge, providing linear time complexity for processing very large graphs in a reasonable amount of time, since the movement and updating of the vertices is performed in parallel by the distributed nodes. For example, relative to O (n)2) Other methods of complexing.
Referring now to fig. 9, fig. 9 is a schematic diagram depicting a process of limiting community reduction or increase during each iteration of the method described with reference to fig. 7, according to some embodiments of the invention. Communities 902A and 902B are designated as senders and are constrained to reduce size by sending vertices. Communities 904A and 904B are designated as receivers and are constrained to increase in size by receiving vertices.
Referring now back to act 708 of fig. 7, optionally, computing the edges of the directional decoupling graph (D (G1)) and/or the directional decoupling graph (D (G1)) is filtered (optionally, barrier synchronized) by a single node 804 with an aggregation computing possible vertex movements of one or more different vertices of the marker graph (G1) by other nodes 804 of the distributed system 802.
The directed decoupling graph is centrally processed (D (G1)) when all data is received from other nodes to ensure that the full graph is processed. Computing the directed decoupling graph (D (G1)) and filtering the edges of the directed decoupling graph (D (G1)) may be performed by the single node using relatively little computing resources and/or in a reasonable amount of time because the computation is relatively simple.
Network traffic between nodes is reduced relative to other order-based approaches. The network traffic associated with moving and storing the directed decoupling graph (D (G1)) is o (n), where n represents the number of vertices in the graph, since a vertex can only have one associated movement. Instead, network traffic associated with moving and storing the entire graph is performed in one iteration represented by o (m), where m represents the number of edges in the graph.
Updating vertices of the marker graph (G1) with community labels from edges of the filtered directed decoupling graph (FD (G1)) to create an updated graph (G1'). The vertices of the labeled graph (G1) and/or the edges of the filtered directed decoupled graph (FD (G1)) may be distributed across nodes 804 of the distributed system 802 for parallel and/or distributed updates.
Referring now to FIG. 10, FIG. 10 is a schematic diagram depicting a process for centrally filtering edges of the directed decoupling graph (D (G1)) in a distributed system, according to some embodiments of the invention. The process described with reference to FIG. 10 represents the optimization of the community. The input graph 1002, upon receipt, is processed to create a labeled graph (G1) that is then distributed to the nodes 1004 of the distributed computing system. Each node 1004 may include one or more CPUs, each having one or more cores. Nodes 1004 compute vertex movements in parallel (one movement 1006 is shown for clarity) (as described with reference to act 704 of FIG. 7). The computed vertex movements (e.g., 1006) are sent to a single node 1008 (which may include one or more CPUs and/or one or more cores) for centralized filtering, i.e., decoupling. Node 1006 optionally aggregates the movements computed by the distributed nodes based on barrier synchronization (e.g., 1008). The vertices of the labeled graph (G1) are assigned to nodes 1004 of the distributed system, and are updated distributively using community labels from the edges of the filtered directed decoupled graph (FD (G1)) (one update 1010 is shown for clarity). A compression graph 1012 is created based on the updated vertices and output, providing a hierarchical community of the computations.
Referring now back to fig. 7, at 710, the compressed graph is created (for the current iteration described with reference to act 712) by merging the vertices of each community of the filtered directed decoupled graph (FD (G1)) into a single vertex. Merging a plurality of edges between pairs of vertices of the compression map into a single edge with an aggregated weight between corresponding pairs of vertices.
The compressed graph is created in parallel by a plurality of nodes 804 of the distributed system 802, each node 804 executing at least one distinct community, merging vertices of each community of the filtered directed decoupled graph (FD (G1)) into a single vertex, and merging duplicate edges of the compressed graph into a single edge.
At 712, block 702 and 710 are iterated when there may be one or more vertex moves. First, using the labeled graph (G1 ═ V1, E1), the compressed graph is recursively computed in an iterative manner by performing blocks 702 and 710 to create a compressed graph for each iteration. The compression map serves as a label map for the next iteration (G1). An iteration may be terminated when the graphical modular values representing connectivity of a community of the current iteration's compressed graph are stable or inverted relative to previously computed graphical modular values of the compressed graph.
The graph modularization values represent a global cost function indicating a comparison of a density of edges between vertices within each community of the compression map and a density of edges between vertices of different communities of the compression map.
Referring now to FIG. 15, FIG. 15 is a diagram that graphically depicts one iteration of a process of computing the compression map in accordance with block 702-710 described with reference to FIG. 7, in accordance with some embodiments of the present invention. Corresponding to block 702-.
Receive input graph 1501 processes and detects communities as described herein. At 1502, a label graph is created (G1), as described with reference to block 702 of fig. 7. At 1504, possible vertex movements are computed for the label graph (G1), as described with reference to block 704 of fig. 7. At 1506, a directional decoupling graph (D (G1)) is created based on the possible vertex movements, as described with reference to block 706 of fig. 7. At 1508, a filtered directional decoupling graph (FD (G1)) is created by filtering the edges of the directional decoupling graph of 1506, as described herein. Each vertex of the filtered directed decoupling graph (FD (G1)) has only an in-edge or an out-edge. At 1509, applicable moves are determined from the filtered directed decoupling graph (FD (G1)). At 1510, vertices of the labeled graph (G1) are updated with community labels from edges of the filtered decoupled graph (FD (G1)) to create an updated graph (G1'), as described with reference to block 710 of fig. 7. At 1512, the update map enters the next iteration as the marker map, as described with reference to block 712 of FIG. 7.
Referring back now to FIG. 7, at 714, the iterative loop is terminated when no possible vertex movement is available. Outputting a compression map of a last iteration, the compression map of the last iteration representing a computational hierarchy community of the markup map.
The compression map representing the hierarchical community of computations may be presented on a display (e.g., user interface 814), stored on a local data storage device (e.g., 810), sent to a remote server, and/or may be used as input to another set of code instructions (e.g., for further processing of the hierarchical community of computations).
Another implementation of the method described with reference to fig. 7 is now described. The implementation is based on a greedy variant of the method described with reference to fig. 7. For clarity and ease of illustration, variations of the actions of the method described with reference to fig. 7 are described. Greedy algorithm based computations are sufficiently accurate in practice, where vertex moves are selected according to the most significant changes (e.g., the largest changes) in the local move modular values, while providing linear time complexity for processing very large graphics in reasonable time.
At 704, a set of possible movements for the graph vertices (denoted as P) are computed in parallel. The set P comprises conflicting movements that are not independent of each other. When performed in parallel and/or simultaneously, the conflicting movements prevent graphical convergence or require significant time and/or computing resources to compute convergence.
Selecting a first vertex movement from the calculated possible movements according to a local movement modality value. The method of calculating the local movement modality values may be sorting the members of the set P according to a decreasing sequence of changes of the local movement modality values to obtain an ordered set (denoted S) and selecting the first member of S of the local movement modality values having the largest change.
Optionally, a null transmitter group (denoted E), a null receiver group (denoted a), and a null decoupled mobile group (denoted D) have been created.
At 708, a first transmitter community and a second receiver community are designated according to the selected first vertex movement. The sender community represents the origin of a first vertex-moved vertex, and the receiver community represents the target point of the first vertex-moved vertex.
Optionally, a first transmitter community is placed in set E and a second receiver community is placed in set A.
For the remaining movements used to filter the edges of the directed decoupling graph (D (G)), iterating: selecting another vertex movement from the calculated possible movements according to the local movement modality values; another vertex movement will be considered filtered when the origin of a vertex from the other vertex movement is located in one of the receiver communities and the destination is located in one of the transmitter communities. The filtering may be performed by checking whether the source community (i.e., sender) of the other vertex conflicts with sets E and a. The check may be performed to determine whether the other moved source community (i.e., sender) is already not present in set a and/or whether the other vertex moved destination community (i.e., receiver) is already present in set E. When the selected second movement does not conflict with a previously filtered movement, designating another sender community and another receiver community according to the selected second vertex movement. The specification of the selected second vertex movement may be performed by adding a second moved source community to the set E and a second moved destination community to the set a. The second vertex movement (i.e., the decoupled movement) is added to set D. When a second vertex is detected to conflict with the previous vertex movement, the second vertex movement is discarded.
An iteration is performed on the remaining movements according to a decreasing sequence of changes in the local movement modality values, optionally according to the movements stored in the set S. Set D is filled during the iteration. The set D is distributed over the nodes 804 of the distributed system 802 for updating the vertices of the label graph (G1) using community labels obtained from the moved edges stored in the set D.
At 710, the compression map is created.
At 712, new local motion modality values are calculated for the compression map, and block 704 is iterated 710.
Referring now to FIG. 11, FIG. 11 is a schematic diagram depicting a process for calculating decoupled vertex movements based on a greedy transformation algorithm according to the method described with reference to FIG. 7, in accordance with some embodiments of the invention. At 1102, a set of possible movements of vertices of graph P1104 are computed in parallel. At 1106, an ordered set S of potential vertex movements is created by ordering the members of set P1102 according to decreasing order of change in the local movement modality values. At 1108, a set D of stored decoupled movements is created, as described herein. Set D1108 is distributed among the nodes of the distributed system for performing the vertex update, as described herein. Conflicting moves are discarded 1110, as described herein.
Referring now to fig. 12, fig. 12 is a diagram that describes parallel processing for computing hierarchical communities in a graph based on the method described with reference to fig. 7 and/or implemented by the system 800 described with reference to fig. 8, according to some embodiments of the invention. The process described with reference to FIG. 12 represents the optimization of the community, and the graph aggregation phase is not described. At 1202, the initial map (G) is received, and then a label map (G1) is computed, as described herein. At 1204, the label graph (G1) is distributed over nodes (denoted as n) of the distributed computing system. At 1206, the possible vertex movements (denoted v) are computed by nodes of the distributed computing system, as described herein. At 1208, the vertices of the label graph (G1) are updated by the nodes of the distributed computing system, as described herein. Diagram 1210 depicts the process of filtering the edges of the directional decoupling graph (d (g)) according to the local mode of motion values. At 1212, when the maximum number of iterations has not been reached and/or the community has not converged, iteration block 1204-. Alternatively, at 1214, the hierarchical community is provided when the maximum number of iterations and/or community conversion is reached.
As described herein, the methods, code instructions and/or actions performed by the system may be expressed in sequential steps as follows, which further elucidate the parallel-based processing described herein:
1. the current iteration i is set to 0. The graph G is optionally distributed in a balanced manner among the nodes (optionally all nodes) of the distributed computing system.
2. The set P of possible vertex movements (optionally all movements) for the vertex (optionally all vertices) is obtained independently in parallel across communities in the graph G on the node (optionally all nodes).
3. If the mobile terminal cannot move in the step 2, the step 7 is requested, otherwise, the step 4 is performed.
4. A directed decoupling graph D is created from P, where the community of G is the vertices of D and the vertices are moved from P to the edges of D.
5. Edges are deleted from graph D such that any vertex in D has only an in-edge or an out-edge.
6. The remaining edges of D are converted to a set of decoupled vertex movements D in G, and D is distributed over nodes (optionally all nodes) in the distributed system to update the community labels of the vertices in parallel. Go to step 2.
7. By merging vertices of the public community into a new graph GpAnd by a single vertex of GpThe graph G is compressed in parallel by merging the repeated edges into a single edge with the aggregate weight.
8. Computing initial modular Q of the compression map in paralleli
9. When Q isi>Qi-1When i is equal to 0, then i is set to i +1 and G is set to GpAnd go to step 2, otherwise go to step 10.
10. The output graph represents the hierarchical community.
As described herein, implementing methods, code instructions, and/or actions performed by the system based on a greedy algorithm may be expressed in sequential steps as follows. The greedy algorithm-based implementation is based on computing the modular deltas due to the individual vertex moves to approximate the maximum global modular gain. The greedy algorithm based implementation further illustrates the parallel-based processing described herein.
1. Setting the current iteration i to 0; the graph G is set as an input graph.
2. The possible movements (optionally all possible movements) of the vertices (optionally all vertices) are acquired in parallel across the communities in graph G.
3. If the mobile terminal cannot move in the step 2, the step 8 is requested, otherwise, the step 4 is performed.
4. The possible movements are ordered according to the modular increment of decreasing movement, denoted Δ Q.
5. The first step (with the largest Δ Q) is marked as decoupled and the current and new communities of the first step are marked as transmitters and receivers accordingly.
6. Starting with the next move (i.e., the second largest Δ Q), when the mobile vertex does not leave the community receiver and does not enter the community transmitter of an already decoupled move, the movement is marked as decoupled. As discussed with reference to step 5, the current and new communities are labeled as senders and receivers based on the mobile vertices.
7. When the marker movement from the set is complete, the decoupled movement is retained and other movements not designated as decoupled are discarded.
8. By merging vertices of the public community into a new graph GpAnd by a single vertex of GpThe graph G is compressed in parallel by merging the repeated edges into a single edge with the aggregate weight.
9. Initial modular Q for parallel computing compression mapsi
10. When Q isi>Qi-1When i is equal to 0, then i is set to i +1 and G is set to GpAnd go to step 2, otherwise go to step 11.
11. The output graph represents the hierarchical community.
Various implementations and aspects of the systems and/or methods and/or code instructions described above and claimed in the claims section below find experimental support in the following examples.
Examples of the invention
Reference is now made to the following examples, which together with the above descriptions, illustrate in a non-limiting manner some implementations of the systems and/or methods and/or code instructions described herein.
Referring now to FIG. 13, FIG. 13 is a table summarizing experimental results comparing systems, devices, methods, and/or code instructions described herein that perform parallel processing to identify communities in graphics with the sequential Luwenun method (described herein) and the naive parallelization method implemented by parallel processing of the Luwenun method (described herein) designed for sequential processing. The graph is obtained from a publicly available data set and a custom created data set. The naive parallelization method is implemented in Apache Spark. The sequential unwen approach is written in Scala language.
The slower behavior denoted by x is due to the overhead of Apache Spark in very small tasks compared to the sequential JVM. Work indicates that the sequential unwen process was unable to run or not completed within 24 hours.
Referring now to FIG. 14, FIG. 14 is a table summarizing the computational performance of the systems, apparatus, methods, and/or code instructions described herein when identifying community regions in a big data graph.
As described above, even if the decoupling behavior of the process identifying communities in the graph is performed on a single node or a single thread, due to the inability of this behavior to be parallelized efficiently, a sequential bottleneck may be created, which may limit the parallelization according to Amdahl's law or Gustafson's law, the continuous time of the decoupling behavior is reduced compared to the total running time according to the results given in fig. 13-14. E.g., less than 5% (e.g., 1:600 to 1:27 of the run). In practice, the overhead resulting from process actions performed at a single node or thread may be considered tolerable or negligible.
Other systems, methods, features and advantages of the invention will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.
The description of the various embodiments of the present invention is intended to be illustrative, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or technical advances, or to enable others skilled in the art to understand the embodiments disclosed herein, as compared to techniques available in the market.
It is expected that a number of related distributed systems will be produced during the life of a patent derived from this application, although the scope of the term distributed system is intended to include all such new technologies a priori.
The term "about" as used herein means ± 10%.
The terms "including," comprising, "" having, "and variations thereof mean" including, but not limited to. This term includes the terms "consisting of … …" and "consisting essentially of … …".
The phrase "consisting essentially of …" means that the composition or method may include additional ingredients and/or steps, provided that the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, the term "compound" or "at least one compound" may comprise a plurality of compounds, including mixtures thereof.
The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any "exemplary" embodiment is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the presence of other combinations of features of embodiments.
The word "optionally" is used herein to mean "provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may incorporate a plurality of "optional" features, unless these features contradict each other.
Throughout this application, various embodiments of the present invention may be presented in a range format. It is to be understood that the description of the range format is merely for convenience and brevity and should not be construed as a fixed limitation on the scope of the present invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within the range, such as 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
When a range of numbers is indicated herein, the expression includes any number (fractional or integer) recited within the indicated range. The phrases "in the first indicated number and the second indicated number range" and "from the first indicated number to the second indicated number range" and used interchangeably herein are meant to include the first and second indicated numbers and all fractions and integers in between.
It is appreciated that certain features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as any suitable other embodiment of the invention. Certain features described in the context of various embodiments are not considered essential features of those embodiments unless the embodiments are not otherwise invalid.
All publications, patents and patent specifications mentioned in this specification are herein incorporated in the specification by reference, and likewise, each individual publication, patent or patent specification is specifically and individually incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims (10)

1. An apparatus (806) for determining a community using vertex V and edge E in G ═ V, E) (812), the apparatus (806) being configured to:
first computing a compression map in an iterative loop using a graph of (G1 ═ V1, E1), wherein the graph G ═ V, E serves as a graph of labels for a first iteration of the iterative loop, the compression map serves as a graph of labels for a next iteration of the iterative loop, until a graph modular value representing connectivity of a community of compression maps of a current iteration is stable or inverted relative to a graph modular value of a previously computed compression map,
wherein the apparatus is configured to recursively compute the compression map for each iteration by:
obtaining for each vertex V in V1 of the label graph (G1) all vertex movements between communities, where a vertex movement between any two communities is an edge E in E1 of the label graph (G1), connecting two vertices from the two communities,
stopping the iterative loop when there is no vertex movement,
creating a directed decoupling graph of communities of vertices (D (G1)) from vertex movements between the obtained communities, when there is at least one possible vertex movement, wherein each vertex of the directed decoupling graph (D (G1)) represents a respective community of the marker graph and the obtained vertex movements between communities are represented as edges of the directed decoupling graph (D (G1)),
filtering edges of the directed decoupling graph (D (G1)) to obtain a filtered directed decoupling graph (FD (G1)), wherein each vertex of the filtered directed decoupling graph (D (G1)) has only an in-edge or an out-edge,
and updating vertices of the marker graph (G1) using community labels from edges of the filtered directed decoupling graph D (G1) to create an updated graph (G1');
creating a compression graph for a current iteration by merging vertices of each community of the update graph (G1') into a single vertex and merging multiple edges between pairs of vertices of the compression graph into a single edge with an aggregated weight between corresponding pairs of vertices;
and outputting a compression graph of the last iteration, wherein the compression graph of the last iteration represents the calculation level community of the label graph of the first iteration.
2. The apparatus (806) of claim 1, further configured to obtain the label graph (G1) by placing each vertex in a respective community and initializing the vertices of the input graph (G) with community labels.
3. The apparatus (806) of any one of the preceding claims, further configured to distribute the label graph (G1) over nodes (804) of a distributed computing system (802), and wherein to obtain possible vertex movements, each node (804) calculates possible vertex movements for at least one different vertex of the label graph (G1).
4. The apparatus (806) of claim 1, comprising a plurality of processors (808) configured to compute the compression map in parallel, each processor performing for at least one different community, merging the vertices, merging each community of the filtered directed decoupling map (D (G1)) into a single vertex, and merging repeated edges of the compression map into a single edge.
5. The apparatus (806) of claim 1, wherein the graph modularity represents a global cost function indicating a comparison of a density of edges between vertices within each community of the compression map and a density of edges between vertices of different communities of the compression map.
6. The apparatus (806) of any of claims 4 to 5, wherein one of the processors (808) is further configured to select vertex movements from the computed possible movements between communities according to local movement modality values computed for each vertex representing a variation in density of edges between vertices within the respective community of each vertex movement, wherein a directed decoupling graph (D (G1)) of the communities of vertices is created according to the selected vertex movements.
7. The apparatus (806) of claim 1, wherein the apparatus is configured to create and filter the directed decoupling graph (D (G1)) by:
selecting a first vertex movement from the calculated possible movements according to a vertex movement local modal value;
designating a first transmitter community and a second receiver community in accordance with the selected first vertex movement, wherein the transmitter community represents an origin of the first vertex moved vertex and the receiver community represents an aim point of the first vertex moved vertex;
the remaining moves are iteratively processed:
selecting another vertex movement from the calculated possible movements according to a local movement modality value representing the most significant change in density of edges from the remaining possible movements,
filtering another vertex movement when an origin of a vertex from the another vertex movement is located in one of the receiver communities and an endpoint is located in one of the sender communities,
designating another transmitter community and another receiver community according to the selected second vertex movement.
8. A system (800) for determining communities using vertices V and edges E in a graph G ═ (V, E), characterized in that the system comprises:
the device (806) of any of claims 1 to 7;
a distributed system (802) comprising a plurality of nodes (804);
a graphics library (812);
a user interface (814);
wherein the apparatus (806) is configured to:
receiving a graph G ═ (V, E) from the graph library (812);
distributing the graph G over nodes (804) of a distributed system (802); possible vertex movements of the graph G are received from a distributed system (802), each vertex V of V is computed in parallel by a node (804) of the distributed system (802), and a community is determined in the graph G based on the received vertex movements.
9. A method for determining communities using vertices V and edges E in a graph G ═ (V, E), comprising:
first computing a compression map in an iterative loop using a labeled map (G1 ═ V1, E1), wherein the graph G ═ V, E) serves as a labeled map for a first iteration of the iterative loop, the compression map serves as a labeled map for a next iteration of the iterative loop until a graph modular value representing connectivity of a community of compression maps of a current iteration is stable or inverted relative to a graph modular value of a previously computed compression map, wherein the compression map for each iteration is recursively computed by: computing for each vertex V in V1 of the label graph (G1) all vertex movements between communities, where a vertex movement between any two communities is an edge E in E1 of the label graph (G1), connecting two vertices from the two communities (712);
stopping the iterative loop when there is no vertex movement,
creating a directed decoupling graph of communities of vertices (D (G1)) from the calculated movement between communities when there is at least one possible vertex movement, wherein each vertex of the directed decoupling graph (D (G1)) represents a respective community of the marker graph and the calculated movement between communities is represented as edges (706) of the directed decoupling graph (D (G1)),
filtering edges of the directional decoupling graph (D (G1)) to obtain a filtered directional decoupling graph (FD (G1)), wherein each vertex of the filtered directional decoupling graph (D (G1)) has only an in-edge or an out-edge (708),
and updating vertices of the marker graph (G1) using community labels from edges of the filtered directed decoupling graph D (G1) to create an updated graph (G1');
creating a compression graph for a current iteration by merging vertices of each community of the update graph (G1') into a single vertex and merging multiple edges between pairs of vertices of the compression graph into a single edge with an aggregated weight between corresponding pairs of vertices; and outputting a compression map for the last iteration (710);
a compression map for the last iteration is output, the compression map for the last iteration representing a community of computation levels for the labeled map for the first iteration (714).
10. A computer-readable storage medium comprising instructions, which, when executed by a computer, cause the computer to perform the steps of the method according to claim 9.
CN201780050053.6A 2017-05-29 2017-05-29 System and method for hierarchical community detection in graphics Active CN110325984B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2017/000365 WO2018222064A1 (en) 2017-05-29 2017-05-29 Systems and methods of hierarchical community detection in graphs

Publications (2)

Publication Number Publication Date
CN110325984A CN110325984A (en) 2019-10-11
CN110325984B true CN110325984B (en) 2021-12-03

Family

ID=59485405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780050053.6A Active CN110325984B (en) 2017-05-29 2017-05-29 System and method for hierarchical community detection in graphics

Country Status (2)

Country Link
CN (1) CN110325984B (en)
WO (1) WO2018222064A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177474B (en) * 2019-06-27 2022-12-02 腾讯科技(深圳)有限公司 Graph data processing method and related device
CN112714080B (en) * 2020-12-23 2023-10-17 上海观安信息技术股份有限公司 Interconnection relation classification method and system based on spark graph algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8793283B1 (en) * 2011-04-20 2014-07-29 Google Inc. Label propagation in a distributed system
CN105069039A (en) * 2015-07-22 2015-11-18 山东大学 Overlapping community parallel discovery method of memory iteration on basis of spark platform
CN105279187A (en) * 2014-07-15 2016-01-27 天津科技大学 Edge clustering coefficient-based social network group division method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8090665B2 (en) * 2008-09-24 2012-01-03 Nec Laboratories America, Inc. Finding communities and their evolutions in dynamic social network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8793283B1 (en) * 2011-04-20 2014-07-29 Google Inc. Label propagation in a distributed system
CN105279187A (en) * 2014-07-15 2016-01-27 天津科技大学 Edge clustering coefficient-based social network group division method
CN105069039A (en) * 2015-07-22 2015-11-18 山东大学 Overlapping community parallel discovery method of memory iteration on basis of spark platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Comparative Analysis of Community Detection;Zhao Yang;《Scientific Reports》;20160802;全文 *
Finding community structure in very large networks;Aaron Clauset;《Physical Review》;20040830;全文 *

Also Published As

Publication number Publication date
WO2018222064A1 (en) 2018-12-06
CN110325984A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
US10831759B2 (en) Efficient determination of join paths via cardinality estimation
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
US20200175370A1 (en) Decentralized distributed deep learning
US10997176B2 (en) Massive time series correlation similarity computation
US20170091246A1 (en) Distributed graph database
US8990209B2 (en) Distributed scalable clustering and community detection
Moon et al. Parallel community detection on large graphs with MapReduce and GraphChi
US20150186427A1 (en) Method and system of analyzing dynamic graphs
US11574254B2 (en) Adaptive asynchronous federated learning
KR20110049644A (en) Structured grids and graph traversal for image processing
Puri et al. MapReduce algorithms for GIS polygonal overlay processing
US11847533B2 (en) Hybrid quantum computing network
KR102108342B1 (en) A graph upscaling method for preserving graph properties and operating method thereof
Meyerhenke et al. Drawing large graphs by multilevel maxent-stress optimization
CN110325984B (en) System and method for hierarchical community detection in graphics
CN115701613A (en) Multiple for neural network resolution hash encoding
CN106575296B (en) Dynamic N-dimensional cube for hosted analytics
Gupta et al. Map-based graph analysis on MapReduce
US11042530B2 (en) Data processing with nullable schema information
CN116736624A (en) Parallel mask rule checking for evolving mask shapes in an optical proximity correction stream
Shivashankar et al. Efficient software for programmable visual analysis using Morse-Smale complexes
US11164348B1 (en) Systems and methods for general-purpose temporal graph computing
US10437809B1 (en) Projection-based updates
US10922312B2 (en) Optimization of data processing job execution using hash trees
US20200065430A1 (en) Data driven shrinkage compensation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant