CN110598055A

CN110598055A - Parallel graph summarization method based on attribute graph

Info

Publication number: CN110598055A
Application number: CN201910783949.6A
Authority: CN
Inventors: 马应龙; 张鹏
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2019-12-20

Abstract

The invention belongs to the technical field of computer abstraction, and particularly relates to a parallel graph abstraction method based on an attribute graph, which comprises the following steps: step 1: preprocessing the acquired graph data, and processing each node in the graph into a node structure with own information and all direct neighbor information; step 2: randomly selecting a direct neighbor node for a current node, and then selecting a node with the same attribute and the maximum similarity with the current node from all the direct neighbor nodes of the neighbor nodes as a candidate node combined with the current node; and step 3: judging whether the introduced error exceeds an error threshold value after the current node and the candidate node are combined, if so, returning to the step 2 to continuously search other candidate nodes, and if not, combining the two nodes; and 4, step 4: and (4) executing two-node combination by updating all node information in the node structure, repeating the steps 3-4 until the number of the remaining nodes is less than a set threshold value, saving the final node structure and exporting the summary graph.

Description

Parallel graph summarization method based on attribute graph

Technical Field

The invention belongs to the technical field of computer abstraction, and particularly relates to a parallel graph abstraction method based on an attribute graph.

Background

The graph has strong inherent advantages and is widely applied to modeling of real objects and relationships thereof. Large scale graphical data is common in many application areas. In the graph, entities are modeled as vertices, and their relationships or connections are represented by edges. Various modern applications generate a large amount of graph data, and because the graph data stores a large amount of relationship information in a code, potential implicit knowledge can be mined from the graph data so as to be better used for serving users, so that many researchers intensively research the processing calculation of the graph data. However, as the number of application users continues to grow, the size and structure of the graph becomes increasingly complex, and analyzing and processing large graphs with millions or even billions of nodes and edges becomes a huge challenge. Because of the extremely high amount and complexity of graph data, conventional graph data analysis tools are unable to complete mining analysis of graph data in a limited time. Therefore, whether for tools or algorithms, it is a vital requirement today to reduce the size and complexity of graph data by generalizing large-scale graphs into compact, information-rich, highly abstract representations of original graphs, in such a way that large-scale graph data can be easily stored, managed, analyzed, and processed. Among various graph computation techniques, graph summarization is one potential approach to solving these problems.

In the field of graph summarization research, subjects concerned by different research groups are different, and they often extract features of graphs from different angles, so that a plurality of graph summarization algorithms are generated. Most of the existing graph abstract algorithms adopt a statistical method to research and extract the characteristics of a graph, and mainly concern about the topological structure of the graph, such as node degree distribution, frequent subgraph mining, community detection and the like. However, the summary generated by this algorithm is often a series of graphs, which are only subgraphs with high occurrence frequency or dense structures in the original graph, and the summary graph is obtained approximately by replacing the whole graph with the main structure. Although they contain the main information of the original graph to a large extent and can be analyzed and processed instead of the original graph, they often ignore other information in the graph, resulting in that the structural information of the whole graph loses intuitiveness and may cause deviation or even error of the analysis result. Most algorithms only consider the topological structure of the graph and do not consider the node attribute and relationship information, however, most network graphs in the real world are attribute graphs, the nodes and edges of the network graphs have various attributes and relationships, and only the topological structure of the graph is considered to be not in accordance with the requirements of the actual situation. In addition, most of the existing methods perform graph summarization in a single machine environment, with the rapid increase of internet users, the scale of the graph often exceeds the computing and storage capacity of a single computer, and when nodes and edges reach the order of millions or billions, the algorithms cannot normally process large-scale graphs, and the expansibility is not high. The centralized graph abstract algorithm based on the single-machine environment is not suitable for the current processing environment any more, and the research and implementation of the parallel graph abstract algorithm based on the distributed environment play a crucial role in analyzing and processing future large-scale graph data.

In the past research on node pairing, there are two node selection strategies: greedy and random methods. The greedy method is to select the optimal 2-hop neighbor node pair in the whole graph for combination every time, and although the optimal node pair is selected for combination and the minimum summary error is obtained, the greedy method causes a large amount of calculation and network communication. The random method is to randomly select 2-hop neighbor node pairs as candidate node pairs each time, although the calculated amount of the node pair selection stage is greatly reduced, the selected node pairs do not meet the error threshold of node combination in a high probability, and unnecessary calculation in the subsequent stage is caused.

Disclosure of Invention

Aiming at the technical problem, the invention provides a parallel graph summarization method based on an attribute graph, which comprises the following steps:

step 1: preprocessing the acquired graph data, and processing each node in the graph into a node structure with own information and all direct neighbor information;

step 2: randomly selecting a direct neighbor node for a current node, and then selecting a node with the same attribute and the maximum similarity with the current node from all the direct neighbor nodes of the neighbor nodes as a candidate node combined with the current node;

and step 3: judging whether the introduced error exceeds an error threshold value after the current node and the candidate node are combined, if so, returning to the step 2 to continuously search other candidate nodes, and if not, combining the two nodes;

and 4, step 4: and (4) executing two-node combination by updating all node information in the node structure, repeating the steps 3-4 until the number of the remaining nodes is less than a set threshold value, saving the final node structure and exporting the summary graph.

The error threshold is dynamically adjusted by adopting a heuristic algorithm of simulated annealing, and the error threshold is continuously increased along with the decrease of the number of the residual nodes.

The two-node combination cancels the node with larger ID and reserves the ID of the node with smaller ID.

Each node called a super point in the summary graph corresponds to a partition divided by the nodes of the original graph, and each edge called a super edge corresponds to a connection between two related node partitions; a hyper-edge join relationship between two hyper-points exists if and only if there is at least one edge join between some of the two hyper-points.

The introduced error is defined as:

whereinIndicating a point of excess v_iAnd v_jMerge into v_mThe introduced error, alpha, is an adjustable parameter,the error in the topology is represented by,representing the relationship error:

v_pis a point of excess, V_sThe data is a super point set, and the data is a super point set,respectively is a super point v_mAnd v_pV of over point_iAnd v_pV of over point_jAnd v_pV of over point_iAnd v_pThese four sets of pairwise merge errors; r is_i,p、r_j,pRespectively representing a point of excess v_iAnd v_pV of over point_jAnd v_pThe relationship between these two sets of over-points.

The invention has the beneficial effects that:

the invention is oriented to the attribute graph, and fully considers the topological structure of the input graph and the attribute relation of the nodes in the graph abstract process; on the premise of meeting the requirement of controlling the abstract resolution by a user, a bottom-up node aggregation technology is adopted to finally generate abstract graphs with different abstract degrees.

The invention defines the concept of summary error and provides a specific calculation method of node aggregation error to quantitatively evaluate the increase of nodes to merging error in the process of graph summary. The threshold value of the error introduced by node pair merging at each time is defined by using a heuristic algorithm, and the error threshold value is dynamically increased along with the advance of the graph summarization process. The method not only can obtain smaller abstract errors on the whole, but also can improve the probability of node pair successful combination, and avoid excessive invalid candidate node pair selection operation, thereby improving the efficiency of map abstraction.

The abstract graph generated by the invention is a suboptimal solution close to an optimal solution, a large number of experiments are carried out on the attribute graph with various node attributes and relationships based on a Spark platform, the effectiveness and the efficiency of the algorithm are evaluated by analyzing abstract errors in the graph abstract process and the running time of a program, and finally, a large number of experiment results show that compared with the traditional graph abstract algorithm, the parallel graph abstract algorithm provided by the invention has high feasibility and efficiency.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a graph of the abstract error of the embodiment.

FIG. 3 is a time chart of the graph summarization process in the embodiment.

FIG. 4 is a diagram of extensibility analysis in an embodiment.

Detailed Description

The invention provides a parallel graph summarization method based on an attribute graph, which comprises the following steps of:

The first stage performs the candidate node pair selection task. The selection of node pairs to be merged is key to the graph summarization algorithm during each iteration. In view of the graph abstract of the attribute-oriented graph, two nodes to be merged must have the same attribute type and have as many neighbor domain relations as possible, which depends mainly on the direct neighbor nodes. Because two nodes with the same direct neighbor are similar in a high probability, in the process of selecting the candidate node pair, each current node selects the node with the same attribute in the 2-hop neighbor as the node with the highest similarity, and the node with the highest similarity serves as the merging partner of the node. In previous studies, there are two node selection strategies: greedy and random methods. The greedy method is to select the optimal 2-hop neighbor node pair in the whole graph for combination every time, and although the optimal node pair is selected for combination and the minimum summary error is obtained, the greedy method causes a large amount of calculation and network communication. The random method is to randomly select 2-hop neighbor node pairs as candidate node pairs each time, although the calculated amount of the node pair selection stage is greatly reduced, the selected node pairs do not meet the error threshold of node combination in a high probability, and unnecessary calculation in the subsequent stage is caused. In the algorithm, a candidate node pair selection method combines two strategies, a current node randomly selects a direct neighbor node, and then selects a node with the same attribute and the maximum similarity as the current node from all the direct neighbor nodes of the neighbor node as a partner of the current node. This stage only needs to find a pair of nodes with the same attribute and similar relationship domain, but does not need to verify whether the pair of nodes can really merge. Because the node pairs found are only merged with a high probability, the merging operation will never take place until further decisions are made that depend on the next stage. In addition, a variable is used in the program to track the number of nodes remaining in the data graph during the summarization process. When the digest size, i.e., the number of nodes in the graph is less than or equal to the user-defined number of nodes, k, this stage will stop finding candidate node pairs and send a message telling the other stages that the graph digest process is complete, then the other stages stop, and finally the whole process will terminate.

And the second stage executes the merging task of the candidate node pair. After a pair of candidate nodes is found, node pair information is sent to this stage, and it is then determined whether they can be combined by comparing the magnitude of the error introduced by this pair combination with the error threshold. We define a metric Δ to evaluate the similarity between the pair of nodes in the merging process, where Δ accumulates the difference between the relationship between the pair of nodes and its neighboring nodes, and a smaller value indicates that the two nodes are more similar, i.e. the pair of nodes are easier to merge. In order to obtain a lower total error of the combination, an error threshold value ET is used to limit the error value in each combination process, and the graph summarization algorithm is optimized. The merging operation can only be finally carried out on the pair of nodes when the merging introduced error of the pair of nodes is smaller than an error threshold ET, otherwise the pair of nodes can not be merged. In the early stage of the iterative process of the graph abstract, a pair of nodes meeting the merging error condition is easily found, which means that the probability that the node is merged with other nodes at the beginning is high. As the graph summarization process progresses, the overall level of error introduced by a single node pair merge will increase. If the error threshold remains the same, it is difficult to find a pair of nodes where merging can actually occur. Therefore, a heuristic algorithm named Simulated Annealing (SA) is adopted to dynamically adjust the error threshold, so that not only can a smaller summary error be obtained on the whole, but also the probability of node pair successful combination can be improved, and excessive invalid candidate node pair selection operation is avoided, thereby improving the efficiency of graph summarization. In the algorithm, the error threshold is set to a small initial value at the beginning of the program and then continuously increases as the graph summarization process advances. For each node merging operation, a node with a smaller node ID is always reserved as a merged node, instead of creating a new node. Specifically, the node with the larger ID should set its own ownerID as the ID of the node with the smaller ID, that is, set its own state as a dead state, and the subsequent operation is not applied to the node.

The third stage performs a node structure update task. If a pair of candidate nodes satisfies all the merging conditions, including the attribute isomorphism and the error threshold, the actual merging is finally performed by updating all the node information in the node structure, i.e., the node with the smaller ID (i.e., v_i) The node whose size and number of self-connections, and ID are large (i.e., v) should be updated_j) The ID value of the smaller node whose owerid should be modified. In addition, node pair merging affects all neighbors of the pair of nodes. Neighbors should update their neighbor list with the new neighbor ID, size and connection. In particular, v_iAnd v_jShould remove v from the neighbor list_jAnd updates v with its new node information_i。v_iShould update v in its neighbor list_iInformation of v_jShould have v in its neighbor list_jIs modified into a new v_iAnd (4) node information. After all affected information is updated, the program will continue to find another pair of candidate node pairs and iterate continuouslyOperation above the loop.

At the end of the procedure, the resulting node structure is saved for deriving the digest map. For each node, if its node ID is equal to the ownerID, it indicates that the node is a node present in the result digest map. In essence, the ID of each node and its ownerID form a member information from the superpoint in the original graph and the summary graph, where the nodes are compressed to form a ownership map. The algorithm assumes that the graph summarization process terminates when the resulting summarized graph has k nodes. However, since multiple merges occur in each iteration, the number of nodes remaining in the resultant summary graph is in most cases less than k. In most practical cases, the user may accept a summary map with a size approximately equal to k. Thus, once the number of super nodes does not exceed k, the program will terminate.

Given an attribute graph, the nodes in the graph have any number of attributes and are connected by edges of various relationship types. More precisely, the attribute graph is represented as G ═ V, E, a, R, where V is the set of nodes and E is the set of edges. The node attribute set in the figure is denoted as a ═ a₁,a₂,…,a_mAnd V e V is provided with one attribute type in A for any node V E V in the graph. The node relationship set in the figure is represented as R ═ { R ═ R₁,r₂,…,r_nAnd (c) for any one edge (u, v) epsilon E in the graph, the graph has one relationship type in E.

The abstract diagram is defined as follows: given an input attribute graph G ═ (V, E, a, R), and a node partition of attribute graph GThe summary map based on the partition P is represented as s (g) ═ V_s,E_sA, R), wherein V_s＝P，More intuitively, each node in the summary graph, called a hyper-point, corresponds to a partition of the original graph node partitions, and each edge, called a hyper-edge, corresponds to a connection between two related node partitions. A hyper-edge join relationship between two hyper-points exists if and only if there is at least one edge join between some of the two hyper-points. To refer to any node v_i∈V_sRepresents a subset of the nodes in the original graph, any one edge (v) being a super point_i,v_j)∈E_sFor a super edge, a super point v is indicated_iEach node in v_jThere is a connection between each node in the set. In other words, forv∈v_j,(u,v)∈(v_i,v_j) Whether or not the edge (u, v) is present in the original graph. Partition v_iAnd v_jThe type of the edge relation between is defined asAbbreviated as r_i,p。

Due to node merging, a super edge between two super points may add an extra edge or delete an extra edge, thereby causing an error between information contained in the summary graph and the original graph. Let II_i,jRepresenting two corresponding over points v_iAnd v_jSet of edges with fully connected nodes in between, A_i,jIndicating a point of excess v_iAnd v_jThe set of edges that actually exist in the original graph. If there is a super edge (v) in the summary map_i,v_j) Then the merging node is increased by | Π_i,j|-A_i,jI, an edge; otherwise | A is deleted from the summary chart_i,jAn | edge. More precisely, a pair of the salient points v in the summary map generated based on the topological graph is defined_iAnd v_jAssociated error e_i,jAs follows:

e_i,j＝min{|Π_i,j|-A_i,j|,|A_i,j|}

therefore, the summary error of the summary map S (G) can be defined as

Once the user has selected a summary resolution, i.e. the number of super nodes k, the graph summary problem is automatically transformed into an optimization problem, i.e. a summary graph s (g) is generated, minimizing the summary error E (s (g)). The past literature confirms that the graph summarization problem is an NP-hard problem, the most difficult part of which is to determine a super point set, and once the super point set is determined, a super edge set with the minimum summarization error can be constructed in polynomial time.

The algorithm starts with a summary map initialized to the original map, iteratively merges a pair of nodes into a salient point, forming a lower resolution summary map until k salient points remain in the final summary map. In each iteration step, a pair of outliers with lower introduced errors should be merged, thereby reducing the overall summary error level. Defining and combining two super points v based on node structure obtained by preprocessing_iAnd v_jForming a point of excess v_mThe error increment of (2) is:

Δ_i,j＝e_m,·-(e_i,·+e_j,·-e_i,j)

here, the first and second liquid crystal display panels are,representation and over-point v_iAssociated total error, e_m,·And e_j,··In a similar way to that. e.g. of the type_i,·+e_j,·-e_i,jIndicating a point of excess v_iAnd v_jThe sum of the errors associated therewith before combining. Due to e_i,jAt e_i,·+e_j,·Is counted twice, so it is necessary to count from e_i,·+e_j,·Minus e_i,j,。

The above summary error calculation is topology-oriented, and does not consider the error introduced by the node merging which is an edge relation. Defines a name delta^EThe metrics of (a) to evaluate the similarity between node pairs during each merge process, which accumulates errors in topology and edge relationship types between the pair of superpoints and their neighbors. Delta^EThe smaller the value of (c), the more similar the two super points, meaning that the pair of super points are easier to merge. Suppose a super point v_iAnd v_jMerge into v_mThe formula for the node pair merging error increment at this time is as follows:

whereinThe error in the topology is represented by,indicating a relationship error. In addition, a tunable parameter alpha (alpha epsilon [0, 1) is introduced]) To balance the importance of topological errors and relational errors. If α is 1, the formula applies to the node pair merging error calculation of the topology, i.e.Is defined as the above-mentioned Δ_i,jIs defined similarly, whereinThe formula of (c) is defined as follows:

wherein r is_i,p＝r(v_i,v_p) Indicating a point of excess v_iAnd v_pThe relationship between them. In the calculation of the relationship error, because of the existence of various relationships, only one super edge is regarded as one edge, and the relationship error can be accumulated from the actual edge of the original graph. In addition, the overtop and the overtop v containing more actual nodes are selected_pThe relationship between as the relationship after the pair of merged nodes, because the over-point is with the over-point v_pThe relationship between them dominates the merged relationship.

In the present invention, the proposed parallel graph summarization algorithm is implemented based on the Enron dataset (V36692, E367662). Firstly, evaluating the effectiveness and efficiency of the algorithm by using the abstract error and the abstract execution time; the scalability of the algorithm is then evaluated by increasing the size of the input graph. All experiments were repeated three times or more and the average of the statistical data was graphically displayed.

First, the summary error and the processing time (the number of machines is 4) of the topology map and the attribute map are compared. As can be seen from fig. 2, as the abstract degree of the abstract increases, the error of the abstract increases. For different types of input graphs, the summary error of the attribute graph is always larger than that of the topological graph, but the error difference is less than 10% in general, which shows the effectiveness of the graph summary facing the attribute graph. As can be seen from fig. 3, the digest time increases with the increase of the abstraction level of the digest. For different types of input graphs, the summary time of the attribute graphs is always greater than that of the topological graphs, but the summary time difference is averagely less than 20% on the whole, which indicates that the graph summary of the profile attribute graphs has good efficiency.

The scalability of the algorithm was then evaluated by increasing the size of the input graph (50% abstraction, 4 machines). As can be seen from fig. 4, the summary time increases approximately linearly with the increase of the scale of the input graph, which illustrates that the algorithm has good scalability to some extent.

The embodiments are only preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A parallel graph summarization method based on an attribute graph is characterized by comprising the following steps:

2. The method for abstracting a parallel graph according to claim 1, wherein the error threshold is dynamically adjusted by using a heuristic algorithm of simulated annealing, and the error threshold is continuously increased as the number of the remaining nodes is reduced.

3. The method for parallel graph summarization according to claim 1 wherein the two-node combination cancels the node with the larger ID and reserves the ID of the node with the smaller ID.

4. The parallel graph summarization method according to claim 1 wherein each node in the summary graph, called a super-point, corresponds to a partition of the original graph node partition, and each edge, called a super-edge, corresponds to a connection between two related node partitions; a hyper-edge join relationship between two hyper-points exists if and only if there is at least one edge join between some of the two hyper-points.

5. The method for parallel graph summarization according to claim 1, wherein the introduced error is defined as: