CN113139098A

CN113139098A - Abstract extraction method and system for big homogeneous relation graph

Info

Publication number: CN113139098A
Application number: CN202110308958.7A
Authority: CN
Inventors: 刘盛华; 程学旗; 周厚铨; 刘财政; 沈华伟
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-20
Anticipated expiration: 2041-03-23
Also published as: CN113139098B

Abstract

The invention provides a method and a system for abstracting a big picture of a homogeneous relation, which comprises the following steps: acquiring relation graph data to be abstracted as current graph data, wherein the relation graph data is a big homogeneous relation graph, and each node in the current graph data is regarded as a super point; grouping nodes in the current graph data through locality sensitive hashing according to the adjacency matrix of the current graph data; randomly selecting a plurality of super point pairs from the group, respectively calculating the difference between the combined super point pairs and the relational graph data, and selecting the super point pair with the minimum difference for combination to obtain the reconstructed graph data; the reconstructed image data is output as a digest extraction result.

Description

Abstract extraction method and system for big homogeneous relation graph

Technical Field

The invention relates to the field of data mining, in particular to a rapid summary abstract and reconstruction technology and device for a big homogeneous relation graph.

Background

Social media has surpassed search engines at present and become the first large-flow source of the internet, and the social media accounts for 46 percent and 40 percent respectively. Relational graph data becomes a common data to be applied to many sciences and engineering, and a graph can be represented as a structure that graph G ═ V, E is a pair of sets: a set of nodes V represents entities and a set of edges E represents relationships or connections between entities. In computer science, a network contains nodes and edges; in social science, the corresponding terms are actors and relationships, and the terms have equivalent meanings in this document. By the first quarter of 2020, the number of active accounts of WeChat and WeChat in the combined month reaches 12.025 hundred million, which means that WeChat is formally an application of more than 10 hundred million active users in the first month of China, and the sending amount of WeChat messages increases by 64.2% and 8.23 hundred million people to receive and send WeChat red packets from the beginning to the beginning. A quarter report by 31/3/2020. The most spread is "1 trillion dollars", 3 months and 31 days in 2020 and 12 months later, the arri platform GMV reaches 7.053 trillion renminbi. In the past 12 months, 7.8 million Chinese people buy products or services on the Ali platform, the number of active users in the mobile month is 8.46 million in the Ali Chinese retail market, and the number of active buyers in the year is 7.26 million. The messaging or shopping relationships between users of these platforms form a graph, as shown in fig. 1 and 2, the users form nodes in the graph, and the edges form shopping or messaging relationships between users. In most cases, graph data is created by one or more generation processes that are capable of not only representing activities in the system, but also collecting observations of entities. However, these large-scale graph data volumes are very large and difficult to process, analyze and understand, which presents a significant challenge to the graph data mining application. One effective technique to address these challenges is graph summarization. Given a graph G, the goal is to find a compact representation of G, i.e. a summary graph with supernodes and superedges (as an example in fig. 3). The abstract model usually needs to reconstruct a graph from the abstract graph, so the reconstruction scheme is the core of most abstract models. Aiming at the idea of abstract summarization, the current method mainly comprises the following categories:

(1) error of adjacency matrix: such methods attempt to minimize some measure of error between the adjacency matrix of the original graph and the adjacency matrix of the reconstructed graph to achieve the best summary result.

(2) The total number of edges: in this method, the objective function is defined as the sum of the number of edges in the summary map and the edge correction information, and the performance of summary is improved by the number of edges and the correction information.

(3) Coding length: such methods typically use the Minimum Description Length (MDL) principle, with the total code length being the objective function. The minimum description length is usually optimized under different coding schemes.

The above-mentioned method mainly focuses on static simple graphs and applies to a certain type of graph data, and cannot have general applicability. Meanwhile, the method needs to calculate the relationship between each pair of nodes so as to summarize the graph data, although some methods can optimize and accelerate the calculation process, the calculation complexity is still high, and particularly when large graph data is faced, the methods generally have the defects of low efficiency, time consumption, more memory occupation and the like.

Disclosure of Invention

The invention relates to the field of data mining, in particular to a rapid summary abstract and reconstruction technology and a device of a big homogeneous relation graph, which has the core idea that the key idea is the same as a common configuration model, the homogeneity is the same as the node type in the graph, and for example, the node types in a social network are all users; in the E-commerce shopping network, part of nodes represent customers, part of nodes represent commodities, the types of the nodes are different, namely heterogeneous graphs, the method sets some super edges which are proportional to the degrees of the nodes, and a distribution scheme (CR scheme) based on configuration can be usually embedded into the existing summary method and can improve the performance and effect of the existing related summary method; based on the Minimum Description Length (MDL) in the information theory as a principle to minimize the cost of summary graphs and reconstruction errors. Meanwhile, the method designs a rapid algorithm called DPGS algorithm, and the rapid algorithm is used for grouping the candidate nodes of the large graph based on a Local Sensitive Hashing (LSH) method and performing greedy combination in the group so as to achieve the purpose of summarizing the abstract of the graph. In theory, the method demonstrates that the perturbation of the laplacian eigenvalues is limited by minimizing the reconstruction error.

Aiming at the defects of the prior art, the invention provides a method for extracting an abstract of a big homogeneous relation graph, which comprises the following steps:

step 1, obtaining relation graph data to be abstracted as current graph data, wherein the relation graph data is a large homogeneous relation graph, and each node in the current graph data is regarded as a super point;

step 2, grouping nodes in the current graph data through locality sensitive hashing according to the adjacency matrix of the current graph data;

step 3, randomly selecting a plurality of super point pairs from the group, respectively calculating the difference between the combined super point pairs and the relational graph data, and selecting the super point pair with the minimum difference for combination to obtain the reconstructed graph data;

and 4, outputting the reconstructed picture data as a summary extraction result.

The abstract extraction method of the big homogeneity relation graph comprises the following steps of adding 1 to the iteration times after obtaining the reconstructed graph data in the step 3; and (4) judging whether the current iteration number reaches a preset value, if so, executing the step (4), otherwise, taking the reconstructed picture data as the current picture data, and executing the step (2) again.

The abstract extraction method of the big homogeneity relation graph comprises the following steps of: obtaining the difference L (M, D) between the combined pair of the excess points and the data of the relational graph through the following formula;

L(M,D)＝L(M)+L(D|M)

wherein d is_iAnd d_jDegree, D, representing nodes i and j_kAnd D_lRepresenting supernumeraryNode S_kAnd S_lDegree of (A)_SThe adjacent matrix of the summary graph obtained after the hyper point pairs are combined, A 'is the adjacent matrix of the graph reconstructed by the summary graph, A is the adjacent matrix of the relation graph data, A' (i, j) is the adjacent edge weight from the node i to the node j in the adjacent matrix of the graph reconstructed by the summary graph, A_S(i, j) is the adjacent edge weight from node i to node j in the adjacent matrix of the abstract graph, A (i, j) is the adjacent edge weight from node i to node j in the adjacent matrix of the relation graph data, LN is the function of the length of the coded positive integer, LNU is the function of the length of the Bernoulli code, n and m are the number of nodes and edges respectively, w is the number of the nodes and edges respectively_iIs the weight of the edge, d_iFor the degree of node structure, L (M) is the description length of the summary graph, and L (D | M) is the reconstruction error.

The abstract extraction method of the big homogeneity relation graph comprises the following steps of: each node in the current graph data can obtain a hash value according to the neighbor nodes, and the nodes with the same hash value are divided into a group.

The abstract extraction method of the big homogeneous relation graph is characterized in that the relation graph data is an unweighted undirected graph.

The invention also provides a system for abstracting the big map of the homogeneous relation, which comprises the following steps:

the system comprises a module 1, a data processing module and a data processing module, wherein the module 1 is used for acquiring relational graph data to be abstracted as current graph data, the relational graph data is a big homogeneous relational graph, and each node in the current graph data is regarded as a super point;

the module 2 is used for grouping nodes in the current graph data through locality sensitive hashing according to the adjacency matrix of the current graph data;

a module 3, configured to randomly select a plurality of super point pairs from the group, respectively calculate a difference between the merged super point pair and the relational graph data, and select the super point pair with the smallest difference to merge, so as to obtain reconstructed graph data;

and the module 4 is used for outputting the reconstructed picture data as a summary extraction result.

In the abstract extraction system of the big homogeneity relation graph, after the module 3 obtains the reconstructed graph data, the iteration number is added by 1; and judging whether the current iteration number reaches a preset value, if so, calling the module 4, and otherwise, calling the module 2 again by taking the reconstructed graph data as the current graph data.

The abstract extraction system of the big map of the homogeneous relation comprises the following modules 3: obtaining the difference L (M, D) between the combined pair of the excess points and the data of the relational graph through the following formula;

L(M,D)＝L(M)+L(D|M)

wherein d is_iAnd d_jDegree, D, representing nodes i and j_kAnd D_lRepresenting a supernode S_kAnd S_lDegree of (A)_SThe adjacent matrix of the summary graph obtained after the hyper point pairs are combined, A 'is the adjacent matrix of the graph reconstructed by the summary graph, A is the adjacent matrix of the relation graph data, A' (i, j) is the adjacent edge weight from the node i to the node j in the adjacent matrix of the graph reconstructed by the summary graph, A_S(i, j) is the adjacent edge weight from node i to node j in the adjacent matrix of the abstract graph, A (i, j) is the adjacent edge weight from node i to node j in the adjacent matrix of the relation graph data, LN is the function of the length of the coded positive integer, LNU is the function of the length of the Bernoulli code, n and m are the number of nodes and edges respectively, w is the number of the nodes and edges respectively_iIs the weight of the edge, d_iFor the degree of node i, L (M) is the description length of the summary graph, and L (D | M) is the reconstruction error.

The abstract extraction system of the big map of the homogeneous relation comprises the following modules 2: each node in the current graph data can obtain a hash value according to the neighbor nodes, and the nodes with the same hash value are divided into a group.

The abstract extraction system of the big homogeneous relation graph is characterized in that the relation graph data is an unweighted undirected graph.

According to the scheme, the invention has the advantages that:

(1) novel reconstruction scheme: we have designed a Graph abstract summary model called Degree-forecasting Graph summary model (DPGS), and propose a novel reconstruction scheme based on the configuration model. We have theoretically demonstrated that our DPGS uses reconstruction errors to limit map perturbations.

(2) Compatibility: the scheme designed by the user can be universally applied to different diagram abstract summarizing scenes, and the diagram abstract summarizing quality is improved.

(3) Effectiveness: comparison between the synthetic and real-world maps confirmed the superiority of the designed reconstruction method and demonstrated that our DPGS algorithm outperformed several of the latest methods with better summarization. In addition, the abstract map can help to effectively and effectively train the neural network of the map.

(4) And (3) expandability: the DPGS model has high running speed, and theoretical analysis shows that the complexity is in a linear relation on the number of edges and can be applied to abstract summarization of large graph data.

Drawings

FIG. 1 is a graph data diagram;

FIG. 2 is a adjacency matrix diagram of an oblivious graph;

FIG. 3 is a adjacency matrix diagram of the ownership map;

FIG. 4 is a schematic diagram of the operation of the process of the present invention;

FIG. 5 is a flow chart of an implementation of the method of the present invention.

Detailed Description

The invention relates to the field of data mining, in particular to a rapid summary abstract and reconstruction technology and a device of a big homogeneous relation graph, which comprise the following steps:

the invention provides a technology for converting a large graph into a summary graph, which can measure whether the summary graph is excellent or not specifically, the summary graph is restored to a reconstructed graph from the summary graph, and then the difference between the reconstructed graph and an original graph is calculated, wherein the smaller the difference is, the more excellent the summary graph is, and the better the technology for converting the large graph into the summary graph is.

Therefore, the invention uses a new reconstruction method based on the configuration model to better measure the difference between the reconstructed image and the original image. (1) Reconstruction scheme: given a summary graph, we can reconstruct the original graph based on the graph summary model. A new abstract graph reconstruction method is defined in the method: a is to be_SAnd a' are respectively represented as a summary map and a reconstructed adjacency matrix. The configuration-based reconstruction method (CR method) is calculated as follows:

wherein S_kAnd S_lAre supernodes belonging to nodes i and j, respectively. We use d_iAnd d_jRepresenting degrees of nodes i and j; we used D_kAnd D_lRepresenting a supernode S_kAnd S_lDegree of (c). The reconstructed edge weight a' (i, j) is therefore proportional to the product of the end point degrees. In the abstract diagram, each node of the original graph belongs to one unique supernode. And k and l respectively represent the super node subscripts of the nodes i and j in the abstract graph.

In equation one, Sk and Sl are represented in Dk and Dl. That is, the edge weights of the reconstructed graph will take into account the degree di dj of the node itself and the degree Dk Dl of the supernodes Sk, Sl to which they belong.

(2) Degree maintenance: the reconstruction method can be used for achieving degree preservation, and has the following properties:

where A' is the adjacency matrix of the reconstructed graph, A is the adjacency matrix of the original graph, A (i, j) represents the adjacent edge weight from the point i to the point j in the adjacency matrix,

the degree of node i is indicated.

(3) Define a new merit function: the method uses the MDL principle to find the abstract diagram. We minimize the total description length while assuming one needs both the summary map and the reconstruction error to accurately reconstruct the original map. Define a new graph summary and reconstruction error function:

l (M, D) ═ L (M) + L (D | M) (formula three)

Where L (M) is the description length of the summary map and L (D | M) is the reconstruction error. The method uses KL divergence to represent the coding error of the critical matrix of the original image and the adjacent matrix of the summary map, and is defined as follows

L(M)＝LN(n)+LN(m)+nLN(n)+∑_iLN(d_i)+∑_iLN(w_i) + LNU (n (n +1)/2, m) (equation five)

Where LN is a function of the coded positive integer length and LNU is a bernoulli coded length function. n, m is the number of nodes and edges, w_iIs the weight of the edge, d_iIs the degree of node i.

(4) Defining eigenvalue perturbation:

the normalized Laplacian matrices of the original graph and the reconstructed graph are denoted as L and L'. Then, the total squared errors of their eigenvalues (denoted by λ (i) and λ' (i)) are as follows:

(5) designing a fast algorithm, called DPGS algorithm, wherein the input of the algorithm is G ═ V, E, the iteration number of the algorithm is T, and the output is G_S＝(V_S,E_S,A_S) The core steps of the algorithm are as follows:

in order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

In this embodiment, the implementation flow is shown in fig. 5, and the detailed implementation process is described in detail by taking an undirected graph as an example. The specific embodiment is as follows:

step 1, a undirected weightless graph is given, as shown in fig. 4(1), an adjacency matrix a, a node number N and an edge number E of the undirected weightless graph are obtained, and the iteration number T of the setting method is obtained, wherein a, N and E are determined according to an actual graph, T is a parameter, and the parameter can be set by experts or experiments. And assume that each point is a super point. One or more nodes may be included in a supernode.

Step 2, the basic idea of the LSH method is as follows: after two adjacent data points in the original data space are subjected to the same mapping or projection transformation (projection), the probability that the two data points are still adjacent in the new data space is high, and the probability that non-adjacent data points are mapped to the same group is low. All the super points are divided into different groups using the LSH method. When grouped by the LSH method here, a is used for the calculation. Specifically, each point computes a hash value based on its neighbors, and the hash values are grouped into groups that are identical. As in fig. 4(2), assuming that 4, 5, 6,7, and 8 are grouped into a group, it should be noted that this illustration is merely for the purpose of illustrating the algorithm flow, and that the specific points are grouped into a group, depending on the specific LSH method.

And 3, randomly sampling different point pairs for the divided groups, and calculating a formula three and a formula four to minimize the value of the formula three, namely the new graph abstract and the reconstruction error function. Merging a point pair will change the result of equation 3; therefore, a plurality of point pairs are sampled in step 3, the values of formula 3 after combination are respectively calculated, and the point pair which enables formula 3 to be minimum is selected for combination. For example, two point pairs of (5,8) and (6,7) are sampled, and after calculation, the merging (5,8) is found to make the value of formula 3 smaller, so that the merging (5,8) into a new super node is selected.

The purpose of using sampling here is to eliminate the need to compute each pair of nodes, resulting in less computational complexity. As shown in fig. 4 (sample point 5 and 8), formula three and formula four are calculated, as shown in fig. 4 (sample point 6 and 7), formula three and formula four are calculated, and points 5 and 8 are merged together according to the calculation result. It should be noted that this illustration is merely for the purpose of illustrating the algorithm flow, and that the specific points need to be combined into a group, depending on the calculations of the specific formula three and formula four.

And 4, updating the LSH function, and grouping the over points again according to the updated LSH function.

Step 5, repeating the step 2, the step 3 and the step 4 until the set iteration number T, and ending

Step 6, returning the abstract of the original graph

Although specific embodiments of the invention have been disclosed for illustrative purposes and the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated by reference, those skilled in the art will appreciate that: the corresponding methods and tools may be implemented on other platforms without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the embodiment and the drawings.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

L(M,D)＝L(M)+L(D|M)

wherein d is_iAnd d_jDegree, D, representing nodes i and j_kAnd D_lRepresenting a supernode S_kAnd S_lDegree of (A)_SThe adjacent matrix of the summary graph obtained after the hyper point pairs are combined, A 'is the adjacent matrix of the graph reconstructed by the summary graph, A is the adjacent matrix of the relation graph data, A' (i, j) is the adjacent edge weight from the node i to the node j in the adjacent matrix of the graph reconstructed by the summary graph, A_S(i, j) is in the adjacency matrix of the abstract diagramThe adjacent edge weight from the node i to the node j, A (i, j) is the adjacent edge weight from the node i to the node j in the adjacency matrix of the relational graph data, LN is a function of the length of the coded positive integer, LNU is a function of the length of the Bernoulli code, n and m are the number of nodes and the number of edges respectively, w_iIs the weight of the edge, d_iFor the degree of node i, L (M) is the description length of the summary graph, and L (D | M) is the reconstruction error.

Claims

1. A method for extracting an abstract of a big homogeneous relation graph is characterized by comprising the following steps:

2. The method for abstracting a summary of a large map of a homogeneous relationship as claimed in claim 1, wherein after the reconstructed map data is obtained in the step 3, 1 is added to the iteration number; and (4) judging whether the current iteration number reaches a preset value, if so, executing the step (4), otherwise, taking the reconstructed picture data as the current picture data, and executing the step (2) again.

3. The method for abstracting a summary of a large map of a homogeneous relationship as claimed in claim 1, wherein the step 3 comprises: obtaining the difference L (M, D) between the combined pair of the excess points and the data of the relational graph through the following formula;

L(M,D)＝L(M)+L(D|M)

4. The method for abstracting a summary of a large map of a homogeneous relationship as claimed in claim 1, wherein the step 2 comprises: each node in the current graph data can obtain a hash value according to the neighbor nodes, and the nodes with the same hash value are divided into a group.

5. The method as claimed in claim 1, wherein the relationship graph data is an unweighted undirected graph.

6. A system for abstracting a large map of homogeneous relationships, comprising:

7. The system for abstracting a summary of a big map of a homogeneous relationship as described in claim 6, wherein after obtaining the reconstructed map data in the module 3, the iteration number is increased by 1; and judging whether the current iteration number reaches a preset value, if so, calling the module 4, and otherwise, calling the module 2 again by taking the reconstructed graph data as the current graph data.

8. The system for abstracting a summary of a large map of homogeneous relationships as set forth in claim 6, wherein the module 3 comprises: obtaining the difference L (M, D) between the combined pair of the excess points and the data of the relational graph through the following formula;

L(M,D)＝L(M)+L(D|M)

9. The system for abstracting a summary of a large map of homogeneous relationships as set forth in claim 1, wherein the module 2 comprises: each node in the current graph data can obtain a hash value according to the neighbor nodes, and the nodes with the same hash value are divided into a group.

10. The system for abstracting a big map of homogenous relations as in claim 6, wherein the relational map data is an unweighted undirected map.