CN111159483B

CN111159483B - Tensor calculation-based social network diagram abstract generation method

Info

Publication number: CN111159483B
Application number: CN201911373671.1A
Authority: CN
Inventors: 谢夏; 王健; 金海�
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-07-04
Anticipated expiration: 2039-12-26
Also published as: CN111159483A

Abstract

The invention discloses a method for generating a social network diagram abstract based on tensor calculation, and belongs to the field of social networks. Comprising the following steps: tensor representation is carried out on the social network diagram in the target time period to obtain a Boolean tensor T _G The method comprises the steps of carrying out a first treatment on the surface of the For Boolean tensor T _G Tensor decomposition is carried out to obtain a decomposed node matrix N ₁ ，N ₂ Attribute matrix A ₁ ，…A _h‑3 And a time matrix T; node matrix N ₁ Or N ₂ Clustering is carried out to obtain a cluster center and the type of each node; and taking the cluster center as the superpoints of the graph abstract, and calculating the superedge weight between the superpoints to obtain the graph abstract. According to the method, the nodes, the node attributes and the time stamps of the social network are subjected to multidimensional data fusion, unified expression of Gao Weitu data is realized based on the binary and tensor high-dimensional expression characteristics of the social network graph, and the Boolean Zhang Lianghua expression of the complex social network is realized. Tensor CP decomposition is introduced, prior information such as decomposition results of old graph tensors is fully utilized, the size of the decomposition tensors is reduced, and the decomposition efficiency of the graph abstract is improved.

Description

Tensor calculation-based social network diagram abstract generation method

Technical Field

The invention belongs to the field of social networks, and particularly relates to a method for generating a social network diagram abstract based on tensor calculation.

Background

Social network analysis is a popular topic of the data mining community in recent years, and querying and reasoning about the interrelation between entities in a social network can inspire interesting and deep insight into various phenomena. However, due to the characteristics of the social network, such as complex dynamic and changeable structure, huge data, the expression and mining of the social network graph data are limited by computing resources and cost. Thus, the starting point for analyzing such complex large graph data is typically a concise representation, i.e., a graph summary. It helps to understand these datasets and represent queries in a meaningful way. The graph summarization plays a very important role in the processing of graph data, from reducing the number of bits required to encode the original graph to more complex database operations, and so on.

In recent years, tensor methods have been applied to graph summarization methods, which can produce more accurate weighted graph summaries. Tensors are a form of data storage that is multidimensional, with the dimensions of the data being referred to as the order of the tensors. Since real tensor data often has high-dimensional sparse characteristics, we generally use a tensor decomposition method to preserve original information, reduce computational complexity, and reduce data loss.

The current graph summarization method only focuses on time sequence dynamics or node attributes of graph data, user nodes in the social network contain various attributes, the connection relationship between users can change newly every moment, and the social network graph data has dynamics and node attributes at the same time. In addition, for the time sequence dynamic diagram, the current method can repeatedly calculate the historical data, so that the calculation efficiency is low.

Disclosure of Invention

Aiming at the defects and improvement demands of the prior art, the invention provides a generation method of a social network graph abstract based on tensor calculation, which aims to uniformly express the dynamic property and node property of social network graph data by adopting a tensor calculation framework and introduce a Boolean tensor decomposition method to realize extensible and efficient graph abstract calculation.

To achieve the above object, according to a first aspect of the present invention, there is provided a method for generating a social network graph summary based on tensor calculation, the method comprising the steps of:

s1, tensor representation is carried out on a social network diagram in a target time period to obtain a target Boolean tensor T _G ；

S2, for a target Boolean tensor T _G Tensor decomposition is carried out to obtain a decomposed node matrix N ₁ ，N ₂ ；

S3, node matrix N ₁ Or N ₂ Clustering is carried out to obtain a cluster center and the type of each node;

s4, regarding the cluster center as the superpoints of the graph abstract, and calculating the superedge weight between the superpoints to obtain the graph abstract of the social network graph.

Preferably, the social network graph is a dynamic undirected unauthorized graph and corresponds to the time stamp one by one.

Preferably, step S2 comprises the steps of:

s21, old Boolean tensor T _old And target Boolean tensor T _G Merged into a boolean tensor T _all Boolean tensor T _all The last order is the time dimension, the old Boolean tensor T _old A tensor representation of the social network graph for a previous period of time;

s22, P-Boolean tensor T _all Performing biased sampling to generate k sub-tensors sT _i ；

S23, for each sub-tensor sT _i Performing parallel distributed Boolean CP decomposition, and calculating to obtain decomposition factor matrix of each sub-tensor

S24, sub tensor sT _i Boolean decomposition matrix of (2)

And old Boolean tensor T _old Is>

Merging to obtain a new Boolean tensor T _all Boolean CP decomposition results->

Wherein i is more than or equal to 1 and less than or equal to k, j is more than or equal to 1 and less than or equal to h.

Preferably, the step S22 includes the steps of:

s221, for h-order old Boolean tensor T _old Summing each order of (2) to obtain

S222. Will

Divided byT _old The number of non-zero elements in the index is calculated to obtain the sampling probability of each order of index

S223, calculating T according to the set sampling factors _old Size L of each-order sampling index _j ；

S224, according to sampling probability

For T _old L is carried out by the j-th order index _j Subsampling to obtain sample index set +.>

S225, collecting sampling index

And target Boolean tensor T _G Is combined to obtain { V } ₁ ，V ₂ ，...，V _h ∪{V _new }, wherein V is _new Representing T _G Is a time dimension index of (2);

s226, according to the index set { V ] ₁ ，V ₂ ，...，V _h ∪{V _new -obtaining a sample sub-tensor;

s227, repeating the steps S221-S226 until k sub tensors are generated.

Preferably, the step S23 includes the steps of:

s231, sub-tensor sT _i Factor matrix of (2)

Initializing for Y times, namely initializing a Boolean matrix with non-zero term probability p each time, and taking a factor matrix with the minimum reduction error as a final initialization matrix;

s232, carrying out h iterations, fixing (h-1) factor matrixes in each iteration process, and optimizing the rest factor matrixes to minimize the overall reduction error, thereby completing one round of iteration;

s233, repeating the step 232 until the number of iteration rounds reaches k or the iteration error is smaller than e, and returning to the Boolean factor matrix

Preferably, step S24 comprises the steps of:

s241 tensor sT is added ₁ Boolean decomposition matrix of (2)

And old tensor T _old Is>

Combining to obtain a Boolean decomposition matrix set +.>

S242, sub-tensor sT ₂ Boolean decomposition matrix of (2)

And Boolean decomposition matrix set->

The corresponding matrices are combined, and so on, until the sub-tensor sT _k Is>

And Boolean decomposition matrix set->

Combining the corresponding matrixes to obtain a new tensor T _all Is a Boolean CP decomposition matrix->

Preferably, the merging of the boolean decomposition matrices comprises the following steps:

(1) Calculating tensors V and U, where V _x Is that

Line x, u _x Is->

Corresponding to the index row of samples of (a), V is V _x Tensor recovered from other factor matrix, U is U _x Tensors recovered from other factor matrices;

(2) Calculating reconstruction error epsilon of tensor V and old tensor factor matrix ₁ And epsilon ₂ ；

ε ₁ ＝||V-T _x ||

ε ₂ ＝||U-T _x ||

Wherein T is _x Slice tensors for corresponding index rows;

(3) Judging whether epsilon is satisfied ₁ ＜ε ₂ If yes, u of the original tensor factor matrix _x Using v _x And updating, otherwise, not updating.

Preferably, in step S3, a Hamming distance is selected, the number r of cluster centers is set, and a K-Means clustering mode is adopted to obtain cluster centers S _i And the cluster to which each node belongs, i=1.

Preferably, step S4 comprises the steps of:

s41, calculating the over-edge weight between the over-points in the graph abstract, wherein the calculation formula is as follows:

wherein S is _i 、S _j Cluster center calculated for clustering algorithm, l and m are S respectively _i 、S _j In (2), L is the Boolean tensor T _all Time dimension length, N is T _all The number of nodes, σ (S _i ) Is S _i The number of points included;

s42, calculating a reconstruction error of the graph abstract, wherein the calculation formula is as follows:

s43, judging whether the reconstruction error meets a set threshold value, if so, taking the cluster as a node in the graph abstract, taking the over-edge weight as the weight of the edge of the graph abstract, otherwise, changing the number of cluster centers, and then entering step S3.

To achieve the above object, according to a second aspect of the present invention, there is provided a computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a computer program, which when executed by a processor, implements the method for generating a social network graph summary based on tensor computation according to the first aspect.

In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:

(1) Aiming at the problem that the existing graph summarization method only focuses on the dynamic property or node attribute of graph data, the method fuses the multi-dimensional data of the nodes, node attributes and time stamps of the social network, and based on the binary property and tensor high-dimensional expression characteristics of the social network graph, uniform expression of Gao Weitu data is realized, and Boolean Zhang Lianghua expression of the complex social network is realized.

(2) Aiming at the problem of low calculation efficiency of the existing graph abstract method, tensor Boolean CP decomposition is introduced, prior information such as decomposition results of old graph tensors is fully utilized, the size of the decomposition tensors is reduced, and the decomposition efficiency of the graph abstract is improved.

Drawings

Fig. 1 is a flowchart of a method for generating a social network diagram abstract based on tensor calculation according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

First, some terms related to the present invention will be explained.

The abstract of the drawing: the abstract is a concise representation of the original graph, and a large number of points and edges in the graph are aggregated into super points and super edges, which is beneficial to the visualization of the large graph and the mining of graph data. The superpoints are point sets aggregated by a plurality of nodes in the graph, the superedges are edge sets aggregated by a plurality of edges in the graph, and the superedge weight is calculated by edge adjacency characteristics and weights in the sets.

Boolean tensor: all elements are tensors of 0 or 1, and the dynamic unweighted graph can be represented as a boolean tensor due to the binary nature of the adjacency matrix of the unweighted graph, where the order is the dimension of the tensor.

Undirected unauthorized graph: edges in the graph have neither direction nor weight, where the dynamic undirected graph is an undirected graph at each timestamp.

Tensor decomposition: the scheme of representing tensors as a basic sequence of operations on other simpler tensors is generally applicable to tensor padding, dimension reduction representation, feature extraction, and the like.

CP decomposition: canonical Polyadic Decomposition, a common form of tensor decomposition, decomposes a tensor into a sum of a plurality of rank 1 tensors, a rank 1 tensor being a special tensor that can be decomposed into the outer products of a plurality of vectors.

As shown in FIG. 1, the invention provides a method for generating a social network diagram abstract based on tensor calculation, which comprises the following steps:

s1, tensor representation is carried out on a social network diagram in a target time period to obtain a Boolean tensor T _G 。

And abstracting the users in the social network as nodes, abstracting the relationship among the users in the social network as edges, and obtaining a social network diagram. For example, in a microblog social network, a microblog user is a node, each node has a plurality of node attributes, such as gender, academic, work and occupation, and a concern relationship between users is an edge. The relationship of interest between users is dynamically changing, and thus, the social network diagram data is dynamic.

In this embodiment, the target period of time is 1 day, that is, the social network diagram abstract within 1 day needs to be generated. In the microblog social network, in the generated graph abstract, user nodes with similar user attributes and user interests are represented by one superpoint, and the connection relationship among different user superpoints is represented by a superside.

The graph data is constructed as a high-order tensor, the node attribute and the time stamp of the graph data are used as different dimensions of the tensor, the tensor is a binary tensor, and non-zero elements in the tensor represent two nodes of one edge in the dynamic attribute graph, the node attribute and the time stamp of the edge.

For a high-order sparse tensor, if all elements in the tensor are stored, a large amount of storage space is consumed, so for a graph tensor, the invention uses only tuples to store index values of non-zero elements in different dimensions, for example, (Node 1, node2, node1.Attribute, node2.Attribute, T..) to indicate that, at time T, there is an edge with Node1 and Node2 as vertices, while the attributes of Node1 and Node2 are respectively: node1.Attribute, node2.Attribute. To support the computation of large-scale graph data, the graph tensor tuples are uploaded into a distributed file system HDFS.

S2, for Boolean tensor T _G Tensor decomposition is carried out to obtain a decomposed node matrix N ₁ ，N ₂ Attribute matrix A ₁ ，…A _h-3 And a time matrix T.

Decomposed node matrix N ₁ ，N ₂ Is a feature vector, attribute matrix A, for representing node adjacency characteristics ₁ ，…A _h-3 Is a feature vector for representing the node attribute of the graph, and the time matrix T is a feature vector for representing the graph in the time dimension.

Preferably, step S2 comprises the steps of:

s21, boolean tensor T _old And Boolean tensor T _G Merged into a boolean tensor T _all Wherein the last order is a time dimension, the Boolean tensor T _old Is a tensor representation of the social networking graph over a previous period of time.

In the present embodiment, the Boolean tensor T _old Is a tensor representation of the social networking graph of the previous day.

S22, P-Boolean tensor T _all Performing biased sampling to generate k sub-tensors sT _i ，1≤i≤k。

For Boolean tensor T _all Biased sampling according to importance measure to increase non-zero term density of sub-tensors of sample and increase decomposition result of each sub-tensor to T _all Influence of update. The following procedure is illustrated with h=2.

Assume that

The sampling probability is 0.5.

Preferably, the step S22 includes the steps of:

s221, for h-order old Boolean tensor T _old Summing each order of (2) to obtain

In the present embodiment of the present invention,

s222, by

Divided by T _old The number of non-zero items in the index is calculated to obtain the sampling probability of each order of index

In the present embodiment of the present invention,

s223, calculating T according to the set sampling factors _old Size L of each-order sampling index _j 。

In the present embodiment, L ₁ ＝2*0.5＝1，L ₂ ＝2*0.5＝1。

S224, according to sampling probability

In the present embodiment, in a first dimension, the sample size is 1, generally [0,1]The sampling probability of the corresponding element is [ 0.33.0.67 ]]The method comprises the steps of carrying out a first treatment on the surface of the In the second dimension, the sample size is 1, generally [0,1]The sampling probability of the corresponding element is [ 0.33.0.67 ]]. Assume the sampling result

S225, combining the sampling index set and the index set of the new tensor to obtain { V } ₁ ，V ₂ ，...，V _h ∪{V _new }, wherein V is _new Representing T _G Is used for the time dimension index of (a).

In the present embodiment, V _new ＝[2，3]The final sample index is { [1 ]]，[1，2,3]}。

S226, according to the sampling index { V ] ₁ ，V ₂ ，...，V _h ∪{V _new And } takes the sample sub-tensors.

In the present embodiment, the sub-tensor is T _all [1，{1，2，3}]＝[111]。

S227, repeating the steps S221-S226 until k sub tensors sT are generated ₁ ，......，sT _k 。

S23, for each sub-tensor sT _i Performing parallel distributed Boolean CP decomposition, and calculating to obtain a decomposition factor matrix of each matrix

Preferably, the step S23 includes the steps of:

s231, sub-tensor sT _i Factor matrix of (2)

And initializing for Y times, namely initializing to be a Boolean matrix with non-zero term probability p each time, and taking a factor matrix with the minimum reduction error as a final initialization matrix.

In this embodiment, Y is set according to actual needs, and is generally an arbitrary integer from 5 to 20.

S232, carrying out h iterations, fixing (h-1) factor matrixes in each iteration process, and optimizing the rest factor matrixes to minimize the overall reduction error.

This embodiment employs least squares optimization. The following procedure is illustrated with h=3.

Sub-tensor sT _i The decomposition factor matrices of (a) are respectively

Fix->

Optimization

Minimizing the reduction error; fix->

Optimization->

Minimizing the reduction error; fix->

Optimization->

So that the reduction error is minimized.

In this embodiment, k and e are set according to actual needs.

S24, sub tensor sT _i Boolean decomposition matrix of (2)

And old tensor T _old Is>

Merging to obtain a new tensor T _all Boolean CP decomposition results->

Combining the two to obtain a new tensor T _all Boolean CP decomposition results of (C)

Can be the old tensor T _all Introducing updates to the decomposition matrix, reducing the error of the decomposition.

Preferably, step S24 comprises the steps of:

s241 tensor sT is added ₁ Is the Boolean decomposition moment of (2)Array

And old tensor T _old Is>

Combining to obtain a Boolean decomposition matrix set +.>

S242, sub-tensor sT ₂ Boolean decomposition matrix of (2)

And Boolean decomposition matrix set->

And Boolean decomposition matrix set->

…

(1) Calculating tensors V and U, wherein V is V _x Tensor recovered from other factor matrix, tensor U is U _x Tensor recovered from other factor matrices, v _x Is that

Line x, u _x Is->

Corresponding to the index row.

ε ₁ ＝||V-T _x ||

ε ₂ ＝||U-T _x ||

Wherein T is _x For the slice tensor of the corresponding index row, the tensor 1-norm is denoted as ||, i.e. the number of non-zero entries in the boolean tensor.

Satisfy epsilon ₁ ＜ε ₂ The u of the original tensor factor matrix represents the update line to reduce the overall reconstruction error _x Using v _x And updating.

S3, node matrix N ₁ Or N ₂ Clustering is carried out, and a cluster center and the type of each node are obtained.

Preferably, in the step S3, the clustering mode is a K-Means clustering mode, and the Hamming distance is selected as the distance.

The method comprises the following steps:

s31, selecting a row vector set { N } of a node Boolean factor matrix N ₁ ，n ₂ ，...，n _l "l" is the number of rows of matrix N, also the number of graph nodes。

S32, selecting r vectors from the vector as initial cluster centers, wherein r represents the number of cluster centers and is also the number of super points generated in the final graph abstract.

In this embodiment, r is K, and is initialized to 100.

S33, calculating the Hamming distance from other nodes to each cluster center, and dividing the Hamming distance into clusters closest to the other nodes.

S34, updating the cluster center by using the rounding average value of all the vectors according to all the vectors in each cluster, and completing one round of iteration.

S35, if the iteration times reach a specified value, outputting the cluster of each point.

In this embodiment, the iteration number specified value is 10.

And S4, regarding the cluster center as the superpoints of the graph abstract, and calculating the superedge weight between the superpoints to obtain the complete graph abstract.

Preferably, step S4 comprises the steps of:

s41, calculating the superedge weight between the superpoints in the graph abstract according to the graph node adjacency similarity formula.

S42, calculating the reconstruction error of the graph abstract according to the Euclidean distance of the tensor.

Wherein S is _i 、S _j Cluster center calculated for clustering algorithm, l and m are S respectively _i 、S _j In (2), L is the Boolean tensor T _all Time dimension length, N is T _all The number of nodes, σ (S _i ) Is S _i The number of points involved, || represents the absolute value operator.

In this embodiment, the reconstruction error setting threshold is 1000. If the reconstruction error does not meet the set threshold, the number of clustering centers is increased.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1.A method for generating a social network diagram abstract based on tensor calculation is characterized by comprising the following steps:

s1, tensor representation is carried out on a social network diagram in a target time period to obtain a target Boolean tensor T _G The method comprises the steps of carrying out a first treatment on the surface of the The social network graph is a dynamic undirected unauthorized graph and corresponds to the time stamp one by one;

s4, regarding the cluster center as the superpoints of the graph abstract, and calculating the superedge weight between the superpoints to obtain the graph abstract of the social network graph;

wherein, step S2 includes the following steps:

S23, for each sub-tensor sT _i A parallel distributed boolean CP decomposition is performed,calculating to obtain a decomposition factor matrix of each sub-tensor

S24, sub tensor sT _i Boolean decomposition matrix of (2)

And old Boolean tensor T _old Is>

2. The method according to claim 1, wherein the step S22 comprises the steps of:

s221, for h-order old Boolean tensor T _old Summing each order of (2) to obtain

S222. Will

Divided by T _old The number of non-zero elements in the index is calculated to obtain the sampling probability of each order of index

S224, according to sampling probability

S225, collecting sampling index

s227, repeating the steps S221-S226 until k sub tensors are generated.

3. The method according to claim 1, wherein the step S23 comprises the steps of:

s231, sub-tensor sT _i Factor matrix of (2)

4. The method of claim 1, wherein step S24 comprises the steps of:

s241 tensor sT is added ₁ Boolean decomposition matrix of (2)

And old Boolean tensor T _old Is>

Combining to obtain a Boolean decomposition matrix set +.>

S242, sub-tensor sT ₂ Boolean decomposition matrix of (2)

And Boolean decomposition matrix set->

And Boolean decomposition matrix set->

5. The method of claim 4, wherein the combining of the boolean decomposition matrices comprises the steps of:

(1) Calculation sheetQuantity V and tensor U, where V _x Is that

Line x, u _x Is->

ε ₁ ＝||V-T _x ||

ε ₂ ＝||U-T _x ||

Wherein T is _x Slice tensors for corresponding index rows;

6. The method of claim 1, wherein in step S3, a Hamming distance is selected, a cluster center number r is set, and a K-Means clustering mode is adopted to obtain cluster centers S _i And the cluster to which each node belongs, i=1.

7. The method according to any one of claims 1 to 5, wherein step S4 comprises the steps of:

8. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the method for generating a tensor-based social network graph summary according to any one of claims 1 to 7.