CN112765414A

CN112765414A - Graph embedding vector generation method and graph embedding-based community discovery method

Info

Publication number: CN112765414A
Application number: CN202110079198.7A
Authority: CN
Inventors: 于东晓; 张喜连; 罗琦
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-07

Abstract

The invention belongs to the technical field of data processing, and relates to a method for generating a graph embedding vector and a community discovery method based on graph embedding. A method for generating graph embedding vectors comprises the following steps: acquiring a core value of a vertex; acquiring neighborhood structure information of the vertexes, and calculating the similarity between the vertexes; generating a vertex sequence based on the similarity of the vertex and the adjacent neighbor thereof; and carrying out word vector training on the vertex sequence to generate an embedded vector of each vertex. According to the method for generating the graph embedding vector, the neighborhood structure information of the vertex is kept through the kernel value information of the vertex, so that the vertex with a similar structure is closer to the embedding space. And clustering or classifying the obtained graph embedding vectors to discover the communities.

Description

Graph embedding vector generation method and graph embedding-based community discovery method

Technical Field

The invention belongs to the technical field of data processing, and relates to a method for generating a graph embedding vector and a community discovery method based on graph embedding.

Background

In the internet era, from computer vision to natural language processing, deep learning techniques have been applied to hundreds of practical problems over the past few years. Graph databases are also increasingly used in social networking, e-commerce, and other fields because of their excellent performance in handling relationships between data. In a graph network, a subgraph corresponding to a node subset with relatively close internal connection is called a community, and a process of finding out a community structure from the graph is called community discovery. Naturally, community discovery of graph data in combination with deep learning has become a subject of research. However, simple graph data cannot be directly input as a deep learning model, and the graph data needs to be converted into sequence data.

Graph embedding techniques refer to embedding a graph into a vector space, represented as a low-dimensional vector, while preserving the structural information of the graph. The current graph embedding technology can be roughly divided into three types, namely matrix decomposition based, random walk based and neural network model based, and specific algorithms such as Line, deep walk, SNDE and the like. In the existing methods, most of the methods consider the similarity between vertex pairs as the feature information of the vertices, but do not consider the neighborhood structure information of the vertices.

The kernel value of a vertex may reflect, to some extent, the neighborhood structure of the vertex. A vertex with a kernel value of k indicates that the vertex has at least k neighbors, and that the k neighbors are all greater than or equal to k degrees. In addition, a vertex with a k-kernel value indicates that the vertex is in a k-kernel subgraph, the subgraph is a compact subgraph, and the degrees of all the vertices in the subgraph are greater than or equal to k. In summary, the kernel value of a vertex may reflect how well it exists in a compact subgraph, and thus may reflect the neighborhood structure of the vertex.

Disclosure of Invention

The invention aims to provide a generating method of a graph embedding vector based on a kernel value and a community discovery method based on graph embedding aiming at the defects of the prior art.

In order to achieve the above purpose, one of the technical solutions provided by the present invention is: a method of generating graph embedding vectors, the method comprising:

acquiring a core value of a vertex;

acquiring neighborhood structure information of the vertexes, and calculating the similarity between the vertexes;

generating a vertex sequence based on the similarity of the vertex and the adjacent neighbor thereof;

and carrying out word vector training on the vertex sequence to generate an embedded vector of each vertex.

As a preferred embodiment of the present invention, the method for calculating the vertex kernel value includes:

calculating degrees of all vertexes;

selecting a vertex with the minimum degree, wherein the core value of the vertex is the value of the degree;

and traversing the neighbors of the vertex in the previous step, and if the degree of a certain neighbor vertex is greater than that of the vertex, subtracting 1 from the degree of the neighbor vertex.

Further preferably, the method for calculating the similarity between the vertexes includes:

acquiring a vertex set with a distance of 1,2, …, k from the vertex u, namely a k-hop neighbor set of the vertex u

Respectively obtaining the kernel value distribution of neighbor vertexes in k-hop neighbor set of vertexes, and using vectors

To represent this distribution; wherein

To represent

How many vertices with a median kernel value of t are;

multiplying the vector of each hop neighbor set of the vertex u by an attenuation coefficient respectively and integrating the vectors into a total vector d_u(ii) a Wherein, the larger the hop count, i.e. the larger k, the neighborhood information is shown to be in the neighborhood of the vertexThe smaller the structural aspect influence and, therefore, the smaller the attenuation coefficient.

And calculating Euclidean distance between vectors corresponding to the vertexes u and v, and further calculating the similarity between the two vertexes.

Further preferably, the vertex sequence is generated by:

calculating the probability of a vertex wandering to an adjacent vertex according to the similarity between the vertexes;

and generating a vertex sequence by using a random walk model by taking the obtained probability as a preset transition probability.

In order to further achieve the object of the present invention, the present invention also provides a graph embedding-based community discovery method, including:

acquiring a training sample set;

obtaining a graph embedding vector corresponding to each training sample, wherein the graph embedding vector is generated by adopting the method for generating the graph embedding vector provided by the invention;

and taking the graph embedding vector of the training sample as input data, and training a preset network model to obtain a community discovery model.

The preset network model is a clustering algorithm model, similar vertex vectors are summarized together, a plurality of different cluster classes are further obtained, and each cluster class represents a community.

Further preferably, under the condition that the labels of the training samples are known, each vertex vector is allocated to a community, and then the community to which the new vertex belongs is predicted to complete community discovery.

The invention has the beneficial effects that: compared with the prior art, the invention provides a generating method of the graph embedding vector based on the kernel value, and also provides a method based on clustering and classification for processing the obtained graph embedding vector to obtain a plurality of different communities, the vertexes in the communities are more closely connected, the similarity is higher, and the purpose of community discovery is achieved.

Drawings

FIG. 1 is a schematic flow chart of a community discovery method based on graph embedding according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for generating a kernel-based graph embedding vector according to an embodiment of the present invention;

fig. 3 is an exemplary diagram of data samples in the method for generating a graph embedding vector according to this embodiment;

fig. 4 is a schematic flowchart of the k-means cluster-based community discovery provided in this embodiment;

fig. 5 is a schematic flowchart of the KNN classification-based community discovery provided in this embodiment.

Detailed Description

In order to facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying drawings and specific examples. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Embodiment 1 this implementation provides a method for generating a graph embedding vector based on a kernel value, and the flow is as shown in fig. 2, first, the kernel values of all vertices are calculated by a kernel value decomposition method; then acquiring neighborhood structure information of the vertexes, and calculating the structural similarity between the two vertexes based on the neighborhood structure information; calculating the probability of the vertex moving to the neighbor vertex based on the structural similarity between the vertex and the neighbor, and generating a vertex sequence in a biased way; and carrying out word vector training on the obtained vertex sequence to generate an embedded vector of each vertex.

Based on the overall process, the method comprises the following steps:

s201, calculating the kernel values of all vertexes:

specifically, the kernel values of all vertices are calculated in the graph using a kernel value decomposition method. The method for nuclear value decomposition specifically comprises the following steps: firstly, calculating degrees of all vertexes; secondly, selecting a vertex with the minimum degree, wherein the core value of the vertex is the value of the degree; thirdly, traversing the neighbor vertex of the vertex in the second step, and if the degree of a certain neighbor vertex is greater than that of the vertex, subtracting 1 from the degree of the neighbor vertex; fourthly, repeating the second step and the third step.

S202, acquiring a k-hop neighbor set of a vertex:

in particular, a k-hop neighbor of vertex u refers to a set of vertices that are a distance k from u. Here, taking vertex u as an example, to obtain a multi-hop neighbor set of vertex u, a breadth-first traversal is performed on the graph from u, the first layer is a neighbor directly connected to u, and the distance between u and u is 1, so that u is a 1-hop neighbor and is marked as u's 1-hop neighbor

Next, the vertex directly connected to the vertex of the 1-hop neighbor of u is 2 away from u, and thus is a 2-hop neighbor of u, and is noted as

By analogy, a vertex with a distance k from the vertex u is a k-hop neighbor set of u, and is marked as

S203, obtaining a kernel value distribution vector in the k-hop neighbor set of the vertex:

specifically, the kernel value distribution of the neighbor vertex in each neighbor set of u is respectively counted and represented in the form of a vector, and the vector is used to represent part of the structural information of the vertex u. For the

Using a D-dimensional vector as a vertex in (1)

To represent

The distribution of the kernel values of the middle vertex, wherein D is the maximum kernel value of the vertex in the graph,

the number of vertices with a kernel value i is indicated. Thus, each one

All represent the k jump neighborhood structure information of the vertex u, and integrate the vectors obtained when k takes different values to obtain a comprehensive vector c_uThe vector represents the overall neighborhood structure information of vertex u, and two vertices are considered to be structurally similar if the two vectors are close in distance.

Wherein, δ is a coefficient for controlling the influence of the neighbors of different hops on the comprehensive vector, and δ is (0, 1). By setting different δ values, the influence of vertices farther from vertex u on the structural information of u can be made weaker. For example, as shown in FIG. 3, there are B, C, E vertices that are one-hop neighbors of vertex A, i.e.

The core values of the vertexes B and C are both 2, the core value of the vertex E is 1, namely, two vertexes with the core value of 2 are provided, and 1 vertex with the core value of 1 is provided, so that the vertex B and the vertex C have the core values of 2 and 1, and the vertex B and the vertex C have the core values of 1, and the vertex B and the vertex C

The two-hop neighbor of A has D and F, wherein the kernel value of D is 2, and the kernel value of F is 1, so

The two vectors are integrated, assuming δ is taken⁰＝1，δ¹When the value is equal to 0.5, then

S204, calculating the similarity between the two vertexes based on the vertex kernel value distribution vector:

in particular, we compare the structural vectors c of the two vertices u, v_u、c_vThe similarity of the two vertices is compared. We propose a similar function to make the comparison,

c_uand c_vThe larger the distance between the two points is, the more dissimilar the structures of the vertex u and the vertex v are; c. C_uAnd c_vThe smaller the distance of (d), the more similar the structures of the vertex u and the vertex v are, further reflecting that the role and function of the vertex u and the vertex v in the network may be similar.

S205, calculating the probability of a certain vertex wandering to an adjacent vertex, and obtaining a vertex sequence by biased wandering:

specifically, the random walk model may be used to capture a topological structure of the graph, where the random walk may select a vertex in the graph data as a first step, then move to a neighbor vertex according to a preset probability, then move to the neighbor vertex according to the preset transition probability with the vertex after the random walk once as a starting point, and so on until a preset condition of the random walk is met, to obtain a plurality of random walk sequences corresponding to the vertex.

Specifically, in the present embodiment, the preset transition probability is defined as:

where s (u, v) is the structural similarity between vertex u and vertex v, Σ_w∈N(u)s (u, w) is the sum of the structural similarity of the vertex u and all its neighbors, p (u, v) is the probability of wandering from the vertex u to the vertex v, and the higher the structural similarity between u and v, the greater the probability of wandering from u to v.

Specifically, the preset condition of the random walk refers to a maximum sequence length of the random walk and the number of random walk sequences of one vertex.

Specifically, in a specific implementation manner of this embodiment, a specific process of performing random walk on a certain vertex to obtain a plurality of random vertex sequences is as follows: firstly, taking the vertex as a starting point, and moving to a neighbor vertex according to a preset probability; and secondly, taking the vertex after random walk as a new starting point, moving to a neighbor vertex according to a preset probability, and so on until a preset condition is met, and obtaining a plurality of vertex sequences of the vertex.

For example, the following steps are carried out: the number of random walk sequences of each vertex is 10, the length of the random walk is 80, the random vertex sequences with the length of 80 are obtained by sampling the random walk sequences of each vertex in the graph data for 10 times, and high-order adjacent relations among items are implied in the random vertex sequences.

And S206, carrying out word vector training on the vertex sequence to generate an embedded vector of each vertex.

Specifically, after a plurality of random walk sequences of all vertices are obtained in step S205, the vertices may be graph-embedded by the word vector model, so as to obtain an embedded vector corresponding to each vertex.

Specifically, in this embodiment, the embedding process may be completed using a Skip-Gram model that considers each "context-target vocabulary" combination as a new observation volume and predicts the context vocabulary through the target vocabulary.

Embodiment 2 the flow of the graph embedding-based community discovery method provided in this embodiment is shown in fig. 1, which takes a k-means clustering method as an example. Fig. 4 is a specific flowchart of the generating method, which includes the following steps:

s401, obtaining graph embedding vector

Specifically, the embedding vector of the training sample data graph is obtained based on the method for generating the graph embedding vector based on the kernel value in embodiment 1.

S402, embedding vectors of a graph of a training sample as input data, and randomly selecting k vertex vectors as a clustering center by adopting a k-means clustering method.

Specifically, k vertex vectors are randomly selected as centers of the clusters, and the k centers respectively belong to k different communities.

And S403, distributing other vertex vectors to communities where the nearest clustering centers are located, and recalculating each clustering center.

Specifically, for each vertex vector in the data set, the distance between the vertex vector and each vertex vector serving as a cluster center is calculated, and the vertex vector is assigned to the community in which the cluster center closest to the vertex vector is located.

Specifically, the distance between the vectors can be calculated by using a manhattan distance or euler distance equidistance measurement mode.

Specifically, all k cluster centers after completion of community allocation are recalculated.

S404, judging whether the distance between the centers of the new cluster and the old cluster is smaller than delta.

Giving a threshold value delta, judging whether the distance between the recalculated clustering center and the original clustering center is smaller than the threshold value, if so, indicating that the position change of the clustering center is not large and tends to be convergent; otherwise, steps S302-S304 are repeated.

S405, outputting k communities.

Embodiment 3, the graph embedding-based community discovery method provided in this embodiment takes a KNN classification method as an example, and fig. 5 is a specific flowchart of the generation method, and specifically includes the following steps:

s501, obtaining a graph embedding vector.

S502, embedding the graph of the training sample into vectors to serve as input data, and calculating the distance between the new vectors and the vectors in each training set by adopting a KNN classification method.

Specifically, each vector in the training set is labeled, i.e., known to the community to which it belongs. For each new vector unknown to the community to which it belongs, its distance to each vector in the training set is calculated. The distance calculation may use euclidean distance, manhattan distance, or other distance measurement methods.

S503, selecting the K vectors with the minimum distance and determining the frequency of the community in which the K vectors are located.

Specifically, the distances between the new vectors obtained by calculation and the vectors in the training set are arranged according to an increasing order, and the first K vectors are selected. And counting the occurrence frequency of communities to which the K vectors belong, and taking the community with the highest occurrence frequency as the community to which the new vector should belong.

S504, determining the community of each new vertex vector.

Claims

1. A method for generating graph embedding vectors, comprising:

acquiring a core value of a vertex;

2. The method for generating graph-embedded vectors according to claim 1, wherein the vertex kernel value is calculated by:

calculating degrees of all vertexes;

3. The method for generating graph-embedded vectors according to claim 2, wherein the method for calculating the similarity between the vertices comprises:

obtaining k-hop neighbor set of vertex u

Separately obtaining collections

Kernel value distribution of u-neighborhood vertex of middle vertex

Wherein

To represent

How many vertices with a median kernel value of t are;

multiplying the vector of each hop neighbor set of the vertex u by an attenuation coefficient respectively and integrating the vectors into a total vector d_u(ii) a The larger the hop count is, namely the larger k is, the smaller the influence of the neighborhood information on the neighborhood structure condition of the vertex is, and therefore, the smaller the attenuation coefficient is;

and calculating Euclidean distance between vectors corresponding to the vertex u and the vertex v, and further calculating the similarity between the two vertexes.

4. The method of generating graph-embedded vectors according to claim 3, wherein the vertex sequence is generated by:

5. A graph embedding-based community discovery method is characterized by comprising the following steps:

acquiring a training sample set;

obtaining a graph embedding vector corresponding to each training sample, wherein the graph embedding vector is generated by adopting the method of any one of claims 1-4;

6. The graph-embedding-based community discovery method according to claim 5, wherein the preset network model is a clustering algorithm model, similar vertex vectors are summarized together, and a plurality of different cluster classes are obtained, wherein each cluster class represents a community.

7. The graph-based embedded community discovery method according to claim 6, wherein, under the condition that the labels of the training samples are known, each vertex vector is allocated to a community, and then the community to which a new vertex belongs is predicted to complete community discovery.