CN111860866A

CN111860866A - A network representation learning method and device with community structure

Info

Publication number: CN111860866A
Application number: CN202010723330.9A
Authority: CN
Inventors: 何嘉林
Original assignee: China West Normal University
Current assignee: China West Normal University
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-30

Abstract

The invention discloses a network representation learning method with a community structure, which comprises the following steps: step 1: data collection and processing stage: using a density function, vertex sequence samples S ═ S are obtained by using a random walk strategy on the network G₁,s₂,...,s_n}; step 2: data representation learning phase: optimizing the Skip-gram model, and training vertex sequence samples S-S by using the Skip-gram model₁,s₂,...,s_nObtaining the vector representation of each vertex sequence; and step 3: a data calculation stage: and carrying out similarity calculation on the vector representation of each vertex sequence to obtain community division similarity. The method can better capture the community structure in the network and can obtain higher accuracy in the vertex classification task.

Description

A network representation learning method and device with community structure

技术领域technical field

本发明涉及计算机技术领域，具体涉及一种具有社团结构的网络表示学习方法及装置。The invention relates to the field of computer technology, in particular to a network representation learning method and device with a community structure.

背景技术Background technique

许多复杂系统可以抽象成网络结构，而网络结构通常由图来表示，即由一组节点和一组连边组成。对于小规模网络，我们可以在它上面快速执行许多复杂任务，例如社团挖掘和多标签分类。然而，对于大规模网络(例如具有数十亿个顶点的网络)，在它上面执行这些复杂任务是一个挑战。为了解决这个问题，我们必须找到另外一种简洁而有效的网络表示形式。网络嵌入就是解决该问题的一种有效策略，即学习网络中顶点的低维向量表示。对于每个顶点，我们将它们在网络中的结构特征映射到低维空间向量，然后再将这些向量应用于网络中的复杂任务。Many complex systems can be abstracted into network structures, and network structures are usually represented by graphs, which consist of a set of nodes and a set of connected edges. For small-scale networks, we can quickly perform many complex tasks on it, such as community mining and multi-label classification. However, performing these complex tasks on large-scale networks (such as those with billions of vertices) is a challenge. To solve this problem, we must find another concise and efficient network representation. An effective strategy for solving this problem is network embedding, which is to learn low-dimensional vector representations of vertices in the network. For each vertex, we map their structural features in the network to low-dimensional spatial vectors, which are then applied to complex tasks in the network.

在过去的几年中，有人提出了许多刻画网络局部结构的网络嵌入方法。DeepWalk方法通过使用截断的随机游走策略来刻画网络顶点的邻域结构。Node2vec方法证明DeepWalk并不能捕获网络中连接模式的多样性。它提出了一种偏向随机行走策略，该策略结合了BFS和DFS思想来探索顶点邻域信息。LINE方法主要应用于大规模网络嵌入学习。它保留了高阶的顶点邻域结构，并可以很容易地扩展到数百万个顶点。曹等人提出了一种深图表示模型，该模型采用随机冲浪策略来捕获图的结构信息。冯等人提出了“度惩罚”原则，该原则通过惩罚高度顶点之间的邻近性来保留无标度属性。王等人提出了一种半监督的深度模型，该模型能够通过优化多层非线性函数来捕获高度非线性的网络结构。Yanardag等人提出了捕获中层相似结构的通用框架。此外，他们还提出了一些用来保留全局网络结构的方法。王等人提出了一个模块化的非负矩阵分解模型，该模型保留了网络中的社团结构。涂等人提出了一种启发式的社团增强机制，该机制将社团结构信息映射到顶点向量表示中。陈等人提出了一种多层次的网络表示学习范式，它通过重新捕获初始网络的全局结构，逐步将初始网络合并为较小但结构相似的网络。In the past few years, many network embedding methods have been proposed to characterize the local structure of the network. The DeepWalk method characterizes the neighborhood structure of network vertices by using a truncated random walk strategy. The Node2vec approach proves that DeepWalk does not capture the diversity of connection patterns in the network. It proposes a biased random walk strategy that combines BFS and DFS ideas to explore vertex neighborhood information. The LINE method is mainly applied to large-scale network embedding learning. It preserves the high-order vertex neighborhood structure and scales easily to millions of vertices. Cao et al. proposed a deep graph representation model that employs a random surfing strategy to capture the structural information of the graph. Feng et al. proposed the "degree penalty" principle, which preserves the scale-free property by penalizing the proximity between vertices of high degree. Wang et al. proposed a semi-supervised deep model capable of capturing highly nonlinear network structures by optimizing multiple layers of nonlinear functions. Yanardag et al. proposed a general framework to capture similar structures in mid-level. In addition, they also proposed some methods to preserve the global network structure. Wang et al. proposed a modular non-negative matrix factorization model that preserves the community structure in the network. Tu et al. proposed a heuristic community enhancement mechanism that maps community structure information into vertex vector representations. Chen et al. proposed a multi-level network representation learning paradigm that gradually merges the initial network into smaller but structurally similar networks by recapturing the global structure of the initial network.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是现有技术中刻画网络局部结构的三种网络嵌入方法不能更好地捕获网络中的社团结构，并且在顶点分类任务中不能获得更高的准确率，目的在于提供一种具有社团结构的网络表示学习方法及装置，解决上述问题。The technical problem to be solved by the present invention is that the three network embedding methods for describing the local structure of the network in the prior art cannot better capture the community structure in the network, and cannot obtain higher accuracy in the vertex classification task. The purpose is to provide A network representation learning method and device with community structure solve the above problems.

本发明通过下述技术方案实现：The present invention is achieved through the following technical solutions:

一种具有社团结构的网络表示学习方法，包括以下步骤：A network representation learning method with community structure, including the following steps:

步骤1：数据收集与处理阶段：使用一种密度函数，通过在网络G上使用随机游走策略，获得顶点序列样本S＝{s₁,s₂,...,s_n}；Step 1: Data collection and processing phase: use a density function to obtain vertex sequence samples S={s ₁ ,s ₂ ,...,s _n } by using a random walk strategy on the network G;

步骤2：数据表示学习阶段：优化Skip-gram模型，使用Skip-gram模型来训练顶点序列样本S＝{s₁,s₂,...,s_n}，得到每个顶点序列的向量表示；Step 2: Data representation learning stage: optimize the Skip-gram model, use the Skip-gram model to train vertex sequence samples S={s ₁ , s ₂ ,...,s _n }, and obtain the vector representation of each vertex sequence;

步骤3：数据计算阶段：对每个顶点序列的向量表示进行相似度计算，获得社团划分相似度。Step 3: Data calculation stage: Calculate the similarity of the vector representation of each vertex sequence to obtain the similarity of community division.

进一步地，一种具有社团结构的网络表示学习方法，所述步骤1中，顶点序列样本S＝{s₁,s₂,...,s_n}中顶点序列表示为s＝{v1,v2...,v|s|}。Further, a network representation learning method with community structure, in the step 1, the vertex sequence in the vertex sequence sample S={s ₁ ,s ₂ ,...,s _n } is represented as s={v1,v2 ...,v|s|}.

进一步地，一种具有社团结构的网络表示学习方法，所述步骤1中的密度函数定义为：

Further, a network representation learning method with community structure, the density function in the step 1 is defined as:

其中

和

分别是顶点序列s中所有顶点的内部度之和与外部度之和，而α是分辨率参数，用于控制社团的大小；in

and

are the sum of the internal and external degrees of all vertices in the vertex sequence s, respectively, and α is the resolution parameter used to control the size of the community;

所述密度函数还具有密度增益Δf^v _s，所述密度增益Δf^v _s应满足如下公式：The density function also has a density gain _Δf ^v _s , ^which should satisfy the following formula:

△f_s＝f_s+{v}-f_s Δf _s =f _s+{v} -f _s

其中符号s+{v}表示将顶点v移到s后得到的新顶点序列。where the symbol s+{v} represents the new vertex sequence obtained by moving vertex v to s.

进一步地，一种具有社团结构的网络表示学习方法，所述步骤1中获得顶点序列样本的具体步骤为：Further, in a network representation learning method with a community structure, the specific steps of obtaining vertex sequence samples in the step 1 are:

步骤11：从集合N′(v_|s|)中随机选择一个顶点v_|s|+1；Step 11: Randomly select a vertex v _|s|+1 from the set N'(v _|s| );

步骤12：根据公式△f_s＝f_s+{v}-f_s计算Δf_s ^v|s|+1；Step 12: Calculate Δf _s ^v|s|+1 according to the formula Δf _s =f _s+{v} -f _s ;

步骤13：如果Δf_s ^v|s|+1<0，则从集合N′(v_|s|)中删除v_|s|+1，然后返回步骤11；Step 13: If Δf _s ^v|s|+1 <0, delete v _|s|+1 from the set N'(v _|s| ), and then return to step 11;

步骤14：如果Δf_s ^v|s|+1>0，则将v_|s|+1添加到集合s上，并将v_|s|+1标记为当前顶点；Step 14: If Δf _s ^v|s|+1 >0, add v _|s|+1 to the set s, and mark v _|s|+1 as the current vertex;

其中顶点v_|s|是最后添加的顶点，令最后添加的顶点v_|s|为当前顶点；N′(v_|s|)表示当前顶点v_|s|的所有不在s中的邻居顶点集合；重复步骤11～步骤14直到不能增加顶点序列s的密度为止。where vertex v _|s| is the last added vertex, let the last added vertex v _|s| be the current vertex; N′(v _|s| ) represents the set of all the neighbor vertices of the current vertex v _|s| that are not in s; Repeat steps 11 to 14 until the density of the vertex sequence s cannot be increased.

进一步地，一种具有社团结构的网络表示学习方法，所述步骤2中Skip-gram模型通过最小化以下目标函数来训练顶点序列样本：Further, a network representation learning method with community structure, in the step 2, the Skip-gram model trains vertex sequence samples by minimizing the following objective function:

其中t是窗口大小，v_j是v_i在窗口内的上下文网络中的顶点表示，以上公式中的概率p(v_j|v_i)定义为where t is the window size, v _j is the vertex representation of _{vi in the context network within the window, and the probability p(v j} _| vi ₎ in the above formula is defined as

其中Φ(s)表示s的嵌入向量，Φ′(s)表示上下文向量，s表示顶点序列集合。where Φ(s) represents the embedding vector of s, Φ′(s) represents the context vector, and s represents the vertex sequence set.

进一步地，一种具有社团结构的网络表示学习方法，所述步骤3中对每个顶点序列的向量表示进行相似度计算具体包括：对于网络中每个顶点序列的向量表示计算它与其他的顶点序列的向量表示的相似程度，具体使用NMI公式计算相似度。Further, a network representation learning method with a community structure, in the step 3, the similarity calculation for the vector representation of each vertex sequence specifically includes: calculating the vector representation of each vertex sequence in the network and other vertices. The similarity degree of the vector representation of the sequence, specifically using the NMI formula to calculate the similarity.

一种具有社团结构的网络表示学习装置，包括：A network representation learning device with community structure, comprising:

数据收集与处理模块，用于读取顶点序列样本，获得顶点序列样本S＝{s₁,s₂,...,s_n}；The data collection and processing module is used to read the vertex sequence samples and obtain the vertex sequence samples S={s ₁ , s ₂ ,...,s _n };

数据表示学习模块，用于优化Skip-gram模型，使用Skip-gram模型来训练顶点序列样本S＝{s₁,s₂,...,s_n}，得到每个顶点序列的向量表示；The data representation learning module is used to optimize the Skip-gram model. The Skip-gram model is used to train vertex sequence samples S={s ₁ , s ₂ ,...,s _n } to obtain the vector representation of each vertex sequence;

相似度计算模块，用于对每个顶点序列的向量表示进行相似度计算，获得社团划分相似度。The similarity calculation module is used to calculate the similarity of the vector representation of each vertex sequence to obtain the similarity of community division.

本发明方法使用如下NMI公式计算相似度：The inventive method uses the following NMI formula to calculate similarity:

Normalized Mutual Information measure(NMI)是一种基于信息论的指标，用来衡量两个社团划分A和B的相似度。NMI定义如下：The Normalized Mutual Information measure (NMI) is an information-theoretic metric used to measure the similarity of the divisions A and B of two communities. NMI is defined as follows:

其中C是相似矩阵，其中的行对应“真实”社团，而列对应“探测”社团，N是节点数。C_ij是真实社团i与探测社团j中共同的顶点数。C_A和C_B分别代表真实社团和探测社团的个数。C_i.和C_.j分别表示矩阵C的第i行和第j列的和。NMI的取值范围为0到1。如果真实社团划分与探测社团划分完全相同，则NMI等于1；反之则等于0。where C is the similarity matrix, where the rows correspond to the "true" communities and the columns correspond to the "probe" communities, and N is the number of nodes. C _ij is the number of common vertices in real community i and probe community j. C _A and C _B represent the number of real communities and probe communities, respectively. C _i . and C _.j denote the sum of the i-th row and j-th column of matrix C, respectively. The value range of NMI is 0 to 1. If the real community partition is exactly the same as the detection community partition, the NMI is equal to 1; otherwise, it is equal to 0.

本发明与现有技术相比，具有如下的优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

本发明方法不仅可以更好地捕获网络中的社团结构，并且可以在顶点分类任务中获得更高的准确率。The method of the present invention can not only better capture the community structure in the network, but also obtain higher accuracy in the vertex classification task.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解，构成本申请的一部分，并不构成对本发明实施例的限定。在附图中：The accompanying drawings described herein are used to provide further understanding of the embodiments of the present invention, and constitute a part of the present application, and do not constitute limitations to the embodiments of the present invention. In the attached image:

图1是本发明的整体流程图。FIG. 1 is an overall flow chart of the present invention.

图2为人工网络上的NMI和Q-Walker的提升比例，参数α＝1.5。Figure 2 shows the boost ratio of NMI and Q-Walker on artificial network, parameter α=1.5.

图3为参数α在四个真实网络上的最优区间。Figure 3 shows the optimal interval of parameter α on four real networks.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施例和附图，对本发明作进一步的详细说明，本发明的示意性实施方式及其说明仅用于解释本发明，并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and the accompanying drawings. as a limitation of the present invention.

实施例Example

如图1所示，一种具有社团结构的网络表示学习方法，包括以下步骤：As shown in Figure 1, a network representation learning method with community structure includes the following steps:

其中

和

and

△f_s＝f_s+{v}-f_s Δf _s =f _s+{v} -f _s

本实施例中使用如下NMI公式计算相似度：In this embodiment, the following NMI formula is used to calculate the similarity:

许多经典嵌入方法，如DeepWalk，Node2vec和DP-Walker，通过在网络G上使用随机行走策略，获得一组顶点序列样本S＝{s₁,s₂,...,s_n}，其中每个顶点序列可以表示为s＝{v₁,...,v_|s|}。通过将每个顶点序列看作一个文档中的句子，我们就可以使用Skip-gram模型来学习网络中的顶点表示：Many classical embedding methods, such as DeepWalk, Node2vec and DP-Walker, obtain a set of vertex sequence samples S={s ₁ ,s ₂ ,...,s _n } by using a random walk strategy on the network G, where each The vertex sequence can be represented as s={v ₁ ,...,v _|s| }. By treating each vertex sequence as a sentence in a document, we can use the Skip-gram model to learn vertex representations in the network:

对于DeepWalk方法来说，它在随机行走过程中采用均匀分布p(v_i+1|v_i)，即v_i的每个邻居被选择的概率相等。For the DeepWalk method, it adopts a uniform distribution p(v _i+1 |v _i ) in the random walk process, that is, each neighbor of v _i has an equal probability of being selected.

对于Node2vec方法来说，它在随机行走过程中使用有偏概率p(v_i+1|v_i)，其定义为：For the Node2vec method, it uses a biased probability p(v _i+1 |v _i ) in the random walk process, which is defined as:

其中d_i-1,i+1表示顶点v_i-1与顶点v_i+1之间的最短路径距离。在等式(3)中，参数p和q分别控制广度优先搜索策略和深度优先搜索策略在随机行走过程中的比重。where d _{i-1, i+1} represents the shortest path distance between vertex v _i-1 and vertex v _i+1 . In equation (3), the parameters p and q control the proportions of the breadth-first search strategy and the depth-first search strategy in the random walk process, respectively.

对于DP-Walker来说，概率p(v_i+1|v_i)定义为For DP-Walker, the probability p(v _i+1 |v _i ) is defined as

其中k_i是顶点v_i的度，C_i,i+1是v_i和v_i+1的公共邻居数，β是模型参数。where k _i is the degree of vertex v _i , C _i,i+1 is the number of common neighbors of v _i and v _i+1 , and β is the model parameter.

均值飘移聚类方法是一种非参数聚类的过程。与经典的k均值聚类方法相比，它不需要假设分布的形状，也不需要假设聚类的个数。给定n个数据点x_i∈R^d(i＝1，...，n)，则基于径向对称核K(x)的多元核密度估计由以下公式给出：The mean-shift clustering method is a nonparametric clustering process. Compared with the classic k-means clustering method, it does not need to assume the shape of the distribution, nor the number of clusters. Given n data points x _i ∈ R ^d (i=1,...,n), the multivariate kernel density estimate based on the radially symmetric kernel K(x) is given by:

其中h是核的半径。对于每个数据点x_i，都会对其局部估计密度执行梯度上升优化策略，直到收敛为止。凡是与同一个中心点关联的数据点都属于同一个聚类。where h is the radius of the nucleus. For each data point _xi , a gradient ascent optimization strategy is performed on its local estimated density until convergence. All data points associated with the same center point belong to the same cluster.

以下为实验分析：The following is the experimental analysis:

(1)真实网络(1) Real network

在社团挖掘实验中，我们使用了四个真实网络，分别是Karate，Football，Dolphin和PolBooks网络。表1列出了四个网络的详细信息，包括节点数(|V|)，边数(|E|)，平均度(<k>)，均方度(<k²>)，平均聚类系数(cc)和真实社团数(nc)。In the community mining experiments, we use four real networks, namely Karate, Football, Dolphin and PolBooks networks. Table 1 lists the details of the four networks, including the number of nodes (|V|), the number of edges (|E|), the mean degree (<k>), the mean square degree (<k ² >), the mean clustering Coefficients (cc) and the number of true communities (nc).

表1：四个具有真实社团的网络统计信息Table 1: Four network statistics with real communities

(2)人工网络(2) Artificial network

在社团挖掘实验中，我们进一步使用人工网络来评估我们方法的性能。Plantedpartition模型是一个经典的人工基准网络生成器。该模型生成一个具有n＝g·z个顶点的网络，其中z是社团数，g是每个社团中的顶点数。在同一社团中，任意两个顶点之间存在一条连边的概率为p_in，而在不同社团之间，任意两个顶点之间存在一条连边的概率为P_out。每个顶点的平均度<k>＝p_in(g–1)+p_out g(z-1)。如果p_in>＝p_out，则网络具有社团结构，因为社团内的连边密度大于社团之间的连边密度。在本发明中，我们采用Girvan和Newman提出的l-partition模型的特例。他们设置z＝4，g＝32，<k>＝16。表2展示了7个人工网络，其中顶点的内部平均度<k_in>越大，则社团结构越强。In community mining experiments, we further use artificial networks to evaluate the performance of our method. The Plantedpartition model is a classic artificial benchmark network generator. The model generates a network with n=g z vertices, where z is the number of communities and g is the number of vertices in each community. In the same community, the probability of a connecting edge between any two vertices is p _in , and between different communities, the probability of an edge existing between any two vertices is P _out . Average degree of each vertex <k>= _{pin(g-1)+poutg} ₍ z-1). If p _in >= p _out , the network has a community structure because the density of connections within communities is greater than the density of connections between communities. In the present invention, we adopt a special case of the l-partition model proposed by Girvan and Newman. They set z=4, g=32, <k>=16. Table 2 shows 7 artificial networks, where the larger the internal average degree <k _in > of the vertices, the stronger the community structure.

表2：7个具有不同社团结构的人工网络统计信息Table 2: Statistics of 7 artificial networks with different community structures

(3)基准方法(3) Benchmark method

我们将我们的方法(Q-Walker)与三个网络嵌入方法(DeepWalk、Node2vec和DP-Walker)进行比较。We compare our method (Q-Walker) with three network embedding methods (DeepWalk, Node2vec and DP-Walker).

(4)社团探测(4) Community detection

在学习到每个节点的嵌入向量表示后，我们使用均值飘移聚类算法来挖掘社团。After learning the embedding vector representation of each node, we use the mean-shift clustering algorithm to mine communities.

(5)精度指标(5) Accuracy index

(6)真实网络分析(6) Real network analysis

我们首先在具有已知社团的四个真实网络上评估Q-Walker的性能，结果如表3所示。从表3中可以看出，Q-Walker和DP-Walker在所有网络中的性能都优于另外两个算法，即DeepWalk和Node2vec。因此，接下来我们只比较Q-Walker和DP-Walker两个算法。在Karate网络上，Q-Walker和DP-Walker的NMI分别为1和0.837。因此，Q-Walker可以正确检测到所有的已知社团，与DP-Walker相比，其NMI提升了19.47％。同样，在Dolphin网络上，Q-Walker也可以找到所有的已知社团，其NMI提升了12.48％。在PolBooks网络上，Q-Walker和DP-Walker的NMI分别为0.679和0.581。与DP-Walker的NMI相比，Q-Walker的NMI提升了16.86％。在Football网络上，尽管所有方法的效果都还不错，但Q-Walker的NMI仍然略高于其他三种方法。与DP-Walker相比，Q-Walker的NMI提升了1.81％。We first evaluate the performance of Q-Walker on four real networks with known communities, and the results are shown in Table 3. As can be seen from Table 3, Q-Walker and DP-Walker outperform the other two algorithms, DeepWalk and Node2vec, in all networks. Therefore, next we only compare the Q-Walker and DP-Walker algorithms. On the Karate network, the NMI of Q-Walker and DP-Walker are 1 and 0.837, respectively. Therefore, Q-Walker can correctly detect all known communities, and its NMI is improved by 19.47% compared with DP-Walker. Similarly, on the Dolphin network, Q-Walker can also find all known communities, and its NMI is improved by 12.48%. On the PolBooks network, the NMI of Q-Walker and DP-Walker are 0.679 and 0.581, respectively. Compared with the NMI of DP-Walker, the NMI of Q-Walker is improved by 16.86%. On the Football network, the NMI of the Q-Walker is still slightly higher than that of the other three methods, although all methods perform fairly well. Compared with DP-Walker, the NMI of Q-Walker is improved by 1.81%.

表3：四个具有已知社团的真实网络上的NMITable 3: NMI on four real networks with known communities

(7)人工网络分析(7) Artificial network analysis

我们还在具有不同社团结构的人工网络上评估了我们方法的性能，其中表2展示了7个人工网络的详细信息。实验结果如图2所示。从图2中可以看出，当<k_in>≤10.5，Q-Walker方法的性能优于其他三种方法。同时我们也注意到，与其它三种方法的NMI相比，Q-Walker提升的比例与<k_in>成反比，即<k_in>越小，Q-Walker提升的比例越高。以Node2vec为例，当<k_in>＝8.5时，Q-Walker提升的比例高于50％，而当<k_in>＝11.5时，Q-Walker的提升的比例为0％。其中的原因解释如下。当k＝8.5时，网络具有许多弱社团结构。由于弱社团结构之间存在许多连边，因此在随机行走过程中，节点可以很容易地从一个弱社团跳跃到另一个弱社团。所以Node2vec采样的顶点序列s并不能很好地刻画弱社团结构，原因是s中的大多数顶点来自不同的弱社团结构。然而，对于Q-Walker来说，顶点序列s中的大多数顶点都来自于同一个社团结构，所以s具有相对紧密的内部连接。因此，Q-Walker可以很好地刻画弱社团结构。当k＝11.5时，网络具有许多强社团结构。由于强社团结构的内部连接密度很高且强社团结构之间的连边较少，所以节点在大部分时间内都在同一个社团内部随机行走。因此，Node2vec采样的顶点序列s可以很好刻画强社团结构，原因是s中的大多数顶点都来自同一个社团。同样的，我们可以用上面的分析来解释另外两种基准方法DeepWalk和DP-Walker。综合以上分析，Q-Walker不仅在具有弱社团结构的网络中表现良好，而且在具有强社团结构的网络上也表现不错。We also evaluate the performance of our method on artificial networks with different community structures, where Table 2 shows the details of 7 artificial networks. The experimental results are shown in Figure 2. As can be seen from Figure 2, when <k _in >≤10.5, the Q-Walker method outperforms the other three methods. At the same time, we also noticed that compared with the NMI of the other three methods, the ratio of Q-Walker improvement is inversely proportional to <k _in >, that is, the smaller <k _in >, the higher the ratio of Q-Walker improvement. Taking Node2vec as an example, when <k _in >=8.5, the improvement ratio of Q-Walker is higher than 50%, and when <k _in >=11.5, the improvement ratio of Q-Walker is 0%. The reasons for this are explained below. When k=8.5, the network has many weak community structures. Since there are many edges between weak community structures, nodes can easily jump from one weak community to another during random walk. Therefore, the vertex sequence s sampled by Node2vec cannot well describe the weak community structure, because most of the vertices in s come from different weak community structures. However, for Q-Walker, most of the vertices in the vertex sequence s are from the same community structure, so s has relatively tight internal connections. Therefore, Q-Walker can well characterize weak community structure. When k=11.5, the network has many strong community structures. Due to the high density of internal connections and few edges between strong community structures, nodes walk randomly within the same community most of the time. Therefore, the vertex sequence s sampled by Node2vec can well characterize the strong community structure, because most of the vertices in s are from the same community. Similarly, we can use the above analysis to explain two other benchmark methods, DeepWalk and DP-Walker. Based on the above analysis, Q-Walker not only performs well in networks with weak community structure, but also performs well in networks with strong community structure.

(8)参数敏感性(8) Parameter sensitivity

最后，我们通过改变公式

中参数α的值来评估我们方法的性能。实验中，参数α的取值范围为0.05≤α≤1.5的，结果如图3所示。从图3可以看出，每个网络都包含一个最优分辨率区间，在这个区间内，算法的性能稳定且最好。以Karate网络为例，在0.55≤α≤0.7，我们的算法能够发现所有已知的社团结构。另外，每个网络α的最优区间是不同的。以Dolphin和PolBooks网络为例，它们的最优区间分别为0.5≤α≤0.8和0.05≤α≤0.2。最优区间的不同与网络中的社团结构层次相关。通常地，不同网络的社团层次结构是不相同的。因此，不同网络的α最优区间也应该是不相同的。Finally, we change the formula

to evaluate the performance of our method. In the experiment, the value range of parameter α is 0.05≤α≤1.5, and the result is shown in Figure 3. As can be seen from Figure 3, each network contains an optimal resolution interval, and within this interval, the performance of the algorithm is stable and the best. Taking the Karate network as an example, at 0.55≤α≤0.7, our algorithm is able to discover all known community structures. In addition, the optimal interval for each network α is different. Taking the Dolphin and PolBooks networks as examples, their optimal intervals are 0.5≤α≤0.8 and 0.05≤α≤0.2, respectively. The difference in the optimal interval is related to the level of community structure in the network. Generally, the community hierarchy of different networks is not the same. Therefore, the optimal interval of α for different networks should also be different.

(9)多标签分类(9) Multi-label classification

此外，我们进一步评估了我们的方法在多标签分类任务中的性能。为了方便我们的方法和其他三种方法进行比较，我们使用了以下实验步骤。具体来说，我们随机选择一部分顶点作为训练集，其余顶点则作为测试集。然后用LibLinear实现的逻辑斯蒂多分类模型返回概率最高的标签。我们将上述过程重复执行50次，然后求Micro-F1和Macro F1的平均分数。我们在BlogCatalog网络上进行了该实验，并设置参数α＝1。为了加快多标签分类器的训练速度，我们在BlogCatalog网络上选择了较小的训练集，比例从1％到9％。表4和表5分别显示了在BlogCatalog上的Micro F1和Macro F1。粗体数字表示每列中的最高分数。从表4和表5可以看出，根据Micro-F1和Macro F1的分数，Q-Walker算法优于其他三种方法。Furthermore, we further evaluate the performance of our method on multi-label classification tasks. To facilitate the comparison of our method with the other three methods, we used the following experimental steps. Specifically, we randomly select a portion of the vertices as the training set and the rest as the test set. Then a logistic classification model implemented with LibLinear returns the label with the highest probability. We repeated the above process 50 times and then averaged the Micro-F1 and Macro F1 scores. We conducted this experiment on the BlogCatalog network and set the parameter α=1. To speed up the training of multi-label classifiers, we choose a smaller training set on the BlogCatalog network, ranging from 1% to 9%. Table 4 and Table 5 show the Micro F1 and Macro F1 on BlogCatalog, respectively. Bold numbers indicate the highest score in each column. As can be seen from Table 4 and Table 5, according to the scores of Micro-F1 and Macro F1, the Q-Walker algorithm outperforms the other three methods.

表4：BlogCatalog上的Micro F1得分Table 4: Micro F1 Score on BlogCatalog

表5：BlogCatalog上的Macro F1得分Table 5: Macro F1 Scores on BlogCatalog

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A network representation learning method with community structure, characterized in that, comprising the following steps:

Step 1: Data collection and processing phase: use a density function to obtain vertex sequence samples S={s ₁ ,s ₂ ,...,s _n } by using a random walk strategy on the network G;

Step 2: Data representation learning stage: optimize the Skip-gram model, use the Skip-gram model to train vertex sequence samples S={s ₁ , s ₂ ,...,s _n }, and obtain the vector representation of each vertex sequence;

Step 3: Data calculation stage: Calculate the similarity of the vector representation of each vertex sequence to obtain the similarity of community division.

2. A network representation learning method with community structure according to claim 1, characterized in that, in the step 1, in the vertex sequence sample S={s ₁ , s ₂ ,...,s _n } The vertex sequence is denoted as s={v1,v2...,v|s|}.

3. a kind of network representation learning method with community structure according to claim 2 is characterized in that,

The density function in step 1 is defined as:

in

and

The density function also has a density gain _Δf ^v _s , ^which should satisfy the following formula:

Δf _s =f _s+{v} -f _s

where the symbol s+{v} represents the new vertex sequence obtained by moving vertex v to s.

4. a kind of network representation learning method with community structure according to claim 3, is characterized in that, the concrete step that obtains vertex sequence sample in described step 1 is:

Step 11: Randomly select a vertex v _|s|+1 from the set N'(v _|s| );

Step 12: Calculate Δf _s ^v|s|+1 according to the formula Δf _s =f _s+{v} -f _s ;

Step 13: If Δf _s ^v|s|+1 <0, delete v _|s|+1 from the set N'(v _|s| ), and then return to step 11;

Step 14: If Δf _s ^v|s|+1 >0, add v _|s|+1 to the set s, and mark v _|s|+1 as the current vertex;

where vertex v _|s| is the last added vertex, let the last added vertex v _|s| be the current vertex; N′(v _|s| ) represents the set of all the neighbor vertices of the current vertex v _|s| that are not in s; Repeat steps 11 to 14 until the density of the vertex sequence s cannot be increased.

5. a kind of network representation learning method with community structure according to claim 2, is characterized in that, in described step 2, Skip-gram model trains vertex sequence samples by minimizing following objective function:

where t is the window size, v _j is the vertex representation of _{vi in the context network within the window, and the probability p(v j} _| vi ₎ in the above formula is defined as

where Φ(s) represents the embedding vector of s, Φ′(s) represents the context vector, and s represents the vertex sequence set.

6. a kind of network representation learning method with community structure according to claim 1, is characterized in that, in described step 3, carrying out similarity calculation to the vector representation of each vertex sequence specifically comprises: for each vertex in the network The vector representation of the sequence calculates the degree of similarity between it and the vector representation of other vertex sequences. Specifically, the NMI formula is used to calculate the similarity.

7. A network representation learning device with community structure, characterized in that it comprises:

The data collection and processing module is used to read the vertex sequence samples and obtain the vertex sequence samples S={s ₁ , s ₂ ,...,s _n };

The data representation learning module is used to optimize the Skip-gram model. The Skip-gram model is used to train vertex sequence samples S={s ₁ , s ₂ ,...,s _n } to obtain the vector representation of each vertex sequence;

The similarity calculation module is used to calculate the similarity of the vector representation of each vertex sequence to obtain the similarity of community division.