CN113807543A

CN113807543A - Network embedding algorithm and system based on direction perception

Info

Publication number: CN113807543A
Application number: CN202110983059.7A
Authority: CN
Inventors: 周晟; 刘劭荣; 卜佳俊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-12-17
Anticipated expiration: 2041-08-25
Also published as: CN113807543B

Abstract

A directional network embedding algorithm based on direction awareness, comprising: s1, calculating asymmetric proximity, specifically including: defining single step probability for a random walk strategy in a directed network, storing the single step direction and the proximity information in the random walk in a weight, and calculating scores among nodes; s2, establishing a directed network embedding, specifically including: after the asymmetric proximity between the nodes is obtained through calculation, a qualitative directed network embedding DNE-L is established, the discrete asymmetric proximity between the nodes is reserved in an embedding network, after the asymmetric proximity between the nodes is obtained through calculation, a quantitative directed network embedding DNE-T is established, the discrete asymmetric proximity between the nodes is reserved in the embedding network, and a model is optimized. The invention also includes a system for implementing a directional network embedding algorithm based on direction perception. The invention has better explanatory property to the actual problem in the real network, and effectively reserves the discrete and continuous directed network embedding in the embedding space.

Description

Network embedding algorithm and system based on direction perception

Technical Field

The invention relates to machine learning, in particular to a directional network embedding algorithm and a system based on direction perception.

Background

The purpose of the network embedding algorithm is to embed nodes in an existing network into a low-dimensional vector space in order to better understand the semantic relationships between the nodes. Existing network embedding algorithms, which primarily focus on dealing with undirected networks, preserve similarities through deterministic metrics or random walks. For a directed embedding network, the general solution is to ignore the direction of the edges in the directed network and apply an undirected network embedding algorithm to the transformed network. However, this may result in information loss, and it is more likely that a wrong embedding result is learned.

Because edges in real networks are often related to direction, directed network embedding algorithms have received attention. The directed edges represent asymmetric proximity between nodes in the network, and the potential asymmetric proximity is a key characteristic of the directed network and needs to be preserved by using a network embedding algorithm. While some existing methods attempt to preserve asymmetric proximity in a directed graph, the meaning that they capture asymmetric proximity is ambiguous. Therefore, obtaining asymmetric proximity and efficiently preserving it in the embedding space, and making it practical for real networks, faces significant challenges.

Disclosure of Invention

The present invention provides a directional network embedding algorithm based on direction perception and a system thereof.

The invention aims to acquire asymmetric proximity in a directed network, effectively store the asymmetric proximity in an embedding space and achieve better effect in link prediction and node classification tasks of a real network.

In order to achieve the purpose, the invention adopts the following technical scheme: a directional network embedding algorithm based on direction awareness, comprising:

s1: calculating asymmetric proximity;

s1 a: defining single step probability for a random walk strategy in the directed network, wherein the single step probability formula is as follows:

wherein, P represents the single step probability of random walk,

represents from v_iThe k-th step of the starting random walk,

indicating the number of neighbors of node a,

number of neighbors representing node a, E _ab1 means that there is one directed edge from a to b;

s1 b: the single step direction and the proximity information in the random walk are stored in the weight, and the single step weight formula is as follows:

wherein r is_i，i+11 denotes random walk in the edge direction, r_i，i+1With-1 representing a random walk in the opposite direction along the edge, r_i，i+10 denotes the node v_iAnd v_i+1Directional edges exist in the two directions;

s1 c: calculating scores among the nodes to express asymmetric proximity among the nodes, wherein the formula is as follows:

wherein r is_j，j+1Is the step weight j, 1/k is used to normalize the effect from the step number.

S2: establishing directed network embedding;

s21: after the asymmetric proximity between the nodes is obtained through calculation, a qualitative directed network embedding DNE-L is established, and the discrete asymmetric proximity between the nodes is reserved in an embedding network:

s21 a: defining the probability of observing the context of a directed graph, i.e. s in asymmetric proximity_u，vIn the context of a directed graph of node u, observing node vProbability. Different probability formulas are selected according to the directionality between the nodes:

wherein h is^sIs source embedding, h^tIs object embedding. The probability of the observation score is the dot product between the source embedding of node u and the target embedding of node v. When s is_u，vWhen 0, node u and node v tend to form a bidirectional edge, so the probability is the sum of the embedded probabilities resulting from both directions.

S21 b: by maximizing the probability of observing context nodes of the directed graph, asymmetric proximity is kept in network embedding:

wherein, DC_uIs the directed context of node u, s_u，vIs the result of the random walk strategy computation by S1, P (v | u, S)_u，v) Is given a score of s_u，vThe probability of node v is observed in the directed graph context of node u.

S22: after the asymmetric proximity between the nodes is obtained through calculation, a quantitative directed network is built to be embedded into DNE-T, and the discrete asymmetric proximity between the nodes is reserved in an embedded network:

s22 a: defining a weight conversion formula, and obtaining a new weight by the asymmetric proximity score calculated in the step S1 through a weighting function:

wherein s is_u，vIs the sum of the scores computed in the above Infowalk, and b is an offset value used to ensure that the weight is positive.

S22 b: defining a quantitative directed network embedding model, and learning source embedding and target embedding through weighted Skip-Gram optimization:

wherein h is^sIs source embedding, h^tIs object embedding, pi_u，vIs the conversion of scores into weights in a quantitative directed network.

S23: optimizing the model: the training efficiency is improved by adopting a negative sampling and random gradient descent strategy:

where, σ denotes the activation function,

the source embedding of the representation node u,

object embedding, π, representing node v_u，vRepresenting the weight between nodes u and v.

Preferably, in S202a, the weighting function should satisfy the following requirement: (1) pi₀＞0；(2)

π_m＞π_n；(3)

Wherein the content of the first and second substances,

representing the result of a calculation of a weighting function with a length i and an asymmetric proximity score m.

Further, a random walk strategy Infowalk is used for effectively obtaining a hierarchical structure and asymmetric proximity between nodes in the directed network to obtain a weighted node sequence representing the asymmetric proximity between the nodes for directed embedded learning; the use of qualitative directed network embedding DNE-L and quantitative directed network embedding DNE-T effectively preserves the embedded network in the embedding space, allowing it to achieve excellent task results on real-world reference datasets.

The system for implementing the directional-perception-based directional network embedding algorithm of the invention comprises a memory, a processor and a program stored on the memory and executed on the processor, and is characterized in that: the program comprises an asymmetric proximity calculation module and a directed network embedding building module which are connected in sequence.

The invention has the advantages that: 1) a new information random walking strategy is provided to effectively obtain asymmetric proximity in a directed network structure, so that a better explanation is provided for actual problems in a real network; 2) a directed network embedding algorithm (DNE-L with two variables and DNE-T with a directed network embedding method) with qualitative and quantitative is provided, and discrete and continuous directed network embedding is effectively kept in an embedding space.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1a to fig. 1f are schematic diagrams of an information walking strategy on a directed network according to an embodiment of the present invention, where fig. 1a is a schematic diagram of first backward walking and then forward walking between three nodes, fig. 1b is a schematic diagram of first forward walking and then backward walking between three nodes, fig. 1c is a schematic diagram of first backward walking and then forward walking twice between four nodes, fig. 1d is a schematic diagram of forward and backward spaced walking between four nodes, fig. 1e is a schematic diagram of forward walking in a directed ring graph, and fig. 1f is a schematic diagram of backward walking in a directed ring graph;

fig. 2 is an overall framework diagram of a directed network embedding method according to an embodiment of the present invention;

fig. 3a and 3b are comparative graphs of scoring results compared with existing algorithms under a user analysis experiment of a user recommendation scene, provided by an embodiment of the present invention, wherein fig. 3a is a comparative graph evaluated by a Micro-F1 score on different algorithms under a user contour analysis scene, and fig. 3b is a comparative graph evaluated by a Macro-F1 score on different algorithms under a user contour analysis scene.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a directional network embedding algorithm based on direction perception, which can be used for: (1) a new information random walking strategy is used to effectively obtain asymmetric proximity between nodes of a directed network, and the method can be well applied to actual problems in a real network; (2) both qualitative and quantitative directed network embedding (DNE-L and the directed network embedding method DNE-T with two variables) are used to maintain discrete and continuous asymmetric proximity in the potential embedding space.

The core method proposed in the present invention is explained in detail below.

S1: calculating asymmetric proximity;

the invention provides an information random walk strategy Infowalk for calculating asymmetric proximity between nodes. The basic idea of InfoWalk is to first ignore the direction of an edge and allow random walks to access nodes from various directions. In each step of the random walk, the direction and asymmetric proximity are stored in a well-designed weight. After the random walk reaches the specified length, the Infowalk obtains a step-length weighted node sequence representing the asymmetric proximity between nodes, and the step-length weighted node sequence can be used for directed embedding learning.

S1 a: defining single step probabilities for a random walk policy in a directed network:

given a directed network G, the random walk from node vi can be represented as

v_i→v_j…→v_kThis is a sequence of nodes that are currently visited,

indicating random walk

The node accessed in the k step. Suppose that in the k-th step, a random walk reaches node v_a：

In step (k +1), the random walk will walk uniformly to node v_aIs adjacent to

Or outer neighbor

Wherein P represents the probability of random walk,

represents from v_iThe k-th step of the starting random walk,

indicating the number of neighbors of node a,

number of neighbors representing node a, E _ab1 means that there is one directed edge from a to b.

This random walk can be viewed as a directionless network walk ignoring the edge direction in the directed graph G. The walking method can reach nodes without paths in the directed network and acquire asymmetric proximity.

S1 b: the direction and proximity information of the single step in the random walk are saved in the weight:

to obtain the direction and proximity between nodes, the present invention further provides for each v_i，i+1Introducing a direction-sensing step weight r according to the following rule on steps_i，i+1：

Wherein r is_i，i+11 denotes random walk in the edge direction, r_i，i+1With-1 representing a random walk in the opposite direction along the edge, r_i，i+10 denotes the node v_iAnd v_i+1There is a directed edge in both directions.

S1 c: calculating scores among the nodes to represent asymmetric proximity among the nodes:

given the weight r of each step_i，i+1The result of Infowalk may be represented as a sequence of nodes with weighted edges:

based on the node sequence of the weighted edge, the invention defines a node v in the sequence_iAnd v_i+1Fraction s of_i，i+kAs an index sum between each step thereof, the formula is as follows:

Fig. 1 is a schematic diagram illustrating an information migration policy on a directed network according to an embodiment of the present invention, where solid arrows indicate steps moving along an edge direction, and dashed arrows indicate steps moving in a direction opposite to the edge direction. s_i，j> 0 denotes node v_iTends to observe a direction v_jDirected edge of, s_i，j< 0 denotes the node v_jTends to observe a direction v_iDirected edge of, s_i，j0 denotes the node v_jAnd v_iWith bidirectional edges being observed.

Infowalk can easily acquire asymmetric proximity because Infowalk ignores the direction of edges in a directed network, making nodes with higher degree of introversion and introversion more easily accessible frequently. Therefore, the probability of these nodes appearing in other node windows is also higher.

S2: establishing directed network embedding;

the invention proposes two variants of directed network embedding: a qualitative directed network (DNE-L) and a quantitative directed network (DNE-T). For each variant, two independent embeddings are learned to preserve asymmetric proximity, referred to as source and target embeddings. The two variants differ in the way in which the asymmetric proximity is retained, and FIG. 2 shows the radicals of DNE-L and DNE-T of the two methods described aboveThis architecture, where DNE-L retains discrete directed network embeddings and DNE-T retains continuous directed network embeddings. DNE will be based on the score s_u，vAnd defining the context of the directed graph, and preserving the directed relation between the nodes through the source embedding and the target embedding of each node.

The context of the directed graph is the result of random walk of information on the directed network G, and is divided into a source context, a target context, and an ambiguous context. The source context refers to the node that the DNE method arrives at, and may have a directed edge from that node with it; the target context refers to the node that the DNE method arrives at, and may have a directed edge with it that reaches the node; ambiguous context refers to the nodes that the DNE method arrives at, but there is no explicit direction between them.

s21 a: defining the probability of observing the context of a directed graph, i.e. s in asymmetric proximity_u，vIn the context of a directed graph of node u, the probability of node v is observed. Different probability formulas are selected according to the directionality between the nodes:

wherein h is^sIs source embedding, h^tIs object embedding. The probability of the observation score is the dot product between the source embedding of node u and the target embedding of node v. When s is_u，vWhen 0, node u and node v tend to form a bidirectional edge, so the probability is fromThe sum of the probabilities of embedding resulting from both directions.

s22 a: since the probability of the context node of the directed graph being accessed by the Infowalk is different from that of the central node, it is reasonable to measure the current node according to the relative scores of the context nodes. However, due to 1) the fraction s_u，vThe weight of the context node being 0 is not a positive weight but 0; 2) fraction s even if the random walk length is different_u，vThe weight of the context node of 0 is still the same, and the accuracy of directly using the score to measure the importance of the node is intuitively affected.

In order to solve the above problems, in the quantitative directed network embedding, s needs to be newly formulated according to new requirements_u，vThe weighting function requires the following: (1) pi₀＞0；(2)

π_m＞π_n；(3)

Defining a weight conversion formula, and obtaining a new weight by the asymmetric proximity score calculated in the step S1 through a weighting function:

wherein s is_u，vIs the sum of the scores computed in the above Infowalk, and b is an offset value used to ensure that the weight is positive. Such a conversion ensures that the score possesses the following attributes: (1) nodes with higher scores have larger weights; (2) nodes with longer distances have less weight in random walks.

S23: optimizing the model: the training efficiency is improved by adopting a negative sampling and random gradient descent strategy, and the projection formula is as follows:

where, σ denotes the activation function,

the source embedding of the representation node u,

The system for implementing the directional perception-based directed network embedding algorithm comprises a memory, a processor and a program stored on the memory and executed on the processor, wherein the program comprises an asymmetric proximity calculation module and a directed network embedding establishment module which are connected in sequence. The execution content of the asymmetric proximity calculation module corresponds to the content of step S1 of the method of the present invention, and the directed network embedding creation module corresponds to the content of step S2 of the method of the present invention.

In order to more clearly illustrate the specific application of the invention, the embodiment takes user recommendation on a microblog as an example, and elaborates the specific implementation process in detail:

the specific scenarios of this embodiment are: recommending interested users for the microblog users to pay attention.

A method for recommending interested users to pay attention to microblog users comprises the following steps:

step one, technicians need to collect attention information between users and establish a user relationship directed network. The nodes of the directed network represent a user individual, the directed edges represent the attention behaviors of the user, the outgoing direction of the edges represents an attention person, and the incoming direction of the edges represents an attention person.

Step two, after the directed network is established, the technician obtains the asymmetric proximity between the nodes, i.e., the asymmetric proximity between the users, by using the random walk strategy proposed in step S1.

Step three, the technician can choose to use the qualitative directed network embedding DNE-L mentioned in step S21 or the quantitative directed network embedding DNE-T mentioned in step S22 to keep the asymmetric proximity between users in the network embedding. In this process, the technician needs to use the negative sampling and stochastic gradient descent strategy in step S23 to improve the training efficiency and optimize the network model.

And step four, after learning of the directed network embedded model is completed, technicians can represent each user as a representation for a downstream task, namely a matching task of the user. The technical personnel calculate the similarity of the representation of each user, namely the users with similar representations are classified into one class for user recommendation.

The scheme provided by the embodiment of the invention mainly has the following beneficial effects: 1. asymmetric proximity is effectively obtained in a real directional network; 2. discrete or continuous network embedding reserved by using the DNE method has better effect in tasks such as link prediction and node classification than the existing embedding method. In order to explain the effects of the above-described embodiments of the present invention, experiments are described.

First, experimental data.

Experiments a wide range of experiments were performed using several real social network datasets and a bookkeeping network with tags on each node. Where a social network with directed edges is used to evaluate user recommendations and a bookkeeping network is used for user analysis. Because it is difficult to collect large-scale real social networks with real tags, experiments have adopted a booklist network with directed edges. Table 1 shows the statistics of the data set.

Dataset	#Nodes	#Edges	#Labels	％Dangling Node	％Bi-directional Edges
						Wiki	7,115	103,689	-	0.141	0.0565
Epinions	75,879	508,837	-	0.204	0.4052
						Slashdot	77,360	905,468	-	0.271	0.8783
Twitter	90,908	443,399	-	0.087	0.6066
						LastFM	136,409	1,685,524	-	0.439	0.0009
Pubmed	19,717	44,338	3	0.803	0.0001
						Cocit	44,034	195,361	15	0.451	0.0001

TABLE 1 statistical information of data sets

And II, experimental conclusion.

1. The DNE approach achieves better results in most network data sets by preserving proximity between nodes.

In the experiment, the method of the present invention was compared with several of the most advanced directed network embedding methods and user recommendation methods to evaluate the proposed DNE. In the experiment, no comparison was made with the social network based user recommendation method, as the experiment focused on evaluating the learning effect of embedding users/nodes in the directed graph.

In the baseline method, Node2Vec, deep walk, APP, NERD are all random walk-based methods, and for fair comparison, the experiment sets the random walk parameters in these methods to be the same as the DNE method in the present invention. The method specifically comprises the following steps: the random walk length l is 10, the window size k is 4, and the walk number r of each node is 10. For the Node2Vec method, the probability of width-first sampling is set to 0.25 and the probability of depth-first sampling is set to 0.5. The inner product of the embedded vectors is used in the experiment to estimate the proximity between nodes. The APP, ATP, NERD, and HOPE approaches preserve asymmetric proximity by learning two independent source and target embeddings. For the node classification task, two kinds of embedding are used to test performance and report the best results. LINE learns two embeddings per node, namely context embedding and node embedding. In the experiment, the DNE method in the invention is realized by using PyTorch and Tensorflow, the model parameters are initialized randomly by using Xavier, an Adam optimizer is adopted for optimization, the learning rate is set to be 0.0005, and the batch size is set to be 512. The number of vector bits for all methods is 128.

Table 2 shows the generic user recommendations in five real-world social networking datasets. NA represents the case where these methods cannot be run on hardware due to memory limitations or run time exceeding one week,

is shown at p<The results of the pair-wise difference test at 0.05 were significant.

TABLE 2 comparison of Performance of the present invention and existing algorithms on common user recommendations

From table 2, it can be seen that: both variants of the proposed DNE method achieve better results in most network datasets than the existing methods in terms of preserving asymmetric proximity, which demonstrates the effectiveness of the present invention in obtaining asymmetric proximity in directed social networks.

2. The DNE method improves the effect of preserving the direction between the nodes in the user recommendation scene.

The experiment further evaluates the user recommendation tasks with directional perception to simulate the scene in the real world where the recommendation direction should be considered. The common user recommendation task only predicts whether the edge exists, and cannot ensure that the direction can be well predicted. For example, from v_iTo v_jThere are directed edges, but from v_jTo v_iWithout edges, the method of predicting edges from both directions would blend through the index by positive sampling and would not sample as negative samples. The experiment also tested the method according to the experimental setup of the existing methodRecommending the effect of the task to the perceived user. Where 30% of the links are randomly sampled from the original network as positive links, and negative links include random samples from edges that are not present in the original network and negative edges that are not present in the positive edges. Table 3 illustrates the effect of direction-aware user recommendations and classic user recommendations on real data sets.

Table 3 comparison of the performance of the present invention in direction perception with existing algorithms

From table 3, it can be seen that: among all the evaluation methods, DNE-L and DNE-T in the present invention achieved the best results on all data sets, and were significantly improved over the existing methods. Comparing tables 2 and 3, it can be observed that all methods have a reduced effect on direction-aware user recommendations, which illustrates the necessity to consider the direction and asymmetric proximity of edges. An improvement of both tasks can be observed comparing DNE-L and DNE-T, the improved effect being more pronounced in directional-perception user recommendations than in classical user recommendations. This further indicates the importance of the directional links between the predicted nodes to consider the direction.

3. The method has better effect on the aspect of user contour analysis.

User profiling is another important task for user modeling, especially in directed social networks, where the goal of user profiling is to find the user group to which the user belongs, which is the same as the classical node classification task. In the experiment, 30% of the randomly sampled and labeled nodes are trained, and the rest nodes are tested. The learned embeddings would be input into the same SVM classifier, and the results evaluated using Micro-F1 and Macro-F1 scores. For the method of learning two independent embeddings for each node, the embeddings are concatenated for evaluation, and the evaluation result is shown in fig. 3.

As can be seen from fig. 3, the basic observation result is similar to the user recommended task, and the DNE method has better effect than the existing method in two evaluation indexes.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A directional network embedding algorithm based on direction awareness, comprising:

s1: calculating asymmetric proximity;

wherein, P represents the single step probability of random walk,

represents from v_iThe k-th step of the starting random walk,

indicating the number of neighbors of node a,

number of neighbors representing node a, E_ab1 means that there is one directed edge from a to b;

wherein r is_i,i+11 denotes random walk in the edge direction, r_i,i+1With-1 representing a random walk in the opposite direction along the edge, r_i,i+10 denotes the node v_iAnd v_i+1Directional edges exist in the two directions;

wherein r is_j,j+1Is the step weight j, 1/k is used to normalize the effect from the step number.

S2: establishing directed network embedding;

s21 a: defining the probability of observing the context of a directed graph, i.e. s in asymmetric proximity_u,vIn the context of a directed graph of node u, the probability of node v is observed. Different probability formulas are selected according to the directionality between the nodes:

wherein h is^sIs source embedding, h^tIs object embedding. Score of observationIs the dot product between the source embedding of node u and the target embedding of node v. When s is_u,vWhen 0, node u and node v tend to form a bidirectional edge, so the probability is the sum of the embedded probabilities resulting from both directions.

wherein, DC_uIs the directed context of node u, s_u,vIs the result of the random walk strategy computation by S1, P (v | u, S)_u,v) Is given a score of s_u,vThe probability of node v is observed in the directed graph context of node u.

wherein s is_u,vIs the sum of the scores computed in the above Infowalk, and b is an offset value used to ensure that the weight is positive.

wherein h is^sIs source embedding, h^tIs object embedding, pi_u,vIs to quantify the score to the weight in a directed networkAnd (4) heavy conversion.

where, σ denotes the activation function,

the source embedding of the representation node u,

object embedding, π, representing node v_u,vRepresenting the weight between nodes u and v.

2. The directional-awareness-based directed network embedding algorithm of claim 1, wherein: in step S202a, the weighting function should satisfy the following requirements: (1) pi₀>0；(2)

(3)

Wherein the content of the first and second substances,

3. The directional-awareness-based directed network embedding algorithm of claim 2, wherein: the method comprises the steps that a random walk strategy Infowalk is used for effectively obtaining a hierarchical structure and asymmetric proximity among nodes in a directed network, and a weighted node sequence representing the asymmetric proximity among the nodes is obtained and used for directed embedding learning; the use of qualitative directed network embedding DNE-L and quantitative directed network embedding DNE-T effectively preserves the embedded network in the embedding space, allowing it to achieve excellent task results on real-world reference datasets.

4. A system for implementing the directional-awareness-based directional-network-embedding algorithm of claim 1, comprising a memory and a processor and a program stored on the memory and executed on the processor, wherein: the program comprises an asymmetric proximity calculation module and a directed network embedding building module which are connected in sequence.