WO2021196240A1

WO2021196240A1 - Representation learning algorithm oriented to cross-network application

Info

Publication number: WO2021196240A1
Application number: PCT/CN2020/083378
Authority: WO
Inventors: 王朝坤; 严本成
Original assignee: 清华大学
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2021-10-07
Also published as: CN113228059A

Abstract

Provided is a representation learning algorithm oriented to cross-network application, the algorithm comprising: S1, generating data of networks comprising a source network and a target network, wherein data of each network data includes topological structure information and node attribute information of the network, and the target network is a network to be inferred and characterized; S2, randomly sampling a set number of nodes from the source network and the target network, and sorting same into a data format that meets an input for the algorithm; S3, after input data from the source network and the target network is received, respectively inputting same into an L-layer neural network, respectively calculating structural features and expression features of the source network and the target network for each layer, and calculating the distance loss between corresponding features of the source network and the target network; S4, performing classification prediction probability calculation on an expression vector, obtained from the L-layer neural network, of a source network node, calculating a classification loss using a cross entropy loss function, and updating, in combination with the distance loss, neural network parameters by means of a back-propagation algorithm; and S5, repeating steps S2-S4 until the entire algorithm converges. Therefore, the invention effectively solves the problem of cross-network representation learning, and has a broad application space in reality.

Description

Representation learning algorithm for cross-network

Technical field

The present disclosure belongs to the field of computer technology, and in particular relates to a representation learning algorithm oriented across networks.

Background technique

Network structure data is widely used in many application scenarios because of its ability to naturally express the relationship between objects. For example, in the social field (WeChat or Weibo), the friendly relationship between users and users can be expressed in the form of social networks; in the field of scientific research, the relationship between authors and papers, and the relationship between papers and papers, can be used separately Express with the reference network; in the field of e-commerce, the network formed by the click relationship between users and commodities. Because of the universality and importance of network structure data, in recent years, how to effectively vectorize the expression (ie network embedding expression) for the nodes in the network has become an important research problem. The vectorization of nodes refers to the hope that the nodes in the network can be mapped to a low-dimensional space through algorithms. In this low-dimensional vector space, the distance between nodes can reflect the relationship between each other in the original network. The learned node vector can be applied to multiple tasks, such as recommendation, link prediction, and so on.

The existing network embedding representation algorithms can be divided into two categories: one is the direct inference representation learning algorithm. Given a target network, the direct expression algorithm directly optimizes the expression vector of each node through the attributes of the node and the network relationship, such as DeepWalk and Node2vec. The second is the inductive representation learning algorithm. The inductive representation algorithm is often to learn a mapping function. As long as the attributes of the input node and its neighbors are given, the expression vector of the node can be inferred through the mapping function, such as GCN, GraphSAGE and GAT.

In real applications, we may be faced with multiple networks, each of which may come from different moments or different data sources. The distribution of these network data may be different. We often hope to summarize useful knowledge from known networks and apply the summarized knowledge to unknown networks. For example, in the citation network of papers, even if the themes of the papers published at different times are different, we can still use the network formed by the papers published in the past years to help infer the relationship between the recently published papers and the papers. Therefore, when facing multiple different networks, how to solve the problem of different distributions between networks, so that the algorithm can make full use of the known network data to improve the quality of the learning vector of the unknown network data is the research of this technology. the key of.

However, none of the existing algorithms can well solve the problem of cross-network representation learning. Specifically:

(1) For the direct inference algorithm, because the direct inference algorithm directly optimizes the expression vector of the nodes in the network, for a new network, the direct inference algorithm cannot directly infer the expression vector of the nodes in the new network. . Therefore, the direct algorithm does not have any available knowledge that can be used for cross-network learning.

(2) For the inductive algorithm, although it considers learning a mapping function of node attributes and structural information when modeling, so that cross-network inference can be made naturally, but the inductive algorithm does not consider The data distribution between the network and the network is different, and the patterns or knowledge summarized from one network may not be well applicable to another network. Therefore, inductive algorithms also exist in the problem of cross-network representation learning. Certain flaws.

Therefore, the existing technology needs to be improved.

The foregoing background technical content is only used to help understand the present disclosure, and does not represent an acknowledgement or recognition that any content mentioned is part of the common general knowledge relative to the present disclosure.

Summary of the invention

In order to solve the above technical problems, the present disclosure proposes a cross-network-oriented representation learning algorithm.

Based on one aspect of the embodiments of the present disclosure, a cross-network-oriented representation learning algorithm is disclosed, including:

S1: Generate network data including a source network and a target network, each network data includes network topology information and node attribute information, and the target network is a network to be inferred and characterized;

S2, randomly sample a set number of nodes from the source network and the target network, and sort them into a data format that meets the input of the algorithm;

S3, after obtaining the input data of the source network and the target network, input them into an L-layer neural network, and calculate the structural characteristics and expression characteristics of the source network and the target network for each layer, and calculate the source network and the target network The distance loss between the corresponding features;

S4: Perform classification prediction probability calculation on the expression vector of the source network node obtained from the neural network of the L layer, calculate the classification loss through the cross-entropy loss function, and combine the distance loss to update the network parameters through the back propagation algorithm;

S5, repeat steps S2-S4 until the entire algorithm converges.

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, in step S3, after obtaining the input data of the source network and the target network, they are inputted into an L-layer neural network respectively, and each The layer calculates the structural characteristics and expression characteristics of the source network and the target network respectively, and the distance loss between the corresponding characteristics of the source network and the target network includes:

S30: Input the node characteristics of the source network and the target network into the neural network of the L layer;

S31: In each layer of the L-layer neural network, the node feature expression vector of each network generates structural features through a message routing module;

S32, the structural feature obtains the new expression feature vector of the current node through the message aggregation module;

S33: Calculate the structural feature distance loss and the expression feature distance loss between the source network and the target network in the current layer through the cross-network alignment module;

S34: Repeat steps S31 to S33 for L times to obtain the node feature vectors of the final source network and the target network and the accumulated structural feature distance loss and expression feature distance loss of the L layer.

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, in step S31, in each layer of the L-layer neural network, the node feature expression vector of each network generates structural features through a message routing module include:

The message routing module of each layer is expressed as:

Where

Is the structural feature vector of the source network and the target network calculated by the node i in the l-th layer of the L-layer neural network,

Is the expression feature vector of the source network and the target network of the l-1th layer in the L-layer neural network, the expression feature vector of the 0th layer is represented by the original feature vector x _{i of the} node,

Is the parameter matrix involved in the message routing module of the first layer, a ^(l)T is the parameter vector involved in the message routing module of the first layer, σ is the activation function, || is the direct connection operation of the two vectors, N(v ) Is the set of neighbors directly connected to node v,

Is the weight of the message sent from node u to node v.

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, in step S32, the structural feature obtaining the new expression feature vector of the current node through the message aggregation module includes:

The message aggregation module of each layer is expressed as:

Where

with

Is the parameter matrix involved in the message aggregation module,

Is a vector showing the aggregation level of the node.

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, the step S33 is to calculate the structural feature distance loss and the expression feature distance loss from the source network and the target network in the current layer through the cross-network alignment module include:

The current layer from the structural feature distance loss between the source network and the target network is:

In the formula, P _r , Q _r are the structural feature vectors of the source network and the target network

with

Distribution,

Is a distance function used to calculate the structural feature vector

with

The expected distance.

The distance loss of the current layer from the expression feature between the source network and the target network is:

In the formula, P _a , Q _a are the node expression feature vectors of the source network and the target network

with

Distribution,

As a distance function, used to calculate the node expression feature vector

with

The expected distance.

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, in the step S34, steps S31 to S33 are repeated L times to obtain the node feature vectors of the final source network and the target network and the accumulated structural features of the L layer Distance loss and express feature distance loss include:

The node feature vectors of the source network and the target network and the accumulated structural feature distance loss of the L layer are:

The node feature vectors of the source network and the target network and the accumulated expression feature distance loss of the L layer are:

In another embodiment of the cross-network-oriented representation learning algorithm based on the present disclosure, in the step S4, the expression vector of the source network node obtained from the neural network of the L layer is calculated by the classification prediction probability, and the cross-entropy loss function is used Calculate the classification loss, and combine the distance loss to update the network parameters through the backpropagation algorithm, including:

The cross entropy loss function is expressed as:

Among them, L _S is the cross-entropy loss function, W _Z is the weight parameter matrix,

Is the feature expression vector _{of the node, z i} is the classification prediction probability of the node category, _yi is the true category of the node, and V ^S is the set of nodes with category information in the source network.

Compared with the prior art, the present disclosure has the following advantages:

The cross-network-oriented representation learning algorithm of the present disclosure can extract structural information and node attribute information in the network. At the same time, the algorithm takes into account the inconsistency of distribution between different network data, and compensates for the inconsistency by minimizing the feature distance. The resulting information loss effectively solves the problem of cross-network representation learning, and has a broad application space in reality.

Description of the drawings

The drawings constituting a part of the specification describe the embodiments of the present disclosure, and together with the description, serve to explain the principle of the present disclosure.

With reference to the accompanying drawings, the present disclosure can be understood more clearly according to the following detailed description, in which:

FIG. 1 is a flowchart of an embodiment of the cross-network-oriented representation learning algorithm proposed by the present disclosure;

Fig. 2 is a flowchart of another embodiment of the cross-network-oriented representation learning algorithm proposed by the present disclosure.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present disclosure.

In the following, a cross-network-oriented representation learning algorithm provided by the present disclosure will be described in more detail with reference to the accompanying drawings and embodiments.

FIG. 1 is a flowchart of an embodiment of the cross-network-oriented representation learning algorithm proposed by the present disclosure. As shown in FIG. 1, the cross-network-oriented representation learning algorithm:

S1, generating network data including a source network and a target network, each network data includes network topology information and node attribute information, the target network is the network to be inferred and characterized; the expression of the source network is G ^S , the target network The expression of is G ^t , the expression of topological structure information is G=(V,E), where V represents a node, E represents an edge, and the expression of node attribute information is x _v ,v∈V;

S2, randomly sample a set number of nodes from the source network and the target network, and sort them into a data format that meets the input of the algorithm; use the node attributes x _v corresponding to the collected nodes as the input data of the algorithm;

S5, repeat steps S2-S4 until the entire algorithm converges.

Figure 2 is a flowchart of another embodiment of the cross-network-oriented representation learning algorithm proposed by the present disclosure. As shown in Figure 2, the step S3 is to obtain the input data of the source network and the target network, and then input them to An L-layer neural network, and calculate the structural characteristics and expression characteristics of the source network and the target network separately for each layer, and calculate the distance loss between the corresponding characteristics of the source network and the target network including:

S30: Input the node characteristics of the source network and the target network into the neural network of the L layer; the node characteristics of the source network and the target network are respectively

with

will

with

Input to an L-layer neural network;

S31, in each layer of the L-layer neural network, the node feature expression vector of each network generates structural features through a message routing module; the structural feature expression is

S32, the structural feature obtains the new expression feature vector of the current node through the message aggregation module, and the expression feature vector expression is

S34: Repeat steps S31 to S33 for L times to obtain the node feature vectors of the final source network and the target network and the accumulated structural feature distance loss and expression feature distance loss of the L layer. Finally, the node feature vectors of the source network and the target network are

with

The cumulative structural feature distance loss value of the L layer is L _mra , and the expression feature distance loss value is L _maa .

In step S31, in each layer of the L-layer neural network, the node feature expression vector of each network generates structural features through a message routing module including:

The message routing module of each layer is expressed as:

Where

Is the structural feature vector of the source network and the target network calculated at the lth layer of the L-layer neural network for node i,

Is the weight of the message sent from node u to node v.

In step S32, the structure feature obtaining the new expression feature vector of the current node through the message aggregation module includes:

The message aggregation module of each layer is expressed as:

Where

with

Is the parameter matrix involved in the message aggregation module,

Is a vector showing the aggregation level of the node.

In step S33, calculating the structural feature distance loss and expression feature distance loss between the source network and the target network in the current layer through the cross-network alignment module includes:

with

Distribution,

Is a distance function used to calculate the structural feature vector

with

The expected distance.

In the formula, O _a , Q _a are the node expression feature vectors of the source network and the target network

with

Distribution,

As a distance function, used to calculate the node expression feature vector

with

The expected distance.

The step S34, repeating steps S31 to S33 for L times, to obtain the node feature vectors of the final source network and the target network and the accumulated structural feature distance loss and expression feature distance loss of the L layer include:

In step S4, the classification prediction probability calculation is performed on the expression vector of the source network node obtained from the neural network of the L layer, the classification loss is calculated by the cross-entropy loss function, and the distance loss is combined, and the network parameters are updated by the backpropagation algorithm. :

The cross entropy loss function is expressed as:

For those skilled in the art, it is obvious that the embodiments of the present disclosure are not limited to the details of the above exemplary embodiments, and the embodiments of the present disclosure can be implemented in other specific forms without departing from the spirit or basic characteristics of the embodiments of the present disclosure. example. Therefore, from any point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of the embodiments of the present disclosure is defined by the appended claims rather than the foregoing description, and therefore it is intended to fall on All changes within the meaning and scope of equivalent elements of the claims are included in the embodiments of the present disclosure. Any reference signs in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units, modules or devices stated in the claims of a system, device or terminal can also be implemented by the same unit, module or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.

Finally, it should be noted that the above implementation manners are only used to illustrate the technical solutions of the embodiments of the present disclosure and not to limit them. Although the embodiments of the present disclosure have been described in detail with reference to the above preferred implementation manners, those of ordinary skill in the art should understand that Modifications or equivalent replacements to the technical solutions of the embodiments of the present disclosure should not depart from the spirit and scope of the technical solutions of the embodiments of the present disclosure.

Claims

A cross-network-oriented representation learning algorithm, which is characterized in that it includes:

S1: Generate network data including a source network and a target network, each network data includes network topology information and node attribute information, and the target network is a network to be inferred and characterized;

S2, randomly sample a set number of nodes from the source network and the target network, and sort them into a data format that meets the input of the algorithm;

S3, after obtaining the input data of the source network and the target network, input them into an L-layer neural network, and calculate the structural characteristics and expression characteristics of the source network and the target network for each layer, and calculate the source network and the target network The distance loss between the corresponding features;

S4: Perform classification prediction probability calculation on the expression vector of the source network node obtained from the neural network of the L layer, calculate the classification loss through the cross-entropy loss function, and combine the distance loss to update the network parameters through the back propagation algorithm;

S5, repeat steps S2-S4 until the entire algorithm converges.
The cross-network-oriented representation learning algorithm according to claim 1, characterized in that, in step S3, after obtaining the input data of the source network and the target network, they are respectively input to an L-layer neural network, and each The first layer calculates the structural characteristics and expression characteristics of the source network and the target network respectively, and calculates the distance loss between the corresponding characteristics of the source network and the target network, including:

S30: Input the node characteristics of the source network and the target network into the neural network of the L layer;

S31: In each layer of the L-layer neural network, the node feature expression vector of each network generates structural features through a message routing module;

S32, the structural feature obtains the new expression feature vector of the current node through the message aggregation module;

S33: Calculate the structural feature distance loss and the expression feature distance loss between the source network and the target network in the current layer through the cross-network alignment module;

S34: Repeat steps S31 to S33 for L times to obtain the node feature vectors of the final source network and the target network and the accumulated structural feature distance loss and expression feature distance loss of the L layer.
The cross-network-oriented representation learning algorithm according to claim 2, characterized in that, in step S31, in each layer of the L-layer neural network, the node feature expression vector of each network generates a structure through a message routing module Features include:

The message routing module of each layer is expressed as:

Where
Is the structural feature vector of the source network and the target network calculated at the lth layer of the L-layer neural network for node i,
Is the expression feature vector of the source network and the target network of the l-1th layer in the L-layer neural network, the expression feature vector of the 0th layer is represented by the original feature vector x i of the node,
Is the parameter matrix involved in the message routing module of the first layer, a (l)T is the parameter vector involved in the message routing module of the first layer, σ is the activation function, || is the direct connection operation of the two vectors, N(v ) Is the set of neighbors directly connected to node v,
Is the weight of the message sent from node u to node v.
The cross-network-oriented representation learning algorithm according to claim 3, characterized in that in step S32, the structure feature obtaining the new expression feature vector of the current node through the message aggregation module comprises:

The message aggregation module of each layer is expressed as:

Where
with
Is the parameter matrix involved in the message aggregation module,
Is a vector showing the aggregation level of the node.
The cross-network-oriented representation learning algorithm according to claim 4, characterized in that, in step S33, through the cross-network alignment module, the current layer is calculated from the structural feature distance loss and the expression feature distance between the source network and the target network. The losses include:

The current layer from the structural feature distance loss between the source network and the target network is:

In the formula, P r , Q r are the structural feature vectors of the source network and the target network
with
Distribution,
Is a distance function used to calculate the structural feature vector
with
The expected distance.

The distance loss of the current layer from the expression feature between the source network and the target network is:

In the formula, P a , Q a are the node expression feature vectors of the source network and the target network
with
Distribution,
As a distance function, used to calculate the node expression feature vector
with
The expected distance.
The cross-network-oriented representation learning algorithm according to claim 5, wherein the step S34 is to repeat steps S31 to S33 for L times to obtain the node feature vectors of the final source network and the target network and the L-layer accumulation structure Feature distance loss and expression feature distance loss include:

The node feature vectors of the source network and the target network and the accumulated structural feature distance loss of the L layer are:

The node feature vectors of the source network and the target network and the accumulated expression feature distance loss of the L layer are:
The cross-network-oriented representation learning algorithm according to claim 6, wherein the step S4 is to perform classification prediction probability calculation on the expression vector of the source network node obtained from the neural network of the L layer, through cross-entropy loss The function calculates the classification loss, and combines the distance loss to update the network parameters through the backpropagation algorithm including:

The cross entropy loss function is expressed as:

Among them, L s is the cross-entropy loss function, W z is the weight parameter matrix,
Is the feature expression vector of the node, z i is the classification prediction probability of the node category, yi is the true category of the node, and V s is the set of nodes with category information in the source network.