CN113868482A

CN113868482A - Heterogeneous network link prediction method suitable for scientific cooperative network

Info

Publication number: CN113868482A
Application number: CN202111286870.6A
Authority: CN
Inventors: 刘忠; 马扬; 许乃夫; 梁星星; 冯旸赫; 程光权; 黄金才; 施伟
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-21
Filing date: 2021-11-02
Publication date: 2021-12-31

Abstract

The invention discloses a heterogeneous network link prediction method suitable for a scientific cooperative network, which comprises the following steps: acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network; sampling neighbor nodes on the heterogeneous scientific cooperative network; neighbor information aggregation is carried out on the sampled neighbor nodes; training on the heterogeneous scientific cooperative network, and learning to obtain a heterogeneous network representation; using the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and using the corresponding continuous edge to train the binary classifier; predicting whether an unknown link exists by adopting the binary classifier; if the existing unknown link is the link between the author and the author, the two authors will collaborate to write the article in the future, and if the existing unknown link is the link between the author and the paper, the existing unknown link indicates that one author will quote a certain article in the future. The method has better effect on predicting the cooperation of the author and articles cited by the author.

Description

Heterogeneous network link prediction method suitable for scientific cooperative network

Technical Field

The invention belongs to the technical field of social network science, and relates to a heterogeneous network link prediction method suitable for a scientific cooperative network.

Background

In the real world, connections between individuals are ubiquitous, and these complex connections can be described in different forms of networks (social networks, citation networks, protein molecule interaction networks, power networks, etc.). As a common data carrier, data in a network form exists in all aspects of society, meaningful information in the network is deeply mined, and the data has very high academic value and potential application value. Although the structures and the forms of various networks are complex and various, the basic components of the networks are nodes and connecting edges, so that the analysis and the research on the complex networks have universal significance. Since the network contains a large number of nodes and edges, the time to obtain all the state information of each node on the network is extremely time consuming. Network Embedding (Network Embedding) aims at finding a mapping function that transforms the structural and attribute characteristics of each node in the Network into a low-dimensional potential representation. The graphical neural network, as a powerful learning method for deep representation of graphical data, shows excellent performance in network analysis, and has attracted extensive attention of researchers. By means of supervised learning or unsupervised learning, structure and attribute information of the network is mined through an algorithm, converted into low-dimensionality node vectors and further applied to downstream tasks such as node classification, node clustering and link prediction.

Although many approaches to neural networks have been proposed, the main research is still focused on homogeneous networks, i.e. each node in the network is of the same type. However, more real networks are represented in the form of heterogeneous graph structures. The heterogeneous graph network contains abundant structure information and attribute information, and nodes and connecting edges of the heterogeneous graph network are of various types. However, in the current network representation research, there is still a key problem: how are information of neighbors of the target node aggregated? Most graph neural networks define the neighbors of a node as its directly connected nodes and iteratively aggregate neighbor information for distances greater than 2 through multiple layers of nesting. Some methods expand the neighbor space of a node by random walk, and all nodes on a path are regarded as neighbor nodes. These methods ignore that the connection edges in heterogeneous networks are heterogeneous, and the combination of heterogeneous connection edges has a specific meaning.

The interactive cooperation among scientists of the scientific cooperation network is a long-standing phenomenon in scientific practice, and a considerable amount of communication activities exist in most stages of the research process. Scientific departments are mutually crossed and penetrated, a large scientific system with a complex structure is developed, and scientists not only communicate research results and information, but also often cooperate together to generate or report the research results of the scientists in the process of mutual conversation, writing, reading papers, letters and the like. With the improvement of the scientific socialization degree, the cooperation in scientific research is more frequent, and the cooperation becomes a social strength which influences the scientific development ability to be ignored gradually. Scientific collaboration networks are apparently organized to a large extent by scientists, and a large number of technological policy initiatives have also facilitated scientific collaboration. When a scientific document is completed by more than two authors, the present embodiment states that there is a scientific collaboration between these authors. In a scientific collaboration network, nodes are authors and if there is more than one treatise between two authors, there is a connection between them. Scientific collaboration networks are an important class of social networks that have been widely used to determine the structure of scientific collaboration and the status of individual researchers. The method has great significance for researching possible cooperation relationship in the future of scientific cooperation network, promoting scientific research and development, forming team advantages, searching an advantageous platform and the like. Whether unknown links possibly exist in the scientific cooperation network is predicted, the method can be used for judging whether two authors meet to compose an article in the future and whether one author refers to a certain article in the future, and the method has a strong guiding significance for scientific cooperation.

Disclosure of Invention

The invention relates to a heterogeneous network link prediction method suitable for a scientific cooperative network, which adopts the following technical scheme.

A heterogeneous network link prediction method suitable for a scientific cooperative network comprises the following steps:

step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;

step 2, sampling neighbor nodes on the heterogeneous scientific cooperation network;

step 3, neighbor information aggregation is carried out on the sampled neighbor nodes;

step 4, training on the heterogeneous scientific cooperation network, and learning to obtain heterogeneous network representation;

step 5, taking the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and training a binary classifier by using the corresponding continuous edge;

step 6, adopting the binary classifier to predict whether an unknown link exists; if the existing unknown link is the link between the author and the author, the two authors can collaborate to write the article in the future, and if the existing unknown link is the link between the author and the thesis, the unknown link indicates that one author can quote a certain article in the future;

specifically, the heterogeneous scientific cooperative network is represented as G ═ V, E, A and R, wherein V represents a node set, E represents a continuous edge set, and the node and the continuous edge respectively have mapping phi V → A and

wherein A represents a node type set comprising an author type, a paper type and a conference type, and R represents a continuous edge type set comprising an author-write-paper type, a paper-quote-paper type and a paper-publish-conference type;

the neighbor node sampling is based on a meta-path, which is a path defined on graph G ═ V, E, a, R, and is represented in the following form:

define node A₁And A_k+1Compound relationship between them

Wherein

Representing complex operations between relationships, this meta-path representing A₁And A_k+1Semantic information formed by k +1 nodes and k links is linked together.

Furthermore, the neighbor node sampling comprises the following steps of, step 201, performing direct attribute and indirect attribute aggregation on the nodes of the heterogeneous scientific cooperative network, and step 202, performing direct neighbor and indirect neighbor sampling on the nodes of the heterogeneous scientific cooperative network;

the direct attribute is an attribute of the node, the indirect attribute is an additional attribute obtained from the mutual relation in the heterogeneous scientific cooperative network, and the additional attribute is a supplement to the direct attribute; the direct attributes of the paper nodes comprise paper titles, paper abstracts and pre-training structure embedding, and the indirect attributes of the paper nodes comprise structure embedding of admission conferences, structure embedding of paper authors, structure embedding of cited documents and title vectors of the cited documents; the direct attribute of the author node comprises the structure embedding of the author node, and the indirect attribute of the author node is a vector representation of a title and a summary of a paper written by an author; the direct attribute of the conference node comprises the structure embedding of the conference node, and the indirect attribute of the conference node comprises the vector representation of the title and the abstract of the conference admission paper;

the direct neighbor is a neighbor node of the target node which is directly connected in the heterogeneous scientific cooperative network through a connecting edge, and the indirect neighbor is a neighbor node of the target node which is connected in the heterogeneous scientific cooperative network through a meta-path.

Preferably, the direct attribute and the indirect attribute are aggregated by using a BilSTM (Bi-direct Long Short-Term Memory), and then mean pooling is performed to obtain d with expression ability_aDimension initial feature vector

i∈A，x_iIs a feature vector obtained by BiLSTM,

representing the vector dimension as d_aX 1 dimension.

Further, the neighbor information aggregation includes step 301, performing homogeneous information aggregation on similar sampling neighbors, and step 302, performing heterogeneous information aggregation on different types of neighbors by adopting an attention mechanism;

the homogeneous information aggregation firstly adopts the BilSTM to aggregate information vectors of sampling neighbors, and then the information aggregation vectors are obtained through mean pooling;

and in the heterogeneous information aggregation, an attention mechanism is adopted to learn the aggregation weight of the heterogeneous type.

Preferably, in said homogeneous information aggregated BilSTM, the forward LSTM output sequence is output

And backward LSTMOutput sequence

Stitching to discover sequential associations between elements, d-dimensional output vector of i-type neighbors of node m

i ∈ A is obtained from the following formula:

wherein the content of the first and second substances,

for the splicing operation, Neighbor_i(m) represents a set of neighbor nodes of node m,

representing the forward representation vector obtained by node j by the bi-directional LSTM,

representing the backward-represented vector obtained by bi-directional LSTM for node j,

and

obtained by passing LSTM:

f_j＝σ(W_f·[h_j-1,x_j]+b_f)

i_j＝σ(W_i·[h_j-1,x_j]+b_i)

o_j＝σ(W_o·[h_j-1,x_j]+b_o)

h_j＝o_j·tanh(C_j)

wherein f, i and o are respectively a forgetting gate, an input gate and an output gate, W and b are learnable parameters, and d is 2d_aAnd sigma denotes an activation function sigmoid, x_jRepresents a feature vector obtained by BiLSTM;

in the information aggregation of the attention mechanism, a given type target node k epsilon A and an aggregation vector y_iI ∈ A, corresponding attention α_kiAnd self-attention coefficient alpha'_kIs composed of

Wherein the content of the first and second substances,

for the splicing operation, x_kRepresenting the feature vector obtained by BiLSTM,

for the attention parameter to be learnable,

for the learnable self-attention parameter,

transpose of the self-attention parameter vector, y, representing node type l_lThe node type is a type-based aggregation vector of l, sigma is an activation function LeakyRelu, and the aggregation vector of the k type node is as follows:

z_k＝α'_kx_k+∑_i∈Aα_kiy_i

note that the force mechanism model repeats n times independently and takes the learned mean of the embeddings as the final aggregate vector:

wherein the content of the first and second substances,

the ith calculation of the aggregation vector representing the k type nodes.

Preferably, graph context loss is employed in the training process in step 4, and the following optimization objectives are defined:

among them, RW_GIs the set of paths sampled by random walks on graph G,<v,i,j>is a triplet in which V ∈ V is the target node and i ∈ C_vIs that it is in RW_GContext of (c), j ∈ V is a negative sampling node, z'_vRepresents the final aggregate vector, z ', of node v'_iRepresents the final aggregate vector, z 'of node i'_jAnd representing the final aggregation vector of the node j, and theta is a sigmoid activation function.

Specifically, the meta path adopts an interpretable meta path with a length not exceeding 3, and the type of the meta path specifically includes: A-P-A represents ｃA co-author, A-P represents an author writing ｃA paper, A-P-P represents an author referring to the paper in writing, A-P-V represents an author's paper received by ｃA conference, A-P-P-V represents ｃA referred paper received by ｃA conference, P-A represents ｃA paper written by an author, P-P represents ｃA paper referring paper, P-A-P represents ｃA paper written by the same author, P-V-P represents ｃA paper received by the same conference, P-V represents ｃA conference received, P-P-V represents ｃA referred paper received by ｃA conference, V-P-A represents ｃA paper written by ｃA conference receiving author, V-P represents ｃA conference receiving paper, V-P-P represents ｃA conference-related referring paper, V-P-P-V represents a meeting-related reference meeting.

Preferably, when the neighbor node sampling is performed, each node is sampled by the same number of neighbors, and the sampled neighbor set is brought into the subsequent neighbor information aggregation.

Preferably, when the neighbor node sampling is performed, the balance relationship among the number of neighbors of each type of meta-path sampling is controlled.

Compared with the prior art, the method has the advantages that: 1) for better information aggregation, information acquisition in a heterogeneous network and aggregation of feature information of different sources are researched, and direct neighbor and indirect neighbor node sampling is realized based on a meta-path; 2) in order to better mine the internal connection of the same type of neighbors, the BilSTM is adopted to aggregate the same type of sampling nodes together, in order to have better aggregation to each type of nodes, the attention mechanism is adopted to realize heterogeneous type aggregation, and therefore the link prediction is more accurate.

Drawings

FIG. 1 is a schematic flow diagram of an embodiment of the present invention;

FIG. 2 is a diagram of a heterogeneous network according to an embodiment of the present invention;

FIG. 3 is a meta-path diagram according to an embodiment of the invention;

FIG. 4 is a flow diagram of a heterogeneous scientific collaborative network learning representation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating an aggregation of characteristics of different sources of nodes in a paper according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating neighbor sampling according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an information aggregation process of sampling neighbors in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a heterogeneous network link prediction method suitable for a scientific cooperative network includes the following steps:

step 6, adopting the binary classifier to predict whether an unknown link exists; if the existing unknown link is the link between the author and the author, the two authors will collaborate to write the article in the future, and if the existing unknown link is the link between the author and the paper, the existing unknown link indicates that one author will quote a certain article in the future.

A heterogeneous network is a special form of network that contains multiple types of nodes or multiple types of edges. The heterogeneous network may be represented as G ═ (V, E, a, R), where V denotes a set of nodes and E denotes a set of connected edges, where the nodes and connected edges of the heterogeneous network have respective mappings Φ: V → a and

wherein A and R respectively represent type sets of nodes and connecting edges. In addition, the attributes are different for different types of V nodes. For heterogeneous networks, there is | A | + | R | > 2.

As shown in fig. 2, the scientific cooperative network is an example for explaining a heterogeneous network. The node V in the network may be an author, a paper or a conference, and the edge E may be an author-write-paper, a paper-quote-paper or a paper-publish-conference. For a thesis node, the attribute comprises abstract, title and the like, for an author node, the attribute comprises published articles, cooperative researchers and the like, and for a conference node, the attribute comprises recorded articles and the like. In this example, | a | ═ 3, | R | ═ 3, so the network is a heterogeneous network.

A meta path is a path defined on graph G ═ V, E, a, R, and is represented in the form:

this defines A₁And A_k+1Compound relationship between them

Wherein

Representing complex operations between relationships, this path representing A₁And A_k+1Semantic information formed by k +1 nodes (node types allow repetition) and k connecting edges (connecting edge types allow repetition) is connected together, A represents a node type set, A subscripts represent a node with a specific type, and the following table represents a number code.

As shown in FIG. 3, in an academic collaboration network, for example, different authors may be connected viｃA an A-P-A (Author-Paper-Author) metｃA-path, which represents the Paper collaboration, and authors and meetings may be connected viｃA an A-P-V (Author-Paper-vector) metｃA-path, which represents the authors participating in the academic meeting.

Given a heterogeneous network G ═ (V, E, a, R), the core of the heterogeneous network representation (graph embedding) is to extract the features of the network. It learns a uniform d-dimension representation vector for each node in the network

And d is less than V, so that X can capture structural and semantic information in the network, and the method is applied to subsequent tasks such as node classification, clustering, link prediction, recommendation, network reconstruction and the like.

As in FIG. 4, the representation learning of the heterogeneous network of the present embodiment includes three phases, 1. neighbor sampling based on meta-paths; 2. aggregating information of sampling neighbors; 3. and (4) target optimization and model training.

1. Meta-path based neighbor sampling

At the heart of most graph neural networks is the aggregation of neighbor information, and in large-scale networks it is also common to aggregate samples of neighbors, such as GraphSage. However, directly employing aggregation of neighbors or sampled neighbors presents several problems:

simple neighbor aggregation cannot directly capture all types of node information. Taking a scientific cooperation network as an example, all author nodes are not directly connected, but authors who jointly complete the same paper have close connection and cooperation, which is semantic information that cannot be reflected by directly connected edges.

Different types of neighbor nodes have different content attribute characteristics. The different features need to be converted into a unified representation space.

In a heterogeneous network, for nodes of the same type, the feature information usually comes from different acquisition approaches (such as paper summary information and paper title information), and usually the feature information from different sources is spliced together, but the correlation between the feature information and the paper summary information is rarely studied and discussed.

In order to solve the above problems, the present embodiment researches information acquisition in a heterogeneous network and aggregation of feature information from different sources; and then based on the meta path, sampling of the neighbor nodes is realized.

1) Node feature aggregation

The attribute information of the nodes in the heterogeneous network comprises a direct attribute and an indirect attribute, wherein the direct attribute is an attribute of the data, and the indirect attribute is an additional attribute obtained by arranging and inducing the mutual relationship in the heterogeneous network and is a supplement to the direct attribute. This embodiment introduces direct and indirect attribute sources for different types of nodes and aggregates attributes for different types of nodes in an academic collaboration network. There are three types of nodes in the network: paper, author, and meeting.

Paper nodes are the most important type of nodes in the network and are also the core source of attribute data in the network, and the direct attributes thereof include paper titles, paper abstracts, pre-training structure embedding and the like. It should be noted that, in order to mine more structural information from the network, the embodiment adopts deep walk, regards the network as a homogeneous network, obtains the pretrained structure embedding of each node in the network, and uses it as a part of the node attribute. Indirect attributes of the paper nodes include structural embedding of the admission meeting, structural embedding of its author, structural embedding of the cited documents, and title vectors of the cited documents. In order to improve the operation efficiency and form a uniform vector space, all indirect attributes are obtained by sampling average calculation.

The author node is directly connected with the paper node in the network, the direct attribute is the structure embedding of the author node, and the indirect attribute is the vector representation of the title and abstract of the written paper. The conference node is also directly connected with the paper node in the network, the direct attribute is the structure embedding of the conference node, and the indirect attribute is the vector representation of the title and abstract of the paper recorded by the conference node. For the convenience of calculation, the indirect attribute is also obtained by sampling average calculation.

In the prior art, attribute information of different sources is mostly spliced together directly, in order to obtain the correlation between the characteristic attributes of different sources and form vector representation with uniform dimension, the embodiment captures the correlation between the attributes of different sources by using BilSTM to form d with more expressive ability_aDimension initial feature vector

i ∈ A. Fig. 5 introduces an aggregation method of different source characteristics of paper nodes.

Similar to the classification of node attributes, the neighbors of a node in a heterogeneous network can also be classified into two categories: direct neighbors and indirect neighbors. The direct neighbors are nodes directly connected by connecting edges in the network, and taking fig. 5 as an example, the direct neighbors of the node a1 are P1 and P2. The indirect neighbor is ｃA neighbor connected with ｃA target node through ｃA metｃA-path in the heterogeneous network, the node A1 is connected with the V1 node through the metｃA-path A-P-V, and is connected with the A2 node through the metｃA-path A-P-A, and the like.

As shown in fig. 6, the direct neighbors and indirect neighbors (meta-path neighbors) of each node together form a neighbor set of the node, and in a large network, for the efficiency of computation and storage, the present embodiment samples the same number of neighbors for each node, and brings the sampled neighbor set into the subsequent neighbor aggregation.

According to the prior art, the influence of meta paths with lengths exceeding 3 is very little, so that for the scientific cooperation network, the present embodiment screens all interpretable meta paths with lengths not exceeding 3 of the three types of nodes, as shown in table 1.

TABLE 1 Meta Path example on scientific collaboration network

The structure and semantic information in the network can be better captured through proper meta-path selection, the more meta-paths are not selected, the better the model result is, the less the wrong meta-path selection helps to improve the result, and even negative improvement can be caused. For the heterogeneous graph neural network model based on meta-paths, more effort should be put on the screening of meta-paths.

2. Information aggregation of sampling neighbors

Neighbor aggregation faces the following two problems: how are neighbors of the same type grouped together? How to cluster together different types of neighbors?

To solve the above problem, the present embodiment first aggregates sampling neighbors according to types, and then aggregates different types of information again. In order to better mine the internal connection of the same type of neighbors, the BilSTM is adopted to gather the same type of sampling nodes together. To have a better aggregation for each type of node, a mechanism of attention is used to achieve heterogeneous types of aggregation.

1) Homogeneous sampling neighbor information aggregation

As shown in fig. 7, a basic process of information aggregation of a homogeneous sampling neighbor is described, and a hidden layer sequence is subjected to mean pooling to obtain an information aggregation vector of a specific type.

BilSTM outputs sequence by forward LSTM

And backward LSTM output sequence

Concatenating to find sequential associations between elements. D-dimensional output vector of i-type neighbor of node m

i ∈ A is obtained from the following formula:

wherein

In order to perform the splicing operation,

and

obtained by passing LSTM:

f_j＝σ(W_f·[h_j-1,x_j]+b_f)

i_j＝σ(W_i·[h_j-1,x_j]+b_i)

o_j＝σ(W_o·[h_j-1,x_j]+b_o)

h_j＝o_j·tanh(C_j)

wherein f, i and o are a forgetting gate, an input gate and an output gate respectively, and W and b are learnable parameters. To obtain an output vector y of d dimension_iThe dimension of the correlation vector satisfies the constraint equation d 2d_a。

2) Polymerization of heterogeneous type

The present embodiment employs a self-attention mechanism to learn the aggregation weights of heterogeneous types. Given type target node k ∈ A and aggregation vector y_iI ∈ A, corresponding attention α_kiAnd self-attention coefficient alpha'_kIs composed of

Wherein

In order to perform the splicing operation,

for the attention parameter to be learnable,

for a learnable self-attention parameter, σ is the activation function LeakyRelu, so the final aggregate vector for a type-k node is

z_k＝α'_kx_k+∑_i∈Aα_kiy_i

Since heterogeneous maps have a scale-free nature, the variance of the map data is large. In order to solve the above problems, the heterogeneous aggregation attention is expanded to a multi-head attention, so that the training process is more stable. Specifically, the present embodiment independently repeats the attention model n times, and takes the learned embedded mean as the final orientation of aggregationQuantity:

3. object optimization and model training

To implement heterogeneous network representation learning, the present embodiment employs graph context loss (graph context loss), and defines the following optimization objectives:

among them, RW_GIs the set of paths sampled by random walks on graph G,<v,i,j>is a triplet in which V ∈ V is the target node and i ∈ C_vIs that it is in RW_GAnd j ∈ V is a negative sampling node, and theta is a sigmoid activation function. Specifically, the present embodiment first generates a series of randomly-traveled paths RW on the graph G_G. Then, for each path w ∈ RW_GFor each node v above, the present embodiment selects a context node i satisfying the distance constraint dist (v, i) < epsilon on the path. Finally, this embodiment samples a node j from graph G that is the same type as node i. For the academic collaboration network, 9 types of triples are provided, and by counting the distribution frequency of different types of nodes in the graph G, the embodiment can achieve the basic balance of the sampling number of different triples. By using batch processing and Adam optimizer, the quick update and convergence (convergence after about 50 generations) of the whole MHGNN parameter can be realized, and the precision of the expression vector output by the model is higher and higher through continuous iteration parameters.

In this embodiment, data between 2006 and 2015 are extracted from the AMiner data for experiments, the heterogeneous network data are shown in table 2, and the training set and the test set are divided by using 2012 and 2013 as boundaries respectively for different tasks.

TABLE 2 data set detail attributes

Node type	Number of	Type of continuous edge	Number of
				A(author)	28646	A-P	69311
P(paper)	21044	P-P	49631
				V(venue)	18	P-V	21044

In node sampling, 10 a, 10P, and 3V are sampled from the neighbors of each node as sampling neighbors. In the node feature aggregation, the dimensionality of all input data is 128, the dimensionality of graph embedding is 128, the learning rate is 0.001, the batch size is 200, the training algebra is 60, the optimizer is Adam, the random number seed is 10, the sampling number of graph context loss is 20000, the number of sampling paths of each node is 10, the path length is 30, the distance constraint is 5, and the number of heads of multi-head attention is 4.

The following network representation and graph neural network algorithm were chosen as comparison algorithms in this example:

metapath2 Vec: the model constructs a node heterogeneous neighbor set by using random walk based on a meta-path, and generates a corresponding node representation by using a heterogeneous skip-gram model.

GraphSAGE: the method is a classical graphic neural network model, and obtains the feature representation of nodes by gathering the information of neighboring nodes together in a specific form (Mean, posing or lstm).

And (3) GAT: the method learns the weights of different neighbor nodes through an attention mechanism, so that neighbor information is more effectively aggregated.

HetGNN: the method obtains neighbors of the nodes in the heterogeneous graph through random walk with restart, and conducts multiple aggregation aiming at the types of the neighbors to obtain heterogeneous graph representation vectors of the nodes.

ASNE: the method simultaneously uses the attribute characteristics and the potential characteristics of the nodes to learn node embedding.

A SHNE: the method learns the embedding of heterogeneous graph nodes related to texts by jointly optimizing the similarity of graph structures and the relevance of text semantics.

Unlike the conventional random division of data by link prediction, the task adopts a data training model with the time being earlier according to the time (two tasks, 2012/2013 is used as a division standard of a training set and a test set respectively), takes the para-position product of the representation vectors of two end points of a continuous edge as the representation vector of the continuous edge, and trains a binary classifier by using the corresponding continuous edge (20% of the continuous edge of the training set).

This embodiment tests two types of tasks: 1) the authors collaborate: analyzing whether two authors are likely to collaborate in the future to compose an article; 2) citation of the article: it is analyzed whether in the future an author will refer to another article. The results are shown in Table 3, wherein the results of the method of the invention are presented as means and standard deviations of ten experiments.

Table 3A-P and a-a link prediction results. The data set is divided into a training set and a testing set by year.

According to result analysis, the method provided by the invention achieves 5.7-8.3% improvement on A-A link prediction and 1.3-2.5% improvement on A-P tasks. The inventive method performs best in all four link prediction tasks.

According to the invention content and the embodiment, the method researches information acquisition in a heterogeneous network and aggregation of characteristic information of different sources for better information aggregation, and realizes sampling of direct neighbor nodes and indirect neighbor nodes based on a meta-path; in order to better mine the internal connection of similar neighbors, the sampling nodes of the same type are aggregated together by using the BilSTM, and in order to better aggregate the nodes of each type, the aggregation of heterogeneous types is realized by using an attention mechanism; compared with the comparison method, the method disclosed by the invention has the advantage that the link prediction effect of the heterogeneous network is improved.

Claims

1. A heterogeneous network link prediction method suitable for a scientific cooperative network is characterized by comprising the following steps:

the heterogeneous scientific cooperative network is represented as G (V, E, A and R), wherein V represents a node set, E represents a continuous edge set, and the node and the continuous edge respectively have mapping phi V → A and

define node A₁And A_k+1Compound relationship between them

Wherein

2. The method according to claim 1, wherein the neighbor node sampling comprises steps of, step 201, performing direct attribute and indirect attribute aggregation on the nodes of the heterogeneous scientific cooperative network, step 202, performing direct neighbor and indirect neighbor sampling on the nodes of the heterogeneous scientific cooperative network;

3. The heterogeneous network link prediction method suitable for scientific cooperative network as claimed in claim 2, wherein the direct attribute and indirect attribute aggregation is BilSTM aggregation, and then mean pooling is performed to obtain d with expression ability_aDimension initial feature vector

x_iIs a feature vector obtained by BiLSTM,

representing the vector dimension as d_aX 1 dimension.

4. A heterogeneous network link prediction method suitable for scientific cooperative network according to claim 2 or 3, wherein the neighbor information aggregation comprises, step 301, performing homogeneous information aggregation on similar sampling neighbors, step 302, performing heterogeneous information aggregation on different types of neighbors by using an attention mechanism;

5. The method of claim 4, wherein the homogeneous information aggregated BilSTM is obtained by outputting forward LSTM output sequence

And backward LSTM output sequence

Obtained by the following formula:

wherein the content of the first and second substances,

and

obtained by passing LSTM:

f_j＝σ(W_f·[h_j-1,x_j]+b_f)

i_j＝σ(W_i·[h_j-1,x_j]+b_i)

o_j＝σ(W_o·[h_j-1,x_j]+b_o)

h_j＝o_j·tanh(C_j)

Wherein the content of the first and second substances,

for learnable attention parameters，

For the learnable self-attention parameter,

z_k＝α'_kx_k+∑_i∈Aα_kiy_i

wherein the content of the first and second substances,

the ith calculation of the aggregation vector representing the k type nodes.

6. The heterogeneous network link prediction method suitable for scientific cooperative network as claimed in claim 5, wherein graph context loss is adopted in the training process in step 4, and the following optimization objectives are defined:

7. A method as claimed in claim 1 or 6, wherein the meta path is an interpretable meta path having a length of 3 or less, and the type of meta path includes: A-P-A represents ｃA co-author, A-P represents an author writing ｃA paper, A-P-P represents an author referring to the paper in writing, A-P-V represents an author's paper received by ｃA conference, A-P-P-V represents ｃA referred paper received by ｃA conference, P-A represents ｃA paper written by an author, P-P represents ｃA paper referring paper, P-A-P represents ｃA paper written by the same author, P-V-P represents ｃA paper received by the same conference, P-V represents ｃA conference received, P-P-V represents ｃA referred paper received by ｃA conference, V-P-A represents ｃA paper written by ｃA conference receiving author, V-P represents ｃA conference receiving paper, V-P-P represents ｃA conference-related referring paper, V-P-P-V represents a meeting-related reference meeting.

8. The heterogeneous network link prediction method suitable for scientific cooperative networks according to claim 7, wherein when neighbor node sampling is performed, the same number of neighbors are sampled for each node, and the sampled neighbor set is brought into subsequent neighbor information aggregation.

9. The method as claimed in claim 8, wherein the method controls a balance relationship between the number of neighbors sampled by each type of meta-path when performing neighbor node sampling.