CN113868482A - Heterogeneous network link prediction method suitable for scientific cooperative network - Google Patents

Heterogeneous network link prediction method suitable for scientific cooperative network Download PDF

Info

Publication number
CN113868482A
CN113868482A CN202111286870.6A CN202111286870A CN113868482A CN 113868482 A CN113868482 A CN 113868482A CN 202111286870 A CN202111286870 A CN 202111286870A CN 113868482 A CN113868482 A CN 113868482A
Authority
CN
China
Prior art keywords
node
paper
heterogeneous
author
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111286870.6A
Other languages
Chinese (zh)
Inventor
刘忠
马扬
许乃夫
梁星星
冯旸赫
程光权
黄金才
施伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Publication of CN113868482A publication Critical patent/CN113868482A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a heterogeneous network link prediction method suitable for a scientific cooperative network, which comprises the following steps: acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network; sampling neighbor nodes on the heterogeneous scientific cooperative network; neighbor information aggregation is carried out on the sampled neighbor nodes; training on the heterogeneous scientific cooperative network, and learning to obtain a heterogeneous network representation; using the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and using the corresponding continuous edge to train the binary classifier; predicting whether an unknown link exists by adopting the binary classifier; if the existing unknown link is the link between the author and the author, the two authors will collaborate to write the article in the future, and if the existing unknown link is the link between the author and the paper, the existing unknown link indicates that one author will quote a certain article in the future. The method has better effect on predicting the cooperation of the author and articles cited by the author.

Description

Heterogeneous network link prediction method suitable for scientific cooperative network
Technical Field
The invention belongs to the technical field of social network science, and relates to a heterogeneous network link prediction method suitable for a scientific cooperative network.
Background
In the real world, connections between individuals are ubiquitous, and these complex connections can be described in different forms of networks (social networks, citation networks, protein molecule interaction networks, power networks, etc.). As a common data carrier, data in a network form exists in all aspects of society, meaningful information in the network is deeply mined, and the data has very high academic value and potential application value. Although the structures and the forms of various networks are complex and various, the basic components of the networks are nodes and connecting edges, so that the analysis and the research on the complex networks have universal significance. Since the network contains a large number of nodes and edges, the time to obtain all the state information of each node on the network is extremely time consuming. Network Embedding (Network Embedding) aims at finding a mapping function that transforms the structural and attribute characteristics of each node in the Network into a low-dimensional potential representation. The graphical neural network, as a powerful learning method for deep representation of graphical data, shows excellent performance in network analysis, and has attracted extensive attention of researchers. By means of supervised learning or unsupervised learning, structure and attribute information of the network is mined through an algorithm, converted into low-dimensionality node vectors and further applied to downstream tasks such as node classification, node clustering and link prediction.
Although many approaches to neural networks have been proposed, the main research is still focused on homogeneous networks, i.e. each node in the network is of the same type. However, more real networks are represented in the form of heterogeneous graph structures. The heterogeneous graph network contains abundant structure information and attribute information, and nodes and connecting edges of the heterogeneous graph network are of various types. However, in the current network representation research, there is still a key problem: how are information of neighbors of the target node aggregated? Most graph neural networks define the neighbors of a node as its directly connected nodes and iteratively aggregate neighbor information for distances greater than 2 through multiple layers of nesting. Some methods expand the neighbor space of a node by random walk, and all nodes on a path are regarded as neighbor nodes. These methods ignore that the connection edges in heterogeneous networks are heterogeneous, and the combination of heterogeneous connection edges has a specific meaning.
The interactive cooperation among scientists of the scientific cooperation network is a long-standing phenomenon in scientific practice, and a considerable amount of communication activities exist in most stages of the research process. Scientific departments are mutually crossed and penetrated, a large scientific system with a complex structure is developed, and scientists not only communicate research results and information, but also often cooperate together to generate or report the research results of the scientists in the process of mutual conversation, writing, reading papers, letters and the like. With the improvement of the scientific socialization degree, the cooperation in scientific research is more frequent, and the cooperation becomes a social strength which influences the scientific development ability to be ignored gradually. Scientific collaboration networks are apparently organized to a large extent by scientists, and a large number of technological policy initiatives have also facilitated scientific collaboration. When a scientific document is completed by more than two authors, the present embodiment states that there is a scientific collaboration between these authors. In a scientific collaboration network, nodes are authors and if there is more than one treatise between two authors, there is a connection between them. Scientific collaboration networks are an important class of social networks that have been widely used to determine the structure of scientific collaboration and the status of individual researchers. The method has great significance for researching possible cooperation relationship in the future of scientific cooperation network, promoting scientific research and development, forming team advantages, searching an advantageous platform and the like. Whether unknown links possibly exist in the scientific cooperation network is predicted, the method can be used for judging whether two authors meet to compose an article in the future and whether one author refers to a certain article in the future, and the method has a strong guiding significance for scientific cooperation.
Disclosure of Invention
The invention relates to a heterogeneous network link prediction method suitable for a scientific cooperative network, which adopts the following technical scheme.
A heterogeneous network link prediction method suitable for a scientific cooperative network comprises the following steps:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 3, neighbor information aggregation is carried out on the sampled neighbor nodes;
step 4, training on the heterogeneous scientific cooperation network, and learning to obtain heterogeneous network representation;
step 5, taking the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and training a binary classifier by using the corresponding continuous edge;
step 6, adopting the binary classifier to predict whether an unknown link exists; if the existing unknown link is the link between the author and the author, the two authors can collaborate to write the article in the future, and if the existing unknown link is the link between the author and the thesis, the unknown link indicates that one author can quote a certain article in the future;
specifically, the heterogeneous scientific cooperative network is represented as G ═ V, E, A and R, wherein V represents a node set, E represents a continuous edge set, and the node and the continuous edge respectively have mapping phi V → A and
Figure BDA0003333199660000031
wherein A represents a node type set comprising an author type, a paper type and a conference type, and R represents a continuous edge type set comprising an author-write-paper type, a paper-quote-paper type and a paper-publish-conference type;
the neighbor node sampling is based on a meta-path, which is a path defined on graph G ═ V, E, a, R, and is represented in the following form:
Figure BDA0003333199660000032
define node A1And Ak+1Compound relationship between them
Figure BDA0003333199660000033
Wherein
Figure BDA0003333199660000034
Representing complex operations between relationships, this meta-path representing A1And Ak+1Semantic information formed by k +1 nodes and k links is linked together.
Furthermore, the neighbor node sampling comprises the following steps of, step 201, performing direct attribute and indirect attribute aggregation on the nodes of the heterogeneous scientific cooperative network, and step 202, performing direct neighbor and indirect neighbor sampling on the nodes of the heterogeneous scientific cooperative network;
the direct attribute is an attribute of the node, the indirect attribute is an additional attribute obtained from the mutual relation in the heterogeneous scientific cooperative network, and the additional attribute is a supplement to the direct attribute; the direct attributes of the paper nodes comprise paper titles, paper abstracts and pre-training structure embedding, and the indirect attributes of the paper nodes comprise structure embedding of admission conferences, structure embedding of paper authors, structure embedding of cited documents and title vectors of the cited documents; the direct attribute of the author node comprises the structure embedding of the author node, and the indirect attribute of the author node is a vector representation of a title and a summary of a paper written by an author; the direct attribute of the conference node comprises the structure embedding of the conference node, and the indirect attribute of the conference node comprises the vector representation of the title and the abstract of the conference admission paper;
the direct neighbor is a neighbor node of the target node which is directly connected in the heterogeneous scientific cooperative network through a connecting edge, and the indirect neighbor is a neighbor node of the target node which is connected in the heterogeneous scientific cooperative network through a meta-path.
Preferably, the direct attribute and the indirect attribute are aggregated by using a BilSTM (Bi-direct Long Short-Term Memory), and then mean pooling is performed to obtain d with expression abilityaDimension initial feature vector
Figure BDA0003333199660000041
i∈A,xiIs a feature vector obtained by BiLSTM,
Figure BDA0003333199660000042
representing the vector dimension as daX 1 dimension.
Further, the neighbor information aggregation includes step 301, performing homogeneous information aggregation on similar sampling neighbors, and step 302, performing heterogeneous information aggregation on different types of neighbors by adopting an attention mechanism;
the homogeneous information aggregation firstly adopts the BilSTM to aggregate information vectors of sampling neighbors, and then the information aggregation vectors are obtained through mean pooling;
and in the heterogeneous information aggregation, an attention mechanism is adopted to learn the aggregation weight of the heterogeneous type.
Preferably, in said homogeneous information aggregated BilSTM, the forward LSTM output sequence is output
Figure BDA0003333199660000043
And backward LSTMOutput sequence
Figure BDA0003333199660000044
Stitching to discover sequential associations between elements, d-dimensional output vector of i-type neighbors of node m
Figure BDA0003333199660000045
i ∈ A is obtained from the following formula:
Figure BDA0003333199660000046
wherein the content of the first and second substances,
Figure BDA0003333199660000047
for the splicing operation, Neighbori(m) represents a set of neighbor nodes of node m,
Figure BDA0003333199660000048
representing the forward representation vector obtained by node j by the bi-directional LSTM,
Figure BDA0003333199660000049
representing the backward-represented vector obtained by bi-directional LSTM for node j,
Figure BDA0003333199660000051
and
Figure BDA0003333199660000052
obtained by passing LSTM:
fj=σ(Wf·[hj-1,xj]+bf)
ij=σ(Wi·[hj-1,xj]+bi)
Figure BDA0003333199660000053
Figure BDA0003333199660000054
oj=σ(Wo·[hj-1,xj]+bo)
hj=oj·tanh(Cj)
wherein f, i and o are respectively a forgetting gate, an input gate and an output gate, W and b are learnable parameters, and d is 2daAnd sigma denotes an activation function sigmoid, xjRepresents a feature vector obtained by BiLSTM;
in the information aggregation of the attention mechanism, a given type target node k epsilon A and an aggregation vector yiI ∈ A, corresponding attention αkiAnd self-attention coefficient alpha'kIs composed of
Figure BDA0003333199660000055
Figure BDA0003333199660000056
Wherein the content of the first and second substances,
Figure BDA0003333199660000057
for the splicing operation, xkRepresenting the feature vector obtained by BiLSTM,
Figure BDA0003333199660000058
for the attention parameter to be learnable,
Figure BDA0003333199660000059
for the learnable self-attention parameter,
Figure BDA00033331996600000510
transpose of the self-attention parameter vector, y, representing node type llThe node type is a type-based aggregation vector of l, sigma is an activation function LeakyRelu, and the aggregation vector of the k type node is as follows:
zk=α'kxk+∑i∈Aαkiyi
note that the force mechanism model repeats n times independently and takes the learned mean of the embeddings as the final aggregate vector:
Figure BDA00033331996600000511
wherein the content of the first and second substances,
Figure BDA00033331996600000512
the ith calculation of the aggregation vector representing the k type nodes.
Preferably, graph context loss is employed in the training process in step 4, and the following optimization objectives are defined:
Figure BDA0003333199660000061
among them, RWGIs the set of paths sampled by random walks on graph G,<v,i,j>is a triplet in which V ∈ V is the target node and i ∈ CvIs that it is in RWGContext of (c), j ∈ V is a negative sampling node, z'vRepresents the final aggregate vector, z ', of node v'iRepresents the final aggregate vector, z 'of node i'jAnd representing the final aggregation vector of the node j, and theta is a sigmoid activation function.
Specifically, the meta path adopts an interpretable meta path with a length not exceeding 3, and the type of the meta path specifically includes: A-P-A represents cA co-author, A-P represents an author writing cA paper, A-P-P represents an author referring to the paper in writing, A-P-V represents an author's paper received by cA conference, A-P-P-V represents cA referred paper received by cA conference, P-A represents cA paper written by an author, P-P represents cA paper referring paper, P-A-P represents cA paper written by the same author, P-V-P represents cA paper received by the same conference, P-V represents cA conference received, P-P-V represents cA referred paper received by cA conference, V-P-A represents cA paper written by cA conference receiving author, V-P represents cA conference receiving paper, V-P-P represents cA conference-related referring paper, V-P-P-V represents a meeting-related reference meeting.
Preferably, when the neighbor node sampling is performed, each node is sampled by the same number of neighbors, and the sampled neighbor set is brought into the subsequent neighbor information aggregation.
Preferably, when the neighbor node sampling is performed, the balance relationship among the number of neighbors of each type of meta-path sampling is controlled.
Compared with the prior art, the method has the advantages that: 1) for better information aggregation, information acquisition in a heterogeneous network and aggregation of feature information of different sources are researched, and direct neighbor and indirect neighbor node sampling is realized based on a meta-path; 2) in order to better mine the internal connection of the same type of neighbors, the BilSTM is adopted to aggregate the same type of sampling nodes together, in order to have better aggregation to each type of nodes, the attention mechanism is adopted to realize heterogeneous type aggregation, and therefore the link prediction is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of an embodiment of the present invention;
FIG. 2 is a diagram of a heterogeneous network according to an embodiment of the present invention;
FIG. 3 is a meta-path diagram according to an embodiment of the invention;
FIG. 4 is a flow diagram of a heterogeneous scientific collaborative network learning representation in accordance with an embodiment of the present invention;
FIG. 5 is a diagram illustrating an aggregation of characteristics of different sources of nodes in a paper according to an embodiment of the present invention;
FIG. 6 is a flow chart illustrating neighbor sampling according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an information aggregation process of sampling neighbors in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a heterogeneous network link prediction method suitable for a scientific cooperative network includes the following steps:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 3, neighbor information aggregation is carried out on the sampled neighbor nodes;
step 4, training on the heterogeneous scientific cooperation network, and learning to obtain heterogeneous network representation;
step 5, taking the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and training a binary classifier by using the corresponding continuous edge;
step 6, adopting the binary classifier to predict whether an unknown link exists; if the existing unknown link is the link between the author and the author, the two authors will collaborate to write the article in the future, and if the existing unknown link is the link between the author and the paper, the existing unknown link indicates that one author will quote a certain article in the future.
A heterogeneous network is a special form of network that contains multiple types of nodes or multiple types of edges. The heterogeneous network may be represented as G ═ (V, E, a, R), where V denotes a set of nodes and E denotes a set of connected edges, where the nodes and connected edges of the heterogeneous network have respective mappings Φ: V → a and
Figure BDA0003333199660000081
wherein A and R respectively represent type sets of nodes and connecting edges. In addition, the attributes are different for different types of V nodes. For heterogeneous networks, there is | A | + | R | > 2.
As shown in fig. 2, the scientific cooperative network is an example for explaining a heterogeneous network. The node V in the network may be an author, a paper or a conference, and the edge E may be an author-write-paper, a paper-quote-paper or a paper-publish-conference. For a thesis node, the attribute comprises abstract, title and the like, for an author node, the attribute comprises published articles, cooperative researchers and the like, and for a conference node, the attribute comprises recorded articles and the like. In this example, | a | ═ 3, | R | ═ 3, so the network is a heterogeneous network.
A meta path is a path defined on graph G ═ V, E, a, R, and is represented in the form:
Figure BDA0003333199660000082
this defines A1And Ak+1Compound relationship between them
Figure BDA0003333199660000083
Wherein
Figure BDA0003333199660000084
Representing complex operations between relationships, this path representing A1And Ak+1Semantic information formed by k +1 nodes (node types allow repetition) and k connecting edges (connecting edge types allow repetition) is connected together, A represents a node type set, A subscripts represent a node with a specific type, and the following table represents a number code.
As shown in FIG. 3, in an academic collaboration network, for example, different authors may be connected vicA an A-P-A (Author-Paper-Author) metcA-path, which represents the Paper collaboration, and authors and meetings may be connected vicA an A-P-V (Author-Paper-vector) metcA-path, which represents the authors participating in the academic meeting.
Given a heterogeneous network G ═ (V, E, a, R), the core of the heterogeneous network representation (graph embedding) is to extract the features of the network. It learns a uniform d-dimension representation vector for each node in the network
Figure BDA0003333199660000085
And d is less than V, so that X can capture structural and semantic information in the network, and the method is applied to subsequent tasks such as node classification, clustering, link prediction, recommendation, network reconstruction and the like.
As in FIG. 4, the representation learning of the heterogeneous network of the present embodiment includes three phases, 1. neighbor sampling based on meta-paths; 2. aggregating information of sampling neighbors; 3. and (4) target optimization and model training.
1. Meta-path based neighbor sampling
At the heart of most graph neural networks is the aggregation of neighbor information, and in large-scale networks it is also common to aggregate samples of neighbors, such as GraphSage. However, directly employing aggregation of neighbors or sampled neighbors presents several problems:
simple neighbor aggregation cannot directly capture all types of node information. Taking a scientific cooperation network as an example, all author nodes are not directly connected, but authors who jointly complete the same paper have close connection and cooperation, which is semantic information that cannot be reflected by directly connected edges.
Different types of neighbor nodes have different content attribute characteristics. The different features need to be converted into a unified representation space.
In a heterogeneous network, for nodes of the same type, the feature information usually comes from different acquisition approaches (such as paper summary information and paper title information), and usually the feature information from different sources is spliced together, but the correlation between the feature information and the paper summary information is rarely studied and discussed.
In order to solve the above problems, the present embodiment researches information acquisition in a heterogeneous network and aggregation of feature information from different sources; and then based on the meta path, sampling of the neighbor nodes is realized.
1) Node feature aggregation
The attribute information of the nodes in the heterogeneous network comprises a direct attribute and an indirect attribute, wherein the direct attribute is an attribute of the data, and the indirect attribute is an additional attribute obtained by arranging and inducing the mutual relationship in the heterogeneous network and is a supplement to the direct attribute. This embodiment introduces direct and indirect attribute sources for different types of nodes and aggregates attributes for different types of nodes in an academic collaboration network. There are three types of nodes in the network: paper, author, and meeting.
Paper nodes are the most important type of nodes in the network and are also the core source of attribute data in the network, and the direct attributes thereof include paper titles, paper abstracts, pre-training structure embedding and the like. It should be noted that, in order to mine more structural information from the network, the embodiment adopts deep walk, regards the network as a homogeneous network, obtains the pretrained structure embedding of each node in the network, and uses it as a part of the node attribute. Indirect attributes of the paper nodes include structural embedding of the admission meeting, structural embedding of its author, structural embedding of the cited documents, and title vectors of the cited documents. In order to improve the operation efficiency and form a uniform vector space, all indirect attributes are obtained by sampling average calculation.
The author node is directly connected with the paper node in the network, the direct attribute is the structure embedding of the author node, and the indirect attribute is the vector representation of the title and abstract of the written paper. The conference node is also directly connected with the paper node in the network, the direct attribute is the structure embedding of the conference node, and the indirect attribute is the vector representation of the title and abstract of the paper recorded by the conference node. For the convenience of calculation, the indirect attribute is also obtained by sampling average calculation.
In the prior art, attribute information of different sources is mostly spliced together directly, in order to obtain the correlation between the characteristic attributes of different sources and form vector representation with uniform dimension, the embodiment captures the correlation between the attributes of different sources by using BilSTM to form d with more expressive abilityaDimension initial feature vector
Figure BDA0003333199660000101
i ∈ A. Fig. 5 introduces an aggregation method of different source characteristics of paper nodes.
Similar to the classification of node attributes, the neighbors of a node in a heterogeneous network can also be classified into two categories: direct neighbors and indirect neighbors. The direct neighbors are nodes directly connected by connecting edges in the network, and taking fig. 5 as an example, the direct neighbors of the node a1 are P1 and P2. The indirect neighbor is cA neighbor connected with cA target node through cA metcA-path in the heterogeneous network, the node A1 is connected with the V1 node through the metcA-path A-P-V, and is connected with the A2 node through the metcA-path A-P-A, and the like.
As shown in fig. 6, the direct neighbors and indirect neighbors (meta-path neighbors) of each node together form a neighbor set of the node, and in a large network, for the efficiency of computation and storage, the present embodiment samples the same number of neighbors for each node, and brings the sampled neighbor set into the subsequent neighbor aggregation.
According to the prior art, the influence of meta paths with lengths exceeding 3 is very little, so that for the scientific cooperation network, the present embodiment screens all interpretable meta paths with lengths not exceeding 3 of the three types of nodes, as shown in table 1.
TABLE 1 Meta Path example on scientific collaboration network
Figure BDA0003333199660000111
The structure and semantic information in the network can be better captured through proper meta-path selection, the more meta-paths are not selected, the better the model result is, the less the wrong meta-path selection helps to improve the result, and even negative improvement can be caused. For the heterogeneous graph neural network model based on meta-paths, more effort should be put on the screening of meta-paths.
2. Information aggregation of sampling neighbors
Neighbor aggregation faces the following two problems: how are neighbors of the same type grouped together? How to cluster together different types of neighbors?
To solve the above problem, the present embodiment first aggregates sampling neighbors according to types, and then aggregates different types of information again. In order to better mine the internal connection of the same type of neighbors, the BilSTM is adopted to gather the same type of sampling nodes together. To have a better aggregation for each type of node, a mechanism of attention is used to achieve heterogeneous types of aggregation.
1) Homogeneous sampling neighbor information aggregation
As shown in fig. 7, a basic process of information aggregation of a homogeneous sampling neighbor is described, and a hidden layer sequence is subjected to mean pooling to obtain an information aggregation vector of a specific type.
BilSTM outputs sequence by forward LSTM
Figure BDA0003333199660000121
And backward LSTM output sequence
Figure BDA0003333199660000122
Concatenating to find sequential associations between elements. D-dimensional output vector of i-type neighbor of node m
Figure BDA0003333199660000123
i ∈ A is obtained from the following formula:
Figure BDA0003333199660000124
wherein
Figure BDA0003333199660000125
In order to perform the splicing operation,
Figure BDA0003333199660000126
and
Figure BDA0003333199660000127
obtained by passing LSTM:
fj=σ(Wf·[hj-1,xj]+bf)
ij=σ(Wi·[hj-1,xj]+bi)
Figure BDA0003333199660000128
Figure BDA0003333199660000129
oj=σ(Wo·[hj-1,xj]+bo)
hj=oj·tanh(Cj)
wherein f, i and o are a forgetting gate, an input gate and an output gate respectively, and W and b are learnable parameters. To obtain an output vector y of d dimensioniThe dimension of the correlation vector satisfies the constraint equation d 2da
2) Polymerization of heterogeneous type
The present embodiment employs a self-attention mechanism to learn the aggregation weights of heterogeneous types. Given type target node k ∈ A and aggregation vector yiI ∈ A, corresponding attention αkiAnd self-attention coefficient alpha'kIs composed of
Figure BDA00033331996600001210
Figure BDA00033331996600001211
Wherein
Figure BDA0003333199660000131
In order to perform the splicing operation,
Figure BDA0003333199660000132
for the attention parameter to be learnable,
Figure BDA0003333199660000133
for a learnable self-attention parameter, σ is the activation function LeakyRelu, so the final aggregate vector for a type-k node is
zk=α'kxk+∑i∈Aαkiyi
Since heterogeneous maps have a scale-free nature, the variance of the map data is large. In order to solve the above problems, the heterogeneous aggregation attention is expanded to a multi-head attention, so that the training process is more stable. Specifically, the present embodiment independently repeats the attention model n times, and takes the learned embedded mean as the final orientation of aggregationQuantity:
Figure BDA0003333199660000134
3. object optimization and model training
To implement heterogeneous network representation learning, the present embodiment employs graph context loss (graph context loss), and defines the following optimization objectives:
Figure BDA0003333199660000135
among them, RWGIs the set of paths sampled by random walks on graph G,<v,i,j>is a triplet in which V ∈ V is the target node and i ∈ CvIs that it is in RWGAnd j ∈ V is a negative sampling node, and theta is a sigmoid activation function. Specifically, the present embodiment first generates a series of randomly-traveled paths RW on the graph GG. Then, for each path w ∈ RWGFor each node v above, the present embodiment selects a context node i satisfying the distance constraint dist (v, i) < epsilon on the path. Finally, this embodiment samples a node j from graph G that is the same type as node i. For the academic collaboration network, 9 types of triples are provided, and by counting the distribution frequency of different types of nodes in the graph G, the embodiment can achieve the basic balance of the sampling number of different triples. By using batch processing and Adam optimizer, the quick update and convergence (convergence after about 50 generations) of the whole MHGNN parameter can be realized, and the precision of the expression vector output by the model is higher and higher through continuous iteration parameters.
In this embodiment, data between 2006 and 2015 are extracted from the AMiner data for experiments, the heterogeneous network data are shown in table 2, and the training set and the test set are divided by using 2012 and 2013 as boundaries respectively for different tasks.
TABLE 2 data set detail attributes
Node type Number of Type of continuous edge Number of
A(author) 28646 A-P 69311
P(paper) 21044 P-P 49631
V(venue) 18 P-V 21044
In node sampling, 10 a, 10P, and 3V are sampled from the neighbors of each node as sampling neighbors. In the node feature aggregation, the dimensionality of all input data is 128, the dimensionality of graph embedding is 128, the learning rate is 0.001, the batch size is 200, the training algebra is 60, the optimizer is Adam, the random number seed is 10, the sampling number of graph context loss is 20000, the number of sampling paths of each node is 10, the path length is 30, the distance constraint is 5, and the number of heads of multi-head attention is 4.
The following network representation and graph neural network algorithm were chosen as comparison algorithms in this example:
metapath2 Vec: the model constructs a node heterogeneous neighbor set by using random walk based on a meta-path, and generates a corresponding node representation by using a heterogeneous skip-gram model.
GraphSAGE: the method is a classical graphic neural network model, and obtains the feature representation of nodes by gathering the information of neighboring nodes together in a specific form (Mean, posing or lstm).
And (3) GAT: the method learns the weights of different neighbor nodes through an attention mechanism, so that neighbor information is more effectively aggregated.
HetGNN: the method obtains neighbors of the nodes in the heterogeneous graph through random walk with restart, and conducts multiple aggregation aiming at the types of the neighbors to obtain heterogeneous graph representation vectors of the nodes.
ASNE: the method simultaneously uses the attribute characteristics and the potential characteristics of the nodes to learn node embedding.
A SHNE: the method learns the embedding of heterogeneous graph nodes related to texts by jointly optimizing the similarity of graph structures and the relevance of text semantics.
Unlike the conventional random division of data by link prediction, the task adopts a data training model with the time being earlier according to the time (two tasks, 2012/2013 is used as a division standard of a training set and a test set respectively), takes the para-position product of the representation vectors of two end points of a continuous edge as the representation vector of the continuous edge, and trains a binary classifier by using the corresponding continuous edge (20% of the continuous edge of the training set).
This embodiment tests two types of tasks: 1) the authors collaborate: analyzing whether two authors are likely to collaborate in the future to compose an article; 2) citation of the article: it is analyzed whether in the future an author will refer to another article. The results are shown in Table 3, wherein the results of the method of the invention are presented as means and standard deviations of ten experiments.
Table 3A-P and a-a link prediction results. The data set is divided into a training set and a testing set by year.
Figure BDA0003333199660000151
According to result analysis, the method provided by the invention achieves 5.7-8.3% improvement on A-A link prediction and 1.3-2.5% improvement on A-P tasks. The inventive method performs best in all four link prediction tasks.
According to the invention content and the embodiment, the method researches information acquisition in a heterogeneous network and aggregation of characteristic information of different sources for better information aggregation, and realizes sampling of direct neighbor nodes and indirect neighbor nodes based on a meta-path; in order to better mine the internal connection of similar neighbors, the sampling nodes of the same type are aggregated together by using the BilSTM, and in order to better aggregate the nodes of each type, the aggregation of heterogeneous types is realized by using an attention mechanism; compared with the comparison method, the method disclosed by the invention has the advantage that the link prediction effect of the heterogeneous network is improved.

Claims (9)

1. A heterogeneous network link prediction method suitable for a scientific cooperative network is characterized by comprising the following steps:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 3, neighbor information aggregation is carried out on the sampled neighbor nodes;
step 4, training on the heterogeneous scientific cooperation network, and learning to obtain heterogeneous network representation;
step 5, taking the para-position product of the expression vectors of the nodes at the two ends of the continuous edge as the expression vector of the continuous edge, and training a binary classifier by using the corresponding continuous edge;
step 6, adopting the binary classifier to predict whether an unknown link exists; if the existing unknown link is the link between the author and the author, the two authors can collaborate to write the article in the future, and if the existing unknown link is the link between the author and the thesis, the unknown link indicates that one author can quote a certain article in the future;
the heterogeneous scientific cooperative network is represented as G (V, E, A and R), wherein V represents a node set, E represents a continuous edge set, and the node and the continuous edge respectively have mapping phi V → A and
Figure FDA0003333199650000013
wherein A represents a node type set comprising an author type, a paper type and a conference type, and R represents a continuous edge type set comprising an author-write-paper type, a paper-quote-paper type and a paper-publish-conference type;
the neighbor node sampling is based on a meta-path, which is a path defined on graph G ═ V, E, a, R, and is represented in the following form:
Figure FDA0003333199650000012
define node A1And Ak+1Compound relationship between them
Figure FDA0003333199650000014
Wherein
Figure FDA0003333199650000015
Representing complex operations between relationships, this meta-path representing A1And Ak+1Semantic information formed by k +1 nodes and k links is linked together.
2. The method according to claim 1, wherein the neighbor node sampling comprises steps of, step 201, performing direct attribute and indirect attribute aggregation on the nodes of the heterogeneous scientific cooperative network, step 202, performing direct neighbor and indirect neighbor sampling on the nodes of the heterogeneous scientific cooperative network;
the direct attribute is an attribute of the node, the indirect attribute is an additional attribute obtained from the mutual relation in the heterogeneous scientific cooperative network, and the additional attribute is a supplement to the direct attribute; the direct attributes of the paper nodes comprise paper titles, paper abstracts and pre-training structure embedding, and the indirect attributes of the paper nodes comprise structure embedding of admission conferences, structure embedding of paper authors, structure embedding of cited documents and title vectors of the cited documents; the direct attribute of the author node comprises the structure embedding of the author node, and the indirect attribute of the author node is a vector representation of a title and a summary of a paper written by an author; the direct attribute of the conference node comprises the structure embedding of the conference node, and the indirect attribute of the conference node comprises the vector representation of the title and the abstract of the conference admission paper;
the direct neighbor is a neighbor node of the target node which is directly connected in the heterogeneous scientific cooperative network through a connecting edge, and the indirect neighbor is a neighbor node of the target node which is connected in the heterogeneous scientific cooperative network through a meta-path.
3. The heterogeneous network link prediction method suitable for scientific cooperative network as claimed in claim 2, wherein the direct attribute and indirect attribute aggregation is BilSTM aggregation, and then mean pooling is performed to obtain d with expression abilityaDimension initial feature vector
Figure FDA0003333199650000021
xiIs a feature vector obtained by BiLSTM,
Figure FDA0003333199650000022
representing the vector dimension as daX 1 dimension.
4. A heterogeneous network link prediction method suitable for scientific cooperative network according to claim 2 or 3, wherein the neighbor information aggregation comprises, step 301, performing homogeneous information aggregation on similar sampling neighbors, step 302, performing heterogeneous information aggregation on different types of neighbors by using an attention mechanism;
the homogeneous information aggregation firstly adopts the BilSTM to aggregate information vectors of sampling neighbors, and then the information aggregation vectors are obtained through mean pooling;
and in the heterogeneous information aggregation, an attention mechanism is adopted to learn the aggregation weight of the heterogeneous type.
5. The method of claim 4, wherein the homogeneous information aggregated BilSTM is obtained by outputting forward LSTM output sequence
Figure FDA0003333199650000031
And backward LSTM output sequence
Figure FDA0003333199650000032
Stitching to discover sequential associations between elements, d-dimensional output vector of i-type neighbors of node m
Figure FDA0003333199650000033
Obtained by the following formula:
Figure FDA0003333199650000034
wherein the content of the first and second substances,
Figure FDA0003333199650000035
for the splicing operation, Neighbori(m) represents a set of neighbor nodes of node m,
Figure FDA0003333199650000036
representing the forward representation vector obtained by node j by the bi-directional LSTM,
Figure FDA0003333199650000037
representing the backward-represented vector obtained by bi-directional LSTM for node j,
Figure FDA0003333199650000038
and
Figure FDA0003333199650000039
obtained by passing LSTM:
fj=σ(Wf·[hj-1,xj]+bf)
ij=σ(Wi·[hj-1,xj]+bi)
Figure FDA00033331996500000310
Figure FDA00033331996500000311
oj=σ(Wo·[hj-1,xj]+bo)
hj=oj·tanh(Cj)
wherein f, i and o are respectively a forgetting gate, an input gate and an output gate, W and b are learnable parameters, and d is 2daAnd sigma denotes an activation function sigmoid, xjRepresents a feature vector obtained by BiLSTM;
in the information aggregation of the attention mechanism, a given type target node k epsilon A and an aggregation vector yiI ∈ A, corresponding attention αkiAnd self-attention coefficient alpha'kIs composed of
Figure FDA00033331996500000312
Figure FDA00033331996500000313
Wherein the content of the first and second substances,
Figure FDA00033331996500000314
for the splicing operation, xkRepresenting the feature vector obtained by BiLSTM,
Figure FDA00033331996500000315
for learnable attention parameters,
Figure FDA00033331996500000316
For the learnable self-attention parameter,
Figure FDA00033331996500000317
transpose of the self-attention parameter vector, y, representing node type llThe node type is a type-based aggregation vector of l, sigma is an activation function LeakyRelu, and the aggregation vector of the k type node is as follows:
zk=α'kxk+∑i∈Aαkiyi
note that the force mechanism model repeats n times independently and takes the learned mean of the embeddings as the final aggregate vector:
Figure FDA0003333199650000041
wherein the content of the first and second substances,
Figure FDA0003333199650000042
the ith calculation of the aggregation vector representing the k type nodes.
6. The heterogeneous network link prediction method suitable for scientific cooperative network as claimed in claim 5, wherein graph context loss is adopted in the training process in step 4, and the following optimization objectives are defined:
Figure FDA0003333199650000043
among them, RWGIs the set of paths sampled by random walks on graph G,<v,i,j>is a triplet in which V ∈ V is the target node and i ∈ CvIs that it is in RWGContext of (c), j ∈ V is a negative sampling node, z'vRepresents the final aggregate vector, z ', of node v'iRepresents the final aggregate vector, z 'of node i'jAnd representing the final aggregation vector of the node j, and theta is a sigmoid activation function.
7. A method as claimed in claim 1 or 6, wherein the meta path is an interpretable meta path having a length of 3 or less, and the type of meta path includes: A-P-A represents cA co-author, A-P represents an author writing cA paper, A-P-P represents an author referring to the paper in writing, A-P-V represents an author's paper received by cA conference, A-P-P-V represents cA referred paper received by cA conference, P-A represents cA paper written by an author, P-P represents cA paper referring paper, P-A-P represents cA paper written by the same author, P-V-P represents cA paper received by the same conference, P-V represents cA conference received, P-P-V represents cA referred paper received by cA conference, V-P-A represents cA paper written by cA conference receiving author, V-P represents cA conference receiving paper, V-P-P represents cA conference-related referring paper, V-P-P-V represents a meeting-related reference meeting.
8. The heterogeneous network link prediction method suitable for scientific cooperative networks according to claim 7, wherein when neighbor node sampling is performed, the same number of neighbors are sampled for each node, and the sampled neighbor set is brought into subsequent neighbor information aggregation.
9. The method as claimed in claim 8, wherein the method controls a balance relationship between the number of neighbors sampled by each type of meta-path when performing neighbor node sampling.
CN202111286870.6A 2021-07-21 2021-11-02 Heterogeneous network link prediction method suitable for scientific cooperative network Pending CN113868482A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110825076 2021-07-21
CN2021108250768 2021-07-21

Publications (1)

Publication Number Publication Date
CN113868482A true CN113868482A (en) 2021-12-31

Family

ID=78986379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111286870.6A Pending CN113868482A (en) 2021-07-21 2021-11-02 Heterogeneous network link prediction method suitable for scientific cooperative network

Country Status (1)

Country Link
CN (1) CN113868482A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115037630A (en) * 2022-04-29 2022-09-09 电子科技大学长三角研究院(湖州) Weighted network link prediction method based on structural disturbance model
CN116383446A (en) * 2023-04-06 2023-07-04 哈尔滨工程大学 Author classification method based on heterogeneous quotation network
CN116610807A (en) * 2023-07-21 2023-08-18 北京语言大学 Knowledge structure identification method and device based on heterogeneous graph neural network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115037630A (en) * 2022-04-29 2022-09-09 电子科技大学长三角研究院(湖州) Weighted network link prediction method based on structural disturbance model
CN115037630B (en) * 2022-04-29 2023-10-20 电子科技大学长三角研究院(湖州) Weighted network link prediction method based on structure disturbance model
CN116383446A (en) * 2023-04-06 2023-07-04 哈尔滨工程大学 Author classification method based on heterogeneous quotation network
CN116610807A (en) * 2023-07-21 2023-08-18 北京语言大学 Knowledge structure identification method and device based on heterogeneous graph neural network
CN116610807B (en) * 2023-07-21 2023-10-13 北京语言大学 Knowledge structure identification method and device based on heterogeneous graph neural network

Similar Documents

Publication Publication Date Title
Zhang et al. Hierarchical graph pooling with structure learning
Mienye et al. Prediction performance of improved decision tree-based algorithms: a review
CN113868482A (en) Heterogeneous network link prediction method suitable for scientific cooperative network
CN113282612A (en) Author conference recommendation method based on scientific cooperation heterogeneous network analysis
CN113569906B (en) Heterogeneous graph information extraction method and device based on meta-path subgraph
CN113378913B (en) Semi-supervised node classification method based on self-supervised learning
CN111737535A (en) Network characterization learning method based on element structure and graph neural network
CN107451210B (en) Graph matching query method based on query relaxation result enhancement
CN114564573A (en) Academic cooperative relationship prediction method based on heterogeneous graph neural network
CN113705099A (en) Social platform rumor detection model construction method and detection method based on contrast learning
CN111259264B (en) Time sequence scoring prediction method based on generation countermeasure network
Li et al. Graph pooling with representativeness
CN116257662A (en) Heterogeneous graph community discovery method based on K neighbor graph neural network
CN113869461A (en) Author migration and classification method for scientific cooperation heterogeneous network
Liu et al. Community detection based on community perspective and graph convolutional network
Patel et al. A reduced error pruning technique for improving accuracy of decision tree learning
Alrahhal et al. Data mining, big data, and artificial intelligence: An overview, challenges, and research questions
Jasim et al. Analyzing Social Media Sentiment: Twitter as a Case Study
Zeng et al. Graph symbiosis learning
Ansari et al. A combinatorial cooperative-tabu search feature reduction approach
CN113077003A (en) Graph attention network inductive learning method based on graph sampling
CN111460324A (en) Citation recommendation method and system based on link analysis
Yu et al. Workflow recommendation based on graph embedding
Hu et al. HSDN: A High-Order Structural Semantic Disentangled Neural Network
Yang et al. Multi-granularity evolution network for dynamic link prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination