CN113869461B - Author migration classification method for scientific cooperation heterogeneous network - Google Patents

Author migration classification method for scientific cooperation heterogeneous network Download PDF

Info

Publication number
CN113869461B
CN113869461B CN202111286872.5A CN202111286872A CN113869461B CN 113869461 B CN113869461 B CN 113869461B CN 202111286872 A CN202111286872 A CN 202111286872A CN 113869461 B CN113869461 B CN 113869461B
Authority
CN
China
Prior art keywords
node
author
heterogeneous
vector
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111286872.5A
Other languages
Chinese (zh)
Other versions
CN113869461A (en
Inventor
程光权
马扬
梁星星
许乃夫
冯旸赫
黄金才
刘忠
陈晓轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Publication of CN113869461A publication Critical patent/CN113869461A/en
Application granted granted Critical
Publication of CN113869461B publication Critical patent/CN113869461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an author migration classification method for a scientific cooperation heterogeneous network, which comprises the following steps of: acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network; classifying each meeting in the heterogeneous scientific cooperation network according to the topic, and grouping each author according to the meeting category to which the published paper belongs; sampling neighbor nodes on the heterogeneous scientific cooperative network; carrying out neighbor information aggregation on the sampled neighbor nodes; training on the heterogeneous scientific cooperation network, and learning to obtain heterogeneous network representation; training an author classifier using a training data set of author representation vectors and categories; the trained author classifier is used to classify unknown authors. The method of the invention realizes the sampling of the direct neighbor node and the indirect neighbor node based on the meta-path, and simultaneously adopts various information modes to lead the migration classification of the author to have better effect.

Description

Author migration classification method for scientific cooperation heterogeneous network
Technical Field
The invention belongs to the technical field of social network science, and relates to an author migration classification method for a scientific cooperation heterogeneous network.
Background
In the real world, connections between individuals are ubiquitous, and these complex connections can be described in terms of different forms of networks (social networks, citation networks, protein molecular action networks, electrical networks, etc.). As a common data carrier, data in the form of networks exist in various aspects of society, and the meaningful information in the networks is deeply mined, so that the data carrier has very high academic value and potential application value. The structure and the form of various networks are complex and various, but the basic components of the network are nodes and edges, so the network structure has universality significance for analysis and research of the complex network. Because the network contains a large number of nodes and edges, the time consumed to obtain all the state information of each node on the network is significant. Network embedding (Network Embedding) aims to find a mapping function that converts the structural and attribute features of each node in the network into a potential representation of low dimension. The graphic neural network is taken as a powerful graphic data deep representation learning method, shows excellent performance in network analysis, and also draws extensive attention of researchers. Through supervised learning or unsupervised learning, the algorithm mines the structure and attribute information of the network, converts the information into low-dimensional node vectors, and further applies the low-dimensional node vectors to downstream tasks such as node classification, node clustering, link prediction and the like.
Although many approaches to graph neural networks have been proposed, the main research is still focused on homogeneous networks, i.e. each node in the network is of the same type. However, more real networks behave in a heterogeneous graph structure. The heterogeneous graph network contains rich structure information and attribute information, and nodes and edges of the heterogeneous graph network are of various different types. However, there is still a key problem in the current network representation research: how are the information of the neighbors of the target node aggregated together? Most graph neural networks define neighbors of nodes as nodes to which they are directly connected, and iteratively aggregate neighbor information with a distance greater than 2 through multi-layer nesting. Some methods expand the neighbor space of a node by random walk and consider nodes on the path as their neighbors. These approaches ignore that the edges in heterogeneous networks are heterogeneous, and the combination of the heterogeneous edges itself has a specific meaning.
Interaction between scientists in a scientific collaborative network is a long-standing phenomenon in scientific practice, and a considerable amount of communication activities exist in most stages of the research process. The interdigitation penetration among a plurality of categories of science is developed into a large scientific system with a complex structure, and scientists not only communicate research results and information, but also cooperate together to generate or report the research results of the scientists in the processes of intercommunicating, writing, reading papers, letters and the like. With the increase of the degree of science socialization, the cooperation in the scientific research is also more frequent, and the cooperation gradually becomes a social strength which affects the scientific development ability and cannot be ignored. Scientific collaboration networks are obviously largely organized by scientists, and a large number of technological policy surges have also prompted scientific collaboration. When a scientific document is completed by more than two authors, the present embodiment illustrates that there is a scientific partnership between the authors. In a scientific collaborative network, a node is an author, and if there is more than one treatise between two authors, there is a connection between them. Scientific collaboration networks are an important class of social networks that have been widely used to determine the structure of scientific collaboration and the status of individual researchers. The research on the possible cooperative relationship in the future of the scientific cooperation network has very important significance for promoting the scientific research and development, forming team advantages, searching for advantage platforms and the like. According to the type of the meeting in the scientific cooperation network and various associated information in the scientific cooperation network, the related authors are classified in the migration learning, and the method has strong guiding significance for searching the corresponding resource platform and the like by the authors in the scientific cooperation.
Disclosure of Invention
The invention relates to an author migration classification method for a scientific cooperation heterogeneous network, which adopts the following technical scheme.
An author migration classification method for a scientific collaborative heterogeneous network, comprising the steps of:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, classifying each meeting in the heterogeneous scientific cooperation network according to the topic, and grouping each author according to the meeting category to which the published paper belongs;
step 3, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 4, carrying out neighbor information aggregation on the sampled neighbor nodes;
step 5, training is carried out on the heterogeneous scientific cooperation network, and heterogeneous network representation is obtained through learning;
step 6, training an author classifier by using the training data set of the author representation vector and the category;
and 7, training a complete author classifier to classify unknown authors.
Specifically, the heterogeneous scientific cooperation network is expressed as G= (V, E, A, R), wherein V represents a node set, E represents a connecting edge set, and the node and the connecting edge have mapping phi, V-A and R respectivelyWherein a represents a set of node types including an author type, a paper type, and a meeting type, and R represents a set of edge types including an author-write-paper type, a paper-citation-paper type, and a paper-publication-meeting type;
the neighbor node sampling is based on a meta-path, which is a path defined on the graph g= (V, E, a, R), expressed as follows:define node A 1 And A k+1 Composite relationship between->Wherein->Representing a complex operation between relationships, the element path representing A 1 And A k+1 Semantic information formed by k+1 nodes and k edges is concatenated together.
Further, the neighbor node sampling includes the following steps, step 201, performing direct attribute and indirect attribute aggregation on the nodes of the heterogeneous scientific cooperation network, and step 202, performing direct neighbor and indirect neighbor sampling on the nodes of the heterogeneous scientific cooperation network;
the direct attribute is the attribute of the node, the indirect attribute is an additional attribute obtained from the interrelation in the heterogeneous scientific cooperation network, and the direct attribute is supplemented; the direct attribute of the paper node comprises a paper title, a paper abstract and a pre-training structure embedding, and the indirect attribute of the paper node comprises a structure embedding of a recording meeting, a structure embedding of a paper author, a structure embedding of a citation and a title vector of the citation; the direct attribute of the author node comprises structural embedding of the author node, and the indirect attribute of the author node is a vector representation of titles and abstracts of papers written by the author; the direct attribute of the conference node comprises structural embedding of the conference node, and the indirect attribute of the conference node comprises vector representation of the title and abstract of the conference admission paper;
the direct neighbors are neighbor nodes of the target nodes which are directly connected through the connecting edges in the heterogeneous scientific cooperation network, and the indirect neighbors are neighbor nodes of the target nodes which are connected through the element paths in the heterogeneous scientific cooperation network.
Preferably, the direct attribute and indirect attribute aggregation adopts BiLSTM (Bi-direct L)ong Short-Term Memory) and then carrying out average pooling to obtain d with expression capability a Dimension initial feature vectorx i Is a feature vector obtained by BiLSTM, < >>The dimension of the representation vector is d a X 1 dimension.
Further, the neighbor information aggregation includes, step 301, performing homogeneous information aggregation on the same-type sampling neighbors, and step 302, performing heterogeneous information aggregation on different-type neighbors by adopting an attention mechanism;
the homogeneous information aggregation is carried out by firstly adopting BiLSTM to aggregate the information vectors of sampling neighbors, and then carrying out mean value pooling to obtain an information aggregation vector;
and learning heterogeneous aggregation weights by adopting a self-attention mechanism in heterogeneous information aggregation.
Preferably, in the homogeneous information aggregation BiLSTM, the sequence is outputted by forward LSTMAnd the output sequence of backward LSTM +.>Splicing to find sequential associations between elements, output vector of d-dimension of i-type neighbors of node m>Obtained by the formula:
wherein,for splicing operations, neighbor i (m) set of neighbor nodes representing node m, < ->Representing forward representation vector of node j obtained by bi-directional LSTM,>representing the backward representation vector of node j obtained by bi-directional LSTM>And->Obtained by LSTM:
f j =σ(W f ·[h j-1 ,x j ]+b f )
i j =σ(W i ·[h j-1 ,x j ]+b i )
o j =σ(W o ·[h j-1 ,x j ]+b o )
h j =o j ·tanh(C j )
wherein f, i, o are respectively a forgetting gate, an input gate, an output gate, W, b are learnable parameters, d=2d a Sigma represents the activation function sigmoid, x j Representing a feature vector obtained by BiLSTM;
in the information aggregation of the attention mechanism, a target node k epsilon A of a given type and an aggregation vector y i i.epsilon.A, corresponding attention alpha ki And a self-attention coefficient alpha' k Is that
Wherein,for splicing operation, x k Representing the feature vector obtained by BiLSTM, < >>As a learnable attention parameter, +.>Is a learnable self-attention parameter, +.>Transpose of self-attention parameter vector representing node type l, y l The aggregation vector based on the type of the node type l, sigma is an activation function LeakyRelu, and the aggregation vector of the k type node is:
z k =α' k x k +∑ i∈A α ki y i
the attention mechanism model is independently repeated n times, and the learned embedded mean value is taken as a final aggregate vector:
wherein,the ith calculation result of the aggregate vector representing the k-type node.
Preferably, graph context loss is employed in the training process in step 4, and the following optimization objectives are defined:
wherein RW G Is the set of paths sampled by random walk on graph G,<v,i,j>is a triplet in which V e V is the target node and i e C v Is that it is in RW G Context neighbors on, j e V is the negative sampling node, z' v Representing the final aggregate vector of node v, z' i Representing the final aggregate vector for node i, z' j Representing the final aggregate vector for node j, θ is the sigmoid activation function.
Specifically, the meta-path adopts an interpretable meta-path with the length not exceeding 3, and the types of the meta-paths specifically comprise: A-P-A represents co-authors, A-P represents authors writing papers, A-P-P represents authors referring to the papers in writing, A-P-V represents papers of authors received by conferences, A-P-P-V represents papers referring to received by conferences, P-A represents papers written by authors, P-P represents papers referring to papers, P-A-P represents papers written by the same author, P-V-P represents papers received by the same conferences, P-P-V represents papers referring to received by conferences, V-P-A represents papers written by conferences receiving by conferences, V-P represents papers received by conferences, V-P-P represents conference related reference papers, V-P-P-V represents conference related reference papers.
Preferably, when sampling neighbor nodes, sampling the same number of neighbors for each node, and bringing the sampled neighbor set into subsequent neighbor information aggregation.
Preferably, when sampling the neighbor nodes, the balance relation among the neighbor numbers of the various element path sampling is controlled.
Compared with the prior art, the method has the advantages that: 1) In order to better aggregate information, information acquisition in a heterogeneous network and aggregation of characteristic information of different sources are researched, and sampling of direct neighbor nodes and indirect neighbor nodes is realized based on a meta path; 2) In order to better mine the internal connection of the same type of neighbors, the BiLSTM is adopted to aggregate the sampling nodes of the same type, in order to better aggregate the nodes of each type, the attention mechanism is adopted to realize heterogeneous type aggregation, and therefore the migration classification effect of the author based on the conference type is better.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a heterogeneous network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a meta-path according to an embodiment of the present invention;
FIG. 4 is a schematic flow diagram of a heterogeneous scientific collaborative network learning representation in accordance with an embodiment of the present invention;
FIG. 5 is an aggregate schematic diagram of different source characteristics of nodes in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a neighbor sampling flow in an embodiment of the present invention;
fig. 7 is a schematic diagram of an information aggregation flow of sampling neighbors in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1, an author migration classification method for a scientific cooperative heterogeneous network includes the steps of:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, classifying each meeting in the heterogeneous scientific cooperation network according to the topic, and grouping each author according to the meeting category to which the published paper belongs;
step 3, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 4, carrying out neighbor information aggregation on the sampled neighbor nodes;
step 5, training is carried out on the heterogeneous scientific cooperation network, and heterogeneous network representation is obtained through learning;
step 6, training an author classifier by using the training data set of the author representation vector and the category;
and 7, training a complete author classifier to classify unknown authors.
Heterogeneous networks are a special form of network that contains multiple types of nodes or multiple types of links. The heterogeneous network may be represented as g= (V, E, a, R), where V represents a set of nodes, E represents a set of conjoined edges, and there are mappings Φ: v→a and conjoined edges for nodes and conjoined edges of the heterogeneous network, respectivelyWherein A and R respectively represent type sets of nodes and edges. Further, the attributes are also different for different types of V-nodes. For heterogeneous networks, there is |A|+|R| > 2.
As in fig. 2, a scientific cooperative network exemplifies a heterogeneous network. The node V in the network may be an author, a paper or a meeting, and the border E may be an author-write-paper, a paper-citation-paper or a paper-publication-meeting. For paper nodes, its attributes include abstract, title, etc., for author nodes, its attributes include published articles, collaborative researchers, etc., for meeting nodes, its attributes include recorded articles, etc. In this example, |a|=3, |r|=3, so the network is a heterogeneous network.
A meta path is a path defined on the graph g= (V, E, a, R) and is expressed as follows:this defines A 1 And A k+1 Composite relationship between->Wherein the method comprises the steps ofRepresenting a complex operation between relationships, this path represents A 1 And A k+1 Semantic information formed by k+1 nodes (node type allows repetition) and k edges (edge type allows repetition) is concatenated together, a represents a node type set, a subscript represents a node of a specific type, and the following table represents a number code.
As shown in FIG. 3, in the academic collaboration network example, different authors may be connected by an A-P-A (Author-Paper-Author) meta-path, which represents Paper collaboration, and authors and meetings may be connected by an A-P-V (Author-Paper-Venue) meta-path, which represents authors participating in an academic meeting.
Given a heterogeneous network g= (V, E, a, R), the core of the heterogeneous network representation (graph embedding) is the feature of the extraction network. It learns a unified d-dimensional representation vector for each node in the networkThe X can capture the structure and semantic information in the network, so that the X is applied to the subsequent tasks of node classification, clustering, link prediction, recommendation, network reconstruction and the like.
As in fig. 4, the representation learning of the heterogeneous network of the present embodiment includes three phases, 1. Neighbor sampling based on meta-paths; 2. information aggregation of sampling neighbors; 3. and (5) target optimization and model training.
1. Meta-path based neighbor sampling
The core of most graph neural networks is the aggregation of neighbor information, and in large-scale networks, the aggregation of neighbor samples is also common, such as graphpage. However, there are several problems with directly employing aggregation of neighbors or sampled neighbors:
simple neighbor aggregation cannot directly capture all types of node information. Taking a scientific cooperation network as an example, all author nodes are not directly connected, but close connection and cooperation are achieved among authors of the same paper together, and the semantic information which cannot be reflected by the direct connection side is obtained.
Different types of neighbor nodes have different content attribute characteristics. It is necessary to transform the different features into a unified representation space.
In heterogeneous networks, characteristic information is usually obtained from different acquisition paths (such as paper abstract information and paper heading information) for the same type of node, and characteristic information of different sources is usually spliced together, but the association between the characteristic information and the characteristic information is rarely studied and discussed.
In order to solve the above problems, the present embodiment has studied information acquisition in a heterogeneous network and aggregation of feature information of different sources; and then based on the meta-path, sampling of the neighbor nodes is realized.
1) Node feature aggregation
The attribute information of the nodes in the heterogeneous network comprises a direct attribute and an indirect attribute, wherein the direct attribute is the attribute of the data, the indirect attribute is an additional attribute obtained by sorting and summarizing the interrelation in the heterogeneous network, and the direct attribute is complemented. The present embodiment introduces direct and indirect sources of attributes for different types of nodes and aggregates the attributes for different types of nodes in an academic collaboration network. There are three types of nodes in the network: papers, authors, and conferences.
The paper nodes are the most important types of nodes in the network and are also core sources of attribute data in the network, and the direct attributes of the paper nodes comprise paper titles, paper abstracts, pre-training structure embedding and the like. It should be noted that in order to mine more structural information from the network, the present embodiment uses deep, regards the network as a homogeneous network, obtains a pre-training structure for each node in the network to embed, and uses it as part of the node attributes. Indirect attributes of paper nodes include structural embedding of the admission meeting, structural embedding of its author, structural embedding of the citation, and heading vector of the citation. In order to improve the operation efficiency and form a unified vector space, all indirect attributes are obtained through sampling average calculation.
The author node is directly connected to the paper node in the network, its direct attribute is its structural embedment, and its indirect attribute is the vector representation of the headlines and summaries of the paper it is writing. The conference node is also directly connected with the paper node in the network, and the direct attribute is the embedded structure and the indirect attribute is the vector representation of the title and abstract of the recorded paper. For ease of calculation, the indirect properties are also obtained by sampling average calculation.
In the prior art, attribute information of different sources is directly spliced together, in order to obtain association between characteristic attributes of different sources and form vector representation with unified dimension, the association between the attributes of different sources is captured by using BiLSTM to form d with more expressive power a Dimension initial feature vectorFIG. 5 illustrates a method of aggregating different source signatures for nodes of a paper.
Similar to the classification of node attributes, neighbors of nodes in heterogeneous networks can also be classified into two categories: direct neighbors and indirect neighbors. The direct neighbors are nodes directly connected by a connecting edge in the network, and taking fig. 5 as an example, the direct neighbors of the node A1 are P1 and P2. The indirect neighbors are neighbors in the heterogeneous network connected with the target node through the meta-path, the node A1 is connected with the V1 node through the meta-path A-P-V, and the node A2 node through the meta-path A-P-A, and the like.
As shown in fig. 6, the direct neighbors and indirect neighbors (meta-path neighbors) of each node together form a neighbor set of nodes, and in a large network, for the efficiency of calculation and storage, the present embodiment samples the same number of neighbors for each node and brings the sampled neighbor set into the subsequent neighbor aggregation.
According to the prior art, the meta paths with lengths exceeding 3 have very little effect, so for the scientific cooperation network, the present embodiment screens out the interpretable meta paths with lengths not exceeding 3 of all three types of nodes, as shown in table 1.
Table 1 Meta Path example on a scientific Cooperation network
Proper meta-path selection can better capture structural and semantic information in a network, and the more meta-paths are not selected, the better the model result is, and the less the false meta-path selection helps to promote the result, even negative promotion is caused. For a heterogeneous graph neural network model based on meta-paths, more effort should be put on the screening of meta-paths.
2. Information aggregation of sampling neighbors
Neighbor aggregation faces two problems: how are the same type of neighbors aggregated together? How are different types of neighbors grouped together?
In order to solve the above problem, the present embodiment first aggregates sampling neighbors by type, and then aggregates different types of information again. To better mine intra-class neighbor connections, biLSTM is used to aggregate the same type of sampling nodes together. In order to have better aggregation for each type of node, a attention mechanism is employed to implement heterogeneous types of aggregation.
1) Homogeneous sampling neighbor information aggregation
As shown in fig. 7, a basic flow of homogeneous sampling neighbor information aggregation is described, and a hidden layer sequence is subjected to mean pooling to obtain a specific type of information aggregation vector.
BiLSTM outputs the sequence by forward LSTMAnd the output sequence of backward LSTM +.>Splice to find sequential associations between elements. Output vector of d-dimension of i-type neighbor of node m +.>Obtained by the formula:
wherein the method comprises the steps ofFor splicing operation, < >>And->Obtained by LSTM:
f j =σ(W f ·[h j-1 ,x j ]+b f )
i j =σ(W i ·[h j-1 ,x j ]+b i )
o j =σ(W o ·[h j-1 ,x j ]+b o )
h j =o j ·tanh(C j )
wherein f, i, o are respectively a forgetting gate, an input gate and an output gate, and W, b are learnable parameters. To obtain the d-dimensional output vector y i The dimension of the correlation vector satisfies constraint equation d=2d a
2) Heterogeneous polymerization
The present embodiment learns heterogeneous types of aggregate weights using a self-attention mechanism. Target node k epsilon A of given type and aggregate vector y i i.epsilon.A, corresponding attention alpha ki And a self-attention coefficient alpha' k Is that
Wherein the method comprises the steps ofFor splicing operation, < >>As a learnable attention parameter, +.>As a learnable self-attention parameter, σ is the activation function LeakyRelu, so the final aggregate vector for the k-type nodes is
z k =α' k x k +∑ i∈A α ki y i
Since the heterogeneous map has a scaleless property, the variance of the map data is large. In order to solve the problems, heterogeneous aggregation attention is expanded to multi-head attention, so that the training process is more stable. Specifically, the present embodiment repeats the attention model n times independently, and takes the learned embedded mean value as the final aggregate vector:
3. target optimization and model training
To achieve heterogeneous network representation learning, the present embodiment employs graph context loss (graph context loss) and defines the following optimization objectives:
wherein RW G Is the set of paths sampled by random walk on graph G,<v,i,j>is a triplet in which v.epsilon.V is the destinationTarget node, i E C v Is that it is in RW G The context neighbors on the upper, j e V is the negative sampling node and θ is the sigmoid activation function. Specifically, the present embodiment first generates a series of randomly walked paths RW on the graph G G . Then, for each path w ε RW G Each node v above, the present embodiment selects a context node i that satisfies the distance constraint dist (v, i) < epsilon on that path. Finally, the present embodiment samples a node j of the same negative type as node i from graph G. For an academic collaboration network, 9 types of triplets are totally used, and the basic balance of the sampling quantity of different triplets can be realized through counting the distribution frequency of different types of nodes in the graph G. By using batch processing and Adam optimizers, the whole MHGNN parameters can be updated and converged quickly (after about 50 generations of convergence), and the accuracy of the representation vectors output by the model is higher and higher through continuous iteration parameters.
In this embodiment, experiments are performed by extracting data between 2006 and 2015 from AMiner data, the heterogeneous network data is shown in table 2, and training sets and test sets are respectively divided by taking 2012 and 2013 as boundaries according to different tasks.
Table 2 dataset detailed attributes
Node type Quantity of Edge type Quantity of
A(author) 28646 A-P 69311
P(paper) 21044 P-P 49631
V(venue) 18 P-V 21044
In node sampling, 10 a, 10P and 3V are sampled from the neighbors of each node as sampling neighbors. In node feature aggregation, the dimension of all input data is 128, the dimension of graph embedding is 128, the learning rate is 0.001, the batch size is 200, the training algebra is 60, the optimizer is Adam, the random number seed is 10, the sampling number of graph context loss is 20000, the number of sampling paths per node is 10, the path length is 30, the distance constraint is 5, and the head number of multi-head attention is 4.
The following network representation and graph neural network algorithm is chosen as a comparison algorithm in this embodiment:
metaath 2Vec: the model utilizes random walk based on a meta-path to construct a node heterogeneous neighbor set, and a heterogeneous skip-gram model is used for generating corresponding node representation.
GAT: the method learns the weights of different neighbor nodes through the attention mechanism, so that the neighbor information is more effectively aggregated.
HetGNN: the method comprises the steps of obtaining neighbors of nodes in a heterogeneous graph through random walk with restarting, and carrying out multiple aggregation aiming at the types of the neighbors to obtain heterogeneous graph representation vectors of the nodes.
ASNE: the method uses both the attribute features and the potential feature sums of the nodes to learn node embedding.
SHNE: according to the method, the similarity of the graph structure and the association degree of text semantics are combined and optimized, and the text-related heterogeneous graph node embedding is learned.
For classification tasks, the present embodiment screens 12 meetings from 18 meetings and classifies them into four categories according to subject (DM, CV, NLP, DB), each author being grouped according to the category to which their primary published papers belong. The method learns the representation vector of each author, takes 10% and 30% of data as training input, brings the training input into a classifier, and analyzes the classification accuracy of the rest author nodes. For clustering tasks, the embodiment adopts a k-means algorithm to distinguish authors belonging to four fields.
TABLE 3 Classification and clustering task results
As shown by result analysis, the method has excellent overall performance in clustering and classifying tasks.
According to the invention, in order to better aggregate information, the method researches information acquisition in a heterogeneous network and aggregation of characteristic information from different sources, and realizes sampling of direct neighbor nodes and indirect neighbor nodes based on meta paths; in order to better dig the internal connection of the same kind of neighbors, adopt BiLSTM to gather the sampling node of the same type together, in order to have better gathering to every type of node, adopt the attention mechanism to realize heterogeneous type to gather; the method has a good effect on the migration classification of authors in a scientific cooperation heterogeneous network.

Claims (7)

1. An author migration classification method for a scientific cooperative heterogeneous network, comprising the steps of:
step 1, acquiring scientific cooperation data and constructing a heterogeneous scientific cooperation network;
step 2, classifying each meeting in the heterogeneous scientific cooperation network according to the topic, and grouping each author according to the meeting category to which the published paper belongs;
step 3, sampling neighbor nodes on the heterogeneous scientific cooperation network;
step 4, carrying out neighbor information aggregation on the sampled neighbor nodes;
step 5, training is carried out on the heterogeneous scientific cooperation network, and heterogeneous network representation is obtained through learning;
step 6, training an author classifier by using the training data set of the author representation vector and the category;
step 7, the trained author classifier is used for classifying unknown authors;
the heterogeneous scientific cooperation network is expressed as G= (V, E, A, R), wherein V represents a node set, E represents a connecting edge set, and the node and the connecting edge have mapping phi, V-A and R respectivelyE→R, wherein A represents a node type set comprising an author type, a paper type and a meeting type, R represents a borderline type set comprising an author-write-paper type, a paper-citation-paper type and a paper-publication-meeting type;
the neighbor node sampling is based on a meta-path, which is a path defined on the graph g= (V, E, a, R), expressed as follows:define node A 1 And A k+1 Composite relationship between->Wherein->Representing a complex operation between relationships, the element path representing A 1 And A k+1 Semantic information formed by k+1 nodes and k continuous edges is connected together;
the neighbor node sampling includes the following steps, step 201, direct attribute and indirect attribute aggregation are carried out on the nodes of the heterogeneous scientific cooperation network, and step 202, direct neighbor and indirect neighbor sampling are carried out on the nodes of the heterogeneous scientific cooperation network;
the direct attribute is the attribute of the node, the indirect attribute is an additional attribute obtained from the interrelation in the heterogeneous scientific cooperation network, and the direct attribute is supplemented; the direct attribute of the paper node comprises a paper title, a paper abstract and a pre-training structure embedding, and the indirect attribute of the paper node comprises a structure embedding of a recording meeting, a structure embedding of a paper author, a structure embedding of a citation and a title vector of the citation; the direct attribute of the author node comprises structural embedding of the author node, and the indirect attribute of the author node is a vector representation of titles and abstracts of papers written by the author; the direct attribute of the conference node comprises structural embedding of the conference node, and the indirect attribute of the conference node comprises vector representation of the title and abstract of the conference admission paper;
the direct neighbors are neighbor nodes directly connected by the target node through the connecting edge in the heterogeneous scientific cooperation network, and the indirect neighbors are neighbor nodes connected by the target node through the meta path in the heterogeneous scientific cooperation network;
the graph context loss is adopted in the training process in the step 4, and the following optimization targets are defined:
wherein RW G Is the set of paths sampled by random walk on graph G,<v,i,j>is a triplet in which V e V is the target node and i e C v Is that it is in RW G Context neighbors on, j e V is the negative sampling node, z' v Representing the final aggregate vector of node v, z' i Representing the final aggregate vector for node i, z' j Representing the final aggregate vector for node j, θ is the sigmoid activation function.
2. The method for author migration classification of a scientific collaborative heterogeneous network according to claim 1, which is specific toCharacterized in that the direct attribute and the indirect attribute are polymerized by BiLSTM, and then are subjected to mean value pooling to obtain d with expression capability a Dimension initial feature vectorx i Is a feature vector obtained by BiLSTM, < >>The dimension of the representation vector is d a X 1 dimension.
3. The method for author migration classification of scientific cooperative heterogeneous network according to claim 1 or 2, wherein the neighbor information aggregation includes step 301, performing homogeneous information aggregation on similar sampled neighbors, and step 302, performing heterogeneous information aggregation on different types of neighbors by adopting an attention mechanism;
the homogeneous information aggregation is carried out by firstly adopting BiLSTM to aggregate the information vectors of sampling neighbors, and then carrying out mean value pooling to obtain an information aggregation vector;
and learning heterogeneous aggregation weights by adopting a self-attention mechanism in heterogeneous information aggregation.
4. The method of claim 3, wherein in the BiLSTM of the homogeneous information aggregation, the forward LSTM output sequence is used to classify the author migrationAnd the output sequence of backward LSTM +.>Stitching to find sequential associations between elements, output vector of d-dimension of i-type neighbors of node mObtained by the formula:
wherein,for splicing operations, neighbor i (m) set of neighbor nodes representing node m, < ->Representing forward representation vector of node j obtained by bi-directional LSTM,>representing the backward representation vector of node j obtained by bi-directional LSTM>And->Obtained by LSTM:
f j =σ(W f ·[h j-1 ,x j ]+b f )
i j =σ(W i ·[h j-1 ,x j ]+b i )
o j =σ(W o ·[h j-1 ,x j ]+b o )
h j =o j ·tanh(C j )
wherein f, i, o are respectively a forgetting gate, an input gate, an output gate, W, b are learnable parameters, d=2d a Sigma represents the activation function sigmoid, x j Representing a feature vector obtained by BiLSTM;
in the information aggregation of the attention mechanism, a target node k epsilon A of a given type and an aggregation vector y i i.epsilon.A, corresponding attention alpha ki And a self-attention coefficient alpha' k Is that
Wherein,for splicing operation, x k Representing the feature vector obtained by BiLSTM, < >>As a learnable attention parameter, +.>Is a learnable self-attention parameter, +.>Transpose of self-attention parameter vector representing node type l, y l The aggregation vector based on the type of the node type l, sigma is an activation function LeakyRelu, and the aggregation vector of the k type node is:
z k =α' k x k +∑ i∈A α ki y i
the attention mechanism model is independently repeated n times, and the learned embedded mean value is taken as a final aggregate vector:
wherein,the ith calculation result of the aggregate vector representing the k-type node.
5. An author migration classification method for a scientific collaborative heterogeneous network according to claim 1 or 4, wherein the meta-path employs an interpretable meta-path having a length of no more than 3, the meta-path being of a type comprising: A-P-A represents co-authors, A-P represents authors writing papers, A-P-P represents authors referring to the papers in writing, A-P-V represents papers of authors received by conferences, A-P-P-V represents papers referring to received by conferences, P-A represents papers written by authors, P-P represents papers referring to papers, P-A-P represents papers written by the same author, P-V-P represents papers received by the same conferences, P-P-V represents papers referring to received by conferences, V-P-A represents papers written by conferences receiving by conferences, V-P represents papers received by conferences, V-P-P represents conference related reference papers, V-P-P-V represents conference related reference papers.
6. The method of claim 5, wherein when sampling neighbor nodes, sampling the same number of neighbors for each node, and bringing the sampled neighbor set into a subsequent neighbor information aggregation.
7. The method of claim 6, wherein the balancing relationship between the number of neighbors of each type of meta-path sampling is controlled when sampling neighbor nodes.
CN202111286872.5A 2021-07-21 2021-11-02 Author migration classification method for scientific cooperation heterogeneous network Active CN113869461B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021108250787 2021-07-21
CN202110825078 2021-07-21

Publications (2)

Publication Number Publication Date
CN113869461A CN113869461A (en) 2021-12-31
CN113869461B true CN113869461B (en) 2024-03-12

Family

ID=78986397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111286872.5A Active CN113869461B (en) 2021-07-21 2021-11-02 Author migration classification method for scientific cooperation heterogeneous network

Country Status (1)

Country Link
CN (1) CN113869461B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578884B (en) * 2023-07-07 2023-10-31 北京邮电大学 Scientific research team identification method and device based on heterogeneous information network representation learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110830291A (en) * 2019-10-30 2020-02-21 辽宁工程技术大学 Node classification method of heterogeneous information network based on meta-path
CN112115971A (en) * 2020-08-13 2020-12-22 中国科学院计算技术研究所 Method and system for portraying scholars based on heterogeneous academic network
CN112989842A (en) * 2021-02-25 2021-06-18 电子科技大学 Construction method of universal embedded framework of multi-semantic heterogeneous graph
WO2021139256A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Disambiguation method and apparatus for author of paper, and computer device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674298B (en) * 2019-09-29 2022-09-30 安徽信息工程学院 Deep learning mixed topic model construction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110830291A (en) * 2019-10-30 2020-02-21 辽宁工程技术大学 Node classification method of heterogeneous information network based on meta-path
WO2021139256A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Disambiguation method and apparatus for author of paper, and computer device
CN112115971A (en) * 2020-08-13 2020-12-22 中国科学院计算技术研究所 Method and system for portraying scholars based on heterogeneous academic network
CN112989842A (en) * 2021-02-25 2021-06-18 电子科技大学 Construction method of universal embedded framework of multi-semantic heterogeneous graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Heterogeneous Graph Neural Networks Based on Meta-path;Yang Ma;2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence;正文第1-4页,图1-2 *
异质信息网络数据挖掘关键技术研究;李际超;中国博士学位论文全文库;第2020卷(第12期);全文 *

Also Published As

Publication number Publication date
CN113869461A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Zhang et al. Hierarchical graph pooling with structure learning
CN112669916B (en) Molecular diagram representation learning method based on comparison learning
Standish Open-ended artificial evolution
CN113282612A (en) Author conference recommendation method based on scientific cooperation heterogeneous network analysis
CN106815310B (en) Hierarchical clustering method and system for massive document sets
CN113378913B (en) Semi-supervised node classification method based on self-supervised learning
CN113868482A (en) Heterogeneous network link prediction method suitable for scientific cooperative network
Fan et al. Federated few-shot learning with adversarial learning
CN109753589A (en) A kind of figure method for visualizing based on figure convolutional network
CN111681132B (en) Typical power consumption mode extraction method suitable for massive class unbalanced load data
CN114565053A (en) Deep heterogeneous map embedding model based on feature fusion
Obaid et al. Semantic web and web page clustering algorithms: a landscape view
CN113869461B (en) Author migration classification method for scientific cooperation heterogeneous network
CN113268993A (en) Mutual information-based attribute heterogeneous information network unsupervised network representation learning method
Ali et al. Fake accounts detection on social media using stack ensemble system
CN109344309A (en) Extensive file and picture classification method and system are stacked based on convolutional neural networks
Balafar et al. Active learning for constrained document clustering with uncertainty region
Patel et al. A reduced error pruning technique for improving accuracy of decision tree learning
CN113505223B (en) Network water army identification method and system
CN106156256A (en) A kind of user profile classification transmitting method and system
Vu et al. Graph-based clustering with background knowledge
Mortezanezhad et al. Big-data clustering with genetic algorithm
CN114297498A (en) Opinion leader identification method and device based on key propagation structure perception
CN114611668A (en) Vector representation learning method and system based on heterogeneous information network random walk
Wang et al. Adaptive ensemble method based on spatial characteristics for classifying imbalanced data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant