CN116578884A

CN116578884A - Scientific research team identification method and device based on heterogeneous information network representation learning

Info

Publication number: CN116578884A
Application number: CN202310831630.2A
Authority: CN
Inventors: 李雅文; 王军富; 李昂
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-08-11
Anticipated expiration: 2043-07-07
Also published as: CN116578884B

Abstract

The invention provides a scientific research team identification method and device based on heterogeneous information network representation learning, belonging to the technical field of big data, comprising the following steps: acquiring academic heterogeneous information network information, constructing a heterogeneous graph network structure, and determining element paths, element path adjacent vectors and neighbor nodes of each node; inputting the meta-path, the meta-path adjacent vector and the neighbor nodes into a trained embedded representation learning model to obtain the structural feature similarity of each node and each neighbor node and the first node level attention weight of each neighbor node; calculating a second node level attention weight of the node based on the first node level attention weight, and determining a person in charge of the scientific research team based on the second node level attention weight; and determining the core member and the non-core member based on the level attention weight of each first node and the structural feature similarity of each node and the neighbor nodes. The method can accurately identify the responsible person, the core member and the non-core member of the scientific research team in the academic heterogeneous information network.

Description

Scientific research team identification method and device based on heterogeneous information network representation learning

Technical Field

The invention relates to the technical field of big data, in particular to a scientific research team identification method and device based on heterogeneous information network representation learning.

Background

The directed graph with interconnection, macro scale and complex structure formed by various types of nodes and edges is a heterogeneous information network, and the number of types of the nodes or the number of types of the edges in the heterogeneous information network is more than 1; the heterogeneous information network composed of nodes of scholars, papers, conferences, journals and the like and edges of types such as 'publication', 'treatise', and the like is an academic heterogeneous information network. The representation learning of the heterogeneous information network refers to mapping vectors in an original high-dimensional and sparse space into a low-dimensional vector space, and learning node low-dimensional and dense real-valued vector representation, so that similar entities are closer in space, and dissimilar entities are farther apart.

Along with the rapid development of the Internet and the continuous improvement of domestic and foreign scientific research levels, explosive growth situations appear in the quantity of journal papers, foundation projects, patent works and other data, a large amount of data information related to scientific research teams contained in various types of data form an academic heterogeneous information network which is connected with each other, large in scale and complex in structure, and the scientific research teams are very important to be effectively identified based on the academic heterogeneous information network. The traditional scientific research team identification method adopts a questionnaire investigation mode to collect information such as organization information, team project information and the like related to a scientific research team, and the method can break through the limitation of space, but greatly improves the labor, financial and material costs of the questionnaire investigation mode for large-scale and large-scale scientific research team information collection.

Because members of a scientific research team generally have close relations of partnerships, membership, cooperators and the like, different connections among different nodes can be formed in an academic heterogeneous information network, and the closely connected nodes are gathered to form a community structure in the heterogeneous information network. Based on the method, the scientific research team is identified by adopting a hierarchical clustering-based scientific research team identification method at present, the key of the method for effectively identifying the scientific research team is to effectively mine the community structure in the complex academic heterogeneous information network, but the hierarchical clustering-based scientific research team identification method is inaccurate in team identification result because structural division of communities is measured by similarity calculation, and different similarity measurement indexes can lead to different community structure division results.

In addition to the above, in the prior art, a scientific research team identification method based on association rules is adopted to identify the scientific research team, and the method identifies the scientific research team by mining the most frequent item set with the closest cooperation; the scientific research team identification method based on the association rule can identify the scientific research team in a large-scale academic heterogeneous information network formed by paper-author partnership, but cannot identify and distinguish roles of team members in the scientific research team, namely, team responsible persons, team core members and non-core persons, so that the practical application value of the scientific research team identification result is reduced. In addition, in the prior art, a centrality measurement index is also adopted to identify scientific research personnel and teams which are closely cooperated in the authoritative relationship network, the centrality measurement index is adopted in the method, the centrality measurement index is biased to nodes with a large number of neighbor nodes at the geometric center position of the network, so that the nodes of different types can not well represent and measure heterogeneous characteristics such as rich topology, semantics and the like contained in the heterogeneous information network formed by the scientific research teams due to different types of connection, and the recognition result of the scientific research teams in the heterogeneous information network is inaccurate. Therefore, how to accurately identify the responsible person, core member and non-core member of the scientific research team in the academic heterogeneous information network is a technical problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides a method and apparatus for identifying a scientific research team based on heterogeneous information network representation learning, so as to solve one or more problems in the prior art.

According to one aspect of the invention, the invention discloses a scientific research team identification method based on heterogeneous information network representation learning, which comprises the following steps: acquiring academic heterogeneous information network information, constructing a heterogeneous graph network structure based on the academic heterogeneous information network information, and determining element paths, element path adjacency vectors and neighbor nodes of each node in the heterogeneous graph network structure;

inputting the element path, the element path adjacent vector and the neighbor nodes of each node into a trained embedded representation learning model to obtain the structural feature similarity of each node and each neighbor node thereof and the first node level attention weight of each neighbor node of each node;

calculating second node level attention weights corresponding to each element path of each node based on the first node level attention weights of each neighbor node of each node, and determining a scientific research team responsible person based on the second node level attention weights corresponding to each element path of each node;

And determining the team core member and the team non-core member based on the first node level attention weight of each neighbor node of each node and the structural feature similarity of each node and each neighbor node thereof.

In some embodiments of the invention, the method further comprises:

determining neighbor node aggregation feature representation of each node based on the element path adjacent vector of each neighbor node of each node and the first node level attention weight, and determining structural feature embedding representation corresponding to each element path of each node based on each element path adjacent vector of each node and the neighbor node aggregation feature representation;

acquiring a meta-path preference vector of each node, calculating the path similarity between each meta-path of each node and the meta-path preference vector based on structural feature embedding representation corresponding to each meta-path of each node and the meta-path preference vector, and determining comprehensive embedding representation of each node based on the path similarity and the structural feature embedding representation corresponding to each meta-path of each node;

and carrying out iterative training on the initial network model based on the comprehensive embedded representation to obtain a trained embedded representation learning model.

In some embodiments of the invention, the embedded representation learning model has a loss function of:

wherein ,representing node set,/->，/>Representing node->On-labellTrue value on->，Representing learning derived nodes->On-labellAnd the predicted value, L, is the total number of labels.

In some embodiments of the present invention, obtaining a meta path preference vector for each node, calculating a path similarity between each meta path of each node and the meta path preference vector based on a structural feature embedded representation corresponding to each meta path of each node and the meta path preference vector, and determining a comprehensive embedded representation of each node based on the path similarity and the structural feature embedded representation corresponding to each meta path of each node, comprising:

determining the dimension of the meta-path preference vector of each node;

converting the structural feature embedded representation corresponding to each element path of each node into an embedded representation with the same dimension as the element path preference vector;

respectively calculating the path similarity between the dimension converted embedded representation corresponding to each element path of each node and the element path preference vector;

calculating the element path attention coefficient of each node based on each path similarity;

a composite embedded representation for each node is determined based on the meta-path attention coefficients for each node and the structural feature embedded representation corresponding to each meta-path for each node.

In some embodiments of the present invention, the calculation formula of the path similarity is:

；

the calculation formula of the element path attention coefficient is as follows:

；

the calculation formula of the comprehensive embedded expression is as follows:

；

wherein ,representing node->Meta-path preference vector,/>Representing node->Is an embedded representation after dimension conversion, +.>Representing node->Path similarity of meta-path preference vector with embedded representation after dimension conversion, +.>Representing vector regularization, < >>Representing node->Meta-path attention coefficient based on meta-path pi, M represents node +.>Meta-path total number,>representing node->Meta-path pi based integrated embedded representation, < + >>Representing node->The representation is embedded based on the structural features of the meta-path pi.

In some embodiments of the present invention, calculating a second node level attention weight corresponding to each element path of each node based on a first node level attention weight of each neighbor node of each node includes:

sorting importance of neighbor nodes according to the attention weights of the first node levels;

and selecting a part of neighbor nodes with higher importance, and calculating an average value of the first node level attention weights of the selected part of neighbor nodes as a second node level attention weight of the corresponding node.

In some embodiments of the present invention, determining a research team responsible person based on a second node level attention weight corresponding to each meta-path of each node includes: taking the node with the largest attention weight of the second node level as a responsible person of the scientific research team;

determining team core members and team non-core members based on the first node level attention weights of each neighbor node of each node and the structural feature similarities of each node and each neighbor node thereof, comprising:

performing similarity sorting based on structural feature similarity of nodes corresponding to the responsible persons of the scientific research team and neighbor nodes of the responsible persons;

selecting partial neighbor nodes with higher similarity as team member sets of responsible persons of scientific research teams;

sequencing importance of the neighbor nodes based on the first node level attention weights of all neighbor nodes of the nodes corresponding to the scientific research team responsible person;

selecting a part of neighbor nodes with higher importance as an influence node set of a responsible person of a scientific research team;

taking the intersection of the team member set and the influence node set as a team core member;

and taking other neighbor nodes except the intersection among all neighbor nodes of the nodes corresponding to the scientific research team responsible person as non-core members of the team.

In some embodiments of the present invention, the structural feature similarity is calculated as:

the calculation formula of the first node level attention weight is as follows:

the calculation formula of the aggregation characteristic representation of the neighbor nodes is as follows:

the calculation formula of the embedded representation of the structural features is as follows:

wherein ,representing node->And node->Based on the similarity of the structural features of the meta-path pi, sigma represents the activation function,，/>for dimension conversion parameters, N is the total number of nodes, d is the target dimension, < >>Representing node->Neighbor node aggregation feature representation based on meta-path pi, < ->Representing node->First node level attention weight based on meta path pi,/and the like>Representing node->Meta-path adjacency vector based on meta-path pi, < ->The weight coefficients representing the linear transformation are represented,Rrepresents a real set, d represents a target dimension, +.>Representing node->Meta-path adjacency vector based on meta-path pi, < ->Representing vector concatenation->Representing node->A set of neighbor nodes based on meta-path pi.

According to another aspect of the present invention, there is also disclosed a scientific research team identification system based on academic heterogeneous information network representation learning, the system comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the system implementing the steps of the method according to any of the embodiments described above when the computer instructions are executed by the processor.

According to yet another aspect of the present application, a computer-readable storage medium is also disclosed, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any of the embodiments described above. According to the scientific research team identification method and device based on heterogeneous information network representation learning, structural feature similarity of each node of academic heterogeneous information network information and each neighbor node of the scientific research team is obtained through a trained embedded representation learning model, first node level attention weights of each neighbor node of each node are calculated further based on each first node level attention weight, second node level attention weights corresponding to each element path of each node are calculated further, scientific research team responsible persons are determined according to the second node level attention weights, and team core members and team non-core members are determined based on the first node level attention weights and the structural feature similarity of each node and each neighbor node of the first node level attention weights. The method can accurately identify the responsible person, the core member and the non-core member of the scientific research team in the academic heterogeneous information network.

In addition, the embedded representation learning model reserves node topological structure characteristics and semantic characteristics in the heterogeneous information network, and the representation capability of the node heterogeneous characteristics in the heterogeneous information network is improved through the attention mechanism of the node level and the meta-path level, so that the accuracy of the recognition results of team responsible persons, team core members and team non-core members is further improved.

Additional advantages, objects, and features of the application will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present application are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present application will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the application. Corresponding parts in the drawings may be exaggerated, i.e. made larger relative to other parts in an exemplary device actually manufactured according to the present application, for convenience in showing and describing some parts of the present application. In the drawings:

fig. 1 is a schematic flow chart of a scientific research team identification method based on academic heterogeneous information network representation learning according to an embodiment of the application.

Fig. 2 is a schematic architecture diagram of a scientific research team identification system based on academic heterogeneous information network representation learning according to an embodiment of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present application and their descriptions herein are for the purpose of explaining the present application, but are not to be construed as limiting the application.

It should be noted that, in order to avoid obscuring the present application due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present application are shown in the drawings, while other details not greatly related to the present application are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

The scientific research team refers to a scientific research group formed by team responsible persons and a certain number of scientific researchers through division work and cooperation under the common scientific research and development target; the scientific research team identification refers to identifying and finding team responsible persons, team core members and non-core members belonging to the same scientific research team in a heterogeneous information network consisting of scientific research personnel-co-worker relations. The existing scientific research team identification method cannot effectively mine heterogeneous characteristics such as rich topology, semantics and the like in a heterogeneous information network, so that the problems that team member importance evaluation is inaccurate, team member roles cannot be distinguished and the like exist in the scientific research team identification process; the application discloses a scientific research team identification method and device based on academic heterogeneous information network representation learning, which are used for accurately identifying responsible persons, core members and non-core members of the scientific research team in the academic heterogeneous information network.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

Fig. 1 is a flow chart of a scientific research team identification method based on academic heterogeneous information network representation learning according to an embodiment of the invention, and referring to fig. 1, the scientific research team identification method at least includes steps S10 to S40.

Step S10: and acquiring academic heterogeneous information network information, constructing a heterogeneous graph network structure based on the academic heterogeneous information network information, and determining element paths, element path adjacent vectors and neighbor nodes of each node in the heterogeneous graph network structure.

In this step, each node may have M meta-paths, and the meta-path adjacency vector correspondingly has M, exemplary, nodesThe meta path adjacency vector based on meta path pi can be expressed as +.>. Further, in order to simplify the calculation +.>Can be node->Is based on the normalized meta-path adjacency vector of meta-path pi. And, node->Is a neighbor node determined based on the heterogeneous graph network structure.

Step S20: and inputting the element path, the element path adjacent vector and the neighbor nodes of each node into a trained embedded representation learning model to obtain the structural feature similarity of each node and each neighbor node thereof and the first node level attention weight of each neighbor node of each node.

In this step, the structural feature similarity of each node and its neighboring nodes and the first node level attention weight of each neighboring node of each node are learned based on the embedded representation learning model. And in this embodiment all neighbor nodes with similar structural features to the nodeWill be given a larger first node level attention weight +.>。

The structural feature similarity is calculated by the following formula:

the calculation formula of the first node level attention weight is as follows:

wherein ,representing node->And node->Based on the structural feature similarity of the meta-paths pi, when the total number of the meta-paths is M, pi epsilon [1, M]，/>For the dimension transformation parameters, the structural feature transformation of meta-path pi is represented, +.>R represents a real set, N is the total number of nodes, d is the target dimension, ++>Meta-path adjacency vector representing nodes based on meta-path pi,/->Representing node->Meta-path adjacency vector based on meta-path pi, < ->Representing node->First node level attention weight based on meta path pi,/and the like>Representing based on meta path pi and node +.>A set of connected neighbor nodes.

Step S30: and calculating second node level attention weights corresponding to each element path of each node based on the first node level attention weights of each neighbor node of each node, and determining a responsible person of the scientific research team based on the second node level attention weights corresponding to each element path of each node.

In the step, the responsible person of the scientific research team is determined based on the calculated second node level attention weight corresponding to each element path of each node. The scientific research team responsible person is used as soul and core of the scientific research team, and effectively identifies and discovers that the scientific research team responsible person is the premise and key for realizing the identification of the scientific research team; the scientific research team responsible person has great academic achievement and high wisdom, can play an important influence role on team decision, and the important role is reflected in a heterogeneous information network formed by the scientific research team as important nodes to influence the layout and network fluxion of nodes in the heterogeneous information network, so that the scientific research team responsible person in the academic heterogeneous information network is identified by using the second node level attention weight of the node level, the second node level attention weight represents the influence of the nodes, namely, the "author" node with the highest second node level attention weight (the largest influence) is used as the scientific research team responsible person. Specifically, determining a responsible person of the scientific research team based on the second node level attention weight corresponding to each element path of each node includes: and taking the node with the largest attention weight of the second node level as a responsible person of the scientific research team.

In an embodiment, calculating a second node level attention weight corresponding to each element path of each node based on a first node level attention weight of each neighbor node of each node includes: sorting importance of neighbor nodes according to the attention weights of the first node levels; and selecting a part of neighbor nodes with higher importance, and calculating an average value of the first node level attention weights of the selected part of neighbor nodes as a second node level attention weight of the corresponding node.

It can be understood that the probability of forming a scientific research team among the scientific research personnel in the same research direction is higher, and the probability is mapped toThe higher the structural feature similarity among nodes in an academic heterogeneous information network formed by scientific research teams, the larger the influence is, and the more likely the nodes are clustered tightly. NodeAnd node->Structural feature similarity based on meta-paths as calculated above>. Nodes representing scientific research members of the same scientific research team in the academic heterogeneous information network are necessarily adjacent nodes which are connected through path examples and become each other, the scientific research members belonging to the same scientific research team have higher structural feature similarity, and the path and the node through the examples are not good in the representation learning process of the academic heterogeneous information network >All neighbor nodes connected +.>In (1), and node->Neighbor node with similar structural features +.>Will be given a larger first node level attention weighting coefficient +.>. Thus the first node level attention weighting coefficient +.>Metric node->Neighbor node of->Is to add the node +_>Neighbor node of->And (3) sorting the first node level attention weights according to the sizes, and selecting partial neighbor nodes with larger influence. For example, a neighboring node corresponding to the attention weights of the first K first node levels may be selected, and then the attention weight of the second node level of the node is the average value of the attention weights of the first K first node levels, i.e.)>，/>Representing node->Is a second node level attention weight, +.>Representing neighbor node->Is a first node level attention weight of (c).

In the above-described embodiments of the present invention,representing node->The influence of the first node level attention weights of all the nodes are further sequenced, and the author node with the largest influence is used as a responsible person of the scientific research team, so that the identification of the responsible person of the scientific research team is realized.

Step S40: and determining the team core member and the team non-core member based on the first node level attention weight of each neighbor node of each node and the structural feature similarity of each node and each neighbor node thereof.

After the person in charge of the scientific research team is effectively identified based on the step S30, the members of the scientific research team are further identified, the scientific research personnel in the same scientific research team and the person in charge of the scientific research team have close relations such as membership, partnership and the like, the relations correspond to tight connection among different author nodes in an academic heterogeneous information network, and vector representations which are embodied as the author nodes in the embedded representation results of the heterogeneous information network nodes have higher structural feature similarity. Therefore, team members belonging to the same team as the responsible person of the scientific research team can be determined through the similarity of the structural features of each node and each neighbor node.

In one embodiment, determining team core members and team non-core members based on a first node level attention weight of each neighbor node of each node and a structural feature similarity of each node to each neighbor node thereof comprises: performing similarity sorting based on structural feature similarity of nodes corresponding to the responsible persons of the scientific research team and neighbor nodes of the responsible persons; selecting partial neighbor nodes with higher similarity as team member sets of responsible persons of scientific research teams; sequencing importance of the neighbor nodes based on the first node level attention weights of all neighbor nodes of the nodes corresponding to the scientific research team responsible person; selecting a part of neighbor nodes with higher importance as an influence node set of a responsible person of a scientific research team; taking the intersection of the team member set and the influence node set as a team core member; and taking other neighbor nodes except the intersection among all neighbor nodes of the nodes corresponding to the scientific research team responsible person as non-core members of the team.

In this embodiment, first, the neighbor node with the structural feature similarity at the front P may be selected from all neighbor nodes similar to the structural feature of the responsible person of the scientific research team, the P neighbor nodes are combined into a team member set, and the authors corresponding to the P neighbor nodes belong to the same scientific research team as the responsible person of the scientific research team. And selecting the neighbor node of which the attention weight of the first node is positioned at the front P 'on the basis of the calculated first node level attention weights of all neighbor nodes of the nodes corresponding to the scientific research team responsible person, wherein the P' neighbor nodes form an influence node set. Further solving an intersection of the team member set and the influence node set, and assuming that the number of nodes in the intersection is M, the scientific research team members corresponding to the M nodes are team core members; and further determining all neighbor nodes of the nodes corresponding to the scientific research team responsible person, wherein other members except the team core members in all neighbor nodes of the nodes corresponding to the scientific research team responsible person are team non-core members, and when the number of all neighbor nodes of the nodes corresponding to the scientific research team responsible person is N, the number of the team non-core members is N-M.

It can be appreciated that after determining the team responsible person, the team core member and the team non-core member of the first scientific research team based on the above method, further the node corresponding to the second node level attention weight ranking second may be obtained as the team responsible person of the second scientific research team. In addition, the method for identifying the team core member and the team non-core member of the second scientific research team is similar to the method for identifying the first scientific research team, and will not be described in detail herein.

In one embodiment, the specific recognition process of the scientific research team recognition method based on academic heterogeneous information network representation learning is shown in the following table:

algorithm 1: scientific research team identification method based on academic heterogeneous information network representation learning

Input: academic heterogeneous information network

And (3) outputting: scientific research team recognition result

1. Obtaining a vector representation of the network node through academic heterogeneous information network representation learning;

2. calculating the influence magnitudes and arranging the side by side of all unlabeled 'author' nodes;

3. selecting the node with the highest influence of the node as a responsible person node of a scientific research team, and marking the node as 'identified';

4. selecting the top P 'author' nodes most similar to the 'scientific research team responsible person' nodes as team core member candidate nodes belonging to the same scientific research team as the current 'scientific research team responsible person' nodes;

5. Calculating the intersection of P ' before affecting the nodes of the ' scientific research team ' and P similar nodes in the step 4, wherein the intersection result is used as the ' core team member ' nodes of the scientific research team, and the nodes are marked as ' identified ';

the rest neighbor 'author' nodes of the 'scientific research team responsible person' nodes serve as 'non-core member' nodes of the team, and the nodes are marked as 'identified';

7. and outputting the identification result of the current scientific research team, and repeating the steps 2-6 until all nodes are marked as being identified.

In order to obtain an accurate recognition result of the team, it is important to train the embedded representation learning model so that the embedded representation learning model has better model parameters, and in order to train the embedded representation learning model, the scientific research team recognition method based on academic heterogeneous information network representation learning in one embodiment further comprises the following steps:

determining neighbor node aggregation feature representation of each node based on the element path adjacent vector of each neighbor node of each node and the first node level attention weight, and determining structural feature embedding representation corresponding to each element path of each node based on each element path adjacent vector of each node and the neighbor node aggregation feature representation; acquiring a meta-path preference vector of each node, calculating the path similarity between each meta-path of each node and the meta-path preference vector based on structural feature embedding representation corresponding to each meta-path of each node and the meta-path preference vector, and determining comprehensive embedding representation of each node based on the path similarity and the structural feature embedding representation corresponding to each meta-path of each node; and carrying out iterative training on the initial network model based on the comprehensive embedded representation to obtain a trained embedded representation learning model.

In the above embodiment, after obtaining the similarity of the structural features of each node and each neighboring node thereof and the first node level attention weight of each neighboring node of each node based on the embedded representation learning model, further learning is performed to obtain the structural feature embedded representation and the semantic feature representation of the node in the heterogeneous graph network structure, and further determine the comprehensive embedded representation of the node based on the structural feature embedded representation and the semantic feature representation.

Illustratively, the calculation formula of the aggregation characteristic representation of the neighbor node is as follows:

at the determined nodeAfter the neighbor node aggregation feature representation of the neighbor node, the node is further processedThe self structural features and the neighboring node aggregation features are spliced to obtain a node +.>Is embedded in the representation of the structural feature of (a) then node +.>The calculation formula of the embedded representation of the structural features is as follows:

wherein sigma represents the activation function,for the dimension transformation parameters, the structural feature transformation of meta-path pi is represented,r represents a real set, N is the total number of nodes, d is the target dimension, ++>Representing node->Neighbor node aggregation feature representation based on meta-path pi, < ->Representing node->First node level attention weight based on meta path pi,/and the like >Representing node->Meta-path adjacency vector based on meta-path pi, < ->Representing the weight coefficients of the linear transformation from the splice vector to the embedded space,Rrepresenting real number set,/->Representing node->Meta-path adjacency vector based on meta-path pi, < ->Representing vector concatenation.

Further, a meta path preference vector of each node is obtained, a path similarity between each meta path of each node and the meta path preference vector is calculated based on a structural feature embedded representation corresponding to each meta path of each node and the meta path preference vector, and a comprehensive embedded representation of each node is determined based on the path similarity and the structural feature embedded representation corresponding to each meta path of each node, specifically including: determining the dimension of the meta-path preference vector of each node; converting the structural feature embedded representation corresponding to each element path of each node into an embedded representation with the same dimension as the element path preference vector; respectively calculating the path similarity between the dimension converted embedded representation corresponding to each element path of each node and the element path preference vector; calculating the element path attention coefficient of each node based on each path similarity; a composite embedded representation for each node is determined based on the meta-path attention coefficients for each node and the structural feature embedded representation corresponding to each meta-path for each node.

In this embodiment, for each nodeIntroducing a meta path preference vector +.>Directing a meta-path attention mechanism to enable meta-path based semantic feature representation learning of a node for the node>Structural feature embedding representation based on meta path +.>If it is +.>The similarity will be given a larger attention weighting coefficient, making the contribution in the comprehensive embedded representation of the node higher. At the computing node->Meta-path attention coefficient +.>When the node is first->Is converted into an embedded representation of k-dimensional space, i.e. such that the node +.>Is based on the dimension of the structural feature embedded representation of the meta-path pi and the meta-path preference vector +.>Is the same, the structural features after dimension conversion are embedded at the momentRepresentation->Wherein σ is the activation function, +.>Is a parameter of dimension conversion, +.>Representing the bias term. Further calculating the path similarity between the embedded representation after dimension conversion and the meta-path preference vector, wherein the calculation formula of the path similarity is as follows: />； wherein ,/>Representing node->Meta-path preference vector,/>Representing node->Is an embedded representation after dimension conversion, +.>Representing node- >Meta-path preference vector +.>Embedded representation after dimension conversion +.>Path similarity of>Representing vector regularization. After the path similarity is calculated, thenStep computing node->Meta-path attention coefficient +.>； wherein ,/>Representing node->Meta-path attention coefficient based on meta-path pi, M represents node +.>Is a meta-path total number of (a).

Further, nodes in heterogeneous information networkFinal integrated embedded representation->The calculation formula is as follows:

；

wherein ,representing node->Meta-path pi based integrated embedded representation, < + >>Representing node->Meta-path attention coefficient based on meta-path pi,/-, and>representing node->The representation is embedded based on the structural features of the meta-path pi. />Meta-path attention weight integrated with meta-path rich semantic information, < ->The structural feature embedded representation fused with the structural feature of the node, thus the node +.>The comprehensive embedded representation based on the meta-path pi simultaneously reserves the topological structure characteristics and the semantic characteristics of the nodes in the heterogeneous information network, so that the representation capability of the heterogeneous characteristics of the nodes in the heterogeneous information network is improved based on the attention mechanism of the node level and the meta-path level.

Furthermore, in order to perform iterative training on the initial network model based on the comprehensive embedded representation to obtain a trained embedded representation learning model, a loss function and a sample data set of the model need to be constructed, and the model is pre-trained based on the sample data set and the loss function. Illustratively, in an iterative training process, after the comprehensive embedded representation of the node is obtained, the node is further classified based on the model to obtain a predicted value of the node, and a cross entropy loss value of the model is calculated based on the label value and the predicted value of the node.

The loss function embedded in the representation learning model may be:

；

In the above embodiment, the scientific research team identification method based on academic heterogeneous information network representation learning utilizes the attention mechanisms of node level and meta-path level to perform structural feature and semantic feature representation learning on the heterogeneous information network, and learns low-dimensional, dense and real-valued vector representations on the basis of simultaneously retaining topology information rich in nodes in the network and semantic information based on meta-paths; and evaluating the importance of the nodes in the academic heterogeneous information network by aggregating the influence of the neighbor nodes of the network nodes, and effectively identifying and finding the responsible person, the core members and the non-core members of the scientific research team based on the maximization of the influence of the nodes.

Correspondingly, the application also discloses a scientific research team identification system based on academic heterogeneous information network representation learning, which comprises a processor and a memory, wherein the memory is stored with computer instructions, the processor is used for executing the computer instructions stored in the memory, and the system realizes the steps of the method in any embodiment when the computer instructions are executed by the processor.

FIG. 2 is a schematic diagram of a scientific research team recognition system based on academic heterogeneous information network representation learning according to an embodiment of the present invention, and referring to FIG. 2, the system first converts the academic heterogeneous information network into a heterogeneous network structure, and then inputs the heterogeneous network structure into an embedded representation learning modelAnd (3) carrying out structural feature representation learning and semantic feature representation learning of the node so as to obtain the comprehensive embedded representation of the node. In the course of the structural feature representation learning,representation->Nodes are based on meta-path adjacency vectors of meta-path pi, < ->Respectively represent node->First, second … ∈48 ∈>Meta-path adjacency vectors of neighboring nodes, further based on node +.>The first node level attention weight of each neighbor node is calculated by the meta-path adjacency vector of the neighbor node and the meta-path adjacency vector of the neighbor node, and further, the embedded representation of the node based on the structural characteristics of each meta-path is obtained based on the structural characteristics of the node and the aggregation characteristics of the neighbor node->。

Further computing nodesMeta-path attention coefficient based on each meta-path +.>And based on the calculated node +.>Meta-path attention coefficient based on each meta-path and node +. >Is used to determine the final integrated embedded representation +.>. Comprehensive embedded representation->For completing the pre-training of the embedded representation learning model.

Further, the system performs node similarity relation analysis, neighbor node influence measurement and neighbor node influence aggregation to perform node influence evaluation, completes identification of a scientific research team responsible person based on the node influence, and determines a team member set based on structural feature similarity between the node and the neighbor nodeAnd completing the identification of core members of the scientific research team and the identification of non-core members of the scientific research team based on the influence node set determined by the first node level attention weight of the neighbor nodes.

According to the scientific research team identification method and device based on heterogeneous information network representation learning, the structural feature similarity of each node and each neighbor node and the first node level attention weight of each neighbor node of each node can be obtained based on the trained embedded representation learning model, so that the principal, the team core members and the non-core members of the scientific research team are identified further based on the influence of each node, the structural feature similarity between the neighbor nodes and the influence of each neighbor node. In addition, the embedded representation learning model carries out structural feature representation learning through a node-level attention mechanism in the pre-training process, digs topological structure features of nodes in an academic heterogeneous information network, carries out semantic feature representation learning based on a meta-path-level attention mechanism, and digs rich semantic information in the network; therefore, after low-dimensional, dense and robust representation of the nodes in the academic heterogeneous information network is obtained, importance evaluation of the nodes in the academic heterogeneous information network formed by the scientific research team is realized by exploring the similarity among the nodes and the influence of the neighbor nodes and aggregating the importance of the neighbor nodes of the nodes.

In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the method according to any of the embodiments above.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A scientific research team identification method based on heterogeneous information network representation learning, which is characterized by comprising the following steps:

acquiring academic heterogeneous information network information, constructing a heterogeneous graph network structure based on the academic heterogeneous information network information, and determining element paths, element path adjacency vectors and neighbor nodes of each node in the heterogeneous graph network structure;

2. The method for identifying a research team based on heterogeneous information network representation learning of claim 1, further comprising:

3. The method for identifying a research team based on heterogeneous information network representation learning of claim 2, wherein the loss function of the embedded representation learning model is:

wherein ,representing node set,/->，/>Representing node->On-labellTrue value on->，/>Representing learning derived nodes->On-labellAnd the predicted value, L, is the total number of labels.

4. The method for identifying a research team based on heterogeneous information network representation learning of claim 2, wherein obtaining a meta path preference vector for each node, calculating a path similarity between each meta path for each node and the meta path preference vector based on a structural feature embedded representation corresponding to each meta path for each node and the meta path preference vector, determining a comprehensive embedded representation for each node based on the path similarity and the structural feature embedded representation corresponding to each meta path for each node, comprising:

Determining the dimension of the meta-path preference vector of each node;

5. The method for identifying a research team based on heterogeneous information network representation learning of claim 4, wherein,

the calculation formula of the path similarity is as follows:

；

the calculation formula of the comprehensive embedded expression is as follows:

；

wherein ,representing node->Meta-path preference vector,/>Representing node->Is an embedded representation after dimension conversion, +.>Representing node->Path similarity of meta-path preference vector with embedded representation after dimension conversion, +.>Representing vector regularization, < >>Representing node- >Meta-path attention coefficient based on meta-path pi, M represents node +.>Meta-path total number,>representing nodesMeta-path pi based integrated embedded representation, < + >>Representing node->The representation is embedded based on the structural features of the meta-path pi.

6. The method for identifying a research team based on heterogeneous information network representation learning of claim 1, wherein calculating the second node level attention weight corresponding to each element path of each node based on the first node level attention weights of each neighbor node of each node comprises:

7. The method for identifying a research team based on heterogeneous information network representation learning of claim 6, wherein determining a research team responsible person based on the second node level attention weights corresponding to each meta-path of each node comprises: taking the node with the largest attention weight of the second node level as a responsible person of the scientific research team;

8. The method for identifying a research team based on heterogeneous information network representation learning of claim 2, wherein,

the calculation formula of the structural feature similarity is as follows:

the calculation formula of the first node level attention weight is as follows:

wherein ,representing node->And node->Based on the similarity of the structural features of the meta-path pi, sigma represents the activation function,，/>for dimension conversion parameters, N is the total number of nodes, d is the target dimension, < >>Representing node->Neighbor node aggregation feature representation based on meta-path pi, < ->Representing node->First node level attention weight based on meta path pi,/and the like>Representing node->Meta-path adjacency vector based on meta-path pi, < ->The weight coefficients representing the linear transformation are represented,Rrepresents a real set, d represents a target dimension, +.>Representing node->Meta-path adjacency vector based on meta-path pi, < ->Representing vector concatenation->Representing nodesA set of neighbor nodes based on meta-path pi.

9. A research team identification system based on heterogeneous information network representation learning, the system comprising a processor and a memory, wherein the memory has stored therein computer instructions, the processor being adapted to execute the computer instructions stored in the memory, the system implementing the steps of the method according to any of claims 1 to 8 when the computer instructions are executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.