CN116959745A

CN116959745A - Infectious disease network key node identification method based on graph neural network

Info

Publication number: CN116959745A
Application number: CN202310970865.XA
Authority: CN
Inventors: 宋玉蓉; 张明磊; 曲鸿博
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-10-27

Abstract

The invention discloses a method for identifying key nodes of an infectious disease network based on a graph neural network, which comprises the following steps: 1. simulating a virus transmission process on a generation network through a traditional SIR infectious disease model, and constructing a training data tag; 2. selecting the feature combination which enables Kendall coefficients to be highest in score obtained through operation in the step 1, inputting the selected network features into an attention model which introduces a plurality of graph attention layers, and training in a plurality of artificial networks to obtain importance score ranking with more applicability of each node; 3. considering the 'rich club' effect, the resulting ordered sequence is chosen to have a further expanded range of influence by a node selection algorithm based on distance 2-hop. The invention provides a new thought and tool for identifying key nodes in a complex network, and provides scientific basis for network optimization and risk prevention.

Description

Infectious disease network key node identification method based on graph neural network

Technical Field

The invention relates to the field of infectious disease networks and graphic neural networks, in particular to a method for identifying infectious disease network key nodes based on a graphic neural network.

Background

The complex network is a complex system formed by a large number of nodes and edges, has the characteristics of self-organization, self-similarity, small world, no scale and the like, and can describe phenomena and rules in many natural and artificial systems. Nodes in a complex network typically represent individuals or entities in the network, while edges represent interactions or links between nodes. The nodes in the complex network are not equivalent or equivalent, and some nodes have greater influence on the structure and dynamic behavior of the network due to different positions, functions or connection modes, and are called key nodes.

The identification of key nodes is one of important contents of complex network research, and has important theoretical significance and practical value for understanding and controlling complex networks. For example, in a social network, identifying key nodes can help in efficient information dissemination, public opinion monitoring, and social impact analysis; in biological networks, identifying key nodes can help discover important genes, proteins, or metabolites, revealing the functions and mechanisms of the living system; in a traffic network, identifying key nodes can help optimize route design, improve transportation efficiency and ensure network safety; in the introduction network, the identification of key nodes can help readers to read high-quality documents and improve learning efficiency.

The diversity and complexity of network structures are continuously increased, the traditional key node identification algorithm is difficult to process huge data volume, and with the continuous development of artificial intelligence, computer computing power and the like, machine learning is gradually becoming a new research direction of key node identification, and the key node identification algorithm can analyze the characteristics in a complex network with higher precision and is suitable for different network structures. If GCN is introduced into the key node recognition work, the recognition problem is converted into the classification problem, and good effect is achieved. However, due to the "rich club" phenomenon, the more important nodes are more compact in the network, and the nodes cannot be precisely ranked in importance. In the network, not only the neighboring nodes can be influenced, but also some non-neighboring nodes can play a non-negligible role, and the influence factors of the non-neighboring nodes are ignored by the method.

Disclosure of Invention

In order to solve the problems, the invention provides a key node identification method based on machine learning, namely an infectious disease network key node identification method based on a graph neural network, which considers local and global structure influence factors of a network, fully digs node influence capacity by pre-training a generated network and utilizing a graph attention mechanism, and selects a node set to ensure that the network is most affected.

In order to achieve the above object, the present invention provides a method for identifying key nodes of an infectious disease network based on a graph neural network, comprising the following steps:

s01: building training data labels:

simulating a virus transmission process on a generation network through a traditional SIR infectious disease model to obtain an initial output score of an initial representation of transmission capacity on each node, and combining the initial output score with a message transmission idea to obtain a node output score;

obtaining local and global influence factors of each node of a network by utilizing an enhanced K-Shell algorithm, then splicing the obtained local and global impression factors of each node of the network with node output scores, sorting the node importance of the scores obtained by the calculation, and taking the scores as training data labels of subsequent models;

s02: graph attention network training: selecting the feature combination which enables Kendall coefficients to be highest in score obtained through operation in the step S01, inputting the selected network features into an attention model which introduces a plurality of graph attention layers, and training in a plurality of artificial networks to obtain importance score ranking with more applicability of each node;

s03: sequencing results and optimizing: considering the 'rich club' effect, the resulting ordered sequence is chosen to have a further expanded range of influence by a node selection algorithm based on distance 2-hop.

As a further improvement of the present invention, the step S01 specifically includes:

s11: in order to obtain the propagation capability value of each node, traversing all nodes in the network in turn to be respectively used as infection sources, setting the states of all nodes as infection states I, setting the states of all other nodes as susceptibility states S, and improving the applicability of the model by training on different types of networks, wherein the infection rate is set asWherein k is node degree, a numerator represents network average degree, a denominator is network degree distribution, a recovery rate is fixed to be 0.25, for a node I, in each SIR simulation, the sum of the number of infected nodes I and recovery nodes R in a network is counted after a given propagation day passes, and the sum is divided by the number N of network nodes to obtain an initial propagation fraction N of the node I>The calculation process is as follows:

wherein, I (j) and R (j) respectively represent the number of the infecting person and the recovering person in the network by taking the node I as the initial infection source when the j-th simulation is finished, and the time is the simulation times;

s12: combining the score obtained in the step S11 with a message transmission idea, and enabling the importance score to be more monotonous and more fit with reality by utilizing the utilization information;

wherein d is _j For node v _j Is used for the degree of (3),v is _i Is a neighbor node set of->V is _i Is a fraction of the initial propagation of (a);

s13: in an actual network, a node with a transmission capability of a certain node can be cascade-connected and infected with non-adjacent nodes, and the result is inaccurate due to the fact that only local features are considered to identify key nodes, so that the local and global influence scores can be calculated for each node by using an enhanced K-Shell algorithm improved based on the K-Shell algorithm and well capturing local and global information of the node by applying information entropy to a complex network:

therein IKS _i For the enhanced K-Shell value of node i,dist, the number of nodes of the current network _ij For the shortest path length between nodes i and j, the propagation score and local and global impact scores of the nodes are further calculated, so the model training label is as follows:

wherein, the degree is Hadamard product of matrix, and the loss function of the model is defined as variance loss function of the prediction result and the real label of SIR simulation, and the higher the output of the model is, the stronger the infection capability of the node is, and the larger the influence in the network is.

As a further improvement of the present invention, the step S02 specifically includes:

s21: after multiple experiments, node degree centrality value, neighbor node degree, neighbor node average degree and two-hop neighbor node averageThe five characteristics are spliced into the degree and the clustering coefficientWherein>The matrix is represented, N is the number of network nodes, each row represents five characteristic aggregation of a certain node, the characteristic of virus transmission can be well reflected, the model can conveniently learn the infectious disease capacity score, and the model and the network structure data are input into the graph attention network model;

s22: converting the data in S21 into 1-dimensional advanced features by a learnable linear model:wherein W is the parameter to be learned of the linear layer, ^T representing transposition operation, b is bias, the graph annotation meaning network model can dynamically calculate the importance of different neighbor nodes by stacking multiple layers of attention mechanisms, and fuse the information of higher-order neighbor nodes, the attention weight of each layer is determined by the node representation of the upper layer, the transmission of cross-layer information is realized, the model introduces three layers of graph attention layers, a self-attention mechanism is implemented for each node in each layer, and initial scores are calculated respectively>Aggregation with its first-order neighbors and regularization with a LeakyReLU nonlinear activation function to obtain the attention coefficient alpha _ij The calculation process is as follows:

wherein the model weight isF is the number of new node features generated after passing through the attention layer of the graph,w is a weight matrix, ||represents matrix stitching, |is ++>Is the neighborhood node of node i, · ^T Representing transpose operations, a multi-headed attention mechanism was introduced to further enhance computing power:

k represents the number of intent heads per layer, sigma-activation function selects the ReLU function,the output of the model represents the composite importance score for each node. And the difference between the model and the actual data is reflected by adopting a mean square error loss function, wherein the loss function is defined as follows:

as a further improvement of the present invention, the step S03 specifically includes:

s31: through a node selection algorithm based on the distance of 2-hop, after obtaining any node ordering structure, firstly selecting the highest-ranking node as a seed node, calculating the minimum distance from a second ordering node to the seed node, if the distance is smaller than 2, skipping the node, continuing to select the next node in the ordering list, adding the node to the node set once another node with the minimum distance is found, and repeating the steps to obtain the node set.

As a further improvement of the invention, the top-k method is selected, the top k nodes with the highest score are selected, the minimum distance from the second sorting node to the seed node is calculated, if the distance is smaller than 2, the node is skipped and the next node in the sorting list is continuously selected, once another node with the minimum distance is found, the node is added into the node set, and the steps are repeated until k nodes are selected as the final node set.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. the invention utilizes SIR propagation model and message transmission idea on the generated network, fully considers the global and local structure information of the nodes, selects the network topology characteristics capable of reflecting the propagation property through multiple tests, and can effectively obtain the propagation capacity score of each node in the network.

2. According to the invention, the node transmission capability importance sequencing sequence is obtained through model training, and by considering influence overlapping influence and utilizing a node selection method based on shortest distance of 2-hop, a certain number of node sets can be effectively selected, and the network is enabled to be the widest in infected range; and the nodes are immunized sequentially by using the ranking order, and experiments prove that the virus transmission can be effectively restrained.

Drawings

FIG. 1 is a flow chart of the identification of key nodes of an infectious disease network based on a graph neural network according to the present invention.

Fig. 2 is a schematic diagram of the internal architecture of the network.

Fig. 3 is a schematic diagram of influence overlap.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following specific embodiments with reference to the accompanying drawings, but is not intended to limit the scope of the present invention.

FIG. 1 is a flowchart of identifying key nodes of an infectious disease network based on a graph neural network, constructing a training data tag module by using an SIR infectious disease model, training the module by using a graph attention network, and sequencing results and optimizing the module. The method specifically comprises the following steps:

s01: the method comprises the steps of (1) constructing a training data label, wherein the training data label is divided into two large sub-modules, (1) simulating a virus transmission process on a generated network by using a traditional SIR infectious disease model, and then carrying out transmission capacity initialization representation on each node in the network by combining a message transmission idea, (2) obtaining local and global influence factors of each node in the network by using an enhanced K-Shell algorithm, and then carrying out output splicing operation on the local and global influence factors and the sub-modules (1) to obtain node importance sequences, and taking the node importance sequences as training labels of subsequent models;

specifically, the construction of the S01 training data diagram comprises the following steps:

s11: in order to obtain the propagation capability value of each node, traversing all nodes in the network in turn to be respectively used as infection sources, setting the states of the nodes as an infection state I, setting the states of all other nodes as a susceptibility state S, and simulating the states of all other nodes on the network for multiple times to obtain a stable result. Because multiple simulations on a large-scale network can result in large computational overhead, the simulation is selected on a small-scale generation network, so that the transmission speed of infectious diseases can be increased, and the data preprocessing time is effectively reduced. Including BA scaleless networks, WS small world networks, and PL networks, promote model applicability by training on different types of networks. Wherein the infection rate is set toWherein k is node degree, a numerator represents network average degree, a denominator is network degree distribution, a recovery rate is fixed to be 0.25, for a node I, in each SIR simulation, the sum of the number of infected nodes I and recovery nodes R in a network is counted after a given propagation day passes, and the sum is divided by the number N of network nodes to obtain an initial propagation fraction N of the node I>The calculation process is as follows:

wherein, I (j) and R (j) respectively represent the number of the infecting person and the recovering person in the network with the node I as the initial infection source when the j-th simulation is finished, and the time is the simulation times.

S12: combining the score obtained in S11 with the "messaging" concept, the availability information makes the importance score more monotonic and more realistic.

s13: in an actual network, a node with a transmission capability of a certain node can be cascaded to infect non-adjacent nodes, and only local features are considered to identify key nodes, so that inaccuracy of a result can be caused, and the node local and global information can be well captured by applying an information entropy to a complex network by using an enhanced K-Shell algorithm which is improved based on the K-Shell algorithm.

The K-Shell algorithm is a node importance ranking algorithm based on a network global structure, and recursively deletes nodes with the degree less than or equal to K in the network from k=1 to k=n, and assigns K to the K-Shell value of the node, and the algorithm is executed until all the nodes in the network are assigned values, wherein the larger the value is, the more important the representative node is. The enhanced K-Shell algorithm Improved K-Shell based on the information entropy IKS is based on the K-Shell algorithm, and the decomposition of the K-Shell is subdivided and optimized by utilizing the shannon information entropy. Importance I of node I _i The definition is as follows:

where N is the number of network nodes, k _i For the node degree, the information entropy of i is IKS value:

wherein the method comprises the steps ofFor the neighbor node set of node i, the larger the IKS value, the higher the node importance in the same layer as the K-Shell value.

In the data enhancement process, in order to better distinguish importance between adjacent nodes and non-adjacent nodes with similar neighborhood structures, the effect of the local structure is enhanced. Local influence equations are presented herein, using IKS values to calculate local influence scores for any node, as follows:

therein IKS _i The enhanced K-Shell value of the node i is obtained, and N is the node number of the current network.

In addition, the infected nodes can generate cascading effect through the connecting edges to affect the nodes with far distance, and in order to strengthen the global information of the nodes, global influence scores are provided herein:

where ° is the Hadamard product of the matrix, and finally the loss function of the model is defined as the variance loss function of the prediction result and the real label of SIR simulation, the higher the output of the model is, the stronger the infection capability of the node is, and the larger the influence in the network is.

S02: graph attention network training, the module contains two parts: selecting the combination of Kendall coefficients and the features with the highest scores in the step S01, inputting the selected network features into an attention model, and training in a plurality of artificial networks to obtain importance score ranking with more applicability of each node.

Specifically, the S02 graph annotation network training comprises the following steps:

s21: in order to mine which feature combinations can enable the model output to be more fit with the infection force sequencing, after multiple experiments, the node degree centrality value, the neighbor node degree, the neighbor node average degree, the two-hop neighbor node average degree and the clustering coefficient are spliced into five featuresWherein>The matrix is represented, N is the number of network nodes, each row represents five characteristic aggregation of a certain node, the characteristic of virus transmission can be well reflected, the model can conveniently learn the infectious disease capacity score, and the model and network structure data are used as the input of the model;

s22: to make the results more interpretable, the data in S021 is transformed into a 1-dimensional high-level feature by a learnable linear model:wherein->To-be-learned parameters of linear layer · ^T And b is bias, and the graph annotation meaning network GAT model can dynamically calculate the importance of different neighbor nodes and fuse the information of higher-order neighbor nodes by stacking a multi-layer attention mechanism. The attention weight of each layer is determined by the node representation of the upper layer, so that the transmission of cross-layer information is realized. Thus, the GAT model retains the graphThe structural information captures the higher-order relation between the nodes, and the representation capability of the model is improved. The model introduces three layers of attention layers of the graph, and in each layer, a self-attention mechanism is implemented on each node, and initial scores are calculated respectively>Aggregation with its first-order neighbors and regularization with a LeakyReLU nonlinear activation function to obtain the attention coefficient alpha _ij The calculation process is as follows:

wherein the model weight isF is the new node characteristic quantity generated after the graph attention layer, W is a weight matrix, and I represents matrix splicing, and I is->Is the neighborhood node of node i, · ^T Representing transpose operations, a multi-headed attention mechanism was introduced to further enhance computing power:

k represents the number of intent heads per layer, and the sigma-activation function selects the ReLU function.The output of the model represents the composite importance score for each node. And the difference between the model and the actual data is reflected by adopting a mean square error loss function, wherein the loss function is defined as follows:

s03: and (3) sequencing and optimizing, taking the 'rich club' effect into consideration, and selecting a node set which further expands the influence range from the obtained sequencing sequence through a node selection algorithm based on a distance of 2-hop.

The step S03 specifically includes:

s31: there are many methods for further selecting the target node set after the ordered sequence is obtained, and the top-k method is the simplest, namely selecting the top k nodes with the highest score, but this may pick partial nodes with relatively dense distances, so that a single node is affected multiple times, and finally the phenomenon that the infection range is limited is called influence overlapping.

There are many ways to choose the key node set on the ranking result, such as top-k algorithm that selects k nodes in order from high to low according to node impact. However, because of the existence of a rich club phenomenon in a complex network, nodes with high degrees in the network are more prone to connect with other nodes with high degrees, forming a subset of tight connections. The top-k method may therefore select partial nodes that are closer together, resulting in multiple infections of a single node, and ultimately limiting the extent of infection, a phenomenon known as impact overlap Influence Overlap. FIG. 3 shows the process of influencing overlap, where the left ranking is the result of a node ranking algorithm, where the Top k key nodes are sequentially selected according to the Top-k algorithm from the highest ranking v ₃ Begin propagation and for v ₁ ,v ₂ ,v ₄ ,v ₆ Producing influence, and selecting v with high ranking ₄ And to v ₂ ,v ₃ ,v ₅ ,v ₇ Influence at the moment v ₂ The nodes overlap in influence, i.e., are affected multiple times by neighboring nodes.

Therefore, the method selects a node selection algorithm based on the minimum distance of 2-hop, namely after obtaining any node ordering structure, firstly selecting the highest-ranking node as a seed node, calculating the minimum distance from a second ordering node to the seed node, if the minimum distance is smaller than 2, skipping the node, continuing to select the next node in the ordering list, adding the node to the node set once another node with the minimum distance is found, and repeating the steps until k nodes are selected as the final node set.

In summary, the method for identifying the key nodes of the infectious disease network based on the graph neural network utilizes the SIR propagation model and considers the network topology characteristics to mine the local and global structural information, generates the importance node sequence in combination with the graph attention network, and selects partial nodes to maximize the affected range of the network, thereby suppressing the virus propagation in the actual network and further effectively maintaining the stability of the network.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The method for identifying the key nodes of the infectious disease network based on the graph neural network is characterized by comprising the following steps of:

s01: building training data labels:

obtaining local and global influence factors of each node of a network by utilizing an enhanced K-Shell algorithm, then splicing the obtained local and global influence factors of each node of the network with node output scores, sorting the node importance of the scores obtained by the calculation, and taking the scores as training data labels of subsequent models;

s02: graph attention network training: selecting the feature combination which enables Kendal coefficients to be highest in score obtained through operation in the step S01, inputting the selected network features into an attention model which introduces a plurality of graph attention layers, and training in a plurality of artificial networks to obtain importance score ranking with more applicability of each node;

2. The method for identifying the key nodes of the infectious disease network according to claim 1, wherein the step S01 specifically includes:

s13: in an actual network, a node with a transmission capability of a certain node can be cascade-connected and infected with non-adjacent nodes, and the result is inaccurate due to the fact that only local features are considered to identify key nodes, so that the local and global influence scores can be calculated for each node by using an enhanced K-Shell algorithm improved on the basis of the K-Shell algorithm and well capturing local and global information of the node by applying information entropy to a complex network:

3. The method for identifying the key nodes of the infectious disease network according to claim 1, wherein the step S02 specifically includes:

s21: after multiple experiments, the node degree centrality value, the neighbor node degree, the neighbor node average degree, the two-hop neighbor node average degree and the clustering coefficient are spliced into five characteristicsIs a feature vector matrix of>The matrix is represented, N is the number of network nodes, each row represents five characteristic aggregation of a certain node, the characteristic of virus transmission can be well reflected, the model can conveniently learn the infectious disease capacity score, and the model and the network structure data are input into the graph attention network model;

s22: converting the data in S21 into 1-dimensional advanced features by a learnable linear model:wherein->To-be-learned parameters of linear layer · ^T Representing transpose operation, b being bias, the graph annotation intent network model introduces three graph attention layers, in each of which a self-attention mechanism is implemented for each node, calculating an initial score ∈>With its first order neighborsAggregating and obtaining attention coefficient alpha after regularization by utilizing LeakyReLU nonlinear activation function _ij The calculation process is as follows:

k represents the number of attention heads per layer, sigma-activated function selects the ReLU function,for the output of the model, representing the comprehensive importance score of each node, and adopting a mean square error loss function to reflect the difference between the model and the actual data, wherein the loss function is defined as follows:

4. the method for identifying the key nodes of the infectious disease network according to claim 1, wherein the step S03 specifically includes:

5. The method for identifying the key nodes of the infectious disease network according to claim 4, wherein: selecting top-k method, selecting top k nodes with highest score, calculating the minimum distance from the second ordering node to the seed node, if the distance is less than 2, skipping the node and continuing to select the next node in the ordering list, adding the node to the node set once another node with the minimum distance is found, repeating the above steps until k nodes are selected as the final node set.