CN115049002A

CN115049002A - Complex network influence node identification method based on reverse generation network

Info

Publication number: CN115049002A
Application number: CN202210681512.3A
Authority: CN
Inventors: 刘小洋; 叶舒
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-13

Abstract

The invention provides a complex network influence node identification method based on a reverse generation network, which comprises the following steps: s1, community division: carrying out community division on the network by using a Louvain algorithm; s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set; s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes. The method can select key nodes in the network or nodes for maintaining the stability of the network as candidate nodes, the seed nodes selected by the method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.

Description

Complex network influence node identification method based on reverse generation network

Technical Field

The invention relates to the technical field of influence node identification, in particular to a complex network influence node identification method based on a reverse generation network.

Background

The prosperity of network science has led to a new trend of identifying a set of influential nodes in a complex network, and many fields in real life including biology, physics, sociology, engineering, etc., can be represented by the complex network. Key nodes or influencing nodes in the network may be used to maintain the stability of the network topology, determine the efficiency of information transfer in the network, such as rumors, and the like, and these nodes directly determine whether the network can operate properly. These nodes are often referred to as force nodes. And identifying a group of influential nodes in the network is a classical influence maximization problem. The method has the advantages that control over influence nodes is strengthened, a new opportunity is provided for accelerating information transmission, the method has very important significance for virus-type marketing, identification of drug targets and necessary proteins, rumor control and the like, for example, in public stone marketing, users with the most influence are selected for popularization to ensure that the lowest budget is spent to achieve the maximum transmission effect, or public opinion transmission sources are detected in a social network, and public opinion transmission can be controlled. With the advent of big data and the 5G era, the types and the scale of networks are also rapidly increased, which provides new challenges for the old node influence strength method.

Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing method simply utilizes the characteristic of community structure, and some three-segment-based methods have some disadvantages in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm, or some other heuristic algorithm need to traverse the entire network when selecting a set of candidate nodes, and there may be cases where the selected candidate nodes are clustered with each other. The problem of overlapping influence ranges may exist if the seed nodes are selected from the candidate node sets. The generation of the candidate node set is particularly important throughout the seed node selection process. Secondly, in the seed node identification stage, a great number of algorithms use a greedy algorithm to perform accurate selection, and although the selection in the generation candidate node set greatly improves the operation efficiency, the selection is time-consuming.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly creatively provides a complex network influence node identification method based on a reverse generation network.

In order to achieve the above object, the present invention provides a complex network influence node identification method based on a reverse generation network, including the following steps:

s1, community division: carrying out community division on the network by using a Louvain algorithm, and reducing the search space of the seed nodes;

s2, generating a candidate node set: the method has the advantages that node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and high-importance nodes can be selected without recovering an original network when the network is constructed under the method; then selecting a part of nodes to be added into the candidate node set;

s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes.

Further, the size of the community at least has the following conditions:

C_size＝size(G)*η (9)

wherein C _ size represents the community size;

size (G) represents the number of nodes of network G;

eta is an adjustable parameter for controlling the size of the community that satisfies the condition.

Further, the S2 includes:

s2-1, calculating the adding cost of the remaining nodes, selecting the nodes with the minimized cost function to construct the network, wherein the cost function is as follows:

where cost (u, n +1) indicates that node u is at (n +1) ^th A cost function of the time step;

is shown in (n +1) ^th Time step, the size of the maximum connected component for adding node u to network G, i.e.

AUC _u AUC value representing node u；

AUC _max Represents the maximum AUC value among all nodes;

AUC _min represents the smallest AUC value among all nodes;

ξ is a sufficiently small positive parameter;

s2-2, after the node is added, the size of each connected component in the network needs to be updated, and the size of the maximum connected component in the network is recorded;

s2-3, repeating the steps S2-1-S2-2 until the number of the remaining nodes meets the required candidate node number, and at the same time stopping constructing the network.

Generating a set of candidate nodes using an improved reverse generation network has the advantage that, first, the generated set of candidate nodes can be made more dispersed, with few nodes aggregated, and all being key nodes in the network. Secondly, the selected candidate nodes have low overlapping degree of influence ranges, and a heuristic method and a greedy method can be used for balancing to select the seed nodes in the third stage. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourthly, the whole network is not required to be traversed when the candidate nodes are generated, and the network can be stopped being constructed when the number of the candidate node sets is met by the nodes which are not added.

Further, the S2 further includes:

in order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:

where cand _ num _i Representing the size of the ith community candidate node set;

(C_size _i -C_size _min )/(C_size _max -C_size _min ) Representing the proportion of the ith community among all the selected communities;

β is an amplification parameter;

k is the number of seed nodes that ultimately need to be selected.

Further, the graph traversal comprises:

the degree centrality is selected as an initial center score in graph traversal, and the degree centrality is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated.

Further, the S3 includes:

s3-1, selecting k in the candidate node set through a degree discount algorithm ₁ A node;

the degree discount algorithm has the calculation formula as follows:

gdd therein _v Represents a degree discount for node v;

d _v represents the degree of node v;

t _v represents the number of infected neighbors of node v;

t _w representing the number of infected neighbors of a susceptible neighbor node w of the node v;

p represents the probability of infection;

s3-2, selecting k through improved sub-model algorithm ₂ A node;

k ₂ ＝k-k ₁ (13)

k is the number of the seed nodes which need to be selected finally;

mu is an adjustable parameter to balance greedy algorithm and heuristic algorithm;

c represents the total number of communities in the network that meet a certain scale;

cand_num _i representing the size of the ith community candidate node set;

the improved sub-modular algorithm selection k ₂ Individual node toolThe method comprises the following steps: and if the selected node u in the process of each round of improved sub-model algorithm selection is similar to the node selected before or the node selected in the heuristic process, not selecting the node and removing the node from the candidate node set.

Compared with the original sub-modular algorithm, the improved sub-modular algorithm further considers the position information of the nodes and the structural similarity between the nodes.

Further, the similarity is judged whether the nodes are similar or not through the following formula, and if the following formula is met, the node u is similar to the node v;

where sim represents the similarity of node u and node v;

n (u) represents a set of neighbor nodes for node u;

n (v) a set of neighbor nodes representing node v;

n (u) andn (v) represents the number of neighbors shared by N (u) and N (v);

| represents the number of sets;

ε is a parameter approaching 0;

abs (·) represents the absolute value;

ks (u) is the normalized k-shell index for node u;

ks (v) is a normalized k-shell index for node v.

Further, the method is evaluated by the following performance indexes: robustness value, cumulative distribution function, SIR model, and average shortest path length. Whether the method is reasonable or not can be comprehensively judged through the four performance indexes.

In summary, due to the adoption of the technical scheme, the invention has the following advantages:

(1) the influence scores are used for assisting in constructing the reverse generation network, so that the generated network does not need to be restored to the original network, nodes which are not added into the network directly add into the candidate node set, and the calculation time is greatly reduced.

(2) Considering that the influence ranges of the selected node sets overlap due to the common neighbor and position relations among the nodes, an improved CELF algorithm is provided.

(3) The community structure of the network is considered, the algorithm is accelerated by utilizing the community, and the advantages of the connectivity of the network, graph traversal, a heuristic algorithm and a greedy algorithm are combined.

Therefore, the method can select key nodes in the network or nodes for maintaining the stability of the network as candidate nodes, the seed nodes selected by the method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a graph of the robustness value R of the present invention.

FIG. 2 is a schematic diagram of the traversal centrality of the graph of the present invention.

Figure 3 is an overall schematic diagram of the CBGN framework of the present invention.

Figure 4 is a schematic diagram of a toy network of the present invention.

FIG. 5 is a graph of the degree distribution of 6 experimental nets of the present invention.

Fig. 6 is an R-curve diagram of 6 real networks of the present invention under different methods.

FIG. 7 is an R-curve of 2 real network communities according to the present invention.

FIG. 8 is the CDF distribution diagram of the invention Degrid and TARank _ grid over 3 networks.

Fig. 9 is a schematic diagram of the propagation effect of the seed nodes selected by different algorithms under the SIR model.

FIG. 10 is a schematic representation of the spread of the present invention at different infection rates.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

The invention provides a complex network influence node identification method based on a reverse generation network, which comprises the following specific embodiments:

s1, carrying out community division on the public opinion network: carrying out community division on a public opinion network by using a Louvain algorithm, and reducing a search space of public opinion seed nodes;

s2, generating a public opinion candidate node set: public sentiment node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and high-importance nodes can be selected without recovering an original network when the network is constructed by the method; then selecting a part of nodes to add into the public opinion candidate node set;

s3, selecting public opinion seed nodes: and finding out the final k influential nodes from the public opinion candidate node set as public opinion seed nodes.

And the public opinion spreading can be quickly controlled only by carrying out the speech control on the user corresponding to the screened public opinion seed node.

1. Introduction to related content

1.1 background

The impact maximization problem was first studied by domingonden richardson, and was later formulated by Kempe et al as a discrete optimization problem in 2003 and proved to be an NP-hard problem. In order to efficiently obtain the influence nodes, the nodes are measured from different angles, and a large number of indexes are successively provided, such as suchasDegree, ClusterRank, K-shell, H-index, Neighborhodoesss, PageRank, DegreeDiscount and the like. The influence range is overlapped when a group of nodes are selected by simply utilizing the high centrality, and the problem of low solving precision exists when the nodes are directly selected such as K-shell and H-index. The PageRank algorithm considers the number and quality of nodes at the same time, but when a hanging page appears, specification leakage can occur. The ClusterRank considers the direct neighbors of the nodes and the clustering coefficients thereof at the same time, and also lacks performance guarantee due to the consideration of network local information. Bae et al assume that when one node has more neighbor nodes located at the core position of the network, the propagation range is wider, the centrality of the number of neighborhood cores is provided, the method balances the degree and the position relation of the node, and the degeneracy of the K-shell method is effectively improved. There are also many combinations of network topologies to heuristically select nodes, Beni et al propose an IMT algorithm to select seed nodes from dense parts of the graph in order to access more nodes within the shortest distance. Zhang et al designs a constrained evolution algorithm (IICEA) based on local-global influence indexes by considering local neighbor information and global community information, and effectively solves the problem of budget influence constraint. Mao et al propose a topological potential solution for predicting influential nodes in a large-scale online social network. Although the influence node can be searched in a short time based on the centrality of the network topology structure, the stability of the algorithm is often insufficient, and the solving quality is not high.

In recent decades, some classical greedy algorithms such as greedy algorithm for sequentially selecting the node with the largest marginal gain and CELF, CELF + +, etc. improved on the algorithm have been proposed one after another. Although the impact dispersion of these algorithms approximates an optimal solution within the factor (1-1/e-epsilon), epsilon is a very small parameter approaching 0, and e is a natural base, these algorithms require tens of thousands of monte carlo simulations, resulting in long computation times that make it difficult to scale to large scale networks. In order to solve the phenomenon, a large number of network science researchers provide heuristic algorithms aiming at specific fields, sacrifice a certain accuracy rate and greatly reduce the computational complexity. In recent years, a new index, robustness value (Robustnessvalue), is widely used to measure the importance of a node. Robustness stems from the well-known theory of penetration, i.e., when a portion of the nodes in the network are removed, a breakdown of the network is caused. The robustness value is used for measuring the connectivity of the network, and a smaller robustness value means better performance of the algorithm. Compared with the traditional method, all nodes need to be ordered, and robustness only considers the importance of all nodes in the network. In addition, communities are also an important structure in network science, and nodes may be more frequently associated in the same community than in different communities. A large number of community-based influence maximization algorithms are proposed, and the effectiveness of community division is proved.

Based on the above discussion, in the present patent, a new Community-based reverse generation network (CBGN) method is proposed to identify a group of nodes in a complex network, which includes three steps: (1) dividing communities; (2) generating a candidate set; (3) a seed node is selected. Firstly, carrying out community division on the network by using a Louvain algorithm, and reducing the search space of the seed node. In the second step, the information of the nodes is collected by using the traversal of the graph, the reverse generation network is constructed in an auxiliary way by using the different information of the nodes, and a part of nodes are selected to be added into the candidate node set; and finally, finding out the final top-k influential node from the candidate nodes by utilizing a heuristic algorithm and an improved classical greedy algorithm.

1.2 motivation for research

Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing method simply utilizes the characteristic of community structure, and some three-segment-based methods have some disadvantages in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm, or some other heuristic algorithm need to traverse the entire network when selecting a set of candidate nodes, and there may be cases where the selected candidate nodes are clustered with each other. The problem of overlapping influence ranges may exist if the seed nodes are selected from the candidate node sets. The generation of the candidate node set is particularly important throughout the seed node selection process. Secondly, in the seed node identification stage, a great number of algorithms use a greedy algorithm to perform accurate selection, and although the selection in the generation candidate node set greatly improves the operation efficiency, the selection is time-consuming. It is therefore desirable that the set of candidate nodes selected in the second stage can assist the third stage and not merely serve to narrow the scope of the search.

2. Related concepts

Let G ═ V ═ E denote a complex network graph, V ═ V ₁ ,v ₂ ,...,v _n ]Representing a set of nodes, E ═ E ₁ ,e ₂ ,...,e _m ]A set of edge sets is represented. Nodes represent individuals in a complex network and edges represent relationships between individuals in the network. In the present patent, only undirected and unwarranted simple networks are of interest and no self-loops are allowed to exist.

Definition 1, influence node top-k: top-k influential nodes are defined as nodes that have a significant influence in a particular scenario, and the number is designated as k.

Definition 2, influence maximization problem: the influence maximization problem refers to finding a group of influence nodes, so that the influence can be maximized when the k nodes are used as a source node S and the influence is propagated on a specified propagation model, which can be expressed as

Influ(S)＝arg max _{S∈V,|S|＝k} σ(S) (1)

Where Influ (S) represents the influence of the source node set S, σ (·) represents the information diffusion function, and σ (S) represents the expected number of nodes that can be influenced after propagation on the specified model with the seed set S as the node set.

Definition 3, robustness value: the robustness value can be used to evaluate the performance of the ranking algorithm. Given a network, one node is removed at each time step and the size of the largest connected component in the remaining network is calculated until the network is empty. The network robustness value R is defined as

Where n is the number of nodes in G, σ _gcc (G) Representing the magnitude of the maximum connected component of the network, σ, without removing any one node _gcc (G\{v ₁ ,v ₂ ,...,v _n }) means that the set K is sequentially removed from the network-v ₁ ,v ₂ ,...,v _k The size of the huge communication component in the remaining network after the node in (c). The robustness value R can be considered as the area under the R curve, as shown in FIG. 1, with the horizontal axis being k/n and the vertical axis being σ _gcc (G\{v ₁ ,v ₂ ,...,v _n })/σ _gcc (G) (ii) a The x-axis of fig. 1 represents the proportion of the removed node, and the y-axis represents the magnitude of the maximum communication component of the remaining network after the removal of the node.

Of all the networks of n nodes, the most vulnerable network is the star network, the most robust is the fully connected network, in which,

in the case of a complete network, it is,

thus, the range of robustness values is

A smaller robustness value means a smaller performance of the algorithm.

In this section, the following 3 aspects will be introduced: community detection, reverse generation of network and graph traversal method.

2.1. Community detection

Communities are generally considered to be a group of closely related nodes, interested in users with similar content or with the same interests. All nodes inside the community are tightly clustered, and only sparse connections are formed among different communities. The method has important significance for discovering the group structure in the social network, understanding the influence of the network group structure on information propagation, identifying key nodes in a complex network and the like. The complexity of the algorithm can be greatly reduced by reasonably utilizing the community structure in consideration of the real world network characteristics.

A number of outstanding community discovery algorithms such as FPMQA, BiLPA, commenitygan, SEAL, etc. have been proposed in the last decade, but Newman and Louvain are undoubtedly still widely used community detection algorithms. In the invention patent, a Louvain algorithm is selected to cluster the network, the algorithm is a community discovery algorithm based on modularity, and the optimization goal is to maximize the modularity of the whole community network. The degree of network modularity is defined as

Wherein Q represents network modularity;

k _i represents the sum of the weights of the edges connected by the node i;

k _j represents the sum of the weights of the edges connected by node j;

A _ij representing the weight of the edge between nodes i and j;

m represents the sum of the weights of all edges;

wherein A represents a network adjacency matrix which represents the weights of edges between nodes, and when the network is not a weighted graph, the weights of all edges are regarded as 1; k is a radical of _i ＝∑ _j A _ij Represents the sum of the weights of all edges connected to node i,

representing the sum of the weights of all edges. The Louvain algorithm mainly comprises two stages, modulation Optimization and communication Aggregation. In the former stage, each node is mainly divided into communities where nodes adjacent to the node are located, so that the modularity is continuously increased; in the latter stage, the communities divided in the first stage are aggregated into one point, and then the network is reconstructed according to the community structure in the last step. The first stage needs to calculate the modularity gain, and the calculation formula is as follows:

the above equation can be simplified as:

wherein ∑ _in Represents the sum of the weights, k, of all edges within Community C _i,in Representing the sum of the weights of the edges pointing from node i to community C. Sigma _tot Representing the sum of the weights of the edges pointing to the nodes in community C. The Louvain algorithm can obtain more natural communities of the network and is also a fairly fast algorithm, so the community division of the network by the algorithm is used for accelerating the proposed algorithm.

2.2. Reverse generation network

Lin et al propose a Backup Generating Networks (BGN) to identify influential nodes in a complex network, which gets a ranking of node importance by minimizing the robustness value R. BGN aims to find a sequence of nodes that causes the network to crash as quickly as possible, i.e. the maximum connected component in the network decreases quickly. The core of the BGN is a reverse process, which does not choose to delete nodes from the network, but adds nodes one by one to the empty network to construct the original network according to the requirement that the size of the huge connected component in the network grows as slow as possible. In this way, the ordering of the nodes is reversed from the order of addition, i.e., nodes added later are more important in maintaining network connectivity.

Reverse process slave-to-air network G ₀ (V ₀ ,E ₀ ) At the beginning, wherein,

and is provided with

In (n +1) ^th Time step (n +1 th node), adding one of the remaining nodes to the current network G _n (V _n ,E _n ) In which a new network with n +1 nodes is formed, i.e. G _n+1 (V _n+1 ,E _n+1 )，V _n Representation network G _n Node set of (E), E _n Representation network G _n The edge set of (2). This process is repeated until the network is restored to the original network. Note that all ongoing networks G are in this process _n (N-0, 1, 2.., N) is an inducible subgraph of network G, G _n Representing a network of n nodes. According to the BGN strategy, the node selected in each time step should minimize the network G as much as possible _n+1 Of the maximum connected component.

2.3. Graph traversal centrality

The graph traversal framework can incorporate different types of centrality to improve existing performance. The method solves the problem of identification of the influence node from the view of graph traversal, and is completely different from the existing method. Any centrality such as centrality in degree, H-index, etc. that exists can generate the importance score through the framework. Graph traversal centricity As shown in FIG. 2, first, for each node in the network, a breadth-first search tree (BFS) is constructed by traversing the graph layer by layer, where the target node is the root nodeAs shown in fig. 2 (a). Each node has an initial centrality score that can be obtained from any centrality measurement method. For a tree of influential nodes, there are typically more nodes at the top level of the BFS tree because the top level nodes belong to local neighbors of the root node. Second, a cumulative score vector vec ═ l of length h is constructed from each BFS tree ₁ ,l ₂ ,...,l _h ]Where h represents the maximum number of layers (the number of layers of the root node is 1). l. the _i The sum of the scores of all nodes representing the number of layers not greater than i is shown in fig. 2 (B). Thirdly, the top m values of the score vector vec are used for drawing a curve, the influence of the nodes is quantified by the area under the curve, and the area under the curve is recorded as an AUC value, so that the influence of each node is measured; as shown in fig. 2(C), where m (1 ≦ m ≦ h) is a user-specific parameter, m ≦ 2 in this figure; the x-axis represents the number of levels of the BFS tree and the y-axis represents the cumulative score for each level.

In a given network graph G, assume that the initial centrality score is CS ═ c ₁ ,c ₂ ,...,c _n In which c is _i Representing some centrality (e.g., degree, H-index, etc.) of node i, n represents the number of nodes, c represents _n Representing the nth node centrality score. Then the cumulative score of the BFS tree generated by each node at level m may be calculated as follows:

wherein cumscore (m) represents the cumulative fraction of the BFS tree generated by each node at level m; v. of _j Represents node j, T (q) represents all nodes at the q-th level in the BFS tree generated by a certain node, c _j Representing some centrality of node j. The AUC values are obtained by plotting the top m terms of the generated score vectors and calculating the area, and can be expressed as follows:

the AUC (m) value represents the top m items of the generated score vector to draw a curve and calculate the area to obtain, and represents the AUC value of a node, thereby measuring the importance of the node.

The AUC value obtained through graph traversal can better improve the performance such as degree centrality and the like, so that the existing centrality ordering method is more fine in granularity.

3. Proposed frame

In this section, a new CBGN framework will be proposed to achieve impact maximization in the network. The algorithm pseudo code is shown in algorithm 1. The frame is composed of three parts as shown in fig. 3: (1) dividing communities; at this stage a run-time-aware community detection algorithm (algorithm 1line3) adapted to the application data set is used. (2) Generating a candidate set; the concept of graph traversal is introduced in reverse network generation. Nodes are added one by one to the empty network using the graph traversal further optimized centrality metric. Finally, the nodes that have not joined the network are the candidate nodes (algorithm 1lines 4-10). (3) A seed node is selected. The heuristic algorithm is balanced with a greedy algorithm to quickly and accurately select nodes (algorithm 1line 12). A detailed description of each step will be given below.

3.1 Community partitioning

The Louvain algorithm is used for dividing the real network data set of the patent of the invention. Compared with other graph segmentation methods, hierarchical clustering methods, label propagation methods and the like, the Louvain algorithm does not need prior knowledge about the number of communities and can discover more natural communities. Therefore, the community obtained by the Louvain algorithm is much closer to the inherent community of a real network. And the community discovery algorithm can also be applied to a large-scale network. In addition, for the network after dividing communities, not all communities are meaningful enough to contain the final seed node, and for the communities with smaller sizes, the final seed node is not sent to the candidate node selection stage. Considering that the community with larger scale has more influence, each community size at least has the following conditions

C_size＝size(G)*η (9)

Where size (G) represents the number of nodes of the network G, η is an adjustable parameter for controlling the size of the community that satisfies the condition. The invention sets eta to be 0.01.

3.2 candidate node selection

At this stage, candidate nodes will be generated in mutually independent communities. The most influential node is found by reducing the number of candidate nodes that need to be evaluated, thereby improving efficiency. In this step, each community will be seen as an induced subgraph, an independent network. In each network, the process of generating the network in reverse is performed by minimizing the robustness value. When node u joins the network, note that the maximum connected component of the network at this time is G [ u ]. According to the strategy of the reverse generation network, the joined node u should minimize the size of the maximum connected component in the network, but there may be two or more nodes satisfying the condition at the same time, for example, the green node in fig. 4, fig. 4 is a toy network, fig. 4(a) is at the 2 nd time step of the reverse generation process, and it is found that in the case of ensuring the maximum connected component of the network is minimized, when a node is further joined, there are a plurality of nodes (

nodes

1,2, 3, 4, 6) selectable, fig. 4(B) is at the 4 th time step of the reverse generation network, there are 3 nodes (

nodes

1, 3, 4) selectable. And utilizing the centrality optimized by graph traversal to assist in constructing the reverse generation network. The goal of minimizing the magnitude of the connected component is translated into minimizing the cost function. Its cost function is defined as:

wherein cost (u, n +1) indicates that the node u is at (n +1) ^th Cost function of time step, AUC _u AUC value, AUC, representing node u _max AUC value maximum, AUC for all nodes _min The AUC value of all nodes is the smallest,

Xi is a positive parameter small enough to ensure

The centrality of the degree is selected as an initial central score in graph traversal, and the centrality of the degree is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated. And adding the node which minimizes the cost function into the network at each time step, and stopping constructing the network when the number of the remaining nodes which are not added meets the number requirement of the candidate nodes. The improved reverse generation network algorithm imp _ BGN for generating candidate nodes in each community is shown as algorithm 2. Since the nodes which are less important are added in the process of reversely generating the network, the nodes which are not used for constructing the original network are remained and then are transmitted into the candidate node set. In order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:

wherein, cand _ num _i Size, item (C _ size) representing the ith Community candidate node set _i -C_size _min )/(C_size _max -C_size _min ) Represents the proportion of the ith community in all the selected communities, and the value is defined as 0,1]. Beta is an amplification parameter which controls the size of the candidate node set. k is the number of seed nodes that ultimately need to be selected. Generating a set of candidate nodes using an improved reverse generation network has the advantage that, first, the generated set of candidate nodes can be made more decentralized, with few nodes aggregated, and all being key nodes in the network. Second, the selected candidate nodes have low overlapping degree of influence ranges, and the third candidate nodes have low overlapping degree of influence rangesThe phase may also balance heuristic and greedy methods to select seed nodes. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourthly, the whole network is not required to be traversed when the candidate nodes are generated, and the network can be stopped being constructed when the number of the candidate node sets is met by the nodes which are not added.

In summary, it makes sense to generate candidate nodes using imp _ BGN.

In Algorithm 2, the algorithm is initialized first with lines1-2 lines, after which lines3-11 builds a reverse generation network to generate a set of candidate nodes. Every time a node is added into the network, the adding cost (lines4-5) of the rest nodes needs to be calculated, and the node (line7) with the minimized cost function is selected to construct the network. After joining a node, it is necessary to update the size of each derivative, i.e., connected component, in the network and record the size of the largest connected component in the network (line 8). This process is repeated until the number of remaining nodes meets the required number of candidate nodes.

3.3 selecting seed nodes

Through the first two stages, the search space has been greatly reduced. Since the nodes of the candidate node set in the second stage are relatively dispersed from each other, the nodes can be partially selected by using a heuristic algorithm to balance the efficiency of the algorithm. At this stage, a selection heuristic method is combined with a greedy method to select seed influence nodes. Either a heuristic combined with an improved classical greedy algorithm or a heuristic combined with a classical greedy algorithm. Seed influence nodes are preferably selected using a heuristic algorithm in combination with a modified classical greedy algorithm, as follows:

the whole seed node selection is divided into two steps: step 1): selecting part k in candidate node set by degree discount algorithm ₁ Node, step 2): selection of part k by improved sub-model CELF algorithm ₂ And (4) nodes. Let k ₁ 、k ₂ The following conditions are satisfied:

k ₂ ＝k-k ₁ (13)

wherein k is the number of the seed nodes which need to be selected finally, and c represents the total number of communities meeting a certain scale in the network. Mu is an adjustable parameter to balance greedy and heuristic algorithms. For convenience, the patent of the invention sets mu to 0.5. In the heuristic algorithm selection process, the infection probability p in the discount is made larger than the propagation threshold of the network. The generalized discount degree of each node is obtained by the formula (14), the calculation results are ranked from high to low, and the top k is selected ₁ And (4) each node.

Wherein d is _v Degree of representing node, t _v Representing the number of infected neighbors of node v, t _w Representing the number of infected neighbors of the susceptible neighbor node w of node v. In a greedy CELF selection stage, the position information of nodes and the structural similarity between the nodes are further considered, if a node u selected in each improved CELF selection process is similar to a previously selected node (a node selected first in the selection stage, and a CELF algorithm needs to select a node iteratively) or a node selected in a heuristic process, that is, as long as one node is satisfied, the node is not selected and is removed from a candidate node set. Node u and node v are designed to have the following similarities:

sim_loc＝1-abs(ks(u)-ks(v)) (16)

where sim represents the similarity between node u and node v, n (u) represents the set of neighboring nodes of node u, n (v) represents the set of neighboring nodes of node v, and abs (·) represents the absolute value. ks (u) is the normalized k-shell index for node u, and ks (v) is the normalized k-shell index for node v. The antecedent in the formula represents structural similarity between nodes, and the consequent represents positional similarity between nodes. Since the k-shell indices between two nodes are equal or close, they should be located at close locations in the network, and such nodes are considered to be similar in location. The position similarity sim _ loc is calculated by equation (16). Obviously, the larger sim _ loc, the more similar the node location. Where ε is a positive parameter that balances structural similarity and positional similarity, where ε is set to 0.1. It selects the seed node algorithm pseudocode as shown in algorithm 3.

Therein, lines1-3 initializes the algorithm, followed by selection of portion k with the degree discount algorithm ₁ Node (line5), and finally selecting part k by using modified CELF algorithm ₂ Nodes (lines12-19), sim _ value represents a similarity threshold, and if the similarity between two nodes is greater than the similarity threshold, they are considered similar.

4 results of the experiment

In order to compare the proposed CBGN method with the existing algorithms (Degree, K-shell, NC +, PageRank, ClusterRank and BGN), simulation experiments including robustness and propagation scale of SIR model were performed on 6 real networks Inf-USAir, CEnew, Power, Ca-GrQc, Hamster and Router.

4.1 data set

The present patent uses 6 real network data sets of different types and sizes, the statistical properties of which are shown in table 1.

The degree distribution and network community division of each experimental network are shown in FIG. 5, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. Wherein the abscissa represents the degree of the nodes in the network and the ordinate represents the frequency of occurrence of the degree of the nodes. The small graph is a community visualization result of the network.

Wherein 1) Inf-USAir is an American aviation network, the node represents an airport, and the edge represents a straight flight route between the two airports. 2) CEnew is a biological network, an edge list of the C.elegans metabolic networks described. 3) Power is a undirected, unlicensed network representing the topology of the state Power grid in the United states, with each node representing a utility and each edge representing a relationship between utilities. 4) Hamster is a friendship relationship between users describing website "www.hamsterster.com". 5) The Ca-GrQc network is a scientific collaboration network that encompasses scientific collaboration between authors of papers submitted to the general relativistic and Quantum universities categories. 6) Router is a symmetric snapshot of the internet fabric at the autonomous system level.

TABLE 1 statistical characteristics of networks

Network	\|V\|	\|E\|	<k>	k _max	<d>	C_num	β _min
								Inf-USAir	332	2126	12.807	139	2.738	7	0.0231
CEnew	453	2025	8.94	237	2.664	9	0.0256
								Power	685	1282	5.743	12	12.422	17	0.2778
Hamster	2426	16631	13.711	273	3.67	168	0.0241
								Ca-GrQc	4158	13422	6.456	81	6.049	40	0.0589
Router	5022	6258	2.492	106	6.449	55	0.0786

In table 1, | V | represents the total number of nodes in the network, | E | represents the number of edges in the network.<k>2| E |/| V | represents the average degree of the network. k is a radical of formula _max Representing the network maximum.<d>Representing the average shortest path length of the network. C _ num represents the number of communities in the network. Beta is a _min For propagation threshold of the network, by formula<k>/(<k ² >-<k>) And (4) calculating.

4.2 Performance index

(1) Robustness value

Robustness can be used to evaluate the performance of the algorithm, a network is given, a node is deleted at each time step, and the size of the largest connected component in the remaining network is calculated. And summing the maximum connected components when the nodes are added each time, and normalizing by using the network size N to obtain the robustness value. The robustness value is calculated by equation (2), and the smaller the value, the more the sorting algorithm can give the correct sorting.

(2) Cumulative Distribution Function (CDF)

The cumulative distribution function can completely describe the probability distribution of a random variable X, and is defined as follows for all real numbers X:

F _X (x)＝P(X≤x)for-∞<x<+∞ (17)

the CDF can be used to determine the probability that a random observed value taken from the population is less than or equal to a particular value. The invention exclusively utilizes the CDF curve to measure the capability of the sorting algorithm for distinguishing the node importance.

(3) SIR model

The SIR model is a common spreading model describing infectious diseases, and its basic assumption is to classify nodes in the network into three classes: a) susceptible node, meaning an uninfected but immunocompromised node, b) infected node, which is a node that has been infected, that can infect neighboring susceptible nodes with a probability of β at each time step, c) restoring node, where also at each time step each infected node will become restored with a probability of γ and will not participate in the infection and infected process afterwards. SIR models are often used to measure the magnitude of influence of a node. The present patent uses the SIR model to measure the ultimate infection size of the selected seed node. Excellent transmitters rapidly achieve high infection levels at time t, infection scale F (t) and final infection scale F (t) which reaches steady state during infection _c ) Can be expressed as

Wherein n is _I (t) denotes the number of infected nodes at time t, n _R (t) represents the number of recovery nodes at time t, and n is the number of nodes in G.

(4) Average shortest path length

To ensure broader coverage, it is contemplated that the selected seed nodes are spread throughout various portions of the network. Generally, the more distributed, i.e., evenly distributed, the selected nodes are in the networkThe smaller the overlap of the influence ranges between, the greater the range of infection that can be expected. The average shortest path length can be used to determine the node dispersion degree, and the average shortest path length L between seed nodes _s Can be calculated by equation (19).

Where S represents a selected set of seed nodes, d _u,v Representing the average shortest path length from node u to node v.

4.3 Baseline Algorithm

6 advanced algorithms are used as reference algorithms, and are compared with the CBGN method respectively in a robustness experiment and a propagation scale experiment, and the 6 algorithms are briefly introduced as follows.

Degree: the algorithm selects the maximum degree as the seed node, and is a simple, intuitive and common standard algorithm.

K-shell: the K-shell value of each node can be obtained through K-shell decomposition, and the K-shell method considers the position relation of the node in the network.

Neighborwood Core (NC): the method is a further improvement on the K-shell method, and the number of the neighborhood cores C of each node _nc (v) And extended neighborhood kernel number C _nc+ (v) The calculation is as follows, where ks (w) represents the k-shell value of node w.

Where N (v) represents the number of neighbors of node v.

PageRank: the PageRank algorithm is proposed as an algorithm for the importance of computer Internet web pages. The higher the PageRank value, the more important the web page may be, perhaps being ranked first in the ranking of the Internet search. A web page is important if it is linked to by many other web pages. If a web page with a high PageRank value links to another web page, the PageRank value of the linked web page is correspondingly increased accordingly.

ClusterRank: the ClusterRank algorithm not only considers the influence of the nodes, but also considers the clustering coefficients of the nodes. The method considers local information of the network and lacks performance guarantee.

BGN: the algorithm considers the importance of the nodes from the viewpoint of network robustness and considers the global information of the network.

4.4 robustness analysis

To verify the effectiveness of the improved reverse generation network algorithm, imp _ BGN was compared to six baseline algorithms and analyzed for robustness. A good ranking algorithm should have a smaller robustness value, i.e. the smaller the area under the R-curve. FIG. 6 visualizes the R-curves generated during the backward network generation in 6 networks, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. The robustness values R for the different methods in 6 real networks are given in table 2. As can be seen from fig. 6 and table 2, except in the Inf-USAir network, Imp _ BGN has a smaller robustness value R than the other 6 baseline methods. Therefore, candidate nodes can be well selected by the method, and the candidate nodes are helpful for maintaining the stability and the connectivity of the network. Considering that the reverse generation network is applied in the candidate node selection phase, the candidate node is performed in the sub-network formed by communities, in order to further verify the effectiveness of the algorithm, the robustness of the algorithm is analyzed in the first 2 communities with larger scale of each network, and the statistics of the results are shown in table 3, C ₁ ，C ₂ Two communities with the size of the community ranked 2 are represented. The R curves of the first two communities in the two networks, CEnew and Power, are visualized in fig. 7, where fig. 7 is the R curves of the 2 real network communities, the horizontal axis represents the seed node ratio, and the vertical axis represents the maximum connected component in the network. FIG. 7(a) is the largest community in CEnew, FIG. 7(b) is the second largest community in CEnew7(c) is the largest community in Power, and FIG. 7(d) is the second largest community in Power.

As can be seen from table 3 and fig. 7, the imp _ BGN algorithm also performs most well in the community of most networks, and also outperforms the BGN algorithm with little advantage in the community of Inf-USAir networks.

TABLE 2 robustness values R for different methods

Network	Inf-USAir	CEnew	Power	Hamster	Ca-GrQc	Router
							K-shell	0.1614	0.1873	0.4223	0.1815	0.2317	0.0285
ClusterRank	0.1181	0.1301	0.2115	0.1371	0.1051	0.0158
							PageRank	0.1227	0.1229	0.2019	0.1421	0.1027	0.0135
NC+	0.1643	0.1623	0.3329	0.1692	0.2143	0.0202
							BGN	0.0899	0.1171	0.0633	0.1045	0.0606	0.0076
Degree	0.1260	0.1200	0.2286	0.1384	0.1313	0.0121
							Imp_BGN	0.0961	0.0790	0.0431	0.0872	0.0538	0.0063

TABLE 3 robustness values R of different algorithms in communities

In addition, the degree centrality optimized through graph traversal is selected to assist in building the reverse generation network, and compared with the method that the degree centrality is directly used to assist in building, the method is more effective. This is because the centrality becomes finer grained after the graph traversal framework is optimized. The ability to distinguish node importance can be measured in terms of resolution, which can be measured by the cumulative distribution function CDF. Fig. 8 shows CDF curves for the tree and the graph traversal optimized centrality TRank _ tree in 3 networks. The 3 networks are respectively (a) Inf-USAir, (b) CEnew, (c) Power, wherein the x axis in the graph represents the grade of the node, and the y axis represents the proportion of each grade. The smaller the included angle between the CDF curve and the x axis is, the better the algorithm effect is. It can be seen that trunk _ hierarchy can distinguish the importance of nodes more. This verifies the validity of the algorithm imp _ BGN even more.

4.5 propagation Scale analysis

To verify the ability of the proposed CBGN method to select influential nodes, SIR models were chosen to measure the ultimate infection scale F (t) of the seed nodes selected by different algorithms _c ). Selecting the propagation threshold beta at which beta should be higher than the network _min The infection rate is set to λ ═ β/γ. Due to the randomness of the model, the experimental results were obtained by averaging 1000 independent experiments. Setting the number of the selected seed nodes as the network3% of the gauge. The results are shown in FIG. 9, where the abscissa indicates the time t of infection, the ordinate F (t) indicates the number of nodes accumulating infection at time t, and F (t) reaches a stable value F (t) with the passage of time _c ). Reach a greater F (t) in less time _c ) It indicates that the performance of the algorithm is better.

Looking at the SIR model time step experiment of fig. 9, the proposed CBGN is best propagated in 6 network datasets compared to 6 other algorithms, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. In the Inf-USAir network, the proposed CBGN method infects at a significantly higher scale than the other 6 baseline algorithms, while the 6 baseline algorithms infect at a comparable scale. In the Power network, the CBGN algorithm outperforms the best ClusterRank algorithm with 0.84% dominance in infection scale. In networks CEnew, Hamster, Ca-GrQC, and Router, the infection scale of CBGN methods is 0.59%, 1.05%, 0.71%, and 0.14% higher than the best BGN algorithms, respectively. In these 4 networks, the BGN algorithm has shown excellent capabilities, but again inferior to CBGN. In addition, in the SIR model, since different infection probabilities have a certain influence on the propagation scale, experiments were carried out on different infection rates of the SIR model, and the λ range was set to [1.0,2.0 ]]The experimental results are shown in fig. 10. Likewise, experimental results were obtained from the average of 1000 independent experiments. Wherein the x-axis represents the infection rate λ and the y-axis represents the stable final infection scale F (t) at a certain infection rate _c )。

The proposed CBGN approach is superior to the 6 baseline algorithms in infection scale for different infection probabilities. Except in the CEnew network, the CBGN method is similar to the BGN algorithm, and the CBGN method is superior to the BGN algorithm in the rest networks. In addition, when selecting candidate nodes, the improved reverse generation network algorithm imp _ BGN does not always yield the minimum robustness value on Inf-USAir and Router, but the seed node selected by the final CBGN method can successfully infect the most nodes. It follows that the CBGN framework constructed is progressive and efficient.

4.6 average shortest Path Length analysis

Generally speaking, the more dispersed the selected seed node set nodes are, i.e. the larger the average shortest path is, the wider the spread effect can be reached, so the average shortest path length L between the seed node sets _s Usually as an index to measure the quality. L is _s It is not an absolute indicator because the node's propagation capability is considered in selecting nodes rather than just the degree of dispersion of the nodes.

Table 4 average shortest path length

Network	Inf-USAir	CEnew	Power	Hamster	Ca-GrQc	Router
							Degree	1.0	1.3077	12.3809	1.6929	3.936	3.6381
K-shell	1.0	1.3077	9.7762	1.9958	4.0117	3.1819
							NC+	1.0	1.2527	8.6714	1.5871	3.9745	3.0253
PageRank	1.1333	1.3187	11.2143	1.7393	3.3853	3.6440
							ClusterRank	1.0	1.3187	10.6524	1.6541	2.9274	3.0640
BGN	1.0	1.4615	8.5381	2.1008	3.7463	4.1390
							CBGN	1.2	1.5275	12.2857	2.2618	3.9821	4.0840

Table 4 gives the average shortest path length between the proposed CBGN method and the seed nodes selected by the 6 baseline algorithms. It can be seen that in half of the network, the seed node set selected by the CBGN method is most dispersed. And the method of Degree, K-shell and BGN selects the most dispersed nodes in the networks of Power, Ca-GrQc and Router respectively.

5 conclusion

The invention provides a reverse generation network framework CBGN based on a community to solve the problem of influence maximization. First, the network is divided into natural communities using the runtime-considered Louvain algorithm adapted to the application data set, by which the search range of the influencing nodes is narrowed. Then, each community is regarded as an induced subgraph of the original graph, the degree centrality of graph traversal optimization is applied to each subgraph to assist in reversely constructing the network, nodes with the minimized cost function are added into the network every time, when the number of the remaining nodes which are not added into the network meets the number of the candidate nodes, the construction of the network is stopped, and all the candidate nodes are sent into the candidate node set. By analyzing the robustness experiment, the improved reverse generation network algorithm can obtain smaller robustness value in the whole network or independent community. This verifies that the improved imp _ BGN algorithm is better able to select a key node in the network or a node that maintains network stability as a candidate node. And finally, selecting a final seed node in the candidate node set by using the degree discount and a greedy algorithm considering the network structure and the node position relation. Experiments of the propagation scale and the average shortest path length of the algorithm prove that the seed nodes selected by the CBGN method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.

In summary, the main contributions of the present patent are summarized as follows:

(1) a new reverse generation network method Imp _ BGN for minimizing a cost function is provided, a new view angle is traversed by a graph, each node is evaluated by constructing a breadth-first search tree (BFS) with a target node as a root node, an influence score of each node can be obtained from the BFS tree, the influence score is used for assisting in constructing a reverse generation network, and the generated network does not need to be restored to an original network because the least important node is added firstly, nodes which are not added into the network are directly added into a candidate node set, and the calculation time is greatly reduced.

(2) The CELF algorithm is improved, the influence ranges of the selected node sets are overlapped by considering the common neighbor and position relations between the nodes, a similarity evaluation index is designed, and the method is applied to the process of selecting the seed nodes by the CELF.

(3) A community-based reverse generation network framework CBGN is constructed to select a group of influential nodes in a complex network, the community structure of the network is considered, the algorithm is accelerated by utilizing the community, and the advantages of the connectivity, graph traversal, heuristic algorithm and greedy algorithm of the network are combined.

(4) The CBGN method is subjected to experimental evaluation of robustness, influence propagation scale, average shortest path length among nodes and the like, and experimental results on 6 real networks such as Inf-USAir and the like show that the algorithm is more competitive compared with the existing advanced method.

Furthermore, many challenges remain from different perspectives for the impact node identification problem, for example, how to efficiently mine large scale network impact nodes, how impact nodes may change with topology changes on time-varying networks, how to better combine information between different layers in a multi-layer network, and the like. Future work will be extended even further to weighted networks, time-varying networks, multi-layer networks and heterogeneous networks.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A complex network influence node identification method based on a reverse generation network is characterized by comprising the following steps:

s1, community division: carrying out community division on the network by using a Louvain algorithm;

s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set;

2. The method for identifying the complex network influence node based on the backward generation network according to claim 1, wherein the size of the community at least has the following condition:

C_size＝size(G)*η (9)

wherein C _ size represents the community size;

size (G) represents the number of nodes of network G;

3. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 comprises:

AUC _u AUC value representing node u;

AUC _max represents the maximum AUC value among all nodes;

AUC _min represents the smallest AUC value among all nodes;

ξ is a sufficiently small positive parameter;

4. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 further comprises:

β is an amplification parameter;

k is the number of seed nodes that ultimately need to be selected.

5. The complex network influence node identification method based on the backward generation network according to claim 1, wherein the graph traversal comprises:

6. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S3 comprises:

the calculation formula of the degree discount algorithm is as follows:

gdd therein _v Represents a degree discount for node v;

d _v degree representing node v;

t _v represents the number of infected neighbors of node v;

p represents the probability of infection;

s3-2, selecting k through improved sub-model algorithm ₂ A node;

k ₂ ＝k-k ₁ (13)

k is the number of the seed nodes which need to be selected finally;

cand_num _i representing the size of the ith community candidate node set;

the improved sub-modular algorithm selection k ₂ The specific steps of each node are as follows: and if the node u selected in the process of each round of improved sub-model algorithm selection is similar to the node selected before or the node selected in the heuristic process, not selecting the node and removing the node from the candidate node set.

7. The complex network influence node identification method based on the backward generation network of claim 6, wherein the similarity is determined by the following formula to determine whether the nodes are similar, and if the following formula is satisfied, the node u is similar to the node v;

where sim represents the similarity of node u and node v;

n (u) represents a set of neighbor nodes for node u;

n (v) a set of neighbor nodes representing node v;

n (u) andn (v) represents the number of neighbors shared by N (u) and N (v);

| represents the number of sets;

ε is a parameter approaching 0;

abs (·) represents the absolute value;

ks (u) is the normalized k-shell index for node u;

ks (v) is a normalized k-shell index for node v.

8. The method for identifying the complex network influence node based on the backward generation network according to claim 1, further comprising evaluating the method by using the following performance indexes: robustness value, cumulative distribution function, SIR model, and average shortest path length.