CN115049002A - Complex network influence node identification method based on reverse generation network - Google Patents

Complex network influence node identification method based on reverse generation network Download PDF

Info

Publication number
CN115049002A
CN115049002A CN202210681512.3A CN202210681512A CN115049002A CN 115049002 A CN115049002 A CN 115049002A CN 202210681512 A CN202210681512 A CN 202210681512A CN 115049002 A CN115049002 A CN 115049002A
Authority
CN
China
Prior art keywords
node
network
nodes
algorithm
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210681512.3A
Other languages
Chinese (zh)
Inventor
刘小洋
叶舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202210681512.3A priority Critical patent/CN115049002A/en
Publication of CN115049002A publication Critical patent/CN115049002A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a complex network influence node identification method based on a reverse generation network, which comprises the following steps: s1, community division: carrying out community division on the network by using a Louvain algorithm; s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set; s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes. The method can select key nodes in the network or nodes for maintaining the stability of the network as candidate nodes, the seed nodes selected by the method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.

Description

Complex network influence node identification method based on reverse generation network
Technical Field
The invention relates to the technical field of influence node identification, in particular to a complex network influence node identification method based on a reverse generation network.
Background
The prosperity of network science has led to a new trend of identifying a set of influential nodes in a complex network, and many fields in real life including biology, physics, sociology, engineering, etc., can be represented by the complex network. Key nodes or influencing nodes in the network may be used to maintain the stability of the network topology, determine the efficiency of information transfer in the network, such as rumors, and the like, and these nodes directly determine whether the network can operate properly. These nodes are often referred to as force nodes. And identifying a group of influential nodes in the network is a classical influence maximization problem. The method has the advantages that control over influence nodes is strengthened, a new opportunity is provided for accelerating information transmission, the method has very important significance for virus-type marketing, identification of drug targets and necessary proteins, rumor control and the like, for example, in public stone marketing, users with the most influence are selected for popularization to ensure that the lowest budget is spent to achieve the maximum transmission effect, or public opinion transmission sources are detected in a social network, and public opinion transmission can be controlled. With the advent of big data and the 5G era, the types and the scale of networks are also rapidly increased, which provides new challenges for the old node influence strength method.
Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing method simply utilizes the characteristic of community structure, and some three-segment-based methods have some disadvantages in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm, or some other heuristic algorithm need to traverse the entire network when selecting a set of candidate nodes, and there may be cases where the selected candidate nodes are clustered with each other. The problem of overlapping influence ranges may exist if the seed nodes are selected from the candidate node sets. The generation of the candidate node set is particularly important throughout the seed node selection process. Secondly, in the seed node identification stage, a great number of algorithms use a greedy algorithm to perform accurate selection, and although the selection in the generation candidate node set greatly improves the operation efficiency, the selection is time-consuming.
Disclosure of Invention
The invention aims to at least solve the technical problems in the prior art, and particularly creatively provides a complex network influence node identification method based on a reverse generation network.
In order to achieve the above object, the present invention provides a complex network influence node identification method based on a reverse generation network, including the following steps:
s1, community division: carrying out community division on the network by using a Louvain algorithm, and reducing the search space of the seed nodes;
s2, generating a candidate node set: the method has the advantages that node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and high-importance nodes can be selected without recovering an original network when the network is constructed under the method; then selecting a part of nodes to be added into the candidate node set;
s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes.
Further, the size of the community at least has the following conditions:
C_size=size(G)*η (9)
wherein C _ size represents the community size;
size (G) represents the number of nodes of network G;
eta is an adjustable parameter for controlling the size of the community that satisfies the condition.
Further, the S2 includes:
s2-1, calculating the adding cost of the remaining nodes, selecting the nodes with the minimized cost function to construct the network, wherein the cost function is as follows:
Figure BDA0003696427860000021
where cost (u, n +1) indicates that node u is at (n +1) th A cost function of the time step;
Figure BDA0003696427860000022
is shown in (n +1) th Time step, the size of the maximum connected component for adding node u to network G, i.e.
Figure BDA0003696427860000023
AUC u AUC value representing node u;
AUC max Represents the maximum AUC value among all nodes;
AUC min represents the smallest AUC value among all nodes;
ξ is a sufficiently small positive parameter;
s2-2, after the node is added, the size of each connected component in the network needs to be updated, and the size of the maximum connected component in the network is recorded;
s2-3, repeating the steps S2-1-S2-2 until the number of the remaining nodes meets the required candidate node number, and at the same time stopping constructing the network.
Generating a set of candidate nodes using an improved reverse generation network has the advantage that, first, the generated set of candidate nodes can be made more dispersed, with few nodes aggregated, and all being key nodes in the network. Secondly, the selected candidate nodes have low overlapping degree of influence ranges, and a heuristic method and a greedy method can be used for balancing to select the seed nodes in the third stage. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourthly, the whole network is not required to be traversed when the candidate nodes are generated, and the network can be stopped being constructed when the number of the candidate node sets is met by the nodes which are not added.
Further, the S2 further includes:
in order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:
Figure BDA0003696427860000024
where cand _ num i Representing the size of the ith community candidate node set;
(C_size i -C_size min )/(C_size max -C_size min ) Representing the proportion of the ith community among all the selected communities;
β is an amplification parameter;
k is the number of seed nodes that ultimately need to be selected.
Further, the graph traversal comprises:
the degree centrality is selected as an initial center score in graph traversal, and the degree centrality is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated.
Further, the S3 includes:
s3-1, selecting k in the candidate node set through a degree discount algorithm 1 A node;
Figure BDA0003696427860000031
the degree discount algorithm has the calculation formula as follows:
Figure BDA0003696427860000032
gdd therein v Represents a degree discount for node v;
d v represents the degree of node v;
t v represents the number of infected neighbors of node v;
t w representing the number of infected neighbors of a susceptible neighbor node w of the node v;
p represents the probability of infection;
s3-2, selecting k through improved sub-model algorithm 2 A node;
k 2 =k-k 1 (13)
k is the number of the seed nodes which need to be selected finally;
mu is an adjustable parameter to balance greedy algorithm and heuristic algorithm;
c represents the total number of communities in the network that meet a certain scale;
cand_num i representing the size of the ith community candidate node set;
the improved sub-modular algorithm selection k 2 Individual node toolThe method comprises the following steps: and if the selected node u in the process of each round of improved sub-model algorithm selection is similar to the node selected before or the node selected in the heuristic process, not selecting the node and removing the node from the candidate node set.
Compared with the original sub-modular algorithm, the improved sub-modular algorithm further considers the position information of the nodes and the structural similarity between the nodes.
Further, the similarity is judged whether the nodes are similar or not through the following formula, and if the following formula is met, the node u is similar to the node v;
Figure BDA0003696427860000041
where sim represents the similarity of node u and node v;
n (u) represents a set of neighbor nodes for node u;
n (v) a set of neighbor nodes representing node v;
n (u) andn (v) represents the number of neighbors shared by N (u) and N (v);
| represents the number of sets;
ε is a parameter approaching 0;
abs (·) represents the absolute value;
ks (u) is the normalized k-shell index for node u;
ks (v) is a normalized k-shell index for node v.
Further, the method is evaluated by the following performance indexes: robustness value, cumulative distribution function, SIR model, and average shortest path length. Whether the method is reasonable or not can be comprehensively judged through the four performance indexes.
In summary, due to the adoption of the technical scheme, the invention has the following advantages:
(1) the influence scores are used for assisting in constructing the reverse generation network, so that the generated network does not need to be restored to the original network, nodes which are not added into the network directly add into the candidate node set, and the calculation time is greatly reduced.
(2) Considering that the influence ranges of the selected node sets overlap due to the common neighbor and position relations among the nodes, an improved CELF algorithm is provided.
(3) The community structure of the network is considered, the algorithm is accelerated by utilizing the community, and the advantages of the connectivity of the network, graph traversal, a heuristic algorithm and a greedy algorithm are combined.
Therefore, the method can select key nodes in the network or nodes for maintaining the stability of the network as candidate nodes, the seed nodes selected by the method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a graph of the robustness value R of the present invention.
FIG. 2 is a schematic diagram of the traversal centrality of the graph of the present invention.
Figure 3 is an overall schematic diagram of the CBGN framework of the present invention.
Figure 4 is a schematic diagram of a toy network of the present invention.
FIG. 5 is a graph of the degree distribution of 6 experimental nets of the present invention.
Fig. 6 is an R-curve diagram of 6 real networks of the present invention under different methods.
FIG. 7 is an R-curve of 2 real network communities according to the present invention.
FIG. 8 is the CDF distribution diagram of the invention Degrid and TARank _ grid over 3 networks.
Fig. 9 is a schematic diagram of the propagation effect of the seed nodes selected by different algorithms under the SIR model.
FIG. 10 is a schematic representation of the spread of the present invention at different infection rates.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
The invention provides a complex network influence node identification method based on a reverse generation network, which comprises the following specific embodiments:
s1, carrying out community division on the public opinion network: carrying out community division on a public opinion network by using a Louvain algorithm, and reducing a search space of public opinion seed nodes;
s2, generating a public opinion candidate node set: public sentiment node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and high-importance nodes can be selected without recovering an original network when the network is constructed by the method; then selecting a part of nodes to add into the public opinion candidate node set;
s3, selecting public opinion seed nodes: and finding out the final k influential nodes from the public opinion candidate node set as public opinion seed nodes.
And the public opinion spreading can be quickly controlled only by carrying out the speech control on the user corresponding to the screened public opinion seed node.
1. Introduction to related content
1.1 background
The prosperity of network science has led to a new trend of identifying a set of influential nodes in a complex network, and many fields in real life including biology, physics, sociology, engineering, etc., can be represented by the complex network. Key nodes or influencing nodes in the network may be used to maintain the stability of the network topology, determine the efficiency of information transfer in the network, such as rumors, and the like, and these nodes directly determine whether the network can operate properly. These nodes are often referred to as force nodes. And identifying a group of influential nodes in the network is a classical influence maximization problem. The method has the advantages that control over influence nodes is strengthened, a new opportunity is provided for accelerating information transmission, the method has very important significance for virus-type marketing, identification of drug targets and necessary proteins, rumor control and the like, for example, in public stone marketing, users with the most influence are selected for popularization to ensure that the lowest budget is spent to achieve the maximum transmission effect, or public opinion transmission sources are detected in a social network, and public opinion transmission can be controlled. With the advent of big data and the 5G era, the types and the scale of networks are also rapidly increased, which provides new challenges for the old node influence strength method.
The impact maximization problem was first studied by domingonden richardson, and was later formulated by Kempe et al as a discrete optimization problem in 2003 and proved to be an NP-hard problem. In order to efficiently obtain the influence nodes, the nodes are measured from different angles, and a large number of indexes are successively provided, such as suchasDegree, ClusterRank, K-shell, H-index, Neighborhodoesss, PageRank, DegreeDiscount and the like. The influence range is overlapped when a group of nodes are selected by simply utilizing the high centrality, and the problem of low solving precision exists when the nodes are directly selected such as K-shell and H-index. The PageRank algorithm considers the number and quality of nodes at the same time, but when a hanging page appears, specification leakage can occur. The ClusterRank considers the direct neighbors of the nodes and the clustering coefficients thereof at the same time, and also lacks performance guarantee due to the consideration of network local information. Bae et al assume that when one node has more neighbor nodes located at the core position of the network, the propagation range is wider, the centrality of the number of neighborhood cores is provided, the method balances the degree and the position relation of the node, and the degeneracy of the K-shell method is effectively improved. There are also many combinations of network topologies to heuristically select nodes, Beni et al propose an IMT algorithm to select seed nodes from dense parts of the graph in order to access more nodes within the shortest distance. Zhang et al designs a constrained evolution algorithm (IICEA) based on local-global influence indexes by considering local neighbor information and global community information, and effectively solves the problem of budget influence constraint. Mao et al propose a topological potential solution for predicting influential nodes in a large-scale online social network. Although the influence node can be searched in a short time based on the centrality of the network topology structure, the stability of the algorithm is often insufficient, and the solving quality is not high.
In recent decades, some classical greedy algorithms such as greedy algorithm for sequentially selecting the node with the largest marginal gain and CELF, CELF + +, etc. improved on the algorithm have been proposed one after another. Although the impact dispersion of these algorithms approximates an optimal solution within the factor (1-1/e-epsilon), epsilon is a very small parameter approaching 0, and e is a natural base, these algorithms require tens of thousands of monte carlo simulations, resulting in long computation times that make it difficult to scale to large scale networks. In order to solve the phenomenon, a large number of network science researchers provide heuristic algorithms aiming at specific fields, sacrifice a certain accuracy rate and greatly reduce the computational complexity. In recent years, a new index, robustness value (Robustnessvalue), is widely used to measure the importance of a node. Robustness stems from the well-known theory of penetration, i.e., when a portion of the nodes in the network are removed, a breakdown of the network is caused. The robustness value is used for measuring the connectivity of the network, and a smaller robustness value means better performance of the algorithm. Compared with the traditional method, all nodes need to be ordered, and robustness only considers the importance of all nodes in the network. In addition, communities are also an important structure in network science, and nodes may be more frequently associated in the same community than in different communities. A large number of community-based influence maximization algorithms are proposed, and the effectiveness of community division is proved.
Based on the above discussion, in the present patent, a new Community-based reverse generation network (CBGN) method is proposed to identify a group of nodes in a complex network, which includes three steps: (1) dividing communities; (2) generating a candidate set; (3) a seed node is selected. Firstly, carrying out community division on the network by using a Louvain algorithm, and reducing the search space of the seed node. In the second step, the information of the nodes is collected by using the traversal of the graph, the reverse generation network is constructed in an auxiliary way by using the different information of the nodes, and a part of nodes are selected to be added into the candidate node set; and finally, finding out the final top-k influential node from the candidate nodes by utilizing a heuristic algorithm and an improved classical greedy algorithm.
1.2 motivation for research
Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing method simply utilizes the characteristic of community structure, and some three-segment-based methods have some disadvantages in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm, or some other heuristic algorithm need to traverse the entire network when selecting a set of candidate nodes, and there may be cases where the selected candidate nodes are clustered with each other. The problem of overlapping influence ranges may exist if the seed nodes are selected from the candidate node sets. The generation of the candidate node set is particularly important throughout the seed node selection process. Secondly, in the seed node identification stage, a great number of algorithms use a greedy algorithm to perform accurate selection, and although the selection in the generation candidate node set greatly improves the operation efficiency, the selection is time-consuming. It is therefore desirable that the set of candidate nodes selected in the second stage can assist the third stage and not merely serve to narrow the scope of the search.
2. Related concepts
Let G ═ V ═ E denote a complex network graph, V ═ V 1 ,v 2 ,...,v n ]Representing a set of nodes, E ═ E 1 ,e 2 ,...,e m ]A set of edge sets is represented. Nodes represent individuals in a complex network and edges represent relationships between individuals in the network. In the present patent, only undirected and unwarranted simple networks are of interest and no self-loops are allowed to exist.
Definition 1, influence node top-k: top-k influential nodes are defined as nodes that have a significant influence in a particular scenario, and the number is designated as k.
Definition 2, influence maximization problem: the influence maximization problem refers to finding a group of influence nodes, so that the influence can be maximized when the k nodes are used as a source node S and the influence is propagated on a specified propagation model, which can be expressed as
Influ(S)=arg max S∈V,|S|=k σ(S) (1)
Where Influ (S) represents the influence of the source node set S, σ (·) represents the information diffusion function, and σ (S) represents the expected number of nodes that can be influenced after propagation on the specified model with the seed set S as the node set.
Definition 3, robustness value: the robustness value can be used to evaluate the performance of the ranking algorithm. Given a network, one node is removed at each time step and the size of the largest connected component in the remaining network is calculated until the network is empty. The network robustness value R is defined as
Figure BDA0003696427860000071
Where n is the number of nodes in G, σ gcc (G) Representing the magnitude of the maximum connected component of the network, σ, without removing any one node gcc (G\{v 1 ,v 2 ,...,v n }) means that the set K is sequentially removed from the network-v 1 ,v 2 ,...,v k The size of the huge communication component in the remaining network after the node in (c). The robustness value R can be considered as the area under the R curve, as shown in FIG. 1, with the horizontal axis being k/n and the vertical axis being σ gcc (G\{v 1 ,v 2 ,...,v n })/σ gcc (G) (ii) a The x-axis of fig. 1 represents the proportion of the removed node, and the y-axis represents the magnitude of the maximum communication component of the remaining network after the removal of the node.
Of all the networks of n nodes, the most vulnerable network is the star network, the most robust is the fully connected network, in which,
Figure BDA0003696427860000081
in the case of a complete network, it is,
Figure BDA0003696427860000082
thus, the range of robustness values is
Figure BDA0003696427860000083
A smaller robustness value means a smaller performance of the algorithm.
In this section, the following 3 aspects will be introduced: community detection, reverse generation of network and graph traversal method.
2.1. Community detection
Communities are generally considered to be a group of closely related nodes, interested in users with similar content or with the same interests. All nodes inside the community are tightly clustered, and only sparse connections are formed among different communities. The method has important significance for discovering the group structure in the social network, understanding the influence of the network group structure on information propagation, identifying key nodes in a complex network and the like. The complexity of the algorithm can be greatly reduced by reasonably utilizing the community structure in consideration of the real world network characteristics.
A number of outstanding community discovery algorithms such as FPMQA, BiLPA, commenitygan, SEAL, etc. have been proposed in the last decade, but Newman and Louvain are undoubtedly still widely used community detection algorithms. In the invention patent, a Louvain algorithm is selected to cluster the network, the algorithm is a community discovery algorithm based on modularity, and the optimization goal is to maximize the modularity of the whole community network. The degree of network modularity is defined as
Figure BDA0003696427860000084
Wherein Q represents network modularity;
k i represents the sum of the weights of the edges connected by the node i;
k j represents the sum of the weights of the edges connected by node j;
A ij representing the weight of the edge between nodes i and j;
m represents the sum of the weights of all edges;
Figure BDA0003696427860000085
wherein A represents a network adjacency matrix which represents the weights of edges between nodes, and when the network is not a weighted graph, the weights of all edges are regarded as 1; k is a radical of i =∑ j A ij Represents the sum of the weights of all edges connected to node i,
Figure BDA0003696427860000086
representing the sum of the weights of all edges. The Louvain algorithm mainly comprises two stages, modulation Optimization and communication Aggregation. In the former stage, each node is mainly divided into communities where nodes adjacent to the node are located, so that the modularity is continuously increased; in the latter stage, the communities divided in the first stage are aggregated into one point, and then the network is reconstructed according to the community structure in the last step. The first stage needs to calculate the modularity gain, and the calculation formula is as follows:
Figure BDA0003696427860000087
the above equation can be simplified as:
Figure BDA0003696427860000091
wherein ∑ in Represents the sum of the weights, k, of all edges within Community C i,in Representing the sum of the weights of the edges pointing from node i to community C. Sigma tot Representing the sum of the weights of the edges pointing to the nodes in community C. The Louvain algorithm can obtain more natural communities of the network and is also a fairly fast algorithm, so the community division of the network by the algorithm is used for accelerating the proposed algorithm.
2.2. Reverse generation network
Lin et al propose a Backup Generating Networks (BGN) to identify influential nodes in a complex network, which gets a ranking of node importance by minimizing the robustness value R. BGN aims to find a sequence of nodes that causes the network to crash as quickly as possible, i.e. the maximum connected component in the network decreases quickly. The core of the BGN is a reverse process, which does not choose to delete nodes from the network, but adds nodes one by one to the empty network to construct the original network according to the requirement that the size of the huge connected component in the network grows as slow as possible. In this way, the ordering of the nodes is reversed from the order of addition, i.e., nodes added later are more important in maintaining network connectivity.
Reverse process slave-to-air network G 0 (V 0 ,E 0 ) At the beginning, wherein,
Figure BDA0003696427860000092
and is provided with
Figure BDA0003696427860000093
In (n +1) th Time step (n +1 th node), adding one of the remaining nodes to the current network G n (V n ,E n ) In which a new network with n +1 nodes is formed, i.e. G n+1 (V n+1 ,E n+1 ),V n Representation network G n Node set of (E), E n Representation network G n The edge set of (2). This process is repeated until the network is restored to the original network. Note that all ongoing networks G are in this process n (N-0, 1, 2.., N) is an inducible subgraph of network G, G n Representing a network of n nodes. According to the BGN strategy, the node selected in each time step should minimize the network G as much as possible n+1 Of the maximum connected component.
2.3. Graph traversal centrality
The graph traversal framework can incorporate different types of centrality to improve existing performance. The method solves the problem of identification of the influence node from the view of graph traversal, and is completely different from the existing method. Any centrality such as centrality in degree, H-index, etc. that exists can generate the importance score through the framework. Graph traversal centricity As shown in FIG. 2, first, for each node in the network, a breadth-first search tree (BFS) is constructed by traversing the graph layer by layer, where the target node is the root nodeAs shown in fig. 2 (a). Each node has an initial centrality score that can be obtained from any centrality measurement method. For a tree of influential nodes, there are typically more nodes at the top level of the BFS tree because the top level nodes belong to local neighbors of the root node. Second, a cumulative score vector vec ═ l of length h is constructed from each BFS tree 1 ,l 2 ,...,l h ]Where h represents the maximum number of layers (the number of layers of the root node is 1). l. the i The sum of the scores of all nodes representing the number of layers not greater than i is shown in fig. 2 (B). Thirdly, the top m values of the score vector vec are used for drawing a curve, the influence of the nodes is quantified by the area under the curve, and the area under the curve is recorded as an AUC value, so that the influence of each node is measured; as shown in fig. 2(C), where m (1 ≦ m ≦ h) is a user-specific parameter, m ≦ 2 in this figure; the x-axis represents the number of levels of the BFS tree and the y-axis represents the cumulative score for each level.
In a given network graph G, assume that the initial centrality score is CS ═ c 1 ,c 2 ,...,c n In which c is i Representing some centrality (e.g., degree, H-index, etc.) of node i, n represents the number of nodes, c represents n Representing the nth node centrality score. Then the cumulative score of the BFS tree generated by each node at level m may be calculated as follows:
Figure BDA0003696427860000101
wherein cumscore (m) represents the cumulative fraction of the BFS tree generated by each node at level m; v. of j Represents node j, T (q) represents all nodes at the q-th level in the BFS tree generated by a certain node, c j Representing some centrality of node j. The AUC values are obtained by plotting the top m terms of the generated score vectors and calculating the area, and can be expressed as follows:
Figure BDA0003696427860000102
the AUC (m) value represents the top m items of the generated score vector to draw a curve and calculate the area to obtain, and represents the AUC value of a node, thereby measuring the importance of the node.
The AUC value obtained through graph traversal can better improve the performance such as degree centrality and the like, so that the existing centrality ordering method is more fine in granularity.
3. Proposed frame
In this section, a new CBGN framework will be proposed to achieve impact maximization in the network. The algorithm pseudo code is shown in algorithm 1. The frame is composed of three parts as shown in fig. 3: (1) dividing communities; at this stage a run-time-aware community detection algorithm (algorithm 1line3) adapted to the application data set is used. (2) Generating a candidate set; the concept of graph traversal is introduced in reverse network generation. Nodes are added one by one to the empty network using the graph traversal further optimized centrality metric. Finally, the nodes that have not joined the network are the candidate nodes (algorithm 1lines 4-10). (3) A seed node is selected. The heuristic algorithm is balanced with a greedy algorithm to quickly and accurately select nodes (algorithm 1line 12). A detailed description of each step will be given below.
3.1 Community partitioning
The Louvain algorithm is used for dividing the real network data set of the patent of the invention. Compared with other graph segmentation methods, hierarchical clustering methods, label propagation methods and the like, the Louvain algorithm does not need prior knowledge about the number of communities and can discover more natural communities. Therefore, the community obtained by the Louvain algorithm is much closer to the inherent community of a real network. And the community discovery algorithm can also be applied to a large-scale network. In addition, for the network after dividing communities, not all communities are meaningful enough to contain the final seed node, and for the communities with smaller sizes, the final seed node is not sent to the candidate node selection stage. Considering that the community with larger scale has more influence, each community size at least has the following conditions
C_size=size(G)*η (9)
Where size (G) represents the number of nodes of the network G, η is an adjustable parameter for controlling the size of the community that satisfies the condition. The invention sets eta to be 0.01.
Figure BDA0003696427860000111
3.2 candidate node selection
At this stage, candidate nodes will be generated in mutually independent communities. The most influential node is found by reducing the number of candidate nodes that need to be evaluated, thereby improving efficiency. In this step, each community will be seen as an induced subgraph, an independent network. In each network, the process of generating the network in reverse is performed by minimizing the robustness value. When node u joins the network, note that the maximum connected component of the network at this time is G [ u ]. According to the strategy of the reverse generation network, the joined node u should minimize the size of the maximum connected component in the network, but there may be two or more nodes satisfying the condition at the same time, for example, the green node in fig. 4, fig. 4 is a toy network, fig. 4(a) is at the 2 nd time step of the reverse generation process, and it is found that in the case of ensuring the maximum connected component of the network is minimized, when a node is further joined, there are a plurality of nodes ( nodes 1,2, 3, 4, 6) selectable, fig. 4(B) is at the 4 th time step of the reverse generation network, there are 3 nodes ( nodes 1, 3, 4) selectable. And utilizing the centrality optimized by graph traversal to assist in constructing the reverse generation network. The goal of minimizing the magnitude of the connected component is translated into minimizing the cost function. Its cost function is defined as:
Figure BDA0003696427860000112
wherein cost (u, n +1) indicates that the node u is at (n +1) th Cost function of time step, AUC u AUC value, AUC, representing node u max AUC value maximum, AUC for all nodes min The AUC value of all nodes is the smallest,
Figure BDA0003696427860000113
is shown in (n +1) th Time step, the size of the maximum connected component for adding node u to network G, i.e.
Figure BDA0003696427860000114
Xi is a positive parameter small enough to ensure
Figure BDA0003696427860000115
The centrality of the degree is selected as an initial central score in graph traversal, and the centrality of the degree is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated. And adding the node which minimizes the cost function into the network at each time step, and stopping constructing the network when the number of the remaining nodes which are not added meets the number requirement of the candidate nodes. The improved reverse generation network algorithm imp _ BGN for generating candidate nodes in each community is shown as algorithm 2. Since the nodes which are less important are added in the process of reversely generating the network, the nodes which are not used for constructing the original network are remained and then are transmitted into the candidate node set. In order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:
Figure BDA0003696427860000121
wherein, cand _ num i Size, item (C _ size) representing the ith Community candidate node set i -C_size min )/(C_size max -C_size min ) Represents the proportion of the ith community in all the selected communities, and the value is defined as 0,1]. Beta is an amplification parameter which controls the size of the candidate node set. k is the number of seed nodes that ultimately need to be selected. Generating a set of candidate nodes using an improved reverse generation network has the advantage that, first, the generated set of candidate nodes can be made more decentralized, with few nodes aggregated, and all being key nodes in the network. Second, the selected candidate nodes have low overlapping degree of influence ranges, and the third candidate nodes have low overlapping degree of influence rangesThe phase may also balance heuristic and greedy methods to select seed nodes. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourthly, the whole network is not required to be traversed when the candidate nodes are generated, and the network can be stopped being constructed when the number of the candidate node sets is met by the nodes which are not added.
In summary, it makes sense to generate candidate nodes using imp _ BGN.
Figure BDA0003696427860000122
In Algorithm 2, the algorithm is initialized first with lines1-2 lines, after which lines3-11 builds a reverse generation network to generate a set of candidate nodes. Every time a node is added into the network, the adding cost (lines4-5) of the rest nodes needs to be calculated, and the node (line7) with the minimized cost function is selected to construct the network. After joining a node, it is necessary to update the size of each derivative, i.e., connected component, in the network and record the size of the largest connected component in the network (line 8). This process is repeated until the number of remaining nodes meets the required number of candidate nodes.
3.3 selecting seed nodes
Through the first two stages, the search space has been greatly reduced. Since the nodes of the candidate node set in the second stage are relatively dispersed from each other, the nodes can be partially selected by using a heuristic algorithm to balance the efficiency of the algorithm. At this stage, a selection heuristic method is combined with a greedy method to select seed influence nodes. Either a heuristic combined with an improved classical greedy algorithm or a heuristic combined with a classical greedy algorithm. Seed influence nodes are preferably selected using a heuristic algorithm in combination with a modified classical greedy algorithm, as follows:
the whole seed node selection is divided into two steps: step 1): selecting part k in candidate node set by degree discount algorithm 1 Node, step 2): selection of part k by improved sub-model CELF algorithm 2 And (4) nodes. Let k 1 、k 2 The following conditions are satisfied:
Figure BDA0003696427860000131
k 2 =k-k 1 (13)
wherein k is the number of the seed nodes which need to be selected finally, and c represents the total number of communities meeting a certain scale in the network. Mu is an adjustable parameter to balance greedy and heuristic algorithms. For convenience, the patent of the invention sets mu to 0.5. In the heuristic algorithm selection process, the infection probability p in the discount is made larger than the propagation threshold of the network. The generalized discount degree of each node is obtained by the formula (14), the calculation results are ranked from high to low, and the top k is selected 1 And (4) each node.
Figure BDA0003696427860000132
Wherein d is v Degree of representing node, t v Representing the number of infected neighbors of node v, t w Representing the number of infected neighbors of the susceptible neighbor node w of node v. In a greedy CELF selection stage, the position information of nodes and the structural similarity between the nodes are further considered, if a node u selected in each improved CELF selection process is similar to a previously selected node (a node selected first in the selection stage, and a CELF algorithm needs to select a node iteratively) or a node selected in a heuristic process, that is, as long as one node is satisfied, the node is not selected and is removed from a candidate node set. Node u and node v are designed to have the following similarities:
Figure BDA0003696427860000133
sim_loc=1-abs(ks(u)-ks(v)) (16)
where sim represents the similarity between node u and node v, n (u) represents the set of neighboring nodes of node u, n (v) represents the set of neighboring nodes of node v, and abs (·) represents the absolute value. ks (u) is the normalized k-shell index for node u, and ks (v) is the normalized k-shell index for node v. The antecedent in the formula represents structural similarity between nodes, and the consequent represents positional similarity between nodes. Since the k-shell indices between two nodes are equal or close, they should be located at close locations in the network, and such nodes are considered to be similar in location. The position similarity sim _ loc is calculated by equation (16). Obviously, the larger sim _ loc, the more similar the node location. Where ε is a positive parameter that balances structural similarity and positional similarity, where ε is set to 0.1. It selects the seed node algorithm pseudocode as shown in algorithm 3.
Figure BDA0003696427860000134
Figure BDA0003696427860000141
Therein, lines1-3 initializes the algorithm, followed by selection of portion k with the degree discount algorithm 1 Node (line5), and finally selecting part k by using modified CELF algorithm 2 Nodes (lines12-19), sim _ value represents a similarity threshold, and if the similarity between two nodes is greater than the similarity threshold, they are considered similar.
4 results of the experiment
In order to compare the proposed CBGN method with the existing algorithms (Degree, K-shell, NC +, PageRank, ClusterRank and BGN), simulation experiments including robustness and propagation scale of SIR model were performed on 6 real networks Inf-USAir, CEnew, Power, Ca-GrQc, Hamster and Router.
4.1 data set
The present patent uses 6 real network data sets of different types and sizes, the statistical properties of which are shown in table 1.
The degree distribution and network community division of each experimental network are shown in FIG. 5, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. Wherein the abscissa represents the degree of the nodes in the network and the ordinate represents the frequency of occurrence of the degree of the nodes. The small graph is a community visualization result of the network.
Wherein 1) Inf-USAir is an American aviation network, the node represents an airport, and the edge represents a straight flight route between the two airports. 2) CEnew is a biological network, an edge list of the C.elegans metabolic networks described. 3) Power is a undirected, unlicensed network representing the topology of the state Power grid in the United states, with each node representing a utility and each edge representing a relationship between utilities. 4) Hamster is a friendship relationship between users describing website "www.hamsterster.com". 5) The Ca-GrQc network is a scientific collaboration network that encompasses scientific collaboration between authors of papers submitted to the general relativistic and Quantum universities categories. 6) Router is a symmetric snapshot of the internet fabric at the autonomous system level.
TABLE 1 statistical characteristics of networks
Network |V| |E| <k> k max <d> C_num β min
Inf-USAir 332 2126 12.807 139 2.738 7 0.0231
CEnew 453 2025 8.94 237 2.664 9 0.0256
Power 685 1282 5.743 12 12.422 17 0.2778
Hamster 2426 16631 13.711 273 3.67 168 0.0241
Ca-GrQc 4158 13422 6.456 81 6.049 40 0.0589
Router 5022 6258 2.492 106 6.449 55 0.0786
In table 1, | V | represents the total number of nodes in the network, | E | represents the number of edges in the network.<k>2| E |/| V | represents the average degree of the network. k is a radical of formula max Representing the network maximum.<d>Representing the average shortest path length of the network. C _ num represents the number of communities in the network. Beta is a min For propagation threshold of the network, by formula<k>/(<k 2 >-<k>) And (4) calculating.
4.2 Performance index
(1) Robustness value
Robustness can be used to evaluate the performance of the algorithm, a network is given, a node is deleted at each time step, and the size of the largest connected component in the remaining network is calculated. And summing the maximum connected components when the nodes are added each time, and normalizing by using the network size N to obtain the robustness value. The robustness value is calculated by equation (2), and the smaller the value, the more the sorting algorithm can give the correct sorting.
(2) Cumulative Distribution Function (CDF)
The cumulative distribution function can completely describe the probability distribution of a random variable X, and is defined as follows for all real numbers X:
F X (x)=P(X≤x)for-∞<x<+∞ (17)
the CDF can be used to determine the probability that a random observed value taken from the population is less than or equal to a particular value. The invention exclusively utilizes the CDF curve to measure the capability of the sorting algorithm for distinguishing the node importance.
(3) SIR model
The SIR model is a common spreading model describing infectious diseases, and its basic assumption is to classify nodes in the network into three classes: a) susceptible node, meaning an uninfected but immunocompromised node, b) infected node, which is a node that has been infected, that can infect neighboring susceptible nodes with a probability of β at each time step, c) restoring node, where also at each time step each infected node will become restored with a probability of γ and will not participate in the infection and infected process afterwards. SIR models are often used to measure the magnitude of influence of a node. The present patent uses the SIR model to measure the ultimate infection size of the selected seed node. Excellent transmitters rapidly achieve high infection levels at time t, infection scale F (t) and final infection scale F (t) which reaches steady state during infection c ) Can be expressed as
Figure BDA0003696427860000161
Figure BDA0003696427860000162
Wherein n is I (t) denotes the number of infected nodes at time t, n R (t) represents the number of recovery nodes at time t, and n is the number of nodes in G.
(4) Average shortest path length
To ensure broader coverage, it is contemplated that the selected seed nodes are spread throughout various portions of the network. Generally, the more distributed, i.e., evenly distributed, the selected nodes are in the networkThe smaller the overlap of the influence ranges between, the greater the range of infection that can be expected. The average shortest path length can be used to determine the node dispersion degree, and the average shortest path length L between seed nodes s Can be calculated by equation (19).
Figure BDA0003696427860000163
Where S represents a selected set of seed nodes, d u,v Representing the average shortest path length from node u to node v.
4.3 Baseline Algorithm
6 advanced algorithms are used as reference algorithms, and are compared with the CBGN method respectively in a robustness experiment and a propagation scale experiment, and the 6 algorithms are briefly introduced as follows.
Degree: the algorithm selects the maximum degree as the seed node, and is a simple, intuitive and common standard algorithm.
K-shell: the K-shell value of each node can be obtained through K-shell decomposition, and the K-shell method considers the position relation of the node in the network.
Neighborwood Core (NC): the method is a further improvement on the K-shell method, and the number of the neighborhood cores C of each node nc (v) And extended neighborhood kernel number C nc+ (v) The calculation is as follows, where ks (w) represents the k-shell value of node w.
Figure BDA0003696427860000164
Figure BDA0003696427860000165
Where N (v) represents the number of neighbors of node v.
PageRank: the PageRank algorithm is proposed as an algorithm for the importance of computer Internet web pages. The higher the PageRank value, the more important the web page may be, perhaps being ranked first in the ranking of the Internet search. A web page is important if it is linked to by many other web pages. If a web page with a high PageRank value links to another web page, the PageRank value of the linked web page is correspondingly increased accordingly.
ClusterRank: the ClusterRank algorithm not only considers the influence of the nodes, but also considers the clustering coefficients of the nodes. The method considers local information of the network and lacks performance guarantee.
BGN: the algorithm considers the importance of the nodes from the viewpoint of network robustness and considers the global information of the network.
4.4 robustness analysis
To verify the effectiveness of the improved reverse generation network algorithm, imp _ BGN was compared to six baseline algorithms and analyzed for robustness. A good ranking algorithm should have a smaller robustness value, i.e. the smaller the area under the R-curve. FIG. 6 visualizes the R-curves generated during the backward network generation in 6 networks, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. The robustness values R for the different methods in 6 real networks are given in table 2. As can be seen from fig. 6 and table 2, except in the Inf-USAir network, Imp _ BGN has a smaller robustness value R than the other 6 baseline methods. Therefore, candidate nodes can be well selected by the method, and the candidate nodes are helpful for maintaining the stability and the connectivity of the network. Considering that the reverse generation network is applied in the candidate node selection phase, the candidate node is performed in the sub-network formed by communities, in order to further verify the effectiveness of the algorithm, the robustness of the algorithm is analyzed in the first 2 communities with larger scale of each network, and the statistics of the results are shown in table 3, C 1 ,C 2 Two communities with the size of the community ranked 2 are represented. The R curves of the first two communities in the two networks, CEnew and Power, are visualized in fig. 7, where fig. 7 is the R curves of the 2 real network communities, the horizontal axis represents the seed node ratio, and the vertical axis represents the maximum connected component in the network. FIG. 7(a) is the largest community in CEnew, FIG. 7(b) is the second largest community in CEnew7(c) is the largest community in Power, and FIG. 7(d) is the second largest community in Power.
As can be seen from table 3 and fig. 7, the imp _ BGN algorithm also performs most well in the community of most networks, and also outperforms the BGN algorithm with little advantage in the community of Inf-USAir networks.
TABLE 2 robustness values R for different methods
Network Inf-USAir CEnew Power Hamster Ca-GrQc Router
K-shell 0.1614 0.1873 0.4223 0.1815 0.2317 0.0285
ClusterRank 0.1181 0.1301 0.2115 0.1371 0.1051 0.0158
PageRank 0.1227 0.1229 0.2019 0.1421 0.1027 0.0135
NC+ 0.1643 0.1623 0.3329 0.1692 0.2143 0.0202
BGN 0.0899 0.1171 0.0633 0.1045 0.0606 0.0076
Degree 0.1260 0.1200 0.2286 0.1384 0.1313 0.0121
Imp_BGN 0.0961 0.0790 0.0431 0.0872 0.0538 0.0063
TABLE 3 robustness values R of different algorithms in communities
Figure BDA0003696427860000171
In addition, the degree centrality optimized through graph traversal is selected to assist in building the reverse generation network, and compared with the method that the degree centrality is directly used to assist in building, the method is more effective. This is because the centrality becomes finer grained after the graph traversal framework is optimized. The ability to distinguish node importance can be measured in terms of resolution, which can be measured by the cumulative distribution function CDF. Fig. 8 shows CDF curves for the tree and the graph traversal optimized centrality TRank _ tree in 3 networks. The 3 networks are respectively (a) Inf-USAir, (b) CEnew, (c) Power, wherein the x axis in the graph represents the grade of the node, and the y axis represents the proportion of each grade. The smaller the included angle between the CDF curve and the x axis is, the better the algorithm effect is. It can be seen that trunk _ hierarchy can distinguish the importance of nodes more. This verifies the validity of the algorithm imp _ BGN even more.
4.5 propagation Scale analysis
To verify the ability of the proposed CBGN method to select influential nodes, SIR models were chosen to measure the ultimate infection scale F (t) of the seed nodes selected by different algorithms c ). Selecting the propagation threshold beta at which beta should be higher than the network min The infection rate is set to λ ═ β/γ. Due to the randomness of the model, the experimental results were obtained by averaging 1000 independent experiments. Setting the number of the selected seed nodes as the network3% of the gauge. The results are shown in FIG. 9, where the abscissa indicates the time t of infection, the ordinate F (t) indicates the number of nodes accumulating infection at time t, and F (t) reaches a stable value F (t) with the passage of time c ). Reach a greater F (t) in less time c ) It indicates that the performance of the algorithm is better.
Looking at the SIR model time step experiment of fig. 9, the proposed CBGN is best propagated in 6 network datasets compared to 6 other algorithms, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, and (f) Router, respectively. In the Inf-USAir network, the proposed CBGN method infects at a significantly higher scale than the other 6 baseline algorithms, while the 6 baseline algorithms infect at a comparable scale. In the Power network, the CBGN algorithm outperforms the best ClusterRank algorithm with 0.84% dominance in infection scale. In networks CEnew, Hamster, Ca-GrQC, and Router, the infection scale of CBGN methods is 0.59%, 1.05%, 0.71%, and 0.14% higher than the best BGN algorithms, respectively. In these 4 networks, the BGN algorithm has shown excellent capabilities, but again inferior to CBGN. In addition, in the SIR model, since different infection probabilities have a certain influence on the propagation scale, experiments were carried out on different infection rates of the SIR model, and the λ range was set to [1.0,2.0 ]]The experimental results are shown in fig. 10. Likewise, experimental results were obtained from the average of 1000 independent experiments. Wherein the x-axis represents the infection rate λ and the y-axis represents the stable final infection scale F (t) at a certain infection rate c )。
The proposed CBGN approach is superior to the 6 baseline algorithms in infection scale for different infection probabilities. Except in the CEnew network, the CBGN method is similar to the BGN algorithm, and the CBGN method is superior to the BGN algorithm in the rest networks. In addition, when selecting candidate nodes, the improved reverse generation network algorithm imp _ BGN does not always yield the minimum robustness value on Inf-USAir and Router, but the seed node selected by the final CBGN method can successfully infect the most nodes. It follows that the CBGN framework constructed is progressive and efficient.
4.6 average shortest Path Length analysis
Generally speaking, the more dispersed the selected seed node set nodes are, i.e. the larger the average shortest path is, the wider the spread effect can be reached, so the average shortest path length L between the seed node sets s Usually as an index to measure the quality. L is s It is not an absolute indicator because the node's propagation capability is considered in selecting nodes rather than just the degree of dispersion of the nodes.
Table 4 average shortest path length
Network Inf-USAir CEnew Power Hamster Ca-GrQc Router
Degree 1.0 1.3077 12.3809 1.6929 3.936 3.6381
K-shell 1.0 1.3077 9.7762 1.9958 4.0117 3.1819
NC+ 1.0 1.2527 8.6714 1.5871 3.9745 3.0253
PageRank 1.1333 1.3187 11.2143 1.7393 3.3853 3.6440
ClusterRank 1.0 1.3187 10.6524 1.6541 2.9274 3.0640
BGN 1.0 1.4615 8.5381 2.1008 3.7463 4.1390
CBGN 1.2 1.5275 12.2857 2.2618 3.9821 4.0840
Table 4 gives the average shortest path length between the proposed CBGN method and the seed nodes selected by the 6 baseline algorithms. It can be seen that in half of the network, the seed node set selected by the CBGN method is most dispersed. And the method of Degree, K-shell and BGN selects the most dispersed nodes in the networks of Power, Ca-GrQc and Router respectively.
5 conclusion
The invention provides a reverse generation network framework CBGN based on a community to solve the problem of influence maximization. First, the network is divided into natural communities using the runtime-considered Louvain algorithm adapted to the application data set, by which the search range of the influencing nodes is narrowed. Then, each community is regarded as an induced subgraph of the original graph, the degree centrality of graph traversal optimization is applied to each subgraph to assist in reversely constructing the network, nodes with the minimized cost function are added into the network every time, when the number of the remaining nodes which are not added into the network meets the number of the candidate nodes, the construction of the network is stopped, and all the candidate nodes are sent into the candidate node set. By analyzing the robustness experiment, the improved reverse generation network algorithm can obtain smaller robustness value in the whole network or independent community. This verifies that the improved imp _ BGN algorithm is better able to select a key node in the network or a node that maintains network stability as a candidate node. And finally, selecting a final seed node in the candidate node set by using the degree discount and a greedy algorithm considering the network structure and the node position relation. Experiments of the propagation scale and the average shortest path length of the algorithm prove that the seed nodes selected by the CBGN method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.
In summary, the main contributions of the present patent are summarized as follows:
(1) a new reverse generation network method Imp _ BGN for minimizing a cost function is provided, a new view angle is traversed by a graph, each node is evaluated by constructing a breadth-first search tree (BFS) with a target node as a root node, an influence score of each node can be obtained from the BFS tree, the influence score is used for assisting in constructing a reverse generation network, and the generated network does not need to be restored to an original network because the least important node is added firstly, nodes which are not added into the network are directly added into a candidate node set, and the calculation time is greatly reduced.
(2) The CELF algorithm is improved, the influence ranges of the selected node sets are overlapped by considering the common neighbor and position relations between the nodes, a similarity evaluation index is designed, and the method is applied to the process of selecting the seed nodes by the CELF.
(3) A community-based reverse generation network framework CBGN is constructed to select a group of influential nodes in a complex network, the community structure of the network is considered, the algorithm is accelerated by utilizing the community, and the advantages of the connectivity, graph traversal, heuristic algorithm and greedy algorithm of the network are combined.
(4) The CBGN method is subjected to experimental evaluation of robustness, influence propagation scale, average shortest path length among nodes and the like, and experimental results on 6 real networks such as Inf-USAir and the like show that the algorithm is more competitive compared with the existing advanced method.
Furthermore, many challenges remain from different perspectives for the impact node identification problem, for example, how to efficiently mine large scale network impact nodes, how impact nodes may change with topology changes on time-varying networks, how to better combine information between different layers in a multi-layer network, and the like. Future work will be extended even further to weighted networks, time-varying networks, multi-layer networks and heterogeneous networks.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (8)

1. A complex network influence node identification method based on a reverse generation network is characterized by comprising the following steps:
s1, community division: carrying out community division on the network by using a Louvain algorithm;
s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set;
s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes.
2. The method for identifying the complex network influence node based on the backward generation network according to claim 1, wherein the size of the community at least has the following condition:
C_size=size(G)*η (9)
wherein C _ size represents the community size;
size (G) represents the number of nodes of network G;
eta is an adjustable parameter for controlling the size of the community that satisfies the condition.
3. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 comprises:
s2-1, calculating the adding cost of the remaining nodes, selecting the nodes with the minimized cost function to construct the network, wherein the cost function is as follows:
Figure FDA0003696427850000011
where cost (u, n +1) indicates that node u is at (n +1) th A cost function of the time step;
Figure FDA0003696427850000012
is shown in (n +1) th Time step, the size of the maximum connected component for adding node u to network G, i.e.
Figure FDA0003696427850000013
AUC u AUC value representing node u;
AUC max represents the maximum AUC value among all nodes;
AUC min represents the smallest AUC value among all nodes;
ξ is a sufficiently small positive parameter;
s2-2, after the node is added, the size of each connected component in the network needs to be updated, and the size of the maximum connected component in the network is recorded;
s2-3, repeating the steps S2-1-S2-2 until the number of the remaining nodes meets the required candidate node number, and at the same time stopping constructing the network.
4. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 further comprises:
in order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:
Figure FDA0003696427850000021
where cand _ num i Representing the size of the ith community candidate node set;
(C_size i -C_size min )/(C_size max -C_size min ) Representing the proportion of the ith community among all the selected communities;
β is an amplification parameter;
k is the number of seed nodes that ultimately need to be selected.
5. The complex network influence node identification method based on the backward generation network according to claim 1, wherein the graph traversal comprises:
the degree centrality is selected as an initial center score in graph traversal, and the degree centrality is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated.
6. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S3 comprises:
s3-1, selecting k in the candidate node set through a degree discount algorithm 1 A node;
Figure FDA0003696427850000022
the calculation formula of the degree discount algorithm is as follows:
Figure FDA0003696427850000031
gdd therein v Represents a degree discount for node v;
d v degree representing node v;
t v represents the number of infected neighbors of node v;
t w representing the number of infected neighbors of a susceptible neighbor node w of the node v;
p represents the probability of infection;
s3-2, selecting k through improved sub-model algorithm 2 A node;
k 2 =k-k 1 (13)
k is the number of the seed nodes which need to be selected finally;
mu is an adjustable parameter to balance greedy algorithm and heuristic algorithm;
c represents the total number of communities in the network that meet a certain scale;
cand_num i representing the size of the ith community candidate node set;
the improved sub-modular algorithm selection k 2 The specific steps of each node are as follows: and if the node u selected in the process of each round of improved sub-model algorithm selection is similar to the node selected before or the node selected in the heuristic process, not selecting the node and removing the node from the candidate node set.
7. The complex network influence node identification method based on the backward generation network of claim 6, wherein the similarity is determined by the following formula to determine whether the nodes are similar, and if the following formula is satisfied, the node u is similar to the node v;
Figure FDA0003696427850000032
where sim represents the similarity of node u and node v;
n (u) represents a set of neighbor nodes for node u;
n (v) a set of neighbor nodes representing node v;
n (u) andn (v) represents the number of neighbors shared by N (u) and N (v);
| represents the number of sets;
ε is a parameter approaching 0;
abs (·) represents the absolute value;
ks (u) is the normalized k-shell index for node u;
ks (v) is a normalized k-shell index for node v.
8. The method for identifying the complex network influence node based on the backward generation network according to claim 1, further comprising evaluating the method by using the following performance indexes: robustness value, cumulative distribution function, SIR model, and average shortest path length.
CN202210681512.3A 2022-06-15 2022-06-15 Complex network influence node identification method based on reverse generation network Pending CN115049002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210681512.3A CN115049002A (en) 2022-06-15 2022-06-15 Complex network influence node identification method based on reverse generation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210681512.3A CN115049002A (en) 2022-06-15 2022-06-15 Complex network influence node identification method based on reverse generation network

Publications (1)

Publication Number Publication Date
CN115049002A true CN115049002A (en) 2022-09-13

Family

ID=83161442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210681512.3A Pending CN115049002A (en) 2022-06-15 2022-06-15 Complex network influence node identification method based on reverse generation network

Country Status (1)

Country Link
CN (1) CN115049002A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1849776A (en) * 2003-10-14 2006-10-18 思科技术公司 Method and apparatus for generating routing information in a data communications network
US20130013807A1 (en) * 2010-03-05 2013-01-10 Chrapko Evan V Systems and methods for conducting more reliable assessments with connectivity statistics
CN108492201A (en) * 2018-03-29 2018-09-04 山东科技大学 A kind of social network influence power maximization approach based on community structure
EP3425861A1 (en) * 2017-07-03 2019-01-09 Mitsubishi Electric R&D Centre Europe B.V. Improved routing in an heterogeneous iot network
CN111222029A (en) * 2020-01-16 2020-06-02 西安交通大学 Method for selecting key nodes in network public opinion information dissemination
CN112380456A (en) * 2020-11-25 2021-02-19 上海大学 Condensation entropy based dynamic influence maximization method
CN114242261A (en) * 2021-12-10 2022-03-25 西北工业大学 Virus propagation control method based on bounded seepage-greedy algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1849776A (en) * 2003-10-14 2006-10-18 思科技术公司 Method and apparatus for generating routing information in a data communications network
US20130013807A1 (en) * 2010-03-05 2013-01-10 Chrapko Evan V Systems and methods for conducting more reliable assessments with connectivity statistics
EP3425861A1 (en) * 2017-07-03 2019-01-09 Mitsubishi Electric R&D Centre Europe B.V. Improved routing in an heterogeneous iot network
CN108492201A (en) * 2018-03-29 2018-09-04 山东科技大学 A kind of social network influence power maximization approach based on community structure
CN111222029A (en) * 2020-01-16 2020-06-02 西安交通大学 Method for selecting key nodes in network public opinion information dissemination
CN112380456A (en) * 2020-11-25 2021-02-19 上海大学 Condensation entropy based dynamic influence maximization method
CN114242261A (en) * 2021-12-10 2022-03-25 西北工业大学 Virus propagation control method based on bounded seepage-greedy algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
XIAOYANG LIU等: "Influential Spreaders Identification in Complex Networks with TOPSIS and K-Shell Decomposition" *
YAN LIU等: "A graph-traversal approach to identify influential nodes in a network" *
ZHIWEI LIN等: "BGN:Identifying Influential Nodes in Complex Networks via Backward Generating Networks" *
何道兵等: "一种新的在线社交网络社区发现算法" *
邓心惠: "基于反向可达集的影响力最大化算法" *

Similar Documents

Publication Publication Date Title
Harenberg et al. Community detection in large‐scale networks: a survey and empirical evaluation
Yang et al. A novel centrality of influential nodes identification in complex networks
Chen et al. Multi-objective optimization of community detection using discrete teaching–learning-based optimization with decomposition
Li et al. A novel complex network community detection approach using discrete particle swarm optimization with particle diversity and mutation
Kundu et al. Fuzzy-rough community in social networks
Zhou et al. A density based link clustering algorithm for overlapping community detection in networks
Wan et al. Solving dynamic overlapping community detection problem by a multiobjective evolutionary algorithm based on decomposition
Pourkazemi et al. Community detection in social network by using a multi-objective evolutionary algorithm
Virmani et al. Clustering in Aggregated User Profiles across Multiple Social Networks.
Kumar et al. Overlapping community detection using multiobjective genetic algorithm
CN115630328A (en) Identification method of key nodes in emergency logistics network
Chen et al. Detecting community structures in social networks with particle swarm optimization
Han et al. Identifying top-k influential nodes based on discrete particle swarm optimization with local neighborhood degree centrality
Xiao et al. Fuzzy community detection based on elite symbiotic organisms search and node neighborhood information
Gialampoukidis et al. Community detection in complex networks based on DBSCAN* and a Martingale process
CN106815653B (en) Distance game-based social network relationship prediction method and system
Wu et al. Historical information-based differential evolution for dynamic optimization problem
Zhu et al. PHEE: Identifying influential nodes in social networks with a phased evaluation-enhanced search
Swetha et al. Simultaneous feature selection and clustering using particle swarm optimization
CN115049002A (en) Complex network influence node identification method based on reverse generation network
Chen et al. A Multi‐label Propagation Algorithm for Community Detection Based on Average Mutual Information
Khatri et al. Influence maximization in social networks using discretized harris hawks optimization algorithm and neighbour scout strategy
Gupta et al. A Spreader Ranking Algorithm for Extremely Low-budget Influence Maximization in Social Networks using Community Bridge Nodes
Liu et al. A new method of identifying core designers and teams based on the importance and similarity of networks
Pan et al. An evolutionary approach based on symmetric nonnegative matrix factorization for community detection in dynamic networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination