CN106022936B

CN106022936B - Community structure-based influence maximization algorithm applicable to thesis cooperative network

Info

Publication number: CN106022936B
Application number: CN201610353585.4A
Authority: CN
Inventors: 吴骏; 陈厚兵; 张梓雄; 王晓彤; 吴和生; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-05-25
Filing date: 2016-05-25
Publication date: 2020-03-20
Anticipated expiration: 2036-05-25
Also published as: CN106022936A

Abstract

The invention provides an influence maximization algorithm (COMAX algorithm) based on a community structure and applicable to a thesis cooperative network, which comprises the following steps: 1) a community discovery phase a constructs a thesis cooperation network graph; b, merging local communities; c, constructing a new network diagram; d, finishing; 2) a seed node selection stage a is used for calculating the influence of each community; b, selecting a corresponding node in the community with the largest influence; and c, finishing. The influence maximization algorithm based on the community structure provides a new solution for the influence maximization problem of the thesis cooperation network, and results show that the COMAX algorithm provided by the invention is close to a greedy algorithm in the influence coverage range on the ICM model, and the time efficiency is very good.

Description

Community structure-based influence maximization algorithm applicable to thesis cooperative network

Technical Field

The invention relates to a method for solving an influence maximization problem of a thesis cooperation network, in particular to a method for solving the influence maximization problem based on a community structure.

Background

In recent years, online social networks have been rapidly developed, and more social websites are presented. Information dissemination in these social networks, both in scale and in efficiency, has surpassed real life. The influence maximization problem concerns how to select a fixed number of seed nodes so as to maximize the coverage of information propagation. When a subject or a field needs to be investigated or deeply understood, all the data in the field cannot be viewed, a part of works of authors with high influence are selected, and how to find the authors with high influence is the process of selecting seed nodes.

In 2003, formalization of Kempe, Kleinberg, and Tardos three [ Maximizing the Spread of infiniture through a Social Network ] defined the impact maximization problem, transformed it into a discrete optimization problem, and proved that the problem was NP-Hard difficult. Under the linear threshold model and the independent cascade model, a greedy algorithm is given, and the approximate ratio of the greedy algorithm to the optimal algorithm is proved to be (1-1/e). However, the time complexity of the greedy algorithm is very high, the degree distribution condition of the network is not considered, the community structure of the network is not considered, the influence of each seed node needs to be recalculated when the seed node is selected every time, and the time efficiency is low.

In 2007, aiming at the problem that the greedy algorithm is high in time complexity, Leskovec et al [ Cost-effective Outbreak Detection in Networks ] apply the sub-model characteristics in the influence maximization, and provide an optimization strategy of 'LazyForward', and a CELF algorithm.

In 2009, Chen Wei et al [ Efficient knowledge in social networks ] proposed a New greedy algorithm and a MixGreeny algorithm based on the high time complexity of the greedy algorithm. The NewGreedy algorithm preprocesses an original network graph, deletes edges irrelevant to a propagation process, and finally changes a problem into a reachable node set of a seed node set in a new network graph. The MixGreedy algorithm is the combination of a NewGreedy algorithm and a CELF algorithm, the NewGreedy algorithm is used when the first node is selected, the initial influence of each node is calculated, and then the CELF algorithm is used when the seed node is selected. The results show that the coverage range of the New Greedy algorithm and the coverage range of the MixGreedy algorithm are close to the greedy algorithm, the time efficiency is higher than that of the greedy algorithm, but Monte Carlo simulation experiments need to be applied for multiple times, the overall efficiency is low, and the method is not suitable for large-scale social networks.

Many influence maximization algorithms do not consider the community structure of the network, but the connection between nodes inside the community is closer than the connection outside the community, and accordingly, in the information dissemination process, the possibility that a node activates other nodes in the same community is higher than the possibility that the node outside the community is activated. The influence maximization algorithm based on the community structure is provided, the whole network is divided into relatively independent communities, the node influence is calculated in each community, and then the maximum influence is used as the community influence. After the seed nodes are selected, the influence value of one community only needs to be recalculated, and the influence value does not need to be recalculated completely, so that the efficiency of selecting the seed nodes is greatly improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a seed node selection method suitable for the influence maximization problem of a thesis cooperative network.

The technical scheme is as follows: in order to solve the above problems, the impact maximization algorithm based on community structure for paper cooperation network of the present invention includes the following steps:

1) a community discovery phase;

a, constructing an initial thesis cooperation network graph;

b, merging local communities;

c, constructing a new network diagram;

d, finishing;

2) seed node selection stage

a, calculating community influence;

b, selecting a seed node;

and c, finishing.

In the invention, the nodes in the network graph constructed in the steps 1) -a represent authors, edges in the network graph represent that cooperative relationships exist among the authors and papers are published together, and the authority values of the edges represent the number of the papers published together.

In the invention, merging the local communities in the steps 1) -b means that each node is taken as a local community, each node is selected to be connected with the node and merged with the community with the largest modularity value increment after merging, wherein the modularity value is expressed by the following formula:

where nc denotes the number of all communities, in_cIndicating the number of edges, tot, inside the community c_cIndicating the number of all edges connected to the nodes in community c.

The increment of the modularity value after the node i and the community c are merged is as follows:

wherein

Representing the number of edges connecting node i with community c, which after merging become the internal edges of the new community, k_iRepresenting the degree of node i.

In the invention, the step 1) -c of constructing the new network graph means that all the nodes in the communities obtained in the step 1) -b after combination are represented by one node to be used as the nodes in the new network graph, and the connecting edges between the original communities become the connecting edges between the nodes in the new network graph.

In the invention, the step 2) -a of calculating the community influence refers to taking the influence value of the node with the largest influence in the community as the influence of the community and recording the corresponding node.

In the invention, the step 2) -b of selecting the seed node refers to selecting the corresponding node in the community with the largest influence, and the influence of the corresponding community needs to be recalculated.

The invention has the beneficial effects that: the influence maximization algorithm of the thesis cooperation network based on the community structure provides a new heuristic solution for solving the influence maximization problem, the influence propagation range of the selected seed nodes is close to that of the greedy algorithm, the time efficiency is high, and the influence maximization algorithm is suitable for solving the influence maximization problem of a large-scale social network.

Drawings

Fig. 1 is a flowchart of a paper cooperation network influence maximization method based on a community structure according to an embodiment of the present invention.

FIG. 2 is a flow chart of the community discovery phase of FIG. 1.

Fig. 3 is a flow chart of the seed stage selection stage in fig. 1.

Fig. 4 is a comparison of the coverage of the influence of the algorithm (COMAX) proposed by the invention and the seed nodes selected by other methods on the Hep data set.

Detailed Description

In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.

As shown in FIG. 1, the method has two stages, namely a community discovery stage and a seed node selection stage.

The influence maximization algorithm based on the community structure and applied to the thesis cooperative network comprises the following steps:

1) a community discovery phase;

a, constructing an initial thesis cooperation network graph;

b, merging local communities;

c, constructing a new network diagram;

d, finishing;

2) seed node selection stage

a, calculating community influence;

b, selecting a seed node;

and c, finishing.

Fig. 2 is a flow chart of the community discovery phase, which is divided into three main parts, namely, constructing an original network graph, merging local communities and constructing a new network graph. After the local communities are merged, all nodes in the same local community need to be abstracted into one node, a new network is established, and merging is performed again. The merging is performed when the module value increment is positive.

The concrete steps of the community discovery phase are as follows:

step 1-0 is the method start;

step 1-1 is to traverse the corpus, which is the first step in building the network and needs to record the author information of all relevant corpora.

Step 1-2 is to extract the cooperative relationship, the nodes of the network are constructed in step 1-1, but the edges between the nodes and the weights of the edges are not known, the authors cooperate to construct an edge between the two authors, and the weight of the final edge is the total number of the two authors cooperating to construct the thesis.

Step 1-3 is to construct a cooperative network graph, and a undirected weighted graph G (V, E, W) is constructed by using the nodes constructed in step 1-1 and the edges constructed in step 1-2. V denotes authors, E denotes a cooperative relationship between authors, and W denotes the number of cooperative papers between authors.

Step 1-4 is to calculate the modularity value increment of the combination of the node and the connected community, and the modularity value increment after the combination of the node i and the community c is as follows:

wherein

Representing the number of edges connecting node i with community c, which after merging become the internal edges of the new community, k_iRepresenting the degree of node i. At this step, for each node, the modularity value increment after it is merged with all connected communities needs to be calculated, and the maximum increment value and the corresponding community are recorded.

And 1-5, judging whether the maximum modularity value increment after a certain node and a connected community are merged in all the nodes is larger than 0, if not, jumping to the step 1-8, and ending the community discovery phase.

Steps 1-6 are the merge phase, merging, for each node, with the community of the largest modularity value increment greater than 0.

Step 1-7 is to construct a new network graph, abstract all the nodes in the same community after being merged in step 1-6 into a node, and take the edge between the original communities as the edge between the nodes in the new graph, so that the number of the nodes in the new network graph is consistent with the number of the communities after being merged in step 1-6, and each node represents one previous community. Then jumps to step 1-4.

Steps 1-8 are returning to the community structure of the community network, where the community discovery phase is completed.

Fig. 3 is a flow chart of the seed node selection stage, which is divided into two main parts, namely, calculating community influence and selecting a seed node. The influence of all communities needs to be calculated firstly, then the node corresponding to the community with the maximum influence is selected, only the influence of the selected community needs to be recalculated, and other communities do not need to be recalculated.

The seed node selection stage comprises the following specific steps:

step 2-0 is the method start;

and step 2-1, calculating the influence of the nodes in the community. The information propagation model used by us is an independent cascade model, and for the weighted network graph, the expected value of the influence of the node v is as follows:

wherein in_vIs the sum of the edge weights, t, of the node v and the nodes connected inside the community_vIs the sum of the edge weights that have become seed nodes in the neighbors of node v, and p is the probability of each edge being successfully activated. For node u and node v, the edge weight value between them is t, and if u is in the active state, the probability that u activates v is 1- (1-p)^t。

And 2-2, calculating community influence, wherein the community influence is the maximum influence value of all nodes in the community, and recording the nodes corresponding to the influence values.

And 2-3, selecting seed nodes, firstly positioning the community with the largest influence, then selecting nodes corresponding to the community, and adding the nodes into the seed node set.

And 2-4, judging whether the seed node selection process is finished, if the number of the selected seed nodes reaches K, skipping to the step 2-6, and finishing the algorithm.

And 2-5, recalculating the influence values of all nodes in the community with the maximum influence in the step 2-3, then calculating the influence of the community, and jumping to the step 2-3.

And 2-6, returning the selected seed node set until the seed selection is completed.

The data set Hep used in fig. 4 is a data set frequently used by the impact maximization problem, and is a cooperative network diagram in the high-energy physical direction. It can be found from the graph that as the number of the seed nodes increases, the influence coverage of the seed node set is increased, and the influence coverage of the seed node set selected by the COMAX algorithm is very close to that of the greedy algorithm CELF algorithm after acceleration, but the time efficiency is higher than that of the CELF algorithm by multiple orders of magnitude.

In summary, the influence maximization algorithm based on the community structure provides a new method for discovering high-influence nodes for the thesis cooperation network, and the method includes the steps of firstly dividing the network into relatively independent community structures, then calculating the community influence, selecting the corresponding node in the community with the largest influence to be added into the seed node, and recalculating the community influence, so that K seed nodes are found in a circulating manner.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1. A method for maximizing influence based on a community structure and applied to a thesis cooperation network is characterized by comprising the following steps:

1) a community discovery phase;

a, constructing an initial thesis cooperation network graph;

b, merging local communities;

c, constructing a new network diagram;

d, finishing;

2) seed node selection stage

a, calculating community influence;

b, selecting a seed node;

c, finishing;

the step 1) -a of constructing the cooperative network graph means that in the constructed network graph, nodes represent authors, edges in the graph represent that a cooperative relationship exists between two authors and papers are published together, the weight values of the edges represent the number of the papers published together, and the constructed network graph is an undirected graph;

the local communities are merged in the steps 1) -b, namely, each node is taken as a local community, each node is selected to be connected with the node, and the communities with the largest modularity value increment after merging are merged, wherein the modularity value formula is as follows:

where nc denotes the number of all communities, in_cIndicating the number of edges, tot, inside the community c_cRepresenting the number of all edges connected to the nodes in community c, and m representing the number of all edges in the network;

wherein

Representing the number of edges connecting node i with community c, which after merging become the internal edges of the new community, k_iRepresenting the degree of the node i;

the step 1) -c of constructing the new network graph refers to that all nodes in the communities obtained in the step 1) -b after combination are represented by one node to serve as nodes in the new network graph, and connecting edges between the original communities become connecting edges between the nodes in the new network graph;

calculating the influence of the community in the steps 2) -a, namely taking the influence value of the node with the maximum influence in the community as the influence of the community, and recording the corresponding node;

selecting the seed node in the steps 2) -b refers to selecting a corresponding node in the community with the largest influence, and recalculating the influence of the corresponding community.