CN108492201B

CN108492201B - Social network influence maximization method based on community structure

Info

Publication number: CN108492201B
Application number: CN201810269184.XA
Authority: CN
Inventors: 仇丽青; 于金凤; 范鑫
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2022-02-08
Anticipated expiration: 2038-03-29
Also published as: CN108492201A

Abstract

The invention discloses a social network influence maximization method based on a community structure, which comprises the following specific processes: (1) dividing communities to form a candidate node set, identifying core nodes and boundary nodes in a network and forming the candidate node set by dividing the network; (2) selecting nodes in a heuristic manner, verifying potential influence through the degree of the nodes, the scale of communities, the number of connected communities and influence weight for each node in the candidate node set, and accordingly selecting the node with the maximum potential influence in the heuristic manner to join the seed set; (3) and executing a greedy algorithm, and selecting the node with the maximum marginal profit to join the seed set by using the greedy algorithm. According to the method, the accuracy and the operation efficiency of the initial seed node mining are further improved by analyzing the effect of the community structure in influence propagation, and the problem of maximization of the influence of the social network is effectively solved.

Description

Social network influence maximization method based on community structure

Technical Field

The invention relates to the field of social networks, in particular to a social network influence maximization method based on a community structure.

Background

In recent years, with the rise of social networks, more and more social platforms such as Facebook, Twitter, Google +, and the like have attracted wide attention. These platforms act as carriers for social networks, allowing various information to be propagated across social networks. How to make the information maximally spread outwards through the social platforms and let more users accept the information is called "influence maximization problem". The problem of maximizing the influence of social networks is a hot problem in social network research, and has great application value in the fields of marketing, disease propagation, rumor control and the like.

The problem of maximizing the influence of the social network is how to select Top-K seed nodes for propagation so as to maximize the final propagation range. The problem is proved to be an NP difficult problem, so that two main types of solutions are provided at present, one is a greedy algorithm with a better influence range, and the other is a heuristic algorithm with higher efficiency. Because the greedy algorithm needs longer running time and the result of the heuristic algorithm is unstable, a hybrid algorithm generated by combining the heuristic algorithm and the greedy algorithm is a better method for solving the problem of influence maximization at present, and the algorithm mainly applies the heuristic algorithm in the first stage and applies the greedy algorithm in the second stage. All these impact maximization algorithms are generally based on two impact propagation models, namely a linear threshold model (LT) and an independent cascade model (IC), wherein the independent cascade model is a less stable model and requires a large amount of simulation, the linear threshold model has incomparable advantages of the independent cascade model, and its "cumulative characteristic" enables a node to activate a large number of nodes in the subsequent activation process, and the specific rule is: each node is initially assumed to be an active or inactive node and each inactive node is assigned a threshold value representing how easily the node can be affected, and the node is activated only if the sum of all active neighbor influences of the node is greater than or equal to the node threshold value. Each active node can participate in the activation process for multiple times, so that when an inactive node is started and is not activated, the influence of neighbor nodes is accumulated continuously, and the possibility of activation is increased.

Generally, there are two main indicators for verifying the influence maximization problem, one is the influence range and the other is the execution efficiency. However, most of current work does not consider the practical structure problem of networks, and each network has the community structure characteristic, namely the community interior connection is close and the community connection is sparse. Through the analysis to the structure, influence scope and execution efficiency can be further improved, the core node in the community can make information propagate in the community as soon as possible, and the boundary node between the communities can enlarge the information propagation scope, and execution efficiency and accuracy can be improved by identifying the two types of nodes through dividing the community. Therefore, aiming at the problem of large network scale in the aspect of influence maximization, the Louvain algorithm is used for dividing communities, is a rapid algorithm with high accuracy, and can be applied to large-scale networks.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical defects in the existing influence maximization algorithm, a community structure-based social network influence maximization method is provided, and the influence range of the seed nodes is further improved on the premise of ensuring the algorithm efficiency by analyzing the effect of the community structure in actual information propagation.

The technical scheme of the invention is as follows:

a social network influence maximization method based on community structure, the method comprising the steps of:

(1) constructing a social network graph: g ═ V, E, where G represents the social network, V represents the set of nodes, and E represents the set of edges of the network;

(2) dividing communities to generate a candidate node set: firstly, the Louvain algorithm proposed by Blondel et al is adopted to carry out community division on the input network G to generate M communities, namely C ═ C₁,C₂,...C_M) Second, find out the boundary node set S of each community_boundaryAnd a core node set S_coreTaking a union set to form a candidate node set CS, wherein S_coreSelecting nodes with larger degree of 10% of the number of each community as a core node set according to degree centrality;

(3) heuristically selecting nodes: heuristically selecting from the set of candidate nodes formed in step (2)

Adding the nodes with the largest potential influence into a seed node set S, and executing the activation process of the seed node set S by using a linear threshold model to generate an initial active node set A, wherein k represents the number of the target node set S, and c represents an enlightening factor;

(4) a greedy algorithm is executed: continuing to select from the set of candidate nodes formed in step (2) using a greedy algorithm

And adding the node with the maximum marginal profit into the seed set S, and simultaneously activating by using a linear threshold model to generate a new active node adding set A.

Further, the specific operation steps of the Louvain algorithm in the step (2) are as follows:

(a) merging communities: taking each node in the network as a community, then determining which neighbor communities are combined based on the modularity gain maximization standard, and repeating the process until the modularity gain is not increased any more, wherein the modularity gain is defined as follows:

therein, sigma_inRepresents the sum of all edge weights, Σ, in the community C_totSum of weights, k, representing all edges connected to Community C_i,inRepresenting the sum of the weights of all edges from the node i to the community C, wherein m represents the total number of the edges of the network;

(b) constructing a new network: taking the new community obtained in the step (a) as a new node, constructing a new network and repeatedly executing the step (a);

(c) the two stages are repeated until the modularity gain is not changed any more;

further, the node with the largest potential influence is selected in step (3) to join the seed node set, and then the potential influence of each node is calculated as follows:

(a) for any node v in the graph G, firstly, judging the community attribute of each node, namely a core node or a boundary node, and respectively calculating the community influence of each node based on the community structure;

(b) for the core node, the degree of the node and the number of communities in which the node is located are integrated to evaluate the community influence, and the calculation formula is as follows:

CI(v)＝C_D(v)+C_S(v)/2

wherein, C_D(v) Degree of representation of the community, C_S(v) Representing the size of the community in which the node is positioned;

(c) for the boundary node, the community influence of the boundary node is evaluated by integrating the degree of the node, the number of communities directly connected with the node and the community scale mean value of the neighbor communities of the node, and the calculation formula is as follows:

CI(v)＝C_D(v)+C_N(v)+AvgN_S(v)/3

wherein, C_D(v) Degree of representation of the community, C_N(v) Representing the number of communities to which the node is directly connected, AvgN_S(v) The average value of the community sizes of the neighbor communities of the representative node is calculated by the following formula:

wherein, | C_i(w) | represents the scale of the community where the neighbor w of the node v is located;

(d) in order to make the contribution of each index to the community influence consistent, the normalization criterion is used for optimization, and the community influence of each node v in the network is defined as follows by integrating the steps (b) and (c):

wherein each index is the result after normalization;

(e) except the community influence obtained in the step (d), each node v has a direct influence weight b on the neighbor node w_vwAnd combining the two, and calculating the potential influence of each node v in the network as follows:

wherein w ∈ neighbor (v),

indicating that node w is an inactive neighbor of node v.

Further, the specific operation steps of the greedy algorithm in the step (4) are as follows:

(a) initialization: initializing a seed node set S;

(b) calculating the marginal benefit of each node v: which is expressed as the final impact increment that can be brought by adding a node v to the seed set S, the calculation formula is as follows:

σ(S+v)-σ(S)

wherein σ (g) represents an influence function;

(c) selecting a seed node: selecting the node with the largest influence gain to be added into the seed set S, and updating the influence of each node;

(d) repeating step (c) until k nodes satisfying the target are selected.

The invention has the beneficial effects that: according to the social network influence maximization method based on the community structure, core nodes and boundary nodes of a community are identified through community structure characteristics of the network, a candidate node set is formed, then the community influence of each node and the influence weight of each node are integrated by means of the accumulative characteristics of linear thresholds to heuristically select seed nodes with the largest potential influence, and finally a greedy algorithm is applied to select the seed nodes. By the method, the accuracy and the operation efficiency of the initial seed node mining are further improved, and the problem of maximization of the influence of the social network is effectively solved.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a flow chart of a social network influence maximization method based on community structure according to the present invention;

FIG. 2 is a graph of the effect of the influence ranges of different heuristic factors when the seed node size is 50 according to the heuristic factor c of the present invention;

FIG. 3 is a graph of the runtime effects of different heuristic factors when the heuristic factor c of the present invention is at a seed node size of 50;

FIG. 4 is a graph comparing the effect of the present invention on the scope of influence of the HepTh social network with the existing algorithm;

FIG. 5 is a graph comparing the impact of the present invention on the context of the BrightKite social network with existing algorithms;

FIG. 6 is a graph comparing the impact of the present invention on the relationships social network of Epinions with the prior art algorithm;

FIG. 7 is a graph comparing the impact of the present invention with existing algorithms on the Amazon social network;

FIG. 8 is a graph comparing the runtime of the present invention over four social networks with existing algorithms;

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.

Fig. 1 shows a method for maximizing social network influence based on a community structure according to the present invention, which comprises the following specific steps:

step 1, constructing a social network diagram: g is (V, E),

where G represents a social network, V represents a set of nodes, and E represents a set of edges of the network.

And 2, dividing communities to generate a candidate node set.

Firstly, the input network G is divided into communities by adopting the Louvain algorithm proposed in the industry, and M communities are generated, namely C ═ C₁,C₂,...C_M) Second, find out the boundary node set S of each community_boundaryAnd a core node set S_coreTaking a union set to form a candidate node set CS, wherein S_coreThe nodes with the larger degree of 10% of the number of each community are selected as a core node set according to the degree centrality. The Louvain algorithm in the step specifically comprises the following steps:

(2a) merging communities: taking each node in the network as a community, then determining which neighbor communities are combined based on the modularity gain maximization standard, and repeating the process until the modularity gain is not increased any more, wherein the modularity gain is defined as follows:

therein, sigma_inRepresents the sum of all edge weights, Σ, in the community C_totSum of weights, k, representing all edges connected to Community C_i,inThe sum of the weights representing all edges of node i to community CAnd m represents the total number of edges of the network.

(2b) Constructing a new network: and (3) taking the new community obtained in the step (2a) as a new node, constructing a new network and repeatedly executing the step (2 a).

(2c) These two phases are repeated until the modularity gain is no longer changed.

And step 3, heuristically selecting nodes.

Heuristically selecting from the set of candidate nodes formed in step 2

Adding the nodes with the largest potential influence into the seed node set S, and executing the activation process of the seed node set S by using a linear threshold model to generate an initial active node set A, wherein k represents the number of the target node set S, and c represents a heuristic factor. The calculation of the potential influence described in this step is as follows:

(3a) for any node v in the graph G, firstly, the community attribute of each node, namely a core node or a boundary node, is judged, and the community influence of each node is calculated respectively based on the community structure.

(3b) For the core node, the degree of the node and the number of communities in which the node is located are integrated to evaluate the community influence, and the calculation formula is as follows:

CI(v)＝C_D(v)+C_S(v)/2

wherein, C_D(v) Degree of representation of the community, C_S(v) Representing the size of the community in which the node is located.

(3c) For the boundary node, the community influence of the boundary node is evaluated by integrating the degree of the node, the number of communities directly connected with the node and the community scale mean value of the neighbor communities of the node, and the calculation formula is as follows:

CI(v)＝C_D(v)+C_N(v)+AvgN_S(v)/3

wherein, | C_i(w) | represents the size of the community in which the neighbor w of node v resides.

(3d) In order to make the contribution of each index to the community influence consistent, the normalization criterion is used for optimization, and the steps (3b) and (3c) are integrated, and the community influence of each node v in the network is defined as follows:

wherein each index is the result after normalization.

(3e) Except the community influence obtained in the step (3d), each node v has a direct influence weight b on the neighbor node w_vwAnd combining the two, and calculating the potential influence of each node v in the network as follows:

wherein w ∈ neighbor (v),

indicating that node w is an inactive neighbor of node v.

And 4, executing a greedy algorithm.

Continuing to select from the set of candidate nodes formed in step 2 using a greedy algorithm

And adding the node with the maximum marginal profit into the seed set S, and simultaneously activating by using a linear threshold model to generate a new active node adding set A. The greedy algorithm in this step specifically comprises the following steps:

(4a) initialization: the set of seeders S is initialized.

(4b) Calculating the marginal benefit of each node v: which is expressed as the final impact increment that can be brought by adding a node v to the seed set S, the calculation formula is as follows:

σ(S+v)-σ(S)

where σ (g) represents the influence function.

(c) Selecting a seed node: and selecting the node with the largest influence gain to join the seed set S, and updating the influence of each node.

(4d) And (4c) repeating the step until k nodes meeting the target are selected.

Example (b):

data set and experimental setup

In this example, four different sizes of published datasets HepTh dataset, brightkit dataset, epipcations dataset, and Amazon dataset from SNAP were used. The HepTh data set is from a network of high-energy physical theory collaborators and is an undirected graph. The brightkit dataset is a location-based social network, an undirected graph. The eponions data set is from a trust network, and is a directed graph formed by selecting partial trust at members of the eponions website to comment on the formed link relationship. The Amazon dataset comes from Amazon purchasing web sites, and if two products in the web site are often purchased together, there will be a link relationship, and thus a directed graph. The static structural feature statistics of these four data sets are shown in table 1.

Table 1: statistical analysis of static structural features of experimental data

The linear threshold model used in the present invention has its threshold value often assigned to one in [0,1 ]]Here, in order to make the result more definite, the present invention uses the classical threshold θ of 0.5 proposed by Kempe et al, and the influence weight of the linear threshold model is often set to b_vw＝1/C_D(v) This means that the contribution of node v to each neighbor is the same, but this does not fit into the real world situation, so i amSet the influence weight to

Wherein C is_D(v) Degree representing node v, and n (v) represents a neighbor set of node v.

All simulation experiments in the following examples were compared with the PHG (Partition-Heuristic-Greeny) of the present invention using a hybrid HPG, a Greedy method, a PageRank method, a Degreee method, and a randomization method.

Second, dividing the community representation

In order to identify key nodes more accurately, the accuracy of community division is particularly important, and for a large-scale social network, the running time is also a necessary consideration, and all the considerations are combined, the invention selects a Louvain algorithm to divide the community, the algorithm is an algorithm which can be applied to the large-scale social network and has higher accuracy, and the division result is shown in table 2.

Table 2: results of community discovery

By analyzing table 2, there are two parameters, the modularity Q and the parameter u, which characterize the community division. The modularity Q is used for measuring the advantages and disadvantages of network division by comparing the connection density difference of the existing network and the reference network under the same community division, and the higher the value of the modularity is, the better the network division is represented. The modularity values of the four networks in table 2 range from 0.76 to 0.91, indicating whether the Louvain algorithm has a higher accuracy for partitioning communities. And the parameter u ═ S_min/S_max) The average probability of representing that each node in the network is not in the same community with the neighbor node is directly determined by the parameter, the smaller the parameter is, the stronger the representative community structure is, the parameters of the four data sets in the table 2 are all lower than 0.01, which indicates that the communities divided by the four networks are all strong community structures, and is more favorable for identifying key nodes in the communities. General Table 2 and related analysesIt can be known that the Louvain algorithm is a relatively suitable algorithm applied to a large-scale social network.

Selection of heuristic factor c

By considering the combined effect of the impact propagation and the runtime, an appropriate heuristic c is selected for each data set to optimize the present invention. Fig. 2 and fig. 3 show the influence range and the runtime variation of different heuristic factors at a seed node size of 50, respectively, and it can be seen from fig. 2 that as the heuristic factor increases, the influence range gradually decreases on most data sets, while the influence range on eponions data sets does not vary much. Similarly, as can be seen from fig. 3, the efficiency of the runtime is gradually increasing as the heuristic increases. This is mainly due to the low effectiveness and efficiency of heuristic algorithms and the high effectiveness and efficiency of greedy algorithms. Therefore, to obtain a relatively suitable runtime and a relatively high impact range, the Amazon dataset and brightkill dataset were set to 0.4 and 0.4, respectively, heuristic factors. The eponions dataset, however, has its heuristic factor set to 1 depending on its run time, since it is not widely separated over the range of influence. The HepTh data set has a small size and a small runtime gap, so its heuristic factor is set to 0.2 depending on the impact range.

Fourth, range of influence

Fig. 4 to 7 show graphs comparing the influence ranges of the PHG of the present invention with other five algorithms HPG, Greedy, PageRank, Degree and Random on four datasets of HepTh, brightkit, epicons and Amazon at seed node scales of 1-50, respectively. As can be seen from these four figures, the Random algorithm performs the worst, mainly because the algorithm does not consider any factors, while other algorithms consider some factors to a different degree. The heuristic algorithms PageRank and Degree, while performing better than the Random algorithm, are much worse than the remaining algorithms Greedy, HPG and PHG. When the seed scale is smaller, greeny and PHG algorithms as high-influence algorithms can be found to have a similar influence range with the PHG of the invention, but the PHG of the invention performs better and better with the increase of the node scale. For example, at a seed node size of 50, the PHG algorithm is 10.7% and 35.5% higher on Amazon datasets than the Greedy algorithm and the HPG algorithm, respectively. These results all indicate that community structure information has an important role in information dissemination, so that the invention can effectively identify influential nodes.

Fifth, running time

FIG. 8 shows a runtime comparison of the present invention with other five algorithms HPG, Greedy, PageRank, Degreee and Random at a seed node size of 50. It can be seen from the figure that the runtime of the Random, Degree and PageRank algorithms is relatively short, mainly because these algorithms are not stable and do not perform the propagation of the influence well. In the remaining algorithms, the efficiency of the PHG of the present invention is higher than the efficiencies of greeny and HPG of the other two algorithms, mainly because the present invention forms a candidate node set by partitioning the community search key nodes and the core nodes, thereby reducing the seed search space and improving the operation efficiency.

In conclusion, by analyzing the effect of the community structure in influence propagation, the invention utilizes the community structure information to identify the key nodes in the social network, so that the invention not only improves the influence range, but also further optimizes the operation efficiency, and has excellent effect.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A social network influence maximization method based on community structure, the method comprising the steps of:

(1) if two products in the website are frequently purchased together, a link relation is formed, a directed graph is formed, product purchase data from the purchased website is obtained, and a social network graph is constructed: g ═ V, E, where G represents the social network, V represents the set of nodes, and E represents the set of edges of the network;

(2) dividing communities to generate a candidate node set: firstly, the input network G is subjected to community division by adopting a Louvain algorithm to generate M communities, namely C ═ C (C)₁,C₂,...C_M) Second, find out the boundary node set S of each community_boundaryAnd a core node set S_coreTaking a union set to form a candidate node set CS, wherein S_coreSelecting nodes with larger degree of 10% of the number of each community as a core node set according to degree centrality;

(3) heuristically selecting nodes: heuristic selection from the set of candidate nodes formed in step (2)

(4) a greedy algorithm is executed: continuing to select from the set of candidate nodes formed in step (3) using a greedy algorithm

Adding the node with the maximum marginal profit into a seed set S, and simultaneously activating by using a linear threshold model to generate a new active node adding set A;

and (4) selecting the node with the largest potential influence to be added into the seed node set in the step (3), wherein the potential influence of each node is calculated in the following process:

CI(v)＝C_D(v)+C_S(v)/2

CI(v)＝C_D(v)+C_N(v)+AvgN_S(v)/3

wherein each index is the result after normalization;

wherein w ∈ neighbor (v),

indicating that node w is an inactive neighbor of node v.

2. The method for maximizing social network influence based on community structure as claimed in claim 1, wherein the specific operation steps of the Louvain algorithm in the step (2) are as follows:

(c) these two phases are repeated until the modularity gain is no longer changed.

3. The method for maximizing social network influence based on community structure as claimed in claim 1, wherein the greedy algorithm in step (4) is specifically operated as follows:

(a) initialization: initializing a seed node set S;

σ(S+v)-σ(S)

wherein σ (g) represents an influence function;

(d) repeating step (c) until k nodes satisfying the target are selected.