CN111784529A

CN111784529A - Social network high-quality user identification method based on overlapping community detection

Info

Publication number: CN111784529A
Application number: CN202010596006.5A
Authority: CN
Inventors: 张磊; 孙凤姣; 刘玉童; 吴鑫鹏
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-16

Abstract

The invention discloses a social network high-quality user identification method based on overlapping community detection. By designing the initialization strategy and the local search strategy, the invention can effectively find the high-influence user combination under different cost budgets in the network, and provides a plurality of choices for decision makers with different cost budget requirements.

Description

Social network high-quality user identification method based on overlapping community detection

Technical Field

The invention relates to a social network, in particular to a social network high-quality user identification method based on overlapping community detection.

Background

With the development of internet technology, people share their opinion information through social platforms such as Twitter and Weibo, so that information dissemination breaks through the limitations of time and space. Currently, most research works only focus on the influence of the nodes and neglect the cost of the nodes in the aspect of maximizing the social influence. In actual marketing, a merchant often adopts some marketing strategies, for example, a star introduction, giving free trial products, discounting and the like to promote the products, and a certain cost is required to be invested in the process. In addition, because the influence between users is different in magnitude, the users with large influence can make the spread range of the information wide. Since the higher the impact the higher the cost. In this case, it is important to identify good users in the network, while maximizing impact and minimizing costs. Therefore, a social network high-quality user identification method based on overlapping community detection is provided.

Currently, influence-cost optimization methods in social networks are mainly classified into the following two categories:

the first type: a fixed cost budget needs to be set. According to the social network, a fixed cost budget is set in advance, a greedy strategy is used in a current common method to find seed nodes in the network until the cost budget is exceeded, but the time consumption is large and only one group of seed node combinations can be found by using the greedy strategy.

The second type: no fixed cost budget needs to be set. The cost budget fixed in advance is not needed, the cost of the seed node is used as an optimization target, from the perspective of a decision maker, the cost budget selected by the decision maker is as small as possible, and the generated influence is as large as possible. The current common algorithm is solved by utilizing a multi-objective optimization method, but does not combine the information of the overlapped community structures in the network to find the users with good quality in the network. However, overlapping points of overlapping communities allow information to be propagated between different communities, acting as a "bridge," and at the same level of cost, the influence of overlapping points is propagated significantly better than non-overlapping points. In addition to this, the lack of a suitable strategy does not perform well in a specific problem.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a social network high-quality user identification method based on overlapping community detection. And mining users with high influence and low cost in the network by utilizing the overlapped community structure information. Through an effective initialization strategy and a local search strategy, a plurality of seed node combination schemes with different cost budgets are provided for decision makers to select, and therefore the requirement for solving practical problems is met.

In order to solve the technical problems, the invention adopts the following technical scheme:

a social network high-quality user identification method based on overlapping community detection is characterized by comprising the following steps:

the network is defined as G (V, E), V ═ V₁,v₂,…,v_i,…,v_nDenotes a user in the social network, v_iRepresents the ith user; n is the total number of nodes; e ═ E _ij1,2, …, n; j ═ 1,2, …, n } indicates that there is a connection between any two nodes; e.g. of the type_ijRepresents the ith node v_iAnd j node v_jWhether an edge exists between; if e_ij1 denotes the ith node v_iAnd j node v_jThere is an edge connection between them, then the ith node v_iAnd j node v_jNodes which are mutually adjacent; if e_ij0 denotes the ith node v_iAnd j node v_jThere is no edge connection between them, i.e. there is no connection;

step 1, detection of overlapping communities

Preprocessing a network by using an overlapping community detection algorithm to obtain an overlapping community structure of the network and a community label of a community where each node is located;

step 2, individual coding

For the social network, seed nodes with fixed lengths are selected to form an individual, the individual is coded in an integer mode, and the individual X ═ { X } representing the seed node combination is obtained₁,x₂,…,x_k}，x_iRepresenting the node number in the network, and k representing the number of the selected seed nodes in the social network;

step 3, initialization

Step 3.1, defining the population scale as pop, the maximum iteration number as maxgen, the initial iteration number gen as 0, and controlling the parameter of the local search number as m;

step 3.2, calculating the cost performance index of the node:

firstly, respectively calculating the structure degree structDegreee (v) of the node and the cost c (v) of the node, and then calculating the cost performance index of the node;

step 3.2.1, calculating node structure degree

In the formula (1), L_vCommunity tag, N, representing node v_vA set of neighbor nodes representing node v;

indicating the overlapping importance of the node v itself,

represents the overlapping importance of the neighbors of node v;

step 3.2.2, calculate node cost

In the formula (2), C_iRepresents the cost of node i, d_iRepresenting the size of i degrees of the node; r, m and t are fixed constants, r is used for constructing the cost of different levels, m and t are respectively used for measuring the cost difference of nodes of different levels and the same level, and the larger the degree of the node is, the higher the cost is;

step 3.2.3, obtaining the cost performance index of the node

In the formula (3), structDegreee (v) and C (v) are respectively formula (1) and formula (2);

step 3.3, setting pop individuals in the population { X₁,X₂,…,X_i,…,X_popIn which X is_iTo represent(ii) an ith individual;

and 3.4, selecting pop-k nodes to form an individual according to descending order of degree, and recording as X₀；

Step 3.5 obtaining the individuals X according to step 3.4₀；

Step 3.5.1, for individual X₀Randomly generating a random number r of [0,1) for each gene position, and traversing each gene position;

step 3.5.2, if the random number r is greater than 0.5, randomly selecting two nodes v and u from the network;

step 3.5.3, comparing the size of the cost performance indexes cp (v) and cp (u), if the node with large cost performance index does not exist in the individual, the node with large cost performance index replaces the node of the gene position; if the node with large cost performance index exists in the individual or the cost performance indexes of the node and the individual are the same, turning to the step 3.5.2 until all the gene positions are traversed to obtain the t-th individual X_t＝{x₁,x₂,…,x_k}；

Step 3.6, repeat step 3.5 until pop individuals { X }are obtained₁,X₂,…,X_i,…,X_pop}, constructing an initial population P¹＝{X₁,X₂,…,X_i,…,X_pop}；

Step 3.7, calculating the tth individual X in the initial population by using the formula (4)_tA corresponding 2-dimensional objective function value comprising: the influence of the nodes is approximately evaluated, the larger the influence is, the better the influence is, the smaller the cost is, the better the influence is;

wherein N is_sRepresenting nodes covered by a 1-hop range of the node s, namely neighbor nodes of the s, wherein p is the propagation probability;

representing the influence of the 1-hop range of the node s, C_iThe cost of the ith node in the seed node S;

step 3.8, sequencing the obtained initialized population according to a non-dominated sequencing method to obtain a sequenced population with a plurality of front surfaces; calculating the crowding distance of the sorted population with the plurality of leading faces according to the Euclidean distance;

step 4, population evolution

Step 4.1, selecting the sorted population with a plurality of leading edges by adopting a championship selection strategy to obtain a mating pool;

4.2, carrying out cross variation by using individuals in the mating pool to generate a new population with the size of pop, which is called a sub-population;

4.3, calculating the corresponding influence and cost of the individuals in the sub-population according to the step 3.7;

and 4.4, mixing the father population and the child population, carrying out non-dominated sorting, calculating the crowding distance of the sorted population according to the Euclidean distance, and marking the current population as P_temp；

Step 4.5, judging whether the current population needs to be subjected to local search according to the formula (5), and if not, executing step 4.6 to show that the local search is not executed; if equation (5) is true, go to step 4.7 to perform local search:

gen|m＝0 (5)

wherein gen is the current iteration number, m is a set parameter used for controlling the number of local search, and '|' is a modulo operation;

step 4.6, Slave population P_tempSelecting pop individuals as populations of the gen +1 iteration;

step 4.7, local search

Step 4.7.1, Slave population P_tempAll individuals of the first front surface are selected and sorted in descending order according to the individual influence; the first half of the individuals after the sorting is marked as P_infMake a local search on the influence target, the latter half is marked as P_costLocal search on the target is made;

step 4.7.2, P_infIn each individual, the nodes are sorted according to the ascending order of the structure degree, and the search length l is randomly selected; starting from the first gene locus,randomly replacing the node on the current gene by the neighbor node of the current gene, wherein the neighbor node does not exist in the current individual, and if the influence of the replaced individual is larger than that of the previous individual, the node is reserved; otherwise, traversing the neighbor nodes; repeating the steps until the traversal length exceeds l, and finally obtaining a searched individual set, and marking as P'_inf；

Step 4.7.3, P_costThe node cost of each individual is sorted in descending order, the search length l is randomly selected, from the first gene position, the neighbor node of the individual replaces the node on the current gene randomly, the neighbor node does not exist in the current individual, and if the cost of the replaced individual is less than the cost of the previous individual, the node is reserved; otherwise, traversing the neighbor nodes; repeating the steps until the traversal length exceeds l, and finally obtaining a searched individual set, and marking as P'_cost；

Step 4.7.4, mix P_temp、P'_inf、P'_costThe individuals in the population group are subjected to non-dominated sorting, the crowding distance of the sorted population group is calculated according to the Euclidean distance, and pop individuals are selected from the crowding distance as the population of the gen +1 th iteration;

step 4.8, assigning gen +1 to gen; and repeating the step 4 until the maximum iteration times are reached, thereby obtaining a final iterated population which is marked as Lastpop;

and 4.9, selecting all individuals in the first front surface from the Lastpop population, wherein the seed node combination in the front surface can provide various solutions for decision makers with different cost budget requirements.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with a single target with fixed cost budgets and a greedy strategy for continuously optimizing the seed node combination, the algorithm can simultaneously obtain a group of seed node combinations with different cost budgets, and the running time of the algorithm is much shorter than that of the greedy strategy.

2. Compared with a multi-target method without fixed cost budget, the algorithm effectively utilizes the structure information of the overlapped communities, and provides an initialization strategy and a local search strategy which can effectively improve the performance of the algorithm.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 illustrates the detection of overlapping communities according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in fig. 1 and 2, a social network high-quality user identification method based on overlapping community detection is performed according to the following steps:

step 1, detection of overlapping communities

step 2, individual coding

For the social network, seed nodes with fixed lengths are selected to form an individual, the individual is coded in an integer mode, and the individual X ═ { X } representing the seed node combination is obtained₁,x₂,…,x_k}，x_iRepresenting node numbers in the network, k representing a selected seed in the social networkThe number of nodes.

Step 3, initialization

and 3.2, calculating the cost performance index of the node.

Firstly, respectively calculating the structure degree structDegree (v and the cost c (v) of the node), and then calculating the cost performance index of the node.

Step 3.2.1, calculating node structure degree

In the formula (1), L_vCommunity tag, N, representing node v_vA set of neighbor nodes representing node v.

Indicating the overlapping importance of the node v itself,

representing the overlapping importance of the neighbors of node v.

Step 3.2.2, calculate node cost

In the formula (2), C_iRepresents the cost of node i, d_iRepresenting the size of i degrees of the node; r, m and t are fixed constants, r is used for constructing the cost of different levels, m and t are respectively used for measuring the cost difference of nodes of different levels and the same level, and the larger the degree of the node is, the higher the cost is.

Step 3.2.3, obtaining the cost performance index of the node

In the formula (3), structDegreee (v) and C (v) are respectively formula (1) and formula (2).

Step 3.3, setting pop individuals in the population { X₁,X₂,…,X_i,…,X_popIn which X is_iRepresents the ith individual;

and 3.4, selecting top-k nodes to form an individual according to descending order of degree, and recording as X₀；

Step 3.5 obtaining the individuals X according to step 3.4₀；

Step 3.5.1, for individual X₀A random number r of [0,1) is randomly generated for each locus, traversing each locus.

Step 3.7, calculating the tth individual X in the initial population by using the formula (4)_tA corresponding 2-dimensional objective function value comprising: the influence of the nodes is approximately evaluated, the larger the influence is, the better the influence is, and the smaller the influence is, the better the cost is when the seed node is S.

Wherein N is_sAnd representing nodes covered by the 1-hop range of the node s, namely neighbor nodes of the s, and p is the propagation probability.

Representing the influence of the 1-hop range of the node s, C_iIs the cost of the ith node in the seed node S.

Step 3.8, sequencing the obtained initialized population according to a non-dominated sequencing method to obtain a sequenced population with a plurality of front surfaces; and calculating the crowding distance of the sorted population with the plurality of front surfaces according to the Euclidean distance.

Step 4, population evolution

And 4.1, selecting the sorted population with a plurality of leading edges by adopting a championship selection strategy to obtain a mating pool.

and 4.4, mixing the father population and the child population, carrying out non-dominated sorting, calculating the crowding distance of the sorted population according to the Euclidean distance, and marking the current population as P_temp。

gen|m＝0 (5)

step 4.7, local search

Step 4.7.1, Slave population P_tempAll individuals of the first front surface are selected and sorted according to the descending order of the individual influence. The first half of the individuals after the sorting is marked as P_infMake a local search on the influence target, the latter half is marked as P_costMake a game on the objectAnd (4) searching.

Step 4.7.2, P_infAnd each individual in the node list is sorted according to the ascending order of the structure degree, and the search length l is randomly selected. Starting from the first gene position, randomly replacing the node on the current gene by using the neighbor node of the first gene position, wherein the neighbor node does not exist in the current individual, and if the influence of the replaced individual is larger than that of the previous individual, retaining the node; otherwise, the neighbor nodes are traversed. Repeating the steps until the traversal length exceeds l, and finally obtaining a searched individual set, and marking as P'_inf。

Step 4.7.3, P_costThe node cost of each individual is sorted in descending order, the search length l is randomly selected, from the first gene position, the neighbor node of the individual replaces the node on the current gene randomly, the neighbor node does not exist in the current individual, and if the cost of the replaced individual is less than the cost of the previous individual, the node is reserved; otherwise, the neighbor nodes are traversed. Repeating the steps until the traversal length exceeds l, and finally obtaining a searched individual set, and marking as P'_cost。

Step 4.7.4, mix P_temp、P'_inf、P'_costThe individuals in (1) are subjected to non-dominant sorting, the crowding distance of the sorted population is calculated according to the Euclidean distance, and pop individuals are selected from the crowding distance as the population of the gen +1 th iteration.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A social network high-quality user identification method based on overlapping community detection is characterized by comprising the following steps:

the network is defined as G (V, E), V ═ V₁,v₂,…,v_i,…,v_nDenotes a user in the social network, v_iRepresents the ith user; n is the total number of nodes; e ═ E_ij1,2, …, n; j ═ 1,2, …, n } indicates that there is a connection between any two nodes; e.g. of the type_ijRepresents the ith node v_iAnd j node v_jWhether an edge exists between; if e_ij1 denotes the ith node v_iAnd j node v_jThere is an edge connection between them, then the ith node v_iAnd j node v_jNodes which are mutually adjacent; if e_ij0 denotes the ith node v_iAnd j node v_jThere is no edge connection between them, i.e. there is no connection;

step 1, detection of overlapping communities

step 2, individual coding

step 3, initialization

step 3.2, calculating the cost performance index of the node:

step 3.2.1, calculating node structure degree

indicating the overlapping importance of the node v itself,

represents the overlapping importance of the neighbors of node v;

step 3.2.2, calculate node cost

step 3.2.3, obtaining the cost performance index of the node

Step 3.5 obtaining the individuals X according to step 3.4₀；

step 4, population evolution

gen|m＝0 (5)

step 4.7, local search;

2. The method for identifying good users of social networks based on the detection of the overlapping communities as claimed in claim 1, wherein the local search specifically comprises the steps of:

step 4.7.1, Slave population P_tempAll individuals of the first leading surface are selected according toSorting the individual influence in descending order; the first half of the individuals after the sorting is marked as P_infMake a local search on the influence target, the latter half is marked as P_costLocal search on the target is made;

step 4.7.2, P_infIn each individual, the nodes are sorted according to the ascending order of the structure degree, and the search length l is randomly selected; starting from the first gene position, randomly replacing the node on the current gene by using the neighbor node of the first gene position, wherein the neighbor node does not exist in the current individual, and if the influence of the replaced individual is larger than that of the previous individual, retaining the node; otherwise, traversing the neighbor nodes; repeating the steps until the traversal length exceeds l, and finally obtaining a searched individual set, and marking as P'_inf；