CN110825935A

CN110825935A - Community core character mining method, system, electronic equipment and readable storage medium

Info

Publication number: CN110825935A
Application number: CN201910914473.5A
Authority: CN
Inventors: 黄萍; 张江华; 潘飞; 吕绪祥; 刘世峰; 高佳
Original assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Current assignee: FUJIAN NEW LAND SOFTWARE ENGINEERING Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2020-02-21

Abstract

The invention discloses a community core character mining method, which comprises the following steps: acquiring a target user group under a target group number, and acquiring communication data in the target user group; cleaning and converting the communication data to construct a target user communication sequence; dividing the community structure of the target user group by using a Louvain algorithm through the communication data; making shortest paths among all nodes of the community through a Dijkstra algorithm, calculating the centrality of each node, and exploring a central node of a network graph; and calculating the volatility of each central node, wherein the central node with small volatility is the community core of the target user group. The whole process of the invention is unsupervised, namely the final structure completely depends on algorithm clustering, and manual advance preset classification is not needed. The method has no strict requirement on the size of the graph, can quickly converge after several iterations, and has high algorithm efficiency.

Description

Community core character mining method, system, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of data mining, in particular to a community core character mining method and system, electronic equipment and a readable storage medium.

Background

The conversation is one of the most common contact ways in the modern people communication process, and the relation between people can be mined through the conversation. In the existing illegal organization such as distribution and marketing, people with illegal activities can be distinguished and core people in the illegal activities can be distinguished by using call records. The existing community discovery algorithm comprises:

and community division is realized by analyzing the similarity of user messages. The method comprises the steps of establishing a specific message content library, mapping to specific users by analyzing the similarity degree of user messages and the specific message content library, specifically classifying the users and setting corresponding weights, so as to judge core users. And dividing the core communication circle of the core user in the two-by-two connected communication networks by taking the core user as a node. However, this method requires a specific message library to be established by itself and analyzed for message similarity. For the method without too much information exchange, the community discovery and the group organization architecture analysis cannot be carried out. For the situation that the characteristics of the information to be exchanged are not obvious, the method cannot establish a message library with obvious characteristics, and the effect of the algorithm is greatly reduced. And only the message similarity analysis corresponding to the user can not carry out the alternating current frequency analysis, so that the core people in the group can not be mined.

The community division is realized by expanding the topological graph through communication data in the telephone communication network. And selecting the edge with the highest weight from the current topological graph, and regarding the edge and two nodes of the edge as an interaction circle. And searching the neighbor node with the maximum attribution degree of each user node in the topological graph, judging whether the attribution degree is greater than a preset value, if so, enlarging the interaction circle, and if not, stopping enlarging the interaction circle. For this approach, if the designated user is not the core of the clique, the expanded circle of interaction may not meet the requirements, and the process is too extensive to be developed.

Disclosure of Invention

The invention aims to provide a method and a device for accurately mining a group organization structure and a core character of a mobile user.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a community core character mining method comprises the following steps:

acquiring a target user group under a target group number, and acquiring communication data in the target user group;

cleaning and converting the communication data to construct a target user communication sequence;

dividing the community structure of the target user group by using a Louvain algorithm through the communication data;

making shortest paths among all nodes of the community through a Dijkstra algorithm, calculating the centrality of each node, and exploring a central node of a network graph;

and calculating the volatility of each central node, wherein the central node with small volatility is the community core of the target user group.

Preferably, the process of dividing the community structure of the target user group by using the Louvain algorithm through the communication data includes:

s31: each target user is taken as a node and belongs to a community;

s32: calculating the change delta Q of the modularity degree value of the whole network after any node i is merged into an adjacent community, merging the change delta Q into the community with the maximum delta Q value, and if the calculation result is negative, not changing the attributive community of i;

s33: repeating the step S32 until the node is transferred to another adjacent community in the network and the delta Q cannot be improved;

s34: and merging communities, compressing the obtained communities into nodes, and giving the sum of the edge weights of all node pairs in the original community as a new weight to each edge between the community nodes.

Preferably, the process of making the shortest path between the nodes of the community by dijkstra algorithm comprises:

s41: for each node in the community, let S ═ { v ═ v₀}，T＝{other nodes}

S42: calculating the distance from S to all the vertexes in T if v_iTo v₀Has an arc of v_iTo v₀The distance value of (1) is a weight value on the arc if v to v₀Without arc, then v_iTo v₀The distance value of (a) is infinite;

s43: selecting a vertex w with the minimum distance value from the T, and adding the vertex w into the set S;

s44: modifying the vertex distance values in the rest T, and if w is added as a middle vertex, from v to v_iIf the distance value is shortened, modifying the distance value;

s45: and repeating the steps S43 and S44 until all the vertexes are contained in S.

Preferably, the process of calculating the centrality of each node comprises:

for each pair of nodes (s, t) within the community, calculating all shortest paths between them;

for each pair of nodes (s, t) in the community, judging whether the node v is on the solved shortest path;

accumulating the shortest paths, and calculating the node betweenness centrality of the node v:

wherein, σ st is the shortest path number from s to t, and σ st (v) is the number of nodes v passing through in the shortest path from s to t;

and calculating the node betweenness centrality of all the nodes.

Preferably, the process of calculating the centrality of each node further comprises: calculating the betweenness centrality of each edge:

calculating all shortest paths between node pairs in the community;

judging whether the edge e is on the shortest path;

accumulating the shortest paths to obtain the betweenness centrality of the edge e

Where σ st is the number of all shortest paths in graph G; σ st (e) is the number of in-path paths that contain edge e;

and calculating the betweenness centrality of all edges.

Preferably, the process of calculating the volatility of each said central node comprises:

by calculating the standard deviation of node v from all other nodes in the network:

wherein:

and the smaller the standard deviation is, the smaller the fluctuation is, the closer the node is to the center, and the node is the community core.

Preferably, the step of obtaining the target user group under the target group number includes: filtering the number with invalid number state in the target user, associating a bill list table of the target user, acquiring a call list of the target user with valid state in one month, and generating a call list sequence of the target user, wherein the call list comprises: user identification, called number and call duration.

In a second aspect, the present invention further provides a system for mining community core people, including:

an acquisition module: acquiring a target user group under a target group number, and acquiring communication data in the target user group;

a cleaning module: cleaning and converting the communication data to construct a target user communication sequence;

a dividing module: dividing the community structure of the target user group by using a Louvain algorithm through the communication data;

a central node module: making shortest paths among all nodes of the community through a Dijkstra algorithm, calculating the centrality of each node, and exploring a central node of a network graph;

a core mining module: and calculating the volatility of each central node, wherein the central node with small volatility is the community core of the target user group.

In a third aspect, the present invention further provides an electronic device for community core people mining, including a memory, a processor and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the community core people mining method when executing the program.

In a fourth aspect, the present invention further provides a readable storage medium for community core persona mining, having a computer program stored thereon, the computer program being executed by a processor to implement the steps of the community core persona mining method described above.

The invention provides a method for judging the centrality of nodes and calculating the volatility of the nodes based on operator call big data by applying a Louvain algorithm and a Dijkstra (Dijkstra) algorithm, so that automatic community division and automatic judgment of community core personnel are realized. The community structure obtained by the algorithm is layered, a new graph obtained after each round of calculation is a result discovered by a plurality of subdivided communities in a large community, and the layered structure is the natural attribute of each grid, so that researchers can deeply know the internal structure and the formation mechanism of a certain community. The invention uses Louvain algorithm, and has no supervision in the whole process, namely the final structure completely depends on algorithm clustering, and artificial advance preset classification is not needed. The algorithm has better performance, almost has no upper limit requirement on the size of the graph in comparison of some classical community classification algorithms, and can quickly converge after several iterations, so the algorithm has higher efficiency.

Drawings

FIG. 1 is a flowchart of an embodiment of a community core people mining method of the present invention;

FIG. 2 is a flowchart of step S30 in FIG. 1;

FIG. 3 is a flowchart of step S40 in FIG. 1;

fig. 4 is a schematic diagram illustrating the principle of dividing the community structure of the target user group by the Louvain algorithm in the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to fig. 1, 2 and 3, an embodiment of the present invention provides a community core character mining method, including:

s10: acquiring a target user group under a target group number, and acquiring communication data in the target user group; filtering the number with invalid number state in the target user, associating the list table of the call ticket of the target user, obtaining the call list of the target user with valid state in one month, and generating a call list sequence of the target user, wherein the call list comprises: user identification, called number and call duration.

S20: cleaning and converting the communication data to construct a target user communication sequence;

s30: and dividing the community structure of the target user group by using a Louvain algorithm through the communication data.

Each target user is taken as a node and belongs to a community; calculating the change delta Q of the modularity degree value of the whole network after any node i is merged into an adjacent community, merging the change delta Q into the community with the maximum delta Q value, and if the calculation result is negative, not changing the attributive community of i; the previous step, until a node is transferred to another adjacent community in the network, the improvement of delta Q cannot be brought; and merging communities, compressing the obtained communities into nodes, and giving the sum of the edge weights of all node pairs in the original community as a new weight to each edge between the community nodes.

S40: making shortest paths among all nodes of the community through a Dijkstra algorithm, calculating the centrality of each node, and exploring a central node of a network graph;

for each node in the community, let S ═ { v ═ v₀T ═ other nodes }; calculating the distance from S to all the vertexes in T if v_iTo v₀Has an arc of v_iTo v₀The distance value of (1) is a weight value on the arc if v to v₀Without arc, then v_iTo v₀The distance value of (a) is infinite; selecting a vertex w with the minimum distance value from the T, and adding the vertex w into the set S; modifying the vertex distance values in the rest T, and if w is added as a middle vertex, from v to v_iIf the distance value is shortened, modifying the distance value; the previous two steps are repeated until all vertices are contained in S.

The centrality of a technology node includes calculating the node centrality of all nodes and calculating the centrality of intermediaries of each edge.

The process of calculating the centrality of each node comprises:

the process of calculating the centrality of each node further comprises: calculating the betweenness centrality of each edge:

calculating all shortest paths between node pairs in the community;

judging whether the edge e is on the shortest path;

and calculating the betweenness centrality of all edges.

S50: and calculating the volatility of each central node, wherein the central node with small volatility is the community core of the target user group. By calculating the standard deviation of node v from all other nodes in the network:

wherein:

In another embodiment of the invention, a target user group under a group number is taken out according to the group number, a user with a valid mobile phone state is limited, and then a user bill list table is associated; and finally, taking out the communication data of the target user. The test data access period is one month, and the statistical period can be selected subsequently according to specific conditions.

Performing community discovery, as shown in fig. 4, 1) initializing, and assuming that each node in the network belongs to a community; 2) calculating the change delta Q of the Q value of the whole network after any node i is merged into an adjacent community, finding out the community with the maximum change of the Q value, and if the change delta Q is negative, not changing the attributive community of i;

can be simplified as follows:

wherein k is_i,inRepresents the sum of the weights incident on cluster C by node i, Σ tot represents the total weight of incident cluster C, k_iRepresenting the total weight of the incident node i.

3) Repeating the step 2 until the Q value is not changed any more, namely, transferring one node to another adjacent community in the network, wherein the delta Q cannot be improved, and all nodes in the current network are not moved any more; 4) and merging communities, namely, compressing the original image in the step, taking each community obtained in the previous steps as a node of the new image, and giving a new weight to each edge of the new image by taking the sum of the edge weights of all node pairs in the original community as the new weight.

The steps comprise two stages: and (4) solving the optimal solution of the Q value, and merging communities obtained by the round of division to obtain a new graph. The two stages are called as one round, the algorithm automatically enters the first stage of the next round of calculation after the calculation of the one round is finished, the Q value of the finally obtained network is not increased after a plurality of iterations, the network at the moment is aggregated into a plurality of small communities with close internal connection and sparse external connection, and the algorithm is finished at the moment.

It should be noted that the Louvain algorithm is a community discovery algorithm based on modularity, and the algorithm performs well in efficiency and effect, and can discover a hierarchical community structure, and the optimization goal is to maximize the modularity of the whole community network.

Modulation, Modularity definition:

A_i，j＝fre q_i，j*log(∑time)

wherein A is_i,jWeight, freq, of the edge representing the node connecting the nodes i, j_i,jRepresenting the frequency of calls between the nodes i, j; time represents the duration of each call; m represents the number of edges in the network;

k_irepresenting the sum of all the edge weights connected with the node i; c. C_iIs the community to which the node i belongs; and σ (c)_i,c_j) When two variables in the function are the same, the value is 1, otherwise, the value is 0.

The community structure obtained by the Louvain algorithm is layered, a new graph obtained after each round of calculation is the result discovered for a plurality of subdivided communities in a large community, and the layered structure is the natural attribute of each grid, so that researchers can deeply know the internal structure and the formation mechanism of a certain community. The whole calculation process of the algorithm is unsupervised, namely the final structure completely depends on algorithm clustering, and manual advance preset classification is not needed. The algorithm has good performance, and in comparison of some classical community classification algorithms, the Louvain algorithm has almost no upper limit requirement on the size of the graph and can be quickly converged after generations fall for several times.

The core molecules are mined according to the social network graph of the community, and the problem is abstracted into the problem of mining the central node of the complex network graph.

The complex network can measure the connection mechanism through the centrality of the nodes, and reasonably explains the actual phenomenon. In the study of complex networks, different centrality definitions are adopted: degree centrality, node betweenness centrality, tight centrality, edge betweenness centrality, feature vector centrality, and the like.

The embodiment of the invention uses: node betweenness centrality and edge betweenness centrality; meanwhile, the centrality fluctuation of the positions of the nodes is customized according to needs.

First we make the shortest path of the graph, which means: and starting from a certain vertex in the graph, and one path with the smallest sum of the weights on all the paths which pass from the edge of the graph to the other vertex is selected.

In the embodiment of the invention, Dijkstra (Dijkstra) is used for calculating the shortest path from one node to all other nodes. The method is mainly characterized in that the expansion is carried out layer by layer towards the outer part by taking the starting point as the center until the end point is reached.

The algorithm comprises the following steps:

1. initially, let S be { v }₀}，T＝{other nodes}；

2. Calculating the distance from S to all the vertexes in T:

if v is_iTo v₀With arc (i.e. from v)_iTo v₀Exists), the distance is a weight on the arc,

if v is_iTo v₀If the distance of (v) is not present, then v_iTo v₀The distance value of (a) is infinite;

3. and selecting a vertex w with the minimum distance value from the T, and adding the vertex w into the set S.

4. And modifying the vertex distance values in the rest T: if w is added as a middle vertex and the distance value from v to vi is shortened, the distance value is modified.

5. Repeating the steps 3 and 4; until all vertices are contained in S.

The process of calculating the centrality of each node comprises:

calculating all shortest paths between node pairs in the community;

judging whether the edge e is on the shortest path;

Wherein σ_stIs the number of all shortest paths in graph G; sigma_st(e) Is the number of passes in the shortest path containing edge e;

and calculating the betweenness centrality of all edges.

wherein:

The invention also provides a system for mining the community core character, which comprises the following steps:

The system for mining the community core people can also realize the community core people mining method.

The invention also provides electronic equipment for mining the community core characters, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the steps of the community core character mining method are realized when the processor executes the program.

The invention also proposes a readable storage medium for community core persona mining, on which a computer program is stored, the computer program being executed by a processor for implementing the steps of the community core persona mining method described above.

The invention provides a method for judging node centrality based on operator call big data by applying a Louvain algorithm and a Dijkstra (Dijkstra) algorithm, and realizing automatic community division and automatic judgment of community core personnel. Meanwhile, the results obtained by the algorithm are layered, a new graph obtained after each round of calculation is the result discovered by a plurality of subdivided communities in a large community, and the layered structure is the natural attribute of each grid, so that researchers can deeply know the internal structure and the formation mechanism of a certain community.

The invention brings about a plurality of beneficial effects: the community structure obtained by the algorithm is layered, a new graph obtained after each round of calculation is a result discovered by a plurality of subdivided communities in a large community, and the layered structure is a natural attribute of each grid, so that researchers can deeply know the internal structure and the formation mechanism of a certain community. The invention uses Louvain algorithm, and has no supervision in the whole process, namely the final structure completely depends on algorithm clustering, and artificial advance preset classification is not needed. The performance of the algorithm is good, in the comparison of some classical community classification algorithms, the Louvain algorithm has almost no upper limit requirement on the size of the graph, and can be quickly converged after several iterations, so the algorithm efficiency is high.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims

1. A community core character mining method is characterized by comprising the following steps:

2. The community core character mining method according to claim 1, wherein the process of dividing the community structure of the target user group by using the Louvain algorithm through the communication data comprises:

s31: each target user is taken as a node and belongs to a community;

3. The community core character mining method according to any one of claims 1 or 2, wherein the process of making the shortest path between the community nodes by dijkstra algorithm comprises:

s41: for each node in the community, let S ═ { v ═ v₀}，T＝{other nodes}；

S42: calculating the distance from S to all the vertexes in T if v_iTo v₀With an arc, then vi to v₀The distance value of (1) is a weight value on the arc if v to v₀Without arc, then v_iTo v₀The distance value of (a) is infinite;

4. The community core character mining method according to claim 3, wherein the process of calculating the centrality of each node comprises:

and calculating the node betweenness centrality of all the nodes.

5. The community core character mining method according to claim 3, wherein the process of calculating the centrality of each node further comprises: calculating the betweenness centrality of each edge:

calculating all shortest paths between node pairs in the community;

judging whether the edge e is on the shortest path;

and calculating the betweenness centrality of all edges.

6. The community core character mining method according to claim 5, wherein the process of calculating the volatility of each of the central nodes comprises:

wherein:

7. The community core character mining method according to claim 1, wherein a target user group under a target group number is obtained, and the process of obtaining the communication data inside the target user group comprises: filtering the number with invalid number state in the target user, associating a bill list table of the target user, acquiring a call list of the target user with valid state in one month, and generating a call list sequence of the target user, wherein the call list comprises: user identification, called number and call duration.

8. A system for community core persona mining, comprising:

9. An electronic device for community core persona mining, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor when executing the program performs the steps of the community core people mining method of any of claims 1-7.

10. A readable storage medium for community core persona mining, having a computer program stored thereon, characterized by: the computer program is executed by a processor to perform the steps of the community core persona mining method of any one of claims 1-7.