CN106886524A

CN106886524A - A kind of community network community division method based on random walk

Info

Publication number: CN106886524A
Application number: CN201510938101.8A
Authority: CN
Inventors: 张贤坤; 宋琛; 牛渊博; 高新雅; 任静; 熬阳月
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2015-12-15
Filing date: 2015-12-15
Publication date: 2017-06-23

Abstract

The present invention relates to a kind of community network community division method based on random walk, it is related to community network field.Invention solves the problems, such as that conventional labels algorithm randomness during tag update is excessive.First, introduce random walk thought, be calculated it is a kind of weigh network node between similarity matrix；Secondly, it is not random selection when the frequency of occurrences is signed in neighbor node acceptance of the bid there is multiple highest in label communication process, but the tag update that selection similarity highest neighbor node is possessed, it is to avoid label is arbitrarily propagated between community；Finally, tested with live network and artificial baseline network, as a result show that the algorithm obtains more preferable performance than primal algorithm in community discovery.Community network group dividing method according to embodiments of the present invention, according to personal relationship similarity properties, by improving label propagation algorithm, user group is divided, division result all has preferable application value to network public-opinion monitoring, commercial user's excavation etc..

Description

A kind of community network community division method based on random walk

Technical field

It is a kind of community network community division method based on random walk the present invention relates to community network field.

Background technology

Annexation according to network node can be divided into some communities, and internal node connection in community's is relatively tight It is close, connect then more sparse between community.Community discovery has very heavy for network public-opinion monitoring, safe early warning, ecommerce etc. The application value wanted.As the good friend that chat software is recommended belongs to same community, shopping website is recommended to the user of different communities The commodity of different-style, public security system is monitored when the term frequencies such as heresy community " parade " are raised and taken an immediate action.To community It was found that research, a large amount of reliable valuable information can be obtained.

The research of community discovery achieves sizable progress in recent years, and many scholars propose new theory and new method. These methods can be largely classified into four classes：Figure dividing method, W-H algorithms, hierarchical clustering method and label propagation algorithm.Figure segmentation Method is commonly used to computer realm, and it is based on iteration to a point technology：Divide every time and network is all divided into two optimal sons Figure, subgraph is further continued for iteration to point, until quantity reaches requirement.Figure split plot design can substantially be divided into two classes：Based on Laplce Spectral radius bisection method and Kerninghan-Lin algorithms.It has the disadvantage to be needed to obtain result by network to point every time Want continuous iteration.To solve this problem, Wu and Huberman propose W-H algorithms：Two nodes of different communities are chosen, point The initial point that voltage is 1 and the destination node that voltage is 0 not being set to, each edge resistance being set to 1, other nodes can obtain different Magnitude of voltage.By the similar node division of magnitude of voltage to same community.W-H algorithm shortcomings are that community structure is must be known by before division Part prior information, to ensure initial point and destination node not in same community.Hierarchical clustering method is according to the connection between node Relation and similarity degree divide community, and the method can be divided into coacervation and disintegrating method again.Represent algorithm respectively G-N algorithms With Newman fast algorithms, but the point extremely low due to there are many similarities in community, hierarchical clustering method often ignore these section Point, final result is unsatisfactory.Label propagation algorithm (Label Propagation Algorithm, LPA) and former classes Method is compared, it is not necessary to is known network structure or priori community structure, the propagation characteristic of network is only relied upon, with linear Time complexity, it is very high that community divides efficiency.Cause the extensive concern of domestic and foreign scholars.Its time complexity is close to linear O (m) (m is the number on side), detects, for fairly large community's (106-109 node) by starting to receive after 5 iteration Hold back.In addition, label propagation algorithm need not both optimize predefined object function, it is not required that quantity and scale on community Etc. prior information, the size to community is not also limited, thus label propagation algorithm to have become current application relatively broad One of community discovery algorithm, is widely applied in the fields such as multimedia messages classification, virtual community excavation.

Label propagation algorithm precise and high efficiency, but in communication process, sign the frequency of occurrences when nodes neighbors acceptance of the bid and there is multiple most Gao Shi, understand equality treats each node, randomly selects a highest label, and this randomness causes label in different communities Between propagation, community division result is unstable, and randomness is stronger, and robustness has much room for improvement.In sum, existing community's hair All there is very big room for promotion in the degree of accuracy and time complexity in existing method.

The content of the invention

It is an object of the invention to provide a kind of community network community division method based on random walk, the method is favourable In the accuracy and stability that improve Web Community's division.

To achieve the above object, the technical scheme is that：A kind of community network community based on random walk divides Method, comprises the following steps：

Step A：Read community network data, construction with network individuality as node, individuality between society of the correlation as side Hand over network；

Step B：Random Walk Algorithm is improved, according to the superposition algorithm calculate node similarity matrix after improvement；

Step C：Initialization community：Upset node sequence, be that each user node distributes a label value, label value mark Know the affiliated community of node；

Step D：Tag update, the tag update of x is by each label frequency of occurrences in the adjacent node of calculate node x：Go out Existing frequency highest label, if label frequency has multiple highests, compares the node institute in the similarity matrix of adjacent node The similar value being expert at, the label that the maximum node of selection similarity is possessed is updated, if in the presence of multiple maximum similar values, Randomly select one；

Step E：Judge whether to meet stop condition：Reach regulation iterations or several times after iteration label value reaches To stabilization.Otherwise, return to step 2 continues to update label；

Step F：All nodes with same label are classified as a community；

Further, in above-mentioned steps B, as follows is improved to Random Walk Algorithm：In traditional Random Walk Algorithm, only exist 1 random walk person walker, due to the Markov property of random walk, the final similarity for obtaining can exist it is very big not Certainty, to eliminate error, the t walker of release of innovatory algorithm in given time period t with Δ t=1 in rapid succession.

Further, in above-mentioned steps B, the calculating to random walk similarity proposes new computing formula：

By social network diagram it is abstract be a simple non-directed graph, G (N, E), wherein, N represents the set of node, and E is represented The set on side；

Use P_xyWalker is represented from node x, trip makes a move and reaches the probability of y：

k_xIt is the degree of node x, a_xyIt is respective value in adjacency matrix；

Use π_xyT () represents that walker reaches the probability of y within node x migration t steps；π_xyT () can be by a step Transfer matrix recursion is obtained, π_xT () is the column matrix of π (t) matrixes xth row；π_xyT () is π_xThe value of (t) matrix y rows.π_x(0) It is a vector for n*1 dimensions, x-th value is 1,

π_x(t)=P^Tπ_x(t-1)

P^TIt is the transposition of matrix P；

WithRepresent the random walk similarity between node x and y.Computing formula is as follows：

Connection between wherein | E | is nodes is total；

For a fixed network, its total side number | E | is fixed, therefore in calculating process, and 2 | E | can be with It is ignored.The OLRW similarities of omission are proposed in step B similarity solution procedurees for this：

The calculating of above similarity is all based on what single walker was carried out, for improve rear stability superposition higher with Machine migration algorithm, the OSRW calculating formula of similarity after superposition is as follows：

Further, in above-mentioned steps C, to ensure label equality of each node and updated when original state is distributed Randomness in journey, upsets node sequence first, then for each user node distributes a label value, i.e. C_n=L_n, C_nTable Show the affiliated communities of node n, L_nRepresent the label value of node n.

Further, in above-mentioned steps D, node label value updates detailed process and is：All summit iteration in figure are updated, The label value of node is updated to the most label value of quantity in the label of its adjacent node；If the label frequency of occurrences exists multiple Highest, the label that the maximum node of selection similarity is possessed is updated, if in the presence of multiple maximum similar values, randomly selecting One.

Further, in above-mentioned steps E, stopping criterion for iteration is each node label of iterations or network for reaching regulation No longer change.

Brief description of the drawings

Fig. 1 realizes flow chart for the inventive method.

Fig. 2 is the influence that different step number t divide the degree of accuracy to community.Using the inventive method (being represented with RWLPA) in section It is 500 to count, and under conditions of hybrid parameter μ=0.6, when taking different value to t, the degree of accuracy that community divides is verified.μ is represented Have the part on even side in different intercommunal nodes) value change comparison diagram.

Fig. 3 is the influence that different step number t divide the degree of accuracy to community.Using the inventive method (being represented with RWLPA) in section It is 500 to count, and under conditions of hybrid parameter μ=0.65, when taking different value to t, the degree of accuracy that community divides is verified.μ tables Show there is the part on even side in different intercommunal nodes) value change comparison diagram.

Fig. 4 is that Zachary ' s karate club data sets are divided using original tag propagation algorithm (being represented by LPA) Result schematic diagram.

Fig. 5 is to Zachary ' s karate club data set division results using the inventive method (being represented by RWLPA) Schematic diagram.

Fig. 6 is in 250 nodes using the inventive method (being represented by RWLPA) and label propagation algorithm (being represented by LPA) Baseline network in, the average number of degrees of node in a network<k>=15, under the conditions of the maximum maxk=50 of the number of degrees, two kinds of calculations The average NMI values that method runs 100 times change comparison diagram with the value of hybrid parameter μ (μ ∈ [0,0.90]).

Fig. 7 is in 500 nodes using the inventive method (being represented by RWLPA) and label propagation algorithm (being represented by LPA) Baseline network in, the average number of degrees of node in a network<k>=15, under the conditions of the maximum maxk=50 of the number of degrees, two kinds of calculations The average NMI values that method runs 100 times change comparison diagram with the value of hybrid parameter μ (μ ∈ [0,0.90]).

Fig. 8 is in 750 nodes using the inventive method (being represented by RWLPA) and label propagation algorithm (being represented by LPA) Baseline network in, the average number of degrees of node in a network<k>=15, under the conditions of the maximum maxk=50 of the number of degrees, two kinds of calculations The average NMI values that method runs 100 times change comparison diagram with the value of hybrid parameter μ (μ ∈ [0,0.90]).

Fig. 9 is in 1000 nodes using the inventive method (being represented by RWLPA) and label propagation algorithm (being represented by LPA) Baseline network in, the average number of degrees of node in a network<k>=15, under the conditions of the maximum maxk=50 of the number of degrees, two kinds of calculations The average NMI values that method runs 100 times change comparison diagram with the value of hybrid parameter μ (μ ∈ [0,0.90]).

Specific embodiment

Below in conjunction with accompanying drawing, the feature and advantage above-mentioned to the present invention make more detailed explanation.

Fig. 1 is a kind of community network community division method based on random walk of the present invention.As shown in figure 1, methods described Comprise the following steps：

Step A：Community network data are read, is constructed with individuality as node, relation is the social network diagram on side between individuality.

It is identical to have in user using each user an as node in network such as in microblogging social networks The a line as network of feature or viewpoint.Then, many communities with same characteristic features are formd, this is supervised to network public-opinion Survey has great importance；In WWW, if it is known that a small amount of information of some webpages, it is possible to relevant with other webpage compositions The company side of system, this is highly useful to search engine；In scientist's coorporative network, using each author an as node, two Author's cooperation article is to produce a company side, forms huge collaborative network.

In the present embodiment, using 4 kinds of baseline networks being made up of different nodes, the node number of degrees and hybrid parameter. Wherein, hybrid parameter μ represents the obvious degree of the community structure of community network, and the smaller community structure of μ values is more obvious.

Step B：Similarity matrix between calculate node.

Such as in microblogging social networks, if the walker of random walk is when walking within the specified time period, from The probability that family A reaches user B is higher, and similarity is bigger, and the possibility that two users finally belong to same community is bigger.

Step C：Initialization community, flag node；It is one label for representing its affiliated community of each node distribution, iteration Number of times t=1.

It is label detailed process that each node distribution one represents its affiliated community specifically, in the step C For i.e. C_n=L_n, C_nRepresent the affiliated communities of node n, L_nRepresent the label value of node n.

Step D：Label starts iteration renewal, when the label frequency of occurrences has multiple highests, one is entered according to similarity matrix Step is divided.

Step E：Judge whether to meet stop condition.

Specifically, in the step D, node label updates and specifically includes following steps：

Step D1：For each node x in sequence node X, with frequency of occurrences highest in its neighbor node label, that is marked Sign to update the label of the node, it is assumed that the k neighbor node of node x is respectively x₁, x₂..., x_k, the label of iteration node x at the t times It is dependent in its neighbor node all labels that have passed through t iteration posterior nodal point and other have passed through institute after the t-1 times iteration node The label for obtaining, the tag update formula of node x is： Wherein function g return be node x neighbor node label in frequency highest label.

Step D2：If during the frequency highest that multiple labels occur, lookup possesses the node of highest label and is updated section Similarity between point, the label for selecting most like node to possess is updated.

Step E：If the label of all nodes no longer changes, algorithm stops；Otherwise, t=t+1, and return to step E.

Specifically, in the step F, when the iterations specified is reached, algorithm will also stop.

Step F：All summits with same label are classified as a community.

In the present embodiment first by live network data set come the validity of verification algorithm.In Zachary ' s Karate club data sets carry out verifying that Karate data sets include 34 members in karate club of one, the U.S., 78 Individual member contact.Contradiction of this 34 members due to two leaders each other generates division, as Liang Ge groups.Use module Degree Q used as evaluation index, with average value tried one's best elimination error by experiment operation 100 times.Q is defined as follows：

| E | represents the total side number of non-directed graph, A in formula_ijIt is adjacency matrix, k_iIt is the number of degrees of node i, node i is with j same δ=1 during community, otherwise δ=0.The modularity of community after computation partition, the Q values of RWLPA algorithm partitions are 0.3916, LPA algorithms The Q values of division are 0.3896.Effect of the invention is better than traditional algorithm.

Secondly, in the investigation present invention and baseline network of the label propagation algorithm under the conditions of 4 kinds of different parameters, with mixing The value change of parameter μ, the average NMI values comparison diagram that each operation is 100 times.Wherein, NMI (normalized mutual Information) value is the judging basis of judgment experiment division result and the degree of closeness of actual legitimate reading.Specific rules are detailed It is thin as follows：

Wherein, a confusion matrix N is defined, line number represents community content, and columns is represented and has found community, N_ijRepresent true The shared node number in community j is had found of node in community i, c_ARepresent the quantity of community content, c_BRepresentative has found society The quantity in area, N_iRepresent matrix N_ijThe summation of middle i rows, N_jRepresent the summation of j row.

A kind of community network community division method based on random walk of the present invention, is reading by community's partition process Community network data, calculate node similarity matrix, initialization community, tag update, division five stages of community are taken, is read first Take community network data, construction is with individuality as node, social network diagram of the correlation as side between individuality；Phase between structure node Like degree matrix；The all of node of random alignment, is one label for representing its affiliated community of each node distribution in network, Iterations t=1；Node label updates, and for each node x in sequence node, occurs frequently with its neighbor node label That label of rate highest updates the label of the node, if multiple labels occur frequency highest when, search similarity moment Battle array, finds and is updated the label that the most like node of node possesses and be updated.If the label of all nodes no longer becomes Change or reach the iterations of regulation, then algorithm stops；Finally, all summits with same label are classified as a society Area.There is very big randomness in label propagation algorithm, have impact on the accuracy of community division result during tag update. The present invention starts with from the randomness for suppressing label propagation, introduces random walk thought, the range formula definition based on random walk A kind of new similarity calculating method, builds the similarity matrix between node.During label is propagated, work as nodes neighbors It is no longer random selected when multiple highests occurs in middle label frequency, but the label that the most like node of selection is possessed is carried out more Newly, node is effectively prevent in intercommunal any propagation, improves the degree of accuracy of community's division.In the present invention, exist One parameter t, the selection of the parameter produces different influences to experimental result, under the conditions of have chosen 2 different parameters for this Baseline network verifies that experiment proves that community division result is more accurate at 3≤t≤8 to the value of t.In order to prove this The advantage of inventive method, embodiment have chosen live network and artificial baseline network is verified respectively, to Zachary ' s The result that karate club data sets are divided shows that the modularity after division is higher than tradition LPA algorithms, and division result is more excellent It is elegant.Baseline network under the conditions of 4 different parameters is tested, network parameter is different, as a result such as Fig. 6, Fig. 7, Fig. 8, Fig. 9 institute Show.When hybrid parameter μ values are in the range of 0 to 0.9, the NMI values result of the inventive method is overall to be propagated better than conventional labels Algorithm.Above-described embodiment shows RWLPA from the randomness for largely limiting label propagation, substantially increases community discovery The accuracy rate of algorithm.To sum up, the inventive method substantially increases the accuracy rate of original community discovery algorithm, can effectively excavate society Community structure in network, can apply to the field of the different scales such as network public-opinion monitoring, search engine.

Presently preferred embodiments of the present invention is the foregoing is only, it is carried out in invention claim limited range It is changes, modifications, even equivalent, will all belong in protection scope of the present invention.

Claims

1. a kind of community network community division method based on random walk, it is characterised in that the described method comprises the following steps：

Step A：Read community network data, construction with network individuality as node, individuality between social network of the correlation as side Network figure；

Step C：Initialization community：Upset node sequence, be that each user node distributes a label value, label value mark section Community belonging to point；

Step D：Tag update, the tag update of x is by each label frequency of occurrences in the adjacent node of calculate node x：There is frequency Rate highest label, if label frequency has multiple highests, the node is expert in comparing the similarity matrix of adjacent node Similar value, choose the label that the maximum node of similarity possessed and be updated, if in the presence of multiple maximum similar values, at random Choose one；

Step E：Judge whether to meet stop condition：Reach regulation iterations or several times after iteration label value reaches surely It is fixed；Otherwise, return to step 2 continues to update label；

Step F：All nodes with same label are classified as a community.

2. a kind of community network community division method based on random walk according to claim 1, it is characterised in that：

In above-mentioned steps B, a walker for migration, unstable result are only existed in original Random Walk Algorithm.To eliminate The error that randomness in random walk process is caused, the release in given time period t with Δ t=1 in rapid succession Walker, until first walker step number being released is t, similarity between two nodes is tried to achieve according to modified hydrothermal process.

3. a kind of community network community division method based on random walk according to claim 1, it is characterised in that：

In above-mentioned steps B, the calculating to random walk similarity proposes new criterion；Similarity matrix xth row y The value a of row_xyThe similarity degree of node x and node y is represented, its value is bigger, represent the individuality representated by the two nodes same The possibility of individual community is bigger, for the social network diagram constructed in step A, is abstracted into simple non-directed graph G (N, E), Wherein, N represents the set of node, and E represents the set on side；The specific solution procedure of similarity is as follows between node：

k_xIt is the degree of node x, a_xyIt is respective value in adjacency matrix；Use π_xyT () represents that walker walks it from node x migration t The probability of interior arrival y；π_xyT () can be obtained by Matrix of shifting of a step recursion, π_xT () is the column matrix of π (t) matrixes xth row； π_xyT () is π_xThe value of (t) matrix y rows.π_x(0) it is a vector for n*1 dimensions, x-th value is 1, P^TIt is the transposition of matrix P；

π_x(t)=P^Tπ_x(t-1)

WithThe random walk similarity between node x and y is represented, computing formula is as follows：

Connection between wherein | E | is nodes is total；

For a fixed network, its total side number | E | is fixed, therefore in calculating process, 2 | E | can be neglected Slightly.The OLRW similarities of omission are proposed in step B similarity solution procedurees for this：

The calculating of above similarity is all based on what single walker was carried out, is swum at random for improving rear stability superposition higher Algorithm is walked, the OSRW calculating formula of similarity after superposition is as follows：

4. a kind of community network community division method based on random walk according to claim 1, it is characterised in that：

It is random in the equality and renewal process of each node when original state is distributed to ensure label in above-mentioned steps C Property, node sequence is upset first, then for each user node distributes a label value, i.e. C_n=L_n, C_nRepresent node n institutes Category community, L_nRepresent the label value of node n.

5. a kind of community network community division method based on random walk according to claim 1, it is characterised in that：

In above-mentioned steps D, node label value updates detailed process and is：All summit iteration in figure are updated, by the mark of node Label value is updated to the most label value of quantity in the label of its adjacent node；If the label frequency of occurrences has multiple highests, choose The label that the maximum node of similarity is possessed is updated, if in the presence of multiple maximum similar values, randomly selecting one.

6. a kind of community network community division method based on random walk according to claim 1, it is characterised in that：

In above-mentioned steps E, stopping criterion for iteration is that each node label of iterations or social networks for reaching regulation is no longer sent out Changing.