CN105069290A

CN105069290A - Parallelization critical node discovery method for postal delivery data

Info

Publication number: CN105069290A
Application number: CN201510469302.8A
Authority: CN
Inventors: 马云龙; 刘敏; 桂峰; 章锋; 袁菡; 孙源
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-08-03
Filing date: 2015-08-03
Publication date: 2015-11-18
Anticipated expiration: 2035-08-03
Also published as: CN105069290B

Abstract

The present invention relates to a parallelization critical node discovery method for postal delivery data. The method comprises the following steps: step S1: acquiring node activity according to the total number of sending and receiving times of each node in a set time in the postal delivery data, and taking the node activity as the own weight value of the node; step S2: acquiring the weight values of edges of each node pair according to the interaction frequency and shared neighbor number metric indexes of each node pair in the set time in the postal delivery data, and defining a network formed by the postal delivery data as a directed double-weighted network graph; and step S3: adding the own weight values of the nodes and the weight values of the edges of the node pairs on the basis of a PageRank algorithm, and excavating critical nodes in the directed double-weighted network graph in parallel. In contrast to the prior art, the parallelization critical node discovery method fully utilizes information in a logistics postal delivery network, reduces the loss of useful information, improves the accuracy of discovery of critical nodes in the network, and parallel operation is implemented at the same time, thereby greatly improving the efficiency and stability of critical node excavation.

Description

A kind of parallelization key node discover method towards consignment data

Technical field

The present invention relates to social network analysis technical field, especially relate to a kind of parallelization key node discover method towards consignment data.

Background technology

After being put forward by British scholar in nineteen twenty " community network " this concept, the research of researchers to community network was never interrupted.Especially the present fast development along with biology information technology, network technology, the communication technology, social platform, defines the community network of a huge complexity between each individuality in community network.In social life, complex network and our life closely bound up, the complex network that we often touch comprises: the Internet in computer realm, WWW, communication network, mail network, micro blog network, the logistics consignment relational network in logistics and interactional network between the protein-protein of biomedical sector.Key node is ubiquitous important a kind of node in social network structure, is a focus in recent years to the research of key node in community network always.In society and physical network, find that key node to be assessed its importance and had very important practical significance.In social networks, such as find out most active user in a public organization, fixer network is attacked and key node in defence, to determine in logistics network key person etc.Key node in social network structure is found that there is and helps more profound the information excavated in community network, find out the key node in community structure, for the theory of 26S Proteasome Structure and Function own profound and the realistic meaning of understanding community network.

First, the existing research about key node in complex network is all utilize the PageRank algorithm of Google and improve on its basis mostly.But most of key node finds that algorithm only considered the weights on limit, and the weights of node self take into account by few people, cause have ignored much useful information when excavating key person in a network, have impact on the accuracy that key node finds.Secondly, our defined node liveness is as outside node self weights, and we are with two because usually calculating the weights on limit, and one is shared neighbours' number of two nodes of fillet, another be node between frequency of interaction, so just make use of the information in network fully.Finally, due to the fast development of computer technology and Internet technology, the ability that people obtain data constantly strengthens, the network size of researchist's research also rises to the scale of 100 ten thousand to millions from original tens to a hundreds of node, consider that MapReduce programming framework is applicable to process large-scale data simultaneously, therefore the present invention proposes based on MapReduce programming framework, and the parallelization key node realized towards extensive consignment data finds.

Summary of the invention

Object of the present invention be exactly in order to overcome above-mentioned prior art exist defect and a kind of parallelization key node discover method towards consignment data is provided, based on real logistics network, by node liveness, node frequency of interaction and the right shared neighbours' number etc. of node are considered in weight computing, take full advantage of the information in logistics consignment network, decrease the loss of effective information, improve the accuracy that in network, key node finds, and based on MapReduce programming framework, make improvements on the PageRank algorithm of the Google of comparative maturity, the parallelization of implementation algorithm, substantially increase efficiency and the stability of key node excavation.

Object of the present invention can be achieved through the following technical solutions:

Towards a parallelization key node discover method for consignment data, comprising:

Step S1: the transmitting-receiving total degree according to node each in setting-up time in consignment data obtains node liveness, using the weights of node liveness as node self;

The net definitions formed by consignment data is an oriented couple of weighting network figure by step S2: the frequency of interaction right according to node each in setting-up time in consignment data and shared neighbours several times figureofmerit obtain the weights on the right limit of each node;

Step S3: the weights adding the weights of ingress self and the right limit of node on the basis of PageRank algorithm, excavates the key node in oriented couple of weighting network figure concurrently.

Described node liveness meets following formula:

a _i＝M _i/Max_num(1)

In formula, a _irepresent the node liveness of node i, M _irepresent that node i receives and dispatches total degree in setting-up time, Max_num represents all M _iin maximal value.

The weights on described limit meet following formula:

w _ji＝a×freq _ij+(1-a)Neighbor(i,j)(2)

In formula, w _jirepresent the weights on the limit between node i and node j, freq _ijrepresent the frequency of interaction between node i and node j, Neighbor (i, j) represents the shared neighbours figureofmerit several times between node i and node j, and a represents Dynamic gene.

Described frequency of interaction meets following formula:

freq _ij＝n _ij/Max_num(3)

In formula, freq _ijrepresent the frequency of interaction between node i and node j, n _ijrepresent the occurrence number on the limit that node i and node j are formed, Max_num represents all n _ijin maximal value.

Described shared neighbours several times figureofmerit meet following formula:

Neighbor(i,j)＝Neighbor_shared_num(i,j)/Max_SharedNum(4)

In formula, Neighbor (i, j) the shared neighbours figureofmerit several times between node i and node j is represented, Neighbor_shared_num (i, j) the shared neighbours' number between node i and node j is represented, Max_SharedNum represents the maximal value in described Neighbor_shared_num (i, j).

Described step S3 is specially:

301: the PageRank value obtaining each node, meets following formula:

PR(p _i)＝a _i/N+(1-a _i)×ΣPR(p _j)×w _ji/L(p _j)(5)

In formula, PR (p _i) represent the PageRank value of node i, p _j∈ M (p _i), M (p _i) represent the set pointing to node i, L (p _j) representing the out-degree of this node pointing to node i, N represents node number total in consignment data, a _irepresent the node liveness of node i, w _jirepresent the weights on the limit between node i and node j;

302: for each node, the PageRank value of twice acquisition before and after contrast, whether the absolute value of both judgements difference is greater than given threshold epsilon, if so, jump procedure 301, continues the PageRank value obtaining each node of next round, if not, performs step 303;

303: sort to the PageRank value of each node that step 302 finally obtains, before rank, the node of k is excavated key node, and k is the quantity of key node.

In this parallelization key node discover method, the data of each step all carry out parallelization process based on MapReduce programming framework.

Compared with prior art, the present invention has the following advantages:

1) due to existing in the discovery algorithm of key node in complex network, the weights seldom having researcher simultaneously to consider node self and the weights on limit affected by frequency of interaction between node and shared neighbours' number, and the inventive method also take into account the liveness of node self in the design, using the weights of node liveness as node self, when the weights considering limit, introduce the factor that two determine the weights on limit, i.e. shared neighbours' number of internodal frequency of interaction and node, make use of the information in network fully, improve the accuracy of algorithm, be suitable for key node in large scale community network to find.

2) build consignment network based on consignment data, PageRank algorithm is applied in the network of logistics consignment data formation and excavate key node, be applicable to the accurate and excavation fast of key node in magnanimity consignment data.

3) achieve parallelization based on MapReduce programming framework to the PageRank after improved to calculate, substantially increase the extendability of algorithm, digging efficiency and stability.

Accompanying drawing explanation

Fig. 1 is the overall flow figure of Parallelization Scheme of the present invention;

Fig. 2 is the procedure chart of MapReduce process data;

Fig. 3 is the schematic diagram of shared neighbours' number definition.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.The present embodiment is implemented premised on technical solution of the present invention, give detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

As shown in Figure 2, the step of MapReduce by dividing, divides into groups mass data and by its process, each partial node distributed under host node completes jointly, the result of calculation finally integrating each partial node obtains net result.MapReduce by abstract for whole data handling procedure be two parts, with function representation, be respectively map and reduce.The work of map is that Task-decomposing is become multiple, and reduce is responsible for the result gathering multitasking.Data set under MapReduce framework can resolve into multiple small data set, and can by parallelization process.

As shown in Figure 1, a kind of parallelization key node discover method towards consignment data based on MapReduce framework and PageRank algorithm comprises:

Step S1: the transmitting-receiving total degree according to node each in setting-up time in consignment data obtains node liveness, using the weights of node liveness as node self.Specific as follows:

Based on MapReduce framework, by multiple for division random for raw data set to be excavated data block split (), computer node in MapReduce cluster starts multiple Mapper, each Mapper stage processes corresponding data block information respectively: the relevant information reading node data, handling procedure map () in map function, be translated into <key, value> form exports, obtaining key value is present node, and value value is the adjacent node having interactive relation with present node.Such as have a consignment behavior to be A → B, A represents the sender in this consignment behavior here, and B represents the addressee in this consignment behavior, for oriented, when map exports for A → B, although be input as oriented, but that export is A-B and B-A, for undirected.Finally, the Output rusults of each map function is transferred to and carries out result in the handling procedure reduce () in Reudcer stage and gather, add up the total degree of each node transmitting-receiving express delivery, namely receive and dispatch total degree, write in file with the data layout of node:count and save.

By transmitting-receiving total degree computing node liveness, meet following formula:

a _i＝M _i/Max_num(1)

The net definitions formed by consignment data is an oriented couple of weighting network figure by step S2: the frequency of interaction right according to node each in setting-up time in consignment data and shared neighbours several times figureofmerit obtain the weights on the right limit of each node.Specific as follows:

201: calculate the frequency of interaction that in setting-up time, each node is right:

Data prediction is the same, raw data set random division is several pieces, the computer node of MapReduce cluster starts multiple Mapper, each Mapper stage processes corresponding data block information respectively, read the relevant information on node and limit, be translated into <key, value> form exports, the key value obtained is node pair, value value is 1, such as, the once consignment behavior of A → B, map exports as <A-B, 1> and <B-A, 1>.Then, the output of each map is sent to Reducer end and gathers, count the total degree that each node occurs the limit formed, finally <node1-node2:count> and <node2-node1:count> form all can be had to write in file for each consignment behavior and save.

Then frequency of interaction meets following formula:

freq _ij＝n _ij/Max_num(2)

202: in calculating setting-up time, each node is to enjoying neighbours' figureofmerit several times:

Data prediction is the same, to raw data set process, through the process of Mapper, obtain <key, value> form exports, the value of the key obtained is node pair, value value is the common adjacent nodes of this node centering two nodes one, finally, gathering of result is carried out at Reducer end, count each node to the shared neighbours' number being each limit, finally two values are preserved for each limit, such as A → B, that finally we come out is <A-B:count> and <B-A:count>.

As shown in Figure 3, shared neighbours' number of two mutual node A and B=share sends neighbours' number+share accepting neighbours' number, and shared neighbor node number is more, shows that its possibility of associating scope that exists together is larger, relation is tightr, then share neighbours several times figureofmerit meet following formula:

Neighbor(i,j)＝Neighbor_shared_num(i,j)/Max_SharedNum(3)

In formula, Neighbor (i, j) the shared neighbours figureofmerit several times between node i and node j is represented, Neighbor_shared_num (i, j) the shared neighbours' number between node i and node j is represented, Max_SharedNum represents the maximal value in Neighbor_shared_num (i, j).

203: the weights calculating the right limit of each node, the weight computing formula on the limit between two nodes is as follows:

w _ji＝a×freq _ij+(1-a)Neighbor(i,j)(4)

Step S3: the weights adding the weights of ingress self and the right limit of node on the basis of PageRank algorithm, excavates the key node in oriented couple of weighting network figure concurrently.Be specially:

301: according to the node liveness calculated of step S2 and the weights on every bar limit, obtained the PageRank value of each node by the Google's page rank algorithm-PageRank algorithm after improving, PageRank computing formula is as follows:

PR(p _i)＝a _i/N+(1-a _i)×ΣPR(p _j)×w _ji/L(p _j)(5)

302: after calculating the PageRank value of all nodes, last computation PageRank value out and current PankRank value are contrasted, if the absolute value of the PageRank value of each node and the difference of last time is greater than given threshold epsilon, then repeat the PageRank value that step 301 calculates each node of next round.If absolute value of difference of PageRank value of twice is less than given threshold epsilon before and after this, then perform step 303;

303: sort to the PageRank value of each node that step 302 finally obtains, before rank, the node of k is by being excavated k most important key node, and k is the quantity of key node.

Be described for actual program in MapReduce framework below:

1) consignment data to be excavated are divided into multiple data block to process respectively, through a MapReduce operation, export <key, value> formal model, wherein, key value is people's node i in network, and value value is the node number having consignment behavior with node i, comprises the number of sender and addressee.Specifically comprise the following steps:

11) consignment Data Segmentation to be excavated is become data block form, in units of data block, give Mapper process.

12) in cluster, each computing node processes corresponding data block respectively, performs a MapReduce operation.

The Mapper stage:

Input: the original consignment data of analysis to be excavated;

Output:<node _i, node _j>, its interior joint node _iand node _jall represent addressee and the sender of a participation consignment behavior, and node _iand node _jboth can be addressee also can be sender, so in the Mapper stage, for <node _i, node _jthe node pair that > is such, we export should export <node in the Mapper stages _i, node _j> also will export <node _j, node _i>.

The Reducer stage:

Input：<node _i,node _j>；

Output:<node _i, count>, wherein key is node node _i, value is and node node _ithere is the degree of node count of transmitting-receiving relation, result is write a file A1 on HDFS (HadoopDistributedFileSystem).

2) consignment data to be excavated are divided into multiple data block to process respectively, through a MapReduce operation, export <key, value> formal model, wherein, key value is the node pair that a consignment behavior occurs in logistics network, and value value is integer, represents each node to the number of times occurred.Specifically comprise the following steps:

21) consignment Data Segmentation to be excavated is become data block form, in units of data block, give Mapper process.

22) in cluster, each computing node processes corresponding data block respectively, performs a MapReduce operation.

The Mapper stage:

Input: the original consignment data of analysis to be excavated;

Output:< (node _i, node _j), 1>, its interior joint node _iand node _jall represent addressee and the sender of a participation consignment behavior, this output form describes node node _ito node node _jonce consignment behavior.

The Reducer stage:

Input：<(node _i,node _j),1>；

Output:< (node _i, node _j), count>, wherein key is that node is to (node _i, node _j), result, to the number of times occurred, is write a file A2 on HDFS by this node of value.

3) similar with previous step, input is still original consignment data set, calculate shared neighbours' number that node is right, obtaining key value is node pair, represent a consignment behavior of sender and addressee, calculate according to the definition of the right shared neighbours of node and the right shared neighbours' number of node, specifically comprise the following steps:

31) consignment Data Segmentation to be excavated is become data block form, in units of data block, give Mapper process.

32) in cluster, each computing node processes corresponding data block respectively, performs twice MapReduce operation.

The Mapper1 stage:

Input: the original consignment data of analysis to be excavated;

Output:<node _i, node _j>, its interior joint node _iand node _jall represent addressee and the sender of a participation consignment behavior, this output form describes node node _ito node node _jonce consignment behavior.

The Reducer1 stage:

Input：<node _i,node _j>；

Output:< (node _i, node _j), adjacentnodesofnode _i>, wherein key is that node is to (node _i, node _j), value is the node set be connected with this node, and the result obtained is exactly the critical table form in figure.

The Mapper2 stage:

Input:< (node _i, node _j), adjacentnodesofnode _i>, that is the input of Mapper2 is exactly the Reduce Output rusults of Mapper1;

Output:< (node _a, node _b), node _i>, wherein key value (node _a, node _b) represent node node _ithe node pair of any two nodes composition in neighbor node, the sequence number sequence of the previous node of its interior joint centering is forward compared with the latter.Such as, for input <A, (B, C, D) >, <B, (C, D) >, so the output of Mapper2 is exactly < (B, C), A>, < (B, D), A>, < (C, D), A>, < (C, D), B>.

The Reducer2 stage:

Input：<(node _a,node _b),node _i>；

Output:< (node _a, node _b), commonadjacentnodesofnode _a, node _b>, such as in the hypothesis of previous step, the output in Reducer2 stage is exactly < (B, C), A>, < (B, D), A>, < (C, D), A, B>.Result is write in file A3.

4) the PageRank value of each node is calculated, before carrying out Mapper stage and Reduce stage, write separately a program and write separately a stand-alone program for reading the data obtained above, data in the present embodiment in setup function in file reading A1, file A2 and file A3, then the liveness of each node is obtained according to the definition of node liveness and computing formula (1), using node character string form as key value, the node liveness calculated is stored in as value value the HashMap defined and gathers in hashmap1.Then the weights on limit are calculated according to the computing formula (4) of the definition of internodal frequency of interaction and computing formula (2), internodal shared the neighbours definition of figureofmerit and the weights on computing formula (3) and limit several times.Wherein a is the Dynamic gene of two factors affecting limit weights, can control the weight on these two Effects of Factors limits.Then using the character string forms on limit as key value, by the weight w on this limit _jileave as value value the HashMap defined in gather in hashmap2.Then find that algorithm definition and formula (5) calculate the PageRank value of each node according to based on weighting key node two after PageRank algorithm improvement.Perform a MapReduce operation, calculate the PageRank value of each node, then the result obtained last time is as the input of MapReduce operation next time, so ceaselessly iteration is gone down, until the absolute value of the PageRank value difference of each node corresponding in double operation is less than given threshold epsilon just stop iterative process, obtain result like this.Specifically comprise the following steps:

41) weights on computing node liveness and limit in setup function

The data of file reading A1, A2, A3 in setup function, then calculate the liveness of each node according to the definition of node liveness and computing formula, result left in hashmap1 set.And then the weights on every a pair internodal limit are calculated according to the fixed of node frequency of interaction and internodal shared neighbours' number and computing formula, its result is left in during hashmap2 gathers.

42) in cluster, each computing node processes corresponding data block respectively, and perform a MapReduce operation, process the pre-service of mainly data specifically, main step is as follows:

The Mapper stage:

Input:<node _i, node _jthe consignment data to be excavated of > form, original data mode style representatives node node _ito node node _jpost an express delivery, directive, node _i→ node _j, finally obtain critical table form.

Output:<node _j, node _i>, raw data is changed a direction, form has changed node into _j→ node _i

The Reducer stage:

Input：<node _j,listofnodespointtonode _j>。After Mapper Output rusults, through the process of shuffle and combine function, the input key value of Reduce is addressee node _j, value value sends by special delivery to node _jsender set.

Output:<node _j, listofnodespointtonode _jresult directly exports by >, Reduce, such as, obtain result <A, (B, C, D) >, represent Node B, C, D and all posted an express delivery to node A.A is addressee, and B, C, D are that sender gathers.

43) to this step, the liveness of known each node, the weights on each limit and raw data adjacency list form after treatment, in cluster, each computing node processes corresponding data block respectively, perform a MapReduce operation again, according to the key node innovatory algorithm formula (5) based on PageRank, calculate the PageRank value of each node.Below we with the form of false code to the computational details of egress PageRank.

Algorithm1:Map(key,value)

Input:

Logisticsnetworknodes--logistic network nodal points

PR (p _i): thePageRankvalueofnode--PageRank value

W _ij: the weights on thevalueoftheedges (i, j)--limit

Links[p ₁,p ₂,p ₃,...p _m]:allthenodep _jlinkedbynodep _i

Output:

Listof<key:value>

1.Emit(p _i,links[p ₁,p ₂,p ₃,...p _m])

2.Foreachp _jinlinks[p ₁,p ₂,p ₃,...p _m]

3.Partial(j)＝PR(p _i)×w _ij/L(p _j)

4.Emit(p _j,partial(j))

5.EndFor

Algorithm2:Reduce(key,value)

Input:

LogisticsNetworknodep _jlistof<p _j,partial(j)>

Output:

PR(p _j):thePageRankvalueofuserp _j

1.//InitialnewPageRankvalueofnodep _j

2.PR(p _j)＝0

3.Foreachpartial(j)inthelist

4.PR(p _j)+＝partial(j)

5.EndFor

6.PR (p _j)=(1-a) × PR (p _i)+a/N//N is the sum of nodes

44) when the PageRank value obtaining each node that first time calculates, the value of the PageRank of the node obtained first time, as the initial p ageRank value of the node of second time MapReduce operation, then carries out second time iteration to calculate the PageRank value of next iteration process.The result calculated by last interative computation is like this as the initial p ageRank value calculating each node next time, constantly carry out interative computation, until the PageRank value of each node that calculates of last time differs with the PageRank value of each node calculated next time be no more than given threshold epsilon with regard to finishing iteration computing, what now obtain is exactly the PageRank value of final each node.Then sort according to respective PageRank value to each node, before rank, k's is exactly front k most important key node.

Existing in key node research, few people pay close attention to the discovery of key node in logistics consignment network, the present invention is based on real logistics network, by node liveness, node frequency of interaction and the right shared neighbours' number etc. of node are considered in weight computing, take full advantage of the information in logistics consignment network, decrease the loss of effective information, improve the accuracy that in network, key node finds, and based on MapReduce programming framework, obtain on PageRank algorithm in the Google of comparative maturity and make improvements, the parallelization of implementation algorithm, substantially increase efficiency and the stability of key node excavation.

Find key node in the small scale network formed in small data set, traditional uniprocessor algorithm can well meet the demands, and efficiency is suitable.But for the large scale network that mass data is formed, traditional uniprocessor algorithm seems unable to do what one wishes, and the method superiority that the present invention puts forward is fairly obvious.

Claims

1., towards a parallelization key node discover method for consignment data, it is characterized in that, comprising:

2. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, described node liveness meets following formula:

a _i＝M _i/Max_num(1)

3. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, the weights on described limit meet following formula:

w _ji＝a×freq _ij+(1-a)Neighbor(i,j)(2)

4. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, described frequency of interaction meets following formula:

freq _ij＝n _ij/Max_num(3)

5. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, described shared neighbours several times figureofmerit meet following formula:

Neighbor(i,j)＝Neighbor_shared_num(i,j)/Max_SharedNum(4)

6. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, described step S3 is specially:

301: the PageRank value obtaining each node, meets following formula:

PR(p _i)＝a _i/N+(1-a _i)×ΣPR(p _j)×w _ji/L(p _j)(5)

7. a kind of parallelization key node discover method towards consignment data according to claim 1, it is characterized in that, in this parallelization key node discover method, the data of each step all carry out parallelization process based on MapReduce programming framework.