CN105159922B

CN105159922B - The parallelization Combo discovering method towards consignment data based on label propagation algorithm

Info

Publication number: CN105159922B
Application number: CN201510469289.6A
Authority: CN
Inventors: 马云龙; 刘敏; 桂峰; 章锋; 袁菡; 孙源
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-08-03
Filing date: 2015-08-03
Publication date: 2018-08-24
Anticipated expiration: 2035-08-03
Also published as: CN105159922A

Abstract

The present invention relates to a kind of based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, including：Step S1：Consignment data are pre-processed, text data is turned to according to setting format structure；Step S2：Consignment contact information, standardizes the weights of directed edge between node, is finally built into the oriented relational network model of having the right of consignment to abut sheet form between comprehensive text data interior joint；Step S3：Using improved label propagation algorithm, the community structure in consignment network is excavated with MapReduce frame parallelizations；Step S4：The community structure that analyzing step S3 is obtained finds corporations in consignment network.Compared with prior art, the present invention improves the autgmentability and operational efficiency of conventional labels propagation algorithm, and final realize accurately and efficiently excavates corporations in consignment network.

Description

The parallelization Combo discovering method towards consignment data based on label propagation algorithm

Technical field

The method that the present invention relates to a kind of to build consignment network based on consignment data being based on label more particularly, to one kind Parallelization Combo discovering method of the propagation algorithm towards consignment data.

Background technology

The research origin of social network analysis in the early 1920s, lay particular emphasis on research social entity between relationship, Such as：Exchange inside group membership, country between trade or company between economic transaction.With the fast development of information, Social networks complexity is increasing, no matter network manager or network research personnel, be intended to have social network structure Clearly recognize.Community mining is to understanding social network structure important in inhibiting, and the discovery of community structure is for network Analysis of Topological Structure, network functionality analysis and network behavior prediction are with very important theory significance and practical valence Value, is widely used in the fields such as social network and biological net, and it is more to be widely used in social networks, terroristic organization's identification etc. A field.

First, the community discovery algorithm based on cluster often only considers the attribute information of node, and causing to ignore others has With information (weights on such as side), and it needs a previously given input parameter (numbers of corporations in network), leads to society The accuracy that group divides is not high.Secondly, it is contemplated that any input parameter is not needed based on label pass-algorithm, and with linear Time complexity, convergence rate is very fast, and the accuracy excavated is also higher, is suitable in large scale network corporations and excavates. Finally, due to the fast development of computer technology and Internet technology, the ability that people obtain data constantly enhances, and needs to analyze Network size also from tens to hundreds of original nodes rise to million to millions scale, lead to non-distributed algorithm It has been no longer desirable for community discovery in fairly large network.And the MapReduce Computational frames in Hadoop platform are very suitable for Large-scale data is handled, therefore introduces MapReduce Computational frames in community mining algorithm, is solved using Distributed Calculation Extensive consignment network in community discovery, be a realistic plan.

Invention content

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind to be propagated based on label Parallelization Combo discovering method of the algorithm towards consignment data, on the basis of constructing consignment relational network model, utilize MapReduce distributed computing frameworks, improve the autgmentability and operational efficiency of conventional labels propagation algorithm, it is final realize it is accurate, Efficiently excavate corporations in consignment network.

The purpose of the present invention can be achieved through the following technical solutions：

It is a kind of based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, including：

Step S1：Consignment data are pre-processed, text data is turned to according to setting format structure；

Step S2：Consignment contact information, standardizes the weights of directed edge between node between comprehensive text data interior joint, Finally the oriented relational network model of having the right of consignment is built into abut sheet form；

Step S3：Using improved label propagation algorithm, excavated in consignment network with MapReduce frame parallelizations Community structure；

Step S4：The community structure that analyzing step S3 is obtained finds corporations in consignment network.

The text data is uploaded in the HDFS (Hadoop Distributed File System) of Hadoop platform Storage and processing.

The step S1 is specially：For every consignment data, sender's name, sender telephone number are extracted respectively Code, addressee's name, addressee's telephone number, sender's name, sender telephone number, addressee's name, addressee Telephone number corresponds to four column informations of every style of writing notebook data.

The step S2 is specially：

201：For each sender, the adjacency list of logistics contact frequency between the sender and other addressees is obtained, And adjacency list is standardized；

202：The sender and addressee that come are flowed to arbitrary existence, corresponds to and deposits when counting them respectively as sender It is denoted as shared transmission neighbours' number in the quantity A of identical addressee, quantity A；

203：The sender and addressee that come are flowed to arbitrary existence, corresponds to and deposits when counting them respectively as addressee It is denoted as shared reception neighbours' number in the quantity B of identical sender, quantity B；

204：The sender and addressee that come are flowed to arbitrary existence, obtain shared transmission neighbours number between them with Shared receive neighbours' number and value, should and be worth as shared neighbours' number between the sender and addressee, and to shared neighbours Number is standardized；

205：By the shared neighbours' number obtained in the weights of adjacency list that step 201 obtains and step 204 by α：The ratio of 1- α Example obtains after being added while considering that post part frequency sends neighbours' number and the common directed edge weights for receiving neighbours' number with common, and more New adjacency list, wherein 0 ＜ α ＜ 1.

By the way of successive ignition, an iteration process is specially the improved label propagation algorithm：

301：The adjacency list that step S2 is obtained ending plus corresponding sender's node unique mark ID, as posting part People node label Label completes init Tag；

302：It is multiple according to the adjacency list output with node label<key,value>Form key-value pair is divided into sender's key Value pair and addressee's key-value pair；

303：The key-value pair for obtaining identical key values traverses each value, and the value for obtaining sender's key-value pair first is used Come indicate the key values adjacency list value, and be stored in variable adjacent, secondly, for addressee's key-value pair Value counts the sum of weighted value under different Label, and the node label of the key values is updated according to the proportion of different Label NewLabel；

304：NewLabel is added at adjacent endings, export one it is new<key,value>Form key-value pair, And the label of adjacency list is updated, the community structure in consignment network is corresponding with the adjacency list containing label.

The stopping criterion for iteration of the improved label propagation algorithm includes：Front and back iterative process twice is more than setting percentage The node label of ratio does not change or reaches the iterations of setting.

The percentage that sets is 90%.

The iterations set is 20~30 times.

The step S4 is specially：According to the adjacency list that step S3 is obtained, the node of same label is considered as same corporations, To find corporations in consignment network.

Compared with prior art, the present invention has the following advantages：

1) prior art is mainly based upon uniprocessor algorithm and excavates corporations, is not suitable for corporations in large scale network and excavates, this hair The bright method for building consignment network based on consignment data, while using simultaneously row label propagation algorithm in consignment network, with standard Really, corporations in consignment network are efficiently excavated, especially suitable for the excavation of large scale network, are excavated compared to conventional individual algorithm The superiority of corporations, method provided by the present invention is fairly obvious.

2) from the aspect of calculating the weights of consignment network edge 3 aspects index：1, the logistics contact of consignment both sides Frequency；2, there are the quantity of identical addressee for correspondence when statistics consignment both sides are respectively as sender；3, statistics consignment both sides point It is corresponded to when not as addressee there are the quantity of identical sender, this comprehensive 3 indexs of the last present invention, which calculate in network, to be owned The weights on side excavate precision and accuracy to provide.

3) the method for the present invention does not need any input parameter, and has linear time complexity, and convergence rate is very fast, It is suitable for corporations in large scale network to excavate.

4) MapReduce distributed computing frameworks are combined, the text data for reacting consignment data is uploaded to Hadoop collection Storage and processing, improve the autgmentability and time efficiency of algorithm in the HDFS of group.

Description of the drawings

Fig. 1 is the overall flow figure of the method for the present invention；

Fig. 2 is the flow chart that consignment relational network model is built based on consignment data；

Fig. 3 is the flow chart that corporations are excavated using improved label propagation algorithm parallelization.

Specific implementation mode

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, give detailed embodiment and specific operating process, but protection scope of the present invention is not limited to Following embodiments.

As shown in Figure 1, a kind of being divided into structure based on parallelization Combo discovering method of the label propagation algorithm towards consignment data Consignment relational network model stage and excavation phase are built, it is specific as follows：

Step S1：Consignment data are pre-processed, turn to text data according to setting format structure, text data is uploaded to Storage and processing in the HDFS of Hadoop clusters.Specially：

For every consignment data, sender's name, sender telephone number, addressee's name, addressee are extracted respectively People's telephone number, sender's name, sender telephone number, addressee's name, addressee's telephone number correspond to every style of writing originally Four column informations of data.

Step S2：Consignment contact information, standardizes the weights of directed edge between node between comprehensive text data interior joint, It finally is built into the oriented relational network model of having the right of consignment to abut sheet form, and is uploaded in HDFS.As shown in Fig. 2, specific For：

201：For each sender, the adjacency list of logistics contact frequency between the sender and other addressees is obtained, And adjacency list is standardized.It is specifically described below：

1) MapReduce Computational frames first, are based on, HDFS and process are stored in the Map stages are by row read step S1 Text data after standardization uses the combination of its name and telephone number as its unique mark sender and addressee respectively Show ID, exports<key,value>Form key-value pair, wherein key are sender ID, and value is addressee ID.

2) in the case where the Reduce stages obtain identical key values, i.e., in the case of identical sender, the sender and different receipts are counted Part personage, which flows to, carrys out frequency.Finally one is obtained for each sender only consider that logistics is past between the sender and other addressees Carry out the adjacency list of frequency.

3) secondly, according to the adjacency list of each sender, it is more than setting frequency (this reality when the sender sends express delivery frequency Apply and rule of thumb taken in example 500 times), then it can determine whether situations such as sender is logistics terminal or is Taobao seller, therefore The adjacency list of the sender need to be left out, while leaving out sender's node from the adjacency list of other senders.

4) finally, according to the adjacency list of newly generated all senders, the statistics logistics contact maximum sender of frequency with Addressee standardizes the adjacency list of all senders using Max if maximum contact number is Max：Assuming that some sender Adjacency list [S tR₁:C₁\tR₂:C₂...\tR_k:C_k], wherein t be separator, S writes a Chinese character in simplified form for Sender, indicates that sender, R are Receiver writes a Chinese character in simplified form, and indicates that addressee, C write a Chinese character in simplified form for Count, and subscript k is the serial number of addressee and corresponding number, expression time Number will have the addressee (R that logistics is come and gone with it₁、R₂And R_kDeng) contact number (C₁、C₂And C_kDeng) divided by Max, it finally obtains The sender standardization after adjacency list, i.e., [S tR₁:W₁\tR₂:W₂...\tR_k:W_k], wherein W_k=C_k/Max。

202：Acquire shared transmission neighbours' number：The sender and addressee that come are flowed to arbitrary existence, them is counted and divides There are the quantity A of identical addressee, quantity A to be denoted as shared transmission neighbours' number for correspondence when not as sender.Below specifically It is bright：

1) first, under MapReduce Computational frames, in 1) the Map stages read each sender adjacency list [S tR₁:W₁\tR₂:W₂...\tR_k:W_k], output is multiple<key,value>Form key-value pair：<S,+R₁\tR₂...\tR_k>(+to It distinguishes subsequent<key,value>Key-value pair) and<R₁,S\tR₂...\tR_k>、<R₂,S\tR₁...\tR_k>、……、<R_k,S\ tR₁...\tR_k-1>Deng.

2) identical key values are obtained in the Reduce stages<key,value>Key-value pair traverses each value, obtains first The value with "+" is taken, the receipts that element is current key user after it is divided into array with " t ", in array when being sender These neighbor users are stored in a HashSet data structures set_key by part people.Secondly, to being left each without "+" Value use " t " be divided into array and parsed, result is stored in (map in the map of a HashMap data structure Key be array after being divided by " t " first element, value is the other elements for being used to store array HashSet structures).Finally, this map is traversed, is sought common ground to the value and set_key of each element in map, intersection Size be the key of key values and the current Reduce of this element respectively as sender when shared transmission neighbours number.

203：Acquire shared reception neighbours' number：The sender and addressee that come are flowed to arbitrary existence, them is counted and divides There are the quantity B of identical sender, quantity B to be denoted as shared reception neighbours' number for correspondence when not as addressee.Below specifically It is bright：

First, according to the adjacency list of each sender in step 201 [S tR₁:W₁\tR₂:W₂...\tR_k:W_k], it is each Addressee establishes the inverted index [R to sender₁\tS_l\tS_p...\tS_n], subscript l, p, n indicate the sender's after the row of falling Serial number；Secondly, it is analogous to step 202 solution procedure, obtains sender and addressee that any two has logistics to come and go, counts him Respectively as addressee when shared reception neighbours number.

204：The sender and addressee that come are flowed to arbitrary existence, obtain shared transmission neighbours number between them with Shared receive neighbours' number and value, should and be worth as shared neighbours' number between the sender and addressee, and acquire entire net The maximum value that neighbours' number is shared in network, to standardize sender's node of each existing logistics contact and being total to for recipient node Enjoy neighbours' number.

205：By the shared neighbours' number obtained in the weights of adjacency list that step 201 obtains and step 204 by α：The ratio of 1- α Example obtains after being added while part frequency and the common directed edge weights for sending neighbours' number and common reception neighbours' number are posted in consideration, i.e., adjacent Connecing the weights on side in table, to account for weight ratio be α, and it is 1- α to send neighbours' number and the common weight ratio that accounts for for receiving neighbours' number jointly, In, newly generated adjacency list is uploaded in HDFS by 0 ＜ α ＜ 1 with new directed edge right value update adjacency list.

The above data processing for completing the structure consignment relational network model stage, as shown in Figure 2.Excavation phase is carried out below Data processing, as shown in Figure 3.

Step S3：Using improved label propagation algorithm, excavated in consignment network with MapReduce frame parallelizations Community structure.

By the way of successive ignition, an iteration process is specially improved label propagation algorithm：

301：The adjacency list that step S2 is obtained ending plus corresponding sender's node unique mark ID, as posting part People node label Label, completes init Tag, and the corresponding adjacency list with node label is expressed as [S tR₁:W₁\tR₂: W₂...\tR_k:W_k\tLabel]。

302：It is the Map stages, multiple according to the adjacency list output with node label<key,value>Form key-value pair, is divided into Sender's key-value pair<S,+R₁:W₁\tR₂:W₂...\tR_k:W_k>(+below generated to distinguish<key,value>Key-value pair) and Addressee's key-value pair<R₁,Label\tW₁>、<R₂,Label\tW₂>、……、<R_k,+Label\tW_k>。

303：In the Reduce stages, identical key values are obtained<key,value>Key-value pair traverses each value, first The value (i.e. the value with "+") of sender's key-value pair is obtained for indicating the value of the adjacency list of the key values, and is stored in In variable adjacent, secondly, for the value (without the value of "+") of addressee's key-value pair, counts and weighed under different Label The sum of weight values, and update according to the proportion of different Label the node label NewLabel of the key values, wherein shared by Label Proportion is bigger, and the label of current key nodes may more update Label thus.

304：The newly generated label NewLabel of key nodes is added at adjacent endings, export one it is new< key,value>Form key-value pair, i.e.,<S,R₁:W₁\tR₂:W₂...\tR_k:W_k\tNewLabel>, and the label of adjacency list is updated, Community structure in consignment network is corresponding with the adjacency list containing label.

The stopping criterion for iteration of improved label propagation algorithm includes following two：1, each node label is basicly stable, i.e., The node label that front and back iterative process twice is more than setting percentage does not change, wherein percentage is set in the present embodiment It is 90%, 2, reach the iterations of setting, it generally takes 20~30 times, is taken in the present embodiment 25 times.

Step S4：The community structure that analyzing step S3 is obtained finds corporations in consignment network, and result is stored in HDFS In.Specially：

According to the adjacency list that step S3 is obtained, the node of same label is considered as same corporations, to find consignment network Middle corporations.

To sum up, the structure consignment relational network model stage is process of data preprocessing, excavation phase iterative process, iteration mistake Journey realizes the distributed form of algorithm based on single machine label propagation algorithm, simultaneously as the particularity of the consignment data of logistics, this The index of patent 3 aspects from the aspect of calculating the weights of consignment network edge：1, the logistics contact frequency of consignment both sides；2、 There are the quantity of identical addressee for correspondence when counting consignment both sides respectively as sender；3, statistics consignment both sides are respectively as receipts It being corresponded to when part people there are the quantity of identical sender, this comprehensive 3 indexs of the last present invention calculate the weights on all sides in network, Corporations in consignment network are accurately and efficiently excavated to realize.

Claims

1. a kind of based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, which is characterized in that including：

Step S2：Consignment contact information, standardizes the weights of directed edge between node, finally between comprehensive text data interior joint It is built into the oriented relational network model of having the right of consignment to abut sheet form；

Step S3：Using improved label propagation algorithm, the corporations in consignment network are excavated with MapReduce frame parallelizations Structure；

Step S4：The community structure that analyzing step S3 is obtained finds corporations in consignment network；

301：Unique mark ID of corresponding sender's node is added in the ending for the adjacency list that step S2 is obtained, is saved as sender Point label Label, completes init Tag；

302：Multiple ＜ key, value are exported according to the adjacency list with node label>Form key-value pair is divided into sender's key-value pair With addressee's key-value pair；

303：The key-value pair for obtaining identical key values traverses each value, and the value for obtaining sender's key-value pair first is used for table Show the value of the adjacency list of the key values, and be stored in variable adjacent, secondly, for the value of addressee's key-value pair, system The sum of weighted value under different Label is counted, and updates the node label NewLabel of the key values according to the proportion of different Label；

304：NewLabel is added at adjacent endings, new ＜ a key, value are exported>Form key-value pair, and more The label of new adjacency list, the community structure in consignment network are corresponding with the adjacency list containing label.

2. it is according to claim 1 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the text data is uploaded to storage and processing in the HDFS of Hadoop clusters.

3. it is according to claim 1 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the step S1 is specially：For every consignment data, sender's name, sender telephone number are extracted respectively Code, addressee's name, addressee's telephone number, sender's name, sender telephone number, addressee's name, addressee Telephone number corresponds to four column informations of every style of writing notebook data.

4. it is according to claim 1 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the step S2 is specially：

201：For each sender, the adjacency list of logistics contact frequency between the sender and other addressees is obtained, and right Adjacency list is standardized；

202：The sender and addressee that come are flowed to arbitrary existence, there are phases for correspondence when counting them respectively as sender With the quantity A of addressee, quantity A is denoted as shared transmission neighbours' number；

203：The sender and addressee that come are flowed to arbitrary existence, there are phases for correspondence when counting them respectively as addressee With the quantity B of sender, quantity B is denoted as shared reception neighbours' number；

204：The sender and addressee that come are flowed to arbitrary existence, obtain the shared transmission neighbours number between them and shared Receive neighbours' number and value, should and be worth as shared neighbours' number between the sender and addressee, and to share neighbours' number into Row standardization；

205：By the shared neighbours' number obtained in the weights of adjacency list that step 201 obtains and step 204 by α：The ratio phase of 1- α It is obtained after adding while consideration posts part frequency and sends the directed edge weights of neighbours' number and common reception neighbours' number with common, and update neighbour Connect table, wherein 0 ＜ α ＜ 1.

5. it is according to claim 1 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the stopping criterion for iteration of the improved label propagation algorithm includes：Front and back iterative process twice is more than setting hundred The iterations for dividing the node label of ratio not change or reach setting.

6. it is according to claim 5 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the percentage that sets is 90%.

7. it is according to claim 5 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the iterations set is 20~30 times.

8. it is according to claim 1 based on parallelization Combo discovering method of the label propagation algorithm towards consignment data, It is characterized in that, the step S4 is specially：According to the adjacency list that step S3 is obtained, the node of same label is considered as same society Group.