CN105159922A

CN105159922A - Label propagation algorithm-based posting data-oriented parallelized community discovery method

Info

Publication number: CN105159922A
Application number: CN201510469289.6A
Authority: CN
Inventors: 马云龙; 刘敏; 桂峰; 章锋; 袁菡; 孙源
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2015-08-03
Filing date: 2015-08-03
Publication date: 2015-12-16
Anticipated expiration: 2035-08-03
Also published as: CN105159922B

Abstract

The present invention relates to a parallelized community discovery method based on tag propagation algorithm for mailing data, including: step S1: preprocessing mailing data, and structuring it into text data according to a set format; step S2: mailing between nodes in the integrated text data Exchange information, standardize the weights of directed edges between nodes, and finally construct a delivery directed and weighted relationship network model in the form of an adjacency table; Step S3: use the improved label propagation algorithm, and use the MapReduce framework to parallelize mining in the delivery network Community structure; step S4: analyze the community structure obtained in step S3, and discover the communities in the delivery network. Compared with the prior art, the present invention improves the expansibility and operating efficiency of the traditional label propagation algorithm, and finally realizes accurate and efficient mining of communities in the delivery network.

Description

A Parallelized Community Discovery Method for Mailing Data Based on Label Propagation Algorithm

技术领域technical field

本发明涉及一种基于寄递数据来构建寄递网络的方法，尤其是涉及一种基于标签传播算法面向寄递数据的并行化社团发现方法。The invention relates to a method for constructing a delivery network based on delivery data, in particular to a method for discovering a parallel community based on label propagation algorithm for delivery data.

背景技术Background technique

社会网络分析的研究起源于20世纪20年代初，侧重于研究社会实体之间的关系，例如：组成员内部的交流，国家之间的贸易，或公司之间的经济交易。随着信息的快速发展，社交网络复杂度越来越大，无论网络管理者还是网络研究人员，都希望对社交网络结构有清晰的认识。社区挖掘对理解社交网络结构有着重要意义，网络社区结构的发现对于网络拓扑结构分析、网络功能性分析以及网络行为预测具有非常重要的理论意义以及实用价值，在社会网及生物网等领域有广泛应用，现已被广泛应用于社交网络、恐怖组织识别等多个领域。The study of social network analysis originated in the early 1920s, focusing on the study of relationships between social entities, such as: communication within group members, trade between countries, or economic transactions between companies. With the rapid development of information, the complexity of social networks is increasing. Both network managers and network researchers hope to have a clear understanding of the structure of social networks. Community mining is of great significance to understanding the structure of social networks. The discovery of network community structure has very important theoretical significance and practical value for network topology analysis, network functional analysis and network behavior prediction. It has a wide range of applications in the fields of social networks and biological networks. It has been widely used in many fields such as social network and terrorist organization identification.

首先，基于聚类的社团发现算法往往仅考虑节点的属性信息，导致忽略其它的有用信息(如边的权值)，而且它需要一个预先给定的输入参数(网络中社团的数目)，导致社团划分的准确性不高。其次，考虑到基于标签传递算法不需要任何输入参数，而且具有线性的时间复杂度，收敛速度较快，而且挖掘的精确度也较高，适合于大规模网络中社团挖掘。最后，由于计算机技术和互联网技术的迅猛发展，人们获取数据的能力不断增强，需要分析的网络规模也从原来的几十至几百个结点上升到百万至千万级的规模，导致非分布式算法已不再适用于较大规模网络中社团发现。而Hadoop平台中的MapReduce计算框架十分适合处理大规模数据，因此在社区挖掘算法中引入MapReduce计算框架，利用分布式计算来解决的大规模寄递网络中社团发现，是一个切实可行的方案。First of all, clustering-based community discovery algorithms often only consider the attribute information of nodes, resulting in ignoring other useful information (such as edge weights), and it requires a pre-specified input parameter (the number of communities in the network), resulting in The accuracy of community division is not high. Secondly, considering that the label-based transfer algorithm does not require any input parameters, and has linear time complexity, fast convergence speed, and high mining accuracy, it is suitable for community mining in large-scale networks. Finally, due to the rapid development of computer technology and Internet technology, people's ability to obtain data has been continuously enhanced, and the scale of the network to be analyzed has also increased from tens to hundreds of nodes to a scale of millions to tens of millions. Distributed algorithms are no longer suitable for community discovery in larger networks. The MapReduce computing framework in the Hadoop platform is very suitable for processing large-scale data. Therefore, it is a feasible solution to introduce the MapReduce computing framework into the community mining algorithm and use distributed computing to solve the large-scale delivery network community discovery.

发明内容Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种基于标签传播算法面向寄递数据的并行化社团发现方法，，在构建了寄递关系网络模型基础上，利用MapReduce分布式计算框架，提高传统标签传播算法的扩展性和运行效率，最终实现准确、高效地挖掘寄递网络中社团。The purpose of the present invention is to provide a parallel community discovery method based on the label propagation algorithm for delivery data in order to overcome the defects in the above-mentioned prior art. On the basis of building a delivery relationship network model, using the MapReduce distributed computing framework, Improve the scalability and operating efficiency of the traditional label propagation algorithm, and finally realize the accurate and efficient mining of communities in the delivery network.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved through the following technical solutions:

一种基于标签传播算法面向寄递数据的并行化社团发现方法，包括：A parallel community discovery method for delivery data based on label propagation algorithm, including:

步骤S1：预处理寄递数据，按照设定格式结构化为文本数据；Step S1: Preprocess the delivery data, and structure it into text data according to the set format;

步骤S2：综合文本数据中节点之间寄递往来信息，标准化节点之间有向边的权值，最终以邻接表形式构建成寄递有向有权关系网络模型；Step S2: Synthesize the posting and posting information between nodes in the text data, standardize the weights of directed edges between nodes, and finally construct a posting directed and entitled relationship network model in the form of an adjacency list;

步骤S3：利用改进的标签传播算法，运用MapReduce框架并行化挖掘寄递网络中的社团结构；Step S3: Use the improved label propagation algorithm to mine the community structure in the delivery network in parallel using the MapReduce framework;

步骤S4：解析步骤S3获取的社团结构，发现寄递网络中社团。Step S4: Analyze the community structure obtained in step S3, and discover the communities in the delivery network.

所述文本数据上传至Hadoop平台的HDFS(HadoopDistributedFileSystem)中存储与处理。The text data is uploaded to the HDFS (Hadoop Distributed File System) of the Hadoop platform for storage and processing.

所述步骤S1具体为：对于每条寄递数据，分别抽取出寄件人姓名、寄件人电话号码、收件人姓名、收件人电话号码，所述寄件人姓名、寄件人电话号码、收件人姓名、收件人电话号码对应为每行文本数据的四列信息。The step S1 is specifically: for each piece of delivery data, extract the sender's name, sender's phone number, recipient's name, recipient's phone number, the sender's name, sender's phone number , recipient's name, and recipient's phone number correspond to four columns of information for each row of text data.

所述步骤S2具体为：The step S2 is specifically:

201：针对每个寄件人，获取该寄件人与其他收件人之间物流往来频数的邻接表，并对邻接表进行标准化处理；201: For each sender, obtain the adjacency list of the frequency of logistics exchanges between the sender and other recipients, and standardize the adjacency list;

202：对任意存在物流往来的寄件人和收件人，统计他们分别作为寄件人时对应存在相同收件人的数量A，该数量A记为共享发送邻居数；202: For any sender and receiver who have logistics exchanges, count the number A of the same recipients corresponding to them when they are senders respectively, and this number A is recorded as the number of shared sending neighbors;

203：对任意存在物流往来的寄件人和收件人，统计他们分别作为收件人时对应存在相同寄件人的数量B，该数量B记为共享接收邻居数；203: For any sender and receiver who have logistics exchanges, count the number B corresponding to the same sender when they are the recipients respectively, and record the number B as the number of shared receiving neighbors;

204：对任意存在物流往来的寄件人和收件人，获取他们之间的共享发送邻居数与共享接收邻居数的和值，该和值作为该寄件人和收件人之间的共享邻居数，并对共享邻居数进行标准化处理；204: For any sender and recipient that have a logistics relationship, obtain the sum of the number of shared sending neighbors and the number of shared receiving neighbors between them, and the sum is used as the shared value between the sender and the recipient The number of neighbors, and normalize the number of shared neighbors;

205：将步骤201得到的邻接表的权值和步骤204中得到的共享邻居数按α：1-α的比例相加后获得同时考虑寄件频数与共同发送邻居数和共同接收邻居数的有向边权值，并更新邻接表，其中，0＜α＜1。205: Add the weight of the adjacency list obtained in step 201 and the number of shared neighbors obtained in step 204 according to the ratio of α: 1-α, and then obtain the effective value considering both the frequency of sending mail and the number of common sending neighbors and common receiving neighbors. To the edge weight, and update the adjacency list, where, 0<α<1.

所述改进的标签传播算法采用多次迭代的方式，一次迭代过程具体为：The improved label propagation algorithm adopts multiple iterations, and an iteration process is specifically:

301：在步骤S2获得的邻接表的结尾加上对应寄件人节点的唯一标示ID，作为寄件人节点标签Label，完成初始化标签；301: At the end of the adjacency list obtained in step S2, add the unique ID of the corresponding sender node as the sender node label Label, and complete the initialization label;

302：根据带节点标签的邻接表输出多个<key,value>形式键值对，分为寄件人键值对和收件人键值对；302: Output multiple key-value pairs in the form of <key, value> according to the adjacency list with node labels, which are divided into sender key-value pairs and receiver key-value pairs;

303：获取相同key值的键值对，遍历每个value，首先获取寄件人键值对的value用来表示该key值的邻接表的value，并存于变量adjacent中，其次，对于收件人键值对的value，统计不同Label下权重值之和，并根据不同Label的比重来更新该key值的节点标签NewLabel；303: Obtain the key-value pair with the same key value, traverse each value, first obtain the value of the sender's key-value pair to represent the value of the adjacency list of the key value, and store it in the variable adjacent; secondly, for the recipient For the value of the key-value pair, count the sum of the weight values under different Labels, and update the node label NewLabel of the key value according to the proportion of different Labels;

304：将NewLabel添加到adjacent结尾处，输出一个新的<key,value>形式键值对，并更新邻接表的标签，寄递网络中的社团结构与含有标签的邻接表相对应。304: Add NewLabel to the end of adjacent, output a new key-value pair in the form of <key,value>, and update the label of the adjacency list. The community structure in the delivery network corresponds to the adjacency list containing the label.

所述改进的标签传播算法的迭代终止条件包括：前后两次迭代过程大于设定百分比的节点标签不发生变化或达到设定的迭代次数。The iteration termination condition of the improved label propagation algorithm includes: the labels of the nodes whose labels are greater than the set percentage in the previous two iterations do not change or reach the set number of iterations.

所述设定百分比为90％。The set percentage is 90%.

所述设定的迭代次数为20～30次。The set iteration times are 20-30 times.

所述步骤S4具体为：根据步骤S3获取的邻接表，将相同标签的节点视为同一社团，从而发现寄递网络中社团。The step S4 specifically includes: according to the adjacency list obtained in the step S3, the nodes with the same label are regarded as the same community, so as to discover the community in the delivery network.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

1)现有技术主要是基于单机算法挖掘社团，不适合大规模网络中社团挖掘，本发明基于寄递数据来构建寄递网络的方法，同时在寄递网络中采用并行标签传播算法，以准确、高效地挖掘寄递网络中社团，特别适用于大规模网络的挖掘，相比于传统单机算法挖掘社团，本发明所提供的方法的优越性十分明显。1) The existing technology is mainly based on a stand-alone algorithm to mine communities, which is not suitable for community mining in large-scale networks. The present invention builds a delivery network based on delivery data, and uses a parallel label propagation algorithm in the delivery network to accurately and efficiently Mining communities in delivery networks is especially suitable for mining large-scale networks. Compared with traditional stand-alone algorithms for mining communities, the method provided by the present invention has obvious advantages.

2)在计算寄递网络边的权值方面考虑了3个方面的指标：1、寄递双方的物流往来频数；2、统计寄递双方分别作为寄件人时对应存在相同收件人的数量；3、统计寄递双方分别作为收件人时对应存在相同寄件人的数量，最后本发明综合这3个指标计算网络中所有边的权值，从而提供挖掘精度与准确度。2) In the calculation of the weight of the delivery network side, three indicators are considered: 1. The frequency of logistics exchanges between the two parties; 2. The number of corresponding recipients that exist when both parties are senders; 3. Count the number of the same sender when both senders and senders are recipients. Finally, the present invention integrates these three indicators to calculate the weights of all edges in the network, thereby providing mining precision and accuracy.

3)本发明方法不需要任何输入参数，而且具有线性的时间复杂度，收敛速度较快，适合于大规模网络中社团挖掘。3) The method of the present invention does not require any input parameters, and has linear time complexity and fast convergence speed, and is suitable for community mining in large-scale networks.

4)结合MapReduce分布式计算框架，将反应寄递数据的文本数据上传至Hadoop集群的HDFS中存储与处理，提高算法的扩展性与时间效率。4) Combining with the MapReduce distributed computing framework, the text data reflecting the delivery data is uploaded to the HDFS of the Hadoop cluster for storage and processing, so as to improve the scalability and time efficiency of the algorithm.

附图说明Description of drawings

图1为本发明方法的整体流程图；Fig. 1 is the overall flowchart of the inventive method;

图2为基于寄递数据构建寄递关系网络模型的流程图；Fig. 2 is a flow chart of building a delivery relationship network model based on delivery data;

图3为采用改进的标签传播算法并行化挖掘社团的流程图。Figure 3 is a flow chart of mining communities in parallel using the improved label propagation algorithm.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is carried out on the premise of the technical solution of the present invention, and detailed implementation and specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

如图1所示，一种基于标签传播算法面向寄递数据的并行化社团发现方法分为构建寄递关系网络模型阶段和挖掘阶段，具体如下：As shown in Figure 1, a parallel community discovery method based on the label propagation algorithm for delivery data is divided into the stage of building the delivery relationship network model and the mining stage, as follows:

步骤S1：预处理寄递数据，按照设定格式结构化为文本数据，文本数据上传至Hadoop集群的HDFS中存储与处理。具体为：Step S1: Preprocess the delivery data, structure it into text data according to the set format, and upload the text data to the HDFS of the Hadoop cluster for storage and processing. Specifically:

对于每条寄递数据，分别抽取出寄件人姓名、寄件人电话号码、收件人姓名、收件人电话号码，寄件人姓名、寄件人电话号码、收件人姓名、收件人电话号码对应为每行文本数据的四列信息。For each piece of delivery data, extract the sender's name, sender's phone number, recipient's name, recipient's phone number, sender's name, sender's phone number, recipient's name, recipient's Phone numbers correspond to four columns of information for each row of text data.

步骤S2：综合文本数据中节点之间寄递往来信息，标准化节点之间有向边的权值，最终以邻接表形式构建成寄递有向有权关系网络模型，并上传至HDFS中。如图2所示，具体为：Step S2: Synthesize the posting information between nodes in the text data, standardize the weights of directed edges between nodes, and finally construct a posting directed and entitled relationship network model in the form of an adjacency list, and upload it to HDFS. As shown in Figure 2, specifically:

201：针对每个寄件人，获取该寄件人与其他收件人之间物流往来频数的邻接表，并对邻接表进行标准化处理。下面具体说明：201: For each sender, obtain an adjacency list of the frequency of logistics exchanges between the sender and other recipients, and standardize the adjacency list. The specific instructions are as follows:

1)首先，基于MapReduce计算框架，在Map阶段按行读取步骤S1中存于HDFS且经过标准化后的文本数据，分别对寄件人和收件人使用其姓名与电话号码的组合作为其唯一标示ID，输出<key,value>形式键值对，其中key为寄件人ID，value为收件人ID。1) First, based on the MapReduce computing framework, read the standardized text data stored in HDFS in step S1 line by line in the Map stage, and use the combination of their name and phone number for the sender and recipient as their unique Mark the ID and output a key-value pair in the form of <key,value>, where key is the ID of the sender and value is the ID of the recipient.

2)在Reduce阶段获取相同key值下，即相同寄件人情况下，统计该寄件人与不同收件人物流往来频数。最终为每个寄件人得到一个仅考虑该寄件人与其他收件人之间物流往来频数的邻接表。2) When the same key value is obtained in the Reduce stage, that is, the same sender, count the frequency of logistics between the sender and different recipients. Finally, for each sender, an adjacency list that only considers the frequency of logistics transactions between the sender and other recipients is obtained.

3)其次，根据每个寄件人的邻接表，当该寄件人发送快递频数大于设定频数(本实施例中根据经验取500次)，则可判断该寄件人为物流中转站或者为淘宝卖家等情况，因此需删去该寄件人的邻接表，同时从其他寄件人的邻接表中删去该寄件人节点。3) Secondly, according to the adjacency list of each sender, when the sender sends express delivery frequency greater than the set frequency (500 times in this embodiment based on experience), it can be judged that the sender is a logistics transfer station or is For Taobao sellers, etc., it is necessary to delete the adjacency list of the sender, and delete the sender node from the adjacency lists of other senders at the same time.

4)最后，根据新产生的所有寄件人的邻接表，统计物流往来频数最大的寄件人与收件人，设最大往来次数为Max，利用Max来标准化所有寄件人的邻接表：假设某个寄件人的邻接表[S\tR₁:C₁\tR₂:C₂...\tR_k:C_k]，其中\t为分隔符，S为Sender简写，表示寄件人，R为Receiver简写，表示收件人，C为Count简写，下标k为收件人和对应次数的顺序编号，表示次数，将与其有物流往来的收件人(R₁、R₂和R_k等)的往来次数(C₁、C₂和C_k等)除以Max，最终得到该寄件人标准化后的邻接表，即[S\tR₁:W₁\tR₂:W₂...\tR_k:W_k]，其中，W_k＝C_k/Max。4) Finally, according to the newly generated adjacency lists of all senders, count the sender and receiver with the largest frequency of logistics exchanges, set the maximum number of exchanges as Max, and use Max to standardize the adjacency lists of all senders: Suppose Adjacency list of a certain sender [S\tR ₁ :C ₁ \tR ₂ :C ₂ ...\tR _k :C _k ], where \t is the delimiter, S is the abbreviation of Sender, which means the sender, R is the abbreviation of Receiver, indicating the recipient, C is the abbreviation of Count, the subscript k is the sequence number of the recipient and the corresponding times, indicating the number of times, and the recipients who will have logistics with it (R ₁ , R ₂ and R _k etc.), divide the number of exchanges (C ₁ , C ₂ and C _k , etc.) by Max, and finally get the normalized adjacency list of the sender, that is, [S\tR ₁ :W ₁ \tR ₂ :W ₂ ... \tR _k :W _k ], where W _k ＝C _k /Max.

202：求得共享发送邻居数：对任意存在物流往来的寄件人和收件人，统计他们分别作为寄件人时对应存在相同收件人的数量A，该数量A记为共享发送邻居数。下面具体说明：202: Obtain the number of shared sending neighbors: For any sender and receiver with logistics contacts, count the number A of the same recipients corresponding to them when they are senders, and this number A is recorded as the number of shared sending neighbors . The specific instructions are as follows:

1)首先，在MapReduce计算框架下，在Map阶段读取1)中每个寄件人的邻接表[S\tR₁:W₁\tR₂:W₂...\tR_k:W_k]，输出多个<key,value>形式键值对：<S,+R₁\tR₂...\tR_k>(+用以区分后面的<key,value>键值对)和<R₁,S\tR₂...\tR_k>、<R₂,S\tR₁...\tR_k>、……、<R_k,S\tR₁...\tR_k-1>等。1) First, under the MapReduce computing framework, read the adjacency list [S\tR ₁ :W ₁ \tR ₂ :W ₂ ...\tR _k :W _k ] of each sender in 1) in the Map stage , output multiple key-value pairs in the form of <key,value>: <S,+R ₁ \tR ₂ ...\tR _k >(+ is used to distinguish the following <key,value> key-value pairs) and <R ₁ ,S\tR ₂ ...\tR _k >, <R ₂ ,S\tR ₁ ...\tR _k >, ..., <R _k ,S\tR ₁ ...\tR _k-1 >, etc. .

2)在Reduce阶段获取相同key值的<key,value>键值对，遍历每一个value，首先获取带“+”的value，将其用“\t”划分为数组后，数组中元素均为当前key用户为寄件人时的收件人，将这些邻居用户存于一个HashSet数据结构set_key里。其次，对剩下每一个不带“+”的value使用“\t”划分为数组并进行解析，将结果存于一个HashMap数据结构的map中(map的key为经过“\t”划分后数组的第一个元素，value为一个用于存放数组的其他元素的HashSet结构)。最后，遍历这个map，对map中每一个元素的value与set_key求交集，交集的大小为这个元素的key值与当前Reduce的key分别作为寄件人时的共享发送邻居数。2) Obtain <key, value> key-value pairs with the same key value in the Reduce phase, traverse each value, first obtain the value with "+", divide it into an array with "\t", and the elements in the array are When the current key user is the recipient of the sender, store these neighbor users in a HashSet data structure set_key. Secondly, use "\t" to divide each remaining value without "+" into an array and analyze it, and store the result in a map of a HashMap data structure (the key of the map is the array after being divided by "\t" The first element of the array, value is a HashSet structure used to store other elements of the array). Finally, the map is traversed, and the value and set_key of each element in the map are intersected. The size of the intersection is the number of shared sending neighbors when the key value of this element and the current Reduce key are respectively used as senders.

203：求得共享接收邻居数：对任意存在物流往来的寄件人和收件人，统计他们分别作为收件人时对应存在相同寄件人的数量B，该数量B记为共享接收邻居数。下面具体说明：203: Obtain the number of shared receiving neighbors: For any sender and receiver with logistics contacts, count the number B corresponding to the same sender when they are respectively recipients, and this number B is recorded as the number of shared receiving neighbors . The specific instructions are as follows:

首先，根据步骤201中每个寄件人的邻接表[S\tR₁:W₁\tR₂:W₂...\tR_k:W_k]，为每个收件人建立到寄件人的倒排索引[R₁\tS_l\tS_p...\tS_n]，下标l,p,n表示倒排后的寄件人的序号；其次，类比于步骤202求解过程，得到任意两个有物流往来的寄件人和收件人，统计他们分别作为收件人时的共享接收邻居数。First, according to the adjacency list [S\tR ₁ :W ₁ \tR ₂ :W ₂ ...\tR _k :W _k ] of each sender in step 201, establish a link to the sender for each recipient Inverted index [R ₁ \tS _l \tS _p ...\tS _n ], the subscripts l, p, n indicate the serial number of the sender after inversion; secondly, similar to the solving process in step 202, any For two senders and recipients who have logistics exchanges, count the number of shared receiving neighbors when they are respectively recipients.

204：对任意存在物流往来的寄件人和收件人，获取他们之间的共享发送邻居数与共享接收邻居数的和值，该和值作为该寄件人和收件人之间的共享邻居数，并求得整个网络中共享邻居数的最大值，以标准化每一个已有物流往来的寄件人节点和收件人节点的共享邻居数。204: For any sender and recipient that have a logistics relationship, obtain the sum of the number of shared sending neighbors and the number of shared receiving neighbors between them, and the sum is used as the shared value between the sender and the recipient The number of neighbors, and the maximum value of the number of shared neighbors in the entire network is obtained to standardize the number of shared neighbors of each sender node and receiver node that have logistics exchanges.

205：将步骤201得到的邻接表的权值和步骤204中得到的共享邻居数按α：1-α的比例相加后获得同时考虑寄件频数与共同发送邻居数和共同接收邻居数的有向边权值，即邻接表中边的权值占重比例为α，而共同发送邻居数和共同接收邻居数的占重比例为1-α，其中，0＜α＜1，用新的有向边权值更新邻接表，将新产生的邻接表上传至HDFS中。205: Add the weight of the adjacency list obtained in step 201 and the number of shared neighbors obtained in step 204 according to the ratio of α: 1-α, and then obtain the effective value considering both the frequency of sending mail and the number of common sending neighbors and common receiving neighbors. The edge weight, that is, the proportion of the weight of the edge in the adjacency list is α, and the proportion of the number of co-sending neighbors and the number of co-receiving neighbors is 1-α, where 0<α<1, using the new Update the adjacency list to the edge weight, and upload the newly generated adjacency list to HDFS.

以上完成构建寄递关系网络模型阶段的数据处理，如图2所示。下面进行挖掘阶段的数据处理，如图3所示。The data processing in the stage of building the delivery relationship network model is completed above, as shown in FIG. 2 . Next, the data processing in the mining stage is carried out, as shown in Figure 3.

步骤S3：利用改进的标签传播算法，运用MapReduce框架并行化挖掘寄递网络中的社团结构。Step S3: Use the improved label propagation algorithm to mine the community structure in the delivery network in parallel using the MapReduce framework.

改进的标签传播算法采用多次迭代的方式，一次迭代过程具体为：The improved label propagation algorithm adopts multiple iterations, and the specific iteration process is as follows:

301：在步骤S2获得的邻接表的结尾加上对应寄件人节点的唯一标示ID，作为寄件人节点标签Label，完成初始化标签，对应带节点标签的邻接表表示为[S\tR₁:W₁\tR₂:W₂...\tR_k:W_k\tLabel]。301: Add the unique ID of the corresponding sender node at the end of the adjacency list obtained in step S2, as the sender node label Label, complete the initialization label, and the corresponding adjacency list with node label is expressed as [S\tR ₁ : W ₁ \tR ₂ :W ₂ ...\tR _k :W _k \tLabel].

302：Map阶段，根据带节点标签的邻接表输出多个<key,value>形式键值对，分为寄件人键值对<S,+R₁:W₁\tR₂:W₂...\tR_k:W_k>(+用以区分后面产生的<key,value>键值对)和收件人键值对<R₁,Label\tW₁>、<R₂,Label\tW₂>、……、<R_k,+Label\tW_k>。302: In the Map stage, output multiple key-value pairs in the form of <key,value> according to the adjacency list with node labels, and divide them into sender key-value pairs <S,+R ₁ :W ₁ \tR ₂ :W ₂ .. .\tR _k :W _k >(+ to distinguish <key, value> key-value pairs generated later) and recipient key-value pairs <R ₁ ,Label\tW ₁ >, <R ₂ ,Label\tW ₂ >, ..., <R _k ,+Label\tW _k >.

303：在Reduce阶段，获取相同key值的<key,value>键值对，遍历每个value，首先获取寄件人键值对的value(即带“+”的value)用来表示该key值的邻接表的value，并存于变量adjacent中，其次，对于收件人键值对的value(不带“+”的value)，统计不同Label下权重值之和，并根据不同Label的比重来更新该key值的节点标签NewLabel，其中，Label所占比重越大，当前key节点的标签越可能更新为此Label。303: In the Reduce phase, obtain the <key, value> key-value pair with the same key value, traverse each value, and first obtain the value of the sender's key-value pair (that is, the value with "+") to represent the key value The value of the adjacency list of the recipient is stored in the variable adjacent. Secondly, for the value of the recipient key-value pair (the value without "+"), the sum of the weight values under different Labels is calculated and updated according to the proportion of different Labels The node label NewLabel of the key value, where the larger the proportion of Label, the more likely the label of the current key node is updated to this Label.

304：将key节点新产生的标签NewLabel添加到adjacent结尾处，输出一个新的<key,value>形式键值对，即<S,R₁:W₁\tR₂:W₂...\tR_k:W_k\tNewLabel>，并更新邻接表的标签，寄递网络中的社团结构与含有标签的邻接表相对应。304: Add the newly generated label NewLabel of the key node to the end of adjacent, and output a new key-value pair in the form of <key,value>, that is, <S,R ₁ :W ₁ \tR ₂ :W ₂ ...\tR _k :W _k \tNewLabel>, and update the label of the adjacency list, the community structure in the delivery network corresponds to the adjacency list containing the label.

改进的标签传播算法的迭代终止条件包括以下两种：1、各节点标签基本稳定，即前后两次迭代过程大于设定百分比的节点标签不发生变化，其中，本实施例中设定百分比为90％，2、达到设定的迭代次数，一般取20～30次，本实施例中取25次。The iteration termination conditions of the improved label propagation algorithm include the following two types: 1. The labels of each node are basically stable, that is, the labels of nodes with a setting percentage greater than the previous two iterations do not change. Among them, the setting percentage in this embodiment is 90 %, 2. Reach the set number of iterations, generally 20-30 times, and 25 times in this embodiment.

步骤S4：解析步骤S3获取的社团结构，发现寄递网络中社团，并将结果保存于HDFS中。具体为：Step S4: Analyze the community structure obtained in step S3, find the community in the delivery network, and save the result in HDFS. Specifically:

根据步骤S3获取的邻接表，将相同标签的节点视为同一社团，从而发现寄递网络中社团。According to the adjacency list obtained in step S3, the nodes with the same label are regarded as the same community, so as to discover the community in the delivery network.

综上，构建寄递关系网络模型阶段为数据预处理过程，挖掘阶段迭代过程，迭代过程基于单机标签传播算法实现算法的分布式形式，同时，由于物流的寄递数据的特殊性，本专利在计算寄递网络边的权值方面考虑了3个方面的指标：1、寄递双方的物流往来频数；2、统计寄递双方分别作为寄件人时对应存在相同收件人的数量；3、统计寄递双方分别作为收件人时对应存在相同寄件人的数量，最后本发明综合这3个指标计算网络中所有边的权值，从而实现准确、高效地挖掘寄递网络中社团。To sum up, the stage of building the delivery relationship network model is the data preprocessing process, and the mining stage is the iterative process. The iterative process is based on the single-machine label propagation algorithm to realize the distributed form of the algorithm. At the same time, due to the particularity of the logistics delivery data, this patent is calculating In terms of the weight of the network side, three indicators are considered: 1. The frequency of logistics exchanges between the two parties; The number of recipients corresponds to the number of identical senders. Finally, the present invention integrates these three indicators to calculate the weights of all edges in the network, thereby realizing accurate and efficient mining of communities in the delivery network.

Claims

1., based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, it is characterized in that, comprising:

Step S1: pre-service consignment data, turns to text data according to setting format structure;

Step S2: consignment contact information between comprehensive text data interior joint, the weights of directed edge between standardization node, are finally built into the oriented relational network model of having the right of consignment with adjacency list form;

Step S3: utilize the label propagation algorithm improved, uses the community structure in MapReduce framework parallelization excavation consignment network;

Step S4: the analyzing step S3 community structure obtained, finds corporations in consignment network.

2. according to claim 1ly it is characterized in that based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, described text data is uploaded in the HDFS of Hadoop cluster and stores and process.

3. according to claim 1 based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, it is characterized in that, described step S1 is specially: for every bar consignment data, extract sender's name, sender telephone number, addressee's name, addressee's telephone number respectively, described sender's name, sender telephone number, addressee's name, addressee's telephone number correspond to four column informations of notebook data of often composing a piece of writing.

4. according to claim 1 based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, it is characterized in that, described step S2 is specially:

201: for each sender, obtain logistics between this sender and other addressees and to come and go the adjacency list of frequency, and standardization is carried out to adjacency list;

202: next sender and addressee are flowed to any existence, add up them and be designated as shared transmission neighbours number respectively as corresponding during sender the quantity A that there is identical addressee, this quantity A;

203: next sender and addressee are flowed to any existence, add up them and be designated as shared reception neighbours number respectively as corresponding during addressee the quantity B that there is identical sender, this quantity B;

204: next sender and addressee are flowed to any existence, obtain shared transmission neighbours number between them with share receive neighbours' number and value, and should be worth as the shared neighbours' number between this sender and addressee, and standardization is carried out to shared neighbours' number;

205: the shared neighbours' number obtained in the weights of adjacency list step 201 obtained and step 204 obtains after being added in the ratio of α: 1-α to be considered post part frequency and jointly send neighbours' number and the common directed edge weights receiving neighbours' number simultaneously, and upgrade adjacency list, wherein, 0 < α < 1.

5. according to claim 1ly it is characterized in that based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, the label propagation algorithm of described improvement adopts the mode of successive ignition, and one time iterative process is specially:

301: the unique sign ID adding corresponding sender's node in the ending of the adjacency list of step S2 acquisition, as sender's node label Label, completes init Tag;

302: the adjacency list according to band node label exports multiple <key, value> form key-value pair, is divided into sender's key-value pair and addressee's key-value pair;

303: the key-value pair obtaining identical key value, travel through each value, first the value obtaining sender key-value pair is used for the value of the adjacency list representing this key value, and be stored in variable adjacent, secondly, for the value of addressee's key-value pair, weighted value sum under statistics Different L abel, and the node label NewLabel of this key value is upgraded according to the proportion of Different L abel;

304: NewLabel is added to adjacent ending place, export a new <key, value> form key-value pair, and upgrade the label of adjacency list, the community structure in consignment network is corresponding with the adjacency list containing label.

6. according to claim 5 based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, it is characterized in that, the stopping criterion for iteration of the label propagation algorithm of described improvement comprises: the node label that twice, front and back iterative process is greater than setting number percent does not change or reaches the iterations of setting.

7. according to claim 6ly it is characterized in that based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, described setting number percent is 90%.

8. according to claim 6ly it is characterized in that based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, the iterations of described setting is 20 ~ 30 times.

9. according to claim 5 based on the parallelization Combo discovering method of label propagation algorithm towards consignment data, it is characterized in that, described step S4 is specially: the adjacency list obtained according to step S3, is considered as same corporations by the node of same label.