CN110647406B

CN110647406B - Coarse-grained graph data asynchronous iterative updating method

Info

Publication number: CN110647406B
Application number: CN201910812905.1A
Authority: CN
Inventors: 李超; 唐杰辰; 黄新宇
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2022-11-29
Anticipated expiration: 2039-08-29
Also published as: CN110647406A

Abstract

The invention relates to the field of graph data processing, and discloses a coarse-grained graph data asynchronous iterative updating method. The asynchronous iterative updating method of the coarse-grained graph data encapsulates the point partition data sets and the edge partition data sets in the large-scale graph data into the remote procedure call protocol message loop body, realizes asynchronous message communication, and avoids I/0 processing cost caused by frequent access to external memory.

Description

Coarse-grained graph data asynchronous iterative updating method

Technical Field

The invention relates to the field of graph data processing, in particular to a coarse-grained graph data asynchronous iterative updating method.

Background

In order to mine valuable information from massive graph data, various distributed graph processing frameworks are widely applied to iterative computation of graph data, wherein a representative Pregel model is computed with node as granularity, and the graph data is iteratively updated in a message-driven manner. Although the Pregel model is easy to implement, the global synchronization update mechanism requires that the state information of the surrounding neighbor nodes is all reached, and a new round of iterative update can be started, which causes the convergence speed of the graph data to be slow.

In recent years, the development of mobile internet and social networks has led to an ever-expanding network scale, and the data scale generated by the network has also spanned from the PB level to the EB level. With the increasing amount of data to be processed by the graph data mining algorithm, the traditional centralized data processing platform cannot meet the storage and computation requirements of large-scale graph data, and therefore some distributed graph data processing systems come into force, which are typified by a distributed parallel processing system based on a MapReduce model and a distributed parallel processing system based on a BSP (bus synchronous parallel computing) model. Although these parallel graph data processing systems all propose a mature solution in the aspects of graph partitioning, computation, and fault tolerance management, these models still have shortcomings when applied to large-scale graph data iterative computation, and there are three main problems:

1. slow convergence rate of calculation

Among distributed parallel processing systems based on the BSP model, google's Pregel system is most representative. In the Pregel system, most block synchronous parallel computation models divide the computation process into a series of super steps (supersteps), which are performed in an overall synchronous manner. The common point of the iterative algorithm is that the next round of iterative computation can be started according to the result of the previous round of iterative computation. The iteration mode based on the integral synchronization needs the k-1 th iteration calculation to be completed by all nodes, and then the k-th iteration calculation can be started. However, the complexity of the data results in different computation speeds for different nodes. The graph data iterative computation based on global synchronization requires all neighbor nodes around a node to send all the computed results, and the node can enter the next round of data iterative update. During actual calculation, the calculation speed of some neighbor nodes is slower than that of other neighbor nodes, so that the node with the higher calculation speed inevitably needs to wait for the node with the lower calculation speed, which inevitably reduces the execution speed of the algorithm, and causes unnecessary time overhead.

2. I/O processing costs are high

At present, the MapReduce computational model is mature in the fields of data mining and data analysis. Although this type of system has good fault tolerance and can easily access resources in a distributed cluster, these data processing systems store intermediate computation results in a global disk read-write manner during an iterative process. Therefore, when the MapReduce calculation model is applied to iterative calculation needing to repeatedly utilize intermediate results, the utilization rate of a memory is low. Furthermore, external I/O accesses are time consuming and unreliable compared to direct memory accesses. A large amount of data is read and written frequently to the external disk at the same time, which not only consumes a large amount of time, but also leads to system paralysis. In order to solve the problem of high I/O processing cost in the iterative computation process, an Apache Spark computation model is proposed in 2012 by UC Berkeley AMPLab. Spark is a memory-based distributed computing framework, the core of which is a flexible distributed data set (RDD), which is an abstraction of distributed memory for processing data on a cluster in parallel. Spark's every job concatenates the RDDs to form a Directed Acyclic Graph (DAG) that can be executed more efficiently, and Spark is particularly good at running iterative type algorithms, with speeds in processing iterative jobs that exceed Hadoop by a factor of about 20.

The Spark GraphX is used as a sub-item under the Spark platform, and the memory calculation of the graph is realized on the master-slave cluster. The method has lower requirements on communication in the aspects of graph loading, node processing, edge processing, adjacent node calculation and the like, and the generated graph has a simpler structure, so that the performance is greatly improved. However, the graph subsystem still adopts a Pregel programming framework to perform iterative computation according to an overall synchronization mode, so that the problem of low convergence speed still exists.

3. Low efficiency of data transmission

In a distributed environment, a large-scale connected graph needs to be divided into a plurality of completely independent sub-graphs, and when the sub-graph data is deployed on different machines, crossed edges inevitably exist among the sub-graphs. The graph data processing system based on the BSP model is usually performed by taking a node as a center in a message-driven mode, and if the communication cost is considered, the number of crossed edges is an important index for measuring the quality of sub-graph division. If the degree of coupling between subgraphs is high, the subgraphs need to access non-local data frequently when doing graph computation. In general, the time to access local memory data is usually calculated in nanoseconds, while the time for network communication between machines is calculated in milliseconds, which are different by 4 to 5 orders of magnitude. If the graph data in another partition is frequently accessed along the cross edge every time in the unit of node, the processing speed of the whole graph is influenced.

In addition, in the fine-grained information exchange mode taking the node as the center, the node accesses the data in another subgraph and also attaches the transmission part of the transmission control protocol information. Therefore, a single iteration calculation centered on a node generates a large amount of redundant information, which imposes a large burden on network communication, which is especially significant when the network size is large.

In summary, graph data, as a type of big data with strong correlation characteristics, needs to pay attention to the ways of data segmentation, storage and iterative computation when analyzing the graph data. Aiming at the defects of low convergence rate, high I/O processing cost, low data transmission efficiency and the like of the conventional synchronous parallel computation model for graph data processing, the improvement and optimization of a model and a processing mechanism of large-scale graph data are urgently needed.

Disclosure of Invention

The invention aims to provide a coarse-grained graph data asynchronous iterative updating method aiming at the defects of the technology, a point partition data set and a side partition data set in large-scale graph data are packaged into a remote procedure call protocol message loop body, asynchronous message communication is realized, and I/0 processing cost caused by frequent access to an external memory is avoided.

In order to achieve the above object, the asynchronous iterative update method for coarse-grained graph data according to the present invention comprises the following steps:

a) Defining a node and a plurality of destination edges which are connected with the node and take the node as a destination node and a plurality of source edges which are connected with the node and take the node as a source node, storing node attribute data of the node in a point partition data set, and storing edge attribute data of the destination edge and the source edge in a plurality of different edge partition data sets;

b) Packaging a remote procedure call protocol on the point partition data set and the edge partition data set obtained in the step A), respectively treating the point partition data set and each edge partition data set as an independent message loop body, and storing the node attribute data and the edge attribute data by using an elastic distributed data set;

c) After the data block containing the target edge attribute is sent to the point partition by the edge partition data set, the message circulation body of the point partition data set stores the received data block in a message queue, and the data block is taken out from the message queue periodically to be calculated and analyzed to obtain a node attribute data state change value;

d) The point partition data set periodically sends the node attribute data state change value to an edge partition data set storing the active edge attribute;

e) After the edge partition data set storing the active edge attribute receives the node attribute data state change value, the node attribute information of the destination node corresponding to the source edge is updated by using the node attribute data state change value, then the aggregation of the source edge attribute data is completed in the message loop body of the edge partition data set, the aggregated information is sent to the point partition data set in a data block form again, next round of iterative computation is carried out, the steps are repeated until the attribute data state change values of all nodes in the point partition data set do not change any more, the algorithm is converged, and the operation is finished.

Preferably, in the step C), the node attribute data is iteratively updated in an asynchronous manner, and three independent threads are created in the message loop body of the point partition data set, including:

receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a buffer blocking queue of the message loop body;

data iteration updating thread: periodically taking out data blocks from the cache blocking queue, calculating and analyzing to obtain a node attribute data state change value, and storing the node attribute data state change value into a node attribute state table;

and a message sending thread: and periodically sending the node attribute data state change value in the node attribute state table to an edge partition data set storing the active edge attribute.

Preferably, in a plurality of data blocks taken out from the cache blocking queue by the data iteration updating thread, if a node with the same ID is included, aggregation calculation is performed on data of the ID node, and a processed result is filled in the node attribute state table.

Preferably, an index number is established for each node in the node attribute state table, and when the data size is very large, the ID of the target node is quickly located.

Preferably, the node attribute data state change value obtained by the data iteration update thread is subjected to accumulation processing, a threshold is set for sending the node attribute data state change value, the message sending thread is awakened after the node attribute data state change value exceeds the threshold, and the message sending thread sends the node attribute data state change value meeting the condition to the edge partition data set storing the active edge attribute, so that communication overhead caused by frequently sending data blocks is avoided, and information transmission is more efficient.

Preferably, in the step C), an asynchronous parallel calculation formula adopted by the point partition data set to perform calculation analysis on the data block mathematically conforms to an exchange law, a combination law and a distribution law, and the node attribute data state change value may be used as an input of a next iteration.

Compared with the prior art, the invention has the following advantages:

1. by means of an existing elastic distributed data set under a Spark platform and conditions required by asynchronous message communication, a point partition data set and a side partition data set in large-scale graph data are packaged into a remote process call protocol message loop body, an asynchronous message communication model of the point partition data set and the side partition data set is realized by using a remote process call protocol, and for a processing mode of intermediate iteration data storage, I/0 processing cost caused by frequent access to external memory is avoided;

2. by means of a remote procedure call protocol message loop body of a point partition data set and an edge partition data set, a traditional point-centered fine-grained transmission mechanism is converted into a partition data set-centered coarse-grained transmission mechanism, and by setting a threshold value, the data transmission efficiency is improved, and the network communication overhead is reduced;

3. and establishing a message cache queue in the point partition data set, performing cooperative processing among a message receiving thread, a message sending thread and a data iterative updating thread, and keeping the iterative updating of data in a running state as much as possible on the premise of ensuring the data access consistency among the threads.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

A coarse-grained graph data asynchronous iterative updating method comprises the following steps:

c) After the data blocks containing the target edge attribute are sent to the point partitions by the edge partition data set, the received data blocks are stored in the message queue by the message circulating body of the point partition data set, the data blocks are periodically taken out from the message queue to be calculated and analyzed to obtain the node attribute data state change value, calculation does not need to be started until all the data blocks are completely arrived, and the waiting time of the point partition data set is reduced;

d) The point partition data set periodically sends the node attribute data state change value to the edge partition data set storing the active edge attribute;

e) After the edge partition data set storing the active edge attribute receives the node attribute data state change value, the node attribute information of the destination node corresponding to the source edge is updated by using the node attribute data state change value, then the aggregation of the source edge attribute data is completed in the message loop body of the edge partition data set, the aggregated information is sent to the point partition data set in the form of data blocks again, the next round of iterative computation is started, the steps are repeated until the attribute data state change values of all nodes in the point partition data set do not change any more, the algorithm is converged, and the operation is ended.

In step C), performing iterative update processing on the node attribute data in an asynchronous manner, and creating three independent threads in a message loop body of a point partition data set, including:

receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a buffer blocking queue of a message loop body;

data iteration updating thread: regularly taking out data blocks from the cache blocking queue to perform calculation analysis to obtain a node attribute data state change value, performing data iteration updating thread to perform aggregation calculation on data of an ID node in a plurality of data blocks taken out from the cache blocking queue, and filling a processed result into a node attribute state table if the data blocks contain the same ID node;

sending the message thread: and periodically sending the node attribute data state change value in the node partition data set to the edge partition data set storing the active edge attribute.

In this embodiment, the node attribute data state change value obtained by the data iteration update thread is subjected to accumulation processing, a threshold is set for sending the node attribute data state change value, the message sending thread is awakened when the node attribute data state change value exceeds the threshold, and the message sending thread sends the node attribute data state change value meeting the condition to the edge partition dataset storing the active edge attribute.

In step C), the asynchronous parallel calculation formula used for performing calculation analysis on the data blocks by the point partition data set is mathematically in accordance with the commutative law, the associative law and the distributive law.

The asynchronous iterative updating method of the coarse-grained graph data packages a point partition data set and a side partition data set in large-scale graph data into a remote process call protocol message cycle body by means of an existing elastic distributed data set under a Spark platform and conditions required by asynchronous message communication, and realizes an asynchronous message communication model of the point partition data set and the side partition data set by using a remote process call protocol, so that for a processing mode of intermediate iterative data storage, I/0 processing cost caused by frequent access to an external memory is avoided; meanwhile, by means of a remote procedure call protocol message loop body of a point partition data set and an edge partition data set, a traditional point-centered fine-grained transmission mechanism is converted into a partition data set-centered coarse-grained transmission mechanism, and by setting a threshold value, the data transmission efficiency is improved, and the network communication overhead is reduced; in addition, a message cache queue is created in the point partition data set, cooperative processing work is well done among a message receiving thread, a message sending thread and a data iteration updating thread, and the iterative updating of data is in a running state as much as possible on the premise of ensuring the consistency of data access among threads.

Claims

1. A coarse-grained graph data asynchronous iterative updating method is characterized by comprising the following steps: the method comprises the following steps:

2. The coarse-grained graph data asynchronous iterative update method of claim 1, comprising: in step C), performing iterative update processing on the node attribute data in an asynchronous manner, and creating three independent threads in a message loop body of the point partition data set, including:

receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a cache blocking queue of the message loop body;

data iteration updating thread: periodically taking out the data block from the cache blocking queue to perform calculation analysis to obtain a node attribute data state change value;

3. The coarse-grained graph data asynchronous iterative update method according to claim 2, characterized in that: and in a plurality of data blocks taken out from the cache blocking queue by the data iteration updating thread, if the data blocks contain the nodes with the same ID, carrying out aggregation calculation on the data of the ID node, and filling the processed result into a node attribute state table.

4. The coarse-grained graph data asynchronous iterative update method according to claim 3, characterized in that: and establishing an index number for each node in the node attribute state table.

5. The coarse-grained graph data asynchronous iterative update method of claim 2, wherein: and performing accumulation processing on the node attribute data state change value obtained by the data iteration updating thread, setting a threshold value for sending the node attribute data state change value, awakening the message sending thread after the node attribute data state change value exceeds the threshold value, and sending the node attribute data state change value meeting the condition to an edge partition data set storing the active edge attribute by the message sending thread.

6. The coarse-grained graph data asynchronous iterative update method of claim 1, comprising: in the step C), the asynchronous parallel calculation formula adopted by the point partition data set to perform calculation analysis on the data blocks mathematically conforms to the commutative law, the associative law and the distribution law.