CN110647406B - Coarse-grained graph data asynchronous iterative updating method - Google Patents

Coarse-grained graph data asynchronous iterative updating method Download PDF

Info

Publication number
CN110647406B
CN110647406B CN201910812905.1A CN201910812905A CN110647406B CN 110647406 B CN110647406 B CN 110647406B CN 201910812905 A CN201910812905 A CN 201910812905A CN 110647406 B CN110647406 B CN 110647406B
Authority
CN
China
Prior art keywords
data
node
edge
data set
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910812905.1A
Other languages
Chinese (zh)
Other versions
CN110647406A (en
Inventor
李超
唐杰辰
黄新宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN201910812905.1A priority Critical patent/CN110647406B/en
Publication of CN110647406A publication Critical patent/CN110647406A/en
Application granted granted Critical
Publication of CN110647406B publication Critical patent/CN110647406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the field of graph data processing, and discloses a coarse-grained graph data asynchronous iterative updating method. The asynchronous iterative updating method of the coarse-grained graph data encapsulates the point partition data sets and the edge partition data sets in the large-scale graph data into the remote procedure call protocol message loop body, realizes asynchronous message communication, and avoids I/0 processing cost caused by frequent access to external memory.

Description

Coarse-grained graph data asynchronous iterative updating method
Technical Field
The invention relates to the field of graph data processing, in particular to a coarse-grained graph data asynchronous iterative updating method.
Background
In order to mine valuable information from massive graph data, various distributed graph processing frameworks are widely applied to iterative computation of graph data, wherein a representative Pregel model is computed with node as granularity, and the graph data is iteratively updated in a message-driven manner. Although the Pregel model is easy to implement, the global synchronization update mechanism requires that the state information of the surrounding neighbor nodes is all reached, and a new round of iterative update can be started, which causes the convergence speed of the graph data to be slow.
In recent years, the development of mobile internet and social networks has led to an ever-expanding network scale, and the data scale generated by the network has also spanned from the PB level to the EB level. With the increasing amount of data to be processed by the graph data mining algorithm, the traditional centralized data processing platform cannot meet the storage and computation requirements of large-scale graph data, and therefore some distributed graph data processing systems come into force, which are typified by a distributed parallel processing system based on a MapReduce model and a distributed parallel processing system based on a BSP (bus synchronous parallel computing) model. Although these parallel graph data processing systems all propose a mature solution in the aspects of graph partitioning, computation, and fault tolerance management, these models still have shortcomings when applied to large-scale graph data iterative computation, and there are three main problems:
1. slow convergence rate of calculation
Among distributed parallel processing systems based on the BSP model, google's Pregel system is most representative. In the Pregel system, most block synchronous parallel computation models divide the computation process into a series of super steps (supersteps), which are performed in an overall synchronous manner. The common point of the iterative algorithm is that the next round of iterative computation can be started according to the result of the previous round of iterative computation. The iteration mode based on the integral synchronization needs the k-1 th iteration calculation to be completed by all nodes, and then the k-th iteration calculation can be started. However, the complexity of the data results in different computation speeds for different nodes. The graph data iterative computation based on global synchronization requires all neighbor nodes around a node to send all the computed results, and the node can enter the next round of data iterative update. During actual calculation, the calculation speed of some neighbor nodes is slower than that of other neighbor nodes, so that the node with the higher calculation speed inevitably needs to wait for the node with the lower calculation speed, which inevitably reduces the execution speed of the algorithm, and causes unnecessary time overhead.
2. I/O processing costs are high
At present, the MapReduce computational model is mature in the fields of data mining and data analysis. Although this type of system has good fault tolerance and can easily access resources in a distributed cluster, these data processing systems store intermediate computation results in a global disk read-write manner during an iterative process. Therefore, when the MapReduce calculation model is applied to iterative calculation needing to repeatedly utilize intermediate results, the utilization rate of a memory is low. Furthermore, external I/O accesses are time consuming and unreliable compared to direct memory accesses. A large amount of data is read and written frequently to the external disk at the same time, which not only consumes a large amount of time, but also leads to system paralysis. In order to solve the problem of high I/O processing cost in the iterative computation process, an Apache Spark computation model is proposed in 2012 by UC Berkeley AMPLab. Spark is a memory-based distributed computing framework, the core of which is a flexible distributed data set (RDD), which is an abstraction of distributed memory for processing data on a cluster in parallel. Spark's every job concatenates the RDDs to form a Directed Acyclic Graph (DAG) that can be executed more efficiently, and Spark is particularly good at running iterative type algorithms, with speeds in processing iterative jobs that exceed Hadoop by a factor of about 20.
The Spark GraphX is used as a sub-item under the Spark platform, and the memory calculation of the graph is realized on the master-slave cluster. The method has lower requirements on communication in the aspects of graph loading, node processing, edge processing, adjacent node calculation and the like, and the generated graph has a simpler structure, so that the performance is greatly improved. However, the graph subsystem still adopts a Pregel programming framework to perform iterative computation according to an overall synchronization mode, so that the problem of low convergence speed still exists.
3. Low efficiency of data transmission
In a distributed environment, a large-scale connected graph needs to be divided into a plurality of completely independent sub-graphs, and when the sub-graph data is deployed on different machines, crossed edges inevitably exist among the sub-graphs. The graph data processing system based on the BSP model is usually performed by taking a node as a center in a message-driven mode, and if the communication cost is considered, the number of crossed edges is an important index for measuring the quality of sub-graph division. If the degree of coupling between subgraphs is high, the subgraphs need to access non-local data frequently when doing graph computation. In general, the time to access local memory data is usually calculated in nanoseconds, while the time for network communication between machines is calculated in milliseconds, which are different by 4 to 5 orders of magnitude. If the graph data in another partition is frequently accessed along the cross edge every time in the unit of node, the processing speed of the whole graph is influenced.
In addition, in the fine-grained information exchange mode taking the node as the center, the node accesses the data in another subgraph and also attaches the transmission part of the transmission control protocol information. Therefore, a single iteration calculation centered on a node generates a large amount of redundant information, which imposes a large burden on network communication, which is especially significant when the network size is large.
In summary, graph data, as a type of big data with strong correlation characteristics, needs to pay attention to the ways of data segmentation, storage and iterative computation when analyzing the graph data. Aiming at the defects of low convergence rate, high I/O processing cost, low data transmission efficiency and the like of the conventional synchronous parallel computation model for graph data processing, the improvement and optimization of a model and a processing mechanism of large-scale graph data are urgently needed.
Disclosure of Invention
The invention aims to provide a coarse-grained graph data asynchronous iterative updating method aiming at the defects of the technology, a point partition data set and a side partition data set in large-scale graph data are packaged into a remote procedure call protocol message loop body, asynchronous message communication is realized, and I/0 processing cost caused by frequent access to an external memory is avoided.
In order to achieve the above object, the asynchronous iterative update method for coarse-grained graph data according to the present invention comprises the following steps:
a) Defining a node and a plurality of destination edges which are connected with the node and take the node as a destination node and a plurality of source edges which are connected with the node and take the node as a source node, storing node attribute data of the node in a point partition data set, and storing edge attribute data of the destination edge and the source edge in a plurality of different edge partition data sets;
b) Packaging a remote procedure call protocol on the point partition data set and the edge partition data set obtained in the step A), respectively treating the point partition data set and each edge partition data set as an independent message loop body, and storing the node attribute data and the edge attribute data by using an elastic distributed data set;
c) After the data block containing the target edge attribute is sent to the point partition by the edge partition data set, the message circulation body of the point partition data set stores the received data block in a message queue, and the data block is taken out from the message queue periodically to be calculated and analyzed to obtain a node attribute data state change value;
d) The point partition data set periodically sends the node attribute data state change value to an edge partition data set storing the active edge attribute;
e) After the edge partition data set storing the active edge attribute receives the node attribute data state change value, the node attribute information of the destination node corresponding to the source edge is updated by using the node attribute data state change value, then the aggregation of the source edge attribute data is completed in the message loop body of the edge partition data set, the aggregated information is sent to the point partition data set in a data block form again, next round of iterative computation is carried out, the steps are repeated until the attribute data state change values of all nodes in the point partition data set do not change any more, the algorithm is converged, and the operation is finished.
Preferably, in the step C), the node attribute data is iteratively updated in an asynchronous manner, and three independent threads are created in the message loop body of the point partition data set, including:
receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a buffer blocking queue of the message loop body;
data iteration updating thread: periodically taking out data blocks from the cache blocking queue, calculating and analyzing to obtain a node attribute data state change value, and storing the node attribute data state change value into a node attribute state table;
and a message sending thread: and periodically sending the node attribute data state change value in the node attribute state table to an edge partition data set storing the active edge attribute.
Preferably, in a plurality of data blocks taken out from the cache blocking queue by the data iteration updating thread, if a node with the same ID is included, aggregation calculation is performed on data of the ID node, and a processed result is filled in the node attribute state table.
Preferably, an index number is established for each node in the node attribute state table, and when the data size is very large, the ID of the target node is quickly located.
Preferably, the node attribute data state change value obtained by the data iteration update thread is subjected to accumulation processing, a threshold is set for sending the node attribute data state change value, the message sending thread is awakened after the node attribute data state change value exceeds the threshold, and the message sending thread sends the node attribute data state change value meeting the condition to the edge partition data set storing the active edge attribute, so that communication overhead caused by frequently sending data blocks is avoided, and information transmission is more efficient.
Preferably, in the step C), an asynchronous parallel calculation formula adopted by the point partition data set to perform calculation analysis on the data block mathematically conforms to an exchange law, a combination law and a distribution law, and the node attribute data state change value may be used as an input of a next iteration.
Compared with the prior art, the invention has the following advantages:
1. by means of an existing elastic distributed data set under a Spark platform and conditions required by asynchronous message communication, a point partition data set and a side partition data set in large-scale graph data are packaged into a remote process call protocol message loop body, an asynchronous message communication model of the point partition data set and the side partition data set is realized by using a remote process call protocol, and for a processing mode of intermediate iteration data storage, I/0 processing cost caused by frequent access to external memory is avoided;
2. by means of a remote procedure call protocol message loop body of a point partition data set and an edge partition data set, a traditional point-centered fine-grained transmission mechanism is converted into a partition data set-centered coarse-grained transmission mechanism, and by setting a threshold value, the data transmission efficiency is improved, and the network communication overhead is reduced;
3. and establishing a message cache queue in the point partition data set, performing cooperative processing among a message receiving thread, a message sending thread and a data iterative updating thread, and keeping the iterative updating of data in a running state as much as possible on the premise of ensuring the data access consistency among the threads.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
A coarse-grained graph data asynchronous iterative updating method comprises the following steps:
a) Defining a node and a plurality of destination edges which are connected with the node and take the node as a destination node and a plurality of source edges which are connected with the node and take the node as a source node, storing node attribute data of the node in a point partition data set, and storing edge attribute data of the destination edge and the source edge in a plurality of different edge partition data sets;
b) Packaging a remote procedure call protocol on the point partition data set and the edge partition data set obtained in the step A), respectively treating the point partition data set and each edge partition data set as an independent message loop body, and storing the node attribute data and the edge attribute data by using an elastic distributed data set;
c) After the data blocks containing the target edge attribute are sent to the point partitions by the edge partition data set, the received data blocks are stored in the message queue by the message circulating body of the point partition data set, the data blocks are periodically taken out from the message queue to be calculated and analyzed to obtain the node attribute data state change value, calculation does not need to be started until all the data blocks are completely arrived, and the waiting time of the point partition data set is reduced;
d) The point partition data set periodically sends the node attribute data state change value to the edge partition data set storing the active edge attribute;
e) After the edge partition data set storing the active edge attribute receives the node attribute data state change value, the node attribute information of the destination node corresponding to the source edge is updated by using the node attribute data state change value, then the aggregation of the source edge attribute data is completed in the message loop body of the edge partition data set, the aggregated information is sent to the point partition data set in the form of data blocks again, the next round of iterative computation is started, the steps are repeated until the attribute data state change values of all nodes in the point partition data set do not change any more, the algorithm is converged, and the operation is ended.
In step C), performing iterative update processing on the node attribute data in an asynchronous manner, and creating three independent threads in a message loop body of a point partition data set, including:
receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a buffer blocking queue of a message loop body;
data iteration updating thread: regularly taking out data blocks from the cache blocking queue to perform calculation analysis to obtain a node attribute data state change value, performing data iteration updating thread to perform aggregation calculation on data of an ID node in a plurality of data blocks taken out from the cache blocking queue, and filling a processed result into a node attribute state table if the data blocks contain the same ID node;
sending the message thread: and periodically sending the node attribute data state change value in the node partition data set to the edge partition data set storing the active edge attribute.
In this embodiment, the node attribute data state change value obtained by the data iteration update thread is subjected to accumulation processing, a threshold is set for sending the node attribute data state change value, the message sending thread is awakened when the node attribute data state change value exceeds the threshold, and the message sending thread sends the node attribute data state change value meeting the condition to the edge partition dataset storing the active edge attribute.
In step C), the asynchronous parallel calculation formula used for performing calculation analysis on the data blocks by the point partition data set is mathematically in accordance with the commutative law, the associative law and the distributive law.
The asynchronous iterative updating method of the coarse-grained graph data packages a point partition data set and a side partition data set in large-scale graph data into a remote process call protocol message cycle body by means of an existing elastic distributed data set under a Spark platform and conditions required by asynchronous message communication, and realizes an asynchronous message communication model of the point partition data set and the side partition data set by using a remote process call protocol, so that for a processing mode of intermediate iterative data storage, I/0 processing cost caused by frequent access to an external memory is avoided; meanwhile, by means of a remote procedure call protocol message loop body of a point partition data set and an edge partition data set, a traditional point-centered fine-grained transmission mechanism is converted into a partition data set-centered coarse-grained transmission mechanism, and by setting a threshold value, the data transmission efficiency is improved, and the network communication overhead is reduced; in addition, a message cache queue is created in the point partition data set, cooperative processing work is well done among a message receiving thread, a message sending thread and a data iteration updating thread, and the iterative updating of data is in a running state as much as possible on the premise of ensuring the consistency of data access among threads.

Claims (6)

1. A coarse-grained graph data asynchronous iterative updating method is characterized by comprising the following steps: the method comprises the following steps:
a) Defining a node and a plurality of destination edges which are connected with the node and take the node as a destination node and a plurality of source edges which are connected with the node and take the node as a source node, storing node attribute data of the node in a point partition data set, and storing edge attribute data of the destination edge and the source edge in a plurality of different edge partition data sets;
b) Packaging a remote procedure call protocol on the point partition data set and the edge partition data set obtained in the step A), respectively treating the point partition data set and each edge partition data set as an independent message loop body, and storing the node attribute data and the edge attribute data by using an elastic distributed data set;
c) After the data block containing the target edge attribute is sent to the point partition by the edge partition data set, the message circulation body of the point partition data set stores the received data block in a message queue, and the data block is taken out from the message queue periodically to be calculated and analyzed to obtain a node attribute data state change value;
d) The point partition data set periodically sends the node attribute data state change value to the edge partition data set storing the active edge attribute;
e) After the edge partition data set storing the active edge attribute receives the node attribute data state change value, the node attribute information of the destination node corresponding to the source edge is updated by using the node attribute data state change value, then the aggregation of the source edge attribute data is completed in the message loop body of the edge partition data set, the aggregated information is sent to the point partition data set in a data block form again, next round of iterative computation is carried out, the steps are repeated until the attribute data state change values of all nodes in the point partition data set do not change any more, the algorithm is converged, and the operation is finished.
2. The coarse-grained graph data asynchronous iterative update method of claim 1, comprising: in step C), performing iterative update processing on the node attribute data in an asynchronous manner, and creating three independent threads in a message loop body of the point partition data set, including:
receiving a message thread: receiving data blocks from different edge partition data sets, and putting the data blocks into a cache blocking queue of the message loop body;
data iteration updating thread: periodically taking out the data block from the cache blocking queue to perform calculation analysis to obtain a node attribute data state change value;
sending the message thread: and periodically sending the node attribute data state change value in the node partition data set to the edge partition data set storing the active edge attribute.
3. The coarse-grained graph data asynchronous iterative update method according to claim 2, characterized in that: and in a plurality of data blocks taken out from the cache blocking queue by the data iteration updating thread, if the data blocks contain the nodes with the same ID, carrying out aggregation calculation on the data of the ID node, and filling the processed result into a node attribute state table.
4. The coarse-grained graph data asynchronous iterative update method according to claim 3, characterized in that: and establishing an index number for each node in the node attribute state table.
5. The coarse-grained graph data asynchronous iterative update method of claim 2, wherein: and performing accumulation processing on the node attribute data state change value obtained by the data iteration updating thread, setting a threshold value for sending the node attribute data state change value, awakening the message sending thread after the node attribute data state change value exceeds the threshold value, and sending the node attribute data state change value meeting the condition to an edge partition data set storing the active edge attribute by the message sending thread.
6. The coarse-grained graph data asynchronous iterative update method of claim 1, comprising: in the step C), the asynchronous parallel calculation formula adopted by the point partition data set to perform calculation analysis on the data blocks mathematically conforms to the commutative law, the associative law and the distribution law.
CN201910812905.1A 2019-08-29 2019-08-29 Coarse-grained graph data asynchronous iterative updating method Active CN110647406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910812905.1A CN110647406B (en) 2019-08-29 2019-08-29 Coarse-grained graph data asynchronous iterative updating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910812905.1A CN110647406B (en) 2019-08-29 2019-08-29 Coarse-grained graph data asynchronous iterative updating method

Publications (2)

Publication Number Publication Date
CN110647406A CN110647406A (en) 2020-01-03
CN110647406B true CN110647406B (en) 2022-11-29

Family

ID=69009949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910812905.1A Active CN110647406B (en) 2019-08-29 2019-08-29 Coarse-grained graph data asynchronous iterative updating method

Country Status (1)

Country Link
CN (1) CN110647406B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914556A (en) * 2014-04-15 2014-07-09 西北工业大学 Large-scale graph data processing method
CN107122248A (en) * 2017-05-02 2017-09-01 华中科技大学 A kind of distributed figure processing method of storage optimization

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120084747A1 (en) * 2010-10-01 2012-04-05 Nec Laboratories America, Inc. Partitioned iterative convergance programming model
US8521782B2 (en) * 2011-07-20 2013-08-27 Salesforce.Com, Inc. Methods and systems for processing large graphs using density-based processes using map-reduce

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914556A (en) * 2014-04-15 2014-07-09 西北工业大学 Large-scale graph data processing method
CN107122248A (en) * 2017-05-02 2017-09-01 华中科技大学 A kind of distributed figure processing method of storage optimization

Also Published As

Publication number Publication date
CN110647406A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
Khorasani et al. Scalable simd-efficient graph processing on gpus
Hong et al. Efficient parallel graph exploration on multi-core CPU and GPU
Chu et al. Confidence-based work stealing in parallel constraint programming
CN110232087B (en) Big data increment iteration method and device, computer equipment and storage medium
Wang et al. Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing
Morari et al. Scaling irregular applications through data aggregation and software multithreading
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
CN112015765B (en) Spark cache elimination method and system based on cache value
Burkes et al. Design approaches for real-time transaction processing remote site recovery
Koo et al. Revealing critical loads and hidden data locality in GPGPU applications
Lee et al. Performance improvement of mapreduce process by promoting deep data locality
CN105426163A (en) Single data stream quantile processing method based on MIC coprocessor
EP3662376A1 (en) Reconfigurable cache architecture and methods for cache coherency
CN111176831A (en) Dynamic thread mapping optimization method and device based on multithread shared memory communication
CN110647406B (en) Coarse-grained graph data asynchronous iterative updating method
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
CN116069480B (en) Processor and computing device
CN110515729B (en) Graph computing node vector load balancing method and device based on graph processor
CN103955397A (en) Virtual machine scheduling multi-strategy selection method based on micro-architecture perception
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
Toliopoulos et al. Adaptive distributed partitioning in apache flink
Liu et al. A hierarchical synchronous parallel model for wide-area graph analytics
CN115906684A (en) Hydrodynamics multi-grid solver parallel optimization method for Shenwei architecture
CN105573834B (en) A kind of higher-dimension vocabulary tree constructing method based on heterogeneous platform
CN108228323A (en) Hadoop method for scheduling task and device based on data locality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant