CN116188239B

CN116188239B - Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system

Info

Publication number: CN116188239B
Application number: CN202211536501.2A
Authority: CN
Inventors: 李超; 徐诚; 王靖; 汪陶磊; 梅君夷
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-09-12
Anticipated expiration: 2042-12-02
Also published as: CN116188239A

Abstract

A multi-request concurrent GPU graph random walk optimization implementation method and system, classifying graph random walk requests in an off-line stage and facing GPU resource occupation situations of all requests, and establishing a concurrent effect judgment mechanism based on graph random walk request types and resource occupation situations; and in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage hierarchy and a multi-accelerator side. The invention can realize low-interference and low-delay graph data segmentation management and low-pause graph random walk request execution, fully excavate the performance potential of multiple GPU space division sharing and processing multiple concurrent requests, improve the overall throughput rate of running the graph random walk requests on the GPU and reduce energy consumption.

Description

Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system

Technical Field

The invention relates to a technology in the field of distributed data processing, in particular to a method and a system for realizing random walk optimization of a multi-request concurrent GPU (graphics processing Unit) graph.

Background

The graph random walk algorithm continuously selects a leading node of the current node according to a predefined mode, so as to extract a sub-graph sequence of the current large graph. The existing optimized graph random walk algorithm working on the GPU obtains a certain acceleration effect by means of high computation power of the GPU platform. However, these GPU-based graph random walk frameworks only consider performing the graph random walk task requests in series, resulting in insufficient utilization of GPU resources in some cases, and do not consider concurrency potential in the case of multiple random walk requests.

Disclosure of Invention

Aiming at the problems that the random walk of the graph is used as a memory intensive task and large graph data often exceeds the upper limit of a GPU memory, so that the computing capacity of the GPU when executing the random walk task of the graph exceeds the upper limit of a bandwidth and is frequently involved in pipeline stagnation, the invention provides a multi-request concurrent GPU graph random walk optimization realization method and system, which optimize fine granularity according to a random walk algorithm and GPU hardware characteristics, score different types of requests according to a concurrency model, judge the request combination which is most suitable for concurrent operation according to the prediction time and the number of the different types of requests, realize graph data segmentation management with low interference and low delay and graph random walk request execution with low pause according to GPU storage hierarchy, fully excavate the performance potential of the multiple GPU space division sharing and simultaneously process the multiple concurrent requests, improve the overall throughput rate of the running graph random walk requests on the GPU, and reduce energy consumption.

The invention is realized by the following technical scheme:

the invention relates to a multi-request concurrent GPU graph random walk optimization implementation method, which comprises the steps of classifying graph random walk requests in an offline stage and setting up a concurrent effect judgment mechanism based on graph random walk request types and resource occupation conditions against GPU resource occupation conditions of all requests; and in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage hierarchy and a multi-accelerator side.

The offline state refers to: possible request situations are anticipated before the actual application request arrives.

The establishing of the concurrency effect judging mechanism based on the graph random walk request type and the resource occupation condition specifically comprises the following steps:

i) Dividing the request into a large map request with map data larger than the memory on the GPU board and a small map request with map data smaller than or equal to the memory on the GPU board according to the size of the map data to be processed and the memory size on the GPU board;

ii) judging the memory resources and the quantity of the computing resources required by the random walk requests of the various types of graphs;

iii) Judging the proper degree of concurrent execution of each request according to the characteristics of the graph random walk request, wherein the method comprises the following steps: concurrent execution of the request layer, namely concurrent execution of requests using different graph random walk algorithms; and concurrent execution of the graph data plane, i.e., concurrent execution of large and small graph requests.

The predicted graph random walk request execution time is obtained by fine tuning based on the batch size of the processing request and the average degree of the graph after giving the basic time according to the type of the graph random walk request, and is specifically as follows: wherein: t (T) is the predicted execution time of the graph random walk task, T(s) is the execution time of the same type of reference task, D (T, s) is the difference in the average degree of the graph, and batch _t Batch size, batch for target graph random walk task _s For a reference batch size, θ is a constant that can be adjusted.

The suitability of each graph random walk request refers to: judging the suitability of the graph random walk request according to the parameters of the graph random walk request in each concurrent operation mode in the graph random walk concurrent operation model, and adding bias to the requests with more quantity by referring to the quantity of the requests of each type, wherein the method specifically comprises the following steps: s (t) ₁ ，t ₂ )＝Max(G(t ₁ ，t ₂ )，M(t ₁ ，t ₂ ) +α Abundant, wherein: t is a graph random walk task, G and M respectively calculate the parallelism suitability of the graph data layer and the request layer, abundant is a value between 0 and 1 generated according to the duty ratio of the two types of graph random requests in all graph random walk requests, the higher the duty ratio is, the higher the value is, and alpha is an adjustable constant.

The adjusting the operation priority and/or the operation combination comprises: (1) the random walk requests of the same type are spliced according to descending order of fitness and correspondingly set priority, and are regarded as a new request, or the combination of the (1) and the (2) specifically comprises:

step 1: selecting a graph random walk request;

step 2: selecting a next graph random walk request according to the receiving time sequence;

step 3: scoring the two requests;

step 4: predicting a runtime of the two requests;

step 5: judging whether a plurality of requests need to be spliced or not;

step 6: the score is multiplied by the run-time percentage difference and recorded as the final score of the two requests.

Step 7: and (5) repeating the steps 2-6 until all the request scoring is completed, and selecting the combination with the highest score to execute.

The graph data segmentation management refers to: and for concurrent requests, storing the graph data with the size smaller than that of the memory on the GPU board, storing the large graph in the main memory, transmitting the large graph to the GPU through a PCIe interface in real time for processing, performing cold and hot partitioning on the graph with the size larger than that of the memory on the GPU board, and preferentially storing the hot data in the left space on the GPU. At the same time, using Zero-Copy (Zero-Copy) technology, address translation is performed and the data in the main memory is read directly to obtain lower latency under appropriate scenarios.

The graph random walk request execution refers to: the appropriate number of accelerator terminals are activated based on the requested graph random walk algorithm type and graph data. For redundant accelerator ends, the system is not directly activated to achieve the purpose of energy conservation.

Technical effects

The concurrent model of the graphic random walk based on the GPU enables the GPU to simultaneously and efficiently execute a plurality of graphic random walk requests through the efficient design of the GPU based on space division sharing; the multi-graph random walk scheduling mechanism screens and obtains the combination of the most suitable graph random walk request executed concurrently through offline analysis modeling and online correction, and compared with the prior art, the method and the device remarkably improve the resource utilization rate and the overall throughput rate of the graph random walk acceleration system based on the GPU.

Drawings

FIG. 1 is a schematic diagram of a system architecture of the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a schematic diagram of an embodiment execution process.

Detailed Description

As shown in fig. 1, a GPU graph random walk optimization system for implementing the method according to the present embodiment includes: a HOST (HOST) side and an accelerator (GPU) side. Wherein: the HOST end and the accelerator end are connected through PCIe and exchange data; the method comprises the steps that a host side concurrently operates various graph random walk requests offline to obtain corresponding resource consumption and system throughput rate information, a concurrent execution model is built, the graph random walk requests received in an online stage are classified, predicted and scheduled according to the attributes of the graph random walk requests, request combinations suitable for concurrent execution are determined, and graph data are output to a cache; and the accelerator side loads corresponding graph data into the memory according to the concurrently executed request combination determined by the host side to generate a memory buffer structure, activates matched computing resources for different graph random walk requests, and returns a result to the host side after the computation is completed.

The host side comprises: the system comprises a concurrency model analysis module, a request extraction module, a request scheduling module and a data segmentation transmission module, wherein: the concurrency model analysis module classifies requests from a graph data layer and an algorithm layer according to graph random walk request characteristics in an offline state, and obtains concurrency models of the requests according to GPU resources, graph data size, operation modes and the like to obtain request combinations suitable for concurrency operation; after acquiring a real request, the request extraction module analyzes the attribute of the real request, classifies the real request, and inputs the attribute of the real request into the request scheduling module for scheduling; the request scheduling module predicts different request running times according to the input attribute, scores the degree of the suitable concurrency of different requests according to the concurrency model obtained in an offline state, and schedules according to the score, the running time and the number of request types to obtain the combination most suitable for concurrency for the concurrent running of the accelerator terminal; the data segmentation transmission module firstly loads the graph data from the hard disk into the memory, then segments the graph data according to the type of the graph random walk request, transmits smaller graph data into the memory on the GPU board, segments larger graph data according to the cold and hot degree, transmits the hot data of the smaller graph data into the memory on the GPU board, and the cold data of the larger graph data is still placed in the memory at the host end and interacts with the accelerator end in real time through the PCIe interface.

The accelerator end comprises: the system comprises a graph data management module, an accelerator end distribution module and a graph random walk module, wherein: the data management adopts a mixed data management mode, and simultaneously adopts unified memory (unified memory) and Zero Copy technology (Zero Copy) management graph data to achieve the effects of lowest delay and no interference between concurrent requests; the accelerator terminal distribution module distributes a proper number of accelerator terminals for each graph random walk request according to the request attribute, so that pipeline stagnation caused by insufficient memory bandwidth is minimized under the condition that performance is not reduced, and redundant accelerator terminals are prevented from being activated to achieve the aim of saving energy; the graph random walk module finally executes the graph random walk task and returns the result to the host end.

As shown in fig. 2, the present embodiment relates to an optimization method of the GPU graph random walk system with multiple concurrent requests, which includes the following steps:

step 1) in an offline stage, analyzing possible request types in advance, testing concurrent operation effects of the request types in advance, including concurrency of a graph data layer and concurrency of an algorithm layer, and establishing a concurrent execution model.

And 2) in the real-time stage, firstly carrying out attribute extraction on various graph random walk algorithms, such as graph data size, average vertex degree, algorithm type and the like.

Step 3) the scheduler schedules different graph requests according to a pre-established concurrent execution model, selects proper request combinations and concurrently executes the requests, and the specific steps comprise:

i) Selecting a graph random walk request to be executed;

ii) selecting a next graph random walk request in the order of time received;

iii) Scoring the two requests;

iv) predicting the run time of the two requests;

v) multiplying the score by the run-time percentage difference, and recording as the final score of the two requests.

vi) returning to step ii), ending without a new graph random walk request requiring computation.

And 4) the accelerator side performs graph data segmentation management through a graph data management module, transfers hot data to the accelerator side, stores cold data in a host side, performs real-time interaction with the GPU in the running process through PCIe, and manages the graph data by adopting unified memory (unified memory) and Zero Copy technology (Zero Copy).

Step 5) activating a proper number of accelerator terminals and concurrently processing the graph random walk request. After the calculation is completed, the result is returned to the host end, and the memory area occupied by the host end and the accelerator end is recovered.

In this embodiment, a graph random walk application is taken as an example, and an Injettia GPU is used as a heterogeneous accelerator platform, wherein a server is provided with 2 Intel (R) Xeon (R) Gold 6148CPU with 20 cores, 256GB of memory, 8TB of hard disk and 4 blocks of Injettia 2080Ti GPU with 11GB of DDR6.

Through practical experiments, by rewriting the accelerator end, various operations such as: the random walk algorithm of the deep 2vec, PPR and other graphs can use GPU space division sharing to run concurrently in the mode, and process 7 data sets (1G to 14G are unequal) of Livejournal and the like, and experimental results can be obtained, wherein the overall throughput rate of the system is improved by 54% at most under the condition that the energy consumption is reduced by 12% at most as shown in the following table. Compared with the prior art, the performance index of the method is improved in higher system throughput rate, lower delay and lower energy consumption.

Table 1 data plane embodiment dataset and results comparison

Table 2 algorithm level example dataset and results comparison

Compared with the prior art, the method concurrently executes a plurality of graph random walk requests, simultaneously uses the memory on the GPU board and PCIe to transmit data so as to maximize the equivalent memory bandwidth of the GPU, and reasonably distributes the computing resources of the accelerator terminal so as to improve the utilization rate of the whole resources; and scheduling the graph requests, and concurrently executing the requests with high suitability to achieve the best acceleration effect.

The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims

1. A multi-request concurrent GPU graph random walk optimization implementation method is characterized in that graph random walk requests are classified in an offline stage, and a concurrent effect judgment mechanism based on graph random walk request types and resource occupation conditions is established according to GPU resource occupation conditions of all requests; in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage level and a multi-accelerator end;

the suitability of each graph random walk request refers to: judging the suitability of the graph random walk request according to the parameters of the graph random walk request in each concurrent operation mode in the graph random walk concurrent operation model, and adding bias to the requests with more quantity by referring to the quantity of the requests of each type, wherein the method specifically comprises the following steps: s (t) ₁ ，t ₂ )＝Max(G(t ₁ ，t ₂ )，M(t ₁ ，t ₂ ) Abondant), wherein: t is a graph random walk request, G and M respectively calculate the parallelism suitability degree of a graph data layer and a request layer, and Abundant is a graph based on t ₁ ，t ₂ Graph random request of (a)The duty cycle in all graph random walk requests generates a value between 0 and 1, the higher the duty cycle the higher the value, α being the constant of adjustment.

2. The method for implementing the multiple-request concurrent GPU graph random walk optimization according to claim 1, wherein the establishing a concurrent effect judging mechanism based on the graph random walk request type and the resource occupation condition specifically comprises:

iii) Concurrent execution of the request layer, namely concurrent execution of requests using different graph random walk algorithms; and concurrent execution of the graph data plane, i.e., concurrent execution of large and small graph requests.

3. The method for implementing the multi-request concurrent GPU graph random walk optimization according to claim 1, wherein the predicted graph random walk request execution time is obtained by fine tuning based on the batch size of the processing request and the average degree of the graph after giving the base time according to the type of the graph random walk request, specifically:wherein: t (T) is the predicted execution time of the graph random walk request, T(s) is the execution time of the same type of reference task, D (T, s) is the difference in the average degree of the graph, and batch _t Batch size, batch for target graph random walk requests _s For the reference batch size, θ is the constant for adjustment.

4. The method for implementing the random walk optimization of the GPU map with multiple concurrent requests according to claim 1, wherein the adjusting the operation priority and/or the operation combination includes: (1) and (2) splicing the random walk requests of the same type of graph according to the descending order of fitness and setting priority correspondingly, and treating the random walk requests as a new request.

5. The method for implementing the random walk optimization of the GPU graph with multiple concurrent requests according to claim 1, wherein the adjusting the operation priority and/or the operation combination specifically comprises:

step 1: selecting a graph random walk request;

step 3: scoring the two requests;

step 4: predicting a runtime of the two requests;

step 5: judging whether a plurality of requests need to be spliced or not;

step 6: multiplying the score by the run-time percentage difference, and recording as the final score of the two requests;

6. The method for implementing the random walk optimization of the multi-request concurrent GPU graph according to claim 1, wherein the graph data segmentation management means: for concurrent requests, storing the graph data with the size smaller than that of the memory on the GPU board, storing the large graph in the main memory, transmitting the large graph to the GPU through a PCIe interface in real time for processing, performing cold and hot partitioning on the graph with the size larger than that of the memory on the GPU board, storing the hot data in the left space on the GPU, performing address conversion by using a zero copy technology, and directly reading the data in the main memory to obtain lower delay.

7. The method for implementing the multiple-request concurrent GPU graph random walk optimization according to claim 1, wherein the graph random walk request execution means: and activating the accelerator terminals with the matched quantity according to the type of the requested graph random walk algorithm and graph data, and not directly activating redundant accelerator terminals to achieve the aim of saving energy.

8. A GPU-graph random walk optimization system that implements the multi-request concurrent GPU-graph random walk optimization implementation method of any of claims 1-7, comprising: host side and accelerator side, wherein: the host end and the accelerator end are connected through PCIe and exchange data; the method comprises the steps that a host side concurrently operates various graph random walk requests offline to obtain corresponding resource consumption and system throughput rate information, a concurrent execution model is built, the graph random walk requests received in an online stage are classified, predicted and scheduled according to the attributes of the graph random walk requests, request combinations suitable for concurrent execution are determined, and graph data are output to a cache; and the accelerator side loads corresponding graph data into the memory according to the concurrently executed request combination determined by the host side to generate a memory buffer structure, activates matched computing resources for different graph random walk requests, and returns a result to the host side after the computation is completed.

9. The GPU-graph random walk optimization system of claim 8, wherein said host side comprises: the system comprises a concurrency model analysis module, a request extraction module, a request scheduling module and a data segmentation transmission module, wherein: the concurrent model analysis module classifies the requests from a graph data layer and an algorithm layer according to the random walk request characteristics of the graph in an offline state, and obtains a concurrent model of the requests according to GPU resources, the graph data size and the operation mode to obtain a request combination suitable for concurrent operation; after acquiring a real request, the request extraction module analyzes the attribute of the real request, classifies the real request, and inputs the attribute of the real request into the request scheduling module for scheduling; the request scheduling module predicts different request running times according to the input attribute, scores the degree of the suitable concurrency of different requests according to the concurrency model obtained in an offline state, and schedules according to the score, the running time and the number of request types to obtain the combination most suitable for concurrency for the concurrent running of the accelerator terminal; the data segmentation transmission module firstly loads the graph data from the hard disk into the memory, then segments the graph data according to the type of the graph random walk request, transmits the graph data with the size smaller than that of the memory on the GPU board into the memory on the GPU board, segments the graph data with the size larger than that of the memory on the GPU board into blocks according to the cold and hot degree, transmits the hot data of the graph data into the memory on the GPU board, and the cold data of the graph data is still placed in the memory of the host end and interacts with the accelerator end in real time through a PCIe interface;

the accelerator end comprises: the system comprises a graph data management module, an accelerator end distribution module and a graph random walk module, wherein: the graph data management module adopts a mixed data management mode, and simultaneously adopts a unified memory and zero copy technology to manage graph data so as to achieve the effects of lowest delay and no interference between concurrent requests; the accelerator terminal distribution module distributes a proper number of accelerator terminals for each graph random walk request according to the request attribute, so that pipeline stagnation caused by insufficient memory bandwidth is minimized under the condition that performance is not reduced, and redundant accelerator terminals are prevented from being activated to achieve the aim of saving energy; the graph random walk module finally executes the graph random walk request and returns the result to the host end.