CN116188239B - Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system - Google Patents

Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system Download PDF

Info

Publication number
CN116188239B
CN116188239B CN202211536501.2A CN202211536501A CN116188239B CN 116188239 B CN116188239 B CN 116188239B CN 202211536501 A CN202211536501 A CN 202211536501A CN 116188239 B CN116188239 B CN 116188239B
Authority
CN
China
Prior art keywords
graph
request
random walk
requests
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211536501.2A
Other languages
Chinese (zh)
Other versions
CN116188239A (en
Inventor
李超
徐诚
王靖
汪陶磊
梅君夷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202211536501.2A priority Critical patent/CN116188239B/en
Publication of CN116188239A publication Critical patent/CN116188239A/en
Application granted granted Critical
Publication of CN116188239B publication Critical patent/CN116188239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Generation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A multi-request concurrent GPU graph random walk optimization implementation method and system, classifying graph random walk requests in an off-line stage and facing GPU resource occupation situations of all requests, and establishing a concurrent effect judgment mechanism based on graph random walk request types and resource occupation situations; and in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage hierarchy and a multi-accelerator side. The invention can realize low-interference and low-delay graph data segmentation management and low-pause graph random walk request execution, fully excavate the performance potential of multiple GPU space division sharing and processing multiple concurrent requests, improve the overall throughput rate of running the graph random walk requests on the GPU and reduce energy consumption.

Description

Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system
Technical Field
The invention relates to a technology in the field of distributed data processing, in particular to a method and a system for realizing random walk optimization of a multi-request concurrent GPU (graphics processing Unit) graph.
Background
The graph random walk algorithm continuously selects a leading node of the current node according to a predefined mode, so as to extract a sub-graph sequence of the current large graph. The existing optimized graph random walk algorithm working on the GPU obtains a certain acceleration effect by means of high computation power of the GPU platform. However, these GPU-based graph random walk frameworks only consider performing the graph random walk task requests in series, resulting in insufficient utilization of GPU resources in some cases, and do not consider concurrency potential in the case of multiple random walk requests.
Disclosure of Invention
Aiming at the problems that the random walk of the graph is used as a memory intensive task and large graph data often exceeds the upper limit of a GPU memory, so that the computing capacity of the GPU when executing the random walk task of the graph exceeds the upper limit of a bandwidth and is frequently involved in pipeline stagnation, the invention provides a multi-request concurrent GPU graph random walk optimization realization method and system, which optimize fine granularity according to a random walk algorithm and GPU hardware characteristics, score different types of requests according to a concurrency model, judge the request combination which is most suitable for concurrent operation according to the prediction time and the number of the different types of requests, realize graph data segmentation management with low interference and low delay and graph random walk request execution with low pause according to GPU storage hierarchy, fully excavate the performance potential of the multiple GPU space division sharing and simultaneously process the multiple concurrent requests, improve the overall throughput rate of the running graph random walk requests on the GPU, and reduce energy consumption.
The invention is realized by the following technical scheme:
the invention relates to a multi-request concurrent GPU graph random walk optimization implementation method, which comprises the steps of classifying graph random walk requests in an offline stage and setting up a concurrent effect judgment mechanism based on graph random walk request types and resource occupation conditions against GPU resource occupation conditions of all requests; and in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage hierarchy and a multi-accelerator side.
The offline state refers to: possible request situations are anticipated before the actual application request arrives.
The establishing of the concurrency effect judging mechanism based on the graph random walk request type and the resource occupation condition specifically comprises the following steps:
i) Dividing the request into a large map request with map data larger than the memory on the GPU board and a small map request with map data smaller than or equal to the memory on the GPU board according to the size of the map data to be processed and the memory size on the GPU board;
ii) judging the memory resources and the quantity of the computing resources required by the random walk requests of the various types of graphs;
iii) Judging the proper degree of concurrent execution of each request according to the characteristics of the graph random walk request, wherein the method comprises the following steps: concurrent execution of the request layer, namely concurrent execution of requests using different graph random walk algorithms; and concurrent execution of the graph data plane, i.e., concurrent execution of large and small graph requests.
The predicted graph random walk request execution time is obtained by fine tuning based on the batch size of the processing request and the average degree of the graph after giving the basic time according to the type of the graph random walk request, and is specifically as follows: wherein: t (T) is the predicted execution time of the graph random walk task, T(s) is the execution time of the same type of reference task, D (T, s) is the difference in the average degree of the graph, and batch t Batch size, batch for target graph random walk task s For a reference batch size, θ is a constant that can be adjusted.
The suitability of each graph random walk request refers to: judging the suitability of the graph random walk request according to the parameters of the graph random walk request in each concurrent operation mode in the graph random walk concurrent operation model, and adding bias to the requests with more quantity by referring to the quantity of the requests of each type, wherein the method specifically comprises the following steps: s (t) 1 ,t 2 )=Max(G(t 1 ,t 2 ),M(t 1 ,t 2 ) +α Abundant, wherein: t is a graph random walk task, G and M respectively calculate the parallelism suitability of the graph data layer and the request layer, abundant is a value between 0 and 1 generated according to the duty ratio of the two types of graph random requests in all graph random walk requests, the higher the duty ratio is, the higher the value is, and alpha is an adjustable constant.
The adjusting the operation priority and/or the operation combination comprises: (1) the random walk requests of the same type are spliced according to descending order of fitness and correspondingly set priority, and are regarded as a new request, or the combination of the (1) and the (2) specifically comprises:
step 1: selecting a graph random walk request;
step 2: selecting a next graph random walk request according to the receiving time sequence;
step 3: scoring the two requests;
step 4: predicting a runtime of the two requests;
step 5: judging whether a plurality of requests need to be spliced or not;
step 6: the score is multiplied by the run-time percentage difference and recorded as the final score of the two requests.
Step 7: and (5) repeating the steps 2-6 until all the request scoring is completed, and selecting the combination with the highest score to execute.
The graph data segmentation management refers to: and for concurrent requests, storing the graph data with the size smaller than that of the memory on the GPU board, storing the large graph in the main memory, transmitting the large graph to the GPU through a PCIe interface in real time for processing, performing cold and hot partitioning on the graph with the size larger than that of the memory on the GPU board, and preferentially storing the hot data in the left space on the GPU. At the same time, using Zero-Copy (Zero-Copy) technology, address translation is performed and the data in the main memory is read directly to obtain lower latency under appropriate scenarios.
The graph random walk request execution refers to: the appropriate number of accelerator terminals are activated based on the requested graph random walk algorithm type and graph data. For redundant accelerator ends, the system is not directly activated to achieve the purpose of energy conservation.
Technical effects
The concurrent model of the graphic random walk based on the GPU enables the GPU to simultaneously and efficiently execute a plurality of graphic random walk requests through the efficient design of the GPU based on space division sharing; the multi-graph random walk scheduling mechanism screens and obtains the combination of the most suitable graph random walk request executed concurrently through offline analysis modeling and online correction, and compared with the prior art, the method and the device remarkably improve the resource utilization rate and the overall throughput rate of the graph random walk acceleration system based on the GPU.
Drawings
FIG. 1 is a schematic diagram of a system architecture of the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a schematic diagram of an embodiment execution process.
Detailed Description
As shown in fig. 1, a GPU graph random walk optimization system for implementing the method according to the present embodiment includes: a HOST (HOST) side and an accelerator (GPU) side. Wherein: the HOST end and the accelerator end are connected through PCIe and exchange data; the method comprises the steps that a host side concurrently operates various graph random walk requests offline to obtain corresponding resource consumption and system throughput rate information, a concurrent execution model is built, the graph random walk requests received in an online stage are classified, predicted and scheduled according to the attributes of the graph random walk requests, request combinations suitable for concurrent execution are determined, and graph data are output to a cache; and the accelerator side loads corresponding graph data into the memory according to the concurrently executed request combination determined by the host side to generate a memory buffer structure, activates matched computing resources for different graph random walk requests, and returns a result to the host side after the computation is completed.
The host side comprises: the system comprises a concurrency model analysis module, a request extraction module, a request scheduling module and a data segmentation transmission module, wherein: the concurrency model analysis module classifies requests from a graph data layer and an algorithm layer according to graph random walk request characteristics in an offline state, and obtains concurrency models of the requests according to GPU resources, graph data size, operation modes and the like to obtain request combinations suitable for concurrency operation; after acquiring a real request, the request extraction module analyzes the attribute of the real request, classifies the real request, and inputs the attribute of the real request into the request scheduling module for scheduling; the request scheduling module predicts different request running times according to the input attribute, scores the degree of the suitable concurrency of different requests according to the concurrency model obtained in an offline state, and schedules according to the score, the running time and the number of request types to obtain the combination most suitable for concurrency for the concurrent running of the accelerator terminal; the data segmentation transmission module firstly loads the graph data from the hard disk into the memory, then segments the graph data according to the type of the graph random walk request, transmits smaller graph data into the memory on the GPU board, segments larger graph data according to the cold and hot degree, transmits the hot data of the smaller graph data into the memory on the GPU board, and the cold data of the larger graph data is still placed in the memory at the host end and interacts with the accelerator end in real time through the PCIe interface.
The accelerator end comprises: the system comprises a graph data management module, an accelerator end distribution module and a graph random walk module, wherein: the data management adopts a mixed data management mode, and simultaneously adopts unified memory (unified memory) and Zero Copy technology (Zero Copy) management graph data to achieve the effects of lowest delay and no interference between concurrent requests; the accelerator terminal distribution module distributes a proper number of accelerator terminals for each graph random walk request according to the request attribute, so that pipeline stagnation caused by insufficient memory bandwidth is minimized under the condition that performance is not reduced, and redundant accelerator terminals are prevented from being activated to achieve the aim of saving energy; the graph random walk module finally executes the graph random walk task and returns the result to the host end.
As shown in fig. 2, the present embodiment relates to an optimization method of the GPU graph random walk system with multiple concurrent requests, which includes the following steps:
step 1) in an offline stage, analyzing possible request types in advance, testing concurrent operation effects of the request types in advance, including concurrency of a graph data layer and concurrency of an algorithm layer, and establishing a concurrent execution model.
And 2) in the real-time stage, firstly carrying out attribute extraction on various graph random walk algorithms, such as graph data size, average vertex degree, algorithm type and the like.
Step 3) the scheduler schedules different graph requests according to a pre-established concurrent execution model, selects proper request combinations and concurrently executes the requests, and the specific steps comprise:
i) Selecting a graph random walk request to be executed;
ii) selecting a next graph random walk request in the order of time received;
iii) Scoring the two requests;
iv) predicting the run time of the two requests;
v) multiplying the score by the run-time percentage difference, and recording as the final score of the two requests.
vi) returning to step ii), ending without a new graph random walk request requiring computation.
And 4) the accelerator side performs graph data segmentation management through a graph data management module, transfers hot data to the accelerator side, stores cold data in a host side, performs real-time interaction with the GPU in the running process through PCIe, and manages the graph data by adopting unified memory (unified memory) and Zero Copy technology (Zero Copy).
Step 5) activating a proper number of accelerator terminals and concurrently processing the graph random walk request. After the calculation is completed, the result is returned to the host end, and the memory area occupied by the host end and the accelerator end is recovered.
In this embodiment, a graph random walk application is taken as an example, and an Injettia GPU is used as a heterogeneous accelerator platform, wherein a server is provided with 2 Intel (R) Xeon (R) Gold 6148CPU with 20 cores, 256GB of memory, 8TB of hard disk and 4 blocks of Injettia 2080Ti GPU with 11GB of DDR6.
Through practical experiments, by rewriting the accelerator end, various operations such as: the random walk algorithm of the deep 2vec, PPR and other graphs can use GPU space division sharing to run concurrently in the mode, and process 7 data sets (1G to 14G are unequal) of Livejournal and the like, and experimental results can be obtained, wherein the overall throughput rate of the system is improved by 54% at most under the condition that the energy consumption is reduced by 12% at most as shown in the following table. Compared with the prior art, the performance index of the method is improved in higher system throughput rate, lower delay and lower energy consumption.
Table 1 data plane embodiment dataset and results comparison
Table 2 algorithm level example dataset and results comparison
Compared with the prior art, the method concurrently executes a plurality of graph random walk requests, simultaneously uses the memory on the GPU board and PCIe to transmit data so as to maximize the equivalent memory bandwidth of the GPU, and reasonably distributes the computing resources of the accelerator terminal so as to improve the utilization rate of the whole resources; and scheduling the graph requests, and concurrently executing the requests with high suitability to achieve the best acceleration effect.
The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims (9)

1. A multi-request concurrent GPU graph random walk optimization implementation method is characterized in that graph random walk requests are classified in an offline stage, and a concurrent effect judgment mechanism based on graph random walk request types and resource occupation conditions is established according to GPU resource occupation conditions of all requests; in the online stage, the execution time of the graph random walk request is predicted by a scheduler, the operation priority and/or operation combination are adjusted according to the suitability of each graph random walk request, and graph data segmentation management and graph random walk request execution are carried out by a GPU memory storage level and a multi-accelerator end;
the suitability of each graph random walk request refers to: judging the suitability of the graph random walk request according to the parameters of the graph random walk request in each concurrent operation mode in the graph random walk concurrent operation model, and adding bias to the requests with more quantity by referring to the quantity of the requests of each type, wherein the method specifically comprises the following steps: s (t) 1 ,t 2 )=Max(G(t 1 ,t 2 ),M(t 1 ,t 2 ) Abondant), wherein: t is a graph random walk request, G and M respectively calculate the parallelism suitability degree of a graph data layer and a request layer, and Abundant is a graph based on t 1 ,t 2 Graph random request of (a)The duty cycle in all graph random walk requests generates a value between 0 and 1, the higher the duty cycle the higher the value, α being the constant of adjustment.
2. The method for implementing the multiple-request concurrent GPU graph random walk optimization according to claim 1, wherein the establishing a concurrent effect judging mechanism based on the graph random walk request type and the resource occupation condition specifically comprises:
i) Dividing the request into a large map request with map data larger than the memory on the GPU board and a small map request with map data smaller than or equal to the memory on the GPU board according to the size of the map data to be processed and the memory size on the GPU board;
ii) judging the memory resources and the quantity of the computing resources required by the random walk requests of the various types of graphs;
iii) Concurrent execution of the request layer, namely concurrent execution of requests using different graph random walk algorithms; and concurrent execution of the graph data plane, i.e., concurrent execution of large and small graph requests.
3. The method for implementing the multi-request concurrent GPU graph random walk optimization according to claim 1, wherein the predicted graph random walk request execution time is obtained by fine tuning based on the batch size of the processing request and the average degree of the graph after giving the base time according to the type of the graph random walk request, specifically:wherein: t (T) is the predicted execution time of the graph random walk request, T(s) is the execution time of the same type of reference task, D (T, s) is the difference in the average degree of the graph, and batch t Batch size, batch for target graph random walk requests s For the reference batch size, θ is the constant for adjustment.
4. The method for implementing the random walk optimization of the GPU map with multiple concurrent requests according to claim 1, wherein the adjusting the operation priority and/or the operation combination includes: (1) and (2) splicing the random walk requests of the same type of graph according to the descending order of fitness and setting priority correspondingly, and treating the random walk requests as a new request.
5. The method for implementing the random walk optimization of the GPU graph with multiple concurrent requests according to claim 1, wherein the adjusting the operation priority and/or the operation combination specifically comprises:
step 1: selecting a graph random walk request;
step 2: selecting a next graph random walk request according to the receiving time sequence;
step 3: scoring the two requests;
step 4: predicting a runtime of the two requests;
step 5: judging whether a plurality of requests need to be spliced or not;
step 6: multiplying the score by the run-time percentage difference, and recording as the final score of the two requests;
step 7: and (5) repeating the steps 2-6 until all the request scoring is completed, and selecting the combination with the highest score to execute.
6. The method for implementing the random walk optimization of the multi-request concurrent GPU graph according to claim 1, wherein the graph data segmentation management means: for concurrent requests, storing the graph data with the size smaller than that of the memory on the GPU board, storing the large graph in the main memory, transmitting the large graph to the GPU through a PCIe interface in real time for processing, performing cold and hot partitioning on the graph with the size larger than that of the memory on the GPU board, storing the hot data in the left space on the GPU, performing address conversion by using a zero copy technology, and directly reading the data in the main memory to obtain lower delay.
7. The method for implementing the multiple-request concurrent GPU graph random walk optimization according to claim 1, wherein the graph random walk request execution means: and activating the accelerator terminals with the matched quantity according to the type of the requested graph random walk algorithm and graph data, and not directly activating redundant accelerator terminals to achieve the aim of saving energy.
8. A GPU-graph random walk optimization system that implements the multi-request concurrent GPU-graph random walk optimization implementation method of any of claims 1-7, comprising: host side and accelerator side, wherein: the host end and the accelerator end are connected through PCIe and exchange data; the method comprises the steps that a host side concurrently operates various graph random walk requests offline to obtain corresponding resource consumption and system throughput rate information, a concurrent execution model is built, the graph random walk requests received in an online stage are classified, predicted and scheduled according to the attributes of the graph random walk requests, request combinations suitable for concurrent execution are determined, and graph data are output to a cache; and the accelerator side loads corresponding graph data into the memory according to the concurrently executed request combination determined by the host side to generate a memory buffer structure, activates matched computing resources for different graph random walk requests, and returns a result to the host side after the computation is completed.
9. The GPU-graph random walk optimization system of claim 8, wherein said host side comprises: the system comprises a concurrency model analysis module, a request extraction module, a request scheduling module and a data segmentation transmission module, wherein: the concurrent model analysis module classifies the requests from a graph data layer and an algorithm layer according to the random walk request characteristics of the graph in an offline state, and obtains a concurrent model of the requests according to GPU resources, the graph data size and the operation mode to obtain a request combination suitable for concurrent operation; after acquiring a real request, the request extraction module analyzes the attribute of the real request, classifies the real request, and inputs the attribute of the real request into the request scheduling module for scheduling; the request scheduling module predicts different request running times according to the input attribute, scores the degree of the suitable concurrency of different requests according to the concurrency model obtained in an offline state, and schedules according to the score, the running time and the number of request types to obtain the combination most suitable for concurrency for the concurrent running of the accelerator terminal; the data segmentation transmission module firstly loads the graph data from the hard disk into the memory, then segments the graph data according to the type of the graph random walk request, transmits the graph data with the size smaller than that of the memory on the GPU board into the memory on the GPU board, segments the graph data with the size larger than that of the memory on the GPU board into blocks according to the cold and hot degree, transmits the hot data of the graph data into the memory on the GPU board, and the cold data of the graph data is still placed in the memory of the host end and interacts with the accelerator end in real time through a PCIe interface;
the accelerator end comprises: the system comprises a graph data management module, an accelerator end distribution module and a graph random walk module, wherein: the graph data management module adopts a mixed data management mode, and simultaneously adopts a unified memory and zero copy technology to manage graph data so as to achieve the effects of lowest delay and no interference between concurrent requests; the accelerator terminal distribution module distributes a proper number of accelerator terminals for each graph random walk request according to the request attribute, so that pipeline stagnation caused by insufficient memory bandwidth is minimized under the condition that performance is not reduced, and redundant accelerator terminals are prevented from being activated to achieve the aim of saving energy; the graph random walk module finally executes the graph random walk request and returns the result to the host end.
CN202211536501.2A 2022-12-02 2022-12-02 Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system Active CN116188239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211536501.2A CN116188239B (en) 2022-12-02 2022-12-02 Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211536501.2A CN116188239B (en) 2022-12-02 2022-12-02 Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system

Publications (2)

Publication Number Publication Date
CN116188239A CN116188239A (en) 2023-05-30
CN116188239B true CN116188239B (en) 2023-09-12

Family

ID=86437222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211536501.2A Active CN116188239B (en) 2022-12-02 2022-12-02 Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system

Country Status (1)

Country Link
CN (1) CN116188239B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104158840A (en) * 2014-07-09 2014-11-19 东北大学 Method for calculating node similarity of chart in distributing manner
CN112667562A (en) * 2021-01-22 2021-04-16 北京工业大学 CPU-FPGA-based random walk heterogeneous computing system on large-scale graph
CN112925627A (en) * 2021-03-25 2021-06-08 上海交通大学 Graph sampling and random walk accelerating method and system based on graph processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104158840A (en) * 2014-07-09 2014-11-19 东北大学 Method for calculating node similarity of chart in distributing manner
CN112667562A (en) * 2021-01-22 2021-04-16 北京工业大学 CPU-FPGA-based random walk heterogeneous computing system on large-scale graph
CN112925627A (en) * 2021-03-25 2021-06-08 上海交通大学 Graph sampling and random walk accelerating method and system based on graph processor

Also Published As

Publication number Publication date
CN116188239A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN110619595B (en) Graph calculation optimization method based on interconnection of multiple FPGA accelerators
US11740941B2 (en) Method of accelerating execution of machine learning based application tasks in a computing device
WO2021254135A1 (en) Task execution method and storage device
CN111190735B (en) On-chip CPU/GPU pipelining calculation method based on Linux and computer system
CN113821332B (en) Method, device, equipment and medium for optimizing efficiency of automatic machine learning system
US11422858B2 (en) Linked workload-processor-resource-schedule/processing-system—operating-parameter workload performance system
US11875426B2 (en) Graph sampling and random walk acceleration method and system on GPU
CN111813506A (en) Resource sensing calculation migration method, device and medium based on particle swarm algorithm
CN104572501A (en) Access trace locality analysis-based shared buffer optimization method in multi-core environment
CN114327811A (en) Task scheduling method, device and equipment and readable storage medium
CN116263681A (en) Mobile edge computing task unloading method, device, equipment and storage medium
CN113902116A (en) Deep learning model-oriented reasoning batch processing optimization method and system
Kim et al. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration
He Optimization of edge delay sensitive task scheduling based on genetic algorithm
Li Optimization of task offloading problem based on simulated annealing algorithm in MEC
Zhang et al. A locally distributed mobile computing framework for DNN based android applications
CN116188239B (en) Multi-request concurrent GPU (graphics processing unit) graph random walk optimization realization method and system
Wang et al. Improved intermediate data management for mapreduce frameworks
Guo Ant colony optimization computing resource allocation algorithm based on cloud computing environment
CN117112201A (en) Hardware resource scheduling method, device, computer equipment and storage medium
CN114217688B (en) NPU power consumption optimization system and method based on neural network structure
CN118170503A (en) Heterogeneous processor and related scheduling method
CN114980216A (en) Dependent task unloading system and method based on mobile edge calculation
CN114281543A (en) System and method for realizing calculation integration based on solid-state storage
Lee et al. Service Chaining Offloading Decision in the EdgeAI: A Deep Reinforcement Learning Approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant