WO2022099925A1 - 自适应的面向大图的统一内存管理方法及系统 - Google Patents

自适应的面向大图的统一内存管理方法及系统 Download PDF

Info

Publication number
WO2022099925A1
WO2022099925A1 PCT/CN2021/072376 CN2021072376W WO2022099925A1 WO 2022099925 A1 WO2022099925 A1 WO 2022099925A1 CN 2021072376 W CN2021072376 W CN 2021072376W WO 2022099925 A1 WO2022099925 A1 WO 2022099925A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
memory
graph
data
read
Prior art date
Application number
PCT/CN2021/072376
Other languages
English (en)
French (fr)
Inventor
李超
王鹏宇
邵传明
王靖
郭进阳
朱浩瑾
过敏意
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海交通大学 filed Critical 上海交通大学
Priority to US18/040,905 priority Critical patent/US20230297234A1/en
Publication of WO2022099925A1 publication Critical patent/WO2022099925A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0207Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/30Providing cache or TLB in specific location of a processing system
    • G06F2212/302In image processor or graphics adapter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/455Image or video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6028Prefetching based on hints or prefetch instructions

Definitions

  • the invention relates to a technology in the field of graphics processing, in particular to a method and system for dynamically configuring memory for a read strategy of large image data whose size exceeds the display memory capacity adaptively under a unified memory architecture.
  • Unified memory refers to adding a unified memory space to the existing memory management method, so that programs can use a pointer to directly access the memory of the central processing unit (CPU) or the video memory of the graphics processing unit (GPU). data stored in.
  • This technology enables the graphics processor to increase the available address space so that the GPU can process graph data that exceeds the memory capacity. But using this technique directly often comes with a significant performance penalty.
  • the present invention proposes an adaptive, large-graph-oriented unified memory management method and system.
  • the characteristics of the graph data different graph algorithms are adopted, and combined with the size of the available memory of the GPU, the unified memory can be significantly improved.
  • the performance of processing large graphs exceeding the video memory capacity under the memory architecture including improving GPU bandwidth utilization, reducing the number of page faults and the overhead of dealing with page faults, and speeding up the running time of graph computing programs.
  • the invention relates to an adaptive large-picture-oriented unified memory management method, which sequentially checks whether the current GPU memory is full and judges whether the size of the current graph data is in the order of priority for different types of graph data in graph computing applications. If the available memory capacity of the GPU is exceeded, the policy configuration of unified memory management is performed.
  • the different types of graph data include: vertex offset (VertexOffset), vertex attribute label (VertexProperty), edge (Edge) and vertex frontier to be processed (Frontier), wherein: VertexOffset, VertexProperty, Edge are compressed sparse row formats (CSR) three arrays.
  • VertexOffset vertex offset
  • VertexProperty vertex attribute label
  • Edge vertex frontier to be processed
  • VertexOffset VertexProperty
  • Edge are compressed sparse row formats (CSR) three arrays.
  • the priority order refers to: the graph data structure is accessed in descending order of times during the execution of the graph algorithm, specifically: vertex properties, vertex offsets, front lines, and edges.
  • the graph algorithms can be divided into traversal algorithms or computational algorithms, including but not limited to single source shortest path algorithm (SSSP), breadth-first search algorithm (BFS), page ranking algorithm (PageRank, PR), connected component algorithm. (Connected Components, CC).
  • SSSP single source shortest path algorithm
  • BFS breadth-first search algorithm
  • PageRank page ranking algorithm
  • CC connected component algorithm
  • the GPU memory judgment calls cudaMemGetInfo to check the remaining capacity of the current GPU memory.
  • Data Exceeded Judge whether the size of the comparison data exceeds the size of the available memory of the GPU.
  • the unified memory management strategy configuration adopts but is not limited to setting the management strategy of the current graph data by calling cudaMemPrefetchAsync and cudaMemAdvise, wherein: cudaMemPrefetchAsync can move some data to GPU video memory in advance; cudaMemAdvise can set data usage for specified data Hints (Memory Usage Hint, hereinafter referred to as hints) are used to help the GPU driver control data movement in an appropriate way and improve the final performance.
  • the optional data usage hints include AccessedBy and ReadMostly. These instructions are for NVIDIA's various series of GPUs, specifically:
  • the invention as a whole solves the technical problem that the existing GPU does not have the ability to process a large image beyond the display memory.
  • the present invention uses a unified memory technology to manage graph data, adopts a targeted management strategy for different graph data structures according to a specific priority order, and according to the size of the graph data and the relative size of GPU available memory, Types of Graph Algorithms Adjusting the management strategy of graph data significantly improves the operating efficiency of graph algorithms.
  • Fig. 1 is the system schematic diagram of the present invention
  • FIG. 2 is a schematic diagram of a flow chart of a memory management policy setting according to the present invention.
  • an adaptive large-picture-oriented unified memory management system involved in this embodiment includes: a system parameter setting module, a data reading module, and a memory management strategy setting module, wherein: a system parameter setting module Call the CUDA programming interface to obtain the operating parameters of the memory management strategy and initialize it.
  • the data reading module reads the graph data file from the memory and builds the corresponding graph data structure in the CPU memory.
  • the memory management strategy setting module supports CUDA8 by calling the graph data structure.
  • the application program interface of .0 sets the strategy of data pre-reading and prompting.
  • the operating parameters include: memory full (GPUIsFull), currently available memory capacity of the GPU (availGPUMemSize), and read-ahead rate ⁇ .
  • the initialization refers to: setting GPUIsFull to false; obtaining availGPUMemSize through cudaMemGetInfo.
  • the read-ahead rate ⁇ is set to 0.5 for a traversal graph algorithm (such as BFS), and is set to 0.8 for a computational graph algorithm (such as CC).
  • the APIs described to support CUDA 8.0 include, but are not limited to, those that allow the same functionality as the explicit memory copy and pinning APIs without reinstating the limitations of explicit GPU memory allocation: explicit prefetch (cudaMemPrefetchAsync) and memory Use hints (cudaMemAdvise).
  • this example relates to an adaptive memory management method based on the above system, including the following steps:
  • Step 1 (B0 in the figure): Obtain the initial values of the running parameters (GPUIsFull, availGPUMemSize, ⁇ ).
  • Step 2 (B1, C0 in the figure): Set a memory management strategy for each data Data in the data structure (VertexProperty, VertexOffset, Frontier, Edge) in turn, and judge each data in turn:
  • Step 2.1 when the value of the variable GPUIsFull is false, execute step 2.1.1; otherwise, execute step 2.1.2.
  • Step 2.1.1 when the size of Data is less than availGPUMemSize, perform step 2.1.1.1; otherwise, perform step 2.1.1.2.
  • Step 2.1.2 (B8 in the figure): call cudaMemAdvise to set the prompt of Data to AccessedBy; go back to step 2.
  • the graph algorithm is executed under different data sets, and the execution time of the algorithm is measured, that is, starting from the GPU The total running time to the end, excluding the time of preprocessing and data transfer. During the measurement, the algorithm is repeated 5 times, and the average value of the 5 execution times is taken.
  • the datasets described are multiple graph datasets of different sizes, including social network graph datasets (LiveJournal, Orkut, Twitter, Friendster), and Internet snapshot graph datasets (UK-2005, SK-2005, UK- union), where Livejournal contains 5M vertices and 69M edges, with a volume of 1.4GB; UK-union contains 133M vertices and 5.5B edges, with a volume of 110GB.
  • social network graph datasets LiveJournal, Orkut, Twitter, Friendster
  • Internet snapshot graph datasets UK-2005, SK-2005, UK- union
  • the graph algorithms described are four graph algorithms, SSSP, BFS, PR, and CC, wherein SSSP and BFS are ergodic algorithms, and PR and CC are computational algorithms.
  • BFS and SSSP the algorithm takes the first vertex in each graph dataset as the source vertex; for PR, set 0.85 as the decay coefficient and 0.01 as the fault-tolerant read.
  • PR the condition for the algorithm to end running is that the algorithm converges, or the number of iterations reaches 100.
  • the present invention can significantly shorten the running time of the graph calculation program.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Memory System (AREA)

Abstract

一种自适应的面向大图的统一内存管理方法,对图计算应用中的不同类型的图数据按照优先级顺序,依次通过GPU内存判断检查当前GPU内存是否已满、通过数据超出判断当前的图数据的大小是否超出GPU的可用内存容量再进行统一内存管理策略配置。该方法针对图数据的特点采用不同的图算法,结合GPU可用内存的大小,能够显著提升统一内存架构下处理超过显存容量的大图的性能,包括提高GPU带宽利用率、减少缺页的次数和处理缺页的开销,加快图计算程序的运行时间。

Description

自适应的面向大图的统一内存管理方法及系统 技术领域
本发明涉及的是一种图形处理领域的技术,具体是一种在统一内存架构下自适应地对尺寸在超出显存容量的大图数据的读取策略进行内存动态配置的方法及系统。
背景技术
统一内存(Unified Memory)是指:在现有的内存管理方式上增加了一个统一的内存空间,使得程序可以使用一个指针直接访问中央处理器(CPU)的内存或图形处理器(GPU)的显存中存储的数据。通过该技术使得图形处理器增加可用的地址空间,使得GPU可以处理超过显存容量的图数据的技术。但直接使用该技术往往会带来显著的性能损失。
发明内容
本发明针对现有技术的上述不足,提出一种自适应的、面向大图的统一内存管理方法及系统,针对图数据的特点采用不同的图算法,结合GPU可用内存的大小,能够显著提升统一内存架构下处理超过显存容量的大图的性能,包括提高GPU带宽利用率、减少缺页的次数和处理缺页的开销,加快图计算程序的运行时间。
本发明是通过以下技术方案实现的:
本发明涉及一种自适应的面向大图的统一内存管理方法,对图计算应用中的不同类型的图数据按照优先级顺序,依次检查当前GPU内存是否已满、判断当前的图数据的大小是否超出GPU的可用内存容量,再进行统一内存管理的策略配置。
所述的不同类型的图数据包括:顶点偏移量(VertexOffset)、顶点属性标签(VertexProperty)、边(Edge)以及待处理顶点前线(Frontier),其中:VertexOffset、VertexProperty、Edge为压缩稀疏行格式(CSR)的三个数组。
所述的优先级顺序是指:图数据结构在图算法执行中按照被访问的次数由高到低的顺序,具体为:顶点性质、顶点偏移、前线、边。
所述的图算法可以被分为遍历型算法或计算型算法,包括但不限于单源最短路径算法(SSSP)、广度优先搜索算法(BFS)、网页排名算法(PageRank,PR)、连通分量算法(Connected Component,CC)。
所述的GPU内存判断调用cudaMemGetInfo检查当前GPU内存的剩余容量。数据超出判断比较数据量的大小是否超过GPU可用内存的大小。
所述的统一内存管理策略配置采用但不限于通过调用cudaMemPrefetchAsync和 cudaMemAdvise对当前图数据的管理策略进行设置,其中:cudaMemPrefetchAsync能够预先移动部分数据到GPU显存中;cudaMemAdvise能够为指定的数据设定数据使用提示(Memory Usage Hint,以下简称为提示)以帮助GPU驱动程序采用适当的方式控制数据移动,提高最终的性能,可选的数据使用提示包括AccessedBy和ReadMostly。这些指令针对的是NVIDIA各系列的GPU,具体为:
①针对顶点性质数据,当GPU内存已满时,设置VertexProperty的提示为AccessedBy;否则,即GPU内存未满且当VertexProperty未超出GPU的可用内存容量时,设置VertexProperty的预读量为VertexProperty的大小;当VertexProperty超出GPU的可用内存容量时,设置VertexProperty的提示为AccessedBy,并设置VertexProperty的预读量为:预读率×GPU可用内存容量,单位为字节。
②针对顶点偏移数据,当GPU内存已满时,设置VertexOffset的提示为AccessedBy;否则,即GPU内存未满且当VertexOffset未超出GPU的可用内存容量时,设置VertexOffset的预读量为VertexOffset的大小;当VertexOffset超出GPU的可用内存容量时,设置VertexOffset的提示为AccessedBy,并设置VertexOffset的预读量为:预读率GPU可用内存容量,单位为字节。
③针对前线数据,当GPU内存已满时,设置Frontier的提示为AccessedBy;否则,即GPU内存未满且当Frontier未超出GPU的可用内存容量时,设置Frontier的预读量为Frontier的大小;当Frontier超出GPU的可用内存容量时,设置Frontier的提示为AccessedBy,并设置Frontier的预读量为:预读率GPU可用内存容量,单位为字节。
④针对边数据,当GPU内存已满时,设置Edge的提示为AccessedBy;否则,即GPU内存未满且当Edge未超出GPU的可用内存容量时,设置Edge的预读量为Edge的大小;当Edge超出GPU的可用内存容量时,设置Edge的提示为AccessedBy,并设置Edge的预读量为:预读率GPU可用内存容量,单位为字节。
技术效果
本发明整体解决了现有GPU不具有处理超出显存的大图的能力的技术问题。
与现有技术相比,本发明使用统一内存技术管理图数据,按照特定的优先级顺序,对不同的图数据结构采用针对性的管理策略,根据图数据的大小与GPU可用内存的相对大小、图算法的种类调整图数据的管理策略显著提高了图算法的运行效率。
附图说明
图1为本发明系统示意图;
图2为本发明内存管理策略设置流程示意图。
具体实施方式
如图1所示,为本实施例涉及的一种自适应的面向大图的统一内存管理系统,包括:系统参数设置模块、数据读取模块、内存管理策略设置模块,其中:系统参数设置模块调用CUDA编程接口获取内存管理策略运行参数并进行初始化,数据读取模块从存储器读取图数据文件,在CPU内存中构建相应的图数据结构,内存管理策略设置模块对图数据结构通过调用支持CUDA8.0的应用程序接口设置数据的预读、提示的策略。
所述的运行参数包括:内存已满(GPUIsFull)、GPU当前可用的内存容量(availGPUMemSize)以及预读率τ。
所述的初始化是指:将GPUIsFull设置为false;通过cudaMemGetInfo获取availGPUMemSize。
所述的预读率τ对于遍历型图算法(如BFS)设置为0.5,对于计算型图算法(如CC)设置为0.8。
所述的支持CUDA8.0的应用程序接口,包括但不限于允许与显式内存拷贝和固定API相同的功能,而无需恢复显式GPU内存分配的限制的:显式预取(cudaMemPrefetchAsync)和内存使用提示(cudaMemAdvise)。
如图2所示,本实例涉及基于上述系统的自适应内存管理方法,包括以下步骤:
步骤1(图中B0):获取运行参数(GPUIsFull、availGPUMemSize、τ)的初始值。
步骤2(图中B1、C0):依次对图数据结构(VertexProperty,VertexOffset,Frontier,Edge)中的每个数据Data设置内存管理策略,针对其中每条数据依次判断:
步骤2.1(图中C1):当变量GPUIsFull的值为false时执行步骤2.1.1;否则执行步骤2.1.2。
步骤2.1.1(图中C2):当Data的大小小于availGPUMemSize时执行步骤2.1.1.1;否则执行步骤2.1.1.2。
步骤2.1.1.1(图中B3~B4):调用cudaMemPrefetchAsync,将Data预取到GPU内存中;设置AvailGPUMemSize-=Data的大小;返回步骤2。
步骤2.1.1.2(图中B5~B7):设置内存已满(GPUIsFull=true);调用cudaMemAdvise将Data的提示设为AccessedBy;调用cudaMemPrefetchAsync预取τ×availGPUMemSize大小的Data到GPU内存中;返回步骤2。
步骤2.1.2(图中B8):调用cudaMemAdvise将Data的提示设为AccessedBy;返回步骤2。
经过具体实际实验,在一个配备Intel Xeon E5-2620CPU,128GB内存和NVIDIA  GTX1080Ti GPU的服务器环境下,基于本方法,在不同数据集下执行图算法,测量算法的执行时间,即在GPU上从开始到结束的总运行时间,不包含预处理、数据传送的时间。测量时,算法重复执行5次,取5次执行时间的平均值。
所述的数据集,为多个不同大小的图数据集,具体包含社交网络图数据集(LiveJournal、Orkut、Twitter、Friendster),以及互联网快照图数据集(UK-2005、SK-2005、UK-union),其中Livejournal包含5M顶点和69M边,体积为1.4GB;UK-union包含133M顶点和5.5B边,体积为110GB。
所述的图算法,为SSSP、BFS、PR、CC四种图算法,其中SSSP、BFS为遍历型算法,PR、CC为计算型算法。对于BFS和SSSP,算法将每个图数据集中的第一个顶点作为源顶点;对于PR,设置0.85为衰减系数,0.01为容错读。算法结束运行的条件为算法收敛,或者迭代次数达到100次。
试验结果显示本方法可以使图计算的总执行时间得到1.1至6.6倍的加速。其中SSSP的性能提升最高,而PR的性能提升最低,说明本方法对访存密集型的程序更有利。τ=0.5和τ=0.8分别是对遍历型算法和计算型算法最优的预读率,相比于τ=0时分别能达到最高平均1.43和1.48倍的最高平均加速。
综上可见,本发明可以显著缩短图计算程序的运行时间。
上述具体实施可由本领域技术人员在不背离本发明原理和宗旨的前提下以不同的方式对其进行局部调整,本发明的保护范围以权利要求书为准且不由上述具体实施所限,在其范围内的各个实现方案均受本发明之约束。

Claims (6)

  1. 一种自适应的面向大图的统一内存管理方法,其特征在于,对图计算应用中的不同类型的图数据按照优先级顺序,依次通过GPU内存判断检查当前GPU内存是否已满、通过数据超出判断当前的图数据的大小是否超出GPU的可用内存容量再进行统一内存管理策略配置;
    所述的统一内存管理策略配置通过调用cudaMemPrefetchAsync和cudaMemAdvise指令对当前图数据的管理策略进行设置,其数据使用提示包括AccessedBy和ReadMostly;
    所述的统一内存管理策略配置具体包括:
    ①针对顶点性质的图数据,当GPU内存已满时,设置VertexProperty的提示为AccessedBy;否则,即GPU内存未满且当VertexProperty未超出GPU的可用内存容量时,设置VertexProperty的预读量为VertexProperty的大小;当VertexProperty超出GPU的可用内存容量时,设置VertexProperty的提示为AccessedBy,并设置VertexProperty的预读量为:预读率×GPU可用内存容量,单位为字节;
    ②针对顶点偏移的图数据,当GPU内存已满时,设置VertexOffset的提示为AccessedBy;否则,即GPU内存未满且当VertexOffset未超出GPU的可用内存容量时,设置VertexOffset的预读量为VertexOffset的大小;当VertexOffset超出GPU的可用内存容量时,设置VertexOffset的提示为AccessedBy,并设置VertexOffset的预读量为:预读率GPU可用内存容量,单位为字节;
    ③针对前线的图数据,当GPU内存已满时,设置Frontier的提示为AccessedBy;否则,即GPU内存未满且当Frontier未超出GPU的可用内存容量时,设置Frontier的预读量为Frontier的大小;当Frontier超出GPU的可用内存容量时,设置Frontier的提示为AccessedBy,并设置Frontier的预读量为:预读率GPU可用内存容量,单位为字节;
    ④针对边的图数据,当GPU内存已满时,设置Edge的提示为AccessedBy;否则,即GPU内存未满且当Edge未超出GPU的可用内存容量时,设置Edge的预读量为Edge的大小;当Edge超出GPU的可用内存容量时,设置Edge的提示为AccessedBy,并设置Edge的预读量为:预读率GPU可用内存容量,单位为字节。
  2. 根据权利要求1所述的自适应的面向大图的统一内存管理方法,其特征是,所述的图算法为遍历型算法或计算型算法,对应预读率τ对于遍历型图算法设置为0.5,对于计算型图算法设置为0.8。
  3. 根据权利要求1所述的自适应的面向大图的统一内存管理方法,其特征是,具体包括:
    步骤1:获取运行参数的初始值;
    步骤2:依次对图数据结构中的每个数据设置内存管理策略,针对其中每条数据依次判断:
    步骤2.1:当运行参数中的变量GPUIsFull的值为false时执行步骤2.1.1;否则执行步骤2.1.2;
    步骤2.1.1:当图数据的大小小于availGPUMemSize时执行步骤2.1.1.1;否则执行步骤2.1.1.2;
    步骤2.1.1.1:调用cudaMemPrefetchAsync,将图数据预取到GPU内存中;设置AvailGPUMemSize-=Data的大小;返回步骤2;
    步骤2.1.1.2:设置运行参数中的变量GPUIsFull的值为true时,调用cudaMemAdvise将Data的提示设为AccessedBy;调用cudaMemPrefetchAsync预取大小为运行参数中的预读率τ×availGPUMemSize大小的图数据到GPU内存中;返回步骤2;
    步骤2.1.2:调用cudaMemAdvise将Data的提示设为AccessedBy;返回步骤2。
  4. 根据权利要求3所述的自适应的面向大图的统一内存管理方法,其特征是,所述的运行参数包括:内存已满(GPUIsFull)、GPU当前可用的内存容量(availGPUMemSize)以及预读率τ。
  5. 根据权利要求4所述的自适应的面向大图的统一内存管理方法,其特征是,所述的初始化是指:将GPUIsFull设置为false;通过cudaMemGetInfo获取availGPUMemSize。
  6. 一种实现上述任一权利要求所述方法的自适应的面向大图的统一内存管理系统,其特征在于,包括:系统参数设置模块、数据读取模块、内存管理策略设置模块,其中:系统参数设置模块调用CUDA编程接口获取内存管理策略运行参数并进行初始化,数据读取模块从存储器读取图数据文件,在CPU内存中构建相应的图数据结构,内存管理策略设置模块对图数据结构通过调用支持CUDA8.0的应用程序接口设置数据的预读、提示的策略。
PCT/CN2021/072376 2020-11-10 2021-01-18 自适应的面向大图的统一内存管理方法及系统 WO2022099925A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/040,905 US20230297234A1 (en) 2020-11-10 2021-01-18 Adaptive unified memory management method and system for large-scale graphs

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011244031.3A CN112346869B (zh) 2020-11-10 2020-11-10 自适应的面向大图的统一内存管理方法及系统
CN202011244031.3 2020-11-10

Publications (1)

Publication Number Publication Date
WO2022099925A1 true WO2022099925A1 (zh) 2022-05-19

Family

ID=74362382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/072376 WO2022099925A1 (zh) 2020-11-10 2021-01-18 自适应的面向大图的统一内存管理方法及系统

Country Status (3)

Country Link
US (1) US20230297234A1 (zh)
CN (1) CN112346869B (zh)
WO (1) WO2022099925A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999946A (zh) * 2012-09-17 2013-03-27 Tcl集团股份有限公司 一种3d图形数据处理方法、装置及设备
CN104835110A (zh) * 2015-04-15 2015-08-12 华中科技大学 一种基于gpu的异步图数据处理系统
US20160125566A1 (en) * 2014-10-29 2016-05-05 Daegu Gyeongbuk Institute Of Science And Technology SYSTEM AND METHOD FOR PROCESSING LARGE-SCALE GRAPHS USING GPUs
CN110187968A (zh) * 2019-05-22 2019-08-30 上海交通大学 异构计算环境下的图数据处理加速方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8373710B1 (en) * 2011-12-30 2013-02-12 GIS Federal LLC Method and system for improving computational concurrency using a multi-threaded GPU calculation engine
KR20130093995A (ko) * 2012-02-15 2013-08-23 한국전자통신연구원 계층적 멀티코어 프로세서의 성능 최적화 방법 및 이를 수행하는 멀티코어 프로세서 시스템
US9430400B2 (en) * 2013-03-14 2016-08-30 Nvidia Corporation Migration directives in a unified virtual memory system architecture
CN103226540B (zh) * 2013-05-21 2015-08-19 中国人民解放军国防科学技术大学 基于分组多流的gpu上多区结构网格cfd加速方法
US9400767B2 (en) * 2013-12-17 2016-07-26 International Business Machines Corporation Subgraph-based distributed graph processing
US9886736B2 (en) * 2014-01-20 2018-02-06 Nvidia Corporation Selectively killing trapped multi-process service clients sharing the same hardware context
CN111930498B (zh) * 2020-06-29 2022-11-29 苏州浪潮智能科技有限公司 一种高效的gpu资源分配优化方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999946A (zh) * 2012-09-17 2013-03-27 Tcl集团股份有限公司 一种3d图形数据处理方法、装置及设备
US20160125566A1 (en) * 2014-10-29 2016-05-05 Daegu Gyeongbuk Institute Of Science And Technology SYSTEM AND METHOD FOR PROCESSING LARGE-SCALE GRAPHS USING GPUs
CN104835110A (zh) * 2015-04-15 2015-08-12 华中科技大学 一种基于gpu的异步图数据处理系统
CN110187968A (zh) * 2019-05-22 2019-08-30 上海交通大学 异构计算环境下的图数据处理加速方法

Also Published As

Publication number Publication date
CN112346869B (zh) 2021-07-13
US20230297234A1 (en) 2023-09-21
CN112346869A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
CN106991011B (zh) 基于cpu多线程与gpu多粒度并行及协同优化的方法
KR101253012B1 (ko) 이종 플랫폼에서 포인터를 공유시키는 방법 및 장치
US8196147B1 (en) Multiple-processor core optimization for producer-consumer communication
US20080005473A1 (en) Compiler assisted re-configurable software implemented cache
US11720496B2 (en) Reconfigurable cache architecture and methods for cache coherency
JP2015505091A (ja) キャッシュのプレローディングにgpuコントローラを使用するための機構
WO2009123492A1 (en) Optimizing memory copy routine selection for message passing in a multicore architecture
WO2022233195A1 (zh) 神经网络权值存储方法、读取方法及相关设备
CN113590508A (zh) 动态可重构的内存地址映射方法及装置
US8380962B2 (en) Systems and methods for efficient sequential logging on caching-enabled storage devices
CN116227599A (zh) 一种推理模型的优化方法、装置、电子设备及存储介质
JPH05274252A (ja) コンピュータシステムにおけるトランザクション実行方法
CN112801856B (zh) 数据处理方法和装置
CN113448897B (zh) 适用于纯用户态远端直接内存访问的优化方法
WO2022099925A1 (zh) 自适应的面向大图的统一内存管理方法及系统
CN116438543A (zh) 数据和模型并行化中的共享存储器空间
US20230126783A1 (en) Leveraging an accelerator device to accelerate hash table lookups
CN116132369A (zh) 云网关服务器中多网口的流量分发方法及相关设备
TWI718634B (zh) 卷積神經網路的特徵圖存取方法及其系統
Feng et al. Understanding Scalability of Multi-GPU Systems
CN112487352B (zh) 可重构处理器上快速傅里叶变换运算方法及可重构处理器
US11915138B2 (en) Method and device for reducing a size of a neural network model
US20240054179A1 (en) Systems and methods for inference system caching
US20240070107A1 (en) Memory device with embedded deep learning accelerator in multi-client environment
CN108762666B (zh) 一种存储系统的访问方法、系统、介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890463

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21890463

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21890463

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.01.2024)