WO2020073641A1 - 一种面向数据结构的图形处理器数据预取方法及装置 - Google Patents

一种面向数据结构的图形处理器数据预取方法及装置 Download PDF

Info

Publication number
WO2020073641A1
WO2020073641A1 PCT/CN2019/084774 CN2019084774W WO2020073641A1 WO 2020073641 A1 WO2020073641 A1 WO 2020073641A1 CN 2019084774 W CN2019084774 W CN 2019084774W WO 2020073641 A1 WO2020073641 A1 WO 2020073641A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
data
prefetch
request
list
Prior art date
Application number
PCT/CN2019/084774
Other languages
English (en)
French (fr)
Inventor
黄立波
郭辉
郑重
王志英
郭维
雷国庆
王俊辉
隋兵才
孙彩霞
王永文
Original Assignee
国防科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国防科技大学 filed Critical 国防科技大学
Priority to US16/960,894 priority Critical patent/US11520589B2/en
Publication of WO2020073641A1 publication Critical patent/WO2020073641A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0895Caches characterised by their organisation or structure of parts of caches, e.g. directory or tag array
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/455Image or video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6022Using a prefetch buffer or dedicated prefetch cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6024History based prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware

Definitions

  • the invention relates to the field of data prefetching of a graphics processor, and in particular to a data structure-oriented data prefetching method and device of a graphics processor.
  • Width-first search is the basic algorithm for graph traversal in many graph computing applications.
  • the GPU must generate more than one storage access request for each irregular storage access. This greatly affects the memory access efficiency of the GPU, which in turn causes the GPU to effectively accelerate the width-first search.
  • the GPU's access to the graph data structure lacks sufficient locality, which in turn causes the GPU's cache access failure rate of some data to be as high as 80%.
  • the GPU cannot achieve latency hiding through massive parallelism, and the pipeline has to pause to wait for data.
  • the GPU cannot fully utilize its powerful computing power to accelerate the width-first search algorithm.
  • Data prefetching is a technology that promises to improve memory access and cache efficiency.
  • Typical data prefetchers on GPUs such as step-based data prefetchers, data stream-based prefetchers, and data prefetchers based on global historical fetch information, can effectively reduce regular storage access in applications Delay.
  • the typical prediction-based data prefetcher has a significantly higher prefetch error rate than regular storage access.
  • Such a high prefetch error rate directly leads to reading too much useless data, which in turn causes serious cache data pollution and waste of memory bandwidth.
  • prefetchers based on memory access pattern recognition cannot accurately identify complex and irregular irregular memory access patterns, these types of data prefetchers have little contribution to reducing memory access latency and improving GPU execution efficiency.
  • stride-based data prefetcher stride-based data prefetcher
  • prefetcher based on data stream
  • downstream prefetcher prefetcher
  • Data prefetcher based on global historical fetch information
  • the data prefetcher based on the step size uses a table to record local memory historical access information.
  • This information mainly includes: the program counter value (PC, as the index of the table), the address of the last fetch (used to calculate the step size and the address of the next fetch), and the step size between the last two fetch addresses ( The difference between the last two fetch addresses) and the step valid bit (mark whether the currently recorded step is valid). If the fetch address of the fetch instruction of the same PC has a fixed step size, then the data prefetcher based on the step size will calculate the address of the data to be prefetched based on the step value and the most recently accessed address.
  • PC program counter value
  • the address of the last fetch used to calculate the step size and the address of the next fetch
  • the step size between the last two fetch addresses The difference between the last two fetch addresses
  • the step valid bit mark whether the currently recorded step is valid.
  • the data stream-based prefetcher usually tracks the direction of access to a certain storage area.
  • the prefetcher based on the data stream will continuously read the data into the buffer of the prefetcher in units of cache blocks according to the identified direction.
  • the reason why the prefetched data is not stored in the on-chip cache is to prevent the prefetched data from polluting the useful data in the cache.
  • the cache will store the pre-fetched cache block in the cache.
  • the prefetcher's buffer will be refreshed.
  • a data prefetcher based on global historical access information uses a global historical access information cache (GHB) to record address information of all cache invalid accesses in the entire system.
  • GHB global historical access information cache
  • Each GHB item will store an invalid access address and a pointer, and these pointers will connect the GHB items from the same index in chronological order.
  • the data prefetcher based on the global historical access information will also use an index table. This table is used to store the index value of the index GHB item. These values can be the PC value of the instruction or the address value of the memory access failure.
  • Corresponding to the index value is a pointer to the GHB item corresponding to the index value, and all the items with the same index value in the GHB can be found through this pointer.
  • Data prefetchers based on global historical memory access information can be used with other data prefetching mechanisms, such as step-based data prefetchers and data stream-based prefetchers, so that multiple memory fetch modes can be identified.
  • these three typical data prefetchers are all designed based on one or more common rule storage access patterns. But for the irregular storage access mode of width-first search, the prefetch efficiency of these three data prefetchers is very inefficient, or even invalid.
  • a data-oriented structure-oriented graphic for breadth-first search which can efficiently prefetch irregular storage access data, has a simple hardware structure, and is transparent to programmer programming Processor data prefetching method and device.
  • a data structure-oriented graphics processor data prefetching method the implementation steps include:
  • the storage access request for acquiring the data structure of the monitoring processor check map in step 1) includes: the memory access monitor in the memory access instruction unit is responsible for monitoring the general memory access instruction access to the work vector and the memory instruction unit All the memory access request information and read data recorded in the memory access result buffer in the first level cache.
  • the data address of the prefetch request for the next item in the work vector is the data address read by the storage access request plus the read The result obtained by taking the data size.
  • the detailed steps of generating the prefetch request for the vertex list vector in step 2) include: determining the corresponding line of the prefetch request for the vertex list vector according to the node ID obtained when the prefetch request for the work vector list was last generated; and The address of the next line, if the address of the corresponding line and the next line are in the same cache block, a storage access request is generated to retrieve the address of the corresponding line and the next line at the same time; if the address of the corresponding line and the next line are not the same Two storage access requests are generated in one cache block to retrieve the addresses of the corresponding line and the next line, respectively.
  • the detailed steps of generating a prefetch request for the edge vector in step 2) include: the index generating unit generates and prefetches the request for the edge vector according to the start and end of the edge vector in the runtime, and generates the requested The number mainly depends on how many cache blocks are needed to store the data of these edges and how many cache blocks are needed for address alignment.
  • the detailed steps of generating a prefetch request for the visited vector in step 2) include: reading the returned result of the prefetch request for the edge vector as the calculated prefetched visited vector data, for each value read Generate the corresponding access request address.
  • a data structure-oriented graphics processor data prefetching device includes a data prefetching unit distributed in each processing unit, and the data prefetching unit is respectively connected with a memory access monitor of the memory access instruction unit and a memory access result cache Connected to the primary cache, the data prefetch unit includes:
  • the address space classifier is used to select the corresponding data prefetch request generation method according to the type of storage access request of the processor check graph data structure;
  • the runtime information table is used to separately record the runtime information of various vectors in each processing unit Warp, the runtime information of the various vectors includes the index of the work vector list, the index of the vertex list vector, and the start of the edge vector list Indexing and terminating indexing;
  • the prefetch request generation unit is used to perform different data prefetch request generation methods according to the designation. If the storage access request is an ordinary read access to the work vector list, a prefetch request to the next item of the work vector list is generated; if The storage access request is a prefetch access to the work vector list, a prefetch request to the vertex list vector is generated; if the storage access request is a prefetch access to the vertex list vector, a prefetch request to the edge vector list is generated ; If the storage access request is a prefetch request for the edge list vector, a prefetch request for the visited list vector is generated;
  • the prefetch request queue is used to save the generated data prefetch request.
  • the address space classifier includes an address space range table and eight address comparators
  • the address space range table includes the start of the address space range of the work vector list, vertex list vector, edge list vector, and visited vector list, respectively.
  • Eight addresses corresponding to the start address and the end address one by one, one input of each address comparator in the eight address comparators is the accessed address in the information from the memory access instruction unit, and the other input is the address space range Corresponding addresses in the table, and the output terminals of the eight address comparators are respectively connected to the prefetch request generating unit.
  • each item of the runtime information table includes five items of information: WID, work vector list index, vertex list vector index, edge list vector start index and end index, where WID is used to record the processing unit Warp ID; the runtime information table also includes a selector for updating the runtime information based on the source of information from the memory access instruction unit, the Warp ID of the processing unit Warp, and the accessed data according to the memory access information
  • WID is used to record the processing unit Warp ID
  • the runtime information table also includes a selector for updating the runtime information based on the source of information from the memory access instruction unit, the Warp ID of the processing unit Warp, and the accessed data according to the memory access information
  • the corresponding entry in the table, the information source is used to distinguish whether the memory access information comes from the memory access monitor or the memory access result cache. If it comes from the memory access monitor, it is determined that the data prefetching required for the traversal of the new node begins.
  • the content of the entry corresponding to the WID in the runtime information table is cleared and the accessed data is recorded in the work index; if it comes from the cache of the access result, the accessed data is recorded in the entry of the corresponding WID in the runtime information table.
  • the prefetch request generation unit includes a prefetch generation unit selector, a work vector list prefetch request generation unit, a vertex list vector prefetch request generation unit, an edge list vector prefetch request generation unit, and a visited list vector prefetch Request generation unit, the prefetch generation unit selector according to the type of memory access information output by the address space classifier, the source of information in the information from the memory access instruction unit, the runtime information output by the runtime information table, and from the work list Select one of vector prefetch request generation unit, vertex list vector prefetch request generation unit, edge list vector prefetch request generation unit, and visited list vector prefetch request generation unit to generate prefetch request; work list vector prefetch The request generation unit is used to generate a prefetch request for the next item in the work vector list and write it to the prefetch request queue; the vertex list vector prefetch request generation unit is used to generate a prefetch request for the vertex list vector and write the prefetch request Queue; the edge
  • the data structure-oriented graphics processor data prefetching method of the present invention has the following advantages:
  • the present invention can efficiently prefetch irregularly stored access data for width-first search.
  • the data structure-oriented data prefetching mechanism of the present invention adopts an explicit way to obtain the width-first search access graph data structure mode, and uses the information of the node currently being searched to read the data required to search for the next node to the on-chip cache in advance in.
  • the present invention has a simple hardware structure implementation, because the data prefetch unit does not need to use complex calculations, so its calculation logic is very simple, the main overhead of data prefetching comes from the storage graph data structure information and runtime information, And this part of the storage can be solved using on-chip shared storage.
  • the present invention is transparent to programmer programming.
  • the use of the data structure-oriented data prefetching mechanism of the present invention does not require a large number of changes to the original program, but only needs to replace the original application storage space allocation code with the application storage space allocation code marked with the data structure.
  • the data structure-oriented graphics processor data prefetching device of the invention is the hardware of the data structure-oriented graphics processor data prefetching method of the invention, and also has the aforementioned advantages of the data structure-oriented graphics processor data prefetching method of the invention. I will not repeat them here.
  • FIG. 1 is a schematic diagram of the working principle of the existing step-based data prefetcher.
  • FIG. 2 is a schematic diagram of the working principle of the existing data stream-based prefetcher.
  • FIG. 3 is a schematic diagram of the working principle of an existing data prefetcher based on global historical access information.
  • FIG. 4 is a schematic flowchart of a method according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a distributed distribution structure of a data preprocessing unit in an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a basic structure and interface of a data preprocessing unit in an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a data preprocessing unit in an embodiment of the present invention.
  • the implementation steps of the data pre-fetching method of the data processor for the data structure in this embodiment include:
  • the data structure-oriented graphics processor data prefetching method is implemented by the data prefetching unit in the processor, and the monitoring processor checks the storage access request of the graph data structure, and the monitored storage access request information and The read data is sent to the data prefetch unit; after receiving the storage access request information, the data prefetch unit according to whether the storage access request is a data prefetch request and which data in the graph data structure is accessed by the access request Vector to select the corresponding data prefetch request generation unit to select the corresponding data prefetch request generation method.
  • the Compressed Sparse Row format (a compressed format used to store large sparse graphs) is used for the data stream-driven width-first search algorithm of the graph data structure, which contains 4 data vectors: work list vector, vertex list vector, edge list Vector and visited list vector.
  • the data prefetch unit will generate a prefetch request to the next item in the work vector; if the storage access request is a prefetch access to the work vector , The data prefetch unit will generate a prefetch request for the vertex list vector; if the storage access request is a prefetch access to the vertex list vector, the data prefetch unit will generate a prefetch request for the edge vector; if the The storage access request is a prefetch request for the edge list vector, and the data prefetch unit generates a prefetch request for the visited list vector.
  • the storage access request for obtaining the data structure of the monitoring processor check map in step 1) includes: the memory access monitor in the memory access instruction unit is responsible for monitoring the normal memory access read instruction access to the work vector and memory access
  • the memory access result buffer in the instruction unit records all memory access request information and read data processed by the primary cache.
  • the data address of the prefetch request for the next item in the work vector is the data address read by the storage access request plus The result of the size of the data read.
  • the detailed steps for generating the prefetch request for the vertex list vector in step 2) include: determining the correspondence of the prefetch request for the vertex list vector according to the node ID obtained when the prefetch request for the work vector list was last generated The address of the line and the next line, if the address of the corresponding line and the next line are in the same cache block, a storage access request is generated to retrieve the address of the corresponding line and the next line at the same time; if the address of the corresponding line and the next line If not in the same cache block, two storage access requests are generated to retrieve the addresses of the corresponding line and the next line.
  • the detailed steps of generating a prefetch request for the edge vector in step 2) include: the index generation unit generates and pre-fetches the request for the edge vector according to the start and end of the edge vector in the runtime.
  • the number of requests depends mainly on how many cache blocks are needed to store the data on these edges and how many cache blocks are needed for address alignment.
  • the detailed steps for generating the prefetch request for the visited vector in step 2) include: reading the return result of the prefetch request for the edge vector list as the calculated prefetched visited vector data, and reading for each The value of generates the corresponding access request address.
  • the data structure-oriented graphics processor data prefetching device of this embodiment includes a data prefetching unit distributed in each processing unit.
  • the data prefetching unit is respectively connected with the memory access monitor of the memory access instruction unit, the memory access result cache, and a Level cache is connected.
  • a GPU will contain multiple stream processors (SM), and each SM will contain many simple single-threaded processor cores.
  • SM stream processors
  • every 32 hardware threads form a processing unit for resource scheduling. This processing unit is called Warp. All threads in the same Warp will execute the same instruction at the same time.
  • this embodiment uses a distributed data prefetch unit to process and generate the data prefetch request in each stream processor.
  • the data prefetch unit includes:
  • the address space classifier 1 is used to select a corresponding data prefetch request generation method according to the type of the storage access request of the processor check graph data structure;
  • Runtime information table 2 is used to record the runtime information of various vectors in each processing unit Warp.
  • the runtime information of various vectors includes the index of the work vector list, the index of the vertex list vector, and the start index of the edge vector list And termination index;
  • the prefetch request generating unit 3 is used to perform different data prefetch request generation methods according to the designation, and if the storage access request is an ordinary read access to the work vector list, a prefetch request to the next item of the work vector list is generated; If the storage access request is a prefetch access to the work vector list, a prefetch request to the vertex list vector is generated; if the storage access request is a prefetch access to the vertex list vector, a prefetch to the edge vector list is generated Request; if the storage access request is a prefetch request for the edge list vector, a prefetch request for the visited list vector is generated;
  • the prefetch request queue 4 is used to save the generated data prefetch request.
  • the access information of the graph data structure obtained by the data prefetch unit mainly comes from the two components of the load instruction unit (Load / Store Unit): memory access monitor (memory access monitor) and memory access results Buffer (responseFIFO).
  • the memory access monitor is responsible for monitoring access to the general memory access read command of the work vector. If these memory access and read instructions are monitored, the data prefetch unit knows the start of a new search iteration and starts to prepare for the prefetching of the data required for the next iteration.
  • the memory access result buffer is responsible for recording all memory access request information and read data processed by the first-level cache.
  • the fetch result buffer can monitor the processing status of the prefetch request and send the requested data and fetching information to the data prefetching unit.
  • the data prefetching unit can generate corresponding data prefetching requests by using the information from the memory access instruction unit and the width-first search access mode of the graph data structure. After receiving the information from the fetch instruction unit, the data prefetch unit will update the corresponding item in the runtime information table 2 according to the Warp ID in the message, and the prefetch request generation unit 3 will be based on the source of the information and the fetch Which vector in the graph data structure is requested to select the data prefetch request generator.
  • the prefetch unit puts the newly generated data prefetch request into the prefetch request queue 4.
  • the prefetch request generating unit 3 in the data prefetch unit is responsible for controlling the number of data prefetch requests generated.
  • the first-level cache not only handles ordinary memory access requests, but also processes data prefetch requests and treats it as ordinary memory access requests.
  • the address space classifier 1 includes an address space range table and eight address comparators.
  • the address space range table includes the start of the address space range of the work vector list, vertex list vector, edge list vector, and visited list vector, respectively.
  • the eight addresses corresponding to the start address and the end address one by one, one input of each address comparator in the eight address comparators is the accessed address in the information from the memory access instruction unit, and the other input is the address space range table Corresponding addresses, and the output terminals of the eight address comparators are connected to the prefetch request generating unit 3 respectively.
  • the address comparator judges which data vector's address space the address of the memory access request belongs to by comparing the address of the memory access request with the address space range of all data vectors.
  • the runtime information table 2 will update the corresponding items according to the Warp ID according to the received information.
  • each item in the runtime information table 2 includes five items of information: WID, work vector list index, vertex list vector index, edge list vector start index and end index, among which WID is used to record Processing unit Warp ID; runtime information table 2 also includes a selector, the selector is used to update the runtime information table based on the information source from the memory access instruction unit information, the processing unit Warp Warp ID, and the accessed data
  • the information source is used to distinguish whether the memory access information comes from the memory access monitor or the memory access result cache.
  • the memory access monitor determines that the data prefetching required for the traversal of the new node begins.
  • the content of the entry corresponding to the WID in the runtime information table 2 is cleared and the accessed data is recorded in the work index; if it comes from the cache of the access result, the accessed data is recorded in the entry corresponding to the WID in the runtime information table 2 .
  • each item of runtime information table 2 includes five items of information: WID, work vector list index, vertex list vector index, edge list vector start index and end index:
  • WID It indicates the Warp to which the recorded information belongs. All the received memory information will carry Warp ID information, and determine which item of information in the update table is compared with the WID in the runtime information table. As shown in Figure 7, 0, 1, and 2 represent Warp0, Warp1, and Warp2, respectively.
  • Index of work list vector indicates which node in the work vector list is being prefetched by the prefetch unit. This item is updated by the access monitor to monitor the general access information of the work vector. By determining the position of the node item in the work vector currently visited by the Warp, the position of the node item to be accessed by the Warp next cycle is obtained, that is, the next item of the currently accessed node item. For example, as shown in Figure 7, the item with WID of 0, work index is 2, indicating that Warp0 is traversing the first item in the work list vector, and the data prefetch unit is prefetching the second item in the work list vector The data required for the item.
  • the index of the vertex vector indicates the node ID to be traversed in the next cycle of this Warp. This item is updated by the prefetch access information of the work vector list monitored by the access result cache. Based on the accessed address of the prefetch access information and the work list index recorded in the runtime information table, the data address of the prefetch access request can be determined, and then the corresponding data can be read out and the vertex index can be updated.
  • the start index and end index of the edge list vector mark the range of all edges of the node to be traversed in the next vector of this Warp loop.
  • This item is updated by the prefetch access information of the vertex list vector monitored by the access result cache.
  • the data address of the prefetch access request can be determined, and then the corresponding data can be read out. Since the start index and end index of a node's edge vector are the two adjacent items in the vertex list vector, if these two values are stored in the same cache block, it can be obtained by accessing the memory once. Otherwise, it needs to be read separately through two pre-fetches.
  • the address corresponding to vertex index 1279 is 0x900232FC
  • the address corresponding to the next vertex index 1280 is 0x90023200.
  • These two addresses are in two cache blocks respectively, and need to pass two Take to get the start index and end index of the edge list vector.
  • the current state indicates that it is waiting for a prefetch request to address 0x90023200 to update the end index of the edge list.
  • the selector of the runtime information table 2 updates the runtime information table through three inputs. (1) Since each fetch information carries a Warp ID, the selector determines which entry in the update table by matching with the WID in the runtime information table. (2) The information source indicates whether the memory access information comes from the memory access monitor (indicated by 0) or the memory access result cache (indicated by 1). If it comes from the access monitor, it means that the data prefetching required for the traversal of the new node starts, and the content of the same WID entry in the table will be cleared and the accessed data will be recorded in the work index. If it comes from the cache of the access result, the accessed data will be recorded to the corresponding content in the table.
  • the runtime information output will only be obtained when the start index and end index of the edge list are all acquired. They will be sent to the edge vector list prefetch request generation unit 34 to generate a prefetch request for the edge vector list, and the work list index and vertex index can be forwarded to the prefetch request generation unit while updating the runtime information table .
  • the prefetch request generator unit 3 includes a prefetch generation unit selector 31 and four prefetch request generators, which are respectively responsible for generating memory access requests for four data vectors.
  • the prefetch request generation unit 3 includes a prefetch generation unit selector 31, a work vector list prefetch request generation unit 32, a vertex list vector prefetch request generation unit 33, and an edge list vector prefetch request generation unit 34 And visited vector list prefetch request generating unit 35, prefetch generating unit selector 31 according to the type of memory access information output by the address space classifier 1, the source of information in the information from the memory access instruction unit, the output of the runtime information table 2 Run-time information, and select one from work list vector prefetch request generation unit 32, vertex list vector prefetch request generation unit 33, edge list vector prefetch request generation unit 34, visited list vector prefetch request generation unit 35 To generate a prefetch request; the work vector list prefetch request generation unit 32 is used to generate a prefetch request for the next
  • the data prefetch unit adds four prefetch request generation units (work list vector prefetch request generation unit 32, vertex list vector prefetch request generation unit 33, edge The list vector prefetch request generation unit 34 and the visited list vector prefetch request generation unit 35) respectively generate prefetch requests for four data vectors.
  • the four prefetch request generation units can be divided into two categories: a work vector list prefetch request generation unit 32 that generates a work vector and a generation unit that generates prefetch requests to the other three data vectors. This is because the source of the information required by the generator to generate the prefetch request is the memory access monitor and the memory buffer.
  • the data prefetching unit uses the general access information of the work vector to be monitored by the access monitor to generate a prefetch request for the work vector, and the data prefetch request for other data vectors needs to be buffered by the access result Monitored prefetch request information.
  • the data prefetch unit also needs to use the access address of the monitored prefetch request to select whether to generate a prefetch request for the vertex list vector, edge list vector, or visited list vector. According to the data structure access pattern of width-first search, the access order of each data vector can be predicted.
  • the data prefetch unit when it receives a prefetch request message for the work vector list, it can generate a prefetch request for the vertex list vector, and when it receives a prefetch request for the vertex list vector, It will generate a prefetch request for the edge list vector. Similarly, if a prefetch access request for the edge vector is received, the data prefetch unit will generate a fetch request for the visited list vector. Therefore, the source of the information received by the data prefetch unit and the access address of the monitored access request determine which prefetch request generating unit is used.
  • the work vector list prefetch request generation unit 32 is responsible for generating a prefetch request for the work vector list.
  • the address of the prefetch request is the address of the next item of data requested by this general fetch read command. Therefore, the data address of the prefetch request is the result of adding the data size to the data address read by this general memory read instruction. For example, if the ordinary memory access read instruction reads the data at address 0x88021d00, then the prefetch request generation unit of the worklist will generate a prefetch request for the data at address 0x88021d04.
  • the vertex list vector records the starting position of each row of non-zero elements of the adjacency matrix of the graph in the edge list vector (each row of the adjacency matrix corresponds to a node in the graph, not zero The element is the edge connecting the node. These edges are continuously stored in the edge vector.) Therefore, to determine the number of edges of a certain node, you need to read the value of the corresponding node and the next node in the vertex list vector at the same time.
  • the vertex list vector prefetch request generation unit 33 can obtain the address of the node and the next node in the vertex list vector according to the index. Generally, when these two values are in the same cache block, a storage access request can retrieve both values at the same time. However, if the two values are not in the same cache block, the vertex list vector prefetch request generating unit 33 will generate two memory access requests.
  • the prefetch request generation unit of the edge vector will generate a prefetch request for the edge vector.
  • the start index and end index of the edge list in the entry with the WID of 2 in the runtime information table are 24003 and 25210, respectively. This means that all the edges connecting the node 2020 are from the edge list list item 24003 to Item 25210 (excluding item 25210). Because the information of each node side is continuously stored in the edge vector, the number of requests generated mainly depends on how many cache blocks are needed to store the data of these edges, and the address misalignment needs to be considered.
  • the edge vector lists the edge node ID of the edge, so the visited vector list prefetch request generation unit 35 needs to read the end node ID of the edge vector list prefetch request, and the end node ID is used as the visited vector
  • the index can be calculated to read the address of the corresponding visited list vector value. Since these end node IDs are discontinuous and scattered, the visited list vector prefetch request generating unit 35 needs to generate an access request address for the visited list vector for each value in the prefetched cache block.
  • the size of the GPU's cache block is usually 128B, then if the data size is 4B, it means that 32 end node IDs are stored in a cache block, then the visited vector list prefetch request generation unit needs to generate 32 for these end node IDs. Access requests.
  • the data structure-oriented data prefetching method and device of this embodiment use the width-first search-defined data structure access mode and graph data structure information to generate corresponding data prefetch requests, compared to the existing GPUs.
  • Data prefetching mechanism, the data prefetching method and device can more accurately and efficiently prefetch the data required for graph traversal using width-first search, thereby improving the performance of GPU in processing graph calculation problems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本发明公开了一种面向数据结构的图形处理器数据预取方法及装置,方法包括获取监控处理器核对图数据结构的存储访问请求的信息及读取的数据,利用宽度优先搜索定义的数据结构访问模式以及图数据结构信息来产生相应的四种向量预取请求并存入预取请求队列。装置包括分布在每一个处理单元中的数据预取单元,所述数据预取单元分别与访存指令单元的访存监视器、访存结果缓存以及一级缓存相连,所述数据预取单元包括地址空间分类器(1)、运行时信息表(2)、预取请求生成单元(3)以及预取请求队列(4)。本发明更加准确和高效地预取使用宽度优先搜索进行图遍历所需的数据,从而提高GPU处理图计算问题的性能。

Description

一种面向数据结构的图形处理器数据预取方法及装置 【技术领域】
本发明涉及图形处理器的数据预取领域,具体涉及一种面向数据结构的图形处理器数据预取方法及装置。
【背景技术】
随着图计算应用的问题规模不断增长,使用图形处理器(GPU)并行加速图计算应用成为计算机处理大规模图计算问题的关键。但是,由于大多数图计算应用是访存密集型应用,因此这些应用最大的时间开销来自于遍历图产生的存储访问。宽度优先搜索是许多图计算应用进行图遍历的基本算法。但是,由于宽度优先搜索的不规则存储访问,导致GPU对每个不规则的存储访问都要产生多于一个的存储访问请求。这个极大地影响了GPU的访存效率,进而导致GPU不能有效地加速宽度优先搜索。另外,GPU对图数据结构的访问缺乏足够的局部性,进而导致GPU对一些数据的缓存访问失效率高达80%。更糟糕的是,由于缺乏足够的算术计算,GPU无法通过大规模并行来实现延迟隐藏,流水线不得不暂停来等待数据。最终,GPU无法充分利用它的强大计算能力来加速宽度优先搜索算法。
数据预取是一项有希望可以提高存储器访问和缓存效率的技术。GPU上典型的数据预取器,例如,基于步长的数据预取器、基于数据流的预取器和基于全局历史访存信息的数据预取器,都可以有效地减少应用中规则存储访问的延迟。然而,对于非规则存储访问,典型的基于预测的数据预取器的预取错误率明显高于对规则存储访问的预取。这样高的预取错误率直接导致读取过多的无用数据,进而产生严重的缓存数据污染和存储器带宽浪费。另外,由于基于访存模式识别的预取器无法准确识别出复杂多变的非规则访存模式,这些类型的数据预取器对减少访存延迟和提高GPU的执行效率几乎没有任何贡献。
通常来讲,主要有三种典型的数据预取机制:基于步长的数据预取器(stride prefetcher)、基于数据流的预取器(stream prefetcher)以及基于全局历史访存信息的数据预取器(Global History Buffer prefetcher)。
如图1所示,基于步长的数据预取器会使用一个表来记录局部的存储器历史访问信息。这些信息主要包括:程序计数器值(PC,作为表的索引)、最近一次访存的地址(用来计算步长和下次访存的地址)、最近两次访存地址之间的步长(最近两次访存地址的差值)以及步长有效位(标记当前记录的步长是否有效)。如果同一个PC的访存指令的访存地址 具有固定步长,那么基于步长的数据预取器就会根据步长值和最近访问的地址计算出将要被预取数据的地址。
如图2所示,基于数据流的预取器通常会跟踪对某一块存储区域访问的方向。当所有访存的地址都是朝着同一个方向进行连续变化,基于数据流的预取器就会按照识别出的方向将数据以缓存块为单位不断地读取到预取器的缓冲中。之所以不将预取的数据存入片上缓存,是为了避免预取的数据污染缓存中有用的数据。当某一次数据访问导致缓存失效时,缓存才会将预取的缓存块存入缓存。当识别的顺序访存模式发生改变,预取器的缓冲就会被刷新。
如图3所示,基于全局历史访存信息的数据预取器使用一个全局历史访存信息缓存(GHB)来记录整个系统中所有缓存失效访问的地址信息。每个GHB项都会存储一个失效的访存地址和一个指针,这些指针会将来自同一个索引的GHB项按照时间顺序连接起来。另外,基于全局历史访存信息的数据预取器还会使用一个索引表。这张表用来存储索引GHB项的索引值,这些值可以是指令的PC值或者是访存失效的地址值。与索引值对应的是一个指向索引值对应GHB项的指针,并且通过这个指针可以将GHB中具有相同索引值的项全部找出来。基于全局历史访存信息的数据预取器可以和其他的数据预取机制搭配,例如基于步长的数据预取器和基于数据流的预取器,从而可以识别多种访存模式。
由此可见,这三种典型的数据预取器都是基于一种或多种常见的规则存储访问模式设计的。但是对于宽度优先搜索的不规则存储访问模式,这三种数据预取器的预取效率是非常低效,甚至是无效的。
【发明内容】
本发明要解决的技术问题:针对现有技术的上述问题,提供一种面向宽度优先搜索、能够高效地预取不规则存储访问数据、硬件结构简单、对程序员编程透明的面向数据结构的图形处理器数据预取方法及装置。
为了解决上述技术问题,本发明采用的技术方案为:
一种面向数据结构的图形处理器数据预取方法,实施步骤包括:
1)获取监控处理器核对图数据结构的存储访问请求的信息及读取的数据;
2)根据该存储访问请求的类型选择相应的数据预取请求生成方式:若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
3)将生成的预取请求存入预取请求队列。
优选地,步骤1)中获取监控处理器核对图数据结构的存储访问请求包括:访存指令单元中的访存监视器负责监视对work list向量的普通访存读指令访问,以及访存指令单元中的访存结果缓冲记录的所有被一级缓存处理过的访存请求信息以及读取到的数据。
优选地,步骤2)中生成对work list向量下一项的预取请求时,对work list向量下一项的预取请求的数据地址为该存储访问请求所读取的数据地址加上所读取的数据大小得到的结果。
优选地,步骤2)中生成对vertex list向量的预取请求的详细步骤包括:根据上一次产生work list向量的预取请求时得到的节点ID来确定vertex list向量的预取请求的对应行以及下一行的地址,如果该对应行以及下一行的地址在同一个缓存块中则生成一条存储访问请求以同时取回该对应行以及下一行的地址;如果该对应行以及下一行的地址不在同一个缓存块中则生成两条存储访问请求以分别取回该对应行和下一行的地址。
优选地,步骤2)中生成对edge list向量的预取请求的详细步骤包括:根据运行时的edge list向量起始和终止索引生成单元会产生对edge list向量的预取请求,且产生请求的数量主要取决于存储这些边的数据需要多少缓存块以及进行地址对齐需要多少缓存块。
优选地,步骤2)中生成对visited list向量的预取请求的详细步骤包括:读取对edge list向量预取请求的返回结果作为计算预取visited list向量数据,为每一个读取到的值产生相应的访问请求地址。
一种面向数据结构的图形处理器数据预取装置,包括分布在每一个处理单元中的数据预取单元,所述数据预取单元分别与访存指令单元的访存监视器、访存结果缓存以及一级缓存相连,所述数据预取单元包括:
地址空间分类器,用于根据处理器核对图数据结构的存储访问请求的类型选择相应的数据预取请求生成方式;
运行时信息表,用于分别记录各个处理单元Warp中各种向量的运行时信息,所述各种向量的运行时信息包括work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引;
预取请求生成单元,用于根据指定执行不同的数据预取请求生成方式,若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
预取请求队列,用于保存生成的数据预取请求。
优选地,所述地址空间分类器包括地址空间范围表和八个地址比较器,所述地址空间范围表分别包括work list向量、vertex list向量、edge list向量、visited list向量的地址空间范围的起始地址、结束地址一一对应的八个地址,所述八个地址比较器中每一个地址比较器的一路输入为来自访存指令单元的信息中的被访问地址、另一路输入为地址空间范围表中对应的地址,且所述八个地址比较器的输出端分别与预取请求生成单元相连。
优选地,所述运行时信息表的每一项包括WID、work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引共五项信息,其中WID用于记录处理单元Warp ID;所述运行时信息表还包括选择器,所述选择器用于根据访存信息来自访存指令单元的信息中的信息来源、处理单元Warp的Warp ID、被访问数据来更新运行时信息表中的对应表项,信息来源用于区别该访存信息来自于访存监视器还是访存结果缓存,若来自访存监视器则判定对新节点的遍历所需的数据预取开始,将运行时信息表中对应WID的条目的内容清空并将被访问的数据记录到work list索引;若来自访存结果缓存,则将被访问的数据记录到运行时信息表中对应WID的条目。
优选地,所述预取请求生成单元包括预取生成单元选择器、work list向量预取请求生成单元、vertex list向量预取请求生成单元、edge list向量预取请求生成单元以及visited list向量预取请求生成单元,所述预取生成单元选择器根据地址空间分类器输出的访存信息类型、来自访存指令单元的信息中的信息来源、运行时信息表输出的运行时信息,并从work list向量预取请求生成单元、vertex list向量预取请求生成单元、edge list向量预取请求生成单元、visited list向量预取请求生成单元四者中选择一个来进行预取请求生成;work list向量预取请求生成单元用于生成对work list向量下一项的预取请求并写入预取请求队列;vertex list向量预取请求生成单元用于生成对vertex list向量的预取请求并写入预取请求队列;edge list向量预取请求生成单元用于生成对edge list向量的预取请求并写入预取请求队列;visited list向量预取请求生成单元用于生成对visited list向量的预取请求并写入预取请求队列。
和现有技术相比,本发明面向数据结构的图形处理器数据预取方法具有下述优点:
1、本发明能够高效地预取宽度优先搜索的不规则存储访问数据。本发明面向数据结构的数据预取机制采用显式的方式获取宽度优先搜索访问图数据结构的模式,并且利用当前正在被搜索的节点信息将搜索下一个节点所需要的数据提前读取到片上缓存中。
2、本发明具有简单的硬件结构实现方式,因为数据预取单元不需要使用复杂的计算,所以它的计算逻辑非常简单,数据预取的主要开销来自于存储图数据结构信息和运行时信 息,而这部分存储可以使用片上共享存储来解决。
3、本发明对程序员编程透明。本发明面向数据结构的数据预取机制的使用不需要对原有程序进行大量的更改,只需将原有的申请存储空间分配的代码,替换为带有数据结构标记的申请存储空间分配代码。
本发明面向数据结构的图形处理器数据预取装置为本发明面向数据结构的图形处理器数据预取方法的硬件,同样具有本发明面向数据结构的图形处理器数据预取方法的前述优点,故在此不再赘述。
【附图说明】
图1为现有基于步长的数据预取器的工作原理示意图。
图2为现有基于数据流的预取器的工作原理示意图。
图3为现有基于全局历史访存信息的数据预取器的工作原理示意图。
图4为本发明实施例方法的基本流程示意图。
图5为本发明实施例中数据预处理单元的分布式分布结构示意图。
图6为本发明实施例中数据预处理单元的基本结构及接口示意图。
图7为本发明实施例中数据预处理单元的原理结构示意图。
【具体实施方式】
如图4所示,本实施例面向数据结构的图形处理器数据预取方法的实施步骤包括:
1)获取监控处理器核对图数据结构的存储访问请求的信息及读取的数据;
2)根据该存储访问请求的类型选择相应的数据预取请求生成方式:若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
3)将生成的预取请求存入预取请求队列。当一级缓存空闲时,会从预取请求队列中取出预取请求进行处理,预取到的数据会存放在一级缓存中。
本实施例面向数据结构的图形处理器数据预取方法的实施主体为处理器中的数据预取单元,监控处理器核对图数据结构的存储访问请求,并且将监控到的存储访问请求的信息及读取的数据发送给数据预取单元;数据预取单元在接收到存储访问请求信息后,根据该存储访问请求是否为数据预取请求以及该访存请求访问的是图数据结构中的哪个数据向量,来选择相应的数据预取请求生成单元选择相应的数据预取请求生成方式。本实施例中使用Compressed Sparse Row格式(一种用于存储大型稀疏图的压缩格式)图数据结构 的数据流驱动的宽度优先搜索算法包含4个数据向量:work list向量、vertex list向量、edge list向量和visited list向量。因此,若该存储访问请求是对work list向量的普通读访问,则数据预取单元会生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则数据预取单元会生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则数据预取单元会生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则数据预取单元会生成对visited list向量的预取请求。
本实施例中,步骤1)中获取监控处理器核对图数据结构的存储访问请求包括:访存指令单元中的访存监视器负责监视对work list向量的普通访存读指令访问,以及访存指令单元中的访存结果缓冲记录的所有被一级缓存处理过的访存请求信息以及读取到的数据。
本实施例中,步骤2)中生成对work list向量下一项的预取请求时,对work list向量下一项的预取请求的数据地址为该存储访问请求所读取的数据地址加上所读取的数据大小得到的结果。
本实施例中,步骤2)中生成对vertex list向量的预取请求的详细步骤包括:根据上一次产生work list向量的预取请求时得到的节点ID来确定vertex list向量的预取请求的对应行以及下一行的地址,如果该对应行以及下一行的地址在同一个缓存块中则生成一条存储访问请求以同时取回该对应行以及下一行的地址;如果该对应行以及下一行的地址不在同一个缓存块中则生成两条存储访问请求以分别取回该对应行和下一行的地址。
本实施例中,步骤2)中生成对edge list向量的预取请求的详细步骤包括:根据运行时的edge list向量起始和终止索引生成单元会产生对edge list向量的预取请求,且产生请求的数量主要取决于存储这些边的数据需要多少缓存块以及进行地址对齐需要多少缓存块。
本实施例中,步骤2)中生成对visited list向量的预取请求的详细步骤包括:读取对edge list向量预取请求的返回结果作为计算预取visited list向量数据,为每一个读取到的值产生相应的访问请求地址。
本实施例面向数据结构的图形处理器数据预取装置包括分布在每一个处理单元中的数据预取单元,数据预取单元分别与访存指令单元的访存监视器、访存结果缓存以及一级缓存相连。
如图5所示,在体系结构设计时,通常一个GPU会包含多个流处理器(SM),并且每个SM会包含许多简单的单线程处理器核。在GPU的硬件模型中,每32个硬件线程会组成一个处理单元来进行资源调度,这个处理单元被称之为Warp。同一个Warp里的所有线 程会同时执行同一条指令。对于数据流驱动模型的宽度优先搜索算法,因为每个Warp处理的work list向量的子集是独立的,那么对于不同的处理单元和流处理器,它们需要获取的图数据结构信息是不同的。因此,本实施例采用分布的数据预取单元来处理和生成每个流处理器中的数据预取请求。
如图7所示,数据预取单元包括:
地址空间分类器1,用于根据处理器核对图数据结构的存储访问请求的类型选择相应的数据预取请求生成方式;
运行时信息表2,用于分别记录各个处理单元Warp中各种向量的运行时信息,各种向量的运行时信息包括work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引;
预取请求生成单元3,用于根据指定执行不同的数据预取请求生成方式,若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
预取请求队列4,用于保存生成的数据预取请求。
如图6所示,数据预取单元获取到的图数据结构的访问信息主要来自于访存指令单元(Load/Store Unit)的两个组件:访存监视器(memory access monitor)和访存结果缓冲(response FIFO)。访存监视器负责监视对work list向量的普通访存读指令访问。若是监视到这些访存读指令,数据预取单元就知道新的一轮搜索迭代的开始并开始为下一轮迭代所需的数据做预取的准备。访存结果缓冲负责记录所有被一级缓存处理过的访存请求信息以及读取到的数据。因为数据预取单元的数据预取请求也是由一级缓存处理的,所以访存结果缓冲可以监视到预取请求的处理情况并把请求到的数据和访存信息发送给数据预取单元。数据预取单元利用这些来自访存指令单元的信息以及宽度优先搜索对图数据结构的访问模式就可以产生相应的数据预取请求。在收到来自访存指令单元的信息之后,数据预取单元会根据信息中的Warp ID去更新运行时信息表2中对应的项,并且预取请求生成单元3根据信息的来源以及该访存请求访问的是图数据结构中的哪个向量来选择数据预取请求生成器。最后,数据预取单元将新产生的数据预取请求放入预取请求队列4。数据预取单元中的预取请求生成单元3负责控制数据预取请求产生的数量。一级缓存既处理普通的访存请求,同时也会处理数据预取请求并把它作为普通的访存请求进行处理。
如图7所示,地址空间分类器1包括地址空间范围表和八个地址比较器,地址空间范 围表分别包括work list向量、vertex list向量、edge list向量、visited list向量的地址空间范围的起始地址、结束地址一一对应的八个地址,八个地址比较器中每一个地址比较器的一路输入为来自访存指令单元的信息中的被访问地址、另一路输入为地址空间范围表中对应的地址,且八个地址比较器的输出端分别与预取请求生成单元3相连。地址比较器通过将访存请求的地址和所有数据向量的地址空间范围比较来判断访存请求的地址属于哪个数据向量的地址空间。
运行时信息表2会根据收到的信息按照Warp ID更新相应的项。如图7所示,运行时信息表2的每一项包括WID、work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引共五项信息,其中WID用于记录处理单元Warp ID;运行时信息表2还包括选择器,选择器用于根据访存信息来自访存指令单元的信息中的信息来源、处理单元Warp的Warp ID、被访问数据来更新运行时信息表2中的对应表项,信息来源用于区别该访存信息来自于访存监视器还是访存结果缓存,若来自访存监视器则判定对新节点的遍历所需的数据预取开始,将运行时信息表2中对应WID的条目的内容清空并将被访问的数据记录到work list索引;若来自访存结果缓存,则将被访问的数据记录到运行时信息表2中对应WID的条目。
本实施例中,运行时信息表2的每一项包括WID、work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引共五项信息:
WID:标示该项记录的信息属于哪个Warp。所有接收到的访存信息都会携带Warp ID信息,通过与运行时信息表中的WID比较来确定更新表中哪一项的信息。如图7中所示,0,1,2分别代表Warp 0、Warp1、Warp2。
work list向量的索引:标示预取单元正在为遍历work list向量中的哪个节点预取数据。该项由访存监视器监控到的对work list向量的普通访存信息进行更新。通过确定该Warp当前访问work list向量中节点项的位置,进而得到该Warp下次循环要访问的节点项的位置,也即当前被访问的节点项下一项。例如,如图7中WID为0的那一项,work list索引为2,表明Warp0正在遍历work list向量中的第一项,而数据预取单元正在为其预取遍历work list向量的第二项所需的数据。
vertex向量的索引:标示该Warp下一次循环要遍历的节点ID。该项由访存结果缓存监控到的对work list向量的预取访存信息进行更新。根据该预取访存信息的被访问地址以及运行时信息表记录的work list索引,可以确定该预取访存请求的数据地址,进而读取出相应的数据,并更新vertex索引。
edge list向量的起始索引和终止索引:标示该Warp下一次循环要遍历节点的所有边在 edge list向量中的范围。该项由访存结果缓存监控到的对vertex list向量的预取访存信息进行更新。根据该预取访存信息的被访问地址以及运行时信息表记录的vertex list索引,可以确定该预取访存请求的数据地址,进而读取出相应的数据。由于一个节点的edge list向量的起始索引和终止索引是vertex list向量中的相邻两项,若这两个值存储在同一个缓存块中时,则通过一次预取访存就可以获得;否则,需要通过两次预取访存分别读取。例如,如图7中所示,对于Warp1,vertex索引1279对应的地址为0x900232FC,而下一个vertex索引1280对应的地址为0x90023200,这两个地址分别在两个缓存块中,需要通过两次预取来获得edge list向量的起始索引和终止索引。而目前的状态说明正在等待对地址0x90023200的预取请求来更新edge list的终止索引。
运行时信息表2的选择器通过三个输入对运行时信息表进行更新。(1)由于每个访存信息都携带Warp ID,因此选择器通过与运行时信息表中的WID进行匹配,来确定更新表中哪个条目。(2)信息来源表明该访存信息来自于访存监视器(用0表示)还是访存结果缓存(用1表示)。若来自访存监视器,则说明对新节点的遍历所需的数据预取开始,表中的同一个WID的条目的内容会被清空并将被访问的数据记录到work list索引。若来自访存结果缓存,会将被访问的数据记录到表中相应的内容。相较于表中work list索引和vertex索引,因为edge list的起始索引和终止索引有可能不能同时获得,因此运行时信息输出会在edge list的起始索引和终止索引全部获取的情况下才会将它们发送给edge list向量预取请求生成单元34用来生成对edge list向量的预取请求,而work list索引和vertex索引可以在更新运行时信息表的同时被转发给预取请求生成单元。
预取请求生成器单元3包含一个预取生成单元选择器31和4个预取请求生成器,分别负责产生对4个数据向量的访存请求。如图7所示,预取请求生成单元3包括预取生成单元选择器31、work list向量预取请求生成单元32、vertex list向量预取请求生成单元33、edge list向量预取请求生成单元34以及visited list向量预取请求生成单元35,预取生成单元选择器31根据地址空间分类器1输出的访存信息类型、来自访存指令单元的信息中的信息来源、运行时信息表2输出的运行时信息,并从work list向量预取请求生成单元32、vertex list向量预取请求生成单元33、edge list向量预取请求生成单元34、visited list向量预取请求生成单元35四者中选择一个来进行预取请求生成;work list向量预取请求生成单元32用于生成对work list向量下一项的预取请求并写入预取请求队列4;vertex list向量预取请求生成单元33用于生成对vertex list向量的预取请求并写入预取请求队列4;edge list向量预取请求生成单元34用于生成对edge list向量的预取请求并写入预取请求队列4;visited list向量预取请求生成单元35用于生成对visited list向量的预取请求并写入预取请 求队列4。
因为对图数据结构中每个数据向量的访问模式不同,数据预取单元添加了四个预取请求生成单元(work list向量预取请求生成单元32、vertex list向量预取请求生成单元33、edge list向量预取请求生成单元34以及visited list向量预取请求生成单元35)来分别产生对四个数据向量的预取请求。总的来说,这四个预取请求生成单元可以分为两类:生成对work list向量的work list向量预取请求生成单元32以及生成对其他三个数据向量的预取请求的生成单元。这是因为生成器产生预取请求所需要信息的来源分别为访存监视器和访存结果缓冲。数据预取单元利用访存监视器监控到的对work list向量的普通访存信息来生成对work list向量的预取请求,而生成对其他数据向量的数据预取请求则需要由访存结果缓冲监控的预取请求的信息。另外,数据预取单元还需要使用监控到的预取请求的访问地址来选择对vertex list向量、edge list向量还是visited list向量生成预取请求。根据宽度优先搜索的数据结构访问模式,对每个数据向量的访问顺序是可以预测的。因此,当数据预取单元收到一个对work list向量的预取请求信息时,它就可以生成对vertex list向量的预取请求,而当它收到的是针对vertex list向量的预取请求,它就会生成对edge list向量的预取请求。同理,如果收到的是对edge list向量的预取访存请求,数据预取单元就会生成对visited list向量的访存请求。因此,数据预取单元收到信息的来源和监测到的访存请求的访问地址共同决定了使用哪个预取请求生成单元。
当数据预取单元接收到访问work list向量的普通访存读指令的信息时,work list向量预取请求生成单元32会负责产生对work list向量的预取请求。预取请求的地址是这个普通访存读指令请求的数据下一项的地址。因此,预取请求的数据地址就是这个普通访存读指令所读取的数据地址加上数据大小的结果。例如,如果普通的访存读指令读取的是地址0x88021d00的数据,那么work list的预取请求生成单元就会产生对地址为0x88021d04数据的预取请求。
依据Compressed Sparse Row格式的图数据结构定义,vertex list向量记录了图的邻接矩阵每一行非零元素在edge list向量中的起始位置(邻接矩阵的每一行对应一个图中的节点,而非零元素就是连接该节点的边,这些边是连续存储在edge list向量中),因此要确定某一个节点边的个数,需要同时读取vertex list向量中对应节点及下一个节点的值。由于之前对work list向量的预取请求会得到vertex索引,那么vertex list向量预取请求生成单元33根据该索引就可以获得该节点以及下一个节点在vertex list向量中的地址。通常,当这两个值在同一个缓存块中时,一条存储访问请求可以同时取回这两个值。但是,如果这两个值不在同一个缓存块中时,vertex list向量预取请求生成单元33就要产生两条访存请求。
根据运行时信息表中相应的edge list向量起始索引和终止索引,edge list向量的预取请求生成单元会产生对edge list向量的预取请求。例如,图7所示,运行时信息表中WID为2的条目中edge list起始索引和终止索引分别为24003和25210,这说明连接节点2020的所有边为edge list向量中从第24003项到第25210项(其中不包括第25210项)。因为每个节点边的信息都是连续存储在edge list向量中,所以产生请求的数量主要取决于存储这些边的数据需要多少缓存块,而且地址未对齐的情况也需要考虑。
对于宽度优先搜索,edge list向量保存的是边的终点节点ID,因此visited list向量预取请求生成单元35需要读取对edge list向量预取请求的终点节点ID,终点节点ID作为对visited list向量的索引就可以计算得到读取相应visited list向量值的地址。由于这些终点节点ID都是不连续且分散的,visited list向量预取请求生成单元35需要为预取到的缓存块中每一个值分别产生对visited list向量的访问请求地址。因为通常GPU的缓存块的大小是128B,那么如果数据大小是4B,说明一个缓存块中存储了32个终点节点ID,那么visited list向量预取请求生成单元就需要对这些终点节点ID分别产生32个访存请求。
综上所述,本实施例面向数据结构的数据预取方法及装置利用宽度优先搜索定义的数据结构访问模式以及图数据结构信息来产生相应的数据预取请求,相比于现有GPU上的数据预取机制,该数据预取方法及装置能够更加准确和高效地预取使用宽度优先搜索进行图遍历所需的数据,从而提高GPU处理图计算问题的性能。
以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (10)

  1. 一种面向数据结构的图形处理器数据预取方法,其特征在于,实施步骤包括:
    1)获取监控处理器核对图数据结构的存储访问请求的信息及读取的数据;
    2)根据该存储访问请求的类型选择相应的数据预取请求生成方式:若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
    3)将生成的预取请求存入预取请求队列。
  2. 根据权利要求1所述的面向数据结构的图形处理器数据预取方法,其特征在于,步骤1)中获取监控处理器核对图数据结构的存储访问请求包括:访存指令单元中的访存监视器负责监视对work list向量的普通访存读指令访问,以及访存指令单元中的访存结果缓冲记录的所有被一级缓存处理过的访存请求信息以及读取到的数据。
  3. 根据权利要求1所述的面向数据结构的图形处理器数据预取方法,其特征在于,步骤2)中生成对work list向量下一项的预取请求时,对work list向量下一项的预取请求的数据地址为该存储访问请求所读取的数据地址加上所读取的数据大小得到的结果。
  4. 根据权利要求1所述的面向数据结构的图形处理器数据预取方法,其特征在于,步骤2)中生成对vertex list向量的预取请求的详细步骤包括:根据上一次产生work list向量的预取请求时得到的节点ID来确定vertex list向量的预取请求的对应行以及下一行的地址,如果该对应行以及下一行的地址在同一个缓存块中则生成一条存储访问请求以同时取回该对应行以及下一行的地址;如果该对应行以及下一行的地址不在同一个缓存块中则生成两条存储访问请求以分别取回该对应行和下一行的地址。
  5. 根据权利要求1所述的面向数据结构的图形处理器数据预取方法,其特征在于,步骤2)中生成对edge list向量的预取请求的详细步骤包括:根据运行时的edge list向量起始和终止索引生成单元会产生对edge list向量的预取请求,且产生请求的数量主要取决于存储这些边的数据需要多少缓存块以及进行地址对齐需要多少缓存块。
  6. 根据权利要求1所述的面向数据结构的图形处理器数据预取方法,其特征在于,步骤2)中生成对visited list向量的预取请求的详细步骤包括:读取对edge list向量预取请求的返回结果作为计算预取visited list向量数据,为每一个读取到的值产生相应的访问请求地址。
  7. 一种面向数据结构的图形处理器数据预取装置,其特征在于,包括分布在每一个处 理单元中的数据预取单元,所述数据预取单元分别与访存指令单元的访存监视器、访存结果缓存以及一级缓存相连,所述数据预取单元包括:
    地址空间分类器(1),用于根据处理器核对图数据结构的存储访问请求的类型选择相应的数据预取请求生成方式;
    运行时信息表(2),用于分别记录各个处理单元Warp中各种向量的运行时信息,所述各种向量的运行时信息包括work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引;
    预取请求生成单元(3),用于根据指定执行不同的数据预取请求生成方式,若该存储访问请求是对work list向量的普通读访问,则生成对work list向量下一项的预取请求;若该存储访问请求是对work list向量的预取访问,则生成对vertex list向量的预取请求;若该存储访问请求是对vertex list向量的预取访问,则生成对edge list向量的预取请求;若该存储访问请求是对edge list向量的预取请求,则生成对visited list向量的预取请求;
    预取请求队列(4),用于保存生成的数据预取请求。
  8. 根据权利要求7所述的面向数据结构的图形处理器数据预取装置,其特征在于,所述地址空间分类器(1)包括地址空间范围表和八个地址比较器,所述地址空间范围表分别包括work list向量、vertex list向量、edge list向量、visited list向量的地址空间范围的起始地址、结束地址一一对应的八个地址,所述八个地址比较器中每一个地址比较器的一路输入为来自访存指令单元的信息中的被访问地址、另一路输入为地址空间范围表中对应的地址,且所述八个地址比较器的输出端分别与预取请求生成单元(3)相连。
  9. 根据权利要求7所述的面向数据结构的图形处理器数据预取装置,其特征在于,所述运行时信息表(2)的每一项包括WID、work list向量的索引、vertex list向量的索引、edge list向量的起始索引和终止索引共五项信息,其中WID用于记录处理单元Warp ID;所述运行时信息表(2)还包括选择器,所述选择器用于根据访存信息来自访存指令单元的信息中的信息来源、处理单元Warp的Warp ID、被访问数据来更新运行时信息表(2)中的对应表项,信息来源用于区别该访存信息来自于访存监视器还是访存结果缓存,若来自访存监视器则判定对新节点的遍历所需的数据预取开始,将运行时信息表(2)中对应WID的条目的内容清空并将被访问的数据记录到work list索引;若来自访存结果缓存,则将被访问的数据记录到运行时信息表(2)中对应WID的条目。
  10. 根据权利要求7所述的面向数据结构的图形处理器数据预取装置,其特征在于,所述预取请求生成单元(3)包括预取生成单元选择器(31)、work list向量预取请求生成单元(32)、vertex list向量预取请求生成单元(33)、edge list向量预取请求生成单元(34) 以及visited list向量预取请求生成单元(35),所述预取生成单元选择器(31)根据地址空间分类器(1)输出的访存信息类型、来自访存指令单元的信息中的信息来源、运行时信息表(2)输出的运行时信息,并从work list向量预取请求生成单元(32)、vertex list向量预取请求生成单元(33)、edge list向量预取请求生成单元(34)、visited list向量预取请求生成单元(35)四者中选择一个来进行预取请求生成;work list向量预取请求生成单元(32)用于生成对work list向量下一项的预取请求并写入预取请求队列(4);vertex list向量预取请求生成单元(33)用于生成对vertex list向量的预取请求并写入预取请求队列(4);edge list向量预取请求生成单元(34)用于生成对edge list向量的预取请求并写入预取请求队列(4);visited list向量预取请求生成单元(35)用于生成对visited list向量的预取请求并写入预取请求队列(4)。
PCT/CN2019/084774 2018-10-11 2019-04-28 一种面向数据结构的图形处理器数据预取方法及装置 WO2020073641A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/960,894 US11520589B2 (en) 2018-10-11 2019-04-28 Data structure-aware prefetching method and device on graphics processing unit

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811183490.8A CN109461113B (zh) 2018-10-11 2018-10-11 一种面向数据结构的图形处理器数据预取方法及装置
CN201811183490.8 2018-10-11

Publications (1)

Publication Number Publication Date
WO2020073641A1 true WO2020073641A1 (zh) 2020-04-16

Family

ID=65607513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/084774 WO2020073641A1 (zh) 2018-10-11 2019-04-28 一种面向数据结构的图形处理器数据预取方法及装置

Country Status (3)

Country Link
US (1) US11520589B2 (zh)
CN (1) CN109461113B (zh)
WO (1) WO2020073641A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109461113B (zh) 2018-10-11 2021-07-16 中国人民解放军国防科技大学 一种面向数据结构的图形处理器数据预取方法及装置
CN111124675B (zh) * 2019-12-11 2023-06-20 华中科技大学 一种面向图计算的异构存内计算设备及其运行方法
CN113741567B (zh) * 2021-11-08 2022-03-29 广东省新一代通信与网络创新研究院 矢量加速器及其控制方法、装置
CN114218132B (zh) * 2021-12-14 2023-03-24 海光信息技术股份有限公司 信息预取方法、处理器、电子设备
CN114565503B (zh) * 2022-05-03 2022-07-12 沐曦科技(北京)有限公司 Gpu指令数据管理的方法、装置、设备及存储介质
CN116821008B (zh) * 2023-08-28 2023-12-26 英特尔(中国)研究中心有限公司 具有提高的高速缓存命中率的处理装置及其高速缓存设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899156A (zh) * 2015-05-07 2015-09-09 中国科学院信息工程研究所 一种面向大规模社交网络的图数据存储及查询方法
CN104952032A (zh) * 2015-06-19 2015-09-30 清华大学 图的处理方法、装置以及栅格化表示及存储方法
US20170060958A1 (en) * 2015-08-27 2017-03-02 Oracle International Corporation Fast processing of path-finding queries in large graph databases
CN109461113A (zh) * 2018-10-11 2019-03-12 中国人民解放军国防科技大学 一种面向数据结构的图形处理器数据预取方法及装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8089486B2 (en) * 2005-03-21 2012-01-03 Qualcomm Incorporated Tiled prefetched and cached depth buffer
KR100703709B1 (ko) * 2005-06-02 2007-04-06 삼성전자주식회사 그래픽스 처리장치와 처리방법, 및 그 기록 매체
CN100481028C (zh) * 2007-08-20 2009-04-22 杭州华三通信技术有限公司 一种利用缓存实现数据存储的方法和装置
US8397049B2 (en) * 2009-07-13 2013-03-12 Apple Inc. TLB prefetching
US20140184630A1 (en) * 2012-12-27 2014-07-03 Scott A. Krig Optimizing image memory access
CN104156264B (zh) * 2014-08-01 2017-10-10 西北工业大学 一种基于多gpu的基带信号处理任务并行实时调度方法
US9535842B2 (en) * 2014-08-28 2017-01-03 Oracle International Corporation System and method for performing message driven prefetching at the network interface
US10180803B2 (en) * 2015-07-28 2019-01-15 Futurewei Technologies, Inc. Intelligent memory architecture for increased efficiency
US9990690B2 (en) * 2015-09-21 2018-06-05 Qualcomm Incorporated Efficient display processing with pre-fetching
US20170091103A1 (en) * 2015-09-25 2017-03-30 Mikhail Smelyanskiy Instruction and Logic for Indirect Accesses
US10423411B2 (en) * 2015-09-26 2019-09-24 Intel Corporation Data element comparison processors, methods, systems, and instructions
US20180189675A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Hardware accelerator architecture and template for web-scale k-means clustering
CN111124675B (zh) * 2019-12-11 2023-06-20 华中科技大学 一种面向图计算的异构存内计算设备及其运行方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899156A (zh) * 2015-05-07 2015-09-09 中国科学院信息工程研究所 一种面向大规模社交网络的图数据存储及查询方法
CN104952032A (zh) * 2015-06-19 2015-09-30 清华大学 图的处理方法、装置以及栅格化表示及存储方法
US20170060958A1 (en) * 2015-08-27 2017-03-02 Oracle International Corporation Fast processing of path-finding queries in large graph databases
CN109461113A (zh) * 2018-10-11 2019-03-12 中国人民解放军国防科技大学 一种面向数据结构的图形处理器数据预取方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUO, HUI ET AL.: "Accelerating BFS via Data Structure-Aware Prefetching on GPU", 16 October 2018 (2018-10-16), pages 60234 - 60246, XP011698514, ISSN: 2169-3536 *

Also Published As

Publication number Publication date
CN109461113B (zh) 2021-07-16
CN109461113A (zh) 2019-03-12
US11520589B2 (en) 2022-12-06
US20200364053A1 (en) 2020-11-19

Similar Documents

Publication Publication Date Title
WO2020073641A1 (zh) 一种面向数据结构的图形处理器数据预取方法及装置
Falsafi et al. A primer on hardware prefetching
San Miguel et al. Load value approximation
EP2542973B1 (en) Gpu support for garbage collection
US8583874B2 (en) Method and apparatus for caching prefetched data
US9361233B2 (en) Method and apparatus for shared line unified cache
US20080065809A1 (en) Optimized software cache lookup for simd architectures
US20090138680A1 (en) Vector atomic memory operations
US6711651B1 (en) Method and apparatus for history-based movement of shared-data in coherent cache memories of a multiprocessor system using push prefetching
US10318261B2 (en) Execution of complex recursive algorithms
Lee et al. Improving energy efficiency of GPUs through data compression and compressed execution
TW201621671A (zh) 在多記憶體存取代理器動態更新硬體預取特性為互斥或共享的裝置與方法
US6810472B2 (en) Page handling efficiency in a multithreaded processor
Kim et al. Leveraging cache coherence in active memory systems
US20030088636A1 (en) Multiprocessor system having distributed shared memory and instruction scheduling method used in the same system
Beard et al. Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore
Yoon et al. Design of DRAM-NAND flash hybrid main memory and Q-learning-based prefetching method
Sun et al. Server-based data push architecture for multi-processor environments
Rau et al. The effect of instruction fetch strategies upon the performance of pipelined instruction units
Guo et al. Accelerating BFS via data structure-aware prefetching on GPU
CN114661626A (zh) 用于选择性地丢弃软件预取指令的设备、系统和方法
Keshtegar et al. Cluster‐based approach for improving graphics processing unit performance by inter streaming multiprocessors locality
Zhang et al. Locality protected dynamic cache allocation scheme on GPUs
Wang et al. Using idle workstations to implement predictive prefetching
CN110347487B (zh) 一种面向数据库应用的数据搬移的能耗特征化方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871112

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19871112

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19871112

Country of ref document: EP

Kind code of ref document: A1