WO2023151231A1 - 用于在单指令多线程计算系统中加载数据的方法和装置 - Google Patents

用于在单指令多线程计算系统中加载数据的方法和装置 Download PDF

Info

Publication number
WO2023151231A1
WO2023151231A1 PCT/CN2022/107081 CN2022107081W WO2023151231A1 WO 2023151231 A1 WO2023151231 A1 WO 2023151231A1 CN 2022107081 W CN2022107081 W CN 2022107081W WO 2023151231 A1 WO2023151231 A1 WO 2023151231A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
data
execution
target data
threads
Prior art date
Application number
PCT/CN2022/107081
Other languages
English (en)
French (fr)
Inventor
彭永超
袁红岗
满新攀
赵鹏
徐立宝
王东辉
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023151231A1 publication Critical patent/WO2023151231A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/461Saving or restoring of program or task context
    • G06F9/462Saving or restoring of program or task context with multiple register sets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Definitions

  • Embodiments of the present disclosure relate generally to the field of electronics, and more particularly to a method and apparatus for loading data in a single instruction multithreading computing system.
  • each thread has its own register file (that is, an array of registers, also called a register file), and each thread can perform thread-level data exchange between its registers and memory.
  • register file that is, an array of registers, also called a register file
  • a typical register access architecture (load-store architecture) is usually used to exchange data between registers and memories.
  • a load instruction may be utilized to read data from memory and store data into registers.
  • conventional load instructions lack optimization for data exchange between multiple threads. Therefore, there is a need for a scheme for efficiently loading data for multiple threads in a single instruction multithreading computing system.
  • Embodiments of the present disclosure provide a technical solution for loading data in a single instruction multithreading computing system.
  • a method of loading data in a single instruction multithreading computing system includes: based on a single load instruction received, determining a plurality of predicates for the plurality of threads, each predicate indicating whether an address specified in the corresponding thread is valid, the address used to access data in the memory; For the plurality of determined predicates, determine at least one execution thread in the plurality of threads; determine target data for each execution thread in the at least one execution thread; The set of target data for each execution thread is written into a register file of each target thread among the plurality of threads.
  • an apparatus for loading data in a single instruction multithreading computing system includes: a predicate determination unit configured to: determine a plurality of predicates of the plurality of threads based on a received single load instruction, each predicate indicating whether an address specified in a corresponding thread is valid, the address being used for accessing data in the memory; and an execution thread determination unit configured to determine at least one execution thread among the plurality of threads based on the determined plurality of predicates; a target data determination unit configured for the at least one execution thread Each execution thread in one execution thread determines target data; and a writing unit configured to write the set of the target data for each execution thread in the at least one execution thread to the plurality of in the register file of each target thread in the thread.
  • a computer readable storage medium stores a plurality of programs configured to be executed by one or more processing engines, the plurality of programs including instructions for performing the method of the first aspect of the present disclosure.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the method of the first aspect of the present disclosure.
  • corresponding target data can be determined for each thread of execution and a set of target data written to each target thread based on a single load instruction. In this way, the efficiency of data exchange between the register and the memory can be improved.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • FIG. 2 shows a schematic diagram of a chip according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of data exchange between a register and a memory using a conventional load instruction
  • FIG. 4 shows a flowchart of a method for loading data according to an embodiment of the present disclosure
  • Fig. 5 shows a schematic diagram of a result of loading data according to an embodiment of the present disclosure
  • FIG. 6a and 6b show schematic diagrams of transpose storage according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of a process of loading data according to an embodiment of the present disclosure.
  • Fig. 8 shows a schematic block diagram of an apparatus for loading data according to an embodiment of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may include, for example, electronic devices with computing capabilities, such as computers.
  • example environment 100 includes, for example, central processing unit (CPU) 120 , system memory 110 , North Bridge/memory bridge 130 , accelerator system 140 , external storage device 150 , and south bridge/input output (IO) bridge 160 .
  • System memory 110 may include, for example, volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 130 for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 120 and the high-speed interface as well as bridging the CPU 120 and the south bridge/IO bridge 160.
  • the South Bridge/IO Bridge 160 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator system 140 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video.
  • the external storage device 150 may be, for example, a volatile memory such as DRAM located outside the accelerator system 140 .
  • the external storage device 150 is also referred to as an off-chip memory, that is, a memory located outside the chip of the accelerator system 140 .
  • the chip of the accelerator system 140 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache. It will be described in detail below in conjunction with some embodiments of the present disclosure. While an example environment 100 in which embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in other application environments with accelerator systems such as GPUs, such as ARM architectures and RISC-V architectures.
  • FIG. 2 shows a schematic block diagram of an accelerator system 200 according to one embodiment of the present disclosure.
  • the accelerator system 200 may be, for example, a specific implementation manner of the chip of the accelerator system 140 in FIG. 1 .
  • the accelerator system 200 includes, for example, an accelerator system-on-a-chip such as a GPU.
  • the accelerator system 200 may include a stream processor (SP) 210, a page table device 220, a processing engine (Processing Engine, PE) unit 230, a direct memory access (DMA) controller 240, a L1 cache 260 and L2 cache 250 .
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator system 200 may be controlled by a host device such as the CPU 120 and receive instructions from the CPU 120.
  • the SP 210 analyzes instructions from the CPU 120, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage virtual storage accessible by the accelerator system 200 .
  • the virtual storage may include, for example, the L2 cache 250 and off-chip memory such as the external storage device 150 in FIG. 1 .
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 may include a plurality of processing engines PE_1 , PE_2 . . . PE_N, where N represents an integer greater than 1.
  • Each PE in PE unit 230 may be a single instruction multiple thread device.
  • each thread can have its own register file, and all threads of each PE also share a uniform register file.
  • Multiple PEs can perform the same or different processing jobs in parallel. For example, PE can perform processing such as sorting and convolution on the data to be processed.
  • the application program can be divided into multiple parts, and the multiple parts are run in parallel on multiple PEs.
  • Each thread can have its own register file and execution unit, and use its own memory addresses.
  • the execution units may include a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit for performing arithmetic and logic operations. Arithmetic operations such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers. Logical operations such as logical and, or, not, etc.
  • FIG. 3 shows a schematic diagram of data exchange between a register and a memory using a conventional load instruction.
  • threads 310-1 to 310-N (collectively referred to as a plurality of threads 310) in the PE 300 can exchange data with the memory 320.
  • Each thread has its own register file, eg, thread 310-1 has register file 330-1, thread 310-2 has register file 330-2, and so on.
  • Each thread also has a respective datapath, such as datapaths 340-1 through 340-N (collectively, datapaths 340).
  • the memory 320 may include a memory inside the chip (such as a first-level cache, a second-level cache), or a memory outside the chip.
  • multiple threads 310 in PE 300 need to read data from the same address in memory 320 and load the same data read into each thread's respective register (this operation is simply called broadcast load). For example, when performing matrix multiplication, each row of matrix A needs to be multiplied by the same column of matrix B, which requires the same column of matrix B to be broadcast to each thread used to process each row of matrix A.
  • each thread would need to specify the same address if a conventional load instruction is utilized. Then, for each thread, data may be read from memory 320 based on the address and written into the thread's registers. In other words, the data will be read N times, that is, the data will be copied N times in the data path 340 between the register and the memory 320 . Multiple reads and copies of the same data reduce the efficiency of data exchange between the register and memory 320 and increase the power consumption of the data exchange.
  • a technical solution for efficiently loading data in a single instruction multithreading computing system is provided.
  • a plurality of predicates are determined for a plurality of threads, each predicate indicating whether the address specified in the corresponding thread is valid, the address is used to access the data in the memory; based on the determined A plurality of predicates, determining at least one execution thread in the plurality of threads; determining target data for each execution thread in the at least one execution thread; and writing a set of target data for each execution thread in the at least one execution thread into the register file of each target thread among multiple threads.
  • corresponding target data can be determined for each execution thread based on a single load instruction and a set of target data can be written to each target thread, that is, a plurality of target data respectively corresponding to a plurality of execution threads without The same target data is read multiple times for each thread of execution. In this way, the efficiency of data exchange between registers and memories can be improved.
  • FIG. 4 shows a flowchart of a method 400 for loading data according to an embodiment of the present disclosure.
  • Method 400 may be implemented at a SIMT computing system including multiple threads, such as PE 300 shown in FIG. 3 . Specifically, it may be implemented by an input and output module (not shown in FIG. 3 ) used for exchanging data with the memory 320 in the PE 300.
  • a plurality of predicates for the plurality of threads are determined, each predicate indicating whether a specified address in the corresponding thread is valid, the address being used to access data in memory.
  • Multiple threads may be some or all of the threads in PE 300.
  • multiple threads may be partial threads started by PE 300.
  • Each of the multiple threads may specify a corresponding predicate.
  • the value of a predicate can be specified in a thread's predicate register.
  • a predicate may indicate whether an address specified in a thread for accessing data in memory is valid.
  • the predicate may indicate whether data can be read from memory 320 based on the address specified in the thread. For example, data may be read from memory 320 based on an address specified in a thread when the value of the predicate is true. On the contrary, when the value of the predicate is false, the address specified in the thread may be regarded as invalid, that is, the data in the memory 320 is not read based on the address.
  • a thread of execution is the thread for which data to be written is to be determined.
  • the data to be written refers to the data to be written into the thread, also referred to as target data.
  • the thread of execution among the plurality of threads may be determined based on predicates and predetermined rules.
  • only threads for which the predicate indicates that the address is valid may be determined as executing threads. For example, a thread whose predicate value is true may be determined as an execution thread, and a thread whose predicate value is false may be excluded from the execution thread.
  • a plurality of threads may be sorted into a sequence by number and a target subsequence in the sequence is determined, and all threads in the target subsequence may be determined as execution threads.
  • a target subsequence starts at the starting thread in the sequence (eg, thread 310-1), and ends at the last thread in the sequence whose predicate evaluates to true. That is, the predicates of the remaining threads after the target subsequence among the plurality of threads all evaluate to false.
  • threads whose values of all predicates are true and threads whose values of some predicates are false may be determined as execution threads.
  • the target subsequence can also be determined based on other rules. For example, it may be specified that the number of threads in the target subsequence is an integer multiple of N, where N is an integer greater than or equal to 1. Thus, the target subsequence may end at the 1st, 2nd... or N-1th thread after the last thread whose predicate evaluates to true.
  • target data is determined for each of the at least one thread of execution.
  • target data refers to data to be written into a thread.
  • different methods may be used to determine the corresponding target data.
  • target data for a thread of execution may be fetched from memory 320 based on the address in response to the thread of execution's predicate indicating that the address is valid. Conversely, in response to the execution thread's predicate indicating that the address is invalid, the target data may be determined based on a predetermined value. For example, the target data may be determined to be zero.
  • different addresses are assigned in different threads of execution.
  • multiple target data from different addresses can be read from the memory 320 based on a single load instruction, and each target data is read only once. In this way, the efficiency of data exchange between the registers of multiple threads and the memory 320 can be improved.
  • an address register for storing an address in a register file of an executing thread may be determined based on the load instruction, and the address is read from the address register. Based on the address, target data for the thread of execution may be fetched from memory 320 .
  • a parameter for identifying an address register may be included in the load instruction. For example, the parameter could be the number of the address register in the register file.
  • target data may be fetched from memory 320 based on a data width of 4 bytes or 16 bytes, depending on the load instruction.
  • the set of target data for each of the at least one thread of execution is written to a register file of each of the plurality of threads of execution.
  • the target thread refers to a thread to be written into a collection of target data among a plurality of threads.
  • the target thread can be one or more of the plurality of threads.
  • the target thread can also be each of multiple threads.
  • the target thread may be determined based on a single load instruction. For example, the number of the target thread can be determined based on the modifiers of the load instruction.
  • the set of target data includes target data for all threads of execution. In this way, based on a single load instruction, more of the same data can be written to each target thread, thereby improving the efficiency of loading data from the memory 320 .
  • the object data in the set of object data, is sorted in the same order as the execution threads. For example, object data may be sorted by the number of the corresponding execution thread. In this way, when the target data is determined for each execution thread in parallel, multiple target data can be written into the registers of the target thread in a certain order.
  • the set of target data when writing the set of target data into each target thread, may be written into a designated register.
  • the specified register may be at least one register determined based on various predetermined rules.
  • the specified register may be at least one continuous register, that is, the addresses of these registers in the register file are continuous.
  • the specified register can start from the destination register.
  • the target register in the register file may be determined based on the load instruction. For example, a parameter for identifying a target register may be included in the load instruction. For example, the argument could be the number of the destination register in the register file.
  • the set of target data may be written to at least one consecutive register in the register file starting from the target register.
  • each target data needs to be written into 4 registers.
  • FIG. 5 shows a schematic diagram of the results of loading data according to one embodiment of the present disclosure.
  • FIG. 5 shows thread 310 - 1 , thread 310 - 2 , thread 310 - 3 , thread 310 - 4 and memory 320 .
  • the thread 310-1, the thread 310-2, the thread 310-3, and the thread 310-4 are only examples of target threads, and the target threads may include 1, 2, 3, 4 or more threads.
  • each thread may include a predicate register, an address register, and a data register for storing data.
  • the register file of thread 310-1 may include a predicate register 501-1 for storing the value of the predicate, an address register 502-1 for storing the address of memory data, and a data register 503-1.
  • the data register 503-1 may include a target register 504-1, which serves as a starting register for storing target data. The details of thread 310-2, thread 310-3 and thread 310-4 will not be repeated here.
  • one object data can be written into one register, and the object data can be written sequentially based on the number of the corresponding execution thread into the register file of each target thread.
  • data A' (denoted as data 520-1) at address A in memory 320 may be retrieved as target data Write to the first register in each target thread, the target register.
  • data B' at address B in memory 320 may be written as target data to in the second register in each target thread.
  • data C' at address C in memory 320 may not be written to each in the target thread, and the target data that will be zero can be written in the third register in each target thread.
  • data D' at address D in memory 320 may be written as target data to in the fourth register in each target thread.
  • the set of target data prior to writing the set of target data to the threads, may be transposed, and the transposed set of target data written to each target thread (this operation is also referred to as stored for transposition).
  • the data may be transposed as single or double bytes.
  • Figures 6a and 6b show schematic diagrams of transpose storage according to an embodiment of the present disclosure. It should be understood that, similar to FIG. 5, thread 310-1, thread 310-2, thread 310-3, and thread 310-4 are only examples of target threads, and target threads may include 1, 2, 3, 4 or More threads. Furthermore, it is assumed in FIGS. 6a and 6b that the size of the data A', B', C', and D' to be read is 4 bytes, and the size of the register is also 4 bytes.
  • data A' can be split into data a 1 , a 2 , a 3 , a 4 , and the size of a 1 , a 2 , a 3 , a 4 is a single byte.
  • the details of data B' and D' are not described again. Therefore, the data written in the first register shown in FIG. 5 is a 1 , a 2 , a 3 , a 4 ; the data written in the second register is b 1 , b 2 , b 3 , b 4 ; The data written in the third register are 0, 0, 0, 0; the data written in the fourth register are d 1 , d 2 , d 3 , d 4 .
  • Fig. 6a shows a schematic diagram of performing transposition storage according to a single byte.
  • the data written in the first register are a 1 , b 1 , 0, d 1
  • the data written in the second register are a 2 , b 2 , 0, d 2
  • the data written in the third register are a 3 , b 3 , 0, d 3
  • the data written in the fourth register are a 4 , b 4 , 0, d 4 .
  • Fig. 6b shows a schematic diagram of performing transposition storage according to double bytes.
  • the data written in the first register are a 1 , a 2 , b 1 , b 2 ;
  • the data written in the second register are a 3 , a 4 , b 3 , b 4 ;
  • the data written in the third register is 0, 0, d 1 , d 2 ;
  • the data written in the fourth register is 0, 0, d 3 , d 4 .
  • FIG. 7 shows a schematic diagram of a process of loading data according to an embodiment of the present disclosure.
  • FIG. 7 shows a predicate check module 710 , a sort input module 720 , a sort output module 730 and a transpose module 740 .
  • FIG. 7 also shows a plurality of buffers 750 - 1 , 750 - 2 to 750 -N (collectively referred to as buffers 750 ) and an off-chip memory 760 respectively corresponding to the plurality of threads 310 .
  • Buffer 750 and off-chip memory 760 may be part of memory 320 .
  • Fig. 7 only shows an example of loading data from the memory 320, but not all details of data exchange. For example, although not shown, addresses specified in each thread may be transferred to memory 320 via the address bus.
  • the predicate checking module 710 determines a plurality of predicates for the plurality of threads 310 based on a single load instruction received, each predicate indicating whether the address specified in the corresponding thread is valid, the address is used to access data in the memory .
  • the predicate checking module 710 also determines at least one execution thread of the plurality of threads 310 based on the determined plurality of predicates. For example, the predicate checking module 710 may determine the thread whose value of the predicate is true as the thread of execution.
  • the predicate checking module 710 may record from which thread in the sequence of multiple threads the predicate is false. For example, the predicate checking module 710 may determine that the predicates are all false starting from the thread numbered N.
  • predicate checking module 710 may determine whether to read data at a specified address in a thread of execution from off-chip memory 760 based on the predicate. For example, when the value of the predicate of the execution thread is true, the predicate checking module 710 may instruct to read corresponding data from the off-chip memory 760 and cache the data in the buffer 750 corresponding to the execution thread.
  • the sort input module 720 is configured to determine target data for each of the at least one execution thread. As described above, target data can be determined based on predicates. In some embodiments, the ordering input module 720 passes the currently processed thread number to the predicate checking module 710 . The predicate checking module 710 decides how to process according to the thread number and whether the predicate is true or false, for example, whether to read data from the off-chip memory 760 and write it into the buffer 750 .
  • predicate is all false from thread N, then can be divided into three kinds of situations: (1) the thread number of current processing is less than N and the value of its predicate is true, then can read from the register 750 corresponding with this thread Target data; (2) the currently processed thread number is less than N, and the value of the predicate is false, then all 0 data can be used as the target data; (3) the currently processed thread number is greater than or equal to N, then stop processing.
  • the sort input module 720 can sort the target data based on the numbers of the corresponding execution threads, so as to allow reading and determining each target data in parallel.
  • the sort output module 730 is configured to write a set of target data for each of the at least one execution thread into a register file of each of the plurality of threads.
  • the sort output module 730 may write the set of target data into each target thread through the broadcast bus.
  • the width of the broadcast bus can match the number of register file ports and multiple threads.
  • a transpose module 740 may be provided between the sort input module 720 and the sort output module 730 .
  • the transpose module 740 may be configured to transpose the set of target data to update the set of target data. In this way, a set of transposed object data can be written in each object thread.
  • Fig. 8 shows a schematic block diagram of an apparatus 800 for loading data according to an embodiment of the present disclosure.
  • Apparatus 800 may be implemented as or included in accelerator system 200 of FIG. 2 .
  • the apparatus 800 may include a plurality of units for performing corresponding steps in the method 400 as discussed in FIG. 4 .
  • Each unit can implement part or all of the functions of at least one of the predicate checking module 710 , sorting input module 720 , sorting output module 730 and transposition module 740 .
  • the apparatus 800 includes: a predicate determining unit 810 configured to determine a plurality of predicates for a plurality of threads based on a received single load instruction, each predicate indicating whether the specified address in the corresponding thread is valid, the address For accessing data in the memory; the execution thread determination unit 820 is configured to determine at least one execution thread in the plurality of threads based on the determined plurality of predicates; the target data determination unit 830 is configured to target at least one execution thread In each execution thread, determine target data; And write unit 840, be configured to write the set of target data for each execution thread in at least one execution thread to each target thread in a plurality of threads in the register file.
  • a predicate determining unit 810 configured to determine a plurality of predicates for a plurality of threads based on a received single load instruction, each predicate indicating whether the specified address in the corresponding thread is valid, the address For accessing data in the memory
  • the execution thread determination unit 820 is configured to
  • the target data determining unit 830 is configured to perform one of the following: in response to the execution thread's predicate indicating that the address is valid, fetching target data for the execution thread from the memory based on the address; or in response to the execution thread The predicate indicates that the address is not valid, and the target data is determined based on a predetermined value.
  • the target data determining unit 830 is configured to: determine the address register for storing the address in the register file of the execution thread based on the load instruction; read the address from the address register; Fetches target data for a thread of execution.
  • the target data determining unit 830 is further configured to extract target data based on a data width of 4 bytes or 16 bytes.
  • the writing unit 840 is configured to: determine a target thread among multiple threads based on a load instruction; determine a target register in a register file based on a load instruction; and determine a target register based on a set of target data size, write the set of target data to at least one consecutive register in the register file, at least one consecutive register starting from the target register.
  • the apparatus 800 further includes: a transposition unit 850 configured to update the set of target data by transposing the set of target data.
  • the transposition unit 850 is further configured to transpose a set of target data per byte or per double byte.
  • the execution thread determination unit 820 is configured to: determine a target subsequence in the sequence, the target subsequence starts from the beginning of the sequence threads and ends at the last thread in the sequence whose predicate evaluates to true; and determines all threads in the target subsequence to be at least one thread of execution.
  • a load instruction for broadcast loading data which is also called a broadcast read instruction.
  • the processing engine is made to perform the following operations: determine a plurality of predicates for a plurality of threads, each predicate indicates whether the specified address in the corresponding thread is valid, and the address is used to access data in the memory; a predicate, determine at least one execution thread in the plurality of threads; determine target data for each execution thread in the at least one execution thread; and write a set of target data for each execution thread in the at least one execution thread into the register file of each of the target threads among the multiple threads.
  • the broadcast read instruction may include a first parameter for specifying an address register and a second parameter for specifying a target register. Based on the broadcast read instruction, a set of target data may be written to at least one consecutive register in the register file of each target thread.
  • a computer-readable storage medium stores a plurality of programs configured to be executed by one or more processing engines, the plurality of programs including instructions for performing the methods described above.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the methods described above.
  • an accelerator system includes: a processor; and a memory coupled to the processor, the memory having instructions stored therein that, when executed by the processor, cause the device to perform the method described above.
  • the present disclosure may be a method, apparatus, system and/or computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for carrying out various aspects of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

一种在单指令多线程计算系统中加载数据的方法和电子装置。在该方法中,基于接收到的单个加载指令,确定多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据(410);基于所确定的多个谓词,确定多个线程中的至少一个执行线程(420);针对至少一个执行线程中的每个执行线程,确定目标数据(430);以及将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中(440)。该方法可以基于单个加载指令来针对每个执行线程确定对应的目标数据并且向每个目标线程写入目标数据的集合,可以提高在寄存器与存储器之间进行数据交换的效率。

Description

用于在单指令多线程计算系统中加载数据的方法和装置
本申请要求于2022年2月9日提交中国专利局、申请号为202210122226.3、发明名称为“用于在单指令多线程计算系统中加载数据的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种用于在单指令多线程计算系统中加载数据的方法和装置。
背景技术
对于常规的单指令多线程计算系统,每个线程有自己的寄存器堆(即寄存器的阵列,也称为寄存器文件),并且每个线程可以在其寄存器与存储器之间进行线程级的数据交换。
目前,通常采用典型的寄存器存取架构(load-store architecture)来在寄存器与存储器之间进行数据交换。例如,可以利用加载(load)指令来从存储器读取数据并将数据存储到寄存器中。然而,常规的加载指令缺少针对多个线程进行数据交换的优化。因此,需要一种用于在单指令多线程计算系统中针对多个线程高效地加载数据的方案。
发明内容
本公开的实施方式提供了一种在单指令多线程计算系统中加载数据的技术方案。
在第一方面,提供了一种在单指令多线程计算系统中加载数据的方法。该方法包括:基于接收到的单个加载指令,确定所述多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,所述地址用于访问存储器中的数据;基于所确定的所述多个谓词, 确定所述多个线程中的至少一个执行线程;针对所述至少一个执行线程中的每个执行线程,确定目标数据;以及将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中。
在第二方面,提供了一种用于在单指令多线程计算系统中加载数据的装置。该装置包括:谓词确定单元,被配置为:基于接收到的单个加载指令,确定所述多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,所述地址用于访问存储器中的数据;以及执行线程确定单元,被配置为基于所确定的所述多个谓词,确定所述多个线程中的至少一个执行线程;目标数据确定单元,被配置为针对所述至少一个执行线程中的每个执行线程,确定目标数据;以及写入单元,被配置为将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中。
在第三方面,提供了一种计算机可读存储介质。该介质存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行本公开的第一方面的方法的指令。
在第四方面,提供了一种计算机程序产品。该计算机程序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行本公开的第一方面的方法的指令。
利用本公开的示例性实现方式,可以基于单个加载指令来针对每个执行线程确定对应的目标数据并且向每个目标线程写入目标数据的集合。这样,可以提高在寄存器与存储器之间进行数据交换的效率。
附图说明
通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施方式能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一个实施方式的芯片示意图;
图3示出了利用常规的加载指令来进行寄存器与存储器之间的数据交换的示意图;
图4示出了根据本公开的一个实施方式的加载数据的方法的流程图;
图5示出了根据本公开的一个实施方式的加载数据的结果的示意图;
图6a和图6b示出了根据本公开的一个实施方式的转置存储的示意图;
图7示出了根据本公开的一个实施方式的加载数据的过程的示意图;以及
图8示出了根据本公开的一个实施方式的用于加载数据的装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中示出了本公开的优选实施方式,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施方式限制。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施方式”和“一个实施方式”表示“至少一个示例实施方式”。术语“另一实施方式”表示“至少一个另外的实施方式”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
首先参见图1描述用于执行本公开的多个示例性实现方式的环境的概要。图1示出了本公开的多个实施方式能够在其中实现的示例环境100的示意图。示例环境100例如可以包括诸如计算机之类的具有计算能力的电子设备。在一个实施方式中,示例环境100例如包括中央处理器(CPU)120、系统存储器110、北桥/存储器桥130、加速器系统140、外部存储设备150和南桥/输入输出(IO)桥160。系统存储器110例如可以包括诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥130例如集成了内存控制器、PCIe控制器等,其负责CPU 120和高速接口之间的数据交换以及桥接CPU 120和南桥/IO桥160。南桥/IO桥160用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器系统140例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。外部存储设备150例如可以是诸如DRAM之类的位于加速器系统140外部的易失性存储器。
在本公开中,外部存储设备150也被称为片外存储器,即,位于加速器系统140的芯片外部的存储器。相对而言,加速器系统140的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。将在下文结合本公开的一些实施方式具体描述。虽然在图1中示出了本公开的多个实施方式能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施方式也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器系统的其他应用环境中使用。
图2示出了根据本公开的一个实施方式的加速器系统200的示意框图。加速器系统200例如可以是图1中加速器系统140的芯片的一种具体实现方式。加速器系统200例如包括诸如GPU之类的加速器系统芯片。根据本公开的一个示例性实现方式,加速器系统200可以包括流处理器(SP)210、页表装置220、处理引擎(Processing Engine,PE)单元230、直接存储器访问(DMA)控制器240、L1 高速缓存260和L2高速缓存250。
加速器系统200可以由诸如CPU 120之类的主机设备控制,并且接收来自CPU 120的指令。SP 210对来自CPU 120的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器系统200可访问的虚拟存储。在本公开中,除了L1高速缓存260,虚拟存储例如可以包括L2高速缓存250和诸如图1中的外部存储设备150之类的片外存储器。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230可以包括多个处理引擎PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程装置。在PE中,每个线程可以具有自己的寄存器堆,并且每个PE的所有线程还共享统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作。例如,PE可以针对待处理的数据执行排序、卷积等处理。
用户(例如,程序员)可以编写应用程序来实现特定的目的。对于需要较大计算量的应用程序而言,可以将该应用程序划分为多个部分,并且分别在多个PE处并行地运行多个部分。
进一步,可以在每个PE处启动一个或多个线程。每个线程可以具有自己的寄存器堆和执行单元,并使用自己的存储地址。执行单元可以包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元,以用于执行算术和逻辑运算。算术运算例如浮点和定点数的加、减、乘、除。逻辑运算例如逻辑与、或、非等。
如上所述,可以采用典型的寄存器存取架构来针对每个线程进行与存储器之间的数据交换。图3示出了利用常规的加载指令来进行寄存器与存储器之间的数据交换的示意图。如图3所示,PE 300中的线程310-1至310-N(统称为多个线程310)可以与存储器320进行数据交换。每个线程具有各自的寄存器堆,例如线程310-1具有寄存器堆330-1,线程310-2具有寄存器堆330-2,等等。每个线程 还具有各自的数据通路,例如数据通路340-1至340-N(统称为数据通路340)。存储器320可以包括芯片内部的存储器(例如一级高速缓存、二级高速缓存),也可以包括芯片外部的存储器。
在一些场景中,PE 300中的多个线程310需要从存储器320中的相同地址读取数据并且将所读取的相同数据加载到每个线程各自的寄存器中(该操作简称为广播加载)。例如,在执行矩阵乘法时,需要将A矩阵的每一行与B矩阵的同一列相乘,这就需要将B矩阵的同一列广播到用于处理矩阵A的每一行的每个线程。
在这样的场景中,如果利用常规的加载指令,需要每个线程指定相同的地址。然后,针对每个线程,可以基于该地址从存储器320读取数据并且将所读取的数据写入到线程的寄存器中。换言之,该数据将被读取N次,也即该数据在寄存器与存储器320之间的数据通路340中将被复制N次。对相同数据的多次读取和复制降低了寄存器与存储器320之间的数据交换的效率并且增加了数据交换的功耗。
为了至少部分地解决已有技术方案的上述以及其他缺陷,根据本公开的一个示例性实现方式,提供了一种在单指令多线程计算系统中高效地加载数据的技术方案。在该方案中,基于接收到的单个加载指令,确定针对多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据;基于所确定的多个谓词,确定多个线程中的至少一个执行线程;针对至少一个执行线程中的每个执行线程,确定目标数据;以及将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中。
以此方式,可以基于单个加载指令来针对每个执行线程确定对应的目标数据并且向每个目标线程写入目标数据的集合,也即分别与多个执行线程对应的多个目标数据,而无需针对每个执行线程多次读取相同的目标数据。这样,寄存器与存储器之间的数据交换的效率可以得到提高。
下文将参考图4至图8来详细描述本方案的细节。
图4示出了根据本公开的一个实施方式的加载数据的方法400的流程图。方法400可以被实现在包括多个线程的SIMT计算系统处,例如图3所示的PE 300处。具体地,可以由PE 300中用于与存储器320进行数据交换的输入输出模块(图3中未示出)来实现。
在框410处,基于接收到的单个加载指令,确定针对多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据。多个线程可以是PE 300中的部分或全部线程。例如,多个线程可以是由PE 300启动的部分线程。多个线程中的每个线程可以指定相应的谓词。例如,可以在线程的谓词寄存器中指定谓词的值。
谓词可以指示线程中所指定的用于访问存储器中的数据的地址是否有效。换言之,谓词可以指示是否能够基于线程中指定的地址而从存储器320读取数据。例如,当谓词的值为真时,可以基于线程中指定的地址来从存储器320读取数据。相反,当谓词的值为假时,可以将线程中指定的地址视为无效,也即,不基于该地址来读取存储器320中的数据。
在框420处,基于所确定的多个谓词,确定多个线程中的至少一个执行线程。执行线程是指要针对其确定待写入数据的线程。待写入数据是指要被写入到线程中的数据,也称为目标数据。可以基于谓词和预定规则来确定多个线程中的执行线程。
在一些实施例中,可以仅将谓词指示地址有效的线程确定为执行线程。例如,可以将谓词的值为真的线程确定为执行线程,并且将谓词的值为假的线程排除在执行线程之外。
备选地,可以将多个线程按编号排序为序列并且确定该序列中的目标子序列,目标子序列中的所有线程可以被确定为执行线程。目标子序列起始于序列中的起始线程(例如,线程310-1),并且结束于序列中最后一个其谓词的值为真的线程。也即,多个线程中在目标子序列之后的剩余线程的谓词的值均为假。在这种实施例中, 可以将所有谓词的值为真的线程以及部分谓词的值为假的线程确定为执行线程。
附加地,还可以基于其他规则来确定目标子序列。例如,可以规定目标子序列中的线程的数目是N的整数倍,其中N是大于等于1的整数。这样,目标子序列可以结束于最后一个其谓词的值为真的线程之后的第1、2……或第N-1个线程。
在框430处,针对至少一个执行线程中的每个执行线程,确定目标数据。如上所述,目标数据是指要写入到线程中的数据。针对不同的执行线程,可以采用不同方法来确定对应的目标数据。
在一些实施例中,响应于执行线程的谓词指示地址有效,可以基于该地址从存储器320提取针对该执行线程的目标数据。相反,响应于执行线程的谓词指示地址无效,可以基于预定值来确定目标数据。例如,可以将目标数据确定为零。
在一些实施例中,不同的执行线程中指定不同的地址。在这种情况下,基于单个加载指令,可以从存储器320读取来自不同地址的多个目标数据,并且每个目标数据仅被读取一次。这样,可以提高多个线程的寄存器与存储器320之间的数据交换的效率。
在一些实施例中,可以基于加载指令,确定执行线程的寄存器堆中用于存储地址的地址寄存器,并且从该地址寄存器读取地址。基于该地址,可以从存储器320提取针对执行线程的目标数据。加载指令中可以包括用于标识地址寄存器的参数。例如,参数可以是地址寄存器在寄存器堆中的编号。
在一些实施例中,取决于加载指令,可以基于4字节或16字节的数据宽度来从存储器320提取目标数据。
在框440处,将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中。目标线程是指多个线程中将被写入目标数据的集合的线程。
目标线程可以是多个线程中的一个或多个线程。目标线程也可以是多个线程中的每个线程。在一些实施例中,可以基于单个加载 指令来确定目标线程。例如,可以基于加载指令的修饰符确定目标线程的编号。
目标数据的集合包括针对所有执行线程的目标数据。这样,基于单个加载指令,可以向每个目标线程写入更多的相同数据,从而提高从存储器320加载数据的效率。
在一些实施例中,在目标数据的集合中,目标数据按照与执行线程相同的顺序被排序。例如,目标数据可以按照对应的执行线程的编号而被排序。这样,在并行地针对每个执行线程确定目标数据时,可以使得多个目标数据按照一定顺序被写入到目标线程的寄存器中。
在一些实施例中,在将目标数据的集合写入到每个目标线程中时,可以将目标数据的集合写入到指定的寄存器中。指定的寄存器可以是基于各种预定规则确定的至少一个寄存器。
指定的寄存器可以是至少一个连续的寄存器,也即,在寄存器堆中这些寄存器的地址是连续的。指定的寄存器可以起始于目标寄存器。
在一些实施例中,可以基于加载指令来确定寄存器堆中的目标寄存器。例如,加载指令中可以包括用于标识目标寄存器的参数。例如,参数可以是目标寄存器在寄存器堆中的编号。
基于目标数据的集合的大小和每个寄存器的大小,可以将目标数据的集合写入到寄存器堆中起始于目标寄存器的至少一个连续的寄存器。
例如,如果一个执行线程指定16字节的目标数据,则8个执行线程可以指定8×16=128字节的目标数据的集合。在寄存器的宽度为4字节的情况,每个目标数据需要写入4个寄存器。可以将目标数据的集合按顺序写入到寄存器堆中的128/4=32个寄存器中。应理解,如果目标数据的集合的数据大小超过了寄存器堆的大小,则可以将多余的数据丢弃。
图5示出了根据本公开的一个实施方式的加载数据的结果的示 意图。图5示出了线程310-1、线程310-2、线程310-3、线程310-4以及存储器320。应理解,线程310-1、线程310-2、线程310-3、线程310-4仅是目标线程的示例,目标线程可以包括1个、2个、3个、4个或更多线程。如图5所示,每个线程可以包括谓词寄存器、地址寄存器和用于存储数据的数据寄存器。
以线程310-1为例,线程310-1的寄存器堆中可以包括用于存储谓词的值的谓词寄存器501-1、用于存储存储器数据的地址的地址寄存器502-1、以及数据寄存器503-1。数据寄存器503-1中可以包括目标寄存器504-1,其作为存储目标数据的起始寄存器。线程310-2、线程310-3和线程310-4的细节不再赘述。
如图5所示,在数据存取宽度与寄存器的数据宽度一致(例如,4字节)时,一个目标数据可以写入一个寄存器,并且目标数据可以基于对应的执行线程的编号按顺序被写入到每个目标线程的寄存器堆中。
具体地,响应于线程310-1中的谓词的值为真(T)并且所指定的地址为A,存储器320中地址A处的数据A’(记为数据520-1)可以作为目标数据被写入到每个目标线程中的第一寄存器,也即目标寄存器中。
响应于线程310-2中的谓词的值为真(T)并且所指定的地址为B,存储器320中地址B处的数据B’(记为数据520-2)可以作为目标数据被写入到每个目标线程中的第二寄存器中。
响应于线程310-3中的谓词的值为假(F)并且所指定的地址为C,存储器320中地址C处的数据C’(记为数据520-3)可以不被写入到每个目标线程中,并且可以将为零的目标数据写入到每个目标线程中的第三寄存器中。
响应于线程310-4中的谓词的值为真(T)并且所指定的地址为D,存储器320中地址D处的数据D’(记为数据520-4)可以作为目标数据被写入到每个目标线程中的第四寄存器中。
以此方式,基于单个加载指令,通过指定地址寄存器和目标寄 存器,可以将相同的目标数据的集合写入到每个目标线程中,从而实现数据的广播加载,以提高寄存器与存储器320数据交换的效率。
在一些实施例中,在将目标数据的集合写入到线程中之前,可以转置目标数据的集合,并且将经转置的目标数据的集合写入到每个目标线程中(该操作也称为转置存储)。
在一些实施例中,可以按照单字节或双字节来对数据进行转置。图6a和图6b示出了根据本公开的一个实施方式的转置存储的示意图。应理解,类似于图5,线程310-1、线程310-2、线程310-3、线程310-4仅是目标线程的示例,目标线程可以包括1个、2个、3个、4个或更多线程。此外,在图6a和6b中假设要读取的数据A’、B’、C’和D’的大小为4字节,并且寄存器的大小也为4字节。
为了方便描述,数据A’可以拆分为数据a 1、a 2、a 3、a 4,并且a 1、a 2、a 3、a 4的大小为单个字节。类似地,数据B’和D’的细节不再赘述。因此,图5中所示的第一寄存器中写入的数据是a 1、a 2、a 3、a 4;第二寄存器中写入的数据是b 1、b 2、b 3、b 4;第三寄存器中写入的数据是0、0、0、0;第四寄存器中写入的数据是d 1、d 2、d 3、d 4
图6a示出了按照单字节来进行转置存储的示意图。如图6a所示,经过转置之后,第一寄存器中写入的数据是a 1、b 1、0、d 1;第二寄存器中写入的数据是a 2、b 2、0、d 2;第三寄存器中写入的数据是a 3、b 3、0、d 3;第四寄存器中写入的数据是a 4、b 4、0、d 4
图6b示出了按照双字节来进行转置存储的示意图。如图6b所示,经过转置之后,第一寄存器中写入的数据是a 1、a 2、b 1、b 2;第二寄存器中写入的数据是a 3、a 4、b 3、b 4;第三寄存器中写入的数据是0、0、d 1、d 2;第四寄存器中写入的数据是0、0、d 3、d 4
下文将参考图7和图8描述本公开的实施例的硬件实现。图7示出了根据本公开的一个实施方式的加载数据的过程的示意图。图7示出了谓词检查模块710、排序输入模块720、排序输出模块730和转置模块740。图7还示出了与多个线程310分别对应的多个缓冲器750-1、750-2至750-N(统称为缓冲器750)以及片外存储器760。 缓冲器750和片外存储器760可以是存储器320中的一部分。
应理解,图7仅示出了从存储器320加载数据的示例,并非示出了数据交换的所有细节。例如,尽管未示出,每个线程中指定的地址可以通过地址总线传输到存储器320。
在一些实施例中,谓词检查模块710基于接收到的单个加载指令,确定多个线程310的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据。
谓词检查模块710还基于所确定的多个谓词,确定多个线程310中的至少一个执行线程。例如,谓词检查模块710可以将其谓词的值为真的线程确定为执行线程。
在一些实施例中,谓词检查模块710可以记录在多个线程的序列中从哪个线程开始谓词都为假。例如,谓词检查模块710可以确定从编号为N的线程开始谓词都为假。
在一些实施例中,谓词检查模块710可以基于谓词来确定是否从片外存储器760读取执行线程中指定的地址处的数据。例如,当执行线程的谓词的值为真时,谓词检查模块710可以指示从片外存储器760读取对应的数据,并将数据缓存在与执行线程对应的缓冲器750中。
排序输入模块720被配置为针对至少一个执行线程中的每个执行线程,确定目标数据。如上所述,可以基于谓词来确定目标数据。在一些实施例中,排序输入模块720将当前处理的线程编号传递给谓词检查模块710。谓词检查模块710根据线程编号和谓词真假决定如何处理,例如是否从片外存储器760读取数据并写入缓存器750中。
假如从线程N开始谓词都为假,则可分为三种情况:(1)当前处理的线程编号小于N且其谓词的值为真,则可以从与此线程对应的缓存器750中读取目标数据;(2)当前处理的线程编号小于N,且谓词的值为假,则可以使用全0的数据作为目标数据;(3)当前处理的线程编号大于等于N,则停止处理。
在一些实施例中,排序输入模块720可以将目标数据基于对应的执行线程的编号进行排序,以允许并行地读取和确定各个目标数据。
排序输出模块730被配置为将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中。排序输出模块730可以通过广播总线将目标数据的集合写入到每个目标线程中。广播总线的宽度可以与寄存器堆的端口和多个线程的数目相匹配。
在一些实施例中,在排序输入模块720与排序输出模块730之间可以设置转置模块740。转置模块740可以被配置为转置目标数据的集合来更新目标数据的所述集合。这样,经转置的目标数据的集合可以被写入每个目标线程中。
图8示出了根据本公开的一个实施方式的用于加载数据的装置800的示意框图。装置800可以被实现为或者被包括在图2的加速器系统200中。装置800可以包括多个单元,以用于执行如图4中所讨论的方法400中的对应步骤。每个单元可以实现谓词检查模块710、排序输入模块720、排序输出模块730和转置模块740中至少一个模块的部分或所有功能。
如图8所示,装置800包括:谓词确定单元810,被配置为基于接收到的单个加载指令,确定多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据;执行线程确定单元820,被配置为基于所确定的多个谓词,确定多个线程中的至少一个执行线程;目标数据确定单元830,被配置为针对至少一个执行线程中的每个执行线程,确定目标数据;以及写入单元840,被配置为将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中。
根据本公开的一个示例性实现方式,目标数据确定单元830被配置为执行以下之一:响应于执行线程的谓词指示地址有效,基于地址从存储器提取针对执行线程的目标数据;或者响应于执行线程 的谓词指示地址无效,基于预定值确定目标数据。
根据本公开的一个示例性实现方式,目标数据确定单元830被配置为:基于加载指令,确定执行线程的寄存器堆中用于存储地址的地址寄存器;从地址寄存器读取地址;以及基于地址从存储器提取针对执行线程的目标数据。
根据本公开的一个示例性实现方式,目标数据确定单元830还被配置为:基于4字节或16字节的数据宽度来提取目标数据。
根据本公开的一个示例性实现方式,写入单元840被配置为:基于加载指令,确定多个线程中的目标线程;基于加载指令,确定寄存器堆中的目标寄存器;以及基于目标数据的集合的大小,将目标数据的集合写入到寄存器堆中的至少一个连续的寄存器,至少一个连续的寄存器起始于目标寄存器。
根据本公开的一个示例性实现方式,装置800还包括:转置单元850,被配置为通过转置目标数据的集合来更新目标数据的集合。
根据本公开的一个示例性实现方式,转置单元850还被配置为:每字节或每双字节转置目标数据的集合。
根据本公开的一个示例性实现方式,其中多个线程按编号被排序为序列,并且执行线程确定单元820被配置为:确定序列中的目标子序列,目标子序列起始于序列中的起始线程,并且结束于序列中最后一个其谓词的值为真的线程;以及将目标子序列中的所有线程确定为至少一个执行线程。
根据本公开的一个示例性实现方式,提供了一种用于广播加载数据的加载指令,也称为广播读指令。广播读指令被执行时使得处理引擎执行以下操作:确定多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,地址用于访问存储器中的数据;基于所确定的多个谓词,确定多个线程中的至少一个执行线程;针对至少一个执行线程中的每个执行线程,确定目标数据;以及将针对至少一个执行线程中的每个执行线程的目标数据的集合写入到多个线程中的每个目标线程的寄存器堆中。
根据本公开的一个示例性实现方式,广播读指令可以包括用于指定地址寄存器的第一参数和用于指定目标寄存器的第二参数。基于该广播读指令,目标数据的集合可以被写入每个目标线程的寄存器堆中的至少一个连续的寄存器。
根据本公开的一个示例性实现方式,提供了一种计算机可读存储介质。该介质存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行上文描述的方法的指令。
根据本公开的一个示例性实现方式,提供了一种计算机程序产品。该计算机程序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行上文描述的方法的指令。
根据本公开的一个示例性实现方式,提供了加速器系统。该加速器系统包括:处理器;以及与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被所述处理器执行时使得所述设备执行上文描述的方法。
本公开可以是方法、设备、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施方式的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (13)

  1. 一种在单指令多线程计算系统中加载数据的方法,所述单指令多线程计算系统包括多个线程,所述方法包括:
    基于接收到的单个加载指令,确定所述多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,所述地址用于访问存储器中的数据;
    基于所确定的所述多个谓词,确定所述多个线程中的至少一个执行线程;
    针对所述至少一个执行线程中的每个执行线程,确定目标数据;以及
    将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中。
  2. 根据权利要求1所述的方法,其中针对所述至少一个执行线程中的每个执行线程确定目标数据包括以下之一:
    响应于所述执行线程的所述谓词指示所述地址有效,基于所述地址从所述存储器提取针对所述执行线程的所述目标数据;或者
    响应于所述执行线程的所述谓词指示所述地址无效,基于预定值确定所述目标数据。
  3. 根据权利要求1所述的方法,其中针对所述至少一个执行线程中的每个执行线程确定目标数据包括:
    基于所述加载指令,确定所述执行线程的所述寄存器堆中用于存储所述地址的地址寄存器;
    从所述地址寄存器读取所述地址;以及
    基于所述地址从所述存储器提取针对所述执行线程的所述目标数据。
  4. 根据权利要求1所述的方法,其中将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中包括:
    基于所述加载指令,确定所述多个线程中的所述目标线程;
    基于所述加载指令,确定所述寄存器堆中的目标寄存器;以及
    基于所述目标数据的集合的大小,将所述目标数据的集合写入到所述寄存器堆中的至少一个连续的寄存器,所述至少一个连续的寄存器起始于所述目标寄存器。
  5. 根据权利要求1所述的方法,其中将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中包括:
    转置所述目标数据的集合;以及
    将经转置的目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中。
  6. 根据权利要求1所述的方法,其中所述多个线程按编号被排序为序列,并且基于所确定的所述多个谓词确定所述多个线程中的至少一个执行线程包括:
    确定所述序列中的目标子序列,所述目标子序列起始于所述序列中的起始线程,并且结束于所述序列中最后一个其谓词的值为真的线程;以及
    将所述目标子序列中的所有线程确定为所述至少一个执行线程。
  7. 根据权利要求6所述的方法,其中所述目标数据的集合中的所述目标数据按照对应的线程的所述编号被排序。
  8. 根据权利要求5所述的方法,其中转置所述目标数据的集合包括:
    每字节或每双字节转置所述目标数据的所述集合。
  9. 根据权利要求3所述的方法,其中基于所述地址从所述存储器提取针对所述执行线程的所述目标数据包括:
    基于4字节或16字节的数据宽度来提取所述目标数据。
  10. 一种用于在单指令多线程计算系统中加载数据的装置,所述单指令多线程计算系统包括多个线程,所述装置包括:
    谓词确定单元,被配置为:基于接收到的单个加载指令,确定所述多个线程的多个谓词,每个谓词指示相应线程中所指定的地址是否有效,所述地址用于访问存储器中的数据;以及
    执行线程确定单元,被配置为基于所确定的所述多个谓词,确定所述多个线程中的至少一个执行线程;
    目标数据确定单元,被配置为针对所述至少一个执行线程中的每个执行线程,确定目标数据;以及
    写入单元,被配置为将针对所述至少一个执行线程中的每个执行线程的所述目标数据的集合写入到所述多个线程中的每个目标线程的寄存器堆中。
  11. 根据权利要求10所述的装置,还包括转置单元,所述转置单元被配置为通过转置所述目标数据的集合来更新所述目标数据的所述集合。
  12. 一种计算机可读存储介质,存储多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-9中任一项所述的方法的指令。
  13. 一种计算机程序产品,所述计算机程序产品包括多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-9中任一项所述的方法的指令。
PCT/CN2022/107081 2022-02-09 2022-07-21 用于在单指令多线程计算系统中加载数据的方法和装置 WO2023151231A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210122226.3A CN114510271B (zh) 2022-02-09 2022-02-09 用于在单指令多线程计算系统中加载数据的方法和装置
CN202210122226.3 2022-02-09

Publications (1)

Publication Number Publication Date
WO2023151231A1 true WO2023151231A1 (zh) 2023-08-17

Family

ID=81552575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107081 WO2023151231A1 (zh) 2022-02-09 2022-07-21 用于在单指令多线程计算系统中加载数据的方法和装置

Country Status (2)

Country Link
CN (1) CN114510271B (zh)
WO (1) WO2023151231A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510271B (zh) * 2022-02-09 2023-08-15 海飞科(南京)信息技术有限公司 用于在单指令多线程计算系统中加载数据的方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309702A (zh) * 2012-03-05 2013-09-18 辉达公司 用于并行线程子集的一致加载处理
US20140013087A1 (en) * 2011-03-25 2014-01-09 Freescale Semiconductor, Inc Processor system with predicate register, computer system, method for managing predicates and computer program product
CN108140011A (zh) * 2015-10-14 2018-06-08 Arm有限公司 向量加载指令
CN109426519A (zh) * 2017-08-31 2019-03-05 辉达公司 线内数据检查以进行工作量简化
CN112241290A (zh) * 2019-07-16 2021-01-19 辉达公司 用于在并行处理单元中有效执行数据规约的技术
CN114510271A (zh) * 2022-02-09 2022-05-17 海飞科(南京)信息技术有限公司 用于在单指令多线程计算系统中加载数据的方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984248B2 (en) * 2004-12-29 2011-07-19 Intel Corporation Transaction based shared data operations in a multiprocessor environment
US7921263B2 (en) * 2006-12-22 2011-04-05 Broadcom Corporation System and method for performing masked store operations in a processor
JP5043560B2 (ja) * 2007-08-24 2012-10-10 パナソニック株式会社 プログラム実行制御装置
US8661226B2 (en) * 2007-11-15 2014-02-25 Nvidia Corporation System, method, and computer program product for performing a scan operation on a sequence of single-bit values using a parallel processor architecture
US10360039B2 (en) * 2009-09-28 2019-07-23 Nvidia Corporation Predicted instruction execution in parallel processors with reduced per-thread state information including choosing a minimum or maximum of two operands based on a predicate value
CN104011670B (zh) * 2011-12-22 2016-12-28 英特尔公司 用于基于向量写掩码的内容而在通用寄存器中存储两个标量常数之一的指令
US11755484B2 (en) * 2015-06-26 2023-09-12 Microsoft Technology Licensing, Llc Instruction block allocation
US20170177352A1 (en) * 2015-12-18 2017-06-22 Intel Corporation Instructions and Logic for Lane-Based Strided Store Operations
US11194583B2 (en) * 2019-10-21 2021-12-07 Advanced Micro Devices, Inc. Speculative execution using a page-level tracked load order queue

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140013087A1 (en) * 2011-03-25 2014-01-09 Freescale Semiconductor, Inc Processor system with predicate register, computer system, method for managing predicates and computer program product
CN103309702A (zh) * 2012-03-05 2013-09-18 辉达公司 用于并行线程子集的一致加载处理
CN108140011A (zh) * 2015-10-14 2018-06-08 Arm有限公司 向量加载指令
CN109426519A (zh) * 2017-08-31 2019-03-05 辉达公司 线内数据检查以进行工作量简化
CN112241290A (zh) * 2019-07-16 2021-01-19 辉达公司 用于在并行处理单元中有效执行数据规约的技术
CN114510271A (zh) * 2022-02-09 2022-05-17 海飞科(南京)信息技术有限公司 用于在单指令多线程计算系统中加载数据的方法和装置

Also Published As

Publication number Publication date
CN114510271A (zh) 2022-05-17
CN114510271B (zh) 2023-08-15

Similar Documents

Publication Publication Date Title
US10860326B2 (en) Multi-threaded instruction buffer design
US8327109B2 (en) GPU support for garbage collection
US8639730B2 (en) GPU assisted garbage collection
US8904153B2 (en) Vector loads with multiple vector elements from a same cache line in a scattered load operation
US20070022428A1 (en) Context switching method, device, program, recording medium, and central processing unit
US11231930B2 (en) Methods and systems for fetching data for an accelerator
JP2006107497A (ja) 制御方法、処理方法、またはそれらを利用した処理システム、コンピュータ処理システム、コンピュータのネットワーク
US11947821B2 (en) Methods and systems for managing an accelerator's primary storage unit
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
WO2023151231A1 (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
US9170638B2 (en) Method and apparatus for providing early bypass detection to reduce power consumption while reading register files of a processor
US11372768B2 (en) Methods and systems for fetching data for an accelerator
US9507725B2 (en) Store forwarding for data caches
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
WO2023077875A1 (zh) 用于并行执行核心程序的方法和装置
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023077880A1 (zh) 基于便笺存储器来共享数据的方法和电子装置
JP2024518587A (ja) データ依存の不規則な演算のためのプログラム可能なアクセラレータ
CN112559037B (zh) 一种指令执行方法、单元、装置及系统
US20220413849A1 (en) Providing atomicity for complex operations using near-memory computing
US10114650B2 (en) Pessimistic dependency handling based on storage regions
CN109683959B (zh) 处理器的指令执行方法及其处理器
KR100861701B1 (ko) 레지스터 값의 유사성에 기반을 둔 레지스터 리네이밍시스템 및 방법
JP2005071351A (ja) プロセッサおよびプロセッサの動作方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22925590

Country of ref document: EP

Kind code of ref document: A1