WO2023077880A1 - 基于便笺存储器来共享数据的方法和电子装置 - Google Patents

基于便笺存储器来共享数据的方法和电子装置 Download PDF

Info

Publication number
WO2023077880A1
WO2023077880A1 PCT/CN2022/108045 CN2022108045W WO2023077880A1 WO 2023077880 A1 WO2023077880 A1 WO 2023077880A1 CN 2022108045 W CN2022108045 W CN 2022108045W WO 2023077880 A1 WO2023077880 A1 WO 2023077880A1
Authority
WO
WIPO (PCT)
Prior art keywords
virtual address
data
program
thread
address area
Prior art date
Application number
PCT/CN2022/108045
Other languages
English (en)
French (fr)
Inventor
徐立宝
常亮
杨经纬
彭永超
桑永奇
姚飞
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023077880A1 publication Critical patent/WO2023077880A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Definitions

  • Embodiments of the present disclosure generally relate to the field of electronics, and more particularly, to a method and electronic device for sharing data based on a scratchpad.
  • processor systems such as graphics processing units (GPUs) have been proposed, and multiple processor cores in such processing systems can provide parallel multi-thread processing, thereby providing higher processing speeds. These processing systems can break down complex calculations into smaller tasks that are processed in parallel by multiple cores, thereby reducing processing time.
  • GPUs graphics processing units
  • a large number of threads can be run on a multi-core processor such as a GPU, and data sharing among the large number of threads is usually required.
  • a technical solution for sharing data based on a cache has been proposed.
  • a cache since a cache includes only a small storage space and involves a complicated management process, it is expected that data can be shared among multiple threads in a more efficient and convenient manner.
  • Embodiments of the present disclosure provide a technical solution for sharing data based on a note storage.
  • a method for sharing data based on a scratchpad includes, based on definitions in the program, allocating to the program a virtual address region in virtual storage accessible by the accelerator system, the virtual address region being mapped to any one of a plurality of physical storage devices: a secondary cache, and external storage; setting a virtual address area as a scratch memory attribute; and managing data shared between a first thread and a second thread in a program based on the virtual address area.
  • allocating the virtual address area to the program includes: determining a class for designating a physical storage device corresponding to the virtual address area based on the definition; device, selects the virtual address area for allocation to the program.
  • selecting the virtual address region for allocation to the program further includes: determining the size of the virtual address region based on the definition; and in response to determining that the size is not higher than a threshold size, selecting a virtual address region matching the size address area.
  • it further includes: in response to determining that the size is higher than a threshold size, selecting a virtual device with a matching size from physical devices with a grade and physical devices below a grade among the plurality of physical devices. address area.
  • sharing data between the first thread and the second thread in the program includes: modifying the swap policy associated with the virtual address area so that the virtual address area Data is not swapped to another physical storage device, which is of a lower rank than the physical storage device corresponding to the virtual address region.
  • sharing data between the first thread and the second thread in the program includes: in response to determining that the first thread writes data to the cache line in the virtual address area , setting the cache line as a “dirty” flag; and prohibiting writing of data in the cache line to a next-level storage device associated with the virtual address region.
  • it further includes: in response to determining that the data in the cache line will be swapped to another physical storage device, writing back the data in the cache line to another physical storage device.
  • sharing data between the first thread and the second thread in the program includes: setting a data block in the virtual address area as "unused”; and responding to It is determined that the first thread reads data from the data block set as "unused", and invokes read exception handling.
  • the method further includes: releasing the virtual address area in response to determining that the program ends.
  • allocating the virtual address area to the program includes: based on the definition in the program, setting the format of the virtual address area as a tensor of any of the following dimensions: 1-dimensional, 2-dimensional, 3-dimensional, and 4 dimensions.
  • the method is executed at one processing engine among the multiple processing engines at the accelerator system, the virtual address area is mapped to multiple physical storage devices via the address mapping table, and the address mapping table is stored in the accelerator system.
  • allocating the virtual address area to the program includes: determining the virtual address area in a portion of the virtual storage allocated to the processing engine; and providing the program with an address associated with the virtual address area offset.
  • a computer readable storage medium stores a plurality of programs configured to be executed by one or more processing engines, the plurality of programs including instructions for performing the method of the first aspect of the present disclosure.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the method of the first aspect of the present disclosure.
  • an accelerator system comprising: a processor; and a memory coupled to the processor, the memory having instructions stored therein that when executed by the processor cause the device to implement.
  • an apparatus for sharing data based on a scratchpad includes: an allocation unit configured to allocate a virtual address region in virtual storage accessible by the accelerator system to the program based on a definition in the program, the virtual address region being mapped to any one of the following multiple physical storage devices Items: L2 cache and external storage; a setting unit configured to set the virtual address region as a scratch memory attribute; and a shared unit configured to manage the first thread and the second thread in the program based on the virtual address region Data shared between two threads.
  • the allocation unit includes: a level determination unit configured to determine a level for specifying a physical storage device corresponding to the virtual address area based on a definition; and a selection unit configured to select from Among the physical devices having a class among the plurality of physical devices, a virtual address area is selected for allocation to a program.
  • the selection unit includes: a size determination unit configured to determine the size of the virtual address region based on the definition; and a first selection unit configured to respond to the determination that the size is not higher than the threshold size , select a virtual address area that matches the size.
  • the selection unit includes a second selection unit configured to, in response to determining that the size is higher than a threshold size, select a physical device with a grade and a physical device below a grade among the plurality of physical devices , select a virtual address area that matches the size.
  • the management unit includes: a modification unit configured to modify a swap policy associated with the virtual address area, so that data in the virtual address area is not swapped to another physical storage device, Another physical storage device is at a lower level than the physical storage device corresponding to the virtual address region.
  • the management unit includes: a write unit configured to, in response to determining that the first thread writes data to the cache line associated with the virtual address area, set the cache line to " "dirty" flag; and a prohibition unit configured to prohibit writing back data in the cache line to the virtual address area.
  • the management unit includes: a write-back unit configured to, in response to determining that the data in the cache line will be swapped to another physical storage device, write back the high-speed data in the cache line.
  • the management unit includes: an initial setting unit configured to set the data block in the virtual address area as “unused”; and a calling unit configured to respond to determining the first A thread reads data from a data block that is set to "unused", and invokes read exception handling.
  • the management unit includes: a release unit configured to release the virtual address area in response to determining that the program ends.
  • the allocation unit includes: a format setting unit configured to, based on the definition in the program, set the format of the virtual address area as a tensor of any of the following dimensions: 1-dimensional, 2-dimensional, 3D and 4D.
  • the apparatus is implemented at one processing engine among multiple processing engines at the accelerator system, the virtual address area is mapped to multiple physical storage devices via the address mapping table, and the address mapping table is stored in the accelerator system.
  • the allocation unit includes: an address determination unit configured to determine a virtual address area in a part of the virtual storage allocated to a processing engine; and an offset unit configured to Provides the program with the address offset associated with the virtual address region.
  • a user can designate a virtual storage space in a program for sharing data among multiple threads involved in the program.
  • the size of the virtual storage space is no longer limited by the size of the cache in computing devices such as processors, but more shared storage space can be provided in a more flexible and efficient manner.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • FIG. 2 shows a schematic diagram of a chip according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of virtual storage according to an embodiment of the present disclosure
  • Fig. 4 shows a schematic block diagram for sharing data between threads based on scratch memory according to an embodiment of the present disclosure
  • FIG. 5 shows a flow chart of a method for sharing data between threads based on scratch memory according to an embodiment of the present disclosure
  • Fig. 6 shows a schematic block diagram of the working process of the virtual address area according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic block diagram for exchanging data between physical storage devices of different levels according to an embodiment of the present disclosure.
  • Fig. 8 shows a schematic block diagram of an apparatus for sharing data among threads according to an embodiment of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may include, for example, electronic devices with computing capabilities, such as computers.
  • example environment 100 includes, for example, central processing unit (CPU) 120 , system memory 110 , North Bridge/memory bridge 130 , accelerator system 140 , external storage device 150 , and south bridge/input output (IO) bridge 160 .
  • System memory 110 may include, for example, volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 130 for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 120 and the high-speed interface and bridges the CPU 120 and the south bridge/IO bridge 160.
  • the South Bridge/IO Bridge 160 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator system 140 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video.
  • the external storage device 150 may be, for example, a volatile memory such as DRAM located outside the accelerator system 140 .
  • the external storage device 150 is also referred to as an off-chip memory, that is, a memory located outside the chip of the accelerator system 140 .
  • the chip of the accelerator system 140 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache. It will be described in detail below in conjunction with some embodiments of the present disclosure. While an example environment 100 in which embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in other application environments with accelerator systems such as GPUs, such as ARM architectures and RISC-V architectures.
  • FIG. 2 shows a schematic block diagram of an accelerator system 200 according to one embodiment of the present disclosure.
  • the accelerator system 200 may be, for example, a specific implementation manner of the chip of the accelerator system 140 in FIG. 1 .
  • the accelerator system 200 includes, for example, an accelerator system-on-a-chip such as a GPU.
  • the accelerator system 200 may include a system processor (SP) 210, a page table device 220, a processing engine (Processing Engine, PE) unit 230, a direct memory access (DMA) controller 240, a L1 cache 260 and L2 cache 250 .
  • SP system processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator system 200 may be controlled by a host device such as the CPU 120 and receive instructions from the CPU 120.
  • the SP 210 analyzes instructions from the CPU 120, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage virtual storage accessible by the accelerator system 200 .
  • the virtual storage may include, for example, the L2 cache 250 and off-chip memory such as the external storage device 150 in FIG. 1 .
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 may include a plurality of processing engines PE_1 , PE_2 . . . PE_N, where N represents an integer greater than 1.
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • SIMT single instruction multiple thread
  • each thread can have its own register file, and all threads of each PE also share a uniform register file.
  • Multiple PEs can perform the same or different processing jobs in parallel. For example, PE can perform processing such as sorting and convolution on the data to be processed.
  • Each thread can have its own arithmetic logic execution unit and use its own storage address, which can adopt a typical load-store architecture, for example.
  • Each execution unit can include a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • Most instructions are used to perform arithmetic and logic operations, such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers, or logical AND, OR, and NOT. Operands come from registers.
  • Memory read and write instructions can provide data exchange between registers and on-chip/off-chip memory.
  • a tensor may have one or more dimensions.
  • a tensor may be a four-dimensional tensor, which has four dimensions D1, D2, D3, and D4, and the size of the tensor may be different in each dimension.
  • the tensor may be 1-dimensional, 2-dimensional, 3-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • the tensor may internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double byte, and so on.
  • an application program may be divided into multiple program parts for parallel execution at multiple PEs respectively. It will be appreciated that a user may specify multiple (eg, tens, hundreds, or even more) threads to be launched at a PE to perform certain operations in parallel. During execution, data may need to be shared among multiple threads, so shared space needs to be provided for these threads.
  • a technical solution for sharing data between threads based on a cache in an accelerator system has been provided. However, the size of the cache is usually limited and sometimes difficult to meet the size of the data to be shared. In addition, the sharing method of the existing technical solution may involve frequent data exchange, so it is difficult to guarantee the operating efficiency.
  • FIG. 3 shows a schematic block diagram of a virtual storage 300 according to an embodiment of the present disclosure.
  • the virtual storage 300 may be represented by a virtual address, and the virtual storage 300 may be mapped to various types of physical storage devices, for example, at least any one of the L1 cache 260, the L2 cache 250, and the external storage device 150, etc. item.
  • the virtual storage 300 can be organized according to segments, and each program (for example, a Kernel program from an application program) can use one or more segments (for example, Kernel 1, Kernel 2 and Kernel 3 in Figure 3 A different number of segments can be used).
  • Each segment can include one or more pages (for example, the segment used by Kernel 1 in FIG. 3 includes pages P1, P2, P3, and P4), where the page size is defined by the application and can be variable of.
  • each Kernel program can be executed by one or more PEs, for example, Kernel 1 can be executed by 8 PEs (that is, PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, PE_8), and can be executed by Four PEs (ie, PE_1, PE_2, PE_3, PE_4) execute Kernel 2.
  • data may be processed in parallel by multiple threads at each PE. At this point, multiple threads often need to exchange data.
  • the data to be sorted may be divided into multiple parts, and multiple threads process the data of the multiple parts respectively. The preliminarily sorted portions of data may then be processed by a certain thread(s). At this point, data needs to be shared among multiple threads.
  • FIG. 4 shows a schematic block diagram 400 for sharing data among threads based on scratch memory according to an embodiment of the present disclosure.
  • Corresponding virtual storage can be allocated for each PE, and a certain virtual address area 440 in the virtual storage 430 accessible by the accelerator system can be allocated to the program, so that multiple threads started by the program (for example, the first thread 410 and the second thread 420) to share data.
  • the program here is such as the Kernel program shown in Figure 3, and the storage space in the virtual address area 440 can be from at least any one of a plurality of physical storage devices (for example, L2 cache 250 and external storage device 150, etc.) The available storage space for the item.
  • a plurality of physical storage devices for example, L2 cache 250 and external storage device 150, etc.
  • a memo memory attribute 450 can be set for the virtual address area 440, and the memo memory attribute 450 can indicate that the virtual address area needs to be managed in a different memory management manner 440, so that the data in the virtual address area 440 is located in a physical storage device with a higher level (that is, a higher access speed) as much as possible (for example, after being loaded into the L1 cache, it is kept in the L1 cache for as long as possible. cache). In this way, it can be ensured that multiple threads can read and write data in the virtual address area 440 at a higher speed so as to achieve the purpose of data sharing.
  • the storage space in the virtual address area 440 may come from multiple levels of physical storage devices. In this way, a larger storage space can be ensured for data sharing purposes. Compared with the technical solution of only allocating the shared storage area in the cache (for example, L1 cache 260) in the prior art solution, the implementation of the present disclosure can provide greater sharing while ensuring data access efficiency as much as possible. storage.
  • FIG. 5 shows a flow chart of a method 500 for sharing data among threads based on scratch memory according to an embodiment of the present disclosure.
  • a virtual address region 440 in virtual storage 430 accessible by the accelerator system may be assigned to the program based on definitions in the program, where the virtual address region 440 is mapped to any of the following plurality of physical storage devices One item: L2 cache and external storage.
  • keywords for defining a shared storage area may be provided to the user.
  • "ScrathPadMemoryLevel" (or other keywords) may be used to specify the level of the physical storage device corresponding to virtual address region 440 .
  • the user can use this keyword to specify which level of storage space in the physical storage device is used to provide the shared area.
  • the user may specify to use at least any one of the L2 cache 250 and the external storage device 150 to provide the shared area.
  • the L2 cache 250 can be designated to provide the shared area; and when the inter-process sharing involves a relatively large amount of data, the external storage device 150 can be designated to provide the shared area.
  • the program is running, no matter where the virtual address area is located, data to be shared in the virtual address area needs to be loaded into the L1 cache to facilitate inter-thread data exchange.
  • the L2 cache can already provide a relatively sufficient shared area for exchanging data.
  • the storage space in the external storage device 150 may be designated as the shared area.
  • the program When the program is running, it can analyze the relevant definitions in the input Kernel program, and then determine which level of physical storage device the programmer expects to use to provide the shared area. Based on the level of the physical storage device specified by the definition in the program, the corresponding storage device can be conveniently selected from multiple physical devices. For example, the user may designate that the shared area is provided by at least one of the L2 cache 250 and the external storage device 150 . With the exemplary implementation of the present disclosure, the size of the shared area where data is shared between threads is no longer limited inside the L1 cache 260, but the available storage space can be selected from a variety of physical storage devices with more storage space . In this way, programs involving large amounts of data sharing can be served with greater efficiency.
  • a default allocation method may be provided.
  • physical storage devices with faster access speeds can be preferentially used in the following order: L1 cache 260, L2 cache 250, and external physical storage 150.
  • shared space can be automatically allocated from a physical storage device having a higher access speed without user's designation.
  • a mapping between the virtual storage 430 and multiple physical storage storage devices may be established via the address mapping table 460 .
  • the address mapping table 460 may include a plurality of entries, and each entry may include one of a virtual identifier (for identifying a virtual address accessible by the accelerator system) and a real address (for pointing to a physical address in a physical storage device) mapping relationship between them.
  • a mapping can be conveniently established between the virtual storage 430 and each physical storage device, so that the accelerator system can run programs without knowing the real address of the accessed data.
  • the storage areas pointed to by the multiple entries here may have the same or different sizes.
  • a virtual identifier can point to a physical storage space such as 10M (or other values); when a program needs to use a small storage space, a virtual identifier can point to a physical storage space such as 4k (or other value) physical storage space.
  • each virtual address area does not have to have the same size, but can be specified according to specific needs. Therefore, the data volume of the address mapping table 460 itself can be kept at a low level, and thus can be stored in a physical storage device (for example, the L2 cache 250 ) of the accelerator system with a relatively high access speed. In this way, the access speed of the accelerator system can be further improved and the overall performance can be improved.
  • the user can define the required size of the virtual address area 440 in the program.
  • the keyword "ScrathPadMemorySize" (or other keywords) may be used to specify the size of the virtual address area 440 .
  • the program When the program is running, it can automatically detect whether the size specified by the user is out of bounds, that is, whether it exceeds the predetermined threshold size.
  • the threshold size may be determined based on the size of the L1 cache 260 or the L2 cache 250 , for example.
  • the threshold size may be set to a certain percentage (eg, 40% or other value) of the L1 cache 260 .
  • a virtual address area matching the size can be selected from the L1 cache 260 , and the remaining storage space in the L1 cache 260 can still meet other work requirements of the accelerator system. In this manner, the requirement for sharing data between processes can be met without interfering with the normal operation of the accelerator system.
  • a virtual address area that meets the size required by the user may be selected from the storage device desired by the user and other physical storage devices with a lower level.
  • the keyword in the program indicates that the user desires to allocate a virtual address area of size "size" from the L2 cache 250, and the value of "size" is higher than a predetermined threshold size.
  • the virtual address region 440 can be allocated from both the external storage device 150 and the L2 cache 250 respectively, and the total number of storage spaces from the two levels of caches is "size".
  • a balance can be made between the overall processing performance of the accelerator system and user requirements, so as to satisfy the user requirements as much as possible without affecting the overall performance.
  • the method 500 may be performed at one processing engine among a plurality of processing engines at an accelerator system. Specifically, in the case that the program is executed by PE_1 in the processing engine unit 230 shown in FIG. 2 , the method 500 may be executed at the PE_1. In this manner, it can be ensured that the program executed by the processing engine can be locally managed, thereby improving the running efficiency of the program.
  • the format of the virtual address area 440 can be set based on the definition in the program. Assuming that the purpose of the program is to process tensor data, the format of the virtual address area 440 can be set as a tensor of any of the following dimensions according to the dimension of the tensor to be processed: 1-dimensional, 2-dimensional, 3-dimensional, and 4-dimensional or other dimensions. With the exemplary implementation of the present disclosure, the format of the virtual address area 440 can be set according to user definition, so that the allocated shared area can be adapted to the user's requirements, so as to transfer data in a required format between multiple threads.
  • the address offset can be used to identify the virtual address area 440 used as the shared space. Specifically, the location of the virtual address area 440 may be determined in the part of the virtual storage allocated to the processing engine PE_1 as shown in FIG. 3 . Further, an address offset associated with virtual address region 440 may be provided to the program. In this way, when performing inter-thread data sharing, the first thread 410 and the second thread 420 can access the virtual address area 440 via the address offset, thereby pointing to the shared space in a simple and efficient manner.
  • the format of the offset may be determined here based on the format of the tensor data. If the tensor data involves a 1-dimensional format, then a 1-dimensional address offset can be used at this time; if the tensor data involves a 2-dimensional format, a 2-dimensional address offset can be used at this time, and so on. In this way, the address offset can be set in a manner that matches the format of the data to be shared, so that the program can accurately locate the virtual address area 440 .
  • virtual address region 440 may be set to scratchpad storage attribute 450 .
  • an identifier may be set to the location where the virtual address area 440 is located, so as to indicate the note storage attribute 450 .
  • the scratchpad memory attribute 450 here may represent a special memory management strategy that can ensure that multiple threads in the program can share data in an efficient manner when the program is running.
  • data shared between the first thread 410 and the second thread 420 in the program may be managed based on the virtual address region 440 .
  • the virtual address area 440 is a special temporary area for data sharing, and thus the data in the virtual address area 440 does not need to be initialized at the beginning of program execution.
  • another thread reads the written data from the virtual address area 440 , so as to achieve data sharing between processes.
  • FIG. 6 more details for sharing data between the threads 410 and 420 based on the virtual address area 440 will be described with reference to FIG. 6 .
  • FIG. 6 shows a schematic block diagram 600 of a working process of a virtual address region according to an embodiment of the present disclosure.
  • the virtual address area 440 may include multiple data blocks, and a flag 620 may be set for each data block to indicate a state. For example, at an initial stage, flag 620 may be set to "unused" to indicate that the data in data block 610 is not ready for inter-thread sharing at this time.
  • the data block 610 is used to transmit data from the second thread 420 to the first thread 410, if it is determined that the first thread 410 reads from the "unused" data block 610 (that is, data has not yet been written by the second thread 420) If you want to fetch data, you can call read exception handling. For example, the first thread 410 may be notified to continue waiting until the second thread 420 writes data to be shared into the data block 610 .
  • the swap policy associated with the virtual address area 440 may be modified, so that the data in the virtual address area 440 shall not be swapped to other physical storage devices with lower access speed as much as possible.
  • the virtual address area 440 can be allocated as much as possible during the management of storage resources. Data in the address area 440 remains in the L1 cache 260 without swapping the data to the L2 cache 250 and the external storage device 150 having a lower access speed.
  • storage resources can be managed using a least recently used (LRU) principle.
  • LRU least recently used
  • special treatment may be performed on the storage space marked as the scratchpad storage, so that the data in the virtual address area 440 is kept in the physical storage device with a higher access speed as much as possible.
  • a threshold time can be set, and only when the data satisfies the LRU principle and exceeds the threshold time, the data is swapped to a physical storage device with a lower access speed.
  • other rules may also be used to ensure that the data in the virtual address area 440 is kept in a physical storage device with a faster access speed as much as possible.
  • each thread can use the storage space in the physical storage device with a higher access speed as a shared area, thereby reducing the time overhead of read and write operations involved in data sharing, This improves the efficiency of data sharing between threads.
  • the data in the virtual address area 440 is temporary data used for inter-thread data sharing and is valid only during program execution, so when the data in the virtual address area 440 is not exchanged to a low-speed physical storage device, there is no need to Write data back to low-speed physical storage devices. Specifically, if it is determined that a certain thread writes data to the cache line associated with the data block 610 in the virtual address area 440, the cache line may be set as "dirty" to indicate that the content in the cache line has been Revise. At this time, writing back data in the cache line to the virtual address area 440 may be prohibited.
  • the time overhead occupied by unnecessary data writing operations can be reduced as much as possible, thereby improving the performance of the accelerator system.
  • FIG. 7 shows a schematic block diagram 700 for exchanging data between physical storage devices of different levels according to an embodiment of the present disclosure. As shown in FIG. 7, assume that a data block 610 is loaded into a cache line 710 in the L1 cache 260, and the data in the cache line 710 will be moved out to a physical cache line 710 in the L2 cache 250 with a lower access speed. Stored in block 720.
  • data can be written back to the physical storage block 720 in the L2 cache 250 , in other words, the data in the cache line 710 can be written to the physical storage block 720 .
  • only data marked as "dirty" may be written back.
  • data write-back is performed only when the data in the higher-speed physical storage device is moved out to the lower-speed physical storage device. Data will not be lost, and on the other hand, it can ensure that there will be no useless overhead of repeatedly writing data to lower-speed physical storage devices.
  • the data in the virtual address area 440 is temporary data during program execution and these data are only useful to the program, so when the program ends, the virtual address area 440 can be released.
  • the virtual address region 440 may be marked as "free" for other purposes.
  • the virtual address area 440 can be allocated at the start of the program for data sharing among multiple threads called by the program. Further, the virtual address area 440 can be released when the program ends. In this way, the corresponding virtual address areas can be continuously allocated and released along with the running of different programs, so as to achieve the purpose of cyclically using free storage space by different programs in different time periods.
  • a large amount of data eg, 10,000 numbers
  • the two threads T1 and T2 can respectively write the sorted 5000 pieces of data to corresponding positions in the virtual address area.
  • a shared space can be provided based on virtual address areas mapped to one or more physical storage devices, so the size of the provided shared space is not limited Due to the limitations of the L1 cache, it can be easily expanded. In this manner, a larger shared space can be provided to improve data processing efficiency. Further, the special storage exchange strategy for the note storage can ensure that the shared space is located in the physical storage device with a higher access speed as much as possible. Therefore, the frequency of reading and writing to the off-chip storage device can be reduced, and the energy consumption of the accelerator system can be reduced while improving the reading and writing efficiency.
  • the above only uses the sorting program as an example to describe how to share data among multiple threads based on the scratch memory, and the scratch memory can also be used in programs for other purposes.
  • the number of register files may be insufficient.
  • the virtual address area can be used for storing local variables. It will be understood that although the above schematically shows an example of sharing data between two threads via the virtual address area 440, according to an exemplary implementation of the present disclosure, it is also possible to share data between more threads via the virtual address area 440 share data between.
  • the virtual storage is on-chip tensor virtual storage.
  • the virtual storage can be used to store various tensor data involved in the running of the program.
  • data can be shared among more threads than two threads. For example, data shared between a first thread, a second thread, and one or more other threads in the program can be managed based on the virtual address region.
  • a maximum of two programs can be run on the PE.
  • two kernel programs can be run in parallel on the PE.
  • the virtual address area can execute shared data among multiple threads of a kernel program running on the PE, that is, data cannot be shared between multiple threads invoked by two different kernel programs running in parallel.
  • the memo memory of the present disclosure is no longer limited to on-chip storage with a fixed size, but provides an expandable virtual storage mode.
  • the scratch-based storage is private to the kernel program running in the PE.
  • the kernel program starts, the data value in the memo memory is undefined, and the data value is defined by the thread of the kernel.
  • the kernel program ends the data in the memo memory will be discarded, that is, it will not be written back to other levels of storage devices.
  • a scratchpad attribute may be defined and assigned to certain segments called by a kernel program, the attribute being provided to the hardware.
  • this attribute is used to control the memory swap policy.
  • the scratchpad memory can be mapped to L2 cache or DRAM memory. Accesses to the scratchpad memory are treated specially in the L1 cache relative to accesses to other types of memory, so that they have higher priority during cache swapping, e.g., data in the scratchpad memory can be resident in the L1 cache longer time.
  • a scratchpad can be defined to store a structured tensor format, which can include, for example, 1-, 2-, 3-, or 4-dimensional tensors.
  • the kernel program when running, it can automatically detect the out-of-bounds access of the memo storage segment.
  • Fig. 8 shows a schematic block diagram of an apparatus 800 for sharing data among threads according to an embodiment of the present disclosure.
  • Apparatus 800 may be implemented as or included in accelerator system 200 of FIG. 2 .
  • the apparatus 800 may include a plurality of units for performing corresponding steps in the method 500 as discussed in FIG. 5 .
  • the apparatus 800 includes: an allocation unit 810 configured to allocate a virtual address area in the virtual storage accessible by the accelerator system to the program based on the definition in the program, and the virtual address area is mapped to the following multiple Any item in the physical storage device: secondary cache and external storage; the setting unit 820 is configured to set the virtual address area as the scratchpad memory attribute; and the sharing unit 830 is configured to manage based on the virtual address area Data that is shared between a first thread and a second thread in a program.
  • an allocation unit 810 configured to allocate a virtual address area in the virtual storage accessible by the accelerator system to the program based on the definition in the program, and the virtual address area is mapped to the following multiple Any item in the physical storage device: secondary cache and external storage
  • the setting unit 820 is configured to set the virtual address area as the scratchpad memory attribute
  • the sharing unit 830 is configured to manage based on the virtual address area Data that is shared between a first thread and a second thread in a program.
  • the allocating unit 810 includes: a level determining unit configured to determine a level for specifying a physical storage device corresponding to the virtual address area based on a definition; and a selecting unit configured to A virtual address area is selected for allocation to a program from among physical devices having a class among the plurality of physical devices.
  • the selection unit includes: a size determination unit configured to determine the size of the virtual address region based on the definition; and a first selection unit configured to respond to the determination that the size is not higher than the threshold size , select a virtual address area that matches the size.
  • the selection unit includes a second selection unit configured to, in response to determining that the size is higher than a threshold size, select a physical device with a grade and a physical device below a grade among the plurality of physical devices , select a virtual address area that matches the size.
  • the management unit 830 includes: a modifying unit configured to modify a swap policy associated with the virtual address area, so that data in the virtual address area is not swapped to another physical storage device , the level of another physical storage device is lower than that of the physical storage device corresponding to the virtual address region.
  • the management unit 830 includes: a write unit configured to set the cache line as “dirty” in response to determining that the first thread writes data to the cache line in the virtual address area. ” flag; and a prohibition unit configured to prohibit writing data in the cache line to a next-level storage device associated with the virtual address area.
  • the management unit 830 includes: a write-back unit configured to write back to another physical storage device in response to determining that the data in the cache line will be swapped to another physical storage device data in the cache line.
  • the management unit 830 includes: an initial setting unit configured to set the cache line in the virtual address area as “unused”; and a calling unit configured to respond to It is determined that the first thread reads data from the cache line set as "unused”, and invokes read exception handling.
  • the management unit 830 includes: a release unit configured to release the virtual address area in response to determining that the program ends.
  • the allocation unit 810 includes: a format setting unit configured to set the format of the virtual address area as a tensor of any of the following dimensions based on the definition in the program: 1-dimensional, 2-dimensional , 3D and 4D.
  • the apparatus 800 is implemented at one processing engine among the multiple processing engines at the accelerator system, the virtual address area is mapped to multiple physical storage devices via the address mapping table, and the address mapping table is stored in the accelerator system.
  • the allocation unit 810 includes: an address determination unit configured to determine a virtual address area in a part of the virtual storage allocated to a processing engine; and an offset unit configured to Used to provide the program with the address offset associated with the virtual address region.
  • the virtual storage is an on-chip tensor virtual storage
  • the management unit 830 is further configured to: based on the virtual address area, manage the first thread, the second thread, and other threads in the program data shared between.
  • a computer-readable storage medium stores a plurality of programs configured to be executed by one or more processing engines, the plurality of programs including instructions for performing the methods described above.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the methods described above.
  • an accelerator system includes: a processor; and a memory coupled to the processor, the memory having instructions stored therein that, when executed by the processor, cause the device to perform the method described above.
  • the present disclosure may be a method, apparatus, system and/or computer program product.
  • a computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本公开涉及基于便笺存储器来共享数据的方法和电子装置。在一种用于基于便笺存储器来共享数据的方法中,基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给程序,虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储。将虚拟地址区域设置为便笺存储器属性;以及基于虚拟地址区域,管理在程序中的第一线程和第二线程之间共享的数据。进一步,提供了相应的电子装置、计算机可读存储介质、计算机程序产品、以及加速器系统。利用本公开的示例性实现方式,可以从多种物理存储设备中分配存储空间,以便用于在程序中的多个线程之间共享数据。

Description

基于便笺存储器来共享数据的方法和电子装置
本申请要求于2021年11月08日提交中国专利局、申请号为202111314187.9、发明名称为“基于便笺存储器来共享数据的方法和电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施方式一般地涉及电子领域,更具体而言涉及一种用于基于便笺存储器来共享数据的方法和电子装置。
背景技术
目前已经提出了诸如图形处理器(GPU)之类处理器系统,此类处理系统中的多个处理器核可以提供并行的多线程处理方式,因而可以提供更高的处理速度。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核进行并行处理,从而减少处理时间。
在一些情形下,可以在诸如GPU之类的多核处理器中运行大量线程,此时大量线程之间通常需要进行数据共享。目前已经提出了基于高速缓存来共享数据的技术方案。然而,由于高速缓存仅包括较小存储空间并且涉及复杂的管理过程,期望可以以更加有效和方便的方式在多个线程之间共享数据。
发明内容
本公开的实施方式提供了一种基于便笺存储器来共享数据的技术方案。
在第一方面,提供了一种用于基于便笺存储器来共享数据的方法。该方法包括:基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给程序,虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储;将虚 拟地址区域设置为便笺存储器属性;以及基于虚拟地址区域,管理在程序中的第一线程和第二线程之间共享的数据。
根据本公开的一个示例性实现方式,将虚拟地址区域分配给程序包括:基于定义确定用于指定与虚拟地址区域相对应的物理存储设备的等级;以及从多个物理设备中的具有等级的物理设备中,选择虚拟地址区域以用于分配给程序。
根据本公开的一个示例性实现方式,选择虚拟地址区域以用于分配给程序进一步包括:基于定义确定虚拟地址区域的大小;以及响应于确定大小不高于阈值大小,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,进一步包括:响应于确定大小高于阈值大小,从多个物理设备中的具有等级的物理设备和低于等级的物理设备中,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,基于虚拟地址区域,在程序中的第一线程和第二线程之间共享数据包括:修改与虚拟地址区域相关联的交换策略,以使得虚拟地址区域中的数据不被交换到另一物理存储设备,另一物理存储设备的等级低于与虚拟地址区域相对应的物理存储设备的等级。
根据本公开的一个示例性实现方式,基于虚拟地址区域,在程序中的第一线程和第二线程之间共享数据包括:响应于确定第一线程向虚拟地址区域中的高速缓存线写入数据,将高速缓存线设置为“脏”标记;以及禁止将高速缓存线中的数据写入至与虚拟地址区域相关联的下一等级的存储设备。
根据本公开的一个示例性实现方式,进一步包括:响应于确定高速缓存线中的数据将被交换至另一物理存储设备,向另一物理存储设备回写高速缓存线中的数据。
根据本公开的一个示例性实现方式,基于虚拟地址区域,在程序中的第一线程和第二线程之间共享数据包括:将虚拟地址区域中的数据块设置为“未使用”;以及响应于确定第一线程从被设置为 “未使用”的数据块中读取数据,调取读取异常处理。
根据本公开的一个示例性实现方式,进一步包括:响应于确定程序结束,释放虚拟地址区域。
根据本公开的一个示例性实现方式,将虚拟地址区域分配给程序包括:基于程序中的定义,将虚拟地址区域的格式设置为以下任一维度的张量:1维、2维、3维以及4维。
根据本公开的一个示例性实现方式,该方法在加速器系统处的多个处理引擎中的一个处理引擎处被执行,虚拟地址区域经由地址映射表被映射至多个物理存储设备,以及地址映射表被存储在加速器系统中。
根据本公开的一个示例性实现方式,将虚拟地址区域分配给程序包括:在虚拟存储中的被分配给处理引擎的部分中,确定虚拟地址区域;以及向程序提供与虚拟地址区域相关联的地址偏移。
在第二方面,提供了一种计算机可读存储介质。该介质存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行本公开的第一方面的方法的指令。
在第三方面,提供了一种计算机程序产品。该计算机程序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行本公开的第一方面的方法的指令。
在第四方面,提供了加速器系统,包括:处理器;以及与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被所述处理器执行时使得所述设备执行。
在第五方面,提供了一种用于基于便笺存储器来共享数据的装置。该装置包括:分配单元,被配置用于基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给程序,虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储;设置单元,被配置用于将虚拟地址区域设置为便笺存储器属性;以及共享单元,被配置用于基于虚拟地址区域,管理在程序中的第一线程和第二线程之间共享的数据。
根据本公开的一个示例性实现方式,分配单元包括:等级确定单元,被配置用于基于定义确定用于指定与虚拟地址区域相对应的物理存储设备的等级;以及选择单元,被配置用于从多个物理设备中的具有等级的物理设备中,选择虚拟地址区域以用于分配给程序。
根据本公开的一个示例性实现方式,选择单元包括:大小确定单元,被配置用于基于定义确定虚拟地址区域的大小;以及第一选择单元,被配置用于响应于确定大小不高于阈值大小,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,选择单元包括第二选择单元,被配置用于响应于确定大小高于阈值大小,从多个物理设备中的具有等级的物理设备和低于等级的物理设备中,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,管理单元包括:修改单元,被配置用于修改与虚拟地址区域相关联的交换策略,以使得虚拟地址区域中的数据不被交换到另一物理存储设备,另一物理存储设备的等级低于与虚拟地址区域相对应的物理存储设备的等级。
根据本公开的一个示例性实现方式,管理单元包括:写入单元,被配置用于响应于确定第一线程向与虚拟地址区域相关联的高速缓存线写入数据,将高速缓存线设置为“脏”标记;以及禁止单元,被配置用于禁止将高速缓存线中的数据回写至虚拟地址区域。
根据本公开的一个示例性实现方式,管理单元包括:回写单元,被配置用于响应于确定高速缓存线中的数据将被交换至另一物理存储设备,向另一物理存储设备回写高速缓存线中的数据。
根据本公开的一个示例性实现方式,管理单元包括:初始设置单元,被配置用于将虚拟地址区域中的数据块设置为“未使用”;以及调取单元,被配置用于响应于确定第一线程从被设置为“未使用”的数据块中读取数据,调取读取异常处理。
根据本公开的一个示例性实现方式,管理单元包括:释放单元,被配置用于响应于确定程序结束,释放虚拟地址区域。
根据本公开的一个示例性实现方式,分配单元包括:格式设置单元,被配置用于基于程序中的定义,将虚拟地址区域的格式设置为以下任一维度的张量:1维、2维、3维以及4维。
根据本公开的一个示例性实现方式,该装置在加速器系统处的多个处理引擎中的一个处理引擎处被实现,虚拟地址区域经由地址映射表被映射至多个物理存储设备,以及地址映射表被存储在加速器系统中。
根据本公开的一个示例性实现方式,分配单元包括:地址确定单元,被配置用于在虚拟存储中的被分配给处理引擎的部分中,确定虚拟地址区域;以及偏移单元,被配置用于向程序提供与虚拟地址区域相关联的地址偏移。
利用本公开的示例性实现方式,用户可以在程序中指定虚拟存储空间,以用于在程序所涉及的多个线程之间共享数据。以此方式,虚拟存储空间的大小不再受到诸如处理器等计算设备中的高速缓存的大小的限制,而是可以以更加灵活和有效的方式来提供更多的共享存储空间。
附图说明
通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施方式能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一个实施方式的芯片示意图;
图3示出了根据本公开的一个实施方式的虚拟存储的示意图;
图4示出了根据本公开的一个实施方式的用于基于便笺存储器在线程之间共享数据的示意框图;
图5示出了根据本公开的一个实施方式的用于基于便笺存储器在线程之间共享数据的方法的流程图;
图6示出了根据本公开的一个实施方式的虚拟地址区域的工作过程的示意框图;
图7示出了根据本公开的一个实施方式的用于在不同等级的物理存储设备之间交换数据的示意框图;以及
图8示出了根据本公开的一个实施方式的用于在线程之间共享数据的装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中示出了本公开的优选实施方式,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施方式限制。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施方式”和“一个实施方式”表示“至少一个示例实施方式”。术语“另一实施方式”表示“至少一个另外的实施方式”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
首先参见图1描述用于执行本公开的多个示例性实现方式的环境的概要。图1示出了本公开的多个实施方式能够在其中实现的示例环境100的示意图。示例环境100例如可以包括诸如计算机之类的具有计算能力的电子设备。在一个实施方式中,示例环境100例如包括中央处理器(CPU)120、系统存储器110、北桥/存储器桥130、加速器系统140、外部存储设备150和南桥/输入输出(IO)桥160。系统存储器110例如可以包括诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥130例如集成了内存控制器、PCIe控制器等,其负责CPU 120和高速接口之间的数据交换以及桥 接CPU 120和南桥/IO桥160。南桥/IO桥160用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器系统140例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。外部存储设备150例如可以是诸如DRAM之类的位于加速器系统140外部的易失性存储器。
在本公开中,外部存储设备150也被称为片外存储器,即,位于加速器系统140的芯片外部的存储器。相对而言,加速器系统140的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。将在下文结合本公开的一些实施方式具体描述。虽然在图1中示出了本公开的多个实施方式能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施方式也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器系统的其他应用环境中使用。
图2示出了根据本公开的一个实施方式的加速器系统200的示意框图。加速器系统200例如可以是图1中加速器系统140的芯片的一种具体实现方式。加速器系统200例如包括诸如GPU之类的加速器系统芯片。根据本公开的一个示例性实现方式,加速器系统200可以包括系统处理器(SP)210、页表装置220、处理引擎(Processing Engine,PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存260和L2高速缓存250。
加速器系统200可以由诸如CPU 120之类的主机设备控制,并且接收来自CPU 120的指令。SP 210对来自CPU 120的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器系统200可访问的虚拟存储。在本公开中,除了L1高速缓存260,虚拟存储例如可以包括L2高速缓存250和诸如图1中的外部存储设备150之类的片外存储器。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230可以包括多个处理引擎PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作。例如,PE可以针对待处理的数据执行排序、卷积等处理。
用户(例如,程序员)可以编写应用程序来实现特定的目的。对于需要较大计算量的应用程序而言,可以将该应用程序划分为多个部分,并且分别在多个PE处并行地运行多个部分。进一步,可以在每个PE处启动一个或多个线程。每个线程可以具有自己的算数逻辑执行单元并使用自己的存储地址,其例如可以采用典型的寄存器存取架构(load-store architecture)。每个执行单元可以包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。大多数的指令用于执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。
根据本公开的一个示例性实现方式,图2的加速器系统200可以执行应用程序以便处理数据,例如张量数据等。根据本公开的一个示例性实现方式,张量可以具有一个或多个维度。例如张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各个维度上的大小可以不同。在另一些实施方式中,张量可以是1维、2维、3维或更多维张量,本公开对此不进行限制。
此外,在本公开的实施方式中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。
在一些情形中,可以将应用程序划分为多个程序部分,以便分 别在多个PE处并行地执行。将会理解,用户可以指定在PE处启动多个(例如,数十、数百个甚至更多个)线程来并行地执行某些操作。在执行过程中,多个线程之间可能需要共享数据,因而需要为这些线程提供共享空间。目前已经提供了基于加速器系统中的高速缓存来在线程之间共享数据的技术方案。然而,高速缓存的大小通常存在限制并且有时难以满足待共享数据的大小。此外,已有技术方案的共享方式可能涉及频繁的数据交换因而难以保证运行效率。
为了至少部分地解决已有技术方案的上述以及其他缺陷,根据本公开的一个示例性实现方式,提供了一种基于虚拟存储中的便笺存储器来在线程之间共享数据的技术方案。首先参见图3描述虚拟存储的概要。图3示出了根据本公开的一个实施方式的虚拟存储300的示意框图。在此,虚拟存储300可以利用虚拟地址来表示,并且虚拟存储300可以被映射至多种类型的物理存储设备,例如,L1高速缓存260、L2高速缓存250以及外部存储设备150等中的至少任一项。
如图3所示,虚拟存储300可以按照段来组织,每个程序(例如,来自应用程序的Kernel程序)可以使用一个或多个段(例如,图3中由Kernel 1、Kernel 2和Kernel 3可以使用不同数量的段)。每个段可以包括一个或多个页(例如,图3中由Kernel 1使用的段包括页P1、P2、P3和P4),在此,页的大小由应用程序来定义并且是可以是可变的。
进一步,每个Kernel程序可以由一个或多个PE来执行,例如,可以由8个PE(即,PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7、PE_8)来执行Kernel 1,并且可以由4个PE(即,PE_1、PE_2、PE_3、PE_4)来执行Kernel 2。将会理解,为了提高数据处理性能,在每个PE处可以由多个线程来以并行方式处理数据。此时,多个线程经常需要交换数据。例如,在排序操作中可以将待排序数据划分为多个部分,并且分别由多个线程来处理多个部分的数据。继而可以由某个(某些)线程来处理经过初步排序的各部分数据。 此时,需要在多个线程之间共享数据。
在如图3所示的虚拟存储300中,可以按照程序的定义来将一部分区域设置为便笺存储器,以用于在由程序启动的多个线程时间共享数据。首先参见图4描述根据本公开的一个示例性实现方式的概要。图4示出了根据本公开的一个实施方式的用于基于便笺存储器在线程之间共享数据的示意框图400。可以为每个PE分配相应的虚拟存储,并且可以将加速器系统可访问的虚拟存储430中的某个虚拟地址区域440分配给程序,以便在由程序启动的多个线程(例如,第一线程410和第二线程420)之间共享数据。在此的程序例如图3所示的Kernel程序,并且虚拟地址区域440中的存储空间可以是分别来自多个物理存储设备(例如,L2高速缓存250和外部存储设备150等)中的至少任一项的可用存储空间。
进一步,为了区分用于数据共享的虚拟地址区域440与普通存储空间,可以为该虚拟地址区域440设置便笺存储器属性450,该便笺存储器属性450可以指示需要以不同的存储器管理方式来管理虚拟地址区域440,以便使得该虚拟地址区域440中的数据尽量位于具有较高等级(也即,较高访问速度)的物理存储设备(例如,在被加载至L1高速缓存之后尽量长时间被保留在该L1高速缓存)中。以此方式,可以确保多个线程以较高的速度针对虚拟地址区域440中的数据进行读写从而实现数据共享的目的。
进一步,根据本公开的一个示例性实现方式,虚拟地址区域440中的存储空间可以来自于多个等级的物理存储设备。以此方式,可以确保为了数据共享目的来提供较大的存储空间。相对于已有技术方案中的仅在高速缓存(例如,L1高速缓存260)中分配共享存储区域的技术方案而言,本公开的实现方式可以在尽量确保数据访问效率的同时提供更大的共享存储空间。
在下文中,将参见图5描述根据本公开的一个示例性实现方式的更多细节。图5示出了根据本公开的一个实施方式的用于基于便笺存储器在线程之间共享数据的方法500的流程图。在框510处, 可以基于程序中的定义,将由加速器系统可访问的虚拟存储430中的虚拟地址区域440分配给程序,在此该虚拟地址区域440被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储。
根据本公开的一个示例性实现方式,可以向用户提供用于定义共享存储区域的关键字。例如,可以使用“ScrathPadMemoryLevel”(或者其他关键字)来指定与虚拟地址区域440相对应的物理存储设备的等级。此时,用户可以通过该关键字来指定使用哪个等级的物理存储设备中的存储空间来提供共享区域。例如,用户可以指定使用L2高速缓存250和外部存储设备150中的至少任一项来提供共享区域。
当进程间共享仅涉及较小数据量时,可以指定由L2高速缓存250来提供共享区域;并且当进程间共享涉及较大数据量时,可以指定由外部存储设备150来提供共享区域,以此类推。在程序运行时,无论虚拟地址区域位于何处,虚拟地址区域中的将被共享的数据需要被加载至L1高速缓存中以便于线程间数据交换。通常而言,L2高速缓存已经可以提供较为充足的共享区域来用于交换数据。当L2高速缓存的存储空间不足时,可以指定由外部存储设备150中的存储空间来充当共享区域。
在程序运行时,可以解析输入的Kernel程序中的相关定义,进而确定程序员期望使用哪个等级的物理存储设备来提供共享区域。基于程序中的定义所指定的物理存储设备的等级,可以方便地从多个物理设备中选择相应的存储设备。例如,用户可以指定由L2高速缓存250和外部存储设备150中的至少一个来提供共享区域。利用本公开的示例性实现方式,在线程之间共享数据的共享区域的大小不再局限于L1高速缓存260内部,而是可以从具有更多存储空间的多种物理存储设备中选择可用存储空间。以此方式,可以以更高的效率来服务于涉及大量数据共享的程序。
根据本公开的一个示例性实现方式,可以提供默认分配方式。例如,可以按照如下顺序来优先地使用访问速度较快的物理存储设 备:L1高速缓存260、L2高速缓存250以及外部物理存储设备150。以此方式,可以在无需用户指定的情况下自动从具有较高访问速度的物理存储设备中分配共享空间。
根据本公开的一个示例性实现方式,可以经由地址映射表460来在虚拟存储430和多个物理存储存储设备之间建立映射。在此地址映射表460可以包括多个条目,并且每个条目可以包括虚拟标识符(用于标识由加速器系统可访问的虚拟地址)以及真实地址(用于指向物理存储设备中的物理地址)之间的映射关系。以此方式,可以方便地在虚拟存储430和各个物理存储设备之间建立映射,从而使得加速器系统在无需知晓所访问数据的真实地址的情况下,运行程序。
进一步,在此的多个条目所指向的存储区域可以具有相同或者不同的大小。例如,当程序需要使用较大存储空间时,一个虚拟标识符可以指向例如10M(或者其他数值)的物理存储空间;当程序需要使用较小存储空间时,虚拟标识符可以指向例如4k(或者其他数值)的物理存储空间。以此方式,每个虚拟地址区域不必具有相同的大小,而是可以根据具体需要来指定。由此,地址映射表460本身的数据量可以保持在较低的水平,因而可以被存储至加速器系统的具有较高访问速度的物理存储设备(例如,L2高速缓存250)中。以此方式,可以进一步提高加速器系统的访问速度进而提高整体性能。
根据本公开的一个示例性实现方式,用户可以在程序中定义所需要的虚拟地址区域440的大小。例如,可以使用关键字“ScrathPadMemorySize”(或者其他关键字)来指定虚拟地址区域440的大小。在程序运行时,可以自动检测用户指定的大小是否越界,也即是否超过预定的阈值大小。在此阈值大小例如可以基于L1高速缓存260或者L2高速缓存250的大小来确定。根据本公开的一个示例性实现方式,可以将阈值大小设置为L1高速缓存260的某个百分比(例如,40%或者其他数值)。当用户指定的大小低于该百分比时, 可以从L1高速缓存260中选择与大小相匹配的虚拟地址区域,此时L1高速缓存260中的剩余存储空间仍然可以确保加速器系统的其他工作需求。以此方式,可以在不干扰加速器系统的正常操作的情况下,满足进程间共享数据的需求。
根据本公开的一个示例性实现方式,如果用户定义的大小高于阈值大小,则认为用户要求的存储空间过大并且将会影响加速器系统的正常操作。此时,可以从用户期望的存储设备以及其他具有较低等级的物理存储设备中选择满足用户要求大小的虚拟地址区域。假设程序中的关键字表示用户期望从L2高速缓存250中分配大小为“size”的虚拟地址区域,并且“size”的数值高于预定阈值大小。此时,可以分别从外部存储设备150和L2高速缓存250两者中分配虚拟地址区域440,并且来自两个等级的高速缓存的存储空间的总数为“size”。利用本公开的示例性实现方式,可以在加速器系统的整体处理性能和用户需求之间进行平衡,以便在不影响整体性能的情况下尽量满足用户需求。
根据本公开的一个示例性实现方式,可以在加速器系统处的多个处理引擎中的一个处理引擎处执行方法500。具体地,在由图2所示处理引擎单元230中的PE_1执行程序的情况下,可以在该PE_1处执行方法500。以此方式,可以确保可以在处理引擎本地管理由该处理引擎所执行的程序,从而提高程序的运行效率。
根据本公开的一个示例性实现方式,可以基于程序中的定义,来设置虚拟地址区域440的格式。假设程序目的在于处理张量数据,则可以按照将被处理的张量的维度,将虚拟地址区域440的格式设置为以下任一维度的张量:1维、2维、3维以及4维或者其他维度。利用本公开的示例性实现方式,可以按照用户定义来设置虚拟地址区域440的格式,以使得所分配的共享区域可以适合于用户需求,从而在多个线程之间传递所需格式的数据。
根据本公开的一个示例性实现方式,除了用户指定的虚拟地址区域440以外,可以向程序分配用于执行其他任务的其他虚拟地址 空间。此时,可以利用地址偏移来标识用作共享空间的虚拟地址区域440。具体地,可以在如图3所示的虚拟存储中的被分配给处理引擎PE_1的部分中,确定虚拟地址区域440的位置。进一步,可以向程序提供与虚拟地址区域440相关联的地址偏移。以此方式,在执行线程之的间数据共享时,第一线程410和第二线程420可以经由该地址偏移来访问虚拟地址区域440,从而以简单并且有效的方式指向共享空间。
将会理解,在此可以基于张量数据的格式来确定偏移的格式。假设张量数据涉及1维格式,则此时可以使用1维地址偏移;假设张量数据涉及2维格式,则此时可以使用2维地址偏移,以此类推。以此方式,可以按照与将被共享的数据格式相匹配的方式来设置地址偏移,进而使得程序可以准确地定位虚拟地址区域440的位置。
上文已经描述了如何根据程序中的定义来向程序分配虚拟地址区域440。在图5中的框520处,可以将虚拟地址区域440设置为便笺存储器属性450。具体地,可以向该虚拟地址区域440所在的位置设置标识符,以便表示便笺存储器属性450。将会理解,在此的便笺存储器属性450可以表示一种特殊的存储器管理策略,当程序运行时,该存储器管理策略可以确保程序中的多个线程可以以高效的方式共享数据。
在图5的框530处,可以基于虚拟地址区域440,管理在程序中的第一线程410和第二线程420之间共享的数据。将会理解,虚拟地址区域440是用于数据共享的专门临时区域,因而在程序运行初期并不需要初始化虚拟地址区域440中的数据。根据本公开的一个示例性实现方式,在一个线程向虚拟地址区域440写入数据之后,由另一线程从虚拟地址区域440中读出被写入的数据,以便实现进程间数据共享的目的。在下文中,将参见图6描述用于基于虚拟地址区域440来在线程410和420之间共享数据的更多细节。
图6示出了根据本公开的一个实施方式的虚拟地址区域的的工作过程示意框图600。如图6所示,虚拟地址区域440可以包括多个 数据块,可以针对各个数据块设置标记620来表示状态。例如,在初始阶段,标记620可以被设置为“未使用”,以便指示此时数据块610中的数据没有准备就绪用于线程间共享。假设数据块610用于从第二线程420向第一线程410传输数据,如果确定第一线程410从“未使用”的数据块610(也即,尚未被第二线程420写入数据)中读取数据,则可以调取读取异常处理。例如,可以通知第一线程410继续等待,直到第二线程420向数据块610写入待共享的数据。
将会理解,由于第一线程410和第二线程420在程序运行期间将会不断进行数据交换,因而需要确保被分配的虚拟地址区域440中的数据尽量位于L1高速缓存260中以便提供较高的访问速度。此时,可以修改与虚拟地址区域440相关联的交换策略,以使得虚拟地址区域440中的数据尽量不被交换到访问速度较低的其他物理存储设备。假设最初利用L2高速缓存250中的存储空间提供虚拟地址区域440,则在程序运行过程中将虚拟地址区域440中的数据加载至L1高速缓存260之后,在存储资源的管理过程中可以尽量将虚拟地址区域440中的数据保留在L1高速缓存260中,而不将数据交换至具有较低访问速度的L2高速缓存250和外部存储设备150。
随着加速器系统的运行,可以使用最近最少使用(LRU)原则来管理存储资源。此时,可以对被标记为便笺存储器的存储空间进行特殊处理,以使得虚拟地址区域440中的数据尽量保持在具有较高访问速度的物理存储设备中。例如,可以设置阈值时间,并且只有当数据满足LRU原则并且超过阈值时间时,才将该数据交换至具有较低访问速度的物理存储设备。根据本公开的一个示例性实现方式,还可以基于其他规则来确保尽量将虚拟地址区域440中的数据保持在访问速度较快的物理存储设备中。
利用本公开的示例性实现方式,在程序运行期间,可以确保各个线程可以利用具有较高访问速度的物理存储设备中的存储空间作为共享区域,从而降低数据共享涉及的读写操作的时间开销,进而提高线程间数据共享的效率。
在此,虚拟地址区域440中的数据是用于线程间数据共享的临时数据并且仅在程序运行期间是有效的,因而当虚拟地址区域440中的数据没有被交换至低速物理存储设备时,不必向低速物理存储设备回写数据。具体地,如果确定某个线程向与虚拟地址区域440中的数据块610相关联的高速缓存线写入数据,可以将高速缓存线设置为“脏”,以便指示高速缓存线中的内容已经被修改。此时,可以禁止将高速缓存线中的数据回写至虚拟地址区域440。利用本公开的示例性实现方式,可以尽量减少不必要的数据写入操作所占用的时间开销,从而提高加速器系统的性能。
根据本公开的一个示例性实现方式,可以仅在虚拟地址区域440中的数据被交换至低速物理存储设备时,执行回写操作。在下文中,将参见图7描述有关数据交换的更多细节。图7示出了根据本公开的一个实施方式的用于在不同等级的物理存储设备之间交换数据的示意框图700。如图7所示,假设数据块610被加载至L1高速缓存260中的高速缓存线710,并且该高速缓存线710中的数据将被移出至具有较低访问速度的L2高速缓存250中的物理存储块720中。此时,可以向L2高速缓存250中的物理存储块720回写数据,换言之,可以向物理存储块720写入高速缓存线710中的数据。根据本公开的一个示例性实现方式,可以仅回写被标记为“脏”的数据。
利用本公开的示例性实现方式,只有当较高速度的物理存储设备中的数据被移出至较低速度的物理存储设备才执行数据回写,一方面可以确保虚拟地址区域440中的“脏”数据不会丢失,另一方面可以确保不会出现向较低速度的物理存储设备中重复写入数据的无用开销。
根据本公开的一个示例性实现方式,虚拟地址区域440中的数据是程序运行期间的临时数据并且这些数据仅对于程序有用,因而当程序结束时,可以释放虚拟地址区域440。换言之,可以将该虚拟地址区域440标记为“空闲”以便用作其他目的。利用本公开的示例性实现方式,可以按照程序中的定义,在程序启动时分配虚拟地 址区域440用于由程序调用的多个线程之间的数据共享。进一步,可以在程序结束时释放虚拟地址区域440。以此方式,可以随着不同程序的运行而不断地分配和释放相应的虚拟地址区域,从而达到由不同程序在不同时间段循环使用空闲存储空间的目的。
上文已经描述了根据本公开的一个示例性实现方式的用于基于便笺存储器来共享数据的一般原理。在下文中,将仅以数据排序程序为示例描述如何使用虚拟地址区域440来在多个线程之间共享数据。在此的数据排序程序可以基于多种方式来实现,例如,可以基于冒泡排序、归并排序和/或其他任意排序算法来实现。假设排序程序的目的在于将大量数据(例如,10000个数字)排序,此时程序可以将待排序数字划分为多个部分,并且分别指定由不同的线程来处理每个部分。具体地,可以将10000个数字划分为两部分,由线程T1和线程T2来分别处理10000/2=5000个数据,并且排序程序可以指定可容纳10000个数据的虚拟地址区域来作为共享区域。
当线程T1和T2分别完成各自的5000个数据的局部排序时,两个线程T1和T2可以分别向虚拟地址区域的相应位置处写入排序后的5000个数据。进一步,某个线程(例如,线程T1或者T2)可以对经过初步排序的5000+5000=10000个数据执行排序,以便获得最终排序后的10000个数据。将会理解,尽管此时要执行多次排序,线程T1和T2可以并行地运行以提供局部排序的数据序列。相对于利用单个线程来对10000个原始数据进行整体排序而言,利用虚拟地址区域来辅助多个线程并行地执行排序操作可以提高排序效率并且降低排序时间。
在上述排序的示例中,10000个数据将会占用较大的存储空间。将会理解,传统的基于L1高速缓存来提供共享区域的技术方案并不能提供如此大小的存储空间,进而导致需要在不同速度的物理存储设备之间的频繁交换数据,这将会大大降低排序性能。
不同于传统的技术方案,根据本公开的一个示例性实现方式,可以基于被映射至一个或多个物理存储设备的虚拟地址区域来提供 共享空间,因而所提供的共享空间的大小并不受限于L1高速缓存的限制,而是可以便于扩展。以此方式,可以提供更大的共享空间进而提高数据处理效率。进一步,针对便笺存储器的特殊存储器交换策略可以确保共享空间尽量位于具有较高访问速度的物理存储设备中。因而可以降低对于片外存储设备的读写频率,在提高读写效率的同时降低加速器系统的能耗。
将会理解,上文仅以排序程序为示例描述了如何基于便笺存储器在多个线程之间共享数据,还可以在用于实现其他目的的程序中使用便笺存储器。根据本公开的一个示例性实现方式,当使用寄存器堆来存储局部变量是,寄存器堆的数量可能会出现不足。此时,可以使用虚拟地址区域来用于存储局部变量。将会理解,尽管上文示意性示出了经由虚拟地址区域440来在两个线程之间共享数据的示例,根据本公开的一个示例性实现方式,还可以经由虚拟地址区域440在更多线程之间共享数据。
根据本公开的一个示例性实现方式,虚拟存储是片上张量虚拟存储。具体地,该虚拟存储可以用于存储程序运行期间涉及的多种张量数据。进一步,可以在大于两个线程的更多线程之间共享数据。例如,可以基于所述虚拟地址区域,管理在第一线程、第二线程、以及所述程序中的一个或多个其他线程之间共享的数据。
根据本公开的一个示例性实现方式,可以在PE上运行最多两个程序(program)。换言之,可以在PE上并行地运行两个kernel程序。此时,虚拟地址区域可以在PE上运行的一个kernel程序的多个线程执行共享数据,也即,不能在并行运行的两个不同kernel程序调用的多个线程之间共享数据。
相比于常规的便笺存储器,本公开的便笺存储器不再局限于固定大小的片上存储,而是提供了一种可以扩展的虚拟存储方式。此时,基于便笺存储器是在PE中运行的kernel程序所私有的。当kernel程序启动时,便笺存储器中的数据值是未定义的,并且该数据值由kernel的线程来定义。当kernel程序结束时,便笺存储器中的数据将 被丢弃,也即不会被回写到其他级别的存储设备中。
根据本公开的一个示例性实现方式,可以定义便笺存储器属性并且将该属性赋予由kernel程序调用的某些段,该属性被提供给硬件。在硬件执行kernel时,该属性被用于控制存储器的交换策略。在此,便笺存储器可以被映射到L2高速缓存或者DRAM存储器。相对于访问其他类型的存储器,针对便笺存储器的访问在L1高速缓存将被特殊处理,以便在高速缓存交换时具有更高的优先级,例如,便笺存储器中的数据在L1高速缓存中可以驻留更长的时间。
进一步,针对便笺存储器的访问在L1高速缓存可以延迟写回,直到必须被替换时才将脏数据写出到其他等级的存储器。在kernel程序结束时,可以将被加载到L1高速缓存中的与便笺存储器对应的数据设为无效(包括脏数据)。因为当kernel程序结束时,这些数据将不再被需要。便笺存储器可以被定义为存储结构化的张量格式,例如,可以包括1维、2维、3维或4维张量。此外,在kernel程序运行时,可以自动检测出便笺存储段的访问越界。
图8示出了根据本公开的一个实施方式的用于在线程之间共享数据的装置800的示意框图。装置800可以被实现为或者被包括在图2的加速器系统200中。装置800可以包括多个单元,以用于执行如图5中所讨论的方法500中的对应步骤。
如图8所示,装置800包括:分配单元810,被配置用于基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给程序,虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储;设置单元820,被配置用于将虚拟地址区域设置为便笺存储器属性;以及共享单元830,被配置用于基于虚拟地址区域,管理在程序中的第一线程和第二线程之间共享的数据。
根据本公开的一个示例性实现方式,分配单元810包括:等级确定单元,被配置用于基于定义确定用于指定与虚拟地址区域相对应的物理存储设备的等级;以及选择单元,被配置用于从多个物理 设备中的具有等级的物理设备中,选择虚拟地址区域以用于分配给程序。
根据本公开的一个示例性实现方式,选择单元包括:大小确定单元,被配置用于基于定义确定虚拟地址区域的大小;以及第一选择单元,被配置用于响应于确定大小不高于阈值大小,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,选择单元包括第二选择单元,被配置用于响应于确定大小高于阈值大小,从多个物理设备中的具有等级的物理设备和低于等级的物理设备中,选择与大小相匹配的虚拟地址区域。
根据本公开的一个示例性实现方式,管理单元830包括:修改单元,被配置用于修改与虚拟地址区域相关联的交换策略,以使得虚拟地址区域中的数据不被交换到另一物理存储设备,另一物理存储设备的等级低于与虚拟地址区域相对应的物理存储设备的等级。
根据本公开的一个示例性实现方式,管理单元830包括:写入单元,被配置用于响应于确定第一线程向虚拟地址区域中的高速缓存线写入数据,将高速缓存线设置为“脏”标记;以及禁止单元,被配置用于禁止将高速缓存线中的数据写入至与虚拟地址区域相关联的下一等级的存储设备。
根据本公开的一个示例性实现方式,管理单元830包括:回写单元,被配置用于响应于确定高速缓存线中的数据将被交换至另一物理存储设备,向另一物理存储设备回写高速缓存线中的数据。
根据本公开的一个示例性实现方式,管理单元830包括:初始设置单元,被配置用于将虚拟地址区域中的高速缓存线设置为“未使用”;以及调取单元,被配置用于响应于确定第一线程从被设置为“未使用”的高速缓存线中读取数据,调取读取异常处理。
根据本公开的一个示例性实现方式,管理单元830包括:释放单元,被配置用于响应于确定程序结束,释放虚拟地址区域。
根据本公开的一个示例性实现方式,分配单元810包括:格式 设置单元,被配置用于基于程序中的定义,将虚拟地址区域的格式设置为以下任一维度的张量:1维、2维、3维以及4维。
根据本公开的一个示例性实现方式,该装置800在加速器系统处的多个处理引擎中的一个处理引擎处被实现,虚拟地址区域经由地址映射表被映射至多个物理存储设备,以及地址映射表被存储在加速器系统中。
根据本公开的一个示例性实现方式,分配单元810包括:地址确定单元,被配置用于在虚拟存储中的被分配给处理引擎的部分中,确定虚拟地址区域;以及偏移单元,被配置用于向程序提供与虚拟地址区域相关联的地址偏移。
根据本公开的一个示例性实现方式,虚拟存储是片上张量虚拟存储,并且管理单元830进一步配置用于:基于虚拟地址区域,管理在第一线程、第二线程、以及程序中的其他线程之间共享的数据。
根据本公开的一个示例性实现方式,提供了一种计算机可读存储介质。该介质存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行上文描述的方法的指令。
根据本公开的一个示例性实现方式,提供了一种计算机程序产品。该计算机程序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行上文描述的方法的指令。
根据本公开的一个示例性实现方式,提供了加速器系统。该加速器系统包括:处理器;以及与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被所述处理器执行时使得所述设备执行上文描述的方法。
本公开可以是方法、设备、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和 并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施方式的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (21)

  1. 一种用于基于便笺存储器来共享数据的方法,包括:
    基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给所述程序,所述虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储;
    将所述虚拟地址区域设置为便笺存储器属性;以及
    基于所述虚拟地址区域,管理在所述程序中的第一线程和第二线程之间共享的数据。
  2. 根据权利要求1所述的方法,其中将所述虚拟地址区域分配给所述程序包括:
    基于所述定义确定用于指定与所述虚拟地址区域相对应的物理存储设备的等级;以及
    从所述多个物理设备中的具有所述等级的物理设备中,选择所述虚拟地址区域以用于分配给所述程序。
  3. 根据权利要求2所述的方法,其中选择所述虚拟地址区域以用于分配给所述程序进一步包括:
    基于所述定义确定所述虚拟地址区域的大小;以及
    响应于确定所述大小不高于阈值大小,选择与所述大小相匹配的所述虚拟地址区域。
  4. 根据权利要求3所述的方法,其中选择与所述大小相匹配的所述虚拟地址区域进一步包括:响应于确定所述大小高于所述阈值大小,
    从所述多个物理设备中的具有所述等级的物理设备和低于所述等级的物理设备中,选择与所述大小相匹配的所述虚拟地址区域。
  5. 根据权利要求1所述的方法,其中基于所述虚拟地址区域,在所述程序中的第一线程和第二线程之间共享数据包括:
    修改与所述虚拟地址区域相关联的交换策略,以使得所述虚拟地址区域中的数据不被交换到另一物理存储设备,所述另一物理存 储设备的等级低于与所述虚拟地址区域相对应的物理存储设备的等级。
  6. 根据权利要求1所述的方法,其中基于所述虚拟地址区域,在所述程序中的第一线程和第二线程之间共享数据包括:
    响应于确定所述第一线程向与所述虚拟地址区域相关联的高速缓存线写入数据,将所述高速缓存线设置为“脏”标记;以及
    禁止将所述高速缓存线中的所述数据回写至所述虚拟地址区域。
  7. 根据权利要求6所述的方法,进一步包括:
    响应于确定所述高速缓存线中的数据将被交换至另一物理存储设备,向所述另一物理存储设备回写所述高速缓存线中的所述数据。
  8. 根据权利要求1所述的方法,其中基于所述虚拟地址区域,在所述程序中的第一线程和第二线程之间共享数据包括:
    将所述虚拟地址区域中的数据块设置为“未使用”;以及
    响应于确定所述第一线程从被设置为“未使用”的数据块中读取数据,调取读取异常处理。
  9. 根据权利要求1所述的方法,进一步包括:响应于确定所述程序结束,释放所述虚拟地址区域。
  10. 根据权利要求1所述的方法,其中将所述虚拟地址区域分配给所述程序包括:基于所述程序中的所述定义,将所述虚拟地址区域的格式设置为以下任一维度的张量:1维、2维、3维以及4维。
  11. 根据权利要求11所述的方法,其中所述方法在所述加速器系统处的多个处理引擎中的一个处理引擎处被执行,所述虚拟地址区域经由地址映射表被映射至所述多个物理存储设备,以及所述地址映射表被存储在所述加速器系统中。
  12. 根据权利要求11所述的方法,其中将所述虚拟地址区域分配给所述程序包括:
    在所述虚拟存储中的被分配给所述处理引擎的部分中,确定所述虚拟地址区域;以及
    向所述程序提供与所述虚拟地址区域相关联的地址偏移。
  13. 根据权利要求1所述的方法,其中所述虚拟存储是片上张量虚拟存储,并且所述方法进一步包括:
    基于所述虚拟地址区域,管理在所述第一线程、所述第二线程、以及所述程序中的其他线程之间共享的数据。
  14. 一种计算机可读存储介质,存储多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-13中任一项所述的方法的指令。
  15. 一种计算机程序产品,所述计算机程序产品包括多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-13中任一项所述的方法的指令。
  16. 一种加速器系统,包括:
    处理器;以及
    与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被所述处理器执行时使得所述设备执行权利要求1-13中任一项所述的方法。
  17. 一种用于基于便笺存储器来共享数据的装置,包括:
    分配单元,被配置用于基于程序中的定义,将由加速器系统可访问的虚拟存储中的虚拟地址区域分配给所述程序,所述虚拟地址区域被映射至以下多个物理存储设备中的任一项:二级高速缓存以及外部存储;
    设置单元,被配置用于将所述虚拟地址区域设置为便笺存储器属性;以及
    管理单元,被配置用于基于所述虚拟地址区域,管理在所述程序中的第一线程和第二线程之间共享的数据。
  18. 根据权利要求17所述的装置,其中所述分配单元包括:
    确定单元,被配置用于基于所述定义确定用于指定与所述虚拟地址区域相对应的物理存储设备的等级;以及
    选择单元,被配置用于从所述多个物理设备中的具有所述等级 的物理设备中,选择所述虚拟地址区域以用于分配给所述程序。
  19. 根据权利要求17所述的装置,其中所述管理单元包括:
    修改单元,被配置用于修改与所述虚拟地址区域相关联的交换策略,以使得所述虚拟地址区域中的数据不被交换到另一物理存储设备,所述另一物理存储设备的等级低于与所述虚拟地址区域相对应的物理存储设备的等级。
  20. 根据权利要求17所述的装置,其中所述管理单元包括:
    写入单元,被配置用于响应于确定所述第一线程向所述虚拟地址区域中的高速缓存线写入数据,将所述高速缓存线设置为“脏”标记;以及
    禁止单元,被配置用于禁止将所述高速缓存线中的所述数据写入至与所述虚拟地址区域相关联的下一等级的存储设备。
  21. 根据权利要求20所述的装置,其中所述管理单元进一步包括:
    回写单元,被配置用于响应于确定所述高速缓存线中的数据将被交换至另一物理存储设备,向所述另一物理存储设备回写所述高速缓存线中的所述数据;以及
    释放单元,被配置用于响应于确定所述程序结束,释放所述虚拟地址区域。
PCT/CN2022/108045 2021-11-08 2022-07-26 基于便笺存储器来共享数据的方法和电子装置 WO2023077880A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111314187.9 2021-11-08
CN202111314187.9A CN114035980B (zh) 2021-11-08 2021-11-08 基于便笺存储器来共享数据的方法和电子装置

Publications (1)

Publication Number Publication Date
WO2023077880A1 true WO2023077880A1 (zh) 2023-05-11

Family

ID=80143340

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/108045 WO2023077880A1 (zh) 2021-11-08 2022-07-26 基于便笺存储器来共享数据的方法和电子装置

Country Status (2)

Country Link
CN (1) CN114035980B (zh)
WO (1) WO2023077880A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035980B (zh) * 2021-11-08 2023-11-14 海飞科(南京)信息技术有限公司 基于便笺存储器来共享数据的方法和电子装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1506849A (zh) * 2002-12-12 2004-06-23 国际商业机器公司 能够管理虚拟存储器处理方案的数据处理系统
CN1506851A (zh) * 2002-12-12 2004-06-23 �Ҵ���˾ 能够利用虚拟存储器处理方案的数据处理系统
CN1506850A (zh) * 2002-12-12 2004-06-23 国际商业机器公司 没有系统存储器的数据处理系统
CN103268297A (zh) * 2013-05-20 2013-08-28 浙江大学 基于异构多核平台的加速核虚拟便笺存储器的方法
CN103778072A (zh) * 2012-10-25 2014-05-07 辉达公司 多线程处理单元中的高效存储器虚拟化
CN104881330A (zh) * 2015-05-22 2015-09-02 大唐移动通信设备有限公司 一种多进程共享数据的方法和装置
CN105868028A (zh) * 2015-01-23 2016-08-17 华为技术有限公司 一种进程间共享数据的方法、装置及终端
US9858199B1 (en) * 2016-03-30 2018-01-02 Amazon Technologies, Inc. Memory management unit for shared memory allocation
CN114035980A (zh) * 2021-11-08 2022-02-11 海飞科(南京)信息技术有限公司 基于便笺存储器来共享数据的方法和电子装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0728706A (ja) * 1993-07-14 1995-01-31 Sumitomo Electric Ind Ltd キャッシュメモリ装置
US7275246B1 (en) * 1999-01-28 2007-09-25 Ati International Srl Executing programs for a first computer architecture on a computer of a second architecture
GB2381886B (en) * 2001-11-07 2004-06-23 Sun Microsystems Inc Computer system with virtual memory and paging mechanism
GB2469299B (en) * 2009-04-07 2011-02-16 Imagination Tech Ltd Ensuring consistency between a data cache and a main memory
US8458440B2 (en) * 2009-09-25 2013-06-04 Nvidia Corporation Deferred complete virtual address computation for local memory space requests
US8627041B2 (en) * 2009-10-09 2014-01-07 Nvidia Corporation Efficient line and page organization for compression status bit caching
US9612966B2 (en) * 2012-07-03 2017-04-04 Sandisk Technologies Llc Systems, methods and apparatus for a virtual machine cache
KR101801567B1 (ko) * 2013-12-19 2017-11-27 인텔 코포레이션 권한 관리된 콘텐츠의 정책에 기반한 신뢰성 있는 검사
US9892039B2 (en) * 2015-04-21 2018-02-13 Oracle International Corporation Non-temporal write combining using cache resources
JP7184074B2 (ja) * 2018-02-15 2022-12-06 ソニーグループ株式会社 メモリ管理装置及びメモリ管理方法、並びに情報処理装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1506849A (zh) * 2002-12-12 2004-06-23 国际商业机器公司 能够管理虚拟存储器处理方案的数据处理系统
CN1506851A (zh) * 2002-12-12 2004-06-23 �Ҵ���˾ 能够利用虚拟存储器处理方案的数据处理系统
CN1506850A (zh) * 2002-12-12 2004-06-23 国际商业机器公司 没有系统存储器的数据处理系统
CN103778072A (zh) * 2012-10-25 2014-05-07 辉达公司 多线程处理单元中的高效存储器虚拟化
CN103268297A (zh) * 2013-05-20 2013-08-28 浙江大学 基于异构多核平台的加速核虚拟便笺存储器的方法
CN105868028A (zh) * 2015-01-23 2016-08-17 华为技术有限公司 一种进程间共享数据的方法、装置及终端
CN104881330A (zh) * 2015-05-22 2015-09-02 大唐移动通信设备有限公司 一种多进程共享数据的方法和装置
US9858199B1 (en) * 2016-03-30 2018-01-02 Amazon Technologies, Inc. Memory management unit for shared memory allocation
CN114035980A (zh) * 2021-11-08 2022-02-11 海飞科(南京)信息技术有限公司 基于便笺存储器来共享数据的方法和电子装置

Also Published As

Publication number Publication date
CN114035980B (zh) 2023-11-14
CN114035980A (zh) 2022-02-11

Similar Documents

Publication Publication Date Title
EP2542973B1 (en) Gpu support for garbage collection
US8639730B2 (en) GPU assisted garbage collection
US8266337B2 (en) Dynamic logical data channel assignment using channel bitmap
US11741019B2 (en) Memory pools in a memory model for a unified computing system
KR20130010442A (ko) 가상 gpu
WO2017222893A1 (en) System and method for using virtual vector register files
WO2023040460A1 (zh) 存储器访问方法和电子装置
CN114667508B (zh) 为加速器取回数据的方法和系统
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
US11868306B2 (en) Processing-in-memory concurrent processing system and method
WO2023077880A1 (zh) 基于便笺存储器来共享数据的方法和电子装置
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023077875A1 (zh) 用于并行执行核心程序的方法和装置
WO2023151231A1 (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
US20130262814A1 (en) Mapping Memory Instructions into a Shared Memory Address Place
Očkay et al. Memory partitions and access patterns used for optimization of GPU processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888915

Country of ref document: EP

Kind code of ref document: A1