WO2023103392A1 - 用于存储管理的方法、介质、程序产品、系统和装置 - Google Patents

用于存储管理的方法、介质、程序产品、系统和装置 Download PDF

Info

Publication number
WO2023103392A1
WO2023103392A1 PCT/CN2022/107143 CN2022107143W WO2023103392A1 WO 2023103392 A1 WO2023103392 A1 WO 2023103392A1 CN 2022107143 W CN2022107143 W CN 2022107143W WO 2023103392 A1 WO2023103392 A1 WO 2023103392A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
data
chip memory
address
access
Prior art date
Application number
PCT/CN2022/107143
Other languages
English (en)
French (fr)
Inventor
杨经纬
李甲
赵鹏
徐立宝
谢钢锋
王磊
许飞翔
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023103392A1 publication Critical patent/WO2023103392A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal

Definitions

  • Embodiments of the present disclosure generally relate to the electronic field, and more specifically relate to a method, medium, program product, system and apparatus for storage management.
  • processor systems such as graphics processing units (GPUs) have been proposed, and multiple processor cores in such processing systems can provide parallel multi-thread processing, thereby providing higher processing speeds. These processing systems can decompose complex calculations into smaller tasks and perform parallel processing by multiple cores and threads to reduce processing time.
  • GPUs graphics processing units
  • the amount of data (such as tensor data) to be processed by the program may be large, and the capacity of the on-chip memory (such as L2 cache) is limited, so it is impossible to load a large amount of data to the on-chip memory at the same time, which will It will affect the parallel processing efficiency of data.
  • Embodiments of the present disclosure provide a solution for storage management.
  • a storage management method includes creating a page table for a virtual storage space based on data to be accessed during execution of the application program, the virtual storage space being mapped to an on-chip memory and an off-chip memory, the page table indicating at least a logical address of the data in the virtual storage space The mapping relationship with the physical address on the on-chip memory or off-chip memory; and when the application program is executed, use the page table to access the data.
  • Embodiments of the present disclosure propose combining addressing of the on-chip memory and the off-chip memory, and performing unified addressing in the virtual storage space.
  • a storage space is not only beneficial to storage space management, but also beneficial to program design and operation.
  • an application program can use a logical address to address data to be accessed without knowing the physical address information of the data or on which physical medium the virtually stored data is stored. This is beneficial for programmers to conveniently and flexibly configure different data to be processed, and only needs to limit the logical address corresponding to the data part to be processed by each application program. The operation of the program does not need to manage the migration of data.
  • creating the page table includes: creating page table entries corresponding to the pages divided by the data in the page table, and each page table entry at least indicates the logical address of the corresponding page in the virtual storage space and the on-slice The mapping relationship between physical addresses on memory or off-chip memory.
  • each page table entry in the page table also indicates the value of the reference counter for the corresponding page.
  • the value of the reference counter in each page table entry is updated based on at least one of: the readiness status of the corresponding page's data in the on-chip memory or the off-chip memory, or the pair of processing engines that will access the corresponding page The access status of this page.
  • the logical address in the virtual storage space indicates the segment identifier of the segment where the data resides, the reference address data, the page identifier of the page where the data resides, and the offset value of the page relative to the reference address data.
  • data includes tensor data and/or program instructions.
  • the page tables are stored in on-chip memory.
  • using the page table to access data includes: determining the target page according to the logical address of the data in the virtual storage space; using the page table to determine the physical address of the target page in the on-chip memory or off-chip memory; according to the logical address determining an in-page offset address of the data; and accessing the data using the physical address of the target page and the in-page offset address.
  • accessing data using the physical address of the target page and the offset address in the page includes: if the access of the target page includes reading the target page, using the physical address and the offset address in the page directly from the on-chip memory or the on-chip reading data from the external memory; and if the access to the target page includes writing to the target page, writing data to the on-chip memory or the off-chip memory using the physical address and the offset address within the page.
  • the target page is mapped into off-chip memory
  • the physical address determined using the page table includes the physical address of the target page in the on-chip memory.
  • Using the physical address of the target page and the offset address in the page to access data also includes: if the access of the target page includes reading the target page, using the physical address of the target page in the off-chip memory to read the data of the target page from the off-chip memory loading into on-chip memory, and reading data from on-chip memory based on the physical address of the target page in on-chip memory and the offset address within the page; and if the access to the target page includes writing to the target page, using the target page in on-chip memory Write the data into the on-chip memory using the physical address and offset address in the page, and use the physical address of the target page in the off-chip memory to flush the data of the target page from the on-chip memory to the off-chip memory.
  • a computer readable storage medium In a second aspect of the present disclosure, a computer readable storage medium is provided. A plurality of programs are stored, the plurality of programs are configured to be executed by the one or more processing units, the plurality of programs include instructions for performing the method of the first aspect.
  • a computer program product comprises a plurality of programs configured to be executed by one or more processing units, the plurality of programs comprising instructions for performing the method of the first aspect.
  • an accelerator system in a fourth aspect of the present disclosure, includes: a processing unit; and a memory coupled to the processing unit, the memory having instructions stored therein, the instructions performing the method of the first aspect when executed by the processing unit.
  • an apparatus for storage management includes a creation unit configured to create a page table for a virtual storage space based on data to be accessed during execution of an application program, the virtual storage space is mapped to an on-chip memory and an off-chip memory, and the page table at least indicates that the data is in the virtual The mapping relationship between the logical address in the storage space and the physical address on the on-chip memory or off-chip memory; and the access unit configured to use the page table to access data when the application program is executed.
  • the data is divided into at least one segment, each segment including at least one page.
  • the creation unit is configured to: create in the page table page table entries corresponding to the pages divided by the data, each page table entry at least indicates the logical address of the corresponding page in the virtual storage space and the The mapping relationship between physical addresses in on-chip memory or off-chip memory.
  • each page table entry in the page table also indicates the value of the reference counter for the corresponding page.
  • the value of the reference counter in each page table entry is updated based on at least one of: the readiness status of the corresponding page's data in the on-chip memory or the off-chip memory, or the pair of processing engines that will access the corresponding page The access status of this page.
  • the logical address of the data in the virtual storage space indicates the segment identifier of the segment where the data resides, the reference address data, the page identifier of the page where the data resides, and the offset value of the page relative to the reference address data.
  • data includes tensor data or program instructions.
  • the page tables are stored in on-chip memory.
  • the access unit includes: a logical address determination unit configured to determine the target page according to the logical address of the data in the virtual storage space; an address translation unit configured to use the page table to determine the target page in the on-chip memory or a physical address in the off-chip memory; and an in-page offset address determination unit configured to determine an in-page offset address of the data based on the logical address; and an address-based access unit configured to use the physical address of the target page and the page
  • the internal offset address accesses on-chip memory or off-chip memory.
  • the address-based access unit is configured to: if the access to the target page includes reading the target page, use the physical address and the offset address within the page to directly read data from the on-chip memory or the off-chip memory; and If the access to the target page includes writing to the target page, use the physical address and the offset address in the page to directly write data to the on-chip memory or the off-chip memory.
  • the target page is mapped into off-chip memory
  • the physical address determined using the page table includes the physical address of the target page in the on-chip memory.
  • the address-based access unit may also be configured to: if the access of the target page includes reading the target page, use the physical address of the target page in the off-chip memory to load the data of the target page from the off-chip memory to the on-chip memory, and Read data from the on-chip memory based on the physical address of the target page in the on-chip memory and the offset address within the page; and if the access to the target page includes writing to the target page, use the physical address of the target page in the on-chip memory and The offset address writes data into the on-chip memory, and uses the physical address of the target page in the off-chip memory to flush the data of the target page from the on-chip memory to the off-chip memory.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Fig. 2 shows a schematic block diagram of a chip according to some embodiments of the present disclosure
  • Fig. 3 shows a schematic block diagram of a parallel processing engine structure according to some embodiments of the present disclosure
  • Figure 4 shows an example of on-chip virtual storage space according to some embodiments of the present disclosure
  • Fig. 5 shows a schematic flowchart of a storage management method according to some embodiments of the present disclosure
  • Fig. 6 shows a schematic flowchart of a storage management method according to other embodiments of the present disclosure
  • FIG. 7 shows a schematic block diagram of an apparatus for storage management according to some embodiments of the present disclosure.
  • Fig. 8 shows a schematic block diagram of an apparatus for storage management according to some other embodiments of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “some embodiments” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • the amount of data to be accessed during the execution of the application may be large, and the capacity of the on-chip memory (such as L2 cache) is limited, so it is impossible to store a large amount of data at the same time. Loaded to the on-chip memory, which will affect the parallel processing efficiency of data.
  • a solution of on-chip virtual storage is proposed. Different from the virtual storage technology that uses secondary storage devices (such as hard disks, remote storage, etc.) to expand the main storage space, in the embodiments of the present disclosure, the on-chip storage and off-chip storage of the accelerator system are combined into a unified virtual storage space .
  • the data to be accessed by the application program is addressed, which provides a larger unified addressable storage space for the application program, expands the available memory space, and improves parallel processing efficiency, especially for Parallel processing efficiency of large-sized data such as tensor data.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20 , system memory 10 , north/memory bridge 30 , accelerator system 40 , device memory 50 , and south/input-output (IO) bridge 60 .
  • System memory 10 may be, for example, a volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the south bridge/IO bridge 60.
  • the South Bridge/IO Bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator system 40 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video.
  • Device memory 50 may be, for example, a volatile memory such as DRAM that is external to accelerator system 40 .
  • the device memory 50 is also referred to as an off-chip memory, ie, a memory located outside the chip of the accelerator system 40 .
  • the chip of the accelerator system 40 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache. This will be specifically described below in conjunction with some embodiments of the present disclosure.
  • FIG. 1 While an example environment 100 in which embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments with accelerator systems such as GPUs, such as ARM architectures and RISC-V architectures.
  • accelerator systems such as GPUs, such as ARM architectures and RISC-V architectures.
  • FIG. 2 shows a schematic block diagram of an accelerator system 200 according to some embodiments of the present disclosure.
  • the accelerator system 200 may be, for example, a specific implementation of the chip of the accelerator system 40 in FIG. 1 .
  • the accelerator system 200 is, for example, an accelerator system-on-a-chip such as a GPU.
  • accelerator system 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 Cache 250.
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator system 200 may be controlled by a host device such as the CPU 20, and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 maintains page tables for managing on-chip virtual storage accessible by the accelerator system 200 .
  • on-chip memory such as L2 cache 250 and off-chip memory such as device memory 50 in FIG. address space.
  • the page table in the page table device 220 can be jointly accessed and updated by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 may include one or more processing engines (processing engine, PE) PE_1, PE_2...PE_N, where N represents an integer greater than or equal to 1.
  • Each processing engine may be associated with a corresponding L1 cache.
  • PE_1 may be associated with L1_1
  • PE_2 may be associated with L1_2, and so on.
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • FIG. 3 shows a schematic diagram of a parallel PE structure 300 of a SIMT according to some embodiments of the present disclosure.
  • the parallel PE structure 300 shown in FIG. 3 may be implemented within the PEs in the PE unit 230 .
  • each thread in a PE can have its own register file (register file), and all threads of each PE also share a uniform register file (uniform register file) (not shown).
  • PEs can perform the same or different processing tasks in parallel, and can perform address translation and access to target data in memory in parallel, thereby reducing processing time.
  • PE can perform processing such as sorting and convolution on the data to be processed.
  • Each thread can perform thread-level data exchange between its own register file and the memory subsystem. It will be appreciated that a user may specify multiple (eg, tens, hundreds, or even more) threads to be launched at a PE to perform certain operations in parallel.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which adopts a typical register access architecture (load-store architecture).
  • Each execution unit includes a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • the data processed by the accelerator system 200 may be multi-dimensional tensor data or one-dimensional tensor data.
  • the tensor may be a four-dimensional tensor, which has four dimensions Dl, D2, D3, and D4, and the tensor may be of different size in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • the tensor may internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double byte, and so on.
  • on-chip storage is faster and consumes less power, but has limited capacity, while off-chip storage has longer access latency, higher power consumption, and relatively low bandwidth.
  • on-chip storage is designed as a cache and cannot be explicitly addressed.
  • the main memory is generally off-chip storage, and its data access uses physical addresses.
  • a virtual storage method is used to manage on-chip storage instead of L2 cache.
  • On-chip storage and off-chip storage A uniformly addressable virtual storage space is formed, which provides a virtual on-chip storage perspective for the program.
  • the data to be accessed by the application program is managed through the page table, which indicates the mapping relationship between the logical address of the data in the virtual storage space and the physical address on the on-chip memory or on the off-chip memory.
  • the page table is used to access the data, which can be physically stored in the on-chip memory or in the off-chip memory.
  • a unified virtual on-chip storage space is not only beneficial to storage space management, but also beneficial to program design and operation.
  • an application can use a logical address to address the data to be accessed without knowing the physical address information of the data or on which physical medium the virtually stored data is stored. This is beneficial for programmers to conveniently and flexibly configure different data to be processed, and only needs to limit the logical address corresponding to the data part to be processed by each application program. The operation of the program does not need to manage the migration of data.
  • FIG. 4 shows a schematic block diagram of a portion of a virtual storage space 400 according to some embodiments of the present disclosure.
  • FIG. 5 shows a flowchart of an example process 500 of storage management. Process 500 may be implemented in accelerator system 200 .
  • the virtual memory space 400 is mapped to on-chip memory and off-chip memory.
  • the on-chip memory refers to the on-chip memory of the accelerator system 200 , such as the L2 cache in FIG. 2 , which may be static random access memory (SRAM) or other types of on-chip memory.
  • the off-chip memory is, for example, the off-chip memory of the accelerator system 200 , such as the device memory 50 in FIG. 1 , which may be a dynamic random access memory (DRAM) or other types of off-chip memory.
  • DRAM dynamic random access memory
  • a page table for a virtual storage space such as the page table for virtual storage space 400, is created based on the data to be accessed upon execution of the application program.
  • the page table at least indicates the mapping relationship between the logical address of the data to be accessed in the virtual storage space 400 and the physical address on the on-chip memory or off-chip memory.
  • the page tables are utilized to access data as the application is executed.
  • the page table is maintained in the accelerator system 200 , for example, in the page table device 220 .
  • the SP 210 may receive a sequence of commands sent by the host to initiate the execution of the application.
  • the SP 210 can create a page table corresponding to the data according to the data to be accessed during the execution of the application program, so as to indicate the mapping relationship between the logical address and the physical address of the data.
  • the storage structure of the data to be accessed in the virtual storage space can be flexibly defined in different application programs. Specifically, the data to be accessed during the execution of the application program can be organized in the virtual storage space 400 by segments and pages.
  • a "segment” is sometimes referred to as a storage segment or a data segment
  • a "page” is sometimes referred to as a storage page or a data page.
  • Data can be divided into one or more segments, and each segment can include one or more pages.
  • the number and size of segments, and the number and size of pages can be determined according to the application.
  • the operation of the application program in each PE can use one or more segments, and each segment can include one or more pages.
  • the data to be accessed when the application is running may include data to be processed by the application, such as tensor data or other forms of data.
  • the data to be accessed when the application is running may also include program instructions related to the application.
  • the programmer can specify the data part to be processed through the logical address in the application program. For example, programmers only need to configure the overall data (for example, tensor data) and structural attribute information to be processed in the application program, and the corresponding data parts to be processed by each PE.
  • a page table can be established to map a logical address to a physical address of an on-chip or off-chip memory.
  • a virtual storage space 400 is used to store tensor data having three dimensions D1, D2 and D3, which schematically shows the first segment S1, S1, The second segment S2 and the third segment S3.
  • Different applications can use different numbers of segments.
  • Each segment of data can have a different size, so programmers have the flexibility to configure segments based on design needs.
  • the number of segments occupied by an application program may be limited, for example, it may be stipulated that an application program may occupy a maximum of 16 segments.
  • At least one page can also be set to further subdivide the data.
  • page division can be implemented on any one or more dimensions, and the number of pages divided on each dimension is independent of each other.
  • segment S1 in FIG. 4 may have 4 pages P[1], P[2], P[3] and P[4]; the second segment S2 has only one page, and so on.
  • the page size is defined by the application and can be variable.
  • the number of pages in each segment can be different, so programmers can flexibly configure the size of pages in a segment based on design requirements. For example, because an entire page of data needs to be loaded into the on-chip memory when an application program is running, the size of the page can be configured to fit into the on-chip memory as a whole, so that the on-chip memory space can be fully utilized.
  • each segment can be accessed by one or more PEs, including reading, writing or executing.
  • segment S1 can be accessed by 8 PEs (ie, PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, PE_8), where segment S1 stores data in the form of tensors to be processed by these PEs at runtime.
  • data may be processed in parallel by multiple threads at each PE.
  • FIG. 4 it may be specified that the data of segment S1 is processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • the application itself can be stored within the segment.
  • segment S2 may be used to store program instructions for one or more application programs. Program instructions stored in segment S2 may be executed by one or more PEs.
  • the SP 210 in the accelerator system 200 can establish page table entries corresponding to the page identifiers (also referred to as "page numbers") of the pages divided by the data in the page table, and each page table entry at least indicates The mapping relationship between the physical addresses of the corresponding pages in the on-chip memory or off-chip memory.
  • the page identifier (or page number) of a page is derived from the logical address of the data to be accessed.
  • each segment may have a segment identification and reference address data, called an anchor or reference point.
  • the reference address data may represent the starting coordinate point of the data assigned by each PE.
  • the reference address data can be coordinates for each dimension of the tensor, (0,0,0,0) or (0,4,0,0).
  • Multiple PEs can have the same or different base address data.
  • Data within a segment can be addressed within that segment relative to reference address data.
  • the logical address of the data in a segment can include the segment identifier of the segment, the reference address data and the offset address in the segment, where the offset address in the segment can include the page identifier of the page where the data is located, and the relative position of the page in the segment The offset value of the base address data.
  • each page table entry may include a page identifier of the page and a physical address to which the page is mapped, which may be a physical address in on-chip memory or off-chip memory.
  • the number of page table entries established in the page table may be limited, and the number may be configured according to actual applications.
  • the page table is stored in on-chip memory to facilitate subsequent fast access to the page table.
  • the page table is used to access data when the application is running.
  • SP 210 may receive a sequence of commands from the host, including memory map information and other commands, such as initialization commands and the like.
  • the SP 210 can create a page table based on the storage mapping information and store it in the page table device 230.
  • SP 210 can control applications running on PE.
  • the page identifier (or page number) of the target page where the data is located is derived from the logical address.
  • the logical address is also more specifically used to determine an intra-page offset address of data within a page.
  • the page offset address can be used to indicate the starting position of the data to be accessed in a page.
  • the PE can access the page table through the page table device 230, locate the corresponding page table entry according to the page identifier, and read the physical address of the target page in the on-chip memory or the off-chip memory in the page table entry.
  • an address translator may be included in the PE to perform translation between logical addresses and physical addresses. The PE can use the determined physical address to access the on-chip memory or the off-chip memory to access the corresponding data portion.
  • Data access methods can include direct access and indirect access.
  • Direct access refers to whether the data is located in off-chip memory or on-chip memory, PE can directly access from off-chip memory or on-chip memory.
  • Indirect access refers to loading the data to be accessed into the on-chip memory first, and then accessing it. Indirect access For the case where the target page for storing data is mapped to off-chip memory, data needs to be loaded from off-chip to on-chip.
  • the data access mode can be default, or can be set by the programmer as needed.
  • the constructed page table indicates the mapping relationship between the logical address of the data in the virtual storage space and the physical address of the on-chip memory or the off-chip memory.
  • the target page is found to be mapped to off-chip memory or on-chip memory through the determined physical address, then the target page located in off-chip memory or on-chip memory can be directly read based on the physical address and page offset address. page of data, or data can be written directly to off-chip memory or on-chip memory.
  • the data to be processed can be read from the off-chip memory or the on-chip memory to the register, or the data to be processed can be written from the register to the off-chip memory or the on-chip memory.
  • the program instructions may be directly fetched from the off-chip memory or the on-chip memory and executed, or may be directly written into the off-chip memory or the on-chip memory.
  • the constructed page table indicates the mapping relationship between the logical address of data in the virtual storage space and the physical address of the on-chip memory. If you want to read the target page mapped to the off-chip memory when running the application program, you can use the physical address of the target page in the off-chip memory to load the data of the target page from the off-chip memory to the on-chip memory for use for access.
  • the SP 210 can instruct the DMA controller 240 in the accelerator system 200 to read data from the off-chip memory and cache it to the on-chip memory.
  • DMA operations may operate in parallel with application execution to enable streaming.
  • the physical address of the target page loaded into the on-chip memory can be determined through the page table, and the in-page offset address of the data to be read can be determined.
  • Data can be read from the on-chip memory based on the physical address of the target page in the on-chip memory and the determined offset address within the page.
  • the data of the target page can pass the physical address of the target page in the on-chip memory when the application is running And the determined offset address in the page is firstly written into the on-chip memory.
  • the SP 210 uses the physical address of the target page in the off-chip memory to flush the data of the target page from the on-chip memory to the off-chip memory.
  • the SP 210 can execute the flushing of data from the on-chip memory to the off-chip memory through the FLUSH command. This frees up more on-chip storage for runtime use.
  • each page table entry in the page table also indicates the value of one or more reference counters for the corresponding page.
  • the value of the reference counter can be used to manage the data dependencies of the pages, and the value of the reference counter in each page table entry can be based on whether the corresponding page is referenced on-chip memory or off-chip memory In particular, it may be updated based on at least one of the following: the ready state of the data of the corresponding page in the on-chip memory or the off-chip memory, or the access state of the page by the PE that wants to access the corresponding page.
  • tensor data may be stored in on-chip high-speed memory, such as L2 cache 250 .
  • on-chip high-speed memory such as L2 cache 250 .
  • the programmer can divide the tensor into multiple segments, and each segment describes a part of the tensor.
  • the application program can be started multiple times, and each time, the DMA controller 240 moves a segment of the tensor from the off-chip storage to the on-chip storage in advance, and provides it for the operation of the application program. After multiple starts of the application, all segments contained in the tensor are processed, and the entire run ends.
  • the on-chip high-speed memory is sufficient to accommodate all the tensors to be accessed by the application, a tensor only needs one segment description, and the application only needs to be started once.
  • the same application can be run on one or more PEs. These applications are written to work with specific data, such as tensor data. As mentioned earlier, data can be stored in pages in the virtual memory space and can be used by applications at runtime after being written to on-chip memory. Therefore, the same page may be used by different PEs. In this case, management of pages is important. In some other embodiments of the present disclosure, it is also proposed to use reference counters corresponding to pages to manage data dependencies on pages in the virtual storage space.
  • FIG. 6 shows a flowchart of a process 600 for storage management according to other embodiments of the present disclosure.
  • Process 600 may be implemented in accelerator system 200 .
  • the page to be accessed by the application is determined, and data is stored in the page.
  • the SP 210 can receive the command sequence sent by the host to initiate the operation of the application program. By analyzing the sequence of commands, the SP 210 can determine the pages to be accessed by the application to run.
  • an application can access one or more pages.
  • “accessing” a page or “fetching” a page refers to reading data, writing data, or executing an instruction to a page in a storage space, and the storage space may be utilized on-chip as described above.
  • the virtual storage space obtained by virtualization technology, or the storage space that does not utilize such on-chip virtualization technology in the accelerator system.
  • an application in a task related to a machine learning model, might be configured to perform matrix multiplication, and the application might access three pages for storing data, where the first page holds matrix A, The second page holds matrix B, and the third page holds the result of multiplying matrix A by matrix B.
  • the addressing information of the page to be accessed can be determined from a sequence of commands related to the application.
  • the first page storing matrix A can be located in page P[1] in segment 1 (S1)
  • the second page storing matrix B can be located in page P[2] in segment 2 (S2), storing the result of matrix multiplication
  • the third page of may be located on page P[5] in segment 3 (S3).
  • the PE may fetch instructions from a page storing program instructions and execute the instructions.
  • the value of the first reference counter corresponding to the page is set.
  • Applications can perform data access operations on one or more PEs.
  • the access to the page is managed by setting a reference counter (v-counter), so as to prevent the data in the page from being deleted or replaced before being used up by the related PE.
  • v-counter reference counter
  • the value of the reference counter may be maintained in a page table.
  • each page table entry corresponds to a page, and includes the address information of the page, completes the conversion between the logical address and the physical address as mentioned above, and may also include the value of the reference counter.
  • each page may correspond to one or more reference counters, which may be set to respective values, as will be described below.
  • the value of the page's corresponding first reference counter may be based on the number of PEs on which the application is to be run in order to maintain access to the page by the PEs. In some embodiments, the value of the first reference counter may be set equal to the number of PEs on which the application is to be run.
  • the value of another reference counter (sometimes referred to herein as the second reference counter) corresponding to the page can also be set to represent the ready status of the data in the page in the on-chip memory or in the off-chip memory .
  • the data in the page can be prevented from being accessed when it is not ready, for example, being used for subsequent calculations.
  • the SP 210 may set the value of the second reference counter corresponding to the page based on the ready state of the page in the on-chip memory or the off-chip memory.
  • the access operation may be performed based on the value of the second reference counter.
  • the data in the page may be originally stored in the on-chip memory, or may be stored in the off-chip memory.
  • the data in the page will not be completely written to the on-chip memory until the calculation of the matrix multiplication is completed.
  • the value of the second reference counter may be set in consideration of the readiness status of the data on the on-chip memory or the off-chip memory.
  • the access to the data is an indirect access, that is, data needs to be loaded from the off-chip memory to the on-chip memory to reduce storage latency, then the second reference counter can be set considering the ready state of the data in the on-chip memory value. That is,
  • the value of the second reference counter can be set to the first value, for example, can be set to 1, to indicate that the page is Data cannot be accessed yet. For example, if the data in the page needs to be moved from the off-chip memory to the on-chip memory, or the data in the page can only be obtained after the calculation is completed, then the value of the second reference counter corresponding to the page is set when the movement or calculation is started Set to 1 to avoid access to the page by other entities while the move or computation is pending. In some embodiments, if the data in the page is ready in the on-chip memory or the off-chip memory, the second reference counter is set to a second value indicating that the page is accessible or the data is ready, for example, it can be set to 0.
  • the second reference counter may be set by the SP 210 to the first value (eg, 1).
  • the SP 210 can use the second reference counter ( For example, denoted v-counter[0]) is set to 1 to indicate that the page is being loaded, and SP 210 may instruct DMA controller 240 to load matrix A into on-chip memory.
  • the DMA controller 240 may set v-counter[0] in the page table entry corresponding to the first page to 0.
  • the value of the second reference counter v-counter[0] can also be similarly set.
  • the PE can first write the page corresponding to the third page P[5]
  • the first counter in the entry (such as v-counter[0], for the writer, it is called the first counter, and for the reader, it is called the second counter) is set to the number of PEs, To avoid the page being accessed while the result has not been completely written.
  • the PE can set v-counter[0] in the page table entry corresponding to the third page P[5] to 0.
  • the value of the counter can be used to determine whether the data in the page is ready. Specifically, if it is found that the value of the second reference counter indicates that the data in the page is not yet accessible (for example, the value is 1 or the number of PEs), then the access to the page needs to wait. PE can prevent applications from performing access operations first. In some embodiments, the PE can periodically query the corresponding page entry in the page table to determine the readiness status of the data in the page. In some embodiments, if it is found that the data in the page is not ready through the value of the second reference counter, an interrupt may also be sent to the host to tell the host that the data in the page is not ready.
  • the counter v-counter[0] corresponding to the page P[1] is set to 1, and the DMA controller 240 is started to load the matrix A from the off-chip memory to the on-chip memory,
  • the counter v-counter[0] corresponding to the page P[2] is set to 1, and the DMA controller 240 is started to load the matrix B from the off-chip memory to the on-chip memory,
  • the application can begin to perform access operations on the selected one or more PEs.
  • the value of the first reference counter is updated based on the status of the application's access to the page on the PE.
  • the value of the first reference counter is set based on the number of PEs. Through real-time updating, the value of the first reference counter can be used to reflect the real-time access status of the page corresponding to the first reference counter by the application program on the PE.
  • the PE may update the value of the first reference counter of the corresponding page to reflect that the PE has completed the use of the page. For example, the PE may decrement the value of the first reference counter by one. As the access operations to a certain page by the application programs running on each PE are continuously completed, the value of the first reference counter is decremented.
  • the value of the first reference counter can be updated to indicate that there is no PE that wants to access the page. For example, if the value of the first reference counter (for example, v_counter[1]) corresponding to a page is set to 4, then after the application program’s access to a certain page is completed on all four PEs, the first reference counter v_counter [1] has a value of 0, indicating that no PE wants to access the page.
  • the value of the first reference counter for example, v_counter[1]
  • the data in the page is freed or replaced based on the updated value of the first reference counter.
  • the updated value of the first reference counter indicates that there is no PE to perform an access operation on the page, for example, the value of the first reference counter is 0, it means that all relevant PEs have completed access to the page. usage of.
  • the data in this page can be freed, eg deleted from the on-chip memory, or replaced with other data. The choice of release or replacement depends on the specific application.
  • the value of the first counter can reflect the use of the page by the PE, for example, how many PEs will be used and how many PEs have already used the page, so as to avoid the page being deleted or replaced before it is fully used.
  • the reuse rate of the page and the storage space of the on-chip memory can be improved.
  • the LAUNCH command establishes a logical address-physical address mapping for P[0], 1, 2, and 5,
  • ⁇ PE queries the value of v_counter[0] of each page, and the access operation of the application on PE refers to the value of v_counter[1] of these pages,
  • the value of the reference counter corresponding to the page should also be queried when the application program is running. If the value of the first reference counter corresponding to the page indicates that there is no PE to perform an access operation on the page, and the value of the second reference counter indicates that the page is accessible, for example, the values of the first and second reference counters corresponding to the page are all 0, then the data to be accessed by the application can replace the existing data in the page, and the value of the first reference counter corresponding to the page can be updated synchronously. Note that the application here could be another run of the same application that previously visited the page, or it could be a different application.
  • multiple reference counters such as two or more (eg, 3) reference counters, may be maintained in a page table entry of a page table.
  • some counter values can be selected according to needs to indicate the readiness status of the data in the page and the access status of the page by the application program on each PE.
  • Unused reference counters can be initialized to zero. In this way, it can be determined whether the page is accessible, deleted or replaced by judging the values of all counters corresponding to the page to be 0.
  • Fig. 7 shows a schematic block diagram of an apparatus 700 for storage management according to some embodiments of the present disclosure.
  • Apparatus 700 may be implemented as or included in accelerator system 200 of FIG. 2 .
  • the apparatus 700 may include a plurality of modules for performing corresponding steps in the method 500 as discussed in FIG. 5 .
  • the apparatus 700 includes a creation unit 710 configured to create a page table for a virtual storage space based on data to be accessed during execution of an application program.
  • the virtual memory space is mapped to on-chip memory and off-chip memory.
  • the page table at least indicates the mapping relationship between the logical address of the data in the virtual storage space and the physical address on the on-chip memory or on the off-chip memory.
  • the apparatus 700 also includes an access unit 720 configured to use the page table to access data when the application program is executed.
  • the data is divided into at least one segment, each segment comprising at least one page.
  • the creation unit 710 is configured to: create in the page table page table entries corresponding to the pages divided by the data, and each page table entry at least indicates the logical address and address of the corresponding page in the virtual storage space. The mapping relationship between physical addresses in on-chip memory or off-chip memory.
  • each page table entry in the page table also indicates the value of the reference counter for the corresponding page.
  • the value of the reference counter in each page table entry is updated based on at least one of: the readiness status of the corresponding page's data in the on-chip memory or the off-chip memory, or the pair of processing engines that will access the corresponding page The access status of this page.
  • the logical address of the data in the virtual storage space indicates the segment identifier of the segment where the data resides, the reference address data, the page identifier of the page where the data resides, and the offset value of the page relative to the reference address data.
  • data includes tensor data and/or program instructions.
  • the page tables are stored in on-chip memory.
  • the access unit includes: a logical address determination unit configured to determine the target page according to the logical address of the data in the virtual storage space; an address translation unit configured to use the page table to determine the target page in the on-chip memory or a physical address in the off-chip memory; and an in-page offset address determination unit configured to determine an in-page offset address of the data based on the logical address; and an address-based access unit configured to use the physical address of the target page and the page The internal offset address accesses the data.
  • the address-based access unit is configured to: if the access to the target page includes reading the target page, use the physical address and the offset address within the page to directly read data from the on-chip memory or the off-chip memory; and If the access to the target page includes writing to the target page, use the physical address and the offset address in the page to directly write data to the on-chip memory or the off-chip memory.
  • the target page is mapped into off-chip memory
  • the physical address determined using the page table includes the physical address of the target page in the on-chip memory.
  • the address-based access unit may also be configured to: if the access of the target page includes reading the target page, use the physical address of the target page in the off-chip memory to load the data of the target page from the off-chip memory to the on-chip memory, and Read data from the on-chip memory based on the physical address of the target page in the on-chip memory and the offset address within the page; and if the access to the target page includes writing to the target page, use the physical address of the target page in the on-chip memory and The offset address writes data into the on-chip memory, and uses the physical address of the target page in the off-chip memory to flush the data of the target page from the on-chip memory to the off-chip memory.
  • Fig. 8 shows a schematic block diagram of an apparatus 800 for storage management according to other embodiments of the present disclosure.
  • Apparatus 800 may be implemented as or included in accelerator system 200 of FIG. 2 .
  • the apparatus 800 may include a plurality of modules for performing corresponding steps in the method 600 as discussed in FIG. 6 .
  • the apparatus 800 includes a page determining unit 810 configured to determine a page to be accessed by an application program, where data is stored.
  • the apparatus 800 further includes a first counter setting unit 820 configured to set the value of the first reference counter corresponding to the page based on the number of processing engines to be started to run the application.
  • the apparatus 800 further includes a first counter updating unit 830 configured to update the value of the first reference counter based on the access state of the page by the application on the processing engine.
  • the apparatus 800 further includes a data release or replacement unit 840 configured to release or replace data in the page based on the updated value of the first reference counter.
  • the apparatus 800 may further include: a second counter setting unit configured to set the value of the second reference counter corresponding to the page based on the ready state of the data in the page in the on-chip memory or the off-chip memory; and
  • the program running unit is configured to run the application program on the processing engine based on the value of the second reference counter.
  • the second counter setting unit includes: a first value setting unit configured to set the second reference counter to the first value if the data in the page is not ready in the on-chip memory or the off-chip memory; and The second value setting unit is configured to set the second reference counter to a second value if the data in the page is ready in the on-chip memory or the off-chip memory.
  • the program execution unit includes: an access blocking unit configured to prevent the application program from performing an access operation on the page on the processing engine if the second reference counter is the first value; and an access start unit configured to if The second reference counter is a second value, allowing the application to start performing access operations on the processing engine for the page.
  • the first counter setting unit 820 is configured to: set the value of the first reference counter equal to the number of processing engines.
  • the first counter update unit 830 is configured to: decrement the value of the first reference counter by one if the application completes the page access operation on one of the processing engines.
  • the data release or replacement unit 840 is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, release the page from the on-chip memory or replace the data in the page. data.
  • the data release or replacement unit 840 is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, and the value of the second reference counter indicates that the page is available Access, to replace the data in the page with data to be accessed by another application.
  • pages have corresponding page table entries in a page table and are mapped to physical addresses in physical storage space.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

描述了一种用于存储管理的方法、介质、程序产品、系统和装置。在本公开的一些实施例中,基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表。虚拟存储空间被映射到片上存储器和片外存储器。所建立的页表至少指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。这样,在应用程序被运行时,利用页表来访问数据。本公开的实施例提出了将片上存储器和片外存储器合并编址,在虚拟存储空间进行统一寻址。这样的存储空间不仅有利于存储空间管理,也有利于程序的设计和运行。

Description

用于存储管理的方法、介质、程序产品、系统和装置
本申请要求于2021年12月06日提交中国专利局、申请号为202111479984.2、发明名称为“用于存储管理的方法、介质、程序产品、系统和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种用于存储管理的方法、介质、程序产品、系统和装置。
背景技术
目前已经提出了诸如图形处理器(GPU)之类处理器系统,此类处理系统中的多个处理器核可以提供并行的多线程处理方式,因而可以提供更高的处理速度。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核、多线程进行并行处理,从而减少处理时间。
在一些情形中,程序要处理的数据(例如张量数据)的数据量可能较大,而片上存储器(例如,L2高速缓存)的容量有限,因此无法同时将大量数据加载到片上存储器,这将会影响到数据的并行处理效率。
发明内容
本公开的实施例提供了一种用于存储管理的方案。
在第一方面,提供了一种存储管理方法。该方法包括基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表,虚拟存储空间被映射到片上存储器和片外存储器,页表至少指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系;以及在应用程序被执行时,利用页表来访问数据。
本公开的实施例提出了将片上存储器和片外存储器合并编址,在 虚拟存储空间进行统一寻址。这样的存储空间不仅有利于存储空间管理,也有利于程序的设计和运行。例如,应用程序可以使用逻辑地址来寻址要访问的数据,而不需要知道数据的物理地址信息,也无需知道虚拟存储的数据被存储在哪个物理介质上。这有利于编程人员对要处理的不同数据进行方便、灵活的配置,只需要限定每个应用程序要处理的数据部分对应的逻辑地址即可。程序的运行不需要管理数据的迁移。
在一些实施例中,数据被划分为至少一个段,每个段包括至少一个页。在一些实施例中,创建页表包括:在页表中建立与数据所划分的页分别对应的页表项,每个页表项至少指示对应的页在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。
在一些实施例中,页表中的每个页表项还指示对应页的引用计数器的值。在一些实施例中,每个页表项中的引用计数器的值基于以下至少一项而更新:对应页的数据在片上存储器或片外存储器中的就绪状态,或者要访问对应页的处理引擎对该页的访问状态。
在一些实施例中,虚拟存储空间中的逻辑地址指示数据所在的段的段标识、基准地址数据、所在页的页标识、以及该页相对于基准地址数据的偏移值。
在一些实施例中,数据包括张量数据和/或程序指令。在一些实施例中,页表被存储在片上存储器中。
在一些实施例中,利用页表来访问数据包括:根据数据在虚拟存储空间中的逻辑地址确定目标页;使用页表,确定目标页在片上存储器或片外存储器中的物理地址;根据逻辑地址确定数据的页内偏移地址;以及使用目标页的物理地址和页内偏移地址访问数据。
在一些实施例中,使用目标页的物理地址和页内偏移地址访问数据包括:如果目标页的访问包括对目标页的读取,使用物理地址以及页内偏移地址直接从片上存储器或片外存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用物理地址以及页内偏移地址向 片上存储器或片外存储器写入数据。
在一些实施例中,目标页被映射到片外存储器中,并且使用页表确定的物理地址包括目标页在片上存储器中的物理地址。使用目标页的物理地址和页内偏移地址访问数据还包括:如果目标页的访问包括对目标页的读取,使用目标页在片外存储器中的物理地址将目标页的数据从片外存储器加载到片上存储器,并且基于目标页在片上存储器中的物理地址和页内偏移地址从片上存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用目标页在片上存储器中的物理地址和页内偏移地址将数据写入片上存储器,并且使用目标页在片外存储器中的物理地址将目标页的数据从片上存储器冲刷到片外存储器。
在本公开的第二方面,提供了一种计算机可读存储介质。存储多个程序,多个程序被配置为一个或多个处理单元执行,多个程序包括用于执行第一方面的方法的指令。
在本公开的第三方面,提供了一种计算机程序产品。计算机程序产品包括多个程序,多个程序被配置为一个或多个处理单元执行,多个程序包括用于执行第一方面的方法的指令。
在本公开的第四方面,提供了一种加速器系统。加速器系统包括:处理单元;以及与处理单元耦合的存储器,存储器具有存储于其中的指令,指令在被处理单元执行时执行第一方面的方法。
在本公开的第五方面,提供了一种用于存储管理的装置。该装置包括创建单元,被配置为基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表,虚拟存储空间被映射到片上存储器和片外存储器,页表至少指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系;以及访问单元,被配置为在应用程序被执行时,利用页表来访问数据。
在一些实施例中,数据被划分为至少一个段,每个段包括至少一个页。在一些实施例中,创建单元被配置为:在页表中建立与数据所划分的页分别对应的页表项,每个页表项至少指示对应的页在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之 间的映射关系。
在一些实施例中,页表中的每个页表项还指示对应页的引用计数器的值。在一些实施例中,每个页表项中的引用计数器的值基于以下至少一项而更新:对应页的数据在片上存储器或片外存储器中的就绪状态,或者要访问对应页的处理引擎对该页的访问状态。
在一些实施例中,数据在虚拟存储空间中的逻辑地址指示数据所在的段的段标识、基准地址数据、所在页的页标识、以及该页相对于基准地址数据的偏移值。
在一些实施例中,数据包括张量数据或程序指令。在一些实施例中,页表被存储在片上存储器中。
在一些实施例中,访问单元包括:逻辑地址确定单元,被配置为根据数据在虚拟存储空间中的逻辑地址确定目标页;地址转换单元,被配置为使用页表,确定目标页在片上存储器或片外存储器中的物理地址;以及页内偏移地址确定单元,被配置为根据逻辑地址确定数据的页内偏移地址;以及基于地址的访问单元,被配置为使用目标页的物理地址和页内偏移地址访问片上存储器或片外存储器。
在一些实施例中,基于地址的访问单元被配置为:如果目标页的访问包括对目标页的读取,使用物理地址以及页内偏移地址直接从片上存储器或片外存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用物理地址以及页内偏移地址直接向片上存储器或片外存储器写入数据。
在一些实施例中,目标页被映射到片外存储器中,并且使用页表确定的物理地址包括目标页在片上存储器中的物理地址。基于地址的访问单元还可以被配置为:如果目标页的访问包括对目标页的读取,使用目标页在片外存储器中的物理地址将目标页的数据从片外存储器加载到片上存储器,并且基于目标页在片上存储器中的物理地址和页内偏移地址从片上存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用目标页在片上存储器中的物理地址和页内偏移地址将数据写入片上存储器,并且使用目标页在片外存储器中的物理地 址将目标页的数据从片上存储器冲刷到片外存储器。
提供发明内容部分是为了简化的形式来介绍对概念的选择,其在下文的具体实施方式中将被进一步描述。发明内容部分无意标识要求保护的主题的关键特征或主要特征,也无意限制要求保护的主题的范围。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的芯片示意框图;
图3示出了根据本公开的一些实施例的并行处理引擎结构的示意框图;
图4示出了根据本公开的一些实施例的片上虚拟存储空间的示例;
图5示出了根据本公开的一些实施例的存储管理的方法的示意流程图;
图6示出了根据本公开的另一些实施例的存储管理的方法的示意流程图;
图7示出了根据本公开的一些实施例的用于存储管理的装置的示意框图;以及
图8示出了根据本公开的另一些实施例的用于存储管理的装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式 实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一些实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如前文所提及的,在应用程序的执行时要访问的数据(例如张量数据)的数据量可能较大,而片上存储器(例如,L2高速缓存)的容量有限,因此无法同时将大量数据加载到片上存储器,这将会影响到数据的并行处理效率。
在本公开的一些实施例中,提出了片上虚拟存储的方案。不同于利用二级存储器件(如硬盘、远程存储器等)来扩展主存空间的虚拟存储技术,在本公开的实施例中,将加速器系统的片上存储器和片外存储器合并为统一的虚拟存储空间。在该虚拟存储空间中对应用程序要访问的数据进行寻址,这样为应用程序提供了更大的统一可寻址的存储空间,扩展了可使用的内存空间,提高并行处理效率,特别是对于诸如张量数据等大尺寸数据的并行处理效率。
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一些实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器系统40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA) 控制器等。加速器系统40例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。设备存储器50例如可以是诸如DRAM之类的位于加速器系统40外部的易失性存储器。
在本公开中,设备存储器50也被称为片外存储器,即,位于加速器系统40的芯片外部的存储器。相对而言,加速器系统40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。这将在下文结合本公开的一些实施例具体描述。
虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器系统的一些应用环境中使用。
图2示出了根据本公开的一些实施例的加速器系统200的示意框图。加速器系统200例如可以是图1中加速器系统40的芯片的一种具体实现方式。加速器系统200例如是诸如GPU之类的加速器系统芯片。在一些实施例中,加速器系统200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。
加速器系统200可以由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。
页表装置220维护页表,以用于管理加速器系统200可访问的片上虚拟存储。如下文将详细描述的,在本公开的实施例中,诸如L2高速缓存250之类的片上存储器以及诸如图1中的设备存储器50之类的片外存储器构成虚拟存储系统,具有统一的虚拟寻址空间。页表装置220中的页表可以由SP 210、PE单元230和DMA控制器240共同访问和更新。
PE单元230可以包括一个或多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于等于1的整数。每个处理引擎可以与对应的L1高速缓存关联。例如,如图1所示,PE_1可以与L1_1关联,PE_2可以与L1_2关联,等等。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。图3示出了根据本公开一些实施例的SIMT的并行PE结构300的示意图。图3所示的并行PE结构300可以被实现在PE单元230中的PE内。
如图所示,在PE中可以存在一个或多个线程320-1、320-2、…320-M,其中M为大于等于1的整数,每个线程要处理的数据来自相应的缓冲区310-1、310-2、…310-M。PE中的每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)(未示出)。
多个PE可以并行地执行相同或不同的处理工作,可以并行地进行地址转换和存储器中目标数据的访问,从而减少处理时间。例如,在执行机器学习(DL)等计算任务时,PE可以针对待处理的数据执行排序、卷积等处理。
用户(例如,程序员)可以编写应用程序来实现特定的目的。对于需要较大计算量的应用程序而言,可分别在多个PE处并行地运行应用程序,以分别处理不同的数据部分。应用程序也称为内核程序(kernel)。进一步,可以在每个PE处启动一个或多个线程。每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。将会理解,用户可以指定在PE处启动多个(例如,数十、数百个甚至更多个)线程来并行地执行某些操作。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。
大多数的指令执行算数和逻辑算子,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般 地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。
在一些实施例中,加速器系统200所处理的数据可以是多维张量数据,也可以是一维张量数据。例如,在一些实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各维上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。
此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。
在典型的计算系统中,片上存储速度更快且功耗更低,但是其容量有限,而片外存储的存取时延更长且功耗更高,其带宽也相对较低。典型地,片上存储一般都设计成高速缓存(cache)且不可显式寻址。在一个典型的计算机系统中,主存一般都是片外存储,其数据存取使用物理地址。
如以上提及的,不同于已有的片上存储映射与管理方式,在本公开的实施例中,使用了虚拟存储的方式来管理片上存储而不是采用L2高速缓存方式,片上存储与片外存储形成可统一寻址的虚拟存储空间,为程序提供了虚拟片上存储视角。应用程序要访问的要访问的数据通过页表来管理,该页表指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。
由此,在应用程序被运行时,借助页表来访问数据,该数据可以被物理存储在片上存储器或片外存储器中。统一的虚拟片上存储空间不仅有利于存储空间管理,也有利于程序的设计和运行。例如,应用程序可以使用逻辑地址来寻址要访问的数据,而不需要知道数据的物 理地址信息,也无需知道虚拟存储的数据被存储在哪个物理介质上。这有利于编程人员对要处理的不同数据进行方便、灵活的配置,只需要限定每个应用程序要处理的数据部分对应的逻辑地址即可。程序的运行不需要管理数据的迁移。
下文将借助图4和图5来描述根据本公开的一些实施例的在虚拟存储空间中的存储管理。图4示出了根据本公开的一些实施例的部分的虚拟存储空间400的示意框图。图5示出了存储管理的示例过程500的流程图。过程500可以被实现在加速器系统200中。
虚拟存储空间400被映射到片上存储器和片外存储器。片上存储器指的是加速器系统200的芯片上存储器,例如图2中的L2高速缓存,其可以是静态随机存取存储器(SRAM)或者其他类型的片上存储器。片外存储器例如是加速器系统200的芯片外存储器,例如图1中的设备存储器50,其可以是动态随机存取存储器(DRAM)或者其他类型的片外存储器。
在过程500中,在框510,基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表,例如针对虚拟存储空间400的页表。页表至少指示要访问的数据在虚拟存储空间400中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。在框520,在应用程序被运行时,利用页表来访问数据。
在一些实施例中,在加速器系统200中,例如在页表装置220中维护页表。在一些实施例中,SP 210可以接收主机发送的命令序列,以发起应用程序的运行。SP 210可以根据在应用程序的执行时要访问的数据,来创建数据对应的页表,以指示数据的逻辑地址与物理地址之间的映射关系。
在一些实施例中,在不同应用程序中可以灵活定义要访问的数据在虚拟存储空间中的存储结构。具体地,在应用程序的执行时要访问的数据可以按段和页来在虚拟存储空间400中组织。在本文中,“段”有时也称为存储段或数据段,“页”有时也称为存储页或数据页。
数据可以被划分为一个或多个段,并且每个段可以包括一个或多 个页。段的数目和大小,页的数目和大小可以根据应用程序来确定。应用程序在每个PE中的运行可以使用一个或多个段,每个段可以包括一个或多个页。在应用程序的运行时要访问的数据可以包括应用程序所要处理的数据,例如张量数据或者其他形式的数据。在应用程序的运行时要访问的数据还可以包括应用程序相关的程序指令。
由于在虚拟存储空间中可以通过逻辑地址来寻址数据在片上存储器或片外存储器中的物理地址,在编写程序时候编程人员不需要关心数据实际的物理存储位置。这给编程人员对数据的段和页的定义提供了极大地灵活度,并且也可以充分利用片上存储空间。在诸如机器学习等处理任务中,可能需要执行较多的矩阵乘法运算。因此,将数据划分为较大的分块(例如,较大的段或页)由不同PE并行执行,将非常有利于提高计算性能。
此外,由于不需要知道数据所在的各个段和页的物理地址信息,编程人员可以在应用程序中通过逻辑地址来指定要处理的数据部分。例如,编程人员只需要配置好在应用程序要处理的总体数据(例如,张量数据)以及结构属性信息,以及各个PE相应要处理的数据部分即可。在加速器系统200上运行程序时可以通过建立页表来将逻辑地址映射到片上或片外存储器的物理地址。
作为示例,在图4中,虚拟存储空间400用于存储具有三个维度D1、D2和D3的张量数据,其示意性示出了单个应用程序数据(程序)存储空间的第一段S1、第二段S2和第三段S3。不同的应用程序可以使用不同数量的段。数据的每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。在一些实施例中,一个应用程序占用的段的数目可以是有上限的,例如可以规定一个应用程序可以占用最多16个段。
在一些实施例中,在一个段内,还可以设置至少一个页以进一步细分数据。对于张量数据,页的划分可以在任意一个或多个维上实施,并且各维上划分的页数是相互独立的。例如,图4中的段S1,可以具有4个页P[1]、P[2]、P[3]和P[4];第二段S2仅具有一个页,等等。 在此,页的大小由应用程序来定义并且是可以是可变的。在本公开的实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,因为在运行应用程序时,要将整页上的数据加载到片上存储器,因此可以将页的尺寸配置为适于整体存入片上存储器,从而可以充分利用片上存储空间。
进一步,每个段可以由一个或多个PE来访问,包括读取、写入或执行。例如,可以由8个PE(即,PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7、PE_8)来访问段S1,其中段S1存储这些PE在运行时要处理的张量形式的数据。将会理解,为了提高数据处理性能,在每个PE处可以由多个线程来以并行方式处理数据。例如,在图4中,可以指定段S1的数据由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。除应用程序要处理的数据之外,段内还可以存储应用程序本身。例如,段S2可以用于存储一个或多个应用程序的程序指令。段S2中存储的程序指令可以由一个或多个PE执行。
在建立页表时,加速器系统200中的SP 210可以页表中建立与数据所划分的页的页标识(也称为“页号”)分别对应的页表项,每个页表项至少指示对应的页在片上存储器或片外存储器上的物理地址之间的映射关系。页的页标识(或页号)由要访问的数据的逻辑地址推导得到。
在定义逻辑地址时,每个段可以具有段标识和基准地址数据,被称为锚点(anchor)或基准点。例如,如果一个段被划分为多个部分,用于在不同PE上执行,基准地址数据可以表示为各个PE指派的数据的起始坐标点。例如,基准地址数据可以是对张量的各个维度的坐标,(0,0,0,0)或(0,4,0,0)。多个PE可以具有相同或不同的基准地址数据。
一个段内的数据可以在该段内相对于基准地址数据进行寻址。一个段内的数据的逻辑地址可以包括所在的段的段标识、基准地址数据和段内偏移地址,其中段内偏移地址可以包括数据所在页的页标识,以及该页在段内相对于基准地址数据的偏移值。
在一些实施例中,在页表中,每个页表项可以包括该页的页标识以及该页被映射到的物理地址,其可以是片上存储器或片外存储器中的物理地址。在一些实施例中,页表中建立的页表项的数目可以是有上限的,其数目可以根据实际应用配置。在一些实施例中,页表被存储在片上存储器中,以便于后续对页表的快速访问。
在页表被建立后,在运行应用程序时,利用页表来访问数据。例如,SP 210可以从主机接收到命令序列,其中包括存储映射信息和其他命令,例如初始化命令等。SP 210可以基于存储映射信息创建页表并存储在页表装置230中。SP 210可以控制应用程序在PE上运行。
在运行时,如果一个PE在运行应用程序时要访问目标段中的数据,该数据所在目标页的页标识(或页号)由逻辑地址推导得到。逻辑地址还更具体用于确定数据在页内的页内偏移地址。页内偏移地址可以用于指示要访问的数据在一个页内的起始位置。PE可以通过页表装置230访问页表,根据页标识定位相应的页表项,在页表项内读取目标页在片上存储器或片外存储器中的物理地址。在一些实现中,PE中可以包括地址转换器,用于执行逻辑地址与物理地址之间的转换。PE可以使用所确定的物理地址来访问片上存储器或片外存储器,以访问对应的数据部分。
在运行应用程序时,通过物理地址和页内偏移地址来访问数据。数据的访问方式可以包括直接访问和间接访问。直接访问指的是无论数据位于片外存储器或片上存储器,PE可以从片外存储器或片上存储器进行直接访问。间接访问指的是首先将要访问的数据加载到片上存储器,然后再进行访问。间接访问对于存储数据的目标页被映射到片外存储器的情况,需要进行数据从片外到片上的加载。数据的访问方式可以是默认的,或者可以是由程序员根据需要进行设置。
在一些实施例中,在采用直接访问方式的情况下,所构建的页表指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。在运行应用程序时,如果通过所确定的物理地址发现目标页被映射到片外存储器或片上存储器,那么可 以基于物理地址和页内偏移地址,直接读取位于片外存储器或片上存储器的目标页的数据,或者可以直接将数据写入片外存储器或片上存储器。例如,如果目标页的数据包括应用程序要处理的数据,可以将要处理的数据从片外存储器或片上存储器读取到寄存器,或者可以将要处理的数据从寄存器写入到片外存储器或片上存储器。又例如,如果目标页的数据包括应用程序的程序指令,可以直接从片外存储器或片上存储器指取程序指令并执行,或者可以直接将程序指令写入到片外存储器或片上存储器。
在一些实施例中,在采用间接访问方式的情况下,首先确保要访问的数据被放入片上存储器,然后再执行后续访问操作。在这种情况下,所构建的页表指示数据在虚拟存储空间中的逻辑地址与在片上存储器的物理地址之间的映射关系。如果在运行应用程序时要对被映射到片外存储器中的目标页进行读取,可以使用目标页在片外存储器中的物理地址,将目标页的数据从片外存储器加载到片上存储器以用于访问。在一些实施例中,SP 210可以指令加速器系统200中的DMA控制器240从片外存储器读取数据并缓存到片上存储器。在一些实施例中,DMA操作与应用程序的运行可以并行操作,以实现流处理。在被加载到片上存储器后,可以通过页表可以确定被加载到片上存储器的目标页的物理地址,并确定要读取的数据的页内偏移地址。可以基于目标页在片上存储器中的物理地址以及所确定的页内偏移地址,从片上存储器读取数据。
在一些实施例中,如果在运行应用程序时要对目标页进行写入,并且目标页被映射到片外存储器,在应用程序运行时目标页的数据可以通过目标页在片上存储器中的物理地址以及所确定的页内偏移地址,被首先写入到片上存储器。在应用程序运行完成后,由SP 210使用目标页在片外存储器中的物理地址,将目标页的数据从片上存储器冲刷到片外存储器。例如,SP 210可以通过FLUSH命令来执行数据从片上存储器到片外存储器的冲刷。这样可以释放更多的片上存储空间以供运行时使用。
在一些实施例中,除记录页的逻辑地址与物理地址之间的映射外,页表中的每个页表项还指示对应页的一个或多个引用计数器的值。如下文参考图6更详细描述的,引用计数器的值可以用于管理页的数据相关性,并且每个页表项中的引用计数器的值可以基于对应页在片上存储器或片外存储器上被引用的状态而更新,特别是可以基于以下至少一项而更新:对应页的数据在片上存储器或片外存储器中的就绪状态,或者要访问对应页的PE对该页的访问状态。
在一些实施例中,可以将张量数据存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。应用程序可以多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供应用程序操作使用。在多次启动应用程序后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳应用程序所要访问的所有张量时,一个张量仅需要一个段描述即可,应用程序也只需要启动一次。
在并行处理的加速系统中,相同的应用程序可以被运行在一个或多个PE上。这些应用程序被编写为要处理特定数据,例如张量数据。如前所述,在虚拟存储空间中数据可以按页存储,并可以在被写入片上存储器后由应用程序在运行时使用。因此,同一个页可能会被不同PE使用。在这种情况下,对页的管理是重要的。在本公开的另一些实施例中,还提出了利用与页对应的引用计数器来管理虚拟存储空间中的页上的数据相关性。
图6示出了根据本公开的另一些实施例的用于存储管理的过程600的流程图。过程600可以被实现在加速器系统200中。
如图6所示,在框610,确定应用程序所要访问的页,该页中存储有数据。在加速器系统200中,SP 210可以接收主机发送的命令序列,以发起应用程序的运行。通过分析命令序列,SP 210可以确定要运行的应用程序所要访问的页。
在一些实施例中,一个应用程序可以访问一个或多个页。在此, “访问”页或“访存”页指的是对存储空间中的页进行数据读取、数据写入或执行其中的指令,并且在此存储空间可以是利用如前所述的片上虚拟化技术得到的虚拟存储空间,或是在加速器系统中没有利用这样的片上虚拟化技术的存储空间。
作为一个示例,在与机器学习模型相关的任务中,一个应用程序可能被配置为执行矩阵乘法运算,那么该应用程序可能要访问三个用于存储数据的页,其中第一页存放矩阵A,第二页存放矩阵B,并且第三页存放矩阵A乘以矩阵B的结果。在一些实施例中,可以从关于应用程序的命令序列确定所要访问的页的寻址信息。例如,存放矩阵A的第一页可以位于段1(S1)中的页P[1],存放矩阵B的第二页可以位于段2(S2)中的页P[2],存放矩阵乘结果的第三页可以位于段3(S3)中的页P[5]。在一些实施例中,PE可以从存放程序指令的页中取出指令并执行指令。
在框620,基于要被启动来运行应用程序的PE的数目,设置页对应的第一引用计数器的值。
应用程序可以在一个或多个PE上执行对数据的访问操作。在本公开的实施例中,通过设置引用计数器(v-counter)来管理对页的访问,以避免在页中的数据还没有被相关的PE使用完就被删除或替换。
在一些实施例中,引用计数器的值可以在页表中维护。在页表中,每个页表项对应一个页,并且包括该页的地址信息,完成如前所述的逻辑地址到物理地址之间的转换,并且还可以包括引用计数器的值。在一些实施例中,每个页可以对应于一个或多个引用计数器,这些引用计数器可以被设置各自的值,如下文将描述的。
在一些实施例中,页对应的第一引用计数器(被表示为v-counter[1])的值可以基于要运行应用程序的PE的数目,以便维护对PE对页的访问。在一些实施例中,第一引用计数器的值可以被设置为等于要运行应用程序的PE的数目。
在一些实施例中,还可以设置页对应的另一引用计数器(在本文中有时称为第二引用计数器)的值,以用于表征页中的数据在片上存 储器或片外存储器中的就绪状态。通过对第二引用计数器的值的维护,可以避免页中的数据还未准备好时就被访问,例如被用于后续计算。
具体地,在确定要访问的页后,SP 210可以基于页在片上存储器或片外存储器中的就绪状态来设置页对应的第二引用计数器的值。当应用程序在PE上被启动时,可以基于第二引用计数器的值来执行访问操作。
取决于实际存储方式以及各个应用程序的执行方式,页中的数据可能原本被存储在片上存储器,也可能被存储在片外存储器。对于用于存储处理结果的页,例如上文示例中用于存储矩阵乘的结果的页,在矩阵乘的计算结束之后该页中的数据才会被完全写入到片上存储器。
在一些实施例中,如果对数据的访问是直接访问,那么可以考虑数据在片上存储器或片外存储器上的就绪状态来设置第二引用计数器的值。在一些实施例中,如果对数据的访问是间接访问,即需要将数据从片外存储器加载到片上存储器以降低存储时延,那么可以考虑数据在片上存储器中的就绪状态来设置第二引用计数器的值。也就是说,
在一些实施例中,如果确定页中的数据在片上存储器或片外存储器中未就绪,第二引用计数器的值可以被设置为第一值,例如可以被置位为1,以指示该页的数据还不能被访问。例如,如果要将页中的数据从片外存储器移动到片上存储器,或者要等待计算结束后才能获得页中的数据,那么在开始移动或计算时将该页对应的第二引用计数器的值设置为1,以避免在移动或计算未完成时其他实体对该页的访问。在一些实施例中,如果页中的数据在片上存储器或片外存储器中已就绪,将第二引用计数器设置为指示该页可访问或数据已就绪的第二值,例如可以被设置为0。
在一些实施例中,可以由SP 210将第二引用计数器设置为第一值(例如1)。继续上文的示例,如果第一页P[1]中的矩阵A被物理存 储在片外存储器中,SP 210可以将页表中与第一页对应的页表项中的第二引用计数器(例如,表示为v-counter[0])设置为1,以指示该页正在被加载,并且SP 210可以指示DMA控制器240将矩阵A加载到片上存储器。在完成矩阵A的数据加载后,DMA控制器240可以将该第一页对应的页表项中的v-counter[0]设置为0。对于用于存放矩阵B的第二页P[2],也可以类似地设置第二引用计数器v-counter[0]的值。
对于用于存放的矩阵乘的结果的第三页P[5],如果在PE上运行的应用程序要向该页写入结果,那么该PE可以首先将第三页P[5]对应的页表项中的第一计数器(如v-counter[0],对于写者来说,它被称为第一计数器,而对于读者来说,它被称为第二计数器)设置为PE的数目,以避免结果还未完全被写入时该页就被访问。在结果写入完成后,该PE可以将将第三页P[5]对应的页表项中的v-counter[0]设置0。
对于要访问某个页的PE,可以通过计数器的值来确定该页中的数据是否已经准备就绪。具体地,如果发现第二引用计数器的值指示页中的数据还不可访问(例如,值为1或PE的个数),那么对该页的访问需要等待。PE可以先阻止应用程序执行访问操作。在一些实施例中,PE可以定期查询页表中对应的页表项,以确定页中的数据的就绪状态。在一些实施例中,如果通过第二引用计数器的值发现页中的数据未就绪,也可以发送中断到主机,以告诉主机该页中的数据尚未准备好。
以上讨论了页对应的一些引用计数器的值的设置。为更好理解,将以一个具体示例来描述。仍然假设一个应用程序被配置为执行矩阵A和矩阵B的乘法,并且该应用程序要在4个PE上运行。下面是SP210的命令流的示例。
●LOAD P[0](其中P[0]可以包括应用程序和静态全局参数);
●LOAD P[1](矩阵A,即要处理的输入数据);
○将页P[1]对应的计数器v-counter[0]设置为1,并启动DMA控 制器240从片外存储器加载矩阵A到片上存储器,
○将页P[1]对应的计数器v-counter[1]设置为所使用的PE数目(例如,等于4)
●LOAD P[2](矩阵B,即要处理的输入数据);
○将页P[2]对应的计数器v-counter[0]设置为1,并启动DMA控制器240从片外存储器加载矩阵B到片上存储器,
○将页P[2]对应的计数器v-counter[1]设置为所使用的PE数目(例如,等于4)
●INIT P[5](用来存储矩阵A乘以矩阵B的结果)
○将页P[5]对应的计数器v-counter[0]赋初值,例如4。
在确定要使用的页准备好之后,应用程序可以开始在所选定的一个或多个PE上执行访问操作。
在过程600中,在框630,基于应用程序在PE上对页的访问状态来更新第一引用计数器的值。如前所述,第一引用计数器的值基于PE的数目而被设置。通过实时更新,第一引用计数器的值可以用于反映应用程序在PE上对该第一引用计数器所对应页的实时访问状态。
在一些实施例中,如果某个PE已经完成应用程序对某个页的访问操作,那么该PE可以更新对应页的第一引用计数器的值,以反映该PE已完成对页的使用。例如,该PE可以将第一引用计数器的值减一。随着各个PE上运行的应用程序对某个页的访问操作不断被完成,第一引用计数器的值递减。
在所有PE完成应用程序对某个页的访问操作后,第一引用计数器的值可以被更新到能够指示没有要访问该页的PE。例如,如果一个页对应的第一引用计数器(例如,v_counter[1])的值被设置为4,那么在4个PE上均完成应用程序对某个页的访问操作后,第一引用计数器v_counter[1]的值为0,指示没有PE要访问该页。
在框640,基于第一引用计数器的更新后的值,释放或替换页中 的数据。在一些实施例中,如果第一引用计数器的更新后的值指示没有要对该页执行访问操作的PE,例如第一引用计数器的值为0,那么意味着相关PE均已完成了对该页的使用。该页中的数据可以被释放,例如从片上存储器被删除,或者被用其他数据替换。释放或替换的选择取决于具体应用。
第一计数器的值可以反映出页被PE的使用情况,例如还要被多少PE使用以及有多少PE已经使用过这个页,避免了在页未被使用完毕就被删除或替换。通过对第一引用计数器的维护,可以提高对页的重新使用率,并且提高片上存储器的存储空间的重新使用率。
继续上文的SP 210的命令流的示例。在执行LOAD P[1],LOAD P[2]以及INIT P[5]后,SP 210继续执行以下命令流:
●LAUNCH应用程序在所选PE上运行
○ LAUNCH命令为P[0]、1、2、5建立逻辑地址-物理地址映射,
○ PE查询各个页的v_counter[0]的值,并且PE上的应用程序的访问操作参考这些页的v_counter[1]的值,
○矩阵乘计算完成后,结果被放入P[5]中,
○每个PE完成计算任务时将所访问的页对应的v_counter[1]或v_counter[0]减1。
●FLUSH P[5],更新P[0]、1、2、5的状态并将P[5]中的矩阵乘的结果写入外部存储器。在片上存储器中的P[5]数据被释放。
在上文的示例中,如果在加速器系统200中还有应用程序要访问P[5]中的运算结果。由于P[5]的结果本身位于片上存储器,P[5]对应的第一引用计数器(对于读者来说)(v_counter[0])的值是0,那么在该应用程序在PE上被运行时可以直接使用P[5]中的数据。也就是说,可以不用将P[5]中的结果写入外部存储器而后再次加载到片上存储器。
在一些实施例中,如果应用程序要访问一个页,例如要将新的数据写入到该页,那么在该应用程序的运行时也要查询该页对应的引用计数器的值。如果该页对应的第一引用计数器的值指示没有要对该页执行访问操作的PE,并且第二引用计数器的值指示该页可访问,例如该页对应的第一和第二引用计数器的值均为0,那么可以将应用程序要访问的数据替换该页中已有的数据,并且同步更新该页对应的第一引用计数器的值。注意,这里的应用程序可以是先前访问过该页的同一应用程序的另一次运行,或者可以是不同的应用程序。
以上讨论了页对应的一些引用计数器的使用。该页对应的引用计数器可以用于管理对该页的各种使用。
在一些实施例中,在页表的页表项中可以维护多个引用计数器,例如两个或更多个(例如,3个)引用计数器。在维护多个引用计数器的值的情况下,可以根据需要选用一些计数器的值来指示页中的数据的就绪状态,和应用程序在各个PE上对页的访问状态。未被使用的引用计数器的值可以被初始化为0。这样,可以通过判断页对应的全部计数器的值为0,来确定该页是否可访问、可被删除或替换。
此外,应当理解,上文所给出的计数器的值均是示例,可以设置其他值,只要能够反映出所指示的状态即可。
图7示出了根据本公开的一些实施例的用于存储管理的装置700的示意框图。装置700可以被实现为或者被包括在图2的加速器系统200中。装置700可以包括多个模块,以用于执行如图5中所讨论的方法500中的对应步骤。
如图7所示,装置700包括创建单元710,被配置为基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表。虚拟存储空间被映射到片上存储器和片外存储器。页表至少指示数据在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。装置700还包括访问单元720,被配置为在应用程序被执行时,利用页表来访问数据。
在一些实施例中,数据被划分为至少一个段,每个段包括至少一 个页。在一些实施例中,创建单元710被配置为:在页表中建立与数据所划分的页分别对应的页表项,每个页表项至少指示对应的页在虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。
在一些实施例中,页表中的每个页表项还指示对应页的引用计数器的值。在一些实施例中,每个页表项中的引用计数器的值基于以下至少一项而更新:对应页的数据在片上存储器或片外存储器中的就绪状态,或者要访问对应页的处理引擎对该页的访问状态。
在一些实施例中,数据在虚拟存储空间中的逻辑地址指示数据所在的段的段标识、基准地址数据、所在页的页标识、以及该页相对于基准地址数据的偏移值。
在一些实施例中,数据包括张量数据和/或程序指令。
在一些实施例中,页表被存储在片上存储器中。
在一些实施例中,访问单元包括:逻辑地址确定单元,被配置为根据数据在虚拟存储空间中的逻辑地址确定目标页;地址转换单元,被配置为使用页表,确定目标页在片上存储器或片外存储器中的物理地址;以及页内偏移地址确定单元,被配置为根据逻辑地址确定数据的页内偏移地址;以及基于地址的访问单元,被配置为使用目标页的物理地址和页内偏移地址访问数据。
在一些实施例中,基于地址的访问单元被配置为:如果目标页的访问包括对目标页的读取,使用物理地址以及页内偏移地址直接从片上存储器或片外存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用物理地址以及页内偏移地址直接向片上存储器或片外存储器写入数据。
在一些实施例中,目标页被映射到片外存储器中,并且使用页表确定的物理地址包括目标页在片上存储器中的物理地址。基于地址的访问单元还可以被配置为:如果目标页的访问包括对目标页的读取,使用目标页在片外存储器中的物理地址将目标页的数据从片外存储器加载到片上存储器,并且基于目标页在片上存储器中的物理地址和 页内偏移地址从片上存储器读取数据;以及如果目标页的访问包括对目标页的写入,使用目标页在片上存储器中的物理地址和页内偏移地址将数据写入片上存储器,并且使用目标页在片外存储器中的物理地址将目标页的数据从片上存储器冲刷到片外存储器。
图8示出了根据本公开的另一些实施例的用于存储管理的装置800的示意框图。装置800可以被实现为或者被包括在图2的加速器系统200中。装置800可以包括多个模块,以用于执行如图6中所讨论的方法600中的对应步骤。
如图8所示,装置800包括页确定单元810,被配置为确定应用程序所要访问的页,页中存储有数据。装置800还包括第一计数器设置单元820,被配置为基于要被启动来运行应用程序的处理引擎的数目,设置页对应的第一引用计数器的值。装置800还包括第一计数器更新单元830,被配置为基于应用程序在处理引擎上对页的访问状态来更新第一引用计数器的值。装置800还包括数据释放或替换单元840,被配置为基于第一引用计数器的更新后的值,释放或替换页中的数据。
在一些实施例中,装置800还可以包括:第二计数器设置单元,被配置为基于页中的数据在片上存储器或片外存储器中的就绪状态,设置页对应的第二引用计数器的值;以及程序运行单元,被配置为基于第二引用计数器的值来在处理引擎上运行应用程序。
在一些实施例中,第二计数器设置单元包括:第一值设置单元,被配置为如果页中的数据在片上存储器或片外存储器中未就绪,将第二引用计数器设置为第一值;以及第二值设置单元,被配置为如果页中的数据在片上存储器或片外存储器中已就绪,将第二引用计数器设置为第二值。
在一些实施例中,程序运行单元包括:访问阻止单元,被配置为如果第二引用计数器为第一值,阻止应用程序在处理引擎上对页执行访问操作;以及访问开始单元,被配置为如果第二引用计数器为第二值,允许应用程序开始在处理引擎上对页执行访问操作。
在一些实施例中,第一计数器设置单元820被配置为:将第一引用计数器的值设置为等于处理引擎的数目。
在一些实施例中,第一计数器更新单元830被配置为:如果应用程序在处理引擎中的一个处理引擎上对页的访问操作完成,将第一引用计数器的值减一。
在一些实施例中,数据释放或替换单元840被配置为:如果第一引用计数器的更新后的值指示没有要对所述页执行访问操作的处理引擎,从片上存储器释放页或替换页中的数据。
在一些实施例中,另一应用程序要访问页。在一些实施例中,数据释放或替换单元840被配置为:如果第一引用计数器的更新后的值指示没有要对所述页执行访问操作的处理引擎,并且第二引用计数器的值指示页可访问,用另一应用程序要访问的数据替换页中的数据。
在一些实施例中,页在页表中具有对应的页表项,并且被映射到物理存储空间中的物理地址。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (21)

  1. 一种存储管理方法,包括:
    基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表,所述虚拟存储空间被映射到片上存储器和片外存储器,所述页表至少指示所述数据在所述虚拟存储空间中的逻辑地址与在所述片上存储器或所述片外存储器上的物理地址之间的映射关系;以及
    在所述应用程序被执行时,利用所述页表来访问所述数据。
  2. 根据权利要求1所述的方法,其中所述数据被划分为至少一个段,每个段包括至少一个页,并且其中创建所述页表包括:
    在所述页表中建立与所述数据所划分的页分别对应的页表项,每个页表项至少指示对应的页在所述虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。
  3. 根据权利要求2所述的方法,其中所述页表中的每个页表项还指示对应页的引用计数器的值,并且
    其中每个页表项中的所述引用计数器的值基于以下至少一项而更新:所述对应页的数据在所述片上存储器或所述片外存储器中的就绪状态,或者要访问所述对应页的处理引擎对该页的访问状态。
  4. 根据权利要求2所述的方法,其中所述数据在所述虚拟存储空间中的所述逻辑地址指示所述数据所在的段的段标识、基准地址数据、所在页的页标识、以及该页相对于基准地址数据的偏移值。
  5. 根据权利要求1所述的方法,其中所述数据包括张量数据和/或程序指令。
  6. 根据权利要求1所述的方法,其中所述页表被存储在所述片上存储器中。
  7. 根据权利要求1所述的方法,其中利用所述页表来访问所述数据包括:
    根据所述数据在所述虚拟存储空间中的逻辑地址确定目标页;
    使用所述页表,确定所述目标页在所述片上存储器或所述片外存 储器中的物理地址;
    根据所述逻辑地址确定所述数据的页内偏移地址;以及
    使用所述目标页的所述物理地址和所述页内偏移地址来访问所述数据。
  8. 根据权利要求7所述的方法,其中使用所述目标页的所述物理地址和所述页内偏移地址访问所述数据包括:
    如果所述目标页的访问包括对所述目标页的读取,使用所述物理地址以及所述页内偏移地址直接从所述片上存储器或所述片外存储器读取所述数据;以及
    如果所述目标页的访问包括对所述目标页的写入,使用所述物理地址以及所述页内偏移地址直接向所述片上存储器或所述片外存储器写入所述数据。
  9. 根据权利要求7所述的方法,其中所述目标页被映射到所述片外存储器中,并且使用所述页表确定的所述物理地址包括所述目标页在所述片上存储器中的物理地址,其中使用所述目标页的所述物理地址和所述页内偏移地址访问所述数据还包括:
    如果所述目标页的访问包括对所述目标页的读取,使用所述目标页在所述片外存储器中的物理地址将所述目标页的数据从所述片外存储器加载到所述片上存储器,并且基于所述目标页在所述片上存储器中的所述物理地址和所述页内偏移地址从所述片上存储器读取所述数据;以及
    如果所述目标页的访问包括对所述目标页的写入,使用所述目标页在所述片上存储器中的所述物理地址和所述页内偏移地址将数据写入所述片上存储器,并且使用所述目标页在片外存储器中的物理地址将所述目标页的数据从所述片上存储器冲刷到所述片外存储器。
  10. 一种计算机可读存储介质,存储多个程序,所述多个程序被配置为一个或多个处理单元执行,所述多个程序包括用于执行权利要求1-9中任一项所述的方法的指令。
  11. 一种计算机程序产品,所述计算机程序产品包括多个程序, 所述多个程序被配置为一个或多个处理单元执行,所述多个程序包括用于执行权利要求1-9中任一项所述的方法的指令。
  12. 一种加速器系统,包括:
    处理单元;以及
    与所述处理单元耦合的存储器,所述存储单元具有存储于其中的指令,所述指令在被所述处理单元执行时执行权利要求1-9中任一项所述的方法。
  13. 一种用于存储管理的装置,包括:
    创建单元,被配置为基于在应用程序的执行时要访问的数据来创建针对虚拟存储空间的页表,所述虚拟存储空间被映射到片上存储器和片外存储器,所述页表至少指示所述数据在所述虚拟存储空间中的逻辑地址与在所述片上存储器或所述片外存储器上的物理地址之间的映射关系;以及
    访问单元,被配置为在所述应用程序被执行时,利用所述页表来访问所述数据。
  14. 根据权利要求13所述的装置,其中所述数据被划分为至少一个段,每个段包括至少一个页,并且其中所述创建单元被配置为:
    在所述页表中建立与所述数据所划分的页分别对应的页表项,每个页表项至少指示对应的页在所述虚拟存储空间中的逻辑地址与在片上存储器或片外存储器上的物理地址之间的映射关系。
  15. 根据权利要求14所述的装置,其中所述页表中的每个页表项还指示对应页的引用计数器的值,并且
    其中每个页表项中的所述引用计数器的值基于以下至少一项而更新:所述对应页的数据在所述片上存储器或所述片外存储器中的就绪状态,或者要访问所述对应页的处理引擎对该页的访问状态。
  16. 根据权利要求14所述的装置,其中所述数据在所述虚拟存储空间中的所述逻辑地址指示所述数据所在的段的段标识、基准地址数据、所在页的页标识、以及该页相对于基准地址数据的偏移值。
  17. 根据权利要求13所述的装置,其中所述数据包括张量数据 和/或程序指令。
  18. 根据权利要求13所述的装置,其中所述页表被存储在所述片上存储器中。
  19. 根据权利要求13所述的装置,其中所述访问单元包括:
    逻辑地址确定单元,被配置为根据所述数据在所述虚拟存储空间中的逻辑地址确定目标页;
    地址转换单元,被配置为使用所述页表,确定所述目标页在所述片上存储器或所述片外存储器中的物理地址;
    页内偏移地址确定单元,被配置为根据所述逻辑地址确定所述数据的页内偏移地址;以及
    基于地址的访问单元,被配置为使用所述目标页的所述物理地址和所述页内偏移地址访问所述片上存储器或所述片外存储器。
  20. 根据权利要求19所述的装置,其中基于地址的访问单元被配置为:
    如果所述目标页的访问包括对所述目标页的读取,使用所述物理地址以及所述页内偏移地址直接从所述片上存储器或所述片外存储器读取所述数据;以及
    如果所述目标页的访问包括对所述目标页的写入,使用所述物理地址以及所述页内偏移地址直接向所述片上存储器或所述片外存储器写入所述数据。
  21. 根据权利要求19所述的装置,其中所述目标页被映射到所述片外存储器中,并且使用所述页表确定的所述物理地址包括所述目标页在所述片上存储器中的物理地址,其中基于地址的访问单元还可以被配置为:
    如果所述目标页的访问包括对所述目标页的读取,使用所述目标页在所述片外存储器中的物理地址将所述目标页的数据从所述片外存储器加载到所述片上存储器,并且基于所述目标页在所述片上存储器中的所述物理地址和所述页内偏移地址从所述片上存储器读取所述数据;以及
    如果所述目标页的访问包括对所述目标页的写入,使用所述目标页在所述片上存储器中的所述物理地址和所述页内偏移地址将数据写入所述片上存储器,并且使用所述目标页在片外存储器中的物理地址将所述目标页的数据从所述片上存储器冲刷到所述片外存储器。
PCT/CN2022/107143 2021-12-06 2022-07-21 用于存储管理的方法、介质、程序产品、系统和装置 WO2023103392A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111479984.2 2021-12-06
CN202111479984.2A CN114218153B (zh) 2021-12-06 2021-12-06 用于存储管理的方法、介质、程序产品、系统和装置

Publications (1)

Publication Number Publication Date
WO2023103392A1 true WO2023103392A1 (zh) 2023-06-15

Family

ID=80700015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107143 WO2023103392A1 (zh) 2021-12-06 2022-07-21 用于存储管理的方法、介质、程序产品、系统和装置

Country Status (2)

Country Link
CN (1) CN114218153B (zh)
WO (1) WO2023103392A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218153B (zh) * 2021-12-06 2023-11-14 海飞科(南京)信息技术有限公司 用于存储管理的方法、介质、程序产品、系统和装置
CN115718641A (zh) * 2023-01-09 2023-02-28 苏州浪潮智能科技有限公司 存储器模拟方法及装置、存储介质及电子装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1379334A (zh) * 2001-03-30 2002-11-13 斯罗扬有限公司 地址转换
CN104850503A (zh) * 2015-05-06 2015-08-19 中国航天科技集团公司第九研究院第七七一研究所 一种通用地址空间管理方法及其结构
US20210318812A1 (en) * 2020-04-09 2021-10-14 Synaptics Incorporated Page-based memory access control
CN114218153A (zh) * 2021-12-06 2022-03-22 海飞科(南京)信息技术有限公司 用于存储管理的方法、介质、程序产品、系统和装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005108262A (ja) * 1994-09-09 2005-04-21 Renesas Technology Corp データ処理装置
US6298428B1 (en) * 1998-03-30 2001-10-02 International Business Machines Corporation Method and apparatus for shared persistent virtual storage on existing operating systems
US6910106B2 (en) * 2002-10-04 2005-06-21 Microsoft Corporation Methods and mechanisms for proactive memory management
GB2399899B (en) * 2003-03-27 2005-06-22 Micron Technology Inc Active memory command engine and method
US7334076B2 (en) * 2005-03-08 2008-02-19 Microsoft Corporation Method and system for a guest physical address virtualization in a virtual machine environment
JP5664347B2 (ja) * 2011-03-04 2015-02-04 ソニー株式会社 仮想メモリシステム、仮想メモリの制御方法、およびプログラム
US20130326151A1 (en) * 2012-05-31 2013-12-05 Semiconductor Energy Laboratory Co., Ltd. Memory management system and program
US10037228B2 (en) * 2012-10-25 2018-07-31 Nvidia Corporation Efficient memory virtualization in multi-threaded processing units
CN103034593B (zh) * 2012-12-11 2015-07-22 中国人民解放军国防科学技术大学 面向众核处理器的片上锁变量全局编址存储方法及装置
US9720717B2 (en) * 2013-03-14 2017-08-01 Sandisk Technologies Llc Virtualization support for storage devices
US9495302B2 (en) * 2014-08-18 2016-11-15 Xilinx, Inc. Virtualization of memory for programmable logic
US20180024938A1 (en) * 2016-07-21 2018-01-25 Advanced Micro Devices, Inc. Allocating physical pages to sparse data sets in virtual memory without page faulting
US10423541B1 (en) * 2016-12-22 2019-09-24 Amazon Technologies, Inc. Using encryption within a computing system
GB2570744B (en) * 2018-06-29 2020-08-26 Imagination Tech Ltd Virtual memory management
KR20210112923A (ko) * 2020-03-06 2021-09-15 삼성전자주식회사 시스템 온 칩 및 이의 동작 방법

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1379334A (zh) * 2001-03-30 2002-11-13 斯罗扬有限公司 地址转换
CN104850503A (zh) * 2015-05-06 2015-08-19 中国航天科技集团公司第九研究院第七七一研究所 一种通用地址空间管理方法及其结构
US20210318812A1 (en) * 2020-04-09 2021-10-14 Synaptics Incorporated Page-based memory access control
CN114218153A (zh) * 2021-12-06 2022-03-22 海飞科(南京)信息技术有限公司 用于存储管理的方法、介质、程序产品、系统和装置

Also Published As

Publication number Publication date
CN114218153A (zh) 2022-03-22
CN114218153B (zh) 2023-11-14

Similar Documents

Publication Publication Date Title
US10037228B2 (en) Efficient memory virtualization in multi-threaded processing units
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
US10310973B2 (en) Efficient memory virtualization in multi-threaded processing units
US10169091B2 (en) Efficient memory virtualization in multi-threaded processing units
US9244839B2 (en) Methods and apparatus for supporting persistent memory
CN114667508B (zh) 为加速器取回数据的方法和系统
KR20130010442A (ko) 가상 gpu
WO2023040460A1 (zh) 存储器访问方法和电子装置
US11947821B2 (en) Methods and systems for managing an accelerator's primary storage unit
US11868306B2 (en) Processing-in-memory concurrent processing system and method
JP7126136B2 (ja) 再構成可能なキャッシュアーキテクチャおよびキャッシュコヒーレンシの方法
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
EP3830702A1 (en) Vmid as a gpu task container for virtualization
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
US11372768B2 (en) Methods and systems for fetching data for an accelerator
WO2023077880A1 (zh) 基于便笺存储器来共享数据的方法和电子装置
WO2023103391A1 (zh) 流处理方法、处理电路和电子设备
CN114281414B (zh) Aigpu架构中urf寄存器的数据写入方法
US12061550B2 (en) Coherent multiprocessing enabled compute in storage and memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22902827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE