WO2023040460A1 - 存储器访问方法和电子装置 - Google Patents

存储器访问方法和电子装置 Download PDF

Info

Publication number
WO2023040460A1
WO2023040460A1 PCT/CN2022/107136 CN2022107136W WO2023040460A1 WO 2023040460 A1 WO2023040460 A1 WO 2023040460A1 CN 2022107136 W CN2022107136 W CN 2022107136W WO 2023040460 A1 WO2023040460 A1 WO 2023040460A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
target element
memory
cache
segment
Prior art date
Application number
PCT/CN2022/107136
Other languages
English (en)
French (fr)
Inventor
杨经纬
葛建明
谢钢锋
许飞翔
彭永超
袁红岗
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023040460A1 publication Critical patent/WO2023040460A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal

Definitions

  • Embodiments of the present disclosure generally relate to the electronic field, and more specifically relate to a method and an electronic device for accessing a memory.
  • GPUs graphics processing units
  • Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field.
  • image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array.
  • a multi-core processor When processing image data, different parts of the image data can be processed in parallel by a multi-core processor to reduce processing time.
  • Conventional tensor data is usually stored in the form of a one-dimensional array in the memory. Therefore, when designing the program, the programmer needs to consider how to correctly address the tensor data in the memory when the program loads the tensor data.
  • the tensor data represents a multi-dimensional tensor
  • the programmer usually needs to know the mapping from the multi-dimensional tensor to the one-dimensional data, that is, the correct mapping from the multi-dimensional tensor to the physical address of the one-dimensional data in the memory, which increases the severe cognitive burden.
  • Embodiments of the present disclosure provide a method and apparatus for accessing a memory.
  • a method for accessing memory includes converting the logical address of the target element in the first segment of the tensor to the physical address of the target element in memory.
  • the logical address includes segment reference data and offset data of the first segment in the tensor, and the offset data represents the offset of the target element on each dimension among the multiple dimensions of the first segment.
  • the method also includes accessing memory using the physical address of the target element. Programmers can consider data from the perspective of multi-dimensional tensors without knowing the mapping of multi-dimensional tensors to one-dimensional data, that is, the correct mapping of multi-dimensional tensors to physical addresses of one-dimensional data in memory, thus reducing programmers' effort. Cognitive load, increasing development efficiency and reducing development time.
  • the first segment includes at least one page, and the at least one page includes a target element page where the target element is located.
  • Converting the logical address of the target element into a physical address includes: converting the logical address into a linear address at least according to the size of each dimension of the target element page, and the linear address includes the one-dimensional page identifier of the target element page and the target element within the target element page and converting the linear address into a physical address according to the page table entry for the target element page, where the page table entry includes the page physical address of the target element page.
  • converting the linear address into the physical address according to the page table entry for the target element page includes: looking up the page physical address corresponding to the target element page from the page table entry; The offset values are added to obtain the physical address of the target element.
  • each page in the at least one page includes multiple cache lines, and the dimension arrangement order of the multiple elements of the tensor in the cache line is different from that of the multiple cache lines of the tensor in the at least one page
  • the order in which the dimensions are arranged in is or is different from the order in which the dimensions are arranged in the first paragraph of at least one page.
  • Using the physical address of the target element to access the memory includes: using the physical address of the target element to read a cache line including the target element from the memory to a first-level cache, and the memory is a second-level cache or an off-chip memory.
  • the first page of the at least one page includes a first cache line and a second cache line arranged along the first dimension and a third cache line and a fourth cache line arranged along the first dimension, the first A cache line and a third cache line are arranged along a second dimension different from the first dimension, and a second cache line and a fourth cache line are arranged along the second dimension.
  • the first cache line includes a first element and a second element arranged along the second dimension and a third element and a fourth element arranged along the second dimension, the first element and the third element are arranged along the first dimension, and the second element and the fourth element are set along the first dimension.
  • the target element is one of the first element, the second element, the third element and the fourth element. Reading the cache line including the target element from the memory into the L1 cache using the physical address of the target element includes storing at least the first element, the second element, the third element and the fourth element into the L1 cache.
  • a first page of the at least one page includes a first plurality of cache lines, and the first plurality of cache lines includes a target element.
  • a second page of the at least one page includes a second plurality of cache lines including another target element.
  • the size of the second page is different from the size of the first page, or the size of each cache line in the second plurality of cache lines is different from the size of each cache line in the first plurality, or the size of each cache line in the first plurality
  • the dimension arrangement order of the plurality of elements of the cache line is different from the dimension arrangement order of the plurality of elements of the second plurality of cache lines.
  • the method further includes converting a logical address of another target element in parallel with the target element into another physical address of the other target element in the memory, the logical address of the other target element including segment reference data and another offset data , the other offset data represents the offset value of another target element on each of the plurality of dimensions in the first segment; and using the physical address of the other target element to access the memory in parallel with the target element. Since the size and arrangement of pages, cache lines, and elements can be flexibly set, this greatly increases the flexibility of programmers when processing tensor data, so that tensor data can be processed more efficiently and processing time is reduced.
  • using the physical address of the target element to access the memory includes using the physical address of the target element to access a part of the tensor stored in the memory.
  • the method further includes: determining whether the logical address exceeds a logical boundary of the tensor; and if the logical address exceeds the logical boundary, generating a memory access exception signal. By detecting whether it is out of bounds, the correctness of tensor processing can be ensured before processing.
  • determining whether the logical address exceeds the logical boundary of the tensor includes: determining whether the sum of the segment reference data and the offset data exceeds the logical boundary of each dimension of the first segment, that is, each dimension of the first segment size of.
  • the method further includes: leaving the target element in the first-level cache based on the segment attribute of the first segment, where the segment attribute includes the Replacement rules. Because certain elements have a high frequency of use, it is not expected to be moved in and out of the L1 cache frequently, but to remain in the L1 cache as much as possible. By setting replacement rules, it is possible to effectively increase the probability of the target element being left in the first-level cache, thereby reducing element transmission time and correspondingly reducing tensor processing time.
  • the first segment includes a plurality of pages
  • the method further includes: if the first page of the plurality of pages is moved into the memory or the off-chip memory, establishing a first page corresponding to the first page entry, the first page entry stores the physical address of the first page in the memory or off-chip memory.
  • the first segment includes a plurality of pages
  • the method further includes: if the first page of the plurality of pages is moved out of the memory or the off-chip memory, deleting the first page corresponding to the first page Page table entries, the first page table entry stores the physical address of the first page in the memory or off-chip memory.
  • the first segment includes a target element page
  • the target element page includes the target element
  • the method further includes: determining whether the target element page is located in the memory based on the logical address of the target element; and if the target element page is not in memory, the target element page is moved into memory from off-chip memory.
  • a computer readable storage medium configured to be executed by the one or more processing engines, the plurality of programs comprising instructions for performing the method of the first aspect.
  • Programmers can consider data from the perspective of multi-dimensional tensors without knowing the mapping of multi-dimensional tensors to one-dimensional data, that is, the correct mapping of multi-dimensional tensors to physical addresses of one-dimensional data in memory, thus reducing programmers' effort. Cognitive load, increasing development efficiency and reducing development time.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the method of the first aspect.
  • Programmers can consider data from the perspective of multi-dimensional tensors without knowing the mapping of multi-dimensional tensors to one-dimensional data, that is, the correct mapping of multi-dimensional tensors to physical addresses of one-dimensional data in memory, thus reducing programmers' effort. Cognitive load, increasing development efficiency and reducing development time.
  • an electronic device in a fourth aspect of the present disclosure, includes: a stream processor; a page table device coupled to the stream processor; a memory; a processing engine unit coupled to the stream processor, the memory and the page table device, configured to execute the method of the first aspect.
  • Programmers can consider data from the perspective of multi-dimensional tensors without knowing the mapping of multi-dimensional tensors to one-dimensional data, that is, the correct mapping of multi-dimensional tensors to physical addresses of one-dimensional data in memory, thus reducing programmers' effort. Cognitive load, increasing development efficiency and reducing development time.
  • a memory access device in a fifth aspect of the present disclosure, includes a conversion unit and an access unit.
  • the conversion unit is used to convert the logical address of the target element in the first segment of the tensor to the physical address of the target element in the memory.
  • the logical address includes the segment reference data and offset data of the first segment in the tensor.
  • the offset data represents the target The offset of the element in each of the dimensions of the first paragraph.
  • the access unit is used to access the memory using the physical address of the target element.
  • the first segment includes at least one page, and the at least one page includes a target element page where the target element is located.
  • the conversion unit is further configured to convert the logical address into a linear address at least according to the size of each dimension of the target element page, and the linear address includes a one-dimensional page identifier of the target element page and a one-dimensional offset value of the target element within the target element page; and converting the linear address into a physical address according to the page table entry for the target element page, where the page table entry includes the page physical address of the target element page.
  • the conversion unit is further configured to: look up the page physical address corresponding to the target element page from the page table entry; and add the page physical address to the one-dimensional offset value to obtain the physical address of the target element address.
  • each page in the at least one page includes multiple cache lines, and the dimension arrangement order of the multiple elements of the tensor in the cache line is different from that of the multiple cache lines of the tensor in the at least one page
  • the dimension arrangement sequence in or is different from the dimension arrangement sequence of at least one page in the first segment the access unit is further configured to: read the cache line including the target element from the memory to the first-level cache using the physical address of the target element,
  • the memory is L2 cache or off-chip memory.
  • the first page of the at least one page includes a first cache line and a second cache line arranged along the first dimension and a third cache line and a fourth cache line arranged along the first dimension, the first A cache line and a third cache line are arranged along a second dimension different from the first dimension, and a second cache line and a fourth cache line are arranged along the second dimension; the first cache line includes first elements arranged along the second dimension and the second element and the third and fourth elements arranged along the second dimension, the first and third elements are arranged along the first dimension, and the second and fourth elements are arranged along the first dimension; the target element is One of the first, second, third, and fourth elements.
  • the access unit is further used for: storing at least the first element, the second element, the third element and the fourth element in the first-level cache.
  • a first page of the at least one page includes a first plurality of cache lines, and the first plurality of cache lines includes a target element.
  • a second page of the at least one page includes a second plurality of cache lines including another target element.
  • the size of the second page is different from the size of the first page, or the size of each cache line in the second plurality of cache lines is different from the size of each cache line in the first plurality, or the size of each cache line in the first plurality
  • the dimension arrangement order of the plurality of elements of the cache line is different from the dimension arrangement order of the plurality of elements of the second plurality of cache lines.
  • the conversion unit is also used to convert the logical address of another target element in parallel with the target element into another physical address of the other target element in the memory, and the logical address of the other target element includes segment reference data and another offset data , the other offset data represents the offset value of another target element on each of the multiple dimensions in the first paragraph.
  • the access unit is also used to access the memory in parallel with the target element using the physical address of another target element. Since the size and arrangement of pages, cache lines, and elements can be flexibly set, this greatly increases the flexibility of programmers when processing tensor data, so that tensor data can be processed more efficiently and processing time is reduced.
  • the access unit is further configured to use the physical address of the target element to access a part of the tensor stored in the memory.
  • the memory access device further includes a determining unit and a generating unit.
  • the determine unit is used to determine whether a logical address is outside the logical bounds of a tensor.
  • the generating unit is used to generate a memory access exception signal if the logical address exceeds a logical boundary. By detecting whether it is out of bounds, the correctness of tensor processing can be ensured before processing.
  • the determining unit is further configured to: determine whether the sum of the segment reference data and the offset data exceeds the size of each dimension of the segment. By confirming whether the boundary is crossed based on the logical address, the detection steps can be simplified, and the calculation amount and processing time can be reduced.
  • the memory access device further includes a retention unit.
  • the retention unit is used to retain the target element in the L1 cache based on the segment attributes of the first segment.
  • the segment attributes include the replacement rules in the L1 cache for the cache line where the target element resides. Because some elements are used frequently, it is not expected to be moved in and out of the L1 cache as often, but to remain in the L1 cache as much as possible. By setting replacement rules, it is possible to effectively increase the probability of the target element being left in the first-level cache, thereby reducing element transmission time and correspondingly reducing tensor processing time.
  • the first segment includes a plurality of pages
  • the memory access device further includes an establishment unit.
  • the establishment unit is used to establish a first page table entry corresponding to the first page if the first page of the plurality of pages is moved into the off-chip memory or memory, and the first page table entry stores the physical address of the first page in the memory .
  • the first segment includes multiple pages
  • the memory access device further includes a delete unit.
  • the delete unit is used to delete the first page table entry corresponding to the first page if the first page in the multiple pages is moved out of the memory or the off-chip memory, and the first page table entry stores the physical data of the first page in the memory address.
  • the first paragraph includes a target element page, and the target element page includes the target element.
  • the determining unit is further configured to determine whether the target element page is located in the memory based on the logical address of the target element.
  • the memory access device also includes a shift-in unit. The shift-in unit is used to move a target element page from off-chip memory into memory if the target element page is not located in memory.
  • programmers can consider data from the perspective of multi-dimensional tensors without knowing the mapping of multi-dimensional tensors to one-dimensional data, that is, the mapping of multi-dimensional tensors to one-dimensional data in memory Correct mapping of physical addresses, thereby reducing the cognitive burden of programmers, increasing development efficiency and reducing development time.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Fig. 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure
  • Fig. 3 shows a schematic block diagram of a three-dimensional tensor according to an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of page allocation of image data according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a method for accessing a memory according to an embodiment of the present disclosure
  • Fig. 6 shows a schematic diagram of cache line allocation of the image data in Fig. 4;
  • Fig. 7 shows a schematic diagram of one-dimensional storage of the image data in Fig. 4 in the memory
  • Fig. 8 shows a schematic block diagram of a memory access device according to some embodiments of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • mapping which adds a serious cognitive load to the programmer. In addition, it may increase the runtime overhead of the software.
  • the programmer can not need to know the tensor data
  • the mapping relationship with the physical address but only needs to know the position of the data in the multidimensional tensor, such as coordinates. In this way, the cognitive burden of programmers can be significantly reduced, and the development time of application programs or software can be significantly reduced and development efficiency can be increased.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20 , system memory 10 , north/memory bridge 30 , accelerator subsystem 40 , device memory 50 , and south/input-output (IO) bridge 60 .
  • System memory 10 may be, for example, a volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the south bridge/IO bridge 60.
  • the South Bridge/IO Bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator subsystem 40 may include, for example, devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of data such as graphics and video.
  • Device memory 50 may be, for example, a volatile memory such as DRAM that is external to accelerator subsystem 40 .
  • device memory 50 is also referred to as off-chip memory, ie, memory located outside the chip of accelerator subsystem 40 .
  • the chip of the accelerator subsystem 40 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache.
  • L1 cache first-level cache
  • L2 cache optional second-level cache
  • FIG. 2 shows a schematic block diagram of an accelerator subsystem 200 according to one embodiment of the present disclosure.
  • the accelerator subsystem 200 may be, for example, a specific implementation of the chip of the accelerator subsystem 40 in FIG. 1 .
  • the accelerator subsystem 200 is, for example, an accelerator subsystem chip such as a GPU.
  • the accelerator subsystem 200 includes a stream processor (SP) 210, a page table device 220, a processing engine (PE) unit 230, a direct memory access (DMA) controller 240, an L1 cache (cache) 260, and L2 cache 250.
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator subsystem 200 is controlled by a host device such as the CPU 20, and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage the on-chip virtual storage of the accelerator subsystem 200 .
  • L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system.
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 includes a plurality of processing engines (processing engine, PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1.
  • PE processing engine
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • each thread can have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file).
  • Multiple PEs can perform the same or different processing tasks in parallel, and can perform address conversion described below and access to target data in memory in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segment, page, cache line, and attribute, size, and dimension order of the target element may be different, as described in detail below.
  • Each thread can perform thread-level data exchange between its own register file and the memory subsystem.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which adopts a typical register access architecture (load-store architecture).
  • Each execution unit includes a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • the accelerator subsystem 200 of FIG. 2 may, for example, perform the following operations: 1) construct page table entry content and initial state; Move to the on-chip memory, such as the L2 cache 250; 3) start and execute the program; 4) define each segment and describe the properties of the tensor and storage; 5) when the program execution is completed, write the data of the execution result to off-chip memory.
  • the data processed by the accelerator subsystem 200 is mainly for multi-dimensional tensors.
  • the tensor may be a four-dimensional tensor having four dimensions Dl, D2, D3, and D4, and the tensor may be of different size in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • the tensor may internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double byte, and so on.
  • tensors may be divided into at least one segment. In the case where the tensor contains only one segment, the tensor is the segment. Whereas, in the case where the tensor contains multiple segments, the segment is part of the tensor.
  • the CPU 20 can specify which PE processes each part of the segment by an instruction.
  • FIG. 3 shows a schematic block diagram of a three-dimensional tensor 300 according to an embodiment of the present disclosure.
  • the three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.
  • CPU 20 may specify that the tensor elements of segment S1 be processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • CPU 20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4.
  • each segment may have a different size, so programmers can flexibly configure segments based on design needs.
  • page division can be implemented on any one or more dimensions, and the number of pages divided on each dimension is independent of each other.
  • tensor data may be stored in on-chip high-speed memory, such as L2 cache 250 .
  • on-chip high-speed memory such as L2 cache 250 .
  • the kernel program (kernel) can be started multiple times, and each time the DMA controller 240 moves a segment of the tensor from the off-chip storage to the on-chip storage in advance for kernel operation. After starting the kernel multiple times, all the segments contained in the tensor are processed, and the entire running process ends.
  • the on-chip high-speed memory is sufficient to accommodate all tensors to be accessed by the kernel, a tensor only needs one segment description, and the kernel only needs to be started once.
  • At least one page may also be set to further subdivide the tensor.
  • the first segment S1 there are 4 pages P[1], P[2], P[3] and P[4].
  • the second segment S2 has only one page.
  • the number of pages in each segment can be different, so programmers can flexibly configure the size of pages in a segment based on design requirements. For example, pages are configured to fit into L2 cache 250 in their entirety.
  • a page can usually contain multiple elements.
  • the page where the target element is located is referred to as a "target element page" herein.
  • a page may include multiple cache lines.
  • L1 cache 260 It only takes a few clock cycles for the PE to read data from the L1 cache 260 , but it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250 . Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250 .
  • a "cache line" is used here to describe the minimum transfer data unit from L2 cache 250 to L1 cache 260, in this disclosure, this part of data may not necessarily be arranged in rows or columns, a "cache line”
  • the data inside is distributed on multiple dimensions, and the size of the data distributed on each dimension is not limited to 1.
  • PE performs parallel processing on the data in a segment, and the allocation of PE is carried out in the logical address space of the data, which is independent of the physical storage structure of the segment, as described below.
  • the first group of cache lines in the first page P[1] is designated to be processed by PE_1, and the second group of cache lines is designated to be processed by PE_2.
  • PE_1 the first group of cache lines in the first page P[1]
  • PE_2 the second group of cache lines
  • PE_M the number of tensor data that can be processed by PE_M, where M represents any integer not greater than N.
  • FIG. 4 shows a schematic diagram of page allocation of image data 400 according to an embodiment of the present disclosure.
  • Image data is typically a two-dimensional tensor.
  • the image data 400 is, for example, 8*8 pixels.
  • the image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Therefore, the image data 400 has pixels P00, P01...P77.
  • the image data 400 has only one segment, but is divided into four pages P[1], P[2], P[3] and P[4] in two dimensions.
  • the four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it is also possible to divide by diagonal. This disclosure is not limited in this regard.
  • FIG. 5 shows a flowchart of a method 500 for accessing a memory according to some embodiments of the present disclosure.
  • the method 500 may be implemented by an accelerator subsystem such as a GPU, so various aspects described above with respect to FIGS. 1-3 may be selectively applied to the method 500 .
  • the accelerator subsystem translates a logical address of a target element in a first segment in the tensor to a physical address of the target element in memory.
  • the target element is the tensor element to be fetched and processed by the PE. For example, pixel P01 in FIG. 4 .
  • the kernel program when the kernel program accesses the tensor segment, the kernel program can be deployed on one or more PEs, and each PE accesses some or all elements of the tensor data described by it through the segment.
  • the starting point of the tensor data accessed by each PE within the segment is defined by the anchor coordinates in the segment structure.
  • the definition of reference point coordinates in the segment structure is also different.
  • the logical address of the target element can be represented as seg:RF:imm, where seg represents the segment base register, RF represents the offset register, and imm represents the offset immediate value.
  • the logical address may include segment reference data and offset data of the first segment in the tensor, and the offset data represents the offset of the target element on each dimension among the multiple dimensions of the first segment.
  • the segment reference data is the address of the start point of the segment.
  • the first segment includes at least one page
  • the accelerator subsystem 200 may convert the logical address into a linear address at least according to the size of each dimension of the target element page.
  • the linear address includes a one-dimensional page identifier of the target element page and a one-dimensional offset value of the target element within the target element page.
  • the accelerator subsystem 200 can obtain the page number offset of the target element in each dimension according to the page size of the page in each dimension in the first segment, thereby obtaining the one-dimensional identification of the page where the target element is located.
  • the target element is located at the top layer of the tensor in FIG. 3 , and the page ID of the target element can be determined to be P[1] through the above method.
  • the accelerator subsystem can also obtain the relative offset of the target element in each dimension within the page, and based on this, determine the one-dimensional linear offset of the target element relative to the starting position of the page.
  • the one-dimensional identification of the page and the one-dimensional linear offset within the page together constitute the linear address of the target element.
  • the accelerator subsystem 200 converts the linear address into a physical address according to the page table entry for the target element page, and the page table entry includes the page physical address of each page of the at least one page. Specifically, in one embodiment, after the accelerator subsystem 200 obtains the page identifier of the target element, it can search the corresponding entry in the page table device 220 according to the page identifier to obtain the physical address of the page.
  • the physical address plus the one-dimensional linear offset of the target element on the target element page is the physical address of the target element.
  • the physical address may represent the storage address of the target element on off-chip device memory 50 or on-chip memory, such as L2 cache 250 .
  • the page table entry of the target element page can also store the physical address relative to other pages, and the target element page can be obtained based on the offset of the target element page relative to other pages, the physical address of other pages, and the one-dimensional linear offset The physical address of the element.
  • the page table entry can also include other attributes, such as status, which is used to indicate whether the page has been loaded, that is, whether it is available.
  • status which is used to indicate whether the page has been loaded, that is, whether it is available.
  • This disclosure is not limited in this regard. Although a two-level translation of addresses is shown here, the disclosure is not so limited. Alternatively, more stages of conversion are also possible. For example, page offsets, cache line offsets, and element offsets are calculated hierarchically, and are sequentially added to the physical address to obtain the final physical address of the target element.
  • the accelerator subsystem 200 moves the first page of the plurality of pages from the off-chip memory into the on-chip memory, and creates a first page entry corresponding to the first page, and the first page entry stores the first page entry.
  • the physical address of a page in memory If the first page of the plurality of pages is moved from the memory into the off-chip memory, the accelerator subsystem 200 may delete the first page table entry corresponding to the first page.
  • the accelerator subsystem translates the logical address of the target element in the first segment S1 into a physical address in the on-chip virtual memory.
  • On-chip virtual memory may include on-chip L2 cache 250 and off-chip device memory 50 .
  • the logical address includes segment reference data and offset data of the first segment in the tensor, and the offset data represents the offset of the target element on each dimension among the multiple dimensions of the first segment.
  • the accelerator subsystem can access memory using the physical address of the target element.
  • the programmer can use the segment reference data and the multi-dimensional offset relative to the segment reference data to use the tensor data from the perspective of the tensor when programming, without needing to know the one-to-one mapping between the tensor data and the physical address in memory relation.
  • the above describes the data addressing mode for converting the logical address of a multi-dimensional tensor into a physical address according to some embodiments of the present disclosure.
  • factors such as the spatial proximity of the L1 cache 260 fetching tensor data from the L2 cache 250 may be further considered.
  • the L1 cache 260 acquires tensor data from the cache 250
  • the smallest acquisition unit is a cache line.
  • the L1 cache 260 acquires an element from the cache, it does not only acquire the target element, but reads tensor elements near it to the L1 cache 260 together.
  • the minimum amount of data fetched is one cache line. For example, referring to FIG. 4, when the cache lines are arranged in rows, if it is desired to obtain the target element P00, the L1 cache 260 will read the elements P00-P03 together from the L2 cache 250, because the cache line in this case Include elements P00-P03.
  • the elements near the target element often have a greater correlation and have a greater probability of being processed.
  • the pixels of an image are often accessed according to the Manhattan distance between pixels, and adjacent pixels are usually processed together.
  • the elements P01 and P10 near the target element P00 have a higher probability of being processed.
  • the L1 cache 260 reads the target element P00 from the L2 cache 250, it only reads P01, P02 and P03 in the same cache line together, so when the PE tries to read from the L1 cache 260 At P10, the L1 cache 260 reads the cache line where P10 is located from the L2 cache 250 again. This results in a reduction in data processing speed and a potential waste of storage bandwidth.
  • each page in the at least one page includes multiple cache lines, and the dimension arrangement order of the multiple elements of the tensor in the cache line is different from the dimension arrangement order of the multiple cache lines of the tensor in the at least one page .
  • FIG. 6 shows a schematic diagram of cache line allocation of the image data in FIG. 4 .
  • Inverse cross addressing will be described below with reference to FIG. 6 and FIG. 4 .
  • image data 400 includes only one segment, and the segment includes four pages P[1], P[2], P[3], and P[4].
  • the image data 400 is a two-dimensional tensor, so the image data 400 includes a first dimension D1 and a second dimension D2.
  • the pages of image data 400 are first arranged in the first dimension D1, and after reaching the boundary of the first dimension D1, extend by one page in the second dimension D2, and continue to be arranged in the first dimension D1.
  • the arrangement order of the pages of the image data 400 is P[1], P[2], P[3], and P[4].
  • the order in which the dimensions of tensor elements are arranged in the cache line is different from the order in which the dimensions of the tensor's cache lines are arranged in at least one page, see, for example, FIG. 6 .
  • FIG. 6 shows a schematic diagram of cache line allocation of the image data in FIG. 4 .
  • Fig. 6 shows 4 cache lines C[1], C[2], C[3] and C[4], and each cache line includes 4 elements.
  • the first cache line C[1] includes elements P00, P10, P01, and P11.
  • cache lines C[1], C[2], C[3], and C[4] are also first arranged along the first dimension D1 until reaching the first dimension D1. The boundary of one-dimensional D1. The cache line then extends one cache line in the second dimension D2 and continues along the first dimension D1 until reaching the boundary of the first dimension D1.
  • FIG. 6 shows 4 cache lines C[1], C[2], C[3] and C[4]
  • each cache line includes 4 elements.
  • the first cache line C[1] includes elements P00, P10, P01, and P11.
  • cache lines C[1], C[2], C[3], and C[4] are also first arranged along the first dimension D1 until reaching
  • the arrangement order of the cache lines of the page 600 is C[1], C[2], C[3], and C[4].
  • the following three fields can be used to describe the logical addresses of the image data in FIGS. 4 and 6: addr[5:4], addr[3:2], and addr[1:0], where addr[ 5:4] indicates the domain of the page, that is, the page address; add[3:2] indicates the domain of the cache line, that is, the address of the cache line in the page; addr[1:0] indicates the address of the element or element sequence in the cache line.
  • the arrangement order of the tensor elements is diametrically opposite to the arrangement order of the cache line in the page or in the first segment of the page, which is reverse interleaved addressing.
  • elements are firstly arranged according to the second dimension D2 and then arranged according to the first dimension D1. That is, the elements P00, P10, P01, and P11 are sequentially arranged within the cache line C[1].
  • FIG. 7 more vividly shows a schematic diagram of the one-dimensional storage of the image data in FIG. 4 in the memory. As shown in FIG.
  • the image data 700 is actually arranged in one-dimensional order in the memory, and elements P00, P10, P01, and P11 are sequentially arranged in the memory. Therefore, when the target element is P00, the L1 cache reads elements P00, P10, P01, and P11 from the L2 cache. In this way, the hit of the target element can be improved, and the time to access the target element can be correspondingly reduced and thus the overall processing time of the tensor can be reduced.
  • the cache line including the target element is thus read from the memory to the L1 cache using the physical address of the target element, the memory being the L2 cache or off-chip memory. Also, some target elements may be used frequently, so it is desirable to have them resident in L1 cache.
  • the replacement rule of the cache line containing the target element can be set in the segment attribute to be resident in the L1 cache, which relies on the temporal locality principle of data access.
  • inverse interleaved addressing is shown here as a two-dimensional tensor, it is to be understood that the disclosure is not so limited.
  • the dimensional order of elements within a cache line may be in any order other than the dimensional order of a cache line within a page or the dimensional order of a page within a segment.
  • the dimension order of elements within a cache line, the dimension order of cache lines within a page, and the order of page dimensions within a segment may be different two by two.
  • the number of pages in each segment, the number of cache lines in each page, and the number of elements in each cache line are not limited to the numbers illustrated in the figure.
  • elements within the cache line are shown as a block arrangement in the embodiment of FIG. 6, it is understood that the present disclosure is not limited thereto. Elements within a cache line may also be arranged in rows, columns, or other sequences within a one-dimensional physical area of memory. This can give programmers further flexibility, so that programmers can flexibly adjust the arrangement of elements in the cache line based on the usage relationship of tensor elements, thereby reducing the running time and response speed of the program.
  • a tensor may include at least one segment, a segment may include at least one page, a page may include at least one cache line, and a cache line may include at least one element.
  • Tensors, segments, pages, and cache lines can all include one or more dimensions.
  • the accelerator subsystem 40 can use the logical address to determine whether the address to be accessed crosses the boundary of the tensor. For example, it can be detected whether the sum of the coordinates or offsets of the segment reference data and the target element exceeds the size of each dimension. If a boundary is crossed, a memory access exception signal is generated. By detecting the occurrence of an out-of-bounds situation, the accuracy of addressing and the correctness of accessing tensor elements in memory can be ensured.
  • the instruction may include various information of tensors, such as tensor description items.
  • Tensor description items include, for example, segment identifiers, PE masks, reference points, starting page numbers, page numbers, dimensions per page, segment attributes, dimension attributes, and the like.
  • the segment ID is used to index the segment during a run.
  • the PE mask indicates the PEs to be allocated. Each bit corresponds to a PE. If the PE mask is set to 1, it means that the corresponding PE is allocated, otherwise no PE is allocated. For example, if bit0 is 1, it means that PE_1 is allocated.
  • PE_1 and PE_2 with a mask of 0x03.
  • Other situations can be deduced by analogy, and any combination can be made according to application requirements and resource availability.
  • the reference point represents the starting coordinate point within the tensor specified for each PE.
  • the reference point can be (0,0,0,0) or (0,4,0,0).
  • Multiple PEs can have the same or different reference points. Since the reference point is actually a component that constitutes the logical address of the target element, in the case of different segment reference data, the same coordinate offsets of different PEs are actually mapped to different physical address. After the PE has processed a target element, it can use the offset relative to the reference point to continue to visit the next target element. Alternatively, a new reference point can also be used to access the next target element.
  • the start page number indicates the start page of the element processed by the PE. Since the page identifiers contained in the segment are continuous, if the total number of pages in the segment is known, all pages in the segment can be accessed by only specifying the starting page identifier in the segment.
  • the dimension per page indicates the number of elements distributed on each dimension in the page.
  • the element distribution of segments on each dimension is (8,8,0,0), and the distribution of pages on each dimension is (2,2,0,0), so the The element distribution is (8/2,8/2,0,0), which is (4,4,0,0).
  • There are 4x4 16 elements in each page.
  • Segment attributes include status flags used by pages in the segment, element size, element data encoding type and conversion flags, replacement rules for cache lines in the segment, etc.
  • Dimension attributes can be used to independently set the attributes of each dimension, including information such as long mode, streaming mode, symbolic attributes of addresses, and bit widths of inverse cross addressing in a cache line.
  • Long mode indicates that the size of a tensor in one dimension is significantly higher than the size of other dimensions.
  • Streaming mode means that it can support the calculation of infinitely long tensor data without stopping the core program.
  • the symbol attribute of the address indicates that the coordinate offset value relative to the reference point can be positive or negative, in other words, the offset can be positive or negative in the same dimension.
  • the properties of the page include page ID, physical base address, status field and dimension information, etc.
  • the page identifier is used to index the corresponding page table entry.
  • the physical base address describes the physical first address of the page in on-chip memory such as L2 cache or in off-chip memory.
  • the status field indicates whether the page is occupied or available.
  • Dimension information mainly includes the number of dimensions and the size of each dimension, and this field can be defined by a segment. Attributes of pages may be stored within page table device 220, for example.
  • target element P36 is addressed. That is, the PE accesses the memory to obtain the target element P36.
  • Image data 400 is stored in memory in an inverse interleaved manner. Assume that the image data 400 is a four-dimensional tensor, and the pages and cache lines are arranged in the dimension order of D1->D2->D3->D4, and the elements in the cache line are arranged in the order of D4->D3->D2->D1 Dimensions are laid out sequentially.
  • the real physical address of the target element in the memory is obtained.
  • FIG. 8 shows a schematic block diagram of a memory access device 800 according to some embodiments of the present disclosure.
  • Apparatus 800 may be implemented as or included in accelerator subsystem 200 of FIG. 2 .
  • the apparatus 800 may include a plurality of modules for performing corresponding steps in the method 500 as discussed in FIG. 5 .
  • the memory of the device 800 includes a conversion unit 802 and an access unit 804 .
  • the conversion unit 802 is used to convert the logical address of the target element in the first segment of the tensor into the physical address of the target element in the memory.
  • the logical address includes the segment reference data and offset data of the first segment in the tensor, and the offset data represents The offset of the target element in each of the dimensions of the first paragraph.
  • the access unit 804 is used to access the memory using the physical address of the target element.
  • the first segment includes at least one page, and the at least one page includes the target element page where the target element is located.
  • the conversion unit 802 is further configured to convert the logical address into a linear address at least according to the size of each dimension of the target element page, and the linear address includes a one-dimensional page identifier of the target element page and a one-dimensional offset value of the target element within the target element page ; and converting the linear address into a physical address according to the page table entry for the target element page, where the page table entry includes the page physical address of the target element page.
  • the converting unit 802 is further configured to: search the page table entry for a page physical address corresponding to the target element page; and add the page physical address to a one-dimensional offset value to obtain the target element's physical address.
  • the calculation operation is further simplified, which further saves the conversion time and correspondingly improves the tensor processing efficiency.
  • each page in the at least one page includes a plurality of cache lines
  • the dimension arrangement sequence of the plurality of elements of the tensor in the cache line is different from the dimension of the plurality of cache lines of the tensor in the at least one page
  • the arrangement sequence is or different from the dimension arrangement sequence of at least one page in the first segment
  • the access unit 804 is further configured to: use the physical address of the target element to read the cache line including the target element from the memory to the first-level cache, and the memory is L2 cache or off-chip storage device.
  • a first page of the at least one page includes a first cache line and a second cache line arranged along the first dimension and a third cache line and a fourth cache line arranged along the first dimension, the first cache line and the third cache line are arranged along a second dimension different from the first dimension, and the second cache line and the fourth cache line are arranged along the second dimension;
  • the first cache line includes the first element and the second element arranged along the second dimension element and a third element and a fourth element arranged along the second dimension, the first element and the third element are arranged along the first dimension, and the second element and the fourth element are arranged along the first dimension;
  • the target element is the first element, One of the second, third, and fourth elements.
  • the access unit 804 is further configured to: store at least the first element, the second element, the third element and the fourth element in the first-level cache.
  • a first page of the at least one page includes a first plurality of cache lines, the first plurality of cache lines including the target element.
  • a second page of the at least one page includes a second plurality of cache lines including another target element.
  • the size of the second page is different from the size of the first page, or the size of each cache line in the second plurality of cache lines is different from the size of each cache line in the first plurality, or the size of each cache line in the first plurality
  • the dimension arrangement order of the plurality of elements of the cache line is different from the dimension arrangement order of the plurality of elements of the second plurality of cache lines.
  • the conversion unit 802 is further configured to convert the logical address of another target element in parallel with the target element into another physical address of the other target element in the memory, the logical address of the other target element includes segment reference data and another offset data, another offset data represents the offset value of another target element on each dimension among the multiple dimensions in the first paragraph.
  • the access unit 804 is further configured to use the physical address of another target element to access the memory in parallel with the target element. Since the size and arrangement of pages, cache lines, and elements can be flexibly set, this greatly increases the flexibility of programmers when processing tensor data, so that tensor data can be processed more efficiently and processing time is reduced.
  • the access unit 804 is further configured to access a portion of the tensor stored in memory using the physical address of the target element.
  • the memory access device 800 further includes a determining unit and a generating unit not shown.
  • the determine unit is used to determine whether a logical address is outside the logical bounds of a tensor.
  • the generating unit is used to generate a memory access exception signal if the logical address exceeds a logical boundary. By detecting whether it is out of bounds, the correctness of tensor processing can be ensured before processing.
  • the determining unit is further configured to: determine whether the sum of the segment reference data and the offset value of each dimension exceeds the size of each dimension of the segment. By confirming whether it is out of bounds based on the logical address, the detection steps can be simplified, and the calculation amount and processing time can be reduced.
  • the memory access device 800 further includes a not-shown retention unit.
  • the retention unit is used to retain the target element in the L1 cache based on the segment attributes of the first segment.
  • the segment attributes include the replacement rules in the L1 cache for the cache line where the target element resides. Because certain elements have a high frequency of use, it is not expected to be moved in and out of the L1 cache frequently, but to remain in the L1 cache as much as possible. By setting replacement rules, it is possible to effectively increase the probability of the target element being left in the first-level cache, thereby reducing element transmission time and correspondingly reducing tensor processing time.
  • the first segment includes a plurality of pages
  • the memory access device 800 further includes a setup unit.
  • the establishment unit is used to establish a first page entry corresponding to the first page if the first page of the plurality of pages is moved into the off-chip memory or the on-chip memory, and the first page entry stores the physical data of the first page in the memory. address.
  • the first segment includes multiple pages
  • the memory access device 800 further includes a delete unit.
  • the delete unit is used to delete the first page entry corresponding to the first page if the first page in the plurality of pages is moved out from the on-chip memory or the off-chip memory, and the first page entry stores the first page in the memory physical address.
  • the first segment includes a target element page
  • the target element page includes the target element.
  • the determining unit is further configured to determine whether the target element page is located in the memory based on the logical address of the target element.
  • the memory access device 800 also includes a shift-in unit. The shift-in unit is used to move a target element page from off-chip memory into memory if the target element page is not located in memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本文描述了一种用于访问存储器的方法、计算机可读存储介质、计算机程序产品和电子设备。在本公开中,可以将多维张量的目标数据的逻辑地址转换为存储器中的物理地址,并且使用该物理地址来访问存储器。这样,编程人员可以仅需要从多维张量的角度考虑张量的逻辑地址,而无需知晓从多维张量到一维物理地址的映射关系,从而减少编程人员的认知负担,并且减少了开发时间。

Description

存储器访问方法和电子装置 技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种用于对存储器进行访问的方法和电子装置。
背景技术
诸如图形处理器(GPU)之类的并行高性能多线程多核处理系统处理数据的速度比过去快得多。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核并行处理以增加处理效率并且减少处理时间。
在一些情形下,诸如GPU之类的多核处理器对具有大量相同或相似形式的数据的张量的处理尤为有利。张量数据在计算机领域通常表示一维或多维数组的数据,例如图像数据就是一种常规的二维张量数据,其可以由二维数组表示。对图像数据进行处理时,可以由多核处理器对图像数据中的不同部分并行处理以减少处理时间。
常规的张量数据在存储器中通常以一维数组形式进行存储,因此程序的编程人员在设计程序时,需要考虑程序加载张量数据的情形下如何对存储器中的张量数据正确寻址。当张量数据表示多维张量时,编程人员通常需要知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,这给编程人员增加了严重的认知负担。
发明内容
本公开的实施例提供了一种用于访问存储器的方法和装置。
在第一方面,提供了一种用于访问存储器的方法。该方法包括将张量中的第一段中的目标元素的逻辑地址转换为目标元素在存储器 中的物理地址。逻辑地址包括第一段在张量中的段基准数据和偏移数据,偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量。该方法还包括使用目标元素的物理地址访问存储器。编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在一种可能的实现方式中,第一段包括至少一个页,至少一个页包括目标元素所在的目标元素页。将目标元素的逻辑地址转换为物理地址包括:至少根据目标元素页的各维的尺寸,将逻辑地址转换为线性地址,线性地址包括目标元素页的一维页标识和目标元素在目标元素页内的一维偏移值;以及根据针对目标元素页的页表项,将线性地址转换为物理地址,页表项包括目标元素页的页物理地址。通过将逻辑地址转换为线性地址并且再转换为物理地址,简化了计算操作,从而相应地提高了张量处理效率。
在一种可能的实现方式中,根据针对目标元素页的页表项将线性地址转换为物理地址包括:从页表项查找与目标元素页对应的页物理地址;以及将页物理地址与一维偏移值相加,以获得目标元素的物理地址。通过将页物理地址与一维偏移值相加,进一步简化了计算操作,从而进一步提高了张量处理效率。
在一种可能的实现方式中,至少一个页中的每个页包括多个缓存行,张量的多个元素在缓存行中的维度布置顺序不同于张量的多个缓存行在至少一个页中的维度布置顺序或不同于至少一个页在第一段中的维度布置顺序。使用目标元素的物理地址访问存储器包括:使用目标元素的物理地址将包括目标元素的缓存行从存储器读取至一级高速缓存,存储器是二级高速缓存或片外存储器。通过在同一缓存行内交叉存储多个维上的相邻数据,可以确保张量处理过程中高速缓存的邻近性原理被极大地被利用,从而极大地地提高了一级高速缓存的命中率并且极大地减少了读取数据的时间,并且相应地减少了张量处理时间。
在一种可能的实现方式中,至少一个页中的第一页包括沿第一维度设置第一缓存行和第二缓存行以及沿第一维度设置的第三缓存行和第四缓存行,第一缓存行和第三缓存行沿与第一维度不同的第二维度设置,并且第二缓存行和第四缓存行沿第二维度设置。第一缓存行包括沿第二维度设置的第一元素和第二元素以及沿第二维度设置的第三元素和第四元素,第一元素和第三元素沿第一维度设置,并且第二元素和第四元素沿第一维度设置。目标元素是第一元素、第二元素、第三元素和第四元素之一。使用目标元素的物理地址从存储器读取包括目标元素的缓存行至一级高速缓存包括:至少将第一元素、第二元素、第三元素和第四元素存储至一级高速缓存。
在一种可能的实现方式中,至少一个页中的第一页包括第一多个缓存行,第一多个缓存行包括目标元素。至少一个页中的第二页包括第二多个缓存行,第二多个缓存行包括另一目标元素。第二页的尺寸不同于第一页的尺寸,或第二多个缓存行中的每个缓存行的尺寸不同于第一多个缓存行中的每个缓存行的尺寸,或第一多个缓存行的多个元素的维度布置顺序不同于第二多个缓存行的多个元素的维度布置顺序。该方法还包括:将另一目标元素的逻辑地址与目标元素并行地转换为另一目标元素在存储器中的另一物理地址,另一目标元素的逻辑地址包括段基准数据和另一偏移数据,另一偏移数据表示另一目标元素在第一段的多个维度中的各维上的偏移值;以及使用另一目标元素的物理地址与目标元素并行地访问存储器。由于页、缓存行和元素的尺寸和布置方式都可以灵活设置,这极大地增加了编程人员在处理张量数据时的灵活性,从而可以更有效率地处理张量数据,并且减少处理时间。
在一种可能的实现方式中,使用目标元素的物理地址访问存储器包括使用目标元素的物理地址访问存储器中所存储的张量的一部分。通过将张量划分为多个段,可以良好地适配硬件性能,从而可以基于硬件设置来灵活地处理张量数据。
在一种可能的实现方式中,该方法还包括:确定逻辑地址是否超 出张量的逻辑边界;以及如果逻辑地址超出逻辑边界,则生成存储器访问异常信号。通过检测是否越界,可以在处理之前就确保张量处理的正确性。
在一种可能的实现方式中,确定逻辑地址是否超出张量的逻辑边界包括:确定段基准数据和偏移数据之和是否超出第一段的各维的逻辑边界,即第一段的各维的尺寸。通过基于逻辑地址确认是否越界,可以简化检测的步骤,降低计算量和处理时间。
在一种可能的实现方式中,该方法还包括:基于第一段的段属性,将目标元素留置在一级高速缓存中,段属性包括针对目标元素所在的缓存行在一级高速缓存中的替换规则。由于某些元素具有较高的使用频率,因此并不期望频繁地将其移入和移出一级高速缓存,而是期望尽可能地保留在一级高速缓存中。通过设置替换规则,可以有效地提升目标元素被留置在一级高速缓存中的概率,从而减少元素传输时间并且相应地减少张量处理时间。
在一种可能的实现方式中,第一段包括多个页,该方法还包括:如果将多个页中的第一页移入存储器或片外存储器,则建立与第一页对应的第一页表项,第一页表项存储第一页在存储器或片外存储器中的物理地址。通过在页被移入时建立页表项,一方面可以有效地并且正确地进行寻址,另一方面也可以减少页表装置以及片内高速存储的用量,从而减少成本。
在一种可能的实现方式中,第一段包括多个页,该方法还包括:如果将多个页中的第一页从存储器或片外存储器移出,则删除与第一页对应的第一页表项,第一页表项存储第一页在存储器或片外存储器中的物理地址。通过及时删除不被使用的页及页表项,可以减少页表装置以及片内高速存储用量,某些场景下还能减少片内存储的使用量,从而减少成本。
在一种可能的实现方式中,第一段包括目标元素页,目标元素页包括目标元素,该方法还包括:基于目标元素的逻辑地址,确定目标元素页是否位于存储器中;以及如果目标元素页不位于存储器中,则 从片外存储器将目标元素页移入存储器。
在本公开的第二方面,提供一种计算机可读存储介质。存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行第一方面的方法的指令。编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在本公开的第三方面,提供一种计算机程序产品。计算机程序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于用于执行第一方面的方法的指令。编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在本公开的第四方面,提供一种电子设备。电子设备包括:流处理器;页表装置,耦合至流处理器;存储器;处理引擎单元,耦合至流处理器、存储器和页表装置,被配置为执行第一方面的方法。编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在本公开的第五方面,提供一种存储器访问装置。存储器访问装置包括转换单元和访问单元。转换单元用于将张量中的第一段中的目标元素的逻辑地址转换为目标元素在存储器中的物理地址,逻辑地址包括第一段在张量中的段基准数据和偏移数据,偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量。访问单元用于使用目标元素的物理地址访问存储器。编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在一种可能的实现方式中,第一段包括至少一个页,至少一个页包括目标元素所在的目标元素页。转换单元进一步用于至少根据目标元素页的各维的尺寸,将逻辑地址转换为线性地址,线性地址包括目标元素页的一维页标识和目标元素在目标元素页内的一维偏移值;以及根据针对目标元素页的页表项,将线性地址转换为物理地址,页表项包括目标元素页的页物理地址。通过将逻辑地址转换为线性地址并且再转换为物理地址,简化了计算操作,从而提高了张量处理效率。
在一种可能的实现方式中,转换单元进一步用于:从页表项查找与目标元素页对应的页物理地址;以及将页物理地址与一维偏移值相加,以获得目标元素的物理地址。通过将页物理地址与一维偏移值相加,进一步简化了计算操作,从而进一步提高了张量处理效率。
在一种可能的实现方式中,至少一个页中的每个页包括多个缓存行,张量的多个元素在缓存行中的维度布置顺序不同于张量的多个缓存行在至少一个页中的维度布置顺序或不同于至少一个页在第一段中的维度布置顺序,访问单元进一步用于:使用目标元素的物理地址将包括目标元素的缓存行从存储器读取至一级高速缓存,存储器是二级高速缓存或片外存储器。通过在同一缓存行内交叉存储多个维上的相邻数据,可以确保张量处理过程中高速缓存的邻近性原理被极大地被利用,从而极大地地提高了一级高速缓存的命中率并且极大地减少了读取数据的时间,并且相应地减少了张量处理时间。
在一种可能的实现方式中,至少一个页中的第一页包括沿第一维度设置第一缓存行和第二缓存行以及沿第一维度设置的第三缓存行和第四缓存行,第一缓存行和第三缓存行沿与第一维度不同的第二维度设置,并且第二缓存行和第四缓存行沿第二维度设置;第一缓存行包括沿第二维度设置的第一元素和第二元素以及沿第二维度设置的第三元素和第四元素,第一元素和第三元素沿第一维度设置,并且第二元素和第四元素沿第一维度设置;目标元素是第一元素、第二元素、第三元素和第四元素之一。访问单元进一步用于:至少将第一元素、第二元素、第三元素和第四元素存储至一级高速缓存。
在一种可能的实现方式中,至少一个页中的第一页包括第一多个缓存行,第一多个缓存行包括目标元素。至少一个页中的第二页包括第二多个缓存行,第二多个缓存行包括另一目标元素。第二页的尺寸不同于第一页的尺寸,或第二多个缓存行中的每个缓存行的尺寸不同于第一多个缓存行中的每个缓存行的尺寸,或第一多个缓存行的多个元素的维度布置顺序不同于第二多个缓存行的多个元素的维度布置顺序。转换单元还用于将另一目标元素的逻辑地址与目标元素并行地转换为另一目标元素在存储器中的另一物理地址,另一目标元素的逻辑地址包括段基准数据和另一偏移数据,另一偏移数据表示另一目标元素在第一段的多个维度中的各维上的偏移值。访问单元还用于使用另一目标元素的物理地址与目标元素并行地访问存储器。由于页、缓存行和元素的尺寸和布置方式都可以灵活设置,这极大地增加了编程人员在处理张量数据时的灵活性,从而可以更有效率地处理张量数据,并且减少处理时间。
在一种可能的实现方式中,访问单元进一步用于使用目标元素的物理地址访问存储器中所存储的张量的一部分。通过将张量划分为多个段,可以良好地适配硬件性能,从而可以基于硬件设置来灵活地处理张量数据。
在一种可能的实现方式中,存储器访问装置还包括确定单元和生成单元。确定单元用于确定逻辑地址是否超出张量的逻辑边界。生成单元用于如果逻辑地址超出逻辑边界,则生成存储器访问异常信号。通过检测是否越界,可以在处理之前就确保张量处理的正确性。
在一种可能的实现方式中,确定单元进一步用于:确定段基准数据和偏移数据之和是否超出段的各维的尺寸。通过基于逻辑地址确认是否越界,可以简化检测的步骤,降低计算量和处理时间。
在一种可能的实现方式中,存储器访问装置还包括留置单元。留置单元用于基于第一段的段属性,将目标元素留置在一级高速缓存中。段属性包括针对目标元素所在的缓存行在一级高速缓存中的替换规则。由于某些元素具有较高的使用频率,因此并不期望频繁地将其 移入和移出一级高速缓存,而是期望尽可能地保留在一级高速缓存中。通过设置替换规则,可以有效地提升目标元素被留置在一级高速缓存中的概率,从而减少元素传输时间并且相应地减少张量处理时间。
在一种可能的实现方式中,第一段包括多个页,存储器访问装置还包括建立单元。建立单元用于如果将多个页中的第一页移入片外存储器或存储器,则建立与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。通过在页被移入时建立页表项,一方面可以有效地并且正确地进行寻址,另一方面也可以减少页表装置以及片内高速存储的用量,从而减少成本。
在一种可能的实现方式中,第一段包括多个页,存储器访问装置还包括删除单元。删除单元用于如果将多个页中的第一页从存储器或片外存储器移出,则删除与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。通过及时删除不被使用的页及页表项,可以减少页表装置以及片内高速存储的用量,从而减少成本。
在一种可能的实现方式中,第一段包括目标元素页,目标元素页包括目标元素。确定单元进一步用于基于目标元素的逻辑地址,确定目标元素页是否位于存储器中。存储器访问装置还包括移入单元。移入单元用于如果目标元素页不位于存储器中,则从片外存储器将目标元素页移入存储器。
根据本公开的实施例的方法和电子设备,编程人员可以从多维张量角度考虑数据,而无需知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一个实施例的芯片示意框图;
图3示出了根据本公开的一个实施例的三维张量示意框图;
图4示出了根据本公开的一个实施例的图像数据的页分配示意图;
图5示出了根据本公开的一个实施例的访问存储器的方法的示意流程图;
图6示出了图4中的图像数据的缓存行分配示意图;
图7示出了图4中的图像数据在存储器中的一维存储示意图;以及
图8示出了根据本公开的一些实施例的存储器访问装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如前文所提及的,当张量数据表示多维张量时,编程人员通常需要知晓这种多维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,这给编程人员增加了严重的认知负 担。此外,还可能增加了软件的运行时开销。
在本公开的一些实施例中,通过使用与张量对应的具有多个维度的逻辑地址以将逻辑地址自动转换为与张量数据在存储器中的对应物理地址,编程人员可以无需知晓张量数据与物理地址的映射关系,而只需要了解数据在多维张量中的位置,例如坐标。这样,可以显著减少编程人员的认知负担,并且显著减少应用程序或软件的开发时间并且增加开发效率。
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器子系统40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器子系统40例如可以包括诸如图形处理器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。设备存储器50例如可以是诸如DRAM之类的位于加速器子系统40外部的易失性存储器。在本公开中,设备存储器50也被称为片外存储器,即,位于加速器子系统40的芯片外部的存储器。相对而言,加速器子系统40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。这将在下文结合本公开的一些实施例具体描述。虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器子系统的一些应用环境中使用。
图2示出了根据本公开的一个实施例的加速器子系统200的示意 框图。加速器子系统200例如可以是图1中加速器子系统40的芯片的一种具体实现方式。加速器子系统200例如是诸如GPU之类的加速器子系统芯片。在一个实施例中,加速器子系统200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。
加速器子系统200由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器子系统200的片上虚拟存储。在本公开中,L2高速缓存250和诸如图1中的设备存储器50之类的片外存储器构成虚拟存储系统。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作,可以并行地进行下文所述的地址转换和存储器中目标数据的访问,从而减少处理时间。可以理解,多个PE处理的目标元素并不相同,并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同,如下文具体描述。
每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。
大多数的指令执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器 读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。
在一个实施例中,图2的加速器子系统200可以例如执行如下操作:1)组建页表项内容和初始状态;2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器,例如L2高速缓存250;3)启动和执行程序;4)定义各个段并对张量以及存储的属性进行描述;5)在程序执行完成时,将执行结果的数据写入至片外存储器。
可以理解,在公开的实施例中,加速器子系统200所处理的数据主要针对多维张量。例如,在一个实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各维上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。
此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。
在一些情形中,张量所包含的数据量可能较大,而L2高速缓存250的容量有限,因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中,为了便于张量的并行处理,可以将张量划分为至少一个段。在张量仅包括一个段的情形下,张量即为段。而在张量包括多个段的情形下,段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。
图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3,并且包括第一段S1、 第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外,CPU 20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中,每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。实际上,页的划分可以在任意一个或多个维上实施,并且各维上划分的页数是相互独立的。
在一个实施例中,可以将张量数据存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。核心程序(kernel)可以分多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供kernel操作使用。在多次启动kernel后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时,一个张量仅需要一个段描述即可,kernel也只需要启动一次。
进一步地,在本公开的一些实施例中,在一个段内,还可以设置至少一个页以进一步细分张量。例如,在第一段S1中,具有4个页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,将页配置为适于整体存入L2高速缓存250。
如上所述,当对张量寻址时,最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中,页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时,如果PE经由L1高速缓存260读取目标元素,则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据,而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期,而L1高速缓存260从L2高速缓存250 读取数据可能需要几十个甚至上百个时钟周期。因此,期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位,但在本公开中,这部分数据可以并不必然按行或列排列,一个“缓存行”里面的数据分布在多个维上,且各维上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理,PE的分配在数据的逻辑地址空间展开,独立于段的物理存储结构,具体如下文描述。
在图3中,第一页P[1]中的第一组缓存行被指定由PE_1处理,第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理,但是可以理解张量数据的处理独立于PE的顺序,本公开对此不进行限制。例如图3中的PE_2表示部分的张量数据可以由PE_M处理,其中M表示不大于N的任意整数。
图4示出了根据本公开的一个实施例的图像数据400的页分配示意图。图像数据是典型的二维张量。在一个实施例中,图像数据400例如为8*8像素。换言之,图像数据400在第一维D1具有8个像素,并且在第二维D2也具有8个像素。因此,图像数据400具有像素P00、P01……P77。在图4的实施例中,图像数据400仅具有一个段,但是按两个维度分为4个页P[1]、P[2]、P[3]和P[4]。4个页可以按第二维D2划分以分配给PE_1和PE_2处理,也可以按第一维D1划分以分配给PE_1和PE_2处理。此外,还可以按对角线划分。本公开对此不进行限制。
图5示出了根据本公开的一些实施例的对存储器进行访问的方法500的流程图。在一个实施例中,方法500例如可以由诸如GPU之类的加速器子系统实施,因此上面针对图1-图3描述的各个方面可以选择性地适用于方法500。
在502,加速器子系统将张量中的第一段中的目标元素的逻辑地址转换为目标元素在存储器中的物理地址。目标元素为待获取并且将由PE处理的张量元素。例如图4中的像素P01。在一个实施例中,当核心程序访问张量的段时,核心程序可以被部署至一个或多个PE 上,每个PE通过段访问其所描述张量数据的部分或全部元素。每个PE在段内访问的张量数据的起始点由段结构中的基准点(anchor)坐标定义。在一些实施例中,在处理相同的张量的相同段时,对于不同PE来说,其段结构中基准点坐标的定义也不尽相同。
在一个实施例中,目标元素的逻辑地址可以表示为seg:RF:imm,其中seg表示段基址寄存器,RF表示偏移寄存器,imm表示偏移立即数。从张量角度而言,逻辑地址可以包括第一段在张量中的段基准数据和偏移数据,偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量。段基准数据为段起始点的地址。
在一个实施例中,第一段包括至少一个页,加速器子系统200可以至少根据目标元素页的各维的尺寸,将逻辑地址转换为线性地址。线性地址包括目标元素页的一维页标识和目标元素在目标元素页内的一维偏移值。具体而言,加速器子系统200可以根据第一段内各维上页的页尺寸得到目标元素在各维上所处的页序号偏移,由此获得目标元素所处的页的一维标识。例如,目标元素位于图3中的张量的最上层,通过上述方式可以确定目标元素的页标识为P[1]。
此外,加速器子系统还可以得到目标元素在该页内部各维上的相对偏移量,并以此为基础,确定目标元素相对于页的起始位置的一维线性偏移量。页的一维标识以及页内的一维线性偏移量共同构成目标元素的线性地址。
加速器子系统200根据针对目标元素页的页表项,将线性地址转换为物理地址,页表项包括至少一个页中的每个页的页物理地址。具体而言,在一个实施例中,加速器子系统200在获取目标元素的页标识之后,可以根据页标识查找页表装置220中对应的项,获取页的物理地址。该物理地址加上目标元素在目标元素页的一维线性偏移量即为目标元素的物理地址。该物理地址可以表示片外的设备存储器50或片上的存储器,例如L2高速缓存250上的目标元素的存储地址。备选地,目标元素页的页表项也可以存储相对于其它页的物理地址,并且基于目标元素页相对于其它页的偏移、其它页的物理地址和一维 线性偏移量来获得目标元素的物理地址。
除了物理地址之外,页表项还可以包括其它属性,例如状态,用于表示页是否加载完毕,即是否可用。本公开对此不进行限制。虽然在此示出了地址的二级转换,但是本公开不限于此。备选地,也可以经过更多级转换。例如,分级计算页偏移、缓存行偏移、元素偏移,并且依次与物理地址相加以得到目标元素的最终的物理地址。
在一个实施例中,加速器子系统200将多个页中的第一页从片外的存储器移入片上的存储器,并且建立与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。如果将多个页中的第一页从存储器移入片外存储器,则加速器子系统200可以删除与第一页对应的第一页表项。
加速器子系统将第一段S1中的目标元素的逻辑地址转换为在片上虚拟存储器中的物理地址。片上虚拟存储器可以包括片上的L2高速缓存250和片外的设备存储器50。逻辑地址包括第一段在张量中的段基准数据和偏移数据,偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量。
在504,加速器子系统可以使用目标元素的物理地址访问存储器。这样,编程人员在编程时可以使用段基准数据和相对于段基准数据的多维偏移量来从张量的视角使用张量数据,而无需知晓张量数据与存储器中的物理地址的一一映射关系。
上面描述了根据本公开的一些实施例的将多维张量的逻辑地址转换为物理地址的数据寻址方式。然而在张量的处理过程中,还可以进一步考虑诸如L1高速缓存260从L2高速缓存250获取张量数据的空间邻近性因素。L1高速缓存260从高速缓存250获取张量数据时,最小的获取单位为缓存行。换言之,当L1高速缓存260从高速缓存获取元素时,并非只是获取目标元素,而是将其附近的张量元素一起读取至L1高速缓存260。这个获取的数据的最小量是一个缓存行。例如参考图4,当缓存行按行排布时,如果期望获得目标元素P00,则L1高速缓存260会从L2高速缓存250一起读取元素P00-P03,这 是因为在此情形下的缓存行包括元素P00-P03。
然而在张量处理过程中,往往是目标元素附近的元素具有较大相关性,并且被处理的概率较大。例如,图像的像素经常是按照像素间的曼哈顿距离存取,相邻的像素通常会在一起处理。当目标元素P00被处理时,目标元素P00附近的元素P01和P10有较大概率被处理。以P10举例而言,由于L1高速缓存260在从L2高速缓存250读取目标元素P00时,仅一同读取了同一缓存行内的P01、P02和P03,因此当PE试图从L1高速缓存260读取P10时,L1高速缓存260会再次从L2高速缓存250读取P10所在的缓存行。这会造成数据处理速度的降低以及存储带宽的潜在浪费。
在本公开的一些实施例中,在上面的多维张量寻址方法的基础上,提出一种多维张量的逆交叉寻址的方法。具体而言,至少一个页中的每个页包括多个缓存行,张量的多个元素在缓存行中的维度布置顺序不同于张量的多个缓存行在至少一个页中的维度布置顺序。图6示出了图4中的图像数据的缓存行分配示意图。下面结合图6和图4来描述逆交叉寻址。如图4所示,图像数据400包括仅一个段,并且该段包括4个页P[1]、P[2]、P[3]和P[4]。图像数据400为二维张量,因此图像数据400包括第一维D1和第二维D2。
在一个实施例中,图像数据400的页首先按照第一维D1排列,并且在到达第一维D1的边界之后,在第二维D2延伸一页,并且继续按第一维D1排列。如图4所示,图像数据400的页的布置顺序是P[1]、P[2]、P[3]和P[4]。然而,在本公开的一些实施例中,张量的多个元素在缓存行中的维度布置顺序不同于张量的多个缓存行在至少一个页中的维度布置顺序,例如参见图6。
图6示出了图4中的图像数据的缓存行分配示意图。图6示出了4个缓存行C[1]、C[2]、C[3]和C[4],并且每个缓存行包括4个元素。例如,第一缓存行C[1]包括元素P00、P10、P01和P11。与图4中的页的布置顺序相似,在图6中,缓存行C[1]、C[2]、C[3]和C[4]也是先沿着第一维D1布置,直至到达第一维D1的边界。缓存行随后在 第二维D2延伸一个缓存行,并且继续沿着第一维D1布置,直至到达第一维D1的边界。如图6所示,页600的缓存行的布置顺序是C[1]、C[2]、C[3]和C[4]。在一个实施例中,可以使用如下三个域来描述图4和图6的图像数据的逻辑地址:addr[5:4]、add[3:2]和addr[1:0],其中addr[5:4]表示页的域,即页地址;add[3:2]表示缓存行的域,即,页内缓存行的地址;addr[1:0]表示缓存行内元素或元素序列的地址。
然而,在缓存行的内部,张量元素的布置顺序与缓存行在页内或者页在第一段内的布置顺序截然相反,是为逆交叉寻址。以第一缓存行C[1]为例说明,元素首先按照第二维D2布置然后按第一维D1布置。即,元素P00、P10、P01和P11在缓存行C[1]内依序布置。图7更为形象地示出了图4中的图像数据在存储器中的一维存储示意图。如图7所示,图像数据700在存储器中实际上按一维顺序排列,并且元素P00、P10、P01和P11依次被布置在存储器中。因此,当目标元素为P00时,L1高速缓存从L2高速缓存读取元素P00、P10、P01和P11。这样,就可以提高目标元素的命中,并且相应地减少访问目标元素的时间并且从而减少张量的整体处理时间。因此使用目标元素的物理地址将包括目标元素的缓存行从存储器读取至L1高速缓存,存储器是L2高速缓存或片外存储器。此外,有些目标元素可能会被经常使用,因此期望将其常驻在L1高速缓存中。在一些实施例中,可以在段属性中将包含该目标元素的缓存行的替换规则设置为常驻于L1高速缓存中,这里依赖的是数据访问的时间局部性原理。
虽然在此以二维张量示出了逆交叉寻址,但是可以理解本公开不限于此。缓存行内的元素的维度顺序可以是不同于缓存行在页内的维度顺序或页在段内的维度顺序的任意顺序。在另一些实施例中,在3维或4维张量的情形下,缓存行内的元素维度顺序、页内的缓存行维度顺序和段内的页维度顺序可以两两不同。
此外,每个段中的页数,每个页中的缓存行数以及每个缓存行中的元素的个数也不限于图中示例的数目。
虽然在图6的实施例中将缓存行内的元素示出为方块布置,但是 可以理解本公开不限于此。缓存行内的元素也可以按行、按列或是以其它顺序布置在存储器的一维物理区域内。这可以给编程人员进一步的灵活性,使得编程人员可以基于张量元素的使用关系灵活调整缓存行内的元素布置,从而减少程序的运行时间和响应速度。
上面具体介绍了在本公开的一些实施例中的张量分级结构,即张量-段-页-缓存行-元素。张量可以包括至少一个段,段可以包括至少一个页,页可以包括至少一个缓存行,并且缓存行可以包括至少一个元素。张量、段、页和缓存行都可以包括一个或多个维度。可以理解,段的维度数目和尺寸不高于张量的维度数目和尺寸,页的维度数目和尺寸不高于段的维度数目和尺寸,并且缓存行的维度数目和尺寸不高于页的维度数目和尺寸。为了确保程序的正确运行,可以在访问存储器的数据之前,加速器子系统40可以使用逻辑地址确定待访问的地址是否越过张量边界。例如可以检测段基准数据与目标元素的坐标或偏移量之和是否超出各维的尺寸。如果越过边界,则生成存储器访问异常信号。通过检测越界情形的发生,可以确保对寻址的准确性和对存储器中的张量元素存取的正确性。
下面介绍从逻辑地址到物理地址的具体寻址过程的一个示例。诸如CPU 20之类的主机向SP 210传输指令时,指令可以包括张量的各种信息,例如张量描述项。张量描述项例如包括段标识、PE掩码、基准点、起始页号、页数、每页维度、段属性、维度属性等。段标识用于在一次运行过程中索引该段。PE掩码表示待分配的PE。每个比特位对应一个PE。PE掩码被置为1表示分配对应的PE,否则不被分配PE。例如bit0为1,则表示PE_1被分配。一种可能是指定两个PE(PE_1和PE_2),其掩码为0x03。也可以指定4个PE(PE_2、PE_3、PE_5、PE_6),则掩码为0x36,或者只指定PE_8,掩码为0x80。其他情况依此类推,可以根据应用需求及资源可用情况进行任意组合。
基准点表示为各个PE指定的张量内的起始坐标点。例如基准点可以是(0,0,0,0)或(0,4,0,0)。多个PE可以具有相同或不同的基准点,由于基准点实际上是构成目标元素逻辑地址的组成部分,因此段 基准数据不同的情形下,不同的PE的相同坐标偏移实际上被映射到不同的物理地址。在PE处理完一个目标元素之后,可以使用相对于基准点的偏移继续访问下一目标元素。备选地,也可以使用新的基准点来访问下一目标元素。
起始页号表示PE处理的元素的起始页。由于段内包含的页标识是连续的,因此在知道段内总页数的情况下,只需要指定段内的起始页标识即可访问段内所有页。页数表示每个维度上划分的页数的乘积。在图4中,在D1和D2两个维度各划分了两个页,因此段内总共包含了2*2=4个页。具体表示为(2,2,0,0)。
每页维度表示元素在页内各维上分布的个数。例如在图4中,段在各维上的元素分布为(8,8,0,0),页在各维上的分布为(2,2,0,0),因此页内各维上的元素分布为(8/2,8/2,0,0),即(4,4,0,0)。每个页内有4x4=16个元素。
段属性包括段中页使用的状态标志、元素大小、元素数据编码类型以及转换标志,段内缓存行的替换规则等。维属性可以用于独立设置各维的属性,包括长模式,流(streaming)模式,地址的符号属性以及缓存行内逆交叉寻址的位宽等信息。长模式表示张量在一个维度的尺寸显著高于其它维度的尺寸。流模式表示可以在核心程序不停止的情形下支持无限长的张量数据的计算。地址的符号属性表示相对于基准点的坐标偏移值可以为正也可以为负,换言之在同一维度上可以正向偏移也可以负向偏移。
页的属性包括页标识、物理基址、状态字段和维度信息等。页标识用于索引对应的页表项。物理基址描述页在诸如L2高速缓存之类的片上存储器或片外的存储器内的物理首地址。状态字段表示页是否被占用或可用。维度信息主要包括维度的数目以及各维的尺寸,该字段可以由段定义。页的属性例如可以被存储在页表装置220内。
参见图4,在一个实施例中,对目标元素P36进行寻址。即PE访问存储器以获取目标元素P36。图像数据400在存储器中以逆交叉的方式被存储。假设图像数据400是一个四维张量,并且页和缓存行 均按D1->D2->D3->D4的维度顺序布置,而缓存行内的元素则按D4->D3->D2->D1的维度顺序布置。PE接收到的加载数据指令表示目标元素在D1维上的地址为6,在D2维上的地址为3。由于每页维度为(4,4,0,0),因此目标元素P36落在坐标为(6/4,3/4,0,0)=(1,0,0,0)的页上,即页P[2]上。
在页P[2]内部,目标元素P36的逻辑地址偏移为(6%4,3%4,0,0)=(2,3,0,0),而缓存行中各维上元素的个数为(2,2,0,0),因此目标元素P36落在页P[2]中坐标为(2/2,3/2,0,0)=(1,1,0,0)的缓存行上,即缓存行C[4]中。
在缓存行C[4]中,各维上元素的个数为(2,2,0,0),因此缓存行内部的偏移为(3%2,2%2,0,0)=(1,0,0,0),即缓存行内偏移量为1的地址上。由于图像数据400是以逆交叉的方式被存储,因此在计算过程将3和2的位置实际上进行了对调,从而得到目标元素P36在缓存行C[4]内部的真实物理偏移量,也即线性地址。
通过查阅页表项,可以获得页P[2]的物理地址为2’b10000。因此目标元素P36存储地址的页的域addr[5:4]=1,缓存行的域addr[3:2]=3,缓存行内的偏移addr[1:0]=1,因此,目标元素P36的存储地址值为addr[5:0]=6’b01_11_01=6’b011_101=6’o35。由此得到目标元素在存储器中的真实物理地址。
图8示出了根据本公开的一些实施例的存储器访问装置800的示意框图。装置800可以被实现为或者被包括在图2的加速器子系统200中。装置800可以包括多个模块,以用于执行如图5中所讨论的方法500中的对应步骤。
如图8所示,装置800存储器包括转换单元802和访问单元804。转换单元802用于将张量中的第一段中的目标元素的逻辑地址转换为目标元素在存储器中的物理地址,逻辑地址包括第一段在张量中的段基准数据和偏移数据,偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量。访问单元804用于使用目标元素的物理地址访问存储器。编程人员可以从多维张量角度考虑数据,而无需知晓这种多 维张量到一维数据的映射,即多维张量到存储器中的一维数据的物理地址的正确映射,从而降低了编程人员的认知负担,增加了开发效率并且减少开发时间。
在一些实施例中,第一段包括至少一个页,至少一个页包括目标元素所在的目标元素页。转换单元802进一步用于至少根据目标元素页的各维的尺寸,将逻辑地址转换为线性地址,线性地址包括目标元素页的一维页标识和目标元素在目标元素页内的一维偏移值;以及根据针对目标元素页的页表项,将线性地址转换为物理地址,页表项包括目标元素页的页物理地址。通过将逻辑地址转换为线性地址并且再转换为物理地址,简化了计算操作,从而节省了转换时间并且相应地提高了张量处理效率。
在一些实施例中,转换单元802进一步用于:从页表项查找与目标元素页对应的页物理地址;以及将页物理地址与一维偏移值相加,以获得目标元素的物理地址。通过将页物理地址与一维偏移值相加,进一步简化了计算操作,从而进一步节省了转换时间并且相应地提高了张量处理效率。
在一些实施例中,至少一个页中的每个页包括多个缓存行,张量的多个元素在缓存行中的维度布置顺序不同于张量的多个缓存行在至少一个页中的维度布置顺序或不同于至少一个页在第一段中的维度布置顺序,访问单元804进一步用于:使用目标元素的物理地址将包括目标元素的缓存行从存储器读取至一级高速缓存,存储器是二级高速缓存或片外存储设备。通过在在同一缓存行内交叉存储多个维上的相邻数据,,可以确保张量处理过程中高速缓存的邻近性原理被极大地被利用,从而极大地地提高了一级高速缓存的命中率并且极大地减少了读取数据的时间,并且相应地减少了张量处理时间。
在一些实施例中,至少一个页中的第一页包括沿第一维度设置第一缓存行和第二缓存行以及沿第一维度设置的第三缓存行和第四缓存行,第一缓存行和第三缓存行沿与第一维度不同的第二维度设置,并且第二缓存行和第四缓存行沿第二维度设置;第一缓存行包括沿第 二维度设置的第一元素和第二元素以及沿第二维度设置的第三元素和第四元素,第一元素和第三元素沿第一维度设置,并且第二元素和第四元素沿第一维度设置;目标元素是第一元素、第二元素、第三元素和第四元素之一。访问单元804进一步用于:至少将第一元素、第二元素、第三元素和第四元素存储至一级高速缓存。
在一些实施例中,至少一个页中的第一页包括第一多个缓存行,第一多个缓存行包括目标元素。至少一个页中的第二页包括第二多个缓存行,第二多个缓存行包括另一目标元素。第二页的尺寸不同于第一页的尺寸,或第二多个缓存行中的每个缓存行的尺寸不同于第一多个缓存行中的每个缓存行的尺寸,或第一多个缓存行的多个元素的维度布置顺序不同于第二多个缓存行的多个元素的维度布置顺序。转换单元802还用于将另一目标元素的逻辑地址与目标元素并行地转换为另一目标元素在存储器中的另一物理地址,另一目标元素的逻辑地址包括段基准数据和另一偏移数据,另一偏移数据表示另一目标元素在第一段的多个维度中的各维上的偏移值。访问单元804还用于使用另一目标元素的物理地址与目标元素并行地访问存储器。由于页、缓存行和元素的尺寸和布置方式都可以灵活设置,这极大地增加了编程人员在处理张量数据时的灵活性,从而可以更有效率地处理张量数据,并且减少处理时间。
在一些实施例中,访问单元804进一步用于使用目标元素的物理地址访问存储器中所存储的张量的一部分。通过将张量划分为多个段,可以良好地适配硬件性能,从而可以基于硬件设置来灵活地处理张量数据。
在一些实施例中,存储器访问装置800还包括未示出的确定单元和生成单元。确定单元用于确定逻辑地址是否超出张量的逻辑边界。生成单元用于如果逻辑地址超出逻辑边界,则生成存储器访问异常信号。通过检测是否越界,可以在处理之前就确保张量处理的正确性。
在一些实施例中,确定单元进一步用于:确定段基准数据与各维的偏移值之和是否超出段的各维的尺寸。通过基于逻辑地址确认是否 越界,可以简化检测的步骤,降低计算量和处理时间。
在一些实施例中,存储器访问装置800还包括未示出的留置单元。留置单元用于基于第一段的段属性,将目标元素留置在一级高速缓存中。段属性包括针对目标元素所在的缓存行在一级高速缓存中的替换规则。由于某些元素具有较高的使用频率,因此并不期望频繁地将其移入和移出一级高速缓存,而是期望尽可能地保留在一级高速缓存中。通过设置替换规则,可以有效地提升目标元素被留置在一级高速缓存中的概率,从而减少元素传输时间并且相应地减少张量处理时间。
在一些实施例中,第一段包括多个页,存储器访问装置800还包括建立单元。建立单元用于如果将多个页中的第一页移入片外存储器或片上存储器,则建立与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。通过在页被移入时建立页表项,一方面可以有效地并且正确地进行寻址,另一方面也可以减少页表装置以及片内高速存储的用量,从而减少成本。
在一些实施例中,第一段包括多个页,存储器访问装置800还包括删除单元。删除单元用于如果将多个页中的第一页从片上存储器或片外存储器移出,则删除与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。通过及时删除不被使用的页及页表项,可以减少页表装置以及片内高速存储的用量,从而减少成本。
在一些实施例中,第一段包括目标元素页,目标元素页包括目标元素。确定单元进一步用于基于目标元素的逻辑地址,确定目标元素页是否位于存储器中。存储器访问装置800还包括移入单元。移入单元用于如果目标元素页不位于存储器中,则从片外存储器将目标元素页移入存储器。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细 节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (20)

  1. 一种用于访问存储器的方法,包括:
    将张量中的第一段中的目标元素的逻辑地址转换为所述目标元素在所述存储器中的物理地址,所述逻辑地址包括所述第一段在所述张量中的段基准数据和偏移数据,所述偏移数据表示所述目标元素在所述第一段的多个维度中的各维上的偏移值;以及
    使用所述目标元素的物理地址访问所述存储器。
  2. 根据权利要求1所述的方法,其中所述第一段包括至少一个页,所述至少一个页包括所述目标元素所在的目标元素页,将所述目标元素的逻辑地址转换为所述物理地址包括:
    至少根据所述目标元素页的各维的尺寸,将所述逻辑地址转换为线性地址,所述线性地址包括所述目标元素页的一维页标识和所述目标元素在所述目标元素页内的一维偏移值;以及
    根据针对所述目标元素页的页表项,将所述线性地址转换为所述物理地址,所述页表项包括所述目标元素页的页物理地址。
  3. 根据权利要求2所述的方法,其中根据针对所述目标元素页的页表项将所述线性地址转换为所述物理地址包括:
    从所述页表项查找与所述目标元素页对应的页物理地址;以及
    将所述页物理地址与所述一维偏移值相加,以获得所述目标元素的物理地址。
  4. 根据权利要求2或3所述的方法,其中所述至少一个页中的每个页包括多个缓存行,所述张量的多个元素在所述缓存行中的维度布置顺序不同于所述张量的多个缓存行在所述至少一个页中的维度布置顺序或不同于所述至少一个页在所述第一段中的维度布置顺序,使用所述目标元素的物理地址访问所述存储器包括:
    使用所述目标元素的物理地址将包括所述目标元素的缓存行从所述存储器读取至一级高速缓存,所述存储器是二级高速缓存或片外存储设备。
  5. 根据权利要求4所述的方法,其中所述至少一个页中的第一页包括沿第一维度设置第一缓存行和第二缓存行以及沿所述第一维度设置的第三缓存行和第四缓存行,所述第一缓存行和所述第三缓存行沿与所述第一维度不同的第二维度设置,并且所述第二缓存行和所述第四缓存行沿所述第二维度设置;
    所述第一缓存行包括沿所述第二维度设置的第一元素和第二元素以及沿所述第二维度设置的第三元素和第四元素,所述第一元素和所述第三元素沿所述第一维度设置,并且所述第二元素和所述第四元素沿所述第一维度设置;
    所述目标元素是所述第一元素、所述第二元素、所述第三元素和所述第四元素之一;
    使用所述目标元素的物理地址从所述存储器读取包括所述目标元素的缓存行至一级高速缓存包括:
    至少将所述第一元素、所述第二元素、所述第三元素和所述第四元素存储至所述一级高速缓存。
  6. 根据权利要求2或3所述的方法,其中
    所述至少一个页中的第一页包括第一多个缓存行,所述第一多个缓存行包括所述目标元素;
    所述至少一个页中的第二页包括第二多个缓存行,所述第二多个缓存行包括另一目标元素;
    所述第二页的尺寸不同于所述第一页的尺寸,或所述第二多个缓存行中的每个缓存行的尺寸不同于所述第一多个缓存行中的每个缓存行的尺寸,或所述第一多个缓存行的多个元素的维度布置顺序不同于所述第二多个缓存行的多个元素的维度布置顺序;以及
    所述方法还包括:
    将所述另一目标元素的逻辑地址与所述目标元素并行地转换为所述另一目标元素在所述存储器中的另一物理地址,所述另一目标元素的逻辑地址包括所述段基准数据和另一偏移数据,所述另一偏移数据表示所述另一目标元素在所述第一段的多个维度中的各维上 的偏移值;以及
    使用所述另一目标元素的物理地址与所述目标元素并行地访问所述存储器。
  7. 根据权利要求1-3中任一项所述的方法,其中使用所述目标元素的物理地址访问所述存储器包括:
    使用所述目标元素的物理地址访问所述存储器中所存储的所述张量的一部分。
  8. 根据权利要求1-3中任一项所述的方法,还包括:
    确定所述逻辑地址是否超出所述张量的逻辑边界;以及
    如果所述逻辑地址超出所述逻辑边界,则生成存储器访问异常信号。
  9. 根据权利要求8所述的方法,其中确定所述逻辑地址是否超出所述张量的逻辑边界包括:
    确定所述段基准数据和所述偏移数据之和是否超出所述第一段的逻辑边界。
  10. 根据权利要求1-3中任一项所述的方法,还包括:
    基于所述第一段的段属性,将所述目标元素留置在一级高速缓存中,所述段属性包括针对所述目标元素所在的缓存行在所述一级高速缓存中的替换规则。
  11. 根据权利要求1-3中任一项所述的方法,其中所述第一段包括多个页,所述存储器是片上存储器,所述方法还包括:
    如果将所述多个页中的第一页移入所述存储器或片外存储器,则建立与所述第一页对应的第一页表项,所述第一页表项存储所述第一页在所述存储器或所述片外存储器中的物理地址。
  12. 根据权利要求1-3中任一项所述的方法,其中所述第一段包括多个页,所述存储器是片上存储器,所述方法还包括:
    如果将所述多个页中的第一页从所述存储器或片外存储器移出,则删除与所述第一页对应的第一页表项,所述第一页表项存储所述第一页在所述存储器或所述片外存储器中的物理地址。
  13. 根据权利要求1-3中任一项所述的方法,其中所述第一段包括目标元素页,所述目标元素页包括所述目标元素,所述方法还包括:
    基于所述目标元素的逻辑地址,确定所述目标元素页是否位于所述存储器中;以及
    如果所述目标元素页不位于所述存储器中,则从片外存储器将所述目标元素页移入所述存储器。
  14. 一种计算机可读存储介质,存储多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-13中任一项所述的方法的指令。
  15. 一种计算机程序产品,所述计算机程序产品包括多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-13中任一项所述的方法的指令。
  16. 一种电子设备,包括:
    流处理器;
    页表装置,耦合至所述流处理器;
    存储器;
    处理引擎单元,耦合至所述流处理器、所述存储器和所述页表装置,被配置为执行权利要求1-13中任一项所述的方法。
  17. 一种存储器访问装置,包括:
    转换单元,用于将张量中的第一段中的目标元素的逻辑地址转换为所述目标元素在所述存储器中的物理地址,所述逻辑地址包括所述第一段在所述张量中的段基准数据和偏移数据,所述偏移数据表示所述目标元素在所述第一段的多个维度中的各维上的偏移量;以及
    访问单元,用于使用所述目标元素的物理地址访问所述存储器。
  18. 根据权利要求17所述的存储器访问装置,其中所述第一段包括至少一个页,所述至少一个页包括所述目标元素所在的目标元素页,所述转换单元进一步用于:
    至少根据所述目标元素页的各维的尺寸,将所述逻辑地址转换为线性地址,所述线性地址包括所述目标元素页的一维页标识和所述目 标元素在所述目标元素页内的一维偏移值;以及
    根据针对所述目标元素页的页表项,将所述线性地址转换为所述物理地址,所述页表项包括所述目标元素页的页物理地址。
  19. 根据权利要求18所述的存储器访问装置,其中所述转换单元进一步用于:
    从所述页表项查找与所述目标元素页对应的页物理地址;以及
    将所述页物理地址与所述一维偏移值相加,以获得所述目标元素的物理地址。
  20. 根据权利要求18或19所述的存储器访问装置,其中所述至少一个页中的每个页包括多个缓存行,所述张量的多个元素在所述缓存行中的维度布置顺序不同于所述张量的多个缓存行在所述至少一个页中的维度布置顺序或不同于所述至少一个页在所述第一段中的维度布置顺序,所述访问单元进一步用于:
    使用所述目标元素的物理地址将包括所述目标元素的缓存行从所述存储器读取至一级高速缓存,所述存储器是二级高速缓存或片外存储器。
PCT/CN2022/107136 2021-09-17 2022-07-21 存储器访问方法和电子装置 WO2023040460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111094313.4 2021-09-17
CN202111094313.4A CN113836049B (zh) 2021-09-17 2021-09-17 存储器访问方法和电子装置

Publications (1)

Publication Number Publication Date
WO2023040460A1 true WO2023040460A1 (zh) 2023-03-23

Family

ID=78959889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107136 WO2023040460A1 (zh) 2021-09-17 2022-07-21 存储器访问方法和电子装置

Country Status (2)

Country Link
CN (1) CN113836049B (zh)
WO (1) WO2023040460A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836049B (zh) * 2021-09-17 2023-08-08 海飞科(南京)信息技术有限公司 存储器访问方法和电子装置
CN114091085B (zh) * 2022-01-10 2022-04-15 北京一流科技有限公司 用于二元操作的数据访问控制系统及其方法
CN114489798B (zh) * 2022-01-25 2024-04-05 海飞科(南京)信息技术有限公司 用于确定张量元素的越界状态的方法和电子装置
CN114579929B (zh) * 2022-03-14 2023-08-08 海飞科(南京)信息技术有限公司 加速器执行的方法和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038018A (zh) * 2016-02-03 2017-08-11 谷歌公司 访问多维张量中的数据
US9946539B1 (en) * 2017-05-23 2018-04-17 Google Llc Accessing data in multi-dimensional tensors using adders
CN110462586A (zh) * 2017-05-23 2019-11-15 谷歌有限责任公司 使用加法器访问多维张量中的数据
CN113836049A (zh) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 存储器访问方法和电子装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061655B (zh) * 2017-12-28 2022-06-17 贵阳忆芯科技有限公司 存储设备的地址转换方法与设备
US10346093B1 (en) * 2018-03-16 2019-07-09 Xilinx, Inc. Memory arrangement for tensor data
US11314674B2 (en) * 2020-02-14 2022-04-26 Google Llc Direct memory access architecture with multi-level multi-striding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038018A (zh) * 2016-02-03 2017-08-11 谷歌公司 访问多维张量中的数据
US9946539B1 (en) * 2017-05-23 2018-04-17 Google Llc Accessing data in multi-dimensional tensors using adders
CN110462586A (zh) * 2017-05-23 2019-11-15 谷歌有限责任公司 使用加法器访问多维张量中的数据
CN113836049A (zh) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 存储器访问方法和电子装置

Also Published As

Publication number Publication date
CN113836049B (zh) 2023-08-08
CN113836049A (zh) 2021-12-24

Similar Documents

Publication Publication Date Title
WO2023040460A1 (zh) 存储器访问方法和电子装置
US8639730B2 (en) GPU assisted garbage collection
CN103425533B (zh) 用于管理嵌套执行流的方法和系统
US10037228B2 (en) Efficient memory virtualization in multi-threaded processing units
US10310973B2 (en) Efficient memory virtualization in multi-threaded processing units
US8327109B2 (en) GPU support for garbage collection
US10169091B2 (en) Efficient memory virtualization in multi-threaded processing units
US9513886B2 (en) Heap data management for limited local memory(LLM) multi-core processors
WO2023142403A1 (zh) 用于确定张量元素的越界状态的方法和电子装置
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
CN111164580B (zh) 用于缓存一致性的可重构的缓存架构及方法
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
US9772864B2 (en) Methods of and apparatus for multidimensional indexing in microprocessor systems
CN114035980B (zh) 基于便笺存储器来共享数据的方法和电子装置
WO2023103391A1 (zh) 流处理方法、处理电路和电子设备
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023065748A1 (zh) 加速器和电子装置
CN114510271B (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
CN114035847B (zh) 用于并行执行核心程序的方法和装置
Silberstein et al. GPUFs: The case for operating system services on GPUs
Miki et al. An extension of OpenACC directives for out-of-core stencil computation with temporal blocking
Hussain et al. Pams: Pattern aware memory system for embedded systems
JP2024500363A (ja) デュアルベクトル算術論理ユニット

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868837

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE