WO2023142403A1 - 用于确定张量元素的越界状态的方法和电子装置 - Google Patents

用于确定张量元素的越界状态的方法和电子装置 Download PDF

Info

Publication number
WO2023142403A1
WO2023142403A1 PCT/CN2022/107337 CN2022107337W WO2023142403A1 WO 2023142403 A1 WO2023142403 A1 WO 2023142403A1 CN 2022107337 W CN2022107337 W CN 2022107337W WO 2023142403 A1 WO2023142403 A1 WO 2023142403A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
segment
offset
bounds
elements
Prior art date
Application number
PCT/CN2022/107337
Other languages
English (en)
French (fr)
Inventor
杨经纬
葛建明
谢钢锋
许飞翔
彭永超
袁红岗
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023142403A1 publication Critical patent/WO2023142403A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes

Definitions

  • Embodiments of the present disclosure generally relate to the field of electronics, and more particularly, to a method and an electronic device for determining an out-of-bounds state of a tensor element.
  • GPUs graphics processing units
  • Tensor elements usually represent one-dimensional or multi-dimensional array data in the computer field.
  • grayscale image data is a conventional two-dimensional tensor element, which can be represented by a two-dimensional array.
  • Conventional tensor elements are usually stored in the form of a one-dimensional array in the memory, so when designing the program, the programmer needs to consider the correct address of the tensor element in the one-dimensional array when the program loads the tensor element.
  • conventional addressing methods cannot provide information about whether an access to a tensor element is outside the bounds of the tensor. For example, in some conventional schemes, for a read operation, if an address out of bounds is detected, a value of 0 or other preset default values is returned directly, and for a write operation, if an address out of bounds is detected, the write operation is discarded.
  • Embodiments of the present disclosure provide a method and electronic apparatus for determining an out-of-bounds state of a tensor element.
  • a method for determining an out-of-bounds state of a tensor element includes determining a first set of offsets of the first tensor element in a plurality of dimensions of the tensor segment based on an out-of-bounds query instruction for the first tensor element in the tensor segment.
  • the method also includes determining a set of element numbers for the tensor segment in each of the plurality of dimensions based on the segment attributes of the tensor segment.
  • the method also includes determining a first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element numbers.
  • the offset corresponding to the logical address in the tensor segment can be compared with the range of the tensor segment. Once the offset on any dimension exceeds the range of the tensor segment on this dimension, it can be determined that the logical address is out of bounds, that is, beyond the boundary of the tensor segment.
  • an out-of-bounds status indication such as a specific value, can be obtained from a circuit unit such as a predicate register. In this way, subsequent program operations such as program debugging or dynamic detection can be brought convenience.
  • out-of-bounds query instruction does not involve returning or writing specific stored data from the memory, but only gives an indication of an out-of-bounds state (such as a logic 1 or 0), this avoids reading from the memory.
  • the out-of-bounds query instruction includes a logical address representation of the first tensor element, and the logical address representation includes a segment base address register representation, an offset register representation, and an immediate value.
  • Determining a first set of offsets for the first tensor element in the tensor segment includes computing the first set of offsets based on the segment base register representation, the offset register representation, and the immediate value.
  • the method further includes: determining the segment base address register representation based on the schema representation.
  • mode notation the offset of the logical address within the tensor segment can be determined for different modes, such as long mode or default mode (non-long mode). In this way, the tensor out-of-bounds determination schemes in different modes can be compatible to adapt to more tensor situations.
  • the method further includes determining a dimension corresponding to the immediate value based on the out-of-bounds query instruction.
  • calculating the first offset includes: according to the dimension of the tensor, dividing the segment base Add the segment base register representation and the offset register representation in the address range to obtain the first offset.
  • the method further includes, on a dimension that does not correspond to the immediate value, representing the segment base address register and the offset register within the range of the segment base address on the corresponding dimension Add to get the first offset.
  • the first offset can be obtained in a simple manner by adding the representation of the segment base address register and the representation of the offset register of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • calculating the first offset includes: according to the dimension of the tensor, dividing the segment base Add the segment base register representation, the offset register representation, and the immediate value in the address range to obtain the first offset.
  • the method further includes, on the dimension corresponding to the immediate value, representing the segment base address register, the offset register representation, and the immediate value within the range of the segment base address on the corresponding dimension numbers to obtain the first offset.
  • the first offset can be obtained in a simple manner by adding the representation of the segment base address register, the representation of the offset register and the immediate value of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • determining the set of the number of elements of the tensor segment in each of the multiple dimensions based on the segment attribute of the tensor segment includes: based on the segment attribute, determining the tensor segment in The number of pages in each dimension and the number of elements per page in each dimension; and based on the number of pages and the number of elements, determining a set of number of elements.
  • the set of element numbers may include at least one element number in at least one dimension.
  • tensor segments include pages, and pages include elements.
  • Tensor segments may have a first plurality of dimensions
  • pages have a second plurality of dimensions
  • the number of dimensions of the second plurality of dimensions is not greater than the number of dimensions of the first plurality of dimensions.
  • determining the first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element numbers includes: determining each whether the offsets exceed the number of elements in the corresponding dimension in the number-of-elements set; and if at least one offset in the first set of offsets exceeds the number of elements in the corresponding dimension in the number-of-elements set, then A first out-of-bounds status indication for a first tensor element is set to a first value.
  • determining the first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element numbers includes: determining each whether the offsets exceed the number of elements in the corresponding dimension in the number-of-elements set; and if each offset in the first set of offsets does not exceed the number of elements in the corresponding dimension in the number-of-elements set, The first out-of-bounds status indication for the first tensor element is then set to the second value.
  • a device such as a predicate register
  • a computer readable storage medium stores a plurality of programs configured to be executed by one or more processing engines, and the plurality of programs include instructions for executing the method of the first aspect.
  • a computer program product comprises a plurality of programs configured for execution by one or more processing engines, the plurality of programs comprising instructions for performing the method of the first aspect.
  • an electronic device in a fourth aspect of the present disclosure, includes: a stream processor; a page table device coupled to the stream processor; a memory; a processing engine unit coupled to the stream processor, the memory and the page table device, configured to execute the method of the first aspect.
  • an electronic device in a fifth aspect of the present disclosure, includes: an offset set determining unit configured to determine first offsets of the first tensor element in multiple dimensions of the tensor segment based on an out-of-bounds query instruction for the first tensor element in the tensor segment Set of displacements.
  • the electronic device further includes: an element number set determining unit configured to determine a set of element numbers of the tensor segment in each of the plurality of dimensions based on the segment attribute of the tensor segment.
  • the electronic device includes: an out-of-bounds status indication determining unit configured to determine a first out-of-bounds status indication for a first tensor element based on a first set of offsets and a set of element numbers.
  • the offset corresponding to the logical address in the tensor segment can be compared with the range of the tensor segment. Once the offset on any dimension exceeds the range of the tensor segment on this dimension, it can be determined that the logical address is out of bounds, that is, beyond the boundary of the tensor segment.
  • an out-of-bounds status indication such as a specific value, can be obtained from a circuit unit such as a predicate register. In this way, subsequent program operations such as program debugging or dynamic detection can be brought convenience.
  • out-of-bounds query instruction does not involve returning or writing specific stored data from the memory, but only gives an indication of an out-of-bounds state (such as a logic 1 or 0), this avoids reading from the memory.
  • the out-of-bounds query instruction includes a logical address representation of the first tensor element, and the logical address representation includes a segment base address register representation, an offset register representation, and an immediate value, and the offset set
  • the determining unit is further configured to: calculate a first set of offsets based on the segment base register representation, the offset register representation and the immediate value.
  • the electronic device further includes: a segment base address register representation determination unit configured to determine the segment base address register representation based on the mode representation.
  • a segment base address register representation determination unit configured to determine the segment base address register representation based on the mode representation.
  • the method further includes determining the dimension corresponding to the immediate value based on the out-of-bounds query instruction.
  • the offset determination unit is further configured to: according to the dimension of the tensor, represent the segment base address register within the range of the segment base address on the corresponding dimension, the offset register Indicates adding to the immediate value to obtain the first offset.
  • the offset determining unit is further configured to: on a dimension that does not correspond to an immediate value, set the segment base address register within the range of the segment base address on the corresponding dimension to The representation and the offset register representation are added to obtain the first offset.
  • the first offset can be obtained in a simple manner by adding the representation of the segment base address register and the representation of the offset register of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • the element number set determining unit is further configured to: determine the number of pages of tensor segments in each dimension and the number of pages of each page in each dimension based on the segment attribute the number of elements; and based on the number of pages and the number of elements, determining the set of number of elements.
  • the offset determination unit is further configured to: on the dimension corresponding to the immediate value, represent the segment base address register within the range of the segment base address on the corresponding dimension, The offset register represents adding the immediate value to obtain the first offset. In different dimensions, the first offset can be obtained in a simple manner by adding the representation of the segment base address register, the representation of the offset register and the immediate value of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • determining the set of the number of elements of the tensor segment in each of the multiple dimensions based on the segment attribute of the tensor segment includes: based on the segment attribute, determining the tensor segment in The number of pages in each dimension and the number of elements per page in each dimension; and based on the number of pages and the number of elements, determining a set of number of elements.
  • the set of element numbers may include at least one element number in at least one dimension.
  • tensor segments include pages, and pages include elements.
  • Tensor segments may have a first plurality of dimensions
  • pages have a second plurality of dimensions
  • the number of dimensions of the second plurality of dimensions is not greater than the number of dimensions of the first plurality of dimensions.
  • the out-of-bounds state indication determining unit is further configured to: determine whether each offset in the first set of offsets exceeds the corresponding dimension in the set of element numbers number of elements; and setting a first out-of-bounds status indication for the first tensor element to first if at least one offset in the first set of offsets exceeds the number of elements in the corresponding dimension in the set of number of elements value.
  • the out-of-bounds state indication determining unit is further configured to: determine whether each offset in the first set of offsets exceeds the corresponding dimension in the set of element numbers the number of elements; and if each offset in the first set of offsets does not exceed the number of elements in the corresponding dimension in the set of number of elements, then setting the first out-of-bounds status indication for the first tensor element to second value.
  • a device such as a predicate register
  • programmers can obtain an out-of-bounds status indication, such as a specific value, from a circuit unit such as a predicate register through an out-of-bounds query command. For read operations, you can also return to a preset default value. In this way, subsequent program operations such as program debugging or dynamic detection can be brought convenience.
  • the out-of-bounds query instruction does not involve returning or writing specific stored data from the memory, but only gives an indication of an out-of-bounds state (such as a logic 1 or 0), this avoids reading from the memory. The time delay and large operating power consumption brought by fetching or writing data.
  • Figure 1 shows a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Fig. 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure
  • Fig. 3 shows a schematic block diagram of a three-dimensional tensor according to an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of page allocation of image data according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a method for determining an out-of-bounds state of a tensor element according to an embodiment of the present disclosure
  • FIG. 6 shows a schematic block diagram of an electronic device according to some embodiments of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • the offset of the logical address in the tensor segment by calculating the offset of the logical address in the tensor segment and comparing the offset with the offset range of the tensor segment, it can be determined whether the logical address is out of bounds.
  • setting the first out-of-bounds status indication indicating whether out-of-bounds is set to a different value in a device such as a predicate register, it is possible to determine whether out-of-bounds by reading the stored value. This brings convenience to subsequent operations such as program debugging and dynamic detection, such as saving time and power consumption, because this kind of out-of-bounds detection does not need to return specific data from the memory, but only returns the status value from the internal register of the processing engine. Can.
  • reading a single or a small number of byte state values from the predicate register inside the processing engine can significantly reduce transmission and processing time, And the processing power consumption is reduced accordingly.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20 , system memory 10 , north/memory bridge 30 , accelerator subsystem 40 , device memory 50 , and south/input-output (IO) bridge 60 .
  • System memory 10 may be, for example, a volatile memory such as dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the north bridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the south bridge/IO bridge 60.
  • the South Bridge/IO Bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers and the like.
  • the accelerator subsystem 40 may include devices or chips such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerated processing of graphics, video and other data, for example.
  • Device memory 50 may be, for example, a volatile memory such as DRAM that is external to accelerator subsystem 40 .
  • device memory 50 is also referred to as off-chip memory, ie, memory located outside the chip of accelerator subsystem 40 .
  • the chip of the accelerator subsystem 40 also has a volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache.
  • L1 cache first-level cache
  • L2 cache optional second-level cache
  • FIG. 2 shows a schematic block diagram of an accelerator subsystem 200 according to one embodiment of the present disclosure.
  • the accelerator subsystem 200 may be, for example, a specific implementation of the chip of the accelerator subsystem 40 in FIG. 1 .
  • the accelerator subsystem 200 is, for example, an accelerator subsystem chip such as a GPU.
  • the accelerator subsystem 200 includes a stream processor (SP) 210, a page table device 220, a processing engine (PE) unit 230, a direct memory access (DMA) controller 240, an L1 cache (cache) 260, and L2 cache 250.
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator subsystem 200 is controlled by a host device such as the CPU 20, and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20, and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage the on-chip virtual storage of the accelerator subsystem 200 .
  • L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system.
  • the page table device 220 is jointly maintained by the SP 210, the PE unit 230 and the DMA controller 240.
  • the PE unit 230 includes a plurality of processing engines (processing engine, PE) PE_1, PE_2 . . . PE_N, where N represents an integer greater than or equal to 1.
  • Each PE in PE unit 230 may be a single instruction multiple thread (SIMT) device.
  • each thread can have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file).
  • Multiple PEs can perform the same or different processing tasks in parallel, and can perform address conversion described below and access to target data in memory in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segment, page, cache line, and attribute, size, and dimension order of the target element may be different, as described in detail below.
  • Each thread can perform thread-level data exchange between its own register file and the memory subsystem.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which adopts a typical register access architecture (load-store architecture).
  • Each execution unit includes a floating-point/fixed-point unit supporting multiple data types and an arithmetic logic unit.
  • the accelerator subsystem 200 of FIG. 2 may, for example, perform the following operations: 1) construct page table entry content and initial state; Move to the on-chip memory, such as the L2 cache 250; 3) start and execute the program; 4) define each segment and describe the properties of the tensor and storage; 5) when the program execution is completed, write the data of the execution result to off-chip memory.
  • the data processed by the accelerator subsystem 200 is mainly for multi-dimensional tensors.
  • the tensor may be a four-dimensional tensor, which has four dimensions D1, D2, D3, and D4, and the tensor may be of different sizes in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited in the present disclosure.
  • tensors can support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double byte, and so on.
  • tensors may be divided into at least one segment. In the case where the tensor contains only one segment, the tensor is the segment. Whereas, in the case where the tensor contains multiple segments, the segment is part of the tensor.
  • the CPU 20 can specify which PE processes each part of the segment by an instruction.
  • FIG. 3 shows a schematic block diagram of a three-dimensional tensor 300 according to an embodiment of the present disclosure.
  • the three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.
  • CPU 20 may specify that the tensor elements of segment S1 be processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • CPU 20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4.
  • each segment may have a different size, so programmers can flexibly configure segments based on design needs.
  • page division can be implemented on any one or more dimensions, and the number of pages divided on each dimension is independent of each other.
  • tensor elements may be stored in on-chip high-speed memory, such as L2 cache 250 .
  • on-chip high-speed memory such as L2 cache 250 .
  • the kernel program (kernel) can be started multiple times, and each time the DMA controller 240 moves a segment of the tensor from the off-chip storage to the on-chip storage in advance for kernel operation. After starting the kernel multiple times, all the segments contained in the tensor are processed, and the entire running process ends.
  • the on-chip high-speed memory is sufficient to accommodate all tensors to be accessed by the kernel, a tensor only needs one segment description, and the kernel only needs to be started once.
  • At least one page may also be set to further subdivide the tensor.
  • the first segment S1 there are 4 pages P[1], P[2], P[3] and P[4].
  • the second segment S2 has only one page.
  • the number of pages in each segment can be different, so programmers can flexibly configure the size of pages in a segment based on design needs. For example, pages are configured to fit into L2 cache 250 in their entirety.
  • a page can usually contain multiple elements.
  • the page where the target element is located is referred to as a "target element page" herein.
  • a page may include multiple cache lines.
  • L1 cache 260 It only takes a few clock cycles for the PE to read data from the L1 cache 260 , but it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250 . Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250 .
  • a "cache line" is used here to describe the minimum transfer data unit from L2 cache 250 to L1 cache 260, in this disclosure, this part of data may not necessarily be arranged in rows or columns, a "cache line”
  • the data in it is distributed in multiple dimensions, and the size of the data distributed in each dimension is not limited to 1.
  • PE performs parallel processing on the data in a segment, and the allocation of PE is carried out in the logical address space of the data, which is independent of the physical storage structure of the segment, as described below.
  • the first group of cache lines in the first page P[1] is designated to be processed by PE_1, and the second group of cache lines is designated to be processed by PE_2.
  • PE_1 the first group of cache lines in the first page P[1]
  • PE_2 the second group of cache lines
  • PE_M the processing of tensor elements is independent of the order of the PEs, as this disclosure is not limiting.
  • PE_2 in Figure 3 indicates that some tensor elements can be processed by PE_M, where M represents any integer not greater than N.
  • FIG. 4 shows a schematic diagram of page allocation of image data 400 according to an embodiment of the present disclosure.
  • Image data is typically a two-dimensional tensor.
  • the image data 400 is, for example, 8*8 pixels.
  • the image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Accordingly, the image data 400 has pixels P00, P01...P07, P10...P77.
  • the image data 400 has only one segment, but is divided into four pages P[1], P[2], P[3] and P[4] in two dimensions.
  • the four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it is also possible to divide by diagonal. This disclosure is not limited in this regard.
  • FIG. 5 shows a flowchart of a method 500 for accessing a memory according to some embodiments of the present disclosure.
  • the method 500 may be implemented by an accelerator subsystem such as a GPU, so various aspects described above with respect to FIGS. 1-3 may be selectively applied to the method 500 .
  • method 500 includes determining a first set of offsets of a first tensor element in a plurality of dimensions of a tensor segment based on an out-of-bounds query instruction for the first tensor element in the tensor segment.
  • an out-of-bounds query instruction may be, for example,
  • query indicates an out-of-bounds query
  • p0 indicates the dimension where the immediate value is located
  • Dp indicates the out-of-bounds register representation
  • segment_id indicates the segment base address register representation
  • RF_offset represents the offset register representation
  • immediate represents the immediate value representation.
  • a tensor can contain multiple tensor sections. Each tensor segment can be accessed by one or more PEs according to the partition.
  • the starting logical address of PE access is defined by the reference point, which is a logical address in the tensor space, that is, the addressing base address of each PE in the tensor space.
  • the addressing of PEs within a segment is based on reference points, and each PE can have its own reference point. In one embodiment, when the reference point of all PEs is the starting point of the tensor segment, there is no need to specifically define the reference point. The only time you need to explicitly define a reference point for each PE is when a different reference point is required.
  • the accelerator subsystem uses a tensor-oriented storage architecture, and each storage segment specifies tensor properties with the SP command. Tensor attributes are stored in an on-chip private cache for use during program execution. In addition, SP 210 also directly or indirectly defines the storage properties of tensors. Tensor-related attributes defined by SP 210 include but are not limited to the following information: tensor segment identification (ID), tensor segment dimension information, tensor segment page information, tensor segment size information, and tensor segment reference point Information, interleaving information of tensor elements, dimension information supporting long patterns during addressing, address compliance attributes of segments, etc.
  • ID tensor segment identification
  • tensor segment dimension information tensor segment page information
  • tensor segment size information tensor segment size information
  • tensor segment reference point Information interleaving information of tensor elements, dimension information supporting long patterns during addressing, address compliance attributes of segments, etc.
  • the tensor segment ID is unique during a running process and is an important part of the memory access or query address.
  • Segment dimension information is used to indicate the number of dimensions a tensor has, i.e. one-dimensional, two-dimensional, three-dimensional, or four-dimensional.
  • the page information of a tensor segment includes the starting page number and the number of pages in each dimension.
  • the size information of the tensor segment is used to indicate the size of the pages in the tensor segment in each dimension.
  • the reference point information of the tensor segment is used to indicate the logical address of the starting point of each PE in the tensor segment, that is, the logical offset in each dimension.
  • the interleaving information of tensor elements is used to indicate whether tensor elements are interleaved in two dimensions.
  • the dimensional information that supports long patterns during addressing is used to adapt to the larger addressing range requirements of tensors in certain dimensions.
  • the sign attribute of the segment address is used to indicate whether the tensor segment address supports negative numbers.
  • the access to a certain tensor element in the segment can be completed according to the representation of the logical address of the program address segment_id:RF_offset:immediate.
  • an out-of-bounds query can be generated with access to a logical address. Alternatively, it can also be performed independently.
  • the out-of-bounds query instruction is the same as the memory access instruction, requiring the precise logical address representation of the tensor element segment_id:RF_offset:immediate as the source operand to execute the out-of-bounds query.
  • each PE is assigned a different logical offset within the segment, threads inside each PE can still access any element in other PEs or even the entire tensor segment.
  • the base point offset of the PE can be reflected in the base address information of the tensor segment, and the base address information of the tensor segment can be indexed by the tensor segment ID.
  • negative address support can be introduced, that is, to access storage units at negative offset positions relative to the reference point.
  • segment_id and immediate may be negative numbers, for example. Since tensor addresses can have multiple dimensions, such as 4D, negative offsets also support multiple dimensions, such as 4D, and the judgment of storage out-of-bounds also supports multiple dimensions such as 4D.
  • the source operand segment_id:RF_offset:immediate needs to include multiple-dimensional address offset information (such as 3-dimensional or 4-dimensional address offset information), to fit tensors of different dimensions.
  • the registers in the source operand segment_id:RF_offset:immediate and the number of bits of the immediate value are fixed, so in the tensor addressing process of different dimensions, the width of the bit field allocated by each dimension is also not exactly. The larger the dimension of the tensor is, the fewer bits are allocated to each dimension.
  • the segment base address register and the bit fields represented by the RF register may allocate bit fields of a certain length to each dimension according to the principle of approximately equal division. In some cases, some dimensions may be assigned relatively wide bit fields because the bit width represented by the register cannot be divisible by the number of dimensions. In addition, in some cases, tensors have different sizes in each dimension, and it is not an efficient address encoding scheme to allocate approximately equal bit widths for each dimension. In some embodiments, based on the balance between the complexity of the system design and the high efficiency of the address encoding, it is proposed to allow appropriate provision of additional bit widths in some dimensions to meet the requirements of a larger addressing range. The dimension that provides additional bit width is not static, and can be flexibly allocated in each dimension according to the shape of the tensor and the addressing requirements.
  • segment_id:RF_offset:immediate For the immediate part in segment_id:RF_offset:immediate, its bit width is relatively limited, and the offset of all dimensions cannot be effectively expressed at one time.
  • sequential subset encoding of a four-dimensional (4D) model may be used. That is, the offset within the immediate value describes one or two consecutive dimensions of the tensor, and its high-dimensional ordinal must not exceed the highest-dimensional ordinal of the tensor.
  • the immediate value is described once and can only describe the offset of the 0th dimension (starting with 0, the same below); for a two-dimensional (2D) tensor, the immediate value It can describe the offset of the 0th dimension and the 1st dimension at a time; for a three-dimensional (3D) tensor, the immediate value can describe the offset of the 0th dimension and the 1st dimension or the offset of the 1st dimension and the 2nd dimension at a time ; For 4D tensors, the immediate value can describe the offset of the 0th dimension and the 1st dimension, the offset of the 1st dimension and the 2nd dimension, or the offset of the 2nd dimension and the 3rd dimension.
  • ur0 is a specific example of segment_id
  • r0 is a specific example of RF_offset
  • 0x55aa is a specific example of immediate.
  • the base address inside the ur0 register includes ur_addr3[7:0], ur_addr2[7:0], ur_addr1[7:0], ur_addr0[7:0], where ur_addr3[7:0] indicates that the tensor element A is in the D4 dimension ur_addr2[7:0] indicates the base address of tensor element A on D3 dimension, ur_addr1[7:0] indicates the base address of tensor element A on D2 dimension, ur_addr0[7:0] indicates Base address of tensor element A in dimension D1.
  • r0 may include internal offset addresses, such as r_off3[7:0], r_off2[7:0], r_off1[7:0], and r_off0[7:0].
  • r_off3[7:0] r_off2[7:0]
  • r_off1[7:0] r_off0[7:0]
  • r_off0[7:0] r_off3[7:0]
  • the method 500 further includes: determining the segment base address register representation based on the mode representation. If the long mode is enabled for the address attribute of a certain dimension, take the base address of tensor element A on the D0 dimension as an example, it should take 12 bits, then the effective base address on the D0 dimension is ur_addr0[11:0], other The base addresses of the dimensions are still ur_addr1[7:0], ur_addr2[7:0] and ur_addr3[7:0].
  • the effective base address extracted is ur_addr1[11:0]
  • the base addresses of other dimensions are ur_addr0[7:0], ur_addr2[7:0] and ur_addr3[7] :0], and so on.
  • the bit width of the address field allocated on each dimension is 12 bits, but in long mode, all 12-bit contents are valid, while in non- In long mode, the upper 4 bits of 12 bits are 0, and only the lower 8 bits are taken.
  • 4D is used here as an example to illustrate the difference between long mode and non-long mode, this is only an example, and the bit width on each dimension and long mode and non-long mode in the case of tensors of other dimensions can be set as needed The bit width selected by the mode.
  • the address fields allocated in each dimension are 16 bits wide, but in long mode all 16 bits are valid, while in non-long mode the 12-bit high 8 bits are 0, and only the lower 8 bits are taken.
  • the mode representation the offset of the logical address within the tensor segment can be determined for different modes, such as long mode or default mode (non-long mode). In this way, the tensor out-of-bounds determination schemes in different modes can be compatible to adapt to more tensor situations.
  • the method further includes determining a dimension corresponding to the immediate value based on the out-of-bounds query instruction.
  • p0 represents the dimension corresponding to the specified immediate value.
  • immediate can be a 16-bit data. According to the suffix field "p0" in the instruction, it can be determined that immediate[7:0] indicates the offset of tensor element A in D1 dimension, and immediate[15:8] indicates the offset of tensor element A in D2 dimension.
  • immediate[7:0] is the offset of tensor element A in D2 dimension
  • immediate[15:8] is the tensor The offset of element A in D3 dimension, and so on.
  • the method 500 further includes: according to the dimensions of the tensor, adding the segment base register representation and the offset register representation within the range of the segment base address on the corresponding dimension to obtain the first offset quantity. In an embodiment of the present disclosure, the method further includes adding the segment base address register representation and the offset register representation within the segment base address range of the corresponding dimension on the dimension not corresponding to the immediate value, to obtain first offset.
  • the immediate value immediate[7:0] represents the offset of the tensor element A in the D1 dimension
  • immediate[15:8] Indicates the offset of tensor element A in D2 dimension.
  • the logical address of tensor element A can be expressed as:
  • addr_A (ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[7:0]+r_off1[7:0]+immediate[15:8 ],ur_addr0[7:0]+r_off0[7:0]+immediate[7:0])
  • the logical address of tensor element A can be expressed as:
  • addr_A (ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[7:0]+r_off1[7:0]+immediate[15:8 ],ur_addr0[11:0]+r_off0[7:0]+immediate[7:0])
  • the logical address of tensor element A can be expressed as:
  • addr_A (ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[11:0]+r_off1[7:0]+immediate[15:8 ],ur_addr0[7:0]+r_off0[7:0]+immediate[7:0])
  • the first offset can be obtained in a simple manner by adding the representation of the segment base address register and the representation of the offset register of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • method 500 also includes determining a set of number of elements of the tensor segment in each of the plurality of dimensions based on the segment attribute of the tensor segment.
  • the method 500 may include: determining the number of pages of tensor segments in each dimension and the number of elements of each page in each dimension based on the segment attributes; Number of elements, to determine the number of elements set.
  • the set of element numbers may include at least one element number in at least one dimension. In other words, the set can include the number of elements in each dimension.
  • tensor segments include pages, and pages include elements. Tensor segments may have a first plurality of dimensions, pages have a second plurality of dimensions, and the number of dimensions of the second plurality of dimensions is not greater than the number of dimensions of the first plurality of dimensions.
  • the number of elements in each dimension in the tensor segment can be calculated according to the definition information of the tensor segment, the dimension of the tensor segment, the number of pages in each dimension, and the number of elements in each dimension in the page .
  • the number of tensor elements in each dimension elem_per_dim (N[1]*P[1],N[2]*P[2],N[3]*P[3],N[ 4]*P[4]).
  • the calculation method of the number of elements is shown in 4D here, this is only for illustration and not to limit the scope of the present disclosure. Calculation methods for other dimensions will not be repeated here.
  • the number of elements in that dimension can be determined, for example, by multiplication, that is, the maximum number of tensor elements in that dimension Possible offset range. In this way, the set of element numbers can be determined in a simple manner.
  • method 500 also includes determining a first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element numbers.
  • determining the first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element ranges includes: determining whether each offset in the first set of offsets is is outside the range of elements in the corresponding dimension in the set of element ranges; and if at least one offset in the set of first offsets is outside the range of elements in the corresponding dimension in the set of element ranges, then for the first tensor
  • the element's first out-of-bounds state indication is set to a first value. If each offset in the first set of offsets does not exceed the range of elements in the corresponding dimension in the set of element ranges, then set the first out-of-bounds status indication for the first tensor element to the same as the first value Different second value.
  • any one of the above-mentioned conditions a)-d) is true, it may be determined that the tensor element is out of bounds, and the out-of-bounds status indication is set to a first logical value, such as "0". If none of the above-mentioned conditions a)-d) is satisfied, it may be determined that the tensor element does not cross the bounds, and the cross-bounds status indication is set to a first logical value, such as "1".
  • boundary-crossing determination is shown in a 4D situation, this is only for illustration and not to limit the scope of the present disclosure.
  • the above-mentioned method is applicable to other dimensions, long patterns in other dimensions, or immediate data corresponding to other dimensions, and the present disclosure does not limit this.
  • a target predicate register may be provided to store a flag indicating whether the access address is out of bounds.
  • the designated register Dp is used to store a flag indicating whether the access address is out of bounds.
  • each thread has a dedicated predicate register to indicate whether to perform special processing in the next step, such as program debugging or dynamic detection. Since the predicate register is located inside the thread, when subsequent instructions are executed, the predicate register can provide faster and more convenient out-of-bounds lookup results than when stored in a memory outside the PE. In addition, since the out-of-bounds indication is not multi-bit data but only a single-bit logic value, and there is no need to transmit between the PE and a memory outside the PE, the resulting power consumption can also be significantly reduced.
  • FIG. 6 shows a schematic block diagram of an electronic device 600 according to some embodiments of the present disclosure.
  • Electronics 600 may be implemented as or included in accelerator subsystem 200 of FIG. 2 .
  • the electronic device 600 may include a plurality of modules for performing corresponding steps in the method 500 as discussed in FIG. 5 .
  • the memory of the electronic device 600 includes an offset set determination unit 602 , an element number set determination unit 604 and an out-of-bounds status indication determination unit 606 .
  • the offset set determining unit 602 is configured to determine a first set of offsets of the first tensor element in multiple dimensions of the tensor segment based on an out-of-bounds query instruction for the first tensor element in the tensor segment.
  • the element number set determining unit 604 is configured to determine a set of element numbers of the tensor section in each of the plurality of dimensions based on the section attribute of the tensor section.
  • the out-of-bounds status indication determining unit 606 is configured to determine a first out-of-bounds status indication for the first tensor element based on the first set of offsets and the set of element numbers.
  • the offset corresponding to the logical address in the tensor segment can be compared with the range of the tensor segment. Once the offset on any dimension exceeds the range of the tensor segment on this dimension, it can be determined that the logical address is out of bounds, that is, beyond the boundary of the tensor segment.
  • an out-of-bounds status indication such as a specific value, can be obtained from a circuit unit such as a predicate register. In this way, subsequent program operations such as program debugging or dynamic detection can be brought convenience.
  • out-of-bounds query instruction does not involve returning or writing specific stored data from the memory, but only gives an indication of an out-of-bounds state (such as a logic 1 or 0), this avoids reading from the memory.
  • the out-of-bounds query instruction includes a logical address representation of the first tensor element, and the logical address representation includes a segment base address register representation, an offset register representation, and an immediate value
  • the offset set determination unit 602 is further configured to: Based on the segment base register representation, the offset register representation, and the immediate value, a first set of offsets is computed.
  • the electronic device 600 further includes: a segment base address register representation determination unit configured to determine the segment base address register representation based on the mode representation.
  • a segment base address register representation determination unit configured to determine the segment base address register representation based on the mode representation.
  • the method further includes determining a dimension corresponding to the immediate value based on the out-of-bounds query instruction.
  • the offset determination unit is further configured to: according to the dimensions of the tensor, add the segment base register representation, the offset register representation, and the immediate value within the range of the segment base address on the corresponding dimension to Get the first offset.
  • the offset determining unit is further configured to: on a dimension not corresponding to the immediate value, add the segment base address register representation and the offset register representation within the range of the segment base address on the corresponding dimension , to get the first offset.
  • the first offset can be obtained in a simple manner by adding the representation of the segment base address register and the representation of the offset register of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • the element number set determination unit 604 is further configured to: determine the number of pages of tensor segments in each dimension and the number of elements in each dimension of each page based on the segment attributes; Number and number of elements determine the number of elements in the collection.
  • the offset determination unit is further configured to: on the dimension corresponding to the immediate value, compare the segment base address register representation, the offset register representation and the immediate value within the range of the segment base address on the corresponding dimension Add to get the first offset.
  • the first offset can be obtained in a simple way by adding the segment base address register representation, offset register representation and immediate value of the corresponding dimension, thereby reducing calculation cost and power consumption.
  • determining the set of the number of elements in each dimension of the tensor segment based on the segment attribute of the tensor segment includes: determining the number of pages of the tensor segment in each dimension based on the segment attribute and the number of elements for each page in each dimension; and based on the number of pages and the number of elements, a set of number of elements is determined.
  • the set of element numbers may include at least one element number in at least one dimension.
  • tensor segments include pages, and pages include elements.
  • Tensor segments may have a first plurality of dimensions, pages have a second plurality of dimensions, and the number of dimensions of the second plurality of dimensions is not greater than the number of dimensions of the first plurality of dimensions.
  • the number of elements in that dimension can be determined, for example, by multiplication, that is, the maximum number of tensor elements in that dimension Possible offset range. In this way, the set of element numbers can be determined in a simple manner.
  • the out-of-bounds state indication determining unit 606 is further configured to: determine whether each offset in the first set of offsets exceeds the number of elements in the corresponding dimension in the set of number of elements; and if the first At least one of the offsets in the set of offsets exceeds the number of elements in the corresponding dimension in the set of numbers of elements, setting the first out-of-bounds status indication for the first tensor element to a first value.
  • the out-of-bounds state indication determining unit 606 is further configured to: determine whether each offset in the first set of offsets exceeds the number of elements in the corresponding dimension in the set of number of elements; and if the first Each offset in the set of offsets does not exceed the number of elements in the corresponding dimension in the set of number of elements, then the first out-of-bounds status indication for the first tensor element is set to the second value.
  • the first out-of-bounds status indication By setting the first out-of-bounds status indication to a different value in a device, such as a predicate register, it can be determined whether out-of-bounds can be determined by reading the stored value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

本文描述了一种用于确定张量元素的越界状态的方法和电子装置。该方法包括基于针对张量段中的第一张量元素的越界查询指令,确定第一张量元素在张量段的多个维度的第一偏移量集合。该方法还包括基于张量段的段属性,确定张量段在多个维度中的每个维度上的元素数目集合。该方法还包括基于第一偏移量集合和元素数目集合,确定针对第一张量元素的第一越界状态指示。通过使用越界查询命令,可以将偏移量与张量段范围进行比较。一旦超出,则可以确定该逻辑地址越界,即超出了张量段的边界。继而,可以从诸如谓词寄存器之类的电路单元获得越界状态指示,例如特定值。这样,可以对诸如程序调试或动态检测之类的后续程序操作带来便利。

Description

用于确定张量元素的越界状态的方法和电子装置
本申请要求于2022年01月25日提交中国专利局、申请号为202210088659.1、发明名称为“用于确定张量元素的越界状态的方法和电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开的实施例一般地涉及电子领域,更具体而言涉及一种用于确定张量元素的越界状态的方法和电子装置。
背景技术
诸如图形处理器(GPU)之类的并行高性能多线程多核处理系统处理数据的速度比过去快得多。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核并行处理以增加处理效率并且减少处理时间。
在一些情形下,诸如GPU之类的多核处理器对具有大量相同或相似形式的数据的张量的处理尤为有利。张量元素在计算机领域通常表示一维或多维数组的数据,例如灰度图像数据就是一种常规的二维张量元素,其可以由二维数组表示。对图像数据进行处理时,可以由多核处理器对图像数据中的不同部分并行处理以减少处理时间。
常规的张量元素在存储器中通常以一维数组形式进行存储,因此程序的编程人员在设计程序时,需要考虑程序加载张量元素的情形下张量元素在一维数组中的正确地址。通常,常规的寻址方法无法提供对于张量元素的访问是否超出张量边界的信息。例如,在一些常规方案中,对于读操作,如果检测出地址越界,则直接返回0值或者其他预设的默认值,而对于写操作,如果检测出地址越界,则丢弃此次写操作。但是对于用户或设计人员而言,其无法通过读 操作的返回值0来明确判断存储是否越界,因为张量元素的值有可能为0。同时,用户也无法得知写操作是否越界。这给设计人员在编程过程中的后续一些操作带来不便利。
发明内容
本公开的实施例提供了一种用于确定张量元素的越界状态的方法和电子装置。
在第一方面,提供了一种用于确定张量元素的越界状态的方法。该方法包括基于针对张量段中的第一张量元素的越界查询指令,确定第一张量元素在张量段的多个维度的第一偏移量集合。该方法还包括基于张量段的段属性,确定张量段在多个维度中的每个维度上的元素数目集合。该方法还包括基于第一偏移量集合和元素数目集合,确定针对第一张量元素的第一越界状态指示。通过使用越界查询命令,可以将逻辑地址在张量段内对应的偏移量与张量段范围进行比较。一旦任一维度上的偏移量超出了张量段在该维度上的范围,则可以确定该逻辑地址越界,即超出了张量段的边界。继而,可以从诸如谓词寄存器之类的电路单元获得越界状态指示,例如特定值。这样,可以对诸如程序调试或动态检测之类的后续程序操作带来便利。此外,由于该越界查询指令并不涉及从存储器中返回所存储的具体数据或向其写入数据,而仅是给出越界状态的指示(例如逻辑1或0),这可以避免从存储器中读取数据或写入数据所带来的时间延迟和大的操作功耗。
在本公开的第一方面的一种实现方式中,越界查询指令包括第一张量元素的逻辑地址表示,逻辑地址表示包括段基址寄存器表示、偏移寄存器表示和立即数。确定第一张量元素在张量段中的第一偏移量集合包括:基于段基址寄存器表示、偏移寄存器表示和立即数,计算第一偏移量集合。
在本公开的第一方面的一种实现方式中,该方法还包括:基于模式表示,确定段基址寄存器表示。通过使用模式表示,可以针对 不同的模式,例如长模式或默认模式(非长模式),来确定逻辑地址在张量段内的偏移量。这样,可以兼容不同模式下的张量越界确定方案,以适配更多的张量情形。
在本公开的第一方面的一种实现方式中,该方法还包括基于越界查询指令确定与立即数对应的维度。
在本公开的第一方面的一种实现方式中,基于段基址寄存器表示、偏移寄存器表示和立即数,计算第一偏移量包括:根据张量的维度,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。在本公开的第一方面的一种实现方式中,该方法还包括在与立即数并不对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示和偏移寄存器表示相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在本公开的第一方面的一种实现方式中,基于段基址寄存器表示、偏移寄存器表示和立即数,计算第一偏移量包括:根据张量的维度,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在本公开的第一方面的一种实现方式中,该方法还包括在与立即数对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示、偏移寄存器表示和立即数相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在本公开的第一方面的一种实现方式中,基于张量段的段属性确定张量段在多个维度中的每个维度上的元素数目集合包括:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。元素数目集合可以包括在至少一个维度上的至少一个元素数目。在本公开的实现方式中,张量段包括页,页包括元素。张量段可以具 有第一多个维度,页具有第二多个维度,并且第二多个维度的维度数目不大于第一多个维度的数目。通过计算张量段在每个维度上的页数目和页在每个维度上的元素数目,可以例如通过乘法来确定在该维度上的元素数目,也即张量元素在该维度上的最大可能偏移范围。这样,可以通过简易的方式,确定元素数目集合。
在本公开的第一方面的一种实现方式中,基于第一偏移量集合和元素数目集合确定针对第一张量元素的第一越界状态指示包括:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的至少一个偏移量超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第一值。
在本公开的第一方面的一种实现方式中,基于第一偏移量集合和元素数目集合确定针对第一张量元素的第一越界状态指示包括:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的每个偏移量都不超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第二值。通过将第一越界状态指示在诸如谓词寄存器之类的设备中设置为不同值,可以通过读取所存储的值来确定是否越界。这给诸如程序调试、动态检测之类的后续操作带来便利,例如可以节省时间和功耗,因为这种越界检测无需从存储器返回具体数据,而仅是从处理引擎内部的寄存器返回状态值即可。因此,相比于从处理引擎外的存储器向处理引擎读取多字节的数据,这种从处理引擎内部的谓词寄存器读取单个或少量字节的状态值,可以显著减少传输和处理时间,并且相应地降低了处理功耗。
在本公开的第二方面,提供一种计算机可读存储介质。该计算机可读存储介质存储多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行第一方面的方法的指令。
在本公开的第三方面,提供一种计算机程序产品。该计算机程 序产品包括多个程序,多个程序被配置为一个或多个处理引擎执行,多个程序包括用于执行第一方面的方法的指令。
在本公开的第四方面,提供一种电子装置。该电子装置包括:流处理器;页表装置,耦合至流处理器;存储器;处理引擎单元,耦合至流处理器、存储器和页表装置,被配置为执行第一方面的方法。
在本公开的第五方面,提供一种电子装置。该电子装置包括:偏移量集合确定单元,被配置为基于针对张量段中的第一张量元素的越界查询指令,确定第一张量元素在张量段的多个维度的第一偏移量集合。该电子装置还包括:元素数目集合确定单元,被配置为基于张量段的段属性,确定张量段在多个维度中的每个维度上的元素数目集合。该电子装置包括:越界状态指示确定单元,被配置为基于第一偏移量集合和元素数目集合,确定针对第一张量元素的第一越界状态指示。通过使用越界查询命令,可以将逻辑地址在张量段内对应的偏移量与张量段范围进行比较。一旦任一维度上的偏移量超出了张量段在该维度上的范围,则可以确定该逻辑地址越界,即超出了张量段的边界。继而,可以从诸如谓词寄存器之类的电路单元获得越界状态指示,例如特定值。这样,可以对诸如程序调试或动态检测之类的后续程序操作带来便利。此外,由于该越界查询指令并不涉及从存储器中返回所存储的具体数据或向其写入数据,而仅是给出越界状态的指示(例如逻辑1或0),这可以避免从存储器中读取数据或写入数据所带来的时间延迟和大的操作功耗。
在本公开的第五方面的一种实现方式中,越界查询指令包括第一张量元素的逻辑地址表示,逻辑地址表示包括段基址寄存器表示、偏移寄存器表示和立即数,偏移量集合确定单元被进一步配置为:基于段基址寄存器表示、偏移寄存器表示和立即数,计算第一偏移量集合。
在本公开的第五方面的一种实现方式中,电子装置还包括:段基址寄存器表示确定单元,被配置为基于模式表示,确定段基址寄 存器表示。通过使用模式表示,可以针对不同的模式,例如长模式或默认模式(非长模式),来确定逻辑地址在张量段内的偏移量。这样,可以兼容不同模式下的张量越界确定方案,以适配更多的张量情形。
在本公开的第五方面的一种实现方式中,该方法还包括基于越界查询指令确定与立即数对应的维度。
在本公开的第五方面的一种实现方式中,偏移量确定单元被进一步配置为:根据张量的维度,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在本公开的第五方面的一种实现方式中,偏移量确定单元被进一步配置为:在与立即数并不对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示和偏移寄存器表示相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在本公开的第五方面的一种实现方式中,元素数目集合确定单元被进一步配置为:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。在本公开的第五方面的一种实现方式中,偏移量确定单元被进一步配置为:在与立即数对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示、偏移寄存器表示和立即数相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在本公开的第五方面的一种实现方式中,基于张量段的段属性确定张量段在多个维度中的每个维度上的元素数目集合包括:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。元素数目集合可以包括在至少一个维度上的至少一个元素数目。在 本公开的实现方式中,张量段包括页,页包括元素。张量段可以具有第一多个维度,页具有第二多个维度,并且第二多个维度的维度数目不大于第一多个维度的数目。通过计算张量段在每个维度上的页数目和页在每个维度上的元素数目,可以例如通过乘法来确定在该维度上的元素数目,也即张量元素在该维度上的最大可能偏移范围。这样,可以通过简易的方式,确定元素数目集合。
在本公开的第五方面的一种实现方式中,越界状态指示确定单元被进一步配置为:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的至少一个偏移量超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第一值。
在本公开的第五方面的一种实现方式中,越界状态指示确定单元被进一步配置为:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的每个偏移量都不超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第二值。通过将第一越界状态指示在诸如谓词寄存器之类的设备中设置为不同值,可以通过读取所存储的值来确定是否越界。这给诸如程序调试、动态检测之类的后续操作带来便利,例如可以节省时间和功耗,因为这种越界检测无需从存储器返回具体数据,而仅是从处理引擎内部的寄存器返回状态值即可。因此,相比于从处理引擎外的存储器向处理引擎读取多字节的数据,这种从处理引擎内部的谓词寄存器读取单个或少量字节的状态值,可以显著减少传输和处理时间,并且相应地降低了处理功耗。
根据本公开的实施例的方法和电子设备,编程人员可以通过越界查询命令从诸如谓词寄存器之类的电路单元获得越界状态指示,例如特定值。对于读操作,还可以返回预先设置的默认值。这样,可以对诸如程序调试或动态检测之类的后续程序操作带来便利。此外,由于该越界查询指令并不涉及从存储器中返回所存储的具体数 据或向其写入数据,而仅是给出越界状态的指示(例如逻辑1或0),这可以避免从存储器中读取数据或写入数据所带来的时间延迟和大的操作功耗。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一个实施例的芯片示意框图;
图3示出了根据本公开的一个实施例的三维张量示意框图;
图4示出了根据本公开的一个实施例的图像数据的页分配示意图;
图5示出了根据本公开的一个实施例的用于确定张量元素的越界状态的方法的示意流程图;以及
图6示出了根据本公开的一些实施例的电子装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的 或相同的对象。下文还可能包括其他明确的和隐含的定义。
如前文所提及的,在一些常规方案中,对于读操作,如果检测出地址越界,则直接返回0或其他预设的默认值,而对于写操作,如果地址检测出越界,则丢弃此次写操作。但是对于用户或设计人员而言,其无法通过读操作的返回0或其他值来明确判断存储是否越界,因为张量元素的值有可能为0或其他值。同时,用户也无法得知写操作是否越界。这给设计人员在编程过程中给后续一些操作带来不便利。
在本公开的一些实施例中,通过计算逻辑地址在张量段中的偏移量,并且将该偏移量与张量段的偏移范围进行比较,可以确定该逻辑地址是否越界。此外,将表示是否越界的第一越界状态指示在诸如谓词寄存器之类的装置中设置为不同值,可以通过读取所存储的值来确定是否越界。这给诸如程序调试、动态检测之类的后续操作带来便利,例如可以节省时间和功耗,因为这种越界检测无需从存储器返回具体数据,而仅是从处理引擎内部的寄存器返回状态值即可。因此,相比于从处理引擎外的存储器向处理引擎读取多字节的数据,这种从处理引擎内部的谓词寄存器读取单个或少量字节的状态值,可以显著减少传输和处理时间,并且相应地降低了处理功耗。
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器子系统40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器子系统40例如可以包括诸如图形处理 器(GPU)和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。设备存储器50例如可以是诸如DRAM之类的位于加速器子系统40外部的易失性存储器。在本公开中,设备存储器50也被称为片外存储器,即,位于加速器子系统40的芯片外部的存储器。相对而言,加速器子系统40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。这将在下文结合本公开的一些实施例具体描述。虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器子系统的一些应用环境中使用。
图2示出了根据本公开的一个实施例的加速器子系统200的示意框图。加速器子系统200例如可以是图1中加速器子系统40的芯片的一种具体实现方式。加速器子系统200例如是诸如GPU之类的加速器子系统芯片。在一个实施例中,加速器子系统200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。
加速器子系统200由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器子系统200的片上虚拟存储。在本公开中,L2高速缓存250和诸如图1中的设备存储器50之类的片外存储器构成虚拟存储系统。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。
PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于或等于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还 共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作,可以并行地进行下文所述的地址转换和存储器中目标数据的访问,从而减少处理时间。可以理解,多个PE处理的目标元素并不相同,并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同,如下文具体描述。
每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。
大多数的指令执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。
在一个实施例中,图2的加速器子系统200可以例如执行如下操作:1)组建页表项内容和初始状态;2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器,例如L2高速缓存250;3)启动和执行程序;4)定义各个段并对张量以及存储的属性进行描述;5)在程序执行完成时,将执行结果的数据写入至片外存储器。
可以理解,在公开的实施例中,加速器子系统200所处理的数据主要针对多维张量。例如,在一个实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各个维度上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。
此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自 定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。
在一些情形中,张量所包含的数据量可能较大,而L2高速缓存250的容量有限,因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中,为了便于张量的并行处理,可以将张量划分为至少一个段。在张量仅包括一个段的情形下,张量即为段。而在张量包括多个段的情形下,段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。
图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3,并且包括第一段S1、第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外,CPU 20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中,每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。实际上,页的划分可以在任意一个或多个维上实施,并且各个维度上划分的页数是相互独立的。
在一个实施例中,可以将张量元素存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。核心程序(kernel)可以分多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供kernel操作使用。在多次启动kernel后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时,一个张量仅需要一个段描述即可,kernel也只需要启动一次。
进一步地,在本公开的一些实施例中,在一个段内,还可以设置至少一个页以进一步细分张量。例如,在第一段S1中,具有4个 页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,将页配置为适于整体存入L2高速缓存250。
如上所述,当对张量寻址时,最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中,页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时,如果PE经由L1高速缓存260读取目标元素,则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据,而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期,而L1高速缓存260从L2高速缓存250读取数据可能需要几十个甚至上百个时钟周期。因此,期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位,但在本公开中,这部分数据可以并不必然按行或列排列,一个“缓存行”里面的数据分布在多个维上,且各个维度上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理,PE的分配在数据的逻辑地址空间展开,独立于段的物理存储结构,具体如下文描述。
在图3中,第一页P[1]中的第一组缓存行被指定由PE_1处理,第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理,但是可以理解张量元素的处理独立于PE的顺序,本公开对此不进行限制。例如图3中的PE_2表示部分的张量元素可以由PE_M处理,其中M表示不大于N的任意整数。
图4示出了根据本公开的一个实施例的图像数据400的页分配示意图。图像数据是典型的二维张量。在一个实施例中,图像数据400例如为8*8像素。换言之,图像数据400在第一维D1具有8个 像素,并且在第二维D2也具有8个像素。因此,图像数据400具有像素P00、P01…P07、P10…P77。在图4的实施例中,图像数据400仅具有一个段,但是按两个维度分为4个页P[1]、P[2]、P[3]和P[4]。4个页可以按第二维D2划分以分配给PE_1和PE_2处理,也可以按第一维D1划分以分配给PE_1和PE_2处理。此外,还可以按对角线划分。本公开对此不进行限制。
图5示出了根据本公开的一些实施例的对存储器进行访问的方法500的流程图。在一个实施例中,方法500例如可以由诸如GPU之类的加速器子系统实施,因此上面针对图1-图3描述的各个方面可以选择性地适用于方法500。
在502,方法500包括基于针对张量段中的第一张量元素的越界查询指令,确定第一张量元素在张量段的多个维度的第一偏移量集合。在一个实施例中,越界查询指令可以例如是
query.p0 Dp,segment_id:RF_offset:immediate
其中query表示越界查询,p0表示立即数所在的维度,Dp表示越界寄存器表示,segment_id表示段基址寄存器表示,RF_offset表示偏移寄存器表示,immediate表示立即数表示。虽然在此示出了一种具体的越界查询指令,但是本公开不限于此。可以理解,可以使用其它形式的越界查询指令,本公开对此不进行限制。
如上所述,一个张量可以包括多个张量段。每个张量段可以可以由一个或多个PE根据划分进行访问。PE访问的起始逻辑地址由基准点定义,基准点为张量空间内的一个逻辑地址,也即各PE在张量空间内的寻址基址。PE在段内寻址基于基准点展开,每个PE可以有自己的基准点。在一个实施例中,当所有PE的基准点均为张量段的起始点时,无需专门定义基准点。只有当需要不同基准点的时候,才需要显式为每个PE定义基准点。
加速器子系统使用面向张量结构的存储体系结构,每个存储段用SP命令规定张量属性。张量属性被存储在片内专用高速缓存中以供程序执行时使用。除此之外,SP 210还直接或间接定义了张量的 存储属性。SP 210定义的有关张量的属性包括但不限于如下信息:张量段标识(ID)、张量段维度信息、张量段的页信息、张量段的尺寸信息、张量段的基准点信息、张量元素的交织信息、寻址过程中支持长模式的维度信息、段的地址符合属性等。
张量段ID在一次运行过程中具有唯一性,是访存或查询地址的重要组成部分。段维度信息用于表示张量具有的维度数量,即一维、二维、三维或四维。张量段的页信息包括起始页编号、以及各个维度上页的数目。张量段的尺寸信息用于表示张量段内的页在各个维度上的尺寸。张量段的基准点信息用于表示各个PE在张量段内的起始点的逻辑地址,即各个维度上的逻辑偏移。张量元素的交织信息用于表示张量元素在两个维度上是否交织。寻址过程中支持长模式的维度信息用于适配张量在某些维度上的更大寻址范围需求。段地址的符号属性用于表示张量段地址是否支持负数。
基于上述信息,可以根据程序地址的逻辑地址的表示segment_id:RF_offset:immediate来完成对段内某个张量元素的访问。在一个实施例中,越界查询可以伴随逻辑地址的访问产生。备选地,其也可以独立执行。越界查询指令与访存指令一样,需要张量元素的精确逻辑地址表示segment_id:RF_offset:immediate作为源操作数来执行越界查询。
在一个实施例中,虽然每个PE被分配了不同的段内逻辑偏移量,但是每个PE内部的线程依然可以访问其他PE甚至整个张量段内的任意元素。PE的基准点偏移可以在张量段的基地址信息中体现,而张量段的基地址信息可以由张量段ID索引。在基于基准点的寻址过程中,为了实现对PE外数据元素的访问,可以引入负地址支持,即访问相对于基准点的负偏移位置的存储单元。在一个实施例中,segment_id和immediate例如可以为负数。由于张量地址可以是多个维度,例如4D,因此负偏移也支持多个维度,例如4D,同时对于存储越界的判断也支持诸如4D之类的多个维度。
由于需要表示并且确定张量段中的多个维度地址的越界情况, 因此源操作数segment_id:RF_offset:immediate中需要包括多个维度地址偏移信息(例如3维或4维地址偏移信息),以适配不同维度的张量。在一个实施例中,源操作数segment_id:RF_offset:immediate中的寄存器以及立即数的位数是固定的,因此在不同维度的张量寻址过程中,每个维度所分配的位域的宽度也不尽相同。张量的维度越大,各个维度所分配的位数相对越少。
在一个实施例中,段基址寄存器表示以及RF寄存器表示的位域可以按近似均分原则为各个维度分配一定长度的位域。在一些情形下,有可能会因为寄存器表示的位宽无法被维度数目整除,而出现某些维度被分配相对较宽的位域的情况。另外,在一些情形下,张量在各个维度上的尺寸各不相同,为各个维度分配近似相等的位宽不是一种高效的地址编码方案。在一些实施例中,可以基于系统设计的复杂度以及地址编码的高效性的平衡,提出在某些维度上允许适当提供额外的位宽,以满足更大的寻址范围的需求。提供额外位宽的维度并不是一成不变的,可以根据张量的形状以及寻址要求在各个维度上灵活分配。
对于segment_id:RF_offset:immediate中的立即数部分,其位宽较为有限,无法一次性有效表达所有维度的偏移情况。在一个实施例中,可以采用四维(4D)模型的连续子集编码方式。即,立即数内的偏移量描述的是张量的一个或两个连续维度,其高维序号不得超过张量的最高维序号。例如,对于一维(1D)张量,立即数一次描述且只能描述第0维(以0为起始基数,下同)的偏移量;对于二维(2D)张量,立即数一次可以描述第0维和第1维的偏移量;对于三维(3D)张量,立即数一次可以描述第0维和第1维的偏移量或第1维和第2维的偏移量;对于4D张量,立即数一次可以描述第0维和第1维的偏移量、第1维和第2维的偏移量、或第2维和第3维的偏移量。
在一个实施例中,以4D张量内部张量元素A的地址ur0:r0:0x55aa为例,ur0为segment_id的一个具体示例,r0为RF_offset 的一个具体示例,0x55aa为immediate的一个具体示例。ur0寄存器内部的基址包括ur_addr3[7:0],ur_addr2[7:0],ur_addr1[7:0],ur_addr0[7:0],其中ur_addr3[7:0]表示张量元素A在D4维上的基址,ur_addr2[7:0]表示张量元素A在D3维上的基址,ur_addr1[7:0]表示张量元素A在D2维上的基址,ur_addr0[7:0]表示张量元素A在D1维上的基址。此外,针对四维张量,r0可以包括内部的偏移地址,例如r_off3[7:0]、r_off2[7:0]、r_off1[7:0]和r_off0[7:0]。虽然在此以四维张量进行说明,但是这仅是示意而非对本公开的范围进行限制。张量也可以包括一维、二维、三维或更多维的情形。
在本公开的一个实施例中,方法500还包括:基于模式表示,确定段基址寄存器表示。如果某一维的地址属性打开了长模式,以张量元素A在D0维上的基址为例,其应该取12位,则D0维上的有效基址为ur_addr0[11:0],其他维的基址仍然为ur_addr1[7:0]、ur_addr2[7:0]和ur_addr3[7:0]。如果张量元素A在D1维打开了长模式,则提取的有效基址为ur_addr1[11:0],其他维的基址为ur_addr0[7:0]、ur_addr2[7:0]和ur_addr3[7:0],依此类推。
由此可见,在该实施例中,在4D张量的情形下,每个维度上分配的地址域的位宽为12位,但是在长模式下,所有12位的内容均有效,而在非长模式下,12位的高4位为0,而只取低8位。虽然在此以4D的情形为例进行说明长模式和非长模式的区别,但是这仅是示例,可以根据需要设置其它维度张量情形下的每个维度上的位宽以及长模式和非长模式所选的位宽。例如,在3D张量的情形下,每个维度上分配的地址域的位宽为16位,但是在长模式下,所有16位的内容均有效,而在非长模式下,12位的高8位为0,而只取低8位。通过使用模式表示,可以针对不同的模式,例如长模式或默认模式(非长模式),来确定逻辑地址在张量段内的偏移量。这样,可以兼容不同模式下的张量越界确定方案,以适配更多的张量情形。
在本公开的一个实施例中,该方法还包括基于越界查询指令确定与立即数对应的维度。例如,在越界查询指令query.p0 Dp, segment_id:RF_offset:immediate中,p0表示了被指定的立即数所对应的维度。immediate可以是一个16位的数据。根据指令中的后缀字段“p0”,可以确定immediate[7:0]表示张量元素A在D1维的偏移,immediate[15:8]表示张量元素A在D2维的偏移。在另一实施例中,如果越界查询指令中的后缀字段为“p1”,则可以确定immediate[7:0]为张量元素A在D2维的偏移,immediate[15:8]为张量元素A在D3维的偏移,以此类推。
在本公开的一个实施例中,方法500还包括:根据张量的维度,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。在本公开的一个实施例中,该方法还包括在与立即数并不对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。
在一个实施例中,如上所述,由于越界查询指令里规定了后缀字段“p0”,因此立即数immediate[7:0]表示张量元素A在D1维的偏移,immediate[15:8]表示张量元素A在D2维的偏移。在此情形下,张量元素A的逻辑地址可以表示为:
addr_A=(ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[7:0]+r_off1[7:0]+immediate[15:8],ur_addr0[7:0]+r_off0[7:0]+immediate[7:0])
在另一实施例中,如果在D1维上打开了长模式,则张量元素A的逻辑地址可以表示为:
addr_A=(ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[7:0]+r_off1[7:0]+immediate[15:8],ur_addr0[11:0]+r_off0[7:0]+immediate[7:0])
在另一实施例中,如果在D2维上打开了长模式,则张量元素A的逻辑地址可以表示为:
addr_A=(ur_addr3[7:0]+r_off3[7:0],ur_addr2[7:0]+r_off2[7:0],ur_addr1[11:0]+r_off1[7:0]+immediate[15:8],ur_addr0[7:0]+r_off0[7:0]+immediate[7:0])
对于1D、2D、3D和4D,可以将逻辑地址的计算方式总结为下表1。
表1各个维度的逻辑地址计算
  1D 2D 3D 4D
Ur_addr3       [11:8]+[7:0]
Ur_addr2     [15:12]+[11:0] [11:8]+[7:0]
Ur_addr1   [19:16]+[15:0] [13:10]+[9:0] [11:8]+[7:0]
Ur_addr0 [35:32]+[31:0] [19:16]+[15:0] [13:10]+[9:0] [11:8]+[7:0]
R_off3       [7:0]
R_off2     [11:0] [7:0]
R_off1   [15:0] [9:0] [7:0]
R_off0 [31:0] [15:0] [9:0] [7:0]
immediate [15:0] [15:8]+[7:0] [15:8]+[7:0] [15:8]+[7:0]
在不同的维度上,将相应维度的段基址寄存器表示和偏移寄存器表示相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在504,方法500还包括基于张量段的段属性,确定张量段在多个维度中的每个维度上的元素数目集合。在本公开的一个实施例中,方法500可以包括:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。元素数目集合可以包括在至少一个维度上的至少一个元素数目。换言之,该集合可以包括各个维度上的元素数目。在本公开的实现方式中,张量段包括页,页包括元素。张量段可以具有第一多个维度,页具有第二多个维度,并且第二多个维度的维度数目不大于第一多个维度的数目。
在一个实施例中,可以依据张量段的定义信息,根据张量段的维数,各个维度上页的数目以及页内各个维度元素的数目,计算出张量段内各个维度上元素的数目。例如,在4D张量的情形下,可以基于段属性字段定义的各个维度上页的数目pages_per_dim=(P[1],P[2],P[3],P[4])。此外,还可以确定各页上元素的数目elem_per_page=(N[1],N[2],N[3],N[4])。在此情形下,可以得到各个维度上张量元素的数目elem_per_dim=(N[1]*P[1],N[2]*P[2],N[3]*P[3],N[4]*P[4])。
虽然在此以4D示出了元素数目的计算方式,但是这仅是示意,而非对本公开的范围进行限制。针对其它维度的计算方法,在此不再赘述。通过计算张量段在每个维度上的页数目和页在每个维度上的元素数目,可以例如通过乘法来确定在该维度上的元素数目,也即张量元素在该维度上的最大可能偏移范围。这样,可以通过简易的方式,确定元素数目集合。
在506,方法500还包括基于第一偏移量集合和元素数目集合,确定针对第一张量元素的第一越界状态指示。在本公开的一个实施例中,基于第一偏移量集合和元素范围集合确定针对第一张量元素的第一越界状态指示包括:确定第一偏移量集合中的每个偏移量是否超出元素范围集合中的、相应维度上的元素范围;以及如果第一偏移量集合中的至少一个偏移量超出元素范围集合中的、相应维度上的元素范围,则将针对第一张量元素的第一越界状态指示设置为第一值。如果第一偏移量集合中的每个偏移量都不超出元素范围集合中的、相应维度上的元素范围,则将针对第一张量元素的第一越界状态指示设置为与第一值不同的第二值。
在一个实施例中,例如在上面所示的越界查询指令query.p0 Dp,segment_id:RF_offset:immediate来确定4D张量元素是否越界的情形下,假设长模式在D2维上并且指定立即数对应于D1和D2维,可以通过下面的一个示例比较来进行确定。
a)ur_addr3[7:0]+r_off3[7:0]>=n3*p3
b)ur_addr2[7:0]+r_off2[7:0]>=n2*p2
c)ur_addr1[11:0]+r_off1[7:0]+immediate[16:8]>=n1*p1
d)ur_addr0[7:0]+r_off0[7:0]+immediate[7:0]>=n0*p0
如果上述条件a)-d)中任一项成立,则可以确定张量元素越界,将越界状态指示设置为第一逻辑值,例如“0”。如果上述条件a)-d)均不成立,则可以确定张量元素不越界,将越界状态指示设置为第一逻辑值,例如“1”。虽然在此以4D情形示出了一个具体越界判定示例,但这仅是示意而非对本公开的范围进行限制。上述方法 针对其它维度、长模式在其它维度或立即数对应于其它维度的情形都可以适用,本公开对此不进行限制。
此外,在一个实施例中,可以提供目标谓词寄存器以用于存放访问地址是否越界的标志。例如,在上面的示例中,指定寄存器Dp用于存放访问地址是否越界的标志。可以理解,每个线程拥有专属的谓词寄存器,以指示下一步是否进行特殊处理,例如程序调试或动态检测。由于谓词寄存器位于线程内部,因此在后续指令执行时,相比于存储在PE外部的存储器上的情形,谓词寄存器可以提供更为快速和便捷的越界查阅结果。另外,由于越界指示不是多个比特位的数据而仅是单个比特位的逻辑值,并且还无需在PE和PE外的存储器之间传输,因此也可以显著降低由此带来的功耗。
图6示出了根据本公开的一些实施例的电子装置600的示意框图。电子装置600可以被实现为或者被包括在图2的加速器子系统200中。电子装置600可以包括多个模块,以用于执行如图5中所讨论的方法500中的对应步骤。
如图6所示,电子装置600存储器包括偏移量集合确定单元602、元素数目集合确定单元604和越界状态指示确定单元606。偏移量集合确定单元602被配置为基于针对张量段中的第一张量元素的越界查询指令,确定第一张量元素在张量段的多个维度的第一偏移量集合。元素数目集合确定单元604被配置为基于张量段的段属性,确定张量段在多个维度中的每个维度上的元素数目集合。越界状态指示确定单元606被配置为基于第一偏移量集合和元素数目集合,确定针对第一张量元素的第一越界状态指示。
通过使用越界查询命令,可以将逻辑地址在张量段内对应的偏移量与张量段范围进行比较。一旦任一维度上的偏移量超出了张量段在该维度上的范围,则可以确定该逻辑地址越界,即超出了张量段的边界。继而,可以从诸如谓词寄存器之类的电路单元获得越界状态指示,例如特定值。这样,可以对诸如程序调试或动态检测之类的后续程序操作带来便利。此外,由于该越界查询指令并不涉及 从存储器中返回所存储的具体数据或向其写入数据,而仅是给出越界状态的指示(例如逻辑1或0),这可以避免从存储器中读取数据或写入数据所带来的时间延迟和大的操作功耗。
在一个实施例中,越界查询指令包括第一张量元素的逻辑地址表示,逻辑地址表示包括段基址寄存器表示、偏移寄存器表示和立即数,偏移量集合确定单元602被进一步配置为:基于段基址寄存器表示、偏移寄存器表示和立即数,计算第一偏移量集合。
在一个实施例中,电子装置600还包括:段基址寄存器表示确定单元,被配置为基于模式表示,确定段基址寄存器表示。通过使用模式表示,可以针对不同的模式,例如长模式或默认模式(非长模式),来确定逻辑地址在张量段内的偏移量。这样,可以兼容不同模式下的张量越界确定方案,以适配更多的张量情形。
在一个实施例中,该方法还包括基于越界查询指令确定与立即数对应的维度。
在一个实施例中,偏移量确定单元被进一步配置为:根据张量的维度,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在一个实施例中,偏移量确定单元被进一步配置为:在与立即数并不对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示和偏移寄存器表示相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示和偏移寄存器表示相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在一个实施例中,元素数目集合确定单元604被进一步配置为:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。在一个实施例中,偏移量确定单元被进一步配置为:在与立即数对应的维度上,将相应维度上的段基址范围内的段基址寄存器表示、偏移寄存器表示和立即数相加,以获得第一偏移量。在不同的维度上,将相应维度的段基址寄存器表示、偏移寄存器表示和立即 数相加,可以以简易的方式获得第一偏移量,从而减少计算成本和功耗。
在一个实施例中,基于张量段的段属性确定张量段在多个维度中的每个维度上的元素数目集合包括:基于段属性,确定张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及基于页数目和元素数目,确定元素数目集合。元素数目集合可以包括在至少一个维度上的至少一个元素数目。在本公开的实现方式中,张量段包括页,页包括元素。张量段可以具有第一多个维度,页具有第二多个维度,并且第二多个维度的维度数目不大于第一多个维度的数目。通过计算张量段在每个维度上的页数目和页在每个维度上的元素数目,可以例如通过乘法来确定在该维度上的元素数目,也即张量元素在该维度上的最大可能偏移范围。这样,可以通过简易的方式,确定元素数目集合。
在一个实施例中,越界状态指示确定单元606被进一步配置为:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的至少一个偏移量超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第一值。
在一个实施例中,越界状态指示确定单元606被进一步配置为:确定第一偏移量集合中的每个偏移量是否超出元素数目集合中的、相应维度上的元素数目;以及如果第一偏移量集合中的每个偏移量都不超出元素数目集合中的、相应维度上的元素数目,则将针对第一张量元素的第一越界状态指示设置为第二值。通过将第一越界状态指示在诸如谓词寄存器之类的设备中设置为不同值,可以通过读取所存储的值来确定是否越界。这给诸如程序调试、动态检测之类的后续操作带来便利,例如可以节省时间和功耗,因为这种越界检测无需从存储器返回具体数据,而仅是从处理引擎内部的寄存器返回状态值即可。因此,相比于从处理引擎外的存储器向处理引擎读取多字节的数据,这种从处理引擎内部的谓词寄存器读取单个或少 量字节的状态值,可以显著减少传输和处理时间,并且相应地降低了处理功耗。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (17)

  1. 一种用于确定张量元素的越界状态的方法,包括:
    基于针对张量段中的第一张量元素的越界查询指令,确定所述第一张量元素在所述张量段的多个维度的第一偏移量集合;
    基于所述张量段的段属性,确定所述张量段在所述多个维度中的每个维度上的元素数目集合;以及
    基于所述第一偏移量集合和所述元素数目集合,确定针对所述第一张量元素的第一越界状态指示。
  2. 根据权利要求1所述的方法,其中所述越界查询指令包括所述第一张量元素的逻辑地址表示,所述逻辑地址表示包括段基址寄存器表示、偏移寄存器表示和立即数,确定所述第一张量元素在所述张量段中的所述第一偏移量集合包括:
    基于所述段基址寄存器表示、所述偏移寄存器表示和所述立即数,计算所述第一偏移量集合。
  3. 根据权利要求2所述的方法,还包括:基于模式表示,确定所述段基址寄存器表示。
  4. 根据权利要求3所述的方法,其中基于所述段基址寄存器表示、所述偏移寄存器表示和所述立即数,计算所述第一偏移量包括:
    根据所述张量的维度,将相应维度上的所述段基址范围内的所述段基址寄存器表示、所述偏移寄存器表示和所述立即数相加,以获得所述第一偏移量。
  5. 根据权利要求1-4中任一项所述的方法,其中基于所述张量段的段属性确定所述张量段在所述多个维度中的每个维度上的元素数目集合包括:
    基于所述段属性,确定所述张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及
    基于所述页数目和所述元素数目,确定所述元素数目集合。
  6. 根据权利要求1-4中任一项所述的方法,其中基于所述第一 偏移量集合和所述元素数目集合确定针对所述第一张量元素的第一越界状态指示包括:
    确定所述第一偏移量集合中的每个偏移量是否超出所述元素数目集合中的、相应维度上的元素数目;以及
    如果所述第一偏移量集合中的至少一个偏移量超出所述元素数目集合中的、相应维度上的元素数目,则将针对所述第一张量元素的第一越界状态指示设置为第一值。
  7. 根据权利要求1-4中任一项所述的方法,其中基于所述第一偏移量集合和所述元素数目集合确定针对所述第一张量元素的第一越界状态指示包括:
    确定所述第一偏移量集合中的每个偏移量是否超出所述元素数目集合中的、相应维度上的元素数目;以及
    如果所述第一偏移量集合中的每个偏移量都不超出所述元素数目集合中的、相应维度上的元素数目,则将针对所述第一张量元素的第一越界状态指示设置为第二值。
  8. 一种计算机可读存储介质,存储多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-7中任一项所述的方法的指令。
  9. 一种计算机程序产品,所述计算机程序产品包括多个程序,所述多个程序被配置为一个或多个处理引擎执行,所述多个程序包括用于执行权利要求1-8中任一项所述的方法的指令。
  10. 一种电子装置,包括:
    流处理器;
    页表装置,耦合至所述流处理器;
    存储器;
    处理引擎单元,耦合至所述流处理器、所述存储器和所述页表装置,被配置为执行权利要求1-8中任一项所述的方法。
  11. 一种电子装置,包括:
    偏移量集合确定单元,被配置为基于针对张量段中的第一张量 元素的越界查询指令,确定所述第一张量元素在所述张量段的多个维度的第一偏移量集合;
    元素数目集合确定单元,被配置为基于所述张量段的段属性,确定所述张量段在所述多个维度中的每个维度上的元素数目集合;以及
    越界状态指示确定单元,被配置为基于所述第一偏移量集合和所述元素数目集合,确定针对所述第一张量元素的第一越界状态指示。
  12. 根据权利要求11所述的电子装置,其中所述越界查询指令包括所述第一张量元素的逻辑地址表示,所述逻辑地址表示包括段基址寄存器表示、偏移寄存器表示和立即数,所述偏移量集合确定单元被进一步配置为:
    基于所述段基址寄存器表示、所述偏移寄存器表示和所述立即数,计算所述第一偏移量集合。
  13. 根据权利要求12所述的电子装置,还包括:段基址寄存器表示确定单元,被配置为基于模式表示,确定所述段基址寄存器表示。
  14. 根据权利要求13所述的电子装置,其中所述偏移量确定单元被进一步配置为:
    根据所述张量的维度,将相应维度上的所述段基址范围内的所述段基址寄存器表示、所述偏移寄存器表示和所述立即数相加,以获得所述第一偏移量。
  15. 根据权利要求11-14中任一项所述的电子装置,其中所述元素数目集合确定单元被进一步配置为:
    基于所述段属性,确定所述张量段在每个维度上的页数目以及每个页在每个维度上的元素数目;以及
    基于所述页数目和所述元素数目,确定所述元素数目集合。
  16. 根据权利要求11-14中任一项所述的电子装置,其中所述越界状态指示确定单元被进一步配置为:
    确定所述第一偏移量集合中的每个偏移量是否超出所述元素数目集合中的、相应维度上的元素数目;以及
    如果所述第一偏移量集合中的至少一个偏移量超出所述元素数目集合中的、相应维度上的元素数目,则将针对所述第一张量元素的第一越界状态指示设置为第一值。
  17. 根据权利要求11-14中任一项所述的电子装置,其中所述越界状态指示确定单元被进一步配置为:
    确定所述第一偏移量集合中的每个偏移量是否超出所述元素数目集合中的、相应维度上的元素数目;以及
    如果所述第一偏移量集合中的每个偏移量都不超出所述元素数目集合中的、相应维度上的元素数目,则将针对所述第一张量元素的第一越界状态指示设置为第二值。
PCT/CN2022/107337 2022-01-25 2022-07-22 用于确定张量元素的越界状态的方法和电子装置 WO2023142403A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210088659.1 2022-01-25
CN202210088659.1A CN114489798B (zh) 2022-01-25 2022-01-25 用于确定张量元素的越界状态的方法和电子装置

Publications (1)

Publication Number Publication Date
WO2023142403A1 true WO2023142403A1 (zh) 2023-08-03

Family

ID=81475208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107337 WO2023142403A1 (zh) 2022-01-25 2022-07-22 用于确定张量元素的越界状态的方法和电子装置

Country Status (2)

Country Link
CN (1) CN114489798B (zh)
WO (1) WO2023142403A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114489798B (zh) * 2022-01-25 2024-04-05 海飞科(南京)信息技术有限公司 用于确定张量元素的越界状态的方法和电子装置
CN115599442B (zh) * 2022-12-14 2023-03-10 成都登临科技有限公司 一种ai芯片、电子设备及张量处理方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831429A (zh) * 2020-06-11 2020-10-27 上海交通大学 一种基于simt编程模型的张量化并行计算方法
US10936569B1 (en) * 2012-05-18 2021-03-02 Reservoir Labs, Inc. Efficient and scalable computations with sparse tensors
CN113836049A (zh) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 存储器访问方法和电子装置
CN114489798A (zh) * 2022-01-25 2022-05-13 海飞科(南京)信息技术有限公司 用于确定张量元素的越界状态的方法和电子装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079083B (zh) * 2006-05-25 2010-05-12 中国科学院计算技术研究所 一种对访存操作进行权限检查的系统、装置及方法
US9875104B2 (en) * 2016-02-03 2018-01-23 Google Llc Accessing data in multi-dimensional tensors
CN106326123B (zh) * 2016-08-24 2018-12-04 北京奇虎测腾安全技术有限公司 一种用于检测数组越界缺陷的方法及系统
US10534607B2 (en) * 2017-05-23 2020-01-14 Google Llc Accessing data in multi-dimensional tensors using adders
US9946539B1 (en) * 2017-05-23 2018-04-17 Google Llc Accessing data in multi-dimensional tensors using adders

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936569B1 (en) * 2012-05-18 2021-03-02 Reservoir Labs, Inc. Efficient and scalable computations with sparse tensors
CN111831429A (zh) * 2020-06-11 2020-10-27 上海交通大学 一种基于simt编程模型的张量化并行计算方法
CN113836049A (zh) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 存储器访问方法和电子装置
CN114489798A (zh) * 2022-01-25 2022-05-13 海飞科(南京)信息技术有限公司 用于确定张量元素的越界状态的方法和电子装置

Also Published As

Publication number Publication date
CN114489798B (zh) 2024-04-05
CN114489798A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023142403A1 (zh) 用于确定张量元素的越界状态的方法和电子装置
WO2023040460A1 (zh) 存储器访问方法和电子装置
Guide Cuda c programming guide
US20100115233A1 (en) Dynamically-selectable vector register partitioning
US8364739B2 (en) Sparse matrix-vector multiplication on graphics processor units
US8205066B2 (en) Dynamically configured coprocessor for different extended instruction set personality specific to application program with shared memory storing instructions invisibly dispatched from host processor
EP2480985B1 (en) Unified addressing and instructions for accessing parallel memory spaces
US9600288B1 (en) Result bypass cache
CN114579929B (zh) 加速器执行的方法和电子设备
US11657119B2 (en) Hardware accelerated convolution
WO2023103392A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023173642A1 (zh) 指令调度的方法、处理电路和电子设备
US11372768B2 (en) Methods and systems for fetching data for an accelerator
US9772864B2 (en) Methods of and apparatus for multidimensional indexing in microprocessor systems
TW202143031A (zh) 使用具反饋輸入之脈動陣列的可擴縮的稀疏矩陣乘法加速
CN113961506B (zh) 加速器和电子装置
CN114035980B (zh) 基于便笺存储器来共享数据的方法和电子装置
WO2023103397A1 (zh) 用于存储管理的方法、介质、程序产品、系统和装置
WO2023103391A1 (zh) 流处理方法、处理电路和电子设备
CN114510271B (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
Vieira et al. A compute cache system for signal processing applications
Al Farhan et al. Unstructured Computations on Emerging Architectures.
JP2024500363A (ja) デュアルベクトル算術論理ユニット
You et al. VecRA: A Vector-Aware Register Allocator for GPU Shader Processors
WO2023129435A1 (en) Cache blocking for dispatches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923211

Country of ref document: EP

Kind code of ref document: A1